Finding Strings in recursively zipped files

I had an itch to scratch. After using Field Trip (which I like a lot) to determine unused fields, the team managing the external Informatica integration claimed they would need weeks to ensure none of the fields are used in any of their (hundreds) of pipelines.

ZIP inception

My first reaction (OK, the second, first one isn't PC) was: Let's go after the source code and just use an editor of choice to do a find in files. Turns out: not so fast. The source export offered by the team was a zip file with an elaborate directory structure containing, tada, zip files. So each of the pipes would need multiple zip operations.

Itch defined

I needed a tool that would start in a directory with a bunch of zip files, unpack them all. Check for zip files in the unpacked result, unzip these and repeat. Once done, take a list of strings and search for occurrences of those and generate a report which shows the files containing these strings

Itch scratched

I created findstring, a command line tool that takes a directory as starting point unzips what can be unzipped (optional) and searches for the occurrence of strings provided in a text file.

Initially I contemplated to render the output as XML, so the final report could be designed in whatever fashion using XSLT. However following KISS, I ended up using Markdown. I might add the XML option later on.

Recursion

The key piece of the tool is recursion (until you stack overflow ;-) ). Reading a directory and dive into directories found. I could have avoided that using Guava and its fileTraverser, but I like some Inception style coding. The key piece is this:

    private boolean expandSources(final File sourceDir) throws IOException {
        boolean result = false;
        final File[] allFiles = sourceDir.listFiles();

        for (final File f : allFiles) {
            if (f.isDirectory()) {
                result = result || this.expandSources(f);

            } else if (f.getName().endsWith(".zip")) {
                final String newDirName = f.getAbsolutePath().replace(".zip", "");
                final File newTarget = new File(newDirName);

                // Need to scan the new directory too
                if (this.expandFile(f, newTarget)) {
                    result = result || this.expandSources(newTarget);
                }
            }
        }
        return result;

    }

The function will return true as long as there was a zip file to be unzipped. The string finding operation (case insensitive) follows the same approach

Use cases

Find field usage in ZIP files. Works with a package downloaded from the meta data api or what Informatica exports
Check a source directory (doesn't need to contain zips) for keywords like TODO, FIXME, XXX

The command line syntax is very simple:

java -jar findString.jar -d directory -s strings [-o output]

-d,--dir <arg> directory with all zip files
-s,--stringfile <arg> Filename with Strings to search, one per line
-o,--output <arg> Output file name for report in MD format
-nz,--nz Rerun find operation on a ready unzipped structure - good for alternate finds

Limits

In its current form the utility will check for strings in any file short of zip. Zip gets unpacked and the result checked. When your directory contains binary files (e.g. images) it will still look for the string occurrence inside. File extension filters might be a future enhancement (share your opinion).

Files are read into memory. So if your directory contains huge files, you will blow your heap. Source code files hardly pose an issue, so the approach worked for me. Alternatively a scanner could be used, should the need arise.

Go give it a spin and keep in mind: YMMV

Posted by Stephan H Wissel on 16 March 2019 | Comments (1) | categories: Salesforce Singapore

posted by Ben Langhinrichs on Saturday 16 March 2019 AD:

Depending on the absolute need for completeness, you probably check the initial bytes to see if it is a zip file. But if that is not compelling, you could at least add the other common zipped extensions such as .epub, .jar, and all the ODF and OOXML file extensions. There are probably other common ones, but those are the ones I use internally when I check for likely zip files. Zipped packaging is fairly common.