frostillic.us

Java Travelogue: The Care and Feeding of Locales

Sun Feb 14 13:37:14 EST 2021

Tags: java

Java Hiccups
Bitwise Operators
Java Grab Bag 2
Java Travelogue: The Care and Feeding of Locales
More Notes on Filesystem and Charset Portability

Over time, people using the NSF ODP Tooling project have periodically hit troubles with files using non-ASCII filenames, as well as some related encoding issues.

Now, I know what you're thinking: why don't people hitting this trouble just be Americans and not use languages with accents? And yes, obviously, that's the optimal solution. However, given that, apparently, most people on the planet are not American, it's for the best to not write software that completely falls apart when encountering an umlaut.

When working to fix this, I found some areas where the fix was pretty obvious, and others where the trouble was a bit more insidious. I figure it'll be potentially useful to write these down, either for others running into similar trouble or my own future self next time I write overly-American code.

Early Encounters: ZIP Files

The earliest place people encountered trouble was with the handling of ZIP files when transferring packages around. When compiling remotely, the local Maven plugin ZIPs up the ODP and related support files (OSGi sites, etc.) for transfer to the server, which then unzips them. This led to a problem wherein the handling of file names in ZIP files is wildly inconsistent over platforms and locales.

Fortunately, this one has a clean fix: when using ZipOutputStream and ZipInputStream (which were my preferred mechanisms), you can specify your encoding:

try(OutputStream fos = Files.newOutputStream(packageZip, StandardOpenOption.CREATE, StandardOpenOption.TRUNCATE_EXISTING)) {
    try(ZipOutputStream zos = new ZipOutputStream(fos, StandardCharsets.UTF_8)) {
        // Add entries to the ZIP here
    }
}

// And to read:
try(InputStream is = Files.newInputStream(zipFilePath)) {
    try(ZipInputStream zis = new ZipInputStream(is, StandardCharsets.UTF_8)) {
        // Iterate over entries here
    }
}

Since I control both sides of the operation in this case, I can then be confident that it will use UTF-8 across the board.

Next Problem: Filesystem Restrictions

The next problem I ran into actually happened when I was setting up a compiler server in a Docker container. One of the design elements in the example projects is an agent containing umlauts, based on a reported problem. When I tried compiling this project in a Docker-housed Domino server, I ran into this trouble:

java.nio.file.InvalidPathException: Malformed input or input contains unmappable characters: Code/Agents/Example Agent with ref?r?ns.fa
    at sun.nio.fs.UnixPath.encode(UnixPath.java:147)
    at sun.nio.fs.UnixPath.<init>(UnixPath.java:71)
    at sun.nio.fs.UnixFileSystem.getPath(UnixFileSystem.java:281)
    at sun.nio.fs.AbstractPath.resolve(AbstractPath.java:53)
    at org.openntf.nsfodp.compiler.servlet.ODPCompilerServlet.expandZip(ODPCompilerServlet.java:241)

Basically, it was trying to write out what it considered an illegal filename and choked on it.

I first spent some time double-checking my ZIP handling, since I was assuming that the trouble was that the name it got out of the ZIP file was corrupted, hence the "?" instead of "?". This search brought me to this Stack Overflow question, which is asking about the same exception and which talks about the locale of the underlying system. The gist of it is that Java uses a semi-standard property (sun.jnu.encoding) to interpret a lot of things, filename mapping included, and it derives this from the system locale.

I hopped into the Domino container to see what locale it uses (by way of echo $LANG) and saw that it's "C.utf8". I like the sound of that "utf8" part, but the "C" part is different from the comfy "en_US" that I'm used to, and likely causes Java to be more restrictive. Uncharacteristically, the typical "en_US" setup actually avoids this trouble, causing Java NIO to allow all sorts of characters in filenames.

So I started seeing what I could do by way of setting ENV variables as part of the Dockerfile, but then realized that it'd be better to fix this in a way that doesn't depend on external configuration like that.

Java NIO

Here I realized that I didn't actually need to write these files out to the filesystem at all. Over a year ago, I wrote part 1 of an unfinished series talking about the Java NIO filesystem API from Java 7. That API exists for a number of reasons, and the best way to dive into it is to replace your uses of java.io.File, java.io.FileInputStream, etc. with it, which I did in the NSF ODP Tooling a while ago.

What struck me, then, was that this earlier work also separated out the specifics of filesystem access. And, critically, Java ships with a ZIP file system provider that lets you point at a ZIP or JAR file and treat it like any old filesystem. The on-disk project representation I wrote for the compiler uses this NIO API as its entrypoint. By skipping the step of extracting the ODP from the ZIP to the filesystem, I could remove that entire problem from my view.

The Fiddly Parts

This process was mostly smooth, but there are a few fiddly parts that I had to account for:

You have to use newFileSystem when you crack open a ZIP this way, rather than trying to open it by "jar:file" URL directly. Additionally, you have to pass a Map of options including "create":"true" to make it work.
Paths.get, which is a common mechanism for creating either a full or relative path, is a bit insidious. Since those paths are created using the default system filesystem, you can't just pass them to methods like resolve for paths created from another filesystem type. Accordingly, I replaced uses of that with methods based on a context filesystem.
Nested ZIPs aren't supported. That is, they exist like other files in there, but you can't reach further inside of them with a "jar:jar:file" URL. So, when building the classpath for compilation, I have to extract them. I suppose this part is technically a bug if those files have non-ASCII names, but that's rare enough to hopefully not be an issue.

Once I dealt with those, though, things went surprisingly smoothly. I even refactored earlier code to use this, replacing more-complicated streaming logic with conceptually-simpler file-copying logic. My guess is that this new route is slower, but the difference is negligible for my needs, so I'll take the higher abstraction here.

Stream Locales

Unfortunately, while that helped a bit and is definitely conceptually neat, it didn't solve all my trouble. If I recall correctly, at this point, I was able to get the file imported, but the agent name itself was mangled in Notes, something that didn't happen when I compiled it locally.

This brought me to looking into locales used when reading and writing XML from the ZIP or filesystem. Hypothetically, I had done this cleanly. My file-reading utility methods were very similar, just opening up an InputStream (which is too low-level to care about encoding) and passing it along to IBM Commons utilities to interpret it:

public static String readFile(Path path) {
    try(InputStream is = Files.newInputStream(path)) {
        return StreamUtil.readString(is);
    } catch(IOException e) {
        throw new RuntimeException(e);
    }
}

public static Document readXml(Path file) {
    try(InputStream is = Files.newInputStream(file)) {
        return DOMUtil.createDocument(is);
    } catch(IOException | XMLException e) {
        throw new RuntimeException(e);
    }
}

However, I realized that these were insidious traps, too. By not handling encoding on my side, I was leaving it up to the internals to pick a default encoding, which isn't guaranteed to be UTF-8 (even though it really should be for XML). StreamUtil.readString there has a variant that takes an encoding as the second argument, but I decided to instead handle this one step earlier. Rather than using InputStream, which deals with bytes directly, I decided to switch to Readers, which are more specialized for dealing with character sequences. The Files class provides methods to do this:

public static String readFile(Path path) {
    try(Reader r = Files.newBufferedReader(path, StandardCharsets.UTF_8)) {
        return StreamUtil.readString(r);
    } catch(IOException e) {
        throw new RuntimeException(e);
    }
}

public static Document readXml(Path file) {
    try(Reader r = Files.newBufferedReader(file, StandardCharsets.UTF_8)) {
        return DOMUtil.createDocument(r);
    } catch(IOException | XMLException e) {
        throw new RuntimeException(e);
    }
}

This way, it's explicit what I'm doing, and it allows for extra optimization at the NIO level if possible.

Writing Back Out

These rules also applied to writing back out. For the most part, Files.newBufferedWriter(..., StandardCharsets.UTF_8) was the way to go, though I did find one extra insidious bit:

1
2
3

try(PrintWriter writer = new PrintWriter(os)) {
    // ...
}

Here, PrintWriter doesn't have a character-set argument at all, and so one could be forgiven (hopefully) for just kind of assuming it'll use Unicode. However, delving into the implementation, it uses OutputStreamWriter's no-charset constructor, which in turn calls Charset.defaultCharset(), and there's your potential bug. Since I didn't actually need PrintWriter as such, I replaced this with a charset-specific call and all was well:

1
2
3

try(Writer writer = new OutputStreamWriter(os, StandardCharsets.UTF_8)) {
    // ...
}

Overall

I felt that this was a pretty good exercise to perform, not just because it'll be immediately useful for NSF ODP, but also because it's a good reminder to be more diligent about character encoding. And it's also just a good lesson for two critical parts of programming: take the higher abstraction when you can and be as explicit as possible in your intent.

By switching to using the ZIP filesystem implementation, I was able to remove an entire step and problem domain from my plate. Now, the code that reads and writes filenames server-side should be able to run on basically any locale setting, without concern for the restrictions of the filesystem (within reason). The code is simpler, the operations are the same whether it's working with the filesystem directly or not, and reading the ZIP'ed ODP should actually be slightly more efficient.

And for the rest, explicitly picking your character set is just good practice. Even in a case where the documentation says that it will default to UTF-8, I think it's better to do it this way, so anyone reading your code can see what you're doing without resting on implied behavior. Certainly, you can be too explicit in places where relying on natural behavior makes sense, but this highlighted that character sets aren't one of those cases.

New Comment

Name: Body: