... until the collector arrives ...

This "blog" is really just a scratchpad of mine. There is not much of general interest here. Most of the content is scribbled down "live" as I discover things I want to remember. I rarely go back to correct mistakes in older entries. You have been warned :)

2012-12-13

Java vs. UTF-8 BOM

From time-to-time, one runs into Java software that has trouble handling a Unicode byte-order-mark (BOM) in a UTF8-encoded character stream. I've seen this, for example, in various XML processing pipelines. Most recently, I came across it in a character stream generated by a .NET component that was being consumed by a Java component. The problem is always the same -- the BOM is finding its way into the data stream instead of being silently consumed by the Java I/O infrastructure.

This is a known problem, listed as bug #4508058 on the Java bug parade. Oracle/Sun acknowledged the bug, and it was even briefly fixed in the "Mustang" (Java6) release. However, a follow-on bug report (#6378911) complained that the fix broke backwards compatibility with previous releases. So, the fix was ultimately withdrawn and the original bug was marked as "won't fix".

Bottom line: when writing Java components that consume UTF8-encoded character streams, be prepared to consume the BOM yourself. Also, be aware that the various Microsoft I/O frameworks aggressively write BOMs into UTF8 streams, even if the stream would be otherwise empty.

It seems to me that there is an opportunity here for Java to add an alternate method to construct UTF8-readers that handle BOMs properly.

Blog Archive