Weakly Reachable: Java vs. UTF-8 BOM

2012-12-13

Java vs. UTF-8 BOM

From time-to-time, one runs into Java software that has trouble handling a Unicode byte-order-mark (BOM) in a UTF8-encoded character stream. I've seen this, for example, in various XML processing pipelines. Most recently, I came across it in a character stream generated by a .NET component that was being consumed by a Java component. The problem is always the same -- the BOM is finding its way into the data stream instead of being silently consumed by the Java I/O infrastructure.

This is a known problem, listed as bug #4508058 on the Java bug parade. Oracle/Sun acknowledged the bug, and it was even briefly fixed in the "Mustang" (Java6) release. However, a follow-on bug report (#6378911) complained that the fix broke backwards compatibility with previous releases. So, the fix was ultimately withdrawn and the original bug was marked as "won't fix".

Bottom line: when writing Java components that consume UTF8-encoded character streams, be prepared to consume the BOM yourself. Also, be aware that the various Microsoft I/O frameworks aggressively write BOMs into UTF8 streams, even if the stream would be otherwise empty.

It seems to me that there is an opportunity here for Java to add an alternate method to construct UTF8-readers that handle BOMs properly.

Weakly Reachable

2012-12-13

Java vs. UTF-8 BOM

Blog Archive