... until the collector arrives ...

This "blog" is really just a scratchpad of mine. There is not much of general interest here. Most of the content is scribbled down "live" as I discover things I want to remember. I rarely go back to correct mistakes in older entries. You have been warned :)

2013-01-30

Java SAXParserFactory vs. XMLFilter

Using Java 7, I had problems trying to inject an XMLFilter into a SAX pipeline. My code looked something like this:

SAXParserFactory saxFactory = SAXParserFactory.newInstance();
saxFactory.setNamespaceAware(true);
saxFactory.setValidating(false);
saxFactory.setSchema(schema);

XMLReader reader = saxFactory.newSAXParser().getXMLReader();

XMLFilter filter = new NamespaceRewriter();

filter.setParent(reader);
filter.setContentHandler(handler);

filter.parse(source);

The filter was a trivial extension of XMLFilterImpl that rewrite some of the document namespaces. However, this code would fail. As can be seen, the XMLReader was configured to perform XML schema validation. The failure was that the reader would bypass the installed SAX filter when performing this validation. Using a debugger, I was able to verify at run-time the reader's ContentHandler was correctly set to be my filter. But the reader was apparently performing the schema validation prior prior to invoking its ContentHandler.

I was able to work around this problem by inserting an explicit validation stage in the pipeline instead of relying upon the validation baked into the XMLReader built by SAXParserFactory:

XMLReader reader = XMLReaderFactory.createXMLReader();

ValidatorHandler validator = schema.newValidatorHandler();

XMLFilter filter = new NamespaceRewriter();

filter.setParent(reader);
filter.setContentHandler(validator);
validator.setContentHandler(handler);

filter.parse(source);

The changes are emphasized.

I would say that this is unfortunate behaviour in the Java 7 implementation, but I lay the blame squarely on the SAX API. The SAX API makes it devilishly difficult to build reader/filter chains correctly. In particular, it is almost impossible to link independently assembled subchains into a single larger chain. The API supports easy delegation of parse calls, but chaining content handlers is messy. A filter is required to destructively replace the content handler of the next reader/filter in the chain with an augmented content handler. If that next component has built-in behaviour that is not expressed in the content handler, then that component cannot be wrapped successfully (e.g. the problem at hand). Even if the XMLReader above had expressed the schema validation as an augmentation to the content handler, how would a filter recover it? Presumably getContentHandler() is supposed to return the unaugmented handler -- so how is one supposed to get the augmented one?

Blog Archive