... until the collector arrives ...

This "blog" is really just a scratchpad of mine. There is not much of general interest here. Most of the content is scribbled down "live" as I discover things I want to remember. I rarely go back to correct mistakes in older entries. You have been warned :)

2013-01-30

Java SAXParserFactory vs. XMLFilter

Using Java 7, I had problems trying to inject an XMLFilter into a SAX pipeline. My code looked something like this:

SAXParserFactory saxFactory = SAXParserFactory.newInstance();
saxFactory.setNamespaceAware(true);
saxFactory.setValidating(false);
saxFactory.setSchema(schema);

XMLReader reader = saxFactory.newSAXParser().getXMLReader();

XMLFilter filter = new NamespaceRewriter();

filter.setParent(reader);
filter.setContentHandler(handler);

filter.parse(source);

The filter was a trivial extension of XMLFilterImpl that rewrite some of the document namespaces. However, this code would fail. As can be seen, the XMLReader was configured to perform XML schema validation. The failure was that the reader would bypass the installed SAX filter when performing this validation. Using a debugger, I was able to verify at run-time the reader's ContentHandler was correctly set to be my filter. But the reader was apparently performing the schema validation prior prior to invoking its ContentHandler.

I was able to work around this problem by inserting an explicit validation stage in the pipeline instead of relying upon the validation baked into the XMLReader built by SAXParserFactory:

XMLReader reader = XMLReaderFactory.createXMLReader();

ValidatorHandler validator = schema.newValidatorHandler();

XMLFilter filter = new NamespaceRewriter();

filter.setParent(reader);
filter.setContentHandler(validator);
validator.setContentHandler(handler);

filter.parse(source);

The changes are emphasized.

I would say that this is unfortunate behaviour in the Java 7 implementation, but I lay the blame squarely on the SAX API. The SAX API makes it devilishly difficult to build reader/filter chains correctly. In particular, it is almost impossible to link independently assembled subchains into a single larger chain. The API supports easy delegation of parse calls, but chaining content handlers is messy. A filter is required to destructively replace the content handler of the next reader/filter in the chain with an augmented content handler. If that next component has built-in behaviour that is not expressed in the content handler, then that component cannot be wrapped successfully (e.g. the problem at hand). Even if the XMLReader above had expressed the schema validation as an augmentation to the content handler, how would a filter recover it? Presumably getContentHandler() is supposed to return the unaugmented handler -- so how is one supposed to get the augmented one?

2013-01-10

SQL Server vs. EXISTS

Using Microsoft SQL Server, I had a query that retrieved rows from a view of some complexity, i.e. something like this:

SELECT * FROM vwComplex WHERE ...

This query ran basically instantaneously, even though the view involved numerous UNION ALLs and joins on coalesced keys. My sample query criteria selected ~100 out of the view's ~80,000 rows.

In the general case, I needed a query that returned a single bit indicating whether the specific criteria will select at least one row. I tried this:

SELECT
  CASE WHEN EXISTS (
    SELECT * FROM vwComplex WHERE ...
  ) THEN 1 ELSE 0 END

I presumed that this statement would run faster than the original unadorned query since it can stop executing as soon as it finds the first row. My presumption was incorrect -- the query ran for a long, long time.

In this case, at least, it turns out that it is faster to run the full query than to perform an existence check:

SELECT SIGN(COUNT(*)) FROM vwComplex WHERE ...

No surprises here, of course. SQL query plan generation remains a mystical exercise.

Blog Archive