Techniques for parsing XML

language-agnosticparsingxml

I've always found XML somewhat cumbersome to process. I'm not talking about implementing an XML parser: I'm talking about using an existing stream-based parser, like a SAX parser, which processes the XML node by node.

Yes, it's really easy to learn the various APIs for these parsers, but whenever I look at code that processes XML I always find it to be somewhat convoluted. The essential problem seems to be that an XML document is logically separated into individual nodes, and yet the data types and attributes are often separated from the actual data, sometimes by multiple levels of nesting. Therefore, when processing any particular node individually, a lot of extra state needs to be maintained to determine where we are and what we need to do next.

For example, given a snippet from a typical XML document:

<book>
  <title>Blah blah</title>
  <author>Blah blah</author>
  <price>15 USD</price>
</book>

…How would I determine when I've encountered a text node containing a book title? Suppose we have a simple XML parser which acts like an iterator, giving us the next node in the XML document everytime we call XMLParser.getNextNode(). I inevitably find myself writing code like the following:

boolean insideBookNode = false;
boolean insideTitleNode = false;

while (!XMLParser.finished())
{
    ....
    XMLNode n = XMLParser.getNextNode();

    if (n.type() == XMLTextNode)
    {
        if (insideBookNode && insideTitleNode)
        {
            // We have a book title, so do something with it
        }
    }
    else
    {
        if (n.type() == XMLStartTag)
        {
            if (n.name().equals("book")) insideBookNode = true
            else if (n.name().equals("title")) insideTitleNode = true;
        }
        else if (n.type() == XMLEndTag)
        {
            if (n.name().equals("book")) insideBookNode = false;
            else if (n.name().equals("title")) insideTitleNode = false;
        }
    }
}

Basically, the XML processing quickly turns into a huge, state-machine driven loop, with lots of state variables used to indicate parent nodes we've found earlier. Otherwise, a stack object needs to be maintained to keep track of all the nested tags. This quickly becomes error-prone and difficult to maintain.

Again, the problem seems to be that the data we're interested in is not directly associated with an individual node. Sure, it could be, if we wrote the XML like:

<book title="Blah blah" author="blah blah" price="15 USD" />

…but this is rarely how XML is used in reality. Mostly we have text nodes as children of parent nodes, and we need to keep track of the parent nodes in order to determine what a text node refers to.

So…am I doing something wrong? Is there a better way? At what point does using an XML stream-based parser become too cumbersome, so that a fully-fledged DOM parser becomes necessary? I'd like to hear from other programmers what sort of idioms they use when processing XML with stream-based parsers. Must stream-based XML parsing always turn into a huge state machine?

Best Answer

To me, the question is the other way round. At what point does an XML Document become so cumbersome, that you have to start using SAX instead of DOM?

I would only use SAX for a very large, indeterminately-sized stream of data; or if the behaviour the XML is intended to invoke is really event-driven, and therefore SAX-like.

The example you give looks very DOM-like to me.

  1. Load the XML
  2. Extract the title node(s) and "do something with them".

EDIT: I'd also use SAX for streams that may be malformed, but where I want make a best-guess at getting the data out.

Related Topic