Java – why is sax parsing faster than dom parsing ? and how does stax work

domjavasaxstaxxml

yes, this question is rather long-winded – sorry. I kept is as dense as I felt possible. I bolded the questions to make it easier to peek at before reading the whole thing.

Why is sax parsing faster than dom parsing? The only thing I can come up with is that w/ sax you're probably ignoring the majority of the incoming data, and thus not wasting time processing parts of the xml you don't care about. IOW – after parsing w/ SAX, you can't recreate the original input. If you wrote your SAX parser so that it accounted for each and every xml node (and could thus recreate the original), then it wouldn't be any faster than DOM would it?

The reason I'm asking is that I'm trying to parse xml documents more quickly. I need to have access to the entire xml tree AFTER parsing. I am writing a platform for 3rd party services to plug into, so I can't anticipate what parts of the xml document will be needed and which parts won't. I don't even know the structure of the incoming document. This is why I can't use jaxb or sax. Memory footprint isn't an issue for me because the xml documents are small and I only need 1 in memory at a time. It's the time it takes to parse this relatively small xml document that is killing me. I haven't used stax before, but perhaps I need to investigate further because it might be the middle ground? If I understand correctly, stax keeps the original xml structure and processes the parts that I ask for on demand? In this way, the original parse time might be quick, but each time I ask it to traverse part of the tree it hasn't yet traversed, that's when the processing takes place?

If you provide a link that answers most of the questions, I will accept your answer (you don't have to directly answer my questions if they're already answered elsewhere).

update: I rewrote it in sax and it parses documents on avg 2.1 ms. This is an improvement (16% faster) over the 2.5 ms that dom was taking, however it is not the magnitude that I (et al) would've guessed

Thanks

Best Answer

Assuming you do nothing but parse the document, the ranking of the different parser standards is as follows:

1. StAX is the fastest

The event is reported to you

2. SAX is next

It does everything StAX does plus the content is realized automatically (element name, namespace, attributes, ...)

3. DOM is last

It does everything SAX does and presents the information as an instance of Node.

Your Use Case

If you need to maintain all of the XML, DOM is the standard representation. It integrates cleanly with XSLT transforms (javax.xml.transform), XPath (javax.xml.xpath), and schema validation (javax.xml.validation) APIs. However if performance is key, you may be able to build your own tree structure using StAX faster than a DOM parser could build a DOM.

Related Solutions

Java – How does the Java ‘for each’ loop work

for (Iterator<String> i = someIterable.iterator(); i.hasNext();) {
    String item = i.next();
    System.out.println(item);
}

Note that if you need to use i.remove(); in your loop, or access the actual iterator in some way, you cannot use the for ( : ) idiom, since the actual iterator is merely inferred.

As was noted by Denis Bueno, this code works for any object that implements the Iterable interface.

Also, if the right-hand side of the for (:) idiom is an array rather than an Iterable object, the internal code uses an int index counter and checks against array.length instead. See the Java Language Specification.

Python – XML parsing – ElementTree vs SAX and DOM

ElementTree is much easier to use, because it represents an XML tree (basically) as a structure of lists, and attributes are represented as dictionaries.

ElementTree needs much less memory for XML trees than DOM (and thus is faster), and the parsing overhead via iterparse is comparable to SAX. Additionally, iterparse returns partial structures, and you can keep memory usage constant during parsing by discarding the structures as soon as you process them.

ElementTree, as in Python 2.5, has only a small feature set compared to full-blown XML libraries, but it's enough for many applications. If you need a validating parser or complete XPath support, lxml is the way to go. For a long time, it used to be quite unstable, but I haven't had any problems with it since 2.1.

ElementTree deviates from DOM, where nodes have access to their parent and siblings. Handling actual documents rather than data stores is also a bit cumbersome, because text nodes aren't treated as actual nodes. In the XML snippet

<a>This is <b>a</b> test</a>

The string test will be the so-called tail of element b.

In general, I recommend ElementTree as the default for all XML processing with Python, and DOM or SAX as the solutions for specific problems.

Best Answer

Related Solutions

Java – How does the Java ‘for each’ loop work

Python – XML parsing – ElementTree vs SAX and DOM

Related Topic