ElementTree is much easier to use, because it represents an XML tree (basically) as a structure of lists, and attributes are represented as dictionaries.
ElementTree needs much less memory for XML trees than DOM (and thus is faster), and the parsing overhead via iterparse
is comparable to SAX. Additionally, iterparse
returns partial structures, and you can keep memory usage constant during parsing by discarding the structures as soon as you process them.
ElementTree, as in Python 2.5, has only a small feature set compared to full-blown XML libraries, but it's enough for many applications. If you need a validating parser or complete XPath support, lxml is the way to go. For a long time, it used to be quite unstable, but I haven't had any problems with it since 2.1.
ElementTree deviates from DOM, where nodes have access to their parent and siblings. Handling actual documents rather than data stores is also a bit cumbersome, because text nodes aren't treated as actual nodes. In the XML snippet
<a>This is <b>a</b> test</a>
The string test
will be the so-called tail
of element b
.
In general, I recommend ElementTree as the default for all XML processing with Python, and DOM or SAX as the solutions for specific problems.
Using JDOM, taking an InputStream and making it a Document:
InputStream inputStream = (InputStream)httpURLConnection.getContent();
DocumentBuilderFactory docbf = DocumentBuilderFactory.newInstance();
docbf.setNamespaceAware(true);
DocumentBuilder docbuilder = docbf.newDocumentBuilder();
Document document = docbuilder.parse(inputStream, baseUrl);
At that point, you have the XML in a Java object. Done. Easy.
You can either use the document object and the Java API to just walk through it, or also use XPath, which I find easier (once I learned it).
Build an XPath object, which takes a bit:
public static XPath buildXPath() {
XPathFactory factory = XPathFactory.newInstance();
XPath xpath = factory.newXPath();
xpath.setNamespaceContext(new AtomNamespaceContext());
return xpath;
}
public class AtomNamespaceContext implements NamespaceContext {
public String getNamespaceURI(String prefix) {
if (prefix == null)
throw new NullPointerException("Null prefix");
else if ("a".equals(prefix))
return "http://www.w3.org/2005/Atom";
else if ("app".equals(prefix))
return "http://www.w3.org/2007/app";
else if ("os".equals(prefix))
return "http://a9.com/-/spec/opensearch/1.1/";
else if ("x".equals(prefix))
return "http://www.w3.org/1999/xhtml";
else if ("xml".equals(prefix))
return XMLConstants.XML_NS_URI;
return XMLConstants.NULL_NS_URI;
}
// This method isn't necessary for XPath processing.
public String getPrefix(String uri) {
throw new UnsupportedOperationException();
}
// This method isn't necessary for XPath processing either.
public Iterator getPrefixes(String uri) {
throw new UnsupportedOperationException();
}
}
Then just use it, which (thankfully) doesn't take much time at all:
return Integer.parseInt(xpath.evaluate("/a:feed/os:totalResults/text()", document));
Best Answer
If you are using C, then you can use LibXML from the Gnome project. You can choose from DOM and SAX interfaces to your document, plus lots of additional features that have been developed over years. If you really want C++, then you can use libxml++, which is a C++ OO wrapper around LibXML.
The library has been proven again and again, is high performance, and can be compiled on almost any platform you can find.