Python – XML parsing – ElementTree vs SAX and DOM

domelementtreepythonsaxxml

Python has several ways to parse XML…

I understand the very basics of parsing with SAX. It functions as a stream parser, with an event-driven API.

I understand the DOM parser also. It reads the XML into memory and converts it to objects that can be accessed with Python.

Generally speaking, it was easy to choose between the two depending on what you needed to do, memory constraints, performance, etc.

(Hopefully I'm correct so far.)

Since Python 2.5, we also have ElementTree. How does this compare to DOM and SAX? Which is it more similar to? Why is it better than the previous parsers?

Best Answer

ElementTree is much easier to use, because it represents an XML tree (basically) as a structure of lists, and attributes are represented as dictionaries.

ElementTree needs much less memory for XML trees than DOM (and thus is faster), and the parsing overhead via iterparse is comparable to SAX. Additionally, iterparse returns partial structures, and you can keep memory usage constant during parsing by discarding the structures as soon as you process them.

ElementTree, as in Python 2.5, has only a small feature set compared to full-blown XML libraries, but it's enough for many applications. If you need a validating parser or complete XPath support, lxml is the way to go. For a long time, it used to be quite unstable, but I haven't had any problems with it since 2.1.

ElementTree deviates from DOM, where nodes have access to their parent and siblings. Handling actual documents rather than data stores is also a bit cumbersome, because text nodes aren't treated as actual nodes. In the XML snippet

<a>This is <b>a</b> test</a>

The string test will be the so-called tail of element b.

In general, I recommend ElementTree as the default for all XML processing with Python, and DOM or SAX as the solutions for specific problems.

Related Solutions

Python – Difference between staticmethod and classmethod

Maybe a bit of example code will help: Notice the difference in the call signatures of foo, class_foo and static_foo:

class A(object):
    def foo(self, x):
        print(f"executing foo({self}, {x})")

    @classmethod
    def class_foo(cls, x):
        print(f"executing class_foo({cls}, {x})")

    @staticmethod
    def static_foo(x):
        print(f"executing static_foo({x})")

a = A()

Below is the usual way an object instance calls a method. The object instance, a, is implicitly passed as the first argument.

a.foo(1)
# executing foo(<__main__.A object at 0xb7dbef0c>, 1)

With classmethods, the class of the object instance is implicitly passed as the first argument instead of self.

a.class_foo(1)
# executing class_foo(<class '__main__.A'>, 1)

You can also call class_foo using the class. In fact, if you define something to be a classmethod, it is probably because you intend to call it from the class rather than from a class instance. A.foo(1) would have raised a TypeError, but A.class_foo(1) works just fine:

A.class_foo(1)
# executing class_foo(<class '__main__.A'>, 1)

One use people have found for class methods is to create inheritable alternative constructors.

With staticmethods, neither self (the object instance) nor cls (the class) is implicitly passed as the first argument. They behave like plain functions except that you can call them from an instance or the class:

a.static_foo(1)
# executing static_foo(1)

A.static_foo('hi')
# executing static_foo(hi)

Staticmethods are used to group functions which have some logical connection with a class to the class.

foo is just a function, but when you call a.foo you don't just get the function, you get a "partially applied" version of the function with the object instance a bound as the first argument to the function. foo expects 2 arguments, while a.foo only expects 1 argument.

a is bound to foo. That is what is meant by the term "bound" below:

print(a.foo)
# <bound method A.foo of <__main__.A object at 0xb7d52f0c>>

With a.class_foo, a is not bound to class_foo, rather the class A is bound to class_foo.

print(a.class_foo)
# <bound method type.class_foo of <class '__main__.A'>>

Here, with a staticmethod, even though it is a method, a.static_foo just returns a good 'ole function with no arguments bound. static_foo expects 1 argument, and a.static_foo expects 1 argument too.

print(a.static_foo)
# <function static_foo at 0xb7d479cc>

And of course the same thing happens when you call static_foo with the class A instead.

print(A.static_foo)
# <function static_foo at 0xb7d479cc>

Javascript – How to find out which DOM element has the focus

Use document.activeElement, it is supported in all major browsers.

Previously, if you were trying to find out what form field has focus, you could not. To emulate detection within older browsers, add a "focus" event handler to all fields and record the last-focused field in a variable. Add a "blur" handler to clear the variable upon a blur event for the last-focused field.

If you need to remove the activeElement you can use blur; document.activeElement.blur(). It will change the activeElement to body.

Best Answer

Related Solutions

Python – Difference between staticmethod and classmethod

Javascript – How to find out which DOM element has the focus

Related Topic