Html, xmlns, namespaces, xml

htmlnamespacesxmlxml-namespaces

I just encoutered some problems while parsing html documents with nekohtml + dom4j.

I found out my xpath expressions were not working anymore because of a new default html xml namespace that was added recently on the html source.

Specification says:

The prefix xmlns is used only to
declare namespace bindings and is by
definition bound to the namespace name
http://www.w3.org/2000/xmlns/. It MUST
NOT be declared . Other prefixes MUST
NOT be bound to this namespace name,
and it MUST NOT be declared as the
default namespace. Element names MUST
NOT have the prefix xmlns.

But in my html docs, there was added recently (i guess) in the html tag: xmlns="http://www.w3.org/1999/xhtml"

I found 2 solutions:

1) Drop namespace with:

DOMParser parser = new DOMParser();
parser.setFeature("http://xml.org/sax/features/namespaces", false);
parser.parse(url);

According to what NekoHTML faq said.

2) Add a prefix to my xpath, binded to the default html namespace.
(It seems it can't bind prefix "empty string" to the namespace i want)

Map<String,String> XPATH_NAMESPACES = new HashMap<String, String>();
XPATH_NAMESPACES.put("my_prefix", "http://www.w3.org/1999/xhtml");

XPath xpath = document.createXPath(xpathExpr);
xpath.setNamespaceURIs(XPATH_NAMESPACES);
Element element = (Element) xpath.selectSingleNode(document);

And then, instead of using //td for exemple, i use //my_prefix:td

I just post these solutions because some people could find this post useful.
See also http://www.edankert.com/defaultnamespaces.html#Jaxen_and_Dom4J

But What i would really like to know is:

I guess my question could seems obvious to some of you, but i don't really catch what it brings.
I've read about the differences between html and xhtml. I guess people using xhtml dtd would rather use this namespace, but what is the real interest apart from the fact it's giving some additional pain to crawlers or other stuff like that?

PS: I've seen that to pass from html to xhtml you have to add both xmlns and xml:lang, for exemple:

So it was probably not the aim of the website i was parsing since no xml:lang was added…

Thanks

Best Answer

There's quite a lot of confusion evident in your question, and it's not easy to resolve it without writing an entire tutorial on XML namespaces. I'll try cover as best I can how they relate to (X)HTML.

First though, the purpose of namespaces is to separate vocabularies. So, for example, the title element in the http://www.w3.org/1999/xhtml namespace can be distinguished from the title element in the http://www.w3.org/2000/svg namespace, when they appear in the same document, or processed by a common processor.

Second, forget about the http://www.w3.org/2000/xmlns/ namespace. What it does is largely behind the scenes and you rarely need to worry about it.

Next, we need to distinguish between the null namespace, the default namespace, and namespaces referenced by prefixes.

When an XML file has no xmlns= attributes defined, all the unprefixed elements are said to be 'in the null namespace', or 'in no namespace' which amounts to the same thing.

When an XML element has an xmlns= attribute, it and its descendant elements, if they are unprefixed are said to be 'in the default namespace' where the default namespace is the value of the xmlns attribute.

Prefixed elements are always in the namespace mapped by xmlns:prefix= attributes in the element or ancestor of the element.

Now, the XHTML vocabulary is defined as elements in the http://www.w3.org/1999/xhtml namespace, so a correctly written XHTML document will declare either that namespace as being the default namespace, or will map a prefix to the namespace, in which case all the XHTML elements will need to include that prefix on their names. (This latter situation doesn't happen very often, for reasons given below).

So, when parsing XHTML with an XML parser, the namespace mapping needs to be there.

However, XPath has no concept of a default namespace. If you don't put the prefix on the elements named in the xpath, it will attempt to match elements in the null namespace. If the XHTML elements are in the http://www.w3.org/1999/xhtml namespace, then the xpath won't match anything.


This is where it starts to get complicated - browsers.

If you serve XHTML web pages to browsers as you should, with an XML content type like application/xhtml+xml, the browser will use an XML parser to load it and all the above rules apply. If you don't include the xmlns="http://www.w3.org/1999/xhtml" attribute, browsers won't understand how to process it and will simply display the file as a raw XML structure.

However, because IE until IE9 didn't support XML content-types, hardly anybody does serve their web pages that way. Instead they use the "text/html" content type, in which case the browser doesn't use an XML parser at all, it uses an HTML one.

The HTML parser just ignores the namespace to prefix mappings, and simply "knows" which element names belong in which namespaces. This makes it ultimately less flexible, but within its specialized domain, more robust and simple to use. (In the example of the title element above, it determines which namespace applies by looking at the titles ancestor elements) This is why XHTML documents don't use prefixed elements, because an HTML parser won't recognise them.

Browsers, (the modern ones anyway), then have specialized DOM-alike API methods and CSS rules to hide all this namespace complexity away from the javascript and css author, and thus, for the most part, namespacing can be safely ignored by web authors.

Standalone HTML parsers, however, don't always do this. Instead, they place all the elements in the null namespace, which means that they can be found with xpaths that don't include prefixes on the element names, using standard DOM APIs. For most practical purposes, this amounts to the same thing as when browser parse using their HTML parser.

So, in summary, you need to be aware of whether you are parsing your XHTML with an XML or HTML parser, and how that particular parser is assigning elements to namespaces, in order to be able to write a correct xpath to query for elements within the document.