Html, xmlns, namespaces, xml

htmlnamespacesxmlxml-namespaces

I just encoutered some problems while parsing html documents with nekohtml + dom4j.

I found out my xpath expressions were not working anymore because of a new default html xml namespace that was added recently on the html source.

Specification says:

The prefix xmlns is used only to
declare namespace bindings and is by
definition bound to the namespace name
http://www.w3.org/2000/xmlns/. It MUST
NOT be declared . Other prefixes MUST
NOT be bound to this namespace name,
and it MUST NOT be declared as the
default namespace. Element names MUST
NOT have the prefix xmlns.

But in my html docs, there was added recently (i guess) in the html tag: xmlns="http://www.w3.org/1999/xhtml"

I found 2 solutions:

1) Drop namespace with:

DOMParser parser = new DOMParser();
parser.setFeature("http://xml.org/sax/features/namespaces", false);
parser.parse(url);

According to what NekoHTML faq said.

2) Add a prefix to my xpath, binded to the default html namespace.
(It seems it can't bind prefix "empty string" to the namespace i want)

Map<String,String> XPATH_NAMESPACES = new HashMap<String, String>();
XPATH_NAMESPACES.put("my_prefix", "http://www.w3.org/1999/xhtml");

XPath xpath = document.createXPath(xpathExpr);
xpath.setNamespaceURIs(XPATH_NAMESPACES);
Element element = (Element) xpath.selectSingleNode(document);

And then, instead of using //td for exemple, i use //my_prefix:td

I just post these solutions because some people could find this post useful.
See also http://www.edankert.com/defaultnamespaces.html#Jaxen_and_Dom4J

But What i would really like to know is:

Why to use a different namespace from
the default one?
Why would someone switch from http://www.w3.org/2000/xmlns/ to
http://www.w3.org/1999/xhtml ?
Why do we use w3 namespaces in general? Does the namespace have some
impact on the browser?

I guess my question could seems obvious to some of you, but i don't really catch what it brings.
I've read about the differences between html and xhtml. I guess people using xhtml dtd would rather use this namespace, but what is the real interest apart from the fact it's giving some additional pain to crawlers or other stuff like that?

PS: I've seen that to pass from html to xhtml you have to add both xmlns and xml:lang, for exemple:

So it was probably not the aim of the website i was parsing since no xml:lang was added…

Thanks

Best Answer

There's quite a lot of confusion evident in your question, and it's not easy to resolve it without writing an entire tutorial on XML namespaces. I'll try cover as best I can how they relate to (X)HTML.

First though, the purpose of namespaces is to separate vocabularies. So, for example, the title element in the http://www.w3.org/1999/xhtml namespace can be distinguished from the title element in the http://www.w3.org/2000/svg namespace, when they appear in the same document, or processed by a common processor.

Second, forget about the http://www.w3.org/2000/xmlns/ namespace. What it does is largely behind the scenes and you rarely need to worry about it.

Next, we need to distinguish between the null namespace, the default namespace, and namespaces referenced by prefixes.

When an XML file has no xmlns= attributes defined, all the unprefixed elements are said to be 'in the null namespace', or 'in no namespace' which amounts to the same thing.

When an XML element has an xmlns= attribute, it and its descendant elements, if they are unprefixed are said to be 'in the default namespace' where the default namespace is the value of the xmlns attribute.

Prefixed elements are always in the namespace mapped by xmlns:prefix= attributes in the element or ancestor of the element.

Now, the XHTML vocabulary is defined as elements in the http://www.w3.org/1999/xhtml namespace, so a correctly written XHTML document will declare either that namespace as being the default namespace, or will map a prefix to the namespace, in which case all the XHTML elements will need to include that prefix on their names. (This latter situation doesn't happen very often, for reasons given below).

So, when parsing XHTML with an XML parser, the namespace mapping needs to be there.

However, XPath has no concept of a default namespace. If you don't put the prefix on the elements named in the xpath, it will attempt to match elements in the null namespace. If the XHTML elements are in the http://www.w3.org/1999/xhtml namespace, then the xpath won't match anything.

This is where it starts to get complicated - browsers.

If you serve XHTML web pages to browsers as you should, with an XML content type like application/xhtml+xml, the browser will use an XML parser to load it and all the above rules apply. If you don't include the xmlns="http://www.w3.org/1999/xhtml" attribute, browsers won't understand how to process it and will simply display the file as a raw XML structure.

However, because IE until IE9 didn't support XML content-types, hardly anybody does serve their web pages that way. Instead they use the "text/html" content type, in which case the browser doesn't use an XML parser at all, it uses an HTML one.

The HTML parser just ignores the namespace to prefix mappings, and simply "knows" which element names belong in which namespaces. This makes it ultimately less flexible, but within its specialized domain, more robust and simple to use. (In the example of the title element above, it determines which namespace applies by looking at the titles ancestor elements) This is why XHTML documents don't use prefixed elements, because an HTML parser won't recognise them.

Browsers, (the modern ones anyway), then have specialized DOM-alike API methods and CSS rules to hide all this namespace complexity away from the javascript and css author, and thus, for the most part, namespacing can be safely ignored by web authors.

Standalone HTML parsers, however, don't always do this. Instead, they place all the elements in the null namespace, which means that they can be found with xpaths that don't include prefixes on the element names, using standard DOM APIs. For most practical purposes, this amounts to the same thing as when browser parse using their HTML parser.

So, in summary, you need to be aware of whether you are parsing your XHTML with an XML or HTML parser, and how that particular parser is assigning elements to namespaces, in order to be able to write a correct xpath to query for elements within the document.

Related Solutions

Html – What are valid values for the id attribute in HTML

For HTML 4, the answer is technically:

ID and NAME tokens must begin with a letter ([A-Za-z]) and may be followed by any number of letters, digits ([0-9]), hyphens ("-"), underscores ("_"), colons (":"), and periods (".").

HTML 5 is even more permissive, saying only that an id must contain at least one character and may not contain any space characters.

The id attribute is case sensitive in XHTML.

As a purely practical matter, you may want to avoid certain characters. Periods, colons and '#' have special meaning in CSS selectors, so you will have to escape those characters using a backslash in CSS or a double backslash in a selector string passed to jQuery. Think about how often you will have to escape a character in your stylesheets or code before you go crazy with periods and colons in ids.

For example, the HTML declaration <div id="first.name"></div> is valid. You can select that element in CSS as #first\.name and in jQuery like so: $('#first\\.name'). But if you forget the backslash, $('#first.name'), you will have a perfectly valid selector looking for an element with id first and also having class name. This is a bug that is easy to overlook. You might be happier in the long run choosing the id first-name (a hyphen rather than a period), instead.

You can simplify your development tasks by strictly sticking to a naming convention. For example, if you limit yourself entirely to lower-case characters and always separate words with either hyphens or underscores (but not both, pick one and never use the other), then you have an easy-to-remember pattern. You will never wonder "was it firstName or FirstName?" because you will always know that you should type first_name. Prefer camel case? Then limit yourself to that, no hyphens or underscores, and always, consistently use either upper-case or lower-case for the first character, don't mix them.

A now very obscure problem was that at least one browser, Netscape 6, incorrectly treated id attribute values as case-sensitive. That meant that if you had typed id="firstName" in your HTML (lower-case 'f') and #FirstName { color: red } in your CSS (upper-case 'F'), that buggy browser would have failed to set the element's color to red. At the time of this edit, April 2015, I hope you aren't being asked to support Netscape 6. Consider this a historical footnote.

Xml – What does “xmlns” in XML mean

It means XML namespace.

Basically, every element (or attribute) in XML belongs to a namespace, a way of "qualifying" the name of the element.

Imagine you and I both invent our own XML. You invent XML to describe people, I invent mine to describe cities. Both of us include an element called name. Yours refers to the person’s name, and mine to the city name—OK, it’s a little bit contrived.

<person>
    <name>Rob</name>
    <age>37</age>
    <homecity>
        <name>London</name>
        <lat>123.000</lat>
        <long>0.00</long>
    </homecity>
</person>

If our two XMLs were combined into a single document, how would we tell the two names apart? As you can see above, there are two name elements, but they both have different meanings.

The answer is that you and I would both assign a namespace to our XML, which we would make unique:

<personxml:person xmlns:personxml="http://www.your.example.com/xml/person"
                  xmlns:cityxml="http://www.my.example.com/xml/cities">
    <personxml:name>Rob</personxml:name>
    <personxml:age>37</personxml:age>
    <cityxml:homecity>
        <cityxml:name>London</cityxml:name>
        <cityxml:lat>123.000</cityxml:lat>
        <cityxml:long>0.00</cityxml:long>
    </cityxml:homecity>
</personxml:person>

Now we’ve fully qualified our XML, there is no ambiguity as to what each name element means. All of the tags that start with personxml: are tags belonging to your XML, all the ones that start with cityxml: are mine.

There are a few points to note:

If you exclude any namespace declarations, things are considered to be in the default namespace.
If you declare a namespace without the identifier, that is, xmlns="http://somenamespace", rather than xmlns:rob="somenamespace", it specifies the default namespace for the document.
The actual namespace itself, often a IRI, is of no real consequence. It should be unique, so people tend to choose a IRI/URI that they own, but it has no greater meaning than that. Sometimes people will place the schema (definition) for the XML at the specified IRI, but that is a convention of some people only.
The prefix is of no consequence either. The only thing that matters is what namespace the prefix is defined as. Several tags beginning with different prefixes, all of which map to the same namespace are considered to be the same.

For instance, if the prefixes personxml and mycityxml both mapped to the same namespace (as in the snippet below), then it wouldn't matter if you prefixed a given element with personxml or mycityxml, they'd both be treated as the same thing by an XML parser. The point is that an XML parser doesn't care what you've chosen as the prefix, only the namespace it maps too. The prefix is just an indirection pointing to the namespace.
```
<personxml:person 
     xmlns:personxml="http://example.com/same/url"
     xmlns:mycityxml="http://example.com/same/url" />
```
Attributes can be qualified but are generally not. They also do not inherit their namespace from the element they are on, as opposed to elements (see below).

Also, element namespaces are inherited from the parent element. In other words I could equally have written the above XML as

<person xmlns="http://www.your.example.com/xml/person">
    <name>Rob</name>
    <age>37</age>
    <homecity xmlns="http://www.my.example.com/xml/cities">
        <name>London</name>
        <lat>123.000</lat>
        <long>0.00</long>
    </homecity>
</person>

Best Answer

Related Solutions

Html – What are valid values for the id attribute in HTML

Xml – What does “xmlns” in XML mean

Related Topic