Xml – What are invalid characters in XML

xml

I am working with some XML that holds strings like:

<node>This is a string</node>

Some of the strings that I am passing to the nodes will have characters like &, #, $, etc.:

<node>This is a string & so is this</node>

This is not valid due to &.

I cannot wrap these strings in CDATA as they need to be as they are. I tried looking for a list of characters that cannot be put in XML nodes without being in a CDATA.

Can someone point me in the direction of one or provide me with a list of illegal characters?

Best Answer

OK, let's separate the question of the characters that:

  1. aren't valid at all in any XML document.
  2. need to be escaped.

The answer provided by @dolmen in "What are invalid characters in XML" is still valid but needs to be updated with the XML 1.1 specification.

1. Invalid characters

The characters described here are all the characters that are allowed to be inserted in an XML document.

1.1. In XML 1.0

The global list of allowed characters is:

[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

Basically, the control characters and characters out of the Unicode ranges are not allowed. This means also that calling for example the character entity &#x3; is forbidden.

1.2. In XML 1.1

The global list of allowed characters is:

[2] Char ::= [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

[2a] RestrictedChar ::= [#x1-#x8] | [#xB-#xC] | [#xE-#x1F] | [#x7F-#x84] | [#x86-#x9F]

This revision of the XML recommendation has extended the allowed characters so control characters are allowed, and takes into account a new revision of the Unicode standard, but these ones are still not allowed : NUL (x00), xFFFE, xFFFF...

However, the use of control characters and undefined Unicode char is discouraged.

It can also be noticed that all parsers do not always take this into account and XML documents with control characters may be rejected.

2. Characters that need to be escaped (to obtain a well-formed document):

The < must be escaped with a &lt; entity, since it is assumed to be the beginning of a tag.

The & must be escaped with a &amp; entity, since it is assumed to be the beginning a entity reference

The > should be escaped with &gt; entity. It is not mandatory -- it depends on the context -- but it is strongly advised to escape it.

The ' should be escaped with a &apos; entity -- mandatory in attributes defined within single quotes but it is strongly advised to always escape it.

The " should be escaped with a &quot; entity -- mandatory in attributes defined within double quotes but it is strongly advised to always escape it.