Xml – UTF-8 or ISO-8859-1 in XML

utf-8xml

We have an application this takes a text string entered by a user into a web form and packages it in XML. Just to confuse matters a little, the XML is send as the body of on Outlook email message.

Because the users can paste almost anything into the web form (typically from Word), the text string can contain non-ASCII (7 bit) characters such as those used for open and close double quotes.

The string is travelling intact via email but when we use the Microsoft XML parser, it complains (quite rightly) that the XML contains invalid characters.

A quick fix is to put encoding="iso-8859-1" in the header. However, I wonder if it would be better to encode the XML file in true UTF-8 format at the start as I've read articles that state it would be better for a more harmonious world if every XML document was encoded in UTF-8?

But… are we going to have trouble as the XML document is actually being transferred via the body of an email message? I understand that UTF-8 is a variable byte length encoding system I assume using 7 bit ASCII and escapte characters to indicate "there is more data".

Another option is to set to UTF-8 but replace non-ASCII characters with the &#nnn; format.

Any advise on this rather complicated area appreciated.

Cheers, Rob.

Best Answer

Here from outside english-only-land{1} I can confirm that UTF-8 works fine everywhere and has done so for many, many years. I have trouble remembering since when any MTA crippled emails by stripping of the 8th bit (leading to "inventions" like QP (which were basically fixing the symptom rather than solving the problem)). That happened most certainly during mid-90s, although UTF-8 quickly gained popularity and replaced iso-8859-1. I do not remember when I switched, but I guess it was at least before year 2000.

Speaking of iso-8859-1, it will not be able to cover all possible input from your users. Depending on language, other iso-8859 variants might be needed (for instance for Finnish and Welsh), and even so the 8859 family does not support languages like Chinese. UTF-8 in the other hand should cover everything, so I strongly recommend that to iso-8859-1.

{1} This might bias my experience since any program not fully supporting UTF-8 would be considered crap and tend not to be used here.