I'm going to ask what is probably quite a controversial question: "Should one of the most
popular encodings, UTF-16, be considered harmful?"
Why do I ask this question?
How many programmers are aware of the fact that UTF-16 is actually a variable length encoding? By this I mean that there are code points that, represented as surrogate pairs, take more than one element.
I know; lots of applications, frameworks and APIs use UTF-16, such as Java's String, C#'s String, Win32 APIs, Qt GUI libraries, the ICU Unicode library, etc. However, with all of that, there are lots of basic bugs in the processing of characters out of BMP (characters that should be encoded using two UTF-16 elements).
For example, try to edit one of these characters:
- 𝄞 (U+1D11E) MUSICAL SYMBOL G CLEF
- 𝕥 (U+1D565) MATHEMATICAL DOUBLE-STRUCK SMALL T
- 𝟶 (U+1D7F6) MATHEMATICAL MONOSPACE DIGIT ZERO
- 𠂊 (U+2008A) Han Character
You may miss some, depending on what fonts you have installed. These characters are all outside of the BMP (Basic Multilingual Plane). If you cannot see these characters, you can also try looking at them in the Unicode Character reference.
For example, try to create file names in Windows that include these characters; try to delete these characters with a "backspace" to see how they behave in different applications that use UTF-16. I did some tests and the results are quite bad:
- Opera has problem with editing them (delete required 2 presses on backspace)
- Notepad can't deal with them correctly (delete required 2 presses on backspace)
- File names editing in Window dialogs in broken (delete required 2 presses on backspace)
- All QT3 applications can't deal with them – show two empty squares instead of one symbol.
- Python encodes such characters incorrectly when used directly
u'X'!=unicode('X','utf-16')
on some platforms when X in character outside of BMP. - Python 2.5 unicodedata fails to get properties on such characters when python compiled with UTF-16 Unicode strings.
- StackOverflow seems to remove these characters from the text if edited directly in as Unicode characters (these characters are shown using HTML Unicode escapes).
- WinForms TextBox may generate invalid string when limited with MaxLength.
It seems that such bugs are extremely easy to find in many applications that use UTF-16.
So… Do you think that UTF-16 should be considered harmful?
Best Answer
Opinion: Yes, UTF-16 should be considered harmful. The very reason it exists is because some time ago there used to be a misguided belief that widechar is going to be what UCS-4 now is.
Despite the "anglo-centrism" of UTF-8, it should be considered the only useful encoding for text. One can argue that source codes of programs, web pages and XML files, OS file names and other computer-to-computer text interfaces should never have existed. But when they do, text is not only for human readers.
On the other hand, UTF-8 overhead is a small price to pay while it has significant advantages. Advantages such as compatibility with unaware code that just passes strings with
char*
. This is a great thing. There're few useful characters which are SHORTER in UTF-16 than they are in UTF-8.I believe that all other encodings will die eventually. This involves that MS-Windows, Java, ICU, python stop using it as their favorite. After long research and discussions, the development conventions at my company ban using UTF-16 anywhere except OS API calls, and this despite importance of performance in our applications and the fact that we use Windows. Conversion functions were developed to convert always-assumed-UTF8
std::string
s to native UTF-16, which Windows itself does not support properly.To people who say "use what needed where it is needed", I say: there's a huge advantage to using the same encoding everywhere, and I see no sufficient reason to do otherwise. In particular, I think adding
wchar_t
to C++ was a mistake, and so are the Unicode additions to C++0x. What must be demanded from STL implementations though is that everystd::string
orchar*
parameter would be considered unicode-compatible.I am also against the "use what you want" approach. I see no reason for such liberty. There's enough confusion on the subject of text, resulting in all this broken software. Having above said, I am convinced that programmers must finally reach consensus on UTF-8 as one proper way. (I come from a non-ascii-speaking country and grew up on Windows, so I'd be last expected to attack UTF-16 based on religious grounds).
I'd like to share more information on how I do text on Windows, and what I recommend to everyone else for compile-time checked unicode correctness, ease of use and better multi-platformness of the code. The suggestion substantially differs from what is usually recommended as the proper way of using Unicode on windows. Yet, in depth research of these recommendations resulted in the same conclusion. So here goes:
wchar_t
orstd::wstring
in any place other than adjacent point to APIs accepting UTF-16._T("")
orL""
UTF-16 literals (These should IMO be taken out of the standard, as a part of UTF-16 deprecation)._UNICODE
constant, such asLPTSTR
orCreateWindow()
._UNICODE
always defined, to avoid passingchar*
strings to WinAPI getting silently compiledstd::strings
andchar*
anywhere in program are considered UTF-8 (if not said otherwise)std::string
, though you can pass char* or string literal toconvert(const std::string &)
.only use Win32 functions that accept widechars (
LPWSTR
). Never those which acceptLPTSTR
orLPSTR
. Pass parameters this way:(The policy uses conversion functions below.)
With MFC strings:
Working with files, filenames and fstream on Windows:
std::string
orconst char*
filename arguments tofstream
family. MSVC STL does not support UTF-8 arguments, but has a non-standard extension which should be used as follows:Convert
std::string
arguments tostd::wstring
withUtils::Convert
:We'll have to manually remove the convert, when MSVC's attitude to
fstream
changes.fstream
unicode research/discussion case 4215 for more info.fopen()
for RAII/OOD reasons. If necessary, use_wfopen()
and WinAPI conventions above.