UTF-32 vs UTF-8 – Does It Make Sense to Choose UTF-32?

ccross platformprogramming practicesunicode

I'm working on an cross platform C++ project, which doesn't consider unicode, and need change to support unicode.

There is following two choices, and I need to decide which one to choose.

Using UTF-8 (std::string) which will make it easy to support posix system.
Using UTF-32 (std::wstring) which will make it easy to call windows API.

So for item #1 UTF8, the benefit is code change will not too many. But the concern is some basic rule will broken for UTF8, for example,

string.size() will not equal the character length.
search an '/' in path will be hard to implement (I'm not 100% sure).

So any more experience? And which one I should choose?

Best Answer

Use UTF-8. string.size() won't equal the amount of code points, but that is mostly a useless metric anyway. In almost all cases, you should either worry about the number of user-perceived characters/glyphs (and for that, UTF-32 fails just as badly), or about the number of bytes of storage used (for this, UTF-32 is offers no advantage and uses more bytes to boot).

Searching for an ASCII character, such as /, will actually be easier than with other encodings, because you can simply use any byte/ASCII based search routine (even old C strstr if you have 0 terminators). UTF-8 is designed such that all ASCII characters use the same byte representation in UTF-8, and no non-ASCII character shares any byte with any ASCII character.

The Windows API uses UTF-16, and UTF-16 doesn't offer string.size() == code_point_count either. It also shares all downsides of UTF-32, more or less. Furthermore, making the application handle Unicode probably won't be as simple as making all strings UTF-{8,16,32}; good Unicode support can require some tricky logic like normalizing text, handling silly code points well (this can become a security issue for some applications), making string manipulations such as slicing and iteration work with glyphs or code points instead of bytes, etc.

There are more reasons to use UTF-8 (and reasons not to use UTF-{16,32}) than I can reasonably describe here. Please refer to the UTF-8 manifesto if you need more convincing.

Related Solutions

Should UTF-16 be considered harmful

This is an old answer.
See UTF-8 Everywhere for the latest updates.

Opinion: Yes, UTF-16 should be considered harmful. The very reason it exists is because some time ago there used to be a misguided belief that widechar is going to be what UCS-4 now is.

Despite the "anglo-centrism" of UTF-8, it should be considered the only useful encoding for text. One can argue that source codes of programs, web pages and XML files, OS file names and other computer-to-computer text interfaces should never have existed. But when they do, text is not only for human readers.

On the other hand, UTF-8 overhead is a small price to pay while it has significant advantages. Advantages such as compatibility with unaware code that just passes strings with char*. This is a great thing. There're few useful characters which are SHORTER in UTF-16 than they are in UTF-8.

I believe that all other encodings will die eventually. This involves that MS-Windows, Java, ICU, python stop using it as their favorite. After long research and discussions, the development conventions at my company ban using UTF-16 anywhere except OS API calls, and this despite importance of performance in our applications and the fact that we use Windows. Conversion functions were developed to convert always-assumed-UTF8 std::strings to native UTF-16, which Windows itself does not support properly.

To people who say "use what needed where it is needed", I say: there's a huge advantage to using the same encoding everywhere, and I see no sufficient reason to do otherwise. In particular, I think adding wchar_t to C++ was a mistake, and so are the Unicode additions to C++0x. What must be demanded from STL implementations though is that every std::string or char* parameter would be considered unicode-compatible.

I am also against the "use what you want" approach. I see no reason for such liberty. There's enough confusion on the subject of text, resulting in all this broken software. Having above said, I am convinced that programmers must finally reach consensus on UTF-8 as one proper way. (I come from a non-ascii-speaking country and grew up on Windows, so I'd be last expected to attack UTF-16 based on religious grounds).

I'd like to share more information on how I do text on Windows, and what I recommend to everyone else for compile-time checked unicode correctness, ease of use and better multi-platformness of the code. The suggestion substantially differs from what is usually recommended as the proper way of using Unicode on windows. Yet, in depth research of these recommendations resulted in the same conclusion. So here goes:

Do not use wchar_t or std::wstring in any place other than adjacent point to APIs accepting UTF-16.
Don't use _T("") or L"" UTF-16 literals (These should IMO be taken out of the standard, as a part of UTF-16 deprecation).
Don't use types, functions or their derivatives that are sensitive to the _UNICODE constant, such as LPTSTR or CreateWindow().
Yet, _UNICODE always defined, to avoid passing char* strings to WinAPI getting silently compiled
std::strings and char* anywhere in program are considered UTF-8 (if not said otherwise)
All my strings are std::string, though you can pass char* or string literal to convert(const std::string &).
only use Win32 functions that accept widechars (LPWSTR). Never those which accept LPTSTR or LPSTR. Pass parameters this way:
```
::SetWindowTextW(Utils::convert(someStdString or "string litteral").c_str())
```
(The policy uses conversion functions below.)

With MFC strings:

CString someoneElse; // something that arrived from MFC. Converted as soon as possible, before passing any further away from the API call:

std::string s = str(boost::format("Hello %s\n") % Convert(someoneElse));
AfxMessageBox(MfcUtils::Convert(s), _T("Error"), MB_OK);

Working with files, filenames and fstream on Windows:
- Never pass std::string or const char* filename arguments to fstream family. MSVC STL does not support UTF-8 arguments, but has a non-standard extension which should be used as follows:
- Convert std::string arguments to std::wstring with Utils::Convert:
```
std::ifstream ifs(Utils::Convert("hello"),
                  std::ios_base::in |
                  std::ios_base::binary);
```
  We'll have to manually remove the convert, when MSVC's attitude to fstream changes.
- This code is not multi-platform and may have to be changed manually in the future
- See fstream unicode research/discussion case 4215 for more info.
- Never produce text output files with non-UTF8 content
- Avoid using fopen() for RAII/OOD reasons. If necessary, use _wfopen() and WinAPI conventions above.

// For interface to win32 API functions
std::string convert(const std::wstring& str, unsigned int codePage /*= CP_UTF8*/)
{
    // Ask me for implementation..
    ...
}

std::wstring convert(const std::string& str, unsigned int codePage /*= CP_UTF8*/)
{
    // Ask me for implementation..
    ...
}

// Interface to MFC
std::string convert(const CString &mfcString)
{
#ifdef UNICODE
    return Utils::convert(std::wstring(mfcString.GetString()));
#else
    return mfcString.GetString();   // This branch is deprecated.
#endif
}

CString convert(const std::string &s)
{
#ifdef UNICODE
    return CString(Utils::convert(s).c_str());
#else
    Exceptions::Assert(false, "Unicode policy violation. See W569"); // This branch is deprecated as it does not support unicode
    return s.c_str();   
#endif
}

Legacy Code – Does Adding Unit Tests Make Sense?

You have to be pragmatic about these situations. Everything has to have a business value, but the business has to trust you to judge what the value of technical work is. Yes, there is always a benefit to having unit tests, but is that benefit great enough to justify the time spent?

I would argue always on new code but, on legacy code, you have to make a judgement call.

Are you in this area of code often? Then there's a case for continual improvement. Are you making a significant change? Then there's a case that it is already new code. But if you're making a one-line code in a complex area that will probably not be touched again for a year, of course the cost (not to mention risk) of reengineering is too great. Just slap your one line of code in there and go take a shower quick.

Rule of thumb: Always think to yourself, "Do I believe that the business benefits more from this technical work that I feel I should do than the job they asked for which is going to be delayed as a result?"

Best Answer

Related Solutions

Should UTF-16 be considered harmful

Legacy Code – Does Adding Unit Tests Make Sense?

Related Topic