UTF-32 vs UTF-8 – Does It Make Sense to Choose UTF-32?

ccross platformprogramming practicesunicode

I'm working on an cross platform C++ project, which doesn't consider unicode, and need change to support unicode.

There is following two choices, and I need to decide which one to choose.

  • Using UTF-8 (std::string) which will make it easy to support posix system.
  • Using UTF-32 (std::wstring) which will make it easy to call windows API.

So for item #1 UTF8, the benefit is code change will not too many. But the concern is some basic rule will broken for UTF8, for example,

  • string.size() will not equal the character length.
  • search an '/' in path will be hard to implement (I'm not 100% sure).

So any more experience? And which one I should choose?

Best Answer

Use UTF-8. string.size() won't equal the amount of code points, but that is mostly a useless metric anyway. In almost all cases, you should either worry about the number of user-perceived characters/glyphs (and for that, UTF-32 fails just as badly), or about the number of bytes of storage used (for this, UTF-32 is offers no advantage and uses more bytes to boot).

Searching for an ASCII character, such as /, will actually be easier than with other encodings, because you can simply use any byte/ASCII based search routine (even old C strstr if you have 0 terminators). UTF-8 is designed such that all ASCII characters use the same byte representation in UTF-8, and no non-ASCII character shares any byte with any ASCII character.

The Windows API uses UTF-16, and UTF-16 doesn't offer string.size() == code_point_count either. It also shares all downsides of UTF-32, more or less. Furthermore, making the application handle Unicode probably won't be as simple as making all strings UTF-{8,16,32}; good Unicode support can require some tricky logic like normalizing text, handling silly code points well (this can become a security issue for some applications), making string manipulations such as slicing and iteration work with glyphs or code points instead of bytes, etc.

There are more reasons to use UTF-8 (and reasons not to use UTF-{16,32}) than I can reasonably describe here. Please refer to the UTF-8 manifesto if you need more convincing.

Related Topic