UTF-16 vs UTF-8 – Fixed-Width or Variable-Width and Byte-Order Issues

character encodingunicodeutf-8

Is UTF-16 fixed-width or variable-width? I got different results
from different sources:

From http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF:

UTF-16 stores Unicode characters in sixteen-bit chunks.

From http://en.wikipedia.org/wiki/UTF-16/UCS-2:

UTF-16 (16-bit Unicode Transformation Format) is a character
encoding
for Unicode capable of encoding 1,112,064[1] numbers (called code
points) in the Unicode code space from 0 to 0x10FFFF. It produces
a
variable-length result of either one or two 16-bit code units per
code
point.
From the first source

UTF-8 also has the advantage that the unit of encoding is the
byte, so
there are no byte-ordering issues.

Why doesn't UTF-8 have byte-order problem? It is variable-width, and
one character may contain more than one byte, so I think byte-order
can still be a problem?

Thanks and regards!

Best Answer

(1) What does byte sequence mean, an arrary of char in C? Is UTF-16 a byte sequence, or what is it then? (2) Why does a byte sequence have nothing to do with variable length?

You seem to be misunderstanding what endian issues are. Here's a brief summary.

A 32-bit integer takes up 4 bytes. Now, we know the logical ordering of these bytes. If you have a 32-bit integer, you can get the high byte of this with the following code:

uint32_t value = 0x8100FF32;
uint8_t highByte = (uint8_t)((value >> 24) & 0xFF); //Now contains 0x81

That's all well and good. Where the problem begins is how various hardware stores and retrieves integers from memory.

In Big Endian order, a 4 byte piece of memory that you read as a 32-bit integer will be read with the first byte being the high byte:

[0][1][2][3]

In Little Endian order, a 4 byte piece of memory that you read as a 32-bit integer will be read with the first byte being the low byte:

[3][2][1][0]

If you have a pointer to a pointer to a 32-bit value, you can do this:

uint32_t value = 0x8100FF32;
uint32_t *pValue = &value;
uint8_t *pHighByte = (uint8_t*)pValue;
uint8_t highByte = pHighByte[0]; //Now contains... ?

According to C/C++, the result of this is undefined. It could be 0x81. Or it could be 0x32. Technically, it could return anything, but for real systems, it will return one or the other.

If you have a pointer to a memory address, you can read that address as a 32-bit value, a 16-bit value, or an 8-bit value. On a big endian machine, the pointer points to the high byte; on a little endian machine, the pointer points to the low byte.

Note that this is all about reading and writing to/from memory. It has nothing to do with the internal C/C++ code. The first version of the code, the one that C/C++ doesn't declare as undefined, will always work to get the high byte.

The issue is when you start reading byte streams. Such as from a file.

16-bit values have the same issues as 32-bit ones; they just have 2 bytes instead of 4. Therefore, a file could contain 16-bit values stored in big endian or little endian order.

UTF-16 is defined as a sequence of 16-bit values. Effectively, it is a uint16_t[]. Each individual code unit is a 16-bit value. Therefore, in order to properly load UTF-16, you must know what the endian-ness of the data is.

UTF-8 is defined as a sequence of 8-bit values. It is a uint8_t[]. Each individual code unit is 8-bits in size: a single byte.

Now, both UTF-16 and UTF-8 allow for multiple code units (16-bit or 8-bit values) to combine together to form a Unicode codepoint (a "character", but that's not the correct term; it is a simplification). The order of these code units that form a codepoint is dictated by the UTF-16 and UTF-8 encodings.

When processing UTF-16, you read a 16-bit value, doing whatever endian conversion is needed. Then, you detect if it is a surrogate pair; if it is, then you read another 16-bit value, combine the two, and from that, you get the Unicode codepoint value.

When processing UTF-8, you read an 8-bit value. No endian conversion is possible, since there is only one byte. If the first byte denotes a multi-byte sequence, then you read some number of bytes, as dictated by the multi-byte sequence. Each individual byte is a byte and therefore has no endian conversion. The order of these bytes in the sequence, just as the order of surrogate pairs in UTF-16, is defined by UTF-8.

So there can be no endian issues with UTF-8.

Related Solutions

Should UTF-16 be considered harmful

This is an old answer.
See UTF-8 Everywhere for the latest updates.

Opinion: Yes, UTF-16 should be considered harmful. The very reason it exists is because some time ago there used to be a misguided belief that widechar is going to be what UCS-4 now is.

Despite the "anglo-centrism" of UTF-8, it should be considered the only useful encoding for text. One can argue that source codes of programs, web pages and XML files, OS file names and other computer-to-computer text interfaces should never have existed. But when they do, text is not only for human readers.

On the other hand, UTF-8 overhead is a small price to pay while it has significant advantages. Advantages such as compatibility with unaware code that just passes strings with char*. This is a great thing. There're few useful characters which are SHORTER in UTF-16 than they are in UTF-8.

I believe that all other encodings will die eventually. This involves that MS-Windows, Java, ICU, python stop using it as their favorite. After long research and discussions, the development conventions at my company ban using UTF-16 anywhere except OS API calls, and this despite importance of performance in our applications and the fact that we use Windows. Conversion functions were developed to convert always-assumed-UTF8 std::strings to native UTF-16, which Windows itself does not support properly.

To people who say "use what needed where it is needed", I say: there's a huge advantage to using the same encoding everywhere, and I see no sufficient reason to do otherwise. In particular, I think adding wchar_t to C++ was a mistake, and so are the Unicode additions to C++0x. What must be demanded from STL implementations though is that every std::string or char* parameter would be considered unicode-compatible.

I am also against the "use what you want" approach. I see no reason for such liberty. There's enough confusion on the subject of text, resulting in all this broken software. Having above said, I am convinced that programmers must finally reach consensus on UTF-8 as one proper way. (I come from a non-ascii-speaking country and grew up on Windows, so I'd be last expected to attack UTF-16 based on religious grounds).

I'd like to share more information on how I do text on Windows, and what I recommend to everyone else for compile-time checked unicode correctness, ease of use and better multi-platformness of the code. The suggestion substantially differs from what is usually recommended as the proper way of using Unicode on windows. Yet, in depth research of these recommendations resulted in the same conclusion. So here goes:

Do not use wchar_t or std::wstring in any place other than adjacent point to APIs accepting UTF-16.
Don't use _T("") or L"" UTF-16 literals (These should IMO be taken out of the standard, as a part of UTF-16 deprecation).
Don't use types, functions or their derivatives that are sensitive to the _UNICODE constant, such as LPTSTR or CreateWindow().
Yet, _UNICODE always defined, to avoid passing char* strings to WinAPI getting silently compiled
std::strings and char* anywhere in program are considered UTF-8 (if not said otherwise)
All my strings are std::string, though you can pass char* or string literal to convert(const std::string &).
only use Win32 functions that accept widechars (LPWSTR). Never those which accept LPTSTR or LPSTR. Pass parameters this way:
```
::SetWindowTextW(Utils::convert(someStdString or "string litteral").c_str())
```
(The policy uses conversion functions below.)

With MFC strings:

CString someoneElse; // something that arrived from MFC. Converted as soon as possible, before passing any further away from the API call:

std::string s = str(boost::format("Hello %s\n") % Convert(someoneElse));
AfxMessageBox(MfcUtils::Convert(s), _T("Error"), MB_OK);

Working with files, filenames and fstream on Windows:
- Never pass std::string or const char* filename arguments to fstream family. MSVC STL does not support UTF-8 arguments, but has a non-standard extension which should be used as follows:
- Convert std::string arguments to std::wstring with Utils::Convert:
```
std::ifstream ifs(Utils::Convert("hello"),
                  std::ios_base::in |
                  std::ios_base::binary);
```
  We'll have to manually remove the convert, when MSVC's attitude to fstream changes.
- This code is not multi-platform and may have to be changed manually in the future
- See fstream unicode research/discussion case 4215 for more info.
- Never produce text output files with non-UTF8 content
- Avoid using fopen() for RAII/OOD reasons. If necessary, use _wfopen() and WinAPI conventions above.

// For interface to win32 API functions
std::string convert(const std::wstring& str, unsigned int codePage /*= CP_UTF8*/)
{
    // Ask me for implementation..
    ...
}

std::wstring convert(const std::string& str, unsigned int codePage /*= CP_UTF8*/)
{
    // Ask me for implementation..
    ...
}

// Interface to MFC
std::string convert(const CString &mfcString)
{
#ifdef UNICODE
    return Utils::convert(std::wstring(mfcString.GetString()));
#else
    return mfcString.GetString();   // This branch is deprecated.
#endif
}

CString convert(const std::string &s)
{
#ifdef UNICODE
    return CString(Utils::convert(s).c_str());
#else
    Exceptions::Assert(false, "Unicode policy violation. See W569"); // This branch is deprecated as it does not support unicode
    return s.c_str();   
#endif
}

Should UTF-8 CSV files contain a BOM (byte order mark)

Not for UTF-8, but see the various caveats in the comments.

It's unnecessary (UTF-8 has no byte order) unlike UTF-16/32 and not recommended in the Unicode standard. It's also quite rare to see UTF-8 with BOM "in the wild", so unless you have a valid reason (e.g. as commented, you'll be working with software that expects the BOM) I'd recommend the BOM-less approach.

Wikipedia mentions some mainly Microsoft software that forces and expects a BOM, but unless you're working with them, don't use it.

Best Answer

Related Solutions

Should UTF-16 be considered harmful

Should UTF-8 CSV files contain a BOM (byte order mark)

Related Topic