UTF-8 Encoding – Why Does It Waste Several Bits?

character encodingtext-encodingutf-8

According to the Wikipedia article, UTF-8 has this format:

First code Last code Bytes Byte 1    Byte 2    Byte 3    Byte 4
point      point     Used
U+0000     U+007F    1     0xxxxxxx
U+0080     U+07FF    2     110xxxxx  10xxxxxx
U+0800     U+FFFF    3     1110xxxx  10xxxxxx  10xxxxxx
U+10000    U+1FFFFF  4     11110xxx  10xxxxxx  10xxxxxx  10xxxxxx
x means that this bit is used to select the code point.

This wastes two bits on each continuation byte and one bit in the first byte. Why is UTF-8 not encoded like the following?

First code Last code Bytes Byte 1    Byte 2    Byte 3
point      point     Used
U+0000     U+007F    1     0xxxxxxx
U+0080     U+3FFF    2     10xxxxxx  xxxxxxxx
U+0800     U+1FFFFF  3     110xxxxx  xxxxxxxx  xxxxxxxx

It would save one byte when the code point is out of the Basic Multilingual Plane or if the code point is in range [U+800,U+3FFF].

Why is UTF-8 not encoded in a more efficient way?

Best Answer

This is done so that you can detect when you are in the middle of a multi-byte sequence. When looking at UTF-8 data, you know that if you see 10xxxxxx, that you are in the middle of a multibyte character, and should back up in the stream until you see either 0xxxxxx or 11xxxxxx. Using your scheme, bytes 2 or 3 could easily end up with patters like either 0xxxxxxx or 11xxxxxx

Also keep in mind that how much is saved varies entirely on what sort of string data you are encoding. For most text, even Asian text, you will rarely if ever see four byte characters with normal text. Also, people's naive estimates about how text will look are often wrong. I have text localized for UTF-8 that includes Japanese, Chinese and Korean strings, yet it is actually Russian that takes most space. (Because our Asian strings often have Roman characters interspersed for proper names, punctuation and such and because the average Chinese word is 1-3 characters while the average Russian word is many, many more.)

Related Solutions

UTF-16 vs UTF-8 – Fixed-Width or Variable-Width and Byte-Order Issues

(1) What does byte sequence mean, an arrary of char in C? Is UTF-16 a byte sequence, or what is it then? (2) Why does a byte sequence have nothing to do with variable length?

You seem to be misunderstanding what endian issues are. Here's a brief summary.

A 32-bit integer takes up 4 bytes. Now, we know the logical ordering of these bytes. If you have a 32-bit integer, you can get the high byte of this with the following code:

uint32_t value = 0x8100FF32;
uint8_t highByte = (uint8_t)((value >> 24) & 0xFF); //Now contains 0x81

That's all well and good. Where the problem begins is how various hardware stores and retrieves integers from memory.

In Big Endian order, a 4 byte piece of memory that you read as a 32-bit integer will be read with the first byte being the high byte:

[0][1][2][3]

In Little Endian order, a 4 byte piece of memory that you read as a 32-bit integer will be read with the first byte being the low byte:

[3][2][1][0]

If you have a pointer to a pointer to a 32-bit value, you can do this:

uint32_t value = 0x8100FF32;
uint32_t *pValue = &value;
uint8_t *pHighByte = (uint8_t*)pValue;
uint8_t highByte = pHighByte[0]; //Now contains... ?

According to C/C++, the result of this is undefined. It could be 0x81. Or it could be 0x32. Technically, it could return anything, but for real systems, it will return one or the other.

If you have a pointer to a memory address, you can read that address as a 32-bit value, a 16-bit value, or an 8-bit value. On a big endian machine, the pointer points to the high byte; on a little endian machine, the pointer points to the low byte.

Note that this is all about reading and writing to/from memory. It has nothing to do with the internal C/C++ code. The first version of the code, the one that C/C++ doesn't declare as undefined, will always work to get the high byte.

The issue is when you start reading byte streams. Such as from a file.

16-bit values have the same issues as 32-bit ones; they just have 2 bytes instead of 4. Therefore, a file could contain 16-bit values stored in big endian or little endian order.

UTF-16 is defined as a sequence of 16-bit values. Effectively, it is a uint16_t[]. Each individual code unit is a 16-bit value. Therefore, in order to properly load UTF-16, you must know what the endian-ness of the data is.

UTF-8 is defined as a sequence of 8-bit values. It is a uint8_t[]. Each individual code unit is 8-bits in size: a single byte.

Now, both UTF-16 and UTF-8 allow for multiple code units (16-bit or 8-bit values) to combine together to form a Unicode codepoint (a "character", but that's not the correct term; it is a simplification). The order of these code units that form a codepoint is dictated by the UTF-16 and UTF-8 encodings.

When processing UTF-16, you read a 16-bit value, doing whatever endian conversion is needed. Then, you detect if it is a surrogate pair; if it is, then you read another 16-bit value, combine the two, and from that, you get the Unicode codepoint value.

When processing UTF-8, you read an 8-bit value. No endian conversion is possible, since there is only one byte. If the first byte denotes a multi-byte sequence, then you read some number of bytes, as dictated by the multi-byte sequence. Each individual byte is a byte and therefore has no endian conversion. The order of these bytes in the sequence, just as the order of surrogate pairs in UTF-16, is defined by UTF-8.

So there can be no endian issues with UTF-8.

ASCII vs UTF-8 – Advantages of Choosing ASCII Encoding

In some cases it can speed up access to individual characters. Imagine string str='ABC' encoded in UTF8 and in ASCII (and assuming that the language/compiler/database knows about encoding)

To access third (C) character from this string using array-access operator which is featured in many programming languages you would do something like c = str[2].

Now, if the string is ASCII encoded, all we need to do is to fetch third byte from the string.

If, however string is UTF-8 encoded, we must first check if first character is a one or two byte char, then we need to perform same check on second character, and only then we can access the third character. The difference in performance will be the bigger, the longer the string.

This is an issue for example in some database engines, where to find a beginning of a column placed 'after' a UTF-8 encoded VARCHAR, database does not only need to check how many characters are there in the VARCHAR field, but also how many bytes each one of them uses.

Best Answer

Related Solutions

UTF-16 vs UTF-8 – Fixed-Width or Variable-Width and Byte-Order Issues

ASCII vs UTF-8 – Advantages of Choosing ASCII Encoding

Related Topic