UTF-8 Encoding – Why Does It Waste Several Bits?

character encodingtext-encodingutf-8

According to the Wikipedia article, UTF-8 has this format:

First code Last code Bytes Byte 1    Byte 2    Byte 3    Byte 4
point      point     Used
U+0000     U+007F    1     0xxxxxxx
U+0080     U+07FF    2     110xxxxx  10xxxxxx
U+0800     U+FFFF    3     1110xxxx  10xxxxxx  10xxxxxx
U+10000    U+1FFFFF  4     11110xxx  10xxxxxx  10xxxxxx  10xxxxxx
x means that this bit is used to select the code point.

This wastes two bits on each continuation byte and one bit in the first byte. Why is UTF-8 not encoded like the following?

First code Last code Bytes Byte 1    Byte 2    Byte 3
point      point     Used
U+0000     U+007F    1     0xxxxxxx
U+0080     U+3FFF    2     10xxxxxx  xxxxxxxx
U+0800     U+1FFFFF  3     110xxxxx  xxxxxxxx  xxxxxxxx

It would save one byte when the code point is out of the Basic Multilingual Plane or if the code point is in range [U+800,U+3FFF].

Why is UTF-8 not encoded in a more efficient way?

Best Answer

This is done so that you can detect when you are in the middle of a multi-byte sequence. When looking at UTF-8 data, you know that if you see 10xxxxxx, that you are in the middle of a multibyte character, and should back up in the stream until you see either 0xxxxxx or 11xxxxxx. Using your scheme, bytes 2 or 3 could easily end up with patters like either 0xxxxxxx or 11xxxxxx

Also keep in mind that how much is saved varies entirely on what sort of string data you are encoding. For most text, even Asian text, you will rarely if ever see four byte characters with normal text. Also, people's naive estimates about how text will look are often wrong. I have text localized for UTF-8 that includes Japanese, Chinese and Korean strings, yet it is actually Russian that takes most space. (Because our Asian strings often have Roman characters interspersed for proper names, punctuation and such and because the average Chinese word is 1-3 characters while the average Russian word is many, many more.)

Related Topic