Is the BOM optional for UTF-16 and UTF-32

character encodingunicode

I used to think that the BOM is optional for UTF-8, but mandatory for UTF-16 and UTF-32.

But then I have read the following (in this article):

Let's look just at the ones that Notepad supports.

8-bit ANSI (of which 7-bit ASCII is a subset). These have no BOM; they
just dive right in with bytes of text. They are also probably the most
common type of text file.

UTF-8. These usually begin with a BOM but not always.

Unicode big-endian (UTF-16BE). These usually begin with a BOM but not
always.

Unicode little-endian (UTF-16LE). These usually begin with a BOM but
not always.

So is Notepad not complying with the Unicode standard, or does the Unicode standard says that the BOM is optional for UTF-16?

Best Answer

The BOM is entirely optional. However, in order to decode UTF-16 you need to know the correct byte order. If you decode with the wrong byte order, you will generally also get valid codepoints. To know the correct byte order, we

either know the byte order from an external source, e.g. documentation that states “this tool will always produce UTF-16LE output”,
or the encoded text contains a BOM.

For example, the XML standard is defined in a way so that XML documents can optionally start with a BOM, but the <?xml declaration at the start can also be used to determine the encoding.

Editors or web browsers have to work reasonably even when BOMs are missing, and the encoding is ambiguous. They can use statistical data e.g. on expected character frequencies to take a guess, but in the end the user should be able to override the encoding.

If we look at the Unicode Specification (version 10.0), then section 2.6 Encoding Schemes states:

When a higher-level protocol supplies mechanisms for handling the endianness of integral data types, it is not necessary to use Unicode encoding schemes or the byte order mark. In those cases Unicode text is simply a sequence of integral data types.

I.e. as explained above, the BOM is not necessary when we have external information about the byte order. However, there is a requirement that a BOM must be understood by some software dealing with Unicode. From section 23.8 Specials:

Where the byte order is explicitly specified, such as in UTF-16BE or UTF-16LE, then all U+FEFF characters—even at the very beginning of the text—are to be interpreted as zero width no-break spaces. Similarly, where Unicode text has known byte order, initial U+FEFF characters are not required, but for backward compatibility are to be interpreted as zero width no-break spaces. […]

Systems that use the byte order mark must recognize when an initial U+FEFF signals the byte order. In those cases, it is not part of the textual content and should be removed before processing, because otherwise it may be mistaken for a legitimate zero width no-break space.

In section 3.10 Encoding Schemes, various UTF encodings are defined. Here, UTF-16LE, UTF16BE, and UTF-16 are different encodings. The LE and BE variants do not have a BOM. For UTF-16:

The UTF-16 encoding scheme may or may not begin with a BOM. However, when there is no BOM, and in the absence of a higher-level protocol, the byte order of the UTF-16 encoding scheme is big-endian.

Equivalently for UTF-32LE, UTF-32BE, and UTF-32.

So the Unicode Standard does state that the BOM is optional, and mandates how software must handle the presence or absence of a BOM under various circumstances. Unicode-aware software that does not handle encoded text without BOMs is broken.

Related Solutions

Character Encoding – Should Non-UTF Encodings Be Deprecated?

Since you mentioned PostgreSQL, I can say with some authority that the main killer reason why non-UTF8 server-side encodings are supported in such detail is that the Japanese need it. Apparently, identical round-trip conversion between Unicode and the various Japanese "legacy" encodings is not always possible, and in some cases conversion tables are even different between vendors. It's baffling really, but it's apparently so. (The extensive character set support is also one of the reasons why PostgreSQL is so popular in Japan.)

Since we are talking about a database system, one of the main jobs is to be able to store and retrieve data reliably, as defined by the user, so lossy character set conversion sometimes won't fly. If you were dealing with the a web browser, say, where all that really matters is whether the result looks OK, then you could probably get away with supporting less encodings, but in a database system you have extra requirements.

Some of the other reasons mentioned in other answers also apply as supporting arguments. But as long as the Japanese veto it, character setup support cannot be reduced.

Should UTF-16 be considered harmful

This is an old answer.
See UTF-8 Everywhere for the latest updates.

Opinion: Yes, UTF-16 should be considered harmful. The very reason it exists is because some time ago there used to be a misguided belief that widechar is going to be what UCS-4 now is.

Despite the "anglo-centrism" of UTF-8, it should be considered the only useful encoding for text. One can argue that source codes of programs, web pages and XML files, OS file names and other computer-to-computer text interfaces should never have existed. But when they do, text is not only for human readers.

On the other hand, UTF-8 overhead is a small price to pay while it has significant advantages. Advantages such as compatibility with unaware code that just passes strings with char*. This is a great thing. There're few useful characters which are SHORTER in UTF-16 than they are in UTF-8.

I believe that all other encodings will die eventually. This involves that MS-Windows, Java, ICU, python stop using it as their favorite. After long research and discussions, the development conventions at my company ban using UTF-16 anywhere except OS API calls, and this despite importance of performance in our applications and the fact that we use Windows. Conversion functions were developed to convert always-assumed-UTF8 std::strings to native UTF-16, which Windows itself does not support properly.

To people who say "use what needed where it is needed", I say: there's a huge advantage to using the same encoding everywhere, and I see no sufficient reason to do otherwise. In particular, I think adding wchar_t to C++ was a mistake, and so are the Unicode additions to C++0x. What must be demanded from STL implementations though is that every std::string or char* parameter would be considered unicode-compatible.

I am also against the "use what you want" approach. I see no reason for such liberty. There's enough confusion on the subject of text, resulting in all this broken software. Having above said, I am convinced that programmers must finally reach consensus on UTF-8 as one proper way. (I come from a non-ascii-speaking country and grew up on Windows, so I'd be last expected to attack UTF-16 based on religious grounds).

I'd like to share more information on how I do text on Windows, and what I recommend to everyone else for compile-time checked unicode correctness, ease of use and better multi-platformness of the code. The suggestion substantially differs from what is usually recommended as the proper way of using Unicode on windows. Yet, in depth research of these recommendations resulted in the same conclusion. So here goes:

Do not use wchar_t or std::wstring in any place other than adjacent point to APIs accepting UTF-16.
Don't use _T("") or L"" UTF-16 literals (These should IMO be taken out of the standard, as a part of UTF-16 deprecation).
Don't use types, functions or their derivatives that are sensitive to the _UNICODE constant, such as LPTSTR or CreateWindow().
Yet, _UNICODE always defined, to avoid passing char* strings to WinAPI getting silently compiled
std::strings and char* anywhere in program are considered UTF-8 (if not said otherwise)
All my strings are std::string, though you can pass char* or string literal to convert(const std::string &).
only use Win32 functions that accept widechars (LPWSTR). Never those which accept LPTSTR or LPSTR. Pass parameters this way:
```
::SetWindowTextW(Utils::convert(someStdString or "string litteral").c_str())
```
(The policy uses conversion functions below.)

With MFC strings:

CString someoneElse; // something that arrived from MFC. Converted as soon as possible, before passing any further away from the API call:

std::string s = str(boost::format("Hello %s\n") % Convert(someoneElse));
AfxMessageBox(MfcUtils::Convert(s), _T("Error"), MB_OK);

Working with files, filenames and fstream on Windows:
- Never pass std::string or const char* filename arguments to fstream family. MSVC STL does not support UTF-8 arguments, but has a non-standard extension which should be used as follows:
- Convert std::string arguments to std::wstring with Utils::Convert:
```
std::ifstream ifs(Utils::Convert("hello"),
                  std::ios_base::in |
                  std::ios_base::binary);
```
  We'll have to manually remove the convert, when MSVC's attitude to fstream changes.
- This code is not multi-platform and may have to be changed manually in the future
- See fstream unicode research/discussion case 4215 for more info.
- Never produce text output files with non-UTF8 content
- Avoid using fopen() for RAII/OOD reasons. If necessary, use _wfopen() and WinAPI conventions above.

// For interface to win32 API functions
std::string convert(const std::wstring& str, unsigned int codePage /*= CP_UTF8*/)
{
    // Ask me for implementation..
    ...
}

std::wstring convert(const std::string& str, unsigned int codePage /*= CP_UTF8*/)
{
    // Ask me for implementation..
    ...
}

// Interface to MFC
std::string convert(const CString &mfcString)
{
#ifdef UNICODE
    return Utils::convert(std::wstring(mfcString.GetString()));
#else
    return mfcString.GetString();   // This branch is deprecated.
#endif
}

CString convert(const std::string &s)
{
#ifdef UNICODE
    return CString(Utils::convert(s).c_str());
#else
    Exceptions::Assert(false, "Unicode policy violation. See W569"); // This branch is deprecated as it does not support unicode
    return s.c_str();   
#endif
}

Best Answer

Related Solutions

Character Encoding – Should Non-UTF Encodings Be Deprecated?

Should UTF-16 be considered harmful

Related Topic