Is the BOM optional for UTF-16 and UTF-32

character encodingunicode

I used to think that the BOM is optional for UTF-8, but mandatory for UTF-16 and UTF-32.

But then I have read the following (in this article):

Let's look just at the ones that Notepad supports.

8-bit ANSI (of which 7-bit ASCII is a subset). These have no BOM; they
just dive right in with bytes of text. They are also probably the most
common type of text file.

UTF-8. These usually begin with a BOM but not always.

Unicode big-endian (UTF-16BE). These usually begin with a BOM but not
always.

Unicode little-endian (UTF-16LE). These usually begin with a BOM but
not always.

So is Notepad not complying with the Unicode standard, or does the Unicode standard says that the BOM is optional for UTF-16?

Best Answer

The BOM is entirely optional. However, in order to decode UTF-16 you need to know the correct byte order. If you decode with the wrong byte order, you will generally also get valid codepoints. To know the correct byte order, we

  • either know the byte order from an external source, e.g. documentation that states “this tool will always produce UTF-16LE output”,
  • or the encoded text contains a BOM.

For example, the XML standard is defined in a way so that XML documents can optionally start with a BOM, but the <?xml declaration at the start can also be used to determine the encoding.

Editors or web browsers have to work reasonably even when BOMs are missing, and the encoding is ambiguous. They can use statistical data e.g. on expected character frequencies to take a guess, but in the end the user should be able to override the encoding.


If we look at the Unicode Specification (version 10.0), then section 2.6 Encoding Schemes states:

When a higher-level protocol supplies mechanisms for handling the endianness of integral data types, it is not necessary to use Unicode encoding schemes or the byte order mark. In those cases Unicode text is simply a sequence of integral data types.

I.e. as explained above, the BOM is not necessary when we have external information about the byte order. However, there is a requirement that a BOM must be understood by some software dealing with Unicode. From section 23.8 Specials:

Where the byte order is explicitly specified, such as in UTF-16BE or UTF-16LE, then all U+FEFF characters—even at the very beginning of the text—are to be interpreted as zero width no-break spaces. Similarly, where Unicode text has known byte order, initial U+FEFF characters are not required, but for backward compatibility are to be interpreted as zero width no-break spaces. […]

Systems that use the byte order mark must recognize when an initial U+FEFF signals the byte order. In those cases, it is not part of the textual content and should be removed before processing, because otherwise it may be mistaken for a legitimate zero width no-break space.

In section 3.10 Encoding Schemes, various UTF encodings are defined. Here, UTF-16LE, UTF16BE, and UTF-16 are different encodings. The LE and BE variants do not have a BOM. For UTF-16:

The UTF-16 encoding scheme may or may not begin with a BOM. However, when there is no BOM, and in the absence of a higher-level protocol, the byte order of the UTF-16 encoding scheme is big-endian.

Equivalently for UTF-32LE, UTF-32BE, and UTF-32.

So the Unicode Standard does state that the BOM is optional, and mandates how software must handle the presence or absence of a BOM under various circumstances. Unicode-aware software that does not handle encoded text without BOMs is broken.