A Unicode sentinel value I can use

unicode

I am desiging a file format and I want to do it right. Since it is a binary format, the very first byte (or bytes) of the file should not form valid textual characters (just like in the PNG file header1). This allows tools that do not recognize the format to still see that its not a text file by looking at the first few bytes.

Any codepoint above 0x7F is invalid US-ASCII, so that's easy. But for Unicode it's a whole different story. Apart from valid Unicode characters there are private-use characters, noncharacters and sentinels, as I found in the Unicode Private-Use Characters, Noncharacters & Sentinels FAQ.

What would be a sentinel sequence of bytes that I can use at the start of the file that would result in invalid US-ASCII, UTF-8, UTF-16LE and UTF-16BE?

  • Obviously the first byte cannot have a value below 0x80 as that would be a valid US-ASCII (control)character, so 0x00 cannot be used.
  • Also, since private-use characters are valid Unicode characters, I can't use those codepoints either.
  • Since it must work with both little-endian and big-endian UTF-16, a noncharacter such as 0xFFFE is also not possible as its reverse 0xFEFF is a valid Unicode character.
  • The above mentioned FAQ suggests not using any of the noncharacters as that would still result in a valid Unicode sequence, so something like 0xFFFF is also out of the picture.

What would be the future-proof sentinel values that are left for me to use?


1) The PNG format has as its very first byte the non-ASCII 0x89 value, followed by the string PNG. A tool that read the first few bytes of a PNG may determine it is a binary file since it cannot interpret 0x89. A GIF file, on the other hand, starts directly with the valid and readable ASCII string GIF followed by three more valid ASCII characters. For GIF a tool might determine it is a readable text file. This is wrong and the idea of starting the file with a non-textural byte sequence came from Designing File Formats by Andy McFadden.

Best Answer

0xDC 0xDC

  • Obviously invalid UTF-8 and ASCII
  • Unpaired trail surrogate in lead position regardless of endianess in UTF-16. It doesn't get more invalid UTF-16 than that.
Related Topic