Should Source Code Be in UTF-8?

character encodingcoding-standardssource codeutf-8

I feel that often you don't really choose what format your code is in. I mean most of my tools in the past have decided for me. Or I haven't really even thought about it. I was using TextPad on windows the other day and as I was saving a file, it prompted me about ASCII, UTF-8/16, Unicode etc etc…

I am assuming that almost all code written is ASCII, but why should it be ASCII? Should we actually be using UTF-8 files now for source code, and why? I'd imagine this might be useful on multi-lingual teams. Are there standards associated with how multilingual teams name variables/functions/etc?

Best Answer

The choice is not between ASCII and UTF-8. ASCII is a 7-bit encoding, and UTF-8 supersedes it - any valid ASCII text is also valid UTF-8. The problems arise when you use non-ASCII characters; for these you have to pick between UTF-8, UTF-16, UTF-32, and various 8-bit encodings (ISO-xxxx, etc.).

The best solution is to stick with a strict ASCII charset, that is, just don't use any non-ASCII characters in your code. Most programming languages provide ways to express non-ASCII characters using ASCII characters, e.g. "\u1234" to indicate the Unicode code point at 1234. Especially, avoid using non-ASCII characters for identifiers. Even if they work correctly, people who use a different keyboard layout are going to curse you for making them type these characters.

If you can't avoid non-ASCII characters, UTF-8 is your best bet. Unlike UTF-16 and UTF-32, it is a superset of ASCII, which means anyone who opens it with the wrong encoding gets at least most of it right; and unlike 8-bit codepages, it can encode about every character you'll ever need, unambiguously, and it's available on every system, regardless of locale.

And then you have the encoding that your code processes; this doesn't have to be the same as the encoding of your source file. For example, I can easily write PHP in UTF-8, but set its internal multibyte-encoding to, say, Latin-1; because the PHP parser does not concern itself with encodings at all, but rather just reads byte sequences, my UTF-8 string literals will be misinterpreted as Latin-1. If I output these strings on a UTF-8 terminal, you won't see any differences, but string lengths and other multibyte operations (e.g. substr) will produce wrong results.

My rule of thumb is to use UTF-8 for everything; only if you absolutely have to deal with other encodings, convert to UTF-8 as early as possible and from UTF-8 as late as possible.

Related Solutions

Character Encoding – Should Non-UTF Encodings Be Deprecated?

Since you mentioned PostgreSQL, I can say with some authority that the main killer reason why non-UTF8 server-side encodings are supported in such detail is that the Japanese need it. Apparently, identical round-trip conversion between Unicode and the various Japanese "legacy" encodings is not always possible, and in some cases conversion tables are even different between vendors. It's baffling really, but it's apparently so. (The extensive character set support is also one of the reasons why PostgreSQL is so popular in Japan.)

Since we are talking about a database system, one of the main jobs is to be able to store and retrieve data reliably, as defined by the user, so lossy character set conversion sometimes won't fly. If you were dealing with the a web browser, say, where all that really matters is whether the result looks OK, then you could probably get away with supporting less encodings, but in a database system you have extra requirements.

Some of the other reasons mentioned in other answers also apply as supporting arguments. But as long as the Japanese veto it, character setup support cannot be reduced.

UTF-16 vs UTF-8 – Fixed-Width or Variable-Width and Byte-Order Issues

(1) What does byte sequence mean, an arrary of char in C? Is UTF-16 a byte sequence, or what is it then? (2) Why does a byte sequence have nothing to do with variable length?

You seem to be misunderstanding what endian issues are. Here's a brief summary.

A 32-bit integer takes up 4 bytes. Now, we know the logical ordering of these bytes. If you have a 32-bit integer, you can get the high byte of this with the following code:

uint32_t value = 0x8100FF32;
uint8_t highByte = (uint8_t)((value >> 24) & 0xFF); //Now contains 0x81

That's all well and good. Where the problem begins is how various hardware stores and retrieves integers from memory.

In Big Endian order, a 4 byte piece of memory that you read as a 32-bit integer will be read with the first byte being the high byte:

[0][1][2][3]

In Little Endian order, a 4 byte piece of memory that you read as a 32-bit integer will be read with the first byte being the low byte:

[3][2][1][0]

If you have a pointer to a pointer to a 32-bit value, you can do this:

uint32_t value = 0x8100FF32;
uint32_t *pValue = &value;
uint8_t *pHighByte = (uint8_t*)pValue;
uint8_t highByte = pHighByte[0]; //Now contains... ?

According to C/C++, the result of this is undefined. It could be 0x81. Or it could be 0x32. Technically, it could return anything, but for real systems, it will return one or the other.

If you have a pointer to a memory address, you can read that address as a 32-bit value, a 16-bit value, or an 8-bit value. On a big endian machine, the pointer points to the high byte; on a little endian machine, the pointer points to the low byte.

Note that this is all about reading and writing to/from memory. It has nothing to do with the internal C/C++ code. The first version of the code, the one that C/C++ doesn't declare as undefined, will always work to get the high byte.

The issue is when you start reading byte streams. Such as from a file.

16-bit values have the same issues as 32-bit ones; they just have 2 bytes instead of 4. Therefore, a file could contain 16-bit values stored in big endian or little endian order.

UTF-16 is defined as a sequence of 16-bit values. Effectively, it is a uint16_t[]. Each individual code unit is a 16-bit value. Therefore, in order to properly load UTF-16, you must know what the endian-ness of the data is.

UTF-8 is defined as a sequence of 8-bit values. It is a uint8_t[]. Each individual code unit is 8-bits in size: a single byte.

Now, both UTF-16 and UTF-8 allow for multiple code units (16-bit or 8-bit values) to combine together to form a Unicode codepoint (a "character", but that's not the correct term; it is a simplification). The order of these code units that form a codepoint is dictated by the UTF-16 and UTF-8 encodings.

When processing UTF-16, you read a 16-bit value, doing whatever endian conversion is needed. Then, you detect if it is a surrogate pair; if it is, then you read another 16-bit value, combine the two, and from that, you get the Unicode codepoint value.

When processing UTF-8, you read an 8-bit value. No endian conversion is possible, since there is only one byte. If the first byte denotes a multi-byte sequence, then you read some number of bytes, as dictated by the multi-byte sequence. Each individual byte is a byte and therefore has no endian conversion. The order of these bytes in the sequence, just as the order of surrogate pairs in UTF-16, is defined by UTF-8.

So there can be no endian issues with UTF-8.

Best Answer

Related Solutions

Character Encoding – Should Non-UTF Encodings Be Deprecated?

UTF-16 vs UTF-8 – Fixed-Width or Variable-Width and Byte-Order Issues

Related Topic