Should Source Code Be in UTF-8?

character encodingcoding-standardssource codeutf-8

I feel that often you don't really choose what format your code is in. I mean most of my tools in the past have decided for me. Or I haven't really even thought about it. I was using TextPad on windows the other day and as I was saving a file, it prompted me about ASCII, UTF-8/16, Unicode etc etc…

I am assuming that almost all code written is ASCII, but why should it be ASCII? Should we actually be using UTF-8 files now for source code, and why? I'd imagine this might be useful on multi-lingual teams. Are there standards associated with how multilingual teams name variables/functions/etc?

Best Answer

The choice is not between ASCII and UTF-8. ASCII is a 7-bit encoding, and UTF-8 supersedes it - any valid ASCII text is also valid UTF-8. The problems arise when you use non-ASCII characters; for these you have to pick between UTF-8, UTF-16, UTF-32, and various 8-bit encodings (ISO-xxxx, etc.).

The best solution is to stick with a strict ASCII charset, that is, just don't use any non-ASCII characters in your code. Most programming languages provide ways to express non-ASCII characters using ASCII characters, e.g. "\u1234" to indicate the Unicode code point at 1234. Especially, avoid using non-ASCII characters for identifiers. Even if they work correctly, people who use a different keyboard layout are going to curse you for making them type these characters.

If you can't avoid non-ASCII characters, UTF-8 is your best bet. Unlike UTF-16 and UTF-32, it is a superset of ASCII, which means anyone who opens it with the wrong encoding gets at least most of it right; and unlike 8-bit codepages, it can encode about every character you'll ever need, unambiguously, and it's available on every system, regardless of locale.

And then you have the encoding that your code processes; this doesn't have to be the same as the encoding of your source file. For example, I can easily write PHP in UTF-8, but set its internal multibyte-encoding to, say, Latin-1; because the PHP parser does not concern itself with encodings at all, but rather just reads byte sequences, my UTF-8 string literals will be misinterpreted as Latin-1. If I output these strings on a UTF-8 terminal, you won't see any differences, but string lengths and other multibyte operations (e.g. substr) will produce wrong results.

My rule of thumb is to use UTF-8 for everything; only if you absolutely have to deal with other encodings, convert to UTF-8 as early as possible and from UTF-8 as late as possible.