Character Encoding – Should Non-UTF Encodings Be Deprecated?

character encodingunicodeutf-8

A pet peeve of mine is looking at so many software projects that have mountains of code for character set support. Don't get me wrong, I'm all for compatibility, and I'm happy that text editors let you open and save files in multiple character sets. What annoys me is how proliferation of non-universal character encodings is labeled “proper Unicode support” rather than “a problem”.

For example, let me pick on PostgreSQL and its character set support. PostgreSQL deals with two types of encodings:

  • Client encoding: Used in communication between the client and the server.
  • Server encoding: Used to store text internally in the database.

I can understand why supporting a lot of client encodings is a good thing. It enables clients that don't operate in UTF-8 to communicate with PostgreSQL without themselves needing to perform conversion. What I don't get is: why does PostgreSQL support multiple server encodings? Database files are (almost always) incompatible from one PostgreSQL version to the next, so cross-version compatibility is not the issue here.

UTF-8 is the only standard, ASCII-compatible character set that can encode all Unicode codepoints (if I'm wrong, let me know). I'm in the camp that UTF-8 is the best character set, but I am willing to put up with other universal character sets such as UTF-16 and UTF-32.

I believe all non-universal character sets should be deprecated. Is there any compelling reason they shouldn't?

Best Answer

Since you mentioned PostgreSQL, I can say with some authority that the main killer reason why non-UTF8 server-side encodings are supported in such detail is that the Japanese need it. Apparently, identical round-trip conversion between Unicode and the various Japanese "legacy" encodings is not always possible, and in some cases conversion tables are even different between vendors. It's baffling really, but it's apparently so. (The extensive character set support is also one of the reasons why PostgreSQL is so popular in Japan.)

Since we are talking about a database system, one of the main jobs is to be able to store and retrieve data reliably, as defined by the user, so lossy character set conversion sometimes won't fly. If you were dealing with the a web browser, say, where all that really matters is whether the result looks OK, then you could probably get away with supporting less encodings, but in a database system you have extra requirements.

Some of the other reasons mentioned in other answers also apply as supporting arguments. But as long as the Japanese veto it, character setup support cannot be reduced.