Terminology – Why Does ‘Charset’ Mean ‘Encoding’?

Something that has long confused me is that so much software uses the terms "charset" and "encoding" as synonyms.

When people refer to a unicode "encoding", they always mean a ruleset for representing unicode characters as a sequence of bytes – like ASCII, or UTF-8. This seems reasonable and intuitive; the idea is that you are "encoding" those characters as bytes using the specified ruleset.

Since those rulesets sometimes only provide the ability to "encode" some subset of all unicode characters, you might imagine that a "charset" – short for 'set of characters' – would simply mean a set of unicode characters – without any regard for how those characters are encoded. An encoding would thus imply a charset (an encoding like ASCII, which only has rules for encoding 128 characters, would be associated with the charset of those 128 characters) but a charset need not imply an encoding (for example, UTF-8, UTF-16 and UTF-32 are all different encodings but can encode the same set of characters).

Yet – and here is the crux of my question – real-world usage of the word "charset" does not match what the construction of the word would imply. It is almost always used to mean "encoding".

For example:

The charset attribute in HTML is used to specify an encoding
Charsets in Java are encodings
charsets and character sets in MySQL are, once again, encodings

How old is this curious (ab)use of language, and how did this counter-intuitive definition of 'charset' come to exist? Does it perhaps originate from a time when there truly was, in practice, a one-to-one mapping between encodings in use and sets of characters they supported? Or was there some particularly influential standard or specification that dictated this definition of the word?

Best Answer

The concept of character sets is older than Unicode.

Before Unicode, a character set defined a set of characters and how each character was represented as bits. Most character sets mapped a character to a byte (which allowed a set of 256 characters), some mapped to two bytes, and a few (like ASCII) to only 7 bits. Different character sets often assigned different values to the same character, and there was no universal translation key between the various character sets in use.

Unicode was an attempt to solve this problem by unifying all the various character sets in a common "superset". For this purpose Unicode introduced some additional levels of abstraction, for example the concept of character encodings as something separate from the code point values. This allowed Unicode to redefine the pre-unicode character sets as unicode character encodings.

The charset attribute in HTML (which mirrors the charset parameter in the HTTP content-type header) for example, is from before unicode was widely adopted, but when it was decided to accept unicode as the universal character set of the internet, the charset attribute was just redefined to specify the encoding in use, but the name wasn't changed to allows backwards compatibility.

Best Answer

Related Solutions

What does ‘opinionated software’ really mean

Character Encoding – Should Non-UTF Encodings Be Deprecated?

Related Topic