Why does Unicode have separate codepoints for characters with identical glyphs

unicode

(Not entirely sure whether this should go in the information-security StackExchange instead; feel free to move it there if that's where it belongs.)

Unicode has many, many instances of pairs or larger sets of characters with identical glyphs nevertheless being assigned to separate codepoints (for instance, the Latin capital letter A, the Cyrillic capital letter А, and the Greek capital letter Α all have identical glyphs, but are assigned to codepoints U+0041, U+0410, and U+0391, respectively). This causes severe security issues, as well as the more minor problem of cluttering up Unicode with redundant characters.

Why doesn't Unicode assign all characters that share a particular glyph to the same codepoint, which would resolve both of these problems?

Best Answer

The short answer to this question is, "Unicode encodes characters, not glyphs". But like many questions about Unicode, a related answer is "plain text may be plain, but it's not simple". A good place to start unpacking this question is chapter 1, Introduction, of The Unicode Standard (TUS).

You say:

Unicode has many, many instances of pairs or larger sets of characters with identical glyphs nevertheless being assigned to separate codepoints (for instance, the Latin capital letter A, the Cyrillic capital letter А, and the Greek capital letter Α all have identical glyphs, but are assigned to codepoints U+0041, U+0410, and U+0391, respectively).

It is a mistake to say "Unicode has …[characters]… with identical glyphs", because Unicode does not standardise glyphs.

The Unicode Standard specifies a numeric value (code point) and a name for each of its characters. [It also defines]… a character’s case, directionality, …alphabetic properties…, and other semantic values.…" (TUS, Chapter 1, p.1)

Notice that the concept of "glyph" does not appear in that list. Unicode does not encode glyphs.

The difference between identifying a character and rendering it on screen or paper is crucial to understanding the Unicode Standard’s role in text processing. The character identified by a Unicode code point is an abstract entity, such as “latin capital letter a” or “bengali digit five”. The mark made on screen or paper, called a glyph, is a visual representation of the character. The Unicode Standard does not define glyph images. That is, the standard defines how characters are interpreted, not how glyphs are rendered. (TUS, section 1.3 Characters and Glyphs, p. 6)

So what defines how glyphs are rendered? Fonts, and the text layout engine, and their character-to-glyph mappings. If you give me the three characters Latin capital letter A, and Cyrillic capital letter А, and Greek capital letter Α, I could choose three fonts to render them which made them look identical. I could choose three other fonts which made them look different. Unicode's architecture has no say there.

Characters that look similar, or homographs, are a problem even within character sets for a single writing system. I can present you with a domain name "PaypaI.com", which will look pretty similar to "Paypal.com", as long as I can choose a font where uppercase "I" looks identical to lower-case "l". In some fonts, digit "0" and uppercase letter "O" look darn similar. Homographs abound.

It gets worse. Consider U+0020 " " and U+00A0 " ". In just about every reasonable font, they will have identical glyphs. But they are distinct characters in many character sets, including Unicode, ISO 8859-1, and Windows CP-1252 "ANSI". Why? Because the first is a conventional space character, and the second is a non-breaking space. They have identical appearance, but have properties which affect text layout differently.

You say:

This causes severe security issues…

I challenge your word "severe". Unicode makes many things possible which were difficult before, including some homograph attacks. But as the "PaypaI"/"Paypal" example shows, homograph attacks exist even within a single script. And within the landscape of security risks caused by text representations of domain names and the like, I argue that the attacks Unicode makes easier are medium to minor. Links where innocuous link text points to a brazenly malicious URL are a bigger problem. Users who click on such links are a bigger problem. Users who misinterpret long domain names like friendlybank.com.security_department.malicious.example.com, thinking it says friendlybank.com/security_department/, is a bigger problem.

I also challenge your word "causes". All non-trivial design involves tradeoffs. Unicode opens the door to new benefits, traded off with new risks. The designers of Unicode take the risks seriously: see UTR#36 Unicode Security Considerations.

To someone who is well-served by Latin-script text, the benefit of Bengali support may not carry much weight, and the novel homograph risks might be dazzling. But to the next billion internet users, who will largely prefer non-Latin scripts, the language coverage of Unicode is indispensable, and well worth the tradeoff of some risks.

Turning to "redundancy", you say:

…the more minor problem of cluttering up Unicode with redundant characters.

The three examples you give, U+0041 LATIN CAPITAL LETTER A, U+0410 CYRILLIC CAPITAL LETTER A, and U+0391 GREEK CAPITAL LETTER ALPHA, are not redundant for Unicode's text processing objectives in at least two ways.

The first way is case mapping. Latin, Cyrillic, and Greek scripts all have the notion of upper-case and lower-case versions of letters in a 1:1 relationship. What is the upper-case counterpart to U+0061 LATIN SMALL LETTER A? U+0041, obviously. What is the upper-case counterpart to U+0430 CYRILLIC SMALL LETTER A? U+03B1 GREEK SMALL LETTER ALPHA? To have one codepoint for capital letter A that mapped to one of three different small letters, depending on whether the text is Latin, Cyrillic, or Greek script, would be an implementation nightmare. It is much more efficient to have separate codepoints for each capital letter A, that encodes their different text properties.

The second way is round-trip convertibility with legacy encodings. The standard says,

Character identity is preserved for interchange with a number of different base standards, including national, international, and vendor standards. … This choice guarantees the existence of a mapping between the Unicode Standard and base standards. Accurate convertibility is guaranteed between the Unicode Standard and other standards in wide usage as of May 1993.… (TUS, section 2.2 Unicode Design Principles, subsection Convertibility, p. 23-24)

One of those "other standards" is ISO/IEC_8859-5, which encodes Cyrillic characters including 0xB0 Cyrillic Capital Letter A, and an ASCII-compatible block with 0x41 Latin Capital Letter A. Convertibility requires that Unicode have separate codepoints for these two characters. Another "other standard" is ISO/IEC_8859-7, which encodes Greek characters including, you guessed it, 0xC1 Greek Capital Letter Alpha, as well as ASCII-compatible 0x41 Latin Capital Letter A. Convertibility also requires separate codepoints for these two characters. And so, convertibility as well as case-mapping justifies Unicode having three distinct codepoints for these three characters.

Finally, you ask:

Why doesn't Unicode assign all characters that share a particular glyph to the same codepoint, which would resolve both of these problems [homograph attacks and "redundancy"]?

Because that would make it harder for Unicode to reach its goals of a "universal", "efficient", "unambiguous" text encoding (TUS, Section 1.2 Design Goals, p. 4). Because that would not "resolve" either security issues in general, or homograph attacks in particular; they would still be plentiful. Because the "redundancy" you see is not really a problem for text systems actually using Unicode to represent the world's plain text information.

[Update: added paragraphs on case-mapping and convertibility, as reminded by Nicol Bolas's answer. Thanks for the reminder.]

Best Answer

Related Solutions

Php – Why exactly can’t PHP have full unicode support