What’s the point of adding Unicode identifier support to various language implementations

unicode

I personally find reading code full of Unicode identifiers confusing. In my opinion, it also prevents the code from being easily maintained. Not to mention all the effort required for authors of various translators to implement such support. I also constantly notice the lack (or the presence) of Unicode identifiers support in the lists of (dis)advantages of various language implementations (like it really matters). I don't get it: why so much attention?

Best Answer

When you think unicode, you think Chinese or Russian characters, which makes you think of some source code written in Russian you've seen on the internet, and which was unusable (unless you know Russian).

But if unicode can be used in a wrong way, it doesn't mean it's bad by itself in source code.

When writing code for a specific field, with unicode, you can shorten your code and make it more readable. Instead of:

const numeric Pi = 3.1415926535897932384626433832795;
numeric firstAlpha = deltaY / deltaX + Pi;
numeric secondAlpha = this.Compute(firstAlpha);
Assert.Equals(math.Infinity, secondAlpha);

you can write:

const numeric π = 3.1415926535897932384626433832795;
numeric α₁ = Δy / Δx + π;
numeric α₂ = this.Compute(α₁);
Assert.Equals(math.∞, α₂);

which may not be easy to read for an average developer, but is still easy to read for a person who uses mathematical symbols daily.

Or, when doing an application related to SLR photography, instead of:

int aperture = currentLens.GetMaximumAperture();
Assert.AreEqual(this.Aperture1_8, aperture);

you can replace the aperture by it's symbol ƒ, with a writing closer to ƒ/1.8:

int ƒ = currentLens.GetMaximumƒ();
Assert.AreEqual(this.ƒ1¸8, ƒ);

This may be inconvenient: when typing general C# code, I would prefer writing:

var productPrices = this.Products.Select(c => c.Price);
double average = productPrices.Average()
double sum = this.ProductPrices.Sum();

rather than:

var productPrices = this.Products.Select(c => c.Price);
double average = productPrices.x̅()
double sum = productPrices.Σ();

because in the first case, IntelliSense helps me to write the whole code nearly without typing and especially without using my mouse, while in the second case, I have no idea where to find those symbols and would be forced to rely on the mouse to go and search them in the auto-completion list.

This being said, it's still useful in some cases. currentLens.GetMaximumƒ(); of my previous example can rely on IntelliSense and is as easy to type as GetMaximumAperture, being shorter and more readable. Also, for specific domains with lots of symbols, keyboard shortcuts may help typing the symbols quicker than their literal equivalents in source code.

The same, by the way, applies to comments. No one wants to read code full of comments in Chinese (unless you know well Chinese yourself). But in some programming languages, unicode symbols can still be useful. One example is footnotes¹.

^{¹ I certainly wouldn't enjoy footnotes in C# code where there is a strict set of style rules of how to write comments. In PHP on the other hand, if there are lots of things to explain, but those things are not very important, why not putting them at the bottom of the file, and create a footnote in the PHPDoc of the method?}

Related Solutions

Php – What does the lack of Unicode support in PHP mean

Any website that purports to be multi-lingual or to deal with documents or content that is not representable in Latin-1 is likely to be problematic if you don't have Unicode support.

For example, http://amazon.jp would be toast without Unicode.

Another problematic use-case is when content might contain mathematical and other symbols.

However, your example of Facebook suggests that in fact you can in fact "do" Unicode in PHP. Alternatively, http://facebook.jp is not implemented in PHP. Either way, the home page says:

<meta http-equiv="Content-type" content="text/html; charset=utf-8" />

and has lots of UTF-8 content.

OK, here's what the PHP doc for "String" says:

"A string is series of characters, therefore, a character is the same as a byte. That is, there are exactly 256 different characters possible. This also implies that PHP has no native support of Unicode. See utf8_encode() and utf8_decode() for some basic Unicode functionality."

So PHP does have Unicode support. It is just that "native strings" are not Unicode based.

So what it means is that if you need to deal with any language (or set of languages) that cannot be encode in an 8-bit character set, your PHP code is going to be more cumbersome at any point where it needs to process content as (real) characters.

Unicode UTF-8 – Can UTF-8 Support Millions of New Characters for an Alien Language?

The Unicode standard has lots of space to spare. The Unicode codepoints are organized in “planes” and “blocks”. Of 17 total planes, there are 11 currently unassigned. Each plane holds 65,536 characters, so there's realistically half a million codepoints to spare for an alien language (unless we fill all of that up with more emoji before first contact). As of Unicode 8.0, only 120,737 code points have been assigned in total (roughly 10% of the total capacity), with roughly the same amount being unassigned but reserved for private, application-specific use. In total, 974,530 codepoints are unassigned.

UTF-8 is a specific encoding of Unicode, and is currently limited to four octets (bytes) per code point, which matches the limitations of UTF-16. In particular, UTF-16 only supports 17 planes. Previously, UTF-8 supported 6 octets per codepoint, and was designed to support 32768 planes. In principle this 4 byte limit could be lifted, but that would break the current organization structure of Unicode, and would require UTF-16 to be phased out – unlikely to happen in the near future considering how entrenched it is in certain operating systems and programming languages.

The only reason UTF-16 is still in common use is that it's an extension to the flawed UCS-2 encoding which only supported a single Unicode plane. It otherwise inherits undesirable properties from both UTF-8 (not fixed-width) and UTF-32 (not ASCII compatible, waste of space for common data), and requires byte order marks to declare endianness. Given that despite these problems UTF-16 is still popular, I'm not too optimistic that this is going to change by itself very soon. Hopefully, our new Alien Overlords will see this impediment to Their rule, and in Their wisdom banish UTF-16 from the face of the earth.