Php – What does the lack of Unicode support in PHP mean

PHPunicode

How can the lack of Unicode support in PHP affect a PHP web app?

Best Answer

Any website that purports to be multi-lingual or to deal with documents or content that is not representable in Latin-1 is likely to be problematic if you don't have Unicode support.

For example, http://amazon.jp would be toast without Unicode.

Another problematic use-case is when content might contain mathematical and other symbols.

However, your example of Facebook suggests that in fact you can in fact "do" Unicode in PHP. Alternatively, http://facebook.jp is not implemented in PHP. Either way, the home page says:

<meta http-equiv="Content-type" content="text/html; charset=utf-8" />

and has lots of UTF-8 content.

OK, here's what the PHP doc for "String" says:

"A string is series of characters, therefore, a character is the same as a byte. That is, there are exactly 256 different characters possible. This also implies that PHP has no native support of Unicode. See utf8_encode() and utf8_decode() for some basic Unicode functionality."

So PHP does have Unicode support. It is just that "native strings" are not Unicode based.

So what it means is that if you need to deal with any language (or set of languages) that cannot be encode in an 8-bit character set, your PHP code is going to be more cumbersome at any point where it needs to process content as (real) characters.

Related Solutions

Php – Why exactly can’t PHP have full unicode support

PHP as a language definitely can have it, but I think the problem is with compatibility with existing programs. Unicode support can break them in subtle ways, which is the most annoying kind of bug to have.

Currently most string-processing functions in PHP are "binary-safe", which means you can use them to process any file in any encoding as well as binary formats like image data, etc.

With addition of Unicode strings you'd have to be very careful not to mix Unicode strings with binary strings (pretty hard when your strings come from different sources and you never had to worry about it before). And you couldn't be ignorant about encodings any more (and lots of scripts are ignorant about this!)

Another hard, but solvable problem is random access in Unicode strings. Implementation of $string[$offset] changes from trivial to either very slow or little slow and very complex.

Also I think it was a mistake to choose UTF-16 as internal encoding for PHP. It has same problems as UTF-8 (variable width because of surrogate pairs) and inefficiency of UCS-2. Maybe they should scrap that and start again with UTF-8?

</speculation>

What’s the point of adding Unicode identifier support to various language implementations

When you think unicode, you think Chinese or Russian characters, which makes you think of some source code written in Russian you've seen on the internet, and which was unusable (unless you know Russian).

But if unicode can be used in a wrong way, it doesn't mean it's bad by itself in source code.

When writing code for a specific field, with unicode, you can shorten your code and make it more readable. Instead of:

const numeric Pi = 3.1415926535897932384626433832795;
numeric firstAlpha = deltaY / deltaX + Pi;
numeric secondAlpha = this.Compute(firstAlpha);
Assert.Equals(math.Infinity, secondAlpha);

you can write:

const numeric π = 3.1415926535897932384626433832795;
numeric α₁ = Δy / Δx + π;
numeric α₂ = this.Compute(α₁);
Assert.Equals(math.∞, α₂);

which may not be easy to read for an average developer, but is still easy to read for a person who uses mathematical symbols daily.

Or, when doing an application related to SLR photography, instead of:

int aperture = currentLens.GetMaximumAperture();
Assert.AreEqual(this.Aperture1_8, aperture);

you can replace the aperture by it's symbol ƒ, with a writing closer to ƒ/1.8:

int ƒ = currentLens.GetMaximumƒ();
Assert.AreEqual(this.ƒ1¸8, ƒ);

This may be inconvenient: when typing general C# code, I would prefer writing:

var productPrices = this.Products.Select(c => c.Price);
double average = productPrices.Average()
double sum = this.ProductPrices.Sum();

rather than:

var productPrices = this.Products.Select(c => c.Price);
double average = productPrices.x̅()
double sum = productPrices.Σ();

because in the first case, IntelliSense helps me to write the whole code nearly without typing and especially without using my mouse, while in the second case, I have no idea where to find those symbols and would be forced to rely on the mouse to go and search them in the auto-completion list.

This being said, it's still useful in some cases. currentLens.GetMaximumƒ(); of my previous example can rely on IntelliSense and is as easy to type as GetMaximumAperture, being shorter and more readable. Also, for specific domains with lots of symbols, keyboard shortcuts may help typing the symbols quicker than their literal equivalents in source code.

The same, by the way, applies to comments. No one wants to read code full of comments in Chinese (unless you know well Chinese yourself). But in some programming languages, unicode symbols can still be useful. One example is footnotes¹.

^{¹ I certainly wouldn't enjoy footnotes in C# code where there is a strict set of style rules of how to write comments. In PHP on the other hand, if there are lots of things to explain, but those things are not very important, why not putting them at the bottom of the file, and create a footnote in the PHPDoc of the method?}

Best Answer

Related Solutions

Php – Why exactly can’t PHP have full unicode support

What’s the point of adding Unicode identifier support to various language implementations

Related Topic