Everybody knows, that PHP has problems with Unicode. Version 6 is effectively abandoned, because of Unicode implementation difficulties. But I wonder if anyone knows what are the exact reasons? Architecture/design problems, performance concerns, community problems (I bet not), something other?
Php – Why exactly can’t PHP have full unicode support
Architecturelanguage-designopen sourcePHPunicode
Related Solutions
Any website that purports to be multi-lingual or to deal with documents or content that is not representable in Latin-1 is likely to be problematic if you don't have Unicode support.
- For example,
http://amazon.jp
would be toast without Unicode.
Another problematic use-case is when content might contain mathematical and other symbols.
However, your example of Facebook suggests that in fact you can in fact "do" Unicode in PHP. Alternatively, http://facebook.jp
is not implemented in PHP. Either way, the home page says:
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
and has lots of UTF-8 content.
OK, here's what the PHP doc for "String" says:
"A string is series of characters, therefore, a character is the same as a byte. That is, there are exactly 256 different characters possible. This also implies that PHP has no native support of Unicode. See utf8_encode() and utf8_decode() for some basic Unicode functionality."
So PHP does have Unicode support. It is just that "native strings" are not Unicode based.
So what it means is that if you need to deal with any language (or set of languages) that cannot be encode in an 8-bit character set, your PHP code is going to be more cumbersome at any point where it needs to process content as (real) characters.
The only compelling reason to use XML is to establish an open data standard. XAML is the same display language used in both Silverlight and WPF; any vendor can use the same markup standard to create a display definition for their own platform, and it can be reused in Silverlight or WPF.
In the aerospace industry, we have control rooms that, thanks to the advances of computer technology, are now reasonably flexible. In the past, all the hardware was custom, unique, and very expensive; today it is all run with inexpensive, commonly-available, off-the-shelf PC's. This greatly reduces vendor lock in. However, display widgets are still written using ActiveX, because that's how it's always been done.
ActiveX requires access to Microsoft tools that are, well, obsolete. So the Air Force and the Inter-Range Instrumentation Group is coming up with a Data Display Markup Language, which is XML based. This will allow practitioners to design displays using XML markup, in the editor of their choice. Sound familiar?
Nobody argues that XML is not without its faults. But it's the best thing available for what it was designed to do, until something better comes along.
See Also
Why XML Doesn't Suck
Best Answer
PHP as a language definitely can have it, but I think the problem is with compatibility with existing programs. Unicode support can break them in subtle ways, which is the most annoying kind of bug to have.
Currently most string-processing functions in PHP are "binary-safe", which means you can use them to process any file in any encoding as well as binary formats like image data, etc.
With addition of Unicode strings you'd have to be very careful not to mix Unicode strings with binary strings (pretty hard when your strings come from different sources and you never had to worry about it before). And you couldn't be ignorant about encodings any more (and lots of scripts are ignorant about this!)
Another hard, but solvable problem is random access in Unicode strings. Implementation of
$string[$offset]
changes from trivial to either very slow or little slow and very complex.Also I think it was a mistake to choose UTF-16 as internal encoding for PHP. It has same problems as UTF-8 (variable width because of surrogate pairs) and inefficiency of UCS-2. Maybe they should scrap that and start again with UTF-8?
</speculation>