UTF8?
UTF16?
Do strings in PHP also keep track of the encoding used?
Let's look at this script for example. Say I run:
$original = "शक्नोम्यत्तुम्";
What actually happens?
Obviously I think $original
will not contain just 7 characters. Those glyphs must each be represented by several bytes there.
Then I do:
$converted = mb_convert_encoding ($original , "UTF-8");
What will happen to $converted
? How will $converted
be different from $original
?
Will it be just the exact same byte sequence as $original
but with a different encoding?
Best Answer
A PHP string is just a sequence of bytes, with no encoding tagged to it whatsoever. String values can come from various sources: the client (over HTTP), a database, a file, or from string literals in your source code. PHP reads all these as byte sequences, and it never extracts any encoding information.
As long as all your data sources and destinations use the same encoding, the worst thing that can happen is that string positions are wrong (if you use multi-byte encodings), since PHP will count bytes, not characters.
But if the encodings don't match (e.g. you write a string literal in a source file stored as UTF-8, and then send it to a database that expects Latin-1), PHP will not perform any conversion for you: it will happily copy the bytes over raw.
The sanest solution is this:
Content-type
headers).SET NAMES UTF8
in MySQL).Why UTF-8? Because it can represent all Unicode characters and thus supersedes all the existing 7-bit and 8-bit encodings, and because it is binary compatible with ASCII, that is, every valid ASCII string is also a valid UTF-8 string (but not vv.).
In your example, what happens is this.
First, you save your source file; your text editor is probably configured to use UTF-8, so your string literal ends up UTF-8 encoded on disk. PHP reads this file, interpreting the string as a series of bytes;
$original
now holds a UTF-8 encoded string of 7 characters, which is just a byte sequence (though it contains more than 7 bytes, because each character is represented by two or more bytes). If you then callecho $original
, the encoded string is sent to the client as-is; if you have told the client to expect UTF-8, everything is fine, but if you haven't, PHP has no way to tell the difference, and you'll end up with garbage in the browser. As an experiment, try this:strlen
is encoding-agnostic and assumes a fixed-width 8 bit encoding, that is, one byte per character, so it will count bytes, not characters.