PHP Strings – How PHP Internally Represents Strings

PHPstrings

UTF8?
UTF16?

Do strings in PHP also keep track of the encoding used?

Let's look at this script for example. Say I run:

$original = "शक्नोम्यत्तुम्";

What actually happens?

Obviously I think $original will not contain just 7 characters. Those glyphs must each be represented by several bytes there.

Then I do:

$converted = mb_convert_encoding ($original , "UTF-8");

What will happen to $converted? How will $converted be different from $original?

Will it be just the exact same byte sequence as $original but with a different encoding?

Best Answer

A PHP string is just a sequence of bytes, with no encoding tagged to it whatsoever. String values can come from various sources: the client (over HTTP), a database, a file, or from string literals in your source code. PHP reads all these as byte sequences, and it never extracts any encoding information.

As long as all your data sources and destinations use the same encoding, the worst thing that can happen is that string positions are wrong (if you use multi-byte encodings), since PHP will count bytes, not characters.

But if the encodings don't match (e.g. you write a string literal in a source file stored as UTF-8, and then send it to a database that expects Latin-1), PHP will not perform any conversion for you: it will happily copy the bytes over raw.

The sanest solution is this:

Set PHP's internal encoding to UTF-8.
Save all your source files as UTF-8.
Use UTF-8 as your output encoding (don't forget to send suitable Content-type headers).
Set the database connection to use UTF-8 (SET NAMES UTF8 in MySQL).
Configure everything else to be UTF-8 if at all possible.
For anything that you can't control (e.g. third-party web services), make sure you know the encoding, and convert to UTF-8 as early as possible, and back to the other encoding as late as possible.

Why UTF-8? Because it can represent all Unicode characters and thus supersedes all the existing 7-bit and 8-bit encodings, and because it is binary compatible with ASCII, that is, every valid ASCII string is also a valid UTF-8 string (but not vv.).

In your example, what happens is this.

First, you save your source file; your text editor is probably configured to use UTF-8, so your string literal ends up UTF-8 encoded on disk. PHP reads this file, interpreting the string as a series of bytes; $original now holds a UTF-8 encoded string of 7 characters, which is just a byte sequence (though it contains more than 7 bytes, because each character is represented by two or more bytes). If you then call echo $original, the encoded string is sent to the client as-is; if you have told the client to expect UTF-8, everything is fine, but if you haven't, PHP has no way to tell the difference, and you'll end up with garbage in the browser. As an experiment, try this:

$original = "शक्नोम्यत्तुम्";
echo strlen($original);

strlen is encoding-agnostic and assumes a fixed-width 8 bit encoding, that is, one byte per character, so it will count bytes, not characters.

Best Answer

Related Solutions

Computer Science – Why Are Strings So Slow?

Data Compression – How to Compress ASCII Strings into Fewer Bytes

Related Topic