A PHP string is just a sequence of bytes, with no encoding tagged to it whatsoever. String values can come from various sources: the client (over HTTP), a database, a file, or from string literals in your source code. PHP reads all these as byte sequences, and it never extracts any encoding information.
As long as all your data sources and destinations use the same encoding, the worst thing that can happen is that string positions are wrong (if you use multi-byte encodings), since PHP will count bytes, not characters.
But if the encodings don't match (e.g. you write a string literal in a source file stored as UTF-8, and then send it to a database that expects Latin-1), PHP will not perform any conversion for you: it will happily copy the bytes over raw.
The sanest solution is this:
- Set PHP's internal encoding to UTF-8.
- Save all your source files as UTF-8.
- Use UTF-8 as your output encoding (don't forget to send suitable
Content-type
headers).
- Set the database connection to use UTF-8 (
SET NAMES UTF8
in MySQL).
- Configure everything else to be UTF-8 if at all possible.
- For anything that you can't control (e.g. third-party web services), make sure you know the encoding, and convert to UTF-8 as early as possible, and back to the other encoding as late as possible.
Why UTF-8? Because it can represent all Unicode characters and thus supersedes all the existing 7-bit and 8-bit encodings, and because it is binary compatible with ASCII, that is, every valid ASCII string is also a valid UTF-8 string (but not vv.).
In your example, what happens is this.
First, you save your source file; your text editor is probably configured to use UTF-8, so your string literal ends up UTF-8 encoded on disk. PHP reads this file, interpreting the string as a series of bytes; $original
now holds a UTF-8 encoded string of 7 characters, which is just a byte sequence (though it contains more than 7 bytes, because each character is represented by two or more bytes). If you then call echo $original
, the encoded string is sent to the client as-is; if you have told the client to expect UTF-8, everything is fine, but if you haven't, PHP has no way to tell the difference, and you'll end up with garbage in the browser. As an experiment, try this:
$original = "शक्नोम्यत्तुम्";
echo strlen($original);
strlen
is encoding-agnostic and assumes a fixed-width 8 bit encoding, that is, one byte per character, so it will count bytes, not characters.
You need log2(n) bits to address n bytes. For example, you can store 256 different values in an 8 bit number, so 8 bits can address 256 bytes. 210 = 1024, so you need 10 bits to address every byte in a kilobyte. Likewise, you need 20 bits to address every byte in a megabyte, and 30 bits to address every byte in a gigabyte. 232 = 4294967296, which is the number of bytes in 4 gigabytes, so you need a 32 bit address for 4 GB of memory.
Best Answer
Is the ID always in the form: IO123456? What your colleague could mean is that he only sends the numeric part, which fits easily in 4 bytes omitting the "IO" part.