Data Compression – How to Compress ASCII Strings into Fewer Bytes

asciibytestrings

I'm working with an embedded device with a unique protocol that sends messages to other devices and I'm making an application that parses the sent packets. Each packet carries 8 bytes. The protocol is defined as where first byte is header and remaining 7 bytes is the data.

They are trying to pass a particular ID string but the ID string is 8 characters long (ASCII) so it won't fit in 7 bytes.

What my colleague told me is that they're going to turn the 8 ascii bytes of original string into integer (decimal) and send me 4 bytes of it. They told me I should be able to get the original string from the 4 bytes. I'm having a hard time wrapping my head around on this.

So if you have an ID string like "IO123456", that's 0x49 0x4f 0x31 0x32 0x33 0x34 0x35 0x36 in ASCII.. How on earth can you compress that in 4 bytes by turning it into an integer and I can get the original string from it? Am I missing something or is my colleague mistaken? I understand this is a really bizzare question but this seriously does not make any sense to me.

Best Answer

Is the ID always in the form: IO123456? What your colleague could mean is that he only sends the numeric part, which fits easily in 4 bytes omitting the "IO" part.

Related Solutions

PHP Strings – How PHP Internally Represents Strings

A PHP string is just a sequence of bytes, with no encoding tagged to it whatsoever. String values can come from various sources: the client (over HTTP), a database, a file, or from string literals in your source code. PHP reads all these as byte sequences, and it never extracts any encoding information.

As long as all your data sources and destinations use the same encoding, the worst thing that can happen is that string positions are wrong (if you use multi-byte encodings), since PHP will count bytes, not characters.

But if the encodings don't match (e.g. you write a string literal in a source file stored as UTF-8, and then send it to a database that expects Latin-1), PHP will not perform any conversion for you: it will happily copy the bytes over raw.

The sanest solution is this:

Set PHP's internal encoding to UTF-8.
Save all your source files as UTF-8.
Use UTF-8 as your output encoding (don't forget to send suitable Content-type headers).
Set the database connection to use UTF-8 (SET NAMES UTF8 in MySQL).
Configure everything else to be UTF-8 if at all possible.
For anything that you can't control (e.g. third-party web services), make sure you know the encoding, and convert to UTF-8 as early as possible, and back to the other encoding as late as possible.

Why UTF-8? Because it can represent all Unicode characters and thus supersedes all the existing 7-bit and 8-bit encodings, and because it is binary compatible with ASCII, that is, every valid ASCII string is also a valid UTF-8 string (but not vv.).

In your example, what happens is this.

First, you save your source file; your text editor is probably configured to use UTF-8, so your string literal ends up UTF-8 encoded on disk. PHP reads this file, interpreting the string as a series of bytes; $original now holds a UTF-8 encoded string of 7 characters, which is just a byte sequence (though it contains more than 7 bytes, because each character is represented by two or more bytes). If you then call echo $original, the encoded string is sent to the client as-is; if you have told the client to expect UTF-8, everything is fine, but if you haven't, PHP has no way to tell the difference, and you'll end up with garbage in the browser. As an experiment, try this:

$original = "शक्नोम्यत्तुम्";
echo strlen($original);

strlen is encoding-agnostic and assumes a fixed-width 8 bit encoding, that is, one byte per character, so it will count bytes, not characters.

Computer Science – How Many Bits Address Required for n Bytes of Memory?

You need log2(n) bits to address n bytes. For example, you can store 256 different values in an 8 bit number, so 8 bits can address 256 bytes. 2¹⁰ = 1024, so you need 10 bits to address every byte in a kilobyte. Likewise, you need 20 bits to address every byte in a megabyte, and 30 bits to address every byte in a gigabyte. 2³² = 4294967296, which is the number of bytes in 4 gigabytes, so you need a 32 bit address for 4 GB of memory.

Related Topic