What type of encoding can I use to make a string shorter

nettext-encoding

I am interested in encoding a string I have and I am curious if there is a type of encoding that can be used that will only include alpha and numeric characters and would preferably shorten the number of characters needed to represent the string.

So far I have looked at using Base64 encoding to do this but it appears to make my string longer and sometimes includes == which I would like to avoid. Example:

test name|120101

becomes

dGVzdCBuYW1lfDEyMDEwMQ==

which goes from 16 to 24 characters and includes non-alphanumeric.

Does anyone know of a different type of encoding that I could use that will achieve my requirements? Bonus points if it's either built into the .NET framework or there exists a third party library that will do the encoding.

Best Answer

The final '=' or '==' in Base64 is there only to make the number of characters a multiple of 4. You can remove it, since you can always put it back later on. Note that Base64 is so called because it uses 64 distinct characters. Uppercase letters, lowercase letters, and digits, that's 62. So Base64 also uses '/' and '+', which may or may not fit your bill.

On a general basis, if you want to encode arbitrary sequences of bytes into alphanumeric characters, there is necessarily some length extension somewhere, because there are 256 possible values for a byte, and only 62 alphanumeric characters. It is sometimes called the pigeonhole principle. An encoding scheme must have an average length extension of a factor log 256 / log 62 = 1.344 (average over all sequences of bytes); otherwise, it means that some pigeons are being crushed to death somewhere and you will not get them back without damage (which means: two distinct strings encoded to the same, so decoding cannot work reliably).

Now, it is quite possible that your strings are not exactly "sequences of uniformly random bytes"; your strings have some meaning which means that most possible sequence of bytes will not occur, because they are meaningless. On that basis, you can probably devise an encoding scheme which will incur less length extension than generic Base64 (or Base62 if you need to stick to strict alphanumeric characters). This is lossless data compression. It works over a clearly defined probabilistic model of what can appear as input.

Summary: a generic scheme for encoding strings into alphanumerical sequences such that no or little length extension ever occurs, cannot exist; it is a mathematical impossibility. A specific scheme tailored for the kind of input string you expect can probably exist (but since you do not tell what kind of string you may encounter, noone can help you on this).

Related Topic