Algorithms – How to Compress a List of Non-Unique Integers and Retain Order

algorithmscompression

Let's start out with an example

[1,1,1,5,3,1,1,2,78,2,3,1,1,...,1]

As you can tell by the example, 1 is repeated a lot, but there will be outliers (like 78, and really anything that isn't 1).

The catch to my question is, when I uncompress the numbers, they must retain the original order. The algorithms that I have found generally deal with unique numbers, either sorted or unsorted. I wouldn't mind using one of those, but since my integer list is not unique, I'm wondering if there is anything currently out there that is optimized for this sort of thing.

Quickly thinking, My first idea was to do a normal compression tree and just map in values for patterns.

My second Idea was to just describe repeating numbers like so: 3->1, 1->5, 1->3, 2->1, 1->2, 1->78, 1->2, 1->3, x->1.

Of course this only works if values are constantly repeated, which generally the lower numbers will be repeated a lot, but I can't be 100% positive that that will be the case every time.

Best Answer

The simplest approach to this is simple run length encoding. This is part of what you describe in "My second Idea was to just describe repeating numbers like so: 3->1, 1->5, 1->3, 2->1, 1->2, 1->78, 1->2, 1->3, x->1."

This works for a lot of data and you will frequently see it in bitmaps (where every value is a 0 or a 1 - thats why its in the 'image' section of the Data Compression Template on Wikipedia). The difficulty with this method is that the more irregular the data, the the smaller the runs and the more bookkeeping necessary. Essentially with RLE, the book keeping is with each block.

You could also look to Huffman coding. Its quite a bit more work to do than simple RLE, but it also can save quite a bit more space.

The thing is that your numbers, they're still fixed sized. Lets say they're all bytes (8 bits) rather than ints or longs. Now, the 1 is actually 0000 0001 and 78 is 0100 1110. The trick with huffman encoding is that we can make 1 take up 1 bit instead because it is so common.

For the sequence [1,1,1,5,3,1,1,2,78,2,3,1,1,1] we have 1, 5, 3, 2, 78 as symbols - 5 symbols total.

This could give us a Huffman encoding table that looks like:

1  => 1
2  => 01
3  => 0010
5  => 0011
78 => 0001

This would allow us to encode the above data from:

0000 0001, 0000 0001, 0000 0001, 0000 0101, 0000 0011, 0000 0001,
0000 0001, 0000 0010, 0100 1110, 0000 0010, 0000 0011, 0000 0001, 
0000 0001, 0000 0001

To:

1, 1, 1, 0011, 0010, 1, 1, 01, 0001, 01, 0010, 1, 1, 1

Which should be clear is much less information. Yes, you will still need to pass the book keeping information about what those numbers actually mean with it too. You've got something that compresses a bit tighter than RLE does in most cases because of the common bookkeeping information than it would with the information being stored with each run.

Related Topic