Algorithms – How to Compress a List of Non-Unique Integers and Retain Order

algorithmscompression

Let's start out with an example

[1,1,1,5,3,1,1,2,78,2,3,1,1,...,1]

As you can tell by the example, 1 is repeated a lot, but there will be outliers (like 78, and really anything that isn't 1).

The catch to my question is, when I uncompress the numbers, they must retain the original order. The algorithms that I have found generally deal with unique numbers, either sorted or unsorted. I wouldn't mind using one of those, but since my integer list is not unique, I'm wondering if there is anything currently out there that is optimized for this sort of thing.

Quickly thinking, My first idea was to do a normal compression tree and just map in values for patterns.

My second Idea was to just describe repeating numbers like so: 3->1, 1->5, 1->3, 2->1, 1->2, 1->78, 1->2, 1->3, x->1.

Of course this only works if values are constantly repeated, which generally the lower numbers will be repeated a lot, but I can't be 100% positive that that will be the case every time.

Best Answer

The simplest approach to this is simple run length encoding. This is part of what you describe in "My second Idea was to just describe repeating numbers like so: 3->1, 1->5, 1->3, 2->1, 1->2, 1->78, 1->2, 1->3, x->1."

This works for a lot of data and you will frequently see it in bitmaps (where every value is a 0 or a 1 - thats why its in the 'image' section of the Data Compression Template on Wikipedia). The difficulty with this method is that the more irregular the data, the the smaller the runs and the more bookkeeping necessary. Essentially with RLE, the book keeping is with each block.

You could also look to Huffman coding. Its quite a bit more work to do than simple RLE, but it also can save quite a bit more space.

The thing is that your numbers, they're still fixed sized. Lets say they're all bytes (8 bits) rather than ints or longs. Now, the 1 is actually 0000 0001 and 78 is 0100 1110. The trick with huffman encoding is that we can make 1 take up 1 bit instead because it is so common.

For the sequence [1,1,1,5,3,1,1,2,78,2,3,1,1,1] we have 1, 5, 3, 2, 78 as symbols - 5 symbols total.

This could give us a Huffman encoding table that looks like:

1  => 1
2  => 01
3  => 0010
5  => 0011
78 => 0001

This would allow us to encode the above data from:

0000 0001, 0000 0001, 0000 0001, 0000 0101, 0000 0011, 0000 0001,
0000 0001, 0000 0010, 0100 1110, 0000 0010, 0000 0011, 0000 0001, 
0000 0001, 0000 0001

To:

1, 1, 1, 0011, 0010, 1, 1, 01, 0001, 01, 0010, 1, 1, 1

Which should be clear is much less information. Yes, you will still need to pass the book keeping information about what those numbers actually mean with it too. You've got something that compresses a bit tighter than RLE does in most cases because of the common bookkeeping information than it would with the information being stored with each run.

Related Solutions

Finding common prefixes for a set of strings

You want to store strings in compressed form (to save space, I guess), but you want fast lookup, is this right? If I were you, I would go for the speed and use a trie (for the first few characters). It has O(log n) lookup, and it will automatically condense common prefixes.

A lot depends on the statistics of the strings, like how many there are, and their typical length.

ADDED: For the strings you gave, the trie would look like this:

- A B - 1 2 3 4 5 - 6 .
  |     |           |
  |     |           7 .
  |     |           
  |     C D E F G - H .
  |                 |
  |                 X 1 .
  |                 |
  |                 Y .
  |
  X X X X .

Each node of the trie contains a little "dictionary" of "words", initially 1 letter long, and each "word" points to a sub-node. If that sub-node contains only one "word" in its own "dictionary", then that "word" can be absorbed into its parent's "word", and that's how you build up the prefixes.

Video Compression – Fast, Lossless Compression of a Video Stream

You've got temporal prediction, but no spatial. For better compression at the cost of speed, you should be able to use the pixels above and to the left of the current pixel in the current frame as predictors, as well as the pixel at the same location in the previous frame. The reason for only looking up and left is the same as the reason for only looking at the previous frame; you want to only rely on data you've already decoded, and limit how much of it you have to keep around.

Rice codes are probably a good tradeoff between efficiency and speed, but a static Huffman code (precomputed by you on a sample of video data) might be more efficient and equally fast.

As for speed, make sure that your code is getting vectorized — either by using the right compiler flags and code patterns to allow the compiler to auto-vectorize, or by hand-writing the code to use vector intrinsics or assembly.

Finally, is dropping down to 8 bits per pixel a possibility? Obviously that's leaving the realm of "lossless", but not only would it reduce the size of your compressed output, it would also, with vectorized code, possibly increase your throughput up to 2x.

Best Answer

Related Solutions

Finding common prefixes for a set of strings

Video Compression – Fast, Lossless Compression of a Video Stream

Related Topic