Algorithms – Compression Algorithm for Non-Repeating Integers

algorithmscompression

I have an array of unique integers, for example: {1,3,,7,9,31,46,…}, which I want to compress. I have found compression techniques and algorithms for the list of integers, where some integers are repeated but I couldn't find any compression technique for list of unique integers. I would appreciate if anyone could help me to get an idea so that I could start with.

I have 1024 datasets of size 512 where where each value is a distinct integer between 1 and N (or 0 and N-1), i.e., a permutation of a set of N consecutive integers. I also assume that there is given a tolerance x and so, I want to find a compressed representation that allows the original values to be recovered with deviations no greater than x.
Thanks.

Best Answer

You have to consider that compression - by which I assume you mean lossless compression - equates to the removal of redundant information. If you write 12,12,12,12,12 there's redundance and you can write it as 12*5.

So you need to find the information that you can make redundant. For example the sequence 1,2,3,4,5,6,7,18,19,20,21 is nonrepeating, yet there is redundance and you can "compress" it as 1,7,18,4 (storing the first element of an increasing sequence and the number of elements) or 1,7,18,21 (storing the first and last elements of all sequences).

Then you must keep in mind that this kind of compression a trade - instead of using a set of symbols with some known occurrence probability you use another set, with a different spectrum, organized so that the most likely sequences map to the shortest symbols, even if on average the symbols are bulkier. But there will always be a killer sequence to which your compression will be applied with little result, or even catastrophic results (as whatsitsname remarked, most random sequences will turn out to become larger; which is why applying a compression unsuited for your data will likely avail little). For example in the "1,2,3..." sequence, if you get 1,3,5,7,9,11,13 you're forced to store it as 1,1,3,1,5,1,7,1,9,1,13,1 -- doubling the sequence size. For the same reason if you rename a .zip file as .dat (to prevent Zip from refusing to compress it) and try to compress it, you'll notice that once compressed, the file actually gets larger. Perhaps by very little, perhaps just by that handful of bits necessary to store the information "This file is stored uncompressed!" - but larger nonetheless.

Finally you need to state what you mean by compression. Are you referring to the number of integers in the sequence? Or the number of characters in its text representation? Say you have 1,7,9,5,11. You notice that no number is larger or equal to 12. So no product of any two numbers A and B will be larger than 11*A+B (actually a product of N0, N1, ...Nn can be translated as 11n*Nn+...+110*N0). Using Horner's algorithm, n integers become one (larger) integer. Using a larger base for clarity (100 instead of 12), 1,7,9,11 becomes 1070911. In characters, you haven't gained that much - eleven characters have become seven (using 12 as base, 1-7-9-11 becomes 2855 which is four characters). Counting integers, four have become one - or rather two, as you need to account for the need of storing the 12 also.

If you have an increasing sequence, you can store the first number and the differences (1-3-7-9-31-46 becomes 1-2-4-2-22-15), which may yield smaller numbers, which might be useful, though they'll now be probably non-monotonic.

Finally, a sequence with no redundancy because there is no relation between the previous numbers and the next one, so that each number provides 100% information on itself - in other words, a random sequence - will never be losslessly compressible, because there is nothing to squeeze out. You will be able to compress its textual representation, but this is not what your question seems to suggest.

So, without more information on the sequence's characteristics and your specific scenario, I'm afraid there's very little that can be done for you.

Case in point

You need to deal with VERY large numbers (N=512, and you don't want to manipulate (N-1)! without heavy duty gloves).

But the main problem is that the permutation is essentially random, so compressing the numbers is not going to work reliably; I think we might have better luck with some ad hoc coding.

Now every integer in itself holds 9 bits of information (512 = 29), so your data set is at a first glance actuality composed of 4608 bits (or 4599, since the last number is uniquely identified by the other 511 - it's the only one missing from the set).

But this is assuming that it's not a permutation but a combination, that every integer is free to assume any value; we do not exploit the uniqueness constraint, which adds at least 511 bits more information, or in other words requires 2n*(n-1)+1 bytes, not 2n * n. Let's look at it more closely.

The first number (0) has 512 possible positions which we can code in 9 bits (29 = 512).

The next number can only go in one of the remaining 511 positions, which still requires 9 bits. In short, the first 256 numbers all need 9 bits.

When the number of positions remaining drop below 256, we can use 8 bits. So, the next 128 numbers can be coded into the remaining 256 positions using 8 bits, until the number of free positions becomes 127.

And so on, until the 511th number can only go in either of two positions and requires one bit.

One number is left and it can only go in the last place, so it adds no information and it makes for a further 0 bits.

So we need

256*9 + 128*8 + 64*7 + 32*6 + 16*5 + 8*4 + 4*3 + 2*2 + 1*1 bits =

= 2304 + 1024 + 448 + 192 + 80 + 32 + 12 + 4 + 1 =

= 4097 bits

which is 2n * (n-1)+1 (in this case it looks like 214+1).

You can represent those 4097 bits using 133 dword integers from 0 to 2147483648 if you want to code them in the smallest possible number of integers.

If you want to code them in the smallest possible number of decimal digits, then you want to find a power of ten which is the nearest possible to the immediately lesser power of two. With a number of bits below 32, this happens with three: 23 is 8, which requires one digit, and in one decimal digit you can code at the most the number 9.

But taking into account the separator, a single digit requires two characters each three bits, if you also need the comma (you mightn't, because one digit needs no separator).

The best becomes then 223 or 8388608, seven digits (plus one comma, eight).

So your 512 numbers (1940 characters) become (4097/23) * 8 = 1432 numeric characters and commas. Or it is an uninterrupted sequence of (4097/3) * 1 = 1366 digits. Using 63 bits and numbers up to 263-1 or 9223372036854775807, we get around (4097/63)(19+1) = 1300 characters.

At best, you can code the sequence into 683 base64 characters, or a 65% reduction, which makes your 1024 dataset 684 Kb in total.

(Or you can binary-code it into 513 bytes).

The algorithm

Let BUFFER be your sequence of numbers (137, 410, 128, ...)
initialize a vector POSITION[512]
for i = 0 to 511 included:
    # The position of number BUFFER[i] (e.g. 128) is 2.
    POSITION[BUFFER[i]] = i

next = 256
for i = 0 to 510 included:
    POS = POSITION[i]
    # How many unchecked positions are between 0 and POS?
    Q   = 0
    for j = 0 to POS:
        if COPYOFBUFFER[j] >= 0:
            Q = Q + 1
    # Mark the POS-th address as "checked" by setting it to -1
    COPYOFBUFFER[POS] = -1;
    # Or you could set it to the first unused number, 512, as MAXVAL, and
    # modify the previous cycle  to check that COPYOFBUFFER[j] < MAXVAL

    # Now, i-th number is uniquely identified by Q
    # And Q is uniquely identified by a diminishing number of bits
    output(Q, bits)
    if (i == next):
        bits = bits - 1
        next = next / 2

The function "output" will output 8 bits (one character) or maybe 6 bits (one Base64 digit) and keep the rest internally, prepending said rest to the next call. You will need to make a last call to flush any remaining bits.

This is a not too efficient C implementation.

/**
 * outputs N bits (maximum N depends on integer type)
 * @param n   number of bits
 * @param val value holding the bits in its "n" least significant bits
 */
void output_bits(int n, int val) {
    static int buffer = 0;
    static int bits   = 0;
    if (0 == n) {
        // The request to output 0 bits makes no sense and is used
        // to mean "flush the buffer".
        // If the buffer contains 5 bits, we need 3 additional bits.
        if (0 == bits) {
            // buffer is already flushed.
            return;
        }
        n    = 8 - bits;
        val  = 0;
    }
    int mask = 1 << n; // we have 000111 (so, 7) and 3 bits. We get 0100 (4)
    while (n--) {
        // shift buffer by 1 bit left to accommodate next bit.
        buffer = buffer <<= 1;
        if (val & mask) {
            // We have a 1.
            buffer |= 1;
        }
        bits++;
        if (0 == (bits % 8)) {
            // We have a full byte.
            output_byte(buffer & 0xff);
        }
        // Shift the mask right to get the next bit.
        mask >>= 1;
    }
}

To reconstruct the array, proceed in reverse: as soon as you've gathered enough bits to retrieve a Q, move into the buffer by Q empty positions and place the number there:

 |__________________________________________|
 read first byte:  23, but we need 9 bits
 read second byte: 139, so the first number is 23*2+(139/128) = 47
      and the remainder is (139-27) = 112
 read first Q: 47
 ...so the first number, 1, goes into slot 47
 |______________________1___________________|
 now read third byte: 34, so the second number is 112*2+(34/128) = 224
 the second number, 2, should go into the 224th empty slot, which is
 number 225 because slot 47 is already taken
 |______________________1__________________2____|
 ...and so on.

Another algorithm

You can squeeze some more bits out of the sequence by operating as follows. You code the position of the first number among the 512 still free spaces in the sequence using 9 bits. Then, the second number's position requires a number from 0 to 511 to identify the p-th free slot in the sequence. 511 is 8.99718... bits. You have now need of coding fractional bits, and you can google those and look e.g. phase-in codes.

The sequence will now size sum which is rounded to 3867, from the original naive estimate of 4599, or a reduction of around 16% to 484 bytes. Decoding will be quite slow as you need to walk the whole sequence to translate a number to its true position.

Related Topic