I implemented a few methods to test if data is compressible.
Simplified Compression
This basically checks for duplicate byte pairs:
static boolean isCompressible(byte[] data, int len) {
int result = 0;
// check in blocks of 256 bytes,
// and sum up how compressible each block is
for (int start = 0; start < len; start += 256) {
result += matches(data, start, Math.min(start + 255, len));
}
// the result is proportional to the number of
// bytes that can be saved
// if we can save many bytes, then it is compressible
return ((len - result) * 777) < len * 100;
}
static int matches(byte[] data, int i, int end) {
// bitArray is a bloom filter of seen byte pairs
// match counts duplicate byte pairs
// last is the last seen byte
int bitArray = 0, match = 0, last = 0;
if (i < 0 || end > data.length) {
// this check may allow the JVM to avoid
// array bound checks in the following loop
throw new ArrayIndexOutOfBoundsException();
}
for (; i < end; i++) {
int x = data[i];
// the bloom filter bit to set
int bit = 1 << ((last ^ x) & 31);
// if it was already set, increment match
// (without using a branch, as branches are slow)
match -= (-(bitArray & bit)) >> 31;
bitArray |= bit;
last = x;
}
return match;
}
On my (limited) set of test data, this algorithm is quite accurate. It about 5 times faster than compressing itself if the data is not compressible. For trivial data (all zeroes), it is about half as fast however.
Partial Entropy
This algorithm estimates the entropy of the high nibbles. I wanted to avoid using too many buckets, because they have to be zeroed out each time (which is slow if the blocks to check are small). 63 - numberOfLeadingZeros
is the logarithm (I wanted to avoid using floating point numbers). Depending on the data, it is faster or slower than the algorithm above (not sure why). The result isn't quite as accurate as the algorithm above, possibly because of using only 16 buckets, and only integer arithmetic.
static boolean isCompressible(byte[] data, int len) {
// the number of bytes with
// high nibble 0, 1,.., 15
int[] sum = new int[16];
for (int i = 0; i < len; i++) {
int x = (data[i] & 255) >> 4;
sum[x]++;
}
// see wikipedia to understand this formula :-)
int r = 0;
for (int x : sum) {
long v = ((long) x << 32) / len;
r += 63 - Long.numberOfLeadingZeros(v + 1);
}
return len * r < 438 * len;
}
Best Answer
Most generic compression algorithms work with a one-byte granularity.
Let's consider the following string:
Now let's encode our string in Base64. Here's what we get:
All algorithms are now saying: "What kind of mess is that?". And they're not likely to compress that string very well.
As a reminder, Base64 basically works by re-encoding groups of 3 bytes in (0...255) into groups of 4 bytes in (0...63):
Each output byte is then transformed into a printable ASCII character. By convention, these characters are (here with a mark every 10 characters):
For instance, our example string begins with a group of three bytes equal to 0x58 in hexadecimal (ASCII code of character "X"). Or in binary: 01011000. Let's apply Base64 encoding:
Basically, the pattern "3 times the byte 0x58" which was obvious in the original data stream is not obvious anymore in the encoded data stream because we've broken the bytes into 6-bit packets and mapped them to new bytes that now appear to be random.
Or in other words: we have broken the original byte alignment that most compression algorithms rely on.
Whatever compression method is used, it usually has a severe impact on the algorithm performance. That's why you should always compress first and encode second.
This is even more true for encryption: compress first, encrypt second.
EDIT - A note about LZMA
As MSalters noticed, LZMA -- which xz is using -- is working on bit streams rather than byte streams.
Still, this algorithm will also suffer from Base64 encoding in a way which is essentially consistent with my earlier description:
Even by working at the bit level, it's much easier to recognize a pattern in the input binary sequence than in the output binary sequence.