C# – way to produce a binary diff on two byte arrays in c# .net

algorithmcdata structuresnet

I am trying to compare two byte arrays in memory and produce a data structure to hold the differences so that with byte array B and the data structure holding the differences it is possible to recreate byte array A.

The size of the byte arrays are always identical and are relatively small. The byte arrays represent bitmaps which typically range in size from 1000x1000x32 to 128x128x32.

The speed and efficiency (from a cpu time pov) at which the data-structure-that-holds-the-differences and the second byte array can be combined/used to reconstruct byte array A is of the highest importance. It is not as important that the generation of the difference object be as efficient.

Right now my solution will be to use a binary search + md5 hashing approach to build a list of the smallest units of difference inside the image and package the raw byte data with a reference of its offset inside byte array A.

I.e. I generate the hash for Image A's byte array and compare it to the hash of Image B's byte array, if it doesn't match I horizontally split the image in half and hash each block then compare those hashes between the images, this process is then repeated recursively until all blocks that don't match hit a minimum size like 32x32x32 and/or a max number of splits. Once a block is marked as matching it is no longer part of the recursive search. Once all the differences have been identified they will be added to a list of changes and that list is then the difference object.

Are there any built in ways to efficiently produce the results I'm looking for? Or are there any libraries that would do the job?


Notes:

  • This is a learning project (for WCF, WPF and a number of other technologies),
  • It is a VNC style system – so the images are snapshots of a window and will be sent over a LAN connection.
  • The reason the reconstructing of Byte array A must be efficient is that a client can connect to more than one server and each server can serve more than one window, so a client could be monitoring/interacting with 30+ windows.
  • I would like to achieve 3 fps+ in each window for any content that changes.

Best Answer

If they're guaranteed to be the same size - actually, the same dimensions - then I'm not seeing the importance of all this hashing and binary searches and other overhead. You can simply compare the two byte-by-byte in a loop, and if they don't match, add a "point" to your diff containing both the index and the value in A. To reverse the process, you don't need to look at every byte because you already have indexes.

If the two arrays differ by, say, just 1 byte, then you'll end up with a diff structure that's 5 bytes in size (assuming you use an Int32 for the index), and takes exactly 1 iteration to mutate B back into A. In general the process is O(n) for the diff and O(m) for the revert, where m is the total number of points that actually changed. I'm not an expert on data structures but I doubt you'll be able to come up with something more efficient.

So, something like this:

Diff GetDiff(byte[] a, byte[] b)
{
    Diff diff = new Diff();
    for (int i = 0; i < a.Length; i++)
    {
        if (a[i] != b[i])
        {
            diff.Points.Add(new DiffPoint(i, a[i]));
        }
    }
    return diff;
}

// Mutator method - turns "b" into the original "a"
void ApplyDiff(byte[] b, Diff diff)
{
    foreach (DiffPoint point in diff.Points)
    {
        b[point.Index] = point.Value;
    }
}

// Copy method - recreates "a" leaving "b" intact
byte[] ApplyDiffCopy(byte[] b, Diff diff)
{
    byte[] a = new byte[b.Length];
    int startIndex = 0;
    foreach (DffPoint point in diff.Points)
    {
        for (int i = startIndex; i < point.Index; i++)
        {
            a[i] = b[i];
        }
        a[point.Index] = point.Value;
        startIndex = point.Index + 1;
    }
    for (int j = startIndex; j < b.Length; j++)
    {
        a[j] = b[j];
    }
    return a;
}

struct DiffPoint
{
    public int Index;
    public byte Value;

    public DiffPoint(int index, byte value) : this()
    {
        this.Index = index;
        this.Value = value;
    }
}

class Diff
{
    public Diff()
    {
        Points = new List<DiffPoint>();
    }

    public List<DiffPoint> Points { get; private set; }
}

There's a lot of looping in the ApplyDiffCopy but if you work it out you'll see that it actually only performs one operation per point. Of course, if you don't need a copy and just want to mutate B, then the first ApplyDiff method will be extremely fast if there aren't many actual differences.

And obviously I haven't done much error-checking here. You would want to write your version a bit more defensively (verify array lengths, etc.)

If I've correctly understood the assumptions here and the problem you're trying to solve, then the original ApplyDiff method is going to be the fastest way to restore the original image.