Best method for Pattern Matching on Binary String

algorithmsbinarypattern matchingsearch

I need to search a long string of binary string data (couple of megabytes worth of binary digits) for patterns of binary digits. There are about 100 000 different patterns to search for. Each is a maximum of 10 bits/character length patterns of 1's and 0's. I then need to display the most common pattern of all of these.

The binary data is fed in real time to some sort of text file or database. Its essentially a large block of binary digits that I need to search for a couple thousand different patterns in the fastest most effective way possible. Each new digit that comes in does so in about like 10 seconds time one after the other.

So for example, the following string might come in:

10101110000110001011011111000101011010111101110100000011011101010101010111011100101010101011000101110011011111110001010010000110101011100000110011100101001010011001011010110101101010001010101010100010101010010100101001010101010111010101010010100101010011101010101010101001001010110101

and I need to find whats the most common one out of these 100 000 patterns, such as 0001100111, 0110011011, 1100011 and so on.

What is the best way to do this on real time data being fed to a text file would be? Does not really need to be exactly real time however I would rather it be as real time as possible. Id also rather it be a straight console affair so as to make it as latency free as possible however I can adapt as it doesn't really need to be terminal either.

I've got about 1 year of programming experience in Bash. Would a simple bash script be fast enough for 100 000 of such patterns? But all those thousands of patterns makes me fear that just a simple bash script solution might be a little too slow and outdated. Or can SQL be really faster than just a straight bash script? I've heard about binary matching algorithms but I dont really know what those are and it seems a little out of my league at the moment, however I'm willing to get to the bottom if it if its truly the most effective way. What's the best way to do this ?

Best Answer

Reformulated problem statement

You have some huge binary content and you'll have to scan it for a set of thousands of bit patterns. The main problem is that the bit patterns can happen anywhere in the content, and not just at the start of a byte.

Algorithmic solution

Let's say that each pattern has an id, a initial bit offset 0, a bit length, and is defined in a sequence of bytes aligned on the first byte's most significant bit. The sequence of bytes shall be long enough to accommodate 7 more bits (e.g. 110110..-........ where the points represent unused bits in the pattern bytes that will be set to 0, and dash a change of byte).

  1. Build for each pattern the 7 derivates obtained by shifting the bits of all the pattern by an offset of 1 to 7 (e.g. .110110.-........, ..110110-........, ...11011-0......., etc...), and store all 7 of them and the original pattern into a map mapping to the pattern id.

  2. For every map item:

    1. For every byte offset of content, so that the remaining content length is long enough compared to the length n of the pattern bytes:

      1. Make a copy of the n content bytes, and set the first offset bits of the copy to 0. Set the 8-((offset+bit length)%8) last bits to 0 as well.

      2. compare the altered copy with the pattern, byte by byte. If every byte is equal, increase the count for the mapped pattern.

  3. Sort the patterns by decreasing pattern count

This algorithm will be very fast because step 5 is done by byte compare, as for strings. Step 4 is fast, as it is a based on two binary AND operations on the copy.

Related Topic