What are the shift rules for Boyer–Moore string search algorithm

algorithmboyer-moorestring-search

I have been trying to understand shift rules in Boyer–Moore string search algorithm but haven't understood them. I read here on wikipedia but that is too complex !

It will be of great help if someone lists the rule in a simple manner.

Best Answer

In the Boyer-Moore algorithm, you start comparing pattern characters to text characters from the end of the pattern. If you find a mismatch, you have a configuration of the type

....xyzabc....      <-text
  ....uabc          <- pattern
      ^
    mismatch

Now the bad character shift means to shift the pattern so that the text character of the mismatch is aligned to the last occurrence of that character in the initial part of the pattern (pattern minus last pattern character), if there is such an occurrence, or one position before the pattern if the mismatched character doesn't appear in the initial part of the pattern at all.

That could be a shift to the left, if the situation is

     v
...xyzazc...
 ....uazc
 ..uazc

so that alone doesn't guarantee a progress.

The other shift, the good suffix shift, aligns the matched part of the text, m, with the rightmost occurrence of that character sequence in the pattern that is preceded by a different character (including none, if the matched suffix is also a prefix of the pattern) than the matched suffix m of the pattern - if there is such an occurrence.

So for example

           v
....abcdabceabcfabc...
    ...xabcfabcfabc
        ...xabcfabcfabc

would lead to a good suffix shift of four positions, since the matched part m = abcfabc occurs in the pattern four places left of its suffix-occurrence and is preceded by a different character there (x instead of f) than in the suffix position.

If there is no complete occurrence of the matched part in the pattern preceded by a different character than the suffix, the good suffix shift aligns a suffix of the matched part of the text with a prefix of the pattern, choosing maximal overlap, e.g.

      v
...robocab....
    abacab
        abacab

The good suffix shift always shifts the pattern to the right, so guarantees progress.

Then, on every mismatch the advances of the bad character shift and the good suffix shift are compared, and the greater is chosen. It is explained in greater depth by Christian Charras and Thierry Lecroq here, along with many other string searching algorithms.

For the example you mentioned in the comments,

SSIMPLE EXAMPLE
EXAMPLE
  ^

the matched suffix is MPLE, and the mismatched text character is I. So the bad character shift looks for the last occurrence of I in the initial part of the pattern. There is none, so that bad character shift would shift the pattern so that the mismatched I is one before the start of the pattern

SSIMPLE EXAMPLE
   EXAMPLE

and the good suffix shift looks for the rightmost occurrence of MPLE in the pattern not preceded by an A, or the longest suffix of MPLE that is a prefix of the pattern. There is no complete occurrence of the matched part in the pattern before the suffix, so the longest suffix of the matched part that is also a prefix of the pattern determines the good suffix shift. In this case, the two suffixes of the matched part that are prefixes of the pattern are the single-character string E, and the empty string. The longest is obviously the nonempty string, so the good suffix shift aligns the one-character suffix E in the matched part of the text with the one-character prefix of the pattern

SSIMPLE EXAMPLE
      EXAMPLE

The good suffix shift shifts the pattern farther right, so that is the chosen shift.

Then there is an immediate mismatch at the last pattern position, and then the bad character shift aligns the P in the text with the P in the pattern (and the good suffix shift need not be considered at all if the mismatch occurs at the last pattern character, since in that case, it would never produce a larger shift than the bad character shift).

Then we have the complete match.

In the variant with the pattern TXAMPLE, the good suffix shift finds that no non-empty suffix of the matched part is a prefix of the pattern (and there is no occurrence of the complete matched part in the pattern not preceded by A), so the good suffix shift aligns the empty suffix of the matched part of the text (the boundary between the E and the space) with the empty prefix of the pattern (the empty string preceding the T), resulting in

SSIMPLE EXAMPLE
       TXAMPLE

(then in the next step, the bad character shift aligns the two Ls, and the next mismatch in the step thereafter occurs at the initial T of the pattern).

Related Solutions

.net – the best algorithm for overriding GetHashCode

I usually go with something like the implementation given in Josh Bloch's fabulous Effective Java. It's fast and creates a pretty good hash which is unlikely to cause collisions. Pick two different prime numbers, e.g. 17 and 23, and do:

public override int GetHashCode()
{
    unchecked // Overflow is fine, just wrap
    {
        int hash = 17;
        // Suitable nullity checks etc, of course :)
        hash = hash * 23 + field1.GetHashCode();
        hash = hash * 23 + field2.GetHashCode();
        hash = hash * 23 + field3.GetHashCode();
        return hash;
    }
}

As noted in comments, you may find it's better to pick a large prime to multiply by instead. Apparently 486187739 is good... and although most examples I've seen with small numbers tend to use primes, there are at least similar algorithms where non-prime numbers are often used. In the not-quite-FNV example later, for example, I've used numbers which apparently work well - but the initial value isn't a prime. (The multiplication constant is prime though. I don't know quite how important that is.)

This is better than the common practice of XORing hashcodes for two main reasons. Suppose we have a type with two int fields:

XorHash(x, x) == XorHash(y, y) == 0 for all x, y
XorHash(x, y) == XorHash(y, x) for all x, y

By the way, the earlier algorithm is the one currently used by the C# compiler for anonymous types.

This page gives quite a few options. I think for most cases the above is "good enough" and it's incredibly easy to remember and get right. The FNV alternative is similarly simple, but uses different constants and XOR instead of ADD as a combining operation. It looks something like the code below, but the normal FNV algorithm operates on individual bytes, so this would require modifying to perform one iteration per byte, instead of per 32-bit hash value. FNV is also designed for variable lengths of data, whereas the way we're using it here is always for the same number of field values. Comments on this answer suggest that the code here doesn't actually work as well (in the sample case tested) as the addition approach above.

// Note: Not quite FNV!
public override int GetHashCode()
{
    unchecked // Overflow is fine, just wrap
    {
        int hash = (int) 2166136261;
        // Suitable nullity checks etc, of course :)
        hash = (hash * 16777619) ^ field1.GetHashCode();
        hash = (hash * 16777619) ^ field2.GetHashCode();
        hash = (hash * 16777619) ^ field3.GetHashCode();
        return hash;
    }
}

Note that one thing to be aware of is that ideally you should prevent your equality-sensitive (and thus hashcode-sensitive) state from changing after adding it to a collection that depends on the hash code.

As per the documentation:

You can override GetHashCode for immutable reference types. In general, for mutable reference types, you should override GetHashCode only if:

You can compute the hash code from fields that are not mutable; or

You can ensure that the hash code of a mutable object does not change while the object is contained in a collection that relies on its hash code.

The link to the FNV article is broken but here is a copy in the Internet Archive: Eternally Confuzzled - The Art of Hashing

The fastest substring search algorithm

Build up a test library of likely needles and haystacks. Profile the tests on several search algorithms, including brute force. Pick the one that performs best with your data.

Boyer-Moore uses a bad character table with a good suffix table.

Boyer-Moore-Horspool uses a bad character table.

Knuth-Morris-Pratt uses a partial match table.

Rabin-Karp uses running hashes.

They all trade overhead for reduced comparisons to a different degree, so the real world performance will depend on the average lengths of both the needle and haystack. The more initial overhead, the better with longer inputs. With very short needles, brute force may win.

Edit:

A different algorithm might be best for finding base pairs, english phrases, or single words. If there were one best algorithm for all inputs, it would have been publicized.

Think about the following little table. Each question mark might have a different best search algorithm.

                 short needle     long needle
short haystack         ?               ?
long haystack          ?               ?

This should really be a graph, with a range of shorter to longer inputs on each axis. If you plotted each algorithm on such a graph, each would have a different signature. Some algorithms suffer with a lot of repetition in the pattern, which might affect uses like searching for genes. Some other factors that affect overall performance are searching for the same pattern more than once and searching for different patterns at the same time.

If I needed a sample set, I think I would scrape a site like google or wikipedia, then strip the html from all the result pages. For a search site, type in a word then use one of the suggested search phrases. Choose a few different languages, if applicable. Using web pages, all the texts would be short to medium, so merge enough pages to get longer texts. You can also find public domain books, legal records, and other large bodies of text. Or just generate random content by picking words from a dictionary. But the point of profiling is to test against the type of content you will be searching, so use real world samples if possible.

I left short and long vague. For the needle, I think of short as under 8 characters, medium as under 64 characters, and long as under 1k. For the haystack, I think of short as under 2^10, medium as under a 2^20, and long as up to a 2^30 characters.

Best Answer

Related Solutions

.net – the best algorithm for overriding GetHashCode

The fastest substring search algorithm

Related Topic