Algorithm to find multiple string matches

algorithmboostsearchstring

I'm looking for suggestions for an efficient algorithm for finding all matches in a large body of text. Terms to search for will be contained in a list and can have 1000+ possibilities. The search terms may be 1 or more words.

Obviously I could make multiple passes through the text comparing against each search term. Not too efficient.

I've thought about ordering the search terms and combining common sub-segments. That way I could eliminate large numbers of terms quickly. Language is C++ and I can use boost.

An example of search terms could be a list of Fortune 500 company names.

Ideas?

Best Answer

DonĀ“t reinvent the wheel

This problem has been intensively researched. Curiously, the best algorithms for searching ONE pattern/string do not extrapolate easily to multi-string matching.

The "grep" family implement the multi-string search in a very efficient way. If you can use them as external programs, do it.

In case you really need to implement the algorithm, I think the fastest way is to reproduce what agrep does (agrep excels in multi-string matching!). Here are the source and executable files.

And here you will find a paper describing the algorithms used, the theoretical background, and a lot of information and pointers about string matching.

A note of caution: multiple-string matching have been heavily researched by people like Knuth, Boyer, Moore, Baeza-Yates, and others. If you need a really fast algorithm don't hesitate on standing on their broad shoulders. Don't reinvent the wheel.