How to Quickly Search Through a Very Large List of Strings/Records in a Database

algorithmsdata structuresdatabaseindexing

I have the following problem: I have a database containing more than 2 million records.
Each record has a string field X and I want to display a list of records for which field X contains a certain string. Each record is about 500 bytes in size.

To make it more concrete: in the GUI of my application I have a text field where I can enter a string. Above the text field I have a table displaying the (first N, e.g. 100) records that match the string in the text field. When I type or delete one character in the text field, the table content must be updated on the fly.

I wonder if there is an efficient way of doing this using appropriate index structures and / or caching. As explained above, I only want to display the first N items that match the query. Therefore, for N small enough, it should not be a big issue loading the matching items from the database. Besides, caching items in main memory can make retrieval faster.

I think the main problem is how to find the matching items quickly, given the pattern string. Can I rely on some DBMS facilities, or do I have to build some in-memory index myself? Any ideas?

EDIT

I have run a first experiment. I have split the records into different text files (at most 200 records per file) and put the files in different directories (I used the content of one data field to determine the directory tree). I end up with about 50000 files in about 40000 directories. I have then run Lucene to index the files. Searching for a string with the Lucene demo program is pretty fast. Splitting and indexing took a few minutes: this is totally acceptable for me because it is a static data set that I want to query.

The next step is to integrate Lucene in the main program and use the hits returned by Lucene to load the relevant records into main memory.

Best Answer

Instead of putting your data inside the DB, you can keep them as a set of documents (text files) separately and keep the link (path/url etc.) in the DB.

This is essential because, SQL query by design will be very slow both in sub-string search as well as retrieval.

Now, your problem is formulated as, having to search the text files which contains the set of strings. There are two possibilities here.

  1. Sub-string match If your text blobs is a single sting or word (without any white space) and you need to search arbitrary sub-string within it. In such cases you need to parse every file to find best possible files that matches. One uses algorithms like Boyer Moor algorithm. See this and this for details. This is also equivalent to grep - because grep uses similar stuff inside. But you may still make at least 100+ grep (worst case 2 million) before returning.

  2. Indexed search. Here you are assuming that text contains set of words and search is limited to fixed word lengths. In this case, document is indexed over all the possible occurrences of words. This is often called "Full Text search". There are number of algorithms to do this and number of open source projects that can be used directly. Many of them, also support wild card search, approximate search etc. as below :
    a. Apache Lucene : http://lucene.apache.org/java/docs/index.html
    b. OpenFTS : http://openfts.sourceforge.net/
    c. Sphinx http://sphinxsearch.com/

Most likely if you need "fixed words" as queries, the approach two will be very fast and effective.

Related Topic