Database – Why don’t databases have good full text indexes

databaseproduct-features

Why don't any of the major RDBMS systems like MySQL, SQL Server, Oracle, etc. have good full text indexing support?

I realize that most databases support full text indexes to some degree, but that they are usually slower, and with a smaller feature set. It seems that every time you want a really good full text index, you have to go outside the database and use something like Lucene/Solr or Sphinx.

Why isn't the technology in these full text search engines completely integrated into the database engine? There's lot of problems with keeping the data in another system such as Lucence, including keeping the data up to date, and the inability to join the results with other tables. Is there a specific technological reason why these two technologies can't be integrated?

Best Answer

The short answer is because text retrieval has almost nothing in common with how traditional databases are designed and used. Someone who is an ace at creating/using an RDBMS is like a lamb to the slaughter when they approach text retrieval for the first time.

(Sorry for the long answer, but I'm sick in bed today and I've got nothing else to do.)

The following could easily come under TL;DR, but if you have the time and the interest, what follows is a piece of the longer answer. Note: I'm speaking from having implemented a commercial information retrieval system starting in 1986. We were a technical success, but a marketing flop.

Doing IR (Information Retrieval) properly requires that you begin by thinking about what you are searching for and how you will find it using your query mechanism. This may sound easy, but it is anything but easy. Here are just some of the things you will have to decide before you even begin scanning your documents (or fields).

Does case matter? Is DoD the same as dod? How about "flame" and "FLAME" (a cologne based on the Burger King Whopper (yes, really)).
What kinds of tokens will you index? You obviously want to index "daddy". You probably want to index "daddy123". Do you want to index "123"? "12.3"? "192.168.1.1"?
How do you deal with things like hyphenation? A somewhat out-of-date example is "data base", "database", and "data-base", all of which were in use concurrently in 1986.
If your query language supports the concept of "Find A in the same sentence as B", how do you determine sentence breaks? Although '?' and '!' are easy enough, those '.'s are a bitch. Think about things like "Mr.", "2.", "etc.", etc.
Are you going to support stemming? If so, how careful will you be to not accidentally change the POS (Part Of Speech)? E.g. "cats" can stem to "cat", but "blinds" may or may not stem to "blind". If it was a verb ("He blinds me") then you can stem, but if it was a noun ("I like your blinds) you can't (or at least shouldn't). Stemming is very seductive, but it is a swamp of the First Order.
What languages are you going to support? What works in English can fail big time in either French or German, although strangely enough it will tend to work OK for Japanese in the Hepburn Romanji representation.

And the list goes on and on.

Then we have to think about our query language. It may seem that if all you are going to support is simple Boolean then it should be easy, but the one thing that is pretty much universally agreed upon is that pure Boolean sucks for text. For example, you will need additional operators to specify ordering and proximity, and boy, oh, boy does that ever make life more complicated. You also need to know what section you are in -- title, header, body, etc. -- which leads to all sorts of collection-specific parsing fun. But now it's no longer sufficient to just have a list of tokens that occur in the doc, you have to know where in the doc they occur. This results in an address tuple of (docID, sectionID, para-in-section, sentence-in-para, word-in-sentence). Efficiently storing and searching this information can get gnarly for a non-toy collection.

Then there is the actual structure of your data store. Text systems are normally implemented as a "full inversion" of the documents. How many indices does the average DB have? 10? 50? 500? In IR it is not uncommon to have 5,000,000 or more indices, one for each separate token. And any given token may have 1 instance (eg. "narfle" or "garthok") or 10,000,000 instances (eg. "the"). This means that your whole method for creating and updating indices has to be lightning fast or you are going to sink into the swamp. And you still have many of the other problems that a traditional DB does: disk space management, crash recovery, coherent snapshot from a running system, etc., etc.

Finally there is results ranking. An unranked result set from a Boolean query against a large collection is useless to a human. It might be useful to a program, but that was not what I was dealing with. Although our system implemented Boolean, our selling point was that we were the first commercially available system to support similarity searching, based on the Cosine Coefficient. The math and logic of this type of search (basically a normalized dot product of the query vector against millions of document vectors) required radically different approaches to data representation and storage than did Boolean -- definitely not something available in your average DB.

All of this (and more) is why "text retrieval" and "database" almost don't belong in the same sentence together. I think you would be better off picking a good database for your "normal" needs, and then using an external IR system to index/search the "documents" in your primary DB.

Best Answer

Related Solutions

How to Quickly Search Through a Very Large List of Strings/Records in a Database

Algorithms – Deduplication and Similarity Detection of Complex Records

Related Topic