The short answer is because text retrieval has almost nothing in common with how traditional databases are designed and used. Someone who is an ace at creating/using an RDBMS is like a lamb to the slaughter when they approach text retrieval for the first time.
(Sorry for the long answer, but I'm sick in bed today and I've got nothing else to do.)
The following could easily come under TL;DR, but if you have the time and the interest, what follows is a piece of the longer answer. Note: I'm speaking from having implemented a commercial information retrieval system starting in 1986. We were a technical success, but a marketing flop.
Doing IR (Information Retrieval) properly requires that you begin by thinking about what you are searching for and how you will find it using your query mechanism. This may sound easy, but it is anything but easy. Here are just some of the things you will have to decide before you even begin scanning your documents (or fields).
- Does case matter? Is DoD the same as dod? How about "flame" and "FLAME" (a cologne based on the Burger King Whopper (yes, really)).
- What kinds of tokens will you index? You obviously want to index "daddy". You probably want to index "daddy123". Do you want to index "123"? "12.3"? "192.168.1.1"?
- How do you deal with things like hyphenation? A somewhat out-of-date example is "data base", "database", and "data-base", all of which were in use concurrently in 1986.
- If your query language supports the concept of "Find A in the same sentence as B", how do you determine sentence breaks? Although '?' and '!' are easy enough, those '.'s are a bitch. Think about things like "Mr.", "2.", "etc.", etc.
- Are you going to support stemming? If so, how careful will you be to not accidentally change the POS (Part Of Speech)? E.g. "cats" can stem to "cat", but "blinds" may or may not stem to "blind". If it was a verb ("He blinds me") then you can stem, but if it was a noun ("I like your blinds) you can't (or at least shouldn't). Stemming is very seductive, but it is a swamp of the First Order.
- What languages are you going to support? What works in English can fail big time in either French or German, although strangely enough it will tend to work OK for Japanese in the Hepburn Romanji representation.
And the list goes on and on.
Then we have to think about our query language. It may seem that if all you are going to support is simple Boolean then it should be easy, but the one thing that is pretty much universally agreed upon is that pure Boolean sucks for text. For example, you will need additional operators to specify ordering and proximity, and boy, oh, boy does that ever make life more complicated. You also need to know what section you are in -- title, header, body, etc. -- which leads to all sorts of collection-specific parsing fun. But now it's no longer sufficient to just have a list of tokens that occur in the doc, you have to know where in the doc they occur. This results in an address tuple of (docID, sectionID, para-in-section, sentence-in-para, word-in-sentence). Efficiently storing and searching this information can get gnarly for a non-toy collection.
Then there is the actual structure of your data store. Text systems are normally implemented as a "full inversion" of the documents. How many indices does the average DB have? 10? 50? 500? In IR it is not uncommon to have 5,000,000 or more indices, one for each separate token. And any given token may have 1 instance (eg. "narfle" or "garthok") or 10,000,000 instances (eg. "the"). This means that your whole method for creating and updating indices has to be lightning fast or you are going to sink into the swamp. And you still have many of the other problems that a traditional DB does: disk space management, crash recovery, coherent snapshot from a running system, etc., etc.
Finally there is results ranking. An unranked result set from a Boolean query against a large collection is useless to a human. It might be useful to a program, but that was not what I was dealing with. Although our system implemented Boolean, our selling point was that we were the first commercially available system to support similarity searching, based on the Cosine Coefficient. The math and logic of this type of search (basically a normalized dot product of the query vector against millions of document vectors) required radically different approaches to data representation and storage than did Boolean -- definitely not something available in your average DB.
All of this (and more) is why "text retrieval" and "database" almost don't belong in the same sentence together. I think you would be better off picking a good database for your "normal" needs, and then using an external IR system to index/search the "documents" in your primary DB.
The key is in the R in the RDBMS, which stands for relational. Contrary to popular belief it doesn't mean relations between tables, but rather the fact, that each table is relation in mathematical sense of the word.
Relational model has quite significant implications. You have to model your data to fit relations and normalize that model. If your application is designed as object-oriented model, relational model is not a good fit. This is widely known as object-relational impedance mismatch.
One approach to this mismatch are ORMs (object-relation mappers), which have gained a lot popularity. But they are not the true solution, they are more like work around for the problem. They still don't really solve the problem of mapping class inheritance to relational model.
The true solution to the object-relational mismatch are OODBMSes, which didn't get much traction unfortunately. Popular engine supporting OOBDs natively is PostgreSQL, which is hybrid OO/RDBMS. Another OODBMS is Zope Object DB, which is built in Python and in typical setup uses RDBMS as underlying engine.
Alternative approach is to have more logic implemented in application or middle-ware level and use NoSQL solution for underlying storage.
Neither OODBMS nor NoSQL are "just a flat-file".
Best Answer
Today, most database management systems (e.g. PostGreSQL, MongoDB, etc...) internally keep their data inside OS files (in the past, some DBMSs used raw disk partitions directly).
On recent computers still using spinning hard disks, the disk is so slow - relative to the CPU or the RAM - that adding a few software layers is not relevant. SSD technology might change that a bit, and some file systems are optimized for SSDs.
Files are present in most OSes in general for historical and social reasons (in particular, C compilers and most tools - editors, linkers - want files, so there is a chicken and egg issue), and because there are a lot of very good file system implementations.
BTW, some essential system facilities can use databases. For example on Linux PAM can be configured to use information in databases (but this is rarely done in practice). Also, some mail servers may store some or most of their data in databases (e.g. Exim).
Files are slightly lower abstractions than databases, so they can be easier to implement (as the file systems & VFS layer in the Linux kernel) and faster to use. In particular, the operations on files are much more restricted than those on databases. In fact, you could see files or file systems as some very restricted databases!
You could design an operating system without any files, but with some other orthogonal persistence machinery (e.g. having every process be persistent, then you don't care much explicitly about storage, since the OS is managing persistent resources). This has been done in several academic operating systems(1) (and also in the Smalltalk and Lisp machines of the 1980s, somehow in the IBM System i, a.k.a. AS/400, and in some toy projects linked from osdev), but when you design your OS this way you cannot leverage on many existing tools (e.g. you also need to make your compiler and your user interface from scratch, and that is a lot of work).
Notice that microkernel operating systems might not need files provided by kernel layers since the file systems are just application servers (e.g. Hurd translators running in userland). Look also at the unikernel approach in today's MirageOS
Linux (and probably Windows, which got most of its inspiration from VMS & Unix) need files to work. At the very least, the init program (the first program started by the kernel) must be an executable stored in a file (often
/sbin/init
, but it could be systemd these days), and (nearly) all other programs are started with execve(2) syscall so must be stored in a file. However, FUSE enables you to give file-like semantics to non-file things.Notice also that on Linux (and perhaps even Windows, which I don't know and never used) sqlite is a library managing some SQL database in a files and providing an API for that. It is widely known that Android (a Linux variant) uses a lot of sqlite files (but it still does have a POSIX-like file system).
Read also about application checkpointing (which, on many current OSes, is implemented to write the process state in files). Pushed to the extreme, that approach does not need to manually write application files (but only to persist the entire process state using the checkpointing machinery).
Actually, the interesting question is why do current operating systems still use files, and the answer is legacy, and economic and cultural reasons (sadly, most programming languages and libraries today still want files).
Note 1: persistent academic OSes include Lisaac & Grasshopper, but these academic projects seem to be inactive. Look also into http://tunes.org/ ; it is inactive, but has gotten lots of discussions around such topics.
Note 2: the notion of file has widely changed over time (look at this answer about my first programming experiences): the first MSDOS on 1980s IBM PCs (no directories!), the VMS -on 1978 Vaxen- (had both fixed-record files and sequential files, with a primitive versioning system), the 1970s mainframes (IBM/370 with OS/VS2 MVS) had a very different notion of files and file systems (in particular because at their time the ratio of hard disk access time to core memory access time was a few thousand - so at that time disk ran relatively faster than today, even if today's disks are absolutely faster than in the previous century, today the CPU / disk speed ratio is about a million; but we now have SSDs). Also, files are less (or even not) useful when the memory is persistent (as on CAB500 magnetic drum, 1960s; or future computers using MRAM)