Data structure for file search

cdata structuresnetworking

I've asked this question before and I got a few answers/idea, but I'm not sure how to implement them.

I'm building a telecom messaging solution. Currently, I'm using a database to save my transaction/messages for the network stack I've built, and as you know it's slower than using a data structure (hash, linkedlist, etc…). My problem is that the data can be really huge, and it won't fit in the memory.

I was thinking of saving the records in a file and the a key and line number in a hash, then if I want to access some record then I can get the line number from the hash, and get it from the file. I don't know how efficient is this; I think the database is doing a way better job than this on my behalf.

Please share whatever you have in mind.

Best Answer

An Inverted Index seems to be what you seek. The basic idea is:

The messages/files are saved on disk in their full structure.
Each word in the file has a row in a table in the database.
There's a link table that allows you to do a join between the words and a unique ID for each message/file.

For example, here's the 3 database tables as I would create:

CREATE TABLE words (
    word_id INT PRIMARY KEY,
    word TEXT
)
CREATE TABLE links (
    word_id INT FOREIGN KEY words(word_id),
    file_id INT FOREIGN KEY files(file_id),
)
CREATE TABLE files (
    file_id INT PRIMARY KEY,
    filename TEXT
)

To find all filenames that contain the words "help" and "me":

SELECT filename
FROM files
INNER JOIN links l1 ON (files.file_id = l1.file_id)
INNER JOIN words w1 ON (l1.word_id = w1.word_id)
INNER JOIN links l2 ON (files.file_id = l2.file_id)
INNER JOIN words w2 ON (l2.word_id = w2.word_id)
WHERE w1.word = "help" AND w2.word = "me"

That help?

Best Answer

Related Solutions

How to Design Data Transfer Between Client and Web API for Normalized Data

Option 1

Option 2

Recommendation

Caching Caveats

C++ – an efficient data structure for syntax highlighting in text editors

Edit: How to implement anchors

Related Topic