R – Zend: index generation and the pros and cons of Zend_Search_Lucene

zend-frameworkzend-lucenezend-search-lucene

I've never came across an app/class like Zend Search Lucene before, as I've always queried my database.

Zend_Search_Lucene operates with
documents as atomic objects for
indexing. A document is divided into
named fields, and fields have content
that can be searched.

A document is represented by the
Zend_Search_Lucene_Document class, and
this objects of this class contain
instances of Zend_Search_Lucene_Field
that represent the fields on the
document.

It is important to note that any
information can be added to the index.
Application-specific information or
metadata can be stored in the document
fields, and later retrieved with the
document during search.

So this is basically saying that I can apply this to anything including databases, the key thing here is making indexes for searching.

What I'm trying to grasp is where exactly should I store the indexes in my application, let's take for example we have phones stored in a database, a manufacturers, models – how should I categorize the indexes?

If I'm making indexes of users with say, addresses I obviously wouldn't want them to be publically viewable, I'm just confused on how it all works out together, if there are known disadvantages, any gotchas I should know while using it.

Best Answer

A Lucene index is stored outside the database. I'd store it in a "data" directory as a sister to your controllers, models, and views. But you can store it anywhere; you just need to specify the path when you open the index for querying.

It's basically a redundant copy of the documents stored in your database, and you have to keep them in sync yourself. That's one of the disadvantages: you have to write code to populate the Lucene index based on results of a query against your database. As you add data to the database, you have to update your Lucene index as well.

An advantage of using an external full-text index solution is that you can reduce the workload on your RDBMS. To find a document, you execute a search using the Lucene API. The result should include a field containing the primary key value (as part of the document but no need to make it analyzed for FT search). You get this field back when you do a Lucene search, so you can look up the respective row in the database.

Does that help answer your question?

I gave a presentation recently for MySQL University comparing full-text search solutions: http://forge.mysql.com/wiki/Practical_Full-Text_Search_in_MySQL

I also publish my slides at http://www.SlideShare.net/billkarwin.

Related Topic