Syncing Lucene.net indexes across multiple app servers

lucenelucene.net

we are designing the search architecture for a corporate web application. We'll be using Lucene.net for this. The indexes will not be big (about 100,000 documents), but the search service must be always up and always be up to date. There will be new documents added to the index all the time and concurrent searches.
Since we must have high availability for the search system, we have 2 application servers which expose a WCF service to perform searches and indexing (a copy of the service is running in each server). The server then uses lucene.net API to access the indexes.

The problem is, what would be the best solution to keep the indexes synced all the time? We have considered several options:

  • Using one server for indexing and
    having the 2nd server access the
    indexes via SMB: no can do because we
    have a single point of failure
    situation;

  • Indexing to both servers, essentially writing every index twice: probably lousy performance, and possibility of desync if eg. server 1 indexes OK and server 2 runs out of disk space or whatever;

  • Using SOLR or KATTA to wrap access to the indexes: nope, we cannot have tomcat or similar running on the servers, we only have IIS.

  • Storing the index in database: I found this can be done with the java version of Lucene (JdbcDirectory module), but I couldn't find anything similar for Lucene.net. Even if it meant a small performance hit, we'd go for this option because it'd cleanly solve the concurrency and syncing problem with mininum development.

  • Using Lucene.net DistributedSearch contrib module: I couldn't file a single link with documentation about this. I don't even know by looking at the code what this code does, but it seems to me that it actually splits the index across multiple machines, which is not what we want.

  • rsync and friends, copying the indexes back and forth between the 2 servers: this feels hackish and error-prone to us, and, if the indexes grow big, might take a while, and during this period we would be returning either corrupt or inconsistent data to clients, so we'd have to develop some ad hoc locking policy, which we don't want to.

I understand this is a complex problem, but I'm sure lots of people have faced it before. Any help is welcome!

Best Answer

It seems that the best solution would be to index the documents on both servers into their own copy of the index.

If you are worried about the indexing succeeding on one server and failing on the other, then you'll need to keep track of the success/failure for each server so that you can re-try the failed documents once the problem is resolved. This tracking would be done outside of Lucene in whatever system you are using to present the documents to be indexed to Lucene. Depending on how critical the completeness of the index is to you, you may also have to remove the failed server from whatever load balancer you are using until the problem has been fixed and indexing has reprocessed any outstanding documents.

Related Topic