Should an index be optimised after incremental indexes in Lucene

luceneoptimization

We run full re-indexes every 7 days (i.e. creating the index from scratch) on our Lucene index and incremental indexes every 2 hours or so. Our index has around 700,000 documents and a full index takes around 17 hours (which isn't a problem).

When we do incremental indexes, we only index content that has changed in the past two hours, so it takes much less time – around half an hour. However, we've noticed that a lot of this time (maybe 10 minutes) is spent running the IndexWriter.optimize() method.

The LuceneFAQ mentions that:

The IndexWriter class supports an optimize() method that compacts the index database and speeds up queries. You may want to use this method after performing a complete indexing of your document set or after incremental updates of the index. If your incremental update adds documents frequently, you want to perform the optimization only once in a while to avoid the extra overhead of the optimization.

…but this doesn't seem to give any definition for what "frequently" means. Optimizing is CPU intensive and VERY IO-intensive, so we'd rather not be doing it if we can get away with it. How much is the hit of running queries on an un-optimized index (I'm thinking especially in terms of query performance after a full re-index compared to after 20 incremental indexes where, say, 50,000 documents have changed)? Should we be optimising after every incremental index or is the performance hit not worth it?

Best Answer

Mat, since you seem to have a good idea how long your current process takes, I suggest that you remove the optimize() and measure the impact.

Do many of the documents change in those 2 hour windows? If only a small fraction (50,000/700,000 is about 7%) are incrementally re-indexed, then I don't think you are getting much value out of an optimize().

Some ideas:

  • Don't do an incremental optimize() at all. My experience says you are not seeing a huge query improvement anyway.
  • Do the optimize() daily instead of 2-hourly.
  • Do the optimize() during low-volume times (which is what the javadoc says).

And make sure you take measurements. These kinds of changes can be a shot in the dark without them.

Related Topic