R – How to implement an IFilter for indexing heavyweight formats

ifiltersearchsharepoint

I need to develop an IFilter for Microsoft Search Server 2008 that performs prolonged computations to extract text. Extracting text from one file can take from 5 seconds to 12 hours. How can I desing such an IFilter so that the daemon doesn't reset it on timeout and also other IFilters can be reset on timeout if they hang up?

Best Answer

12 hours, wow!

If it takes that long and there are many files, your best option would be to create a pre-processing application that would extract the text and make it available for the iFilter to access.

Another option would be to create html summaries of the documents and instruct the crawler to index those. If the summary page could easily link to the document itself if necessary.