Html – How to index HTML files into Apache SOLR

apachehtmlindexinginverted-indexsolr

By default SOLR accepts XML files, I want to perform search on millions of crawled URLS (html).

Best Answer

Usually, the first step I would recommend rolling your own application using SolrJ or similar to handle the indexing, and not do it directly with the DataImportHandler.

Just write your application and have that output the contents of those web pages as a field in a SolrInputDocument. I recommend stripping the HTML in that application, because it gives you greater control. Besides, you probably want to get at some of the data inside that pag, such as <title>, and index it to a different field. An alternative is to use HTMLStripTransformer on one of your fields to make sure it strips HTML out of anything that you send to that field.

How are you crawling all this data? If you're using something like Apache Nutch it should already take care of most of this for you, allowing you to just plug in the connection details of your Solr server.