Html – How to index HTML files into Apache SOLR

apachehtmlindexinginverted-indexsolr

By default SOLR accepts XML files, I want to perform search on millions of crawled URLS (html).

Best Answer

Usually, the first step I would recommend rolling your own application using SolrJ or similar to handle the indexing, and not do it directly with the DataImportHandler.

Just write your application and have that output the contents of those web pages as a field in a SolrInputDocument. I recommend stripping the HTML in that application, because it gives you greater control. Besides, you probably want to get at some of the data inside that pag, such as <title>, and index it to a different field. An alternative is to use HTMLStripTransformer on one of your fields to make sure it strips HTML out of anything that you send to that field.

How are you crawling all this data? If you're using something like Apache Nutch it should already take care of most of this for you, allowing you to just plug in the connection details of your Solr server.

Related Solutions

Python – How to remove an element from a list by index

Use del and specify the index of the element you want to delete:

>>> a = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> del a[-1]
>>> a
[0, 1, 2, 3, 4, 5, 6, 7, 8]

Also supports slices:

>>> del a[2:4]
>>> a
[0, 1, 4, 5, 6, 7, 8, 9]

Here is the section from the tutorial.

Javascript – How to move an element into another element

You may want to use the appendTo function (which adds to the end of the element):

$("#source").appendTo("#destination");

Alternatively you could use the prependTo function (which adds to the beginning of the element):

$("#source").prependTo("#destination");

Example:

$("#appendTo").click(function() {
  $("#moveMeIntoMain").appendTo($("#main"));
});
$("#prependTo").click(function() {
  $("#moveMeIntoMain").prependTo($("#main"));
});

#main {
  border: 2px solid blue;
  min-height: 100px;
}

.moveMeIntoMain {
  border: 1px solid red;
}

<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<div id="main">main</div>
<div id="moveMeIntoMain" class="moveMeIntoMain">move me to main</div>

<button id="appendTo">appendTo main</button>
<button id="prependTo">prependTo main</button>

Best Answer

Related Solutions

Python – How to remove an element from a list by index

Javascript – How to move an element into another element

Related Topic