Elasticsearch, Nest and Lucene.net

elasticsearchlucene.netnest

I know that Elasticsearch is based on Lucene but I wonder if Elasticsearch gives me any benefits developing a search engine rather than coding with Lucene.Net directly. Sorry, If question is a bit simple but I am confusing after searching the possibilities for creating a search engine.

I found more examples for simple lucene.net search but not many for Elasticsearch and Nest. Another question is what is the difference between Nest and Elasticsearch indeed? are they same?

if someone throws me some light here, maybe with a nice sample, I appreciate. what I need is? Easy, quick and fast search engine. what would be the best option? any other alternative can be also but only .net (c# or vb) thanks.

Best Answer

Lucene

Lucene and the .NET port, Lucene.Net, is a search engine library for supporting full-text search in an application; it builds an inverted index based on the Document (and the fields within the Document) that you feed it to support full-text search. An example of this is search within the Nuget Gallery source, where a nuget package and its properties is converted to a document to pass to Lucene. The inverted index is stored across files within a directory.

Elasticsearch

Elasticsearch is a distributed search engine that uses Lucene under the covers - An Elasticsearch cluster can be made up of one or more nodes, where each node can contain a number of shards and replicas; each shard is a complete Lucene index. Having such infrastructure enables fast performance and allows horizontal scaling to handle search across a large amount of data since you are no longer limited by the constraints of a single Lucene index on a single machine. In addition you can achieve high availability with fault tolerance and disaster recovery since data can be replicated across shards meaning there is no single point of failure. An example of Elasticsearch with NEST is up on my blog.

Which to use?

Well, it depends on your use case (it nearly always does, right?); if your application is one that gets installed onto a machine and all data is persisted locally, you might decide to use Lucene library within the application and persist the index directory to local disk. Similarly, if you have a simple web application that runs on a single server with a small number of users then using Lucene may also be a sensible choice. On the other hand, if your application runs across multiple machines in a web farm and requires search capabilities, going with a distributed search engine like Elasticsearch would be a good idea.

How well does Elasticsearch scale? Back in 2013, Github was using Elasticsearch to index 2 billion documents i.e. all the code files in every repository on the site - across 44 separate Amazon EC2 instances, each with two terabytes of ephemeral SSD storage, giving a total of 30 terabytes of primary data. Stackoverflow also uses Elasticsearch to power search on this site (perhaps a dev could comment with some figures/metrics?)

Related Solutions

.net – Concurrency in Lucene.NET.

First of all we have to define a "write" operation. A write operation will object a lock once you start a write operation and will continue until you close the object that is performing the work. Such as creating an IndexWriter and indexing a document will cause the write to object a lock and it will keep this lock until you close the IndexWriter.

Now we can talk about the lock a little bit. This lock that is object is a file based lock. Like mythz mentioned earlier, there is a file called 'write.lock' that is created. Once a write lock is objected it is exclusive! This lock causes all index modifying operations (IndexWriter, and some methods from IndexReader) to wait until the lock is removed.

Overall you and have multiple reads on an index. You can even read and write at the same time, no problem. But there is a problem when having multiple writers. If one thread is waiting for the lock too long it will time out.

1) Possible Solution #1 Direct Operations

If you are sure that your indexing operations are short and quick, you may be able to just use the same index at the same time. Otherwise you will have to think about how you want to organize the indexing operations of the applications.

2) Possible Solution #2 Web Service

Since you are working with a web solution it might be possible to create a web service. When implementing this web service I would dedicate a worker thread for indexing. I would create a work queue to contain the work and if the queue contained multiple jobs to do, it should grab them all and do them into batch. This will solve all of the problems.

3) create another index, then merge

If the console application does heavy work on the index you may be able to look into having the console application you could create a seperate index in the console application and then merge the indexes at some safe scheduled time using IndexWriter.AddIndexes.

from here you can do this in two ways, you can merge with the direct index. Or you can merge to create a 3rd index, and then when this index is ready replace the original index. You have to be careful in what your doing here as well to make sure that your not going to lock something in heavy use and cause a timeout for other write operations.

4) Index & Search multiple indexes

Personally I think people need to separate their indexes out. This helps separates responsibilities of the programs and minimizes down time and maintained of having a single point for all indexes. For example, if your console application is responsible for only adding in certain fields or your are kind of extending an index you could look separate the indexes out, but maintain identity by using an ID field in each document. Now with this you can take advantage of the built in support for searching multiple indexes using the MultiSercher class. Or if your wanting there is also a nice ParallelMultiSearch class that can search both indexes at once.

5) Look into SOLR

Something else that can help your issue of maintaining a single place for you index, you could change your program to work with a SOLR server. http://lucene.apache.org/solr/ there is also a nice SOLRNET http://code.google.com/p/solrnet/ library that can be helpful in this situation. Although I'm not experienced with solr but i am under the impression that it will help you manage situation such as this. Also it has other benefits such as hit highlighting and searching for related items by finding items "MoreLikeThis", or provide spell checking.

I'm sure there are other methods but these are all the ones that I can think of. Overall it your solution depends upon how many people are writing and how up to date the search index you need it to be. Overall if you can defer some operations for a latter time and do some batch operations in any situation will give you the most performance. My suggestion is to understand what your able to work with and go from there. good luck

Make elasticsearch only return certain fields

Yep, Use a better option source filter. If you're searching with JSON it'll look something like this:

{
    "_source": ["user", "message", ...],
    "query": ...,
    "size": ...
}

In ES 2.4 and earlier, you could also use the fields option to the search API:

{
    "fields": ["user", "message", ...],
    "query": ...,
    "size": ...
}

This is deprecated in ES 5+. And source filters are more powerful anyway!