The difference between Lucene and Elasticsearch

elasticsearchlucene

I know ElasticSearch is built upon Apache Lucene but I want to know the significant differences between the two.

Best Answer

Lucene is a Java library. You can include it in your project and refer to its functions using function calls.

Elasticsearch is a JSON Based, Distributed, web server built over Lucene. Though it's Lucene who is doing the actual work beneath, Elasticsearch provides us a convenient layer over Lucene. Each shard that gets created in Elasticsearch is a separate Lucene instance. So to summarize

Elasticsearch is built over Lucene and provides a JSON based REST API to refer to Lucene features.
Elasticsearch provides a distributed system on top of Lucene. A distributed system is not something Lucene is aware of or built for. Elasticsearch provides this abstraction of distributed structure.
Elasticsearch provides other supporting features like thread-pool, queues, node/cluster monitoring API, data monitoring API, Cluster management, etc.

Related Solutions

Shards and replicas in Elasticsearch

I'll try to explain with a real example since the answer and replies you got don't seem to help you.

When you download elasticsearch and start it up, you create an elasticsearch node which tries to join an existing cluster if available or creates a new one. Let's say you created your own new cluster with a single node, the one that you just started up. We have no data, therefore we need to create an index.

When you create an index (an index is automatically created when you index the first document as well) you can define how many shards it will be composed of. If you don't specify a number it will have the default number of shards: 5 primaries. What does it mean?

It means that elasticsearch will create 5 primary shards that will contain your data:

 ____    ____    ____    ____    ____
| 1  |  | 2  |  | 3  |  | 4  |  | 5  |
|____|  |____|  |____|  |____|  |____|

Every time you index a document, elasticsearch will decide which primary shard is supposed to hold that document and will index it there. Primary shards are not a copy of the data, they are the data! Having multiple shards does help taking advantage of parallel processing on a single machine, but the whole point is that if we start another elasticsearch instance on the same cluster, the shards will be distributed in an even way over the cluster.

Node 1 will then hold for example only three shards:

 ____    ____    ____ 
| 1  |  | 2  |  | 3  |
|____|  |____|  |____|

Since the remaining two shards have been moved to the newly started node:

 ____    ____
| 4  |  | 5  |
|____|  |____|

Why does this happen? Because elasticsearch is a distributed search engine and this way you can make use of multiple nodes/machines to manage big amounts of data.

Every elasticsearch index is composed of at least one primary shard since that's where the data is stored. Every shard comes at a cost, though, therefore if you have a single node and no foreseeable growth, just stick with a single primary shard.

Another type of shard is a replica. The default is 1, meaning that every primary shard will be copied to another shard that will contain the same data. Replicas are used to increase search performance and for fail-over. A replica shard is never going to be allocated on the same node where the related primary is (it would pretty much be like putting a backup on the same disk as the original data).

Back to our example, with 1 replica we'll have the whole index on each node, since 2 replica shards will be allocated on the first node and they will contain exactly the same data as the primary shards on the second node:

 ____    ____    ____    ____    ____
| 1  |  | 2  |  | 3  |  | 4R |  | 5R |
|____|  |____|  |____|  |____|  |____|

Same for the second node, which will contain a copy of the primary shards on the first node:

 ____    ____    ____    ____    ____
| 1R |  | 2R |  | 3R |  | 4  |  | 5  |
|____|  |____|  |____|  |____|  |____|

With a setup like this, if a node goes down, you still have the whole index. The replica shards will automatically become primaries and the cluster will work properly despite the node failure, as follows:

 ____    ____    ____    ____    ____
| 1  |  | 2  |  | 3  |  | 4  |  | 5  |
|____|  |____|  |____|  |____|  |____|

Since you have "number_of_replicas":1, the replicas cannot be assigned anymore as they are never allocated on the same node where their primary is. That's why you'll have 5 unassigned shards, the replicas, and the cluster status will be YELLOW instead of GREEN. No data loss, but it could be better as some shards cannot be assigned.

As soon as the node that had left is backed up, it'll join the cluster again and the replicas will be assigned again. The existing shard on the second node can be loaded but they need to be synchronized with the other shards, as write operations most likely happened while the node was down. At the end of this operation, the cluster status will become GREEN.

Hope this clarifies things for you.

Java – Difference between solr and lucene

@darkheir: Lucene and Solr are 2 differents Apache projects that are made to work together, I don't understand what is the aim of each project.

Solr uses Lucene under the hood. Lucene has no clue about the Solr API.
Lucene is a powerful search engine framework that lets us add search capability to our application. It exposes an easy-to-use API while hiding all the search-related complex operations. Any application can use this library, not just Solr.
Solr is built around Lucene. It is not just an http-wrapper around Lucene but has been known to add more arsenal to Lucene (archived). Solr is ready-to-use out of box. It is a web application that offers related infrastructure and a lot more features in addition to what Lucene offers.

@darkheir: Lucene is used to create a search index and Solr use this index to perform searches. Am I right or is this a totally different approach?

Lucene doesn't just create the Index for the consumption by Solr. Lucene handles all the search related operations. Any application can use the Lucene framework.

Examples are Solr, Elastic Search, LinkedIn (yes, under the hood), etc..

Check out this article: Lucene vs Solr

UPDATE (6/18/14)

When to use Lucene?

You are a search engineer AND
You are a programmer AND
You want full control over almost all the internals of Lucene AND
Your requirements demand you to do all sorts of geeky customization to Lucene AND
You are willing to take care of infrastructure elements of your search like scaling, distribution, etc.

When to use Solr?

At least one of the above didn't make sense. OR
You want something that is ready to use out-of-the-box (even without knowledge of Java) OR
Your infrastructure requirements outweigh search customization requirements.

NOTE: I don't mean that Solr is hard to customize. Solr is very flexible and provides a lot of pluggable API points, allowing you to throw-in your code.

There are people, falling under 'have to use Lucene' camp, but still prefer Solr to plain Lucene as it's easy to use. However, they never restrain themselves from customizing Solr to the maximum extent.

BTW, I see that there are more resources on Solr (4.x) than Lucene (4.x).

Best Answer

Related Solutions

Shards and replicas in Elasticsearch

Java – Difference between solr and lucene

Related Topic