Elasticsearch is using way too much disk space

disk-space-utilizationelasticsearchlucene

I have a CentOS 6.5 server on which I installed Elasticsearch 1.3.2.

My elasticsearch.yml configuration file is a minimal modification of the one shipping with elasticsearch as a default. Once stripped of all commented lines, it looks like:

cluster.name: xxx-kibana

node:
    name: "xxx"
    master: true
    data: true

index.number_of_shards: 5

index.number_of_replicas: 1

path:
    logs: /log/elasticsearch/log
    data: /log/elasticsearch/data


transport.tcp.port: 9300

http.port: 9200

discovery.zen.ping.multicast.enabled: false

Elasticsearch should have compression ON by default, and I read various benchmarks putting the compression ratio from as low as 50% to as high as 95%. Unluckily, the compression ratio in my case is -400%, or in other words: data stored with ES takes 4 times as much disk space than the text file with the same content. See:

12K     logstash-2014.10.07/2/translog
16K     logstash-2014.10.07/2/_state
116M    logstash-2014.10.07/2/index
116M    logstash-2014.10.07/2
12K     logstash-2014.10.07/4/translog
16K     logstash-2014.10.07/4/_state
127M    logstash-2014.10.07/4/index
127M    logstash-2014.10.07/4
12K     logstash-2014.10.07/0/translog
16K     logstash-2014.10.07/0/_state
109M    logstash-2014.10.07/0/index
109M    logstash-2014.10.07/0
16K     logstash-2014.10.07/_state
12K     logstash-2014.10.07/1/translog
16K     logstash-2014.10.07/1/_state
153M    logstash-2014.10.07/1/index
153M    logstash-2014.10.07/1
12K     logstash-2014.10.07/3/translog
16K     logstash-2014.10.07/3/_state
119M    logstash-2014.10.07/3/index
119M    logstash-2014.10.07/3
622M    logstash-2014.10.07/  # <-- This is the total!

versus:

6,3M    /var/log/td-agent/legacy_api.20141007_0.log
8,0M    /var/log/td-agent/legacy_api.20141007_10.log
7,6M    /var/log/td-agent/legacy_api.20141007_11.log
6,7M    /var/log/td-agent/legacy_api.20141007_12.log
8,0M    /var/log/td-agent/legacy_api.20141007_13.log
7,6M    /var/log/td-agent/legacy_api.20141007_14.log
7,6M    /var/log/td-agent/legacy_api.20141007_15.log
7,7M    /var/log/td-agent/legacy_api.20141007_16.log
5,6M    /var/log/td-agent/legacy_api.20141007_17.log
7,9M    /var/log/td-agent/legacy_api.20141007_18.log
6,3M    /var/log/td-agent/legacy_api.20141007_19.log
7,8M    /var/log/td-agent/legacy_api.20141007_1.log
7,1M    /var/log/td-agent/legacy_api.20141007_20.log
8,0M    /var/log/td-agent/legacy_api.20141007_21.log
7,2M    /var/log/td-agent/legacy_api.20141007_22.log
3,8M    /var/log/td-agent/legacy_api.20141007_23.log
7,5M    /var/log/td-agent/legacy_api.20141007_2.log
7,3M    /var/log/td-agent/legacy_api.20141007_3.log
8,0M    /var/log/td-agent/legacy_api.20141007_4.log
7,5M    /var/log/td-agent/legacy_api.20141007_5.log
7,5M    /var/log/td-agent/legacy_api.20141007_6.log
7,8M    /var/log/td-agent/legacy_api.20141007_7.log
7,8M    /var/log/td-agent/legacy_api.20141007_8.log
7,2M    /var/log/td-agent/legacy_api.20141007_9.log
173M    total

What am I doing wrong? Why is data not being compressed?

I have provisionally added index.store.compress.stored: 1 to my configuration file, as I found that in the elasticsearch 0.19.5 release notes (that's when the store compression came out first), but I'm not yet able to tell if it is making a difference, and anyhow compression should be ON by default, nowadays…

Best Answer

Elasticsearch does not shrink your data automagically. This is true for any database. Beside storing the raw data, each database has to store metadata along with it. Normal databases only store an index (for faster search) for the columns the db-admin chose upfront. ElasticSearch is different as it indexes every column by default. Thus making the index extremely large, but on the other hand gives perfect performance while retrieving data.

In normal configurations you see an increase of 4 to 6 times of the raw data after indexing. Although it heavily depends on the actual data. But this is actually intended behavior.

So to decrease the database size, you have to go the other way around like you did in RDBMs: Exclude columns from being indexed or stored that you do not need to be indexed.

Additionally you could turn on compression, but this will only improve when your "documents" are large, which is probably not true for log file entries.

There are some comparisons and and useful tips here: https://github.com/jordansissel/experiments/tree/master/elasticsearch/disk

But remember: Searching comes with a cost. The cost to pay is disk space. But you gain flexibility. If your storage size exceeds, then grow horizontally! This is where ElasticSearch wins.

Related Topic