ElasticSearch Server Randomly Stops Working

elasticsearch

I have 2 ES servers that are being fed by 1 logstash server and viewing the logs in Kibana. This is a POC to work out any issues before going into production. The system has ran for ~1 month and every few days, Kibana will stop showing logs at some random time in the middle of the night. Last night, the last log entry I received in Kibana was around 18:30. When I checked on the ES servers, it showed the master running and the secondary not running (from /sbin/service elasticsearch status), but I was able to do a curl on the localhost and it returned information. So not sure what's up with that. Anyway, when I do a status on the master node, I get this:

curl -XGET 'http://localhost:9200/_cluster/health?pretty=true'
{
  "cluster_name" : "gis-elasticsearch",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 6,
  "number_of_data_nodes" : 2,
  "active_primary_shards" : 186,
  "active_shards" : 194,
  "relocating_shards" : 0,
  "initializing_shards" : 7,
  "unassigned_shards" : 249
}

When I view the indexes, via "ls …nodes/0/indeces/" it shows all indexes being modified today for some reason and there are new file for today's date.So I think I'm starting to catch back up after I restarted both servers but not sure why it failed in the first place. When I look at the logs on the master, I only see 4 warning errors at 18:57 and then the 2ndary leaving the cluster. I don't see any logs on the secondary (Pistol) on why it stopped working or what truly happened.

[2014-03-06 18:57:04,121][WARN ][transport                ] [ElasticSearch Server1] Transport response handler not found of id [64147630]
[2014-03-06 18:57:04,124][WARN ][transport                ] [ElasticSearch Server1] Transport response handler not found of id [64147717]
[2014-03-06 18:57:04,124][WARN ][transport                ] [ElasticSearch Server1] Transport response handler not found of id [64147718]
[2014-03-06 18:57:04,124][WARN ][transport                ] [ElasticSearch Server1] Transport response handler not found of id [64147721]

[2014-03-06 19:56:08,467][INFO ][cluster.service ]
[ElasticSearch Server1] removed
{[Pistol][sIAMHNj6TMCmrMJGW7u97A][inet[/10.1.1.10:9301]]{client=true,
data=false},}, reason:
zen-disco-node_failed([Pistol][sIAMHNj6TMCmrMJGW7u97A][inet[/10.13.3.46:9301]]{client=true,
data=false}), reason failed to ping, tried [3] times, each with
maximum [30s] timeout [2014-03-06 19:56:12,304][INFO ][cluster.service
] [ElasticSearch Server1] added
{[Pistol][sIAMHNj6TMCmrMJGW7u97A][inet[/10.1.1.10:9301]]{client=true,
data=false},}, reason: zen-disco-receive(join from
node[[Pistol][sIAMHNj6TMCmrMJGW7u97A][inet[/10.13.3.46:9301]]{client=true,
data=false}])

Any idea on additional logging or troubleshooting I can turn on to keep this from happening in the future? Since the shards are not caught up, right now I"m just seeing a lot o debug messages about failed to parse. I'm assuming that will be corrected once we catch up.

[2014-03-07 10:06:52,235][DEBUG][action.search.type ]
[ElasticSearch Server1] All shards failed for phase: [query]
[2014-03-07 10:06:52,223][DEBUG][action.search.type ]
[ElasticSearch Server1] [windows-2014.03.07][3],
node[W6aEFbimR5G712ddG_G5yQ], [P], s[STARTED]: Failed to execute
[org.elasticsearch.action.search.SearchRequest@74ecbbc6] lastShard
[true] org.elasticsearch.search.SearchParseException:
[windows-2014.03.07][3]: from[-1],size[-1]: Parse Failure [Failed to
parse source
[{"facets":{"0":{"date_histogram":{"field":"@timestamp","interval":"10m"},"global":true,"facet_filter":{"fquery":{"query":{"filtered":{"query":{"query_string":{"query":"(ASA
AND
Deny)"}},"filter":{"bool":{"must":[{"range":{"@timestamp":{"from":1394118412373,"to":"now"}}}]}}}}}}}},"size":0}]]

Best Answer

Usual suspects for ES with Kibana are :

  • *Too small amount of memory available for ES** (which you can investigate with any probe system such as Marvel, or something that will send you JVM data outside the VM for monitoring)
  • Long GC durations (turn on GC logging and see if they do not happen when the ES stop responding)

Also the "usual" setup for ES is 3 servers to allow better redundancy when one server is down. But YMMV.

You can try the new G1 garbage collector too, which has (in my case) a much better behavior than CMS in my Kibana ES.

The GC duration problem is usually the one that happens when you're looking somewhere else and will typically lead to a loss of data because ES stops responding.

Good luck with these :)

Related Topic