ElasticSearch Server Randomly Stops Working

I have 2 ES servers that are being fed by 1 logstash server and viewing the logs in Kibana. This is a POC to work out any issues before going into production. The system has ran for ~1 month and every few days, Kibana will stop showing logs at some random time in the middle of the night. Last night, the last log entry I received in Kibana was around 18:30. When I checked on the ES servers, it showed the master running and the secondary not running (from /sbin/service elasticsearch status), but I was able to do a curl on the localhost and it returned information. So not sure what's up with that. Anyway, when I do a status on the master node, I get this:

curl -XGET 'http://localhost:9200/_cluster/health?pretty=true'
{
  "cluster_name" : "gis-elasticsearch",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 6,
  "number_of_data_nodes" : 2,
  "active_primary_shards" : 186,
  "active_shards" : 194,
  "relocating_shards" : 0,
  "initializing_shards" : 7,
  "unassigned_shards" : 249
}

When I view the indexes, via "ls …nodes/0/indeces/" it shows all indexes being modified today for some reason and there are new file for today's date.So I think I'm starting to catch back up after I restarted both servers but not sure why it failed in the first place. When I look at the logs on the master, I only see 4 warning errors at 18:57 and then the 2ndary leaving the cluster. I don't see any logs on the secondary (Pistol) on why it stopped working or what truly happened.

[2014-03-06 18:57:04,121][WARN ][transport                ] [ElasticSearch Server1] Transport response handler not found of id [64147630]
[2014-03-06 18:57:04,124][WARN ][transport                ] [ElasticSearch Server1] Transport response handler not found of id [64147717]
[2014-03-06 18:57:04,124][WARN ][transport                ] [ElasticSearch Server1] Transport response handler not found of id [64147718]
[2014-03-06 18:57:04,124][WARN ][transport                ] [ElasticSearch Server1] Transport response handler not found of id [64147721]

[2014-03-06 19:56:08,467][INFO ][cluster.service ]
[ElasticSearch Server1] removed
{[Pistol][sIAMHNj6TMCmrMJGW7u97A][inet[/10.1.1.10:9301]]{client=true,
data=false},}, reason:
zen-disco-node_failed([Pistol][sIAMHNj6TMCmrMJGW7u97A][inet[/10.13.3.46:9301]]{client=true,
data=false}), reason failed to ping, tried [3] times, each with
maximum [30s] timeout [2014-03-06 19:56:12,304][INFO ][cluster.service
] [ElasticSearch Server1] added
{[Pistol][sIAMHNj6TMCmrMJGW7u97A][inet[/10.1.1.10:9301]]{client=true,
data=false},}, reason: zen-disco-receive(join from
node[[Pistol][sIAMHNj6TMCmrMJGW7u97A][inet[/10.13.3.46:9301]]{client=true,
data=false}])

Any idea on additional logging or troubleshooting I can turn on to keep this from happening in the future? Since the shards are not caught up, right now I"m just seeing a lot o debug messages about failed to parse. I'm assuming that will be corrected once we catch up.

[2014-03-07 10:06:52,235][DEBUG][action.search.type ]
[ElasticSearch Server1] All shards failed for phase: [query]
[2014-03-07 10:06:52,223][DEBUG][action.search.type ]
[ElasticSearch Server1] [windows-2014.03.07][3],
node[W6aEFbimR5G712ddG_G5yQ], [P], s[STARTED]: Failed to execute
[org.elasticsearch.action.search.SearchRequest@74ecbbc6] lastShard
[true] org.elasticsearch.search.SearchParseException:
[windows-2014.03.07][3]: from[-1],size[-1]: Parse Failure [Failed to
parse source
[{"facets":{"0":{"date_histogram":{"field":"@timestamp","interval":"10m"},"global":true,"facet_filter":{"fquery":{"query":{"filtered":{"query":{"query_string":{"query":"(ASA
AND
Deny)"}},"filter":{"bool":{"must":[{"range":{"@timestamp":{"from":1394118412373,"to":"now"}}}]}}}}}}}},"size":0}]]

Best Answer

Usual suspects for ES with Kibana are :

*Too small amount of memory available for ES** (which you can investigate with any probe system such as Marvel, or something that will send you JVM data outside the VM for monitoring)
Long GC durations (turn on GC logging and see if they do not happen when the ES stop responding)

Also the "usual" setup for ES is 3 servers to allow better redundancy when one server is down. But YMMV.

You can try the new G1 garbage collector too, which has (in my case) a much better behavior than CMS in my Kibana ES.

The GC duration problem is usually the one that happens when you're looking somewhere else and will typically lead to a loss of data because ES stops responding.

Good luck with these :)

Best Answer

Related Solutions

Elasticsearch standalone mode with logstash reject data after five days

Multiple instances of Logstash + Elasticsearch on AWS

Related Topic