Centos – sudden peak in cpu usage

We are running a 4 node/machine elastic search cluster on 12 core, 96gb RAM, 4 spinning disk machines. under normal operation most cpu usage is user and around 5-10%. Every few days, one of the machine's cpu usage gets pegged at 80-100% and is all user and system — io wait actually decreases. We first thought it was an elasticsearch specific issue, but after extensive debugging it doesn't seem to be so:

the high cpu utilization survives an elasticsearch node process restart
the elasticsearch threads are all behaving normally, things just take 10x longer.
non elasticsearch operations (gc collection) also take 10x longer, but heap activity is normal

If we stop the process for about an hour and then restart the process only (not the machine) the problem goes away and things work fine for a few days.

We have also noticed that during the problem, disk copy tests are very slow. With the process up but idle (not indexing/searching data) or soon after the process has stopped, copying a 1GB file via dd happens at about 18MB/s on the problematic machine but at 490MB/s when healthy. Interestingly, we noticed using dstat that the slow copy took about 25 seconds before doing any i/o and then took an additional 30 seconds to complete. The strace output didn't seem to be significantly different.

Any idea what further tests we could run?

Best Answer

There are lot of issues going around with Elastic Search and by quick googling you can find some. But major problem in high cpu usage might be caused due to lack of control on cache usage. Please below for references :

https://github.com/elasticsearch/elasticsearch/issues/4288 http://elasticsearch-users.115913.n3.nabble.com/Very-high-sys-cpu-usage-with-HTTP-KeepAlive-td4049998.html http://blog.sematext.com/2012/05/17/elasticsearch-cache-usage/

Best Answer

Related Solutions

Windows Server 2008 R2 – Process Runs Slower as a Scheduled Task

Linux – JFS: long fsck time on large filesystem

Related Topic