EC2 CPU utilization hits 100% everytime in a machine where a drupal site is installed

amazon ec2cpu-usagedrupal8php-fpm

I have my production drupal website running on EC2. Recently we noticed that our EC2 CPU utilization is hitting 100%. I checked my site traffic, not much users are there. I checked the process running in EC2 using the top command. I saw one command jbd2_sda1-8. I couldn't understand what is the command is used for. I saw the command was invoked by user www-data and it is showing 200% as CPU usage for that command. As far as I know, www-data means the command was invoked by some application running in my machine. I am using PHP-7.1 and Nginx for my drupal site. I ran the command sudo service php7.1-fpm restart. Then I checked the processes. The process got killed. So I assume, the command was invoked by some php process. I checked EC2 monitoring, my CPU usage went down to 3%.

Everything seemed fine for around an hour, suddenly I again got alert from AWS saying high CPU usage. I did the same process to debug. This time I saw some different command kjournald. This process is taking high CPU, that also from user www-data.

I am confused, I tried to find out the meaning of this command, I didn't understand anything, and I didn't find any relation between the old process and new process.

This issue is keep happening. If I restart the php, the process will get killed and after some time it will re-appear again. I don't know what is the issue.

My trouble shoot experiments:

I copied the code and placed it in my staging environment, in that machine, I am not facing the issue (Since it is staging, no traffic at all). I tried to map, the production DNS to staging, since it was not showing any issue, but after DNS mapping, same issue occurred in the second environment also. First instance became normal.
I enabled access monitoring in nginx and checked any un-usual requests coming at the time of 100% CPU(To check if my site was under some attack). I couldn't find any suspicious request at the time of CPU spike.
I enabled access monitoring in ELB and checked the requests using AWS athena. Couldn't find anything there also.

I am stuck here. No idea what is going on. Can anybody help?

Best Answer

If kjournald is consuming a lot of resource, its mean your OS is doing a lot of journaling operations, which arise from changes to file system (disk operations).

That means something is doing a lot more writing to the file system than it should be, or that there is some issue with your block storage.

Use lsof to see what files are in use by your operating system at any given time.

Failing that, if your Production and Staging environments were created at the same time, with the same instance class, in the same AZ, and you're confident that your PHP application is well designed (should be OK if its Drupal) and you have not changed anything recently, I'd consider re-provisioning the EC2 instances.

Make a snapshot and re-deploy, ideally on a different instance class to ensure you're moving to a different infrastructure stack.

I presume the instances are EBS backed, in which case they could be using some block storage in AWS that has gone bad.

Related Solutions

Linux CPU Usage – Monitoring CPU Usage and Process Execution History

There are a couple of possible ways you can do this. Note that its entirely possible its many processes in a runaway scenario causing this, not just one.

The first way is to setup pidstat to run in the background and produce data.

pidstat -u 600 >/var/log/pidstats.log & disown $!

This will give you a quite detailed outlook of the running of the system at ten minute intervals. I would suggest this be your first port of call since it produces the most valuable/reliable data to work with.

There is a problem with this, primarily if the box goes into a runaway cpu loop and produces huge load -- your not guaranteed that your actual process will execute in a timely manner during load (if at all) so you could actually miss the output!

The second way to look for this is to enable process accounting. Possibly more of a long term option.

accton on

This will enable process accounting (if not already added). If it was not running before this will need time to run.

Having been ran, for say 24 hours - you can then run such a command (which will produce output like this)

# sa --percentages --separate-times
     108  100.00%       7.84re  100.00%       0.00u  100.00%       0.00s  100.00%         0avio     19803k
       2    1.85%       0.00re    0.05%       0.00u   75.00%       0.00s    0.00%         0avio     29328k   troff
       2    1.85%       0.37re    4.73%       0.00u   25.00%       0.00s   44.44%         0avio     29632k   man
       7    6.48%       0.00re    0.01%       0.00u    0.00%       0.00s   44.44%         0avio     28400k   ps
       4    3.70%       0.00re    0.02%       0.00u    0.00%       0.00s   11.11%         0avio      9753k   ***other*
      26   24.07%       0.08re    1.01%       0.00u    0.00%       0.00s    0.00%         0avio      1130k   sa
      14   12.96%       0.00re    0.01%       0.00u    0.00%       0.00s    0.00%         0avio     28544k   ksmtuned*
      14   12.96%       0.00re    0.01%       0.00u    0.00%       0.00s    0.00%         0avio     28096k   awk
      14   12.96%       0.00re    0.01%       0.00u    0.00%       0.00s    0.00%         0avio     29623k   man*
       7    6.48%       7.00re   89.26%       0.00u    0.00%       0.00s

The columns are ordered as such:

Number of calls
Percentage of calls
Amount of real time spent on all the processes of this type.
Percentage.
User CPU time
Percentage
System CPU time.
Average IO calls.
Percentage
Command name

What you'll be looking for is the process types that generate the most User/System CPU time.

This breaks down the data as the total amount of CPU time (the top row) and then how that CPU time has been split up. Process accounting only accounts properly when its on when processes spawn, so its probably best to restart the system after enabling it to ensure all services are being accounted for.

This, by no means actually gives you a definite idea what process it might be that is the cause of this problem, but might give you good feel. As it could be a 24 hour snapshot theres a possibility of skewed results so bear that in mind. It also should always log since its a kernel feature and unlike pidstat will always produce output even during heavy load.

The last option available also uses process accounting so you can turn it on as above, but then use the program "lastcomm" to produce some statistics of processes executed around the time of the problem along with cpu statistics for each process.

lastcomm | grep "May  8 22:[01234]"
kworker/1:0       F    root     __         0.00 secs Tue May  8 22:20
sleep                  root     __         0.00 secs Tue May  8 22:49
sa                     root     pts/0      0.00 secs Tue May  8 22:49
sa                     root     pts/0      0.00 secs Tue May  8 22:49
sa                   X root     pts/0      0.00 secs Tue May  8 22:49
ksmtuned          F    root     __         0.00 secs Tue May  8 22:49
awk                    root     __         0.00 secs Tue May  8 22:49

This might give you some hints too as to what might be causing the problem.

Linux – How to find what processes were running at a time in the past

There are several options:

use a script which writes needed data on a regular basis to a logfile. You could use cron to write the output of ps (and other commands) every x minutes into a logfile.
Better it would be to use a specialized program, which does this for you. atop is very good at this, at it takes care of logfile retention.

atop is available via the EPEL repo for CentOS/RHEL/Fedora and via the default repos of Debian/Ubuntu.

You can use atop like a normal real-time top utility, with slightly different behaviour (check out the manpage for keystrokes).

The more interesting part is: Once installed a daemon starts logging data into /var/log/atop and you can read these files with atop again:

atop -r /var/log/atop/atop_20160128

You have then access to all 'top' like functions (sorting/looking at memory/CPU/IO usage, etc.) and you can jump 10 minutes forward in time via 't' and 10 minutes back with 'T' or jump at a specific time via 'b'.

Check out the atop manpage and google has lots of howtos about it.

There might be other solutions, but atop is easy to understand and use and a good start before doing some more bespoke setups.

Best Answer

Related Solutions

Linux CPU Usage – Monitoring CPU Usage and Process Execution History

Linux – How to find what processes were running at a time in the past

Related Topic