Amazon EC2 – How to Handle Downtime and High CPU Usage

amazon ec2amazon-linuxcpu-usage

Recently I had an issue on an EC2 instance. The site running there was unavailable for 2 hours:

CPU utilization for the past week:

The rest are around the time it happened:




The systemd journal around that period.

What I can see there? At around 20:31 it looks like things get slow:

Job execution of per-minute job scheduled for 20:30 delayed into subsequent minute 20:31. Skipping job run.

Jobs (cronyd) fail to start.

Jan 12 21:31:29 ip-172-xx-x-xx.xx-yy-z.compute.internal chronyd[24287]: Forward time jump detected!
Jan 12 21:33:21 ip-172-xx-x-xx.xx-yy-z.compute.internal chronyd[24287]: Can't synchronise: no selectable sources

The dhclient lines, usually they come together, but at that period it was like this:

Jan 12 20:46:21 ip-172-xx-x-xx.xx-yy-z.compute.internal dhclient[3066]: DHCPREQUEST on eth0 to 172.xx.x.xx port 67 (xid=0x7cb0e02d)
Jan 12 20:46:22 ip-172-xx-x-xx.xx-yy-z.compute.internal dhclient[3066]: DHCPACK from 172.xx.x.xx (xid=0x7cb0e02d)
Jan 12 21:06:23 ip-172-xx-x-xx.xx-yy-z.compute.internal dhclient[3066]: bound to 172.yy.y.yy -- renewal in 354 seconds.

Also:

Jan 12 21:47:22 ip-172-xx-x-xx.xx-yy-z.compute.internal dhclient[3066]: bound to 172.yy.y.yy -- renewal in -554 seconds.

And it looks like at 21:47 things get back to normal.

The docker containers running there restarted. I remember that their logs started closer to 10pm, probably at 21:47.

The sysstat log (/var/log/sa/sar12):

07:00:01 PM     all      3.77      0.00      0.53      0.00      0.53      0.00      0.11      0.00      0.00     95.05
07:00:01 PM       0      4.22      0.00      0.54      0.01      0.45      0.00      0.11      0.00      0.00     94.68
07:00:01 PM       1      3.33      0.00      0.53      0.00      0.61      0.00      0.10      0.00      0.00     95.43
07:10:01 PM     all      3.47      0.00      0.52      0.00      0.54      0.00      0.13      0.00      0.00     95.34
07:10:01 PM       0      4.01      0.00      0.53      0.00      0.48      0.00      0.10      0.00      0.00     94.88
07:10:01 PM       1      2.93      0.00      0.52      0.01      0.60      0.00      0.15      0.00      0.00     95.80
07:20:01 PM     all      1.89      0.00      0.47      0.00      0.46      0.00      0.10      0.00      0.00     97.08
07:20:01 PM       0      1.54      0.00      0.46      0.00      0.39      0.00      0.10      0.00      0.00     97.50
07:20:01 PM       1      2.24      0.00      0.48      0.00      0.53      0.00      0.10      0.00      0.00     96.65
07:30:01 PM     all      1.37      0.00      0.47      0.00      0.42      0.00      0.09      0.00      0.00     97.65
07:30:01 PM       0      1.55      0.00      0.46      0.00      0.36      0.00      0.08      0.00      0.00     97.54
07:30:01 PM       1      1.18      0.00      0.48      0.00      0.47      0.00      0.10      0.00      0.00     97.77
07:40:01 PM     all      1.32      0.00      0.47      0.00      0.41      0.00      0.10      0.00      0.00     97.71
07:40:01 PM       0      1.46      0.00      0.46      0.00      0.33      0.00      0.09      0.00      0.00     97.66
07:40:01 PM       1      1.18      0.00      0.47      0.00      0.48      0.00      0.10      0.00      0.00     97.77
07:50:01 PM     all      1.36      0.00      0.48      0.00      0.41      0.00      0.10      0.00      0.00     97.65
07:50:01 PM       0      1.14      0.00      0.45      0.00      0.33      0.00      0.11      0.00      0.00     97.96
07:50:01 PM       1      1.58      0.00      0.50      0.00      0.50      0.00      0.09      0.00      0.00     97.33
08:00:01 PM     all      2.17      0.00      0.52      0.01      0.52      0.00      0.12      0.00      0.00     96.66
08:00:01 PM       0      2.26      0.00      0.49      0.01      0.45      0.00      0.13      0.00      0.00     96.67
08:00:01 PM       1      2.08      0.00      0.55      0.01      0.60      0.00      0.12      0.00      0.00     96.65
08:10:01 PM     all      3.47      1.35      2.41      0.08      0.58      0.00      0.15      0.00      0.00     91.96
08:10:01 PM       0      3.28      1.11      2.38      0.07      0.50      0.00      0.15      0.00      0.00     92.51
08:10:01 PM       1      3.66      1.58      2.45      0.09      0.66      0.00      0.15      0.00      0.00     91.40

08:10:01 PM     CPU      %usr     %nice      %sys   %iowait    %steal      %irq     %soft    %guest    %gnice     %idle
08:20:01 PM     all      1.73      0.00      0.54      0.07      0.48      0.00      0.10      0.00      0.00     97.07
08:20:01 PM       0      1.94      0.00      0.58      0.07      0.40      0.00      0.10      0.00      0.00     96.90
08:20:01 PM       1      1.52      0.00      0.51      0.08      0.55      0.00      0.11      0.00      0.00     97.23
09:50:02 PM     all      2.11      0.11     50.63     43.63      0.09      0.00      0.02      0.00      0.00      3.41
09:50:02 PM       0      3.34      0.09     15.85     77.19      0.07      0.00      0.02      0.00      0.00      3.45
09:50:02 PM       1      0.93      0.12     83.90     11.54      0.11      0.00      0.02      0.00      0.00      3.37
10:00:01 PM     all      2.11      0.00      0.43      2.61      0.35      0.00      0.07      0.00      0.00     94.42
10:00:01 PM       0      1.87      0.00      0.45      2.73      0.25      0.00      0.07      0.00      0.00     94.63
10:00:01 PM       1      2.36      0.00      0.42      2.50      0.45      0.00      0.07      0.00      0.00     94.20
10:10:01 PM     all      0.80      0.00      0.33      0.00      0.29      0.00      0.06      0.00      0.00     98.52
10:10:01 PM       0      0.82      0.00      0.31      0.00      0.20      0.00      0.07      0.00      0.00     98.59
10:10:01 PM       1      0.77      0.00      0.35      0.00      0.37      0.00      0.06      0.00      0.00     98.45
10:20:01 PM     all      0.85      0.00      0.35      0.00      0.29      0.00      0.07      0.00      0.00     98.44
10:20:01 PM       0      0.85      0.00      0.34      0.00      0.21      0.00      0.07      0.00      0.00     98.53
10:20:01 PM       1      0.86      0.00      0.36      0.00      0.37      0.00      0.06      0.00      0.00     98.35
10:30:01 PM     all      1.41      0.00      0.38      0.00      0.33      0.00      0.08      0.00      0.00     97.79
10:30:01 PM       0      1.13      0.00      0.36      0.00      0.25      0.00      0.07      0.00      0.00     98.18
10:30:01 PM       1      1.69      0.00      0.40      0.00      0.42      0.00      0.09      0.00      0.00     97.40
10:40:01 PM     all      0.98      0.00      0.35      0.00      0.29      0.00      0.06      0.00      0.00     98.32
10:40:01 PM       0      0.70      0.00      0.33      0.00      0.22      0.00      0.06      0.00      0.00     98.69
10:40:01 PM       1      1.25      0.00      0.36      0.00      0.35      0.00      0.07      0.00      0.00     97.96
10:50:01 PM     all      0.65      0.00      0.34      0.00      0.28      0.00      0.06      0.00      0.00     98.68
10:50:01 PM       0      0.80      0.00      0.34      0.00      0.20      0.00      0.05      0.00      0.00     98.61
10:50:01 PM       1      0.50      0.00      0.34      0.00      0.35      0.00      0.06      0.00      0.00     98.75

There's a gap between 8:20 and 9:50, and only at 9:50 we see the load (idle 3%).

What might be relevant here, on Jan 04 I enabled time synchronization (timedatectl set-ntp true) because there was offset of 15 minutes:

System clock wrong by -910.996745 seconds

It's a t3a.medium instance. And I believe the credit specification was unlimited by then. At least that's what I saw on the next day. Anyways, the credit balance didn't hit the floor.

Can you explain it? What can I check?

To be fair, I can't be sure it can't be caused by the site or one of its components, but I haven't run into such issues.

UPD The issue might have been caused by memory leak in one of the containers. At least after making it run nokogiri tasks in separate processes, memory stopped growing and there were no similar incidents so far.

Best Answer

Looks like your instance got paused/suspended due to some issues with the physical host running your EC2 instance. Please remember that EC2 Instance level SLA is 99.5%. You probably want to enable monitoring for status check for your instances and automatic recovery.