AWS Cloudwatch alarm going off prematurely

amazon ec2amazon-cloudwatchamazon-web-services

I created an alarm to stop an instance and email me if it was idle for too long (avg. CPU Utilization < 2% for 3 hours). However in my testing I noticed that the instance was stopped after 1 hour. Attached is the report from the email:

Alarm Details:

Name: Stop

Description: Created from EC2 Console

State Change: INSUFFICIENT_DATA -> ALARM

Reason for State Change: Threshold Crossed: 2 datapoints were less than the threshold (2.0). 

The most recent datapoints: http:// 0.0425, 0.038363636363636364.

Timestamp: Thursday 14 March, 2013 22:20:11 UTC

AWS Account: xxxxxxxxxxxx

Threshold:
The alarm is in the ALARM state when the metric is LessThanThreshold 2.0 for 3600 seconds.

Monitored Metric:
MetricNamespace: AWS/EC2
MetricName: CPUUtilization
Dimensions: InstanceId = i-xxxxxxx
Period: 3600 seconds
Statistic: Average
Unit: not specified

State Change Actions:
OK:
ALARM: arn:aws:sns:us-east-1:xxxxxxxxxxxx:NotifyMe
INSUFFICIENT_DATA:

I'm confused as to why it enters the ALARM state after just 1 hour (3600s) when I set it to 3 hours (10800s). For my test, the instance had been stopped all day. Once I created the alarm I started it and didn't do anything with the instance. Does it take into account all those stopped hours when it calculates the avg CPU utilization over 3 hours?

I would like to have the alarm let the instance stay alive for the threshold of 3 hours before it stops the instance. Is there a better way to do this?

Best Answer

In your email it clearly states that your alarm is set to trigger after 3600 seconds.

Threshold: The alarm is in the ALARM state when the metric is LessThanThreshold 2.0 for 3600 seconds.

There should be an option to set "EvaluationPeriods". What this does is it tells the alarm how many times to evaluate the specific metric you wish to check. So in your case you would set this to 3 and the alarm would check once every hour to see if the metric is LessThanThreshold 2.0. The alarm will trigger if for 3 consecutive hours the average of the 3 points taken is LessThanThreshold 2.0.

Another thing to note is that your alarm state went from INSUFFICIENT_DATA -> ALARM. I have noticed this activity with some alarms I am working on.

In my case:

I have an alarm that stops an instance when LessThanThreshold 5.0 for CPUUtilization for 1 hour with 6 evaluation periods, one every 10 minutes.
When an alarm gets new data after there being INSUFFICIENT_DATA it seems to trigger my alarm to the ALARM state as I think it treats INSUFFICIENT_DATA as 0.0 (don't quote me on this, this is just what I am assuming based off some tests I am running).
Even though the first point being taken could be 25.6% the last 5 points were INSUFFICIENT_DATA (possibly 0.0?) so the average is around 4.2ish which is LessThanThreshold 5.0.
Then my alarm is triggered even though its technically only been 10 minutes with "real" data.

To mitigate this I have set up a script so that whenever an instance is started the alarm is created with it and when ever an alarm is triggered it deletes itself after stopping the instance it is assigned to.

Related Solutions

AWS Alarms in “ALARM” state not triggering Policy actions

I asked this question on the Amazon Forums and apparently there is a recent bug in the creation of Alarms that automatically sets the "ActionsEnabled" property to False.

From AWS:

We have identified an issue in the AutoScaling console regarding the binding of
AutoScaling policies to CloudWatch alarms and are working on a fix. We will post
an update to this thread once the fix is rolled out. Thanks for bringing this to
our attention.

The workaround for now:

In the meantime, please try calling the DescribeAlarms CloudWatch API. If the
alarms associated with your policies have ActionsEnabled=false, then this could
cause your policies to not be invoked when the alarm is triggered. Please try
calling the PutMetricAlarm CloudWatch API to update ActionsEnabled=true for the
affected alarms, and that should fix the issues you are experiencing.

I have confirmed the bug and the workaround with my own Alarms through the API.

CloudWatch not honoring alarm settings

Your EC2 instance must be enabled for monitoring at 1-minute intervals. If you have not enabled detailed monitoring on your EC2 instance, then you would be collecting data in 5-minute intervals. 3 consecutive periods of 5-minute intervals would be 15 minutes.

I'm not certain, but after reviewing some of my own Cloudwatch alarms and playing with a new one in the console... it seems like, in this case, the alarm state triggers based on minutes instead of periods -- we just define minutes in terms of periods at alarm creation time. This seems sensible to me -- otherwise your alarm wouldn't ever be able to enter the alarm state if detailed (1-minute) monitoring was disabled.

Regarding detailed monitoring: I would turn it on for this case, if it is disabled. If you are using basic (5-minute) monitoring, the 3 data points don't necessarily mean that CPU Utilization has been >= 95% for 15 consecutive minutes. It rather means that CPU Utilization was >= 95% at the time the data was sampled, for 3 consecutive samplings.

Best Answer

Related Solutions

AWS Alarms in “ALARM” state not triggering Policy actions

CloudWatch not honoring alarm settings

Related Topic