CloudWatch Alarm – Creating an Alarm for ECS Service Task Failures

amazon-cloudwatchamazon-ecsamazon-web-servicesmonitoring

If I release a new Docker image with a bug to my ECS Service, then the service will attempt to start new Tasks but will keep the old version around if the new tasks fail to start.

In that scenario, it will sometimes (not always) emit an Event to the bus like:

service xxx is unable to consistently start tasks successfully. For more information, see the Troubleshooting section.

and sometimes it will just emit loads of events like:

service xxx deregistered 1 targets in target-group yyy

I would like a CloudWatch Alarm to fire in this scenario. How can I achieve that?

I cannot see any CloudWatch metrics that track any relevant events that I could use to trigger this Alarm. https://docs.aws.amazon.com/AmazonECS/latest/developerguide/cloudwatch-metrics.html

If the Tasks fail to boot then I don't even get any UnHealthyHostCount metrics on the LB Target Group.

I think I will have to create an EventBridge rule to watch for the above named event, but I can't see an obvious way to have that rule trigger an Alarm. I have set a rule to forward "WARN" and "ERROR" events to SNS/email, but I don't always get these events. So I frequently get a restart loop with no alarms firing. 🙁

Best Answer

I have the following infrastructure which I think addresses this requirement:

An alarm on the metric AWS/ApplicationELB / UnHealthyHostCount which sometimes fires
An Event rule with the following pattern forwarding to SNS which catches failed tasks:

{
    "source": [
        "aws.ecs"
    ],
    "detail-type": [
        "ECS Task State Change"
    ],
    "detail": {
        "group": [
            "service:${var.ecs_service_name}"
        ],
        "stoppedReason": [
            "Essential container in task exited"
        ]
    }
}

An Event rule with the following pattern forwarding to SNS which catches the "unable to consistently start tasks successfully" event which sometimes fires:

{
    "source": [
        "aws.ecs"
    ],
    "detail-type": [
        "ECS Service Action"
    ],
    "resources": [
        "${var.ecs_service_arn}"
    ],
    "detail": {
        "eventType": ["WARN", "ERROR"]
    }
}

An alarm on the metric AWS/Events / TriggeredRules which fires when 2 or 3 occurs

This is quite a messy approach, but the best I could find. I am disappointed that ECS doesn't publish metrics to track this common case.

(I do not subscribe anything to the SNS topics created above; they exist solely to make the above rules valid. The events are viewable in the ECS console if required.)

Best Answer

Related Solutions

AWS RDS db.t2 instance performance thresholds & monitoring

AWS – Using RabbitMQ Queue Length to Autoscale Instances with Celery

Related Topic