amazon-ecs – How to Find Out Why a Task Instance Gets Deregistered

amazon-ecs

We have a bunch of services running in ECS. All of them are set to run at least two instances. With some of the services I notice that at irregular intervals one of the instances gets de-registered. In the logs there are no errors, and the health check never fails. So I'm wondering why does ECS decide to de-register a seemingly perfectly fine running ECS task instance? Is there a way to find out the reason?

This would make it much easier to decide what needs to be done to stabilize it.

Best Answer

There are a couple of ways to debug this:

Obviously logs are helpful in discovering why an instance became unhealthy. If you're using an ELB with a health check, you'll want to check your access logs to see if the health check endpoint returned an error response. You said that you didn't see anything in the logs, but I figured I would mention this for anyone who sees this answer in the future in case it helps in their case.
Check the Events tab on the page for a service that had an instance die - when tasks are registered or deregistered, ECS logs the event to the events list. However, you'll want to make sure to check soon after the event happens since the events list will only display the most recent events.
If you have the information page for a task open before the task dies, the container definition area may list information under the exit reason section. Similarly to the events page, the deregistered task will eventually get removed after a certain period of time, so it helps to check soon after the task gets removed.
If none of the above works, maybe try creating a CloudWatch Dashboard. Use the HTTPCode_ELB_5XX_Count statistic for the ALB/ELB sitting in front of the service - typically these are 504s indicating a timeout (enabling S3 logging for the ELB will tell you for sure), and you might find an elevated rate of 5XX responses if a task is dying due to timeouts during the health check, so this may point you in the right direction - however, do note that such an event will definitely be logged to the events list for the service as well.

Related Solutions

Grace Period? – AWS EC2 Container Service and Elastic Load Balancers

After a discussion with the support team, it turns out that ECS cannot support our current use case.

There is a workaround that solves one of the issues we are facing. That workaround is to create a separate, essential, health-check container and in the same ECS task as the actual application container. The purpose of the health-check container is to monitor the application container to determine when the application has been started completely. If it detects that the application has failed to start, it exits, causing the ECS service to cycle the task. The ELB is then configured to perform its health checks against the health-check container, which will always report that it is up via the relevant port. This workaround will prevent the ECS service from cycling the ECS task due to failed health checks.

However, the ELB will begin routing traffic to the application container immediately. It will do so, even if the application container is not yet ready to receive traffic (for example, because it is still waiting for a cache to load). Currently, there is no way to delay the ELB from sending traffic to the application container, as the ECS service provides no support a grace period. We have managed to workaround this issue by providing messages to our application containers via SQS and only having them pull from the queue when their caches are fully loaded. However, we have future use cases (such as serving web requests) where this is not a feasible option. To this end, I intend to raise a feature request for the grace period.

As an aside, both Kubernetes (http://kubernetes.io/v1.0/docs/user-guide/walkthrough/k8s201.html#application-health-checking) and Marathon (https://mesosphere.github.io/marathon/docs/health-checks.html) already support this option for health checking, if someone reading this is happy not to use a managed service.

Docker – How to get the IP Address for a specific AWS ECS task

It turns out my original premise (needing to know the task container's own internal IP address for service discovery) is very flawed - that IP address is only usable within a single EC2 Container Instance. If you have multiple container instances (which you probably should have) then those task container IP's are basically useless.

The alternative solution that I came up with is to follow the pattern suggested for Application Load Balancers running HTTP / HTTPS - have a port mapping with 0 as the host port, pointing to the port within the docker instance that I need to use. By doing this, Docker will assign a random host port, which I can then find by using the AWS SDK - in particular, the "describeTasks" function available on the ECS module. See here for details: http://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/ECS.html#describeTasks-property

This is the fundamental basis for my roll-your-own service discovery mechanism - there are a lot of other details necessary to do this in a complete fashion. I used Lambda functions calling out to the AWS SDK as well as a PostgreSQL database in order to keep my list of host containers up to date (something a bit like a dynamic DNS registry). Part of the trick is you need to know the IP and port for each of the containers, but describeTasks only returns the port. Here is a handy NodeJS function that I wrote which takes a container name and looks for all IP addresses and ports found within the cluster for containers of that name:

var Q = require('q');
/**
 * @param {String} - cluster - name of the cluster to query, e.g. "sqlfiddle3"
 * @param {String} - containerType - name of the container to search for within the cluster
 * @returns {Promise} - promise resolved with a list of ip/port combinations found for this container name, like so:
    [
      {
        "connection_meta": "{\"type\":\"ecs\",\"taskArn\":\"arn:aws:ecs:u..\"}",
        "port": 32769,
        "ip": "10.0.1.49"
      }
    ]
 *
 */
exports.getAllHostsForContainerType = (cluster, containerType) => {
    var AWS = require('aws-sdk'),
        ecs = new AWS.ECS({"apiVersion": '2014-11-13'}),
        ec2 = new AWS.EC2({"apiVersion": '2016-11-15'});

    return ecs.listTasks({ cluster }).promise()
    .then((taskList) => ecs.describeTasks({ cluster, tasks: taskList.taskArns }).promise())
    .then((taskDetails) => {
        var containersForName = taskDetails.tasks
            .filter((taskDetail) =>
                taskDetail.containers.filter(
                    (container) => container.name === containerType
                ).length > 0
            )
            .map((taskDetail) =>
                taskDetail.containers.map((container) => {
                    container.containerInstanceArn = taskDetail.containerInstanceArn;
                    return container;
                })
            )
            .reduce((final, containers) =>
                final.concat(containers)
            , []);

        return containersForName.length ? (ecs.describeContainerInstances({ cluster,
            containerInstances: containersForName.map(
                (containerDetails) => containerDetails.containerInstanceArn
            )
        }).promise()
        .then((containerInstanceList) => {

            containersForName.forEach((containerDetails) => {
                containerDetails.containerInstanceDetails = containerInstanceList.containerInstances.filter((instance) =>
                    instance.containerInstanceArn === containerDetails.containerInstanceArn
                )[0];
            });

            return ec2.describeInstances({
                InstanceIds: containerInstanceList.containerInstances.map((instance) =>
                    instance.ec2InstanceId
                )
            }).promise();
        })
        .then((instanceDetails) => {
            var instanceList = instanceDetails.Reservations.reduce(
                (final, res) => final.concat(res.Instances), []
            );

            containersForName.forEach((containerDetails) => {
                if (containerDetails.containerInstanceDetails) {
                    containerDetails.containerInstanceDetails.ec2Instance = instanceList.filter(
                        (instance) => instance.InstanceId === containerDetails.containerInstanceDetails.ec2InstanceId
                    )[0];
                }
            });
            return containersForName;
        })) : [];
    })
    .then(
        (containersForName) => containersForName.map(
            (container) => ({
                connection_meta: JSON.stringify({
                    type: "ecs",
                    taskArn: container.taskArn
                }),
                // assumes that this container has exactly one network binding
                port: container.networkBindings[0].hostPort,
                ip: container.containerInstanceDetails.ec2Instance.PrivateIpAddress
            })
        )
    );
};

Note that this uses the 'Q' promise library - you'll need to declare that as a dependency in your package.json.

The rest of my custom solution for handing ECS service discovery using Lambda functions can be found here: https://github.com/jakefeasel/sqlfiddle3#setting-up-in-amazon-web-services

Best Answer

Related Solutions

Grace Period? – AWS EC2 Container Service and Elastic Load Balancers

Docker – How to get the IP Address for a specific AWS ECS task

Related Topic