After a discussion with the support team, it turns out that ECS cannot support our current use case.
There is a workaround that solves one of the issues we are facing. That workaround is to create a separate, essential, health-check container and in the same ECS task as the actual application container. The purpose of the health-check container is to monitor the application container to determine when the application has been started completely. If it detects that the application has failed to start, it exits, causing the ECS service to cycle the task. The ELB is then configured to perform its health checks against the health-check container, which will always report that it is up via the relevant port. This workaround will prevent the ECS service from cycling the ECS task due to failed health checks.
However, the ELB will begin routing traffic to the application container immediately. It will do so, even if the application container is not yet ready to receive traffic (for example, because it is still waiting for a cache to load). Currently, there is no way to delay the ELB from sending traffic to the application container, as the ECS service provides no support a grace period. We have managed to workaround this issue by providing messages to our application containers via SQS and only having them pull from the queue when their caches are fully loaded. However, we have future use cases (such as serving web requests) where this is not a feasible option. To this end, I intend to raise a feature request for the grace period.
As an aside, both Kubernetes (http://kubernetes.io/v1.0/docs/user-guide/walkthrough/k8s201.html#application-health-checking) and Marathon (https://mesosphere.github.io/marathon/docs/health-checks.html) already support this option for health checking, if someone reading this is happy not to use a managed service.
It turns out my original premise (needing to know the task container's own internal IP address for service discovery) is very flawed - that IP address is only usable within a single EC2 Container Instance. If you have multiple container instances (which you probably should have) then those task container IP's are basically useless.
The alternative solution that I came up with is to follow the pattern suggested for Application Load Balancers running HTTP / HTTPS - have a port mapping with 0 as the host port, pointing to the port within the docker instance that I need to use. By doing this, Docker will assign a random host port, which I can then find by using the AWS SDK - in particular, the "describeTasks" function available on the ECS module. See here for details: http://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/ECS.html#describeTasks-property
This is the fundamental basis for my roll-your-own service discovery mechanism - there are a lot of other details necessary to do this in a complete fashion. I used Lambda functions calling out to the AWS SDK as well as a PostgreSQL database in order to keep my list of host containers up to date (something a bit like a dynamic DNS registry). Part of the trick is you need to know the IP and port for each of the containers, but describeTasks only returns the port. Here is a handy NodeJS function that I wrote which takes a container name and looks for all IP addresses and ports found within the cluster for containers of that name:
var Q = require('q');
/**
* @param {String} - cluster - name of the cluster to query, e.g. "sqlfiddle3"
* @param {String} - containerType - name of the container to search for within the cluster
* @returns {Promise} - promise resolved with a list of ip/port combinations found for this container name, like so:
[
{
"connection_meta": "{\"type\":\"ecs\",\"taskArn\":\"arn:aws:ecs:u..\"}",
"port": 32769,
"ip": "10.0.1.49"
}
]
*
*/
exports.getAllHostsForContainerType = (cluster, containerType) => {
var AWS = require('aws-sdk'),
ecs = new AWS.ECS({"apiVersion": '2014-11-13'}),
ec2 = new AWS.EC2({"apiVersion": '2016-11-15'});
return ecs.listTasks({ cluster }).promise()
.then((taskList) => ecs.describeTasks({ cluster, tasks: taskList.taskArns }).promise())
.then((taskDetails) => {
var containersForName = taskDetails.tasks
.filter((taskDetail) =>
taskDetail.containers.filter(
(container) => container.name === containerType
).length > 0
)
.map((taskDetail) =>
taskDetail.containers.map((container) => {
container.containerInstanceArn = taskDetail.containerInstanceArn;
return container;
})
)
.reduce((final, containers) =>
final.concat(containers)
, []);
return containersForName.length ? (ecs.describeContainerInstances({ cluster,
containerInstances: containersForName.map(
(containerDetails) => containerDetails.containerInstanceArn
)
}).promise()
.then((containerInstanceList) => {
containersForName.forEach((containerDetails) => {
containerDetails.containerInstanceDetails = containerInstanceList.containerInstances.filter((instance) =>
instance.containerInstanceArn === containerDetails.containerInstanceArn
)[0];
});
return ec2.describeInstances({
InstanceIds: containerInstanceList.containerInstances.map((instance) =>
instance.ec2InstanceId
)
}).promise();
})
.then((instanceDetails) => {
var instanceList = instanceDetails.Reservations.reduce(
(final, res) => final.concat(res.Instances), []
);
containersForName.forEach((containerDetails) => {
if (containerDetails.containerInstanceDetails) {
containerDetails.containerInstanceDetails.ec2Instance = instanceList.filter(
(instance) => instance.InstanceId === containerDetails.containerInstanceDetails.ec2InstanceId
)[0];
}
});
return containersForName;
})) : [];
})
.then(
(containersForName) => containersForName.map(
(container) => ({
connection_meta: JSON.stringify({
type: "ecs",
taskArn: container.taskArn
}),
// assumes that this container has exactly one network binding
port: container.networkBindings[0].hostPort,
ip: container.containerInstanceDetails.ec2Instance.PrivateIpAddress
})
)
);
};
Note that this uses the 'Q' promise library - you'll need to declare that as a dependency in your package.json.
The rest of my custom solution for handing ECS service discovery using Lambda functions can be found here: https://github.com/jakefeasel/sqlfiddle3#setting-up-in-amazon-web-services
Best Answer
There are a couple of ways to debug this: