Autoscaling health checks fail in Amazon-provided ECS CloudFormation template.

amazon-cloudformationamazon-ecsamazon-web-services

I am attempting to stand up a new ECS cluster using the CloudFormation ECS Service template AWS provided here as a guide. My ECS instances boot within the AutoScaling group, but then fail a health check and are always terminated.

The output doesn't really tell me much about what checks are failing or why.

The CloudFormation code I am using is pretty much the stock code provided in the AWS docs. I added a security group with broader permissions (so that I could SSH in as I iterate through) and updated the AMI to the most recent version of ECS-optimized Amazon Linux in us-east-1.

Current template:

{
  "AWSTemplateFormatVersion" : "2010-09-09",
  "Description": "Deploys PoC ECS infrastructure.",  

  "Parameters" : {
    "KeyName": {
      "Description": "Name of an existing EC2 KeyPair to enable SSH access to the Elastic Beanstalk and Bastion hosts",
      "Type": "String",
      "MinLength": "1",
      "MaxLength": "255",
      "AllowedPattern": "[\\x20-\\x7E]*",
      "ConstraintDescription": "can contain only ASCII characters.",
      "Default": "smx-test-key"
    },
    "SubnetID": {
      "Type": "List<AWS::EC2::Subnet::Id>",
      "Description": "Select a default subnet ID."
    },    
    "DesiredCapacity": {
      "Type": "Number",
      "Default" : "1",
      "Description": "Number of instances to launch in your ECS cluster."
    },    
    "MaxSize": {
      "Type": "Number",
      "Default" : "1",
      "Description": "Maximum number of instances that can be launched in your ECS cluster."
    },
    "ECSInstanceType": {
      "Description": "The type of instance to use for ECS app servers",
      "Type": "String",
      "Default": "t2.micro",
      "AllowedValues": ["t2.micro", "t2.small", "t2.medium", "t2.large", "m3.medium", "m3.large", "m3.xlarge" ]
    },
    "SSHLocation" : {
      "Description" : " The IP address range that can be used to SSH to the EC2 instances.",
      "Type": "String",
      "MinLength": "9",
      "MaxLength": "18",
      "Default": "0.0.0.0/0",
      "AllowedPattern": "(\\d{1,3})\\.(\\d{1,3})\\.(\\d{1,3})\\.(\\d{1,3})/(\\d{1,2})",
      "ConstraintDescription": "must be a valid IP CIDR range of the form x.x.x.x/x."
    }
  },

  "Mappings" : {
    "AWSRegionToAMI" : {
      "us-east-1"      : { "AMIID" : "ami-5d1b984a" }
    }
  },

  "Resources" : {

    "ECSCluster": {
      "Type": "AWS::ECS::Cluster"
    },

    "taskdefinition": {
      "Type": "AWS::ECS::TaskDefinition",
      "Properties" : {
        "ContainerDefinitions" : [
          {
            "Name": "simple-app",
            "Cpu": "10",
            "Essential": "true",
            "Image":"httpd:2.4",
            "Memory":"300",
            "MountPoints": [{
              "ContainerPath": "/usr/local/apache2/htdocs",
              "SourceVolume": "my-vol"
            }],
            "PortMappings": [
              { "HostPort": 80, "ContainerPort": 80 }
            ]
          },
          {
            "Name": "busybox",
            "Cpu": 10,
            "Command": [
              "/bin/sh -c \"while true; do echo '<html> <head> <title>Amazon ECS Sample App</title> <style>body {margin-top: 40px; background-color: #333;} </style> </head><body> <div style=color:white;text-align:center> <h1>Amazon ECS Sample App</h1> <h2>Congratulations!</h2> <p>Your application is now running on a container in Amazon ECS.</p>' > top; /bin/date > date ; echo '</div></body></html>' > bottom; cat top date bottom > /usr/local/apache2/htdocs/index.html ; sleep 1; done\""
            ],      
            "EntryPoint": [ "sh", "-c"],
            "Essential": false,
            "Image": "busybox",
            "Memory": 200,
            "VolumesFrom": [
              {
                "SourceContainer": "simple-app"
              }
            ]
          }
        ],
        "Volumes": [
          { "Name": "my-vol" }
        ]
      }
    },    

    "EcsElasticLoadBalancer" : {
      "Type" : "AWS::ElasticLoadBalancing::LoadBalancer",
      "Properties" : {
        "Subnets" : { "Ref" : "SubnetID" },
        "Listeners" : [ {
          "LoadBalancerPort" : "80",
          "InstancePort" : "80",
          "Protocol" : "HTTP"
        } ],
        "HealthCheck" : {
          "Target" : "HTTP:80/",
          "HealthyThreshold" : "2",
          "UnhealthyThreshold" : "10",
          "Interval" : "30",
          "Timeout" : "5"
        }
      }
    }, 

    "ECSAutoScalingGroup" : {
      "Type" : "AWS::AutoScaling::AutoScalingGroup",
      "Properties" : {
        "VPCZoneIdentifier" : { "Ref" : "SubnetID" },
        "LaunchConfigurationName" : { "Ref" : "ContainerInstances" },
        "MinSize" : "1",
        "MaxSize" : { "Ref" : "MaxSize" },
        "DesiredCapacity" : { "Ref" : "DesiredCapacity" }
      },
      "CreationPolicy" : {
        "ResourceSignal" : {
          "Timeout" : "PT60M"
        }
      },
      "UpdatePolicy": {
        "AutoScalingRollingUpdate": {
          "MinInstancesInService": "1",
          "MaxBatchSize": "1",
          "PauseTime" : "PT60M",
          "WaitOnResourceSignals": "true"
        }
      }
    },   

    "ContainerInstances": {
      "Type": "AWS::AutoScaling::LaunchConfiguration",
      "Metadata" : {
        "AWS::CloudFormation::Init" : {
          "config" : {
            "commands" : {
              "01_add_instance_to_cluster" : {
                "command" : { "Fn::Join": [ "", [ "#!/bin/bash\n", "echo ECS_CLUSTER=", { "Ref": "ECSCluster" }, " >> /etc/ecs/ecs.config" ] ] }
              }
            },
            "files" : {
              "/etc/cfn/cfn-hup.conf" : {
                "content" : { "Fn::Join" : ["", [
                  "[main]\n",
                  "stack=", { "Ref" : "AWS::StackId" }, "\n",
                  "region=", { "Ref" : "AWS::Region" }, "\n"
                ]]},
                "mode"    : "000400",
                "owner"   : "root",
                "group"   : "root"
              },
              "/etc/cfn/hooks.d/cfn-auto-reloader.conf" : {
                "content": { "Fn::Join" : ["", [
                  "[cfn-auto-reloader-hook]\n",
                  "triggers=post.update\n",
                  "path=Resources.ContainerInstances.Metadata.AWS::CloudFormation::Init\n",
                  "action=/opt/aws/bin/cfn-init -v ",
                  "         --stack ", { "Ref" : "AWS::StackName" },
                  "         --resource ContainerInstances ",
                  "         --region ", { "Ref" : "AWS::Region" }, "\n",
                  "runas=root\n"
                ]]}
              }
            },
            "services" : {
              "sysvinit" : {
                "cfn-hup" : { "enabled" : "true", "ensureRunning" : "true", "files" : ["/etc/cfn/cfn-hup.conf", "/etc/cfn/hooks.d/cfn-auto-reloader.conf"] }
              }
            }
          }
        }
      },
      "Properties": {
        "ImageId" : { "Fn::FindInMap" : [ "AWSRegionToAMI", { "Ref" : "AWS::Region" }, "AMIID" ] },
        "InstanceType"   : { "Ref" : "ECSInstanceType" },
        "IamInstanceProfile": { "Ref": "EC2InstanceProfile" },
        "KeyName"        : { "Ref" : "KeyName" },
        "SecurityGroups": { "Ref" : "ECSSecurityGroup" },
        "UserData"       : { "Fn::Base64" : { "Fn::Join" : ["", [
             "#!/bin/bash -xe\n",
             "yum install -y aws-cfn-bootstrap\n",

             "/opt/aws/bin/cfn-init -v ",
             "         --stack ", { "Ref" : "AWS::StackName" },
             "         --resource ContainerInstances ",
             "         --region ", { "Ref" : "AWS::Region" }, "\n",

             "/opt/aws/bin/cfn-signal -e $? ",
             "         --stack ", { "Ref" : "AWS::StackName" },
             "         --resource ECSAutoScalingGroup ",
             "         --region ", { "Ref" : "AWS::Region" }, "\n"
        ]]}},
        "Tags" : [ {"Key" : "Name", "Value" : "ECS autoscaling instance"} ]
      }
    },

    "service": {
      "Type": "AWS::ECS::Service",
      "DependsOn": ["ECSAutoScalingGroup"],
      "Properties" : {
        "Cluster": {"Ref": "ECSCluster"},
        "DesiredCount": "1",
        "LoadBalancers": [
          {
            "ContainerName": "simple-app",
            "ContainerPort": "80",
            "LoadBalancerName" : { "Ref" : "EcsElasticLoadBalancer" }
          }
        ],
        "Role" : {"Ref":"ECSServiceRole"},
        "TaskDefinition" : {"Ref":"taskdefinition"}
      }
    },

    "ECSServiceRole": {
      "Type": "AWS::IAM::Role",
      "Properties": {
        "AssumeRolePolicyDocument": {
          "Statement": [
            {
              "Effect": "Allow",
              "Principal": {
                "Service": [
                  "ecs.amazonaws.com"
                ]
              },
              "Action": [
                "sts:AssumeRole"
              ]
            }
          ]
        },
        "Path": "/",
        "Policies": [
          {
            "PolicyName": "ecs-service",
            "PolicyDocument": {
              "Statement": [
                {
                  "Effect": "Allow",
                  "Action": [
                    "elasticloadbalancing:Describe*",
                    "elasticloadbalancing:DeregisterInstancesFromLoadBalancer",
                    "elasticloadbalancing:RegisterInstancesWithLoadBalancer",
                    "ec2:Describe*",
                    "ec2:AuthorizeSecurityGroupIngress"
                  ],
                  "Resource": "*"
                }
              ]
            }
          }
        ]
      }
    },

    "EC2Role": {
      "Type": "AWS::IAM::Role",
      "Properties": {
        "AssumeRolePolicyDocument": {
          "Statement": [
            {
              "Effect": "Allow",
              "Principal": {
                "Service": [
                  "ec2.amazonaws.com"
                ]
              },
              "Action": [
                "sts:AssumeRole"
              ]
            }
          ]
        },
        "Path": "/",
        "Policies": [
          {
            "PolicyName": "ecs-service",
            "PolicyDocument": {
              "Statement": [
                {
                  "Effect": "Allow",
                  "Action": [
                    "ecs:CreateCluster",
                    "ecs:DeregisterContainerInstance",
                    "ecs:DiscoverPollEndpoint",
                    "ecs:Poll",
                    "ecs:RegisterContainerInstance",
                    "ecs:StartTelemetrySession",
                    "ecs:Submit*",
                    "logs:CreateLogStream",
                    "logs:PutLogEvents"
                  ],
                  "Resource": "*"
                }
              ]
            }
          }
        ]
      }
    },

    "EC2InstanceProfile": {
      "Type": "AWS::IAM::InstanceProfile",
      "Properties": {
        "Path": "/",
        "Roles": [ { "Ref": "EC2Role" } ]
      }
    }

    "ECSSecurityGroup" : {
      "Type" : "AWS::EC2::SecurityGroup",
      "Properties" : {
        "GroupDescription" : "Fortigate recommended settings.  See marketplace for docs.",
        "VpcId" : { "Ref" : "VPC" },
        "SecurityGroupIngress" : [
           { "IpProtocol" : "tcp", "FromPort" : "22", "ToPort" : "22", "CidrIp" : "0.0.0.0/0" },
           { "IpProtocol" : "tcp", "FromPort" : "80", "ToPort" : "80", "CidrIp" : "0.0.0.0/0" },
           { "IpProtocol" : "icmp", "FromPort" : "-1",  "ToPort" : "-1",  "CidrIp" : {"Fn::GetAtt" : [ "VPC" , "CidrBlock" ]}}
        ],
        "Tags" : [ {"Key" : "Name", "Value" : "ECS Security Group"} ]
      }
    }
  },

  "Outputs" : {
    "ecsservice" : {
      "Value" : { "Ref" : "service" }
    },
    "ecscluster" : {
      "Value" : { "Ref" : "ECSCluster" }
    },
    "taskdef" : {
      "Value" : { "Ref" : "taskdefinition" }
    }
  }
}

When I create the stack, everything up to the AutoScaling group completes. The AS group get created and the instance boots. But then the health check fails, the instance is terminated, and the stack rolls back. CloudFormation shows the creation of the autoscaling group failed with Received 0 SUCCESS signal(s) out of 1. Unable to satisfy 100% MinSuccessfulInstancesPercent requirement.

Steps taken to troubleshoot so far:

  • Rolled the AMI mappings back to the versions listed in their sample results in the same failing checks. So that's not it.
  • Set the stack to a longer timeout (60m) to SSH in and start poking around the instance.
  • Looks like the CloudFormation Init isn't being run from the AWS::AutoScaling::LaunchConfiguration resource. /etc/ecs/ecs.config doesn't exist, other files are absent or missing config. But the LaunchConfiguration resources DOES appear to complete in the CloudFormation console event logs.

Couple questions:

  • Before we troubleshoot our way through this, I guess I should ask if there is there a more up-to-date, verified working template posted somewhere?
  • Does the AWS::CloudFormation::Init code for the "ContainerInstances" resource look okay?
  • Does the AutoScaling group itself look okay?
  • How can I best troubleshoot my way through these config issues?

Best Answer

Use this user data :

"UserData": { "Fn::Base64": { "Fn::Join": [ "", [ "#!/bin/bash -xe\n", "sudo mkdir /etc/ecs/\n", "sudo chmod 777 /etc/ecs/\n", "echo ECS_CLUSTER=", { "Ref": "ECSCluster" }, " >> /etc/ecs/ecs.config\n", "yum install -y aws-cfn-bootstrap\n", "/opt/aws/bin/cfn-signal -e $? ", " --stack ", { "Ref": "AWS::StackName" }, " --resource ECSAutoScalingGroup ", " --region ", { "Ref": "AWS::Region" }, "\n" ] ] } }

And add "cloudformation:SignalResource" to EC2Role Policy document.