AWS – Flink Taskmanager on ECS Cannot Connect to Jobmanager on EC2

amazon-ecsamazon-web-services

I have an EC2 instance which is in us-east-1b and is running the flink jobmanager, which is responsible for coordinating work across multiple taskmanagers via RPC, and history server. I can see from netstat that the jobmanager is listening on :::6123 for incoming taskmanager connections.

I have an auto scaling group which will run up an EC2 instance into the same az, subnet and security group as the EC2 instance.

The security group allows All Traffic on all ports from any source in the group to any destination in the group:
Inbound Rules
Outbound Rules

I'm using that ASG as a capacity provider for ECS tasks. I'm then trying to run up a task in ECS that runs the taskmanager and uses that ASG.

The taskmanager starts up, but won't connect to the jobmanager:

2021-09-28 13:52:08,651 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Could not resolve ResourceManager address akka.tcp://[email protected]:6123/user/rpc/resourcemanager_*, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://[email protected]:6123/user/rpc/resourcemanager_*.

I've ssh-d onto the instance run up by the ASG and confirmed that I can curl the jobmanager on ip-xxx-xx-x-xxx.ec2.internal:8081 – it works. So I know that the taskmanager instance can see the jobmanager instance.

To summarise:

  • The taskmanager and jobmanager are in the same VPC, the same AZ, the same subnet and the same security group
  • The security group allows all inbound traffic from sources in the same security group
  • The security groups allows all outbound traffic to any destination
  • The jobmanager is running on an EC2 instance manually created
  • The taskmanager is running on an EC2 instance created as part of an ASG by ECS. The taskmanager runs in a container on ECS
  • I can curl the jobmanager from the taskmanager node
  • The taskmanager and jobmanager communicate over RPC
  • The taskmanager won't resolve the address to the jobmanager

Why won't my task connect? I've also tried the public IP (v4) and the private IP (v4).

Best Answer

Today I discovered why this wasn't working.

The jobmanager was configure with:

jobmanager.rpc.address: localhost

and so, whilst listening on the right rpc port, was not accepting traffic to any other address.

When I changed it to match the taskmanager:

jobmanager.rpc.address: ip-xxx-xx-x-xxx.ec2.internal

then the task manager connected immediately.