AWS regional traffic: Track down where it comes from

amazon ec2amazon-web-services

I started using multiple machines in a cluster on AWS EC2. Since I started this project, I see costs for regional traffic in my billing information:

regional data transfer – in/out/between EC2 AZs or using elastic IPs or ELB

According to the name, it's three possibilities:

  • different Availability Zones
  • communication using elastic IPs
  • using an Elastic Loud Balancer

Had different AZ's for my machines, that was a problem. So I solved this, now all machines are in the same AZ, but the costs have been increasing for 24 hours now (there were 3 updates during that time). So it seems, setting all machines to the same AZ did not solve it.

However, I do not use Elastic IPs nor ELB. When I visit these pages on my web portal, it just shows me an empty list with a message that I do not have any components at the moment.

In another serverfault question we also read that this happens, when public IP addresses are used for communication, but on some github discussion we can read that even the public DNS name will be resolved to the internal IP internally (still, the public IP will always go through external network, so would in fact trigger costs).

If I track my network communication from the master and one of the slaves in my cluster using

sudo tcpdump -i eth0 | grep -v $MY_HOSTNAME

I can see only internal traffic like this:

IP ip-172-31-48-176.ec2.internal.56372 > ip-172-31-51-15.ec2.internal.54768

So my problem: How can I find out which component is causing this regional traffic?

Best Answer

tl;dr

The huge amount of regional traffic was caused by an apt-get update on startup of the machine.

At first I suspected the software I am running on the cluster, because this sends a hell a lot of DNS requests out - probably it does not use any DNS caching. And the DNS server is in another Availability Zone.

Full way to debug such stuff

I debugged this with a friend, here is how we arrived at the solution so everyone with this issue can follow:

First of all, from the billing management, you can see that the cost is $0.01 per GB. So it reflects the following points from the Pricing web page (which go a bit more into detail):

  • Amazon EC2, Amazon RDS, Amazon Redshift and Amazon ElastiCache instances or Elastic Network Interfaces in the same Availability Zone
    • Using a public or Elastic IP address
  • Amazon EC2, Amazon RDS, Amazon Redshift and Amazon ElastiCache instances or Elastic Network Interfaces in another Availability Zone or peered VPC in the same AWS Region

Next I checked an explanation on AWS about Availability Zones and Regions. What I have to pay for is definitely traffic that comes from the same region (us-east-1 in my case). It can either be traffic passing from one AZ to another AZ (we knew before) or traffic using a public IP address or Elastic IP address within the same AZ (we also knew from the other serverfault question). However, it now seems that this list is indeed exhaustive.

I knew I had:

  • 6 EC2 machines in a cluster
  • no RDS
  • no Redshift
  • no ElastiCache
  • no Elastic IP address

Peered VPC

VPC is an own product, so go to VPC. From there you can see how many VPCs you have. In my case it was only one, so peering is not possible at all. But you can still go to Peering Connections and see if anything is set there.

Subnets

From the Subnet section in VPC we also found out some important clue for further debugging. IP ranges of the different Availability Zones in us-east-1:

  • 172.31.0.0/20 for us-east-1a
  • 172.31.16.0/20 for us-east-1b
  • 172.31.32.0/20 for us-east-1e
  • 172.31.48.0/20 for us-east-1d

Since all my machines should be in us-east-1d, I verified that. And indeed they all had IPs starting with 172.31.48, 172.31.51 and 172.31.54. So far, so good.

tcpdump

This then finally helped us setting the right filters for tcpdump. Now knowing with which IPs I should be communicating in order to avoid costs (network 172.31.48.0/20 only), we set up a filter for tcpdump. This helped removing all the noise that made me not see the external communication. Plus, before I did not even know that communication with [something].ec2.internal could be the problem, since I did not know enough about regions, AZs and their respective IP-ranges.

First we came up with this tcpdump filter:

tcpdump "not src net 172.31.48.0 mask 255.255.240.0" -i eth0

This should show all traffic coming in from everywhere but us-east-1d. It showed a lot of traffic from my SSH connection, but I saw something weird flying by - an ec2.internal address. Shouldn't they have all been filtered out, because we do not show AZ-internal traffic anymore?

IP ip-172-31-0-2.ec2.internal.domain > ip-172-31-51-15.ec2.internal.60851

But this is not internal! It's from another AZ, namely us-east-1a. This is from the DNS system.

I extended the filter to check how many of these messages occur:

sudo tcpdump "not src net 172.31.48.0 mask 255.255.240.0 and not src host $MY_HOSTNAME" -i eth0

I waited 10 seconds, stopped the logging and it was 16 responses from DNS!

Next days, still the same problem

However, after installing dnsmasq nothing has changed. Still several GB of traffic when I used the cluster.

From day to day I removed more tasks from the cluster and finally tried it one day without any startup scripts (fine!) and one day with startup scripts only + instant shutdown (traffic!).

The analysis of the startup script revealed that apt-get update and apt-get install ... are the only component pulling huge files. Through a Google research I learned that Ubuntu indeed has a package repository inside AWS. This can also be seen from the sources.list:

http://us-east-1.ec2.archive.ubuntu.com/ubuntu/

Resolving the hostname leads to the following IP addresses:

us-east-1.ec2.archive.ubuntu.com.   30  IN  A   54.87.136.115
us-east-1.ec2.archive.ubuntu.com.   30  IN  A   54.205.195.154
us-east-1.ec2.archive.ubuntu.com.   30  IN  A   54.198.110.211
us-east-1.ec2.archive.ubuntu.com.   30  IN  A   54.144.108.75

So I setup a Log Flow service and logged the cluster during boot time. Then, I downloaded the log files and ran them through a python script to sum up all transferred bytes to any of these 4 IP addresses. And the result matches my traffic. I had 1.5 GB traffic during the last test, had 3 clusters of 5 machines each and according to my Log Flow log each machine queries about 100 MB from the Ubuntu repository.