Amazon EC2 – Restricting Access to Metadata for Specific Docker Containers

amazon ec2dockeriptables

I'm running docker on AWS EC2 instances, and I'd like to block certain containers from accessing
the EC2 instance metadata (at IP address 169.254.169.254). I thought I could do this by running
those containers as a specific user (eg userx), in the presence of the following ip tables rule:

$ iptables -A OUTPUT -m owner --uid-owner userx -d 169.254.169.254 -j DROP

This blocks the connnection as expected when the container is run with host networking:

$ docker run -it --rm --network host -u $(id -u userx):$(id -g userx) appropriate/curl  http://169.254.169.254/latest/meta-data/
...blocks..

But sadly allows the connection when the container runs within it's own network

$ docker run -it --rm -u $(id -u userx):$(id -g userx) appropriate/curl  http://169.254.169.254/latest/meta-data/
...show metadata...

How can I make this work? Or alternatively, is there some other technique that will give specific containers full network access whilst blocking the instance metadata?

Best Answer

Your issue is that OUTPUT doesn't catch packets coming out of containers. FORWARD does.

Why is that?

Every Docker container runs in its own network namespace. Every network namespace has its own routing table and iptables rules, and behaves exactly as if it was a separate physical machine.

In iptables:

  • INPUT matches packets going to local processes
  • FORWARD matches packets coming in one network interface and going out another one (being routed through).
  • OUTPUT matches packets coming from local processes

The key is that "local process" means "a process in this network namespace", not "a process in this machine".

Let's analyze what's going on:

  • Packets are generated by processes in the Docker container's network namespace.
  • They go through the iptables OUTPUT chain in the container's network namespace iptables. (which is empty!)
  • They get routed out of the veth interface.
  • They arrive to the host's network namespace from the veth interface.
  • The host network namespace consults the routing table and decides they need to go out of e.g. eth0.
  • They go through the iptables FORWARD chain in the host's network namespace.
  • They go out eth0.

Therefore, the solution is putting your rule in the FORWARD chain instead.

The issue is that -m owner doesn't work in FORWARD. According to man iptables-extensions:

This match is only valid in the OUTPUT and POSTROUTING chains. Forwarded packets do not have any socket associated with them.

You can either hardcode the container's IP addresses, or put containers you want filtered in a special network, and match the whole range. Something similar to this should work:

    # single container
    iptables -A FORWARD -s 172.17.0.4 -d 169.254.169.254 -j DROP

    # or entire network
    iptables -A FORWARD -s 172.17.0.0/16 -d 169.254.169.254 -j DROP

Also, using owner is probably not a good idea either way, because processes inside docker containers can change their uid's via eg setuid binaries (like sudo), if there are any in the image.