Iptables – Change default interface docker container

dockeriptablesnat;networking

I'm struggling with this problem for two days.

Assumptions:

  • Docker network (and interface) named docknet type bridge subnet 172.18.0.0/16
  • Two interfaces eth0 (Gateway IP: 192.168.1.1, Interface Static IP: 192.168.1.100) and eth1 (Gateway IP:192.168.2.1, Interface Static IP: 192.168.2.100)
  • Default routing goes through eth0

What I want:

  • Outgoing traffic from container attached to docknet must go to eth1

What I tried:

  • Default iptable rule created by docker left untouched:

-A POSTROUTING -s 172.18.0.0/16 ! -o docknet -j MASQUERADE

  • My rules:

iptables -t mangle -I PREROUTING -s 172.18.0.0/16 -j MARK --set-mark 1

ip rule add from all fwmark 1 table 2

Where table 2 is:

default via 192.168.2.1 dev eth1 proto static

With this setup when I try to ping 8.8.8.8 from a container (172.18.0.2) attached to docknet the following happens:

  • 172.18.0.2 gets translated to 192.168.2.1
  • the packet goes through eth1
  • the packet returns to eth1 with src addr 8.8.8.8 and dst 192.168.2.1

from here a reverse translation from 192.168.2.1 to 172.168.0.2 should happen but running tcpdump -i any host 8.8.8.8 there is not trace about this translation

I checked out also conntrack -L and this is the result:

icmp 1 29 src=172.18.0.2 dst=8.8.8.8 type=8 code=0 id=9 src=8.8.8.8 dst=192.168.2.1 type=0 code=0 id=9 mark=0 use=1

Useful info:

  • eth1 is actually a 4G usb dongle
  • ip forwarding is active
  • curl --interface eth1 ipinfo.io works as expected

EDIT:

output from ip -d link show eth1

eth1: mtu 1500 qdisc fq_codel state UNKNOWN mode DEFAULT group default qlen 1000
link/ether 00:b0:d6:00:00:00 brd ff:ff:ff:ff:ff:ff promiscuity 0 addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535

Best Answer

I will also assume rp_filter is activated, and is causing troubles. It doesn't behave as expected in presence of a mark. Some references are in this Q/A: Advanced routing with firewall marks and rp_filter.

So while there's a mark set for outgoing packets which selects table 2, no such mark exists for incoming packets. So those packets are considered to not be using table 2 and are dropped by the routing stack of the kernel by reverse path forwarding filter rp_filter, because those incoming packets have no reverse outgoing route when looking in the main table.

The fix should be:

ip rule add iif eth1 table 2

But because rp_filter doesn't behave as expected, a second fix must be added in addition: set rp_filter in loose mode:

sysctl -w net.ipv4.conf.eth1.rp_filter=2

Now, the part I don't have an explanation for: it appears the host doesn't find the 172.18.0.0/16 entry when looking up table 2 and container's outgoing packets are dropped on the host. It doesn't have this problem about not finding 192.168.2.0/24 in table 2 before its default route. So, while not knowing exactly why (it works for 192.168.2.0/24), the final fix is to duplicate from the main table the missing route:

ip route add table 2 172.18.0.0/16 dev docknet src 172.18.0.1

I usually duplicate all of them and don't think about it anymore. Now the ping from container should be working and going through eth1.

UPDATE:

Actually there's no need to involve iptables at all in this case: ip rule can do it on its own, and everything behaves better without a mark, because table 2 is looked up when needed while with the mark it wouldn't always be (eg: not needing iif eth1 anymore here). So here's a simplier answer. This supersedes OP's settings and previous answer (so don't add the mangle rule):

ip rule add iif docknet table 2
ip route add table 2 172.18.0.0/16 dev docknet src 172.18.0.1
ip route add table 2 default via 192.168.2.1 dev eth1

This makes the container use eth1, without even having to change rp_filter.

Now for this to also work from the host in my test, rp_filter must be loosened again (and of course oif must be used):

sysctl -q -w net.ipv4.conf.eth1.rp_filter=2
ip rule add oif eth1 table 2

Also, contrary to OP, in my tests, to be able to use ping -I eth1 8.8.8.8 or for example curl --interface eth1 8.8.8.8 from the "host" I also had to do this in addition to the previous commands:

ip rule add oif eth1 table 2

Which is for locally generated packets going out through eth1. Without it, when forcing interface eth1, host is doing direct ARP requests for 8.8.8.8 (I don't have a good explaination for this, except that's because routes are missing) which won't work unless the 4g card is doing proxy ARP.


BONUS: mockup reproducer script

While knowing what to look for (rp_filter, missing routes, missing rules...), it's been mostly a trial and error to find a working solution. I made a script to reproduce a whole mockup internet with multihomed setup, including two internet providers and google's 8.8.8.8 IP. Using the script below I get those results from (real) host:

# ip netns exec dockerhost traceroute -n 8.8.8.8
traceroute to 8.8.8.8 (8.8.8.8), 30 hops max, 60 byte packets
 1  192.168.1.1  0.160 ms  0.043 ms  0.031 ms
 2  192.0.2.1  0.111 ms  0.055 ms  0.046 ms
 3  203.0.113.11  0.112 ms  0.047 ms  0.041 ms
 4  8.8.8.8  0.073 ms  0.048 ms  0.045 ms
# ip netns exec dockerhost traceroute -i eth1 -n 8.8.8.8
traceroute to 8.8.8.8 (8.8.8.8), 30 hops max, 60 byte packets
 1  192.168.2.1  0.071 ms  0.017 ms  0.014 ms
 2  198.51.100.1  0.044 ms  0.023 ms  0.020 ms
 3  203.0.113.22  0.042 ms  0.025 ms  0.024 ms
 4  8.8.8.8  0.032 ms  0.026 ms  0.025 ms
# ip netns exec container traceroute -n 8.8.8.8
traceroute to 8.8.8.8 (8.8.8.8), 30 hops max, 60 byte packets
 1  172.18.0.1  0.081 ms  0.017 ms  0.012 ms
 2  192.168.2.1  0.038 ms  0.024 ms  0.022 ms
 3  198.51.100.1  0.035 ms  0.030 ms  0.028 ms
 4  203.0.113.22  0.036 ms  0.040 ms  0.027 ms
 5  8.8.8.8  0.046 ms  0.037 ms *

Script I made to create the mockup internet network parts (I ran out of test nets, so used ip address peer syntax for "LAN-less" address+routing at the end):

#!/bin/sh

if ip netns id | grep -qv '^ *$' ; then
    printf 'ERROR: leave netns "%s" first\n' $(ip netns id) >&2
    exit 1
fi

for ns in dockerhost container gw1 gw2 isp1 isp2 google inet; do
    ip netns del $ns 2>/dev/null || :
    ip netns add $ns
    ip -n $ns link set lo up
    ip netns exec $ns sysctl -q -w net.ipv4.conf.default.forwarding=1
    ip netns exec $ns sysctl -q -w net.ipv4.conf.default.rp_filter=1
    ip netns exec $ns sysctl -q -w net.ipv4.conf.all.rp_filter=1
    ip netns exec $ns sysctl -q -w net.ipv6.conf.default.disable_ipv6=1
done

ip -n dockerhost link add docknet type bridge
ip netns exec dockerhost iptables -t nat -A POSTROUTING -s 172.18.0.0/16 ! -o docknet -j MASQUERADE
ip -n dockerhost link set docknet up
ip -n dockerhost address add 172.18.0.1/16 dev docknet
ip -n dockerhost link add veth-container type veth peer netns container name eth0
ip -n dockerhost link set veth-container master docknet
ip -n dockerhost link set veth-container up

ip -n dockerhost link add eth0 type veth peer netns gw1 name lan1
ip -n dockerhost link set eth0 up
ip -n dockerhost address add 192.168.1.100/24 dev eth0
ip -n dockerhost route add default via 192.168.1.1

ip -n dockerhost link add eth1 type veth peer netns gw2 name lan2
ip -n dockerhost link set eth1 up
ip -n dockerhost address add 192.168.2.100/24 dev eth1

ip -n container link set eth0 up
ip -n container address add 172.18.0.2/16 dev eth0
ip -n container route add default via 172.18.0.1

ip -n gw1 route add unreachable 172.16.0.0/12
ip netns exec gw1 iptables -t nat -A POSTROUTING -s 192.168.1.0/24 ! -o lan1 -j MASQUERADE
ip -n gw1 link set lan1 up
ip -n gw1 address add 192.168.1.1/24 dev lan1
ip -n gw1 link add wan0 type veth peer netns isp1 name client0
ip -n gw1 link set wan0 up
ip -n gw1 address add 192.0.2.100/24 dev wan0
ip -n gw1 route add default via 192.0.2.1

ip -n gw2 route add unreachable 172.16.0.0/12
ip netns exec gw2 iptables -t nat -A POSTROUTING -s 192.168.2.0/24 ! -o lan2 -j MASQUERADE
ip -n gw2 link set lan2 up
ip -n gw2 address add 192.168.2.1/24 dev lan2
ip -n gw2 link add wan0 type veth peer netns isp2 name client0
ip -n gw2 link set wan0 up
ip -n gw2 address add 198.51.100.100/24 dev wan0
ip -n gw2 route add default via 198.51.100.1

ip -n isp1 route add unreachable 192.168.0.0/16
ip -n isp1 link set client0 up
ip -n isp1 address add 192.0.2.1/24 dev client0
ip -n isp1 link add wan0 type veth peer netns inet name isp1
ip -n isp1 link set wan0 up
ip -n isp1 address add 203.0.113.101 peer 203.0.113.11 dev wan0
ip -n isp1 route add default via 203.0.113.11

ip -n isp2 route add unreachable 192.168.0.0/16
ip -n isp2 link set client0 up
ip -n isp2 address add 198.51.100.1/24 dev client0
ip -n isp2 link add wan0 type veth peer netns inet name isp2
ip -n isp2 link set wan0 up
ip -n isp2 address add 203.0.113.102 peer 203.0.113.22 dev wan0
ip -n isp2 route add default via 203.0.113.22

ip -n google link add wan0 type veth peer netns inet name google0
ip -n google link set wan0 up
ip -n google address add 203.0.113.103 peer 203.0.113.33 dev wan0
ip -n google route add default via 203.0.113.33
ip -n google address add 8.8.8.8 dev lo

ip -n inet link set isp1 up
ip -n inet address add 203.0.113.11 peer 203.0.113.101 dev isp1
ip -n inet route add 192.0.2.0/24 via 203.0.113.101
ip -n inet link set isp2 up
ip -n inet address add 203.0.113.22 peer 203.0.113.102 dev isp2
ip -n inet route add 198.51.100.0/24 via 203.0.113.102
ip -n inet link set google0 up
ip -n inet address add 203.0.113.33 peer 203.0.113.103 dev google0
ip -n inet route add 8.8.8.8 via 203.0.113.103

#OP's additional settings for goal
#ip netns exec dockerhost iptables -t mangle -I PREROUTING -s 172.18.0.0/16 -j MARK --set-mark 1
#ip -n dockerhost rule add from all fwmark 1 table 2
#ip -n dockerhost route add table 2 default via 192.168.2.1 dev eth1 proto static
#ip -n dockerhost route add table 2 default via 192.168.2.1 dev eth1

#Superseded initial fix
#ip netns exec dockerhost sysctl -q -w net.ipv4.conf.eth1.rp_filter=2
#ip -n dockerhost rule add iif docknet table 2
#ip -n dockerhost rule add iif eth1 table 2
#ip -n dockerhost route add table 2 172.18.0.0/16 dev docknet src 172.18.0.1

#Superseded initial host fix
#ip -n dockerhost rule add oif eth1 table 2

#Or instead proxy_arp on gw2 would work
#ip netns exec gw2 sysctl -q -w net.ipv4.conf.lan2.proxy_arp=1

#Final fix for container, without additional iptables rule, not using marks at all:
ip -n dockerhost rule add iif docknet table 2
ip -n dockerhost route add table 2 172.18.0.0/16 dev docknet src 172.18.0.1
ip -n dockerhost route add table 2 default via 192.168.2.1 dev eth1

#Final fix for host
ip netns exec dockerhost sysctl -q -w net.ipv4.conf.eth1.rp_filter=2
ip -n dockerhost rule add oif eth1 table 2
Related Topic