iptables – Tcpdump Shows Different Redirection Port After Adding REDIRECT Rule

iptablestcpdump

I am attempting to direct client traffic to a kubernetes cluster NodePort listening on 192.168.1.100.30000.

Client's needs to make a request to 192.168.1.100.8000 so I added the following REDIRECT rule in iptables:

iptables -t nat -I PREROUTING -p tcp --dst 192.168.1.100 --dport 8000 -j REDIRECT --to-port 30000

I then issue a curl to 192.168.1.100:8000 however, in tcpdump i see a different port:

# tcpdump -i lo -nnvvv host 192.168.1.100 and port 8000
tcpdump: listening on lo, link-type EN10MB (Ethernet), capture size 262144 bytes
[Interface: lo] 20:39:22.685968 IP (tos 0x0, ttl 64, id 20590, offset 0, flags [DF], proto TCP (6), length 40)
[Interface: lo]     192.168.1.100.8000 > 192.168.1.100.49816: Flags [R.], cksum 0xacda (correct), seq 0, ack 3840205844, win 0, length 0
[Interface: lo] 20:39:37.519256 IP (tos 0x0, ttl 64, id 34221, offset 0, flags [DF], proto TCP (6), length 40)

I would expect the tcpdump to show something like

192.168.1.100.8000 > 192.168.1.100.30000

However, it is showing and causing a connection refused error since no process is listing on 192.168.1.100.49816.

192.168.1.100.8000 > 192.168.1.100.49816

I am using a test environment so i don't have access to remote devices that is why I am using curl to test the iptables REDIRECT path.

Is there a reason why adding a REDIRECT rule causes tcpdump to redirect the traffic to a different port than the one specified?

Edit:

After @A.B. suggestion added the following OUTPUT rule:

iptables -t nat -I OUTPUT -d 192.168.1.100 -p tcp --dport 8000 -j REDIRECT --to-port 30000

and curl does proceed further, packet count for the OUTPUT chain does increase (PREROUTING REDIRECT chain packet didn't increase though):

2       10   600 REDIRECT   tcp  --  *      *       0.0.0.0/0            192.168.1.100         tcp dpt:8000 redir ports 30000

However, getting the following error:

# curl -vk https://192.168.1.100:8000/v1/api
* About to connect() to 192.168.1.100 port 8000 (#0)
*   Trying 192.168.1.100...
* Connected to 192.168.1.100 (192.168.1.100) port 8000 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
* NSS error -12263 (SSL_ERROR_RX_RECORD_TOO_LONG)
* SSL received a record that exceeded the maximum permissible length.
* Closing connection 0
curl: (35) SSL received a record that exceeded the maximum permissible length.

Also, tried adding a remotesystem net, this time the PREROUTING REDIRECT CHAIN packet count increases after executing remotesystem curl ... (but the OUTPUT CHAIN doesn't increase):

2       34  2040 REDIRECT   tcp  --  *      *       0.0.0.0/0            172.16.128.1         tcp dpt:8000 redir ports 30000

Error:

# ip netns exec remotesystem curl -vk https://192.168.1.100:8000/v1/api
* About to connect() to 192.168.1.100 port 8000 (#0)
*   Trying 192.168.1.100...
* Connection timed out
* Failed connect to 192.168.1.100:8000; Connection timed out
* Closing connection 0
curl: (7) Failed connect to 192.168.1.100:8000; Connection timed out

Best Answer

To be clear: OP's test is done from the system 192.168.1.100 to itself, not from a remote system, and that's the cause of the problem. The port wasn't changed in this case because no NAT rule matched, while it would have matched if done from a remote system.

The schematic below shows how order of operations are performed on a packet:

Packet flow in Netfilter and General Networking

The reason is how NAT works on Linux: iptables sees a packet in the nat table only for the first packet of a new conntrack flow (which is thus in NEW state).

This rule works fine when from a remote system. In this case the first packet seen will be an incoming packet:

to port 8000 --> AF_PACKET (tcpdump) --> conntrack --> nat/PREROUTING (iptables REDIRECT): to port 30000
--> routing decision --> ... --> local process receiving on port 30000

All following packets in the same flow will have conntrack handle directly the port change (or port reversion for replies) and will skip any iptables rule in the nat table (as written in the schematic: nat table only consulted for NEW connections). So, (skipping the reply packet part), the next incoming packet will undergo this instead:

to port 8000 --> AF_PACKET (tcpdump) --> conntrack: to port 30000
--> routing decision --> ... --> local process receiving on port 30000

For a test on the system to itself, the first packet isn't an incoming packet but an outgoing packet. This happens instead, using the outgoing lo interface:

local process client curl --> routing decision --> conntrack --> nat/OUTPUT (no rule here)
--> reroute check --> AF_PACKET (tcpdump) --> to port 8000

And now this packet is looped back on the lo interface, it reappears as a packet which isn't anymore the first packet in a connection so follows second case as above: conntrack alone takes care of the NAT and doesn't call nat/PREROUTING. Except it wasn't instructed in the step before to do any NAT:

to port 8000 --> AF_PACKET (tcpdump) --> conntrack
--> routing decision --> ... -->nolocal process receiving on port8000

as there's nothing listening on port 8000, the OS sends back a TCP RST.

For this to work on the local system, a REDIRECT rule must also be put in the nat/OUTPUT chain:

iptables -t nat -I OUTPUT -d 192.168.1.100 -p tcp --dport 8000 -j REDIRECT --to-port 30000

Additional notes

  • if the case is intended for remote use, don't test from the local system: rules traversed by the test aren't the same. This makes the test not reflecting reality.

    Just use a network namespace to create a pocket remote system in case no other system is available. Example that should work with a system having only OP's nat/PREROUTING rule and doing curl http://192.168.1.100/ (which doesn't require DNS):

    ip netns add remotesystem
    ip link add name vethremote up type veth peer netns remotesystem name eth0
    ip address add 192.0.2.1/24 dev vethremote
    ip -n remotesystem address add 192.0.2.2/24 dev eth0
    ip -n remotesystem link set eth0 up
    ip -n remotesystem route add 192.168.1.100 via 192.0.2.1
    ip netns exec remotesystem curl http://192.168.1.100:8000/
    
  • tcpdump and NAT

    tcpdump happens at the AF_PACKET steps in the schematic above: very early for ingress and very late for egress. That means for a remote system case, it will never capture the port 30000 even when it's working. For the local system case, once the nat/OUTPUT rule is added, it will capture port 30000.

    Just don't trust blindly the address/port displayed by tcpdump when doing NAT: it depends on the case and where the capture happens.