I am attempting to direct client traffic to a kubernetes cluster NodePort listening on 192.168.1.100.30000
.
Client's needs to make a request to 192.168.1.100.8000
so I added the following REDIRECT rule in iptables:
iptables -t nat -I PREROUTING -p tcp --dst 192.168.1.100 --dport 8000 -j REDIRECT --to-port 30000
I then issue a curl to 192.168.1.100:8000
however, in tcpdump i see a different port:
# tcpdump -i lo -nnvvv host 192.168.1.100 and port 8000
tcpdump: listening on lo, link-type EN10MB (Ethernet), capture size 262144 bytes
[Interface: lo] 20:39:22.685968 IP (tos 0x0, ttl 64, id 20590, offset 0, flags [DF], proto TCP (6), length 40)
[Interface: lo] 192.168.1.100.8000 > 192.168.1.100.49816: Flags [R.], cksum 0xacda (correct), seq 0, ack 3840205844, win 0, length 0
[Interface: lo] 20:39:37.519256 IP (tos 0x0, ttl 64, id 34221, offset 0, flags [DF], proto TCP (6), length 40)
I would expect the tcpdump to show something like
192.168.1.100.8000 > 192.168.1.100.30000
However, it is showing and causing a connection refused error since no process is listing on 192.168.1.100.49816
.
192.168.1.100.8000 > 192.168.1.100.49816
I am using a test environment so i don't have access to remote devices that is why I am using curl
to test the iptables REDIRECT path.
Is there a reason why adding a REDIRECT rule causes tcpdump to redirect the traffic to a different port than the one specified?
Edit:
After @A.B. suggestion added the following OUTPUT rule:
iptables -t nat -I OUTPUT -d 192.168.1.100 -p tcp --dport 8000 -j REDIRECT --to-port 30000
and curl does proceed further, packet count for the OUTPUT chain does increase (PREROUTING REDIRECT chain packet didn't increase though):
2 10 600 REDIRECT tcp -- * * 0.0.0.0/0 192.168.1.100 tcp dpt:8000 redir ports 30000
However, getting the following error:
# curl -vk https://192.168.1.100:8000/v1/api
* About to connect() to 192.168.1.100 port 8000 (#0)
* Trying 192.168.1.100...
* Connected to 192.168.1.100 (192.168.1.100) port 8000 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
* NSS error -12263 (SSL_ERROR_RX_RECORD_TOO_LONG)
* SSL received a record that exceeded the maximum permissible length.
* Closing connection 0
curl: (35) SSL received a record that exceeded the maximum permissible length.
Also, tried adding a remotesystem net, this time the PREROUTING REDIRECT CHAIN packet count increases after executing remotesystem curl ...
(but the OUTPUT CHAIN doesn't increase):
2 34 2040 REDIRECT tcp -- * * 0.0.0.0/0 172.16.128.1 tcp dpt:8000 redir ports 30000
Error:
# ip netns exec remotesystem curl -vk https://192.168.1.100:8000/v1/api
* About to connect() to 192.168.1.100 port 8000 (#0)
* Trying 192.168.1.100...
* Connection timed out
* Failed connect to 192.168.1.100:8000; Connection timed out
* Closing connection 0
curl: (7) Failed connect to 192.168.1.100:8000; Connection timed out
Best Answer
To be clear: OP's test is done from the system 192.168.1.100 to itself, not from a remote system, and that's the cause of the problem. The port wasn't changed in this case because no NAT rule matched, while it would have matched if done from a remote system.
The schematic below shows how order of operations are performed on a packet:
The reason is how NAT works on Linux: iptables sees a packet in the
nat
table only for the first packet of a new conntrack flow (which is thus in NEW state).This rule works fine when from a remote system. In this case the first packet seen will be an incoming packet:
to port 8000 --> AF_PACKET (tcpdump) --> conntrack --> nat/PREROUTING (iptables REDIRECT): to port 30000
--> routing decision --> ... --> local process receiving on port 30000
All following packets in the same flow will have conntrack handle directly the port change (or port reversion for replies) and will skip any iptables rule in the
nat
table (as written in the schematic:nat
table only consulted forNEW
connections). So, (skipping the reply packet part), the next incoming packet will undergo this instead:to port 8000 --> AF_PACKET (tcpdump) --> conntrack: to port 30000
--> routing decision --> ... --> local process receiving on port 30000
For a test on the system to itself, the first packet isn't an incoming packet but an outgoing packet. This happens instead, using the outgoing
lo
interface:local process client curl --> routing decision --> conntrack --> nat/OUTPUT (
no rule here
)
--> reroute check --> AF_PACKET (tcpdump) --> to port 8000
And now this packet is looped back on the
lo
interface, it reappears as a packet which isn't anymore the first packet in a connection so follows second case as above: conntrack alone takes care of the NAT and doesn't callnat/PREROUTING
. Except it wasn't instructed in the step before to do any NAT:to port 8000 --> AF_PACKET (tcpdump) --> conntrack
--> routing decision --> ... -->
no
local process receiving on port
8000
as there's nothing listening on port 8000, the OS sends back a TCP RST.
For this to work on the local system, a
REDIRECT
rule must also be put in thenat/OUTPUT
chain:Additional notes
if the case is intended for remote use, don't test from the local system: rules traversed by the test aren't the same. This makes the test not reflecting reality.
Just use a network namespace to create a pocket remote system in case no other system is available. Example that should work with a system having only OP's
nat/PREROUTING
rule and doingcurl http://192.168.1.100/
(which doesn't require DNS):tcpdump
and NATtcpdump
happens at theAF_PACKET
steps in the schematic above: very early for ingress and very late for egress. That means for a remote system case, it will never capture the port 30000 even when it's working. For the local system case, once thenat/OUTPUT
rule is added, it will capture port 30000.Just don't trust blindly the address/port displayed by
tcpdump
when doing NAT: it depends on the case and where the capture happens.