Ubuntu Firewall Rejecting Final Packet – Troubleshooting Guide

firewalliptablesubuntu-8.04

Background, for a long time we have had problems with our firewall that sometimes keep HTTP requests hanging partially loaded until TCP times out.

After tracing the traffic on the firewall I noticed that it occurs only during certain timing conditions, e.g when the webserver has sent the entire response before the client has sent its second ACK on the payload. [SYN, SYN/ACK, ACK] has been exchanged, REQUEST has been sent and ACK'ed and the first RESPONSE packet has been received and ACK'ed, then the webserver sends the rest of the response body in one shot (8 packets including the last FIN, PSH) and before the client has ACK'ed any of those, the Firewall REJECTS with a RST towards the webserver and keeps the client hanging infinite.

Here is the entire wireshark trace with packets from both sides of the firewall. 192.168.126.161 is the private NAT'et IP address of the client. 172.16.1.2 is the webserver IP (not showing the real public IP) and 10.1.1.1 is the firewall external IP (not showing the real public IP)

2105 0.086275 192.168.126.161  172.16.1.2       TCP 37854 > http [SYN] Seq=0 Win=5840 Len=0 MSS=1460 SACK_PERM=1 TSV=89375083 TSER=0
2106 0.000066 10.1.1.1         172.16.1.2       TCP 37854 > http [SYN] Seq=0 Win=5840 Len=0 MSS=1460 SACK_PERM=1 TSV=89375083 TSER=0
2107 0.002643 172.16.1.2       10.1.1.1         TCP http > 37854 [SYN, ACK] Seq=0 Ack=1 Win=32768 Len=0 MSS=1460
2108 0.007705 172.16.1.2       192.168.126.161  TCP http > 37854 [SYN, ACK] Seq=0 Ack=1 Win=32768 Len=0 MSS=1460
2109 0.006301 192.168.126.161  172.16.1.2       TCP 37854 > http [ACK] Seq=1 Ack=1 Win=5840 Len=0
2110 0.000025 10.1.1.1         172.16.1.2       TCP 37854 > http [ACK] Seq=1 Ack=1 Win=5840 Len=0
2111 0.000007 192.168.126.161  172.16.1.2       HTTP GET /test/style.css HTTP/1.1 
2112 0.000015 10.1.1.1         172.16.1.2       HTTP GET /test/style.css HTTP/1.1 
2113 0.001536 172.16.1.2       10.1.1.1         TCP http > 37854 [ACK] Seq=1 Ack=111 Win=32658 Len=0
2114 0.000014 172.16.1.2       192.168.126.161  TCP http > 37854 [ACK] Seq=1 Ack=111 Win=32658 Len=0
2115 0.002274 172.16.1.2       10.1.1.1         HTTP HTTP/1.1 200 OK  (text/css)
2116 0.000025 172.16.1.2       192.168.126.161  HTTP HTTP/1.1 200 OK  (text/css)
2117 0.005689 192.168.126.161  172.16.1.2       TCP 37854 > http [ACK] Seq=111 Ack=1461 Win=8760 Len=0
2118 0.000024 10.1.1.1         172.16.1.2       TCP 37854 > http [ACK] Seq=111 Ack=1461 Win=8760 Len=0
2119 0.001536 172.16.1.2       10.1.1.1         HTTP Continuation or non-HTTP traffic
2120 0.000026 172.16.1.2       192.168.126.161  HTTP Continuation or non-HTTP traffic
2121 0.000007 172.16.1.2       10.1.1.1         HTTP Continuation or non-HTTP traffic
2122 0.000023 172.16.1.2       192.168.126.161  HTTP Continuation or non-HTTP traffic
2123 0.000313 172.16.1.2       10.1.1.1         HTTP Continuation or non-HTTP traffic
2124 0.000030 172.16.1.2       192.168.126.161  HTTP Continuation or non-HTTP traffic
2125 0.000007 172.16.1.2       10.1.1.1         HTTP Continuation or non-HTTP traffic
2126 0.000023 172.16.1.2       192.168.126.161  HTTP Continuation or non-HTTP traffic
2127 0.000009 172.16.1.2       10.1.1.1         HTTP Continuation or non-HTTP traffic
2128 0.000023 172.16.1.2       192.168.126.161  HTTP Continuation or non-HTTP traffic
2129 0.001108 172.16.1.2       10.1.1.1         HTTP Continuation or non-HTTP traffic
2130 0.000035 172.16.1.2       192.168.126.161  HTTP Continuation or non-HTTP traffic
2131 0.000008 172.16.1.2       10.1.1.1         HTTP Continuation or non-HTTP traffic
2132 0.000022 172.16.1.2       192.168.126.161  HTTP Continuation or non-HTTP traffic
2133 0.000007 172.16.1.2       10.1.1.1         HTTP Continuation or non-HTTP traffic
REJECT-->
2134 0.000089 10.1.1.1         172.16.1.2       TCP 37854 > http [RST] Seq=111 Win=0 Len=0
CLIENT FIRST ACK-->
2135 0.002421 192.168.126.161  172.16.1.2       TCP 37854 > http [ACK] Seq=111 Ack=2921 Win=11680 Len=0
2136 0.000033 10.1.1.1         172.16.1.2       TCP 37854 > http [ACK] Seq=111 Ack=2921 Win=11680 Len=0
2137 0.000007 192.168.126.161  172.16.1.2       TCP 37854 > http [ACK] Seq=111 Ack=4381 Win=14600 Len=0
2138 0.000014 10.1.1.1         172.16.1.2       TCP 37854 > http [ACK] Seq=111 Ack=4381 Win=14600 Len=0
2139 0.000008 192.168.126.161  172.16.1.2       TCP 37854 > http [ACK] Seq=111 Ack=5841 Win=17520 Len=0
2140 0.000014 10.1.1.1         172.16.1.2       TCP 37854 > http [ACK] Seq=111 Ack=5841 Win=17520 Len=0
2141 0.000007 192.168.126.161  172.16.1.2       TCP 37854 > http [ACK] Seq=111 Ack=7301 Win=20440 Len=0
2142 0.000013 10.1.1.1         172.16.1.2       TCP 37854 > http [ACK] Seq=111 Ack=7301 Win=20440 Len=0
2143 0.000007 192.168.126.161  172.16.1.2       TCP 37854 > http [ACK] Seq=111 Ack=8761 Win=23360 Len=0
2144 0.000015 10.1.1.1         172.16.1.2       TCP 37854 > http [ACK] Seq=111 Ack=8761 Win=23360 Len=0
2145 0.000007 192.168.126.161  172.16.1.2       TCP 37854 > http [ACK] Seq=111 Ack=10221 Win=26280 Len=0
2146 0.000013 10.1.1.1         172.16.1.2       TCP 37854 > http [ACK] Seq=111 Ack=10221 Win=26280 Len=0
2147 0.001059 192.168.126.161  172.16.1.2       TCP 37854 > http [ACK] Seq=111 Ack=11681 Win=29200 Len=0
2148 0.000018 10.1.1.1         172.16.1.2       TCP 37854 > http [ACK] Seq=111 Ack=11681 Win=29200 Len=0

I have been digging and logging the packet traversal according to this chart and it seems that the last incoming packet 2133 gets past raw-PREROUTING, conntrack, mangle-PREROUTING but then is lost. I have no REJECT rules in my iptables, I log all DROP rules and none of them show where packet 2133 gets lost.

I would love to use the TRACE target on the incoming filter, but unfortunately ubuntu 8.04 is not shipped with support for the TRACE target.

So I believe that some internal implicit routing/conntrack/mangling rules applies that for some reason resets the connection. Maybe the traffic trigger some DOS protection, but I have no idea where to configure/analyse that. The most frustrating thing is that a packet is rejected and nothing gets logged…

Also requesting this file works 100% from windows hosts, but it fails on certain Linux hosts and 99.9% of all requests get through but sometimes the timing of the packets triggers this behaviour in our firewall.

EDIT
Ok, now I have added tons of logging in iptables and it seems like the following happens (still dont know why!)

For packets successfully traversing the firewall the following steps are taken, table/step references from here

Table 3-3 step

2     raw-pre
      conntrack
3     mangle-pre
4     [nat-pre]
5     routing-decision -> destination forward
6     mangle-fwd
7     filter-fwd
8     mangle-post
9     [nat-post]

Packet 2133 that gets rejected traverses these steps:

Table 3-1 steps for the incoming FIN,ACK packet 2133
2     raw-pre
      conntrack
3     mangle-pre
4     [nat-pre]
5     routing-decision -> destination local
6     mangle-input
7     filter-input
8     local process emits RST -> webserver

Table 3-2 steps for the outgoing RST packet 2134 in response to 2133
1     raw-out
2     routing decision
      conntrack
3     mangle-out
      reroute-check
4     [nat-out]
5     filter-out
6     mangle-post
7     nat-post

The thing that is strange is that the routing decision for packet 2133 in step 5 now is different than the routing decision for the other packets. When analysing requests that works, e.g does not get stuck, even the last FIN gets properly routed. It seems like a bug in the kernel or that the routing decision is stateful in some way.

EDIT

One thing that could cause these problems is the following fact, the traffic is routed between the firewall and the local LAN, so the client LAN is not directly connected to the firewall through L2.

                +---------------------------+       +------------------+                         +------------------------+
                |                           |       |      Router      |   (   Lab network    )  |                        |
( Internet ) -- + eth1                 eth0 +-------+                  +-- (                  ) -+ Client 192.168.126.161 |
                | 10.1.1.1   192.168.60.254 |       |                  |   ( 192.168.126.0/24 )  |                        |
                +---------------------------+       +------------------+                         +------------------------+

In this picture, 10.1.1.1 represents the external IP address of the firewall, all other addresses are the real IP addresses used.

Here is the routing table on the firewall:

Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
10.1.1.0        0.0.0.0         255.255.255.240 U     0      0        0 eth1
192.168.126.0   0.0.0.0         255.255.255.0   U     0      0        0 eth0
192.168.60.0    0.0.0.0         255.255.255.0   U     0      0        0 eth0
0.0.0.0         10.1.1.15       0.0.0.0         UG    0      0        0 eth1

Note that 10.1.1.0 and default gw 10.1.1.15 are made up, the rest is exactly the same as used. I had to manually add the 192.168.126.0/24 route to reach the lab network from eth0 (192.168.60.254).

Here are some extensive logs on the packet traversal for the last packet 2133 that gets rejected due to being routed to the local host (e.g the firewall).

[16406874.374588] raw pre IN=eth1 OUT= MAC=00:02:b3:b9:ff:b5:00:90:1a:10:06:88:08:00 SRC=172.16.1.2 DST=10.1.1.1 LEN=1004 TOS=0x00 PREC=0x00 TTL=55 ID=13739 DF PROTO=TCP SPT=80 DPT=53497 WINDOW=5840 RES=0x00 ACK PSH FIN URGP=0 
[16406874.374625] mangle pre IN=eth1 OUT= MAC=00:02:b3:b9:ff:b5:00:90:1a:10:06:88:08:00 SRC=172.16.1.2 DST=10.1.1.1 LEN=1004 TOS=0x00 PREC=0x00 TTL=55 ID=13739 DF PROTO=TCP SPT=80 DPT=53497 WINDOW=5840 RES=0x00 ACK PSH FIN URGP=0 
[16406874.374667] mangle in IN=eth1 OUT= MAC=00:02:b3:b9:ff:b5:00:90:1a:10:06:88:08:00 SRC=172.16.1.2 DST=10.1.1.1 LEN=1004 TOS=0x00 PREC=0x00 TTL=55 ID=13739 DF PROTO=TCP SPT=80 DPT=53497 WINDOW=5840 RES=0x00 ACK PSH FIN URGP=0 
[16406874.374699] filter in IN=eth1 OUT= MAC=00:02:b3:b9:ff:b5:00:90:1a:10:06:88:08:00 SRC=172.16.1.2 DST=10.1.1.1 LEN=1004 TOS=0x00 PREC=0x00 TTL=55 ID=13739 DF PROTO=TCP SPT=80 DPT=53497 WINDOW=5840 RES=0x00 ACK PSH FIN URGP=0 
[16406874.374780] mangle out IN= OUT=eth1 SRC=10.1.1.1 DST=172.16.1.2 LEN=40 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=53497 DPT=80 WINDOW=0 RES=0x00 RST URGP=0 
[16406874.374807] mangle post IN= OUT=eth1 SRC=10.1.1.1 DST=172.16.1.2 LEN=40 TOS=0x00 PREC=0x00 TTL=64 ID=0 DF PROTO=TCP SPT=53497 DPT=80 WINDOW=0 RES=0x00 RST URGP=0 
[16406874.378813] mangle pre IN=eth0 OUT= MAC=00:02:b3:b9:ff:b4:00:90:1a:10:0c:dd:08:00 SRC=192.168.126.161 DST=172.16.1.2 LEN=40 TOS=0x00 PREC=0x00 TTL=63 ID=35424 DF PROTO=TCP SPT=53497 DPT=80 WINDOW=11680 RES=0x00 ACK URGP=0 
[16406874.378863] mangle fwd IN=eth0 OUT=eth1 SRC=192.168.126.161 DST=172.16.1.2 LEN=40 TOS=0x00 PREC=0x00 TTL=62 ID=35424 DF PROTO=TCP SPT=53497 DPT=80 WINDOW=11680 RES=0x00 ACK URGP=0 

Once again, our fw external IP has been replaced with 10.1.1.1 and the ip of webserver outside the NAT'ed network is replaced with 172.16.1.2

EDIT Breaking News!

Ok last try was to DROP the RST packet, very very interesting, I added an iptables rule that dropped all RST packets destinated to the webserver we have problem requesting files from. And then it worked e.g the last FIN, ACK, PSH packet 2133 in the log above is dropped, but since the RST is dropped the webserver has time to get all the ACK's ant then decides to retransmitt the last packet, packet 2133 once again, and now it goes through the firewall since the contrack module now has seen ACK's coming back from the client and allows the last ACK, FIN packet with the final payload.

So it is definitely a timing/window problem, this particular file, with the timing of the ACK's from the client, triggers something in conntrack that rejects the final packet from the webserver.

So far, googling and reading Kernel doc's reveal nothing that can cause this behaviour, next step will be to read the kernel source code for the routing/conntrack module.

PROBLEM SOLVED

Well, at least now we know exactly what happens and have a workaround that solves the problem.

Sergey pointed to the very valuable -m state –state INVALID matching rule which helped a lot in the debugging, I now realize that an iptables setup without an explicit rule for INVALID packets is not complete so strange behavior seems to happen sometimes.

When enabling logging in the conntrack module for what causes the invalid packet, what happens its pretty obvious and I had my suspicion regarding this.

[16659529.322465] nf_ct_tcp: SEQ is over the upper bound (over the window of the receiver) IN= OUT= SRC=172.16.1.2 DST=10.1.1.1 LEN=1004 TOS=0x00 PREC=0x00 TTL=55 ID=40874 DF PROTO=TCP SPT=80 DPT=55498 SEQ=658735108 ACK=1194081763 WINDOW=5840 RES=0x00 ACK PSH FIN URGP=0 

Once again, 172.16.1.2 is the external webserver (which behaves incorrect) and 10.1.1.1 is the external address of the firewall.

The webserver pushes more data over the wire than the client has advertised in the receive window (conntrack is state-full and verifies this), it seems that it is when the FIN packet arrives that conntrack bails out since the receive window is actually exceeded much earlier.

I believe that it might be caused by incorrect TCP offloading in the network card on the webserver. When I started to analyse this I took captures on the webserver and according to the tcpdump/wireshark traces jumbo frames were written by the TCP layer in the kernel which was then segmented into smaller frames with MTU=1500 by the network card. So obviously this needs to be adressed in the webserver since it is not correct TCP behaviour to send more data than the receiver has adverties in its receive window.

Both Polynomial and Sergey provided valuable input, but Sergey pointed me to the exact behavior of the conntrack/NAT module regarding packet traversal.

Best Answer

A similar situation is described at http://www.spinics.net/lists/netfilter/msg51408.html: some packets which should have been processed by NAT somehow got marked as INVALID instead of ESTABLISHED, and went to the INPUT chain. You should add some rules with -m state --state INVALID to check for this, and the answer at http://www.spinics.net/lists/netfilter/msg51409.html suggests that such INVALID packet should always be DROPped, because NAT is not performed on them properly, therefore addresses in them may be wrong.

If your problematic packets are really marked as INVALID, adding iptables -I INPUT -m state --state INVALID -j DROP probably will work around the problem (the broken packet will not get to the local process and will not cause the RST response, then TCP will recover from the lost packet after a timeout). Then you can try to debug the problem further, as described in http://www.spinics.net/lists/netfilter/msg51411.html:

echo 255 >/proc/sys/net/netfilter/nf_conntrack_log_invalid

(In that particular case the problem was caused by some broken networking hardware along the path, probably combined with some TCP checksum offload brokenness.)