Linux – High TCP reset and packet drop count on CentOS Linux

apache-2.2centoslinuxpacketlossreset

I have a small farm of web servers (HP Proliant and IBM x, with Broadcom Corporation NetXtreme II BCM5 NIC's) running Apache 2.2.15 on CentOS 6, behind a Cisco ACE load balancer, serving a PHP/JS based web portal. This farm receives a lot of requests daily (it serves a whole small country) trying to access a splash page (to go, from there, to the index page)

I've been struggling with the following problem:

I've noticed sometimes requests to web delay quite a "long" time to be answered (from the client point of view) and sometimes they are not even answered at all (timeout at web client side). In the latter, I don't even seen the request on Apache logs.
I've also noticed that netstat reports an increasing amount of TCP resets being sent (netstat -st | grep 'resets sent')
Also, dropwatch -l kas shows there are many packets being dropped:

Initalizing kallsyms db dropwatch> start Enabling monitoring… Kernel
monitoring activated. Issue Ctrl-C to stop monitoring 53 drops at
tcp_v4_md5_hash_skb+248 (0xffffffff8149fa08) 26 drops at
tcp_rcv_established+926 (0xffffffff814981b6) 3 drops at
tcp_v4_reqsk_destructor+fa (0xffffffff814a104a) 1 drops at
netlink_unicast+251 (0xffffffff81471b11) 56 drops at
tcp_v4_md5_hash_skb+248 (0xffffffff8149fa08) 29 drops at
tcp_rcv_established+926 (0xffffffff814981b6) 4 drops at
tcp_v4_reqsk_destructor+fa (0xffffffff814a104a) 51 drops at
tcp_v4_md5_hash_skb+248 (0xffffffff8149fa08) 32 drops at
tcp_rcv_established+926 (0xffffffff814981b6) 2 drops at
tcp_v4_reqsk_destructor+fa (0xffffffff814a104a) 1 drops at
ip_rcv_finish+199 (0xffffffff8147ea49) 1 drops at
tcp_v4_destroy_sock+115 (0xffffffff814a0cf5) 1 drops at
tcp_v4_reqsk_destructor+fa (0xffffffff814a104a) 22 drops at
tcp_rcv_established+926 (0xffffffff814981b6) 36 drops at
tcp_v4_md5_hash_skb+248 (0xffffffff8149fa08) 2 drops at
tcp_v4_reqsk_destructor+fa (0xffffffff814a104a) 49 drops at
tcp_v4_md5_hash_skb+248 (0xffffffff8149fa08) 29 drops at
tcp_rcv_established+926 (0xffffffff814981b6) 26 drops at
tcp_rcv_established+926 (0xffffffff814981b6)

I've been following recommendations from RH (Red Hat Enterprise Linux Network Performance Tuning
Guide), even though I've not seen some of the symptoms described there in my servers. In short:

I've increased the NIC ring buffers to maximum.
I've fiddled with (increased or changed) several kernel parameters (tcp_syncookies, netdev_budget, tcp_timestamps, tcp_window_scaling, tcp_rmem, dev_weight, tcp_tw_reuse…)
I've modified the Apache config according to several "Apache
optimization guides" extracted from web (even tough there were, and still are, Idle workers on Apache stats)
I've stop/disabled any system service/daemon not required (basically
all that remains is sshd, httpd and snmpd)

All of the above with no luck.

All NIC's at working at Speed: 1000Mb/s, CPU and disk usage are low, and neither netstat nor ethtool shows errors.

Any ideas what else can be done?

Best Answer

A TCP reset is an immediate close of a TCP connection. This allows for the resources that were allocated for the previous connection to be released and made available to the system.

causes of RST generation

Ack, Reset

sent in response to a Syn. An Ack Reset sent in response to a Syn frame is sent to acknowledge the receipt of the frame but then to let the client know that the server cannot allow the connection on that port. Among the reasons for the Ack, Reset are:

a. The node being connected to is not listening on the port the client node is trying to connect to.

b. There is some reason that the server node cannot complete the connection on that port. For example, the server is out of resources and so cannot allocate the needed resources to allow the connection.

RST

If the connection is in any non-synchronized state (LISTEN, SYN-SENT, SYN-RECEIVED), and the incoming segment acknowledges something not yet sent (the segment carries an unacceptable ACK) , a reset is sent.
The next reset is a TCP reset that happens when a network frame is sent six times (this would be the original frame plus five retransmits of the frame) without a response. As a result, the sending node resets the connection.

As you and tried using various kernal tuning parameters , Try using tcp cookies option of kernel

Enable TCP SYN cookie protection

Edit the file /etc/sysctl.conf, run:
# vi /etc/sysctl.conf

Append the following entry:

net.ipv4.tcp_syncookies = 1

Save and close the file. To reload the change, type:
# sysctl -p

solution can be given only by analyzing your logs , IPtables can also help

Related Solutions

Linux – How passively monitor for tcp packet loss? (Linux)

For a general sense of the scale of your problem netstat -s will track your total number of retransmissions.

# netstat -s | grep retransmitted
     368644 segments retransmitted

You can aso grep for segments to get a more detailed view:

# netstat -s | grep segments
         149840 segments received
         150373 segments sent out
         161 segments retransmitted
         13 bad segments received

For a deeper dive, you'll probably want to fire up Wireshark.

In Wireshark set your filter to tcp.analysis.retransmission to see retransmissions by flow.

That's the best option I can come up with.

Other dead ends explored:

netfilter/conntrack tools don't seem to keep retransmits
stracing netstat -s showed that it is just printing /proc/net/netstat
column 9 in /proc/net/tcp looked promising, but it unfortunately appears to be unused.

Linux – High Server Crash Rates During Leap Second Day

This is caused by a livelock when ntpd calls adjtimex(2) to tell the kernel to insert a leap second. See lkml posting http://lkml.indiana.edu/hypermail/linux/kernel/1203.1/04598.html

Red Hat should also be updating their KB article as well. https://access.redhat.com/knowledge/articles/15145

UPDATE: Red Hat has a second KB article just for this issue here: https://access.redhat.com/knowledge/solutions/154713 - the previous article is for an earlier, unrelated problem

The work-around is to just turn off ntpd. If ntpd already issued the adjtimex(2) call, you may need to disable ntpd and reboot to be 100% safe.

This affects RHEL 6 and other distros running newer kernels (newer than approx 2.6.26), but not RHEL 5.

The reason this is occurring before the leap second is actually scheduled to occur is that ntpd lets the kernel handle the leap second at midnight, but needs to alert the kernel to insert the leap second before midnight. ntpd therefore calls adjtimex(2) sometime during the day of the leap second, at which point this bug is triggered.

If you have adjtimex(8) installed, you can use this script to determine if flag 16 is set. Flag 16 is "inserting leap second":

adjtimex -p | perl -p -e 'undef $_, next unless m/status: (\d+)/; (16 & $1) && print "leap second flag is set:\n"'

UPDATE:

Red Hat has updated their KB article to note: "RHEL 6 customers may be affected by a known issue that causes NMI Watchdog to detect a hang when receiving the NTP leapsecond announcement. This issue is being addressed in a timely manner. If your systems received the leapsecond announcement and did not experience this issue, then they are no longer affected."

UPDATE: The above language was removed from the Red Hat article; and a second KB solution was added detailing the adjtimex(2) crash issue: https://access.redhat.com/knowledge/solutions/154713

However, the code change in the LKML post by IBM Engineer John Stultz notes there may also be a deadlock when the leap second is actually applied, so you may want to disable the leap second by rebooting or using adjtimex(8) after disabling ntpd.

FINAL UPDATE:

Well, I'm no kernel dev, but I reviewed John Stultz's patch again here: https://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=6b43ae8a619d17c4935c3320d2ef9e92bdeed05d

If I'm reading it right this time, I was wrong about there being another deadlock when the leap second is applied. That seems to be Red Hat's opinion as well, based on their KB entry. However, if you have disabled ntpd, keep it disabled for another 10 minutes, so that you don't hit the deadlock when ntpd calls adjtimex(2).

We'll find out if there are any more bugs soon :)

POST-LEAP SECOND UPDATE:

I spent the last few hours reading through the ntpd and pre-patch (buggy) kernel code, and while I may be very wrong here, I'll attempt to explain what I think was going on:

First, ntpd calls adjtimex(2) all the time. It does this as part of its "clock loop filter", defined in local_clock in ntp_loopfilter.c. You can see that code here: http://www.opensource.apple.com/source/ntp/ntp-70/ntpd/ntp_loopfilter.c (from ntp version 4.2.6).

The clock loop filter runs quite often -- it runs every time ntpd polls its upstream servers, which by default is every 17 minutes or more. The relevant bit of the clock loop filter is:

if (sys_leap == LEAP_ADDSECOND)
    ntv.status |= STA_INS;

And then:

ntp_adjtime(&ntv)

In other words, on days when there's a leap second, ntpd sets the "STA_INS" flag and calls adjtimex(2) (via its portability-wrapper).

That system call makes its way to the kernel. Here's the relevant kernel code: https://github.com/mirrors/linux/blob/a078c6d0e6288fad6d83fb6d5edd91ddb7b6ab33/kernel/time/ntp.c

The kernel codepath is roughly this:

line 663 - start of do_adjtimex routine.
line 691 - cancel any existing leap-second timer.
line 709 - grab the ntp_lock spinlock (this lock is involved in the possible livelock crash)
line 724 - call process_adjtimex_modes.
line 616 - call process_adj_status.
line 590 - set time_status global variable, based on flags set in adjtimex(2) call
line 592 - check time_state global variable. in most cases, call ntp_start_leap_timer.
line 554 - check time_status global variable. STA_INS will be set, so set time_state to TIME_INS and call hrtimer_start (another kernel function) to start the leap second timer. in the process of creating a timer, this code grabs the xtime_lock. if this happens while another CPU has already grabbed the xtime_lock and the ntp_lock, then the kernel livelocks. this is why John Stultz wrote the patch to avoid using hrtimers. This is what was causing everyone trouble today.
line 598 - if ntp_start_leap_timer did not actually start a leap timer, set time_state to TIME_OK
line 751 - assuming the kernel does not livelock, the stack is unwound and the ntp_lock spinlock is released.

There are a couple interesting things here.

First, line 691 cancels the existing timer every time adjtimex(2) is called. Then, 554 re-creates that timer. This means each time ntpd ran its clock loop filter, the buggy code was invoked.

Therefore I believe Red Hat was wrong when they said that once ntpd had set the leap-second flag, the system would not crash. I believe each system running ntpd had the potential to livelock every 17 minutes (or more) for the 24-hour period before the leap-second. I believe this may also explain why so many systems crashed; a one-time chance of crashing would be much less likely to hit as compared to 3 chances an hour.

UPDATE: In Red Hat's KB solution at https://access.redhat.com/knowledge/solutions/154713 , Red Hat engineers did come to the same conclusion (that running ntpd would continuously hit the buggy code). And indeed they did so several hours before I did. This solution wasn't linked to the main article at https://access.redhat.com/knowledge/articles/15145 , so I didn't notice it until now.

Second, this explains why loaded systems were more likely to crash. Loaded systems will be handling more interrupts, causing the "do_tick" kernel function to be called more often, giving more of a chance for this code to run and grab the ntp_lock while the timer was being created.

Third, is there a chance of the system crashing when the leap-second actually occurs? I don't know for sure, but possibly yes, because the timer that fires and actually executes the leap-second adjustment (ntp_leap_second, on line 388) also grabs the ntp_lock spinlock, and has a call to hrtimer_add_expires_ns. I don't know if that call might also be able to cause a livelock, but it doesn't seem impossible.

Finally, what causes the leap-second flag to be disabled after the leap-second has run? The answer there is ntpd stops setting the leap-second flag at some point after midnight when it calls adjtimex(2). Since the flag isn't set, the check on line 554 will not be true, and no timer will be created, and line 598 will reset the time_state global variable to TIME_OK. This explains why if you checked the flag with adjtimex(8) just after the leap second, you would still see the leap-second flag set.

In short, the best advice for today seems to be the first I gave after all: disable ntpd, and disable the leap-second flag.

And some final thoughts:

none of the Linux vendors noticed John Stultz's patch and applied it to their kernels :(
why didn't John Stultz alert some of the vendors this was needed? perhaps the chance of the livelock seemed low enough making noise wasn't warranted.
I've heard reports of Java processes locking up or spinning when the leap-second was applied. Perhaps we should follow Google's lead and rethink how we apply leap-seconds to our systems: http://googleblog.blogspot.com/2011/09/time-technology-and-leaping-seconds.html

06/02 Update from John Stultz:

https://lkml.org/lkml/2012/7/1/203

The post contained a step-by-step walk-through of why the leap second caused the futex timers to expire prematurely and continuously, spiking the CPU load.

Best Answer

Related Solutions

Linux – How passively monitor for tcp packet loss? (Linux)

Linux – High Server Crash Rates During Leap Second Day

Related Topic