I'm having a strange problem with a server I have sitting at my workplace (it's behind a NAT, if that's important). The issue is that at some times, it becomes unreachable and then comes back up again, usually within a few seconds, sometimes lasting up to 1 minute. It doesn't reboot, it doesn't crash. It simply becomes inaccessible. During this time, I cannot ssh into it, nor can I access any applications running on the machine (it's running a couple of Rails apps, so they become unreachable as well). I checked dmesg and saw these lines –
[ 4.958074] ADDRCONF(NETDEV_UP): eth0: link is not ready
[ 5.040476] ADDRCONF(NETDEV_UP): eth1: link is not ready
[ 5.175624] igb: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
[ 5.177207] ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
A couple of lines later, I see something similar concerning the network interfaces –
[1195777.544167] igb: eth0 NIC Link is Down
[1195780.962943] igb: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
It does look like a network issue. /var/log/messages doesn't show anything interesting. I'm not sure how to debug this. Any clue as to what it could be? And what all things should I be checking here? Thanks!
Best Answer
This kind of problem usually doesn't generate a lot of log messages. You have discovered the important two messages which show the interfaces going down and up. These can be generated by unplugging the ethernet cable and plugging it back in.
It could be a bad cable between the NIC and the router. My first steps (done one at a time) would be:
netstat -i
orifconfig
and examine the error counts. Normally, they should be 0 or single digits. High carrier or frame errors may indicate duplex mismatch. Duplex mismatch can be verified by uploading then downloading a large file. Large speed differences accompanied by increasing error counts indicate mismatch on the link. Cable modems usually have different upload and download bandwidths, so local transfers work better for this test.One tool I do use is
mtr
. I use a command likemtr -i 15 -n google.com
to monitor connectivity. Consider using one of your ISP's servers instead of google.com. It can be run in report mode in batch. If the problem is upstream of the server, the output should help identify where the problem is occurring.