Linux – How to debug linux latency issues under network load

latencylinuxnetworking

I have 12 mixed Ubuntu 12/14 database Cassandra nodes. All nodes are baremetal nodes with SSDs, 1Gb network cards and are all colocated in the same DC (managed colo).

Under light operation, the latency between all the nodes and our cloud nodes (in the same DC as well) are all under 1ms.

When I start ramping up writes to the database nodes, the latencies to and from these database nodes climbs heavily to around 300ms. CPU load is also around 1 (4 physical cores), disk utilization is below 3%, and via dstat, the network load is around 18MiB.

Local reads & writes to Cassandra are relatively quick, so I've ruled out the application layer being overloaded.

What tools and settings should I be looking into tuning to get an understanding to why my latency is so bad? I have monitoring tools in place to view these issues, I'm unsure where to start in diagnosing them.

Best Answer

My starting point for issues like this is usually perf top. This will quickly give you an idea where the largest amounts of time are spent. See https://perf.wiki.kernel.org/index.php/Tutorial for some nice examples on how to use it.

Related Topic