Low Latency Unix/Linux – Performance Optimization

clatencylinuxperformance

Most low latency/high frequency programming jobs (based on job specs) appear to be implemented on unix platforms. In a lot of the specs they make particular request for people with "low latency linux" type of experience.

Assuming this does not mean a real-time linux OS, could people give me help with what this could be referring to? I know you can set CPU affinity to threads, but I am assuming they are asking for much more to it than that.

Kernel tuning? (although I heard manufacturers like solarflare produce kernel bypass network cards anyway)?

What about DMA or possibly shared memory between processes? If people could give me brief ideas I can go and do the research on google.

(This question will probably require somebody familiar with High Frequency Trading)

Best Answer

I've done a fair amount of work supporting HFT groups in IB and Hedge Fund settings. I'm going to answer from the sysadmin view, but some of this is applicable to programming in such environments as well.

There are a couple of things an employer is usually looking for when they refer to "Low Latency" support. Some of these are "raw speed" questions (do you know what type of 10g card to buy, and what slot to put it in?), but more of them are about the ways in which a High Frequency Trading environment differs from a traditional Unix environment. Some examples:

  • Unix is traditionally tuned to support running a large number of processes without starving any of them for resources, but in an HFT environment, you are likely to want to run one application with an absolute minimum of overhead for context switching, and so on. As a classic small example, turning on hyperthreading on an Intel CPU allows more processes to run at once -- but has a significant performance impact on the speed at which each individual process is executed. As a programmer, you're likewise going to have to look at the cost of abstractions like threading and RPC, and figure out where a more monolithic solution -- while less clean -- will avoid overhead.

  • TCP/IP is typically tuned to prevent connection drops and make efficient use of the bandwidth available. If your goal is to get the lowest latency possible out of a very fast link -- instead of to get the highest bandwidth possible out of a more constrained link -- you're going to want to adjust the tuning of the network stack. From a programming side, you'll likewise going to want to look at the available socket options, and figure out which ones have defaults more tuned for bandwidth and reliability than for reducing latency.

  • As with networking, so with storage -- you're going to want to know how to tell a storage performance problem from an application problem, and learn what patterns of I/O usage are least likely to interfere with your program's performance (as an example, learn where the complexity of using asynchronous IO can pay off for you, and what the downsides are).

  • Finally, and more painfully: we Unix admins want as much information on the state of the environments we monitor as possible, so we like to run tools like SNMP agents, active monitoring tools like Nagios, and data gathering tools like sar(1). In an environment where context switches need to be absolutely minimized and use of disk and network IO tightly controlled, though, we have to find the right tradeoff between the expense of monitoring and the bare-metal performance of the boxes monitored. Similarly, what techniques are you using that make coding easier but are costing you performance?

Finally, there are other things that just come with time; tricks and details that you learn with experience. But these are more specialized (when do I use epoll? why do two models of HP server with theoretically identical PCIe controllers perform so differently?), more tied to whatever your specific shop is using, and more likely to change from one year to another.

Related Topic