Java Performance – How to Significantly Improve Java Performance

Architecturejavaperformance

The team over at LMAX have a presentation about how they were able to do 100k TPS at less than 1 ms of latency. They have backed up that presentation with a blog, technical paper (PDF) and the source code itself.

Recently, Martin Fowler published an excellent paper on the LMAX architecture and mentions that they are now able to handle six million orders per second and highlights a few of the steps that the team took to go up another order of magnitude in performance.

So far I've explained that the key to the speed of the Business Logic
Processor is doing everything sequentially, in-memory. Just doing this (and
nothing really stupid) allows developers to write code that can process 10K TPS.

They then found that concentrating on the simple elements of good code could
bring this up into the 100K TPS range. This just needs well-factored code and
small methods – essentially this allows Hotspot to do a better job of
optimizing and for CPUs to be more efficient in caching the code as it's
running.

It took a bit more cleverness to go up another order of magnitude. There are
several things that the LMAX team found helpful to get there. One was to write
custom implementations of the Java collections that were designed to be
cache-friendly and careful with garbage.

Another technique to reach that top level of performance is putting attention
into performance testing. I've long noticed that people talk a lot about
techniques to improve performance, but the one thing that really makes a
difference is to test it

Fowler mentioned that there are several things that were found, but he only mentioned a couple.

Are there other architectures, libraries, techniques or "things" that are helpful to reach such levels of performance?

Best Answer

There are all kinds of techniques for high-performance transaction processing and the one in Fowler's article is just one of many at the bleeding edge. Rather than listing a bunch of techniques which may or may not be applicable to anyone's situation, I think it's better to discuss the basic principles and how LMAX addresses a large number of them.

For a high-scale transaction processing system you want to do all of the following as much as possible:

  1. Minimize time spent in the slowest storage tiers. From fastest to slowest on a modern server you have: CPU/L1 -> L2 -> L3 -> RAM -> Disk/LAN -> WAN. The jump from even the fastest modern magnetic disk to the slowest RAM is over 1000x for sequential access; random access is even worse.

  2. Minimize or eliminate time spent waiting. This means sharing as little state as possible, and, if state must be shared, avoiding explicit locks whenever possible.

  3. Spread the workload. CPUs haven't gotten much faster in the past several years, but they have gotten smaller, and 8 cores is pretty common on a server. Beyond that, you can even spread the work over multiple machines, which is Google's approach; the great thing about this is that it scales everything including I/O.

According to Fowler, LMAX takes the following approach to each of these:

  1. Keep all state in memory at all times. Most database engines will actually do this anyway, if the entire database can fit in memory, but they don't want to leave anything up to chance, which is understandable on a real-time trading platform. In order to pull this off without adding a ton of risk, they had to build a bunch of lightweight backup and failover infrastructure.

  2. Use a lock-free queue ("disruptor") for the stream of input events. Contrast to traditional durable message queues which are definitively not lock free, and in fact usually involve painfully-slow distributed transactions.

  3. Not much. LMAX throws this one under the bus on the basis that workloads are interdependent; the outcome of one changes the parameters for the others. This is a critical caveat, and one which Fowler explicitly calls out. They do make some use of concurrency in order to provide failover capabilities, but all of the business logic is processed on a single thread.

LMAX is not the only approach to high-scale OLTP. And although it's quite brilliant in its own right, you do not need to use bleeding-edge techniques in order to pull off that level of performance.

Of all of the principles above, #3 is probably the most important and the most effective, because, frankly, hardware is cheap. If you can properly partition the workload across half a dozen cores and several dozen machines, then the sky's the limit for conventional Parallel Computing techniques. You'd be surprised how much throughput you can pull off with nothing but a bunch of message queues and a round-robin distributor. It's obviously not as efficient as LMAX - actually not even close - but throughput, latency, and cost-effectiveness are separate concerns, and here we're talking specifically about throughput.

If you have the same sort of special needs that LMAX does - in particular, a shared state which corresponds to a business reality as opposed to a hasty design choice - then I'd suggest trying out their component, because I haven't seen much else that's suited to those requirements. But if we're simply talking about high scalability then I'd urge you to do more research into distributed systems, because they are the canonical approach used by most organizations today (Hadoop and related projects, ESB and related architectures, CQRS which Fowler also mentions, and so on).

SSDs are also going to become a game-changer; arguably, they already are. You can now have permanent storage with similar access times to RAM, and although server-grade SSDs are still horribly expensive, they will eventually come down in price once adoption rates grow. It's been researched extensively and the results are pretty mind-boggling and will only get better over time, so the whole "keep everything in memory" concept is a lot less important than it used to be. So once again, I'd try to focus on concurrency whenever possible.

Related Topic