No, no, no, you can't separate something that is intrinsically a part of the job.
Let programmers code any crap that makes all tests tick green and than tell the other team to rewrite everything to make it also work fast? Doesn't work that way.
It is a job of the same programmer who writes the code to also think about performance. If you free them of that obligation, what would motivate them to learn and make things better each time?
Having said that, there is a career path if you choose to specialize in performance tuning. But it's not like a daily job, it is rather you offering your consulting services to various clients to help with their performance issues. But obviously to be able to do that you must already have passed beyond being able to write just working code and have gained insight into how make working code a fast code.
There are all kinds of techniques for high-performance transaction processing and the one in Fowler's article is just one of many at the bleeding edge. Rather than listing a bunch of techniques which may or may not be applicable to anyone's situation, I think it's better to discuss the basic principles and how LMAX addresses a large number of them.
For a high-scale transaction processing system you want to do all of the following as much as possible:
Minimize time spent in the slowest storage tiers. From fastest to slowest on a modern server you have: CPU/L1 -> L2 -> L3 -> RAM -> Disk/LAN -> WAN. The jump from even the fastest modern magnetic disk to the slowest RAM is over 1000x for sequential access; random access is even worse.
Minimize or eliminate time spent waiting. This means sharing as little state as possible, and, if state must be shared, avoiding explicit locks whenever possible.
Spread the workload. CPUs haven't gotten much faster in the past several years, but they have gotten smaller, and 8 cores is pretty common on a server. Beyond that, you can even spread the work over multiple machines, which is Google's approach; the great thing about this is that it scales everything including I/O.
According to Fowler, LMAX takes the following approach to each of these:
Keep all state in memory at all times. Most database engines will actually do this anyway, if the entire database can fit in memory, but they don't want to leave anything up to chance, which is understandable on a real-time trading platform. In order to pull this off without adding a ton of risk, they had to build a bunch of lightweight backup and failover infrastructure.
Use a lock-free queue ("disruptor") for the stream of input events. Contrast to traditional durable message queues which are definitively not lock free, and in fact usually involve painfully-slow distributed transactions.
Not much. LMAX throws this one under the bus on the basis that workloads are interdependent; the outcome of one changes the parameters for the others. This is a critical caveat, and one which Fowler explicitly calls out. They do make some use of concurrency in order to provide failover capabilities, but all of the business logic is processed on a single thread.
LMAX is not the only approach to high-scale OLTP. And although it's quite brilliant in its own right, you do not need to use bleeding-edge techniques in order to pull off that level of performance.
Of all of the principles above, #3 is probably the most important and the most effective, because, frankly, hardware is cheap. If you can properly partition the workload across half a dozen cores and several dozen machines, then the sky's the limit for conventional Parallel Computing techniques. You'd be surprised how much throughput you can pull off with nothing but a bunch of message queues and a round-robin distributor. It's obviously not as efficient as LMAX - actually not even close - but throughput, latency, and cost-effectiveness are separate concerns, and here we're talking specifically about throughput.
If you have the same sort of special needs that LMAX does - in particular, a shared state which corresponds to a business reality as opposed to a hasty design choice - then I'd suggest trying out their component, because I haven't seen much else that's suited to those requirements. But if we're simply talking about high scalability then I'd urge you to do more research into distributed systems, because they are the canonical approach used by most organizations today (Hadoop and related projects, ESB and related architectures, CQRS which Fowler also mentions, and so on).
SSDs are also going to become a game-changer; arguably, they already are. You can now have permanent storage with similar access times to RAM, and although server-grade SSDs are still horribly expensive, they will eventually come down in price once adoption rates grow. It's been researched extensively and the results are pretty mind-boggling and will only get better over time, so the whole "keep everything in memory" concept is a lot less important than it used to be. So once again, I'd try to focus on concurrency whenever possible.
Best Answer
Every processor I've worked on does comparison by subtracting one of the operands from the other, discarding the result and leaving the processor's flags (zero, negative, etc.) alone. Because subtraction is done as a single operation, the contents of the operands don't matter.
The best way to answer the question for sure is to compile your code into assembly and consult the target processor's documentation for the instructions generated. For current Intel CPUs, that would be the Intel 64 and IA-32 Architectures Software Developer’s Manual.
The description of the
CMP
("compare") instruction is in volume 2A, page 3-126, or page 618 of the PDF, and describes its operation as:This means the second operand is sign-extended if necessary, subtracted from the first operand and the result placed in a temporary area in the processor. Then the status flags are set the same way as they would be for the
SUB
("subtract") instruction (page 1492 of the PDF).There's no mention in the
CMP
orSUB
documentation that the values of the operands have any bearing on latency, so any value you use is safe.