Don't use your Domain Model and aggregates for querying.
In fact, what you are asking is a common enough question that a set of principles and patterns has been established to avoid just that. It is called CQRS.
There are all kinds of techniques for high-performance transaction processing and the one in Fowler's article is just one of many at the bleeding edge. Rather than listing a bunch of techniques which may or may not be applicable to anyone's situation, I think it's better to discuss the basic principles and how LMAX addresses a large number of them.
For a high-scale transaction processing system you want to do all of the following as much as possible:
Minimize time spent in the slowest storage tiers. From fastest to slowest on a modern server you have: CPU/L1 -> L2 -> L3 -> RAM -> Disk/LAN -> WAN. The jump from even the fastest modern magnetic disk to the slowest RAM is over 1000x for sequential access; random access is even worse.
Minimize or eliminate time spent waiting. This means sharing as little state as possible, and, if state must be shared, avoiding explicit locks whenever possible.
Spread the workload. CPUs haven't gotten much faster in the past several years, but they have gotten smaller, and 8 cores is pretty common on a server. Beyond that, you can even spread the work over multiple machines, which is Google's approach; the great thing about this is that it scales everything including I/O.
According to Fowler, LMAX takes the following approach to each of these:
Keep all state in memory at all times. Most database engines will actually do this anyway, if the entire database can fit in memory, but they don't want to leave anything up to chance, which is understandable on a real-time trading platform. In order to pull this off without adding a ton of risk, they had to build a bunch of lightweight backup and failover infrastructure.
Use a lock-free queue ("disruptor") for the stream of input events. Contrast to traditional durable message queues which are definitively not lock free, and in fact usually involve painfully-slow distributed transactions.
Not much. LMAX throws this one under the bus on the basis that workloads are interdependent; the outcome of one changes the parameters for the others. This is a critical caveat, and one which Fowler explicitly calls out. They do make some use of concurrency in order to provide failover capabilities, but all of the business logic is processed on a single thread.
LMAX is not the only approach to high-scale OLTP. And although it's quite brilliant in its own right, you do not need to use bleeding-edge techniques in order to pull off that level of performance.
Of all of the principles above, #3 is probably the most important and the most effective, because, frankly, hardware is cheap. If you can properly partition the workload across half a dozen cores and several dozen machines, then the sky's the limit for conventional Parallel Computing techniques. You'd be surprised how much throughput you can pull off with nothing but a bunch of message queues and a round-robin distributor. It's obviously not as efficient as LMAX - actually not even close - but throughput, latency, and cost-effectiveness are separate concerns, and here we're talking specifically about throughput.
If you have the same sort of special needs that LMAX does - in particular, a shared state which corresponds to a business reality as opposed to a hasty design choice - then I'd suggest trying out their component, because I haven't seen much else that's suited to those requirements. But if we're simply talking about high scalability then I'd urge you to do more research into distributed systems, because they are the canonical approach used by most organizations today (Hadoop and related projects, ESB and related architectures, CQRS which Fowler also mentions, and so on).
SSDs are also going to become a game-changer; arguably, they already are. You can now have permanent storage with similar access times to RAM, and although server-grade SSDs are still horribly expensive, they will eventually come down in price once adoption rates grow. It's been researched extensively and the results are pretty mind-boggling and will only get better over time, so the whole "keep everything in memory" concept is a lot less important than it used to be. So once again, I'd try to focus on concurrency whenever possible.
Best Answer
insert premature-discussion-is-the-root-of-all-evil lecture
That said, here are some habits I've gotten into to avoid unnecessary efficiency, and in some cases, make my code simpler and more correct as well.
This isn't a discussion of general principles, but of some things to be aware of to avoid introducing unnecessary inefficiencies into code.
Know your big-O
This should probably be merged into the lengthy discussion above. It's pretty much common sense that a loop inside of a loop, where the inner loop repeats a calculation, is gonna be slower. For example:
This will take a horrendous amount of time if the string is really long, because the length is being recalculated on every iteration of the loop. Note that GCC actually optimizes this case because
strlen()
is marked as a pure function.When sorting a million 32-bit integers, bubble sort would be the wrong way to go. In general, sorting can be done in O(n * log n) time (or better, in the case of radix sort), so unless you know your data is going to be small, look for an algorithm that's at least O(n * log n).
Likewise, when dealing with databases, be aware of indexes. If you
SELECT * FROM people WHERE age = 20
, and you don't have an index on people(age), it'll require an O(n) sequential scan rather than a much faster O(log n) index scan.Integer arithmetic hierarchy
When programming in C, bear in mind that some arithmetic operations are more expensive than others. For integers, the hierarchy goes something like this (least expensive first):
+ - ~ & | ^
<< >>
*
/
Granted, the compiler will usually optimize things like
n / 2
ton >> 1
automatically if you're targeting a mainstream computer, but if you're targeting an embedded device, you might not get that luxury.Also,
% 2
and& 1
have different semantics. Division and modulus usually rounds toward zero, but it's implementation defined. Good ol'>>
and&
always rounds toward negative infinity, which (in my opinion) makes a lot more sense. For instance, on my computer:Hence, use what makes sense. Don't think you're being a good boy by using
% 2
when you were originally going to write& 1
.Expensive floating point operations
Avoid heavy floating point operations like
pow()
andlog()
in code that doesn't really need them, especially when dealing with integers. Take, for example, reading a number:Not only is this use of
pow()
(and theint
<->double
conversions needed to use it) rather expensive, but it creates an opportunity for precision loss (incidentally, the code above doesn't have precision issues). That's why I wince when I see this type of function used in a non-mathematical context.Also, notice how the "clever" algorithm below, which multiplies by 10 on each iteration, is actually more concise than the code above: