The short answer is that you switch when the effort to not switch exceeds the effort of switching, or when you can foresee that it will soon become so. That's a very subjective assessment which requires experience, but it's relatively easy to see the extreme cases.
For example, say you estimate one month to switch, but not switching means you can't use a module that's only available in the upgraded version, so you'll have to take two months to implement one from scratch. The choice is easy.
For another example, if you are spending all your time fixing security vulnerabilities in software that is no longer supported by the vendor, it's a good time to switch.
If you never have meetings where someone says, "This would be a lot better/easier with the new version," then you probably don't need to switch.
The in-between cases are harder to recognize, but it usually feels like building pressure, where sticking with the old version feels more and more limiting.
As far as your well-done software test, your bug tracker provides a good rule of thumb on that. When you have zero critical defects and your list of non-critical defects is at a reasonable level and steadily shrinking, you can call it 'done.' You can get a sense of your software architecture quality by the turnaround time for adding new features or fixing bugs. There's no cut off point that says, "this is professional," but the lower the better.
Because there is a huge difference between optimizing the performance and turning off completely a safety
By reducing the number of GC, their framework is more responsive and can run (presumably) quicker. Now, optimizing for the garbage collector don't mean they don't ever do a garbage collection. It just mean they do it less often, and when they do it, it run really fast. Those kind of optimization include :
- Minimizing the number of object that move to a survivor space (i.e that survived at least one garbage collection) by using small throw-away objects. Object that moved to the survivor space are harder to collect and a garbage collection here sometime imply freezing the whole JVM.
- Don't allocate too many objects to begin with. This can backfire if you're not careful, as the young generation objects are super cheap to allocate and collect.
- Ensure that new object point to old one (and not the other way around) so that the young object are easy to collect, since there is no reference to them that will cause them to be kept
When you tune out the performance, you usually tune some very specific "hot spot" while ignoring code that don't run often. If you do that in Java, you can let the garbage collector still take care of those dark corner (since it won't make a lot of difference) while optimizing very carefully for area that run in a tight loop. So you can choose where you optimize and where you don't, and you can thus focus your effort where it matter.
Now, if you turn off completely garbage collection, then you can't choose. You must manually dispose of every object, ever. That method get called at most once per day? In Java, you can let it be, as its performance impact is negligible (it may be OK to let a full GC occur every month). In C++, you are still leaking resource, so you must take care even of that obscure method. So you must pay the price for resource management in every, single, part of your application, while in Java you can focus.
But it get worse.
What if you have a bug, let say in a dark corner of your application that is only accessed on Monday on a full moon? Java have strong safety guarantee. There is little to no "undefined behavior". If you use something wrong, an Exception is thrown, your program stop, and no data corruption occur. So you are pretty sure that nothing wrong can happen without you noticing.
But in something like D, you can have a bad pointer access, or a buffer overflow, and you can corrupt your memory, but your program won't know (you turned the safety off, remember?) and will keep running with its incorrect data, and do some pretty nasty things and corrupt your data, and you don't know, and as more corruption happen, your data get more and more wrong, and then suddenly it break, and it was in a life critical application, and some error happened in the computation of a rocket, and so it doesn't work, and the rocket explode, and someone die, and your company is in the front page of every newspaper and your boss point its finger to you saying "You are the engineer that suggested we used D to optimize performance, how come you didn't think of safety? ". And it is your fault. You killed those people with your foolish attempt at performance.
OK, ok, most of the time it is much less dramatic than that. But even a business critical application or just a GPS app or, let say, a government healthcare website can yield some pretty negative consequence if you have bugs. Using a language that either prevent them completely or fail-fast when they happen is usually a very good idea.
There is a cost to turning off a safety. Going native doesn't always make sense. Sometime it is much simpler and safer to just optimize a bit a safe language that to go all in for a language where you can shoot yourself in the foot big-time. Correctness and safety in a lot of case trump the few nano second you would have scrapped by eliminating the GC completely. Disruptor can be used in those situation, so I think LMAX-Exchange made the right call.
But what about D in particular? You do have a GC if you want for the dark corners, and the SafeD subset (that I didn't know of before the edit) remove undefined behavior (if you remember to use it!).
Well in that case its a simple question of maturity. The Java ecosystem is full of well-written tool and mature libraries (better for development). Much more developers know Java than D (better for maintenance). Going for a new and not-so popular language for something as critical as a financial application would not have been a good idea. With less-known language, if you have a problem, few can help you, and the libraries you find tend to have more bugs since they were exposed to less people.
So my last point still hold: if you want to avoid problems with dire consequences, stick with safe choices. At this point in the life of D, its customer are the little start-ups ready to take crazy risks. If a problem can cost millions, you are better staying further in the innovation bell curve.
Best Answer
There are all kinds of techniques for high-performance transaction processing and the one in Fowler's article is just one of many at the bleeding edge. Rather than listing a bunch of techniques which may or may not be applicable to anyone's situation, I think it's better to discuss the basic principles and how LMAX addresses a large number of them.
For a high-scale transaction processing system you want to do all of the following as much as possible:
Minimize time spent in the slowest storage tiers. From fastest to slowest on a modern server you have: CPU/L1 -> L2 -> L3 -> RAM -> Disk/LAN -> WAN. The jump from even the fastest modern magnetic disk to the slowest RAM is over 1000x for sequential access; random access is even worse.
Minimize or eliminate time spent waiting. This means sharing as little state as possible, and, if state must be shared, avoiding explicit locks whenever possible.
Spread the workload. CPUs haven't gotten much faster in the past several years, but they have gotten smaller, and 8 cores is pretty common on a server. Beyond that, you can even spread the work over multiple machines, which is Google's approach; the great thing about this is that it scales everything including I/O.
According to Fowler, LMAX takes the following approach to each of these:
Keep all state in memory at all times. Most database engines will actually do this anyway, if the entire database can fit in memory, but they don't want to leave anything up to chance, which is understandable on a real-time trading platform. In order to pull this off without adding a ton of risk, they had to build a bunch of lightweight backup and failover infrastructure.
Use a lock-free queue ("disruptor") for the stream of input events. Contrast to traditional durable message queues which are definitively not lock free, and in fact usually involve painfully-slow distributed transactions.
Not much. LMAX throws this one under the bus on the basis that workloads are interdependent; the outcome of one changes the parameters for the others. This is a critical caveat, and one which Fowler explicitly calls out. They do make some use of concurrency in order to provide failover capabilities, but all of the business logic is processed on a single thread.
LMAX is not the only approach to high-scale OLTP. And although it's quite brilliant in its own right, you do not need to use bleeding-edge techniques in order to pull off that level of performance.
Of all of the principles above, #3 is probably the most important and the most effective, because, frankly, hardware is cheap. If you can properly partition the workload across half a dozen cores and several dozen machines, then the sky's the limit for conventional Parallel Computing techniques. You'd be surprised how much throughput you can pull off with nothing but a bunch of message queues and a round-robin distributor. It's obviously not as efficient as LMAX - actually not even close - but throughput, latency, and cost-effectiveness are separate concerns, and here we're talking specifically about throughput.
If you have the same sort of special needs that LMAX does - in particular, a shared state which corresponds to a business reality as opposed to a hasty design choice - then I'd suggest trying out their component, because I haven't seen much else that's suited to those requirements. But if we're simply talking about high scalability then I'd urge you to do more research into distributed systems, because they are the canonical approach used by most organizations today (Hadoop and related projects, ESB and related architectures, CQRS which Fowler also mentions, and so on).
SSDs are also going to become a game-changer; arguably, they already are. You can now have permanent storage with similar access times to RAM, and although server-grade SSDs are still horribly expensive, they will eventually come down in price once adoption rates grow. It's been researched extensively and the results are pretty mind-boggling and will only get better over time, so the whole "keep everything in memory" concept is a lot less important than it used to be. So once again, I'd try to focus on concurrency whenever possible.