First, most JVMs include a compiler, so "interpreted bytecode" is actually pretty rare (at least in benchmark code -- it's not quite as rare in real life, where your code is usually more than a few trivial loops that get repeated extremely often).
Second, a fair number of the benchmarks involved appear to be quite biased (whether by intent or incompetence, I can't really say). Just for example, years ago I looked at some of the source code linked from one of the links you posted. It had code like this:
init0 = (int*)calloc(max_x,sizeof(int));
init1 = (int*)calloc(max_x,sizeof(int));
init2 = (int*)calloc(max_x,sizeof(int));
for (x=0; x<max_x; x++) {
init2[x] = 0;
init1[x] = 0;
init0[x] = 0;
}
Since calloc
provides memory that's already zeroed, using the for
loop to zero it again is obviously useless. This was followed (if memory serves) by filling the memory with other data anyway (and no dependence on it being zeroed), so all the zeroing was completely unnecessary anyway. Replacing the code above with a simple malloc
(like any sane person would have used to start with) improved the speed of the C++ version enough to beat the Java version (by a fairly wide margin, if memory serves).
Consider (for another example) the methcall
benchmark used in the blog entry in your last link. Despite the name (and how things might even look), the C++ version of this is not really measuring much about method call overhead at all. The part of the code that turns out to be critical is in the Toggle class:
class Toggle {
public:
Toggle(bool start_state) : state(start_state) { }
virtual ~Toggle() { }
bool value() {
return(state);
}
virtual Toggle& activate() {
state = !state;
return(*this);
}
bool state;
};
The critical part turns out to be the state = !state;
. Consider what happens when we change the code to encode the state as an int
instead of a bool
:
class Toggle {
enum names{ bfalse = -1, btrue = 1};
const static names values[2];
int state;
public:
Toggle(bool start_state) : state(values[start_state])
{ }
virtual ~Toggle() { }
bool value() { return state==btrue; }
virtual Toggle& activate() {
state = -state;
return(*this);
}
};
This minor change improves the overall speed by about a 5:1 margin. Even though the benchmark was intended to measure method call time, in reality most of what it was measuring was the time to convert between int
and bool
. I'd certainly agree that the inefficiency shown by the original is unfortunate -- but given how rarely it seems to arise in real code, and the ease with which it can be fixed when/if it does arise, I have a difficult time thinking of it as meaning much.
In case anybody decides to re-run the benchmarks involved, I should also add that there's an almost equally trivial modification to the Java version that produces (or at least at one time produced -- I haven't re-run the tests with a recent JVM to confirm they still do) a fairly substantial improvement in the Java version as well. The Java version has an NthToggle::activate() that looks like this:
public Toggle activate() {
this.counter += 1;
if (this.counter >= this.count_max) {
this.state = !this.state;
this.counter = 0;
}
return(this);
}
Changing this to call the base function instead of manipulating this.state
directly gives quite a substantial speed improvement (though not enough to keep up with the modified C++ version).
So, what we end up with is a false assumption about interpreted byte codes vs. some of the worst benchmarks (I've) ever seen. Neither is giving a meaningful result.
My own experience is that with equally experienced programmers paying equal attention to optimizing, C++ will beat Java more often than not -- but (at least between these two), the language will rarely make as much difference as the programmers and design. The benchmarks being cited tell us more about the (in)competence/(dis)honesty of their authors than they do about the languages they purport to benchmark.
[Edit: As implied in one place above but never stated as directly as I probably should have, the results I'm quoting are those I got when I tested this ~5 years ago, using C++ and Java implementations that were current at that time. I haven't rerun the tests with current implementations. A glance, however, indicates that the code hasn't been fixed, so all that would have changed would be the compiler's ability to cover up the problems in the code.]
If we ignore the Java examples, however, it is actually possible for interpreted code to run faster than compiled code (though difficult and somewhat unusual).
The usual way this happens is that the code being interpreted is much more compact than the machine code, or it's running on a CPU that has a larger data cache than code cache.
In such a case, a small interpreter (e.g., the inner interpreter of a Forth implementation) may be able to fit entirely in the code cache, and the program it's interpreting fits entirely in the data cache. The cache is typically faster than main memory by a factor of at least 10, and often much more (a factor of 100 isn't particularly rare any more).
So, if the cache is faster than main memory by a factor of N, and it takes fewer than N machine code instructions to implement each byte code, the byte code should win (I'm simplifying, but I think the general idea should still be apparent).
There are many NoSQL solutions around, each one with its own strengths and weaknesses, so the following must be taken with a grain of salt.
But essentially, what many NoSQL databases do is rely on denormalization and try to optimize for the denormalized case. For instance, say you are reading a blog post together with its comments in a document-oriented database. Often, the comments will be saved together with the post itself. This means that it will be faster to retrieve all of them together, as they are stored in the same place and you do not have to perform a join.
Of course, you can do the same in SQL, and denormalizing is a common practice when one needs performance. It is just that many NoSQL solutions are engineered from the start to be always used this way. You then get the usual tradeoffs: for instance, adding a comment in the above example will be slower because you have to save the whole document with it. And once you have denormalized, you have to take care of preserving data integrity in your application.
Moreover, in many NoSQL solutions, it is impossible to do arbitrary joins, hence arbitrary queries. Some databases, like CouchDB, require you to think ahead of the queries you will need and prepare them inside the DB.
All in all, it boils down to expecting a denormalized schema and optimizing reads for that situation, and this works well for data that is not highly relational and that requires much more reads than writes.
Best Answer
This may sound obvious, but computers don't execute formulas, they execute code, and how long that execution takes depends directly on the code they execute and only indirectly on whatever concept that code implements. Two logically identical pieces of code can have very different performance characteristics. Some reasons that are likely to crop up in matrix multiplication specifically:
if
statement all the others in that group have to wait for it.A lot of smart people have written very efficient code for common linear algebra operations, using the above tricks and many more and usually even with stupid platform-specific tricks. Therefore, transforming your formula into a matrix multiplication and then implementing that calculation by calling into a mature linear algebra library benefits from that optimization effort. By contrast, if you simply write the formula out in the obvious way in a high-level language, the machine code that is eventually generated won't use all of those tricks and won't be as fast. This is also true if you take the matrix formulation and implement it by calling a naive matrix multiplication routine that you wrote yourself (again, in the obvious way).
Making code fast takes work, and often quite a lot of work if you want that last ounce of performance. Because so many important calculations can be expressed as combination of a couple of linear algebra operations, it's economical to create highly optimized code for these operations. Your one-off specialized use case, though? Nobody cares about that except you, so optimizing the heck out of it is not economical.