Matrix Multiplications – Why They Speed Up Calculations

matricesperformance

In Google's MNist tutorial using TensorFlow, a calculation is exhibited in which one step is equivalent to multiplying a matrix by a vector. Google first shows a picture in which each numeric multiplication and addition that would go into performing the calculation is written out in full. Next, they show a picture in which it is instead expressed as a matrix multiplication, claiming that this version of the calculation is, or at least might be, faster:

If we write that out as equations, we get:

We can "vectorize" this procedure, turning it into a matrix multiplication and vector addition. This is helpful for computational efficiency. (It's also a useful way to think.)

I know that equations like this are usually written in the matrix multiplication format by machine learning practitioners, and can of course see advantages to doing so from the standpoints of code terseness or of understanding the mathematics. What I don't understand is Google's claim that converting from the longhand form to the matrix form "is helpful for computational efficiency"

When, why, and how would it be possible to gain performance improvements in software by expressing calculations as matrix multiplications? If I were to calculate the matrix multiplication in the second (matrix-based) image myself, as a human, I'd do it by sequentially doing each of the distinct calculations shown in the first (scalar) image. To me, they are nothing but two notations for the same sequence of calculations. Why is it different for my computer? Why would a computer be able to perform the matrix calculation faster than the scalar one?

Best Answer

This may sound obvious, but computers don't execute formulas, they execute code, and how long that execution takes depends directly on the code they execute and only indirectly on whatever concept that code implements. Two logically identical pieces of code can have very different performance characteristics. Some reasons that are likely to crop up in matrix multiplication specifically:

Using multiple threads. There is almost no modern CPU that doesn't have multiple cores, many have up to 8, and specialized machines for high-performance computing can easily have 64 across several sockets. Writing code in the obvious way, in a normal programming language, uses only one of those. In other words, it may use less than 2% of the available computing resources of the machine it's running on.
Using SIMD instructions (confusingly, this is also called "vectorization" but in a different sense than in the text quotes in the question). In essence, instead of 4 or 8 or so scalar arithmetic instructions, give the CPU one instruction that performs arithmetic on 4 or 8 or so registers in parallel. This can literally make some calculations (when they're a perfectly independent and fit for the instruction set) 4 or 8 times faster.
Making smarter use of the cache. Memory access are faster if they are temporally and spatially coherent, that is, consecutive accesses are to nearby addresses and when accessing an address twice you access it twice in quick succession rather than with a long pause.
Using accelerators such as GPUs. These devices are very different beasts from CPUs and programming them efficiently is an whole art form of its own. For example, they have hundreds of cores, which are grouped into groups of a few dozen cores, and these groups share resources — they share a few KiB of memory that is much faster than normal memory, and when any core of the group executes an if statement all the others in that group have to wait for it.
Distribute the work over several machines (very important in supercomputers!) which introduces a huge set of new headaches but can, of course, give access to vastly greater computing resources.
Smarter algorithms. For matrix multiplication the simple O(n^3) algorithm, properly optimized with the tricks above, are often faster than the sub-cubic ones for reasonable matrix sizes, but sometimes they win. For special cases such as sparse matrices, you can write specialized algorithms.

A lot of smart people have written very efficient code for common linear algebra operations, using the above tricks and many more and usually even with stupid platform-specific tricks. Therefore, transforming your formula into a matrix multiplication and then implementing that calculation by calling into a mature linear algebra library benefits from that optimization effort. By contrast, if you simply write the formula out in the obvious way in a high-level language, the machine code that is eventually generated won't use all of those tricks and won't be as fast. This is also true if you take the matrix formulation and implement it by calling a naive matrix multiplication routine that you wrote yourself (again, in the obvious way).

Making code fast takes work, and often quite a lot of work if you want that last ounce of performance. Because so many important calculations can be expressed as combination of a couple of linear algebra operations, it's economical to create highly optimized code for these operations. Your one-off specialized use case, though? Nobody cares about that except you, so optimizing the heck out of it is not economical.

Related Solutions

Java vs C++ – Why Java Can Be Faster Than C++

First, most JVMs include a compiler, so "interpreted bytecode" is actually pretty rare (at least in benchmark code -- it's not quite as rare in real life, where your code is usually more than a few trivial loops that get repeated extremely often).

Second, a fair number of the benchmarks involved appear to be quite biased (whether by intent or incompetence, I can't really say). Just for example, years ago I looked at some of the source code linked from one of the links you posted. It had code like this:

  init0 = (int*)calloc(max_x,sizeof(int));
  init1 = (int*)calloc(max_x,sizeof(int));
  init2 = (int*)calloc(max_x,sizeof(int));
  for (x=0; x<max_x; x++) {
    init2[x] = 0;
    init1[x] = 0;
    init0[x] = 0;
  }

Since calloc provides memory that's already zeroed, using the for loop to zero it again is obviously useless. This was followed (if memory serves) by filling the memory with other data anyway (and no dependence on it being zeroed), so all the zeroing was completely unnecessary anyway. Replacing the code above with a simple malloc (like any sane person would have used to start with) improved the speed of the C++ version enough to beat the Java version (by a fairly wide margin, if memory serves).

Consider (for another example) the methcall benchmark used in the blog entry in your last link. Despite the name (and how things might even look), the C++ version of this is not really measuring much about method call overhead at all. The part of the code that turns out to be critical is in the Toggle class:

class Toggle {
public:
    Toggle(bool start_state) : state(start_state) { }
    virtual ~Toggle() {  }
    bool value() {
        return(state);
    }
    virtual Toggle& activate() {
        state = !state;
        return(*this);
    }
    bool state;
};

The critical part turns out to be the state = !state;. Consider what happens when we change the code to encode the state as an int instead of a bool:

class Toggle {
    enum names{ bfalse = -1, btrue = 1};
    const static names values[2];
    int state;

public:
    Toggle(bool start_state) : state(values[start_state]) 
    { }
    virtual ~Toggle() {  }
    bool value() {  return state==btrue;    }

    virtual Toggle& activate() {
        state = -state;
        return(*this);
    }
};

This minor change improves the overall speed by about a 5:1 margin. Even though the benchmark was intended to measure method call time, in reality most of what it was measuring was the time to convert between int and bool. I'd certainly agree that the inefficiency shown by the original is unfortunate -- but given how rarely it seems to arise in real code, and the ease with which it can be fixed when/if it does arise, I have a difficult time thinking of it as meaning much.

In case anybody decides to re-run the benchmarks involved, I should also add that there's an almost equally trivial modification to the Java version that produces (or at least at one time produced -- I haven't re-run the tests with a recent JVM to confirm they still do) a fairly substantial improvement in the Java version as well. The Java version has an NthToggle::activate() that looks like this:

public Toggle activate() {
this.counter += 1;
if (this.counter >= this.count_max) {
    this.state = !this.state;
    this.counter = 0;
}
return(this);
}

Changing this to call the base function instead of manipulating this.state directly gives quite a substantial speed improvement (though not enough to keep up with the modified C++ version).

So, what we end up with is a false assumption about interpreted byte codes vs. some of the worst benchmarks (I've) ever seen. Neither is giving a meaningful result.

My own experience is that with equally experienced programmers paying equal attention to optimizing, C++ will beat Java more often than not -- but (at least between these two), the language will rarely make as much difference as the programmers and design. The benchmarks being cited tell us more about the (in)competence/(dis)honesty of their authors than they do about the languages they purport to benchmark.

[Edit: As implied in one place above but never stated as directly as I probably should have, the results I'm quoting are those I got when I tested this ~5 years ago, using C++ and Java implementations that were current at that time. I haven't rerun the tests with current implementations. A glance, however, indicates that the code hasn't been fixed, so all that would have changed would be the compiler's ability to cover up the problems in the code.]

If we ignore the Java examples, however, it is actually possible for interpreted code to run faster than compiled code (though difficult and somewhat unusual).

The usual way this happens is that the code being interpreted is much more compact than the machine code, or it's running on a CPU that has a larger data cache than code cache.

In such a case, a small interpreter (e.g., the inner interpreter of a Forth implementation) may be able to fit entirely in the code cache, and the program it's interpreting fits entirely in the data cache. The cache is typically faster than main memory by a factor of at least 10, and often much more (a factor of 100 isn't particularly rare any more).

So, if the cache is faster than main memory by a factor of N, and it takes fewer than N machine code instructions to implement each byte code, the byte code should win (I'm simplifying, but I think the general idea should still be apparent).

Performance – Why Is NoSQL Faster Than SQL?

There are many NoSQL solutions around, each one with its own strengths and weaknesses, so the following must be taken with a grain of salt.

But essentially, what many NoSQL databases do is rely on denormalization and try to optimize for the denormalized case. For instance, say you are reading a blog post together with its comments in a document-oriented database. Often, the comments will be saved together with the post itself. This means that it will be faster to retrieve all of them together, as they are stored in the same place and you do not have to perform a join.

Of course, you can do the same in SQL, and denormalizing is a common practice when one needs performance. It is just that many NoSQL solutions are engineered from the start to be always used this way. You then get the usual tradeoffs: for instance, adding a comment in the above example will be slower because you have to save the whole document with it. And once you have denormalized, you have to take care of preserving data integrity in your application.

Moreover, in many NoSQL solutions, it is impossible to do arbitrary joins, hence arbitrary queries. Some databases, like CouchDB, require you to think ahead of the queries you will need and prepare them inside the DB.

All in all, it boils down to expecting a denormalized schema and optimizing reads for that situation, and this works well for data that is not highly relational and that requires much more reads than writes.

Best Answer

Related Solutions

Java vs C++ – Why Java Can Be Faster Than C++

Performance – Why Is NoSQL Faster Than SQL?

Related Topic