Matrix Multiplications – Why They Speed Up Calculations

matricesperformance

In Google's MNist tutorial using TensorFlow, a calculation is exhibited in which one step is equivalent to multiplying a matrix by a vector. Google first shows a picture in which each numeric multiplication and addition that would go into performing the calculation is written out in full. Next, they show a picture in which it is instead expressed as a matrix multiplication, claiming that this version of the calculation is, or at least might be, faster:

If we write that out as equations, we get:

scalar equation

We can "vectorize" this procedure, turning it into a matrix multiplication and vector addition. This is helpful for computational efficiency. (It's also a useful way to think.)

vector equation

I know that equations like this are usually written in the matrix multiplication format by machine learning practitioners, and can of course see advantages to doing so from the standpoints of code terseness or of understanding the mathematics. What I don't understand is Google's claim that converting from the longhand form to the matrix form "is helpful for computational efficiency"

When, why, and how would it be possible to gain performance improvements in software by expressing calculations as matrix multiplications? If I were to calculate the matrix multiplication in the second (matrix-based) image myself, as a human, I'd do it by sequentially doing each of the distinct calculations shown in the first (scalar) image. To me, they are nothing but two notations for the same sequence of calculations. Why is it different for my computer? Why would a computer be able to perform the matrix calculation faster than the scalar one?

Best Answer

This may sound obvious, but computers don't execute formulas, they execute code, and how long that execution takes depends directly on the code they execute and only indirectly on whatever concept that code implements. Two logically identical pieces of code can have very different performance characteristics. Some reasons that are likely to crop up in matrix multiplication specifically:

  • Using multiple threads. There is almost no modern CPU that doesn't have multiple cores, many have up to 8, and specialized machines for high-performance computing can easily have 64 across several sockets. Writing code in the obvious way, in a normal programming language, uses only one of those. In other words, it may use less than 2% of the available computing resources of the machine it's running on.
  • Using SIMD instructions (confusingly, this is also called "vectorization" but in a different sense than in the text quotes in the question). In essence, instead of 4 or 8 or so scalar arithmetic instructions, give the CPU one instruction that performs arithmetic on 4 or 8 or so registers in parallel. This can literally make some calculations (when they're a perfectly independent and fit for the instruction set) 4 or 8 times faster.
  • Making smarter use of the cache. Memory access are faster if they are temporally and spatially coherent, that is, consecutive accesses are to nearby addresses and when accessing an address twice you access it twice in quick succession rather than with a long pause.
  • Using accelerators such as GPUs. These devices are very different beasts from CPUs and programming them efficiently is an whole art form of its own. For example, they have hundreds of cores, which are grouped into groups of a few dozen cores, and these groups share resources — they share a few KiB of memory that is much faster than normal memory, and when any core of the group executes an if statement all the others in that group have to wait for it.
  • Distribute the work over several machines (very important in supercomputers!) which introduces a huge set of new headaches but can, of course, give access to vastly greater computing resources.
  • Smarter algorithms. For matrix multiplication the simple O(n^3) algorithm, properly optimized with the tricks above, are often faster than the sub-cubic ones for reasonable matrix sizes, but sometimes they win. For special cases such as sparse matrices, you can write specialized algorithms.

A lot of smart people have written very efficient code for common linear algebra operations, using the above tricks and many more and usually even with stupid platform-specific tricks. Therefore, transforming your formula into a matrix multiplication and then implementing that calculation by calling into a mature linear algebra library benefits from that optimization effort. By contrast, if you simply write the formula out in the obvious way in a high-level language, the machine code that is eventually generated won't use all of those tricks and won't be as fast. This is also true if you take the matrix formulation and implement it by calling a naive matrix multiplication routine that you wrote yourself (again, in the obvious way).

Making code fast takes work, and often quite a lot of work if you want that last ounce of performance. Because so many important calculations can be expressed as combination of a couple of linear algebra operations, it's economical to create highly optimized code for these operations. Your one-off specialized use case, though? Nobody cares about that except you, so optimizing the heck out of it is not economical.