SIMD Programming – Maintenance Cost of SIMD Programming Code Base

estimationoptimizationperformance

Question:

The software industry's consensus is that clean and simple code is fundamental to the long-term viability of the code base and the organization that owns it. These properties lead to lower maintenance costs and increased likelihood of the code base being continued.

However, SIMD code is different than general application code, and I would like to know if there is a similar consensus regarding clean and simple code applying specifically to SIMD code.


Background to my question.

I write plenty of SIMD (single-instruction, multiple data) code for various image processing and analysis tasks. Recently I also had to port a small number of these functions from one architecture (SSE2) to another (ARM NEON).

The code is written for shrink-wrapped software, therefore it cannot depend on proprietary languages without unrestricted redistribution rights such as MATLAB.

An example of typical code structure:

  • Using OpenCV's matrix type (Mat) for all memory, buffer and lifetime management.
  • After checking the size (dimensions) of input arguments, the pointers to the start address of each row of pixels is taken.
  • The pixel count, and the start addresses of each row of pixels from each input matrix is passed into some low-level C++ functions.
  • These low-level C++ functions use SIMD intrinsics (for Intel Architecture, and ARM NEON), loading from and saving to raw pointer addresses.
  • Characteristics of these low-level C++ functions:
    • Exclusively one-dimensional (consecutive in memory)
    • Does not deal with memory allocations.
      (Every allocation, including temporaries, are handled by the outer code using OpenCV facilities.)
    • The range of name lengths of symbols (intrinsics, variable names, etc) are roughly 10 – 20 characters, which is quite excessive.
      (Reads like techno-babble.)
    • Reuse of SIMD variables is discouraged because compilers are quite buggy in correctly parsing code that is not written in the "single-assignment" coding style.
      (I've filed several compiler bug reports.)

What aspects of SIMD programming would cause the discussion to differ from the general case? Or, why is SIMD different?

In terms of initial development cost

  • It is well-known that the initial development cost of C++ SIMD code with good performance is about 10x – 100x (with a wide margin) compared to casually-written C++ code.
  • As noted in the answers to Choosing between performance vs readable/cleaner code? , most code (including casually-written code and SIMD code) is initially neither clean nor fast.
  • Evolutionary improvements in code performance (in both scalar and SIMD code) is discouraged (because it is seen as a kind of software rework), and the cost and benefit is not tracked.

In terms of propensity
(e.g. the Pareto principle, a.k.a. the 80-20 rule)

  • Even if image processing only comprises 20% of a software system (in both code size and functionality), image processing is comparatively slow (when viewed as a percentage of CPU time spent), taking more than 80% of time.
    • This is due to the data size effect: A typical image size is measured in megabytes, whereas the typical size of non-image data is measured in kilobytes.
  • Within the image processing code, a SIMD programmer is trained to automatically recognize the 20% code comprising the hotspots by identifying the loop structure in the C++ code. Thus, from a SIMD programmer's perspective, 100% of the "code that matters" is performance bottleneck.
  • Often in an image processing system, multiple hotspots exist and take up comparable proportions of time. For example, there may be 5 hotspots each taking up (20%, 18%, 16%, 14%, 12%) of total time. To achieve a high performance gain, all of the hotspots need to be rewritten in SIMD.
    • This is summarized as the balloon-popping rule: a balloon cannot be popped twice.
    • Suppose there are some balloons, say 5 of them. The only way to decimate them is to pop them one by one.
    • Once the first balloon is popped, the remaining 4 balloons now comprises a higher percentage of total execution time.
    • To make further gains, one must then pop another balloon.
      (This is in defiance to the 80-20 rule of optimization: a good economical outcome can be achieved after the 20% of lowest-hanging fruits has been picked.)

In terms of readability and maintenance

  • SIMD code is patently hard to read.

    • This is true even if one follows every software engineering best-practice e.g. naming, encapsulation, const-correctness (and making side-effects obvious), function decomposition, etc.
    • This is true even for experienced SIMD programmers.
  • Optimal SIMD code is very contorted, (see remark) compared to its equivalent C++ prototype code.

    • There are many ways to contort SIMD code, but only 1 out of 10 such attempts will attain acceptably fast results.
    • (That is, in the tunes of 4x-10x performance gains in order to justify to high development cost. Even higher gains have been observed in practice.)

(Remark)
This is the main thesis of the MIT Halide project – quoting the paper's title verbatim:

"decoupling algorithms from schedules for easy optimization of image processing pipelines"

In terms of forward applicability

  • SIMD code is strictly tied to a single architecture. Each new architecture (or each widening of SIMD registers) requires a rewrite.
  • Unlike the majority of software development, each piece of SIMD code is typically written for a single purpose that never changes.
    (With the exception of porting to other architectures.)
  • Some architectures maintain perfect backward compatibility (Intel); some fall short by a trivial amount (ARM AArch64, replacing vtbl with vtblq) but which is sufficient to cause some code to fail to compile.

In terms of skills and training

  • It is not clear what knowledge prerequisites are required to properly train a new programmer to write and maintain SIMD code.
  • College graduates who have learned SIMD programming in school seems to despise and dismiss it as an impractical career track.
  • Disassembly-reading and low-level performance profiling are cited as two fundamental skills for writing high-performance SIMD code. However, it is unclear how to systematically train programmers in these two skills.
  • Modern CPU architecture (which diverges significantly from what is taught in textbooks) makes training even more difficult.

In terms of correctness and defect-related costs

  • A single SIMD processing function is actually cohesive enough that one can establish correctness by:
    • Applying formal methods (with pen-and-paper), and
    • Verifying output integer ranges (with prototype code and performed outside run-time).
  • The verification process is, however, very costly (spends 100% time on code review and 100% time on prototype model checking), which triples the already-expensive development cost of SIMD code.
  • If a bug somehow manages to slip through this verification process, it is nearly impossible to "repair" (fix) except to replace (rewrite) the suspected defective function.
  • SIMD code suffers from the blunt of defects in the C++ compiler (optimizing code generator).
    • SIMD code generated using C++ expression templates also suffers greatly from defects of the compiler.

In terms of disruptive innovations

  • Many solutions have been proposed from academia, but few are seeing widespread commercial usage.

    • MIT Halide
    • Stanford Darkroom
    • NT2 (Numerical Template Toolbox) and the related Boost.SIMD
  • Libraries with widespread commercial usage do not seem to be heavily SIMD-enabled.

    • Open-source libraries seem lukewarm to SIMD.
      • Recently I have this first-hand observation of this after profiling a large number of OpenCV API functions, as of version 2.4.9.
      • Many other image processing libraries I have profiled also do not make heavy use of SIMD, or they miss the true hotspots.
    • Commercial libraries seem to avoid SIMD altogether.
      • In a few cases, I have even seen image processing libraries reverting SIMD-optimized code in an earlier version to non-SIMD code in a later version, resulting in severe performance regressions.
        (The vendor's response is that it was necessary to avoid compiler bugs.)

This Programmer's question: Does low latency code sometimes have to be "ugly"? is related, and I previously wrote an answer to that question to explain my view points a few years ago.

However, that answer is pretty much "appeasement" to the "premature optimization" viewpoint, i.e. to the viewpoint that:

  • All optimizations are premature by definition (or, short-term by nature), and
  • The only optimization that has long-term benefit is toward simplicity.

But such viewpoints are contested in this ACM article.


All of that leads me to ask:
SIMD code is different than general application code, and I would like to know if there is a similar industry consensus regarding the value of clean and simple code for SIMD code.

Best Answer

I did not write much SIMD code for myself, but a lot of assembler code some decades ago. AFAIK using SIMD intrinsics is essentially assembler programming, and your whole question could be rephrased just by replacing "SIMD" by the word "assembly". For example, the points you already mentioned, like

  • the code takes 10x to 100x to develop than "high level code"

  • it is tied to a specific architecture

  • the code is never "clean" nor easy to refactor

  • you need experts for writing and maintaining it

  • debugging and maintaining is hard, evolving really hard

are in no way "special" to SIMD - these points are true for any kind of assembly language, and they are all "industry consensus". And the conclusion in the software industry is also pretty much the same as for assembler:

  • don't write it if you don't have to - use a high level language whereever possible and let compilers do the hard work

  • if the compilers are not sufficient, at least encapsulate the "low level" parts in some libraries, but avoid to spread the code all over your program

  • since it is almost impossible to write "self-documenting" assembler or SIMD code, try to balance this by lots of documentation.

Of course, there is indeed a difference to the situation with "classic" assembly or machine code: today, modern compilers typically produce high quality machine code from a high level language, which is often better optimized than assembler code written manually. For the SIMD architectures which are popular today, the quality of the available compilers is AFAIK far below that - and maybe it will never reach that, since automatic vectorization is still a topic of scientific research. See, for example, this article which describes the differences in opimization between a compiler and a human, giving a notion that it might be very hard to create good SIMD compilers.

As you described in your question already, there exist also a quality problem with current state-of-the-art libraries. So IMHO best we can hope is that in the next years the quality of the compilers and libraries will increase, maybe the SIMD hardware will have to change to become more "compiler friendly", maybe specialized programming languages supporting easier vectorization (like Halide, which you mentioned twice) will become more popular (wasn't that already a strength of Fortran?). According to Wikipedia, SIMD became "a mass product" around 15 to 20 years ago (and Halide is less than 3 years old, when I interpret the docs correctly). Compare this to the time compilers for "classic" assembly language needed to become mature. According to this Wikipedia article it took almost 30 years (from ~1970 to the end of the 1990s) until compilers exceeded the performance of human experts (in producing non-parallel machine code). So we may just have to wait more 10 to 15 years until the same happens to SIMD-enabled compilers.

Related Topic