C++ – An expensive jump with GCC 5.4.0

branch-predictioncconditional-statementsgccx86

I had a function which looked like this (showing only the important part):

double CompareShifted(const std::vector<uint16_t>& l, const std::vector<uint16_t> &curr, int shift, int shiftY)  {
...
  for(std::size_t i=std::max(0,-shift);i<max;i++) {
     if ((curr[i] < 479) && (l[i + shift] < 479)) {
       nontopOverlap++;
     }
     ...
  }
...
}

Written like this, the function took ~34ms on my machine. After changing the condition to bool multiplication (making the code look like this):

double CompareShifted(const std::vector<uint16_t>& l, const std::vector<uint16_t> &curr, int shift, int shiftY)  {
...
  for(std::size_t i=std::max(0,-shift);i<max;i++) {
     if ((curr[i] < 479) * (l[i + shift] < 479)) {
       nontopOverlap++;
     }
     ...
  }
...
}

the execution time decreased to ~19ms.

The compiler used was GCC 5.4.0 with -O3 and after checking the generated asm code using godbolt.org I found out that the first example generates a jump, while the second one does not. I decided to try GCC 6.2.0 which also generates a jump instruction when using the first example, but GCC 7 seems to not generate one anymore.

Finding out this way to speed up the code was rather gruesome and took quite some time. Why does the compiler behave this way? Is it intended and is it something the programmers should look out for? Are there any more things similar to this?

Best Answer

The logical AND operator (&&) uses short-circuit evaluation, which means that the second test is only done if the first comparison evaluates to true. This is often exactly the semantics that you require. For example, consider the following code:

if ((p != nullptr) && (p->first > 0))

You must ensure that the pointer is non-null before you dereference it. If this wasn't a short-circuit evaluation, you'd have undefined behavior because you'd be dereferencing a null pointer.

It is also possible that short circuit evaluation yields a performance gain in cases where the evaluation of the conditions is an expensive process. For example:

if ((DoLengthyCheck1(p) && (DoLengthyCheck2(p))

If DoLengthyCheck1 fails, there is no point in calling DoLengthyCheck2.

However, in the resulting binary, a short-circuit operation often results in two branches, since this is the easiest way for the compiler to preserve these semantics. (Which is why, on the other side of the coin, short-circuit evaluation can sometimes inhibit optimization potential.) You can see this by looking at the relevant portion of object code generated for your if statement by GCC 5.4:

    movzx   r13d, WORD PTR [rbp+rcx*2]
    movzx   eax,  WORD PTR [rbx+rcx*2]

    cmp     r13w, 478         ; (curr[i] < 479)
    ja      .L5

    cmp     ax, 478           ; (l[i + shift] < 479)
    ja      .L5

    add     r8d, 1            ; nontopOverlap++

You see here the two comparisons (cmp instructions) here, each followed by a separate conditional jump/branch (ja, or jump if above).

It is a general rule of thumb that branches are slow and are therefore to be avoided in tight loops. This has been true on virtually all x86 processors, from the humble 8088 (whose slow fetch times and extremely small prefetch queue [comparable to an instruction cache], combined with utter lack of branch prediction, meant that taken branches required the cache to be dumped) to modern implementations (whose long pipelines make mispredicted branches similarly expensive). Note the little caveat that I slipped in there. Modern processors since the Pentium Pro have advanced branch prediction engines that are designed to minimize the cost of branches. If the direction of the branch can be properly predicted, the cost is minimal. Most of the time, this works well, but if you get into pathological cases where the branch predictor is not on your side, your code can get extremely slow. This is presumably where you are here, since you say that your array is unsorted.

You say that benchmarks confirmed that replacing the && with a * makes the code noticeably faster. The reason for this is evident when we compare the relevant portion of the object code:

    movzx   r13d, WORD PTR [rbp+rcx*2]
    movzx   eax,  WORD PTR [rbx+rcx*2]

    xor     r15d, r15d        ; (curr[i] < 479)
    cmp     r13w, 478
    setbe   r15b

    xor     r14d, r14d        ; (l[i + shift] < 479)
    cmp     ax, 478
    setbe   r14b

    imul    r14d, r15d        ; meld results of the two comparisons

    cmp     r14d, 1           ; nontopOverlap++
    sbb     r8d, -1

It is a bit counter-intuitive that this could be faster, since there are more instructions here, but that is how optimization works sometimes. You see the same comparisons (cmp) being done here, but now, each is preceded by an xor and followed by a setbe. The XOR is just a standard trick for clearing a register. The setbe is an x86 instruction that sets a bit based on the value of a flag, and is often used to implement branchless code. Here, setbe is the inverse of ja. It sets its destination register to 1 if the comparison was below-or-equal (since the register was pre-zeroed, it will be 0 otherwise), whereas ja branched if the comparison was above. Once these two values have been obtained in the r15b and r14b registers, they are multiplied together using imul. Multiplication was traditionally a relatively slow operation, but it is darn fast on modern processors, and this will be especially fast, because it's only multiplying two byte-sized values.

You could just as easily have replaced the multiplication with the bitwise AND operator (&), which does not do short-circuit evaluation. This makes the code much clearer, and is a pattern that compilers generally recognize. But when you do this with your code and compile it with GCC 5.4, it continues to emit the first branch:

    movzx   r13d, WORD PTR [rbp+rcx*2]
    movzx   eax,  WORD PTR [rbx+rcx*2]

    cmp     r13w, 478         ; (curr[i] < 479)
    ja      .L4

    cmp     ax, 478           ; (l[i + shift] < 479)
    setbe   r14b

    cmp     r14d, 1           ; nontopOverlap++
    sbb     r8d, -1

There is no technical reason it had to emit the code this way, but for some reason, its internal heuristics are telling it that this is faster. It would probably be faster if the branch predictor was on your side, but it will likely be slower if branch prediction fails more often than it succeeds.

Newer generations of the compiler (and other compilers, like Clang) know this rule, and will sometimes use it to generate the same code that you would have sought by hand-optimizing. I regularly see Clang translate && expressions to the same code that would have been emitted if I'd have used &. The following is the relevant output from GCC 6.2 with your code using the normal && operator:

    movzx   r13d, WORD PTR [rbp+rcx*2]
    movzx   eax,  WORD PTR [rbx+rcx*2]

    cmp     r13d, 478         ; (curr[i] < 479)
    jg      .L7

    xor     r14d, r14d        ; (l[i + shift] < 479)
    cmp     eax, 478
    setle   r14b

    add     esi, r14d         ; nontopOverlap++

Note how clever this is! It is using signed conditions (jg and setle) as opposed to unsigned conditions (ja and setbe), but this isn't important. You can see that it still does the compare-and-branch for the first condition like the older version, and uses the same setCC instruction to generate branchless code for the second condition, but it has gotten a lot more efficient in how it does the increment. Instead of doing a second, redundant comparison to set the flags for a sbb operation, it uses the knowledge that r14d will be either 1 or 0 to simply unconditionally add this value to nontopOverlap. If r14d is 0, then the addition is a no-op; otherwise, it adds 1, exactly like it is supposed to do.

GCC 6.2 actually produces more efficient code when you use the short-circuiting && operator than the bitwise & operator:

    movzx   r13d, WORD PTR [rbp+rcx*2]
    movzx   eax,  WORD PTR [rbx+rcx*2]

    cmp     r13d, 478         ; (curr[i] < 479)
    jg      .L6

    cmp     eax, 478          ; (l[i + shift] < 479)
    setle   r14b

    cmp     r14b, 1           ; nontopOverlap++
    sbb     esi, -1

The branch and the conditional set are still there, but now it reverts back to the less-clever way of incrementing nontopOverlap. This is an important lesson in why you should be careful when trying to out-clever your compiler!

But if you can prove with benchmarks that the branching code is actually slower, then it may pay to try and out-clever your compiler. You just have to do so with careful inspection of the disassembly—and be prepared to re-evaluate your decisions when you upgrade to a later version of the compiler. For example, the code you have could be rewritten as:

nontopOverlap += ((curr[i] < 479) & (l[i + shift] < 479));

There is no if statement here at all, and the vast majority of compilers will never think about emitting branching code for this. GCC is no exception; all versions generate something akin to the following:

    movzx   r14d, WORD PTR [rbp+rcx*2]
    movzx   eax,  WORD PTR [rbx+rcx*2]

    cmp     r14d, 478         ; (curr[i] < 479)
    setle   r15b

    xor     r13d, r13d        ; (l[i + shift] < 479)
    cmp     eax, 478
    setle   r13b

    and     r13d, r15d        ; meld results of the two comparisons
    add     esi, r13d         ; nontopOverlap++

If you've been following along with the previous examples, this should look very familiar to you. Both comparisons are done in a branchless way, the intermediate results are anded together, and then this result (which will be either 0 or 1) is added to nontopOverlap. If you want branchless code, this will virtually ensure that you get it.

GCC 7 has gotten even smarter. It now generates virtually identical code (excepting some slight rearrangement of instructions) for the above trick as the original code. So, the answer to your question, "Why does the compiler behave this way?", is probably because they're not perfect! They try to use heuristics to generate the most optimal code possible, but they don't always make the best decisions. But at least they can get smarter over time!

One way of looking at this situation is that the branching code has the better best-case performance. If branch prediction is successful, skipping unnecessary operations will result in a slightly faster running time. However, branchless code has the better worst-case performance. If branch prediction fails, executing a few additional instructions as necessary to avoid a branch will definitely be faster than a mispredicted branch. Even the smartest and most clever of compilers will have a hard time making this choice.

And for your question of whether this is something programmers need to watch out for, the answer is almost certainly no, except in certain hot loops that you are trying to speed up via micro-optimizations. Then, you sit down with the disassembly and find ways to tweak it. And, as I said before, be prepared to revisit those decisions when you update to a newer version of the compiler, because it may either do something stupid with your tricky code, or it may have changed its optimization heuristics enough that you can go back to using your original code. Comment thoroughly!

Linking to static libraries

$ g++ -c b.cpp -o b.o
$ ar cr libb.a b.o
$ g++ -c d.cpp -o d.o
$ ar cr libd.a d.o

$ g++ -L. -ld -lb a.cpp # wrong order
$ g++ -L. -lb -ld a.cpp # wrong order
$ g++ a.cpp -L. -ld -lb # wrong order
$ g++ a.cpp -L. -lb -ld # right order

The linker searches from left to right, and notes unresolved symbols as it goes. If a library resolves the symbol, it takes the object files of that library to resolve the symbol (b.o out of libb.a in this case).

Dependencies of static libraries against each other work the same - the library that needs symbols must be first, then the library that resolves the symbol.

If a static library depends on another library, but the other library again depends on the former library, there is a cycle. You can resolve this by enclosing the cyclically dependent libraries by -( and -), such as -( -la -lb -) (you may need to escape the parens, such as -\( and -\)). The linker then searches those enclosed lib multiple times to ensure cycling dependencies are resolved. Alternatively, you can specify the libraries multiple times, so each is before one another: -la -lb -la.

Linking to dynamic libraries

$ export LD_LIBRARY_PATH=. # not needed if libs go to /usr/lib etc
$ g++ -fpic -shared d.cpp -o libd.so
$ g++ -fpic -shared b.cpp -L. -ld -o libb.so # specifies its dependency!

$ g++ -L. -lb a.cpp # wrong order (works on some distributions)
$ g++ -Wl,--as-needed -L. -lb a.cpp # wrong order
$ g++ -Wl,--as-needed a.cpp -L. -lb # right order

It's the same here - the libraries must follow the object files of the program. The difference here compared with static libraries is that you need not care about the dependencies of the libraries against each other, because dynamic libraries sort out their dependencies themselves.

Some recent distributions apparently default to using the --as-needed linker flag, which enforces that the program's object files come before the dynamic libraries. If that flag is passed, the linker will not link to libraries that are not actually needed by the executable (and it detects this from left to right). My recent archlinux distribution doesn't use this flag by default, so it didn't give an error for not following the correct order.

It is not correct to omit the dependency of b.so against d.so when creating the former. You will be required to specify the library when linking a then, but a doesn't really need the integer b itself, so it should not be made to care about b's own dependencies.

Here is an example of the implications if you miss specifying the dependencies for libb.so

$ export LD_LIBRARY_PATH=. # not needed if libs go to /usr/lib etc
$ g++ -fpic -shared d.cpp -o libd.so
$ g++ -fpic -shared b.cpp -o libb.so # wrong (but links)

$ g++ -L. -lb a.cpp # wrong, as above
$ g++ -Wl,--as-needed -L. -lb a.cpp # wrong, as above
$ g++ a.cpp -L. -lb # wrong, missing libd.so
$ g++ a.cpp -L. -ld -lb # wrong order (works on some distributions)
$ g++ -Wl,--as-needed a.cpp -L. -ld -lb # wrong order (like static libs)
$ g++ -Wl,--as-needed a.cpp -L. -lb -ld # "right"

If you now look into what dependencies the binary has, you note the binary itself depends also on libd, not just libb as it should. The binary will need to be relinked if libb later depends on another library, if you do it this way. And if someone else loads libb using dlopen at runtime (think of loading plugins dynamically), the call will fail as well. So the "right" really should be a wrong as well.

C++ – How to get rid of `deprecated conversion from string constant to ‘char*’` warnings in GCC

Any functions into which you pass string literals "I am a string literal" should use char const * as the type instead of char*.

If you're going to fix something, fix it right.

Explanation:

You can not use string literals to initialise strings that will be modified, because they are of type const char*. Casting away the constness to later modify them is undefined behaviour, so you have to copy your const char* strings char by char into dynamically allocated char* strings in order to modify them.

Example:

#include <iostream>

void print(char* ch);

void print(const char* ch) {
    std::cout<<ch;
}

int main() {
    print("Hello");
    return 0;
}

Best Answer

Related Solutions

Why does the order in which libraries are linked sometimes cause errors in GCC

Linking to static libraries

Linking to dynamic libraries

C++ – How to get rid of `deprecated conversion from string constant to ‘char*’` warnings in GCC

Related Topic