Efficiency of branching in shaders

branchperformanceshader

I understand that this question may seem somewhat ungrounded, but if someone knows anything theoretical / has practical experience on this topic, it would be great if you share it.

I am attempting to optimize one of my old shaders, which uses a lot of texture lookups.

I've got diffuse, normal, specular maps for each of three possible mapping planes and for some faces which are near to the user I also have to apply mapping techniques, which also bring a lot of texture lookups (like parallax occlusion mapping).

Profiling showed that texture lookups are the bottleneck of the shader and I am willing to remove some of them away. For some cases of the input parameters I already know that part of the texture lookups would be unnecessary and the obvious solution is to do something like (pseudocode):

if (part_actually_needed) {
   perform lookups;
   perform other steps specific for THIS PART;
}

// All other parts.

Now – here comes the question.

I do not remember exactly (that's why I stated the question might be ungrounded), but in some paper I recently read (unfortunately, can't remember the name) something similar to the following was stated:

The performance of the presented
technique depends on how efficient
the HARDWARE-BASED CONDITIONAL
BRANCHING is implemented.

I remembered this kind of statement right before I was about to start refactoring a big number of shaders and implement that if-based optimization I was talking about.

So – right before I start doing that – does someone know something about the efficiency of the branching in shaders? Why could branching give a severe performance penalty in shaders?

And is it even possible that I could only worsen the actual performance with the if-based branching?

You might say – try and see. Yes, that's what I'm going to do if nobody here is helps me 🙂

But still, what in the if case may be effective for new GPU's could be a nightmare for a bit older ones. And that kind of issue is very hard to forecast unless you have a lot of different GPU's (that's not my case)

So, if anyone knows something about that or has benchmarking experience for these kinds of shaders, I would really appreciate your help.

Few remaining brain cells that are actually working keep telling me that branching on the GPU's might be far not as effective as branching for the CPU (which usually has extremely efficient ways of branch predictions and eliminating cache misses) simply because it's a GPU (or that could be hard / impossible to implement on the GPU).

Unfortunately I am not sure if this statement has anything in common with the real situation…

Best Answer

If the condition is uniform (i.e. constant for the entire pass), then the branch is essentially free because the framework will essentially compile two versions of the shader (branch taken and not) and choose one of these for the entire pass based on your input variable. In this case, definitely go for the if statement as it will make your shader faster.

If the condition varies per vertex/pixel, then it can indeed degrade performance and older shader models don't even support dynamic branching.

Related Solutions

Java – What are the effects of exceptions on performance in Java

It depends how exceptions are implemented. The simplest way is using setjmp and longjmp. That means all registers of the CPU are written to the stack (which already takes some time) and possibly some other data needs to be created... all this already happens in the try statement. The throw statement needs to unwind the stack and restore the values of all registers (and possible other values in the VM). So try and throw are equally slow, and that is pretty slow, however if no exception is thrown, exiting the try block takes no time whatsoever in most cases (as everything is put on the stack which cleans up automatically if the method exists).

Sun and others recognized, that this is possibly suboptimal and of course VMs get faster and faster over the time. There is another way to implement exceptions, which makes try itself lightning fast (actually nothing happens for try at all in general - everything that needs to happen is already done when the class is loaded by the VM) and it makes throw not quite as slow. I don't know which JVM uses this new, better technique...

...but are you writing in Java so your code later on only runs on one JVM on one specific system? Since if it may ever run on any other platform or any other JVM version (possibly of any other vendor), who says they also use the fast implementation? The fast one is more complicated than the slow one and not easily possible on all systems. You want to stay portable? Then don't rely on exceptions being fast.

It also makes a big difference what you do within a try block. If you open a try block and never call any method from within this try block, the try block will be ultra fast, as the JIT can then actually treat a throw like a simple goto. It neither needs to save stack-state nor does it need to unwind the stack if an exception is thrown (it only needs to jump to the catch handlers). However, this is not what you usually do. Usually you open a try block and then call a method that might throw an exception, right? And even if you just use the try block within your method, what kind of method will this be, that does not call any other method? Will it just calculate a number? Then what for do you need exceptions? There are much more elegant ways to regulate program flow. For pretty much anything else but simple math, you will have to call an external method and this already destroys the advantage of a local try block.

See the following test code:

public class Test {
    int value;


    public int getValue() {
        return value;
    }

    public void reset() {
        value = 0;
    }

    // Calculates without exception
    public void method1(int i) {
        value = ((value + i) / i) << 1;
        // Will never be true
        if ((i & 0xFFFFFFF) == 1000000000) {
            System.out.println("You'll never see this!");
        }
    }

    // Could in theory throw one, but never will
    public void method2(int i) throws Exception {
        value = ((value + i) / i) << 1;
        // Will never be true
        if ((i & 0xFFFFFFF) == 1000000000) {
            throw new Exception();
        }
    }

    // This one will regularly throw one
    public void method3(int i) throws Exception {
        value = ((value + i) / i) << 1;
        // i & 1 is equally fast to calculate as i & 0xFFFFFFF; it is both
        // an AND operation between two integers. The size of the number plays
        // no role. AND on 32 BIT always ANDs all 32 bits
        if ((i & 0x1) == 1) {
            throw new Exception();
        }
    }

    public static void main(String[] args) {
        int i;
        long l;
        Test t = new Test();

        l = System.currentTimeMillis();
        t.reset();
        for (i = 1; i < 100000000; i++) {
            t.method1(i);
        }
        l = System.currentTimeMillis() - l;
        System.out.println(
            "method1 took " + l + " ms, result was " + t.getValue()
        );

        l = System.currentTimeMillis();
        t.reset();
        for (i = 1; i < 100000000; i++) {
            try {
                t.method2(i);
            } catch (Exception e) {
                System.out.println("You'll never see this!");
            }
        }
        l = System.currentTimeMillis() - l;
        System.out.println(
            "method2 took " + l + " ms, result was " + t.getValue()
        );

        l = System.currentTimeMillis();
        t.reset();
        for (i = 1; i < 100000000; i++) {
            try {
                t.method3(i);
            } catch (Exception e) {
                // Do nothing here, as we will get here
            }
        }
        l = System.currentTimeMillis() - l;
        System.out.println(
            "method3 took " + l + " ms, result was " + t.getValue()
        );
    }
}

Result:

method1 took 972 ms, result was 2
method2 took 1003 ms, result was 2
method3 took 66716 ms, result was 2

The slowdown from the try block is too small to rule out confounding factors such as background processes. But the catch block killed everything and made it 66 times slower!

As I said, the result will not be that bad if you put try/catch and throw all within the same method (method3), but this is a special JIT optimization I would not rely upon. And even when using this optimization, the throw is still pretty slow. So I don't know what you are trying to do here, but there is definitely a better way of doing it than using try/catch/throw.

Php – Preferred method to store PHP arrays (json_encode vs serialize)

Depends on your priorities.

If performance is your absolute driving characteristic, then by all means use the fastest one. Just make sure you have a full understanding of the differences before you make a choice

Unlike serialize() you need to add extra parameter to keep UTF-8 characters untouched: json_encode($array, JSON_UNESCAPED_UNICODE) (otherwise it converts UTF-8 characters to Unicode escape sequences).
JSON will have no memory of what the object's original class was (they are always restored as instances of stdClass).
You can't leverage __sleep() and __wakeup() with JSON
By default, only public properties are serialized with JSON. (in PHP>=5.4 you can implement JsonSerializable to change this behavior).
JSON is more portable

And there's probably a few other differences I can't think of at the moment.

A simple speed test to compare the two

<?php

ini_set('display_errors', 1);
error_reporting(E_ALL);

// Make a big, honkin test array
// You may need to adjust this depth to avoid memory limit errors
$testArray = fillArray(0, 5);

// Time json encoding
$start = microtime(true);
json_encode($testArray);
$jsonTime = microtime(true) - $start;
echo "JSON encoded in $jsonTime seconds\n";

// Time serialization
$start = microtime(true);
serialize($testArray);
$serializeTime = microtime(true) - $start;
echo "PHP serialized in $serializeTime seconds\n";

// Compare them
if ($jsonTime < $serializeTime) {
    printf("json_encode() was roughly %01.2f%% faster than serialize()\n", ($serializeTime / $jsonTime - 1) * 100);
}
else if ($serializeTime < $jsonTime ) {
    printf("serialize() was roughly %01.2f%% faster than json_encode()\n", ($jsonTime / $serializeTime - 1) * 100);
} else {
    echo "Impossible!\n";
}

function fillArray( $depth, $max ) {
    static $seed;
    if (is_null($seed)) {
        $seed = array('a', 2, 'c', 4, 'e', 6, 'g', 8, 'i', 10);
    }
    if ($depth < $max) {
        $node = array();
        foreach ($seed as $key) {
            $node[$key] = fillArray($depth + 1, $max);
        }
        return $node;
    }
    return 'empty';
}

Best Answer

Related Solutions

Java – What are the effects of exceptions on performance in Java

Php – Preferred method to store PHP arrays (json_encode vs serialize)

Related Topic