OpenGL Low-Level Performance Questions

openglopengl-3optimizationperformancesdl

This subject, as with any optimisation problem, gets hit on a lot, but I just couldn't find what I (think) I want.

A lot of tutorials, and even SO questions have similar tips; generally covering:

Use GL face culling (the OpenGL function, not the scene logic)
Only send 1 matrix to the GPU (projectionModelView combination), therefore decreasing the MVP calculations from per vertex to once per model (as it should be).
Use interleaved Vertices
Minimize as many GL calls as possible, batch where appropriate

And possibly a few/many others. I am (for curiosity reasons) rendering 28 million triangles in my application using several vertex buffers. I have tried all the above techniques (to the best of my knowledge), and received almost no performance change.

Whilst I am receiving around 40FPS in my implementation, which is by no means problematic, I am still curious as to where these optimisation 'tips' actually come into use?

My CPU is idling around 20-50% during rendering, therefore I assume I am GPU bound for increasing performance.

Note: I am looking into gDEBugger at the moment

Cross posted at Game Development

Best Answer

Point 1 is obvious, as is saves fill rate. In case the primitives of an objects backside get processed first this will omit those faces. However modern GPUs tolerate overdraw quite well. I once (GeForce8800 GTX) measured up to 20% overdraw before significant performance hit. But it's better to save this reserve for things like occlusion culling, rendering of blended geometry and the like.

Point 2 is, well pointless. The matrices never have been calculated on the GPU – well, if you don't count SGI Onyx. Matrices always were just some kind of rendering global parameter calculated on the CPU, then pushed into global registers on the GPU, now called a uniform, so joining them has only very little benefit. In the shader that saves only one additional vector matrix multiplication (boils down to 4 MAD instructions), at the expense of less algorithmic flexibility.

Point 3 is all about cache efficiency. Data belonging together should fit into a cache line.

Point 4 is about preventing state changes trashing the caches. But it strongly depends which GL calls they mean. Changing uniforms is cheap. Switching a texture is expensive. The reason is, that a uniform sits in a register, not some piece of memory that's cached. Switching a shader is expensive, because different shaders exhibit different runtime behaviour, thus trashing the pipeline execution predition, altering memory (and thus) cache access patterns and so on.

But those are all micro optimizations (some of them with huge impact). However I recommend looking in large impact optimizations, like implementing an early Z pass; using occlusion query in th early Z for quick discrimination of whole geometry batches. One large impact optimization, that essentially consists of summing up a lot of Point-4 like micro optimizations is to sort render batches by expensive GL states. So group everything with common shaders, within those groups sort by texture and so on. This state grouping will only affect the visible render passes. In early Z you're only testing outcomes on the Z buffer so there's only geometry transformation and the fragment shaders will just pass the Z value.

Related Solutions

Sqlite – Improve INSERT-per-second performance of SQLite

Several tips:

Put inserts/updates in a transaction.
For older versions of SQLite - Consider a less paranoid journal mode (pragma journal_mode). There is NORMAL, and then there is OFF, which can significantly increase insert speed if you're not too worried about the database possibly getting corrupted if the OS crashes. If your application crashes the data should be fine. Note that in newer versions, the OFF/MEMORY settings are not safe for application level crashes.
Playing with page sizes makes a difference as well (PRAGMA page_size). Having larger page sizes can make reads and writes go a bit faster as larger pages are held in memory. Note that more memory will be used for your database.
If you have indices, consider calling CREATE INDEX after doing all your inserts. This is significantly faster than creating the index and then doing your inserts.
You have to be quite careful if you have concurrent access to SQLite, as the whole database is locked when writes are done, and although multiple readers are possible, writes will be locked out. This has been improved somewhat with the addition of a WAL in newer SQLite versions.
Take advantage of saving space...smaller databases go faster. For instance, if you have key value pairs, try making the key an INTEGER PRIMARY KEY if possible, which will replace the implied unique row number column in the table.
If you are using multiple threads, you can try using the shared page cache, which will allow loaded pages to be shared between threads, which can avoid expensive I/O calls.
Don't use !feof(file)!

I've also asked similar questions here and here.

C++ – Why does changing 0.1f to 0 slow down performance by 10x

Welcome to the world of denormalized floating-point! They can wreak havoc on performance!!!

Denormal (or subnormal) numbers are kind of a hack to get some extra values very close to zero out of the floating point representation. Operations on denormalized floating-point can be tens to hundreds of times slower than on normalized floating-point. This is because many processors can't handle them directly and must trap and resolve them using microcode.

If you print out the numbers after 10,000 iterations, you will see that they have converged to different values depending on whether 0 or 0.1 is used.

Here's the test code compiled on x64:

int main() {

    double start = omp_get_wtime();

    const float x[16]={1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2.0,2.1,2.2,2.3,2.4,2.5,2.6};
    const float z[16]={1.123,1.234,1.345,156.467,1.578,1.689,1.790,1.812,1.923,2.034,2.145,2.256,2.367,2.478,2.589,2.690};
    float y[16];
    for(int i=0;i<16;i++)
    {
        y[i]=x[i];
    }
    for(int j=0;j<9000000;j++)
    {
        for(int i=0;i<16;i++)
        {
            y[i]*=x[i];
            y[i]/=z[i];
#ifdef FLOATING
            y[i]=y[i]+0.1f;
            y[i]=y[i]-0.1f;
#else
            y[i]=y[i]+0;
            y[i]=y[i]-0;
#endif

            if (j > 10000)
                cout << y[i] << "  ";
        }
        if (j > 10000)
            cout << endl;
    }

    double end = omp_get_wtime();
    cout << end - start << endl;

    system("pause");
    return 0;
}

Output:

#define FLOATING
1.78814e-007  1.3411e-007  1.04308e-007  0  7.45058e-008  6.70552e-008  6.70552e-008  5.58794e-007  3.05474e-007  2.16067e-007  1.71363e-007  1.49012e-007  1.2666e-007  1.11759e-007  1.04308e-007  1.04308e-007
1.78814e-007  1.3411e-007  1.04308e-007  0  7.45058e-008  6.70552e-008  6.70552e-008  5.58794e-007  3.05474e-007  2.16067e-007  1.71363e-007  1.49012e-007  1.2666e-007  1.11759e-007  1.04308e-007  1.04308e-007

//#define FLOATING
6.30584e-044  3.92364e-044  3.08286e-044  0  1.82169e-044  1.54143e-044  2.10195e-044  2.46842e-029  7.56701e-044  4.06377e-044  3.92364e-044  3.22299e-044  3.08286e-044  2.66247e-044  2.66247e-044  2.24208e-044
6.30584e-044  3.92364e-044  3.08286e-044  0  1.82169e-044  1.54143e-044  2.10195e-044  2.45208e-029  7.56701e-044  4.06377e-044  3.92364e-044  3.22299e-044  3.08286e-044  2.66247e-044  2.66247e-044  2.24208e-044

Note how in the second run the numbers are very close to zero.

Denormalized numbers are generally rare and thus most processors don't try to handle them efficiently.

To demonstrate that this has everything to do with denormalized numbers, if we flush denormals to zero by adding this to the start of the code:

_MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);

Then the version with 0 is no longer 10x slower and actually becomes faster. (This requires that the code be compiled with SSE enabled.)

This means that rather than using these weird lower precision almost-zero values, we just round to zero instead.

Timings: Core i7 920 @ 3.5 GHz:

//  Don't flush denormals to zero.
0.1f: 0.564067
0   : 26.7669

//  Flush denormals to zero.
0.1f: 0.587117
0   : 0.341406

In the end, this really has nothing to do with whether it's an integer or floating-point. The 0 or 0.1f is converted/stored into a register outside of both loops. So that has no effect on performance.

Best Answer

Related Solutions

Sqlite – Improve INSERT-per-second performance of SQLite

C++ – Why does changing 0.1f to 0 slow down performance by 10x

Related Topic