C++ templates for performance?

cc++11templates

I have seen online a few times it has been mentioned that C++ can be ever faster using templates.

Could someone explain, including at a low level why this is exactly? I always presumed such a "nice" feature would have overhead like most useful concepts.

I am really intrigued by this from a ultra low latency perspective!

Best Answer

A common example is sorting.

In C, qsort takes a pointer to a comparison function. Generally speaking, there will be one copy of the qsort code, which is not inlined. It will make a call through the pointer to the comparison routine -- this of course is also not inlined.

In C++, std::sort is a template, and it can take a functor object as comparator. There is a different copy of std::sort for each different type used as a comparator. Assuming you use a functor class with overloaded operator(), then the call to the comparator can easily be inlined into this copy of std::sort.

So, templates give you more inlining because there are more copies of the sort code, each of which can inline a different comparator. Inlining is quite a good optimization, and sort routines do a lot of comparisons, so you can often measure std::sort running faster than an equivalent qsort. The cost of this is the chance of much larger code -- if your program uses a lot of different comparators then you get a lot of different copies of the sort routine, each with a different comparator baked into it.

In principle there's no reason why a C implementation can't inline qsort into the place it is called. Then if it was called with the name of the function, the optimizer could in theory observe that at the point it is used, the function pointer must still point to that same function. Then it can inline the call to the function, and the result would be similar to the result with std::sort. But in practice, compilers tend not to take the first step, inlining qsort. That's because (a) it's large, and (b) it's in a different translation unit, usually compiled into some library that your program is linked against, and (c) to do it this way, you'd have an inlined copy of qsort for every call to it, not just a copy for every different comparator. So it would be even more bloated than the C++, unless the implementation could also find a way to common up the code in cases where qsort is called in different places with the same comparator.

So, general-purpose functions like qsort in C tend to have some overheads on account of calls through function pointers, or other indirection[*]. Templates in C++ are a common way of keeping the source code generic, but ensuring that it compiles to a special-purpose function (or several such functions). The special-purpose code hopefully is faster.

It's worth noting that templates are not by any means just about performance. std::sort is itself more general-purpose than qsort in some ways. For example qsort only sorts arrays, whereas std::sort can sort anything that provides a random-access iterator. It can for example sort a deque, which under the covers is several disjoint arrays allocated separately. So the use of templates doesn't necessarily provide any performance benefit, it might be done for other reasons. It just happens that templates do affect performance.

[*] another example with sorting - qsort takes an integer parameter saying how big each element of the array is, and when it moves elements it therefore must call memcpy or similar with the value of this variable. std::sort knows at compile-time the exact type of the elements, and hence the exact size. It can inline a copy constructor call that in turn might translate to instructions to copy that number of bytes. As with the inlined comparator, it's often possible to copy exactly 4 (or 8, or 16, or whatever) bytes faster than you'd get by calling a routine that copies a variable number of bytes, passing it the value 4 (or 8, or 16, or whatever). As before, if you called qsort with a literal value for the size, and that call to qsort was inlined, then the compiler could perform the exact same optimization in C. But in practice you don't see that.

Related Solutions

C++ – Why can templates only be implemented in the header file

Caveat: It is not necessary to put the implementation in the header file, see the alternative solution at the end of this answer.

Anyway, the reason your code is failing is that, when instantiating a template, the compiler creates a new class with the given template argument. For example:

template<typename T>
struct Foo
{
    T bar;
    void doSomething(T param) {/* do stuff using T */}
};

// somewhere in a .cpp
Foo<int> f;

When reading this line, the compiler will create a new class (let's call it FooInt), which is equivalent to the following:

struct FooInt
{
    int bar;
    void doSomething(int param) {/* do stuff using int */}
}

Consequently, the compiler needs to have access to the implementation of the methods, to instantiate them with the template argument (in this case int). If these implementations were not in the header, they wouldn't be accessible, and therefore the compiler wouldn't be able to instantiate the template.

A common solution to this is to write the template declaration in a header file, then implement the class in an implementation file (for example .tpp), and include this implementation file at the end of the header.

Foo.h

template <typename T>
struct Foo
{
    void doSomething(T param);
};

#include "Foo.tpp"

Foo.tpp

template <typename T>
void Foo<T>::doSomething(T param)
{
    //implementation
}

This way, implementation is still separated from declaration, but is accessible to the compiler.

Alternative solution

Another solution is to keep the implementation separated, and explicitly instantiate all the template instances you'll need:

Foo.h

// no implementation
template <typename T> struct Foo { ... };

Foo.cpp

// implementation of Foo's methods

// explicit instantiations
template class Foo<int>;
template class Foo<float>;
// You will only be able to use Foo with int or float

If my explanation isn't clear enough, you can have a look at the C++ Super-FAQ on this subject.

C++ – Difference of keywords ‘typename’ and ‘class’ in templates

typename and class are interchangeable in the basic case of specifying a template:

template<class T>
class Foo
{
};

and

template<typename T>
class Foo
{
};

are equivalent.

Having said that, there are specific cases where there is a difference between typename and class.

The first one is in the case of dependent types. typename is used to declare when you are referencing a nested type that depends on another template parameter, such as the typedef in this example:

template<typename param_t>
class Foo
{
    typedef typename param_t::baz sub_t;
};

The second one you actually show in your question, though you might not realize it:

template < template < typename, typename > class Container, typename Type >

When specifying a template template, the class keyword MUST be used as above -- it is not interchangeable with typename in this case (note: since C++17 both keywords are allowed in this case).

You also must use class when explicitly instantiating a template:

template class Foo<int>;

I'm sure that there are other cases that I've missed, but the bottom line is: these two keywords are not equivalent, and these are some common cases where you need to use one or the other.