C++ – Optimizing Redundant String Allocations in C++

coptimizationstrings

I have a fairly complex C++ component whose performance has become a problem. Profiling shows that that most of the execution time is simply spent allocating memory for std::strings.

I know that there's a lot of redundancy among those strings. A handful of values repeat very frequently but there are also lots of unique values. The strings are typically fairly short.

I'm now just thinking if it would make sense to somehow reuse those frequent allocations. Instead of 1000 pointers to 1000 distinct "foobar" values, I could have 1000 pointers to one "foobar" value. The fact that this would be more memory efficient is a nice bonus but I'm mostly concerned about latency here.

I guess one option would be to maintain some kind of a registry of already allocated values but is it even possible to make the registry lookups faster than redundant memory allocations? Is this a viable approach?

Best Answer

I lean heavily on interned strings as Basile suggests, where a string lookup translates to a 32-bit index to store and compare. That's useful in my case since I sometimes have hundreds of thousands to millions of components with a property named "x", e.g., which still needs to be a user-friendly string name since it's often accessed by scripters by name.

I use a trie for the lookup (experimented also with unordered_map but my tuned trie backed by memory pools at least started out performing better and was also easier to make thread-safe without just locking every time the structure was accessed) but it isn't as fast for construction as creating std::string. The point is more to speed up the subsequent operations like checking for string equality which, in my case, just boils down to checking two integers for equality and to drastically reduce memory usage.

I guess one option would be to maintain some kind of a registry of already allocated values but is it even possible to make the registry lookups faster than redundant memory allocations?

That's going to be tough to make a search through a data structure much faster than a single malloc, e.g. If you have a case where you are reading a boatload of strings from an external input like a file, for example, then my temptation would be to use a sequential allocator if possible. That comes with the downside that you can't free memory of an individual string. All the memory pooled by the allocator has to be freed at once or not at all. But a sequential allocator can be handy in cases where you just need to allocate a boatload of tiny variable-sized chunks of memory in a straight sequential fashion, only to then toss it all away later. I don't know if that applies in your case or not, but when applicable, it can be an easy way to fix a hotspot related to frequent teeny memory allocations (which might have more to do with cache misses and page faults than the underlying algorithm used by, say, malloc).

Fixed-sized allocations are easier to speed up without the sequential allocator constraints that prevent you from freeing specific chunks of memory to be reused later. But making variable-sized allocation faster than the default allocator is pretty tough. Basically making any kind of memory allocator that's faster than malloc is generally extremely tough if you don't apply constraints that narrow its applicability. One solution is to use a fixed-sized allocator for, say, all strings which are 8 bytes or less if you have a boatload of them and longer strings are a rare case (for which you can just use the default allocator). That does mean 7 bytes are wasted for 1-byte strings, but it should eliminate allocation-related hotspots, if, say, 95% of the time, your strings are very short.

Another solution that just occurred to me is to use unrolled linked lists which might sound crazy but hear me out.

enter image description here

The idea here is to make each unrolled node a fixed-size instead of variable-sized. When you do that, you can use an extremely fast fixed-sized chunk allocator which pools memory, allocating fixed-sized chunks for variable-sized strings linked together. That won't reduce memory use, it'll tend to add to it because of the cost of the links, but you can play with the unrolled size to find a balance suitable for your needs. It's kind of a wacky idea but should eliminate memory-related hotspots since you can now effectively pool memory already-allocated in bulky contiguous blocks and still have the benefits of freeing strings individually. Here's a simple ol' fixed allocator I wrote (illustratory one I made for someone else, devoid of production-related fluff) which you can freely use:

#ifndef FIXED_ALLOCATOR_HPP
#define FIXED_ALLOCATOR_HPP

class FixedAllocator
{
public:
    /// Creates a fixed allocator with the specified type and block size.
    explicit FixedAllocator(int type_size, int block_size = 2048);

    /// Destroys the allocator.
    ~FixedAllocator();

    /// @return A pointer to a newly allocated chunk.
    void* allocate();

    /// Frees the specified chunk.
    void deallocate(void* mem);

private:
    struct Block;
    struct FreeElement;

    FreeElement* free_element;
    Block* head;
    int type_size;
    int num_block_elements;
};

#endif

#include "FixedAllocator.hpp"
#include <cstdlib>

struct FixedAllocator::FreeElement
{
    FreeElement* next_element;
};

struct FixedAllocator::Block
{
    Block* next;
    char* mem;
};

FixedAllocator::FixedAllocator(int type_size, int block_size): free_element(0), head(0)
{
    type_size = type_size > sizeof(FreeElement) ? type_size: sizeof(FreeElement);
    num_block_elements = block_size / type_size;
    if (num_block_elements == 0)
        num_block_elements = 1;
}

FixedAllocator::~FixedAllocator()
{
    // Free each block in the list, popping a block until the stack is empty.
    while (head)
    {
        Block* block = head;
        head = head->next;
        free(block->mem);
        free(block);
    }
    free_element = 0;
}

void* FixedAllocator::allocate()
{
    // Common case: just pop free element and return.
    if (free_element)
    {
        void* mem = free_element;
        free_element = free_element->next_element;
        return mem;
    }

    // Rare case when we're out of free elements.
    // Create new block.
    Block* new_block = static_cast<Block*>(malloc(sizeof(Block)));
    new_block->mem = malloc(type_size * num_block_elements);
    new_block->next = head;
    head = new_block;

    // Push all but one of the new block's elements to the free stack.
    char* mem = new_block->mem;
    for (int j=1; j < num_block_elements; ++j)
    {
        void* ptr = mem + j*type_size;
        FreeElement* element = static_cast<FreeElement*>(ptr);
        element->next_element = free_element;
        free_element = element;
    }
    return mem;
}

void FixedAllocator::deallocate(void* mem)
{
    // Just push a free element to the stack.
    FreeElement* element = static_cast<FreeElement*>(mem);
    element->next_element = free_element;
    free_element = element;
}
Related Topic