C++ – Lock-Free Multiple Producer/Consumer Queue in C++11

cc++11lock-freemultithreadingqueue

I'm trying to implement a lock free multiple producer, multiple consumer queue in C++11. I'm doing this as a learning exercise, so I'm well aware that I could just use an existing open source implementation, but I'd really like to find out why my code doesn't work. The data is stored in a ringbuffer, apparently it is a "bounded MPMC queue".

I've modelled it pretty closely to what I've read of Disruptor. The thing I've noticed is that it works absolutely fine with a single consumer and single/multiple producers, it's just multiple consumers which seems to break it.

Here's the queue:

    template <typename T>
class Queue : public IQueue<T>
{
public:
    explicit Queue( int capacity );
    ~Queue();

    bool try_push( T value );
    bool try_pop( T& value );
private:
    typedef struct
    {
        bool readable;
        T value;
    } Item;

    std::atomic<int> m_head;
    std::atomic<int> m_tail;
    int m_capacity;
    Item* m_items;
};

template <typename T>
Queue<T>::Queue( int capacity ) :
m_head( 0 ),
m_tail( 0 ),
m_capacity(capacity),
m_items( new Item[capacity] )
{
    for( int i = 0; i < capacity; ++i )
    {
        m_items[i].readable = false;
    }
}

template <typename T>
Queue<T>::~Queue()
{
    delete[] m_items;
}

template <typename T>
bool Queue<T>::try_push( T value )
{
    while( true )
    {
        // See that there's room
        int tail = m_tail.load(std::memory_order_acquire);
        int new_tail = ( tail + 1 );
        int head = m_head.load(std::memory_order_acquire);

        if( ( new_tail - head ) >= m_capacity )
        {
            return false;
        }

        if( m_tail.compare_exchange_weak( tail, new_tail, std::memory_order_acq_rel ) )
        {
            // In try_pop, m_head is incremented before the reading of the value has completed,
            // so though we've acquired this slot, a consumer thread may be in the middle of reading
            tail %= m_capacity;

            std::atomic_thread_fence( std::memory_order_acquire );
            while( m_items[tail].readable )
            {
            }

            m_items[tail].value = value;
            std::atomic_thread_fence( std::memory_order_release );
            m_items[tail].readable = true;

            return true;
        }
    }
}

template <typename T>
bool Queue<T>::try_pop( T& value )
{
    while( true )
    {
        int head = m_head.load(std::memory_order_acquire);
        int tail = m_tail.load(std::memory_order_acquire);

        if( head == tail )
        {
            return false;
        }

        int new_head = ( head + 1 );

        if( m_head.compare_exchange_weak( head, new_head, std::memory_order_acq_rel ) )
        {
            head %= m_capacity;

            std::atomic_thread_fence( std::memory_order_acquire );
            while( !m_items[head].readable )
            {
            }

            value = m_items[head].value;
            std::atomic_thread_fence( std::memory_order_release );
            m_items[head].readable = false;

            return true;
        }
    }
}

And here's the test I'm using:

void Test( std::string name, Queue<int>& queue )
{
    const int NUM_PRODUCERS = 64;
    const int NUM_CONSUMERS = 2;
    const int NUM_ITERATIONS = 512;
    bool table[NUM_PRODUCERS*NUM_ITERATIONS];
    memset(table, 0, NUM_PRODUCERS*NUM_ITERATIONS*sizeof(bool));

    std::vector<std::thread> threads(NUM_PRODUCERS+NUM_CONSUMERS);

    std::chrono::system_clock::time_point start, end;
    start = std::chrono::system_clock::now();

    std::atomic<int> pop_count (NUM_PRODUCERS * NUM_ITERATIONS);
    std::atomic<int> push_count (0);

    for( int thread_id = 0; thread_id < NUM_PRODUCERS; ++thread_id )
    {
        threads[thread_id] = std::thread([&queue,thread_id,&push_count]()
                                 {
                                     int base = thread_id * NUM_ITERATIONS;

                                     for( int i = 0; i < NUM_ITERATIONS; ++i )
                                     {
                                         while( !queue.try_push( base + i ) ){};
                                         push_count.fetch_add(1);
                                     }
                                 });
    }

    for( int thread_id = 0; thread_id < ( NUM_CONSUMERS ); ++thread_id )
    {
        threads[thread_id+NUM_PRODUCERS] = std::thread([&]()
                                         {
                                             int v;

                                             while( pop_count.load() > 0 )
                                             {
                                                 if( queue.try_pop( v ) )
                                                 {
                                                     if( table[v] )
                                                     {
                                                         std::cout << v << " already set" << std::endl;
                                                     }
                                                     table[v] = true;
                                                     pop_count.fetch_sub(1);
                                                 }
                                             }
                                         });

    }

    for( int i = 0; i < ( NUM_PRODUCERS + NUM_CONSUMERS ); ++i )
    {
        threads[i].join();
    }

    end = std::chrono::system_clock::now();
    std::chrono::duration<double> duration = end - start;

    std::cout << name << " " << duration.count() << std::endl;

    std::atomic_thread_fence( std::memory_order_acq_rel );

    bool result = true;
    for( int i = 0; i < NUM_PRODUCERS * NUM_ITERATIONS; ++i )
    {
        if( !table[i] )
        {
            std::cout << "failed at " << i << std::endl;
            result = false;
        }
    }
    std::cout << name << " " << ( result? "success" : "fail" ) << std::endl;
}

Any nudging in the right direction would be greatly appreciated. I'm pretty new to memory fences rather than just using a mutex for everything, so I'm probably just fundamentally misunderstanding something.

Cheers
J

Best Answer

I'd give a look to Moody Camel's implementation.

It is a fast general purpose lock-free queue for C++ entirely written in C++11. Documentation seems to be rather good along with a few performance tests.

Among all other interesting things (they're worth a read anyway), it's all contained in a single header, and available under the simplified BSD license. Just drop it in your project and enjoy!

Worst (won't actually work)

Change the access modifier of counter to public volatile

As other people have mentioned, this on its own isn't actually safe at all. The point of volatile is that multiple threads running on multiple CPUs can and will cache data and re-order instructions.

If it is not volatile, and CPU A increments a value, then CPU B may not actually see that incremented value until some time later, which may cause problems.

If it is volatile, this just ensures the two CPUs see the same data at the same time. It doesn't stop them at all from interleaving their reads and write operations which is the problem you are trying to avoid.

Second Best:

lock(this.locker) this.counter++;

This is safe to do (provided you remember to lock everywhere else that you access this.counter). It prevents any other threads from executing any other code which is guarded by locker. Using locks also, prevents the multi-CPU reordering problems as above, which is great.

The problem is, locking is slow, and if you re-use the locker in some other place which is not really related then you can end up blocking your other threads for no reason.

Best

Interlocked.Increment(ref this.counter);

This is safe, as it effectively does the read, increment, and write in 'one hit' which can't be interrupted. Because of this, it won't affect any other code, and you don't need to remember to lock elsewhere either. It's also very fast (as MSDN says, on modern CPUs, this is often literally a single CPU instruction).

~~I'm not entirely sure however if it gets around other CPUs reordering things, or if you also need to combine volatile with the increment.~~

InterlockedNotes:

INTERLOCKED METHODS ARE CONCURRENTLY SAFE ON ANY NUMBER OF COREs OR CPUs.
Interlocked methods apply a full fence around instructions they execute, so reordering does not happen.
Interlocked methods do not need or even do not support access to a volatile field, as volatile is placed a half fence around operations on given field and interlocked is using the full fence.

Footnote: What volatile is actually good for.

As volatile doesn't prevent these kinds of multithreading issues, what's it for? A good example is saying you have two threads, one which always writes to a variable (say queueLength), and one which always reads from that same variable.

If queueLength is not volatile, thread A may write five times, but thread B may see those writes as being delayed (or even potentially in the wrong order).

A solution would be to lock, but you could also use volatile in this situation. This would ensure that thread B will always see the most up-to-date thing that thread A has written. Note however that this logic only works if you have writers who never read, and readers who never write, and if the thing you're writing is an atomic value. As soon as you do a single read-modify-write, you need to go to Interlocked operations or use a Lock.

Javascript – How to implement a Stack and a Queue in JavaScript

var stack = [];
stack.push(2);       // stack is now [2]
stack.push(5);       // stack is now [2, 5]
var i = stack.pop(); // stack is now [2]
alert(i);            // displays 5

var queue = [];
queue.push(2);         // queue is now [2]
queue.push(5);         // queue is now [2, 5]
var i = queue.shift(); // queue is now [5]
alert(i);              // displays 2

taken from "9 JavaScript Tips You May Not Know"

Best Answer

Related Solutions

C# – Volatile vs. Interlocked vs. lock

Worst (won't actually work)

Second Best:

Best

Footnote: What volatile is actually good for.

Javascript – How to implement a Stack and a Queue in JavaScript

Related Topic