Performance – Benchmarking Asynchronous Code in Node.js

asynchronous-programmingnode.jsperformance

Asynchronous programming seems to be getting quite popular these days. One of the most quoted advantages is performance gain from removing operations that block threads. But I also saw people saying that the advantage is not that great. And that properly configured thread pool can have same effect as fully asynchronous code.

My question is, are there any real benchmarks that compare blocking vs. asynchronous code? It can be any language, but I think C# and Java would be most representative. Not sure how good would this benchmark be with Node.js or similar.

Edit : My attempt at general question combined with unclear terminology seems to have failed. By "asynchronous code" I mean what some answers described as event or callback programming. In the final result, operations that would block thread are instead delegated to some callback system so threads can be utilized better.

And if I wanted to ask specific question : Are there any benchmarks that compare throughput/latency gain of async/await server code in .NET? Or any other similar comparison?

Best Answer

Based on your comments, it seems that you're really interested in "non-blocking IO." This differs from my definition of "asynchronous programming," which is an approach to decomposing work exemplified by Erlang processes or Goroutines.

And, if that is your definition, then yes, there have been benchmarks. But, like all benchmarks, they shouldn't be accepted blindly. Instead, you need to think about what goes on behind the scenes.

A thread is the unit of OS scheduling. Platforms such as Erlang and Go build their own schedulers on top of the OS scheduler, allowing multiple units of execution to share the same thread. This is great, as long as your units of execution are lightweight, because it avoids the overheads associated with threads.^* However, IO operations require a trip to the kernel, which means that you need a real thread to do them. And if you're implementing a sub-thread scheduler, you need to be smart about not scheduling sub-thread tasks on a thread that's blocked in a kernel operation.
All IO operations have the potential to block.^** When you make a read or write request, the kernel looks to see if there is data available or (for write) room in a buffer. If not, the kernel suspends the thread until the operation can complete. This makes thread-per-connection servers really simple to implement, but worries people who think about thread overheads.
Operating systems provide a way to block on multiple IO channels simultaneously. The select call on POSIX is one of these: you provide it with a list of channels (file descriptors / sockets) that you care about, and it will tell you when one of them is ready to read or write (read is what most people care about). You still have to make a kernel call, and you'll still end up blocking a thread if nothing's available, but that's only one thread. This is how Node.js works: when data is available, the proper event handler is called (I don't know the internals of Node, but hope that they also verify that write buffers are available before calling write).
When you max CPU, you're done. It doesn't matter whether you use select or a thread-per-connection approach, you still need to spend CPU to do whatever your server is meant to do. With the thread-per-connection approach, you don't really pay attention to that: the scheduler will assign threads to cores, and you'll degrade gracefully. With select, you will need to hand connections off to threads when they're ready for processing, or you'll be limited by the performance of a single core (Node.js gets around that by letting you spawn multiple servers).

As I said, benchmarks shouldn't be accepted blindly; they're only valid as long as they model the real-world problem that you're trying to solve. The author of the linked benchmark works for (worked for?) Mailinator, which if you haven't used it, is a poste restante service for ad hoc email addresses. Which means that it's going to be getting short-lived, high-activity connections from a relatively small number clients. This is a perfect use case for thread-per-connection scheduling. As noted in the comments, a chat server (long-lived, low activity) might be different.

In my mind, the question of blocking versus non-blocking IO is rather boring: most real-world servers don't have that many concurrent connections. More interesting to me is the programming model: a worker-based model like Erlang or Go means that you can focus on your business logic, and not care how connections are being managed.

^{* These overheads include kernel scheduling structures, and perhaps most important, a multi-megabyte thread stack, most of which goes unused. While 2MB doesn't seem like much, it adds up quickly if you have 100k processes ... which most applications don't have.}

^{** Not 100% true, but I don't want to get too deep into the weeds here.}

Related Solutions

How to benchmark concurrent key-value stores

Why not take your existing random KVP (key value pair) operation testing to the next level?

Presumably, your current set of tests includes a list of potential KVPs and then performing CRUD operations against whichever KVP was selected. In effect, the list of KVPs drives the benchmarks against your system. An Actor randomly selects the KVP and then picks a CRUD op.

The next logical stage is to create sets of operations which will "replace" your list of potential KVPs as the driver. The sets of operations will reflect what you think a "real" workload will be. In some cases, it will still be CRUD ops on KVPs. In other cases, as you mentioned, it will have additional changes (aka "real work") and it's the aggregate of those operations that make the set.

Now your Actor will select from the list of sets instead of KVPs. Bonus points if you make your Actor intelligent enough to pick relative workloads, so some percentage would be CRUD on KVP and some other percentage would be "real work."

This approach doesn't fully address your concerns with "artificial" benchmarks, but I don't know that any solution in the abstract can really resolve that issue. In theory, you know the expected work load best so you can tailor those sets of operations accordingly.

The benefit of this approach is you can now state "The system can handle ### transactions of X% inserts, Y% deletes, Z% lookups and Q% 'real world' operations." And you'll add a parenthetical remark explaining what "real world" means to you.

C# async/await: Pedantry vs. the Debugger

You could rewrite your code this way in a pure async/await paradigm.

    private async void button1_Click(object sender, EventArgs e)
    {
        int r = await longRunningWork();
        textBox1.Text += (r.ToString());
    }

    private async Task<int> longRunningWork()
    {
        await Task.Delay(15000);
        return (new Random()).Next(10);
    }

The fact a thread is created is up to the CLR to decide. What the await does in the button1_Click method is that it returns the control to the calling thread of button1_Click (UI thread) which can do other work. The compiler introduces automatic code everytime it sees await to add some goto at the end of the awaitable method whenever it is completed to continue the execution where it left.

My guess here why I think your code spawn a new thread is because you call Thread.Sleep(15000); which is a synchronous call, hence if the CLR wants to give control back to the UI thread, it knows it has to spawn a new thread.

Even your original longRunningWork method is synchronous so it definitely needs a new thread.

Best Answer

Related Solutions

How to benchmark concurrent key-value stores

C# async/await: Pedantry vs. the Debugger

Related Topic