C# Kafka Consumer Design – Multithreading with Apache Kafka

apache-kafkacmultithreading

I have a C#-based system that works relatively well. It reads data from kafka, processes the data and then writes it out to MSSQL.

The kafka topic has been partitioned into 10 partitions – and after 2 years, the team is considering increasing the number of partitions because the cluster is showing signs of not managing the volumes anymore. (Volumes are large and increase every month).

My consumers have been implemented as windows services, where each service starts up one kafka consumer. We've installed 10 services – so that each service handles one partition.

This design works well and if 5 of the services are down, the consumer rebalances and the 5 remaining services will each handle 2 partitions.

When the topics increase, I should be able to install and start a few extra consumers and everything should continue smoothly.

I'm starting to question whether I shouldn't change the design to have a single service start multiple KafkaReaders (basically multiple threads, each with a kafka consumer). If I do this – scaling would be easier, just configure the service to start more threads. (I would probably still run at least 2 services for redundancy, or in case I want to take one down.)

I have heard anecdotes though where people claim that multiple threads in the same service is slower on the kafka consume than multiple applications. I don't see why this would be the case, and in my case, I can get some nice savings in memory if I consolidate into a single service.

Is there any logical reason that you know of why running 10 threads in a single service would perform worse than running 10 services?

Either I'm missing some basic principle – or the anecdotes were just… inaccurate(?)…

Best Answer

Keep in mind that if you are going to run multiple kafka consumers in one single process in multiple threads, you should have the logic to restart the failed consumer threads or atleast be notified of the failure.

Is there any logical reason that you know of why running 10 threads in a single service would perform worse than running 10 services?

Undoubtedly, threads are cheaper than processes but processes have the capability to be horizontally scaled, i.e. you can run one instance in one system and another instance in another machine which gives you better fault tolerance and availability.

Also, processes are relatively easier to monitor from outside, for example, if they go down (or) come up etc we get to know them easily.

When the topics increase, I should be able to install and start a few extra > consumers and everything should continue smoothly.

I'm starting to question whether I shouldn't change the design to have a single service start multiple KafkaReaders (basically multiple threads, each with a kafka consumer). If I do this - scaling would be easier, just configure the service to start more threads.

You can have an initialization script, which would get the no. of topic partitions and spin up the required no. of processes depending on those numbers and it should not be hard to implement.

A point to note here is that, increasing the topic partitions is not a frequent operation and when it happens, Ops team would have to consider if the consumers need to be scaled as well.

Moreover, increase in the topic partitions doesn't necessarily demand increase in the consumers because the no. of consumers you can have depends on a lot of other factors than simply the no. of topic partitions like:

  1. The processing logic
  2. The CPU and memory and other resources allotted to each consumer.
  3. The network between the consumer and the Kafka cluster etc

And your question essentially boils down to multi-threading vs multi-processing and the answer varies upon the use-case, the amount of data to be processed, resources, language chosen etc.

Check this answer and also this answer which says processes in Windows are heavy-weight compared to their Linux counterparts.