Kafka Streaming to Website System – Solution Architecture

apache-kafkaarchitectural-patternsdesignintegrationreal time

Not sure if this is the correct forum for this question – but I could really use some advice.

I need to design a system (within certain constraints).

The system must implement the following logic:

  1. A producer will drop JSON messages on a Kafka topic for me to consume. Each message will contain an image (something like a uuencoded jpg) along with some other fields.
  2. Each message must be shown to a group of administrators.
  3. An administator must view the image and related data and will classify the image as "Good" or "Bad".
  4. Images that are classified as "Bad" will be retained (stored in a SQL-db) and "Good" images will be discarded.

Constraints:

  1. The use of kafka is out of my control and mandatory.
  2. I may use C# and MSSQL on windows OS (.NET 5 is preferred)
  3. The front-end must be a VueJS front-end.
  4. I'm allowed to use any tech inside the above mentioned stack, so REST API's, OData, SignalR etc are all allowed.

Business Rules:

  1. All images must be processed as quickly as possible.
  2. Users may log on and off at any time. (If volumes increases – more users will log in to help)
  3. Only one administrator may classify a given image (Images may not be classified more than once)
  4. System must be available 24x7x365.
  5. Expect low volumes at night (a handful of images per hour).
  6. Expect about 1000 images per hour between 6am and 6pm, with peaks not exceeding 2000 images per hour at 8am, 1pm and 4pm.

The obvious (almost default) solution is to have a C# Service that reads the kafka topic and puts the data into MSSQL and then to have the users "pull" images from there.

These users will be doing other work as well (typically in the same front-end), so I'd like to push the messages to them, rather than have them pull messages.

So, I could now read the kafka topic and push the data into a SQL DB, and then publish a SignalR event with the db-key telling the front-end that a new message is available. This way, if the user clicks on the notification, I can show the message to the user.

Is this a decent enough starting point for a design or can you see huge flaws in my logic here?

My main worry about this design is that I will still persist every message in a database – just to allow the users to process it. (I might be able to delete some records when the user classifies them as good – rather than just storing the bad ones, but this is still a lot of overhead for no real benefit(?). I only need to store the "bad" images, so is there any way in which I can build this, so that I don't need to "stage" the data in SQL ?

Also, are there any patterns for how to show each image to a user, but to ensure that multiple users don't attempt to classify the same image? I'm mostly worried about showing an image to a user and then he logs off without classifying it. I can't afford to have abandoned images slipping through the cracks.

Some additional thoughts:

  1. Would it be a really bad idea to talk directly from Vue/Js to kafka or should I still go through some c#-service-layer? I do see some JS libraries for getting kafka data, though I have never used any of them?

I guess if I followed this method, it would be easy if there is only one user, multiple users might pose a challenge (especially for those abandoned records).

And the MOST IMPORTANT question:

  1. Is there a totally different/better approach I could consider?

Best Answer

  1. Whether to consume Kafka from Frontends.

No. It would be a bad idea to read Kafka directly from frontends. It would work to a certain degree, but Kafka must re-balance consumers if those change, notifying all consumers in the process of what partitions they are allowed to consume. This is expensive.

Not to mention this strategy would limit the number of users that can work on the topic to the number of partitions the topic has.

  1. A different approach.

I would perhaps (without knowing your exact requirements here) cache images temporarily in the DB. That is, read only so much messages from Kafka as there are users to process them.

The server would actually not read Kafka messages until there are users to process them. This way, you don't have to dump all images into the DB at full speed, which doesn't really makes sense, and is not required anyway.

So the DB would have "slots" basically equal to the number of users. Once an image is processed, remove it, read a new one from Kafka.

Of course when an image is processed, the server has to see whether there are images in the DB "cache" that users maybe abandoned and won't be processed anymore and re-assign them.

  1. High-Availability

In the above, assuming Kafka and DB are already highly available, you can scale the service to multiple instances by assigning different Kafka group ids. So if some of them die, the frontends can re-connect to other instances, and Kafka guaranteeing that messages are equally distributed.

This works assuming all instances will have equal number of users. If not, you'll have to code something to make sure, otherwise a whole group of images might wait on an instance that might have no users assigned.

However, if you're ok with a few seconds downtime on this service, you can just let it re-start and let frontends reconnect. This would be much cheaper and simpler.

HTH

Related Topic