Message Brokers – Traditional Message Brokers vs. Streaming Data

apache-kafkamessage-queuestream-processing

According to the Kafka site:

"Kakfa is used for building real-time data pipelines and streaming apps."

Searching the internet far and wide, I've found the following generally-accepted definition of what "stream data" is:

  • Stream data is data that flows contiguously from a source to a destination over a network; and
  • Stream data is not atomic in nature, meaning any part of a flowing stream of data is meaningful and processable, as opposed to a file whose bytes don't mean anything unless you have all of them; and
  • Stream data can be started/stopped at any time; and
  • Consumers can attach and detach from a stream of data at will, and process just the parts of it that they want

Now then, if anything I've said above is incorrect, incomplete or totally wrong, please begin by correcting me! Assuming I'm more or less on track, then…

Now that I understand what "streaming data" is, then I understand what Kafka and Kinesis mean when they bill themselves as processing/brokering middleware for applications with streaming data. But it has piqued my interests: can/should "stream middleware" like Kafka or Kinesis be used for non-streaming data, like traditional message brokers? And vice versa: can/should traditional MQs like RabbitMQ, ActiveMQ, Apollo, etc. be used for streaming data?

Let's take an example where an application will be sending its backend constant barrage of JSON messages that need to be processed, and the processing is fairly complex (validation, transforms on the data, filtering, aggregations, etc.):

  • Case #1: The messages are each frames of a movie; that is one JSON messgage per video frame containing the frame data and some supporting metadata
  • Case #2: The messages are time-series data, perhaps someone's heartbeat as a function of time. So Message #1 is sent representing my heartbeat at t=1, Message #2 contains my heartbeat at t=2, etc.
  • Case #3: The data is completely disparate and non-related by time or as part of any "data stream". Perhaps audit/security events that get fired as hundreds of users navigate the application clicking buttons and taking actions

Based on how Kafka/Kinesis are billed and on my understanding of what "streaming data" is, they seem to be obvious candidates for Cases #1 (contiguous video data) and #2 (contiguous time-series data). However I don't see any reason why a traditional message broker like RabbitMQ couldn't efficiently handle both these inputs as well.

And with Case #3, we're only provided with an event that has occurred and we need to process a reaction to that event. So to me this speaks to needing a traditional broker like RabbitMQ. But there's also no reason why you couldn't have Kafka or Kinesis handle the processing of event data either.

So basically, I'm looking to establish a rubric that says: I have X data with Y characteristics. I should use a stream processor like Kafka/Kinesis to handle it. Or, conversely, one that helps me determine: I have W data with Z characteristics. I should use a traditional message broker to handle it.

So I ask: What factors about the data (or otherwise) help steer the decision between stream processor or message broker, since both can handle streaming data, and both can handle (non-streaming) message data?

Best Answer

Kafka deals in ordered logs of atomic messages. You can view it sort of like the pub/sub mode of message brokers, but with strict ordering and the ability to replay or seek around the stream of messages at any point in the past that's still being retained on disk (which could be forever).

Kafka's flavor of streaming stands opposed to remote procedure call like Thrift or HTTP, and to batch processing like in the Hadoop ecosystem. Unlike RPC, components communicate asynchronously: hours or days may pass between when a message is sent and when the recipient wakes up and acts on it. There could be many recipients at different points in time, or maybe no one will ever bother to consume a message. Multiple producers could produce to the same topic without knowledge of the consumers. Kafka does not know whether you are subscribed, or whether a message has been consumed. A message is simply committed to the log, where any interested party can read it.

Unlike batch processing, you're interested in single messages, not just giant collections of messages. (Though it's not uncommon to archive Kafka messages into Parquet files on HDFS and query them as Hive tables).

Case 1: Kafka does not preserve any particular temporal relationship between producer and consumer. It's a poor fit for streaming video because Kafka is allowed to slow down, speed up, move in fits and starts, etc. For streaming media, we want to trade away overall throughput in exchange for low and, more importantly, stable latency (otherwise known as low jitter). Kafka also takes great pains to never lose a message. With streaming video, we typically use UDP and are content to drop a frame here and there to keep the video running. The SLA on a Kafka-backed process is typically seconds to minutes when healthy, hours to days when healthy. The SLA on streaming media is in tens of milliseconds.

Netflix could use Kafka to move frames around in an internal system that transcodes terabytes of video per hour and saves it to disk, but not to ship them to your screen.

Case 2: Absolutely. We use Kafka this way at my employer.

Case 3: You can use Kafka for this kind of thing, and we do, but you are paying some unnecessary overhead to preserve ordering. Since you don't care about order, you could probably squeeze some more performance out of another system. If your company already maintains a Kafka cluster, though, probably best to reuse it rather than take on the maintenance burden of another messaging system.

Related Topic