Apache-spark – Spark: processing multiple kafka topic in parallel

apache-kafkaapache-sparkspark-streaming

I am using spark 1.5.2. I need to run spark streaming job with kafka as the streaming source. I need to read from multiple topics within kafka and process each topic differently.

Is it a good idea to do this in the same job? If so, should I create a single stream with multiple partitions or different streams for each topic?
I am using Kafka direct steam. As far as I know, spark launches long-running receivers for each partition. I have a relatively small cluster, 6 nodes with 4 cores each. If I have many topics and partitions in each topic, would the efficiency be impacted as most executors are busy with long-running receivers? Please correct me if my understanding is wrong here

Best Answer

I made the following observations, in case its helpful for someone:

In kafka direct stream, the receivers are not run as long running tasks. At the beginning of each batch inerval, first the data is read from kafka in executors. Once read, the processing part takes over.
If we create a single stream with multiple topics, the topics are read one after the other. Also, filtering the dstream for applying different processing logic would add another step to the job
Creating multiple streams would help in two ways: 1. You don't need to apply the filter operation to process different topics differently. 2. You can read multiple streams in parallel (as opposed to one by one in case of single stream). To do so, there is an undocumented config parameter spark.streaming.concurrentJobs*. So, I decided to create multiple streams.
```
sparkConf.set("spark.streaming.concurrentJobs", "4");
```

Best Answer

Related Solutions

Apache-spark – Kafka topic partitions to Spark streaming

Understanding Kafka Topics and Partitions

This post already has answers, but I am adding my view with a few pictures from Kafka Definitive Guide

1. When a producer is producing a message - It will specify the topic it wants to send the message to, is that right? Does it care about partitions?

2. When a subscriber is running - Does it specify its group id so that it can be part of a cluster of consumers of the same topic or several topics that this group of consumers is interested in?

3. Does each consumer group have a corresponding partition on the broker or does each consumer have one?

4. As the partitions created by the broker, therefore not a concern for the consumers?

5. Since this is a queue with an offset for each partition, is it the responsibility of the consumer to specify which messages it wants to read? Does it need to save its state?

6. What happens when a message is deleted from the queue? - For example, The retention was for 3 hours, then the time passes, how is the offset being handled on both sides?

Related Topic