I'm learning about Kafka, reading the introduction section here
https://kafka.apache.org/documentation.html#introduction
specifically the portion about Consumers. In the second to last paragraph in the Introduction it reads
Kafka does it better. By having a notion of parallelism—the partition—within the topics, Kafka is
able to provide both ordering guarantees and load balancing over a pool of consumer processes. This
is achieved by assigning the partitions in the topic to the consumers in the consumer group so that
each partition is consumed by exactly one consumer in the group. By doing this we ensure that the
consumer is the only reader of that partition and consumes the data in order. Since there are many
partitions this still balances the load over many consumer instances. Note however that there cannot
be more consumer instances than partitions.
My confusion stems from that last sentence, because in the image right above that paragraph where the author depicts two consumer groups and a 4-partition topic, there are more consumer instances than partitions!
It also doesn't make sense that there can't be more consumer instances than partitions, because then partitions would be incredibly small and it seems like the overhead in creating a new partition for each consumer instance would bog down Kafka. I understand that partitions are used for fault-tolerance and reducing the load on any one server, but the sentence above does not make sense in the context of a distributed system that's supposed to be able to handle thousands of consumers at a time.
Best Answer
Ok, to understand it, one needs to understand several parts.
Also what you think is a performance penalty (multiple partitions) is actually a performance gain, as Kafka can perform actions of different partitions completely in parallel, while waiting for other partitions to finish.
In the beginning the two scenarios are described:
So, the more subscriber groups you have, the lower the performance is, as kafka needs to replicate the messages to all those groups and guarantee the total order.
On the other hand, the less group, and more partitions you have the more you gain from parallizing the message processing.