In Apache Kafka why can’t there be more consumer instances than partitions

apache-kafkadistributed

I'm learning about Kafka, reading the introduction section here

https://kafka.apache.org/documentation.html#introduction

specifically the portion about Consumers. In the second to last paragraph in the Introduction it reads

Kafka does it better. By having a notion of parallelism—the partition—within the topics, Kafka is
able to provide both ordering guarantees and load balancing over a pool of consumer processes. This
is achieved by assigning the partitions in the topic to the consumers in the consumer group so that
each partition is consumed by exactly one consumer in the group. By doing this we ensure that the
consumer is the only reader of that partition and consumes the data in order. Since there are many
partitions this still balances the load over many consumer instances. Note however that there cannot
be more consumer instances than partitions.

My confusion stems from that last sentence, because in the image right above that paragraph where the author depicts two consumer groups and a 4-partition topic, there are more consumer instances than partitions!

It also doesn't make sense that there can't be more consumer instances than partitions, because then partitions would be incredibly small and it seems like the overhead in creating a new partition for each consumer instance would bog down Kafka. I understand that partitions are used for fault-tolerance and reducing the load on any one server, but the sentence above does not make sense in the context of a distributed system that's supposed to be able to handle thousands of consumers at a time.

Best Answer

Ok, to understand it, one needs to understand several parts.

In order to provide ordering total order, the message can be sent only to one consumer. Otherwise it would be extremely inefficient, because it would need to wait for all consumers to recieve the message before sending the next one:

However, although the server hands out messages in order, the messages are delivered asynchronously to consumers, so they may arrive out of order on different consumers. This effectively means the ordering of the messages is lost in the presence of parallel consumption. Messaging systems often work around this by having a notion of "exclusive consumer" that allows only one process to consume from a queue, but of course this means that there is no parallelism in processing.

Kafka does it better. By having a notion of parallelism—the partition—within the topics, Kafka is able to provide both ordering guarantees and load balancing over a pool of consumer processes. This is achieved by assigning the partitions in the topic to the consumers in the consumer group so that each partition is consumed by exactly one consumer in the group. By doing this we ensure that the consumer is the only reader of that partition and consumes the data in order. Since there are many partitions this still balances the load over many consumer instances. Note however that there cannot be more consumer instances than partitions.

Kafka only provides a total order over messages within a partition, not between different partitions in a topic.

Also what you think is a performance penalty (multiple partitions) is actually a performance gain, as Kafka can perform actions of different partitions completely in parallel, while waiting for other partitions to finish.

The picture show different consumer groups, but the limitation of maximum one consumer per partition is only within a group. You still can have multiple consumer groups.

In the beginning the two scenarios are described:

If all the consumer instances have the same consumer group, then this works just like a traditional queue balancing load over the consumers.

If all the consumer instances have different consumer groups, then this works like publish-subscribe and all messages are broadcast to all consumers.

So, the more subscriber groups you have, the lower the performance is, as kafka needs to replicate the messages to all those groups and guarantee the total order.

On the other hand, the less group, and more partitions you have the more you gain from parallizing the message processing.

Related Solutions

Data Modeling with Kafka? Topics and Partitions

When structuring your data for Kafka it really depends on how it´s meant to be consumed.

In my mind, a topic is a grouping of messages of a similar type that will be consumed by the same type of consumer so in the example above, I would just have a single topic and if you´ll decide to push some other kind of data through Kafka, you can add a new topic for that later.

Topics are registered in ZooKeeper which means that you might run into issues if trying to add too many of them, e.g. the case where you have a million users and have decided to create a topic per user.

Partitions on the other hand is a way to parallelize the consumption of the messages. The total number of partitions in a broker cluster need to be at least the same as the number of consumers in a consumer group to make sense of the partitioning feature. Consumers in a consumer group will split the burden of processing the topic between themselves according to the partitioning so that one consumer will only be concerned with messages in the partition itself is "assigned to".

Partitioning can either be explicitly set using a partition key on the producer side or if not provided, a random partition will be selected for every message.

C++ – the cin analougus of scanf formatted input

You can make a simple class to verify input.

struct confirm_input {
    char const *str;

    confirm_input( char const *in ) : str( in ) {}

    friend std::istream &operator >>
        ( std::istream &s, confirm_input const &o ) {

        for ( char const *p = o.str; * p; ++ p ) {
            if ( std::isspace( * p ) ) {
                std::istream::sentry k( s ); // discard whitespace
            } else if ( (c = s.get() ) != * p ) {
                s.setstate( std::ios::failbit ); // stop extracting
            }
        }
        return s;
     }
};

usage:

std::cin >> x >> confirm_input( " = " ) >> y;

Best Answer

Related Solutions

Data Modeling with Kafka? Topics and Partitions

C++ – the cin analougus of scanf formatted input

Related Topic