Can a Kafka consumer(0.8.2.2) read messages in batch

Based on your question clarification.

A Kafka Consumer can read multiple messages at a time. But a Kafka Consumer doesn’t really read messages, its more correct to say a Consumer reads a certain number of bytes and then based on the size of the individual messages, that determines how many messages will be read. Reading through the Kafka Consumer Configs, you’re not allowed to specify how many messages to fetch, you specify a max/min data size that a consumer can fetch. However many messages fit inside that range is how many you will get. You will always get messages sequentially as you have pointed out.

Related Consumer Configs (for 0.9.0.0 and greater)

fetch.min.bytes
max.partition.fetch.bytes

UPDATE

Using your example in the comments, “my understanding is if i specify in config to read 10 bytes and if each message is 2 bytes the consumer reads 5 messages at a time.” That is true. Your next statement, “that means the offsets of these 5 messages were random with in partition” that is false. Reading sequential doesn’t mean one by one, it just means that they remain ordered. You are able to batch items and have them remain sequential/ordered. Take the following examples.

In a Kafka log, if there are 10 messages (each 2 bytes) with the following offsets, [0,1,2,3,4,5,6,7,8,9].

If you read 10 bytes, you’ll get a batch containing the messages at offsets [0,1,2,3,4].

If you read 6 bytes, you’ll get a batch containing the messages at offsets [0,1,2].

If you read 6 bytes, then another 6 bytes, you’ll get two batches containing the messages [0,1,2] and [3,4,5].

If you read 8 bytes, then 4 bytes, you’ll get two batches containing the messages [0,1,2,3] and [4,5].

Update: Clarifying Committing

I’m not 100% sure how committing works, I’ve mainly worked with Kafka from a Storm environment. The provided KafkaSpout automatically commits Kafka messages.

But looking through the 0.9.0.1 Consumer APIs, which I would recommend you do to. There seems to be three methods in particular that are relevant to this discussion.

poll(long timeout)
commitSync()
commitSync(java.util.Map offsets)

The poll method retrieves messages, could be only 1, could be 20, for your example lets say 3 messages were returned [0,1,2]. You now have those three messages. Now it’s up you to determine how to process them. You could process them 0 => 1 => 2, 1 => 0 => 2, 2 => 0 => 1, it just depends. However you process them, after processing you’ll want to commit which tells the Kafka server you’re done with those messages.

Using the commitSync() commits everything returned on last poll, in this case it would commit offsets [0,1,2].

On the other hand, if you choose to use commitSync(java.util.Map offsets), you can manually specify which offsets to commit. If you’re processing them in order, you can process offset 0 then commit it, process offset 1 then commit it, finally process offset 2 and commit.

All in all, Kafka gives you the freedom to process messages how to desire, you can choose to process them sequentially or entirely random at your choosing.

More Related Contents:

Leave a Comment Cancel reply