Kafka's Duplicate Message Problem

Scaling Systems

23 February 2026·6 min read

Your consumers are humming along at 10,000 messages per second. One crashes. Suddenly you've got duplicate messages, stalled partitions, and angry Slack pings from downstream.

That's consumer group rebalancing, and I think it's the most common Kafka production issue that nobody warns you about until you're already dealing with it.

Consumer Group Rebalancing Visualizer - crash consumers and compare eager vs cooperative strategies in real time

Try the interactive demo - crash consumers, compare eager vs cooperative rebalancing, and watch duplicates happen in real time.

What Is Rebalancing?

Consumers in the same consumer group split up the partitions of a topic between them. When something changes - a consumer crashes, one joins, partitions change - Kafka needs to redistribute. The problem is how: by default, every consumer stops while partitions get shuffled around. Full stop-the-world. Pretty aggressive default if you ask me.

What Triggers a Rebalance?

1. A Consumer Crashes or Gets Evicted

If a consumer doesn't send heartbeats within session.timeout.ms (default: 45s), the coordinator assumes it's dead. Here's the annoying version: a slow DB query blocks poll(), heartbeats stop, and Kafka kicks the consumer out - even though it was actively doing work. Nothing actually broke, it was just slow.

// This consumer will get evicted if processRecord() takes > 45 seconds
while (true) {
    ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
    for (ConsumerRecord<String, String> record : records) {
        processRecord(record); // Blocks for 60 seconds on a slow DB query
    }
}

2. A Consumer Joins the Group

Every join triggers a rebalance. Rolling deployment of 10 consumers? That's 10 rebalances back to back. If you're on eager assignment, that's 10 stop-the-world pauses.

3. Partition Count Changes

Adding partitions triggers a full rebalance across the whole group.

Eager vs Cooperative Rebalancing

Eager Rebalancing (The Default)

With RangeAssignor or RoundRobinAssignor, all consumers give up all partitions and stop. One consumer out of twenty crashes? All twenty stop. That's what bothers me about this - it turns a single-consumer problem into a cluster-wide event.

// Eager rebalancing - stop the world
props.put(ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG,
    RangeAssignor.class.getName());

Cooperative Rebalancing

With CooperativeStickyAssignor, only the affected partitions move. Everyone else keeps going. This is almost certainly what you want.

// Cooperative rebalancing - only affected partitions move
props.put(ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG,
    CooperativeStickyAssignor.class.getName());

Trade-offs of Cooperative Rebalancing

It's not free, but the trade-offs are worth it in almost every case:

Multiple rounds. Takes at least two rounds instead of eager's one. Total time can actually be longer.
Brief partition gaps. Revoked partitions sit unassigned between rounds.
All-or-nothing migration. You can't mix eager and cooperative in the same group.
Quieter failures. A stuck rebalance on a subset of partitions can go unnoticed longer.

The Duplicate Processing Trap

Here's where it gets really ugly. When all consumers give up all partitions at once, every single one has a window of uncommitted work that gets replayed. Twenty consumers each holding uncommitted messages, one crashes, and now you've got 20 partitions worth of duplicates. Not just the dead consumer's - everyone's.

How Duplicates Happen During Rebalancing

With auto-commit (also the default), offsets get committed every 5 seconds. Anything processed between the last commit and the rebalance? Re-processed by the new owner. Eager rebalancing makes this as bad as possible because it forces every partition to change hands. Cooperative limits the damage to just the affected ones.

So you've got eager rebalancing plus auto-commit. Both are defaults. Both should be changed for anything you care about. Honestly, I think this combination is one of the bigger footguns in Kafka's default config.

The Fix: Manual Offset Management

while (true) {
    ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));

    for (ConsumerRecord<String, String> record : records) {
        processRecord(record);

        consumer.commitSync(Collections.singletonMap(
            new TopicPartition(record.topic(), record.partition()),
            new OffsetAndMetadata(record.offset() + 1)
        ));
    }
}

If you need more throughput, batch your commits:

int count = 0;
Map<TopicPartition, OffsetAndMetadata> offsets = new HashMap<>();

for (ConsumerRecord<String, String> record : records) {
    processRecord(record);

    offsets.put(
        new TopicPartition(record.topic(), record.partition()),
        new OffsetAndMetadata(record.offset() + 1)
    );

    if (++count % 100 == 0) {
        consumer.commitAsync(offsets, null);
        offsets.clear();
    }
}
consumer.commitSync(offsets);

Configuration Trade-offs

Configuration	Throughput	Duplicate Risk	Best For
Auto-commit + eager	High	High	Idempotent consumers, analytics
Manual sync per message	Low	Low	Financial transactions, payments
Manual async + cooperative	High	Medium	Event logging with deduplication
Batched commits + sticky	Very High	Medium	High-volume analytics pipelines

Monitoring

Three metrics that actually tell you something:

rebalance-latency-avg - If this is going up, your consumers are spending more time rebalancing than processing. Not great.
assigned-partitions - If it's bouncing between 0 and N, you've got a consumer that keeps crashing and rejoining.
commit-latency-avg - Slow commits can trigger evictions, which cause more rebalances. It's a fun feedback loop.

The Checklist

Switch to CooperativeStickyAssignor. Just do this. Should be the default for any new consumer group.
Turn off auto-commit. Set enable.auto.commit=false and handle offsets yourself.
Tune your timeouts. Find the balance between catching real failures and tolerating slow processing.
Make consumers idempotent. Deduplication keys, upserts instead of inserts.
Watch your rebalance frequency. More than a few per hour and something's wrong.

Key Takeaways

Rebalancing will happen. Design for it instead of hoping it won't.
Switch to cooperative rebalancing. There's almost no reason to keep eager. It turns one failure into a cluster-wide pause.
Manage offsets yourself when you care about correctness more than raw throughput.
Make your consumers idempotent. Even the best offset strategy has edge cases. If duplicates are safe, rebalancing goes from incident to inconvenience.

Dig Deeper

KIP-429: Kafka Consumer Incremental Rebalance Protocol - the proposal that introduced cooperative rebalancing
Confluent: Cooperative Rebalancing in Practice - Confluent's deep dive on incremental cooperative rebalancing
Apache Kafka Consumer Javadoc - official consumer API docs covering offset management and assignment strategies
Kafka: The Definitive Guide, Chapter 4 - thorough coverage of consumer groups and partition assignment
KIP-345: Introduce static membership - reduces rebalances during rolling deploys with group.instance.id