All Posts

Kafka's Duplicate Message Problem

Your consumers are humming along at 10,000 messages per second. One crashes. Suddenly you've got duplicate messages, stalled partitions, and angry Slack pings from downstream.

That's consumer group rebalancing, and I think it's the most common Kafka production issue that nobody warns you about until you're already dealing with it.

Consumer Group Rebalancing Visualizer - crash consumers and compare eager vs cooperative strategies in real time
Consumer Group Rebalancing Visualizer - crash consumers and compare eager vs cooperative strategies in real time

Try the interactive demo - crash consumers, compare eager vs cooperative rebalancing, and watch duplicates happen in real time.

What Is Rebalancing?

Consumers in the same consumer group split up the partitions of a topic between them. When something changes - a consumer crashes, one joins, partitions change - Kafka needs to redistribute. The problem is how: by default, every consumer stops while partitions get shuffled around. Full stop-the-world. Pretty aggressive default if you ask me.

What Triggers a Rebalance?

1. A Consumer Crashes or Gets Evicted

If a consumer doesn't send heartbeats within session.timeout.ms (default: 45s), the coordinator assumes it's dead. Here's the annoying version: a slow DB query blocks poll(), heartbeats stop, and Kafka kicks the consumer out - even though it was actively doing work. Nothing actually broke, it was just slow.

// This consumer will get evicted if processRecord() takes > 45 seconds
while (true) {
    ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
    for (ConsumerRecord<String, String> record : records) {
        processRecord(record); // Blocks for 60 seconds on a slow DB query
    }
}

2. A Consumer Joins the Group

Every join triggers a rebalance. Rolling deployment of 10 consumers? That's 10 rebalances back to back. If you're on eager assignment, that's 10 stop-the-world pauses.

3. Partition Count Changes

Adding partitions triggers a full rebalance across the whole group.

Eager vs Cooperative Rebalancing

Eager vs Cooperative Rebalancing
Eager vs Cooperative Rebalancing

Eager Rebalancing (The Default)

With RangeAssignor or RoundRobinAssignor, all consumers give up all partitions and stop. One consumer out of twenty crashes? All twenty stop. That's what bothers me about this - it turns a single-consumer problem into a cluster-wide event.

// Eager rebalancing - stop the world
props.put(ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG,
    RangeAssignor.class.getName());

Cooperative Rebalancing

With CooperativeStickyAssignor, only the affected partitions move. Everyone else keeps going. This is almost certainly what you want.

// Cooperative rebalancing - only affected partitions move
props.put(ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG,
    CooperativeStickyAssignor.class.getName());

Trade-offs of Cooperative Rebalancing

It's not free, but the trade-offs are worth it in almost every case:

  • Multiple rounds. Takes at least two rounds instead of eager's one. Total time can actually be longer.
  • Brief partition gaps. Revoked partitions sit unassigned between rounds.
  • All-or-nothing migration. You can't mix eager and cooperative in the same group.
  • Quieter failures. A stuck rebalance on a subset of partitions can go unnoticed longer.

The Duplicate Processing Trap

Here's where it gets really ugly. When all consumers give up all partitions at once, every single one has a window of uncommitted work that gets replayed. Twenty consumers each holding uncommitted messages, one crashes, and now you've got 20 partitions worth of duplicates. Not just the dead consumer's - everyone's.

How Duplicates Happen During Rebalancing
How Duplicates Happen During Rebalancing

With auto-commit (also the default), offsets get committed every 5 seconds. Anything processed between the last commit and the rebalance? Re-processed by the new owner. Eager rebalancing makes this as bad as possible because it forces every partition to change hands. Cooperative limits the damage to just the affected ones.

So you've got eager rebalancing plus auto-commit. Both are defaults. Both should be changed for anything you care about. Honestly, I think this combination is one of the bigger footguns in Kafka's default config.

The Fix: Manual Offset Management

while (true) {
    ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));

    for (ConsumerRecord<String, String> record : records) {
        processRecord(record);

        consumer.commitSync(Collections.singletonMap(
            new TopicPartition(record.topic(), record.partition()),
            new OffsetAndMetadata(record.offset() + 1)
        ));
    }
}

If you need more throughput, batch your commits:

int count = 0;
Map<TopicPartition, OffsetAndMetadata> offsets = new HashMap<>();

for (ConsumerRecord<String, String> record : records) {
    processRecord(record);

    offsets.put(
        new TopicPartition(record.topic(), record.partition()),
        new OffsetAndMetadata(record.offset() + 1)
    );

    if (++count % 100 == 0) {
        consumer.commitAsync(offsets, null);
        offsets.clear();
    }
}
consumer.commitSync(offsets);

Configuration Trade-offs

ConfigurationThroughputDuplicate RiskBest For
Auto-commit + eagerHighHighIdempotent consumers, analytics
Manual sync per messageLowLowFinancial transactions, payments
Manual async + cooperativeHighMediumEvent logging with deduplication
Batched commits + stickyVery HighMediumHigh-volume analytics pipelines

Monitoring

Three metrics that actually tell you something:

  • rebalance-latency-avg - If this is going up, your consumers are spending more time rebalancing than processing. Not great.
  • assigned-partitions - If it's bouncing between 0 and N, you've got a consumer that keeps crashing and rejoining.
  • commit-latency-avg - Slow commits can trigger evictions, which cause more rebalances. It's a fun feedback loop.

The Checklist

  1. Switch to CooperativeStickyAssignor. Just do this. Should be the default for any new consumer group.
  2. Turn off auto-commit. Set enable.auto.commit=false and handle offsets yourself.
  3. Tune your timeouts. Find the balance between catching real failures and tolerating slow processing.
  4. Make consumers idempotent. Deduplication keys, upserts instead of inserts.
  5. Watch your rebalance frequency. More than a few per hour and something's wrong.

Key Takeaways

  • Rebalancing will happen. Design for it instead of hoping it won't.
  • Switch to cooperative rebalancing. There's almost no reason to keep eager. It turns one failure into a cluster-wide pause.
  • Manage offsets yourself when you care about correctness more than raw throughput.
  • Make your consumers idempotent. Even the best offset strategy has edge cases. If duplicates are safe, rebalancing goes from incident to inconvenience.

Dig Deeper