Files
seaweedfs/weed
Chris Lu 6eb3bc46bd fix(kafka): close late-joiner orphan race in consumer-group rebalance
The CI-observed orphan (one consumer with empty assignment after
`TestConsumerGroups/Rebalancing/MultipleConsumersJoin`) came from a
race in the group coordinator: once the leader had taken its member-
list snapshot in its JoinGroup response, a new member could still
arrive before the leader's SyncGroup landed. The gateway accepted the
stale SyncGroup, moved to Stable, and the late joiner's own SyncGroup
then served an empty Assignment from the Stable-state path — leaving
it silently unassigned with no further rebalance to fix it.

Three changes in `handleJoinGroup` / `handleSyncGroup` close the race:

- Late join during `CompletingRebalance` bumps the generation and
  resets to `PreparingRebalance`, so the leader's in-flight SyncGroup
  fails its generation check and the round restarts with the new
  member in the snapshot.
- SyncGroup generation-mismatch returns `REBALANCE_IN_PROGRESS` (not
  `ILLEGAL_GENERATION`) while the group is rebalancing, mirroring the
  existing heartbeat fix — otherwise Sarama's `Consume()` tears down
  on the stale SyncGroup instead of retrying.
- Leader SyncGroup verifies its assignment covers every current
  member and rejects with `REBALANCE_IN_PROGRESS` otherwise, as a
  belt-and-suspenders catch for joins that slip in between the
  leader's JoinGroup reply and its SyncGroup without going through
  `CompletingRebalance` state.

Verified: baseline reliably reproduces the orphan locally; with the
fix `TestConsumerGroups` passes end-to-end (53s total,
`MultipleConsumersJoin` 15-17s) and a 10-iteration stress loop against
the same gateway is 10/10 green with every consumer getting exactly
one partition.
2026-04-20 14:39:37 -07:00
..
2026-04-10 17:31:14 -07:00
2026-04-10 17:31:14 -07:00
2026-04-14 20:48:24 -07:00
2026-04-19 14:38:29 -07:00