mirror of
https://github.com/seaweedfs/seaweedfs.git
synced 2026-05-24 02:31:28 +00:00
The CI-observed orphan (one consumer with empty assignment after `TestConsumerGroups/Rebalancing/MultipleConsumersJoin`) came from a race in the group coordinator: once the leader had taken its member- list snapshot in its JoinGroup response, a new member could still arrive before the leader's SyncGroup landed. The gateway accepted the stale SyncGroup, moved to Stable, and the late joiner's own SyncGroup then served an empty Assignment from the Stable-state path — leaving it silently unassigned with no further rebalance to fix it. Three changes in `handleJoinGroup` / `handleSyncGroup` close the race: - Late join during `CompletingRebalance` bumps the generation and resets to `PreparingRebalance`, so the leader's in-flight SyncGroup fails its generation check and the round restarts with the new member in the snapshot. - SyncGroup generation-mismatch returns `REBALANCE_IN_PROGRESS` (not `ILLEGAL_GENERATION`) while the group is rebalancing, mirroring the existing heartbeat fix — otherwise Sarama's `Consume()` tears down on the stale SyncGroup instead of retrying. - Leader SyncGroup verifies its assignment covers every current member and rejects with `REBALANCE_IN_PROGRESS` otherwise, as a belt-and-suspenders catch for joins that slip in between the leader's JoinGroup reply and its SyncGroup without going through `CompletingRebalance` state. Verified: baseline reliably reproduces the orphan locally; with the fix `TestConsumerGroups` passes end-to-end (53s total, `MultipleConsumersJoin` 15-17s) and a 10-iteration stress loop against the same gateway is 10/10 green with every consumer getting exactly one partition.