mirror of
https://github.com/seaweedfs/seaweedfs.git
synced 2026-05-23 02:01:32 +00:00
* fix(kafka): close late-joiner orphan race in consumer-group rebalance
The CI-observed orphan (one consumer with empty assignment after
`TestConsumerGroups/Rebalancing/MultipleConsumersJoin`) came from a
race in the group coordinator: once the leader had taken its member-
list snapshot in its JoinGroup response, a new member could still
arrive before the leader's SyncGroup landed. The gateway accepted the
stale SyncGroup, moved to Stable, and the late joiner's own SyncGroup
then served an empty Assignment from the Stable-state path — leaving
it silently unassigned with no further rebalance to fix it.
Three changes in `handleJoinGroup` / `handleSyncGroup` close the race:
- Late join during `CompletingRebalance` bumps the generation and
resets to `PreparingRebalance`, so the leader's in-flight SyncGroup
fails its generation check and the round restarts with the new
member in the snapshot.
- SyncGroup generation-mismatch returns `REBALANCE_IN_PROGRESS` (not
`ILLEGAL_GENERATION`) while the group is rebalancing, mirroring the
existing heartbeat fix — otherwise Sarama's `Consume()` tears down
on the stale SyncGroup instead of retrying.
- Leader SyncGroup verifies its assignment covers every current
member and rejects with `REBALANCE_IN_PROGRESS` otherwise, as a
belt-and-suspenders catch for joins that slip in between the
leader's JoinGroup reply and its SyncGroup without going through
`CompletingRebalance` state.
Verified: baseline reliably reproduces the orphan locally; with the
fix `TestConsumerGroups` passes end-to-end (53s total,
`MultipleConsumersJoin` 15-17s) and a 10-iteration stress loop against
the same gateway is 10/10 green with every consumer getting exactly
one partition.
* fix(kafka): clear stale Assignment when restarting a rebalance round
Review spot: the two restart paths added in the previous commit bumped
group.Generation and reset each member's State to Pending but left
member.Assignment populated with the prior generation's partitions.
The non-leader SyncGroup path only returns REBALANCE_IN_PROGRESS when
`member.Assignment` is empty (handleSyncGroup ~line 982). Leaving the
stale assignment in place means a member rejoining at the new
generation — before the leader's SyncGroup has published fresh
assignments — falls through that guard and is served its old
partitions from the pre-rebalance state.
Clear m.Assignment alongside m.State in both restart sites so the
guard fires and the member correctly re-enters the join/sync cycle.
Verified with a fresh-broker TestConsumerGroups run: 50.99s total,
MultipleConsumersJoin 15.25s, all four consumers each get exactly one
partition.
* fix(kafka): don't let empty leader assignments bypass coverage check
Review spot: the leader-assignment branch was gated on
`len(request.GroupAssignments) > 0`, so a leader SyncGroup that omitted
every current member (empty array with a non-empty group) fell through
to the server-side-assignment `else` branch and could move the group
Stable without the intended rebalance retry.
Drop the length guard. Whenever the caller is the leader, build the
assigned-member map and run the coverage check; if the assignment
omits any current member (including the all-empty case against a
non-empty group), bump the generation, reset to PreparingRebalance,
clear each member's Assignment, and return REBALANCE_IN_PROGRESS so
the leader rebuilds its snapshot and sends a complete assignment on
retry. The server-side-assignment branch (documented as "should not
happen with Sarama") is now only reachable for non-leader+non-empty
SyncGroups — a genuinely unexpected case — and keeps its existing
warning.
* revert: keep len(GroupAssignments) > 0 gate on leader-assign branch
The previous commit (797f4f779) dropped the len(request.GroupAssignments)
> 0 guard on the leader-branch so that an empty-assignments-with-
non-empty-members leader SyncGroup would be forced through the coverage
check. Confluent Schema Registry's SchemaRegistryCoordinator, however,
uses a server-side-assignment protocol and by design sends leader
SyncGroup with an empty GroupAssignments array; dropping the gate put
the schema-registry group into a REBALANCE_IN_PROGRESS rejoin storm
(generation 84000+ observed in the Kafka Quick Test / Load Test with
Schema Registry CI job against PR #9162).
Restore the gate and document why it's load-bearing. The original
CodeRabbit concern (empty leader assignment from a client-side protocol
accidentally bypassing the coverage check) is theoretical — no
real client-side-assignment client sends empty leader assignments — and
the server-side-assignment else-branch is how schema-registry is
supposed to be served.
TestConsumerGroups still passes end-to-end (52.97s fresh-broker,
MultipleConsumersJoin 17.26s, all 4 consumers get exactly one
partition).
* fix(kafka): parse SyncGroup v5 protocol fields; skip partition decode for schema-registry
Two issues surfaced after PR #9162's coverage check was re-gated on
non-empty GroupAssignments:
1. parseSyncGroupRequest was stopping after GroupInstanceID even though
SyncGroup v5+ (the version Confluent Schema Registry uses) inserts
ProtocolType and ProtocolName strings before the assignments array.
The old parser read the protocol strings' compact-string length
prefixes as assignments-array length and either failed or came back
with bogus assignment entries. Parse v5 flexible protocol fields
explicitly and add them to SyncGroupRequest.
2. The schema-registry leader's assignment payload is the SR JSON
leader-identity blob, not ConsumerGroupMemberAssignment partition
bytes. processGroupAssignments would parse it as partition bytes
and either fail or corrupt member.Assignment. Special-case the
schema-registry group in the leader-assign branch: skip
processGroupAssignments, clear member.Assignment so
serializeSchemaRegistryAssignment rebuilds the response from the
elected leader's JoinGroup metadata, and transition to Stable.
Adds two unit tests: one asserts the v5 parser pulls the protocol
fields out correctly, the other drives the full handleSyncGroup path
for a schema-registry leader and asserts the group reaches Stable
without a partition-decode error.