Files
seaweedfs/weed
Chris Lu 0007245c7f fix(kafka): close late-joiner orphan race in consumer-group rebalance (#9162)
* fix(kafka): close late-joiner orphan race in consumer-group rebalance

The CI-observed orphan (one consumer with empty assignment after
`TestConsumerGroups/Rebalancing/MultipleConsumersJoin`) came from a
race in the group coordinator: once the leader had taken its member-
list snapshot in its JoinGroup response, a new member could still
arrive before the leader's SyncGroup landed. The gateway accepted the
stale SyncGroup, moved to Stable, and the late joiner's own SyncGroup
then served an empty Assignment from the Stable-state path — leaving
it silently unassigned with no further rebalance to fix it.

Three changes in `handleJoinGroup` / `handleSyncGroup` close the race:

- Late join during `CompletingRebalance` bumps the generation and
  resets to `PreparingRebalance`, so the leader's in-flight SyncGroup
  fails its generation check and the round restarts with the new
  member in the snapshot.
- SyncGroup generation-mismatch returns `REBALANCE_IN_PROGRESS` (not
  `ILLEGAL_GENERATION`) while the group is rebalancing, mirroring the
  existing heartbeat fix — otherwise Sarama's `Consume()` tears down
  on the stale SyncGroup instead of retrying.
- Leader SyncGroup verifies its assignment covers every current
  member and rejects with `REBALANCE_IN_PROGRESS` otherwise, as a
  belt-and-suspenders catch for joins that slip in between the
  leader's JoinGroup reply and its SyncGroup without going through
  `CompletingRebalance` state.

Verified: baseline reliably reproduces the orphan locally; with the
fix `TestConsumerGroups` passes end-to-end (53s total,
`MultipleConsumersJoin` 15-17s) and a 10-iteration stress loop against
the same gateway is 10/10 green with every consumer getting exactly
one partition.

* fix(kafka): clear stale Assignment when restarting a rebalance round

Review spot: the two restart paths added in the previous commit bumped
group.Generation and reset each member's State to Pending but left
member.Assignment populated with the prior generation's partitions.

The non-leader SyncGroup path only returns REBALANCE_IN_PROGRESS when
`member.Assignment` is empty (handleSyncGroup ~line 982). Leaving the
stale assignment in place means a member rejoining at the new
generation — before the leader's SyncGroup has published fresh
assignments — falls through that guard and is served its old
partitions from the pre-rebalance state.

Clear m.Assignment alongside m.State in both restart sites so the
guard fires and the member correctly re-enters the join/sync cycle.

Verified with a fresh-broker TestConsumerGroups run: 50.99s total,
MultipleConsumersJoin 15.25s, all four consumers each get exactly one
partition.

* fix(kafka): don't let empty leader assignments bypass coverage check

Review spot: the leader-assignment branch was gated on
`len(request.GroupAssignments) > 0`, so a leader SyncGroup that omitted
every current member (empty array with a non-empty group) fell through
to the server-side-assignment `else` branch and could move the group
Stable without the intended rebalance retry.

Drop the length guard. Whenever the caller is the leader, build the
assigned-member map and run the coverage check; if the assignment
omits any current member (including the all-empty case against a
non-empty group), bump the generation, reset to PreparingRebalance,
clear each member's Assignment, and return REBALANCE_IN_PROGRESS so
the leader rebuilds its snapshot and sends a complete assignment on
retry. The server-side-assignment branch (documented as "should not
happen with Sarama") is now only reachable for non-leader+non-empty
SyncGroups — a genuinely unexpected case — and keeps its existing
warning.

* revert: keep len(GroupAssignments) > 0 gate on leader-assign branch

The previous commit (797f4f779) dropped the len(request.GroupAssignments)
> 0 guard on the leader-branch so that an empty-assignments-with-
non-empty-members leader SyncGroup would be forced through the coverage
check. Confluent Schema Registry's SchemaRegistryCoordinator, however,
uses a server-side-assignment protocol and by design sends leader
SyncGroup with an empty GroupAssignments array; dropping the gate put
the schema-registry group into a REBALANCE_IN_PROGRESS rejoin storm
(generation 84000+ observed in the Kafka Quick Test / Load Test with
Schema Registry CI job against PR #9162).

Restore the gate and document why it's load-bearing. The original
CodeRabbit concern (empty leader assignment from a client-side protocol
accidentally bypassing the coverage check) is theoretical — no
real client-side-assignment client sends empty leader assignments — and
the server-side-assignment else-branch is how schema-registry is
supposed to be served.

TestConsumerGroups still passes end-to-end (52.97s fresh-broker,
MultipleConsumersJoin 17.26s, all 4 consumers get exactly one
partition).

* fix(kafka): parse SyncGroup v5 protocol fields; skip partition decode for schema-registry

Two issues surfaced after PR #9162's coverage check was re-gated on
non-empty GroupAssignments:

1. parseSyncGroupRequest was stopping after GroupInstanceID even though
   SyncGroup v5+ (the version Confluent Schema Registry uses) inserts
   ProtocolType and ProtocolName strings before the assignments array.
   The old parser read the protocol strings' compact-string length
   prefixes as assignments-array length and either failed or came back
   with bogus assignment entries. Parse v5 flexible protocol fields
   explicitly and add them to SyncGroupRequest.

2. The schema-registry leader's assignment payload is the SR JSON
   leader-identity blob, not ConsumerGroupMemberAssignment partition
   bytes. processGroupAssignments would parse it as partition bytes
   and either fail or corrupt member.Assignment. Special-case the
   schema-registry group in the leader-assign branch: skip
   processGroupAssignments, clear member.Assignment so
   serializeSchemaRegistryAssignment rebuilds the response from the
   elected leader's JoinGroup metadata, and transition to Stable.

Adds two unit tests: one asserts the v5 parser pulls the protocol
fields out correctly, the other drives the full handleSyncGroup path
for a schema-registry leader and asserts the group reaches Stable
without a partition-decode error.
2026-04-20 18:44:16 -07:00
..
2026-04-10 17:31:14 -07:00
2026-04-10 17:31:14 -07:00
2026-04-14 20:48:24 -07:00