Files
Chris Lu 86c5e815d2 fix(kafka): make consumer-group rebalancing work end-to-end (#9143)
* fix(kafka): make consumer-group rebalancing work end-to-end

TestConsumerGroups was failing every run since the job was added
(2026-04-17) but the failures were masked by a `|| echo ...` trailer on
the go test invocation, so the CI reported green. Removing the mask
exposes several real bugs in the gateway's group-coordinator code:

1. JoinGroup deduplicated members by ClientID, which collapsed two
   Sarama consumers that share the default ClientID ("sarama") into a
   single member slot and broke rebalancing. Key dedup off the TCP
   ConnectionID instead; keep ClientID on the member for DescribeGroup
   fidelity.

2. Every JoinGroup replaced the *GroupMember struct, wiping the
   Assignment the leader had just published in its SyncGroup and leaving
   non-leader consumers with 0 partitions after a rebalance. Update the
   existing member in place on rejoin.

3. Non-leader SyncGroup returned an empty assignment while the leader
   was mid-rebalance, so consumers silently came up with no partitions.
   Return REBALANCE_IN_PROGRESS when the group is not Stable so Sarama
   retries the join/sync cycle (4 retries x 2s backoff by default).

4. Heartbeat returned ILLEGAL_GENERATION on a gen mismatch even when
   the group was in PreparingRebalance/CompletingRebalance. Return
   REBALANCE_IN_PROGRESS in that case so the heartbeat loop cleanly
   cancels the session instead of tearing it down on a fatal error.

5. LeaveGroup parser only handled v0-v2. Sarama at V2_8_0_0 sends v3
   (Members array) by default, so the gateway silently rejected the
   request as InvalidGroupID and dead consumers stayed in the group as
   phantom leaders. Added v3 (Members array) and v4+ (flexible/compact/
   tagged-fields) parsing.

The rebalancing integration tests called Consume() once per consumer,
which cannot survive a rebalance (heartbeat RBIP cancels the session
and Consume() returns - this is documented Sarama behaviour; callers
are expected to loop). Added a runConsumeLoop helper and used it in the
four affected sub-tests. RebalanceTestHandler.Setup now overwrites
stale entries in its assignments channel so the test observes the
settled post-rebalance snapshot rather than whatever arrived first.

* fix(kafka): address PR review feedback

- JoinGroup now snapshots existing members before mutating and restores
  the snapshot on INCONSISTENT_GROUP_PROTOCOL rollback. Previously the
  rollback path always deleted the entry, corrupting group state when
  an existing member rejoined with an incompatible protocol.

- handleLeaveGroup iterates request.Members instead of processing only
  the first entry, so v3+ batch departures (KIP-345 style) correctly
  remove every listed member and build a per-member response. A single
  group-state transition runs after the loop, with leader election
  only triggered if the actual group leader was among the departures.

- Added buildLeaveGroupFlexibleResponse for v4+ clients. The parser
  already decoded flexible versions, but the response still went out in
  non-flexible encoding (4-byte array lengths, 2-byte strings, no
  tagged fields), which v4+ clients could not parse. Route flexible
  versions through the new builder; v1-v3 keep buildLeaveGroupFullResponse.

- BasicFunctionality gives each consumer its own
  ConsumerGroupHandler/ready channel. The previous shared handler
  closed ready once, so readyCount advanced to numConsumers from a
  single signal; the test could proceed without the other consumers
  actually reaching Setup.

- RebalanceTestHandler.assignments is now a size-1 channel, so readers
  always observe the most recent rebalance snapshot instead of an
  intermediate one from an earlier round.
2026-04-20 10:11:45 -07:00
..