Commit Graph

1 Commits

Author SHA1 Message Date
Ping Qiu
8b2b5f6f66 feat: Phase 6 CP6-3 -- failover + rebuild in Kubernetes, 126 tests
Wire low-level fencing primitives to master/VS control plane and CSI:

- Proto: replica/rebuild address fields on assignment/info/response messages
- Assignment queue: retain-until-confirmed (Peek+Confirm), stale epoch pruning
- VS assignment receiver: processes assignments from HeartbeatResponse
- BlockService replication: ProcessAssignments, deterministic ports (FNV hash)
- Registry replica tracking: SetReplica/ClearReplica/SwapPrimaryReplica
- CreateBlockVolume: primary + replica, enqueues assignments, single-copy mode
- Failover: lease-aware promotion, deferred timers with cancellation on reconnect
- ControllerPublish: returns fresh primary iSCSI address after failover
- Recovery: recoverBlockVolumes drains pendingRebuilds, enqueues Rebuilding
- Real integration tests on M02: failover address switch, rebuild data
  consistency, full lifecycle failover+rebuild (3 tests, all PASS)

Review fixes (12 findings, 5 High, 5 Medium, 2 Low):
- R1-1: AllocateBlockVolume returns replication ports
- R1-2: setupPrimaryReplication starts rebuild server
- R1-3: VS sends periodic block heartbeat for assignment confirmation
- R2-F1: LastLeaseGrant set before Register (no stale-lease race)
- R2-F2: Deferred promotion timers cancelled on VS reconnect
- R2-F3: SwapPrimaryReplica uses RoleToWire instead of uint32(1)
- R2-F4: DeleteBlockVolume deletes replica (best-effort)
- R2-F5: SwapPrimaryReplica computes epoch atomically under lock
- QA: SetReplica removes old replica from byServer index (BUG-QA-CP63-1)

126 CP6-3 tests (67 dev + 48 QA + 8 integration + 3 real).
Cumulative Phase 6: 352 tests. All PASS.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-05 00:52:05 -08:00