Commit Graph

2 Commits

Author SHA1 Message Date
Ping Qiu
979a9b496c feat: Phase 8 CP8-1/2/3/4 -- ops control plane, multi-replica, CSI snapshots, observability
CP8-1: HTTP REST API (create/delete/lookup/list/assign/servers), blockapi Go
client with multi-master failover, 5 shell commands, HTML dashboard at /block/.

CP8-2: RF=2/RF=3 multi-replica support -- ShipperGroup fan-out, distributed
sync, health scoring, segment-based scrub, gated promotion (heartbeat
freshness + WAL LSN + role checks), failover/rebuild for N>2 replicas.

CP8-3: CSI snapshot + expansion -- CreateSnapshot/DeleteSnapshot/ListSnapshots
RPCs, NodeExpandVolume with iSCSI rescan, snapshot ID helpers, 20 adversarial
tests covering concurrent ops, edge cases, and error injection.

CP8-4: Observability -- EngineMetrics atomic counters for flusher/group-commit/
WAL-shipper/scrub, 10 new Prometheus metrics, barrier_lag_lsn SLO gauge,
failover/promotion/rebuild counters, request ID correlation in master gRPC
logs, baseline regression framework with 7 hard-fail conditions.

Total: 63 files, ~11.2K LOC, 160+ new tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-06 00:05:17 -08:00
Ping Qiu
8b2b5f6f66 feat: Phase 6 CP6-3 -- failover + rebuild in Kubernetes, 126 tests
Wire low-level fencing primitives to master/VS control plane and CSI:

- Proto: replica/rebuild address fields on assignment/info/response messages
- Assignment queue: retain-until-confirmed (Peek+Confirm), stale epoch pruning
- VS assignment receiver: processes assignments from HeartbeatResponse
- BlockService replication: ProcessAssignments, deterministic ports (FNV hash)
- Registry replica tracking: SetReplica/ClearReplica/SwapPrimaryReplica
- CreateBlockVolume: primary + replica, enqueues assignments, single-copy mode
- Failover: lease-aware promotion, deferred timers with cancellation on reconnect
- ControllerPublish: returns fresh primary iSCSI address after failover
- Recovery: recoverBlockVolumes drains pendingRebuilds, enqueues Rebuilding
- Real integration tests on M02: failover address switch, rebuild data
  consistency, full lifecycle failover+rebuild (3 tests, all PASS)

Review fixes (12 findings, 5 High, 5 Medium, 2 Low):
- R1-1: AllocateBlockVolume returns replication ports
- R1-2: setupPrimaryReplication starts rebuild server
- R1-3: VS sends periodic block heartbeat for assignment confirmation
- R2-F1: LastLeaseGrant set before Register (no stale-lease race)
- R2-F2: Deferred promotion timers cancelled on VS reconnect
- R2-F3: SwapPrimaryReplica uses RoleToWire instead of uint32(1)
- R2-F4: DeleteBlockVolume deletes replica (best-effort)
- R2-F5: SwapPrimaryReplica computes epoch atomically under lock
- QA: SetReplica removes old replica from byServer index (BUG-QA-CP63-1)

126 CP6-3 tests (67 dev + 48 QA + 8 integration + 3 real).
Cumulative Phase 6: 352 tests. All PASS.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-05 00:52:05 -08:00