pingqiu
d1a16fac03
feat: protocol-aware execution wave — phase gate for live WAL shipping
...
Add host-side protocol state seam that derives per-replica execution
state from V2 sender/session snapshots and blocks live-tail WAL
shipping while an active recovery session is in progress.
New file: weed/server/block_protocol_state.go
- replicaProtocolExecutionState derived from engine snapshots
- LiveEligible=false during active catch-up/rebuild sessions
- bindProtocolExecutionPolicy wires policy into BlockVol
- syncProtocolExecutionState called after assignments + core events
Data plane changes:
- WALShipper.Ship() checks liveShippingPolicy before dial/send
- BlockVol.SetLiveShippingPolicy persists across shipper group rebuilds
- ShipperGroup propagates policy to all shippers
Design contract: sw-block/design/v2-protocol-aware-execution.md
Scope: WAL-first rollout only. Prevents illegal live-tail delivery
during active recovery. Does not change snapshot/build behavior or
move backlog. Next wave: bounded WAL catch-up under same contract.
Tests: 4 unit/component tests for phase gate behavior, plus bootstrap
seam tests that confirmed the two pre-existing bugs locally.
13 files changed, 900 insertions, 69 deletions.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-04-05 23:47:07 -07:00
pingqiu
cf16e53b04
feat: Phase 16M/17 + promote fixes + testrunner updates
...
Phase 16M: explicit replica readiness on heartbeat seam
- master.proto: optional bool replica_ready = 19 (proto regenerated on M01)
- block_heartbeat_proto.go: write/read ReplicaReady with presence semantics
- master_block_registry.go: replicaReadyObservedFromHeartbeat prefers
explicit proto field, falls back to address heuristic when absent
- volume_server_block.go: heartbeat emits ReplicaReady from core projection
Phase 17: host effects extraction + stop line
- phase-17-log.md: Batch 10/11 delivery notes
Promote fixes:
- master_block_failover.go: deterministic replica addrs from path hash
- qa_promote_replication_test.go: address-upgrade trigger test
- qa_promote_rejoin_live_test.go: new live rejoin test
Testrunner:
- devops.go: action improvements
- recovery-baseline-failover.yaml, suite-ha-failover.yaml: scenario updates
- cp11b3-manual-promote.yaml: promote scenario alignment
- fresh_volume_write_test.go: new component test
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-04-05 11:38:05 -07:00
Ping Qiu
979a9b496c
feat: Phase 8 CP8-1/2/3/4 -- ops control plane, multi-replica, CSI snapshots, observability
...
CP8-1: HTTP REST API (create/delete/lookup/list/assign/servers), blockapi Go
client with multi-master failover, 5 shell commands, HTML dashboard at /block/.
CP8-2: RF=2/RF=3 multi-replica support -- ShipperGroup fan-out, distributed
sync, health scoring, segment-based scrub, gated promotion (heartbeat
freshness + WAL LSN + role checks), failover/rebuild for N>2 replicas.
CP8-3: CSI snapshot + expansion -- CreateSnapshot/DeleteSnapshot/ListSnapshots
RPCs, NodeExpandVolume with iSCSI rescan, snapshot ID helpers, 20 adversarial
tests covering concurrent ops, edge cases, and error injection.
CP8-4: Observability -- EngineMetrics atomic counters for flusher/group-commit/
WAL-shipper/scrub, 10 new Prometheus metrics, barrier_lag_lsn SLO gauge,
failover/promotion/rebuild counters, request ID correlation in master gRPC
logs, baseline regression framework with 7 hard-fail conditions.
Total: 63 files, ~11.2K LOC, 160+ new tests.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-06 00:05:17 -08:00
Ping Qiu
8b2b5f6f66
feat: Phase 6 CP6-3 -- failover + rebuild in Kubernetes, 126 tests
...
Wire low-level fencing primitives to master/VS control plane and CSI:
- Proto: replica/rebuild address fields on assignment/info/response messages
- Assignment queue: retain-until-confirmed (Peek+Confirm), stale epoch pruning
- VS assignment receiver: processes assignments from HeartbeatResponse
- BlockService replication: ProcessAssignments, deterministic ports (FNV hash)
- Registry replica tracking: SetReplica/ClearReplica/SwapPrimaryReplica
- CreateBlockVolume: primary + replica, enqueues assignments, single-copy mode
- Failover: lease-aware promotion, deferred timers with cancellation on reconnect
- ControllerPublish: returns fresh primary iSCSI address after failover
- Recovery: recoverBlockVolumes drains pendingRebuilds, enqueues Rebuilding
- Real integration tests on M02: failover address switch, rebuild data
consistency, full lifecycle failover+rebuild (3 tests, all PASS)
Review fixes (12 findings, 5 High, 5 Medium, 2 Low):
- R1-1: AllocateBlockVolume returns replication ports
- R1-2: setupPrimaryReplication starts rebuild server
- R1-3: VS sends periodic block heartbeat for assignment confirmation
- R2-F1: LastLeaseGrant set before Register (no stale-lease race)
- R2-F2: Deferred promotion timers cancelled on VS reconnect
- R2-F3: SwapPrimaryReplica uses RoleToWire instead of uint32(1)
- R2-F4: DeleteBlockVolume deletes replica (best-effort)
- R2-F5: SwapPrimaryReplica computes epoch atomically under lock
- QA: SetReplica removes old replica from byServer index (BUG-QA-CP63-1)
126 CP6-3 tests (67 dev + 48 QA + 8 integration + 3 real).
Cumulative Phase 6: 352 tests. All PASS.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-05 00:52:05 -08:00