Commit Graph

13186 Commits

Author SHA1 Message Date
pingqiu
f20ec2ef79 test: align collector readiness check with replica eligibility
Use ReplicaEligible instead of PublishHealthy in the heartbeat collector test now that publish health is rebound to publication truth rather than receiver readiness.

Made-with: Cursor
2026-04-04 14:03:21 -07:00
pingqiu
6cad5bb8e1 refactor: rebind bounded volume mode heartbeat truth
Make the heartbeat/master boundary preserve explicit volume_mode truth so master consume no longer reconstructs outward mode only from secondary heartbeat signals. Keep backward compatibility by falling back to the previous reconstruction when older heartbeats do not send the field.

Made-with: Cursor
2026-04-04 13:56:41 -07:00
pingqiu
6794f79df9 refactor: preserve bounded publish healthy heartbeat truth
Make the heartbeat/master boundary preserve explicit publish_healthy truth so master consume no longer reconstructs healthy publication only from secondary readiness and degraded heuristics. Keep backward compatibility by falling back to the previous reconstruction when older heartbeats do not send the field.

Made-with: Cursor
2026-04-04 13:43:19 -07:00
pingqiu
eb610deb92 refactor: preserve bounded needs_rebuild heartbeat truth
Make the heartbeat/master boundary preserve explicit needs_rebuild truth so primary heartbeat consume no longer collapses that stronger mode into a generic degraded signal. Keep backward compatibility by falling back to the previous heuristic when older heartbeats do not send the field.

Made-with: Cursor
2026-04-04 13:11:42 -07:00
pingqiu
69b41a7f16 refactor: rebind bounded replica-ready heartbeat truth
Make the heartbeat/master boundary carry explicit replica readiness truth so the registry no longer depends only on replica transport-address presence as a readiness proxy. Keep backward compatibility by falling back to the old address heuristic when older heartbeats do not send the field.

Made-with: Cursor
2026-04-04 12:06:53 -07:00
pingqiu
43dbebfa04 refactor: close bounded recovery drain and invalidation seams
Move removed-replica drain and replica-scoped invalidation onto explicit core-command paths so the widened multi-replica runtime no longer depends on coarse host-side recovery handling.

Made-with: Cursor
2026-04-04 11:01:12 -07:00
pingqiu
5fd9ec0edf refactor: widen bounded multi-replica catch-up startup ownership
Emit one core-owned start_recovery_task per primary catch-up replica so the bounded multi-replica startup path no longer depends on a single-replica assumption.

Made-with: Cursor
2026-04-04 10:21:28 -07:00
pingqiu
92c006eb29 refactor: aggregate bounded multi-replica catch-up conservatively
Track catch-up observations per replica so the volume-level recovery view stays in catching_up until all bounded replicas complete. This preserves the current bounded semantics while removing an overclaim that would block later multi-replica startup ownership work.

Made-with: Cursor
2026-04-04 09:27:03 -07:00
pingqiu
16ba70f856 refactor: make bounded recovery observation events replica-scoped
Carry replica-scoped addressing through bounded recovery planning and completion events so the core no longer depends on a volume-only observation seam. This preserves the current single-replica catch-up and rebuilding behavior while aligning the observation side with the replica-scoped command path.

Made-with: Cursor
2026-04-04 09:18:07 -07:00
pingqiu
b304b8e212 refactor: make bounded recovery command addressing replica-scoped
Replace the remaining volume-scoped recovery command and pending slot
with replica-scoped addressing on the bounded core-present path. This
preserves the current single-replica catch-up and rebuilding behavior
while removing the structural blocker for later multi-replica startup
ownership.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 09:05:36 -07:00
pingqiu
1453274988 refactor: extract host effects adapter and define Phase 17 stop line
Move dispatcher-facing host effects out of volume_server_block.go into
blockcmd while keeping server-owned cache/state semantics in weed/server.
Document Batch 10 delivery and Batch 11 stop-line review so the
separation line closes without over-extracting readiness-state mutation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 08:43:21 -07:00
pingqiu
38b5042997 refactor: extract command bindings and service ops from volume server
Move BlockVol-backed command bindings into v2bridge and move non-BlockVol
command operations into weed/server/blockcmd. This keeps dispatch and host
effects in weed/server, keeps backend binding in v2bridge, and further
shrinks volume_server_block.go toward a host shell while preserving
current command-driven proofs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 08:11:39 -07:00
pingqiu
11c6aaf316 feat: Batch 7 + Phase 16C-E — command dispatch extraction + engine refinements
Batch 7: Command dispatch binding extraction
- New weed/server/blockcmd package: CommandHandler interface + DispatchCommands
- volume_server_block.go applyCoreCommandsWithAssignment delegates to dispatcher
- weed/server still owns RecordCommand, EmitCoreEvent, PublishProjection
- v2bridge NOT given command-switch or event-emission semantics

Phase 16C: Rebuilding assignment enters core command path
Phase 16D: Rebuild recovery-task startup is command-driven
Phase 16E: Catch-up recovery-task startup is command-driven

Engine refinements:
- RecoveryTarget on AssignmentDelivered event
- shouldStartRecoveryTask / shouldStartReceiver guards
- bootstrapReason: awaiting_rebuild_start

Bridge/contract updates:
- control_adapter.go: refined translation helpers
- contract.go: executor port alignment

Migration design docs (Batch 1-3 delivered, design artifacts):
- v2-first/second/third-migration-batch.md + task-pack.md
- v2-assignment-translation-unification.md
- v2-execution-muscles-inventory.md
- v2-separation-port-layer-audit.md
- v2-legacy-runtime-exit-criteria.md

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 02:13:08 -07:00
pingqiu
41082bf92c fix: Batch 6 completion — rebuildAddr folded into resolveRecoveryContext
resolveRecoveryContext now also derives rebuildAddr from assignments,
so the full host-side recovery context is resolved in one call:
- volPath (from replicaID)
- rebuildAddr (from assignments via deriveRebuildAddr)
- recovery bindings (driver + executor via BuildRecoveryBundle)
- replicaFlushedLSN (from sender session)

startTask/runRecovery/runCatchUp/runRebuild now pass assignments
instead of rebuildAddr. No separate rebuildAddr resolution remains
outside the resolver.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 01:52:35 -07:00
pingqiu
a48da0f674 refactor: Batch 6 — recovery context resolver extracted
New recoveryContext type + resolveRecoveryContext method consolidates:
- volumePathForReplica (volPath from replicaID)
- v2bridge.BuildRecoveryBundle (driver + executor from BlockVol)
- sender/session lookup (replicaFlushedLSN for catch-up start)

runCatchUp and runRebuild now read as:
  resolve → plan → branch (legacy or core-present)

Removed buildRecoveryBundle (inlined into resolveRecoveryContext).
block_recovery.go no longer has any inline context assembly —
it is now a pure orchestration shell.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 01:46:06 -07:00
pingqiu
263611004e refactor: Batch 5 — recovery binding factory moved to v2bridge
New v2bridge.BuildRecoveryBundle(vol, rebuildAddr) assembles all
recovery bindings (Reader + Pinner + StorageAdapter + Executor) from
a real BlockVol instance in one call.

block_recovery.go changes:
- Removed local recoveryBundle type
- buildRecoveryBundle now delegates to v2bridge.BuildRecoveryBundle
  inside WithVolume, returns (driver, executor, err)
- Removed direct v2bridge.NewReader/NewPinner/NewExecutor construction
- Removed bridge import (no longer needed)
- runCatchUp/runRebuild use (driver, executor, err) directly

block_recovery.go no longer knows how to construct Reader, Pinner,
StorageAdapter, or Executor. It only knows: resolve volPath, ask the
factory for bindings, plan, branch to legacy or core-present path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 01:39:40 -07:00
pingqiu
ded84b25e6 refactor: Batch 4 steps 2+3 — rebuild status port + recovery bundle factory
Step 2: Rebuild completion status port
- New runtime.RebuildCompletionStatus + DeriveRebuildCommitted:
  reusable shaping logic for post-rebuild snapshot → RebuildCommitted event
- block_recovery.go OnRebuildCompleted: delegates to DeriveRebuildCommitted,
  host only reads raw snapshot via readRebuildStatus (thin binding)
- Removed 15 lines of inline flushedLSN/checkpointLSN/achievedLSN computation

Step 3: Recovery bundle factory
- New buildRecoveryBundle: shared host-side setup for both catch-up and rebuild
  (creates Reader + Pinner + StorageAdapter + Executor + RecoveryDriver)
- runCatchUp and runRebuild both use buildRecoveryBundle instead of
  duplicating the WithVolume → NewReader → NewPinner → NewStorageAdapter →
  NewExecutor → RecoveryDriver chain
- runCatchUp/runRebuild are now thin host-shell methods

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 01:32:34 -07:00
pingqiu
0bcfc678d0 refactor: Batch 4 step 1 — typed PendingExecution, zero type assertions
Replace interface{} fields in runtime.PendingExecution with typed handles:
- Driver: *engine.RecoveryDriver (was interface{})
- Plan: *engine.RecoveryPlan (was interface{})
- CatchUpIO: engine.CatchUpIO (was interface{})
- RebuildIO: engine.RebuildIO (was interface{})

block_recovery.go:
- ExecutePendingCatchUp/Rebuild: direct field access (pe.Driver, pe.Plan)
  instead of type assertions (pe.Driver.(*engine.RecoveryDriver))
- CancelFunc: pe.Driver.CancelPlan(pe.Plan, reason) — no casts
- 6 type assertions removed from production path

Test files: remove Plan type assertions — fields are typed end-to-end.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 01:27:29 -07:00
pingqiu
3a5fbbfded fix: Batch 3 wiring — production path uses runtime helpers, legacy isolated
H wiring: block_recovery.go now uses runtime.PendingCoordinator
- Removed local pendingRecoveryExecution type + store/take/peek/has/cancel
- ExecutePendingCatchUp/Rebuild delegate to coord.TakeCatchUp/TakeRebuild
- Shutdown uses coord.CancelAll
- Added CancelAll to PendingCoordinator

I wiring: executeCatchUpPlan/executeRebuildPlan replaced
- ExecutePendingCatchUp now calls rt.ExecuteCatchUpPlan with RecoveryManager
  as RecoveryCallbacks (OnCatchUpCompleted/OnRebuildCompleted)
- ExecutePendingRebuild follows same pattern
- Local executeCatchUpPlan/executeRebuildPlan methods removed

J structural: legacy no-core branches extracted
- executeLegacyCatchUp: wraps rt.ExecuteCatchUpPlan for v2Core==nil path
- executeLegacyRebuild: wraps rt.ExecuteRebuildPlan for v2Core==nil path
- Clear "LEGACY NO-CORE COMPATIBILITY" section with structural separation
- runCatchUp/runRebuild now branch cleanly: legacy helper vs core coordinator

Test updates: pendingRecoveryExecution → rt.PendingExecution, field casing,
Plan type assertions.

Validation: all P4, P16B, and ApplyAssignments tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 01:20:41 -07:00
pingqiu
e075d77619 refactor: Task J — legacy no-core paths explicitly labeled
Add explicit "LEGACY NO-CORE COMPATIBILITY" section header in
block_recovery.go marking HandleAssignmentResult and
HandleRemovedAssignments as compatibility-only entry points.

The comment block explicitly states:
- These are for pre-Phase-16 no-core paths and older tests
- Core-present paths use StartRecoveryTask + ExecutePending*
- These should NOT be strengthened into semantic-authority proofs

No behavioral change — structural labeling only. All validation passes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 01:05:16 -07:00
pingqiu
e200df7791 feat: Task I — recovery execution helpers extracted to sw-block runtime
New reusable execution helpers in sw-block/engine/replication/runtime:
- ExecuteCatchUpPlan: drives catch-up execution, notifies host via callback
- ExecuteRebuildPlan: drives rebuild execution, notifies host via callback
- RecoveryCallbacks interface: host-side OnCatchUpCompleted/OnRebuildCompleted

The host (weed/server/block_recovery.go) supplies concrete IO bindings and
receives completion notifications. The reusable execution logic no longer
requires weed/server ownership.

4 tests prove boundary behavior:
- catch-up callback receives achievedLSN matching plan target
- catch-up with plan-derived target works correctly
- rebuild callback receives plan reference
- nil callbacks don't panic

weed/server rebinding to use these helpers deferred to Task J
(legacy isolation).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 01:03:37 -07:00
pingqiu
6fea93e821 feat: Task H — PendingCoordinator extracted to sw-block/engine/replication/runtime
New reusable pending-execution coordinator with fail-closed command matching:
- Store/TakeCatchUp/TakeRebuild/Cancel/Has/Peek
- TakeCatchUp: fail-closed on target LSN mismatch (cancel + return nil)
- TakeRebuild: same fail-closed semantics
- Cancel callback invoked on mismatch or explicit cancellation

9 tests prove boundary behavior:
- match succeeds, mismatch cancels, explicit cancel, noop on empty,
  peek non-destructive, store replaces, take from empty

No weed/ imports. Pure coordination logic reusable by any adapter shell.
weed/server/block_recovery.go rebinding deferred to Task I.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 00:59:10 -07:00
pingqiu
519c849946 refactor: Task F+G — remove pinner shim, executor already clean
Task F (Pinner):
- block_recovery.go: removed pinnerShimForRecovery (11 lines of pure
  pass-through). v2bridge.Pinner structurally satisfies bridge.BlockVolPinner
  (same method signatures), so it's passed directly.

Task G (Executor):
- Already clean. v2bridge.Executor is used directly without any shim —
  structurally satisfies engine.CatchUpIO and engine.RebuildIO.
  No code changes needed.

After Task E+F+G: zero shim types remain in block_recovery.go.
v2bridge Reader/Pinner/Executor all satisfy sw-block contracts directly.

Validation:
- go test ./weed/storage/blockvol/v2bridge/ -run "TestPinner_|TestExecutor_|TestBridge_" → PASS
- go test ./weed/server/ -run "TestP4_|TestP16B_" → PASS (8 tests)
- go test ./sw-block/bridge/blockvol/... → PASS

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 00:45:43 -07:00
pingqiu
680b530314 refactor: Task E — reader returns bridge.BlockVolState directly
Reader backend-binding extraction:
- v2bridge/reader.go: Reader.ReadState() now returns bridge.BlockVolState
  directly instead of a local v2bridge.BlockVolState mirror type.
  Removed the local BlockVolState type entirely.
- block_recovery.go: removed readerShimForRecovery (12 lines of 1:1
  field copying). Reader is now passed directly as bridge.BlockVolReader.

Before: v2bridge.Reader → v2bridge.BlockVolState → readerShim → bridge.BlockVolState
After:  v2bridge.Reader → bridge.BlockVolState (direct)

v2bridge now imports sw-block/bridge/blockvol for the contract type
(control.go already did this, reader.go now follows the same pattern).

Validation:
- go test ./sw-block/bridge/blockvol/... → PASS
- go test ./weed/storage/blockvol/v2bridge/ -run "TestReader_" → PASS
- go test ./weed/server/ -run "TestP4_|TestP16B_" → PASS (8 tests)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 00:43:30 -07:00
pingqiu
a38e04c03b refactor: Task A — canonical identity/recovery rules via bridge helpers
Remove direct fmt.Sprintf identity construction from v2bridge/control.go.
Both convertReplicaAssignment and convertRebuildAssignment now use:
- bridge.ReplicaAssignmentForServer (canonical ReplicaID derivation)
- bridge.RecoveryTargetForRole (canonical role → SessionKind mapping)

Before: 3 call sites with inline fmt.Sprintf("%s/%s", vol, server)
After: 0 — all identity construction goes through sw-block canonical helpers

volume_server_block.go already used bridge helpers (no change needed).

Validation:
- go test ./sw-block/bridge/blockvol/... → PASS (10 tests)
- go test ./weed/storage/blockvol/v2bridge/ -run "TestControl_|TestBridge_" → PASS (7 tests)
- go test ./weed/server/ -run "TestBlockService_ApplyAssignments_RebuildingRole_" → PASS

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 00:10:48 -07:00
pingqiu
13680c9aa6 feat: Phase 16B rev3 — bounded rebuild execution ownership + review
16B widened from catch-up-only to catch-up + rebuild:
- StartRebuildCommand: core emits rebuild command, adapter executes
- Fail-closed: pending rebuild does not run without fresh command
- Recovery observations close back into core projection

New proofs:
- StartRebuildCommand_ConsumesPendingPlanAndUpdatesProjection
- RunRebuild_FailClosedWithoutFreshStartRebuildCommand

Review docs:
- phase-16-rev3-review.md: widened 16B review object
- phase-16-rev3-manager-rereview.md: manager challenge response
- phase-16-checkpoint-review.md: updated

Non-claims: not full recovery-loop closure, not end-to-end
failover/publication, not launch readiness.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 21:38:44 -07:00
pingqiu
8c2485e0e9 feat: Phase 15 + Phase 16A/B — V2 core integration + checkpoint review
Phase 15: V2 core wired into BlockService
- volume_server_block.go: v2Core field, applyCoreAssignmentEvent,
  core command executors (ApplyRole, StartReceiver, ConfigureShipper,
  InvalidateSession, StartCatchUp, StartRebuild, PublishProjection)
- Assignment processing now goes through core engine → command emission
  → bounded execution, replacing direct V1 replication setup
- master_block_registry.go: ClusterHealthSummary, VolumeMode in entries
- master_server_handlers_block.go: blockStatusHandler, entryToVolumeInfo
  refactored with entryReplicaSurface

Phase 16A: Core projection surfaces
Phase 16B: Bounded closure (checkpoint review ready)

Test fixes: add v2Core to manually-constructed BlockService in
idempotence, convergence, soak, and CP13-8A tests (required because
V1 replication setup paths now delegate to core engine).

All tests pass (21s regression).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 20:58:12 -07:00
pingqiu
a6fc8545b9 feat: Phase 14A+14B — V2 core publication ownership + command semantics
14A: Publication as explicit core-owned state
- state.go: PublicationView on VolumeState, explicit gate reasons
- engine.go: mode→readiness→publication chain with named gates
  (awaiting_role_apply, awaiting_shipper_configured, awaiting_barrier_durability)
- projection.go: PublicationProjection carries publication truth
- RF=1/no-replicas → allocated_only (CP13-9 constraint in core)
- phase14_core_test.go: strengthened publication closure + RF=1 proof

14B: Command emission bounded by semantic gap
- engine.go: repeated same-assignment skips redundant commands,
  repeated same-reason BarrierRejected skips duplicate invalidation,
  command-state tracking on VolumeState
- command.go: new command types for bounded emission
- event.go: new boundary events
- phase14_command_test.go: exact command sequences frozen as proofs
  (primary/replica repeated assignment, assignment changed, repeated failure)
- phase14_boundary_test.go: boundary/recovery structural tests

All tests pass in sw-block/engine/replication.
Phase 14 docs updated (14A accepted, 14B active→14C planned).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 16:52:55 -07:00
pingqiu
34f42078fb docs: Phase 13 CP13-9 accepted + Phase 14 preparation docs
- phase-13.md: CP13-8/8A/9 accepted with carry-forward
- phase-13-log.md: CP13-9 technical/delivery packs
- phase-13-cp9-mode-normalization.md: minor updates
- v2-protocol-claim-and-evidence.md: CP13-8/8A claims updated,
  constrained-V1-runtime interpretation rule added

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 16:13:03 -07:00
pingqiu
fb0da91196 feat: start Phase 14 V2 core shell
Make the first V2 core owner explicit in sw-block by freezing Phase 14 docs, mode/readiness/publication semantics, and bounded command emission rules. This turns accepted Phase 13 constraints into executable core behavior without overclaiming live runtime cutover.

Made-with: Cursor
2026-04-03 16:11:38 -07:00
pingqiu
6e1b8efd68 feat: CP13-9 — mode normalization for constrained V1 runtime
Add computed VolumeMode to BlockVolumeEntry with 5 normalized modes:
- allocated_only: RF=1, no replicas (standalone)
- bootstrap_pending: RF>1 but replicas not yet ready (first-write pending)
- publish_healthy: all replicas ready, no transport degradation
- degraded: replication impaired but recoverable
- needs_rebuild: unrecoverable gap, rebuild required

Code changes:
- master_block_registry.go: computeVolumeMode() called from
  recomputeReplicaState(), VolumeMode field on BlockVolumeEntry
- master_server_handlers_block.go: VolumeMode exposed in REST API
- blockapi/types.go: VolumeMode field in VolumeInfo
- testrunner types: VolumeMode for scenario assertions

7 tests prove mode normalization:
- AllocatedOnly, BootstrapPending (2 cases), PublishHealthy,
  Degraded, NeedsRebuild, SurfaceConsistency (transition proof)

Interpretation rule: current integrated tests validate V1 runtime
under V2 constraints, not a completed V2 runtime (Phase 14 scope).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 15:02:50 -07:00
pingqiu
4c7fbefe25 feat: CP13-8 PASSES — real-workload validation on RF=2 sync_all
CP13-8 scenario results on m01/M02 (25Gbps RoCE):
  fsck_ext4:       CLEAN
  file count:      200 (assert_equal PASS)
  checksum match:  MATCH (assert_contains PASS)
  pgbench TPS:     565.69 (assert_greater PASS)
  auto-failover:   10.0.0.1:18480 → 10.0.0.3:18480

Code changes (tester + scenario):
- volume_server_block.go: readiness state, assignment lifecycle cleanup
- block_heartbeat_loop.go: readiness-aware heartbeat reporting
- store_blockvol.go: readiness tracking
- master_server_handlers_block.go: block API handler updates
- cp13-8-real-workload-validation.yaml: redesigned scenario
  (removed block_promote, use natural auto-failover flow,
  bootstrap write before wait_volume_healthy)
- testrunner/actions/devops.go: scenario action improvements
- replica_read_test.go: component-level replica read test

Phase docs: CP13-7 accepted, CP13-8/8A technical packs updated,
design docs updated for protocol closure evidence.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 14:24:13 -07:00
pingqiu
334c12664a fix: CP13-8A P0 — post-promote primary refresh with replica addresses
Bug: After failover promotes a replica to primary, the old primary
re-registers via heartbeat as a replica (lower epoch). But the master
never sent an updated Primary assignment to the new primary with the
re-registered replica's addresses. The new primary had 0 shippers →
replication dead. sync_all barrier passed vacuously.

Root cause: upsertServerAsReplica (heartbeat reconciliation) added the
re-registered server to Replicas[] but didn't (a) populate DataAddr/
CtrlAddr from heartbeat info, or (b) trigger a primary assignment
refresh.

Fix:
- master_block_registry.go: upsertServerAsReplica now copies DataAddr/
  CtrlAddr from heartbeat info and sets NeedsPrimaryRefresh flag.
  UpdateFullHeartbeat returns HeartbeatResult with PrimaryRefreshNeeded
  entries. DrainPrimaryRefreshNeeded collects and clears the flag.
- master_block_failover.go: add enqueuePrimaryRefresh — builds a
  Primary assignment with all current replica addresses and enqueues it.
- master_grpc_server.go: heartbeat handler processes PrimaryRefreshNeeded
  entries after UpdateFullHeartbeat.

Gate test: TestPromote_AssignmentHasReplicaAddrs now PASSES —
after promote + re-register, the new primary gets an assignment with
replicaDataAddr=vs1:14260 and replicaAddrs=1.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 13:59:43 -07:00
pingqiu
7012383c3f fix: StartReplicaReceiver idempotency guard — skip if already running
P0 bug on real hardware: assignments are re-delivered every heartbeat
cycle (5s). First setupReplicaReceiver succeeds (receiver starts on
deterministic port). Second call fails with "bind: address already in
use" because the listener is already bound. The volume stays permanently
degraded, blocking all RF=2 sync_all replication.

Fix: skip StartReplicaReceiver if v.replRecv is already set. The
receiver only needs to start once per volume lifetime.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 13:18:30 -07:00
pingqiu
3da4c19046 fix: CP13-8A — fix malformed replica address in test allocator + add read proof
Investigation result:
- Dual-BlockVol hypothesis: DISPROVEN (one instance per path, correct wiring)
- Root cause: adapter wiring bug in test allocator
  soak_test.go blockVSAllocate returned ReplicaDataAddr = "vs2:9333:14260"
  (server + ":port" where server already has a port → three colons, invalid)
  This caused setupReplicaReceiver to fail silently → no data replicated

Root cause classification: adapter/test-harness bug
- NOT a backend data visibility bug
- NOT a core-rule gap
- The engine read path works correctly (TestSyncAll_FullRoundTrip passes)

Code changes:
- qa_block_soak_test.go: fix allocator to use host:port (not server:port),
  use deterministic FNV-hashed ports matching production ReplicationPorts
- qa_block_cp13_8a_test.go: 2 new integration tests proving replica reads
  work through both ReadLBA and adapter.ReadAt, before and after promotion

Remaining contradiction for CP13-8 scenario on real hardware:
- The production weed cluster uses ReplicationPorts (deterministic) which
  should not have this bug. If CP13-8 still fails on m01/M02, the cause
  is different from this test-harness issue and needs a separate investigation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 11:47:41 -07:00
pingqiu
2c305f9e7f fix: CP13-8 — use correct assert params + add pgbench TPS gate
1. assert_contains: change actual/expected to value/contains (matches
   the action implementation in system.go)
2. Add assert_greater for pgbench TPS > 0 after pgbench_run (closes
   the pgbench durability pass criterion in the doc)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 09:17:11 -07:00
pingqiu
d7cd415714 feat: CP13-8 — bounded real-workload validation scenario + envelope
One named workload validation package for RF=2 sync_all:
- Scenario: cp13-8-real-workload-validation.yaml (6 phases)
- ext4 proof: write 200 files → failover → fsck + file count + md5sum diff
- pgbench proof: TPC-B on promoted replica (database durability)
- Disturbance: one bounded failover (kill primary, promote replica)

Workload envelope doc: phase-13-cp8-workload-validation.md
- Named topology, transport, workloads, disturbance, exclusions
- Pass criteria: fsck passes, 200 files, checksums match, pgbench TPS > 0
- Maps each pass criterion to accepted CP13-1..7 semantics
- Explicit non-claims: not rollout approval, not NVMe, not soak, not CP13-9

Reuses existing infrastructure:
- cp85-db-ext4-fsck.yaml pattern (extended with checksums + pgbench)
- benchmark-pgbench.yaml actions (pgbench_init/pgbench_run)

Must run on real hardware (m01/M02). Cannot run in unit test harness.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 08:59:46 -07:00
pingqiu
4f7283b6be fix: registry role-aware failover + devops action + failover scenario update
- master_block_registry.go: minor role-handling fixes
- qa_failover_role_test.go: new failover role test
- testrunner/actions/devops.go: new devops action helpers
- recovery-baseline-failover.yaml: scenario alignment

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 08:48:13 -07:00
pingqiu
21ccf06ef3 docs: Phase 13 CP13-1..CP13-7 technical packs, acceptance status, design updates
- phase-13.md: CP13-1 through CP13-6 accepted, CP13-7 active
- phase-13-log.md: full technical + delivery packs for CP13-2..CP13-7
- phase-13-cp4-state-eligibility.md: refined barrier behavior table
  (Disconnected/Degraded as recovery entry points, not eligibility)
- phase-12.md: minor cross-reference updates
- Older phase docs: minor wording alignment
- Design docs: V2 development plan and completion overview updated

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 08:48:05 -07:00
pingqiu
1d3fb1f119 fix: CP13-7 rev3 — require NeedsRebuild, not Degraded, after handshake gap
Tighten TestReconnect_GapBeyondRetainedWal_NeedsRebuild assertion from
"NeedsRebuild or Degraded" to strictly "NeedsRebuild". The handshake
R < S path returns NeedsRebuild directly — tolerating Degraded weakened
the proof.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 08:36:00 -07:00
pingqiu
ec63c18438 fix: CP13-7 rev2 — real handshake gap detection, reclassify rebuild test
Two fixes:
1. TestReconnect_GapBeyondRetainedWal_NeedsRebuild: rewritten to test the
   real reconnect handshake gap detection path (R < S in
   reconnectWithHandshake). Sequence: establish sync → disconnect →
   release retention hold via timeout → write + flush to advance WAL past
   replica position → reconnect → handshake detects R=0 < S=9 → NeedsRebuild.
   Log proves: "reconnect: gap too large R=0 H=8 S=9"

2. TestReplicaState_RebuildComplete_ReentersInSync: reclassified from
   primary proof to support evidence (does not start from live NeedsRebuild
   shipper state, but proves rebuild mechanics work end-to-end).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 08:24:56 -07:00
pingqiu
88c336b1c1 feat: CP13-7 — NeedsRebuild fail-closed fallback + rebuild handoff proof
Last baseline FAIL closed:
- TestAdversarial_NeedsRebuildBlocksAllPaths: rewritten to use
  EvaluateRetentionBudgets for NeedsRebuild trigger, then asserts
  5 properties: state=NeedsRebuild, Ship drops, Barrier rejects,
  state sticky after failed barrier, second SyncCache still fails

Last baseline PASS* closed:
- TestReconnect_GapBeyondRetainedWal_NeedsRebuild: rewritten with
  hard NeedsRebuild state assertion + SyncCache failure assertion

6 tests promoted to CP13-7 primary proof:
- NeedsRebuildBlocksAllPaths (fail-closed lifecycle)
- GapBeyondRetainedWal (transition)
- HeartbeatReportsNeedsRebuild (visibility)
- RebuildComplete_ReentersInSync (handoff)
- Rebuild_AbortOnEpochChange (epoch safety)
- PostRebuild_FlushedLSN_IsCheckpoint (progress initialization)

Baseline: 43 PASS / 0 FAIL / 1 PASS* (address witness only)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 00:13:33 -07:00
pingqiu
0ce5aa32e9 fix: CP13-6 rev3 — hard hold-release assertion + stale comment cleanup
1. TestWalRetention_TimeoutTriggersNeedsRebuild: add hard assertion that
   checkpoint advances past replicaFlushedLSN after NeedsRebuild (proves
   hold is actually released, not just state transition)
2. TestWalRetention_RequiredReplicaBlocksReclaim: remove stale "EXPECTED
   TO FAIL" / duplicate comment block

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 23:59:44 -07:00
pingqiu
4e55b53bef fix: CP13-6 rev2 — upgrade all 3 retention tests to hard assertions, block-size-aware budget
Three fixes:
1. TestWalRetention_RequiredReplicaBlocksReclaim: rewritten from log-only
   placeholder to hard assertion (checkpointLSN <= replicaFlushedLSN)
2. TestWalRetention_TimeoutTriggersNeedsRebuild: rewritten from log-only
   to hard assertion (State() == NeedsRebuild after 1ns timeout)
3. EvaluateRetentionBudgets: uses RetentionBudgetParams struct with
   actual BlockSize from volume config instead of hardcoded 4096

All 3 retention tests now have real state/progress assertions.
No placeholder or log-only evidence remains in CP13-6 proof package.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 23:45:29 -07:00
pingqiu
0ca57dc2eb feat: CP13-6 — replica-aware WAL retention with max-bytes budget
Add max-bytes retention budget alongside existing timeout budget:
- shipper_group.go: EvaluateRetentionBudgets now checks both timeout
  (last contact time) and max-bytes (entry lag * 4KB > maxBytes).
  Either exceeding budget → NeedsRebuild state transition.
- blockvol.go: add walRetentionMaxBytes (64MB default), pass to
  EvaluateRetentionBudgets with primaryHeadLSN.

TestWalRetention_MaxBytesTriggersNeedsRebuild upgraded from PASS*
(log-only placeholder) to real PASS: asserts State()==NeedsRebuild
after lag exceeds configured max-bytes budget.

Retention contract: hold-back blocks reclaim for recoverable replicas,
timeout and max-bytes budgets escalate to NeedsRebuild and release hold.
Full rebuild lifecycle remains CP13-7 scope.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 23:10:06 -07:00
pingqiu
20a1a4995c fix: CP13-5 doc — remove stale CatchingUp transition claim
Replace "observable CatchingUp state transition" with the actual 3
signals the test asserts: seeded hasFlushedProgress, receivedLSN
advance, non-zero replicaFlushedLSN.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 22:59:44 -07:00
pingqiu
4681df6b56 fix: CP13-5 — tighten reconnect proof with observable handshake evidence
Findings fixed:
1. TestAdversarial_ReconnectUsesHandshakeNotBootstrap now has 3 observable
   proof points instead of just "SyncCache succeeded":
   - new shipper HasFlushedProgress=true (seeded from old group)
   - replica receivedLSN advances during SyncCache (catch-up delivered entries)
   - shipper replicaFlushedLSN > 0 after barrier (durable progress established)
   Bootstrap alone would not advance receivedLSN — it only sends the barrier.

2. TestBug2 stale comment removed: "must NOT call SetReplicaAddr" replaced
   with accurate CP13-5 explanation that SetReplicaAddrs now preserves
   hasFlushedProgress across shipper replacement.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 22:56:20 -07:00
pingqiu
80be2ec05a feat: CP13-5 — reconnect handshake + WAL catch-up on SetReplicaAddrs
Bug: SetReplicaAddrs created fresh shippers (hasFlushedProgress=false),
so after disconnect, the new shipper used bootstrap instead of reconnect
handshake. Bootstrap doesn't replay missed WAL entries — barrier hung.

Fix:
- blockvol.go: SetReplicaAddrs checks if old shipper group had durable
  progress (AnyHasFlushedProgress). If so, seeds new shippers with
  hasFlushedProgress=true → they use reconnect handshake + catch-up.
- shipper_group.go: add AnyHasFlushedProgress() helper.

3 baseline FAILs now PASS:
- ReconnectUsesHandshakeNotBootstrap: reconnect path used, not bootstrap
- CatchupMultipleDisconnects: repeated disconnect/reconnect recovers
- CatchupDoesNotOverwriteNewerData: catch-up completes, safety exercised

7 tests promoted to CP13-5 primary proof.
TestAdversarial_NeedsRebuildBlocksAllPaths still FAIL (CP13-7 scope).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 22:39:08 -07:00
pingqiu
1c294af169 feat: CP13-4 — replica state machine / barrier eligibility contract + proof
Contract review: 6-state set (Disconnected, Connecting, CatchingUp,
InSync, Degraded, NeedsRebuild). Only InSync proceeds to barrier
request path. All other states either fail immediately or attempt
reconnect (must succeed before reaching barrier).

New test: TestBarrier_NonEligibleStates_FailClosed — systematically
verifies each non-eligible state (Connecting, CatchingUp, NeedsRebuild,
Disconnected) is rejected by Barrier(), and InSync is the only state
that enters the barrier request path.

5 baseline tests promoted to CP13-4 primary proof.
No production code changed — contract review + new focused test only.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 22:01:05 -07:00
pingqiu
d4ff6b482b fix: CP13-3 test — exercise real shipper.Barrier() against legacy server
The previous test only checked wire decode + fresh shipper state, never
calling shipper.Barrier() against a legacy response source.

New test runs a fake TCP control server that responds with a 1-byte
BarrierOK (no FlushedLSN). Shipper.Barrier() is called against it and
must return an error containing "no FlushedLSN". Verifies the real
rejection path at wal_shipper.go:229-231.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 21:47:58 -07:00