Commit Graph

8 Commits

Author SHA1 Message Date
pingqiu
44103a1bd7 feat: Phase 20 acceptance fixes + sw-test-runner suite mode
Acceptance rows closed:
- WriteLBA/SyncCache contract: code comments document write-back vs
  durability fence semantics
- RF=2 stable identity: v2bridge always uses SetReplicaAddrs (preserves
  ServerID); blockcmd dispatcher also fixed to use setupPrimaryReplicationMulti;
  test asserts exact expected ReplicaID="vs-2" (not just non-empty)
- Tests treating WriteLBA as commit: replica_read_test rewritten with
  SyncCache as durability fence
- publish_healthy contract: 3 gate tests with hard assertions including
  gate 3 (PrimaryShipperConnected)
- SetReplicaAddr deprecation warning added
- WALShipper.ReplicaID() getter added for identity verification

Test runner enhancements:
- sw-test-runner suite command: build → deploy → run N scenarios in one
  invocation with --skip-deploy support
- Suite YAML definitions for T6 Stage 0 and Stage 1
- deploy action: kill stale processes, clean dirs, cross-compile, upload
- run-phase20-t6.ps1 PowerShell script (deprecated by suite command)

Engine/runtime fixes:
- Recovery executor nil-safety improvements
- Recovery bundle BuildRecoveryBundle defensive checks
- ShipperGroup MinReplicaFlushedLSNAll surface

Docs: acceptance checklist refined, test matrix updated, T6 runbook,
engine maintainer tutorial, design README updated.

26 files changed, ~1600 insertions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 11:30:54 -07:00
pingqiu
16ba70f856 refactor: make bounded recovery observation events replica-scoped
Carry replica-scoped addressing through bounded recovery planning and completion events so the core no longer depends on a volume-only observation seam. This preserves the current single-replica catch-up and rebuilding behavior while aligning the observation side with the replica-scoped command path.

Made-with: Cursor
2026-04-04 09:18:07 -07:00
pingqiu
b304b8e212 refactor: make bounded recovery command addressing replica-scoped
Replace the remaining volume-scoped recovery command and pending slot
with replica-scoped addressing on the bounded core-present path. This
preserves the current single-replica catch-up and rebuilding behavior
while removing the structural blocker for later multi-replica startup
ownership.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 09:05:36 -07:00
pingqiu
ded84b25e6 refactor: Batch 4 steps 2+3 — rebuild status port + recovery bundle factory
Step 2: Rebuild completion status port
- New runtime.RebuildCompletionStatus + DeriveRebuildCommitted:
  reusable shaping logic for post-rebuild snapshot → RebuildCommitted event
- block_recovery.go OnRebuildCompleted: delegates to DeriveRebuildCommitted,
  host only reads raw snapshot via readRebuildStatus (thin binding)
- Removed 15 lines of inline flushedLSN/checkpointLSN/achievedLSN computation

Step 3: Recovery bundle factory
- New buildRecoveryBundle: shared host-side setup for both catch-up and rebuild
  (creates Reader + Pinner + StorageAdapter + Executor + RecoveryDriver)
- runCatchUp and runRebuild both use buildRecoveryBundle instead of
  duplicating the WithVolume → NewReader → NewPinner → NewStorageAdapter →
  NewExecutor → RecoveryDriver chain
- runCatchUp/runRebuild are now thin host-shell methods

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 01:32:34 -07:00
pingqiu
0bcfc678d0 refactor: Batch 4 step 1 — typed PendingExecution, zero type assertions
Replace interface{} fields in runtime.PendingExecution with typed handles:
- Driver: *engine.RecoveryDriver (was interface{})
- Plan: *engine.RecoveryPlan (was interface{})
- CatchUpIO: engine.CatchUpIO (was interface{})
- RebuildIO: engine.RebuildIO (was interface{})

block_recovery.go:
- ExecutePendingCatchUp/Rebuild: direct field access (pe.Driver, pe.Plan)
  instead of type assertions (pe.Driver.(*engine.RecoveryDriver))
- CancelFunc: pe.Driver.CancelPlan(pe.Plan, reason) — no casts
- 6 type assertions removed from production path

Test files: remove Plan type assertions — fields are typed end-to-end.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 01:27:29 -07:00
pingqiu
3a5fbbfded fix: Batch 3 wiring — production path uses runtime helpers, legacy isolated
H wiring: block_recovery.go now uses runtime.PendingCoordinator
- Removed local pendingRecoveryExecution type + store/take/peek/has/cancel
- ExecutePendingCatchUp/Rebuild delegate to coord.TakeCatchUp/TakeRebuild
- Shutdown uses coord.CancelAll
- Added CancelAll to PendingCoordinator

I wiring: executeCatchUpPlan/executeRebuildPlan replaced
- ExecutePendingCatchUp now calls rt.ExecuteCatchUpPlan with RecoveryManager
  as RecoveryCallbacks (OnCatchUpCompleted/OnRebuildCompleted)
- ExecutePendingRebuild follows same pattern
- Local executeCatchUpPlan/executeRebuildPlan methods removed

J structural: legacy no-core branches extracted
- executeLegacyCatchUp: wraps rt.ExecuteCatchUpPlan for v2Core==nil path
- executeLegacyRebuild: wraps rt.ExecuteRebuildPlan for v2Core==nil path
- Clear "LEGACY NO-CORE COMPATIBILITY" section with structural separation
- runCatchUp/runRebuild now branch cleanly: legacy helper vs core coordinator

Test updates: pendingRecoveryExecution → rt.PendingExecution, field casing,
Plan type assertions.

Validation: all P4, P16B, and ApplyAssignments tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 01:20:41 -07:00
pingqiu
e200df7791 feat: Task I — recovery execution helpers extracted to sw-block runtime
New reusable execution helpers in sw-block/engine/replication/runtime:
- ExecuteCatchUpPlan: drives catch-up execution, notifies host via callback
- ExecuteRebuildPlan: drives rebuild execution, notifies host via callback
- RecoveryCallbacks interface: host-side OnCatchUpCompleted/OnRebuildCompleted

The host (weed/server/block_recovery.go) supplies concrete IO bindings and
receives completion notifications. The reusable execution logic no longer
requires weed/server ownership.

4 tests prove boundary behavior:
- catch-up callback receives achievedLSN matching plan target
- catch-up with plan-derived target works correctly
- rebuild callback receives plan reference
- nil callbacks don't panic

weed/server rebinding to use these helpers deferred to Task J
(legacy isolation).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 01:03:37 -07:00
pingqiu
6fea93e821 feat: Task H — PendingCoordinator extracted to sw-block/engine/replication/runtime
New reusable pending-execution coordinator with fail-closed command matching:
- Store/TakeCatchUp/TakeRebuild/Cancel/Has/Peek
- TakeCatchUp: fail-closed on target LSN mismatch (cancel + return nil)
- TakeRebuild: same fail-closed semantics
- Cancel callback invoked on mismatch or explicit cancellation

9 tests prove boundary behavior:
- match succeeds, mismatch cancels, explicit cancel, noop on empty,
  peek non-destructive, store replaces, take from empty

No weed/ imports. Pure coordination logic reusable by any adapter shell.
weed/server/block_recovery.go rebinding deferred to Task I.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 00:59:10 -07:00