pingqiu
26a1b33c2e
feat: add A5-A8 acceptance traceability and rebuild-source evidence
...
Cleanup: removed redundant TargetLSNAtStart from CatchUpBudget.
FrozenTargetLSN on RecoverySession is the single source of truth.
Acceptance traceability (acceptance_test.go):
- A5: 3 evidence tests (unrecoverable gap, budget escalation, frozen target)
- A6: 2 evidence tests (exact boundary, contiguity required)
- A7: 3 evidence tests (snapshot history, catch-up replay, truncation)
- A8: 2 evidence tests (convergence required, truncation required)
Rebuild-source decision evidence:
- snapshot_tail when trusted base exists
- full_base when no snapshot or untrusted
- 3 explicit tests
13 new tests total.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-03-29 15:42:48 -07:00
pingqiu
8f5070679c
fix: make frozen target intrinsic and rebuild completion exclusive
...
Frozen target is now unconditional:
- FrozenTargetLSN field on RecoverySession, set by BeginCatchUp
- RecordCatchUpProgress enforces FrozenTargetLSN regardless of Budget
- Catch-up is always a bounded (R, H0] contract
Rebuild completion exclusivity:
- CompleteSessionByID explicitly rejects SessionRebuild by kind
- Rebuild sessions can ONLY complete via CompleteRebuild
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-03-29 15:30:17 -07:00
pingqiu
8e4028758f
fix: make rebuild path exclusive, enforce phase discipline, require tick for stall budget
...
Rebuild exclusivity:
- BeginCatchUp rejects SessionRebuild ("must use rebuild APIs")
- RecordCatchUpProgress rejects SessionRebuild
- Rebuild sessions can only be completed via CompleteRebuild
- All legacy rebuild-through-catch-up paths in tests converted
Phase discipline:
- SelectRebuildSource requires session.Phase == PhaseHandshake
- Cannot skip BeginConnect + RecordHandshake
Stall budget:
- RecordCatchUpProgress requires tick parameter when
ProgressDeadlineTicks > 0 (no silent stall budget bypass)
3 new tests: rebuild exclusivity (catch-up APIs rejected),
rebuild source requires handshake phase, stall budget requires tick.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-03-29 15:21:39 -07:00
pingqiu
5b66a85f92
fix: wire rebuild FSM into sender, enforce frozen target, fix entry counting
...
Rebuild execution path:
- newRecoverySession auto-initializes RebuildState for SessionRebuild
- Sender rebuild APIs: SelectRebuildSource, BeginRebuildTransfer,
RecordRebuildTransferProgress, BeginRebuildTailReplay,
RecordRebuildTailProgress, CompleteRebuild
- All rebuild APIs are sender-authority-gated by sessionID
- E2E rebuild test now drives through rebuild FSM, not catch-up APIs
Bounded CatchUp enforcement:
- BeginCatchUp freezes TargetLSNAtStart from session.TargetLSN
- RecordCatchUpProgress rejects progress beyond frozen target
- Entry counting uses LSN delta (recoveredTo - previous), not call count
- Merged RecordCatchUpProgressAt into RecordCatchUpProgress (tick param)
5 new tests: target-frozen enforcement, sender-level rebuild via
rebuild APIs, reject non-rebuild, reject stale ID on rebuild.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-03-29 15:16:56 -07:00
pingqiu
3f0048cbd9
feat: add bounded CatchUp budget and Rebuild mode state machine (Phase 4.5 P0)
...
Bounded CatchUp:
- CatchUpBudget: MaxDurationTicks, MaxEntries, ProgressDeadlineTicks
- BudgetCheck: runtime consumption tracker (StartTick, EntriesReplayed, LastProgressTick)
- Sender.CheckBudget: evaluates budget, escalates to NeedsRebuild on violation
- RecordCatchUpProgressAt: tracks progress tick for stall detection
- BeginCatchUp accepts optional startTick for budget tracking
Rebuild state machine:
- RebuildSource: snapshot_tail (preferred) vs full_base (fallback)
- RebuildPhase: init → source_select → transfer → tail_replay → completed|aborted
- SelectSource: chooses based on snapshot availability
- Phase ordering enforced, transfer regression rejected
- ReadyToComplete validates target reached
13 new tests: budget enforcement (duration, entries, stall, no-budget),
sender budget integration, rebuild lifecycle (snapshot+tail, full base,
abort, phase order, regression), E2E bounded catch-up → rebuild.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-03-29 14:33:06 -07:00
pingqiu
90c39b549d
feat: add prototype scenario closure (Phase 04 P4)
...
Maps V2 acceptance criteria A1-A7, A10 to enginev2 prototype evidence.
Adds 4 V2-boundary scenarios against the prototype.
Scenario tests:
- A1: committed data survives promotion (WAL truncation boundary)
- A2: uncommitted data truncated, not revived
- A3: stale epoch fenced at sender + session + assignment layers
- A4: short-gap catch-up with WAL-backed proof + data verification
- A5: unrecoverable gap escalates to NeedsRebuild with proof
- A6: recoverability boundary exact (tail +/- 1 LSN)
- A7: historical data correct after tail advancement (snapshot)
- A10: changed-address → invalidation → new assignment → recovery
V2-boundary scenarios:
- NeedsRebuild persists across topology update
- catch-up does not overwrite safe data
- 5 disconnect/reconnect cycles preserve sender identity
- full V2 harness: 3 replicas, 3 outcomes (zero-gap, catch-up, rebuild)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-03-29 11:31:56 -07:00
pingqiu
942a0b7da7
fix: strengthen IsRecoverable contiguity check and StateAt snapshot correctness
...
IsRecoverable now verifies three conditions:
- startExclusive >= tailLSN (not recycled)
- endInclusive <= headLSN (within WAL)
- all LSNs in range exist contiguously (no holes)
StateAt now uses base snapshot captured during AdvanceTail:
- returns nil for LSNs before snapshot boundary (unreconstructable)
- correctly includes block state from recycled entries via snapshot
5 new tests: end-beyond-head, missing entries, state after tail
advance, nil before snapshot, block last written before tail.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-03-28 18:52:11 -07:00
pingqiu
c89709e47e
feat: add WAL history model and recoverability proof (Phase 04 P3)
...
Adds minimal historical-data prototype to enginev2:
- WALHistory: retained-prefix model with Append, Commit, AdvanceTail,
Truncate, EntriesInRange, IsRecoverable, StateAt
- MakeHandshakeResult connects WAL state to outcome classification
- RecordTruncation execution API for divergent tail cleanup
- CompleteSessionByID gates on truncation when required
- Zero-gap requires exact equality (FlushedLSN == CommittedLSN)
- Replica-ahead classified as CatchUp with mandatory truncation
15 new tests: WAL basics, provable recoverability, unprovable gap,
exact boundary, truncation enforcement, WAL-backed end-to-end
recovery with data verification.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-03-28 11:29:27 -07:00
pingqiu
edec7098e8
feat: add V2 protocol simulator and enginev2 sender/session prototype
...
Adds sw-block/ directory with:
- distsim: protocol correctness simulator (96 tests)
- cluster model with epoch fencing, barrier semantics, commit modes
- endpoint identity, control-plane flow, candidate eligibility
- timeout events, timer races, same-tick ordering
- session ownership tracking with ID-based stale fencing
- enginev2: standalone V2 sender/session implementation (63 tests)
- per-replica Sender with identity-preserving reconciliation
- RecoverySession with FSM phase transitions and session ID
- execution APIs: BeginConnect, RecordHandshake, BeginCatchUp,
RecordCatchUpProgress, CompleteSessionByID — all sender-authority-gated
- recovery outcome branching: zero-gap, catch-up, needs-rebuild
- assignment-intent orchestration with epoch fencing
- design docs: acceptance criteria, open questions, first-slice spec,
protocol development process
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-03-28 10:38:27 -07:00