Commit Graph

9 Commits

Author SHA1 Message Date
pingqiu
26a1b33c2e feat: add A5-A8 acceptance traceability and rebuild-source evidence
Cleanup: removed redundant TargetLSNAtStart from CatchUpBudget.
FrozenTargetLSN on RecoverySession is the single source of truth.

Acceptance traceability (acceptance_test.go):
- A5: 3 evidence tests (unrecoverable gap, budget escalation, frozen target)
- A6: 2 evidence tests (exact boundary, contiguity required)
- A7: 3 evidence tests (snapshot history, catch-up replay, truncation)
- A8: 2 evidence tests (convergence required, truncation required)

Rebuild-source decision evidence:
- snapshot_tail when trusted base exists
- full_base when no snapshot or untrusted
- 3 explicit tests

13 new tests total.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 15:42:48 -07:00
pingqiu
8f5070679c fix: make frozen target intrinsic and rebuild completion exclusive
Frozen target is now unconditional:
- FrozenTargetLSN field on RecoverySession, set by BeginCatchUp
- RecordCatchUpProgress enforces FrozenTargetLSN regardless of Budget
- Catch-up is always a bounded (R, H0] contract

Rebuild completion exclusivity:
- CompleteSessionByID explicitly rejects SessionRebuild by kind
- Rebuild sessions can ONLY complete via CompleteRebuild

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 15:30:17 -07:00
pingqiu
8e4028758f fix: make rebuild path exclusive, enforce phase discipline, require tick for stall budget
Rebuild exclusivity:
- BeginCatchUp rejects SessionRebuild ("must use rebuild APIs")
- RecordCatchUpProgress rejects SessionRebuild
- Rebuild sessions can only be completed via CompleteRebuild
- All legacy rebuild-through-catch-up paths in tests converted

Phase discipline:
- SelectRebuildSource requires session.Phase == PhaseHandshake
- Cannot skip BeginConnect + RecordHandshake

Stall budget:
- RecordCatchUpProgress requires tick parameter when
  ProgressDeadlineTicks > 0 (no silent stall budget bypass)

3 new tests: rebuild exclusivity (catch-up APIs rejected),
rebuild source requires handshake phase, stall budget requires tick.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 15:21:39 -07:00
pingqiu
5b66a85f92 fix: wire rebuild FSM into sender, enforce frozen target, fix entry counting
Rebuild execution path:
- newRecoverySession auto-initializes RebuildState for SessionRebuild
- Sender rebuild APIs: SelectRebuildSource, BeginRebuildTransfer,
  RecordRebuildTransferProgress, BeginRebuildTailReplay,
  RecordRebuildTailProgress, CompleteRebuild
- All rebuild APIs are sender-authority-gated by sessionID
- E2E rebuild test now drives through rebuild FSM, not catch-up APIs

Bounded CatchUp enforcement:
- BeginCatchUp freezes TargetLSNAtStart from session.TargetLSN
- RecordCatchUpProgress rejects progress beyond frozen target
- Entry counting uses LSN delta (recoveredTo - previous), not call count
- Merged RecordCatchUpProgressAt into RecordCatchUpProgress (tick param)

5 new tests: target-frozen enforcement, sender-level rebuild via
rebuild APIs, reject non-rebuild, reject stale ID on rebuild.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 15:16:56 -07:00
pingqiu
3f0048cbd9 feat: add bounded CatchUp budget and Rebuild mode state machine (Phase 4.5 P0)
Bounded CatchUp:
- CatchUpBudget: MaxDurationTicks, MaxEntries, ProgressDeadlineTicks
- BudgetCheck: runtime consumption tracker (StartTick, EntriesReplayed, LastProgressTick)
- Sender.CheckBudget: evaluates budget, escalates to NeedsRebuild on violation
- RecordCatchUpProgressAt: tracks progress tick for stall detection
- BeginCatchUp accepts optional startTick for budget tracking

Rebuild state machine:
- RebuildSource: snapshot_tail (preferred) vs full_base (fallback)
- RebuildPhase: init → source_select → transfer → tail_replay → completed|aborted
- SelectSource: chooses based on snapshot availability
- Phase ordering enforced, transfer regression rejected
- ReadyToComplete validates target reached

13 new tests: budget enforcement (duration, entries, stall, no-budget),
sender budget integration, rebuild lifecycle (snapshot+tail, full base,
abort, phase order, regression), E2E bounded catch-up → rebuild.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 14:33:06 -07:00
pingqiu
90c39b549d feat: add prototype scenario closure (Phase 04 P4)
Maps V2 acceptance criteria A1-A7, A10 to enginev2 prototype evidence.
Adds 4 V2-boundary scenarios against the prototype.

Scenario tests:
- A1: committed data survives promotion (WAL truncation boundary)
- A2: uncommitted data truncated, not revived
- A3: stale epoch fenced at sender + session + assignment layers
- A4: short-gap catch-up with WAL-backed proof + data verification
- A5: unrecoverable gap escalates to NeedsRebuild with proof
- A6: recoverability boundary exact (tail +/- 1 LSN)
- A7: historical data correct after tail advancement (snapshot)
- A10: changed-address → invalidation → new assignment → recovery

V2-boundary scenarios:
- NeedsRebuild persists across topology update
- catch-up does not overwrite safe data
- 5 disconnect/reconnect cycles preserve sender identity
- full V2 harness: 3 replicas, 3 outcomes (zero-gap, catch-up, rebuild)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 11:31:56 -07:00
pingqiu
942a0b7da7 fix: strengthen IsRecoverable contiguity check and StateAt snapshot correctness
IsRecoverable now verifies three conditions:
- startExclusive >= tailLSN (not recycled)
- endInclusive <= headLSN (within WAL)
- all LSNs in range exist contiguously (no holes)

StateAt now uses base snapshot captured during AdvanceTail:
- returns nil for LSNs before snapshot boundary (unreconstructable)
- correctly includes block state from recycled entries via snapshot

5 new tests: end-beyond-head, missing entries, state after tail
advance, nil before snapshot, block last written before tail.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 18:52:11 -07:00
pingqiu
c89709e47e feat: add WAL history model and recoverability proof (Phase 04 P3)
Adds minimal historical-data prototype to enginev2:

- WALHistory: retained-prefix model with Append, Commit, AdvanceTail,
  Truncate, EntriesInRange, IsRecoverable, StateAt
- MakeHandshakeResult connects WAL state to outcome classification
- RecordTruncation execution API for divergent tail cleanup
- CompleteSessionByID gates on truncation when required
- Zero-gap requires exact equality (FlushedLSN == CommittedLSN)
- Replica-ahead classified as CatchUp with mandatory truncation

15 new tests: WAL basics, provable recoverability, unprovable gap,
exact boundary, truncation enforcement, WAL-backed end-to-end
recovery with data verification.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 11:29:27 -07:00
pingqiu
edec7098e8 feat: add V2 protocol simulator and enginev2 sender/session prototype
Adds sw-block/ directory with:

- distsim: protocol correctness simulator (96 tests)
  - cluster model with epoch fencing, barrier semantics, commit modes
  - endpoint identity, control-plane flow, candidate eligibility
  - timeout events, timer races, same-tick ordering
  - session ownership tracking with ID-based stale fencing

- enginev2: standalone V2 sender/session implementation (63 tests)
  - per-replica Sender with identity-preserving reconciliation
  - RecoverySession with FSM phase transitions and session ID
  - execution APIs: BeginConnect, RecordHandshake, BeginCatchUp,
    RecordCatchUpProgress, CompleteSessionByID — all sender-authority-gated
  - recovery outcome branching: zero-gap, catch-up, needs-rebuild
  - assignment-intent orchestration with epoch fencing

- design docs: acceptance criteria, open questions, first-slice spec,
  protocol development process

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 10:38:27 -07:00