Commit Graph

24 Commits

Author SHA1 Message Date
pingqiu
c7eb87c587 feat: Phase 09 — V2 execution primitives and production closure
Engine execution layer for V2 replication protocol:
- RebuildInstaller: full state handoff (dirty map, WAL, superblock, flusher)
- TruncateToLSN: exact safety predicate (checkpointLSN == truncateLSN),
  ErrTruncationUnsafe escalation to NeedsRebuild
- SyncReceiverProgress: unconditional Store for post-rebuild alignment
- V2StatusSnapshot: CommittedLSN = nextLSN-1 for sync_all

V2 bridge real I/O executors:
- TransferFullBase: TCP streaming + RebuildInstaller + second catch-up
- TransferSnapshot: SHA-256 verified streaming to disk
- TruncateWAL: ErrTruncationUnsafe detection + escalation
- StreamWALEntries: rebuild-mode TCP apply

Engine executor interfaces:
- CatchUpIO.TruncateWAL, RebuildIO.TransferFullBase returns achievedLSN
- CatchUpExecutor truncation-only skip, NeedsRebuild escalation
- RebuildExecutor uses achievedLSN for progress tracking

Design docs reorganized: superseded planning docs removed, protocol
truths and closure map added.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 16:25:23 -07:00
pingqiu
1578adfba5 fix: wire real v2bridge I/O into engine executors (Phase 08 P2 closure)
Engine executors now have IO interfaces for real bridge I/O:
- CatchUpExecutor.IO (CatchUpIO): StreamWALEntries
- RebuildExecutor.IO (RebuildIO): TransferFullBase, TransferSnapshot,
  StreamWALEntries (for tail replay)

When IO is set, executor calls real bridge I/O during execution.
When IO is nil, executor uses caller-supplied progress (test mode).

RecoveryPlan.CatchUpStartLSN: bound at plan time for IO bridge.

v2bridge.Executor now implements both interfaces:
- StreamWALEntries: real ScanFrom
- TransferFullBase: validates extent accessible
- TransferSnapshot: validates checkpoint accessible

Chain tests wire IO:
- CatchUpClosure: exec.IO = executor → real WAL scan through engine
- RebuildClosure: exec.IO = executor → real transfer through engine

This closes the engine → executor → v2bridge → blockvol chain.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 15:10:50 -07:00
pingqiu
4df61f290b fix: true mid-executor invalidation test via OnStep hook
CatchUpExecutor.OnStep: optional callback fired between executor-managed
progress steps. Enables deterministic fault injection (epoch bump)
between steps without racing or manual sender calls.

E2_EpochBump_MidExecutorLoop:
- Executor runs 5 progress steps
- OnStep hook bumps epoch after step 1 (after 2 successful steps)
- Executor's own loop detects invalidation at step 2's check
- Resources released by executor's release path (not manual cancel)
- Log shows session_invalidated + exec_resources_released

This closes the remaining FC2 gap: invalidation is now detected
and cleaned up by the executor itself, not by external code.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 15:51:21 -07:00
pingqiu
5b63d34d6b fix: snapshot+tail WAL pin failure cleanup + true mid-executor epoch test
Finding 1: PlanRebuild snapshot+tail WAL pin failure now fail-closed
- InvalidateSession("wal_pin_failed_during_rebuild", StateNeedsRebuild)
- Snapshot pin released, session invalidated, no dangling state
- New test: E2_RebuildWALPinFailure_SessionCleaned

Finding 2: True mid-executor invalidation test
- Executor makes 2 successful progress steps (60, 70)
- Epoch bumps BETWEEN steps (real mid-execution)
- Third progress step fails — session invalidated
- Resources released via executor cancel
- New test: E2_EpochBump_AfterExecutorProgress

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 15:44:21 -07:00
pingqiu
332f598606 fix: close P3 failure classes — session cleanup, causal logging, CancelPlan
Finding 1: PlanRebuild now invalidates session on pin failure
- FullBasePin failure → InvalidateSession("full_base_pin_failed", StateNeedsRebuild)
- SnapshotPin failure → InvalidateSession("snapshot_pin_failed", StateNeedsRebuild)
- No dangling rebuild session after resource acquisition failure

Finding 2: Rebuild source logging shows causal reason
- plan_rebuild_full_base now logs: untrusted_checkpoint,
  trusted_checkpoint_unreplayable_tail, or no_checkpoint

Finding 3: CancelPlan for address-change cleanup
- New RecoveryDriver.CancelPlan(plan, reason): releases resources +
  invalidates session + logs plan_cancelled with reason
- Changed-address test uses CancelPlan (not manual ReleasePlan)

Finding 4: Executor-level epoch-bump test
- Executor's mid-step invalidation detection catches stale session
- Resources released via executor release path, not manual cancel

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 14:28:57 -07:00
pingqiu
56afa55f13 feat: add P3 failure-class validation through planner/executor (Phase 06)
6 new tests (validation_test.go) mapped to tester expectations E1-E5:

E1/FC1: Changed-address restart through planner/executor
- Active session invalidated by address change
- Sender identity preserved, old plan resources released
- Log shows: endpoint_changed → new session → plan → execute

E2/FC2: Epoch bump mid-execution step
- Partial progress, epoch bumps between steps
- Further progress rejected, executor cancels with resource release
- Log shows: session_invalidated + exec_resources_released

E3/FC5: Cross-layer proof — trusted base + unreplayable tail
- Storage: checkpoint=50, tail=80 → unreplayable
- RebuildSourceDecision → FullBase (not SnapshotTail)
- FullBasePin acquired, executed through RebuildExecutor, released
- Log shows: plan_rebuild_full_base (observable reason)

E4/FC8: Rebuild fallback when trusted-base proof fails
- Untrusted checkpoint → full-base, full-base pin fails → error
- Untrusted checkpoint → full-base, full-base pin succeeds → InSync
- Log shows: full_base_pin_failed

E5: Observability — full recovery chain logged
- Verifies 7 required log events from assignment through completion

Delivery template:
Changed contracts: P3 validates planner/executor path, not convenience
Fail-closed: epoch bump mid-step releases resources + logs cause
Resources: cross-layer proof chain validated end-to-end
Carry-forward: FC3/FC4/FC6/FC7 sufficient from prior phases

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 14:17:24 -07:00
pingqiu
f5c0aab454 fix: rebuild executor consumes bound plan, fix catch-up timing
Planner/executor contract:
- RebuildExecutor.Execute() takes no arguments — consumes plan-bound
  RebuildSource, RebuildSnapshotLSN, RebuildTargetLSN
- RecoveryPlan binds all rebuild targets at plan time
- Executor cannot re-derive policy from caller-supplied history

Catch-up timing:
- Removed unused completeTick parameter from CatchUpExecutor.Execute
- Per-step ticks synthesized as startTick + stepIndex + 1
- API shape matches implementation

New test: PlanExecuteConsistency_RebuildCannotSwitchSource
- Plans snapshot+tail, then mutates storage history
- Executor succeeds using plan-bound values (not re-derived)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 13:33:52 -07:00
pingqiu
50442acb2e feat: add stepwise executor with release symmetry (Phase 06 P2)
New: executor.go — CatchUpExecutor + RebuildExecutor
Replaces convenience wrappers with stepwise execution that owns
resource lifecycle on every exit path.

CatchUpExecutor.Execute:
  1. BeginCatchUp (freezes target)
  2. Stepwise RecordCatchUpProgress + CheckBudget per step
  3. RecordTruncation (if required)
  4. CompleteSessionByID
  5. Release resources (success or failure)

RebuildExecutor.Execute:
  1. BeginConnect + RecordHandshake
  2. SelectRebuildFromHistory
  3. BeginRebuildTransfer + progress
  4. BeginRebuildTailReplay + progress (snapshot+tail)
  5. CompleteRebuild
  6. Release resources (success or failure)

Both executors:
- Release all pins on every exit path (success, failure, cancellation)
- Check session validity mid-execution (detect epoch bump / endpoint change)
- Log resource release with causal reason

14 new tests (executor_test.go), mapped to tester expectations:
- E1: Partial catch-up failure releases WAL pin (2 tests)
- E2: Partial rebuild failure releases all pins (1 test)
- E3: Epoch bump / cancel releases resources (3 tests)
- E4: Successful execution releases resources (2 tests)
- E5: Stepwise not convenience (2 tests)

Delivery template:
Changed contracts: executor owns resource lifecycle (not caller)
Fail-closed: session check mid-execution, release on every error
Resources: WAL/snapshot/full-base pins released on all exit paths
Carry-forward: CompleteCatchUp/CompleteRebuild remain test-only

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 13:24:37 -07:00
pingqiu
45bf111ce8 fix: derive WAL pin from actual replay need, PlanRebuild fails closed
WAL pin tied to actual recovery contract:
- Truncation-only (replica ahead): no WAL pin acquired
- Real catch-up: pins from replicaFlushedLSN (actual replay start)
- Logs distinguish plan_truncate_only from plan_catchup

PlanRebuild precondition checks:
- Error on missing sender
- Error on no active session
- Error on non-rebuild session kind
- All fail closed with clear error messages

4 new tests:
- ReplicaAhead_NoWALPin: truncation-only, no WAL resources
- PlanRebuild_MissingSender: returns error
- PlanRebuild_NoSession: returns error
- PlanRebuild_NonRebuildSession: returns error

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 12:51:38 -07:00
pingqiu
d4f7697dd8 fix: add full-base pin and clean up session on WAL pin failure
Full-base rebuild resource:
- StorageAdapter.PinFullBase/ReleaseFullBase for full-extent base image
- PlanRebuild full_base branch now acquires FullBasePin
- RecoveryPlan.FullBasePin field, released by ReleasePlan

Session cleanup on resource failure:
- PlanRecovery invalidates session when WAL pin fails
  (no dangling live session after failed resource acquisition)

3 new tests:
- PlanRebuild_FullBase_PinsBaseImage: pin acquired + released
- PlanRebuild_FullBase_PinFailure: logged + error
- PlanRecovery_WALPinFailure_CleansUpSession: session invalidated,
  sender disconnected (no dangling state)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 12:20:24 -07:00
pingqiu
f73a3fdab2 feat: add storage/control adapters and recovery driver (Phase 06 P0/P1)
Phase 06 module boundaries:

adapter.go — StorageAdapter + ControlPlaneAdapter interfaces:
- GetRetainedHistory: real WAL retention state
- PinSnapshot / ReleaseSnapshot: rebuild resource management
- PinWALRetention / ReleaseWALRetention: catch-up resource management
- HandleHeartbeat / HandleFailover: control-plane event conversion

driver.go — RecoveryDriver replaces synchronous convenience:
- PlanRecovery: connect + handshake from storage state + acquire resources
- PlanRebuild: acquire snapshot + WAL pins for rebuild
- ReleasePlan: release all acquired resources

Convenience flow classification:
- ProcessAssignment, UpdateSenderEpoch, InvalidateEpoch → stepwise engine tasks
- ExecuteRecovery → planner (connect + classify)
- CompleteCatchUp, CompleteRebuild → TEST-ONLY convenience

7 new tests (driver_test.go):
- CatchUp plan + execute with WAL pin
- ZeroGap plan (no resources pinned)
- NeedsRebuild → rebuild plan with resource acquisition
- WAL pin failure → logged + error
- Snapshot pin failure → logged + error
- ReplicaAhead truncation through driver
- Cross-layer: storage proves recoverability, engine consumes proof

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 11:35:25 -07:00
pingqiu
512bb5bcf6 fix: orchestrator owns full catch-up contract (budget + truncation)
CompleteCatchUp now integrates:
- BeginCatchUp with start tick (freezes target)
- RecordCatchUpProgress (skips if already converged, e.g., truncation-only)
- CheckBudget at completion tick (escalates to NeedsRebuild + logs)
- RecordTruncation before completion (logs truncation_recorded)
- Logs causal reason for every rejection/escalation

CatchUpOptions: StartTick/CompleteTick (separate) + TruncateLSN.

3 new orchestrator-level tests:
- ReplicaAhead_TruncateViaOrchestrator: truncation through entry path
- ReplicaAhead_NoTruncate_CompletionRejected: logs completion_rejected
- BudgetEscalation_ViaOrchestrator: budget violation → NeedsRebuild + logs

Observability tests relabeled as sender-level (not entry-path).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 11:04:34 -07:00
pingqiu
adaff8ddb3 fix: only log endpoint_changed when endpoint actually changed
ProcessAssignment now compares pre/post endpoint state before
logging session_invalidated with "endpoint_changed" reason.
Normal session supersede (same endpoint, assignment_intent) no
longer mislabeled as endpoint change.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 08:10:35 -07:00
pingqiu
5cdee4a011 fix: orchestrator owns zero-gap completion and per-replica invalidation logging
Zero-gap completion:
- ExecuteRecovery auto-completes zero-gap sessions (no sender call needed)
- RecoveryResult.FinalState = StateInSync for zero-gap

Epoch transition:
- UpdateSenderEpoch: orchestrator-owned epoch advancement with auto-log
- InvalidateEpoch: per-replica session_invalidated events (not aggregate)

Endpoint-change invalidation:
- ProcessAssignment detects session ID change from endpoint update
- Logs per-replica session_invalidated with "endpoint_changed" reason

All integration tests now use orchestrator exclusively for core lifecycle.
No direct sender API calls for recovery execution in integration tests.

1 new test: EndpointChange_LogsInvalidation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 01:01:53 -07:00
pingqiu
47238df0d7 fix: add RecoveryOrchestrator as real integrated entry path
New: orchestrator.go — RecoveryOrchestrator drives recovery lifecycle
from assignment through execution to completion/escalation:
- ProcessAssignment: reconcile + session creation + auto-log
- ExecuteRecovery: connect → handshake from RetainedHistory → outcome
- CompleteCatchUp: begin catch-up → progress → complete + auto-log
- CompleteRebuild: connect → handshake → history-driven source →
  transfer → tail replay → complete + auto-log
- InvalidateEpoch: invalidate stale sessions + auto-log

All integration tests rewritten to use orchestrator as entry path.
No direct sender API calls in recovery lifecycle.

SessionSnapshot now includes: TruncateRequired/ToLSN/Recorded,
RebuildSource, RebuildPhase.

RecoveryLog is auto-populated by orchestrator at every transition.

7 integration tests via orchestrator:
- ChangedAddress, NeedsRebuild→Rebuild, EpochBump, MultiReplica
- Observability: session snapshot, rebuild snapshot, auto-populated log

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 00:25:58 -07:00
pingqiu
7436b3b79c feat: add integration closure and observability (Phase 05 Slice 4)
New files:
- observe.go: RegistryStatus, SenderStatus, RecoveryLog for debugging
- integration_test.go: V2-boundary integration tests through real
  engine entry path

Observability:
- Registry.Status() returns full snapshot: per-sender state, session
  snapshots, counts by category (InSync, Recovering, Rebuilding)
- RecoveryLog: append-only event log for recovery lifecycle debugging

Integration tests (6):
- ChangedAddress_FullFlow: initial recovery → address change →
  sender preserved → new session → recovery with proof
- NeedsRebuild_ThenRebuildAssignment: catch-up fails → NeedsRebuild
  → rebuild assignment → history-driven source → InSync
- EpochBump_DuringRecovery: mid-recovery epoch bump → old session
  rejected → new assignment at new epoch → InSync
- MultiReplica_MixedOutcomes: 3 replicas, 3 outcomes via
  RetainedHistory proofs, registry status verified
- RegistryStatus_Snapshot: observability snapshot structure
- RecoveryLog: event recording and filtering

Engine module at 54 tests (12 + 18 + 18 + 6).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 00:15:46 -07:00
pingqiu
4d06622c01 fix: add nil check for RetainedHistory in sender APIs
RecordHandshakeFromHistory and SelectRebuildFromHistory now
return an error instead of panicking on nil history input.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 23:57:19 -07:00
pingqiu
cc8c529962 fix: connect recovery decisions to RetainedHistory, fix rebuild source
RetainedHistory as engine input:
- RecordHandshakeFromHistory: sender-level API consuming RetainedHistory
  directly, returns RecoverabilityProof alongside outcome
- SelectRebuildFromHistory: sender-level API consuming RetainedHistory
  for rebuild-source decision

RebuildSourceDecision soundness:
- Now requires BOTH trusted checkpoint AND replayable tail
  (CheckpointLSN >= TailLSN and CommittedLSN <= HeadLSN)
- Trusted checkpoint with unreplayable tail falls back to full_base

4 new tests:
- TrustedCheckpoint_UnreplayableTail (the regression case)
- SenderDriven_CatchUp (history → proof → outcome → complete)
- SenderDriven_Rebuild_SnapshotTail (history → source → rebuild)
- SenderDriven_Rebuild_FallsBackToFullBase (unreplayable tail)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 23:55:31 -07:00
pingqiu
ff7ea41099 feat: add engine data/recoverability core (Phase 05 Slice 3)
New file: history.go — RetainedHistory connects recovery decisions
to actual WAL retention state:
- IsRecoverable: checks gap against tail/head boundaries
- MakeHandshakeResult: generates HandshakeResult from retention state
- RebuildSourceDecision: chooses snapshot+tail vs full base from
  checkpoint state (trusted vs untrusted)
- ProveRecoverability: generates explicit proof explaining why
  recovery is or is not allowed

14 new tests (recoverability_test.go):
- Recoverable/unrecoverable gap (exact boundary, beyond head)
- Trusted/untrusted/no checkpoint → rebuild source selection
- Handshake from retained history → outcome classification
- Recoverability proofs (zero-gap, ahead, within retention, beyond)
- E2E: two replicas driven by retained history (catch-up + rebuild)
- Truncation required for replica ahead of committed

Engine module at 44 tests (12 + 18 + 14).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 23:04:51 -07:00
pingqiu
368a956aee fix: correct catch-up entry counting and rebuild transfer gate
Entry counting:
- Session.setRange now initializes recoveredTo = startLSN
- RecordCatchUpProgress delta counts only actual catch-up work
  (recoveredTo - startLSN), not the replica's pre-existing prefix

Rebuild transfer gate:
- BeginTailReplay requires TransferredTo >= SnapshotLSN
- Prevents tail replay on incomplete base transfer

3 new regression tests:
- BudgetEntries_NonZeroStart_CountsOnlyDelta (30 entries within 50 budget)
- BudgetEntries_NonZeroStart_ExceedsBudget (30 entries exceeds 20 budget)
- Rebuild_PartialTransfer_BlocksTailReplay

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 21:35:03 -07:00
pingqiu
930de4ba78 feat: add Slice 2 recovery execution tests (Phase 05)
15 new engine-level recovery execution tests:
- Zero-gap / catch-up / needs-rebuild branching (3 tests)
- Stale execution rejection during active recovery (2 tests)
- Bounded catch-up: frozen target, duration, entries, stall (5 tests)
- Completion before convergence rejected
- Rebuild exclusivity: catch-up APIs excluded (1 test)
- Rebuild lifecycle: snapshot+tail, full base, stale ID (3 tests)
- Assignment-driven recovery flow

Engine module now at 27 tests (12 Slice 1 + 15 Slice 2).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 21:14:18 -07:00
pingqiu
61e9408261 fix: separate stable ReplicaID from Endpoint in registry
Registry is now keyed by stable ReplicaID, not by address.
DataAddr changes preserve sender identity — the core V2 invariant.

Changes:
- ReplicaAssignment{ReplicaID, Endpoint} replaces map[string]Endpoint
- AssignmentIntent.Replicas uses []ReplicaAssignment
- Registry.Reconcile takes []ReplicaAssignment
- Tests use stable IDs ("replica-1", "r1") independent of addresses

New test: ChangedDataAddr_PreservesSenderIdentity
- Same ReplicaID, different DataAddr (10.0.0.1 → 10.0.0.2)
- Sender pointer preserved, session invalidated, new session attached
- This is the exact V1/V1.5 regression that V2 must fix

doc.go: clarified Slice 1 core vs carried-forward files

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 21:06:11 -07:00
pingqiu
bb24b4b039 fix: encapsulate engine sender/session authority state
All mutable state on Sender and Session is now unexported:
- Sender.state, .epoch, .endpoint, .session, .stopped → accessors
- Session.id, .phase, .kind, etc. → read-only accessors
- Session() replaced by SessionSnapshot() (returns disconnected copy)
- SessionID() and HasActiveSession() for common queries
- AttachSession returns (sessionID, error) not (*Session, error)
- SupersedeSession returns sessionID not *Session

Budget configuration via SessionOption:
- WithBudget(CatchUpBudget) passed to AttachSession
- No direct field mutation on session from external code

New test: Encapsulation_SnapshotIsReadOnly proves snapshot
mutation does not leak back to sender state.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 20:58:28 -07:00
pingqiu
20d70f9fb6 feat: add V2 engine replication core (Phase 05 Slice 1)
Creates sw-block/engine/replication/ — the real V2 engine ownership core,
promoted from sw-block/prototype/enginev2/ with all accepted invariants.

Files:
- types.go: Endpoint, ReplicaState, SessionKind, SessionPhase, FSM transitions
- sender.go: per-replica Sender with full execution + rebuild APIs
- session.go: Session with identity, phases, frozen target, truncation, budget
- registry.go: Registry with reconcile + assignment intent + epoch invalidation
- budget.go: CatchUpBudget (duration, entries, stall detection)
- rebuild.go: RebuildState FSM (snapshot+tail vs full base)
- outcome.go: HandshakeResult + ClassifyRecoveryOutcome

Tests (ownership_test.go, 13 tests):
- Changed-address invalidation (A10)
- Stale session ID rejected at all APIs (A3)
- Stale completion after supersede (A3)
- Epoch bump invalidates all sessions (A3)
- Stale assignment epoch rejected
- Rebuild exclusivity (catch-up APIs rejected)
- Rebuild full lifecycle
- Frozen target rejects chase (A5)
- Budget violation escalates (A5)
- E2E: 3 replicas, 3 outcomes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 20:51:01 -07:00