Commit Graph

13 Commits

Author SHA1 Message Date
pingqiu
512bb5bcf6 fix: orchestrator owns full catch-up contract (budget + truncation)
CompleteCatchUp now integrates:
- BeginCatchUp with start tick (freezes target)
- RecordCatchUpProgress (skips if already converged, e.g., truncation-only)
- CheckBudget at completion tick (escalates to NeedsRebuild + logs)
- RecordTruncation before completion (logs truncation_recorded)
- Logs causal reason for every rejection/escalation

CatchUpOptions: StartTick/CompleteTick (separate) + TruncateLSN.

3 new orchestrator-level tests:
- ReplicaAhead_TruncateViaOrchestrator: truncation through entry path
- ReplicaAhead_NoTruncate_CompletionRejected: logs completion_rejected
- BudgetEscalation_ViaOrchestrator: budget violation → NeedsRebuild + logs

Observability tests relabeled as sender-level (not entry-path).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 11:04:34 -07:00
pingqiu
adaff8ddb3 fix: only log endpoint_changed when endpoint actually changed
ProcessAssignment now compares pre/post endpoint state before
logging session_invalidated with "endpoint_changed" reason.
Normal session supersede (same endpoint, assignment_intent) no
longer mislabeled as endpoint change.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 08:10:35 -07:00
pingqiu
5cdee4a011 fix: orchestrator owns zero-gap completion and per-replica invalidation logging
Zero-gap completion:
- ExecuteRecovery auto-completes zero-gap sessions (no sender call needed)
- RecoveryResult.FinalState = StateInSync for zero-gap

Epoch transition:
- UpdateSenderEpoch: orchestrator-owned epoch advancement with auto-log
- InvalidateEpoch: per-replica session_invalidated events (not aggregate)

Endpoint-change invalidation:
- ProcessAssignment detects session ID change from endpoint update
- Logs per-replica session_invalidated with "endpoint_changed" reason

All integration tests now use orchestrator exclusively for core lifecycle.
No direct sender API calls for recovery execution in integration tests.

1 new test: EndpointChange_LogsInvalidation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 01:01:53 -07:00
pingqiu
47238df0d7 fix: add RecoveryOrchestrator as real integrated entry path
New: orchestrator.go — RecoveryOrchestrator drives recovery lifecycle
from assignment through execution to completion/escalation:
- ProcessAssignment: reconcile + session creation + auto-log
- ExecuteRecovery: connect → handshake from RetainedHistory → outcome
- CompleteCatchUp: begin catch-up → progress → complete + auto-log
- CompleteRebuild: connect → handshake → history-driven source →
  transfer → tail replay → complete + auto-log
- InvalidateEpoch: invalidate stale sessions + auto-log

All integration tests rewritten to use orchestrator as entry path.
No direct sender API calls in recovery lifecycle.

SessionSnapshot now includes: TruncateRequired/ToLSN/Recorded,
RebuildSource, RebuildPhase.

RecoveryLog is auto-populated by orchestrator at every transition.

7 integration tests via orchestrator:
- ChangedAddress, NeedsRebuild→Rebuild, EpochBump, MultiReplica
- Observability: session snapshot, rebuild snapshot, auto-populated log

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 00:25:58 -07:00
pingqiu
7436b3b79c feat: add integration closure and observability (Phase 05 Slice 4)
New files:
- observe.go: RegistryStatus, SenderStatus, RecoveryLog for debugging
- integration_test.go: V2-boundary integration tests through real
  engine entry path

Observability:
- Registry.Status() returns full snapshot: per-sender state, session
  snapshots, counts by category (InSync, Recovering, Rebuilding)
- RecoveryLog: append-only event log for recovery lifecycle debugging

Integration tests (6):
- ChangedAddress_FullFlow: initial recovery → address change →
  sender preserved → new session → recovery with proof
- NeedsRebuild_ThenRebuildAssignment: catch-up fails → NeedsRebuild
  → rebuild assignment → history-driven source → InSync
- EpochBump_DuringRecovery: mid-recovery epoch bump → old session
  rejected → new assignment at new epoch → InSync
- MultiReplica_MixedOutcomes: 3 replicas, 3 outcomes via
  RetainedHistory proofs, registry status verified
- RegistryStatus_Snapshot: observability snapshot structure
- RecoveryLog: event recording and filtering

Engine module at 54 tests (12 + 18 + 18 + 6).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 00:15:46 -07:00
pingqiu
4d06622c01 fix: add nil check for RetainedHistory in sender APIs
RecordHandshakeFromHistory and SelectRebuildFromHistory now
return an error instead of panicking on nil history input.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 23:57:19 -07:00
pingqiu
cc8c529962 fix: connect recovery decisions to RetainedHistory, fix rebuild source
RetainedHistory as engine input:
- RecordHandshakeFromHistory: sender-level API consuming RetainedHistory
  directly, returns RecoverabilityProof alongside outcome
- SelectRebuildFromHistory: sender-level API consuming RetainedHistory
  for rebuild-source decision

RebuildSourceDecision soundness:
- Now requires BOTH trusted checkpoint AND replayable tail
  (CheckpointLSN >= TailLSN and CommittedLSN <= HeadLSN)
- Trusted checkpoint with unreplayable tail falls back to full_base

4 new tests:
- TrustedCheckpoint_UnreplayableTail (the regression case)
- SenderDriven_CatchUp (history → proof → outcome → complete)
- SenderDriven_Rebuild_SnapshotTail (history → source → rebuild)
- SenderDriven_Rebuild_FallsBackToFullBase (unreplayable tail)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 23:55:31 -07:00
pingqiu
ff7ea41099 feat: add engine data/recoverability core (Phase 05 Slice 3)
New file: history.go — RetainedHistory connects recovery decisions
to actual WAL retention state:
- IsRecoverable: checks gap against tail/head boundaries
- MakeHandshakeResult: generates HandshakeResult from retention state
- RebuildSourceDecision: chooses snapshot+tail vs full base from
  checkpoint state (trusted vs untrusted)
- ProveRecoverability: generates explicit proof explaining why
  recovery is or is not allowed

14 new tests (recoverability_test.go):
- Recoverable/unrecoverable gap (exact boundary, beyond head)
- Trusted/untrusted/no checkpoint → rebuild source selection
- Handshake from retained history → outcome classification
- Recoverability proofs (zero-gap, ahead, within retention, beyond)
- E2E: two replicas driven by retained history (catch-up + rebuild)
- Truncation required for replica ahead of committed

Engine module at 44 tests (12 + 18 + 14).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 23:04:51 -07:00
pingqiu
368a956aee fix: correct catch-up entry counting and rebuild transfer gate
Entry counting:
- Session.setRange now initializes recoveredTo = startLSN
- RecordCatchUpProgress delta counts only actual catch-up work
  (recoveredTo - startLSN), not the replica's pre-existing prefix

Rebuild transfer gate:
- BeginTailReplay requires TransferredTo >= SnapshotLSN
- Prevents tail replay on incomplete base transfer

3 new regression tests:
- BudgetEntries_NonZeroStart_CountsOnlyDelta (30 entries within 50 budget)
- BudgetEntries_NonZeroStart_ExceedsBudget (30 entries exceeds 20 budget)
- Rebuild_PartialTransfer_BlocksTailReplay

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 21:35:03 -07:00
pingqiu
930de4ba78 feat: add Slice 2 recovery execution tests (Phase 05)
15 new engine-level recovery execution tests:
- Zero-gap / catch-up / needs-rebuild branching (3 tests)
- Stale execution rejection during active recovery (2 tests)
- Bounded catch-up: frozen target, duration, entries, stall (5 tests)
- Completion before convergence rejected
- Rebuild exclusivity: catch-up APIs excluded (1 test)
- Rebuild lifecycle: snapshot+tail, full base, stale ID (3 tests)
- Assignment-driven recovery flow

Engine module now at 27 tests (12 Slice 1 + 15 Slice 2).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 21:14:18 -07:00
pingqiu
61e9408261 fix: separate stable ReplicaID from Endpoint in registry
Registry is now keyed by stable ReplicaID, not by address.
DataAddr changes preserve sender identity — the core V2 invariant.

Changes:
- ReplicaAssignment{ReplicaID, Endpoint} replaces map[string]Endpoint
- AssignmentIntent.Replicas uses []ReplicaAssignment
- Registry.Reconcile takes []ReplicaAssignment
- Tests use stable IDs ("replica-1", "r1") independent of addresses

New test: ChangedDataAddr_PreservesSenderIdentity
- Same ReplicaID, different DataAddr (10.0.0.1 → 10.0.0.2)
- Sender pointer preserved, session invalidated, new session attached
- This is the exact V1/V1.5 regression that V2 must fix

doc.go: clarified Slice 1 core vs carried-forward files

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 21:06:11 -07:00
pingqiu
bb24b4b039 fix: encapsulate engine sender/session authority state
All mutable state on Sender and Session is now unexported:
- Sender.state, .epoch, .endpoint, .session, .stopped → accessors
- Session.id, .phase, .kind, etc. → read-only accessors
- Session() replaced by SessionSnapshot() (returns disconnected copy)
- SessionID() and HasActiveSession() for common queries
- AttachSession returns (sessionID, error) not (*Session, error)
- SupersedeSession returns sessionID not *Session

Budget configuration via SessionOption:
- WithBudget(CatchUpBudget) passed to AttachSession
- No direct field mutation on session from external code

New test: Encapsulation_SnapshotIsReadOnly proves snapshot
mutation does not leak back to sender state.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 20:58:28 -07:00
pingqiu
20d70f9fb6 feat: add V2 engine replication core (Phase 05 Slice 1)
Creates sw-block/engine/replication/ — the real V2 engine ownership core,
promoted from sw-block/prototype/enginev2/ with all accepted invariants.

Files:
- types.go: Endpoint, ReplicaState, SessionKind, SessionPhase, FSM transitions
- sender.go: per-replica Sender with full execution + rebuild APIs
- session.go: Session with identity, phases, frozen target, truncation, budget
- registry.go: Registry with reconcile + assignment intent + epoch invalidation
- budget.go: CatchUpBudget (duration, entries, stall detection)
- rebuild.go: RebuildState FSM (snapshot+tail vs full base)
- outcome.go: HandshakeResult + ClassifyRecoveryOutcome

Tests (ownership_test.go, 13 tests):
- Changed-address invalidation (A10)
- Stale session ID rejected at all APIs (A3)
- Stale completion after supersede (A3)
- Epoch bump invalidates all sessions (A3)
- Stale assignment epoch rejected
- Rebuild exclusivity (catch-up APIs rejected)
- Rebuild full lifecycle
- Frozen target rejects chase (A5)
- Budget violation escalates (A5)
- E2E: 3 replicas, 3 outcomes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 20:51:01 -07:00