Commit Graph

13129 Commits

Author SHA1 Message Date
pingqiu
600dac6029 feat: Phase 13 CP13-1 — frozen test-first baseline for sync replication gaps
Baseline report (phase-13-cp1-baseline.md) from running 44 existing
replication-gap tests on current code with zero protocol changes:

  37 PASS / 4 FAIL / 3 PASS*

4 FAILs expose real gaps:
- ReconnectUsesHandshakeNotBootstrap: degraded shipper doesn't catch up (CP13-5)
- CatchupMultipleDisconnects: repeated reconnect cycles don't recover (CP13-5)
- NeedsRebuildBlocksAllPaths: stays Degraded after large gap (CP13-5+7)
- CatchupDoesNotOverwriteNewerData: catch-up fails at barrier (CP13-5)

3 PASS* are witness-only (pass but don't prove the property):
- Bug3_ReplicaAddr: documents gap, not fix (CP13-2)
- GapBeyondRetainedWal: asserts barrier failure, not NeedsRebuild (CP13-7)
- MaxBytesTriggersNeedsRebuild: logs "not implemented" (CP13-6)

No protocol code changed. Baseline is test-first evidence only.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 17:07:21 -07:00
pingqiu
c0a805184f chore: archive superseded V2 design docs
Copies of design docs removed in Phase 09, preserved in sw-block/docs/archive/
for historical reference.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 16:26:34 -07:00
pingqiu
bdf20fde71 feat: Phase 12 — production hardening (disturbance, soak, testrunner scenarios)
P1 Disturbance: restart/reconnect correctness tests — assignment delivery
  through real proto → ProcessAssignments, epoch validation on promoted
  volume, mandatory reconnect assertions

P2 Soak: repeated create/failover/recover cycles with end-of-cycle truth
  checks, runtime hygiene (no stale tasks/entries), steady-state idempotence

Testrunner recovery actions + scenarios:
- recovery.go: wait_recovery_complete, assert_recovery_state, trigger_rebuild
- 8 new YAML scenarios: baseline (failover/crash/partition), stability
  (replication-tax, netem-sweep, packet-loss, degraded), robust shipper

HA edge case and EC6 fix tests for regression coverage.

(P3 diagnosability + P4 perf floor committed separately in 643a5a107)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 16:26:17 -07:00
pingqiu
bdf83e350e feat: Phase 11 — product-surface rebinding (snapshot, CSI, publication, restore)
P1 Snapshots: CoW snapshot lifecycle through V2 engine path, create/list/delete
  via master RPC, BaseLSN tracking in manifest, ImportSnapshotForRebuild

P2 CSI Lifecycle: masterServerBackend calling real MasterServer in-process,
  CreateVolume/DeleteVolume/ExpandVolume through CSI → master → VS flow,
  ExportedControllerServer/ExportedNodeServer for cross-package testing

P3 Publication: LookupBlockVolume coherence across failover, iSCSI + NVMe
  address switching on promotion, repeated lookup self-consistency

P4 Restore: RestoreBlockSnapshot RPC through master and volume server,
  snapshot restore with runtime convergence, epoch/role validation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 16:25:58 -07:00
pingqiu
3ec8fab2f1 feat: Phase 10 — control-plane closure (identity, convergence, idempotence)
Stable identity on wire:
- ServerID fields in proto (replica_server_id, server_id on ReplicaAddrMessage)
- volumeServerId wired through volume.go → BlockService.SetServerID
- Identity derived from canonical server ID, not transport addresses

Assignment convergence:
- V2 idempotence via lastAppliedAssignment.equals (full replica set comparison)
- setupPrimaryReplication/Multi idempotence guards
- ProcessAssignments with V2 + V1 dual-path assignment handling

Master-driven control loop:
- RecoveryManager: serialized cancel-and-drain via done channels
- Per-replica heartbeat state reporting (ReplicaShipperStatus)
- masterServerBackend: VolumeBackend calling real MasterServer in-process
- RestoreBlockSnapshot RPC (master + volume server proto)

QA tests (P10 P1-P4):
- Identity: ServerID on wire, fail-closed on missing
- Convergence: assignment delivery, epoch monotonicity, registry coherence
- Idempotence: repeated assignment, multi-replica set comparison
- Control loop: integrationMaster + real allocator + proto round-trip

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 16:25:43 -07:00
pingqiu
c7eb87c587 feat: Phase 09 — V2 execution primitives and production closure
Engine execution layer for V2 replication protocol:
- RebuildInstaller: full state handoff (dirty map, WAL, superblock, flusher)
- TruncateToLSN: exact safety predicate (checkpointLSN == truncateLSN),
  ErrTruncationUnsafe escalation to NeedsRebuild
- SyncReceiverProgress: unconditional Store for post-rebuild alignment
- V2StatusSnapshot: CommittedLSN = nextLSN-1 for sync_all

V2 bridge real I/O executors:
- TransferFullBase: TCP streaming + RebuildInstaller + second catch-up
- TransferSnapshot: SHA-256 verified streaming to disk
- TruncateWAL: ErrTruncationUnsafe detection + escalation
- StreamWALEntries: rebuild-mode TCP apply

Engine executor interfaces:
- CatchUpIO.TruncateWAL, RebuildIO.TransferFullBase returns achievedLSN
- CatchUpExecutor truncation-only skip, NeedsRebuild escalation
- RebuildExecutor uses achievedLSN for progress tracking

Design docs reorganized: superseded planning docs removed, protocol
truths and closure map added.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 16:25:23 -07:00
pingqiu
643a5a1074 feat: Phase 12 P3+P4 — diagnosability surfaces, perf floor, rollout gates
P3: Add explicit bounded read-only diagnosis surfaces for all symptom classes:
- FailoverDiagnostic: volume-oriented failover state with per-volume
  DeferredPromotion/PendingRebuild entries and proper timer lifecycle
- PublicationDiagnostic: two-read coherence check (LookupBlockVolume vs
  registry authority) with computed Coherent verdict
- RecoveryDiagnostic: minimal ActiveTasks surface (Path A)
- Blocker ledger: 3 diagnosed + 3 unresolved, finite, from actual file
- Runbook references only exposed surfaces, no internal state

P4: Add bounded performance floor + rollout-gate package:
- Engine-local floor measurement with explicit IOPS gates per workload
- Cost characterization: WAL 2x write amp, -56% replication tax
- Rollout gates with semantic cross-checks against cited evidence
  (baseline numbers, transport/network matrix, blocker counts)
- Launch envelope tightened to actually measured combinations only

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 16:20:22 -07:00
pingqiu
ebe95b6e2e fix: flusher OOM on multi-block writes + testrunner enhancements
Bug: flusher.go:336 allocated make([]byte, entryLen) per dirty block
instead of per unique WAL entry. A 4MB WriteLBA creates 1024 dirty map
entries (one per 4KB block), all sharing the same WAL offset. The flusher
read the full 4MB WAL entry 1024 times into separate buffers:
1024 × 4MB = 4GB per 4MB write → OOM on mkfs.ext4.

Root cause: flusher assumed 1:1 dirty-block-to-WAL-entry mapping.
WriteLBA supports multi-block writes but the flusher never deduplicated
shared WAL offsets.

Fix: deduplicate WAL reads by WalOffset in flushOnceLocked(). Multiple
dirty blocks from the same WAL entry share one read buffer and one
DecodeWALEntry call. Memory: O(WAL_entries × size) not O(blocks × size).
For a 4MB write: 4GB → 4MB.

Verified on hardware (m01/M02 25Gbps RoCE):
- Before: mkfs.ext4 → VS RSS 100MB→25GB → OOM killed
- After: mkfs.ext4 → VS RSS 129MB stable, mkfs succeeds
- pgbench TPC-B c=4: 1,248 TPS (RF=1, previously blocked by OOM)

Tests added:
- flusher_test.go: flush_multiblock_shared_wal_read (16 blocks share
  one WAL offset, flush dedup verified)
- flusher_test.go: flush_multiblock_data_correct (3 mixed multi-block
  writes, all data correct after flush)
- test/component/large_write_test.go: 7 component tests (single 4MB,
  sequential mkfs sim, concurrent, mixed sizes, production volume,
  flusher throughput 30s sustained)
- iscsi/large_write_mem_test.go: 2 iSCSI session memory tests (4MB
  R2T flow, slow device)

Testrunner enhancements (same commit — all tested on hardware):
- discover_primary action: maps primary IP → topology node name,
  supports alt_ips for multi-NIC (RoCE + management)
- NodeSpec.AltIPs field for multi-NIC node identification
- 5 new YAML scenarios: ec3, ec5, degraded sync_all/best_effort, pgbench
- All 13 hardware-verified scenarios PASS

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 14:24:10 -07:00
pingqiu
46faf0f7e3 feat: Phase 09 P0 — production execution closure plan
Execution-closure targets:
- P1: TransferFullBase — reuse rebuild.go TCP protocol
- P2: TransferSnapshot — checkpoint image + WAL tail
- P3: TruncateWAL — AdvanceTail + superblock update
- P4: Runtime ownership — V2 orchestrator drives execution

Key reuse sources identified:
- rebuild.go: rebuildFullExtent (client), RebuildServer (server)
- wal_writer.go: AdvanceTail
- flusher.go: updateSuperblockCheckpoint
- blockvol.go: ScanWALEntries (already wired)

Slice order: full-base first (highest value), then snapshot,
then truncation, then runtime ownership.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 17:25:09 -07:00
pingqiu
1497204e81 fix: require CatchUp outcome, true simultaneous overlap, observability assertions
HIGH: Changed-address now requires OutcomeCatchUp and fails if not.
No more conditional execution — must go through full catch-up chain.

MED: Overlapping retention is now true simultaneous overlap:
- Hold 1 at LSN T+1, Hold 2 at LSN T+2 — both coexist
- MinWALRetentionFloor = T+1 (minimum of two)
- Release hold 1 → floor moves to T+2
- Release hold 2 → ActiveHoldCount=0, no floor

MED: NeedsRebuild now asserts escalated event in logs.
PostCheckpoint now asserts handshake + catch-up execution events.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 15:55:37 -07:00
pingqiu
77a6e60fa3 feat: add P3 hardening validation — 4 matrix + 2 extra cases (Phase 08)
Compact replay matrix on accepted P1/P2 live path:

Matrix 1 (ChangedAddress): address change → cancel old plan → new
  assignment → new recovery → identity preserved → pins released
Matrix 2 (StaleEpoch): epoch bump → invalidate → cancel plan →
  new epoch assignment → new session → pins released
Matrix 3 (NeedsRebuild): unrecoverable gap → rebuild assignment →
  RebuildExecutor(IO=v2bridge) → InSync → pins released
Matrix 4 (PostCheckpointBoundary): at committed=ZeroGap, in window=
  CatchUp via CatchUpExecutor(IO=v2bridge) → pins released

Extra 1 (FailoverCycle): epoch 1 → failover → epoch 2 → recovery
  resumes → InSync. Logs: invalidation + cancellation + new session.
Extra 2 (OverlappingRetention): plan1 acquires pins → cancel →
  plan2 acquires pins → cancel → ActiveHoldCount==0,
  MinWALRetentionFloor has no holds.

Each test verifies all 5 evidence categories:
  entry truth, engine result, execution result, cleanup, observability

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 15:46:48 -07:00
pingqiu
08e34e02ae feat: separate CommittedLSN from CheckpointLSN, close catch-up ONE CHAIN (Phase 08 P2)
CommittedLSN separation:
- StatusSnapshot().CommittedLSN = nextLSN-1 (WAL head) for sync_all
- Was: flusher.CheckpointLSN() (collapsed catch-up window to zero)
- Now: entries between checkpoint and head are committed but unflushed
- Creates real catch-up window: TailLSN=5 < replica=6 < CommittedLSN=10

Catch-up ONE CHAIN PROVEN:
  assignment → PlanRecovery(replica=6) → OutcomeCatchUp
  → CatchUpExecutor(IO=v2bridge) → StreamWALEntries(6,10)
  → real ScanFrom from disk → engine progress → InSync
  → pinner.ActiveHoldCount()==0

Both chains now closed:
- Catch-up: plan → executor(IO) → v2bridge → blockvol → complete
- Rebuild: plan → executor(IO) → v2bridge → blockvol → complete

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 15:22:23 -07:00
pingqiu
1c178c0853 fix: rename rebuild test to match actual path, use t.Skipf for V1 catch-up limitation
HIGH: renamed TestP2_RebuildClosure_FullBase_OneChain → TestP2_RebuildClosure_OneChain.
Log now shows actual source (snapshot_tail or full_base) from plan, not hardcoded claim.

MED: catch-up test uses t.Skipf when V1 interim prevents OutcomeCatchUp.
No longer silently passes — explicitly reports the V1 limitation as a skip.
One-chain wiring exists and would be exercised when planner yields CatchUp.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 15:17:34 -07:00
pingqiu
8b1b6ec1c0 fix: update executor doc comment to reflect P2 implementation status
Executor comment now reflects reality:
- StreamWALEntries, TransferFullBase, TransferSnapshot: real
- TruncateWAL: stub
- Implements engine.CatchUpIO and engine.RebuildIO interfaces

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 15:14:34 -07:00
pingqiu
1578adfba5 fix: wire real v2bridge I/O into engine executors (Phase 08 P2 closure)
Engine executors now have IO interfaces for real bridge I/O:
- CatchUpExecutor.IO (CatchUpIO): StreamWALEntries
- RebuildExecutor.IO (RebuildIO): TransferFullBase, TransferSnapshot,
  StreamWALEntries (for tail replay)

When IO is set, executor calls real bridge I/O during execution.
When IO is nil, executor uses caller-supplied progress (test mode).

RecoveryPlan.CatchUpStartLSN: bound at plan time for IO bridge.

v2bridge.Executor now implements both interfaces:
- StreamWALEntries: real ScanFrom
- TransferFullBase: validates extent accessible
- TransferSnapshot: validates checkpoint accessible

Chain tests wire IO:
- CatchUpClosure: exec.IO = executor → real WAL scan through engine
- RebuildClosure: exec.IO = executor → real transfer through engine

This closes the engine → executor → v2bridge → blockvol chain.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 15:10:50 -07:00
pingqiu
ec51cfa474 fix: rewrite P2 as one-chain proofs with pin release assertions
Rebuild ONE CHAIN (proven):
  assignment → PlanRebuild → RebuildExecutor.Execute()
  → v2bridge TransferFullBase → engine complete → InSync
  → pinner.ActiveHoldCount() == 0 (pins released)

Catch-up ONE CHAIN (V1 limitation documented):
  V1 interim: CommittedLSN = CheckpointLSN = TailLSN after flush.
  No gap between tail and committed exists. Engine can only produce:
  - ZeroGap (replica at committed)
  - NeedsRebuild (replica below committed/tail)
  Catch-up (OutcomeCatchUp) is structurally impossible under V1 model.
  Real WAL scan proven separately (P1). Engine catch-up chain requires
  CommittedLSN separation from CheckpointLSN.

Cleanup: CancelPlan → pins released + session invalidated + logged.
Observability: sender_added + session_created + connected + escalated.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 14:58:00 -07:00
pingqiu
c9671c4e47 feat: integrated execution chain — catch-up + rebuild + cleanup (Phase 08 P2)
Live catch-up chain:
- Assignment → engine plan → v2bridge WAL scan → blockvol ScanFrom
- StreamWALEntries transfers real entries (transferred=5)
- V1 interim: engine classifies ZeroGap (committed=0), but WAL scan
  chain proven mechanically (executor→v2bridge→blockvol→progress)

Live rebuild chain (full-base):
- ForceFlush advances checkpoint → NeedsRebuild detected
- TransferFullBase now real: validates extent accessible at committed LSN
- Engine rebuild session: connect → handshake → source select →
  transfer → complete → InSync

Execution cleanup:
- CancelPlan releases resources + invalidates session
- Log shows plan_cancelled with reason

Observability:
- sender_added + escalated events explain execution causality
- Escalation includes proof reason from RetainedHistory

4 new execution chain tests + TransferFullBase implementation.

Carry-forward:
- Post-checkpoint catch-up not proven as integrated engine chain
  (V1 CommittedLSN=0 collapses to ZeroGap)
- TransferSnapshot: stub
- TruncateWAL: stub

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 14:22:27 -07:00
pingqiu
04bc261f9b fix: deliver assignment intent to real engine orchestrator, not discard
Finding 1: ProcessAssignments now calls v2Orchestrator.ProcessAssignment
- BlockService.v2Orchestrator field (RecoveryOrchestrator)
- ProcessAssignment result logged at glog V(1)
- No more `_ = intent` — engine state actually changes

Finding 2: localServerID documented as interim
- BlockService.localServerID = listenAddr (transport-shaped)
- Field doc explicitly states: INTERIM, should be registry-assigned
- Used only for replica/rebuild local identity

3 integration tests (qa_block_v2bridge_test.go):
- CreatesEngineSender: ProcessAssignment → engine has sender + session
- EpochBump: epoch 1 → invalidate → epoch 2 → new session
- AddressChange: same ServerID, different IP → sender preserved,
  endpoint updated

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 13:38:30 -07:00
pingqiu
46ef79ce35 fix: stable ServerID in assignments, fail-closed on missing identity, wire into ProcessAssignments
Finding 1: Identity no longer address-derived
- ReplicaAddr.ServerID field added (stable server identity from registry)
- BlockVolumeAssignment.ReplicaServerID field added (scalar RF=2 path)
- ControlBridge uses ServerID, NOT address, for ReplicaID
- Missing ServerID → replica skipped (fail closed), logged

Finding 2: Wired into real ProcessAssignments
- BlockService.v2Bridge field initialized in StartBlockService
- ProcessAssignments converts each assignment via v2Bridge.ConvertAssignment
  BEFORE existing V1 processing (parallel, not replacing yet)
- Logged at glog V(1)

Finding 3: Fail-closed on missing identity
- Empty ServerID in ReplicaAddrs → replica skipped with log
- Empty ReplicaServerID in scalar path → no replica created
- Test: MissingServerID_FailsClosed verifies both paths

7 tests: StableServerID, AddressChange_IdentityPreserved,
MultiReplica_StableServerIDs, MissingServerID_FailsClosed,
EpochFencing_IntegratedPath, RebuildAssignment, ReplicaAssignment

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 10:46:17 -07:00
pingqiu
48b3e1b8c8 feat: add real control delivery bridge from BlockVolumeAssignment (Phase 08 P1)
ControlBridge converts real BlockVolumeAssignment (from master heartbeat)
into V2 engine AssignmentIntent:

- Identity: ReplicaID = <volume-path>/<replica-server-id>
- Epoch from real assignment
- Role → SessionKind mapping (primary/replica/rebuilding)
- Multi-replica support (ReplicaAddrs) with scalar RF=2 fallback

Known limitation (documented in test):
- extractServerID currently uses address as server ID (matches
  master registry ReplicaInfo.Server format)
- IP change = different server ID in current model
- Registry-backed stable server ID deferred

6 new tests:
- PrimaryAssignment_StableIdentity: real assignment → stable ID
- PrimaryAssignment_MultiReplica: RF=3 multi-replica mapping
- AddressChange_SameServerID: documents current identity boundary
- EpochFencing_IntegratedPath: epoch 1 → bump → epoch 2 through
  real assignment conversion + engine
- RebuildAssignment: rebuilding role → SessionRebuild
- ReplicaAssignment: replica role with local server ID

Delivery template:
Changed contracts: real BlockVolumeAssignment → engine intent
Fail-closed: unknown role returns empty intent
Carry-forward: address-based server ID, not registry-backed

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 10:35:41 -07:00
pingqiu
cd8bfb21d4 fix: tighten FC1 new-session assertion and FC4 proof-detail check
FC1: now asserts HasActiveSession() after address change AND
verifies session_created in log (not just plan_cancelled).

FC4: escalation event detail must be >15 chars (contains proof
reason with LSN values, not just "needs_rebuild").

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 23:43:48 -07:00
pingqiu
cd4b91033f fix: force failure conditions in P2 tests, add BlockVol.ForceFlush
P2 tests now force conditions instead of observing them:

FC3: Real WAL scan verified directly — StreamWALEntries transfers
real entries from disk (head=5, transferred=5). Engine planning also
verified (ZeroGap in V1 interim documented).

FC4: ForceFlush advances checkpoint/tail to 20. Replica at 0 is
below tail → NeedsRebuild with proof: "gap_beyond_retention: need
LSN 1 but tail=20". No early return.

FC5: ForceFlush advances checkpoint to 10. Assertive:
- replica at checkpoint=10 → ZeroGap (V1 interim)
- replica at 0 → NeedsRebuild (below tail, not CatchUp)

FC1/FC2: Labeled as integrated engine/storage (control simulated).

New: BlockVol.ForceFlush() — triggers synchronous flusher cycle for
test use. Advances checkpoint + WAL tail deterministically.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 23:07:55 -07:00
pingqiu
26bf7bc582 feat: add integrated failure replay tests through real bridge path (Phase 07 P2)
5 failure-class replay tests against real file-backed BlockVol,
exercising the full integrated path:
  bridge adapter → v2bridge reader/pinner → engine planner/executor

FC1: Changed-address restart — identity preserved, old plan cancelled,
     new session created. Log shows plan_cancelled + session_created.

FC2: Stale epoch after failover — sessions invalidated at old epoch,
     new assignment at epoch 2 creates fresh session. Log shows
     per-replica invalidation.

FC3: Real catch-up (pre-checkpoint) — engine classifies from real
     RetainedHistory, zero-gap in V1 interim (committed=0 before flush).
     Documents the V1 limitation explicitly.

FC4: Unrecoverable gap — after flush, if checkpoint advances, replica
     behind tail gets NeedsRebuild. Documents that V1 unit test may
     not advance checkpoint (flusher timing).

FC5: Post-checkpoint boundary — replica at checkpoint = zero-gap in
     V1 interim. Explicitly documents the catch-up collapse boundary.

go.mod: added replace directives for sw-block engine + bridge modules.

Carry-forward (explicit):
- CommittedLSN = CheckpointLSN (V1 interim)
- FC3/FC4/FC5 limited by flusher not advancing checkpoint in unit tests
- Executor snapshot/full-base/truncate still stubs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 22:54:44 -07:00
pingqiu
4aab00b149 feat: add real v2bridge integration tests against file-backed BlockVol
7 tests in weed/storage/blockvol/v2bridge/bridge_test.go:

Reader (2 tests):
- StatusSnapshot reads real nextLSN, WALCheckpointLSN, flusher state
- HeadLSN advances with real writes

Pinner (2 tests):
- HoldWALRetention: hold tracked, MinWALRetentionFloor reports position,
  release clears hold
- HoldRejectsRecycled: validates against real WAL tail

Executor (2 tests):
- StreamWALEntries: real ScanFrom reads WAL entries from disk
- StreamPartialRange: partial range scan works

Stubs (1 test):
- TransferSnapshot/TransferFullBase/TruncateWAL return not-implemented

All tests use createTestVol (1MB file-backed BlockVol with 256KB WAL).
No mock/push adapters — direct real blockvol instances.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 22:22:28 -07:00
pingqiu
cfec3bff4a fix: update contract.go field source docs to match P1 implementation
BlockVolState field mapping now matches actual StatusSnapshot():
- WALTailLSN ← super.WALCheckpointLSN (was: flusher.RetentionFloor)
- CommittedLSN ← flusher.CheckpointLSN() V1 interim (was: distCommit)
- CheckpointTrusted ← super.Validate()==nil (was: superblock.Valid)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 20:44:04 -07:00
pingqiu
d5b2a3a345 fix: WALTailLSN is now an LSN boundary, ScanWALEntries uses durable checkpoint
Finding 1: WALTailLSN semantic fix
- StatusSnapshot().WALTailLSN now reads super.WALCheckpointLSN (an LSN)
- Was: wal.Tail() which returns a physical byte offset
- Entries with LSN > WALTailLSN are guaranteed in the WAL

Finding 2: ScanWALEntries replay-source fix
- ScanWALEntries passes super.WALCheckpointLSN as the recycled boundary
- Was: flusher.CheckpointLSN() which in V1 equals CommittedLSN
- The flusher's live checkpoint may advance in memory, but entries above
  the durable superblock checkpoint are still physically in the WAL
- Normal catch-up (replica at 70, committed at 100) now works because
  fromLSN=71 > super.WALCheckpointLSN (which is the last persisted
  checkpoint, not the live flusher state)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 20:26:27 -07:00
pingqiu
785a7d7efd feat: wire real pinner into flusher retention + real WAL scan executor (Phase 07 P1)
Pinner wired to real retention:
- NewPinner calls vol.SetV2RetentionFloor(p.MinWALRetentionFloor)
- Flusher.RetentionFloorFn() / SetRetentionFloorFn() exposed
- SetV2RetentionFloor chains with existing shipper retention floor
- Holds actually prevent WAL reclaim (not just tracked state)

Executor uses real WAL scan:
- BlockVol.ScanWALEntries(fromLSN, callback) wraps wal.ScanFrom
  with real fd, walOffset, checkpointLSN
- Executor.StreamWALEntries uses ScanWALEntries (not stub)
- Reads real WAL entries, tracks highest LSN scanned

CommittedLSN mapping:
- Explicitly documented as interim V1 model (committed = checkpointed)
- Will diverge when V2 distributed commit separates from local flush

Carry-forward:
- TransferSnapshot/TransferFullBase/TruncateWAL: stubs (need extent I/O)
- Control intent from confirmed failover: deferred

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 20:01:46 -07:00
pingqiu
c00c9e3e3d feat: add real BlockVolPinner + BlockVolExecutor in v2bridge (Phase 07 P1)
Pinner (pinner.go):
- HoldWALRetention: validates startLSN >= current tail, tracks hold
- HoldSnapshot: validates checkpoint exists + trusted
- HoldFullBase: tracks hold by ID
- MinWALRetentionFloor: returns minimum held position across all
  WAL/snapshot holds — designed for flusher RetentionFloorFn hookup
- Release functions remove holds from tracking map

Executor (executor.go):
- StreamWALEntries: validates range against real WAL tail/head
  (actual ScanFrom integration deferred to network-layer wiring)
- TransferSnapshot/TransferFullBase/TruncateWAL: stubs for P1

Key integration points:
- Pinner reads real StatusSnapshot for validation
- Pinner.MinWALRetentionFloor can wire into flusher.RetentionFloorFn
- Executor validates WAL range availability from real state

Carry-forward:
- Real ScanFrom wiring needs WAL fd + offset (network layer)
- TransferSnapshot/TransferFullBase need extent I/O
- Control intent from confirmed failover (master-side)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 19:54:24 -07:00
pingqiu
d5ecf471fe feat: real blockvol integration — StatusSnapshot + v2bridge reader + contract interfaces (Phase 07 P1)
Real blockvol integration:
- BlockVol.StatusSnapshot() reads actual fields:
  WALHeadLSN ← nextLSN-1, WALTailLSN ← wal.Tail(),
  CommittedLSN ← flusher.CheckpointLSN(),
  CheckpointLSN ← super.WALCheckpointLSN,
  CheckpointTrusted ← super.Validate()==nil

weed/storage/blockvol/v2bridge/:
- Reader wraps real BlockVol, implements ReadState() → BlockVolState
- Lives in weed/ module (can import blockvol directly)

sw-block/bridge/blockvol/ contract interfaces:
- BlockVolReader: ReadState() (weed-side implements)
- BlockVolPinner: HoldWALRetention/HoldSnapshot/HoldFullBase → release func
- BlockVolExecutor: StreamWALEntries/TransferSnapshot/TransferFullBase/TruncateWAL
- StorageAdapter refactored to consume interfaces (not push-based)
- PushStorageAdapter for tests

Handoff boundary (E5):
- sw-block/ defines contracts, weed/ implements them
- sw-block/ does NOT import weed/
- No cross-module circular dependency

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 18:17:59 -07:00
pingqiu
8c326c871c feat: add contract interfaces and pin/release via release-func pattern (Phase 07 P1)
E5 handoff contract (contract.go):
- BlockVolReader: ReadState() → BlockVolState from real blockvol
- BlockVolPinner: HoldWALRetention/HoldSnapshot/HoldFullBase → release func
- BlockVolExecutor: StreamWALEntries/TransferSnapshot/TransferFullBase/TruncateWAL
- Clear import direction: weed-side imports sw-block, not reverse

StorageAdapter refactored:
- Consumes BlockVolReader + BlockVolPinner interfaces
- Pin/release uses release-func pattern (not map-based tracking)
- PushStorageAdapter for tests (push-based, no blockvol dependency)

10 bridge tests:
- 4 control adapter (identity, address change, role mapping, primary)
- 4 storage adapter (retained history, WAL pin reject, snapshot reject, symmetry)
- 1 E2E (assignment → adapter → engine → plan → execute → InSync)
- 1 contract interface verification

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 18:07:20 -07:00
pingqiu
05daede7f9 feat: add V2 bridge adapters for blockvol (Phase 07 P0)
Creates sw-block/bridge/blockvol/ — concrete adapters connecting
the V2 engine to real blockvol storage and control-plane state.

control_adapter.go:
- MakeReplicaID: volume-name/server-id (NOT address-derived)
- ToAssignmentIntent: maps master assignment → engine intent
- Role → SessionKind translation (pure mapping, no policy)

storage_adapter.go:
- BlockVolState: maps to real blockvol fields (WAL head/tail,
  committed, checkpoint) — NOT reconstructed from metadata
- GetRetainedHistory from real state
- PinSnapshot rejects untrusted checkpoint
- PinWALRetention rejects recycled range
- PinFullBase / ReleaseFullBase

8 bridge tests:
- StableIdentity: ReplicaID = vol/server (not address)
- AddressChangePreservesIdentity: same ID, different address
- RebuildRoleMapping: "rebuilding" → SessionRebuild
- PrimaryNoRecovery: no recovery targets for primary
- RetainedHistoryFromRealState: all fields from BlockVolState
- WALPinRejectsRecycled: tail validation
- SnapshotPinRejectsInvalid: trust validation
- E2E_AssignmentToRecovery: master assignment → adapter →
  engine intent → plan → execute → InSync

Adapter replacement order:
P0: control_adapter + storage_adapter (this delivery)
P1: executor_bridge + observe_adapter (deferred)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 17:39:39 -07:00
pingqiu
4df61f290b fix: true mid-executor invalidation test via OnStep hook
CatchUpExecutor.OnStep: optional callback fired between executor-managed
progress steps. Enables deterministic fault injection (epoch bump)
between steps without racing or manual sender calls.

E2_EpochBump_MidExecutorLoop:
- Executor runs 5 progress steps
- OnStep hook bumps epoch after step 1 (after 2 successful steps)
- Executor's own loop detects invalidation at step 2's check
- Resources released by executor's release path (not manual cancel)
- Log shows session_invalidated + exec_resources_released

This closes the remaining FC2 gap: invalidation is now detected
and cleaned up by the executor itself, not by external code.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 15:51:21 -07:00
pingqiu
5b63d34d6b fix: snapshot+tail WAL pin failure cleanup + true mid-executor epoch test
Finding 1: PlanRebuild snapshot+tail WAL pin failure now fail-closed
- InvalidateSession("wal_pin_failed_during_rebuild", StateNeedsRebuild)
- Snapshot pin released, session invalidated, no dangling state
- New test: E2_RebuildWALPinFailure_SessionCleaned

Finding 2: True mid-executor invalidation test
- Executor makes 2 successful progress steps (60, 70)
- Epoch bumps BETWEEN steps (real mid-execution)
- Third progress step fails — session invalidated
- Resources released via executor cancel
- New test: E2_EpochBump_AfterExecutorProgress

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 15:44:21 -07:00
pingqiu
332f598606 fix: close P3 failure classes — session cleanup, causal logging, CancelPlan
Finding 1: PlanRebuild now invalidates session on pin failure
- FullBasePin failure → InvalidateSession("full_base_pin_failed", StateNeedsRebuild)
- SnapshotPin failure → InvalidateSession("snapshot_pin_failed", StateNeedsRebuild)
- No dangling rebuild session after resource acquisition failure

Finding 2: Rebuild source logging shows causal reason
- plan_rebuild_full_base now logs: untrusted_checkpoint,
  trusted_checkpoint_unreplayable_tail, or no_checkpoint

Finding 3: CancelPlan for address-change cleanup
- New RecoveryDriver.CancelPlan(plan, reason): releases resources +
  invalidates session + logs plan_cancelled with reason
- Changed-address test uses CancelPlan (not manual ReleasePlan)

Finding 4: Executor-level epoch-bump test
- Executor's mid-step invalidation detection catches stale session
- Resources released via executor release path, not manual cancel

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 14:28:57 -07:00
pingqiu
56afa55f13 feat: add P3 failure-class validation through planner/executor (Phase 06)
6 new tests (validation_test.go) mapped to tester expectations E1-E5:

E1/FC1: Changed-address restart through planner/executor
- Active session invalidated by address change
- Sender identity preserved, old plan resources released
- Log shows: endpoint_changed → new session → plan → execute

E2/FC2: Epoch bump mid-execution step
- Partial progress, epoch bumps between steps
- Further progress rejected, executor cancels with resource release
- Log shows: session_invalidated + exec_resources_released

E3/FC5: Cross-layer proof — trusted base + unreplayable tail
- Storage: checkpoint=50, tail=80 → unreplayable
- RebuildSourceDecision → FullBase (not SnapshotTail)
- FullBasePin acquired, executed through RebuildExecutor, released
- Log shows: plan_rebuild_full_base (observable reason)

E4/FC8: Rebuild fallback when trusted-base proof fails
- Untrusted checkpoint → full-base, full-base pin fails → error
- Untrusted checkpoint → full-base, full-base pin succeeds → InSync
- Log shows: full_base_pin_failed

E5: Observability — full recovery chain logged
- Verifies 7 required log events from assignment through completion

Delivery template:
Changed contracts: P3 validates planner/executor path, not convenience
Fail-closed: epoch bump mid-step releases resources + logs cause
Resources: cross-layer proof chain validated end-to-end
Carry-forward: FC3/FC4/FC6/FC7 sufficient from prior phases

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 14:17:24 -07:00
pingqiu
f5c0aab454 fix: rebuild executor consumes bound plan, fix catch-up timing
Planner/executor contract:
- RebuildExecutor.Execute() takes no arguments — consumes plan-bound
  RebuildSource, RebuildSnapshotLSN, RebuildTargetLSN
- RecoveryPlan binds all rebuild targets at plan time
- Executor cannot re-derive policy from caller-supplied history

Catch-up timing:
- Removed unused completeTick parameter from CatchUpExecutor.Execute
- Per-step ticks synthesized as startTick + stepIndex + 1
- API shape matches implementation

New test: PlanExecuteConsistency_RebuildCannotSwitchSource
- Plans snapshot+tail, then mutates storage history
- Executor succeeds using plan-bound values (not re-derived)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 13:33:52 -07:00
pingqiu
50442acb2e feat: add stepwise executor with release symmetry (Phase 06 P2)
New: executor.go — CatchUpExecutor + RebuildExecutor
Replaces convenience wrappers with stepwise execution that owns
resource lifecycle on every exit path.

CatchUpExecutor.Execute:
  1. BeginCatchUp (freezes target)
  2. Stepwise RecordCatchUpProgress + CheckBudget per step
  3. RecordTruncation (if required)
  4. CompleteSessionByID
  5. Release resources (success or failure)

RebuildExecutor.Execute:
  1. BeginConnect + RecordHandshake
  2. SelectRebuildFromHistory
  3. BeginRebuildTransfer + progress
  4. BeginRebuildTailReplay + progress (snapshot+tail)
  5. CompleteRebuild
  6. Release resources (success or failure)

Both executors:
- Release all pins on every exit path (success, failure, cancellation)
- Check session validity mid-execution (detect epoch bump / endpoint change)
- Log resource release with causal reason

14 new tests (executor_test.go), mapped to tester expectations:
- E1: Partial catch-up failure releases WAL pin (2 tests)
- E2: Partial rebuild failure releases all pins (1 test)
- E3: Epoch bump / cancel releases resources (3 tests)
- E4: Successful execution releases resources (2 tests)
- E5: Stepwise not convenience (2 tests)

Delivery template:
Changed contracts: executor owns resource lifecycle (not caller)
Fail-closed: session check mid-execution, release on every error
Resources: WAL/snapshot/full-base pins released on all exit paths
Carry-forward: CompleteCatchUp/CompleteRebuild remain test-only

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 13:24:37 -07:00
pingqiu
45bf111ce8 fix: derive WAL pin from actual replay need, PlanRebuild fails closed
WAL pin tied to actual recovery contract:
- Truncation-only (replica ahead): no WAL pin acquired
- Real catch-up: pins from replicaFlushedLSN (actual replay start)
- Logs distinguish plan_truncate_only from plan_catchup

PlanRebuild precondition checks:
- Error on missing sender
- Error on no active session
- Error on non-rebuild session kind
- All fail closed with clear error messages

4 new tests:
- ReplicaAhead_NoWALPin: truncation-only, no WAL resources
- PlanRebuild_MissingSender: returns error
- PlanRebuild_NoSession: returns error
- PlanRebuild_NonRebuildSession: returns error

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 12:51:38 -07:00
pingqiu
d4f7697dd8 fix: add full-base pin and clean up session on WAL pin failure
Full-base rebuild resource:
- StorageAdapter.PinFullBase/ReleaseFullBase for full-extent base image
- PlanRebuild full_base branch now acquires FullBasePin
- RecoveryPlan.FullBasePin field, released by ReleasePlan

Session cleanup on resource failure:
- PlanRecovery invalidates session when WAL pin fails
  (no dangling live session after failed resource acquisition)

3 new tests:
- PlanRebuild_FullBase_PinsBaseImage: pin acquired + released
- PlanRebuild_FullBase_PinFailure: logged + error
- PlanRecovery_WALPinFailure_CleansUpSession: session invalidated,
  sender disconnected (no dangling state)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 12:20:24 -07:00
pingqiu
f73a3fdab2 feat: add storage/control adapters and recovery driver (Phase 06 P0/P1)
Phase 06 module boundaries:

adapter.go — StorageAdapter + ControlPlaneAdapter interfaces:
- GetRetainedHistory: real WAL retention state
- PinSnapshot / ReleaseSnapshot: rebuild resource management
- PinWALRetention / ReleaseWALRetention: catch-up resource management
- HandleHeartbeat / HandleFailover: control-plane event conversion

driver.go — RecoveryDriver replaces synchronous convenience:
- PlanRecovery: connect + handshake from storage state + acquire resources
- PlanRebuild: acquire snapshot + WAL pins for rebuild
- ReleasePlan: release all acquired resources

Convenience flow classification:
- ProcessAssignment, UpdateSenderEpoch, InvalidateEpoch → stepwise engine tasks
- ExecuteRecovery → planner (connect + classify)
- CompleteCatchUp, CompleteRebuild → TEST-ONLY convenience

7 new tests (driver_test.go):
- CatchUp plan + execute with WAL pin
- ZeroGap plan (no resources pinned)
- NeedsRebuild → rebuild plan with resource acquisition
- WAL pin failure → logged + error
- Snapshot pin failure → logged + error
- ReplicaAhead truncation through driver
- Cross-layer: storage proves recoverability, engine consumes proof

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 11:35:25 -07:00
pingqiu
512bb5bcf6 fix: orchestrator owns full catch-up contract (budget + truncation)
CompleteCatchUp now integrates:
- BeginCatchUp with start tick (freezes target)
- RecordCatchUpProgress (skips if already converged, e.g., truncation-only)
- CheckBudget at completion tick (escalates to NeedsRebuild + logs)
- RecordTruncation before completion (logs truncation_recorded)
- Logs causal reason for every rejection/escalation

CatchUpOptions: StartTick/CompleteTick (separate) + TruncateLSN.

3 new orchestrator-level tests:
- ReplicaAhead_TruncateViaOrchestrator: truncation through entry path
- ReplicaAhead_NoTruncate_CompletionRejected: logs completion_rejected
- BudgetEscalation_ViaOrchestrator: budget violation → NeedsRebuild + logs

Observability tests relabeled as sender-level (not entry-path).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 11:04:34 -07:00
pingqiu
adaff8ddb3 fix: only log endpoint_changed when endpoint actually changed
ProcessAssignment now compares pre/post endpoint state before
logging session_invalidated with "endpoint_changed" reason.
Normal session supersede (same endpoint, assignment_intent) no
longer mislabeled as endpoint change.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 08:10:35 -07:00
pingqiu
5cdee4a011 fix: orchestrator owns zero-gap completion and per-replica invalidation logging
Zero-gap completion:
- ExecuteRecovery auto-completes zero-gap sessions (no sender call needed)
- RecoveryResult.FinalState = StateInSync for zero-gap

Epoch transition:
- UpdateSenderEpoch: orchestrator-owned epoch advancement with auto-log
- InvalidateEpoch: per-replica session_invalidated events (not aggregate)

Endpoint-change invalidation:
- ProcessAssignment detects session ID change from endpoint update
- Logs per-replica session_invalidated with "endpoint_changed" reason

All integration tests now use orchestrator exclusively for core lifecycle.
No direct sender API calls for recovery execution in integration tests.

1 new test: EndpointChange_LogsInvalidation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 01:01:53 -07:00
pingqiu
47238df0d7 fix: add RecoveryOrchestrator as real integrated entry path
New: orchestrator.go — RecoveryOrchestrator drives recovery lifecycle
from assignment through execution to completion/escalation:
- ProcessAssignment: reconcile + session creation + auto-log
- ExecuteRecovery: connect → handshake from RetainedHistory → outcome
- CompleteCatchUp: begin catch-up → progress → complete + auto-log
- CompleteRebuild: connect → handshake → history-driven source →
  transfer → tail replay → complete + auto-log
- InvalidateEpoch: invalidate stale sessions + auto-log

All integration tests rewritten to use orchestrator as entry path.
No direct sender API calls in recovery lifecycle.

SessionSnapshot now includes: TruncateRequired/ToLSN/Recorded,
RebuildSource, RebuildPhase.

RecoveryLog is auto-populated by orchestrator at every transition.

7 integration tests via orchestrator:
- ChangedAddress, NeedsRebuild→Rebuild, EpochBump, MultiReplica
- Observability: session snapshot, rebuild snapshot, auto-populated log

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 00:25:58 -07:00
pingqiu
7436b3b79c feat: add integration closure and observability (Phase 05 Slice 4)
New files:
- observe.go: RegistryStatus, SenderStatus, RecoveryLog for debugging
- integration_test.go: V2-boundary integration tests through real
  engine entry path

Observability:
- Registry.Status() returns full snapshot: per-sender state, session
  snapshots, counts by category (InSync, Recovering, Rebuilding)
- RecoveryLog: append-only event log for recovery lifecycle debugging

Integration tests (6):
- ChangedAddress_FullFlow: initial recovery → address change →
  sender preserved → new session → recovery with proof
- NeedsRebuild_ThenRebuildAssignment: catch-up fails → NeedsRebuild
  → rebuild assignment → history-driven source → InSync
- EpochBump_DuringRecovery: mid-recovery epoch bump → old session
  rejected → new assignment at new epoch → InSync
- MultiReplica_MixedOutcomes: 3 replicas, 3 outcomes via
  RetainedHistory proofs, registry status verified
- RegistryStatus_Snapshot: observability snapshot structure
- RecoveryLog: event recording and filtering

Engine module at 54 tests (12 + 18 + 18 + 6).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 00:15:46 -07:00
pingqiu
4d06622c01 fix: add nil check for RetainedHistory in sender APIs
RecordHandshakeFromHistory and SelectRebuildFromHistory now
return an error instead of panicking on nil history input.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 23:57:19 -07:00
pingqiu
cc8c529962 fix: connect recovery decisions to RetainedHistory, fix rebuild source
RetainedHistory as engine input:
- RecordHandshakeFromHistory: sender-level API consuming RetainedHistory
  directly, returns RecoverabilityProof alongside outcome
- SelectRebuildFromHistory: sender-level API consuming RetainedHistory
  for rebuild-source decision

RebuildSourceDecision soundness:
- Now requires BOTH trusted checkpoint AND replayable tail
  (CheckpointLSN >= TailLSN and CommittedLSN <= HeadLSN)
- Trusted checkpoint with unreplayable tail falls back to full_base

4 new tests:
- TrustedCheckpoint_UnreplayableTail (the regression case)
- SenderDriven_CatchUp (history → proof → outcome → complete)
- SenderDriven_Rebuild_SnapshotTail (history → source → rebuild)
- SenderDriven_Rebuild_FallsBackToFullBase (unreplayable tail)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 23:55:31 -07:00
pingqiu
ff7ea41099 feat: add engine data/recoverability core (Phase 05 Slice 3)
New file: history.go — RetainedHistory connects recovery decisions
to actual WAL retention state:
- IsRecoverable: checks gap against tail/head boundaries
- MakeHandshakeResult: generates HandshakeResult from retention state
- RebuildSourceDecision: chooses snapshot+tail vs full base from
  checkpoint state (trusted vs untrusted)
- ProveRecoverability: generates explicit proof explaining why
  recovery is or is not allowed

14 new tests (recoverability_test.go):
- Recoverable/unrecoverable gap (exact boundary, beyond head)
- Trusted/untrusted/no checkpoint → rebuild source selection
- Handshake from retained history → outcome classification
- Recoverability proofs (zero-gap, ahead, within retention, beyond)
- E2E: two replicas driven by retained history (catch-up + rebuild)
- Truncation required for replica ahead of committed

Engine module at 44 tests (12 + 18 + 14).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 23:04:51 -07:00
pingqiu
368a956aee fix: correct catch-up entry counting and rebuild transfer gate
Entry counting:
- Session.setRange now initializes recoveredTo = startLSN
- RecordCatchUpProgress delta counts only actual catch-up work
  (recoveredTo - startLSN), not the replica's pre-existing prefix

Rebuild transfer gate:
- BeginTailReplay requires TransferredTo >= SnapshotLSN
- Prevents tail replay on incomplete base transfer

3 new regression tests:
- BudgetEntries_NonZeroStart_CountsOnlyDelta (30 entries within 50 budget)
- BudgetEntries_NonZeroStart_ExceedsBudget (30 entries exceeds 20 budget)
- Rebuild_PartialTransfer_BlocksTailReplay

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 21:35:03 -07:00
pingqiu
930de4ba78 feat: add Slice 2 recovery execution tests (Phase 05)
15 new engine-level recovery execution tests:
- Zero-gap / catch-up / needs-rebuild branching (3 tests)
- Stale execution rejection during active recovery (2 tests)
- Bounded catch-up: frozen target, duration, entries, stall (5 tests)
- Completion before convergence rejected
- Rebuild exclusivity: catch-up APIs excluded (1 test)
- Rebuild lifecycle: snapshot+tail, full base, stale ID (3 tests)
- Assignment-driven recovery flow

Engine module now at 27 tests (12 Slice 1 + 15 Slice 2).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 21:14:18 -07:00