seaweedfs

mirror of https://github.com/seaweedfs/seaweedfs.git synced 2026-07-19 14:32:41 +00:00

Author	SHA1	Message	Date
pingqiu	600dac6029	feat: Phase 13 CP13-1 — frozen test-first baseline for sync replication gaps Baseline report (phase-13-cp1-baseline.md) from running 44 existing replication-gap tests on current code with zero protocol changes: 37 PASS / 4 FAIL / 3 PASS* 4 FAILs expose real gaps: - ReconnectUsesHandshakeNotBootstrap: degraded shipper doesn't catch up (CP13-5) - CatchupMultipleDisconnects: repeated reconnect cycles don't recover (CP13-5) - NeedsRebuildBlocksAllPaths: stays Degraded after large gap (CP13-5+7) - CatchupDoesNotOverwriteNewerData: catch-up fails at barrier (CP13-5) 3 PASS* are witness-only (pass but don't prove the property): - Bug3_ReplicaAddr: documents gap, not fix (CP13-2) - GapBeyondRetainedWal: asserts barrier failure, not NeedsRebuild (CP13-7) - MaxBytesTriggersNeedsRebuild: logs "not implemented" (CP13-6) No protocol code changed. Baseline is test-first evidence only. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 17:07:21 -07:00
pingqiu	c0a805184f	chore: archive superseded V2 design docs Copies of design docs removed in Phase 09, preserved in sw-block/docs/archive/ for historical reference. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 16:26:34 -07:00
pingqiu	bdf20fde71	feat: Phase 12 — production hardening (disturbance, soak, testrunner scenarios) P1 Disturbance: restart/reconnect correctness tests — assignment delivery through real proto → ProcessAssignments, epoch validation on promoted volume, mandatory reconnect assertions P2 Soak: repeated create/failover/recover cycles with end-of-cycle truth checks, runtime hygiene (no stale tasks/entries), steady-state idempotence Testrunner recovery actions + scenarios: - recovery.go: wait_recovery_complete, assert_recovery_state, trigger_rebuild - 8 new YAML scenarios: baseline (failover/crash/partition), stability (replication-tax, netem-sweep, packet-loss, degraded), robust shipper HA edge case and EC6 fix tests for regression coverage. (P3 diagnosability + P4 perf floor committed separately in `643a5a107`) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 16:26:17 -07:00
pingqiu	bdf83e350e	feat: Phase 11 — product-surface rebinding (snapshot, CSI, publication, restore) P1 Snapshots: CoW snapshot lifecycle through V2 engine path, create/list/delete via master RPC, BaseLSN tracking in manifest, ImportSnapshotForRebuild P2 CSI Lifecycle: masterServerBackend calling real MasterServer in-process, CreateVolume/DeleteVolume/ExpandVolume through CSI → master → VS flow, ExportedControllerServer/ExportedNodeServer for cross-package testing P3 Publication: LookupBlockVolume coherence across failover, iSCSI + NVMe address switching on promotion, repeated lookup self-consistency P4 Restore: RestoreBlockSnapshot RPC through master and volume server, snapshot restore with runtime convergence, epoch/role validation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 16:25:58 -07:00
pingqiu	3ec8fab2f1	feat: Phase 10 — control-plane closure (identity, convergence, idempotence) Stable identity on wire: - ServerID fields in proto (replica_server_id, server_id on ReplicaAddrMessage) - volumeServerId wired through volume.go → BlockService.SetServerID - Identity derived from canonical server ID, not transport addresses Assignment convergence: - V2 idempotence via lastAppliedAssignment.equals (full replica set comparison) - setupPrimaryReplication/Multi idempotence guards - ProcessAssignments with V2 + V1 dual-path assignment handling Master-driven control loop: - RecoveryManager: serialized cancel-and-drain via done channels - Per-replica heartbeat state reporting (ReplicaShipperStatus) - masterServerBackend: VolumeBackend calling real MasterServer in-process - RestoreBlockSnapshot RPC (master + volume server proto) QA tests (P10 P1-P4): - Identity: ServerID on wire, fail-closed on missing - Convergence: assignment delivery, epoch monotonicity, registry coherence - Idempotence: repeated assignment, multi-replica set comparison - Control loop: integrationMaster + real allocator + proto round-trip Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 16:25:43 -07:00
pingqiu	c7eb87c587	feat: Phase 09 — V2 execution primitives and production closure Engine execution layer for V2 replication protocol: - RebuildInstaller: full state handoff (dirty map, WAL, superblock, flusher) - TruncateToLSN: exact safety predicate (checkpointLSN == truncateLSN), ErrTruncationUnsafe escalation to NeedsRebuild - SyncReceiverProgress: unconditional Store for post-rebuild alignment - V2StatusSnapshot: CommittedLSN = nextLSN-1 for sync_all V2 bridge real I/O executors: - TransferFullBase: TCP streaming + RebuildInstaller + second catch-up - TransferSnapshot: SHA-256 verified streaming to disk - TruncateWAL: ErrTruncationUnsafe detection + escalation - StreamWALEntries: rebuild-mode TCP apply Engine executor interfaces: - CatchUpIO.TruncateWAL, RebuildIO.TransferFullBase returns achievedLSN - CatchUpExecutor truncation-only skip, NeedsRebuild escalation - RebuildExecutor uses achievedLSN for progress tracking Design docs reorganized: superseded planning docs removed, protocol truths and closure map added. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 16:25:23 -07:00
pingqiu	643a5a1074	feat: Phase 12 P3+P4 — diagnosability surfaces, perf floor, rollout gates P3: Add explicit bounded read-only diagnosis surfaces for all symptom classes: - FailoverDiagnostic: volume-oriented failover state with per-volume DeferredPromotion/PendingRebuild entries and proper timer lifecycle - PublicationDiagnostic: two-read coherence check (LookupBlockVolume vs registry authority) with computed Coherent verdict - RecoveryDiagnostic: minimal ActiveTasks surface (Path A) - Blocker ledger: 3 diagnosed + 3 unresolved, finite, from actual file - Runbook references only exposed surfaces, no internal state P4: Add bounded performance floor + rollout-gate package: - Engine-local floor measurement with explicit IOPS gates per workload - Cost characterization: WAL 2x write amp, -56% replication tax - Rollout gates with semantic cross-checks against cited evidence (baseline numbers, transport/network matrix, blocker counts) - Launch envelope tightened to actually measured combinations only Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 16:20:22 -07:00
pingqiu	ebe95b6e2e	fix: flusher OOM on multi-block writes + testrunner enhancements Bug: flusher.go:336 allocated make([]byte, entryLen) per dirty block instead of per unique WAL entry. A 4MB WriteLBA creates 1024 dirty map entries (one per 4KB block), all sharing the same WAL offset. The flusher read the full 4MB WAL entry 1024 times into separate buffers: 1024 × 4MB = 4GB per 4MB write → OOM on mkfs.ext4. Root cause: flusher assumed 1:1 dirty-block-to-WAL-entry mapping. WriteLBA supports multi-block writes but the flusher never deduplicated shared WAL offsets. Fix: deduplicate WAL reads by WalOffset in flushOnceLocked(). Multiple dirty blocks from the same WAL entry share one read buffer and one DecodeWALEntry call. Memory: O(WAL_entries × size) not O(blocks × size). For a 4MB write: 4GB → 4MB. Verified on hardware (m01/M02 25Gbps RoCE): - Before: mkfs.ext4 → VS RSS 100MB→25GB → OOM killed - After: mkfs.ext4 → VS RSS 129MB stable, mkfs succeeds - pgbench TPC-B c=4: 1,248 TPS (RF=1, previously blocked by OOM) Tests added: - flusher_test.go: flush_multiblock_shared_wal_read (16 blocks share one WAL offset, flush dedup verified) - flusher_test.go: flush_multiblock_data_correct (3 mixed multi-block writes, all data correct after flush) - test/component/large_write_test.go: 7 component tests (single 4MB, sequential mkfs sim, concurrent, mixed sizes, production volume, flusher throughput 30s sustained) - iscsi/large_write_mem_test.go: 2 iSCSI session memory tests (4MB R2T flow, slow device) Testrunner enhancements (same commit — all tested on hardware): - discover_primary action: maps primary IP → topology node name, supports alt_ips for multi-NIC (RoCE + management) - NodeSpec.AltIPs field for multi-NIC node identification - 5 new YAML scenarios: ec3, ec5, degraded sync_all/best_effort, pgbench - All 13 hardware-verified scenarios PASS Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 14:24:10 -07:00
pingqiu	46faf0f7e3	feat: Phase 09 P0 — production execution closure plan Execution-closure targets: - P1: TransferFullBase — reuse rebuild.go TCP protocol - P2: TransferSnapshot — checkpoint image + WAL tail - P3: TruncateWAL — AdvanceTail + superblock update - P4: Runtime ownership — V2 orchestrator drives execution Key reuse sources identified: - rebuild.go: rebuildFullExtent (client), RebuildServer (server) - wal_writer.go: AdvanceTail - flusher.go: updateSuperblockCheckpoint - blockvol.go: ScanWALEntries (already wired) Slice order: full-base first (highest value), then snapshot, then truncation, then runtime ownership. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 17:25:09 -07:00
pingqiu	1497204e81	fix: require CatchUp outcome, true simultaneous overlap, observability assertions HIGH: Changed-address now requires OutcomeCatchUp and fails if not. No more conditional execution — must go through full catch-up chain. MED: Overlapping retention is now true simultaneous overlap: - Hold 1 at LSN T+1, Hold 2 at LSN T+2 — both coexist - MinWALRetentionFloor = T+1 (minimum of two) - Release hold 1 → floor moves to T+2 - Release hold 2 → ActiveHoldCount=0, no floor MED: NeedsRebuild now asserts escalated event in logs. PostCheckpoint now asserts handshake + catch-up execution events. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 15:55:37 -07:00
pingqiu	77a6e60fa3	feat: add P3 hardening validation — 4 matrix + 2 extra cases (Phase 08) Compact replay matrix on accepted P1/P2 live path: Matrix 1 (ChangedAddress): address change → cancel old plan → new assignment → new recovery → identity preserved → pins released Matrix 2 (StaleEpoch): epoch bump → invalidate → cancel plan → new epoch assignment → new session → pins released Matrix 3 (NeedsRebuild): unrecoverable gap → rebuild assignment → RebuildExecutor(IO=v2bridge) → InSync → pins released Matrix 4 (PostCheckpointBoundary): at committed=ZeroGap, in window= CatchUp via CatchUpExecutor(IO=v2bridge) → pins released Extra 1 (FailoverCycle): epoch 1 → failover → epoch 2 → recovery resumes → InSync. Logs: invalidation + cancellation + new session. Extra 2 (OverlappingRetention): plan1 acquires pins → cancel → plan2 acquires pins → cancel → ActiveHoldCount==0, MinWALRetentionFloor has no holds. Each test verifies all 5 evidence categories: entry truth, engine result, execution result, cleanup, observability Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 15:46:48 -07:00
pingqiu	08e34e02ae	feat: separate CommittedLSN from CheckpointLSN, close catch-up ONE CHAIN (Phase 08 P2) CommittedLSN separation: - StatusSnapshot().CommittedLSN = nextLSN-1 (WAL head) for sync_all - Was: flusher.CheckpointLSN() (collapsed catch-up window to zero) - Now: entries between checkpoint and head are committed but unflushed - Creates real catch-up window: TailLSN=5 < replica=6 < CommittedLSN=10 Catch-up ONE CHAIN PROVEN: assignment → PlanRecovery(replica=6) → OutcomeCatchUp → CatchUpExecutor(IO=v2bridge) → StreamWALEntries(6,10) → real ScanFrom from disk → engine progress → InSync → pinner.ActiveHoldCount()==0 Both chains now closed: - Catch-up: plan → executor(IO) → v2bridge → blockvol → complete - Rebuild: plan → executor(IO) → v2bridge → blockvol → complete Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 15:22:23 -07:00
pingqiu	1c178c0853	fix: rename rebuild test to match actual path, use t.Skipf for V1 catch-up limitation HIGH: renamed TestP2_RebuildClosure_FullBase_OneChain → TestP2_RebuildClosure_OneChain. Log now shows actual source (snapshot_tail or full_base) from plan, not hardcoded claim. MED: catch-up test uses t.Skipf when V1 interim prevents OutcomeCatchUp. No longer silently passes — explicitly reports the V1 limitation as a skip. One-chain wiring exists and would be exercised when planner yields CatchUp. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 15:17:34 -07:00
pingqiu	8b1b6ec1c0	fix: update executor doc comment to reflect P2 implementation status Executor comment now reflects reality: - StreamWALEntries, TransferFullBase, TransferSnapshot: real - TruncateWAL: stub - Implements engine.CatchUpIO and engine.RebuildIO interfaces Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 15:14:34 -07:00
pingqiu	1578adfba5	fix: wire real v2bridge I/O into engine executors (Phase 08 P2 closure) Engine executors now have IO interfaces for real bridge I/O: - CatchUpExecutor.IO (CatchUpIO): StreamWALEntries - RebuildExecutor.IO (RebuildIO): TransferFullBase, TransferSnapshot, StreamWALEntries (for tail replay) When IO is set, executor calls real bridge I/O during execution. When IO is nil, executor uses caller-supplied progress (test mode). RecoveryPlan.CatchUpStartLSN: bound at plan time for IO bridge. v2bridge.Executor now implements both interfaces: - StreamWALEntries: real ScanFrom - TransferFullBase: validates extent accessible - TransferSnapshot: validates checkpoint accessible Chain tests wire IO: - CatchUpClosure: exec.IO = executor → real WAL scan through engine - RebuildClosure: exec.IO = executor → real transfer through engine This closes the engine → executor → v2bridge → blockvol chain. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 15:10:50 -07:00
pingqiu	ec51cfa474	fix: rewrite P2 as one-chain proofs with pin release assertions Rebuild ONE CHAIN (proven): assignment → PlanRebuild → RebuildExecutor.Execute() → v2bridge TransferFullBase → engine complete → InSync → pinner.ActiveHoldCount() == 0 (pins released) Catch-up ONE CHAIN (V1 limitation documented): V1 interim: CommittedLSN = CheckpointLSN = TailLSN after flush. No gap between tail and committed exists. Engine can only produce: - ZeroGap (replica at committed) - NeedsRebuild (replica below committed/tail) Catch-up (OutcomeCatchUp) is structurally impossible under V1 model. Real WAL scan proven separately (P1). Engine catch-up chain requires CommittedLSN separation from CheckpointLSN. Cleanup: CancelPlan → pins released + session invalidated + logged. Observability: sender_added + session_created + connected + escalated. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 14:58:00 -07:00
pingqiu	c9671c4e47	feat: integrated execution chain — catch-up + rebuild + cleanup (Phase 08 P2) Live catch-up chain: - Assignment → engine plan → v2bridge WAL scan → blockvol ScanFrom - StreamWALEntries transfers real entries (transferred=5) - V1 interim: engine classifies ZeroGap (committed=0), but WAL scan chain proven mechanically (executor→v2bridge→blockvol→progress) Live rebuild chain (full-base): - ForceFlush advances checkpoint → NeedsRebuild detected - TransferFullBase now real: validates extent accessible at committed LSN - Engine rebuild session: connect → handshake → source select → transfer → complete → InSync Execution cleanup: - CancelPlan releases resources + invalidates session - Log shows plan_cancelled with reason Observability: - sender_added + escalated events explain execution causality - Escalation includes proof reason from RetainedHistory 4 new execution chain tests + TransferFullBase implementation. Carry-forward: - Post-checkpoint catch-up not proven as integrated engine chain (V1 CommittedLSN=0 collapses to ZeroGap) - TransferSnapshot: stub - TruncateWAL: stub Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 14:22:27 -07:00
pingqiu	04bc261f9b	fix: deliver assignment intent to real engine orchestrator, not discard Finding 1: ProcessAssignments now calls v2Orchestrator.ProcessAssignment - BlockService.v2Orchestrator field (RecoveryOrchestrator) - ProcessAssignment result logged at glog V(1) - No more `_ = intent` — engine state actually changes Finding 2: localServerID documented as interim - BlockService.localServerID = listenAddr (transport-shaped) - Field doc explicitly states: INTERIM, should be registry-assigned - Used only for replica/rebuild local identity 3 integration tests (qa_block_v2bridge_test.go): - CreatesEngineSender: ProcessAssignment → engine has sender + session - EpochBump: epoch 1 → invalidate → epoch 2 → new session - AddressChange: same ServerID, different IP → sender preserved, endpoint updated Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 13:38:30 -07:00
pingqiu	46ef79ce35	fix: stable ServerID in assignments, fail-closed on missing identity, wire into ProcessAssignments Finding 1: Identity no longer address-derived - ReplicaAddr.ServerID field added (stable server identity from registry) - BlockVolumeAssignment.ReplicaServerID field added (scalar RF=2 path) - ControlBridge uses ServerID, NOT address, for ReplicaID - Missing ServerID → replica skipped (fail closed), logged Finding 2: Wired into real ProcessAssignments - BlockService.v2Bridge field initialized in StartBlockService - ProcessAssignments converts each assignment via v2Bridge.ConvertAssignment BEFORE existing V1 processing (parallel, not replacing yet) - Logged at glog V(1) Finding 3: Fail-closed on missing identity - Empty ServerID in ReplicaAddrs → replica skipped with log - Empty ReplicaServerID in scalar path → no replica created - Test: MissingServerID_FailsClosed verifies both paths 7 tests: StableServerID, AddressChange_IdentityPreserved, MultiReplica_StableServerIDs, MissingServerID_FailsClosed, EpochFencing_IntegratedPath, RebuildAssignment, ReplicaAssignment Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 10:46:17 -07:00
pingqiu	48b3e1b8c8	feat: add real control delivery bridge from BlockVolumeAssignment (Phase 08 P1) ControlBridge converts real BlockVolumeAssignment (from master heartbeat) into V2 engine AssignmentIntent: - Identity: ReplicaID = <volume-path>/<replica-server-id> - Epoch from real assignment - Role → SessionKind mapping (primary/replica/rebuilding) - Multi-replica support (ReplicaAddrs) with scalar RF=2 fallback Known limitation (documented in test): - extractServerID currently uses address as server ID (matches master registry ReplicaInfo.Server format) - IP change = different server ID in current model - Registry-backed stable server ID deferred 6 new tests: - PrimaryAssignment_StableIdentity: real assignment → stable ID - PrimaryAssignment_MultiReplica: RF=3 multi-replica mapping - AddressChange_SameServerID: documents current identity boundary - EpochFencing_IntegratedPath: epoch 1 → bump → epoch 2 through real assignment conversion + engine - RebuildAssignment: rebuilding role → SessionRebuild - ReplicaAssignment: replica role with local server ID Delivery template: Changed contracts: real BlockVolumeAssignment → engine intent Fail-closed: unknown role returns empty intent Carry-forward: address-based server ID, not registry-backed Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 10:35:41 -07:00
pingqiu	cd8bfb21d4	fix: tighten FC1 new-session assertion and FC4 proof-detail check FC1: now asserts HasActiveSession() after address change AND verifies session_created in log (not just plan_cancelled). FC4: escalation event detail must be >15 chars (contains proof reason with LSN values, not just "needs_rebuild"). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 23:43:48 -07:00
pingqiu	cd4b91033f	fix: force failure conditions in P2 tests, add BlockVol.ForceFlush P2 tests now force conditions instead of observing them: FC3: Real WAL scan verified directly — StreamWALEntries transfers real entries from disk (head=5, transferred=5). Engine planning also verified (ZeroGap in V1 interim documented). FC4: ForceFlush advances checkpoint/tail to 20. Replica at 0 is below tail → NeedsRebuild with proof: "gap_beyond_retention: need LSN 1 but tail=20". No early return. FC5: ForceFlush advances checkpoint to 10. Assertive: - replica at checkpoint=10 → ZeroGap (V1 interim) - replica at 0 → NeedsRebuild (below tail, not CatchUp) FC1/FC2: Labeled as integrated engine/storage (control simulated). New: BlockVol.ForceFlush() — triggers synchronous flusher cycle for test use. Advances checkpoint + WAL tail deterministically. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 23:07:55 -07:00
pingqiu	26bf7bc582	feat: add integrated failure replay tests through real bridge path (Phase 07 P2) 5 failure-class replay tests against real file-backed BlockVol, exercising the full integrated path: bridge adapter → v2bridge reader/pinner → engine planner/executor FC1: Changed-address restart — identity preserved, old plan cancelled, new session created. Log shows plan_cancelled + session_created. FC2: Stale epoch after failover — sessions invalidated at old epoch, new assignment at epoch 2 creates fresh session. Log shows per-replica invalidation. FC3: Real catch-up (pre-checkpoint) — engine classifies from real RetainedHistory, zero-gap in V1 interim (committed=0 before flush). Documents the V1 limitation explicitly. FC4: Unrecoverable gap — after flush, if checkpoint advances, replica behind tail gets NeedsRebuild. Documents that V1 unit test may not advance checkpoint (flusher timing). FC5: Post-checkpoint boundary — replica at checkpoint = zero-gap in V1 interim. Explicitly documents the catch-up collapse boundary. go.mod: added replace directives for sw-block engine + bridge modules. Carry-forward (explicit): - CommittedLSN = CheckpointLSN (V1 interim) - FC3/FC4/FC5 limited by flusher not advancing checkpoint in unit tests - Executor snapshot/full-base/truncate still stubs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 22:54:44 -07:00
pingqiu	4aab00b149	feat: add real v2bridge integration tests against file-backed BlockVol 7 tests in weed/storage/blockvol/v2bridge/bridge_test.go: Reader (2 tests): - StatusSnapshot reads real nextLSN, WALCheckpointLSN, flusher state - HeadLSN advances with real writes Pinner (2 tests): - HoldWALRetention: hold tracked, MinWALRetentionFloor reports position, release clears hold - HoldRejectsRecycled: validates against real WAL tail Executor (2 tests): - StreamWALEntries: real ScanFrom reads WAL entries from disk - StreamPartialRange: partial range scan works Stubs (1 test): - TransferSnapshot/TransferFullBase/TruncateWAL return not-implemented All tests use createTestVol (1MB file-backed BlockVol with 256KB WAL). No mock/push adapters — direct real blockvol instances. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 22:22:28 -07:00
pingqiu	cfec3bff4a	fix: update contract.go field source docs to match P1 implementation BlockVolState field mapping now matches actual StatusSnapshot(): - WALTailLSN ← super.WALCheckpointLSN (was: flusher.RetentionFloor) - CommittedLSN ← flusher.CheckpointLSN() V1 interim (was: distCommit) - CheckpointTrusted ← super.Validate()==nil (was: superblock.Valid) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 20:44:04 -07:00
pingqiu	d5b2a3a345	fix: WALTailLSN is now an LSN boundary, ScanWALEntries uses durable checkpoint Finding 1: WALTailLSN semantic fix - StatusSnapshot().WALTailLSN now reads super.WALCheckpointLSN (an LSN) - Was: wal.Tail() which returns a physical byte offset - Entries with LSN > WALTailLSN are guaranteed in the WAL Finding 2: ScanWALEntries replay-source fix - ScanWALEntries passes super.WALCheckpointLSN as the recycled boundary - Was: flusher.CheckpointLSN() which in V1 equals CommittedLSN - The flusher's live checkpoint may advance in memory, but entries above the durable superblock checkpoint are still physically in the WAL - Normal catch-up (replica at 70, committed at 100) now works because fromLSN=71 > super.WALCheckpointLSN (which is the last persisted checkpoint, not the live flusher state) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 20:26:27 -07:00
pingqiu	785a7d7efd	feat: wire real pinner into flusher retention + real WAL scan executor (Phase 07 P1) Pinner wired to real retention: - NewPinner calls vol.SetV2RetentionFloor(p.MinWALRetentionFloor) - Flusher.RetentionFloorFn() / SetRetentionFloorFn() exposed - SetV2RetentionFloor chains with existing shipper retention floor - Holds actually prevent WAL reclaim (not just tracked state) Executor uses real WAL scan: - BlockVol.ScanWALEntries(fromLSN, callback) wraps wal.ScanFrom with real fd, walOffset, checkpointLSN - Executor.StreamWALEntries uses ScanWALEntries (not stub) - Reads real WAL entries, tracks highest LSN scanned CommittedLSN mapping: - Explicitly documented as interim V1 model (committed = checkpointed) - Will diverge when V2 distributed commit separates from local flush Carry-forward: - TransferSnapshot/TransferFullBase/TruncateWAL: stubs (need extent I/O) - Control intent from confirmed failover: deferred Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 20:01:46 -07:00
pingqiu	c00c9e3e3d	feat: add real BlockVolPinner + BlockVolExecutor in v2bridge (Phase 07 P1) Pinner (pinner.go): - HoldWALRetention: validates startLSN >= current tail, tracks hold - HoldSnapshot: validates checkpoint exists + trusted - HoldFullBase: tracks hold by ID - MinWALRetentionFloor: returns minimum held position across all WAL/snapshot holds — designed for flusher RetentionFloorFn hookup - Release functions remove holds from tracking map Executor (executor.go): - StreamWALEntries: validates range against real WAL tail/head (actual ScanFrom integration deferred to network-layer wiring) - TransferSnapshot/TransferFullBase/TruncateWAL: stubs for P1 Key integration points: - Pinner reads real StatusSnapshot for validation - Pinner.MinWALRetentionFloor can wire into flusher.RetentionFloorFn - Executor validates WAL range availability from real state Carry-forward: - Real ScanFrom wiring needs WAL fd + offset (network layer) - TransferSnapshot/TransferFullBase need extent I/O - Control intent from confirmed failover (master-side) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 19:54:24 -07:00
pingqiu	d5ecf471fe	feat: real blockvol integration — StatusSnapshot + v2bridge reader + contract interfaces (Phase 07 P1) Real blockvol integration: - BlockVol.StatusSnapshot() reads actual fields: WALHeadLSN ← nextLSN-1, WALTailLSN ← wal.Tail(), CommittedLSN ← flusher.CheckpointLSN(), CheckpointLSN ← super.WALCheckpointLSN, CheckpointTrusted ← super.Validate()==nil weed/storage/blockvol/v2bridge/: - Reader wraps real BlockVol, implements ReadState() → BlockVolState - Lives in weed/ module (can import blockvol directly) sw-block/bridge/blockvol/ contract interfaces: - BlockVolReader: ReadState() (weed-side implements) - BlockVolPinner: HoldWALRetention/HoldSnapshot/HoldFullBase → release func - BlockVolExecutor: StreamWALEntries/TransferSnapshot/TransferFullBase/TruncateWAL - StorageAdapter refactored to consume interfaces (not push-based) - PushStorageAdapter for tests Handoff boundary (E5): - sw-block/ defines contracts, weed/ implements them - sw-block/ does NOT import weed/ - No cross-module circular dependency Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 18:17:59 -07:00
pingqiu	8c326c871c	feat: add contract interfaces and pin/release via release-func pattern (Phase 07 P1) E5 handoff contract (contract.go): - BlockVolReader: ReadState() → BlockVolState from real blockvol - BlockVolPinner: HoldWALRetention/HoldSnapshot/HoldFullBase → release func - BlockVolExecutor: StreamWALEntries/TransferSnapshot/TransferFullBase/TruncateWAL - Clear import direction: weed-side imports sw-block, not reverse StorageAdapter refactored: - Consumes BlockVolReader + BlockVolPinner interfaces - Pin/release uses release-func pattern (not map-based tracking) - PushStorageAdapter for tests (push-based, no blockvol dependency) 10 bridge tests: - 4 control adapter (identity, address change, role mapping, primary) - 4 storage adapter (retained history, WAL pin reject, snapshot reject, symmetry) - 1 E2E (assignment → adapter → engine → plan → execute → InSync) - 1 contract interface verification Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 18:07:20 -07:00
pingqiu	05daede7f9	feat: add V2 bridge adapters for blockvol (Phase 07 P0) Creates sw-block/bridge/blockvol/ — concrete adapters connecting the V2 engine to real blockvol storage and control-plane state. control_adapter.go: - MakeReplicaID: volume-name/server-id (NOT address-derived) - ToAssignmentIntent: maps master assignment → engine intent - Role → SessionKind translation (pure mapping, no policy) storage_adapter.go: - BlockVolState: maps to real blockvol fields (WAL head/tail, committed, checkpoint) — NOT reconstructed from metadata - GetRetainedHistory from real state - PinSnapshot rejects untrusted checkpoint - PinWALRetention rejects recycled range - PinFullBase / ReleaseFullBase 8 bridge tests: - StableIdentity: ReplicaID = vol/server (not address) - AddressChangePreservesIdentity: same ID, different address - RebuildRoleMapping: "rebuilding" → SessionRebuild - PrimaryNoRecovery: no recovery targets for primary - RetainedHistoryFromRealState: all fields from BlockVolState - WALPinRejectsRecycled: tail validation - SnapshotPinRejectsInvalid: trust validation - E2E_AssignmentToRecovery: master assignment → adapter → engine intent → plan → execute → InSync Adapter replacement order: P0: control_adapter + storage_adapter (this delivery) P1: executor_bridge + observe_adapter (deferred) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 17:39:39 -07:00
pingqiu	4df61f290b	fix: true mid-executor invalidation test via OnStep hook CatchUpExecutor.OnStep: optional callback fired between executor-managed progress steps. Enables deterministic fault injection (epoch bump) between steps without racing or manual sender calls. E2_EpochBump_MidExecutorLoop: - Executor runs 5 progress steps - OnStep hook bumps epoch after step 1 (after 2 successful steps) - Executor's own loop detects invalidation at step 2's check - Resources released by executor's release path (not manual cancel) - Log shows session_invalidated + exec_resources_released This closes the remaining FC2 gap: invalidation is now detected and cleaned up by the executor itself, not by external code. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 15:51:21 -07:00
pingqiu	5b63d34d6b	fix: snapshot+tail WAL pin failure cleanup + true mid-executor epoch test Finding 1: PlanRebuild snapshot+tail WAL pin failure now fail-closed - InvalidateSession("wal_pin_failed_during_rebuild", StateNeedsRebuild) - Snapshot pin released, session invalidated, no dangling state - New test: E2_RebuildWALPinFailure_SessionCleaned Finding 2: True mid-executor invalidation test - Executor makes 2 successful progress steps (60, 70) - Epoch bumps BETWEEN steps (real mid-execution) - Third progress step fails — session invalidated - Resources released via executor cancel - New test: E2_EpochBump_AfterExecutorProgress Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 15:44:21 -07:00
pingqiu	332f598606	fix: close P3 failure classes — session cleanup, causal logging, CancelPlan Finding 1: PlanRebuild now invalidates session on pin failure - FullBasePin failure → InvalidateSession("full_base_pin_failed", StateNeedsRebuild) - SnapshotPin failure → InvalidateSession("snapshot_pin_failed", StateNeedsRebuild) - No dangling rebuild session after resource acquisition failure Finding 2: Rebuild source logging shows causal reason - plan_rebuild_full_base now logs: untrusted_checkpoint, trusted_checkpoint_unreplayable_tail, or no_checkpoint Finding 3: CancelPlan for address-change cleanup - New RecoveryDriver.CancelPlan(plan, reason): releases resources + invalidates session + logs plan_cancelled with reason - Changed-address test uses CancelPlan (not manual ReleasePlan) Finding 4: Executor-level epoch-bump test - Executor's mid-step invalidation detection catches stale session - Resources released via executor release path, not manual cancel Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 14:28:57 -07:00
pingqiu	56afa55f13	feat: add P3 failure-class validation through planner/executor (Phase 06) 6 new tests (validation_test.go) mapped to tester expectations E1-E5: E1/FC1: Changed-address restart through planner/executor - Active session invalidated by address change - Sender identity preserved, old plan resources released - Log shows: endpoint_changed → new session → plan → execute E2/FC2: Epoch bump mid-execution step - Partial progress, epoch bumps between steps - Further progress rejected, executor cancels with resource release - Log shows: session_invalidated + exec_resources_released E3/FC5: Cross-layer proof — trusted base + unreplayable tail - Storage: checkpoint=50, tail=80 → unreplayable - RebuildSourceDecision → FullBase (not SnapshotTail) - FullBasePin acquired, executed through RebuildExecutor, released - Log shows: plan_rebuild_full_base (observable reason) E4/FC8: Rebuild fallback when trusted-base proof fails - Untrusted checkpoint → full-base, full-base pin fails → error - Untrusted checkpoint → full-base, full-base pin succeeds → InSync - Log shows: full_base_pin_failed E5: Observability — full recovery chain logged - Verifies 7 required log events from assignment through completion Delivery template: Changed contracts: P3 validates planner/executor path, not convenience Fail-closed: epoch bump mid-step releases resources + logs cause Resources: cross-layer proof chain validated end-to-end Carry-forward: FC3/FC4/FC6/FC7 sufficient from prior phases Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 14:17:24 -07:00
pingqiu	f5c0aab454	fix: rebuild executor consumes bound plan, fix catch-up timing Planner/executor contract: - RebuildExecutor.Execute() takes no arguments — consumes plan-bound RebuildSource, RebuildSnapshotLSN, RebuildTargetLSN - RecoveryPlan binds all rebuild targets at plan time - Executor cannot re-derive policy from caller-supplied history Catch-up timing: - Removed unused completeTick parameter from CatchUpExecutor.Execute - Per-step ticks synthesized as startTick + stepIndex + 1 - API shape matches implementation New test: PlanExecuteConsistency_RebuildCannotSwitchSource - Plans snapshot+tail, then mutates storage history - Executor succeeds using plan-bound values (not re-derived) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 13:33:52 -07:00
pingqiu	50442acb2e	feat: add stepwise executor with release symmetry (Phase 06 P2) New: executor.go — CatchUpExecutor + RebuildExecutor Replaces convenience wrappers with stepwise execution that owns resource lifecycle on every exit path. CatchUpExecutor.Execute: 1. BeginCatchUp (freezes target) 2. Stepwise RecordCatchUpProgress + CheckBudget per step 3. RecordTruncation (if required) 4. CompleteSessionByID 5. Release resources (success or failure) RebuildExecutor.Execute: 1. BeginConnect + RecordHandshake 2. SelectRebuildFromHistory 3. BeginRebuildTransfer + progress 4. BeginRebuildTailReplay + progress (snapshot+tail) 5. CompleteRebuild 6. Release resources (success or failure) Both executors: - Release all pins on every exit path (success, failure, cancellation) - Check session validity mid-execution (detect epoch bump / endpoint change) - Log resource release with causal reason 14 new tests (executor_test.go), mapped to tester expectations: - E1: Partial catch-up failure releases WAL pin (2 tests) - E2: Partial rebuild failure releases all pins (1 test) - E3: Epoch bump / cancel releases resources (3 tests) - E4: Successful execution releases resources (2 tests) - E5: Stepwise not convenience (2 tests) Delivery template: Changed contracts: executor owns resource lifecycle (not caller) Fail-closed: session check mid-execution, release on every error Resources: WAL/snapshot/full-base pins released on all exit paths Carry-forward: CompleteCatchUp/CompleteRebuild remain test-only Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 13:24:37 -07:00
pingqiu	45bf111ce8	fix: derive WAL pin from actual replay need, PlanRebuild fails closed WAL pin tied to actual recovery contract: - Truncation-only (replica ahead): no WAL pin acquired - Real catch-up: pins from replicaFlushedLSN (actual replay start) - Logs distinguish plan_truncate_only from plan_catchup PlanRebuild precondition checks: - Error on missing sender - Error on no active session - Error on non-rebuild session kind - All fail closed with clear error messages 4 new tests: - ReplicaAhead_NoWALPin: truncation-only, no WAL resources - PlanRebuild_MissingSender: returns error - PlanRebuild_NoSession: returns error - PlanRebuild_NonRebuildSession: returns error Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 12:51:38 -07:00
pingqiu	d4f7697dd8	fix: add full-base pin and clean up session on WAL pin failure Full-base rebuild resource: - StorageAdapter.PinFullBase/ReleaseFullBase for full-extent base image - PlanRebuild full_base branch now acquires FullBasePin - RecoveryPlan.FullBasePin field, released by ReleasePlan Session cleanup on resource failure: - PlanRecovery invalidates session when WAL pin fails (no dangling live session after failed resource acquisition) 3 new tests: - PlanRebuild_FullBase_PinsBaseImage: pin acquired + released - PlanRebuild_FullBase_PinFailure: logged + error - PlanRecovery_WALPinFailure_CleansUpSession: session invalidated, sender disconnected (no dangling state) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 12:20:24 -07:00
pingqiu	f73a3fdab2	feat: add storage/control adapters and recovery driver (Phase 06 P0/P1) Phase 06 module boundaries: adapter.go — StorageAdapter + ControlPlaneAdapter interfaces: - GetRetainedHistory: real WAL retention state - PinSnapshot / ReleaseSnapshot: rebuild resource management - PinWALRetention / ReleaseWALRetention: catch-up resource management - HandleHeartbeat / HandleFailover: control-plane event conversion driver.go — RecoveryDriver replaces synchronous convenience: - PlanRecovery: connect + handshake from storage state + acquire resources - PlanRebuild: acquire snapshot + WAL pins for rebuild - ReleasePlan: release all acquired resources Convenience flow classification: - ProcessAssignment, UpdateSenderEpoch, InvalidateEpoch → stepwise engine tasks - ExecuteRecovery → planner (connect + classify) - CompleteCatchUp, CompleteRebuild → TEST-ONLY convenience 7 new tests (driver_test.go): - CatchUp plan + execute with WAL pin - ZeroGap plan (no resources pinned) - NeedsRebuild → rebuild plan with resource acquisition - WAL pin failure → logged + error - Snapshot pin failure → logged + error - ReplicaAhead truncation through driver - Cross-layer: storage proves recoverability, engine consumes proof Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 11:35:25 -07:00
pingqiu	512bb5bcf6	fix: orchestrator owns full catch-up contract (budget + truncation) CompleteCatchUp now integrates: - BeginCatchUp with start tick (freezes target) - RecordCatchUpProgress (skips if already converged, e.g., truncation-only) - CheckBudget at completion tick (escalates to NeedsRebuild + logs) - RecordTruncation before completion (logs truncation_recorded) - Logs causal reason for every rejection/escalation CatchUpOptions: StartTick/CompleteTick (separate) + TruncateLSN. 3 new orchestrator-level tests: - ReplicaAhead_TruncateViaOrchestrator: truncation through entry path - ReplicaAhead_NoTruncate_CompletionRejected: logs completion_rejected - BudgetEscalation_ViaOrchestrator: budget violation → NeedsRebuild + logs Observability tests relabeled as sender-level (not entry-path). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 11:04:34 -07:00
pingqiu	adaff8ddb3	fix: only log endpoint_changed when endpoint actually changed ProcessAssignment now compares pre/post endpoint state before logging session_invalidated with "endpoint_changed" reason. Normal session supersede (same endpoint, assignment_intent) no longer mislabeled as endpoint change. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 08:10:35 -07:00
pingqiu	5cdee4a011	fix: orchestrator owns zero-gap completion and per-replica invalidation logging Zero-gap completion: - ExecuteRecovery auto-completes zero-gap sessions (no sender call needed) - RecoveryResult.FinalState = StateInSync for zero-gap Epoch transition: - UpdateSenderEpoch: orchestrator-owned epoch advancement with auto-log - InvalidateEpoch: per-replica session_invalidated events (not aggregate) Endpoint-change invalidation: - ProcessAssignment detects session ID change from endpoint update - Logs per-replica session_invalidated with "endpoint_changed" reason All integration tests now use orchestrator exclusively for core lifecycle. No direct sender API calls for recovery execution in integration tests. 1 new test: EndpointChange_LogsInvalidation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 01:01:53 -07:00
pingqiu	47238df0d7	fix: add RecoveryOrchestrator as real integrated entry path New: orchestrator.go — RecoveryOrchestrator drives recovery lifecycle from assignment through execution to completion/escalation: - ProcessAssignment: reconcile + session creation + auto-log - ExecuteRecovery: connect → handshake from RetainedHistory → outcome - CompleteCatchUp: begin catch-up → progress → complete + auto-log - CompleteRebuild: connect → handshake → history-driven source → transfer → tail replay → complete + auto-log - InvalidateEpoch: invalidate stale sessions + auto-log All integration tests rewritten to use orchestrator as entry path. No direct sender API calls in recovery lifecycle. SessionSnapshot now includes: TruncateRequired/ToLSN/Recorded, RebuildSource, RebuildPhase. RecoveryLog is auto-populated by orchestrator at every transition. 7 integration tests via orchestrator: - ChangedAddress, NeedsRebuild→Rebuild, EpochBump, MultiReplica - Observability: session snapshot, rebuild snapshot, auto-populated log Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 00:25:58 -07:00
pingqiu	7436b3b79c	feat: add integration closure and observability (Phase 05 Slice 4) New files: - observe.go: RegistryStatus, SenderStatus, RecoveryLog for debugging - integration_test.go: V2-boundary integration tests through real engine entry path Observability: - Registry.Status() returns full snapshot: per-sender state, session snapshots, counts by category (InSync, Recovering, Rebuilding) - RecoveryLog: append-only event log for recovery lifecycle debugging Integration tests (6): - ChangedAddress_FullFlow: initial recovery → address change → sender preserved → new session → recovery with proof - NeedsRebuild_ThenRebuildAssignment: catch-up fails → NeedsRebuild → rebuild assignment → history-driven source → InSync - EpochBump_DuringRecovery: mid-recovery epoch bump → old session rejected → new assignment at new epoch → InSync - MultiReplica_MixedOutcomes: 3 replicas, 3 outcomes via RetainedHistory proofs, registry status verified - RegistryStatus_Snapshot: observability snapshot structure - RecoveryLog: event recording and filtering Engine module at 54 tests (12 + 18 + 18 + 6). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 00:15:46 -07:00
pingqiu	4d06622c01	fix: add nil check for RetainedHistory in sender APIs RecordHandshakeFromHistory and SelectRebuildFromHistory now return an error instead of panicking on nil history input. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-29 23:57:19 -07:00
pingqiu	cc8c529962	fix: connect recovery decisions to RetainedHistory, fix rebuild source RetainedHistory as engine input: - RecordHandshakeFromHistory: sender-level API consuming RetainedHistory directly, returns RecoverabilityProof alongside outcome - SelectRebuildFromHistory: sender-level API consuming RetainedHistory for rebuild-source decision RebuildSourceDecision soundness: - Now requires BOTH trusted checkpoint AND replayable tail (CheckpointLSN >= TailLSN and CommittedLSN <= HeadLSN) - Trusted checkpoint with unreplayable tail falls back to full_base 4 new tests: - TrustedCheckpoint_UnreplayableTail (the regression case) - SenderDriven_CatchUp (history → proof → outcome → complete) - SenderDriven_Rebuild_SnapshotTail (history → source → rebuild) - SenderDriven_Rebuild_FallsBackToFullBase (unreplayable tail) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-29 23:55:31 -07:00
pingqiu	ff7ea41099	feat: add engine data/recoverability core (Phase 05 Slice 3) New file: history.go — RetainedHistory connects recovery decisions to actual WAL retention state: - IsRecoverable: checks gap against tail/head boundaries - MakeHandshakeResult: generates HandshakeResult from retention state - RebuildSourceDecision: chooses snapshot+tail vs full base from checkpoint state (trusted vs untrusted) - ProveRecoverability: generates explicit proof explaining why recovery is or is not allowed 14 new tests (recoverability_test.go): - Recoverable/unrecoverable gap (exact boundary, beyond head) - Trusted/untrusted/no checkpoint → rebuild source selection - Handshake from retained history → outcome classification - Recoverability proofs (zero-gap, ahead, within retention, beyond) - E2E: two replicas driven by retained history (catch-up + rebuild) - Truncation required for replica ahead of committed Engine module at 44 tests (12 + 18 + 14). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-29 23:04:51 -07:00
pingqiu	368a956aee	fix: correct catch-up entry counting and rebuild transfer gate Entry counting: - Session.setRange now initializes recoveredTo = startLSN - RecordCatchUpProgress delta counts only actual catch-up work (recoveredTo - startLSN), not the replica's pre-existing prefix Rebuild transfer gate: - BeginTailReplay requires TransferredTo >= SnapshotLSN - Prevents tail replay on incomplete base transfer 3 new regression tests: - BudgetEntries_NonZeroStart_CountsOnlyDelta (30 entries within 50 budget) - BudgetEntries_NonZeroStart_ExceedsBudget (30 entries exceeds 20 budget) - Rebuild_PartialTransfer_BlocksTailReplay Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-29 21:35:03 -07:00
pingqiu	930de4ba78	feat: add Slice 2 recovery execution tests (Phase 05) 15 new engine-level recovery execution tests: - Zero-gap / catch-up / needs-rebuild branching (3 tests) - Stale execution rejection during active recovery (2 tests) - Bounded catch-up: frozen target, duration, entries, stall (5 tests) - Completion before convergence rejected - Rebuild exclusivity: catch-up APIs excluded (1 test) - Rebuild lifecycle: snapshot+tail, full base, stale ID (3 tests) - Assignment-driven recovery flow Engine module now at 27 tests (12 Slice 1 + 15 Slice 2). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-29 21:14:18 -07:00

1 2 3 4 5 ...

13129 Commits