From 600dac602938af3eca1dfea197bb8dca8a71d366 Mon Sep 17 00:00:00 2001 From: pingqiu Date: Thu, 2 Apr 2026 17:07:21 -0700 Subject: [PATCH] =?UTF-8?q?feat:=20Phase=2013=20CP13-1=20=E2=80=94=20froze?= =?UTF-8?q?n=20test-first=20baseline=20for=20sync=20replication=20gaps?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Baseline report (phase-13-cp1-baseline.md) from running 44 existing replication-gap tests on current code with zero protocol changes: 37 PASS / 4 FAIL / 3 PASS* 4 FAILs expose real gaps: - ReconnectUsesHandshakeNotBootstrap: degraded shipper doesn't catch up (CP13-5) - CatchupMultipleDisconnects: repeated reconnect cycles don't recover (CP13-5) - NeedsRebuildBlocksAllPaths: stays Degraded after large gap (CP13-5+7) - CatchupDoesNotOverwriteNewerData: catch-up fails at barrier (CP13-5) 3 PASS* are witness-only (pass but don't prove the property): - Bug3_ReplicaAddr: documents gap, not fix (CP13-2) - GapBeyondRetainedWal: asserts barrier failure, not NeedsRebuild (CP13-7) - MaxBytesTriggersNeedsRebuild: logs "not implemented" (CP13-6) No protocol code changed. Baseline is test-first evidence only. Co-Authored-By: Claude Opus 4.6 (1M context) --- .../.private/phase/phase-13-cp1-baseline.md | 111 +++++++ sw-block/.private/phase/phase-13-log.md | 304 ++++++++++++++++++ 2 files changed, 415 insertions(+) create mode 100644 sw-block/.private/phase/phase-13-cp1-baseline.md create mode 100644 sw-block/.private/phase/phase-13-log.md diff --git a/sw-block/.private/phase/phase-13-cp1-baseline.md b/sw-block/.private/phase/phase-13-cp1-baseline.md new file mode 100644 index 000000000..4521c3655 --- /dev/null +++ b/sw-block/.private/phase/phase-13-cp1-baseline.md @@ -0,0 +1,111 @@ +# CP13-1 Baseline Report + +Date: 2026-04-02 +Commit: c0a805184 (feature/sw-block HEAD) +Runner: `go test ./weed/storage/blockvol/ -v -count=1 -timeout 120s` +Protocol changes in this checkpoint: NONE — test-first baseline only + +## Category 1: Address Truth + +| Result | Test | Reason | +|--------|------|--------| +| PASS | `TestCanonicalizeAddr_WildcardIPv4_UsesAdvertised` | canonicalization infra works | +| PASS | `TestCanonicalizeAddr_WildcardIPv6_UsesAdvertised` | canonicalization infra works | +| PASS | `TestCanonicalizeAddr_NilIP_UsesAdvertised` | canonicalization infra works | +| PASS | `TestCanonicalizeAddr_AlreadyCanonical_Unchanged` | no-op on canonical input | +| PASS | `TestCanonicalizeAddr_Loopback_Unchanged` | loopback preserved intentionally | +| PASS | `TestCanonicalizeAddr_NoAdvertised_FallsBackToOutbound` | fallback path works | +| PASS* | `TestBug3_ReplicaAddr_MustBeIPPort_WildcardBind` | documents gap: ReplicaReceiver may return `:port` not `ip:port` on wildcard bind; test passes as documentation, not as proof of fix → CP13-2 | + +## Category 2: Durable Progress Truth + +| Result | Test | Reason | +|--------|------|--------| +| PASS | `TestReplicaProgress_BarrierUsesFlushedLSN` | barrier now gates on replicaFlushedLSN (CP13-3 done) | +| PASS | `TestReplicaProgress_FlushedLSNMonotonicWithinEpoch` | flushedLSN monotonic within epoch (CP13-3 done) | +| PASS | `TestBarrier_RejectsReplicaNotInSync` | barrier rejects non-InSync replica | +| PASS | `TestBarrier_EpochMismatchRejected` | barrier rejects epoch mismatch | +| PASS | `TestBarrier_DuringCatchup_Rejected` | barrier rejected during CatchingUp state (CP13-4 done) | +| PASS | `TestBarrier_ReplicaSlowFsync_Timeout` | barrier timeout on slow replica | +| PASS | `TestBarrierResp_FlushedLSN_Roundtrip` | barrier response wire format carries flushedLSN | +| PASS | `TestBarrierResp_BackwardCompat_1Byte` | backward compat with old 1-byte response | +| PASS | `TestReplica_FlushedLSN_OnlyAfterSync` | flushedLSN only updated after fdatasync | +| PASS | `TestReplica_FlushedLSN_NotOnReceive` | flushedLSN not updated on entry receive | +| PASS | `TestShipper_ReplicaFlushedLSN_UpdatedOnBarrier` | shipper tracks replica flushedLSN from barrier | +| PASS | `TestShipper_ReplicaFlushedLSN_Monotonic` | tracked flushedLSN is monotonic | +| PASS | `TestShipperGroup_MinReplicaFlushedLSN` | group computes min flushedLSN across replicas | +| PASS | `TestDistSync_SyncAll_NilGroup_Succeeds` | sync_all with no replicas succeeds locally | +| PASS | `TestDistSync_SyncAll_AllDegraded_Fails` | sync_all fails when all replicas degraded | +| PASS | `TestBug2_SyncAll_SyncCache_AfterDegradedShipperRecovers` | after recovery, catch-up + barrier succeeds (CP13-5 done) | +| PASS | `TestBug1_SyncAll_WriteDuringDegraded_SyncCacheMustFail` | SyncCache correctly fails during degraded | + +## Category 3: Reconnect / Catch-up + +| Result | Test | Reason | +|--------|------|--------| +| PASS | `TestReconnect_CatchupFromRetainedWal` | reconnect + WAL catch-up works (CP13-5 done) | +| PASS* | `TestReconnect_GapBeyondRetainedWal_NeedsRebuild` | correctly fails SyncCache after large gap, but does NOT assert NeedsRebuild state transition — asserts barrier failure only → CP13-5+CP13-7 | +| PASS | `TestReconnect_EpochChangeDuringCatchup_Aborts` | catch-up aborts on epoch change | +| PASS | `TestReconnect_CatchupTimeout_TransitionsDegraded` | catch-up timeout → degraded | +| PASS | `TestAdversarial_FreshShipperUsesBootstrapNotReconnect` | fresh shipper uses bootstrap path | +| FAIL | `TestAdversarial_ReconnectUsesHandshakeNotBootstrap` | **gap: degraded shipper with prior flushed progress reconnects but barrier fails** — shipper does not catch up before attempting barrier → CP13-5 | +| PASS | `TestAdversarial_ReplicaRejectsDuplicateLSN` | replica rejects duplicate LSN | +| PASS | `TestAdversarial_ReplicaRejectsGapLSN` | replica rejects LSN gap | +| FAIL | `TestAdversarial_CatchupMultipleDisconnects` | **gap: catch-up across multiple disconnect/reconnect cycles fails** — first reconnect barrier fails, subsequent cycles never recover → CP13-5 | +| PASS | `TestAdversarial_ConcurrentBarrierDoesNotCorruptCatchupFailures` | concurrent barriers don't corrupt counter | + +## Category 4: Retention / Rebuild Boundary + +| Result | Test | Reason | +|--------|------|--------| +| PASS | `TestWalRetention_RequiredReplicaBlocksReclaim` | replica-aware WAL retention works (CP13-6 done) | +| PASS | `TestWalRetention_TimeoutTriggersNeedsRebuild` | retention timeout → NeedsRebuild (CP13-6 done) | +| PASS* | `TestWalRetention_MaxBytesTriggersNeedsRebuild` | passes but logs "max-bytes retention trigger not implemented yet" — shipper stays Degraded, does not transition to NeedsRebuild → CP13-6 | +| FAIL | `TestAdversarial_NeedsRebuildBlocksAllPaths` | **gap: after large WAL gap, shipper stays Degraded instead of NeedsRebuild; Ship/Barrier not blocked** → CP13-5+CP13-7 | +| FAIL | `TestAdversarial_CatchupDoesNotOverwriteNewerData` | **gap: catch-up after disconnect fails at barrier level** — catch-up doesn't complete, so newer-data safety not actually exercised → CP13-5 | +| PASS | `TestHeartbeat_ReportsPerReplicaState` | heartbeat reports per-replica shipper state | +| PASS | `TestHeartbeat_ReportsNeedsRebuild` | heartbeat reports NeedsRebuild per-replica | +| PASS | `TestReplicaState_RebuildComplete_ReentersInSync` | full rebuild cycle: NeedsRebuild → rebuild → InSync | +| PASS | `TestRebuild_AbortOnEpochChange` | rebuild aborts on epoch change | +| PASS | `TestRebuild_PostRebuild_FlushedLSN_IsCheckpoint` | post-rebuild flushedLSN = checkpoint | + +## Summary + +| Category | PASS | FAIL | PASS* | Total | +|----------|------|------|-------|-------| +| 1. Address Truth | 6 | 0 | 1 | 7 | +| 2. Durable Progress Truth | 17 | 0 | 0 | 17 | +| 3. Reconnect / Catch-up | 7 | 2 | 1 | 10 | +| 4. Retention / Rebuild | 7 | 2 | 1 | 10 | +| **Total** | **37** | **4** | **3** | **44** | + +## Failure → Checkpoint Mapping + +| FAIL Test | Root Cause | Closes In | +|-----------|-----------|-----------| +| `TestAdversarial_ReconnectUsesHandshakeNotBootstrap` | degraded shipper reconnects but doesn't catch up before barrier | CP13-5 | +| `TestAdversarial_CatchupMultipleDisconnects` | repeated disconnect/reconnect cycles don't recover | CP13-5 | +| `TestAdversarial_NeedsRebuildBlocksAllPaths` | shipper stays Degraded after large gap, should be NeedsRebuild | CP13-5 + CP13-7 | +| `TestAdversarial_CatchupDoesNotOverwriteNewerData` | catch-up fails, so newer-data safety not exercised | CP13-5 | + +## PASS* → Checkpoint Mapping + +| PASS* Test | Why Not Full Proof | Closes In | +|------------|-------------------|-----------| +| `TestBug3_ReplicaAddr_MustBeIPPort_WildcardBind` | documents gap, doesn't prove fix | CP13-2 | +| `TestReconnect_GapBeyondRetainedWal_NeedsRebuild` | asserts barrier failure, not NeedsRebuild state | CP13-5 + CP13-7 | +| `TestWalRetention_MaxBytesTriggersNeedsRebuild` | logs "not implemented", shipper stays Degraded | CP13-6 | + +## What Was NOT Changed + +This baseline was captured on current code without any protocol modifications: + +- No reconnect handshake changes +- No WAL catch-up logic changes +- No retention policy changes +- No rebuild behavior changes +- No barrier protocol changes +- No state machine changes +- No new protocol code of any kind + +All 4 FAILs and 3 PASS* entries expose real gaps that exist in the current codebase. diff --git a/sw-block/.private/phase/phase-13-log.md b/sw-block/.private/phase/phase-13-log.md new file mode 100644 index 000000000..4f9da74f4 --- /dev/null +++ b/sw-block/.private/phase/phase-13-log.md @@ -0,0 +1,304 @@ +Purpose: append-only technical pack and delivery log for `Phase 13` sync replication correctness. + +--- + +### `CP13-1` Technical Pack + +Date: 2026-04-02 +Goal: freeze a focused test-first baseline for sync replication correctness before major implementation work so `Phase 13` closes real gaps rather than validating against moving expectations + +#### Layer 1: Semantic Core + +##### Problem statement + +`Phase 12` accepted bounded hardening and a first launch envelope on the chosen path. +That acceptance did not prove that cross-machine `RF=2 sync_all` already has a fully explicit replicated-durability model for: + +1. reconnect after outage +2. catch-up from retained WAL +3. durable-progress truth at barrier time +4. retention vs rebuild boundary + +The first checkpoint therefore accepts only one bounded thing: + +1. a frozen failing/passing baseline for the replication gaps that `Phase 13` will close + +It does not accept: + +1. protocol implementation by implication +2. broad performance or rollout claims +3. happy-path-only validation + +##### State / contract + +`CP13-1` must make these truths explicit: + +1. the target replication gaps are named before implementation +2. at least one current-code failure exists for each major missing protocol property +3. already-correct behavior may remain green and should be recorded as such +4. later checkpoints must refer back to this baseline rather than redefining success after the fact + +##### Reject shapes + +Reject before implementation if the checkpoint: + +1. adds tests only after protocol code lands +2. reports “some failures happened” without mapping failures to named gaps +3. mixes proxy coverage and true proof coverage without distinction +4. quietly turns the baseline into broad workload benchmarking + +#### Layer 2: Execution Core + +##### Current gaps `CP13-1` must expose + +1. canonical replica endpoint truth may be weaker than real cross-machine requirements +2. barrier correctness may still depend on sender-side progress rather than flushed durability truth +3. reconnect / catch-up behavior may fail or degrade unclearly after outage +4. retention / rebuild boundary may be implicit instead of explicit + +##### Suggested file targets + +1. `weed/storage/blockvol/blockvol_test.go` +2. `weed/storage/blockvol/replica_test.go` +3. `weed/storage/blockvol/dist_group_commit_test.go` +4. `weed/storage/blockvol/wal_shipper_test.go` +5. `weed/storage/blockvol/test/component/` +6. `weed/storage/blockvol/testrunner/scenarios/internal/` for bounded real-node scenarios when justified + +##### Validation focus + +Required proofs: + +1. baseline-freeze proof + - the focused tests are added before the major protocol checkpoints land +2. gap-visibility proof + - named protocol gaps fail clearly on current code or are marked as bounded witness coverage +3. boundedness proof + - the checkpoint remains test-first baseline work, not hidden implementation + +Reject if: + +1. tests are too indirect to say which gap they expose +2. failing behavior is captured only in chat or terminal output, not in a baseline artifact +3. baseline wording already claims the later protocol is fixed + +##### Suggested first cut + +1. prepare a compact test inventory grouped by: + - address truth + - durable progress truth + - reconnect / catch-up + - retention / rebuild boundary +2. run on current code +3. freeze one baseline report with explicit categories: + - `FAIL` + - `PASS` + - `PASS*` + +##### Assignment For `tester` + +1. Goal + - add and run the focused replication-gap tests before `sw` starts major protocol work +2. Required outputs + - one frozen baseline report + - one explicit list of current expected failures + - one explicit list of already-green behaviors +3. Hard rules + - do not strengthen the current implementation first + - do not let component or real-node tests replace the smaller protocol-gap tests + - do not turn proxy passes into full-proof claims + +##### Assignment For `sw` + +1. Goal + - start `Phase 13` immediately by making the baseline package real without pre-solving the protocol gaps +2. Allowed work before baseline freeze + - test harness support that does not change protocol behavior + - small cleanup required to make the baseline runnable + - test additions or renames that make the current gaps explicit +3. Hard rules + - do not pre-fix reconnect / catch-up / retention semantics before the baseline is captured + - do not weaken current degraded-mode signaling just to make tests pass + +##### `P1` Start Pack For `sw` + +`sw` may start now, but only inside this bounded `CP13-1` package. + +--- + +###### Task 1: Baseline Inventory Freeze + +Collect the existing test inventory and classify each test. The inventory below is the frozen starting point — `sw` validates it against current code, fixes classification errors, and adds missing entries only. + +**Category 1: Address Truth** + +| Test | File | Status | Classification | +|------|------|--------|----------------| +| `TestCanonicalizeAddr_WildcardIPv4_UsesAdvertised` | `net_util_test.go` | PASS | existing, reusable | +| `TestCanonicalizeAddr_WildcardIPv6_UsesAdvertised` | `net_util_test.go` | PASS | existing, reusable | +| `TestCanonicalizeAddr_NilIP_UsesAdvertised` | `net_util_test.go` | PASS | existing, reusable | +| `TestCanonicalizeAddr_AlreadyCanonical_Unchanged` | `net_util_test.go` | PASS | existing, reusable | +| `TestCanonicalizeAddr_Loopback_Unchanged` | `net_util_test.go` | PASS | existing, reusable | +| `TestCanonicalizeAddr_NoAdvertised_FallsBackToOutbound` | `net_util_test.go` | PASS | existing, reusable | +| `TestBug3_ReplicaAddr_MustBeIPPort_WildcardBind` | `sync_all_bug_test.go` | PASS* | documents gap: ReplicaReceiver may return `:port` not `ip:port` | + +**Category 2: Durable Progress Truth** + +| Test | File | Status | Classification | +|------|------|--------|----------------| +| `TestReplicaProgress_BarrierUsesFlushedLSN` | `sync_all_protocol_test.go` | FAIL expected | gap: barrier doesn't gate on replicaFlushedLSN | +| `TestReplicaProgress_FlushedLSNMonotonicWithinEpoch` | `sync_all_protocol_test.go` | FAIL expected | gap: replicaFlushedLSN API missing | +| `TestBarrier_EpochMismatchRejected` | `sync_all_protocol_test.go` | FAIL expected | gap: barrier doesn't check epoch on replica | +| `TestBarrier_ReplicaSlowFsync_Timeout` | `sync_all_protocol_test.go` | FAIL expected | gap: barrier timeout is hardcoded | +| `TestBarrier_RejectsReplicaNotInSync` | `sync_all_protocol_test.go` | verify | existing, needs verification | +| `TestBarrierResp_FlushedLSN_Roundtrip` | `sync_all_protocol_test.go` | verify | existing, needs verification | +| `TestBarrierResp_BackwardCompat_1Byte` | `sync_all_protocol_test.go` | verify | existing, needs verification | +| `TestReplica_FlushedLSN_OnlyAfterSync` | `sync_all_protocol_test.go` | verify | existing, needs verification | +| `TestReplica_FlushedLSN_NotOnReceive` | `sync_all_protocol_test.go` | verify | existing, needs verification | +| `TestShipper_ReplicaFlushedLSN_UpdatedOnBarrier` | `sync_all_protocol_test.go` | verify | existing, needs verification | +| `TestShipper_ReplicaFlushedLSN_Monotonic` | `sync_all_protocol_test.go` | verify | existing, needs verification | +| `TestShipperGroup_MinReplicaFlushedLSN` | `sync_all_protocol_test.go` | verify | existing, needs verification | +| `TestDistSync_SyncAll_NilGroup_Succeeds` | `dist_group_commit_test.go` | PASS | existing, reusable | +| `TestDistSync_SyncAll_AllDegraded_Fails` | `dist_group_commit_test.go` | PASS | existing, reusable | +| `TestBug2_SyncAll_SyncCache_AfterDegradedShipperRecovers` | `sync_all_bug_test.go` | FAIL expected | gap: catch-up not implemented, barrier hangs after recovery | + +**Category 3: Reconnect / Catch-up** + +| Test | File | Status | Classification | +|------|------|--------|----------------| +| `TestReconnect_CatchupFromRetainedWal` | `sync_all_protocol_test.go` | FAIL expected | gap: no reconnect handshake or WAL catch-up | +| `TestReconnect_GapBeyondRetainedWal_NeedsRebuild` | `sync_all_protocol_test.go` | FAIL expected | gap: no retention tracking, no NeedsRebuild transition | +| `TestReconnect_EpochChangeDuringCatchup_Aborts` | `sync_all_protocol_test.go` | FAIL expected | gap: no CatchingUp state, no epoch-aware abort | +| `TestReconnect_CatchupTimeout_TransitionsDegraded` | `sync_all_protocol_test.go` | FAIL expected | gap: no catch-up timeout | +| `TestBarrier_DuringCatchup_Rejected` | `sync_all_protocol_test.go` | FAIL expected | gap: no CatchingUp state | +| `TestAdversarial_FreshShipperUsesBootstrapNotReconnect` | `sync_all_adversarial_test.go` | verify | existing, needs verification | +| `TestAdversarial_ReconnectUsesHandshakeNotBootstrap` | `sync_all_adversarial_test.go` | FAIL expected | gap: handshake protocol missing | +| `TestAdversarial_ReplicaRejectsDuplicateLSN` | `sync_all_adversarial_test.go` | verify | existing, needs verification | +| `TestAdversarial_ReplicaRejectsGapLSN` | `sync_all_adversarial_test.go` | verify | existing, needs verification | +| `TestAdversarial_CatchupMultipleDisconnects` | `sync_all_adversarial_test.go` | FAIL expected | gap: no catch-up protocol | +| `TestAdversarial_ConcurrentBarrierDoesNotCorruptCatchupFailures` | `sync_all_adversarial_test.go` | verify | existing, needs verification | + +**Category 4: Retention / Rebuild Boundary** + +| Test | File | Status | Classification | +|------|------|--------|----------------| +| `TestWalRetention_RequiredReplicaBlocksReclaim` | `sync_all_protocol_test.go` | FAIL expected | gap: WAL reclaim not replica-aware | +| `TestWalRetention_TimeoutTriggersNeedsRebuild` | `sync_all_protocol_test.go` | FAIL expected | gap: no retention timeout | +| `TestWalRetention_MaxBytesTriggersNeedsRebuild` | `sync_all_protocol_test.go` | FAIL expected | gap: no max-bytes retention | +| `TestAdversarial_NeedsRebuildBlocksAllPaths` | `sync_all_adversarial_test.go` | FAIL expected | gap: NeedsRebuild state incomplete | +| `TestAdversarial_CatchupDoesNotOverwriteNewerData` | `sync_all_adversarial_test.go` | verify | existing, needs verification | +| `TestHeartbeat_ReportsPerReplicaState` | `rebuild_v1_test.go` | verify | existing, needs verification | +| `TestHeartbeat_ReportsNeedsRebuild` | `rebuild_v1_test.go` | verify | existing, needs verification | +| `TestReplicaState_RebuildComplete_ReentersInSync` | `rebuild_v1_test.go` | verify | existing, needs verification | +| `TestRebuild_AbortOnEpochChange` | `rebuild_v1_test.go` | verify | existing, needs verification | +| `TestRebuild_PostRebuild_FlushedLSN_IsCheckpoint` | `rebuild_v1_test.go` | verify | existing, needs verification | + +###### Task 2: Runnable Baseline Harness + +Fix only the minimum harness friction needed to run the baseline cleanly: + +- if any test cannot compile or run due to missing test helpers, add the helpers +- if any test panics on setup (not on the gap itself), fix the setup +- do NOT change protocol behavior to make failing tests pass +- do NOT add new protocol code (reconnect, retention, rebuild) + +Scope guard: if a fix touches `wal_shipper.go`, `replica_apply.go`, `dist_group_commit.go`, or `blockvol.go` beyond test-helper support, it is out of bounds for Task 2. + +###### Task 3: Focused Gap Tests + +Add or tighten the smallest set of tests that exposes the current gap on present code: + +- if the inventory has a `verify` entry that turns out to be proxy coverage (passes but doesn't actually prove the property), reclassify it as `PASS*` +- if a gap has no test at all, add the minimum test that fails on current code +- prefer unit/protocol tests; add component tests only where the unit test cannot expose the gap +- each new test must have a comment naming which CP13 checkpoint it maps to + +Hard rule: do NOT add tests that only pass after protocol work. The baseline must fail cleanly on current code. + +###### Task 4: Frozen Baseline Report + +Run the full inventory on current code and produce one explicit report: + +``` +CP13-1 Baseline Report +Date: YYYY-MM-DD +Commit: + +Category 1: Address Truth + PASS TestCanonicalizeAddr_WildcardIPv4_UsesAdvertised + PASS TestCanonicalizeAddr_... + PASS* TestBug3_ReplicaAddr_MustBeIPPort_WildcardBind — documents gap, not proof + ... + +Category 2: Durable Progress Truth + FAIL TestReplicaProgress_BarrierUsesFlushedLSN — barrier doesn't gate on flushedLSN + PASS TestDistSync_SyncAll_NilGroup_Succeeds + ... + +Category 3: Reconnect / Catch-up + FAIL TestReconnect_CatchupFromRetainedWal — no catch-up protocol + ... + +Category 4: Retention / Rebuild Boundary + FAIL TestWalRetention_RequiredReplicaBlocksReclaim — WAL reclaim not replica-aware + PASS TestHeartbeat_ReportsPerReplicaState + ... + +Summary: X PASS / Y FAIL / Z PASS* out of N total +``` + +The report must be saved to `sw-block/.private/phase/phase-13-cp1-baseline.md`. + +--- + +###### Required output from `sw` + +1. one delivery note naming: + - files changed (test helpers, test additions, renames) + - tests added or strengthened + - which gaps are now exposed by the baseline +2. one frozen result summary (`phase-13-cp1-baseline.md`) +3. one explicit statement of what `sw` did NOT fix + +###### Hard boundary + +`sw` may do Tasks 1-4 now. `sw` may NOT: + +1. implement reconnect handshake protocol (→ CP13-5) +2. implement WAL retention policy (→ CP13-6) +3. implement rebuild fallback behavior (→ CP13-7) +4. change barrier protocol to return flushedLSN (→ CP13-3) +5. add CatchingUp or NeedsRebuild state transitions (→ CP13-4, CP13-5) +6. change `wal_shipper.go` Ship/Connect behavior beyond test-helper wiring + +These are explicitly reserved for CP13-2 through CP13-7. The baseline must expose the gaps without closing them. + +###### Reject if `sw` + +1. starts implementing reconnect protocol, retention policy, or rebuild behavior before the baseline is frozen +2. buries the real gap under large harness churn +3. upgrades witness coverage into proof coverage without saying so +4. turns `CP13-1` into `CP13-2+` by stealth + +--- + +#### Gap → Checkpoint Mapping + +| Gap | Exposed by baseline tests | Closed by checkpoint | +|-----|--------------------------|---------------------| +| ReplicaReceiver returns `:port` not `ip:port` | TestBug3 (PASS*) | CP13-2 | +| Barrier doesn't gate on replicaFlushedLSN | TestReplicaProgress_BarrierUsesFlushedLSN (FAIL) | CP13-3 | +| No CatchingUp state, no barrier rejection during catch-up | TestBarrier_DuringCatchup_Rejected (FAIL) | CP13-4 | +| No reconnect handshake or WAL catch-up replay | TestReconnect_CatchupFromRetainedWal (FAIL) | CP13-5 | +| WAL reclaim not replica-aware | TestWalRetention_RequiredReplicaBlocksReclaim (FAIL) | CP13-6 | +| No NeedsRebuild transition on WAL gap | TestReconnect_GapBeyondRetainedWal_NeedsRebuild (FAIL) | CP13-5 + CP13-7 | +| Post-reconnect barrier hangs | TestBug2 (FAIL) | CP13-5 | + +#### Short judgment + +`CP13-1` is acceptable when: + +1. the phase has a frozen, named failing baseline +2. the failing baseline maps cleanly to later checkpoints +3. already-correct behavior is distinguished from true gaps +4. no implementation overclaim sneaks into the checkpoint