Investigation result:
- Dual-BlockVol hypothesis: DISPROVEN (one instance per path, correct wiring)
- Root cause: adapter wiring bug in test allocator
soak_test.go blockVSAllocate returned ReplicaDataAddr = "vs2:9333:14260"
(server + ":port" where server already has a port → three colons, invalid)
This caused setupReplicaReceiver to fail silently → no data replicated
Root cause classification: adapter/test-harness bug
- NOT a backend data visibility bug
- NOT a core-rule gap
- The engine read path works correctly (TestSyncAll_FullRoundTrip passes)
Code changes:
- qa_block_soak_test.go: fix allocator to use host:port (not server:port),
use deterministic FNV-hashed ports matching production ReplicationPorts
- qa_block_cp13_8a_test.go: 2 new integration tests proving replica reads
work through both ReadLBA and adapter.ReadAt, before and after promotion
Remaining contradiction for CP13-8 scenario on real hardware:
- The production weed cluster uses ReplicationPorts (deterministic) which
should not have this bug. If CP13-8 still fails on m01/M02, the cause
is different from this test-harness issue and needs a separate investigation.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. assert_contains: change actual/expected to value/contains (matches
the action implementation in system.go)
2. Add assert_greater for pgbench TPS > 0 after pgbench_run (closes
the pgbench durability pass criterion in the doc)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- master_block_registry.go: minor role-handling fixes
- qa_failover_role_test.go: new failover role test
- testrunner/actions/devops.go: new devops action helpers
- recovery-baseline-failover.yaml: scenario alignment
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- phase-13.md: CP13-1 through CP13-6 accepted, CP13-7 active
- phase-13-log.md: full technical + delivery packs for CP13-2..CP13-7
- phase-13-cp4-state-eligibility.md: refined barrier behavior table
(Disconnected/Degraded as recovery entry points, not eligibility)
- phase-12.md: minor cross-reference updates
- Older phase docs: minor wording alignment
- Design docs: V2 development plan and completion overview updated
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tighten TestReconnect_GapBeyondRetainedWal_NeedsRebuild assertion from
"NeedsRebuild or Degraded" to strictly "NeedsRebuild". The handshake
R < S path returns NeedsRebuild directly — tolerating Degraded weakened
the proof.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two fixes:
1. TestReconnect_GapBeyondRetainedWal_NeedsRebuild: rewritten to test the
real reconnect handshake gap detection path (R < S in
reconnectWithHandshake). Sequence: establish sync → disconnect →
release retention hold via timeout → write + flush to advance WAL past
replica position → reconnect → handshake detects R=0 < S=9 → NeedsRebuild.
Log proves: "reconnect: gap too large R=0 H=8 S=9"
2. TestReplicaState_RebuildComplete_ReentersInSync: reclassified from
primary proof to support evidence (does not start from live NeedsRebuild
shipper state, but proves rebuild mechanics work end-to-end).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. TestWalRetention_TimeoutTriggersNeedsRebuild: add hard assertion that
checkpoint advances past replicaFlushedLSN after NeedsRebuild (proves
hold is actually released, not just state transition)
2. TestWalRetention_RequiredReplicaBlocksReclaim: remove stale "EXPECTED
TO FAIL" / duplicate comment block
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three fixes:
1. TestWalRetention_RequiredReplicaBlocksReclaim: rewritten from log-only
placeholder to hard assertion (checkpointLSN <= replicaFlushedLSN)
2. TestWalRetention_TimeoutTriggersNeedsRebuild: rewritten from log-only
to hard assertion (State() == NeedsRebuild after 1ns timeout)
3. EvaluateRetentionBudgets: uses RetentionBudgetParams struct with
actual BlockSize from volume config instead of hardcoded 4096
All 3 retention tests now have real state/progress assertions.
No placeholder or log-only evidence remains in CP13-6 proof package.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add max-bytes retention budget alongside existing timeout budget:
- shipper_group.go: EvaluateRetentionBudgets now checks both timeout
(last contact time) and max-bytes (entry lag * 4KB > maxBytes).
Either exceeding budget → NeedsRebuild state transition.
- blockvol.go: add walRetentionMaxBytes (64MB default), pass to
EvaluateRetentionBudgets with primaryHeadLSN.
TestWalRetention_MaxBytesTriggersNeedsRebuild upgraded from PASS*
(log-only placeholder) to real PASS: asserts State()==NeedsRebuild
after lag exceeds configured max-bytes budget.
Retention contract: hold-back blocks reclaim for recoverable replicas,
timeout and max-bytes budgets escalate to NeedsRebuild and release hold.
Full rebuild lifecycle remains CP13-7 scope.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace "observable CatchingUp state transition" with the actual 3
signals the test asserts: seeded hasFlushedProgress, receivedLSN
advance, non-zero replicaFlushedLSN.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Findings fixed:
1. TestAdversarial_ReconnectUsesHandshakeNotBootstrap now has 3 observable
proof points instead of just "SyncCache succeeded":
- new shipper HasFlushedProgress=true (seeded from old group)
- replica receivedLSN advances during SyncCache (catch-up delivered entries)
- shipper replicaFlushedLSN > 0 after barrier (durable progress established)
Bootstrap alone would not advance receivedLSN — it only sends the barrier.
2. TestBug2 stale comment removed: "must NOT call SetReplicaAddr" replaced
with accurate CP13-5 explanation that SetReplicaAddrs now preserves
hasFlushedProgress across shipper replacement.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Bug: SetReplicaAddrs created fresh shippers (hasFlushedProgress=false),
so after disconnect, the new shipper used bootstrap instead of reconnect
handshake. Bootstrap doesn't replay missed WAL entries — barrier hung.
Fix:
- blockvol.go: SetReplicaAddrs checks if old shipper group had durable
progress (AnyHasFlushedProgress). If so, seeds new shippers with
hasFlushedProgress=true → they use reconnect handshake + catch-up.
- shipper_group.go: add AnyHasFlushedProgress() helper.
3 baseline FAILs now PASS:
- ReconnectUsesHandshakeNotBootstrap: reconnect path used, not bootstrap
- CatchupMultipleDisconnects: repeated disconnect/reconnect recovers
- CatchupDoesNotOverwriteNewerData: catch-up completes, safety exercised
7 tests promoted to CP13-5 primary proof.
TestAdversarial_NeedsRebuildBlocksAllPaths still FAIL (CP13-7 scope).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Contract review: 6-state set (Disconnected, Connecting, CatchingUp,
InSync, Degraded, NeedsRebuild). Only InSync proceeds to barrier
request path. All other states either fail immediately or attempt
reconnect (must succeed before reaching barrier).
New test: TestBarrier_NonEligibleStates_FailClosed — systematically
verifies each non-eligible state (Connecting, CatchingUp, NeedsRebuild,
Disconnected) is rejected by Barrier(), and InSync is the only state
that enters the barrier request path.
5 baseline tests promoted to CP13-4 primary proof.
No production code changed — contract review + new focused test only.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The previous test only checked wire decode + fresh shipper state, never
calling shipper.Barrier() against a legacy response source.
New test runs a fake TCP control server that responds with a 1-byte
BarrierOK (no FlushedLSN). Shipper.Barrier() is called against it and
must return an error containing "no FlushedLSN". Verifies the real
rejection path at wal_shipper.go:229-231.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Bug: BarrierOK with FlushedLSN == 0 (legacy 1-byte response) was counted
as successful sync_all durability even though no authoritative durable
progress was established. This allowed a legacy replica to silently pass
through the sync_all barrier without proving any LSN was fsynced.
Fix (wal_shipper.go): BarrierOK with FlushedLSN == 0 now returns an
error instead of nil. Barrier success requires the replica to report a
non-zero FlushedLSN proving which LSN was durably persisted. This makes
the code match the CP13-3 contract: replicaFlushedLSN is the sole
authority for sync_all durability.
New test: TestBarrier_LegacyResponseRejectedBySyncAll — proves legacy
1-byte responses don't establish durable authority.
Contract review doc updated to reflect the code fix.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Contract review (no code changed):
- replicaFlushedLSN is the sole authority for replica durability
- flushedLSN advanced only after fd.Sync() on replica (not on receive)
- shippedLSN/sentLSN are explicitly diagnostic (comment at line 268)
- barrier response carries flushedLSN; shipper updates via monotonic CAS
- sync_all gates on ALL barriers succeeding (fail-closed)
8 baseline tests promoted to CP13-3 primary proof:
- BarrierUsesFlushedLSN, FlushedLSNMonotonicWithinEpoch
- FlushedLSN_OnlyAfterSync, FlushedLSN_NotOnReceive
- ShipperReplicaFlushedLSN_UpdatedOnBarrier, _Monotonic
- BarrierResp_FlushedLSN_Roundtrip, BackwardCompat_1Byte
6 tests classified as support evidence (not primary proof).
Reconnect/retention/rebuild tests explicitly out of scope (CP13-4+).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two fixes:
1. Rename advertisedIP → advertisedHost throughout, relax contract from
"always a real IP" to "routable host from -ip flag (IP or resolvable
hostname)". This matches the actual -ip flag semantics which accepts
both IP addresses and server names.
2. Add TestCP13_2_BlockService_AdvertisedHost_NotOpaqueID that hits the
actual production wiring: BlockService with opaque localServerID +
routable advertisedHost → setupReplicaReceiver → verify exported
addresses use the routable host, not the opaque ID.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Bug: setupReplicaReceiver derived the advertised host from localServerID,
which can be an opaque string (from -id flag, e.g., "my-custom-server-id").
This would publish unusable endpoints like "my-custom-server-id:14260".
Fix:
- volume_server_block.go: add advertisedIP field (always a real IP from
-ip flag), use it instead of localServerID for replica canonicalization
- volume.go: wire *v.ip → blockService.SetAdvertisedIP() at startup
- blockvol.go: StartReplicaReceiver variadic advertisedHost unchanged
Proof (sync_all_bug_test.go TestBug3, 4 sub-cases):
- fallback: wildcard bind without advertisedHost → outbound-IP
- advertisedHost: explicit IP appears in exported addresses
- StartReplicaReceiver_API: public API forwards host correctly
- opaque_identity_not_routable: proves opaque string produces
non-routable address, confirming production must use advertisedIP
Identity vs transport separation preserved:
- localServerID: stable identity for V2 control (may be opaque)
- advertisedIP: routable IP for transport endpoints (always real IP)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Problem: StartReplicaReceiver didn't forward advertisedHost to
NewReplicaReceiver, so wildcard-bind listeners relied on outbound-IP
fallback for canonicalization. On multi-NIC hosts this could select
the wrong interface, leaking non-routable addresses into replication
truth.
Fix:
- blockvol.go: StartReplicaReceiver now accepts optional advertisedHost
variadic param and forwards it to NewReplicaReceiver
- volume_server_block.go: setupReplicaReceiver extracts host from
localServerID (the canonical VS identity) and passes it as
advertisedHost — wildcard-bind addresses now resolve to the
authoritative server IP, not outbound-IP fallback
Proof (sync_all_bug_test.go TestBug3, upgraded from PASS* to PASS):
- fallback: wildcard bind without advertisedHost still produces ip:port
- advertisedHost: explicit host appears in exported DataAddr/CtrlAddr
- StartReplicaReceiver_API: public API forwards advertisedHost correctly
What CP13-2 does NOT change:
- No reconnect handshake changes (CP13-5)
- No retention policy changes (CP13-6)
- No rebuild behavior changes (CP13-7)
- No barrier protocol changes (CP13-3)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Change "CP13-3/4/5/6 behavior already implemented in earlier phases" to
"current code already passes tests associated with later checkpoint themes"
— baseline evidence only, not implementation closure.
No .go files changed in CP13-1. All 44 baseline tests already existed.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- phase-13-log.md: mark pre-baseline inventory table as superseded,
point to phase-13-cp1-baseline.md for authoritative results
- phase-13-cp1-baseline.md: replace "CP13-X done" language with neutral
"current code passes this test; suggests behavior may already exist"
— checkpoint closure still requires dedicated review
- Expand remaining-open-checkpoints section: CP13-2/5/6/7 all still
require review, main fails cluster around CP13-5 but CP13-7 and
part of CP13-6 also remain open
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copies of design docs removed in Phase 09, preserved in sw-block/docs/archive/
for historical reference.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
HIGH: Changed-address now requires OutcomeCatchUp and fails if not.
No more conditional execution — must go through full catch-up chain.
MED: Overlapping retention is now true simultaneous overlap:
- Hold 1 at LSN T+1, Hold 2 at LSN T+2 — both coexist
- MinWALRetentionFloor = T+1 (minimum of two)
- Release hold 1 → floor moves to T+2
- Release hold 2 → ActiveHoldCount=0, no floor
MED: NeedsRebuild now asserts escalated event in logs.
PostCheckpoint now asserts handshake + catch-up execution events.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
HIGH: renamed TestP2_RebuildClosure_FullBase_OneChain → TestP2_RebuildClosure_OneChain.
Log now shows actual source (snapshot_tail or full_base) from plan, not hardcoded claim.
MED: catch-up test uses t.Skipf when V1 interim prevents OutcomeCatchUp.
No longer silently passes — explicitly reports the V1 limitation as a skip.
One-chain wiring exists and would be exercised when planner yields CatchUp.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Engine executors now have IO interfaces for real bridge I/O:
- CatchUpExecutor.IO (CatchUpIO): StreamWALEntries
- RebuildExecutor.IO (RebuildIO): TransferFullBase, TransferSnapshot,
StreamWALEntries (for tail replay)
When IO is set, executor calls real bridge I/O during execution.
When IO is nil, executor uses caller-supplied progress (test mode).
RecoveryPlan.CatchUpStartLSN: bound at plan time for IO bridge.
v2bridge.Executor now implements both interfaces:
- StreamWALEntries: real ScanFrom
- TransferFullBase: validates extent accessible
- TransferSnapshot: validates checkpoint accessible
Chain tests wire IO:
- CatchUpClosure: exec.IO = executor → real WAL scan through engine
- RebuildClosure: exec.IO = executor → real transfer through engine
This closes the engine → executor → v2bridge → blockvol chain.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Finding 1: ProcessAssignments now calls v2Orchestrator.ProcessAssignment
- BlockService.v2Orchestrator field (RecoveryOrchestrator)
- ProcessAssignment result logged at glog V(1)
- No more `_ = intent` — engine state actually changes
Finding 2: localServerID documented as interim
- BlockService.localServerID = listenAddr (transport-shaped)
- Field doc explicitly states: INTERIM, should be registry-assigned
- Used only for replica/rebuild local identity
3 integration tests (qa_block_v2bridge_test.go):
- CreatesEngineSender: ProcessAssignment → engine has sender + session
- EpochBump: epoch 1 → invalidate → epoch 2 → new session
- AddressChange: same ServerID, different IP → sender preserved,
endpoint updated
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Finding 1: Identity no longer address-derived
- ReplicaAddr.ServerID field added (stable server identity from registry)
- BlockVolumeAssignment.ReplicaServerID field added (scalar RF=2 path)
- ControlBridge uses ServerID, NOT address, for ReplicaID
- Missing ServerID → replica skipped (fail closed), logged
Finding 2: Wired into real ProcessAssignments
- BlockService.v2Bridge field initialized in StartBlockService
- ProcessAssignments converts each assignment via v2Bridge.ConvertAssignment
BEFORE existing V1 processing (parallel, not replacing yet)
- Logged at glog V(1)
Finding 3: Fail-closed on missing identity
- Empty ServerID in ReplicaAddrs → replica skipped with log
- Empty ReplicaServerID in scalar path → no replica created
- Test: MissingServerID_FailsClosed verifies both paths
7 tests: StableServerID, AddressChange_IdentityPreserved,
MultiReplica_StableServerIDs, MissingServerID_FailsClosed,
EpochFencing_IntegratedPath, RebuildAssignment, ReplicaAssignment
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ControlBridge converts real BlockVolumeAssignment (from master heartbeat)
into V2 engine AssignmentIntent:
- Identity: ReplicaID = <volume-path>/<replica-server-id>
- Epoch from real assignment
- Role → SessionKind mapping (primary/replica/rebuilding)
- Multi-replica support (ReplicaAddrs) with scalar RF=2 fallback
Known limitation (documented in test):
- extractServerID currently uses address as server ID (matches
master registry ReplicaInfo.Server format)
- IP change = different server ID in current model
- Registry-backed stable server ID deferred
6 new tests:
- PrimaryAssignment_StableIdentity: real assignment → stable ID
- PrimaryAssignment_MultiReplica: RF=3 multi-replica mapping
- AddressChange_SameServerID: documents current identity boundary
- EpochFencing_IntegratedPath: epoch 1 → bump → epoch 2 through
real assignment conversion + engine
- RebuildAssignment: rebuilding role → SessionRebuild
- ReplicaAssignment: replica role with local server ID
Delivery template:
Changed contracts: real BlockVolumeAssignment → engine intent
Fail-closed: unknown role returns empty intent
Carry-forward: address-based server ID, not registry-backed
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
FC1: now asserts HasActiveSession() after address change AND
verifies session_created in log (not just plan_cancelled).
FC4: escalation event detail must be >15 chars (contains proof
reason with LSN values, not just "needs_rebuild").
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
P2 tests now force conditions instead of observing them:
FC3: Real WAL scan verified directly — StreamWALEntries transfers
real entries from disk (head=5, transferred=5). Engine planning also
verified (ZeroGap in V1 interim documented).
FC4: ForceFlush advances checkpoint/tail to 20. Replica at 0 is
below tail → NeedsRebuild with proof: "gap_beyond_retention: need
LSN 1 but tail=20". No early return.
FC5: ForceFlush advances checkpoint to 10. Assertive:
- replica at checkpoint=10 → ZeroGap (V1 interim)
- replica at 0 → NeedsRebuild (below tail, not CatchUp)
FC1/FC2: Labeled as integrated engine/storage (control simulated).
New: BlockVol.ForceFlush() — triggers synchronous flusher cycle for
test use. Advances checkpoint + WAL tail deterministically.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5 failure-class replay tests against real file-backed BlockVol,
exercising the full integrated path:
bridge adapter → v2bridge reader/pinner → engine planner/executor
FC1: Changed-address restart — identity preserved, old plan cancelled,
new session created. Log shows plan_cancelled + session_created.
FC2: Stale epoch after failover — sessions invalidated at old epoch,
new assignment at epoch 2 creates fresh session. Log shows
per-replica invalidation.
FC3: Real catch-up (pre-checkpoint) — engine classifies from real
RetainedHistory, zero-gap in V1 interim (committed=0 before flush).
Documents the V1 limitation explicitly.
FC4: Unrecoverable gap — after flush, if checkpoint advances, replica
behind tail gets NeedsRebuild. Documents that V1 unit test may
not advance checkpoint (flusher timing).
FC5: Post-checkpoint boundary — replica at checkpoint = zero-gap in
V1 interim. Explicitly documents the catch-up collapse boundary.
go.mod: added replace directives for sw-block engine + bridge modules.
Carry-forward (explicit):
- CommittedLSN = CheckpointLSN (V1 interim)
- FC3/FC4/FC5 limited by flusher not advancing checkpoint in unit tests
- Executor snapshot/full-base/truncate still stubs
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
7 tests in weed/storage/blockvol/v2bridge/bridge_test.go:
Reader (2 tests):
- StatusSnapshot reads real nextLSN, WALCheckpointLSN, flusher state
- HeadLSN advances with real writes
Pinner (2 tests):
- HoldWALRetention: hold tracked, MinWALRetentionFloor reports position,
release clears hold
- HoldRejectsRecycled: validates against real WAL tail
Executor (2 tests):
- StreamWALEntries: real ScanFrom reads WAL entries from disk
- StreamPartialRange: partial range scan works
Stubs (1 test):
- TransferSnapshot/TransferFullBase/TruncateWAL return not-implemented
All tests use createTestVol (1MB file-backed BlockVol with 256KB WAL).
No mock/push adapters — direct real blockvol instances.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Finding 1: WALTailLSN semantic fix
- StatusSnapshot().WALTailLSN now reads super.WALCheckpointLSN (an LSN)
- Was: wal.Tail() which returns a physical byte offset
- Entries with LSN > WALTailLSN are guaranteed in the WAL
Finding 2: ScanWALEntries replay-source fix
- ScanWALEntries passes super.WALCheckpointLSN as the recycled boundary
- Was: flusher.CheckpointLSN() which in V1 equals CommittedLSN
- The flusher's live checkpoint may advance in memory, but entries above
the durable superblock checkpoint are still physically in the WAL
- Normal catch-up (replica at 70, committed at 100) now works because
fromLSN=71 > super.WALCheckpointLSN (which is the last persisted
checkpoint, not the live flusher state)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>