Root cause for "volume not ready" gate: missing
--expected-slots-per-volume 2 flag on blockmaster.
Default is 3; QA's 2-node topology had 2 slots; controller
silently rejected observation snapshot (cmd/blockmaster/main.go:39).
Fix verified locally on Windows (single-node, no m01/M02 needed):
- Add --expected-slots-per-volume 2 to blockmaster command
- Primary reaches Healthy=true with epoch=1
- assignment-received fires; durable storage opens; status
endpoint serves {"Healthy":true}
Lesson learned (process improvement): for V3-internal bring-up
debug, try single-node local reproduction FIRST. The cluster
bring-up gate is V3 logic, not network topology. Reproduces in
seconds locally with full source-code access; m01/M02 only needed
for cross-node-specific scenarios (real network conditions,
iptables, multi-host wire).
Secondary finding: replica r2 sees primary r1's assignment but
records "supersede, not applying to adapter" because T1
HealthyPathExecutor only handles primary case. For G5-4 replica
bring-up, sw needs to wire T4a-T4d ReplicationVolume + ReplicaPeer
+ ReplicaListener stack (not just --t1-readiness flag). This is
the actual next gap for G5-4.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root cause: cmd/blockmaster/main.go hardcoded ExpectedSlotsPerVolume=3.
QA's 2-slot topology silently failed validateVolumeTopology in the
controller, so no assignments were minted, no master-log lines,
and volumes timed out at durable open.
Fix landed in seaweed_block@f5de7c5: --expected-slots-per-volume
CLI flag, default 3, set 2 for the 2-node smoke.
QA next: rebuild blockmaster, pass --expected-slots-per-volume 2
in §3.4 of the handoff command sequence; rest unchanged.
Records QA's cross-node smoke attempt 2026-04-26: infrastructure
fully verified READY (m01+M02 reachability, SMB share for binary
distribution, master cross-node listen, network OK), but cluster
bring-up blocked at V3-internal gate.
Symptom: blockvolume on both nodes connects to master but logs
"durable open: frontend: volume not ready" — never reaches steady
state, status endpoint never binds, master log shows no heartbeat
or assignment-mint events.
Hand-off contents:
- §1 specific questions for sw (5 gaps to fill)
- §2 infrastructure verified READY (no action needed)
- §3 copy-pasteable commands sw can run/debug
(build → topology → master → primary → replica → cleanup)
- §4 QA's hypothesis on the gap (assignment-from-master flow)
- §5 debug suggestions for sw (log levels, integration test
references)
- §6 G5-4 script skeleton current state
- §7 QA's next steps once sw answers
Working dirs reproducible:
- Binaries: /mnt/smb/work/share/g5-binaries/{blockmaster,blockvolume}
- Run state: /tmp/g5sm/ on both nodes
- Logs: /tmp/g5sm/logs/{master,primary,replica}.log
Blocks: G5-4 implementation work (script scenario bodies, hardware
first-light scenarios). Does NOT block QA scenario authoring at
component scope (Cluster framework already covers that).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per QA infra-check round 2026-04-26, surfaces real readiness gaps
before architect ratifies G5-4 schedule:
m01 (192.168.1.181 — primary node):
✅ 32-day uptime; sudo password-less; 16 cores; 19 GiB RAM
✅ 177 GiB free disk; Go 1.26.2 installed
✅ iptables / netns / multi-process tools all available
✅ T2 m01 NVMe script template available as pattern reference
M02 (192.168.1.184 — replica node):
✅ Reachable from m01 (0.92ms); same kernel; 178 GiB free disk
❌ Go NOT installed — must scp binaries from m01
Implication for G5-4:
Build binaries on m01, scp to M02. Same cross-node binary pattern
T2 already uses for its iSCSI target deployment. G5-4 skeleton at
seaweed_block/scripts/iterate-m01-replicated-write.sh implements
this build-then-scp flow.
No infrastructure blockers. Architecture ready as soon as G5 mini-plan
ratifies scenario list.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two artifacts landing together to close T4 batch series:
1. v3-phase-15-t4d-closure-report.md (NEW)
QA single-sign artifact for T4d batch close per §8C.2; architect
T-end three-sign per §8C.1 (T4d IS final T4 batch — confirmed at
round-48 review). Round-48 + round-49 corrections incorporated:
- Part C commit hash bound to e642ae8 throughout
- CARRY-T4D-LANE-CONTEXT-001 bind point = post-G5 hardening
backlog (not T4e — consistent with "T-end at this close")
- §H Finding #1 reworded — walstore HAS background flusher
(walstore.go:189-190); QA's earlier "caller-driven" was wrong
- §H Finding #3 RESOLVED at a0be6d5 (T2A NVMe race fixed +
m01 -race ×50 PASS)
- 16 invariants pinned (added 2 named for part C bug fixes:
INV-REPL-FAILED-SESSION-KIND-DRIVES-ESCALATION +
INV-REPL-REBUILD-ESCALATION-STICKY-UNTIL-TERMINAL)
- 22/22 packages green under -race on m01 (post-a0be6d5)
2. v3-phase-15-t4d-mini-plan.md (NEW — was uncommitted across
v0.1 → v0.5 evolution)
Final v0.5 incorporates: architect Path B fold; round-47
rebuild path engine-driven HARD GATE expansion; G5-DECISION-001
named decision record; 4-batch shape ratified; T4d-3 G-1 binding.
Active forward-carries (post-G5 hardening backlog):
- CARRY-T4D-LANE-CONTEXT-001 — replace TargetLSN==1 caller shim
with true handler/session-context lane signal
- G5-DECISION-001 — engine recovery state behavior across
primary restart (Path A persist vs Path B rebuild-from-probe)
G5 collective close items (NOT post-G5):
- m01 hardware first-light for replicated write path
- Multi-replica concurrent live + recovery scenarios
- walstore flusher cadence verification + tuning policy
- Minimal metrics/backpressure assessment
- G5-DECISION-001 architect resolution
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Architect sign by pingqiu 2026-04-25:
"T4d v0.2 scope accepted as one batch series; Option C for appliedLSN
source; BlockStore walHead hotfix may land pre-T4d; substrate defense-
in-depth included where practical; 4-batch order approved; T4d-3 G-1
required; T4d-2 no G-1; T-end three-sign at T4d close if T4d remains
final T4 batch."
All open architect-decision points (§2 scope, §2.5 Option/hotfix/
substrate, §3 batch shape, §4 acceptance bar) resolved. §6 open
issues all closed. §8 inscribes the verbatim ratification record.
Sw clearances effective immediately:
- Land BlockStore walHead one-liner as pre-T4d hotfix (single PR with
un-skipped regression test)
- Produce T4d mini-plan (4-batch shape per §3)
- Produce T4d-3 G-1 V2 read on wal_shipper.go runCatchUpTo
- T4d-2 spec is round-43/44 architect text (no G-1 needed)
T-end horizon: §8C.1 T-end three-sign lands at T4d close IF T4d
remains final T4 batch (per architect's criterion #10 wording tweak).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
QA single-sign artifact for T4c batch close per §8C.2; architect
acceptance of §B scope deltas signed 2026-04-25 by pingqiu.
Scope deltas accepted:
- T4c closes as mid-T4 batch under §8C.2, not T4 T-end
- L2/L3 mini-plan bar narrowed to muscle-level L2 + component evidence
- L3 m01 first-light deferred to T4d / G5 final close
- Substring "WAL recycled" matching accepted as TEMPORARY, replacement
bound to T4d (preferred) or G5 final sign (latest)
- INV-REPL-CATCHUP-WITHIN-RETENTION-001 downgraded to T4d blocker
(catch-up sender hardcodes ScanLBAs(1); replica's R+1 not threaded)
Doc-hygiene fixes per PM round-2 review (this commit):
- Drop INV-REPL-CATCHUP-DONE-MARKER-EMITTED (non-existent: V2 marker
collapsed into barrier-as-terminator per catchup_sender.go:48,187)
- §B/#2 + #5 reword "green at HEAD" to acknowledge architect Windows
cleanup-only repro failures (tracked as next-batch carry)
- Active formal-INV count 8 -> 6
Forward-carries to T4d (BLOCKERS):
- R+1 catch-up threading (StartCatchUp signature + adapter wire)
- Full engine→adapter→executor recovery wiring
- Structured RecoveryFailureKind replacing substring sentinel
- LastSentMonotonic_AcrossRetries cross-call form scenario
- Windows TempDir cleanup race investigation
Forward-carry to G5 final close:
- m01 hardware first-light for replicated write path
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes QA round-2 feedback loop. Three concerns resolved and one L2-blocker
hazard added.
## Q1-Q3 resolution (sw-verifiable per QA concern; V2 source check)
Q1 scope completeness: VERIFIED complete. V2 grep shows sync_all_* are
three test files only — `sync_all_adversarial_test.go`, `sync_all_bug_test.go`,
`sync_all_protocol_test.go`. Zero production files for sync_all / split_brain /
takeover / arbiter. These are cross-entity invariants, not distinct types.
10-entity set stands.
Q2 ReplicaReceiver scope: VERIFIED per-volume, not per-assignment.
`v.replRecv = recv` at `blockvol.go:1515` is the only write site; zero
`replRecv = nil` assignments in codebase. Receiver is constructed-once per
BlockVol instance. L1 §2.3 wording stands.
Q3 RebuildSession/Bitmap durability: VERIFIED no sidecar. Grep
`rebuild_bitmap.go` + `rebuild_session.go` for `os.Open / os.Create /
WriteFile / ReadFile / persist / sidecar` → empty. Recovery is WAL
hydration only (`hydrateBitmapFromRecoveredWAL` at `rebuild_session.go:102`).
L1 §2.10 invariant #3 CORRECTED — earlier draft incorrectly called out a
"sidecar schema" that doesn't exist.
## QA concern #3 resolution: §3.14 new hazard
`AllBlocks()` semantic divergence: V3 `walstore.go:565` and
`smartwal/store.go:367` both call `s.Read(lba)` which reads through the
dirty map (includes unflushed WAL bytes). V2 `rebuild.go:handleExtentStream`
uses `readBlockFromExtent` which BYPASSES dirty map (flushed-only).
Concrete impact: V3 base stream can contain bytes the primary hasn't fsynced.
If primary crashes pre-fsync, replica's copy is "newer" than primary's
recovered state. Epoch fencing + WAL-wins bitmap still prevent corruption,
but the invariant chain is "eventually consistent via epoch churn" instead
of V2's "base stream never contains unflushed bytes". Different contracts,
same end state.
Two L2 options proposed: (a) keep AllBlocks semantics + document non-claim
in §2.7 bridge; (b) add `LogicalStorage.AllBlocksFlushed()` preserving V2
invariant. H5 architect-line decision affects which path is safer.
## QA concern #2 resolution: §3.a locked-pairs section (new)
Documents pre-coupled L2 decisions driven by V3 existing shape:
H6 Option C → H7b locks automatically (Provider intercepts at LogicalStorage
layer; Backend.Write stays host-facing, doesn't carry LSN)
§3.14 + H5 → AllBlocks safety rationale depends on which H5 shape wins
Per BUG-005 documentation-discipline lesson: record coupled pairs explicitly
rather than leaving them as "implied". Saves L2 cycles and gives future
readers visible intent for why Backend.Write excludes LSN.
## QA concern #1 deferred to L2
Volumes map extension (single-map with role discrimination vs two separate
primaryHandles + replicaHandles maps) is a legitimate L2 design concern.
L1 appropriately hedges with "likely needs to grow" (§3.11 Option C); L2
picks shape. QA's BUG-005-adjacent concern (role-discriminated handle
callers forgetting to check role) is the right frame for the L2 decision.
No L1 edit needed; flagged for L2 attention.
## §4 open questions status
Q1-Q3 ✓ resolved
Q4 DistGroupCommit residence → effectively answered by §3.11 C
Q5 protocol-frame wire-compat stance → still architect-line (pairs with H5)
Blocking L2 start now: only H5 + Q5, both architect-line. QA to draft
one-page arch memo per round-2 offer.
## Change log
§5 feedback-round log gains round-3 entry
§6 change log gains full round-3 detail with V2 line citations
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
§8C.8 specifies exactly one three-sign per T-boundary — at T-start,
covering the bundled L1+L2+L3 package. I had proposed a separate
L1 three-sign in §5 that isn't in the rule. Architect correctly
pushed back.
§5 rewritten as lightweight cadence:
1. sw V3 pre-scan (~5 min, inline reply, prerequisite to L2 not a
sign gate) — same grep checklist retained, same BUG-005 rationale
2. sw + QA iterate on L2 (catalogue §3 filled) informally
3. sw + QA draft L3 (T4 port plan sketch)
4. T4 T-start three-sign on bundled L1+L2+L3 (only governance event)
Informal feedback-round log hook added so architect/PM inputs are
tracked without per-round sign ceremony.
Change log updated.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
All 5 feedback items accepted; no subsetting.
F1 — RebuildBitmap split into standalone §2.10 entity (10 total,
was 9). Rationale: bitmap has independent on-disk schema (~84 LOC
rebuild_bitmap.go) + independent conflict-resolution invariant
(WAL-wins-over-base). Collapsing into §2.6 RebuildSession at L1
would lose granularity for L2 — bitmap and session may have
different PRESERVE/REBUILD verdicts. §2.6 now explicitly
cross-references §2.10.
F2 — ShipperGroup §2.2 gains "External deps" row: N = RF comes
from master assignment via BlockVol.SetReplicaAddrs, not from
shipper-internal decision. Cross-entity contract (master assignment
↔ ShipperGroup size ↔ ReplicaReceiver expected-connection-count
↔ DistGroupCommit quorum arithmetic) made explicit so L2 split
can't silently drift sync_quorum.
F3 — ReplicaBarrier §2.4 scope rewritten from "per-request
ephemeral" to "per-request call-closure, BUT queue-state shared
per-volume via cond.Wait". Prior wording risked 1:1-porting into
a V3 stateless function, losing multi-watcher cond.Broadcast
semantics.
H5 added to §3 observations — cross-node epoch consistency
observation window for sync_quorum. V2 implicit via ack frame
carrying epoch; V3 L2 must pick "ack frame carries epoch" vs
"primary maintains per-replica epoch cache" before locking.
Different choices → different failover + rebuild-trigger semantics.
H6 added to §3 observations — write-path vs replication-path
concurrency residence. Three L2 options documented:
A) StorageBackend.Write triggers shipper (violates T3a layering)
B) ReplicatedBackend wraps StorageBackend+shipper (clean; +1 entity)
C) Replication inside DurableProvider (extends BUG-005 lesson)
L1 makes no recommendation; L2 LOCKS the decision before L3.
§5 restructured into 5 gated steps; step 1 is a mandatory sw V3
pre-scan of core/frontend/durable/ + core/frontend/*.go for
pre-baked replication-adjacent assumptions. Rationale cited per
architect: BUG-005 latent drift came from implicit V3 convention;
L1 must surface any such convention before L2 verdicts lock.
Concrete grep checklist included so the scan is 5 min, not open-ended.
§2 header + §4 open question #1 updated for 10-entity count.
Scope block references rebuild_bitmap.go explicitly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Predecessor docs for the T3 batch, retained here for audit trail.
The closure report (`v3-phase-15-t3-closure-report.md`), contract
bridge catalogue, and BUG-005/006 artifacts already landed in
commits `4127e5136` + `6e196885e`; this commit fills the docs
those closure artifacts reference back to.
Landed:
v3-phase-15-t3-port-plan-sketch.md T3 umbrella sketch (rev-2.1, three-signed)
v3-phase-15-t3-port-audit.md T3.0 port audit + Addendum A (QA-signed)
v3-phase-15-t3a-mini-plan.md T3a scope + sign-off (CLOSED 0e1595c)
v3-phase-15-t3b-mini-plan.md T3b scope + sign-off (CLOSED 72d0d40)
v3-phase-15-t3c-mini-plan.md T3c scope + sign-off (CLOSED 829c6a9)
Total 1,346 lines of doc; no code impact.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Covers 6 areas based on CockroachDB/Ceph/etcd/Longhorn research:
1. Structured logging: zap + JSON + channel model (OPS/STORAGE/REPL/ISCSI/AUDIT/HEALTH)
2. Distributed tracing: OpenTelemetry spans across write/rebuild/failover paths
3. Metrics: 40+ must-have Prometheus metrics with histogram latency buckets
4. Debug tools: debug zip (logs+pprof+state), log merge, live tail
5. Audit logging: every admin mutation with actor/target/operation/result
6. Alert design: 3 tiers (page/ticket/log), anti-patterns to avoid
Identifies existing gaps: no I/O latency histogram, no rebuild duration
metric, no audit trail, no structured logging, no distributed tracing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Covers three personas (developer/operator/platform engineer) with:
- One-command setup: weed server -block (10 seconds to first volume)
- Shell commands: block.list, block.status, block.health, block.create, etc.
- REST API: /block/volumes CRUD, /block/health
- Observability: Prometheus metrics, alerting rules, Grafana dashboard
- Actionable error messages (every error tells you what to do next)
- Dry-run by default for all destructive operations
Competitive comparison: 10s setup vs Ceph 30min, 13.5x write IOPS,
single binary for object + block storage.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
P1 feature updated: replace generic "structured results" with concrete
runs.db design (newline-delimited JSON, one line per run). Leverages
existing RunBundle system (manifest.json, result.json already exist).
New CLI commands: list, trend, gc, reindex, diff.
Regression detection via stddev comparison against rolling baseline.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
testrunner-roadmap.md: P0-P3 feature plan for multi-version comparison,
Ceph adapter, result tracking, cluster templates, debug mode.
dm-stripe-two-server.yaml: proven Linux dm-stripe across 2 sw-block
volumes on 2 servers. Results: single=42K IOPS → striped=79K IOPS (1.87x).
Data integrity verified via md5. Zero sw-block code changes needed.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The architect's refactor correctly routes remote rebuild acks through
the shared observation path (pins, watchdog, deferred terminal success).
But requireReplicaSession fails with "sender not found" when the
orchestrator registry is reconciled between installSession and the
first ack arrival.
Fix: when emitTerminal=false (remote path), treat sender-not-found as
non-fatal. The remote coordinator already validated the session — the
sender lookup is for local observation only. Pins and watchdog handle
nil snap gracefully (updateRebuildProgressPin line 296 already checks
snap != nil).
This preserves the architect's design (shared observation + deferred
terminal success) while tolerating the sender registry race that only
affects the remote rebuild path.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
After RemoteRebuildIO.TransferFullBase returns, the OnAck callback has
already emitted SessionCompleted and stored achievedLSN. But
RebuildExecutor.Execute() continues calling sender methods which fail
("sender stopped") because the completion event already cleaned up the
sender. This error propagated to ExecutePendingRebuild which emitted a
spurious SessionFailed, knocking the mode back to degraded.
Fix: check remoteRebuildAchieved before emitting SessionFailed. If the
rebuild already completed via the ack path, log the post-completion
error but suppress the SessionFailed event.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three fixes for the remote rebuild path:
1. Base-only completion: when BaseLSN == TargetLSN, the base image covers
all data — no WAL tail needed. MarkBaseComplete now auto-satisfies the
WAL condition and calls TryComplete so the session completes immediately
after the base transfer finishes.
2. Base lane protocol handshake: runBaseLaneClient now sends MsgRebuildReq
{Type: RebuildSessionBase} before reading. The RebuildServer requires
this handshake to dispatch to ServeBaseBlocks. Without it, the server
received raw frames it couldn't understand.
3. Direct ack events: OnAck emits engine events directly (SessionCompleted,
SessionProgressObserved, SessionFailed) instead of routing through
ObserveReplicaRebuildSessionAck which requires the sender in the
orchestrator registry. The remote coordinator owns the session — no
registry lookup needed.
Also adds diagnostic logging on both sides:
- Replica: logs parsed RebuildAddr and base lane client start
- Primary: logs sender state after installSession
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The accepted ack from the replica is rejected with "sender not found"
even though installSession succeeds. Add diagnostic logging to verify
the sender exists in the orchestrator registry immediately after
installSession, and dump all registry IDs if not found.
This will reveal whether the sender is removed between installSession
and the ack arrival (by syncProtocolExecutionState, evaluateActivationGate,
or another ProcessAssignment that reconciles with a stale replica list).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When CommittedLSN=0 (sync_all mode, replica degraded), snapshot-tail
rebuild was chosen because IsRecoverable(checkpoint, 0) is vacuously
true (0 <= HeadLSN always). But snapshot-tail requires a valid committed
endpoint for tail-replay. Without it, ExecuteRebuildPlan calls
TransferSnapshot which RemoteRebuildIO doesn't support → immediate fail.
Fix: if CommittedLSN=0, force RebuildFullBase. This is the correct
source when the primary has data but no replica has confirmed durability.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root cause: StatusSnapshot().CommittedLSN reports 0 in sync_all mode when
the replica shipper has no flushed progress (NeedsRebuild state). This is
correct for lineage-safe committed boundary, but PlanRebuild uses
CommittedLSN as RebuildTargetLSN. With target=0, shouldStartSessionCommand
rejects the StartRebuildCommand, and the rebuild IO never executes.
Fix: PlanRebuild falls back to HeadLSN when CommittedLSN is 0. The
primary's WAL head IS the data boundary the replica needs to reach.
The fact that no replica has confirmed durability is exactly why we're
rebuilding.
Also adds command type logging to coreApplyAndLog so tester can verify
which commands are actually emitted vs silently dropped.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three correctness fixes for the remote rebuild path:
1. No double completion: for remote rebuilds, OnRebuildCompleted skips
RebuildCommitted since ObserveReplicaRebuildSessionAck already emitted
SessionCompleted on the accepted ack. One rebuild = one completion event.
2. SessionAckFailed with rejected observation: if OnAck rejects the failed
ack (stale session), don't use the sentinel errRebuildAckFailed. Return
a regular error so ExecutePendingRebuild emits the fallback SessionFailed.
No path leaves the engine session hanging.
3. Diagnostic logging in ExecutePendingRebuild: log the replicaID and
targetLSN on both nil-return (TakeRebuild mismatch) and successful take
paths. Also log the pending store in runRebuild with replicaID, targetLSN,
and IO type. This makes the TakeRebuild seam diagnosable on hardware
without rebuilding the engine package.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the broken primary-local rebuild executor with RemoteRebuildIO,
a server-side engine.RebuildIO implementation that coordinates remotely.
The primary sends SessionControlV2 (with RebuildAddr trailer) to the
replica's control channel; the replica starts a local rebuild session
and auto-connects to the primary's rebuild server for the base lane.
Single rebuild route: ALL core-present rebuilds use RemoteRebuildIO.
The entire command chain is preserved unchanged:
PlanRebuild → pending → RebuildStarted → StartRebuildCommand
→ ExecutePendingRebuild → RemoteRebuildIO.TransferFullBase
Key changes:
- SessionControlMsg v2: optional RebuildAddr trailer (len-based decode)
- ReplicaRebuilding shipper state: session-gated live WAL lane
- RemoteRebuildIO: dials replica ctrl, sends session control, reads acks
- Ack forwarding through ObserveReplicaRebuildSessionAck (pins/watchdog)
- Completion proof from replica's achievedLSN, not primary's local vol
- Transport failures emit SessionFailed (no double-emit on ack failures)
- Progress ack rejection fails closed (stale session = abort)
- Replica auto-starts base lane client on v2 session control
State transitions:
NeedsRebuild → [accepted ack] → Rebuilding → [completed] → InSync
Rebuilding → [failed/EOF] → NeedsRebuild → [next probe] → retry
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace three bypass mechanisms with one unified model. When the
probe returns ProbeRebuildRequired, the host now starts the rebuild
through the existing recovery manager (StartRecoveryTask), which
resolves the rebuild address, plans the rebuild, and executes via
the v2bridge executor — the same path as master-driven RoleRebuilding.
New per-replica probe API:
- WALShipper.ProbeReconnect() → ReplicaProbeResult with typed outcome
- ShipperGroup.ProbeReconnectAll() → []ReplicaProbeResult
- BlockVol.ProbeReplicaOnboarding() / IsClosed()
Host-side wiring:
- handleReplicaProbeResult routes outcomes:
KeepUp → ShipperConnectedObserved
CatchUp → ShipperConnectedObserved (recovery manager handles session)
Rebuild → NeedsRebuildObserved + StartRecoveryTask (executes rebuild)
TemporaryFailure → no-op
- lastAssignmentsForPath reconstructs assignment for recovery manager
- onPrimaryRosterChanged probes all replicas (defined, called from watchdog)
- observePrimaryShipperConnectivity uses probe API
Probe fires via syncProtocolExecutionState immediately after assignment
processing — same heartbeat cycle, no timer delay.
Deleted: startDirectRebuild, resolveCtrlAddrForShipper,
TryReconnect/TryReconnectAll/TryReconnectShippers.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When proactive reconnect finds WAL gap exceeds retained range:
1. Emit per-replica NeedsRebuildObserved to engine (with ReplicaID)
2. Resolve replica ctrl address from shipper group
3. Start direct rebuild session: send sessionControl(start_rebuild)
to replica's ctrl channel, stream base blocks, emit RebuildStarted
The primary drives the rebuild directly without master round-trip.
The master sees the result via heartbeat projection (needs_rebuild →
rebuilding → healthy). This matches V2 authority: master owns identity,
primary owns data-control recovery.
Added WALShipper.CtrlAddr() getter for address resolution.
resolveCtrlAddrForShipper maps data address to ctrl address via
shipper group (works for RF=2 and RF=3+).
startDirectRebuild runs in a goroutine: dials replica ctrl, sends
start_rebuild, waits for accepted ack, serves base blocks, emits
RebuildStarted to engine on success.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Revert detectAndEnqueueRebuildFromHeartbeat (Bridge 2) — master
should not drive rebuild assignments from heartbeat. The primary
owns data-control recovery per the V2 authority split.
Fix Bridge 1: NeedsRebuildObserved now carries per-replica identity.
resolveReplicaIDForShipper maps shipper DataAddr to ReplicaID via
the shipper group (works for RF=2 and RF=3+). The engine receives
the specific replica that needs rebuild, not a volume-level broadcast.
Primary-direct rebuild: the primary detects which replica needs
rebuild and will drive the session directly. The master learns about
it via subsequent heartbeat projection (needs_rebuild → rebuilding →
healthy). No master round-trip needed for the rebuild decision.
Added WALShipper.DataAddr() getter for address resolution.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
After rejoin, the shipper is configured but no I/O triggers Ship(),
so the shipper stays Disconnected and the core stays at
awaiting_shipper_connected indefinitely.
Fix: observePrimaryShipperConnectivity now calls TryReconnectShippers
when ShipperConfigured=true but ShipperConnected=false. This triggers
the full reconnect protocol (dial + handshake + bounded catch-up)
proactively, bringing the replica current without waiting for I/O.
Option B approach: uses the same reconnect path as Barrier() — not a
fake write or bare dial probe. CatchUpTo(headLSN) replays any retained
WAL entries, bringing the replica fully current.
New methods:
- WALShipper.TryReconnect(): full reconnect without foreground I/O
- ShipperGroup.TryReconnectAll(): probes all disconnected shippers
- BlockVol.TryReconnectShippers(): volume-level entry point
Also fix pre-existing test expectation: engine now emits
start_recovery_task on primary assignment with replicas.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fix recover path TOCTOU: re-Lookup after AddReplica so the primary
refresh assignment includes the freshly added replica addresses.
Previously, Lookup (copy) was called before AddReplica modified the
registry, so entry.Replicas was empty → primary got replicas=0 →
shipper never configured.
Add 2 WAL pressure edge case tests:
- ShipperCatchUpOrEscalate: 64KB WAL, 200 writes, aggressive flusher.
Proves no hang/deadlock/corruption. Shipper either keeps up or
correctly escalates to NeedsRebuild.
- RebuildWithPinWhilePrimaryWrites: rebuild session active while
primary writes 7600+ blocks in 2s. Proves primary never freezes
— rebuild pin is on replica only, primary WAL recycles freely.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All 43 actions pass on m01/m02 hardware. Auto-failover PASS.
dd_write: 30s → 123ms. Post-failover write: 33,621 IOPS.
1. WAL retention: remove keepup retention floor (MinShippedLSN).
WAL cannot be pinned during sustained async writes — any pin
strategy either fills WAL (blocking writes) or over-recycles
(breaking catch-up). Flusher recycles freely. Future LBA map
will provide catch-up without WAL retention.
MinShippedLSN on ShipperGroup retained as diagnostic surface.
2. Registry stale-cleanup race: add RegisteredAt grace period.
Race: master registers volume → next VS heartbeat arrives before
VS discovers the volume → stale cleanup deletes the entry →
failover finds 0 entries. Fix: skip stale cleanup for entries
registered within 30s (> 2 heartbeat intervals).
2 new tests: grace protects new entry, old entry still cleaned.
3. Shutdown heartbeat: VS disconnect heartbeat no longer claims
block inventory authority. Previously, the shutdown beat's
empty inventory triggered stale cleanup, deleting the entry
before failover could use it.
Scenario fix: recovery-baseline-failover.yaml now kills the
correct node (discovered primary, not hardcoded), connects to
the correct new primary for post-failover verification.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Wire protocol messages and transport handlers for the rebuild MVP:
Protocol messages (rebuild_transport.go):
- SessionControlMsg: epoch, sessionID, command, baseLSN, targetLSN,
snapshotID. Encode/Decode with fixed 37-byte wire format.
- SessionAckMsg: epoch, sessionID, phase, walAppliedLSN, baseComplete,
achievedLSN. Encode/Decode with fixed 34-byte wire format.
- MsgSessionControl (0x10) and MsgSessionAck (0x11) on control channel.
- SendSessionControl/SendSessionAck convenience functions.
Transport handlers:
- RebuildTransportServer: primary-side, streams all extent blocks as
MsgRebuildExtent frames (reusing existing rebuild message type),
ends with MsgRebuildDone.
- RebuildTransportClient: replica-side, receives base blocks and
routes through vol.ApplyRebuildSessionBaseBlock, marks base
complete on MsgRebuildDone.
4 transport tests:
- SessionControl wire round-trip
- SessionAck wire round-trip
- BaseBlockStreaming: full TCP loop, 1024 blocks streamed and verified
- SessionControlOverTCP: real TCP send/receive with accepted ack
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add BlockService replica-side rebuild routing API that bridges
transport/host layer to BlockVol session surface:
StartReplicaRebuildSession(path, config)
ApplyReplicaRebuildWALEntry(path, sessionID, entry)
ApplyReplicaRebuildBaseBlock(path, sessionID, lba, data)
MarkReplicaRebuildBaseComplete(path, sessionID, totalBlocks)
TryCompleteReplicaRebuildSession(path, sessionID)
CancelReplicaRebuildSession(path, sessionID, reason)
ReplicaRebuildSession(path) → snapshot
Each method does one thing: validate → WithVolume → delegate to BlockVol.
No wire decoding, no protocol decisions, no state invention. Transport
wiring (sessionControl/walData/sessionData handlers) is the next step.
2 focused tests: skeleton routes correctly, stale session ID rejected.
Updated v2-rebuild-mvp-session-protocol.md with server skeleton section.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tighten acceptance matrix with explicit per-boundary rows, signoff
reading split into hard blockers vs product hardening, and clear
rule: architecture-complete ≠ product-complete.
6 hard blockers before T6/T7:
1. WriteLBA/SyncCache/sync_all contract closure
2. Fresh replica bounded catch-up before live tail
3. Timeout/retention-loss classification for catch-up
4. publish_healthy alignment with one protocol contract
5. RF=2 stable identity on all shipping paths
6. Test audit for incorrect WriteLBA==commit assumptions
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
7-area acceptance matrix mapping current state vs product requirements:
write/durability contract, fresh replica bootstrap, host observation
completeness, serving/publish alignment, snapshot/rebuild convergence,
adapter consistency, test contract alignment.
Each item marked with: current state, required for product, blocks
T6/T7, best test level. Priority ordered into must-close-before-Stage-1,
should-close-before-Stage-2, and can-close-after-T6/T7.
Key diagnosis: architecture-complete, execution-incomplete. The engine
thinks like a product; the data plane still behaves partly like a
prototype. The gap is end-to-end contract closure.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>