Commit Graph

13263 Commits

Author SHA1 Message Date
pingqiu
fbcfe89e24 G5-4 bring-up hand-off v0.2 — RESOLVED via local debug
Root cause for "volume not ready" gate: missing
--expected-slots-per-volume 2 flag on blockmaster.

Default is 3; QA's 2-node topology had 2 slots; controller
silently rejected observation snapshot (cmd/blockmaster/main.go:39).

Fix verified locally on Windows (single-node, no m01/M02 needed):
  - Add --expected-slots-per-volume 2 to blockmaster command
  - Primary reaches Healthy=true with epoch=1
  - assignment-received fires; durable storage opens; status
    endpoint serves {"Healthy":true}

Lesson learned (process improvement): for V3-internal bring-up
debug, try single-node local reproduction FIRST. The cluster
bring-up gate is V3 logic, not network topology. Reproduces in
seconds locally with full source-code access; m01/M02 only needed
for cross-node-specific scenarios (real network conditions,
iptables, multi-host wire).

Secondary finding: replica r2 sees primary r1's assignment but
records "supersede, not applying to adapter" because T1
HealthyPathExecutor only handles primary case. For G5-4 replica
bring-up, sw needs to wire T4a-T4d ReplicationVolume + ReplicaPeer
+ ReplicaListener stack (not just --t1-readiness flag). This is
the actual next gap for G5-4.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 10:41:02 -07:00
pingqiu
e21c686939 G5-4 m01+M02 bring-up — sw answer: --expected-slots-per-volume flag
Root cause: cmd/blockmaster/main.go hardcoded ExpectedSlotsPerVolume=3.
QA's 2-slot topology silently failed validateVolumeTopology in the
controller, so no assignments were minted, no master-log lines,
and volumes timed out at durable open.

Fix landed in seaweed_block@f5de7c5: --expected-slots-per-volume
CLI flag, default 3, set 2 for the 2-node smoke.

QA next: rebuild blockmaster, pass --expected-slots-per-volume 2
in §3.4 of the handoff command sequence; rest unchanged.
2026-04-26 10:37:20 -07:00
pingqiu
2d9c2be9f3 G5-4 m01+M02 cluster bring-up — hand-off to sw
Records QA's cross-node smoke attempt 2026-04-26: infrastructure
fully verified READY (m01+M02 reachability, SMB share for binary
distribution, master cross-node listen, network OK), but cluster
bring-up blocked at V3-internal gate.

Symptom: blockvolume on both nodes connects to master but logs
"durable open: frontend: volume not ready" — never reaches steady
state, status endpoint never binds, master log shows no heartbeat
or assignment-mint events.

Hand-off contents:
  - §1 specific questions for sw (5 gaps to fill)
  - §2 infrastructure verified READY (no action needed)
  - §3 copy-pasteable commands sw can run/debug
    (build → topology → master → primary → replica → cleanup)
  - §4 QA's hypothesis on the gap (assignment-from-master flow)
  - §5 debug suggestions for sw (log levels, integration test
    references)
  - §6 G5-4 script skeleton current state
  - §7 QA's next steps once sw answers

Working dirs reproducible:
  - Binaries: /mnt/smb/work/share/g5-binaries/{blockmaster,blockvolume}
  - Run state: /tmp/g5sm/ on both nodes
  - Logs: /tmp/g5sm/logs/{master,primary,replica}.log

Blocks: G5-4 implementation work (script scenario bodies, hardware
first-light scenarios). Does NOT block QA scenario authoring at
component scope (Cluster framework already covers that).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 10:32:10 -07:00
pingqiu
ce78fea36f G5 kickoff §7a: m01 + M02 infrastructure verification (QA pre-ratify)
Per QA infra-check round 2026-04-26, surfaces real readiness gaps
before architect ratifies G5-4 schedule:

m01 (192.168.1.181 — primary node):
   32-day uptime; sudo password-less; 16 cores; 19 GiB RAM
   177 GiB free disk; Go 1.26.2 installed
   iptables / netns / multi-process tools all available
   T2 m01 NVMe script template available as pattern reference

M02 (192.168.1.184 — replica node):
   Reachable from m01 (0.92ms); same kernel; 178 GiB free disk
   Go NOT installed — must scp binaries from m01

Implication for G5-4:
  Build binaries on m01, scp to M02. Same cross-node binary pattern
  T2 already uses for its iSCSI target deployment. G5-4 skeleton at
  seaweed_block/scripts/iterate-m01-replicated-write.sh implements
  this build-then-scp flow.

No infrastructure blockers. Architecture ready as soon as G5 mini-plan
ratifies scenario list.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 09:38:57 -07:00
pingqiu
a792ed67e5 G5 kickoff PROPOSAL v0.1 (post-T4 close)
QA-authored proposal opening G5 collective close planning.
Inherits 5 forward-carries from T4d closure §I as G5 scope:
  - m01 hardware first-light for replicated write path
  - Multi-replica concurrent live + recovery scenarios
  - G5-DECISION-001 resolution (Path A persist vs Path B rebuild)
  - walstore flusher cadence verification + tuning policy
  - Minimal metrics/backpressure assessment

5-batch shape proposed:
  - G5-1 multi-replica scenarios (component) — QA + sw framework
  - G5-2 walstore cadence verification — sw + architect
  - G5-3 metrics/backpressure assessment — sw + architect
  - G5-4 m01 hardware L3 first-light — QA + sw
  - G5-5 G5-DECISION-001 resolution + closure report — architect + sw + QA

QA recommendations:
  - G5-DECISION-001: Path B (rebuild from probe after restart) for
    MVP scope. T4d-4 part B already structurally enables (ReplicaState
    JSON-clean per TestG5Decision001_*); production restarts rare;
    Path A's persistence work substantial. Backwards-compatible
    upgrade later if production usage proves Path B insufficient.
  - G5-5 timing at close (after G5-1/2/3/4 evidence informs decision)
  - §2.2 explicit non-claims to prevent G5 scope creep:
    * CARRY-T4D-LANE-CONTEXT-001 → post-G5 hardening backlog
    * --durable-walsize CLI flag → post-G5
    * Snapshot-based catch-up → post-G5
    * Wire protocol versioning → post-G5
    * Auth/encryption/mTLS → post-G5

Status: ⏸ DRAFT — awaiting architect ratification on §2 scope +
§3 batch shape + §4 acceptance bar + §5 G5-DECISION-001 path.

No G5 code work begins until ratified.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 00:26:15 -07:00
pingqiu
75d18e676f T4d batch close: catalogue invariant upgrades + checklist v0.3
Catalogue §3.3 — 12 T4d invariants flipped from ⏭ to ✓ PORTED with
specific commit hashes; 4 round-47/48 invariants newly inscribed:

Pre-existing flipped to ✓ PORTED:
  - INV-REPL-NO-PER-LBA-DATA-REGRESSION → bd2de99 + 01f4ab9
  - INV-REPL-RECOVERY-STALE-ENTRY-SKIP-PER-LBA → bd2de99
  - INV-REPL-RECOVERY-COVERAGE-ADVANCES-ON-SKIP → bd2de99
  - INV-REPL-LIVE-LANE-STALE-FAILS-LOUD → bd2de99
  - INV-REPL-RECOVERY-COVERAGE-RESTART-SAFE → bd2de99
  - INV-REPL-LANE-DERIVED-FROM-HANDLER-CONTEXT → 01f4ab9 + 44c60dd
    (with named carry CARRY-T4D-LANE-CONTEXT-001 to post-G5)
  - INV-REPL-TRANSPORT-STORAGE-CONTRACT-ONLY → 44c60dd + 1edeb36
  - INV-REPL-CATCHUP-FROMLSN-IS-REPLICA-FLUSHED-PLUS-1 → 44c60dd
  - INV-REPL-CATCHUP-FROMLSN-FROM-ENGINE-STATE-NOT-PROBE → 44c60dd

Newly inscribed (round-47 + round-48 architect additions):
  - INV-REPL-CATCHUP-EXHAUSTION-ESCALATES-TO-REBUILD → 812d3fa + e642ae8
  - INV-REPL-REBUILD-FAILURE-TERMINAL → 812d3fa
  - INV-REPL-FAILED-SESSION-KIND-DRIVES-ESCALATION (part C bug #1) → e642ae8
  - INV-REPL-REBUILD-ESCALATION-STICKY-UNTIL-TERMINAL (part C bug #2) → e642ae8

Forward-carry checklist v0.3:
  - All per-batch focus rows resolved
  - m01 -race verified across all T4d batches including T2A NVMe race fix
  - Status transitions from "active gating" to "G5-baseline"

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 00:24:23 -07:00
pingqiu
2ee12b2c14 T4d batch close artifact + mini-plan v0.5 (architect-accepted)
Two artifacts landing together to close T4 batch series:

1. v3-phase-15-t4d-closure-report.md (NEW)
   QA single-sign artifact for T4d batch close per §8C.2; architect
   T-end three-sign per §8C.1 (T4d IS final T4 batch — confirmed at
   round-48 review). Round-48 + round-49 corrections incorporated:
   - Part C commit hash bound to e642ae8 throughout
   - CARRY-T4D-LANE-CONTEXT-001 bind point = post-G5 hardening
     backlog (not T4e — consistent with "T-end at this close")
   - §H Finding #1 reworded — walstore HAS background flusher
     (walstore.go:189-190); QA's earlier "caller-driven" was wrong
   - §H Finding #3 RESOLVED at a0be6d5 (T2A NVMe race fixed +
     m01 -race ×50 PASS)
   - 16 invariants pinned (added 2 named for part C bug fixes:
     INV-REPL-FAILED-SESSION-KIND-DRIVES-ESCALATION +
     INV-REPL-REBUILD-ESCALATION-STICKY-UNTIL-TERMINAL)
   - 22/22 packages green under -race on m01 (post-a0be6d5)

2. v3-phase-15-t4d-mini-plan.md (NEW — was uncommitted across
   v0.1 → v0.5 evolution)
   Final v0.5 incorporates: architect Path B fold; round-47
   rebuild path engine-driven HARD GATE expansion; G5-DECISION-001
   named decision record; 4-batch shape ratified; T4d-3 G-1 binding.

Active forward-carries (post-G5 hardening backlog):
  - CARRY-T4D-LANE-CONTEXT-001 — replace TargetLSN==1 caller shim
    with true handler/session-context lane signal
  - G5-DECISION-001 — engine recovery state behavior across
    primary restart (Path A persist vs Path B rebuild-from-probe)

G5 collective close items (NOT post-G5):
  - m01 hardware first-light for replicated write path
  - Multi-replica concurrent live + recovery scenarios
  - walstore flusher cadence verification + tuning policy
  - Minimal metrics/backpressure assessment
  - G5-DECISION-001 architect resolution

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 00:21:20 -07:00
pingqiu
80036404ce T4d planning + G-1 doc landing (architect Path B + Issue 2(a) ratification)
Lands four T4d planning artifacts together:

1. v3-phase-15-t4d-3-g1-v2-read.md (NEW)
   T4d-3 G-1 V2 read v0.2, QA-signed in conversation 2026-04-25.
   Per architect Issue 2(a) ratification: G-1 docs land first;
   implementation references the committed hash. Future T4d-3
   commits should reference this commit's sha via:
     Refs G-1 sign: <this-commit-sha>

2. v3-phase-15-t4d-forward-carry-checklist.md (NEW)
   v0.2 — 19 active T4a/T4b/T4c invariants with risk grades and
   per-batch focus rows. T4d-3 close gate inscribed
   (CARRY-T4D-LANE-CONTEXT-001 option A or B); pre/with-T4d-3
   doc fixes recorded.

3. v3-phase-15-t4d-qa-scenario-catalogue.md (NEW)
   v0.1 — 9 QA component-scope scenarios mirroring T4c QA
   Stage-1 discipline. 10 framework primitives surfaced for
   sw's batch PRs.

4. v2-v3-contract-bridge-catalogue.md (UPDATED)
   §3.3 inscriptions for T4d-locked invariants:
     - INV-REPL-NO-PER-LBA-DATA-REGRESSION (round-43)
     - INV-REPL-RECOVERY-STALE-ENTRY-SKIP-PER-LBA (round-43)
     - INV-REPL-RECOVERY-COVERAGE-ADVANCES-ON-SKIP (round-44)
     - INV-REPL-LIVE-LANE-STALE-FAILS-LOUD (round-44)
     - INV-REPL-RECOVERY-COVERAGE-RESTART-SAFE (Option C)
     - INV-REPL-LANE-DERIVED-FROM-HANDLER-CONTEXT (Q2 + round-46)
     - INV-REPL-TRANSPORT-STORAGE-CONTRACT-ONLY (Q1+Q3 + T4d-1
       strengthening)
     - INV-REPL-CATCHUP-FROMLSN-IS-REPLICA-FLUSHED-PLUS-1
       (T4d-3 G-1 §5)
     - INV-REPL-CATCHUP-FROMLSN-FROM-ENGINE-STATE-NOT-PROBE
       (T4d-3 G-1 §5)
     - CARRY-T4D-LANE-CONTEXT-001 (named carry, T4e/post-G5)
   INV-REPL-CATCHUP-WITHIN-RETENTION-001 status updated:
   T4c downgrade → T4d-2+T4d-3 un-pin path.

Process rule inscribed (architect 2026-04-25):
G-1 sign docs land in seaweedfs FIRST; sw implementation in
seaweed_block references the committed G-1 hash via
"Refs G-1 sign: <sha>" per mini-plan §7.1 procedural binding.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 22:26:21 -07:00
pingqiu
7b9b353293 T4d kickoff: v0.3 architect-ratified
Architect sign by pingqiu 2026-04-25:
"T4d v0.2 scope accepted as one batch series; Option C for appliedLSN
source; BlockStore walHead hotfix may land pre-T4d; substrate defense-
in-depth included where practical; 4-batch order approved; T4d-3 G-1
required; T4d-2 no G-1; T-end three-sign at T4d close if T4d remains
final T4 batch."

All open architect-decision points (§2 scope, §2.5 Option/hotfix/
substrate, §3 batch shape, §4 acceptance bar) resolved. §6 open
issues all closed. §8 inscribes the verbatim ratification record.

Sw clearances effective immediately:
  - Land BlockStore walHead one-liner as pre-T4d hotfix (single PR with
    un-skipped regression test)
  - Produce T4d mini-plan (4-batch shape per §3)
  - Produce T4d-3 G-1 V2 read on wal_shipper.go runCatchUpTo
  - T4d-2 spec is round-43/44 architect text (no G-1 needed)

T-end horizon: §8C.1 T-end three-sign lands at T4d close IF T4d
remains final T4 batch (per architect's criterion #10 wording tweak).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 15:46:13 -07:00
pingqiu
c910464a9a T4c batch close artifact: closure report (architect-accepted)
QA single-sign artifact for T4c batch close per §8C.2; architect
acceptance of §B scope deltas signed 2026-04-25 by pingqiu.

Scope deltas accepted:
  - T4c closes as mid-T4 batch under §8C.2, not T4 T-end
  - L2/L3 mini-plan bar narrowed to muscle-level L2 + component evidence
  - L3 m01 first-light deferred to T4d / G5 final close
  - Substring "WAL recycled" matching accepted as TEMPORARY, replacement
    bound to T4d (preferred) or G5 final sign (latest)
  - INV-REPL-CATCHUP-WITHIN-RETENTION-001 downgraded to T4d blocker
    (catch-up sender hardcodes ScanLBAs(1); replica's R+1 not threaded)

Doc-hygiene fixes per PM round-2 review (this commit):
  - Drop INV-REPL-CATCHUP-DONE-MARKER-EMITTED (non-existent: V2 marker
    collapsed into barrier-as-terminator per catchup_sender.go:48,187)
  - §B/#2 + #5 reword "green at HEAD" to acknowledge architect Windows
    cleanup-only repro failures (tracked as next-batch carry)
  - Active formal-INV count 8 -> 6

Forward-carries to T4d (BLOCKERS):
  - R+1 catch-up threading (StartCatchUp signature + adapter wire)
  - Full engine→adapter→executor recovery wiring
  - Structured RecoveryFailureKind replacing substring sentinel
  - LastSentMonotonic_AcrossRetries cross-call form scenario
  - Windows TempDir cleanup race investigation

Forward-carry to G5 final close:
  - m01 hardware first-light for replicated write path

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 11:49:24 -07:00
pingqiu
6d8d088273 T4 L1 survey round 3: sw V2 verification of Q1-Q3 + §3.14 AllBlocks hazard + §3.a locked-pairs
Closes QA round-2 feedback loop. Three concerns resolved and one L2-blocker
hazard added.

## Q1-Q3 resolution (sw-verifiable per QA concern; V2 source check)

  Q1 scope completeness: VERIFIED complete. V2 grep shows sync_all_* are
  three test files only — `sync_all_adversarial_test.go`, `sync_all_bug_test.go`,
  `sync_all_protocol_test.go`. Zero production files for sync_all / split_brain /
  takeover / arbiter. These are cross-entity invariants, not distinct types.
  10-entity set stands.

  Q2 ReplicaReceiver scope: VERIFIED per-volume, not per-assignment.
  `v.replRecv = recv` at `blockvol.go:1515` is the only write site; zero
  `replRecv = nil` assignments in codebase. Receiver is constructed-once per
  BlockVol instance. L1 §2.3 wording stands.

  Q3 RebuildSession/Bitmap durability: VERIFIED no sidecar. Grep
  `rebuild_bitmap.go` + `rebuild_session.go` for `os.Open / os.Create /
  WriteFile / ReadFile / persist / sidecar` → empty. Recovery is WAL
  hydration only (`hydrateBitmapFromRecoveredWAL` at `rebuild_session.go:102`).
  L1 §2.10 invariant #3 CORRECTED — earlier draft incorrectly called out a
  "sidecar schema" that doesn't exist.

## QA concern #3 resolution: §3.14 new hazard

  `AllBlocks()` semantic divergence: V3 `walstore.go:565` and
  `smartwal/store.go:367` both call `s.Read(lba)` which reads through the
  dirty map (includes unflushed WAL bytes). V2 `rebuild.go:handleExtentStream`
  uses `readBlockFromExtent` which BYPASSES dirty map (flushed-only).

  Concrete impact: V3 base stream can contain bytes the primary hasn't fsynced.
  If primary crashes pre-fsync, replica's copy is "newer" than primary's
  recovered state. Epoch fencing + WAL-wins bitmap still prevent corruption,
  but the invariant chain is "eventually consistent via epoch churn" instead
  of V2's "base stream never contains unflushed bytes". Different contracts,
  same end state.

  Two L2 options proposed: (a) keep AllBlocks semantics + document non-claim
  in §2.7 bridge; (b) add `LogicalStorage.AllBlocksFlushed()` preserving V2
  invariant. H5 architect-line decision affects which path is safer.

## QA concern #2 resolution: §3.a locked-pairs section (new)

  Documents pre-coupled L2 decisions driven by V3 existing shape:
    H6 Option C → H7b locks automatically (Provider intercepts at LogicalStorage
      layer; Backend.Write stays host-facing, doesn't carry LSN)
    §3.14 + H5 → AllBlocks safety rationale depends on which H5 shape wins

  Per BUG-005 documentation-discipline lesson: record coupled pairs explicitly
  rather than leaving them as "implied". Saves L2 cycles and gives future
  readers visible intent for why Backend.Write excludes LSN.

## QA concern #1 deferred to L2

  Volumes map extension (single-map with role discrimination vs two separate
  primaryHandles + replicaHandles maps) is a legitimate L2 design concern.
  L1 appropriately hedges with "likely needs to grow" (§3.11 Option C); L2
  picks shape. QA's BUG-005-adjacent concern (role-discriminated handle
  callers forgetting to check role) is the right frame for the L2 decision.
  No L1 edit needed; flagged for L2 attention.

## §4 open questions status

  Q1-Q3 ✓ resolved
  Q4 DistGroupCommit residence → effectively answered by §3.11 C
  Q5 protocol-frame wire-compat stance → still architect-line (pairs with H5)

  Blocking L2 start now: only H5 + Q5, both architect-line. QA to draft
  one-page arch memo per round-2 offer.

## Change log

  §5 feedback-round log gains round-3 entry
  §6 change log gains full round-3 detail with V2 line citations

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 23:42:39 -07:00
pingqiu
de2767cd3c T4 L1 survey round 2: sw pre-scan output + H6 narrowing + H7 + §3.13
Fulfills §5 step 1 pre-scan gate with concrete V3 source evidence and
propagates findings to §3 observations.

## Pre-scan output (§5 step 1)

5-row checklist table against V3 source:
  - SetReplicaAddrs / ReplicaAddrs / replica fields: NONE in
    `core/frontend/` or `core/storage/` (grep-clean)
  - Sync/Write remote-ack semantics: NONE; all returns pure-local
    (`types.go:50-78`, `logical_storage.go:57-70`)
  - LogicalStorage.Write LSN: pure-local; distributed durability
    is explicit non-contract (`logical_storage.go:45`)
  - Ship/Replicate/Quorum/Barrier/Durability identifiers: none in
    code; comments only
  - Replication stubs: NONE; but three fully-implemented replica-
    side primitives on LogicalStorage: ApplyEntry / AdvanceFrontier
    / AllBlocks, with impls in walstore.go + smartwal/store.go

Net: frontend/durable layer clean; LogicalStorage layer already
committed to a specific replica-side shape. L2 must ALIGN with
that shape, not override it.

## §3 updates driven by pre-scan

§3.11 (H6) narrowed with V3 existing-shape evidence:
  - Option A unlikely (no supporting V3 shape; StorageBackend is
    replication-unaware)
  - Option B effectively ruled out (ApplyEntry/AdvanceFrontier sit
    BELOW Backend on LogicalStorage; a ReplicatedBackend wrapper
    would either reach past its wrapped contents or duplicate the
    storage-layer contract)
  - Option C leading (matches V3 existing Provider-owns-lifecycle
    shape; generalizes BUG-005 lesson)

§3.12 (H7) new — LSN surface-up gap:
  - `Backend.Write → (int, error)` discards LSN
  - `LogicalStorage.Write → (lsn, error)` returns it
  - Primary-side shipper needs per-write LSN
  - H7a (extend Backend sig) unlikely; H7b (Provider intercepts
    at LogicalStorage layer) natural fit with H6 Option C; H7c
    (side-channel NextLSN+Boundaries delta) rejected as racy
  - H7 resolution coupled to H6 — joint L2 LOCK

§3.13 new — replica-side bypasses Backend entirely:
  - Structural finding already locked by V3 shape, NOT an L2 choice
  - Primary-side traffic: session → handler → Backend → LogicalStorage
  - Replica-side traffic: network frame → ReplicaReceiver →
    LogicalStorage.ApplyEntry (bypasses Backend)
  - Explicit so L2 builds on it rather than fighting

## Feedback-round log + change log

§5 feedback log gains round 2 entry; §6 change log gains full
round-2 detail with line-level citations.

No sign event; this is iterative informal feedback per §8C.8
lightweight cadence. L1 stays DRAFT until bundled T4 T-start
three-sign with L2 + L3.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 23:21:35 -07:00
pingqiu
b4adf76aa0 T4 L1 survey: drop invented L1-sign gate; keep only T-start three-sign
§8C.8 specifies exactly one three-sign per T-boundary — at T-start,
covering the bundled L1+L2+L3 package. I had proposed a separate
L1 three-sign in §5 that isn't in the rule. Architect correctly
pushed back.

§5 rewritten as lightweight cadence:
1. sw V3 pre-scan (~5 min, inline reply, prerequisite to L2 not a
   sign gate) — same grep checklist retained, same BUG-005 rationale
2. sw + QA iterate on L2 (catalogue §3 filled) informally
3. sw + QA draft L3 (T4 port plan sketch)
4. T4 T-start three-sign on bundled L1+L2+L3 (only governance event)

Informal feedback-round log hook added so architect/PM inputs are
tracked without per-round sign ceremony.

Change log updated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 22:59:01 -07:00
pingqiu
d2588f5b77 T4 L1 survey: architect feedback round 1 (F1/F2/F3 + H5/H6 + sw pre-scan gate)
All 5 feedback items accepted; no subsetting.

F1 — RebuildBitmap split into standalone §2.10 entity (10 total,
was 9). Rationale: bitmap has independent on-disk schema (~84 LOC
rebuild_bitmap.go) + independent conflict-resolution invariant
(WAL-wins-over-base). Collapsing into §2.6 RebuildSession at L1
would lose granularity for L2 — bitmap and session may have
different PRESERVE/REBUILD verdicts. §2.6 now explicitly
cross-references §2.10.

F2 — ShipperGroup §2.2 gains "External deps" row: N = RF comes
from master assignment via BlockVol.SetReplicaAddrs, not from
shipper-internal decision. Cross-entity contract (master assignment
↔ ShipperGroup size ↔ ReplicaReceiver expected-connection-count
↔ DistGroupCommit quorum arithmetic) made explicit so L2 split
can't silently drift sync_quorum.

F3 — ReplicaBarrier §2.4 scope rewritten from "per-request
ephemeral" to "per-request call-closure, BUT queue-state shared
per-volume via cond.Wait". Prior wording risked 1:1-porting into
a V3 stateless function, losing multi-watcher cond.Broadcast
semantics.

H5 added to §3 observations — cross-node epoch consistency
observation window for sync_quorum. V2 implicit via ack frame
carrying epoch; V3 L2 must pick "ack frame carries epoch" vs
"primary maintains per-replica epoch cache" before locking.
Different choices → different failover + rebuild-trigger semantics.

H6 added to §3 observations — write-path vs replication-path
concurrency residence. Three L2 options documented:
  A) StorageBackend.Write triggers shipper (violates T3a layering)
  B) ReplicatedBackend wraps StorageBackend+shipper (clean; +1 entity)
  C) Replication inside DurableProvider (extends BUG-005 lesson)
L1 makes no recommendation; L2 LOCKS the decision before L3.

§5 restructured into 5 gated steps; step 1 is a mandatory sw V3
pre-scan of core/frontend/durable/ + core/frontend/*.go for
pre-baked replication-adjacent assumptions. Rationale cited per
architect: BUG-005 latent drift came from implicit V3 convention;
L1 must surface any such convention before L2 verdicts lock.
Concrete grep checklist included so the scan is 5 min, not open-ended.

§2 header + §4 open question #1 updated for 10-entity count.
Scope block references rebuild_bitmap.go explicitly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 22:41:18 -07:00
pingqiu
38cb25702f T4 kick-off: L1 V2 replication entity survey (pre-sketch review)
First artifact for T4 (Gate G5 Replicated Write Path) under the §8C.8
top-down port discipline added post-T3 retrospective. This is L1 only
— raw V2 entity enumeration with scope / lifecycle / concurrency /
cross-session / authority / protocol / invariants attributes. No V3
bridge verdicts proposed yet; L2 follows only after L1 review closes.

9 entities identified across replication surface:
- WALShipper (per-replica fan-out)
- ShipperGroup (per-volume aggregator)
- ReplicaReceiver (per-volume replica-side listener)
- ReplicaBarrier FSM (per-barrier ephemeral)
- DistGroupCommit closure (per-write-op, mode-aware)
- RebuildSession (volatile, non-crash-durable)
- RebuildServer (per-primary listener)
- RebuildTransportServer / Client (per-session base lane)

9 L1-level observations flagged as L2 hazards: epoch fencing
pervasiveness, contiguous-LSN cross-cutting invariant, two-lane
rebuild bitmap integration, mode-dependent durability, volatility
of rebuild session (vs BUG-005 Provider cache lesson), explicit
reconnect protocol, three-phase barrier, ioMu.RLock nesting,
shipper-group double watermark.

5 open questions raised for sw / architect / PM review before L1
sign: scope completeness (sync_all_reconnect, split-brain arbiter?),
scope accuracy (ReplicaReceiver per-volume vs per-assignment),
RebuildSession volatility confirmation, DistGroupCommit V3 residence
opinion, protocol-frame wire-compat stance.

Status: DRAFT — open for sw review; L2 + L3 work blocked on L1 sign
per §8C.8 discipline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 22:30:57 -07:00
pingqiu
88dcd49d67 sw-block/design: T3 mini-plan + audit + sketch docs (pre-close docs)
Predecessor docs for the T3 batch, retained here for audit trail.
The closure report (`v3-phase-15-t3-closure-report.md`), contract
bridge catalogue, and BUG-005/006 artifacts already landed in
commits `4127e5136` + `6e196885e`; this commit fills the docs
those closure artifacts reference back to.

Landed:
  v3-phase-15-t3-port-plan-sketch.md    T3 umbrella sketch (rev-2.1, three-signed)
  v3-phase-15-t3-port-audit.md          T3.0 port audit + Addendum A (QA-signed)
  v3-phase-15-t3a-mini-plan.md          T3a scope + sign-off (CLOSED 0e1595c)
  v3-phase-15-t3b-mini-plan.md          T3b scope + sign-off (CLOSED 72d0d40)
  v3-phase-15-t3c-mini-plan.md          T3c scope + sign-off (CLOSED 829c6a9)

Total 1,346 lines of doc; no code impact.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 22:19:03 -07:00
pingqiu
6e196885e4 T3 closure: reconcile §8C.3 trigger narrative + C5 pin strength
Two document-truthfulness mismatches flagged by architect review:

§B Governance transition (closure report): previously claimed "no
§8C.3 triggers fired during T3"; §H Phase 3's own BUG-005 description
matches trigger #1 (unknown-unknown architectural bug, V2/V3 shape-
level mismatch). Corrected to say trigger #1 fired once (BUG-005)
and was handled per §8C.3, with log entry, architect+PM notification,
catalogue §2.3 drift-event row, and porting-discipline citation.

C5-NVME-SESSION-STATE-CLEANUP-ON-CLOSE (contract bridge catalogue
§2.2.14): previously stated "PASSES today" / "pinned explicitly".
Closure §H Phase 4 correctly narrows landed tests to "smoke +
goroutine-leak guard" with Target.ctrls/AER/KATO-stored-ms
introspection not exercised. Catalogue row now matches that
strength: "pin strength today: smoke + goroutine-leak guard only;
full state-release introspection NOT exercised; queued as
follow-up".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 22:14:08 -07:00
pingqiu
4127e5136c T3 closure: finalize sign-ready state + BUG-006/007 + catalogue retrofill
Closure report (v3-phase-15-t3-closure-report.md):
- §E rewritten as FINAL A–F m01 verification table with per-impl
  status + G4 pass criterion = smartwal (production default) full
  A–F green; walstore non-default fallback with Matrix D failure
  tracked via BUG-007
- §E sign table: QA re-sign 2026-04-22 with evidence basis
  (seaweed_block@313dd52 + BUG-005 fix 42b045a); prior RETRACTED
  row superseded
- §D INV-DURABLE-001: conditional "Path B pending" wording
  removed; scoped to smartwal; canonical row name stands
- §B non-claims: stale _TBD_ perf wording replaced with
  first-light scope statement; new non-claim "G4 pass =
  smartwal only; walstore deferred via BUG-007" added
- §G.3 finalized: FINAL resolution with smartwal A–F PASS;
  walstore deferred
- §H Phase 2 narrative updated to match final matrix outcome
  (Matrix E smartwal-only; walstore E skipped pending BUG-007)
- §H Phase 4: T3-DEF-6 test wording downgraded from
  "pins cleanup contract" to "smoke + goroutine-leak guard"
  per PM feedback (no test-only introspection of Target.ctrls/
  AER/KATO internals; follow-up deferred)
- §H Phase 5: BUG-007 filed and scoped; non-blocking basis
  spelled out

Contract Bridge Catalogue (v2-v3-contract-bridge-catalogue.md):
- §2.2.14 C1-NVME-SESSION-KATO reclassified PRESERVE-partial
  → VIOLATED with BUG-006 anchor + m01 Matrix D evidence
- §2.2.14 C5-NVME-SESSION-STATE-CLEANUP-ON-CLOSE added
  (T3-DEF-6 retrofit, pinned by QA L1 addendum)
- §2.3 drift-event audit table expanded with BUG-006, BUG-007,
  T3-DEF-5, T3-DEF-6

BUG-006 (006_nvme_kato_timer_not_enforced.md):
- Unified contract ID to catalogue name
  C1-NVME-SESSION-KATO-STORED-NOT-ENFORCED (was drifting as
  C3-NVME-KATO-ENFORCEMENT, PM Low catch)
- §7 reframed as "existing row reclassified VIOLATED"
  rather than "add new row"

BUG-007 (007_walstore_umount_remount_data_loss.md): filed as
pre-existing walstore-specific durability bug surfaced by
Matrix D re-verify; explicitly non-blocking for T3 since
smartwal is production default.

BUG-005 (005_backend_close_cross_session.md): committed for
HEAD-reproducibility (referenced by closure §H Phase 3).

Inventory (bugs/inventory/nvme-test-coverage-deferred.md):
T3-DEF-5/6/7 struck through with per-row resolution pointers;
zero open T3-scope inventory rows remaining.

Evidence artifacts committed in seaweed_block@313dd52
(scripts/iterate-m01-nvme.sh Matrix F robustness +
t3_qa_session_cleanup_addendum_test.go).

Awaiting architect + PM three-sign.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 22:08:30 -07:00
pingqiu
953fdb7564 doc: P14 S8 final bounded close — evidence matrix + P15 handoff
Adds the six S8 closure deliverables consolidating S4-S7 evidence,
classifying V2 scenarios, and mapping residual product gaps onto
canonical P15 tracks (per v3-phase-15-product-plan.md §4).

New docs:
- v3-phase-14-s8-assignment.md — S8 execution contract.
- v3-phase-14-s8-final-bounded-close.md — bounded P14 target,
  accepted topology, reject conditions.
- v3-phase-14-s8-evidence-matrix.md — 16 claims × {L0, L1, L2, L3,
  Status, Residual}. 15 PROVEN, 1 PARTIAL (Claim 15 fence
  quantitative bound, P14 internal follow-up). Rounds 2-3 architect
  corrections: Claim 10 / 12 L2 narrowed; Claim 6 refresh gap closed
  by the new L1 test (see companion commit in seaweed_block).
- v3-phase-14-s8-v2-scenario-classification.md — every V2 scenario
  mapped to RUNNABLE-P14 / BLOCKED-FRONTEND / BLOCKED-OPS /
  BLOCKED-HA / BLOCKED-PERF / PORT-MECHANISM; scenario YAMLs kept
  as L3 shape, not executed evidence.
- v3-phase-14-s8-p15-handoff.md — 11 rows (10 canonical P15 tracks
  + 1 P14 internal follow-up anchored to Claim 15 PARTIAL); §4
  integrity check split by row class.
- v3-phase-14-s8-closure.md — final P14 closure statement matching
  the close doc §10 wording; explicit non-goals; all 9 P15 tracks
  named with canonical numbering.

No claim of CSI / frontend / migration / security / performance /
production readiness. Every product gap is handed off with a
concrete first-proof gate.

Companion: seaweed_block commit adds the IntentRefreshEndpoint L1
route test that closes Claim 6.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 01:44:11 -07:00
pingqiu
247d9f6fa6 doc: V3 observability — structured logging, tracing, metrics, debug zip, alerts
Covers 6 areas based on CockroachDB/Ceph/etcd/Longhorn research:

1. Structured logging: zap + JSON + channel model (OPS/STORAGE/REPL/ISCSI/AUDIT/HEALTH)
2. Distributed tracing: OpenTelemetry spans across write/rebuild/failover paths
3. Metrics: 40+ must-have Prometheus metrics with histogram latency buckets
4. Debug tools: debug zip (logs+pprof+state), log merge, live tail
5. Audit logging: every admin mutation with actor/target/operation/result
6. Alert design: 3 tiers (page/ticket/log), anti-patterns to avoid

Identifies existing gaps: no I/O latency histogram, no rebuild duration
metric, no audit trail, no structured logging, no distributed tracing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 02:32:30 -07:00
pingqiu
9437bd0b95 doc: V3 development process — branch strategy, CI/CD, review, release
Covers full engineering process based on SeaweedFS upstream audit:
- Branch strategy: feature/sw-block with checkpoint branches for perf baselines
- Commit conventions: type: description format
- Code review checklist with anti-pattern checks
- Testing standards: 5 levels, 1600+ tests, 4 hardware acceptance scenarios
- CI/CD pipeline: unit→component→hardware gates
- Release process: checklist, artifacts, versioning
- Issue/PR templates with anti-pattern classification
- Agent collaboration model (architect/sw/tester/manager roles)
- Code quality: golangci-lint config, race detection
- Upstream contribution path for SeaweedFS merger

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 01:58:19 -07:00
pingqiu
f11a5829d7 doc: update operations design — add existing V1 UI foundation, code map
Added section 8: existing UI/admin infrastructure from V1:
- iSCSI admin HTTP server (admin.go: /status, /assign, /rebuild, /snapshot)
- Grafana dashboard JSON (block-overview.json, already built)
- Master UI HTML (master.html, add Block Volumes tab)
- Volume server UI HTML (volume.html, add Block section)
- Prometheus metrics (already integrated)

Added section 10: existing vs new code map showing most backend
exists — work is wiring to user-facing interfaces.

Updated Phase 1 to include Master UI tab (+200 lines HTML/JS).
Updated Phase 5 with two options (lightweight extend vs full SPA).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 01:44:47 -07:00
pingqiu
3ba622d9e0 doc: V3 operations design — user-friendly setup, shell commands, REST API
Covers three personas (developer/operator/platform engineer) with:
- One-command setup: weed server -block (10 seconds to first volume)
- Shell commands: block.list, block.status, block.health, block.create, etc.
- REST API: /block/volumes CRUD, /block/health
- Observability: Prometheus metrics, alerting rules, Grafana dashboard
- Actionable error messages (every error tells you what to do next)
- Dry-run by default for all destructive operations

Competitive comparison: 10s setup vs Ceph 30min, 13.5x write IOPS,
single binary for object + block storage.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 00:36:32 -07:00
pingqiu
2bc8dfcdde doc: update testrunner roadmap — add runs.db text index for result tracking
P1 feature updated: replace generic "structured results" with concrete
runs.db design (newline-delimited JSON, one line per run). Leverages
existing RunBundle system (manifest.json, result.json already exist).

New CLI commands: list, trend, gc, reindex, diff.
Regression detection via stddev comparison against rolling baseline.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 22:00:35 -07:00
pingqiu
676539d3b9 doc: testrunner roadmap + dm-stripe scenario (42/42 PASS, 1.87x write IOPS)
testrunner-roadmap.md: P0-P3 feature plan for multi-version comparison,
Ceph adapter, result tracking, cluster templates, debug mode.

dm-stripe-two-server.yaml: proven Linux dm-stripe across 2 sw-block
volumes on 2 servers. Results: single=42K IOPS → striped=79K IOPS (1.87x).
Data integrity verified via md5. Zero sw-block code changes needed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 21:57:03 -07:00
pingqiu
25ede892b4 doc: external failure taxonomy — 20 real bugs from Ceph/DRBD/Mayastor/Longhorn
Catalogs production failures organized by semantic class:
- Membership/liveness misjudgment (4 cases)
- Recovery decision error (3 cases)
- Completion/durability illusion (4 cases)
- Ordering/race conditions (4 cases)
- Background work corrupts semantics (3 cases)

Each entry maps to V2 exposure and V3 prevention rules.
Includes "Would V2 Have This Bug?" self-audit checklist.

Sources: Ceph tracker, DRBD changelogs, Longhorn/Mayastor GitHub issues.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-11 01:21:08 -07:00
pingqiu
8ecc506452 V2 stabilization: 144/144 hardware actions PASS + design docs + SmartWAL prototype
Hardware scenarios (all PASS on m01/m02, 25Gbps RoCE):
- I-V3 auto-failover: 43/43 (create→write→kill→promote→verify IO)
- I-R8 rebuild-rejoin: 58/58 (failover→write→restart→1GB rebuild in 2s→verify data)
- Fast rejoin: 43/43 (kill replica→3s→restart→recovery→data verified)

Performance: V2 RF=1 = 46,666 IOPS vs V1.5 RF=1 = 47,233 IOPS (-1.2%, noise)

New test scenarios:
- v2-rebuild-rejoin.yaml: full failover→rebuild→second failover→data integrity
- v2-fast-rejoin-catchup.yaml: replica kill→fast restart→recovery
- v2-rebuild-failure-retry.yaml: kill during rebuild→restart→data verified
- rf1-perf-compare.yaml: RF=1 perf baseline for V1.5 vs V2 comparison

Design documents:
- protocol-anti-patterns.md: 7 anti-patterns with cases from SeaweedFS/Ceph/DRBD
- smartwal-design-memo.md: extent-first write algorithm research (BlueStore/ZFS/DRBD)
- smartwal-prototype-spec.md: prototype spec with 16/16 crash tests PASS
- v3-clean-recovery-draft.md: V3 semantic cleanup principles
- v2-integration-matrix.md: 25-row integration coverage map
- v2-acceptance-evidence.md: gap analysis for remaining work

SmartWAL prototype (16/16 tests PASS):
- smartwal.go, smartwal_record.go, smartwal_recovery.go: core implementation
- smartwal_test.go: 9 single-node crash tests
- smartwal_repl_test.go: 7 two-node replication crash tests

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-11 00:18:20 -07:00
pingqiu
5279bd3945 fix: tolerate missing sender in remote rebuild ack observation
The architect's refactor correctly routes remote rebuild acks through
the shared observation path (pins, watchdog, deferred terminal success).
But requireReplicaSession fails with "sender not found" when the
orchestrator registry is reconciled between installSession and the
first ack arrival.

Fix: when emitTerminal=false (remote path), treat sender-not-found as
non-fatal. The remote coordinator already validated the session — the
sender lookup is for local observation only. Pins and watchdog handle
nil snap gracefully (updateRebuildProgressPin line 296 already checks
snap != nil).

This preserves the architect's design (shared observation + deferred
terminal success) while tolerating the sender registry race that only
affects the remote rebuild path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 23:00:06 -07:00
pingqiu
008ea03ef5 fix: suppress SessionFailed after successful remote rebuild completion
After RemoteRebuildIO.TransferFullBase returns, the OnAck callback has
already emitted SessionCompleted and stored achievedLSN. But
RebuildExecutor.Execute() continues calling sender methods which fail
("sender stopped") because the completion event already cleaned up the
sender. This error propagated to ExecutePendingRebuild which emitted a
spurious SessionFailed, knocking the mode back to degraded.

Fix: check remoteRebuildAchieved before emitting SessionFailed. If the
rebuild already completed via the ack path, log the post-completion
error but suppress the SessionFailed event.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 17:32:37 -07:00
pingqiu
55862f1ab1 fix: rebuild base-only completion + protocol handshake + direct ack events
Three fixes for the remote rebuild path:

1. Base-only completion: when BaseLSN == TargetLSN, the base image covers
   all data — no WAL tail needed. MarkBaseComplete now auto-satisfies the
   WAL condition and calls TryComplete so the session completes immediately
   after the base transfer finishes.

2. Base lane protocol handshake: runBaseLaneClient now sends MsgRebuildReq
   {Type: RebuildSessionBase} before reading. The RebuildServer requires
   this handshake to dispatch to ServeBaseBlocks. Without it, the server
   received raw frames it couldn't understand.

3. Direct ack events: OnAck emits engine events directly (SessionCompleted,
   SessionProgressObserved, SessionFailed) instead of routing through
   ObserveReplicaRebuildSessionAck which requires the sender in the
   orchestrator registry. The remote coordinator owns the session — no
   registry lookup needed.

Also adds diagnostic logging on both sides:
- Replica: logs parsed RebuildAddr and base lane client start
- Primary: logs sender state after installSession

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 16:18:44 -07:00
pingqiu
0faf93a152 diag: add sender registry verification after installSession
The accepted ack from the replica is rejected with "sender not found"
even though installSession succeeds. Add diagnostic logging to verify
the sender exists in the orchestrator registry immediately after
installSession, and dump all registry IDs if not found.

This will reveal whether the sender is removed between installSession
and the ack arrival (by syncProtocolExecutionState, evaluateActivationGate,
or another ProcessAssignment that reconciles with a stale replica list).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 15:53:33 -07:00
pingqiu
943000ae8e fix: RebuildSourceDecision returns FullBase when CommittedLSN=0
When CommittedLSN=0 (sync_all mode, replica degraded), snapshot-tail
rebuild was chosen because IsRecoverable(checkpoint, 0) is vacuously
true (0 <= HeadLSN always). But snapshot-tail requires a valid committed
endpoint for tail-replay. Without it, ExecuteRebuildPlan calls
TransferSnapshot which RemoteRebuildIO doesn't support → immediate fail.

Fix: if CommittedLSN=0, force RebuildFullBase. This is the correct
source when the primary has data but no replica has confirmed durability.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 15:46:00 -07:00
pingqiu
a79cba0be7 fix: PlanRebuild targetLSN=0 when replica is degraded (CommittedLSN fallback)
Root cause: StatusSnapshot().CommittedLSN reports 0 in sync_all mode when
the replica shipper has no flushed progress (NeedsRebuild state). This is
correct for lineage-safe committed boundary, but PlanRebuild uses
CommittedLSN as RebuildTargetLSN. With target=0, shouldStartSessionCommand
rejects the StartRebuildCommand, and the rebuild IO never executes.

Fix: PlanRebuild falls back to HeadLSN when CommittedLSN is 0. The
primary's WAL head IS the data boundary the replica needs to reach.
The fact that no replica has confirmed durability is exactly why we're
rebuilding.

Also adds command type logging to coreApplyAndLog so tester can verify
which commands are actually emitted vs silently dropped.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 15:35:31 -07:00
pingqiu
bc767eb9d2 fix: rebuild correctness — single completion, fail-closed acks, diagnostic logging
Three correctness fixes for the remote rebuild path:

1. No double completion: for remote rebuilds, OnRebuildCompleted skips
   RebuildCommitted since ObserveReplicaRebuildSessionAck already emitted
   SessionCompleted on the accepted ack. One rebuild = one completion event.

2. SessionAckFailed with rejected observation: if OnAck rejects the failed
   ack (stale session), don't use the sentinel errRebuildAckFailed. Return
   a regular error so ExecutePendingRebuild emits the fallback SessionFailed.
   No path leaves the engine session hanging.

3. Diagnostic logging in ExecutePendingRebuild: log the replicaID and
   targetLSN on both nil-return (TakeRebuild mismatch) and successful take
   paths. Also log the pending store in runRebuild with replicaID, targetLSN,
   and IO type. This makes the TakeRebuild seam diagnosable on hardware
   without rebuilding the engine package.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 15:25:26 -07:00
pingqiu
df69c83f41 feat: RemoteRebuildIO — primary coordinates rebuild, replica installs
Replace the broken primary-local rebuild executor with RemoteRebuildIO,
a server-side engine.RebuildIO implementation that coordinates remotely.
The primary sends SessionControlV2 (with RebuildAddr trailer) to the
replica's control channel; the replica starts a local rebuild session
and auto-connects to the primary's rebuild server for the base lane.

Single rebuild route: ALL core-present rebuilds use RemoteRebuildIO.
The entire command chain is preserved unchanged:
  PlanRebuild → pending → RebuildStarted → StartRebuildCommand
  → ExecutePendingRebuild → RemoteRebuildIO.TransferFullBase

Key changes:
- SessionControlMsg v2: optional RebuildAddr trailer (len-based decode)
- ReplicaRebuilding shipper state: session-gated live WAL lane
- RemoteRebuildIO: dials replica ctrl, sends session control, reads acks
- Ack forwarding through ObserveReplicaRebuildSessionAck (pins/watchdog)
- Completion proof from replica's achievedLSN, not primary's local vol
- Transport failures emit SessionFailed (no double-emit on ack failures)
- Progress ack rejection fails closed (stale session = abort)
- Replica auto-starts base lane client on v2 session control

State transitions:
  NeedsRebuild → [accepted ack] → Rebuilding → [completed] → InSync
  Rebuilding → [failed/EOF] → NeedsRebuild → [next probe] → retry

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 15:04:22 -07:00
pingqiu
befe049b09 refactor: unified primary onboarding + rebuild execution wiring
Replace three bypass mechanisms with one unified model. When the
probe returns ProbeRebuildRequired, the host now starts the rebuild
through the existing recovery manager (StartRecoveryTask), which
resolves the rebuild address, plans the rebuild, and executes via
the v2bridge executor — the same path as master-driven RoleRebuilding.

New per-replica probe API:
- WALShipper.ProbeReconnect() → ReplicaProbeResult with typed outcome
- ShipperGroup.ProbeReconnectAll() → []ReplicaProbeResult
- BlockVol.ProbeReplicaOnboarding() / IsClosed()

Host-side wiring:
- handleReplicaProbeResult routes outcomes:
  KeepUp → ShipperConnectedObserved
  CatchUp → ShipperConnectedObserved (recovery manager handles session)
  Rebuild → NeedsRebuildObserved + StartRecoveryTask (executes rebuild)
  TemporaryFailure → no-op
- lastAssignmentsForPath reconstructs assignment for recovery manager
- onPrimaryRosterChanged probes all replicas (defined, called from watchdog)
- observePrimaryShipperConnectivity uses probe API

Probe fires via syncProtocolExecutionState immediately after assignment
processing — same heartbeat cycle, no timer delay.

Deleted: startDirectRebuild, resolveCtrlAddrForShipper,
TryReconnect/TryReconnectAll/TryReconnectShippers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 02:33:07 -07:00
pingqiu
d6bc7516f1 feat: primary-direct rebuild — start rebuild session on NeedsRebuild
When proactive reconnect finds WAL gap exceeds retained range:
1. Emit per-replica NeedsRebuildObserved to engine (with ReplicaID)
2. Resolve replica ctrl address from shipper group
3. Start direct rebuild session: send sessionControl(start_rebuild)
   to replica's ctrl channel, stream base blocks, emit RebuildStarted

The primary drives the rebuild directly without master round-trip.
The master sees the result via heartbeat projection (needs_rebuild →
rebuilding → healthy). This matches V2 authority: master owns identity,
primary owns data-control recovery.

Added WALShipper.CtrlAddr() getter for address resolution.
resolveCtrlAddrForShipper maps data address to ctrl address via
shipper group (works for RF=2 and RF=3+).

startDirectRebuild runs in a goroutine: dials replica ctrl, sends
start_rebuild, waits for accepted ack, serves base blocks, emits
RebuildStarted to engine on success.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 01:04:00 -07:00
pingqiu
8b469cf70b fix: revert Bridge 2, fix Bridge 1 with per-replica identity
Revert detectAndEnqueueRebuildFromHeartbeat (Bridge 2) — master
should not drive rebuild assignments from heartbeat. The primary
owns data-control recovery per the V2 authority split.

Fix Bridge 1: NeedsRebuildObserved now carries per-replica identity.
resolveReplicaIDForShipper maps shipper DataAddr to ReplicaID via
the shipper group (works for RF=2 and RF=3+). The engine receives
the specific replica that needs rebuild, not a volume-level broadcast.

Primary-direct rebuild: the primary detects which replica needs
rebuild and will drive the session directly. The master learns about
it via subsequent heartbeat projection (needs_rebuild → rebuilding →
healthy). No master round-trip needed for the rebuild decision.

Added WALShipper.DataAddr() getter for address resolution.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 00:55:50 -07:00
pingqiu
f90ccf5bfd fix: proactive shipper reconnect on rejoin (Bug 5)
After rejoin, the shipper is configured but no I/O triggers Ship(),
so the shipper stays Disconnected and the core stays at
awaiting_shipper_connected indefinitely.

Fix: observePrimaryShipperConnectivity now calls TryReconnectShippers
when ShipperConfigured=true but ShipperConnected=false. This triggers
the full reconnect protocol (dial + handshake + bounded catch-up)
proactively, bringing the replica current without waiting for I/O.

Option B approach: uses the same reconnect path as Barrier() — not a
fake write or bare dial probe. CatchUpTo(headLSN) replays any retained
WAL entries, bringing the replica fully current.

New methods:
- WALShipper.TryReconnect(): full reconnect without foreground I/O
- ShipperGroup.TryReconnectAll(): probes all disconnected shippers
- BlockVol.TryReconnectShippers(): volume-level entry point

Also fix pre-existing test expectation: engine now emits
start_recovery_task on primary assignment with replicas.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 00:14:46 -07:00
pingqiu
53246d2780 fix: recover TOCTOU + WAL pressure edge case tests
Fix recover path TOCTOU: re-Lookup after AddReplica so the primary
refresh assignment includes the freshly added replica addresses.
Previously, Lookup (copy) was called before AddReplica modified the
registry, so entry.Replicas was empty → primary got replicas=0 →
shipper never configured.

Add 2 WAL pressure edge case tests:
- ShipperCatchUpOrEscalate: 64KB WAL, 200 writes, aggressive flusher.
  Proves no hang/deadlock/corruption. Shipper either keeps up or
  correctly escalates to NeedsRebuild.
- RebuildWithPinWhilePrimaryWrites: rebuild session active while
  primary writes 7600+ blocks in 2s. Proves primary never freezes
  — rebuild pin is on replica only, primary WAL recycles freely.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 23:56:26 -07:00
pingqiu
e0116fc631 fix: three hardware blockers — WAL retention + registry race + shutdown beat
All 43 actions pass on m01/m02 hardware. Auto-failover PASS.
dd_write: 30s → 123ms. Post-failover write: 33,621 IOPS.

1. WAL retention: remove keepup retention floor (MinShippedLSN).
   WAL cannot be pinned during sustained async writes — any pin
   strategy either fills WAL (blocking writes) or over-recycles
   (breaking catch-up). Flusher recycles freely. Future LBA map
   will provide catch-up without WAL retention.
   MinShippedLSN on ShipperGroup retained as diagnostic surface.

2. Registry stale-cleanup race: add RegisteredAt grace period.
   Race: master registers volume → next VS heartbeat arrives before
   VS discovers the volume → stale cleanup deletes the entry →
   failover finds 0 entries. Fix: skip stale cleanup for entries
   registered within 30s (> 2 heartbeat intervals).
   2 new tests: grace protects new entry, old entry still cleaned.

3. Shutdown heartbeat: VS disconnect heartbeat no longer claims
   block inventory authority. Previously, the shutdown beat's
   empty inventory triggered stale cleanup, deleting the entry
   before failover could use it.

Scenario fix: recovery-baseline-failover.yaml now kills the
correct node (discovered primary, not hardcoded), connects to
the correct new primary for post-failover verification.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 22:59:46 -07:00
pingqiu
39f1232fe2 feat: validation matrix closure — Rebuild Ready 12/12, Restore Ready 10/10
Close all Rebuild Ready and Restore Ready matrix gaps. V2 Ready at 10/14
(2 partial, 2 missing — honest assessment).

New tests (tester-written):
- R1: syncAck-driven trigger via protocol engine decision
- R3: stale replica restart beyond WAL → rebuild converges
- R5: connection drop mid-base → cancel → fresh rebuild converges
- R10: failover-rejoin with forced WAL recycling, strict rebuild assert
- R11: divergent replica full overwrite convergence
- R12: crash mid-rebuild → fresh session converges (not resume)
- S2: corrupt WAL entry + corrupt base block both rejected
- S5: snapshot-tail rebuild (base + WAL tail replay)
- S7: crash between base install and tail replay
- S8: snapshot under concurrent writes
- V5: rebuild complete without DurableLSN blocks publish_healthy
- V9: mixed replica health aggregate projection
- V14: negative fail-closed matrix (epoch, kind, stale)

Bug fix: StartRebuildSession now clears stale dirty map + resets WAL +
updates checkpoint AFTER safety check but BEFORE session.Start(). Fixes
stale extent data shadowing rebuild base blocks on reopened replicas.

Cleanup: remove 14 obsolete design docs (migration batches, old WAL-v2
specs, simulator goals) — all superseded by current protocol docs.

34 component tests + 8 protocol engine tests + server tests all pass.
1GB CRC validation passes in 19s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 16:31:55 -07:00
pingqiu
59a36013d4 feat: rebuild hardening A1-A5 + session-controlled execution path
A1 Engine kind-routing fix:
  SessionProgressObserved/Completed/Failed now respect active session
  Kind. Rebuild progress no longer leaks into catch-up aggregate.
  sessionKindMismatch guard + observeRebuildProgress helper.
  2 regression tests lock kind isolation.

A2 Retention pin:
  Rebuild session ack drives progress-based WAL retention floor.
  Pin installed at base_lsn on accepted, advances with wal_applied_lsn,
  released on completed/failed/cancelled. rebuildProgressPinFloor
  returns min across all active replicas.
  Retention pin test: 100 blocks fill WAL, 5 flusher cycles with
  20 pinned rebuild entries — all verified correct.

A3 Progress ack emission:
  Automatic sessionAck(running/base_complete/completed/failed) emitted
  from rebuild session lifecycle transitions. sessionAckLocked builds
  ack under session lock. emitRebuildSessionAck callback wired through
  SetOnRebuildSessionAck on BlockVol.
  ObserveReplicaRebuildSessionAck maps acks to core engine events.
  WireLocalReplicaRebuildSessionAcks bridges local callback to server.
  5 server tests proving ack→core, pin advance, pin cleanup.

A4 Deadline/timeout:
  rebuildAckWatch watchdog: armed on accepted/running/base_complete,
  refreshed on each ack, cleared on completed/failed. Timeout
  cancels local session + clears pin + fail-closes.
  2 tests: timeout→fail-close, progress→refresh.

A5 Session-controlled execution path:
  v2bridge.Executor.TransferFullBase now uses session-controlled loop:
  beginControlledFullBase → real sessionControl over TCP →
  transferExtentToSession via RebuildTransportClient →
  PrepareFullBaseRebuild → TryCompleteRebuildSession.
  ReplicaReceiver control channel handles MsgSessionControl alongside
  MsgBarrierReq. Session acks written back on same TCP connection.
  RebuildSessionBase request type separates new per-block stream from
  legacy raw extent stream. Full-base cleanup deferred until success.
  Deadlock fix: ApplyBaseBlock releases session lock before ioMu.
  Hydration skip for full-base sessions.

23 rebuild component tests (all pass):
  11 kernel correctness, 8 transport/runtime, 3 scenario-scale,
  including 1GB primary-initiated with CRC validation.

29 files changed, ~2500 insertions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 14:39:11 -07:00
pingqiu
342f8baa69 feat: rebuild transport wiring — session control + base block streaming
Wire protocol messages and transport handlers for the rebuild MVP:

Protocol messages (rebuild_transport.go):
- SessionControlMsg: epoch, sessionID, command, baseLSN, targetLSN,
  snapshotID. Encode/Decode with fixed 37-byte wire format.
- SessionAckMsg: epoch, sessionID, phase, walAppliedLSN, baseComplete,
  achievedLSN. Encode/Decode with fixed 34-byte wire format.
- MsgSessionControl (0x10) and MsgSessionAck (0x11) on control channel.
- SendSessionControl/SendSessionAck convenience functions.

Transport handlers:
- RebuildTransportServer: primary-side, streams all extent blocks as
  MsgRebuildExtent frames (reusing existing rebuild message type),
  ends with MsgRebuildDone.
- RebuildTransportClient: replica-side, receives base blocks and
  routes through vol.ApplyRebuildSessionBaseBlock, marks base
  complete on MsgRebuildDone.

4 transport tests:
- SessionControl wire round-trip
- SessionAck wire round-trip
- BaseBlockStreaming: full TCP loop, 1024 blocks streamed and verified
- SessionControlOverTCP: real TCP send/receive with accepted ack

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 14:57:43 -07:00
pingqiu
49845dd509 feat: server-layer rebuild session skeleton — host routing for MVP
Add BlockService replica-side rebuild routing API that bridges
transport/host layer to BlockVol session surface:

  StartReplicaRebuildSession(path, config)
  ApplyReplicaRebuildWALEntry(path, sessionID, entry)
  ApplyReplicaRebuildBaseBlock(path, sessionID, lba, data)
  MarkReplicaRebuildBaseComplete(path, sessionID, totalBlocks)
  TryCompleteReplicaRebuildSession(path, sessionID)
  CancelReplicaRebuildSession(path, sessionID, reason)
  ReplicaRebuildSession(path) → snapshot

Each method does one thing: validate → WithVolume → delegate to BlockVol.
No wire decoding, no protocol decisions, no state invention. Transport
wiring (sessionControl/walData/sessionData handlers) is the next step.

2 focused tests: skeleton routes correctly, stale session ID rejected.

Updated v2-rebuild-mvp-session-protocol.md with server skeleton section.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 14:53:32 -07:00
pingqiu
d2d57851b0 feat: rebuild MVP — dual-lane session with bitmap protection
Rebuild session protocol implementation for v2-rebuild-mvp-session-protocol.md.

New files:
- rebuild_bitmap.go: RebuildBitmap — session-scoped dense bitset for
  WAL-applied LBA tracking. MarkApplied on local WAL write (not receive).
  ShouldApplyBase returns false for WAL-covered LBAs (WAL always wins).

- rebuild_session.go: RebuildSession — replica-side two-line rebuild.
  WAL lane (ApplyWALEntry) + base lane (ApplyBaseBlock) with bitmap
  conflict resolution. TryComplete requires BOTH base_complete AND
  wal_applied_lsn >= target_lsn. Volume-level control surface:
  StartRebuildSession, ApplyRebuildSessionWALEntry/BaseBlock,
  MarkRebuildSessionBaseComplete, TryCompleteRebuildSession,
  CancelRebuildSession, ActiveRebuildSession.

- rebuild_mvp_test.go: 4 correctness tests — base+WAL converge,
  WAL-applied never overwritten by base, bitmap set on applied not
  received, control surface start/supersede/complete.

- rebuild_transport_test.go: 2 transport-level tests — two-line with
  real WAL shipping, live writes during base copy with bitmap conflict.

Design docs:
- v2-rebuild-mvp-session-protocol.md: MVP spec with message set, apply
  rules, completion/failure/crash rules, test matrix
- v2-sync-recovery-protocol.md: full protocol context (keepup/catchup/
  rebuild unified design, primary decision logic, two-line model)
- v2-session-protocol-shape.md: protocol shape overview

Protocol engine (reference, not production):
- sw-block/protocol/: 7-event engine with ~300 lines, 13 tests

6 rebuild tests pass, all existing component tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 14:30:34 -07:00
pingqiu
55013e103b feat: Phase 20 Stage 0+1 closure — bootstrap + sustained workload on hardware
Stage 0 (bootstrap closure): PASS on m01/M02
  - create RF=2 sync_all → 10s shipper wait → 4k fsync → publish_healthy
  - Proves: BarrierAccepted observation, ShipperConnected, DurableLSN > 0

Stage 1 (sustained workload): 32/33 actions PASS
  - bootstrap → fio 10s randwrite → dd_write 1M×2 fsync → data checksum
  - Remaining: auto-failover promotion (separate issue)

Key fixes:
  - BarrierAccepted callback: SyncCache success → core DurableLSN update
  - BarrierRejected callback: barrier failures surface to core with reason
  - Shipper state callback for new volumes (not just startup volumes)
  - CatchUpTo ctrl conn reset: prevents stale control channel after recovery
  - CP13-6 max-bytes budget suspended: uses replicaFlushedLSN which can't
    advance without barrier; kills healthy shippers during async writes.
    Will be replaced by v2 negotiated sync/recovery protocol.
  - Barrier diagnostic logging: start/fail/success with reason and LSN
  - Scenario restructured: Stage 0 (bootstrap-closure) + Stage 1 (failover)
  - dd_write: sync_mode param + real stderr capture
  - sw-test-runner suite command: deploy once, run N scenarios
  - WAL size plumbing: proto + API + handler (forward-compatible)

Known: 6 blockvol/server test failures from Barrier() path change
(bounded catch-up in Barrier). Need test updates to match new semantics.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 19:55:12 -07:00
pingqiu
44103a1bd7 feat: Phase 20 acceptance fixes + sw-test-runner suite mode
Acceptance rows closed:
- WriteLBA/SyncCache contract: code comments document write-back vs
  durability fence semantics
- RF=2 stable identity: v2bridge always uses SetReplicaAddrs (preserves
  ServerID); blockcmd dispatcher also fixed to use setupPrimaryReplicationMulti;
  test asserts exact expected ReplicaID="vs-2" (not just non-empty)
- Tests treating WriteLBA as commit: replica_read_test rewritten with
  SyncCache as durability fence
- publish_healthy contract: 3 gate tests with hard assertions including
  gate 3 (PrimaryShipperConnected)
- SetReplicaAddr deprecation warning added
- WALShipper.ReplicaID() getter added for identity verification

Test runner enhancements:
- sw-test-runner suite command: build → deploy → run N scenarios in one
  invocation with --skip-deploy support
- Suite YAML definitions for T6 Stage 0 and Stage 1
- deploy action: kill stale processes, clean dirs, cross-compile, upload
- run-phase20-t6.ps1 PowerShell script (deprecated by suite command)

Engine/runtime fixes:
- Recovery executor nil-safety improvements
- Recovery bundle BuildRecoveryBundle defensive checks
- ShipperGroup MinReplicaFlushedLSNAll surface

Docs: acceptance checklist refined, test matrix updated, T6 runbook,
engine maintainer tutorial, design README updated.

26 files changed, ~1600 insertions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 11:30:54 -07:00
pingqiu
275c3ee1c7 docs: Phase 20 acceptance checklist — architect-refined signoff matrix
Tighten acceptance matrix with explicit per-boundary rows, signoff
reading split into hard blockers vs product hardening, and clear
rule: architecture-complete ≠ product-complete.

6 hard blockers before T6/T7:
1. WriteLBA/SyncCache/sync_all contract closure
2. Fresh replica bounded catch-up before live tail
3. Timeout/retention-loss classification for catch-up
4. publish_healthy alignment with one protocol contract
5. RF=2 stable identity on all shipping paths
6. Test audit for incorrect WriteLBA==commit assumptions

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 00:12:32 -07:00
pingqiu
58aa842802 docs: Phase 20 product acceptance checklist
7-area acceptance matrix mapping current state vs product requirements:
write/durability contract, fresh replica bootstrap, host observation
completeness, serving/publish alignment, snapshot/rebuild convergence,
adapter consistency, test contract alignment.

Each item marked with: current state, required for product, blocks
T6/T7, best test level. Priority ordered into must-close-before-Stage-1,
should-close-before-Stage-2, and can-close-after-T6/T7.

Key diagnosis: architecture-complete, execution-incomplete. The engine
thinks like a product; the data plane still behaves partly like a
prototype. The gap is end-to-end contract closure.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 00:05:22 -07:00