seaweedfs

mirror of https://github.com/seaweedfs/seaweedfs.git synced 2026-07-20 15:02:27 +00:00

Author	SHA1	Message	Date
pingqiu	fbcfe89e24	G5-4 bring-up hand-off v0.2 — RESOLVED via local debug Root cause for "volume not ready" gate: missing --expected-slots-per-volume 2 flag on blockmaster. Default is 3; QA's 2-node topology had 2 slots; controller silently rejected observation snapshot (cmd/blockmaster/main.go:39). Fix verified locally on Windows (single-node, no m01/M02 needed): - Add --expected-slots-per-volume 2 to blockmaster command - Primary reaches Healthy=true with epoch=1 - assignment-received fires; durable storage opens; status endpoint serves {"Healthy":true} Lesson learned (process improvement): for V3-internal bring-up debug, try single-node local reproduction FIRST. The cluster bring-up gate is V3 logic, not network topology. Reproduces in seconds locally with full source-code access; m01/M02 only needed for cross-node-specific scenarios (real network conditions, iptables, multi-host wire). Secondary finding: replica r2 sees primary r1's assignment but records "supersede, not applying to adapter" because T1 HealthyPathExecutor only handles primary case. For G5-4 replica bring-up, sw needs to wire T4a-T4d ReplicationVolume + ReplicaPeer + ReplicaListener stack (not just --t1-readiness flag). This is the actual next gap for G5-4. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 10:41:02 -07:00
pingqiu	e21c686939	G5-4 m01+M02 bring-up — sw answer: --expected-slots-per-volume flag Root cause: cmd/blockmaster/main.go hardcoded ExpectedSlotsPerVolume=3. QA's 2-slot topology silently failed validateVolumeTopology in the controller, so no assignments were minted, no master-log lines, and volumes timed out at durable open. Fix landed in seaweed_block@f5de7c5: --expected-slots-per-volume CLI flag, default 3, set 2 for the 2-node smoke. QA next: rebuild blockmaster, pass --expected-slots-per-volume 2 in §3.4 of the handoff command sequence; rest unchanged.	2026-04-26 10:37:20 -07:00
pingqiu	2d9c2be9f3	G5-4 m01+M02 cluster bring-up — hand-off to sw Records QA's cross-node smoke attempt 2026-04-26: infrastructure fully verified READY (m01+M02 reachability, SMB share for binary distribution, master cross-node listen, network OK), but cluster bring-up blocked at V3-internal gate. Symptom: blockvolume on both nodes connects to master but logs "durable open: frontend: volume not ready" — never reaches steady state, status endpoint never binds, master log shows no heartbeat or assignment-mint events. Hand-off contents: - §1 specific questions for sw (5 gaps to fill) - §2 infrastructure verified READY (no action needed) - §3 copy-pasteable commands sw can run/debug (build → topology → master → primary → replica → cleanup) - §4 QA's hypothesis on the gap (assignment-from-master flow) - §5 debug suggestions for sw (log levels, integration test references) - §6 G5-4 script skeleton current state - §7 QA's next steps once sw answers Working dirs reproducible: - Binaries: /mnt/smb/work/share/g5-binaries/{blockmaster,blockvolume} - Run state: /tmp/g5sm/ on both nodes - Logs: /tmp/g5sm/logs/{master,primary,replica}.log Blocks: G5-4 implementation work (script scenario bodies, hardware first-light scenarios). Does NOT block QA scenario authoring at component scope (Cluster framework already covers that). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 10:32:10 -07:00
pingqiu	ce78fea36f	G5 kickoff §7a: m01 + M02 infrastructure verification (QA pre-ratify) Per QA infra-check round 2026-04-26, surfaces real readiness gaps before architect ratifies G5-4 schedule: m01 (192.168.1.181 — primary node): ✅ 32-day uptime; sudo password-less; 16 cores; 19 GiB RAM ✅ 177 GiB free disk; Go 1.26.2 installed ✅ iptables / netns / multi-process tools all available ✅ T2 m01 NVMe script template available as pattern reference M02 (192.168.1.184 — replica node): ✅ Reachable from m01 (0.92ms); same kernel; 178 GiB free disk ❌ Go NOT installed — must scp binaries from m01 Implication for G5-4: Build binaries on m01, scp to M02. Same cross-node binary pattern T2 already uses for its iSCSI target deployment. G5-4 skeleton at seaweed_block/scripts/iterate-m01-replicated-write.sh implements this build-then-scp flow. No infrastructure blockers. Architecture ready as soon as G5 mini-plan ratifies scenario list. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 09:38:57 -07:00
pingqiu	a792ed67e5	G5 kickoff PROPOSAL v0.1 (post-T4 close) QA-authored proposal opening G5 collective close planning. Inherits 5 forward-carries from T4d closure §I as G5 scope: - m01 hardware first-light for replicated write path - Multi-replica concurrent live + recovery scenarios - G5-DECISION-001 resolution (Path A persist vs Path B rebuild) - walstore flusher cadence verification + tuning policy - Minimal metrics/backpressure assessment 5-batch shape proposed: - G5-1 multi-replica scenarios (component) — QA + sw framework - G5-2 walstore cadence verification — sw + architect - G5-3 metrics/backpressure assessment — sw + architect - G5-4 m01 hardware L3 first-light — QA + sw - G5-5 G5-DECISION-001 resolution + closure report — architect + sw + QA QA recommendations: - G5-DECISION-001: Path B (rebuild from probe after restart) for MVP scope. T4d-4 part B already structurally enables (ReplicaState JSON-clean per TestG5Decision001_); production restarts rare; Path A's persistence work substantial. Backwards-compatible upgrade later if production usage proves Path B insufficient. - G5-5 timing at close (after G5-1/2/3/4 evidence informs decision) - §2.2 explicit non-claims to prevent G5 scope creep: CARRY-T4D-LANE-CONTEXT-001 → post-G5 hardening backlog * --durable-walsize CLI flag → post-G5 * Snapshot-based catch-up → post-G5 * Wire protocol versioning → post-G5 * Auth/encryption/mTLS → post-G5 Status: ⏸ DRAFT — awaiting architect ratification on §2 scope + §3 batch shape + §4 acceptance bar + §5 G5-DECISION-001 path. No G5 code work begins until ratified. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 00:26:15 -07:00
pingqiu	75d18e676f	T4d batch close: catalogue invariant upgrades + checklist v0.3 Catalogue §3.3 — 12 T4d invariants flipped from ⏭ to ✓ PORTED with specific commit hashes; 4 round-47/48 invariants newly inscribed: Pre-existing flipped to ✓ PORTED: - INV-REPL-NO-PER-LBA-DATA-REGRESSION → bd2de99 + 01f4ab9 - INV-REPL-RECOVERY-STALE-ENTRY-SKIP-PER-LBA → bd2de99 - INV-REPL-RECOVERY-COVERAGE-ADVANCES-ON-SKIP → bd2de99 - INV-REPL-LIVE-LANE-STALE-FAILS-LOUD → bd2de99 - INV-REPL-RECOVERY-COVERAGE-RESTART-SAFE → bd2de99 - INV-REPL-LANE-DERIVED-FROM-HANDLER-CONTEXT → 01f4ab9 + 44c60dd (with named carry CARRY-T4D-LANE-CONTEXT-001 to post-G5) - INV-REPL-TRANSPORT-STORAGE-CONTRACT-ONLY → 44c60dd + 1edeb36 - INV-REPL-CATCHUP-FROMLSN-IS-REPLICA-FLUSHED-PLUS-1 → 44c60dd - INV-REPL-CATCHUP-FROMLSN-FROM-ENGINE-STATE-NOT-PROBE → 44c60dd Newly inscribed (round-47 + round-48 architect additions): - INV-REPL-CATCHUP-EXHAUSTION-ESCALATES-TO-REBUILD → 812d3fa + e642ae8 - INV-REPL-REBUILD-FAILURE-TERMINAL → 812d3fa - INV-REPL-FAILED-SESSION-KIND-DRIVES-ESCALATION (part C bug #1) → e642ae8 - INV-REPL-REBUILD-ESCALATION-STICKY-UNTIL-TERMINAL (part C bug #2) → e642ae8 Forward-carry checklist v0.3: - All per-batch focus rows resolved - m01 -race verified across all T4d batches including T2A NVMe race fix - Status transitions from "active gating" to "G5-baseline" Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 00:24:23 -07:00
pingqiu	2ee12b2c14	T4d batch close artifact + mini-plan v0.5 (architect-accepted) Two artifacts landing together to close T4 batch series: 1. v3-phase-15-t4d-closure-report.md (NEW) QA single-sign artifact for T4d batch close per §8C.2; architect T-end three-sign per §8C.1 (T4d IS final T4 batch — confirmed at round-48 review). Round-48 + round-49 corrections incorporated: - Part C commit hash bound to e642ae8 throughout - CARRY-T4D-LANE-CONTEXT-001 bind point = post-G5 hardening backlog (not T4e — consistent with "T-end at this close") - §H Finding #1 reworded — walstore HAS background flusher (walstore.go:189-190); QA's earlier "caller-driven" was wrong - §H Finding #3 RESOLVED at a0be6d5 (T2A NVMe race fixed + m01 -race ×50 PASS) - 16 invariants pinned (added 2 named for part C bug fixes: INV-REPL-FAILED-SESSION-KIND-DRIVES-ESCALATION + INV-REPL-REBUILD-ESCALATION-STICKY-UNTIL-TERMINAL) - 22/22 packages green under -race on m01 (post-a0be6d5) 2. v3-phase-15-t4d-mini-plan.md (NEW — was uncommitted across v0.1 → v0.5 evolution) Final v0.5 incorporates: architect Path B fold; round-47 rebuild path engine-driven HARD GATE expansion; G5-DECISION-001 named decision record; 4-batch shape ratified; T4d-3 G-1 binding. Active forward-carries (post-G5 hardening backlog): - CARRY-T4D-LANE-CONTEXT-001 — replace TargetLSN==1 caller shim with true handler/session-context lane signal - G5-DECISION-001 — engine recovery state behavior across primary restart (Path A persist vs Path B rebuild-from-probe) G5 collective close items (NOT post-G5): - m01 hardware first-light for replicated write path - Multi-replica concurrent live + recovery scenarios - walstore flusher cadence verification + tuning policy - Minimal metrics/backpressure assessment - G5-DECISION-001 architect resolution Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 00:21:20 -07:00
pingqiu	80036404ce	T4d planning + G-1 doc landing (architect Path B + Issue 2(a) ratification) Lands four T4d planning artifacts together: 1. v3-phase-15-t4d-3-g1-v2-read.md (NEW) T4d-3 G-1 V2 read v0.2, QA-signed in conversation 2026-04-25. Per architect Issue 2(a) ratification: G-1 docs land first; implementation references the committed hash. Future T4d-3 commits should reference this commit's sha via: Refs G-1 sign: <this-commit-sha> 2. v3-phase-15-t4d-forward-carry-checklist.md (NEW) v0.2 — 19 active T4a/T4b/T4c invariants with risk grades and per-batch focus rows. T4d-3 close gate inscribed (CARRY-T4D-LANE-CONTEXT-001 option A or B); pre/with-T4d-3 doc fixes recorded. 3. v3-phase-15-t4d-qa-scenario-catalogue.md (NEW) v0.1 — 9 QA component-scope scenarios mirroring T4c QA Stage-1 discipline. 10 framework primitives surfaced for sw's batch PRs. 4. v2-v3-contract-bridge-catalogue.md (UPDATED) §3.3 inscriptions for T4d-locked invariants: - INV-REPL-NO-PER-LBA-DATA-REGRESSION (round-43) - INV-REPL-RECOVERY-STALE-ENTRY-SKIP-PER-LBA (round-43) - INV-REPL-RECOVERY-COVERAGE-ADVANCES-ON-SKIP (round-44) - INV-REPL-LIVE-LANE-STALE-FAILS-LOUD (round-44) - INV-REPL-RECOVERY-COVERAGE-RESTART-SAFE (Option C) - INV-REPL-LANE-DERIVED-FROM-HANDLER-CONTEXT (Q2 + round-46) - INV-REPL-TRANSPORT-STORAGE-CONTRACT-ONLY (Q1+Q3 + T4d-1 strengthening) - INV-REPL-CATCHUP-FROMLSN-IS-REPLICA-FLUSHED-PLUS-1 (T4d-3 G-1 §5) - INV-REPL-CATCHUP-FROMLSN-FROM-ENGINE-STATE-NOT-PROBE (T4d-3 G-1 §5) - CARRY-T4D-LANE-CONTEXT-001 (named carry, T4e/post-G5) INV-REPL-CATCHUP-WITHIN-RETENTION-001 status updated: T4c downgrade → T4d-2+T4d-3 un-pin path. Process rule inscribed (architect 2026-04-25): G-1 sign docs land in seaweedfs FIRST; sw implementation in seaweed_block references the committed G-1 hash via "Refs G-1 sign: <sha>" per mini-plan §7.1 procedural binding. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 22:26:21 -07:00
pingqiu	7b9b353293	T4d kickoff: v0.3 architect-ratified Architect sign by pingqiu 2026-04-25: "T4d v0.2 scope accepted as one batch series; Option C for appliedLSN source; BlockStore walHead hotfix may land pre-T4d; substrate defense- in-depth included where practical; 4-batch order approved; T4d-3 G-1 required; T4d-2 no G-1; T-end three-sign at T4d close if T4d remains final T4 batch." All open architect-decision points (§2 scope, §2.5 Option/hotfix/ substrate, §3 batch shape, §4 acceptance bar) resolved. §6 open issues all closed. §8 inscribes the verbatim ratification record. Sw clearances effective immediately: - Land BlockStore walHead one-liner as pre-T4d hotfix (single PR with un-skipped regression test) - Produce T4d mini-plan (4-batch shape per §3) - Produce T4d-3 G-1 V2 read on wal_shipper.go runCatchUpTo - T4d-2 spec is round-43/44 architect text (no G-1 needed) T-end horizon: §8C.1 T-end three-sign lands at T4d close IF T4d remains final T4 batch (per architect's criterion #10 wording tweak). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 15:46:13 -07:00
pingqiu	c910464a9a	T4c batch close artifact: closure report (architect-accepted) QA single-sign artifact for T4c batch close per §8C.2; architect acceptance of §B scope deltas signed 2026-04-25 by pingqiu. Scope deltas accepted: - T4c closes as mid-T4 batch under §8C.2, not T4 T-end - L2/L3 mini-plan bar narrowed to muscle-level L2 + component evidence - L3 m01 first-light deferred to T4d / G5 final close - Substring "WAL recycled" matching accepted as TEMPORARY, replacement bound to T4d (preferred) or G5 final sign (latest) - INV-REPL-CATCHUP-WITHIN-RETENTION-001 downgraded to T4d blocker (catch-up sender hardcodes ScanLBAs(1); replica's R+1 not threaded) Doc-hygiene fixes per PM round-2 review (this commit): - Drop INV-REPL-CATCHUP-DONE-MARKER-EMITTED (non-existent: V2 marker collapsed into barrier-as-terminator per catchup_sender.go:48,187) - §B/#2 + #5 reword "green at HEAD" to acknowledge architect Windows cleanup-only repro failures (tracked as next-batch carry) - Active formal-INV count 8 -> 6 Forward-carries to T4d (BLOCKERS): - R+1 catch-up threading (StartCatchUp signature + adapter wire) - Full engine→adapter→executor recovery wiring - Structured RecoveryFailureKind replacing substring sentinel - LastSentMonotonic_AcrossRetries cross-call form scenario - Windows TempDir cleanup race investigation Forward-carry to G5 final close: - m01 hardware first-light for replicated write path Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 11:49:24 -07:00
pingqiu	6d8d088273	T4 L1 survey round 3: sw V2 verification of Q1-Q3 + §3.14 AllBlocks hazard + §3.a locked-pairs Closes QA round-2 feedback loop. Three concerns resolved and one L2-blocker hazard added. ## Q1-Q3 resolution (sw-verifiable per QA concern; V2 source check) Q1 scope completeness: VERIFIED complete. V2 grep shows sync_all_* are three test files only — `sync_all_adversarial_test.go`, `sync_all_bug_test.go`, `sync_all_protocol_test.go`. Zero production files for sync_all / split_brain / takeover / arbiter. These are cross-entity invariants, not distinct types. 10-entity set stands. Q2 ReplicaReceiver scope: VERIFIED per-volume, not per-assignment. `v.replRecv = recv` at `blockvol.go:1515` is the only write site; zero `replRecv = nil` assignments in codebase. Receiver is constructed-once per BlockVol instance. L1 §2.3 wording stands. Q3 RebuildSession/Bitmap durability: VERIFIED no sidecar. Grep `rebuild_bitmap.go` + `rebuild_session.go` for `os.Open / os.Create / WriteFile / ReadFile / persist / sidecar` → empty. Recovery is WAL hydration only (`hydrateBitmapFromRecoveredWAL` at `rebuild_session.go:102`). L1 §2.10 invariant #3 CORRECTED — earlier draft incorrectly called out a "sidecar schema" that doesn't exist. ## QA concern #3 resolution: §3.14 new hazard `AllBlocks()` semantic divergence: V3 `walstore.go:565` and `smartwal/store.go:367` both call `s.Read(lba)` which reads through the dirty map (includes unflushed WAL bytes). V2 `rebuild.go:handleExtentStream` uses `readBlockFromExtent` which BYPASSES dirty map (flushed-only). Concrete impact: V3 base stream can contain bytes the primary hasn't fsynced. If primary crashes pre-fsync, replica's copy is "newer" than primary's recovered state. Epoch fencing + WAL-wins bitmap still prevent corruption, but the invariant chain is "eventually consistent via epoch churn" instead of V2's "base stream never contains unflushed bytes". Different contracts, same end state. Two L2 options proposed: (a) keep AllBlocks semantics + document non-claim in §2.7 bridge; (b) add `LogicalStorage.AllBlocksFlushed()` preserving V2 invariant. H5 architect-line decision affects which path is safer. ## QA concern #2 resolution: §3.a locked-pairs section (new) Documents pre-coupled L2 decisions driven by V3 existing shape: H6 Option C → H7b locks automatically (Provider intercepts at LogicalStorage layer; Backend.Write stays host-facing, doesn't carry LSN) §3.14 + H5 → AllBlocks safety rationale depends on which H5 shape wins Per BUG-005 documentation-discipline lesson: record coupled pairs explicitly rather than leaving them as "implied". Saves L2 cycles and gives future readers visible intent for why Backend.Write excludes LSN. ## QA concern #1 deferred to L2 Volumes map extension (single-map with role discrimination vs two separate primaryHandles + replicaHandles maps) is a legitimate L2 design concern. L1 appropriately hedges with "likely needs to grow" (§3.11 Option C); L2 picks shape. QA's BUG-005-adjacent concern (role-discriminated handle callers forgetting to check role) is the right frame for the L2 decision. No L1 edit needed; flagged for L2 attention. ## §4 open questions status Q1-Q3 ✓ resolved Q4 DistGroupCommit residence → effectively answered by §3.11 C Q5 protocol-frame wire-compat stance → still architect-line (pairs with H5) Blocking L2 start now: only H5 + Q5, both architect-line. QA to draft one-page arch memo per round-2 offer. ## Change log §5 feedback-round log gains round-3 entry §6 change log gains full round-3 detail with V2 line citations Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 23:42:39 -07:00
pingqiu	de2767cd3c	T4 L1 survey round 2: sw pre-scan output + H6 narrowing + H7 + §3.13 Fulfills §5 step 1 pre-scan gate with concrete V3 source evidence and propagates findings to §3 observations. ## Pre-scan output (§5 step 1) 5-row checklist table against V3 source: - SetReplicaAddrs / ReplicaAddrs / replica fields: NONE in `core/frontend/` or `core/storage/` (grep-clean) - Sync/Write remote-ack semantics: NONE; all returns pure-local (`types.go:50-78`, `logical_storage.go:57-70`) - LogicalStorage.Write LSN: pure-local; distributed durability is explicit non-contract (`logical_storage.go:45`) - Ship/Replicate/Quorum/Barrier/Durability identifiers: none in code; comments only - Replication stubs: NONE; but three fully-implemented replica- side primitives on LogicalStorage: ApplyEntry / AdvanceFrontier / AllBlocks, with impls in walstore.go + smartwal/store.go Net: frontend/durable layer clean; LogicalStorage layer already committed to a specific replica-side shape. L2 must ALIGN with that shape, not override it. ## §3 updates driven by pre-scan §3.11 (H6) narrowed with V3 existing-shape evidence: - Option A unlikely (no supporting V3 shape; StorageBackend is replication-unaware) - Option B effectively ruled out (ApplyEntry/AdvanceFrontier sit BELOW Backend on LogicalStorage; a ReplicatedBackend wrapper would either reach past its wrapped contents or duplicate the storage-layer contract) - Option C leading (matches V3 existing Provider-owns-lifecycle shape; generalizes BUG-005 lesson) §3.12 (H7) new — LSN surface-up gap: - `Backend.Write → (int, error)` discards LSN - `LogicalStorage.Write → (lsn, error)` returns it - Primary-side shipper needs per-write LSN - H7a (extend Backend sig) unlikely; H7b (Provider intercepts at LogicalStorage layer) natural fit with H6 Option C; H7c (side-channel NextLSN+Boundaries delta) rejected as racy - H7 resolution coupled to H6 — joint L2 LOCK §3.13 new — replica-side bypasses Backend entirely: - Structural finding already locked by V3 shape, NOT an L2 choice - Primary-side traffic: session → handler → Backend → LogicalStorage - Replica-side traffic: network frame → ReplicaReceiver → LogicalStorage.ApplyEntry (bypasses Backend) - Explicit so L2 builds on it rather than fighting ## Feedback-round log + change log §5 feedback log gains round 2 entry; §6 change log gains full round-2 detail with line-level citations. No sign event; this is iterative informal feedback per §8C.8 lightweight cadence. L1 stays DRAFT until bundled T4 T-start three-sign with L2 + L3. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 23:21:35 -07:00
pingqiu	b4adf76aa0	T4 L1 survey: drop invented L1-sign gate; keep only T-start three-sign §8C.8 specifies exactly one three-sign per T-boundary — at T-start, covering the bundled L1+L2+L3 package. I had proposed a separate L1 three-sign in §5 that isn't in the rule. Architect correctly pushed back. §5 rewritten as lightweight cadence: 1. sw V3 pre-scan (~5 min, inline reply, prerequisite to L2 not a sign gate) — same grep checklist retained, same BUG-005 rationale 2. sw + QA iterate on L2 (catalogue §3 filled) informally 3. sw + QA draft L3 (T4 port plan sketch) 4. T4 T-start three-sign on bundled L1+L2+L3 (only governance event) Informal feedback-round log hook added so architect/PM inputs are tracked without per-round sign ceremony. Change log updated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 22:59:01 -07:00
pingqiu	d2588f5b77	T4 L1 survey: architect feedback round 1 (F1/F2/F3 + H5/H6 + sw pre-scan gate) All 5 feedback items accepted; no subsetting. F1 — RebuildBitmap split into standalone §2.10 entity (10 total, was 9). Rationale: bitmap has independent on-disk schema (~84 LOC rebuild_bitmap.go) + independent conflict-resolution invariant (WAL-wins-over-base). Collapsing into §2.6 RebuildSession at L1 would lose granularity for L2 — bitmap and session may have different PRESERVE/REBUILD verdicts. §2.6 now explicitly cross-references §2.10. F2 — ShipperGroup §2.2 gains "External deps" row: N = RF comes from master assignment via BlockVol.SetReplicaAddrs, not from shipper-internal decision. Cross-entity contract (master assignment ↔ ShipperGroup size ↔ ReplicaReceiver expected-connection-count ↔ DistGroupCommit quorum arithmetic) made explicit so L2 split can't silently drift sync_quorum. F3 — ReplicaBarrier §2.4 scope rewritten from "per-request ephemeral" to "per-request call-closure, BUT queue-state shared per-volume via cond.Wait". Prior wording risked 1:1-porting into a V3 stateless function, losing multi-watcher cond.Broadcast semantics. H5 added to §3 observations — cross-node epoch consistency observation window for sync_quorum. V2 implicit via ack frame carrying epoch; V3 L2 must pick "ack frame carries epoch" vs "primary maintains per-replica epoch cache" before locking. Different choices → different failover + rebuild-trigger semantics. H6 added to §3 observations — write-path vs replication-path concurrency residence. Three L2 options documented: A) StorageBackend.Write triggers shipper (violates T3a layering) B) ReplicatedBackend wraps StorageBackend+shipper (clean; +1 entity) C) Replication inside DurableProvider (extends BUG-005 lesson) L1 makes no recommendation; L2 LOCKS the decision before L3. §5 restructured into 5 gated steps; step 1 is a mandatory sw V3 pre-scan of core/frontend/durable/ + core/frontend/*.go for pre-baked replication-adjacent assumptions. Rationale cited per architect: BUG-005 latent drift came from implicit V3 convention; L1 must surface any such convention before L2 verdicts lock. Concrete grep checklist included so the scan is 5 min, not open-ended. §2 header + §4 open question #1 updated for 10-entity count. Scope block references rebuild_bitmap.go explicitly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 22:41:18 -07:00
pingqiu	38cb25702f	T4 kick-off: L1 V2 replication entity survey (pre-sketch review) First artifact for T4 (Gate G5 Replicated Write Path) under the §8C.8 top-down port discipline added post-T3 retrospective. This is L1 only — raw V2 entity enumeration with scope / lifecycle / concurrency / cross-session / authority / protocol / invariants attributes. No V3 bridge verdicts proposed yet; L2 follows only after L1 review closes. 9 entities identified across replication surface: - WALShipper (per-replica fan-out) - ShipperGroup (per-volume aggregator) - ReplicaReceiver (per-volume replica-side listener) - ReplicaBarrier FSM (per-barrier ephemeral) - DistGroupCommit closure (per-write-op, mode-aware) - RebuildSession (volatile, non-crash-durable) - RebuildServer (per-primary listener) - RebuildTransportServer / Client (per-session base lane) 9 L1-level observations flagged as L2 hazards: epoch fencing pervasiveness, contiguous-LSN cross-cutting invariant, two-lane rebuild bitmap integration, mode-dependent durability, volatility of rebuild session (vs BUG-005 Provider cache lesson), explicit reconnect protocol, three-phase barrier, ioMu.RLock nesting, shipper-group double watermark. 5 open questions raised for sw / architect / PM review before L1 sign: scope completeness (sync_all_reconnect, split-brain arbiter?), scope accuracy (ReplicaReceiver per-volume vs per-assignment), RebuildSession volatility confirmation, DistGroupCommit V3 residence opinion, protocol-frame wire-compat stance. Status: DRAFT — open for sw review; L2 + L3 work blocked on L1 sign per §8C.8 discipline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 22:30:57 -07:00
pingqiu	88dcd49d67	sw-block/design: T3 mini-plan + audit + sketch docs (pre-close docs) Predecessor docs for the T3 batch, retained here for audit trail. The closure report (`v3-phase-15-t3-closure-report.md`), contract bridge catalogue, and BUG-005/006 artifacts already landed in commits `4127e5136` + `6e196885e`; this commit fills the docs those closure artifacts reference back to. Landed: v3-phase-15-t3-port-plan-sketch.md T3 umbrella sketch (rev-2.1, three-signed) v3-phase-15-t3-port-audit.md T3.0 port audit + Addendum A (QA-signed) v3-phase-15-t3a-mini-plan.md T3a scope + sign-off (CLOSED 0e1595c) v3-phase-15-t3b-mini-plan.md T3b scope + sign-off (CLOSED 72d0d40) v3-phase-15-t3c-mini-plan.md T3c scope + sign-off (CLOSED 829c6a9) Total 1,346 lines of doc; no code impact. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 22:19:03 -07:00
pingqiu	6e196885e4	T3 closure: reconcile §8C.3 trigger narrative + C5 pin strength Two document-truthfulness mismatches flagged by architect review: §B Governance transition (closure report): previously claimed "no §8C.3 triggers fired during T3"; §H Phase 3's own BUG-005 description matches trigger #1 (unknown-unknown architectural bug, V2/V3 shape- level mismatch). Corrected to say trigger #1 fired once (BUG-005) and was handled per §8C.3, with log entry, architect+PM notification, catalogue §2.3 drift-event row, and porting-discipline citation. C5-NVME-SESSION-STATE-CLEANUP-ON-CLOSE (contract bridge catalogue §2.2.14): previously stated "PASSES today" / "pinned explicitly". Closure §H Phase 4 correctly narrows landed tests to "smoke + goroutine-leak guard" with Target.ctrls/AER/KATO-stored-ms introspection not exercised. Catalogue row now matches that strength: "pin strength today: smoke + goroutine-leak guard only; full state-release introspection NOT exercised; queued as follow-up". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 22:14:08 -07:00
pingqiu	4127e5136c	T3 closure: finalize sign-ready state + BUG-006/007 + catalogue retrofill Closure report (v3-phase-15-t3-closure-report.md): - §E rewritten as FINAL A–F m01 verification table with per-impl status + G4 pass criterion = smartwal (production default) full A–F green; walstore non-default fallback with Matrix D failure tracked via BUG-007 - §E sign table: QA re-sign 2026-04-22 with evidence basis (seaweed_block@313dd52 + BUG-005 fix 42b045a); prior RETRACTED row superseded - §D INV-DURABLE-001: conditional "Path B pending" wording removed; scoped to smartwal; canonical row name stands - §B non-claims: stale _TBD_ perf wording replaced with first-light scope statement; new non-claim "G4 pass = smartwal only; walstore deferred via BUG-007" added - §G.3 finalized: FINAL resolution with smartwal A–F PASS; walstore deferred - §H Phase 2 narrative updated to match final matrix outcome (Matrix E smartwal-only; walstore E skipped pending BUG-007) - §H Phase 4: T3-DEF-6 test wording downgraded from "pins cleanup contract" to "smoke + goroutine-leak guard" per PM feedback (no test-only introspection of Target.ctrls/ AER/KATO internals; follow-up deferred) - §H Phase 5: BUG-007 filed and scoped; non-blocking basis spelled out Contract Bridge Catalogue (v2-v3-contract-bridge-catalogue.md): - §2.2.14 C1-NVME-SESSION-KATO reclassified PRESERVE-partial → VIOLATED with BUG-006 anchor + m01 Matrix D evidence - §2.2.14 C5-NVME-SESSION-STATE-CLEANUP-ON-CLOSE added (T3-DEF-6 retrofit, pinned by QA L1 addendum) - §2.3 drift-event audit table expanded with BUG-006, BUG-007, T3-DEF-5, T3-DEF-6 BUG-006 (006_nvme_kato_timer_not_enforced.md): - Unified contract ID to catalogue name C1-NVME-SESSION-KATO-STORED-NOT-ENFORCED (was drifting as C3-NVME-KATO-ENFORCEMENT, PM Low catch) - §7 reframed as "existing row reclassified VIOLATED" rather than "add new row" BUG-007 (007_walstore_umount_remount_data_loss.md): filed as pre-existing walstore-specific durability bug surfaced by Matrix D re-verify; explicitly non-blocking for T3 since smartwal is production default. BUG-005 (005_backend_close_cross_session.md): committed for HEAD-reproducibility (referenced by closure §H Phase 3). Inventory (bugs/inventory/nvme-test-coverage-deferred.md): T3-DEF-5/6/7 struck through with per-row resolution pointers; zero open T3-scope inventory rows remaining. Evidence artifacts committed in seaweed_block@313dd52 (scripts/iterate-m01-nvme.sh Matrix F robustness + t3_qa_session_cleanup_addendum_test.go). Awaiting architect + PM three-sign. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 22:08:30 -07:00
pingqiu	953fdb7564	doc: P14 S8 final bounded close — evidence matrix + P15 handoff Adds the six S8 closure deliverables consolidating S4-S7 evidence, classifying V2 scenarios, and mapping residual product gaps onto canonical P15 tracks (per v3-phase-15-product-plan.md §4). New docs: - v3-phase-14-s8-assignment.md — S8 execution contract. - v3-phase-14-s8-final-bounded-close.md — bounded P14 target, accepted topology, reject conditions. - v3-phase-14-s8-evidence-matrix.md — 16 claims × {L0, L1, L2, L3, Status, Residual}. 15 PROVEN, 1 PARTIAL (Claim 15 fence quantitative bound, P14 internal follow-up). Rounds 2-3 architect corrections: Claim 10 / 12 L2 narrowed; Claim 6 refresh gap closed by the new L1 test (see companion commit in seaweed_block). - v3-phase-14-s8-v2-scenario-classification.md — every V2 scenario mapped to RUNNABLE-P14 / BLOCKED-FRONTEND / BLOCKED-OPS / BLOCKED-HA / BLOCKED-PERF / PORT-MECHANISM; scenario YAMLs kept as L3 shape, not executed evidence. - v3-phase-14-s8-p15-handoff.md — 11 rows (10 canonical P15 tracks + 1 P14 internal follow-up anchored to Claim 15 PARTIAL); §4 integrity check split by row class. - v3-phase-14-s8-closure.md — final P14 closure statement matching the close doc §10 wording; explicit non-goals; all 9 P15 tracks named with canonical numbering. No claim of CSI / frontend / migration / security / performance / production readiness. Every product gap is handed off with a concrete first-proof gate. Companion: seaweed_block commit adds the IntentRefreshEndpoint L1 route test that closes Claim 6. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 01:44:11 -07:00
pingqiu	247d9f6fa6	doc: V3 observability — structured logging, tracing, metrics, debug zip, alerts Covers 6 areas based on CockroachDB/Ceph/etcd/Longhorn research: 1. Structured logging: zap + JSON + channel model (OPS/STORAGE/REPL/ISCSI/AUDIT/HEALTH) 2. Distributed tracing: OpenTelemetry spans across write/rebuild/failover paths 3. Metrics: 40+ must-have Prometheus metrics with histogram latency buckets 4. Debug tools: debug zip (logs+pprof+state), log merge, live tail 5. Audit logging: every admin mutation with actor/target/operation/result 6. Alert design: 3 tiers (page/ticket/log), anti-patterns to avoid Identifies existing gaps: no I/O latency histogram, no rebuild duration metric, no audit trail, no structured logging, no distributed tracing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 02:32:30 -07:00
pingqiu	9437bd0b95	doc: V3 development process — branch strategy, CI/CD, review, release Covers full engineering process based on SeaweedFS upstream audit: - Branch strategy: feature/sw-block with checkpoint branches for perf baselines - Commit conventions: type: description format - Code review checklist with anti-pattern checks - Testing standards: 5 levels, 1600+ tests, 4 hardware acceptance scenarios - CI/CD pipeline: unit→component→hardware gates - Release process: checklist, artifacts, versioning - Issue/PR templates with anti-pattern classification - Agent collaboration model (architect/sw/tester/manager roles) - Code quality: golangci-lint config, race detection - Upstream contribution path for SeaweedFS merger Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 01:58:19 -07:00
pingqiu	f11a5829d7	doc: update operations design — add existing V1 UI foundation, code map Added section 8: existing UI/admin infrastructure from V1: - iSCSI admin HTTP server (admin.go: /status, /assign, /rebuild, /snapshot) - Grafana dashboard JSON (block-overview.json, already built) - Master UI HTML (master.html, add Block Volumes tab) - Volume server UI HTML (volume.html, add Block section) - Prometheus metrics (already integrated) Added section 10: existing vs new code map showing most backend exists — work is wiring to user-facing interfaces. Updated Phase 1 to include Master UI tab (+200 lines HTML/JS). Updated Phase 5 with two options (lightweight extend vs full SPA). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 01:44:47 -07:00
pingqiu	3ba622d9e0	doc: V3 operations design — user-friendly setup, shell commands, REST API Covers three personas (developer/operator/platform engineer) with: - One-command setup: weed server -block (10 seconds to first volume) - Shell commands: block.list, block.status, block.health, block.create, etc. - REST API: /block/volumes CRUD, /block/health - Observability: Prometheus metrics, alerting rules, Grafana dashboard - Actionable error messages (every error tells you what to do next) - Dry-run by default for all destructive operations Competitive comparison: 10s setup vs Ceph 30min, 13.5x write IOPS, single binary for object + block storage. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 00:36:32 -07:00
pingqiu	2bc8dfcdde	doc: update testrunner roadmap — add runs.db text index for result tracking P1 feature updated: replace generic "structured results" with concrete runs.db design (newline-delimited JSON, one line per run). Leverages existing RunBundle system (manifest.json, result.json already exist). New CLI commands: list, trend, gc, reindex, diff. Regression detection via stddev comparison against rolling baseline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 22:00:35 -07:00
pingqiu	676539d3b9	doc: testrunner roadmap + dm-stripe scenario (42/42 PASS, 1.87x write IOPS) testrunner-roadmap.md: P0-P3 feature plan for multi-version comparison, Ceph adapter, result tracking, cluster templates, debug mode. dm-stripe-two-server.yaml: proven Linux dm-stripe across 2 sw-block volumes on 2 servers. Results: single=42K IOPS → striped=79K IOPS (1.87x). Data integrity verified via md5. Zero sw-block code changes needed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 21:57:03 -07:00
pingqiu	25ede892b4	doc: external failure taxonomy — 20 real bugs from Ceph/DRBD/Mayastor/Longhorn Catalogs production failures organized by semantic class: - Membership/liveness misjudgment (4 cases) - Recovery decision error (3 cases) - Completion/durability illusion (4 cases) - Ordering/race conditions (4 cases) - Background work corrupts semantics (3 cases) Each entry maps to V2 exposure and V3 prevention rules. Includes "Would V2 Have This Bug?" self-audit checklist. Sources: Ceph tracker, DRBD changelogs, Longhorn/Mayastor GitHub issues. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 01:21:08 -07:00
pingqiu	8ecc506452	V2 stabilization: 144/144 hardware actions PASS + design docs + SmartWAL prototype Hardware scenarios (all PASS on m01/m02, 25Gbps RoCE): - I-V3 auto-failover: 43/43 (create→write→kill→promote→verify IO) - I-R8 rebuild-rejoin: 58/58 (failover→write→restart→1GB rebuild in 2s→verify data) - Fast rejoin: 43/43 (kill replica→3s→restart→recovery→data verified) Performance: V2 RF=1 = 46,666 IOPS vs V1.5 RF=1 = 47,233 IOPS (-1.2%, noise) New test scenarios: - v2-rebuild-rejoin.yaml: full failover→rebuild→second failover→data integrity - v2-fast-rejoin-catchup.yaml: replica kill→fast restart→recovery - v2-rebuild-failure-retry.yaml: kill during rebuild→restart→data verified - rf1-perf-compare.yaml: RF=1 perf baseline for V1.5 vs V2 comparison Design documents: - protocol-anti-patterns.md: 7 anti-patterns with cases from SeaweedFS/Ceph/DRBD - smartwal-design-memo.md: extent-first write algorithm research (BlueStore/ZFS/DRBD) - smartwal-prototype-spec.md: prototype spec with 16/16 crash tests PASS - v3-clean-recovery-draft.md: V3 semantic cleanup principles - v2-integration-matrix.md: 25-row integration coverage map - v2-acceptance-evidence.md: gap analysis for remaining work SmartWAL prototype (16/16 tests PASS): - smartwal.go, smartwal_record.go, smartwal_recovery.go: core implementation - smartwal_test.go: 9 single-node crash tests - smartwal_repl_test.go: 7 two-node replication crash tests Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 00:18:20 -07:00
pingqiu	5279bd3945	fix: tolerate missing sender in remote rebuild ack observation The architect's refactor correctly routes remote rebuild acks through the shared observation path (pins, watchdog, deferred terminal success). But requireReplicaSession fails with "sender not found" when the orchestrator registry is reconciled between installSession and the first ack arrival. Fix: when emitTerminal=false (remote path), treat sender-not-found as non-fatal. The remote coordinator already validated the session — the sender lookup is for local observation only. Pins and watchdog handle nil snap gracefully (updateRebuildProgressPin line 296 already checks snap != nil). This preserves the architect's design (shared observation + deferred terminal success) while tolerating the sender registry race that only affects the remote rebuild path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 23:00:06 -07:00
pingqiu	008ea03ef5	fix: suppress SessionFailed after successful remote rebuild completion After RemoteRebuildIO.TransferFullBase returns, the OnAck callback has already emitted SessionCompleted and stored achievedLSN. But RebuildExecutor.Execute() continues calling sender methods which fail ("sender stopped") because the completion event already cleaned up the sender. This error propagated to ExecutePendingRebuild which emitted a spurious SessionFailed, knocking the mode back to degraded. Fix: check remoteRebuildAchieved before emitting SessionFailed. If the rebuild already completed via the ack path, log the post-completion error but suppress the SessionFailed event. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 17:32:37 -07:00
pingqiu	55862f1ab1	fix: rebuild base-only completion + protocol handshake + direct ack events Three fixes for the remote rebuild path: 1. Base-only completion: when BaseLSN == TargetLSN, the base image covers all data — no WAL tail needed. MarkBaseComplete now auto-satisfies the WAL condition and calls TryComplete so the session completes immediately after the base transfer finishes. 2. Base lane protocol handshake: runBaseLaneClient now sends MsgRebuildReq {Type: RebuildSessionBase} before reading. The RebuildServer requires this handshake to dispatch to ServeBaseBlocks. Without it, the server received raw frames it couldn't understand. 3. Direct ack events: OnAck emits engine events directly (SessionCompleted, SessionProgressObserved, SessionFailed) instead of routing through ObserveReplicaRebuildSessionAck which requires the sender in the orchestrator registry. The remote coordinator owns the session — no registry lookup needed. Also adds diagnostic logging on both sides: - Replica: logs parsed RebuildAddr and base lane client start - Primary: logs sender state after installSession Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 16:18:44 -07:00
pingqiu	0faf93a152	diag: add sender registry verification after installSession The accepted ack from the replica is rejected with "sender not found" even though installSession succeeds. Add diagnostic logging to verify the sender exists in the orchestrator registry immediately after installSession, and dump all registry IDs if not found. This will reveal whether the sender is removed between installSession and the ack arrival (by syncProtocolExecutionState, evaluateActivationGate, or another ProcessAssignment that reconciles with a stale replica list). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 15:53:33 -07:00
pingqiu	943000ae8e	fix: RebuildSourceDecision returns FullBase when CommittedLSN=0 When CommittedLSN=0 (sync_all mode, replica degraded), snapshot-tail rebuild was chosen because IsRecoverable(checkpoint, 0) is vacuously true (0 <= HeadLSN always). But snapshot-tail requires a valid committed endpoint for tail-replay. Without it, ExecuteRebuildPlan calls TransferSnapshot which RemoteRebuildIO doesn't support → immediate fail. Fix: if CommittedLSN=0, force RebuildFullBase. This is the correct source when the primary has data but no replica has confirmed durability. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 15:46:00 -07:00
pingqiu	a79cba0be7	fix: PlanRebuild targetLSN=0 when replica is degraded (CommittedLSN fallback) Root cause: StatusSnapshot().CommittedLSN reports 0 in sync_all mode when the replica shipper has no flushed progress (NeedsRebuild state). This is correct for lineage-safe committed boundary, but PlanRebuild uses CommittedLSN as RebuildTargetLSN. With target=0, shouldStartSessionCommand rejects the StartRebuildCommand, and the rebuild IO never executes. Fix: PlanRebuild falls back to HeadLSN when CommittedLSN is 0. The primary's WAL head IS the data boundary the replica needs to reach. The fact that no replica has confirmed durability is exactly why we're rebuilding. Also adds command type logging to coreApplyAndLog so tester can verify which commands are actually emitted vs silently dropped. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 15:35:31 -07:00
pingqiu	bc767eb9d2	fix: rebuild correctness — single completion, fail-closed acks, diagnostic logging Three correctness fixes for the remote rebuild path: 1. No double completion: for remote rebuilds, OnRebuildCompleted skips RebuildCommitted since ObserveReplicaRebuildSessionAck already emitted SessionCompleted on the accepted ack. One rebuild = one completion event. 2. SessionAckFailed with rejected observation: if OnAck rejects the failed ack (stale session), don't use the sentinel errRebuildAckFailed. Return a regular error so ExecutePendingRebuild emits the fallback SessionFailed. No path leaves the engine session hanging. 3. Diagnostic logging in ExecutePendingRebuild: log the replicaID and targetLSN on both nil-return (TakeRebuild mismatch) and successful take paths. Also log the pending store in runRebuild with replicaID, targetLSN, and IO type. This makes the TakeRebuild seam diagnosable on hardware without rebuilding the engine package. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 15:25:26 -07:00
pingqiu	df69c83f41	feat: RemoteRebuildIO — primary coordinates rebuild, replica installs Replace the broken primary-local rebuild executor with RemoteRebuildIO, a server-side engine.RebuildIO implementation that coordinates remotely. The primary sends SessionControlV2 (with RebuildAddr trailer) to the replica's control channel; the replica starts a local rebuild session and auto-connects to the primary's rebuild server for the base lane. Single rebuild route: ALL core-present rebuilds use RemoteRebuildIO. The entire command chain is preserved unchanged: PlanRebuild → pending → RebuildStarted → StartRebuildCommand → ExecutePendingRebuild → RemoteRebuildIO.TransferFullBase Key changes: - SessionControlMsg v2: optional RebuildAddr trailer (len-based decode) - ReplicaRebuilding shipper state: session-gated live WAL lane - RemoteRebuildIO: dials replica ctrl, sends session control, reads acks - Ack forwarding through ObserveReplicaRebuildSessionAck (pins/watchdog) - Completion proof from replica's achievedLSN, not primary's local vol - Transport failures emit SessionFailed (no double-emit on ack failures) - Progress ack rejection fails closed (stale session = abort) - Replica auto-starts base lane client on v2 session control State transitions: NeedsRebuild → [accepted ack] → Rebuilding → [completed] → InSync Rebuilding → [failed/EOF] → NeedsRebuild → [next probe] → retry Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 15:04:22 -07:00
pingqiu	befe049b09	refactor: unified primary onboarding + rebuild execution wiring Replace three bypass mechanisms with one unified model. When the probe returns ProbeRebuildRequired, the host now starts the rebuild through the existing recovery manager (StartRecoveryTask), which resolves the rebuild address, plans the rebuild, and executes via the v2bridge executor — the same path as master-driven RoleRebuilding. New per-replica probe API: - WALShipper.ProbeReconnect() → ReplicaProbeResult with typed outcome - ShipperGroup.ProbeReconnectAll() → []ReplicaProbeResult - BlockVol.ProbeReplicaOnboarding() / IsClosed() Host-side wiring: - handleReplicaProbeResult routes outcomes: KeepUp → ShipperConnectedObserved CatchUp → ShipperConnectedObserved (recovery manager handles session) Rebuild → NeedsRebuildObserved + StartRecoveryTask (executes rebuild) TemporaryFailure → no-op - lastAssignmentsForPath reconstructs assignment for recovery manager - onPrimaryRosterChanged probes all replicas (defined, called from watchdog) - observePrimaryShipperConnectivity uses probe API Probe fires via syncProtocolExecutionState immediately after assignment processing — same heartbeat cycle, no timer delay. Deleted: startDirectRebuild, resolveCtrlAddrForShipper, TryReconnect/TryReconnectAll/TryReconnectShippers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 02:33:07 -07:00
pingqiu	d6bc7516f1	feat: primary-direct rebuild — start rebuild session on NeedsRebuild When proactive reconnect finds WAL gap exceeds retained range: 1. Emit per-replica NeedsRebuildObserved to engine (with ReplicaID) 2. Resolve replica ctrl address from shipper group 3. Start direct rebuild session: send sessionControl(start_rebuild) to replica's ctrl channel, stream base blocks, emit RebuildStarted The primary drives the rebuild directly without master round-trip. The master sees the result via heartbeat projection (needs_rebuild → rebuilding → healthy). This matches V2 authority: master owns identity, primary owns data-control recovery. Added WALShipper.CtrlAddr() getter for address resolution. resolveCtrlAddrForShipper maps data address to ctrl address via shipper group (works for RF=2 and RF=3+). startDirectRebuild runs in a goroutine: dials replica ctrl, sends start_rebuild, waits for accepted ack, serves base blocks, emits RebuildStarted to engine on success. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 01:04:00 -07:00
pingqiu	8b469cf70b	fix: revert Bridge 2, fix Bridge 1 with per-replica identity Revert detectAndEnqueueRebuildFromHeartbeat (Bridge 2) — master should not drive rebuild assignments from heartbeat. The primary owns data-control recovery per the V2 authority split. Fix Bridge 1: NeedsRebuildObserved now carries per-replica identity. resolveReplicaIDForShipper maps shipper DataAddr to ReplicaID via the shipper group (works for RF=2 and RF=3+). The engine receives the specific replica that needs rebuild, not a volume-level broadcast. Primary-direct rebuild: the primary detects which replica needs rebuild and will drive the session directly. The master learns about it via subsequent heartbeat projection (needs_rebuild → rebuilding → healthy). No master round-trip needed for the rebuild decision. Added WALShipper.DataAddr() getter for address resolution. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 00:55:50 -07:00
pingqiu	f90ccf5bfd	fix: proactive shipper reconnect on rejoin (Bug 5) After rejoin, the shipper is configured but no I/O triggers Ship(), so the shipper stays Disconnected and the core stays at awaiting_shipper_connected indefinitely. Fix: observePrimaryShipperConnectivity now calls TryReconnectShippers when ShipperConfigured=true but ShipperConnected=false. This triggers the full reconnect protocol (dial + handshake + bounded catch-up) proactively, bringing the replica current without waiting for I/O. Option B approach: uses the same reconnect path as Barrier() — not a fake write or bare dial probe. CatchUpTo(headLSN) replays any retained WAL entries, bringing the replica fully current. New methods: - WALShipper.TryReconnect(): full reconnect without foreground I/O - ShipperGroup.TryReconnectAll(): probes all disconnected shippers - BlockVol.TryReconnectShippers(): volume-level entry point Also fix pre-existing test expectation: engine now emits start_recovery_task on primary assignment with replicas. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 00:14:46 -07:00
pingqiu	53246d2780	fix: recover TOCTOU + WAL pressure edge case tests Fix recover path TOCTOU: re-Lookup after AddReplica so the primary refresh assignment includes the freshly added replica addresses. Previously, Lookup (copy) was called before AddReplica modified the registry, so entry.Replicas was empty → primary got replicas=0 → shipper never configured. Add 2 WAL pressure edge case tests: - ShipperCatchUpOrEscalate: 64KB WAL, 200 writes, aggressive flusher. Proves no hang/deadlock/corruption. Shipper either keeps up or correctly escalates to NeedsRebuild. - RebuildWithPinWhilePrimaryWrites: rebuild session active while primary writes 7600+ blocks in 2s. Proves primary never freezes — rebuild pin is on replica only, primary WAL recycles freely. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 23:56:26 -07:00
pingqiu	e0116fc631	fix: three hardware blockers — WAL retention + registry race + shutdown beat All 43 actions pass on m01/m02 hardware. Auto-failover PASS. dd_write: 30s → 123ms. Post-failover write: 33,621 IOPS. 1. WAL retention: remove keepup retention floor (MinShippedLSN). WAL cannot be pinned during sustained async writes — any pin strategy either fills WAL (blocking writes) or over-recycles (breaking catch-up). Flusher recycles freely. Future LBA map will provide catch-up without WAL retention. MinShippedLSN on ShipperGroup retained as diagnostic surface. 2. Registry stale-cleanup race: add RegisteredAt grace period. Race: master registers volume → next VS heartbeat arrives before VS discovers the volume → stale cleanup deletes the entry → failover finds 0 entries. Fix: skip stale cleanup for entries registered within 30s (> 2 heartbeat intervals). 2 new tests: grace protects new entry, old entry still cleaned. 3. Shutdown heartbeat: VS disconnect heartbeat no longer claims block inventory authority. Previously, the shutdown beat's empty inventory triggered stale cleanup, deleting the entry before failover could use it. Scenario fix: recovery-baseline-failover.yaml now kills the correct node (discovered primary, not hardcoded), connects to the correct new primary for post-failover verification. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 22:59:46 -07:00
pingqiu	39f1232fe2	feat: validation matrix closure — Rebuild Ready 12/12, Restore Ready 10/10 Close all Rebuild Ready and Restore Ready matrix gaps. V2 Ready at 10/14 (2 partial, 2 missing — honest assessment). New tests (tester-written): - R1: syncAck-driven trigger via protocol engine decision - R3: stale replica restart beyond WAL → rebuild converges - R5: connection drop mid-base → cancel → fresh rebuild converges - R10: failover-rejoin with forced WAL recycling, strict rebuild assert - R11: divergent replica full overwrite convergence - R12: crash mid-rebuild → fresh session converges (not resume) - S2: corrupt WAL entry + corrupt base block both rejected - S5: snapshot-tail rebuild (base + WAL tail replay) - S7: crash between base install and tail replay - S8: snapshot under concurrent writes - V5: rebuild complete without DurableLSN blocks publish_healthy - V9: mixed replica health aggregate projection - V14: negative fail-closed matrix (epoch, kind, stale) Bug fix: StartRebuildSession now clears stale dirty map + resets WAL + updates checkpoint AFTER safety check but BEFORE session.Start(). Fixes stale extent data shadowing rebuild base blocks on reopened replicas. Cleanup: remove 14 obsolete design docs (migration batches, old WAL-v2 specs, simulator goals) — all superseded by current protocol docs. 34 component tests + 8 protocol engine tests + server tests all pass. 1GB CRC validation passes in 19s. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 16:31:55 -07:00
pingqiu	59a36013d4	feat: rebuild hardening A1-A5 + session-controlled execution path A1 Engine kind-routing fix: SessionProgressObserved/Completed/Failed now respect active session Kind. Rebuild progress no longer leaks into catch-up aggregate. sessionKindMismatch guard + observeRebuildProgress helper. 2 regression tests lock kind isolation. A2 Retention pin: Rebuild session ack drives progress-based WAL retention floor. Pin installed at base_lsn on accepted, advances with wal_applied_lsn, released on completed/failed/cancelled. rebuildProgressPinFloor returns min across all active replicas. Retention pin test: 100 blocks fill WAL, 5 flusher cycles with 20 pinned rebuild entries — all verified correct. A3 Progress ack emission: Automatic sessionAck(running/base_complete/completed/failed) emitted from rebuild session lifecycle transitions. sessionAckLocked builds ack under session lock. emitRebuildSessionAck callback wired through SetOnRebuildSessionAck on BlockVol. ObserveReplicaRebuildSessionAck maps acks to core engine events. WireLocalReplicaRebuildSessionAcks bridges local callback to server. 5 server tests proving ack→core, pin advance, pin cleanup. A4 Deadline/timeout: rebuildAckWatch watchdog: armed on accepted/running/base_complete, refreshed on each ack, cleared on completed/failed. Timeout cancels local session + clears pin + fail-closes. 2 tests: timeout→fail-close, progress→refresh. A5 Session-controlled execution path: v2bridge.Executor.TransferFullBase now uses session-controlled loop: beginControlledFullBase → real sessionControl over TCP → transferExtentToSession via RebuildTransportClient → PrepareFullBaseRebuild → TryCompleteRebuildSession. ReplicaReceiver control channel handles MsgSessionControl alongside MsgBarrierReq. Session acks written back on same TCP connection. RebuildSessionBase request type separates new per-block stream from legacy raw extent stream. Full-base cleanup deferred until success. Deadlock fix: ApplyBaseBlock releases session lock before ioMu. Hydration skip for full-base sessions. 23 rebuild component tests (all pass): 11 kernel correctness, 8 transport/runtime, 3 scenario-scale, including 1GB primary-initiated with CRC validation. 29 files changed, ~2500 insertions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 14:39:11 -07:00
pingqiu	342f8baa69	feat: rebuild transport wiring — session control + base block streaming Wire protocol messages and transport handlers for the rebuild MVP: Protocol messages (rebuild_transport.go): - SessionControlMsg: epoch, sessionID, command, baseLSN, targetLSN, snapshotID. Encode/Decode with fixed 37-byte wire format. - SessionAckMsg: epoch, sessionID, phase, walAppliedLSN, baseComplete, achievedLSN. Encode/Decode with fixed 34-byte wire format. - MsgSessionControl (0x10) and MsgSessionAck (0x11) on control channel. - SendSessionControl/SendSessionAck convenience functions. Transport handlers: - RebuildTransportServer: primary-side, streams all extent blocks as MsgRebuildExtent frames (reusing existing rebuild message type), ends with MsgRebuildDone. - RebuildTransportClient: replica-side, receives base blocks and routes through vol.ApplyRebuildSessionBaseBlock, marks base complete on MsgRebuildDone. 4 transport tests: - SessionControl wire round-trip - SessionAck wire round-trip - BaseBlockStreaming: full TCP loop, 1024 blocks streamed and verified - SessionControlOverTCP: real TCP send/receive with accepted ack Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-07 14:57:43 -07:00
pingqiu	49845dd509	feat: server-layer rebuild session skeleton — host routing for MVP Add BlockService replica-side rebuild routing API that bridges transport/host layer to BlockVol session surface: StartReplicaRebuildSession(path, config) ApplyReplicaRebuildWALEntry(path, sessionID, entry) ApplyReplicaRebuildBaseBlock(path, sessionID, lba, data) MarkReplicaRebuildBaseComplete(path, sessionID, totalBlocks) TryCompleteReplicaRebuildSession(path, sessionID) CancelReplicaRebuildSession(path, sessionID, reason) ReplicaRebuildSession(path) → snapshot Each method does one thing: validate → WithVolume → delegate to BlockVol. No wire decoding, no protocol decisions, no state invention. Transport wiring (sessionControl/walData/sessionData handlers) is the next step. 2 focused tests: skeleton routes correctly, stale session ID rejected. Updated v2-rebuild-mvp-session-protocol.md with server skeleton section. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-07 14:53:32 -07:00
pingqiu	d2d57851b0	feat: rebuild MVP — dual-lane session with bitmap protection Rebuild session protocol implementation for v2-rebuild-mvp-session-protocol.md. New files: - rebuild_bitmap.go: RebuildBitmap — session-scoped dense bitset for WAL-applied LBA tracking. MarkApplied on local WAL write (not receive). ShouldApplyBase returns false for WAL-covered LBAs (WAL always wins). - rebuild_session.go: RebuildSession — replica-side two-line rebuild. WAL lane (ApplyWALEntry) + base lane (ApplyBaseBlock) with bitmap conflict resolution. TryComplete requires BOTH base_complete AND wal_applied_lsn >= target_lsn. Volume-level control surface: StartRebuildSession, ApplyRebuildSessionWALEntry/BaseBlock, MarkRebuildSessionBaseComplete, TryCompleteRebuildSession, CancelRebuildSession, ActiveRebuildSession. - rebuild_mvp_test.go: 4 correctness tests — base+WAL converge, WAL-applied never overwritten by base, bitmap set on applied not received, control surface start/supersede/complete. - rebuild_transport_test.go: 2 transport-level tests — two-line with real WAL shipping, live writes during base copy with bitmap conflict. Design docs: - v2-rebuild-mvp-session-protocol.md: MVP spec with message set, apply rules, completion/failure/crash rules, test matrix - v2-sync-recovery-protocol.md: full protocol context (keepup/catchup/ rebuild unified design, primary decision logic, two-line model) - v2-session-protocol-shape.md: protocol shape overview Protocol engine (reference, not production): - sw-block/protocol/: 7-event engine with ~300 lines, 13 tests 6 rebuild tests pass, all existing component tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-07 14:30:34 -07:00
pingqiu	55013e103b	feat: Phase 20 Stage 0+1 closure — bootstrap + sustained workload on hardware Stage 0 (bootstrap closure): PASS on m01/M02 - create RF=2 sync_all → 10s shipper wait → 4k fsync → publish_healthy - Proves: BarrierAccepted observation, ShipperConnected, DurableLSN > 0 Stage 1 (sustained workload): 32/33 actions PASS - bootstrap → fio 10s randwrite → dd_write 1M×2 fsync → data checksum - Remaining: auto-failover promotion (separate issue) Key fixes: - BarrierAccepted callback: SyncCache success → core DurableLSN update - BarrierRejected callback: barrier failures surface to core with reason - Shipper state callback for new volumes (not just startup volumes) - CatchUpTo ctrl conn reset: prevents stale control channel after recovery - CP13-6 max-bytes budget suspended: uses replicaFlushedLSN which can't advance without barrier; kills healthy shippers during async writes. Will be replaced by v2 negotiated sync/recovery protocol. - Barrier diagnostic logging: start/fail/success with reason and LSN - Scenario restructured: Stage 0 (bootstrap-closure) + Stage 1 (failover) - dd_write: sync_mode param + real stderr capture - sw-test-runner suite command: deploy once, run N scenarios - WAL size plumbing: proto + API + handler (forward-compatible) Known: 6 blockvol/server test failures from Barrier() path change (bounded catch-up in Barrier). Need test updates to match new semantics. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-06 19:55:12 -07:00
pingqiu	44103a1bd7	feat: Phase 20 acceptance fixes + sw-test-runner suite mode Acceptance rows closed: - WriteLBA/SyncCache contract: code comments document write-back vs durability fence semantics - RF=2 stable identity: v2bridge always uses SetReplicaAddrs (preserves ServerID); blockcmd dispatcher also fixed to use setupPrimaryReplicationMulti; test asserts exact expected ReplicaID="vs-2" (not just non-empty) - Tests treating WriteLBA as commit: replica_read_test rewritten with SyncCache as durability fence - publish_healthy contract: 3 gate tests with hard assertions including gate 3 (PrimaryShipperConnected) - SetReplicaAddr deprecation warning added - WALShipper.ReplicaID() getter added for identity verification Test runner enhancements: - sw-test-runner suite command: build → deploy → run N scenarios in one invocation with --skip-deploy support - Suite YAML definitions for T6 Stage 0 and Stage 1 - deploy action: kill stale processes, clean dirs, cross-compile, upload - run-phase20-t6.ps1 PowerShell script (deprecated by suite command) Engine/runtime fixes: - Recovery executor nil-safety improvements - Recovery bundle BuildRecoveryBundle defensive checks - ShipperGroup MinReplicaFlushedLSNAll surface Docs: acceptance checklist refined, test matrix updated, T6 runbook, engine maintainer tutorial, design README updated. 26 files changed, ~1600 insertions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-06 11:30:54 -07:00
pingqiu	275c3ee1c7	docs: Phase 20 acceptance checklist — architect-refined signoff matrix Tighten acceptance matrix with explicit per-boundary rows, signoff reading split into hard blockers vs product hardening, and clear rule: architecture-complete ≠ product-complete. 6 hard blockers before T6/T7: 1. WriteLBA/SyncCache/sync_all contract closure 2. Fresh replica bounded catch-up before live tail 3. Timeout/retention-loss classification for catch-up 4. publish_healthy alignment with one protocol contract 5. RF=2 stable identity on all shipping paths 6. Test audit for incorrect WriteLBA==commit assumptions Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-06 00:12:32 -07:00
pingqiu	58aa842802	docs: Phase 20 product acceptance checklist 7-area acceptance matrix mapping current state vs product requirements: write/durability contract, fresh replica bootstrap, host observation completeness, serving/publish alignment, snapshot/rebuild convergence, adapter consistency, test contract alignment. Each item marked with: current state, required for product, blocks T6/T7, best test level. Priority ordered into must-close-before-Stage-1, should-close-before-Stage-2, and can-close-after-T6/T7. Key diagnosis: architecture-complete, execution-incomplete. The engine thinks like a product; the data plane still behaves partly like a prototype. The gap is end-to-end contract closure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-06 00:05:22 -07:00

1 2 3 4 5 ...

13263 Commits