T4 L1 survey: architect feedback round 1 (F1/F2/F3 + H5/H6 + sw pre-scan gate)

All 5 feedback items accepted; no subsetting.

F1 — RebuildBitmap split into standalone §2.10 entity (10 total,
was 9). Rationale: bitmap has independent on-disk schema (~84 LOC
rebuild_bitmap.go) + independent conflict-resolution invariant
(WAL-wins-over-base). Collapsing into §2.6 RebuildSession at L1
would lose granularity for L2 — bitmap and session may have
different PRESERVE/REBUILD verdicts. §2.6 now explicitly
cross-references §2.10.

F2 — ShipperGroup §2.2 gains "External deps" row: N = RF comes
from master assignment via BlockVol.SetReplicaAddrs, not from
shipper-internal decision. Cross-entity contract (master assignment
↔ ShipperGroup size ↔ ReplicaReceiver expected-connection-count
↔ DistGroupCommit quorum arithmetic) made explicit so L2 split
can't silently drift sync_quorum.

F3 — ReplicaBarrier §2.4 scope rewritten from "per-request
ephemeral" to "per-request call-closure, BUT queue-state shared
per-volume via cond.Wait". Prior wording risked 1:1-porting into
a V3 stateless function, losing multi-watcher cond.Broadcast
semantics.

H5 added to §3 observations — cross-node epoch consistency
observation window for sync_quorum. V2 implicit via ack frame
carrying epoch; V3 L2 must pick "ack frame carries epoch" vs
"primary maintains per-replica epoch cache" before locking.
Different choices → different failover + rebuild-trigger semantics.

H6 added to §3 observations — write-path vs replication-path
concurrency residence. Three L2 options documented:
  A) StorageBackend.Write triggers shipper (violates T3a layering)
  B) ReplicatedBackend wraps StorageBackend+shipper (clean; +1 entity)
  C) Replication inside DurableProvider (extends BUG-005 lesson)
L1 makes no recommendation; L2 LOCKS the decision before L3.

§5 restructured into 5 gated steps; step 1 is a mandatory sw V3
pre-scan of core/frontend/durable/ + core/frontend/*.go for
pre-baked replication-adjacent assumptions. Rationale cited per
architect: BUG-005 latent drift came from implicit V3 convention;
L1 must surface any such convention before L2 verdicts lock.
Concrete grep checklist included so the scan is 5 min, not open-ended.

§2 header + §4 open question #1 updated for 10-entity count.
Scope block references rebuild_bitmap.go explicitly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
pingqiu
2026-04-22 22:41:18 -07:00
parent 38cb25702f
commit d2588f5b77

View File

@@ -5,7 +5,7 @@
**Owner**: QA (initial draft) → sw review → architect + PM sign as part of T4 T-start three-sign
**Purpose**: L1 stateful-entity enumeration per `v3-phase-15-qa-system.md` §8C.8 top-down port discipline, effective T4+. This doc is the raw entity landscape BEFORE any L2 bridge verdicts are proposed. Review target: is the entity set complete and correctly scoped? L2 builds on this.
**Scope**: V2 replication code under `weed/storage/blockvol/` — the files listed in `v3-phase-15-mvp-scope-gates.md` §G5 "Port from V2" (wal_shipper / shipper_group / repl_proto / replica_apply / replica_barrier) plus the Phase 4A additions (dist_group_commit / promotion / rebuild / rebuild_session / rebuild_transport). No V3 comparison made here — that is L2's job.
**Scope**: V2 replication code under `weed/storage/blockvol/` — the files listed in `v3-phase-15-mvp-scope-gates.md` §G5 "Port from V2" (wal_shipper / shipper_group / repl_proto / replica_apply / replica_barrier) plus the Phase 4A additions (dist_group_commit / promotion / rebuild / rebuild_session / rebuild_transport / rebuild_bitmap). No V3 comparison made here — that is L2's job.
---
@@ -21,7 +21,7 @@ T4 is the first track where this discipline is enforced pre-sketch. L1 survey is
## §2 Entity survey
Nine V2 stateful entities identified. For each: scope / lifecycle owner / concurrency model / cross-session behavior / authority coupling / network protocol surface / known tricky invariants.
Ten V2 stateful entities identified (updated per architect F1 — RebuildBitmap split out from RebuildSession). For each: scope / lifecycle owner / concurrency model / cross-session behavior / authority coupling / network protocol surface / known tricky invariants.
### §2.1 WALShipper
@@ -42,6 +42,7 @@ Nine V2 stateful entities identified. For each: scope / lifecycle owner / concur
|---|---|
| V2 file(s) | `shipper_group.go` (lines 19408) |
| Scope | per-volume (one group per primary; spans N configured replicas) |
| External deps | **RF is externally supplied, not internally decided.** `N = RF` (Replication Factor) comes from master assignment via `BlockVol.SetReplicaAddrs([]string)` (see `blockvol.go`). Group resizing is an assignment-op (teardown + rebuild), never a shipper-internal decision. RF change semantics are thus a cross-entity contract: `master assignment ↔ ShipperGroup size ↔ ReplicaReceiver expected-connection-count ↔ DistGroupCommit quorum arithmetic`. Losing this alignment during L2 split silently breaks sync_quorum. |
| Lifecycle owner | Constructed via `NewShipperGroup()`; owned by `BlockVol.shipperGroup`. Torn down via `StopAll()` during demote. No reconstruction mid-assignment. |
| Concurrency | No internal goroutines. `BarrierAll()` spawns N goroutines (one per shipper) in parallel via `WaitGroup`. Accessors hold `RWMutex.RLock()`. |
| Cross-session behavior | Pure aggregation; no durable state. Individual shippers retain `replicaFlushedLSN` and state machine across group lifecycle. `MinReplicaFlushedLSN()` reflects live shipper states at query time. |
@@ -67,8 +68,8 @@ Nine V2 stateful entities identified. For each: scope / lifecycle owner / concur
| Attribute | Value |
|---|---|
| V2 file(s) | `replica_barrier.go` (lines 147204) |
| Scope | per-barrier-request (ephemeral); state is `ReplicaReceiver.receivedLSN` + `flushedLSN` |
| Lifecycle owner | Invoked synchronously from `handleControlConn`. No external owner. |
| Scope | **per-request call-closure, BUT queue-state shared per-volume.** Each `handleBarrier` invocation is ephemeral, but the underlying `cond = sync.NewCond(&ReplicaReceiver.mu)` + the `receivedLSN` / `flushedLSN` counters it waits on are **per-volume durable** (for the life of the receiver process). Concurrent barriers share the cond; one broadcast may release several waiters. **L2 hazard**: mis-reading this as "stateless per-request" would 1:1-port it into a V3 stateless function, losing the queue-share semantics and the multi-watcher correctness. |
| Lifecycle owner | Invoked synchronously from `handleControlConn`. The per-call closure has no external owner; the **shared queue state lives in ReplicaReceiver** (§2.3). |
| Concurrency | Serial per control connection; races with concurrent data-channel entries. `cond.Wait()` blocks until `applyEntry` broadcast or timeout fires. Timer goroutine spawned per barrier; broadcasts cond on timeout. `vol.fd.Sync()` single-threaded. |
| Cross-session behavior | Does not persist. Durable boundary `flushedLSN` lives in `ReplicaReceiver`; barrier only advances it. |
| Authority coupling | `req.Epoch` vs `vol.epoch.Load()`; mismatch → `BarrierEpochMismatch`. Waits for strictly contiguous LSN receipt. |
@@ -92,7 +93,7 @@ Nine V2 stateful entities identified. For each: scope / lifecycle owner / concur
| Attribute | Value |
|---|---|
| V2 file(s) | `rebuild_session.go` (lines 41327) |
| V2 file(s) | `rebuild_session.go` (lines 41327). **Embeds RebuildBitmap (see §2.10)** — bitmap is a separate stateful entity with its own on-disk schema, NOT a session-internal implementation detail. |
| Scope | per-replica-session; volatile (tied to session lifecycle, NOT volume lifecycle) |
| Lifecycle owner | `NewRebuildSession()` called from primary host (via `StartRebuildSession()`). Owned by `BlockVol.rebuildSess`. Destroyed via `CancelRebuildSession()`. Does NOT survive crash; rebuild must restart from scratch. |
| Concurrency | All mutation under `mu`. Two concurrent data lanes: WAL lane via `ApplyWALEntry()`; base lane via `ApplyBaseBlock()`. Both acquire+release `mu` individually (short-hold, no deadlock). Bitmap update atomic within `mu`. Ack emit after `mu` release. |
@@ -140,6 +141,21 @@ Nine V2 stateful entities identified. For each: scope / lifecycle owner / concur
| Network protocol surface | In: `MsgRebuildExtent` (0x12), `MsgRebuildDone` (0x13), `MsgRebuildError` (0x14). Out: implicit (receiver). |
| Known tricky invariants | (1) Frame parsing extracts LBA from first 8 bytes of extent frame. (2) Block apply routed to `RebuildSession.ApplyBaseBlock()`; bitmap conflict → skip (not an error). (3) Graceful disconnect on error; replica's session handles timeout/retry. |
### §2.10 RebuildBitmap
Added per architect F1: RebuildBitmap has independent on-disk schema and independent conflict-resolution invariant; collapsing it into §2.6 RebuildSession at L1 would lose granularity for L2 verdicts (bitmap and session may have different PRESERVE/REBUILD outcomes).
| Attribute | Value |
|---|---|
| V2 file(s) | `rebuild_bitmap.go` (~84 LOC) + hydration in `rebuild_session.go` `hydrateBitmapFromRecoveredWAL()` |
| Scope | per-rebuild-session (one bitmap per active session); **has on-disk representation** independent of session state |
| Lifecycle owner | Constructed by `RebuildSession` during session init; persisted to sidecar file during session lifetime; unlinked on session `Completed` / `Failed`. Embedded in `RebuildSession.bitmap` field (§2.6). |
| Concurrency | Mutations under `RebuildSession.mu`. Bit-set operations are atomic within that lock; reads (for `ShouldApplyBase`) also under `mu`. No own internal synchronization. |
| Cross-session behavior | **NOT durable across crash** (matches session volatility per §2.6). On primary or replica restart, sidecar file is discarded; rebuild restarts. **WAL hydration** (§2.6 invariant): if recovery shows `TargetLSN != BaseLSN`, bitmap is re-populated from recovered WAL entries to avoid re-applying stale base blocks; hydration guard fails closed if local checkpoint > base LSN. |
| Authority coupling | Indirect — bitmap bit-set happens only AFTER successful WAL `append` (per §2.6 invariant). Epoch validation is upstream (in `ApplyWALEntry`); bitmap itself is epoch-agnostic. |
| Network protocol surface | None — bitmap is local-only state. Conflict resolution happens inside `ApplyBaseBlock` on the replica. |
| Known tricky invariants | (1) **WAL-wins-over-base conflict resolution**: `ShouldApplyBase(lba)` returns false if the bit is set; `ApplyBaseBlock` becomes a no-op (not an error). (2) **Hydration guard** fails closed: prevents unsound recovery if local checkpoint has advanced past the requested base LSN. (3) **Sidecar schema** is an independent on-disk format (84 LOC worth of serialization); any L2 REBUILD verdict would need to decide whether V3 keeps the same on-disk layout for wire-compat / recovery replay. (4) **Bit-set precedes ACK**: bitmap bit is set before the session emits `emitRebuildSessionAck()` (§2.6), so primary sees a consistent "LBA is WAL-owned now" view when it gets the ack. |
---
## §3 L1-level observations (for L2 to chew on)
@@ -155,6 +171,13 @@ These are patterns visible at L1 that will drive L2 bridge verdicts. **Not** rec
7. **Barrier is a three-phase operation.** Wait-LSN → fsync → advance `flushedLSN`. Both timeout and entry-apply broadcast the same cond; multi-watcher case handled naturally.
8. **ReplicaReceiver's `ioMu.RLock` nesting around apply** — the lock is held across the entire apply path for restore/import exclusion. Only released inside `replicaAppendWithRetry` during WAL-full wait. Any V3 bridge must either preserve this nesting or explicitly rebuild the exclusion (with the rationale documented).
9. **ShipperGroup double watermark**`MinShippedLSN` (Ceph retention watermark, advisory) vs `MinReplicaFlushedLSNAll` (authoritative sync_all durability). Two consumers, two semantics; losing the distinction in V3 would silently break one or the other.
10. **Cross-node epoch consistency observation window** (H5, architect-added). V2 `dist_group_commit` sync_quorum needs primary to know each replica's ack `{epoch, lsn}`; epoch mismatch → ack ignored for quorum. V2 makes this implicit — wire frame carries epoch, decoder in `repl_proto` extracts it. V3's `frontend.Identity.Epoch` is this-primary's view; the replica's epoch comes from replica-side `ProjectionView` (or from assignment directly). L2 decision required: does the V3 ack frame still carry epoch (wire-compat / simplest), OR does primary maintain a `per-replica epoch cache` (cleaner layering but needs refresh semantics on failover)? These choices produce different failover semantics and different rebuild-trigger conditions — cannot be deferred to implementation.
11. **Write-path vs replication-path concurrency residence** (H6, architect-added). V2 `BlockVol.Write` is one function that does durable-local-write AND triggers `ShipperGroup` broadcast. V3's `StorageBackend.Write → LogicalStorage.Write` is strictly local durable. **Where does replication get triggered?** Three L2 options:
- **Option A** — `StorageBackend.Write` calls local storage, then synchronously calls shipper. Matches V2 semantics, but adapter-layer gains a replication dependency — violates T3a layering.
- **Option B** — introduce a `ReplicatedBackend` wrapping `StorageBackend + shipper`; `frontend.Provider` returns Replicated vs Plain based on RF. Clean layering; adds one entity.
- **Option C** — replication lives inside `DurableProvider`; `StorageBackend` is unaware; Provider intercepts Write. Requires Provider to gain replication responsibility — rhymes with BUG-005's "Provider owns Backend lifecycle" lesson but extends it.
L1 makes no recommendation; L2 must pick and LOCK this before L3 touches files.
---
@@ -162,7 +185,7 @@ These are patterns visible at L1 that will drive L2 bridge verdicts. **Not** rec
For sw / architect / PM before L1 sign:
1. **Scope completeness**: is the 9-entity set complete? Anything I missed — `sync_all_reconnect_protocol` logic (not a distinct type but a cross-entity invariant)? `split_brain_*` tests reference a primary-takeover arbiter — is that an entity or just a protocol handshake?
1. **Scope completeness**: is the 10-entity set complete (post architect F1 adding RebuildBitmap as §2.10)? Anything still missed — `sync_all_reconnect_protocol` logic (not a distinct type but a cross-entity invariant)? `split_brain_*` tests reference a primary-takeover arbiter — is that an entity or just a protocol handshake?
2. **Scope accuracy**: any entity I've categorized at the wrong scope? Notably, **ReplicaReceiver** I've marked "per-volume replica-side" (survives demote/promote). Verify against V2 teardown code — if receiver is actually per-assignment, L2 verdicts change.
3. **RebuildSession volatility stance**: V2 documented "does NOT survive crash". Is there any post-4A work that made rebuild sessions durable? If yes, that changes the Provider-cache lesson in §3 observation 5.
4. **DistGroupCommit residence**: at L1 this is a closure bound to `vol.writeSync`. L2 will ask: does V3 put this inside DurableProvider, next to it as a new entity, or somewhere else entirely? Opinions now save a revision cycle later.
@@ -170,11 +193,24 @@ For sw / architect / PM before L1 sign:
---
## §5 Next steps (post-L1 sign)
## §5 Next steps (ordered; each a gate for the next)
1. **L2** — Fill `v2-v3-contract-bridge-catalogue.md` §3 Replication with per-entity bridge rows (shape tag + V3 embedding note + per-contract verdicts). Reference V3's five model shifts (event / storage / authority / lifecycle / concurrency) explicitly in embedding notes.
2. **L3** — Derive `v3-phase-15-t4-port-plan-sketch.md` structure: scope / V2 file classification / V3 target files / non-claims / test strategy / provisional ledger rows.
3. **T4 T-start three-sign** — architect + PM + QA on the combined L1 + L2 + L3 package.
1. **sw V3 pre-scan** (blocks L1 sign; ~5 min) — sw greps `core/frontend/durable/` + `core/frontend/*.go` for pre-baked replication-adjacent assumptions. Minimum checklist:
- Any `SetReplicaAddrs`, `ReplicaAddrs`, or similar field on `DurableProvider` / `LogicalStorage` / `StorageBackend`?
- Any `Sync` / `Write` return type or callback that implies waiting on remote acks?
- Does `LogicalStorage.Write` return a pure-local LSN or something that looks coordinated across nodes?
- Any `Ship`, `Replicate`, `Quorum`, `Barrier`, `Durability` identifiers in V3 code today?
- Any stub / comment referencing a future replication surface?
Output: one line "V3 has no pre-baked replication assumptions" OR a list of "V3 has X at Y, L2 must align/override". Rationale per architect feedback: BUG-005's latent drift came from "V3 already has implicit convention more opaque than V2 contract" — L1 must surface any such convention BEFORE L2 verdicts lock.
2. **L1 three-sign** — architect + PM + QA sign this doc, with sw pre-scan result attached. Triggers L2 work.
3. **L2** — Fill `v2-v3-contract-bridge-catalogue.md` §3 Replication with per-entity bridge rows (shape tag + V3 embedding note + per-contract verdicts). Must LOCK answers to H5 (cross-node epoch ack window) and H6 (Options A/B/C for write-path vs replication-path residence) from §3. Reference V3's five model shifts (event / storage / authority / lifecycle / concurrency) explicitly in embedding notes.
4. **L3** — Derive `v3-phase-15-t4-port-plan-sketch.md` structure: scope / V2 file classification / V3 target files / non-claims / test strategy / provisional ledger rows.
5. **T4 T-start three-sign** — architect + PM + QA on the combined L1 + L2 + L3 package.
---
@@ -183,3 +219,4 @@ For sw / architect / PM before L1 sign:
| Date | Change | Author |
|---|---|---|
| 2026-04-22 | Initial L1 survey drafted post-T3 close; 9 entities identified; 9 L1-level observations recorded; 5 open questions raised for sw/architect/PM review | QA Owner |
| 2026-04-22 | Architect feedback round 1 incorporated: F1 split RebuildBitmap into standalone §2.10 (10 entities total); F2 ShipperGroup gains "External deps" row citing master-assignment-provided RF as cross-entity contract; F3 ReplicaBarrier scope rewritten to "per-request call-closure BUT queue-state shared per-volume via cond.Wait"; H5 (cross-node epoch ack window) and H6 (Options A/B/C for write-path vs replication-path residence) added to §3 observations; §5 updated to require sw V3 pre-scan as blocking step before L1 three-sign | QA Owner |