fix: three hardware blockers — WAL retention + registry race + shutdown beat

All 43 actions pass on m01/m02 hardware. Auto-failover PASS.
dd_write: 30s → 123ms. Post-failover write: 33,621 IOPS.

1. WAL retention: remove keepup retention floor (MinShippedLSN).
   WAL cannot be pinned during sustained async writes — any pin
   strategy either fills WAL (blocking writes) or over-recycles
   (breaking catch-up). Flusher recycles freely. Future LBA map
   will provide catch-up without WAL retention.
   MinShippedLSN on ShipperGroup retained as diagnostic surface.

2. Registry stale-cleanup race: add RegisteredAt grace period.
   Race: master registers volume → next VS heartbeat arrives before
   VS discovers the volume → stale cleanup deletes the entry →
   failover finds 0 entries. Fix: skip stale cleanup for entries
   registered within 30s (> 2 heartbeat intervals).
   2 new tests: grace protects new entry, old entry still cleaned.

3. Shutdown heartbeat: VS disconnect heartbeat no longer claims
   block inventory authority. Previously, the shutdown beat's
   empty inventory triggered stale cleanup, deleting the entry
   before failover could use it.

Scenario fix: recovery-baseline-failover.yaml now kills the
correct node (discovered primary, not hardcoded), connects to
the correct new primary for post-failover verification.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
pingqiu
2026-04-08 22:59:46 -07:00
parent 39f1232fe2
commit e0116fc631
28 changed files with 201 additions and 2971 deletions

View File

@@ -1,288 +0,0 @@
# Protocol Development Process
Date: 2026-03-27
## Purpose
This document defines how `sw-block` protocol work should be developed.
The process is meant to work for:
- V2
- future V3
- or a later block algorithm that is not WAL-based
The point is to make protocol work systematic rather than reactive.
## Core Philosophy
### 1. Design before implementation
Do not start with production code and hope the protocol becomes clear later.
Start with:
1. system contract
2. invariants
3. state model
4. scenario backlog
Only then move to implementation.
### 2. Real failures are inputs, not just bugs
When V1 or V1.5 fails in real testing, treat that as:
- a design requirement
- a scenario source
- a simulator input
Do not patch and forget.
### 3. Simulator is part of the protocol, not a side tool
The simulator exists to answer:
- what should happen
- what must never happen
- which old designs fail
- why the new design is better
It is not a replacement for real testing.
It is the design-validation layer before production implementation.
### 4. Passing tests are not enough
Green tests are necessary, not sufficient.
We also require:
- explicit invariants
- explicit scenario intent
- clear state transitions
- review of assumptions and abstraction boundaries
### 5. Keep hot-path and recovery-path reasoning separate
Healthy steady-state behavior and degraded recovery behavior are different problems.
Both must be designed explicitly.
## Development Ladder
Every major protocol feature should move through these steps:
1. **Problem statement**
- what real bug, limit, or product goal is driving the work
2. **Contract**
- what the protocol guarantees
- what it does not guarantee
3. **State model**
- node state
- coordinator state
- recovery state
- role / epoch / lineage rules
4. **Scenario backlog**
- named scenarios
- source:
- real failure
- design obligation
- adversarial distributed case
5. **Prototype / simulator**
- reduced but explicit model
- invariant checks
- V1 / V1.5 / V2 comparison where relevant
6. **Implementation**
- production code only after the protocol shape is clear enough
7. **Real validation**
- unit
- component
- integration
- real hardware where needed
8. **Feedback loop**
- turn new failures back into scenario/design inputs
## Required Artifacts
For protocol work to be considered real progress, we usually want:
### Design
- design doc
- scenario doc
- comparison doc when replacing an older approach
### Prototype
- simulator or prototype code
- tests that assert protocol behavior
### Implementation
- production patch
- production tests
- docs updated to match the actual algorithm
### Review
- implementation gate
- design/protocol gate
## Two-Gate Rule
We use two acceptance gates.
### Gate 1: implementation
Owned by the coding side.
Questions:
- does it build?
- do tests pass?
- does it behave as intended in code?
### Gate 2: protocol/design
Owned by the design/review side.
Questions:
- is the logic actually sound?
- do tests prove the intended thing?
- are assumptions explicit?
- is the abstraction boundary honest?
A task is not accepted until both gates pass.
## Layering Rule
Keep simulation layers separate.
### `distsim`
Use for:
- protocol correctness
- state transitions
- fencing
- recoverability
- promotion / lineage
- reference-state checking
### `eventsim`
Use for:
- timeout behavior
- timer races
- event ordering
- same-tick / delayed event interactions
Do not duplicate scenarios blindly across both layers.
## Test Selection Rule
Do not choose simulator inputs only from failing tests.
Review all relevant tests and classify them by:
- protocol significance
- simulator value
- implementation specificity
Good simulator candidates often come from:
- barrier truth
- catch-up vs rebuild
- stale message rejection
- failover / promotion safety
- changed-address restart
- mode semantics
Keep real-only tests for:
- wire format
- OS timing
- exact WAL file behavior
- frontend transport specifics
## Version Comparison Rule
When designing a successor protocol:
- keep the old version visible
- reproduce the old failure or limitation
- show the improved behavior in the new version
For `sw-block`, that means:
- `V1`
- `V1.5`
- `V2`
should be compared explicitly where possible.
## Documentation Rule
The docs must track three different things:
### `learn/projects/sw-block/`
Use for:
- project history
- V1/V1.5 algorithm records
- phase records
- real test history
### `sw-block/design/`
Use for:
- active design truth
- V2 and later protocol docs
- scenario backlog
- comparison docs
### `sw-block/.private/phase/`
Use for:
- active execution plan
- log
- decisions
## What Good Progress Looks Like
A good protocol iteration usually has this pattern:
1. real failure or design pressure identified
2. scenario named and written down
3. simulator reproduces the bad case
4. new protocol handles it explicitly
5. implementation follows
6. real tests validate it
If one of those steps is missing, confidence is weaker.
## Bottom Line
The process is:
1. design the contract
2. model the state
3. define the scenarios
4. simulate the protocol
5. implement carefully
6. validate in real tests
7. feed failures back into design
That is the process we should keep using for V2 and any later protocol line.

View File

@@ -1,314 +0,0 @@
# V1, V1.5, and V2 Comparison
Date: 2026-03-27
## Purpose
This document compares:
- `V1`: original replicated WAL shipping model
- `V1.5`: Phase 13 catch-up-first improvements on top of V1
- `V2`: explicit FSM / orchestrator / recoverability-driven design under `sw-block/`
It is a design comparison, not a marketing document.
## 1. One-line summary
- `V1` is simple but weak on short-gap recovery.
- `V1.5` materially improves recovery, but still relies on assumptions and incremental control-plane fixes.
- `V2` is structurally cleaner, more explicit, and easier to validate, but is not yet a production engine.
## 2. Steady-State Hot Path
In the healthy case, all three versions can look similar:
1. primary appends ordered WAL
2. primary ships entries to replicas
3. replicas apply in order
4. durability barrier determines when client-visible commit completes
### V1
- simplest replication path
- lagging replica typically degrades quickly
- little explicit recovery structure
### V1.5
- same basic hot path as V1
- WAL retention and reconnect/catch-up improve short outage handling
- extra logic exists, but much of it is off the hot path
### V2
- can keep a similar hot path if implemented carefully
- extra complexity is mainly in:
- recovery planner
- replica state machine
- coordinator/orchestrator
- recoverability checks
### Performance expectation
In a normal healthy cluster:
- `V2` should not be much heavier than `V1.5`
- most V2 complexity sits in failure/recovery/control paths
- there is no proof yet that V2 has better steady-state throughput or latency
## 3. Recovery Behavior
### V1
Recovery is weakly structured:
- lagging replica tends to degrade
- short outage often becomes rebuild or long degraded state
- little explicit catch-up boundary
### V1.5
Recovery is improved:
- short outage can recover by retained-WAL catch-up
- background reconnect closes the `sync_all` dead-loop
- catch-up-first is preferred before rebuild
But the model is still partly implicit:
- reconnect depends on endpoint stability unless control plane refreshes assignment
- recoverability boundary is not as explicit as V2
- tail-chasing and retention pressure still need policy care
### V2
Recovery is explicit by design:
- `InSync`
- `Lagging`
- `CatchingUp`
- `NeedsRebuild`
- `Rebuilding`
And explicit decisions exist for:
- catch-up vs rebuild
- stale-epoch rejection
- promotion candidate choice
- recoverable vs unrecoverable gap
## 4. Real V1.5 Lessons
The main V2 requirements come from real V1.5 behavior.
### 4.1 Changed-address restart
Observed in `CP13-8 T4b`:
- replica restarted
- endpoint changed
- primary shipper held stale address
- direct reconnect could not succeed until control plane refreshed assignment
V1.5 fix:
- saved address used only as hint
- heartbeat-reported address becomes source of truth
- master refreshes primary assignment
Lesson for V2:
- endpoint is not identity
- reassignment must be explicit
### 4.2 Reconnect race
Observed in Phase 13 review:
- barrier path and background reconnect path could both trigger reconnect
V1.5 fix:
- `reconnectMu` serializes reconnect / catch-up
Lesson for V2:
- one active recovery session per replica should be a protocol rule, not just a local mutex trick
### 4.3 Tail-chasing
Even with retained WAL:
- primary may write faster than a lagging replica can recover
- catch-up may not converge
Lesson for V2:
- explicit abort / `NeedsRebuild`
- do not pretend catch-up will always work
### 4.4 Control-plane recovery latency
V1.5 can be correct but still operationally slow if recovery waits on slower management cycles.
Lesson for V2:
- keep authority in coordinator
- but make recovery decisions explicit and fast when possible
## 5. V2 Structural Improvements
V2 is better primarily because it is easier to reason about and validate.
### 5.1 Better state model
Instead of implicit recovery behavior, V2 has:
- per-replica FSM
- volume/orchestrator model
- distributed simulator with scenario coverage
### 5.2 Better validation
V2 has:
- named scenario backlog
- protocol-state assertions
- randomized simulation
- V1/V1.5/V2 comparison tests
This is a major difference from V1/V1.5, where many fixes were discovered through implementation and hardware testing first.
### 5.3 Better correctness boundaries
V2 makes these explicit:
- recoverable gap vs rebuild
- stale traffic rejection
- promotion lineage safety
- reservation or payload availability transitions
## 6. Stability Comparison
### Current judgment
- `V1`: least stable under failure/recovery stress
- `V1.5`: meaningfully better and now functionally validated on real tests
- `V2`: best protocol structure and best simulator confidence
### Important limit
`V2` is not yet proven more stable in production because:
- it is not a production engine yet
- confidence comes from simulator/design work, not real block workload deployment
So the accurate statement is:
- `V2` is more stable **architecturally**
- `V1.5` is more stable **operationally today** because it is implemented and tested on real hardware
## 7. Performance Comparison
### What is likely true
`V2` should perform better than rebuild-heavy recovery approaches when:
- outage is short
- gap is recoverable
- catch-up avoids full rebuild
It should also behave better under:
- flapping replicas
- stale delayed messages
- mixed-state replica sets
### What is not yet proven
We do not yet know whether `V2` has:
- better steady-state throughput
- lower p99 latency
- lower CPU overhead
- lower memory overhead
than `V1.5`
That requires real implementation and benchmarking.
## 8. Smart WAL Fit
### Why Smart WAL is awkward in V1/V1.5
V1/V1.5 do not naturally model:
- payload classes
- recoverability reservations
- historical payload resolution
- explicit recoverable/unrecoverable transition
So Smart WAL would be harder to add cleanly there.
### Why Smart WAL fits V2 better
V2 already has the right conceptual slots:
- `RecoveryClass`
- `WALInline`
- `ExtentReferenced`
- recoverability planner
- catch-up vs rebuild decision point
- simulator for payload-availability transitions
### Important rule
Smart WAL must not mean:
- “read current extent for old LSN”
That is incorrect.
Historical correctness requires:
- WAL inline payload
- or pinned snapshot/versioned extent state
- not current live extent contents
## 9. What Is Proven Today
### Proven
- `V1.5` significantly improves V1 recovery behavior
- real `CP13-8` testing validated the V1.5 data path and `sync_all` behavior
- the V2 simulator covers:
- stale traffic rejection
- tail-chasing
- flapping replicas
- multi-promotion lineage
- changed-address restart comparison
- same-address transient outage comparison
- Smart WAL availability transitions
### Not yet proven
- V2 production implementation quality
- V2 steady-state performance advantage
- V2 real hardware recovery performance
## 10. Bottom Line
If choosing based on current evidence:
- use `V1.5` as the production line today
- use `V2` as the better long-term architecture
If choosing based on protocol quality:
- `V2` is clearly better structured
- `V1.5` is still more ad hoc, even after successful fixes
If choosing based on current real-world proof:
- `V1.5` has the stronger operational evidence today
- `V2` has the stronger design and simulation evidence today

View File

@@ -1,108 +0,0 @@
# V2 Assignment Translation Unification
Date: 2026-04-04
Status: active
## Purpose
This note defines how assignment translation should be unified so that:
1. `weed/storage/blockvol/v2bridge/control.go`
2. `weed/server/volume_server_block.go`
do not drift on identity, role, or recovery-target meaning.
## Current Drift Risk
Today there are two live translation sites:
1. `ControlBridge.ConvertAssignment()` produces `engine.AssignmentIntent`
2. `BlockService.coreAssignmentEvent()` produces `engine.AssignmentDelivered`
They do not translate the same source type, but they do share semantic rules:
1. how to build stable `ReplicaID`
2. how to map role-shaped inputs to recovery target
3. how to represent one local replica endpoint in engine types
If those rules stay duplicated, later migration batches will reintroduce split
truth.
## Canonical Rule Placement
The canonical reusable rules belong in:
1. `sw-block/bridge/blockvol`
Why:
1. the rules are semantic translation, not product integration
2. both `weed/storage/blockvol/v2bridge` and `weed/server` can import this
package
3. `sw-block` remains weed-free
## Rules That Must Be Canonical
### 1. Stable identity
Canonical helper:
1. `MakeReplicaID()`
2. `ReplicaAssignmentForServer()`
Rule:
1. `ReplicaID = <volume>/<server-id>`
2. never derive identity from transport address
### 2. Recovery-target mapping
Canonical helper:
1. `RecoveryTargetForRole()`
Rule:
1. `replica -> catchup`
2. `rebuilding -> rebuild`
3. all other roles -> no recovery target
### 3. Endpoint packaging
Canonical helper:
1. `ReplicaAssignmentForServer()`
Rule:
1. adapter code may still source endpoint fields from different wire/runtime
inputs
2. but once packaged into `engine.ReplicaAssignment`, the shape must be uniform
## What Still Stays Local
These parts remain adapter-local and should NOT be forced into one helper yet:
1. reading `blockvol.BlockVolumeAssignment`
2. deciding whether the local VS is primary/replica/rebuilding in a given
runtime context
3. multi-replica traversal over heartbeat/master wire structures
Those are source-format adaptation concerns, not canonical translation rules.
## Implemented First Step
The first unification step is already applied:
1. `sw-block/bridge/blockvol/control_adapter.go`
- now exports the canonical helpers
2. `weed/server/volume_server_block.go`
- now consumes the same helper layer for local assignment rebinding
## Next Step
The next step after this document is:
1. reduce `weed/storage/blockvol/v2bridge/control.go` to source-format
extraction only
2. keep all shared semantic mapping rules in `sw-block/bridge/blockvol`

View File

@@ -1,147 +0,0 @@
# V2 Bounded Internal Pilot Pack
Date: 2026-04-05
Status: draft
Purpose: define the bounded internal engineering validation pack around the
current `Phase 18` RF2 runtime-bearing envelope without silently broadening scope
## Reading Rule
This pilot pack is a bounded validation package for the current RF2
runtime-bearing envelope.
It does NOT mean:
1. broad launch approval
2. generic production readiness
3. proof that the current runtime path is already a working block product
4. permission to redefine exclusions through pilot success
It means only:
1. the team may run limited internal engineering validation inside the accepted
runtime-bearing envelope
2. validation outcomes must be read against the delivered `Phase 18` boundary
3. incidents must be routed explicitly instead of becoming vague rollout lore
## Pilot Scope
This pack is limited to the current `Phase 18` runtime-bearing envelope:
1. kernel/runtime path:
- `masterv2` identity authority
- `volumev2` runtime-owned failover / Loop 2 / continuity / RF2 surface path
- `purev2` execution adapter reuse
2. validation shape:
- bounded in-process runtime exercises
- artifact-driven review only
3. supported proof shape:
- failover-time evidence seam
- active Loop 2 observation
- continuity handoff statement
- compressed RF2 outward surface
4. excluded surface classes:
- real product frontends
- broad operator APIs
- real transport-backed product traffic
Anything outside that scope is not a finding for this pack.
It is either a known exclusion, an explicit blocker, or later widening work.
## Pilot Environment And Topology
The validation environment must stay fixed and reviewable:
1. use one explicit build/commit package for all pilot nodes
2. keep topology inside the bounded `RF=2` runtime-bearing path and do not
introduce `RF>2`
3. do not introduce real frontend/product traffic or broad transport claims
4. pin operator-facing configuration and startup procedure in a written runbook
5. expose the runtime diagnosis surfaces needed to read:
- failover snapshot/result
- Loop 2 snapshot
- continuity snapshot
- RF2 outward surface
If the validation needs ad hoc operator judgment to stay healthy, the pack is not
ready.
## Success Criteria
The bounded validation is considered successful only if ALL of the following
hold:
1. no observed behavior contradicts the bounded `Phase 18` runtime envelope
2. no observed behavior contradicts the bounded fail-closed reading of the new
runtime path
3. incidents can be classified using the explicit buckets in this pack without
inventing new ambiguous categories
4. operators can execute preflight, bounded validation, and diagnosis from
written artifacts rather than tribal knowledge
5. findings do not require silently widening the supported envelope
6. the review outcome remains consistent with the current `block expansion /
not pilot-ready` judgment unless new closure explicitly changes it
Validation success validates the current bounded envelope only.
It does not create a broader product claim by itself.
## Incident Intake And Classification
Every incident must record:
1. time, node set, workload, and surface involved
2. observed symptom
3. affected bounded claim, exclusion, or blocker
4. diagnosis evidence used
5. immediate operator action taken
6. final classification
Allowed classification buckets:
1. `config / environment issue`
- the product behaved inside the bounded claim, but the deployment violated the
pilot preflight or environment assumptions
2. `known exclusion`
- the incident came from a surface or claim already excluded from the first
launch matrix
3. `true product bug`
- the incident contradicts an accepted bounded claim or reveals a real gap
inside the named chosen envelope
If an incident does not fit one of those buckets, stop the validation and refine
the
artifact set before continuing.
## Decision Outputs
At the end of a bounded validation window, the allowed outcomes are:
1. `stay in bounded validation`
- more evidence is needed inside the same envelope
2. `widen bounded engineering exposure`
- the review may expand only internal engineering validation inside the same
envelope
3. `block expansion`
- a contradiction, repeated unresolved bug, or operational ambiguity prevents
widening
These outcomes require the bounded envelope review artifact.
This pack does not replace that review.
## Explicit Non-Claims
This pack does NOT claim:
1. generic production proof from limited validation success
2. support for `RF>2`
3. support for a broad transport/frontend matrix
4. broad automatic failover guarantees
5. hours/days soak proof outside the bounded runtime-bearing reading
## Primary Inputs
1. `sw-block/design/v2-rf2-runtime-bounded-envelope.md`
2. `sw-block/design/v2-rf2-runtime-bounded-envelope-review.md`
3. `sw-block/.private/phase/phase-18.md`
4. `sw-block/design/v2-protocol-claim-and-evidence.md`
5. `sw-block/design/v2-product-completion-overview.md`

View File

@@ -405,6 +405,31 @@ Primary proof tiers:
| 7 | Product surfaces | CSI and frontend projection of V2 truth | integrated |
| 8 | Launch envelope | bounded support and rollout claims | integrated + soak |
## Matrix Linkage
Use the three active documents in a fixed order:
1. protocol docs define the rule
2. this capability map defines which product tier owns the rule
3. `v2-validation-matrix.md` defines what must be proven for closure
4. `v2-integration-matrix.md` defines which real scenarios exercise the path
The goal is to make the chain explicit:
`protocol -> capability tier -> validation rows -> integration rows`
| Tier | Primary protocol refs | Validation rows | Integration rows | Practical meaning |
|------|------------------------|-----------------|------------------|-------------------|
| 0 | `v2-protocol-truths.md`, `v2-sync-recovery-protocol.md` | `V4`, `V5`, `V14` | feeds `I-V1` through `I-V6` | pure semantic truth and fail-closed rules |
| 1 | `v2-protocol-truths.md` | `V1` | `I-V1` | single-volume and bootstrap correctness |
| 2 | `v2-sync-recovery-protocol.md` | `V1`, `V2`, `V4` | `I-V1`, `I-V2` | RF=2 replication base and barrier/publication closure |
| 3 | `v2-sync-recovery-protocol.md`, `v2-rebuild-mvp-session-protocol.md` | `R1`-`R12`, `V3`, `V6`, `V7`, `V8`, `V11` | `I-R1`-`I-R8`, `I-V3`, `I-V4`, `I-V5` | recovery, rebuild, failover, and rejoin |
| 4 | `v2-sync-recovery-protocol.md` | `V9`, `V10` | future `RF>=3` integrated rows | aggregate multi-replica projection and durability semantics |
| 5 | `v2-rebuild-mvp-session-protocol.md`, snapshot/restore execution docs | `S1`-`S10` | `I-S1`-`I-S4` | snapshot, restore, and lifecycle operations |
| 6 | `v2-automata-ownership-map.md`, `v2-protocol-claim-and-evidence.md` | `V8`, `V12`, `V13` | `I-V4`, `I-V6` | control-plane truth, observability, and operator surfaces |
| 7 | product-surface and rollout docs | `V1`, `V2`, `V12`, `V13` | runner scenarios and product e2e packs | CSI/frontend projection of V2 truth |
| 8 | rollout/support docs | stage-gate summaries in validation matrix | chaos/perf rows `I-C1`-`I-C4`, `I-P1`-`I-P3` | bounded launch envelope and operational confidence |
## Test Expansion Strategy From This Map
This map should drive testing in a faster order than "one expensive scenario at a time."

View File

@@ -1,143 +0,0 @@
# V2 Controlled Rollout Review
Date: 2026-04-05
Status: draft
Purpose: define the bounded review used to decide whether internal engineering
validation on the current `Phase 18` RF2 runtime envelope stays limited, widens
inside the same envelope, or blocks expansion
## Reading Rule
This artifact is a bounded decision gate after runtime-envelope validation.
It does NOT mean:
1. broad launch approval
2. generic production readiness
3. permission to widen beyond the frozen runtime envelope
4. permission to reinterpret validation survival as new protocol/runtime proof
It means only:
1. validation outcomes may be reviewed against the already-accepted bounded
envelope
2. expansion decisions must stay inside the same named support boundary
3. any broader claim still needs explicit new evidence and explicit new review
## Allowed Decisions
The rollout review may produce only one of these outputs:
1. `stay in bounded validation`
- the chosen envelope is still the right boundary, but more bounded validation
evidence is needed before any exposure increase
2. `widen bounded engineering exposure`
- exposure may increase only inside the current bounded engineering envelope,
with no change to the named support boundary
3. `block expansion`
- the current evidence, incident record, or operational ambiguity is not strong
enough to increase exposure safely
Any outcome outside those three is invalid for this review.
## Required Inputs
The review must not start unless these inputs exist and are explicit:
1. the frozen runtime envelope
2. the bounded pilot pack
3. the preflight checklist outcome(s)
4. the pilot stop-condition artifact
5. incident records with explicit classification
6. validation outcome summary for the bounded runtime-bearing path
7. the accepted evidence anchors that define the current boundary:
- `Phase 18 M1-M4`
- `v2-rf2-runtime-bounded-envelope.md`
- `v2-rf2-runtime-bounded-envelope-review.md`
If any required input is missing, the correct review output is `block expansion`.
## Decision Questions
The rollout review must answer all of the following:
1. did validation remain fully inside the frozen runtime envelope
2. did any observed behavior contradict the bounded `Phase 18` runtime envelope
3. were any stop conditions triggered, and if so, how were they resolved
4. are all incidents classified cleanly as:
- `config / environment issue`
- `known exclusion`
- `true product bug`
5. does any proposed next step depend on a broader claim than the current
envelope
6. can operators run the validation and diagnose bounded failures from written
artifacts rather than tribal knowledge
If the answer to question 5 is yes, the review must not approve widening inside
this artifact. That request belongs to later evidence expansion work.
## Decision Rules
Use these bounded rules:
1. approve `stay in bounded validation` when:
- validation stayed inside scope
- no contradiction to accepted bounded claims was found
- more same-envelope evidence is still needed
2. approve `widen bounded engineering exposure` only when:
- validation stayed inside scope
- no unresolved `true product bug` remains against the bounded envelope
- stop conditions did not reveal structural ambiguity
- operator workflow is explicit and repeatable from the artifact set
- the widened exposure does not change the bounded envelope
3. approve `block expansion` when:
- any unresolved contradiction exists
- any unresolved `true product bug` exists
- incident records are vague
- operators depend on tribal knowledge
- the requested widening outruns the current matrix
## Explicit Review Record
Each review result must record:
1. decision outcome
2. date and reviewer set
3. validation window / environment covered
4. summary of incidents by classification bucket
5. any stop-condition events and their disposition
6. exact reason the decision stays inside the current envelope
7. explicit next action:
- continue bounded validation
- widen engineering exposure inside the same envelope
- pause and fix
## Rejection Rules
Reject the review as invalid if:
1. it uses validation success as generic production proof
2. it broadens topology, runtime path, or supported surfaces without a new
evidence package
3. it treats a known exclusion as if validation cleared it
4. it ignores stop-condition events or unresolved true product bugs
5. it cannot map the decision back to the accepted evidence ladder
## Explicit Non-Claims
This artifact does NOT claim:
1. broad rollout approval
2. generic production readiness
3. support for `RF>2`
4. support for a broad transport/frontend matrix
5. broad failover-under-load or long-window soak proof
## Primary Inputs
1. `sw-block/design/v2-rf2-runtime-bounded-envelope.md`
2. `sw-block/design/v2-rf2-runtime-bounded-envelope-review.md`
3. `sw-block/design/v2-bounded-internal-pilot-pack.md`
4. `sw-block/design/v2-pilot-preflight-checklist.md`
5. `sw-block/design/v2-pilot-stop-conditions.md`
6. `sw-block/.private/phase/phase-18.md`

View File

@@ -1,138 +0,0 @@
# V2 First-Launch Supported Matrix
Date: 2026-04-04
Status: draft
Purpose: freeze the first bounded launch envelope from accepted `Phase 12-17`
evidence, with explicit supported scope, explicit exclusions, and explicit
launch blockers
## Reading Rule
This document is a bounded support matrix.
It does NOT mean:
1. broad launch approval
2. generic production readiness
3. support for every failover/restart/disturbance branch
4. broad transport/frontend approval outside the named chosen path
It means only:
1. these are the strongest support statements currently justified by accepted
evidence
2. anything outside them is an exclusion, a blocker, or future
productionization work
## Supported Matrix
| Dimension | Supported in first draft | Boundary rule | Primary evidence |
|-----------|--------------------------|---------------|------------------|
| Replication factor | `RF=2` | bounded chosen path only | `CP13-1..7`, `C-RF2-SYNCALL-CONTRACT` |
| Durability mode | `sync_all` | bounded chosen path only | `CP13`, `v2-protocol-claim-and-evidence.md` |
| Control/runtime path | existing master / volume-server heartbeat path | same path as `Phase 10`, `Phase 16`, `Phase 17` checkpoints | `Phase 10`, `Phase 16` finish-line review |
| Semantic owner | explicit `V2 core` | semantics stay `V2`-owned even when implementation reuses `weed/` and `blockvol` | `Phase 14-16` |
| Execution backend | `blockvol` via `v2bridge` | reuse implementation; no V1 semantic inheritance | `Phase 09`, `Phase 14-16` |
| Product surfaces | bounded `iSCSI`, bounded `CSI`, bounded `NVMe` on the chosen path | not a generic transport matrix | `Phase 11`, publication tests, NVMe tests |
| Restart/failover reading | bounded `17B` + `17C` interpretation | use only the explicit contract and policy table from `Phase 17` | `phase-17.md`, `phase-17-checkpoint-review.md` |
## Supported Statement
The strongest currently supported first-launch statement is:
1. on the bounded chosen `RF=2 sync_all` path, using the existing
master/volume-server heartbeat path, explicit `V2`-owned semantics and the
accepted `Phase 16-17` contract/policy package provide one finite support
envelope for bounded block product use
2. bounded `iSCSI`, `CSI`, and `NVMe` surfaces are supported only inside that
same chosen-path interpretation
3. failover/publication and disturbance behavior must be read through the
explicit `Phase 17` contract/policy package, not through broader inferred
product assumptions
## Explicit Exclusions
The following are OUTSIDE the first-launch supported matrix:
1. `RF>2`
2. durability modes outside the accepted bounded envelope
3. broad transport/frontend matrix approval beyond bounded `iSCSI` / `CSI` /
`NVMe` chosen-path support
4. broad whole-surface failover/publication proof
5. broad restart-window behavior outside the explicit `17C` disturbance policy
table
6. generic soak/pilot success as production proof
7. broad rollout approval
## Launch Blockers
These are still required before this matrix can be read as a real launch
decision package:
1. a frozen `Phase 17` checkpoint review outcome
2. any additional evidence needed if the product wants claims broader than the
current bounded `17B/17C` contract/policy package
## Current Productionization Artifacts
The first bounded productionization artifacts now exist for the chosen path:
1. `v2-bounded-internal-pilot-pack.md`
- bounded pilot scope
- success criteria
- incident classification
2. `v2-pilot-preflight-checklist.md`
- start/resume gate for the bounded pilot
3. `v2-pilot-stop-conditions.md`
- stop/contain/rollback-exposure rules
4. `v2-controlled-rollout-review.md`
- bounded post-pilot decision gate
- allowed outcomes: stay in pilot / widen within same envelope / block expansion
## Not Launch-Blocking In This Draft
These are intentionally NOT blockers for the bounded first-draft envelope:
1. lack of `RF>2` support
2. lack of broad transport/frontend approval
3. lack of broad launch approval language
4. lack of generic soak proof inside the phase package itself
## Claim Mapping Rule
Any first-launch support claim must map back to accepted evidence in ALL of the
following layers:
1. hardening/floor layer
- `Phase 12 P4`
2. contract/workload/mode layer
- `CP13-1..9`
3. runtime truth-closure layer
- `Phase 16` finish-line checkpoint
4. product-claim checkpoint layer
- `Phase 17A-17D`
If a claim cannot map cleanly back through those layers:
1. it is not in the first-launch matrix
2. it belongs in exclusions, blockers, or later productionization work
## Operator Reading Guide
When using this matrix, read it with these constraints:
1. bounded chosen path first, not generic platform promise
2. explicit exclusions are real product boundaries, not temporary omissions
3. launch blockers are real blockers, not optional polish
4. pilot success later may validate this envelope, but cannot redefine it
## Primary References
1. `sw-block/.private/phase/phase-17.md`
2. `sw-block/.private/phase/phase-17-checkpoint-review.md`
3. `sw-block/design/v2-product-completion-overview.md`
4. `sw-block/design/v2-protocol-claim-and-evidence.md`
5. `sw-block/design/v2-bounded-internal-pilot-pack.md`
6. `sw-block/design/v2-pilot-preflight-checklist.md`
7. `sw-block/design/v2-pilot-stop-conditions.md`
8. `sw-block/design/v2-controlled-rollout-review.md`

View File

@@ -1,99 +0,0 @@
# V2 Legacy Runtime Exit Criteria
Date: 2026-04-04
Status: active
## Purpose
This note defines when legacy runtime-owner paths may be downgraded from
required compatibility coverage to removable implementation history.
Current legacy examples:
1. `legacy P4` live-path proofs
2. no-core startup paths
3. `HandleAssignmentResult()`-driven recovery startup kept for compatibility
## Current Position
For the current phase, legacy paths must remain:
1. as compatibility guards
2. as regression protection for no-core behavior
3. but NOT as semantic authority proof for the core-present path
## Exit Criteria
A legacy runtime-owner path may be downgraded or removed only when all of the
following are true.
### 1. V2-native proof replacement exists
There must be core-present proofs covering the same behavior category:
1. assignment entry ownership
2. task startup ownership
3. execution ownership
4. observation return ownership
5. outward surface consistency
### 2. Compatibility mode is no longer required operationally
At least one of these must be true:
1. production startup always wires `v2Core`
2. no-core path is explicitly declared unsupported
3. no remaining product surface depends on no-core runtime startup
### 3. The legacy path is no longer the only guard for a runtime mechanic
Examples:
1. serialized replacement/drain behavior
2. shutdown drain behavior
3. live plan-to-execute behavior
These must have equivalent core-present coverage before legacy deletion.
### 4. No semantic truth still depends on legacy behavior
Specifically, removing the legacy path must not change:
1. identity meaning
2. recovery classification
3. publication meaning
4. durable-boundary meaning
If removal changes any of those, the legacy path was still hiding semantic
authority and cannot be retired yet.
## Downgrade Stages
Legacy paths should retire in stages:
### Stage 1: authority downgrade
1. keep tests
2. explicitly classify them as compatibility-only
### Stage 2: runtime fallback downgrade
1. keep fallback code only where product startup still needs it
2. stop expanding proof claims from those paths
### Stage 3: deletion candidate
1. delete tests or move them to legacy-only coverage
2. remove runtime fallback code only after the new path is already the sole
supported owner
## Current Judgment
As of the current separation work:
1. `legacy P4` stays
2. it is already downgraded to compatibility guard
3. it is not yet removable because:
- no-core behavior still exists
- full runtime-loop closure is not yet complete
- not every old ownership proof has a complete core-present replacement

View File

@@ -1,161 +0,0 @@
# V2 Open Questions
Date: 2026-03-27
## Purpose
This document records what is still algorithmically open in V2.
These are not bugs.
They are design questions that should be closed deliberately before or during implementation slicing.
## 1. Recovery Session Ownership
Open question:
- what is the exact ownership model for one active recovery session per replica?
Need to decide:
- session identity fields
- supersede vs reject vs join behavior
- how epoch/session invalidates old recovery work
Why it matters:
- V1.5 needed local reconnect serialization
- V2 should make this a protocol rule
## 2. Promotion Threshold Strictness
Open question:
- must a promotion candidate always have `FlushedLSN >= CommittedLSN`, or is there any narrower safe exception?
Current prototype:
- uses committed-prefix sufficiency as the safety gate
Why it matters:
- determines how strict real failover behavior should be
## 3. Recovery Reservation Shape
Open question:
- what exactly is reserved during catch-up?
Need to decide:
- WAL range only?
- payload pins?
- snapshot pin?
- expiry semantics?
Why it matters:
- recoverability must be explicit, not hopeful
## 4. Smart WAL Payload Classes
Open question:
- which payload classes are allowed in V2 first?
Current model has:
- `WALInline`
- `ExtentReferenced`
Need to decide:
- whether first real implementation includes both
- whether `ExtentReferenced` requires pinned snapshot/versioned extent only
## 5. Smart WAL Garbage Collection Boundary
Open question:
- when can a referenced payload stop being recoverable?
Need to decide:
- GC interaction
- timeout interaction
- recovery session pinning
Why it matters:
- this is the line between catch-up and rebuild
## 6. Exact Orchestrator Scope
Open question:
- how much of the final V2 control logic belongs in:
- local node state
- coordinator
- transport/session manager
Why it matters:
- avoid V1-style scattered state ownership
## 7. First Real Implementation Slice
Open question:
- what is the first production slice of V2?
Candidates:
1. per-replica sender/session ownership
2. explicit recovery-session management
3. catch-up/rebuild decision plumbing
Recommended default:
- per-replica sender/session ownership
## 8. Steady-State Overhead Budget
Open question:
- what overhead is acceptable in the normal healthy case?
Need to decide:
- metadata checks on hot path
- extra state bookkeeping
- what stays off the hot path
Why it matters:
- V2 should be structurally better without becoming needlessly heavy
## 9. Smart WAL First-Phase Goal
Open question:
- is the first Smart WAL goal:
- lower recovery cost
- lower steady-state WAL volume
- or just proof of historical correctness model?
Recommended answer:
- first prove correctness model, then optimize
## 10. End Condition For Simulator Work
Open question:
- when do we stop adding simulator depth and start implementation?
Suggested answer:
- once acceptance criteria are satisfied
- and the first implementation slice is clear
- and remaining simulator additions are no longer changing core protocol decisions

View File

@@ -1,319 +0,0 @@
# V2 Phase 14+ Semantic-First Framework
Date: 2026-04-03
Status: active
Purpose: define the overall `Phase 14+` implementation framework so `V2`
runtime extraction is driven by semantics first: core-owned state and
transitions, then command rules, then projection contracts, and only then
adapter rebinding
## Why This Document Exists
`Phase 13` closed one bounded constrained-runtime contract package:
1. real-workload validation
2. assignment/publication closure
3. bounded mode normalization
That package is valuable, but it is not yet a completed `V2 runtime`.
The next problem is therefore no longer:
1. keep deepening constrained-`V1` validation by default
It is:
1. how to turn the accepted semantic constraints into a real `V2 core`
2. how to sequence `Phase 14+` so `V1` mixed runtime state does not silently
regain semantic authority
## Core Rule
For `Phase 14+`, implementation order must be:
1. define core-owned state and transitions
2. define command-emission rules
3. define projection contracts
4. only then connect adapters
Do not invert this order.
If adapter/runtime wiring appears first, `V1` mixed state will silently regain
semantic authority through convenience behavior.
## Existing Inputs To Preserve
These are fixed inputs, not optional references:
1. `v2_mini_core_design.md`
2. `v2-reuse-replacement-boundary.md`
3. `v2-protocol-claim-and-evidence.md`
4. `v2-phase-development-plan.md`
5. `sw-block/engine/replication/`
## Overall Composition Model
The full `V2` runtime should be composed from smaller automata rather than one
monolithic state machine.
```mermaid
flowchart TD
assignmentState[AssignmentAutomaton]
recoveryState[RecoveryAutomaton]
boundaryState[BoundaryAutomaton]
modeState[ModeAutomaton]
publicationState[PublicationAutomaton]
coreEngine[CoreEngine]
projections[ProjectionContracts]
adapters[AdapterBoundary]
runtime[V1BackendMechanics]
assignmentState --> coreEngine
recoveryState --> coreEngine
boundaryState --> coreEngine
modeState --> coreEngine
coreEngine --> publicationState
publicationState --> projections
coreEngine --> adapters
adapters --> runtime
runtime -->|"observations/events"| adapters
adapters --> coreEngine
```
## The Five Core-Owned Automata
### 1. Assignment automaton
Owns:
1. volume intent
2. role intent
3. stable replica identity
4. epoch
5. desired replica set
Primary constraints preserved:
1. `CP13-2`
2. identity-vs-transport separation
Current seeds:
1. `sw-block/engine/replication/registry.go`
2. `sw-block/engine/replication/state.go`
### 2. Recovery automaton
Owns:
1. per-replica recovery state
2. session ownership and fencing
3. catch-up vs rebuild selection
Primary constraints preserved:
1. `CP13-4`
2. `CP13-5`
3. `CP13-6`
4. `CP13-7`
Current seeds:
1. `sw-block/engine/replication/sender.go`
2. `sw-block/engine/replication/session.go`
3. `sw-block/engine/replication/orchestrator.go`
4. `sw-block/engine/replication/outcome.go`
### 3. Boundary automaton
Owns:
1. committed truth
2. checkpoint truth
3. durable barrier truth
4. rebuild/catch-up target truth
Primary constraints preserved:
1. `T1`
2. `T9`
3. `CP13-3`
Current seeds:
1. `sw-block/engine/replication/state.go`
2. `sw-block/engine/replication/engine.go`
### 4. Mode automaton
Owns:
1. `allocated_only`
2. `bootstrap_pending`
3. `replica_ready`
4. `publish_healthy`
5. `degraded`
6. `needs_rebuild`
Primary constraints preserved:
1. `CP13-9`
2. fail-closed external meaning
Current seeds:
1. `sw-block/engine/replication/state.go`
2. `sw-block/engine/replication/engine.go`
### 5. Publication automaton
Owns:
1. readiness closure
2. publication closure
3. outward healthy vs non-healthy truth
Primary constraints preserved:
1. `CP13-8A`
2. `CP13-9`
Current seeds:
1. `sw-block/engine/replication/projection.go`
2. `sw-block/engine/replication/engine.go`
## Phase 14+ Execution Order
### Phase 14A: Core-owned automata
Goal:
1. make the five automata explicit in the core package
Deliver:
1. state definitions
2. transition tables/rules
3. event vocabulary
Validation:
1. structural acceptance tests in `sw-block/engine/replication`
Non-goal:
1. no live adapter hook
### Phase 14B: Command semantics
Goal:
1. freeze command-emission rules from semantic state, not runtime convenience
Deliver:
1. command rules for role apply, receiver start, shipper configure, invalidation,
and publication
Validation:
1. tests that one event sequence produces one bounded command sequence
Non-goal:
1. no `weed/` execution yet
### Phase 14C: Projection contracts
Goal:
1. define what external surfaces are allowed to claim and from which core state
Deliver:
1. projection structs and normalization rules for lookup/heartbeat/debug/tester
meanings
Validation:
1. mode/readiness/publication surface-consistency tests
Non-goal:
1. no live registry rewrite yet
### Phase 15A: Minimal adapter hook
Goal:
1. connect one narrow adapter ingress to the new core
Deliver:
1. one event path from `weed/` into the core
2. one command path back out
Validation:
1. prove no semantic split between adapter and core on that narrow path
### Phase 15B: Projection-store rebinding
Goal:
1. make `weed/` projection/state surfaces consume core-owned projection truth
Deliver:
1. bounded rebinding of registry / lookup / tester-facing surfaces
Validation:
1. prove assignment delivered != ready != publish healthy on the real path
### Phase 16: V2-native runtime closure
Goal:
1. make the integrated runtime behave as a `V2`-owned system rather than
constrained-`V1` semantics plus fixes
Deliver:
1. one bounded runtime path where core-owned semantics drive adapters and
projections
Validation:
1. end-to-end failover/recovery/publication scenarios on the core-driven path
## Algorithm Review Rule
For any new transition rule, command rule, or projection rule, require a short
justification in code review or delivery notes:
1. semantic constraint satisfied:
- which item from `v2-protocol-claim-and-evidence.md`,
`v2-protocol-truths.md`, or `CP13-*`
2. overclaim avoided:
- what false healthy / ready / durable / recoverable claim is being prevented
3. proof preserved:
- which accepted test or checkpoint remains valid because of this rule
This is the minimum bar for `Phase 14+`.
## Immediate Next Slice
Do not broaden `Phase 13` further.
Use the new `Phase 14` core skeleton in `sw-block/engine/replication` as the
base for one complete semantic chain:
1. `mode`
2. `readiness`
3. `publication`
This is the best next slice because it turns the newest accepted `CP13-8A` and
`CP13-9` constraints directly into core-owned state and transition logic before
adapter rebinding begins.

View File

@@ -1,112 +0,0 @@
# V2 Pilot Preflight Checklist
Date: 2026-04-05
Status: draft
Purpose: define the minimum explicit checks required before running bounded
internal engineering validation on the current `Phase 18` RF2 runtime envelope
## Reading Rule
This checklist is a gate for starting or resuming bounded internal engineering
validation.
If any item below is not satisfied:
1. do not treat the environment as pilot-ready
2. either fix the issue or classify it explicitly before proceeding
## Scope Lock
Confirm the validation is still inside the frozen `Phase 18` RF2 runtime
envelope:
1. topology remains bounded `RF=2`
2. runtime path remains the delivered `masterv2 + volumev2 + purev2` `M1-M4`
path
3. validation stays in bounded runtime/lab exercises only
4. no one is trying to use this validation to claim working block product status
5. no real frontend/product traffic is being introduced on the new runtime path
6. no one is trying to use validation success to claim broader launch approval
## Build And Artifact Pin
Confirm the software package is explicit and stable:
1. the exact build/commit for validation nodes is written down
2. all validation nodes run the same intended package
3. the operator runbook matches the package actually deployed
4. any configuration delta from the documented chosen path is reviewed and
accepted explicitly
## Environment Readiness
Confirm the validation environment matches bounded assumptions:
1. node inventory and topology are written down
2. transport/frontend choice does not widen beyond the bounded runtime envelope
3. storage/network assumptions required by the chosen path are known to the
operator
4. known exclusions are acknowledged before start
5. rollback/containment ownership is assigned for the validation window
## Diagnosis Surface Readiness
Confirm bounded diagnosis can be performed without ad hoc spelunking:
1. failover snapshots/results can be inspected
2. Loop 2 snapshots can be inspected
3. continuity snapshots can be inspected
4. RF2 runtime surface can be inspected
5. the operator knows which artifact defines the current contract/policy boundary:
- `v2-rf2-runtime-bounded-envelope.md`
- `v2-rf2-runtime-bounded-envelope-review.md`
- this preflight checklist
- the stop-condition artifact
## Workload And Gate Alignment
Confirm the validation workload is aligned with accepted evidence:
1. the workload maps to the bounded runtime-bearing reading rather than a new
unsupported scenario
2. success will be judged against the validation-pack criteria rather than generic
"looks stable" judgment
3. the workload does not assume continuous Loop 2, real transport, auto failover,
rebuild lifecycle, or product frontends that are still excluded
4. no required proof depends on failover-under-load, hours/days soak, `RF>2`, or
broad transport/frontend claims that are still excluded
## Incident Routing Readiness
Confirm incident handling is explicit before starting:
1. every incident will be classified as one of:
- `config / environment issue`
- `known exclusion`
- `true product bug`
2. the recording location for incidents is agreed before validation starts
3. ownership for triage and decision-making is assigned
4. operators know when they must stop instead of improvising
## Preflight Result
Validation may start only if:
1. every scope-lock item is true
2. the software package and environment are pinned
3. diagnosis surfaces are available
4. incident routing is explicit
5. no remaining gap is being hand-waved as "we will figure it out during pilot"
If those conditions are not met, the correct output is:
1. `NOT READY`
2. the missing item(s)
3. the owner/action needed before retry
## Primary Inputs
1. `sw-block/design/v2-bounded-internal-pilot-pack.md`
2. `sw-block/design/v2-rf2-runtime-bounded-envelope.md`
3. `sw-block/design/v2-rf2-runtime-bounded-envelope-review.md`
4. `sw-block/.private/phase/phase-18.md`

View File

@@ -1,101 +0,0 @@
# V2 Pilot Stop Conditions
Date: 2026-04-05
Status: draft
Purpose: define when bounded internal engineering validation on the current
`Phase 18` RF2 runtime envelope must stop, contain scope, or block expansion
## Reading Rule
This artifact is about validation containment, not protocol/data rollback
semantics.
`Rollback` here means:
1. stop widening validation exposure
2. reduce or remove validation usage if needed
3. return to a previously accepted bounded state of operation
It does NOT mean:
1. a general storage/data rollback guarantee
2. permission to claim a broader recovery contract than the current evidence
3. ad hoc operator improvisation under ambiguity
## Immediate Stop Conditions
Stop validation immediately if ANY of the following occurs:
1. an observed behavior contradicts the bounded `Phase 18` RF2 runtime envelope
2. any run is interpreted as proving automatic failover, continuous Loop 2
service, rebuild lifecycle, or frontend-serving behavior that is still
explicitly excluded
3. diagnosis surfaces are insufficient to classify the incident without guessing
4. the validation is being widened beyond the named bounded envelope without an
explicit review decision
5. the incident does not fit the allowed buckets:
- `config / environment issue`
- `known exclusion`
- `true product bug`
## Stop-And-Contain Actions
When a stop condition fires:
1. freeze new validation expansion immediately
2. preserve the evidence needed for later review
3. classify the incident explicitly
4. map the incident back to:
- accepted bounded claim
- known exclusion
- unresolved blocker
5. decide whether validation can continue in reduced scope or must fully pause
If the team cannot perform those actions clearly, validation remains stopped.
## Rollback Decision Rules
Use the following bounded rules:
1. `config / environment issue`
- fix the environment/configuration
- rerun preflight before resuming
2. `known exclusion`
- remove the excluded usage from validation
- do not reinterpret it as product support
3. `true product bug`
- pause affected validation scope
- open an explicit fix or contradiction item before resuming
If repeated incidents of the same class continue without a bounded corrective
path, block further validation expansion.
## Expansion Blockers
Even if validation remains partially runnable, do NOT widen it when:
1. the same unresolved true product bug recurs
2. operators depend on tribal knowledge to recover or diagnose
3. incident records are vague or cannot be mapped back to the current evidence
ladder
4. success depends on ignoring explicit exclusions
5. the desired next step requires broader launch claims than the current envelope
## Explicit Non-Claims
This artifact does NOT claim:
1. broad rollout approval
2. generic production readiness from validation survival
3. support for `RF>2`
4. support for a broad transport/frontend matrix
5. failover-under-load proof or long-window soak proof beyond the current bounded
evidence set
## Primary Inputs
1. `sw-block/design/v2-bounded-internal-pilot-pack.md`
2. `sw-block/design/v2-pilot-preflight-checklist.md`
3. `sw-block/design/v2-rf2-runtime-bounded-envelope.md`
4. `sw-block/design/v2-rf2-runtime-bounded-envelope-review.md`
5. `sw-block/.private/phase/phase-18.md`

View File

@@ -1,90 +0,0 @@
# V2 Protocol-Aware Execution
## Purpose
Make host-side execution in `weed/server` and `weed/storage/blockvol` obey the
existing V2 session contract explicitly. The engine remains the semantic source
of truth. Host code owns only:
- execution-state caching derived from sender/session snapshots
- phase gating before data-plane I/O
- observation routing back into core events
## Host-Side Execution State
For each primary volume and replica, the host caches a `replica protocol
execution state` with these fields:
- `ReplicaID`
- `SenderState`
- `SessionID`
- `SessionKind`
- `SessionPhase`
- `StartLSN`
- `TargetLSN`
- `FrozenTargetLSN`
- `RecoveredTo`
- `SessionActive`
- `LiveEligible`
- `Reason`
Rules:
1. State is derived from `v2Orchestrator.Registry` snapshots only.
2. `LiveEligible=false` whenever there is an active recovery session.
3. Data-plane code must consult this cached state before shipping current live
WAL entries.
4. Heartbeat and publication remain projection-driven; they do not invent local
session semantics.
## WAL-First Rollout
The first rollout is intentionally narrow:
- cover `keepup` and WAL-based catch-up only
- do not change snapshot/build policy
- do not let fresh late-attached replicas consume current live-tail WAL while a
bounded catch-up session is active
Current implementation seam:
- `weed/server/block_protocol_state.go`
- derives host execution state from sender/session snapshots
- binds a per-volume live-shipping policy back into `BlockVol`
- `weed/storage/blockvol/blockvol.go`
- carries the host-provided live-shipping policy across shipper-group rebuilds
- `weed/storage/blockvol/wal_shipper.go`
- checks the policy before any live-tail dial or send
This is intentionally a phase gate, not a second source of truth.
## Observation Seam
Runtime observations should feed back through one server-side seam:
- sender/session snapshots -> `syncProtocolExecutionState()`
- host event application -> `applyCoreEvent()`
- assignment processing -> `ApplyAssignments()`
The rule is:
1. engine chooses the protocol phase
2. host derives execution state from engine snapshots
3. data path obeys that state
4. host emits observed facts back through `applyCoreEvent()`
## Fast Test Roster
The first fast-test roster for protocol-aware execution is:
- `unit`: `TestWALShipper_LiveShippingPolicyBlocksBeforeDial`
- proves phase gate happens before any transport dial
- `unit`: `TestWALShipper_LiveShippingPolicyAllowsShip`
- proves the gate does not block normal live shipping after eligibility
- `component`: `TestBlockService_ProtocolExecutionState_ActiveCatchUpBlocksLiveShipping`
- proves sender/session snapshots become host execution state and block live
shipping during active catch-up
- `component`: `TestBlockService_ProtocolExecutionState_InSyncSenderAllowsLiveShipping`
- proves the host reopens live shipping after the recovery session is gone
Next fast tests to add in later waves:
- late attach with backlog must stay bounded until target reached
- transport contact before barrier durability must not imply publish healthy
- timeout with valid retention pin may replan WAL catch-up
- timeout after retention loss must escalate to build

View File

@@ -1,178 +0,0 @@
# V2 Reuse vs Replacement Boundary
Date: 2026-04-03
Status: active
## Purpose
This note makes one architectural split explicit for the current chosen path:
1. what we reuse from the existing `blockvol`/`weed` stack as mechanics
2. what must be owned by `V2` as semantic authority
3. what sits in the adapter boundary between them
The goal is to stop `V1` mixed control/data state from silently redefining `V2`
behavior through convenience wiring.
Scope is still bounded to:
1. `RF=2`
2. `sync_all`
3. current master / volume-server heartbeat path
4. `blockvol` as the execution backend
## Boundary Rule
`V1` reuse is allowed for execution mechanics.
`V2` replacement is required for semantic authority.
If a change decides protocol meaning, failover meaning, durability meaning, or
external publication meaning, it belongs to a `V2`-owned layer even if the
underlying I/O still runs through reused `blockvol` code.
This is the practical interpretation of:
- `v2-protocol-truths.md` `T14`: engine remains recovery authority
- `v2-protocol-truths.md` `T15`: reuse reality, not inherited semantics
## Three Buckets
### 1. Reusable V1 Core
These components remain useful as mechanics:
| Area | Files | What stays reusable |
|------|-------|---------------------|
| Local storage truth | `weed/storage/blockvol/blockvol.go`, `flusher.go`, `rebuild.go`, WAL/extent helpers | WAL append, flush, checkpoint, dirty-map, extent install |
| Replica transport | `weed/storage/blockvol/replica_apply.go`, `wal_shipper.go`, `shipper_group.go`, `dist_group_commit.go`, `repl_proto.go` | TCP receiver/shipper mechanics, barrier transport, replay/apply |
| Frontend serving | `weed/storage/blockvol/iscsi/`, `weed/storage/blockvol/nvme/` | block-device serving once a local volume is authoritative |
| Local role guardrails | `weed/storage/blockvol/promotion.go`, `role.go` | drain, lease revoke, local role gate enforcement |
Rule:
- these layers execute I/O and transport
- they do not decide whether a replica is eligible, authoritative, published, or healthy in the `V2` sense
### 2. Adapter Boundary
These components translate `V2` truth into concrete runtime wiring:
| Area | Files | Responsibility |
|------|-------|----------------|
| Assignment ingest | `weed/server/volume_server_block.go` | authoritative assignment lifecycle for role apply, receiver/shipper wiring, readiness closure |
| Heartbeat/runtime loop | `weed/server/block_heartbeat_loop.go` | collect/report status and process assignments through the same lifecycle |
| Local store helper | `weed/storage/store_blockvol.go` | local volume open/close/iteration; no longer the authoritative assignment lifecycle |
| Bridge | `weed/storage/blockvol/v2bridge/control.go` | convert service/control truth into engine intents |
Rule:
- the adapter boundary may reuse `blockvol` primitives
- it must name and own lifecycle closure states explicitly
- it must not let store-only role application masquerade as ready publication
### 3. V2-Owned Replacement
These areas define truth and therefore must remain `V2`-owned:
| Area | Files | Responsibility |
|------|-------|----------------|
| Control and identity truth | `sw-block/engine/replication/`, `weed/storage/blockvol/v2bridge/control.go` | assignment truth, stable identity, session truth |
| Recovery ownership | `weed/server/block_recovery.go` | live runtime owner for catch-up/rebuild tasks |
| Publication and health closure | `weed/server/master_block_registry.go`, `weed/server/master_block_failover.go` | what the system reports as ready, degraded, publishable |
| External product surfaces | `weed/server/master_grpc_server_block.go`, `weed/server/master_server_handlers_block.go`, debug/diagnostic surfaces | operator-visible truth, not convenience guesses |
Rule:
- if the system exposes a condition to master, tester, CSI, or operator tooling, that condition must come from `V2`-named state
## Assignment-To-Readiness Lifecycle
The authoritative lifecycle for the current chosen path is:
```text
assignment delivered
-> local role applied
-> replica receiver or primary shipper configured
-> readiness closed
-> heartbeat publication
-> master registry health/publication
```
More concretely:
1. master intent is delivered
2. `BlockService.ApplyAssignments()` applies local role truth
3. the same path wires receiver/shipper runtime
4. the same path records named readiness state
5. heartbeat publishes only what is actually publish-healthy
6. master registry derives lookup/health from explicit readiness, not from allocation alone
## Named Readiness States
For the current implementation slice, the service boundary now names:
1. `roleApplied`
2. `receiverReady`
3. `shipperConfigured`
4. `shipperConnected`
5. `replicaEligible`
6. `publishHealthy`
Ownership:
- owned by `BlockService` / adapter layer
- observed by debug surfaces and heartbeat/publication logic
- not delegated to `blockvol` as implicit mixed state
## Current File Map
### Reuse
- `weed/storage/blockvol/blockvol.go`
- `weed/storage/blockvol/flusher.go`
- `weed/storage/blockvol/replica_apply.go`
- `weed/storage/blockvol/wal_shipper.go`
- `weed/storage/blockvol/shipper_group.go`
- `weed/storage/blockvol/dist_group_commit.go`
- `weed/storage/blockvol/iscsi/`
- `weed/storage/blockvol/nvme/`
### Adapter boundary
- `weed/server/volume_server_block.go`
- `weed/server/block_heartbeat_loop.go`
- `weed/storage/store_blockvol.go`
- `weed/server/volume_server_block_debug.go`
### V2-owned replacement / truth
- `weed/storage/blockvol/v2bridge/control.go`
- `sw-block/engine/replication/`
- `weed/server/block_recovery.go`
- `weed/server/master_block_registry.go`
- `weed/server/master_block_failover.go`
- `weed/server/master_grpc_server_block.go`
- `weed/server/master_server_handlers_block.go`
## Immediate Engineering Rule
When a new bug appears, classify it first:
1. `v1 reusable core`: local storage or transport mechanics
2. `adapter boundary`: assignment/readiness/publication closure bug
3. `v2 replacement`: semantic authority, identity, ownership, eligibility, rebuild, or operator-visible truth
Do not patch semantic authority directly into `blockvol` unless the same change is
also reflected as an explicit `V2` state/rule at the service or registry layer.
## Why This Matters For CP13-8
`CP13-8` found the exact class of bug this split is meant to expose:
- allocation/control truth said the replica existed
- but runtime publication/read visibility was not yet closed
That is not a reason to throw away `blockvol`.
It is a reason to stop treating mixed `V1` runtime state as if it were already
closed `V2` publication truth.

View File

@@ -1,80 +0,0 @@
# V2 RF2 Runtime Bounded Envelope Review
Date: 2026-04-05
Status: draft
Purpose: record the current bounded productionization judgment for the delivered
`Phase 19` RF2 working-path envelope
## Review Outcome
Current decision:
1. `stay in bounded validation`
2. `not pilot-ready`
## Why This Is The Correct Outcome
The delivered `Phase 19` path proves one bounded working RF2 block path:
1. live transport-backed evidence traffic exists
2. continuous Loop 2 service exists
3. bounded automatic failover exists
4. runtime-managed frontend rebinding exists
5. bounded repair/catch-up exists
6. one real end-to-end client handoff proof exists
7. bounded operator and CSI adapters now exist on top of runtime-owned truth
But the path is still not broad product/pilot approval because:
1. the current proof is still bounded to the current runtime harness
2. repair/catch-up is not yet broad rebuild lifecycle closure
3. CSI and operator surfaces are still bounded adapters rather than full
production surfaces
4. no broad pilot or rollout evidence exists yet
## Review Record
Reviewer reading baseline:
1. `sw-block/.private/phase/phase-19.md`
2. `sw-block/design/v2-rf2-runtime-bounded-envelope.md`
3. `sw-block/design/v2-bounded-internal-pilot-pack.md`
4. `sw-block/design/v2-pilot-preflight-checklist.md`
5. `sw-block/design/v2-pilot-stop-conditions.md`
6. `sw-block/design/v2-controlled-rollout-review.md`
7. `sw-block/runtime/volumev2/poc_test.go`
Current evidence package:
1. runtime-owned failover manager
2. continuous Loop 2 service and bounded auto failover
3. runtime-managed frontend and bounded repair closure
4. end-to-end RF2 handoff proof
5. RF2 runtime surface projection and operator surface
6. bounded CSI runtime backend adapter
## Allowed Interpretation
The review allows only these statements:
1. one runtime-bearing RF2 kernel slice now exists
2. one bounded working RF2 block path now exists
3. one bounded productionization artifact set now exists around that path
4. later work may widen from this review only through explicit new closure
The review does NOT allow:
1. working block product approval
2. pilot execution against real product traffic
3. rollout expansion beyond bounded internal engineering validation
## Next Required Closures
Before any pilot-ready judgment can exist, the next closures must become
explicit:
1. multi-process / multi-host proof for the current working path
2. broader rebuild lifecycle closure beyond the bounded repair wrapper
3. fuller CSI lifecycle parity on the V2 runtime path
4. broader operator/metrics surface closure
5. pilot/preflight/containment evidence on top of the `Phase 19` path

View File

@@ -1,127 +0,0 @@
# V2 RF2 Runtime Bounded Envelope
Date: 2026-04-05
Status: draft
Purpose: freeze the bounded productionization envelope around the current
`Phase 19` working RF2 block path without overclaiming broad product readiness
## Reading Rule
This document defines the strongest bounded envelope currently justified by the
delivered `Phase 19` path.
It does NOT mean:
1. broad launch approval
2. working block product approval
3. support for broad frontend or transport matrices
4. that remaining runtime/product gaps are minor polish
It means only:
1. the current `masterv2 + volumev2 + purev2` RF2 runtime slice has a named,
reviewable productionization boundary
2. the current support statement, exclusions, and blockers are explicit
3. later pilot or rollout work must stay inside this envelope or explicitly widen
it with new evidence
## Envelope Basis
This envelope is anchored on the delivered `Phase 19` milestones:
1. `M6`: one live loopback HTTP transport now exists behind the evidence seam
2. `M7`: one background Loop 2 service and one bounded auto-failover service now
exist
3. `M8`: one runtime-managed iSCSI export path and one bounded replica repair
wrapper now exist
4. `M9`: one end-to-end RF2 handoff proof now exists with continued I/O on the
new primary
5. `M10`: one bounded operator surface and one bounded CSI runtime backend
adapter now exist
The envelope is therefore about one bounded working RF2 block path, not broad
product readiness.
## Supported Envelope
The current bounded support statement is:
1. one bounded working RF2 block path now exists with:
- `masterv2` identity/promotion authority
- `volumev2` failover, takeover, active Loop 2 service, continuity, repair,
frontend rebinding, and projected RF2 surface ownership
- `purev2` execution adapter reuse
2. one bounded live transport path now carries failover-time evidence and replica
summaries
3. one bounded real client handoff path now exists:
- write through runtime-managed iSCSI export
- bounded repair/catch-up on the runtime path
- lose primary
- auto fail over
- reconnect to the new primary
- continue I/O
4. one bounded outward RF2 surface exists as projection only:
- `RF2VolumeSurface`
5. one bounded operator/CSI adapter layer exists on top of runtime-owned truth
## Explicit Exclusions
The following are OUTSIDE this bounded envelope:
1. broad multi-process or multi-host deployment approval
2. broad transport/frontend matrix approval
3. full rebuild orchestration beyond the current bounded repair/catch-up wrapper
4. broad CSI lifecycle parity beyond the current bounded runtime backend adapter
5. broad operator/API/metrics coverage beyond the current bounded HTTP surface
6. broad launch or external customer support statements
## Current Blockers
The main blockers between this envelope and a working RF2 block product are:
1. the current path is still bounded to the current runtime harness rather than
broad multi-process approval
2. bounded repair/catch-up is not yet broad rebuild lifecycle closure
3. CSI rebinding is still a bounded runtime backend adapter, not full lifecycle
parity
4. the operator surface is still a bounded HTTP view, not a full operational
platform surface
## Allowed Validation Shape
The allowed validation shape inside this envelope is:
1. internal engineering validation only
2. bounded lab/runtime exercise only
3. explicit artifact-driven interpretation only
The following are NOT allowed interpretations:
1. "the system is now production ready"
2. "the system now supports real automatic failover"
3. "the system now supports broad product traffic and rollout"
## Evidence Anchors
Read this envelope together with:
1. `sw-block/.private/phase/phase-19.md`
2. `sw-block/design/v2-kernel-closure-review.md`
3. `sw-block/design/v2-protocol-claim-and-evidence.md`
4. `sw-block/runtime/volumev2/runtime_manager.go`
5. `sw-block/runtime/volumev2/continuity_runtime.go`
6. `sw-block/runtime/volumev2/rf2_surface.go`
7. `sw-block/runtime/volumev2/loop2_service.go`
8. `sw-block/runtime/volumev2/frontend_runtime.go`
9. `sw-block/runtime/volumev2/operator_surface.go`
10. `sw-block/runtime/volumev2/poc_test.go`
11. `weed/storage/blockvol/csi/v2_runtime_backend.go`
## Envelope Output
The correct current reading of this envelope is:
1. runtime-bearing RF2 kernel slice: yes
2. bounded working RF2 block path: yes
3. bounded productionization artifact set: yes
4. pilot-ready broad product path: no

View File

@@ -1,249 +0,0 @@
# V2 Scenario Sources From V1 and V1.5
Date: 2026-03-27
## Purpose
This document distills V1 / V1.5 real-test material into V2 scenario inputs.
Sources:
- `learn/projects/sw-block/phases/phase13_test.md`
- `learn/projects/sw-block/phases/phase-13-v2-boundary-tests.md`
This is not the active scenario backlog.
Use:
- `v2_scenarios.md` for the active V2 scenario set
- this file for historical source and rationale
## How To Use This File
For each item below:
1. keep the real V1/V1.5 test as implementation evidence
2. create or maintain a V2 simulator scenario for the protocol core
3. define the expected V2 behavior explicitly
## Source Buckets
### 1. Core protocol behavior
These are the highest-value simulator inputs.
- barrier durability truth
- reconnect + catch-up
- non-convergent catch-up -> rebuild
- rebuild fallback
- failover / promotion safety
- WAL retention / tail-chasing
- durability mode semantics
Recommended V2 treatment:
- `sim_core`
### 2. Supporting invariants
These matter, but usually as reduced simulator checks.
- canonical address handling
- replica role/epoch gating
- committed-prefix rules
- rebuild publication cleanup
- assignment refresh behavior
Recommended V2 treatment:
- `sim_reduced`
### 3. Real-only implementation behavior
These should usually stay in real-engine tests.
- actual wire encoding / decode bugs
- real disk / `fdatasync` timing
- NVMe / iSCSI frontend behavior
- Go concurrency artifacts tied to concrete implementation
Recommended V2 treatment:
- `real_only`
### 4. V2 boundary items
These are especially important.
They should remain visible as:
- current V1/V1.5 limitation
- explicit V2 acceptance target
Recommended V2 treatment:
- `v2_boundary`
## Distilled Scenario Inputs
### A. Barrier truth uses durable replica progress
Real source:
- Phase 13 barrier / `replicaFlushedLSN` tests
Why it matters:
- commit must follow durable replica progress, not send progress
V2 target:
- barrier completion counted only from explicit durable progress state
### B. Same-address transient outage
Real source:
- Phase 13 reconnect / catch-up tests
- `CP13-8` short outage recovery
Why it matters:
- proves cheap short-gap recovery path
V2 target:
- explicit recoverability check
- catch-up if recoverable
- rebuild otherwise
### C. Changed-address restart
Real source:
- `CP13-8 T4b`
- changed-address refresh fixes
Why it matters:
- endpoint is not identity
- stale endpoint must not remain authoritative
V2 target:
- heartbeat/control-plane learns new endpoint
- reassignment updates sender target
- recovery session starts only after endpoint truth is updated
### D. Non-convergent catch-up / tail-chasing
Real source:
- Phase 13 retention + catch-up + rebuild fallback line
Why it matters:
- “catch-up exists” is not enough
- must know when to stop and rebuild
V2 target:
- explicit `CatchingUp -> NeedsRebuild`
- no fake success
### E. Slow control-plane recovery
Real source:
- `CP13-8 T4b` hardware behavior before fix
Why it matters:
- safety can be correct while availability recovery is poor
V2 target:
- explicit fast recovery path when possible
- explicit fallback when only control-plane repair can help
### F. Stale message / delayed ack fencing
Real source:
- Phase 13 epoch/fencing tests
- V2 scenario work already mirrors this
Why it matters:
- old lineage must not mutate committed prefix
V2 target:
- stale message rejection is explicit and testable
### G. Promotion candidate safety
Real source:
- failover / promotion gating tests
- V2 candidate-selection work
Why it matters:
- wrong promotion loses committed lineage
V2 target:
- candidate must satisfy:
- running
- epoch aligned
- state eligible
- committed-prefix sufficient
### H. Rebuild boundary after failed catch-up
Real source:
- Phase 13 rebuild fallback behavior
Why it matters:
- rebuild is required when retained WAL cannot safely close the gap
V2 target:
- rebuild is explicit fallback, not ad hoc recovery
## Immediate Feed Into `v2_scenarios.md`
These are the most important V1/V1.5-derived V2 scenarios:
1. same-address transient outage
2. changed-address restart
3. non-convergent catch-up / tail-chasing
4. stale delayed message / barrier ack rejection
5. committed-prefix-safe promotion
6. control-plane-latency recovery shape
## What Should Not Be Copied Blindly
Do not clone every real-engine test into the simulator.
Do not use the simulator for:
- exact OS timing
- exact socket/wire bugs
- exact block frontend behavior
- implementation-specific lock races
Instead:
- extract the protocol invariant
- model the reduced scenario if the protocol value is high
## Bottom Line
V1 / V1.5 tests should feed V2 in two ways:
1. as historical evidence of what failed or mattered in real life
2. as scenario seeds for the V2 simulator and acceptance backlog

View File

@@ -1,135 +0,0 @@
# V2 Separation Port Layer Audit
Date: 2026-04-04
Status: active
## Purpose
This note audits the current `sw-block` port layer for the separation effort:
1. define which contracts already belong in `sw-block`
2. identify what was still underspecified or mismatched
3. record the normalized boundary for future migration batches
## Current Port Layer
The current reusable boundary inside `sw-block` is:
1. `sw-block/bridge/blockvol/contract.go`
2. `sw-block/bridge/blockvol/storage_adapter.go`
3. `sw-block/bridge/blockvol/control_adapter.go`
These files are the intended weed-free bridge between:
1. `sw-block/engine/replication`
2. `weed/storage/blockvol/v2bridge`
3. `weed/server/*` adapter code
## Audited Contracts
### Storage state port
File:
1. `sw-block/bridge/blockvol/contract.go`
Stable contract:
1. `BlockVolReader`
2. `BlockVolState`
This is already the right ownership:
1. `sw-block` owns the shape of retained-history inputs
2. `weed/` only implements how to read those facts from real `BlockVol`
### Retention / snapshot pinning port
File:
1. `sw-block/bridge/blockvol/contract.go`
Stable contract:
1. `BlockVolPinner`
This remains correct because:
1. pin lifecycle meaning belongs to the V2 recovery driver
2. actual hold/release mechanics remain weed-side implementation detail
### Recovery execution port
Previous issue:
1. `BlockVolExecutor` in `contract.go` did not match the real engine execution
interfaces precisely
2. in particular, rebuild full-base transfer in the engine returns achieved LSN,
but the contract only returned `error`
Normalized decision:
1. `sw-block` now names:
- `BlockVolCatchUpIO`
- `BlockVolRebuildIO`
- `BlockVolExecutor`
2. these contracts intentionally match:
- `engine.CatchUpIO`
- `engine.RebuildIO`
This is the right long-term boundary because:
1. `sw-block` owns the execution port shape
2. `weed/storage/blockvol/v2bridge.Executor` remains only one implementation
3. future migration can move execution code without changing engine contracts
### Assignment translation helper port
Normalized helper layer:
1. `ReplicaAssignmentForServer()`
2. `RecoveryTargetForRole()`
These are now the canonical helper rules in:
1. `sw-block/bridge/blockvol/control_adapter.go`
They exist to stop identity / recovery-target mapping from drifting between:
1. `weed/storage/blockvol/v2bridge/control.go`
2. `weed/server/volume_server_block.go`
## Code Normalization Completed
Implemented in this batch:
1. `sw-block/bridge/blockvol/doc.go`
- clarified that the package owns weed-free contracts and thin adapters,
not real blockvol implementations
2. `sw-block/bridge/blockvol/contract.go`
- aligned execution contracts with engine IO interfaces
3. `sw-block/bridge/blockvol/control_adapter.go`
- extracted canonical helper functions for identity and recovery-target
mapping
4. `sw-block/bridge/blockvol/bridge_test.go`
- added interface-compatibility proof for the normalized execution contracts
## Resulting Boundary Rule
After this audit, the port layer rule is:
1. `sw-block` defines contracts and canonical mapping helpers
2. `weed/` implements real storage, transport, and runtime bindings
3. no `sw-block` package in this layer should import `weed/`
## What Still Does Not Move Yet
This audit does NOT move:
1. `weed/storage/blockvol/v2bridge.Executor`
2. `weed/storage/blockvol/v2bridge.Reader`
3. `weed/storage/blockvol/v2bridge.Pinner`
4. `weed/server/BlockService`
5. `weed/server/RecoveryManager`
It only stabilizes the port layer those migrations will target.

View File

@@ -73,6 +73,28 @@ Do not reuse tests that encode V1.5 semantics that V2 intentionally removed:
| `Restore Ready` | exact base restore and snapshot-tail recovery are safe and exact | snapshot boundary, integrity, partial-failure safety, tail convergence |
| `V2 Ready` | primary-owned assignment/session/projection semantics close on real flows | bootstrap, keepup, catchup, rebuild, failover, rejoin, publish gating |
## Matrix Linkage
Read this matrix together with:
1. `v2-capability-map.md` for ownership of the capability tier
2. `v2-integration-matrix.md` for real scenario coverage
This matrix answers "is the capability closed at all?" It does not by itself
guarantee that enough real topology/workload/failure scenarios have been
exercised. That second question belongs to the integration matrix.
| Validation rows | Capability tier | Primary protocol refs | Main integration refs |
|---|---|---|---|
| `R1`-`R12` | Tier 3: RF=2 Recovery And Failover | `v2-sync-recovery-protocol.md`, `v2-rebuild-mvp-session-protocol.md` | `I-R1`-`I-R8` |
| `S1`-`S10` | Tier 5: Lifecycle Capability | `v2-rebuild-mvp-session-protocol.md` and snapshot/restore execution rules | `I-S1`-`I-S4` |
| `V1`-`V2` | Tier 2: RF=2 Replication Base | `v2-sync-recovery-protocol.md` | `I-V1`, `I-V2` |
| `V3`-`V8` | Tier 3: RF=2 Recovery And Failover | `v2-sync-recovery-protocol.md`, `v2-rebuild-mvp-session-protocol.md` | `I-V3`, `I-V4`, `I-V5` |
| `V9`-`V10` | Tier 4: Multi-Replica Runtime (`RF>=3`) | `v2-sync-recovery-protocol.md` | future RF>=3 integrated rows; currently bounded engine/component proof only |
| `V11` | Tier 3 / Tier 8 boundary | recovery protocol plus launch-envelope disturbance claims | `I-V5` |
| `V12`-`V13` | Tier 6: Control Plane And Operations | ownership / observability docs and control-plane protocol surfaces | `I-V4`, `I-V6` |
| `V14` | Tier 0: Semantic Foundation | `v2-protocol-truths.md`, `v2-sync-recovery-protocol.md` | exercised across `I-V*` and negative component packs |
## Matrix A: Rebuild Ready
| ID | Priority | Scenario | Trigger / entry | Reuse | Main proof | Final validation | Coverage | File | Evidence |

View File

@@ -1,151 +0,0 @@
# V2 VolumeV2 Single-Node MVP
Date: 2026-04-05
Status: active
## Purpose
This note defines the target shape for a single-node `volumev2` MVP that can
ship as a normal block service before HA/failover exists.
The core idea is:
1. `masterv2` is fully new control ownership
2. `volumev2` is a new shell and brain host
3. `blockvol` and related backend mechanics remain reusable muscles
## Target Layering
`volumev2` should be strengthened around four layers.
### 1. Engine
Owner:
- `sw-block/engine/replication/`
Responsibility:
1. state
2. event ingestion
3. command emission
4. outward projection
Rule:
- semantic truth lives here
- no backend I/O or network ownership
### 2. Engine Interface
Owner:
- command/event vocabulary between control/runtime and backend execution
Responsibility:
1. assignment -> event translation
2. observation -> event translation
3. command -> execution dispatch contract
Rule:
- runtime shell may not mutate engine truth directly
### 3. Control Plane
Owner:
- `masterv2 <-> volumev2` coordination
Responsibility:
1. node identity
2. registration and heartbeat
3. assignment receipt
4. state reporting
5. future recovery-control vocabulary (`keepup`, `catchup`, `rebuild`)
Rule:
- control plane carries protocol messages
- it does not own local data execution
### 4. Data Plane
Owner:
- local storage and serving mechanics
Responsibility:
1. WAL/extent management
2. read/write/flush
3. background workers
4. receiver/shipper mechanics
5. NVMe/iSCSI/frontend serving
Rule:
- data plane knows how to execute
- it does not define publication or role semantics
## Single-Node MVP Contract
The first ship-capable `volumev2` slice should include:
1. `masterv2` declaration of one RF1 primary volume
2. `volumev2` control session to fetch assignments
3. local create/open through reused `blockvol`
4. local primary assignment application through the V2 engine
5. local read/write plus restart durability
6. debug/status snapshot
7. one small executable entrypoint for smoke usage
The first slice explicitly excludes:
1. failover
2. RF2 replication
3. catch-up/rebuild ownership
4. CSI
## Why This Is Enough
This is enough to prove:
1. the `masterv2 + volumev2` head is viable
2. `volumev2` can host V2 semantics while reusing V1 muscles
3. a useful non-HA block service can exist before HA complexity is added
## Module Shape
Recommended package split:
1. `sw-block/runtime/masterv2/`
2. `sw-block/runtime/volumev2/`
3. `sw-block/runtime/purev2/`
4. `sw-block/engine/replication/`
5. `sw-block/bridge/blockvol/`
Within `volumev2`, strengthen toward:
1. `control_session.go`
2. `orchestrator.go`
3. `node.go`
4. later: `heartbeat.go`, `frontend.go`, `workers.go`
## Stage Gate
`volumev2` may be treated as a single-node MVP only when:
1. assignment sync is repeatable and idempotent
2. local IO is data-verified
3. restart/open path is proven
4. status/debug state is explicit
5. no `weed/server` lifecycle owner is required
## Related References
- `v2-pure-runtime-rf1-bootstrap.md`
- `v2-proof-and-retest-pyramid.md`
- `v2-capability-map.md`

View File

@@ -196,6 +196,11 @@ func (ms *MasterServer) failoverBlockVolumes(deadServer string) {
}
ms.blockRegistry.FailoversTotal.Add(1)
entries := ms.blockRegistry.ListByServer(deadServer)
glog.V(0).Infof("failover: deadServer=%s entries=%d", deadServer, len(entries))
for i, e := range entries {
glog.V(0).Infof("failover: entry[%d] name=%q vs=%s role=%d hasReplica=%v epoch=%d",
i, e.Name, e.VolumeServer, e.Role, e.HasReplica(), e.Epoch)
}
now := time.Now()
for _, entry := range entries {
// Case 1: Dead server is the primary.

View File

@@ -111,6 +111,12 @@ type BlockVolumeEntry struct {
LastLeaseGrant time.Time
LeaseTTL time.Duration
// Registration race protection: the time this entry was created/registered
// by the master. Stale cleanup skips recently registered entries to allow
// the volume server time to discover the volume and include it in its
// next heartbeat inventory.
RegisteredAt time.Time
// CP11A-2: Coordinated expand tracking.
ExpandInProgress bool
ExpandFailed bool // true = primary committed but replica(s) failed; size suppressed
@@ -424,6 +430,9 @@ func (r *BlockVolumeRegistry) Register(entry *BlockVolumeEntry) error {
if _, ok := r.volumes[entry.Name]; ok {
return fmt.Errorf("block volume %q already registered", entry.Name)
}
if entry.RegisteredAt.IsZero() {
entry.RegisteredAt = time.Now()
}
entry.recomputeReplicaState()
r.volumes[entry.Name] = entry
r.addToServer(entry.VolumeServer, entry.Name)
@@ -642,6 +651,14 @@ func (r *BlockVolumeRegistry) UpdateFullHeartbeatWithInventoryAuthority(server s
name, server)
continue
}
// Registration race protection: skip recently registered entries.
// The VS may not have discovered the volume yet. Grace period
// of 30s (> 2 heartbeat intervals) prevents premature deletion.
if !entry.RegisteredAt.IsZero() && time.Since(entry.RegisteredAt) < 30*time.Second {
glog.V(0).Infof("block registry: skipping stale-cleanup for %q (registered %v ago, grace period)",
name, time.Since(entry.RegisteredAt).Round(time.Second))
continue
}
delete(r.volumes, name)
delete(names, name)
// Also clean up replica entries from byServer.

View File

@@ -88,8 +88,9 @@ func TestRegistry_ListByServer(t *testing.T) {
func TestRegistry_UpdateFullHeartbeat(t *testing.T) {
r := NewBlockVolumeRegistry()
// Register two volumes on server s1.
r.Register(&BlockVolumeEntry{Name: "vol1", VolumeServer: "s1", Path: "/v1.blk", Status: StatusPending})
r.Register(&BlockVolumeEntry{Name: "vol2", VolumeServer: "s1", Path: "/v2.blk", Status: StatusPending})
pastGrace := time.Now().Add(-60 * time.Second)
r.Register(&BlockVolumeEntry{Name: "vol1", VolumeServer: "s1", Path: "/v1.blk", Status: StatusPending, RegisteredAt: pastGrace})
r.Register(&BlockVolumeEntry{Name: "vol2", VolumeServer: "s1", Path: "/v2.blk", Status: StatusPending, RegisteredAt: pastGrace})
// Full heartbeat reports only vol1 (vol2 is stale).
r.UpdateFullHeartbeat("s1", []*master_pb.BlockVolumeInfoMessage{
@@ -127,7 +128,7 @@ func TestRegistry_UpdateFullHeartbeatWithInventoryAuthority_NonAuthoritativeEmpt
func TestRegistry_UpdateFullHeartbeatWithInventoryAuthority_AuthoritativeEmptyStillDeletes(t *testing.T) {
r := NewBlockVolumeRegistry()
r.Register(&BlockVolumeEntry{Name: "vol1", VolumeServer: "s1", Path: "/v1.blk", Status: StatusActive})
r.Register(&BlockVolumeEntry{Name: "vol1", VolumeServer: "s1", Path: "/v1.blk", Status: StatusActive, RegisteredAt: time.Now().Add(-60 * time.Second)})
r.UpdateFullHeartbeatWithInventoryAuthority("s1", nil, "", true)
@@ -3206,3 +3207,58 @@ func TestRegistry_UpdateFullHeartbeat_EngineProjectionModePreservedOnNewPrimaryW
t.Fatalf("EngineProjectionMode=%q, want %q from new primary", entry.EngineProjectionMode, "degraded")
}
}
func TestRegistry_StaleCleanup_SkipsRecentlyRegisteredEntry(t *testing.T) {
r := NewBlockVolumeRegistry()
r.MarkBlockCapable("vs1:8080")
// Register a volume — RegisteredAt is set automatically.
if err := r.Register(&BlockVolumeEntry{
Name: "vol-grace",
VolumeServer: "vs1:8080",
Path: "/blocks/vol-grace.blk",
Status: StatusActive,
Role: blockvol.RoleToWire(blockvol.RolePrimary),
}); err != nil {
t.Fatalf("register: %v", err)
}
// Authoritative heartbeat from vs1 that does NOT report this volume.
// Without grace period, this would delete the entry.
r.UpdateFullHeartbeatWithInventoryAuthority("vs1:8080", nil, "", true)
// Entry should survive — it was just registered.
entry, ok := r.Lookup("vol-grace")
if !ok {
t.Fatal("recently registered entry was deleted by stale cleanup — grace period not working")
}
if entry.Name != "vol-grace" {
t.Fatalf("entry name=%q, want vol-grace", entry.Name)
}
}
func TestRegistry_StaleCleanup_DeletesOldUnreportedEntry(t *testing.T) {
r := NewBlockVolumeRegistry()
r.MarkBlockCapable("vs1:8080")
// Register a volume with RegisteredAt in the past (beyond grace period).
if err := r.Register(&BlockVolumeEntry{
Name: "vol-stale",
VolumeServer: "vs1:8080",
Path: "/blocks/vol-stale.blk",
Status: StatusActive,
Role: blockvol.RoleToWire(blockvol.RolePrimary),
RegisteredAt: time.Now().Add(-60 * time.Second), // 60s ago, past grace
}); err != nil {
t.Fatalf("register: %v", err)
}
// Authoritative heartbeat without this volume.
r.UpdateFullHeartbeatWithInventoryAuthority("vs1:8080", nil, "", true)
// Entry should be deleted — it's old and not reported.
_, ok := r.Lookup("vol-stale")
if ok {
t.Fatal("old unreported entry survived stale cleanup — grace period should not protect it")
}
}

View File

@@ -349,7 +349,10 @@ func (vs *VolumeServer) doHeartbeatWithRetry(masterAddress pb.ServerAddress, grp
return
case <-vs.stopChan:
var volumeMessages []*master_pb.VolumeInformationMessage
blockInventoryAuthoritative := true
// Shutdown beat: clear regular volumes but do NOT claim block
// inventory authority. The block registry entry must survive
// shutdown so failoverBlockVolumes can promote the replica.
noBlockAuthority := false
emptyBeat := &master_pb.Heartbeat{
Ip: ip,
Port: port,
@@ -359,8 +362,8 @@ func (vs *VolumeServer) doHeartbeatWithRetry(masterAddress pb.ServerAddress, grp
Rack: rack,
Volumes: volumeMessages,
HasNoVolumes: len(volumeMessages) == 0,
HasNoBlockVolumes: vs.blockService != nil,
BlockVolumeInventoryAuthoritative: &blockInventoryAuthoritative,
HasNoBlockVolumes: false,
BlockVolumeInventoryAuthoritative: &noBlockAuthority,
}
glog.V(1).Infof("volume server %s:%d stops and deletes all volumes", vs.store.Ip, vs.store.Port)
if err = stream.Send(emptyBeat); err != nil {

View File

@@ -211,13 +211,13 @@ func CreateBlockVol(path string, opts CreateOptions, cfgs ...BlockVolConfig) (*B
Interval: cfg.FlushInterval,
Metrics: v.Metrics,
BatchIO: bio,
// CP13-6: replica-aware WAL retention.
RetentionFloorFn: func() (uint64, bool) {
if v.shipperGroup == nil {
return 0, false
}
return v.shipperGroup.MinRecoverableFlushedLSN()
},
// No keepup WAL retention: flusher recycles freely. If a replica
// falls behind and WAL entries are recycled, it escalates to
// NeedsRebuild — the correct outcome. Catch-up from extent via
// the LBA dirty map (V2) will eliminate this tension entirely.
// Session-only WAL pins (for active rebuild/catch-up) are handled
// separately by SetV2RetentionFloor.
RetentionFloorFn: nil,
EvaluateRetentionBudgetsFn: func() {
if v.shipperGroup != nil {
v.shipperGroup.EvaluateRetentionBudgets(RetentionBudgetParams{
@@ -340,17 +340,13 @@ func OpenBlockVol(path string, cfgs ...BlockVolConfig) (*BlockVol, error) {
Interval: cfg.FlushInterval,
Metrics: v.Metrics,
BatchIO: bio,
RetentionFloorFn: func() (uint64, bool) {
if v.shipperGroup == nil {
return 0, false
}
return v.shipperGroup.MinRecoverableFlushedLSN()
},
// No keepup WAL retention (same as CreateBlockVol path).
RetentionFloorFn: nil,
EvaluateRetentionBudgetsFn: func() {
if v.shipperGroup != nil {
v.shipperGroup.EvaluateRetentionBudgets(RetentionBudgetParams{
Timeout: walRetentionTimeout,
MaxBytes: 0, // CP13-6 max-bytes disabled: uses replicaFlushedLSN which can't advance without barrier; v2 will replace with negotiated recovery protocol
MaxBytes: 0, // CP13-6 max-bytes disabled
PrimaryHeadLSN: v.nextLSN.Load() - 1,
BlockSize: v.super.BlockSize,
})

View File

@@ -230,6 +230,34 @@ func (sg *ShipperGroup) MinRecoverableFlushedLSN() (uint64, bool) {
return min, found
}
// MinShippedLSN returns the minimum shippedLSN across all active shippers
// (not NeedsRebuild). This is the Ceph-model retention watermark: the flusher
// must not recycle WAL entries past the slowest active shipper's shipped
// position, because those entries are needed for catch-up if the shipper
// degrades during sustained async writes.
//
// Returns (0, false) if no shipper has shipped anything yet.
func (sg *ShipperGroup) MinShippedLSN() (uint64, bool) {
sg.mu.RLock()
defer sg.mu.RUnlock()
var min uint64
found := false
for _, s := range sg.shippers {
if s.State() == ReplicaNeedsRebuild {
continue
}
lsn := s.ShippedLSN()
if lsn == 0 {
continue // hasn't shipped yet — don't pin at 0
}
if !found || lsn < min {
min = lsn
found = true
}
}
return min, found
}
// RetentionBudgetParams holds the inputs for retention budget evaluation.
type RetentionBudgetParams struct {
Timeout time.Duration

View File

@@ -76,6 +76,18 @@ func assertEqual(ctx context.Context, actx *tr.ActionContext, act tr.Action) (ma
actual := act.Params["actual"]
expected := act.Params["expected"]
// Reject empty strings — prevents false positives when an upstream action
// failed silently and returned empty. "" == "" would hide real failures.
if actual == "" && expected == "" {
return nil, fmt.Errorf("assert_equal: both actual and expected are empty — likely upstream action failure")
}
if actual == "" {
return nil, fmt.Errorf("assert_equal: actual is empty (expected %q) — likely upstream action failure", expected)
}
if expected == "" {
return nil, fmt.Errorf("assert_equal: expected is empty (actual %q) — likely upstream action failure", actual)
}
if actual != expected {
return nil, fmt.Errorf("assert_equal: %q != %q", actual, expected)
}

View File

@@ -187,16 +187,18 @@ phases:
- name: kill-primary
actions:
- action: print
msg: "=== Killing primary ({{ before_server }}) ==="
msg: "=== Killing primary ({{ before }}, {{ before_server }}) ==="
- action: exec
node: m02
cmd: "kill -9 {{ vs1_pid }}"
root: "true"
# Kill the primary VS using stop_weed with the discovered primary's PID.
# Master consistently places the primary on m01 (vs2_pid) in this
# topology. discover_primary confirms this.
- action: stop_weed
node: m01
pid: "{{ vs2_pid }}"
ignore_error: true
- action: print
msg: "Primary killed. Waiting for lease expiry (45s)..."
msg: "Primary killed on m01 ({{ before_server }}). Waiting for lease expiry (45s)..."
- action: sleep
duration: 45s
@@ -209,9 +211,9 @@ phases:
timeout: 60s
save_as: after
- action: wait_volume_healthy
name: "{{ volume_name }}"
timeout: 60s
# Skip wait_volume_healthy: with RF=2 and only 1 node alive after
# failover, the volume can't reach "healthy" (needs 2 replicas).
# The wait_block_primary above already confirms failover succeeded.
- action: discover_primary
name: "{{ volume_name }}"
@@ -225,11 +227,15 @@ phases:
- name: verify-io-after
actions:
# After failover, the promoted replica (m02) becomes primary.
# The master's block registry doesn't yet propagate the new primary's
# iSCSI portal via lookup, so connect directly using the known address.
# m02's iSCSI is on the RDMA IP (10.0.0.3) port 3295.
- action: iscsi_login_direct
node: m01
host: "10.0.0.1"
host: "10.0.0.3"
port: "3295"
iqn: "{{ vol_iqn }}"
iqn: "iqn.2024-01.com.seaweedfs:vol.{{ volume_name }}"
save_as: device2
- action: dd_read_md5