docs(p15): close G8 failover data continuity

This commit is contained in:
pingqiu
2026-05-02 16:22:57 -07:00
parent 0483a035b0
commit 1a613afba4
3 changed files with 1320 additions and 0 deletions

View File

@@ -0,0 +1,216 @@
# V3 Phase 15 — Control Plane Evolution Plan
**Date**: 2026-05-02
**Status**: ACTIVE design note for P15 MVP planning
**Code repo**: `seaweed_block`
**Docs repo**: `seaweedfs/sw-block/design`
---
## 0. Purpose
This note pins the control-plane shape we want while P15 is still small.
The product direction is a Kubernetes-usable block service that can later grow toward a Ceph-like cluster manager without rewriting the authority boundary. The immediate goal is not to build a full Ceph-class master. The goal is to keep the interfaces correct now, so later placement, repair, rebalance, and reintegration are not forced through ad-hoc shortcuts.
---
## 1. Core Rule
`blockmaster` may own placement as a product responsibility.
But code must express that ownership through a control-plane policy layer, not through daemon-local if/then assignment mutation.
Allowed shape:
```text
cmd/blockmaster
-> ObservationHost
-> ClusterState / TopologyController / PlacementPolicy / ReintegrationPolicy
-> AssignmentAsk / PlacementPlan / RecoveryPlan
-> authority.Publisher
-> AssignmentFact
-> blockvolume adapter.OnAssignment
```
Rejected shape:
```text
heartbeat handler
-> if server timed out then publish r2 assignment
```
The rejected shape is the known anti-pattern: heartbeat timing becomes authority, `cmd` becomes policy owner, and assignment truth bypasses the testable authority controller seam.
---
## 2. Layering Contract
| Layer | Owns | Must not own |
|---|---|---|
| `cmd/blockmaster` | process lifecycle, flags, RPC servers, durable directories, wiring | placement policy, failover policy, direct `AssignmentInfo` construction |
| Observation ingestion | heartbeat/progress/durable-ack facts, freshness classification | authority minting, primary selection |
| Cluster state | current known servers, volumes, candidates, supportability, progress facts | direct frontend/data-plane mutation |
| Policy / planner | placement, failover, reintegration, repair, drain decisions as intent | epoch/endpoint minting, adapter calls |
| `authority.Publisher` | epoch/endpointVersion minting, assignment fan-out, durable authority line | deciding policy |
| `core/engine` | local replica recovery semantics: catch-up vs rebuild, session lifecycle, stale lineage | placement, failover candidate choice, rebalance |
| blockvolume runtime | execute assignment, feed WAL/base, report progress | self-promote, heartbeat-as-authority |
The short form:
```text
master owns cluster policy;
authority owns minting;
engine owns local recovery semantics;
runtime executes and reports facts.
```
---
## 3. Fact Types We Need To Keep Distinct
P15 should avoid overloading one heartbeat/status struct into every decision.
| Fact | Producer | Consumer | Meaning |
|---|---|---|---|
| `ObservationReport` | blockvolume | master observation layer | server/slot reachability and eligibility |
| `DurableProgressAck` | blockvolume / replica runtime | master cluster state | replica durable frontier / sync evidence |
| `BaseProgressAck` | recovery receiver | master/diagnostics | rebuild/base feed progress; useful for pin/retention and visibility |
| `PlacementIntent` | user/CSI/operator | master policy | desired volume/RF/topology input |
| `PlacementPlan` | planner | publisher/reconciler | bounded desired primary/replica set; not minted authority |
| `AssignmentFact` | publisher | blockvolumes | current authority line, epoch/EV minted by publisher |
| `RecoveryPlan` | planner/coordinator | feeder/runtime | who feeds whom from what base/WAL pin |
| `ReadinessFact` | runtime/feeder | master policy | candidate/syncing/ready, not frontend primary readiness |
Important distinction:
```text
FrontendHealthy != ReplicaReady
Observed != ReplicaCandidate
ReplicaCandidate != ReplicaReady
AuthorityMoved != DataContinuityProven
```
---
## 4. Returned Replica State Model
Old primary return is the first G8 pressure point where the distinction matters.
Correct sequence:
```text
old primary returns
-> Observed
-> FrontendClosed / Superseded
-> ReplicaCandidate
-> Syncing or Rebuilding
-> ReplicaReady
```
Do not collapse this to:
```text
heartbeat arrived -> replica ready
```
For P15, the minimum future-facing interface should expose or internally model:
1. `FrontendPrimaryReady`: can serve frontend read/write.
2. `AuthorityRole`: current primary / non-primary / superseded.
3. `ReplicationRole`: none / candidate / syncing / ready.
4. `Progress`: durable ack, base progress, WAL frontier.
This does not require a production external API immediately. It does require tests and code to stop using `Healthy=false` as a catch-all for every non-primary state.
---
## 5. MVP Strengthening From This Design
The P15 MVP needs these additions before it can look like a smooth beta:
1. **Control-plane state vocabulary**
- distinguish frontend readiness, authority role, replication role, and progress.
- avoid interpreting one boolean `Healthy` as product state.
2. **Reintegration policy seam**
- returned peer becomes candidate, not ready.
- current primary/coordinator feeds it.
- durable/progress facts promote it to ready.
3. **Placement intent seam**
- user/CSI asks for volume + RF.
- planner creates desired topology/placement.
- publisher mints assignment only from bounded intent.
4. **Plan/result split**
- planner returns `PlacementPlan` / `RecoveryPlan`.
- reconciler executes and observes completion.
- closure needs fact evidence, not just plan existence.
5. **Operator-visible explanation**
- why selected primary?
- why candidate rejected?
- why replica is syncing instead of ready?
- why recovery is stalled?
---
## 6. P15 Near-Term Application
### G8
G8 should remain narrow:
1. authority failover eligibility,
2. old primary stale fence,
3. new primary data continuity,
4. old primary return does not become double-primary.
G8 may add a prelude for returned peer becoming `ReplicaCandidate`, but `ReplicaReady` after reintegration can be a G8-followup/G9A item unless explicitly pulled into scope.
### G9 / G9A
G9/G9A should introduce the first product-shaped placement intent:
```text
CreateVolume(volume_id, size, rf)
-> desired volume record
-> placement plan
-> publisher assignment
```
No CSI/API path should be allowed to submit raw `AssignmentInfo`.
### G17-lite
Observability should expose the split state:
```text
frontend_ready
authority_role
replication_role
durable_ack
recovery_phase
placement_reason
unsupported_reason
```
This is what makes a beta cluster understandable instead of log archaeology.
---
## 7. Anti-Pattern Checklist
Before any master/control-plane change lands, check:
1. Does a heartbeat timeout directly mint authority?
2. Does `cmd/blockmaster` construct or mutate `AssignmentInfo`?
3. Does a blockvolume local role claim become authority?
4. Does assignment movement alone claim data continuity?
5. Does returned process imply replica-ready without progress evidence?
6. Does engine receive placement/failover policy it should not own?
7. Does runtime callback return a semantic decision instead of an observation?
Any "yes" is a design stop unless explicitly ratified as a new product contract.

View File

@@ -0,0 +1,548 @@
# V3 Phase 15 — G8 Failover Data Continuity Mini-Plan
**Date**: 2026-05-02
**Status**: CLOSED 2026-05-02; code evidence pushed on `p15-g8/failover-data-continuity`
**Gate**: G8 — Failover Data Continuity
**Predecessor**: G7 CLOSED 2026-05-02, canonical evidence `seaweed_block@d09fcc6`
**Code repo**: `seaweed_block`
**Docs repo**: `seaweedfs/sw-block/design`
**§1.A architect ratification**: 2026-05-02. First G8 close uses RF=2 subprocess L2 with real `cmd/blockmaster` + 2x `cmd/blockvolume` daemons and explicit iSCSI reconnect as sufficient evidence. m01/M02 cross-node confirmation is forward-carry, not first-close blocking.
---
## 0. Product Sentence
After the primary for a replicated volume fails, the system can move service to an eligible replica without losing previously acknowledged data, and the old primary cannot later acknowledge stale reads or writes against the old authority line.
G8 is the first P15 gate that turns recovery correctness into availability semantics.
---
## 1. Scope
### 1.1 In scope
G8 covers one narrow but product-critical path:
1. A replicated volume has a primary and at least one eligible replica.
2. A client writes known data and receives success.
3. The current primary process is killed or made unreachable.
4. V3 authority publishes a new primary assignment.
5. The client reconnects or reattaches to the new primary.
6. The new primary reads the exact acknowledged data.
7. The old primary, if it returns, cannot corrupt future state and cannot serve stale success under the old line.
### 1.2 Required evidence shape
G8 success must include data verification. Authority movement alone is insufficient.
Required proof:
```text
acknowledged write before failure
-> primary failure
-> reassignment / new primary
-> reconnect or reattach
-> byte-equal read from new primary
-> old primary stale path rejects
```
### 1.3 Out of scope
| Item | Disposition |
|---|---|
| Multi-master HA / distributed authority store | Out of P15 unless separately ratified. |
| Rack/AZ-aware placement | G20/P16; not G8. |
| Full chaos matrix | G22 final validation. G8 binds a minimal failure-continuity gate. |
| Performance SLO / failover time SLO | G21. G8 records timing, but no SLO claim. |
| CSI Kubernetes failover under workload | Dogfood checkpoint / G15a+G17 after G9A; G8 may provide the lower data-continuity primitive. |
| Snapshot/resize behavior across failover | G10/G11/G15b. |
| Strict RF=2 quorum/full-ack write contract under replica lag | Deferred to G8-followup/G9A policy. G8 proves healthy-path data continuity, not that every primary ACK is blocked on a replica durable ACK. |
### 1.4 ACK profile vs recovery profile
G8 keeps these two concepts separate:
1. **ACK profile** decides when the frontend write/sync returns success.
2. **Recovery profile** decides what happens when a replica falls behind, disappears, or misses progress.
`best-effort` means the frontend ACK path does not require a full remote replica durable ACK for every write/sync. It does **not** mean the system ignores lag. If progress facts show the replica is behind, stalled, or below the retained WAL window, the coordinator/feeder path must still catch up or rebuild that replica.
The stricter future profile is a quorum/full-ack mode: frontend success is withheld until the configured replica durability condition is met. That is the profile needed before claiming RF=2 zero acknowledged-write loss under arbitrary replica lag or failure.
For RF=2 in quorum/full-ack mode, a replica that has entered recovery cannot count as the synchronous ACK peer. While that condition holds, the primary has only three valid product behaviors:
1. **Fail or block new writes/syncs** until the replica leaves recovery and becomes sync-ack eligible again.
2. **Explicitly degrade the volume to best-effort / single-replica availability** through a named policy transition that is visible to operators and clients.
3. **Fence or make the volume read-only/unavailable** if policy requires no RPO exposure.
It must not silently ACK writes as "full sync" while the only secondary is in recovery. Recovery traffic may continue feeding the replica, but recovery progress is not a substitute for the synchronous ACK contract.
---
## 2. Starting Point
### 2.1 What G7 gives us
G7 closed:
1. empty replica join -> rebuild -> byte-equal;
2. concurrent writes during rebuild -> byte-equal;
3. stale WAL -> rebuild -> byte-equal;
4. practical single-ingress WAL feeder;
5. `targetLSN` removed as terminal completion truth;
6. progress fact and flow-control diagnostics seams.
G8 should not reopen those contracts unless a failing G8 test proves a missing handoff.
### 2.2 Current likely gap
The likely G8 gap is not raw rebuild transport. It is the integration boundary:
1. authority reassignment to new primary;
2. frontend readiness on the new primary;
3. stale frontend fail-closed on the old primary;
4. client reconnect / retry semantics;
5. ensuring the chosen new primary has the acknowledged data before service success.
---
## 3. Architecture Bindings
### 3.1 Truth domains
| Domain | G8 owner |
|---|---|
| Authority / assignment | master publisher remains single source of authority; no volume-local role minting. |
| Data correctness | primary/replica storage + replication/recovery proof. |
| Frontend stale fencing | frontend backend/projection must reject old line after EV/epoch moves. |
| Client reconnect | protocol-specific behavior may differ; G8 binds at least one declared path. |
### 3.2 Non-negotiable rules
1. Do not port V2 promote/demote or heartbeat-as-authority.
2. Do not declare failover success from assignment movement alone.
3. Do not let stale primary reads return success.
4. Do not let stale primary writes ACK success.
5. Do not make a recovered replica a serving primary solely because a rebuild session completed; assignment must come from master/publisher.
---
## 4. Scenario Set
### G8-0 — Authority failover eligibility + publisher epoch
**Purpose**: prove the control-plane precondition for failover before data-plane claims enter the picture.
**Shape**:
```text
current authority is r1@epoch=N
r1 is no longer acceptable
only ReadyForPrimary+Reachable+Eligible candidates may be selected
publisher mints r2@epoch=N+1 via IntentReassign
```
**Level**: authority unit/component.
**Pass**: controller emits only a bounded `IntentReassign`; publisher remains the sole epoch minter.
**Fail**: high evidence but not-ready candidate is selected, no-candidate case leaves a desired mint, or local code mints authority.
---
### G8-A — Component authority move + stale old primary fence
**Purpose**: prove stale read/write fail-closed at the backend/projection boundary after authority moves.
**Shape**:
```text
old primary backend writes data
authority moves epoch/EV to new primary
old backend write -> ErrStalePrimary / protocol failure
old backend read -> ErrStalePrimary / protocol failure
new backend can read/write under new line
```
**Level**: component.
**Expected first red test**: stale read path returns data success or stale write path succeeds after move.
---
### G8-B — Multi-process primary kill, new primary reads acknowledged data
**Purpose**: the core G8 pass gate.
**Component prelude**: `G8-B0` proves the data-continuity precondition before subprocess work: after acknowledged primary writes converge to a candidate, that candidate's local durable bytes are byte-equal. It also includes the negative oracle that authority movement alone is not G8 success.
**Shape**:
```text
start blockmaster + two blockvolume processes
write known data to primary
confirm replica has acknowledged/durable data or recovery can close
kill primary process
wait for master to publish new primary
reattach/reconnect client
read from new primary
assert byte-equal
```
**Level**: L2/L3 process/hardware.
**Pass**: byte-equal read from new primary after old primary failure.
**Fail**: only observes assignment moved, or read is not checked.
---
### G8-C — Old primary returns stale and is fenced
**Purpose**: prevent split-brain-ish success after old primary restarts or old frontend remains attached.
**Shape**:
```text
after G8-B reassignment
old primary process returns or stale backend remains reachable
old path read -> fail-closed
old path write -> fail-closed
new primary remains authoritative
```
**Level**: component first, then process/hardware if feasible.
**Important non-claim**: `Healthy=false` on the returned old primary proves frontend fail-closed, not that the process has become a ready supporting replica. A returned old primary should eventually flow through:
```text
Observed -> FrontendClosed/Superseded -> ReplicaCandidate -> Syncing/Rebuilding -> ReplicaReady
```
G8-C binds the first two states. Claiming `ReplicaCandidate` or `ReplicaReady` requires additional progress/peer-set evidence and may be pulled into a follow-up slice only if explicitly ratified.
---
### G8-D — Primary kill during active rebuild
**Purpose**: consume the G7 non-claim: primary-kill mid-rebuild was explicitly deferred to G8.
**Shape**:
```text
start rebuild session
kill primary before rebuild completes
system either:
A. continues from new primary and converges, or
B. fails cleanly with explicit unsupported evidence
```
**Initial disposition**: OUT for first G8 close by default. This becomes `G8b` only if architect explicitly pulls it in after G8-0/A/B/C are green. If deferred, G8 close must state the exact consequence and owner.
---
### G8-E — Client reconnect semantics
**Purpose**: bind how clients find the new primary.
**Allowed paths**:
1. explicit reconnect / reattach through test harness;
2. iSCSI/NVMe reconnect if available;
3. frontend-harness reconnect if OS initiator path is not yet stable.
**Rule**: G8 must name the selected path. It cannot leave "client reconnects somehow" implicit.
---
## 5. V2 Port / Reuse Plan
| V2 asset | Disposition |
|---|---|
| `ha-io-continuity.yaml` | PORT scenario shape / oracle. |
| `ha-failover.yaml` | PORT scenario steps where they do not rely on V2 authority. |
| `ha-full-lifecycle.yaml` | Read for future G9/G15 dogfood; likely too broad for first G8. |
| HA component tests under `weed/storage/blockvol/test/` | PORT tests that assert data continuity and stale fencing; REBIND authority setup to V3 publisher. |
| V2 promote/demote RPC | PERMANENT SKIP. Violates V3 authority model. |
| heartbeat-as-authority | PERMANENT SKIP. |
G8 should port scenario oracles aggressively, not V2 authority semantics.
---
## 6. TDD Plan
### 6.1 Red tests before production changes
0. `TestG8_0_TopologyController_FailoverRequiresReadyEligibleReachableCandidate`
- Authority-only eligibility and publisher-minted epoch precondition (`G8-0`).
1. `TestG8A_SupersedeFact_FencesOldPrimaryDurableBackend`
- Component-level stale read/write.
2. `TestG8B0_NewPrimaryCandidate_ReadsAcknowledgedWritesAfterConvergence`
- Component-level data continuity after convergence; prelude to process-level G8-B.
3. `TestG8_FailoverCannotCloseOnAuthorityMoveOnly`
- Negative test: assignment movement without data oracle is not G8 success.
4. `TestG8_ProcessKillPrimary_NewPrimaryByteEqual`
- Multi-process or hardware scenario.
5. `TestG8_OldPrimaryReturn_StalePathRejected`
- Component first; hardware follow-up if feasible.
### 6.2 Test gates
| Layer | Required for first G8 close? |
|---|---|
| Unit / component stale fence | Required. |
| Component data continuity | Required. |
| Multi-process primary kill -> new primary byte-equal | Required. |
| OS initiator reconnect | Preferred; may be provisional if explicitly non-claimed and G15a/G17 dogfood owns it. |
| Primary kill mid-rebuild | Recommended; may be split into G8b if too broad. |
---
## 7. Implementation Audit Checklist
Before production code, sw performs read-only audit:
1. Where does master decide new primary after failure?
2. What evidence marks old primary unreachable?
3. How does `cmd/blockvolume` update frontend projection after new assignment?
4. Can the old primary still serve an already-open iSCSI/NVMe/backend path?
5. Does the candidate new primary have durable acknowledged bytes before serving?
6. What code path proves client reconnect / reattach?
7. Which V2 scenario files provide the closest oracle?
Audit output must classify:
```text
PROCEED-verify-only
PROCEED-minor-patch
HALT-scope-evolution
```
### 7.1 Current audit snapshot (2026-05-02)
| Area | Current code fact | Verdict |
|---|---|---|
| Authority failover decision | `TopologyController.decideVolume` emits `IntentReassign` when current candidate is unacceptable and another candidate is `Reachable && ReadyForPrimary && Eligible && !Withdrawn`. | `PROCEED-verify-only` |
| Epoch minting | `Publisher.apply(IntentReassign)` advances per-volume epoch from max prior volume epoch; local volume/host code does not mint. | `PROCEED-verify-only` |
| Assignment fan-out | `master.SubscribeAssignments` is volume-scoped and fans in all replica slots, allowing old primary to observe a newer cross-replica line. | `PROCEED-verify-only` |
| Old-primary fence | `Host.recordOtherLine -> IsSuperseded -> AdapterProjectionView` turns local Healthy into frontend `Healthy=false`; durable backend maps that to `ErrStalePrimary`. | `PROCEED-verify-only` |
| Candidate data continuity | Process-level G8-B now writes through r1 iSCSI, kills r1, waits r2 failover, reconnects to r2 iSCSI, and reads byte-equal acknowledged data. | `PROCEED-verify-only` |
| Client reconnect / reattach | First G8 close uses explicit test-harness iSCSI reconnect to the new primary target. This is not a claim of transparent OS initiator reconnect. | `PROCEED-verify-only` |
| Primary kill during active rebuild | Crosses G7 recovery + G8 failover; not required for first close by default. | `HALT-scope-evolution` for first G8; split to `G8b` |
Expected first-close default: `PROCEED-minor-patch`, because G8 needs process integration wiring/tests rather than new recovery primitives.
---
## 8. Acceptance Criteria
G8 closes only when all first-close criteria are true:
1. G8-A component stale read/write fence is GREEN.
2. G8-B multi-process or hardware primary-kill scenario is GREEN.
3. The scenario verifies byte-equal acknowledged data on the new primary.
4. The old primary stale path is fail-closed in at least component coverage.
5. G8 close report names the client reconnect / reattach path used.
6. G8 close report names all non-claims, especially OS initiator behavior if not fully exercised.
7. No V2 authority-minting semantics are ported.
---
## 9. Non-Claims For First G8 Close
Unless explicitly expanded during §1.A ratification, first G8 close does not claim:
1. multi-master control-plane HA;
2. RF>=3 quorum/min-pin policy;
3. rack/AZ placement;
4. transparent OS initiator reconnect for all protocols;
5. primary crash during every possible recovery sub-phase;
6. performance/failover SLO;
7. Kubernetes workload failover under CSI.
These are forward-carry unless pulled into G8 by architect before code.
---
## 10. Initial §1.A Questions For Architect Ratification
| Question | Default recommendation |
|---|---|
| Q1: topology | RF=2 subprocess L2 with real product daemons and real loopback iSCSI is sufficient for first G8 close. m01/M02 cross-node confirmation is forward-carry; RF>=3 later. |
| Q2: frontend path | Use backend/harness reconnect for first G8; OS initiator reconnect preferred if available but not required for first data-continuity close. |
| Q3: primary-kill mid-rebuild | OUT for first G8 close by default; split to G8b unless architect explicitly pulls it in. |
| Q4: timing | Record dispatch/reassignment/reconnect time; no SLO. |
| Q5: old primary return | Component required; hardware preferred if harness can restart old primary cleanly. |
Architect decision can override these defaults before sw starts code.
---
## 11. Sequence Followed
1. sw read current failover / publisher / projection / frontend stale-fence code.
2. sw wrote the G8 audit table against §7.1.
3. sw landed `G8-0` / `G8-A` / `G8-B0` component tests before process-level close evidence.
4. G8-B process evidence landed through real product daemons and explicit iSCSI reconnect; see §12.
5. Architect §1.A ratification locked the first-close topology and non-claims on 2026-05-02.
6. Remaining hardware and wider failure-matrix work moves through §12.6 forward-carry, not hidden first-close scope.
---
## 12. Close-Ready Evidence Snapshot
### 12.1 Code evidence
Code branch: `seaweed_block:p15-g8/failover-data-continuity`
| Commit | Evidence |
|---|---|
| `292db26` | G8-0/A/B0 component tests: authority failover eligibility, old-primary stale fence, component data-continuity prelude, and authority-move-only negative oracle. |
| `b01bb9e` | G8-B process role movement: r1 primary kill -> r2 becomes `Healthy=true` at `r2@epoch>=2` through real `cmd/blockmaster` / `cmd/blockvolume` subprocesses. |
| `945942d` | G8-C process old-primary return: r1 restarts after r2 failover and remains frontend non-Healthy while r2 stays primary. |
| `6669cf5` | G8-B process data continuity: r1 iSCSI acknowledged write -> kill r1 -> r2 iSCSI reconnect -> byte-equal read from new primary. |
### 12.1.1 Hardware-fidelity choice
First G8 close deliberately uses subprocess L2 rather than m01/M02 cross-node hardware:
```text
real cmd/blockmaster process
real cmd/blockvolume r1 + r2 processes
real walstore-backed durable roots
real iSCSI target endpoints
explicit iSCSI reconnect from r1 to r2
byte-equal SCSI READ(10) after primary kill
```
This is sufficient for the first data-continuity close because G8's claim is authority-to-frontend failover correctness, not network partition timing, transparent initiator reconnect, or multi-host deployment behavior. m01/M02 confirmation remains forward-carry for the next hardware-fidelity pass.
### 12.2 Selected reconnect path
First G8 close uses explicit iSCSI reconnect in the Go subprocess harness:
```text
dial r1 iSCSI -> WRITE(10) acknowledged
kill r1
wait r2 assignment / Healthy
dial r2 iSCSI -> READ(10) byte-equal
```
This proves the data-plane continuity primitive through a real frontend target in-process with the product daemons. It does not claim transparent kernel initiator reconnect or multipath behavior.
### 12.3 Tests run by sw
```powershell
go test ./cmd/blockvolume -run "TestG8|TestG54_BinaryWiring" -count=1
go test ./cmd/blockvolume ./core/authority ./core/host/volume ./core/frontend/durable ./core/replication/component -count=1
```
Both passed on 2026-05-02.
### 12.4 First G8 close claim
G8 can claim:
```text
V3 can move primary service after a primary process kill in an RF=2 product-daemon setup.
The new primary can read the tested acknowledged data written through the healthy replication path before failure.
The old primary, if it returns, remains frontend fail-closed and does not regain authority.
```
### 12.5 First G8 non-claims
G8 does not claim:
1. transparent OS/kernel initiator reconnect;
2. CSI/Kubernetes failover under workload;
3. primary kill during active rebuild (`G8b`);
4. returned old primary has become `ReplicaReady`;
5. RF>=3 quorum/min-pin policy;
6. rack/AZ placement;
7. multi-master control-plane HA;
8. failover time SLO.
9. strict RF=2 full-ack/quorum write semantics when the replica is lagging, down, or unable to durably ACK.
### 12.6 Forward-carry
| Item | Owner |
|---|---|
| Returned old primary -> `ReplicaCandidate` -> syncing/rebuild -> `ReplicaReady` | G8-followup / G9A reintegration policy |
| Primary kill during active rebuild | G8b |
| Transparent OS initiator reconnect / multipath behavior | G15a/G17 dogfood or dedicated frontend gate |
| m01/M02 cross-node G8 confirmation | G8-followup / hardware-fidelity pass |
| Control-plane placement intent and plan/result split | G9/G9A |
| State vocabulary in diagnostics (`frontend_ready`, `authority_role`, `replication_role`, progress) | G17-lite |
| ACK profile policy: `best-effort` vs quorum/full-ack, and user-visible mode naming | G8-followup / G9A |
---
## 13. Close
### 13.1 Close decision
G8 is closed on 2026-05-02.
Architect single-sign basis:
1. §1.A ratification accepts RF=2 subprocess L2 with real product daemons and real loopback iSCSI as first-close evidence.
2. QA verified the close packet after governance fixes and signed the evidence on 2026-05-02.
3. Code evidence is pinned to `seaweed_block:p15-g8/failover-data-continuity` through `6669cf5`.
4. The close claim is limited to authority-to-frontend failover data continuity under the tested healthy replication path.
### 13.2 Evidence accepted
Accepted evidence:
1. `292db26` — authority failover eligibility, stale-fence component coverage, data-continuity prelude, and authority-move-only negative oracle.
2. `b01bb9e` — process primary kill causes r2 to become `Healthy=true` at a newer epoch.
3. `945942d` — returned old primary remains frontend fail-closed.
4. `6669cf5` — real iSCSI `WRITE(10)` ACK on r1, kill r1, explicit reconnect to r2, real iSCSI `READ(10)` byte-equal.
Accepted test commands:
```powershell
go test ./cmd/blockvolume -run "TestG8|TestG54_BinaryWiring" -count=1
go test ./cmd/blockvolume ./core/authority ./core/host/volume ./core/frontend/durable ./core/replication/component -count=1
```
### 13.3 Final G8 claim
G8 proves:
```text
In an RF=2 product-daemon setup, V3 can move primary service after killing the current primary.
The new primary can read the tested data acknowledged through the old primary's healthy replication path.
The old primary, if restarted, remains frontend fail-closed and does not regain authority.
```
### 13.4 Final non-claims
G8 does not claim:
1. transparent kernel initiator reconnect or multipath behavior;
2. m01/M02 cross-node network-failure evidence;
3. primary kill during active rebuild;
4. returned old primary has become `ReplicaReady`;
5. strict RF=2 full-ack/quorum semantics under lag or recovery;
6. RF>=3 quorum/min-pin policy;
7. rack/AZ placement;
8. multi-master control-plane HA;
9. failover time SLO;
10. CSI/Kubernetes workload failover.
### 13.5 Forward owner
Next owner is G8-followup/G9A:
1. ACK profile policy: best-effort vs full-ack/quorum.
2. Returned replica reintegration: `Observed -> ReplicaCandidate -> Syncing/Rebuilding -> ReplicaReady`.
3. m01/M02 cross-node confirmation.
4. Primary-kill-mid-rebuild split to G8b.

View File

@@ -0,0 +1,556 @@
# V3 Phase 15 — Priority MVP Plan
**Date**: 2026-05-02
**Status**: ACTIVE priority overlay; not a replacement for `v3-phase-15-mvp-scope-gates.md`
**Owner**: architect / sw / QA
**Code repo**: `seaweed_block`
**Docs repo**: `seaweedfs/sw-block/design`
---
## 0. Purpose
This document turns the P15 canonical gate list into an execution priority plan.
`v3-phase-15-mvp-scope-gates.md` remains the canonical definition of what each gate means. This plan answers a different question:
> Given current code reality after G7, what should we do first to reach a credible Kubernetes block-service MVP without over-claiming?
The north-star is a smooth, understandable beta:
1. A user can deploy the block service.
2. A user can create and attach a volume.
3. A pod can write/read real block data.
4. Replica loss, catch-up, and rebuild do not corrupt data.
5. Failover is explicitly tested before we imply availability.
6. Operators can see enough state to diagnose slow or stuck recovery.
7. Unsupported beta features are named, not silent.
---
## 1. Current Reality Snapshot
### 1.1 Closed or closure-ready capability
| Area | Current posture |
|---|---|
| G0/G1 product daemons + control RPC | Implemented via `cmd/blockmaster` / `cmd/blockvolume`, master-volume control route, status surfaces. |
| G2/G3 frontends | iSCSI + NVMe/TCP have substantial V2-port coverage and product-host wiring. Some original stale-OS critical-cell claims remain deferred to later failover evidence. |
| G4 local durability | smartwal path is the canonical durable path; walstore has known deferred issues. |
| G5/G6 replication + catch-up | Replicated write path, retention-aware catch-up, and WALRecycled -> rebuild dispatch are implemented and hardware-tested in prior gates. |
| G7 rebuild / replica re-creation | CLOSED 2026-05-02, pinned to `seaweed_block@d09fcc6`: #2 empty join, #5 concurrent writes during rebuild, #6 stale WAL -> rebuild all hardware GREEN. |
| Recovery debt cleanup | `targetLSN` no longer owns terminal close semantics; practical single feed ingress is in place; progress facts and flow-control diagnostics exist. |
### 1.2 Not yet product-complete
| Gap | Why it matters |
|---|---|
| G8 failover data continuity | CLOSED 2026-05-02 for first-close subprocess L2 + real iSCSI data-continuity scope. Follow-up owns m01/M02, G8b, and strict ACK policy. |
| ACK profile / quorum semantics | The current healthy-path continuity evidence must not be confused with strict RF=2 full-ack semantics under lag. |
| G9 lifecycle product verbs | Users still cannot naturally create/delete/attach/detach volumes without manual topology/assignment setup. |
| G9A placement intent | Current system has topology machinery, but not yet a full product path: intent -> placement -> publisher-minted assignment. |
| G15a CSI MVP | Kubernetes users need PVC -> pod -> block device workflow; CSI is the expected beta surface. |
| G17-lite observability | Hardware QA can inspect logs; ordinary users need status/metrics/diagnostics without grep archaeology. |
| G12/G13 failure/lifecycle policy | Disk failure, drain, decommission must either work or be explicitly unsupported with operator implications. |
| Flow-control enforcement | Diagnostics exist, but no product action yet. This is follow-up after observability and failover basics. |
### 1.3 ACK profile posture
P15 separates ACK policy from recovery policy.
`best-effort` is allowed as a beta profile only if it is explicitly named: frontend write/sync success is based on the primary path and does not wait for every replica to report a full durable ACK. A lagging replica is still a recovery concern. Progress/probe facts must drive catch-up or rebuild; best-effort is not permission to leave a replica silently behind.
`quorum` or `full-ack` is a separate future profile: frontend success waits for the configured replica durability condition. That profile is required before the product claims RF=2 no acknowledged-write loss when the secondary is lagging, down, or unable to durably ACK.
In RF=2 full-ack mode, a replica in recovery is not sync-ack eligible. The primary must therefore choose an explicit policy outcome: block/fail writes, transition the volume to a named degraded/best-effort mode, or make the volume read-only/unavailable. It must not silently return full-sync success while the only secondary is catching up or rebuilding.
---
## 2. Execution Principle
### 2.1 Gate discipline
Each gate gets a mini-plan before code:
1. **Scope**: exact claim, exact non-claims.
2. **Current code audit**: files and behavior already present.
3. **V2 port plan**: PORT-AS-IS / PORT-REBIND / SKIP / REWRITE-TINY.
4. **TDD plan**: red tests first for semantic risk.
5. **Hardware / integration evidence**: what scenario proves the gate.
6. **Closure report**: evidence, non-claims, forward-carry.
### 2.2 Priority rule
Prefer vertical product slices over deep internal polish.
Do not continue polishing recovery internals unless the work protects one of these beta-facing outcomes:
1. data correctness,
2. failover continuity,
3. lifecycle usability,
4. K8s attach/use,
5. operator diagnostics,
6. explicit unsupported-state handling.
### 2.3 Porting rule
Use WAL-style port-model for mature V2 mechanism files:
1. Port complete mechanism sections where the file is M or M*.
2. Rebind thin engine/storage calls to V3 contracts.
3. Reject V2 authority-minting permanently.
4. Do not ref-model protocol or CSI code unless V2 is authority-polluted or scope-deferred.
### 2.4 Control-plane interface rule
P15 must keep the master/control-plane interface future-compatible with a larger cluster manager.
`blockmaster` may own placement as a product responsibility, but placement, failover, rebalance, repair, and reintegration decisions must live behind a testable authority policy/planner seam. The daemon layer wires RPC, flags, stores, and components. It must not grow ad-hoc heartbeat-handler if/then assignment mutation.
Reference: `v3-phase-15-control-plane-evolution.md`.
Required vocabulary split:
1. `FrontendPrimaryReady` — can serve frontend read/write.
2. `AuthorityRole` — current primary / non-primary / superseded.
3. `ReplicationRole` — none / candidate / syncing / ready.
4. `Progress` — durable ack, base progress, WAL frontier.
Do not use one boolean `Healthy` as the product state for all four concepts. `Healthy=false` can mean old primary stale, supporting replica, syncing replica, unknown, or failed; those must become distinguishable before the MVP is user-friendly.
---
## 3. Priority Spine
### Immediate spine
```text
P15-P0: Freeze and close G7
-> P15-P1: G7 follow-up hardening
-> P15-P2: G8 failover data continuity [CLOSED 2026-05-02 for first-close scope]
-> P15-P3: G9 lifecycle product verbs
-> P15-P4: G9A flat placement / desired topology
-> P15-P5: G17-lite observability + G15a CSI MVP in parallel
-> P15-P6: internal K8s dogfood checkpoint
-> P15-P7: remaining beta gates by dogfood feedback
-> P15-P8: G22 final validation
```
This preserves the canonical gate graph while making the execution path concrete.
---
## 4. Gate Plan
### P15-P0 — Close G7 on pinned tree — ✅ done
**Goal**: close G7 honestly before adding more behavior.
**Closure target**: `seaweed_block@d09fcc6`.
**Explicitly excluded from canonical G7 evidence**: post-close `59ac82e` dry-run flow-control logging unless QA re-runs and architect chooses to move the close target.
**Why the remaining items stay out of G7**:
G7's closed product sentence is deliberately narrow: "dual-lane recovery data-plane correctness works for empty join, concurrent writes during rebuild, and stale-WAL rebuild." It is not "the recovery product is autonomous and operationally complete."
The remaining work is real, but it belongs to later gates because each item changes a different product contract:
| Remaining item | Why not G7 | Owning next place |
|---|---|---|
| Primary-kill / failover continuity | Changes authority-to-frontend serving semantics after primary loss; must prove new primary reads acknowledged data. | G8 |
| Autonomous probe-driven recover lifecycle | Changes coordinator policy: when to catch up, rebuild, degrade, or stay feeding. | G8/G9A follow-up or dedicated recover-policy batch |
| RF>=3 quorum / min-pin / slow-replica degrade | Changes write availability and quorum semantics, not just rebuild transport. | G8 or later RF policy batch |
| Strict RF=2 full-ack write contract | Changes frontend ACK behavior under replica lag; G8's current evidence is healthy-path data continuity, not full quorum semantics. | G8-followup / G9A |
| Flow-control enforcement | Changes write admission behavior under pressure. Dry-run diagnostics are safe; enforcement needs product tuning and observability. | G17-lite + flow-control track |
| External diagnostics endpoint | Product supportability surface, not data-plane closure. | G17-lite |
| Strict single-queue feeder refactor | Internal architecture hardening; G7 achieved practical single-ingress sufficient for its hardware gate. | G7-followup / pre-G8 hardening |
This boundary prevents G7 from becoming an unbounded recovery rewrite while still preserving every discovered debt in the forward plan.
**Done**:
1. Single-signed `v3-phase-15-g7-mini-plan.md` §close.
2. Updated roadmap state so it no longer claims P15 is closing G5.
3. Recorded G7 as closed with the precise non-claims:
- no autonomous recovery orchestration,
- no RF>=3 quorum/min-pin policy,
- no full transient disconnect / primary crash matrix,
- no flow-control enforcement,
- no external diagnostics endpoint,
- no strict single-queue feeder claim.
**Pass**: docs identify `d09fcc6` as G7 close evidence and no later commit is silently included.
---
### P15-P1 — G7 follow-up hardening
**Goal**: protect the G7 fixes from regression without expanding G7 scope.
**Why now**: the recovery work exposed structural risks. Small hardening is cheaper before G8 builds on it.
**Work items**:
1. Add invariant tests:
- post-recovery first live write lazy-dials / uses valid session,
- retained write is replayed by next recovery backlog,
- live write during base transfer enters the same feeder path,
- duplicate WAL/base apply remains idempotent.
2. Add anti-bypass test:
- production live WAL path must go through `ReplicaPeer.ShipEntry -> FeedLiveWrite`, not direct `BlockExecutor.Ship`.
3. Clarify `LiveWriteRetained`:
- stale epoch should log as dropped stale epoch, not "recovery will replay".
4. Gate, sample, or remove `g7-debug` logs from product default.
5. Document `ErrSinkSealed` invariant:
- sealed fall-through is safe only after steady emit context has been restored.
**Pass**: scoped `go test` green; no new product behavior claim.
---
### P15-P2 — G8 Failover Data Continuity — ✅ closed for first-close scope
**Goal**: after primary failure, new primary serves tested acknowledged data; old primary cannot corrupt future state.
**Close status**: CLOSED 2026-05-02. Evidence is pinned in `v3-phase-15-g8-mini-plan.md` §13.
**ACK posture for first close**: G8 may close on healthy-path acknowledged data continuity. It must not claim strict RF=2 quorum/full-ack behavior unless an explicit ACK profile gate proves that writes/syncs fail or block when the replica cannot durably ACK.
**Scope**:
1. Kill primary after acknowledged writes.
2. Master publishes new assignment.
3. Client reconnects or reattaches.
4. New primary reads exact acknowledged data.
5. Old primary is fenced from future success.
**Non-scope**:
1. multi-master HA,
2. rack/AZ placement,
3. performance SLO,
4. full chaos matrix,
5. strict full-ack write contract under lagging or unavailable replicas.
**TDD first**:
1. component test: old primary stale write rejects after EV advance;
2. component test: new primary has acknowledged write after reassignment;
3. scenario test: kill primary -> wait assignment -> reconnect -> byte-equal;
4. negative test: failover cannot be declared by authority movement alone without data verification.
**Pass**: hardware or multi-process scenario proves data continuity, not just role movement.
**Control-plane strengthening needed before G8 close**:
1. G8 must explicitly state whether old-primary return is only `FrontendClosed/Superseded` or also `ReplicaCandidate`.
2. If G8 claims returned-peer reintegration, it must prove `ReplicationRole` transitions through candidate/syncing/ready using progress evidence.
3. Assignment movement alone remains a negative oracle; data continuity requires byte-equal proof.
4. `cmd/blockmaster` changes for failover must remain wiring/config only; policy stays in `core/authority`.
5. Best-effort ACK mode, if used, must still feed lagging replicas through catch-up/rebuild. It is an ACK latency/availability tradeoff, not a recovery exemption.
---
### P15-P3 — G9 Volume Lifecycle
**Goal**: users/tools can create, attach, detach, and delete volumes without hand-authoring internal authority state.
**Minimal product verbs**:
1. CreateVolume
2. DeleteVolume
3. Attach / Publish
4. Detach / Unpublish
5. Get/List status
**Implementation direction**:
1. Add V3-safe desired-volume model.
2. Reuse current master/volume RPC and authority publisher.
3. Keep mutating verbs intent-only; never expose direct `AssignmentInfo` mutation.
4. Use CSI/API semantics as the shape even if the first caller is a CLI/test client.
**TDD first**:
1. create volume intent persists;
2. attach cannot report ready before data path is usable;
3. delete leaves no orphan authority line;
4. all lifecycle verbs are idempotent.
**Pass**: external client or test drives create -> attach -> write/read -> detach -> delete through real product daemons.
---
### P15-P4 — G9A Flat Placement / Desired Topology
**Goal**: operator asks for replicated volume; system computes flat placement; publisher mints authority from desired topology, not heartbeat-as-authority.
**Current code advantage**: `core/authority.TopologyController` already contains placement-like machinery. This gate should productize it, not rewrite it.
**Must ship**:
1. flat RF=2/RF=3 placement from intent;
2. durable desired topology generation;
3. explain output: why selected / why rejected;
4. replacement placement for drain/disk-loss workflows if in scope;
5. no path constructs `AssignmentInfo` outside authority publisher.
**Explicit non-scope**:
1. rack/AZ awareness,
2. hot rebalance,
3. load-based auto movement,
4. multi-master HA,
5. V2 promote/demote semantics.
**Pass**: API/CSI/admin test creates RF volume from intent, observes desired topology generation, and volumes bind roles without manual topology stuffing.
**Interface strengthening**:
1. Introduce a product-shaped placement intent: `volume_id`, size, RF, and optional constraints.
2. Planner emits a `PlacementPlan` / bounded intent, never raw `AssignmentInfo`.
3. Reconciler observes plan progress and publisher facts separately.
4. Returned/stale replicas enter as `ReplicaCandidate`, not immediate `ReplicaReady`.
5. Candidate readiness requires durable/progress facts, not heartbeat presence.
---
### P15-P5A — G17-lite Observability
**Goal**: dogfood is not grep-only.
**Minimum endpoints / surfaces**:
1. volume status: role, epoch, endpoint version, mode;
2. recovery state: decision, R/S/H, session phase, durable ack known/R;
3. peer state: healthy/degraded/recovering, last probe/ack;
4. flow-control dry-run verdict;
5. structured log fields for recovery and failover events;
6. `/readyz` and `/healthz` or equivalent for deployment systems.
State vocabulary must separate:
1. frontend readiness;
2. authority role;
3. replication role;
4. durable/recovery progress;
5. placement or unsupported reason.
**TDD first**:
1. endpoint returns stable JSON shape;
2. flow-control verdict appears after observation;
3. stale/degraded/recovering states are distinguishable;
4. endpoint is loopback or authenticated until G16.
**Pass**: QA/user can diagnose stuck rebuild, lagging replica, and unavailable frontend from status output without reading engine internals.
---
### P15-P5B — G15a CSI MVP
**Goal**: Kubernetes PVC -> pod -> block device -> workload read/write -> clean teardown.
**Scheduling**: starts after G9A contracts are stable; can run in parallel with G17-lite.
**Port strategy from V2 CSI survey**:
| Module | Disposition |
|---|---|
| CSI gRPC service layer | PORT-AS-IS / light rebind |
| iSCSI/NVMe transport adapters | PORT-AS-IS |
| Volume backend bridge | PORT-REBIND to V3 lifecycle/placement APIs |
| Snapshot ID encoding | PORT-AS-IS, but snapshot RPCs deferred to G15b |
| deploy YAMLs | light rewrite |
| tests/scenarios | port with backend rewire |
**G15a must ship**:
1. `CreateVolume` / `DeleteVolume`;
2. `ControllerPublishVolume` / `ControllerUnpublishVolume`;
3. `NodeStageVolume` / `NodePublishVolume`;
4. `NodeUnpublishVolume` / `NodeUnstageVolume`;
5. `GetCapacity`;
6. controller/node capability truth.
**G15a must not claim**:
1. snapshots,
2. online resize,
3. clones,
4. rack-aware topology,
5. production security beyond declared beta posture.
**Pass**: k8s integration test provisions a volume, attaches to pod, runs fio/postgres-class workload, and tears down clean.
---
### P15-P6 — Internal K8s Dogfood Checkpoint
**Goal**: first complete user-visible slice.
**Required evidence**:
1. deploy blockmaster/blockvolume components on a small cluster;
2. create volume via CSI;
3. pod mounts/uses the volume;
4. workload runs for a declared contiguous duration;
5. replica failure recovers without indefinite dataplane stall;
6. status surfaces explain current state;
7. teardown cleans resources.
**Non-claims**:
1. snapshot/resize,
2. rack/AZ placement,
3. full node decommission,
4. performance SLO,
5. non-local production auth unless G16 lands.
---
### P15-P7 — Remaining Beta Gates
Prioritize by dogfood pain:
| Gate | Recommended disposition |
|---|---|
| G10 Snapshot | Implement minimal or explicit beta-defer. Do not expose CSI snapshot capability before engine truth exists. |
| G11 Resize | Implement minimal offline/online story or explicit beta-defer. Do not claim PVC expansion silently. |
| G12 Disk failure | Must either detect/evict/recreate, or document unsupported disk-failure path with operator action. |
| G13 Node lifecycle | At least join/drain/decommission story or explicit unsupported state. |
| G14 External API | Narrow because CSI covers many verbs; keep safe intent/status API. |
| G16 Security/Auth | Required before non-local beta; until then loopback/local-only claims must be explicit. |
| G18 Deployment | Helm/manifests/config validation after CSI shapes stabilize. |
| G19 Migration | Prefer explicit non-migration beta unless product owner accepts migration scope. |
| G20 QoS/rack/operator/GC | Mostly defer, but GC/retention implication cannot be silent. |
| G21 Performance SLO | Run after correctness/dogfood; define named hardware and V2 comparison where meaningful. |
---
### P15-P8 — G22 Final Cluster Validation
**Goal**: release evidence bundle, not local smoke.
**Must include**:
1. multi-process / multi-node bring-up;
2. lifecycle create/use/delete;
3. replicated write/read;
4. failover data continuity;
5. replica restart/catch-up;
6. rebuild/re-create;
7. disk failure or explicit unsupported evidence path;
8. node lifecycle or explicit limitation;
9. security negative tests if non-local beta;
10. metrics/log/artifact validation;
11. performance/soak report;
12. `manifest.json`, `result.json`, `result.xml`, logs, metrics, and claim matrix.
---
## 5. What Not To Do Now
1. Do not pursue strict physical single-queue shipper before G8 unless a failing test requires it.
2. Do not enforce flow control before observability and policy are visible.
3. Do not implement CSI snapshot/resize before G10/G11 truth exists.
4. Do not port V2 authority semantics: promotion/demotion, heartbeat-as-authority, local-role-as-authority.
5. Do not claim Kubernetes beta while lifecycle, placement, and diagnostics are still manual/opaque.
6. Do not update P15 close language to include post-`d09fcc6` commits unless QA re-runs the evidence and architect moves the pin.
7. Do not let `blockmaster` become an ad-hoc if/then assignment mutator. Placement belongs in the authority policy/planner seam even if the product says "master owns placement."
8. Do not claim a returned old primary is a ready replica just because it heartbeats. It is at most observed/candidate until progress evidence proves readiness.
---
## 6. Immediate Next Actions
### Action 1 — G7 close packet — ✅ done
1. Architect signed G7 §close on `d09fcc6`.
2. Roadmap state changed from “closing G5/G7” to “G7 closed; G8 next”.
3. Forward-carry register points to this document for prioritization.
### Action 2 — Start P15-P1
Open a small G7-followup hardening mini-plan:
1. list the invariant tests;
2. list log/disposition cleanups;
3. run scoped tests only;
4. no behavior expansion.
### Action 3 — Start G8 mini-plan
Open `v3-phase-15-g8-mini-plan.md`:
1. bind failover scenarios;
2. name hardware target;
3. define reconnect/reattach semantics;
4. define exact old-primary stale rejection;
5. define byte-equal data oracle.
---
## 7. Working Rule For Each Gate
Every gate should start with this checklist:
```text
1. What product sentence becomes true after this gate?
2. What product sentence remains false?
3. What V2 files are assets?
4. What V2 semantics are permanently rejected?
5. What is the first red test?
6. What is the hardware or integration evidence?
7. What is the forward-carry if we stop here?
```
If a gate cannot answer these seven questions, it is not ready for code.
---
## 8. MVP Replication Usage Rules
These rules define the beta-facing behavior until a later gate changes them with tests and docs.
### 8.1 Default ACK profile
MVP default is **best-effort replication ACK**:
1. Foreground write/sync success does not require every replica to return a full durable ACK.
2. Replica lag, miss, or recovery state is still monitored through progress/probe facts.
3. A lagging replica must be fed by catch-up or rebuild; best-effort is not permission to abandon recovery.
4. User-facing docs must name the RPO implication: if the primary dies before a secondary has durable progress, the acknowledged write may not survive unless a stricter profile is enabled.
### 8.2 Future full-ack / quorum profile
Full-ack/quorum mode is a separate policy gate.
For RF=2:
1. The only secondary must be `ReplicaReady` and sync-ack eligible.
2. If that secondary enters recovery, it no longer counts as the synchronous ACK peer.
3. Primary writes/syncs must fail or time out with a clear quorum error, unless an operator explicitly changes the volume to a named degraded/best-effort mode.
4. Reads from the current primary may continue if the authority line is valid and local data is healthy.
5. The system must not silently return "full sync" success while the only secondary is catching up or rebuilding.
### 8.3 Recovery policy
Recovery remains mandatory under both ACK profiles:
1. A new or returning replica starts as observed/candidate, not ready.
2. If retained WAL can cover the gap, the feeder catches it up.
3. If retained WAL cannot cover the gap, the coordinator starts rebuild.
4. Base/rebuild and WAL feeding are one ownership story: do not allow two independent feeders to advance the same peer's recovery truth.
5. Progress facts should eventually move pin/frontier forward; slow pin movement is a flow-control and degradation-policy input, not an ACK shortcut.
### 8.4 Operator-visible states
The MVP must make these states distinguishable in status/logs before claiming production-grade availability:
1. `best_effort`: writes may ACK on primary durability only.
2. `full_ack_unavailable`: configured full-ack mode cannot currently accept writes.
3. `replica_recovering`: replica is being fed and is not sync-ack eligible.
4. `replica_ready`: replica can participate in the configured ACK profile.
5. `degraded`: fewer healthy/ready replicas than desired.