mirror of
https://github.com/seaweedfs/seaweedfs.git
synced 2026-05-17 07:11:30 +00:00
docs: Phase 20 product acceptance checklist
7-area acceptance matrix mapping current state vs product requirements: write/durability contract, fresh replica bootstrap, host observation completeness, serving/publish alignment, snapshot/rebuild convergence, adapter consistency, test contract alignment. Each item marked with: current state, required for product, blocks T6/T7, best test level. Priority ordered into must-close-before-Stage-1, should-close-before-Stage-2, and can-close-after-T6/T7. Key diagnosis: architecture-complete, execution-incomplete. The engine thinks like a product; the data plane still behaves partly like a prototype. The gap is end-to-end contract closure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
126
sw-block/.private/phase/phase-20-acceptance.md
Normal file
126
sw-block/.private/phase/phase-20-acceptance.md
Normal file
@@ -0,0 +1,126 @@
|
||||
# Phase 20 Product Acceptance Checklist
|
||||
|
||||
Date: 2026-04-06
|
||||
Status: active
|
||||
|
||||
## Reading
|
||||
|
||||
The system is architecture-complete but execution-incomplete. The V2
|
||||
engine thinks like a product. The data plane still behaves partly like
|
||||
a prototype. The gap is not features — it is end-to-end contract closure
|
||||
from engine through write/flush/catch-up/barrier to data serve.
|
||||
|
||||
## Acceptance Matrix
|
||||
|
||||
### 1. Write / Durability Contract
|
||||
|
||||
| Item | Current state | Required for product | Blocks T6/T7? | Best test level |
|
||||
|------|--------------|---------------------|---------------|-----------------|
|
||||
| WriteLBA() guarantee is explicit | implicit (WAL append + maybe ship) | must document: write-back only, not durable | yes | contract doc + unit |
|
||||
| SyncCache() = durability boundary | partially (barrier in group commit) | must be the single durability commit point | yes | component |
|
||||
| sync_all observable truth | barrier in distributed sync path | barrier success = all replicas durable before return | yes | component |
|
||||
| FUA / fdatasync boundary | exists in flusher | must be explicit in contract: which fsync is the commit fence | no | unit |
|
||||
| WriteLBA returns before replication | yes (fire-and-forget ship) | must document: write != replicated, sync = replicated | yes | contract doc |
|
||||
|
||||
### 2. Fresh Replica Bootstrap
|
||||
|
||||
| Item | Current state | Required for product | Blocks T6/T7? | Best test level |
|
||||
|------|--------------|---------------------|---------------|-----------------|
|
||||
| Fresh replica entry condition | SetReplicaAddrs creates shipper | must have explicit session before live tail | yes | component |
|
||||
| Catch-up target frozen | engine has FrozenTargetLSN | host must freeze target before live streaming | yes | component |
|
||||
| Live stream start condition | phase gate blocks during active session | must verify: gate clears only after catch-up complete | yes | component |
|
||||
| LSN gap prevention | phase gate prevents live tail | must also move backlog (bounded WAL catch-up) | yes | component |
|
||||
| Timeout → replan or rebuild | not implemented | must classify: retry catch-up vs escalate to rebuild | yes | component |
|
||||
| Retention loss → rebuild | not implemented | if WAL recycled during catch-up, must escalate | yes | component |
|
||||
|
||||
### 3. Host Observation Completeness
|
||||
|
||||
| Item | Current state | Required for product | Blocks T6/T7? | Best test level |
|
||||
|------|--------------|---------------------|---------------|-----------------|
|
||||
| ShipperConfiguredObserved | implemented | — | no | — |
|
||||
| ShipperConnectedObserved | implemented | — | no | — |
|
||||
| CatchUpCompleted | implemented | — | no | — |
|
||||
| NeedsRebuildObserved | implemented | — | no | — |
|
||||
| Replay progress observation | not centralized | must feed back as bounded progress events | no | component |
|
||||
| Target reached observation | not implemented | must emit when catch-up reaches frozen target | yes | component |
|
||||
| Timeout classification | not implemented | must distinguish "retry" from "must rebuild" | yes | component |
|
||||
| Retention loss detection | exists in shipper (NeedsRebuild) | must feed back through protocol observation seam | no | component |
|
||||
| Transport contact vs session complete | partially separated | must be strictly distinct in observation path | yes | component |
|
||||
|
||||
### 4. Serving / Publish / Durability Alignment
|
||||
|
||||
| Item | Current state | Required for product | Blocks T6/T7? | Best test level |
|
||||
|------|--------------|---------------------|---------------|-----------------|
|
||||
| publish_healthy requires | role + shipper + connected + durable>0 | must also require: no active recovery session, barrier confirmed | yes | component |
|
||||
| Frontend serving requires | activation gate (T4) | must be derived from one protocol contract, not mixed signals | partially | integration |
|
||||
| Replica durability observable | MinReplicaFlushedLSN on shipper group | must be the same boundary that publish_healthy reads | no | unit |
|
||||
| bootstrap_pending → publish_healthy | requires ShipperConnected + barrier | must require: catch-up complete + barrier durable + no recovery active | yes | component |
|
||||
|
||||
### 5. Snapshot / Rebuild Convergence
|
||||
|
||||
| Item | Current state | Required for product | Blocks T6/T7? | Best test level |
|
||||
|------|--------------|---------------------|---------------|-----------------|
|
||||
| Rebuild entry condition | engine has StartRebuild | must match session contract (not ad-hoc) | no | component |
|
||||
| Rebuild completion → host convergence | engine has RebuildCommitted | host must clear recovery state and re-enter normal protocol | no | integration |
|
||||
| WAL catch-up → snapshot escalation | not implemented | must have explicit timeout/retention boundary | no | component |
|
||||
| Snapshot/build under same protocol model | separate paths | should converge into same session kind dispatch | no | design doc |
|
||||
|
||||
### 6. Adapter Consistency
|
||||
|
||||
| Item | Current state | Required for product | Blocks T6/T7? | Best test level |
|
||||
|------|--------------|---------------------|---------------|-----------------|
|
||||
| RF=2 single-replica ServerID | missing in some paths | must carry stable identity in all adapter paths | yes | component |
|
||||
| ReplicaID derivation | path/serverID convention | must be consistent across v2bridge, blockcmd, registry | no | unit |
|
||||
| Assignment conversion edge cases | mostly covered | must handle: empty ReplicaAddrs, missing ServerID, role mismatch | no | unit |
|
||||
|
||||
### 7. Test Contract Alignment
|
||||
|
||||
| Item | Current state | Required for product | Blocks T6/T7? | Best test level |
|
||||
|------|--------------|---------------------|---------------|-----------------|
|
||||
| Tests treat WriteLBA as commit | some do | must not — WriteLBA is write-back only | yes | audit |
|
||||
| Tests treat SyncCache as barrier | most do | correct — preserve this | no | — |
|
||||
| Bootstrap tests cover bounded catch-up | partial (phase gate only) | must cover: freeze → catch-up → gate clear → live | yes | component |
|
||||
| Contract matrix test | missing | one test per contract boundary: write / ship / flush / barrier / catch-up / publish | yes | component |
|
||||
|
||||
## Priority Order
|
||||
|
||||
### Must close before Stage 1 hardware validation
|
||||
|
||||
1. **Fresh bootstrap bounded catch-up** — the biggest execution gap.
|
||||
Move backlog under the phase gate contract. Freeze target, stream
|
||||
WAL, clear gate on completion.
|
||||
|
||||
2. **Write/durability contract doc** — make WriteLBA vs SyncCache
|
||||
guarantees explicit so tests and callers align.
|
||||
|
||||
3. **RF=2 adapter ServerID** — fix the identity leak so assignments
|
||||
carry complete replica identity.
|
||||
|
||||
4. **Test contract audit** — find and fix tests that treat WriteLBA
|
||||
as a commit boundary.
|
||||
|
||||
### Should close before Stage 2 hardware validation
|
||||
|
||||
5. **Timeout / retention-loss / escalation** — runtime failures feed
|
||||
back as explicit protocol observations.
|
||||
|
||||
6. **Target reached observation** — catch-up completion emits a
|
||||
distinct observation event.
|
||||
|
||||
7. **publish_healthy alignment** — derive from one protocol contract
|
||||
(catch-up complete + barrier durable + no recovery active).
|
||||
|
||||
### Can close after T6/T7
|
||||
|
||||
8. **Snapshot/build convergence** — fold into same protocol model.
|
||||
|
||||
9. **Replay progress observation** — bounded progress events for
|
||||
operator visibility during long catch-ups.
|
||||
|
||||
10. **Full contract matrix test** — one test per boundary.
|
||||
|
||||
## The One-Sentence Gap
|
||||
|
||||
The engine already knows the protocol. The execution body does not yet
|
||||
fully obey it. Closing that gap is the difference between
|
||||
"architecture-complete" and "product-complete."
|
||||
Reference in New Issue
Block a user