mirror of
https://github.com/seaweedfs/seaweedfs.git
synced 2026-05-28 12:41:15 +00:00
Adds sw-block/ directory with:
- distsim: protocol correctness simulator (96 tests)
- cluster model with epoch fencing, barrier semantics, commit modes
- endpoint identity, control-plane flow, candidate eligibility
- timeout events, timer races, same-tick ordering
- session ownership tracking with ID-based stale fencing
- enginev2: standalone V2 sender/session implementation (63 tests)
- per-replica Sender with identity-preserving reconciliation
- RecoverySession with FSM phase transitions and session ID
- execution APIs: BeginConnect, RecordHandshake, BeginCatchUp,
RecordCatchUpProgress, CompleteSessionByID — all sender-authority-gated
- recovery outcome branching: zero-gap, catch-up, needs-rebuild
- assignment-intent orchestration with epoch fencing
- design docs: acceptance criteria, open questions, first-slice spec,
protocol development process
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
289 lines
5.2 KiB
Markdown
289 lines
5.2 KiB
Markdown
# Protocol Development Process
|
|
|
|
Date: 2026-03-27
|
|
|
|
## Purpose
|
|
|
|
This document defines how `sw-block` protocol work should be developed.
|
|
|
|
The process is meant to work for:
|
|
|
|
- V2
|
|
- future V3
|
|
- or a later block algorithm that is not WAL-based
|
|
|
|
The point is to make protocol work systematic rather than reactive.
|
|
|
|
## Core Philosophy
|
|
|
|
### 1. Design before implementation
|
|
|
|
Do not start with production code and hope the protocol becomes clear later.
|
|
|
|
Start with:
|
|
|
|
1. system contract
|
|
2. invariants
|
|
3. state model
|
|
4. scenario backlog
|
|
|
|
Only then move to implementation.
|
|
|
|
### 2. Real failures are inputs, not just bugs
|
|
|
|
When V1 or V1.5 fails in real testing, treat that as:
|
|
|
|
- a design requirement
|
|
- a scenario source
|
|
- a simulator input
|
|
|
|
Do not patch and forget.
|
|
|
|
### 3. Simulator is part of the protocol, not a side tool
|
|
|
|
The simulator exists to answer:
|
|
|
|
- what should happen
|
|
- what must never happen
|
|
- which old designs fail
|
|
- why the new design is better
|
|
|
|
It is not a replacement for real testing.
|
|
It is the design-validation layer before production implementation.
|
|
|
|
### 4. Passing tests are not enough
|
|
|
|
Green tests are necessary, not sufficient.
|
|
|
|
We also require:
|
|
|
|
- explicit invariants
|
|
- explicit scenario intent
|
|
- clear state transitions
|
|
- review of assumptions and abstraction boundaries
|
|
|
|
### 5. Keep hot-path and recovery-path reasoning separate
|
|
|
|
Healthy steady-state behavior and degraded recovery behavior are different problems.
|
|
|
|
Both must be designed explicitly.
|
|
|
|
## Development Ladder
|
|
|
|
Every major protocol feature should move through these steps:
|
|
|
|
1. **Problem statement**
|
|
- what real bug, limit, or product goal is driving the work
|
|
|
|
2. **Contract**
|
|
- what the protocol guarantees
|
|
- what it does not guarantee
|
|
|
|
3. **State model**
|
|
- node state
|
|
- coordinator state
|
|
- recovery state
|
|
- role / epoch / lineage rules
|
|
|
|
4. **Scenario backlog**
|
|
- named scenarios
|
|
- source:
|
|
- real failure
|
|
- design obligation
|
|
- adversarial distributed case
|
|
|
|
5. **Prototype / simulator**
|
|
- reduced but explicit model
|
|
- invariant checks
|
|
- V1 / V1.5 / V2 comparison where relevant
|
|
|
|
6. **Implementation**
|
|
- production code only after the protocol shape is clear enough
|
|
|
|
7. **Real validation**
|
|
- unit
|
|
- component
|
|
- integration
|
|
- real hardware where needed
|
|
|
|
8. **Feedback loop**
|
|
- turn new failures back into scenario/design inputs
|
|
|
|
## Required Artifacts
|
|
|
|
For protocol work to be considered real progress, we usually want:
|
|
|
|
### Design
|
|
|
|
- design doc
|
|
- scenario doc
|
|
- comparison doc when replacing an older approach
|
|
|
|
### Prototype
|
|
|
|
- simulator or prototype code
|
|
- tests that assert protocol behavior
|
|
|
|
### Implementation
|
|
|
|
- production patch
|
|
- production tests
|
|
- docs updated to match the actual algorithm
|
|
|
|
### Review
|
|
|
|
- implementation gate
|
|
- design/protocol gate
|
|
|
|
## Two-Gate Rule
|
|
|
|
We use two acceptance gates.
|
|
|
|
### Gate 1: implementation
|
|
|
|
Owned by the coding side.
|
|
|
|
Questions:
|
|
|
|
- does it build?
|
|
- do tests pass?
|
|
- does it behave as intended in code?
|
|
|
|
### Gate 2: protocol/design
|
|
|
|
Owned by the design/review side.
|
|
|
|
Questions:
|
|
|
|
- is the logic actually sound?
|
|
- do tests prove the intended thing?
|
|
- are assumptions explicit?
|
|
- is the abstraction boundary honest?
|
|
|
|
A task is not accepted until both gates pass.
|
|
|
|
## Layering Rule
|
|
|
|
Keep simulation layers separate.
|
|
|
|
### `distsim`
|
|
|
|
Use for:
|
|
|
|
- protocol correctness
|
|
- state transitions
|
|
- fencing
|
|
- recoverability
|
|
- promotion / lineage
|
|
- reference-state checking
|
|
|
|
### `eventsim`
|
|
|
|
Use for:
|
|
|
|
- timeout behavior
|
|
- timer races
|
|
- event ordering
|
|
- same-tick / delayed event interactions
|
|
|
|
Do not duplicate scenarios blindly across both layers.
|
|
|
|
## Test Selection Rule
|
|
|
|
Do not choose simulator inputs only from failing tests.
|
|
|
|
Review all relevant tests and classify them by:
|
|
|
|
- protocol significance
|
|
- simulator value
|
|
- implementation specificity
|
|
|
|
Good simulator candidates often come from:
|
|
|
|
- barrier truth
|
|
- catch-up vs rebuild
|
|
- stale message rejection
|
|
- failover / promotion safety
|
|
- changed-address restart
|
|
- mode semantics
|
|
|
|
Keep real-only tests for:
|
|
|
|
- wire format
|
|
- OS timing
|
|
- exact WAL file behavior
|
|
- frontend transport specifics
|
|
|
|
## Version Comparison Rule
|
|
|
|
When designing a successor protocol:
|
|
|
|
- keep the old version visible
|
|
- reproduce the old failure or limitation
|
|
- show the improved behavior in the new version
|
|
|
|
For `sw-block`, that means:
|
|
|
|
- `V1`
|
|
- `V1.5`
|
|
- `V2`
|
|
|
|
should be compared explicitly where possible.
|
|
|
|
## Documentation Rule
|
|
|
|
The docs must track three different things:
|
|
|
|
### `learn/projects/sw-block/`
|
|
|
|
Use for:
|
|
|
|
- project history
|
|
- V1/V1.5 algorithm records
|
|
- phase records
|
|
- real test history
|
|
|
|
### `sw-block/design/`
|
|
|
|
Use for:
|
|
|
|
- active design truth
|
|
- V2 and later protocol docs
|
|
- scenario backlog
|
|
- comparison docs
|
|
|
|
### `sw-block/.private/phase/`
|
|
|
|
Use for:
|
|
|
|
- active execution plan
|
|
- log
|
|
- decisions
|
|
|
|
## What Good Progress Looks Like
|
|
|
|
A good protocol iteration usually has this pattern:
|
|
|
|
1. real failure or design pressure identified
|
|
2. scenario named and written down
|
|
3. simulator reproduces the bad case
|
|
4. new protocol handles it explicitly
|
|
5. implementation follows
|
|
6. real tests validate it
|
|
|
|
If one of those steps is missing, confidence is weaker.
|
|
|
|
## Bottom Line
|
|
|
|
The process is:
|
|
|
|
1. design the contract
|
|
2. model the state
|
|
3. define the scenarios
|
|
4. simulate the protocol
|
|
5. implement carefully
|
|
6. validate in real tests
|
|
7. feed failures back into design
|
|
|
|
That is the process we should keep using for V2 and any later protocol line.
|