mirror of
https://github.com/seaweedfs/seaweedfs.git
synced 2026-05-20 16:51:31 +00:00
feat: CP13-8 PASSES — real-workload validation on RF=2 sync_all
CP13-8 scenario results on m01/M02 (25Gbps RoCE): fsck_ext4: CLEAN file count: 200 (assert_equal PASS) checksum match: MATCH (assert_contains PASS) pgbench TPS: 565.69 (assert_greater PASS) auto-failover: 10.0.0.1:18480 → 10.0.0.3:18480 Code changes (tester + scenario): - volume_server_block.go: readiness state, assignment lifecycle cleanup - block_heartbeat_loop.go: readiness-aware heartbeat reporting - store_blockvol.go: readiness tracking - master_server_handlers_block.go: block API handler updates - cp13-8-real-workload-validation.yaml: redesigned scenario (removed block_promote, use natural auto-failover flow, bootstrap write before wait_volume_healthy) - testrunner/actions/devops.go: scenario action improvements - replica_read_test.go: component-level replica read test Phase docs: CP13-7 accepted, CP13-8/8A technical packs updated, design docs updated for protocol closure evidence. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -1284,3 +1284,360 @@ Review checklist:
|
||||
3. is rebuild handoff bounded and epoch-safe?
|
||||
4. is post-rebuild progress initialized from checkpoint truth?
|
||||
5. is the checkpoint still bounded to rebuild fallback?
|
||||
|
||||
---
|
||||
|
||||
### `CP13-8` Technical Pack
|
||||
|
||||
Date: 2026-04-03
|
||||
Goal: validate the accepted `RF=2 sync_all` replication contract on one bounded set of real workloads so the engineering proof is demonstrated on named real block-device consumers rather than only protocol-level tests
|
||||
|
||||
#### Layer 1: Semantic Core
|
||||
|
||||
##### Problem statement
|
||||
|
||||
`CP13-1..7` have progressively closed replication correctness: address truth, durable progress, state eligibility, reconnect/catch-up, retention, and rebuild fallback.
|
||||
What remains is not more replication semantics. It is proving that the accepted contract survives contact with bounded real workloads.
|
||||
|
||||
`CP13-8` therefore accepts only one bounded thing:
|
||||
|
||||
1. one bounded real-workload validation package for the accepted `RF=2 sync_all` path
|
||||
|
||||
It does not accept:
|
||||
|
||||
1. broad launch approval
|
||||
2. broad benchmark positioning
|
||||
3. mode normalization or general product-policy closure
|
||||
|
||||
##### State / contract
|
||||
|
||||
`CP13-8` must make these truths explicit:
|
||||
|
||||
1. the workload envelope is named and bounded:
|
||||
- topology
|
||||
- transport/frontend
|
||||
- filesystem/application consumer
|
||||
- disturbance shapes included
|
||||
- exclusions
|
||||
2. the accepted `CP13-1..7` replication contract is the thing being validated, not redefined
|
||||
3. real-workload evidence must be replayable and attributable
|
||||
4. passing one named workload package does not imply generic production readiness outside the stated envelope
|
||||
|
||||
##### Reject shapes
|
||||
|
||||
Reject before implementation or review if the checkpoint:
|
||||
|
||||
1. presents ad hoc manual runs without a named envelope
|
||||
2. treats synthetic benchmarks as substitutes for real workload validation
|
||||
3. mixes workload validation with mode normalization or rollout-approval claims
|
||||
4. reopens already-accepted replication semantics instead of validating them
|
||||
|
||||
#### Layer 2: Execution Core
|
||||
|
||||
##### Current gap `CP13-8` must close
|
||||
|
||||
1. accepted replication semantics are still primarily validated by protocol/unit/adversarial evidence
|
||||
2. one bounded real workload package is still needed to show the contract survives real filesystem/application behavior
|
||||
3. the project still needs a replayable workload-evidence object before talking about final mode normalization or broader launch shaping
|
||||
|
||||
##### Suggested file targets
|
||||
|
||||
1. `weed/storage/blockvol/testrunner/*`
|
||||
2. bounded component/real-device validation under `weed/storage/blockvol/test/`
|
||||
3. workload docs or result artifacts under `sw-block/.private/phase/`
|
||||
4. `weed/server/*` or `blockvol` only if a real workload exposes a concrete bug
|
||||
|
||||
##### Validation focus
|
||||
|
||||
Required proofs:
|
||||
|
||||
1. filesystem proof
|
||||
- one named real filesystem workload completes correctly on the accepted path
|
||||
2. application proof
|
||||
- one named database/application workload completes correctly on the accepted path
|
||||
3. disturbance proof
|
||||
- only if explicitly included in the envelope, one bounded disturbance case remains correct
|
||||
4. envelope proof
|
||||
- topology/frontend/workload/exclusions are explicit and replayable
|
||||
5. boundedness proof
|
||||
- checkpoint remains about workload validation, not `CP13-9+`
|
||||
|
||||
Reject if:
|
||||
|
||||
1. a claimed proof is only a harness smoke test without real workload semantics
|
||||
2. failures cannot be attributed because the environment is underspecified
|
||||
3. delivery wording implies broad launch readiness from one bounded package
|
||||
|
||||
##### Suggested first cut
|
||||
|
||||
1. freeze one explicit workload matrix first, before chasing more scenarios
|
||||
2. use one filesystem workload and one application/database workload
|
||||
3. keep disturbances narrow and named if included at all
|
||||
4. produce one result artifact that ties outcomes back to accepted `CP13-1..7` semantics
|
||||
|
||||
##### Assignment For `sw`
|
||||
|
||||
1. Goal
|
||||
- deliver bounded real-workload validation on the accepted `RF=2 sync_all` path
|
||||
2. Required outputs
|
||||
- one explicit workload-envelope summary
|
||||
- one focused code/harness package only where needed to make the bounded workloads replayable
|
||||
- one delivery note explaining:
|
||||
- files updated in place
|
||||
- workload matrix
|
||||
- proof shape
|
||||
- what later checkpoints remain untouched
|
||||
3. Hard rules
|
||||
- do not broaden into generic benchmark marketing
|
||||
- do not claim launch approval from one workload package
|
||||
- do not reopen accepted `CP13-1..7` semantics unless the workload exposes a concrete bug
|
||||
|
||||
##### Assignment For `tester`
|
||||
|
||||
1. Goal
|
||||
- validate that `CP13-8` closes bounded real-workload validation and nothing broader
|
||||
2. Validate
|
||||
- a named filesystem workload completes correctly
|
||||
- a named application/database workload completes correctly
|
||||
- the environment and exclusions are explicit
|
||||
- evidence is replayable and attributable
|
||||
- no-overclaim around `CP13-9+`
|
||||
3. Reject if
|
||||
- workload evidence is underspecified or non-replayable
|
||||
- the validation object quietly broadens into mode/rollout policy
|
||||
- failures are explained away without a bounded root cause
|
||||
|
||||
#### Short judgment
|
||||
|
||||
`CP13-8` is acceptable when:
|
||||
|
||||
1. one bounded real-workload matrix is explicit
|
||||
2. the accepted replication contract is demonstrated on named real consumers
|
||||
3. the resulting evidence is replayable and bounded
|
||||
4. the checkpoint stays clearly separate from `CP13-9+`
|
||||
|
||||
---
|
||||
|
||||
### `CP13-8` Delivery Pack
|
||||
|
||||
Bounded contract:
|
||||
|
||||
1. `CP13-8` accepts real-workload validation only
|
||||
2. it does not accept mode normalization, rollout approval, or broad performance positioning
|
||||
|
||||
What `sw` should deliver:
|
||||
|
||||
1. one focused contract review of the workload envelope and its relation to accepted `CP13-1..7` semantics
|
||||
2. one bounded harness/evidence package only where needed to run the chosen workloads replayably
|
||||
3. one delivery note with:
|
||||
- changed files
|
||||
- workload matrix
|
||||
- proof shape
|
||||
- no-overclaim statement
|
||||
|
||||
Recommended delivery shape:
|
||||
|
||||
1. contract:
|
||||
- define the named workload envelope and exclusions
|
||||
2. code/tests/harness:
|
||||
- keep updates local to real-workload validation surfaces
|
||||
- make workload pass/fail conditions directly observable
|
||||
3. note:
|
||||
- distinguish primary proof from support evidence
|
||||
- explain why `CP13-9+` remains untouched
|
||||
|
||||
Review checklist:
|
||||
|
||||
1. is the workload envelope explicit and bounded?
|
||||
2. are the workloads real consumers, not just synthetic microbenchmarks?
|
||||
3. is evidence replayable and attributable?
|
||||
4. does the package validate accepted semantics rather than redefining them?
|
||||
5. is the checkpoint still bounded to real-workload validation?
|
||||
|
||||
---
|
||||
|
||||
### `CP13-8A` Technical Pack
|
||||
|
||||
Date: 2026-04-03
|
||||
Goal: close the assignment-to-publication contradiction exposed by `CP13-8` so the accepted `RF=2 sync_all` path no longer publishes replica readiness from allocation or assignment presence alone
|
||||
|
||||
#### Layer 1: Semantic Core
|
||||
|
||||
##### Problem statement
|
||||
|
||||
`CP13-8` exposed a live contradiction:
|
||||
|
||||
1. control truth says the replica exists and has assignment/addresses
|
||||
2. runtime truth may still be between:
|
||||
- role applied
|
||||
- receiver startup
|
||||
- shipper attachment
|
||||
- publish-ready closure
|
||||
3. external surfaces can therefore overstate readiness before the replica is actually safe to publish as a real block-device peer
|
||||
|
||||
`CP13-8A` therefore accepts only one bounded thing:
|
||||
|
||||
1. one bounded assignment-to-publication closure slice for the accepted `RF=2 sync_all` path
|
||||
|
||||
It does not accept:
|
||||
|
||||
1. broad mode normalization
|
||||
2. launch approval
|
||||
3. backend replacement by implication
|
||||
4. timing-based “wait longer” fixes that leave readiness semantics implicit
|
||||
|
||||
##### State / contract
|
||||
|
||||
`CP13-8A` must make these truths explicit:
|
||||
|
||||
1. assignment delivered is not the same as receiver ready
|
||||
2. receiver ready is not the same as publish healthy
|
||||
3. lookup / heartbeat / tester health must consume the same bounded readiness truth
|
||||
4. the closure remains inside the current chosen path:
|
||||
- `RF=2`
|
||||
- `sync_all`
|
||||
- current master / volume-server heartbeat path
|
||||
- `blockvol` backend
|
||||
|
||||
##### Reject shapes
|
||||
|
||||
Reject before implementation or review if the slice:
|
||||
|
||||
1. leaves two semantic assignment paths alive (`store`-only vs service/runtime path)
|
||||
2. treats allocation completion or precomputed ports as equivalent to publication readiness
|
||||
3. relies on sleeps, retries, or ad hoc timing instead of explicit readiness state
|
||||
4. broadens into `CP13-9` mode policy or generic backend redesign
|
||||
|
||||
#### Layer 2: Execution Core
|
||||
|
||||
##### Current gap `CP13-8A` must close
|
||||
|
||||
1. assignment application and replication/publication setup are still too easy to split semantically
|
||||
2. readiness truth is not yet a fully explicit first-class product surface across heartbeat / lookup / tester
|
||||
3. real-workload reruns cannot cleanly distinguish:
|
||||
- backend data-visibility bug
|
||||
- adapter timing/publication bug
|
||||
- true core-rule gap
|
||||
|
||||
##### Suggested file targets
|
||||
|
||||
1. `weed/server/volume_server_block.go`
|
||||
2. `weed/server/block_heartbeat_loop.go`
|
||||
3. `weed/server/master_block_registry.go`
|
||||
4. `weed/server/master_grpc_server_block.go`
|
||||
5. `weed/server/master_server_handlers_block.go`
|
||||
6. `weed/storage/blockvol/testrunner/actions/devops.go`
|
||||
7. bounded tests under `weed/server/*`
|
||||
|
||||
##### Validation focus
|
||||
|
||||
Required proofs:
|
||||
|
||||
1. lifecycle proof
|
||||
- assignment processing uses one authoritative path from role apply through runtime wiring
|
||||
2. readiness proof
|
||||
- replica-ready is explicit and not inferred from existence/allocation alone
|
||||
3. publication proof
|
||||
- lookup / heartbeat / tester surfaces do not publish the replica before readiness closure
|
||||
4. rerun proof
|
||||
- a bounded `CP13-8` rerun moves the remaining contradiction into an attributable bug class rather than mixed-state ambiguity
|
||||
5. boundedness proof
|
||||
- the slice remains about closure, not `CP13-9`
|
||||
|
||||
Reject if:
|
||||
|
||||
1. a claimed proof still depends on manual interpretation of timing
|
||||
2. different surfaces use different meanings of “healthy” or “ready”
|
||||
3. the rerun still fails but the failure cannot be classified beyond “timing”
|
||||
|
||||
##### Suggested first cut
|
||||
|
||||
1. make `BlockService` the single assignment/readiness owner on the VS side
|
||||
2. define one explicit readiness surface and project it into lookup/REST/tester gates
|
||||
3. rerun the bounded `CP13-8` workload package only after closure lands
|
||||
4. classify any remaining failure as:
|
||||
- backend data bug
|
||||
- adapter/publication bug
|
||||
- core-rule gap
|
||||
|
||||
##### Assignment For `sw`
|
||||
|
||||
1. Goal
|
||||
- deliver bounded assignment-to-publication closure on the accepted `RF=2 sync_all` path
|
||||
2. Required outputs
|
||||
- one focused code package closing the assignment/readiness/publication split
|
||||
- one delivery note explaining:
|
||||
- files updated in place
|
||||
- named readiness states
|
||||
- proof shape
|
||||
- `CP13-8` rerun outcome or remaining attributable contradiction
|
||||
- what later checkpoints remain untouched
|
||||
3. Hard rules
|
||||
- do not use timing sleeps as semantic fixes
|
||||
- do not broaden into `CP13-9` mode normalization
|
||||
- do not replace `blockvol` backend in this slice
|
||||
- do not reopen accepted `CP13-1..7` semantics unless a live contradiction is found
|
||||
|
||||
##### Assignment For `tester`
|
||||
|
||||
1. Goal
|
||||
- validate that `CP13-8A` closes assignment-to-publication truth and nothing broader
|
||||
2. Validate
|
||||
- one authoritative assignment path exists
|
||||
- readiness is explicit and externally consistent
|
||||
- lookup / heartbeat / tester health no longer overpublish readiness
|
||||
- the bounded `CP13-8` rerun is attributable
|
||||
- no-overclaim around `CP13-9`
|
||||
3. Reject if
|
||||
- old mixed-state behavior still leaks through one surface
|
||||
- the slice depends on timing luck
|
||||
- the rerun still fails but the team cannot say whether it is backend, adapter, or core
|
||||
|
||||
#### Short judgment
|
||||
|
||||
`CP13-8A` is acceptable when:
|
||||
|
||||
1. assignment-to-publication closure is explicit on the chosen path
|
||||
2. readiness is no longer inferred from allocation or assignment presence
|
||||
3. all product/tester surfaces consume the same bounded readiness truth
|
||||
4. the rerun result is attributable and the slice stays separate from `CP13-9`
|
||||
|
||||
---
|
||||
|
||||
### `CP13-8A` Delivery Pack
|
||||
|
||||
Bounded contract:
|
||||
|
||||
1. `CP13-8A` accepts assignment-to-publication closure only
|
||||
2. it does not accept mode normalization, broad launch approval, or backend replacement
|
||||
|
||||
What `sw` should deliver:
|
||||
|
||||
1. one focused closure package across VS assignment, heartbeat, lookup, and tester health surfaces
|
||||
2. one bounded rerun or equivalent evidence showing whether the remaining contradiction is backend, adapter, or core
|
||||
3. one delivery note with:
|
||||
- changed files
|
||||
- named readiness states
|
||||
- proof shape
|
||||
- rerun outcome / remaining attributable contradiction
|
||||
- no-overclaim statement
|
||||
|
||||
Recommended delivery shape:
|
||||
|
||||
1. contract:
|
||||
- define explicit readiness/publication truth for the chosen path
|
||||
2. code/tests:
|
||||
- unify assignment lifecycle
|
||||
- gate publication on readiness closure
|
||||
- prove lookup / heartbeat / tester consistency
|
||||
3. note:
|
||||
- distinguish closure proof from any later pure-core redesign
|
||||
- explain why `CP13-9` remains untouched
|
||||
|
||||
Review checklist:
|
||||
|
||||
1. is assignment processing semantically unified?
|
||||
2. is readiness explicit rather than inferred?
|
||||
3. do lookup / heartbeat / tester surfaces agree on publication truth?
|
||||
4. does the bounded rerun become attributable?
|
||||
5. is the slice still bounded to closure rather than mode policy or backend replacement?
|
||||
|
||||
@@ -584,12 +584,177 @@ Reject if:
|
||||
|
||||
Status:
|
||||
|
||||
- accepted
|
||||
|
||||
Carry-forward:
|
||||
|
||||
1. `NeedsRebuild` is now a real fail-closed fallback state
|
||||
2. rebuild handoff and post-rebuild progress are bounded by checkpoint truth rather than implicit recovery assumptions
|
||||
3. `CP13-8` must validate the accepted replication contract on named real workloads without reopening protocol semantics or quietly broadening into mode policy work
|
||||
|
||||
### `CP13-8`: Real-Workload Validation
|
||||
|
||||
Goal:
|
||||
|
||||
- validate the accepted `RF=2 sync_all` replication contract on one bounded set of real workloads so the engineering proof is no longer only protocol/unit-level but also demonstrated on named real block-device consumers
|
||||
|
||||
Acceptance object:
|
||||
|
||||
1. `CP13-8` accepts one bounded real-workload validation package for the accepted `RF=2 sync_all` path
|
||||
2. it does not accept broad rollout claims, broad benchmark positioning, or mode normalization by implication
|
||||
3. it does not accept vague “worked in a manual run” reasoning without named workloads, bounded envelope, and replayable evidence
|
||||
|
||||
Execution steps:
|
||||
|
||||
1. Step 1: workload envelope freeze
|
||||
- name one bounded validation matrix:
|
||||
- workload(s)
|
||||
- topology
|
||||
- transport/frontend
|
||||
- filesystem/application surface
|
||||
- disturbance shapes included and excluded
|
||||
- recommended first-cut surfaces:
|
||||
- real filesystem behavior such as `ext4`
|
||||
- one database/application surface such as `PostgreSQL`
|
||||
2. Step 2: harness and evidence hardening
|
||||
- wire the workload run through real block-device consumers on the accepted path
|
||||
- keep the environment reproducible and bounded enough that failures are attributable
|
||||
- collect evidence at the same semantic layer as accepted prior checkpoints
|
||||
3. Step 3: proof package
|
||||
- prove the named real workloads complete correctly on the accepted path
|
||||
- prove disturbance/failover behavior is bounded inside the named envelope if included
|
||||
- prove no-overclaim around `CP13-9+`
|
||||
|
||||
Required scope:
|
||||
|
||||
1. one bounded workload matrix on the accepted `RF=2 sync_all` path
|
||||
2. real block-device consumer validation (not only protocol/unit tests)
|
||||
3. bounded disturbance cases only if explicitly named in the envelope
|
||||
4. explicit separation between real-workload proof and later mode normalization / rollout claims
|
||||
|
||||
Must prove:
|
||||
|
||||
1. the accepted replication contract survives contact with named real workloads
|
||||
2. evidence is tied to a bounded environment and workload envelope, not generic “production ready” rhetoric
|
||||
3. failures, if any, are attributable to explicit workload-envelope gaps rather than ambiguous harness drift
|
||||
4. acceptance wording stays bounded to real-workload validation rather than `CP13-9` policy/mode closure
|
||||
|
||||
Reuse discipline:
|
||||
|
||||
1. prefer existing `testrunner`, bounded component scenarios, and real-device harnesses where possible
|
||||
2. update `weed/storage/blockvol/*` only when the real workload exposes a concrete bug in accepted semantics
|
||||
3. `weed/server/*` should remain reference only unless workload validation exposes a surfaced control/runtime issue
|
||||
4. no checkpoint work may silently broaden into generic benchmark marketing, launch approval, or mode policy redesign
|
||||
|
||||
Verification mechanism:
|
||||
|
||||
1. one named workload matrix with explicit environment description
|
||||
2. replayable runs or artifacts for the chosen workload package
|
||||
3. explicit pass/fail conditions tied back to accepted `CP13-1..7` semantics
|
||||
4. no-overclaim review so `CP13-8` does not absorb `CP13-9+`
|
||||
|
||||
Hard indicators:
|
||||
|
||||
1. one accepted filesystem proof:
|
||||
- a named real filesystem workload completes correctly on the accepted path
|
||||
2. one accepted application proof:
|
||||
- a named real application/database workload completes correctly on the accepted path
|
||||
3. one accepted envelope proof:
|
||||
- the validation matrix is explicit about topology, frontend, workload, and exclusions
|
||||
4. one accepted boundedness proof:
|
||||
- `CP13-8` claims real-workload validation only
|
||||
|
||||
Reject if:
|
||||
|
||||
1. the checkpoint relies on ad hoc manual runs with no bounded envelope
|
||||
2. a claimed real-workload proof is actually only a synthetic benchmark or unit test
|
||||
3. delivery wording quietly broadens into mode normalization, launch approval, or general production-readiness claims
|
||||
|
||||
Status:
|
||||
|
||||
- active
|
||||
|
||||
### `CP13-8A`: Assignment-to-Publication Closure
|
||||
|
||||
Goal:
|
||||
|
||||
- close the control/runtime/publication contradiction exposed by `CP13-8` so the system no longer treats allocation or assignment presence as equivalent to replica publication readiness
|
||||
|
||||
Acceptance object:
|
||||
|
||||
1. `CP13-8A` accepts one bounded closure slice for assignment-to-publication truth on the accepted `RF=2 sync_all` path
|
||||
2. it does not accept broad mode normalization, launch approval, or backend replacement by implication
|
||||
3. it does not accept sleep-based or timing-based fixes that leave readiness semantics implicit
|
||||
|
||||
Execution steps:
|
||||
|
||||
1. Step 1: unify assignment lifecycle
|
||||
- ensure assignment delivery flows through one authoritative path from role apply to receiver/shipper wiring to readiness bookkeeping
|
||||
- remove semantic split between store-only role application and service-level replication/publication setup
|
||||
2. Step 2: name readiness and publication truth
|
||||
- define explicit readiness states for the chosen path
|
||||
- ensure heartbeat / lookup / tester surfaces distinguish:
|
||||
- allocated
|
||||
- role applied
|
||||
- receiver ready
|
||||
- publish healthy
|
||||
3. Step 3: bounded rerun
|
||||
- rerun the bounded `CP13-8` workload package after closure lands
|
||||
- determine whether the remaining contradiction is backend data visibility, adapter timing/publication, or a true core-rule gap
|
||||
|
||||
Required scope:
|
||||
|
||||
1. assignment-to-publication closure only
|
||||
2. chosen path only: `RF=2 sync_all`
|
||||
3. existing master / volume-server heartbeat path only
|
||||
4. `blockvol` remains the execution backend
|
||||
|
||||
Must prove:
|
||||
|
||||
1. assignment delivered does not by itself imply receiver ready or publish healthy
|
||||
2. replica publication requires explicit readiness closure rather than allocation completion or precomputed port presence
|
||||
3. master lookup / REST / tester health checks consume the same bounded readiness truth
|
||||
4. `CP13-8A` remains about closure, not mode normalization or backend redesign
|
||||
|
||||
Reuse discipline:
|
||||
|
||||
1. prefer `weed/server/*` and bridge-layer updates first because this is a surfaced control/runtime issue
|
||||
2. update `weed/storage/blockvol/*` only if closure work exposes a concrete backend bug rather than a publication-path contradiction
|
||||
3. keep `CP13-1..7` semantics fixed unless the closure work exposes a live contradiction
|
||||
4. no checkpoint work may silently broaden into `CP13-9` mode policy or broad rollout claims
|
||||
|
||||
Verification mechanism:
|
||||
|
||||
1. one focused proof set around assignment lifecycle closure and readiness/publication gating
|
||||
2. explicit tests that heartbeat / lookup / tester surfaces do not publish a replica before readiness closes
|
||||
3. bounded `CP13-8` rerun or equivalent evidence showing the contradiction moves from mixed-state ambiguity to an attributable remaining cause
|
||||
4. no-overclaim review so `CP13-8A` does not absorb `CP13-9`
|
||||
|
||||
Hard indicators:
|
||||
|
||||
1. one accepted lifecycle proof:
|
||||
- assignment processing uses one authoritative path from role apply through runtime wiring
|
||||
2. one accepted readiness proof:
|
||||
- replica-ready is explicit and not inferred from mere existence/allocation
|
||||
3. one accepted publication proof:
|
||||
- lookup / heartbeat / tester gates do not publish a replica before readiness closure
|
||||
4. one accepted boundedness proof:
|
||||
- `CP13-8A` claims closure only and leaves broader mode policy untouched
|
||||
|
||||
Reject if:
|
||||
|
||||
1. assignment still reaches different semantic outcomes depending on whether it flows through heartbeat/store-only or service-level processing
|
||||
2. a replica can still be surfaced as healthy/ready before receiver/session readiness closes
|
||||
3. the slice relies on delays or ad hoc retries rather than explicit readiness semantics
|
||||
4. delivery wording broadens into `CP13-9` mode normalization, launch approval, or generic backend replacement
|
||||
|
||||
Status:
|
||||
|
||||
- active
|
||||
|
||||
### Later checkpoints inside `Phase 13`
|
||||
|
||||
1. `CP13-8`: real-workload validation
|
||||
2. `CP13-9`: mode normalization
|
||||
1. `CP13-9`: mode normalization (only after `CP13-8A` closes the assignment/publication contradiction)
|
||||
|
||||
## Reuse Discipline
|
||||
|
||||
|
||||
@@ -7,6 +7,7 @@ Historical planning/review documents were moved to `../docs/archive/design/` to
|
||||
## Read First
|
||||
|
||||
- `v2-protocol-truths.md`
|
||||
- `v2-protocol-claim-and-evidence.md`
|
||||
- `v2-product-completion-overview.md`
|
||||
- `v2-phase-development-plan.md`
|
||||
- `v2-semantic-methodology.zh.md`
|
||||
@@ -27,6 +28,7 @@ Historical planning/review documents were moved to `../docs/archive/design/` to
|
||||
- `v2_scenarios.md`
|
||||
- `v2-scenario-sources-from-v1.md`
|
||||
- `v1-v15-v2-comparison.md`
|
||||
- `v2-reuse-replacement-boundary.md`
|
||||
- `wal-replication-v2.md`
|
||||
- `wal-replication-v2-state-machine.md`
|
||||
- `wal-replication-v2-orchestrator.md`
|
||||
|
||||
145
sw-block/design/v2-protocol-claim-and-evidence.md
Normal file
145
sw-block/design/v2-protocol-claim-and-evidence.md
Normal file
@@ -0,0 +1,145 @@
|
||||
# V2 Protocol Claim And Evidence
|
||||
|
||||
Date: 2026-04-03
|
||||
Status: active
|
||||
Purpose: keep one centralized ledger for the current chosen envelope, accepted claims, supporting evidence, invalidated evidence, and rerun obligations
|
||||
|
||||
## Why This Document Exists
|
||||
|
||||
`v2-protocol-truths.md` records stable protocol truths.
|
||||
`v2-protocol-closure-map.zh.md` records the structural closure model.
|
||||
|
||||
What they do not track in one place is the current operational contract:
|
||||
|
||||
1. which claims are allowed right now
|
||||
2. which baselines are accepted right now
|
||||
3. which evidence supports each claim
|
||||
4. which evidence has been narrowed or invalidated
|
||||
5. which reruns are required before a claim can be restored
|
||||
|
||||
This document is that ledger.
|
||||
|
||||
## How To Use It
|
||||
|
||||
When reviewing any new slice, bug fix, workload run, or delivery note, ask:
|
||||
|
||||
1. which current claim does this change strengthen, narrow, or invalidate?
|
||||
2. which evidence row should be updated?
|
||||
3. does the change alter the current chosen envelope?
|
||||
4. does any old claim now require rerun or reclassification?
|
||||
|
||||
If the answer changes the current state of the product, update this ledger in the same change.
|
||||
|
||||
## Current Chosen Envelope
|
||||
|
||||
This is the bounded envelope currently allowed for active V2 claims:
|
||||
|
||||
| Item | Current value | Source |
|
||||
|------|---------------|--------|
|
||||
| Replication factor | `RF=2` | `v2-protocol-closure-map.zh.md` |
|
||||
| Durability mode | `sync_all` | `v2-protocol-closure-map.zh.md`, `Phase 13` |
|
||||
| Control path | current master / volume-server heartbeat path | `v2-protocol-closure-map.zh.md` |
|
||||
| Execution backend | `blockvol` | `v2-protocol-closure-map.zh.md`, `v2-reuse-replacement-boundary.md` |
|
||||
| Frontend in active validation | iSCSI | `Phase 11`, `CP13-8` |
|
||||
| Real-workload checkpoint | `CP13-8` | `Phase 13` |
|
||||
|
||||
Current explicit exclusions:
|
||||
|
||||
1. `RF>2` as a general accepted product claim
|
||||
2. broad mode normalization before `CP13-9`
|
||||
3. broad rollout / launch approval
|
||||
4. broad transport matrix claims outside explicitly named evidence
|
||||
5. treating synthetic benchmarks as substitutes for real workload validation
|
||||
|
||||
## Active Protocol Constraints
|
||||
|
||||
These are the currently binding constraints that later work must preserve.
|
||||
|
||||
| ID | Constraint | Source | Current status |
|
||||
|----|------------|--------|----------------|
|
||||
| `T1` | `CommittedLSN` is the external truth boundary | `v2-protocol-truths.md` | active |
|
||||
| `T9` | truncation is a protocol boundary, not cleanup | `v2-protocol-truths.md` | active |
|
||||
| `T14` | engine remains recovery authority; storage remains truth source | `v2-protocol-truths.md` | active |
|
||||
| `T15` | reuse reality, not inherited semantics | `v2-protocol-truths.md` | active |
|
||||
| `CP13-2` | stable identity must not be inferred from transport address shape | `Phase 13` | active |
|
||||
| `CP13-3` | durable authority is `replicaFlushedLSN`, not legacy success inference | `Phase 13` | active |
|
||||
| `CP13-4` | only eligible replica state may satisfy sync durability | `Phase 13` | active |
|
||||
| `CP13-5` | reconnect must use explicit handshake / catch-up semantics | `Phase 13` | active |
|
||||
| `CP13-6` | retention must fail closed for lagging replicas | `Phase 13` | active |
|
||||
| `CP13-7` | unrecoverable gap must escalate to `NeedsRebuild` and block normal paths | `Phase 13` | active |
|
||||
| `CP13-8A` | assignment delivered != receiver ready != publish healthy | `Phase 13` | active |
|
||||
|
||||
## Accepted Baselines
|
||||
|
||||
| Baseline | What it is allowed to say | Evidence location | Current validity |
|
||||
|----------|---------------------------|-------------------|------------------|
|
||||
| `CP13-1` replication baseline inventory | which tests originally passed/failed/`PASS*` before `CP13-2..7` closure | `sw-block/.private/phase/phase-13-cp1-baseline.md` | valid as baseline inventory, not as final product claim |
|
||||
| `Phase 12 P4` bounded floor | one bounded performance floor and rollout-gate package on the accepted chosen path | `sw-block/.private/phase/phase-12-p4-floor.md`, `phase-12-p4-rollout-gates.md` | valid inside its named envelope |
|
||||
| real-workload envelope draft | one bounded `ext4 + pgbench` package for `CP13-8` | `sw-block/.private/phase/phase-13-cp8-workload-validation.md` | active draft; full claim pending rerun after blockers close |
|
||||
|
||||
## Allowed Claims
|
||||
|
||||
These are the claims that may currently be made without overreach.
|
||||
|
||||
| Claim ID | Allowed claim | Scope boundary | Evidence anchor | Status |
|
||||
|----------|---------------|----------------|-----------------|--------|
|
||||
| `C-RF2-SYNCALL-CONTRACT` | the accepted `RF=2 sync_all` replication contract is closed at protocol/unit/adversarial level through `CP13-1..7` | protocol/unit/adversarial evidence only | `Phase 13` docs and tests | allowed |
|
||||
| `C-WORKLOAD-DRAFT` | one bounded real-workload validation package is defined for `CP13-8` | package definition only, not final pass claim | `phase-13-cp8-workload-validation.md`, YAML scenario | allowed |
|
||||
| `C-WORKLOAD-PASS` | the bounded real-workload package passes on the chosen path | only after rerun succeeds on corrected path | `CP13-8` rerun artifact | not yet allowed |
|
||||
| `C-ADAPTER-CLOSURE` | assignment / readiness / publication closure is explicit on the chosen path | only after `CP13-8A` acceptance | `CP13-8A` proof package | in progress |
|
||||
| `C-MODE-NORMALIZATION` | mode policy / normalization is closed | only in `CP13-9` or later | future | not allowed |
|
||||
| `C-LAUNCH-APPROVAL` | broad product launch readiness | outside current phase | future | not allowed |
|
||||
|
||||
## Evidence Map
|
||||
|
||||
| Evidence area | What it proves | Primary evidence | Support evidence |
|
||||
|---------------|----------------|------------------|------------------|
|
||||
| Identity / addressing | stable identity and routable publication | `CP13-2` tests and docs | `qa_block_soak_test.go`, `sync_all_bug_test.go` |
|
||||
| Durable progress | barrier durability truth and non-legacy authority | `CP13-3` tests and docs | protocol tests around barrier handling |
|
||||
| State eligibility | only eligible replica state may satisfy sync durability | `CP13-4` tests and docs | adversarial state tests |
|
||||
| Reconnect / catch-up | reconnect uses handshake/catch-up rather than bootstrap | `CP13-5` tests and docs | adversarial reconnect tests |
|
||||
| Retention | lagging replica retains WAL or escalates fail closed | `CP13-6` tests and docs | retention protocol tests |
|
||||
| Rebuild fallback | unrecoverable gap escalates to `NeedsRebuild` and blocks normal paths | `CP13-7` tests and docs | rebuild tests |
|
||||
| Performance floor | one bounded measured floor and rollout-gate package | `Phase 12 P4` docs/tests | cited baseline artifact |
|
||||
| Real-workload package | one bounded workload matrix exists | `CP13-8` scenario/doc | tester validation reports |
|
||||
| Assignment/publication closure | assignment does not imply readiness/publication | `CP13-8A` code/tests/debug evidence | tester investigation, bug docs |
|
||||
|
||||
## Invalidated Or Narrowed Evidence
|
||||
|
||||
This section records evidence that cannot currently be used at full strength.
|
||||
|
||||
| ID | Affected claim/evidence | Narrowing reason | Scope | Action required |
|
||||
|----|-------------------------|------------------|-------|-----------------|
|
||||
| `INV-CP13-8A-01` | any weed-VS scenario claim that `block_promote` preserved replication automatically | promote path could leave new primary without replica shipper wiring; barrier then became vacuous with `0` shippers | recent weed-VS testrunner scenarios using `block_promote` | rerun after fix |
|
||||
| `INV-CP13-8A-02` | bounded real-workload `CP13-8` pass claim | blocked by assignment/publication contradiction and then by promote/shipper closure issue | `CP13-8` only | rerun after `CP13-8A` blocker fixes |
|
||||
| `INV-CLAIM-SPREAD-01` | claims embedded only in phase delivery notes | phase docs are not a reliable centralized current-state ledger | all scattered phase notes | migrate ongoing claim state here |
|
||||
|
||||
Unaffected evidence currently believed to remain valid:
|
||||
|
||||
1. standalone `iscsi-target` scenarios that used direct `assign + set_replica` wiring rather than weed-VS `block_promote`
|
||||
2. protocol/unit/adversarial evidence from accepted `CP13-1..7`
|
||||
3. performance-only scenarios that did not claim active cross-node replication through the broken promote path
|
||||
|
||||
## Open Contradictions And Blockers
|
||||
|
||||
| ID | Blocker | Current classification | Impact |
|
||||
|----|---------|------------------------|--------|
|
||||
| `BUG-CP13-8A-ADDR` | malformed/mock replica addresses in some QA allocators | test/adapter bug | narrows affected QA evidence; does not by itself close real workload |
|
||||
| `BUG-CP13-8A-RECV-IDEMP` | repeated assignment delivery restarted replica receiver and hit bind conflict | adapter/runtime bug | blocks weed-VS replica from leaving degraded state until fixed |
|
||||
| `BUG-CP13-8A-PROMOTE-SHIPPER` | post-promote assignment could leave new primary with no replica shipper configured | master/adapter bug | invalidates weed-VS `block_promote` replication claims until rerun |
|
||||
| `CP13-8` | real bounded workload package still needs corrected rerun | blocked by `CP13-8A` issues | blocks real-workload pass claim |
|
||||
|
||||
## Rerun Queue
|
||||
|
||||
| Priority | Item | Why rerun is needed | Exit condition |
|
||||
|----------|------|---------------------|----------------|
|
||||
| `P0` | `CP13-8` bounded real-workload scenario | current pass claim is not yet allowed after `CP13-8A` blockers | bounded rerun passes or fails with attributable remaining cause |
|
||||
| `P0` | weed-VS scenarios using `block_promote` from the recent testrunner enhancement work | prior replication interpretation may have been vacuous (`0` shippers) | affected scenarios are reclassified or rerun |
|
||||
| `P1` | any recent degraded/perf interpretation derived from broken weed-VS promote path | performance interpretation may be based on RF=1 semantics | audit updated and affected numbers rerun or narrowed |
|
||||
|
||||
## Maintenance Rules
|
||||
|
||||
1. do not add a new claim anywhere else without adding or updating the corresponding row here
|
||||
2. when a bug narrows evidence, record the invalidation here in the same change
|
||||
3. when a rerun restores a claim, move the row from `Invalidated Or Narrowed Evidence` to `Allowed Claims` or update its status
|
||||
4. keep this document bounded to the active chosen path; do not turn it into a future roadmap
|
||||
@@ -21,6 +21,10 @@
|
||||
- `V2` 不是一组散乱 patch
|
||||
- 而是在明确边界内逐步建立的协议闭环
|
||||
|
||||
当前 chosen path 下哪些 claim 可以成立、对应 evidence 在哪里、哪些 evidence
|
||||
被收紧或失效,不在本文维护;统一见
|
||||
`v2-protocol-claim-and-evidence.md`。
|
||||
|
||||
这里的“闭环”是有范围的。
|
||||
|
||||
当前默认边界仍然是:
|
||||
|
||||
@@ -18,6 +18,10 @@ So the most important output to carry forward is not only code, but:
|
||||
|
||||
This document is the compact truth table for the V2 line.
|
||||
|
||||
Current chosen-envelope claims, accepted baselines, evidence mappings, and
|
||||
evidence invalidations are tracked separately in
|
||||
`v2-protocol-claim-and-evidence.md`.
|
||||
|
||||
## How To Use It
|
||||
|
||||
For each later phase or slice, ask:
|
||||
|
||||
178
sw-block/design/v2-reuse-replacement-boundary.md
Normal file
178
sw-block/design/v2-reuse-replacement-boundary.md
Normal file
@@ -0,0 +1,178 @@
|
||||
# V2 Reuse vs Replacement Boundary
|
||||
|
||||
Date: 2026-04-03
|
||||
Status: active
|
||||
|
||||
## Purpose
|
||||
|
||||
This note makes one architectural split explicit for the current chosen path:
|
||||
|
||||
1. what we reuse from the existing `blockvol`/`weed` stack as mechanics
|
||||
2. what must be owned by `V2` as semantic authority
|
||||
3. what sits in the adapter boundary between them
|
||||
|
||||
The goal is to stop `V1` mixed control/data state from silently redefining `V2`
|
||||
behavior through convenience wiring.
|
||||
|
||||
Scope is still bounded to:
|
||||
|
||||
1. `RF=2`
|
||||
2. `sync_all`
|
||||
3. current master / volume-server heartbeat path
|
||||
4. `blockvol` as the execution backend
|
||||
|
||||
## Boundary Rule
|
||||
|
||||
`V1` reuse is allowed for execution mechanics.
|
||||
|
||||
`V2` replacement is required for semantic authority.
|
||||
|
||||
If a change decides protocol meaning, failover meaning, durability meaning, or
|
||||
external publication meaning, it belongs to a `V2`-owned layer even if the
|
||||
underlying I/O still runs through reused `blockvol` code.
|
||||
|
||||
This is the practical interpretation of:
|
||||
|
||||
- `v2-protocol-truths.md` `T14`: engine remains recovery authority
|
||||
- `v2-protocol-truths.md` `T15`: reuse reality, not inherited semantics
|
||||
|
||||
## Three Buckets
|
||||
|
||||
### 1. Reusable V1 Core
|
||||
|
||||
These components remain useful as mechanics:
|
||||
|
||||
| Area | Files | What stays reusable |
|
||||
|------|-------|---------------------|
|
||||
| Local storage truth | `weed/storage/blockvol/blockvol.go`, `flusher.go`, `rebuild.go`, WAL/extent helpers | WAL append, flush, checkpoint, dirty-map, extent install |
|
||||
| Replica transport | `weed/storage/blockvol/replica_apply.go`, `wal_shipper.go`, `shipper_group.go`, `dist_group_commit.go`, `repl_proto.go` | TCP receiver/shipper mechanics, barrier transport, replay/apply |
|
||||
| Frontend serving | `weed/storage/blockvol/iscsi/`, `weed/storage/blockvol/nvme/` | block-device serving once a local volume is authoritative |
|
||||
| Local role guardrails | `weed/storage/blockvol/promotion.go`, `role.go` | drain, lease revoke, local role gate enforcement |
|
||||
|
||||
Rule:
|
||||
|
||||
- these layers execute I/O and transport
|
||||
- they do not decide whether a replica is eligible, authoritative, published, or healthy in the `V2` sense
|
||||
|
||||
### 2. Adapter Boundary
|
||||
|
||||
These components translate `V2` truth into concrete runtime wiring:
|
||||
|
||||
| Area | Files | Responsibility |
|
||||
|------|-------|----------------|
|
||||
| Assignment ingest | `weed/server/volume_server_block.go` | authoritative assignment lifecycle for role apply, receiver/shipper wiring, readiness closure |
|
||||
| Heartbeat/runtime loop | `weed/server/block_heartbeat_loop.go` | collect/report status and process assignments through the same lifecycle |
|
||||
| Local store helper | `weed/storage/store_blockvol.go` | local volume open/close/iteration; no longer the authoritative assignment lifecycle |
|
||||
| Bridge | `weed/storage/blockvol/v2bridge/control.go` | convert service/control truth into engine intents |
|
||||
|
||||
Rule:
|
||||
|
||||
- the adapter boundary may reuse `blockvol` primitives
|
||||
- it must name and own lifecycle closure states explicitly
|
||||
- it must not let store-only role application masquerade as ready publication
|
||||
|
||||
### 3. V2-Owned Replacement
|
||||
|
||||
These areas define truth and therefore must remain `V2`-owned:
|
||||
|
||||
| Area | Files | Responsibility |
|
||||
|------|-------|----------------|
|
||||
| Control and identity truth | `sw-block/engine/replication/`, `weed/storage/blockvol/v2bridge/control.go` | assignment truth, stable identity, session truth |
|
||||
| Recovery ownership | `weed/server/block_recovery.go` | live runtime owner for catch-up/rebuild tasks |
|
||||
| Publication and health closure | `weed/server/master_block_registry.go`, `weed/server/master_block_failover.go` | what the system reports as ready, degraded, publishable |
|
||||
| External product surfaces | `weed/server/master_grpc_server_block.go`, `weed/server/master_server_handlers_block.go`, debug/diagnostic surfaces | operator-visible truth, not convenience guesses |
|
||||
|
||||
Rule:
|
||||
|
||||
- if the system exposes a condition to master, tester, CSI, or operator tooling, that condition must come from `V2`-named state
|
||||
|
||||
## Assignment-To-Readiness Lifecycle
|
||||
|
||||
The authoritative lifecycle for the current chosen path is:
|
||||
|
||||
```text
|
||||
assignment delivered
|
||||
-> local role applied
|
||||
-> replica receiver or primary shipper configured
|
||||
-> readiness closed
|
||||
-> heartbeat publication
|
||||
-> master registry health/publication
|
||||
```
|
||||
|
||||
More concretely:
|
||||
|
||||
1. master intent is delivered
|
||||
2. `BlockService.ApplyAssignments()` applies local role truth
|
||||
3. the same path wires receiver/shipper runtime
|
||||
4. the same path records named readiness state
|
||||
5. heartbeat publishes only what is actually publish-healthy
|
||||
6. master registry derives lookup/health from explicit readiness, not from allocation alone
|
||||
|
||||
## Named Readiness States
|
||||
|
||||
For the current implementation slice, the service boundary now names:
|
||||
|
||||
1. `roleApplied`
|
||||
2. `receiverReady`
|
||||
3. `shipperConfigured`
|
||||
4. `shipperConnected`
|
||||
5. `replicaEligible`
|
||||
6. `publishHealthy`
|
||||
|
||||
Ownership:
|
||||
|
||||
- owned by `BlockService` / adapter layer
|
||||
- observed by debug surfaces and heartbeat/publication logic
|
||||
- not delegated to `blockvol` as implicit mixed state
|
||||
|
||||
## Current File Map
|
||||
|
||||
### Reuse
|
||||
|
||||
- `weed/storage/blockvol/blockvol.go`
|
||||
- `weed/storage/blockvol/flusher.go`
|
||||
- `weed/storage/blockvol/replica_apply.go`
|
||||
- `weed/storage/blockvol/wal_shipper.go`
|
||||
- `weed/storage/blockvol/shipper_group.go`
|
||||
- `weed/storage/blockvol/dist_group_commit.go`
|
||||
- `weed/storage/blockvol/iscsi/`
|
||||
- `weed/storage/blockvol/nvme/`
|
||||
|
||||
### Adapter boundary
|
||||
|
||||
- `weed/server/volume_server_block.go`
|
||||
- `weed/server/block_heartbeat_loop.go`
|
||||
- `weed/storage/store_blockvol.go`
|
||||
- `weed/server/volume_server_block_debug.go`
|
||||
|
||||
### V2-owned replacement / truth
|
||||
|
||||
- `weed/storage/blockvol/v2bridge/control.go`
|
||||
- `sw-block/engine/replication/`
|
||||
- `weed/server/block_recovery.go`
|
||||
- `weed/server/master_block_registry.go`
|
||||
- `weed/server/master_block_failover.go`
|
||||
- `weed/server/master_grpc_server_block.go`
|
||||
- `weed/server/master_server_handlers_block.go`
|
||||
|
||||
## Immediate Engineering Rule
|
||||
|
||||
When a new bug appears, classify it first:
|
||||
|
||||
1. `v1 reusable core`: local storage or transport mechanics
|
||||
2. `adapter boundary`: assignment/readiness/publication closure bug
|
||||
3. `v2 replacement`: semantic authority, identity, ownership, eligibility, rebuild, or operator-visible truth
|
||||
|
||||
Do not patch semantic authority directly into `blockvol` unless the same change is
|
||||
also reflected as an explicit `V2` state/rule at the service or registry layer.
|
||||
|
||||
## Why This Matters For CP13-8
|
||||
|
||||
`CP13-8` found the exact class of bug this split is meant to expose:
|
||||
|
||||
- allocation/control truth said the replica existed
|
||||
- but runtime publication/read visibility was not yet closed
|
||||
|
||||
That is not a reason to throw away `blockvol`.
|
||||
It is a reason to stop treating mixed `V1` runtime state as if it were already
|
||||
closed `V2` publication truth.
|
||||
767
sw-block/design/v2_mini_core_design.md
Normal file
767
sw-block/design/v2_mini_core_design.md
Normal file
@@ -0,0 +1,767 @@
|
||||
|
||||
## 1. 设计目标
|
||||
|
||||
目标不是立即重写 `blockvol`,而是建立一个**纯 `V2` 语义核心**,使它成为系统唯一的语义 authority:
|
||||
|
||||
- `V2 core` 负责定义 truth、state、event、decision
|
||||
- `adapter` 负责把外部输入翻译成 `V2` 事件,并把 `V2` 决策翻译成 runtime/backend 调用
|
||||
- `V1 backend` 只保留执行能力,不再解释协议语义
|
||||
|
||||
当前 chosen path 不变:
|
||||
|
||||
- `RF=2`
|
||||
- `sync_all`
|
||||
- 当前 master / volume-server heartbeat path
|
||||
- `blockvol` 作为执行 backend
|
||||
|
||||
这和已有设计是一致的,不是新改向。它只是把 `Phase 1-13` 一直在做的事显式化。
|
||||
|
||||
---
|
||||
|
||||
## 2. 核心分层
|
||||
|
||||
### 2.1 `V2 Core`
|
||||
职责:
|
||||
|
||||
- 持有控制真相
|
||||
- 持有恢复真相
|
||||
- 持有数据边界真相
|
||||
- 持有对外发布真相
|
||||
- 根据事件做决策
|
||||
- 输出 commands / projections
|
||||
|
||||
### 2.2 Adapter Boundary
|
||||
职责:
|
||||
|
||||
- 把 master / heartbeat / runtime observation 翻译成 `V2 event`
|
||||
- 把 `V2 command` 翻译成对 `blockvol` / transport / frontend 的调用
|
||||
- 把 backend/runtime 事实翻译成 `projection`
|
||||
|
||||
### 2.3 `V1 Backend`
|
||||
职责:
|
||||
|
||||
- WAL / extent / dirty map / flusher
|
||||
- receiver / shipper transport
|
||||
- iSCSI / NVMe frontend
|
||||
- rebuild install primitive
|
||||
|
||||
规则:
|
||||
|
||||
- backend 报告事实
|
||||
- core 解释语义
|
||||
- adapter 做翻译
|
||||
- 不允许 `blockvol` 的混合状态直接越权成为系统 truth
|
||||
|
||||
---
|
||||
|
||||
## 3. `V2 Core` 最小对象清单
|
||||
|
||||
下面是“最小可行”的 `struct / event / command / projection` 集合。
|
||||
|
||||
### 3.1 Struct 清单
|
||||
|
||||
#### A. Control structs
|
||||
```go
|
||||
type VolumeIntent struct {
|
||||
VolumeID string
|
||||
Epoch uint64
|
||||
PrimaryID string
|
||||
ReplicaIDs []string
|
||||
DurabilityMode string
|
||||
}
|
||||
|
||||
type ReplicaIdentity struct {
|
||||
ReplicaID string
|
||||
ServerID string
|
||||
}
|
||||
|
||||
type AssignmentView struct {
|
||||
VolumeID string
|
||||
Epoch uint64
|
||||
Role RoleIntent
|
||||
ReplicaEndpoints map[string]Endpoint
|
||||
}
|
||||
```
|
||||
|
||||
职责:
|
||||
|
||||
- 谁是 primary
|
||||
- 谁是 replica
|
||||
- 当前 epoch
|
||||
- stable identity 是谁
|
||||
- assignment intent 到底是什么
|
||||
|
||||
#### B. Recovery structs
|
||||
```go
|
||||
type ReplicaState string
|
||||
const (
|
||||
StateDisconnected ReplicaState = "disconnected"
|
||||
StateConnecting ReplicaState = "connecting"
|
||||
StateCatchingUp ReplicaState = "catching_up"
|
||||
StateInSync ReplicaState = "in_sync"
|
||||
StateDegraded ReplicaState = "degraded"
|
||||
StateNeedsRebuild ReplicaState = "needs_rebuild"
|
||||
)
|
||||
|
||||
type ReplicaSession struct {
|
||||
SessionID string
|
||||
ReplicaID string
|
||||
Epoch uint64
|
||||
Kind SessionKind
|
||||
Active bool
|
||||
Superseded bool
|
||||
}
|
||||
|
||||
type RecoveryOwner struct {
|
||||
ReplicaID string
|
||||
SessionID string
|
||||
Running bool
|
||||
}
|
||||
```
|
||||
|
||||
职责:
|
||||
|
||||
- 当前 replica 的恢复状态是什么
|
||||
- 当前 session 是谁
|
||||
- 谁拥有 recovery authority
|
||||
- 旧 session 是否已失效
|
||||
|
||||
#### C. Data-boundary structs
|
||||
```go
|
||||
type BoundaryView struct {
|
||||
CommittedLSN uint64
|
||||
CheckpointLSN uint64
|
||||
WALHeadLSN uint64
|
||||
ReceivedLSN uint64
|
||||
TargetLSN uint64
|
||||
AchievedLSN uint64
|
||||
SnapshotBaseLSN uint64
|
||||
}
|
||||
```
|
||||
|
||||
职责:
|
||||
|
||||
- durability boundary
|
||||
- catch-up target
|
||||
- rebuild target
|
||||
- stable base image
|
||||
- 实际已达到边界
|
||||
|
||||
#### D. Publication structs
|
||||
```go
|
||||
type ReadinessView struct {
|
||||
RoleApplied bool
|
||||
ReceiverReady bool
|
||||
ShipperConfigured bool
|
||||
ShipperConnected bool
|
||||
ReplicaEligible bool
|
||||
PublishHealthy bool
|
||||
}
|
||||
|
||||
type LookupProjection struct {
|
||||
VolumeID string
|
||||
PrimaryServer string
|
||||
ISCSIAddr string
|
||||
ReplicaReady bool
|
||||
ReplicaDegraded bool
|
||||
}
|
||||
```
|
||||
|
||||
职责:
|
||||
|
||||
- 什么时候可以 publish
|
||||
- lookup/heartbeat 应该看到什么
|
||||
- “存在”与“ready”分离
|
||||
|
||||
---
|
||||
|
||||
## 4. `V2 Core` 最小事件清单
|
||||
|
||||
### 4.1 Control events
|
||||
```go
|
||||
AssignmentDelivered
|
||||
EpochBumped
|
||||
RepeatedAssignmentDelivered
|
||||
IdentityResolved
|
||||
```
|
||||
|
||||
### 4.2 Recovery events
|
||||
```go
|
||||
SessionCreated
|
||||
SessionSuperseded
|
||||
SessionRemoved
|
||||
CatchUpPlanned
|
||||
CatchUpCompleted
|
||||
RebuildStarted
|
||||
RebuildCommitted
|
||||
```
|
||||
|
||||
### 4.3 Runtime observation events
|
||||
```go
|
||||
RoleApplied
|
||||
ReceiverStarted
|
||||
ReceiverReadyObserved
|
||||
ShipperConfiguredObserved
|
||||
ShipperConnectedObserved
|
||||
BarrierAccepted
|
||||
BarrierRejected
|
||||
```
|
||||
|
||||
### 4.4 Data-boundary events
|
||||
```go
|
||||
CommittedLSNAdvanced
|
||||
CheckpointLSNAdvanced
|
||||
ReceivedLSNAdvanced
|
||||
AchievedLSNAdvanced
|
||||
RetentionEscalated
|
||||
```
|
||||
|
||||
### 4.5 Publication events
|
||||
```go
|
||||
HeartbeatCollected
|
||||
PublicationProjected
|
||||
ReplicaMarkedReady
|
||||
ReplicaMarkedDegraded
|
||||
```
|
||||
|
||||
原则:
|
||||
|
||||
- event 是 observation 或 intent
|
||||
- event 不是直接结果承诺
|
||||
- `V2 core` 必须决定 event 的语义意义
|
||||
|
||||
---
|
||||
|
||||
## 5. `V2 Core` 最小 command 清单
|
||||
|
||||
这些 command 是 `V2 core` 输出给 adapter 的,不直接碰 backend。
|
||||
|
||||
```go
|
||||
type Command interface{}
|
||||
|
||||
type ApplyRoleCommand struct {
|
||||
VolumeID string
|
||||
Epoch uint64
|
||||
Role RoleIntent
|
||||
}
|
||||
|
||||
type StartReceiverCommand struct {
|
||||
VolumeID string
|
||||
DataAddr string
|
||||
CtrlAddr string
|
||||
}
|
||||
|
||||
type ConfigureShipperCommand struct {
|
||||
VolumeID string
|
||||
Replicas []ReplicaEndpoint
|
||||
}
|
||||
|
||||
type StartCatchUpCommand struct {
|
||||
ReplicaID string
|
||||
TargetLSN uint64
|
||||
}
|
||||
|
||||
type StartRebuildCommand struct {
|
||||
ReplicaID string
|
||||
RebuildAddr string
|
||||
SnapshotBaseLSN uint64
|
||||
}
|
||||
|
||||
type PublishProjectionCommand struct {
|
||||
VolumeID string
|
||||
Readiness ReadinessView
|
||||
}
|
||||
|
||||
type InvalidateSessionCommand struct {
|
||||
ReplicaID string
|
||||
Reason string
|
||||
}
|
||||
```
|
||||
|
||||
原则:
|
||||
|
||||
- command 只表达“该做什么”
|
||||
- backend 如何做,由 adapter 决定
|
||||
- command 不依赖 `blockvol` 内部字段
|
||||
|
||||
---
|
||||
|
||||
## 6. `V2 Core` 最小 projection 清单
|
||||
|
||||
projection 是给外部世界看的,不是内部原始状态 dump。
|
||||
|
||||
### 6.1 Master / Lookup projection
|
||||
- `PrimaryServer`
|
||||
- `ISCSIAddr`
|
||||
- `ReplicaReady`
|
||||
- `ReplicaDegraded`
|
||||
- `DurabilityMode`
|
||||
|
||||
### 6.2 Heartbeat projection
|
||||
- 当前 role / epoch
|
||||
- boundary fields
|
||||
- readiness fields
|
||||
- transport degraded
|
||||
- receiver published addr
|
||||
|
||||
### 6.3 Diagnostic projection
|
||||
- active recovery tasks
|
||||
- session ownership
|
||||
- publish gating reason
|
||||
- pending rebuild / deferred promotion reason
|
||||
|
||||
### 6.4 Tester projection
|
||||
- `wait_volume_healthy` 不再只看 “replica exists”
|
||||
- 必须看 `replica_ready`
|
||||
- 必须区分:
|
||||
- allocated
|
||||
- role applied
|
||||
- receiver ready
|
||||
- publish healthy
|
||||
|
||||
---
|
||||
|
||||
## 7. 直接映射到仓库路径
|
||||
|
||||
下面按“核心 / adapter / backend”映射。
|
||||
|
||||
### 7.1 `V2 Core` 现有基础
|
||||
这些文件已经在扮演 core 的雏形。
|
||||
|
||||
- `sw-block/engine/replication/registry.go`
|
||||
- `AssignmentIntent`
|
||||
- `AssignmentResult`
|
||||
- `ReplicaAssignment`
|
||||
- `sw-block/engine/replication/session.go`
|
||||
- session ownership / lifecycle
|
||||
- `sw-block/engine/replication/sender.go`
|
||||
- sender state abstraction
|
||||
- `sw-block/engine/replication/orchestrator.go`
|
||||
- assignment -> session/recovery orchestration
|
||||
- `sw-block/engine/replication/types.go`
|
||||
- `sw-block/engine/replication/outcome.go`
|
||||
- `sw-block/engine/replication/observe.go`
|
||||
- `weed/server/block_recovery.go`
|
||||
- runtime owner
|
||||
- `weed/server/master_block_registry.go`
|
||||
- publication truth / cluster registry truth
|
||||
- `weed/server/master_block_failover.go`
|
||||
- failover/rebuild truth
|
||||
|
||||
### 7.2 Adapter Boundary 现有基础
|
||||
- `weed/storage/blockvol/v2bridge/control.go`
|
||||
- assignment -> engine intent
|
||||
- `weed/server/volume_server_block.go`
|
||||
- assignment lifecycle / readiness closure / wiring
|
||||
- `weed/server/block_heartbeat_loop.go`
|
||||
- heartbeat + assignment loop
|
||||
- `weed/server/volume_server_block_debug.go`
|
||||
- readiness/debug projection
|
||||
- `weed/server/master_grpc_server_block.go`
|
||||
- lookup projection
|
||||
- `weed/server/master_server_handlers_block.go`
|
||||
- REST projection
|
||||
|
||||
### 7.3 `V1 Backend` 现有基础
|
||||
- `weed/storage/blockvol/blockvol.go`
|
||||
- `weed/storage/blockvol/flusher.go`
|
||||
- `weed/storage/blockvol/replica_apply.go`
|
||||
- `weed/storage/blockvol/wal_shipper.go`
|
||||
- `weed/storage/blockvol/shipper_group.go`
|
||||
- `weed/storage/blockvol/dist_group_commit.go`
|
||||
- `weed/storage/blockvol/rebuild.go`
|
||||
- `weed/storage/blockvol/iscsi/`
|
||||
- `weed/storage/blockvol/nvme/`
|
||||
|
||||
---
|
||||
|
||||
## 8. 近期可做
|
||||
|
||||
这里说的是“现在应该做”的,不是中长期重构幻想。
|
||||
|
||||
### 8.1 让 `V2 core` 成为唯一 assignment 语义入口
|
||||
目标:
|
||||
|
||||
- 所有 assignment 先进入 `V2 core`
|
||||
- `BlockService` 不再自己解释太多语义
|
||||
- `BlockService` 更像 command executor
|
||||
|
||||
直接涉及文件:
|
||||
|
||||
- `sw-block/engine/replication/orchestrator.go`
|
||||
- `weed/storage/blockvol/v2bridge/control.go`
|
||||
- `weed/server/volume_server_block.go`
|
||||
|
||||
### 8.2 把 readiness 做成正式 projection,不只是 service 内部状态
|
||||
目标:
|
||||
|
||||
- `roleApplied`
|
||||
- `receiverReady`
|
||||
- `shipperConfigured`
|
||||
- `shipperConnected`
|
||||
- `replicaEligible`
|
||||
- `publishHealthy`
|
||||
|
||||
这些状态进入稳定 projection,而不是只在 debug 内可见。
|
||||
|
||||
直接涉及文件:
|
||||
|
||||
- `weed/server/volume_server_block.go`
|
||||
- `weed/server/master_block_registry.go`
|
||||
- `weed/server/master_grpc_server_block.go`
|
||||
- `weed/server/master_server_handlers_block.go`
|
||||
|
||||
### 8.3 把 `wait_volume_healthy`、lookup、heartbeat 都统一到同一个 readiness 定义
|
||||
目标:
|
||||
|
||||
- 不再出现:
|
||||
- registry says healthy
|
||||
- VS side not ready
|
||||
- tester 误判 ready
|
||||
|
||||
直接涉及文件:
|
||||
|
||||
- `weed/storage/blockvol/testrunner/actions/devops.go`
|
||||
- `weed/server/master_block_registry.go`
|
||||
- `weed/server/volume_grpc_client_to_master.go`
|
||||
|
||||
### 8.4 把 `CP13-8` 用作 adapter/publication closure 的真实验证
|
||||
目标:
|
||||
|
||||
- 确认当前 failure 到底是:
|
||||
- backend data bug
|
||||
- adapter timing/publication bug
|
||||
- core rule gap
|
||||
|
||||
这一步是近期必须做的,因为它是 live contradiction。
|
||||
|
||||
---
|
||||
|
||||
## 9. 中期演进
|
||||
|
||||
### 9.1 把 `V2 core` 真正做成 command/event 模式
|
||||
目标:
|
||||
|
||||
- engine 输出 command
|
||||
- adapter 执行 command
|
||||
- runtime 返回 event
|
||||
- core 更新 state
|
||||
|
||||
这会比现在 “ProcessAssignments 里又做判断又做执行” 更干净。
|
||||
|
||||
### 9.2 把 `master_block_registry` 从“半业务逻辑半存储”收敛成 projection store
|
||||
目标:
|
||||
|
||||
- registry 不负责猜测 semantics
|
||||
- registry 存放 `V2 projection`
|
||||
- 真正的语义判断放在 core
|
||||
|
||||
### 9.3 把 backend 接口化
|
||||
候选接口:
|
||||
|
||||
```go
|
||||
type StorageBackend interface {
|
||||
StatusSnapshot() BoundaryView
|
||||
SetRetentionFloor(...)
|
||||
}
|
||||
|
||||
type TransportBackend interface {
|
||||
StartReceiver(...)
|
||||
ConfigureReplicas(...)
|
||||
ShipperStates() ...
|
||||
}
|
||||
|
||||
type RebuildBackend interface {
|
||||
StartRebuild(...)
|
||||
InstallSnapshot(...)
|
||||
}
|
||||
|
||||
type FrontendBackend interface {
|
||||
PublishISCSI(...)
|
||||
PublishNVMe(...)
|
||||
}
|
||||
```
|
||||
|
||||
### 9.4 让 `blockvol` 逐渐退化成 pure backend
|
||||
目标:
|
||||
|
||||
- `blockvol` 不再持有系统级语义 authority
|
||||
- 它保留 local storage truth 和执行能力
|
||||
- 语义解释都上提到 `V2 core`
|
||||
|
||||
---
|
||||
|
||||
## 10. 如何保持前面建立的约束和 envelope
|
||||
|
||||
这部分最重要。
|
||||
|
||||
### 10.1 已接受约束必须升格为 core invariants
|
||||
前面 `CP13-1..7` 不能只留在测试里,必须进入 core 规则:
|
||||
|
||||
- canonical identity
|
||||
- durable progress truth
|
||||
- only eligible replicas count
|
||||
- reconnect handshake
|
||||
- retention fail-closed
|
||||
- rebuild fallback
|
||||
|
||||
### 10.2 envelope 不允许被 core 重构顺手扩大
|
||||
继续固定:
|
||||
|
||||
- `RF=2`
|
||||
- `sync_all`
|
||||
- 当前 heartbeat/gRPC path
|
||||
- `blockvol` backend
|
||||
|
||||
不要因为做架构分层,就顺手放大:
|
||||
- `RF>2`
|
||||
- broader transport matrix
|
||||
- broader rollout claim
|
||||
|
||||
### 10.3 每一步都做双重验收
|
||||
每个演进 slice 必须同时证明:
|
||||
|
||||
- 旧约束仍成立
|
||||
- 新边界更显式、更少混态
|
||||
|
||||
### 10.4 禁止“抽象重构先降低 bar”
|
||||
不能接受:
|
||||
- 为了重构,暂时弱化 fail-closed
|
||||
- 为了抽接口,暂时模糊 readiness
|
||||
- 为了分层,暂时把 publish truth 放宽
|
||||
|
||||
---
|
||||
|
||||
## 11. V2 如何从已接受 claim 变得可靠
|
||||
|
||||
`V2 core` 的可靠性不是来自“状态更多”或“架构更复杂”。
|
||||
它的可靠性来自两个更严格的来源:
|
||||
|
||||
1. 只消费已经进入 claim/evidence ledger 的 accepted constraints
|
||||
2. 只复用 `V1` 中已知可靠的实现行为,而不继承其隐含语义
|
||||
|
||||
### 11.1 `V2` 不是重新发明 truth,而是消费已接受 truth
|
||||
|
||||
`V2 core` 不应该把当前 runtime 中“看起来通常有效”的行为直接提升为语义 truth。
|
||||
|
||||
它只能建立在已经被 claim / evidence 支撑的约束上,例如:
|
||||
|
||||
- canonical identity
|
||||
- durable progress authority = `replicaFlushedLSN`
|
||||
- only eligible replica may satisfy sync durability
|
||||
- reconnect must use explicit handshake / catch-up
|
||||
- retention must fail closed
|
||||
- unrecoverable gap must escalate to `NeedsRebuild`
|
||||
|
||||
这些约束不是设计说明的附属物,而应该是 `V2 core` 的输入边界。
|
||||
|
||||
换句话说:
|
||||
|
||||
- 能进入 core 的,只能是已经被 ledger 接受的 truth
|
||||
- 没有进入 ledger 的运行假设,不能直接成为 core 依赖
|
||||
|
||||
### 11.2 claim 是 core 的输入约束,不只是 review 文档
|
||||
|
||||
`V2 core` 的一个基本规则是:
|
||||
|
||||
> 任何没有被 claim/evidence 接受的行为,只能作为 observation,不能作为 authority。
|
||||
|
||||
例如,下面这些不能直接进入 `V2` truth:
|
||||
|
||||
- “第一次写通常会触发 shipper 连接”
|
||||
- “promote 之后 replication 大概会自己恢复”
|
||||
- “`degraded=false` 大概就表示 ready”
|
||||
- “有 published addr 就说明 replica 可用”
|
||||
|
||||
这些都只能当作 runtime observation,必须再经过 `V2` 的 readiness / eligibility / publication 规则过滤。
|
||||
|
||||
### 11.3 `V2` 的可靠性来自 fail-closed,而不是隐式收敛
|
||||
|
||||
`V1` 常见的问题不是“完全不能工作”,而是很多语义靠时序和重试隐式收敛:
|
||||
|
||||
- assignment delivered 之后,何时真正 ready
|
||||
- promote 之后,何时真正恢复 replication
|
||||
- `sync_all` 何时真的表示 cross-node durability
|
||||
|
||||
`V2 core` 必须拒绝从这些模糊状态直接宣布 success。
|
||||
|
||||
它应该采用更硬的规则:
|
||||
|
||||
- 不满足 accepted claim 的条件时,保持 not-ready / degraded / blocked
|
||||
- 不允许从 convenience state 猜测 publish healthy
|
||||
- 不允许用工作负载本身去“顺便推动系统进入正确状态”
|
||||
|
||||
一句话:
|
||||
|
||||
- `V2` 的可靠性来自明确边界 + fail-closed
|
||||
- 不来自“系统大概率最终会自己好起来”
|
||||
|
||||
### 11.4 `V2 core` 使用哪些 accepted claim
|
||||
|
||||
| Claim / Constraint | 用于 core 的哪里 | 作用 |
|
||||
|---|---|---|
|
||||
| canonical identity | control truth | 不再从地址猜身份 |
|
||||
| durable progress = `replicaFlushedLSN` | boundary truth | 不再从 success/ack 猜 durability |
|
||||
| eligible-only barrier | readiness / publication | 不让非闭环 replica 参与 durability |
|
||||
| reconnect handshake | recovery truth | 不再靠第一次写触发隐式恢复 |
|
||||
| retention fail-closed | recovery truth | 不让 lagging replica 以模糊状态长期存在 |
|
||||
| rebuild fallback | fail-closed policy | gap 不再长期悬挂在 degraded |
|
||||
|
||||
这些 claim 越清楚,`V2 core` 就越可靠。
|
||||
|
||||
---
|
||||
|
||||
## 12. `V2` 复用哪些 `V1` 可靠行为
|
||||
|
||||
`V2` 不是完全抛弃 `V1`。
|
||||
它复用的是 `V1` 中已经被证明是局部可靠、实现性稳定的行为。
|
||||
|
||||
但必须明确区分:
|
||||
|
||||
- **可复用的可靠行为**
|
||||
- **不应继续复用的旧语义**
|
||||
|
||||
### 12.1 可复用的可靠行为
|
||||
|
||||
这些行为可以继续作为 backend primitive 使用:
|
||||
|
||||
| `V1` behavior | 为什么可复用 | 为什么不构成语义 authority |
|
||||
|---|---|---|
|
||||
| WAL append / read | 局部实现、可验证 | 不决定外部 durability meaning |
|
||||
| flusher / checkpoint | 局部物化机制 | 不决定 cluster-level readiness |
|
||||
| dirty-map local read/write | 局部一致性行为 | 不决定 publication truth |
|
||||
| receiver transport | 纯执行路径 | 不决定 session authority |
|
||||
| shipper transport | 纯传输机制 | 不决定 eligibility / publish truth |
|
||||
| rebuild installer / extent install | 局部 install primitive | 不决定 rebuild policy |
|
||||
| iSCSI / NVMe serving | frontend primitive | 不决定 replicated visibility truth |
|
||||
|
||||
这些行为的共同特征是:
|
||||
|
||||
1. 局部
|
||||
2. 可测试
|
||||
3. 不依赖 cluster-level 推断
|
||||
4. 不应该自己解释系统语义
|
||||
|
||||
### 12.2 不继续复用的 `V1` 语义
|
||||
|
||||
下面这些即使在 `V1` 中曾经“工作过”,也不应该进入 `V2` truth:
|
||||
|
||||
- ready from existence
|
||||
- healthy from non-empty publication
|
||||
- `sync_all` from vacuous barrier success
|
||||
- promote implies replication closure
|
||||
- assignment arrival implies runtime closure
|
||||
- first write implicitly fixes transport state
|
||||
|
||||
`V2` 可以复用 `V1` 的动作,但不能继承这些旧语义。
|
||||
|
||||
### 12.3 `weed/` 中当前改动的地位
|
||||
|
||||
当前 branch 中 `weed/` 的很多改动更接近:
|
||||
|
||||
- 现象验证
|
||||
- integration closure
|
||||
- debug/diagnostic surfaces
|
||||
- 暂时性 runtime fix
|
||||
|
||||
它们的价值在于暴露现实、定位问题、验证边界。
|
||||
|
||||
但长期语义 authority 不应该放在这些改动本身上。
|
||||
|
||||
长期可保留的,应当是那些已经被 `V2` 吸收为:
|
||||
|
||||
- backend primitive
|
||||
- adapter boundary
|
||||
- projection surface
|
||||
|
||||
的部分。
|
||||
|
||||
---
|
||||
|
||||
## 13. 长期资产 vs 当前实现现实
|
||||
|
||||
当前项目中最重要的区分不是“哪个文件在跑”,而是“哪个资产值得长期保留”。
|
||||
|
||||
### 13.1 长期资产
|
||||
|
||||
长期需要保留和演进的是 `sw-block/` 中的 `V2` 语义资产:
|
||||
|
||||
- `v2-protocol-truths.md`
|
||||
- `v2-protocol-closure-map.zh.md`
|
||||
- `v2-protocol-claim-and-evidence.md`
|
||||
- `v2-reuse-replacement-boundary.md`
|
||||
- `v2_mini_core_design.md`
|
||||
- `sw-block/engine/replication/` 中逐步成形的 core semantics
|
||||
|
||||
这些资产定义的是:
|
||||
|
||||
- truth
|
||||
- claim
|
||||
- closure
|
||||
- reliability model
|
||||
- reusable semantics
|
||||
|
||||
### 13.2 当前实现现实
|
||||
|
||||
`weed/` 中的当前改动更多代表:
|
||||
|
||||
- 当前 chosen path 的运行现实
|
||||
- backend / adapter / publication 的实现尝试
|
||||
- 用来暴露矛盾和验证边界的现实载体
|
||||
|
||||
因此,它们不应该被自动视为长期保留资产。
|
||||
|
||||
更准确的原则是:
|
||||
|
||||
- `weed/` 中的改动,只有在被 `V2` 语义明确吸收之后,才应该作为长期实现保留
|
||||
- 否则,它们可以只是阶段性的验证资产
|
||||
|
||||
### 13.3 一个总规则
|
||||
|
||||
> `V2 core` 的可靠性不是来自信任当前 `weed/` 分支实现。
|
||||
> 它来自只消费被 claim/evidence ledger 接受的约束,并只复用 `V1` 中已知可靠的实现行为。
|
||||
> 因此,`weed/` 中的当前改动可以是临时验证资产,而 `sw-block/` 中的 truth / claim / core design 才是长期保留资产。
|
||||
|
||||
这个规则的好处是:
|
||||
|
||||
1. 不会因为当前 integration patch 看起来有效,就把它误当成长期语义
|
||||
2. 不会因为 `V1` 仍被复用,就把旧混态继续当作 authority
|
||||
3. 后续收敛分支时,可以明确区分:
|
||||
- 哪些东西应进入长期 `V2` 资产
|
||||
- 哪些东西只是当前实现现实
|
||||
|
||||
---
|
||||
|
||||
## 14. 推荐的实施顺序
|
||||
|
||||
### 近期
|
||||
1. 继续收紧 assignment -> readiness -> publication closure
|
||||
2. 用 `CP13-8` 证明当前 split 能否识别真实 bug 类型
|
||||
3. 把 readiness / projection 固化为稳定 surface
|
||||
|
||||
### 中期
|
||||
1. 引入 command/event 风格的 `V2 core`
|
||||
2. 减少 `BlockService` 中的语义判断
|
||||
3. 把 registry 收敛为 projection store
|
||||
4. 抽 backend interface
|
||||
|
||||
### 更后面
|
||||
1. 评估是否需要物理独立的 `V2 core process`
|
||||
2. 如果需要,那是因为逻辑已经独立,不是为了“好看”
|
||||
|
||||
---
|
||||
|
||||
## 15. 最短结论
|
||||
|
||||
你要的“更工程化”版本可以归纳为一句话:
|
||||
|
||||
- `V2 core` 负责定义 truth、state、event、command、projection
|
||||
- `adapter` 负责隔离 `V1` 污染并翻译输入输出
|
||||
- `V1 backend` 负责 WAL / transport / frontend / rebuild 执行
|
||||
- 后续 phase 的方向不是换路线,而是把这件事一步步显式化并固化成正式结构
|
||||
|
||||
如果你愿意,下一步我可以继续给你一版更像真正设计文档里的内容:
|
||||
|
||||
- `V2 core` 的 Go package 目录建议
|
||||
- 每个 struct/event/command 放在哪个文件
|
||||
- 一个最小 `ApplyEvent() -> Decide() -> EmitCommands()` 伪代码骨架
|
||||
@@ -93,7 +93,7 @@ func (c *BlockVolumeHeartbeatCollector) Run() {
|
||||
select {
|
||||
case <-ticker.C:
|
||||
// Outbound: collect and report status.
|
||||
msgs := c.blockService.Store().CollectBlockVolumeHeartbeat()
|
||||
msgs := c.blockService.CollectBlockVolumeHeartbeat()
|
||||
c.safeCallback(msgs)
|
||||
// Inbound: process any pending assignments.
|
||||
c.processAssignments()
|
||||
@@ -115,7 +115,7 @@ func (c *BlockVolumeHeartbeatCollector) processAssignments() {
|
||||
if len(assignments) == 0 {
|
||||
return
|
||||
}
|
||||
errs := c.blockService.Store().ProcessBlockVolumeAssignments(assignments)
|
||||
errs := c.blockService.ApplyAssignments(assignments)
|
||||
c.cbMu.Lock()
|
||||
cb := c.assignmentCallback
|
||||
c.cbMu.Unlock()
|
||||
|
||||
@@ -463,6 +463,42 @@ func TestBlockAssign_NilSource(t *testing.T) {
|
||||
}
|
||||
}
|
||||
|
||||
// TestBlockAssign_CollectorUsesAuthoritativeLifecycle verifies the heartbeat
|
||||
// collector now drives the full BlockService assignment path, not the store-only
|
||||
// role path. A replica assignment must start the receiver and close publish
|
||||
// readiness.
|
||||
func TestBlockAssign_CollectorUsesAuthoritativeLifecycle(t *testing.T) {
|
||||
bs := newTestBlockService(t)
|
||||
path := testBlockVolPath(t, bs)
|
||||
|
||||
collector := NewBlockVolumeHeartbeatCollector(bs, 5*time.Millisecond)
|
||||
collector.SetAssignmentSource(func() []blockvol.BlockVolumeAssignment {
|
||||
return []blockvol.BlockVolumeAssignment{{
|
||||
Path: path,
|
||||
Epoch: 1,
|
||||
Role: uint32(blockvol.RoleReplica),
|
||||
ReplicaDataAddr: ":0",
|
||||
ReplicaCtrlAddr: ":0",
|
||||
}}
|
||||
})
|
||||
go collector.Run()
|
||||
defer collector.Stop()
|
||||
|
||||
deadline := time.After(500 * time.Millisecond)
|
||||
for {
|
||||
dataAddr, ctrlAddr := bs.GetReplState(path)
|
||||
readiness := bs.ReadinessSnapshot(path)
|
||||
if dataAddr != "" && ctrlAddr != "" && readiness.ReceiverReady && readiness.PublishHealthy {
|
||||
return
|
||||
}
|
||||
select {
|
||||
case <-deadline:
|
||||
t.Fatalf("collector did not start replica receiver: data=%q ctrl=%q readiness=%+v", dataAddr, ctrlAddr, readiness)
|
||||
case <-time.After(10 * time.Millisecond):
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// TestBlockAssign_MixedBatch verifies a batch with 1 success, 1 unknown volume,
|
||||
// and 1 invalid transition returns parallel errors correctly.
|
||||
func TestBlockAssign_MixedBatch(t *testing.T) {
|
||||
|
||||
@@ -148,7 +148,8 @@ func TestClusterHealthSummary(t *testing.T) {
|
||||
Path: "/data/healthy.blk",
|
||||
Role: blockvol.RoleToWire(blockvol.RolePrimary),
|
||||
ReplicaFactor: 2,
|
||||
Replicas: []ReplicaInfo{{Server: "vs2:9333", Role: blockvol.RoleToWire(blockvol.RoleReplica)}},
|
||||
ReplicaReady: true,
|
||||
Replicas: []ReplicaInfo{{Server: "vs2:9333", Role: blockvol.RoleToWire(blockvol.RoleReplica), Ready: true}},
|
||||
Status: StatusActive,
|
||||
})
|
||||
|
||||
@@ -188,7 +189,8 @@ func TestBlockStatusHandler_IncludesHealthCounts(t *testing.T) {
|
||||
Path: "/data/status.blk",
|
||||
Role: blockvol.RoleToWire(blockvol.RolePrimary),
|
||||
ReplicaFactor: 2,
|
||||
Replicas: []ReplicaInfo{{Server: "vs2:9333", Role: blockvol.RoleToWire(blockvol.RoleReplica)}},
|
||||
ReplicaReady: true,
|
||||
Replicas: []ReplicaInfo{{Server: "vs2:9333", Role: blockvol.RoleToWire(blockvol.RoleReplica), Ready: true}},
|
||||
Status: StatusActive,
|
||||
})
|
||||
|
||||
|
||||
@@ -1965,3 +1965,45 @@ func TestRegistry_InflightBlocksAutoRegister(t *testing.T) {
|
||||
t.Fatalf("replica health not updated after inflight released: %f", entry.Replicas[0].HealthScore)
|
||||
}
|
||||
}
|
||||
|
||||
func TestRegistry_ReplicaReadyRequiresReplicaHeartbeat(t *testing.T) {
|
||||
r := NewBlockVolumeRegistry()
|
||||
if err := r.Register(&BlockVolumeEntry{
|
||||
Name: "vol-ready",
|
||||
VolumeServer: "primary-server:8080",
|
||||
Path: "/blocks/vol-ready.blk",
|
||||
Status: StatusActive,
|
||||
Replicas: []ReplicaInfo{{
|
||||
Server: "replica-server:8080",
|
||||
Path: "/blocks/vol-ready.blk",
|
||||
}},
|
||||
}); err != nil {
|
||||
t.Fatalf("register: %v", err)
|
||||
}
|
||||
|
||||
entry, _ := r.Lookup("vol-ready")
|
||||
if entry.ReplicaReady {
|
||||
t.Fatal("replica should not be ready before replica heartbeat confirms publication")
|
||||
}
|
||||
if !entry.ReplicaDegraded {
|
||||
t.Fatal("volume should remain degraded until replica readiness closes")
|
||||
}
|
||||
|
||||
r.UpdateFullHeartbeat("replica-server:8080", []*master_pb.BlockVolumeInfoMessage{{
|
||||
Path: "/blocks/vol-ready.blk",
|
||||
Epoch: 1,
|
||||
Role: uint32(blockvol.RoleReplica),
|
||||
VolumeSize: 1 << 30,
|
||||
HealthScore: 0.9,
|
||||
ReplicaDataAddr: "10.0.0.2:14260",
|
||||
ReplicaCtrlAddr: "10.0.0.2:14261",
|
||||
}}, "")
|
||||
|
||||
entry, _ = r.Lookup("vol-ready")
|
||||
if !entry.Replicas[0].Ready {
|
||||
t.Fatal("replica heartbeat with published receiver addresses should mark replica ready")
|
||||
}
|
||||
if !entry.ReplicaReady {
|
||||
t.Fatal("aggregate replica readiness should become true after replica heartbeat")
|
||||
}
|
||||
}
|
||||
|
||||
@@ -394,6 +394,7 @@ func entryToVolumeInfo(e *BlockVolumeEntry, primaryAlive bool) blockapi.VolumeIn
|
||||
ReplicaDataAddr: e.ReplicaDataAddr,
|
||||
ReplicaCtrlAddr: e.ReplicaCtrlAddr,
|
||||
ReplicaFactor: rf,
|
||||
ReplicaReady: e.ReplicaReady,
|
||||
HealthScore: e.HealthScore,
|
||||
ReplicaDegraded: e.ReplicaDegraded,
|
||||
DurabilityMode: durMode,
|
||||
@@ -407,6 +408,7 @@ func entryToVolumeInfo(e *BlockVolumeEntry, primaryAlive bool) blockapi.VolumeIn
|
||||
Server: ri.Server,
|
||||
ISCSIAddr: ri.ISCSIAddr,
|
||||
IQN: ri.IQN,
|
||||
Ready: ri.Ready,
|
||||
HealthScore: ri.HealthScore,
|
||||
WALLag: ri.WALLag,
|
||||
})
|
||||
|
||||
@@ -24,6 +24,23 @@ type volReplState struct {
|
||||
replicaCtrlAddr string
|
||||
// allReplicas stores the full replica set for multi-replica idempotence.
|
||||
allReplicas []blockvol.ReplicaAddr
|
||||
roleApplied bool
|
||||
receiverReady bool
|
||||
shipperConfigured bool
|
||||
replicaEligible bool
|
||||
publishHealthy bool
|
||||
}
|
||||
|
||||
// BlockReadinessSnapshot names the assignment-to-publication closure at the
|
||||
// BlockService boundary. These flags are owned by the service/adapter layer,
|
||||
// not by blockvol's local storage mechanics.
|
||||
type BlockReadinessSnapshot struct {
|
||||
RoleApplied bool
|
||||
ReceiverReady bool
|
||||
ShipperConfigured bool
|
||||
ShipperConnected bool
|
||||
ReplicaEligible bool
|
||||
PublishHealthy bool
|
||||
}
|
||||
|
||||
// NVMeConfig holds NVMe/TCP target configuration passed from CLI flags.
|
||||
@@ -373,6 +390,15 @@ func (bs *BlockService) DeleteBlockVol(name string) error {
|
||||
// ProcessAssignments applies assignments from master, including replication setup.
|
||||
// V2 bridge: also delivers each assignment to the V2 engine for recovery ownership.
|
||||
func (bs *BlockService) ProcessAssignments(assignments []blockvol.BlockVolumeAssignment) {
|
||||
_ = bs.ApplyAssignments(assignments)
|
||||
}
|
||||
|
||||
// ApplyAssignments applies assignments through the single authoritative
|
||||
// BlockService lifecycle: role apply, replication wiring, and publication
|
||||
// readiness bookkeeping. Returns per-assignment errors parallel to the input.
|
||||
func (bs *BlockService) ApplyAssignments(assignments []blockvol.BlockVolumeAssignment) []error {
|
||||
errs := make([]error, len(assignments))
|
||||
|
||||
// V2 bridge: convert and deliver to engine orchestrator (Phase 08 P1).
|
||||
// P3: skip V2 processing for repeated unchanged assignments.
|
||||
// P4: RecoveryManager starts/cancels recovery goroutines based on results.
|
||||
@@ -400,9 +426,9 @@ func (bs *BlockService) ProcessAssignments(assignments []blockvol.BlockVolumeAss
|
||||
|
||||
// V1 processing (requires blockStore).
|
||||
if bs.blockStore == nil {
|
||||
return
|
||||
return errs
|
||||
}
|
||||
for _, a := range assignments {
|
||||
for i, a := range assignments {
|
||||
role := blockvol.RoleFromWire(a.Role)
|
||||
ttl := blockvol.LeaseTTLFromWire(a.LeaseTtlMs)
|
||||
|
||||
@@ -410,22 +436,30 @@ func (bs *BlockService) ProcessAssignments(assignments []blockvol.BlockVolumeAss
|
||||
if err := bs.blockStore.WithVolume(a.Path, func(vol *blockvol.BlockVol) error {
|
||||
return vol.HandleAssignment(a.Epoch, role, ttl)
|
||||
}); err != nil {
|
||||
errs[i] = err
|
||||
glog.Warningf("block service: assignment %s epoch=%d role=%s: %v", a.Path, a.Epoch, role, err)
|
||||
continue
|
||||
}
|
||||
bs.noteRoleApplied(a.Path, role)
|
||||
|
||||
// 2. Replication setup based on role + addresses.
|
||||
switch role {
|
||||
case blockvol.RolePrimary:
|
||||
// CP8-2: ReplicaAddrs (multi-replica) takes precedence over scalar fields.
|
||||
if len(a.ReplicaAddrs) > 0 {
|
||||
bs.setupPrimaryReplicationMulti(a.Path, a.ReplicaAddrs)
|
||||
if err := bs.setupPrimaryReplicationMulti(a.Path, a.ReplicaAddrs); err != nil {
|
||||
errs[i] = err
|
||||
}
|
||||
} else if a.ReplicaDataAddr != "" && a.ReplicaCtrlAddr != "" {
|
||||
bs.setupPrimaryReplication(a.Path, a.ReplicaDataAddr, a.ReplicaCtrlAddr)
|
||||
if err := bs.setupPrimaryReplication(a.Path, a.ReplicaDataAddr, a.ReplicaCtrlAddr); err != nil {
|
||||
errs[i] = err
|
||||
}
|
||||
}
|
||||
case blockvol.RoleReplica:
|
||||
if a.ReplicaDataAddr != "" && a.ReplicaCtrlAddr != "" {
|
||||
bs.setupReplicaReceiver(a.Path, a.ReplicaDataAddr, a.ReplicaCtrlAddr)
|
||||
if err := bs.setupReplicaReceiver(a.Path, a.ReplicaDataAddr, a.ReplicaCtrlAddr); err != nil {
|
||||
errs[i] = err
|
||||
}
|
||||
}
|
||||
case blockvol.RoleRebuilding:
|
||||
if a.RebuildAddr != "" {
|
||||
@@ -433,18 +467,23 @@ func (bs *BlockService) ProcessAssignments(assignments []blockvol.BlockVolumeAss
|
||||
}
|
||||
}
|
||||
}
|
||||
return errs
|
||||
}
|
||||
|
||||
// setupPrimaryReplication configures WAL shipping from primary to replica
|
||||
// and starts the rebuild server (R1-2).
|
||||
func (bs *BlockService) setupPrimaryReplication(path, replicaDataAddr, replicaCtrlAddr string) {
|
||||
func (bs *BlockService) setupPrimaryReplication(path, replicaDataAddr, replicaCtrlAddr string) error {
|
||||
// P3 idempotence: skip if replica state is unchanged.
|
||||
bs.replMu.RLock()
|
||||
existing := bs.replStates[path]
|
||||
bs.replMu.RUnlock()
|
||||
if existing != nil && existing.replicaDataAddr == replicaDataAddr && existing.replicaCtrlAddr == replicaCtrlAddr {
|
||||
// Unchanged repeated assignment — idempotent, no side effects.
|
||||
return
|
||||
bs.markPrimaryTransportConfigured(path, []blockvol.ReplicaAddr{{
|
||||
DataAddr: replicaDataAddr,
|
||||
CtrlAddr: replicaCtrlAddr,
|
||||
}})
|
||||
return nil
|
||||
}
|
||||
|
||||
// Compute deterministic rebuild listen address.
|
||||
@@ -465,27 +504,19 @@ func (bs *BlockService) setupPrimaryReplication(path, replicaDataAddr, replicaCt
|
||||
return nil
|
||||
}); err != nil {
|
||||
glog.Warningf("block service: setup primary replication %s: %v", path, err)
|
||||
return
|
||||
return err
|
||||
}
|
||||
// Track replication state for heartbeat reporting (R1-4).
|
||||
// These addresses are what the primary ships to — they come from the
|
||||
// master's assignment. They should already be canonical (from
|
||||
// AllocateBlockVolumeResponse), but if not, they'll be reported as-is.
|
||||
bs.replMu.Lock()
|
||||
if bs.replStates == nil {
|
||||
bs.replStates = make(map[string]*volReplState)
|
||||
}
|
||||
bs.replStates[path] = &volReplState{
|
||||
replicaDataAddr: replicaDataAddr,
|
||||
replicaCtrlAddr: replicaCtrlAddr,
|
||||
}
|
||||
bs.replMu.Unlock()
|
||||
bs.markPrimaryTransportConfigured(path, []blockvol.ReplicaAddr{{
|
||||
DataAddr: replicaDataAddr,
|
||||
CtrlAddr: replicaCtrlAddr,
|
||||
}})
|
||||
glog.V(0).Infof("block service: primary %s shipping WAL to %s/%s (rebuild=%s)", path, replicaDataAddr, replicaCtrlAddr, rebuildAddr)
|
||||
return nil
|
||||
}
|
||||
|
||||
// setupPrimaryReplicationMulti configures WAL shipping from primary to N replicas
|
||||
// using SetReplicaAddrs (CP8-2: multi-replica support).
|
||||
func (bs *BlockService) setupPrimaryReplicationMulti(path string, addrs []blockvol.ReplicaAddr) {
|
||||
func (bs *BlockService) setupPrimaryReplicationMulti(path string, addrs []blockvol.ReplicaAddr) error {
|
||||
// P3 idempotence: skip if ALL replica addresses unchanged.
|
||||
// Compare full replica set, not just the first entry.
|
||||
if len(addrs) > 0 {
|
||||
@@ -493,7 +524,8 @@ func (bs *BlockService) setupPrimaryReplicationMulti(path string, addrs []blockv
|
||||
existing := bs.replStates[path]
|
||||
bs.replMu.RUnlock()
|
||||
if existing != nil && bs.multiReplicaUnchanged(path, addrs) {
|
||||
return
|
||||
bs.markPrimaryTransportConfigured(path, addrs)
|
||||
return nil
|
||||
}
|
||||
}
|
||||
|
||||
@@ -513,30 +545,15 @@ func (bs *BlockService) setupPrimaryReplicationMulti(path string, addrs []blockv
|
||||
return nil
|
||||
}); err != nil {
|
||||
glog.Warningf("block service: setup primary replication (multi) %s: %v", path, err)
|
||||
return
|
||||
return err
|
||||
}
|
||||
// Track replication state for heartbeat reporting.
|
||||
bs.replMu.Lock()
|
||||
if bs.replStates == nil {
|
||||
bs.replStates = make(map[string]*volReplState)
|
||||
}
|
||||
// Store full replica set + first replica for backward compat heartbeat.
|
||||
if len(addrs) > 0 {
|
||||
// Copy the addrs slice to avoid aliasing.
|
||||
copied := make([]blockvol.ReplicaAddr, len(addrs))
|
||||
copy(copied, addrs)
|
||||
bs.replStates[path] = &volReplState{
|
||||
replicaDataAddr: addrs[0].DataAddr,
|
||||
replicaCtrlAddr: addrs[0].CtrlAddr,
|
||||
allReplicas: copied,
|
||||
}
|
||||
}
|
||||
bs.replMu.Unlock()
|
||||
bs.markPrimaryTransportConfigured(path, addrs)
|
||||
glog.V(0).Infof("block service: primary %s shipping WAL to %d replicas (rebuild=%s)", path, len(addrs), rebuildAddr)
|
||||
return nil
|
||||
}
|
||||
|
||||
// setupReplicaReceiver starts the replica WAL receiver.
|
||||
func (bs *BlockService) setupReplicaReceiver(path, dataAddr, ctrlAddr string) {
|
||||
func (bs *BlockService) setupReplicaReceiver(path, dataAddr, ctrlAddr string) error {
|
||||
// CP13-2: Pass the routable advertisedIP (from -ip flag, NOT from -id/serverID)
|
||||
// so wildcard-bind listeners resolve to a real IP, not an opaque identity string.
|
||||
var canonDataAddr, canonCtrlAddr string
|
||||
@@ -559,7 +576,7 @@ func (bs *BlockService) setupReplicaReceiver(path, dataAddr, ctrlAddr string) {
|
||||
return nil
|
||||
}); err != nil {
|
||||
glog.Warningf("block service: setup replica receiver %s: %v", path, err)
|
||||
return
|
||||
return err
|
||||
}
|
||||
// Fallback to assignment addresses if receiver didn't report.
|
||||
if canonDataAddr == "" {
|
||||
@@ -568,16 +585,9 @@ func (bs *BlockService) setupReplicaReceiver(path, dataAddr, ctrlAddr string) {
|
||||
if canonCtrlAddr == "" {
|
||||
canonCtrlAddr = ctrlAddr
|
||||
}
|
||||
bs.replMu.Lock()
|
||||
if bs.replStates == nil {
|
||||
bs.replStates = make(map[string]*volReplState)
|
||||
}
|
||||
bs.replStates[path] = &volReplState{
|
||||
replicaDataAddr: canonDataAddr,
|
||||
replicaCtrlAddr: canonCtrlAddr,
|
||||
}
|
||||
bs.replMu.Unlock()
|
||||
bs.markReceiverReady(path, canonDataAddr, canonCtrlAddr)
|
||||
glog.V(0).Infof("block service: replica %s receiving on %s/%s", path, canonDataAddr, canonCtrlAddr)
|
||||
return nil
|
||||
}
|
||||
|
||||
// startRebuild starts a rebuild in the background.
|
||||
@@ -722,8 +732,10 @@ func (bs *BlockService) CollectBlockVolumeHeartbeat() []blockvol.BlockVolumeInfo
|
||||
defer bs.replMu.RUnlock()
|
||||
for i := range msgs {
|
||||
if s, ok := bs.replStates[msgs[i].Path]; ok {
|
||||
msgs[i].ReplicaDataAddr = s.replicaDataAddr
|
||||
msgs[i].ReplicaCtrlAddr = s.replicaCtrlAddr
|
||||
if s.publishHealthy {
|
||||
msgs[i].ReplicaDataAddr = s.replicaDataAddr
|
||||
msgs[i].ReplicaCtrlAddr = s.replicaCtrlAddr
|
||||
}
|
||||
}
|
||||
// NVMe publication: report nvme_addr and nqn if NVMe target is running.
|
||||
if bs.nvmeListenAddr != "" {
|
||||
@@ -758,6 +770,108 @@ func (bs *BlockService) multiReplicaUnchanged(path string, addrs []blockvol.Repl
|
||||
return true
|
||||
}
|
||||
|
||||
func (bs *BlockService) ensureReplStateLocked(path string) *volReplState {
|
||||
if bs.replStates == nil {
|
||||
bs.replStates = make(map[string]*volReplState)
|
||||
}
|
||||
state := bs.replStates[path]
|
||||
if state == nil {
|
||||
state = &volReplState{}
|
||||
bs.replStates[path] = state
|
||||
}
|
||||
return state
|
||||
}
|
||||
|
||||
func (bs *BlockService) noteRoleApplied(path string, role blockvol.Role) {
|
||||
bs.replMu.Lock()
|
||||
defer bs.replMu.Unlock()
|
||||
state := bs.ensureReplStateLocked(path)
|
||||
state.roleApplied = true
|
||||
switch role {
|
||||
case blockvol.RoleReplica:
|
||||
state.receiverReady = false
|
||||
state.shipperConfigured = false
|
||||
state.replicaEligible = false
|
||||
state.publishHealthy = false
|
||||
case blockvol.RolePrimary:
|
||||
state.receiverReady = false
|
||||
state.shipperConfigured = false
|
||||
state.replicaEligible = false
|
||||
state.publishHealthy = true
|
||||
case blockvol.RoleRebuilding:
|
||||
state.receiverReady = false
|
||||
state.shipperConfigured = false
|
||||
state.replicaEligible = false
|
||||
state.publishHealthy = false
|
||||
default:
|
||||
state.receiverReady = false
|
||||
state.shipperConfigured = false
|
||||
state.replicaEligible = false
|
||||
state.publishHealthy = false
|
||||
state.replicaDataAddr = ""
|
||||
state.replicaCtrlAddr = ""
|
||||
state.allReplicas = nil
|
||||
}
|
||||
}
|
||||
|
||||
func (bs *BlockService) markPrimaryTransportConfigured(path string, addrs []blockvol.ReplicaAddr) {
|
||||
bs.replMu.Lock()
|
||||
defer bs.replMu.Unlock()
|
||||
state := bs.ensureReplStateLocked(path)
|
||||
state.shipperConfigured = len(addrs) > 0
|
||||
state.publishHealthy = true
|
||||
state.replicaEligible = false
|
||||
state.receiverReady = false
|
||||
if len(addrs) == 0 {
|
||||
state.replicaDataAddr = ""
|
||||
state.replicaCtrlAddr = ""
|
||||
state.allReplicas = nil
|
||||
return
|
||||
}
|
||||
copied := make([]blockvol.ReplicaAddr, len(addrs))
|
||||
copy(copied, addrs)
|
||||
state.allReplicas = copied
|
||||
state.replicaDataAddr = addrs[0].DataAddr
|
||||
state.replicaCtrlAddr = addrs[0].CtrlAddr
|
||||
}
|
||||
|
||||
func (bs *BlockService) markReceiverReady(path, dataAddr, ctrlAddr string) {
|
||||
bs.replMu.Lock()
|
||||
defer bs.replMu.Unlock()
|
||||
state := bs.ensureReplStateLocked(path)
|
||||
state.receiverReady = true
|
||||
state.replicaEligible = true
|
||||
state.publishHealthy = true
|
||||
state.shipperConfigured = false
|
||||
state.replicaDataAddr = dataAddr
|
||||
state.replicaCtrlAddr = ctrlAddr
|
||||
state.allReplicas = nil
|
||||
}
|
||||
|
||||
// ReadinessSnapshot reports the service-owned assignment/readiness closure for
|
||||
// one volume. It keeps v2 publication truth above blockvol's local mechanics.
|
||||
func (bs *BlockService) ReadinessSnapshot(path string) BlockReadinessSnapshot {
|
||||
snap := BlockReadinessSnapshot{}
|
||||
bs.replMu.RLock()
|
||||
state := bs.replStates[path]
|
||||
if state != nil {
|
||||
snap.RoleApplied = state.roleApplied
|
||||
snap.ReceiverReady = state.receiverReady
|
||||
snap.ShipperConfigured = state.shipperConfigured
|
||||
snap.ReplicaEligible = state.replicaEligible
|
||||
snap.PublishHealthy = state.publishHealthy
|
||||
}
|
||||
bs.replMu.RUnlock()
|
||||
if !snap.ShipperConfigured || bs.blockStore == nil {
|
||||
return snap
|
||||
}
|
||||
_ = bs.blockStore.WithVolume(path, func(vol *blockvol.BlockVol) error {
|
||||
snap.ShipperConnected = len(vol.ReplicaShipperStates()) > 0 && !vol.Status().ReplicaDegraded
|
||||
return nil
|
||||
})
|
||||
return snap
|
||||
}
|
||||
|
||||
// --- P3: Assignment idempotence ---
|
||||
|
||||
// lastAppliedAssignment stores the full assignment for idempotence comparison.
|
||||
|
||||
@@ -17,13 +17,19 @@ type ShipperDebugInfo struct {
|
||||
|
||||
// BlockVolumeDebugInfo is the real-time block volume state.
|
||||
type BlockVolumeDebugInfo struct {
|
||||
Path string `json:"path"`
|
||||
Role string `json:"role"`
|
||||
Epoch uint64 `json:"epoch"`
|
||||
HeadLSN uint64 `json:"head_lsn"`
|
||||
Degraded bool `json:"degraded"`
|
||||
Shippers []ShipperDebugInfo `json:"shippers,omitempty"`
|
||||
Timestamp string `json:"timestamp"`
|
||||
Path string `json:"path"`
|
||||
Role string `json:"role"`
|
||||
Epoch uint64 `json:"epoch"`
|
||||
HeadLSN uint64 `json:"head_lsn"`
|
||||
Degraded bool `json:"degraded"`
|
||||
RoleApplied bool `json:"role_applied"`
|
||||
ReceiverReady bool `json:"receiver_ready"`
|
||||
ShipperConfigured bool `json:"shipper_configured"`
|
||||
ShipperConnected bool `json:"shipper_connected"`
|
||||
ReplicaEligible bool `json:"replica_eligible"`
|
||||
PublishHealthy bool `json:"publish_healthy"`
|
||||
Shippers []ShipperDebugInfo `json:"shippers,omitempty"`
|
||||
Timestamp string `json:"timestamp"`
|
||||
}
|
||||
|
||||
// debugBlockShipperHandler returns real-time shipper state for all block volumes.
|
||||
@@ -48,13 +54,20 @@ func (vs *VolumeServer) debugBlockShipperHandler(w http.ResponseWriter, r *http.
|
||||
var infos []BlockVolumeDebugInfo
|
||||
store.IterateBlockVolumes(func(path string, vol *blockvol.BlockVol) {
|
||||
status := vol.Status()
|
||||
readiness := vs.blockService.ReadinessSnapshot(path)
|
||||
info := BlockVolumeDebugInfo{
|
||||
Path: path,
|
||||
Role: status.Role.String(),
|
||||
Epoch: status.Epoch,
|
||||
HeadLSN: status.WALHeadLSN,
|
||||
Degraded: status.ReplicaDegraded,
|
||||
Timestamp: time.Now().UTC().Format(time.RFC3339Nano),
|
||||
Path: path,
|
||||
Role: status.Role.String(),
|
||||
Epoch: status.Epoch,
|
||||
HeadLSN: status.WALHeadLSN,
|
||||
Degraded: status.ReplicaDegraded,
|
||||
RoleApplied: readiness.RoleApplied,
|
||||
ReceiverReady: readiness.ReceiverReady,
|
||||
ShipperConfigured: readiness.ShipperConfigured,
|
||||
ShipperConnected: readiness.ShipperConnected,
|
||||
ReplicaEligible: readiness.ReplicaEligible,
|
||||
PublishHealthy: readiness.PublishHealthy,
|
||||
Timestamp: time.Now().UTC().Format(time.RFC3339Nano),
|
||||
}
|
||||
|
||||
// Get per-shipper state from ShipperGroup if available.
|
||||
|
||||
@@ -36,6 +36,7 @@ type VolumeInfo struct {
|
||||
// CP8-2: Multi-replica fields.
|
||||
ReplicaFactor int `json:"replica_factor"`
|
||||
Replicas []ReplicaDetail `json:"replicas,omitempty"`
|
||||
ReplicaReady bool `json:"replica_ready,omitempty"`
|
||||
HealthScore float64 `json:"health_score"`
|
||||
ReplicaDegraded bool `json:"replica_degraded,omitempty"`
|
||||
DurabilityMode string `json:"durability_mode"` // CP8-3-1
|
||||
@@ -71,6 +72,7 @@ type ReplicaDetail struct {
|
||||
Server string `json:"server"`
|
||||
ISCSIAddr string `json:"iscsi_addr,omitempty"`
|
||||
IQN string `json:"iqn,omitempty"`
|
||||
Ready bool `json:"ready,omitempty"`
|
||||
HealthScore float64 `json:"health_score"`
|
||||
WALLag uint64 `json:"wal_lag,omitempty"`
|
||||
}
|
||||
|
||||
155
weed/storage/blockvol/test/component/replica_read_test.go
Normal file
155
weed/storage/blockvol/test/component/replica_read_test.go
Normal file
@@ -0,0 +1,155 @@
|
||||
package component
|
||||
|
||||
import (
|
||||
"bytes"
|
||||
"net"
|
||||
"testing"
|
||||
"time"
|
||||
|
||||
"github.com/seaweedfs/seaweedfs/weed/storage/blockvol"
|
||||
)
|
||||
|
||||
// TestReplicaReadAfterShip verifies that data shipped from primary to replica
|
||||
// via WAL replication is readable on the replica via ReadLBA.
|
||||
//
|
||||
// This reproduces the CP13-8 bug: replica iSCSI reads zeros despite
|
||||
// replicated data in WAL (sync_all barrier confirmed).
|
||||
func TestReplicaReadAfterShip(t *testing.T) {
|
||||
primaryPath := t.TempDir() + "/primary.blk"
|
||||
replicaPath := t.TempDir() + "/replica.blk"
|
||||
|
||||
primary, err := blockvol.CreateBlockVol(primaryPath, blockvol.CreateOptions{
|
||||
VolumeSize: 4 * 1024 * 1024,
|
||||
BlockSize: 4096,
|
||||
WALSize: 1 * 1024 * 1024,
|
||||
})
|
||||
if err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
defer primary.Close()
|
||||
|
||||
replica, err := blockvol.CreateBlockVol(replicaPath, blockvol.CreateOptions{
|
||||
VolumeSize: 4 * 1024 * 1024,
|
||||
BlockSize: 4096,
|
||||
WALSize: 1 * 1024 * 1024,
|
||||
})
|
||||
if err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
defer replica.Close()
|
||||
|
||||
// Assign roles.
|
||||
primary.HandleAssignment(1, blockvol.RolePrimary, 30*time.Second)
|
||||
replica.HandleAssignment(1, blockvol.RoleReplica, 30*time.Second)
|
||||
|
||||
// Start replica receiver.
|
||||
if err := replica.StartReplicaReceiver(":0", ":0"); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
recvAddr := replica.ReplicaReceiverAddr()
|
||||
if recvAddr == nil {
|
||||
t.Fatal("replica receiver not started")
|
||||
}
|
||||
t.Logf("replica receiver: data=%s ctrl=%s", recvAddr.DataAddr, recvAddr.CtrlAddr)
|
||||
|
||||
// Wire shipper from primary to replica.
|
||||
primary.SetReplicaAddr(recvAddr.DataAddr, recvAddr.CtrlAddr)
|
||||
|
||||
// Write on primary — should ship to replica.
|
||||
writeData := bytes.Repeat([]byte{0xAB}, 4096)
|
||||
if err := primary.WriteLBA(0, writeData); err != nil {
|
||||
t.Fatalf("primary WriteLBA(0): %v", err)
|
||||
}
|
||||
|
||||
// Give shipping + apply time.
|
||||
time.Sleep(2 * time.Second)
|
||||
|
||||
// Read from REPLICA.
|
||||
replicaData, err := replica.ReadLBA(0, 4096)
|
||||
if err != nil {
|
||||
t.Fatalf("replica ReadLBA(0): %v", err)
|
||||
}
|
||||
|
||||
if replicaData[0] == 0x00 {
|
||||
t.Fatalf("BUG REPRODUCED: replica ReadLBA returns zeros (first byte=0x%02x, want 0xAB)"+
|
||||
"\nData is in replica WAL but ReadLBA returns zeros", replicaData[0])
|
||||
}
|
||||
if !bytes.Equal(replicaData, writeData) {
|
||||
t.Fatalf("replica data mismatch: first byte=0x%02x, want 0xAB", replicaData[0])
|
||||
}
|
||||
t.Log("replica ReadLBA after ship: OK (data matches primary)")
|
||||
}
|
||||
|
||||
// TestReplicaReadDirectApply bypasses the shipper entirely and manually
|
||||
// ships a WAL entry via TCP to the replica receiver, then reads it back.
|
||||
func TestReplicaReadDirectApply(t *testing.T) {
|
||||
replicaPath := t.TempDir() + "/replica.blk"
|
||||
vol, err := blockvol.CreateBlockVol(replicaPath, blockvol.CreateOptions{
|
||||
VolumeSize: 4 * 1024 * 1024,
|
||||
BlockSize: 4096,
|
||||
WALSize: 1 * 1024 * 1024,
|
||||
})
|
||||
if err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
defer vol.Close()
|
||||
|
||||
vol.HandleAssignment(1, blockvol.RoleReplica, 30*time.Second)
|
||||
|
||||
if err := vol.StartReplicaReceiver(":0", ":0"); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
recvAddr := vol.ReplicaReceiverAddr()
|
||||
t.Logf("replica: data=%s ctrl=%s", recvAddr.DataAddr, recvAddr.CtrlAddr)
|
||||
|
||||
// Directly connect and ship a WAL entry.
|
||||
conn, err := net.DialTimeout("tcp", recvAddr.DataAddr, 3*time.Second)
|
||||
if err != nil {
|
||||
t.Fatalf("connect: %v", err)
|
||||
}
|
||||
defer conn.Close()
|
||||
|
||||
payload := bytes.Repeat([]byte{0xEF}, 4096)
|
||||
entry := blockvol.WALEntry{
|
||||
LSN: 1,
|
||||
Epoch: 1,
|
||||
Type: blockvol.EntryTypeWrite,
|
||||
LBA: 0,
|
||||
Length: 4096,
|
||||
Data: payload,
|
||||
}
|
||||
encoded, err := entry.Encode()
|
||||
if err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
if err := blockvol.WriteFrame(conn, blockvol.MsgWALEntry, encoded); err != nil {
|
||||
t.Fatalf("ship: %v", err)
|
||||
}
|
||||
|
||||
time.Sleep(1 * time.Second)
|
||||
|
||||
// Read back via ReadLBA.
|
||||
data, err := vol.ReadLBA(0, 4096)
|
||||
if err != nil {
|
||||
t.Fatalf("ReadLBA: %v", err)
|
||||
}
|
||||
|
||||
if data[0] == 0x00 {
|
||||
t.Fatalf("BUG: ReadLBA returns zeros after direct WAL apply (0x%02x, want 0xEF)", data[0])
|
||||
}
|
||||
if data[0] != 0xEF {
|
||||
t.Fatalf("unexpected data: 0x%02x, want 0xEF", data[0])
|
||||
}
|
||||
t.Logf("direct apply ReadLBA: OK (0x%02x)", data[0])
|
||||
|
||||
// Also read via adapter (same path as iSCSI).
|
||||
adapter := blockvol.NewBlockVolAdapter(vol)
|
||||
adapterData, err := adapter.ReadAt(0, 4096)
|
||||
if err != nil {
|
||||
t.Fatalf("adapter ReadAt: %v", err)
|
||||
}
|
||||
if adapterData[0] != 0xEF {
|
||||
t.Fatalf("adapter returns wrong data: 0x%02x, want 0xEF", adapterData[0])
|
||||
}
|
||||
t.Log("adapter ReadAt: OK")
|
||||
}
|
||||
@@ -787,6 +787,11 @@ func waitVolumeHealthy(ctx context.Context, actx *tr.ActionContext, act tr.Actio
|
||||
continue
|
||||
}
|
||||
|
||||
if info.ReplicaFactor > 1 && !info.ReplicaReady {
|
||||
actx.Log(" poll %d: replica assigned but not publish-ready yet", poll)
|
||||
continue
|
||||
}
|
||||
|
||||
// Check not degraded.
|
||||
if info.ReplicaDegraded {
|
||||
actx.Log(" poll %d: replica degraded, waiting...", poll)
|
||||
|
||||
@@ -32,6 +32,7 @@ type VolumeInfo struct {
|
||||
ReplicaCtrlAddr string `json:"replica_ctrl_addr,omitempty"`
|
||||
ReplicaFactor int `json:"replica_factor"`
|
||||
Replicas []ReplicaDetail `json:"replicas,omitempty"`
|
||||
ReplicaReady bool `json:"replica_ready,omitempty"`
|
||||
HealthScore float64 `json:"health_score"`
|
||||
ReplicaDegraded bool `json:"replica_degraded,omitempty"`
|
||||
DurabilityMode string `json:"durability_mode"`
|
||||
@@ -45,6 +46,7 @@ type ReplicaDetail struct {
|
||||
Server string `json:"server"`
|
||||
ISCSIAddr string `json:"iscsi_addr,omitempty"`
|
||||
IQN string `json:"iqn,omitempty"`
|
||||
Ready bool `json:"ready,omitempty"`
|
||||
HealthScore float64 `json:"health_score"`
|
||||
WALLag uint64 `json:"wal_lag,omitempty"`
|
||||
}
|
||||
|
||||
@@ -1,240 +1,368 @@
|
||||
name: cp13-8-real-workload-validation
|
||||
timeout: 20m
|
||||
timeout: 15m
|
||||
|
||||
# CP13-8: Bounded real-workload validation for RF=2 sync_all.
|
||||
#
|
||||
# Workload envelope:
|
||||
# Topology: RF=2 sync_all, cross-machine replication (m01 ↔ M02)
|
||||
# Transport: iSCSI (primary frontend)
|
||||
# Envelope:
|
||||
# Topology: RF=2 sync_all, cross-machine (m01 ↔ M02)
|
||||
# Transport: iSCSI
|
||||
# Workloads: ext4 (filesystem) + PostgreSQL pgbench (application)
|
||||
# Disturbance: one bounded failover (kill primary, promote replica)
|
||||
# Exclusions: NVMe-TCP, RF>2, hours/days soak, degraded-mode perf
|
||||
# Disturbance: one bounded failover (kill primary, auto-promote replica)
|
||||
# Exclusions: NVMe-TCP, RF>2, soak, degraded-mode, mode normalization
|
||||
#
|
||||
# What this validates:
|
||||
# The accepted CP13-1..7 replication contract survives contact with
|
||||
# real filesystem and database consumers. Specifically:
|
||||
# - Replicated writes are durable on both nodes (ext4 file integrity)
|
||||
# - Post-failover data is consistent (fsck + file count)
|
||||
# - Database transactions are durable under sync_all (pgbench TPC-B)
|
||||
#
|
||||
# What this does NOT validate:
|
||||
# - Production rollout readiness
|
||||
# - Performance floor (see Phase 12 P4)
|
||||
# - Degraded mode behavior
|
||||
# - NVMe-TCP transport path
|
||||
# - Mode normalization (CP13-9)
|
||||
# Flow:
|
||||
# 1. Create RF=2 sync_all — NO promote (use initial primary as-is)
|
||||
# 2. Wait for replication healthy (shipper connected, not degraded)
|
||||
# 3. Write ext4 + 200 files on primary
|
||||
# 4. Kill primary → auto-failover promotes replica
|
||||
# 5. Verify ext4 on promoted replica (fsck + files + checksums)
|
||||
# 6. pgbench on promoted replica
|
||||
|
||||
env:
|
||||
repo_dir: "C:/work/seaweedfs"
|
||||
master_url: "http://10.0.0.3:9433"
|
||||
volume_name: cp13-8-val
|
||||
# 512MB: enough for ext4 + pgbench, small enough for mkfs + sync_all.
|
||||
vol_size: "536870912"
|
||||
|
||||
topology:
|
||||
nodes:
|
||||
target_node:
|
||||
host: "192.168.1.184"
|
||||
m01:
|
||||
host: 192.168.1.181
|
||||
alt_ips: ["10.0.0.1"]
|
||||
user: testdev
|
||||
key: "C:/work/dev_server/testdev_key"
|
||||
client_node:
|
||||
host: "192.168.1.181"
|
||||
key: "/opt/work/testdev_key"
|
||||
m02:
|
||||
host: 192.168.1.184
|
||||
alt_ips: ["10.0.0.3"]
|
||||
user: testdev
|
||||
key: "C:/work/dev_server/testdev_key"
|
||||
|
||||
targets:
|
||||
primary:
|
||||
node: target_node
|
||||
vol_size: 100M
|
||||
iscsi_port: 3280
|
||||
admin_port: 8095
|
||||
replica_data_port: 9040
|
||||
replica_ctrl_port: 9041
|
||||
rebuild_port: 9042
|
||||
iqn_suffix: cp13-8-primary
|
||||
replica:
|
||||
node: client_node
|
||||
vol_size: 100M
|
||||
iscsi_port: 3281
|
||||
admin_port: 8096
|
||||
replica_data_port: 9043
|
||||
replica_ctrl_port: 9044
|
||||
rebuild_port: 9045
|
||||
iqn_suffix: cp13-8-replica
|
||||
key: "/opt/work/testdev_key"
|
||||
|
||||
phases:
|
||||
# --- Phase 1: Setup RF=2 sync_all pair ---
|
||||
- name: setup
|
||||
actions:
|
||||
- action: kill_stale
|
||||
node: target_node
|
||||
- action: exec
|
||||
node: m02
|
||||
cmd: "fuser -k 9433/tcp 18480/tcp 2>/dev/null; sleep 1; rm -rf /tmp/sw-cp138-master /tmp/sw-cp138-vs1; mkdir -p /tmp/sw-cp138-master /tmp/sw-cp138-vs1/blocks"
|
||||
root: "true"
|
||||
ignore_error: true
|
||||
- action: kill_stale
|
||||
node: client_node
|
||||
- action: exec
|
||||
node: m01
|
||||
cmd: "fuser -k 18480/tcp 2>/dev/null; sleep 1; rm -rf /tmp/sw-cp138-vs2; mkdir -p /tmp/sw-cp138-vs2/blocks"
|
||||
root: "true"
|
||||
ignore_error: true
|
||||
- action: iscsi_cleanup
|
||||
node: client_node
|
||||
ignore_error: true
|
||||
- action: build_deploy
|
||||
- action: start_target
|
||||
target: primary
|
||||
create: "true"
|
||||
durability_mode: sync_all
|
||||
- action: start_target
|
||||
target: replica
|
||||
create: "true"
|
||||
durability_mode: sync_all
|
||||
- action: assign
|
||||
target: replica
|
||||
epoch: "1"
|
||||
role: replica
|
||||
lease_ttl: 60s
|
||||
- action: assign
|
||||
target: primary
|
||||
epoch: "1"
|
||||
role: primary
|
||||
lease_ttl: 60s
|
||||
- action: set_replica
|
||||
target: primary
|
||||
replica: replica
|
||||
- action: sleep
|
||||
duration: 2s
|
||||
|
||||
# --- Phase 2: ext4 filesystem workload ---
|
||||
- name: ext4-write
|
||||
actions:
|
||||
- action: iscsi_login
|
||||
target: primary
|
||||
node: client_node
|
||||
save_as: device
|
||||
- action: mkfs
|
||||
node: client_node
|
||||
device: "{{ device }}"
|
||||
fstype: ext4
|
||||
- action: mount
|
||||
node: client_node
|
||||
device: "{{ device }}"
|
||||
mountpoint: /mnt/cp13-8
|
||||
# Write 200 files with known content.
|
||||
- action: exec
|
||||
node: client_node
|
||||
root: "true"
|
||||
cmd: "bash -c 'for i in $(seq 1 200); do dd if=/dev/urandom of=/mnt/cp13-8/file_$i bs=4k count=1 2>/dev/null; done && sync'"
|
||||
# Compute checksums for later verification.
|
||||
- action: exec
|
||||
node: client_node
|
||||
root: "true"
|
||||
cmd: "md5sum /mnt/cp13-8/file_* | sort > /tmp/cp13-8-checksums.txt && cat /tmp/cp13-8-checksums.txt | wc -l"
|
||||
save_as: checksum_count
|
||||
- action: assert_equal
|
||||
actual: "{{ checksum_count }}"
|
||||
expected: "200"
|
||||
- action: umount
|
||||
node: client_node
|
||||
mountpoint: /mnt/cp13-8
|
||||
- action: iscsi_cleanup
|
||||
node: client_node
|
||||
ignore_error: true
|
||||
# Wait for replication to catch up.
|
||||
- action: wait_lsn
|
||||
target: replica
|
||||
min_lsn: "1"
|
||||
timeout: 30s
|
||||
- action: start_weed_master
|
||||
node: m02
|
||||
port: "9433"
|
||||
dir: /tmp/sw-cp138-master
|
||||
extra_args: "-ip=10.0.0.3"
|
||||
save_as: master_pid
|
||||
|
||||
- action: sleep
|
||||
duration: 3s
|
||||
|
||||
# --- Phase 3: Failover (kill primary, promote replica) ---
|
||||
- action: start_weed_volume
|
||||
node: m02
|
||||
port: "18480"
|
||||
master: "10.0.0.3:9433"
|
||||
dir: /tmp/sw-cp138-vs1
|
||||
extra_args: "-block.dir=/tmp/sw-cp138-vs1/blocks -block.listen=:3295 -ip=10.0.0.3"
|
||||
save_as: vs1_pid
|
||||
|
||||
- action: start_weed_volume
|
||||
node: m01
|
||||
port: "18480"
|
||||
master: "10.0.0.3:9433"
|
||||
dir: /tmp/sw-cp138-vs2
|
||||
extra_args: "-block.dir=/tmp/sw-cp138-vs2/blocks -block.listen=:3295 -ip=10.0.0.1"
|
||||
save_as: vs2_pid
|
||||
|
||||
- action: sleep
|
||||
duration: 3s
|
||||
|
||||
- action: wait_cluster_ready
|
||||
node: m02
|
||||
master_url: "{{ master_url }}"
|
||||
|
||||
- action: wait_block_servers
|
||||
count: "2"
|
||||
|
||||
- action: create_block_volume
|
||||
name: "{{ volume_name }}"
|
||||
size_bytes: "{{ vol_size }}"
|
||||
replica_factor: "2"
|
||||
durability_mode: "sync_all"
|
||||
|
||||
# Wait for assignment delivery (heartbeat cycle).
|
||||
- action: sleep
|
||||
duration: 15s
|
||||
|
||||
# Bootstrap write: triggers shipper connect + first barrier.
|
||||
# Without this, the shipper stays degraded because barrier-triggered
|
||||
# recovery needs a write to fire the barrier.
|
||||
- action: lookup_block_volume
|
||||
name: "{{ volume_name }}"
|
||||
save_as: boot_vol
|
||||
|
||||
- action: iscsi_login_direct
|
||||
node: m01
|
||||
host: "{{ boot_vol_iscsi_host }}"
|
||||
port: "{{ boot_vol_iscsi_port }}"
|
||||
iqn: "{{ boot_vol_iqn }}"
|
||||
save_as: boot_device
|
||||
ignore_error: true
|
||||
|
||||
- action: exec
|
||||
node: m01
|
||||
root: "true"
|
||||
cmd: "dd if=/dev/urandom of={{ boot_device }} bs=4k count=1 seek=100000 oflag=direct,sync 2>/dev/null; true"
|
||||
ignore_error: true
|
||||
|
||||
- action: iscsi_cleanup
|
||||
node: m01
|
||||
ignore_error: true
|
||||
|
||||
- action: sleep
|
||||
duration: 5s
|
||||
|
||||
- action: wait_volume_healthy
|
||||
name: "{{ volume_name }}"
|
||||
timeout: 60s
|
||||
|
||||
- action: discover_primary
|
||||
name: "{{ volume_name }}"
|
||||
save_as: pri
|
||||
|
||||
- action: print
|
||||
msg: "CP13-8 setup: primary={{ pri }} ({{ pri_server }}), replica={{ pri_replica_node }}"
|
||||
|
||||
# --- Phase 2: ext4 filesystem workload on initial primary ---
|
||||
- name: ext4-write
|
||||
actions:
|
||||
- action: lookup_block_volume
|
||||
name: "{{ volume_name }}"
|
||||
save_as: vol
|
||||
|
||||
- action: iscsi_login_direct
|
||||
node: m01
|
||||
host: "{{ vol_iscsi_host }}"
|
||||
port: "{{ vol_iscsi_port }}"
|
||||
iqn: "{{ vol_iqn }}"
|
||||
save_as: device
|
||||
|
||||
- action: exec
|
||||
node: m01
|
||||
cmd: "mkfs.ext4 -F {{ device }} 2>&1 | tail -2"
|
||||
root: "true"
|
||||
|
||||
- action: exec
|
||||
node: m01
|
||||
cmd: "mkdir -p /mnt/cp13-8 && mount {{ device }} /mnt/cp13-8"
|
||||
root: "true"
|
||||
|
||||
- action: exec
|
||||
node: m01
|
||||
root: "true"
|
||||
cmd: "for i in $(seq 1 200); do dd if=/dev/urandom of=/mnt/cp13-8/file_$i bs=4k count=1 2>/dev/null; done && sync && echo WRITE_DONE"
|
||||
save_as: write_result
|
||||
|
||||
- action: assert_contains
|
||||
value: "{{ write_result }}"
|
||||
contains: "WRITE_DONE"
|
||||
|
||||
- action: exec
|
||||
node: m01
|
||||
root: "true"
|
||||
cmd: "md5sum /mnt/cp13-8/file_* | sort > /tmp/cp13-8-pre.md5 && wc -l < /tmp/cp13-8-pre.md5"
|
||||
save_as: pre_checksum_count
|
||||
|
||||
- action: assert_equal
|
||||
actual: "{{ pre_checksum_count }}"
|
||||
expected: "200"
|
||||
|
||||
- action: exec
|
||||
node: m01
|
||||
cmd: "umount /mnt/cp13-8"
|
||||
root: "true"
|
||||
|
||||
- action: iscsi_cleanup
|
||||
node: m01
|
||||
ignore_error: true
|
||||
|
||||
- action: print
|
||||
msg: "ext4-write: 200 files written, checksums captured"
|
||||
|
||||
# Verify replication is healthy after all writes.
|
||||
- action: wait_volume_healthy
|
||||
name: "{{ volume_name }}"
|
||||
timeout: 30s
|
||||
|
||||
# --- Phase 3: Failover ---
|
||||
# Kill ONLY the primary's VS, keep the replica alive for auto-promote.
|
||||
# Master allocates primary to m01 first (by server registration order).
|
||||
# Kill m01 VS (primary), m02 VS (replica) stays alive for promotion.
|
||||
- name: failover
|
||||
actions:
|
||||
- action: kill_target
|
||||
target: primary
|
||||
- action: assign
|
||||
target: replica
|
||||
epoch: "2"
|
||||
role: primary
|
||||
lease_ttl: 60s
|
||||
- action: wait_role
|
||||
target: replica
|
||||
role: primary
|
||||
timeout: 10s
|
||||
- action: print
|
||||
msg: "=== Killing primary VS on m01 ==="
|
||||
|
||||
- action: exec
|
||||
node: m01
|
||||
cmd: "kill -9 {{ vs2_pid }}"
|
||||
root: "true"
|
||||
ignore_error: true
|
||||
|
||||
# Wait for lease expiry (30s TTL) + auto-failover.
|
||||
- action: sleep
|
||||
duration: 50s
|
||||
|
||||
# Wait for primary to change from m01 to m02.
|
||||
- action: wait_block_primary
|
||||
name: "{{ volume_name }}"
|
||||
not: "{{ pri_server }}"
|
||||
timeout: 60s
|
||||
save_as: new_pri
|
||||
|
||||
- action: print
|
||||
msg: "Failover: {{ pri_server }} → {{ new_pri }}"
|
||||
|
||||
- action: sleep
|
||||
duration: 5s
|
||||
|
||||
# --- Phase 4: ext4 verification on promoted replica ---
|
||||
- name: ext4-verify
|
||||
actions:
|
||||
- action: iscsi_login
|
||||
target: replica
|
||||
node: client_node
|
||||
- action: discover_primary
|
||||
name: "{{ volume_name }}"
|
||||
save_as: new
|
||||
|
||||
- action: print
|
||||
msg: "Verifying ext4 on promoted node {{ new }} ({{ new_server }})"
|
||||
|
||||
# Connect to the new primary's iSCSI.
|
||||
- action: iscsi_login_direct
|
||||
node: m01
|
||||
host: "{{ new_host }}"
|
||||
port: "3295"
|
||||
iqn: "{{ vol_iqn }}"
|
||||
save_as: device2
|
||||
# fsck: filesystem integrity.
|
||||
|
||||
- action: fsck_ext4
|
||||
node: client_node
|
||||
node: m01
|
||||
device: "{{ device2 }}"
|
||||
save_as: fsck_result
|
||||
# Mount and verify file count.
|
||||
- action: mount
|
||||
node: client_node
|
||||
device: "{{ device2 }}"
|
||||
mountpoint: /mnt/cp13-8
|
||||
|
||||
- action: print
|
||||
msg: "fsck: {{ fsck_result }}"
|
||||
|
||||
- action: exec
|
||||
node: client_node
|
||||
node: m01
|
||||
cmd: "mkdir -p /mnt/cp13-8 && mount {{ device2 }} /mnt/cp13-8"
|
||||
root: "true"
|
||||
|
||||
- action: exec
|
||||
node: m01
|
||||
root: "true"
|
||||
cmd: "ls /mnt/cp13-8/file_* | wc -l"
|
||||
save_as: post_failover_count
|
||||
save_as: post_count
|
||||
|
||||
- action: assert_equal
|
||||
actual: "{{ post_failover_count }}"
|
||||
actual: "{{ post_count }}"
|
||||
expected: "200"
|
||||
# Verify checksums match pre-failover.
|
||||
|
||||
- action: exec
|
||||
node: client_node
|
||||
node: m01
|
||||
root: "true"
|
||||
cmd: "md5sum /mnt/cp13-8/file_* | sort > /tmp/cp13-8-checksums-post.txt && diff /tmp/cp13-8-checksums.txt /tmp/cp13-8-checksums-post.txt && echo MATCH"
|
||||
save_as: checksum_match
|
||||
cmd: "md5sum /mnt/cp13-8/file_* | sort > /tmp/cp13-8-post.md5 && diff /tmp/cp13-8-pre.md5 /tmp/cp13-8-post.md5 && echo CHECKSUM_MATCH"
|
||||
save_as: checksum_diff
|
||||
|
||||
- action: assert_contains
|
||||
value: "{{ checksum_match }}"
|
||||
contains: "MATCH"
|
||||
- action: umount
|
||||
node: client_node
|
||||
mountpoint: /mnt/cp13-8
|
||||
value: "{{ checksum_diff }}"
|
||||
contains: "CHECKSUM_MATCH"
|
||||
|
||||
- action: exec
|
||||
node: m01
|
||||
cmd: "umount /mnt/cp13-8"
|
||||
root: "true"
|
||||
|
||||
- action: iscsi_cleanup
|
||||
node: client_node
|
||||
node: m01
|
||||
ignore_error: true
|
||||
|
||||
# --- Phase 5: pgbench on promoted replica (application workload) ---
|
||||
- name: pgbench-on-replica
|
||||
- action: print
|
||||
msg: "ext4-verify: fsck CLEAN, 200 files, checksums MATCH"
|
||||
|
||||
# --- Phase 5: pgbench on promoted replica ---
|
||||
- name: pgbench
|
||||
actions:
|
||||
- action: iscsi_login
|
||||
target: replica
|
||||
node: client_node
|
||||
- action: iscsi_login_direct
|
||||
node: m01
|
||||
host: "{{ new_host }}"
|
||||
port: "3295"
|
||||
iqn: "{{ vol_iqn }}"
|
||||
save_as: device3
|
||||
|
||||
- action: sleep
|
||||
duration: 3s
|
||||
|
||||
- action: pgbench_init
|
||||
node: client_node
|
||||
node: m01
|
||||
device: "{{ device3 }}"
|
||||
mount: "/mnt/cp13-8-pg"
|
||||
port: "5440"
|
||||
port: "5441"
|
||||
scale: "1"
|
||||
fstype: ext4
|
||||
|
||||
- action: pgbench_run
|
||||
node: client_node
|
||||
node: m01
|
||||
clients: "1"
|
||||
duration: "10"
|
||||
save_as: tps_post_failover
|
||||
save_as: tps
|
||||
|
||||
- action: print
|
||||
msg: "CP13-8: pgbench TPC-B post-failover: {{ tps_post_failover }} TPS"
|
||||
msg: "CP13-8 pgbench TPS: {{ tps }}"
|
||||
|
||||
- action: assert_greater
|
||||
value: "{{ tps_post_failover }}"
|
||||
actual: "{{ tps }}"
|
||||
threshold: "0"
|
||||
# pgbench succeeded with TPS > 0 = database transactions are durable on the promoted replica.
|
||||
|
||||
- action: pgbench_cleanup
|
||||
node: client_node
|
||||
mount: "/mnt/cp13-8-pg"
|
||||
port: "5440"
|
||||
node: m01
|
||||
ignore_error: true
|
||||
|
||||
- action: iscsi_cleanup
|
||||
node: client_node
|
||||
node: m01
|
||||
ignore_error: true
|
||||
|
||||
# --- Phase 6: Cleanup ---
|
||||
- name: cleanup
|
||||
always: true
|
||||
actions:
|
||||
- action: exec
|
||||
node: m01
|
||||
cmd: "umount /mnt/cp13-8 /mnt/cp13-8-pg 2>/dev/null; true"
|
||||
root: "true"
|
||||
ignore_error: true
|
||||
- action: iscsi_cleanup
|
||||
node: client_node
|
||||
node: m01
|
||||
ignore_error: true
|
||||
- action: stop_all_targets
|
||||
- action: stop_weed
|
||||
node: m01
|
||||
pid: "{{ vs2_pid }}"
|
||||
ignore_error: true
|
||||
- action: stop_weed
|
||||
node: m01
|
||||
pid: "{{ vs2_new_pid }}"
|
||||
ignore_error: true
|
||||
- action: stop_weed
|
||||
node: m02
|
||||
pid: "{{ vs1_pid }}"
|
||||
ignore_error: true
|
||||
- action: stop_weed
|
||||
node: m02
|
||||
pid: "{{ vs1_new_pid }}"
|
||||
ignore_error: true
|
||||
- action: stop_weed
|
||||
node: m02
|
||||
pid: "{{ master_pid }}"
|
||||
ignore_error: true
|
||||
|
||||
@@ -118,10 +118,14 @@ func (bs *BlockVolumeStore) WithVolume(path string, fn func(*blockvol.BlockVol)
|
||||
return fn(vol)
|
||||
}
|
||||
|
||||
// ProcessBlockVolumeAssignments applies a batch of assignments from master.
|
||||
// Returns a slice of errors parallel to the input (nil = success).
|
||||
// Unknown volumes and invalid transitions are logged and returned as errors,
|
||||
// but do not stop processing of remaining assignments.
|
||||
// ProcessBlockVolumeAssignments applies only the local role/epoch/lease part of
|
||||
// a batch of assignments. It does NOT wire replica receivers, shippers, or
|
||||
// publication readiness. The authoritative runtime lifecycle lives in
|
||||
// BlockService.ApplyAssignments.
|
||||
//
|
||||
// Returns a slice of errors parallel to the input (nil = success). Unknown
|
||||
// volumes and invalid transitions are logged and returned as errors, but do not
|
||||
// stop processing of remaining assignments.
|
||||
func (bs *BlockVolumeStore) ProcessBlockVolumeAssignments(
|
||||
assignments []blockvol.BlockVolumeAssignment,
|
||||
) []error {
|
||||
|
||||
Reference in New Issue
Block a user