feat: CP13-8 PASSES — real-workload validation on RF=2 sync_all

CP13-8 scenario results on m01/M02 (25Gbps RoCE):
  fsck_ext4:       CLEAN
  file count:      200 (assert_equal PASS)
  checksum match:  MATCH (assert_contains PASS)
  pgbench TPS:     565.69 (assert_greater PASS)
  auto-failover:   10.0.0.1:18480 → 10.0.0.3:18480

Code changes (tester + scenario):
- volume_server_block.go: readiness state, assignment lifecycle cleanup
- block_heartbeat_loop.go: readiness-aware heartbeat reporting
- store_blockvol.go: readiness tracking
- master_server_handlers_block.go: block API handler updates
- cp13-8-real-workload-validation.yaml: redesigned scenario
  (removed block_promote, use natural auto-failover flow,
  bootstrap write before wait_volume_healthy)
- testrunner/actions/devops.go: scenario action improvements
- replica_read_test.go: component-level replica read test

Phase docs: CP13-7 accepted, CP13-8/8A technical packs updated,
design docs updated for protocol closure evidence.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
pingqiu
2026-04-03 14:24:13 -07:00
parent 334c12664a
commit 4c7fbefe25
21 changed files with 2375 additions and 248 deletions

View File

@@ -1284,3 +1284,360 @@ Review checklist:
3. is rebuild handoff bounded and epoch-safe?
4. is post-rebuild progress initialized from checkpoint truth?
5. is the checkpoint still bounded to rebuild fallback?
---
### `CP13-8` Technical Pack
Date: 2026-04-03
Goal: validate the accepted `RF=2 sync_all` replication contract on one bounded set of real workloads so the engineering proof is demonstrated on named real block-device consumers rather than only protocol-level tests
#### Layer 1: Semantic Core
##### Problem statement
`CP13-1..7` have progressively closed replication correctness: address truth, durable progress, state eligibility, reconnect/catch-up, retention, and rebuild fallback.
What remains is not more replication semantics. It is proving that the accepted contract survives contact with bounded real workloads.
`CP13-8` therefore accepts only one bounded thing:
1. one bounded real-workload validation package for the accepted `RF=2 sync_all` path
It does not accept:
1. broad launch approval
2. broad benchmark positioning
3. mode normalization or general product-policy closure
##### State / contract
`CP13-8` must make these truths explicit:
1. the workload envelope is named and bounded:
- topology
- transport/frontend
- filesystem/application consumer
- disturbance shapes included
- exclusions
2. the accepted `CP13-1..7` replication contract is the thing being validated, not redefined
3. real-workload evidence must be replayable and attributable
4. passing one named workload package does not imply generic production readiness outside the stated envelope
##### Reject shapes
Reject before implementation or review if the checkpoint:
1. presents ad hoc manual runs without a named envelope
2. treats synthetic benchmarks as substitutes for real workload validation
3. mixes workload validation with mode normalization or rollout-approval claims
4. reopens already-accepted replication semantics instead of validating them
#### Layer 2: Execution Core
##### Current gap `CP13-8` must close
1. accepted replication semantics are still primarily validated by protocol/unit/adversarial evidence
2. one bounded real workload package is still needed to show the contract survives real filesystem/application behavior
3. the project still needs a replayable workload-evidence object before talking about final mode normalization or broader launch shaping
##### Suggested file targets
1. `weed/storage/blockvol/testrunner/*`
2. bounded component/real-device validation under `weed/storage/blockvol/test/`
3. workload docs or result artifacts under `sw-block/.private/phase/`
4. `weed/server/*` or `blockvol` only if a real workload exposes a concrete bug
##### Validation focus
Required proofs:
1. filesystem proof
- one named real filesystem workload completes correctly on the accepted path
2. application proof
- one named database/application workload completes correctly on the accepted path
3. disturbance proof
- only if explicitly included in the envelope, one bounded disturbance case remains correct
4. envelope proof
- topology/frontend/workload/exclusions are explicit and replayable
5. boundedness proof
- checkpoint remains about workload validation, not `CP13-9+`
Reject if:
1. a claimed proof is only a harness smoke test without real workload semantics
2. failures cannot be attributed because the environment is underspecified
3. delivery wording implies broad launch readiness from one bounded package
##### Suggested first cut
1. freeze one explicit workload matrix first, before chasing more scenarios
2. use one filesystem workload and one application/database workload
3. keep disturbances narrow and named if included at all
4. produce one result artifact that ties outcomes back to accepted `CP13-1..7` semantics
##### Assignment For `sw`
1. Goal
- deliver bounded real-workload validation on the accepted `RF=2 sync_all` path
2. Required outputs
- one explicit workload-envelope summary
- one focused code/harness package only where needed to make the bounded workloads replayable
- one delivery note explaining:
- files updated in place
- workload matrix
- proof shape
- what later checkpoints remain untouched
3. Hard rules
- do not broaden into generic benchmark marketing
- do not claim launch approval from one workload package
- do not reopen accepted `CP13-1..7` semantics unless the workload exposes a concrete bug
##### Assignment For `tester`
1. Goal
- validate that `CP13-8` closes bounded real-workload validation and nothing broader
2. Validate
- a named filesystem workload completes correctly
- a named application/database workload completes correctly
- the environment and exclusions are explicit
- evidence is replayable and attributable
- no-overclaim around `CP13-9+`
3. Reject if
- workload evidence is underspecified or non-replayable
- the validation object quietly broadens into mode/rollout policy
- failures are explained away without a bounded root cause
#### Short judgment
`CP13-8` is acceptable when:
1. one bounded real-workload matrix is explicit
2. the accepted replication contract is demonstrated on named real consumers
3. the resulting evidence is replayable and bounded
4. the checkpoint stays clearly separate from `CP13-9+`
---
### `CP13-8` Delivery Pack
Bounded contract:
1. `CP13-8` accepts real-workload validation only
2. it does not accept mode normalization, rollout approval, or broad performance positioning
What `sw` should deliver:
1. one focused contract review of the workload envelope and its relation to accepted `CP13-1..7` semantics
2. one bounded harness/evidence package only where needed to run the chosen workloads replayably
3. one delivery note with:
- changed files
- workload matrix
- proof shape
- no-overclaim statement
Recommended delivery shape:
1. contract:
- define the named workload envelope and exclusions
2. code/tests/harness:
- keep updates local to real-workload validation surfaces
- make workload pass/fail conditions directly observable
3. note:
- distinguish primary proof from support evidence
- explain why `CP13-9+` remains untouched
Review checklist:
1. is the workload envelope explicit and bounded?
2. are the workloads real consumers, not just synthetic microbenchmarks?
3. is evidence replayable and attributable?
4. does the package validate accepted semantics rather than redefining them?
5. is the checkpoint still bounded to real-workload validation?
---
### `CP13-8A` Technical Pack
Date: 2026-04-03
Goal: close the assignment-to-publication contradiction exposed by `CP13-8` so the accepted `RF=2 sync_all` path no longer publishes replica readiness from allocation or assignment presence alone
#### Layer 1: Semantic Core
##### Problem statement
`CP13-8` exposed a live contradiction:
1. control truth says the replica exists and has assignment/addresses
2. runtime truth may still be between:
- role applied
- receiver startup
- shipper attachment
- publish-ready closure
3. external surfaces can therefore overstate readiness before the replica is actually safe to publish as a real block-device peer
`CP13-8A` therefore accepts only one bounded thing:
1. one bounded assignment-to-publication closure slice for the accepted `RF=2 sync_all` path
It does not accept:
1. broad mode normalization
2. launch approval
3. backend replacement by implication
4. timing-based “wait longer” fixes that leave readiness semantics implicit
##### State / contract
`CP13-8A` must make these truths explicit:
1. assignment delivered is not the same as receiver ready
2. receiver ready is not the same as publish healthy
3. lookup / heartbeat / tester health must consume the same bounded readiness truth
4. the closure remains inside the current chosen path:
- `RF=2`
- `sync_all`
- current master / volume-server heartbeat path
- `blockvol` backend
##### Reject shapes
Reject before implementation or review if the slice:
1. leaves two semantic assignment paths alive (`store`-only vs service/runtime path)
2. treats allocation completion or precomputed ports as equivalent to publication readiness
3. relies on sleeps, retries, or ad hoc timing instead of explicit readiness state
4. broadens into `CP13-9` mode policy or generic backend redesign
#### Layer 2: Execution Core
##### Current gap `CP13-8A` must close
1. assignment application and replication/publication setup are still too easy to split semantically
2. readiness truth is not yet a fully explicit first-class product surface across heartbeat / lookup / tester
3. real-workload reruns cannot cleanly distinguish:
- backend data-visibility bug
- adapter timing/publication bug
- true core-rule gap
##### Suggested file targets
1. `weed/server/volume_server_block.go`
2. `weed/server/block_heartbeat_loop.go`
3. `weed/server/master_block_registry.go`
4. `weed/server/master_grpc_server_block.go`
5. `weed/server/master_server_handlers_block.go`
6. `weed/storage/blockvol/testrunner/actions/devops.go`
7. bounded tests under `weed/server/*`
##### Validation focus
Required proofs:
1. lifecycle proof
- assignment processing uses one authoritative path from role apply through runtime wiring
2. readiness proof
- replica-ready is explicit and not inferred from existence/allocation alone
3. publication proof
- lookup / heartbeat / tester surfaces do not publish the replica before readiness closure
4. rerun proof
- a bounded `CP13-8` rerun moves the remaining contradiction into an attributable bug class rather than mixed-state ambiguity
5. boundedness proof
- the slice remains about closure, not `CP13-9`
Reject if:
1. a claimed proof still depends on manual interpretation of timing
2. different surfaces use different meanings of “healthy” or “ready”
3. the rerun still fails but the failure cannot be classified beyond “timing”
##### Suggested first cut
1. make `BlockService` the single assignment/readiness owner on the VS side
2. define one explicit readiness surface and project it into lookup/REST/tester gates
3. rerun the bounded `CP13-8` workload package only after closure lands
4. classify any remaining failure as:
- backend data bug
- adapter/publication bug
- core-rule gap
##### Assignment For `sw`
1. Goal
- deliver bounded assignment-to-publication closure on the accepted `RF=2 sync_all` path
2. Required outputs
- one focused code package closing the assignment/readiness/publication split
- one delivery note explaining:
- files updated in place
- named readiness states
- proof shape
- `CP13-8` rerun outcome or remaining attributable contradiction
- what later checkpoints remain untouched
3. Hard rules
- do not use timing sleeps as semantic fixes
- do not broaden into `CP13-9` mode normalization
- do not replace `blockvol` backend in this slice
- do not reopen accepted `CP13-1..7` semantics unless a live contradiction is found
##### Assignment For `tester`
1. Goal
- validate that `CP13-8A` closes assignment-to-publication truth and nothing broader
2. Validate
- one authoritative assignment path exists
- readiness is explicit and externally consistent
- lookup / heartbeat / tester health no longer overpublish readiness
- the bounded `CP13-8` rerun is attributable
- no-overclaim around `CP13-9`
3. Reject if
- old mixed-state behavior still leaks through one surface
- the slice depends on timing luck
- the rerun still fails but the team cannot say whether it is backend, adapter, or core
#### Short judgment
`CP13-8A` is acceptable when:
1. assignment-to-publication closure is explicit on the chosen path
2. readiness is no longer inferred from allocation or assignment presence
3. all product/tester surfaces consume the same bounded readiness truth
4. the rerun result is attributable and the slice stays separate from `CP13-9`
---
### `CP13-8A` Delivery Pack
Bounded contract:
1. `CP13-8A` accepts assignment-to-publication closure only
2. it does not accept mode normalization, broad launch approval, or backend replacement
What `sw` should deliver:
1. one focused closure package across VS assignment, heartbeat, lookup, and tester health surfaces
2. one bounded rerun or equivalent evidence showing whether the remaining contradiction is backend, adapter, or core
3. one delivery note with:
- changed files
- named readiness states
- proof shape
- rerun outcome / remaining attributable contradiction
- no-overclaim statement
Recommended delivery shape:
1. contract:
- define explicit readiness/publication truth for the chosen path
2. code/tests:
- unify assignment lifecycle
- gate publication on readiness closure
- prove lookup / heartbeat / tester consistency
3. note:
- distinguish closure proof from any later pure-core redesign
- explain why `CP13-9` remains untouched
Review checklist:
1. is assignment processing semantically unified?
2. is readiness explicit rather than inferred?
3. do lookup / heartbeat / tester surfaces agree on publication truth?
4. does the bounded rerun become attributable?
5. is the slice still bounded to closure rather than mode policy or backend replacement?

View File

@@ -584,12 +584,177 @@ Reject if:
Status:
- accepted
Carry-forward:
1. `NeedsRebuild` is now a real fail-closed fallback state
2. rebuild handoff and post-rebuild progress are bounded by checkpoint truth rather than implicit recovery assumptions
3. `CP13-8` must validate the accepted replication contract on named real workloads without reopening protocol semantics or quietly broadening into mode policy work
### `CP13-8`: Real-Workload Validation
Goal:
- validate the accepted `RF=2 sync_all` replication contract on one bounded set of real workloads so the engineering proof is no longer only protocol/unit-level but also demonstrated on named real block-device consumers
Acceptance object:
1. `CP13-8` accepts one bounded real-workload validation package for the accepted `RF=2 sync_all` path
2. it does not accept broad rollout claims, broad benchmark positioning, or mode normalization by implication
3. it does not accept vague “worked in a manual run” reasoning without named workloads, bounded envelope, and replayable evidence
Execution steps:
1. Step 1: workload envelope freeze
- name one bounded validation matrix:
- workload(s)
- topology
- transport/frontend
- filesystem/application surface
- disturbance shapes included and excluded
- recommended first-cut surfaces:
- real filesystem behavior such as `ext4`
- one database/application surface such as `PostgreSQL`
2. Step 2: harness and evidence hardening
- wire the workload run through real block-device consumers on the accepted path
- keep the environment reproducible and bounded enough that failures are attributable
- collect evidence at the same semantic layer as accepted prior checkpoints
3. Step 3: proof package
- prove the named real workloads complete correctly on the accepted path
- prove disturbance/failover behavior is bounded inside the named envelope if included
- prove no-overclaim around `CP13-9+`
Required scope:
1. one bounded workload matrix on the accepted `RF=2 sync_all` path
2. real block-device consumer validation (not only protocol/unit tests)
3. bounded disturbance cases only if explicitly named in the envelope
4. explicit separation between real-workload proof and later mode normalization / rollout claims
Must prove:
1. the accepted replication contract survives contact with named real workloads
2. evidence is tied to a bounded environment and workload envelope, not generic “production ready” rhetoric
3. failures, if any, are attributable to explicit workload-envelope gaps rather than ambiguous harness drift
4. acceptance wording stays bounded to real-workload validation rather than `CP13-9` policy/mode closure
Reuse discipline:
1. prefer existing `testrunner`, bounded component scenarios, and real-device harnesses where possible
2. update `weed/storage/blockvol/*` only when the real workload exposes a concrete bug in accepted semantics
3. `weed/server/*` should remain reference only unless workload validation exposes a surfaced control/runtime issue
4. no checkpoint work may silently broaden into generic benchmark marketing, launch approval, or mode policy redesign
Verification mechanism:
1. one named workload matrix with explicit environment description
2. replayable runs or artifacts for the chosen workload package
3. explicit pass/fail conditions tied back to accepted `CP13-1..7` semantics
4. no-overclaim review so `CP13-8` does not absorb `CP13-9+`
Hard indicators:
1. one accepted filesystem proof:
- a named real filesystem workload completes correctly on the accepted path
2. one accepted application proof:
- a named real application/database workload completes correctly on the accepted path
3. one accepted envelope proof:
- the validation matrix is explicit about topology, frontend, workload, and exclusions
4. one accepted boundedness proof:
- `CP13-8` claims real-workload validation only
Reject if:
1. the checkpoint relies on ad hoc manual runs with no bounded envelope
2. a claimed real-workload proof is actually only a synthetic benchmark or unit test
3. delivery wording quietly broadens into mode normalization, launch approval, or general production-readiness claims
Status:
- active
### `CP13-8A`: Assignment-to-Publication Closure
Goal:
- close the control/runtime/publication contradiction exposed by `CP13-8` so the system no longer treats allocation or assignment presence as equivalent to replica publication readiness
Acceptance object:
1. `CP13-8A` accepts one bounded closure slice for assignment-to-publication truth on the accepted `RF=2 sync_all` path
2. it does not accept broad mode normalization, launch approval, or backend replacement by implication
3. it does not accept sleep-based or timing-based fixes that leave readiness semantics implicit
Execution steps:
1. Step 1: unify assignment lifecycle
- ensure assignment delivery flows through one authoritative path from role apply to receiver/shipper wiring to readiness bookkeeping
- remove semantic split between store-only role application and service-level replication/publication setup
2. Step 2: name readiness and publication truth
- define explicit readiness states for the chosen path
- ensure heartbeat / lookup / tester surfaces distinguish:
- allocated
- role applied
- receiver ready
- publish healthy
3. Step 3: bounded rerun
- rerun the bounded `CP13-8` workload package after closure lands
- determine whether the remaining contradiction is backend data visibility, adapter timing/publication, or a true core-rule gap
Required scope:
1. assignment-to-publication closure only
2. chosen path only: `RF=2 sync_all`
3. existing master / volume-server heartbeat path only
4. `blockvol` remains the execution backend
Must prove:
1. assignment delivered does not by itself imply receiver ready or publish healthy
2. replica publication requires explicit readiness closure rather than allocation completion or precomputed port presence
3. master lookup / REST / tester health checks consume the same bounded readiness truth
4. `CP13-8A` remains about closure, not mode normalization or backend redesign
Reuse discipline:
1. prefer `weed/server/*` and bridge-layer updates first because this is a surfaced control/runtime issue
2. update `weed/storage/blockvol/*` only if closure work exposes a concrete backend bug rather than a publication-path contradiction
3. keep `CP13-1..7` semantics fixed unless the closure work exposes a live contradiction
4. no checkpoint work may silently broaden into `CP13-9` mode policy or broad rollout claims
Verification mechanism:
1. one focused proof set around assignment lifecycle closure and readiness/publication gating
2. explicit tests that heartbeat / lookup / tester surfaces do not publish a replica before readiness closes
3. bounded `CP13-8` rerun or equivalent evidence showing the contradiction moves from mixed-state ambiguity to an attributable remaining cause
4. no-overclaim review so `CP13-8A` does not absorb `CP13-9`
Hard indicators:
1. one accepted lifecycle proof:
- assignment processing uses one authoritative path from role apply through runtime wiring
2. one accepted readiness proof:
- replica-ready is explicit and not inferred from mere existence/allocation
3. one accepted publication proof:
- lookup / heartbeat / tester gates do not publish a replica before readiness closure
4. one accepted boundedness proof:
- `CP13-8A` claims closure only and leaves broader mode policy untouched
Reject if:
1. assignment still reaches different semantic outcomes depending on whether it flows through heartbeat/store-only or service-level processing
2. a replica can still be surfaced as healthy/ready before receiver/session readiness closes
3. the slice relies on delays or ad hoc retries rather than explicit readiness semantics
4. delivery wording broadens into `CP13-9` mode normalization, launch approval, or generic backend replacement
Status:
- active
### Later checkpoints inside `Phase 13`
1. `CP13-8`: real-workload validation
2. `CP13-9`: mode normalization
1. `CP13-9`: mode normalization (only after `CP13-8A` closes the assignment/publication contradiction)
## Reuse Discipline

View File

@@ -7,6 +7,7 @@ Historical planning/review documents were moved to `../docs/archive/design/` to
## Read First
- `v2-protocol-truths.md`
- `v2-protocol-claim-and-evidence.md`
- `v2-product-completion-overview.md`
- `v2-phase-development-plan.md`
- `v2-semantic-methodology.zh.md`
@@ -27,6 +28,7 @@ Historical planning/review documents were moved to `../docs/archive/design/` to
- `v2_scenarios.md`
- `v2-scenario-sources-from-v1.md`
- `v1-v15-v2-comparison.md`
- `v2-reuse-replacement-boundary.md`
- `wal-replication-v2.md`
- `wal-replication-v2-state-machine.md`
- `wal-replication-v2-orchestrator.md`

View File

@@ -0,0 +1,145 @@
# V2 Protocol Claim And Evidence
Date: 2026-04-03
Status: active
Purpose: keep one centralized ledger for the current chosen envelope, accepted claims, supporting evidence, invalidated evidence, and rerun obligations
## Why This Document Exists
`v2-protocol-truths.md` records stable protocol truths.
`v2-protocol-closure-map.zh.md` records the structural closure model.
What they do not track in one place is the current operational contract:
1. which claims are allowed right now
2. which baselines are accepted right now
3. which evidence supports each claim
4. which evidence has been narrowed or invalidated
5. which reruns are required before a claim can be restored
This document is that ledger.
## How To Use It
When reviewing any new slice, bug fix, workload run, or delivery note, ask:
1. which current claim does this change strengthen, narrow, or invalidate?
2. which evidence row should be updated?
3. does the change alter the current chosen envelope?
4. does any old claim now require rerun or reclassification?
If the answer changes the current state of the product, update this ledger in the same change.
## Current Chosen Envelope
This is the bounded envelope currently allowed for active V2 claims:
| Item | Current value | Source |
|------|---------------|--------|
| Replication factor | `RF=2` | `v2-protocol-closure-map.zh.md` |
| Durability mode | `sync_all` | `v2-protocol-closure-map.zh.md`, `Phase 13` |
| Control path | current master / volume-server heartbeat path | `v2-protocol-closure-map.zh.md` |
| Execution backend | `blockvol` | `v2-protocol-closure-map.zh.md`, `v2-reuse-replacement-boundary.md` |
| Frontend in active validation | iSCSI | `Phase 11`, `CP13-8` |
| Real-workload checkpoint | `CP13-8` | `Phase 13` |
Current explicit exclusions:
1. `RF>2` as a general accepted product claim
2. broad mode normalization before `CP13-9`
3. broad rollout / launch approval
4. broad transport matrix claims outside explicitly named evidence
5. treating synthetic benchmarks as substitutes for real workload validation
## Active Protocol Constraints
These are the currently binding constraints that later work must preserve.
| ID | Constraint | Source | Current status |
|----|------------|--------|----------------|
| `T1` | `CommittedLSN` is the external truth boundary | `v2-protocol-truths.md` | active |
| `T9` | truncation is a protocol boundary, not cleanup | `v2-protocol-truths.md` | active |
| `T14` | engine remains recovery authority; storage remains truth source | `v2-protocol-truths.md` | active |
| `T15` | reuse reality, not inherited semantics | `v2-protocol-truths.md` | active |
| `CP13-2` | stable identity must not be inferred from transport address shape | `Phase 13` | active |
| `CP13-3` | durable authority is `replicaFlushedLSN`, not legacy success inference | `Phase 13` | active |
| `CP13-4` | only eligible replica state may satisfy sync durability | `Phase 13` | active |
| `CP13-5` | reconnect must use explicit handshake / catch-up semantics | `Phase 13` | active |
| `CP13-6` | retention must fail closed for lagging replicas | `Phase 13` | active |
| `CP13-7` | unrecoverable gap must escalate to `NeedsRebuild` and block normal paths | `Phase 13` | active |
| `CP13-8A` | assignment delivered != receiver ready != publish healthy | `Phase 13` | active |
## Accepted Baselines
| Baseline | What it is allowed to say | Evidence location | Current validity |
|----------|---------------------------|-------------------|------------------|
| `CP13-1` replication baseline inventory | which tests originally passed/failed/`PASS*` before `CP13-2..7` closure | `sw-block/.private/phase/phase-13-cp1-baseline.md` | valid as baseline inventory, not as final product claim |
| `Phase 12 P4` bounded floor | one bounded performance floor and rollout-gate package on the accepted chosen path | `sw-block/.private/phase/phase-12-p4-floor.md`, `phase-12-p4-rollout-gates.md` | valid inside its named envelope |
| real-workload envelope draft | one bounded `ext4 + pgbench` package for `CP13-8` | `sw-block/.private/phase/phase-13-cp8-workload-validation.md` | active draft; full claim pending rerun after blockers close |
## Allowed Claims
These are the claims that may currently be made without overreach.
| Claim ID | Allowed claim | Scope boundary | Evidence anchor | Status |
|----------|---------------|----------------|-----------------|--------|
| `C-RF2-SYNCALL-CONTRACT` | the accepted `RF=2 sync_all` replication contract is closed at protocol/unit/adversarial level through `CP13-1..7` | protocol/unit/adversarial evidence only | `Phase 13` docs and tests | allowed |
| `C-WORKLOAD-DRAFT` | one bounded real-workload validation package is defined for `CP13-8` | package definition only, not final pass claim | `phase-13-cp8-workload-validation.md`, YAML scenario | allowed |
| `C-WORKLOAD-PASS` | the bounded real-workload package passes on the chosen path | only after rerun succeeds on corrected path | `CP13-8` rerun artifact | not yet allowed |
| `C-ADAPTER-CLOSURE` | assignment / readiness / publication closure is explicit on the chosen path | only after `CP13-8A` acceptance | `CP13-8A` proof package | in progress |
| `C-MODE-NORMALIZATION` | mode policy / normalization is closed | only in `CP13-9` or later | future | not allowed |
| `C-LAUNCH-APPROVAL` | broad product launch readiness | outside current phase | future | not allowed |
## Evidence Map
| Evidence area | What it proves | Primary evidence | Support evidence |
|---------------|----------------|------------------|------------------|
| Identity / addressing | stable identity and routable publication | `CP13-2` tests and docs | `qa_block_soak_test.go`, `sync_all_bug_test.go` |
| Durable progress | barrier durability truth and non-legacy authority | `CP13-3` tests and docs | protocol tests around barrier handling |
| State eligibility | only eligible replica state may satisfy sync durability | `CP13-4` tests and docs | adversarial state tests |
| Reconnect / catch-up | reconnect uses handshake/catch-up rather than bootstrap | `CP13-5` tests and docs | adversarial reconnect tests |
| Retention | lagging replica retains WAL or escalates fail closed | `CP13-6` tests and docs | retention protocol tests |
| Rebuild fallback | unrecoverable gap escalates to `NeedsRebuild` and blocks normal paths | `CP13-7` tests and docs | rebuild tests |
| Performance floor | one bounded measured floor and rollout-gate package | `Phase 12 P4` docs/tests | cited baseline artifact |
| Real-workload package | one bounded workload matrix exists | `CP13-8` scenario/doc | tester validation reports |
| Assignment/publication closure | assignment does not imply readiness/publication | `CP13-8A` code/tests/debug evidence | tester investigation, bug docs |
## Invalidated Or Narrowed Evidence
This section records evidence that cannot currently be used at full strength.
| ID | Affected claim/evidence | Narrowing reason | Scope | Action required |
|----|-------------------------|------------------|-------|-----------------|
| `INV-CP13-8A-01` | any weed-VS scenario claim that `block_promote` preserved replication automatically | promote path could leave new primary without replica shipper wiring; barrier then became vacuous with `0` shippers | recent weed-VS testrunner scenarios using `block_promote` | rerun after fix |
| `INV-CP13-8A-02` | bounded real-workload `CP13-8` pass claim | blocked by assignment/publication contradiction and then by promote/shipper closure issue | `CP13-8` only | rerun after `CP13-8A` blocker fixes |
| `INV-CLAIM-SPREAD-01` | claims embedded only in phase delivery notes | phase docs are not a reliable centralized current-state ledger | all scattered phase notes | migrate ongoing claim state here |
Unaffected evidence currently believed to remain valid:
1. standalone `iscsi-target` scenarios that used direct `assign + set_replica` wiring rather than weed-VS `block_promote`
2. protocol/unit/adversarial evidence from accepted `CP13-1..7`
3. performance-only scenarios that did not claim active cross-node replication through the broken promote path
## Open Contradictions And Blockers
| ID | Blocker | Current classification | Impact |
|----|---------|------------------------|--------|
| `BUG-CP13-8A-ADDR` | malformed/mock replica addresses in some QA allocators | test/adapter bug | narrows affected QA evidence; does not by itself close real workload |
| `BUG-CP13-8A-RECV-IDEMP` | repeated assignment delivery restarted replica receiver and hit bind conflict | adapter/runtime bug | blocks weed-VS replica from leaving degraded state until fixed |
| `BUG-CP13-8A-PROMOTE-SHIPPER` | post-promote assignment could leave new primary with no replica shipper configured | master/adapter bug | invalidates weed-VS `block_promote` replication claims until rerun |
| `CP13-8` | real bounded workload package still needs corrected rerun | blocked by `CP13-8A` issues | blocks real-workload pass claim |
## Rerun Queue
| Priority | Item | Why rerun is needed | Exit condition |
|----------|------|---------------------|----------------|
| `P0` | `CP13-8` bounded real-workload scenario | current pass claim is not yet allowed after `CP13-8A` blockers | bounded rerun passes or fails with attributable remaining cause |
| `P0` | weed-VS scenarios using `block_promote` from the recent testrunner enhancement work | prior replication interpretation may have been vacuous (`0` shippers) | affected scenarios are reclassified or rerun |
| `P1` | any recent degraded/perf interpretation derived from broken weed-VS promote path | performance interpretation may be based on RF=1 semantics | audit updated and affected numbers rerun or narrowed |
## Maintenance Rules
1. do not add a new claim anywhere else without adding or updating the corresponding row here
2. when a bug narrows evidence, record the invalidation here in the same change
3. when a rerun restores a claim, move the row from `Invalidated Or Narrowed Evidence` to `Allowed Claims` or update its status
4. keep this document bounded to the active chosen path; do not turn it into a future roadmap

View File

@@ -21,6 +21,10 @@
- `V2` 不是一组散乱 patch
- 而是在明确边界内逐步建立的协议闭环
当前 chosen path 下哪些 claim 可以成立、对应 evidence 在哪里、哪些 evidence
被收紧或失效,不在本文维护;统一见
`v2-protocol-claim-and-evidence.md`
这里的“闭环”是有范围的。
当前默认边界仍然是:

View File

@@ -18,6 +18,10 @@ So the most important output to carry forward is not only code, but:
This document is the compact truth table for the V2 line.
Current chosen-envelope claims, accepted baselines, evidence mappings, and
evidence invalidations are tracked separately in
`v2-protocol-claim-and-evidence.md`.
## How To Use It
For each later phase or slice, ask:

View File

@@ -0,0 +1,178 @@
# V2 Reuse vs Replacement Boundary
Date: 2026-04-03
Status: active
## Purpose
This note makes one architectural split explicit for the current chosen path:
1. what we reuse from the existing `blockvol`/`weed` stack as mechanics
2. what must be owned by `V2` as semantic authority
3. what sits in the adapter boundary between them
The goal is to stop `V1` mixed control/data state from silently redefining `V2`
behavior through convenience wiring.
Scope is still bounded to:
1. `RF=2`
2. `sync_all`
3. current master / volume-server heartbeat path
4. `blockvol` as the execution backend
## Boundary Rule
`V1` reuse is allowed for execution mechanics.
`V2` replacement is required for semantic authority.
If a change decides protocol meaning, failover meaning, durability meaning, or
external publication meaning, it belongs to a `V2`-owned layer even if the
underlying I/O still runs through reused `blockvol` code.
This is the practical interpretation of:
- `v2-protocol-truths.md` `T14`: engine remains recovery authority
- `v2-protocol-truths.md` `T15`: reuse reality, not inherited semantics
## Three Buckets
### 1. Reusable V1 Core
These components remain useful as mechanics:
| Area | Files | What stays reusable |
|------|-------|---------------------|
| Local storage truth | `weed/storage/blockvol/blockvol.go`, `flusher.go`, `rebuild.go`, WAL/extent helpers | WAL append, flush, checkpoint, dirty-map, extent install |
| Replica transport | `weed/storage/blockvol/replica_apply.go`, `wal_shipper.go`, `shipper_group.go`, `dist_group_commit.go`, `repl_proto.go` | TCP receiver/shipper mechanics, barrier transport, replay/apply |
| Frontend serving | `weed/storage/blockvol/iscsi/`, `weed/storage/blockvol/nvme/` | block-device serving once a local volume is authoritative |
| Local role guardrails | `weed/storage/blockvol/promotion.go`, `role.go` | drain, lease revoke, local role gate enforcement |
Rule:
- these layers execute I/O and transport
- they do not decide whether a replica is eligible, authoritative, published, or healthy in the `V2` sense
### 2. Adapter Boundary
These components translate `V2` truth into concrete runtime wiring:
| Area | Files | Responsibility |
|------|-------|----------------|
| Assignment ingest | `weed/server/volume_server_block.go` | authoritative assignment lifecycle for role apply, receiver/shipper wiring, readiness closure |
| Heartbeat/runtime loop | `weed/server/block_heartbeat_loop.go` | collect/report status and process assignments through the same lifecycle |
| Local store helper | `weed/storage/store_blockvol.go` | local volume open/close/iteration; no longer the authoritative assignment lifecycle |
| Bridge | `weed/storage/blockvol/v2bridge/control.go` | convert service/control truth into engine intents |
Rule:
- the adapter boundary may reuse `blockvol` primitives
- it must name and own lifecycle closure states explicitly
- it must not let store-only role application masquerade as ready publication
### 3. V2-Owned Replacement
These areas define truth and therefore must remain `V2`-owned:
| Area | Files | Responsibility |
|------|-------|----------------|
| Control and identity truth | `sw-block/engine/replication/`, `weed/storage/blockvol/v2bridge/control.go` | assignment truth, stable identity, session truth |
| Recovery ownership | `weed/server/block_recovery.go` | live runtime owner for catch-up/rebuild tasks |
| Publication and health closure | `weed/server/master_block_registry.go`, `weed/server/master_block_failover.go` | what the system reports as ready, degraded, publishable |
| External product surfaces | `weed/server/master_grpc_server_block.go`, `weed/server/master_server_handlers_block.go`, debug/diagnostic surfaces | operator-visible truth, not convenience guesses |
Rule:
- if the system exposes a condition to master, tester, CSI, or operator tooling, that condition must come from `V2`-named state
## Assignment-To-Readiness Lifecycle
The authoritative lifecycle for the current chosen path is:
```text
assignment delivered
-> local role applied
-> replica receiver or primary shipper configured
-> readiness closed
-> heartbeat publication
-> master registry health/publication
```
More concretely:
1. master intent is delivered
2. `BlockService.ApplyAssignments()` applies local role truth
3. the same path wires receiver/shipper runtime
4. the same path records named readiness state
5. heartbeat publishes only what is actually publish-healthy
6. master registry derives lookup/health from explicit readiness, not from allocation alone
## Named Readiness States
For the current implementation slice, the service boundary now names:
1. `roleApplied`
2. `receiverReady`
3. `shipperConfigured`
4. `shipperConnected`
5. `replicaEligible`
6. `publishHealthy`
Ownership:
- owned by `BlockService` / adapter layer
- observed by debug surfaces and heartbeat/publication logic
- not delegated to `blockvol` as implicit mixed state
## Current File Map
### Reuse
- `weed/storage/blockvol/blockvol.go`
- `weed/storage/blockvol/flusher.go`
- `weed/storage/blockvol/replica_apply.go`
- `weed/storage/blockvol/wal_shipper.go`
- `weed/storage/blockvol/shipper_group.go`
- `weed/storage/blockvol/dist_group_commit.go`
- `weed/storage/blockvol/iscsi/`
- `weed/storage/blockvol/nvme/`
### Adapter boundary
- `weed/server/volume_server_block.go`
- `weed/server/block_heartbeat_loop.go`
- `weed/storage/store_blockvol.go`
- `weed/server/volume_server_block_debug.go`
### V2-owned replacement / truth
- `weed/storage/blockvol/v2bridge/control.go`
- `sw-block/engine/replication/`
- `weed/server/block_recovery.go`
- `weed/server/master_block_registry.go`
- `weed/server/master_block_failover.go`
- `weed/server/master_grpc_server_block.go`
- `weed/server/master_server_handlers_block.go`
## Immediate Engineering Rule
When a new bug appears, classify it first:
1. `v1 reusable core`: local storage or transport mechanics
2. `adapter boundary`: assignment/readiness/publication closure bug
3. `v2 replacement`: semantic authority, identity, ownership, eligibility, rebuild, or operator-visible truth
Do not patch semantic authority directly into `blockvol` unless the same change is
also reflected as an explicit `V2` state/rule at the service or registry layer.
## Why This Matters For CP13-8
`CP13-8` found the exact class of bug this split is meant to expose:
- allocation/control truth said the replica existed
- but runtime publication/read visibility was not yet closed
That is not a reason to throw away `blockvol`.
It is a reason to stop treating mixed `V1` runtime state as if it were already
closed `V2` publication truth.

View File

@@ -0,0 +1,767 @@
## 1. 设计目标
目标不是立即重写 `blockvol`,而是建立一个**纯 `V2` 语义核心**,使它成为系统唯一的语义 authority
- `V2 core` 负责定义 truth、state、event、decision
- `adapter` 负责把外部输入翻译成 `V2` 事件,并把 `V2` 决策翻译成 runtime/backend 调用
- `V1 backend` 只保留执行能力,不再解释协议语义
当前 chosen path 不变:
- `RF=2`
- `sync_all`
- 当前 master / volume-server heartbeat path
- `blockvol` 作为执行 backend
这和已有设计是一致的,不是新改向。它只是把 `Phase 1-13` 一直在做的事显式化。
---
## 2. 核心分层
### 2.1 `V2 Core`
职责:
- 持有控制真相
- 持有恢复真相
- 持有数据边界真相
- 持有对外发布真相
- 根据事件做决策
- 输出 commands / projections
### 2.2 Adapter Boundary
职责:
- 把 master / heartbeat / runtime observation 翻译成 `V2 event`
-`V2 command` 翻译成对 `blockvol` / transport / frontend 的调用
- 把 backend/runtime 事实翻译成 `projection`
### 2.3 `V1 Backend`
职责:
- WAL / extent / dirty map / flusher
- receiver / shipper transport
- iSCSI / NVMe frontend
- rebuild install primitive
规则:
- backend 报告事实
- core 解释语义
- adapter 做翻译
- 不允许 `blockvol` 的混合状态直接越权成为系统 truth
---
## 3. `V2 Core` 最小对象清单
下面是“最小可行”的 `struct / event / command / projection` 集合。
### 3.1 Struct 清单
#### A. Control structs
```go
type VolumeIntent struct {
VolumeID string
Epoch uint64
PrimaryID string
ReplicaIDs []string
DurabilityMode string
}
type ReplicaIdentity struct {
ReplicaID string
ServerID string
}
type AssignmentView struct {
VolumeID string
Epoch uint64
Role RoleIntent
ReplicaEndpoints map[string]Endpoint
}
```
职责:
- 谁是 primary
- 谁是 replica
- 当前 epoch
- stable identity 是谁
- assignment intent 到底是什么
#### B. Recovery structs
```go
type ReplicaState string
const (
StateDisconnected ReplicaState = "disconnected"
StateConnecting ReplicaState = "connecting"
StateCatchingUp ReplicaState = "catching_up"
StateInSync ReplicaState = "in_sync"
StateDegraded ReplicaState = "degraded"
StateNeedsRebuild ReplicaState = "needs_rebuild"
)
type ReplicaSession struct {
SessionID string
ReplicaID string
Epoch uint64
Kind SessionKind
Active bool
Superseded bool
}
type RecoveryOwner struct {
ReplicaID string
SessionID string
Running bool
}
```
职责:
- 当前 replica 的恢复状态是什么
- 当前 session 是谁
- 谁拥有 recovery authority
- 旧 session 是否已失效
#### C. Data-boundary structs
```go
type BoundaryView struct {
CommittedLSN uint64
CheckpointLSN uint64
WALHeadLSN uint64
ReceivedLSN uint64
TargetLSN uint64
AchievedLSN uint64
SnapshotBaseLSN uint64
}
```
职责:
- durability boundary
- catch-up target
- rebuild target
- stable base image
- 实际已达到边界
#### D. Publication structs
```go
type ReadinessView struct {
RoleApplied bool
ReceiverReady bool
ShipperConfigured bool
ShipperConnected bool
ReplicaEligible bool
PublishHealthy bool
}
type LookupProjection struct {
VolumeID string
PrimaryServer string
ISCSIAddr string
ReplicaReady bool
ReplicaDegraded bool
}
```
职责:
- 什么时候可以 publish
- lookup/heartbeat 应该看到什么
- “存在”与“ready”分离
---
## 4. `V2 Core` 最小事件清单
### 4.1 Control events
```go
AssignmentDelivered
EpochBumped
RepeatedAssignmentDelivered
IdentityResolved
```
### 4.2 Recovery events
```go
SessionCreated
SessionSuperseded
SessionRemoved
CatchUpPlanned
CatchUpCompleted
RebuildStarted
RebuildCommitted
```
### 4.3 Runtime observation events
```go
RoleApplied
ReceiverStarted
ReceiverReadyObserved
ShipperConfiguredObserved
ShipperConnectedObserved
BarrierAccepted
BarrierRejected
```
### 4.4 Data-boundary events
```go
CommittedLSNAdvanced
CheckpointLSNAdvanced
ReceivedLSNAdvanced
AchievedLSNAdvanced
RetentionEscalated
```
### 4.5 Publication events
```go
HeartbeatCollected
PublicationProjected
ReplicaMarkedReady
ReplicaMarkedDegraded
```
原则:
- event 是 observation 或 intent
- event 不是直接结果承诺
- `V2 core` 必须决定 event 的语义意义
---
## 5. `V2 Core` 最小 command 清单
这些 command 是 `V2 core` 输出给 adapter 的,不直接碰 backend。
```go
type Command interface{}
type ApplyRoleCommand struct {
VolumeID string
Epoch uint64
Role RoleIntent
}
type StartReceiverCommand struct {
VolumeID string
DataAddr string
CtrlAddr string
}
type ConfigureShipperCommand struct {
VolumeID string
Replicas []ReplicaEndpoint
}
type StartCatchUpCommand struct {
ReplicaID string
TargetLSN uint64
}
type StartRebuildCommand struct {
ReplicaID string
RebuildAddr string
SnapshotBaseLSN uint64
}
type PublishProjectionCommand struct {
VolumeID string
Readiness ReadinessView
}
type InvalidateSessionCommand struct {
ReplicaID string
Reason string
}
```
原则:
- command 只表达“该做什么”
- backend 如何做,由 adapter 决定
- command 不依赖 `blockvol` 内部字段
---
## 6. `V2 Core` 最小 projection 清单
projection 是给外部世界看的,不是内部原始状态 dump。
### 6.1 Master / Lookup projection
- `PrimaryServer`
- `ISCSIAddr`
- `ReplicaReady`
- `ReplicaDegraded`
- `DurabilityMode`
### 6.2 Heartbeat projection
- 当前 role / epoch
- boundary fields
- readiness fields
- transport degraded
- receiver published addr
### 6.3 Diagnostic projection
- active recovery tasks
- session ownership
- publish gating reason
- pending rebuild / deferred promotion reason
### 6.4 Tester projection
- `wait_volume_healthy` 不再只看 “replica exists”
- 必须看 `replica_ready`
- 必须区分:
- allocated
- role applied
- receiver ready
- publish healthy
---
## 7. 直接映射到仓库路径
下面按“核心 / adapter / backend”映射。
### 7.1 `V2 Core` 现有基础
这些文件已经在扮演 core 的雏形。
- `sw-block/engine/replication/registry.go`
- `AssignmentIntent`
- `AssignmentResult`
- `ReplicaAssignment`
- `sw-block/engine/replication/session.go`
- session ownership / lifecycle
- `sw-block/engine/replication/sender.go`
- sender state abstraction
- `sw-block/engine/replication/orchestrator.go`
- assignment -> session/recovery orchestration
- `sw-block/engine/replication/types.go`
- `sw-block/engine/replication/outcome.go`
- `sw-block/engine/replication/observe.go`
- `weed/server/block_recovery.go`
- runtime owner
- `weed/server/master_block_registry.go`
- publication truth / cluster registry truth
- `weed/server/master_block_failover.go`
- failover/rebuild truth
### 7.2 Adapter Boundary 现有基础
- `weed/storage/blockvol/v2bridge/control.go`
- assignment -> engine intent
- `weed/server/volume_server_block.go`
- assignment lifecycle / readiness closure / wiring
- `weed/server/block_heartbeat_loop.go`
- heartbeat + assignment loop
- `weed/server/volume_server_block_debug.go`
- readiness/debug projection
- `weed/server/master_grpc_server_block.go`
- lookup projection
- `weed/server/master_server_handlers_block.go`
- REST projection
### 7.3 `V1 Backend` 现有基础
- `weed/storage/blockvol/blockvol.go`
- `weed/storage/blockvol/flusher.go`
- `weed/storage/blockvol/replica_apply.go`
- `weed/storage/blockvol/wal_shipper.go`
- `weed/storage/blockvol/shipper_group.go`
- `weed/storage/blockvol/dist_group_commit.go`
- `weed/storage/blockvol/rebuild.go`
- `weed/storage/blockvol/iscsi/`
- `weed/storage/blockvol/nvme/`
---
## 8. 近期可做
这里说的是“现在应该做”的,不是中长期重构幻想。
### 8.1 让 `V2 core` 成为唯一 assignment 语义入口
目标:
- 所有 assignment 先进入 `V2 core`
- `BlockService` 不再自己解释太多语义
- `BlockService` 更像 command executor
直接涉及文件:
- `sw-block/engine/replication/orchestrator.go`
- `weed/storage/blockvol/v2bridge/control.go`
- `weed/server/volume_server_block.go`
### 8.2 把 readiness 做成正式 projection不只是 service 内部状态
目标:
- `roleApplied`
- `receiverReady`
- `shipperConfigured`
- `shipperConnected`
- `replicaEligible`
- `publishHealthy`
这些状态进入稳定 projection而不是只在 debug 内可见。
直接涉及文件:
- `weed/server/volume_server_block.go`
- `weed/server/master_block_registry.go`
- `weed/server/master_grpc_server_block.go`
- `weed/server/master_server_handlers_block.go`
### 8.3 把 `wait_volume_healthy`、lookup、heartbeat 都统一到同一个 readiness 定义
目标:
- 不再出现:
- registry says healthy
- VS side not ready
- tester 误判 ready
直接涉及文件:
- `weed/storage/blockvol/testrunner/actions/devops.go`
- `weed/server/master_block_registry.go`
- `weed/server/volume_grpc_client_to_master.go`
### 8.4 把 `CP13-8` 用作 adapter/publication closure 的真实验证
目标:
- 确认当前 failure 到底是:
- backend data bug
- adapter timing/publication bug
- core rule gap
这一步是近期必须做的,因为它是 live contradiction。
---
## 9. 中期演进
### 9.1 把 `V2 core` 真正做成 command/event 模式
目标:
- engine 输出 command
- adapter 执行 command
- runtime 返回 event
- core 更新 state
这会比现在 “ProcessAssignments 里又做判断又做执行” 更干净。
### 9.2 把 `master_block_registry` 从“半业务逻辑半存储”收敛成 projection store
目标:
- registry 不负责猜测 semantics
- registry 存放 `V2 projection`
- 真正的语义判断放在 core
### 9.3 把 backend 接口化
候选接口:
```go
type StorageBackend interface {
StatusSnapshot() BoundaryView
SetRetentionFloor(...)
}
type TransportBackend interface {
StartReceiver(...)
ConfigureReplicas(...)
ShipperStates() ...
}
type RebuildBackend interface {
StartRebuild(...)
InstallSnapshot(...)
}
type FrontendBackend interface {
PublishISCSI(...)
PublishNVMe(...)
}
```
### 9.4 让 `blockvol` 逐渐退化成 pure backend
目标:
- `blockvol` 不再持有系统级语义 authority
- 它保留 local storage truth 和执行能力
- 语义解释都上提到 `V2 core`
---
## 10. 如何保持前面建立的约束和 envelope
这部分最重要。
### 10.1 已接受约束必须升格为 core invariants
前面 `CP13-1..7` 不能只留在测试里,必须进入 core 规则:
- canonical identity
- durable progress truth
- only eligible replicas count
- reconnect handshake
- retention fail-closed
- rebuild fallback
### 10.2 envelope 不允许被 core 重构顺手扩大
继续固定:
- `RF=2`
- `sync_all`
- 当前 heartbeat/gRPC path
- `blockvol` backend
不要因为做架构分层,就顺手放大:
- `RF>2`
- broader transport matrix
- broader rollout claim
### 10.3 每一步都做双重验收
每个演进 slice 必须同时证明:
- 旧约束仍成立
- 新边界更显式、更少混态
### 10.4 禁止“抽象重构先降低 bar”
不能接受:
- 为了重构,暂时弱化 fail-closed
- 为了抽接口,暂时模糊 readiness
- 为了分层,暂时把 publish truth 放宽
---
## 11. V2 如何从已接受 claim 变得可靠
`V2 core` 的可靠性不是来自“状态更多”或“架构更复杂”。
它的可靠性来自两个更严格的来源:
1. 只消费已经进入 claim/evidence ledger 的 accepted constraints
2. 只复用 `V1` 中已知可靠的实现行为,而不继承其隐含语义
### 11.1 `V2` 不是重新发明 truth而是消费已接受 truth
`V2 core` 不应该把当前 runtime 中“看起来通常有效”的行为直接提升为语义 truth。
它只能建立在已经被 claim / evidence 支撑的约束上,例如:
- canonical identity
- durable progress authority = `replicaFlushedLSN`
- only eligible replica may satisfy sync durability
- reconnect must use explicit handshake / catch-up
- retention must fail closed
- unrecoverable gap must escalate to `NeedsRebuild`
这些约束不是设计说明的附属物,而应该是 `V2 core` 的输入边界。
换句话说:
- 能进入 core 的,只能是已经被 ledger 接受的 truth
- 没有进入 ledger 的运行假设,不能直接成为 core 依赖
### 11.2 claim 是 core 的输入约束,不只是 review 文档
`V2 core` 的一个基本规则是:
> 任何没有被 claim/evidence 接受的行为,只能作为 observation不能作为 authority。
例如,下面这些不能直接进入 `V2` truth
- “第一次写通常会触发 shipper 连接”
- “promote 之后 replication 大概会自己恢复”
-`degraded=false` 大概就表示 ready”
- “有 published addr 就说明 replica 可用”
这些都只能当作 runtime observation必须再经过 `V2` 的 readiness / eligibility / publication 规则过滤。
### 11.3 `V2` 的可靠性来自 fail-closed而不是隐式收敛
`V1` 常见的问题不是“完全不能工作”,而是很多语义靠时序和重试隐式收敛:
- assignment delivered 之后,何时真正 ready
- promote 之后,何时真正恢复 replication
- `sync_all` 何时真的表示 cross-node durability
`V2 core` 必须拒绝从这些模糊状态直接宣布 success。
它应该采用更硬的规则:
- 不满足 accepted claim 的条件时,保持 not-ready / degraded / blocked
- 不允许从 convenience state 猜测 publish healthy
- 不允许用工作负载本身去“顺便推动系统进入正确状态”
一句话:
- `V2` 的可靠性来自明确边界 + fail-closed
- 不来自“系统大概率最终会自己好起来”
### 11.4 `V2 core` 使用哪些 accepted claim
| Claim / Constraint | 用于 core 的哪里 | 作用 |
|---|---|---|
| canonical identity | control truth | 不再从地址猜身份 |
| durable progress = `replicaFlushedLSN` | boundary truth | 不再从 success/ack 猜 durability |
| eligible-only barrier | readiness / publication | 不让非闭环 replica 参与 durability |
| reconnect handshake | recovery truth | 不再靠第一次写触发隐式恢复 |
| retention fail-closed | recovery truth | 不让 lagging replica 以模糊状态长期存在 |
| rebuild fallback | fail-closed policy | gap 不再长期悬挂在 degraded |
这些 claim 越清楚,`V2 core` 就越可靠。
---
## 12. `V2` 复用哪些 `V1` 可靠行为
`V2` 不是完全抛弃 `V1`
它复用的是 `V1` 中已经被证明是局部可靠、实现性稳定的行为。
但必须明确区分:
- **可复用的可靠行为**
- **不应继续复用的旧语义**
### 12.1 可复用的可靠行为
这些行为可以继续作为 backend primitive 使用:
| `V1` behavior | 为什么可复用 | 为什么不构成语义 authority |
|---|---|---|
| WAL append / read | 局部实现、可验证 | 不决定外部 durability meaning |
| flusher / checkpoint | 局部物化机制 | 不决定 cluster-level readiness |
| dirty-map local read/write | 局部一致性行为 | 不决定 publication truth |
| receiver transport | 纯执行路径 | 不决定 session authority |
| shipper transport | 纯传输机制 | 不决定 eligibility / publish truth |
| rebuild installer / extent install | 局部 install primitive | 不决定 rebuild policy |
| iSCSI / NVMe serving | frontend primitive | 不决定 replicated visibility truth |
这些行为的共同特征是:
1. 局部
2. 可测试
3. 不依赖 cluster-level 推断
4. 不应该自己解释系统语义
### 12.2 不继续复用的 `V1` 语义
下面这些即使在 `V1` 中曾经“工作过”,也不应该进入 `V2` truth
- ready from existence
- healthy from non-empty publication
- `sync_all` from vacuous barrier success
- promote implies replication closure
- assignment arrival implies runtime closure
- first write implicitly fixes transport state
`V2` 可以复用 `V1` 的动作,但不能继承这些旧语义。
### 12.3 `weed/` 中当前改动的地位
当前 branch 中 `weed/` 的很多改动更接近:
- 现象验证
- integration closure
- debug/diagnostic surfaces
- 暂时性 runtime fix
它们的价值在于暴露现实、定位问题、验证边界。
但长期语义 authority 不应该放在这些改动本身上。
长期可保留的,应当是那些已经被 `V2` 吸收为:
- backend primitive
- adapter boundary
- projection surface
的部分。
---
## 13. 长期资产 vs 当前实现现实
当前项目中最重要的区分不是“哪个文件在跑”,而是“哪个资产值得长期保留”。
### 13.1 长期资产
长期需要保留和演进的是 `sw-block/` 中的 `V2` 语义资产:
- `v2-protocol-truths.md`
- `v2-protocol-closure-map.zh.md`
- `v2-protocol-claim-and-evidence.md`
- `v2-reuse-replacement-boundary.md`
- `v2_mini_core_design.md`
- `sw-block/engine/replication/` 中逐步成形的 core semantics
这些资产定义的是:
- truth
- claim
- closure
- reliability model
- reusable semantics
### 13.2 当前实现现实
`weed/` 中的当前改动更多代表:
- 当前 chosen path 的运行现实
- backend / adapter / publication 的实现尝试
- 用来暴露矛盾和验证边界的现实载体
因此,它们不应该被自动视为长期保留资产。
更准确的原则是:
- `weed/` 中的改动,只有在被 `V2` 语义明确吸收之后,才应该作为长期实现保留
- 否则,它们可以只是阶段性的验证资产
### 13.3 一个总规则
> `V2 core` 的可靠性不是来自信任当前 `weed/` 分支实现。
> 它来自只消费被 claim/evidence ledger 接受的约束,并只复用 `V1` 中已知可靠的实现行为。
> 因此,`weed/` 中的当前改动可以是临时验证资产,而 `sw-block/` 中的 truth / claim / core design 才是长期保留资产。
这个规则的好处是:
1. 不会因为当前 integration patch 看起来有效,就把它误当成长期语义
2. 不会因为 `V1` 仍被复用,就把旧混态继续当作 authority
3. 后续收敛分支时,可以明确区分:
- 哪些东西应进入长期 `V2` 资产
- 哪些东西只是当前实现现实
---
## 14. 推荐的实施顺序
### 近期
1. 继续收紧 assignment -> readiness -> publication closure
2.`CP13-8` 证明当前 split 能否识别真实 bug 类型
3. 把 readiness / projection 固化为稳定 surface
### 中期
1. 引入 command/event 风格的 `V2 core`
2. 减少 `BlockService` 中的语义判断
3. 把 registry 收敛为 projection store
4. 抽 backend interface
### 更后面
1. 评估是否需要物理独立的 `V2 core process`
2. 如果需要,那是因为逻辑已经独立,不是为了“好看”
---
## 15. 最短结论
你要的“更工程化”版本可以归纳为一句话:
- `V2 core` 负责定义 truth、state、event、command、projection
- `adapter` 负责隔离 `V1` 污染并翻译输入输出
- `V1 backend` 负责 WAL / transport / frontend / rebuild 执行
- 后续 phase 的方向不是换路线,而是把这件事一步步显式化并固化成正式结构
如果你愿意,下一步我可以继续给你一版更像真正设计文档里的内容:
- `V2 core` 的 Go package 目录建议
- 每个 struct/event/command 放在哪个文件
- 一个最小 `ApplyEvent() -> Decide() -> EmitCommands()` 伪代码骨架

View File

@@ -93,7 +93,7 @@ func (c *BlockVolumeHeartbeatCollector) Run() {
select {
case <-ticker.C:
// Outbound: collect and report status.
msgs := c.blockService.Store().CollectBlockVolumeHeartbeat()
msgs := c.blockService.CollectBlockVolumeHeartbeat()
c.safeCallback(msgs)
// Inbound: process any pending assignments.
c.processAssignments()
@@ -115,7 +115,7 @@ func (c *BlockVolumeHeartbeatCollector) processAssignments() {
if len(assignments) == 0 {
return
}
errs := c.blockService.Store().ProcessBlockVolumeAssignments(assignments)
errs := c.blockService.ApplyAssignments(assignments)
c.cbMu.Lock()
cb := c.assignmentCallback
c.cbMu.Unlock()

View File

@@ -463,6 +463,42 @@ func TestBlockAssign_NilSource(t *testing.T) {
}
}
// TestBlockAssign_CollectorUsesAuthoritativeLifecycle verifies the heartbeat
// collector now drives the full BlockService assignment path, not the store-only
// role path. A replica assignment must start the receiver and close publish
// readiness.
func TestBlockAssign_CollectorUsesAuthoritativeLifecycle(t *testing.T) {
bs := newTestBlockService(t)
path := testBlockVolPath(t, bs)
collector := NewBlockVolumeHeartbeatCollector(bs, 5*time.Millisecond)
collector.SetAssignmentSource(func() []blockvol.BlockVolumeAssignment {
return []blockvol.BlockVolumeAssignment{{
Path: path,
Epoch: 1,
Role: uint32(blockvol.RoleReplica),
ReplicaDataAddr: ":0",
ReplicaCtrlAddr: ":0",
}}
})
go collector.Run()
defer collector.Stop()
deadline := time.After(500 * time.Millisecond)
for {
dataAddr, ctrlAddr := bs.GetReplState(path)
readiness := bs.ReadinessSnapshot(path)
if dataAddr != "" && ctrlAddr != "" && readiness.ReceiverReady && readiness.PublishHealthy {
return
}
select {
case <-deadline:
t.Fatalf("collector did not start replica receiver: data=%q ctrl=%q readiness=%+v", dataAddr, ctrlAddr, readiness)
case <-time.After(10 * time.Millisecond):
}
}
}
// TestBlockAssign_MixedBatch verifies a batch with 1 success, 1 unknown volume,
// and 1 invalid transition returns parallel errors correctly.
func TestBlockAssign_MixedBatch(t *testing.T) {

View File

@@ -148,7 +148,8 @@ func TestClusterHealthSummary(t *testing.T) {
Path: "/data/healthy.blk",
Role: blockvol.RoleToWire(blockvol.RolePrimary),
ReplicaFactor: 2,
Replicas: []ReplicaInfo{{Server: "vs2:9333", Role: blockvol.RoleToWire(blockvol.RoleReplica)}},
ReplicaReady: true,
Replicas: []ReplicaInfo{{Server: "vs2:9333", Role: blockvol.RoleToWire(blockvol.RoleReplica), Ready: true}},
Status: StatusActive,
})
@@ -188,7 +189,8 @@ func TestBlockStatusHandler_IncludesHealthCounts(t *testing.T) {
Path: "/data/status.blk",
Role: blockvol.RoleToWire(blockvol.RolePrimary),
ReplicaFactor: 2,
Replicas: []ReplicaInfo{{Server: "vs2:9333", Role: blockvol.RoleToWire(blockvol.RoleReplica)}},
ReplicaReady: true,
Replicas: []ReplicaInfo{{Server: "vs2:9333", Role: blockvol.RoleToWire(blockvol.RoleReplica), Ready: true}},
Status: StatusActive,
})

View File

@@ -1965,3 +1965,45 @@ func TestRegistry_InflightBlocksAutoRegister(t *testing.T) {
t.Fatalf("replica health not updated after inflight released: %f", entry.Replicas[0].HealthScore)
}
}
func TestRegistry_ReplicaReadyRequiresReplicaHeartbeat(t *testing.T) {
r := NewBlockVolumeRegistry()
if err := r.Register(&BlockVolumeEntry{
Name: "vol-ready",
VolumeServer: "primary-server:8080",
Path: "/blocks/vol-ready.blk",
Status: StatusActive,
Replicas: []ReplicaInfo{{
Server: "replica-server:8080",
Path: "/blocks/vol-ready.blk",
}},
}); err != nil {
t.Fatalf("register: %v", err)
}
entry, _ := r.Lookup("vol-ready")
if entry.ReplicaReady {
t.Fatal("replica should not be ready before replica heartbeat confirms publication")
}
if !entry.ReplicaDegraded {
t.Fatal("volume should remain degraded until replica readiness closes")
}
r.UpdateFullHeartbeat("replica-server:8080", []*master_pb.BlockVolumeInfoMessage{{
Path: "/blocks/vol-ready.blk",
Epoch: 1,
Role: uint32(blockvol.RoleReplica),
VolumeSize: 1 << 30,
HealthScore: 0.9,
ReplicaDataAddr: "10.0.0.2:14260",
ReplicaCtrlAddr: "10.0.0.2:14261",
}}, "")
entry, _ = r.Lookup("vol-ready")
if !entry.Replicas[0].Ready {
t.Fatal("replica heartbeat with published receiver addresses should mark replica ready")
}
if !entry.ReplicaReady {
t.Fatal("aggregate replica readiness should become true after replica heartbeat")
}
}

View File

@@ -394,6 +394,7 @@ func entryToVolumeInfo(e *BlockVolumeEntry, primaryAlive bool) blockapi.VolumeIn
ReplicaDataAddr: e.ReplicaDataAddr,
ReplicaCtrlAddr: e.ReplicaCtrlAddr,
ReplicaFactor: rf,
ReplicaReady: e.ReplicaReady,
HealthScore: e.HealthScore,
ReplicaDegraded: e.ReplicaDegraded,
DurabilityMode: durMode,
@@ -407,6 +408,7 @@ func entryToVolumeInfo(e *BlockVolumeEntry, primaryAlive bool) blockapi.VolumeIn
Server: ri.Server,
ISCSIAddr: ri.ISCSIAddr,
IQN: ri.IQN,
Ready: ri.Ready,
HealthScore: ri.HealthScore,
WALLag: ri.WALLag,
})

View File

@@ -24,6 +24,23 @@ type volReplState struct {
replicaCtrlAddr string
// allReplicas stores the full replica set for multi-replica idempotence.
allReplicas []blockvol.ReplicaAddr
roleApplied bool
receiverReady bool
shipperConfigured bool
replicaEligible bool
publishHealthy bool
}
// BlockReadinessSnapshot names the assignment-to-publication closure at the
// BlockService boundary. These flags are owned by the service/adapter layer,
// not by blockvol's local storage mechanics.
type BlockReadinessSnapshot struct {
RoleApplied bool
ReceiverReady bool
ShipperConfigured bool
ShipperConnected bool
ReplicaEligible bool
PublishHealthy bool
}
// NVMeConfig holds NVMe/TCP target configuration passed from CLI flags.
@@ -373,6 +390,15 @@ func (bs *BlockService) DeleteBlockVol(name string) error {
// ProcessAssignments applies assignments from master, including replication setup.
// V2 bridge: also delivers each assignment to the V2 engine for recovery ownership.
func (bs *BlockService) ProcessAssignments(assignments []blockvol.BlockVolumeAssignment) {
_ = bs.ApplyAssignments(assignments)
}
// ApplyAssignments applies assignments through the single authoritative
// BlockService lifecycle: role apply, replication wiring, and publication
// readiness bookkeeping. Returns per-assignment errors parallel to the input.
func (bs *BlockService) ApplyAssignments(assignments []blockvol.BlockVolumeAssignment) []error {
errs := make([]error, len(assignments))
// V2 bridge: convert and deliver to engine orchestrator (Phase 08 P1).
// P3: skip V2 processing for repeated unchanged assignments.
// P4: RecoveryManager starts/cancels recovery goroutines based on results.
@@ -400,9 +426,9 @@ func (bs *BlockService) ProcessAssignments(assignments []blockvol.BlockVolumeAss
// V1 processing (requires blockStore).
if bs.blockStore == nil {
return
return errs
}
for _, a := range assignments {
for i, a := range assignments {
role := blockvol.RoleFromWire(a.Role)
ttl := blockvol.LeaseTTLFromWire(a.LeaseTtlMs)
@@ -410,22 +436,30 @@ func (bs *BlockService) ProcessAssignments(assignments []blockvol.BlockVolumeAss
if err := bs.blockStore.WithVolume(a.Path, func(vol *blockvol.BlockVol) error {
return vol.HandleAssignment(a.Epoch, role, ttl)
}); err != nil {
errs[i] = err
glog.Warningf("block service: assignment %s epoch=%d role=%s: %v", a.Path, a.Epoch, role, err)
continue
}
bs.noteRoleApplied(a.Path, role)
// 2. Replication setup based on role + addresses.
switch role {
case blockvol.RolePrimary:
// CP8-2: ReplicaAddrs (multi-replica) takes precedence over scalar fields.
if len(a.ReplicaAddrs) > 0 {
bs.setupPrimaryReplicationMulti(a.Path, a.ReplicaAddrs)
if err := bs.setupPrimaryReplicationMulti(a.Path, a.ReplicaAddrs); err != nil {
errs[i] = err
}
} else if a.ReplicaDataAddr != "" && a.ReplicaCtrlAddr != "" {
bs.setupPrimaryReplication(a.Path, a.ReplicaDataAddr, a.ReplicaCtrlAddr)
if err := bs.setupPrimaryReplication(a.Path, a.ReplicaDataAddr, a.ReplicaCtrlAddr); err != nil {
errs[i] = err
}
}
case blockvol.RoleReplica:
if a.ReplicaDataAddr != "" && a.ReplicaCtrlAddr != "" {
bs.setupReplicaReceiver(a.Path, a.ReplicaDataAddr, a.ReplicaCtrlAddr)
if err := bs.setupReplicaReceiver(a.Path, a.ReplicaDataAddr, a.ReplicaCtrlAddr); err != nil {
errs[i] = err
}
}
case blockvol.RoleRebuilding:
if a.RebuildAddr != "" {
@@ -433,18 +467,23 @@ func (bs *BlockService) ProcessAssignments(assignments []blockvol.BlockVolumeAss
}
}
}
return errs
}
// setupPrimaryReplication configures WAL shipping from primary to replica
// and starts the rebuild server (R1-2).
func (bs *BlockService) setupPrimaryReplication(path, replicaDataAddr, replicaCtrlAddr string) {
func (bs *BlockService) setupPrimaryReplication(path, replicaDataAddr, replicaCtrlAddr string) error {
// P3 idempotence: skip if replica state is unchanged.
bs.replMu.RLock()
existing := bs.replStates[path]
bs.replMu.RUnlock()
if existing != nil && existing.replicaDataAddr == replicaDataAddr && existing.replicaCtrlAddr == replicaCtrlAddr {
// Unchanged repeated assignment — idempotent, no side effects.
return
bs.markPrimaryTransportConfigured(path, []blockvol.ReplicaAddr{{
DataAddr: replicaDataAddr,
CtrlAddr: replicaCtrlAddr,
}})
return nil
}
// Compute deterministic rebuild listen address.
@@ -465,27 +504,19 @@ func (bs *BlockService) setupPrimaryReplication(path, replicaDataAddr, replicaCt
return nil
}); err != nil {
glog.Warningf("block service: setup primary replication %s: %v", path, err)
return
return err
}
// Track replication state for heartbeat reporting (R1-4).
// These addresses are what the primary ships to — they come from the
// master's assignment. They should already be canonical (from
// AllocateBlockVolumeResponse), but if not, they'll be reported as-is.
bs.replMu.Lock()
if bs.replStates == nil {
bs.replStates = make(map[string]*volReplState)
}
bs.replStates[path] = &volReplState{
replicaDataAddr: replicaDataAddr,
replicaCtrlAddr: replicaCtrlAddr,
}
bs.replMu.Unlock()
bs.markPrimaryTransportConfigured(path, []blockvol.ReplicaAddr{{
DataAddr: replicaDataAddr,
CtrlAddr: replicaCtrlAddr,
}})
glog.V(0).Infof("block service: primary %s shipping WAL to %s/%s (rebuild=%s)", path, replicaDataAddr, replicaCtrlAddr, rebuildAddr)
return nil
}
// setupPrimaryReplicationMulti configures WAL shipping from primary to N replicas
// using SetReplicaAddrs (CP8-2: multi-replica support).
func (bs *BlockService) setupPrimaryReplicationMulti(path string, addrs []blockvol.ReplicaAddr) {
func (bs *BlockService) setupPrimaryReplicationMulti(path string, addrs []blockvol.ReplicaAddr) error {
// P3 idempotence: skip if ALL replica addresses unchanged.
// Compare full replica set, not just the first entry.
if len(addrs) > 0 {
@@ -493,7 +524,8 @@ func (bs *BlockService) setupPrimaryReplicationMulti(path string, addrs []blockv
existing := bs.replStates[path]
bs.replMu.RUnlock()
if existing != nil && bs.multiReplicaUnchanged(path, addrs) {
return
bs.markPrimaryTransportConfigured(path, addrs)
return nil
}
}
@@ -513,30 +545,15 @@ func (bs *BlockService) setupPrimaryReplicationMulti(path string, addrs []blockv
return nil
}); err != nil {
glog.Warningf("block service: setup primary replication (multi) %s: %v", path, err)
return
return err
}
// Track replication state for heartbeat reporting.
bs.replMu.Lock()
if bs.replStates == nil {
bs.replStates = make(map[string]*volReplState)
}
// Store full replica set + first replica for backward compat heartbeat.
if len(addrs) > 0 {
// Copy the addrs slice to avoid aliasing.
copied := make([]blockvol.ReplicaAddr, len(addrs))
copy(copied, addrs)
bs.replStates[path] = &volReplState{
replicaDataAddr: addrs[0].DataAddr,
replicaCtrlAddr: addrs[0].CtrlAddr,
allReplicas: copied,
}
}
bs.replMu.Unlock()
bs.markPrimaryTransportConfigured(path, addrs)
glog.V(0).Infof("block service: primary %s shipping WAL to %d replicas (rebuild=%s)", path, len(addrs), rebuildAddr)
return nil
}
// setupReplicaReceiver starts the replica WAL receiver.
func (bs *BlockService) setupReplicaReceiver(path, dataAddr, ctrlAddr string) {
func (bs *BlockService) setupReplicaReceiver(path, dataAddr, ctrlAddr string) error {
// CP13-2: Pass the routable advertisedIP (from -ip flag, NOT from -id/serverID)
// so wildcard-bind listeners resolve to a real IP, not an opaque identity string.
var canonDataAddr, canonCtrlAddr string
@@ -559,7 +576,7 @@ func (bs *BlockService) setupReplicaReceiver(path, dataAddr, ctrlAddr string) {
return nil
}); err != nil {
glog.Warningf("block service: setup replica receiver %s: %v", path, err)
return
return err
}
// Fallback to assignment addresses if receiver didn't report.
if canonDataAddr == "" {
@@ -568,16 +585,9 @@ func (bs *BlockService) setupReplicaReceiver(path, dataAddr, ctrlAddr string) {
if canonCtrlAddr == "" {
canonCtrlAddr = ctrlAddr
}
bs.replMu.Lock()
if bs.replStates == nil {
bs.replStates = make(map[string]*volReplState)
}
bs.replStates[path] = &volReplState{
replicaDataAddr: canonDataAddr,
replicaCtrlAddr: canonCtrlAddr,
}
bs.replMu.Unlock()
bs.markReceiverReady(path, canonDataAddr, canonCtrlAddr)
glog.V(0).Infof("block service: replica %s receiving on %s/%s", path, canonDataAddr, canonCtrlAddr)
return nil
}
// startRebuild starts a rebuild in the background.
@@ -722,8 +732,10 @@ func (bs *BlockService) CollectBlockVolumeHeartbeat() []blockvol.BlockVolumeInfo
defer bs.replMu.RUnlock()
for i := range msgs {
if s, ok := bs.replStates[msgs[i].Path]; ok {
msgs[i].ReplicaDataAddr = s.replicaDataAddr
msgs[i].ReplicaCtrlAddr = s.replicaCtrlAddr
if s.publishHealthy {
msgs[i].ReplicaDataAddr = s.replicaDataAddr
msgs[i].ReplicaCtrlAddr = s.replicaCtrlAddr
}
}
// NVMe publication: report nvme_addr and nqn if NVMe target is running.
if bs.nvmeListenAddr != "" {
@@ -758,6 +770,108 @@ func (bs *BlockService) multiReplicaUnchanged(path string, addrs []blockvol.Repl
return true
}
func (bs *BlockService) ensureReplStateLocked(path string) *volReplState {
if bs.replStates == nil {
bs.replStates = make(map[string]*volReplState)
}
state := bs.replStates[path]
if state == nil {
state = &volReplState{}
bs.replStates[path] = state
}
return state
}
func (bs *BlockService) noteRoleApplied(path string, role blockvol.Role) {
bs.replMu.Lock()
defer bs.replMu.Unlock()
state := bs.ensureReplStateLocked(path)
state.roleApplied = true
switch role {
case blockvol.RoleReplica:
state.receiverReady = false
state.shipperConfigured = false
state.replicaEligible = false
state.publishHealthy = false
case blockvol.RolePrimary:
state.receiverReady = false
state.shipperConfigured = false
state.replicaEligible = false
state.publishHealthy = true
case blockvol.RoleRebuilding:
state.receiverReady = false
state.shipperConfigured = false
state.replicaEligible = false
state.publishHealthy = false
default:
state.receiverReady = false
state.shipperConfigured = false
state.replicaEligible = false
state.publishHealthy = false
state.replicaDataAddr = ""
state.replicaCtrlAddr = ""
state.allReplicas = nil
}
}
func (bs *BlockService) markPrimaryTransportConfigured(path string, addrs []blockvol.ReplicaAddr) {
bs.replMu.Lock()
defer bs.replMu.Unlock()
state := bs.ensureReplStateLocked(path)
state.shipperConfigured = len(addrs) > 0
state.publishHealthy = true
state.replicaEligible = false
state.receiverReady = false
if len(addrs) == 0 {
state.replicaDataAddr = ""
state.replicaCtrlAddr = ""
state.allReplicas = nil
return
}
copied := make([]blockvol.ReplicaAddr, len(addrs))
copy(copied, addrs)
state.allReplicas = copied
state.replicaDataAddr = addrs[0].DataAddr
state.replicaCtrlAddr = addrs[0].CtrlAddr
}
func (bs *BlockService) markReceiverReady(path, dataAddr, ctrlAddr string) {
bs.replMu.Lock()
defer bs.replMu.Unlock()
state := bs.ensureReplStateLocked(path)
state.receiverReady = true
state.replicaEligible = true
state.publishHealthy = true
state.shipperConfigured = false
state.replicaDataAddr = dataAddr
state.replicaCtrlAddr = ctrlAddr
state.allReplicas = nil
}
// ReadinessSnapshot reports the service-owned assignment/readiness closure for
// one volume. It keeps v2 publication truth above blockvol's local mechanics.
func (bs *BlockService) ReadinessSnapshot(path string) BlockReadinessSnapshot {
snap := BlockReadinessSnapshot{}
bs.replMu.RLock()
state := bs.replStates[path]
if state != nil {
snap.RoleApplied = state.roleApplied
snap.ReceiverReady = state.receiverReady
snap.ShipperConfigured = state.shipperConfigured
snap.ReplicaEligible = state.replicaEligible
snap.PublishHealthy = state.publishHealthy
}
bs.replMu.RUnlock()
if !snap.ShipperConfigured || bs.blockStore == nil {
return snap
}
_ = bs.blockStore.WithVolume(path, func(vol *blockvol.BlockVol) error {
snap.ShipperConnected = len(vol.ReplicaShipperStates()) > 0 && !vol.Status().ReplicaDegraded
return nil
})
return snap
}
// --- P3: Assignment idempotence ---
// lastAppliedAssignment stores the full assignment for idempotence comparison.

View File

@@ -17,13 +17,19 @@ type ShipperDebugInfo struct {
// BlockVolumeDebugInfo is the real-time block volume state.
type BlockVolumeDebugInfo struct {
Path string `json:"path"`
Role string `json:"role"`
Epoch uint64 `json:"epoch"`
HeadLSN uint64 `json:"head_lsn"`
Degraded bool `json:"degraded"`
Shippers []ShipperDebugInfo `json:"shippers,omitempty"`
Timestamp string `json:"timestamp"`
Path string `json:"path"`
Role string `json:"role"`
Epoch uint64 `json:"epoch"`
HeadLSN uint64 `json:"head_lsn"`
Degraded bool `json:"degraded"`
RoleApplied bool `json:"role_applied"`
ReceiverReady bool `json:"receiver_ready"`
ShipperConfigured bool `json:"shipper_configured"`
ShipperConnected bool `json:"shipper_connected"`
ReplicaEligible bool `json:"replica_eligible"`
PublishHealthy bool `json:"publish_healthy"`
Shippers []ShipperDebugInfo `json:"shippers,omitempty"`
Timestamp string `json:"timestamp"`
}
// debugBlockShipperHandler returns real-time shipper state for all block volumes.
@@ -48,13 +54,20 @@ func (vs *VolumeServer) debugBlockShipperHandler(w http.ResponseWriter, r *http.
var infos []BlockVolumeDebugInfo
store.IterateBlockVolumes(func(path string, vol *blockvol.BlockVol) {
status := vol.Status()
readiness := vs.blockService.ReadinessSnapshot(path)
info := BlockVolumeDebugInfo{
Path: path,
Role: status.Role.String(),
Epoch: status.Epoch,
HeadLSN: status.WALHeadLSN,
Degraded: status.ReplicaDegraded,
Timestamp: time.Now().UTC().Format(time.RFC3339Nano),
Path: path,
Role: status.Role.String(),
Epoch: status.Epoch,
HeadLSN: status.WALHeadLSN,
Degraded: status.ReplicaDegraded,
RoleApplied: readiness.RoleApplied,
ReceiverReady: readiness.ReceiverReady,
ShipperConfigured: readiness.ShipperConfigured,
ShipperConnected: readiness.ShipperConnected,
ReplicaEligible: readiness.ReplicaEligible,
PublishHealthy: readiness.PublishHealthy,
Timestamp: time.Now().UTC().Format(time.RFC3339Nano),
}
// Get per-shipper state from ShipperGroup if available.

View File

@@ -36,6 +36,7 @@ type VolumeInfo struct {
// CP8-2: Multi-replica fields.
ReplicaFactor int `json:"replica_factor"`
Replicas []ReplicaDetail `json:"replicas,omitempty"`
ReplicaReady bool `json:"replica_ready,omitempty"`
HealthScore float64 `json:"health_score"`
ReplicaDegraded bool `json:"replica_degraded,omitempty"`
DurabilityMode string `json:"durability_mode"` // CP8-3-1
@@ -71,6 +72,7 @@ type ReplicaDetail struct {
Server string `json:"server"`
ISCSIAddr string `json:"iscsi_addr,omitempty"`
IQN string `json:"iqn,omitempty"`
Ready bool `json:"ready,omitempty"`
HealthScore float64 `json:"health_score"`
WALLag uint64 `json:"wal_lag,omitempty"`
}

View File

@@ -0,0 +1,155 @@
package component
import (
"bytes"
"net"
"testing"
"time"
"github.com/seaweedfs/seaweedfs/weed/storage/blockvol"
)
// TestReplicaReadAfterShip verifies that data shipped from primary to replica
// via WAL replication is readable on the replica via ReadLBA.
//
// This reproduces the CP13-8 bug: replica iSCSI reads zeros despite
// replicated data in WAL (sync_all barrier confirmed).
func TestReplicaReadAfterShip(t *testing.T) {
primaryPath := t.TempDir() + "/primary.blk"
replicaPath := t.TempDir() + "/replica.blk"
primary, err := blockvol.CreateBlockVol(primaryPath, blockvol.CreateOptions{
VolumeSize: 4 * 1024 * 1024,
BlockSize: 4096,
WALSize: 1 * 1024 * 1024,
})
if err != nil {
t.Fatal(err)
}
defer primary.Close()
replica, err := blockvol.CreateBlockVol(replicaPath, blockvol.CreateOptions{
VolumeSize: 4 * 1024 * 1024,
BlockSize: 4096,
WALSize: 1 * 1024 * 1024,
})
if err != nil {
t.Fatal(err)
}
defer replica.Close()
// Assign roles.
primary.HandleAssignment(1, blockvol.RolePrimary, 30*time.Second)
replica.HandleAssignment(1, blockvol.RoleReplica, 30*time.Second)
// Start replica receiver.
if err := replica.StartReplicaReceiver(":0", ":0"); err != nil {
t.Fatal(err)
}
recvAddr := replica.ReplicaReceiverAddr()
if recvAddr == nil {
t.Fatal("replica receiver not started")
}
t.Logf("replica receiver: data=%s ctrl=%s", recvAddr.DataAddr, recvAddr.CtrlAddr)
// Wire shipper from primary to replica.
primary.SetReplicaAddr(recvAddr.DataAddr, recvAddr.CtrlAddr)
// Write on primary — should ship to replica.
writeData := bytes.Repeat([]byte{0xAB}, 4096)
if err := primary.WriteLBA(0, writeData); err != nil {
t.Fatalf("primary WriteLBA(0): %v", err)
}
// Give shipping + apply time.
time.Sleep(2 * time.Second)
// Read from REPLICA.
replicaData, err := replica.ReadLBA(0, 4096)
if err != nil {
t.Fatalf("replica ReadLBA(0): %v", err)
}
if replicaData[0] == 0x00 {
t.Fatalf("BUG REPRODUCED: replica ReadLBA returns zeros (first byte=0x%02x, want 0xAB)"+
"\nData is in replica WAL but ReadLBA returns zeros", replicaData[0])
}
if !bytes.Equal(replicaData, writeData) {
t.Fatalf("replica data mismatch: first byte=0x%02x, want 0xAB", replicaData[0])
}
t.Log("replica ReadLBA after ship: OK (data matches primary)")
}
// TestReplicaReadDirectApply bypasses the shipper entirely and manually
// ships a WAL entry via TCP to the replica receiver, then reads it back.
func TestReplicaReadDirectApply(t *testing.T) {
replicaPath := t.TempDir() + "/replica.blk"
vol, err := blockvol.CreateBlockVol(replicaPath, blockvol.CreateOptions{
VolumeSize: 4 * 1024 * 1024,
BlockSize: 4096,
WALSize: 1 * 1024 * 1024,
})
if err != nil {
t.Fatal(err)
}
defer vol.Close()
vol.HandleAssignment(1, blockvol.RoleReplica, 30*time.Second)
if err := vol.StartReplicaReceiver(":0", ":0"); err != nil {
t.Fatal(err)
}
recvAddr := vol.ReplicaReceiverAddr()
t.Logf("replica: data=%s ctrl=%s", recvAddr.DataAddr, recvAddr.CtrlAddr)
// Directly connect and ship a WAL entry.
conn, err := net.DialTimeout("tcp", recvAddr.DataAddr, 3*time.Second)
if err != nil {
t.Fatalf("connect: %v", err)
}
defer conn.Close()
payload := bytes.Repeat([]byte{0xEF}, 4096)
entry := blockvol.WALEntry{
LSN: 1,
Epoch: 1,
Type: blockvol.EntryTypeWrite,
LBA: 0,
Length: 4096,
Data: payload,
}
encoded, err := entry.Encode()
if err != nil {
t.Fatal(err)
}
if err := blockvol.WriteFrame(conn, blockvol.MsgWALEntry, encoded); err != nil {
t.Fatalf("ship: %v", err)
}
time.Sleep(1 * time.Second)
// Read back via ReadLBA.
data, err := vol.ReadLBA(0, 4096)
if err != nil {
t.Fatalf("ReadLBA: %v", err)
}
if data[0] == 0x00 {
t.Fatalf("BUG: ReadLBA returns zeros after direct WAL apply (0x%02x, want 0xEF)", data[0])
}
if data[0] != 0xEF {
t.Fatalf("unexpected data: 0x%02x, want 0xEF", data[0])
}
t.Logf("direct apply ReadLBA: OK (0x%02x)", data[0])
// Also read via adapter (same path as iSCSI).
adapter := blockvol.NewBlockVolAdapter(vol)
adapterData, err := adapter.ReadAt(0, 4096)
if err != nil {
t.Fatalf("adapter ReadAt: %v", err)
}
if adapterData[0] != 0xEF {
t.Fatalf("adapter returns wrong data: 0x%02x, want 0xEF", adapterData[0])
}
t.Log("adapter ReadAt: OK")
}

View File

@@ -787,6 +787,11 @@ func waitVolumeHealthy(ctx context.Context, actx *tr.ActionContext, act tr.Actio
continue
}
if info.ReplicaFactor > 1 && !info.ReplicaReady {
actx.Log(" poll %d: replica assigned but not publish-ready yet", poll)
continue
}
// Check not degraded.
if info.ReplicaDegraded {
actx.Log(" poll %d: replica degraded, waiting...", poll)

View File

@@ -32,6 +32,7 @@ type VolumeInfo struct {
ReplicaCtrlAddr string `json:"replica_ctrl_addr,omitempty"`
ReplicaFactor int `json:"replica_factor"`
Replicas []ReplicaDetail `json:"replicas,omitempty"`
ReplicaReady bool `json:"replica_ready,omitempty"`
HealthScore float64 `json:"health_score"`
ReplicaDegraded bool `json:"replica_degraded,omitempty"`
DurabilityMode string `json:"durability_mode"`
@@ -45,6 +46,7 @@ type ReplicaDetail struct {
Server string `json:"server"`
ISCSIAddr string `json:"iscsi_addr,omitempty"`
IQN string `json:"iqn,omitempty"`
Ready bool `json:"ready,omitempty"`
HealthScore float64 `json:"health_score"`
WALLag uint64 `json:"wal_lag,omitempty"`
}

View File

@@ -1,240 +1,368 @@
name: cp13-8-real-workload-validation
timeout: 20m
timeout: 15m
# CP13-8: Bounded real-workload validation for RF=2 sync_all.
#
# Workload envelope:
# Topology: RF=2 sync_all, cross-machine replication (m01 ↔ M02)
# Transport: iSCSI (primary frontend)
# Envelope:
# Topology: RF=2 sync_all, cross-machine (m01 ↔ M02)
# Transport: iSCSI
# Workloads: ext4 (filesystem) + PostgreSQL pgbench (application)
# Disturbance: one bounded failover (kill primary, promote replica)
# Exclusions: NVMe-TCP, RF>2, hours/days soak, degraded-mode perf
# Disturbance: one bounded failover (kill primary, auto-promote replica)
# Exclusions: NVMe-TCP, RF>2, soak, degraded-mode, mode normalization
#
# What this validates:
# The accepted CP13-1..7 replication contract survives contact with
# real filesystem and database consumers. Specifically:
# - Replicated writes are durable on both nodes (ext4 file integrity)
# - Post-failover data is consistent (fsck + file count)
# - Database transactions are durable under sync_all (pgbench TPC-B)
#
# What this does NOT validate:
# - Production rollout readiness
# - Performance floor (see Phase 12 P4)
# - Degraded mode behavior
# - NVMe-TCP transport path
# - Mode normalization (CP13-9)
# Flow:
# 1. Create RF=2 sync_all — NO promote (use initial primary as-is)
# 2. Wait for replication healthy (shipper connected, not degraded)
# 3. Write ext4 + 200 files on primary
# 4. Kill primary → auto-failover promotes replica
# 5. Verify ext4 on promoted replica (fsck + files + checksums)
# 6. pgbench on promoted replica
env:
repo_dir: "C:/work/seaweedfs"
master_url: "http://10.0.0.3:9433"
volume_name: cp13-8-val
# 512MB: enough for ext4 + pgbench, small enough for mkfs + sync_all.
vol_size: "536870912"
topology:
nodes:
target_node:
host: "192.168.1.184"
m01:
host: 192.168.1.181
alt_ips: ["10.0.0.1"]
user: testdev
key: "C:/work/dev_server/testdev_key"
client_node:
host: "192.168.1.181"
key: "/opt/work/testdev_key"
m02:
host: 192.168.1.184
alt_ips: ["10.0.0.3"]
user: testdev
key: "C:/work/dev_server/testdev_key"
targets:
primary:
node: target_node
vol_size: 100M
iscsi_port: 3280
admin_port: 8095
replica_data_port: 9040
replica_ctrl_port: 9041
rebuild_port: 9042
iqn_suffix: cp13-8-primary
replica:
node: client_node
vol_size: 100M
iscsi_port: 3281
admin_port: 8096
replica_data_port: 9043
replica_ctrl_port: 9044
rebuild_port: 9045
iqn_suffix: cp13-8-replica
key: "/opt/work/testdev_key"
phases:
# --- Phase 1: Setup RF=2 sync_all pair ---
- name: setup
actions:
- action: kill_stale
node: target_node
- action: exec
node: m02
cmd: "fuser -k 9433/tcp 18480/tcp 2>/dev/null; sleep 1; rm -rf /tmp/sw-cp138-master /tmp/sw-cp138-vs1; mkdir -p /tmp/sw-cp138-master /tmp/sw-cp138-vs1/blocks"
root: "true"
ignore_error: true
- action: kill_stale
node: client_node
- action: exec
node: m01
cmd: "fuser -k 18480/tcp 2>/dev/null; sleep 1; rm -rf /tmp/sw-cp138-vs2; mkdir -p /tmp/sw-cp138-vs2/blocks"
root: "true"
ignore_error: true
- action: iscsi_cleanup
node: client_node
ignore_error: true
- action: build_deploy
- action: start_target
target: primary
create: "true"
durability_mode: sync_all
- action: start_target
target: replica
create: "true"
durability_mode: sync_all
- action: assign
target: replica
epoch: "1"
role: replica
lease_ttl: 60s
- action: assign
target: primary
epoch: "1"
role: primary
lease_ttl: 60s
- action: set_replica
target: primary
replica: replica
- action: sleep
duration: 2s
# --- Phase 2: ext4 filesystem workload ---
- name: ext4-write
actions:
- action: iscsi_login
target: primary
node: client_node
save_as: device
- action: mkfs
node: client_node
device: "{{ device }}"
fstype: ext4
- action: mount
node: client_node
device: "{{ device }}"
mountpoint: /mnt/cp13-8
# Write 200 files with known content.
- action: exec
node: client_node
root: "true"
cmd: "bash -c 'for i in $(seq 1 200); do dd if=/dev/urandom of=/mnt/cp13-8/file_$i bs=4k count=1 2>/dev/null; done && sync'"
# Compute checksums for later verification.
- action: exec
node: client_node
root: "true"
cmd: "md5sum /mnt/cp13-8/file_* | sort > /tmp/cp13-8-checksums.txt && cat /tmp/cp13-8-checksums.txt | wc -l"
save_as: checksum_count
- action: assert_equal
actual: "{{ checksum_count }}"
expected: "200"
- action: umount
node: client_node
mountpoint: /mnt/cp13-8
- action: iscsi_cleanup
node: client_node
ignore_error: true
# Wait for replication to catch up.
- action: wait_lsn
target: replica
min_lsn: "1"
timeout: 30s
- action: start_weed_master
node: m02
port: "9433"
dir: /tmp/sw-cp138-master
extra_args: "-ip=10.0.0.3"
save_as: master_pid
- action: sleep
duration: 3s
# --- Phase 3: Failover (kill primary, promote replica) ---
- action: start_weed_volume
node: m02
port: "18480"
master: "10.0.0.3:9433"
dir: /tmp/sw-cp138-vs1
extra_args: "-block.dir=/tmp/sw-cp138-vs1/blocks -block.listen=:3295 -ip=10.0.0.3"
save_as: vs1_pid
- action: start_weed_volume
node: m01
port: "18480"
master: "10.0.0.3:9433"
dir: /tmp/sw-cp138-vs2
extra_args: "-block.dir=/tmp/sw-cp138-vs2/blocks -block.listen=:3295 -ip=10.0.0.1"
save_as: vs2_pid
- action: sleep
duration: 3s
- action: wait_cluster_ready
node: m02
master_url: "{{ master_url }}"
- action: wait_block_servers
count: "2"
- action: create_block_volume
name: "{{ volume_name }}"
size_bytes: "{{ vol_size }}"
replica_factor: "2"
durability_mode: "sync_all"
# Wait for assignment delivery (heartbeat cycle).
- action: sleep
duration: 15s
# Bootstrap write: triggers shipper connect + first barrier.
# Without this, the shipper stays degraded because barrier-triggered
# recovery needs a write to fire the barrier.
- action: lookup_block_volume
name: "{{ volume_name }}"
save_as: boot_vol
- action: iscsi_login_direct
node: m01
host: "{{ boot_vol_iscsi_host }}"
port: "{{ boot_vol_iscsi_port }}"
iqn: "{{ boot_vol_iqn }}"
save_as: boot_device
ignore_error: true
- action: exec
node: m01
root: "true"
cmd: "dd if=/dev/urandom of={{ boot_device }} bs=4k count=1 seek=100000 oflag=direct,sync 2>/dev/null; true"
ignore_error: true
- action: iscsi_cleanup
node: m01
ignore_error: true
- action: sleep
duration: 5s
- action: wait_volume_healthy
name: "{{ volume_name }}"
timeout: 60s
- action: discover_primary
name: "{{ volume_name }}"
save_as: pri
- action: print
msg: "CP13-8 setup: primary={{ pri }} ({{ pri_server }}), replica={{ pri_replica_node }}"
# --- Phase 2: ext4 filesystem workload on initial primary ---
- name: ext4-write
actions:
- action: lookup_block_volume
name: "{{ volume_name }}"
save_as: vol
- action: iscsi_login_direct
node: m01
host: "{{ vol_iscsi_host }}"
port: "{{ vol_iscsi_port }}"
iqn: "{{ vol_iqn }}"
save_as: device
- action: exec
node: m01
cmd: "mkfs.ext4 -F {{ device }} 2>&1 | tail -2"
root: "true"
- action: exec
node: m01
cmd: "mkdir -p /mnt/cp13-8 && mount {{ device }} /mnt/cp13-8"
root: "true"
- action: exec
node: m01
root: "true"
cmd: "for i in $(seq 1 200); do dd if=/dev/urandom of=/mnt/cp13-8/file_$i bs=4k count=1 2>/dev/null; done && sync && echo WRITE_DONE"
save_as: write_result
- action: assert_contains
value: "{{ write_result }}"
contains: "WRITE_DONE"
- action: exec
node: m01
root: "true"
cmd: "md5sum /mnt/cp13-8/file_* | sort > /tmp/cp13-8-pre.md5 && wc -l < /tmp/cp13-8-pre.md5"
save_as: pre_checksum_count
- action: assert_equal
actual: "{{ pre_checksum_count }}"
expected: "200"
- action: exec
node: m01
cmd: "umount /mnt/cp13-8"
root: "true"
- action: iscsi_cleanup
node: m01
ignore_error: true
- action: print
msg: "ext4-write: 200 files written, checksums captured"
# Verify replication is healthy after all writes.
- action: wait_volume_healthy
name: "{{ volume_name }}"
timeout: 30s
# --- Phase 3: Failover ---
# Kill ONLY the primary's VS, keep the replica alive for auto-promote.
# Master allocates primary to m01 first (by server registration order).
# Kill m01 VS (primary), m02 VS (replica) stays alive for promotion.
- name: failover
actions:
- action: kill_target
target: primary
- action: assign
target: replica
epoch: "2"
role: primary
lease_ttl: 60s
- action: wait_role
target: replica
role: primary
timeout: 10s
- action: print
msg: "=== Killing primary VS on m01 ==="
- action: exec
node: m01
cmd: "kill -9 {{ vs2_pid }}"
root: "true"
ignore_error: true
# Wait for lease expiry (30s TTL) + auto-failover.
- action: sleep
duration: 50s
# Wait for primary to change from m01 to m02.
- action: wait_block_primary
name: "{{ volume_name }}"
not: "{{ pri_server }}"
timeout: 60s
save_as: new_pri
- action: print
msg: "Failover: {{ pri_server }} → {{ new_pri }}"
- action: sleep
duration: 5s
# --- Phase 4: ext4 verification on promoted replica ---
- name: ext4-verify
actions:
- action: iscsi_login
target: replica
node: client_node
- action: discover_primary
name: "{{ volume_name }}"
save_as: new
- action: print
msg: "Verifying ext4 on promoted node {{ new }} ({{ new_server }})"
# Connect to the new primary's iSCSI.
- action: iscsi_login_direct
node: m01
host: "{{ new_host }}"
port: "3295"
iqn: "{{ vol_iqn }}"
save_as: device2
# fsck: filesystem integrity.
- action: fsck_ext4
node: client_node
node: m01
device: "{{ device2 }}"
save_as: fsck_result
# Mount and verify file count.
- action: mount
node: client_node
device: "{{ device2 }}"
mountpoint: /mnt/cp13-8
- action: print
msg: "fsck: {{ fsck_result }}"
- action: exec
node: client_node
node: m01
cmd: "mkdir -p /mnt/cp13-8 && mount {{ device2 }} /mnt/cp13-8"
root: "true"
- action: exec
node: m01
root: "true"
cmd: "ls /mnt/cp13-8/file_* | wc -l"
save_as: post_failover_count
save_as: post_count
- action: assert_equal
actual: "{{ post_failover_count }}"
actual: "{{ post_count }}"
expected: "200"
# Verify checksums match pre-failover.
- action: exec
node: client_node
node: m01
root: "true"
cmd: "md5sum /mnt/cp13-8/file_* | sort > /tmp/cp13-8-checksums-post.txt && diff /tmp/cp13-8-checksums.txt /tmp/cp13-8-checksums-post.txt && echo MATCH"
save_as: checksum_match
cmd: "md5sum /mnt/cp13-8/file_* | sort > /tmp/cp13-8-post.md5 && diff /tmp/cp13-8-pre.md5 /tmp/cp13-8-post.md5 && echo CHECKSUM_MATCH"
save_as: checksum_diff
- action: assert_contains
value: "{{ checksum_match }}"
contains: "MATCH"
- action: umount
node: client_node
mountpoint: /mnt/cp13-8
value: "{{ checksum_diff }}"
contains: "CHECKSUM_MATCH"
- action: exec
node: m01
cmd: "umount /mnt/cp13-8"
root: "true"
- action: iscsi_cleanup
node: client_node
node: m01
ignore_error: true
# --- Phase 5: pgbench on promoted replica (application workload) ---
- name: pgbench-on-replica
- action: print
msg: "ext4-verify: fsck CLEAN, 200 files, checksums MATCH"
# --- Phase 5: pgbench on promoted replica ---
- name: pgbench
actions:
- action: iscsi_login
target: replica
node: client_node
- action: iscsi_login_direct
node: m01
host: "{{ new_host }}"
port: "3295"
iqn: "{{ vol_iqn }}"
save_as: device3
- action: sleep
duration: 3s
- action: pgbench_init
node: client_node
node: m01
device: "{{ device3 }}"
mount: "/mnt/cp13-8-pg"
port: "5440"
port: "5441"
scale: "1"
fstype: ext4
- action: pgbench_run
node: client_node
node: m01
clients: "1"
duration: "10"
save_as: tps_post_failover
save_as: tps
- action: print
msg: "CP13-8: pgbench TPC-B post-failover: {{ tps_post_failover }} TPS"
msg: "CP13-8 pgbench TPS: {{ tps }}"
- action: assert_greater
value: "{{ tps_post_failover }}"
actual: "{{ tps }}"
threshold: "0"
# pgbench succeeded with TPS > 0 = database transactions are durable on the promoted replica.
- action: pgbench_cleanup
node: client_node
mount: "/mnt/cp13-8-pg"
port: "5440"
node: m01
ignore_error: true
- action: iscsi_cleanup
node: client_node
node: m01
ignore_error: true
# --- Phase 6: Cleanup ---
- name: cleanup
always: true
actions:
- action: exec
node: m01
cmd: "umount /mnt/cp13-8 /mnt/cp13-8-pg 2>/dev/null; true"
root: "true"
ignore_error: true
- action: iscsi_cleanup
node: client_node
node: m01
ignore_error: true
- action: stop_all_targets
- action: stop_weed
node: m01
pid: "{{ vs2_pid }}"
ignore_error: true
- action: stop_weed
node: m01
pid: "{{ vs2_new_pid }}"
ignore_error: true
- action: stop_weed
node: m02
pid: "{{ vs1_pid }}"
ignore_error: true
- action: stop_weed
node: m02
pid: "{{ vs1_new_pid }}"
ignore_error: true
- action: stop_weed
node: m02
pid: "{{ master_pid }}"
ignore_error: true

View File

@@ -118,10 +118,14 @@ func (bs *BlockVolumeStore) WithVolume(path string, fn func(*blockvol.BlockVol)
return fn(vol)
}
// ProcessBlockVolumeAssignments applies a batch of assignments from master.
// Returns a slice of errors parallel to the input (nil = success).
// Unknown volumes and invalid transitions are logged and returned as errors,
// but do not stop processing of remaining assignments.
// ProcessBlockVolumeAssignments applies only the local role/epoch/lease part of
// a batch of assignments. It does NOT wire replica receivers, shippers, or
// publication readiness. The authoritative runtime lifecycle lives in
// BlockService.ApplyAssignments.
//
// Returns a slice of errors parallel to the input (nil = success). Unknown
// volumes and invalid transitions are logged and returned as errors, but do not
// stop processing of remaining assignments.
func (bs *BlockVolumeStore) ProcessBlockVolumeAssignments(
assignments []blockvol.BlockVolumeAssignment,
) []error {