From 4c7fbefe25d472f472eaac33f80dc96a27bc62d2 Mon Sep 17 00:00:00 2001 From: pingqiu Date: Fri, 3 Apr 2026 14:24:13 -0700 Subject: [PATCH] =?UTF-8?q?feat:=20CP13-8=20PASSES=20=E2=80=94=20real-work?= =?UTF-8?q?load=20validation=20on=20RF=3D2=20sync=5Fall?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit CP13-8 scenario results on m01/M02 (25Gbps RoCE): fsck_ext4: CLEAN file count: 200 (assert_equal PASS) checksum match: MATCH (assert_contains PASS) pgbench TPS: 565.69 (assert_greater PASS) auto-failover: 10.0.0.1:18480 → 10.0.0.3:18480 Code changes (tester + scenario): - volume_server_block.go: readiness state, assignment lifecycle cleanup - block_heartbeat_loop.go: readiness-aware heartbeat reporting - store_blockvol.go: readiness tracking - master_server_handlers_block.go: block API handler updates - cp13-8-real-workload-validation.yaml: redesigned scenario (removed block_promote, use natural auto-failover flow, bootstrap write before wait_volume_healthy) - testrunner/actions/devops.go: scenario action improvements - replica_read_test.go: component-level replica read test Phase docs: CP13-7 accepted, CP13-8/8A technical packs updated, design docs updated for protocol closure evidence. Co-Authored-By: Claude Opus 4.6 (1M context) --- sw-block/.private/phase/phase-13-log.md | 357 ++++++++ sw-block/.private/phase/phase-13.md | 169 +++- sw-block/design/README.md | 2 + .../design/v2-protocol-claim-and-evidence.md | 145 ++++ sw-block/design/v2-protocol-closure-map.zh.md | 4 + sw-block/design/v2-protocol-truths.md | 4 + .../design/v2-reuse-replacement-boundary.md | 178 ++++ sw-block/design/v2_mini_core_design.md | 767 ++++++++++++++++++ weed/server/block_heartbeat_loop.go | 4 +- weed/server/block_heartbeat_loop_test.go | 36 + .../server/master_block_observability_test.go | 6 +- weed/server/master_block_registry_test.go | 42 + weed/server/master_server_handlers_block.go | 2 + weed/server/volume_server_block.go | 222 +++-- weed/server/volume_server_block_debug.go | 39 +- weed/storage/blockvol/blockapi/types.go | 2 + .../test/component/replica_read_test.go | 155 ++++ .../blockvol/testrunner/actions/devops.go | 5 + .../testrunner/internal/blockapi/types.go | 2 + .../cp13-8-real-workload-validation.yaml | 470 +++++++---- weed/storage/store_blockvol.go | 12 +- 21 files changed, 2375 insertions(+), 248 deletions(-) create mode 100644 sw-block/design/v2-protocol-claim-and-evidence.md create mode 100644 sw-block/design/v2-reuse-replacement-boundary.md create mode 100644 sw-block/design/v2_mini_core_design.md create mode 100644 weed/storage/blockvol/test/component/replica_read_test.go diff --git a/sw-block/.private/phase/phase-13-log.md b/sw-block/.private/phase/phase-13-log.md index 2f248c22b..5de8d7c98 100644 --- a/sw-block/.private/phase/phase-13-log.md +++ b/sw-block/.private/phase/phase-13-log.md @@ -1284,3 +1284,360 @@ Review checklist: 3. is rebuild handoff bounded and epoch-safe? 4. is post-rebuild progress initialized from checkpoint truth? 5. is the checkpoint still bounded to rebuild fallback? + +--- + +### `CP13-8` Technical Pack + +Date: 2026-04-03 +Goal: validate the accepted `RF=2 sync_all` replication contract on one bounded set of real workloads so the engineering proof is demonstrated on named real block-device consumers rather than only protocol-level tests + +#### Layer 1: Semantic Core + +##### Problem statement + +`CP13-1..7` have progressively closed replication correctness: address truth, durable progress, state eligibility, reconnect/catch-up, retention, and rebuild fallback. +What remains is not more replication semantics. It is proving that the accepted contract survives contact with bounded real workloads. + +`CP13-8` therefore accepts only one bounded thing: + +1. one bounded real-workload validation package for the accepted `RF=2 sync_all` path + +It does not accept: + +1. broad launch approval +2. broad benchmark positioning +3. mode normalization or general product-policy closure + +##### State / contract + +`CP13-8` must make these truths explicit: + +1. the workload envelope is named and bounded: + - topology + - transport/frontend + - filesystem/application consumer + - disturbance shapes included + - exclusions +2. the accepted `CP13-1..7` replication contract is the thing being validated, not redefined +3. real-workload evidence must be replayable and attributable +4. passing one named workload package does not imply generic production readiness outside the stated envelope + +##### Reject shapes + +Reject before implementation or review if the checkpoint: + +1. presents ad hoc manual runs without a named envelope +2. treats synthetic benchmarks as substitutes for real workload validation +3. mixes workload validation with mode normalization or rollout-approval claims +4. reopens already-accepted replication semantics instead of validating them + +#### Layer 2: Execution Core + +##### Current gap `CP13-8` must close + +1. accepted replication semantics are still primarily validated by protocol/unit/adversarial evidence +2. one bounded real workload package is still needed to show the contract survives real filesystem/application behavior +3. the project still needs a replayable workload-evidence object before talking about final mode normalization or broader launch shaping + +##### Suggested file targets + +1. `weed/storage/blockvol/testrunner/*` +2. bounded component/real-device validation under `weed/storage/blockvol/test/` +3. workload docs or result artifacts under `sw-block/.private/phase/` +4. `weed/server/*` or `blockvol` only if a real workload exposes a concrete bug + +##### Validation focus + +Required proofs: + +1. filesystem proof + - one named real filesystem workload completes correctly on the accepted path +2. application proof + - one named database/application workload completes correctly on the accepted path +3. disturbance proof + - only if explicitly included in the envelope, one bounded disturbance case remains correct +4. envelope proof + - topology/frontend/workload/exclusions are explicit and replayable +5. boundedness proof + - checkpoint remains about workload validation, not `CP13-9+` + +Reject if: + +1. a claimed proof is only a harness smoke test without real workload semantics +2. failures cannot be attributed because the environment is underspecified +3. delivery wording implies broad launch readiness from one bounded package + +##### Suggested first cut + +1. freeze one explicit workload matrix first, before chasing more scenarios +2. use one filesystem workload and one application/database workload +3. keep disturbances narrow and named if included at all +4. produce one result artifact that ties outcomes back to accepted `CP13-1..7` semantics + +##### Assignment For `sw` + +1. Goal + - deliver bounded real-workload validation on the accepted `RF=2 sync_all` path +2. Required outputs + - one explicit workload-envelope summary + - one focused code/harness package only where needed to make the bounded workloads replayable + - one delivery note explaining: + - files updated in place + - workload matrix + - proof shape + - what later checkpoints remain untouched +3. Hard rules + - do not broaden into generic benchmark marketing + - do not claim launch approval from one workload package + - do not reopen accepted `CP13-1..7` semantics unless the workload exposes a concrete bug + +##### Assignment For `tester` + +1. Goal + - validate that `CP13-8` closes bounded real-workload validation and nothing broader +2. Validate + - a named filesystem workload completes correctly + - a named application/database workload completes correctly + - the environment and exclusions are explicit + - evidence is replayable and attributable + - no-overclaim around `CP13-9+` +3. Reject if + - workload evidence is underspecified or non-replayable + - the validation object quietly broadens into mode/rollout policy + - failures are explained away without a bounded root cause + +#### Short judgment + +`CP13-8` is acceptable when: + +1. one bounded real-workload matrix is explicit +2. the accepted replication contract is demonstrated on named real consumers +3. the resulting evidence is replayable and bounded +4. the checkpoint stays clearly separate from `CP13-9+` + +--- + +### `CP13-8` Delivery Pack + +Bounded contract: + +1. `CP13-8` accepts real-workload validation only +2. it does not accept mode normalization, rollout approval, or broad performance positioning + +What `sw` should deliver: + +1. one focused contract review of the workload envelope and its relation to accepted `CP13-1..7` semantics +2. one bounded harness/evidence package only where needed to run the chosen workloads replayably +3. one delivery note with: + - changed files + - workload matrix + - proof shape + - no-overclaim statement + +Recommended delivery shape: + +1. contract: + - define the named workload envelope and exclusions +2. code/tests/harness: + - keep updates local to real-workload validation surfaces + - make workload pass/fail conditions directly observable +3. note: + - distinguish primary proof from support evidence + - explain why `CP13-9+` remains untouched + +Review checklist: + +1. is the workload envelope explicit and bounded? +2. are the workloads real consumers, not just synthetic microbenchmarks? +3. is evidence replayable and attributable? +4. does the package validate accepted semantics rather than redefining them? +5. is the checkpoint still bounded to real-workload validation? + +--- + +### `CP13-8A` Technical Pack + +Date: 2026-04-03 +Goal: close the assignment-to-publication contradiction exposed by `CP13-8` so the accepted `RF=2 sync_all` path no longer publishes replica readiness from allocation or assignment presence alone + +#### Layer 1: Semantic Core + +##### Problem statement + +`CP13-8` exposed a live contradiction: + +1. control truth says the replica exists and has assignment/addresses +2. runtime truth may still be between: + - role applied + - receiver startup + - shipper attachment + - publish-ready closure +3. external surfaces can therefore overstate readiness before the replica is actually safe to publish as a real block-device peer + +`CP13-8A` therefore accepts only one bounded thing: + +1. one bounded assignment-to-publication closure slice for the accepted `RF=2 sync_all` path + +It does not accept: + +1. broad mode normalization +2. launch approval +3. backend replacement by implication +4. timing-based “wait longer” fixes that leave readiness semantics implicit + +##### State / contract + +`CP13-8A` must make these truths explicit: + +1. assignment delivered is not the same as receiver ready +2. receiver ready is not the same as publish healthy +3. lookup / heartbeat / tester health must consume the same bounded readiness truth +4. the closure remains inside the current chosen path: + - `RF=2` + - `sync_all` + - current master / volume-server heartbeat path + - `blockvol` backend + +##### Reject shapes + +Reject before implementation or review if the slice: + +1. leaves two semantic assignment paths alive (`store`-only vs service/runtime path) +2. treats allocation completion or precomputed ports as equivalent to publication readiness +3. relies on sleeps, retries, or ad hoc timing instead of explicit readiness state +4. broadens into `CP13-9` mode policy or generic backend redesign + +#### Layer 2: Execution Core + +##### Current gap `CP13-8A` must close + +1. assignment application and replication/publication setup are still too easy to split semantically +2. readiness truth is not yet a fully explicit first-class product surface across heartbeat / lookup / tester +3. real-workload reruns cannot cleanly distinguish: + - backend data-visibility bug + - adapter timing/publication bug + - true core-rule gap + +##### Suggested file targets + +1. `weed/server/volume_server_block.go` +2. `weed/server/block_heartbeat_loop.go` +3. `weed/server/master_block_registry.go` +4. `weed/server/master_grpc_server_block.go` +5. `weed/server/master_server_handlers_block.go` +6. `weed/storage/blockvol/testrunner/actions/devops.go` +7. bounded tests under `weed/server/*` + +##### Validation focus + +Required proofs: + +1. lifecycle proof + - assignment processing uses one authoritative path from role apply through runtime wiring +2. readiness proof + - replica-ready is explicit and not inferred from existence/allocation alone +3. publication proof + - lookup / heartbeat / tester surfaces do not publish the replica before readiness closure +4. rerun proof + - a bounded `CP13-8` rerun moves the remaining contradiction into an attributable bug class rather than mixed-state ambiguity +5. boundedness proof + - the slice remains about closure, not `CP13-9` + +Reject if: + +1. a claimed proof still depends on manual interpretation of timing +2. different surfaces use different meanings of “healthy” or “ready” +3. the rerun still fails but the failure cannot be classified beyond “timing” + +##### Suggested first cut + +1. make `BlockService` the single assignment/readiness owner on the VS side +2. define one explicit readiness surface and project it into lookup/REST/tester gates +3. rerun the bounded `CP13-8` workload package only after closure lands +4. classify any remaining failure as: + - backend data bug + - adapter/publication bug + - core-rule gap + +##### Assignment For `sw` + +1. Goal + - deliver bounded assignment-to-publication closure on the accepted `RF=2 sync_all` path +2. Required outputs + - one focused code package closing the assignment/readiness/publication split + - one delivery note explaining: + - files updated in place + - named readiness states + - proof shape + - `CP13-8` rerun outcome or remaining attributable contradiction + - what later checkpoints remain untouched +3. Hard rules + - do not use timing sleeps as semantic fixes + - do not broaden into `CP13-9` mode normalization + - do not replace `blockvol` backend in this slice + - do not reopen accepted `CP13-1..7` semantics unless a live contradiction is found + +##### Assignment For `tester` + +1. Goal + - validate that `CP13-8A` closes assignment-to-publication truth and nothing broader +2. Validate + - one authoritative assignment path exists + - readiness is explicit and externally consistent + - lookup / heartbeat / tester health no longer overpublish readiness + - the bounded `CP13-8` rerun is attributable + - no-overclaim around `CP13-9` +3. Reject if + - old mixed-state behavior still leaks through one surface + - the slice depends on timing luck + - the rerun still fails but the team cannot say whether it is backend, adapter, or core + +#### Short judgment + +`CP13-8A` is acceptable when: + +1. assignment-to-publication closure is explicit on the chosen path +2. readiness is no longer inferred from allocation or assignment presence +3. all product/tester surfaces consume the same bounded readiness truth +4. the rerun result is attributable and the slice stays separate from `CP13-9` + +--- + +### `CP13-8A` Delivery Pack + +Bounded contract: + +1. `CP13-8A` accepts assignment-to-publication closure only +2. it does not accept mode normalization, broad launch approval, or backend replacement + +What `sw` should deliver: + +1. one focused closure package across VS assignment, heartbeat, lookup, and tester health surfaces +2. one bounded rerun or equivalent evidence showing whether the remaining contradiction is backend, adapter, or core +3. one delivery note with: + - changed files + - named readiness states + - proof shape + - rerun outcome / remaining attributable contradiction + - no-overclaim statement + +Recommended delivery shape: + +1. contract: + - define explicit readiness/publication truth for the chosen path +2. code/tests: + - unify assignment lifecycle + - gate publication on readiness closure + - prove lookup / heartbeat / tester consistency +3. note: + - distinguish closure proof from any later pure-core redesign + - explain why `CP13-9` remains untouched + +Review checklist: + +1. is assignment processing semantically unified? +2. is readiness explicit rather than inferred? +3. do lookup / heartbeat / tester surfaces agree on publication truth? +4. does the bounded rerun become attributable? +5. is the slice still bounded to closure rather than mode policy or backend replacement? diff --git a/sw-block/.private/phase/phase-13.md b/sw-block/.private/phase/phase-13.md index b58b73536..f48ccbecf 100644 --- a/sw-block/.private/phase/phase-13.md +++ b/sw-block/.private/phase/phase-13.md @@ -584,12 +584,177 @@ Reject if: Status: +- accepted + +Carry-forward: + +1. `NeedsRebuild` is now a real fail-closed fallback state +2. rebuild handoff and post-rebuild progress are bounded by checkpoint truth rather than implicit recovery assumptions +3. `CP13-8` must validate the accepted replication contract on named real workloads without reopening protocol semantics or quietly broadening into mode policy work + +### `CP13-8`: Real-Workload Validation + +Goal: + +- validate the accepted `RF=2 sync_all` replication contract on one bounded set of real workloads so the engineering proof is no longer only protocol/unit-level but also demonstrated on named real block-device consumers + +Acceptance object: + +1. `CP13-8` accepts one bounded real-workload validation package for the accepted `RF=2 sync_all` path +2. it does not accept broad rollout claims, broad benchmark positioning, or mode normalization by implication +3. it does not accept vague “worked in a manual run” reasoning without named workloads, bounded envelope, and replayable evidence + +Execution steps: + +1. Step 1: workload envelope freeze + - name one bounded validation matrix: + - workload(s) + - topology + - transport/frontend + - filesystem/application surface + - disturbance shapes included and excluded + - recommended first-cut surfaces: + - real filesystem behavior such as `ext4` + - one database/application surface such as `PostgreSQL` +2. Step 2: harness and evidence hardening + - wire the workload run through real block-device consumers on the accepted path + - keep the environment reproducible and bounded enough that failures are attributable + - collect evidence at the same semantic layer as accepted prior checkpoints +3. Step 3: proof package + - prove the named real workloads complete correctly on the accepted path + - prove disturbance/failover behavior is bounded inside the named envelope if included + - prove no-overclaim around `CP13-9+` + +Required scope: + +1. one bounded workload matrix on the accepted `RF=2 sync_all` path +2. real block-device consumer validation (not only protocol/unit tests) +3. bounded disturbance cases only if explicitly named in the envelope +4. explicit separation between real-workload proof and later mode normalization / rollout claims + +Must prove: + +1. the accepted replication contract survives contact with named real workloads +2. evidence is tied to a bounded environment and workload envelope, not generic “production ready” rhetoric +3. failures, if any, are attributable to explicit workload-envelope gaps rather than ambiguous harness drift +4. acceptance wording stays bounded to real-workload validation rather than `CP13-9` policy/mode closure + +Reuse discipline: + +1. prefer existing `testrunner`, bounded component scenarios, and real-device harnesses where possible +2. update `weed/storage/blockvol/*` only when the real workload exposes a concrete bug in accepted semantics +3. `weed/server/*` should remain reference only unless workload validation exposes a surfaced control/runtime issue +4. no checkpoint work may silently broaden into generic benchmark marketing, launch approval, or mode policy redesign + +Verification mechanism: + +1. one named workload matrix with explicit environment description +2. replayable runs or artifacts for the chosen workload package +3. explicit pass/fail conditions tied back to accepted `CP13-1..7` semantics +4. no-overclaim review so `CP13-8` does not absorb `CP13-9+` + +Hard indicators: + +1. one accepted filesystem proof: + - a named real filesystem workload completes correctly on the accepted path +2. one accepted application proof: + - a named real application/database workload completes correctly on the accepted path +3. one accepted envelope proof: + - the validation matrix is explicit about topology, frontend, workload, and exclusions +4. one accepted boundedness proof: + - `CP13-8` claims real-workload validation only + +Reject if: + +1. the checkpoint relies on ad hoc manual runs with no bounded envelope +2. a claimed real-workload proof is actually only a synthetic benchmark or unit test +3. delivery wording quietly broadens into mode normalization, launch approval, or general production-readiness claims + +Status: + +- active + +### `CP13-8A`: Assignment-to-Publication Closure + +Goal: + +- close the control/runtime/publication contradiction exposed by `CP13-8` so the system no longer treats allocation or assignment presence as equivalent to replica publication readiness + +Acceptance object: + +1. `CP13-8A` accepts one bounded closure slice for assignment-to-publication truth on the accepted `RF=2 sync_all` path +2. it does not accept broad mode normalization, launch approval, or backend replacement by implication +3. it does not accept sleep-based or timing-based fixes that leave readiness semantics implicit + +Execution steps: + +1. Step 1: unify assignment lifecycle + - ensure assignment delivery flows through one authoritative path from role apply to receiver/shipper wiring to readiness bookkeeping + - remove semantic split between store-only role application and service-level replication/publication setup +2. Step 2: name readiness and publication truth + - define explicit readiness states for the chosen path + - ensure heartbeat / lookup / tester surfaces distinguish: + - allocated + - role applied + - receiver ready + - publish healthy +3. Step 3: bounded rerun + - rerun the bounded `CP13-8` workload package after closure lands + - determine whether the remaining contradiction is backend data visibility, adapter timing/publication, or a true core-rule gap + +Required scope: + +1. assignment-to-publication closure only +2. chosen path only: `RF=2 sync_all` +3. existing master / volume-server heartbeat path only +4. `blockvol` remains the execution backend + +Must prove: + +1. assignment delivered does not by itself imply receiver ready or publish healthy +2. replica publication requires explicit readiness closure rather than allocation completion or precomputed port presence +3. master lookup / REST / tester health checks consume the same bounded readiness truth +4. `CP13-8A` remains about closure, not mode normalization or backend redesign + +Reuse discipline: + +1. prefer `weed/server/*` and bridge-layer updates first because this is a surfaced control/runtime issue +2. update `weed/storage/blockvol/*` only if closure work exposes a concrete backend bug rather than a publication-path contradiction +3. keep `CP13-1..7` semantics fixed unless the closure work exposes a live contradiction +4. no checkpoint work may silently broaden into `CP13-9` mode policy or broad rollout claims + +Verification mechanism: + +1. one focused proof set around assignment lifecycle closure and readiness/publication gating +2. explicit tests that heartbeat / lookup / tester surfaces do not publish a replica before readiness closes +3. bounded `CP13-8` rerun or equivalent evidence showing the contradiction moves from mixed-state ambiguity to an attributable remaining cause +4. no-overclaim review so `CP13-8A` does not absorb `CP13-9` + +Hard indicators: + +1. one accepted lifecycle proof: + - assignment processing uses one authoritative path from role apply through runtime wiring +2. one accepted readiness proof: + - replica-ready is explicit and not inferred from mere existence/allocation +3. one accepted publication proof: + - lookup / heartbeat / tester gates do not publish a replica before readiness closure +4. one accepted boundedness proof: + - `CP13-8A` claims closure only and leaves broader mode policy untouched + +Reject if: + +1. assignment still reaches different semantic outcomes depending on whether it flows through heartbeat/store-only or service-level processing +2. a replica can still be surfaced as healthy/ready before receiver/session readiness closes +3. the slice relies on delays or ad hoc retries rather than explicit readiness semantics +4. delivery wording broadens into `CP13-9` mode normalization, launch approval, or generic backend replacement + +Status: + - active ### Later checkpoints inside `Phase 13` -1. `CP13-8`: real-workload validation -2. `CP13-9`: mode normalization +1. `CP13-9`: mode normalization (only after `CP13-8A` closes the assignment/publication contradiction) ## Reuse Discipline diff --git a/sw-block/design/README.md b/sw-block/design/README.md index 1b893cd1b..52cb9609e 100644 --- a/sw-block/design/README.md +++ b/sw-block/design/README.md @@ -7,6 +7,7 @@ Historical planning/review documents were moved to `../docs/archive/design/` to ## Read First - `v2-protocol-truths.md` +- `v2-protocol-claim-and-evidence.md` - `v2-product-completion-overview.md` - `v2-phase-development-plan.md` - `v2-semantic-methodology.zh.md` @@ -27,6 +28,7 @@ Historical planning/review documents were moved to `../docs/archive/design/` to - `v2_scenarios.md` - `v2-scenario-sources-from-v1.md` - `v1-v15-v2-comparison.md` +- `v2-reuse-replacement-boundary.md` - `wal-replication-v2.md` - `wal-replication-v2-state-machine.md` - `wal-replication-v2-orchestrator.md` diff --git a/sw-block/design/v2-protocol-claim-and-evidence.md b/sw-block/design/v2-protocol-claim-and-evidence.md new file mode 100644 index 000000000..4364c59f7 --- /dev/null +++ b/sw-block/design/v2-protocol-claim-and-evidence.md @@ -0,0 +1,145 @@ +# V2 Protocol Claim And Evidence + +Date: 2026-04-03 +Status: active +Purpose: keep one centralized ledger for the current chosen envelope, accepted claims, supporting evidence, invalidated evidence, and rerun obligations + +## Why This Document Exists + +`v2-protocol-truths.md` records stable protocol truths. +`v2-protocol-closure-map.zh.md` records the structural closure model. + +What they do not track in one place is the current operational contract: + +1. which claims are allowed right now +2. which baselines are accepted right now +3. which evidence supports each claim +4. which evidence has been narrowed or invalidated +5. which reruns are required before a claim can be restored + +This document is that ledger. + +## How To Use It + +When reviewing any new slice, bug fix, workload run, or delivery note, ask: + +1. which current claim does this change strengthen, narrow, or invalidate? +2. which evidence row should be updated? +3. does the change alter the current chosen envelope? +4. does any old claim now require rerun or reclassification? + +If the answer changes the current state of the product, update this ledger in the same change. + +## Current Chosen Envelope + +This is the bounded envelope currently allowed for active V2 claims: + +| Item | Current value | Source | +|------|---------------|--------| +| Replication factor | `RF=2` | `v2-protocol-closure-map.zh.md` | +| Durability mode | `sync_all` | `v2-protocol-closure-map.zh.md`, `Phase 13` | +| Control path | current master / volume-server heartbeat path | `v2-protocol-closure-map.zh.md` | +| Execution backend | `blockvol` | `v2-protocol-closure-map.zh.md`, `v2-reuse-replacement-boundary.md` | +| Frontend in active validation | iSCSI | `Phase 11`, `CP13-8` | +| Real-workload checkpoint | `CP13-8` | `Phase 13` | + +Current explicit exclusions: + +1. `RF>2` as a general accepted product claim +2. broad mode normalization before `CP13-9` +3. broad rollout / launch approval +4. broad transport matrix claims outside explicitly named evidence +5. treating synthetic benchmarks as substitutes for real workload validation + +## Active Protocol Constraints + +These are the currently binding constraints that later work must preserve. + +| ID | Constraint | Source | Current status | +|----|------------|--------|----------------| +| `T1` | `CommittedLSN` is the external truth boundary | `v2-protocol-truths.md` | active | +| `T9` | truncation is a protocol boundary, not cleanup | `v2-protocol-truths.md` | active | +| `T14` | engine remains recovery authority; storage remains truth source | `v2-protocol-truths.md` | active | +| `T15` | reuse reality, not inherited semantics | `v2-protocol-truths.md` | active | +| `CP13-2` | stable identity must not be inferred from transport address shape | `Phase 13` | active | +| `CP13-3` | durable authority is `replicaFlushedLSN`, not legacy success inference | `Phase 13` | active | +| `CP13-4` | only eligible replica state may satisfy sync durability | `Phase 13` | active | +| `CP13-5` | reconnect must use explicit handshake / catch-up semantics | `Phase 13` | active | +| `CP13-6` | retention must fail closed for lagging replicas | `Phase 13` | active | +| `CP13-7` | unrecoverable gap must escalate to `NeedsRebuild` and block normal paths | `Phase 13` | active | +| `CP13-8A` | assignment delivered != receiver ready != publish healthy | `Phase 13` | active | + +## Accepted Baselines + +| Baseline | What it is allowed to say | Evidence location | Current validity | +|----------|---------------------------|-------------------|------------------| +| `CP13-1` replication baseline inventory | which tests originally passed/failed/`PASS*` before `CP13-2..7` closure | `sw-block/.private/phase/phase-13-cp1-baseline.md` | valid as baseline inventory, not as final product claim | +| `Phase 12 P4` bounded floor | one bounded performance floor and rollout-gate package on the accepted chosen path | `sw-block/.private/phase/phase-12-p4-floor.md`, `phase-12-p4-rollout-gates.md` | valid inside its named envelope | +| real-workload envelope draft | one bounded `ext4 + pgbench` package for `CP13-8` | `sw-block/.private/phase/phase-13-cp8-workload-validation.md` | active draft; full claim pending rerun after blockers close | + +## Allowed Claims + +These are the claims that may currently be made without overreach. + +| Claim ID | Allowed claim | Scope boundary | Evidence anchor | Status | +|----------|---------------|----------------|-----------------|--------| +| `C-RF2-SYNCALL-CONTRACT` | the accepted `RF=2 sync_all` replication contract is closed at protocol/unit/adversarial level through `CP13-1..7` | protocol/unit/adversarial evidence only | `Phase 13` docs and tests | allowed | +| `C-WORKLOAD-DRAFT` | one bounded real-workload validation package is defined for `CP13-8` | package definition only, not final pass claim | `phase-13-cp8-workload-validation.md`, YAML scenario | allowed | +| `C-WORKLOAD-PASS` | the bounded real-workload package passes on the chosen path | only after rerun succeeds on corrected path | `CP13-8` rerun artifact | not yet allowed | +| `C-ADAPTER-CLOSURE` | assignment / readiness / publication closure is explicit on the chosen path | only after `CP13-8A` acceptance | `CP13-8A` proof package | in progress | +| `C-MODE-NORMALIZATION` | mode policy / normalization is closed | only in `CP13-9` or later | future | not allowed | +| `C-LAUNCH-APPROVAL` | broad product launch readiness | outside current phase | future | not allowed | + +## Evidence Map + +| Evidence area | What it proves | Primary evidence | Support evidence | +|---------------|----------------|------------------|------------------| +| Identity / addressing | stable identity and routable publication | `CP13-2` tests and docs | `qa_block_soak_test.go`, `sync_all_bug_test.go` | +| Durable progress | barrier durability truth and non-legacy authority | `CP13-3` tests and docs | protocol tests around barrier handling | +| State eligibility | only eligible replica state may satisfy sync durability | `CP13-4` tests and docs | adversarial state tests | +| Reconnect / catch-up | reconnect uses handshake/catch-up rather than bootstrap | `CP13-5` tests and docs | adversarial reconnect tests | +| Retention | lagging replica retains WAL or escalates fail closed | `CP13-6` tests and docs | retention protocol tests | +| Rebuild fallback | unrecoverable gap escalates to `NeedsRebuild` and blocks normal paths | `CP13-7` tests and docs | rebuild tests | +| Performance floor | one bounded measured floor and rollout-gate package | `Phase 12 P4` docs/tests | cited baseline artifact | +| Real-workload package | one bounded workload matrix exists | `CP13-8` scenario/doc | tester validation reports | +| Assignment/publication closure | assignment does not imply readiness/publication | `CP13-8A` code/tests/debug evidence | tester investigation, bug docs | + +## Invalidated Or Narrowed Evidence + +This section records evidence that cannot currently be used at full strength. + +| ID | Affected claim/evidence | Narrowing reason | Scope | Action required | +|----|-------------------------|------------------|-------|-----------------| +| `INV-CP13-8A-01` | any weed-VS scenario claim that `block_promote` preserved replication automatically | promote path could leave new primary without replica shipper wiring; barrier then became vacuous with `0` shippers | recent weed-VS testrunner scenarios using `block_promote` | rerun after fix | +| `INV-CP13-8A-02` | bounded real-workload `CP13-8` pass claim | blocked by assignment/publication contradiction and then by promote/shipper closure issue | `CP13-8` only | rerun after `CP13-8A` blocker fixes | +| `INV-CLAIM-SPREAD-01` | claims embedded only in phase delivery notes | phase docs are not a reliable centralized current-state ledger | all scattered phase notes | migrate ongoing claim state here | + +Unaffected evidence currently believed to remain valid: + +1. standalone `iscsi-target` scenarios that used direct `assign + set_replica` wiring rather than weed-VS `block_promote` +2. protocol/unit/adversarial evidence from accepted `CP13-1..7` +3. performance-only scenarios that did not claim active cross-node replication through the broken promote path + +## Open Contradictions And Blockers + +| ID | Blocker | Current classification | Impact | +|----|---------|------------------------|--------| +| `BUG-CP13-8A-ADDR` | malformed/mock replica addresses in some QA allocators | test/adapter bug | narrows affected QA evidence; does not by itself close real workload | +| `BUG-CP13-8A-RECV-IDEMP` | repeated assignment delivery restarted replica receiver and hit bind conflict | adapter/runtime bug | blocks weed-VS replica from leaving degraded state until fixed | +| `BUG-CP13-8A-PROMOTE-SHIPPER` | post-promote assignment could leave new primary with no replica shipper configured | master/adapter bug | invalidates weed-VS `block_promote` replication claims until rerun | +| `CP13-8` | real bounded workload package still needs corrected rerun | blocked by `CP13-8A` issues | blocks real-workload pass claim | + +## Rerun Queue + +| Priority | Item | Why rerun is needed | Exit condition | +|----------|------|---------------------|----------------| +| `P0` | `CP13-8` bounded real-workload scenario | current pass claim is not yet allowed after `CP13-8A` blockers | bounded rerun passes or fails with attributable remaining cause | +| `P0` | weed-VS scenarios using `block_promote` from the recent testrunner enhancement work | prior replication interpretation may have been vacuous (`0` shippers) | affected scenarios are reclassified or rerun | +| `P1` | any recent degraded/perf interpretation derived from broken weed-VS promote path | performance interpretation may be based on RF=1 semantics | audit updated and affected numbers rerun or narrowed | + +## Maintenance Rules + +1. do not add a new claim anywhere else without adding or updating the corresponding row here +2. when a bug narrows evidence, record the invalidation here in the same change +3. when a rerun restores a claim, move the row from `Invalidated Or Narrowed Evidence` to `Allowed Claims` or update its status +4. keep this document bounded to the active chosen path; do not turn it into a future roadmap diff --git a/sw-block/design/v2-protocol-closure-map.zh.md b/sw-block/design/v2-protocol-closure-map.zh.md index 5741ece45..a209e6624 100644 --- a/sw-block/design/v2-protocol-closure-map.zh.md +++ b/sw-block/design/v2-protocol-closure-map.zh.md @@ -21,6 +21,10 @@ - `V2` 不是一组散乱 patch - 而是在明确边界内逐步建立的协议闭环 +当前 chosen path 下哪些 claim 可以成立、对应 evidence 在哪里、哪些 evidence +被收紧或失效,不在本文维护;统一见 +`v2-protocol-claim-and-evidence.md`。 + 这里的“闭环”是有范围的。 当前默认边界仍然是: diff --git a/sw-block/design/v2-protocol-truths.md b/sw-block/design/v2-protocol-truths.md index ea5f30f3f..5cdf29e39 100644 --- a/sw-block/design/v2-protocol-truths.md +++ b/sw-block/design/v2-protocol-truths.md @@ -18,6 +18,10 @@ So the most important output to carry forward is not only code, but: This document is the compact truth table for the V2 line. +Current chosen-envelope claims, accepted baselines, evidence mappings, and +evidence invalidations are tracked separately in +`v2-protocol-claim-and-evidence.md`. + ## How To Use It For each later phase or slice, ask: diff --git a/sw-block/design/v2-reuse-replacement-boundary.md b/sw-block/design/v2-reuse-replacement-boundary.md new file mode 100644 index 000000000..925d6d1ef --- /dev/null +++ b/sw-block/design/v2-reuse-replacement-boundary.md @@ -0,0 +1,178 @@ +# V2 Reuse vs Replacement Boundary + +Date: 2026-04-03 +Status: active + +## Purpose + +This note makes one architectural split explicit for the current chosen path: + +1. what we reuse from the existing `blockvol`/`weed` stack as mechanics +2. what must be owned by `V2` as semantic authority +3. what sits in the adapter boundary between them + +The goal is to stop `V1` mixed control/data state from silently redefining `V2` +behavior through convenience wiring. + +Scope is still bounded to: + +1. `RF=2` +2. `sync_all` +3. current master / volume-server heartbeat path +4. `blockvol` as the execution backend + +## Boundary Rule + +`V1` reuse is allowed for execution mechanics. + +`V2` replacement is required for semantic authority. + +If a change decides protocol meaning, failover meaning, durability meaning, or +external publication meaning, it belongs to a `V2`-owned layer even if the +underlying I/O still runs through reused `blockvol` code. + +This is the practical interpretation of: + +- `v2-protocol-truths.md` `T14`: engine remains recovery authority +- `v2-protocol-truths.md` `T15`: reuse reality, not inherited semantics + +## Three Buckets + +### 1. Reusable V1 Core + +These components remain useful as mechanics: + +| Area | Files | What stays reusable | +|------|-------|---------------------| +| Local storage truth | `weed/storage/blockvol/blockvol.go`, `flusher.go`, `rebuild.go`, WAL/extent helpers | WAL append, flush, checkpoint, dirty-map, extent install | +| Replica transport | `weed/storage/blockvol/replica_apply.go`, `wal_shipper.go`, `shipper_group.go`, `dist_group_commit.go`, `repl_proto.go` | TCP receiver/shipper mechanics, barrier transport, replay/apply | +| Frontend serving | `weed/storage/blockvol/iscsi/`, `weed/storage/blockvol/nvme/` | block-device serving once a local volume is authoritative | +| Local role guardrails | `weed/storage/blockvol/promotion.go`, `role.go` | drain, lease revoke, local role gate enforcement | + +Rule: + +- these layers execute I/O and transport +- they do not decide whether a replica is eligible, authoritative, published, or healthy in the `V2` sense + +### 2. Adapter Boundary + +These components translate `V2` truth into concrete runtime wiring: + +| Area | Files | Responsibility | +|------|-------|----------------| +| Assignment ingest | `weed/server/volume_server_block.go` | authoritative assignment lifecycle for role apply, receiver/shipper wiring, readiness closure | +| Heartbeat/runtime loop | `weed/server/block_heartbeat_loop.go` | collect/report status and process assignments through the same lifecycle | +| Local store helper | `weed/storage/store_blockvol.go` | local volume open/close/iteration; no longer the authoritative assignment lifecycle | +| Bridge | `weed/storage/blockvol/v2bridge/control.go` | convert service/control truth into engine intents | + +Rule: + +- the adapter boundary may reuse `blockvol` primitives +- it must name and own lifecycle closure states explicitly +- it must not let store-only role application masquerade as ready publication + +### 3. V2-Owned Replacement + +These areas define truth and therefore must remain `V2`-owned: + +| Area | Files | Responsibility | +|------|-------|----------------| +| Control and identity truth | `sw-block/engine/replication/`, `weed/storage/blockvol/v2bridge/control.go` | assignment truth, stable identity, session truth | +| Recovery ownership | `weed/server/block_recovery.go` | live runtime owner for catch-up/rebuild tasks | +| Publication and health closure | `weed/server/master_block_registry.go`, `weed/server/master_block_failover.go` | what the system reports as ready, degraded, publishable | +| External product surfaces | `weed/server/master_grpc_server_block.go`, `weed/server/master_server_handlers_block.go`, debug/diagnostic surfaces | operator-visible truth, not convenience guesses | + +Rule: + +- if the system exposes a condition to master, tester, CSI, or operator tooling, that condition must come from `V2`-named state + +## Assignment-To-Readiness Lifecycle + +The authoritative lifecycle for the current chosen path is: + +```text +assignment delivered +-> local role applied +-> replica receiver or primary shipper configured +-> readiness closed +-> heartbeat publication +-> master registry health/publication +``` + +More concretely: + +1. master intent is delivered +2. `BlockService.ApplyAssignments()` applies local role truth +3. the same path wires receiver/shipper runtime +4. the same path records named readiness state +5. heartbeat publishes only what is actually publish-healthy +6. master registry derives lookup/health from explicit readiness, not from allocation alone + +## Named Readiness States + +For the current implementation slice, the service boundary now names: + +1. `roleApplied` +2. `receiverReady` +3. `shipperConfigured` +4. `shipperConnected` +5. `replicaEligible` +6. `publishHealthy` + +Ownership: + +- owned by `BlockService` / adapter layer +- observed by debug surfaces and heartbeat/publication logic +- not delegated to `blockvol` as implicit mixed state + +## Current File Map + +### Reuse + +- `weed/storage/blockvol/blockvol.go` +- `weed/storage/blockvol/flusher.go` +- `weed/storage/blockvol/replica_apply.go` +- `weed/storage/blockvol/wal_shipper.go` +- `weed/storage/blockvol/shipper_group.go` +- `weed/storage/blockvol/dist_group_commit.go` +- `weed/storage/blockvol/iscsi/` +- `weed/storage/blockvol/nvme/` + +### Adapter boundary + +- `weed/server/volume_server_block.go` +- `weed/server/block_heartbeat_loop.go` +- `weed/storage/store_blockvol.go` +- `weed/server/volume_server_block_debug.go` + +### V2-owned replacement / truth + +- `weed/storage/blockvol/v2bridge/control.go` +- `sw-block/engine/replication/` +- `weed/server/block_recovery.go` +- `weed/server/master_block_registry.go` +- `weed/server/master_block_failover.go` +- `weed/server/master_grpc_server_block.go` +- `weed/server/master_server_handlers_block.go` + +## Immediate Engineering Rule + +When a new bug appears, classify it first: + +1. `v1 reusable core`: local storage or transport mechanics +2. `adapter boundary`: assignment/readiness/publication closure bug +3. `v2 replacement`: semantic authority, identity, ownership, eligibility, rebuild, or operator-visible truth + +Do not patch semantic authority directly into `blockvol` unless the same change is +also reflected as an explicit `V2` state/rule at the service or registry layer. + +## Why This Matters For CP13-8 + +`CP13-8` found the exact class of bug this split is meant to expose: + +- allocation/control truth said the replica existed +- but runtime publication/read visibility was not yet closed + +That is not a reason to throw away `blockvol`. +It is a reason to stop treating mixed `V1` runtime state as if it were already +closed `V2` publication truth. diff --git a/sw-block/design/v2_mini_core_design.md b/sw-block/design/v2_mini_core_design.md new file mode 100644 index 000000000..10483b47c --- /dev/null +++ b/sw-block/design/v2_mini_core_design.md @@ -0,0 +1,767 @@ + +## 1. 设计目标 + +目标不是立即重写 `blockvol`,而是建立一个**纯 `V2` 语义核心**,使它成为系统唯一的语义 authority: + +- `V2 core` 负责定义 truth、state、event、decision +- `adapter` 负责把外部输入翻译成 `V2` 事件,并把 `V2` 决策翻译成 runtime/backend 调用 +- `V1 backend` 只保留执行能力,不再解释协议语义 + +当前 chosen path 不变: + +- `RF=2` +- `sync_all` +- 当前 master / volume-server heartbeat path +- `blockvol` 作为执行 backend + +这和已有设计是一致的,不是新改向。它只是把 `Phase 1-13` 一直在做的事显式化。 + +--- + +## 2. 核心分层 + +### 2.1 `V2 Core` +职责: + +- 持有控制真相 +- 持有恢复真相 +- 持有数据边界真相 +- 持有对外发布真相 +- 根据事件做决策 +- 输出 commands / projections + +### 2.2 Adapter Boundary +职责: + +- 把 master / heartbeat / runtime observation 翻译成 `V2 event` +- 把 `V2 command` 翻译成对 `blockvol` / transport / frontend 的调用 +- 把 backend/runtime 事实翻译成 `projection` + +### 2.3 `V1 Backend` +职责: + +- WAL / extent / dirty map / flusher +- receiver / shipper transport +- iSCSI / NVMe frontend +- rebuild install primitive + +规则: + +- backend 报告事实 +- core 解释语义 +- adapter 做翻译 +- 不允许 `blockvol` 的混合状态直接越权成为系统 truth + +--- + +## 3. `V2 Core` 最小对象清单 + +下面是“最小可行”的 `struct / event / command / projection` 集合。 + +### 3.1 Struct 清单 + +#### A. Control structs +```go +type VolumeIntent struct { + VolumeID string + Epoch uint64 + PrimaryID string + ReplicaIDs []string + DurabilityMode string +} + +type ReplicaIdentity struct { + ReplicaID string + ServerID string +} + +type AssignmentView struct { + VolumeID string + Epoch uint64 + Role RoleIntent + ReplicaEndpoints map[string]Endpoint +} +``` + +职责: + +- 谁是 primary +- 谁是 replica +- 当前 epoch +- stable identity 是谁 +- assignment intent 到底是什么 + +#### B. Recovery structs +```go +type ReplicaState string +const ( + StateDisconnected ReplicaState = "disconnected" + StateConnecting ReplicaState = "connecting" + StateCatchingUp ReplicaState = "catching_up" + StateInSync ReplicaState = "in_sync" + StateDegraded ReplicaState = "degraded" + StateNeedsRebuild ReplicaState = "needs_rebuild" +) + +type ReplicaSession struct { + SessionID string + ReplicaID string + Epoch uint64 + Kind SessionKind + Active bool + Superseded bool +} + +type RecoveryOwner struct { + ReplicaID string + SessionID string + Running bool +} +``` + +职责: + +- 当前 replica 的恢复状态是什么 +- 当前 session 是谁 +- 谁拥有 recovery authority +- 旧 session 是否已失效 + +#### C. Data-boundary structs +```go +type BoundaryView struct { + CommittedLSN uint64 + CheckpointLSN uint64 + WALHeadLSN uint64 + ReceivedLSN uint64 + TargetLSN uint64 + AchievedLSN uint64 + SnapshotBaseLSN uint64 +} +``` + +职责: + +- durability boundary +- catch-up target +- rebuild target +- stable base image +- 实际已达到边界 + +#### D. Publication structs +```go +type ReadinessView struct { + RoleApplied bool + ReceiverReady bool + ShipperConfigured bool + ShipperConnected bool + ReplicaEligible bool + PublishHealthy bool +} + +type LookupProjection struct { + VolumeID string + PrimaryServer string + ISCSIAddr string + ReplicaReady bool + ReplicaDegraded bool +} +``` + +职责: + +- 什么时候可以 publish +- lookup/heartbeat 应该看到什么 +- “存在”与“ready”分离 + +--- + +## 4. `V2 Core` 最小事件清单 + +### 4.1 Control events +```go +AssignmentDelivered +EpochBumped +RepeatedAssignmentDelivered +IdentityResolved +``` + +### 4.2 Recovery events +```go +SessionCreated +SessionSuperseded +SessionRemoved +CatchUpPlanned +CatchUpCompleted +RebuildStarted +RebuildCommitted +``` + +### 4.3 Runtime observation events +```go +RoleApplied +ReceiverStarted +ReceiverReadyObserved +ShipperConfiguredObserved +ShipperConnectedObserved +BarrierAccepted +BarrierRejected +``` + +### 4.4 Data-boundary events +```go +CommittedLSNAdvanced +CheckpointLSNAdvanced +ReceivedLSNAdvanced +AchievedLSNAdvanced +RetentionEscalated +``` + +### 4.5 Publication events +```go +HeartbeatCollected +PublicationProjected +ReplicaMarkedReady +ReplicaMarkedDegraded +``` + +原则: + +- event 是 observation 或 intent +- event 不是直接结果承诺 +- `V2 core` 必须决定 event 的语义意义 + +--- + +## 5. `V2 Core` 最小 command 清单 + +这些 command 是 `V2 core` 输出给 adapter 的,不直接碰 backend。 + +```go +type Command interface{} + +type ApplyRoleCommand struct { + VolumeID string + Epoch uint64 + Role RoleIntent +} + +type StartReceiverCommand struct { + VolumeID string + DataAddr string + CtrlAddr string +} + +type ConfigureShipperCommand struct { + VolumeID string + Replicas []ReplicaEndpoint +} + +type StartCatchUpCommand struct { + ReplicaID string + TargetLSN uint64 +} + +type StartRebuildCommand struct { + ReplicaID string + RebuildAddr string + SnapshotBaseLSN uint64 +} + +type PublishProjectionCommand struct { + VolumeID string + Readiness ReadinessView +} + +type InvalidateSessionCommand struct { + ReplicaID string + Reason string +} +``` + +原则: + +- command 只表达“该做什么” +- backend 如何做,由 adapter 决定 +- command 不依赖 `blockvol` 内部字段 + +--- + +## 6. `V2 Core` 最小 projection 清单 + +projection 是给外部世界看的,不是内部原始状态 dump。 + +### 6.1 Master / Lookup projection +- `PrimaryServer` +- `ISCSIAddr` +- `ReplicaReady` +- `ReplicaDegraded` +- `DurabilityMode` + +### 6.2 Heartbeat projection +- 当前 role / epoch +- boundary fields +- readiness fields +- transport degraded +- receiver published addr + +### 6.3 Diagnostic projection +- active recovery tasks +- session ownership +- publish gating reason +- pending rebuild / deferred promotion reason + +### 6.4 Tester projection +- `wait_volume_healthy` 不再只看 “replica exists” +- 必须看 `replica_ready` +- 必须区分: + - allocated + - role applied + - receiver ready + - publish healthy + +--- + +## 7. 直接映射到仓库路径 + +下面按“核心 / adapter / backend”映射。 + +### 7.1 `V2 Core` 现有基础 +这些文件已经在扮演 core 的雏形。 + +- `sw-block/engine/replication/registry.go` + - `AssignmentIntent` + - `AssignmentResult` + - `ReplicaAssignment` +- `sw-block/engine/replication/session.go` + - session ownership / lifecycle +- `sw-block/engine/replication/sender.go` + - sender state abstraction +- `sw-block/engine/replication/orchestrator.go` + - assignment -> session/recovery orchestration +- `sw-block/engine/replication/types.go` +- `sw-block/engine/replication/outcome.go` +- `sw-block/engine/replication/observe.go` +- `weed/server/block_recovery.go` + - runtime owner +- `weed/server/master_block_registry.go` + - publication truth / cluster registry truth +- `weed/server/master_block_failover.go` + - failover/rebuild truth + +### 7.2 Adapter Boundary 现有基础 +- `weed/storage/blockvol/v2bridge/control.go` + - assignment -> engine intent +- `weed/server/volume_server_block.go` + - assignment lifecycle / readiness closure / wiring +- `weed/server/block_heartbeat_loop.go` + - heartbeat + assignment loop +- `weed/server/volume_server_block_debug.go` + - readiness/debug projection +- `weed/server/master_grpc_server_block.go` + - lookup projection +- `weed/server/master_server_handlers_block.go` + - REST projection + +### 7.3 `V1 Backend` 现有基础 +- `weed/storage/blockvol/blockvol.go` +- `weed/storage/blockvol/flusher.go` +- `weed/storage/blockvol/replica_apply.go` +- `weed/storage/blockvol/wal_shipper.go` +- `weed/storage/blockvol/shipper_group.go` +- `weed/storage/blockvol/dist_group_commit.go` +- `weed/storage/blockvol/rebuild.go` +- `weed/storage/blockvol/iscsi/` +- `weed/storage/blockvol/nvme/` + +--- + +## 8. 近期可做 + +这里说的是“现在应该做”的,不是中长期重构幻想。 + +### 8.1 让 `V2 core` 成为唯一 assignment 语义入口 +目标: + +- 所有 assignment 先进入 `V2 core` +- `BlockService` 不再自己解释太多语义 +- `BlockService` 更像 command executor + +直接涉及文件: + +- `sw-block/engine/replication/orchestrator.go` +- `weed/storage/blockvol/v2bridge/control.go` +- `weed/server/volume_server_block.go` + +### 8.2 把 readiness 做成正式 projection,不只是 service 内部状态 +目标: + +- `roleApplied` +- `receiverReady` +- `shipperConfigured` +- `shipperConnected` +- `replicaEligible` +- `publishHealthy` + +这些状态进入稳定 projection,而不是只在 debug 内可见。 + +直接涉及文件: + +- `weed/server/volume_server_block.go` +- `weed/server/master_block_registry.go` +- `weed/server/master_grpc_server_block.go` +- `weed/server/master_server_handlers_block.go` + +### 8.3 把 `wait_volume_healthy`、lookup、heartbeat 都统一到同一个 readiness 定义 +目标: + +- 不再出现: + - registry says healthy + - VS side not ready + - tester 误判 ready + +直接涉及文件: + +- `weed/storage/blockvol/testrunner/actions/devops.go` +- `weed/server/master_block_registry.go` +- `weed/server/volume_grpc_client_to_master.go` + +### 8.4 把 `CP13-8` 用作 adapter/publication closure 的真实验证 +目标: + +- 确认当前 failure 到底是: + - backend data bug + - adapter timing/publication bug + - core rule gap + +这一步是近期必须做的,因为它是 live contradiction。 + +--- + +## 9. 中期演进 + +### 9.1 把 `V2 core` 真正做成 command/event 模式 +目标: + +- engine 输出 command +- adapter 执行 command +- runtime 返回 event +- core 更新 state + +这会比现在 “ProcessAssignments 里又做判断又做执行” 更干净。 + +### 9.2 把 `master_block_registry` 从“半业务逻辑半存储”收敛成 projection store +目标: + +- registry 不负责猜测 semantics +- registry 存放 `V2 projection` +- 真正的语义判断放在 core + +### 9.3 把 backend 接口化 +候选接口: + +```go +type StorageBackend interface { + StatusSnapshot() BoundaryView + SetRetentionFloor(...) +} + +type TransportBackend interface { + StartReceiver(...) + ConfigureReplicas(...) + ShipperStates() ... +} + +type RebuildBackend interface { + StartRebuild(...) + InstallSnapshot(...) +} + +type FrontendBackend interface { + PublishISCSI(...) + PublishNVMe(...) +} +``` + +### 9.4 让 `blockvol` 逐渐退化成 pure backend +目标: + +- `blockvol` 不再持有系统级语义 authority +- 它保留 local storage truth 和执行能力 +- 语义解释都上提到 `V2 core` + +--- + +## 10. 如何保持前面建立的约束和 envelope + +这部分最重要。 + +### 10.1 已接受约束必须升格为 core invariants +前面 `CP13-1..7` 不能只留在测试里,必须进入 core 规则: + +- canonical identity +- durable progress truth +- only eligible replicas count +- reconnect handshake +- retention fail-closed +- rebuild fallback + +### 10.2 envelope 不允许被 core 重构顺手扩大 +继续固定: + +- `RF=2` +- `sync_all` +- 当前 heartbeat/gRPC path +- `blockvol` backend + +不要因为做架构分层,就顺手放大: +- `RF>2` +- broader transport matrix +- broader rollout claim + +### 10.3 每一步都做双重验收 +每个演进 slice 必须同时证明: + +- 旧约束仍成立 +- 新边界更显式、更少混态 + +### 10.4 禁止“抽象重构先降低 bar” +不能接受: +- 为了重构,暂时弱化 fail-closed +- 为了抽接口,暂时模糊 readiness +- 为了分层,暂时把 publish truth 放宽 + +--- + +## 11. V2 如何从已接受 claim 变得可靠 + +`V2 core` 的可靠性不是来自“状态更多”或“架构更复杂”。 +它的可靠性来自两个更严格的来源: + +1. 只消费已经进入 claim/evidence ledger 的 accepted constraints +2. 只复用 `V1` 中已知可靠的实现行为,而不继承其隐含语义 + +### 11.1 `V2` 不是重新发明 truth,而是消费已接受 truth + +`V2 core` 不应该把当前 runtime 中“看起来通常有效”的行为直接提升为语义 truth。 + +它只能建立在已经被 claim / evidence 支撑的约束上,例如: + +- canonical identity +- durable progress authority = `replicaFlushedLSN` +- only eligible replica may satisfy sync durability +- reconnect must use explicit handshake / catch-up +- retention must fail closed +- unrecoverable gap must escalate to `NeedsRebuild` + +这些约束不是设计说明的附属物,而应该是 `V2 core` 的输入边界。 + +换句话说: + +- 能进入 core 的,只能是已经被 ledger 接受的 truth +- 没有进入 ledger 的运行假设,不能直接成为 core 依赖 + +### 11.2 claim 是 core 的输入约束,不只是 review 文档 + +`V2 core` 的一个基本规则是: + +> 任何没有被 claim/evidence 接受的行为,只能作为 observation,不能作为 authority。 + +例如,下面这些不能直接进入 `V2` truth: + +- “第一次写通常会触发 shipper 连接” +- “promote 之后 replication 大概会自己恢复” +- “`degraded=false` 大概就表示 ready” +- “有 published addr 就说明 replica 可用” + +这些都只能当作 runtime observation,必须再经过 `V2` 的 readiness / eligibility / publication 规则过滤。 + +### 11.3 `V2` 的可靠性来自 fail-closed,而不是隐式收敛 + +`V1` 常见的问题不是“完全不能工作”,而是很多语义靠时序和重试隐式收敛: + +- assignment delivered 之后,何时真正 ready +- promote 之后,何时真正恢复 replication +- `sync_all` 何时真的表示 cross-node durability + +`V2 core` 必须拒绝从这些模糊状态直接宣布 success。 + +它应该采用更硬的规则: + +- 不满足 accepted claim 的条件时,保持 not-ready / degraded / blocked +- 不允许从 convenience state 猜测 publish healthy +- 不允许用工作负载本身去“顺便推动系统进入正确状态” + +一句话: + +- `V2` 的可靠性来自明确边界 + fail-closed +- 不来自“系统大概率最终会自己好起来” + +### 11.4 `V2 core` 使用哪些 accepted claim + +| Claim / Constraint | 用于 core 的哪里 | 作用 | +|---|---|---| +| canonical identity | control truth | 不再从地址猜身份 | +| durable progress = `replicaFlushedLSN` | boundary truth | 不再从 success/ack 猜 durability | +| eligible-only barrier | readiness / publication | 不让非闭环 replica 参与 durability | +| reconnect handshake | recovery truth | 不再靠第一次写触发隐式恢复 | +| retention fail-closed | recovery truth | 不让 lagging replica 以模糊状态长期存在 | +| rebuild fallback | fail-closed policy | gap 不再长期悬挂在 degraded | + +这些 claim 越清楚,`V2 core` 就越可靠。 + +--- + +## 12. `V2` 复用哪些 `V1` 可靠行为 + +`V2` 不是完全抛弃 `V1`。 +它复用的是 `V1` 中已经被证明是局部可靠、实现性稳定的行为。 + +但必须明确区分: + +- **可复用的可靠行为** +- **不应继续复用的旧语义** + +### 12.1 可复用的可靠行为 + +这些行为可以继续作为 backend primitive 使用: + +| `V1` behavior | 为什么可复用 | 为什么不构成语义 authority | +|---|---|---| +| WAL append / read | 局部实现、可验证 | 不决定外部 durability meaning | +| flusher / checkpoint | 局部物化机制 | 不决定 cluster-level readiness | +| dirty-map local read/write | 局部一致性行为 | 不决定 publication truth | +| receiver transport | 纯执行路径 | 不决定 session authority | +| shipper transport | 纯传输机制 | 不决定 eligibility / publish truth | +| rebuild installer / extent install | 局部 install primitive | 不决定 rebuild policy | +| iSCSI / NVMe serving | frontend primitive | 不决定 replicated visibility truth | + +这些行为的共同特征是: + +1. 局部 +2. 可测试 +3. 不依赖 cluster-level 推断 +4. 不应该自己解释系统语义 + +### 12.2 不继续复用的 `V1` 语义 + +下面这些即使在 `V1` 中曾经“工作过”,也不应该进入 `V2` truth: + +- ready from existence +- healthy from non-empty publication +- `sync_all` from vacuous barrier success +- promote implies replication closure +- assignment arrival implies runtime closure +- first write implicitly fixes transport state + +`V2` 可以复用 `V1` 的动作,但不能继承这些旧语义。 + +### 12.3 `weed/` 中当前改动的地位 + +当前 branch 中 `weed/` 的很多改动更接近: + +- 现象验证 +- integration closure +- debug/diagnostic surfaces +- 暂时性 runtime fix + +它们的价值在于暴露现实、定位问题、验证边界。 + +但长期语义 authority 不应该放在这些改动本身上。 + +长期可保留的,应当是那些已经被 `V2` 吸收为: + +- backend primitive +- adapter boundary +- projection surface + +的部分。 + +--- + +## 13. 长期资产 vs 当前实现现实 + +当前项目中最重要的区分不是“哪个文件在跑”,而是“哪个资产值得长期保留”。 + +### 13.1 长期资产 + +长期需要保留和演进的是 `sw-block/` 中的 `V2` 语义资产: + +- `v2-protocol-truths.md` +- `v2-protocol-closure-map.zh.md` +- `v2-protocol-claim-and-evidence.md` +- `v2-reuse-replacement-boundary.md` +- `v2_mini_core_design.md` +- `sw-block/engine/replication/` 中逐步成形的 core semantics + +这些资产定义的是: + +- truth +- claim +- closure +- reliability model +- reusable semantics + +### 13.2 当前实现现实 + +`weed/` 中的当前改动更多代表: + +- 当前 chosen path 的运行现实 +- backend / adapter / publication 的实现尝试 +- 用来暴露矛盾和验证边界的现实载体 + +因此,它们不应该被自动视为长期保留资产。 + +更准确的原则是: + +- `weed/` 中的改动,只有在被 `V2` 语义明确吸收之后,才应该作为长期实现保留 +- 否则,它们可以只是阶段性的验证资产 + +### 13.3 一个总规则 + +> `V2 core` 的可靠性不是来自信任当前 `weed/` 分支实现。 +> 它来自只消费被 claim/evidence ledger 接受的约束,并只复用 `V1` 中已知可靠的实现行为。 +> 因此,`weed/` 中的当前改动可以是临时验证资产,而 `sw-block/` 中的 truth / claim / core design 才是长期保留资产。 + +这个规则的好处是: + +1. 不会因为当前 integration patch 看起来有效,就把它误当成长期语义 +2. 不会因为 `V1` 仍被复用,就把旧混态继续当作 authority +3. 后续收敛分支时,可以明确区分: + - 哪些东西应进入长期 `V2` 资产 + - 哪些东西只是当前实现现实 + +--- + +## 14. 推荐的实施顺序 + +### 近期 +1. 继续收紧 assignment -> readiness -> publication closure +2. 用 `CP13-8` 证明当前 split 能否识别真实 bug 类型 +3. 把 readiness / projection 固化为稳定 surface + +### 中期 +1. 引入 command/event 风格的 `V2 core` +2. 减少 `BlockService` 中的语义判断 +3. 把 registry 收敛为 projection store +4. 抽 backend interface + +### 更后面 +1. 评估是否需要物理独立的 `V2 core process` +2. 如果需要,那是因为逻辑已经独立,不是为了“好看” + +--- + +## 15. 最短结论 + +你要的“更工程化”版本可以归纳为一句话: + +- `V2 core` 负责定义 truth、state、event、command、projection +- `adapter` 负责隔离 `V1` 污染并翻译输入输出 +- `V1 backend` 负责 WAL / transport / frontend / rebuild 执行 +- 后续 phase 的方向不是换路线,而是把这件事一步步显式化并固化成正式结构 + +如果你愿意,下一步我可以继续给你一版更像真正设计文档里的内容: + +- `V2 core` 的 Go package 目录建议 +- 每个 struct/event/command 放在哪个文件 +- 一个最小 `ApplyEvent() -> Decide() -> EmitCommands()` 伪代码骨架 \ No newline at end of file diff --git a/weed/server/block_heartbeat_loop.go b/weed/server/block_heartbeat_loop.go index e2804447b..7d2fcb38e 100644 --- a/weed/server/block_heartbeat_loop.go +++ b/weed/server/block_heartbeat_loop.go @@ -93,7 +93,7 @@ func (c *BlockVolumeHeartbeatCollector) Run() { select { case <-ticker.C: // Outbound: collect and report status. - msgs := c.blockService.Store().CollectBlockVolumeHeartbeat() + msgs := c.blockService.CollectBlockVolumeHeartbeat() c.safeCallback(msgs) // Inbound: process any pending assignments. c.processAssignments() @@ -115,7 +115,7 @@ func (c *BlockVolumeHeartbeatCollector) processAssignments() { if len(assignments) == 0 { return } - errs := c.blockService.Store().ProcessBlockVolumeAssignments(assignments) + errs := c.blockService.ApplyAssignments(assignments) c.cbMu.Lock() cb := c.assignmentCallback c.cbMu.Unlock() diff --git a/weed/server/block_heartbeat_loop_test.go b/weed/server/block_heartbeat_loop_test.go index 7af33c6f6..2caa8c03e 100644 --- a/weed/server/block_heartbeat_loop_test.go +++ b/weed/server/block_heartbeat_loop_test.go @@ -463,6 +463,42 @@ func TestBlockAssign_NilSource(t *testing.T) { } } +// TestBlockAssign_CollectorUsesAuthoritativeLifecycle verifies the heartbeat +// collector now drives the full BlockService assignment path, not the store-only +// role path. A replica assignment must start the receiver and close publish +// readiness. +func TestBlockAssign_CollectorUsesAuthoritativeLifecycle(t *testing.T) { + bs := newTestBlockService(t) + path := testBlockVolPath(t, bs) + + collector := NewBlockVolumeHeartbeatCollector(bs, 5*time.Millisecond) + collector.SetAssignmentSource(func() []blockvol.BlockVolumeAssignment { + return []blockvol.BlockVolumeAssignment{{ + Path: path, + Epoch: 1, + Role: uint32(blockvol.RoleReplica), + ReplicaDataAddr: ":0", + ReplicaCtrlAddr: ":0", + }} + }) + go collector.Run() + defer collector.Stop() + + deadline := time.After(500 * time.Millisecond) + for { + dataAddr, ctrlAddr := bs.GetReplState(path) + readiness := bs.ReadinessSnapshot(path) + if dataAddr != "" && ctrlAddr != "" && readiness.ReceiverReady && readiness.PublishHealthy { + return + } + select { + case <-deadline: + t.Fatalf("collector did not start replica receiver: data=%q ctrl=%q readiness=%+v", dataAddr, ctrlAddr, readiness) + case <-time.After(10 * time.Millisecond): + } + } +} + // TestBlockAssign_MixedBatch verifies a batch with 1 success, 1 unknown volume, // and 1 invalid transition returns parallel errors correctly. func TestBlockAssign_MixedBatch(t *testing.T) { diff --git a/weed/server/master_block_observability_test.go b/weed/server/master_block_observability_test.go index 72ef64e14..a58fe738e 100644 --- a/weed/server/master_block_observability_test.go +++ b/weed/server/master_block_observability_test.go @@ -148,7 +148,8 @@ func TestClusterHealthSummary(t *testing.T) { Path: "/data/healthy.blk", Role: blockvol.RoleToWire(blockvol.RolePrimary), ReplicaFactor: 2, - Replicas: []ReplicaInfo{{Server: "vs2:9333", Role: blockvol.RoleToWire(blockvol.RoleReplica)}}, + ReplicaReady: true, + Replicas: []ReplicaInfo{{Server: "vs2:9333", Role: blockvol.RoleToWire(blockvol.RoleReplica), Ready: true}}, Status: StatusActive, }) @@ -188,7 +189,8 @@ func TestBlockStatusHandler_IncludesHealthCounts(t *testing.T) { Path: "/data/status.blk", Role: blockvol.RoleToWire(blockvol.RolePrimary), ReplicaFactor: 2, - Replicas: []ReplicaInfo{{Server: "vs2:9333", Role: blockvol.RoleToWire(blockvol.RoleReplica)}}, + ReplicaReady: true, + Replicas: []ReplicaInfo{{Server: "vs2:9333", Role: blockvol.RoleToWire(blockvol.RoleReplica), Ready: true}}, Status: StatusActive, }) diff --git a/weed/server/master_block_registry_test.go b/weed/server/master_block_registry_test.go index 1f656ee81..14b20789e 100644 --- a/weed/server/master_block_registry_test.go +++ b/weed/server/master_block_registry_test.go @@ -1965,3 +1965,45 @@ func TestRegistry_InflightBlocksAutoRegister(t *testing.T) { t.Fatalf("replica health not updated after inflight released: %f", entry.Replicas[0].HealthScore) } } + +func TestRegistry_ReplicaReadyRequiresReplicaHeartbeat(t *testing.T) { + r := NewBlockVolumeRegistry() + if err := r.Register(&BlockVolumeEntry{ + Name: "vol-ready", + VolumeServer: "primary-server:8080", + Path: "/blocks/vol-ready.blk", + Status: StatusActive, + Replicas: []ReplicaInfo{{ + Server: "replica-server:8080", + Path: "/blocks/vol-ready.blk", + }}, + }); err != nil { + t.Fatalf("register: %v", err) + } + + entry, _ := r.Lookup("vol-ready") + if entry.ReplicaReady { + t.Fatal("replica should not be ready before replica heartbeat confirms publication") + } + if !entry.ReplicaDegraded { + t.Fatal("volume should remain degraded until replica readiness closes") + } + + r.UpdateFullHeartbeat("replica-server:8080", []*master_pb.BlockVolumeInfoMessage{{ + Path: "/blocks/vol-ready.blk", + Epoch: 1, + Role: uint32(blockvol.RoleReplica), + VolumeSize: 1 << 30, + HealthScore: 0.9, + ReplicaDataAddr: "10.0.0.2:14260", + ReplicaCtrlAddr: "10.0.0.2:14261", + }}, "") + + entry, _ = r.Lookup("vol-ready") + if !entry.Replicas[0].Ready { + t.Fatal("replica heartbeat with published receiver addresses should mark replica ready") + } + if !entry.ReplicaReady { + t.Fatal("aggregate replica readiness should become true after replica heartbeat") + } +} diff --git a/weed/server/master_server_handlers_block.go b/weed/server/master_server_handlers_block.go index 02fdad76f..a101f11c9 100644 --- a/weed/server/master_server_handlers_block.go +++ b/weed/server/master_server_handlers_block.go @@ -394,6 +394,7 @@ func entryToVolumeInfo(e *BlockVolumeEntry, primaryAlive bool) blockapi.VolumeIn ReplicaDataAddr: e.ReplicaDataAddr, ReplicaCtrlAddr: e.ReplicaCtrlAddr, ReplicaFactor: rf, + ReplicaReady: e.ReplicaReady, HealthScore: e.HealthScore, ReplicaDegraded: e.ReplicaDegraded, DurabilityMode: durMode, @@ -407,6 +408,7 @@ func entryToVolumeInfo(e *BlockVolumeEntry, primaryAlive bool) blockapi.VolumeIn Server: ri.Server, ISCSIAddr: ri.ISCSIAddr, IQN: ri.IQN, + Ready: ri.Ready, HealthScore: ri.HealthScore, WALLag: ri.WALLag, }) diff --git a/weed/server/volume_server_block.go b/weed/server/volume_server_block.go index 905c2bef5..ceb46fc8c 100644 --- a/weed/server/volume_server_block.go +++ b/weed/server/volume_server_block.go @@ -24,6 +24,23 @@ type volReplState struct { replicaCtrlAddr string // allReplicas stores the full replica set for multi-replica idempotence. allReplicas []blockvol.ReplicaAddr + roleApplied bool + receiverReady bool + shipperConfigured bool + replicaEligible bool + publishHealthy bool +} + +// BlockReadinessSnapshot names the assignment-to-publication closure at the +// BlockService boundary. These flags are owned by the service/adapter layer, +// not by blockvol's local storage mechanics. +type BlockReadinessSnapshot struct { + RoleApplied bool + ReceiverReady bool + ShipperConfigured bool + ShipperConnected bool + ReplicaEligible bool + PublishHealthy bool } // NVMeConfig holds NVMe/TCP target configuration passed from CLI flags. @@ -373,6 +390,15 @@ func (bs *BlockService) DeleteBlockVol(name string) error { // ProcessAssignments applies assignments from master, including replication setup. // V2 bridge: also delivers each assignment to the V2 engine for recovery ownership. func (bs *BlockService) ProcessAssignments(assignments []blockvol.BlockVolumeAssignment) { + _ = bs.ApplyAssignments(assignments) +} + +// ApplyAssignments applies assignments through the single authoritative +// BlockService lifecycle: role apply, replication wiring, and publication +// readiness bookkeeping. Returns per-assignment errors parallel to the input. +func (bs *BlockService) ApplyAssignments(assignments []blockvol.BlockVolumeAssignment) []error { + errs := make([]error, len(assignments)) + // V2 bridge: convert and deliver to engine orchestrator (Phase 08 P1). // P3: skip V2 processing for repeated unchanged assignments. // P4: RecoveryManager starts/cancels recovery goroutines based on results. @@ -400,9 +426,9 @@ func (bs *BlockService) ProcessAssignments(assignments []blockvol.BlockVolumeAss // V1 processing (requires blockStore). if bs.blockStore == nil { - return + return errs } - for _, a := range assignments { + for i, a := range assignments { role := blockvol.RoleFromWire(a.Role) ttl := blockvol.LeaseTTLFromWire(a.LeaseTtlMs) @@ -410,22 +436,30 @@ func (bs *BlockService) ProcessAssignments(assignments []blockvol.BlockVolumeAss if err := bs.blockStore.WithVolume(a.Path, func(vol *blockvol.BlockVol) error { return vol.HandleAssignment(a.Epoch, role, ttl) }); err != nil { + errs[i] = err glog.Warningf("block service: assignment %s epoch=%d role=%s: %v", a.Path, a.Epoch, role, err) continue } + bs.noteRoleApplied(a.Path, role) // 2. Replication setup based on role + addresses. switch role { case blockvol.RolePrimary: // CP8-2: ReplicaAddrs (multi-replica) takes precedence over scalar fields. if len(a.ReplicaAddrs) > 0 { - bs.setupPrimaryReplicationMulti(a.Path, a.ReplicaAddrs) + if err := bs.setupPrimaryReplicationMulti(a.Path, a.ReplicaAddrs); err != nil { + errs[i] = err + } } else if a.ReplicaDataAddr != "" && a.ReplicaCtrlAddr != "" { - bs.setupPrimaryReplication(a.Path, a.ReplicaDataAddr, a.ReplicaCtrlAddr) + if err := bs.setupPrimaryReplication(a.Path, a.ReplicaDataAddr, a.ReplicaCtrlAddr); err != nil { + errs[i] = err + } } case blockvol.RoleReplica: if a.ReplicaDataAddr != "" && a.ReplicaCtrlAddr != "" { - bs.setupReplicaReceiver(a.Path, a.ReplicaDataAddr, a.ReplicaCtrlAddr) + if err := bs.setupReplicaReceiver(a.Path, a.ReplicaDataAddr, a.ReplicaCtrlAddr); err != nil { + errs[i] = err + } } case blockvol.RoleRebuilding: if a.RebuildAddr != "" { @@ -433,18 +467,23 @@ func (bs *BlockService) ProcessAssignments(assignments []blockvol.BlockVolumeAss } } } + return errs } // setupPrimaryReplication configures WAL shipping from primary to replica // and starts the rebuild server (R1-2). -func (bs *BlockService) setupPrimaryReplication(path, replicaDataAddr, replicaCtrlAddr string) { +func (bs *BlockService) setupPrimaryReplication(path, replicaDataAddr, replicaCtrlAddr string) error { // P3 idempotence: skip if replica state is unchanged. bs.replMu.RLock() existing := bs.replStates[path] bs.replMu.RUnlock() if existing != nil && existing.replicaDataAddr == replicaDataAddr && existing.replicaCtrlAddr == replicaCtrlAddr { // Unchanged repeated assignment — idempotent, no side effects. - return + bs.markPrimaryTransportConfigured(path, []blockvol.ReplicaAddr{{ + DataAddr: replicaDataAddr, + CtrlAddr: replicaCtrlAddr, + }}) + return nil } // Compute deterministic rebuild listen address. @@ -465,27 +504,19 @@ func (bs *BlockService) setupPrimaryReplication(path, replicaDataAddr, replicaCt return nil }); err != nil { glog.Warningf("block service: setup primary replication %s: %v", path, err) - return + return err } - // Track replication state for heartbeat reporting (R1-4). - // These addresses are what the primary ships to — they come from the - // master's assignment. They should already be canonical (from - // AllocateBlockVolumeResponse), but if not, they'll be reported as-is. - bs.replMu.Lock() - if bs.replStates == nil { - bs.replStates = make(map[string]*volReplState) - } - bs.replStates[path] = &volReplState{ - replicaDataAddr: replicaDataAddr, - replicaCtrlAddr: replicaCtrlAddr, - } - bs.replMu.Unlock() + bs.markPrimaryTransportConfigured(path, []blockvol.ReplicaAddr{{ + DataAddr: replicaDataAddr, + CtrlAddr: replicaCtrlAddr, + }}) glog.V(0).Infof("block service: primary %s shipping WAL to %s/%s (rebuild=%s)", path, replicaDataAddr, replicaCtrlAddr, rebuildAddr) + return nil } // setupPrimaryReplicationMulti configures WAL shipping from primary to N replicas // using SetReplicaAddrs (CP8-2: multi-replica support). -func (bs *BlockService) setupPrimaryReplicationMulti(path string, addrs []blockvol.ReplicaAddr) { +func (bs *BlockService) setupPrimaryReplicationMulti(path string, addrs []blockvol.ReplicaAddr) error { // P3 idempotence: skip if ALL replica addresses unchanged. // Compare full replica set, not just the first entry. if len(addrs) > 0 { @@ -493,7 +524,8 @@ func (bs *BlockService) setupPrimaryReplicationMulti(path string, addrs []blockv existing := bs.replStates[path] bs.replMu.RUnlock() if existing != nil && bs.multiReplicaUnchanged(path, addrs) { - return + bs.markPrimaryTransportConfigured(path, addrs) + return nil } } @@ -513,30 +545,15 @@ func (bs *BlockService) setupPrimaryReplicationMulti(path string, addrs []blockv return nil }); err != nil { glog.Warningf("block service: setup primary replication (multi) %s: %v", path, err) - return + return err } - // Track replication state for heartbeat reporting. - bs.replMu.Lock() - if bs.replStates == nil { - bs.replStates = make(map[string]*volReplState) - } - // Store full replica set + first replica for backward compat heartbeat. - if len(addrs) > 0 { - // Copy the addrs slice to avoid aliasing. - copied := make([]blockvol.ReplicaAddr, len(addrs)) - copy(copied, addrs) - bs.replStates[path] = &volReplState{ - replicaDataAddr: addrs[0].DataAddr, - replicaCtrlAddr: addrs[0].CtrlAddr, - allReplicas: copied, - } - } - bs.replMu.Unlock() + bs.markPrimaryTransportConfigured(path, addrs) glog.V(0).Infof("block service: primary %s shipping WAL to %d replicas (rebuild=%s)", path, len(addrs), rebuildAddr) + return nil } // setupReplicaReceiver starts the replica WAL receiver. -func (bs *BlockService) setupReplicaReceiver(path, dataAddr, ctrlAddr string) { +func (bs *BlockService) setupReplicaReceiver(path, dataAddr, ctrlAddr string) error { // CP13-2: Pass the routable advertisedIP (from -ip flag, NOT from -id/serverID) // so wildcard-bind listeners resolve to a real IP, not an opaque identity string. var canonDataAddr, canonCtrlAddr string @@ -559,7 +576,7 @@ func (bs *BlockService) setupReplicaReceiver(path, dataAddr, ctrlAddr string) { return nil }); err != nil { glog.Warningf("block service: setup replica receiver %s: %v", path, err) - return + return err } // Fallback to assignment addresses if receiver didn't report. if canonDataAddr == "" { @@ -568,16 +585,9 @@ func (bs *BlockService) setupReplicaReceiver(path, dataAddr, ctrlAddr string) { if canonCtrlAddr == "" { canonCtrlAddr = ctrlAddr } - bs.replMu.Lock() - if bs.replStates == nil { - bs.replStates = make(map[string]*volReplState) - } - bs.replStates[path] = &volReplState{ - replicaDataAddr: canonDataAddr, - replicaCtrlAddr: canonCtrlAddr, - } - bs.replMu.Unlock() + bs.markReceiverReady(path, canonDataAddr, canonCtrlAddr) glog.V(0).Infof("block service: replica %s receiving on %s/%s", path, canonDataAddr, canonCtrlAddr) + return nil } // startRebuild starts a rebuild in the background. @@ -722,8 +732,10 @@ func (bs *BlockService) CollectBlockVolumeHeartbeat() []blockvol.BlockVolumeInfo defer bs.replMu.RUnlock() for i := range msgs { if s, ok := bs.replStates[msgs[i].Path]; ok { - msgs[i].ReplicaDataAddr = s.replicaDataAddr - msgs[i].ReplicaCtrlAddr = s.replicaCtrlAddr + if s.publishHealthy { + msgs[i].ReplicaDataAddr = s.replicaDataAddr + msgs[i].ReplicaCtrlAddr = s.replicaCtrlAddr + } } // NVMe publication: report nvme_addr and nqn if NVMe target is running. if bs.nvmeListenAddr != "" { @@ -758,6 +770,108 @@ func (bs *BlockService) multiReplicaUnchanged(path string, addrs []blockvol.Repl return true } +func (bs *BlockService) ensureReplStateLocked(path string) *volReplState { + if bs.replStates == nil { + bs.replStates = make(map[string]*volReplState) + } + state := bs.replStates[path] + if state == nil { + state = &volReplState{} + bs.replStates[path] = state + } + return state +} + +func (bs *BlockService) noteRoleApplied(path string, role blockvol.Role) { + bs.replMu.Lock() + defer bs.replMu.Unlock() + state := bs.ensureReplStateLocked(path) + state.roleApplied = true + switch role { + case blockvol.RoleReplica: + state.receiverReady = false + state.shipperConfigured = false + state.replicaEligible = false + state.publishHealthy = false + case blockvol.RolePrimary: + state.receiverReady = false + state.shipperConfigured = false + state.replicaEligible = false + state.publishHealthy = true + case blockvol.RoleRebuilding: + state.receiverReady = false + state.shipperConfigured = false + state.replicaEligible = false + state.publishHealthy = false + default: + state.receiverReady = false + state.shipperConfigured = false + state.replicaEligible = false + state.publishHealthy = false + state.replicaDataAddr = "" + state.replicaCtrlAddr = "" + state.allReplicas = nil + } +} + +func (bs *BlockService) markPrimaryTransportConfigured(path string, addrs []blockvol.ReplicaAddr) { + bs.replMu.Lock() + defer bs.replMu.Unlock() + state := bs.ensureReplStateLocked(path) + state.shipperConfigured = len(addrs) > 0 + state.publishHealthy = true + state.replicaEligible = false + state.receiverReady = false + if len(addrs) == 0 { + state.replicaDataAddr = "" + state.replicaCtrlAddr = "" + state.allReplicas = nil + return + } + copied := make([]blockvol.ReplicaAddr, len(addrs)) + copy(copied, addrs) + state.allReplicas = copied + state.replicaDataAddr = addrs[0].DataAddr + state.replicaCtrlAddr = addrs[0].CtrlAddr +} + +func (bs *BlockService) markReceiverReady(path, dataAddr, ctrlAddr string) { + bs.replMu.Lock() + defer bs.replMu.Unlock() + state := bs.ensureReplStateLocked(path) + state.receiverReady = true + state.replicaEligible = true + state.publishHealthy = true + state.shipperConfigured = false + state.replicaDataAddr = dataAddr + state.replicaCtrlAddr = ctrlAddr + state.allReplicas = nil +} + +// ReadinessSnapshot reports the service-owned assignment/readiness closure for +// one volume. It keeps v2 publication truth above blockvol's local mechanics. +func (bs *BlockService) ReadinessSnapshot(path string) BlockReadinessSnapshot { + snap := BlockReadinessSnapshot{} + bs.replMu.RLock() + state := bs.replStates[path] + if state != nil { + snap.RoleApplied = state.roleApplied + snap.ReceiverReady = state.receiverReady + snap.ShipperConfigured = state.shipperConfigured + snap.ReplicaEligible = state.replicaEligible + snap.PublishHealthy = state.publishHealthy + } + bs.replMu.RUnlock() + if !snap.ShipperConfigured || bs.blockStore == nil { + return snap + } + _ = bs.blockStore.WithVolume(path, func(vol *blockvol.BlockVol) error { + snap.ShipperConnected = len(vol.ReplicaShipperStates()) > 0 && !vol.Status().ReplicaDegraded + return nil + }) + return snap +} + // --- P3: Assignment idempotence --- // lastAppliedAssignment stores the full assignment for idempotence comparison. diff --git a/weed/server/volume_server_block_debug.go b/weed/server/volume_server_block_debug.go index 747bc5f25..7c5e16df4 100644 --- a/weed/server/volume_server_block_debug.go +++ b/weed/server/volume_server_block_debug.go @@ -17,13 +17,19 @@ type ShipperDebugInfo struct { // BlockVolumeDebugInfo is the real-time block volume state. type BlockVolumeDebugInfo struct { - Path string `json:"path"` - Role string `json:"role"` - Epoch uint64 `json:"epoch"` - HeadLSN uint64 `json:"head_lsn"` - Degraded bool `json:"degraded"` - Shippers []ShipperDebugInfo `json:"shippers,omitempty"` - Timestamp string `json:"timestamp"` + Path string `json:"path"` + Role string `json:"role"` + Epoch uint64 `json:"epoch"` + HeadLSN uint64 `json:"head_lsn"` + Degraded bool `json:"degraded"` + RoleApplied bool `json:"role_applied"` + ReceiverReady bool `json:"receiver_ready"` + ShipperConfigured bool `json:"shipper_configured"` + ShipperConnected bool `json:"shipper_connected"` + ReplicaEligible bool `json:"replica_eligible"` + PublishHealthy bool `json:"publish_healthy"` + Shippers []ShipperDebugInfo `json:"shippers,omitempty"` + Timestamp string `json:"timestamp"` } // debugBlockShipperHandler returns real-time shipper state for all block volumes. @@ -48,13 +54,20 @@ func (vs *VolumeServer) debugBlockShipperHandler(w http.ResponseWriter, r *http. var infos []BlockVolumeDebugInfo store.IterateBlockVolumes(func(path string, vol *blockvol.BlockVol) { status := vol.Status() + readiness := vs.blockService.ReadinessSnapshot(path) info := BlockVolumeDebugInfo{ - Path: path, - Role: status.Role.String(), - Epoch: status.Epoch, - HeadLSN: status.WALHeadLSN, - Degraded: status.ReplicaDegraded, - Timestamp: time.Now().UTC().Format(time.RFC3339Nano), + Path: path, + Role: status.Role.String(), + Epoch: status.Epoch, + HeadLSN: status.WALHeadLSN, + Degraded: status.ReplicaDegraded, + RoleApplied: readiness.RoleApplied, + ReceiverReady: readiness.ReceiverReady, + ShipperConfigured: readiness.ShipperConfigured, + ShipperConnected: readiness.ShipperConnected, + ReplicaEligible: readiness.ReplicaEligible, + PublishHealthy: readiness.PublishHealthy, + Timestamp: time.Now().UTC().Format(time.RFC3339Nano), } // Get per-shipper state from ShipperGroup if available. diff --git a/weed/storage/blockvol/blockapi/types.go b/weed/storage/blockvol/blockapi/types.go index 04aed511a..db7545f2a 100644 --- a/weed/storage/blockvol/blockapi/types.go +++ b/weed/storage/blockvol/blockapi/types.go @@ -36,6 +36,7 @@ type VolumeInfo struct { // CP8-2: Multi-replica fields. ReplicaFactor int `json:"replica_factor"` Replicas []ReplicaDetail `json:"replicas,omitempty"` + ReplicaReady bool `json:"replica_ready,omitempty"` HealthScore float64 `json:"health_score"` ReplicaDegraded bool `json:"replica_degraded,omitempty"` DurabilityMode string `json:"durability_mode"` // CP8-3-1 @@ -71,6 +72,7 @@ type ReplicaDetail struct { Server string `json:"server"` ISCSIAddr string `json:"iscsi_addr,omitempty"` IQN string `json:"iqn,omitempty"` + Ready bool `json:"ready,omitempty"` HealthScore float64 `json:"health_score"` WALLag uint64 `json:"wal_lag,omitempty"` } diff --git a/weed/storage/blockvol/test/component/replica_read_test.go b/weed/storage/blockvol/test/component/replica_read_test.go new file mode 100644 index 000000000..b6871146f --- /dev/null +++ b/weed/storage/blockvol/test/component/replica_read_test.go @@ -0,0 +1,155 @@ +package component + +import ( + "bytes" + "net" + "testing" + "time" + + "github.com/seaweedfs/seaweedfs/weed/storage/blockvol" +) + +// TestReplicaReadAfterShip verifies that data shipped from primary to replica +// via WAL replication is readable on the replica via ReadLBA. +// +// This reproduces the CP13-8 bug: replica iSCSI reads zeros despite +// replicated data in WAL (sync_all barrier confirmed). +func TestReplicaReadAfterShip(t *testing.T) { + primaryPath := t.TempDir() + "/primary.blk" + replicaPath := t.TempDir() + "/replica.blk" + + primary, err := blockvol.CreateBlockVol(primaryPath, blockvol.CreateOptions{ + VolumeSize: 4 * 1024 * 1024, + BlockSize: 4096, + WALSize: 1 * 1024 * 1024, + }) + if err != nil { + t.Fatal(err) + } + defer primary.Close() + + replica, err := blockvol.CreateBlockVol(replicaPath, blockvol.CreateOptions{ + VolumeSize: 4 * 1024 * 1024, + BlockSize: 4096, + WALSize: 1 * 1024 * 1024, + }) + if err != nil { + t.Fatal(err) + } + defer replica.Close() + + // Assign roles. + primary.HandleAssignment(1, blockvol.RolePrimary, 30*time.Second) + replica.HandleAssignment(1, blockvol.RoleReplica, 30*time.Second) + + // Start replica receiver. + if err := replica.StartReplicaReceiver(":0", ":0"); err != nil { + t.Fatal(err) + } + recvAddr := replica.ReplicaReceiverAddr() + if recvAddr == nil { + t.Fatal("replica receiver not started") + } + t.Logf("replica receiver: data=%s ctrl=%s", recvAddr.DataAddr, recvAddr.CtrlAddr) + + // Wire shipper from primary to replica. + primary.SetReplicaAddr(recvAddr.DataAddr, recvAddr.CtrlAddr) + + // Write on primary — should ship to replica. + writeData := bytes.Repeat([]byte{0xAB}, 4096) + if err := primary.WriteLBA(0, writeData); err != nil { + t.Fatalf("primary WriteLBA(0): %v", err) + } + + // Give shipping + apply time. + time.Sleep(2 * time.Second) + + // Read from REPLICA. + replicaData, err := replica.ReadLBA(0, 4096) + if err != nil { + t.Fatalf("replica ReadLBA(0): %v", err) + } + + if replicaData[0] == 0x00 { + t.Fatalf("BUG REPRODUCED: replica ReadLBA returns zeros (first byte=0x%02x, want 0xAB)"+ + "\nData is in replica WAL but ReadLBA returns zeros", replicaData[0]) + } + if !bytes.Equal(replicaData, writeData) { + t.Fatalf("replica data mismatch: first byte=0x%02x, want 0xAB", replicaData[0]) + } + t.Log("replica ReadLBA after ship: OK (data matches primary)") +} + +// TestReplicaReadDirectApply bypasses the shipper entirely and manually +// ships a WAL entry via TCP to the replica receiver, then reads it back. +func TestReplicaReadDirectApply(t *testing.T) { + replicaPath := t.TempDir() + "/replica.blk" + vol, err := blockvol.CreateBlockVol(replicaPath, blockvol.CreateOptions{ + VolumeSize: 4 * 1024 * 1024, + BlockSize: 4096, + WALSize: 1 * 1024 * 1024, + }) + if err != nil { + t.Fatal(err) + } + defer vol.Close() + + vol.HandleAssignment(1, blockvol.RoleReplica, 30*time.Second) + + if err := vol.StartReplicaReceiver(":0", ":0"); err != nil { + t.Fatal(err) + } + recvAddr := vol.ReplicaReceiverAddr() + t.Logf("replica: data=%s ctrl=%s", recvAddr.DataAddr, recvAddr.CtrlAddr) + + // Directly connect and ship a WAL entry. + conn, err := net.DialTimeout("tcp", recvAddr.DataAddr, 3*time.Second) + if err != nil { + t.Fatalf("connect: %v", err) + } + defer conn.Close() + + payload := bytes.Repeat([]byte{0xEF}, 4096) + entry := blockvol.WALEntry{ + LSN: 1, + Epoch: 1, + Type: blockvol.EntryTypeWrite, + LBA: 0, + Length: 4096, + Data: payload, + } + encoded, err := entry.Encode() + if err != nil { + t.Fatal(err) + } + if err := blockvol.WriteFrame(conn, blockvol.MsgWALEntry, encoded); err != nil { + t.Fatalf("ship: %v", err) + } + + time.Sleep(1 * time.Second) + + // Read back via ReadLBA. + data, err := vol.ReadLBA(0, 4096) + if err != nil { + t.Fatalf("ReadLBA: %v", err) + } + + if data[0] == 0x00 { + t.Fatalf("BUG: ReadLBA returns zeros after direct WAL apply (0x%02x, want 0xEF)", data[0]) + } + if data[0] != 0xEF { + t.Fatalf("unexpected data: 0x%02x, want 0xEF", data[0]) + } + t.Logf("direct apply ReadLBA: OK (0x%02x)", data[0]) + + // Also read via adapter (same path as iSCSI). + adapter := blockvol.NewBlockVolAdapter(vol) + adapterData, err := adapter.ReadAt(0, 4096) + if err != nil { + t.Fatalf("adapter ReadAt: %v", err) + } + if adapterData[0] != 0xEF { + t.Fatalf("adapter returns wrong data: 0x%02x, want 0xEF", adapterData[0]) + } + t.Log("adapter ReadAt: OK") +} diff --git a/weed/storage/blockvol/testrunner/actions/devops.go b/weed/storage/blockvol/testrunner/actions/devops.go index daeb762b0..f256e90d0 100644 --- a/weed/storage/blockvol/testrunner/actions/devops.go +++ b/weed/storage/blockvol/testrunner/actions/devops.go @@ -787,6 +787,11 @@ func waitVolumeHealthy(ctx context.Context, actx *tr.ActionContext, act tr.Actio continue } + if info.ReplicaFactor > 1 && !info.ReplicaReady { + actx.Log(" poll %d: replica assigned but not publish-ready yet", poll) + continue + } + // Check not degraded. if info.ReplicaDegraded { actx.Log(" poll %d: replica degraded, waiting...", poll) diff --git a/weed/storage/blockvol/testrunner/internal/blockapi/types.go b/weed/storage/blockvol/testrunner/internal/blockapi/types.go index f1cb9038e..d99190435 100644 --- a/weed/storage/blockvol/testrunner/internal/blockapi/types.go +++ b/weed/storage/blockvol/testrunner/internal/blockapi/types.go @@ -32,6 +32,7 @@ type VolumeInfo struct { ReplicaCtrlAddr string `json:"replica_ctrl_addr,omitempty"` ReplicaFactor int `json:"replica_factor"` Replicas []ReplicaDetail `json:"replicas,omitempty"` + ReplicaReady bool `json:"replica_ready,omitempty"` HealthScore float64 `json:"health_score"` ReplicaDegraded bool `json:"replica_degraded,omitempty"` DurabilityMode string `json:"durability_mode"` @@ -45,6 +46,7 @@ type ReplicaDetail struct { Server string `json:"server"` ISCSIAddr string `json:"iscsi_addr,omitempty"` IQN string `json:"iqn,omitempty"` + Ready bool `json:"ready,omitempty"` HealthScore float64 `json:"health_score"` WALLag uint64 `json:"wal_lag,omitempty"` } diff --git a/weed/storage/blockvol/testrunner/scenarios/internal/cp13-8-real-workload-validation.yaml b/weed/storage/blockvol/testrunner/scenarios/internal/cp13-8-real-workload-validation.yaml index 9e7bd24d8..96648e555 100644 --- a/weed/storage/blockvol/testrunner/scenarios/internal/cp13-8-real-workload-validation.yaml +++ b/weed/storage/blockvol/testrunner/scenarios/internal/cp13-8-real-workload-validation.yaml @@ -1,240 +1,368 @@ name: cp13-8-real-workload-validation -timeout: 20m +timeout: 15m # CP13-8: Bounded real-workload validation for RF=2 sync_all. # -# Workload envelope: -# Topology: RF=2 sync_all, cross-machine replication (m01 ↔ M02) -# Transport: iSCSI (primary frontend) +# Envelope: +# Topology: RF=2 sync_all, cross-machine (m01 ↔ M02) +# Transport: iSCSI # Workloads: ext4 (filesystem) + PostgreSQL pgbench (application) -# Disturbance: one bounded failover (kill primary, promote replica) -# Exclusions: NVMe-TCP, RF>2, hours/days soak, degraded-mode perf +# Disturbance: one bounded failover (kill primary, auto-promote replica) +# Exclusions: NVMe-TCP, RF>2, soak, degraded-mode, mode normalization # -# What this validates: -# The accepted CP13-1..7 replication contract survives contact with -# real filesystem and database consumers. Specifically: -# - Replicated writes are durable on both nodes (ext4 file integrity) -# - Post-failover data is consistent (fsck + file count) -# - Database transactions are durable under sync_all (pgbench TPC-B) -# -# What this does NOT validate: -# - Production rollout readiness -# - Performance floor (see Phase 12 P4) -# - Degraded mode behavior -# - NVMe-TCP transport path -# - Mode normalization (CP13-9) +# Flow: +# 1. Create RF=2 sync_all — NO promote (use initial primary as-is) +# 2. Wait for replication healthy (shipper connected, not degraded) +# 3. Write ext4 + 200 files on primary +# 4. Kill primary → auto-failover promotes replica +# 5. Verify ext4 on promoted replica (fsck + files + checksums) +# 6. pgbench on promoted replica env: - repo_dir: "C:/work/seaweedfs" + master_url: "http://10.0.0.3:9433" + volume_name: cp13-8-val + # 512MB: enough for ext4 + pgbench, small enough for mkfs + sync_all. + vol_size: "536870912" topology: nodes: - target_node: - host: "192.168.1.184" + m01: + host: 192.168.1.181 + alt_ips: ["10.0.0.1"] user: testdev - key: "C:/work/dev_server/testdev_key" - client_node: - host: "192.168.1.181" + key: "/opt/work/testdev_key" + m02: + host: 192.168.1.184 + alt_ips: ["10.0.0.3"] user: testdev - key: "C:/work/dev_server/testdev_key" - -targets: - primary: - node: target_node - vol_size: 100M - iscsi_port: 3280 - admin_port: 8095 - replica_data_port: 9040 - replica_ctrl_port: 9041 - rebuild_port: 9042 - iqn_suffix: cp13-8-primary - replica: - node: client_node - vol_size: 100M - iscsi_port: 3281 - admin_port: 8096 - replica_data_port: 9043 - replica_ctrl_port: 9044 - rebuild_port: 9045 - iqn_suffix: cp13-8-replica + key: "/opt/work/testdev_key" phases: - # --- Phase 1: Setup RF=2 sync_all pair --- - name: setup actions: - - action: kill_stale - node: target_node + - action: exec + node: m02 + cmd: "fuser -k 9433/tcp 18480/tcp 2>/dev/null; sleep 1; rm -rf /tmp/sw-cp138-master /tmp/sw-cp138-vs1; mkdir -p /tmp/sw-cp138-master /tmp/sw-cp138-vs1/blocks" + root: "true" ignore_error: true - - action: kill_stale - node: client_node + - action: exec + node: m01 + cmd: "fuser -k 18480/tcp 2>/dev/null; sleep 1; rm -rf /tmp/sw-cp138-vs2; mkdir -p /tmp/sw-cp138-vs2/blocks" + root: "true" ignore_error: true - - action: iscsi_cleanup - node: client_node - ignore_error: true - - action: build_deploy - - action: start_target - target: primary - create: "true" - durability_mode: sync_all - - action: start_target - target: replica - create: "true" - durability_mode: sync_all - - action: assign - target: replica - epoch: "1" - role: replica - lease_ttl: 60s - - action: assign - target: primary - epoch: "1" - role: primary - lease_ttl: 60s - - action: set_replica - target: primary - replica: replica - - action: sleep - duration: 2s - # --- Phase 2: ext4 filesystem workload --- - - name: ext4-write - actions: - - action: iscsi_login - target: primary - node: client_node - save_as: device - - action: mkfs - node: client_node - device: "{{ device }}" - fstype: ext4 - - action: mount - node: client_node - device: "{{ device }}" - mountpoint: /mnt/cp13-8 - # Write 200 files with known content. - - action: exec - node: client_node - root: "true" - cmd: "bash -c 'for i in $(seq 1 200); do dd if=/dev/urandom of=/mnt/cp13-8/file_$i bs=4k count=1 2>/dev/null; done && sync'" - # Compute checksums for later verification. - - action: exec - node: client_node - root: "true" - cmd: "md5sum /mnt/cp13-8/file_* | sort > /tmp/cp13-8-checksums.txt && cat /tmp/cp13-8-checksums.txt | wc -l" - save_as: checksum_count - - action: assert_equal - actual: "{{ checksum_count }}" - expected: "200" - - action: umount - node: client_node - mountpoint: /mnt/cp13-8 - - action: iscsi_cleanup - node: client_node - ignore_error: true - # Wait for replication to catch up. - - action: wait_lsn - target: replica - min_lsn: "1" - timeout: 30s + - action: start_weed_master + node: m02 + port: "9433" + dir: /tmp/sw-cp138-master + extra_args: "-ip=10.0.0.3" + save_as: master_pid + - action: sleep duration: 3s - # --- Phase 3: Failover (kill primary, promote replica) --- + - action: start_weed_volume + node: m02 + port: "18480" + master: "10.0.0.3:9433" + dir: /tmp/sw-cp138-vs1 + extra_args: "-block.dir=/tmp/sw-cp138-vs1/blocks -block.listen=:3295 -ip=10.0.0.3" + save_as: vs1_pid + + - action: start_weed_volume + node: m01 + port: "18480" + master: "10.0.0.3:9433" + dir: /tmp/sw-cp138-vs2 + extra_args: "-block.dir=/tmp/sw-cp138-vs2/blocks -block.listen=:3295 -ip=10.0.0.1" + save_as: vs2_pid + + - action: sleep + duration: 3s + + - action: wait_cluster_ready + node: m02 + master_url: "{{ master_url }}" + + - action: wait_block_servers + count: "2" + + - action: create_block_volume + name: "{{ volume_name }}" + size_bytes: "{{ vol_size }}" + replica_factor: "2" + durability_mode: "sync_all" + + # Wait for assignment delivery (heartbeat cycle). + - action: sleep + duration: 15s + + # Bootstrap write: triggers shipper connect + first barrier. + # Without this, the shipper stays degraded because barrier-triggered + # recovery needs a write to fire the barrier. + - action: lookup_block_volume + name: "{{ volume_name }}" + save_as: boot_vol + + - action: iscsi_login_direct + node: m01 + host: "{{ boot_vol_iscsi_host }}" + port: "{{ boot_vol_iscsi_port }}" + iqn: "{{ boot_vol_iqn }}" + save_as: boot_device + ignore_error: true + + - action: exec + node: m01 + root: "true" + cmd: "dd if=/dev/urandom of={{ boot_device }} bs=4k count=1 seek=100000 oflag=direct,sync 2>/dev/null; true" + ignore_error: true + + - action: iscsi_cleanup + node: m01 + ignore_error: true + + - action: sleep + duration: 5s + + - action: wait_volume_healthy + name: "{{ volume_name }}" + timeout: 60s + + - action: discover_primary + name: "{{ volume_name }}" + save_as: pri + + - action: print + msg: "CP13-8 setup: primary={{ pri }} ({{ pri_server }}), replica={{ pri_replica_node }}" + + # --- Phase 2: ext4 filesystem workload on initial primary --- + - name: ext4-write + actions: + - action: lookup_block_volume + name: "{{ volume_name }}" + save_as: vol + + - action: iscsi_login_direct + node: m01 + host: "{{ vol_iscsi_host }}" + port: "{{ vol_iscsi_port }}" + iqn: "{{ vol_iqn }}" + save_as: device + + - action: exec + node: m01 + cmd: "mkfs.ext4 -F {{ device }} 2>&1 | tail -2" + root: "true" + + - action: exec + node: m01 + cmd: "mkdir -p /mnt/cp13-8 && mount {{ device }} /mnt/cp13-8" + root: "true" + + - action: exec + node: m01 + root: "true" + cmd: "for i in $(seq 1 200); do dd if=/dev/urandom of=/mnt/cp13-8/file_$i bs=4k count=1 2>/dev/null; done && sync && echo WRITE_DONE" + save_as: write_result + + - action: assert_contains + value: "{{ write_result }}" + contains: "WRITE_DONE" + + - action: exec + node: m01 + root: "true" + cmd: "md5sum /mnt/cp13-8/file_* | sort > /tmp/cp13-8-pre.md5 && wc -l < /tmp/cp13-8-pre.md5" + save_as: pre_checksum_count + + - action: assert_equal + actual: "{{ pre_checksum_count }}" + expected: "200" + + - action: exec + node: m01 + cmd: "umount /mnt/cp13-8" + root: "true" + + - action: iscsi_cleanup + node: m01 + ignore_error: true + + - action: print + msg: "ext4-write: 200 files written, checksums captured" + + # Verify replication is healthy after all writes. + - action: wait_volume_healthy + name: "{{ volume_name }}" + timeout: 30s + + # --- Phase 3: Failover --- + # Kill ONLY the primary's VS, keep the replica alive for auto-promote. + # Master allocates primary to m01 first (by server registration order). + # Kill m01 VS (primary), m02 VS (replica) stays alive for promotion. - name: failover actions: - - action: kill_target - target: primary - - action: assign - target: replica - epoch: "2" - role: primary - lease_ttl: 60s - - action: wait_role - target: replica - role: primary - timeout: 10s + - action: print + msg: "=== Killing primary VS on m01 ===" + + - action: exec + node: m01 + cmd: "kill -9 {{ vs2_pid }}" + root: "true" + ignore_error: true + + # Wait for lease expiry (30s TTL) + auto-failover. + - action: sleep + duration: 50s + + # Wait for primary to change from m01 to m02. + - action: wait_block_primary + name: "{{ volume_name }}" + not: "{{ pri_server }}" + timeout: 60s + save_as: new_pri + + - action: print + msg: "Failover: {{ pri_server }} → {{ new_pri }}" + + - action: sleep + duration: 5s # --- Phase 4: ext4 verification on promoted replica --- - name: ext4-verify actions: - - action: iscsi_login - target: replica - node: client_node + - action: discover_primary + name: "{{ volume_name }}" + save_as: new + + - action: print + msg: "Verifying ext4 on promoted node {{ new }} ({{ new_server }})" + + # Connect to the new primary's iSCSI. + - action: iscsi_login_direct + node: m01 + host: "{{ new_host }}" + port: "3295" + iqn: "{{ vol_iqn }}" save_as: device2 - # fsck: filesystem integrity. + - action: fsck_ext4 - node: client_node + node: m01 device: "{{ device2 }}" save_as: fsck_result - # Mount and verify file count. - - action: mount - node: client_node - device: "{{ device2 }}" - mountpoint: /mnt/cp13-8 + + - action: print + msg: "fsck: {{ fsck_result }}" + - action: exec - node: client_node + node: m01 + cmd: "mkdir -p /mnt/cp13-8 && mount {{ device2 }} /mnt/cp13-8" + root: "true" + + - action: exec + node: m01 root: "true" cmd: "ls /mnt/cp13-8/file_* | wc -l" - save_as: post_failover_count + save_as: post_count + - action: assert_equal - actual: "{{ post_failover_count }}" + actual: "{{ post_count }}" expected: "200" - # Verify checksums match pre-failover. + - action: exec - node: client_node + node: m01 root: "true" - cmd: "md5sum /mnt/cp13-8/file_* | sort > /tmp/cp13-8-checksums-post.txt && diff /tmp/cp13-8-checksums.txt /tmp/cp13-8-checksums-post.txt && echo MATCH" - save_as: checksum_match + cmd: "md5sum /mnt/cp13-8/file_* | sort > /tmp/cp13-8-post.md5 && diff /tmp/cp13-8-pre.md5 /tmp/cp13-8-post.md5 && echo CHECKSUM_MATCH" + save_as: checksum_diff + - action: assert_contains - value: "{{ checksum_match }}" - contains: "MATCH" - - action: umount - node: client_node - mountpoint: /mnt/cp13-8 + value: "{{ checksum_diff }}" + contains: "CHECKSUM_MATCH" + + - action: exec + node: m01 + cmd: "umount /mnt/cp13-8" + root: "true" + - action: iscsi_cleanup - node: client_node + node: m01 ignore_error: true - # --- Phase 5: pgbench on promoted replica (application workload) --- - - name: pgbench-on-replica + - action: print + msg: "ext4-verify: fsck CLEAN, 200 files, checksums MATCH" + + # --- Phase 5: pgbench on promoted replica --- + - name: pgbench actions: - - action: iscsi_login - target: replica - node: client_node + - action: iscsi_login_direct + node: m01 + host: "{{ new_host }}" + port: "3295" + iqn: "{{ vol_iqn }}" save_as: device3 + + - action: sleep + duration: 3s + - action: pgbench_init - node: client_node + node: m01 device: "{{ device3 }}" mount: "/mnt/cp13-8-pg" - port: "5440" + port: "5441" scale: "1" fstype: ext4 + - action: pgbench_run - node: client_node + node: m01 clients: "1" duration: "10" - save_as: tps_post_failover + save_as: tps + - action: print - msg: "CP13-8: pgbench TPC-B post-failover: {{ tps_post_failover }} TPS" + msg: "CP13-8 pgbench TPS: {{ tps }}" + - action: assert_greater - value: "{{ tps_post_failover }}" + actual: "{{ tps }}" threshold: "0" - # pgbench succeeded with TPS > 0 = database transactions are durable on the promoted replica. + - action: pgbench_cleanup - node: client_node - mount: "/mnt/cp13-8-pg" - port: "5440" + node: m01 ignore_error: true + - action: iscsi_cleanup - node: client_node + node: m01 ignore_error: true # --- Phase 6: Cleanup --- - name: cleanup always: true actions: + - action: exec + node: m01 + cmd: "umount /mnt/cp13-8 /mnt/cp13-8-pg 2>/dev/null; true" + root: "true" + ignore_error: true - action: iscsi_cleanup - node: client_node + node: m01 ignore_error: true - - action: stop_all_targets + - action: stop_weed + node: m01 + pid: "{{ vs2_pid }}" + ignore_error: true + - action: stop_weed + node: m01 + pid: "{{ vs2_new_pid }}" + ignore_error: true + - action: stop_weed + node: m02 + pid: "{{ vs1_pid }}" + ignore_error: true + - action: stop_weed + node: m02 + pid: "{{ vs1_new_pid }}" + ignore_error: true + - action: stop_weed + node: m02 + pid: "{{ master_pid }}" ignore_error: true diff --git a/weed/storage/store_blockvol.go b/weed/storage/store_blockvol.go index f2d18fc5a..86e59bcda 100644 --- a/weed/storage/store_blockvol.go +++ b/weed/storage/store_blockvol.go @@ -118,10 +118,14 @@ func (bs *BlockVolumeStore) WithVolume(path string, fn func(*blockvol.BlockVol) return fn(vol) } -// ProcessBlockVolumeAssignments applies a batch of assignments from master. -// Returns a slice of errors parallel to the input (nil = success). -// Unknown volumes and invalid transitions are logged and returned as errors, -// but do not stop processing of remaining assignments. +// ProcessBlockVolumeAssignments applies only the local role/epoch/lease part of +// a batch of assignments. It does NOT wire replica receivers, shippers, or +// publication readiness. The authoritative runtime lifecycle lives in +// BlockService.ApplyAssignments. +// +// Returns a slice of errors parallel to the input (nil = success). Unknown +// volumes and invalid transitions are logged and returned as errors, but do not +// stop processing of remaining assignments. func (bs *BlockVolumeStore) ProcessBlockVolumeAssignments( assignments []blockvol.BlockVolumeAssignment, ) []error {