From ea2ca11f73a17d07c56ce4b24ca1aebafbfcb3ae Mon Sep 17 00:00:00 2001 From: pingqiu Date: Sun, 3 May 2026 09:25:47 -0700 Subject: [PATCH] docs(p15): update G15b M02 rerun and TestOps registration design --- ...3-phase-15-g15b-k8s-qa-test-instruction.md | 16 +- ...phase-15-testops-pluggable-registration.md | 260 ++++++++++++++++++ ...3-phase-15-g15b-k8s-static-pv-mini-plan.md | 21 +- 3 files changed, 291 insertions(+), 6 deletions(-) create mode 100644 sw-block/design/test/v3-phase-15-testops-pluggable-registration.md diff --git a/sw-block/design/test/v3-phase-15-g15b-k8s-qa-test-instruction.md b/sw-block/design/test/v3-phase-15-g15b-k8s-qa-test-instruction.md index 6f2429884..9c506eb08 100644 --- a/sw-block/design/test/v3-phase-15-g15b-k8s-qa-test-instruction.md +++ b/sw-block/design/test/v3-phase-15-g15b-k8s-qa-test-instruction.md @@ -1,14 +1,14 @@ # V3 Phase 15 G15b Kubernetes Static PV QA Test Instruction **Date**: 2026-05-03 -**Status**: K8s lab instruction for `p15-g15b/k8s-static-pv@5375add`; execution pending +**Status**: K8s lab instruction for `p15-g15b/k8s-static-pv@eb13105`; M02 re-run pending **Scope**: single-node Kubernetes static PV/PVC/pod smoke through real V3 daemons and CSI. --- ## Headline -At `seaweed_block@5375add`, the G15b lab harness and image build inputs are staged to prove: +At `seaweed_block@eb13105`, the G15b lab harness, image build inputs, and M02 DNS/log-preservation fixes are staged to prove: ```text blockmaster + product-loop + r1/r2 blockvolume @@ -39,6 +39,12 @@ Known current local limitation: - On the current dev workstation, `kubectl` context `rancher-desktop` exists but API server is not reachable. This instruction needs QA or a running K8s lab. +M02 first-run blocker fixed: + +- `5375add` failed because `hostNetwork: true` blockvolume pods inherited host DNS and could not resolve `blockmaster.kube-system.svc.cluster.local`. +- `eb13105` adds `dnsPolicy: ClusterFirstWithHostNet` to both blockvolume pods. +- `eb13105` also collects daemon logs on every exit before cleanup, so failure evidence is preserved. + --- ## Commands @@ -66,6 +72,12 @@ G15B_KIND_CLUSTER= bash scripts/build-g15b-images.sh "$PWD" Local image build result already verified at `5375add`: PASS, images `sw-block:local` and `sw-block-csi:local` built. +After pulling `eb13105`, rebuild images before rerun: + +```bash +bash scripts/build-g15b-images.sh "$PWD" +``` + Kubernetes lab run from Linux or WSL with `kubectl` configured: ```bash diff --git a/sw-block/design/test/v3-phase-15-testops-pluggable-registration.md b/sw-block/design/test/v3-phase-15-testops-pluggable-registration.md new file mode 100644 index 000000000..5a9881eb4 --- /dev/null +++ b/sw-block/design/test/v3-phase-15-testops-pluggable-registration.md @@ -0,0 +1,260 @@ +# V3 Phase 15 TestOps — Pluggable Registration Design + +**Date**: 2026-05-03 +**Status**: architect draft; complements `v3-phase-15-testops-plan.md` +**Scope**: how V3 gates/projects expose test scenarios so TestOps can discover, register, and run them independently +**Code anchor**: `seaweed_block/internal/testops` introduced at `c2b5d9a` + +--- + +## §0 Product Sentence + +V3 should be testable through a stable TestOps registration surface: + +```text +project / gate exposes scenario registration + -> TestOps registry binds scenario name to driver + -> TestOps consumes run-request.json + -> driver runs go-test / shell / privileged host / k8s / future YAML runner + -> result.json + artifact directory are emitted in canonical shape +``` + +This lets TestOps run V3 independently without importing V3 internals and without forcing every gate into one runner implementation. + +--- + +## §1 Core Rule + +Every V3 gate that needs L2+ evidence should register a TestOps scenario. + +Registration is: + +- data + driver binding; +- test workflow metadata; +- artifact contract; +- non-claims. + +Registration is **not**: + +- product authority; +- placement policy; +- failover policy; +- runtime plugin loading; +- a backdoor into internal state mutation. + +--- + +## §2 What Becomes Pluggable + +Pluggable: + +| Surface | Pluggable unit | Example | +|---|---|---| +| L1/L2 `go test` scenario | `GoTestDriver` | G8 failover L2, G9G product loop | +| Host privileged scenario | `ShellDriver` | G15a privileged iSCSI/mkfs/mount | +| Multi-host hardware scenario | `ShellDriver` / future `SSHDriver` | G7 recovery #2/#5/#6 | +| K8s scenario | `K8sDriver` or shell wrapper | G15b static PV/PVC/pod | +| Future YAML scenario | `YAMLDriver` | ported V2 testrunner engine/parser | + +Not pluggable: + +| Surface | Reason | +|---|---| +| `blockmaster` authority publisher | Product truth; must not be loaded as test plugin. | +| `blockvolume` recovery/replication engine | Product truth; TestOps observes, does not replace. | +| CSI controller/node service implementation | Product surface; TestOps drives it through CSI/K8s calls. | +| Placement/failover policy | Product semantics; registration cannot define policy. | + +The important distinction: + +> The test execution path is pluggable. The product runtime truth is not. + +--- + +## §3 V3 TestOps Path + +The V3 path is layered: + +```text +internal/testops + ├── RunRequest / Result schema + ├── Driver interface + ├── Registry + └── driver implementations + +testops/registry/ + ├── g15a-privileged.json + ├── g15b-manifest.json + ├── g15b-k8s-static.json + ├── g9g-l2.json + └── g8-failover-l2.json + +V:\share\v3-debug\bridge\ + ├── run-bridge.sh + ├── run-bridge.exe or go wrapper (future) + ├── scenarios\ + │ ├── g15a-privileged.sh + │ ├── g15b-k8s-static.sh + │ └── g7-recovery.sh + └── runs\\result.json + logs +``` + +Ownership: + +- `internal/testops`: V3 code repo. +- `testops/registry`: V3 code repo, because it pins real commands/paths relative to the code tree. +- `design/test/*.md`: docs repo, because it explains QA contract and close evidence. +- `V:\share\v3-debug\bridge`: harness side, mutable by QA/dev agents. + +--- + +## §4 Registration Shape + +Recommended file shape: + +```json +{ + "schema_version": "1.0", + "scenario": "g15b-k8s-static", + "gate": "G15b", + "layer": "L5", + "driver": { + "type": "shell", + "path": "scripts/run-g15b-k8s-static.sh" + }, + "default_timeout_s": 600, + "required_capabilities": [ + "kubectl", + "privileged-k8s-node", + "iscsiadm", + "mount" + ], + "required_images": [ + "sw-block:local", + "sw-block-csi:local" + ], + "qa_instruction": "sw-block/design/test/v3-phase-15-g15b-k8s-qa-test-instruction.md", + "known_green_commit": "5375add", + "artifacts": [ + "result.json", + "run-request.json", + "pod.log", + "blockmaster.log", + "blockvolume-r1.log", + "blockvolume-r2.log", + "blockcsi-controller.log" + ], + "non_claims": [ + "no dynamic provisioning", + "no failover under live mount", + "single-node only" + ] +} +``` + +Rules: + +- `scenario` is globally unique. +- `driver.type` must map to a TestOps `Driver`. +- `known_green_commit` is evidence, not a constraint. Agents may run newer commits. +- `non_claims` must be present for every L2+ scenario. +- The registration file must not contain authority-shaped fields such as epoch, endpoint version, primary, healthy, or ready unless the scenario is explicitly about observing those read-only facts. + +--- + +## §5 Driver Types + +Initial driver types: + +| Driver | Purpose | Current status | +|---|---|---| +| `shell` | Runs an existing script that reads normalized request and writes result. | Implemented as `internal/testops.ShellDriver`. | +| `go-test` | Runs `go test` package/focus commands and maps output to result. | Next recommended implementation. | +| `k8s` | Applies manifests, waits for resources, collects logs. | Can start as shell wrapper; later native. | +| `privileged-host` | Runs sudo/host OS checks and captures pre/post state. | Can start as shell wrapper. | +| `yaml` | Runs future ported V2 testrunner parser/engine. | Future; conditional. | + +The first bridge can implement all non-shell drivers as shell wrappers. Native drivers are optimization and safety improvements, not prerequisites. + +--- + +## §6 Scenario Lifecycle + +To add a new V3 scenario: + +1. Write or identify the backing test/harness. +2. Add registration file under `testops/registry/`. +3. Add/refresh QA instruction under `sw-block/design/test/`. +4. Add the scenario row to `v3-phase-15-testops-plan.md` §6. +5. Run through TestOps once and capture `result.json`. +6. Use that result as close evidence only if the scenario's non-claims match the gate claim. + +To update a scenario: + +1. Keep scenario name stable if the claim is unchanged. +2. Bump registration fields if driver/timeout/artifact shape changes. +3. Update `known_green_commit` only after verification. +4. Keep old artifact dirs; never mutate historical result directories. + +--- + +## §7 Anti-Patterns + +Do not: + +1. Register a scenario that mutates product authority directly. +2. Encode `primary=true` / `healthy=true` as desired state in TestOps metadata. +3. Use static PV target facts as the default close path for G15b while claiming ControllerPublish evidence. +4. Let a YAML runner call V2 promote/demote or heartbeat-as-authority semantics. +5. Treat a `pass` result as broader than the scenario's non-claims. +6. Hide missing artifacts by returning `status=pass`. +7. Make production code import `internal/testops`. + +The last rule is strict: + +> Product code must not depend on TestOps. TestOps depends on product binaries and public/control surfaces. + +--- + +## §8 Initial Registry Targets + +| Scenario | Driver | Layer | Known green | Status | +|---|---|---|---|---| +| `g15b-manifest` | `go-test` | L1/L2 | `62325c9` | ready to register | +| `g15b-k8s-static` | `shell` | L5 | `5375add` preflight only; K8s run pending | ready to register as pending-lab | +| `g15a-privileged` | `shell` | L3 | `ac49adb` | ready to register | +| `g15a-non-privileged` | `go-test` | L2 | `ac49adb` | ready to register | +| `g9g-l2` | `go-test` | L2 | `7ed9ab2` | ready to register | +| `g8-failover-l2` | `go-test` | L2 | `b320336` | needs instruction extraction | +| `g7-recovery-3scenarios` | `shell` | L4 | `d09fcc6` | wrap existing `g5-test` harness | + +--- + +## §9 Recommended Next Slice + +Implement `testops/registry/g15b-manifest.json` and a minimal `go-test` driver. + +Why this first: + +- It is non-privileged. +- It is fast. +- It proves the registration path without requiring K8s or m01. +- It gives QA an example registration file to copy. + +Pass condition: + +```powershell +go test ./internal/testops ./cmd/blockcsi -count=1 +``` + +plus a small smoke that loads `g15b-manifest.json`, runs the registered scenario, and emits a valid `result.json` in a temp artifact dir. + +--- + +## §10 Sign + +| Role | Status | Basis | +|---|---|---| +| sw | draft | captured V3 pluggable TestOps path after `internal/testops` skeleton | +| QA | pending | review registration shape and artifact expectations | +| architect | pending | ratify product/runtime non-plugin boundary | diff --git a/sw-block/design/v3-phase-15-g15b-k8s-static-pv-mini-plan.md b/sw-block/design/v3-phase-15-g15b-k8s-static-pv-mini-plan.md index 32be24c88..93745523c 100644 --- a/sw-block/design/v3-phase-15-g15b-k8s-static-pv-mini-plan.md +++ b/sw-block/design/v3-phase-15-g15b-k8s-static-pv-mini-plan.md @@ -1,7 +1,7 @@ # V3 Phase 15 — G15b Kubernetes Static PV Mini-Plan **Date**: 2026-05-03 -**Status**: G15b-1 manifests implemented at `62325c9`; G15b-2 lab harness staged at `32b3a13`; image build inputs added at `5375add`; Kubernetes run pending +**Status**: G15b-1 manifests implemented at `62325c9`; G15b-2 lab harness staged at `32b3a13`; image build inputs added at `5375add`; M02 first run found DNS/harness blockers; fixed at `eb13105`; Kubernetes re-run pending **Branch**: `p15-g15b/k8s-static-pv` from `ac49adb` **Goal**: prove a Kubernetes pod can consume a pre-provisioned V3 block volume through `cmd/blockcsi`, using real Kubernetes CSI control flow and real Linux iSCSI staging. @@ -162,7 +162,7 @@ Result: PASS on `62325c9`. ### G15b-2 — K8s Lab Harness -Status: **harness staged** at `seaweed_block@32b3a13`; image build inputs added at `seaweed_block@5375add`; real Kubernetes execution pending. +Status: **harness staged** at `seaweed_block@32b3a13`; image build inputs added at `seaweed_block@5375add`; DNS/logging fixes at `seaweed_block@eb13105`; real Kubernetes re-run pending. Artifacts: @@ -187,6 +187,19 @@ First topology: - iSCSI remains `127.0.0.1:3260`; - this intentionally preserves the G15a loopback-only frontend guard. +M02 first-run findings: + +- `blockvolume` pods use `hostNetwork: true`. +- Without `dnsPolicy: ClusterFirstWithHostNet`, they inherited host DNS and could not resolve `blockmaster.kube-system.svc.cluster.local`. +- Result: no heartbeat, no frontend fact, `ControllerPublish` returned NotFound, and the pod stayed Pending. +- The harness also collected daemon logs only after success, so failure-path evidence was lost unless captured manually. + +Fix at `eb13105`: + +- adds `dnsPolicy: ClusterFirstWithHostNet` to both `sw-blockvolume-r1` and `sw-blockvolume-r2`; +- changes `scripts/run-g15b-k8s-static.sh` so daemon logs are collected from the EXIT trap before cleanup; +- adds `.gitattributes` to keep `*.sh` as LF on future checkouts. + Harness responsibilities: 1. Build V3 binaries/images for `blockmaster`, `blockvolume`, and `blockcsi`. @@ -205,7 +218,7 @@ Pass: - Pod writes and reads byte-equal data. - No dangling iSCSI session for the test IQN after cleanup. -Pre-flight verification already green at `32b3a13`: +Pre-flight verification green at `eb13105`: ```powershell go test ./cmd/blockcsi -run TestG15b_Manifest -count=1 -v @@ -223,7 +236,7 @@ Result: PASS; built `sw-block:local` and `sw-block-csi:local`. Not yet proven: - Kubernetes API server availability; -- image load path into the target cluster; +- image load path into the target cluster after rebuilding at `eb13105`; - external-attacher calling `ControllerPublish`; - kubelet calling `NodeStage` / `NodePublish`; - pod checksum write/read.