docs(p15): update G15b M02 rerun and TestOps registration design

This commit is contained in:
pingqiu
2026-05-03 09:25:47 -07:00
parent b3e92830aa
commit ea2ca11f73
3 changed files with 291 additions and 6 deletions

View File

@@ -1,14 +1,14 @@
# V3 Phase 15 G15b Kubernetes Static PV QA Test Instruction
**Date**: 2026-05-03
**Status**: K8s lab instruction for `p15-g15b/k8s-static-pv@5375add`; execution pending
**Status**: K8s lab instruction for `p15-g15b/k8s-static-pv@eb13105`; M02 re-run pending
**Scope**: single-node Kubernetes static PV/PVC/pod smoke through real V3 daemons and CSI.
---
## Headline
At `seaweed_block@5375add`, the G15b lab harness and image build inputs are staged to prove:
At `seaweed_block@eb13105`, the G15b lab harness, image build inputs, and M02 DNS/log-preservation fixes are staged to prove:
```text
blockmaster + product-loop + r1/r2 blockvolume
@@ -39,6 +39,12 @@ Known current local limitation:
- On the current dev workstation, `kubectl` context `rancher-desktop` exists but API server is not reachable. This instruction needs QA or a running K8s lab.
M02 first-run blocker fixed:
- `5375add` failed because `hostNetwork: true` blockvolume pods inherited host DNS and could not resolve `blockmaster.kube-system.svc.cluster.local`.
- `eb13105` adds `dnsPolicy: ClusterFirstWithHostNet` to both blockvolume pods.
- `eb13105` also collects daemon logs on every exit before cleanup, so failure evidence is preserved.
---
## Commands
@@ -66,6 +72,12 @@ G15B_KIND_CLUSTER=<kind-cluster-name> bash scripts/build-g15b-images.sh "$PWD"
Local image build result already verified at `5375add`: PASS, images `sw-block:local` and `sw-block-csi:local` built.
After pulling `eb13105`, rebuild images before rerun:
```bash
bash scripts/build-g15b-images.sh "$PWD"
```
Kubernetes lab run from Linux or WSL with `kubectl` configured:
```bash

View File

@@ -0,0 +1,260 @@
# V3 Phase 15 TestOps — Pluggable Registration Design
**Date**: 2026-05-03
**Status**: architect draft; complements `v3-phase-15-testops-plan.md`
**Scope**: how V3 gates/projects expose test scenarios so TestOps can discover, register, and run them independently
**Code anchor**: `seaweed_block/internal/testops` introduced at `c2b5d9a`
---
## §0 Product Sentence
V3 should be testable through a stable TestOps registration surface:
```text
project / gate exposes scenario registration
-> TestOps registry binds scenario name to driver
-> TestOps consumes run-request.json
-> driver runs go-test / shell / privileged host / k8s / future YAML runner
-> result.json + artifact directory are emitted in canonical shape
```
This lets TestOps run V3 independently without importing V3 internals and without forcing every gate into one runner implementation.
---
## §1 Core Rule
Every V3 gate that needs L2+ evidence should register a TestOps scenario.
Registration is:
- data + driver binding;
- test workflow metadata;
- artifact contract;
- non-claims.
Registration is **not**:
- product authority;
- placement policy;
- failover policy;
- runtime plugin loading;
- a backdoor into internal state mutation.
---
## §2 What Becomes Pluggable
Pluggable:
| Surface | Pluggable unit | Example |
|---|---|---|
| L1/L2 `go test` scenario | `GoTestDriver` | G8 failover L2, G9G product loop |
| Host privileged scenario | `ShellDriver` | G15a privileged iSCSI/mkfs/mount |
| Multi-host hardware scenario | `ShellDriver` / future `SSHDriver` | G7 recovery #2/#5/#6 |
| K8s scenario | `K8sDriver` or shell wrapper | G15b static PV/PVC/pod |
| Future YAML scenario | `YAMLDriver` | ported V2 testrunner engine/parser |
Not pluggable:
| Surface | Reason |
|---|---|
| `blockmaster` authority publisher | Product truth; must not be loaded as test plugin. |
| `blockvolume` recovery/replication engine | Product truth; TestOps observes, does not replace. |
| CSI controller/node service implementation | Product surface; TestOps drives it through CSI/K8s calls. |
| Placement/failover policy | Product semantics; registration cannot define policy. |
The important distinction:
> The test execution path is pluggable. The product runtime truth is not.
---
## §3 V3 TestOps Path
The V3 path is layered:
```text
internal/testops
├── RunRequest / Result schema
├── Driver interface
├── Registry
└── driver implementations
testops/registry/
├── g15a-privileged.json
├── g15b-manifest.json
├── g15b-k8s-static.json
├── g9g-l2.json
└── g8-failover-l2.json
V:\share\v3-debug\bridge\
├── run-bridge.sh
├── run-bridge.exe or go wrapper (future)
├── scenarios\
│ ├── g15a-privileged.sh
│ ├── g15b-k8s-static.sh
│ └── g7-recovery.sh
└── runs\<RUN_ID>\result.json + logs
```
Ownership:
- `internal/testops`: V3 code repo.
- `testops/registry`: V3 code repo, because it pins real commands/paths relative to the code tree.
- `design/test/*.md`: docs repo, because it explains QA contract and close evidence.
- `V:\share\v3-debug\bridge`: harness side, mutable by QA/dev agents.
---
## §4 Registration Shape
Recommended file shape:
```json
{
"schema_version": "1.0",
"scenario": "g15b-k8s-static",
"gate": "G15b",
"layer": "L5",
"driver": {
"type": "shell",
"path": "scripts/run-g15b-k8s-static.sh"
},
"default_timeout_s": 600,
"required_capabilities": [
"kubectl",
"privileged-k8s-node",
"iscsiadm",
"mount"
],
"required_images": [
"sw-block:local",
"sw-block-csi:local"
],
"qa_instruction": "sw-block/design/test/v3-phase-15-g15b-k8s-qa-test-instruction.md",
"known_green_commit": "5375add",
"artifacts": [
"result.json",
"run-request.json",
"pod.log",
"blockmaster.log",
"blockvolume-r1.log",
"blockvolume-r2.log",
"blockcsi-controller.log"
],
"non_claims": [
"no dynamic provisioning",
"no failover under live mount",
"single-node only"
]
}
```
Rules:
- `scenario` is globally unique.
- `driver.type` must map to a TestOps `Driver`.
- `known_green_commit` is evidence, not a constraint. Agents may run newer commits.
- `non_claims` must be present for every L2+ scenario.
- The registration file must not contain authority-shaped fields such as epoch, endpoint version, primary, healthy, or ready unless the scenario is explicitly about observing those read-only facts.
---
## §5 Driver Types
Initial driver types:
| Driver | Purpose | Current status |
|---|---|---|
| `shell` | Runs an existing script that reads normalized request and writes result. | Implemented as `internal/testops.ShellDriver`. |
| `go-test` | Runs `go test` package/focus commands and maps output to result. | Next recommended implementation. |
| `k8s` | Applies manifests, waits for resources, collects logs. | Can start as shell wrapper; later native. |
| `privileged-host` | Runs sudo/host OS checks and captures pre/post state. | Can start as shell wrapper. |
| `yaml` | Runs future ported V2 testrunner parser/engine. | Future; conditional. |
The first bridge can implement all non-shell drivers as shell wrappers. Native drivers are optimization and safety improvements, not prerequisites.
---
## §6 Scenario Lifecycle
To add a new V3 scenario:
1. Write or identify the backing test/harness.
2. Add registration file under `testops/registry/`.
3. Add/refresh QA instruction under `sw-block/design/test/`.
4. Add the scenario row to `v3-phase-15-testops-plan.md` §6.
5. Run through TestOps once and capture `result.json`.
6. Use that result as close evidence only if the scenario's non-claims match the gate claim.
To update a scenario:
1. Keep scenario name stable if the claim is unchanged.
2. Bump registration fields if driver/timeout/artifact shape changes.
3. Update `known_green_commit` only after verification.
4. Keep old artifact dirs; never mutate historical result directories.
---
## §7 Anti-Patterns
Do not:
1. Register a scenario that mutates product authority directly.
2. Encode `primary=true` / `healthy=true` as desired state in TestOps metadata.
3. Use static PV target facts as the default close path for G15b while claiming ControllerPublish evidence.
4. Let a YAML runner call V2 promote/demote or heartbeat-as-authority semantics.
5. Treat a `pass` result as broader than the scenario's non-claims.
6. Hide missing artifacts by returning `status=pass`.
7. Make production code import `internal/testops`.
The last rule is strict:
> Product code must not depend on TestOps. TestOps depends on product binaries and public/control surfaces.
---
## §8 Initial Registry Targets
| Scenario | Driver | Layer | Known green | Status |
|---|---|---|---|---|
| `g15b-manifest` | `go-test` | L1/L2 | `62325c9` | ready to register |
| `g15b-k8s-static` | `shell` | L5 | `5375add` preflight only; K8s run pending | ready to register as pending-lab |
| `g15a-privileged` | `shell` | L3 | `ac49adb` | ready to register |
| `g15a-non-privileged` | `go-test` | L2 | `ac49adb` | ready to register |
| `g9g-l2` | `go-test` | L2 | `7ed9ab2` | ready to register |
| `g8-failover-l2` | `go-test` | L2 | `b320336` | needs instruction extraction |
| `g7-recovery-3scenarios` | `shell` | L4 | `d09fcc6` | wrap existing `g5-test` harness |
---
## §9 Recommended Next Slice
Implement `testops/registry/g15b-manifest.json` and a minimal `go-test` driver.
Why this first:
- It is non-privileged.
- It is fast.
- It proves the registration path without requiring K8s or m01.
- It gives QA an example registration file to copy.
Pass condition:
```powershell
go test ./internal/testops ./cmd/blockcsi -count=1
```
plus a small smoke that loads `g15b-manifest.json`, runs the registered scenario, and emits a valid `result.json` in a temp artifact dir.
---
## §10 Sign
| Role | Status | Basis |
|---|---|---|
| sw | draft | captured V3 pluggable TestOps path after `internal/testops` skeleton |
| QA | pending | review registration shape and artifact expectations |
| architect | pending | ratify product/runtime non-plugin boundary |

View File

@@ -1,7 +1,7 @@
# V3 Phase 15 — G15b Kubernetes Static PV Mini-Plan
**Date**: 2026-05-03
**Status**: G15b-1 manifests implemented at `62325c9`; G15b-2 lab harness staged at `32b3a13`; image build inputs added at `5375add`; Kubernetes run pending
**Status**: G15b-1 manifests implemented at `62325c9`; G15b-2 lab harness staged at `32b3a13`; image build inputs added at `5375add`; M02 first run found DNS/harness blockers; fixed at `eb13105`; Kubernetes re-run pending
**Branch**: `p15-g15b/k8s-static-pv` from `ac49adb`
**Goal**: prove a Kubernetes pod can consume a pre-provisioned V3 block volume through `cmd/blockcsi`, using real Kubernetes CSI control flow and real Linux iSCSI staging.
@@ -162,7 +162,7 @@ Result: PASS on `62325c9`.
### G15b-2 — K8s Lab Harness
Status: **harness staged** at `seaweed_block@32b3a13`; image build inputs added at `seaweed_block@5375add`; real Kubernetes execution pending.
Status: **harness staged** at `seaweed_block@32b3a13`; image build inputs added at `seaweed_block@5375add`; DNS/logging fixes at `seaweed_block@eb13105`; real Kubernetes re-run pending.
Artifacts:
@@ -187,6 +187,19 @@ First topology:
- iSCSI remains `127.0.0.1:3260`;
- this intentionally preserves the G15a loopback-only frontend guard.
M02 first-run findings:
- `blockvolume` pods use `hostNetwork: true`.
- Without `dnsPolicy: ClusterFirstWithHostNet`, they inherited host DNS and could not resolve `blockmaster.kube-system.svc.cluster.local`.
- Result: no heartbeat, no frontend fact, `ControllerPublish` returned NotFound, and the pod stayed Pending.
- The harness also collected daemon logs only after success, so failure-path evidence was lost unless captured manually.
Fix at `eb13105`:
- adds `dnsPolicy: ClusterFirstWithHostNet` to both `sw-blockvolume-r1` and `sw-blockvolume-r2`;
- changes `scripts/run-g15b-k8s-static.sh` so daemon logs are collected from the EXIT trap before cleanup;
- adds `.gitattributes` to keep `*.sh` as LF on future checkouts.
Harness responsibilities:
1. Build V3 binaries/images for `blockmaster`, `blockvolume`, and `blockcsi`.
@@ -205,7 +218,7 @@ Pass:
- Pod writes and reads byte-equal data.
- No dangling iSCSI session for the test IQN after cleanup.
Pre-flight verification already green at `32b3a13`:
Pre-flight verification green at `eb13105`:
```powershell
go test ./cmd/blockcsi -run TestG15b_Manifest -count=1 -v
@@ -223,7 +236,7 @@ Result: PASS; built `sw-block:local` and `sw-block-csi:local`.
Not yet proven:
- Kubernetes API server availability;
- image load path into the target cluster;
- image load path into the target cluster after rebuilding at `eb13105`;
- external-attacher calling `ControllerPublish`;
- kubelet calling `NodeStage` / `NodePublish`;
- pod checksum write/read.