seaweedfs

mirror of https://github.com/seaweedfs/seaweedfs.git synced 2026-07-20 06:52:24 +00:00

Author	SHA1	Message	Date
pingqiu	f20ec2ef79	test: align collector readiness check with replica eligibility Use ReplicaEligible instead of PublishHealthy in the heartbeat collector test now that publish health is rebound to publication truth rather than receiver readiness. Made-with: Cursor	2026-04-04 14:03:21 -07:00
pingqiu	6cad5bb8e1	refactor: rebind bounded volume mode heartbeat truth Make the heartbeat/master boundary preserve explicit volume_mode truth so master consume no longer reconstructs outward mode only from secondary heartbeat signals. Keep backward compatibility by falling back to the previous reconstruction when older heartbeats do not send the field. Made-with: Cursor	2026-04-04 13:56:41 -07:00
pingqiu	6794f79df9	refactor: preserve bounded publish healthy heartbeat truth Make the heartbeat/master boundary preserve explicit publish_healthy truth so master consume no longer reconstructs healthy publication only from secondary readiness and degraded heuristics. Keep backward compatibility by falling back to the previous reconstruction when older heartbeats do not send the field. Made-with: Cursor	2026-04-04 13:43:19 -07:00
pingqiu	eb610deb92	refactor: preserve bounded needs_rebuild heartbeat truth Make the heartbeat/master boundary preserve explicit needs_rebuild truth so primary heartbeat consume no longer collapses that stronger mode into a generic degraded signal. Keep backward compatibility by falling back to the previous heuristic when older heartbeats do not send the field. Made-with: Cursor	2026-04-04 13:11:42 -07:00
pingqiu	69b41a7f16	refactor: rebind bounded replica-ready heartbeat truth Make the heartbeat/master boundary carry explicit replica readiness truth so the registry no longer depends only on replica transport-address presence as a readiness proxy. Keep backward compatibility by falling back to the old address heuristic when older heartbeats do not send the field. Made-with: Cursor	2026-04-04 12:06:53 -07:00
pingqiu	43dbebfa04	refactor: close bounded recovery drain and invalidation seams Move removed-replica drain and replica-scoped invalidation onto explicit core-command paths so the widened multi-replica runtime no longer depends on coarse host-side recovery handling. Made-with: Cursor	2026-04-04 11:01:12 -07:00
pingqiu	5fd9ec0edf	refactor: widen bounded multi-replica catch-up startup ownership Emit one core-owned start_recovery_task per primary catch-up replica so the bounded multi-replica startup path no longer depends on a single-replica assumption. Made-with: Cursor	2026-04-04 10:21:28 -07:00
pingqiu	92c006eb29	refactor: aggregate bounded multi-replica catch-up conservatively Track catch-up observations per replica so the volume-level recovery view stays in catching_up until all bounded replicas complete. This preserves the current bounded semantics while removing an overclaim that would block later multi-replica startup ownership work. Made-with: Cursor	2026-04-04 09:27:03 -07:00
pingqiu	16ba70f856	refactor: make bounded recovery observation events replica-scoped Carry replica-scoped addressing through bounded recovery planning and completion events so the core no longer depends on a volume-only observation seam. This preserves the current single-replica catch-up and rebuilding behavior while aligning the observation side with the replica-scoped command path. Made-with: Cursor	2026-04-04 09:18:07 -07:00
pingqiu	b304b8e212	refactor: make bounded recovery command addressing replica-scoped Replace the remaining volume-scoped recovery command and pending slot with replica-scoped addressing on the bounded core-present path. This preserves the current single-replica catch-up and rebuilding behavior while removing the structural blocker for later multi-replica startup ownership. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 09:05:36 -07:00
pingqiu	1453274988	refactor: extract host effects adapter and define Phase 17 stop line Move dispatcher-facing host effects out of volume_server_block.go into blockcmd while keeping server-owned cache/state semantics in weed/server. Document Batch 10 delivery and Batch 11 stop-line review so the separation line closes without over-extracting readiness-state mutation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 08:43:21 -07:00
pingqiu	38b5042997	refactor: extract command bindings and service ops from volume server Move BlockVol-backed command bindings into v2bridge and move non-BlockVol command operations into weed/server/blockcmd. This keeps dispatch and host effects in weed/server, keeps backend binding in v2bridge, and further shrinks volume_server_block.go toward a host shell while preserving current command-driven proofs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 08:11:39 -07:00
pingqiu	11c6aaf316	feat: Batch 7 + Phase 16C-E — command dispatch extraction + engine refinements Batch 7: Command dispatch binding extraction - New weed/server/blockcmd package: CommandHandler interface + DispatchCommands - volume_server_block.go applyCoreCommandsWithAssignment delegates to dispatcher - weed/server still owns RecordCommand, EmitCoreEvent, PublishProjection - v2bridge NOT given command-switch or event-emission semantics Phase 16C: Rebuilding assignment enters core command path Phase 16D: Rebuild recovery-task startup is command-driven Phase 16E: Catch-up recovery-task startup is command-driven Engine refinements: - RecoveryTarget on AssignmentDelivered event - shouldStartRecoveryTask / shouldStartReceiver guards - bootstrapReason: awaiting_rebuild_start Bridge/contract updates: - control_adapter.go: refined translation helpers - contract.go: executor port alignment Migration design docs (Batch 1-3 delivered, design artifacts): - v2-first/second/third-migration-batch.md + task-pack.md - v2-assignment-translation-unification.md - v2-execution-muscles-inventory.md - v2-separation-port-layer-audit.md - v2-legacy-runtime-exit-criteria.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 02:13:08 -07:00
pingqiu	41082bf92c	fix: Batch 6 completion — rebuildAddr folded into resolveRecoveryContext resolveRecoveryContext now also derives rebuildAddr from assignments, so the full host-side recovery context is resolved in one call: - volPath (from replicaID) - rebuildAddr (from assignments via deriveRebuildAddr) - recovery bindings (driver + executor via BuildRecoveryBundle) - replicaFlushedLSN (from sender session) startTask/runRecovery/runCatchUp/runRebuild now pass assignments instead of rebuildAddr. No separate rebuildAddr resolution remains outside the resolver. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 01:52:35 -07:00
pingqiu	a48da0f674	refactor: Batch 6 — recovery context resolver extracted New recoveryContext type + resolveRecoveryContext method consolidates: - volumePathForReplica (volPath from replicaID) - v2bridge.BuildRecoveryBundle (driver + executor from BlockVol) - sender/session lookup (replicaFlushedLSN for catch-up start) runCatchUp and runRebuild now read as: resolve → plan → branch (legacy or core-present) Removed buildRecoveryBundle (inlined into resolveRecoveryContext). block_recovery.go no longer has any inline context assembly — it is now a pure orchestration shell. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 01:46:06 -07:00
pingqiu	263611004e	refactor: Batch 5 — recovery binding factory moved to v2bridge New v2bridge.BuildRecoveryBundle(vol, rebuildAddr) assembles all recovery bindings (Reader + Pinner + StorageAdapter + Executor) from a real BlockVol instance in one call. block_recovery.go changes: - Removed local recoveryBundle type - buildRecoveryBundle now delegates to v2bridge.BuildRecoveryBundle inside WithVolume, returns (driver, executor, err) - Removed direct v2bridge.NewReader/NewPinner/NewExecutor construction - Removed bridge import (no longer needed) - runCatchUp/runRebuild use (driver, executor, err) directly block_recovery.go no longer knows how to construct Reader, Pinner, StorageAdapter, or Executor. It only knows: resolve volPath, ask the factory for bindings, plan, branch to legacy or core-present path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 01:39:40 -07:00
pingqiu	ded84b25e6	refactor: Batch 4 steps 2+3 — rebuild status port + recovery bundle factory Step 2: Rebuild completion status port - New runtime.RebuildCompletionStatus + DeriveRebuildCommitted: reusable shaping logic for post-rebuild snapshot → RebuildCommitted event - block_recovery.go OnRebuildCompleted: delegates to DeriveRebuildCommitted, host only reads raw snapshot via readRebuildStatus (thin binding) - Removed 15 lines of inline flushedLSN/checkpointLSN/achievedLSN computation Step 3: Recovery bundle factory - New buildRecoveryBundle: shared host-side setup for both catch-up and rebuild (creates Reader + Pinner + StorageAdapter + Executor + RecoveryDriver) - runCatchUp and runRebuild both use buildRecoveryBundle instead of duplicating the WithVolume → NewReader → NewPinner → NewStorageAdapter → NewExecutor → RecoveryDriver chain - runCatchUp/runRebuild are now thin host-shell methods Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 01:32:34 -07:00
pingqiu	0bcfc678d0	refactor: Batch 4 step 1 — typed PendingExecution, zero type assertions Replace interface{} fields in runtime.PendingExecution with typed handles: - Driver: engine.RecoveryDriver (was interface{}) - Plan: engine.RecoveryPlan (was interface{}) - CatchUpIO: engine.CatchUpIO (was interface{}) - RebuildIO: engine.RebuildIO (was interface{}) block_recovery.go: - ExecutePendingCatchUp/Rebuild: direct field access (pe.Driver, pe.Plan) instead of type assertions (pe.Driver.(*engine.RecoveryDriver)) - CancelFunc: pe.Driver.CancelPlan(pe.Plan, reason) — no casts - 6 type assertions removed from production path Test files: remove Plan type assertions — fields are typed end-to-end. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 01:27:29 -07:00
pingqiu	3a5fbbfded	fix: Batch 3 wiring — production path uses runtime helpers, legacy isolated H wiring: block_recovery.go now uses runtime.PendingCoordinator - Removed local pendingRecoveryExecution type + store/take/peek/has/cancel - ExecutePendingCatchUp/Rebuild delegate to coord.TakeCatchUp/TakeRebuild - Shutdown uses coord.CancelAll - Added CancelAll to PendingCoordinator I wiring: executeCatchUpPlan/executeRebuildPlan replaced - ExecutePendingCatchUp now calls rt.ExecuteCatchUpPlan with RecoveryManager as RecoveryCallbacks (OnCatchUpCompleted/OnRebuildCompleted) - ExecutePendingRebuild follows same pattern - Local executeCatchUpPlan/executeRebuildPlan methods removed J structural: legacy no-core branches extracted - executeLegacyCatchUp: wraps rt.ExecuteCatchUpPlan for v2Core==nil path - executeLegacyRebuild: wraps rt.ExecuteRebuildPlan for v2Core==nil path - Clear "LEGACY NO-CORE COMPATIBILITY" section with structural separation - runCatchUp/runRebuild now branch cleanly: legacy helper vs core coordinator Test updates: pendingRecoveryExecution → rt.PendingExecution, field casing, Plan type assertions. Validation: all P4, P16B, and ApplyAssignments tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 01:20:41 -07:00
pingqiu	e075d77619	refactor: Task J — legacy no-core paths explicitly labeled Add explicit "LEGACY NO-CORE COMPATIBILITY" section header in block_recovery.go marking HandleAssignmentResult and HandleRemovedAssignments as compatibility-only entry points. The comment block explicitly states: - These are for pre-Phase-16 no-core paths and older tests - Core-present paths use StartRecoveryTask + ExecutePending* - These should NOT be strengthened into semantic-authority proofs No behavioral change — structural labeling only. All validation passes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 01:05:16 -07:00
pingqiu	e200df7791	feat: Task I — recovery execution helpers extracted to sw-block runtime New reusable execution helpers in sw-block/engine/replication/runtime: - ExecuteCatchUpPlan: drives catch-up execution, notifies host via callback - ExecuteRebuildPlan: drives rebuild execution, notifies host via callback - RecoveryCallbacks interface: host-side OnCatchUpCompleted/OnRebuildCompleted The host (weed/server/block_recovery.go) supplies concrete IO bindings and receives completion notifications. The reusable execution logic no longer requires weed/server ownership. 4 tests prove boundary behavior: - catch-up callback receives achievedLSN matching plan target - catch-up with plan-derived target works correctly - rebuild callback receives plan reference - nil callbacks don't panic weed/server rebinding to use these helpers deferred to Task J (legacy isolation). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 01:03:37 -07:00
pingqiu	6fea93e821	feat: Task H — PendingCoordinator extracted to sw-block/engine/replication/runtime New reusable pending-execution coordinator with fail-closed command matching: - Store/TakeCatchUp/TakeRebuild/Cancel/Has/Peek - TakeCatchUp: fail-closed on target LSN mismatch (cancel + return nil) - TakeRebuild: same fail-closed semantics - Cancel callback invoked on mismatch or explicit cancellation 9 tests prove boundary behavior: - match succeeds, mismatch cancels, explicit cancel, noop on empty, peek non-destructive, store replaces, take from empty No weed/ imports. Pure coordination logic reusable by any adapter shell. weed/server/block_recovery.go rebinding deferred to Task I. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 00:59:10 -07:00
pingqiu	519c849946	refactor: Task F+G — remove pinner shim, executor already clean Task F (Pinner): - block_recovery.go: removed pinnerShimForRecovery (11 lines of pure pass-through). v2bridge.Pinner structurally satisfies bridge.BlockVolPinner (same method signatures), so it's passed directly. Task G (Executor): - Already clean. v2bridge.Executor is used directly without any shim — structurally satisfies engine.CatchUpIO and engine.RebuildIO. No code changes needed. After Task E+F+G: zero shim types remain in block_recovery.go. v2bridge Reader/Pinner/Executor all satisfy sw-block contracts directly. Validation: - go test ./weed/storage/blockvol/v2bridge/ -run "TestPinner_\|TestExecutor_\|TestBridge_" → PASS - go test ./weed/server/ -run "TestP4_\|TestP16B_" → PASS (8 tests) - go test ./sw-block/bridge/blockvol/... → PASS Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 00:45:43 -07:00
pingqiu	680b530314	refactor: Task E — reader returns bridge.BlockVolState directly Reader backend-binding extraction: - v2bridge/reader.go: Reader.ReadState() now returns bridge.BlockVolState directly instead of a local v2bridge.BlockVolState mirror type. Removed the local BlockVolState type entirely. - block_recovery.go: removed readerShimForRecovery (12 lines of 1:1 field copying). Reader is now passed directly as bridge.BlockVolReader. Before: v2bridge.Reader → v2bridge.BlockVolState → readerShim → bridge.BlockVolState After: v2bridge.Reader → bridge.BlockVolState (direct) v2bridge now imports sw-block/bridge/blockvol for the contract type (control.go already did this, reader.go now follows the same pattern). Validation: - go test ./sw-block/bridge/blockvol/... → PASS - go test ./weed/storage/blockvol/v2bridge/ -run "TestReader_" → PASS - go test ./weed/server/ -run "TestP4_\|TestP16B_" → PASS (8 tests) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 00:43:30 -07:00
pingqiu	a38e04c03b	refactor: Task A — canonical identity/recovery rules via bridge helpers Remove direct fmt.Sprintf identity construction from v2bridge/control.go. Both convertReplicaAssignment and convertRebuildAssignment now use: - bridge.ReplicaAssignmentForServer (canonical ReplicaID derivation) - bridge.RecoveryTargetForRole (canonical role → SessionKind mapping) Before: 3 call sites with inline fmt.Sprintf("%s/%s", vol, server) After: 0 — all identity construction goes through sw-block canonical helpers volume_server_block.go already used bridge helpers (no change needed). Validation: - go test ./sw-block/bridge/blockvol/... → PASS (10 tests) - go test ./weed/storage/blockvol/v2bridge/ -run "TestControl_\|TestBridge_" → PASS (7 tests) - go test ./weed/server/ -run "TestBlockService_ApplyAssignments_RebuildingRole_" → PASS Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 00:10:48 -07:00
pingqiu	13680c9aa6	feat: Phase 16B rev3 — bounded rebuild execution ownership + review 16B widened from catch-up-only to catch-up + rebuild: - StartRebuildCommand: core emits rebuild command, adapter executes - Fail-closed: pending rebuild does not run without fresh command - Recovery observations close back into core projection New proofs: - StartRebuildCommand_ConsumesPendingPlanAndUpdatesProjection - RunRebuild_FailClosedWithoutFreshStartRebuildCommand Review docs: - phase-16-rev3-review.md: widened 16B review object - phase-16-rev3-manager-rereview.md: manager challenge response - phase-16-checkpoint-review.md: updated Non-claims: not full recovery-loop closure, not end-to-end failover/publication, not launch readiness. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-03 21:38:44 -07:00
pingqiu	8c2485e0e9	feat: Phase 15 + Phase 16A/B — V2 core integration + checkpoint review Phase 15: V2 core wired into BlockService - volume_server_block.go: v2Core field, applyCoreAssignmentEvent, core command executors (ApplyRole, StartReceiver, ConfigureShipper, InvalidateSession, StartCatchUp, StartRebuild, PublishProjection) - Assignment processing now goes through core engine → command emission → bounded execution, replacing direct V1 replication setup - master_block_registry.go: ClusterHealthSummary, VolumeMode in entries - master_server_handlers_block.go: blockStatusHandler, entryToVolumeInfo refactored with entryReplicaSurface Phase 16A: Core projection surfaces Phase 16B: Bounded closure (checkpoint review ready) Test fixes: add v2Core to manually-constructed BlockService in idempotence, convergence, soak, and CP13-8A tests (required because V1 replication setup paths now delegate to core engine). All tests pass (21s regression). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-03 20:58:12 -07:00
pingqiu	a6fc8545b9	feat: Phase 14A+14B — V2 core publication ownership + command semantics 14A: Publication as explicit core-owned state - state.go: PublicationView on VolumeState, explicit gate reasons - engine.go: mode→readiness→publication chain with named gates (awaiting_role_apply, awaiting_shipper_configured, awaiting_barrier_durability) - projection.go: PublicationProjection carries publication truth - RF=1/no-replicas → allocated_only (CP13-9 constraint in core) - phase14_core_test.go: strengthened publication closure + RF=1 proof 14B: Command emission bounded by semantic gap - engine.go: repeated same-assignment skips redundant commands, repeated same-reason BarrierRejected skips duplicate invalidation, command-state tracking on VolumeState - command.go: new command types for bounded emission - event.go: new boundary events - phase14_command_test.go: exact command sequences frozen as proofs (primary/replica repeated assignment, assignment changed, repeated failure) - phase14_boundary_test.go: boundary/recovery structural tests All tests pass in sw-block/engine/replication. Phase 14 docs updated (14A accepted, 14B active→14C planned). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-03 16:52:55 -07:00
pingqiu	34f42078fb	docs: Phase 13 CP13-9 accepted + Phase 14 preparation docs - phase-13.md: CP13-8/8A/9 accepted with carry-forward - phase-13-log.md: CP13-9 technical/delivery packs - phase-13-cp9-mode-normalization.md: minor updates - v2-protocol-claim-and-evidence.md: CP13-8/8A claims updated, constrained-V1-runtime interpretation rule added Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-03 16:13:03 -07:00
pingqiu	fb0da91196	feat: start Phase 14 V2 core shell Make the first V2 core owner explicit in sw-block by freezing Phase 14 docs, mode/readiness/publication semantics, and bounded command emission rules. This turns accepted Phase 13 constraints into executable core behavior without overclaiming live runtime cutover. Made-with: Cursor	2026-04-03 16:11:38 -07:00
pingqiu	6e1b8efd68	feat: CP13-9 — mode normalization for constrained V1 runtime Add computed VolumeMode to BlockVolumeEntry with 5 normalized modes: - allocated_only: RF=1, no replicas (standalone) - bootstrap_pending: RF>1 but replicas not yet ready (first-write pending) - publish_healthy: all replicas ready, no transport degradation - degraded: replication impaired but recoverable - needs_rebuild: unrecoverable gap, rebuild required Code changes: - master_block_registry.go: computeVolumeMode() called from recomputeReplicaState(), VolumeMode field on BlockVolumeEntry - master_server_handlers_block.go: VolumeMode exposed in REST API - blockapi/types.go: VolumeMode field in VolumeInfo - testrunner types: VolumeMode for scenario assertions 7 tests prove mode normalization: - AllocatedOnly, BootstrapPending (2 cases), PublishHealthy, Degraded, NeedsRebuild, SurfaceConsistency (transition proof) Interpretation rule: current integrated tests validate V1 runtime under V2 constraints, not a completed V2 runtime (Phase 14 scope). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-03 15:02:50 -07:00
pingqiu	4c7fbefe25	feat: CP13-8 PASSES — real-workload validation on RF=2 sync_all CP13-8 scenario results on m01/M02 (25Gbps RoCE): fsck_ext4: CLEAN file count: 200 (assert_equal PASS) checksum match: MATCH (assert_contains PASS) pgbench TPS: 565.69 (assert_greater PASS) auto-failover: 10.0.0.1:18480 → 10.0.0.3:18480 Code changes (tester + scenario): - volume_server_block.go: readiness state, assignment lifecycle cleanup - block_heartbeat_loop.go: readiness-aware heartbeat reporting - store_blockvol.go: readiness tracking - master_server_handlers_block.go: block API handler updates - cp13-8-real-workload-validation.yaml: redesigned scenario (removed block_promote, use natural auto-failover flow, bootstrap write before wait_volume_healthy) - testrunner/actions/devops.go: scenario action improvements - replica_read_test.go: component-level replica read test Phase docs: CP13-7 accepted, CP13-8/8A technical packs updated, design docs updated for protocol closure evidence. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-03 14:24:13 -07:00
pingqiu	334c12664a	fix: CP13-8A P0 — post-promote primary refresh with replica addresses Bug: After failover promotes a replica to primary, the old primary re-registers via heartbeat as a replica (lower epoch). But the master never sent an updated Primary assignment to the new primary with the re-registered replica's addresses. The new primary had 0 shippers → replication dead. sync_all barrier passed vacuously. Root cause: upsertServerAsReplica (heartbeat reconciliation) added the re-registered server to Replicas[] but didn't (a) populate DataAddr/ CtrlAddr from heartbeat info, or (b) trigger a primary assignment refresh. Fix: - master_block_registry.go: upsertServerAsReplica now copies DataAddr/ CtrlAddr from heartbeat info and sets NeedsPrimaryRefresh flag. UpdateFullHeartbeat returns HeartbeatResult with PrimaryRefreshNeeded entries. DrainPrimaryRefreshNeeded collects and clears the flag. - master_block_failover.go: add enqueuePrimaryRefresh — builds a Primary assignment with all current replica addresses and enqueues it. - master_grpc_server.go: heartbeat handler processes PrimaryRefreshNeeded entries after UpdateFullHeartbeat. Gate test: TestPromote_AssignmentHasReplicaAddrs now PASSES — after promote + re-register, the new primary gets an assignment with replicaDataAddr=vs1:14260 and replicaAddrs=1. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-03 13:59:43 -07:00
pingqiu	7012383c3f	fix: StartReplicaReceiver idempotency guard — skip if already running P0 bug on real hardware: assignments are re-delivered every heartbeat cycle (5s). First setupReplicaReceiver succeeds (receiver starts on deterministic port). Second call fails with "bind: address already in use" because the listener is already bound. The volume stays permanently degraded, blocking all RF=2 sync_all replication. Fix: skip StartReplicaReceiver if v.replRecv is already set. The receiver only needs to start once per volume lifetime. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-03 13:18:30 -07:00
pingqiu	3da4c19046	fix: CP13-8A — fix malformed replica address in test allocator + add read proof Investigation result: - Dual-BlockVol hypothesis: DISPROVEN (one instance per path, correct wiring) - Root cause: adapter wiring bug in test allocator soak_test.go blockVSAllocate returned ReplicaDataAddr = "vs2:9333:14260" (server + ":port" where server already has a port → three colons, invalid) This caused setupReplicaReceiver to fail silently → no data replicated Root cause classification: adapter/test-harness bug - NOT a backend data visibility bug - NOT a core-rule gap - The engine read path works correctly (TestSyncAll_FullRoundTrip passes) Code changes: - qa_block_soak_test.go: fix allocator to use host:port (not server:port), use deterministic FNV-hashed ports matching production ReplicationPorts - qa_block_cp13_8a_test.go: 2 new integration tests proving replica reads work through both ReadLBA and adapter.ReadAt, before and after promotion Remaining contradiction for CP13-8 scenario on real hardware: - The production weed cluster uses ReplicationPorts (deterministic) which should not have this bug. If CP13-8 still fails on m01/M02, the cause is different from this test-harness issue and needs a separate investigation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-03 11:47:41 -07:00
pingqiu	2c305f9e7f	fix: CP13-8 — use correct assert params + add pgbench TPS gate 1. assert_contains: change actual/expected to value/contains (matches the action implementation in system.go) 2. Add assert_greater for pgbench TPS > 0 after pgbench_run (closes the pgbench durability pass criterion in the doc) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-03 09:17:11 -07:00
pingqiu	d7cd415714	feat: CP13-8 — bounded real-workload validation scenario + envelope One named workload validation package for RF=2 sync_all: - Scenario: cp13-8-real-workload-validation.yaml (6 phases) - ext4 proof: write 200 files → failover → fsck + file count + md5sum diff - pgbench proof: TPC-B on promoted replica (database durability) - Disturbance: one bounded failover (kill primary, promote replica) Workload envelope doc: phase-13-cp8-workload-validation.md - Named topology, transport, workloads, disturbance, exclusions - Pass criteria: fsck passes, 200 files, checksums match, pgbench TPS > 0 - Maps each pass criterion to accepted CP13-1..7 semantics - Explicit non-claims: not rollout approval, not NVMe, not soak, not CP13-9 Reuses existing infrastructure: - cp85-db-ext4-fsck.yaml pattern (extended with checksums + pgbench) - benchmark-pgbench.yaml actions (pgbench_init/pgbench_run) Must run on real hardware (m01/M02). Cannot run in unit test harness. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-03 08:59:46 -07:00
pingqiu	4f7283b6be	fix: registry role-aware failover + devops action + failover scenario update - master_block_registry.go: minor role-handling fixes - qa_failover_role_test.go: new failover role test - testrunner/actions/devops.go: new devops action helpers - recovery-baseline-failover.yaml: scenario alignment Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-03 08:48:13 -07:00
pingqiu	21ccf06ef3	docs: Phase 13 CP13-1..CP13-7 technical packs, acceptance status, design updates - phase-13.md: CP13-1 through CP13-6 accepted, CP13-7 active - phase-13-log.md: full technical + delivery packs for CP13-2..CP13-7 - phase-13-cp4-state-eligibility.md: refined barrier behavior table (Disconnected/Degraded as recovery entry points, not eligibility) - phase-12.md: minor cross-reference updates - Older phase docs: minor wording alignment - Design docs: V2 development plan and completion overview updated Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-03 08:48:05 -07:00
pingqiu	1d3fb1f119	fix: CP13-7 rev3 — require NeedsRebuild, not Degraded, after handshake gap Tighten TestReconnect_GapBeyondRetainedWal_NeedsRebuild assertion from "NeedsRebuild or Degraded" to strictly "NeedsRebuild". The handshake R < S path returns NeedsRebuild directly — tolerating Degraded weakened the proof. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-03 08:36:00 -07:00
pingqiu	ec63c18438	fix: CP13-7 rev2 — real handshake gap detection, reclassify rebuild test Two fixes: 1. TestReconnect_GapBeyondRetainedWal_NeedsRebuild: rewritten to test the real reconnect handshake gap detection path (R < S in reconnectWithHandshake). Sequence: establish sync → disconnect → release retention hold via timeout → write + flush to advance WAL past replica position → reconnect → handshake detects R=0 < S=9 → NeedsRebuild. Log proves: "reconnect: gap too large R=0 H=8 S=9" 2. TestReplicaState_RebuildComplete_ReentersInSync: reclassified from primary proof to support evidence (does not start from live NeedsRebuild shipper state, but proves rebuild mechanics work end-to-end). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-03 08:24:56 -07:00
pingqiu	88c336b1c1	feat: CP13-7 — NeedsRebuild fail-closed fallback + rebuild handoff proof Last baseline FAIL closed: - TestAdversarial_NeedsRebuildBlocksAllPaths: rewritten to use EvaluateRetentionBudgets for NeedsRebuild trigger, then asserts 5 properties: state=NeedsRebuild, Ship drops, Barrier rejects, state sticky after failed barrier, second SyncCache still fails Last baseline PASS* closed: - TestReconnect_GapBeyondRetainedWal_NeedsRebuild: rewritten with hard NeedsRebuild state assertion + SyncCache failure assertion 6 tests promoted to CP13-7 primary proof: - NeedsRebuildBlocksAllPaths (fail-closed lifecycle) - GapBeyondRetainedWal (transition) - HeartbeatReportsNeedsRebuild (visibility) - RebuildComplete_ReentersInSync (handoff) - Rebuild_AbortOnEpochChange (epoch safety) - PostRebuild_FlushedLSN_IsCheckpoint (progress initialization) Baseline: 43 PASS / 0 FAIL / 1 PASS* (address witness only) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-03 00:13:33 -07:00
pingqiu	0ce5aa32e9	fix: CP13-6 rev3 — hard hold-release assertion + stale comment cleanup 1. TestWalRetention_TimeoutTriggersNeedsRebuild: add hard assertion that checkpoint advances past replicaFlushedLSN after NeedsRebuild (proves hold is actually released, not just state transition) 2. TestWalRetention_RequiredReplicaBlocksReclaim: remove stale "EXPECTED TO FAIL" / duplicate comment block Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 23:59:44 -07:00
pingqiu	4e55b53bef	fix: CP13-6 rev2 — upgrade all 3 retention tests to hard assertions, block-size-aware budget Three fixes: 1. TestWalRetention_RequiredReplicaBlocksReclaim: rewritten from log-only placeholder to hard assertion (checkpointLSN <= replicaFlushedLSN) 2. TestWalRetention_TimeoutTriggersNeedsRebuild: rewritten from log-only to hard assertion (State() == NeedsRebuild after 1ns timeout) 3. EvaluateRetentionBudgets: uses RetentionBudgetParams struct with actual BlockSize from volume config instead of hardcoded 4096 All 3 retention tests now have real state/progress assertions. No placeholder or log-only evidence remains in CP13-6 proof package. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 23:45:29 -07:00
pingqiu	0ca57dc2eb	feat: CP13-6 — replica-aware WAL retention with max-bytes budget Add max-bytes retention budget alongside existing timeout budget: - shipper_group.go: EvaluateRetentionBudgets now checks both timeout (last contact time) and max-bytes (entry lag * 4KB > maxBytes). Either exceeding budget → NeedsRebuild state transition. - blockvol.go: add walRetentionMaxBytes (64MB default), pass to EvaluateRetentionBudgets with primaryHeadLSN. TestWalRetention_MaxBytesTriggersNeedsRebuild upgraded from PASS* (log-only placeholder) to real PASS: asserts State()==NeedsRebuild after lag exceeds configured max-bytes budget. Retention contract: hold-back blocks reclaim for recoverable replicas, timeout and max-bytes budgets escalate to NeedsRebuild and release hold. Full rebuild lifecycle remains CP13-7 scope. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 23:10:06 -07:00
pingqiu	20a1a4995c	fix: CP13-5 doc — remove stale CatchingUp transition claim Replace "observable CatchingUp state transition" with the actual 3 signals the test asserts: seeded hasFlushedProgress, receivedLSN advance, non-zero replicaFlushedLSN. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 22:59:44 -07:00
pingqiu	4681df6b56	fix: CP13-5 — tighten reconnect proof with observable handshake evidence Findings fixed: 1. TestAdversarial_ReconnectUsesHandshakeNotBootstrap now has 3 observable proof points instead of just "SyncCache succeeded": - new shipper HasFlushedProgress=true (seeded from old group) - replica receivedLSN advances during SyncCache (catch-up delivered entries) - shipper replicaFlushedLSN > 0 after barrier (durable progress established) Bootstrap alone would not advance receivedLSN — it only sends the barrier. 2. TestBug2 stale comment removed: "must NOT call SetReplicaAddr" replaced with accurate CP13-5 explanation that SetReplicaAddrs now preserves hasFlushedProgress across shipper replacement. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 22:56:20 -07:00
pingqiu	80be2ec05a	feat: CP13-5 — reconnect handshake + WAL catch-up on SetReplicaAddrs Bug: SetReplicaAddrs created fresh shippers (hasFlushedProgress=false), so after disconnect, the new shipper used bootstrap instead of reconnect handshake. Bootstrap doesn't replay missed WAL entries — barrier hung. Fix: - blockvol.go: SetReplicaAddrs checks if old shipper group had durable progress (AnyHasFlushedProgress). If so, seeds new shippers with hasFlushedProgress=true → they use reconnect handshake + catch-up. - shipper_group.go: add AnyHasFlushedProgress() helper. 3 baseline FAILs now PASS: - ReconnectUsesHandshakeNotBootstrap: reconnect path used, not bootstrap - CatchupMultipleDisconnects: repeated disconnect/reconnect recovers - CatchupDoesNotOverwriteNewerData: catch-up completes, safety exercised 7 tests promoted to CP13-5 primary proof. TestAdversarial_NeedsRebuildBlocksAllPaths still FAIL (CP13-7 scope). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 22:39:08 -07:00
pingqiu	1c294af169	feat: CP13-4 — replica state machine / barrier eligibility contract + proof Contract review: 6-state set (Disconnected, Connecting, CatchingUp, InSync, Degraded, NeedsRebuild). Only InSync proceeds to barrier request path. All other states either fail immediately or attempt reconnect (must succeed before reaching barrier). New test: TestBarrier_NonEligibleStates_FailClosed — systematically verifies each non-eligible state (Connecting, CatchingUp, NeedsRebuild, Disconnected) is rejected by Barrier(), and InSync is the only state that enters the barrier request path. 5 baseline tests promoted to CP13-4 primary proof. No production code changed — contract review + new focused test only. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 22:01:05 -07:00
pingqiu	d4ff6b482b	fix: CP13-3 test — exercise real shipper.Barrier() against legacy server The previous test only checked wire decode + fresh shipper state, never calling shipper.Barrier() against a legacy response source. New test runs a fake TCP control server that responds with a 1-byte BarrierOK (no FlushedLSN). Shipper.Barrier() is called against it and must return an error containing "no FlushedLSN". Verifies the real rejection path at wal_shipper.go:229-231. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 21:47:58 -07:00

1 2 3 4 5 ...

13186 Commits