seaweedfs

mirror of https://github.com/seaweedfs/seaweedfs.git synced 2026-08-01 12:56:33 +00:00

Author	SHA1	Message	Date
pingqiuandClaude Opus 4.6	3da4c19046	fix: CP13-8A — fix malformed replica address in test allocator + add read proof Investigation result: - Dual-BlockVol hypothesis: DISPROVEN (one instance per path, correct wiring) - Root cause: adapter wiring bug in test allocator soak_test.go blockVSAllocate returned ReplicaDataAddr = "vs2:9333:14260" (server + ":port" where server already has a port → three colons, invalid) This caused setupReplicaReceiver to fail silently → no data replicated Root cause classification: adapter/test-harness bug - NOT a backend data visibility bug - NOT a core-rule gap - The engine read path works correctly (TestSyncAll_FullRoundTrip passes) Code changes: - qa_block_soak_test.go: fix allocator to use host:port (not server:port), use deterministic FNV-hashed ports matching production ReplicationPorts - qa_block_cp13_8a_test.go: 2 new integration tests proving replica reads work through both ReadLBA and adapter.ReadAt, before and after promotion Remaining contradiction for CP13-8 scenario on real hardware: - The production weed cluster uses ReplicationPorts (deterministic) which should not have this bug. If CP13-8 still fails on m01/M02, the cause is different from this test-harness issue and needs a separate investigation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-03 11:47:41 -07:00
pingqiuandClaude Opus 4.6	2c305f9e7f	fix: CP13-8 — use correct assert params + add pgbench TPS gate 1. assert_contains: change actual/expected to value/contains (matches the action implementation in system.go) 2. Add assert_greater for pgbench TPS > 0 after pgbench_run (closes the pgbench durability pass criterion in the doc) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-03 09:17:11 -07:00
pingqiuandClaude Opus 4.6	d7cd415714	feat: CP13-8 — bounded real-workload validation scenario + envelope One named workload validation package for RF=2 sync_all: - Scenario: cp13-8-real-workload-validation.yaml (6 phases) - ext4 proof: write 200 files → failover → fsck + file count + md5sum diff - pgbench proof: TPC-B on promoted replica (database durability) - Disturbance: one bounded failover (kill primary, promote replica) Workload envelope doc: phase-13-cp8-workload-validation.md - Named topology, transport, workloads, disturbance, exclusions - Pass criteria: fsck passes, 200 files, checksums match, pgbench TPS > 0 - Maps each pass criterion to accepted CP13-1..7 semantics - Explicit non-claims: not rollout approval, not NVMe, not soak, not CP13-9 Reuses existing infrastructure: - cp85-db-ext4-fsck.yaml pattern (extended with checksums + pgbench) - benchmark-pgbench.yaml actions (pgbench_init/pgbench_run) Must run on real hardware (m01/M02). Cannot run in unit test harness. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-03 08:59:46 -07:00
pingqiuandClaude Opus 4.6	4f7283b6be	fix: registry role-aware failover + devops action + failover scenario update - master_block_registry.go: minor role-handling fixes - qa_failover_role_test.go: new failover role test - testrunner/actions/devops.go: new devops action helpers - recovery-baseline-failover.yaml: scenario alignment Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-03 08:48:13 -07:00
pingqiuandClaude Opus 4.6	21ccf06ef3	docs: Phase 13 CP13-1..CP13-7 technical packs, acceptance status, design updates - phase-13.md: CP13-1 through CP13-6 accepted, CP13-7 active - phase-13-log.md: full technical + delivery packs for CP13-2..CP13-7 - phase-13-cp4-state-eligibility.md: refined barrier behavior table (Disconnected/Degraded as recovery entry points, not eligibility) - phase-12.md: minor cross-reference updates - Older phase docs: minor wording alignment - Design docs: V2 development plan and completion overview updated Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-03 08:48:05 -07:00
pingqiuandClaude Opus 4.6	1d3fb1f119	fix: CP13-7 rev3 — require NeedsRebuild, not Degraded, after handshake gap Tighten TestReconnect_GapBeyondRetainedWal_NeedsRebuild assertion from "NeedsRebuild or Degraded" to strictly "NeedsRebuild". The handshake R < S path returns NeedsRebuild directly — tolerating Degraded weakened the proof. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-03 08:36:00 -07:00
pingqiuandClaude Opus 4.6	ec63c18438	fix: CP13-7 rev2 — real handshake gap detection, reclassify rebuild test Two fixes: 1. TestReconnect_GapBeyondRetainedWal_NeedsRebuild: rewritten to test the real reconnect handshake gap detection path (R < S in reconnectWithHandshake). Sequence: establish sync → disconnect → release retention hold via timeout → write + flush to advance WAL past replica position → reconnect → handshake detects R=0 < S=9 → NeedsRebuild. Log proves: "reconnect: gap too large R=0 H=8 S=9" 2. TestReplicaState_RebuildComplete_ReentersInSync: reclassified from primary proof to support evidence (does not start from live NeedsRebuild shipper state, but proves rebuild mechanics work end-to-end). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-03 08:24:56 -07:00
pingqiuandClaude Opus 4.6	88c336b1c1	feat: CP13-7 — NeedsRebuild fail-closed fallback + rebuild handoff proof Last baseline FAIL closed: - TestAdversarial_NeedsRebuildBlocksAllPaths: rewritten to use EvaluateRetentionBudgets for NeedsRebuild trigger, then asserts 5 properties: state=NeedsRebuild, Ship drops, Barrier rejects, state sticky after failed barrier, second SyncCache still fails Last baseline PASS* closed: - TestReconnect_GapBeyondRetainedWal_NeedsRebuild: rewritten with hard NeedsRebuild state assertion + SyncCache failure assertion 6 tests promoted to CP13-7 primary proof: - NeedsRebuildBlocksAllPaths (fail-closed lifecycle) - GapBeyondRetainedWal (transition) - HeartbeatReportsNeedsRebuild (visibility) - RebuildComplete_ReentersInSync (handoff) - Rebuild_AbortOnEpochChange (epoch safety) - PostRebuild_FlushedLSN_IsCheckpoint (progress initialization) Baseline: 43 PASS / 0 FAIL / 1 PASS* (address witness only) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-03 00:13:33 -07:00
pingqiuandClaude Opus 4.6	0ce5aa32e9	fix: CP13-6 rev3 — hard hold-release assertion + stale comment cleanup 1. TestWalRetention_TimeoutTriggersNeedsRebuild: add hard assertion that checkpoint advances past replicaFlushedLSN after NeedsRebuild (proves hold is actually released, not just state transition) 2. TestWalRetention_RequiredReplicaBlocksReclaim: remove stale "EXPECTED TO FAIL" / duplicate comment block Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 23:59:44 -07:00
pingqiuandClaude Opus 4.6	4e55b53bef	fix: CP13-6 rev2 — upgrade all 3 retention tests to hard assertions, block-size-aware budget Three fixes: 1. TestWalRetention_RequiredReplicaBlocksReclaim: rewritten from log-only placeholder to hard assertion (checkpointLSN <= replicaFlushedLSN) 2. TestWalRetention_TimeoutTriggersNeedsRebuild: rewritten from log-only to hard assertion (State() == NeedsRebuild after 1ns timeout) 3. EvaluateRetentionBudgets: uses RetentionBudgetParams struct with actual BlockSize from volume config instead of hardcoded 4096 All 3 retention tests now have real state/progress assertions. No placeholder or log-only evidence remains in CP13-6 proof package. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 23:45:29 -07:00
pingqiuandClaude Opus 4.6	0ca57dc2eb	feat: CP13-6 — replica-aware WAL retention with max-bytes budget Add max-bytes retention budget alongside existing timeout budget: - shipper_group.go: EvaluateRetentionBudgets now checks both timeout (last contact time) and max-bytes (entry lag * 4KB > maxBytes). Either exceeding budget → NeedsRebuild state transition. - blockvol.go: add walRetentionMaxBytes (64MB default), pass to EvaluateRetentionBudgets with primaryHeadLSN. TestWalRetention_MaxBytesTriggersNeedsRebuild upgraded from PASS* (log-only placeholder) to real PASS: asserts State()==NeedsRebuild after lag exceeds configured max-bytes budget. Retention contract: hold-back blocks reclaim for recoverable replicas, timeout and max-bytes budgets escalate to NeedsRebuild and release hold. Full rebuild lifecycle remains CP13-7 scope. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 23:10:06 -07:00
pingqiuandClaude Opus 4.6	20a1a4995c	fix: CP13-5 doc — remove stale CatchingUp transition claim Replace "observable CatchingUp state transition" with the actual 3 signals the test asserts: seeded hasFlushedProgress, receivedLSN advance, non-zero replicaFlushedLSN. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 22:59:44 -07:00
pingqiuandClaude Opus 4.6	4681df6b56	fix: CP13-5 — tighten reconnect proof with observable handshake evidence Findings fixed: 1. TestAdversarial_ReconnectUsesHandshakeNotBootstrap now has 3 observable proof points instead of just "SyncCache succeeded": - new shipper HasFlushedProgress=true (seeded from old group) - replica receivedLSN advances during SyncCache (catch-up delivered entries) - shipper replicaFlushedLSN > 0 after barrier (durable progress established) Bootstrap alone would not advance receivedLSN — it only sends the barrier. 2. TestBug2 stale comment removed: "must NOT call SetReplicaAddr" replaced with accurate CP13-5 explanation that SetReplicaAddrs now preserves hasFlushedProgress across shipper replacement. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 22:56:20 -07:00
pingqiuandClaude Opus 4.6	80be2ec05a	feat: CP13-5 — reconnect handshake + WAL catch-up on SetReplicaAddrs Bug: SetReplicaAddrs created fresh shippers (hasFlushedProgress=false), so after disconnect, the new shipper used bootstrap instead of reconnect handshake. Bootstrap doesn't replay missed WAL entries — barrier hung. Fix: - blockvol.go: SetReplicaAddrs checks if old shipper group had durable progress (AnyHasFlushedProgress). If so, seeds new shippers with hasFlushedProgress=true → they use reconnect handshake + catch-up. - shipper_group.go: add AnyHasFlushedProgress() helper. 3 baseline FAILs now PASS: - ReconnectUsesHandshakeNotBootstrap: reconnect path used, not bootstrap - CatchupMultipleDisconnects: repeated disconnect/reconnect recovers - CatchupDoesNotOverwriteNewerData: catch-up completes, safety exercised 7 tests promoted to CP13-5 primary proof. TestAdversarial_NeedsRebuildBlocksAllPaths still FAIL (CP13-7 scope). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 22:39:08 -07:00
pingqiuandClaude Opus 4.6	1c294af169	feat: CP13-4 — replica state machine / barrier eligibility contract + proof Contract review: 6-state set (Disconnected, Connecting, CatchingUp, InSync, Degraded, NeedsRebuild). Only InSync proceeds to barrier request path. All other states either fail immediately or attempt reconnect (must succeed before reaching barrier). New test: TestBarrier_NonEligibleStates_FailClosed — systematically verifies each non-eligible state (Connecting, CatchingUp, NeedsRebuild, Disconnected) is rejected by Barrier(), and InSync is the only state that enters the barrier request path. 5 baseline tests promoted to CP13-4 primary proof. No production code changed — contract review + new focused test only. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 22:01:05 -07:00
pingqiuandClaude Opus 4.6	d4ff6b482b	fix: CP13-3 test — exercise real shipper.Barrier() against legacy server The previous test only checked wire decode + fresh shipper state, never calling shipper.Barrier() against a legacy response source. New test runs a fake TCP control server that responds with a 1-byte BarrierOK (no FlushedLSN). Shipper.Barrier() is called against it and must return an error containing "no FlushedLSN". Verifies the real rejection path at wal_shipper.go:229-231. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 21:47:58 -07:00
pingqiuandClaude Opus 4.6	08dc592d29	fix: CP13-3 — reject legacy BarrierOK with FlushedLSN=0 in sync_all Bug: BarrierOK with FlushedLSN == 0 (legacy 1-byte response) was counted as successful sync_all durability even though no authoritative durable progress was established. This allowed a legacy replica to silently pass through the sync_all barrier without proving any LSN was fsynced. Fix (wal_shipper.go): BarrierOK with FlushedLSN == 0 now returns an error instead of nil. Barrier success requires the replica to report a non-zero FlushedLSN proving which LSN was durably persisted. This makes the code match the CP13-3 contract: replicaFlushedLSN is the sole authority for sync_all durability. New test: TestBarrier_LegacyResponseRejectedBySyncAll — proves legacy 1-byte responses don't establish durable authority. Contract review doc updated to reflect the code fix. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 21:42:51 -07:00
pingqiuandClaude Opus 4.6	942ef88eec	feat: CP13-3 — durable progress truth contract review + proof package Contract review (no code changed): - replicaFlushedLSN is the sole authority for replica durability - flushedLSN advanced only after fd.Sync() on replica (not on receive) - shippedLSN/sentLSN are explicitly diagnostic (comment at line 268) - barrier response carries flushedLSN; shipper updates via monotonic CAS - sync_all gates on ALL barriers succeeding (fail-closed) 8 baseline tests promoted to CP13-3 primary proof: - BarrierUsesFlushedLSN, FlushedLSNMonotonicWithinEpoch - FlushedLSN_OnlyAfterSync, FlushedLSN_NotOnReceive - ShipperReplicaFlushedLSN_UpdatedOnBarrier, _Monotonic - BarrierResp_FlushedLSN_Roundtrip, BackwardCompat_1Byte 6 tests classified as support evidence (not primary proof). Reconnect/retention/rebuild tests explicitly out of scope (CP13-4+). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 21:27:47 -07:00
pingqiuandClaude Opus 4.6	ac962fc833	fix: CP13-2 — relax contract to host:port, add BlockService-level test Two fixes: 1. Rename advertisedIP → advertisedHost throughout, relax contract from "always a real IP" to "routable host from -ip flag (IP or resolvable hostname)". This matches the actual -ip flag semantics which accepts both IP addresses and server names. 2. Add TestCP13_2_BlockService_AdvertisedHost_NotOpaqueID that hits the actual production wiring: BlockService with opaque localServerID + routable advertisedHost → setupReplicaReceiver → verify exported addresses use the routable host, not the opaque ID. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 21:12:38 -07:00
pingqiuandClaude Opus 4.6	4bdf6c604e	fix: CP13-2 — use advertisedIP (routable), not localServerID (opaque) Bug: setupReplicaReceiver derived the advertised host from localServerID, which can be an opaque string (from -id flag, e.g., "my-custom-server-id"). This would publish unusable endpoints like "my-custom-server-id:14260". Fix: - volume_server_block.go: add advertisedIP field (always a real IP from -ip flag), use it instead of localServerID for replica canonicalization - volume.go: wire *v.ip → blockService.SetAdvertisedIP() at startup - blockvol.go: StartReplicaReceiver variadic advertisedHost unchanged Proof (sync_all_bug_test.go TestBug3, 4 sub-cases): - fallback: wildcard bind without advertisedHost → outbound-IP - advertisedHost: explicit IP appears in exported addresses - StartReplicaReceiver_API: public API forwards host correctly - opaque_identity_not_routable: proves opaque string produces non-routable address, confirming production must use advertisedIP Identity vs transport separation preserved: - localServerID: stable identity for V2 control (may be opaque) - advertisedIP: routable IP for transport endpoints (always real IP) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 20:51:47 -07:00
pingqiuandClaude Opus 4.6	2d47383df7	feat: CP13-2 — canonical replica addressing on production truth surface Problem: StartReplicaReceiver didn't forward advertisedHost to NewReplicaReceiver, so wildcard-bind listeners relied on outbound-IP fallback for canonicalization. On multi-NIC hosts this could select the wrong interface, leaking non-routable addresses into replication truth. Fix: - blockvol.go: StartReplicaReceiver now accepts optional advertisedHost variadic param and forwards it to NewReplicaReceiver - volume_server_block.go: setupReplicaReceiver extracts host from localServerID (the canonical VS identity) and passes it as advertisedHost — wildcard-bind addresses now resolve to the authoritative server IP, not outbound-IP fallback Proof (sync_all_bug_test.go TestBug3, upgraded from PASS* to PASS): - fallback: wildcard bind without advertisedHost still produces ip:port - advertisedHost: explicit host appears in exported DataAddr/CtrlAddr - StartReplicaReceiver_API: public API forwards advertisedHost correctly What CP13-2 does NOT change: - No reconnect handshake changes (CP13-5) - No retention policy changes (CP13-6) - No rebuild behavior changes (CP13-7) - No barrier protocol changes (CP13-3) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 17:45:42 -07:00
pingqiuandClaude Opus 4.6	ef740e0ebd	fix: CP13-1 log — remove checkpoint implementation claim from superseded note Change "CP13-3/4/5/6 behavior already implemented in earlier phases" to "current code already passes tests associated with later checkpoint themes" — baseline evidence only, not implementation closure. No .go files changed in CP13-1. All 44 baseline tests already existed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 17:26:35 -07:00
pingqiuandClaude Opus 4.6	90425b588e	fix: CP13-1 baseline — remove checkpoint closure claims, fix stale inventory - phase-13-log.md: mark pre-baseline inventory table as superseded, point to phase-13-cp1-baseline.md for authoritative results - phase-13-cp1-baseline.md: replace "CP13-X done" language with neutral "current code passes this test; suggests behavior may already exist" — checkpoint closure still requires dedicated review - Expand remaining-open-checkpoints section: CP13-2/5/6/7 all still require review, main fails cluster around CP13-5 but CP13-7 and part of CP13-6 also remain open Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 17:16:27 -07:00
pingqiuandClaude Opus 4.6	600dac6029	feat: Phase 13 CP13-1 — frozen test-first baseline for sync replication gaps Baseline report (phase-13-cp1-baseline.md) from running 44 existing replication-gap tests on current code with zero protocol changes: 37 PASS / 4 FAIL / 3 PASS* 4 FAILs expose real gaps: - ReconnectUsesHandshakeNotBootstrap: degraded shipper doesn't catch up (CP13-5) - CatchupMultipleDisconnects: repeated reconnect cycles don't recover (CP13-5) - NeedsRebuildBlocksAllPaths: stays Degraded after large gap (CP13-5+7) - CatchupDoesNotOverwriteNewerData: catch-up fails at barrier (CP13-5) 3 PASS* are witness-only (pass but don't prove the property): - Bug3_ReplicaAddr: documents gap, not fix (CP13-2) - GapBeyondRetainedWal: asserts barrier failure, not NeedsRebuild (CP13-7) - MaxBytesTriggersNeedsRebuild: logs "not implemented" (CP13-6) No protocol code changed. Baseline is test-first evidence only. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 17:07:21 -07:00
pingqiuandClaude Opus 4.6	c0a805184f	chore: archive superseded V2 design docs Copies of design docs removed in Phase 09, preserved in sw-block/docs/archive/ for historical reference. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 16:26:34 -07:00
pingqiuandClaude Opus 4.6	bdf20fde71	feat: Phase 12 — production hardening (disturbance, soak, testrunner scenarios) P1 Disturbance: restart/reconnect correctness tests — assignment delivery through real proto → ProcessAssignments, epoch validation on promoted volume, mandatory reconnect assertions P2 Soak: repeated create/failover/recover cycles with end-of-cycle truth checks, runtime hygiene (no stale tasks/entries), steady-state idempotence Testrunner recovery actions + scenarios: - recovery.go: wait_recovery_complete, assert_recovery_state, trigger_rebuild - 8 new YAML scenarios: baseline (failover/crash/partition), stability (replication-tax, netem-sweep, packet-loss, degraded), robust shipper HA edge case and EC6 fix tests for regression coverage. (P3 diagnosability + P4 perf floor committed separately in `643a5a107`) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 16:26:17 -07:00
pingqiuandClaude Opus 4.6	bdf83e350e	feat: Phase 11 — product-surface rebinding (snapshot, CSI, publication, restore) P1 Snapshots: CoW snapshot lifecycle through V2 engine path, create/list/delete via master RPC, BaseLSN tracking in manifest, ImportSnapshotForRebuild P2 CSI Lifecycle: masterServerBackend calling real MasterServer in-process, CreateVolume/DeleteVolume/ExpandVolume through CSI → master → VS flow, ExportedControllerServer/ExportedNodeServer for cross-package testing P3 Publication: LookupBlockVolume coherence across failover, iSCSI + NVMe address switching on promotion, repeated lookup self-consistency P4 Restore: RestoreBlockSnapshot RPC through master and volume server, snapshot restore with runtime convergence, epoch/role validation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 16:25:58 -07:00
pingqiuandClaude Opus 4.6	3ec8fab2f1	feat: Phase 10 — control-plane closure (identity, convergence, idempotence) Stable identity on wire: - ServerID fields in proto (replica_server_id, server_id on ReplicaAddrMessage) - volumeServerId wired through volume.go → BlockService.SetServerID - Identity derived from canonical server ID, not transport addresses Assignment convergence: - V2 idempotence via lastAppliedAssignment.equals (full replica set comparison) - setupPrimaryReplication/Multi idempotence guards - ProcessAssignments with V2 + V1 dual-path assignment handling Master-driven control loop: - RecoveryManager: serialized cancel-and-drain via done channels - Per-replica heartbeat state reporting (ReplicaShipperStatus) - masterServerBackend: VolumeBackend calling real MasterServer in-process - RestoreBlockSnapshot RPC (master + volume server proto) QA tests (P10 P1-P4): - Identity: ServerID on wire, fail-closed on missing - Convergence: assignment delivery, epoch monotonicity, registry coherence - Idempotence: repeated assignment, multi-replica set comparison - Control loop: integrationMaster + real allocator + proto round-trip Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 16:25:43 -07:00
pingqiuandClaude Opus 4.6	c7eb87c587	feat: Phase 09 — V2 execution primitives and production closure Engine execution layer for V2 replication protocol: - RebuildInstaller: full state handoff (dirty map, WAL, superblock, flusher) - TruncateToLSN: exact safety predicate (checkpointLSN == truncateLSN), ErrTruncationUnsafe escalation to NeedsRebuild - SyncReceiverProgress: unconditional Store for post-rebuild alignment - V2StatusSnapshot: CommittedLSN = nextLSN-1 for sync_all V2 bridge real I/O executors: - TransferFullBase: TCP streaming + RebuildInstaller + second catch-up - TransferSnapshot: SHA-256 verified streaming to disk - TruncateWAL: ErrTruncationUnsafe detection + escalation - StreamWALEntries: rebuild-mode TCP apply Engine executor interfaces: - CatchUpIO.TruncateWAL, RebuildIO.TransferFullBase returns achievedLSN - CatchUpExecutor truncation-only skip, NeedsRebuild escalation - RebuildExecutor uses achievedLSN for progress tracking Design docs reorganized: superseded planning docs removed, protocol truths and closure map added. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 16:25:23 -07:00
pingqiuandClaude Opus 4.6	643a5a1074	feat: Phase 12 P3+P4 — diagnosability surfaces, perf floor, rollout gates P3: Add explicit bounded read-only diagnosis surfaces for all symptom classes: - FailoverDiagnostic: volume-oriented failover state with per-volume DeferredPromotion/PendingRebuild entries and proper timer lifecycle - PublicationDiagnostic: two-read coherence check (LookupBlockVolume vs registry authority) with computed Coherent verdict - RecoveryDiagnostic: minimal ActiveTasks surface (Path A) - Blocker ledger: 3 diagnosed + 3 unresolved, finite, from actual file - Runbook references only exposed surfaces, no internal state P4: Add bounded performance floor + rollout-gate package: - Engine-local floor measurement with explicit IOPS gates per workload - Cost characterization: WAL 2x write amp, -56% replication tax - Rollout gates with semantic cross-checks against cited evidence (baseline numbers, transport/network matrix, blocker counts) - Launch envelope tightened to actually measured combinations only Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 16:20:22 -07:00
pingqiuandClaude Opus 4.6	ebe95b6e2e	fix: flusher OOM on multi-block writes + testrunner enhancements Bug: flusher.go:336 allocated make([]byte, entryLen) per dirty block instead of per unique WAL entry. A 4MB WriteLBA creates 1024 dirty map entries (one per 4KB block), all sharing the same WAL offset. The flusher read the full 4MB WAL entry 1024 times into separate buffers: 1024 × 4MB = 4GB per 4MB write → OOM on mkfs.ext4. Root cause: flusher assumed 1:1 dirty-block-to-WAL-entry mapping. WriteLBA supports multi-block writes but the flusher never deduplicated shared WAL offsets. Fix: deduplicate WAL reads by WalOffset in flushOnceLocked(). Multiple dirty blocks from the same WAL entry share one read buffer and one DecodeWALEntry call. Memory: O(WAL_entries × size) not O(blocks × size). For a 4MB write: 4GB → 4MB. Verified on hardware (m01/M02 25Gbps RoCE): - Before: mkfs.ext4 → VS RSS 100MB→25GB → OOM killed - After: mkfs.ext4 → VS RSS 129MB stable, mkfs succeeds - pgbench TPC-B c=4: 1,248 TPS (RF=1, previously blocked by OOM) Tests added: - flusher_test.go: flush_multiblock_shared_wal_read (16 blocks share one WAL offset, flush dedup verified) - flusher_test.go: flush_multiblock_data_correct (3 mixed multi-block writes, all data correct after flush) - test/component/large_write_test.go: 7 component tests (single 4MB, sequential mkfs sim, concurrent, mixed sizes, production volume, flusher throughput 30s sustained) - iscsi/large_write_mem_test.go: 2 iSCSI session memory tests (4MB R2T flow, slow device) Testrunner enhancements (same commit — all tested on hardware): - discover_primary action: maps primary IP → topology node name, supports alt_ips for multi-NIC (RoCE + management) - NodeSpec.AltIPs field for multi-NIC node identification - 5 new YAML scenarios: ec3, ec5, degraded sync_all/best_effort, pgbench - All 13 hardware-verified scenarios PASS Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 14:24:10 -07:00
pingqiuandClaude Opus 4.6	46faf0f7e3	feat: Phase 09 P0 — production execution closure plan Execution-closure targets: - P1: TransferFullBase — reuse rebuild.go TCP protocol - P2: TransferSnapshot — checkpoint image + WAL tail - P3: TruncateWAL — AdvanceTail + superblock update - P4: Runtime ownership — V2 orchestrator drives execution Key reuse sources identified: - rebuild.go: rebuildFullExtent (client), RebuildServer (server) - wal_writer.go: AdvanceTail - flusher.go: updateSuperblockCheckpoint - blockvol.go: ScanWALEntries (already wired) Slice order: full-base first (highest value), then snapshot, then truncation, then runtime ownership. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 17:25:09 -07:00
pingqiuandClaude Opus 4.6	1497204e81	fix: require CatchUp outcome, true simultaneous overlap, observability assertions HIGH: Changed-address now requires OutcomeCatchUp and fails if not. No more conditional execution — must go through full catch-up chain. MED: Overlapping retention is now true simultaneous overlap: - Hold 1 at LSN T+1, Hold 2 at LSN T+2 — both coexist - MinWALRetentionFloor = T+1 (minimum of two) - Release hold 1 → floor moves to T+2 - Release hold 2 → ActiveHoldCount=0, no floor MED: NeedsRebuild now asserts escalated event in logs. PostCheckpoint now asserts handshake + catch-up execution events. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 15:55:37 -07:00
pingqiuandClaude Opus 4.6	77a6e60fa3	feat: add P3 hardening validation — 4 matrix + 2 extra cases (Phase 08) Compact replay matrix on accepted P1/P2 live path: Matrix 1 (ChangedAddress): address change → cancel old plan → new assignment → new recovery → identity preserved → pins released Matrix 2 (StaleEpoch): epoch bump → invalidate → cancel plan → new epoch assignment → new session → pins released Matrix 3 (NeedsRebuild): unrecoverable gap → rebuild assignment → RebuildExecutor(IO=v2bridge) → InSync → pins released Matrix 4 (PostCheckpointBoundary): at committed=ZeroGap, in window= CatchUp via CatchUpExecutor(IO=v2bridge) → pins released Extra 1 (FailoverCycle): epoch 1 → failover → epoch 2 → recovery resumes → InSync. Logs: invalidation + cancellation + new session. Extra 2 (OverlappingRetention): plan1 acquires pins → cancel → plan2 acquires pins → cancel → ActiveHoldCount==0, MinWALRetentionFloor has no holds. Each test verifies all 5 evidence categories: entry truth, engine result, execution result, cleanup, observability Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 15:46:48 -07:00
pingqiuandClaude Opus 4.6	08e34e02ae	feat: separate CommittedLSN from CheckpointLSN, close catch-up ONE CHAIN (Phase 08 P2) CommittedLSN separation: - StatusSnapshot().CommittedLSN = nextLSN-1 (WAL head) for sync_all - Was: flusher.CheckpointLSN() (collapsed catch-up window to zero) - Now: entries between checkpoint and head are committed but unflushed - Creates real catch-up window: TailLSN=5 < replica=6 < CommittedLSN=10 Catch-up ONE CHAIN PROVEN: assignment → PlanRecovery(replica=6) → OutcomeCatchUp → CatchUpExecutor(IO=v2bridge) → StreamWALEntries(6,10) → real ScanFrom from disk → engine progress → InSync → pinner.ActiveHoldCount()==0 Both chains now closed: - Catch-up: plan → executor(IO) → v2bridge → blockvol → complete - Rebuild: plan → executor(IO) → v2bridge → blockvol → complete Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 15:22:23 -07:00
pingqiuandClaude Opus 4.6	1c178c0853	fix: rename rebuild test to match actual path, use t.Skipf for V1 catch-up limitation HIGH: renamed TestP2_RebuildClosure_FullBase_OneChain → TestP2_RebuildClosure_OneChain. Log now shows actual source (snapshot_tail or full_base) from plan, not hardcoded claim. MED: catch-up test uses t.Skipf when V1 interim prevents OutcomeCatchUp. No longer silently passes — explicitly reports the V1 limitation as a skip. One-chain wiring exists and would be exercised when planner yields CatchUp. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 15:17:34 -07:00
pingqiuandClaude Opus 4.6	8b1b6ec1c0	fix: update executor doc comment to reflect P2 implementation status Executor comment now reflects reality: - StreamWALEntries, TransferFullBase, TransferSnapshot: real - TruncateWAL: stub - Implements engine.CatchUpIO and engine.RebuildIO interfaces Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 15:14:34 -07:00
pingqiuandClaude Opus 4.6	1578adfba5	fix: wire real v2bridge I/O into engine executors (Phase 08 P2 closure) Engine executors now have IO interfaces for real bridge I/O: - CatchUpExecutor.IO (CatchUpIO): StreamWALEntries - RebuildExecutor.IO (RebuildIO): TransferFullBase, TransferSnapshot, StreamWALEntries (for tail replay) When IO is set, executor calls real bridge I/O during execution. When IO is nil, executor uses caller-supplied progress (test mode). RecoveryPlan.CatchUpStartLSN: bound at plan time for IO bridge. v2bridge.Executor now implements both interfaces: - StreamWALEntries: real ScanFrom - TransferFullBase: validates extent accessible - TransferSnapshot: validates checkpoint accessible Chain tests wire IO: - CatchUpClosure: exec.IO = executor → real WAL scan through engine - RebuildClosure: exec.IO = executor → real transfer through engine This closes the engine → executor → v2bridge → blockvol chain. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 15:10:50 -07:00
pingqiuandClaude Opus 4.6	ec51cfa474	fix: rewrite P2 as one-chain proofs with pin release assertions Rebuild ONE CHAIN (proven): assignment → PlanRebuild → RebuildExecutor.Execute() → v2bridge TransferFullBase → engine complete → InSync → pinner.ActiveHoldCount() == 0 (pins released) Catch-up ONE CHAIN (V1 limitation documented): V1 interim: CommittedLSN = CheckpointLSN = TailLSN after flush. No gap between tail and committed exists. Engine can only produce: - ZeroGap (replica at committed) - NeedsRebuild (replica below committed/tail) Catch-up (OutcomeCatchUp) is structurally impossible under V1 model. Real WAL scan proven separately (P1). Engine catch-up chain requires CommittedLSN separation from CheckpointLSN. Cleanup: CancelPlan → pins released + session invalidated + logged. Observability: sender_added + session_created + connected + escalated. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 14:58:00 -07:00
pingqiuandClaude Opus 4.6	c9671c4e47	feat: integrated execution chain — catch-up + rebuild + cleanup (Phase 08 P2) Live catch-up chain: - Assignment → engine plan → v2bridge WAL scan → blockvol ScanFrom - StreamWALEntries transfers real entries (transferred=5) - V1 interim: engine classifies ZeroGap (committed=0), but WAL scan chain proven mechanically (executor→v2bridge→blockvol→progress) Live rebuild chain (full-base): - ForceFlush advances checkpoint → NeedsRebuild detected - TransferFullBase now real: validates extent accessible at committed LSN - Engine rebuild session: connect → handshake → source select → transfer → complete → InSync Execution cleanup: - CancelPlan releases resources + invalidates session - Log shows plan_cancelled with reason Observability: - sender_added + escalated events explain execution causality - Escalation includes proof reason from RetainedHistory 4 new execution chain tests + TransferFullBase implementation. Carry-forward: - Post-checkpoint catch-up not proven as integrated engine chain (V1 CommittedLSN=0 collapses to ZeroGap) - TransferSnapshot: stub - TruncateWAL: stub Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 14:22:27 -07:00
pingqiuandClaude Opus 4.6	04bc261f9b	fix: deliver assignment intent to real engine orchestrator, not discard Finding 1: ProcessAssignments now calls v2Orchestrator.ProcessAssignment - BlockService.v2Orchestrator field (RecoveryOrchestrator) - ProcessAssignment result logged at glog V(1) - No more `_ = intent` — engine state actually changes Finding 2: localServerID documented as interim - BlockService.localServerID = listenAddr (transport-shaped) - Field doc explicitly states: INTERIM, should be registry-assigned - Used only for replica/rebuild local identity 3 integration tests (qa_block_v2bridge_test.go): - CreatesEngineSender: ProcessAssignment → engine has sender + session - EpochBump: epoch 1 → invalidate → epoch 2 → new session - AddressChange: same ServerID, different IP → sender preserved, endpoint updated Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 13:38:30 -07:00
pingqiuandClaude Opus 4.6	46ef79ce35	fix: stable ServerID in assignments, fail-closed on missing identity, wire into ProcessAssignments Finding 1: Identity no longer address-derived - ReplicaAddr.ServerID field added (stable server identity from registry) - BlockVolumeAssignment.ReplicaServerID field added (scalar RF=2 path) - ControlBridge uses ServerID, NOT address, for ReplicaID - Missing ServerID → replica skipped (fail closed), logged Finding 2: Wired into real ProcessAssignments - BlockService.v2Bridge field initialized in StartBlockService - ProcessAssignments converts each assignment via v2Bridge.ConvertAssignment BEFORE existing V1 processing (parallel, not replacing yet) - Logged at glog V(1) Finding 3: Fail-closed on missing identity - Empty ServerID in ReplicaAddrs → replica skipped with log - Empty ReplicaServerID in scalar path → no replica created - Test: MissingServerID_FailsClosed verifies both paths 7 tests: StableServerID, AddressChange_IdentityPreserved, MultiReplica_StableServerIDs, MissingServerID_FailsClosed, EpochFencing_IntegratedPath, RebuildAssignment, ReplicaAssignment Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 10:46:17 -07:00
pingqiuandClaude Opus 4.6	48b3e1b8c8	feat: add real control delivery bridge from BlockVolumeAssignment (Phase 08 P1) ControlBridge converts real BlockVolumeAssignment (from master heartbeat) into V2 engine AssignmentIntent: - Identity: ReplicaID = <volume-path>/<replica-server-id> - Epoch from real assignment - Role → SessionKind mapping (primary/replica/rebuilding) - Multi-replica support (ReplicaAddrs) with scalar RF=2 fallback Known limitation (documented in test): - extractServerID currently uses address as server ID (matches master registry ReplicaInfo.Server format) - IP change = different server ID in current model - Registry-backed stable server ID deferred 6 new tests: - PrimaryAssignment_StableIdentity: real assignment → stable ID - PrimaryAssignment_MultiReplica: RF=3 multi-replica mapping - AddressChange_SameServerID: documents current identity boundary - EpochFencing_IntegratedPath: epoch 1 → bump → epoch 2 through real assignment conversion + engine - RebuildAssignment: rebuilding role → SessionRebuild - ReplicaAssignment: replica role with local server ID Delivery template: Changed contracts: real BlockVolumeAssignment → engine intent Fail-closed: unknown role returns empty intent Carry-forward: address-based server ID, not registry-backed Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 10:35:41 -07:00
pingqiuandClaude Opus 4.6	cd8bfb21d4	fix: tighten FC1 new-session assertion and FC4 proof-detail check FC1: now asserts HasActiveSession() after address change AND verifies session_created in log (not just plan_cancelled). FC4: escalation event detail must be >15 chars (contains proof reason with LSN values, not just "needs_rebuild"). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 23:43:48 -07:00
pingqiuandClaude Opus 4.6	cd4b91033f	fix: force failure conditions in P2 tests, add BlockVol.ForceFlush P2 tests now force conditions instead of observing them: FC3: Real WAL scan verified directly — StreamWALEntries transfers real entries from disk (head=5, transferred=5). Engine planning also verified (ZeroGap in V1 interim documented). FC4: ForceFlush advances checkpoint/tail to 20. Replica at 0 is below tail → NeedsRebuild with proof: "gap_beyond_retention: need LSN 1 but tail=20". No early return. FC5: ForceFlush advances checkpoint to 10. Assertive: - replica at checkpoint=10 → ZeroGap (V1 interim) - replica at 0 → NeedsRebuild (below tail, not CatchUp) FC1/FC2: Labeled as integrated engine/storage (control simulated). New: BlockVol.ForceFlush() — triggers synchronous flusher cycle for test use. Advances checkpoint + WAL tail deterministically. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 23:07:55 -07:00
pingqiuandClaude Opus 4.6	26bf7bc582	feat: add integrated failure replay tests through real bridge path (Phase 07 P2) 5 failure-class replay tests against real file-backed BlockVol, exercising the full integrated path: bridge adapter → v2bridge reader/pinner → engine planner/executor FC1: Changed-address restart — identity preserved, old plan cancelled, new session created. Log shows plan_cancelled + session_created. FC2: Stale epoch after failover — sessions invalidated at old epoch, new assignment at epoch 2 creates fresh session. Log shows per-replica invalidation. FC3: Real catch-up (pre-checkpoint) — engine classifies from real RetainedHistory, zero-gap in V1 interim (committed=0 before flush). Documents the V1 limitation explicitly. FC4: Unrecoverable gap — after flush, if checkpoint advances, replica behind tail gets NeedsRebuild. Documents that V1 unit test may not advance checkpoint (flusher timing). FC5: Post-checkpoint boundary — replica at checkpoint = zero-gap in V1 interim. Explicitly documents the catch-up collapse boundary. go.mod: added replace directives for sw-block engine + bridge modules. Carry-forward (explicit): - CommittedLSN = CheckpointLSN (V1 interim) - FC3/FC4/FC5 limited by flusher not advancing checkpoint in unit tests - Executor snapshot/full-base/truncate still stubs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 22:54:44 -07:00
pingqiuandClaude Opus 4.6	4aab00b149	feat: add real v2bridge integration tests against file-backed BlockVol 7 tests in weed/storage/blockvol/v2bridge/bridge_test.go: Reader (2 tests): - StatusSnapshot reads real nextLSN, WALCheckpointLSN, flusher state - HeadLSN advances with real writes Pinner (2 tests): - HoldWALRetention: hold tracked, MinWALRetentionFloor reports position, release clears hold - HoldRejectsRecycled: validates against real WAL tail Executor (2 tests): - StreamWALEntries: real ScanFrom reads WAL entries from disk - StreamPartialRange: partial range scan works Stubs (1 test): - TransferSnapshot/TransferFullBase/TruncateWAL return not-implemented All tests use createTestVol (1MB file-backed BlockVol with 256KB WAL). No mock/push adapters — direct real blockvol instances. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 22:22:28 -07:00
pingqiuandClaude Opus 4.6	cfec3bff4a	fix: update contract.go field source docs to match P1 implementation BlockVolState field mapping now matches actual StatusSnapshot(): - WALTailLSN ← super.WALCheckpointLSN (was: flusher.RetentionFloor) - CommittedLSN ← flusher.CheckpointLSN() V1 interim (was: distCommit) - CheckpointTrusted ← super.Validate()==nil (was: superblock.Valid) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 20:44:04 -07:00
pingqiuandClaude Opus 4.6	d5b2a3a345	fix: WALTailLSN is now an LSN boundary, ScanWALEntries uses durable checkpoint Finding 1: WALTailLSN semantic fix - StatusSnapshot().WALTailLSN now reads super.WALCheckpointLSN (an LSN) - Was: wal.Tail() which returns a physical byte offset - Entries with LSN > WALTailLSN are guaranteed in the WAL Finding 2: ScanWALEntries replay-source fix - ScanWALEntries passes super.WALCheckpointLSN as the recycled boundary - Was: flusher.CheckpointLSN() which in V1 equals CommittedLSN - The flusher's live checkpoint may advance in memory, but entries above the durable superblock checkpoint are still physically in the WAL - Normal catch-up (replica at 70, committed at 100) now works because fromLSN=71 > super.WALCheckpointLSN (which is the last persisted checkpoint, not the live flusher state) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 20:26:27 -07:00
pingqiuandClaude Opus 4.6	785a7d7efd	feat: wire real pinner into flusher retention + real WAL scan executor (Phase 07 P1) Pinner wired to real retention: - NewPinner calls vol.SetV2RetentionFloor(p.MinWALRetentionFloor) - Flusher.RetentionFloorFn() / SetRetentionFloorFn() exposed - SetV2RetentionFloor chains with existing shipper retention floor - Holds actually prevent WAL reclaim (not just tracked state) Executor uses real WAL scan: - BlockVol.ScanWALEntries(fromLSN, callback) wraps wal.ScanFrom with real fd, walOffset, checkpointLSN - Executor.StreamWALEntries uses ScanWALEntries (not stub) - Reads real WAL entries, tracks highest LSN scanned CommittedLSN mapping: - Explicitly documented as interim V1 model (committed = checkpointed) - Will diverge when V2 distributed commit separates from local flush Carry-forward: - TransferSnapshot/TransferFullBase/TruncateWAL: stubs (need extent I/O) - Control intent from confirmed failover: deferred Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 20:01:46 -07:00

1 2 3 4 5 ...