seaweedfs

mirror of https://github.com/seaweedfs/seaweedfs.git synced 2026-05-30 13:36:23 +00:00

Author	SHA1	Message	Date
pingqiu	bdf20fde71	feat: Phase 12 — production hardening (disturbance, soak, testrunner scenarios) P1 Disturbance: restart/reconnect correctness tests — assignment delivery through real proto → ProcessAssignments, epoch validation on promoted volume, mandatory reconnect assertions P2 Soak: repeated create/failover/recover cycles with end-of-cycle truth checks, runtime hygiene (no stale tasks/entries), steady-state idempotence Testrunner recovery actions + scenarios: - recovery.go: wait_recovery_complete, assert_recovery_state, trigger_rebuild - 8 new YAML scenarios: baseline (failover/crash/partition), stability (replication-tax, netem-sweep, packet-loss, degraded), robust shipper HA edge case and EC6 fix tests for regression coverage. (P3 diagnosability + P4 perf floor committed separately in `643a5a107`) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 16:26:17 -07:00
pingqiu	bdf83e350e	feat: Phase 11 — product-surface rebinding (snapshot, CSI, publication, restore) P1 Snapshots: CoW snapshot lifecycle through V2 engine path, create/list/delete via master RPC, BaseLSN tracking in manifest, ImportSnapshotForRebuild P2 CSI Lifecycle: masterServerBackend calling real MasterServer in-process, CreateVolume/DeleteVolume/ExpandVolume through CSI → master → VS flow, ExportedControllerServer/ExportedNodeServer for cross-package testing P3 Publication: LookupBlockVolume coherence across failover, iSCSI + NVMe address switching on promotion, repeated lookup self-consistency P4 Restore: RestoreBlockSnapshot RPC through master and volume server, snapshot restore with runtime convergence, epoch/role validation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 16:25:58 -07:00
pingqiu	3ec8fab2f1	feat: Phase 10 — control-plane closure (identity, convergence, idempotence) Stable identity on wire: - ServerID fields in proto (replica_server_id, server_id on ReplicaAddrMessage) - volumeServerId wired through volume.go → BlockService.SetServerID - Identity derived from canonical server ID, not transport addresses Assignment convergence: - V2 idempotence via lastAppliedAssignment.equals (full replica set comparison) - setupPrimaryReplication/Multi idempotence guards - ProcessAssignments with V2 + V1 dual-path assignment handling Master-driven control loop: - RecoveryManager: serialized cancel-and-drain via done channels - Per-replica heartbeat state reporting (ReplicaShipperStatus) - masterServerBackend: VolumeBackend calling real MasterServer in-process - RestoreBlockSnapshot RPC (master + volume server proto) QA tests (P10 P1-P4): - Identity: ServerID on wire, fail-closed on missing - Convergence: assignment delivery, epoch monotonicity, registry coherence - Idempotence: repeated assignment, multi-replica set comparison - Control loop: integrationMaster + real allocator + proto round-trip Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 16:25:43 -07:00
pingqiu	c7eb87c587	feat: Phase 09 — V2 execution primitives and production closure Engine execution layer for V2 replication protocol: - RebuildInstaller: full state handoff (dirty map, WAL, superblock, flusher) - TruncateToLSN: exact safety predicate (checkpointLSN == truncateLSN), ErrTruncationUnsafe escalation to NeedsRebuild - SyncReceiverProgress: unconditional Store for post-rebuild alignment - V2StatusSnapshot: CommittedLSN = nextLSN-1 for sync_all V2 bridge real I/O executors: - TransferFullBase: TCP streaming + RebuildInstaller + second catch-up - TransferSnapshot: SHA-256 verified streaming to disk - TruncateWAL: ErrTruncationUnsafe detection + escalation - StreamWALEntries: rebuild-mode TCP apply Engine executor interfaces: - CatchUpIO.TruncateWAL, RebuildIO.TransferFullBase returns achievedLSN - CatchUpExecutor truncation-only skip, NeedsRebuild escalation - RebuildExecutor uses achievedLSN for progress tracking Design docs reorganized: superseded planning docs removed, protocol truths and closure map added. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 16:25:23 -07:00
pingqiu	643a5a1074	feat: Phase 12 P3+P4 — diagnosability surfaces, perf floor, rollout gates P3: Add explicit bounded read-only diagnosis surfaces for all symptom classes: - FailoverDiagnostic: volume-oriented failover state with per-volume DeferredPromotion/PendingRebuild entries and proper timer lifecycle - PublicationDiagnostic: two-read coherence check (LookupBlockVolume vs registry authority) with computed Coherent verdict - RecoveryDiagnostic: minimal ActiveTasks surface (Path A) - Blocker ledger: 3 diagnosed + 3 unresolved, finite, from actual file - Runbook references only exposed surfaces, no internal state P4: Add bounded performance floor + rollout-gate package: - Engine-local floor measurement with explicit IOPS gates per workload - Cost characterization: WAL 2x write amp, -56% replication tax - Rollout gates with semantic cross-checks against cited evidence (baseline numbers, transport/network matrix, blocker counts) - Launch envelope tightened to actually measured combinations only Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 16:20:22 -07:00
pingqiu	ebe95b6e2e	fix: flusher OOM on multi-block writes + testrunner enhancements Bug: flusher.go:336 allocated make([]byte, entryLen) per dirty block instead of per unique WAL entry. A 4MB WriteLBA creates 1024 dirty map entries (one per 4KB block), all sharing the same WAL offset. The flusher read the full 4MB WAL entry 1024 times into separate buffers: 1024 × 4MB = 4GB per 4MB write → OOM on mkfs.ext4. Root cause: flusher assumed 1:1 dirty-block-to-WAL-entry mapping. WriteLBA supports multi-block writes but the flusher never deduplicated shared WAL offsets. Fix: deduplicate WAL reads by WalOffset in flushOnceLocked(). Multiple dirty blocks from the same WAL entry share one read buffer and one DecodeWALEntry call. Memory: O(WAL_entries × size) not O(blocks × size). For a 4MB write: 4GB → 4MB. Verified on hardware (m01/M02 25Gbps RoCE): - Before: mkfs.ext4 → VS RSS 100MB→25GB → OOM killed - After: mkfs.ext4 → VS RSS 129MB stable, mkfs succeeds - pgbench TPC-B c=4: 1,248 TPS (RF=1, previously blocked by OOM) Tests added: - flusher_test.go: flush_multiblock_shared_wal_read (16 blocks share one WAL offset, flush dedup verified) - flusher_test.go: flush_multiblock_data_correct (3 mixed multi-block writes, all data correct after flush) - test/component/large_write_test.go: 7 component tests (single 4MB, sequential mkfs sim, concurrent, mixed sizes, production volume, flusher throughput 30s sustained) - iscsi/large_write_mem_test.go: 2 iSCSI session memory tests (4MB R2T flow, slow device) Testrunner enhancements (same commit — all tested on hardware): - discover_primary action: maps primary IP → topology node name, supports alt_ips for multi-NIC (RoCE + management) - NodeSpec.AltIPs field for multi-NIC node identification - 5 new YAML scenarios: ec3, ec5, degraded sync_all/best_effort, pgbench - All 13 hardware-verified scenarios PASS Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 14:24:10 -07:00
pingqiu	1497204e81	fix: require CatchUp outcome, true simultaneous overlap, observability assertions HIGH: Changed-address now requires OutcomeCatchUp and fails if not. No more conditional execution — must go through full catch-up chain. MED: Overlapping retention is now true simultaneous overlap: - Hold 1 at LSN T+1, Hold 2 at LSN T+2 — both coexist - MinWALRetentionFloor = T+1 (minimum of two) - Release hold 1 → floor moves to T+2 - Release hold 2 → ActiveHoldCount=0, no floor MED: NeedsRebuild now asserts escalated event in logs. PostCheckpoint now asserts handshake + catch-up execution events. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 15:55:37 -07:00
pingqiu	77a6e60fa3	feat: add P3 hardening validation — 4 matrix + 2 extra cases (Phase 08) Compact replay matrix on accepted P1/P2 live path: Matrix 1 (ChangedAddress): address change → cancel old plan → new assignment → new recovery → identity preserved → pins released Matrix 2 (StaleEpoch): epoch bump → invalidate → cancel plan → new epoch assignment → new session → pins released Matrix 3 (NeedsRebuild): unrecoverable gap → rebuild assignment → RebuildExecutor(IO=v2bridge) → InSync → pins released Matrix 4 (PostCheckpointBoundary): at committed=ZeroGap, in window= CatchUp via CatchUpExecutor(IO=v2bridge) → pins released Extra 1 (FailoverCycle): epoch 1 → failover → epoch 2 → recovery resumes → InSync. Logs: invalidation + cancellation + new session. Extra 2 (OverlappingRetention): plan1 acquires pins → cancel → plan2 acquires pins → cancel → ActiveHoldCount==0, MinWALRetentionFloor has no holds. Each test verifies all 5 evidence categories: entry truth, engine result, execution result, cleanup, observability Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 15:46:48 -07:00
pingqiu	08e34e02ae	feat: separate CommittedLSN from CheckpointLSN, close catch-up ONE CHAIN (Phase 08 P2) CommittedLSN separation: - StatusSnapshot().CommittedLSN = nextLSN-1 (WAL head) for sync_all - Was: flusher.CheckpointLSN() (collapsed catch-up window to zero) - Now: entries between checkpoint and head are committed but unflushed - Creates real catch-up window: TailLSN=5 < replica=6 < CommittedLSN=10 Catch-up ONE CHAIN PROVEN: assignment → PlanRecovery(replica=6) → OutcomeCatchUp → CatchUpExecutor(IO=v2bridge) → StreamWALEntries(6,10) → real ScanFrom from disk → engine progress → InSync → pinner.ActiveHoldCount()==0 Both chains now closed: - Catch-up: plan → executor(IO) → v2bridge → blockvol → complete - Rebuild: plan → executor(IO) → v2bridge → blockvol → complete Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 15:22:23 -07:00
pingqiu	1c178c0853	fix: rename rebuild test to match actual path, use t.Skipf for V1 catch-up limitation HIGH: renamed TestP2_RebuildClosure_FullBase_OneChain → TestP2_RebuildClosure_OneChain. Log now shows actual source (snapshot_tail or full_base) from plan, not hardcoded claim. MED: catch-up test uses t.Skipf when V1 interim prevents OutcomeCatchUp. No longer silently passes — explicitly reports the V1 limitation as a skip. One-chain wiring exists and would be exercised when planner yields CatchUp. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 15:17:34 -07:00
pingqiu	8b1b6ec1c0	fix: update executor doc comment to reflect P2 implementation status Executor comment now reflects reality: - StreamWALEntries, TransferFullBase, TransferSnapshot: real - TruncateWAL: stub - Implements engine.CatchUpIO and engine.RebuildIO interfaces Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 15:14:34 -07:00
pingqiu	1578adfba5	fix: wire real v2bridge I/O into engine executors (Phase 08 P2 closure) Engine executors now have IO interfaces for real bridge I/O: - CatchUpExecutor.IO (CatchUpIO): StreamWALEntries - RebuildExecutor.IO (RebuildIO): TransferFullBase, TransferSnapshot, StreamWALEntries (for tail replay) When IO is set, executor calls real bridge I/O during execution. When IO is nil, executor uses caller-supplied progress (test mode). RecoveryPlan.CatchUpStartLSN: bound at plan time for IO bridge. v2bridge.Executor now implements both interfaces: - StreamWALEntries: real ScanFrom - TransferFullBase: validates extent accessible - TransferSnapshot: validates checkpoint accessible Chain tests wire IO: - CatchUpClosure: exec.IO = executor → real WAL scan through engine - RebuildClosure: exec.IO = executor → real transfer through engine This closes the engine → executor → v2bridge → blockvol chain. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 15:10:50 -07:00
pingqiu	ec51cfa474	fix: rewrite P2 as one-chain proofs with pin release assertions Rebuild ONE CHAIN (proven): assignment → PlanRebuild → RebuildExecutor.Execute() → v2bridge TransferFullBase → engine complete → InSync → pinner.ActiveHoldCount() == 0 (pins released) Catch-up ONE CHAIN (V1 limitation documented): V1 interim: CommittedLSN = CheckpointLSN = TailLSN after flush. No gap between tail and committed exists. Engine can only produce: - ZeroGap (replica at committed) - NeedsRebuild (replica below committed/tail) Catch-up (OutcomeCatchUp) is structurally impossible under V1 model. Real WAL scan proven separately (P1). Engine catch-up chain requires CommittedLSN separation from CheckpointLSN. Cleanup: CancelPlan → pins released + session invalidated + logged. Observability: sender_added + session_created + connected + escalated. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 14:58:00 -07:00
pingqiu	c9671c4e47	feat: integrated execution chain — catch-up + rebuild + cleanup (Phase 08 P2) Live catch-up chain: - Assignment → engine plan → v2bridge WAL scan → blockvol ScanFrom - StreamWALEntries transfers real entries (transferred=5) - V1 interim: engine classifies ZeroGap (committed=0), but WAL scan chain proven mechanically (executor→v2bridge→blockvol→progress) Live rebuild chain (full-base): - ForceFlush advances checkpoint → NeedsRebuild detected - TransferFullBase now real: validates extent accessible at committed LSN - Engine rebuild session: connect → handshake → source select → transfer → complete → InSync Execution cleanup: - CancelPlan releases resources + invalidates session - Log shows plan_cancelled with reason Observability: - sender_added + escalated events explain execution causality - Escalation includes proof reason from RetainedHistory 4 new execution chain tests + TransferFullBase implementation. Carry-forward: - Post-checkpoint catch-up not proven as integrated engine chain (V1 CommittedLSN=0 collapses to ZeroGap) - TransferSnapshot: stub - TruncateWAL: stub Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 14:22:27 -07:00
pingqiu	04bc261f9b	fix: deliver assignment intent to real engine orchestrator, not discard Finding 1: ProcessAssignments now calls v2Orchestrator.ProcessAssignment - BlockService.v2Orchestrator field (RecoveryOrchestrator) - ProcessAssignment result logged at glog V(1) - No more `_ = intent` — engine state actually changes Finding 2: localServerID documented as interim - BlockService.localServerID = listenAddr (transport-shaped) - Field doc explicitly states: INTERIM, should be registry-assigned - Used only for replica/rebuild local identity 3 integration tests (qa_block_v2bridge_test.go): - CreatesEngineSender: ProcessAssignment → engine has sender + session - EpochBump: epoch 1 → invalidate → epoch 2 → new session - AddressChange: same ServerID, different IP → sender preserved, endpoint updated Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 13:38:30 -07:00
pingqiu	46ef79ce35	fix: stable ServerID in assignments, fail-closed on missing identity, wire into ProcessAssignments Finding 1: Identity no longer address-derived - ReplicaAddr.ServerID field added (stable server identity from registry) - BlockVolumeAssignment.ReplicaServerID field added (scalar RF=2 path) - ControlBridge uses ServerID, NOT address, for ReplicaID - Missing ServerID → replica skipped (fail closed), logged Finding 2: Wired into real ProcessAssignments - BlockService.v2Bridge field initialized in StartBlockService - ProcessAssignments converts each assignment via v2Bridge.ConvertAssignment BEFORE existing V1 processing (parallel, not replacing yet) - Logged at glog V(1) Finding 3: Fail-closed on missing identity - Empty ServerID in ReplicaAddrs → replica skipped with log - Empty ReplicaServerID in scalar path → no replica created - Test: MissingServerID_FailsClosed verifies both paths 7 tests: StableServerID, AddressChange_IdentityPreserved, MultiReplica_StableServerIDs, MissingServerID_FailsClosed, EpochFencing_IntegratedPath, RebuildAssignment, ReplicaAssignment Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 10:46:17 -07:00
pingqiu	48b3e1b8c8	feat: add real control delivery bridge from BlockVolumeAssignment (Phase 08 P1) ControlBridge converts real BlockVolumeAssignment (from master heartbeat) into V2 engine AssignmentIntent: - Identity: ReplicaID = <volume-path>/<replica-server-id> - Epoch from real assignment - Role → SessionKind mapping (primary/replica/rebuilding) - Multi-replica support (ReplicaAddrs) with scalar RF=2 fallback Known limitation (documented in test): - extractServerID currently uses address as server ID (matches master registry ReplicaInfo.Server format) - IP change = different server ID in current model - Registry-backed stable server ID deferred 6 new tests: - PrimaryAssignment_StableIdentity: real assignment → stable ID - PrimaryAssignment_MultiReplica: RF=3 multi-replica mapping - AddressChange_SameServerID: documents current identity boundary - EpochFencing_IntegratedPath: epoch 1 → bump → epoch 2 through real assignment conversion + engine - RebuildAssignment: rebuilding role → SessionRebuild - ReplicaAssignment: replica role with local server ID Delivery template: Changed contracts: real BlockVolumeAssignment → engine intent Fail-closed: unknown role returns empty intent Carry-forward: address-based server ID, not registry-backed Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 10:35:41 -07:00
pingqiu	cd8bfb21d4	fix: tighten FC1 new-session assertion and FC4 proof-detail check FC1: now asserts HasActiveSession() after address change AND verifies session_created in log (not just plan_cancelled). FC4: escalation event detail must be >15 chars (contains proof reason with LSN values, not just "needs_rebuild"). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 23:43:48 -07:00
pingqiu	cd4b91033f	fix: force failure conditions in P2 tests, add BlockVol.ForceFlush P2 tests now force conditions instead of observing them: FC3: Real WAL scan verified directly — StreamWALEntries transfers real entries from disk (head=5, transferred=5). Engine planning also verified (ZeroGap in V1 interim documented). FC4: ForceFlush advances checkpoint/tail to 20. Replica at 0 is below tail → NeedsRebuild with proof: "gap_beyond_retention: need LSN 1 but tail=20". No early return. FC5: ForceFlush advances checkpoint to 10. Assertive: - replica at checkpoint=10 → ZeroGap (V1 interim) - replica at 0 → NeedsRebuild (below tail, not CatchUp) FC1/FC2: Labeled as integrated engine/storage (control simulated). New: BlockVol.ForceFlush() — triggers synchronous flusher cycle for test use. Advances checkpoint + WAL tail deterministically. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 23:07:55 -07:00
pingqiu	26bf7bc582	feat: add integrated failure replay tests through real bridge path (Phase 07 P2) 5 failure-class replay tests against real file-backed BlockVol, exercising the full integrated path: bridge adapter → v2bridge reader/pinner → engine planner/executor FC1: Changed-address restart — identity preserved, old plan cancelled, new session created. Log shows plan_cancelled + session_created. FC2: Stale epoch after failover — sessions invalidated at old epoch, new assignment at epoch 2 creates fresh session. Log shows per-replica invalidation. FC3: Real catch-up (pre-checkpoint) — engine classifies from real RetainedHistory, zero-gap in V1 interim (committed=0 before flush). Documents the V1 limitation explicitly. FC4: Unrecoverable gap — after flush, if checkpoint advances, replica behind tail gets NeedsRebuild. Documents that V1 unit test may not advance checkpoint (flusher timing). FC5: Post-checkpoint boundary — replica at checkpoint = zero-gap in V1 interim. Explicitly documents the catch-up collapse boundary. go.mod: added replace directives for sw-block engine + bridge modules. Carry-forward (explicit): - CommittedLSN = CheckpointLSN (V1 interim) - FC3/FC4/FC5 limited by flusher not advancing checkpoint in unit tests - Executor snapshot/full-base/truncate still stubs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 22:54:44 -07:00
pingqiu	4aab00b149	feat: add real v2bridge integration tests against file-backed BlockVol 7 tests in weed/storage/blockvol/v2bridge/bridge_test.go: Reader (2 tests): - StatusSnapshot reads real nextLSN, WALCheckpointLSN, flusher state - HeadLSN advances with real writes Pinner (2 tests): - HoldWALRetention: hold tracked, MinWALRetentionFloor reports position, release clears hold - HoldRejectsRecycled: validates against real WAL tail Executor (2 tests): - StreamWALEntries: real ScanFrom reads WAL entries from disk - StreamPartialRange: partial range scan works Stubs (1 test): - TransferSnapshot/TransferFullBase/TruncateWAL return not-implemented All tests use createTestVol (1MB file-backed BlockVol with 256KB WAL). No mock/push adapters — direct real blockvol instances. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 22:22:28 -07:00
pingqiu	d5b2a3a345	fix: WALTailLSN is now an LSN boundary, ScanWALEntries uses durable checkpoint Finding 1: WALTailLSN semantic fix - StatusSnapshot().WALTailLSN now reads super.WALCheckpointLSN (an LSN) - Was: wal.Tail() which returns a physical byte offset - Entries with LSN > WALTailLSN are guaranteed in the WAL Finding 2: ScanWALEntries replay-source fix - ScanWALEntries passes super.WALCheckpointLSN as the recycled boundary - Was: flusher.CheckpointLSN() which in V1 equals CommittedLSN - The flusher's live checkpoint may advance in memory, but entries above the durable superblock checkpoint are still physically in the WAL - Normal catch-up (replica at 70, committed at 100) now works because fromLSN=71 > super.WALCheckpointLSN (which is the last persisted checkpoint, not the live flusher state) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 20:26:27 -07:00
pingqiu	785a7d7efd	feat: wire real pinner into flusher retention + real WAL scan executor (Phase 07 P1) Pinner wired to real retention: - NewPinner calls vol.SetV2RetentionFloor(p.MinWALRetentionFloor) - Flusher.RetentionFloorFn() / SetRetentionFloorFn() exposed - SetV2RetentionFloor chains with existing shipper retention floor - Holds actually prevent WAL reclaim (not just tracked state) Executor uses real WAL scan: - BlockVol.ScanWALEntries(fromLSN, callback) wraps wal.ScanFrom with real fd, walOffset, checkpointLSN - Executor.StreamWALEntries uses ScanWALEntries (not stub) - Reads real WAL entries, tracks highest LSN scanned CommittedLSN mapping: - Explicitly documented as interim V1 model (committed = checkpointed) - Will diverge when V2 distributed commit separates from local flush Carry-forward: - TransferSnapshot/TransferFullBase/TruncateWAL: stubs (need extent I/O) - Control intent from confirmed failover: deferred Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 20:01:46 -07:00
pingqiu	c00c9e3e3d	feat: add real BlockVolPinner + BlockVolExecutor in v2bridge (Phase 07 P1) Pinner (pinner.go): - HoldWALRetention: validates startLSN >= current tail, tracks hold - HoldSnapshot: validates checkpoint exists + trusted - HoldFullBase: tracks hold by ID - MinWALRetentionFloor: returns minimum held position across all WAL/snapshot holds — designed for flusher RetentionFloorFn hookup - Release functions remove holds from tracking map Executor (executor.go): - StreamWALEntries: validates range against real WAL tail/head (actual ScanFrom integration deferred to network-layer wiring) - TransferSnapshot/TransferFullBase/TruncateWAL: stubs for P1 Key integration points: - Pinner reads real StatusSnapshot for validation - Pinner.MinWALRetentionFloor can wire into flusher.RetentionFloorFn - Executor validates WAL range availability from real state Carry-forward: - Real ScanFrom wiring needs WAL fd + offset (network layer) - TransferSnapshot/TransferFullBase need extent I/O - Control intent from confirmed failover (master-side) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 19:54:24 -07:00
pingqiu	d5ecf471fe	feat: real blockvol integration — StatusSnapshot + v2bridge reader + contract interfaces (Phase 07 P1) Real blockvol integration: - BlockVol.StatusSnapshot() reads actual fields: WALHeadLSN ← nextLSN-1, WALTailLSN ← wal.Tail(), CommittedLSN ← flusher.CheckpointLSN(), CheckpointLSN ← super.WALCheckpointLSN, CheckpointTrusted ← super.Validate()==nil weed/storage/blockvol/v2bridge/: - Reader wraps real BlockVol, implements ReadState() → BlockVolState - Lives in weed/ module (can import blockvol directly) sw-block/bridge/blockvol/ contract interfaces: - BlockVolReader: ReadState() (weed-side implements) - BlockVolPinner: HoldWALRetention/HoldSnapshot/HoldFullBase → release func - BlockVolExecutor: StreamWALEntries/TransferSnapshot/TransferFullBase/TruncateWAL - StorageAdapter refactored to consume interfaces (not push-based) - PushStorageAdapter for tests Handoff boundary (E5): - sw-block/ defines contracts, weed/ implements them - sw-block/ does NOT import weed/ - No cross-module circular dependency Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 18:17:59 -07:00
pingqiu	abbc8bff2b	fix: canonicalize host in AllocateBlockVolumeResponse (CP13-2 follow-up) AllocateBlockVolumeResponse used bs.ListenAddr() to derive replica addresses. When the VS binds to ":port" (no explicit IP), host resolved to empty string, producing ":dataPort" as the replica address. This ":port" propagated through master assignments to both primary and replica sides. Now canonicalizes empty/wildcard host using PreferredOutboundIP() before constructing replication addresses. Also exported PreferredOutboundIP for use by the server package. This is the source fix — all downstream paths (heartbeat, API response, assignment) inherit the canonical address. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 19:16:45 -07:00
pingqiu	ae87a31d22	fix: store canonical replica addresses in heartbeat state setupReplicaReceiver now reads back canonical addresses from the ReplicaReceiver (which applies CP13-2 canonicalization) instead of storing raw assignment addresses in replStates. This fixes the API-level leak where replica_data_addr showed ":port" instead of "ip:port" in /block/volumes responses, even though the engine-level CP13-2 fix was working. New BlockVol.ReplicaReceiverAddr() returns canonical addresses from the running receiver. Falls back to assignment addresses if receiver didn't report. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 19:08:48 -07:00
pingqiu	aa4688d5d5	fix: sync flusher checkpointLSN after rebuild (CP13-7) rebuildFullExtent updated superblock.WALCheckpointLSN but not the flusher's internal checkpointLSN. NewReplicaReceiver then read stale 0 from flusher.CheckpointLSN(), causing post-rebuild flushedLSN to be wrong. Added Flusher.SetCheckpointLSN() and call it after rebuild superblock persist. TestRebuild_PostRebuild_FlushedLSN_IsCheckpoint flips FAIL→PASS. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 17:22:55 -07:00
pingqiu	4ed54d04ba	fix: close leaked replica in TestShip_DegradedDoesNotSilently The test used createSyncAllPair(t) but discarded the replica return value, leaving the volume file open. On Windows this caused TempDir cleanup failure. All 7 CP13-1 baseline FAILs now PASS. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 16:54:05 -07:00
pingqiu	3e9358f2be	feat: rebuild fallback with per-replica heartbeat state (CP13-7) Adds per-replica state reporting in heartbeat so master can identify which specific replica needs rebuild, not just a volume-level boolean. New ReplicaShipperStatus{DataAddr, State, FlushedLSN} type reported via ReplicaShipperStates field on BlockVolumeInfoMessage. Populated from ShipperGroup.ShipperStates() on each heartbeat. Scales to RF=3+. V1 constraints (explicit): - NeedsRebuild cleared only by control-plane reassignment (no local exit) - Post-rebuild replica re-enters as Disconnected/bootstrap, not InSync - flushedLSN = checkpointLSN after rebuild (durable baseline only) 4 new tests: heartbeat per-replica state, NeedsRebuild reporting, rebuild-complete-reenters-InSync (full cycle), epoch mismatch abort. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 16:46:31 -07:00
Ping Qiu	47f0111cae	feat: replica-aware WAL retention (CP13-6) Flusher now holds WAL entries needed by recoverable replicas. Both AdvanceTail (physical space) and checkpointLSN (scan gate) are gated by the minimum flushed LSN across catch-up-eligible replicas. New methods on ShipperGroup: - MinRecoverableFlushedLSN() (uint64, bool): pure read, returns min flushed LSN across InSync/Degraded/Disconnected/CatchingUp replicas with known progress. Excludes NeedsRebuild. - EvaluateRetentionBudgets(timeout): separate mutation step, escalates replicas that exceed walRetentionTimeout (5m default) to NeedsRebuild, releasing their WAL hold. Flusher integration: evaluates budgets then queries floor on each flush cycle. If floor < maxLSN, holds both checkpoint and tail. Extent writes proceed normally (reads work), only WAL reclaim is deferred. LastContactTime on WALShipper: updated on barrier success, handshake success, and catch-up completion. Not on Ship (TCP write only). Avoids misclassifying idle-but-healthy replicas. CP13-6 ships with timeout budget only. walRetentionMaxBytes is deferred (documented as partial slice). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-25 22:04:23 -07:00
Ping Qiu	9e481a83e9	fix: serialize LSN allocation + shipping with shipMu Concurrent WriteLBA/Trim calls could deliver WAL entries to replicas out of LSN order: two goroutines allocate LSN 4 and 5 concurrently, but LSN 5 could reach the replica first via ShipAll, causing the replica to reject it as an LSN gap. shipMu now wraps nextLSN.Add + wal.Append + ShipAll in both WriteLBA and Trim, guaranteeing LSN-ordered delivery to replicas under concurrent writers. The dirty map update and WAL pressure check happen after shipMu is released — they don't need ordering guarantees. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-25 16:33:42 -07:00
Ping Qiu	4429f2b8d2	fix: use handshake-reported flushedLSN for catch-up, fix receiver init doReconnectAndCatchUp() now uses the replicaFlushedLSN returned by the reconnect handshake as the catch-up start point, not the shipper's stale cached value. The replica may have less durable progress than the shipper last knew. ReplicaReceiver initialization: flushedLSN now set from the volume's checkpoint LSN (durable by definition), not nextLSN (which includes unflushed entries). receivedLSN still uses nextLSN-1 since those entries are in the WAL buffer even if not yet synced. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-25 15:54:23 -07:00
Ping Qiu	24de2cea2a	fix: refactor reconnect tests to preserve shipper identity (CP13-5) Updated 3 reconnect tests to stop/restart the ReplicaReceiver on the same addresses WITHOUT calling SetReplicaAddr. This preserves the shipper object, its ReplicaFlushedLSN, HasFlushedProgress flag, and catch-up state across the disconnect/reconnect cycle. All 3 tests now PASS: - TestReconnect_CatchupFromRetainedWal - CatchupReplay_DataIntegrity_AllBlocksMatch - CatchupReplay_DuplicateEntry_Idempotent Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-25 15:46:02 -07:00
Ping Qiu	548e47e482	feat: reconnect handshake + WAL catch-up protocol (CP13-5) Adds the sync_all reconnect protocol: when a degraded shipper reconnects, it performs a handshake (ResumeShipReq/Resp) to determine the replica's durable progress, then streams missed WAL entries to close the gap before resuming live shipping. New wire messages: - MsgResumeShipReq (0x03): primary sends epoch, headLSN, retainStart - MsgResumeShipResp (0x04): replica returns status + flushedLSN - MsgCatchupDone (0x05): marks end of catch-up stream Decision matrix after handshake: - R == H: already caught up → InSync - S <= R+1 <= H: recoverable gap → CatchingUp → stream → InSync - R+1 < S: gap exceeds retained WAL → NeedsRebuild - R > H: impossible progress → NeedsRebuild WALAccess interface: narrow abstraction (RetainedRange + StreamEntries) avoids coupling shipper to raw WAL internals. Bootstrap vs reconnect split: fresh shippers (HasFlushedProgress=false) use CP13-4 bootstrap path. Previously-synced shippers use handshake. Catch-up retry budget: maxCatchupRetries=3 before NeedsRebuild. ReplicaReceiver now initializes receivedLSN/flushedLSN from volume's nextLSN on construction (handles receiver restart on existing volume). TestBug2_SyncAll_SyncCache_AfterDegradedShipperRecovers flips FAIL→PASS. All previously-passing baseline tests remain green. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-25 15:38:06 -07:00
Ping Qiu	8d6379f841	feat: replica state machine + barrier eligibility gating (CP13-4) Replaces binary degraded flag with ReplicaState type: Disconnected, Connecting, CatchingUp, InSync, Degraded, NeedsRebuild. Ship() allowed from Disconnected (bootstrap: data must flow before first barrier) and InSync (steady state). Ship does NOT change state. Barrier() gating: - InSync: proceed normally - Disconnected: bootstrap path (connect + barrier) - Degraded: reconnect both data+ctrl connections, then barrier - Connecting/CatchingUp/NeedsRebuild: rejected immediately Only barrier success grants InSync. Reconnect alone does not. IsDegraded() now means "not sync-eligible" (any non-InSync state). InSyncCount() added to ShipperGroup. dist_group_commit.go: removed AllDegraded short-circuit that prevented bootstrap. Barrier attempts always run — individual shippers handle their own state-based gating. 8 CP13-4 tests + TestBarrier_RejectsReplicaNotInSync flips FAIL→PASS. All previously-passing baseline tests remain green. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-25 02:39:32 -07:00
Ping Qiu	499e244b8e	feat: durable progress truth — replicaFlushedLSN in barrier (CP13-3) Barrier response extended from 1-byte status to 9-byte payload carrying the replica's durable WAL progress (FlushedLSN). Updated only after successful fd.Sync(), never on receive/append/send. Replica side: new flushedLSN field on ReplicaReceiver, advanced only in handleBarrier after proven contiguous receipt + sync. max() guard prevents regression. Shipper side: new replicaFlushedLSN (authoritative) replacing ShippedLSN (diagnostic only). Monotonic CAS update from barrier response. hasFlushedProgress flag tracks whether replica supports the extended protocol. ShipperGroup: MinReplicaFlushedLSN() returns (uint64, bool) — minimum across shippers with known progress. (0, false) for empty groups or legacy replicas. Backward compat: 1-byte legacy responses decoded as FlushedLSN=0. Legacy replicas explicitly excluded from sync_all correctness. 7 new tests: roundtrip, backward compat, flush-only-after-sync, not-on-receive, shipper update, monotonicity, group minimum. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-25 01:52:35 -07:00
Ping Qiu	4f3edffb0a	fix: canonical replica address resolution (CP13-2) ReplicaReceiver.DataAddr()/CtrlAddr() now return canonical ip:port instead of raw listener addresses that may be wildcard (:port, 0.0.0.0:port, [::]:port). New canonicalizeListenerAddr() resolves wildcard IPs using the provided advertised host (from VS listen address). Falls back to outbound-IP detection when no advertised host is available. NewReplicaReceiver accepts optional advertisedHost parameter for multi-NIC correctness. In production, the assignment path already provides canonical addresses; this fix ensures test patterns with :0 bind also produce routable addresses. 7 new tests. TestBug3_ReplicaAddr_MustBeIPPort_WildcardBind flips from FAIL to PASS. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-25 01:38:55 -07:00
Ping Qiu	c263d082b5	fix: restart reconciliation — trust roles, upsert replicas Same-epoch reconciliation now trusts reported roles first: - one claims primary, other replica → trust roles - both claim primary → WALHeadLSN heuristic tiebreak - both claim replica → keep existing, log ambiguity Replaced addServerAsReplica with upsertServerAsReplica: checks for existing replica entry by server name before appending. Prevents duplicate ReplicaInfo rows during restart/replay windows. 2 new tests: role-trusted same-epoch, duplicate replica prevention. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-24 01:24:53 -07:00
Ping Qiu	9137fa6486	fix: epoch-based reconciliation on master restart reconstruction When a second server reports the same volume during master restart, UpdateFullHeartbeat now uses epoch-based tie-breaking instead of first-heartbeat-wins: 1. Higher epoch wins as primary — old entry demoted to replica 2. Same epoch — higher WALHeadLSN wins (heuristic, warning logged) 3. Lower epoch — added as replica Applied in both code paths: the auto-register branch (no entry exists yet for this name) and the unlinked-server branch (entry exists but this server is not in it). This is a deterministic reconstruction improvement, not ground truth. The long-term fix is persisting authoritative volume state. 5 new tests covering all reconciliation scenarios. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-24 01:17:51 -07:00
Ping Qiu	a9a5e455c6	fix: Lookup/ListAll return copies, add UpdateEntry for safe mutation Lookup() and ListAll() now return value copies (not pointers to internal registry state). Callers can no longer mutate registry entries without holding a lock. Added clone() on BlockVolumeEntry with deep-copied Replicas slice. Added UpdateEntry(name, func(*BlockVolumeEntry)) for locked mutation. ListByServer() also returns copies. Migrated 1 production mutation (ReplicaPlacement + Preset in create handler) and ~20 test mutations to use UpdateEntry. 5 new copy-correctness tests: Lookup returns copy, Replicas slice isolated, ListAll returns copies, UpdateEntry mutates, UpdateEntry not-found error. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-24 01:00:27 -07:00
Ping Qiu	e8c921d9e8	fix: remove nil-optional superMu pattern, require in all FlusherConfigs superMu is mandatory for correctness — all superblock mutation+persist must be serialized. Remove the nil guard in updateSuperblockCheckpoint and add SuperMu to all 7 test FlusherConfig sites. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-24 00:19:25 -07:00
Ping Qiu	3ddb87adc9	fix: superblock write coordination (superMu) + remove debug logs Adds sync.Mutex (superMu) to BlockVol, shared between group commit's syncWithWALProgress() and flusher's updateSuperblockCheckpoint(). Both paths now serialize superblock mutation + persist, preventing WALTail/WALCheckpointLSN regression when flusher and group commit write the full superblock concurrently. persistSuperblock() also guarded for consistency. Removes temporary log.Printf lines in the open/recovery path that were added during BUG-RESTART-ZEROS investigation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-24 00:09:14 -07:00
Ping Qiu	e92263b4f4	fix: ioMu data-plane exclusion for restore/import/expand Adds sync.RWMutex (ioMu) to BlockVol enforcing mutual exclusion between normal I/O and destructive state operations. Shared (RLock): WriteLBA, ReadLBA, Trim, SyncCache, replica applyEntry, rebuild applyRebuildEntry — concurrent I/O safe. Exclusive (Lock): RestoreSnapshot, ImportSnapshot, Expand, PrepareExpand, CommitExpand, CancelExpand — drains all in-flight I/O before modifying extent/WAL/dirtyMap. Scope rule: RLock covers local data-structure mutation only. Replication shipping is asynchronous and outside the lock, so exclusive holders block only behind local I/O, not network stalls. Lock ordering: ioMu > snapMu > assignMu > mu. Closes the critical ER item: restore/import vs concurrent WriteLBA silent data corruption gap. 3 new tests: concurrent writes allowed, real restore-vs-write contention with data integrity check, close coordination. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-23 20:40:41 -07:00
Ping Qiu	bb691a5458	feat: CP11B-4 observability pack — health state, alerts, dashboard Health-state derivation: deriveHealthStateWithLiveness() computes per-volume state (unsafe > rebuilding > degraded > healthy) using role, replica count, durability mode, degraded flag, and primary server liveness. Used consistently in both volume responses and cluster summary. Extended GET /block/status with health counts (healthy, degraded, rebuilding, unsafe) and NVMe-capable server count. Response is now typed BlockStatusResponse instead of untyped map. Default alert pack: 7 Prometheus rules covering WAL pressure, flusher errors, replica degradation, rebuilding, scrub errors. Alert rules reference real seaweedfs_blockvol_* metric names. Default dashboard: Grafana JSON with 17 panels — cluster health, IOPS, latency P99, WAL pressure, flusher throughput, replication, scrub, dirty map, epoch. 17 tests: 9 health derivation, 1 cluster summary, 2 handler/API, 2 alert validation, 2 dashboard validation, 1 liveness parity. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-23 02:12:42 -07:00
Ping Qiu	f501c63009	feat: CP11B-2 explainable placement / plan API New POST /block/volume/plan endpoint returns full placement preview: resolved policy, ordered candidate list, selected primary/replicas, and per-server rejection reasons with stable string constants. Core design: evaluateBlockPlacement() is a pure function with no registry/topology dependency. gatherPlacementCandidates() is the single topology bridge point. Plan and create share the same planner — parity contract is same ordered candidate list for same cluster state. Create path refactored: uses evaluateBlockPlacement() instead of PickServer(), iterates all candidates (no 3-retry cap), recomputes replica order after primary fallback. rf_not_satisfiable severity is durability-mode-aware (warning for best_effort, error for strict). 15 unit tests + 20 QA adversarial tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-23 02:12:25 -07:00
Ping Qiu	683969086c	feat: CP11B-1 provisioning presets + review fixes Preset system: ResolvePolicy resolves named presets (database, general, throughput) with per-field overrides into concrete volume parameters. Create path now uses resolved policy instead of ad-hoc validation. New /block/volume/resolve diagnostic endpoint for dry-run resolution. Review fix 1 (MED): HasNVMeCapableServer now derives NVMe capability from server-level heartbeat attribute (block_nvme_addr proto field) instead of scanning volume entries. Fixes false "no NVMe" warning on fresh clusters with NVMe-capable servers but no volumes yet. Review fix 2 (LOW): /block/volume/resolve no longer proxied to leader — read-only diagnostic endpoint can be served by any master. Engine fix: ReadLBA retry loop closes stale dirty-map race when WAL entry is recycled between lookup and read. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-22 14:44:24 -07:00
Ping Qiu	075ff52219	feat: CP11B-3 safe ops — promotion hardening, preflight, manual promote Six-task checkpoint hardening the promotion and failover paths: T1: 4-gate candidate evaluation (heartbeat freshness, WAL lag, role, server liveness) with structured rejection reasons. T2: Orphaned-primary re-evaluation on replica reconnect (B-06/B-08). T3: Deferred timer safety — epoch validation prevents stale timers from firing on recreated/changed volumes (B-07). T4: Rebuild addr cleanup on promotion (B-11), NVMe publication refresh on heartbeat, and preflight endpoint wiring. T5: Manual promote API — POST /block/volume/{name}/promote with force flag, target server selection, and structured rejection response. Shared applyPromotionLocked/finalizePromotion helpers eliminate duplication between auto and manual paths. T6: Read-only preflight endpoint (GET /block/volume/{name}/preflight) and blockapi client wrappers (Preflight, Promote). BUG-T5-1: PromotionsTotal counter moved to finalizePromotion (shared by both auto and manual paths) to prevent metrics divergence. 24 files changed, ~6500 lines added. 42 new QA adversarial tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-13 17:21:17 -07:00
Ping Qiu	ed11a09a61	fix: CP11A-4 snapshot export/import safety — 3 bugs from review BUG-CP11A4-1 (HIGH): ImportSnapshot now rejects when active snapshots exist. Import overwrites the extent region that non-CoW'd snapshot blocks read from, which would silently return import data instead of snapshot-time data. New ErrImportActiveSnapshots error and snapMu-guarded check. BUG-CP11A4-2 (HIGH): Double import without AllowOverwrite now correctly rejected. Import bypasses WAL so nextLSN stays at 1; added FlagImported (Superblock.Flags bit 0) set after successful import and checked alongside nextLSN in the non-empty gate. BUG-CP11A4-3 (MED): Replaced fixed exportTempSnapID (0xFFFFFFFE) with atomic sequence counter (exportTempSnapBase + exportTempSnapSeq). Each auto-export gets a unique temp snapshot ID, preventing concurrent export races and user snapshot ID collisions. Also added beginOp()/endOp() lifecycle guards to both ExportSnapshot and ImportSnapshot, and documented the non-atomic import failure semantics. 5 new regression tests + QA-EX-3 rewritten for rejection behavior. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-13 10:56:18 -07:00
Ping Qiu	7cc6467d09	feat: CP11A-4 snapshot export/import to S3 — artifact format, engine, and transport Add crash-consistent snapshot export/import for single-profile block volumes. Export creates a temp snapshot, streams the full volume image with inline SHA-256, and uploads to S3. Import validates manifest + checksum and writes directly to extent region. Admin HTTP endpoints /export and /import added to the standalone iscsi-target binary. Engine: snapshot_export.go (manifest types, ExportSnapshot, ImportSnapshot) S3: snapshot_s3.go (AWS SDK v1 transport, pipe-based streaming upload) Tests: 14 engine + 9 QA adversarial = 23 new tests, all passing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-13 00:15:27 -07:00

1 2 3 4 5 ...

8455 Commits