seaweedfs

mirror of https://github.com/seaweedfs/seaweedfs.git synced 2026-05-22 09:41:28 +00:00

Author	SHA1	Message	Date
pingqiu	abbc8bff2b	fix: canonicalize host in AllocateBlockVolumeResponse (CP13-2 follow-up) AllocateBlockVolumeResponse used bs.ListenAddr() to derive replica addresses. When the VS binds to ":port" (no explicit IP), host resolved to empty string, producing ":dataPort" as the replica address. This ":port" propagated through master assignments to both primary and replica sides. Now canonicalizes empty/wildcard host using PreferredOutboundIP() before constructing replication addresses. Also exported PreferredOutboundIP for use by the server package. This is the source fix — all downstream paths (heartbeat, API response, assignment) inherit the canonical address. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 19:16:45 -07:00
pingqiu	ae87a31d22	fix: store canonical replica addresses in heartbeat state setupReplicaReceiver now reads back canonical addresses from the ReplicaReceiver (which applies CP13-2 canonicalization) instead of storing raw assignment addresses in replStates. This fixes the API-level leak where replica_data_addr showed ":port" instead of "ip:port" in /block/volumes responses, even though the engine-level CP13-2 fix was working. New BlockVol.ReplicaReceiverAddr() returns canonical addresses from the running receiver. Falls back to assignment addresses if receiver didn't report. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 19:08:48 -07:00
pingqiu	aa4688d5d5	fix: sync flusher checkpointLSN after rebuild (CP13-7) rebuildFullExtent updated superblock.WALCheckpointLSN but not the flusher's internal checkpointLSN. NewReplicaReceiver then read stale 0 from flusher.CheckpointLSN(), causing post-rebuild flushedLSN to be wrong. Added Flusher.SetCheckpointLSN() and call it after rebuild superblock persist. TestRebuild_PostRebuild_FlushedLSN_IsCheckpoint flips FAIL→PASS. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 17:22:55 -07:00
pingqiu	4ed54d04ba	fix: close leaked replica in TestShip_DegradedDoesNotSilently The test used createSyncAllPair(t) but discarded the replica return value, leaving the volume file open. On Windows this caused TempDir cleanup failure. All 7 CP13-1 baseline FAILs now PASS. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 16:54:05 -07:00
pingqiu	3e9358f2be	feat: rebuild fallback with per-replica heartbeat state (CP13-7) Adds per-replica state reporting in heartbeat so master can identify which specific replica needs rebuild, not just a volume-level boolean. New ReplicaShipperStatus{DataAddr, State, FlushedLSN} type reported via ReplicaShipperStates field on BlockVolumeInfoMessage. Populated from ShipperGroup.ShipperStates() on each heartbeat. Scales to RF=3+. V1 constraints (explicit): - NeedsRebuild cleared only by control-plane reassignment (no local exit) - Post-rebuild replica re-enters as Disconnected/bootstrap, not InSync - flushedLSN = checkpointLSN after rebuild (durable baseline only) 4 new tests: heartbeat per-replica state, NeedsRebuild reporting, rebuild-complete-reenters-InSync (full cycle), epoch mismatch abort. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 16:46:31 -07:00
Ping Qiu	47f0111cae	feat: replica-aware WAL retention (CP13-6) Flusher now holds WAL entries needed by recoverable replicas. Both AdvanceTail (physical space) and checkpointLSN (scan gate) are gated by the minimum flushed LSN across catch-up-eligible replicas. New methods on ShipperGroup: - MinRecoverableFlushedLSN() (uint64, bool): pure read, returns min flushed LSN across InSync/Degraded/Disconnected/CatchingUp replicas with known progress. Excludes NeedsRebuild. - EvaluateRetentionBudgets(timeout): separate mutation step, escalates replicas that exceed walRetentionTimeout (5m default) to NeedsRebuild, releasing their WAL hold. Flusher integration: evaluates budgets then queries floor on each flush cycle. If floor < maxLSN, holds both checkpoint and tail. Extent writes proceed normally (reads work), only WAL reclaim is deferred. LastContactTime on WALShipper: updated on barrier success, handshake success, and catch-up completion. Not on Ship (TCP write only). Avoids misclassifying idle-but-healthy replicas. CP13-6 ships with timeout budget only. walRetentionMaxBytes is deferred (documented as partial slice). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-25 22:04:23 -07:00
Ping Qiu	9e481a83e9	fix: serialize LSN allocation + shipping with shipMu Concurrent WriteLBA/Trim calls could deliver WAL entries to replicas out of LSN order: two goroutines allocate LSN 4 and 5 concurrently, but LSN 5 could reach the replica first via ShipAll, causing the replica to reject it as an LSN gap. shipMu now wraps nextLSN.Add + wal.Append + ShipAll in both WriteLBA and Trim, guaranteeing LSN-ordered delivery to replicas under concurrent writers. The dirty map update and WAL pressure check happen after shipMu is released — they don't need ordering guarantees. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-25 16:33:42 -07:00
Ping Qiu	4429f2b8d2	fix: use handshake-reported flushedLSN for catch-up, fix receiver init doReconnectAndCatchUp() now uses the replicaFlushedLSN returned by the reconnect handshake as the catch-up start point, not the shipper's stale cached value. The replica may have less durable progress than the shipper last knew. ReplicaReceiver initialization: flushedLSN now set from the volume's checkpoint LSN (durable by definition), not nextLSN (which includes unflushed entries). receivedLSN still uses nextLSN-1 since those entries are in the WAL buffer even if not yet synced. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-25 15:54:23 -07:00
Ping Qiu	24de2cea2a	fix: refactor reconnect tests to preserve shipper identity (CP13-5) Updated 3 reconnect tests to stop/restart the ReplicaReceiver on the same addresses WITHOUT calling SetReplicaAddr. This preserves the shipper object, its ReplicaFlushedLSN, HasFlushedProgress flag, and catch-up state across the disconnect/reconnect cycle. All 3 tests now PASS: - TestReconnect_CatchupFromRetainedWal - CatchupReplay_DataIntegrity_AllBlocksMatch - CatchupReplay_DuplicateEntry_Idempotent Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-25 15:46:02 -07:00
Ping Qiu	548e47e482	feat: reconnect handshake + WAL catch-up protocol (CP13-5) Adds the sync_all reconnect protocol: when a degraded shipper reconnects, it performs a handshake (ResumeShipReq/Resp) to determine the replica's durable progress, then streams missed WAL entries to close the gap before resuming live shipping. New wire messages: - MsgResumeShipReq (0x03): primary sends epoch, headLSN, retainStart - MsgResumeShipResp (0x04): replica returns status + flushedLSN - MsgCatchupDone (0x05): marks end of catch-up stream Decision matrix after handshake: - R == H: already caught up → InSync - S <= R+1 <= H: recoverable gap → CatchingUp → stream → InSync - R+1 < S: gap exceeds retained WAL → NeedsRebuild - R > H: impossible progress → NeedsRebuild WALAccess interface: narrow abstraction (RetainedRange + StreamEntries) avoids coupling shipper to raw WAL internals. Bootstrap vs reconnect split: fresh shippers (HasFlushedProgress=false) use CP13-4 bootstrap path. Previously-synced shippers use handshake. Catch-up retry budget: maxCatchupRetries=3 before NeedsRebuild. ReplicaReceiver now initializes receivedLSN/flushedLSN from volume's nextLSN on construction (handles receiver restart on existing volume). TestBug2_SyncAll_SyncCache_AfterDegradedShipperRecovers flips FAIL→PASS. All previously-passing baseline tests remain green. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-25 15:38:06 -07:00
Ping Qiu	8d6379f841	feat: replica state machine + barrier eligibility gating (CP13-4) Replaces binary degraded flag with ReplicaState type: Disconnected, Connecting, CatchingUp, InSync, Degraded, NeedsRebuild. Ship() allowed from Disconnected (bootstrap: data must flow before first barrier) and InSync (steady state). Ship does NOT change state. Barrier() gating: - InSync: proceed normally - Disconnected: bootstrap path (connect + barrier) - Degraded: reconnect both data+ctrl connections, then barrier - Connecting/CatchingUp/NeedsRebuild: rejected immediately Only barrier success grants InSync. Reconnect alone does not. IsDegraded() now means "not sync-eligible" (any non-InSync state). InSyncCount() added to ShipperGroup. dist_group_commit.go: removed AllDegraded short-circuit that prevented bootstrap. Barrier attempts always run — individual shippers handle their own state-based gating. 8 CP13-4 tests + TestBarrier_RejectsReplicaNotInSync flips FAIL→PASS. All previously-passing baseline tests remain green. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-25 02:39:32 -07:00
Ping Qiu	499e244b8e	feat: durable progress truth — replicaFlushedLSN in barrier (CP13-3) Barrier response extended from 1-byte status to 9-byte payload carrying the replica's durable WAL progress (FlushedLSN). Updated only after successful fd.Sync(), never on receive/append/send. Replica side: new flushedLSN field on ReplicaReceiver, advanced only in handleBarrier after proven contiguous receipt + sync. max() guard prevents regression. Shipper side: new replicaFlushedLSN (authoritative) replacing ShippedLSN (diagnostic only). Monotonic CAS update from barrier response. hasFlushedProgress flag tracks whether replica supports the extended protocol. ShipperGroup: MinReplicaFlushedLSN() returns (uint64, bool) — minimum across shippers with known progress. (0, false) for empty groups or legacy replicas. Backward compat: 1-byte legacy responses decoded as FlushedLSN=0. Legacy replicas explicitly excluded from sync_all correctness. 7 new tests: roundtrip, backward compat, flush-only-after-sync, not-on-receive, shipper update, monotonicity, group minimum. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-25 01:52:35 -07:00
Ping Qiu	4f3edffb0a	fix: canonical replica address resolution (CP13-2) ReplicaReceiver.DataAddr()/CtrlAddr() now return canonical ip:port instead of raw listener addresses that may be wildcard (:port, 0.0.0.0:port, [::]:port). New canonicalizeListenerAddr() resolves wildcard IPs using the provided advertised host (from VS listen address). Falls back to outbound-IP detection when no advertised host is available. NewReplicaReceiver accepts optional advertisedHost parameter for multi-NIC correctness. In production, the assignment path already provides canonical addresses; this fix ensures test patterns with :0 bind also produce routable addresses. 7 new tests. TestBug3_ReplicaAddr_MustBeIPPort_WildcardBind flips from FAIL to PASS. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-25 01:38:55 -07:00
Ping Qiu	c263d082b5	fix: restart reconciliation — trust roles, upsert replicas Same-epoch reconciliation now trusts reported roles first: - one claims primary, other replica → trust roles - both claim primary → WALHeadLSN heuristic tiebreak - both claim replica → keep existing, log ambiguity Replaced addServerAsReplica with upsertServerAsReplica: checks for existing replica entry by server name before appending. Prevents duplicate ReplicaInfo rows during restart/replay windows. 2 new tests: role-trusted same-epoch, duplicate replica prevention. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-24 01:24:53 -07:00
Ping Qiu	9137fa6486	fix: epoch-based reconciliation on master restart reconstruction When a second server reports the same volume during master restart, UpdateFullHeartbeat now uses epoch-based tie-breaking instead of first-heartbeat-wins: 1. Higher epoch wins as primary — old entry demoted to replica 2. Same epoch — higher WALHeadLSN wins (heuristic, warning logged) 3. Lower epoch — added as replica Applied in both code paths: the auto-register branch (no entry exists yet for this name) and the unlinked-server branch (entry exists but this server is not in it). This is a deterministic reconstruction improvement, not ground truth. The long-term fix is persisting authoritative volume state. 5 new tests covering all reconciliation scenarios. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-24 01:17:51 -07:00
Ping Qiu	a9a5e455c6	fix: Lookup/ListAll return copies, add UpdateEntry for safe mutation Lookup() and ListAll() now return value copies (not pointers to internal registry state). Callers can no longer mutate registry entries without holding a lock. Added clone() on BlockVolumeEntry with deep-copied Replicas slice. Added UpdateEntry(name, func(*BlockVolumeEntry)) for locked mutation. ListByServer() also returns copies. Migrated 1 production mutation (ReplicaPlacement + Preset in create handler) and ~20 test mutations to use UpdateEntry. 5 new copy-correctness tests: Lookup returns copy, Replicas slice isolated, ListAll returns copies, UpdateEntry mutates, UpdateEntry not-found error. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-24 01:00:27 -07:00
Ping Qiu	e8c921d9e8	fix: remove nil-optional superMu pattern, require in all FlusherConfigs superMu is mandatory for correctness — all superblock mutation+persist must be serialized. Remove the nil guard in updateSuperblockCheckpoint and add SuperMu to all 7 test FlusherConfig sites. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-24 00:19:25 -07:00
Ping Qiu	3ddb87adc9	fix: superblock write coordination (superMu) + remove debug logs Adds sync.Mutex (superMu) to BlockVol, shared between group commit's syncWithWALProgress() and flusher's updateSuperblockCheckpoint(). Both paths now serialize superblock mutation + persist, preventing WALTail/WALCheckpointLSN regression when flusher and group commit write the full superblock concurrently. persistSuperblock() also guarded for consistency. Removes temporary log.Printf lines in the open/recovery path that were added during BUG-RESTART-ZEROS investigation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-24 00:09:14 -07:00
Ping Qiu	e92263b4f4	fix: ioMu data-plane exclusion for restore/import/expand Adds sync.RWMutex (ioMu) to BlockVol enforcing mutual exclusion between normal I/O and destructive state operations. Shared (RLock): WriteLBA, ReadLBA, Trim, SyncCache, replica applyEntry, rebuild applyRebuildEntry — concurrent I/O safe. Exclusive (Lock): RestoreSnapshot, ImportSnapshot, Expand, PrepareExpand, CommitExpand, CancelExpand — drains all in-flight I/O before modifying extent/WAL/dirtyMap. Scope rule: RLock covers local data-structure mutation only. Replication shipping is asynchronous and outside the lock, so exclusive holders block only behind local I/O, not network stalls. Lock ordering: ioMu > snapMu > assignMu > mu. Closes the critical ER item: restore/import vs concurrent WriteLBA silent data corruption gap. 3 new tests: concurrent writes allowed, real restore-vs-write contention with data integrity check, close coordination. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-23 20:40:41 -07:00
Ping Qiu	bb691a5458	feat: CP11B-4 observability pack — health state, alerts, dashboard Health-state derivation: deriveHealthStateWithLiveness() computes per-volume state (unsafe > rebuilding > degraded > healthy) using role, replica count, durability mode, degraded flag, and primary server liveness. Used consistently in both volume responses and cluster summary. Extended GET /block/status with health counts (healthy, degraded, rebuilding, unsafe) and NVMe-capable server count. Response is now typed BlockStatusResponse instead of untyped map. Default alert pack: 7 Prometheus rules covering WAL pressure, flusher errors, replica degradation, rebuilding, scrub errors. Alert rules reference real seaweedfs_blockvol_* metric names. Default dashboard: Grafana JSON with 17 panels — cluster health, IOPS, latency P99, WAL pressure, flusher throughput, replication, scrub, dirty map, epoch. 17 tests: 9 health derivation, 1 cluster summary, 2 handler/API, 2 alert validation, 2 dashboard validation, 1 liveness parity. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-23 02:12:42 -07:00
Ping Qiu	f501c63009	feat: CP11B-2 explainable placement / plan API New POST /block/volume/plan endpoint returns full placement preview: resolved policy, ordered candidate list, selected primary/replicas, and per-server rejection reasons with stable string constants. Core design: evaluateBlockPlacement() is a pure function with no registry/topology dependency. gatherPlacementCandidates() is the single topology bridge point. Plan and create share the same planner — parity contract is same ordered candidate list for same cluster state. Create path refactored: uses evaluateBlockPlacement() instead of PickServer(), iterates all candidates (no 3-retry cap), recomputes replica order after primary fallback. rf_not_satisfiable severity is durability-mode-aware (warning for best_effort, error for strict). 15 unit tests + 20 QA adversarial tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-23 02:12:25 -07:00
Ping Qiu	683969086c	feat: CP11B-1 provisioning presets + review fixes Preset system: ResolvePolicy resolves named presets (database, general, throughput) with per-field overrides into concrete volume parameters. Create path now uses resolved policy instead of ad-hoc validation. New /block/volume/resolve diagnostic endpoint for dry-run resolution. Review fix 1 (MED): HasNVMeCapableServer now derives NVMe capability from server-level heartbeat attribute (block_nvme_addr proto field) instead of scanning volume entries. Fixes false "no NVMe" warning on fresh clusters with NVMe-capable servers but no volumes yet. Review fix 2 (LOW): /block/volume/resolve no longer proxied to leader — read-only diagnostic endpoint can be served by any master. Engine fix: ReadLBA retry loop closes stale dirty-map race when WAL entry is recycled between lookup and read. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-22 14:44:24 -07:00
Ping Qiu	075ff52219	feat: CP11B-3 safe ops — promotion hardening, preflight, manual promote Six-task checkpoint hardening the promotion and failover paths: T1: 4-gate candidate evaluation (heartbeat freshness, WAL lag, role, server liveness) with structured rejection reasons. T2: Orphaned-primary re-evaluation on replica reconnect (B-06/B-08). T3: Deferred timer safety — epoch validation prevents stale timers from firing on recreated/changed volumes (B-07). T4: Rebuild addr cleanup on promotion (B-11), NVMe publication refresh on heartbeat, and preflight endpoint wiring. T5: Manual promote API — POST /block/volume/{name}/promote with force flag, target server selection, and structured rejection response. Shared applyPromotionLocked/finalizePromotion helpers eliminate duplication between auto and manual paths. T6: Read-only preflight endpoint (GET /block/volume/{name}/preflight) and blockapi client wrappers (Preflight, Promote). BUG-T5-1: PromotionsTotal counter moved to finalizePromotion (shared by both auto and manual paths) to prevent metrics divergence. 24 files changed, ~6500 lines added. 42 new QA adversarial tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-13 17:21:17 -07:00
Ping Qiu	ed11a09a61	fix: CP11A-4 snapshot export/import safety — 3 bugs from review BUG-CP11A4-1 (HIGH): ImportSnapshot now rejects when active snapshots exist. Import overwrites the extent region that non-CoW'd snapshot blocks read from, which would silently return import data instead of snapshot-time data. New ErrImportActiveSnapshots error and snapMu-guarded check. BUG-CP11A4-2 (HIGH): Double import without AllowOverwrite now correctly rejected. Import bypasses WAL so nextLSN stays at 1; added FlagImported (Superblock.Flags bit 0) set after successful import and checked alongside nextLSN in the non-empty gate. BUG-CP11A4-3 (MED): Replaced fixed exportTempSnapID (0xFFFFFFFE) with atomic sequence counter (exportTempSnapBase + exportTempSnapSeq). Each auto-export gets a unique temp snapshot ID, preventing concurrent export races and user snapshot ID collisions. Also added beginOp()/endOp() lifecycle guards to both ExportSnapshot and ImportSnapshot, and documented the non-atomic import failure semantics. 5 new regression tests + QA-EX-3 rewritten for rejection behavior. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-13 10:56:18 -07:00
Ping Qiu	7cc6467d09	feat: CP11A-4 snapshot export/import to S3 — artifact format, engine, and transport Add crash-consistent snapshot export/import for single-profile block volumes. Export creates a temp snapshot, streams the full volume image with inline SHA-256, and uploads to S3. Import validates manifest + checksum and writes directly to extent region. Admin HTTP endpoints /export and /import added to the standalone iscsi-target binary. Engine: snapshot_export.go (manifest types, ExportSnapshot, ImportSnapshot) S3: snapshot_s3.go (AWS SDK v1 transport, pipe-based streaming upload) Tests: 14 engine + 9 QA adversarial = 23 new tests, all passing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-13 00:15:27 -07:00
Ping Qiu	1c5b658170	feat: CP11A-3 WAL hardening foundations — pressure visibility, sizing guidance, preflight Add PressureState() and writer wait tracking to WALAdmission, WALStatus snapshot API on BlockVol, WAL sizing guidance pure functions, Prometheus histogram/gauge/counter exports, and admin /status WAL fields. 23 new tests (7 admission, 10 guidance, 6 QA adversarial). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 19:30:59 -07:00
Ping Qiu	67f6e73ca7	fix: B-09 stale entry during expand, B-10 heartbeat deletes during expand B-09: ExpandBlockVolume re-reads the registry entry after acquiring the expand inflight lock. Previously it used the entry from the initial Lookup, which could be stale if failover changed VolumeServer or Replicas between Lookup and PREPARE. B-10: UpdateFullHeartbeat stale-cleanup now skips entries with ExpandInProgress=true. Previously a primary VS restart during coordinated expand would delete the entry (path not in heartbeat), orphaning the volume and stranding the expand coordinator. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 15:12:40 -07:00
Ping Qiu	1b3edd7856	feat: CP11A-2 coordinated expand protocol for replicated block volumes Two-phase prepare/commit/cancel protocol ensures all replicas expand atomically. Standalone volumes use direct-commit (unchanged behavior). Engine: PrepareExpand/CommitExpand/CancelExpand with on-disk PreparedSize+ExpandEpoch in superblock, crash recovery clears stale prepare state on open, v.mu serializes concurrent expand operations. Proto: 3 new RPCs (PrepareExpand/CommitExpand/CancelExpandBlockVolume). Coordinator: expandClean flag pattern — ReleaseExpandInflight only on clean success or full cancel. Partial replica commit failure calls MarkExpandFailed (keeps ExpandInProgress=true, suppresses heartbeat size updates). ClearExpandFailed for manual reconciliation. Registry: AcquireExpandInflight records PendingExpandSize+ExpandEpoch. ExpandFailed state blocks new expands until cleared. Tests: 15 engine + 4 VS + 10 coordinator + heartbeat suppression regression + updated QA CP82/durability tests with prepare/commit mocks. Also includes CP11A-1 remaining: QA storage profile tests, QA io_backend config tests, testrunner perf-baseline scenarios and coordinated-expand actions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 15:06:48 -07:00
Ping Qiu	74e8a4ce68	feat: CP11A-1 storage profile type, superblock persistence, and validation Add StorageProfile enum (single=0, striped=1 reserved) persisted at superblock offset 105. Existing volumes auto-map to single via zero-pad backward compatibility. CreateBlockVol rejects striped and invalid profile values before file creation. ParseStorageProfile is case-insensitive and whitespace-tolerant. 13 tests: enum string/parse, superblock persistence, backward compat, create/open/reopen, striped rejection, invalid profile rejection. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-11 21:52:00 -07:00
Ping Qiu	86cc5983f5	chore: Phase 10 remaining — QA WAL admission metrics tests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-11 17:35:44 -07:00
Ping Qiu	a7b1b4cb22	fix: propagate NVMe fields through replica creation, heartbeat, and promotion ReplicaInfo now carries NvmeAddr/NQN. Fields are populated during replica allocation (tryCreateOneReplica), updated from replica heartbeats, and copied in PromoteBestReplica. This ensures master lookup returns correct NVMe endpoints immediately after failover, without waiting for the first post-promotion heartbeat. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-11 17:35:44 -07:00
Ping Qiu	9ef446d0cf	feat: master-backed NVMe/TCP publication (nvme_addr + nqn plumbing) Add nvme_addr and nqn fields to proto messages (AllocateBlockVolume, CreateBlockVolume, LookupBlockVolume, BlockVolumeInfoMessage), wire through volume server → master registry → CSI driver. Volume servers report NVMe address in heartbeats when NVMe target is running. CSI MasterVolumeClient now populates NvmeAddr/NQN from master responses, enabling NVMe/TCP via the master-backend path. Proto files regenerated with protoc 29.5. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-11 17:35:43 -07:00
Ping Qiu	f698b1f154	fix: reject IOBackend=io_uring in Validate(), fix wal_admit_wait metric type Finding 1: IOBackend=io_uring was accepted and logged as resolved but had no runtime effect. Now rejected by Validate() until actually wired, preventing user confusion. Finding 2: wal_admit_wait_seconds_total was exported as GaugeFunc but is monotonically increasing. Changed to CounterFunc to match _total naming convention. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-11 17:35:43 -07:00
Ping Qiu	e22e57a3f7	feat: WAL admission metrics for visibility into write pressure behavior Add counters (total, soft, hard, timeout) and wait-time histogram to WALAdmission, wired through EngineMetrics and exported as Prometheus metrics. Six new tests verify all code paths. Nil-safe for backwards compatibility. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-11 17:34:58 -07:00
Ping Qiu	003b8c2f28	fix: require explicit build tags for io_uring backends, add implementation logging All three io_uring backends (iceber, giouring, raw) now require explicit build tags — no tag means standard-only. Each backend registers its name via IOUringImpl so startup logs show compiled implementation alongside requested/selected backend mode. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-10 18:19:31 -07:00
Ping Qiu	cd1e0afa3b	feat: three io_uring backends for A/B/C benchmarking Split iouring_linux.go into three build-tagged implementations: 1. iouring_iceber_linux.go (-tags iouring_iceber) iceber/iouring-go library. Goroutine-based completion model. Known -72% write regression due to per-op channel overhead. 2. iouring_giouring_linux.go (-tags iouring_giouring) pawelgaczynski/giouring — direct liburing port. No goroutines, no channels. Direct SQE/CQE ring manipulation. Kernel 6.0+. 3. iouring_raw_linux.go (default on Linux, no tags needed) Raw syscall wrappers — io_uring_setup/io_uring_enter + mmap. Zero dependencies. ~300 LOC. Kernel 5.6+. Build commands for benchmarking: go build -tags iouring_iceber ./... # option A go build -tags iouring_giouring ./... # option B go build ./... # option C (raw, default) go build -tags no_iouring ./... # disable all io_uring All variants implement the same BatchIO interface. Cross-compile verified for all four tag combinations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-10 18:11:39 -07:00
Ping Qiu	5e4baccc46	fix: use RequestSet.Requests() API for io_uring result iteration The iceber/iouring-go SubmitRequests returns a RequestSet interface which cannot be ranged over directly. Use resultSet.Done() to wait for all completions, then iterate resultSet.Requests(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-10 16:13:24 -07:00
Ping Qiu	9d0ec8efa3	feat: tri-state IOBackend config with explicit logging and CLI flag Replace UseIOUring bool with IOBackend IOBackendMode (tri-state): - "standard" (default): sequential pread/pwrite/fdatasync - "auto": try io_uring, fall back to standard with warning log - "io_uring": require io_uring, fail startup if unavailable NewIOUring now returns ErrIOUringUnavailable instead of silently falling back — callers decide whether to fail or fall back based on the requested mode. All mode transitions are logged: io backend: requested=auto selected=standard reason=... io backend: requested=io_uring selected=io_uring CLI: --io-backend=standard\|auto\|io_uring added to iscsi-target. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-10 16:02:30 -07:00
Ping Qiu	66d5ba0a84	fix: BatchIO review fixes — linked SQE, ring overflow, resource leak, sync parity 1. HIGH: LinkedWriteFsync now uses SubmitLinkRequests (IOSQE_IO_LINK) instead of SubmitRequests, ensuring write+fdatasync execute as a linked chain in the kernel. Falls back to sequential on error. 2. HIGH: PreadBatch/PwriteBatch chunk ops by ring capacity to prevent "too many requests" rejection when dirty map exceeds ring size (256). 3. MED: CloseBatchIO() added to Flusher, called in BlockVol.Close() after final flush to release io_uring ring / kernel resources. 4. MED: Sync parity — both standard and io_uring paths now use fdatasync (via platform-specific fdatasync_linux.go / fdatasync_other.go). Standard path previously used fsync; now matches io_uring semantics. On non-Linux, fdatasync falls back to fsync (only option available). 10 batchio tests, all blockvol tests pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-10 15:47:03 -07:00
Ping Qiu	04b1827b4a	feat: io_uring BatchIO implementation + UseIOUring config wiring Add iouring_linux.go (build-tagged linux && !no_iouring) using iceber/iouring-go for batched pread/pwrite/fdatasync. Includes linked write+fsync chain for group commit optimization. iouring_other.go provides silent fallback to standard on non-Linux. blockvol.go wires UseIOUring config flag through to flusher BatchIO. NewIOUring gracefully falls back if kernel lacks io_uring support. 10 batchio tests, all blockvol tests pass unchanged. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-10 15:19:00 -07:00
Ping Qiu	e55f369d66	feat: BatchIO interface for swappable flusher I/O backend New package batchio/ with BatchIO interface (PreadBatch, PwriteBatch, Fsync, LinkedWriteFsync) and standard sequential implementation. Flusher refactored to use BatchIO: WAL header reads, WAL entry reads, and extent writes are now batched through the interface. With the default NewStandard() backend, behavior is identical to before. UseIOUring config field added for future io_uring opt-in (Linux 5.6+). 9 interface tests, all existing blockvol tests pass unchanged. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-10 15:13:33 -07:00
Ping Qiu	4c5f9f2b9d	feat: CP10B-1 NVMe/TCP RX/TX split + CP10B-2 bench/profiling fixes RX/TX split: rxLoop reads PDUs, txLoop writes responses via respCh. Handlers refactored to void + enqueueResponse pattern. IOCCSZ fix enables inline write data (100K IOPS vs 15K before). R2T deadlock fix via completeWaiters. Shutdown cleans up pendingCapsules buffers. Bench: ParseFioMetric accepts plain/quoted numbers for aggregated medians. Profiling actions: pprof_capture, vmstat_capture, iostat_capture. 196 NVMe tests, 92 testrunner actions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-10 15:09:41 -07:00
Ping Qiu	3557ae283f	feat: Phase 10 CP10-3 -- NVMe/TCP Tier 1 optimizations, WAL admission control, benchmark platform CP10-3 Tier 1 optimizations (T1-T4): - TCP_NODELAY + 256KB socket buffers on NVMe/TCP connections - Response batching: all C2H data chunks + CapsuleResp in single flush - Tiered buffer pool (4KB/64KB/256KB sync.Pool) for write payloads - Configurable MaxH2CDataLength wiring through controller/IC/chunking BUG-CP103-1: NVMe write retry with jittered backoff for transient WAL pressure - writeWithRetry() with bounded backoff [50/200/800ms] - throttleOnWALPressure() pre-write delay above 90% WAL usage - WALPressureProvider interface + NVMeAdapter.WALPressure() BUG-CP103-2: Volume-level WAL admission control - WALAdmission with counting semaphore (max concurrent writers) - Soft watermark (0.7): small delay to desynchronize herd - Hard watermark (0.9): block until flusher drains - Single-deadline budget shared across watermark wait + semaphore - Close-aware during both watermark and semaphore waits - Wired into BlockVol.WriteLBA() and Trim() Benchmark platform enhancements: - NVMe benchmark actions and scenarios (A/B, CW sweep, IOQ sweep) - Database benchmark actions (SQLite, pgbench) - K8s operator QA reconciler tests - New testrunner scenarios for HA, fault injection, CSI lifecycle Test counts: 213 NVMe + 625 engine + operator + testrunner tests, all passing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-09 17:44:01 -07:00
Ping Qiu	bbadeeb89b	feat: Phase 10 CP10-2 -- CSI NVMe/TCP node plugin, 210 tests NVMe/TCP transport support in the CSI driver so Kubernetes pods can mount block volumes via NVMe alongside (or instead of) iSCSI. Transport selection: NVMe preferred when nvme_tcp module loaded + metadata present + nvmeUtil available. Fail-fast on NVMe errors (no silent iSCSI fallback). .transport file persists across CSI restarts. Key changes: - BuildNQN() single source of truth for NQN construction (naming.go) - NVMeUtil interface + realNVMeUtil wrapping nvme-cli (nvme_util.go) - NodeStageVolume/Unstage/Expand dual-transport paths (node.go) - NvmeAddr/NQN fields in VolumeInfo, Controller contexts - VolumeManager NvmeAddr()/VolumeNQN() getters - BlockService NvmeListenAddr()/NQN() accessors - 27 unit tests + 26 QA adversarial tests (nvme_node_test.go, qa_cp102) - Fix: flaky TestQA_Node_ConcurrentStageUnstage (pre-alloc temp dirs) Review fixes applied: F1 (NQN format mismatch), F2 (CreateVolume drops NVMe context), F3 (IsConnected error classification), F4 (findSubsys path validation), F5 (MasterVolumeClient NVMe gap documented). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-08 23:02:59 -07:00
Ping Qiu	0e234f5c80	feat: Phase 10 CP10-1 -- NVMe/TCP target MVP, 109 tests NVMe over Fabrics (TCP) target implementation sharing the same BlockVol engine, fencing, replication, and failover as the existing iSCSI target. New package: weed/storage/blockvol/nvme/ (11 files, 2,242 production LOC) - protocol.go: PDU types, opcodes, status codes, marshal/unmarshal - wire.go: TCP reader/writer with header bounds validation - controller.go: IC handshake, per-queue state, command dispatch, KATO - fabric.go: Connect (admin+IO), PropertyGet/Set, Disconnect - identify.go: Controller/Namespace/NS list/NS descriptors (Linux 5.15) - admin.go: SetFeatures, GetFeatures, GetLogPage (SMART/ANA), KeepAlive - io.go: Read (C2HData), Write (inline), Flush, WriteZeros/Trim - server.go: TCP listener, admin session registry, graceful shutdown - adapter.go: BlockVol-to-NVMe bridge, error mapping, ANA state Integration: NVMeConfig + CLI flags (-block.nvme.*), disabled by default. Key design: inline-data writes only (no R2T), MaxH2CDataLength=32KB, single ANA group coherent with BlockVol role, CNTLID session registry for cross-connection IO queues, HostNQN continuity enforcement. Tests: 65 dev + 44 QA adversarial = 109 total, all passing. Bugs fixed during review: IO queue cross-connection (A), header bounds validation (B), write payload size check (C), disconnect error (D), stream desync prevention (E), HostNQN enforcement (F), capsule-before-IC state guard (H), flowCtlOff SQHD timing (I). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-08 16:52:37 -07:00
Ping Qiu	8fa1829992	feat: Phase 9A -- Kubernetes operator MVP for SeaweedFS block storage, 71 tests Nested Go module (operator/go.mod) isolating controller-runtime deps. CRD SeaweedBlockCluster (block.seaweedfs.com/v1alpha1) with dual-mode: CSI-only (MasterRef) connects to existing cluster; full-stack (Master) deploys master+volume StatefulSets. Single reconciler manages all sub-resources with ownership labels, finalizer cleanup, CHAP secret auto-generation, and multi-CR conflict detection. Review fixes: cross-NS label ownership (H1), ParseQuantity validation (H2), volume readiness probe (M1), leader election (M2), PVC StorageClassName (M3), condition type separation (M4), FQDN master address (L1), port validation (L3). QA adversarial fixes: ExtraArgs override rejection (BUG-QA-1), malformed lastRotated infinite rotation (BUG-QA-2), DNS label length validation (BUG-QA-3), replicas=0 error message (BUG-QA-4), RFC 1123 name validation (BUG-QA-5), whitespace field trimming (BUG-QA-6), zero storage size (BUG-QA-7). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-08 12:04:17 -07:00
Ping Qiu	9acd187587	feat: Phase 8 complete -- CP8-5 stability gate, lease grant fix, Docker e2e, 13 chaos scenarios Phase 8 closes with all 6 checkpoints done (CP8-1 through CP8-5 + CP8-3-1): - CP8-5: 12/12 enterprise QA scenarios PASS on real hardware (m01/M02) - Master-authoritative lease grants (BUG-CP85-11): master renews primary write leases on every heartbeat response, replacing retain-until-confirmed assignment queue semantics that caused 30s lease expiry - Post-rebuild WAL shipping gap fix (BUG-CP85-1): syncLSNAfterRebuild advances replica nextLSN so WAL entries are accepted after rebuild - Block heartbeat startup race fix (BUG-CP85-10): dynamic blockService check on each tick instead of one-shot at loop start - 8 new tests: 4 engine lease grant + 4 registry lease grant - 13 new YAML scenarios: chaos (kill-loop, partition, disk-full), database integrity (sqlite crash, ext4 fsck), perf baseline, metrics verify, snapshot stress, expand-failover, session storm, role flap, 24h soak - 12 new testrunner actions (database, fsck, grep_log, write_loop_bg, stop_bg, assert_metric_gt/eq/lt) + phase repeat support - Docker compose setup + getting-started guide for block storage users - 960+ cumulative unit tests, 24 YAML scenarios Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-07 21:30:14 -08:00
Ping Qiu	da1b81d1c9	feat: CP8-3-1 durability modes + testrunner platform + 21 adversarial tests Durability mode implementation (sync_all, sync_quorum, best_effort): - DurabilityMode type with superblock persistence, parse/validate/string - MakeDistributedSync mode-aware barrier enforcement in dist_group_commit - blockerr sentinel package (ErrDurabilityBarrierFailed, ErrDurabilityQuorumLost) - gRPC create path: mode validation, idempotent create consistency, partial cleanup - F1: strict mode rejects partial replica provisioning with cleanup - F3: empty heartbeat does not overwrite persisted strict mode - F4: SCSI error mapping uses errors.Is sentinels (not string matching) - Proto/wire/blockapi/CLI/UI plumbing for durability_mode field - Observability dashboard: cluster health cards + per-volume columns Testrunner platform (YAML-driven integration test framework): - Engine, parser, registry, reporter (JUnit XML + HTML), metrics scraping - 52 registered actions: block, iSCSI, I/O, fault injection, assertions - Baseline regression framework with 7 hard-fail conditions - 15 YAML scenarios (smoke, crash, HA, fault, consistency, snapshot) - 49 unit tests for testrunner internals QA adversarial suite (21 tests, all PASS): - Idempotent create mode/RF mismatch detection - Heartbeat mode downgrade prevention (F3) - sync_all/sync_quorum partial replica enforcement (F1) - Concurrent create race safety - Failover/expand mode preservation - Cleanup resilience when delete fails - Master restart auto-register mode handling - Superblock roundtrip all 3 modes - Validate edge cases (mode×RF matrix) - RequiredReplicas quorum math verification - Sentinel error categorization Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-06 01:06:51 -08:00
Ping Qiu	979a9b496c	feat: Phase 8 CP8-1/2/3/4 -- ops control plane, multi-replica, CSI snapshots, observability CP8-1: HTTP REST API (create/delete/lookup/list/assign/servers), blockapi Go client with multi-master failover, 5 shell commands, HTML dashboard at /block/. CP8-2: RF=2/RF=3 multi-replica support -- ShipperGroup fan-out, distributed sync, health scoring, segment-based scrub, gated promotion (heartbeat freshness + WAL LSN + role checks), failover/rebuild for N>2 replicas. CP8-3: CSI snapshot + expansion -- CreateSnapshot/DeleteSnapshot/ListSnapshots RPCs, NodeExpandVolume with iSCSI rescan, snapshot ID helpers, 20 adversarial tests covering concurrent ops, edge cases, and error injection. CP8-4: Observability -- EngineMetrics atomic counters for flusher/group-commit/ WAL-shipper/scrub, 10 new Prometheus metrics, barrier_lag_lsn SLO gauge, failover/promotion/rebuild counters, request ID correlation in master gRPC logs, baseline regression framework with 7 hard-fail conditions. Total: 63 files, ~11.2K LOC, 160+ new tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-06 00:05:17 -08:00
Ping Qiu	8b2b5f6f66	feat: Phase 6 CP6-3 -- failover + rebuild in Kubernetes, 126 tests Wire low-level fencing primitives to master/VS control plane and CSI: - Proto: replica/rebuild address fields on assignment/info/response messages - Assignment queue: retain-until-confirmed (Peek+Confirm), stale epoch pruning - VS assignment receiver: processes assignments from HeartbeatResponse - BlockService replication: ProcessAssignments, deterministic ports (FNV hash) - Registry replica tracking: SetReplica/ClearReplica/SwapPrimaryReplica - CreateBlockVolume: primary + replica, enqueues assignments, single-copy mode - Failover: lease-aware promotion, deferred timers with cancellation on reconnect - ControllerPublish: returns fresh primary iSCSI address after failover - Recovery: recoverBlockVolumes drains pendingRebuilds, enqueues Rebuilding - Real integration tests on M02: failover address switch, rebuild data consistency, full lifecycle failover+rebuild (3 tests, all PASS) Review fixes (12 findings, 5 High, 5 Medium, 2 Low): - R1-1: AllocateBlockVolume returns replication ports - R1-2: setupPrimaryReplication starts rebuild server - R1-3: VS sends periodic block heartbeat for assignment confirmation - R2-F1: LastLeaseGrant set before Register (no stale-lease race) - R2-F2: Deferred promotion timers cancelled on VS reconnect - R2-F3: SwapPrimaryReplica uses RoleToWire instead of uint32(1) - R2-F4: DeleteBlockVolume deletes replica (best-effort) - R2-F5: SwapPrimaryReplica computes epoch atomically under lock - QA: SetReplica removes old replica from byServer index (BUG-QA-CP63-1) 126 CP6-3 tests (67 dev + 48 QA + 8 integration + 3 real). Cumulative Phase 6: 352 tests. All PASS. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-05 00:52:05 -08:00

1 2 3 4 5 ...

8430 Commits