Commit Graph

1821 Commits

Author SHA1 Message Date
pingqiu
abbc8bff2b fix: canonicalize host in AllocateBlockVolumeResponse (CP13-2 follow-up)
AllocateBlockVolumeResponse used bs.ListenAddr() to derive replica
addresses. When the VS binds to ":port" (no explicit IP), host
resolved to empty string, producing ":dataPort" as the replica
address. This ":port" propagated through master assignments to both
primary and replica sides.

Now canonicalizes empty/wildcard host using PreferredOutboundIP()
before constructing replication addresses. Also exported
PreferredOutboundIP for use by the server package.

This is the source fix — all downstream paths (heartbeat, API
response, assignment) inherit the canonical address.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 19:16:45 -07:00
pingqiu
ae87a31d22 fix: store canonical replica addresses in heartbeat state
setupReplicaReceiver now reads back canonical addresses from
the ReplicaReceiver (which applies CP13-2 canonicalization)
instead of storing raw assignment addresses in replStates.

This fixes the API-level leak where replica_data_addr showed
":port" instead of "ip:port" in /block/volumes responses,
even though the engine-level CP13-2 fix was working.

New BlockVol.ReplicaReceiverAddr() returns canonical addresses
from the running receiver. Falls back to assignment addresses
if receiver didn't report.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 19:08:48 -07:00
Ping Qiu
c263d082b5 fix: restart reconciliation — trust roles, upsert replicas
Same-epoch reconciliation now trusts reported roles first:
- one claims primary, other replica → trust roles
- both claim primary → WALHeadLSN heuristic tiebreak
- both claim replica → keep existing, log ambiguity

Replaced addServerAsReplica with upsertServerAsReplica: checks
for existing replica entry by server name before appending.
Prevents duplicate ReplicaInfo rows during restart/replay windows.

2 new tests: role-trusted same-epoch, duplicate replica prevention.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 01:24:53 -07:00
Ping Qiu
9137fa6486 fix: epoch-based reconciliation on master restart reconstruction
When a second server reports the same volume during master restart,
UpdateFullHeartbeat now uses epoch-based tie-breaking instead of
first-heartbeat-wins:

1. Higher epoch wins as primary — old entry demoted to replica
2. Same epoch — higher WALHeadLSN wins (heuristic, warning logged)
3. Lower epoch — added as replica

Applied in both code paths: the auto-register branch (no entry
exists yet for this name) and the unlinked-server branch (entry
exists but this server is not in it).

This is a deterministic reconstruction improvement, not ground
truth. The long-term fix is persisting authoritative volume state.

5 new tests covering all reconciliation scenarios.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 01:17:51 -07:00
Ping Qiu
a9a5e455c6 fix: Lookup/ListAll return copies, add UpdateEntry for safe mutation
Lookup() and ListAll() now return value copies (not pointers to
internal registry state). Callers can no longer mutate registry
entries without holding a lock.

Added clone() on BlockVolumeEntry with deep-copied Replicas slice.
Added UpdateEntry(name, func(*BlockVolumeEntry)) for locked mutation.
ListByServer() also returns copies.

Migrated 1 production mutation (ReplicaPlacement + Preset in create
handler) and ~20 test mutations to use UpdateEntry.

5 new copy-correctness tests: Lookup returns copy, Replicas slice
isolated, ListAll returns copies, UpdateEntry mutates, UpdateEntry
not-found error.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 01:00:27 -07:00
Ping Qiu
bb691a5458 feat: CP11B-4 observability pack — health state, alerts, dashboard
Health-state derivation: deriveHealthStateWithLiveness() computes
per-volume state (unsafe > rebuilding > degraded > healthy) using
role, replica count, durability mode, degraded flag, and primary
server liveness. Used consistently in both volume responses and
cluster summary.

Extended GET /block/status with health counts (healthy, degraded,
rebuilding, unsafe) and NVMe-capable server count. Response is now
typed BlockStatusResponse instead of untyped map.

Default alert pack: 7 Prometheus rules covering WAL pressure,
flusher errors, replica degradation, rebuilding, scrub errors.
Alert rules reference real seaweedfs_blockvol_* metric names.

Default dashboard: Grafana JSON with 17 panels — cluster health,
IOPS, latency P99, WAL pressure, flusher throughput, replication,
scrub, dirty map, epoch.

17 tests: 9 health derivation, 1 cluster summary, 2 handler/API,
2 alert validation, 2 dashboard validation, 1 liveness parity.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 02:12:42 -07:00
Ping Qiu
f501c63009 feat: CP11B-2 explainable placement / plan API
New POST /block/volume/plan endpoint returns full placement preview:
resolved policy, ordered candidate list, selected primary/replicas,
and per-server rejection reasons with stable string constants.

Core design: evaluateBlockPlacement() is a pure function with no
registry/topology dependency. gatherPlacementCandidates() is the
single topology bridge point. Plan and create share the same planner —
parity contract is same ordered candidate list for same cluster state.

Create path refactored: uses evaluateBlockPlacement() instead of
PickServer(), iterates all candidates (no 3-retry cap), recomputes
replica order after primary fallback. rf_not_satisfiable severity
is durability-mode-aware (warning for best_effort, error for strict).

15 unit tests + 20 QA adversarial tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 02:12:25 -07:00
Ping Qiu
683969086c feat: CP11B-1 provisioning presets + review fixes
Preset system: ResolvePolicy resolves named presets (database, general,
throughput) with per-field overrides into concrete volume parameters.
Create path now uses resolved policy instead of ad-hoc validation.
New /block/volume/resolve diagnostic endpoint for dry-run resolution.

Review fix 1 (MED): HasNVMeCapableServer now derives NVMe capability
from server-level heartbeat attribute (block_nvme_addr proto field)
instead of scanning volume entries. Fixes false "no NVMe" warning on
fresh clusters with NVMe-capable servers but no volumes yet.

Review fix 2 (LOW): /block/volume/resolve no longer proxied to leader —
read-only diagnostic endpoint can be served by any master.

Engine fix: ReadLBA retry loop closes stale dirty-map race when WAL
entry is recycled between lookup and read.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-22 14:44:24 -07:00
Ping Qiu
075ff52219 feat: CP11B-3 safe ops — promotion hardening, preflight, manual promote
Six-task checkpoint hardening the promotion and failover paths:

T1: 4-gate candidate evaluation (heartbeat freshness, WAL lag, role,
    server liveness) with structured rejection reasons.
T2: Orphaned-primary re-evaluation on replica reconnect (B-06/B-08).
T3: Deferred timer safety — epoch validation prevents stale timers
    from firing on recreated/changed volumes (B-07).
T4: Rebuild addr cleanup on promotion (B-11), NVMe publication
    refresh on heartbeat, and preflight endpoint wiring.
T5: Manual promote API — POST /block/volume/{name}/promote with
    force flag, target server selection, and structured rejection
    response. Shared applyPromotionLocked/finalizePromotion helpers
    eliminate duplication between auto and manual paths.
T6: Read-only preflight endpoint (GET /block/volume/{name}/preflight)
    and blockapi client wrappers (Preflight, Promote).

BUG-T5-1: PromotionsTotal counter moved to finalizePromotion (shared
    by both auto and manual paths) to prevent metrics divergence.

24 files changed, ~6500 lines added. 42 new QA adversarial tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-13 17:21:17 -07:00
Ping Qiu
67f6e73ca7 fix: B-09 stale entry during expand, B-10 heartbeat deletes during expand
B-09: ExpandBlockVolume re-reads the registry entry after acquiring
the expand inflight lock. Previously it used the entry from the
initial Lookup, which could be stale if failover changed VolumeServer
or Replicas between Lookup and PREPARE.

B-10: UpdateFullHeartbeat stale-cleanup now skips entries with
ExpandInProgress=true. Previously a primary VS restart during
coordinated expand would delete the entry (path not in heartbeat),
orphaning the volume and stranding the expand coordinator.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 15:12:40 -07:00
Ping Qiu
1b3edd7856 feat: CP11A-2 coordinated expand protocol for replicated block volumes
Two-phase prepare/commit/cancel protocol ensures all replicas expand
atomically. Standalone volumes use direct-commit (unchanged behavior).

Engine: PrepareExpand/CommitExpand/CancelExpand with on-disk
PreparedSize+ExpandEpoch in superblock, crash recovery clears stale
prepare state on open, v.mu serializes concurrent expand operations.

Proto: 3 new RPCs (PrepareExpand/CommitExpand/CancelExpandBlockVolume).

Coordinator: expandClean flag pattern — ReleaseExpandInflight only on
clean success or full cancel. Partial replica commit failure calls
MarkExpandFailed (keeps ExpandInProgress=true, suppresses heartbeat
size updates). ClearExpandFailed for manual reconciliation.

Registry: AcquireExpandInflight records PendingExpandSize+ExpandEpoch.
ExpandFailed state blocks new expands until cleared.

Tests: 15 engine + 4 VS + 10 coordinator + heartbeat suppression
regression + updated QA CP82/durability tests with prepare/commit mocks.

Also includes CP11A-1 remaining: QA storage profile tests, QA
io_backend config tests, testrunner perf-baseline scenarios and
coordinated-expand actions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 15:06:48 -07:00
Ping Qiu
a7b1b4cb22 fix: propagate NVMe fields through replica creation, heartbeat, and promotion
ReplicaInfo now carries NvmeAddr/NQN. Fields are populated during
replica allocation (tryCreateOneReplica), updated from replica
heartbeats, and copied in PromoteBestReplica. This ensures master
lookup returns correct NVMe endpoints immediately after failover,
without waiting for the first post-promotion heartbeat.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 17:35:44 -07:00
Ping Qiu
9ef446d0cf feat: master-backed NVMe/TCP publication (nvme_addr + nqn plumbing)
Add nvme_addr and nqn fields to proto messages (AllocateBlockVolume,
CreateBlockVolume, LookupBlockVolume, BlockVolumeInfoMessage), wire
through volume server → master registry → CSI driver. Volume servers
report NVMe address in heartbeats when NVMe target is running. CSI
MasterVolumeClient now populates NvmeAddr/NQN from master responses,
enabling NVMe/TCP via the master-backend path.

Proto files regenerated with protoc 29.5.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 17:35:43 -07:00
Ping Qiu
bbadeeb89b feat: Phase 10 CP10-2 -- CSI NVMe/TCP node plugin, 210 tests
NVMe/TCP transport support in the CSI driver so Kubernetes pods can
mount block volumes via NVMe alongside (or instead of) iSCSI.

Transport selection: NVMe preferred when nvme_tcp module loaded +
metadata present + nvmeUtil available. Fail-fast on NVMe errors (no
silent iSCSI fallback). .transport file persists across CSI restarts.

Key changes:
- BuildNQN() single source of truth for NQN construction (naming.go)
- NVMeUtil interface + realNVMeUtil wrapping nvme-cli (nvme_util.go)
- NodeStageVolume/Unstage/Expand dual-transport paths (node.go)
- NvmeAddr/NQN fields in VolumeInfo, Controller contexts
- VolumeManager NvmeAddr()/VolumeNQN() getters
- BlockService NvmeListenAddr()/NQN() accessors
- 27 unit tests + 26 QA adversarial tests (nvme_node_test.go, qa_cp102)
- Fix: flaky TestQA_Node_ConcurrentStageUnstage (pre-alloc temp dirs)

Review fixes applied: F1 (NQN format mismatch), F2 (CreateVolume drops
NVMe context), F3 (IsConnected error classification), F4 (findSubsys
path validation), F5 (MasterVolumeClient NVMe gap documented).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 23:02:59 -07:00
Ping Qiu
0e234f5c80 feat: Phase 10 CP10-1 -- NVMe/TCP target MVP, 109 tests
NVMe over Fabrics (TCP) target implementation sharing the same BlockVol
engine, fencing, replication, and failover as the existing iSCSI target.

New package: weed/storage/blockvol/nvme/ (11 files, 2,242 production LOC)
- protocol.go: PDU types, opcodes, status codes, marshal/unmarshal
- wire.go: TCP reader/writer with header bounds validation
- controller.go: IC handshake, per-queue state, command dispatch, KATO
- fabric.go: Connect (admin+IO), PropertyGet/Set, Disconnect
- identify.go: Controller/Namespace/NS list/NS descriptors (Linux 5.15)
- admin.go: SetFeatures, GetFeatures, GetLogPage (SMART/ANA), KeepAlive
- io.go: Read (C2HData), Write (inline), Flush, WriteZeros/Trim
- server.go: TCP listener, admin session registry, graceful shutdown
- adapter.go: BlockVol-to-NVMe bridge, error mapping, ANA state

Integration: NVMeConfig + CLI flags (-block.nvme.*), disabled by default.

Key design: inline-data writes only (no R2T), MaxH2CDataLength=32KB,
single ANA group coherent with BlockVol role, CNTLID session registry
for cross-connection IO queues, HostNQN continuity enforcement.

Tests: 65 dev + 44 QA adversarial = 109 total, all passing.
Bugs fixed during review: IO queue cross-connection (A), header bounds
validation (B), write payload size check (C), disconnect error (D),
stream desync prevention (E), HostNQN enforcement (F), capsule-before-IC
state guard (H), flowCtlOff SQHD timing (I).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 16:52:37 -07:00
Ping Qiu
9acd187587 feat: Phase 8 complete -- CP8-5 stability gate, lease grant fix, Docker e2e, 13 chaos scenarios
Phase 8 closes with all 6 checkpoints done (CP8-1 through CP8-5 + CP8-3-1):
- CP8-5: 12/12 enterprise QA scenarios PASS on real hardware (m01/M02)
- Master-authoritative lease grants (BUG-CP85-11): master renews primary
  write leases on every heartbeat response, replacing retain-until-confirmed
  assignment queue semantics that caused 30s lease expiry
- Post-rebuild WAL shipping gap fix (BUG-CP85-1): syncLSNAfterRebuild
  advances replica nextLSN so WAL entries are accepted after rebuild
- Block heartbeat startup race fix (BUG-CP85-10): dynamic blockService
  check on each tick instead of one-shot at loop start
- 8 new tests: 4 engine lease grant + 4 registry lease grant
- 13 new YAML scenarios: chaos (kill-loop, partition, disk-full),
  database integrity (sqlite crash, ext4 fsck), perf baseline,
  metrics verify, snapshot stress, expand-failover, session storm,
  role flap, 24h soak
- 12 new testrunner actions (database, fsck, grep_log, write_loop_bg,
  stop_bg, assert_metric_gt/eq/lt) + phase repeat support
- Docker compose setup + getting-started guide for block storage users
- 960+ cumulative unit tests, 24 YAML scenarios

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 21:30:14 -08:00
Ping Qiu
da1b81d1c9 feat: CP8-3-1 durability modes + testrunner platform + 21 adversarial tests
Durability mode implementation (sync_all, sync_quorum, best_effort):
- DurabilityMode type with superblock persistence, parse/validate/string
- MakeDistributedSync mode-aware barrier enforcement in dist_group_commit
- blockerr sentinel package (ErrDurabilityBarrierFailed, ErrDurabilityQuorumLost)
- gRPC create path: mode validation, idempotent create consistency, partial cleanup
- F1: strict mode rejects partial replica provisioning with cleanup
- F3: empty heartbeat does not overwrite persisted strict mode
- F4: SCSI error mapping uses errors.Is sentinels (not string matching)
- Proto/wire/blockapi/CLI/UI plumbing for durability_mode field
- Observability dashboard: cluster health cards + per-volume columns

Testrunner platform (YAML-driven integration test framework):
- Engine, parser, registry, reporter (JUnit XML + HTML), metrics scraping
- 52 registered actions: block, iSCSI, I/O, fault injection, assertions
- Baseline regression framework with 7 hard-fail conditions
- 15 YAML scenarios (smoke, crash, HA, fault, consistency, snapshot)
- 49 unit tests for testrunner internals

QA adversarial suite (21 tests, all PASS):
- Idempotent create mode/RF mismatch detection
- Heartbeat mode downgrade prevention (F3)
- sync_all/sync_quorum partial replica enforcement (F1)
- Concurrent create race safety
- Failover/expand mode preservation
- Cleanup resilience when delete fails
- Master restart auto-register mode handling
- Superblock roundtrip all 3 modes
- Validate edge cases (mode×RF matrix)
- RequiredReplicas quorum math verification
- Sentinel error categorization

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-06 01:06:51 -08:00
Ping Qiu
979a9b496c feat: Phase 8 CP8-1/2/3/4 -- ops control plane, multi-replica, CSI snapshots, observability
CP8-1: HTTP REST API (create/delete/lookup/list/assign/servers), blockapi Go
client with multi-master failover, 5 shell commands, HTML dashboard at /block/.

CP8-2: RF=2/RF=3 multi-replica support -- ShipperGroup fan-out, distributed
sync, health scoring, segment-based scrub, gated promotion (heartbeat
freshness + WAL LSN + role checks), failover/rebuild for N>2 replicas.

CP8-3: CSI snapshot + expansion -- CreateSnapshot/DeleteSnapshot/ListSnapshots
RPCs, NodeExpandVolume with iSCSI rescan, snapshot ID helpers, 20 adversarial
tests covering concurrent ops, edge cases, and error injection.

CP8-4: Observability -- EngineMetrics atomic counters for flusher/group-commit/
WAL-shipper/scrub, 10 new Prometheus metrics, barrier_lag_lsn SLO gauge,
failover/promotion/rebuild counters, request ID correlation in master gRPC
logs, baseline regression framework with 7 hard-fail conditions.

Total: 63 files, ~11.2K LOC, 160+ new tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-06 00:05:17 -08:00
Ping Qiu
8b2b5f6f66 feat: Phase 6 CP6-3 -- failover + rebuild in Kubernetes, 126 tests
Wire low-level fencing primitives to master/VS control plane and CSI:

- Proto: replica/rebuild address fields on assignment/info/response messages
- Assignment queue: retain-until-confirmed (Peek+Confirm), stale epoch pruning
- VS assignment receiver: processes assignments from HeartbeatResponse
- BlockService replication: ProcessAssignments, deterministic ports (FNV hash)
- Registry replica tracking: SetReplica/ClearReplica/SwapPrimaryReplica
- CreateBlockVolume: primary + replica, enqueues assignments, single-copy mode
- Failover: lease-aware promotion, deferred timers with cancellation on reconnect
- ControllerPublish: returns fresh primary iSCSI address after failover
- Recovery: recoverBlockVolumes drains pendingRebuilds, enqueues Rebuilding
- Real integration tests on M02: failover address switch, rebuild data
  consistency, full lifecycle failover+rebuild (3 tests, all PASS)

Review fixes (12 findings, 5 High, 5 Medium, 2 Low):
- R1-1: AllocateBlockVolume returns replication ports
- R1-2: setupPrimaryReplication starts rebuild server
- R1-3: VS sends periodic block heartbeat for assignment confirmation
- R2-F1: LastLeaseGrant set before Register (no stale-lease race)
- R2-F2: Deferred promotion timers cancelled on VS reconnect
- R2-F3: SwapPrimaryReplica uses RoleToWire instead of uint32(1)
- R2-F4: DeleteBlockVolume deletes replica (best-effort)
- R2-F5: SwapPrimaryReplica computes epoch atomically under lock
- QA: SetReplica removes old replica from byServer index (BUG-QA-CP63-1)

126 CP6-3 tests (67 dev + 48 QA + 8 integration + 3 real).
Cumulative Phase 6: 352 tests. All PASS.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-05 00:52:05 -08:00
Ping Qiu
5a9a52f2d0 feat: Phase 6 CP6-2 -- CSI control-plane integration + csi-sanity/k3s validation
CP6-2 wires the CSI driver to SeaweedFS master/volume-server control plane:
- Proto: block volume messages in master.proto/volume_server.proto, codegen
- Master registry: in-memory BlockVolumeRegistry with Pending->Active status,
  full/delta heartbeat, inflight lock, placement (fewest volumes)
- VS gRPC: AllocateBlockVolume/DeleteBlockVolume handlers, shared naming
- Master RPCs: CreateBlockVolume (retry up to 3 servers), Delete, Lookup
- Heartbeat: block volume fields wired into bidirectional stream
- CSI Controller: VolumeBackend interface (Local + Master), returns volume_context
- CSI Node: reads volume_context for remote targets, staged map + IQN derivation
- Mode flag: --mode=controller/node/all, --master for control-plane
- K8s manifests: csi-driver.yaml, csi-controller.yaml, csi-node.yaml

csi-sanity conformance (33 pass, 58 skip) found 6 bugs:
- BUG-SANITY-1/2/3: missing VolumeCapabilities/VolumeCapability validation
- BUG-SANITY-4: NodePublish used mount instead of bind mount
- BUG-SANITY-5: NodeUnpublish didn't remove target path
- BUG-SANITY-6: NodeUnpublish failed on unmounted path

k3s Level 4 (PVC->Pod data persistence) found 1 bug:
- BUG-K3S-1: IsLoggedIn didn't handle iscsiadm exit code 21

226 CSI tests + 54 server tests = 280 new tests, all passing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-04 11:01:08 -08:00
Ping Qiu
7c07d9c95a feat: Phase 4A CP4b-3 -- assignment processing, 2 bug fixes, 20 QA tests
Add ProcessBlockVolumeAssignments to BlockVolumeStore and wire
AssignmentSource/AssignmentCallback into the heartbeat collector's
Run() loop. Assignments are fetched and applied each tick after
status collection.

Bug fixes:
- BUG-CP4B3-1: TOCTOU between GetBlockVolume and HandleAssignment.
  Added withVolume() helper that holds RLock across lookup+operation,
  preventing RemoveBlockVolume from closing the volume mid-assignment.
- BUG-CP4B3-2: Data race on callback fields read by Run() goroutine.
  Made StatusCallback/AssignmentSource/AssignmentCallback private,
  added cbMu mutex and SetXxx() setter methods. Lock held only for
  load/store, not during callback execution.

7 dev tests + 13 QA adversarial tests = 20 new tests.
972 total unit tests, all passing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 11:34:06 -08:00
Ping Qiu
a089bf6828 feat: Phase 4A CP4b-2 -- heartbeat collector, 3 bug fixes, 9 QA tests
BlockVolumeHeartbeatCollector periodically collects block volume status
via callback (standalone, no gRPC wiring yet). Store() accessor on
BlockService. Three bugs found by QA and fixed: Stop-before-Run deadlock
(BUG-CP4B2-1), zero interval panic (BUG-CP4B2-2), callback panic crashes
goroutine (BUG-CP4B2-3). 12 new tests (3 dev + 9 QA adversarial).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 10:20:27 -08:00
Ping Qiu
ffdde15bcd feat: Phase 4A CP4b-1 -- wire types, conversion helpers, heartbeat collection
Add BlockVolumeInfoMessage, BlockVolumeShortInfoMessage, BlockVolumeAssignment
wire-type structs (proto-shaped Go structs). Add conversion helpers with
DiskType plumbing, overflow-safe LeaseTTLToWire, validated RoleFromWire.
Add CollectBlockVolumeHeartbeat on BlockVolumeStore. 9 new tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 09:34:00 -08:00
Ping Qiu
80801b0fac feat: Phase 3 — performance tuning, iSCSI session refactor, store integration
Phase 3 delivers five checkpoints:

CP1 Engine Tuning: BlockVolConfig tunables, 256-shard DirtyMap, adaptive
group commit (low-watermark immediate flush), WAL pressure handling with
backpressure and ErrWALFull timeout.

CP2 iSCSI Session Refactor: RX/TX goroutine split with respCh (cap 64),
txLoop for serialized response writes, StatSN assignment modes. Login
phase stays single-goroutine; full-duplex after login.

CP3 Store Integration: BlockVolAdapter (iscsi.BlockDevice interface),
BlockVolumeStore management, BlockService in volume_server_block.go,
CLI flags (--block.listen/dir/iqn.prefix), sw-block-attach.sh helper.

CP5 Concurrency Hardening: WAL reuse guard (LSN validation in ReadLBA),
opsOutstanding counter with beginOp/endOp + Close drain, appendWithRetry
shared by WriteLBA and TrimLBA, flusher LSN guard in FlushOnce.

Bug fixes (P3-BUG-2–11): unbounded pending queue cap, Data-Out timeout,
flusher error logging, GroupCommitter panic recovery, Close vs concurrent
ops guard, target shutdown race, WAL-full retry vs Close, WRITE SAME(16)
for XFS, MODE SENSE(10) + VPD 0xB0/0xB2 for Linux kernel compatibility.

797 tests passing (517 engine + 280 iSCSI), go vet clean.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 10:43:34 -08:00
Chris Lu
da4edb5fe6 Fix live volume move tail timestamp (#8440)
* Improve move tail timestamp

* Add move tail timestamp integration test

* Simulate traffic during move
2026-02-24 20:07:26 -08:00
Chris Lu
e596542295 Move SQL engine and PostgreSQL server to their own binaries (#8417)
* Drop SQL engine and PostgreSQL server

* Split SQL tooling into weed-db and weed-sql

* move

* fix building
2026-02-23 16:27:08 -08:00
Chris Lu
57ab99d13e fix: generate topology uuid uniformly in single-master mode (#8405)
* fix: ensure topology uuid is generated in single master setups

* ensureTopologyId adds a Hashicorp-aware implementation

* simplify
2026-02-22 23:45:48 -08:00
Chris Lu
b5f3094619 fix format of internal node URLs in master UI templates 2026-02-22 13:47:29 -08:00
Chris Lu
e4b70c2521 go fix 2026-02-20 18:42:00 -08:00
Konstantin Lebedev
01b3125815 [shell]: volume balance capacity by min volume density (#8026)
volume balance by min volume density and active volumes
2026-02-19 13:30:59 -08:00
Chris Lu
7b8df39cf7 s3api: add AttachUserPolicy/DetachUserPolicy/ListAttachedUserPolicies (#8379)
* iam: add XML responses for managed user policy APIs

* s3api: implement attach/detach/list attached user policies

* s3api: add embedded IAM tests for managed user policies

* iam: update CredentialStore interface and Manager for managed policies

Updated the `CredentialStore` interface to include `AttachUserPolicy`,
`DetachUserPolicy`, and `ListAttachedUserPolicies` methods.
The `CredentialManager` was updated to delegate these calls to the store.
Added common error variables for policy management.

* iam: implement managed policy methods in MemoryStore

Implemented `AttachUserPolicy`, `DetachUserPolicy`, and
`ListAttachedUserPolicies` in the MemoryStore.
Also ensured deep copying of identities includes PolicyNames.

* iam: implement managed policy methods in PostgresStore

Modified Postgres schema to include `policy_names` JSONB column in `users`.
Implemented `AttachUserPolicy`, `DetachUserPolicy`, and `ListAttachedUserPolicies`.
Updated user CRUD operations to handle policy names persistence.

* iam: implement managed policy methods in remaining stores

Implemented user policy management in:
- `FilerEtcStore` (partial implementation)
- `IamGrpcStore` (delegated via GetUser/UpdateUser)
- `PropagatingCredentialStore` (to broadcast updates)
Ensures cluster-wide consistency for policy attachments.

* s3api: refactor EmbeddedIamApi to use managed policy APIs

- Refactored `AttachUserPolicy`, `DetachUserPolicy`, and `ListAttachedUserPolicies`
  to use `e.credentialManager` directly.
- Fixed a critical error suppression bug in `ExecuteAction` that always
  returned success even on failure.
- Implemented robust error matching using string comparison fallbacks.
- Improved consistency by reloading configuration after policy changes.

* s3api: update and refine IAM integration tests

- Updated tests to use a real `MemoryStore`-backed `CredentialManager`.
- Refined test configuration synchronization using `sync.Once` and
  manual deep-copying to prevent state corruption.
- Improved `extractEmbeddedIamErrorCodeAndMessage` to handle more XML
  formats robustly.
- Adjusted test expectations to match current AWS IAM behavior.

* fix compilation

* visibility

* ensure 10 policies

* reload

* add integration tests

* Guard raft command registration

* Allow IAM actions in policy tests

* Validate gRPC policy attachments

* Revert Validate gRPC policy attachments

* Tighten gRPC policy attach/detach

* Improve IAM managed policy handling

* Improve managed policy filters
2026-02-19 12:26:27 -08:00
Chris Lu
3300874cb5 filer: add default log purging to master maintenance scripts (#8359)
* filer: add default log purging to master maintenance scripts

* filer: fix default maintenance scripts to include full set of tasks

* filer: refactor maintenance scripts to avoid duplication
2026-02-16 16:58:15 -08:00
Lisandro Pin
a9d12a0792 Implement full scrubbing for EC volumes (#8318)
Implement full scrubbing for EC volumes.
2026-02-16 15:09:01 -08:00
Lisandro Pin
fbe7dd32c2 Implement full scrubbing for regular volumes (#8254)
Implement full scrubbing for regular volumes.
2026-02-13 15:47:29 -08:00
Chris Lu
b08bb8237c Fix master leader election startup issue (#8340)
* Fix master leader election startup issue

Fixes #error-log-leader-not-selected-yet

* Fix master leader election startup issue

This change improves server address comparison using the 'Equals' method and handles recursion in topology leader lookup, resolving the 'leader not selected yet' error during master startup.

* Merge user improvements: use MaybeLeader for non-blocking checks

* not useful test

* Address code review: optimize Equals, fix deadlock in IsLeader, safe access in Leader
2026-02-13 15:39:39 -08:00
Lisandro Pin
e657e7d827 Implement local scrubbing for EC volumes. (#8283) 2026-02-11 11:04:08 -08:00
Chris Lu
1c62808c0e iceberg: wire pagination for list namespaces/tables REST APIs (#8275)
* s3api/iceberg: wire list pagination tokens and page size

* fmt

* Update weed/s3api/iceberg/iceberg.go

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-09 21:46:55 -08:00
Chris Lu
839028b2e0 Fix EC rebuild shard detection (#8265)
Fix EC rebuild shard counting
2026-02-09 12:34:38 -08:00
Lisandro Pin
1a5679a5eb Implement a VolumeEcStatus() RPC for volume servers. (#8006)
Just like `VolumeStatus()`, this call allows inspecting details for
a given EC volume - including number of files and their total size.
2026-02-09 11:52:08 -08:00
Chris Lu
cb9e21cdc5 Normalize hashicorp raft peer ids (#8253)
* Normalize raft voter ids

* 4.11

* Update raft_hashicorp.go
2026-02-09 07:46:34 -08:00
Chris Lu
c284e51d20 fix: multipart upload ETag calculation (#8238)
* fix multipart etag

* address comments

* clean up

* clean up

* optimization

* address comments

* unquoted etag

* dedup

* upgrade

* clean

* etag

* return quoted tag

* quoted etag

* debug

* s3api: unify ETag retrieval and quoting across handlers

Refactor newListEntry to take *S3ApiServer and use getObjectETag,
and update setResponseHeaders to use the same logic. This ensures
consistent ETags are returned for both listing and direct access.

* s3api: implement ListObjects deduplication for versioned buckets

Handle duplicate entries between the main path and the .versions
directory by prioritizing the latest version when bucket versioning
is enabled.

* s3api: cleanup stale main file entries during versioned uploads

Add explicit deletion of pre-existing "main" files when creating new
versions in versioned buckets. This prevents stale entries from
appearing in bucket listings and ensures consistency.

* s3api: fix cleanup code placement in versioned uploads

Correct the placement of rm calls in completeMultipartUpload and
putVersionedObject to ensure stale main files are properly deleted
during versioned uploads.

* s3api: improve getObjectETag fallback for empty ExtETagKey

Ensure that when ExtETagKey exists but contains an empty value,
the function falls through to MD5/chunk-based calculation instead
of returning an empty string.

* s3api: fix test files for new newListEntry signature

Update test files to use the new newListEntry signature where the
first parameter is *S3ApiServer. Created mockS3ApiServer to properly
test owner display name lookup functionality.

* s3api: use filer.ETag for consistent Md5 handling in getEtagFromEntry

Change getEtagFromEntry fallback to use filer.ETag(entry) instead of
filer.ETagChunks to ensure legacy entries with Attributes.Md5 are
handled consistently with the rest of the codebase.

* s3api: optimize list logic and fix conditional header logging

- Hoist bucket versioning check out of per-entry callback to avoid
  repeated getVersioningState calls
- Extract appendOrDedup helper function to eliminate duplicate
  dedup/append logic across multiple code paths
- Change If-Match mismatch logging from glog.Errorf to glog.V(3).Infof
  and remove DEBUG prefix for consistency

* s3api: fix test mock to properly initialize IAM accounts

Fixed nil pointer dereference in TestNewListEntryOwnerDisplayName by
directly initializing the IdentityAccessManagement.accounts map in the
test setup. This ensures newListEntry can properly look up account
display names without panicking.

* cleanup

* s3api: remove premature main file cleanup in versioned uploads

Removed incorrect cleanup logic that was deleting main files during
versioned uploads. This was causing test failures because it deleted
objects that should have been preserved as null versions when
versioning was first enabled. The deduplication logic in listing is
sufficient to handle duplicate entries without deleting files during
upload.

* s3api: add empty-value guard to getEtagFromEntry

Added the same empty-value guard used in getObjectETag to prevent
returning quoted empty strings. When ExtETagKey exists but is empty,
the function now falls through to filer.ETag calculation instead of
returning "".

* s3api: fix listing of directory key objects with matching prefix

Revert prefix handling logic to use strings.TrimPrefix instead of
checking HasPrefix with empty string result. This ensures that when a
directory key object exactly matches the prefix (e.g. prefix="dir/",
object="dir/"), it is correctly handled as a regular entry instead of
being skipped or incorrectly processed as a common prefix. Also fixed
missing variable definition.

* s3api: refactor list inline dedup to use appendOrDedup helper

Refactored the inline deduplication logic in listFilerEntries to use the
shared appendOrDedup helper function. This ensures consistent behavior
and reduces code duplication.

* test: fix port allocation race in s3tables integration test

Updated startMiniCluster to find all required ports simultaneously using
findAvailablePorts instead of sequentially. This prevents race conditions
where the OS reallocates a port that was just released, causing multiple
services (e.g. Filer and Volume) to be assigned the same port and fail
to start.
2026-02-06 21:54:43 -08:00
Lisandro Pin
2cda4289f4 Add a version token on RPCs to read/update volume server states. (#8191)
* Add a version token on `GetState()`/`SetState()` RPCs for volume server states.

* Make state version a property ov `VolumeServerState` instead of an in-memory counter.

Also extend state atomicity to reads, instead of just writes.
2026-02-06 10:58:43 -08:00
Lisandro Pin
9d751a7b61 Contrib/volume scrub local (#8226) 2026-02-05 14:44:12 -08:00
Lisandro Pin
f84b70c362 Implement index (fast) scrubbing for regular/EC volumes. (#8207)
Implement index (fast) scrubbing for regular/EC volumes via `ScrubVolume()`/`ScrubEcVolume()`.

Also rearranges existing index test files for reuse across unit tests for different modules.
2026-02-05 11:27:03 -08:00
Chris Lu
72a8f598f2 Fix Maintenance Task Sorting and Refactor Log Persistence (#8199)
* fix float stepping

* do not auto refresh

* only logs when non 200 status

* fix maintenance task sorting and cleanup redundant handler logic

* Refactor log retrieval to persist to disk and fix slowness

- Move log retrieval to disk-based persistence in GetMaintenanceTaskDetail
- Implement background log fetching on task completion in worker_grpc_server.go
- Implement async background refresh for in-progress tasks
- Completely remove blocking gRPC calls from the UI path to fix 10s timeouts
- Cleanup debug logs and performance profiling code

* Ensure consistent deterministic sorting in config_persistence cleanup

* Replace magic numbers with constants and remove debug logs

- Added descriptive constants for truncation limits and timeouts in admin_server.go and worker_grpc_server.go
- Replaced magic numbers with these constants throughout the codebase
- Verified removal of stdout debug printing
- Ensured consistent truncation logic during log persistence

* Address code review feedback on history truncation and logging logic

- Fix AssignmentHistory double-serialization by copying task in GetMaintenanceTaskDetail
- Fix handleTaskCompletion logging logic (mutually exclusive success/failure logs)
- Remove unused Timeout field from LogRequestContext and sync select timeouts with constants
- Ensure AssignmentHistory is only provided in the top-level field for better JSON structure

* Implement goroutine leak protection and request deduplication

- Add request deduplication in RequestTaskLogs to prevent multiple concurrent fetches for the same task
- Implement safe cleanup in timeout handlers to avoid race conditions in pendingLogRequests map
- Add a 10s cooldown for background log refreshes in GetMaintenanceTaskDetail to prevent spamming
- Ensure all persistent log-fetching goroutines are bounded and efficiently managed

* Fix potential nil pointer panics in maintenance handlers

- Add nil checks for adminServer in ShowTaskDetail, ShowMaintenanceWorkers, and UpdateTaskConfig
- Update getMaintenanceQueueData to return a descriptive error instead of nil when adminServer is uninitialized
- Ensure internal helper methods consistently check for adminServer initialization before use

* Strictly enforce disk-only log reading

- Remove background log fetching from GetMaintenanceTaskDetail to prevent timeouts and network calls during page view
- Remove unused lastLogFetch tracking fields to clean up dead code
- Ensure logs are only updated upon task completion via handleTaskCompletion

* Refactor GetWorkerLogs to read from disk

- Update /api/maintenance/workers/:id/logs endpoint to use configPersistence.LoadTaskExecutionLogs
- Remove synchronous gRPC call RequestTaskLogs to prevent timeouts and bad gateway errors
- Ensure consistent log retrieval behavior across the application (disk-only)

* Fix timestamp parsing in log viewer

- Update task_detail.templ JS to handle both ISO 8601 strings and Unix timestamps
- Fix "Invalid time value" error when displaying logs fetched from disk
- Regenerate templates

* master: fallback to HDD if SSD volumes are full in Assign

* worker: improve EC detection logging and fix skip counters

* worker: add Sync method to TaskLogger interface

* worker: implement Sync and ensure logs are flushed before task completion

* admin: improve task log retrieval with retries and better timeouts

* admin: robust timestamp parsing in task detail view
2026-02-04 08:48:55 -08:00
Chris Lu
f66a23b472 Fix: filer not yet available in s3.configure (#8198)
* Fix: Initialize filer CredentialManager with filer address

* The fix involves checking for directory existence before creation.

* adjust error message

* Fix: Implement FilerAddressSetter in PropagatingCredentialStore

* Refactor: Reorder credential manager initialization in filer server

* refactor
2026-02-03 17:43:58 -08:00
Lisandro Pin
ff5a8f0579 Implement RPC skeleton for regular/EC volumes scrubbing. (#8187)
* Implement RPC skeleton for regular/EC volumes scrubbing.

See https://github.com/seaweedfs/seaweedfs/issues/8018 for details.

* Minor proto improvements for `ScrubVolume()`, `ScrubEcVolume()`:

  - Add fields for scrubbing details in `ScrubVolumeResponse` and `ScrubEcVolumeResponse`,
    instead of reporting these through RPC errors.
  - Return a list of broken shards when scrubbing EC volumes, via `EcShardInfo'.
2026-02-02 17:55:04 -08:00
Lisandro Pin
345ac950b6 Add volume server RPCs to read and update state flags. (#8186)
* Boostrap persistent state for volume servers.

This PR implements logic load/save persistent state information for storages
associated with volume servers, and reporting state changes back to masters
via heartbeat messages.

More work ensues!

See https://github.com/seaweedfs/seaweedfs/issues/7977 for details.

* Add volume server RPCs to read and update state flags.
2026-02-02 16:22:17 -08:00
Lisandro Pin
9638d37fe2 Block RPC write operations on volume servers when maintenance mode is enabled (#8115)
* Boostrap persistent state for volume servers.

This PR implements logic load/save persistent state information for storages
associated with volume servers, and reporting state changes back to masters
via heartbeat messages.

More work ensues!

See https://github.com/seaweedfs/seaweedfs/issues/7977 for details.

* Block RPC operations writing to volume servers when maintenance mode is on.
2026-02-02 13:21:02 -08:00
Lisandro Pin
9e15823855 Have masters update DataNode details based on state heartbeats from volume servers. (#8017) 2026-01-29 21:51:46 -08:00