mirror of
https://github.com/seaweedfs/seaweedfs.git
synced 2026-05-30 13:36:23 +00:00
4.29
1908 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
85ca3cb757 |
filer: warm-up + fail-closed cooling for POSIX locks on owner (re)start (#9673)
After a (re)start the owner defers would-be grants for posixLockWarmup while mounts re-assert, trusting only locally-visible conflicts, so it does not double-grant from empty state; a deferred grant is a retry for SetLkw and EAGAIN for non-blocking SetLk, never a spurious grant. Cooling now fail-closes: if the previous owner is unreachable during a ring change, defer rather than risk a double-grant. readyAt is atomic so the handler reads it without locking. |
||
|
|
a3c0baa9b0 |
filer: cooling-off dual-read for POSIX locks during ring changes (#9672)
While the ring changed within the last snapshot interval, a fresh owner asks the key's previous owner (LockRing.PriorOwner) whether it still holds a conflicting lock before granting TRY_LOCK or answering GET_LK, so it does not double-grant before re-assertion rebuilds its local state. The probe is marked cooling_probe so the previous owner answers from local state without recursing. PriorOwner uses the snapshot's prebuilt ring rather than rebuilding a hash ring per call. |
||
|
|
f8caaa4464 |
mount,filer: re-assert POSIX locks via keepalive (ownership migration + restart) (#9668)
* mount: renew POSIX lock leases via keepalive The mount tracks the inode keys it holds locks on and a background loop renews its session lease (KEEP_ALIVE) with each key's owner filer every 5s, within the filer's 15s TTL. A live mount is never reaped; a dead one stops renewing and owners reclaim its locks. Tracking is a superset: holds are added on grant and dropped only on owner release, so a still held lock is never under-renewed. * mount,filer: re-assert held POSIX locks via keepalive The owner filer holds POSIX advisory locks as in-memory soft state, so a key's owner change (ring rebalance) or an owner restart lost or stranded them: the new or restarted owner was blind to existing holders and would double-grant. Make the keepalive carry the mount's held lock ranges per key. The mount mirrors its own granted locks (posixOwn), and each tick re-asserts them to the key's current owner, which rebuilds that session's locks from the assertion — self -healing after a takeover or restart. The owner arbitrates re-asserted locks against other sessions so it never double-grants; a lock that lost a migration race is reported, not forced. A bare keepalive (no ranges) still just renews. |
||
|
|
c97b69f8a4 |
filer: session lease + reaping for POSIX locks (#9666)
* filer: session lease + reaping for POSIX locks A mount renews its session lease by keepalive (new KEEP_ALIVE op); the owner filer records last-seen per session and a background sweeper reaps the locks of leased sessions that stop renewing — a dead or partitioned mount. Only sessions that have renewed are leased, so this is inert until mounts run with -posixLock. * mount: route POSIX advisory locks to the owner filer (-posixLock) (#9665) mount: route POSIX advisory locks to the owner filer under -dlm With -dlm, GetLk/SetLk/SetLkw and the flush/release cleanup paths go to the inode's owner filer via the PosixLock RPC instead of the local table, so flock/fcntl are honored across mounts. Advisory locking rides the same switch as whole-file write coordination — and is therefore off under writeback cache, which implies single-writer. The mount calls its filer and relies on filer-side forwarding to reach the owner. Keys are the inode identity (HardLinkId else path); SetLkw is client-side polling with the FUSE cancel channel (no server wait queue); a per-mount session id namespaces owners; a local hint avoids a release RPC on every close. * mount,filer: bound posix-lock release RPCs and stop the reaper on shutdown The unlock/release RPCs run off the syscall path (close/flush) and used context.Background() with no deadline, so a slow or unreachable filer could hang close() indefinitely; bound them to 5s (they still aren't cancelled by an interrupt). The lease-reaping sweeper now selects on a stop channel that FilerServer.Shutdown closes, instead of looping for the process lifetime. |
||
|
|
fef49c2d75 |
filer: routed PosixLock RPC over the in-memory authority (#9664)
* filer: in-memory POSIX lock authority (Manager) Concurrent multi-inode authority over the per-inode Set: a Set per opaque inode key (path, or hl:<HardLinkId>) plus a session->keys index so a dead mount's locks reap in O(locks held). Lock state stays in memory like the distributed lock manager's, off the replicated meta-log. TryLock/Unlock/ GetLk/ReleasePosixOwner/ReleaseFlockOwner/ReleaseSession; empty sets and stale index entries are pruned on release. * filer: routed PosixLock RPC over the in-memory authority Adds the PosixLock RPC (try/unlock/get_lk + the flush/release owner drops) that the owner filer answers from its in-memory Manager. The request key is the inode identity ring key; a non-owner filer forwards one hop (is_moved-bounded), mirroring ObjectTransaction, so the owner's table stays the single authority under a stale ring view. Strictly non-blocking; SetLkw polling lives in the mount. |
||
|
|
2a4923e7e8 |
ObjectTransaction: filer-side forwarding via route_key (#9659)
A non-owner filer forwards the whole transaction to the ring owner of route_key, so the owner's per-path lock stays the single serialization point even when the caller's ring view is stale. is_moved bounds forwarding to one hop. The gateway stamps route_key on every routed builder via the shared objectRouteKey helper. Completes taking S3 object mutations off the distributed lock. |
||
|
|
1f0c366583 |
s3: route metadata-only self-copy off the distributed lock (#9638)
A non-versioned metadata-only self-copy (CopyObject with source == destination and the REPLACE directive) is a read-modify-write of one entry, which is why it held the distributed lock. It now routes to the owner as a serialized PATCH_EXTENDED: the owner merges the new managed metadata (set the replacements, delete the dropped keys) onto a fresh read of the entry under its per-path lock, so a concurrent change to non-managed keys (legal hold, retention, version id) is preserved instead of clobbered, and bumps mtime. PATCH_EXTENDED gains touch_mtime for the mtime bump. Versioned and suspended self-copies create a new version (already routed via the copy finalize) and the no-owner bootstrap keep the lock. |
||
|
|
fa7056dc6f |
s3: route object-lock version-specific deletes off the distributed lock (#9657)
A version-specific DELETE (real version or the null version, including object-lock WORM-checked ones and governance-bypass) now runs as one routed transaction on the object's owner instead of holding the distributed lock. For a real version: recompute the .versions pointer excluding the version (repoint-before-delete, so a crash leaves a recoverable orphan rather than a dangling pointer), then delete the version file, under the object's per-path lock. The null version is the regular object entry, deleted directly (no pointer). Object-lock buckets gate the delete on the version's WORM guards evaluated on the owner: legal hold (always) + retention (while not elapsed). Governance bypass scopes the retention guard to COMPLIANCE mode, so the filer allows a governance-mode delete while still denying compliance and legal hold — the gateway never reads the version. Three primitives make this expressible: - ObjectTransaction.condition_key: evaluate the condition against a named entry (the version) while the lock stays on lock_key (the object). - Recompute.exclude_name: omit a child from the scan, to repoint before delete. - WriteCondition.Clause gate_key/gate_value: scope IF_EXTENDED_TIME_ELAPSED to a mode, expressing governance bypass without a gateway-side read. |
||
|
|
db954b5503 |
s3: route versioned PutObject finalize off the DLM (#9631)
s3: route versioned PutObject finalize off the distributed lock A versioned write's finalize (flip the .versions pointer to the newest version, demote the prior latest) now runs as a single RECOMPUTE_LATEST ObjectTransaction on the object's owner filer, under its per-path lock, instead of the unserialized updateLatestVersionInDirectory. The version file is written first; the owner re-derives the pointer by scanning the directory. RECOMPUTE_LATEST gains size_to_key / mtime_to_key to cache the chosen version's size and mtime on the pointer, and demote_key / demote_value to stamp the displaced prior latest (NoncurrentSinceNs for lifecycle) when the pointer moves. Falls back to updateLatestVersionInDirectory when no owner is known yet. |
||
|
|
f037fc4dce |
s3: dial the object lock's primary filer directly (#9626)
* s3: dial the object lock's primary filer directly The S3 object write lock builds a fresh short-lived lock per write, each starting at the seed filer. When the seed isn't the key's hash-ring primary the filer forwards the request to the primary, and in multi-cluster setups that forward crosses clusters on every write. Give the lock client a view of the filer lock ring, fed by the master's LockRingUpdate broadcasts the gateway already receives, so it dials the primary directly. The view tracks filer membership by version; a stale view stays correct because the filer still forwards as a fallback. Also send the initial ring snapshot to S3 clients, not just filers. * s3: subscribe to lock-ring updates before starting the master loop The master delivers the initial LockRingUpdate once, on connect. Registering the callback after KeepConnectedToMaster started left a window where that first update could arrive before the handler was set and be dropped, delaying the ring view until the next membership change. Build the lock client and register the callback in the masters block before launching the loop; the filers block reuses that client (or creates a plain one when no masters are configured). * lock_manager: build the hash ring in a deterministic server order rebuildRing ranged over the server set (a map), whose iteration order is randomized per process. On a vnode hash collision the last writer into vnodeToServer wins, so two nodes holding the same server set could resolve the collision to different servers and disagree on the primary for keys near that slot. Now that the S3 gateway also computes PrimaryForKey, such a disagreement would route the same key to different filers and defeat per-path serialization. Iterate the servers in sorted order so the ring is identical on every node with the same set, regardless of discovery order. * lock_manager: skip redundant ring rebuilds, trim comments SetRing now ignores a non-zero version at or below the current one once a ring exists, so repeated LockRingUpdate broadcasts on reconnect no longer rebuild the ring. * s3: hold the lock-ring client on the server for route-by-key Store the object-write lock client on S3ApiServer so handlers can resolve a key's owner filer via PrimaryForKey. |
||
|
|
b4d2224e97 |
filer: let PATCH_EXTENDED replace Entry.content (#9654)
* filer: let PATCH_EXTENDED replace Entry.content PATCH_EXTENDED merges extended attributes under the per-path lock, reading the entry fresh, so concurrent patches to different keys don't clobber each other. Some single-key state lives in Entry.content rather than an extended attribute (e.g. the S3 bucket metadata blob). Add set_content/content to the mutation so a patch can replace content the same way -- read fresh, set content, preserve the rest -- letting a content write and an extended-attribute write on the same entry serialize on the lock instead of racing whole-entry rewrites. * Update weed/server/filer_grpc_server.go Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * filer: test set_content FileSize sync; note chosen content-patch approach Cover the FileSize behavior of a set_content patch: a file's size follows the new content length (including when it shrinks), a directory's stays zero. Also document, in the bucket-config design, that extending PATCH_EXTENDED with set_content is the implemented path for content-backed config. --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> |
||
|
|
83195fc111 |
filer: reuse the caller's fetched entry in CreateEntry (#9645)
CreateEntry starts with a FindEntry to load the current entry. A conditional CreateEntry already fetched that entry to evaluate the precondition under the per-path lock, so the create repeated the lookup. Add an existing *Entry parameter: when non-nil it is used as the current entry and the internal lookup is skipped; nil keeps the lookup. The gRPC CreateEntry handler passes the entry it fetched for the precondition, removing the redundant read while the lock is held. All other callers pass nil. |
||
|
|
091aad59dc |
filer: add ObjectTransactionBatch for multi-key object writes (#9649)
A multi-object delete spans many keys that route to different owner filers. The gateway groups keys by owner and sends one batch per owner; the filer applies each transaction under its own per-path lock, independent of the others. A failed transaction (precondition or mutation error) is reported in its own response without aborting the rest, matching S3 multi-object semantics where each key succeeds or fails on its own. There is no cross-key atomicity, which S3 batch delete does not require. |
||
|
|
e2203b2a0b |
filer: add extended-attribute guard clauses for object-lock (#9648)
Routing object-lock buckets off the distributed lock needs the retention and legal-hold check to run atomically with the write, under the per-path lock. Move just the comparison into the filer, not the S3 semantics: two generic clause kinds on an extended attribute. IF_EXTENDED_NOT_EQUAL blocks while extended[ext_key] equals ext_value (a legal hold). IF_EXTENDED_TIME_ELAPSED blocks while extended[ext_key], read as a unix- second deadline, is in the future against the filer's clock (retention); a malformed deadline fails safe. The caller composes these from the object-lock state and, for a governance bypass, simply omits the retention clause once the bypass is authorized -- the filer makes no authorization decision and keeps no S3 knowledge. |
||
|
|
e71bac55e9 |
filer: add RECOMPUTE_LATEST mutation to ObjectTransaction (#9647)
Deleting a specific version that happens to be the latest needs the new latest re-derived from the remaining versions, and that scan must run under the same lock as the delete. The gateway can't do it atomically across RPCs. Add a RECOMPUTE_LATEST mutation: it scans a directory under the transaction lock, picks the child that sorts last (descending) or first by name, copies the mapped extended keys from it into a pointer entry, and stores its name under name_to_key. An empty directory clears the pointer keys. The filer stays mechanical and S3-agnostic: the caller, which knows the versioning scheme, supplies the sort direction and the key mappings. A missing pointer entry is a no-op, so a replayed transaction is idempotent. |
||
|
|
bf022ca018 |
filer: add ObjectTransaction for atomic multi-entry object writes (#9646)
A versioned object write touches several entries that must change together: the main object, a delete marker or version file, and the latest pointer on the .versions directory. Holding a distributed lock across separate RPCs to do this is what the per-path lock was meant to replace, but a single CreateEntry only covers one entry. Add ObjectTransaction: a request carries a lock_key (the object path), an optional WriteCondition, and an ordered list of mutations (PUT / DELETE / PATCH_EXTENDED). The filer holds the per-path lock on lock_key for the whole call, checks the condition against the entry at lock_key, then applies the mutations in order. Callers route the object's writes to its owner filer so the lock is authoritative across all of the object's entries. DELETE and PATCH of an absent entry are no-ops, so a replayed transaction is idempotent. PUT entries are metadata-scoped; data-bearing writes (chunks) are written before the transaction, as today. |
||
|
|
b18d3dc96c |
filer: evaluate a write precondition in CreateEntry (#9650)
Add an optional WriteCondition to CreateEntryRequest. When set, the filer evaluates it against the current entry while holding the per-path lock, so the check and the write are atomic on this filer, and returns PRECONDITION_FAILED when it does not hold. The caller must route the key's writes to the owner filer for the check to be authoritative. A condition is a list of clauses that all must hold (logical AND). One clause is the common case; several express what a single comparison cannot: an ETag set (If-Match / If-None-Match with multiple values), weak-ETag comparison, and compound conditions. ETag comparison mirrors the S3 gateway's precedence (stored Seaweed ETag attribute, then the Md5/chunk fallback) and follows RFC 7232 strong/weak rules, so results match without coupling the filer to S3 handling. Condition parsing and evaluation live in filer_grpc_server_condition.go. |
||
|
|
bce76e6e21 |
filer: serialize same-path mutations with a per-path lock (#9639)
CreateEntry is a FindEntry-then-write with no lock, so concurrent creates to the same path race: OExcl can admit two creators, and a conditional check-then-act has no atomicity. Add a per-path exclusive lock (util.LockTable, which evicts idle keys so it stays bounded) on the FilerServer and take it in CreateEntry, so the existence check and the write are atomic on this filer. This is the local serialization point that lets callers route a key's writes to its owner filer and drop the distributed lock for that key. AppendToEntry keeps its distributed lock for now; it can move to the per-path lock once its callers route to the owner. |
||
|
|
9021225591 |
master: accept volume-server Ping targets on follower masters (#9614)
cluster.check asks every master to ping every volume server, but the Ping gate validated volume-server targets only against the local topology. Only the leader receives volume-server heartbeats, so a follower's topology is empty and every probe through it failed with "unknown ping target ... of type volumeServer". Fall back to the volume-server set the master learns over its own MasterClient subscription to the leader, the same source the filer gate already trusts. The anti-SSRF intent is preserved: Ping still only dials recognized cluster members. |
||
|
|
5af7d12f04 |
fix(filer.sync): keep sync_offset fresh while the source is read-only (#9589)
* fix(filer.sync): keep sync_offset fresh while the source is read-only sync_offset holds the timestamp of the last replicated source event, so monitoring derives lag from now-sync_offset. A read-only source emits no metadata events, so the gauge froze at the last write and the derived lag grew without bound, making thresholds unusable. The source filer now sends an idle heartbeat carrying its current time while a subscriber is caught up to the buffer head. filer.sync uses it to advance the gauge, so now-sync_offset reflects real lag. Heartbeats are opt-in (client_supports_idle_heartbeat), are never written to the metadata log, and do not move the resume checkpoint, so a restart still resumes from the last real event. * fix(filer.sync): gate idle heartbeat on the read cursor, not SinceNs In metadata-chunks mode persisted entries replay as log file refs and never reach eachLogEntryFn, so lastSeenTsNs stays put and a caught-up subscriber with an old SinceNs would never get a heartbeat. Use the read cursor (lastReadTime), which advances in that mode too, max'd with lastSeenTsNs so the in-memory backlog-then-idle case still works while the cursor returned to the caller has not yet updated. |
||
|
|
77ac781bbd |
fix(ec): VolumeEcShardsInfo walks every disk on multi-disk servers (#9568)
* fix(ec): VolumeEcShardsInfo walks every disk on multi-disk servers When a volume server holds EC shards for the same vid across more than one disk, each DiskLocation registers its own EcVolume entry and Store.FindEcVolume returns whichever one it hits first. The shard-info RPC iterated only that single EcVolume's Shards, so the response missed every shard mounted on a sibling disk. The worker's verifyEcShardsBeforeDelete sums the per-server responses into a union bitmap and refuses to delete the source volume when the union falls short of dataShards+parityShards. On multi-disk destinations, the union was systematically under-counted and source deletion got blocked even though all shards were physically present and mounted. Walk every DiskLocation in the handler and emit the deduplicated union of all shards. The .ecx-backed fields (file counts, volume size) still come from a single EcVolume since every disk's entry opens the same .ecx via NewEcVolume's cross-disk fallback. Tests: - TestVolumeEcShardsInfo_AggregatesAcrossDisks unit test in weed/server/. - test/volume_server/grpc/ec_verify_multi_disk_test.go integration test drives the full generate -> mount -> redistribute -> restart -> reconcile path and asserts both VolumeEcShardsInfo and VerifyShardsAcrossServers + RequireFullShardSet (the production source-deletion gate) report all 14 shards. - ec_multi_disk_lifecycle_test.go tightened: replaces the "VolumeEcShardsInfo only sees one disk's EcVolume" workaround with a full-shard-set assertion. * review: use ShardBits bitmask + cap-pre-allocation for shard dedup |
||
|
|
68794fb94c |
fix(ec_distribute): remove partial files on copy stream error (#9543)
* fix(ec_distribute): remove partial files on copy stream error writeToFile opens the destination with O_TRUNC and streams into it. On a mid-stream receive / write / cancellation error it returned the failure but left the destination behind in whatever state had been written so far — typically 0 bytes when the source errored before sending any FileContent. VolumeEcShardsCopy distributes .ecx by calling doCopyFile, so this same stub-leaving behaviour produced the 0-byte .ecx files seen on EC encoding failures: the source claims a non-zero ModifiedTsNs (so the existing "source not found" cleanup doesn't fire), the stream then errors immediately, and the receiver ends up with a 0-byte .ecx that downstream code mistook for a valid empty index. Clean up the partial file on every error path that returns from the streaming loop (receive, write, and cancellation). Skip cleanup when isAppend=true so resumable appends keep their existing content. As defense in depth, VolumeEcShardsCopy also stats the .ecx after copy and removes / errors on a 0-byte result so the orchestrator can pick a different source. The Rust volume server has only the source side of CopyFile (no client-side stream-to-disk consumer) and no .ecx subsystem yet, so this fix has no Rust mirror. * fix(ec_distribute): close file before remove, fail fast on stat error Address review feedback: - writeToFile's mid-stream removeIncomplete called os.Remove while the destination file handle was still open. On Windows os.Remove fails while a handle is open, so the cleanup wouldn't run there. Wrap the handle close in a once-only helper, call it from removeIncomplete and from the existing "source not found" cleanup, and keep a deferred close as the safety net for the normal-return path. - VolumeEcShardsCopy's post-copy .ecx check silently passed when os.Stat returned an error: doCopyFile had reported success but if the file was already gone, unreadable, or somehow a directory, the orchestrator only learned at mount time with no useful context. Treat any non-nil stat error and any directory result as a copy failure here and surface it immediately. |
||
|
|
6b94701213 |
mini: quieter startup with a docker-compose-style progress board (#9524)
* mini: quieter startup with a docker-compose-style progress board Replaces noisy startup/shutdown logs with a single in-place progress table on a TTY (or one line per state change off-TTY). Each component renders as `pending -> starting -> ready` during startup and `stopping -> stopped` during shutdown, with elapsed time on transition. Also folds in a few cleanups uncovered while making this readable: - route the admin.go startup prints through glog so quietMiniLogs() filters them under mini but standalone weed admin still shows them - generate a dev SSE-S3 KEK + passphrase on first run via WEED_S3_SSE_KEK and WEED_S3_SSE_KEK_PASSPHRASE env vars (viper.Set has a nested-key conflict between s3.sse.kek and s3.sse.kek.passphrase); persisted under the data folder so restarts reuse the same key - demote worker/master gRPC Recv 'context canceled' to V(1); those are the normal shutdown signal, not Errors/Warnings - drop the 'Optimized Settings' block and the 'credentials loaded from environment variables' message from the welcome banner - only show the credentials setup hints when no S3 identities exist (new s3api.HasAnyIdentity accessor backed by an atomic.Bool) - use S3_BUCKET in the credentials hint so it pairs with AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY - reorder running-services list to master / volume / filer / webdav / s3 / iceberg / admin * mini: refuse in-memory-only SSE-S3 dev keys; surface admin serve errors loadOrCreateMiniHexSecret returns "" when os.WriteFile fails, so SSE-S3 won't encrypt data under a KEK that the next restart can't reproduce (which would orphan whatever was written this run). The caller already treats "" as "skip setting WEED_S3_SSE_* env vars", so SSE-S3 and IAM just stay disabled for this run. startAdminServer's serve goroutine used to only log ListenAndServe failures, so a bind error left the caller blocked on ctx.Done() with no listener. Forward the error through a buffered channel and select on it alongside ctx.Done(). * ci(s3-proxy-signature): match weed mini's new progress-board ready line The readiness probe grepped for "S3 (gateway|service).*(started|ready)", which matched weed mini's old "S3 service is ready at ..." line. Mini now emits " S3 ready (Xs)" from its progress board, so the old pattern misses and the test timed out at the 30-second wait. Widen the alternation to also accept "S3\s+ready". The curl HEAD fallback already covers any remaining cases. |
||
|
|
2a41e76101 |
fix(ec): blanket-clean every destination over the full shard range (#9512)
* fix(ec): blanket-clean every destination over the full shard range The previous cleanup pass walked t.sources only, with the shard ids the topology had reported at detection time. In the wild, a destination can end up with EC shards mounted that the topology snapshot didn't list — shards on a sibling disk that hadn't heartbeated, or shards left over from a concurrent attempt's mount step. FindEcVolume still returns true, so the next ReceiveFile trips the mounted-volume guard. Cleanup now unions t.sources (with ShardIds) and t.targets and issues unmount + delete over [0..totalShards-1] on each. Both RPCs are idempotent on missing shards, so the wider sweep is free. Two new tests cover the gap: shards mounted beyond what t.sources lists, and a target-only destination with no source row. * log(ec): include disk_id in EC unmount/delete/refusal log lines The current logs identify the volume and shard but leave disk_id off, which makes the cross-server cleanup story hard to follow when multiple disks of one server hold pieces of the same volume: UnmountEcShards 4121.1 -> add disk_id ec volume video-recordings_4121 shard delete [1 5] -> add per-loc disk_id volume server X:Y deletes ec shards from 4121 [...] -> add disk_id ReceiveFile: ec volume 4121 is mounted; refusing... -> add disk_ids ReceiveFile's refusal now names the disk_ids actually holding the mount so operators can see whether the next cleanup pass needs to target a sibling disk. Added Store.FindEcVolumeDiskIds / Store::find_ec_volume_disk_ids as the supporting primitive. Mirrored in seaweed-volume/src/ (unmount log in Store::unmount_ec_shard, heartbeat delete log in diff_ec_shard_delta_messages, refusal in the ReceiveFile handler). * test(ec): stub VolumeEcShardsUnmount/Delete on the fake volume server The plugin-worker EC tests boot a fake volume server that embeds UnimplementedVolumeServerServer. After the worker started calling VolumeEcShardsUnmount + VolumeEcShardsDelete pre-distribute, the default Unimplemented response surfaced as fourteen "method not implemented" errors and TestErasureCodingExecutionEncodesShards failed. Both RPCs are no-ops here — nothing on the fake server has mounted state or persisted shard files to remove. |
||
|
|
62821964dd |
filer/iam-grpc: make admin Bearer auth opt-in (fixes #9509) (#9514)
PR #9442 made the filer refuse to register the IAM gRPC service unless jwt.filer_signing.key was set in security.toml, which broke the admin UI Users/Groups/Policies pages for every deployment that ships without a security.toml — weed mini, plain Helm, vanilla weed filer. The Users tab returns Unimplemented and the page is unusable. Issues #9504, #9505 and #9509 all trace to this gap. The rest of the filer's gRPC surface is unauthenticated by default; treat IAM the same way. The service now always registers, and the auth gate is a no-op when no signing key is configured. When the key is set, every RPC still requires an admin-signed Bearer token, matching the post-#9442 behaviour. Operators who expose the filer gRPC port beyond a trusted network should set the key on both filer and admin. The admin client (IamGrpcStore.withIamClient) already skips attaching the authorization metadata when its key is empty, so no changes there. |
||
|
|
bfb2661fec |
fix(tests): make 32-bit GOARCH tests build and run (#9507)
fix(tests): make 32-bit GOARCH tests build and run (#9503) verifyTestFilerClient had bare int64 atomic counters after a map header, so atomic.AddInt64 panicked with "unaligned 64-bit atomic operation" on linux/386. Switch to atomic.Int64, which the stdlib guarantees is 8-byte aligned on all platforms. rpc_version_filter_test.go passed the untyped constant 0xdeadbeef to t.Errorf, where it default-promoted to int and overflowed 32-bit int. Bind it to a typed uint32 const used in both the comparison and the error message. |
||
|
|
3a8389cd68 |
fix(ec): verify full shard set before deleting source volume (#9490) (#9493)
* fix(ec): verify full shard set before deleting source volume (#9490) Before this change, both the worker EC task and the shell ec.encode command would delete the source .dat as soon as MountEcShards returned — even if distribute/mount failed partway, leaving fewer than 14 shards in the cluster. The deletion was logged at V(2), so by the time someone noticed missing data the only trace was a 0-byte .dat synthesized by disk_location at next restart. - Worker path adds Step 6: poll VolumeEcShardsInfo on every destination, union the bitmaps, and refuse to call deleteOriginalVolume unless all TotalShardsCount distinct shard ids are observed. A failed gate leaves the source readonly so the next detection scan can retry. - Shell ec.encode adds the same gate after EcBalance, walking the master topology with collectEcNodeShardsInfo. - VolumeDelete RPC success and .dat/.idx unlinks now log at V(0) so any source destruction is traceable in default-verbosity production logs. The EC-balance-vs-in-flight-encode race is intentionally left for a follow-up; balance should refuse to move shards for a volume whose encode job is not in Completed state. * fix(ec): trim doc comments on the new shard-verification path Drop WHAT-describing godoc on freshly added helpers; keep only the WHY notes (query-error policy in VerifyShardsAcrossServers, the #9490 reference at the call sites). * fix(ec): drop issue-number anchors from new comments Issue references age poorly — the why behind each comment already stands on its own. * fix(ec): parametrize RequireFullShardSet on totalShards Take totalShards as an argument instead of reading the package-level TotalShardsCount constant. The OSS callers continue to pass 14, but the helper is now usable with any DataShards+ParityShards ratio. * test(plugin_workers): make fake volume server respond to VolumeEcShardsInfo The new pre-delete verification gate calls VolumeEcShardsInfo on every destination after mount, and the fake server's UnimplementedVolumeServer returns Unimplemented — the verifier read that as zero shards on every node and aborted source deletion. Build the response from recorded mount requests so the integration test exercises the gate end-to-end. * fix(rust/volume): log .dat/.idx unlink with size in remove_volume_files Mirror the Go-side change in weed/storage/volume_write.go: stat each file before removing and emit an info-level log for .dat/.idx so a destructive call is always traceable. The OSS Rust crate previously unlinked them silently. * fix(ec/decode): verify regenerated .dat before deleting EC shards After mountDecodedVolume succeeds, the previous code immediately unmounts and deletes every EC shard. A silent failure in generate or mount could leave the cluster with neither shards nor a valid normal volume. Probe ReadVolumeFileStatus on the target and refuse to proceed if dat or idx is 0 bytes. Also make the fake volume server's VolumeEcShardsInfo reflect whichever shard files exist on disk (seeded for tests as well as mounted via RPC), so the new gate can be exercised end-to-end. * fix(ec): address PR review nits in verification + fake server - Drop unused ServerShardInventory.Sizes field. - Skip shard ids >= MaxShardCount before bitmap Set so the ShardBits bound is explicit (Set already no-ops on overflow, this is for clarity). - Nil-guard the fake server's VolumeEcShardsInfo so a malformed call doesn't panic the test process. |
||
|
|
d5c0a7b153 |
fix(ec): make multi-disk same-server EC reads work + full-lifecycle integration test (#9487)
* fix(master): include GrpcPort in LookupEcVolume response LookupVolume already passes loc.GrpcPort through to the client; LookupEcVolume builds Location with only Url / PublicUrl / DataCenter, so callers fall back to ServerToGrpcAddress (httpPort + 10000). On any deployment where that convention does not hold — multi-disk integration tests, custom port layouts — EC reads dial the wrong port and quietly degrade to parity recovery. * fix(volume/ec): probe every DiskLocation when serving local shard reads reconcileEcShardsAcrossDisks (issue 9212) registers each .ec?? against the DiskLocation that physically owns it, so a multi-disk volume server can hold shards for the same vid in two separate ecVolumes — one per disk — with .ecx on whichever disk owned the original .dat. The read path only consulted the single EcVolume FindEcVolume picked, so requests for shards on the sibling disk fell through to errShardNotLocal and then to remote/loopback recovery. Walk all DiskLocations after the first probe in both readLocalEcShardInterval and the VolumeEcShardRead gRPC handler; the latter also covers the loopback that recoverOneRemoteEcShardInterval falls back to when a peer dial fails. * test(volume/ec): cover the multi-disk EC lifecycle end-to-end Two integration tests against a real volume server with two data dirs: TestEcLifecycleAcrossMultipleDisks drives encode -> mount -> HTTP read -> drop .dat -> stop -> redistribute shards across disks -> restart -> verify reconcileEcShardsAcrossDisks attached the orphan shards and reads still work -> blob delete -> stop -> drop a shard -> restart -> VolumeEcShardsRebuild pulls input from both disks -> reads still work. TestEcPartialShardsOnSiblingDiskCleanedUpOnRestart is the issue 9478 reproducer at the cluster level: seed a healthy .dat on disk 0, plant the on-disk footprint of an interrupted EC encode on disk 1, restart, and assert pruneIncompleteEcWithSiblingDat wipes disk 1 without touching disk 0. Framework gets RestartVolumeServer / StopVolumeServer helpers; the previous run's volume.log is rotated to volume.log.previous so a startup regression on the second run does not lose the first run's diagnostics. * review: trim verbose comments * review: drop racy fast-path, use locked findEcShard directly gemini-code-assist flagged the two-step lookup in readLocalEcShardInterval and VolumeEcShardRead: the first probe (ecVolume.FindEcVolumeShard) reads the EcVolume's Shards slice without holding ecVolumesLock, so a concurrent mount / unmount could race with it. findEcShard already walks every DiskLocation under the right lock, so the fast-path adds nothing but the race. Collapse both call sites to a single locked call. Also note in RestartVolumeServer why the log-rotation error is swallowed: absence on first call is benign; anything else surfaces in the next os.Create in startVolume. |
||
|
|
f51468cf73 |
Revert #9443 — heartbeat peer binding breaks hostname-based clusters (#9474)
Revert "master: bind heartbeat claims to the connecting peer (#9443)" This reverts commit |
||
|
|
43a8c4fdca |
Revert #9440 — volume admin fail-closed gate breaks multi-host clusters (#9472)
* Revert "volume: fail closed in admin gRPC gate when no whitelist is configured (#9440)" This reverts commit |
||
|
|
f28c7ce6df |
master: bind heartbeat claims to the connecting peer (#9443)
SendHeartbeat used to accept whatever Ip/Port/Volumes the caller put on
the wire. Three changes tighten that:
- Reject heartbeats whose Ip does not match the gRPC peer's source
address. Loopback peers are still trusted; operators behind a proxy
can opt out with -master.allowUntrustedHeartbeat.
- Track which (ip, port) first claimed a volume id or an ec shard slot
and drop foreign re-claims. Non-EC volume claims are bounded by the
replica copy count so legitimate replicas still register. EC
ownership is keyed by (vid, shard_id) so the same vid can legitimately
be split across many peers as long as their EcIndexBits are disjoint;
rejected bits are cleared from the bitmap and the parallel ShardSizes
array is compacted in lock-step.
- Maintain reverse indexes owner -> volumes and owner -> ec shard slots
so disconnect cleanup is O(M) in what that peer held rather than O(N)
over the whole map.
Bindings are also released when a heartbeat reports that the peer no
longer holds an id, either via explicit Deleted{Volumes,EcShards}
entries or by omitting it from a full snapshot. Without this, a planned
rebalance that moved a vid or an ec shard from peer A to peer B would
leave B's heartbeats permanently filtered out until A disconnected,
breaking ec encode/decode flows that delete shards on the source as
soon as the move completes.
The (vid -> owners) binding still does not track which replica slot
each peer occupies, so the first N claims under the copy count win;
strict per-slot mapping is a follow-up.
|
||
|
|
10cc06333b |
cluster: restrict Ping RPC to known peers of the requested type (#9445)
Ping previously dialled whatever host:port the caller asked for. Gate each server's Ping handler on cluster membership: masters check the topology, registered cluster nodes, and configured master peers; volume servers only accept their seed/current masters; filers accept tracked peer filers, the master-learned volume server set, and configured masters. Use address-indexed peer lookups to keep Ping target validation O(1): - topology maintains a pb.ServerAddress -> *DataNode index alongside the dc/rack/node tree, kept in sync from doLinkChildNode and UnlinkChildNode plus the ip/port-rewrite branch in GetOrCreateDataNode. GetTopology now returns nil on a detached subtree instead of panicking, so the linkage hooks can no-op safely. - vid_map tracks a refcount per volume-server address so hasVolumeServer answers without scanning every vid location. The add path skips empty-address entries the same way the delete path already does, so a zero-value Location cannot leak a permanent serverRefCount[""] bucket. - masters reuse a cached master-address set from MasterClient instead of walking the configured peer slice on every request. - volume servers compare against a pre-built seed-master set and protect currentMaster reads/writes with an RWMutex, fixing the data race with the heartbeat goroutine. The seed slice is copied on construction so external mutation cannot desync it from the frozen lookup set. - cluster.check drops the direct volume-to-volume sweep; volume servers no longer carry a peer-volume list, and the note next to the dropped probe is reworded to make clear that direct volume-to-volume reachability is intentionally not validated by this command. Update the volume-server integration tests that drove Ping through the new admission gate: success-path coverage now targets the master peer (the only type a volume server tracks), and the unknown/unreachable path asserts the InvalidArgument the gate now returns instead of the old downstream dial error. Mirror the same admission gate in the Rust volume server crate: a seed-master HashSet built once at startup plus a tokio RwLock over the heartbeat-tracked current master, both consulted in is_known_ping_target on every Ping, with InvalidArgument returned for any target that isn't a recognised master. |
||
|
|
21054b6c18 |
volume: fail closed in admin gRPC gate when no whitelist is configured (#9440)
Add Guard.IsAdminAuthorized, a fail-closed variant of IsWhiteListed, and use it to gate destructive volume admin RPCs. IsWhiteListed keeps its allow-all-when-empty semantics for HTTP compatibility. For TCP peers with an empty whitelist, off-host callers are rejected but loopback (127.0.0.0/8, ::1) is still trusted. A volume server commonly cohabits with the master/filer on a single host and in integration-test clusters; the loopback exception keeps cluster-internal admin traffic working without -whiteList while still locking out off-host attackers. Non-TCP peers (in-process / bufconn / unix-socket) bypass the host check entirely. When `weed server` runs master+volume+filer in a single process the master dials the volume server in-process and the peer address surfaces as "@", which has no parseable IP. Such a caller shares our OS process and cannot be spoofed by a remote attacker, so we treat it as trusted by construction. The gate also tolerates a nil guard (developmental / embedded path) and only enforces once a guard is wired up. UpdateWhiteList skips entries whose CIDR fails to parse so the IP-iteration path can no longer hit a nil *net.IPNet. |
||
|
|
69da20bdae |
volume: gate FetchAndWriteNeedle behind admin auth and refuse internal endpoints (#9441)
volume: require admin auth and refuse loopback endpoints in FetchAndWriteNeedle Gate the RPC behind checkGrpcAdminAuth for parity with the rest of the destructive volume-server RPCs, and reject cluster-internal remote S3 endpoints (loopback / link-local / IMDS / RFC 1918 / CGNAT) before dialing. Pin the validated address against DNS rebinding by routing the AWS SDK through an HTTP transport whose DialContext re-resolves the host and re-applies the deny list on every dial, so an endpoint that resolves to a public IP at validate-time and then flips to 127.0.0.1 at connect time is refused. Operators that legitimately fetch from private hosts can opt out with -volume.allowUntrustedRemoteEndpoints. |
||
|
|
5e8f99f40a |
filer: require admin-signed JWT on the IAM gRPC service (#9442)
Every IAM RPC (CreateUser, PutPolicy, CreateAccessKey, ...) now requires a Bearer token in the authorization metadata, signed with the filer write-signing key. The service refuses to register on a filer that has no jwt.filer_signing.key set, so the unauthenticated default is gone: operators who use these RPCs must configure the key and attach a token on every call. Bearer scheme matching is case-insensitive (RFC 6750), every handler nil-checks req before dereferencing it, and tests now cover the expired-token path. |
||
|
|
05ed5c9ae8 |
filer: scope JWT allowed_prefixes to path components (#9439)
The allowed_prefixes check used a literal byte-prefix match, so a token scoped to /tenant1 also matched /tenant1234, /tenant1-old, and similar sibling paths. Match on /-separated path components after path.Clean normalisation instead. |
||
|
|
532b088262 |
fix(ec): preserve source disk type across EC encoding (#9423) (#9449)
* fix(ec): carry source disk type on VolumeEcShardsMount (#9423) When EC shards land on a target whose disk type differs from the source volume's, master heartbeats wrongly reported under the target disk's type. Add source_disk_type to VolumeEcShardsMountRequest; the target server applies it to the in-memory EcVolume via SetDiskType so the mount notification and steady-state heartbeat both carry the source's disk type. Empty value falls back to the location's disk type (used by disk-scan reload paths). The override is not persisted with the volume — disk type stays an environmental property and .vif remains portable. * fix(ec): plumb source disk type through plugin worker (#9423) Add source_disk_type to ErasureCodingTaskParams (field 8; 7 reserved), populate it from the metric the detector already collects, thread it through ec_task into the MountEcShards helper, and forward it on the VolumeEcShardsMount RPC. * fix(ec): mirror source disk type plumbing in rust volume server (#9423) The volume_ec_shards_mount handler now forwards source_disk_type into mount_ec_shard → DiskLocation::mount_ec_shards. When non-empty it overrides ec_vol.disk_type (and each mounted shard's disk_type) via the new set_disk_type method; empty value keeps the location's disk type, so disk-scan reload and reconcile paths are unchanged. Also picks up two pre-existing proto drifts that 'make gen' synced from weed/pb (LockRingUpdate in master.proto, listing_cache_ttl_seconds in remote.proto). * feat(ec): bias placement toward preferred disk type (#9423) Add DiskCandidate.DiskType and PlacementRequest.PreferredDiskType. When PreferredDiskType is non-empty, SelectDestinations partitions suitable disks into matching/fallback tiers and runs the rack/server/ disk-diversity passes on the matching tier first; the fallback tier is only consulted if the matching pool can't satisfy ShardsNeeded. PlacementResult.SpilledToOtherDiskType lets callers warn on spillover. Empty PreferredDiskType keeps the existing single-pool behavior. * fix(ec): plumb source disk type into placement planner (#9423) diskInfosToCandidates now copies DiskInfo.DiskType into the placement candidate, and ecPlacementPlanner.selectDestinations forwards metric.DiskType as PreferredDiskType so EC shards land on disks matching the source volume's disk type when possible. A glog warning fires when placement had to spill to other disk types. * test(ec): integration coverage for source-disk-type plumbing (#9423) store_ec_disk_type_test exercises Store.MountEcShards end-to-end: a shard physically lives on an HDD location, MountEcShards is called with sourceDiskType="ssd", and the test asserts that the in-memory EcVolume, the mounted shard, the NewEcShardsChan notification, and the steady-state heartbeat all report under the source's disk type. A companion test pins the empty-source path so disk-scan reload keeps the location's disk type. detection_disk_type_test exercises the worker plumbing: with a cluster of nodes carrying both HDD and SSD disks, planECDestinations must place every shard on SSD when metric.DiskType="ssd"; with only one SSD node and 13 HDD nodes it must still satisfy a 10+4 layout via spillover (and log a warning). * revert(ec): drop unrelated proto drift in seaweed-volume/proto (#9423) make gen pulled two pre-existing OSS changes into the rust proto tree (LockRingUpdate / by_plugin in master.proto, listing_cache_ttl_seconds in remote.proto). Reviewers flagged it as scope creep — none of the rust EC fix references those fields. Restore both files to origin/master so this branch only touches EC-related symbols. * fix(ec placement): treat empty disk type as hdd and skip used racks on spill (#9423) partitionByDiskType used raw string comparison, so a PreferredDiskType of "hdd" never matched candidates whose DiskType is "" (the HardDriveType sentinel that weed/storage/types uses). EC encoding of an HDD source would spill onto any HDD reporting "" even when the cluster has plenty of matching capacity. Normalize both sides through normalizeDiskType, which lowercases and folds "" → "hdd", mirroring types.ToDiskType without taking a dependency on it. selectFromTier's rack-diversity pass also kept revisiting racks the preferred tier had already used when running on the fallback tier, which negated PreferDifferentRacks on spillover. Skip racks already in usedRacks so fallback placements still spread onto new racks. * fix(ec): empty-source remount must not clobber existing disk type (#9423) mount_ec_shards_with_idx_dir runs more than once per vid (RPC mount, disk-scan reload, orphan-shard reconcile). After an RPC sets the source-derived disk type, any later call passing source_disk_type="" was resetting ec_vol.disk_type back to the location's value, which reintroduces the heartbeat drift this PR is meant to fix. Only default to the location's disk type when the EC volume is fresh (no shards mounted yet); otherwise leave the recorded type alone so empty-source reloads preserve whatever the original mount RPC set. |
||
|
|
b2d24dd54f |
volume: require admin auth on BatchDelete (#9438)
Run BatchDelete through checkGrpcAdminAuth like the other destructive volume-server RPCs (VolumeDelete, DeleteCollection, vacuum, EC, ...), so a whitelist-configured server denies non-admin callers. |
||
|
|
2b21d19e4c |
volume: require admin auth on ReadAllNeedles and VolumeNeedleStatus (#9437)
Both RPCs hand out raw needle bytes / cookies. Run them through checkGrpcAdminAuth like the rest of the volume-server admin handlers. |
||
|
|
a1e5eb9dad |
Fix UI prefix url encoding (#9344)
* Fix filer UI navigation for URL-sensitive object prefixes * Fix filer UI navigation for URL-sensitive object prefixes * Clarify filer UI path escaping test name Rename the legacy filer UI path test to describe the actual behavior being checked. The printpath helper preserves timestamp characters that are valid in URL path components, while the PR fix is focused on query-string escaping for path and cursor parameters. |
||
|
|
1c0e24f06a |
fix(balance): don't move remote-tiered volumes; don't fatal on missing .idx (#9335)
* fix(volume): don't fatal on missing .idx for remote-tiered volume A .vif left behind without its .idx (orphaned by a crashed move, partial copy, or hand-edit) would trip glog.Fatalf in checkIdxFile and take the whole volume server down on boot, killing every healthy volume on it too. For remote-tiered volumes treat it as a per-volume load error so the server can come up and the operator can clean up the stray .vif. Refs #9331. * fix(balance): skip remote-tiered volumes in admin balance detection The admin/worker balance detector had no equivalent of the shell-side guard ("does not move volume in remote storage" in command_volume_balance.go), so it scheduled moves on remote-tiered volumes. The "move" copies .idx/.vif to the destination and then calls Volume.Destroy on the source, which calls backendStorage.DeleteFile — deleting the remote object the destination's new .vif now points at. Populate HasRemoteCopy on the metrics emitted by both the admin maintenance scanner and the worker's master poll, then drop those volumes at the top of Detection. Fixes #9331. * Apply suggestion from @gemini-code-assist[bot] Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * fix(volume): keep remote data on volume-move-driven delete The on-source delete after a volume move (admin/worker balance and shell volume.move) ran Volume.Destroy with no way to opt out of the remote-object cleanup. Volume.Destroy unconditionally calls backendStorage.DeleteFile for remote-tiered volumes, so a successful move would copy .idx/.vif to the destination and then nuke the cloud object the destination's new .vif was already pointing at. Add VolumeDeleteRequest.keep_remote_data and plumb it through Store.DeleteVolume / DiskLocation.DeleteVolume / Volume.Destroy. The balance task and shell volume.move set it to true; the post-tier-upload cleanup of other replicas and the over-replication trim in volume.fix.replication also set it to true since the remote object is still referenced. Other real-delete callers keep the default. The delete-before-receive path in VolumeCopy also sets it: the inbound copy carries a .vif that may reference the same cloud object as the existing volume. Refs #9331. * test(storage): in-process remote-tier integration tests Cover the four operations the user is most likely to run against a cloud-tiered volume — balance/move, vacuum, EC encode, EC decode — by registering a local-disk-backed BackendStorage as the "remote" tier and exercising the real Volume / DiskLocation / EC encoder code paths. Locks in: - Destroy(keepRemoteData=true) preserves the remote object (move case) - Destroy(keepRemoteData=false) deletes it (real-delete case) - Vacuum/compact on a remote-tier volume never deletes the remote object - EC encode requires the local .dat (callers must download first) - EC encode + rebuild round-trips after a tier-down Tests run in-process and finish in under a second total — no cluster, binary, or external storage required. * fix(rust-volume): keep remote data on volume-move-driven delete Mirror the Go fix in seaweed-volume: plumb keep_remote_data through grpc volume_delete → Store.delete_volume → DiskLocation.delete_volume → Volume.destroy, and skip the s3-tier delete_file call when the flag is set. The pre-receive cleanup in volume_copy passes true for the same reason as the Go side: the inbound copy carries a .vif that may reference the same cloud object as the existing volume. The Rust loader already warns rather than fataling on a stray .vif without an .idx (volume.rs load_index_inmemory / load_index_redb), so no counterpart to the Go fatal-on-missing-idx fix is needed. Refs #9331. * fix(volume): preserve remote tier on IO-error eviction; fix EC test target Two review nits: - Store.MaybeAddVolumes' periodic cleanup pass deleted IO-errored volumes with keepRemoteData=false, so a transient local fault on a remote-tiered volume would also nuke the cloud object. Track the delete reason via a parallel slice and pass keepRemoteData=v.HasRemoteFile() for IO-error evictions; TTL-expired evictions still pass false. - TestRemoteTier_ECEncodeDecode_AfterDownload deleted shards 0..3 but called them "parity" — by the klauspost/reedsolomon convention shards 0..DataShardsCount-1 are data and DataShardsCount..TotalShardsCount-1 are parity. Switch the loop to delete the parity range so the intent matches the indices. --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> |
||
|
|
2417ba0354 |
fix(volume): add authentication to destructive gRPC admin endpoints (#8876)
* fix(volume): add authentication to destructive gRPC admin endpoints Three destructive VolumeServer gRPC endpoints (DeleteCollection, VolumeDelete, VolumeServerLeave) had no authentication checks, unlike their HTTP counterparts which are protected by the Guard whitelist. Add IsWhiteListed(host) to security.Guard and a checkGrpcAdminAuth helper on VolumeServer that extracts the peer IP from gRPC context and validates it against the guard whitelist. Gate all three endpoints behind this check. * fix(volume): tolerate unparseable gRPC peer address in admin auth check S3 Filer Group integration tests were failing with PermissionDenied "bad peer address: address @: missing port in address" when DeleteCollection ran across the in-process gRPC connection between filer and volume server — the peer addr surfaces as "@" there and net.SplitHostPort can't parse it. The check rejected before IsWhiteListed could exercise its allow-all path for empty-whitelist deployments. Hand the raw peer string to IsWhiteListed when SplitHostPort fails. With no whitelist configured (the test environment's mode) it accepts; with a whitelist configured the unparseable host won't match anything and the call still gets denied as it should. Adds three regression tests for IsWhiteListed pinning the empty-config allow-all, populated-list reject-unknown, and signing-key-only allow- all branches that the gRPC admin helper relies on. * refactor(security): dedup checkWhiteList through IsWhiteListed The HTTP-side checkWhiteList and the gRPC-side IsWhiteListed had the same lookup logic in two places; future drift was just a matter of time. Have checkWhiteList delegate so the membership semantics live in exactly one function. Behaviour is unchanged: the new path still returns nil for isEmptyWhiteList (signing-key-only mode) and still rejects unknown hosts when a whitelist is configured. Addresses gemini medium review on PR #8876. * fix(volume): protect remaining state-altering gRPC admin endpoints DeleteCollection, VolumeDelete, and VolumeServerLeave were the truly-destructive endpoints, but AllocateVolume, VolumeMount, VolumeUnmount, VolumeConfigure, VolumeMarkReadonly, and VolumeMarkWritable also modify server state and should sit behind the same whitelist gate. Read-only endpoints (VolumeStatus, VolumeServerStatus, VolumeNeedleStatus, Ping) stay open. The check is a no-op when no whitelist is configured (the default), so existing deployments keep working; operators who lock down their volume servers via guard.white_list now get consistent coverage. Addresses gemini security-high review on PR #8876. * fix(volume): typed peer addr + audit log for gRPC admin auth Prefer a typed *net.TCPAddr when extracting the peer IP — string parsing was already a fallback for the in-process case but using the typed form first is cleaner and skips an unnecessary parse on the common path. Log failed authorization attempts at V(0) so an operator running with a whitelist sees the host that was rejected (and the raw remote address in case the IP lookup itself was the failure mode), matching what the HTTP Guard already does. Addresses gemini medium review on PR #8876. * fix(volume): protect vacuum + scrub + EC-shards-delete admin endpoints Five more master/admin-driven destructive operations live outside volume_grpc_admin.go and were missing the same whitelist gate: - VacuumVolumeCompact, VacuumVolumeCommit, VacuumVolumeCleanup - ScrubVolume - VolumeEcShardsDelete VacuumVolumeCheck stays open (read-only). BatchDelete also stays open: it's the data-plane multi-object delete called from the S3 API and filer, not an admin operation; gating it would break ordinary S3 DeleteObjects calls. Addresses gemini security-high review on PR #8876. * fix(volume): simplify no-peer-info branch in gRPC admin auth The IsWhiteListed("") fallback was defending against a scenario that doesn't actually arise — real gRPC connections always populate peer info. Drop the branch and just deny when peer info is missing, which is the safer default and matches "if we don't know who the caller is, refuse". * fix(volume-rust): mirror gRPC admin auth on the rust volume server The rust volume server has the same set of destructive admin endpoints as the Go side and the same Guard infrastructure, but nothing was wired together — every endpoint accepted unauthenticated calls regardless of guard configuration. Same vulnerability class the Go fix on this PR closes; this commit closes it on the rust side too so the two stacks stay aligned. Adds VolumeGrpcService::check_grpc_admin_auth that pulls the peer SocketAddr off the tonic Request and runs Guard::check_whitelist on its IP, then applies the helper to the same set the Go side covers: DeleteCollection, AllocateVolume, VolumeMount, VolumeUnmount, VolumeDelete, VolumeMarkReadonly, VolumeMarkWritable, VolumeConfigure, VacuumVolumeCompact, VacuumVolumeCommit, VacuumVolumeCleanup, VolumeServerLeave, ScrubVolume, VolumeEcShardsDelete. Read-only endpoints stay open; BatchDelete stays open as a data-plane multi-object delete. |
||
|
|
d265274e13 |
fix(nfs): accept dirpath any-where under the export, mirroring rclone (#9291)
* fix(nfs): accept any MOUNT3 dirpath, mirroring rclone's permissive policy
weed nfs has exactly one export per process, so the MOUNT3 dirpath
argument has no second export to disambiguate against. Strict
comparison only translated PV-path typos into the inconsistent
"mount succeeds but empty" / "mount fails completely" split that
operators see.
Match rclone's serve nfs Handler.Mount: ignore the dirpath, log an INFO
line when it differs from the configured export, and always serve the
export root. Apply the same change to the UDP MOUNT3 path so kernel
clients defaulting to mountproto=udp see identical behaviour. Access
control still goes through -allowedClients / -ip.bind, and file-handle
scoping in FromHandle is unchanged so handles still cannot escape the
export.
Replace the prior single-path reject tests with table tests covering
the shapes operators commonly hit: root, parent, sibling, deeper child,
unrelated, empty, relative form, exact match, and trailing slash, at
the Handler.Mount, UDP MOUNT3, and full RPC layers.
* feat(nfs): mount at subdirectory when MOUNT3 dirpath is under the export
Make the dirpath argument meaningful when the client asks for a subtree
of the configured export. With -filer.path=/buckets, a client mounting
<server>:/buckets/data lands directly inside /buckets/data instead of
at the export root.
- dirpath equals the export root: serve the export root.
- dirpath strictly under the export, directory entry: serve that
subdirectory; the returned filehandle encodes its inode.
- dirpath strictly under the export, missing or non-directory: reject
with NoEnt or NotDir.
- dirpath outside the export: keep the rclone-style fallback to the
export root.
TCP returns a sub-rooted seaweedFileSystem and lets go-nfs's onMount
call ToHandle to encode the FH; UDP encodes the FH itself. FromHandle
is unchanged: handles are content-addressed by inode and resolve via
the inode index, so they remain stable across mounts and across
process restarts.
The trimmed permissive tests keep their outside-export shapes; new
subexport tests cover under-export directories, missing entries, and
non-directory entries on Handler.Mount, the UDP MOUNT3 wire, and
through the full RPC stack.
* nfs: propagate request context through MOUNT3 resolution
Mount now accepts the gonfs context and threads it through
resolveMountFilesystem and lstatExportStatus so a slow filer call
during MOUNT cannot outlive a cancelled or timed-out request.
lstatExportStatus uses fileInfoForVirtualPath(ctx, "/") directly
instead of billy.Filesystem.Lstat, which would otherwise drop the
context on the floor by calling fileInfoForVirtualPathWithOptions
with context.Background().
Lower the successful subexport-mount log from V(0) to V(1). The
fallback log stays at V(0) so operator typos still surface; the
success line is per-mount churn that adds up on NFS-CSI deployments.
* nfs: mirror TCP defensive checks on the UDP MOUNT3 path
Two transport-parity bugs the rabbit caught:
(1) The exact-export-root and outside-export branches were returning
MNT3_OK unconditionally, while the TCP handler runs lstatExportStatus
on those same branches. If the configured -filer.path has been
removed from the filer, TCP returns NoEnt/ServerFault but UDP would
still hand out a synthetic root handle pointing at nothing. Add
rootMountStatus as the UDP analogue and call it on both branches.
(2) resolveSubexportFileHandle did filer I/O on the single UDP serve
loop with context.Background(). One slow filer round-trip would
block every later MOUNT packet. Wrap each MOUNT call's filer work in
context.WithTimeout(mountUDPLookupTimeout) and thread that ctx
through both rootMountStatus and resolveSubexportFileHandle.
Lower the successful subexport log to V(1) to match the TCP side.
* nfs: assert TCP/UDP MOUNT3 produce byte-identical filehandles
The existing UDP subexport assertions only checked the decoded inode
and kind. A regression that drifted the generation, exportID, or
encoding format on one transport but not the other would have slipped
through. Build the TCP Handler from the same Server, drive its Mount
with the same dirpath, and require ToHandle to match the raw UDP FH
bytes for every OK case.
* nfs: take MOUNT3 dirpath as string in resolveMountFilesystem
Convert req.Dirpath to string once at the call site instead of
sprinkling string(...) casts through every log line and conversion
inside the function. Behavior unchanged.
* nfs: share rootFS lifecycle between TCP and UDP MOUNT handlers
Server.rootFilesystem() lazily constructs the seaweedFileSystem rooted
at the configured export the first time anything asks for it, then
hands the same instance to every subsequent caller. newHandler() and
mountUDPServer.rootMountStatus() now both go through it, so:
- Both transports observe the same chunk reader cache and chunk
invalidator without depending on call order during startup.
- The UDP defensive Lstat doesn't allocate a fresh wrapper per
MOUNT request anymore; one struct lives for the life of the
Server.
The sub-rooted seaweedFileSystem the subexport branch builds in
resolveSubexportFileHandle is still per-request because actualRoot
varies with the requested dirpath.
* nfs: drive rootFilesystem before reading sharedReaderCache on UDP
The UDP listener is started before serve() calls newHandler(), so an
under-export MOUNT3 request can reach resolveSubexportFileHandle before
Server.sharedReaderCache has been assigned. Reading it directly would
hand newSeaweedFileSystem a nil cache and the sub-fs would build a
throwaway ReaderCache that never gets shared with the TCP path.
Take rootFS off Server.rootFilesystem() (which drives the sync.Once
that initializes the shared cache) and read readerCache off that
instead, so subexport sub-fs instances always share the same reader
cache as rootFS regardless of which transport sees the first MOUNT.
* nfs: collapse exact-match and outside-export MOUNT branches
The two branches return the same filesystem (export root) and the
same status; only the log line differs. Combine the conditions and
guard the fallback log inline. Behavior unchanged.
|
||
|
|
35fe3c801b |
feat(nfs): UDP MOUNT v3 responder + real-Linux e2e mount harness (#9267)
* feat(nfs): add UDP MOUNT v3 responder
The upstream willscott/go-nfs library only serves the MOUNT protocol
over TCP. Linux's mount.nfs and the in-kernel NFS client default
mountproto to UDP in many configurations, so against a stock weed nfs
deployment the kernel queries portmap for "MOUNT v3 UDP", gets port=0
("not registered"), and either falls back inconsistently or surfaces
EPROTONOSUPPORT — surfacing as the user-visible "requested NFS version
or transport protocol is not supported" reported in #9263. The user has
to add `mountproto=tcp` or `mountport=2049` to mount options to coerce
TCP just for the MOUNT phase.
Add a small UDP responder that speaks just enough of MOUNT v3 to handle
the procedures the kernel actually invokes during mount setup and
teardown: NULL, MNT, and UMNT. The wire layout for MNT mirrors
handler.go's TCP path so both transports produce the same root
filehandle and the same auth flavor list for the same export. Other
v3 procedures (DUMP, EXPORT, UMNTALL) cleanly return PROC_UNAVAIL.
This commit only adds the responder; portmap-advertise and Server.Start
wire-up follow in subsequent commits so each step stays independently
reviewable.
References: RFC 1813 §5 (NFSv3/MOUNTv3), RFC 5531 (RPC). Existing
constants and parseRPCCall / encodeAcceptedReply helpers from
portmap.go are reused so behaviour stays consistent across both UDP
listening goroutines.
* feat(nfs): advertise UDP MOUNT v3 in the portmap responder
The portmap responder advertised TCP-only entries because go-nfs only
serves TCP, but with the new UDP MOUNT responder in place we can now
honestly advertise MOUNT v3 over UDP as well. Linux clients whose
default mountproto is UDP query portmap during mount setup; if the
answer is "not registered" some kernels translate the result to
EPROTONOSUPPORT instead of falling back to TCP, which is exactly the
failure pattern reported in #9263.
Add the entry, refresh the doc comment, and extend the existing
GETPORT and DUMP unit tests so a regression that drops the entry shows
up at unit-test granularity rather than only in an end-to-end mount.
* feat(nfs): start UDP MOUNT v3 responder alongside the TCP NFS listener
Plug the new mountUDPServer into Server.Start so it comes up on the
same bind/port as the TCP NFS listener. Started before portmap so a
portmap query that races a fast client never returns a UDP MOUNT entry
the responder isn't actually answering, and shut down via the same
defer chain so a portmap-or-listener startup failure doesn't leave the
UDP responder dangling.
The portmap startup log now reflects all three advertised entries
(NFS v3 tcp, MOUNT v3 tcp, MOUNT v3 udp) so operators can confirm at a
glance that the UDP MOUNT path is up.
Verified end-to-end: built a Linux/arm64 binary, ran weed nfs in a
container with -portmap.bind, and mounted from another container using
both the user-reported failing setup from #9263 (vers=3 + tcp without
mountport) and an explicit mountproto=udp to force the new code path.
The trace `mount.nfs: trying ... prog 100005 vers 3 prot UDP port 2049`
now leads to a successful mount instead of EPROTONOSUPPORT.
* docs(nfs): note that the plain mount form works on UDP-default clients
With UDP MOUNT v3 now served alongside TCP, the only path that ever
required mountproto=tcp / mountport=2049 — clients whose default
mountproto is UDP — works against the plain mount example. Update the
startup mount hint and the `weed nfs` long help so users don't go
hunting for a mount-option workaround that no longer applies.
The "without -portmap.bind" branch is unchanged: that path still has
to bypass portmap entirely because there is no portmap responder for
the kernel to query.
* test(nfs): add kernel-mount e2e tests under test/nfs
The existing test/nfs/ harness boots a real master + volume + filer +
weed nfs subprocess stack and drives it via go-nfs-client. That covers
protocol behaviour from a Go client's perspective, but anything
mis-coded once a real Linux kernel parses the wire bytes is invisible:
both ends of the test use the same RPC library, so identical bugs
round-trip cleanly. The two NFS issues hit recently were exactly that
shape — NFSv4 mis-routed to v3 SETATTR (#9262) and missing UDP MOUNT v3
— and only surfaced in a real client.
Add three end-to-end tests that mount the harness's running NFS server
through the in-tree Linux client:
- TestKernelMountV3TCP: NFSv3 + MOUNT v3 over TCP (baseline).
- TestKernelMountV3MountProtoUDP: NFSv3 over TCP, MOUNT v3 over UDP
only — regression test for the new UDP MOUNT v3 responder.
- TestKernelMountV4RejectsCleanly: vers=4 against the v3-only server,
asserting the kernel surfaces a protocol/version-level error rather
than a generic "mount system call failed" — regression test for the
PROG_MISMATCH path from #9262.
The tests pass explicit port=/mountport= mount options so the kernel
never queries portmap, which means the harness doesn't need to bind
the privileged port 111 and won't collide with a system rpcbind on a
shared CI runner. They t.Skip cleanly when the host isn't Linux, when
mount.nfs isn't installed, or when the test process isn't running as
root.
Run locally with:
cd test/nfs
sudo go test -v -run TestKernelMount ./...
CI wiring follows in the next commit.
* ci(nfs): run kernel-mount e2e tests in nfs-tests workflow
Wire the new TestKernelMount* tests from test/nfs into the existing
NFS workflow:
- Existing protocol-layer step now skips '^TestKernelMount' so a
"skipped because not root" line doesn't appear on every run.
- New "Install kernel NFS client" step pulls nfs-common (mount.nfs +
helpers) and netbase (/etc/protocols, which mount.nfs's protocol-
name lookups need to resolve `tcp`/`udp`).
- New privileged step runs only the kernel-mount tests under sudo,
preserving PATH and pointing GOMODCACHE/GOCACHE at the user's
caches so the second `go test` invocation reuses already-built
test binaries instead of redownloading modules under root.
The summary block now lists the three kernel-mount cases explicitly
so a regression on either of #9262 or this PR's UDP MOUNT change is
traceable from the workflow run page.
|
||
|
|
3f3aaa7cc8 |
Export Prometheus metrics for scrubbing operations. (#9264)
This PR introduces three new metrics... - `scrub_last_time_seconds` - `scrub_volume_failures` - `scrub_shard_failures` ...capturing overall volume scrub results, and allowing to construct alerts and dashboards to monitor scrubbing progress. Note that these metrics are aggregated at the volume/EC shard level, and not intended for fine-grained tracking of scrubbing operations. |
||
|
|
e2c8791441 |
fix(nfs): reject NFSv4 calls with PROG_MISMATCH so clients fall back to v3 (#9262)
* feat(nfs): add NFSv3-only RPC version filter The upstream willscott/go-nfs library dispatches RPC calls by (program, procedure) only — it does not validate the program version. A client sending NFSv4 (prog 100003 vers 4 proc 1 COMPOUND) lands on the same handler map as NFSv3 and gets routed to v3 SETATTR, which parses the COMPOUND args as SETATTR3args and writes a malformed reply. The kernel then returns EPROTONOSUPPORT and mount.nfs prints "requested NFS version or transport protocol is not supported" without retrying v3. This commit adds a listener wrapper that peeks the first RPC frame on each new TCP connection. If the program is NFS or MOUNT and the version is not 3, it writes a protocol-correct PROG_MISMATCH reply (supported range 3..3, per RFC 5531) directly to the socket and closes the connection. v3 frames are replayed unchanged via a bufio reader so go-nfs sees the original bytes. Unknown programs pass through so go-nfs's own PROG_UNAVAIL handling stays in charge. The filter is not yet wired into the server; the next commit activates it. Tests cover NFSv4 reject, MOUNTv4 reject, NFSv3 pass-through, and unknown-program pass-through. * fix(nfs): wire NFSv3 version filter into the listener chain Place the version filter after the optional client allowlist so that unauthorized peers are still rejected first by IP/CIDR before we look at RPC content. With the filter active, a Linux client doing the default v4-first probe gets a clean PROG_MISMATCH reply pointing at v3, which lets mount.nfs (and the in-kernel client) skip v4 and reuse the same v3 mountOptions that already work for rclone serve nfs against this deployment. * test(nfs): exercise MOUNT v4 in the v4-rejection test, not v1 TestVersionFilterRejectsMOUNTv4WithProgMismatch was sending mountProgramID with version 1, so the test never actually covered the "reject MOUNT v4" path it claims to exercise. The filter does reject any non-v3 version uniformly, so the test still passed, but a future change that tightened the version check (for example, only rejecting v4) would let this test silently lie about coverage. Bump the call to version 4 so the name matches what is actually exercised. * refactor(nfs): reuse package RPC constants and io.ReadFull in version filter The RPC numeric constants (msg_type=CALL/REPLY, MSG_ACCEPTED, PROG_MISMATCH, AUTH_NONE, the NFS/MOUNT program numbers) are already named in portmap.go alongside the portmap responder. Reuse them here instead of defining a parallel set in rpc_version_filter.go: keeping one source of truth per package means a future correction in one spot can't drift away from the other. The filter-only constants (peek timeout, peek length, supportedNFSVer) stay local because they have no portmap analog. In the test, drop the bespoke readFull loop in favor of io.ReadFull. The custom version was a near-identical reimplementation that did not return io.ErrUnexpectedEOF on short reads, so the standard library is both shorter and more diagnostic-friendly. * fix(nfs): move RPC peek off the Accept path The previous wrapper called filterFirstRPCFrame inline inside versionFilterListener.Accept, which meant a single slow or idle TCP connect could hold rpcVersionFilterPeekTimeout (10s) of head-of-line blocking against every other accept: gonfs.Serve calls Accept serially, so each in-flight peek stalled the next legitimate client until the deadline expired. An attacker who simply opens a TCP connection without sending any RPC payload could trivially throttle accept throughput. Restructure the wrapper so a background goroutine drives the inner Accept loop and hands each raw conn to its own short-lived goroutine that runs the peek. Validated conns are sent on a buffered-once channel, which the wrapper's Accept reads from; rejected conns finish their PROG_MISMATCH reply and disappear without ever reaching the channel. This means N concurrent slow clients only block themselves, not the N+1th fast client that connects after them. Add Close coordination — sync.WaitGroup for the accept loop and per-conn peek goroutines, plus a closed channel so Accept unblocks immediately on shutdown — so the wrapper now satisfies the full net.Listener contract instead of relying on the embedded listener. Add a regression test that opens a slow conn (TCP only, never writes) and a fast conn (sends a v3 frame) and asserts the fast conn reaches the inner accept handler well below the peek timeout. * test(nfs): assert io.EOF (not just any error) after PROG_MISMATCH close The post-rejection check was only failing when conn.Read succeeded; any error — including a deadline timeout because the server kept the socket open — let the test pass. That defeats the point of the assertion: a regression where the filter replies but forgets to close would slip through silently. Match against io.EOF explicitly. The TCP semantics are deterministic here: the server writes PROG_MISMATCH, calls conn.Close(), the client reads what's left in flight and then sees a clean FIN, which surfaces as io.EOF on the next zero-byte read. * fix(nfs): reject short first fragments before parsing RPC header fields bufio.Reader.Peek(28) is willing to read across record boundaries to satisfy the requested length, so a final fragment whose body is shorter than the 24-byte fixed RPC CALL header (xid + msg_type + rpcvers + prog + vers + proc) leaves the trailing peek bytes pointing at the next RPC's framing or whatever bytes happen to follow on the wire. Indexing hdr[16:24] for prog/vers in that state can spuriously reject (or pass through) traffic based on data that doesn't belong to the request being classified. Drop those frames out of the filter early: if the first fragment can't possibly hold a full CALL header, pass the connection straight to go-nfs, which has its own framing-error handling for malformed input. Add a regression test that crafts a 12-byte first fragment whose trailing peek bytes are deliberately shaped like an NFSv4 CALL — without the length check the filter sends a PROG_MISMATCH; with it, the conn passes through silently. Verified by stashing the production-code change and running the test in isolation: it fails as expected without the fix. * fix(nfs): retry transient Accept() errors instead of treating any error as terminal acceptLoop previously exited on the first error returned by the inner listener's Accept(). That conflates two very different failure modes: permanent shutdown (the listener was Close()d, OS-level fatal failure) and transient resource pressure (EMFILE, EAGAIN, ECONNABORTED on accept). The transient case should not take the entire NFS server down — a single fd-table-full event would leave the deployment offline until restart. Classify the error: errors.Is(err, net.ErrClosed) is the permanent signal we already wanted to surface to Accept(); everything else is transient. Log at V(1) and back off rpcVersionFilterAcceptBackoff (50ms, mirroring portmap.go's portmapRetryBackoff) before retrying. The backoff sleep is interruptible via the closed channel so Close() still shuts the loop down promptly. Add a regression test that wraps a real listener with one that injects 3 fake transient errors before delegating, and asserts Accept() still delivers the next real connection. Verified the test fails on the old "any error is terminal" loop and passes with this change. * fix(nfs): only synthesize PROG_MISMATCH for ONC RPC v2 traffic The filter was rejecting any CALL-shaped record with prog=100003 or 100005 and vers!=3, regardless of the rpcvers field. If the caller is speaking some other protocol that happens to share the port — or just sending garbled bytes — pretending to be an NFSv3 server replying PROG_MISMATCH is misleading at best, and at worst fabricates a coherent RPC reply for traffic we don't actually understand. Add an rpcvers==2 check between the msg_type and prog/vers parses. Any non-v2 record now passes through to go-nfs, whose RFC 5531 §9 RPC_MISMATCH handling is the correct place to reject mis-versioned RPC. Regression test takes a normal v3 NFS CALL frame, overwrites the rpcvers field with 99, and asserts no PROG_MISMATCH-shaped reply lands on the client and that the conn is delivered to the inner accept handler. Verified the test fails on the previous code (filter still rejected on prog/vers alone) and passes with the guard in place. * fix(nfs): bound Close() latency by evicting in-flight prefilter conns Close() does wg.Wait() to drain handleConn goroutines, but each of those goroutines can be parked inside filterFirstRPCFrame's bufio.Peek for up to rpcVersionFilterPeekTimeout (10s) waiting for the very first RPC header. A client that completes the TCP handshake but never sends a byte therefore stretched shutdown by 10s per such conn — a real regression for stop/restart paths and for tests that just want to tear the listener down. Track raw (pre-peek) conns in versionFilterListener.inFlight as handleConn enters, untrack on exit, and have Close() forcibly close every tracked conn before wg.Wait. Closing the underlying conn breaks its Peek immediately, so handleConn returns within a single scheduler hop. trackInFlight also short-circuits if shutdown has already started, so a conn accepted after signalClose can't slip past the eviction. Black-box regression test opens 4 idle TCP-handshake-only conns, lets their handleConn goroutines settle into Peek, and asserts Close() returns under 2s. Verified: same test fails on the previous code with Close taking ~9.9s; passes here at ~100ms. |
||
|
|
5fbe39320c |
fix(volume_server): pin EC shard auto-select to the .ecx-owning disk (#9212) (#9245)
* fix(volume_server): pin EC shard auto-select to the .ecx-owning disk (#9212) ec.rebuild only sets CopyEcxFile=true on the first shard sent to the rebuilder; subsequent shards rely on VolumeEcShardsCopy / ReceiveFile auto-select to land on the same disk. The old auto-select used FindEcVolume (in-memory) to detect the "already has this volume" case. Mid-rebuild, no EC volume has been mounted yet on the destination, so FindEcVolume returns nothing and the fallback picks "any HDD with free space" — which can split shards from their .ecx across disks of the same node and feed the orphan-shard layout reported in #9212 / fixed on the loader side in #9244. Add Store.FindEcShardTargetLocation as the canonical placement primitive: prefer a mounted EC volume, then a disk that has the .ecx on disk, then any HDD, then any disk. DiskLocation.HasEcxFileOnDisk is the new on-disk check, and it looks at IdxDirectory first with a fallback to Directory to handle .ecx written before -dir.idx was configured. Both VolumeEcShardsCopy and ReceiveFile now route through the new helper, dropping their duplicated 4-level fallback ladder. No protocol changes; explicit DiskId callers are unaffected. * fix(volume_server): treat directories named *.ecx as no-match in HasEcxFileOnDisk os.Stat(".ecx") succeeds for both files and directories. If something happens to leave a directory named X.ecx in the data or idx folder, HasEcxFileOnDisk would currently report true and FindEcShardTargetLocation would route shards to that disk — where NewEcVolume's eventual OpenFile(O_RDWR) on the same path errors out. Add a !info.IsDir() check on both stat sites. Cheap and conservative. Suggested in PR #9245 review by @gemini-code-assist. * refactor(volume_server): collapse EC placement helper to a single pass FindEcShardTargetLocation called FindFreeLocation up to four times. Each call iterates s.Locations and acquires VolumesLen / EcShardCount RLocks per disk — for a typical 4-disk node that's 32 RLock cycles per placement decision. Walk s.Locations once, score each disk by tier (mounted > .ecx-on-disk > HDD > any-disk), break ties by free count. The free-slot math is factored into a small helper that mirrors FindFreeLocation's formula without re-entering the location's locks. Behaviour is unchanged: each existing tier still wins over later tiers, and within a tier the disk with the most free count still wins, matching the original max-tracking in FindFreeLocation. Suggested in PR #9245 review by @gemini-code-assist. * refactor(volume_server): thread dataShardCount as a parameter through EC placement ecFreeShardCount and FindEcShardTargetLocation referenced erasure_coding.DataShardsCount directly. Take it as a parameter so custom-ratio builds (e.g. enterprise) can swap the default without touching the helper itself, and so unit tests can pin a specific ratio independent of the package constant. Default callsites in VolumeEcShardsCopy and ReceiveFile now pass the package default explicitly; tests pass a literal 10 for clarity. * fix(volume_server): treat MaxVolumeCount=0 as unlimited in EC placement ecFreeShardCount computed `MaxVolumeCount - VolumesLen()` and went negative when MaxVolumeCount was 0 — the "unlimited disk" sentinel already honoured by Store.hasFreeDiskLocation and friends. With a negative free count, FindEcShardTargetLocation's `freeCount <= 0` guard skipped the disk entirely, so unlimited disks could never receive EC shards via the placement helper. Special-case MaxVolumeCount<=0: report a synthetic large free count that decrements with current usage, so unlimited disks are eligible and tie-breaks still prefer the less-loaded one. Added TestFindEcShardTargetLocation_HonoursUnlimitedDisk as the regression. Reported in PR #9245 review by @gemini-code-assist. * fix(volume_server): account in shard slots, not volume slots, in ecFreeShardCount FindFreeLocation in store.go ends with `free /= DataShardsCount`, converting "shard slots free" back to "volume-equivalent slots." The truncation is harmless there, but my new ecFreeShardCount inherited the same final divide and re-introduced exactly the orphan-shard hazard #9245 was meant to prevent: with MaxVolumeCount=1, VolumesLen=0, EcShardCount=1 the formula reports 0 even though the disk has room for 9 more shards, so subsequent shards route off the .ecx-owning disk into the HDD-fallback tier. Drop the trailing divide and return the count directly in shard slots. Same shape, finer granularity; tie-breaks still order by free count. The unlimited branch's "used" calculation is updated to match (mix volume-slots and shard-slots in shard units). Added TestFindEcShardTargetLocation_TightProvisioningKeepsEcxDisk as the regression. Reported in PR #9245 review by @coderabbitai. |
||
|
|
2c404f66bc |
Export file_read_invalid_needles metric for REST read requests on invalid file IDs. (#9241)
Provides a straightforward metric to count read requests with incorrect file/needle IDs, which can indicate client issues. Note that the metric does not cover gRPC calls, as the current proto service API does not support seeking files by ID. |
||
|
|
7f770b1553 |
fix(filer): return 503 + Retry-After when remote object not cached yet (#9236)
* extend cache-not-ready handling to filer HTTP path Mirror the s3api change for the native filer HTTP handlers. When the filer GET hits a remote-only object whose cache fill hasn't completed, return 503 Service Unavailable with Retry-After: 5 instead of 500 Internal Error, and treat client disconnects as silent cancellations rather than logging them as errors. Adds an ErrCacheNotReady sentinel and a small helper used at the prepareWriteFn-error sites in ProcessRangeRequest, so the same classification (cancel / not-ready / other) applies to plain GETs, single-range, and multi-range requests. * clear Content-Range on prepareWriteFn error The single-range path sets Content-Range before calling prepareWriteFn. If prepareWriteFn fails, http.Error is about to write a fresh body for 503 or 500, but the stale Content-Range header would still go out and no longer match. Drop it alongside Content-Length in the shared helper so all current and future callers are covered. * strip success-path headers and forward NotFound on prepareWriteFn error When ProcessRangeRequest writes an error response, the previously-set success headers (Content-Disposition, ETag, Last-Modified, in addition to Content-Length/Content-Range) shouldn't ride along on the new body. With ?dl=1 a stale Content-Disposition would even cause browsers to save the error message under the object's filename. Strip them all in the shared helper. Also forward filer_pb.ErrNotFound through the cache-failure branch so a mid-cache entry deletion surfaces as 404, not as a 503 retry-loop. Permanent upstream cloud errors (403/404 from the cloud SDK) still come back as opaque wrapped strings via FetchAndWriteNeedle and remain mapped to 503; distinguishing those would need a wider refactor. |
||
|
|
93247d6de4 |
Export REST file_{read,write}_failures metrics on volume servers (#9215)
* Export gRPC `file_{read,write}_failures` metrics on volume servers.
Allows to track overall R/W errors in real time through Prometheus.
Will follow up with a PR for Seaweed's REST API.
* Export REST `file_{read,write}_failures` metrics on volume servers.
|