seaweedfs

mirror of https://github.com/seaweedfs/seaweedfs.git synced 2026-07-30 20:13:23 +00:00

Author	SHA1	Message	Date
Chris LuandGitHub	dd1b428789	s3,iceberg: reject `..` in URL path vars (#9687 ) * s3,iceberg: reject `..`/NUL in URL path vars Both gateway routers use mux.NewRouter().SkipClean(true), so a request like `GET /bucket-A/../evil-bucket/key` survives routing as bucket=bucket-A, object=../evil-bucket/key. The captured key is then joined into a filer path; util.JoinPath / path.Join collapse the `..` server-side and the read lands in evil-bucket. With auth on, IAM still authorizes against bucket-A (the mux var), so policy is evaluated against the wrong target. Add a middleware on the S3 bucket subrouter and the Iceberg REST router that rejects any `.`, `..`, NUL, or — for single-segment slots — embedded slash in the captured path vars before any handler runs. NormalizeObjectKey already folds `\` to `/` and decoding happens in mux, so `%2e%2e` and `..\` are caught. * s3,iceberg: reject empty captured vars and empty namespace parts Comma-ok the var lookup so we only check captured slots, then treat an empty captured value as a rejection on its own — downstream path.Join would otherwise collapse it and let the next segment pick the bucket. For iceberg, also reject empty parts after splitting the namespace on \x1F so leading/trailing/consecutive unit separators (which parseNamespace silently folds out) don't let distinct route values collapse to the same parsed namespace. Register loggingMiddleware before validateRequestPath on the iceberg router so rejected requests still produce an audit-log line.	2026-05-26 01:04:59 -07:00
Chris Lu	1355c7a102	4.29 4.29	2026-05-25 22:41:25 -07:00
dependabot[bot]GitHubdependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	f72c5ec5d3	build(deps): bump github.com/go-sql-driver/mysql from 1.9.3 to 1.10.0 (#9682 ) Bumps [github.com/go-sql-driver/mysql](https://github.com/go-sql-driver/mysql) from 1.9.3 to 1.10.0. - [Release notes](https://github.com/go-sql-driver/mysql/releases) - [Changelog](https://github.com/go-sql-driver/mysql/blob/master/CHANGELOG.md) - [Commits](https://github.com/go-sql-driver/mysql/compare/v1.9.3...v1.10.0) --- updated-dependencies: - dependency-name: github.com/go-sql-driver/mysql dependency-version: 1.10.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-05-25 16:37:47 -07:00
dependabot[bot]GitHubdependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	96f521addc	build(deps): bump github.com/linxGnu/grocksdb from 1.10.7 to 1.10.8 (#9683 ) Bumps [github.com/linxGnu/grocksdb](https://github.com/linxGnu/grocksdb) from 1.10.7 to 1.10.8. - [Release notes](https://github.com/linxGnu/grocksdb/releases) - [Commits](https://github.com/linxGnu/grocksdb/compare/v1.10.7...v1.10.8) --- updated-dependencies: - dependency-name: github.com/linxGnu/grocksdb dependency-version: 1.10.8 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-05-25 16:22:00 -07:00
dependabot[bot]GitHubdependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	584da4cd10	build(deps): bump golang.org/x/crypto from 0.51.0 to 0.52.0 (#9681 ) Bumps [golang.org/x/crypto](https://github.com/golang/crypto) from 0.51.0 to 0.52.0. - [Commits](https://github.com/golang/crypto/compare/v0.51.0...v0.52.0) --- updated-dependencies: - dependency-name: golang.org/x/crypto dependency-version: 0.52.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-05-25 16:21:44 -07:00
dependabot[bot]GitHubdependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	56b9df937c	build(deps): bump golang.org/x/sys from 0.44.0 to 0.45.0 (#9680 ) Bumps [golang.org/x/sys](https://github.com/golang/sys) from 0.44.0 to 0.45.0. - [Commits](https://github.com/golang/sys/compare/v0.44.0...v0.45.0) --- updated-dependencies: - dependency-name: golang.org/x/sys dependency-version: 0.45.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-05-25 16:21:36 -07:00
dependabot[bot]GitHubdependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	e8ed043d2b	build(deps): bump go.etcd.io/etcd/client/pkg/v3 from 3.6.10 to 3.6.11 (#9679 ) Bumps [go.etcd.io/etcd/client/pkg/v3](https://github.com/etcd-io/etcd) from 3.6.10 to 3.6.11. - [Release notes](https://github.com/etcd-io/etcd/releases) - [Commits](https://github.com/etcd-io/etcd/compare/v3.6.10...v3.6.11) --- updated-dependencies: - dependency-name: go.etcd.io/etcd/client/pkg/v3 dependency-version: 3.6.11 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-05-25 16:21:28 -07:00
dependabot[bot]GitHubdependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	502fef6b50	build(deps): bump docker/login-action from 4.1.0 to 4.2.0 (#9678 ) Bumps [docker/login-action](https://github.com/docker/login-action) from 4.1.0 to 4.2.0. - [Release notes](https://github.com/docker/login-action/releases) - [Commits](https://github.com/docker/login-action/compare/v4.1.0...v4.2.0) --- updated-dependencies: - dependency-name: docker/login-action dependency-version: 4.2.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-05-25 16:21:20 -07:00
Chris LuandGitHub	b21c263328	test/fuse_dlm: cross-mount POSIX locks + survival across a ring change (#9677 ) Adds two FUSE integration tests on the existing dlm cluster harness (the -dlm mounts route advisory locks to the owner filer): - TestPosixLockCrossMount: an flock taken on one mount blocks the other, and is grantable after release — the routed-to-owner path end to end. - TestPosixLockSurvivesFilerLoss: hold flocks on many files, stop filer1 so keys it owned migrate to filer0; after the ring settles and the holding mount re-asserts, every lock is still honored. Asserts only the settled state; the transient migration window is unit-covered. Locks are taken on read-only fds so the -dlm whole-file write lock (a different mechanism, held until close) isn't involved. Skipped on non-Linux: only Linux forwards advisory locks (SETLK) to the FUSE server; macFUSE handles flock in-kernel per mount.	2026-05-25 16:20:23 -07:00
Chris LuandGitHub	c9868dcf2f	filer/posixlock: remove the unused lock-set serde (#9676 ) The codec (Set.Marshal/Unmarshal) and its posix_lock.proto were built to let the lock set ride in an inode's entry metadata, but the authority is in-memory and ownership handoff/restart is handled by mounts re-asserting their held locks over the RPC — neither serializes the set. Nothing calls the serde outside its own tests, so drop it (codec, proto, generated pb, Makefile). The in-memory Set/Manager are unchanged.	2026-05-25 13:15:19 -07:00
Chris LuandGitHub	85ca3cb757	filer: warm-up + fail-closed cooling for POSIX locks on owner (re)start (#9673 ) After a (re)start the owner defers would-be grants for posixLockWarmup while mounts re-assert, trusting only locally-visible conflicts, so it does not double-grant from empty state; a deferred grant is a retry for SetLkw and EAGAIN for non-blocking SetLk, never a spurious grant. Cooling now fail-closes: if the previous owner is unreachable during a ring change, defer rather than risk a double-grant. readyAt is atomic so the handler reads it without locking.	2026-05-25 13:14:05 -07:00
Chris LuandGitHub	a3c0baa9b0	filer: cooling-off dual-read for POSIX locks during ring changes (#9672 ) While the ring changed within the last snapshot interval, a fresh owner asks the key's previous owner (LockRing.PriorOwner) whether it still holds a conflicting lock before granting TRY_LOCK or answering GET_LK, so it does not double-grant before re-assertion rebuilds its local state. The probe is marked cooling_probe so the previous owner answers from local state without recursing. PriorOwner uses the snapshot's prebuilt ring rather than rebuilding a hash ring per call.	2026-05-25 12:34:15 -07:00
7y-9andGitHub	881226a81b	fix: avoid rclone nil close panics (#9674 ) * fix: avoid rclone nil close panics * fix: avoid rclone nil close panics	2026-05-25 09:53:45 -07:00
Chris LuandGitHub	f8caaa4464	mount,filer: re-assert POSIX locks via keepalive (ownership migration + restart) (#9668 ) * mount: renew POSIX lock leases via keepalive The mount tracks the inode keys it holds locks on and a background loop renews its session lease (KEEP_ALIVE) with each key's owner filer every 5s, within the filer's 15s TTL. A live mount is never reaped; a dead one stops renewing and owners reclaim its locks. Tracking is a superset: holds are added on grant and dropped only on owner release, so a still held lock is never under-renewed. * mount,filer: re-assert held POSIX locks via keepalive The owner filer holds POSIX advisory locks as in-memory soft state, so a key's owner change (ring rebalance) or an owner restart lost or stranded them: the new or restarted owner was blind to existing holders and would double-grant. Make the keepalive carry the mount's held lock ranges per key. The mount mirrors its own granted locks (posixOwn), and each tick re-asserts them to the key's current owner, which rebuilds that session's locks from the assertion — self -healing after a takeover or restart. The owner arbitrates re-asserted locks against other sessions so it never double-grants; a lock that lost a migration race is reported, not forced. A bare keepalive (no ranges) still just renews.	2026-05-25 01:02:45 -07:00
Chris LuandGitHub	c97b69f8a4	filer: session lease + reaping for POSIX locks (#9666 ) * filer: session lease + reaping for POSIX locks A mount renews its session lease by keepalive (new KEEP_ALIVE op); the owner filer records last-seen per session and a background sweeper reaps the locks of leased sessions that stop renewing — a dead or partitioned mount. Only sessions that have renewed are leased, so this is inert until mounts run with -posixLock. * mount: route POSIX advisory locks to the owner filer (-posixLock) (#9665) mount: route POSIX advisory locks to the owner filer under -dlm With -dlm, GetLk/SetLk/SetLkw and the flush/release cleanup paths go to the inode's owner filer via the PosixLock RPC instead of the local table, so flock/fcntl are honored across mounts. Advisory locking rides the same switch as whole-file write coordination — and is therefore off under writeback cache, which implies single-writer. The mount calls its filer and relies on filer-side forwarding to reach the owner. Keys are the inode identity (HardLinkId else path); SetLkw is client-side polling with the FUSE cancel channel (no server wait queue); a per-mount session id namespaces owners; a local hint avoids a release RPC on every close. * mount,filer: bound posix-lock release RPCs and stop the reaper on shutdown The unlock/release RPCs run off the syscall path (close/flush) and used context.Background() with no deadline, so a slow or unreachable filer could hang close() indefinitely; bound them to 5s (they still aren't cancelled by an interrupt). The lease-reaping sweeper now selects on a stop channel that FilerServer.Shutdown closes, instead of looping for the process lifetime.	2026-05-25 00:00:59 -07:00
Chris LuandGitHub	3976264391	mount: keep the posix-lock hint until the release RPC succeeds (#9670 ) routedReleasePosixOwner dropped the local owner hint before sending RELEASE_POSIX_OWNER, so a transient RPC failure left the lock held on the owner filer with no local record to retry from — stranded until session-lease reaping. Drop the hint only after a successful release; on failure keep it so a later flush retries, with lease reaping as the backstop.	2026-05-25 00:00:34 -07:00
Chris LuandGitHub	3481f13f54	mount: route POSIX advisory locks to the owner filer under -dlm (#9669 ) With -dlm, GetLk/SetLk/SetLkw and the flush/release cleanup paths go to the inode's owner filer via the PosixLock RPC instead of the local table, so flock/fcntl are honored across mounts. Advisory locking rides the same switch as whole-file write coordination — and is therefore off under writeback cache, which implies single-writer. Keys are the inode identity (HardLinkId else path); SetLkw is client-side polling with the FUSE cancel channel (no server wait queue); a per-mount session id namespaces owners; a local hint avoids a release RPC on every close. Background unlock/release RPCs are bounded so a stuck filer can't hang close().	2026-05-24 23:56:37 -07:00
Chris LuandGitHub	68cae26c0b	mount: fix SetAttr/GetAttr crash from concurrent chunk append under writebackCache (#9667 ) * mount: hold the entry lock while reading chunk size in GetAttr/SetAttr Async upload workers append chunks to an open handle's shared entry under the LockedEntry lock (FileHandle.AddChunks), but GetAttr and SetAttr computed FileSize by iterating entry.Chunks without taking it. A concurrent append that reallocated the backing array tore the slice read and crashed in filer.TotalSize. Surfaces with -writebackCache, where handles stay open and flush asynchronously while metadata ops keep arriving. Take the LockedEntry lock for those reads (and SetAttr's truncate rewrite). * mount: re-read entry under the lock in GetAttr/SetAttr If SetEntry swapped the handle's entry pointer between maybeReadEntry and the lock acquisition, the old pointer is orphaned. Re-read fh.entry.Entry under the lock so SetAttr mutates the live entry instead of losing the update, and GetAttr reports the current one. * mount: cover the truncate path in TestAttrChunkRace Alternate SetAttr between mtime-only and a shrinking size so the test also exercises the entry.Chunks rewrite under fh.entry.Lock, not just the read-side size walk. * mount: snapshot chunks under the entry lock on the read path readFromChunks holds fh.entryLock (excludes SetAttr) but not the LockedEntry lock the async uploader appends under, so IsInRemoteOnly, the FileSize fallback, and the RDMA/peer chunk walks read entry.Chunks while AddChunks reallocated it — the same torn-slice crash as GetAttr/SetAttr. Snapshot size, inline content, and the chunk list under a brief LockedEntry RLock, then hand the snapshot to the RDMA/peer helpers instead of holding the lock across network I/O. The captured slice stays valid: append never mutates the old backing array, and truncate is excluded by the fh.entryLock.	2026-05-24 23:49:41 -07:00
Chris LuandGitHub	fef49c2d75	filer: routed PosixLock RPC over the in-memory authority (#9664 ) * filer: in-memory POSIX lock authority (Manager) Concurrent multi-inode authority over the per-inode Set: a Set per opaque inode key (path, or hl:<HardLinkId>) plus a session->keys index so a dead mount's locks reap in O(locks held). Lock state stays in memory like the distributed lock manager's, off the replicated meta-log. TryLock/Unlock/ GetLk/ReleasePosixOwner/ReleaseFlockOwner/ReleaseSession; empty sets and stale index entries are pruned on release. * filer: routed PosixLock RPC over the in-memory authority Adds the PosixLock RPC (try/unlock/get_lk + the flush/release owner drops) that the owner filer answers from its in-memory Manager. The request key is the inode identity ring key; a non-owner filer forwards one hop (is_moved-bounded), mirroring ObjectTransaction, so the owner's table stays the single authority under a stale ring view. Strictly non-blocking; SetLkw polling lives in the mount.	2026-05-24 22:50:42 -07:00
Chris LuandGitHub	564b94796a	filer: in-memory POSIX lock authority (Manager) (#9663 ) Concurrent multi-inode authority over the per-inode Set: a Set per opaque inode key (path, or hl:<HardLinkId>) plus a session->keys index so a dead mount's locks reap in O(locks held). Lock state stays in memory like the distributed lock manager's, off the replicated meta-log. TryLock/Unlock/ GetLk/ReleasePosixOwner/ReleaseFlockOwner/ReleaseSession; empty sets and stale index entries are pruned on release.	2026-05-24 22:43:17 -07:00
Chris LuandGitHub	475ae2b443	filer: serialize the POSIX lock set for entry metadata (phase 2) (#9661 ) * filer: POSIX advisory lock set primitive (phase 1) Pure per-inode conflict/coalesce/range-split logic for fcntl byte-range and flock whole-file locks, extracted from the mount's PosixLockTable without its wait queue or inode-map concurrency. Owner identity is (Sid, Owner) so the same FUSE owner on different mounts never aliases, and ReleaseSession reaps a dead mount's locks. The owner filer will hold one Set per inode under the per-path lock; no concurrency control here. * test: tolerate transient FUSE invisibility in ConcurrentReadWrite A concurrent truncating overwrite leaves a short-lived dentry/cache window where the file is momentarily ENOENT to another opener. Retry the reads and writes a few times before failing, as ConcurrentDirectoryOperations does. * filer: serialize the POSIX lock set for entry metadata Versioned fixed-width binary encoding of a Set, so an inode's held locks can ride in its entry metadata: a lock op materializes the Set from the blob, applies under the per-path lock, and writes it back. Empty set encodes to nil so a lock-free inode carries no blob. * filer: encode the POSIX lock set as protobuf Replace the hand-rolled fixed-width codec with a LockSetProto message, so the metadata blob can gain fields without a format-version migration. proto.Unmarshal already rejects a malformed blob, so the explicit version and length checks go away. Marshal now returns an error to match.	2026-05-24 22:33:36 -07:00
Chris LuandGitHub	e8e7cd6fac	filer: POSIX advisory lock set primitive (phase 1 of distributed FUSE locking) (#9660 ) * filer: POSIX advisory lock set primitive (phase 1) Pure per-inode conflict/coalesce/range-split logic for fcntl byte-range and flock whole-file locks, extracted from the mount's PosixLockTable without its wait queue or inode-map concurrency. Owner identity is (Sid, Owner) so the same FUSE owner on different mounts never aliases, and ReleaseSession reaps a dead mount's locks. The owner filer will hold one Set per inode under the per-path lock; no concurrency control here. * test: tolerate transient FUSE invisibility in ConcurrentReadWrite A concurrent truncating overwrite leaves a short-lived dentry/cache window where the file is momentarily ENOENT to another opener. Retry the reads and writes a few times before failing, as ConcurrentDirectoryOperations does.	2026-05-24 21:56:48 -07:00
Chris LuandGitHub	0f1e50f9ec	fix(master): re-register volumes missing from the lookup index A disconnect/reconnect race could drop a volume from vid2location while it stayed in the data node's disk map, so it showed in volume.list and the admin UI but LookupVolume returned "volume id not found" and never self-healed (the full heartbeat only registered volumes new to the disk map). The full heartbeat now re-registers any reported volume missing from the lookup index, reusing the already-resolved VolumeLayout.	2026-05-24 15:11:09 -07:00
Chris LuandGitHub	2a4923e7e8	ObjectTransaction: filer-side forwarding via route_key (#9659 ) A non-owner filer forwards the whole transaction to the ring owner of route_key, so the owner's per-path lock stays the single serialization point even when the caller's ring view is stale. is_moved bounds forwarding to one hop. The gateway stamps route_key on every routed builder via the shared objectRouteKey helper. Completes taking S3 object mutations off the distributed lock.	2026-05-24 14:21:06 -07:00
Chris LuandGitHub	25beb7ec48	admin: expose Prometheus metrics (#9652 ) * admin: add -metricsPort flag to expose Prometheus metrics The admin command had no metrics endpoint, so passing -metricsPort (as the operator does for spec.admin.metricsPort) crashed the process with "flag provided but not defined". Wire up -metricsPort/-metricsIp and start the shared Prometheus metrics server, matching filer, master, and volume. * admin: emit maintenance task and worker fleet metrics Add Prometheus metrics for the admin server's distinctive work: the maintenance task queue and the worker fleet that executes it. Task lifecycle: maintenance_tasks_by_status / _by_type gauges (snapshot of the queue), maintenance_tasks_completed_total{type,outcome} counter and maintenance_task_duration_seconds{type} histogram (recorded when a task reaches a terminal state), and last/next scan timestamp gauges. Worker fleet: workers_connected and worker_slots{used,max} gauges, plus worker_events_total{event} counting register/unregister/stale removals. Gauges are snapshotted by a background goroutine on the admin server; counters and the histogram are recorded at their event sites. * admin: read worker slot totals under lock, clear next-scan gauge when idle GetWorkers returns live worker pointers; summing CurrentLoad/MaxConcurrent outside the queue lock races with task assignment and completion. Add GetWorkerSlotTotals to aggregate under the lock. Also reset maintenance_next_scan_timestamp_seconds to 0 when the scanner is not running, so it can't retain a stale value after a stop.	2026-05-24 14:09:02 -07:00
Chris LuandGitHub	6fc212cedb	test: wait for a writable volume before lifecycle tests' first write (#9658 ) Probe one throwaway write once per process before the lifecycle tests run, absorbing the post-start volume-growth window so the first real PutObject doesn't race volume growth and 500. Each call is bounded by the remaining 60s budget; CreateBucket is retried within it.	2026-05-24 14:01:13 -07:00
Chris LuandGitHub	1f0c366583	s3: route metadata-only self-copy off the distributed lock (#9638 ) A non-versioned metadata-only self-copy (CopyObject with source == destination and the REPLACE directive) is a read-modify-write of one entry, which is why it held the distributed lock. It now routes to the owner as a serialized PATCH_EXTENDED: the owner merges the new managed metadata (set the replacements, delete the dropped keys) onto a fresh read of the entry under its per-path lock, so a concurrent change to non-managed keys (legal hold, retention, version id) is preserved instead of clobbered, and bumps mtime. PATCH_EXTENDED gains touch_mtime for the mtime bump. Versioned and suspended self-copies create a new version (already routed via the copy finalize) and the no-owner bootstrap keep the lock.	2026-05-24 12:32:57 -07:00
Chris LuandGitHub	fa7056dc6f	s3: route object-lock version-specific deletes off the distributed lock (#9657 ) A version-specific DELETE (real version or the null version, including object-lock WORM-checked ones and governance-bypass) now runs as one routed transaction on the object's owner instead of holding the distributed lock. For a real version: recompute the .versions pointer excluding the version (repoint-before-delete, so a crash leaves a recoverable orphan rather than a dangling pointer), then delete the version file, under the object's per-path lock. The null version is the regular object entry, deleted directly (no pointer). Object-lock buckets gate the delete on the version's WORM guards evaluated on the owner: legal hold (always) + retention (while not elapsed). Governance bypass scopes the retention guard to COMPLIANCE mode, so the filer allows a governance-mode delete while still denying compliance and legal hold — the gateway never reads the version. Three primitives make this expressible: - ObjectTransaction.condition_key: evaluate the condition against a named entry (the version) while the lock stays on lock_key (the object). - Recompute.exclude_name: omit a child from the scan, to repoint before delete. - WriteCondition.Clause gate_key/gate_value: scope IF_EXTENDED_TIME_ELAPSED to a mode, expressing governance bypass without a gateway-side read.	2026-05-24 11:41:08 -07:00
Chris LuandGitHub	eeda7181aa	s3: route multipart-upload completion off the distributed lock (#9632 ) completeMultipartUpload routes its writes to the object's owner filer when an owner is known, off the distributed lock. Idempotent replay is handled gateway-side in prepareMultipartCompletionState (it returns the existing result when the object already carries this UploadId), so the lock is not needed to dedupe retries; with no owner yet, the lock remains as the bootstrap path. Versioned completion flips the .versions pointer via routedVersionedFinalize (RECOMPUTE_LATEST). Non-versioned and suspended completion write the object via routedMkFile (a routed PUT) so the write serializes with concurrent writes to the same key on the owner's per-path lock. The version file itself is a unique path and stays a plain mkFile.	2026-05-24 11:07:39 -07:00
Chris LuandGitHub	4b9d46b5ad	s3: route versioned COPY and delete-marker off the DLM (#9633 ) s3: route versioned/suspended delete markers and versioned COPY off the lock createDeleteMarker flips the .versions pointer via routedVersionedFinalize (RECOMPUTE_LATEST on the owner filer) when an owner is known, so an Enabled or Suspended DeleteObject takes its pointer flip off the distributed lock; the delete marker file is written first and the owner re-derives the pointer. DeleteObjectHandler routes a versioned/suspended delete with no specific version straight to the owner, off the lock. A specific-version delete and object-lock buckets keep the lock (the former needs a recompute-after-delete handled separately; the latter needs gateway-side enforcement). CopyObject into a versioned bucket finalizes the new version through the same routed pointer flip.	2026-05-24 07:22:27 -07:00
Chris LuandGitHub	5bac8b9281	s3: route object-lock object writes off the distributed lock (#9635 ) routableWriteOwner no longer excludes object-lock buckets, so a versioned PUT (which creates a new version, never overwriting a locked one) and a non-versioned overwrite (WORM-checked gateway-side before dispatch) route to the owner filer like any other write. routedObjectOwner still excludes object-lock: an unversioned object-lock delete enforces WORM under the lock, so it stays there rather than routing past the check. Version-specific deletes likewise stay on the lock — routing them needs the WORM check (on the version entry) and the latest-pointer recompute (on the object) under one transaction, which the current single condition target cannot express.	2026-05-24 07:20:44 -07:00
Chris LuandGitHub	db954b5503	s3: route versioned PutObject finalize off the DLM (#9631 ) s3: route versioned PutObject finalize off the distributed lock A versioned write's finalize (flip the .versions pointer to the newest version, demote the prior latest) now runs as a single RECOMPUTE_LATEST ObjectTransaction on the object's owner filer, under its per-path lock, instead of the unserialized updateLatestVersionInDirectory. The version file is written first; the owner re-derives the pointer by scanning the directory. RECOMPUTE_LATEST gains size_to_key / mtime_to_key to cache the chosen version's size and mtime on the pointer, and demote_key / demote_value to stamp the displaced prior latest (NoncurrentSinceNs for lifecycle) when the pointer moves. Falls back to updateLatestVersionInDirectory when no owner is known yet.	2026-05-24 03:10:30 -07:00
Chris LuandGitHub	32aa70ab59	s3: serialize bucket config writes with field-level filer patches (#9655 ) PutBucketVersioning and PutBucketEncryption ran concurrently each did a whole-entry read-modify-write of the bucket entry, so one could overwrite the other's field with a stale copy. Each config write is now a field-level PATCH_EXTENDED (extended attributes) or set_content (the metadata blob) ObjectTransaction, routed to the bucket's owner filer and merged onto a fresh read under its per-path lock. Disjoint fields no longer clobber each other.	2026-05-24 02:30:26 -07:00
Chris LuandGitHub	f9bc6adf98	s3: route single-entry object writes to the owner filer, off the DLM (#9629 ) s3: route non-versioned object PUT and DELETE off the distributed lock A non-versioned, non-object-lock object write now goes straight to the key's owner filer as a single-mutation ObjectTransaction, which serializes it with the owner's per-path lock and evaluates the precondition, instead of taking a cluster-wide lock. PUT and DELETE use the object's full path as the lock key, so a concurrent create and delete of the same key serialize against each other. The fast path is taken only when the precondition reduces to clauses the filer can evaluate (existence and a single strong-ETag match); time-based conditions, ETag lists, weak ETags, post-create hooks, and an unknown owner fall back to the lock. A routed mutation error other than a failed precondition also falls back, so the lock path stays the authority for the cases it alone covers. PrimaryForKey returns "" until the ring view arrives, keeping writes on the lock until routing is known.	2026-05-24 02:10:32 -07:00
Chris LuandGitHub	f037fc4dce	s3: dial the object lock's primary filer directly (#9626 ) * s3: dial the object lock's primary filer directly The S3 object write lock builds a fresh short-lived lock per write, each starting at the seed filer. When the seed isn't the key's hash-ring primary the filer forwards the request to the primary, and in multi-cluster setups that forward crosses clusters on every write. Give the lock client a view of the filer lock ring, fed by the master's LockRingUpdate broadcasts the gateway already receives, so it dials the primary directly. The view tracks filer membership by version; a stale view stays correct because the filer still forwards as a fallback. Also send the initial ring snapshot to S3 clients, not just filers. * s3: subscribe to lock-ring updates before starting the master loop The master delivers the initial LockRingUpdate once, on connect. Registering the callback after KeepConnectedToMaster started left a window where that first update could arrive before the handler was set and be dropped, delaying the ring view until the next membership change. Build the lock client and register the callback in the masters block before launching the loop; the filers block reuses that client (or creates a plain one when no masters are configured). * lock_manager: build the hash ring in a deterministic server order rebuildRing ranged over the server set (a map), whose iteration order is randomized per process. On a vnode hash collision the last writer into vnodeToServer wins, so two nodes holding the same server set could resolve the collision to different servers and disagree on the primary for keys near that slot. Now that the S3 gateway also computes PrimaryForKey, such a disagreement would route the same key to different filers and defeat per-path serialization. Iterate the servers in sorted order so the ring is identical on every node with the same set, regardless of discovery order. * lock_manager: skip redundant ring rebuilds, trim comments SetRing now ignores a non-zero version at or below the current one once a ring exists, so repeated LockRingUpdate broadcasts on reconnect no longer rebuild the ring. * s3: hold the lock-ring client on the server for route-by-key Store the object-write lock client on S3ApiServer so handlers can resolve a key's owner filer via PrimaryForKey.	2026-05-24 00:40:43 -07:00
Chris Lu GitHubgemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	b4d2224e97	filer: let PATCH_EXTENDED replace Entry.content (#9654 ) * filer: let PATCH_EXTENDED replace Entry.content PATCH_EXTENDED merges extended attributes under the per-path lock, reading the entry fresh, so concurrent patches to different keys don't clobber each other. Some single-key state lives in Entry.content rather than an extended attribute (e.g. the S3 bucket metadata blob). Add set_content/content to the mutation so a patch can replace content the same way -- read fresh, set content, preserve the rest -- letting a content write and an extended-attribute write on the same entry serialize on the lock instead of racing whole-entry rewrites. * Update weed/server/filer_grpc_server.go Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * filer: test set_content FileSize sync; note chosen content-patch approach Cover the FileSize behavior of a set_content patch: a file's size follows the new content length (including when it shrinks), a directory's stays zero. Also document, in the bucket-config design, that extending PATCH_EXTENDED with set_content is the implemented path for content-backed config. --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2026-05-23 21:43:43 -07:00
Chris LuandGitHub	83195fc111	filer: reuse the caller's fetched entry in CreateEntry (#9645 ) CreateEntry starts with a FindEntry to load the current entry. A conditional CreateEntry already fetched that entry to evaluate the precondition under the per-path lock, so the create repeated the lookup. Add an existing *Entry parameter: when non-nil it is used as the current entry and the internal lookup is skipped; nil keeps the lookup. The gRPC CreateEntry handler passes the entry it fetched for the precondition, removing the redundant read while the lock is held. All other callers pass nil.	2026-05-23 21:40:41 -07:00
Chris LuandGitHub	091aad59dc	filer: add ObjectTransactionBatch for multi-key object writes (#9649 ) A multi-object delete spans many keys that route to different owner filers. The gateway groups keys by owner and sends one batch per owner; the filer applies each transaction under its own per-path lock, independent of the others. A failed transaction (precondition or mutation error) is reported in its own response without aborting the rest, matching S3 multi-object semantics where each key succeeds or fails on its own. There is no cross-key atomicity, which S3 batch delete does not require.	2026-05-23 21:09:02 -07:00
Chris LuandGitHub	dc5621d2ae	s3: use oidc: prefix for trust-policy conditions in IAM example (#9653 ) * s3: use oidc: prefix for trust-policy conditions in IAM example Trust-policy conditions for AssumeRoleWithWebIdentity see OIDC claims under the oidc: prefix, so the docker example's bare "roles" key never matched and denied every web-identity assume against those roles. Switch the three roles to oidc:roles. Also document the available trust-policy condition keys (oidc:iss/sub/aud, oidc:<claim>, aws:FederatedProvider, aws:userid, sts:DurationSeconds) and note that roleMapping selects the role for direct OIDC bearer auth while STS uses the explicit RoleArn plus trust policy. * s3: clarify aws:userid differs between trust policy and request auth aws:userid is the raw sub claim during trust-policy evaluation, but a stable sub+iss hash (ComputeParentUser) during S3 request authorization after the role is assumed. Note both so the two contexts aren't conflated.	2026-05-23 20:02:48 -07:00
Chris LuandGitHub	e2203b2a0b	filer: add extended-attribute guard clauses for object-lock (#9648 ) Routing object-lock buckets off the distributed lock needs the retention and legal-hold check to run atomically with the write, under the per-path lock. Move just the comparison into the filer, not the S3 semantics: two generic clause kinds on an extended attribute. IF_EXTENDED_NOT_EQUAL blocks while extended[ext_key] equals ext_value (a legal hold). IF_EXTENDED_TIME_ELAPSED blocks while extended[ext_key], read as a unix- second deadline, is in the future against the filer's clock (retention); a malformed deadline fails safe. The caller composes these from the object-lock state and, for a governance bypass, simply omits the retention clause once the bypass is authorized -- the filer makes no authorization decision and keeps no S3 knowledge.	2026-05-23 19:38:08 -07:00
Chris LuandGitHub	e71bac55e9	filer: add RECOMPUTE_LATEST mutation to ObjectTransaction (#9647 ) Deleting a specific version that happens to be the latest needs the new latest re-derived from the remaining versions, and that scan must run under the same lock as the delete. The gateway can't do it atomically across RPCs. Add a RECOMPUTE_LATEST mutation: it scans a directory under the transaction lock, picks the child that sorts last (descending) or first by name, copies the mapped extended keys from it into a pointer entry, and stores its name under name_to_key. An empty directory clears the pointer keys. The filer stays mechanical and S3-agnostic: the caller, which knows the versioning scheme, supplies the sort direction and the key mappings. A missing pointer entry is a no-op, so a replayed transaction is idempotent.	2026-05-23 18:29:46 -07:00
Chris LuandGitHub	bf022ca018	filer: add ObjectTransaction for atomic multi-entry object writes (#9646 ) A versioned object write touches several entries that must change together: the main object, a delete marker or version file, and the latest pointer on the .versions directory. Holding a distributed lock across separate RPCs to do this is what the per-path lock was meant to replace, but a single CreateEntry only covers one entry. Add ObjectTransaction: a request carries a lock_key (the object path), an optional WriteCondition, and an ordered list of mutations (PUT / DELETE / PATCH_EXTENDED). The filer holds the per-path lock on lock_key for the whole call, checks the condition against the entry at lock_key, then applies the mutations in order. Callers route the object's writes to its owner filer so the lock is authoritative across all of the object's entries. DELETE and PATCH of an absent entry are no-ops, so a replayed transaction is idempotent. PUT entries are metadata-scoped; data-bearing writes (chunks) are written before the transaction, as today.	2026-05-23 17:34:30 -07:00
Chris LuandGitHub	b18d3dc96c	filer: evaluate a write precondition in CreateEntry (#9650 ) Add an optional WriteCondition to CreateEntryRequest. When set, the filer evaluates it against the current entry while holding the per-path lock, so the check and the write are atomic on this filer, and returns PRECONDITION_FAILED when it does not hold. The caller must route the key's writes to the owner filer for the check to be authoritative. A condition is a list of clauses that all must hold (logical AND). One clause is the common case; several express what a single comparison cannot: an ETag set (If-Match / If-None-Match with multiple values), weak-ETag comparison, and compound conditions. ETag comparison mirrors the S3 gateway's precedence (stored Seaweed ETag attribute, then the Md5/chunk fallback) and follows RFC 7232 strong/weak rules, so results match without coupling the filer to S3 handling. Condition parsing and evaluation live in filer_grpc_server_condition.go.	2026-05-23 16:29:14 -07:00
Chris LuandGitHub	bce76e6e21	filer: serialize same-path mutations with a per-path lock (#9639 ) CreateEntry is a FindEntry-then-write with no lock, so concurrent creates to the same path race: OExcl can admit two creators, and a conditional check-then-act has no atomicity. Add a per-path exclusive lock (util.LockTable, which evicts idle keys so it stays bounded) on the FilerServer and take it in CreateEntry, so the existence check and the write are atomic on this filer. This is the local serialization point that lets callers route a key's writes to its owner filer and drop the distributed lock for that key. AppendToEntry keeps its distributed lock for now; it can move to the per-path lock once its callers route to the owner.	2026-05-23 14:22:42 -07:00
Chris Lu GitHubgemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	21f2699624	EC detection: build placement snapshot once per cycle (fix large-topology timeout) (#9625 ) * EC detection: build placement snapshot once per cycle, not per volume planECDestinations rebuilt the full ecbalancer snapshot (FromActiveTopology) for every eligible volume, and resolved each shard destination's address via ResolveServerAddress, which rebuilds the whole node map on every call. Both are O(volumes x topology) and made detection time out on large clusters (TestErasureCodingDetectionLargeTopology: 300k volumes hit the 2-minute deadline). Build the snapshot and the node-address map once per detection cycle and pass them in. planECDestinations now reserves the shards it assigns directly into the shared snapshot, so volumes planned later in the same cycle still see the reduced capacity (previously this was observed by rebuilding from ActiveTopology's pending tasks). Large-topology detection drops from a 120s timeout to ~3.5s. * Update weed/worker/tasks/erasure_coding/detection.go Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2026-05-22 22:20:39 -07:00
Chris LuandGitHub	d1665750e1	Delete the EC placement package now that encode/repair use ecbalancer.Place (#9624 ) Delete the EC placement package and the dead encode planner code Now that encode (and repair) place via ecbalancer.Place, nothing uses the erasure_coding/placement package or the EC-only planner machinery (ecPlacementPlanner, diskInfosToCandidates, calculateECScoreCandidate, distributeECShards) in detection.go. Removes them and the package, along with the planner-direct unit tests.	2026-05-22 20:32:09 -07:00
Chris LuandGitHub	0566fbd552	EC encode: place shards via ecbalancer.Place + configurable replica placement (#9623 ) * Add shared super_block.ResolveReplicaPlacement; use it in ec_balance * Add ecbalancer.FromActiveTopology snapshot constructor for EC encode/repair * Add ecbalancer.Place greenfield/repair placement core (strict + durability-first) * topology: add GetEffectiveAvailableEcShardSlots; FromActiveTopology uses shard-granular free slots GetDisksWithEffectiveCapacity flattens reserved shard slots into volume slots via integer truncation, so an in-flight EC task reserving a non-multiple-of- DataShardsCount number of shards was lost from the snapshot and freeSlots was over-reported. GetEffectiveAvailableEcShardSlots subtracts the full reservation impact at shard granularity. * ecbalancer.Place: reject nodes without a free disk of the requested type FromActiveTopology keeps all disk types in the snapshot, so an SSD-only request could be routed to a node with only HDD capacity (pickBestDiskOnNode then returns disk 0 on the wrong tier). Filter rack/node selection to those with a free disk of the requested type. * ecbalancer.Place: enforce ReplicaPlacement DiffDataCenterCount (per-DC shard cap) * ecbalancer: enforce DiffDataCenterCount in balance (cross-DC phase + cross-rack DC cap) Adds a cross-DC corrective phase that drains data centers holding more than DiffDataCenterCount shards of a volume, and a per-DC cap on cross-rack move targets. Both are no-ops when DiffDataCenterCount is unset, so balance output is unchanged for non-DC placements. * topology: ratio-aware EC shard slots and provisional empty-disk slot GetEffectiveAvailableEcShardSlots now takes the target collection's data-shard count, so a 4+2 volume's larger shards are not over-counted at 10 per volume slot; and it keeps the one provisional slot for freshly started empty servers that report max=0, matching getEffectiveAvailableCapacityUnsafe. FromActiveTopology threads the ratio through. * ecbalancer.Place: explicit disk-type filter signal (fix HDD vs any ambiguity) HardDriveType normalizes to "", which collided with "" meaning any disk. Add Constraints.FilterDiskType and normalize both sides so a hdd request matches disks reported as "" and never leaks to SSD, while filter=false still means any. * ecbalancer: add clearShardAccounting for repair snapshot reconciliation Clears one disk's copy of a shard from per-domain accounting and recomputes the node-level union (preserving a kept copy on another disk of the same node), without crediting capacity. Repair uses it to drop to-be-deleted copies before placing missing shards. * ecbalancer: don't cap cross-DC target racks when DiffRackCount is unset len(racks)+1 wrongly limited each target rack (3 in a 2-rack cluster), so draining a DC could stop short of the DiffDataCenterCount cap. Use MaxShardCount+1 as the effectively-unlimited default. * topology/ecbalancer: ratio-correct EC capacity accounting Reservation shard slots (default ShardsPerVolumeSlot units) are now converted to the target ratio before subtracting, and existing EC shards are charged by size (targetDataShards/shardDataShards) so a 2+1 shard isn't counted as one 10+4 slot. Per-shard ratio lookup is behind shardDataShards (OSS uses the standard ratio). * ecbalancer.Place: candidate tiering and eligible-rack caps Adds a per-disk eligibility/preference abstraction so Place supports: - preferred-tag whole-plan retry (try disks carrying the earliest tags first, widen to all only if a tier cannot place every shard; reports SpilledOutsidePreferredTags), - soft disk-type spill via DiskTypePolicy (Any/Prefer/Require): Prefer fills the preferred type then spills, reporting SpilledToOtherDiskType; Require filters, - even per-rack caps that divide by racks holding an eligible disk, so a tiered cluster (e.g. SSDs in 2 of 4 racks) isn't capped impossibly low. Disk tags carried via Node.AddDiskTags + FromActiveTopology. * ecbalancer: export ClearShardAccounting for repair snapshot reconciliation * ecbalancer: address review feedback (ratio rounding, bitmap walk, same-DC moves) - topology/ecbalancer: round shard-reservation and existing-shard footprint up when converting to target-ratio shard slots, so a sub-slot reservation is not truncated to zero and free capacity is not overstated for low-data-shard layouts (targetDataShards < ds). - erasure_coding: add ShardBits.All iterator and use it across the balancer, cross-DC phase, and placement scoring instead of scanning 0..MaxShardCount and probing Has on every id. - ecbalancer: allow same-DC cross-rack moves when a DC already sits at its DiffDataCenterCount cap; a same-DC move leaves the DC total unchanged. Add a regression test that fails without the guard. - ecbalancer cross-DC phase: pick targets via the eligible-aware pickNodeInRackEligible/pickBestDiskEligible helpers so the disk-type filter is honored and a 0 disk id is not mistaken for a valid selection. * ecbalancer: test ecShardSlotsOnDisk fractional round-up Cover the mixed-ratio path (targetDataShards < existing data shards) so a shard's fractional footprint is never floored to zero and free capacity is not overstated. Exercises the round-up via the targetDataShards parameter; OSS uses the standard ratio at runtime while the enterprise build hits it with real per-volume ratios. * ecbalancer: assert node B rack in TestFromActiveTopology * ecbalancer: split Destination into separate DataCenter and bare Rack Replace the composite "dc:rack" Rack field on Destination with separate DataCenter and bare Rack values, matching topology.DiskInfo and the worker-task convention. Callers (and tests) read the data center directly instead of parsing the composite with strings.SplitN. * shell ec.balance: use utilization-based global balancing (parity with worker) The shell's global rebalance phase balanced by raw shard count; switch it to fractional fullness (shards/capacity), as the worker already does. On uniform capacity the two agree; on heterogeneous capacity it fills nodes proportionally instead of driving small-capacity nodes toward full. Updates the heterogeneous-capacity regression test to assert even fullness (~equal shards/capacity per node) rather than even shard count. * ecbalancer: bounded-proportional per-DC shard spread DiffDataCenterCount was enforced only as a ceiling (drain-to-cap), which could leave a within-cap-but-lopsided DC distribution under a loose cap (e.g. 10/4 of 14 with cap=10). Now the cross-DC phase, the cross-rack DC guard, and Place all target boundedMaxPerDC = min(DiffDataCenterCount, max(ceil(total/numDCs), parityShards)): shards spread proportionally across DCs, but no tighter than the durability floor (once each DC holds <= parityShards a DC loss is recoverable, so further spreading only adds cross-DC/WAN traffic). No-op when DiffDataCenterCount is 0; identical to before when the cap is the binding constraint. * ecbalancer: drop DiffDataCenterCount enforcement for EC placement The 1-byte volume ReplicaPlacement packs xyz into x100+y10+z<=255, so the DC digit can only be 0-2 -- far too small to be a meaningful per-DC EC shard cap (a cap of 1-2 would demand 7-14 DCs for a 10+4 volume). It's volume replica-placement, not an EC spec. Removes the cross-DC balance phase, the DC guard in the cross-rack phase, and the per-DC cap in Place (and the just-added bounded-proportional logic); EC relies on the RP-independent rack/node even spread instead. Rack/node caps (DiffRackCount/SameRackCount) are unchanged. Per-domain EC caps are left for a real EC placement spec. * ecbalancer: enforce per-disk durability cap; symmetric reserve/release Place now refuses to put more than parityShards shards of a volume on a single disk (pickBestDiskEligible skips a disk once it holds parityShards of the volume, a hard cap not relaxed even in durability-first). Previously Place assigned by free capacity, so a skewed near-full cluster could pile >parityShards onto one disk -> losing it loses the volume; only distinct-disk count was checked. This covers encode and repair (both route through Place); the caller skips/leaves the volume rather than minting an unrecoverable layout. Also makes reserveShard decrement freeSlots unconditionally, symmetric with releaseShard's unconditional increment (the old guarded decrement could credit a phantom slot on release if a shard were ever reserved onto a full disk). * ecbalancer: add Topology.ReleaseVolumeShards (clear + credit) for greenfield encode Releases all of a volume's shards from the snapshot and credits the freed disk capacity, so a greenfield encode can plan as if stale EC shards from a prior failed attempt are gone. Safe to credit because the encode task deletes stale shards (cleanupStaleEcShards) before distributing the new ones. Distinct from ClearShardAccounting (repair), which does not credit. * ecbalancer: ReleaseVolumeShards credits node freeSlots, not just disks releaseShard only increments per-disk freeSlots, but rack capacity is summed from node freeSlots (buildRacks) and node freeSlots gates node eligibility. Crediting only disks left a node/rack looking full after releasing stale shards, so a greenfield encode still couldn't use the freed capacity. Now credits the node by the total disk-slots freed. * ecbalancer: correct PlacementMode docs (encode uses durability-first) PlaceStrict was labeled '(encode)' but encode uses PlaceDurabilityFirst. Clarify that durability-first is used by both encode and repair, reports relaxations in PlaceResult.Relaxed, and never relaxes the per-disk durability cap. * ecbalancer: treat SameRackCount as a direct per-node shard cap The 3rd ReplicaPlacement digit now caps shards per node at exactly the digit value, matching how DiffRackCount (2nd digit) caps per rack, instead of allowing digit+1 per node. This makes the per-rack and per-node caps consistent and matches the documented "digits cap EC shards per rack and per node" semantics; e.g. 011 now means at most one shard per rack and one per node. * EC encode: place shards via ecbalancer.Place + configurable replica placement Encode now plans destinations through the shared ecbalancer.Place policy (durability-first: prefers the source disk type and honors replica placement / caps / anti-affinity, relaxing rather than failing when capacity is tight) instead of the EC-only placement planner. Targets and capacity reservations use Place's actual per-disk shard assignment, not a round-robin guess; cross-volume in-cycle capacity is tracked by ActiveTopology's pending task, so the cached planner is no longer consulted. Adds a configurable replica_placement (proto field 6 + worker form + reader) that overrides the master default replication. The placement-package planner code is left in place (now unused) and removed in a follow-up that drops the package. * EC encode: drop unused dataShards param from createECTargets Addresses review feedback: after switching to Place's per-disk shardsPerPlan assignment, createECTargets no longer needs the data-shard count. * EC encode: fix packed-target validation, greenfield stale-shard accounting, RP docs - Validate counts distinct shard ids across targets, not target rows, so packed plans (fewer (node,disk) targets than shards) aren't rejected. - planECDestinations releases the volume's stale EC shards from the snapshot before Place (ReleaseVolumeShards), crediting their capacity. The encode task deletes stale shards before distributing, so a retry on tight capacity no longer fails planning by counting shards that are about to be removed. - replica_placement config/form help no longer claims a data-center limit (the DC digit is ignored for EC); detection logs a warning when a DC digit is set. * EC encode: surface relaxed placement; mark replica_placement best-effort Encode places with PlaceDurabilityFirst (the chosen lenient behavior), which can relax caps/anti-affinity/replica-placement to avoid deferring. That was silent (only disk-type/tag spills were logged). Now logs PlaceResult.Relaxed so a tight replica placement isn't weakened unnoticed, and the config/form help states the rack/node caps are best-effort during encode (enforced by rebalancing). * EC encode: key per-disk shard grouping by struct, not formatted string planECDestinations grouped destinations using a fmt.Sprintf("%s:%d") map key per shard; use a {node,diskID} struct key and pre-size the map/slice to the shard count to drop the per-shard string allocation.	2026-05-22 20:22:30 -07:00
Chris LuandGitHub	d4e39b499b	EC placement: shared replica-placement resolver, snapshot + Place core, capacity fixes, tiering (#9621 ) * Add shared super_block.ResolveReplicaPlacement; use it in ec_balance * Add ecbalancer.FromActiveTopology snapshot constructor for EC encode/repair * Add ecbalancer.Place greenfield/repair placement core (strict + durability-first) * topology: add GetEffectiveAvailableEcShardSlots; FromActiveTopology uses shard-granular free slots GetDisksWithEffectiveCapacity flattens reserved shard slots into volume slots via integer truncation, so an in-flight EC task reserving a non-multiple-of- DataShardsCount number of shards was lost from the snapshot and freeSlots was over-reported. GetEffectiveAvailableEcShardSlots subtracts the full reservation impact at shard granularity. * ecbalancer.Place: reject nodes without a free disk of the requested type FromActiveTopology keeps all disk types in the snapshot, so an SSD-only request could be routed to a node with only HDD capacity (pickBestDiskOnNode then returns disk 0 on the wrong tier). Filter rack/node selection to those with a free disk of the requested type. * ecbalancer.Place: enforce ReplicaPlacement DiffDataCenterCount (per-DC shard cap) * ecbalancer: enforce DiffDataCenterCount in balance (cross-DC phase + cross-rack DC cap) Adds a cross-DC corrective phase that drains data centers holding more than DiffDataCenterCount shards of a volume, and a per-DC cap on cross-rack move targets. Both are no-ops when DiffDataCenterCount is unset, so balance output is unchanged for non-DC placements. * topology: ratio-aware EC shard slots and provisional empty-disk slot GetEffectiveAvailableEcShardSlots now takes the target collection's data-shard count, so a 4+2 volume's larger shards are not over-counted at 10 per volume slot; and it keeps the one provisional slot for freshly started empty servers that report max=0, matching getEffectiveAvailableCapacityUnsafe. FromActiveTopology threads the ratio through. * ecbalancer.Place: explicit disk-type filter signal (fix HDD vs any ambiguity) HardDriveType normalizes to "", which collided with "" meaning any disk. Add Constraints.FilterDiskType and normalize both sides so a hdd request matches disks reported as "" and never leaks to SSD, while filter=false still means any. * ecbalancer: add clearShardAccounting for repair snapshot reconciliation Clears one disk's copy of a shard from per-domain accounting and recomputes the node-level union (preserving a kept copy on another disk of the same node), without crediting capacity. Repair uses it to drop to-be-deleted copies before placing missing shards. * ecbalancer: don't cap cross-DC target racks when DiffRackCount is unset len(racks)+1 wrongly limited each target rack (3 in a 2-rack cluster), so draining a DC could stop short of the DiffDataCenterCount cap. Use MaxShardCount+1 as the effectively-unlimited default. * topology/ecbalancer: ratio-correct EC capacity accounting Reservation shard slots (default ShardsPerVolumeSlot units) are now converted to the target ratio before subtracting, and existing EC shards are charged by size (targetDataShards/shardDataShards) so a 2+1 shard isn't counted as one 10+4 slot. Per-shard ratio lookup is behind shardDataShards (OSS uses the standard ratio). * ecbalancer.Place: candidate tiering and eligible-rack caps Adds a per-disk eligibility/preference abstraction so Place supports: - preferred-tag whole-plan retry (try disks carrying the earliest tags first, widen to all only if a tier cannot place every shard; reports SpilledOutsidePreferredTags), - soft disk-type spill via DiskTypePolicy (Any/Prefer/Require): Prefer fills the preferred type then spills, reporting SpilledToOtherDiskType; Require filters, - even per-rack caps that divide by racks holding an eligible disk, so a tiered cluster (e.g. SSDs in 2 of 4 racks) isn't capped impossibly low. Disk tags carried via Node.AddDiskTags + FromActiveTopology. * ecbalancer: export ClearShardAccounting for repair snapshot reconciliation * ecbalancer: address review feedback (ratio rounding, bitmap walk, same-DC moves) - topology/ecbalancer: round shard-reservation and existing-shard footprint up when converting to target-ratio shard slots, so a sub-slot reservation is not truncated to zero and free capacity is not overstated for low-data-shard layouts (targetDataShards < ds). - erasure_coding: add ShardBits.All iterator and use it across the balancer, cross-DC phase, and placement scoring instead of scanning 0..MaxShardCount and probing Has on every id. - ecbalancer: allow same-DC cross-rack moves when a DC already sits at its DiffDataCenterCount cap; a same-DC move leaves the DC total unchanged. Add a regression test that fails without the guard. - ecbalancer cross-DC phase: pick targets via the eligible-aware pickNodeInRackEligible/pickBestDiskEligible helpers so the disk-type filter is honored and a 0 disk id is not mistaken for a valid selection. * ecbalancer: test ecShardSlotsOnDisk fractional round-up Cover the mixed-ratio path (targetDataShards < existing data shards) so a shard's fractional footprint is never floored to zero and free capacity is not overstated. Exercises the round-up via the targetDataShards parameter; OSS uses the standard ratio at runtime while the enterprise build hits it with real per-volume ratios. * ecbalancer: assert node B rack in TestFromActiveTopology * ecbalancer: split Destination into separate DataCenter and bare Rack Replace the composite "dc:rack" Rack field on Destination with separate DataCenter and bare Rack values, matching topology.DiskInfo and the worker-task convention. Callers (and tests) read the data center directly instead of parsing the composite with strings.SplitN. * shell ec.balance: use utilization-based global balancing (parity with worker) The shell's global rebalance phase balanced by raw shard count; switch it to fractional fullness (shards/capacity), as the worker already does. On uniform capacity the two agree; on heterogeneous capacity it fills nodes proportionally instead of driving small-capacity nodes toward full. Updates the heterogeneous-capacity regression test to assert even fullness (~equal shards/capacity per node) rather than even shard count. * ecbalancer: bounded-proportional per-DC shard spread DiffDataCenterCount was enforced only as a ceiling (drain-to-cap), which could leave a within-cap-but-lopsided DC distribution under a loose cap (e.g. 10/4 of 14 with cap=10). Now the cross-DC phase, the cross-rack DC guard, and Place all target boundedMaxPerDC = min(DiffDataCenterCount, max(ceil(total/numDCs), parityShards)): shards spread proportionally across DCs, but no tighter than the durability floor (once each DC holds <= parityShards a DC loss is recoverable, so further spreading only adds cross-DC/WAN traffic). No-op when DiffDataCenterCount is 0; identical to before when the cap is the binding constraint. * ecbalancer: drop DiffDataCenterCount enforcement for EC placement The 1-byte volume ReplicaPlacement packs xyz into x100+y10+z<=255, so the DC digit can only be 0-2 -- far too small to be a meaningful per-DC EC shard cap (a cap of 1-2 would demand 7-14 DCs for a 10+4 volume). It's volume replica-placement, not an EC spec. Removes the cross-DC balance phase, the DC guard in the cross-rack phase, and the per-DC cap in Place (and the just-added bounded-proportional logic); EC relies on the RP-independent rack/node even spread instead. Rack/node caps (DiffRackCount/SameRackCount) are unchanged. Per-domain EC caps are left for a real EC placement spec. * ecbalancer: enforce per-disk durability cap; symmetric reserve/release Place now refuses to put more than parityShards shards of a volume on a single disk (pickBestDiskEligible skips a disk once it holds parityShards of the volume, a hard cap not relaxed even in durability-first). Previously Place assigned by free capacity, so a skewed near-full cluster could pile >parityShards onto one disk -> losing it loses the volume; only distinct-disk count was checked. This covers encode and repair (both route through Place); the caller skips/leaves the volume rather than minting an unrecoverable layout. Also makes reserveShard decrement freeSlots unconditionally, symmetric with releaseShard's unconditional increment (the old guarded decrement could credit a phantom slot on release if a shard were ever reserved onto a full disk). * ecbalancer: add Topology.ReleaseVolumeShards (clear + credit) for greenfield encode Releases all of a volume's shards from the snapshot and credits the freed disk capacity, so a greenfield encode can plan as if stale EC shards from a prior failed attempt are gone. Safe to credit because the encode task deletes stale shards (cleanupStaleEcShards) before distributing the new ones. Distinct from ClearShardAccounting (repair), which does not credit. * ecbalancer: ReleaseVolumeShards credits node freeSlots, not just disks releaseShard only increments per-disk freeSlots, but rack capacity is summed from node freeSlots (buildRacks) and node freeSlots gates node eligibility. Crediting only disks left a node/rack looking full after releasing stale shards, so a greenfield encode still couldn't use the freed capacity. Now credits the node by the total disk-slots freed. * ecbalancer: correct PlacementMode docs (encode uses durability-first) PlaceStrict was labeled '(encode)' but encode uses PlaceDurabilityFirst. Clarify that durability-first is used by both encode and repair, reports relaxations in PlaceResult.Relaxed, and never relaxes the per-disk durability cap. * ecbalancer: treat SameRackCount as a direct per-node shard cap The 3rd ReplicaPlacement digit now caps shards per node at exactly the digit value, matching how DiffRackCount (2nd digit) caps per rack, instead of allowing digit+1 per node. This makes the per-rack and per-node caps consistent and matches the documented "digits cap EC shards per rack and per node" semantics; e.g. 011 now means at most one shard per rack and one per node.	2026-05-22 20:22:09 -07:00
Chris Lu	adfd731bb8	4.28 4.28	2026-05-21 17:16:32 -07:00
Aleksey GitHub Chris Lu	917a87928c	fix(s3api/list): cancel ListEntries stream in hasChildren (#9617 ) * fix(s3api/list): cancel ListEntries stream in hasChildren * fix(s3api): use filer_pb.List in hasChildren filer_pb.List already wraps the ListEntries stream in a cancellable context, so the single-entry probe needs no separate helper or manual context plumbing to avoid the leaked gRPC stream goroutine. * fix(s3api): propagate request context into hasChildren Thread r.Context() through listFilerEntries and hasChildren so the implicit-directory probe cancels when the client disconnects, instead of running on context.Background(). --------- Co-authored-by: Chris Lu <chris.lu@gmail.com>	2026-05-21 15:48:47 -07:00

1 2 3 4 5 ...