* fix(filer.sync): resolve manifest chunks against source filer
`UpdateEntry` was passing `filer.LookupFn(fs)` — the sink filer client —
into `compareChunks`. But `oldEntry`/`newEntry` chunks come from the
source cluster, so manifest resolution must hit the source filer's
volume servers. With two clusters that have overlapping volume IDs
(common once they grow past a few hundred volumes), the sink lookup
returns its own volume's URLs and the fetch 404s on the source's
fileKey:
compare chunks error: fail to read manifest 631,0babe...: 404 Not Found
The 404 aborts the diff, the manifest chunk never gets replicated, and
the target ends up with whatever flat chunks happened to land from
earlier partial syncs — visible as `SIZE_MISMATCH` in filer.sync.verify
on files large enough to use chunk manifests (~150 GB+ in practice).
Only the manifest path was wrong; flat-chunk reads in `fetchAndWrite`
already use `fs.filerSource.ReadPart`.
* trim comment
* test(filer.sync): regression test for source-filer manifest lookup
Two recording filer gRPC servers stand in for source and sink. Driving
UpdateEntry with a manifest chunk and observing which one receives
LookupVolume proves compareChunks routes source-side lookups through
fs.filerSource, not fs. Reverting the fix flips the call onto the sink
filer and fails the assertion.
* drop test
* fix(filer.sync): validate chunk size in FilerSink to prevent 0-byte propagation
FilerSink.fetchAndWrite previously trusted the source response and the
upload result blindly: a 200 OK / Content-Length: 0 reply from a broken
source volume was happily uploaded as a 0-byte needle to the destination,
and the destination filer metadata was then written with the source
chunk size. The result was permanent silent corruption -- ls shows the
file at its original size but reads fail with EIO.
Add two cheap defenses inside fetchAndWrite:
1. After assembling fullData, compare its length against sourceChunk.Size.
2. After a successful upload, compare uploadResult.Size against
sourceChunk.Size.
Both checks wrap a new sentinel errChunkSizeMismatch that the retry
callback recognizes and refuses to retry -- needle.size=0 on disk is a
persistent state, not a transient network error, so the sync should stop
loudly on the affected entry instead of looping or, worse, silently
propagating it.
Tests:
* TestValidateReplicatedChunkSize -- table-driven coverage of healthy,
legitimately empty, zero-byte read, short read, and truncated upload
cases.
* TestFetchAndWriteRejectsZeroByteSource -- end-to-end: an httptest
source that returns 200 OK with an empty body must cause fetchAndWrite
to return errChunkSizeMismatch after exactly one source hit (fail
fast, no retry storm).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* filer.sync: bubble size-mismatch past CreateEntry/UpdateEntry
Three follow-ups on the chunk-size validation:
- Use %w in replicateOneChunk so the errChunkSizeMismatch sentinel
survives the wrap and reaches errors.Is callers up the stack.
- In FilerSink.CreateEntry/UpdateEntry, surface errChunkSizeMismatch
instead of warning-and-nil. Other errors (deleted source chunk,
transient network) keep the existing swallow so a hiccup doesn't
stall the stream.
- Drop validateReplicatedUploadSize: uploadResult.Size is set
client-side from the same len(fullData) we already validated
pre-upload, so the second check can't fail.
Test: scope the RetryWaitTime override to the one test that needs it,
add a regression that locks in the errors.Is chain through
replicateChunks.
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Chris Lu <chris.lu@gmail.com>
* wdclient: prune filers dropped from master discovery
Filer discovery only appended new addresses; it never removed ones that
disappeared from the master snapshot. After a K8s filer pod rolled to a
new IP the old address lingered in filerAddresses and got retried again
every resetTimeout window, stalling S3 uploads on i/o timeouts.
Treat the master snapshot as authoritative: keep survivors (preserving
their health counters and the active round-robin index), append newcomers
with fresh health, drop the rest. Empty snapshots are still ignored so a
transient master outage can't wipe the list.
* wdclient: skip discovery snapshots with no usable addresses
Guard against the defensive case where master returns updates whose
addresses are all empty; reconciling against an empty discovered set
would prune every filer.
* fix(volume): avoid nil-deref when needle map loader errors
A corrupt .idx whose size is not a multiple of NeedleMapEntrySize sends
the read-only load path into NewSortedFileNeedleMap, which returns
(*SortedFileNeedleMap)(nil) when reverseWalkIndexFile rejects the file.
The multi-value assignment `v.nm, err = NewSortedFileNeedleMap(...)`
parks that typed-nil pointer in the v.nm NeedleMapper interface, so the
subsequent `v.nm != nil` guard still passes — and the post-load
MaxNeedleEnd structural check dispatches through the promoted mapMetric
accessor on a nil receiver, segfaulting the whole volume server at
load time.
Reset v.nm explicitly after every loader failure so the interface is
truly nil, and skip the MaxNeedleEnd check when err is non-nil since
the value would come from a partial walk anyway. NewLevelDbNeedleMap
has the same typed-nil-on-error shape and is fixed the same way.
* fix(volume): close indexFile when needle map load errors
Pre-fix the typed-nil v.nm path either leaked indexFile silently
(SortedFileNeedleMap.Close had a nil-receiver early return) or crashed
(LevelDbNeedleMap.Close had no such guard). With v.nm cleared to nil
on error, the defer cleanup no longer calls Close at all, so the
LoadCompactNeedleMap success-with-error path now also leaks indexFile.
Close indexFile explicitly on each loader error to keep ownership
balanced.
* trim comments
Random allocation could pick 33646 = admin.port (23646) + GrpcPortOffset.
weed mini reserves that as Admin's gRPC port even when the test only
overrides Master/Filer/S3/Iceberg, so the explicit Filer flag failed
with "reserved for gRPC calculation" and TestRisingWaveIcebergCatalog
flaked. Pre-seed the reserved set with every mini default HTTP port
plus its +10000 offset so a random pick (or its own gRPC offset) cannot
land on a service the caller left at its default.
* master: timeout AllocateVolume/DeleteVolume and defer growRequest cleanup
The volume-grow goroutine clears the layout's growRequest flag only after
ms.DoAutomaticVolumeGrow returns, and AllocateVolume / DeleteVolume were
calling the volume-server RPC with context.Background(). A volume server
that hung mid-call (heavy I/O, stuck lock, dead peer behind a stable VIP)
would park the goroutine forever, leaving growRequest=true and silently
blocking every subsequent automatic grow for that layout — Assign retries
then drained their 30s budget with "context deadline exceeded" until the
operator restarted the master.
Bound both RPCs with a 5-minute deadline (creating/removing a volume is
sub-second normally, generous for contended disks) and move the flag
clear + filter delete into defers so a panic in DoAutomaticVolumeGrow
doesn't strand the layout either.
* allocate_volume: shorten timeout to 1m for faster recovery
Volume create/delete is sub-second under normal conditions; 1 minute is
generous even on a contended disk and clears the growRequest flag well
before too many client Assigns drain their own retry budget.
* trim comments
KvPut/KvGet/KvDelete bypassed store.getKey(), so filer.store.id and
other KV writes landed outside the configured prefix. With a Redis
ACL restricted to the prefix this errored with NOPERM; without the
ACL the keys silently lived in the wrong namespace.
* writeJson: drop unused JSONP branch
No in-tree caller uses ?callback=. Always serve application/json
with X-Content-Type-Options: nosniff.
* seaweed-volume: drop unused JSONP branch
Mirror Go: always serve application/json with
X-Content-Type-Options: nosniff.
* writeJson: drop unreachable StatusNotModified check
bodyAllowedForStatus already returns early for 304.
* test/volume_server: rename and rewrite JSONP test to assert callback is ignored
CI: /status?callback=myFunc now returns plain application/json
with X-Content-Type-Options: nosniff.
* s3,iceberg: reject `..`/NUL in URL path vars
Both gateway routers use mux.NewRouter().SkipClean(true), so a request like
`GET /bucket-A/../evil-bucket/key` survives routing as bucket=bucket-A,
object=../evil-bucket/key. The captured key is then joined into a filer path;
util.JoinPath / path.Join collapse the `..` server-side and the read lands in
evil-bucket. With auth on, IAM still authorizes against bucket-A (the mux var),
so policy is evaluated against the wrong target.
Add a middleware on the S3 bucket subrouter and the Iceberg REST router that
rejects any `.`, `..`, NUL, or — for single-segment slots — embedded slash in
the captured path vars before any handler runs. NormalizeObjectKey already
folds `\` to `/` and decoding happens in mux, so `%2e%2e` and `..\` are caught.
* s3,iceberg: reject empty captured vars and empty namespace parts
Comma-ok the var lookup so we only check captured slots, then treat an empty
captured value as a rejection on its own — downstream path.Join would
otherwise collapse it and let the next segment pick the bucket.
For iceberg, also reject empty parts after splitting the namespace on \x1F so
leading/trailing/consecutive unit separators (which parseNamespace silently
folds out) don't let distinct route values collapse to the same parsed
namespace.
Register loggingMiddleware before validateRequestPath on the iceberg router
so rejected requests still produce an audit-log line.
Adds two FUSE integration tests on the existing dlm cluster harness (the
-dlm mounts route advisory locks to the owner filer):
- TestPosixLockCrossMount: an flock taken on one mount blocks the other,
and is grantable after release — the routed-to-owner path end to end.
- TestPosixLockSurvivesFilerLoss: hold flocks on many files, stop filer1
so keys it owned migrate to filer0; after the ring settles and the
holding mount re-asserts, every lock is still honored. Asserts only the
settled state; the transient migration window is unit-covered.
Locks are taken on read-only fds so the -dlm whole-file write lock (a
different mechanism, held until close) isn't involved. Skipped on
non-Linux: only Linux forwards advisory locks (SETLK) to the FUSE server;
macFUSE handles flock in-kernel per mount.
The codec (Set.Marshal/Unmarshal) and its posix_lock.proto were built to
let the lock set ride in an inode's entry metadata, but the authority is
in-memory and ownership handoff/restart is handled by mounts re-asserting
their held locks over the RPC — neither serializes the set. Nothing calls
the serde outside its own tests, so drop it (codec, proto, generated pb,
Makefile). The in-memory Set/Manager are unchanged.
After a (re)start the owner defers would-be grants for posixLockWarmup
while mounts re-assert, trusting only locally-visible conflicts, so it
does not double-grant from empty state; a deferred grant is a retry for
SetLkw and EAGAIN for non-blocking SetLk, never a spurious grant. Cooling
now fail-closes: if the previous owner is unreachable during a ring
change, defer rather than risk a double-grant. readyAt is atomic so the
handler reads it without locking.
While the ring changed within the last snapshot interval, a fresh owner
asks the key's previous owner (LockRing.PriorOwner) whether it still
holds a conflicting lock before granting TRY_LOCK or answering GET_LK, so
it does not double-grant before re-assertion rebuilds its local state.
The probe is marked cooling_probe so the previous owner answers from
local state without recursing. PriorOwner uses the snapshot's prebuilt
ring rather than rebuilding a hash ring per call.
* mount: renew POSIX lock leases via keepalive
The mount tracks the inode keys it holds locks on and a background loop
renews its session lease (KEEP_ALIVE) with each key's owner filer every
5s, within the filer's 15s TTL. A live mount is never reaped; a dead one
stops renewing and owners reclaim its locks. Tracking is a superset:
holds are added on grant and dropped only on owner release, so a still
held lock is never under-renewed.
* mount,filer: re-assert held POSIX locks via keepalive
The owner filer holds POSIX advisory locks as in-memory soft state, so a key's
owner change (ring rebalance) or an owner restart lost or stranded them: the new
or restarted owner was blind to existing holders and would double-grant.
Make the keepalive carry the mount's held lock ranges per key. The mount mirrors
its own granted locks (posixOwn), and each tick re-asserts them to the key's
current owner, which rebuilds that session's locks from the assertion — self
-healing after a takeover or restart. The owner arbitrates re-asserted locks
against other sessions so it never double-grants; a lock that lost a migration
race is reported, not forced. A bare keepalive (no ranges) still just renews.
* filer: session lease + reaping for POSIX locks
A mount renews its session lease by keepalive (new KEEP_ALIVE op); the
owner filer records last-seen per session and a background sweeper reaps
the locks of leased sessions that stop renewing — a dead or partitioned
mount. Only sessions that have renewed are leased, so this is inert until
mounts run with -posixLock.
* mount: route POSIX advisory locks to the owner filer (-posixLock) (#9665)
mount: route POSIX advisory locks to the owner filer under -dlm
With -dlm, GetLk/SetLk/SetLkw and the flush/release cleanup paths go to
the inode's owner filer via the PosixLock RPC instead of the local table,
so flock/fcntl are honored across mounts. Advisory locking rides the same
switch as whole-file write coordination — and is therefore off under
writeback cache, which implies single-writer. The mount calls its filer
and relies on filer-side forwarding to reach the owner. Keys are the inode
identity (HardLinkId else path); SetLkw is client-side polling with the
FUSE cancel channel (no server wait queue); a per-mount session id
namespaces owners; a local hint avoids a release RPC on every close.
* mount,filer: bound posix-lock release RPCs and stop the reaper on shutdown
The unlock/release RPCs run off the syscall path (close/flush) and used
context.Background() with no deadline, so a slow or unreachable filer could
hang close() indefinitely; bound them to 5s (they still aren't cancelled by
an interrupt). The lease-reaping sweeper now selects on a stop channel that
FilerServer.Shutdown closes, instead of looping for the process lifetime.
routedReleasePosixOwner dropped the local owner hint before sending
RELEASE_POSIX_OWNER, so a transient RPC failure left the lock held on the
owner filer with no local record to retry from — stranded until session-lease
reaping. Drop the hint only after a successful release; on failure keep it so
a later flush retries, with lease reaping as the backstop.
With -dlm, GetLk/SetLk/SetLkw and the flush/release cleanup paths go to
the inode's owner filer via the PosixLock RPC instead of the local table,
so flock/fcntl are honored across mounts. Advisory locking rides the same
switch as whole-file write coordination — and is therefore off under
writeback cache, which implies single-writer. Keys are the inode identity
(HardLinkId else path); SetLkw is client-side polling with the FUSE cancel
channel (no server wait queue); a per-mount session id namespaces owners;
a local hint avoids a release RPC on every close. Background unlock/release
RPCs are bounded so a stuck filer can't hang close().
* mount: hold the entry lock while reading chunk size in GetAttr/SetAttr
Async upload workers append chunks to an open handle's shared entry under
the LockedEntry lock (FileHandle.AddChunks), but GetAttr and SetAttr
computed FileSize by iterating entry.Chunks without taking it. A concurrent
append that reallocated the backing array tore the slice read and crashed in
filer.TotalSize. Surfaces with -writebackCache, where handles stay open and
flush asynchronously while metadata ops keep arriving.
Take the LockedEntry lock for those reads (and SetAttr's truncate rewrite).
* mount: re-read entry under the lock in GetAttr/SetAttr
If SetEntry swapped the handle's entry pointer between maybeReadEntry and the
lock acquisition, the old pointer is orphaned. Re-read fh.entry.Entry under
the lock so SetAttr mutates the live entry instead of losing the update, and
GetAttr reports the current one.
* mount: cover the truncate path in TestAttrChunkRace
Alternate SetAttr between mtime-only and a shrinking size so the test also
exercises the entry.Chunks rewrite under fh.entry.Lock, not just the read-side
size walk.
* mount: snapshot chunks under the entry lock on the read path
readFromChunks holds fh.entryLock (excludes SetAttr) but not the LockedEntry
lock the async uploader appends under, so IsInRemoteOnly, the FileSize
fallback, and the RDMA/peer chunk walks read entry.Chunks while AddChunks
reallocated it — the same torn-slice crash as GetAttr/SetAttr.
Snapshot size, inline content, and the chunk list under a brief LockedEntry
RLock, then hand the snapshot to the RDMA/peer helpers instead of holding the
lock across network I/O. The captured slice stays valid: append never mutates
the old backing array, and truncate is excluded by the fh.entryLock.
* filer: in-memory POSIX lock authority (Manager)
Concurrent multi-inode authority over the per-inode Set: a Set per opaque
inode key (path, or hl:<HardLinkId>) plus a session->keys index so a dead
mount's locks reap in O(locks held). Lock state stays in memory like the
distributed lock manager's, off the replicated meta-log. TryLock/Unlock/
GetLk/ReleasePosixOwner/ReleaseFlockOwner/ReleaseSession; empty sets and
stale index entries are pruned on release.
* filer: routed PosixLock RPC over the in-memory authority
Adds the PosixLock RPC (try/unlock/get_lk + the flush/release owner
drops) that the owner filer answers from its in-memory Manager. The
request key is the inode identity ring key; a non-owner filer forwards
one hop (is_moved-bounded), mirroring ObjectTransaction, so the owner's
table stays the single authority under a stale ring view. Strictly
non-blocking; SetLkw polling lives in the mount.
Concurrent multi-inode authority over the per-inode Set: a Set per opaque
inode key (path, or hl:<HardLinkId>) plus a session->keys index so a dead
mount's locks reap in O(locks held). Lock state stays in memory like the
distributed lock manager's, off the replicated meta-log. TryLock/Unlock/
GetLk/ReleasePosixOwner/ReleaseFlockOwner/ReleaseSession; empty sets and
stale index entries are pruned on release.
* filer: POSIX advisory lock set primitive (phase 1)
Pure per-inode conflict/coalesce/range-split logic for fcntl byte-range
and flock whole-file locks, extracted from the mount's PosixLockTable
without its wait queue or inode-map concurrency. Owner identity is
(Sid, Owner) so the same FUSE owner on different mounts never aliases,
and ReleaseSession reaps a dead mount's locks. The owner filer will hold
one Set per inode under the per-path lock; no concurrency control here.
* test: tolerate transient FUSE invisibility in ConcurrentReadWrite
A concurrent truncating overwrite leaves a short-lived dentry/cache window
where the file is momentarily ENOENT to another opener. Retry the reads and
writes a few times before failing, as ConcurrentDirectoryOperations does.
* filer: serialize the POSIX lock set for entry metadata
Versioned fixed-width binary encoding of a Set, so an inode's held locks
can ride in its entry metadata: a lock op materializes the Set from the
blob, applies under the per-path lock, and writes it back. Empty set
encodes to nil so a lock-free inode carries no blob.
* filer: encode the POSIX lock set as protobuf
Replace the hand-rolled fixed-width codec with a LockSetProto message, so the
metadata blob can gain fields without a format-version migration. proto.Unmarshal
already rejects a malformed blob, so the explicit version and length checks go
away. Marshal now returns an error to match.
* filer: POSIX advisory lock set primitive (phase 1)
Pure per-inode conflict/coalesce/range-split logic for fcntl byte-range
and flock whole-file locks, extracted from the mount's PosixLockTable
without its wait queue or inode-map concurrency. Owner identity is
(Sid, Owner) so the same FUSE owner on different mounts never aliases,
and ReleaseSession reaps a dead mount's locks. The owner filer will hold
one Set per inode under the per-path lock; no concurrency control here.
* test: tolerate transient FUSE invisibility in ConcurrentReadWrite
A concurrent truncating overwrite leaves a short-lived dentry/cache window
where the file is momentarily ENOENT to another opener. Retry the reads and
writes a few times before failing, as ConcurrentDirectoryOperations does.
A disconnect/reconnect race could drop a volume from vid2location while it stayed in the data node's disk map, so it showed in volume.list and the admin UI but LookupVolume returned "volume id not found" and never self-healed (the full heartbeat only registered volumes new to the disk map). The full heartbeat now re-registers any reported volume missing from the lookup index, reusing the already-resolved VolumeLayout.
A non-owner filer forwards the whole transaction to the ring owner of route_key, so the owner's per-path lock stays the single serialization point even when the caller's ring view is stale. is_moved bounds forwarding to one hop. The gateway stamps route_key on every routed builder via the shared objectRouteKey helper. Completes taking S3 object mutations off the distributed lock.
* admin: add -metricsPort flag to expose Prometheus metrics
The admin command had no metrics endpoint, so passing -metricsPort
(as the operator does for spec.admin.metricsPort) crashed the process
with "flag provided but not defined". Wire up -metricsPort/-metricsIp
and start the shared Prometheus metrics server, matching filer, master,
and volume.
* admin: emit maintenance task and worker fleet metrics
Add Prometheus metrics for the admin server's distinctive work: the
maintenance task queue and the worker fleet that executes it.
Task lifecycle: maintenance_tasks_by_status / _by_type gauges (snapshot
of the queue), maintenance_tasks_completed_total{type,outcome} counter
and maintenance_task_duration_seconds{type} histogram (recorded when a
task reaches a terminal state), and last/next scan timestamp gauges.
Worker fleet: workers_connected and worker_slots{used,max} gauges, plus
worker_events_total{event} counting register/unregister/stale removals.
Gauges are snapshotted by a background goroutine on the admin server;
counters and the histogram are recorded at their event sites.
* admin: read worker slot totals under lock, clear next-scan gauge when idle
GetWorkers returns live worker pointers; summing CurrentLoad/MaxConcurrent
outside the queue lock races with task assignment and completion. Add
GetWorkerSlotTotals to aggregate under the lock.
Also reset maintenance_next_scan_timestamp_seconds to 0 when the scanner
is not running, so it can't retain a stale value after a stop.
Probe one throwaway write once per process before the lifecycle tests run, absorbing the post-start volume-growth window so the first real PutObject doesn't race volume growth and 500. Each call is bounded by the remaining 60s budget; CreateBucket is retried within it.
A non-versioned metadata-only self-copy (CopyObject with source == destination
and the REPLACE directive) is a read-modify-write of one entry, which is why it
held the distributed lock. It now routes to the owner as a serialized
PATCH_EXTENDED: the owner merges the new managed metadata (set the replacements,
delete the dropped keys) onto a fresh read of the entry under its per-path lock,
so a concurrent change to non-managed keys (legal hold, retention, version id) is
preserved instead of clobbered, and bumps mtime.
PATCH_EXTENDED gains touch_mtime for the mtime bump. Versioned and suspended
self-copies create a new version (already routed via the copy finalize) and the
no-owner bootstrap keep the lock.
A version-specific DELETE (real version or the null version, including
object-lock WORM-checked ones and governance-bypass) now runs as one routed
transaction on the object's owner instead of holding the distributed lock.
For a real version: recompute the .versions pointer excluding the version
(repoint-before-delete, so a crash leaves a recoverable orphan rather than a
dangling pointer), then delete the version file, under the object's per-path lock.
The null version is the regular object entry, deleted directly (no pointer).
Object-lock buckets gate the delete on the version's WORM guards evaluated on the
owner: legal hold (always) + retention (while not elapsed). Governance bypass
scopes the retention guard to COMPLIANCE mode, so the filer allows a
governance-mode delete while still denying compliance and legal hold — the
gateway never reads the version.
Three primitives make this expressible:
- ObjectTransaction.condition_key: evaluate the condition against a named entry
(the version) while the lock stays on lock_key (the object).
- Recompute.exclude_name: omit a child from the scan, to repoint before delete.
- WriteCondition.Clause gate_key/gate_value: scope IF_EXTENDED_TIME_ELAPSED to a
mode, expressing governance bypass without a gateway-side read.
completeMultipartUpload routes its writes to the object's owner filer when an
owner is known, off the distributed lock. Idempotent replay is handled
gateway-side in prepareMultipartCompletionState (it returns the existing result
when the object already carries this UploadId), so the lock is not needed to
dedupe retries; with no owner yet, the lock remains as the bootstrap path.
Versioned completion flips the .versions pointer via routedVersionedFinalize
(RECOMPUTE_LATEST). Non-versioned and suspended completion write the object via
routedMkFile (a routed PUT) so the write serializes with concurrent writes to
the same key on the owner's per-path lock. The version file itself is a unique
path and stays a plain mkFile.
s3: route versioned/suspended delete markers and versioned COPY off the lock
createDeleteMarker flips the .versions pointer via routedVersionedFinalize
(RECOMPUTE_LATEST on the owner filer) when an owner is known, so an Enabled or
Suspended DeleteObject takes its pointer flip off the distributed lock; the
delete marker file is written first and the owner re-derives the pointer.
DeleteObjectHandler routes a versioned/suspended delete with no specific version
straight to the owner, off the lock. A specific-version delete and object-lock
buckets keep the lock (the former needs a recompute-after-delete handled
separately; the latter needs gateway-side enforcement).
CopyObject into a versioned bucket finalizes the new version through the same
routed pointer flip.
routableWriteOwner no longer excludes object-lock buckets, so a versioned PUT
(which creates a new version, never overwriting a locked one) and a
non-versioned overwrite (WORM-checked gateway-side before dispatch) route to the
owner filer like any other write.
routedObjectOwner still excludes object-lock: an unversioned object-lock delete
enforces WORM under the lock, so it stays there rather than routing past the
check. Version-specific deletes likewise stay on the lock — routing them needs
the WORM check (on the version entry) and the latest-pointer recompute (on the
object) under one transaction, which the current single condition target cannot
express.
s3: route versioned PutObject finalize off the distributed lock
A versioned write's finalize (flip the .versions pointer to the newest version,
demote the prior latest) now runs as a single RECOMPUTE_LATEST ObjectTransaction
on the object's owner filer, under its per-path lock, instead of the unserialized
updateLatestVersionInDirectory. The version file is written first; the owner
re-derives the pointer by scanning the directory.
RECOMPUTE_LATEST gains size_to_key / mtime_to_key to cache the chosen version's
size and mtime on the pointer, and demote_key / demote_value to stamp the
displaced prior latest (NoncurrentSinceNs for lifecycle) when the pointer moves.
Falls back to updateLatestVersionInDirectory when no owner is known yet.
PutBucketVersioning and PutBucketEncryption ran concurrently each did a
whole-entry read-modify-write of the bucket entry, so one could overwrite the
other's field with a stale copy. Each config write is now a field-level
PATCH_EXTENDED (extended attributes) or set_content (the metadata blob)
ObjectTransaction, routed to the bucket's owner filer and merged onto a fresh
read under its per-path lock. Disjoint fields no longer clobber each other.
s3: route non-versioned object PUT and DELETE off the distributed lock
A non-versioned, non-object-lock object write now goes straight to the key's
owner filer as a single-mutation ObjectTransaction, which serializes it with the
owner's per-path lock and evaluates the precondition, instead of taking a
cluster-wide lock. PUT and DELETE use the object's full path as the lock key, so
a concurrent create and delete of the same key serialize against each other.
The fast path is taken only when the precondition reduces to clauses the filer
can evaluate (existence and a single strong-ETag match); time-based conditions,
ETag lists, weak ETags, post-create hooks, and an unknown owner fall back to the
lock. A routed mutation error other than a failed precondition also falls back,
so the lock path stays the authority for the cases it alone covers.
PrimaryForKey returns "" until the ring view arrives, keeping writes on the lock
until routing is known.
* s3: dial the object lock's primary filer directly
The S3 object write lock builds a fresh short-lived lock per write, each
starting at the seed filer. When the seed isn't the key's hash-ring primary
the filer forwards the request to the primary, and in multi-cluster setups
that forward crosses clusters on every write.
Give the lock client a view of the filer lock ring, fed by the master's
LockRingUpdate broadcasts the gateway already receives, so it dials the
primary directly. The view tracks filer membership by version; a stale view
stays correct because the filer still forwards as a fallback.
Also send the initial ring snapshot to S3 clients, not just filers.
* s3: subscribe to lock-ring updates before starting the master loop
The master delivers the initial LockRingUpdate once, on connect. Registering the
callback after KeepConnectedToMaster started left a window where that first
update could arrive before the handler was set and be dropped, delaying the ring
view until the next membership change. Build the lock client and register the
callback in the masters block before launching the loop; the filers block reuses
that client (or creates a plain one when no masters are configured).
* lock_manager: build the hash ring in a deterministic server order
rebuildRing ranged over the server set (a map), whose iteration order is
randomized per process. On a vnode hash collision the last writer into
vnodeToServer wins, so two nodes holding the same server set could resolve the
collision to different servers and disagree on the primary for keys near that
slot. Now that the S3 gateway also computes PrimaryForKey, such a disagreement
would route the same key to different filers and defeat per-path serialization.
Iterate the servers in sorted order so the ring is identical on every node with
the same set, regardless of discovery order.
* lock_manager: skip redundant ring rebuilds, trim comments
SetRing now ignores a non-zero version at or below the current one once a ring
exists, so repeated LockRingUpdate broadcasts on reconnect no longer rebuild the
ring.
* s3: hold the lock-ring client on the server for route-by-key
Store the object-write lock client on S3ApiServer so handlers can resolve a
key's owner filer via PrimaryForKey.
* filer: let PATCH_EXTENDED replace Entry.content
PATCH_EXTENDED merges extended attributes under the per-path lock, reading the
entry fresh, so concurrent patches to different keys don't clobber each other.
Some single-key state lives in Entry.content rather than an extended attribute
(e.g. the S3 bucket metadata blob). Add set_content/content to the mutation so a
patch can replace content the same way -- read fresh, set content, preserve the
rest -- letting a content write and an extended-attribute write on the same
entry serialize on the lock instead of racing whole-entry rewrites.
* Update weed/server/filer_grpc_server.go
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* filer: test set_content FileSize sync; note chosen content-patch approach
Cover the FileSize behavior of a set_content patch: a file's size follows the
new content length (including when it shrinks), a directory's stays zero. Also
document, in the bucket-config design, that extending PATCH_EXTENDED with
set_content is the implemented path for content-backed config.
---------
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
CreateEntry starts with a FindEntry to load the current entry. A conditional
CreateEntry already fetched that entry to evaluate the precondition under the
per-path lock, so the create repeated the lookup.
Add an existing *Entry parameter: when non-nil it is used as the current entry
and the internal lookup is skipped; nil keeps the lookup. The gRPC CreateEntry
handler passes the entry it fetched for the precondition, removing the redundant
read while the lock is held. All other callers pass nil.
A multi-object delete spans many keys that route to different owner filers. The
gateway groups keys by owner and sends one batch per owner; the filer applies
each transaction under its own per-path lock, independent of the others.
A failed transaction (precondition or mutation error) is reported in its own
response without aborting the rest, matching S3 multi-object semantics where
each key succeeds or fails on its own. There is no cross-key atomicity, which S3
batch delete does not require.
* s3: use oidc: prefix for trust-policy conditions in IAM example
Trust-policy conditions for AssumeRoleWithWebIdentity see OIDC claims
under the oidc: prefix, so the docker example's bare "roles" key never
matched and denied every web-identity assume against those roles. Switch
the three roles to oidc:roles.
Also document the available trust-policy condition keys (oidc:iss/sub/aud,
oidc:<claim>, aws:FederatedProvider, aws:userid, sts:DurationSeconds) and
note that roleMapping selects the role for direct OIDC bearer auth while
STS uses the explicit RoleArn plus trust policy.
* s3: clarify aws:userid differs between trust policy and request auth
aws:userid is the raw sub claim during trust-policy evaluation, but a
stable sub+iss hash (ComputeParentUser) during S3 request authorization
after the role is assumed. Note both so the two contexts aren't conflated.
Routing object-lock buckets off the distributed lock needs the retention and
legal-hold check to run atomically with the write, under the per-path lock. Move
just the comparison into the filer, not the S3 semantics: two generic clause
kinds on an extended attribute.
IF_EXTENDED_NOT_EQUAL blocks while extended[ext_key] equals ext_value (a legal
hold). IF_EXTENDED_TIME_ELAPSED blocks while extended[ext_key], read as a unix-
second deadline, is in the future against the filer's clock (retention); a
malformed deadline fails safe. The caller composes these from the object-lock
state and, for a governance bypass, simply omits the retention clause once the
bypass is authorized -- the filer makes no authorization decision and keeps no
S3 knowledge.
Deleting a specific version that happens to be the latest needs the new latest
re-derived from the remaining versions, and that scan must run under the same
lock as the delete. The gateway can't do it atomically across RPCs.
Add a RECOMPUTE_LATEST mutation: it scans a directory under the transaction
lock, picks the child that sorts last (descending) or first by name, copies the
mapped extended keys from it into a pointer entry, and stores its name under
name_to_key. An empty directory clears the pointer keys. The filer stays
mechanical and S3-agnostic: the caller, which knows the versioning scheme,
supplies the sort direction and the key mappings. A missing pointer entry is a
no-op, so a replayed transaction is idempotent.
A versioned object write touches several entries that must change together: the
main object, a delete marker or version file, and the latest pointer on the
.versions directory. Holding a distributed lock across separate RPCs to do this
is what the per-path lock was meant to replace, but a single CreateEntry only
covers one entry.
Add ObjectTransaction: a request carries a lock_key (the object path), an
optional WriteCondition, and an ordered list of mutations (PUT / DELETE /
PATCH_EXTENDED). The filer holds the per-path lock on lock_key for the whole
call, checks the condition against the entry at lock_key, then applies the
mutations in order. Callers route the object's writes to its owner filer so the
lock is authoritative across all of the object's entries.
DELETE and PATCH of an absent entry are no-ops, so a replayed transaction is
idempotent. PUT entries are metadata-scoped; data-bearing writes (chunks) are
written before the transaction, as today.