Commit Graph

14008 Commits

Author SHA1 Message Date
Chris Lu
ba9e74d8a7 docs: add zyner as a gold sponsor 2026-05-29 12:40:58 -07:00
7y-9
fbcba51e73 refactor: avoid unused sql insert result (#9734) 2026-05-29 00:45:45 -07:00
Chris Lu
c9623007a2 fix(filer.sync): keep sync_offset fresh through filtered-event markers (#9733)
On a read-only watched path the idle heartbeat keeps sync_offset fresh,
but a busy source filer still emits a MaxUnsyncedEvents marker after many
filtered events. The marker has a non-nil but empty EventNotification, so
the client routed it to the event path, where it advanced no real
watermark yet drove offsetFunc to republish the stale processed
watermark — regressing the gauge between heartbeats and spiking the
derived lag every time a filtered-event burst landed.

Route the empty marker through OnIdleHeartbeat like the idle heartbeat so
its fresh timestamp keeps the gauge current; it still advances the
in-stream resume cursor.
2026-05-28 23:29:59 -07:00
Chris Lu
5955972fe6 fix(shell): verify volume.merge output before overwriting replicas (#9731)
* fix(shell): verify volume.merge output before overwriting replicas

volume.merge overwrote every replica with the merged copy without checking it was complete. Read back the merged copy and refuse to overwrite unless it holds at least as many live needles as the most complete source replica, leaving the originals intact on a short or empty merge.

* fix(shell): keep merged volume until all replicas are rebuilt

On a copy failure partway through the overwrite loop, the temporary merged copy was deleted along with the half-rebuilt replicas. Stop deleting it until every replica has been rebuilt; on failure the verified copy is kept so the merge can be re-run to completion.

* refactor(shell): reuse readVolumeStatus in ensureVolumeReadonly

* fix(shell): guard against nil volume status response
2026-05-28 19:29:25 -07:00
Chris Lu
16717b0bf4 fix(s3): authenticate JWT unsigned-streaming uploads (#9729)
A bearer-token client whose SDK appends a CRC32 trailer sends an
unsigned-streaming PUT (STREAMING-UNSIGNED-PAYLOAD-TRAILER) with no SigV4
signature, so getRequestAuthType classifies it as authTypeStreamingUnsigned.
The auth dispatch ignored the bearer token and fell back to anonymous, and
newChunkedReader tried to verify the bearer token as a SigV4 seed signature
and failed, so the body could not be decoded either.

Dispatch the streaming-unsigned auth on whatever credential is present
(SigV4 / JWT / anonymous), and skip the SigV4 seed-signature recompute for
JWT requests in the chunked reader.
2026-05-28 18:10:24 -07:00
Chris Lu
2f0643e5b1 fix(volume): stop flipping volumes read-only on a non-append-ordered .idx (#9726)
* fix(volume): verify the .dat-tail needle in the integrity check

CheckVolumeDataIntegrity checked the last entry by file position in the .idx
and, for a live needle, flipped the volume read-only when fileSize > fileTailOffset.
That entry is the .dat tail only when the .idx is in append order; a key-sorted
.idx (weed fix and other rebuilds listed entries by key) puts the highest-key
needle last, whose tail sits mid-file, so healthy volumes went read-only on every
load and re-running weed fix only reproduced the sorted index.

Locate the needle at the maximum offset — the one physically last in the .dat —
and verify the .dat ends exactly at it, regardless of .idx ordering. The
append-ordered common case stays O(1) (the last entry's on-disk end matches the
.dat size); only a key-sorted index pays a single linear scan. Deletion
tombstones at the tail are now verified too, instead of skipping the file-size
check.

* fix(command): weed fix rebuilds the .idx in .dat offset order

SaveToIdx wrote entries via AscendingVisit — sorted by key, the .sdx/.ecx shape
— so the rebuilt .idx put the highest-key needle last instead of the .dat-tail
needle, and dropped tombstones whose live needle was gone. Collect the live and
deleted entries, sort by .dat offset, and write them in append order so the .idx
stays a faithful log whose last entry is the real .dat tail.
2026-05-28 18:04:31 -07:00
Chris Lu
685571d93f fix(s3): allow anonymous unsigned-streaming PutObject (#9727)
Modern botocore attaches a CRC32 trailer to plain PutObject, turning the
payload into STREAMING-UNSIGNED-PAYLOAD-TRAILER. An anonymous upload then
carries that header but no Authorization, so it was classified as
authTypeStreamingUnsigned and sent straight to SigV4 verification, which
rejected it as AccessDenied while explicit credentials kept working.

Fall back to the anonymous identity when an unsigned-streaming request
carries no signature, mirroring the plain anonymous path. The request
stays classified as unsigned-streaming so the chunked body is still
decoded.
2026-05-28 17:00:41 -07:00
Chris Lu
f5b833ab6a test(ec): end-to-end encode over a multi-server multi-disk stuck layout (#9728)
* test(framework): support multiple disks per server in MultiVolumeCluster

StartMultiVolumeClusterWithDisks gives each volume server N data
directories (one DiskLocation each), passed to -dir as a comma list, with
a per-server disk-dir accessor for file inspection. StartMultiVolumeCluster
keeps its one-disk default.

* test(ec): end-to-end encode over a multi-server multi-disk stuck layout

A volume in the stuck state — real .dat source, a 0-byte stub replica, and
partial stale EC shards from an interrupted encode — must converge to one
valid EC layout. Asserts the full shard set across servers, .ecx/.vif kept
per server (info file survives the source-volume delete), stale shards
cleared, and no regular .dat/.idx left behind.
2026-05-28 16:44:42 -07:00
Chris Lu
3674f9d04d fix(storage): keep EC .vif when deleting a coexisting regular volume (#9723)
* fix(storage): keep EC .vif when deleting a coexisting regular volume

A regular volume and an EC volume for the same id share <base>.vif. When
EC shards are distributed onto a server that still holds the regular
volume — the encode source, or any replica the planner targets — the
post-encode VolumeDelete ran removeVolumeFiles and stripped the shared
.vif, leaving the freshly built EC volume without its info file.

Skip the .vif in removeVolumeFiles when an EC volume for the same id
exists on the disk (mounted, or a sealed .ecx on disk). The regular
volume's .dat/.idx still go; the EC sidecars survive.

A two-server end-to-end test encodes a volume whose source and a stub
replica both also receive shards, and asserts the final on-disk layout:
both .dat/.idx gone, each server holding only its assigned shards plus
.ecx/.vif. Storage unit tests cover the with-EC and no-EC cases, and the
Rust seaweed-volume port carries the same guard and tests.

* test(storage): assert .idx is removed in the no-EC destroy case

Strengthen TestDestroyRemovesVifWhenNoEc to confirm the full regular
volume cleanup (.dat, .idx, .vif) when no EC volume coexists.
2026-05-28 15:39:31 -07:00
Chris Lu
dfd05d14cb refactor(filer): remove the inode->path index and the NFS gateway (#9724)
* fix(filer): derive inodes by hash instead of a snowflake sequencer

Compute the same inode the FUSE mount would: non-hard-linked entries hash path + crtime, hard links hash their shared HardLinkId so every link resolves to one inode. Removes the snowflake inodeSequencer and the SEAWEEDFS_FILER_SNOWFLAKE_ID knob; inodes are now deterministic across filers.

* chore: remove the experimental NFS gateway

The NFS frontend ('weed nfs') was the only consumer of the inode->path index. Remove the weed/server/nfs package, the command and its registration, the integration test harness, and the CI workflow; go mod tidy drops the willscott/go-nfs and go-nfs-client dependencies.

* refactor(filer): drop the inode->path index

With the NFS gateway gone, nothing reads it. A regular file's inode is a pure hash of its path and a hard link's is a hash of its shared HardLinkId -- both derivable on demand -- so the secondary KV index and its write/remove hooks are dead. Removes filer_inode_index.go and the recordInodeIndex hooks from the store wrapper.
2026-05-28 15:00:18 -07:00
Konstantin Lebedev
3537312045 [docker] add make test_keycloak_s3 for local develop and debug (#9719)
* add make test_keylock_s3 for local develop and debug

* fix typos

* add condition oidc:azp

* docker: reuse test/s3/iam realm and iam config for keycloak dev compose

Point the keycloak dev compose at the existing test/s3/iam configs instead
of a parallel realm/port/key/role set. Adds one declarative realm import
(seaweedfs-test-realm.json) as the single realm source and drops the
duplicated iam.json/s3.json.

---------

Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-05-28 13:39:32 -07:00
Chris Lu
b1dcb6c52e fix(ec): delete empty stub replicas before distributing EC shards (#9722)
* fix(ec): delete empty stub replicas before distributing EC shards

An interrupted encode can leave a 0-byte .dat replica behind. Until now
the only thing that removed it was deleteOriginalVolume, which runs after
distribute+mount and calls VolumeDelete -> removeVolumeFiles. A regular
volume and an EC volume share the same <collection>_<vid>.vif path, so
deleting the stub at that point strips the .vif out from under the
freshly distributed shards.

Sweep the original replicas with VolumeDelete(OnlyEmpty=true) before
distribute: doIsEmpty uses the same superblock threshold, so only the
0-byte stubs go and any data-bearing replica is refused and kept for the
post-verify delete. Servers cleared in the sweep are skipped by
deleteOriginalVolume so it never touches a server that now holds only EC
shards.

* fix(ec): fail the encode when an empty-replica sweep can't confirm a node

The sweep swallowed every VolumeDelete(OnlyEmpty) error, so a transient
failure on a stub node fell through to the post-verify force-delete on
that node — the shared-.vif clobber the sweep exists to avoid.

Treat only the expected cases (volume not empty, or already gone) as
leave-in-place; any other error propagates and fails the encode, which
rolls back the readonly marks and retries next cycle.
2026-05-28 13:21:24 -07:00
Chris Lu
691e601e6f fix(ec): prefer credible replica as canonical metric in EC detection (#9717)
* fix(ec): prefer credible replica as canonical metric in EC detection

An interrupted encode can leave a 0-byte .dat replica behind. When that
stub sits on a lower-sorting server than the real replica, the
lowest-server canonical pick reported Size=0, tripped the min-size gate,
and the volume was stranded in skippedTooSmall: detection never proposed
an encode, so the partial EC shards were never cleared and re-distribute
kept hitting the mounted-volume guard.

selectCanonicalMetric now prefers the lowest-server credible replica
(data-bearing, not already EC), falling back to the lowest-server metric
only when nothing is credible so the downstream gates skip as before. A
leftover EC shard set on a lower server no longer short-circuits the
volume at the IsECVolume guard either, so the orphan-source cleanup and
re-encode paths get their chance.

* fix(ec): treat a bare superblock .dat as a stub too

An interrupted encode or copy can write the 8-byte superblock and then
fail, leaving an 8-byte .dat with no data. isStubReplica used a strict <
so that file slipped through as credible, could win the canonical pick on
a low server, and re-tripped the min-size gate. Use <= the superblock so a
data-less .dat never shadows a real replica.
2026-05-28 13:06:21 -07:00
qzhello
5b1098e2ad fix(s3): honor MetadataDirective=REPLACE for system metadata on CopyObject (#9721)
* fix(s3): honor MetadataDirective=REPLACE for system metadata on CopyObject

* fix(s3): match copy metadata keys case-insensitively for legacy data

Legacy / non-S3 write paths (FUSE mount, direct filer HTTP API, older
versions) may persist Cache-Control etc. in lowercase form. Make
isManagedCopyMetadataKey case-insensitive so mergeCopyMetadata still
clears stale source values under REPLACE, and let the COPY branch of
processMetadataBytes fall back to a lowercase key on the source so
legacy values survive into the destination (re-emitted as canonical).

Mirrors the existing x-amz-meta-* backward-compat path.

* fix(s3): keep legacy non-canonical tag and system metadata across COPY

The previous case-insensitive isManagedCopyMetadataKey caused
mergeCopyMetadata to delete legacy lowercase x-amz-tagging-* and
mixed-case system headers, but the COPY branch in processMetadataBytes
only matched canonical or strict-lowercase keys when re-populating
them, so any non-canonical key was permanently dropped on COPY.

- COPY now scans existing in a single pass and uses strings.EqualFold
  against the system header whitelist, re-emitting under the canonical
  header name. Handles any case folding (CACHE-CONTROL, Cache-control,
  etc.), not just strings.ToLower.
- COPY tagging branch now uses hasPrefixFold(k, AmzObjectTagging) and
  re-emits the canonical X-Amz-Tagging-<suffix>, mirroring the existing
  X-Amz-Meta-* migration path.
- Tests cover lowercase/uppercase/mixed-case system headers and tags
  surviving COPY.

* fix(s3): make COPY of system metadata and tags deterministic across case variants

Single-pass EqualFold matching let Go's randomized map iteration pick
either the canonical or a legacy-cased value when both lived on the
source, so the COPY result varied between calls.

Both COPY branches now use two passes: a canonical-exact lookup first,
then a case-insensitive fallback that only writes when the canonical
slot is still empty. Mirrors the collision-check pattern used by the
X-Amz-Meta-* migration path.

Tests run the canonical-vs-legacy collision 32 times each to exercise
varied map orders.

* fix(s3): apply REPLACE Content-Type on in-place copy

The metadata-only self-copy path never set Attributes.Mime, so a same-key
CopyObject with REPLACE and a new Content-Type silently kept the old type.
Route in place only when the Mime is unchanged; otherwise take the locked
clone path (still metadata-only, reuses source chunks) and set the new Mime
there. Also covers the versioned self-copy path.

* perf(s3): drop per-key ToLower in isManagedCopyMetadataKey

Use the allocation-free hasPrefixFold helper instead of lowercasing the key
and both constant prefixes on every metadata-key check.

---------

Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-05-28 12:55:08 -07:00
Lars Lehtonen
21ab68aa94 chore(weed/storage/backend/s3_backend): remove unused function (#9715)
* chore(weed/storage/backend/s3_backend): remove unused function

* fix(s3_backend): cache session under the composite region|endpoint key

createSession looked up sessions by region|endpoint but stored them by
region alone, so the cache never hit and a new session was built every
call. With getSession gone the lock can also drop to a plain Mutex.

---------

Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-05-27 22:14:45 -07:00
Chris Lu
24e664d651 fix(shell): don't halt volume.fsck purge on a stuck read-only volume (#9714)
* fix(shell): don't halt volume.fsck purge on a stuck read-only volume

A failed VolumeMarkWritable on one volume aborted the entire fsck purge
run; per-volume errors now log and continue so remaining volumes still
get purged.

* fix(shell): unify volume.fsck per-volume skip logging at the caller

Return the mark-writable error from purgeOneVolume instead of logging
in two places — the caller already prints "skip purging volume N: %v"
and defers still fire on the error return.

* fix(shell): collect volume.fsck purge-skipped volumes and report at end

Track volume IDs whose purge was skipped (mark-writable failure or
other per-volume errors) and print a sorted summary so operators don't
have to scrape the run log to find them. Deletes for those volumes are
already skipped; this just makes them explicit.
2026-05-27 17:49:35 -07:00
7y-9
bbbc3925ec fix: validate s3 ownership controls rule (#9684) 2026-05-27 14:41:10 -07:00
qzhello
69c84801e4 fix(s3tables/iceberg): make metadata spec-compliant and accept real-world manifest names (#9703)
* fix(s3tables/iceberg): make metadata spec-compliant and accept real-world manifest names

Two related issues prevent SeaweedFS S3 Tables from interoperating with
strict Iceberg clients (Java/Spark/Flink/Trino):

1. iceberg-go v0.5.0 serializes empty TableMetadata state by dropping
   keys via `omitempty` on optional pointer/slice fields. The Iceberg
   table spec, however, requires `current-snapshot-id`, `snapshots`,
   `snapshot-log`, `metadata-log`, and `refs` to be present even when
   empty (`current-snapshot-id` must be -1 for a table with no
   snapshots). Java's TableMetadataParser uses JsonUtil.getLong on
   `current-snapshot-id` and throws "Cannot parse missing long
   current-snapshot-id" against responses produced by this server.

2. The Iceberg layout validator only accepts manifest filenames that
   match Iceberg's internal naming (`{uuid}-m{n}.avro`,
   `snap-{n}-{n}-{uuid}.avro`). Real writers — notably Flink's sink —
   emit manifests like
   `{flink-job-id}-{checkpoint}-{operator-id}-{n}.avro`, which the
   validator rejects with 403, breaking INSERT commits.

Fixes:

* Add ensureMetadataSpecCompliance helper that backfills the five
  spec-required empty-state fields when iceberg-go omits them or emits
  explicit JSON null. Apply it on every code path that writes
  v*.metadata.json to S3 or returns metadata to clients
  (handlers_table create-table, handlers_commit, commit_helpers
  create-on-commit, plus MarshalJSON on LoadTableResult and
  CommitTableResponse). Real values from non-empty tables are never
  overwritten.

* Add catch-all regex entries to metadataFilePatterns accepting any
  *.avro / *.metadata.json filename composed of [A-Za-z0-9._-]. The
  Iceberg spec does not mandate filename format; the strict patterns
  remain for documentation. Metadata-directory subdirectory rejection
  and the data-file path validation are unchanged.

No upstream dependencies are forked: iceberg-go stays at v0.5.0 and
go.mod is untouched. The compliance layer can be removed once upstream
emits spec-compliant output.

Tests (all pass under `go test -race`):
- metadata_compliance_test.go: 5 cases covering missing fields,
  preserved real values, explicit null, invalid JSON, empty input.
- iceberg_layout_test.go: 3 groups (16 subtests) covering real-world
  manifest names from Flink/Spark/Iceberg, security boundary
  (subdirectories, bad extensions), and data-file regression.

* fix(s3tables/iceberg): preserve metadata key order and keep config field stable

Two small follow-ups on the spec-compliance fix:

* ensureMetadataSpecCompliance now splices missing keys in at the byte
  level just before the closing brace, so iceberg-go's struct-declared
  key order survives the backfill. The previous unmarshal/remarshal
  through map[string]json.RawMessage silently alphabetized every key in
  the document, which is spec-legal but breaks byte-equality fixtures
  and any downstream hashing of the persisted metadata. The slower
  remarshal path is kept for the rare explicit-null replacement case.

* LoadTableResult.MarshalJSON now serializes Config without omitempty,
  matching the struct field tag. The custom marshaler had silently
  flipped the tag to ,omitempty, which made the "config" key disappear
  from the response whenever s3Endpoint was unset (since
  buildFileIOConfig returned an empty but non-nil Properties map).

Tests:
- PreservesOriginalKeyOrder pins the byte-level output against
  iceberg-go's emitted shape; would have caught the alphabetization
  regression.
- EmptyObjectBackfilled covers the {} -> sentinels-only case (no
  leading comma).
- AllPresentReturnsSameBytes confirms the no-op path returns input
  bytes unchanged, with whitespace intact.
- iceberg_layout_test pins the catch-all $ anchor: metadata/file.avro.txt
  must still be rejected.

* fix(s3tables/iceberg): guard ensureMetadataSpecCompliance against top-level null

json.Unmarshal of a JSON `null` literal succeeds but leaves the map nil.
The current byte-append path no-ops gracefully on this input, but the
slow remarshal path would panic with "assignment to entry in nil map"
if the input ever combined `null` with the explicit-null detection. Add
an explicit nil-map short-circuit so the safety property is obvious
from the source, and a test that pins the contract.

* test(s3tables/iceberg): assert full byte equality in AllPresentReturnsSameBytes

The prefix check only caught a missing "{\n  " opener, so the test
would have passed even if the function silently reordered keys or
collapsed whitespace later in the document. Switch to a full string
comparison so any future regression in the no-op path is loud.

---------

Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-05-27 13:05:41 -07:00
Chris Lu
21b4b81edb fix(filer/postgres): default to ON CONFLICT upsert to keep tx alive (#9709)
* fix(filer/postgres): default to ON CONFLICT upsert to keep tx alive

A KvPut from the inode-index secondary write could fail with 23505
(duplicate key) inside a rename's transaction, after which the next
statement returned 25P02 and rename surfaced to FUSE as EIO. Default
the postgres upsert query when enableUpsert=true so INSERTs are
idempotent; the enableUpsert=false escape hatch is preserved for
non-PG-compatible backends.

* fix(filer/mysql): default to ON DUPLICATE KEY UPDATE upsert

Same shape as the postgres default: when enableUpsert=true but no
upsertQuery is configured, install a sensible default so the
inode-index KvPut does not waste a duplicate-key roundtrip on every
entry write. Uses the VALUES() form so the default works on MariaDB
and MySQL >=5.7; the MySQL 8.0.19 row-alias form is left to explicit
config.

* fix(filer): default enableUpsert=true for sql stores

The default-template fallback only kicks in when enableUpsert=true,
so minimal configs that omit the flag entirely were still exposed.
Default it on for postgres/postgres2/mysql/mysql2; an explicit false
in filer.toml still wins because SetDefault only fills absent keys.
2026-05-27 12:23:30 -07:00
Chris Lu
396e3c326b fix(remote_storage/gcs): forward entry mime as ContentType (#9711)
fix(remote_storage/gcs): forward entry.Attributes.Mime as ContentType

Same gap as the S3 client: filer.remote.sync to GCS never populated
the object's ContentType, so HTML/CSS/etc. ended up stored as the GCS
default and didn't render correctly in browsers.

Mirrors the existing Azure client behavior
(weed/remote_storage/azure/azure_storage_client.go).
2026-05-27 12:21:27 -07:00
Chris Lu
9cb9699e9d fix(replication/s3sink): forward entry mime as ContentType (#9710)
* fix(replication/s3sink): forward entry.Attributes.Mime as ContentType

Same gap as the remote_storage S3 client: filer.replicate uploads via
s3manager.Uploader without populating ContentType, so replicated objects
on S3-compatible backends (e.g. Backblaze B2) store binary/octet-stream
and browsers refuse to render HTML, CSS, etc.

Pass entry.Attributes.Mime through to UploadInput.ContentType, leaving
the header unset when no Mime is recorded so the remote keeps its own
default.

* fix(replication/s3sink): nil-guard entry.Attributes when reading Mime

* Revert "fix(replication/s3sink): nil-guard entry.Attributes when reading Mime"

This reverts commit 08c3698e44.

The function already dereferences entry.Attributes.Mtime and
entry.Attributes.Md5 unconditionally on the same path, so a nil guard
on Mime alone is inconsistent and provides no real safety.
2026-05-27 12:20:51 -07:00
Chris Lu
629beda1eb fix(remote_storage/s3): forward entry mime as ContentType (#9708)
fix(remote_storage/s3): forward entry.Attributes.Mime as ContentType

filer.remote.sync was uploading every object without a Content-Type, so
S3-compatible backends (e.g. Backblaze B2) stored binary/octet-stream
and browsers refused to render HTML, CSS, etc.

Pass entry.Attributes.Mime through to UploadInput.ContentType, leaving
the header unset when no Mime is recorded so the remote keeps its own
default behavior.
2026-05-27 12:13:01 -07:00
Chris Lu
c3255b51fd fix(volume): avoid panic when URL path has a dot before the comma (#9712)
LastIndex returns -1 when the separator is missing and can return any
position when both are present. A path like /vol/file.jpg,abc gives
dotSep<commaSep, so path[commaSep+1:dotSep] slices with start>end and
panics. Only treat the dot as an extension boundary when it sits after
the comma.
2026-05-27 11:29:11 -07:00
Chris Lu
65d557cbb0 fix(util): guard BytesToUint{16,32,64} against short input (#9713)
* fix(util): guard BytesToUint{16,32,64} against short input

length is computed as uint, so length-1 on an empty slice underflows
to MaxUint and the loop indexes b[0] on a zero-length slice. BytesToUint16
also indexed b[0]/b[1] with no length check. All call sites today gate
the slice length explicitly, so this hardens the API for new callers
rather than fixing a live crash.

Return 0 on short input, matching the existing variable-length contract.

* BytesToUint16: match variable-length contract of the 32/64 helpers

A 1-byte slice should return uint16(b[0]) rather than 0, matching how
BytesToUint32 and BytesToUint64 treat short input.
2026-05-27 11:29:01 -07:00
Jaehoon Kim
d00acded8a fix(vacuum): batch all replicas in a single plugin worker task (#9702)
* fix(vacuum): batch all replicas in a single plugin worker task

The plugin worker vacuum path emitted one TaskDetectionResult per
(volume, server) replica, but the dispatcher gates duplicate tasks per
volume via ActiveTopology.HasAnyTask. The first replica's task was
created and the remaining N-1 replicas were silently dropped, so only
one replica per volume was ever vacuumed — leaving the others with all
their garbage intact.

Mirror the master built-in flow (topology.vacuumOneVolumeId →
batchVacuumVolumeCheck/Compact/Commit/Cleanup) by:

- aggregating detection metrics by VolumeID so a single task carries
  every replica in TaskParams.Sources
- having VacuumTask accept []string servers (instead of a single
  string), re-check each replica's garbage ratio at execute time to
  derive a vacuumTargets subset, and run Compact/Commit/Cleanup against
  only that subset
- updating the dispatcher (plugin_handler.Execute, register.CreateTask)
  to forward every Sources node to NewVacuumTask

* fix(vacuum): run all-replica vacuum in two phases to keep failure atomic

The prior implementation iterated Compact → Commit → Cleanup against
each replica in sequence. A Compact failure on the second replica left
the first one already committed (its active files swapped with the
.cp* files), producing replica divergence with no automatic recovery.

Split performVacuum into two phases, matching topology.vacuumOneVolumeId:

  Phase 1 — Compact all targets. If any fails, run VacuumVolumeCleanup
  on every target to drop the .cpd/.cpx/.cpldb temp files, then abort.
  No replica has swapped yet, so every replica returns to its original
  state.

  Phase 2 — Commit all targets. Best-effort, matching
  batchVacuumVolumeCommit: per-replica errors are collected and
  surfaced together. Once any replica has swapped there is no clean
  rollback, so a partial Phase 2 failure requires operator
  reconciliation.

Adds compactOne / commitOne / cleanupOne / cleanupAll helpers and
removes the old performVacuumOne.

* fix(vacuum): abort when any replica's garbage check fails

The prior check tolerated per-replica RPC errors and only failed the
task if every replica errored — partial failures were silently treated
as "ineligible" so the responding replicas would still be vacuumed.
That produces divergence the moment the unreachable replica comes
back: it still carries the original garbage while the others have
been compacted.

Match topology.batchVacuumVolumeCheck's contract instead — its return
value (errCount == 0 && len(vacuumLocationList.list) > 0) gates the
whole vacuum on every replica's check succeeding. If any replica is
unreachable or its VacuumVolumeCheck RPC errors, abort the task; the
volume will be retried on the next detection cycle once the replica
is healthy.

* fix(vacuum): guard against nil metrics and TaskSource entries

Detection's bucket-building loop dereferenced m.VolumeID without
checking m for nil. VacuumTask.Validate built sourceSet from
params.Sources without checking each entry for nil. Both paths would
panic on a malformed protobuf payload that managed to deliver a nil
slot. Skip nil entries in both loops — neutral with the existing
nil/empty filtering already done in register.CreateTask and
plugin_handler.Execute.

* test(vacuum): success path no longer calls VacuumVolumeCleanup

The plugin worker vacuum is now two-phase (Compact-all → Commit-all,
with Cleanup only invoked on Compact failure to roll back .cp* temp
files). This matches topology.vacuumOneVolumeId, where
batchVacuumVolumeCleanup runs only on the Compact-failure branch.

On a successful Commit the temp files do not linger:
  - CommitCompactVolume renames .cpd → .dat and .cpx → .idx
  - leveldb needle map renames .cpldb → .ldb (needle_map_leveldb.go)

so calling VacuumVolumeCleanup afterwards is a redundant no-op. The
prior worker code called it unconditionally and the integration test
asserted that — switch the expectation to cleanupCalls == 0 to
document the new (and master-aligned) contract.
2026-05-27 11:15:25 -07:00
Chris Lu
cd68313929 fix(filer.sync): resolve manifest chunks against source filer (#9705)
* fix(filer.sync): resolve manifest chunks against source filer

`UpdateEntry` was passing `filer.LookupFn(fs)` — the sink filer client —
into `compareChunks`. But `oldEntry`/`newEntry` chunks come from the
source cluster, so manifest resolution must hit the source filer's
volume servers. With two clusters that have overlapping volume IDs
(common once they grow past a few hundred volumes), the sink lookup
returns its own volume's URLs and the fetch 404s on the source's
fileKey:

  compare chunks error: fail to read manifest 631,0babe...: 404 Not Found

The 404 aborts the diff, the manifest chunk never gets replicated, and
the target ends up with whatever flat chunks happened to land from
earlier partial syncs — visible as `SIZE_MISMATCH` in filer.sync.verify
on files large enough to use chunk manifests (~150 GB+ in practice).

Only the manifest path was wrong; flat-chunk reads in `fetchAndWrite`
already use `fs.filerSource.ReadPart`.

* trim comment

* test(filer.sync): regression test for source-filer manifest lookup

Two recording filer gRPC servers stand in for source and sink. Driving
UpdateEntry with a manifest chunk and observing which one receives
LookupVolume proves compareChunks routes source-side lookups through
fs.filerSource, not fs. Reverting the fix flips the call onto the sink
filer and fails the assertion.

* drop test
2026-05-27 10:23:29 -07:00
Jaehoon Kim
675020b342 fix(filer.sync): validate chunk size in FilerSink to prevent 0-byte propagation (#9701)
* fix(filer.sync): validate chunk size in FilerSink to prevent 0-byte propagation

FilerSink.fetchAndWrite previously trusted the source response and the
upload result blindly: a 200 OK / Content-Length: 0 reply from a broken
source volume was happily uploaded as a 0-byte needle to the destination,
and the destination filer metadata was then written with the source
chunk size. The result was permanent silent corruption -- ls shows the
file at its original size but reads fail with EIO.

Add two cheap defenses inside fetchAndWrite:

  1. After assembling fullData, compare its length against sourceChunk.Size.
  2. After a successful upload, compare uploadResult.Size against
     sourceChunk.Size.

Both checks wrap a new sentinel errChunkSizeMismatch that the retry
callback recognizes and refuses to retry -- needle.size=0 on disk is a
persistent state, not a transient network error, so the sync should stop
loudly on the affected entry instead of looping or, worse, silently
propagating it.

Tests:

  * TestValidateReplicatedChunkSize -- table-driven coverage of healthy,
    legitimately empty, zero-byte read, short read, and truncated upload
    cases.
  * TestFetchAndWriteRejectsZeroByteSource -- end-to-end: an httptest
    source that returns 200 OK with an empty body must cause fetchAndWrite
    to return errChunkSizeMismatch after exactly one source hit (fail
    fast, no retry storm).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* filer.sync: bubble size-mismatch past CreateEntry/UpdateEntry

Three follow-ups on the chunk-size validation:

- Use %w in replicateOneChunk so the errChunkSizeMismatch sentinel
  survives the wrap and reaches errors.Is callers up the stack.
- In FilerSink.CreateEntry/UpdateEntry, surface errChunkSizeMismatch
  instead of warning-and-nil. Other errors (deleted source chunk,
  transient network) keep the existing swallow so a hiccup doesn't
  stall the stream.
- Drop validateReplicatedUploadSize: uploadResult.Size is set
  client-side from the same len(fullData) we already validated
  pre-upload, so the second check can't fail.

Test: scope the RetryWaitTime override to the one test that needs it,
add a regression that locks in the errors.Is chain through
replicateChunks.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-05-26 20:47:53 -07:00
Chris Lu
7919cc7ca0 wdclient: prune filers dropped from master discovery (#9699)
* wdclient: prune filers dropped from master discovery

Filer discovery only appended new addresses; it never removed ones that
disappeared from the master snapshot. After a K8s filer pod rolled to a
new IP the old address lingered in filerAddresses and got retried again
every resetTimeout window, stalling S3 uploads on i/o timeouts.

Treat the master snapshot as authoritative: keep survivors (preserving
their health counters and the active round-robin index), append newcomers
with fresh health, drop the rest. Empty snapshots are still ignored so a
transient master outage can't wipe the list.

* wdclient: skip discovery snapshots with no usable addresses

Guard against the defensive case where master returns updates whose
addresses are all empty; reconciling against an empty discovered set
would prune every filer.
2026-05-26 17:49:18 -07:00
Chris Lu
1e91a99f79 fix(volume): avoid nil-deref when needle map loader errors (#9694) (#9697)
* fix(volume): avoid nil-deref when needle map loader errors

A corrupt .idx whose size is not a multiple of NeedleMapEntrySize sends
the read-only load path into NewSortedFileNeedleMap, which returns
(*SortedFileNeedleMap)(nil) when reverseWalkIndexFile rejects the file.
The multi-value assignment `v.nm, err = NewSortedFileNeedleMap(...)`
parks that typed-nil pointer in the v.nm NeedleMapper interface, so the
subsequent `v.nm != nil` guard still passes — and the post-load
MaxNeedleEnd structural check dispatches through the promoted mapMetric
accessor on a nil receiver, segfaulting the whole volume server at
load time.

Reset v.nm explicitly after every loader failure so the interface is
truly nil, and skip the MaxNeedleEnd check when err is non-nil since
the value would come from a partial walk anyway. NewLevelDbNeedleMap
has the same typed-nil-on-error shape and is fixed the same way.

* fix(volume): close indexFile when needle map load errors

Pre-fix the typed-nil v.nm path either leaked indexFile silently
(SortedFileNeedleMap.Close had a nil-receiver early return) or crashed
(LevelDbNeedleMap.Close had no such guard). With v.nm cleared to nil
on error, the defer cleanup no longer calls Close at all, so the
LoadCompactNeedleMap success-with-error path now also leaks indexFile.
Close indexFile explicitly on each loader error to keep ownership
balanced.

* trim comments
2026-05-26 16:56:49 -07:00
Chris Lu
4f17c6661a test: keep AllocateMiniPorts off weed mini default ports
Random allocation could pick 33646 = admin.port (23646) + GrpcPortOffset.
weed mini reserves that as Admin's gRPC port even when the test only
overrides Master/Filer/S3/Iceberg, so the explicit Filer flag failed
with "reserved for gRPC calculation" and TestRisingWaveIcebergCatalog
flaked. Pre-seed the reserved set with every mini default HTTP port
plus its +10000 offset so a random pick (or its own gRPC offset) cannot
land on a service the caller left at its default.
2026-05-26 16:48:46 -07:00
Chris Lu
29eec2f111 master: timeout AllocateVolume/DeleteVolume and defer growRequest cleanup (#9698)
* master: timeout AllocateVolume/DeleteVolume and defer growRequest cleanup

The volume-grow goroutine clears the layout's growRequest flag only after
ms.DoAutomaticVolumeGrow returns, and AllocateVolume / DeleteVolume were
calling the volume-server RPC with context.Background(). A volume server
that hung mid-call (heavy I/O, stuck lock, dead peer behind a stable VIP)
would park the goroutine forever, leaving growRequest=true and silently
blocking every subsequent automatic grow for that layout — Assign retries
then drained their 30s budget with "context deadline exceeded" until the
operator restarted the master.

Bound both RPCs with a 5-minute deadline (creating/removing a volume is
sub-second normally, generous for contended disks) and move the flag
clear + filter delete into defers so a panic in DoAutomaticVolumeGrow
doesn't strand the layout either.

* allocate_volume: shorten timeout to 1m for faster recovery

Volume create/delete is sub-second under normal conditions; 1 minute is
generous even on a contended disk and clears the growRequest flag well
before too many client Assigns drain their own retry budget.

* trim comments
2026-05-26 16:26:21 -07:00
Chris Lu
8fd7c524c7 redis2: apply keyPrefix in KV methods (#9693)
KvPut/KvGet/KvDelete bypassed store.getKey(), so filer.store.id and
other KV writes landed outside the configured prefix. With a Redis
ACL restricted to the prefix this errored with NOPERM; without the
ACL the keys silently lived in the wrong namespace.
2026-05-26 12:49:31 -07:00
Chris Lu
77dcb20a74 writeJson: drop unused JSONP branch (#9686)
* writeJson: drop unused JSONP branch

No in-tree caller uses ?callback=. Always serve application/json
with X-Content-Type-Options: nosniff.

* seaweed-volume: drop unused JSONP branch

Mirror Go: always serve application/json with
X-Content-Type-Options: nosniff.

* writeJson: drop unreachable StatusNotModified check

bodyAllowedForStatus already returns early for 304.

* test/volume_server: rename and rewrite JSONP test to assert callback is ignored

CI: /status?callback=myFunc now returns plain application/json
with X-Content-Type-Options: nosniff.
2026-05-26 01:05:07 -07:00
Chris Lu
dd1b428789 s3,iceberg: reject .. in URL path vars (#9687)
* s3,iceberg: reject `..`/NUL in URL path vars

Both gateway routers use mux.NewRouter().SkipClean(true), so a request like
`GET /bucket-A/../evil-bucket/key` survives routing as bucket=bucket-A,
object=../evil-bucket/key. The captured key is then joined into a filer path;
util.JoinPath / path.Join collapse the `..` server-side and the read lands in
evil-bucket. With auth on, IAM still authorizes against bucket-A (the mux var),
so policy is evaluated against the wrong target.

Add a middleware on the S3 bucket subrouter and the Iceberg REST router that
rejects any `.`, `..`, NUL, or — for single-segment slots — embedded slash in
the captured path vars before any handler runs. NormalizeObjectKey already
folds `\` to `/` and decoding happens in mux, so `%2e%2e` and `..\` are caught.

* s3,iceberg: reject empty captured vars and empty namespace parts

Comma-ok the var lookup so we only check captured slots, then treat an empty
captured value as a rejection on its own — downstream path.Join would
otherwise collapse it and let the next segment pick the bucket.

For iceberg, also reject empty parts after splitting the namespace on \x1F so
leading/trailing/consecutive unit separators (which parseNamespace silently
folds out) don't let distinct route values collapse to the same parsed
namespace.

Register loggingMiddleware before validateRequestPath on the iceberg router
so rejected requests still produce an audit-log line.
2026-05-26 01:04:59 -07:00
Chris Lu
1355c7a102 4.29 4.29 2026-05-25 22:41:25 -07:00
dependabot[bot]
f72c5ec5d3 build(deps): bump github.com/go-sql-driver/mysql from 1.9.3 to 1.10.0 (#9682)
Bumps [github.com/go-sql-driver/mysql](https://github.com/go-sql-driver/mysql) from 1.9.3 to 1.10.0.
- [Release notes](https://github.com/go-sql-driver/mysql/releases)
- [Changelog](https://github.com/go-sql-driver/mysql/blob/master/CHANGELOG.md)
- [Commits](https://github.com/go-sql-driver/mysql/compare/v1.9.3...v1.10.0)

---
updated-dependencies:
- dependency-name: github.com/go-sql-driver/mysql
  dependency-version: 1.10.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-25 16:37:47 -07:00
dependabot[bot]
96f521addc build(deps): bump github.com/linxGnu/grocksdb from 1.10.7 to 1.10.8 (#9683)
Bumps [github.com/linxGnu/grocksdb](https://github.com/linxGnu/grocksdb) from 1.10.7 to 1.10.8.
- [Release notes](https://github.com/linxGnu/grocksdb/releases)
- [Commits](https://github.com/linxGnu/grocksdb/compare/v1.10.7...v1.10.8)

---
updated-dependencies:
- dependency-name: github.com/linxGnu/grocksdb
  dependency-version: 1.10.8
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-25 16:22:00 -07:00
dependabot[bot]
584da4cd10 build(deps): bump golang.org/x/crypto from 0.51.0 to 0.52.0 (#9681)
Bumps [golang.org/x/crypto](https://github.com/golang/crypto) from 0.51.0 to 0.52.0.
- [Commits](https://github.com/golang/crypto/compare/v0.51.0...v0.52.0)

---
updated-dependencies:
- dependency-name: golang.org/x/crypto
  dependency-version: 0.52.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-25 16:21:44 -07:00
dependabot[bot]
56b9df937c build(deps): bump golang.org/x/sys from 0.44.0 to 0.45.0 (#9680)
Bumps [golang.org/x/sys](https://github.com/golang/sys) from 0.44.0 to 0.45.0.
- [Commits](https://github.com/golang/sys/compare/v0.44.0...v0.45.0)

---
updated-dependencies:
- dependency-name: golang.org/x/sys
  dependency-version: 0.45.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-25 16:21:36 -07:00
dependabot[bot]
e8ed043d2b build(deps): bump go.etcd.io/etcd/client/pkg/v3 from 3.6.10 to 3.6.11 (#9679)
Bumps [go.etcd.io/etcd/client/pkg/v3](https://github.com/etcd-io/etcd) from 3.6.10 to 3.6.11.
- [Release notes](https://github.com/etcd-io/etcd/releases)
- [Commits](https://github.com/etcd-io/etcd/compare/v3.6.10...v3.6.11)

---
updated-dependencies:
- dependency-name: go.etcd.io/etcd/client/pkg/v3
  dependency-version: 3.6.11
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-25 16:21:28 -07:00
dependabot[bot]
502fef6b50 build(deps): bump docker/login-action from 4.1.0 to 4.2.0 (#9678)
Bumps [docker/login-action](https://github.com/docker/login-action) from 4.1.0 to 4.2.0.
- [Release notes](https://github.com/docker/login-action/releases)
- [Commits](https://github.com/docker/login-action/compare/v4.1.0...v4.2.0)

---
updated-dependencies:
- dependency-name: docker/login-action
  dependency-version: 4.2.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-25 16:21:20 -07:00
Chris Lu
b21c263328 test/fuse_dlm: cross-mount POSIX locks + survival across a ring change (#9677)
Adds two FUSE integration tests on the existing dlm cluster harness (the
-dlm mounts route advisory locks to the owner filer):

- TestPosixLockCrossMount: an flock taken on one mount blocks the other,
  and is grantable after release — the routed-to-owner path end to end.
- TestPosixLockSurvivesFilerLoss: hold flocks on many files, stop filer1
  so keys it owned migrate to filer0; after the ring settles and the
  holding mount re-asserts, every lock is still honored. Asserts only the
  settled state; the transient migration window is unit-covered.

Locks are taken on read-only fds so the -dlm whole-file write lock (a
different mechanism, held until close) isn't involved. Skipped on
non-Linux: only Linux forwards advisory locks (SETLK) to the FUSE server;
macFUSE handles flock in-kernel per mount.
2026-05-25 16:20:23 -07:00
Chris Lu
c9868dcf2f filer/posixlock: remove the unused lock-set serde (#9676)
The codec (Set.Marshal/Unmarshal) and its posix_lock.proto were built to
let the lock set ride in an inode's entry metadata, but the authority is
in-memory and ownership handoff/restart is handled by mounts re-asserting
their held locks over the RPC — neither serializes the set. Nothing calls
the serde outside its own tests, so drop it (codec, proto, generated pb,
Makefile). The in-memory Set/Manager are unchanged.
2026-05-25 13:15:19 -07:00
Chris Lu
85ca3cb757 filer: warm-up + fail-closed cooling for POSIX locks on owner (re)start (#9673)
After a (re)start the owner defers would-be grants for posixLockWarmup
while mounts re-assert, trusting only locally-visible conflicts, so it
does not double-grant from empty state; a deferred grant is a retry for
SetLkw and EAGAIN for non-blocking SetLk, never a spurious grant. Cooling
now fail-closes: if the previous owner is unreachable during a ring
change, defer rather than risk a double-grant. readyAt is atomic so the
handler reads it without locking.
2026-05-25 13:14:05 -07:00
Chris Lu
a3c0baa9b0 filer: cooling-off dual-read for POSIX locks during ring changes (#9672)
While the ring changed within the last snapshot interval, a fresh owner
asks the key's previous owner (LockRing.PriorOwner) whether it still
holds a conflicting lock before granting TRY_LOCK or answering GET_LK, so
it does not double-grant before re-assertion rebuilds its local state.
The probe is marked cooling_probe so the previous owner answers from
local state without recursing. PriorOwner uses the snapshot's prebuilt
ring rather than rebuilding a hash ring per call.
2026-05-25 12:34:15 -07:00
7y-9
881226a81b fix: avoid rclone nil close panics (#9674)
* fix: avoid rclone nil close panics

* fix: avoid rclone nil close panics
2026-05-25 09:53:45 -07:00
Chris Lu
f8caaa4464 mount,filer: re-assert POSIX locks via keepalive (ownership migration + restart) (#9668)
* mount: renew POSIX lock leases via keepalive

The mount tracks the inode keys it holds locks on and a background loop
renews its session lease (KEEP_ALIVE) with each key's owner filer every
5s, within the filer's 15s TTL. A live mount is never reaped; a dead one
stops renewing and owners reclaim its locks. Tracking is a superset:
holds are added on grant and dropped only on owner release, so a still
held lock is never under-renewed.

* mount,filer: re-assert held POSIX locks via keepalive

The owner filer holds POSIX advisory locks as in-memory soft state, so a key's
owner change (ring rebalance) or an owner restart lost or stranded them: the new
or restarted owner was blind to existing holders and would double-grant.

Make the keepalive carry the mount's held lock ranges per key. The mount mirrors
its own granted locks (posixOwn), and each tick re-asserts them to the key's
current owner, which rebuilds that session's locks from the assertion — self
-healing after a takeover or restart. The owner arbitrates re-asserted locks
against other sessions so it never double-grants; a lock that lost a migration
race is reported, not forced. A bare keepalive (no ranges) still just renews.
2026-05-25 01:02:45 -07:00
Chris Lu
c97b69f8a4 filer: session lease + reaping for POSIX locks (#9666)
* filer: session lease + reaping for POSIX locks

A mount renews its session lease by keepalive (new KEEP_ALIVE op); the
owner filer records last-seen per session and a background sweeper reaps
the locks of leased sessions that stop renewing — a dead or partitioned
mount. Only sessions that have renewed are leased, so this is inert until
mounts run with -posixLock.

* mount: route POSIX advisory locks to the owner filer (-posixLock) (#9665)

mount: route POSIX advisory locks to the owner filer under -dlm

With -dlm, GetLk/SetLk/SetLkw and the flush/release cleanup paths go to
the inode's owner filer via the PosixLock RPC instead of the local table,
so flock/fcntl are honored across mounts. Advisory locking rides the same
switch as whole-file write coordination — and is therefore off under
writeback cache, which implies single-writer. The mount calls its filer
and relies on filer-side forwarding to reach the owner. Keys are the inode
identity (HardLinkId else path); SetLkw is client-side polling with the
FUSE cancel channel (no server wait queue); a per-mount session id
namespaces owners; a local hint avoids a release RPC on every close.

* mount,filer: bound posix-lock release RPCs and stop the reaper on shutdown

The unlock/release RPCs run off the syscall path (close/flush) and used
context.Background() with no deadline, so a slow or unreachable filer could
hang close() indefinitely; bound them to 5s (they still aren't cancelled by
an interrupt). The lease-reaping sweeper now selects on a stop channel that
FilerServer.Shutdown closes, instead of looping for the process lifetime.
2026-05-25 00:00:59 -07:00
Chris Lu
3976264391 mount: keep the posix-lock hint until the release RPC succeeds (#9670)
routedReleasePosixOwner dropped the local owner hint before sending
RELEASE_POSIX_OWNER, so a transient RPC failure left the lock held on the
owner filer with no local record to retry from — stranded until session-lease
reaping. Drop the hint only after a successful release; on failure keep it so
a later flush retries, with lease reaping as the backstop.
2026-05-25 00:00:34 -07:00
Chris Lu
3481f13f54 mount: route POSIX advisory locks to the owner filer under -dlm (#9669)
With -dlm, GetLk/SetLk/SetLkw and the flush/release cleanup paths go to
the inode's owner filer via the PosixLock RPC instead of the local table,
so flock/fcntl are honored across mounts. Advisory locking rides the same
switch as whole-file write coordination — and is therefore off under
writeback cache, which implies single-writer. Keys are the inode identity
(HardLinkId else path); SetLkw is client-side polling with the FUSE cancel
channel (no server wait queue); a per-mount session id namespaces owners;
a local hint avoids a release RPC on every close. Background unlock/release
RPCs are bounded so a stuck filer can't hang close().
2026-05-24 23:56:37 -07:00