seaweedfs

mirror of https://github.com/seaweedfs/seaweedfs.git synced 2026-07-26 18:13:24 +00:00

Author	SHA1	Message	Date
7y-9andGitHub	bbbc3925ec	fix: validate s3 ownership controls rule (#9684 )	2026-05-27 14:41:10 -07:00
qzhello GitHub Chris Lu	69c84801e4	fix(s3tables/iceberg): make metadata spec-compliant and accept real-world manifest names (#9703 ) * fix(s3tables/iceberg): make metadata spec-compliant and accept real-world manifest names Two related issues prevent SeaweedFS S3 Tables from interoperating with strict Iceberg clients (Java/Spark/Flink/Trino): 1. iceberg-go v0.5.0 serializes empty TableMetadata state by dropping keys via `omitempty` on optional pointer/slice fields. The Iceberg table spec, however, requires `current-snapshot-id`, `snapshots`, `snapshot-log`, `metadata-log`, and `refs` to be present even when empty (`current-snapshot-id` must be -1 for a table with no snapshots). Java's TableMetadataParser uses JsonUtil.getLong on `current-snapshot-id` and throws "Cannot parse missing long current-snapshot-id" against responses produced by this server. 2. The Iceberg layout validator only accepts manifest filenames that match Iceberg's internal naming (`{uuid}-m{n}.avro`, `snap-{n}-{n}-{uuid}.avro`). Real writers — notably Flink's sink — emit manifests like `{flink-job-id}-{checkpoint}-{operator-id}-{n}.avro`, which the validator rejects with 403, breaking INSERT commits. Fixes: * Add ensureMetadataSpecCompliance helper that backfills the five spec-required empty-state fields when iceberg-go omits them or emits explicit JSON null. Apply it on every code path that writes v.metadata.json to S3 or returns metadata to clients (handlers_table create-table, handlers_commit, commit_helpers create-on-commit, plus MarshalJSON on LoadTableResult and CommitTableResponse). Real values from non-empty tables are never overwritten. Add catch-all regex entries to metadataFilePatterns accepting any .avro / .metadata.json filename composed of [A-Za-z0-9._-]. The Iceberg spec does not mandate filename format; the strict patterns remain for documentation. Metadata-directory subdirectory rejection and the data-file path validation are unchanged. No upstream dependencies are forked: iceberg-go stays at v0.5.0 and go.mod is untouched. The compliance layer can be removed once upstream emits spec-compliant output. Tests (all pass under `go test -race`): - metadata_compliance_test.go: 5 cases covering missing fields, preserved real values, explicit null, invalid JSON, empty input. - iceberg_layout_test.go: 3 groups (16 subtests) covering real-world manifest names from Flink/Spark/Iceberg, security boundary (subdirectories, bad extensions), and data-file regression. * fix(s3tables/iceberg): preserve metadata key order and keep config field stable Two small follow-ups on the spec-compliance fix: * ensureMetadataSpecCompliance now splices missing keys in at the byte level just before the closing brace, so iceberg-go's struct-declared key order survives the backfill. The previous unmarshal/remarshal through map[string]json.RawMessage silently alphabetized every key in the document, which is spec-legal but breaks byte-equality fixtures and any downstream hashing of the persisted metadata. The slower remarshal path is kept for the rare explicit-null replacement case. * LoadTableResult.MarshalJSON now serializes Config without omitempty, matching the struct field tag. The custom marshaler had silently flipped the tag to ,omitempty, which made the "config" key disappear from the response whenever s3Endpoint was unset (since buildFileIOConfig returned an empty but non-nil Properties map). Tests: - PreservesOriginalKeyOrder pins the byte-level output against iceberg-go's emitted shape; would have caught the alphabetization regression. - EmptyObjectBackfilled covers the {} -> sentinels-only case (no leading comma). - AllPresentReturnsSameBytes confirms the no-op path returns input bytes unchanged, with whitespace intact. - iceberg_layout_test pins the catch-all $ anchor: metadata/file.avro.txt must still be rejected. * fix(s3tables/iceberg): guard ensureMetadataSpecCompliance against top-level null json.Unmarshal of a JSON `null` literal succeeds but leaves the map nil. The current byte-append path no-ops gracefully on this input, but the slow remarshal path would panic with "assignment to entry in nil map" if the input ever combined `null` with the explicit-null detection. Add an explicit nil-map short-circuit so the safety property is obvious from the source, and a test that pins the contract. * test(s3tables/iceberg): assert full byte equality in AllPresentReturnsSameBytes The prefix check only caught a missing "{\n " opener, so the test would have passed even if the function silently reordered keys or collapsed whitespace later in the document. Switch to a full string comparison so any future regression in the no-op path is loud. --------- Co-authored-by: Chris Lu <chris.lu@gmail.com>	2026-05-27 13:05:41 -07:00
Chris LuandGitHub	21b4b81edb	fix(filer/postgres): default to ON CONFLICT upsert to keep tx alive (#9709 ) * fix(filer/postgres): default to ON CONFLICT upsert to keep tx alive A KvPut from the inode-index secondary write could fail with 23505 (duplicate key) inside a rename's transaction, after which the next statement returned 25P02 and rename surfaced to FUSE as EIO. Default the postgres upsert query when enableUpsert=true so INSERTs are idempotent; the enableUpsert=false escape hatch is preserved for non-PG-compatible backends. * fix(filer/mysql): default to ON DUPLICATE KEY UPDATE upsert Same shape as the postgres default: when enableUpsert=true but no upsertQuery is configured, install a sensible default so the inode-index KvPut does not waste a duplicate-key roundtrip on every entry write. Uses the VALUES() form so the default works on MariaDB and MySQL >=5.7; the MySQL 8.0.19 row-alias form is left to explicit config. * fix(filer): default enableUpsert=true for sql stores The default-template fallback only kicks in when enableUpsert=true, so minimal configs that omit the flag entirely were still exposed. Default it on for postgres/postgres2/mysql/mysql2; an explicit false in filer.toml still wins because SetDefault only fills absent keys.	2026-05-27 12:23:30 -07:00
Chris LuandGitHub	396e3c326b	fix(remote_storage/gcs): forward entry mime as ContentType (#9711 ) fix(remote_storage/gcs): forward entry.Attributes.Mime as ContentType Same gap as the S3 client: filer.remote.sync to GCS never populated the object's ContentType, so HTML/CSS/etc. ended up stored as the GCS default and didn't render correctly in browsers. Mirrors the existing Azure client behavior (weed/remote_storage/azure/azure_storage_client.go).	2026-05-27 12:21:27 -07:00
Chris LuandGitHub	9cb9699e9d	fix(replication/s3sink): forward entry mime as ContentType (#9710 ) * fix(replication/s3sink): forward entry.Attributes.Mime as ContentType Same gap as the remote_storage S3 client: filer.replicate uploads via s3manager.Uploader without populating ContentType, so replicated objects on S3-compatible backends (e.g. Backblaze B2) store binary/octet-stream and browsers refuse to render HTML, CSS, etc. Pass entry.Attributes.Mime through to UploadInput.ContentType, leaving the header unset when no Mime is recorded so the remote keeps its own default. * fix(replication/s3sink): nil-guard entry.Attributes when reading Mime * Revert "fix(replication/s3sink): nil-guard entry.Attributes when reading Mime" This reverts commit `08c3698e44`. The function already dereferences entry.Attributes.Mtime and entry.Attributes.Md5 unconditionally on the same path, so a nil guard on Mime alone is inconsistent and provides no real safety.	2026-05-27 12:20:51 -07:00
Chris LuandGitHub	629beda1eb	fix(remote_storage/s3): forward entry mime as ContentType (#9708 ) fix(remote_storage/s3): forward entry.Attributes.Mime as ContentType filer.remote.sync was uploading every object without a Content-Type, so S3-compatible backends (e.g. Backblaze B2) stored binary/octet-stream and browsers refused to render HTML, CSS, etc. Pass entry.Attributes.Mime through to UploadInput.ContentType, leaving the header unset when no Mime is recorded so the remote keeps its own default behavior.	2026-05-27 12:13:01 -07:00
Chris LuandGitHub	c3255b51fd	fix(volume): avoid panic when URL path has a dot before the comma (#9712 ) LastIndex returns -1 when the separator is missing and can return any position when both are present. A path like /vol/file.jpg,abc gives dotSep<commaSep, so path[commaSep+1:dotSep] slices with start>end and panics. Only treat the dot as an extension boundary when it sits after the comma.	2026-05-27 11:29:11 -07:00
Chris LuandGitHub	65d557cbb0	fix(util): guard BytesToUint{16,32,64} against short input (#9713 ) * fix(util): guard BytesToUint{16,32,64} against short input length is computed as uint, so length-1 on an empty slice underflows to MaxUint and the loop indexes b[0] on a zero-length slice. BytesToUint16 also indexed b[0]/b[1] with no length check. All call sites today gate the slice length explicitly, so this hardens the API for new callers rather than fixing a live crash. Return 0 on short input, matching the existing variable-length contract. * BytesToUint16: match variable-length contract of the 32/64 helpers A 1-byte slice should return uint16(b[0]) rather than 0, matching how BytesToUint32 and BytesToUint64 treat short input.	2026-05-27 11:29:01 -07:00
Jaehoon KimandGitHub	d00acded8a	fix(vacuum): batch all replicas in a single plugin worker task (#9702 ) * fix(vacuum): batch all replicas in a single plugin worker task The plugin worker vacuum path emitted one TaskDetectionResult per (volume, server) replica, but the dispatcher gates duplicate tasks per volume via ActiveTopology.HasAnyTask. The first replica's task was created and the remaining N-1 replicas were silently dropped, so only one replica per volume was ever vacuumed — leaving the others with all their garbage intact. Mirror the master built-in flow (topology.vacuumOneVolumeId → batchVacuumVolumeCheck/Compact/Commit/Cleanup) by: - aggregating detection metrics by VolumeID so a single task carries every replica in TaskParams.Sources - having VacuumTask accept []string servers (instead of a single string), re-check each replica's garbage ratio at execute time to derive a vacuumTargets subset, and run Compact/Commit/Cleanup against only that subset - updating the dispatcher (plugin_handler.Execute, register.CreateTask) to forward every Sources node to NewVacuumTask * fix(vacuum): run all-replica vacuum in two phases to keep failure atomic The prior implementation iterated Compact → Commit → Cleanup against each replica in sequence. A Compact failure on the second replica left the first one already committed (its active files swapped with the .cp* files), producing replica divergence with no automatic recovery. Split performVacuum into two phases, matching topology.vacuumOneVolumeId: Phase 1 — Compact all targets. If any fails, run VacuumVolumeCleanup on every target to drop the .cpd/.cpx/.cpldb temp files, then abort. No replica has swapped yet, so every replica returns to its original state. Phase 2 — Commit all targets. Best-effort, matching batchVacuumVolumeCommit: per-replica errors are collected and surfaced together. Once any replica has swapped there is no clean rollback, so a partial Phase 2 failure requires operator reconciliation. Adds compactOne / commitOne / cleanupOne / cleanupAll helpers and removes the old performVacuumOne. * fix(vacuum): abort when any replica's garbage check fails The prior check tolerated per-replica RPC errors and only failed the task if every replica errored — partial failures were silently treated as "ineligible" so the responding replicas would still be vacuumed. That produces divergence the moment the unreachable replica comes back: it still carries the original garbage while the others have been compacted. Match topology.batchVacuumVolumeCheck's contract instead — its return value (errCount == 0 && len(vacuumLocationList.list) > 0) gates the whole vacuum on every replica's check succeeding. If any replica is unreachable or its VacuumVolumeCheck RPC errors, abort the task; the volume will be retried on the next detection cycle once the replica is healthy. * fix(vacuum): guard against nil metrics and TaskSource entries Detection's bucket-building loop dereferenced m.VolumeID without checking m for nil. VacuumTask.Validate built sourceSet from params.Sources without checking each entry for nil. Both paths would panic on a malformed protobuf payload that managed to deliver a nil slot. Skip nil entries in both loops — neutral with the existing nil/empty filtering already done in register.CreateTask and plugin_handler.Execute. * test(vacuum): success path no longer calls VacuumVolumeCleanup The plugin worker vacuum is now two-phase (Compact-all → Commit-all, with Cleanup only invoked on Compact failure to roll back .cp* temp files). This matches topology.vacuumOneVolumeId, where batchVacuumVolumeCleanup runs only on the Compact-failure branch. On a successful Commit the temp files do not linger: - CommitCompactVolume renames .cpd → .dat and .cpx → .idx - leveldb needle map renames .cpldb → .ldb (needle_map_leveldb.go) so calling VacuumVolumeCleanup afterwards is a redundant no-op. The prior worker code called it unconditionally and the integration test asserted that — switch the expectation to cleanupCalls == 0 to document the new (and master-aligned) contract.	2026-05-27 11:15:25 -07:00
Chris LuandGitHub	cd68313929	fix(filer.sync): resolve manifest chunks against source filer (#9705 ) * fix(filer.sync): resolve manifest chunks against source filer `UpdateEntry` was passing `filer.LookupFn(fs)` — the sink filer client — into `compareChunks`. But `oldEntry`/`newEntry` chunks come from the source cluster, so manifest resolution must hit the source filer's volume servers. With two clusters that have overlapping volume IDs (common once they grow past a few hundred volumes), the sink lookup returns its own volume's URLs and the fetch 404s on the source's fileKey: compare chunks error: fail to read manifest 631,0babe...: 404 Not Found The 404 aborts the diff, the manifest chunk never gets replicated, and the target ends up with whatever flat chunks happened to land from earlier partial syncs — visible as `SIZE_MISMATCH` in filer.sync.verify on files large enough to use chunk manifests (~150 GB+ in practice). Only the manifest path was wrong; flat-chunk reads in `fetchAndWrite` already use `fs.filerSource.ReadPart`. * trim comment * test(filer.sync): regression test for source-filer manifest lookup Two recording filer gRPC servers stand in for source and sink. Driving UpdateEntry with a manifest chunk and observing which one receives LookupVolume proves compareChunks routes source-side lookups through fs.filerSource, not fs. Reverting the fix flips the call onto the sink filer and fails the assertion. * drop test	2026-05-27 10:23:29 -07:00
Jaehoon Kim GitHub Claude Opus 4.7 Chris Lu	675020b342	fix(filer.sync): validate chunk size in FilerSink to prevent 0-byte propagation (#9701 ) * fix(filer.sync): validate chunk size in FilerSink to prevent 0-byte propagation FilerSink.fetchAndWrite previously trusted the source response and the upload result blindly: a 200 OK / Content-Length: 0 reply from a broken source volume was happily uploaded as a 0-byte needle to the destination, and the destination filer metadata was then written with the source chunk size. The result was permanent silent corruption -- ls shows the file at its original size but reads fail with EIO. Add two cheap defenses inside fetchAndWrite: 1. After assembling fullData, compare its length against sourceChunk.Size. 2. After a successful upload, compare uploadResult.Size against sourceChunk.Size. Both checks wrap a new sentinel errChunkSizeMismatch that the retry callback recognizes and refuses to retry -- needle.size=0 on disk is a persistent state, not a transient network error, so the sync should stop loudly on the affected entry instead of looping or, worse, silently propagating it. Tests: * TestValidateReplicatedChunkSize -- table-driven coverage of healthy, legitimately empty, zero-byte read, short read, and truncated upload cases. * TestFetchAndWriteRejectsZeroByteSource -- end-to-end: an httptest source that returns 200 OK with an empty body must cause fetchAndWrite to return errChunkSizeMismatch after exactly one source hit (fail fast, no retry storm). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * filer.sync: bubble size-mismatch past CreateEntry/UpdateEntry Three follow-ups on the chunk-size validation: - Use %w in replicateOneChunk so the errChunkSizeMismatch sentinel survives the wrap and reaches errors.Is callers up the stack. - In FilerSink.CreateEntry/UpdateEntry, surface errChunkSizeMismatch instead of warning-and-nil. Other errors (deleted source chunk, transient network) keep the existing swallow so a hiccup doesn't stall the stream. - Drop validateReplicatedUploadSize: uploadResult.Size is set client-side from the same len(fullData) we already validated pre-upload, so the second check can't fail. Test: scope the RetryWaitTime override to the one test that needs it, add a regression that locks in the errors.Is chain through replicateChunks. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Chris Lu <chris.lu@gmail.com>	2026-05-26 20:47:53 -07:00
Chris LuandGitHub	7919cc7ca0	wdclient: prune filers dropped from master discovery (#9699 ) * wdclient: prune filers dropped from master discovery Filer discovery only appended new addresses; it never removed ones that disappeared from the master snapshot. After a K8s filer pod rolled to a new IP the old address lingered in filerAddresses and got retried again every resetTimeout window, stalling S3 uploads on i/o timeouts. Treat the master snapshot as authoritative: keep survivors (preserving their health counters and the active round-robin index), append newcomers with fresh health, drop the rest. Empty snapshots are still ignored so a transient master outage can't wipe the list. * wdclient: skip discovery snapshots with no usable addresses Guard against the defensive case where master returns updates whose addresses are all empty; reconciling against an empty discovered set would prune every filer.	2026-05-26 17:49:18 -07:00
Chris LuandGitHub	1e91a99f79	fix(volume): avoid nil-deref when needle map loader errors (#9694 ) (#9697 ) * fix(volume): avoid nil-deref when needle map loader errors A corrupt .idx whose size is not a multiple of NeedleMapEntrySize sends the read-only load path into NewSortedFileNeedleMap, which returns (SortedFileNeedleMap)(nil) when reverseWalkIndexFile rejects the file. The multi-value assignment `v.nm, err = NewSortedFileNeedleMap(...)` parks that typed-nil pointer in the v.nm NeedleMapper interface, so the subsequent `v.nm != nil` guard still passes — and the post-load MaxNeedleEnd structural check dispatches through the promoted mapMetric accessor on a nil receiver, segfaulting the whole volume server at load time. Reset v.nm explicitly after every loader failure so the interface is truly nil, and skip the MaxNeedleEnd check when err is non-nil since the value would come from a partial walk anyway. NewLevelDbNeedleMap has the same typed-nil-on-error shape and is fixed the same way. fix(volume): close indexFile when needle map load errors Pre-fix the typed-nil v.nm path either leaked indexFile silently (SortedFileNeedleMap.Close had a nil-receiver early return) or crashed (LevelDbNeedleMap.Close had no such guard). With v.nm cleared to nil on error, the defer cleanup no longer calls Close at all, so the LoadCompactNeedleMap success-with-error path now also leaks indexFile. Close indexFile explicitly on each loader error to keep ownership balanced. * trim comments	2026-05-26 16:56:49 -07:00
Chris Lu	4f17c6661a	test: keep AllocateMiniPorts off weed mini default ports Random allocation could pick 33646 = admin.port (23646) + GrpcPortOffset. weed mini reserves that as Admin's gRPC port even when the test only overrides Master/Filer/S3/Iceberg, so the explicit Filer flag failed with "reserved for gRPC calculation" and TestRisingWaveIcebergCatalog flaked. Pre-seed the reserved set with every mini default HTTP port plus its +10000 offset so a random pick (or its own gRPC offset) cannot land on a service the caller left at its default.	2026-05-26 16:48:46 -07:00
Chris LuandGitHub	29eec2f111	master: timeout AllocateVolume/DeleteVolume and defer growRequest cleanup (#9698 ) * master: timeout AllocateVolume/DeleteVolume and defer growRequest cleanup The volume-grow goroutine clears the layout's growRequest flag only after ms.DoAutomaticVolumeGrow returns, and AllocateVolume / DeleteVolume were calling the volume-server RPC with context.Background(). A volume server that hung mid-call (heavy I/O, stuck lock, dead peer behind a stable VIP) would park the goroutine forever, leaving growRequest=true and silently blocking every subsequent automatic grow for that layout — Assign retries then drained their 30s budget with "context deadline exceeded" until the operator restarted the master. Bound both RPCs with a 5-minute deadline (creating/removing a volume is sub-second normally, generous for contended disks) and move the flag clear + filter delete into defers so a panic in DoAutomaticVolumeGrow doesn't strand the layout either. * allocate_volume: shorten timeout to 1m for faster recovery Volume create/delete is sub-second under normal conditions; 1 minute is generous even on a contended disk and clears the growRequest flag well before too many client Assigns drain their own retry budget. * trim comments	2026-05-26 16:26:21 -07:00
Chris LuandGitHub	8fd7c524c7	redis2: apply keyPrefix in KV methods (#9693 ) KvPut/KvGet/KvDelete bypassed store.getKey(), so filer.store.id and other KV writes landed outside the configured prefix. With a Redis ACL restricted to the prefix this errored with NOPERM; without the ACL the keys silently lived in the wrong namespace.	2026-05-26 12:49:31 -07:00
Chris LuandGitHub	77dcb20a74	writeJson: drop unused JSONP branch (#9686 ) * writeJson: drop unused JSONP branch No in-tree caller uses ?callback=. Always serve application/json with X-Content-Type-Options: nosniff. * seaweed-volume: drop unused JSONP branch Mirror Go: always serve application/json with X-Content-Type-Options: nosniff. * writeJson: drop unreachable StatusNotModified check bodyAllowedForStatus already returns early for 304. * test/volume_server: rename and rewrite JSONP test to assert callback is ignored CI: /status?callback=myFunc now returns plain application/json with X-Content-Type-Options: nosniff.	2026-05-26 01:05:07 -07:00
Chris LuandGitHub	dd1b428789	s3,iceberg: reject `..` in URL path vars (#9687 ) * s3,iceberg: reject `..`/NUL in URL path vars Both gateway routers use mux.NewRouter().SkipClean(true), so a request like `GET /bucket-A/../evil-bucket/key` survives routing as bucket=bucket-A, object=../evil-bucket/key. The captured key is then joined into a filer path; util.JoinPath / path.Join collapse the `..` server-side and the read lands in evil-bucket. With auth on, IAM still authorizes against bucket-A (the mux var), so policy is evaluated against the wrong target. Add a middleware on the S3 bucket subrouter and the Iceberg REST router that rejects any `.`, `..`, NUL, or — for single-segment slots — embedded slash in the captured path vars before any handler runs. NormalizeObjectKey already folds `\` to `/` and decoding happens in mux, so `%2e%2e` and `..\` are caught. * s3,iceberg: reject empty captured vars and empty namespace parts Comma-ok the var lookup so we only check captured slots, then treat an empty captured value as a rejection on its own — downstream path.Join would otherwise collapse it and let the next segment pick the bucket. For iceberg, also reject empty parts after splitting the namespace on \x1F so leading/trailing/consecutive unit separators (which parseNamespace silently folds out) don't let distinct route values collapse to the same parsed namespace. Register loggingMiddleware before validateRequestPath on the iceberg router so rejected requests still produce an audit-log line.	2026-05-26 01:04:59 -07:00
Chris Lu	1355c7a102	4.29 4.29	2026-05-25 22:41:25 -07:00
dependabot[bot]GitHubdependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	f72c5ec5d3	build(deps): bump github.com/go-sql-driver/mysql from 1.9.3 to 1.10.0 (#9682 ) Bumps [github.com/go-sql-driver/mysql](https://github.com/go-sql-driver/mysql) from 1.9.3 to 1.10.0. - [Release notes](https://github.com/go-sql-driver/mysql/releases) - [Changelog](https://github.com/go-sql-driver/mysql/blob/master/CHANGELOG.md) - [Commits](https://github.com/go-sql-driver/mysql/compare/v1.9.3...v1.10.0) --- updated-dependencies: - dependency-name: github.com/go-sql-driver/mysql dependency-version: 1.10.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-05-25 16:37:47 -07:00
dependabot[bot]GitHubdependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	96f521addc	build(deps): bump github.com/linxGnu/grocksdb from 1.10.7 to 1.10.8 (#9683 ) Bumps [github.com/linxGnu/grocksdb](https://github.com/linxGnu/grocksdb) from 1.10.7 to 1.10.8. - [Release notes](https://github.com/linxGnu/grocksdb/releases) - [Commits](https://github.com/linxGnu/grocksdb/compare/v1.10.7...v1.10.8) --- updated-dependencies: - dependency-name: github.com/linxGnu/grocksdb dependency-version: 1.10.8 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-05-25 16:22:00 -07:00
dependabot[bot]GitHubdependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	584da4cd10	build(deps): bump golang.org/x/crypto from 0.51.0 to 0.52.0 (#9681 ) Bumps [golang.org/x/crypto](https://github.com/golang/crypto) from 0.51.0 to 0.52.0. - [Commits](https://github.com/golang/crypto/compare/v0.51.0...v0.52.0) --- updated-dependencies: - dependency-name: golang.org/x/crypto dependency-version: 0.52.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-05-25 16:21:44 -07:00
dependabot[bot]GitHubdependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	56b9df937c	build(deps): bump golang.org/x/sys from 0.44.0 to 0.45.0 (#9680 ) Bumps [golang.org/x/sys](https://github.com/golang/sys) from 0.44.0 to 0.45.0. - [Commits](https://github.com/golang/sys/compare/v0.44.0...v0.45.0) --- updated-dependencies: - dependency-name: golang.org/x/sys dependency-version: 0.45.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-05-25 16:21:36 -07:00
dependabot[bot]GitHubdependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	e8ed043d2b	build(deps): bump go.etcd.io/etcd/client/pkg/v3 from 3.6.10 to 3.6.11 (#9679 ) Bumps [go.etcd.io/etcd/client/pkg/v3](https://github.com/etcd-io/etcd) from 3.6.10 to 3.6.11. - [Release notes](https://github.com/etcd-io/etcd/releases) - [Commits](https://github.com/etcd-io/etcd/compare/v3.6.10...v3.6.11) --- updated-dependencies: - dependency-name: go.etcd.io/etcd/client/pkg/v3 dependency-version: 3.6.11 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-05-25 16:21:28 -07:00
dependabot[bot]GitHubdependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	502fef6b50	build(deps): bump docker/login-action from 4.1.0 to 4.2.0 (#9678 ) Bumps [docker/login-action](https://github.com/docker/login-action) from 4.1.0 to 4.2.0. - [Release notes](https://github.com/docker/login-action/releases) - [Commits](https://github.com/docker/login-action/compare/v4.1.0...v4.2.0) --- updated-dependencies: - dependency-name: docker/login-action dependency-version: 4.2.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-05-25 16:21:20 -07:00
Chris LuandGitHub	b21c263328	test/fuse_dlm: cross-mount POSIX locks + survival across a ring change (#9677 ) Adds two FUSE integration tests on the existing dlm cluster harness (the -dlm mounts route advisory locks to the owner filer): - TestPosixLockCrossMount: an flock taken on one mount blocks the other, and is grantable after release — the routed-to-owner path end to end. - TestPosixLockSurvivesFilerLoss: hold flocks on many files, stop filer1 so keys it owned migrate to filer0; after the ring settles and the holding mount re-asserts, every lock is still honored. Asserts only the settled state; the transient migration window is unit-covered. Locks are taken on read-only fds so the -dlm whole-file write lock (a different mechanism, held until close) isn't involved. Skipped on non-Linux: only Linux forwards advisory locks (SETLK) to the FUSE server; macFUSE handles flock in-kernel per mount.	2026-05-25 16:20:23 -07:00
Chris LuandGitHub	c9868dcf2f	filer/posixlock: remove the unused lock-set serde (#9676 ) The codec (Set.Marshal/Unmarshal) and its posix_lock.proto were built to let the lock set ride in an inode's entry metadata, but the authority is in-memory and ownership handoff/restart is handled by mounts re-asserting their held locks over the RPC — neither serializes the set. Nothing calls the serde outside its own tests, so drop it (codec, proto, generated pb, Makefile). The in-memory Set/Manager are unchanged.	2026-05-25 13:15:19 -07:00
Chris LuandGitHub	85ca3cb757	filer: warm-up + fail-closed cooling for POSIX locks on owner (re)start (#9673 ) After a (re)start the owner defers would-be grants for posixLockWarmup while mounts re-assert, trusting only locally-visible conflicts, so it does not double-grant from empty state; a deferred grant is a retry for SetLkw and EAGAIN for non-blocking SetLk, never a spurious grant. Cooling now fail-closes: if the previous owner is unreachable during a ring change, defer rather than risk a double-grant. readyAt is atomic so the handler reads it without locking.	2026-05-25 13:14:05 -07:00
Chris LuandGitHub	a3c0baa9b0	filer: cooling-off dual-read for POSIX locks during ring changes (#9672 ) While the ring changed within the last snapshot interval, a fresh owner asks the key's previous owner (LockRing.PriorOwner) whether it still holds a conflicting lock before granting TRY_LOCK or answering GET_LK, so it does not double-grant before re-assertion rebuilds its local state. The probe is marked cooling_probe so the previous owner answers from local state without recursing. PriorOwner uses the snapshot's prebuilt ring rather than rebuilding a hash ring per call.	2026-05-25 12:34:15 -07:00
7y-9andGitHub	881226a81b	fix: avoid rclone nil close panics (#9674 ) * fix: avoid rclone nil close panics * fix: avoid rclone nil close panics	2026-05-25 09:53:45 -07:00
Chris LuandGitHub	f8caaa4464	mount,filer: re-assert POSIX locks via keepalive (ownership migration + restart) (#9668 ) * mount: renew POSIX lock leases via keepalive The mount tracks the inode keys it holds locks on and a background loop renews its session lease (KEEP_ALIVE) with each key's owner filer every 5s, within the filer's 15s TTL. A live mount is never reaped; a dead one stops renewing and owners reclaim its locks. Tracking is a superset: holds are added on grant and dropped only on owner release, so a still held lock is never under-renewed. * mount,filer: re-assert held POSIX locks via keepalive The owner filer holds POSIX advisory locks as in-memory soft state, so a key's owner change (ring rebalance) or an owner restart lost or stranded them: the new or restarted owner was blind to existing holders and would double-grant. Make the keepalive carry the mount's held lock ranges per key. The mount mirrors its own granted locks (posixOwn), and each tick re-asserts them to the key's current owner, which rebuilds that session's locks from the assertion — self -healing after a takeover or restart. The owner arbitrates re-asserted locks against other sessions so it never double-grants; a lock that lost a migration race is reported, not forced. A bare keepalive (no ranges) still just renews.	2026-05-25 01:02:45 -07:00
Chris LuandGitHub	c97b69f8a4	filer: session lease + reaping for POSIX locks (#9666 ) * filer: session lease + reaping for POSIX locks A mount renews its session lease by keepalive (new KEEP_ALIVE op); the owner filer records last-seen per session and a background sweeper reaps the locks of leased sessions that stop renewing — a dead or partitioned mount. Only sessions that have renewed are leased, so this is inert until mounts run with -posixLock. * mount: route POSIX advisory locks to the owner filer (-posixLock) (#9665) mount: route POSIX advisory locks to the owner filer under -dlm With -dlm, GetLk/SetLk/SetLkw and the flush/release cleanup paths go to the inode's owner filer via the PosixLock RPC instead of the local table, so flock/fcntl are honored across mounts. Advisory locking rides the same switch as whole-file write coordination — and is therefore off under writeback cache, which implies single-writer. The mount calls its filer and relies on filer-side forwarding to reach the owner. Keys are the inode identity (HardLinkId else path); SetLkw is client-side polling with the FUSE cancel channel (no server wait queue); a per-mount session id namespaces owners; a local hint avoids a release RPC on every close. * mount,filer: bound posix-lock release RPCs and stop the reaper on shutdown The unlock/release RPCs run off the syscall path (close/flush) and used context.Background() with no deadline, so a slow or unreachable filer could hang close() indefinitely; bound them to 5s (they still aren't cancelled by an interrupt). The lease-reaping sweeper now selects on a stop channel that FilerServer.Shutdown closes, instead of looping for the process lifetime.	2026-05-25 00:00:59 -07:00
Chris LuandGitHub	3976264391	mount: keep the posix-lock hint until the release RPC succeeds (#9670 ) routedReleasePosixOwner dropped the local owner hint before sending RELEASE_POSIX_OWNER, so a transient RPC failure left the lock held on the owner filer with no local record to retry from — stranded until session-lease reaping. Drop the hint only after a successful release; on failure keep it so a later flush retries, with lease reaping as the backstop.	2026-05-25 00:00:34 -07:00
Chris LuandGitHub	3481f13f54	mount: route POSIX advisory locks to the owner filer under -dlm (#9669 ) With -dlm, GetLk/SetLk/SetLkw and the flush/release cleanup paths go to the inode's owner filer via the PosixLock RPC instead of the local table, so flock/fcntl are honored across mounts. Advisory locking rides the same switch as whole-file write coordination — and is therefore off under writeback cache, which implies single-writer. Keys are the inode identity (HardLinkId else path); SetLkw is client-side polling with the FUSE cancel channel (no server wait queue); a per-mount session id namespaces owners; a local hint avoids a release RPC on every close. Background unlock/release RPCs are bounded so a stuck filer can't hang close().	2026-05-24 23:56:37 -07:00
Chris LuandGitHub	68cae26c0b	mount: fix SetAttr/GetAttr crash from concurrent chunk append under writebackCache (#9667 ) * mount: hold the entry lock while reading chunk size in GetAttr/SetAttr Async upload workers append chunks to an open handle's shared entry under the LockedEntry lock (FileHandle.AddChunks), but GetAttr and SetAttr computed FileSize by iterating entry.Chunks without taking it. A concurrent append that reallocated the backing array tore the slice read and crashed in filer.TotalSize. Surfaces with -writebackCache, where handles stay open and flush asynchronously while metadata ops keep arriving. Take the LockedEntry lock for those reads (and SetAttr's truncate rewrite). * mount: re-read entry under the lock in GetAttr/SetAttr If SetEntry swapped the handle's entry pointer between maybeReadEntry and the lock acquisition, the old pointer is orphaned. Re-read fh.entry.Entry under the lock so SetAttr mutates the live entry instead of losing the update, and GetAttr reports the current one. * mount: cover the truncate path in TestAttrChunkRace Alternate SetAttr between mtime-only and a shrinking size so the test also exercises the entry.Chunks rewrite under fh.entry.Lock, not just the read-side size walk. * mount: snapshot chunks under the entry lock on the read path readFromChunks holds fh.entryLock (excludes SetAttr) but not the LockedEntry lock the async uploader appends under, so IsInRemoteOnly, the FileSize fallback, and the RDMA/peer chunk walks read entry.Chunks while AddChunks reallocated it — the same torn-slice crash as GetAttr/SetAttr. Snapshot size, inline content, and the chunk list under a brief LockedEntry RLock, then hand the snapshot to the RDMA/peer helpers instead of holding the lock across network I/O. The captured slice stays valid: append never mutates the old backing array, and truncate is excluded by the fh.entryLock.	2026-05-24 23:49:41 -07:00
Chris LuandGitHub	fef49c2d75	filer: routed PosixLock RPC over the in-memory authority (#9664 ) * filer: in-memory POSIX lock authority (Manager) Concurrent multi-inode authority over the per-inode Set: a Set per opaque inode key (path, or hl:<HardLinkId>) plus a session->keys index so a dead mount's locks reap in O(locks held). Lock state stays in memory like the distributed lock manager's, off the replicated meta-log. TryLock/Unlock/ GetLk/ReleasePosixOwner/ReleaseFlockOwner/ReleaseSession; empty sets and stale index entries are pruned on release. * filer: routed PosixLock RPC over the in-memory authority Adds the PosixLock RPC (try/unlock/get_lk + the flush/release owner drops) that the owner filer answers from its in-memory Manager. The request key is the inode identity ring key; a non-owner filer forwards one hop (is_moved-bounded), mirroring ObjectTransaction, so the owner's table stays the single authority under a stale ring view. Strictly non-blocking; SetLkw polling lives in the mount.	2026-05-24 22:50:42 -07:00
Chris LuandGitHub	564b94796a	filer: in-memory POSIX lock authority (Manager) (#9663 ) Concurrent multi-inode authority over the per-inode Set: a Set per opaque inode key (path, or hl:<HardLinkId>) plus a session->keys index so a dead mount's locks reap in O(locks held). Lock state stays in memory like the distributed lock manager's, off the replicated meta-log. TryLock/Unlock/ GetLk/ReleasePosixOwner/ReleaseFlockOwner/ReleaseSession; empty sets and stale index entries are pruned on release.	2026-05-24 22:43:17 -07:00
Chris LuandGitHub	475ae2b443	filer: serialize the POSIX lock set for entry metadata (phase 2) (#9661 ) * filer: POSIX advisory lock set primitive (phase 1) Pure per-inode conflict/coalesce/range-split logic for fcntl byte-range and flock whole-file locks, extracted from the mount's PosixLockTable without its wait queue or inode-map concurrency. Owner identity is (Sid, Owner) so the same FUSE owner on different mounts never aliases, and ReleaseSession reaps a dead mount's locks. The owner filer will hold one Set per inode under the per-path lock; no concurrency control here. * test: tolerate transient FUSE invisibility in ConcurrentReadWrite A concurrent truncating overwrite leaves a short-lived dentry/cache window where the file is momentarily ENOENT to another opener. Retry the reads and writes a few times before failing, as ConcurrentDirectoryOperations does. * filer: serialize the POSIX lock set for entry metadata Versioned fixed-width binary encoding of a Set, so an inode's held locks can ride in its entry metadata: a lock op materializes the Set from the blob, applies under the per-path lock, and writes it back. Empty set encodes to nil so a lock-free inode carries no blob. * filer: encode the POSIX lock set as protobuf Replace the hand-rolled fixed-width codec with a LockSetProto message, so the metadata blob can gain fields without a format-version migration. proto.Unmarshal already rejects a malformed blob, so the explicit version and length checks go away. Marshal now returns an error to match.	2026-05-24 22:33:36 -07:00
Chris LuandGitHub	e8e7cd6fac	filer: POSIX advisory lock set primitive (phase 1 of distributed FUSE locking) (#9660 ) * filer: POSIX advisory lock set primitive (phase 1) Pure per-inode conflict/coalesce/range-split logic for fcntl byte-range and flock whole-file locks, extracted from the mount's PosixLockTable without its wait queue or inode-map concurrency. Owner identity is (Sid, Owner) so the same FUSE owner on different mounts never aliases, and ReleaseSession reaps a dead mount's locks. The owner filer will hold one Set per inode under the per-path lock; no concurrency control here. * test: tolerate transient FUSE invisibility in ConcurrentReadWrite A concurrent truncating overwrite leaves a short-lived dentry/cache window where the file is momentarily ENOENT to another opener. Retry the reads and writes a few times before failing, as ConcurrentDirectoryOperations does.	2026-05-24 21:56:48 -07:00
Chris LuandGitHub	0f1e50f9ec	fix(master): re-register volumes missing from the lookup index A disconnect/reconnect race could drop a volume from vid2location while it stayed in the data node's disk map, so it showed in volume.list and the admin UI but LookupVolume returned "volume id not found" and never self-healed (the full heartbeat only registered volumes new to the disk map). The full heartbeat now re-registers any reported volume missing from the lookup index, reusing the already-resolved VolumeLayout.	2026-05-24 15:11:09 -07:00
Chris LuandGitHub	2a4923e7e8	ObjectTransaction: filer-side forwarding via route_key (#9659 ) A non-owner filer forwards the whole transaction to the ring owner of route_key, so the owner's per-path lock stays the single serialization point even when the caller's ring view is stale. is_moved bounds forwarding to one hop. The gateway stamps route_key on every routed builder via the shared objectRouteKey helper. Completes taking S3 object mutations off the distributed lock.	2026-05-24 14:21:06 -07:00
Chris LuandGitHub	25beb7ec48	admin: expose Prometheus metrics (#9652 ) * admin: add -metricsPort flag to expose Prometheus metrics The admin command had no metrics endpoint, so passing -metricsPort (as the operator does for spec.admin.metricsPort) crashed the process with "flag provided but not defined". Wire up -metricsPort/-metricsIp and start the shared Prometheus metrics server, matching filer, master, and volume. * admin: emit maintenance task and worker fleet metrics Add Prometheus metrics for the admin server's distinctive work: the maintenance task queue and the worker fleet that executes it. Task lifecycle: maintenance_tasks_by_status / _by_type gauges (snapshot of the queue), maintenance_tasks_completed_total{type,outcome} counter and maintenance_task_duration_seconds{type} histogram (recorded when a task reaches a terminal state), and last/next scan timestamp gauges. Worker fleet: workers_connected and worker_slots{used,max} gauges, plus worker_events_total{event} counting register/unregister/stale removals. Gauges are snapshotted by a background goroutine on the admin server; counters and the histogram are recorded at their event sites. * admin: read worker slot totals under lock, clear next-scan gauge when idle GetWorkers returns live worker pointers; summing CurrentLoad/MaxConcurrent outside the queue lock races with task assignment and completion. Add GetWorkerSlotTotals to aggregate under the lock. Also reset maintenance_next_scan_timestamp_seconds to 0 when the scanner is not running, so it can't retain a stale value after a stop.	2026-05-24 14:09:02 -07:00
Chris LuandGitHub	6fc212cedb	test: wait for a writable volume before lifecycle tests' first write (#9658 ) Probe one throwaway write once per process before the lifecycle tests run, absorbing the post-start volume-growth window so the first real PutObject doesn't race volume growth and 500. Each call is bounded by the remaining 60s budget; CreateBucket is retried within it.	2026-05-24 14:01:13 -07:00
Chris LuandGitHub	1f0c366583	s3: route metadata-only self-copy off the distributed lock (#9638 ) A non-versioned metadata-only self-copy (CopyObject with source == destination and the REPLACE directive) is a read-modify-write of one entry, which is why it held the distributed lock. It now routes to the owner as a serialized PATCH_EXTENDED: the owner merges the new managed metadata (set the replacements, delete the dropped keys) onto a fresh read of the entry under its per-path lock, so a concurrent change to non-managed keys (legal hold, retention, version id) is preserved instead of clobbered, and bumps mtime. PATCH_EXTENDED gains touch_mtime for the mtime bump. Versioned and suspended self-copies create a new version (already routed via the copy finalize) and the no-owner bootstrap keep the lock.	2026-05-24 12:32:57 -07:00
Chris LuandGitHub	fa7056dc6f	s3: route object-lock version-specific deletes off the distributed lock (#9657 ) A version-specific DELETE (real version or the null version, including object-lock WORM-checked ones and governance-bypass) now runs as one routed transaction on the object's owner instead of holding the distributed lock. For a real version: recompute the .versions pointer excluding the version (repoint-before-delete, so a crash leaves a recoverable orphan rather than a dangling pointer), then delete the version file, under the object's per-path lock. The null version is the regular object entry, deleted directly (no pointer). Object-lock buckets gate the delete on the version's WORM guards evaluated on the owner: legal hold (always) + retention (while not elapsed). Governance bypass scopes the retention guard to COMPLIANCE mode, so the filer allows a governance-mode delete while still denying compliance and legal hold — the gateway never reads the version. Three primitives make this expressible: - ObjectTransaction.condition_key: evaluate the condition against a named entry (the version) while the lock stays on lock_key (the object). - Recompute.exclude_name: omit a child from the scan, to repoint before delete. - WriteCondition.Clause gate_key/gate_value: scope IF_EXTENDED_TIME_ELAPSED to a mode, expressing governance bypass without a gateway-side read.	2026-05-24 11:41:08 -07:00
Chris LuandGitHub	eeda7181aa	s3: route multipart-upload completion off the distributed lock (#9632 ) completeMultipartUpload routes its writes to the object's owner filer when an owner is known, off the distributed lock. Idempotent replay is handled gateway-side in prepareMultipartCompletionState (it returns the existing result when the object already carries this UploadId), so the lock is not needed to dedupe retries; with no owner yet, the lock remains as the bootstrap path. Versioned completion flips the .versions pointer via routedVersionedFinalize (RECOMPUTE_LATEST). Non-versioned and suspended completion write the object via routedMkFile (a routed PUT) so the write serializes with concurrent writes to the same key on the owner's per-path lock. The version file itself is a unique path and stays a plain mkFile.	2026-05-24 11:07:39 -07:00
Chris LuandGitHub	4b9d46b5ad	s3: route versioned COPY and delete-marker off the DLM (#9633 ) s3: route versioned/suspended delete markers and versioned COPY off the lock createDeleteMarker flips the .versions pointer via routedVersionedFinalize (RECOMPUTE_LATEST on the owner filer) when an owner is known, so an Enabled or Suspended DeleteObject takes its pointer flip off the distributed lock; the delete marker file is written first and the owner re-derives the pointer. DeleteObjectHandler routes a versioned/suspended delete with no specific version straight to the owner, off the lock. A specific-version delete and object-lock buckets keep the lock (the former needs a recompute-after-delete handled separately; the latter needs gateway-side enforcement). CopyObject into a versioned bucket finalizes the new version through the same routed pointer flip.	2026-05-24 07:22:27 -07:00
Chris LuandGitHub	5bac8b9281	s3: route object-lock object writes off the distributed lock (#9635 ) routableWriteOwner no longer excludes object-lock buckets, so a versioned PUT (which creates a new version, never overwriting a locked one) and a non-versioned overwrite (WORM-checked gateway-side before dispatch) route to the owner filer like any other write. routedObjectOwner still excludes object-lock: an unversioned object-lock delete enforces WORM under the lock, so it stays there rather than routing past the check. Version-specific deletes likewise stay on the lock — routing them needs the WORM check (on the version entry) and the latest-pointer recompute (on the object) under one transaction, which the current single condition target cannot express.	2026-05-24 07:20:44 -07:00
Chris LuandGitHub	db954b5503	s3: route versioned PutObject finalize off the DLM (#9631 ) s3: route versioned PutObject finalize off the distributed lock A versioned write's finalize (flip the .versions pointer to the newest version, demote the prior latest) now runs as a single RECOMPUTE_LATEST ObjectTransaction on the object's owner filer, under its per-path lock, instead of the unserialized updateLatestVersionInDirectory. The version file is written first; the owner re-derives the pointer by scanning the directory. RECOMPUTE_LATEST gains size_to_key / mtime_to_key to cache the chosen version's size and mtime on the pointer, and demote_key / demote_value to stamp the displaced prior latest (NoncurrentSinceNs for lifecycle) when the pointer moves. Falls back to updateLatestVersionInDirectory when no owner is known yet.	2026-05-24 03:10:30 -07:00
Chris LuandGitHub	32aa70ab59	s3: serialize bucket config writes with field-level filer patches (#9655 ) PutBucketVersioning and PutBucketEncryption ran concurrently each did a whole-entry read-modify-write of the bucket entry, so one could overwrite the other's field with a stale copy. Each config write is now a field-level PATCH_EXTENDED (extended attributes) or set_content (the metadata blob) ObjectTransaction, routed to the bucket's owner filer and merged onto a fresh read under its per-path lock. Disjoint fields no longer clobber each other.	2026-05-24 02:30:26 -07:00

1 2 3 4 5 ...