mirror of
https://github.com/seaweedfs/seaweedfs.git
synced 2026-05-23 18:21:28 +00:00
master
941 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
adfd731bb8 | 4.28 | ||
|
|
2c2b2d4d3e |
chore(skiplist): remove unused NameList/NameBatch implementation (#9603)
NameList, NameBatch and their serde were an earlier in-memory directory batch implementation. The redis3 filer store uses its own ItemList backed by Redis sorted sets, so these types had no production callers (NameList only via its own test, LoadNameList none at all). Drop them and the now-orphaned NameBatchData proto message, regenerating skiplist.pb.go with the repo-standard protoc-gen-go v1.36.6. |
||
|
|
3f6410fdc3 |
fix(redis3): prevent filer crash from inconsistent skiplist ends (#9602)
* fix(redis3): prevent filer crash from inconsistent skiplist ends DeleteByKey updated the two ends asymmetrically: the start side decided whether to clear StartLevels[index] by comparing a cached reference key, while the end side cleared EndLevels[index] structurally. The redis3 ItemList re-keys a node while keeping its id, so that cached key drifts. When such a node was the only one at a level and got deleted, the stale key comparison left StartLevels dangling while EndLevels was cleared to nil. The next InsertByKey then dereferenced a nil EndLevels[0] and took down the whole filer during a rename or delete. Match the deleted node by its unique id so both ends stay consistent, and guard each end in InsertByKey so an already-corrupted skiplist persisted in Redis self-heals instead of crashing on load. * fix(redis3): propagate errors from WriteName node split The case 2.3 split path returned nil instead of the error in seven branches. Because the split runs a multi-step sequence (DeleteByKey on the skiplist, ItemAdd, redis range-delete, ItemAdd), a swallowed failure let WriteName report success while the skiplist was half-updated, which the caller then persisted - silently corrupting the directory listing and setting up the very inconsistent-ends state that crashes the filer. |
||
|
|
868849392c | 4.27 | ||
|
|
136eb1b7c8 | 4.26 | ||
|
|
7acba59a5c | 4.25 | ||
|
|
1dfea8a502 | 4.24 | ||
|
|
89608e5499 | chore(weed/util/log_buffer): remove unused functions (#9444) | ||
|
|
d8bbc1d855 |
fix: cap pool retention so chunk-copy buffers don't hoard memory (#9422)
Two pool-retention sites kept the runaway-RSS pattern in #6541 visible even after #9420 and #9421: * weed/util/buffer_pool: SyncPoolPutBuffer dropped a buffer back into sync.Pool regardless of how big it had grown. After a 64 MiB chunk upload through volume.PostHandler -> needle.ParseUpload, the pool hoarded a 64 MiB byte array per cached entry for the rest of the process's lifetime. Cap retention at 4 MiB; oversized buffers are dropped so GC can reclaim the backing array. * weed/s3api/...copy.go: uploadChunkData left UploadOption.BytesBuffer unset, so operation.upload_content fell back to the package-global valyala/bytebufferpool. That pool also retains high-water buffers forever, and concurrent UploadPartCopy filled it with one chunk-sized buffer per concurrent upload. Provide a fresh per-call bytes.Buffer pre-sized to chunk + multipart framing; it's GC'd as soon as the upload returns. Tests: - weed/util/buffer_pool/sync_pool_test.go: pin the cap (oversized buffers don't round-trip), the inverse (right-sized buffers do), and nil-safety. - weed/s3api/...copy_chunk_upload_test.go: extract newChunkUploadOption and pin that BytesBuffer is always non-nil and pre-sized, and that each call gets a distinct buffer. |
||
|
|
935fb42e1d |
chore(weed/util/chunk_cache): remove unused functions (#9372)
* chore(weed/util/chunk_cache): remove unused functions * fix(chunk_cache): bound ReadAt buffer in readNeedleSliceAt When the caller-provided buffer is larger than the remaining needle bytes, ReadAt would spill into the next needle and trigger the n != wanted error. Slice to data[:wanted] so the read stops at the needle boundary. --------- Co-authored-by: Chris Lu <chris.lu@gmail.com> |
||
|
|
73fc9e3833 | 4.23 | ||
|
|
d605feb403 |
refactor(command): expand "~" in all path-style CLI flags (#9306)
* refactor(command): expand "~" in all path-style CLI flags Many of weed's path-bearing flags (-s3.config, -s3.iam.config, -admin.dataDir, -webdav.cacheDir, -volume.dir.idx, TLS cert/key files, profile output paths, mount cache dirs, sftp key files, ...) were never run through util.ResolvePath, so a value like "~/iam.json" was used literally. Tilde only worked when the shell expanded it, which silently fails for the common -flag=~/path form (bash leaves the tilde literal in --opt=~/path). - Extend util.ResolvePath to also handle "~user" / "~user/rest", matching shell tilde expansion. Add unit tests. - Apply util.ResolvePath at the top of each shared start* function (s3, webdav, sftp) so mini/server/filer/standalone callers all inherit it; resolve at the few one-off use sites (mount cache dirs, volume idx folder, mini admin.dataDir, profile paths). - Drop the duplicate expandHomeDir helper from admin.go in favor of the now-equivalent util.ResolvePath. * fixup: handle comma-separated -dir flags for tilde expansion `weed mini -dir`, `weed server -dir`, and `weed volume -dir` accept comma-separated paths (`dir[,dir]...`). Calling util.ResolvePath on the whole string mishandled multi-folder values with tilde, e.g. "~/d1,~/d2" would resolve as if "d1,~/d2" were a single subpath. - Add util.ResolveCommaSeparatedPaths: split on ",", run each entry through ResolvePath, rejoin. Short-circuits when no "~" present. - Use it for *miniDataFolders (mini.go), *volumeDataFolders (server.go), and resolve each entry of v.folders in-place (volume.go) so all downstream consumers see resolved paths. - Add 7-case TestResolveCommaSeparatedPaths covering empty, single, multiple, and mixed inputs. * address PR review: metaFolder + Windows backslash - master.go: resolve *m.metaFolder at the top of runMaster so util.FullPath(*m.metaFolder) on the next line sees an expanded path. Drop the now-redundant ResolvePath in TestFolderWritable. - server.go: same treatment for *masterOptions.metaFolder, paired with the existing cpu/mem profile resolves. Drop the redundant inner ResolvePath at TestFolderWritable. - file_util.go: ResolvePath now accepts filepath.Separator as a separator after the tilde, so "~\\data" works on Windows. Other platforms keep current behaviour (backslash stays literal because it is a valid filename character in usernames and paths). - file_util_test.go: add two cases using filepath.Separator that exercise the new code path on Windows and remain a no-op on Unix. * address PR review: resolve "~" in remaining command path flags Comprehensive sweep of path-bearing flags across every weed subcommand, applying util.ResolvePath in-place at the top of each run* function so all downstream consumers see expanded paths. - webdav.go: resolve *wo.cacheDir at the top of startWebDav so mini/server/filer/standalone callers all inherit it. - mount_std.go: cpu/mem profile paths. - filer_sync.go: cpu/mem profile paths. - mq_broker.go: cpu/mem profile paths. - benchmark.go: cpuprofile output path. - backup.go: -dir resolved once at runBackup; drop the duplicated inline ResolvePath in NewVolume calls. - compact.go: -dir resolved at runCompact; drop inline ResolvePath. - export.go: -dir and -o resolved at runExport; drop inline ResolvePath in LoadFromIdx and ScanVolumeFile. - download.go: -dir resolved at runDownload; drop inline. - update.go: -dir resolved at runUpdate so filepath.Join uses the expanded path; drop inline ResolvePath in TestFolderWritable. - scaffold.go: -output expanded before filepath.Join. - worker.go: -workingDir expanded before being passed to runtime. * address PR review: resolve option-struct paths at run* entry points server.go:381 propagates s3Options.config to filerOptions.s3ConfigFile *before* startS3Server runs, which meant the filer-side code saw the unresolved tilde-prefixed pointer. Same pattern for webdavOptions and sftpOptions (and equivalent in mini.go / filer.go). The fix: hoist resolution from the shared start* functions up to the run* entry points, where every shared pointer is set up before any propagation happens. - s3.go, webdav.go, sftp.go: extract a resolvePaths() method on each Options struct that runs every path field through util.ResolvePath in-place. Idempotent. - runS3, runWebDav, runSftp: call the standalone struct's resolvePaths before starting metrics / loading security config. - runServer, runMini, runFiler: call resolvePaths on every embedded options struct, plus resolve loose flags (serverIamConfig, miniS3Config, miniIamConfig, miniMasterOptions.metaFolder, and filer's defaultLevelDbDirectory) so they're expanded before any pointer copy or use. - Drop the now-redundant inline ResolvePath at filer's defaultLevelDbDirectory composition. * address PR review: re-resolve mini -dir post-config, cover misc paths - mini.go: applyConfigFileOptions can overwrite -dir with a literal ~/data from mini.options. Re-resolve *miniDataFolders after the config-file apply, alongside the other path resolves, so the mini filer no longer ends up with a literal ~/data/filerldb2. - benchmark.go: resolve *b.idListFile (-list). - filer_sync.go: resolve *syncOptions.aSecurity / .bSecurity (-a.security / -b.security) before LoadClientTLSFromFile. - filer_cat.go: resolve *filerCat.output (-o) before os.OpenFile. - admin.go: drop trailing blank line at EOF (git diff --check). * address PR review: resolve -a.security/-b.security/-config before use Three follow-up fixes: - filer_sync.go: the -a.security / -b.security resolves were placed *after* LoadClientTLSFromFile / LoadHTTPClientFromFile were called, so weed filer.sync -a.security=~/a.toml still passed the literal tilde path. Hoist the resolves above the security-loading block so TLS clients see expanded paths. - filer_sync_verify.go: same flag pair was never resolved at all in the verify command; resolve at the top of runFilerSyncVerify. - filer_meta_backup.go: -config (the backup_filer.toml path) was passed directly to viper. Resolve at the top of runFilerMetaBackup. - mini.go: master.dir defaulted to the entire comma-joined miniDataFolders. With weed mini -dir=~/d1,~/d2 (or any multi-dir setup), TestFolderWritable then stat'd the joined string instead of a single directory. Default to the first entry via StringSplit to mirror the disk-space calculation a few lines below, and drop the now-redundant ResolvePath in TestFolderWritable. |
||
|
|
0b3cc8d121 | 4.22 | ||
|
|
c934b5dab6 |
fix(credential/postgres,s3api/iam): rename safety + pgxutil follow-ups to #9226 (#9230)
* refactor(util): extract pgx OpenDB + DSN builder into shared pgxutil
The postgres filer store had OpenPGXDB plus duplicated key=value DSN
assembly across postgres/ and postgres2/. Move the connection helper to
weed/util/pgxutil and add BuildDSN so the credential postgres store can
land on the same code path.
filer/postgres/pgx_conn.go keeps OpenPGXDB as a thin alias so postgres2
keeps building unchanged.
* refactor(credential/postgres): use shared pgxutil for connection setup
Replace the bespoke fmt.Sprintf DSN + sql.Open("pgx", ...) path with
pgxutil.BuildDSN + pgxutil.OpenDB so the credential store mirrors the
postgres filer store. This also drops the leaky RegisterConnConfig-style
init in favor of stdlib.OpenDB(*config), which doesn't accumulate
entries in the global pgx config map.
Adds parity knobs the filer store already exposes: sslcrl, and
configurable connection_max_idle / connection_max_open /
connection_max_lifetime_seconds (with the previous hardcoded 25/5/5min
as defaults). Also moves the jsonbParam helper here so other store
files can reuse it. (Helper is also referenced by postgres_identity.go,
which is migrated to it in the next commit.)
* refactor(credential/postgres): use jsonbParam helper across all writers
Consolidate JSONB write handling on the new pgxutil-adjacent helper
jsonbParam(b []byte) interface{}, which returns nil (driver writes SQL
NULL) when the marshaled JSON is empty and string(b) otherwise.
postgres_identity.go: replace the inline 'var fooParam any' /
'fooParam = string(b)' pattern with the helper. Same in CreateUser
and UpdateUser.
postgres_inline_policy.go, postgres_policy.go, postgres_service_account.go,
postgres_group.go: every JSONB writer was still passing []byte. Under
pgx simple_protocol (pgbouncer_compatible=true), []byte is encoded as
bytea and Postgres rejects that against a JSONB column with "invalid
input syntax for type json". Route them through jsonbParam too.
* fix(credential/postgres): rework SaveConfiguration to handle rename + UNIQUE access keys
The IAM rename path (s3api UpdateUser) renames an identity in place
and keeps its access keys. With the previous flow — upsert each user,
then per-user delete-and-insert credentials, then prune absent users —
the renamed user's access keys were still owned by the old row when
the INSERT for the new name ran, tripping credentials.access_key's
global UNIQUE constraint and failing every rename of a user with
credentials.
Reorder the SaveConfiguration body so the prune step runs BEFORE the
credential replace. CASCADE on the old user releases its access keys
in the same transaction, and the new name can then claim them.
While here:
- Replace the per-user loop DELETE FROM users WHERE username = $1 with
a single DELETE ... WHERE username = ANY($1), one round trip instead
of N inside the transaction.
- Surface inline-policy CASCADE losses: count user_inline_policies for
the prune set and emit a Warningf when the count is non-zero so
rename-driven drops are visible in operator logs (the structural
fix for renames lives at the IAM layer in a follow-up commit).
- Two-pass credential replace: clear credentials for every user we are
about to rewrite first, then insert, so an access key can be moved
between two users in the same SaveConfiguration call.
- credErr := credRows.Err() before credRows.Close() in
LoadConfiguration — Err() is documented as safe after Close, but
the leading-capture pattern matches the rest of the file.
* fix(s3api/iam): preserve inline policies when renaming a user
EmbeddedIamApi.UpdateUser renames an identity in place and the caller
persists via SaveConfiguration, which prunes the old username and
CASCADE-drops its rows from user_inline_policies. GetUserPolicy and
ListUserPolicies then return nothing under the new name even though
the API reported success — silent data loss.
Before flipping sourceIdent.Name, list the user's stored inline
policies and re-attach each one under the new name. The subsequent
SaveConfiguration prune still CASCADE-removes the old-name rows; only
the duplicates we just wrote under the new name survive. Adds a
regression test that puts a policy on the old name, renames, and
asserts the policy is readable under the new name.
* perf(credential/postgres): batch the credential clear in SaveConfiguration
The two-pass credential replace was clearing each incoming user's
credentials with its own DELETE statement — N round-trips inside the
transaction. Match the pattern already used for the user prune and
issue a single DELETE FROM credentials WHERE username = ANY($1)
instead.
* refactor(s3api/iam): plumb context through UpdateUser
UpdateUser was synthesizing a fresh context.Background() inside the
inline-policy migration block, which discards the request deadline,
cancellation, and tracing carried by the caller. Add ctx as the first
parameter and pass r.Context() in via the ExecuteAction dispatcher,
mirroring the signature already used by CreatePolicy /
AttachUserPolicy / DetachUserPolicy.
* fix(util/pgxutil): quote DSN values per libpq rules
BuildDSN was concatenating values directly, so any password / cert path
/ database name with a space, single quote, or backslash produced a
malformed connection string and pgx.ParseConfig either errored or
mis-parsed the remainder. Critical now that the helper is shared with
the credential store: mTLS deployments routinely sourcing passwords or
secret-mounted cert paths from a vault are exactly the case where
spaces and quotes show up.
Add quoteDSNValue: empty values and values containing whitespace, `'`,
or `\` are wrapped in single quotes with `'` and `\` escaped per
PostgreSQL libpq rules; plain alphanumeric values pass through
unchanged. Apply it to every variable field in BuildDSN.
Adds a test that round-trips a password containing spaces, quotes and
backslashes through pgx.ParseConfig and confirms the parsed Config
matches the input.
* fix(credential,s3api/iam): atomic UserRenamer to avoid FK violation on rename
The previous IAM rename path called PutUserInlinePolicy(newName, ...)
before SaveConfiguration created the new users row. user_inline_policies
has a non-deferrable FOREIGN KEY (username) REFERENCES users(username),
which Postgres validates at statement time, so every rename of a user
that owned at least one inline policy failed with an FK violation. The
existing memory-store regression test missed it because the memory
backend has no FK enforcement.
Add an optional credential.UserRenamer interface plus a
CredentialManager.RenameUser thin shim that returns (supported, err).
Implement it on PostgresStore as an atomic in-transaction migration:
INSERT the new users row by SELECT-copying from the old, UPDATE
credentials.username and user_inline_policies.username to the new
name (FK satisfied because both rows now exist), then DELETE the old
row. ErrUserNotFound / ErrUserAlreadyExists are surfaced cleanly.
Implement it on MemoryStore by re-binding store.users / store.accessKeys
/ store.inlinePolicies under the new name. Also fixes a small leak in
DeleteUser, which was forgetting to drop the user's inline-policy
bucket.
EmbeddedIamApi.UpdateUser now calls RenameUser first; if the store
implements the interface, that's the whole migration. If it doesn't
(stores without FK enforcement), fall back to the previous
list / get / put copy.
Adds a focused test for MemoryStore.RenameUser that asserts the
identity, the access-key index, and the inline policies all land
under the new name.
|
||
|
|
da2e90aefd |
fix(mount): sanitize non-UTF-8 filenames; keep marshal errors per-request (#9207)
* fix(mount): sanitize non-UTF-8 filenames; keep marshal errors per-request (#9139) A single file with invalid-UTF-8 bytes in its name (e.g. a GNOME Trash "partial" like \x10\x98=\\\x8a\x7f.trashinfo.9a51454f.partial) made every FUSE-initiated filer RPC fail with: rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8 and then produced an avalanche of "connection is closing" errors on unrelated LookupEntry / ReadDirAll / UpdateEntry calls, causing the volume-server QPS dips reported in #9139. Root cause is twofold: 1. Proto3 `string` fields require valid UTF-8, but the FUSE kernel passes raw name bytes. Create/Mknod/Mkdir/Unlink/Rmdir/Rename/Lookup/Link/ Symlink all forwarded those bytes directly into CreateEntryRequest.Name, DeleteEntryRequest.Name, StreamRenameEntryRequest.{Old,New}Name and Entry.Name. saveDataAsChunk also copied the FullPath into AssignVolumeRequest.Path unchecked. 2. When the marshal failed, shouldInvalidateConnection treated the resulting codes.Internal as a connection problem and dropped the shared cached ClientConn — canceling every other in-flight RPC on it. Fix: - Add sanitizeFuseName (strings.ToValidUTF8 with '?' replacement, matching util.FullPath.DirAndName) and make checkName return the sanitized name. Apply at every FUSE entry point that passes a name to the filer RPC, including Unlink/Rmdir (which did not previously call checkName) and both oldName/newName in Rename. Add a backstop scrub for AssignVolumeRequest.Path so async flush paths cannot reintroduce invalid bytes from a pre-sanitization cached FullPath. - In weed/pb.shouldInvalidateConnection, detect client-side marshal errors via the gRPC library's "error while marshaling" prefix and return false: the connection is healthy, only the request is bad. Refs: https://github.com/seaweedfs/seaweedfs/issues/9139#issuecomment-4301184231 * fix(mount,util): use '_' for invalid-UTF-8 replacement (URL-safe) Sanitized filenames flow downstream into HTTP URLs (volume-server uploads, filer HTTP API, S3/WebDAV gateways). '?' is the URL query-string delimiter and would split the path the first time the name lands in one, so swap every invalid-UTF-8 replacement to '_'. This covers the two pre-existing sites in weed/util/fullpath.go as well, keeping all paths sanitized the same way. * refactor(pb): detect client-side marshal errors via errors.As, not substring Replace the raw `strings.Contains(err.Error(), ...)` check with a type-based carve-out: use errors.As against the `GRPCStatus() *Status` interface to pull the original Status out of any fmt.Errorf("...: %w") wrapping, then match the library-owned "grpc:" prefix on that Status's Message. Why not errors.Is against a proto-level sentinel: gRPC's encode() collapses the inner proto error with "%v" (stringification) before wrapping it in a Status, so the original error type does not survive into the caller. The Status itself is the structural signal that does survive. Why not status.FromError: when the caller wraps the Status error with fmt.Errorf("...: %w", ...), status.FromError rewrites Status.Message with the full err.Error() of the outermost wrapper, which defeats a prefix check on the library-owned message. errors.As gives us the original Status whose Message is still verbatim from the gRPC library. A new test asserts that a plain errors.New("grpc: error while marshaling: …") — i.e. the same text attached to something that is NOT a gRPC status — does not short-circuit invalidation, so we never silently keep a cached connection alive based on a coincidental substring match. * refactor(util): centralize UTF-8 sanitization; add FullPath.Sanitized Addresses review feedback on PR #9207. Nitpick: every invalid-UTF-8 replacement across the codebase (DirAndName, Name, mount.sanitizeFuseName, the weedfs_write.go backstop) now goes through a single util.SanitizeUTF8Name helper, so the replacement char ('_' — URL-safe) is chosen in one place. Outside-diff: three proto fields took raw FullPath strings that could break marshaling if an entry ever carried invalid UTF-8 (CreateEntryRequest.Directory in Mkdir, DeleteEntryRequest.Directory in Unlink, AssignVolumeRequest.Path in command_fs_merge_volumes). The reviewer's suggested fix — using DirAndName() — would have silently changed Directory from parent to grandparent, because DirAndName sanitizes only the trailing component. Added FullPath.Sanitized(), which scrubs every component, and applied it at the three sites. Exposure is narrow in practice (FUSE-boundary sanitization and the gRPC-side isClientSideMarshalError carve-out already cover the #9139 cascade), but the defense-in-depth is cheap and consistent with the existing AssignVolume backstop. New tests in weed/util/fullpath_test.go document: - SanitizeUTF8Name: valid UTF-8 passes through unchanged; invalid bytes become '_' (not '?', which is URL-special). - FullPath.Sanitized: scrubs bytes in any component, not just the last. - FullPath.DirAndName: dir remains raw on purpose — callers needing a clean full path must use Sanitized(). The test pins this behavior so it is not accidentally "fixed" in a way that changes the (dir, name) semantics callers depend on. |
||
|
|
9ae905e456 |
feat(security): hot-reload HTTPS certs without restart (k8s cert-manager) (#9181)
* feat(security): hot-reload HTTPS certs for master/volume/filer/webdav/admin S3 and filer already use a refreshing pemfile provider for their HTTPS cert, so rotated certificates (e.g. from k8s cert-manager) are picked up without a restart. Master, volume, webdav, and admin, however, passed cert/key paths straight to ServeTLS/ListenAndServeTLS and loaded once at startup — rotating those certs required a pod restart. Add a small helper NewReloadingServerCertificate in weed/security that wraps pemfile.Provider and returns a tls.Config.GetCertificate closure, then wire it into the four remaining HTTPS entry points. httpdown now also calls ServeTLS when TLSConfig carries a GetCertificate/Certificates but CertFile/KeyFile are empty, so volume server can pre-populate TLSConfig. A unit test exercises the rotation path (write cert, rotate on disk, assert the callback returns the new cert) with a short refresh window. * refactor(security): route filer/s3 HTTPS through the shared cert reloader Before: filer.go and s3.go each kept a *certprovider.Provider on the options struct plus a duplicated GetCertificateWithUpdate method. Both were loading pemfile themselves. Behaviorally they already reloaded, but the logic was duplicated two ways and neither path was shared with the newly-added master/volume/webdav/admin wiring. After: both use security.NewReloadingServerCertificate like the other servers. The per-struct certProvider field and GetCertificateWithUpdate method are removed, along with the now-unused certprovider and pemfile imports. Net: -32 lines, one code path for all HTTPS cert reloading. No behavior change — the refresh window, cache, and handshake contract are identical (the helper wraps the same pemfile.NewProvider). * feat(security): hot-reload HTTPS client certs for mount/backup/upload/etc The HTTP client in weed/util/http/client loaded the mTLS client cert once at startup via tls.LoadX509KeyPair. That left every long-lived HTTPS client process (weed mount, backup, filer.copy, filer→volume, s3→filer/volume) unable to pick up a rotated client cert without a restart — even though the same cert-manager setup was already rotating the server side fine. Swap the client cert loader for a tls.Config.GetClientCertificate callback backed by the same refreshing pemfile provider. New TLS handshakes pick up the rotated cert; in-flight pooled connections keep their old cert and drop as normal transport churn happens. To keep this reusable from both server and client TLS code without an import cycle (weed/security already imports weed/util/http/client for LoadHTTPClientFromFile), extract the pemfile wrapper into a new weed/security/certreload subpackage. weed/security keeps its thin NewReloadingServerCertificate wrapper. The existing unit test moves with the implementation. gRPC mTLS was already handled by security.LoadServerTLS / LoadClientTLS; this PR does not change any gRPC paths. MQ broker, MQ agent, Kafka gateway, and FUSE mount control plane are gRPC-only and therefore already rotate. CA bundles (ClientCAs / RootCAs / grpc.ca) are still loaded once — noted as a known limitation in the wiki. * fix(security): address PR review feedback on cert reloader Bots (gemini-code-assist + coderabbit) flagged three real issues and a couple of nits. Addressing them here: 1. KeyMaterial used context.Background(). The grpc pemfile provider's KeyMaterial blocks until material arrives or the context deadline expires; with Background() a slow disk could hang the TLS handshake indefinitely. Switched both the server and client callbacks to use hello.Context() / cri.Context() so a stuck read is bounded by the handshake timeout. 2. Admin server loaded TLS inside the serve goroutine. If the cert was bad, the goroutine returned but startAdminServer kept blocking on <-ctx.Done() with no listener, making the process look healthy with nothing bound. Moved TLS setup to run before the goroutine starts and propagate errors via fmt.Errorf; also captures the provider and defers Close(). 3. HTTP client discarded the certprovider.Provider from NewClientGetCertificate. That leaked the refresh goroutine, and NewHttpClientWithTLS had a worse case where a CA-file failure after provider creation orphaned the provider entirely. Added a certProvider field and a Close() method on HTTPClient, and made the constructors close the provider on subsequent error paths. 4. Server-side paths (master/volume/filer/s3/webdav/admin) now retain the provider. filer and webdav run ServeTLS synchronously, so a plain defer works. master/volume/s3 dispatch goroutines and return while the server keeps running, so they hook Close() into grace.OnInterrupt. 5. Test: certreload_test now tolerates transient read/parse errors during file rotation (writeSelfSigned rewrites cert before key) and reports the last error only if the deadline expires. No user-visible behavior change for the happy path. * test(tls): add end-to-end HTTPS cert rotation integration test Boots a real `weed master` with HTTPS enabled, captures the leaf cert served at TLS handshake time, atomically rewrites the cert/key files on disk (the same rename-in-place pattern kubelet does when it swaps a cert-manager Secret), and asserts that a subsequent TLS handshake observes the rotated leaf — with no process restart, no SIGHUP, no reloader sidecar. Verifies the full path: on-disk change → pemfile refresh tick → provider.KeyMaterial → tls.Config.GetCertificate → server TLS handshake. Runtime is ~1s by exposing the reloader's refresh window as an env var (WEED_TLS_CERT_REFRESH_INTERVAL) and setting it to 500ms for the test. The same env var is user-facing — documented in the wiki — so operators running short-lived certs (Vault, cert-manager with duration: 24h, etc.) can tighten the rotation-pickup window without a rebuild. Defaults to 5h to preserve prior behavior. security.CredRefreshingInterval is kept for API compatibility but now aliases certreload.DefaultRefreshInterval so the same env controls both gRPC mTLS and HTTPS reload. * ci(tls): wire the TLS rotation integration test into GitHub Actions Mirrors the existing vacuum-integration-tests.yml shape: Ubuntu runner, Go 1.25, build weed, run `go test` in test/tls_rotation, upload master logs on failure. 10-minute job timeout; the test itself finishes in about a second because WEED_TLS_CERT_REFRESH_INTERVAL is set to 500ms inside the test. Runs on every push to master and on every PR to master. * fix(tls): address follow-up PR review comments Three new comments on the integration test + volume shutdown path: 1. Test: peekServerCert was swallowing every dial/handshake error, which meant waitForCert's "last err: <nil>" fatal message lost all diagnostic value. Thread errors back through: peekServerCert now returns (*x509.Certificate, error), and waitForCert records the latest error so a CI flake points at the actual cause (master didn't come up, handshake rejected, CA pool mismatch, etc.). 2. Test: set HOME=<tempdir> on the master subprocess. Viper today registers the literal path "$HOME/.seaweedfs" without env expansion, so a developer's ~/.seaweedfs/security.toml is accidentally invisible — the test was relying on that. Pinning HOME is belt-and-braces against a future viper upgrade that does expand env vars. 3. volume.go: startClusterHttpService's provider close was registered via grace.OnInterrupt, which fires on SIGTERM but NOT on the v.shutdownCtx.Done() path used by mini / integration tests. The pemfile refresh goroutine leaked in that shutdown path. Now the helper returns a close func and the caller invokes it on BOTH shutdown paths for parity. Also add MinVersion: TLS 1.2 to the test's tls.Config to quiet the ast-grep static-analysis nit — zero-risk since the pool only trusts our in-memory CA. Test runs clean 3/3. |
||
|
|
d2e64e85ce |
fix(log_buffer): back off disk-poll cadence when caught up to disk head (#9161)
Idle subscribers parked in the ResumeFromDiskError path were re-probing the in-memory buffer and disk every 250ms, emitting a "Notification timeout after ResumeFromDiskError, rechecking state" log line per tick even when nothing had changed. Once ReadFromDiskFn returns without advancing lastReadPosition, we know the disk has no data past the subscriber's current position. Switch to a 2s poll in that state so external disk writers (e.g. Schema Registry) are still re-detected on a bounded cadence, but idle CPU and log noise drop ~8x. Any progress (disk advance or notifyChan wakeup) clears the flag and restores the responsive 250ms tick for active readers. The per-timeout V(4) log is replaced by a single "Caught up to disk head" transition log. |
||
|
|
3ff92f797d | 4.21 | ||
|
|
45578a42e9 |
fix(volume): keep vacuum running past dangling .idx entries (#9115)
* fix(volume): keep vacuum running past dangling .idx entries Vacuum compaction aborted entirely on the first .idx entry whose offset pointed past the end of the .dat file, surfacing as `cannot hydrate needle from file: EOF` and stalling progress on every other volume. In both Go and Rust: - During compaction, skip an unreadable needle and continue. The bytes it pointed at were already unreachable via reads, so dropping the index reference makes the post-vacuum volume consistent. Real EIO still bails out so a disk fault is not silently papered over. - At volume load, do a single linear scan of the .idx and confirm every (offset + actual size) fits inside .dat. The pre-existing integrity check only looked at the last 10 entries, so deeper corruption (e.g. left over from a crashed batched write) went undetected and only surfaced later as a vacuum EOF. A failure now marks the volume read-only at load time so an operator can react. Refs #8928 * fix(volume): only skip permanent-corruption needle reads during vacuum Address PR review feedback (gemini-code-assist + coderabbit): The original patch skipped any non-EIO read failure, which would silently drop needles on transient errors — Windows hardware bad-sector errors (ERROR_CRC etc.) never surface as syscall.EIO; tiered-storage network timeouts and EROFS would also slip through and shrink the volume. Switch to an explicit whitelist of permanent-corruption shapes: - Add needle.ErrorCorrupted sentinel and wrap CRC and "index out of range" errors with %w so callers can match via errors.Is. - copyDataBasedOnIndexFile now skips only when the read failure is io.EOF, io.ErrUnexpectedEOF, ErrorSizeMismatch, ErrorSizeInvalid, or ErrorCorrupted. Anything else (real disk faults, environmental errors, Windows hardware codes) aborts the compaction so an operator notices. - Mirror the same whitelist in the Rust volume server, matching on io::ErrorKind::UnexpectedEof and the NeedleError corruption variants (SizeMismatch, CrcMismatch, IndexOutOfRange, TailTooShort). Also add `defer v.Close()` in TestVerifyIndexFitsInDat so Windows t.TempDir() cleanup can release the .dat/.idx handles. Refs #8928 * fix(volume): wrap entry-not-found size-mismatch with ErrorSizeMismatch Address PR review: the fallback branch in ReadBytes returned an unwrapped fmt.Errorf, so isSkippableNeedleReadError (and any caller using errors.Is(..., ErrorSizeMismatch)) could not match it. Wrap with %w so the whitelist applies, while leaving the existing direct sentinel return for the OffsetSize==4 / offset<MaxPossibleVolumeSize retry path unchanged so ReadData's `err == ErrorSizeMismatch` retry still triggers. Refs #8928 * fix(volume): integrate dangling-idx check into existing index load walk Address PR review (gemini-code-assist, medium): the structural .idx check used to do a second linear scan of the index file at every volume load, doubling the disk-I/O cost on servers managing many volumes. Track the largest (offset + actual size) seen during the existing needle-map load walks (`LoadCompactNeedleMap`, `NewLevelDbNeedleMap`, `NewSortedFileNeedleMap`'s `newNeedleMapMetricFromIndexFile`, `DoOffsetLoading`) on a new `MaximumNeedleEnd` field on `mapMetric`, exposed as `MaxNeedleEnd()` on the NeedleMapper interface. `volume.load()` then compares `nm.MaxNeedleEnd()` to the .dat size after the load is complete — pure numeric comparison, no extra I/O. The standalone `verifyIndexFitsInDat` helper and its caller in `CheckVolumeDataIntegrity` are removed; the test that used to drive the helper directly now exercises the new path via `LoadCompactNeedleMap`. Mirror the same change in the Rust volume server: track `max_needle_end` on `NeedleMapMetric`, expose via `max_needle_end()` on `CompactNeedleMap`, `RedbNeedleMap`, and the `NeedleMap` enum. The Rust load walk already happens in `load_from_idx` for both map kinds, so the structural check becomes free. Refs #8928 |
||
|
|
9d15705c16 |
fix(mini): shut down admin/s3/webdav/filer before volume/master on Ctrl+C (#9112)
* fix(mini): shut down admin/s3/webdav/filer before volume/master on Ctrl+C Interrupts fired grace hooks in registration order, so master (started first) shut down before its clients, producing heartbeat-canceled errors and masterClient reconnection noise during weed mini shutdown. Admin/s3/ webdav had no interrupt hooks at all and were killed at os.Exit. - grace: execute interrupt hooks in LIFO (defer-style) order so later- started services tear down first. - filer: consolidate the three separate interrupt hooks (gRPC / HTTP / DB) into one that runs in order, so filer shutdown stays correct independent of FIFO/LIFO semantics. - mini: add MiniClientsShutdownCtx (separate from test-facing MiniClusterCtx) plus an OnMiniClientsShutdown helper. Admin, S3, WebDAV and the maintenance worker observe it; runMini registers a cancel hook after startup so under LIFO it fires first and waits up to 10s on a WaitGroup for those services to drain before filer, volume, and master shut down. Resulting order on Ctrl+C: admin/s3/webdav/worker -> filer (gRPC -> HTTP -> DB) -> volume -> master. * refactor(mini): group mini-client shutdown into one state struct The first pass spread the shutdown plumbing across three globals (MiniClientsShutdownCtx, miniClientsWg, cancelMiniClients) and two ctx-derivation sites (OnMiniClientsShutdown and startMiniAdminWithWorker). Group into a private miniClientsState (ctx/cancel/wg) rebuilt per runMini invocation, and chain its ctx from MiniClusterCtx so clients only observe one signal. Tests that cancel MiniClusterCtx still trigger client shutdown via parent-child propagation. - resetMiniClients() installs fresh state at the top of runMini, so in-process test reruns don't inherit stale ctx/wg. - onMiniClientsShutdown(fn) replaces the exported OnMiniClientsShutdown and only observes one ctx. - trackMiniClient() replaces the manual wg.Add/Done dance for the admin goroutine. - miniClientsCtx() gives the admin startup a ctx without re-deriving. - triggerMiniClientsShutdown(timeout) is the interrupt hook body. No behaviour change; existing tests pass. * refactor: generalize shutdown ctx as an option, not a mini-specific helper Several service files (s3, webdav, filer, master, volume) observed the mini-specific MiniClusterCtx or called onMiniClientsShutdown directly. That leaked mini orchestration into code that also runs under weed s3, weed webdav, weed filer, weed master, and weed volume standalone. Replace with a generic `shutdownCtx context.Context` field on each service's Options struct. When non-nil, the server watches it and shuts down gracefully; when nil (standalone), the shutdown path is a no-op. Mini wires the contexts up from a single place (runMini): - miniMasterOptions/miniOptions.v/miniFilerOptions.shutdownCtx = MiniClusterCtx (drives test-triggered teardown) - miniS3Options/miniWebDavOptions.shutdownCtx = miniClientsCtx() (drives Ctrl+C teardown before filer/volume/master) All knowledge of MiniClusterCtx now lives in mini.go. * fix(mini): stop worker before clients ctx so admin shutdown isn't blocked Symptom on Ctrl+C of a clean weed mini: mini's Shutting down admin/s3/ webdav hook sat for 10s then logged "timed out". Admin had started its shutdown but was blocked inside StopWorkerGrpcServer's GracefulStop, waiting for the still-connected worker stream. That in turn left filer clients connected and cascaded into filer's own 10s gRPC graceful-stop timeout. Two causes, both fixed: 1. worker.Stop() deadlocked on clean shutdown. It sent ActionStop (which makes managerLoop `break out` and exit), then called getTaskLoad() which sends to the same unbuffered cmd channel — no receiver, hangs forever. Reorder Stop() to snapshot the admin client and drain tasks BEFORE sending ActionStop, and call Disconnect() via the local snapshot afterwards. 2. Worker's taskRequestLoop raced with Disconnect(): RequestTask reads from c.incoming, which Disconnect closes, yielding a nil response and a panic on response.Message. Handle the closed channel explicitly. 3. Mini now has a preCancel phase (beforeMiniClientsShutdown) that runs synchronously BEFORE the clients ctx is cancelled. Register worker shutdown there so admin's worker-gRPC GracefulStop finds the worker already disconnected and returns immediately, instead of waiting on a stream that is about to close anyway. Observed shutdown of a clean mini: admin/s3/webdav down in <10ms; full process exit in ~11s (the remaining 10s is a pre-existing filer gRPC graceful-stop timeout, not cascaded from the clients tier). * feat(mini): cap filer gRPC graceful stop at 1s under weed mini Full weed mini shutdown was ~11s on a clean exit, dominated by the filer's default 10s gRPC GracefulStop timeout while background SubscribeLocalMetadata streams drained. Expose the timeout as a FilerOptions.gracefulStopTimeout field (default 10s for standalone weed filer) and set it to 1s in mini. Clean weed mini shutdown now takes ~2s. |
||
|
|
886d50a6a5 |
feat(mount): singleflight dedup for concurrent chunk reads (#9100)
* feat(mount): add singleflight deduplication for concurrent chunk reads When multiple FUSE readers request the same uncached chunk concurrently, only one network fetch is performed. Other readers wait and share the downloaded data, reducing redundant volume server traffic under parallel read workloads. * fix(util): make singleflight panic-safe with defer cleanup If the provided function panics, the WaitGroup and map entry are now cleaned up via defer, preventing other waiters from hanging forever. * fix(filer): remove singleflight from reader_cache to fix buffer ownership The singleflight wrapper around chunk fetches returned the same []byte buffer to concurrent callers. Since each SingleChunkCacher owns and frees its data buffer in destroy(), sharing the same slice would cause a use-after-free or double-free with the mem allocator. The downloaders map already deduplicates in-flight downloads for the same fileId, so the singleflight was redundant at this layer. The SingleFlightGroup utility is retained for use elsewhere. |
||
|
|
e1fa4ec756 |
perf(cache): drop OS page cache after disk cache reads (#9098)
* perf(cache): drop OS page cache after disk cache reads After reading from the on-disk chunk cache, advise the kernel via FADV_DONTNEED to release the corresponding page cache pages. This prevents double-caching the same data in both user-space and kernel page caches, freeing RAM for other uses on systems with large disk caches. * fix(cache): guard dropReadCache against zero length and invalid fd A zero-length fadvise is interpreted as "to end of file" on Linux, which would inadvertently drop the page cache for the entire remainder of the cache volume. Also check fd >= 0 to avoid unnecessary syscalls when the backend file is closed. * perf(cache): only apply FADV_DONTNEED for reads >= 1 MiB For small needle reads the syscall overhead outweighs the memory savings, and the kernel page cache is more beneficial for warm data. Restrict fadvise to reads of at least 1 MiB where the freed page cache is meaningful. |
||
|
|
08d9193fe1 |
[nfs] Add NFS (#9067)
* add filer inode foundation for nfs
* nfs command skeleton
* add filer inode index foundation for nfs
* make nfs inode index hardlink aware
* add nfs filehandle and inode lookup plumbing
* add read-only nfs frontend foundation
* add nfs namespace mutation support
* add chunk-backed nfs write path
* add nfs protocol integration tests
* add stale handle nfs coverage
* complete nfs hardlink and failover coverage
* add nfs export access controls
* add nfs metadata cache invalidation
* fix nfs chunk read lookup routing
* fix nfs review findings and rename regression
* address pr 9067 review comments
- filer_inode: fail fast if the snowflake sequencer cannot start, and let
operators override the 10-bit node id via SEAWEEDFS_FILER_SNOWFLAKE_ID
to avoid multi-filer collisions
- filer_inode: drop the redundant retry loop in nextInode
- filerstore_wrapper: treat inode-index writes/removals as best-effort so
a primary store success no longer surfaces as an operation failure
- filer_grpc_server_rename: defer overwritten-target chunk deletion until
after CommitTransaction so a rolled-back rename does not strand live
metadata pointing at freshly deleted chunks
- command/nfs: default ip.bind to loopback and require an explicit
filer.path, so the experimental server does not expose the entire
filer namespace on first run
- nfs integration_test: document why LinkArgs matches go-nfs's on-the-wire
layout rather than RFC 1813 LINK3args
* mount: pre-allocate inode in Mkdir and Symlink
Mkdir and Symlink used to send filer_pb.CreateEntryRequest with
Attributes.Inode = 0. After PR 9067, the filer's CreateEntry now assigns
its own inode in that case, so the filer-side entry ends up with a
different inode than the one the mount allocates via inodeToPath.Lookup
and returns to the kernel. Once applyLocalMetadataEvent stores the
filer's entry in the meta cache, subsequent GetAttr calls read the
cached entry and hit the setAttrByPbEntry override at line 197 of
weedfs_attr.go, returning the filer-assigned inode instead of the
mount's local one. pjdfstest tests/rename/00.t (subtests 81/87/91)
caught this — it lstat'd a freshly-created directory/symlink, renamed
it, lstat'd again, and saw a different inode the second time.
createRegularFile already pre-allocates via inodeToPath.AllocateInode
and stamps it into the create request. Do the same thing in Mkdir and
Symlink so both sides agree on the object identity from the very first
request, and so GetAttr's cache path returns the same value as Mkdir /
Symlink's initial response.
* sequence: mask snowflake node id on int→uint32 conversion
CodeQL flagged the unchecked uint32(snowflakeId) cast in
NewSnowflakeSequencer as a potential truncation bug when snowflakeId is
sourced from user input (e.g. via SEAWEEDFS_FILER_SNOWFLAKE_ID). Mask
to the 10 bits the snowflake library actually uses so any caller-
supplied int is safely clamped into range.
* add test/nfs integration suite
Boots a real SeaweedFS cluster (master + volume + filer) plus the
experimental `weed nfs` frontend as subprocesses and drives it through
the NFSv3 wire protocol via go-nfs-client, mirroring the layout of
test/sftp. The tests run without a kernel NFS mount, privileged ports,
or any platform-specific tooling.
Coverage includes read/write round-trip, mkdir/rmdir, nested
directories, rename content preservation, overwrite + explicit
truncate, 3 MiB binary file, all-byte binary and empty files, symlink
round-trip, ReadDirPlus listing, missing-path remove, FSInfo sanity,
sequential appends, and readdir-after-remove.
Framework notes:
- Picks ephemeral ports with net.Listen("127.0.0.1:0") and passes
-port.grpc explicitly so the default port+10000 convention cannot
overflow uint16 on macOS.
- Pre-creates the /nfs_export directory via the filer HTTP API before
starting the NFS server — the NFS server's ensureIndexedEntry check
requires the export root to exist with a real entry, which filer.Root
does not satisfy when the export path is "/".
- Reuses the same rpc.Client for mount and target so go-nfs-client does
not try to re-dial via portmapper (which concatenates ":111" onto the
address).
* ci: add NFS integration test workflow
Mirror test/sftp's workflow for the new test/nfs suite so PRs that touch
the NFS server, the inode filer plumbing it depends on, or the test
harness itself run the 14 NFSv3-over-RPC integration tests on Ubuntu
22.04 via `make test`.
* nfs: use append for buffer growth in Write and Truncate
The previous make+copy pattern reallocated the full buffer on every
extending write or truncate, giving O(N^2) behaviour for sequential
write loops. Switching to `append(f.content, make([]byte, delta)...)`
lets Go's amortized growth strategy absorb the repeated extensions.
Called out by gemini-code-assist on PR 9067.
* filer: honor caller cancellation in collectInodeIndexEntries
Dropping the WithoutCancel wrapper lets DeleteFolderChildren bail out of
the inode-index scan if the client disconnects mid-walk. The cleanup is
already treated as best-effort by the caller (it logs on error and
continues), so a cancelled walk just means the partial index rebuild is
skipped — the same failure mode as any other index write error.
Flagged as a DoS concern by gemini-code-assist on PR 9067.
* nfs: skip filer read on open when O_TRUNC is set
openFile used to unconditionally loadWritableContent for every writable
open and then discard the buffer if O_TRUNC was set. For large files
that is a pointless 64 MiB round-trip. Reorder the branches so we only
fetch existing content when the caller intends to keep it, and mark the
file dirty right away so the subsequent Close still issues the
truncating write. Called out by gemini-code-assist on PR 9067.
* nfs: allow Seek on O_APPEND files and document buffered write cap
Two related cleanups on filesystem.go:
- POSIX only restricts Write on an O_APPEND fd, not lseek. The existing
Seek error ("append-only file descriptors may only seek to EOF")
prevented read-and-write workloads that legitimately reposition the
read cursor. Write already snaps the offset to EOF before persisting
(see seaweedFile Write), so Seek can unconditionally accept any
offset. Update the unit test that was asserting the old behaviour.
- Add a doc comment on maxBufferedWriteSize explaining that it is a
per-file ceiling, the memory footprint it implies, and that the real
fix for larger whole-file rewrites is streaming / multi-chunk support.
Both changes flagged by gemini-code-assist on PR 9067.
* nfs: guard offset before casting to int in Write
CodeQL flagged `int(f.offset) + len(p)` inside the Write growth path as
a potential overflow on architectures where `int` is 32-bit. The
existing check only bounded the post-cast value, which is too late.
Clamp f.offset against maxBufferedWriteSize before the cast and also
reject negative/overflowed endOffset results. Both branches fall
through to billy.ErrNotSupported, the same behaviour the caller gets
today for any out-of-range buffered write.
* nfs: compute Write endOffset in int64 to satisfy CodeQL
The previous guard bounded f.offset but left len(p) unchecked, so
CodeQL still flagged `int(f.offset) + len(p)` as a possible int-width
overflow path. Bound len(p) against maxBufferedWriteSize first, do the
addition in int64, and only cast down after the total has been clamped
against the buffer ceiling. Behaviour is unchanged: any out-of-range
write still returns billy.ErrNotSupported.
* ci: drop emojis from nfs-tests workflow summary
Plain-text step summary per user preference — no decorative glyphs in
the NFS CI output or checklist.
* nfs: annotate remaining DEV_PLAN TODOs with status
Three of the unchecked items are genuine follow-up PRs rather than
missing work in this one, and one was actually already done:
- Reuse chunk cache and mutation stream helpers without FUSE deps:
checked off — the NFS server imports weed/filer.ReaderCache and
weed/util/chunk_cache directly with no weed/mount or go-fuse imports.
- Extract shared read/write helpers from mount/WebDAV/SFTP: annotated
as deferred to a separate refactor PR (touches four packages).
- Expand direct data-path writes beyond the 64 MiB buffered fallback:
annotated as deferred — requires a streaming WRITE path.
- Shared lock state + lock tests: annotated as blocked upstream on
go-nfs's missing NLM/NFSv4 lock state RPCs, matching the existing
"Current Blockers" note.
* test/nfs: share port+readiness helpers with test/testutil
Drop the per-suite mustPickFreePort and waitForService re-implementations
in favor of testutil.MustAllocatePorts (atomic batch allocation; no
close-then-hope race) and testutil.WaitForPort / SeaweedMiniStartupTimeout.
Pull testutil in via a local replace directive so this standalone
seaweedfs-nfs-tests module can import the in-repo package without a
separate release.
Subprocess startup is still master + volume + filer + nfs — no switch to
weed mini yet, since mini does not know about the nfs frontend.
* nfs: stream writes to volume servers instead of buffering the whole file
Before this change the NFS write path held the full contents of every
writable open in memory:
- OpenFile(write) called loadWritableContent which read the existing
file into seaweedFile.content up to maxBufferedWriteSize (64 MiB)
- each Write() extended content in-place
- Close() uploaded the whole buffer as a single chunk via
persistContent + AssignVolume
The 64 MiB ceiling made large NFS writes return NFS3ERR_NOTSUPP, and
even below the cap every Write paid a whole-file-in-memory cost. This
PR rewrites the write path to match how `weed filer` and the S3 gateway
persist data:
- openFile(write) no longer loads the existing content at all; it
only issues an UpdateEntry when O_TRUNC is set *and* the file is
non-empty (so a fresh create+trunc is still zero-RPC)
- Write() streams the caller's bytes straight to a volume server via
one AssignVolume + one chunk upload, then atomically appends the
resulting chunk to the filer entry through mutateEntry. Any
previously inlined entry.Content is migrated to a chunk in the same
update so the chunk list becomes the authoritative representation.
- Truncate() becomes a direct mutateEntry (drop chunks past the new
size, clip inline content, update FileSize) instead of resizing an
in-memory buffer.
- Close() is a no-op because everything was flushed inline.
The small-file fast path that the filer HTTP handler uses is preserved:
if the post-write size still fits in maxInlineWriteSize (4 MiB) and
the file has no existing chunks, we rewrite entry.Content directly and
skip the volume-server round-trip. This keeps single-shot tiny writes
(echo, small edits) cheap while completely removing the 64 MiB cap on
larger files. Read() now always reads through the chunk reader instead
of a local byte slice, so reads inside the same session see the freshly
appended data.
Drops the unused seaweedFile.content / dirty fields, the
maxBufferedWriteSize constant, and the loadWritableContent helper.
Updates TestSeaweedFileSystemSupportsNamespaceMutations expectations
to match the new "no extra O_TRUNC UpdateEntry on an empty file"
behavior (still 3 updates: Write + Chmod + Truncate).
* filer: extract shared gateway upload helper for NFS and WebDAV
Three filer-backed gateways (NFS, WebDAV, and mount) each had a local
saveDataAsChunk that wrapped operation.NewUploader().UploadWithRetry
with near-identical bodies: build AssignVolumeRequest, build
UploadOption, build genFileUrlFn with optional filerProxy rewriting,
call UploadWithRetry, validate the result, and call ToPbFileChunk.
Pull that body into filer.SaveGatewayDataAsChunk with a
GatewayChunkUploadRequest struct so both NFS and WebDAV can delegate
to one implementation.
- NFS's saveDataAsChunk is now a thin adapter that assembles the
GatewayChunkUploadRequest from server options and calls the helper.
The chunkUploader interface keeps working for test injection because
the new GatewayChunkUploader interface is structurally identical.
- WebDAV's saveDataAsChunk is similarly a thin adapter — it drops the
local operation.NewUploader call plus the AssignVolume/UploadOption
scaffolding.
- mount is intentionally left alone. mount's saveDataAsChunk has two
features that do not fit the shared helper (a pre-allocated file-id
pool used to skip AssignVolume entirely, and a chunkCache
write-through at offset 0 so future reads hit the mount's local
cache), both of which are mount-specific.
Marks the Phase 2 "extract shared read/write helpers from mount,
WebDAV, and SFTP" DEV_PLAN item as done. The filer-level chunk read
path (NonOverlappingVisibleIntervals + ViewFromVisibleIntervals +
NewChunkReaderAtFromClient) was already shared.
* nfs: remove DESIGN.md and DEV_PLAN.md
The planning documents have served their purpose — all phase 1 and
phase 2 items are landed, phase 3 streaming writes are landed, phase 2
shared helpers are extracted, and the two remaining phase 4 items
(shared lock state + lock tests) are blocked upstream on
github.com/willscott/go-nfs which exposes no NLM or NFSv4 lock state
RPCs. The running decision log no longer reflects current code and
would just drift. The NFS wiki page
(https://github.com/seaweedfs/seaweedfs/wiki/NFS-Server) now carries
the overview, configuration surface, architecture notes, and known
limitations; the source is the source of truth for the rest.
|
||
|
|
50f25bb5cd | 4.20 | ||
|
|
80db692728 | fix(weed/util/chunk_cache): fix dropped errors (#9042) | ||
|
|
edf7d2a074 |
fix(filer): eliminate redundant disk reads causing memory/CPU regression (#9039)
* fix(filer): eliminate redundant disk reads causing memory/CPU regression (#9035) Since 4.18, LocalMetaLogBuffer's ReadFromDiskFn was set to readPersistedLogBufferPosition, causing LoopProcessLogData to call ReadPersistedLogBuffer on every 250ms health-check tick when a subscriber encounters ResumeFromDiskError. Each call creates an OrderedLogVisitor (ListDirectoryEntries on the filer store), spawns a readahead goroutine with a 1024-element channel, finds no data, and returns — 4 times per second even on an idle filer. This is redundant because SubscribeLocalMetadata already manages disk reads explicitly with its own shouldReadFromDisk / lastCheckedFlushTsNs tracking in the outer loop. Set ReadFromDiskFn back to nil for LocalMetaLogBuffer. When LoopProcessLogData encounters ResumeFromDiskError with nil ReadFromDiskFn, the HasData() guard returns ResumeFromDiskError to the caller (SubscribeLocalMetadata), which blocks efficiently on listenersCond.Wait() instead of polling. * fix(filer): add gap detection for slow consumers after disk-read stall When a slow consumer falls behind and LoopProcessLogData returns ResumeFromDiskError with no flush or read-position progress, there may be a gap between persisted data and in-memory data (e.g. writes stopped while consumer was still catching up). Without this, the consumer would block on listenersCond.Wait() forever. Skip forward to the earliest in-memory time to resume progress, matching the gap-handling pattern already used in the shouldReadFromDisk path. * fix(filer): clear stale ResumeFromDiskError after gap-skip to avoid stall The gap-detection block added in the previous commit skips lastReadTime forward to GetEarliestTime() and continues the outer loop. On the next iteration, shouldReadFromDisk becomes true (currentReadTsNs > lastDiskReadTsNs), the disk read returns processedTsNs == 0, and the existing gap handler at the top of the loop runs its own gap check. That check uses readInMemoryLogErr == ResumeFromDiskError as the entry condition — but readInMemoryLogErr is still the stale error from two iterations ago. GetEarliestTime() now equals lastReadTime.Time (we already advanced to it), so earliestTime.After(lastReadTime.Time) is false and the handler falls into listenersCond.Wait() — stuck. Clear readInMemoryLogErr at the gap-skip point, matching the existing pattern at the earlier gap handler that already clears it for the same reason. * fix(log_buffer): GetEarliestTime must include sealed prev buffers GetEarliestTime previously returned only logBuffer.startTime (the active buffer's first timestamp). That is narrower than ReadFromBuffer's tsMemory, which is the min across active + prev buffers. Callers using GetEarliestTime for gap detection after ResumeFromDiskError (the SubscribeLocalMetadata outer loop's disk-read path, the new gap-skip in the in-memory ResumeFromDiskError handler, and MQ HasData) saw a time that was *newer* than the real earliest in-memory data. Impact in SubscribeLocalMetadata's slow-consumer path: - tsMemory = earliest prev buffer time (T_prev) - GetEarliestTime() = active startTime (T_active, later than T_prev) - Consumer position = T1, with T_prev < T1 < T_active - ReadFromBuffer returns ResumeFromDiskError (T1 < tsMemory) - Gap detect: GetEarliestTime().After(T1) = T_active.After(T1) = true - Skip forward to T_active -- silently drops the prev-buffer data - And when T_active happens to equal the stuck position, gap detect evaluates false, and the subscriber stalls on listenersCond.Wait() This reproduces the TestMetadataSubscribeSlowConsumerKeepsProgressing failure in CI where the consumer stalled at 10220/20000 after writing stopped -- the buffer still had data in prev[0..3], but gap detection was comparing against the active buffer's startTime. Fix: scan all sealed prev buffers under RLock, return the true minimum startTime. Matches the min-of-buffers logic in ReadFromBuffer. * test(log_buffer): make DiskReadRetry test deterministic The previous test added the message via AddToBuffer + ForceFlush and relied on a race: the second disk read had to happen before the data was delivered through the in-memory path. Under the race detector or on a slow CI runner, the reader is woken by AddToBuffer's notification, finds the data in the active buffer or its prev slot, and returns after exactly one disk read — failing the >= 2 disk reads assertion even though the loop behaved correctly. Reproduced on master with race detector (2/5 failures). Rewrite the test to deliver the data exclusively through the disk-read path: no AddToBuffer, no ForceFlush. The test waits until the reader has issued at least one no-op disk read, then atomically flips a "dataReady" flag. The reader's next iteration through readFromDiskFn returns the entry. This deterministically exercises the retry-loop behavior the test was originally written to protect, and removes the in-memory delivery race entirely. |
||
|
|
e648c76bcf | go fmt | ||
|
|
eb5624233d |
[filer] fix log buffer idle polling (#9012)
* fix log buffer idle polling * log_buffer: document notificationHealthCheckInterval tradeoffs Explain that notifyChan is the primary wakeup path and this interval only bounds the fallback / state-recheck cadence, so future maintainers don't tune it without understanding the implications for client-disconnect detection latency. * log_buffer: rename waitForNotification to awaitNotificationOrTimeout The helper returns after either a notification or the health-check timeout; the old name read like it blocked indefinitely. No behavior change. * log_buffer: wake blocked subscribers on shutdown awaitNotificationOrTimeout previously only returned on notifyChan or the health-check timeout, so ShutdownLogBuffer on an idle buffer (where copyToFlush returns nil and loopFlush never fires the post-flush notification) would leave subscribers parked for up to 250ms before they noticed IsStopping. Add an internal shutdownCh closed by ShutdownLogBuffer and select on it from awaitNotificationOrTimeout, which is now a method on *LogBuffer. Subscribers wake immediately, re-check IsStopping, and exit. No change to LoopProcessLogData signatures or any caller (filer metadata subscribers, MQ broker, local partition subscribe). * log_buffer: regression tests for flush-notify wake-up TestLoopFlush_NotifiesSubscribersAfterFlush directly verifies that loopFlush calls notifySubscribers after processing a flush, so a reader parked on notifyChan wakes promptly when a flush lands. Verified to fail if that notification is removed. TestLoopProcessLogDataWithOffset_WakesOnDataArrival is the end-to-end counterpart: a real LoopProcessLogDataWithOffset reader parks on notifyChan via the ResumeFromDiskError branch, then wakes and processes the entry well under the 250ms fallback once data arrives. * log_buffer: keep notification-timeout logs at V(4) Revert the V(4)->V(5) demotion. Now that the shutdown wake-up path exists and (with the follow-up fix) idle-polling CPU churn is bounded by the 250ms health check, these timeout logs no longer flood at V=4 the way they did on the 10ms fallback, so the previous verbosity is appropriate again. * log_buffer: exit reader loops cleanly on shutdown awaitNotificationOrTimeout returns true on both data notifications and shutdown (shutdownCh closed). Without an explicit IsStopping() guard, the ResumeFromDiskError, offset-based no-data, empty-buffer, and timestamp-wait paths would either tight-spin against a closed shutdownCh or, in the offset-based case, return ResumeFromDiskError to the caller instead of exiting. Add an IsStopping() check after each awaitNotificationOrTimeout call that previously continued or returned ResumeFromDiskError, so subscribers exit promptly with isDone=true and err=nil when ShutdownLogBuffer is called. * log_buffer: regression test for shutdown wake-up Park a real LoopProcessLogDataWithOffset reader on notifyChan via the ResumeFromDiskError branch, call ShutdownLogBuffer, and assert the reader exits with isDone=true and err=nil well under the 250ms fallback. Verified to fail (timeout) if the IsStopping() guards added in the prior commit are removed. * log_buffer: bump reader-park sleep to 50ms with rationale Both wake-path tests use a sleep to give the goroutine time to reach awaitNotificationOrTimeout before the test triggers the wake-up. Bump from 20ms to 50ms and document the timing assumption to reduce flakiness on slow CI. Both paths are race-free either way (a buffered notification or a closed shutdownCh stays valid until consumed), so this is purely about exercising the park-then-wake path rather than the already-pending fast path. |
||
|
|
a04c9c7dde |
fix: close CPU profile file after stopping profiling (#9000)
The file handle from os.Create(cpuProfile) was passed to pprof.StartCPUProfile but never closed in the OnInterrupt handler. The block and mutex profile files are correctly closed, but the main CPU profile file was leaked. Add f.Close() after pprof.StopCPUProfile() to prevent the file descriptor leak. Co-authored-by: easonysliu <easonysliu@tencent.com> |
||
|
|
0bdf9b0683 | 4.19 | ||
|
|
2919bb27e5 |
fix(sync): use per-cluster TLS for HTTP volume connections in filer.sync (#8974)
* fix(sync): use per-cluster TLS for HTTP volume connections in filer.sync (#8965) When filer.sync runs with -a.security and -b.security flags, only gRPC connections received per-cluster TLS configuration. HTTP clients for volume server reads and uploads used a global singleton with the default security.toml, causing TLS verification failures when clusters use different self-signed certificates. Load per-cluster HTTPS client config from the security files and pass dedicated HTTP clients to FilerSource (for downloads) and FilerSink (for uploads) so each direction uses the correct cluster's certificates. * fix(sync): address review feedback for per-cluster HTTP TLS - Add insecure_skip_verify support to NewHttpClientWithTLS and read it from per-cluster security config via https.client.insecure_skip_verify - Error on partial mTLS config (cert without key or vice versa) - Add nil-check for client parameter in DownloadFileWithClient - Document SetUploader as init-only (same pattern as SetChunkConcurrency) |
||
|
|
f6df7126b6 |
feat(admin): add profiling options for debugging high memory/CPU usage (#8923)
* feat(admin): add profiling options for debugging high memory/CPU usage Add -debug, -debug.port, -cpuprofile, and -memprofile flags to the admin command, matching the profiling support already available in master, volume, and other server commands. This enables investigation of resource usage issues like #8919. * refactor(admin): move profiling flags into AdminOptions struct Move cpuprofile and memprofile flags from global variables into the AdminOptions struct and init() function for consistency with other flags. * fix(debug): bind pprof server to localhost only and document profiling flags StartDebugServer was binding to all interfaces (0.0.0.0), exposing runtime profiling data to the network. Restrict to 127.0.0.1 since this is a development/debugging tool. Also add a "Debugging and Profiling" section to the admin command's help text documenting the new flags. |
||
|
|
0798b274dd |
feat(s3): add concurrent chunk prefetch for large file downloads (#8917)
* feat(s3): add concurrent chunk prefetch for large file downloads Add a pipe-based prefetch pipeline that overlaps chunk fetching with response writing during S3 GetObject, SSE downloads, and filer proxy. While chunk N streams to the HTTP response, fetch goroutines for the next K chunks establish HTTP connections to volume servers ahead of time, eliminating the RTT gap between sequential chunk fetches. Uses io.Pipe for minimal memory overhead (~1MB per download regardless of chunk size, vs buffering entire chunks). Also increases the streaming read buffer from 64KB to 256KB to reduce syscall overhead. Benchmark results (64KB chunks, prefetch=4): - 0ms latency: 1058 → 2362 MB/s (2.2× faster) - 5ms latency: 11.0 → 41.7 MB/s (3.8× faster) - 10ms latency: 5.9 → 23.3 MB/s (4.0× faster) - 20ms latency: 3.1 → 12.1 MB/s (3.9× faster) * fix: address review feedback for prefetch pipeline - Fix data race: use *chunkPipeResult (pointer) on channel to avoid copying struct while fetch goroutines write to it. Confirmed clean with -race detector. - Remove concurrent map write: retryWithCacheInvalidation no longer updates fileId2Url map. Producer only reads it; consumer never writes. - Use mem.Allocate/mem.Free for copy buffer to reduce GC pressure. - Add local cancellable context so consumer errors (client disconnect) immediately stop the producer and all in-flight fetch goroutines. * fix(test): remove dead code and add Range header support in test server - Remove unused allData variable in makeChunksAndServer - Add Range header handling to createTestServer for partial chunk read coverage (206 Partial Content, 416 Range Not Satisfiable) * fix: correct retry condition and goroutine leak in prefetch pipeline - Fix retry condition: use result.fetchErr/result.written instead of copied to decide cache-invalidation retry. The old condition wrongly triggered retry when the fetch succeeded but the response writer failed on the first write (copied==0 despite fetcher having data). Now matches the sequential path (stream.go:197) which checks whether the fetcher itself wrote zero bytes. - Fix goroutine leak: when the producer's send to the results channel is interrupted by context cancellation, the fetch goroutine was already launched but the result was never sent to the channel. The drain loop couldn't handle it. Now waits on result.done before returning so every fetch goroutine is properly awaited. |
||
|
|
995dfc4d5d |
chore: remove ~50k lines of unreachable dead code (#8913)
* chore: remove unreachable dead code across the codebase Remove ~50,000 lines of unreachable code identified by static analysis. Major removals: - weed/filer/redis_lua: entire unused Redis Lua filer store implementation - weed/wdclient/net2, resource_pool: unused connection/resource pool packages - weed/plugin/worker/lifecycle: unused lifecycle plugin worker - weed/s3api: unused S3 policy templates, presigned URL IAM, streaming copy, multipart IAM, key rotation, and various SSE helper functions - weed/mq/kafka: unused partition mapping, compression, schema, and protocol functions - weed/mq/offset: unused SQL storage and migration code - weed/worker: unused registry, task, and monitoring functions - weed/query: unused SQL engine, parquet scanner, and type functions - weed/shell: unused EC proportional rebalance functions - weed/storage/erasure_coding/distribution: unused distribution analysis functions - Individual unreachable functions removed from 150+ files across admin, credential, filer, iam, kms, mount, mq, operation, pb, s3api, server, shell, storage, topology, and util packages * fix(s3): reset shared memory store in IAM test to prevent flaky failure TestLoadIAMManagerFromConfig_EmptyConfigWithFallbackKey was flaky because the MemoryStore credential backend is a singleton registered via init(). Earlier tests that create anonymous identities pollute the shared store, causing LookupAnonymous() to unexpectedly return true. Fix by calling Reset() on the memory store before the test runs. * style: run gofmt on changed files * fix: restore KMS functions used by integration tests * fix(plugin): prevent panic on send to closed worker session channel The Plugin.sendToWorker method could panic with "send on closed channel" when a worker disconnected while a message was being sent. The race was between streamSession.close() closing the outgoing channel and sendToWorker writing to it concurrently. Add a done channel to streamSession that is closed before the outgoing channel, and check it in sendToWorker's select to safely detect closed sessions without panicking. |
||
|
|
6213daf118 | 4.18 | ||
|
|
ced2236cc6 |
Adjust rename events metadata format (#8854)
* rename metadata events * fix subscription filter to use NewEntry.Name for rename path matching The server-side subscription filter constructed the new path using OldEntry.Name instead of NewEntry.Name when checking if a rename event's destination matches the subscriber's path prefix. This could cause events to be incorrectly filtered when a rename changes the file name. * fix bucket events to handle rename of bucket directories onBucketEvents only checked IsCreate and IsDelete. A bucket directory rename via AtomicRenameEntry now emits a single rename event (both OldEntry and NewEntry non-nil), which matched neither check. Handle IsRename by deleting the old bucket and creating the new one. * fix replicator to handle rename events across directory boundaries Two issues fixed: 1. The replicator filtered events by checking if the key (old path) was under the source directory. Rename events now use the old path as key, so renames from outside into the watched directory were silently dropped. Now both old and new paths are checked, and cross-boundary renames are converted to create or delete. 2. NewParentPath was passed to the sink without remapping to the sink's target directory structure, causing the sink to write entries at the wrong location. Now NewParentPath is remapped alongside the key. * fix filer sync to handle rename events crossing directory boundaries The early directory-prefix filter only checked resp.Directory (old parent). Rename events now carry the old parent as Directory, so renames from outside the source path into it were dropped before reaching the existing cross-boundary handling logic. Check both old and new directories against sourcePath and excludePaths so the downstream old-key/new-key logic can properly convert these to create or delete operations. * fix metadata event path matching * fix metadata event consumers for rename targets * Fix replication rename target keys Logical rename events now reach replication sinks with distinct source and target paths.\n\nHandle non-filer sinks as delete-plus-create on the translated target key, and make the rename fallback path create at the translated target key too.\n\nAdd focused tests covering non-filer renames, filer rename updates, and the fallback path.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix filer sync rename path scoping Use directory-boundary matching instead of raw prefix checks when classifying source and target paths during filer sync.\n\nAlso apply excludePaths per side so renames across excluded boundaries downgrade cleanly to create/delete instead of being misclassified as in-scope updates.\n\nAdd focused tests for boundary matching and rename classification.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix replicator directory boundary checks Use directory-boundary matching instead of raw prefix checks when deciding whether a source or target path is inside the watched tree or an excluded subtree.\n\nThis prevents sibling paths such as /foo and /foobar from being misclassified during rename handling, and preserves the earlier rename-target-key fix.\n\nAdd focused tests for boundary matching and rename classification across sibling/excluded directories.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix etc-remote rename-out handling Use boundary-safe source/target directory membership when classifying metadata events under DirectoryEtcRemote.\n\nThis prevents rename-out events from being processed as config updates, while still treating them as removals where appropriate for the remote sync and remote gateway command paths.\n\nAdd focused tests for update/removal classification and sibling-prefix handling.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Defer rename events until commit Queue logical rename metadata events during atomic and streaming renames and publish them only after the transaction commits successfully.\n\nThis prevents subscribers from seeing delete or logical rename events for operations that later fail during delete or commit.\n\nAlso serialize notification.Queue swaps in rename tests and add failure-path coverage.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Skip descendant rename target lookups Avoid redundant target lookups during recursive directory renames once the destination subtree is known absent.\n\nThe recursive move path now inserts known-absent descendants directly, and the test harness exercises prefixed directory listing so the optimization is covered by a directory rename regression test.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Tighten rename review tests Return filer_pb.ErrNotFound from the bucket tracking store test stub so it follows the FilerStore contract, and add a webhook filter case for same-name renames across parent directories.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix HardLinkId format verb in InsertEntryKnownAbsent error HardLinkId is a byte slice. %d prints each byte as a decimal number which is not useful for an identifier. Use %x to match the log line two lines above. * only skip descendant target lookup when source and dest use same store moveFolderSubEntries unconditionally passed skipTargetLookup=true for every descendant. This is safe when all paths resolve to the same underlying store, but with path-specific store configuration a child's destination may map to a different backend that already holds an entry at that path. Use FilerStoreWrapper.SameActualStore to check per-child and fall back to the full CreateEntry path when stores differ. * add nil and create edge-case tests for metadata event scope helpers * extract pathIsEqualOrUnder into util.IsEqualOrUnder Identical implementations existed in both replication/replicator.go and command/filer_sync.go. Move to util.IsEqualOrUnder (alongside the existing FullPath.IsUnder) and remove the duplicates. * use MetadataEventTargetDirectory for new-side directory in filer sync The new-side directory checks and sourceNewKey computation used message.NewParentPath directly. If NewParentPath were empty (legacy events, older filer versions during rolling upgrades), sourceNewKey would be wrong (/filename instead of /dir/filename) and the UpdateEntry parent path rewrite would panic on slice bounds. Derive targetDir once from MetadataEventTargetDirectory, which falls back to resp.Directory when NewParentPath is empty, and use it consistently for all new-side checks and the sink parent path. |
||
|
|
92c2fc0d52 |
Add insecure_skip_verify option for HTTPS client in security.toml (#8781)
* Add -insecureSkipVerify flag and config option for filer.sync HTTPS connections When using filer.sync between clusters with different CAs (e.g., separate OpenShift clusters), TLS certificate verification fails with "x509: certificate signed by unknown authority". This adds two ways to skip TLS certificate verification: 1. CLI flag: `weed filer.sync -insecureSkipVerify ...` 2. Config option: `insecure_skip_verify = true` under [https.client] in security.toml Closes #8778 * Add insecure_skip_verify option for HTTPS client in security.toml When using filer.sync between clusters with different CAs (e.g., separate OpenShift clusters), TLS certificate verification fails. Adding insecure_skip_verify = true under [https.client] in security.toml allows skipping TLS certificate verification. The option is read during global HTTP client initialization so it applies to all HTTPS connections including filer.sync proxy reads and writes. Closes #8778 --------- Co-authored-by: Copilot <copilot@github.com> |
||
|
|
2877febd73 |
S3: fix silent PutObject failure and enforce 1024-byte key limit (#8764)
* S3: add KeyTooLongError error code Add ErrKeyTooLongError (HTTP 400, code "KeyTooLongError") to match the standard AWS S3 error for object keys that exceed length limits. * S3: fix silent PutObject failure when entry name exceeds max_file_name_length putToFiler called client.CreateEntry() directly and discarded the gRPC response. The filer embeds application errors like "entry name too long" in resp.Error (not as gRPC transport errors), so the error was silently swallowed and clients received HTTP 200 with an ETag for objects that were never stored. Switch to the filer_pb.CreateEntry() helper which properly checks resp.Error, and map "entry name too long" to KeyTooLongError (HTTP 400). To avoid fragile string parsing across the gRPC boundary, define shared error message constants in weed/util/constants and use them in both the filer (producing errors) and S3 API (matching errors). Switch filerErrorToS3Error to use strings.Contains/HasSuffix with these constants so matches work regardless of any wrapper prefix. Apply filerErrorToS3Error to the mkdir path for directory markers. Fixes #8759 * S3: enforce 1024-byte maximum object key length AWS S3 limits object keys to 1024 bytes. Add early validation on write paths (PutObject, CopyObject, CreateMultipartUpload) to reject keys exceeding the limit with the standard KeyTooLongError (HTTP 400). The key length check runs before bucket auto-creation to prevent overlong keys from triggering unnecessary side effects. Also use filerErrorToS3Error for CopyObject's mkFile error paths so name-too-long errors from the filer return KeyTooLongError instead of InternalError. Ref #8758 * S3: add handler-level tests for key length validation and error mapping Add tests for filerErrorToS3Error mapping "entry name too long" to KeyTooLongError, including a regression test for the CreateEntry-prefixed "existing ... is a directory" form. Add handler-level integration tests that exercise PutObjectHandler, CopyObjectHandler, and NewMultipartUploadHandler via httptest, verifying HTTP 400 and KeyTooLongError XML response for overlong keys and acceptance of keys at the 1024-byte limit. |
||
|
|
6bf654c25c |
fix: keep metadata subscriptions progressing (#8730) (#8746)
* fix: keep metadata subscriptions progressing (#8730) * test: cancel slow metadata writers with parent context * filer: ignore missing persisted log chunks |
||
|
|
08b79a30f6 |
Fix lock table shared wait condition (#8707)
* Fix lock table shared wait condition (#8696) * Refactor lock table waiter check * Add exclusive lock wait helper test |
||
|
|
b665c329bc |
fix(replication): resume partial chunk reads on EOF instead of re-downloading (#8607)
* fix(replication): resume partial chunk reads on EOF instead of re-downloading When replicating chunks and the source connection drops mid-transfer, accumulate the bytes already received and retry with a Range header to fetch only the remaining bytes. This avoids re-downloading potentially large chunks from scratch on each retry, reducing load on busy source servers and speeding up recovery. * test(replication): add tests for downloadWithRange including gzip partial reads Tests cover: - No offset (no Range header sent) - With offset (Range header verified) - Content-Disposition filename extraction - Partial read + resume: server drops connection mid-transfer, client resumes with Range from the offset of received bytes - Gzip partial read + resume: first response is gzip-encoded (Go auto- decompresses), connection drops, resume request gets decompressed data (Go doesn't add Accept-Encoding when Range is set, so the server decompresses), combined bytes match original * fix(replication): address PR review comments - Consolidate downloadWithRange into DownloadFile with optional offset parameter (variadic), eliminating code duplication (DRY) - Validate HTTP response status: require 206 + correct Content-Range when offset > 0, reject when server ignores Range header - Use if/else for fullData assignment for clarity - Add test for rejected Range (server returns 200 instead of 206) * refactor(replication): remove unused ReplicationSource interface The interface was never referenced and its signature didn't match the actual FilerSource.ReadPart method. --------- Co-authored-by: Copilot <copilot@github.com> |
||
|
|
4a5243886a | 4.17 | ||
|
|
8ad58e7002 | 4.16 | ||
|
|
2ec0a67ee3 |
master: return 503/Unavailable during topology warmup after leader change (#8529)
* master: return 503/Unavailable during topology warmup after leader change After a master restart or leader change, the topology is empty until volume servers reconnect and send heartbeats. During this warmup window (3 heartbeat intervals = 15 seconds), volume lookups that fail now return 503 Service Unavailable (HTTP) or gRPC Unavailable instead of 404 Not Found, signaling clients to retry with other masters. * master: skip warmup 503 on fresh start and single-master setups - Check MaxVolumeId > 0 to distinguish restart from fresh start (MaxVolumeId is Raft-persisted, so 0 means no prior data) - Check peer count > 1 so single-master deployments aren't affected (no point suggesting "retry with other masters" if there are none) * master: address review feedback and block assigns during warmup - Protect LastLeaderChangeTime with dedicated mutex (fix data race) - Extract warmup multiplier as WarmupPulseMultiplier constant - Derive Retry-After header from pulse config instead of hardcoding - Only trigger warmup 503 for "not found" errors, not parse errors - Return nil response (not partial) on gRPC Unavailable - Add doc comments to IsWarmingUp, getter/setter, WarmupDuration - Block volume assign requests (HTTP and gRPC) during warmup, since the topology is incomplete and assignments would be unreliable - Skip warmup behavior for single-master setups (no peers to retry) * master: apply warmup to all setups, skip only on fresh start Single-master restarts still have an empty topology until heartbeats arrive, so warmup protection should apply there too. The only case to skip is a fresh cluster start (MaxVolumeId == 0), which already has no volumes to look up. - Remove GetMasterCount() > 1 guard from all warmup checks - Remove now-unused GetMasterCount helper - Update error messages to "topology is still loading" (not "retry with other masters" which doesn't apply to single-master) * master: add client-side retry on Unavailable for lookup and assign The server-side 503/Unavailable during warmup needs client cooperation. Previously, LookupVolumeIds and Assign would immediately propagate the error without retry. Now both paths retry with exponential backoff (1s -> 1.5s -> ... up to 6s) when receiving Unavailable, respecting context cancellation. This covers the warmup window where the master's topology is still loading after a restart or leader change. * master: seed warmup timestamp in legacy raft path at setup The legacy raft path only set lastLeaderChangeTime inside the event listener callback, which could fire after IsLeader() was already observed as true in SetRaftServer. Seed the timestamp at setup time (matching the hashicorp path) so IsWarmingUp() is active immediately. * master: fix assign retry loop to cover full warmup window The retry loop used waitTime <= maxWaitTime as a stop condition, causing it to give up after ~13s while warmup lasts 15s. Now cap each individual sleep at maxWaitTime but keep retrying until the context is cancelled. * master: preserve gRPC status in lookup retry and fix retry window Return the raw gRPC error instead of wrapping with fmt.Errorf so status.FromError() can extract the status code. Use proper gRPC status check (codes.Unavailable) instead of string matching. Also cap individual sleep at maxWaitTime while retrying until ctx is done. * master: use gRPC status code instead of string matching in assign retry Use status.FromError/codes.Unavailable instead of brittle strings.Contains for detecting retriable gRPC errors in the assign retry loop. * master: use remaining warmup duration for Retry-After header Set Retry-After to the remaining warmup time instead of the full warmup duration, so clients don't wait longer than necessary. * master: reset ret.Replicas before populating from assign response Clear Replicas slice before appending to prevent duplicate entries when the assign response is retried or when alternative requests are attempted. * master: add unit tests for warmup retry behavior Test that Assign() and LookupVolumeIds() retry on codes.Unavailable and stop promptly when the context is cancelled. * master: record leader change time before initialization work Move SetLastLeaderChangeTime() to fire immediately when the leader change event is received, before DoBarrier(), EnsureTopologyId(), and updatePeers(), so the warmup clock starts at the true moment of leadership transition. * master: use topology warmup duration in volume growth wait loop Replace hardcoded constants.VolumePulsePeriod * 2 with topo.IsWarmingUp() and topo.WarmupDuration() so the growth wait stays in sync with the configured warmup window. Remove unused constants import. * master: resolve master before creating RPC timeout context Move GetMaster() call before context.WithTimeout() so master resolution blocking doesn't consume the gRPC call timeout. * master: use NotFound flag instead of string matching for volume lookup Add a NotFound field to LookupResult and set it in findVolumeLocation when a volume is genuinely missing. Update HTTP and gRPC warmup checks to use this flag instead of strings.Contains on the error message. * master: bound assign retry loop to 30s for deadline-free contexts Without a context deadline, the Unavailable retry loop could spin forever. Add a maxRetryDuration of 30s so the loop gives up even when no context deadline is set. * master: strengthen assign retry cancellation test Verify the retry loop actually retried (callCount > 1) and that the returned error is context.DeadlineExceeded, not just any error. * master: extract shared retry-with-backoff utility Add util.RetryWithBackoff for context-aware, bounded retry with exponential backoff. Refactor both Assign() and LookupVolumeIds() to use it instead of duplicating the retry/sleep/backoff logic. * master: cap waitTime in RetryWithBackoff to prevent unbounded growth Cap the backoff waitTime at maxWaitTime so it doesn't grow indefinitely in long-running retry scenarios. * master: only return Unavailable during warmup when all lookups failed For batched LookupVolume requests, return partial results when some volumes are found. Only return codes.Unavailable when no volumes were successfully resolved, so clients benefit from partial results instead of retrying unnecessarily. * master: set retriable error message in 503 response body When returning 503 during warmup, replace the "not found" error in the JSON body with "service warming up, please retry" so clients don't treat it as a permanent error. * master: guard empty master address in LookupVolumeIds If GetMaster() returns empty (no master found or ctx cancelled), return an appropriate error instead of dialing an empty address. Returns ctx.Err() if context is done, otherwise codes.Unavailable to trigger retry. * master: add comprehensive tests for RetryWithBackoff Test success after retries, non-retryable error handling, context cancellation, and maxDuration cap with context.Background(). * master: enforce hard maxDuration bound in RetryWithBackoff Use a deadline instead of elapsed-time check so the last sleep is capped to remaining time. This prevents the total retry duration from overshooting maxDuration by up to one full backoff interval. * master: respect fresh-start bypass in RemainingWarmupDuration Check IsWarmingUp() first (which returns false when MaxVolumeId==0) so RemainingWarmupDuration returns 0 on fresh clusters. * master: round up Retry-After seconds to avoid underestimating Use math.Ceil so fractional remaining seconds (e.g. 1.9s) round up to the next integer (2) instead of flooring down (1). * master: tighten batch lookup warmup to all-NotFound only Only return codes.Unavailable when every requested volume ID was a transient not-found. Mixed cases with non-NotFound errors now return the response with per-volume error details preserved. * master: reduce retry log noise and fix timer leak Lower per-attempt retry log from V(0) to V(1) to reduce noise during warmup. Replace time.After with time.NewTimer to avoid lingering timers when context is cancelled. * master: add per-attempt timeout for assign RPC Use a 10s per-attempt timeout so a single slow RPC can't consume the entire 30s retry budget when ctx has no deadline. * master: share single 30s retry deadline across assign request entries The Assign() function iterates over primary and fallback requests, previously giving each its own 30s RetryWithBackoff budget. With a primary + fallback, the total could reach 60s. Compute one deadline up front and pass the remaining budget to each RetryWithBackoff call so the entire Assign() call stays within a single 30s cap. * master: strengthen context-cancel test with DeadlineExceeded and retry assertions Assert errors.Is(err, context.DeadlineExceeded) to verify the error is specifically from the context deadline, and check callCount > 1 to prove retries actually occurred before cancellation. Mirrors the pattern used in TestAssignStopsOnContextCancel. * master: bound GetMaster with per-attempt timeout in LookupVolumeIds GetMaster() calls WaitUntilConnected() which can block indefinitely if no master is available. Previously it used the outer ctx, so a slow master resolution could consume the entire RetryWithBackoff budget in a single attempt. Move the per-attempt timeoutCtx creation before the GetMaster call so both master resolution and the gRPC LookupVolume RPC share one grpcTimeout-bounded attempt. * master: use deadline-aware context for assign retry budget The shared 30s deadline only limited RetryWithBackoff's internal wall-clock tracking, but per-attempt contexts were still derived from the original ctx and could run for up to 10s even when the budget was nearly exhausted. Create a deadlineCtx from the computed deadline and derive both RetryWithBackoff and per-attempt timeouts from it so all operations honor the shared 30s cap. * master: skip warmup gate for empty lookup requests When VolumeOrFileIds is empty, notFoundCount == len(req.VolumeOrFileIds) is 0 == 0 which is true, causing empty lookup batches during warmup to return codes.Unavailable and be retried endlessly. Add a len(req.VolumeOrFileIds) > 0 guard so empty requests pass through. * master: validate request fields before warmup gate in Assign Move Replication and Ttl parsing before the IsWarmingUp() check so invalid inputs get a proper validation error instead of being masked by codes.Unavailable during warmup. Pure syntactic validation does not depend on topology state and should run first. * master: check deadline and context before starting retry attempt RetryWithBackoff only checked the deadline and context after an attempt completed or during the sleep select. If the deadline expired or context was canceled during sleep, the next iteration would still call operation() before detecting it. Add pre-operation checks so no new attempt starts after the budget is exhausted. * master: always return ctx.Err() on context cancellation in RetryWithBackoff When ctx.Err() is non-nil, the pre-operation check was returning lastErr instead of ctx.Err(). This broke callers checking errors.Is(err, context.DeadlineExceeded) and contradicted the documented contract. Always return ctx.Err() so the cancellation reason is properly surfaced. * master: handle warmup errors in StreamAssign without killing the stream StreamAssign was returning codes.Unavailable errors from Assign directly, which terminates the gRPC stream and breaks pooled connections. Instead, return transient errors as in-band error responses so the stream survives warmup periods. Also reset assignClient in doAssign on Send/Recv failures so a broken stream doesn't leave the proxy permanently dead. * master: wait for warmup before slot search in findAndGrow findEmptySlotsForOneVolume was called before the warmup wait loop, selecting slots from an incomplete topology. Move the warmup wait before slot search so volume placement uses the fully warmed-up topology with all servers registered. * master: add Retry-After header to /dir/assign warmup response The /dir/lookup handler already sets Retry-After during warmup but /dir/assign did not, leaving HTTP clients without guidance on when to retry. Add the same header using RemainingWarmupDuration(). * master: only seed warmup timestamp on leader at startup SetLastLeaderChangeTime was called unconditionally for both leader and follower nodes. Followers don't need warmup state, and the leader change event listener handles real elections. Move the seed into the IsLeader() block so only the startup leader gets warmup initialized. * master: preserve codes.Unavailable for StreamAssign warmup errors in doAssign StreamAssign returns transient warmup errors as in-band AssignResponse.Error messages. doAssign was converting these to plain fmt.Errorf, losing the codes.Unavailable classification needed for the caller's retry logic. Detect warmup error messages and wrap them as status.Error(codes.Unavailable) so RetryWithBackoff can retry. |
||
|
|
540fc97e00 |
s3/iam: reuse one request id per request (#8538)
* request_id: add shared request middleware
* s3err: preserve request ids in responses and logs
* iam: reuse request ids in XML responses
* sts: reuse request ids in XML responses
* request_id: drop legacy header fallback
* request_id: use AWS-style request id format
* iam: fix AWS-compatible XML format for ErrorResponse and field ordering
- ErrorResponse uses bare <RequestId> at root level instead of
<ResponseMetadata> wrapper, matching the AWS IAM error response spec
- Move CommonResponse to last field in success response structs so
<ResponseMetadata> serializes after result elements
- Add randomness to request ID generation to avoid collisions
- Add tests for XML ordering and ErrorResponse format
* iam: remove duplicate error_response_test.go
Test is already covered by responses_test.go.
* address PR review comments
- Guard against typed nil pointers in SetResponseRequestID before
interface assertion (CodeRabbit)
- Use regexp instead of strings.Index in test helpers for extracting
request IDs (Gemini)
* request_id: prevent spoofing, fix nil-error branch, thread reqID to error writers
- Ensure() now always generates a server-side ID, ignoring client-sent
x-amz-request-id headers to prevent request ID spoofing. Uses a
private context key (contextKey{}) instead of the header string.
- writeIamErrorResponse in both iamapi and embedded IAM now accepts
reqID as a parameter instead of calling Ensure() internally, ensuring
a single request ID per request lifecycle.
- The nil-iamError branch in writeIamErrorResponse now writes a 500
Internal Server Error response instead of returning silently.
- Updated tests to set request IDs via context (not headers) and added
tests for spoofing prevention and context reuse.
* sts: add request-id consistency assertions to ActionInBody tests
* test: update admin test to expect server-generated request IDs
The test previously sent a client x-amz-request-id header and expected
it echoed back. Since Ensure() now ignores client headers to prevent
spoofing, update the test to verify the server returns a non-empty
server-generated request ID instead.
* iam: add generic WithRequestID helper alongside reflection-based fallback
Add WithRequestID[T] that uses generics to take the address of a value
type, satisfying the pointer receiver on SetRequestId without reflection.
The existing SetResponseRequestID is kept for the two call sites that
operate on interface{} (from large action switches where the concrete
type varies at runtime). Generics cannot replace reflection there since
Go cannot infer type parameters from interface{}.
* Remove reflection and generics from request ID setting
Call SetRequestId directly on concrete response types in each switch
branch before boxing into interface{}, eliminating the need for
WithRequestID (generics) and SetResponseRequestID (reflection).
* iam: return pointer responses in action dispatch
* Fix IAM error handling consistency and ensure request IDs on all responses
- UpdateUser/CreatePolicy error branches: use writeIamErrorResponse instead
of s3err.WriteErrorResponse to preserve IAM formatting and request ID
- ExecuteAction: accept reqID parameter and generate one if empty, ensuring
every response carries a RequestId regardless of caller
* Clean up inline policies on DeleteUser and UpdateUser rename
DeleteUser: remove InlinePolicies[userName] from policy storage before
removing the identity, so policies are not orphaned.
UpdateUser: move InlinePolicies[userName] to InlinePolicies[newUserName]
when renaming, so GetUserPolicy/DeleteUserPolicy work under the new name.
Both operations persist the updated policies and return an error if
the storage write fails, preventing partial state.
|
||
|
|
b3f7472fd3 | 4.15 | ||
|
|
7799804200 |
4.14
Co-Authored-By: Copilot <223556219+Copilot@users.noreply.github.com> |
||
|
|
f5c35240be |
Add volume dir tags and EC placement priority (#8472)
* Add volume dir tags to topology Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Add preferred tag config for EC Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Prioritize EC destinations by tags Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Add EC placement planner tag tests Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Refactor EC placement tests to reuse buildActiveTopology Remove buildActiveTopologyWithDiskTags helper function and consolidate tag setup inline in test cases. Tests now use UpdateTopology to apply tags after topology creation, reusing the existing buildActiveTopology function rather than duplicating its logic. All tag scenario tests pass: - TestECPlacementPlannerPrefersTaggedDisks - TestECPlacementPlannerFallsBackWhenTagsInsufficient Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Consolidate normalizeTagList into shared util package Extract normalizeTagList from three locations (volume.go, detection.go, erasure_coding_handler.go) into new weed/util/tag.go as exported NormalizeTagList function. Replace all duplicate implementations with imports and calls to util.NormalizeTagList. This improves code reuse and maintainability by centralizing tag normalization logic. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Add PreferredTags to EC config persistence Add preferred_tags field to ErasureCodingTaskConfig protobuf with field number 5. Update GetConfigSpec to include preferred_tags field in the UI configuration schema. Add PreferredTags to ToTaskPolicy to serialize config to protobuf. Add PreferredTags to FromTaskPolicy to deserialize from protobuf with defensive copy to prevent external mutation. This allows EC preferred tags to be persisted and restored across worker restarts. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Add defensive copy for Tags slice in DiskLocation Copy the incoming tags slice in NewDiskLocation instead of storing by reference. This prevents external callers from mutating the DiskLocation.Tags slice after construction, improving encapsulation and preventing unexpected changes to disk metadata. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Add doc comment to buildCandidateSets method Document the tiered candidate selection and fallback behavior. Explain that for a planner with preferredTags, it accumulates disks matching each tag in order into progressively larger tiers, emits a candidate set once a tier reaches shardsNeeded, and finally falls back to the full candidates set if preferred-tag tiers are insufficient. This clarifies the intended semantics for future maintainers. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Apply final PR review fixes 1. Update parseVolumeTags to replicate single tag entry to all folders instead of leaving some folders with nil tags. This prevents nil pointer dereferences when processing folders without explicit tags. 2. Add defensive copy in ToTaskPolicy for PreferredTags slice to match the pattern used in FromTaskPolicy, preventing external mutation of the returned TaskPolicy. 3. Add clarifying comment in buildCandidateSets explaining that the shardsNeeded <= 0 branch is a defensive check for direct callers, since selectDestinations guarantees shardsNeeded > 0. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix nil pointer dereference in parseVolumeTags Ensure all folder tags are initialized to either normalized tags or empty slices, not nil. When multiple tag entries are provided and there are more folders than entries, remaining folders now get empty slices instead of nil, preventing nil pointer dereference in downstream code. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix NormalizeTagList to return empty slice instead of nil Change NormalizeTagList to always return a non-nil slice. When all tags are empty or whitespace after normalization, return an empty slice instead of nil. This prevents nil pointer dereferences in downstream code that expects a valid (possibly empty) slice. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Add nil safety check for v.tags pointer Add a safety check to handle the case where v.tags might be nil, preventing a nil pointer dereference. If v.tags is nil, use an empty string instead. This is defensive programming to prevent panics in edge cases. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Add volume.tags flag to weed server and weed mini commands Add the volume.tags CLI option to both the 'weed server' and 'weed mini' commands. This allows users to specify disk tags when running the combined server modes, just like they can with 'weed volume'. The flag uses the same format and description as the volume command: comma-separated tag groups per data dir with ':' separators (e.g. fast:ssd,archive). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <copilot@github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> |
||
|
|
4f647e1036 |
Worker set its working directory (#8461)
* set working directory * consolidate to worker directory * working directory * correct directory name * refactoring to use wildcard matcher * simplify * cleaning ec working directory * fix reference * clean * adjust test |
||
|
|
e4b70c2521 | go fix |