seaweedfs

mirror of https://github.com/seaweedfs/seaweedfs.git synced 2026-05-17 07:11:30 +00:00

Author	SHA1	Message	Date
Chris Lu	0bdf9b0683	4.19 4.19	2026-04-07 19:21:35 -07:00
Chris Lu	75dcb97187	filer: bootstrap pre-existing metadata when a new filer joins (#8979 ) * filer: bootstrap pre-existing metadata when a new filer joins a cluster When a filer connects to a peer for the first time (no stored sync offset), it now does a full BFS traversal of the peer's metadata via TraverseBfsMetadata before starting the incremental change stream. This ensures filer2 sees all data that existed before it started, fixing the issue where only post-startup changes were synced. Closes #8961 * filer: upsert during bootstrap and persist offset immediately - Use upsert (insert, then update on conflict) during metadata traversal so the bootstrap doesn't fail on the root directory or after a partial previous attempt. - Persist the sync offset right after a successful traversal so a retry doesn't redo the full BFS. * filer: address review feedback on metadata bootstrap - Use peer-side max Mtime as the streaming cursor instead of local time.Now() to avoid missing events due to clock skew between filers. traversePeerMetadata now returns the high-water Mtime (nanoseconds) observed during BFS traversal. - Compare Mtime before overwriting during bootstrap: if a local entry is newer than the peer's version, skip the update instead of clobbering it. - Only trigger full BFS traversal on ErrKvNotFound (key genuinely missing). Transient KvGet errors (connection issues, etc.) are now propagated instead of silently falling through to a full re-sync. Changed readOffset to use %w so errors.Is works through the chain. * filer: address review findings on bootstrap sync - Use wall-clock time with safety margin for stream cursor instead of entry Mtime. Mtime is file modification time (can be arbitrary), while the metadata stream uses TsNs (event log time). Using time.Now() minus 1 minute before traversal ensures no events are missed even with clock skew, matching the proven filer.meta.backup pattern. - Pass ExcludedPrefixes=[SystemLogDir] to TraverseBfsMetadata so the server prunes internal log entries server-side instead of transferring them over the network only to be filtered client-side. - Fail fast if updateOffset fails after bootstrap. If we can't persist the offset, bail out rather than proceeding and potentially losing the expensive BFS work on the next retry.	2026-04-07 19:05:45 -07:00
Chris Lu	940eed0bd3	fix(ec): generate .ecx before EC shards to prevent data inconsistency (#8972 ) * fix(ec): generate .ecx before EC shards to prevent data inconsistency In VolumeEcShardsGenerate, the .ecx index was generated from .idx AFTER the EC shards were generated from .dat. If any write occurred between these two steps (e.g. WriteNeedleBlob during replica sync, which bypasses the read-only check), the .ecx would contain entries pointing to data that doesn't exist in the EC shards, causing "shard too short" and "size mismatch" errors on subsequent reads and scrubs. Fix by generating .ecx FIRST, then snapshotting datFileSize, then encoding EC shards. If a write sneaks in after .ecx generation, the EC shards contain more data than .ecx references — which is harmless (the extra data is simply not indexed). Also snapshot datFileSize before EC encoding to ensure the .vif reflects the same .dat state that .ecx was generated from. Add TestEcConsistency_WritesBetweenEncodeAndEcx that reproduces the race condition by appending data between EC encoding and .ecx generation. * fix: pass actual offset to ReadBytes, improve test quality - Pass offset.ToActualOffset() to ReadBytes instead of 0 to preserve correct error metrics and error messages within ReadBytes - Handle Stat() error in assembleFromIntervalsAllowError - Rename TestEcConsistency_DatFileGrowsDuringEncoding to TestEcConsistency_ExactLargeRowEncoding (test verifies fixed-size encoding, not concurrent growth) - Update test comment to clarify it reproduces the old buggy sequence - Fix verification loop to advance by readSize for full data coverage * fix(ec): add dat/idx consistency check in worker EC encoding The erasure_coding worker copies .dat and .idx as separate network transfers. If a write lands on the source between these copies, the .idx may have entries pointing past the end of .dat, leading to EC volumes with .ecx entries that reference non-existent shard data. Add verifyDatIdxConsistency() that walks the .idx and verifies no entry's offset+size exceeds the .dat file size. This fails the EC task early with a clear error instead of silently producing corrupt EC volumes. * test(ec): add integration test verifying .ecx/.ecd consistency TestEcIndexConsistencyAfterEncode uploads multiple needles of varying sizes (14B to 256KB), EC-encodes the volume, mounts data shards, then reads every needle back via the EC read path and verifies payload correctness. This catches any inconsistency between .ecx index entries and EC shard data. * fix(test): account for needle overhead in test volume fixture WriteTestVolumeFiles created a .dat of exactly datSize bytes but the .idx entry claimed a needle of that same size. GetActualSize adds header + checksum + timestamp overhead, so the consistency check correctly rejects this as the needle extends past the .dat file. Fix by sizing the .dat to GetActualSize(datSize) so the .idx entry is consistent with the .dat contents. * fix(test): remove flaky shard ID assertion in EC scrub test When shard 0 is truncated on disk after mount, the volume server may detect corruption via parity mismatches (shards 10-13) rather than a direct read failure on shard 0, depending on OS caching/mmap behavior. Replace the brittle shard-0-specific check with a volume ID validation. * fix(test): close upload response bodies and tighten file count assertion Wrap UploadBytes calls with ReadAllAndClose to prevent connection/fd leaks during test execution. Also tighten TotalFiles check from >= 1 to == 1 since ecSetup uploads exactly one file.	2026-04-07 19:05:36 -07:00
Chris Lu	6098ef4bd3	fix(test): remove flaky shard ID assertion in EC scrub test (#8978 ) * test: add integration tests for volume and EC volume scrubbing Add scrub integration tests covering normal volumes (full data scrub, corrupt .dat detection, mixed healthy/broken batches, missing volume error) and EC volumes (INDEX/LOCAL modes on healthy volumes, corrupt shard detection with broken shard info reporting, corrupt .ecx index, auto-select, unsupported mode error). Also adds framework helpers: CorruptDatFile, CorruptEcxFile, CorruptEcShardFile for fault injection in scrub tests. * fix: correct dat/ecx corruption helpers and ecx test setup - CorruptDatFile: truncate .dat to superblock size instead of overwriting bytes (ensures scrub detects data file size mismatch) - TestScrubEcVolumeIndexCorruptEcx: corrupt .ecx before mount so the corrupted size is loaded into memory (EC volumes cache ecx size at mount) * fix(test): remove flaky shard ID assertion in EC scrub test When shard 0 is truncated on disk after mount, the volume server may detect corruption via parity mismatches (shards 10-13) rather than a direct read failure on shard 0, depending on OS caching/mmap behavior. Replace the brittle shard-0-specific check with a volume ID validation. * fix(test): close upload response bodies and tighten file count assertion Wrap UploadBytes calls with ReadAllAndClose to prevent connection/fd leaks during test execution. Also tighten TotalFiles check from >= 1 to == 1 since ecSetup uploads exactly one file.	2026-04-07 18:15:53 -07:00
Chris Lu	4bf6d195e4	test: add integration tests for volume and EC scrubbing (#8977 ) * test: add integration tests for volume and EC volume scrubbing Add scrub integration tests covering normal volumes (full data scrub, corrupt .dat detection, mixed healthy/broken batches, missing volume error) and EC volumes (INDEX/LOCAL modes on healthy volumes, corrupt shard detection with broken shard info reporting, corrupt .ecx index, auto-select, unsupported mode error). Also adds framework helpers: CorruptDatFile, CorruptEcxFile, CorruptEcShardFile for fault injection in scrub tests. * fix: correct dat/ecx corruption helpers and ecx test setup - CorruptDatFile: truncate .dat to superblock size instead of overwriting bytes (ensures scrub detects data file size mismatch) - TestScrubEcVolumeIndexCorruptEcx: corrupt .ecx before mount so the corrupted size is loaded into memory (EC volumes cache ecx size at mount)	2026-04-07 16:31:32 -07:00
Chris Lu	74905c4b5d	shell: s3.* commands always output JSON, connection messages to stderr (#8976 ) * shell: s3.* commands output JSON, connection messages to stderr All s3.user.* and s3.policy.attach\|detach commands now output structured JSON to stdout instead of human-readable text: - s3.user.create: {"name","access_key"} (secret key to stderr only) - s3.user.list: [{name,status,policies,keys}] - s3.user.show: {name,status,source,account,policies,credentials,...} - s3.user.delete: {"name"} - s3.user.enable/disable: {"name","status"} - s3.policy.attach/detach: {"policy","user"} Connection startup messages (master/filer) moved to stderr so they don't pollute structured output when piping. Closes #8962 (partial — covers merged s3.user/policy commands). * shell: fix secret leak, duplicate JSON output, and non-interactive prompt - s3.user.create: only echo secret key to stderr when auto-generated, never echo caller-supplied secrets - s3.user.enable/disable: fix duplicate JSON output — remove inner write in early-return path, keep single write site after gRPC call - shell_liner: use bufio.Scanner when stdin is not a terminal instead of liner.Prompt, suppressing the "> " prompt in piped mode * shell: check scanner error, idempotent enable output, history errors to stderr - Check scanner.Err() after non-interactive input loop to surface read errors - s3.user.enable: always emit JSON regardless of current state (idempotent) - saveHistory: write error messages to stderr instead of stdout	2026-04-07 16:27:21 -07:00
Lars Lehtonen	df619ec3f6	fix(weed/filer/redis2): fix dropped error (#8952 ) * fix(weed/filer/redis2): fix dropped error * fix(weed/filer/redis2): break on non-ErrNotFound errors in ListDirectoryEntries Without the break, a hard FindEntry error gets overwritten by subsequent iterations and the function may return nil, silently losing the error. --------- Co-authored-by: Chris Lu <chris.lu@gmail.com>	2026-04-07 14:59:01 -07:00
Chris Lu	fb0573ffc4	shell: rename -force to -apply in s3.iam.import for consistency	2026-04-07 14:17:07 -07:00
Chris Lu	b0e79ad207	fix(admin): respect urlPrefix for root redirect and JS API calls (#8975 ) * fix(admin): respect urlPrefix for root redirect and JS API calls (#8967) Two issues when running admin UI behind a reverse proxy with -urlPrefix: 1. Visiting the prefix path without trailing slash (e.g. /s3-admin) caused a redirect to / instead of /s3-admin/ because http.StripPrefix produced an empty path that the router redirected to root. 2. Several JavaScript API calls in admin.js used hardcoded paths instead of basePath(), causing file upload, download, and preview to fail. * fix(admin): preserve query params in prefix redirect and use 302 Use http.StatusFound instead of 301 to avoid aggressive browser caching of a configuration-dependent redirect, and preserve query parameters.	2026-04-07 14:12:05 -07:00
Chris Lu	2919bb27e5	fix(sync): use per-cluster TLS for HTTP volume connections in filer.sync (#8974 ) * fix(sync): use per-cluster TLS for HTTP volume connections in filer.sync (#8965) When filer.sync runs with -a.security and -b.security flags, only gRPC connections received per-cluster TLS configuration. HTTP clients for volume server reads and uploads used a global singleton with the default security.toml, causing TLS verification failures when clusters use different self-signed certificates. Load per-cluster HTTPS client config from the security files and pass dedicated HTTP clients to FilerSource (for downloads) and FilerSink (for uploads) so each direction uses the correct cluster's certificates. * fix(sync): address review feedback for per-cluster HTTP TLS - Add insecure_skip_verify support to NewHttpClientWithTLS and read it from per-cluster security config via https.client.insecure_skip_verify - Error on partial mTLS config (cert without key or vice versa) - Add nil-check for client parameter in DownloadFileWithClient - Document SetUploader as init-only (same pattern as SetChunkConcurrency)	2026-04-07 14:11:44 -07:00
Chris Lu	d50889002b	shell: add s3.iam., s3.config.show, s3.user.provision; hide legacy commands (#8956 ) shell: add s3.iam., s3.config.show, s3.user.provision; hide legacy commands Add import/export, configuration summary, and a convenience provisioning command: - s3.iam.export: dump full IAM state as JSON (stdout or file) - s3.iam.import: replace IAM state from a JSON file - s3.config.show: human-readable summary (users, policies, service accounts, groups with status and counts) - s3.user.provision: one-step user+policy+credentials creation for common readonly/readwrite/admin roles Hide legacy commands from help listing: - s3.configure: still works but hidden from help output - s3.bucket.access: still works but hidden from help output Both hidden commands remain fully functional for existing scripts. Also adds a Hidden command tag and filters it from printGenericHelp. shell: address review feedback for s3.iam., s3.config.show, s3.user.provision - Simplify joinMax using strings.Join - Fix rolePolicies: remove s3:ListBucket from object-level actions (already covered by bucket-level statement) - Fix admin role: grant s3: on bucket resource too - Return flag parse errors instead of swallowing them * shell: address missed review feedback for PR 3 - s3.iam.import: require -force flag for destructive IAM overwrite - s3.config.show: add nil guard for resp.Configuration - s3.user.provision: check if user exists before creating policy - s3.user.provision: reject wildcard bucket names (* ?) * shell: distinguish NotFound from transient errors in provision, use %w wrapping - s3.user.provision: check gRPC status code on GetUser error — only proceed on NotFound, abort on transient/network errors - s3.iam.import: use %w for error wrapping to preserve error chains, wrap PutConfiguration error with context * shell: remove duplicate joinMax after PR 8954 merge command_s3_helpers.go defined joinMax which is already in command_s3_user_list.go from the merged PR 8954. * shell: restrict export file permissions, rollback policy on user create failure - s3.iam.export: use os.OpenFile with mode 0600 instead of os.Create to protect exported credentials from other users - s3.user.provision: rollback the created policy if CreateUser fails, with a warning if the rollback itself fails	2026-04-07 14:10:15 -07:00
Chris Lu	efc7f3936f	fix(s3): handle empty URL path in forwarded prefix signature verification (#8973 ) fix(s3): handle empty URL path in forwarded prefix signature verification (#8966) When S3 is behind a reverse proxy with a forwarded prefix (e.g. /s3), requests with an empty URL path (like ListBuckets) would incorrectly get a trailing slash appended (e.g. /s3/), causing signature verification to fail because the client signs /s3 without the slash.	2026-04-07 13:22:21 -07:00
Chris Lu	79a48256f5	fix(s3): populate s3:prefix from query param for ListObjects policy conditions (#8971 ) * fix(s3): populate s3:prefix from query param for ListObjects policy conditions (#8969) ListObjectsV2/V1 requests with prefix-restricted STS session policies were denied because: 1. s3:prefix was derived from objectKey, which the auth middleware set to the prefix value, but the resource ARN then included the prefix (e.g. arn:aws:s3:::bucket/prefix) instead of staying at bucket level (arn:aws:s3:::bucket) as AWS requires for ListBucket. 2. When objectKey was empty (no middleware propagation), s3:prefix was never populated from the query parameter at all. Now AuthorizeAction extracts the prefix query parameter directly, sets it as s3:prefix in the request context, and uses a bucket-level resource ARN when the objectKey matches the propagated prefix. * fix(s3): use AWS-style wildcard matching for StringLike policy conditions filepath.Match treats * as not matching /, which breaks IAM StringLike conditions on paths (e.g. arn:aws:s3:::bucket/* won't match nested keys). Replace with a case-sensitive variant of AwsWildcardMatch that correctly treats * as matching any character including /. * refactor(s3): replace regex wildcard matching with string-based matcher Use the existing wildcard.MatchesWildcard utility instead of compiling and caching regexes for IAM wildcard matching. Removes the regexCache, its mutex, and the sync import. * refactor(s3): inline and remove AwsWildcardMatch wrapper functions Replace all call sites with direct wildcard.MatchesWildcard calls. * fix(s3): scope s3:prefix condition key to list operations only The s3:prefix logic was running for all actions, so a GetObject on "foo/bar" would wrongly populate s3:prefix. Restrict it to action "List" and always reset resourceObjectKey to "" so the resource ARN stays at bucket level. Also set s3:prefix to "" when no prefix is provided, so policies with StringEquals {"s3:prefix": ""} evaluate correctly.	2026-04-07 13:21:30 -07:00
Chris Lu	a4753b6a3b	S3: delay empty folder cleanup to prevent Spark write failures (#8970 ) * S3: delay empty folder cleanup to prevent Spark write failures (#8963) Empty folders were being cleaned up within seconds, causing Apache Spark (s3a) writes to fail when temporary directories like _temporary/0/task_xxx/ were briefly empty. - Increase default cleanup delay from 5s to 2 minutes - Only process queue items that have individually aged past the delay (previously the entire queue was drained once any item triggered) - Make the delay configurable via filer.toml: [filer.options] s3.empty_folder_cleanup_delay = "2m" * test: increase cleanup wait timeout to match 2m delay The empty folder cleanup delay was increased to 2 minutes, so the Spark integration test needs to wait longer for temporary directories to disappear. * fix: eagerly clean parent directories after empty folder deletion After deleting an empty folder, immediately try to clean its parent rather than relying on cascading metadata events that each re-enter the 2-minute delay queue. This prevents multi-minute waits when cleaning nested temporary directory trees (e.g. Spark's _temporary hierarchy with 3+ levels would take 6m+ vs near-instant). Fixes the CI failure where lingering _temporary parent directories were not cleaned within the test's 3-minute timeout.	2026-04-07 13:20:59 -07:00
Chris Lu	761ec7da00	fix(iceberg): use dot separator for namespace paths instead of unit separator (#8960 ) * fix(iceberg): use dot separator for namespace paths instead of unit separator The Iceberg REST Catalog handler was using \x1F (unit separator) to join multi-level namespaces when constructing S3 location and filer paths. The S3 Tables storage layer uses "." (dot) as the namespace separator, causing tables created via the Iceberg REST API to point to different paths than where S3 Tables actually stores them. Fixes #8959 * fix(iceberg): use dot separator in log messages for readable namespace output * fix(iceberg): use path.Join for S3 location path segments Use path.Join to construct the namespace/table path segments in fallback S3 locations for robustness and consistency with handleCreateTable. * test(iceberg): add multi-level namespace integration tests for Spark and Trino Add regression tests for #8959 that create a two-level namespace (e.g. "analytics.daily"), create a table under it, insert data, and query it back. This exercises the dot-separated namespace path construction and verifies that Spark/Trino can actually read the data at the S3 location returned by the Iceberg REST API. * fix(test): enable nested namespace in Trino Iceberg catalog config Trino requires `iceberg.rest-catalog.nested-namespace-enabled=true` to support multi-level namespaces. Without this, CREATE SCHEMA with a dotted name fails with "Nested namespace is not enabled for this catalog". * fix(test): parse Trino COUNT() output as integer instead of substring match Avoids false matches from strings.Contains(output, "3") by parsing the actual numeric result with strconv.Atoi and asserting equality. fix(test): use separate Trino config for nested namespace test The nested-namespace-enabled=true setting in Trino changes how SHOW SCHEMAS works, causing "Internal error" for all tests sharing that catalog config. Move the flag to a dedicated config used only by TestTrinoMultiLevelNamespace. * fix(iceberg): support parent query parameter in ListNamespaces for nested namespaces Add handling for the Iceberg REST spec's `parent` query parameter in handleListNamespaces. When Trino has nested-namespace-enabled=true, it sends `GET /v1/namespaces?parent=<ns>` to list child namespaces. The parent value is decoded from the Iceberg unit separator format and converted to a dot-separated prefix for the S3 Tables layer. Also simplify TestTrinoMultiLevelNamespace to focus on namespace operations (create, list, show tables) rather than data operations, since Trino's REST catalog has a non-empty location check that conflicts with server-side metadata creation. * fix(test): expand Trino multi-level namespace test and merge config helpers - Expand TestTrinoMultiLevelNamespace to create a table with explicit location, insert rows, query them back, and verify the S3 file path contains the dot-separated namespace (not \x1F). This ensures the original #8959 bug would be caught by the Trino integration test. - Merge writeTrinoConfig and writeTrinoNestedNamespaceConfig into a single parameterized function using functional options.	2026-04-07 12:21:22 -07:00
Chris Lu	d4548376a1	fix(ec): off-by-one in nLargeBlockRows causes EC read corruption (#8957 ) * fix(ec): off-by-one in nLargeBlockRows causes EC read corruption (#8947) The nLargeBlockRows formula in locateOffset used (shardDatSize-1)/largeBlockLength, which produces an off-by-one error when shardDatSize is an exact multiple of largeBlockLength (e.g. a 30GB volume with 10 data shards = 3GB per shard). This causes needles in the last large block row to be mislocated as small blocks, reading from completely wrong shard positions and returning garbage data. Fix: remove the -1 from locateOffset and only apply it in the ecdFileSize fallback path (old volumes without datFileSize in .vif), where it's needed to handle the ambiguous case conservatively. Also fix ReadEcShardNeedle to pass offset=0 to ReadBytes, consistent with the scrub path, since the bytes buffer already starts at position 0. * fix: add volume context to EC read errors, remove contextless glog The glog.Errorf in ReadBytes logged "entry not found" without any volume ID, making it impossible to identify which volume was affected. Remove this contextless log and instead add volume ID, needle ID, offset, and size to the error returned from the EC read path. The EC scrub callers already wrap errors with volume context.	2026-04-07 12:02:51 -07:00
Chris Lu	45bf3ad058	shell: add s3.user.* and s3.policy.attach\|detach commands (#8954 ) * shell: add s3.user.* and s3.policy.attach\|detach commands Add focused IAM shell commands following a noun-verb model: - s3.user.create: create user with auto-generated or explicit credentials - s3.user.list: tabular listing with status, policies, key count - s3.user.show: detailed user view (status, source, policies, credentials) - s3.user.delete: delete a user - s3.user.enable: enable a disabled user - s3.user.disable: disable a user (preserves credentials and policies) - s3.policy.attach: attach a named policy to a user - s3.policy.detach: detach a policy from a user These commands are thin wrappers over the existing IAM gRPC service, producing human-readable output instead of raw protobuf text. This is part of a larger effort to replace the monolithic s3.configure command with a composable set of single-purpose commands. * shell: address review feedback for s3.user.* and s3.policy.attach\|detach - Return flag parse errors instead of swallowing them (all commands) - Use GetConfiguration instead of N+1 GetUser calls in s3.user.list - Add nil check for resp.Identity in s3.user.show - Fix GetPolicy error masking in s3.policy.attach (wrap original error) - Simplify joinMax using strings.Join * shell: add nil identity guards and wrap gRPC errors - Add nil check for resp.Identity in policy_attach, policy_detach, user_enable, user_disable - Wrap GetUser errors with user context for better diagnostics	2026-04-07 11:26:57 -07:00
Chris Lu	d123a2768b	shell: add s3.accesskey., s3.anonymous., s3.serviceaccount.* commands (#8955 ) * shell: add s3.accesskey., s3.anonymous., s3.serviceaccount.* commands Add credential, anonymous access, and service account management commands: Access key commands: - s3.accesskey.create: add credentials to an existing user - s3.accesskey.list: list access keys for a user (key ID + status) - s3.accesskey.delete: remove a specific access key - s3.accesskey.rotate: atomic create-new + delete-old key rotation Anonymous access commands: - s3.anonymous.set: set/remove public access on a bucket - s3.anonymous.get: show anonymous access for a bucket - s3.anonymous.list: list all buckets with anonymous access Service account commands: - s3.serviceaccount.create: create with optional action subset and expiry - s3.serviceaccount.list: tabular listing, optionally filtered by parent - s3.serviceaccount.show: detailed view of a service account - s3.serviceaccount.delete: remove a service account These replace the credential and anonymous portions of the monolithic s3.configure and s3.bucket.access commands. * shell: address review feedback for s3.accesskey., s3.anonymous., s3.serviceaccount.* - Return flag parse errors instead of swallowing them (all commands) - Add action validation in s3.anonymous.set (Read, Write, List, Tagging, Admin) - Fix s3.serviceaccount.create output: note to use list for server-assigned ID since CreateServiceAccountResponse does not return the ID * shell: fix bucket matching and action validation in s3.anonymous.* - Use SplitN instead of HasSuffix for bucket name matching to avoid false positives when one bucket name is a suffix of another - Make action validation case-insensitive with canonical normalization * shell: fix nil panics, dedup actions, validate service account actions - Fix nil-pointer panic in getOrCreateAnonymousUser when GetUser returns err==nil with nil Identity (status.FromError(nil) returns nil status) - Add nil Identity guards in s3.anonymous.get and s3.anonymous.list - Deduplicate action values in s3.anonymous.set (e.g. -access Read,Read) - Add action validation in s3.serviceaccount.create with case normalization * shell: dedup actions and reject negative expiry in s3.serviceaccount.create - Deduplicate -actions values (e.g. Read,read,Read produces one entry) - Reject negative -expiry values instead of silently treating as no expiration	2026-04-07 11:20:15 -07:00
Chris Lu	733517df30	fix(s3): s3:PutObject bucket policy now implicitly allows multipart uploads (#8968 ) * fix(s3): s3:PutObject bucket policy now implicitly allows multipart uploads The PolicyEngine.evaluateStatement() method used raw regex matching for actions, bypassing the multipart-inherits-PutObject logic that only existed in the unused CompiledStatement.MatchesAction() code path. When a bucket policy granted only s3:PutObject, multipart upload operations (CreateMultipartUpload, UploadPart, CompleteMultipartUpload, etc.) were denied, forcing users to explicitly list every multipart action. Fixes https://github.com/seaweedfs/seaweedfs/discussions/8751 * fix(s3): add s3:UploadPartCopy to multipartActionSet and improve test coverage Add missing S3_ACTION_UPLOAD_PART_COPY constant and include it in multipartActionSet so UploadPartCopy is implicitly allowed by s3:PutObject. Also add a bucket-ARN sub-test for ListBucketMultipartUploads to verify that an object-only resource pattern does not match bucket-level requests.	2026-04-07 11:13:29 -07:00
Chris Lu	0fed72d95a	volume.tier.move: fulfill target replication before deleting old replicas (#8950 ) * volume.tier.move: fulfill target replication before deleting old replicas When -toReplication is specified, volume.tier.move now creates all required replicas on the destination tier before deleting old replicas. This closes the data-loss window where only one copy existed on the target tier while awaiting volume.fix.replication. If replication fulfillment fails, old replicas are preserved and marked writable so the volume remains accessible. Also extracts replicateVolumeToServer and configureVolumeReplication helpers to reduce duplication across volume.tier.move and volume.fix.replication. Fixes #8937 * volume.tier.move: always fulfill replication before deleting old replicas When -toReplication is specified, use that replication setting. Otherwise, read the volume's existing replication from the super block. In both cases, all required replicas are created on the destination tier before old replicas are deleted. If replication fulfillment fails (e.g. not enough destination nodes), old replicas are preserved and marked writable so no data is lost. * volume.tier.move: address review feedback on ensureReplicationFulfilled - Add 5s delay before re-collecting topology to allow master heartbeat propagation after the move - Add nil guard for targetTierReplicas to prevent panic if the moved replica is not yet visible in the topology - Treat configureVolumeReplication failure as a hard error instead of a warning, so the rollback logic preserves old replicas * volume.tier.move: harden replication config error handling - Make configureVolumeReplication failure on the primary moved replica a hard error that aborts the move, instead of logging and continuing - Configure replication metadata on all existing target-tier replicas (not just newly created ones) when -toReplication is specified - Deletion of old replicas cannot affect new replicas since the locations list only contains pre-move servers (verified, no change) * volume.tier.move: fix cleanup deleting fulfilled replicas and broken recovery Fix 1: The cleanup loop now preserves pre-existing target-tier replicas that ensureReplicationFulfilled counted toward the replication target. Previously, a mixed-tier volume with an existing replica on the target tier could have that replica deleted right after being counted as fulfilled, leaving the volume under-replicated. ensureReplicationFulfilled now returns a preserveServers set that the deletion loop checks before removing any old replica. Fix 2: Failure paths after LiveMoveVolume (which deletes the source replica) now use restoreSurvivingReplicasWritable instead of markVolumeReplicasWritable. The old helper stopped on first error, so attempting to mark the already-deleted source writable would prevent all surviving replicas from being restored. The new helper skips the deleted source and continues through all remaining locations, logging per-replica errors instead of aborting. * volume.tier.move: mark preserved replicas writable, skip nodes with existing volume Fix 1: Preserved pre-existing target-tier replicas were left read-only after the move completed. They were marked read-only at the start (along with all other replicas) but never restored since the old code deleted them. Now they are explicitly marked writable before cleanup. Fix 2: The fulfillment loop could pick a candidate node that already hosts this volume on a different disk type, causing a VolumeCopy conflict. Added a guard that skips any node already hosting the volume (on any disk) before attempting replication.	2026-04-06 14:55:37 -07:00
dependabot[bot]	d0692f14ad	build(deps): bump github.com/aws/aws-sdk-go-v2/credentials from 1.19.13 to 1.19.14 (#8942 ) build(deps): bump github.com/aws/aws-sdk-go-v2/credentials Bumps [github.com/aws/aws-sdk-go-v2/credentials](https://github.com/aws/aws-sdk-go-v2) from 1.19.13 to 1.19.14. - [Release notes](https://github.com/aws/aws-sdk-go-v2/releases) - [Commits](https://github.com/aws/aws-sdk-go-v2/compare/credentials/v1.19.13...credentials/v1.19.14) --- updated-dependencies: - dependency-name: github.com/aws/aws-sdk-go-v2/credentials dependency-version: 1.19.14 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-04-06 13:26:02 -07:00
Chris Lu	69218c88fe	fix(stats): fix build on openbsd, solaris, and windows (#8951 ) fix(stats): replace undefined calculateDiskRemaining with inline calculation disk_openbsd.go, disk_solaris.go, and disk_windows.go all call calculateDiskRemaining() which is never defined, causing build failures on those platforms. Replace with the same inline calculation used in disk_supported.go.	2026-04-06 12:48:48 -07:00
Chris Lu	b0a4647d87	fix: prevent stack overflow in ECBalanceTask.reportProgress (#8949 ) * fix: prevent stack overflow in ECBalanceTask.reportProgress Add re-entry guard to reportProgress() to prevent infinite recursion. The progressCallback invoked by ReportProgressWithStage can re-enter reportProgress, causing a stack overflow that crashes the worker process (goroutine stack exceeds 1GB limit after ~22M frames). * fix: use atomics for progress and re-entry guard to avoid data races Address review feedback: GetProgress() can be called from a different goroutine while reportProgress is updating the value. Use atomic operations for both the progress field (via Float64bits/Float64frombits) and the reporting re-entry guard (via CompareAndSwap).	2026-04-06 12:26:38 -07:00
dependabot[bot]	83a632669a	build(deps): bump github.com/aws/aws-sdk-go-v2/service/s3 from 1.96.0 to 1.98.0 (#8943 ) build(deps): bump github.com/aws/aws-sdk-go-v2/service/s3 Bumps [github.com/aws/aws-sdk-go-v2/service/s3](https://github.com/aws/aws-sdk-go-v2) from 1.96.0 to 1.98.0. - [Release notes](https://github.com/aws/aws-sdk-go-v2/releases) - [Commits](https://github.com/aws/aws-sdk-go-v2/compare/service/s3/v1.96.0...service/s3/v1.98.0) --- updated-dependencies: - dependency-name: github.com/aws/aws-sdk-go-v2/service/s3 dependency-version: 1.98.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-04-06 10:51:10 -07:00
dependabot[bot]	331d76e024	build(deps): bump google.golang.org/api from 0.267.0 to 0.274.0 (#8945 ) Bumps [google.golang.org/api](https://github.com/googleapis/google-api-go-client) from 0.267.0 to 0.274.0. - [Release notes](https://github.com/googleapis/google-api-go-client/releases) - [Changelog](https://github.com/googleapis/google-api-go-client/blob/main/CHANGES.md) - [Commits](https://github.com/googleapis/google-api-go-client/compare/v0.267.0...v0.274.0) --- updated-dependencies: - dependency-name: google.golang.org/api dependency-version: 0.274.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-04-06 10:50:49 -07:00
dependabot[bot]	2b73db9c71	build(deps): bump go.etcd.io/etcd/client/v3 from 3.6.9 to 3.6.10 (#8944 ) Bumps [go.etcd.io/etcd/client/v3](https://github.com/etcd-io/etcd) from 3.6.9 to 3.6.10. - [Release notes](https://github.com/etcd-io/etcd/releases) - [Commits](https://github.com/etcd-io/etcd/compare/v3.6.9...v3.6.10) --- updated-dependencies: - dependency-name: go.etcd.io/etcd/client/v3 dependency-version: 3.6.10 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-04-06 10:50:41 -07:00
dependabot[bot]	9a7c731e68	build(deps): bump github.com/hashicorp/vault/api from 1.22.0 to 1.23.0 (#8941 ) Bumps [github.com/hashicorp/vault/api](https://github.com/hashicorp/vault) from 1.22.0 to 1.23.0. - [Release notes](https://github.com/hashicorp/vault/releases) - [Changelog](https://github.com/hashicorp/vault/blob/main/CHANGELOG-v1.10-v1.15.md) - [Commits](https://github.com/hashicorp/vault/compare/api/v1.22.0...api/v1.23.0) --- updated-dependencies: - dependency-name: github.com/hashicorp/vault/api dependency-version: 1.23.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-04-06 10:50:21 -07:00
dependabot[bot]	5c9d3949be	build(deps): bump actions/upload-artifact from 4 to 7 (#8940 ) Bumps [actions/upload-artifact](https://github.com/actions/upload-artifact) from 4 to 7. - [Release notes](https://github.com/actions/upload-artifact/releases) - [Commits](https://github.com/actions/upload-artifact/compare/v4...v7) --- updated-dependencies: - dependency-name: actions/upload-artifact dependency-version: '7' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-04-06 10:50:13 -07:00
dependabot[bot]	7dd6d5547e	build(deps): bump docker/login-action from 4.0.0 to 4.1.0 (#8939 ) Bumps [docker/login-action](https://github.com/docker/login-action) from 4.0.0 to 4.1.0. - [Release notes](https://github.com/docker/login-action/releases) - [Commits](https://github.com/docker/login-action/compare/v4...v4.1.0) --- updated-dependencies: - dependency-name: docker/login-action dependency-version: 4.1.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-04-06 10:50:06 -07:00
dependabot[bot]	b201386c8c	build(deps): bump actions/download-artifact from 4 to 8 (#8938 ) Bumps [actions/download-artifact](https://github.com/actions/download-artifact) from 4 to 8. - [Release notes](https://github.com/actions/download-artifact/releases) - [Commits](https://github.com/actions/download-artifact/compare/v4...v8) --- updated-dependencies: - dependency-name: actions/download-artifact dependency-version: '8' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-04-06 10:50:00 -07:00
Mmx233	3cea900241	fix: replication sinks upload ciphertext for SSE-encrypted objects (#8931 ) * fix: decrypt SSE-encrypted objects in S3 replication sink * fix: add SSE decryption support to GCS, Azure, B2, Local sinks * fix: return error instead of warning for SSE-C objects during replication * fix: close readers after upload to prevent resource leaks * fix: return error for unknown SSE types instead of passing through ciphertext * refactor(repl_util): extract CloseReader/CloseMaybeDecryptedReader helpers The io.Closer close-on-error and defer-close pattern was duplicated in copyWithDecryption and the S3 sink. Extract exported helpers to keep a single implementation and prevent future divergence. * fix(repl_util): warn on mixed SSE types across chunks in detectSSEType detectSSEType previously returned the SSE type of the first encrypted chunk without inspecting the rest. If an entry somehow has chunks with different SSE types, only the first type's decryption would be applied. Now scans all chunks and logs a warning on mismatch. * fix(repl_util): decrypt inline SSE objects during replication Small SSE-encrypted objects stored in entry.Content were being copied as ciphertext because: 1. detectSSEType only checked chunk metadata, but inline objects have no chunks — now falls back to checking entry.Extended for SSE keys 2. Non-S3 sinks short-circuited on len(entry.Content)>0, bypassing the decryption path — now call MaybeDecryptContent before writing Adds MaybeDecryptContent helper for decrypting inline byte content. * fix(repl_util): add KMS initialization for replication SSE decryption SSE-KMS decryption was not wired up for filer.backup — the only initialization was for SSE-S3 key manager. CreateSSEKMSDecryptedReader requires a global KMS provider which is only loaded by the S3 API auth-config path. Add InitializeSSEForReplication helper that initializes both SSE-S3 (from filer KEK) and SSE-KMS (from Viper config [kms] section / WEED_KMS_* env vars). Replace the SSE-S3-only init in filer_backup.go. * fix(replicator): initialize SSE decryption for filer.replicate The SSE decryption setup was only added to filer_backup.go, but the notification-based replicator (filer.replicate) uses the same sinks and was missing the required initialization. Add SSE init in NewReplicator so filer.replicate can decrypt SSE objects. * refactor(repl_util): fold entry param into CopyFromChunkViews Remove the CopyFromChunkViewsWithEntry wrapper and add the entry parameter directly to CopyFromChunkViews, since all callers already pass it. * fix(repl_util): guard SSE init with sync.Once, error on mixed SSE types InitializeWithFiler overwrites the global superKey on every call. Wrap InitializeSSEForReplication with sync.Once so repeated calls (e.g. from NewReplicator) are safe. detectSSEType now returns an error instead of logging a warning when chunks have inconsistent SSE types, so replication aborts rather than silently applying the wrong decryption to some chunks. * fix(repl_util): allow SSE init retry, detect conflicting metadata, add tests - Replace sync.Once with mutex+bool so transient failures (e.g. filer unreachable) don't permanently prevent initialization. Only successful init flips the flag; failed attempts allow retries. - Remove v.IsSet("kms") guard that prevented env-only KMS configs (WEED_KMS_) from being detected. Always attempt KMS loading and let LoadConfigurations handle "no config found". - detectSSEType now checks for conflicting extended metadata keys (e.g. both SeaweedFSSSES3Key and SeaweedFSSSEKMSKey present) and returns an error instead of silently picking the first match. - Add table-driven tests for detectSSEType, MaybeDecryptReader, and MaybeDecryptContent covering plaintext, uniform SSE, mixed chunks, inline SSE via extended metadata, conflicting metadata, and SSE-C. test(repl_util): add SSE-S3 and SSE-KMS integration tests Add round-trip encryption/decryption tests: - SSE-S3: encrypt with CreateSSES3EncryptedReader, decrypt with CreateSSES3DecryptedReader, verify plaintext matches - SSE-KMS: encrypt with AES-CTR, wire a mock KMSProvider via SetGlobalKMSProvider, build serialized KMS metadata, verify MaybeDecryptReader and MaybeDecryptContent produce correct plaintext Fix existing tests to check io.ReadAll errors. * test(repl_util): exercise full SSE-S3 path through MaybeDecryptReader Replace direct CreateSSES3DecryptedReader calls with end-to-end tests that go through MaybeDecryptReader → decryptSSES3 → DeserializeSSES3Metadata → GetSSES3IV → CreateSSES3DecryptedReader. Uses WEED_S3_SSE_KEK env var + a mock filer client to initialize the global key manager with a test KEK, then SerializeSSES3Metadata to build proper envelope-encrypted metadata. Cleanup restores the key manager state. * fix(localsink): write to temp file to prevent truncated replicas The local sink truncated the destination file before writing content. If decryption or chunk copy failed, the file was left empty/truncated, destroying the previous replica. Write to a temp file in the same directory and atomically rename on success. On any error the temp file is cleaned up and the existing replica is untouched. --------- Co-authored-by: Chris Lu <chris.lu@gmail.com>	2026-04-06 00:32:27 -07:00
Chris Lu	7ab6306e15	fix(kafka): resolve consumer group resumption timeout in e2e tests (#8935 ) * fix(kafka): resolve consumer group resumption timeout in e2e tests Three issues caused ConsumerGroupResumption to time out when the second consumer tried to resume from committed offsets: 1. ForceCompleteRebalance deadlock: performCleanup() held group.Mu.Lock then called ForceCompleteRebalance() which tried to acquire the same lock — a guaranteed deadlock on Go's non-reentrant sync.Mutex. Fixed by requiring callers to hold the lock (matching actual call sites). 2. Unbounded fallback fetch: when the multi-batch fetch timed out, the fallback GetStoredRecords call used the connection context (no deadline). A slow broker gRPC call could block the data-plane goroutine indefinitely, causing head-of-line blocking for all responses on that connection. Fixed with a 10-second timeout. 3. HWM lookup failure caused empty responses: after a consumer leaves and the partition is deactivated, GetLatestOffset can fail. The fetch handler treated this as "no data" and entered the long-poll loop (up to 10s × 4 retries = 40s timeout). Fixed by assuming data may exist when HWM lookup fails, so the actual fetch determines availability. * fix(kafka): address review feedback on HWM sentinel and fallback timeout - Don't expose synthetic HWM (requestedOffset+1) to clients; keep result.highWaterMark at 0 when the real HWM lookup fails. - Tie fallback timeout to client's MaxWaitTime instead of a fixed 10s, so one slow partition doesn't hold the reader beyond the request budget. * fix(kafka): use large HWM sentinel and clamp fallback timeout - Use requestedOffset+10000 as sentinel HWM instead of +1, so FetchMultipleBatches doesn't artificially limit to 1 record. - Add 2s floor to fallback timeout so disk reads via gRPC have a reasonable chance even when maxWaitMs is small or zero. * fix(kafka): use MaxInt64 sentinel and derive HWM from fetch result - Use math.MaxInt64 as HWM sentinel to avoid integer overflow risk (previously requestedOffset+10000 could wrap on large offsets). - After the fetch, derive a meaningful HWM from newOffset so the client never sees MaxInt64 or 0 in the response. * fix(kafka): use remaining time budget for fallback fetch The fallback was restarting the full maxWaitMs budget even though the multi-batch fetch already consumed part of it. Now compute remaining time from either the parent context deadline or maxWaitMs minus elapsed, skip the fallback if budget is exhausted, and clamp to [2s, 10s] bounds.	2026-04-05 20:13:57 -07:00
Chris Lu	72eb93919c	fix(gcssink): prevent empty object finalization on write failure (#8933 ) * fix(gcssink): prevent empty object finalization on write failure The GCS writer was created unconditionally with defer wc.Close(), which finalizes the upload even when content decryption or copy fails. This silently overwrites valid objects with empty data. Remove the unconditional defer, explicitly close on success to propagate errors, and delete the object on write failure. * fix(gcssink): use context cancellation instead of obj.Delete on failure obj.Delete() after a failed write would delete the existing object at that key, causing data loss on updates. Use a cancelable context instead — cancelling before Close() aborts the GCS upload without touching any pre-existing object.	2026-04-05 16:07:49 -07:00
Chris Lu	4fd974b16b	fix(azuresink): delete freshly created blob on write failure (#8934 ) * fix(azuresink): delete freshly created blob on write failure appendBlobClient.Create() runs before content decryption and copy. If MaybeDecryptContent or CopyFromChunkViews fails, an empty blob is left behind, silently replacing any previous valid data. Add cleanup that deletes the blob on content write errors when we were the ones who created it. * fix(azuresink): track recreated blobs for cleanup on write failure handleExistingBlob deletes and recreates the blob when overwrite is needed, but freshlyCreated was only set on the initial Create success path. Set freshlyCreated = needsWrite after handleExistingBlob so recreated blobs are also cleaned up on content write failure.	2026-04-05 16:07:34 -07:00
Chris Lu	b8fc99a9cd	fix(s3): apply PutObject multipart expansion to STS session policies (#8932 ) * fix(s3): apply PutObject multipart expansion to STS session policy evaluation (#8929) PR #8445 added logic to implicitly grant multipart upload actions when s3:PutObject is authorized, but only in the S3 API policy engine's CompiledStatement.MatchesAction(). STS session policies are evaluated through the IAM policy engine's matchesActions() -> awsIAMMatch() path, which did plain pattern matching without the multipart expansion. Add the same multipart expansion logic to the IAM policy engine's matchesActions() so that session policies containing s3:PutObject correctly allow multipart upload operations. * fix: make multipart action set lookup case-insensitive and optimize Address PR review feedback: - Lowercase multipartActionSet keys and use strings.ToLower for lookup, since AWS IAM actions are case-insensitive - Only check for s3:PutObject permission when the requested action is actually a multipart action, avoiding unnecessary awsIAMMatch calls - Add test case for case-insensitive multipart action matching	2026-04-05 14:06:50 -07:00
Mmx233	69cd5fa37b	fix: S3 sink puts all entry.Extended into Tagging header instead of only object tags (#8930 ) * test: add failing tests for S3 sink buildTaggingString * fix: S3 sink should only put object tags into Tagging header * fix: avoid sending empty x-amz-tagging header	2026-04-05 12:16:04 -07:00
Chris Lu	076d504044	fix(admin): reduce memory usage and verbose logging for large clusters (#8927 ) * fix(admin): reduce memory usage and verbose logging for large clusters (#8919) The admin server used excessive memory and produced thousands of log lines on clusters with many volumes (e.g., 33k volumes). Three root causes: 1. Scanner duplicated all volume metrics: getVolumeHealthMetrics() created VolumeHealthMetrics objects, then convertToTaskMetrics() copied them all into identical types.VolumeHealthMetrics. Now uses the task-system type directly, eliminating the duplicate allocation and removing convertToTaskMetrics. 2. All previous task states loaded at startup: LoadTasksFromPersistence read and deserialized every .pb file from disk, logging each one. With thousands of balance tasks persisted, this caused massive startup I/O, memory usage, and log noise (including unguarded DEBUG glog.Infof per task). Now starts with an empty queue — the scanner re-detects current needs from live cluster state. Terminal tasks are purged from memory and disk when new scan results arrive. 3. Verbose per-volume/per-node logging: V(2) and V(3) logs produced thousands of lines per scan. Per-volume logs bumped to V(4), per-node/rack/disk logs bumped to V(3). Topology summary now logs counts instead of full node ID arrays. Also removes lastTopologyInfo field from MaintenanceScanner — the raw protobuf topology is returned as a local value and not retained between 30-minute scans. * fix(admin): delete stale task files at startup, add DeleteAllTaskStates Old task .pb files from previous runs were left on disk. The periodic CleanupCompletedTasks still loads all files to find completed ones — the same expensive 4GB path from the pprof profile. Now at startup, DeleteAllTaskStates removes all .pb files by scanning the directory without reading or deserializing them. The scanner will re-detect any tasks still needed from live cluster state. * fix(admin): don't persist terminal tasks to disk CompleteTask was saving failed/completed tasks to disk where they'd accumulate. The periodic cleanup only triggered for completed tasks, not failed ones. Now terminal tasks are deleted from disk immediately and only kept in memory for the current session's UI. * fix(admin): cap in-memory tasks to 100 per job type Without a limit, the task map grows unbounded — balance could create thousands of pending tasks for a cluster with many imbalanced volumes. Now AddTask rejects new tasks when a job type already has 100 in the queue. The scanner will re-detect skipped volumes on the next scan. * fix(admin): address PR review - memory-only purge, active-only capacity - purgeTerminalTasks now only cleans in-memory map (terminal tasks are already deleted from disk by CompleteTask) - Per-type capacity limit counts only active tasks (pending/assigned/ in_progress), not terminal ones - When at capacity, purge terminal tasks first before rejecting * fix(admin): fix orphaned comment, add TaskStatusCancelled to terminal switch - Move hasQueuedOrActiveTaskForVolume comment to its function definition - Add TaskStatusCancelled to the terminal state switch in CompleteTask so cancelled task files are deleted from disk	2026-04-04 18:45:57 -07:00
Chris Lu	2c8a1ea6cc	fix(docker): disable glibc _FORTIFY_SOURCE for aarch64-musl cross builds When cross-compiling aws-lc-sys for aarch64-unknown-linux-musl using aarch64-linux-gnu-gcc, glibc's _FORTIFY_SOURCE generates calls to __memcpy_chk, __fprintf_chk etc. which don't exist in musl, causing linker errors. Disable it via CFLAGS_aarch64_unknown_linux_musl.	2026-04-04 14:25:05 -07:00
Chris Lu	4efe0acaf5	fix(master): fast resume state and default resumeState to true (#8925 ) * fix(master): fast resume state and default resumeState to true When resumeState is enabled in single-master mode, the raft server had existing log entries so the self-join path couldn't promote to leader. The server waited the full election timeout (10-20s) before self-electing. Fix by temporarily setting election timeout to 1ms before Start() when in single-master + resumeState mode with existing log, then restoring the original timeout after leader election. This makes resume near-instant. Also change the default for resumeState from false to true across all CLI commands (master, mini, server) so state is preserved by default. * fix(master): prevent fastResume goroutine from hanging forever Use defer to guarantee election timeout is always restored, and bound the polling loop with a timeout so it cannot spin indefinitely if leader election never succeeds. * fix(master): use ticker instead of time.After in fastResume polling loop	2026-04-04 14:15:56 -07:00
Chris Lu	0da1794856	fix(rust): remove transitive openssl dependency from seaweed-volume reqwest's default features include native-tls which depends on openssl-sys, causing builds to fail on musl targets where OpenSSL headers are not available. Since we already use rustls-tls, disable default features to eliminate the openssl-sys dependency entirely.	2026-04-04 14:07:01 -07:00
Chris Lu	47baf6c841	fix(docker): add Rust volume server pre-build to latest and dev container workflows Both container_latest.yml and container_dev.yml use Dockerfile.go_build which expects weed-volume-prebuilt/ with pre-compiled Rust binaries, but neither workflow produced them, causing COPY failures during docker build. Add build-rust-binaries jobs that natively cross-compile for amd64 and arm64, then download and place the artifacts in the Docker build context. Also fix the trivy-scan local build path in container_latest.yml.	2026-04-04 13:53:13 -07:00
Chris Lu	d37b592bc4	Update object_store_users_templ.go	2026-04-04 11:52:57 -07:00
Chris Lu	896114d330	fix(admin): fix master leader link showing incorrect port in Admin UI (#8924 ) fix(admin): use gRPC address for current server in RaftListClusterServers The old Raft implementation was returning the HTTP address (ms.option.Master) for the current server, while peers used gRPC addresses (peer.ConnectionString). The Admin UI's GetClusterMasters() converts all addresses from gRPC to HTTP via GrpcAddressToServerAddress (port - 10000), which produced a negative port (-667) for the current server since its address was already in HTTP format (port 9333). Use ToGrpcAddress() for consistency with both HashicorpRaft (which stores gRPC addresses) and old Raft peers. Fixes #8921	2026-04-04 11:50:43 -07:00
Chris Lu	f6df7126b6	feat(admin): add profiling options for debugging high memory/CPU usage (#8923 ) * feat(admin): add profiling options for debugging high memory/CPU usage Add -debug, -debug.port, -cpuprofile, and -memprofile flags to the admin command, matching the profiling support already available in master, volume, and other server commands. This enables investigation of resource usage issues like #8919. * refactor(admin): move profiling flags into AdminOptions struct Move cpuprofile and memprofile flags from global variables into the AdminOptions struct and init() function for consistency with other flags. * fix(debug): bind pprof server to localhost only and document profiling flags StartDebugServer was binding to all interfaces (0.0.0.0), exposing runtime profiling data to the network. Restrict to 127.0.0.1 since this is a development/debugging tool. Also add a "Debugging and Profiling" section to the admin command's help text documenting the new flags.	2026-04-04 10:05:19 -07:00
Chris Lu	9add18e169	fix(volume-rust): fix volume balance between Go and Rust servers (#8915 ) Two bugs prevented reliable volume balancing when a Rust volume server is the copy target: 1. find_last_append_at_ns returned None for delete tombstones (Size==0 in dat header), falling back to file mtime truncated to seconds. This caused the tail step to re-send needles from the last sub-second window. Fix: change `needle_size <= 0` to `< 0` since Size==0 delete needles still have a valid timestamp in their tail. 2. VolumeTailReceiver called read_body_v2 on delete needles, which have no DataSize/Data/flags — only checksum+timestamp+padding after the header. Fix: skip read_body_v2 when size == 0, reject negative sizes. Also: - Unify gRPC server bind: use TcpListener::bind before spawn for both TLS and non-TLS paths, propagating bind errors at startup. - Add mixed Go+Rust cluster test harness and integration tests covering VolumeCopy in both directions, copy with deletes, and full balance move with tail tombstone propagation and source deletion. - Make FindOrBuildRustBinary configurable for default vs no-default features (4-byte vs 5-byte offsets).	2026-04-04 09:13:23 -07:00
Chris Lu	d1823d3784	fix(s3): include static identities in listing operations (#8903 ) * fix(s3): include static identities in listing operations Static identities loaded from -s3.config file were only stored in the S3 API server's in-memory state. Listing operations (s3.configure shell command, aws iam list-users) queried the credential manager which only returned dynamic identities from the backend store. Register static identities with the credential manager after loading so they are included in LoadConfiguration and ListUsers results, and filtered out before SaveConfiguration to avoid persisting them to the dynamic store. Fixes https://github.com/seaweedfs/seaweedfs/discussions/8896 * fix: avoid mutating caller's config and defensive copies - SaveConfiguration: use shallow struct copy instead of mutating the caller's config.Identities field - SetStaticIdentities: skip nil entries to avoid panics - GetStaticIdentities: defensively copy PolicyNames slice to avoid aliasing the original * fix: filter nil static identities and sync on config reload - SetStaticIdentities: filter nil entries from the stored slice (not just from staticNames) to prevent panics in LoadConfiguration/ListUsers - Extract updateCredentialManagerStaticIdentities helper and call it from both startup and the grace.OnReload handler so the credential manager's static snapshot stays current after config file reloads * fix: add mutex for static identity fields and fix ListUsers for store callers - Add sync.RWMutex to protect staticIdentities/staticNames against concurrent reads during config reload - Revert CredentialManager.ListUsers to return only store users, since internal callers (e.g. DeletePolicy) look up each user in the store and fail on non-existent static entries - Merge static usernames in the filer gRPC ListUsers handler instead, via the new GetStaticUsernames method - Fix CI: TestIAMPolicyManagement/managed_policy_crud_lifecycle was failing because DeletePolicy iterated static users that don't exist in the store * fix: show static identities in admin UI and weed shell The admin UI and weed shell s3.configure command query the filer's credential manager via gRPC, which is a separate instance from the S3 server's credential manager. Static identities were only registered on the S3 server's credential manager, so they never appeared in the filer's responses. - Add CredentialManager.LoadS3ConfigFile to parse a static S3 config file and register its identities - Add FilerOptions.s3ConfigFile so the filer can load the same static config that the S3 server uses - Wire s3ConfigFile through in weed mini and weed server modes - Merge static usernames in filer gRPC ListUsers handler - Add CredentialManager.GetStaticUsernames helper - Add sync.RWMutex to protect concurrent access to static identity fields - Avoid importing weed/filer from weed/credential (which pulled in filer store init() registrations and broke test isolation) - Add docker/compose/s3_static_users_example.json * fix(admin): make static users read-only in admin UI Static users loaded from the -s3.config file should not be editable or deletable through the admin UI since they are managed via the config file. - Add IsStatic field to ObjectStoreUser, set from credential manager - Hide edit, delete, and access key buttons for static users in the users table template - Show a "static" badge next to static user names - Return 403 Forbidden from UpdateUser and DeleteUser API handlers when the target user is a static identity * fix(admin): show details for static users GetObjectStoreUserDetails called credentialManager.GetUser which only queries the dynamic store. For static users this returned ErrUserNotFound. Fall back to GetStaticIdentity when the store lookup fails. * fix(admin): load static S3 identities in admin server The admin server has its own credential manager (gRPC store) which is a separate instance from the S3 server's and filer's. It had no static identity data, so IsStaticIdentity returned false (edit/delete buttons shown) and GetStaticIdentity returned nil (details page failed). Pass the -s3.config file path through to the admin server and call LoadS3ConfigFile on its credential manager, matching the approach used for the filer. * fix: use protobuf is_static field instead of passing config file path The previous approach passed -s3.config file path to every component (filer, admin). This is wrong because the admin server should not need to know about S3 config files. Instead, add an is_static field to the Identity protobuf message. The field is set when static identities are serialized (in GetStaticIdentities and LoadS3ConfigFile). Any gRPC client that loads configuration via GetConfiguration automatically sees which identities are static, without needing the config file. - Add is_static field (tag 8) to iam_pb.Identity proto message - Set IsStatic=true in GetStaticIdentities and LoadS3ConfigFile - Admin GetObjectStoreUsers reads identity.IsStatic from proto - Admin IsStaticUser helper loads config via gRPC to check the flag - Filer GetUser gRPC handler falls back to GetStaticIdentity - Remove s3ConfigFile from AdminOptions and NewAdminServer signature	2026-04-03 20:01:28 -07:00
Chris Lu	0798b274dd	feat(s3): add concurrent chunk prefetch for large file downloads (#8917 ) * feat(s3): add concurrent chunk prefetch for large file downloads Add a pipe-based prefetch pipeline that overlaps chunk fetching with response writing during S3 GetObject, SSE downloads, and filer proxy. While chunk N streams to the HTTP response, fetch goroutines for the next K chunks establish HTTP connections to volume servers ahead of time, eliminating the RTT gap between sequential chunk fetches. Uses io.Pipe for minimal memory overhead (~1MB per download regardless of chunk size, vs buffering entire chunks). Also increases the streaming read buffer from 64KB to 256KB to reduce syscall overhead. Benchmark results (64KB chunks, prefetch=4): - 0ms latency: 1058 → 2362 MB/s (2.2× faster) - 5ms latency: 11.0 → 41.7 MB/s (3.8× faster) - 10ms latency: 5.9 → 23.3 MB/s (4.0× faster) - 20ms latency: 3.1 → 12.1 MB/s (3.9× faster) * fix: address review feedback for prefetch pipeline - Fix data race: use chunkPipeResult (pointer) on channel to avoid copying struct while fetch goroutines write to it. Confirmed clean with -race detector. - Remove concurrent map write: retryWithCacheInvalidation no longer updates fileId2Url map. Producer only reads it; consumer never writes. - Use mem.Allocate/mem.Free for copy buffer to reduce GC pressure. - Add local cancellable context so consumer errors (client disconnect) immediately stop the producer and all in-flight fetch goroutines. fix(test): remove dead code and add Range header support in test server - Remove unused allData variable in makeChunksAndServer - Add Range header handling to createTestServer for partial chunk read coverage (206 Partial Content, 416 Range Not Satisfiable) * fix: correct retry condition and goroutine leak in prefetch pipeline - Fix retry condition: use result.fetchErr/result.written instead of copied to decide cache-invalidation retry. The old condition wrongly triggered retry when the fetch succeeded but the response writer failed on the first write (copied==0 despite fetcher having data). Now matches the sequential path (stream.go:197) which checks whether the fetcher itself wrote zero bytes. - Fix goroutine leak: when the producer's send to the results channel is interrupted by context cancellation, the fetch goroutine was already launched but the result was never sent to the channel. The drain loop couldn't handle it. Now waits on result.done before returning so every fetch goroutine is properly awaited.	2026-04-03 19:57:30 -07:00
Chris Lu	3efe88c718	feat(s3): store and return checksum headers for additional checksum algorithms (#8914 ) * feat(s3): store and return checksum headers for additional checksum algorithms When clients upload with --checksum-algorithm (SHA256, CRC32, etc.), SeaweedFS validated the checksum but discarded it. The checksum was never stored in metadata or returned in PUT/HEAD/GET responses. Now the checksum is computed alongside MD5 during upload, stored in entry extended attributes, and returned as the appropriate x-amz-checksum-* header in all responses. Fixes #8911 * fix(s3): address review feedback and CI failures for checksum support - Gate GET/HEAD checksum response headers on x-amz-checksum-mode: ENABLED per AWS S3 spec, fixing FlexibleChecksumError on ranged GETs and multipart copies - Verify computed checksum against client-provided header value for non-chunked uploads, returning BadDigest on mismatch - Add nil check for getCheckSumWriter to prevent panic - Handle comma-separated values in X-Amz-Trailer header - Use ordered slice instead of map for deterministic checksum header selection; extract shared mappings into package-level vars * fix(s3): skip checksum header for ranged GET responses The stored checksum covers the full object. Returning it for ranged (partial) responses causes SDK checksum validation failures because the SDK validates the header value against the partial content received. Skip emitting x-amz-checksum-* headers when a Range request header is present, fixing PyArrow large file read failures. * fix(s3): reject unsupported checksum algorithm with 400 detectRequestedChecksumAlgorithm now returns an error code when x-amz-sdk-checksum-algorithm or x-amz-checksum-algorithm contains an unsupported value, instead of silently ignoring it. * feat(s3): compute composite checksum for multipart uploads Store the checksum algorithm during CreateMultipartUpload, then during CompleteMultipartUpload compute a composite checksum from per-part checksums following the AWS S3 spec: concatenate raw per-part checksums, hash with the same algorithm, format as "base64-N" where N is part count. The composite checksum is persisted on the final object entry and returned in HEAD/GET responses (gated on x-amz-checksum-mode: ENABLED). Reuses existing per-part checksum storage from putToFiler and the getCheckSumWriter/checksumHeaders infrastructure. * fix(s3): validate checksum algorithm in CreateMultipartUpload, error on missing part checksums - Move detectRequestedChecksumAlgorithm call before mkdir callback so an unsupported algorithm returns 400 before the upload is created - Change computeCompositeChecksum to return an error when a part is missing its checksum (the upload was initiated with a checksum algorithm, so all parts must have checksums) - Propagate the error as ErrInvalidPart in CompleteMultipartUpload * fix(s3): return checksum header in CompleteMultipartUpload response, validate per-part algorithm - Add ChecksumHeaderName/ChecksumValue fields to CompleteMultipartUploadResult and set the x-amz-checksum-* HTTP response header in the handler, matching the AWS S3 CompleteMultipartUpload response spec - Validate that each part's stored checksum algorithm matches the upload's expected algorithm before assembling the composite checksum; return an error if a part was uploaded with a different algorithm	2026-04-03 18:37:54 -07:00
Chris Lu	36f37b9b6a	fix(filer): remove cancellation guard from RollbackTransaction and clean up #8909 (#8916 ) * fix(filer): remove cancellation guard from RollbackTransaction and clean up #8909 RollbackTransaction is a cleanup operation that must succeed even when the context is cancelled — guarding it causes the exact orphaned state that #8909 was trying to prevent. Also: - Use single-evaluation `if err := ctx.Err(); err != nil` pattern instead of double-calling ctx.Err() - Remove spurious blank lines before guards - Add context.DeadlineExceeded test coverage - Simplify tests from ~230 lines to ~130 lines * fix(filer): call cancel() in expiredCtx and test rollback with expired context - Call cancel() instead of suppressing it to avoid leaking timer resources - Test RollbackTransaction with both cancelled and expired contexts	2026-04-03 17:55:27 -07:00
os-pradipbabar	d5128f00f1	fix: Prevent orphaned metadata from cancelled S3 operations (Issue #8908 ) (#8909 ) fix(filer): check if context was already cancelled before ignoring cancellation	2026-04-03 16:22:46 -07:00

1 2 3 4 5 ...

13396 Commits