mirror of
https://github.com/seaweedfs/seaweedfs.git
synced 2026-05-22 09:41:28 +00:00
22ebe9feb09c82f9610a42e9ff229ca3d513da7e
13697 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
22ebe9feb0 |
ci(e2e): switch FUSE Mount build to Azure Ubuntu mirror, persist buildx cache
archive.ubuntu.com from GitHub-hosted runners has been Ign:/retrying for ~60s per package, eating the Start SeaweedFS step's 10-min budget before apt-get install finishes. The host already uses azure.archive.ubuntu.com; do the same inside Dockerfile.e2e and drop the Retries=5 amplifier. Also rotate /tmp/.buildx-cache-new over /tmp/.buildx-cache so the apt layer actually survives across runs, and bump the step to 15 minutes as a safety margin. |
||
|
|
4ded97a321 |
feat(iam): OIDC provider store + read-only IAM API (Phase 2a) (#9319)
* feat(iam): STS web-identity AWS-fidelity polish - OIDC discovery via .well-known/openid-configuration; falls back to /.well-known/jwks.json when discovery is absent. Reject discovery docs whose issuer claim does not match the configured issuer to defend against issuer-substitution. - ComputeParentUser derives a stable per-identity hash from (sub, iss). Surface as aws:userid in the request context and as a parent_user claim in the session JWT so per-user state survives token rotation. - Per-role MaxSessionDuration (3600..43200) clamps requested DurationSeconds before the STS service applies its own caps. - Tighten RoleSessionName to the AWS contract: 2..64 chars from [\w+=,.@-]. - Populate PackedPolicySize in AssumeRole / AssumeRoleWithWebIdentity / AssumeRoleWithLDAPIdentity responses as a percentage of the 2048-byte inline session policy budget. * fix(iam): leave omitted DurationSeconds nil so STS default applies capDurationByRole was substituting the role's MaxSessionDuration when the caller omitted DurationSeconds entirely. AWS returns the configured default (typically 1 hour) in that case, not the role's upper bound — a 12h MaxSessionDuration shouldn't silently make every no-duration assume-role mint a 12h session. Return nil when requested is nil; let the downstream calculateSessionDuration in the STS service apply its TokenDuration default. The role-max upper bound still clamps when the request arrives with a concrete value above the cap. Addresses gemini high-priority review on PR #9318. * fix(iam): synchronize OIDCProvider JWKS cache fields jwksCache, jwksFetchedAt, resolvedJWKSUri, and discoveryFailed are mutated lazily on the first token-validate call and refreshed afterwards on TTL expiry. Multiple S3 requests can land here in parallel, so the writes were racing against subsequent reads on every other goroutine. resolvedJWKSUri/discoveryFailed inherited the same un-protected pattern when discovery shipped. Add sync.RWMutex; getPublicKey takes the read lock for the common cache-hit path and promotes to the write lock for misses + refreshes. fetchJWKSLocked / resolveJWKSUriLocked assume the write lock is held by the caller; fetchJWKS keeps the test-friendly entry point that acquires the lock itself. Addresses gemini high-priority review on PR #9318. * fix(iam): trim trailing slash + retry discovery after transient failure Two OIDC discovery edge cases reviewers flagged: 1. Issuer comparison was sensitive to trailing slashes. resolveJWKSUri trims them when building the discovery URL, but the doc.Issuer ↔ p.config.Issuer check did not, so an IDP whose issuer claim drops or adds the slash relative to the configured value would be falsely rejected. Trim a single trailing slash on each side before comparing. 2. discoveryFailed flipped to true on any error and stayed there for the process lifetime. A transient 5xx at startup permanently locked the provider into the /.well-known/jwks.json fallback. Reset the flag at the top of fetchJWKSLocked when no URI has been cached yet, so each JWKS refresh (typically once per TTL = 1h) reattempts discovery. Successful discovery remains cached via resolvedJWKSUri so we don't pay the discovery RTT on every refresh. Addresses gemini security-medium + medium reviews on PR #9318. * fix(iam): require non-empty issuer in OIDC discovery doc The previous "doc.Issuer != "" && ..." guard let a discovery document that omitted the issuer field bypass the issuer-mismatch check entirely, letting the doc steer fetchJWKS at any URL it provided. OIDC Discovery 1.0 §3 mandates the issuer field; treat missing as a hard failure same as mismatched. Trailing-slash equivalence still applies. Adds TestDiscoveryRejectsMissingIssuer alongside the existing TestDiscoveryRejectsIssuerMismatch via a new omitDiscoveryIssuer toggle on fakeIDP. * feat(iam): OIDC provider store + read-only IAM API Add OIDCProviderRecord — the persisted, IAM-managed view of an OIDC identity provider — and an OIDCProviderStore interface with memory and filer implementations mirroring the existing role-store pattern. The store is hydrated at boot from the static STS.Providers list so the new IAM API surfaces the same set the STS service already validates against. Two read-only actions land now: - ListOpenIDConnectProviders -> ARN-only list, AWS-shape XML. - GetOpenIDConnectProvider -> URL, ClientIDList, ThumbprintList, Tags, CreateDate. Mutations (Create/Delete/Add-Remove ClientID/Update Thumbprint), multiple client_ids per provider, and TLS thumbprint pinning come in Phase 2b. * fix(iam): preserve CreatedAt across boots + paginate ListProviders Two medium-priority issues gemini flagged on the read-only IAM API: 1. The static-config bootstrap was setting CreatedAt = time.Now() on every server start, so the IAM GetOpenIDConnectProvider response's CreateDate shifted on each restart even when backed by a persistent store. Look up the existing record via GetProviderByARN first and preserve its CreatedAt; only the UpdatedAt advances. 2. FilerOIDCProviderStore.ListProviders had a hardcoded Limit: 1000 that silently truncated above that. Stream-paginate via StartFromFileName, returning io.EOF naturally and surfacing all other errors instead of swallowing them. Addresses two gemini medium reviews on PR #9319. |
||
|
|
d951a8df5a |
feat(iam): STS web-identity AWS-fidelity polish (Phase 1) (#9318)
* feat(iam): STS web-identity AWS-fidelity polish - OIDC discovery via .well-known/openid-configuration; falls back to /.well-known/jwks.json when discovery is absent. Reject discovery docs whose issuer claim does not match the configured issuer to defend against issuer-substitution. - ComputeParentUser derives a stable per-identity hash from (sub, iss). Surface as aws:userid in the request context and as a parent_user claim in the session JWT so per-user state survives token rotation. - Per-role MaxSessionDuration (3600..43200) clamps requested DurationSeconds before the STS service applies its own caps. - Tighten RoleSessionName to the AWS contract: 2..64 chars from [\w+=,.@-]. - Populate PackedPolicySize in AssumeRole / AssumeRoleWithWebIdentity / AssumeRoleWithLDAPIdentity responses as a percentage of the 2048-byte inline session policy budget. * fix(iam): leave omitted DurationSeconds nil so STS default applies capDurationByRole was substituting the role's MaxSessionDuration when the caller omitted DurationSeconds entirely. AWS returns the configured default (typically 1 hour) in that case, not the role's upper bound — a 12h MaxSessionDuration shouldn't silently make every no-duration assume-role mint a 12h session. Return nil when requested is nil; let the downstream calculateSessionDuration in the STS service apply its TokenDuration default. The role-max upper bound still clamps when the request arrives with a concrete value above the cap. Addresses gemini high-priority review on PR #9318. * fix(iam): synchronize OIDCProvider JWKS cache fields jwksCache, jwksFetchedAt, resolvedJWKSUri, and discoveryFailed are mutated lazily on the first token-validate call and refreshed afterwards on TTL expiry. Multiple S3 requests can land here in parallel, so the writes were racing against subsequent reads on every other goroutine. resolvedJWKSUri/discoveryFailed inherited the same un-protected pattern when discovery shipped. Add sync.RWMutex; getPublicKey takes the read lock for the common cache-hit path and promotes to the write lock for misses + refreshes. fetchJWKSLocked / resolveJWKSUriLocked assume the write lock is held by the caller; fetchJWKS keeps the test-friendly entry point that acquires the lock itself. Addresses gemini high-priority review on PR #9318. * fix(iam): trim trailing slash + retry discovery after transient failure Two OIDC discovery edge cases reviewers flagged: 1. Issuer comparison was sensitive to trailing slashes. resolveJWKSUri trims them when building the discovery URL, but the doc.Issuer ↔ p.config.Issuer check did not, so an IDP whose issuer claim drops or adds the slash relative to the configured value would be falsely rejected. Trim a single trailing slash on each side before comparing. 2. discoveryFailed flipped to true on any error and stayed there for the process lifetime. A transient 5xx at startup permanently locked the provider into the /.well-known/jwks.json fallback. Reset the flag at the top of fetchJWKSLocked when no URI has been cached yet, so each JWKS refresh (typically once per TTL = 1h) reattempts discovery. Successful discovery remains cached via resolvedJWKSUri so we don't pay the discovery RTT on every refresh. Addresses gemini security-medium + medium reviews on PR #9318. * fix(iam): require non-empty issuer in OIDC discovery doc The previous "doc.Issuer != "" && ..." guard let a discovery document that omitted the issuer field bypass the issuer-mismatch check entirely, letting the doc steer fetchJWKS at any URL it provided. OIDC Discovery 1.0 §3 mandates the issuer field; treat missing as a hard failure same as mismatched. Trailing-slash equivalence still applies. Adds TestDiscoveryRejectsMissingIssuer alongside the existing TestDiscoveryRejectsIssuerMismatch via a new omitDiscoveryIssuer toggle on fakeIDP. |
||
|
|
2417ba0354 |
fix(volume): add authentication to destructive gRPC admin endpoints (#8876)
* fix(volume): add authentication to destructive gRPC admin endpoints Three destructive VolumeServer gRPC endpoints (DeleteCollection, VolumeDelete, VolumeServerLeave) had no authentication checks, unlike their HTTP counterparts which are protected by the Guard whitelist. Add IsWhiteListed(host) to security.Guard and a checkGrpcAdminAuth helper on VolumeServer that extracts the peer IP from gRPC context and validates it against the guard whitelist. Gate all three endpoints behind this check. * fix(volume): tolerate unparseable gRPC peer address in admin auth check S3 Filer Group integration tests were failing with PermissionDenied "bad peer address: address @: missing port in address" when DeleteCollection ran across the in-process gRPC connection between filer and volume server — the peer addr surfaces as "@" there and net.SplitHostPort can't parse it. The check rejected before IsWhiteListed could exercise its allow-all path for empty-whitelist deployments. Hand the raw peer string to IsWhiteListed when SplitHostPort fails. With no whitelist configured (the test environment's mode) it accepts; with a whitelist configured the unparseable host won't match anything and the call still gets denied as it should. Adds three regression tests for IsWhiteListed pinning the empty-config allow-all, populated-list reject-unknown, and signing-key-only allow- all branches that the gRPC admin helper relies on. * refactor(security): dedup checkWhiteList through IsWhiteListed The HTTP-side checkWhiteList and the gRPC-side IsWhiteListed had the same lookup logic in two places; future drift was just a matter of time. Have checkWhiteList delegate so the membership semantics live in exactly one function. Behaviour is unchanged: the new path still returns nil for isEmptyWhiteList (signing-key-only mode) and still rejects unknown hosts when a whitelist is configured. Addresses gemini medium review on PR #8876. * fix(volume): protect remaining state-altering gRPC admin endpoints DeleteCollection, VolumeDelete, and VolumeServerLeave were the truly-destructive endpoints, but AllocateVolume, VolumeMount, VolumeUnmount, VolumeConfigure, VolumeMarkReadonly, and VolumeMarkWritable also modify server state and should sit behind the same whitelist gate. Read-only endpoints (VolumeStatus, VolumeServerStatus, VolumeNeedleStatus, Ping) stay open. The check is a no-op when no whitelist is configured (the default), so existing deployments keep working; operators who lock down their volume servers via guard.white_list now get consistent coverage. Addresses gemini security-high review on PR #8876. * fix(volume): typed peer addr + audit log for gRPC admin auth Prefer a typed *net.TCPAddr when extracting the peer IP — string parsing was already a fallback for the in-process case but using the typed form first is cleaner and skips an unnecessary parse on the common path. Log failed authorization attempts at V(0) so an operator running with a whitelist sees the host that was rejected (and the raw remote address in case the IP lookup itself was the failure mode), matching what the HTTP Guard already does. Addresses gemini medium review on PR #8876. * fix(volume): protect vacuum + scrub + EC-shards-delete admin endpoints Five more master/admin-driven destructive operations live outside volume_grpc_admin.go and were missing the same whitelist gate: - VacuumVolumeCompact, VacuumVolumeCommit, VacuumVolumeCleanup - ScrubVolume - VolumeEcShardsDelete VacuumVolumeCheck stays open (read-only). BatchDelete also stays open: it's the data-plane multi-object delete called from the S3 API and filer, not an admin operation; gating it would break ordinary S3 DeleteObjects calls. Addresses gemini security-high review on PR #8876. * fix(volume): simplify no-peer-info branch in gRPC admin auth The IsWhiteListed("") fallback was defending against a scenario that doesn't actually arise — real gRPC connections always populate peer info. Drop the branch and just deny when peer info is missing, which is the safer default and matches "if we don't know who the caller is, refuse". * fix(volume-rust): mirror gRPC admin auth on the rust volume server The rust volume server has the same set of destructive admin endpoints as the Go side and the same Guard infrastructure, but nothing was wired together — every endpoint accepted unauthenticated calls regardless of guard configuration. Same vulnerability class the Go fix on this PR closes; this commit closes it on the rust side too so the two stacks stay aligned. Adds VolumeGrpcService::check_grpc_admin_auth that pulls the peer SocketAddr off the tonic Request and runs Guard::check_whitelist on its IP, then applies the helper to the same set the Go side covers: DeleteCollection, AllocateVolume, VolumeMount, VolumeUnmount, VolumeDelete, VolumeMarkReadonly, VolumeMarkWritable, VolumeConfigure, VacuumVolumeCompact, VacuumVolumeCommit, VacuumVolumeCleanup, VolumeServerLeave, ScrubVolume, VolumeEcShardsDelete. Read-only endpoints stay open; BatchDelete stays open as a data-plane multi-object delete. |
||
|
|
a769c938ec |
test(s3tables): Unity Catalog OSS integration tests against SeaweedFS (#9308)
* test(s3tables): add Unity Catalog OSS integration test against SeaweedFS Mirrors the configuration used by the upstream playground at data-engineering-helpers/mds-in-a-box/unitycatalog-playground. Three test variants under test/s3tables/unity_catalog: - TestUnityCatalogDeltaIntegration: aws.masterRoleArn empty / static keys; catalog/schema/EXTERNAL Delta CRUD + temporary-table-credentials S3 round-trip (the playground's working configuration). - TestUnityCatalogMasterRoleIntegration: aws.masterRoleArn set to a SeaweedFS-side role with a permissive trust policy; UC's StsClient is pinned at SeaweedFS via AWS_ENDPOINT_URL_STS, and the test asserts the vended creds carry a session_token and a non-static access key, proving the role-vended path the playground notes as not-yet-working actually does work today. - TestUnityCatalogDeltaRsRoundTrip: writes/reads a real Delta table at the registered storage_location using delta-rs in a slim Python container, with temporary credentials fetched from UC. All three self-skip without Docker or a weed binary, matching the sibling lakekeeper / polaris tests. * test(s3tables): tighten Unity Catalog tests against actual UC OSS behavior After running the suite locally, ground the assertions in what the upstream UC OSS Docker image actually does against SeaweedFS today. - Static-key playground configuration (TestUnityCatalogDeltaIntegration): catalog/schema/EXTERNAL Delta CRUD pass against the SeaweedFS-backed warehouse. The temporary-table- credentials subtest is renamed and inverted to assert the failure mode the playground reports -- UC's AwsCredentialVendor falls through to an internal StsClient.assumeRole when masterRoleArn and sessionToken are both empty, which has no real STS to talk to. Bucket path is also fixed to match UC's getStorageBase() lookup (s3://lakehouse vs the playground's s3://lakehouse/warehouse, which the upstream code never matches). - Master-role variant (TestUnityCatalogMasterRoleIntegration): split into two passing slices. Slice 1 proves SeaweedFS' STS endpoint vending UnityCatalogVendedRole works via the Go AWS SDK and the vended creds round-trip on S3. Slice 2 boots UC with aws.masterRoleArn set and verifies catalog/schema/Delta CRUD. The third hop -- UC's Java StsClient actually reaching SeaweedFS' STS handler during /temporary-table-credentials -- is logged but not asserted, since the AWS Java SDK's STS request currently lands on a SeaweedFS S3 path rather than the STS handler. - Delta-RS round-trip (TestUnityCatalogDeltaRsRoundTrip): gated on UC_DELTA_RS_RUN=1 since it depends on the master-role STS handoff above. The Dockerfile / writer script stay in tree so the test runs end-to-end the moment that hop is fixed. README rewritten to be explicit about what each test validates today and what is still pending. Result: `go test -run TestUnityCatalog ./test/s3tables/unity_catalog/...` passes cleanly with weed + Docker available, and self-skips otherwise. * test(s3tables): exercise unity catalog integrations * ci: run Unity Catalog integration tests on PRs Adds a unity-catalog-integration-tests job to s3-tables-tests.yml, modeled on the existing lakekeeper / dremio jobs. Pre-pulls the UC image and python:3.11-slim (used by the delta-rs writer container) and runs `go test ./test/s3tables/unity_catalog`. Format-check and go-vet jobs already recurse into ./test/s3tables/... so the new package is covered there too. * test/ci: address PR review Tighten the UC readiness probe to require 200, not <500, so a 401/403/404 during startup surfaces immediately instead of being treated as ready (CodeRabbit). Pin the UC image to v0.4.0 in both the workflow and the test default, matching the pinned-tag convention the rest of s3-tables-tests.yml uses (CodeRabbit). Use UC_IMAGE=unitycatalog/unitycatalog:main to re-test against current upstream. * docs: separate UC static-key vs master-role failure modes The README mixed the two together. Static-key empty-sessionToken short-circuits with "S3 bucket configuration not found." before UC even fires an STS call; the AccessDenied I described is what happens in the master-role variant where UC's Java StsClient actually reaches SeaweedFS. Cross-link the playground PR that fixes the static-key vending side. Also drop the "what most playground users actually run" hand-wave under MANAGED tables. * docs: trim README Drop the playground cross-reference and the "two layers fail independently" framing. * docs: pin down what's actually pending Investigated the master-role STS handoff with a sniffer in front of SeaweedFS' STS port. UC's StsClient is constructed without an endpointOverride and never reads aws.endpoint or AWS_ENDPOINT_URL_STS; verified by pointing AWS_ENDPOINT_URL_STS at port 1 and seeing the same real-AWS InvalidClientTokenId 403 with zero traffic to SeaweedFS. The fix is upstream in UC. Updated the README and the master-role test's t.Logf to say so precisely, and dropped the stale "Spark client" bullet (delta-rs covers that path). * test(s3tables): use BaseEndpoint instead of deprecated resolver EndpointResolverWithOptions is deprecated in aws-sdk-go-v2; the supported way to override a service endpoint is via the per-service Options.BaseEndpoint. Switch the assume-role helper to that pattern so the test stops compiling against deprecated API and the resolver boilerplate disappears. Addresses gemini review on PR #9308. * test(s3tables): drop unused splitS3URI helper Helper had no callers; gemini caught it on PR #9308. Easy to bring back from git history if needed. * test(s3tables): extract last token of docker run output as container ID docker run -d may prefix the container ID with image-pull progress when the image isn't cached locally. strings.TrimSpace on the whole output then gave a multi-line string, not the ID. Take the last whitespace-separated token so the ID survives a fresh CI runner. Addresses gemini review on PR #9308. * test(s3tables): cap Unity Catalog response body reads at 10 MiB io.ReadAll without a limit could OOM the test runner if the UC container hands back an unexpectedly large body. 10 MiB is well above any well-formed catalog response and turns a misbehaving server into a test failure instead of a runner crash. Addresses gemini review on PR #9308. * docs: link UC fix PR and call out UC's mocked-Sts test pattern UC's own credential-vending tests substitute StsClient with an in-process EchoAwsStsClient (BaseCRUDTestWithMockCredentials) or Mockito.mockStatic (CloudCredentialVendorTest), so the wire path between UC's Java SDK and a real STS server is untested -- which is why the missing endpointOverride slipped through upstream. Linked the upstream fix at unitycatalog/unitycatalog#1532. |
||
|
|
6d95a5592a |
test(s3/policy): stop racing t.TempDir cleanup against mini shutdown
The mini cluster's admin/plugin worker keeps creating files under admin/plugin/job_types/ for ~1s after subtests finish, while the previous Stop() only cancelled an unobserved context and slept 500ms. t.TempDir()'s registered RemoveAll then raced the worker and intermittently failed with "directory not empty" (CI run 25352039081). Manage the data dir manually so it is removed only after the mini goroutine has exited, and wire MiniClusterCtx so cancel actually drains master/volume/filer/admin/s3/webdav. |
||
|
|
0a91b57f16 |
fix(s3): encrypt SSE-S3 KEK at rest with AES-GCM wrapping (#8880)
* fix(s3): encrypt SSE-S3 KEK at rest using passphrase-derived wrapping key * fix(s3): surface KEK migration failures instead of silently dropping them The legacy-plaintext -> encrypted-at-rest path used to swallow both wrapKEK and updateKEKContent errors. An operator who configured a passphrase but had a filer permission issue (or a wrap failure) saw nothing in the logs and the KEK stayed on disk in plaintext, with the migration retried on every restart and silently failing every time. Log each failure path explicitly so the unmigrated state is visible. The server still starts with the in-memory key loaded — refusing to boot here would be worse than the warning. Addresses gemini and coderabbit reviews on PR #8880. * fix(s3): use a per-installation random salt for KEK wrapping HKDF The original implementation hardcoded "seaweedfs-sse-s3-kek-wrapping-v1" as the HKDF salt for the KEK wrapping key. Two SeaweedFS installations using the same passphrase therefore produced byte-identical wrapping keys, opening a precomputation/rainbow-table angle against weaker passphrases. Generate a random 32-byte salt every time wrapKEK runs and embed it in the on-disk payload alongside the AES-GCM ciphertext. The new format is base64(magic("SWv2") || salt || nonce || ciphertext+tag); unwrapKEK detects the magic and reads the salt out of the payload. KEKs wrapped under the legacy fixed-salt format still unwrap cleanly and are opportunistically re-wrapped into v2 on the next load so operators get the stronger format without manual migration. Addresses gemini review on PR #8880. * fix(s3): plumb KEK passphrase from env into the global key manager InitializeGlobalSSES3KeyManager used to ignore the kekPassphrase field entirely — the global manager was always constructed with the empty constructor, so the encrypted-at-rest code path never engaged in production. Read the passphrase from WEED_S3_SSE_KEK_PASSPHRASE and apply it before InitializeWithFiler so the load path picks up the encrypted format. Log a warning when the env var is unset to make the plaintext fallback visible to operators upgrading from earlier builds. Adds SetKEKPassphrase as the public seam used by the global init and by tests, plus regression tests for the wrap/unwrap round-trip, random-salt independence across managers sharing a passphrase, and the no-passphrase fallback that preserves the legacy hex-decode path. Addresses coderabbit review on PR #8880. * fix(s3): drop redundant base64 decode in KEK migration check unwrapKEK already does the base64 decode and the magic-prefix check; isV2WrappedKEK was repeating both passes purely so the migration branch in loadSuperKeyFromFiler could ask "was this v1 or v2". Have unwrapKEK return the version flag directly and delete the redundant helper. Single decode pass per load. Addresses gemini review on PR #8880. * fix(s3): updateKEKContent must overwrite, not create filer_pb.MkFile maps to CreateEntry, which is O_EXCL: it fails with ErrEntryAlreadyExists when the file already exists. Both KEK migration paths (legacy v1→v2 rewrap and plaintext→encrypted) call updateKEKContent against the entry they just read, so MkFile errored out every time and the migrations only ran in memory while the on-disk KEK stayed in its old format. The previous commit logged the failure loudly but the result was the same: a pre-existing deployment never got migrated. Switch updateKEKContent to LookupDirectoryEntry + UpdateEntry so the overwrite actually persists. Surface the lookup/update errors so the caller's existing "migration failed; KEK still on disk in old form" warnings fire on the right cases. Addresses coderabbit critical review on PR #8880. * chore(s3): drop unused generateAndSaveSuperKeyToFiler Master removed this as dead code in #8913 and I reintroduced it during the merge resolution thinking the migration paths still needed it. On second look it has no callers in this branch — every KEK-creation path on PR #8880 goes through the existing reader code that handles "file not present, generate" inline. Drop the duplicate. Addresses gemini medium review on PR #8880. * fix(s3): updateKEKContent honours km.kekPath instead of hardcoded path The migration write path was always pointing at /etc/s3/sse_kek even when the manager was configured with an operator-overridden kekPath. Split km.kekPath at the last "/" so the lookup + UpdateEntry land on the same file the read path used. Defaults match defaultKEKPath when kekPath is unset. Addresses gemini medium review on PR #8880. * fix(s3): KEK passphrase via Viper key, with env-var fallback The KEK passphrase was read straight from os.Getenv, but every other SSE-S3 secret (s3.sse.kek, s3.sse.key) goes through Viper so an operator can set them in security.toml or via WEED_ env vars interchangeably. Add s3.sse.kek.passphrase to the same path; the existing SSES3KEKPassphraseEnv lookup stays as a fallback so deployments wired before this commit keep working. Addresses gemini medium review on PR #8880. |
||
|
|
e1d5e3899f |
fix(s3): add HMAC-SHA256 key commitment to SSE-S3 and SSE-KMS (#8879)
* fix(s3): add HMAC-SHA256 key commitment to SSE encryption modes * fix(s3): validate SSE-S3 IV length before cipher.NewCTR CreateSSES3DecryptedReader took the IV directly from object metadata and passed it to cipher.NewCTR. The standard library panics when the IV length differs from the block size, so a tampered metadata field turned what should be a decryption error into a crash. Validate via the existing ValidateIV helper first. Addresses coderabbit review on PR #8879. * fix(s3): centralize SSE-KMS key commitment in createSSEKMSKey KeyCommitment used to be set explicitly in only one of the four encryption entry points; the multipart-aware variants and the bucket- bound CreateSSEKMSEncryptedReaderForBucket left it empty, so writes through those paths landed with metadata that VerifyKeyCommitment treats as legacy and accepts unconditionally — exactly the downgrade gap the gemini review flagged. Move the HMAC computation into createSSEKMSKey so every helper-driven path picks it up automatically, and add the same call to the bucket- scoped path that builds its own struct literal. Addresses gemini review on PR #8879. * fix(s3): store base IV (not derived IV) for offset-aware SSE-KMS chunks CreateSSEKMSEncryptedReaderWithBaseIVAndOffset used to store the already-offset-derived IV plus the chunk offset in metadata. The decrypt path then applied calculateIVWithOffset a second time to that stored IV, producing the wrong CTR keystream and a decryption mismatch. Centralizing the key commitment made the bug visible as a commitment failure, but the underlying issue predates that change. Pass the base IV through to createSSEKMSKey so sseKey.IV is the unmodified base on disk; the decrypt path's offset application then recovers the correct chunk-level IV. The HMAC commitment binds the base IV — same value the verify call at decrypt time hashes — so the new commitment path stays consistent. Addresses gemini security-high review on PR #8879. * fix(s3): opt-in strict-commitment mode to close the downgrade vector WEED_S3_REQUIRE_KEY_COMMITMENT=true flips VerifyKeyCommitment from accept-when-missing (the AWS-compatible default needed for objects written before commitments shipped) to reject-when-missing. With the env var set, an attacker who strips the commitment field from object metadata can no longer bypass integrity verification — every object must carry a commitment that hashes to the right value. Default stays false so existing legacy objects keep decrypting; the warning the gemini review raised about the silent-downgrade vector is closed for operators who explicitly opt in once their bucket is fully migrated. SetRequireKeyCommitment exposes a runtime seam for tests and future config-reload paths. Addresses the security-medium review on PR #8879. * test(s3): fix mislabelled IV literal in strict-commitment test The "strict-mode-iv-16" literal was actually 17 bytes — the trailing "16" was meant as a comment but counted as content. The IV is fed into HMAC, not AES, so the length didn't matter for behaviour, but the discrepancy was confusing. Tighten to a real 16-byte literal and explain the choice in a comment. Addresses coderabbit minor review on PR #8879. * fix(s3): store KMS-resolved KeyID in SSE-KMS metadata, not the request CreateSSEKMSEncryptedReaderForBucket built its SSEKMSKey with the caller-supplied keyID, but CreateSSEKMSDecryptedReader later compares decryptResp.KeyID against sseKey.KeyID. A request that used an alias would resolve to a different ARN in the response; storing the request form would then trip the mismatch check at decrypt time and surface as a "KMS key ID mismatch" error against the operator's own object. The helper-driven encryption paths already do the right thing via createSSEKMSKey; this is the bucket-bound path catching up. Addresses coderabbit review on PR #8879. * test(s3): cover key commitment rejection paths under tampering Adds the negative-path tests coderabbit flagged as missing: a tampered key, IV, algorithm, or commitment field must fail VerifyKeyCommitment, otherwise a regression in the rejection logic could land silently. The HMAC binds all three inputs plus the commitment itself, so any single mutation is enough. Addresses coderabbit nitpick on PR #8879. |
||
|
|
3ee147dc4d |
build(deps): bump cloud.google.com/go/kms from 1.26.0 to 1.30.0 (#9311)
Bumps [cloud.google.com/go/kms](https://github.com/googleapis/google-cloud-go) from 1.26.0 to 1.30.0. - [Release notes](https://github.com/googleapis/google-cloud-go/releases) - [Changelog](https://github.com/googleapis/google-cloud-go/blob/main/documentai/CHANGES.md) - [Commits](https://github.com/googleapis/google-cloud-go/compare/kms/v1.26.0...dlp/v1.30.0) --- updated-dependencies: - dependency-name: cloud.google.com/go/kms dependency-version: 1.30.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Chris Lu <chrislusf@users.noreply.github.com> |
||
|
|
66d9b89cd2 |
fix(iam): deny IAM users with no policies instead of granting full access (#9317)
* fix(iam): deny IAM users with zero policies instead of falling through to DefaultEffect=Allow A user created via the S3 IAM API with no policies attached was inheriting full S3 access. With `weed s3 -iam` and no explicit IAM config, the policy engine's DefaultEffect defaults to Allow for the in-memory zero-config path. The "no matching statement" guard in IsActionAllowed only triggered when the user already had at least one policy, so a fresh user with PolicyNames=[] slipped through and got allow-all. Track hasManagedSubject whenever the principal resolves to a registered user or role (or PolicyNames are supplied directly) and deny on no-match. The DefaultEffect=Allow fallback now only applies to truly unmanaged callers. * test(iam): cover non-matching attached-policy case for managed-subject deny * test(iam): cover role-with-empty-AttachedPolicies deny path Sibling case to TestIsActionAllowed_RegisteredUserWithoutPoliciesIsDenied: a managed role that resolves but has zero AttachedPolicies must also fall through to deny under DefaultEffect=Allow, not inherit full access. The fix already handles this branch via the hasManagedSubject flag; this test pins the regression class so we don't lose coverage on either side of the user/role split. Addresses coderabbit nitpick on PR #9317. |
||
|
|
57bb7b2f39 |
quiet noisy 'shard X not found' log when EC shard lives on another server (#9316)
quiet "shard X not found" log when EC shard lives on another server When EC shards are spread across multiple volume servers, every read that targets a shard not present locally was logging at V(0) with the same wording as a real read failure. Under rclone-style traffic this floods the logs and makes a healthy EC cluster look broken (issue #9310). Distinguish the two cases with an errShardNotLocal sentinel: the expected fall-through-to-remote path now logs at V(4); genuine local read failures still log at V(0). |
||
|
|
6aa353716a |
test(kafka): snapshot consumer-group state mid-attempt for resumption flake
The onRetry hook in TestOffsetManagement/ConsumerGroupResumption only fires after defer reader.Close(), so every dump shows the post-LeaveGroup Empty state — useless for diagnosing why the second consumer hangs in the join cycle. Add an onTick callback fired every 1.5s while the reader is still joined so we can see PreparingRebalance / CompletingRebalance churn, leader, and member assignments during the 20s attempt window. |
||
|
|
3efd1e8974 |
build(deps): bump github.com/a-h/templ from 0.3.977 to 0.3.1001 (#9312)
Bumps [github.com/a-h/templ](https://github.com/a-h/templ) from 0.3.977 to 0.3.1001. - [Release notes](https://github.com/a-h/templ/releases) - [Commits](https://github.com/a-h/templ/compare/v0.3.977...v0.3.1001) --- updated-dependencies: - dependency-name: github.com/a-h/templ dependency-version: 0.3.1001 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> |
||
|
|
39cf3cf719 |
build(deps): bump cloud.google.com/go/storage from 1.60.0 to 1.62.1 (#9313)
Bumps [cloud.google.com/go/storage](https://github.com/googleapis/google-cloud-go) from 1.60.0 to 1.62.1. - [Release notes](https://github.com/googleapis/google-cloud-go/releases) - [Changelog](https://github.com/googleapis/google-cloud-go/blob/main/CHANGES.md) - [Commits](https://github.com/googleapis/google-cloud-go/compare/compute/v1.60.0...storage/v1.62.1) --- updated-dependencies: - dependency-name: cloud.google.com/go/storage dependency-version: 1.62.1 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> |
||
|
|
dee1e12bcd |
build(deps): bump github.com/aws/smithy-go from 1.25.0 to 1.25.1 (#9314)
Bumps [github.com/aws/smithy-go](https://github.com/aws/smithy-go) from 1.25.0 to 1.25.1. - [Release notes](https://github.com/aws/smithy-go/releases) - [Changelog](https://github.com/aws/smithy-go/blob/main/CHANGELOG.md) - [Commits](https://github.com/aws/smithy-go/compare/v1.25.0...v1.25.1) --- updated-dependencies: - dependency-name: github.com/aws/smithy-go dependency-version: 1.25.1 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> |
||
|
|
bfcbd5aa0f |
build(deps): bump github.com/klauspost/reedsolomon from 1.13.3 to 1.14.0 (#9315)
Bumps [github.com/klauspost/reedsolomon](https://github.com/klauspost/reedsolomon) from 1.13.3 to 1.14.0. - [Release notes](https://github.com/klauspost/reedsolomon/releases) - [Commits](https://github.com/klauspost/reedsolomon/compare/v1.13.3...v1.14.0) --- updated-dependencies: - dependency-name: github.com/klauspost/reedsolomon dependency-version: 1.14.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> |
||
|
|
1de741737d |
test(s3tables): add Apache Doris Iceberg catalog integration test (#9307)
* test(s3tables): add Apache Doris Iceberg catalog integration test Adds an end-to-end smoke test that boots the apache/doris all-in-one container, registers SeaweedFS as an external Iceberg REST catalog (OAuth2 client_credentials), and validates metadata visibility plus the parquet read path against tables seeded via the Iceberg REST API and a PyIceberg writer container, mirroring the existing Trino, Spark, and Dremio coverage. Wires the test into a new s3-tables-tests workflow job. * test(s3tables): document weed shell -master flag format and fill in helper docstrings Restores the explanatory comment on createTableBucket about the host:port.grpcPort ServerAddress format used by `weed shell -master` (produced by pb.NewServerAddress) so the dot separator isn't mistaken for a typo, and adds doc comments for createIcebergNamespace, createIcebergTable, doIcebergJSONRequest, requireDorisRuntime, and hasDocker. |
||
|
|
90b706984e |
docs(readme): align Docker quick start with weed mini defaults
Replace the old "server -s3" form with a parallel docker run that uses the same env vars (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, S3_BUCKET) as the binary weed mini quick start. Drop the explicit "mini" subcommand since it is the default CMD in the image. |
||
|
|
7f1ac8cf1a |
docs(readme): dedup Quick Start, drop "weed server" section
Remove the "Quick Start with Single Binary" section which duplicated the binary-download instructions and pushed users toward "weed server". Fold the go install alternative and the volume-scaling tip into the weed mini section so there is one canonical Quick Start path. |
||
|
|
4e22d9533b |
docs(readme): highlight built-in Iceberg catalog and lakehouse use case
Surface the table-buckets / Iceberg REST Catalog story in the introduction and add a Data Lakehouse Features section so readers landing on the README see that SeaweedFS replaces both the object store and the metastore in an Iceberg stack. |
||
|
|
73fc9e3833 | 4.23 4.23 | ||
|
|
d605feb403 |
refactor(command): expand "~" in all path-style CLI flags (#9306)
* refactor(command): expand "~" in all path-style CLI flags Many of weed's path-bearing flags (-s3.config, -s3.iam.config, -admin.dataDir, -webdav.cacheDir, -volume.dir.idx, TLS cert/key files, profile output paths, mount cache dirs, sftp key files, ...) were never run through util.ResolvePath, so a value like "~/iam.json" was used literally. Tilde only worked when the shell expanded it, which silently fails for the common -flag=~/path form (bash leaves the tilde literal in --opt=~/path). - Extend util.ResolvePath to also handle "~user" / "~user/rest", matching shell tilde expansion. Add unit tests. - Apply util.ResolvePath at the top of each shared start* function (s3, webdav, sftp) so mini/server/filer/standalone callers all inherit it; resolve at the few one-off use sites (mount cache dirs, volume idx folder, mini admin.dataDir, profile paths). - Drop the duplicate expandHomeDir helper from admin.go in favor of the now-equivalent util.ResolvePath. * fixup: handle comma-separated -dir flags for tilde expansion `weed mini -dir`, `weed server -dir`, and `weed volume -dir` accept comma-separated paths (`dir[,dir]...`). Calling util.ResolvePath on the whole string mishandled multi-folder values with tilde, e.g. "~/d1,~/d2" would resolve as if "d1,~/d2" were a single subpath. - Add util.ResolveCommaSeparatedPaths: split on ",", run each entry through ResolvePath, rejoin. Short-circuits when no "~" present. - Use it for *miniDataFolders (mini.go), *volumeDataFolders (server.go), and resolve each entry of v.folders in-place (volume.go) so all downstream consumers see resolved paths. - Add 7-case TestResolveCommaSeparatedPaths covering empty, single, multiple, and mixed inputs. * address PR review: metaFolder + Windows backslash - master.go: resolve *m.metaFolder at the top of runMaster so util.FullPath(*m.metaFolder) on the next line sees an expanded path. Drop the now-redundant ResolvePath in TestFolderWritable. - server.go: same treatment for *masterOptions.metaFolder, paired with the existing cpu/mem profile resolves. Drop the redundant inner ResolvePath at TestFolderWritable. - file_util.go: ResolvePath now accepts filepath.Separator as a separator after the tilde, so "~\\data" works on Windows. Other platforms keep current behaviour (backslash stays literal because it is a valid filename character in usernames and paths). - file_util_test.go: add two cases using filepath.Separator that exercise the new code path on Windows and remain a no-op on Unix. * address PR review: resolve "~" in remaining command path flags Comprehensive sweep of path-bearing flags across every weed subcommand, applying util.ResolvePath in-place at the top of each run* function so all downstream consumers see expanded paths. - webdav.go: resolve *wo.cacheDir at the top of startWebDav so mini/server/filer/standalone callers all inherit it. - mount_std.go: cpu/mem profile paths. - filer_sync.go: cpu/mem profile paths. - mq_broker.go: cpu/mem profile paths. - benchmark.go: cpuprofile output path. - backup.go: -dir resolved once at runBackup; drop the duplicated inline ResolvePath in NewVolume calls. - compact.go: -dir resolved at runCompact; drop inline ResolvePath. - export.go: -dir and -o resolved at runExport; drop inline ResolvePath in LoadFromIdx and ScanVolumeFile. - download.go: -dir resolved at runDownload; drop inline. - update.go: -dir resolved at runUpdate so filepath.Join uses the expanded path; drop inline ResolvePath in TestFolderWritable. - scaffold.go: -output expanded before filepath.Join. - worker.go: -workingDir expanded before being passed to runtime. * address PR review: resolve option-struct paths at run* entry points server.go:381 propagates s3Options.config to filerOptions.s3ConfigFile *before* startS3Server runs, which meant the filer-side code saw the unresolved tilde-prefixed pointer. Same pattern for webdavOptions and sftpOptions (and equivalent in mini.go / filer.go). The fix: hoist resolution from the shared start* functions up to the run* entry points, where every shared pointer is set up before any propagation happens. - s3.go, webdav.go, sftp.go: extract a resolvePaths() method on each Options struct that runs every path field through util.ResolvePath in-place. Idempotent. - runS3, runWebDav, runSftp: call the standalone struct's resolvePaths before starting metrics / loading security config. - runServer, runMini, runFiler: call resolvePaths on every embedded options struct, plus resolve loose flags (serverIamConfig, miniS3Config, miniIamConfig, miniMasterOptions.metaFolder, and filer's defaultLevelDbDirectory) so they're expanded before any pointer copy or use. - Drop the now-redundant inline ResolvePath at filer's defaultLevelDbDirectory composition. * address PR review: re-resolve mini -dir post-config, cover misc paths - mini.go: applyConfigFileOptions can overwrite -dir with a literal ~/data from mini.options. Re-resolve *miniDataFolders after the config-file apply, alongside the other path resolves, so the mini filer no longer ends up with a literal ~/data/filerldb2. - benchmark.go: resolve *b.idListFile (-list). - filer_sync.go: resolve *syncOptions.aSecurity / .bSecurity (-a.security / -b.security) before LoadClientTLSFromFile. - filer_cat.go: resolve *filerCat.output (-o) before os.OpenFile. - admin.go: drop trailing blank line at EOF (git diff --check). * address PR review: resolve -a.security/-b.security/-config before use Three follow-up fixes: - filer_sync.go: the -a.security / -b.security resolves were placed *after* LoadClientTLSFromFile / LoadHTTPClientFromFile were called, so weed filer.sync -a.security=~/a.toml still passed the literal tilde path. Hoist the resolves above the security-loading block so TLS clients see expanded paths. - filer_sync_verify.go: same flag pair was never resolved at all in the verify command; resolve at the top of runFilerSyncVerify. - filer_meta_backup.go: -config (the backup_filer.toml path) was passed directly to viper. Resolve at the top of runFilerMetaBackup. - mini.go: master.dir defaulted to the entire comma-joined miniDataFolders. With weed mini -dir=~/d1,~/d2 (or any multi-dir setup), TestFolderWritable then stat'd the joined string instead of a single directory. Default to the first entry via StringSplit to mirror the disk-space calculation a few lines below, and drop the now-redundant ResolvePath in TestFolderWritable. |
||
|
|
87bb0a4115 |
test(s3tables): capture weed mini stdout/stderr in catalog_spark tests
Without this, when "weed mini" fails to bind its master port within the 45s startup timeout the only diagnostic is the timeout message itself, making flakes impossible to root-cause. The sibling catalog, catalog_dremio, and catalog_trino packages already wire stdout/stderr through; bring catalog_spark in line. |
||
|
|
6844ec067c |
fix(s3): cache remote-only source before CopyObject (#9304) (#9305)
* fix(s3): cache remote-only source before CopyObject (#9304) CopyObject from a remote.mount source whose object lives only upstream created a destination entry with FileSize > 0 but no chunks/content, because the resolved source entry has no local chunks and the copy path fell into the "inline/empty chunks" branch with empty entry.Content. A subsequent GET returned 500 with "data integrity error: size N reported but no content available". CopyObjectPart had the same shape via copyChunksForRange iterating an empty chunk list. Detect entry.IsInRemoteOnly() right after resolving the source in both CopyObjectHandler and CopyObjectPartHandler and cache the object to the local cluster first via a new cacheRemoteObjectForCopy helper (a copy-time analogue of cacheRemoteObjectForStreaming with a bounded 30s timeout and version-aware path resolution). If caching fails or produces no chunks, return 503 with Retry-After: 5 instead of writing a metadata-only destination, mirroring the GetObject behavior added in the #7817 cold-cache fix. Adds TestCopyObjectRemoteOnlySourceDetection pinning the four entry shapes the fix branches on plus the pre-fix broken-output shape. * address PR review on remote-only copy fix - Use the resolved entry's version id when srcVersionId is empty so a CopyObject reading the latest object in a versioning-enabled bucket caches the correct .versions/v_<id> path instead of getting stuck in a 503 retry loop. New helper resolvedSourceVersionId handles the fallback for both CopyObject and CopyObjectPart. - Drop the redundant cachedEntry.IsInRemoteOnly() recheck in both handlers; the cache helper now reports success based on local data presence, and IsInRemoteOnly does not look at inline Content so keeping the check would 503 on small inline-cached objects. - Treat inline Content as a successful cache result in both cacheRemoteObjectForStreaming and cacheRemoteObjectForCopy via a shared cachedEntryHasLocalData predicate. The CopyObject inline branch already handles entries that have Content but no chunks. - Extract buildVersionedRemoteObjectPath so the streaming and copy cache helpers share path construction. Adds TestResolvedSourceVersionId and TestCachedEntryHasLocalData to pin the new helpers' contracts. * narrow streaming cache contract back to chunks-only CodeRabbit flagged that cacheRemoteObjectForStreaming's caller in streamFromVolumeServers (lines 997-1002) still required non-empty chunks, so content-only cache hits would fall through to a 503 retry loop instead of being honored. Resolve by keeping the helper's contract chunks-only: the filer's caching code only ever writes chunks, the streaming downstream isn't wired to read inline Content from a cached entry, and a partial range- aware inline writer here would be overkill for a path that doesn't actually occur in practice. cacheRemoteObjectForCopy keeps the relaxed contract since the copy path's inline branch genuinely handles both chunked and content-only entries. Document the asymmetry on cachedEntryHasLocalData and on cacheRemoteObjectForStreaming so a future reader can see why the two helpers diverge. * extend version-id resolution to streaming cache path CodeRabbit flagged that GetObjectHandler still passed the raw query versionId to cacheRemoteObjectForStreaming. For latest-version reads in versioning-enabled buckets that stays empty even though the resolved entry lives at .versions/v_<id>, so remote-only GETs would keep caching the wrong path and 503-ing forever. Reuse the new resolvedSourceVersionId helper at the streaming call site too. Also document on cachedEntryHasLocalData that the zero-byte case flagged in the same review is handled upstream (IsInRemoteOnly requires RemoteSize > 0, so the cache helper is never invoked for empty remote objects -- CopyObject's pre-existing inline branch writes a correct empty destination directly). Pin this with a new test case. * trim verbose comments Drop tutorial-style and review-history comments. Keep only the WHY that isn't obvious from identifiers: the #9304 reference on the new branches in CopyObject / CopyObjectPart, the latest-version-fallback rationale on resolvedSourceVersionId, and the streaming/copy contract asymmetry on cachedEntryHasLocalData. * drop issue references from comments Issue numbers belong in PR descriptions and commit messages, not in source comments where they rot. Replace with the underlying invariant the code is preserving. * test: drive remote-object cache helpers through real gRPC Existing tests only re-enacted helper-function branching in test space, so they could not have caught a handler that consumed a remote-only entry without going through the cache. Stand up an in-process filer gRPC server (UnimplementedSeaweedFilerServer + configurable CacheRemoteObjectToLocalCluster response) and exercise the two cache helpers end-to-end. What's pinned: - cacheRemoteObjectForCopy returns nil when the cache makes no progress (response is still remote-only), lets gRPC errors through as nil, accepts both chunked and inline-content cache hits, and surfaces deadline-exceeded as nil so callers can 503 instead of holding the request open. - Versioned source paths route to .versions/v_<id>; non-versioned and "null" stay at the bucket-relative path. Captured by reading the request the stub server received. - cacheRemoteObjectForStreaming holds the stricter chunks-only contract: a content-only cache hit is not propagated, since streamFromVolumeServers' downstream isn't wired to read from inline Content there. Any current or future handler that calls these helpers exercises the same gRPC path under test, so the bug class is closed for helper-routed cache calls. * move remote-only copy test into the integration suite The previous gRPC-stub test in weed/s3api/ was integration-flavored but stubbed; relocate the coverage to the existing two-server suite under test/s3/remote_cache/, which already exercises the real remote.mount + remote.uncache flow against a primary SeaweedFS plus a secondary acting as remote storage. The new test/s3/remote_cache/remote_cache_copy_test.go drives: - TestRemoteCacheCopyObject: upload to primary, uncache (entry now remote-only), CopyObject to a new key, GET the destination. Pre- fix the GET returned 500 'data integrity error: size N reported but no content'; this pins the fixed behavior over real HTTP through the actual handler stack. - TestRemoteCacheCopyObjectPart: same shape via multipart UploadPartCopy on a 6 MiB object split into two parts, exercising CopyObjectPartHandler's range-copy path. Drop weed/s3api/s3api_remote_storage_grpc_test.go: the helper-level classification tests in s3api_remote_storage_test.go still cover the contract pieces (cachedEntryHasLocalData, resolvedSourceVersionId, the remote-only entry shape), and the integration suite covers the end-to-end behavior that those classifications enable. |
||
|
|
fc75f16c30 |
test(s3tables): expand Dremio Iceberg catalog test coverage (#9303)
* test(s3tables): expand Dremio Iceberg catalog test coverage
Restructure TestDremioIcebergCatalog into subtests and add three new
checks that go beyond a connectivity smoke test:
- ColumnProjection: SELECT id, label proves Dremio parsed the schema
served by the SeaweedFS REST catalog (the previous SELECT COUNT(*)
passed without exercising any column metadata).
- InformationSchemaColumns: verifies the table's columns are listed in
Dremio's INFORMATION_SCHEMA.COLUMNS in the expected ordinal order.
- InformationSchemaTables: verifies the table is registered in
INFORMATION_SCHEMA.TABLES.
All subtests share a single Dremio container startup, so total
runtime is unchanged.
* test(s3tables): exercise multi-level Iceberg namespaces from Dremio
Seed a 2-level Iceberg namespace (and a table inside it) via the REST
catalog before bootstrapping Dremio, then add a MultiLevelNamespace
subtest that scans the nested table by its dot-separated reference.
This relies on isRecursiveAllowedNamespaces=true (already set in the
Dremio source config) to surface the nested levels as folders. A
regression in either the SeaweedFS namespace path encoding (#8959-style)
or Dremio's recursive-namespace discovery would surface here.
Adds two helpers to keep the existing single-level call sites unchanged:
- createIcebergNamespaceLevels: namespace creation with []string levels
- createIcebergTableInLevels: table creation with []string levels and
unit-separator (0x1F) URL encoding for the namespace path component
* test(s3tables): verify Dremio reads PyIceberg-written rows
The previous Dremio subtests only scanned empty tables, so they did not
exercise the data path - just the catalog/metadata path. Add a
PyIceberg-based writer that materializes parquet files plus a snapshot
on a separate table before Dremio bootstraps, and two new subtests:
- ReadWrittenDataCount: SELECT COUNT(*) returns 3.
- ReadWrittenDataValues: SELECT id, label ORDER BY id returns the three
written rows with the expected (id, label) pairs.
The writer runs in a small image (Dockerfile.writer) built locally on
demand. It pip-installs pyiceberg+pyarrow once and reuses the layer
cache on subsequent runs. The CI workflow pre-pulls python:3.11-slim
to keep cold runs predictable.
The writer authenticates via the OAuth2 client_credentials flow that
SeaweedFS already exposes at /v1/oauth/tokens, mirroring the Go-side
helper used for REST-API table creation.
* test(s3tables): fix Dremio writer required-field schema mismatch
PyIceberg's append() compatibility check rejects an arrow column whose
nullability does not match the Iceberg field. The table schema declares
id as `required long`, but the default pyarrow int64 column is nullable
- so the writer failed with:
1: id: required long vs. 1: id: optional long
Declare an explicit pyarrow schema with nullable=False on id and
nullable=True on label to match the Iceberg side.
|
||
|
|
f16353de0b |
feat(mini): add -bucket flag to pre-create an S3 bucket on startup (#9302)
* feat(mini): add -bucket flag to pre-create an S3 bucket on startup Lets users hand a pre-provisioned object store to clients/CI without a post-start `weed shell s3.bucket.create` step. The flag is a no-op when empty (default) and idempotent on subsequent starts. * mini: bound bucket-creation RPCs with a timeout off miniClientsCtx Address PR review feedback: derive the lookup/mkdir context from miniClientsCtx() so Ctrl+C cancels the bucket RPCs, and cap with a 5s timeout so a stalled filer cannot block the welcome message indefinitely. Also wrap the DoMkdir error for parity with the lookup path. * mini: fall back to S3_BUCKET env var for -bucket Mirrors the existing -s3.externalUrl / S3_EXTERNAL_URL pattern so container/Kubernetes deployments can pre-create the bucket via env without overriding the entrypoint command. * docs(readme): lead weed mini quick start with credentials + bucket Promote the one-line setup (env vars + bucket) so users get a ready-to-use S3 endpoint without hopping between sections to find credential and bucket setup. * mini: accept comma-separated -bucket list Lets a single startup pre-create multiple S3 buckets, e.g. -bucket=bucket1,bucket2 (or S3_BUCKET=bucket1,bucket2). Names are trimmed and deduped; per-bucket errors are logged and the loop continues so one bad name does not block the rest. * mini: add -tableBucket flag for pre-creating S3 Tables buckets Mirrors -bucket but creates S3 Tables (Iceberg) buckets via s3tables.Manager so users can hand the all-in-one binary a ready-to-use table catalog without a follow-up weed shell call. Comma-separated, env fallback to S3_TABLE_BUCKET, idempotent on restart, owned by the DefaultAccountID placeholder. * mini: use errors.Is for ErrNotFound check in bucket lookup Matches the rest of the codebase (~20 call sites in weed/s3api). The direct equality works today because LookupEntry returns ErrNotFound unwrapped, but errors.Is future-proofs against any future wrapping. |
||
|
|
1f6f473995 |
refactor(worker): co-locate plugin handlers with their task packages (#9301)
* refactor(worker): co-locate plugin handlers with their task packages
Move every per-task plugin handler from weed/plugin/worker/ into the
matching weed/worker/tasks/<name>/ package, so each task owns its
detection, scheduling, execution, and plugin handler in one place.
Step 0 (within pluginworker, no behavior change): extract shared helpers
that previously lived inside individual handler files into dedicated
files and export the ones now consumed across packages.
- activity.go: BuildExecutorActivity, BuildDetectorActivity
- config.go: ReadStringConfig/Double/Int64/Bytes/StringList, MapTaskPriority
- interval.go: ShouldSkipDetectionByInterval
- volume_state.go: VolumeState + consts, FilterMetricsByVolumeState/Location
- collection_filter.go: CollectionFilterMode + consts
- volume_metrics.go: export CollectVolumeMetricsFromMasters,
MasterAddressCandidates, FetchVolumeList
- testing_senders_test.go: shared test stubs
Phase 1: move the per-task plugin handlers (and the iceberg subpackage)
into their task packages.
weed/plugin/worker/vacuum_handler.go -> weed/worker/tasks/vacuum/plugin_handler.go
weed/plugin/worker/ec_balance_handler.go -> weed/worker/tasks/ec_balance/plugin_handler.go
weed/plugin/worker/erasure_coding_handler.go -> weed/worker/tasks/erasure_coding/plugin_handler.go
weed/plugin/worker/volume_balance_handler.go -> weed/worker/tasks/balance/plugin_handler.go
weed/plugin/worker/iceberg/ -> weed/worker/tasks/iceberg/
weed/plugin/worker/handlers/handlers.go now blank-imports all five
task subpackages so their init() registrations fire.
weed/command/mini.go and the worker tests construct the handler with
vacuum.DefaultMaxExecutionConcurrency (the constant moved with the
vacuum handler).
admin_script remains in weed/plugin/worker/ because there is no
underlying weed/worker/tasks/admin_script/ package to merge with.
* refactor(worker): update test/plugin_workers imports for moved handlers
Three handler constructors moved out of pluginworker into their task
packages — update the integration test files in test/plugin_workers/
to import from the new locations:
pluginworker.NewVacuumHandler -> vacuum.NewVacuumHandler
pluginworker.NewVolumeBalanceHandler -> balance.NewVolumeBalanceHandler
pluginworker.NewErasureCodingHandler -> erasure_coding.NewErasureCodingHandler
The pluginworker import is kept where the file still uses
pluginworker.WorkerOptions / pluginworker.JobHandler.
* refactor(worker): update test/s3tables iceberg import path
The iceberg subpackage moved from weed/plugin/worker/iceberg/ to
weed/worker/tasks/iceberg/. test/s3tables/maintenance/maintenance_integration_test.go
still imported the old path, breaking S3 Tables / RisingWave / Trino /
Spark / Iceberg-catalog / STS integration test builds.
Mirrors the OSS-side fix needed by every job in the run that
transitively imports test/s3tables/maintenance.
* chore: gofmt PR-touched files
The S3 Tables Format Check job runs `gofmt -l` over weed/s3api/s3tables
and test/s3tables, then fails if anything is unformatted. Files this
PR moved or modified had import-grouping and trailing-spacing issues
introduced by perl-based renames; reformat them with gofmt -w.
Touched files:
test/plugin_workers/erasure_coding/{detection,execution}_test.go
test/s3tables/maintenance/maintenance_integration_test.go
weed/plugin/worker/handlers/handlers.go
weed/worker/tasks/{balance,ec_balance,erasure_coding,vacuum}/plugin_handler*.go
* refactor(worker): bounds-checked int conversions for plugin config values
CodeQL flagged 18 go/incorrect-integer-conversion warnings on the moved
plugin handler files: results of pluginworker.ReadInt64Config (which
ultimately calls strconv.ParseInt with bit size 64) were being narrowed
to int32/uint32/int without an upper-bound check, so a malicious or
malformed admin/worker config value could overflow the target type.
Add three helpers in weed/plugin/worker/config.go that wrap
ReadInt64Config and clamp out-of-range values back to the caller's
fallback:
ReadInt32Config (math.MinInt32 .. math.MaxInt32)
ReadUint32Config (0 .. math.MaxUint32)
ReadIntConfig (math.MinInt32 .. math.MaxInt32, platform-portable)
Update each flagged call site in the four moved task packages to use
the bounds-checked helper. For protobuf uint32 fields (volume IDs)
the variable type also becomes uint32, removing the trailing
uint32(volumeID) casts and changing the "missing volume_id" check
from `<= 0` to `== 0`.
Touched files:
weed/plugin/worker/config.go
weed/worker/tasks/balance/plugin_handler.go
weed/worker/tasks/erasure_coding/plugin_handler.go
weed/worker/tasks/vacuum/plugin_handler.go
* refactor(worker): use ReadIntConfig for clamped derive-worker-config helpers
CodeQL still flagged three call sites where ReadInt64Config was being
narrowed to int after a value-range clamp (max_concurrent_moves <= 50,
batch_size <= 100, min_server_count >= 2). The clamp is correct but
CodeQL's flow analysis didn't recognize the bound, so it flagged them
as unbounded narrowing.
Switch to ReadIntConfig (already int32-bounded by the helper) for
those three sites, drop the now-redundant int64 intermediate variables.
Also drops the now-unused `> math.MaxInt32` clamp in
ec_balance.deriveECBalanceWorkerConfig (the helper covers it).
|
||
|
|
b2f4ebb776 |
test(s3tables): add Dremio Iceberg catalog integration tests (#9299)
* test(s3tables): add Dremio Iceberg catalog integration tests
Add comprehensive integration tests for Dremio with SeaweedFS's Iceberg
REST Catalog, following the same patterns as existing Spark and Trino tests.
Tests include:
- Basic catalog connectivity and schema operations
- Table creation, insertion, and querying (CRUD)
- Deterministic table location specification
- Multi-level namespace support
Implementation includes:
- dremio_catalog_test.go: Core test environment and basic operations
- dremio_crud_operations_test.go: Schema and table CRUD testing
- dremio_deterministic_location_test.go: Location and namespace testing
- Comprehensive README and implementation documentation
CI/CD:
- Added dremio-iceberg-catalog-tests job to s3-tables-tests.yml
- Pre-pulls Dremio image, runs with 25m timeout
- Uploads artifacts on failure
* add docstrings to Dremio integration tests and fix CI image pre-pull
- Add function docstrings to all test functions and helper functions
in dremio_catalog_test.go, dremio_crud_operations_test.go, and
dremio_deterministic_location_test.go to improve code documentation
and satisfy CodeRabbit's docstring coverage requirements.
- Make Dremio Docker image pre-pull non-critical in CI workflow.
The pre-pull was failing with access denied error, but the image
can still be pulled at runtime. Using continue-on-error to allow
tests to proceed.
* fix: correct YAML syntax in Dremio CI workflow
Use multi-line run command with pipe operator (|) instead of
inline command with || operator to avoid YAML parsing errors.
The || operator was causing 'mapping values are not allowed here'
syntax errors in the YAML parser.
* make Dremio tests gracefully skip if container unavailable
Modify startDremioContainer and waitForDremio to return boolean values
instead of fataling. Tests now skip gracefully if:
- Dremio Docker image is unavailable
- Container fails to start
- Container doesn't become ready within timeout
This prevents CI failure when Dremio image is not accessible while
still testing the integration when it is available.
* Revert "make Dremio tests gracefully skip if container unavailable"
This reverts commit
|
||
|
|
31e5e0dee2 |
fix(mount): keep async flush when LockOwner has no POSIX locks (#9300)
FlushIn.LockOwner is populated by the kernel for any fd that may have participated in locking, not only when locks were actually taken. The previous Flush logic treated any non-zero LockOwner as a closing lock holder and forced a synchronous flush, which silently disabled the writebackCache async-flush path (introduced in #8727) for most ordinary close() calls. Consult the POSIX lock table before forcing sync: only owners that currently hold a non-flock byte-range lock need the synchronous path to coordinate with blocked SetLkw waiters. Other closes go async as intended. |
||
|
|
f2c3bd7b77 |
fix(admin/view): define basePath in plugin IIFE scopes (#9298)
The plugin.templ and plugin_lane.templ components use basePath() in their IIFE (Immediately Invoked Function Expression) scopes to handle subdirectory deployments. However, basePath was not defined locally, causing "basePath is not defined" errors when accessing plugin pages. Added local basePath function definitions in both files, matching the pattern from admin.js. This function checks window.__BASE_PATH__ (set by the layout during page initialization) and prepends it to API paths. |
||
|
|
913f98db10 | refactor(weed/storage): log volume file removal failures (#9297) | ||
|
|
1127672d10 |
fix(s3api,iamapi): avoid full SaveConfiguration when creating a single IAM user (#9261)
* fix(s3api,iamapi): avoid full SaveConfiguration when creating a single IAM user
When CreateUser was called, both the embedded IAM path (s3api_embedded_iam.go)
and the standalone IAM path (iamapi_management_handlers.go) would:
1. Load the full config (all N users)
2. Append the new user
3. Call SaveConfiguration, which re-writes ALL N user files in the filer_etc
store, triggering N file-change events and N reload cycles.
The fix replaces the full-save path with credentialManager.CreateUser, which
writes only the single new user's file. Set changed=false to skip the redundant
SaveConfiguration call, and add "CreateUser" to the reload-after-targeted-write
block so the in-memory config is refreshed.
Also adds a nil guard around iama.iam.GetCredentialManager() in DoActions to
avoid a nil-pointer panic in legacy test fixtures that leave iam unset.
Add SetCredentialManagerForTest to IdentityAccessManagement so tests can inject
an in-memory credential store without touching production code paths.
* test(s3api,iamapi): regression tests for CreateUser targeted-write fix
Add TestCreateUserDoesNotSaveAllUsers (iamapi) and
TestEmbeddedIamCreateUserDoesNotSaveAllUsers (s3api) to guard against the
regression where CreateUser would call SaveConfiguration and re-write all
N existing user files.
Both tests:
- Pre-populate 3 existing users
- Invoke CreateUser via the HTTP API
- Assert PutS3ApiConfiguration (full-config save) was NOT called
- Assert the new user is visible in the credential store
- Assert all pre-existing users are still intact
Also update TestEmbeddedIamExecuteAction to verify persistence via the
credential manager directly (mockConfig is no longer updated on CreateUser
since we skip the SaveConfiguration path).
* refactor(s3api,iamapi): share credential-error to IAM-code mapping
Move the credentialErrToIamErrCode helper from weed/iamapi to
weed/s3api as exported CredentialErrToIamErrCode and call it from
both the standalone IAM handler (iamapi) and the embedded IAM
handler (s3api). Previously the standalone path used the helper
while the embedded path duplicated the same switch inline; the two
sites could drift out of sync.
Also extend the mapping to cover ErrUserNotFound and
ErrAccessKeyNotFound (404 NoSuchEntity) so non-CreateUser callers
that opt into the helper get the right HTTP status.
* test(s3api): seed credential store explicitly in CreateUser regression
Previously the test relied on getS3ApiConfigurationFunc's syncOnce
side effect to populate the credential store with mockConfig
identities. If that fixture ever stopped seeding, the post-test
"pre-existing users still exist" assertion would silently start
passing for the wrong reason (the users were never created).
Seed cm.SaveConfiguration directly and assert the seed precondition
before the API call so any seeding regression fails loudly.
* fix(s3api): honor skipPersist in embedded CreateUser targeted path
ExecuteAction documents that skipPersist=true means "the changed
configuration is not saved to the persistent store", but the
targeted credentialManager.CreateUser added for the avoid-bulk-save
fix ran unconditionally. Gate it on !skipPersist so no-persist
callers don't silently leak a write to the credential store. Leave
changed=true in the skipPersist branch so the tail's existing
`if changed { if !skipPersist { ... } }` block keeps suppressing
persistence the same way it does for every other action.
Add TestEmbeddedIamCreateUserSkipPersist to pin the contract: a
CreateUser invocation with skipPersist=true must not call
PutS3ApiConfiguration and must not leave the user in the
credential store.
---------
Co-authored-by: Chris Lu <chris.lu@gmail.com>
|
||
|
|
7428c48dd6 |
fix(master): bump seaweedfs/raft to v1.1.8 for Windows syncDir fix (#9296)
raft v1.1.7 added an fsync of the state directory after persisting
currentTerm/votedFor. On Windows the syscall (FlushFileBuffers) requires
GENERIC_WRITE, but os.Open opens directories read-only, so every call
panicked the master with ERROR_ACCESS_DENIED ("Zugriff verweigert").
This took down "weed mini" on every restart that loaded persisted state
and entered candidateLoop.
v1.1.8 splits syncDir into build-tagged files: unix keeps the durability
fsync; Windows is a no-op since NTFS journals rename metadata and there
is no portable equivalent of fsync(dir_fd) there.
Fixes #9292.
|
||
|
|
6572b472c3 |
fix(s3): honor X-Forwarded-For in audit log remote_ip (#9295)
* fix(s3): honor X-Forwarded-For in audit log remote_ip When SeaweedFS S3 sits behind a reverse proxy (e.g., Caddy), the audit log's `remote_ip` was reporting the proxy's address because only `X-Real-IP` and `r.RemoteAddr` were consulted. Caddy and most other proxies set `X-Forwarded-For` by default but not `X-Real-IP`, so the real client IP was lost. Check `X-Forwarded-For` first (using the left-most non-empty entry as the originating client), then fall back to `X-Real-IP`, then `r.RemoteAddr`. Fixes #9293 * fix(s3): strip port from RemoteAddr fallback in audit log Address PR review: the X-Forwarded-For and X-Real-IP paths return host-only values, while the RemoteAddr fallback was returning "host:port", making the remote_ip field inconsistent with both the other code paths and the "192.0.2.3" example in the AccessLog struct. Use net.SplitHostPort to strip the port, falling back to the raw RemoteAddr for non-IP markers (e.g., "@" for unix sockets). |
||
|
|
8596434938 |
build(deps): bump github.com/ydb-platform/ydb-go-sdk/v3 from 3.134.0 to 3.134.2 (#9294)
build(deps): bump github.com/ydb-platform/ydb-go-sdk/v3 Bumps [github.com/ydb-platform/ydb-go-sdk/v3](https://github.com/ydb-platform/ydb-go-sdk) from 3.134.0 to 3.134.2. - [Release notes](https://github.com/ydb-platform/ydb-go-sdk/releases) - [Changelog](https://github.com/ydb-platform/ydb-go-sdk/blob/master/CHANGELOG.md) - [Commits](https://github.com/ydb-platform/ydb-go-sdk/compare/v3.134.0...v3.134.2) --- updated-dependencies: - dependency-name: github.com/ydb-platform/ydb-go-sdk/v3 dependency-version: 3.134.2 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> |
||
|
|
d265274e13 |
fix(nfs): accept dirpath any-where under the export, mirroring rclone (#9291)
* fix(nfs): accept any MOUNT3 dirpath, mirroring rclone's permissive policy
weed nfs has exactly one export per process, so the MOUNT3 dirpath
argument has no second export to disambiguate against. Strict
comparison only translated PV-path typos into the inconsistent
"mount succeeds but empty" / "mount fails completely" split that
operators see.
Match rclone's serve nfs Handler.Mount: ignore the dirpath, log an INFO
line when it differs from the configured export, and always serve the
export root. Apply the same change to the UDP MOUNT3 path so kernel
clients defaulting to mountproto=udp see identical behaviour. Access
control still goes through -allowedClients / -ip.bind, and file-handle
scoping in FromHandle is unchanged so handles still cannot escape the
export.
Replace the prior single-path reject tests with table tests covering
the shapes operators commonly hit: root, parent, sibling, deeper child,
unrelated, empty, relative form, exact match, and trailing slash, at
the Handler.Mount, UDP MOUNT3, and full RPC layers.
* feat(nfs): mount at subdirectory when MOUNT3 dirpath is under the export
Make the dirpath argument meaningful when the client asks for a subtree
of the configured export. With -filer.path=/buckets, a client mounting
<server>:/buckets/data lands directly inside /buckets/data instead of
at the export root.
- dirpath equals the export root: serve the export root.
- dirpath strictly under the export, directory entry: serve that
subdirectory; the returned filehandle encodes its inode.
- dirpath strictly under the export, missing or non-directory: reject
with NoEnt or NotDir.
- dirpath outside the export: keep the rclone-style fallback to the
export root.
TCP returns a sub-rooted seaweedFileSystem and lets go-nfs's onMount
call ToHandle to encode the FH; UDP encodes the FH itself. FromHandle
is unchanged: handles are content-addressed by inode and resolve via
the inode index, so they remain stable across mounts and across
process restarts.
The trimmed permissive tests keep their outside-export shapes; new
subexport tests cover under-export directories, missing entries, and
non-directory entries on Handler.Mount, the UDP MOUNT3 wire, and
through the full RPC stack.
* nfs: propagate request context through MOUNT3 resolution
Mount now accepts the gonfs context and threads it through
resolveMountFilesystem and lstatExportStatus so a slow filer call
during MOUNT cannot outlive a cancelled or timed-out request.
lstatExportStatus uses fileInfoForVirtualPath(ctx, "/") directly
instead of billy.Filesystem.Lstat, which would otherwise drop the
context on the floor by calling fileInfoForVirtualPathWithOptions
with context.Background().
Lower the successful subexport-mount log from V(0) to V(1). The
fallback log stays at V(0) so operator typos still surface; the
success line is per-mount churn that adds up on NFS-CSI deployments.
* nfs: mirror TCP defensive checks on the UDP MOUNT3 path
Two transport-parity bugs the rabbit caught:
(1) The exact-export-root and outside-export branches were returning
MNT3_OK unconditionally, while the TCP handler runs lstatExportStatus
on those same branches. If the configured -filer.path has been
removed from the filer, TCP returns NoEnt/ServerFault but UDP would
still hand out a synthetic root handle pointing at nothing. Add
rootMountStatus as the UDP analogue and call it on both branches.
(2) resolveSubexportFileHandle did filer I/O on the single UDP serve
loop with context.Background(). One slow filer round-trip would
block every later MOUNT packet. Wrap each MOUNT call's filer work in
context.WithTimeout(mountUDPLookupTimeout) and thread that ctx
through both rootMountStatus and resolveSubexportFileHandle.
Lower the successful subexport log to V(1) to match the TCP side.
* nfs: assert TCP/UDP MOUNT3 produce byte-identical filehandles
The existing UDP subexport assertions only checked the decoded inode
and kind. A regression that drifted the generation, exportID, or
encoding format on one transport but not the other would have slipped
through. Build the TCP Handler from the same Server, drive its Mount
with the same dirpath, and require ToHandle to match the raw UDP FH
bytes for every OK case.
* nfs: take MOUNT3 dirpath as string in resolveMountFilesystem
Convert req.Dirpath to string once at the call site instead of
sprinkling string(...) casts through every log line and conversion
inside the function. Behavior unchanged.
* nfs: share rootFS lifecycle between TCP and UDP MOUNT handlers
Server.rootFilesystem() lazily constructs the seaweedFileSystem rooted
at the configured export the first time anything asks for it, then
hands the same instance to every subsequent caller. newHandler() and
mountUDPServer.rootMountStatus() now both go through it, so:
- Both transports observe the same chunk reader cache and chunk
invalidator without depending on call order during startup.
- The UDP defensive Lstat doesn't allocate a fresh wrapper per
MOUNT request anymore; one struct lives for the life of the
Server.
The sub-rooted seaweedFileSystem the subexport branch builds in
resolveSubexportFileHandle is still per-request because actualRoot
varies with the requested dirpath.
* nfs: drive rootFilesystem before reading sharedReaderCache on UDP
The UDP listener is started before serve() calls newHandler(), so an
under-export MOUNT3 request can reach resolveSubexportFileHandle before
Server.sharedReaderCache has been assigned. Reading it directly would
hand newSeaweedFileSystem a nil cache and the sub-fs would build a
throwaway ReaderCache that never gets shared with the TCP path.
Take rootFS off Server.rootFilesystem() (which drives the sync.Once
that initializes the shared cache) and read readerCache off that
instead, so subexport sub-fs instances always share the same reader
cache as rootFS regardless of which transport sees the first MOUNT.
* nfs: collapse exact-match and outside-export MOUNT branches
The two branches return the same filesystem (export root) and the
same status; only the log line differs. Combine the conditions and
guard the fallback log inline. Behavior unchanged.
|
||
|
|
9b624a73fe |
ci: provide a Docker tag for foundationdb release container on workflow_dispatch
The metadata-action used type=ref,event=tag, which produces no tag on workflow_dispatch, causing build-push to fail with "tag is needed when pushing to registry". Add a release_tag input and build the tag from a RELEASE_TAG env, mirroring container_release_unified.yml. |
||
|
|
14cd426cf9 | templ | ||
|
|
89b07f3e12 | chore(weed/mq/kafka/protocol): prune no-op test (#9287) | ||
|
|
02da928e25 | chore(weed/mq/kafka/protocol): prune dead code (#9288) | ||
|
|
be451d22b5 |
feat(filer.sync): add -verifySync mode to filer.sync for cross-cluster file comparison (#9284)
* Add -verifySync flag to filer.sync for cross-cluster file comparison
Add a verification mode to filer.sync that compares entries between two
filers without performing actual synchronization. Uses directory-level
sorted merge of ListEntries to detect missing files, size mismatches,
and ETag mismatches. Supports -isActivePassive for unidirectional check
and -modifyTimeAgo to skip recently modified files during sync lag.
* Add mtime annotation and JSON output to filer.sync -verifySync
Add automatic mtime relation analysis for SIZE_MISMATCH and
ETAG_MISMATCH diffs, and an NDJSON output mode for external tooling.
mtime classification:
- B_NEWER => "late_updates_skip_likely" hint. Surfaces the case
where target has a stub entry whose mtime is ahead of source's
real file, causing UpdateEntry's mtime guard in filersink to
permanently skip the update.
- A_NEWER => "sync_lag_or_event_miss" hint.
- EQUAL => no hint (chunk-level issue suspected).
Text output example:
[SIZE_MISMATCH] /path (a=996, b=0, B newer +274d [late-updates skip likely])
Add -verifyJsonOutput flag. When set, emits one JSON object per
line (NDJSON) for diffs and a final SUMMARY object, suitable for
piping into external diagnostic pipelines.
Concurrent writes from the directory worker pool are now serialized
via outputMu to keep both text lines and JSON records atomic.
* fix(filer.sync): use shared global semaphore in verifySync to bound goroutine explosion
Replace the per-call local semaphore in compareDirectory with a single
shared semaphore created in runVerifySync. The old per-level semaphore
applied a limit of verifySyncConcurrency only within each directory level,
allowing effective concurrency to grow as verifySyncConcurrency^depth on
deep trees.
The shared semaphore is held only for each directory's I/O phase
(listEntries + merge) and released before recursing into subdirectories,
so a parent never blocks waiting for children to acquire slots — which
would deadlock once tree depth exceeds the semaphore capacity.
Extract the capacity into a named constant (verifySyncConcurrency = 5)
with a comment explaining the memory vs. performance trade-off.
Add unit tests:
- correctness: missing file, only-in-B, size mismatch, active-passive mode
- concurrency bound: peak concurrent listings ≤ verifySyncConcurrency
- no-deadlock: binary tree of depth 10 completes within timeout
* fix(filer.sync): stream directory entries to prevent OOM on large directories
Replace the listEntries helper (which accumulated all entries into a
single []filer_pb.Entry slice) with an entryStream type that pages
through the directory in the background and forwards entries one at a
time through a buffered channel. Memory per directory comparison is now
O(channel buffer size = 64) regardless of how many entries the directory
contains.
Key design points:
- entryStream wraps a goroutine + buffered channel with a one-entry
lookahead (peek/advance) so the two-pointer sorted merge in
compareDirectory can work without buffering any full listing.
- A child context (mergeCtx) is passed to both stream goroutines so
they are cancelled promptly if compareDirectory returns early (e.g.
on error); the ctx.Done() select arm in the callback prevents
goroutine leaks when the consumer stops reading.
- stream.err is written by the goroutine before close(ch), so it is
safe to read after the channel is exhausted (Go memory model:
channel close happens-before the zero-value receive).
- countMissingRecursive is rewritten to use ReadDirAllEntries with a
direct callback, eliminating its own slice allocation.
- listEntries is removed; it is no longer called anywhere.
* fix(filer.sync): address verifySync review findings
Four real bugs found and fixed; one finding already resolved (shared
semaphore was introduced in a prior commit).
path.Join for child paths (filer_sync_verify.go)
fmt.Sprintf("%s/%s", dir, name) produced "//name" when dir was "/".
Replace all child-path concatenations with path.Join so root-level
walks emit clean paths.
cutoffTime check for ONLY_IN_B entries (filer_sync_verify.go)
The B-only branch ignored -modifyTimeAgo, so files recently written
to B were reported as ONLY_IN_B instead of being skipped. Mirror the
A-side mtime guard: skip and increment skippedRecent when the entry
is newer than cutoffTime.
Summary emitted before error check (filer_sync_verify.go)
A filer I/O error mid-walk still caused a SUMMARY record (or text
summary) to be printed, making partial runs appear complete. Move the
error check to before summary emission; on error, return immediately
without printing any summary.
Return false on verification failure (filer_sync.go)
runVerifySync returned true (exit 0) even when diffs were found or the
walk failed. Return false so the main binary sets exit status 1,
consistent with how all other commands signal failure.
* test(filer.sync): add missing verifySync test coverage
Four new tests covering gaps identified during review:
TestVerifySyncETagMismatch
Verifies that two files with identical size but different Md5 checksums
are counted as etagMismatch (not sizeMismatch). Exercises the second
branch of compareEntries that was previously untested.
TestVerifySyncCutoffTime (4 subtests)
A-only recent — recent file skipped (skippedRecent++), not MISSING
A-only old — old file reported as MISSING
B-only recent — recent file skipped (skippedRecent++), not ONLY_IN_B
B-only old — old file reported as ONLY_IN_B
The B-only subtests specifically cover the cutoffTime fix added in the
previous commit.
TestVerifySyncRootPath
Regression for the path.Join fix: walks from "/" and verifies that the
child directory is reached and compared correctly (the old Sprintf
produced "//data" which would silently produce wrong results).
Asserts dirCount=2 and fileCount=1 to confirm the full tree is walked.
* fix(filer.sync): use os.Exit(2) instead of return false on verify failure
return false triggered weed.go's error handler which printed the full
command usage — appropriate for invalid arguments, not for a completed
verification that found differences. Use os.Exit(2) consistent with
the existing pattern in filer_sync.go (lines 251, 293).
* refactor(filer.sync.verify): split verify into its own command
The verify mode is a one-shot batch operation with a fundamentally
different lifecycle from the long-running sync subscriber, and most of
filer.sync's flags (replication, metrics port, debug pprof, concurrency,
etc.) do not apply to it. Extract it into a sibling command alongside
filer.copy/filer.backup/filer.export rather than a flag mode on
filer.sync.
Also rename modifyTimeAgo to modifiedTimeAgo (grammatical) and drop the
verifyJsonOutput prefix to plain jsonOutput now that the verify context
is implicit in the command name.
* fix(filer.sync.verify): address review comments
- Bounded worker pool: cap subdirectory goroutines per level via a
jobs channel and min(verifySyncConcurrency, len(subDirs)) workers
instead of spawning one goroutine per child. Wide directories no
longer park ~2KB per queued goroutine.
- Don't gate recursion on a directory's mtime: a fresh child write
bumps the parent mtime, but older files inside should still be
reported as missing. Always recurse for missing-in-B directories
and apply the cutoff per-file inside countMissingRecursive.
- Apply -modifiedTimeAgo symmetrically: matched-name files now skip
the comparison when EITHER side is recently modified, not just A.
This restores lag tolerance when B was just rewritten.
Adds tests for both new behaviors and a shared isTooRecent helper.
---------
Co-authored-by: Chris Lu <chris.lu@gmail.com>
|
||
|
|
eebffd9df6 |
fix(storage): fix verifyDeletedNeedleIntegrity using wrong offset (#9273)
* fix(storage): fix verifyDeletedNeedleIntegrity using wrong offset verifyDeletedNeedleIntegrity was ignoring the offset recorded in the index file and instead always reading from fileSize-DiskSize(0), which is only correct when the deleted needle happens to be the very last entry in the .dat file. When a deletion tombstone is written in the middle of the volume (e.g. after subsequent writes), the function reads the wrong bytes, causing the key-mismatch check to either silently pass (if the bytes happen to form a valid needle with the same ID) or produce a spurious integrity error. Fix: accept the actual offset from the index entry and use types.TombstoneFileSize as the size, mirroring the approach used by verifyNeedleIntegrity for live needles. * fix(storage): fix error message in doCheckAndFixVolumeData The error message in the deleted-needle branch referenced verifyNeedleIntegrity, but the function being called is verifyDeletedNeedleIntegrity. This makes log output misleading when a deleted-needle integrity check fails. Suggested by gemini-code-assist review on PR #9273. --------- Co-authored-by: chenshi5012 <chenshi5012@github.com> |
||
|
|
db34e8b6fd |
feat(admin): prefer stored S3 Content-Type metadata over key-extension MIME inference (#9286)
* feat(admin): prefer stored filer mime in file browser and properties * feat(mime): enhance MIME type registration and improve fallback logic Co-authored-by: Copilot <copilot@github.com> --------- Co-authored-by: baracudaz <baracudaz@users.noreply.github.com> Co-authored-by: Copilot <copilot@github.com> |
||
|
|
60c76120fc |
fix(shell): use exact match for volume.balance -racks/-nodes filter (#9279)
* fix(shell): correct volume.list -writable filter unit and comparison * fix(shell): correct volume.list -writable filter unit and comparison * chore(shell): fix typo in EC shard helper param names * fix(shell): use exact match for volume.balance -racks/-nodes filter The old strings.Contains-based filter quietly included any id that was a substring of the user-supplied flag value (e.g. -racks=rack10 also matched rack1). Replace it with an exact-match set parsed from the comma-separated flag value, and add regression tests for both -racks and -nodes paths. Also fix a small typo in the "remote storage" error returned by maybeMoveOneVolume. * fix(shell): use exact match for volume.balance -racks/-nodes filter The old strings.Contains-based filter quietly included any id that was a substring of the user-supplied flag value (e.g. -racks=rack10 also matched rack1). Replace it with an exact-match set parsed from the comma-separated flag value, and add regression tests for both -racks and -nodes paths. Also fix a small typo in the "remote storage" error returned by maybeMoveOneVolume. * refactor(shell): drop nil sentinel in splitCSVSet, use len() in callers |
||
|
|
1da091f798 |
ci: bring previously-uncovered integration tests into CI (#9281 follow-up) (#9283)
* ci: bring previously-uncovered integration tests into CI (#9281 follow-up) Six integration test packages had _test.go files but no GitHub workflow running them. The s3-sse-tests CI gap that let #8908's UploadPartCopy bug (and the four cross-SSE copy bugs in #9281) ship undetected was an instance of this same pattern. This change wires three of them into CI and removes a fourth that was deadcode: test/multi_master/ NEW workflow: multi-master-tests.yml - 3-node master raft cluster failover/recovery (5 tests, ~65s) test/testutil/ (run alongside multi_master) - port-allocator regression test test/s3/etag/ NEW workflow: s3-etag-acl-tests.yml - PutObject ETag format regression for #7768 (must be pure MD5 hex, not "<md5>-N" composite, for AWS Java SDK v2 compatibility) test/s3/acl/ (same workflow as etag) - object-ACL behavior on versioned buckets test/s3/catalog_trino/ DELETED (deadcode) - Single-file copy of test/s3tables/catalog_trino/trino_catalog_test.go from a 2024 commit that was never iterated, while the test/s3tables/ counterpart has been actively maintained (and IS in CI via s3-tables-tests.yml's trino-iceberg-catalog-tests job). Both workflows trigger only on changes to relevant code paths and use the existing simple "build weed → run go test" pattern (no per-test-dir Makefile boilerplate). The S3 workflow starts a single `weed mini` shared by etag and acl, which keeps the job under 2 minutes on a fresh runner. Two tests remain knowingly uncovered: test/s3/basic/ — order-dependent state across tests (TestListObjectV2 expects a bucket created by an earlier test, etc.) and uses the deprecated aws-sdk-go v1. Treated as sample programs, not a regression suite. Fixing them is out of scope for this PR. test/s3/catalog_trino/ — see "DELETED" above. Verified locally: - go test -v -timeout=8m ./test/multi_master/... ./test/testutil/... PASS (5 multi_master + 1 testutil tests, 64s) - weed mini + go test ./test/s3/etag/... + go test ./test/s3/acl/... PASS (8 etag + 5 acl tests, ~6s after server startup) * ci: fix log-collector glob for multi-master tests (review feedback on #9283) test/multi_master/cluster.go creates per-test temp dirs via os.MkdirTemp("", "seaweedfs_multi_master_it_"), so the glob has to match that prefix. The previous version looked for MasterCluster* / TestLeader* / TestTwoMasters* / TestAllMasters* which never matches — the failure-artifact upload would have been empty on a real failure. Switch the find to /tmp/seaweedfs_multi_master_it_* (maxdepth 1) so it actually picks up the per-node master*.log files under <baseDir>/logs/. Found by coderabbitai review on PR #9283. |
||
|
|
1f515f9d02 |
fix(s3api): cross-SSE copy operations and bring them back into CI (#9281) (#9282)
* fix(s3api): cross-SSE copy operations and bring them back into CI (#9281) Four cross-SSE copy tests were broken on master and excluded from CI with the comment "pre-existing SSE-C issues": - TestSSECObjectCopyIntegration/Copy_SSE-C_to_SSE-C_with_different_key - TestSSEKMSObjectCopyIntegration/Copy_SSE-KMS_with_different_key - TestCrossSSECopy/SSE-S3_to_SSE-C - TestSSEMultipartCopy/Copy_SSE-KMS_Multipart_Object Each surfaced as a different symptom — 500 InternalError, CRC32 mismatch, "unexpected EOF", MD5 mismatch — but they were all instances of the same root pattern that #8908 hit on UploadPartCopy: copy paths writing destination chunks tagged inconsistently with the bytes on disk, so detectPrimarySSEType / IsSSE*Encrypted disagreed about what the read path should do. Five fixes in this PR, each with its own targeted test: 1. SSE-C IV format: putToFiler stored entry.Extended[SeaweedFSSSEIV] as raw bytes (with a comment saying so), but StoreSSECIVInMetadata stored it base64-encoded. The two readers (the GET handler reading it raw, and GetSSECIVFromMetadata reading it base64-decoded) each matched one writer but not the other. Standardise on raw bytes everywhere; GetSSECIVFromMetadata accepts the legacy base64 form for backward compat. 2. SSE-C single-part copy chunk tagging: copyChunkWithReencryption re-encrypted the bytes for the destination but never set the destination chunk's SseType / SseMetadata. With chunks left SseType=NONE, detectPrimarySSEType returned "None" and the GET served still-encrypted volume bytes raw without decryption. Tag the chunk after re-encryption. 3. SSE-KMS single-part copy chunk tagging: same shape as (2). Also, the function discarded the destSSEKey returned from CreateSSEKMSEncryptedReaderWithBucketKey (with `_`) — that key carries the freshly-minted EncryptedDataKey + IV the read path needs, so it must be captured and serialized into the destination chunk's per-chunk metadata (and bubbled up to the entry-level SeaweedFSSSEKMSKey for single-chunk objects whose read path falls back to the entry-level key). 4. SSE-KMS multipart source decryption: copyChunkWithSSEKMSReencryption decrypted every source chunk with the entry-level sourceSSEKey. For multipart SSE-KMS objects each chunk has its own EDK + IV in per-chunk metadata, so the entry-level key is wrong. Decrypt with per-chunk metadata when present. 5. Same-key copy fast path chunk tagging: copySingleChunk uses createDestinationChunk which dropped SseType / SseMetadata. For same-key copies (e.g. SSE-KMS source → SSE-KMS dest with the same KMS key) the fast path reuses the source ciphertext as-is, so the destination chunks must keep the source's SSE tagging. Add a createDestinationChunkPreservingSSE helper for the fast path; the re-encryption paths still call createDestinationChunk and then overwrite the SSE fields after re-encrypting. CI: extend the comprehensive-test TEST_PATTERN to include the four test families that were previously excluded (`.*ObjectCopyIntegration`, `TestCrossSSECopy`, `TestSSEMultipartCopy`) so this category of regression is caught going forward. The exclusion comment is removed. Tests: - All four originally-failing tests pass. - The full pre-existing TestSSE* / TestCrossSSE / TestGitHub7562 / TestCopyToBucketDefaultEncryptedRegression / TestSSEMultipart suite still passes. - go test -race ./weed/s3api/ passes. Refs #8908, #9280. * fix(s3api): SSE-KMS copy ChunkOffset must stay 0 (review feedback on #9282) CreateSSEKMSEncryptedReaderWithBucketKey initialises a fresh CTR stream at counter 0 with a per-chunk random IV — there is no base-IV-plus-offset relationship. The previous commit on this branch wrote `destSSEKey.ChunkOffset = chunk.Offset` onto the per-chunk metadata, which the read-side CreateSSEKMSDecryptedReader applies as calculateIVWithOffset(IV, ChunkOffset) — i.e. it advances the decryption IV by chunk.Offset/16 blocks beyond where the encryption actually wrote. The bug only manifests for SSE-KMS-to-SSE-KMS-with-different-key copies of multipart sources (where source chunks live at non-zero offsets), which is why the existing TestSSEKMSObjectCopyIntegration (single-chunk source) and TestSSEMultipartCopy/Copy_SSE-KMS_Multipart_Object (same-key copy that takes the fast preserving path, not the re-encrypt path) both happened to pass. Set ChunkOffset to 0 to match the actual encryption position. Existing tests still pass; the dangerous case is only reachable with a multipart SSE-KMS source and a different destination key, which is not currently exercised in CI. Found by gemini-code-assist review on PR #9282. * fix(s3api): use first dst chunk's full key for entry-level SSE-KMS metadata in remaining copy paths (review feedback on #9282) Earlier this branch fixed copyChunksWithSSEKMSReencryption to populate the entry-level SeaweedFSSSEKMSKey from the first destination chunk's fully-formed metadata (with EDK + IV) instead of a stub key with only KeyID + EncryptionContext + BucketKeyEnabled. The same fix needs to apply to the other two paths that build entry-level SSE-KMS metadata: - copyMultipartCrossEncryption() — cross-encryption to SSE-KMS dest. Per-chunk metadata comes from copyCrossEncryptionChunk's CreateSSEKMSEncryptedReaderWithBucketKey call, so chunks[0] has a real EDK + IV. Use it. - copyChunksWithSSEKMS() direct (same-key) branch. After createDestinationChunkPreservingSSE in copySingleChunk, dst chunks carry the source's per-chunk SSE-KMS metadata. Use chunks[0] for the entry-level key so single-chunk same-key copies don't fall back to a stub key on the read path. Without this, single-chunk SSE-KMS reads through these two paths failed at GET with "Invalid ciphertext format" — KMS unwrap was called on an empty EDK. Found by coderabbitai review on PR #9282. * fix(s3api): add 0-byte fallback to SSE-KMS reencryption entry-level metadata (review feedback on #9282) copyChunksWithSSEKMSReencryption was missing the fallback for 0-byte objects (where dstChunks is empty), inconsistent with the fallback in copyChunksWithSSEKMS direct branch and copyMultipartCrossEncryption. Without it, a 0-byte SSE-KMS copy would land with no entry-level SeaweedFSSSEKMSKey, so the read path's IsSSEKMSEncryptedInternal check would not recognise the empty object as SSE-KMS. Mirror the existing fallback: build a stub SSEKMSKey with KeyID, context and bucket-key state; serialize it as the entry-level key. Found by gemini-code-assist review on PR #9282. * fix(s3api): SSE-KMS direct copy must check encryption context + bucket-key, not just key ID (review feedback on #9282) DetermineSSEKMSCopyStrategy / CanDirectCopySSEKMS only compares the source and destination KMS key IDs, but the destination request can also change the encryption context or the BucketKey flag. Both are embedded in the source ciphertext's wrapped EDK; preserving the source metadata verbatim does not satisfy a destination request that asks for different settings, so the destination object would silently report the source's context/flag instead of what was requested. Add srcSSEKMSStateMatchesDest: deserialize the source's stored SSEKMSKey and compare its EncryptionContext + BucketKeyEnabled to the destination request. If either differs, force the slow re-encrypt path (SSEKMSCopyStrategyDecryptEncrypt) so the destination gets a freshly-wrapped EDK bound to the requested context/flag. A malformed source key is treated as non-matching (conservative). nil and empty encryption-context maps are treated as equal to avoid spurious divergence when the request omits the context header. Found by coderabbitai review on PR #9282. * fix(s3api): copyMultipartSSEKMSChunk falls back to entry-level key + entry-level metadata uses first chunk's full key (review feedback on #9282) Two related issues in copyMultipartSSEKMSChunks / copyMultipartSSEKMSChunk: 1. copyMultipartSSEKMSChunks built the destination's entry-level SeaweedFSSSEKMSKey from a stub (KeyID + context + bucket-key only), missing the EDK + IV. Single-chunk reads through this path fall back to entry-level keyData and would fail at GET because KMS would be asked to unwrap an empty EDK. Mirrors the fix in copyChunksWithSSEKMS / copyMultipartCrossEncryption / copyChunksWithSSEKMSReencryption: prefer the first dst chunk's full per-chunk metadata, fall back to the stub only for 0-byte objects. 2. copyMultipartSSEKMSChunk hard-failed when chunk.GetSseMetadata() was empty. Newer multipart SSE-KMS uploads populate per-chunk metadata, but legacy objects may have only entry-level metadata and would now be impossible to copy. Add a sourceEntrySSEKey fallback parameter (deserialized once by the caller from entry.Extended[SeaweedFSSSEKMSKey]); use it when per-chunk metadata is absent. Found by coderabbitai review on PR #9282. * refactor(s3api): extract entry-level SSE-KMS deserialization and per-chunk fallback into helpers (review feedback on #9282) Three medium-priority maintainability comments from gemini-code-assist: - The same "deserialize entry.Extended[SeaweedFSSSEKMSKey]" pattern appeared in srcSSEKMSStateMatchesDest, copyMultipartSSEKMSChunks and copyChunksWithSSEKMSReencryption. - The "prefer per-chunk metadata, fall back to entry-level key" selection logic appeared inline in copyMultipartSSEKMSChunk and copyChunkWithSSEKMSReencryption with subtly different shapes. - encryptionContextEqual hand-rolled a map comparison. Pull both patterns out into named helpers: - deserializeEntrySSEKMSKey: returns the entry-level SSEKMSKey or nil on missing/malformed data, with a single V(2) log line. - resolveChunkSSEKMSKey: centralises the chunk-vs-entry-level selection so all sites use the same decryption-side selection logic (which must mirror the encryption side). Replace encryptionContextEqual's manual loop with reflect.DeepEqual, keeping the empty-vs-nil shortcut at the top because DeepEqual treats those as different. No behaviour change; existing copy tests still pass. |
||
|
|
82cf60a44f |
fix(s3api): re-encrypt UploadPartCopy bytes for the destination's SSE config (#8908) (#9280)
* fix(s3api): re-encrypt UploadPartCopy bytes for the destination's SSE config (#8908) The remaining failure mode in #8908 was that Docker Registry's blob finalization (server-side Move via UploadPartCopy) silently corrupts SSE-S3 multipart objects. Reproduces with `aws s3api upload-part-copy` under bucket-default SSE-S3: the GET on the completed object returns deterministic wrong bytes (correct length, same wrong SHA-256 across runs). The metadata is mathematically self-consistent — every chunk's stored IV equals `calculateIVWithOffset(baseIV_dst, partLocalOffset)` — but the bytes on disk were encrypted with the SOURCE upload's key+baseIV. Root cause: - `copyChunksForRange` (and `createDestinationChunk`) constructs new chunks for UploadPartCopy without copying `SseType` / `SseMetadata`, so destination chunks are written with `SseType=NONE`. - At completion, `completedMultipartChunk` (PR #9224's NONE→SSE_S3 backfill, intended to recover from a different missing-metadata bug) sees those NONE chunks under an SSE-S3 multipart upload and backfills SSE-S3 metadata derived from the destination upload's baseIV. The chunk metadata is now internally consistent and the GET path applies decryption — but the bytes on disk are encrypted with the source upload's key, not the destination's. Decryption produces deterministic garbage. Docker Registry pulls then fail with "Digest did not match". Fix: when either the source object or the destination multipart upload has any SSE configured, take a slow-path UploadPartCopy that (1) opens a plaintext reader of the source range — decrypting the source's per-chunk SSE-S3 metadata if needed via a reused `buildMultipartSSES3Reader`, and (2) feeds that plaintext through `putToFiler`'s existing encryption pipeline by staging the destination upload entry's SSE-S3/SSE-KMS headers on a cloned request. Encryption then matches PutObjectPart's contract: every part starts a fresh CTR stream from counter 0 with `baseIV_dst`, and each internal chunk's metadata records `calculateIVWithOffset(baseIV_dst, chunk.partLocalOffset)`. The `non-SSE → non-SSE` case still takes the existing fast raw-byte copy path — bytes on disk are plaintext on both sides, so chunk-level metadata is irrelevant. Cross-encryption from SSE-KMS / SSE-C sources is left as TODO — the new path returns an explicit error rather than the previous silent corruption. SSE-S3 (the user-reported case) round-trips correctly. Tests: - test/s3/sse/s3_sse_uploadpartcopy_integration_test.go pins three UploadPartCopy shapes against bucket-default SSE-S3: * Docker-Registry-shape 32MB+tail (the user's exact 5-chunk / 2-part metadata layout) * single full-object UploadPartCopy * many small range copies Each round-trips SHA-256. - test/s3/sse/s3_sse_concurrent_repro_test.go covers the parallel multipart-upload shape from the user report (5 blobs in parallel, full GET and chunked range GET both hash-checked) — pre-existing coverage; added here as a regression sentinel. * test(s3-sse): rename UploadPartCopy regression test so CI matches it The CI workflow .github/workflows/s3-sse-tests.yml dispatches on the TEST_PATTERN ".*Multipart.*Integration" — i.e. the test name must contain both "Multipart" and "Integration" for CI to run it. The previous name TestSSES3UploadPartCopyIntegration had only "Integration"; "UploadPart" isn't "Multipart". Rename to TestSSES3MultipartUploadPartCopyIntegration so the regression test actually runs in CI rather than only locally. * fix(s3api): map unsupported UploadPartCopy SSE source to 501, not 500 (review feedback on #9280) openSourcePlaintextReader explicitly rejects SSE-KMS and SSE-C sources (SSE-S3 is the only one wired up in this slow path so far). Earlier the caller blanket-mapped that to ErrInternalError, which collapses "this shape isn't implemented yet" into the same 500 response a real server failure would produce. Clients can no longer tell whether they hit a feature gap or a bug. Introduce a sentinel errCopySourceSSEUnsupported and have copyObjectPartViaReencryption errors.Is-check it; on match, return ErrNotImplemented (501) instead of ErrInternalError (500). Other failures still map to 500. Found by coderabbitai review on PR #9280. * fix(s3api): UploadPartCopy must fail with NoSuchUpload when upload entry is missing (review feedback on #9280) CopyObjectPartHandler's earlier checkUploadId call only verifies that the uploadID's hash prefix matches dstObject; it does not prove the upload directory exists in the filer. The previous logic silently swallowed filer_pb.ErrNotFound from getEntry(uploadDir) and fell through with uploadEntry=nil, which then skipped the destination SSE check and could route a plain-source copy through the raw-byte fast path even though the destination's encryption state is unknown. Treat ErrNotFound as ErrNoSuchUpload so the client sees the right status, matching the AWS S3 contract for UploadPartCopy on a non-existent upload. Found by coderabbitai review on PR #9280. * feat(s3api): set SSE response headers on UploadPartCopy slow path (review feedback on #9280) PutObjectPartHandler writes x-amz-server-side-encryption (and the KMS key-id header for SSE-KMS) on every successful part response so clients can confirm the destination's encryption state. The new UploadPartCopy slow path was missing this — it returned only the ETag in the response body and no SSE response headers. Plumb putToFiler's SSEResponseMetadata back through copyObjectPartViaReencryption to the handler, then call setSSEResponseHeaders before writing the XML response, matching the PutObjectPart contract. Found by gemini-code-assist review on PR #9280. * fix(s3api): map transient filer errors on UploadPartCopy upload-entry fetch to 503 (review feedback on #9280) Earlier non-ErrNotFound errors from getEntry(uploadDir, uploadID) all returned 500 InternalError, which most SDKs treat as fatal — even though a transient filer outage (gRPC Unavailable, leader election in flight, deadline exceeded) is exactly the kind of failure SDK retry logic is supposed to recover from. Add an isTransientFilerError helper that recognises: - context.DeadlineExceeded / context.Canceled - gRPC codes.Unavailable, DeadlineExceeded, ResourceExhausted, Aborted When the upload-entry fetch fails for one of those reasons, return 503 ServiceUnavailable so the client retries; everything else still maps to 500. Log line now also carries dstObject (in addition to dstBucket and uploadID) to make incident triage easier. Found by gemini-code-assist review on PR #9280. |
||
|
|
e82789ea4b |
rust(volume): strip grpc-port suffix from master URL before HTTP lookup (#9276)
* rust(volume): strip grpc-port suffix from master URL before HTTP lookup The volume server stores `master_url` in SeaweedFS's canonical `host:httpPort.grpcPort` form (e.g. `node-a:5300.5310`). When `lookup_volume` builds the master `/dir/lookup` URL, the appended gRPC port turns the URL into `http://node-a:5300.5310/...`, which reqwest rejects with "builder error". Every replicated write, batch-delete lookup, and proxy/redirect read then fails. Mirror Go's `pb.ServerAddress.ToHttpAddress()` with a new `to_http_address` helper and apply it inside `lookup_volume`, the single funnel for all three HTTP master lookups in the Rust volume server. Other consumers of `master_url` already go through gRPC and use `to_grpc_address` / `parse_grpc_address`. Includes a regression test that mocks a master HTTP server and calls `lookup_volume` with a `host:port.grpcPort` master URL — without the fix it reproduces the exact "lookup request failed: builder error" from issue #9274. Fixes #9274 * rust(volume): only strip dotted suffix from address when both ports are numeric Previously `to_http_address` rewrote any `host:foo.bar` to `host:foo`, which would silently drop the suffix on malformed config (e.g. a hostname like `host:abc.def` or `host:99999.19333`). Validate that both halves of the dotted suffix parse as `u16` before stripping — mirrors the validation that `to_grpc_address` already does in the inverse direction. Also slice the input directly instead of going through `format!`, since the result is just a prefix of `addr`. Adds a test that asserts non-numeric / out-of-range dotted suffixes are preserved unchanged. * rust(volume): strip grpc-port suffix from peer URLs in replicate / proxy / redirect In normal operation the master returns `Location.url` as plain `host:port` and the gRPC port arrives in a separate field. But the volume server already has defensive logic (`grpc_address_for_location`) for the `host:httpPort.grpcPort` form on those same URLs, which implies a code path where peer URLs do carry the suffix. Apply `to_http_address` to `loc.url` / `target.url` before building HTTP URLs in `do_replicated_request`, `proxy_request`, and `redirect_request` to keep replicate-write, proxy-read, and redirect paths from hitting the same `lookup request failed: builder error` mode that #9274 documented for master lookups. Adds a unit test exercising `redirect_request` with a `.grpcPort` suffix on `target.url`. * rust(volume): return Cow<str> from to_http_address to skip allocation on the no-suffix path Most addresses pass through `to_http_address` unchanged (master and peer URLs are normally plain `host:port`), so the previous String return type allocated on every call for nothing. Switch to `Cow<str>`: the common pass-through borrows from the input, and only the rewrite branch allocates. Call sites use the result via `format!`/`Display`, which both work transparently with `Cow<str>`. Adds a test asserting the variant is Borrowed in the no-rewrite cases and Owned only when the suffix is stripped. * rust(volume): cover bracketed IPv6 literals in to_http_address tests The current implementation already handles IPv6 correctly because `rfind(':')` lands on the colon after the closing bracket, leaving the dotted suffix logic unchanged. Add explicit test coverage so the behavior is locked down for dual-stack deployments — both the strip case (`[::1]:9333.19333` -> `[::1]:9333`) and the various passthrough cases (no suffix, non-numeric, out-of-range, trailing colon). * rust(volume): parse both ports in one tuple pattern match in to_http_address Combine the two `is_ok()` checks into a single `if let (Ok(_), Ok(_))` tuple match — equivalent semantics, slightly tighter expression of intent. No behavior change. |
||
|
|
7a461ffc2f |
fix(mount): copy xattr value bytes to avoid FUSE buffer aliasing (#9278)
fix(mount): copy xattr value bytes to avoid FUSE buffer aliasing (#9275) SetXAttr stored the caller-supplied `data` slice directly into entry.Extended. That slice aliases go-fuse's per-request input buffer, which is returned to a pool the moment the handler returns. When a file is open during setxattr (the open-fh path defers persistence to flush), the next FUSE request recycles the buffer and silently overwrites the stored xattr bytes; flushMetadataToFiler then ships the corrupted bytes to the filer. `cp -a` reproduces this because it issues a setxattr while holding an open fh, then continues to issue follow-up FUSE ops that reuse the same buffer. The path-based setxattr (e.g. setfattr without an open fh) saves synchronously inside the same handler, so the bytes were marshalled before the buffer could be reused — that is why the source file in the report looked fine and only the cp -a destination was garbage. Defensively copy the bytes when storing them, and add a unit test that mutates the caller buffer after SetXAttr returns to lock in the invariant. |
||
|
|
108e42fb8b |
chore(shell): fix typo in EC shard helper param names (#9277)
* fix(shell): correct volume.list -writable filter unit and comparison * fix(shell): correct volume.list -writable filter unit and comparison * chore(shell): fix typo in EC shard helper param names |