Commit Graph

14196 Commits

Author SHA1 Message Date
dependabot[bot] d28cd94601 build(deps): bump cloud.google.com/go/storage from 1.62.1 to 1.62.3 (#9977)
Bumps [cloud.google.com/go/storage](https://github.com/googleapis/google-cloud-go) from 1.62.1 to 1.62.3.
- [Release notes](https://github.com/googleapis/google-cloud-go/releases)
- [Changelog](https://github.com/googleapis/google-cloud-go/blob/main/CHANGES.md)
- [Commits](https://github.com/googleapis/google-cloud-go/compare/storage/v1.62.1...storage/v1.62.3)

---
updated-dependencies:
- dependency-name: cloud.google.com/go/storage
  dependency-version: 1.62.3
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-06-15 10:35:41 -07:00
dependabot[bot] d793efd13a build(deps): bump go.etcd.io/etcd/client/v3 from 3.6.10 to 3.6.12 (#9979)
Bumps [go.etcd.io/etcd/client/v3](https://github.com/etcd-io/etcd) from 3.6.10 to 3.6.12.
- [Release notes](https://github.com/etcd-io/etcd/releases)
- [Commits](https://github.com/etcd-io/etcd/compare/v3.6.10...v3.6.12)

---
updated-dependencies:
- dependency-name: go.etcd.io/etcd/client/v3
  dependency-version: 3.6.12
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-06-15 10:35:30 -07:00
dependabot[bot] 3888e6f58a build(deps): bump github.com/a-h/templ from 0.3.1001 to 0.3.1020 (#9978)
Bumps [github.com/a-h/templ](https://github.com/a-h/templ) from 0.3.1001 to 0.3.1020.
- [Release notes](https://github.com/a-h/templ/releases)
- [Commits](https://github.com/a-h/templ/compare/v0.3.1001...v0.3.1020)

---
updated-dependencies:
- dependency-name: github.com/a-h/templ
  dependency-version: 0.3.1020
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-06-15 10:35:19 -07:00
dependabot[bot] 303b23f96a build(deps): bump github.com/aws/aws-sdk-go-v2/config from 1.32.21 to 1.32.25 (#9975)
build(deps): bump github.com/aws/aws-sdk-go-v2/config

Bumps [github.com/aws/aws-sdk-go-v2/config](https://github.com/aws/aws-sdk-go-v2) from 1.32.21 to 1.32.25.
- [Release notes](https://github.com/aws/aws-sdk-go-v2/releases)
- [Commits](https://github.com/aws/aws-sdk-go-v2/compare/config/v1.32.21...config/v1.32.25)

---
updated-dependencies:
- dependency-name: github.com/aws/aws-sdk-go-v2/config
  dependency-version: 1.32.25
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-06-15 10:34:35 -07:00
Chris Lu c6cf5a5bd7 4.34 4.34 2026-06-14 22:41:44 -07:00
Chris Lu 7df43ad9b5 admin: add connected Mount Clients page and dashboard section (#9968)
* admin: add connected mount clients page and dashboard section

The filer is the authority on who is subscribed to its metadata stream
(FUSE/VFS mounts, S3, peer filers, ...), but its in-memory listener
registry only tracked clientId->epoch and was not exposed.

- Enrich the filer subscriber registry with name/type/address/path/
  connected-time, populated in addClient and cleared in deleteClient so
  it reflects currently-connected clients only.
- Add a ListMetadataSubscribers filer gRPC (optional client-type filter).
- Admin server fans out to every filer, filters to mount types
  ("mount" Go weed mount, "sw-vfs" Rust VFS), and renders a new
  Cluster > Mount Clients page plus a Mount Clients dashboard section.

Read-only; no behavior change to the subscribe hot path.

* admin: address review — parallelize filer fan-out, guard nil map, robust CSV

- GetMountClients now queries filers concurrently, each under a 5s
  timeout, so a slow/unreachable filer can't stall the admin dashboard.
- Defensively initialize fs.subscribers before first write.
- Mount Clients CSV export uses a Blob with quote-escaping instead of a
  data: URI, so special characters in paths export correctly.
2026-06-14 21:44:10 -07:00
Chris Lu a736ba1c21 filer: keep metadata-subscription send gauge fresh on idle heartbeat (#9966)
* filer: keep metadata-subscription send gauge fresh on idle heartbeat

last_send_timestamp_of_subscribe only advanced when a real matching
metadata event was streamed to a subscriber. On a quiet path an idle but
perfectly healthy subscriber therefore looked increasingly stale, and the
dashboard panel rendered a large, misleading 'lag'.

An idle heartbeat is a send too, so advance the gauge when one is emitted.
Subscribers that opt into idle heartbeats (filer.sync) now report true
freshness; the rest still show time since the last real event.

Rename the dashboard panel 'Metadata Subscription Lag' ->
'Time Since Last Subscription Send' and clarify its description to match.

* filer: guard nil option when advancing heartbeat gauge

maybeSendIdleHeartbeat is unit-tested with a bare &FilerServer{} (nil
option), so dereferencing fs.option.Host for the sourceFiler label
panicked. Guard it: production always has option set; the test now gets
an empty sourceFiler label instead of a nil-pointer panic.
2026-06-14 21:43:03 -07:00
Chris Lu 14d247703a s3: register account-less identities' synthesized account so ACL/owner ids resolve (#9971)
* s3: register account-less identities' synthesized account in the lookup

#9962 gave each account-less identity a distinct account id derived from
its name (instead of collapsing into admin), but never registered that
account in the id->account map. GetAccountNameById then returned empty
for such ids, so ACL grantee validation rejected canonical grants to the
caller's own account with InvalidRequest, and bucket/object owner display
was dropped as 'owner is invalid'.

This broke a canned PutObjectAcl by an account-less identity (e.g.
TestVersionedObjectAcl with the default 'some_admin_user' identity):
ValidateAndTransferGrants -> GetAccountNameById -> 'account id is not
exists' -> 400 InvalidRequest.

Register the synthesized account at config load so its id resolves to a
display name. Add a regression test.

* s3: reuse explicitly-configured account for account-less identity

Address review: if an account with the same id as an account-less
identity's synthesized account is explicitly configured (custom display
name/email), reuse it instead of the synthesized one. Add a test.
2026-06-14 21:42:23 -07:00
Chris Lu 1391a85a20 metrics: add per-bucket S3 panels and volume slot utilization to dashboard (#9969)
* metrics: add per-bucket S3 panels and volume slot utilization to dashboard

The S3 Gateway/Buckets rows sliced requests only by type, never by bucket,
though SeaweedFS_s3_request_total and the request/TTFB histograms all carry a
bucket label. Add a Bucket template variable and four per-bucket panels to the
S3 Buckets row (request rate, response codes, request p95, TTFB p95), plus a
Volume Slots Utilization panel (volumes/max_volumes) to the Volume Servers row.

* metrics: scope bucket variable to cluster and guard slot utilization divide-by-zero

Source the Bucket variable from SeaweedFS_s3_bucket_size_bytes (a gauge emitted
for every bucket, including idle ones) scoped to the selected cluster, so the
dropdown lists all buckets in the cluster rather than only those that received
requests. Wrap the volume slot utilization denominator in clamp_min(..., 1) to
avoid +Inf/NaN when max_volumes is briefly zero or absent.
2026-06-14 21:10:02 -07:00
Chris Lu 0e9f702152 metrics: guard time()-based dashboard panels against unset gauges (#9965)
* metrics: guard time()-based panels against unset (zero) gauges

Several dashboard panels compute "time() - <gauge>" (uptime, time-since-
last-scrub, lifecycle cursor lag, last daily walk). When the underlying
gauge is unset it reports 0, so the panel rendered ~56 years (time since
the Unix epoch). This is the common case for "Volume Server Uptime": the
Rust volume server doesn't set SeaweedFS_volumeServer_start_time_seconds,
and Go components expose it registered-but-zero.

Guard each such expression with "> 0" so unset/zero series drop out (the
panel shows no data) instead of rendering a nonsensical epoch value.
Affected panels: Master Uptime, Volume Server Uptime, Time Since Last
Scrub, Lifecycle Cursor Lag, Time Since Last Daily Walk.

* metrics: guard Filer Sync Offset Lag and Metadata Subscription Lag panels too

Extend the > 0 guard to the two remaining time()-<gauge> panels that
share the same unset/zero failure mode (filerSync sync offset and the
metadata subscribe last-send timestamp), so they don't render ~56-year
lag when the gauge is absent.
2026-06-14 14:41:18 -07:00
Chris Lu d47cc45b1f admin: fold dashboard sparklines into the existing cards (de-dup) (#9964)
admin: fold dashboard sparklines into the existing cards

The trend sparklines added in #9957 lived in a separate "Cluster Trends"
row that duplicated the existing summary cards (Volumes, Files, Disk Used,
EC Shards). Remove that row and instead render each sparkline inside the
matching summary card, so every headline number shows its recent trend
without duplication. The two maintenance metrics that have no existing
card — Active Tasks and Workers — now fill the previously-empty columns of
the EC row (also with sparklines).

DashboardTrends changes from a Cards slice to named per-card sparkline
SVGs (+ current values for the two maintenance cards). Drops the now-unused
trendBytes helper (disk size keeps using the existing formatBytes).
2026-06-14 14:17:43 -07:00
Chris Lu b13463880c s3tables: scope management authorization to the caller's identity (#9961)
* s3tables: resolve account-less identities to a distinct principal

Static identities with no account block default to the shared admin
account, so getAccountID returned "admin" for every such user and the
permission checks treated them all as the admin principal. Only keep the
admin account when the identity actually carries an admin action;
otherwise fall back to the unique identity name.

* s3tables: limit the open-by-default fallback to anonymous access

The legacy permission path allowed any request that no policy explicitly
denied whenever default-allow was on, which is the zero-config default.
That let an authenticated identity without table permissions reach table
resources owned by others. Restrict the fallback to requests with no
identity or the anonymous identity; authenticated callers must pass an
explicit action or policy check. Zero-config and anonymous access are
unchanged.

* s3tables: drop the no-op ListTableBuckets account gate

The top-level check passed the principal as its own owner, so it always
allowed. Per-bucket filtering in the loop is the real authority; remove
the dead gate and the now-unused locals.

* s3tables: derive the Iceberg catalog's default-allow from auth state

The Iceberg catalog reuses the S3 Tables Manager, which hardcoded
default-allow on. Authenticated callers were enforced only because the
identity struct happens to propagate into the handler; if it were ever
dropped, a secured catalog would fall open. Mirror the S3 port and set
the Manager's default-allow from the authenticator, so an authenticated
caller is enforced regardless. Shell and admin keep their own trusted
Manager. Regression test covers the struct, name-only, and admin paths.

* s3tables: drop redundant ACTION_ADMIN string conversion

ACTION_ADMIN is an untyped string constant, so the conversion is a no-op.

* s3tables: enforce name-only authenticated callers, add trusted bypass

defaultAllowFor treated a request with no identity object as anonymous,
but the Manager path forwards only the identity name (not the struct).
A name-only authenticated caller could therefore be misclassified as
anonymous and allowed under the open default. Treat a server-set identity
name as authenticated too, and add an explicit trusted flag for the local
shell/admin tooling that legitimately bypasses authorization.

* s3tables: trim verbose comments
2026-06-14 13:55:36 -07:00
Chris Lu b56d155b31 admin: native at-a-glance trend sparklines on the dashboard (#9957)
* admin: native at-a-glance trend sparklines on the dashboard

Add a "Cluster Trends" row to the admin Dashboard with inline-SVG
sparklines for volumes, EC shards, disk used, files, active maintenance
tasks, and workers.

The data comes entirely from what the admin already holds — the cached
cluster topology and the in-process maintenance queue — sampled into a
small bounded ring buffer on the existing maintenance-metrics ticker
(~15 min of history). No Prometheus/Grafana dependency, no JS chart
library, no extra goroutine: the sparklines are self-contained SVG
rendered server-side via templ.

This gives basic trend visibility out of the box for clusters that don't
run Prometheus, and a quick glance next to the cluster controls; Grafana
remains the place for deep/historical dashboards.

* admin: cap trendBytes unit index to avoid out-of-bounds panic

A value >= 1 ZiB would push exp past the end of the units string and
panic on units[exp]; cap exp at the last unit (EiB).
2026-06-14 13:55:26 -07:00
Chris Lu c1636ac41c s3: give STS sessions a distinct owner account instead of admin (#9963)
* s3: give STS sessions a distinct owner account, not admin

STS sessions were built with Account: &AccountAdmin, so every assumed-role
session shared the admin account for ownership and ACL checks. Use the
assumed-role user as the account id instead, matching the JWT auth path.
Session permissions are unchanged: they come from the session policies,
and admin is granted only through Actions.

* s3: resolve STS session identity to the OIDC subject

Use sessionInfo.Subject (falling back to the assumed-role user when
absent) for the session identity name and account id, so the SigV4 and
JWT auth paths resolve the same session to the same identity instead of
diverging on AssumedRoleUser vs Subject.

* s3: trim verbose comments
2026-06-14 13:55:11 -07:00
Chris Lu e64c821139 s3: give account-less identities a distinct owner instead of admin (#9962)
* s3: stop collapsing account-less identities into the admin account

Identities configured without an account block all defaulted to the
shared admin account, so distinct users got the same owner id and
ownership checks could not tell them apart. checkAccessByOwnership also
treated that id as an admin bypass, so any account-less caller passed
ownership for any bucket. Give such identities a distinct account id from
their name, and decide the ownership admin bypass by Admin capability
rather than by the account id. isUserAdmin is now nil-safe.

* s3: use the context identity in isUserAdmin before re-authenticating

The Auth middleware already verifies and stores the identity in the
request context. Read it there first so the ownership/admin checks don't
re-run signature verification, which is redundant and fails once the
request body has been consumed.

* s3: nil-guard the context identity in isUserAdmin

A non-nil interface wrapping a typed-nil *Identity passes the type
assertion; guard against it before calling isAdmin().

* s3: trim verbose comments
2026-06-14 13:54:49 -07:00
Chris Lu 3fd5018bd2 metrics: overhaul Grafana dashboard for full metric coverage (#9956)
The bundled dashboard (other/metrics/grafana_seaweedfs.json) covered only
18 of the 84 metrics weed/stats exposes and was a legacy Grafana 8 export
(graph panels, schemaVersion 30). Rebuild it as a modern dashboard
(timeseries panels, schemaVersion 39) with 100% metric coverage, targeting
the direct-scrape model used by Prometheus / seaweed-up / Kubernetes.

- Full coverage of every weed/stats metric: master, volume server, filer,
  filer store/sync, s3, s3 buckets, s3 lifecycle, admin/maintenance, build,
  wdclient, upload errors, plus Go runtime/process per component.
- Organized into collapsible rows with an always-on Overview.
- Scrape label model: group by `instance`; generic go_*/process_* panels
  use `job=~"seaweedfs-.*"` to separate components; an optional `cluster`
  template variable (from SeaweedFS_build_info, defaults to All) supports
  multi-cluster setups and is transparent when no cluster label is present.
- Same uid (nh02dOVnz) and title so it upgrades in place; drops the dead
  "AWS monthly cost" panel.

This is also the single source of truth bundled by seaweed-up's
`cluster dashboard install`.
2026-06-14 11:48:30 -07:00
Chris Lu 7e608c877a refactor(ec_balance): make the balance planner per-volume ratio-capable (#9960)
* refactor(ec_balance): make the balance planner per-volume ratio-capable

Thread a per-volume EC ratio through the balance planner: Plan resolves each
volume's data/parity from a new Options.VolumeRatio (falling back to the
collection Ratio, then the build default, when it reports 0), and keys the
global phase's ratio maps by volume instead of collection. The shell and
worker balance paths build the per-volume lookup from each shard's heartbeat
via the new ecbalancer.VolumeShardRatio.

In OSS this is behavior-preserving: VolumeShardRatio returns 0 because the
per-volume data_shards/parity_shards heartbeat fields are an enterprise
feature, so every volume falls back to the collection ratio -- the existing
standard-scheme behavior. The refactor keeps the shared planner in sync with
the enterprise fork, which overrides VolumeShardRatio to classify and spread
a mixed-ratio collection by each volume's own data/parity split.

* perf(ec_balance): hoist the collection ratio out of the per-volume loop

The collection ratio is constant for every volume in a collection, so
resolve it once per collection instead of per volume; a custom Ratio func
may do map lookups or locking. Addresses a review comment.
2026-06-14 11:33:31 -07:00
Chris Lu 138220b961 fix(ec): recover EC shards with the volume's own ratio, not the build default (#9958)
* fix(ec): recover EC shards with the volume's own ratio, not the build default

recoverOneRemoteEcShardInterval rebuilt a missing shard with a hardcoded
10+4 Reed-Solomon matrix (and counted sufficiency / iterated shards
against the 10+4 constants). For a custom-ratio volume (e.g. 9+3) that
reconstructs with the wrong matrix and corrupts the recovered bytes, and
cachedLookupEcShardLocations could wrongly reject a degraded but
recoverable custom-ratio read. Use the volume's own ECContext (loaded
from its .vif) for the encoder, the shard-iteration bound, and the
data-shard sufficiency checks. In OSS the ratio is always 10+4 so this is
a no-op; it brings the Go volume server in line with the Rust one, which
already reconstructs with the volume's ratio.

* fix(ec): close data races in the EC read-recovery path

Address review: the freshness check in cachedLookupEcShardLocations read
ecVolume.ShardLocations / ShardLocationsRefreshTime without the lock while
recover goroutines mutate them via forgetShardId -- snapshot both under
ShardLocationsLock.RLock(). The recover goroutines also wrote the shared
is_deleted return concurrently -- collect it via an atomic and fold it in
after they join. Also size availableShards/missingShards by the volume's
ECContext ratio rather than the 10+4 constants.
2026-06-14 07:32:36 -07:00
Chris Lu c7781bfca2 fix(ec): remove shared EC index only when no shard remains node-wide (#9955)
* fix(ec): remove the shared EC index only when no shard remains node-wide

deleteEcShardIdsForEachLocation removed the shared .ecx/.ecj/.vif index
as soon as a single disk's shard count hit 0, even when a sibling disk
of the same node still held shards of the volume (split-disk reconciled
layout) -- orphaning those shards without their index. Split the
non-teardown delete into two passes: delete the requested shard files
(and now-orphaned per-disk bitrot sidecars) on every disk, then remove
the shared index only once no shard of the volume remains on ANY disk.
This brings the Go volume server in line with the Rust one, which already
gates the index removal on a node-wide check.

* refactor(ec): reuse checkEcVolumeStatus across the two delete passes

Address review: cache hasEcxFile/hasIdxFile from the node-wide count pass
and pass them to removeEcSharedIndexFiles instead of re-listing each
location's directory.

* fix(ec): clean an orphaned EC .vif even when its .ecx is already gone

Address review: removeEcSharedIndexFiles returned early on !hasEcxFile,
so a node-wide teardown left a stale EC .vif behind when its .ecx was
already removed. Decouple the .vif removal (gated on !hasIdxFile) from
.ecx presence so the generation metadata doesn't leak once no shard
remains node-wide.
2026-06-14 06:36:50 -07:00
Chris Lu ef5fee6c28 fix(storage): delete/unmount every copy of a duplicate volume id (#9954)
* fix(storage): delete and unmount every copy of a duplicate volume id

NewStore has no cross-disk duplicate guard (unlike the Rust volume
server, which refuses to start in that state), so a stale twin of a
volume id can mount on a second disk after a disk repair. DeleteVolume
and UnmountVolume returned after the first matching disk, leaving the
twin to survive and re-register as the volume's content. Walk every disk
and act on all copies, emitting one heartbeat delta per copy.

* fix(storage): surface partial delete/unmount failures across duplicate copies

Address review: if removing one copy of a duplicate volume id fails with a
real error (disk IO, permissions), the loop logged it and could still
return success once another copy was removed -- leaving the stale copy to
re-register, the exact divergence this guards against. DeleteVolume and
UnmountVolume now accumulate such errors and return them (still attempting
every disk), so a copy left behind is never reported as success. Add a
DeleteVolume duplicate-copies regression test.
2026-06-14 06:36:47 -07:00
Chris Lu 284796c7b6 fix(ec): fence stale-worker EC shard cleanup by encode generation (#9953)
* feat(ec): add encode_ts_ns to the EC task params, shard-unmount, and shard-delete RPCs

The generation fence for stale EC-worker cleanup needs the encode
generation on three messages: ErasureCodingTaskParams (admin issues it),
VolumeEcShardsUnmountRequest, and VolumeEcShardsDeleteRequest (the worker
carries it to the volume server). Additive fields only; 0 preserves the
existing unfenced behavior. Mirror the two volume-server fields in the
Rust volume server's proto copy.

* feat(ec): issue the EC encode generation from the admin and carry it on the worker

Stamp each EC proposal's encode_ts_ns from the admin's per-cycle
DetectionSequence (a single-clock value) so generations are globally
ordered even though detection runs on a rotating worker. The worker
writes that generation into the distributed .vif and passes it on its
shard unmount/delete RPCs; it falls back to a local timestamp for the
.vif only on the unfenced legacy/shell path (keeping the read guard on).

* fix(ec): fence the stale-worker EC shard unmount and teardown by generation

A reaped-but-still-running EC worker's cleanupStaleEcShards issued a
generation-blind unmount + full teardown that could unmount and then
overwrite a newer run's live shards on a shared node. Both RPCs now
carry the encode generation: the volume server unmounts/deletes a disk
only when its .vif generation is strictly older than the request, and
preserves a same-or-newer generation, a generation-0 (recovered or
pre-upgrade) volume, and an unreadable .vif. Unload is per-disk, never
node-wide. Request generation 0 keeps the blanket teardown for the shell
pre-encode cleanup and pre-upgrade callers. Mirrored in the Rust volume
server.

* test(ec): cover the generation-fenced teardown and unmount

End-to-end volume-server tests: a fenced FullTeardown wipes a strictly-
older generation, preserves a newer one, preserves a generation-0 volume,
and blanket-wipes on request generation 0; the gen-aware unmount preserves
a same-or-newer mounted generation; and the .vif generation reader handles
present/absent/no-config cases.

* test(ec): pin the fenced .vif==teardown generation and the unreadable-.vif preserve

A fenced run must stamp the admin generation verbatim into the .vif so it
matches the generation sent on the teardown RPCs; add a regression test
that sets the task generation and asserts the .vif carries it exactly.
Also cover the present-but-unparseable .vif case (reads as generation 0,
preserved) and correct the readEcGenerationTsNs docstring accordingly.

* fix(ec): surface EC full-teardown filesystem errors in the Rust volume server

remove_ec_volume_files(_full_teardown) discarded every fs::remove_file
error, so a teardown that failed on permissions or a full disk still
returned full_teardown_done=true and left stale artifacts to collide with
the next encode. Return io::Result, ignore NotFound, propagate the first
real error, and have the teardown RPC surface it -- matching the Go
contract. The best-effort reconcile/load-cleanup callers keep ignoring it.

* refactor(ec): reuse the EC volume lookup on unmount and short-circuit the gen read

Address review: the Rust unmount fence reuses the ec_vol it already
fetched instead of a second find_ec_volume; the Go .vif generation reader
breaks out of the data/idx loop early when the two dirs are the same.
2026-06-14 01:54:04 -07:00
Chris Lu 561768a426 [s3]: preserve multipart copy checksums (#9948)
* s3: preserve checksums for copied multipart parts

* s3: return checksums from multipart copy

* s3: pin the upload's checksum algorithm on copy-part re-stream

* s3: note why UploadPartCopy uses the re-stream slow path

* s3: explain the TLS proxy in the multipart copy checksum test

* s3: cover nil and unknown-algorithm edge cases in copy checksum tests

* s3: cover all checksum algorithms in the multipart copy test

* s3: run all checksum integration tests, not just presigned
2026-06-14 00:16:14 -07:00
Chris Lu da243b9423 fix(ec): group orphan-source completeness by encode generation (topology encode_ts_ns) (#9952)
* feat(ec): carry the encode generation through the topology heartbeat

Add encode_ts_ns (field 14) to VolumeEcShardInformationMessage and
populate it from each EC volume's .vif identity. The volume server emits
it on the full and incremental heartbeats; the master stores it on
EcVolumeInfo and re-emits it via GetTopologyInfo, so the admin/worker
layer can see which encode run produced each shard set. Field 14 avoids
the enterprise fork's reserved 10-13. Mirror the proto field and both
heartbeat emit sites in the Rust volume server.

* fix(ec): group orphan-source shard completeness by encode generation

countExistingEcShardsForVolume ORed EcIndexBits across every disk, so two
interrupted encode runs whose shard sets overlap unioned into a
false-complete set -- triggering the orphaned-source delete while no
single generation was actually complete. Group shards by encode_ts_ns and
return the largest single generation's count, so the trigger fires only
when one run holds the full set. Shards from pre-upgrade servers
(encode_ts_ns==0) form their own bucket.

The heartbeat carries one encode_ts_ns per (volume, disk), so this
separates generations on different disks; same-disk mixing is prevented
upstream by the pre-encode artifact wipe and the cross-run read guard.

* fix(ec): guard against a nil Ec shard info entry in the generation count

Defensive: a manually-constructed or corrupted topology could carry a nil
entry in EcShardInfos. Skip it rather than dereference.

* fix(ec): carry the encode generation on the EC shard unmount delta

The mount delta sets EncodeTsNs; the unmount deletion delta left it 0.
Populate it from the Ec volume before unloading so both incremental
deltas are consistent (the Rust volume server already does this via its
snapshot diff).
2026-06-14 00:14:12 -07:00
Chris Lu 26754fca4d fix(ec): don't fabricate a stub .vif when mounting an EC volume (#9951)
When an EC volume's .vif was missing, NewEcVolume wrote a stub holding
only the version. That stub implies the default 10+4 ratio with
DatFileSize=0 and no encode identity, which the custom-ratio resolver
and the startup credibility checks then read as an authoritative config
-- masking the real ratio of a custom-ratio volume and defeating the
byte-exact .vif gate. Mount with in-memory defaults instead and leave
the real .vif to the encoder or a recovery tool. The Rust volume server
already behaves this way.
2026-06-13 22:15:13 -07:00
Chris Lu 94357ac6a9 [volume] preserve compression state during replication (#9946)
* preserve compression state during replication

* explain why ParseUpload skips compression for replica writes

* fix data race on err result in FetchAndWriteNeedle

The local-write and replica-write goroutines all wrote the named err return under an unsynchronized err==nil check. Give each goroutine its own error slot and combine after wg.Wait(): local error wins, then the first replica failure.

* skip redundant decompression of compressed needles during replication

doUploadData decompressed a compressed input only to report the clear-data length on UploadResult.Size, which both replication callers discard. Skip the decompress when IsReplication.
2026-06-13 21:52:59 -07:00
Chris Lu 240f82d6d2 fix(ec): persist EC source readonly mark and skip writable replicas on orphan cleanup (#9950)
* fix(ec): persist the EC source replica readonly mark

markReplicasReadonly marked each regular replica readonly without
persisting it, so a source-server restart during or after encoding
silently reopened the volume to writes. Those writes are not in the EC
shards, and the later orphan-source cleanup would then delete the
replica, losing them. Send Persist:true so the mark survives a restart;
rollbackReadonly still clears it via VolumeMarkWritable on a failed
encode.

* fix(ec): don't delete a writable source replica during orphan cleanup

cleanupOrphanSourceReplicas issued VolumeDelete to every regular replica
once the EC shard set looked complete, without checking the replica's
current state. A replica that came back writable may hold writes the EC
shards do not contain, so deleting it loses data. Re-probe each replica
via VolumeStatus and skip any that is no longer readonly, logging a
warning instead of deleting.
2026-06-13 21:26:16 -07:00
Chris Lu 1e858d8af0 fix(ec): make ec.decode write-path crash-safe and atomic (#9949)
* fix(ec): check decode .idx writes and fsync decoded .dat/.idx

WriteIdxFileFromEcIndex silently dropped io.Copy and Write errors, so a
short or failed write of the reconstructed .idx went unnoticed and the
caller proceeded to delete the source EC shards. Propagate those errors.

Also fsync the decoded .dat and .idx before returning, so the bytes are
durable before the shards that produced them are removed cluster-wide.
Mirror the .idx fsync into the Rust volume server (its .dat already
syncs and its writes already propagate errors).

* fix(ec): publish decoded .dat/.idx atomically via temp file and rename

WriteDatFile and WriteIdxFileFromEcIndex wrote in place at the final
name with O_TRUNC. A crash mid-write left a truncated .dat/.idx at the
final name beside the still-present EC shards; on restart that partial
file could be mounted as the live volume even though the shards held the
real data. Write to a .tmp file, fsync it, then rename into place and
fsync the directory, so the final name is only ever absent or complete.
A failed decode removes its own temp file rather than leaking it.

Add util.FsyncDir as the shared directory-fsync primitive and reuse the
Rust volume server's fsync_dir for the mirrored change.

* fix(ec): propagate .ecj read errors in the Rust decoder

Path::exists returned false for any error (permission denied, transient
IO), silently skipping the deletion journal and resurrecting deleted
needles as live. Read the journal directly and treat only NotFound as
absent, propagating other errors. The Go decoder already behaves this
way (FileExists returns false only for IsNotExist, then the open
surfaces other errors).

* fix(ec): remove rename destination on Windows in the Rust decoder publish

std::fs::rename does not replace an existing file on every Windows
version. Remove the destination first under a Windows guard before the
atomic publish rename, matching the compaction commit path.
2026-06-13 21:26:07 -07:00
Chris Lu 4fb3e22a01 fix(tiering): never delete a shared remote object while replicas still reference it (#9942)
* tiering: stop a shared remote object being deleted while replicas still point at it

A remote-tiered volume's .dat content lives only in one cloud object that all
N replica .vif files point at. Deleting that object while destroying any one
replica, or before a downloaded replica is durable, bricks the survivors.

- volume.tier.move cleanup now deletes old replicas with keepRemoteData=true so
  surviving replicas keep the shared object. Document why the alreadyPlaced
  anchor needs no replica sync (same-object replicas are byte-identical).
- VolumeTierMoveDatFromRemote now fsyncs the downloaded .dat, fsyncs the
  containing directory, trims the .vif (fsynced) and swaps to the local DiskFile
  BEFORE deleting the remote object, on both the keep-remote and delete paths.
  Only the final DeleteFile is gated by keep_remote_dat_file, so a keep-remote
  download leaves the replica served from local disk rather than the shared
  object, and a crash before delete merely leaks the object.
- volume.tier.download keeps the shared object for every replica except the
  last, which deletes it.
- s3 and rclone download paths fsync the .dat before close.

* storage: swap the volume data backend under the data lock

The tier-download swap closed v.DataBackend and assigned the new local DiskFile
without holding dataFileAccessLock, racing concurrent reads/writes (use of a
closed file / nil deref). Add an exported Volume.SwapDataBackend that performs
the close-and-replace under the lock, and call it from the tier download.

* server: skip directory fsync on Windows in the tier download path

os.Open(dir).Sync() is unsupported on Windows and returns an error, which would
fail VolumeTierMoveDatFromRemote entirely there. Skip the directory fsync on
Windows, matching how the storage-side helper tolerates the unsupported case.

* shell: make multi-replica tier.download resilient to already-local replicas

If a multi-replica download is interrupted and retried, a replica made local
in the prior attempt returns "already on local disk", which aborted the whole
command and left the remaining remote replicas dangling. Treat that case as a
skip-and-continue so a retry completes the rest.

* server: assert downloaded .dat content, not just length, in the tier test

A length-only check passes even if the bytes are corrupted; compare the full
content of the local .dat against the original.
2026-06-13 20:09:00 -07:00
Chris Lu 339a597e7e fix(vacuum): crash-safe compaction commit with a durable .cpc marker, fsync-before-rename, and a reload fence (#9944)
* storage: make vacuum/compaction commit crash-safe with a durable .cpc marker

A crash mid-compaction-commit could lose or corrupt volume data. The
two-rename commit (.cpd->.dat, .cpx->.idx) was not atomic, fsync results
were discarded before renaming over a healthy .dat, a stale .ldb could
poison the needle map, and a duplicate/late commit could delete the live
.dat/.idx outright.

Introduce a durable .cpc commit marker so the swap is atomic across a
crash:

- CommitCompact writes and fsyncs the .cpc marker after makeupDiff
  fsyncs the .cpd/.cpx, then runs applyCompactSwap: an existence-guarded
  rename of .cpd->.dat and .cpx->.idx, a directory fsync, removal of the
  stale .ldb/.rdb, and finally removal of the marker.
- reconcileCompactState recovers an interrupted commit on load: roll
  forward (finish the renames) when the marker is present, roll back
  (delete the orphan .cpd/.cpx) when it is absent. It runs from a
  directory pre-pass keyed on .cpd/.cpc existence, since the per-volume
  loader is keyed on .idx/.vif and misses the marker-only and
  already-renamed-.idx states.
- applyCompactSwap verifies BOTH .cpd and .cpx exist before touching the
  live files, so a stale-state commit (including the Windows
  RemoveAll-then-rename path) errors without deleting anything.
- Error-check the fsyncs that gate the swap: the .cpd close-fsync and
  .cpx fsync in copyDataBasedOnIndexFile, the makeupDiff .idx fsync, and
  MemDb.SaveToIdx.
- generateLevelDbFile rebuilds from offset 0 when the stored watermark
  sits past the end of the .idx, instead of replaying zero entries and
  poisoning the needle map.
- removeVolumeFiles and cleanupCompact sweep the .cpc marker; cleanup
  refuses to unlink the temp files while a marker is present.

Mirror the commit-marker, fsync-before-rename, guard, and
load/reconcile logic in the Rust volume server.

* storage: don't reconcile an already-loaded volume's compaction state on reload

reconcileCompactStates runs in loadExistingVolumes, which is re-invoked at
runtime on SIGHUP (Store.LoadNewVolumes). For a volume that is already loaded
and mid-vacuum, its .cpd/.cpx are live temp files, not crash leftovers --
rolling them back would clobber the in-flight compaction (and remove a live
.ldb out from under an open handle). Skip any vid already present in the
volume map; genuine startup recovery runs before any volume is loaded, so the
map is empty then. Mirrored in the Rust volume server.

Also drop the .note keepVif change that crept into this branch; it belongs to
the replica-copy/verify workstream and is restored to master's behavior here
so the two changes don't collide.

* storage: roll a compaction commit forward per-file, not all-or-nothing

A crash after the .cpd->.dat rename but before .cpx->.idx leaves .cpd gone,
.cpx and .cpc present, and a stale .idx. The roll-forward required BOTH temp
files, so it skipped the swap and cleared the marker, pairing the fresh .dat
with the stale .idx (index corruption). Finish whichever temp file remains:
extract finishCompactSwap to rename .cpd->.dat and/or .cpx->.idx independently;
applyCompactSwap keeps the both-present guard for the normal commit. Existence
in the Rust mirror is checked robustly so a transient error never skips the swap.

* seaweed-volume: propagate directory fsync failures on the compaction commit path

fsync_dir dropped every sync_all error, so the commit could proceed with an
undurable marker or rename and a later restart could recover the wrong
generation. Return the error and check it at the commit call sites (marker write
and the swap), matching the Go fsyncDir which already propagates. Directory
fsync stays a no-op on Windows, where it is unsupported.

* storage: overflow-safe stale-watermark check when rebuilding the leveldb index

watermark*NeedleMapEntrySize can overflow uint64 for a corrupted watermark and
wrap below the file size, defeating the stale-.ldb guard. Compare in entries
(watermark > size/NeedleMapEntrySize) instead, which is equivalent and cannot
overflow. LevelDb-backed needle map is Go-only; no Rust mirror.

* storage: propagate idxFile.Close error when writing the compacted index

SaveToIdx writes the .cpx that is renamed to .idx at commit; a discarded Close
error (buffered data not flushed) could leave a partially-written index after a
crash. Surface it in the same durability gate as the fsync.
2026-06-13 20:06:24 -07:00
Chris Lu c2591b4395 fix(replication): verify-before-destroy in VolumeCopy, check.disk, and over-replication trim (#9943)
* volume: verify before destroy in VolumeCopy and replication repair

Four data-safety fixes around copy/repair paths that could destroy or
resurrect data before verifying the source or survivors.

(a) VolumeCopy no longer deletes a pre-existing local replica up front.
The delete is deferred until ReadVolumeFileStatus on the source succeeds,
so a transient source outage (or a retry after one) can no longer wipe a
healthy destination replica. Gated on source readability only; size/count
comparisons are intentionally not used because they invert legitimately
after divergent vacuum/compaction. Mirrored in the Rust volume server.

(b) volume.check.disk no longer resurrects vacuumed-deleted needles. A
key present-and-live on the source but entirely absent on the target is
ambiguous: it may be a genuine missing write, or a needle deleted on the
target and then vacuumed (its index entry and any tombstone are gone). An
individual needle AppendAtNs has no monotonic relation to a vacuum
watermark, so the old cutoff heuristic could not tell them apart. Without
positive proof the absence is a missing write, the safe default is to NOT
push it back. Tradeoff: a real missing write may go unrepaired until a
tombstone-aware path exists, but we never raise back deleted data.

(c) Over-replication trim no longer resurrects needles or removes the
wrong replica. The pre-delete sync now runs read-only (divergence check
only) instead of writing the doomed replica's needles into the survivor.
pickOneReplicaToDelete only ever removes the smallest of multiple healthy
writable replicas; it refuses the trim when doing so would leave only
read-only/integrity-flagged survivors, since file_count>0 alone cannot
prove the survivor's .dat is readable.

(d) Incomplete-volume (.note) cleanup keeps the shared .vif when an .ecx
for the same vid coexists on the disk, so removing an interrupted regular
copy cannot strip a coexisting EC volume's info file. VolumeCopy now
surfaces .note write/remove errors instead of ignoring them. In the Rust
volume server (where a persisting note is actually reachable) the .note
check moves below the empty-stub sweep and EC validation, keeps the .vif
on EC coexistence, and the mount path fails when a .note still persists.

* shell: scope the over-replication writable-survivor guard to the trim path only

The writable-survivor guard (never trim down to a read-only survivor) lived
inside the shared pickOneReplicaToDelete, so it also gated the misplaced-volume
relocation via pickOneMisplacedVolume -- a misplaced read-only volume (e.g. a
full one) would silently stop being rebalanced. Extract pickSmallestReplica
for the relocation path (which deletes-and-recreates and must act on read-only
replicas), and keep the writable-survivor guard only in pickOneReplicaToDelete
used by the over-replication trim.

* seaweed-volume: recompute keep_vif after invalid-EC cleanup in the .note path

keep_vif used the pre-validation ecx_exists snapshot, so when the EC-validation
step above removed the invalid .ecx/shards, the .note cleanup still preserved a
now-orphaned .vif. Re-check .ecx existence at cleanup time, matching the Go
hasEcxFile re-check.

* shell: keep placement when picking an over-replication victim to delete

The trim picked the smallest writable replica without regard to placement, so
it could delete the only replica in a required failure domain (e.g. with "100"
and replicas dc1 + two in dc2, deleting dc1 leaves both survivors in dc2).
Prefer a writable replica whose removal still satisfies placement, falling back
to the smallest writable only when none does.
2026-06-13 20:05:33 -07:00
Chris Lu aabd44fbb5 [volume] preserve volume data mtime across tier moves (#9947)
* fix(tier): preserve volume data modification time

* fix(tier): best-effort restore of data mtime on download

A failed Chtimes should not abort an otherwise complete tier-down; warn
and continue, matching the EC copy path.

* fix(tier): preserve volume data mtime in rust volume server

Mirror the Go fix: store the source .dat mtime on upload instead of the
upload time, and restore it on the downloaded .dat. Without this a
tiered-then-restored volume loads last_modified_ts_seconds from the
upload/download time, extending its TTL across a restart or remount.

* fix(tier): read source mtime via DiskFile.GetStat()

GetStat() is nil-safe when the backend is closed concurrently and skips a
redundant stat syscall; its cached modTime is the on-disk mtime a reload
reads, since every .dat write or Chtimes is followed by a DiskFile (re)open.

* fix(tier): surface mtime-restore failures on rust tier-down

set_file_mtime now returns io::Result; the tier-down path warns on a
failed restore instead of dropping it silently, so a wrong local .dat
mtime (and the TTL drift it causes) is observable. Matches the Go
download. The EC copy path keeps its best-effort silence.
2026-06-13 15:11:39 -07:00
Chris Lu f724828bcb fix(ec): never delete recoverable EC shards on startup/reconcile (the non-empty-.dat sibling of the stub bug) (#9941)
* fix(ec): never delete recoverable shards on startup/reconcile (size-direction + byte-exact .dat)

EC startup validation and the cross-disk reconcile could delete the only
copy of distributed-EC shards whenever a non-empty .dat sat beside them.
This is the same data-loss class as the empty-.dat-stub fix, now for a
real (non-empty) stale or partial .dat.

validateEcVolume: the discriminating signal is the shard size relative to
the .dat's full encode, not the shard count.
  - shards smaller than expected: an interrupted local encode left partial
    shards and the .dat is the complete source -> reclaim the .dat.
  - shards equal to expected: a valid (or still-distributing) EC volume ->
    keep; the shards may be the only copy.
  - shards larger than expected: the .dat is the stale/partial side (e.g. an
    interrupted decode left a half-written .dat next to the real shards) ->
    keep.
Previously any size mismatch, a low shard count beside a .dat, or a
transient stat error returned "delete", wiping sole-copy shards. Now every
ambiguity (size mismatch in either direction, inconsistent shard sizes,
transient I/O error, partial shard set) keeps the data; only a credible
full source .dat with no partial set to lose is reclaimed.

handleFoundEcxFile: a shard load failure (corrupt/locked .ecx, EMFILE
during a mass restart, transient I/O) no longer deletes the EC files when a
.dat exists -- it only unloads and keeps the files for retry. All deletion
authority now flows through validateEcVolume.

pruneIncompleteEcWithSiblingDat: count shards NODE-WIDE (a set split across
sibling disks summing to >= dataShards is independently recoverable and is
left alone), and require the sibling .dat to byte-exactly match the size
.vif recorded at encode time before deleting -- the prior "at least this
big, or bigger than a superblock" gate could trust a stale .dat and wipe
sole-copy shards. EC encode records the source size in .vif, so this gate
works for real volumes; older volumes without it fail safe (kept).

Rust volume server mirrors all of the above: size-direction + keep-on-
ambiguity in validate_ec_volume, keep-on-load-failure in
handle_found_ecx_file, and the node-wide + byte-exact gate in the prune.
The Rust validate/prune paths now resolve the data-shard count from the
volume's own .vif instead of hardcoding 10+4, so custom-ratio volumes are
not mis-sized and wrongly deleted on reboot.

Existing tests that encoded the old (unsafe) "delete on low count / size
mismatch" behavior are updated to the safe expectation, and new regression
tests cover the partial-decode-.dat-keeps-shards and transient-error-keeps
cases (Go and Rust); they fail on the pre-fix code.

* fix(ec): record DatFileSize in planted EC .vif for the prune test; trim comments

The multi-disk lifecycle e2e test planted a partial EC leftover with an
empty .vif, so the byte-exact prune gate (which a real encoded volume
satisfies via its recorded source size) kept it instead of cleaning up.
Record DatFileSize + the EC ratio in the planted .vif, matching production.

Also condense the verbose comments added in this change to the repo's
concise style.
2026-06-12 23:51:29 -07:00
Chris Lu 3718301599 shell: stop ec.encode/ec.rebuild from destroying live EC shards (no crash needed) (#9939)
* shell: stop ec.encode/ec.rebuild from destroying live EC shards

Three operator-triggered shell paths could destroy data with no crash:

ec.encode -volumeId on an already-EC volume tore down its shards before
failing. The volume-id path never checked the id was a regular volume:
the collection lookup scans only VolumeInfos (so an EC-only id maps to
""), and volumeLocations succeeds via the EC-location fallback, so
clearPreexistingEcShards full-teardown-deleted every shard cluster-wide
before doEcEncode failed. An EC volume has no .dat, so this is its only
copy. Add assertEncodableRegularVolumes: each requested id must be a
regular volume in the topology snapshot; an EC-only or unknown id is
refused before any teardown. A volume present as both a regular .dat and
stale orphan shards (a failed-encode retry) still passes. This closes
the operator-rerun/script-retry path; a worker racing the snapshot is a
fencing problem handled separately.

ec.rebuild dry-run (the default, without -apply) still issued real
VolumeEcShardsDelete RPCs: prepareDataToRecover appended every
would-copy shard to copiedShardIds even though the copy was skipped, and
the cleanup defer deleted that set unconditionally. Now a dry-run copies
nothing and records nothing to delete (a separate would-copy counter
drives the recoverability check so the dry-run still reports its plan),
and the cleanup runs only under -apply.

ec.rebuild could also self-destruct a live shard: localShardsInfo was
overwritten per disk instead of unioned, so a shard the rebuilder holds
on a non-last disk looked remote, got copied onto itself (in-place
O_TRUNC) and then node-wide deleted. Union local shards across all
disks, and never copy/delete a shard whose only listed holder is the
rebuilder itself.

* shell: address ec destructive-guards review comments

- countLocalShards: union shards across all of the rebuilder's disks so
  slot accounting matches what prepareDataToRecover treats as local;
  first-match counting overstated slotsNeeded on multi-disk rebuilders
- VolumeEcShardsCopy: resolve SourceDataNode via
  pb.NewServerAddressFromDataNode instead of the raw node id, which may
  not be a dialable host:port
- assertEncodableRegularVolumes: skip nil DiskInfo map entries, matching
  the other topology walks in this file; rename ecOnly to hasEcShards
  since the map marks any volume with shards, not only shard-only ones
2026-06-12 22:30:17 -07:00
Chris Lu 18cdb3819b fix(ec): crash-safe ecx-journal fold and shard rebuild (fsync before publish, no short-read-as-success) (#9938)
* fix(ec): make ecx-journal fold and shard rebuild crash-safe

Two EC rebuild paths could silently lose or corrupt data:

RebuildEcxFile folded the .ecj deletion journal into .ecx (in-place
WriteAt tombstones) and then unlinked the journal without flushing the
.ecx writes first. A crash could persist the unlink ahead of the
tombstones, resurrecting deleted needles on the next load. It also read
journal records with a bare n!=size break, so a torn tail silently
dropped the remaining tombstones before the unlink. Now: read records
with io.ReadFull (io.EOF ends cleanly, a torn tail aborts and leaves
.ecj in place for retry), fsync .ecx before removing the journal.

rebuildEcFiles treated a zero/short ReadAt as a clean end-of-input and
discarded the read error, so a truncated or unreadable input shard
produced truncated regenerated shards that were then published as
restored redundancy; the regenerated shards were also never fsynced on
the no-sidecar path. Now: derive the expected shard size from the
present inputs up front (rejecting a divergent/zero-size input), drive
the loop by that size, fail on any short read or short write, and fsync
every regenerated shard before it is mounted/renamed.

Rust volume server mirrors the rebuild fix: rebuild_ec_files now checks
the read_at byte count (it previously discarded it, the same truncation
bug). The Rust ecx fold already synced .ecx before removing the journal.

Custom EC ratios are unaffected: the shard size derives from the input
shards and the loop uses the .vif-resolved data/parity counts, never a
hardcoded 10+4.

* storage: close ecx journal files via defer in RebuildEcxFile

Per review: a single deferred Close per file replaces the per-error-path
manual closes, so new early returns cannot leak descriptors. The journal
is still closed explicitly before its unlink since Windows cannot delete
an open file; the deferred second Close is a harmless no-op.
2026-06-12 22:28:56 -07:00
Chris Lu 871d7ddc02 [helm]: configure JWT expiration (#9940)
helm: configure JWT expiration
2026-06-12 21:11:30 -07:00
7y-9 5468707289 fix(util): ignore comment only sql input (#9933)
* fix(util): ignore comment only sql input

Problem: sqlutil.SplitStatements strips SQL comments while scanning, but when no statements remain it falls back to returning the original query. Inputs that contain only comments are therefore reported as executable SQL statements.

Root cause: The no-statements fallback did not distinguish a real single statement from input that had been fully removed by comment filtering.

Fix: Remove the original-query fallback and return an explicit empty slice when scanning produces no statements.

Reproduction: env GOCACHE=/private/tmp/seaweedfs-go-cache go test ./weed/util/sqlutil -run TestSplitStatements -count=1 failed before the fix because comment-only inputs returned the comment text as a statement.

Validation: gofmt -w weed/util/sqlutil/splitter.go weed/util/sqlutil/splitter_test.go; env GOCACHE=/private/tmp/seaweedfs-go-cache go test ./weed/util/sqlutil -run TestSplitStatements -count=1; env GOCACHE=/private/tmp/seaweedfs-go-cache go test ./weed/util/sqlutil -count=1; git diff --check; git diff --cached --check.

Duplicate check: Searched /private/tmp/seaweedfs-codex0610-old-branch-index.tsv and existing tests for sqlutil, SplitStatements, comments, and comment-only. Old PostgreSQL query branches cover malformed wire frames and SQL engine numeric parsing, not comment-only statement splitting.

Co-authored-by: Codex <noreply@openai.com>

* Update weed/util/sqlutil/splitter.go

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

---------

Co-authored-by: Codex <noreply@openai.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-06-12 10:10:27 -07:00
Chris Lu 0345658ea8 [s3] validate indirect filer path inputs (#9931)
* s3: validate indirect filer path inputs

* s3: avoid query parsing on common request path

* filer: scope copy/move source against JWT AllowedPrefixes

maybeCheckJwtAuthorization only checked r.URL.Path, but copy and move read
their source from the cp.from / mv.from query params. A prefix-restricted
token could copy or move data out of a subtree it cannot otherwise reach.
Check every path the request touches, reusing pathHasComponentPrefix so
`..` in the source is collapsed before the prefix match.

* s3: confine iceberg CreateTable location to the catalog bucket

CreateTable derived the metadata bucket and path from the client-supplied
req.Location / req.Name and wrote there directly, so a caller scoped to one
table bucket could place metadata in another bucket (and path.Join collapsed
any `..`). Require the parsed bucket to equal the request's catalog bucket
and reject traversal segments in the table path.

* webdav: clean client path before subFolder confinement

wrappedFs concatenated subFolder + name before the underlying FileSystem
ran path.Clean, so `..` in the request path or COPY/MOVE Destination
resolved across the FilerRootPath confinement boundary. Clean the name as a
rooted path first so traversal segments collapse below subFolder. Only the
non-default -filer.path (non-empty subFolder) setup was affected.

* filer: enforce read-only rule on real write path with destination header

The x-seaweedfs-destination header overrides the path used for storage-rule
matching while the entry is written at r.URL.Path, letting a caller select a
writable rule for a read-only target. When the header is present, also check
the read-only/quota rule against the actual write path.
2026-06-11 21:56:16 -07:00
Chris Lu 34f9b91d69 fix(storage): never let an empty .dat delete healthy distributed EC shards (#9930)
* fix(storage): never let an empty .dat delete healthy distributed EC shards

A leftover empty .dat stub (a phantom from the pre-fix loader; zero
needles) next to a distributed EC volume's local shards made startup
classify the volume as an interrupted local encode: validateEcVolume
requires >= dataShards local shards when a .dat is present, fails with
the 1-2 shards a distributed volume keeps per disk, and the cleanup
deletes those shards -- the only copies of that part of the volume.
Repeated across restart waves this destroys enough shards cluster-wide
to make the volume unrecoverable.

Go:
- loadExistingVolume: hoist the empty-stub sweep above the EC presence
  checks. Previously the .vif-next-to-.ecx guard returned before the
  sweep ever ran, so exactly the dangerous layout (stub + .ecx + local
  shards) kept its stub and then lost its shards in loadAllEcShards.
- validateEcVolume / checkDatFileExists: treat a .dat <= a superblock
  (zero needles) as absent. An empty .dat cannot be the encode source,
  so it must never gate shard deletion; this also covers stubs without
  a .vif, which the sweep cannot prove are EC leftovers.

Rust mirror (seaweed-volume): the same gate in validate_ec_volume and
check_dat_file_exists (the Rust sweep already ran before validation);
the volume-load skip keeps a plain existence check so fresh,
needle-less volumes still load.

Regression tests in Go and Rust reproduce the production layout (a
zero-byte .dat beside .ecx/.ecj and two shards of a 10+4 volume, with
and without a .vif) and fail without the fix with the shards deleted.

* fix(ec): gate source volume deletion on a recoverable shard set

After EC encode, the shell command and the (plugin) worker task refused
to delete the source volume unless every shard was present, and aborted
otherwise -- leaving the source .dat next to live shards, exactly the
mixed state the startup cleanup mishandles.

Replace the full-set requirement with a recoverability gate shared by
both callers (RequireRecoverableShardSet): deleting a non-empty source
.dat requires at least dataShards distinct shards cluster-wide. Below
that the source is kept and the encode fails as before. A degraded but
recoverable set (>= dataShards, < total) now proceeds with a warning
instead of aborting: the missing shards can be rebuilt from the
survivors, while keeping the source would preserve the dangerous mixed
state. Empty stub replicas are still swept unguarded (OnlyEmpty) -- an
empty .dat has nothing to lose.

dataShards/totalShards stay parameters so enterprise custom EC ratios
share the helper verbatim.

* test(ec): use recoverable shard verification gate
2026-06-11 20:26:20 -07:00
Chris Lu b44cf51fe9 s3: validate copy source path segments (#9929)
Reject copy sources whose bucket/object fail IsValidBucketName /
IsValidObjectKey, the helpers validateRequestPath already applies to the
request URL. The object is joined onto the bucket path and `.`/`..`
segments are collapsed by the filer, so without this the source need not
stay within the parsed bucket. Route UploadPartCopy through
ValidateCopySource too; it previously only checked for empty bucket/object.
2026-06-11 17:07:15 -07:00
Chris Lu 4f8af455bf feat(storage): sweep leftover empty EC .dat stubs on volume server startup (#9927)
* feat(storage): sweep leftover empty EC .dat stubs on volume server startup

An EC volume keeps no local .dat. The pre-fix loader left empty 8-byte
superblock .dat stubs next to EC metadata (one per lone .vif). Left in
place each loads as a phantom empty volume, and the same vid's stub on
two disks of one server blocks Rust startup via the duplicate-vid check
in Store::add_location -- the prior fix stops creating new stubs but does
not clean up existing ones.

On startup, when a .dat is empty (<= a superblock, i.e. zero needles) and
its .vif marks the volume erasure-coded, remove the stub (+ empty .idx)
instead of loading it. The real data is in the EC shards, so the empty
stub holds nothing to lose. Non-EC empty .dat files (e.g. freshly
allocated volumes) are left alone.

Done in both Rust (load_existing_volumes) and Go (loadExistingVolume),
with regression tests that fail without the sweep.

* refactor(storage): extract empty EC .dat stub sweep into its own function

Move the startup stub-sweep into remove_empty_ec_dat_stub (Rust) and
removeEmptyEcDatStub + vifIsEcVolume (Go) for clearer logic, and look up
the .vif in both the data and idx directories (each read at most once) so
a stub is still found when -dir.idx is configured. Adds direct tests for
the idx-directory lookup on both engines.
2026-06-11 12:26:21 -07:00
Chris Lu 37962e2445 admin: configure maintenance tasks via admin.toml (#9926)
* admin: configure maintenance tasks via admin.toml

Maintenance task settings could only be edited in the admin UI and live
under <dataDir>/conf, so they silently reverted to defaults whenever the
data directory was recreated. An optional admin.toml now declares vacuum,
balance, and erasure coding settings; keys set there are written through
to the persisted task configs at every startup, overriding UI edits, so
the configuration stays declarative. Generate an example with
"weed scaffold -config=admin".

* vacuum: round min volume age up to whole hours

MinVolumeAgeSeconds was truncated by integer division when converted to
the hour-granular protobuf field, so a sub-hour setting silently became
0 and disabled the age guard.

* admin: split and normalize preferred_tags from admin.toml

A comma-separated string, as set via environment variable, came through
viper as a single slice element. Split on commas and reuse
util.NormalizeTagList, matching the plugin config path.

* scaffold: clarify admin.toml wording
2026-06-11 11:04:52 -07:00
Chris Lu 42030381ae shell: volume.tier.move can move volumes between data centers (#9925)
* shell: volume.tier.move can move volumes between data centers

-fromDataCenter scopes volume selection to volumes with a replica in
that data center. -toDataCenter constrains move destinations and
replication fulfillment. With identical disk types both flags are
required, moving full volumes between data centers on the same tier.

* shell: assert node identity in data center filter test

* shell: tier move resumes when the volume is already on the target

A replica already on the target tier and data center, typically left by
an interrupted earlier run, anchors the move: skip the copy and only
complete replication fulfillment and old replica cleanup. Previously
such volumes hit the no-destination path and the stale source replicas
were never removed.
2026-06-11 10:46:34 -07:00
Chris Lu c3b06bf809 ci: run weed tests on linux/386 (#9924)
386 test binaries execute natively on the amd64 runner, so the suite
catches what vet cannot: unaligned 64-bit atomics and arithmetic that
wraps at runtime. -short keeps the e2e suites on amd64 only.
2026-06-11 09:49:07 -07:00
Chris Lu 3eb550a3f1 fix(tests): 32-bit build of EC e2e tests, type-check linux/386 in CI (#9922)
* fix(tests): keep EC e2e fid cookie arithmetic in uint32

The cookie constants 0x9490CA00 and 0x9500CA00 were added to the int
loop variable before conversion, overflowing 32-bit int at compile
time on linux/386 and linux/arm. Convert the loop variable instead so
the addition stays in uint32.

* fix(tests): pass s3client max backoff in milliseconds

MaxBackoffDelay is documented as milliseconds and multiplied by 1e6
before use, but the example set it to 5s in nanoseconds, yielding an
absurd backoff on 64-bit and a compile-time int overflow on 32-bit.

* ci: type-check code and tests for linux/386

64-bit-only constant arithmetic keeps slipping into test files and
breaking 32-bit downstream builds. Vet the whole root module under
GOOS=linux GOARCH=386 so these fail in CI instead of after release.

* fix(tests): convert s3client backoff to Duration before scaling

The ms-to-ns multiplication ran in int, wrapping at runtime on 32-bit;
scale by time.Millisecond after the Duration conversion instead.
2026-06-11 09:05:54 -07:00
Chris Lu 582b7268f5 s3: export per-bucket quota and read-only state metrics (#9923)
The quota enforcement loop already computes each bucket's configured
quota and effective read-only flag every minute, but neither was
visible to monitoring, so operators could not alert before a bucket
flips read-only.

Add two gauges next to the existing bucket size metrics:

  SeaweedFS_s3_bucket_quota_bytes  configured quota; the series is only
                                   present while the quota is enabled,
                                   so size/quota utilization queries
                                   never divide by zero
  SeaweedFS_s3_bucket_read_only    1 when the bucket's location rule is
                                   read-only (over quota or manually
                                   locked), 0 otherwise

Both are cleaned up with the other per-bucket gauges on bucket
deletion and inactivity TTL.
2026-06-11 09:03:00 -07:00
Chris Lu 55010be19b 4.33 4.33 2026-06-11 00:52:31 -07:00
Chris Lu 79ac279fe1 fix(ec): don't mix EC shards from different encode runs (#9880)
* feat(ec): add encode_ts_ns to EC shard metadata and the shard read RPC

EcShardConfig and VolumeEcShardReadRequest gain an int64 encode_ts_ns
(encode time in unix nanos). It rides in .vif and the read request so a
read can be scoped to the encode run that produced the index.

* fix(ec): stamp each encode and reject cross-run shard reads

Generate stamps EncodeTsNs into the volume's .vif. Reads carry it to the
shard's owning volume (resolved together via FindEcVolumeWithShard, so a
multi-disk server validates the disk that actually serves the bytes) and
reject a shard from a different encode run, recovering from parity. A
zero on either side (pre-upgrade volume) skips the guard.

* fix(ec): stamp the encode identity on the worker-generated .vif

The worker-local encode path now writes EncodeTsNs (and the resolved EC
ratio) into the .vif, so the read guard is not silently off for volumes
encoded by the maintenance worker.

* fix(ec): wipe stale EC artifacts before re-encoding

VolumeEcShardsGenerate evicts any in-memory EcVolume for the volume and
removes its on-disk shard/index/sidecar files before writing fresh ones,
so a retried encode never builds on a partial prior run and the unlink
frees the inodes instead of leaving open fds serving old bytes.

* fix(ec): unmount EC shards across all disks

UnmountEcShards walked only the first disk holding the shard, leaving a
duplicate copy mounted on a sibling disk (split-disk reconciled volumes)
still serving and heartbeating. Traverse every disk and emit one
deletion delta per disk.

* fix(ec): delete orphan shards without a local .ecx

deleteEcShardIdsForEachLocation gated shard-file removal on a local .ecx,
so it could not clean an orphan .ecNN left by a failed copy on a disk
with no index. Delete the requested shard files unconditionally; the
index-file (.ecx/.ecj/.vif) routing stays gated as before.

* fix(ec): clear stale EC shards cluster-wide before re-encoding

ec.encode unmounts and deletes EC shards for the target volumes on every
node before regenerating: fatal for the shards the topology reports
(mounted leftovers), best-effort for the rest (a sweep that catches
unmounted failed-copy orphans). A down node is a no-op.

* fix(ec): don't nil EC fds on close so reads can't race eviction

A reader resolves an EcVolume/shard under the lock then reads after it is
released, so an eviction that nils ecxFile/ecdFile would race that read
and panic. Close the fds without nilling the fields: the field is now
write-once (no data race) and a concurrent read hits a closed fd, getting
a clean error that the caller recovers from parity.

* fix(ec): wipe stale EC artifacts on every disk and surface failures

The pre-encode wipe only deleted beside the source volume, so a stale
shard on a sibling disk survived and could be mounted against the new
index at reconcile. Sweep every disk. Removal also ignored os.Remove
errors, reporting a failed cleanup as success and letting a stale shard
join the next generation; surface the first real failure (treating
already-gone as success) from removeStaleEcArtifacts and the shard delete.

* fix(ec): log when a local shard is skipped for a different encode run

The cross-run guard returned errShardNotLocal, indistinguishable in logs
from a genuinely-absent shard. Add a V(1) line naming both EncodeTsNs so
operators can tell "wrong encode generation" from "shard not here".

* fix(ec): surface metadata removal failures in the shard delete path

deleteEcShardIdsForEachLocation still dropped os.Remove errors on the
.ecx/.ecj/.vif/sidecar cleanup. A surviving stale .ecx is the orphan-index
condition this path prevents, so route those through removeFileIfExists and
return the first real failure instead of reporting cleanup as success.

* fix(ec): fail orphan cleanup when a reachable node's delete fails

The pre-encode orphan sweep swallowed every error for unreported (node,
volume) pairs. That is only safe for an unreachable node, which cannot
receive this encode's new generation. A reachable node whose delete
genuinely failed (permission/IO) keeps an orphan shard that a later copy
re-stamps with the new run's volume-level .vif identity, so the read guard
would accept stale data. Surface those; stay best-effort only for
unreachable nodes (gRPC Unavailable / no status).

* fix(ec): guard ecjFile under its lock in the EC delete path

EcVolume.Close nils ecjFile under ecjFileAccessLock; a delete that resolved
its .ecx lookup before a concurrent eviction (the generate-time
UnloadEcVolume) could then reach the journal append with a nil fd. Bail
with a clear "volume closed" error under the lock instead.

* fix(ec): reject an unstamped shard when the caller has an encode identity

The read guard required both identities nonzero, so a current (stamped)
caller accepted a holder with identity 0 and could be served a stale
pre-upgrade shard. Reject when the caller is stamped and the holder
differs (including unstamped); stay lenient only when the caller itself
has no identity (pre-upgrade reader). A skipped shard recovers from parity.

* fix(ec): full-teardown delete so cluster cleanup wipes a whole generation

The pre-encode cluster sweep deleted only the listed canonical shards on
remote nodes, leaving index/sidecar (and, on builds with versioned
generations, those too) behind. Add a full_teardown flag to
VolumeEcShardsDelete that evicts the volume and wipes every EC artifact for
it on every disk via removeStaleEcArtifacts; the shell and worker pre-encode
cleanup paths set it. Other delete callers (balance/decode/repair) are
unchanged.

* fix(ec): take ecjFileAccessLock before the nil-check in Sync and Close

Sync and Close read ev.ecjFile before acquiring ecjFileAccessLock while
Close nils it under the lock, a data race on the field. Take the lock
first, then nil-check inside, in both.

* fix(ec): acknowledge full_teardown so a pre-upgrade server can't fake success

An old volume server silently ignores full_teardown and returns success
for an ordinary delete, so the caller wrongly believes the generation was
wiped and copies a fresh gen-0 onto an unwiped node. Echo full_teardown_done
in the response; the worker destination cleanup fails when it is absent, and
the shell cluster sweep fails for a reported (mounted) leftover while staying
best-effort for an unreported node. encode_ts_ns stays an accepted transient
(an old server just skips the new read guard, no regression).

* fix(ec): fail the pre-encode sweep for any reachable node that can't ack teardown

A reachable pre-upgrade server ignores full_teardown and returns success
without wiping an orphan, which a later copy then folds into the new
generation. Treat a missing full_teardown_done ack as fatal for every
reachable node (best-effort only for a gRPC-unreachable one), not just for
topology-reported pairs.

* fix(ec): return the served shard identity and validate it client-side

The encode identity was only enforced server-side, so a pre-upgrade server
ignored the request field and served bytes unchecked. Echo the served
shard's EncodeTsNs on every read response chunk and have the client reject a
mismatch (including 0 from an old server), so the guard holds regardless of
server version; a rejected read recovers from parity.

* fix(ec): reject a short/empty remote shard read instead of serving zeros

doReadRemoteEcShardInterval accepted an immediate EOF or a short stream and
returned success with a partly zero-filled, unvalidated buffer (the server
stamps the identity only on chunks that carry bytes). A non-deleted interval
must arrive whole: require n == len(buf), exempting the is_deleted
short-circuit (n=0), matching readLocalEcShardInterval's local check. A short
read now fails so the caller recovers from parity.

* test(ec): fake volume server echoes the full_teardown acknowledgement

The worker now fails a teardown delete that isn't acknowledged (so a
pre-upgrade server can't silently skip the wipe). The fake server's no-op
VolumeEcShardsDelete returned an empty response, which the worker read as a
skipped teardown and aborted the encode. Echo full_teardown_done.

* feat(ec): mirror the encode-run identity guard + full_teardown into the Rust volume server

The Go volume server stamps an encode-run identity (encode_ts_ns) into the .vif
and rejects a read served from a shard of a different run; full_teardown wipes a
whole generation and acknowledges it. The Rust volume server had none of it.
Mirror the shared logic: load encode_ts_ns from the .vif onto the EcVolume,
stamp it on every read response, and reject a request/response mismatch on both
the server and the distributed-read client (recovering from parity); handle
full_teardown by evicting the volume and wiping every EC artifact on each disk,
echoing full_teardown_done so the caller can detect a server that ignored it.

* fix(ec): remove a stale .vif on full teardown of a shard-only node

A shard copy installs shards + .ecx before .vif, so an interrupted copy after a
teardown could mount the new files under the previous run's identity / version /
shard ratio / dat_file_size carried by the surviving .vif. Remove .vif during
full teardown, gated on .idx absence so a source-volume holder keeps its live
.vif. In Rust this lives in a teardown-only helper so the reconcile / load-
fallback paths (which share the base removal) still preserve .vif.

* fix(ec): treat a missing teardown ack as fatal, not as an unreachable node

isNodeUnreachable returned true for any non-gRPC-status error, so a reachable
pre-upgrade server's missing full_teardown_done ack (a plain error) was
classified unreachable and the unreported pair was silently skipped. Classify
only a real codes.Unavailable as unreachable, and wrap the missing ack in a
sentinel the sweep treats as fatal regardless. A genuinely down node still
surfaces as Unavailable from the RPC and stays best-effort.

* fix(ec): reject a short shard read in the local EC needle reader

read_ec_shard_needle ignored the byte count from shard.read_at and appended the
whole pre-sized buffer, so a truncated shard's zero-filled tail passed the later
length check and parsed as garbage. Require n == buf.len() per interval, erroring
on a short read like the local interval reader already does.

* fix(ec): probe reachability before skipping a node that returns Unavailable

The pre-encode sweep skipped any node whose teardown delete returned
codes.Unavailable, but a reachable volume server in maintenance mode also
returns that code for the maintenance-gated delete, so its stale EC files were
left behind on a node that can still receive the new generation. Confirm with a
non-maintenance-gated empty-target Ping: skip only when the node fails the probe
too (genuinely unreachable).

* fix(ec): use try_exists for the teardown .vif .idx guard

The teardown-only .vif removal gated on Path::exists(), which returns false on a
permission/IO stat error, so a stat failure on a present .idx would read as a
shard-only node and delete the live source volume's .vif. Gate on
try_exists() == Ok(false) instead, preserving the sidecar on any stat error.

* fix(ec): only skip a sweep node when a Ping confirms it is transport-down

The pre-encode sweep skipped a node whenever its teardown delete and a liveness
Ping both failed, but it treated ANY Ping error as down — an application-level
Internal/ResourceExhausted, or Unimplemented from a pre-Ping server, left a
reachable node's stale generation in place. Classify the Ping tri-state and skip
only when it transport-fails with codes.Unavailable; a reachable or inconclusive
node stays fatal.

* fix(ec): exclude sweep-skipped nodes from the encode's rebalance

The pre-encode sweep skips a genuinely-down node best-effort, but the rebalance
then recollected the current topology — a node that recovered between the two
could become a copy target and receive the new generation while still holding
its stale, never-cleared shards. Have the sweep return the skipped set and
exclude those nodes from the rebalance for this encode, so a node we could not
clean cannot receive the new generation. Standalone ec.balance is unaffected.

* fix(ec): re-sweep recovered nodes before generation so they aren't stranded

A node skipped as down by the pre-encode sweep is excluded from the rebalance,
but it can recover and become the generation host — mounting all shards locally,
then being excluded from distribution. Union-only verification accepts all
shards on one node and deletes the originals: a single point of failure. Re-sweep
the skipped nodes just before generation; one whose teardown now succeeds leaves
the skipped set and rebalances normally, while a node still down stays skipped.

* fix(ec): abort the encode if a selected source is still skipped after re-sweep

The re-sweep un-skips a recovered node, but the source was selected before it and
a node can stay down through the re-sweep then recover just in time to be the
generation host — mounting all shards locally while still excluded from the
rebalance, which union-only verification accepts before deleting the originals.
Abort the encode when a selected source remains skipped after the re-sweep.

* fix(ec): batch delete returns retriable 503 when a volume became EC mid-batch

If a volume is not EC at the batch-delete classification but is encoded to EC and
its .dat deleted before the regular-volume mutation, the mutation returns an exact
"not found" that the filer chunk-GC treats as completed, dropping the delete.
Recheck EC presence under the mutation lock and return a retriable 503 with the
"try again" token so the filer requeues it onto the EC path.

* fix(ec): recheck EC state before the regular batch-delete mutation

ec.encode mounts EC shards (copied from the .dat) before deleting the originals,
so a volume can be EC while its .dat still exists. The batch delete only rechecked
EC after a NotFound, so a successful regular-volume delete in that window wrote a
tombstone to the soon-removed .dat — the delete was lost and the needle resurrected
from the pre-tombstone shards. Recheck has_ec_volume under the write lock before
delete_volume_needle and return a retriable 503 so the filer requeues onto the EC path.

* fix(volume): make the metrics push test independent of test order

test_push_metrics_once asserted the pushed body contains the request-counter
family without ever touching the counter — a CounterVec with no children emits
nothing, so the assertion only held when another test had already created a
labelset in the shared registry. Create one in the test itself.
2026-06-10 22:31:18 -07:00
Bruce Zou 1dd292fb84 batch drain delta heartbeat messages (#9914) 2026-06-10 13:33:45 -07:00
Lisandro Pin 6b4d20a6f3 volume.scrub and ec.scrub shell commands: make the display of scrub details optional. (#9911)
On volumes failing scrubs, the detail output can get very verbose, which makes
reading results difficult. Most users won't care about this information to
begin with - just whether or not volumes pass scrub tests.

This MR gates the display of scrub result details behind a `--details` flag.
2026-06-10 13:29:07 -07:00
Chris Lu caadd6ca79 ci(s3tables): stop Lakekeeper flaking on Docker Hub pull timeouts (#9920)
* ci(s3tables): drop docker pre-pull from Lakekeeper job

The lakekeeper repro is pure Go against the local weed binary; the job
kept failing on Docker Hub timeouts pulling python:3 and localstack
images the test never runs. Also drop the stale python-in-docker
comments left from the old harness.

* ci(s3tables): serve python:3 from GHA cache in the STS job

Retried pulls still die when both mirror.gcr.io and registry-1.docker.io
are unreachable from the runner. Cache the saved image tarball under a
weekly key: an exact hit skips the registry entirely, a miss pulls fresh
and refreshes the cache, and a stale tarball from a previous week is the
fallback when Docker Hub is down.

* ci(spark): pre-pull the spark tag the test actually runs

The workflow warmed apache/spark:3.5.8 with retries while the
testcontainers setup runs apache/spark:3.5.1, so the real image was
pulled at test time with no retry at all.
2026-06-10 13:26:30 -07:00