seaweedfs

mirror of https://github.com/seaweedfs/seaweedfs.git synced 2026-06-09 18:32:43 +00:00

Author	SHA1	Message	Date
Chris Lu	d47cc45b1f	admin: fold dashboard sparklines into the existing cards (de-dup) (#9964 ) admin: fold dashboard sparklines into the existing cards The trend sparklines added in #9957 lived in a separate "Cluster Trends" row that duplicated the existing summary cards (Volumes, Files, Disk Used, EC Shards). Remove that row and instead render each sparkline inside the matching summary card, so every headline number shows its recent trend without duplication. The two maintenance metrics that have no existing card — Active Tasks and Workers — now fill the previously-empty columns of the EC row (also with sparklines). DashboardTrends changes from a Cards slice to named per-card sparkline SVGs (+ current values for the two maintenance cards). Drops the now-unused trendBytes helper (disk size keeps using the existing formatBytes).	2026-06-14 14:17:43 -07:00
Chris Lu	b13463880c	s3tables: scope management authorization to the caller's identity (#9961 ) * s3tables: resolve account-less identities to a distinct principal Static identities with no account block default to the shared admin account, so getAccountID returned "admin" for every such user and the permission checks treated them all as the admin principal. Only keep the admin account when the identity actually carries an admin action; otherwise fall back to the unique identity name. * s3tables: limit the open-by-default fallback to anonymous access The legacy permission path allowed any request that no policy explicitly denied whenever default-allow was on, which is the zero-config default. That let an authenticated identity without table permissions reach table resources owned by others. Restrict the fallback to requests with no identity or the anonymous identity; authenticated callers must pass an explicit action or policy check. Zero-config and anonymous access are unchanged. * s3tables: drop the no-op ListTableBuckets account gate The top-level check passed the principal as its own owner, so it always allowed. Per-bucket filtering in the loop is the real authority; remove the dead gate and the now-unused locals. * s3tables: derive the Iceberg catalog's default-allow from auth state The Iceberg catalog reuses the S3 Tables Manager, which hardcoded default-allow on. Authenticated callers were enforced only because the identity struct happens to propagate into the handler; if it were ever dropped, a secured catalog would fall open. Mirror the S3 port and set the Manager's default-allow from the authenticator, so an authenticated caller is enforced regardless. Shell and admin keep their own trusted Manager. Regression test covers the struct, name-only, and admin paths. * s3tables: drop redundant ACTION_ADMIN string conversion ACTION_ADMIN is an untyped string constant, so the conversion is a no-op. * s3tables: enforce name-only authenticated callers, add trusted bypass defaultAllowFor treated a request with no identity object as anonymous, but the Manager path forwards only the identity name (not the struct). A name-only authenticated caller could therefore be misclassified as anonymous and allowed under the open default. Treat a server-set identity name as authenticated too, and add an explicit trusted flag for the local shell/admin tooling that legitimately bypasses authorization. * s3tables: trim verbose comments	2026-06-14 13:55:36 -07:00
Chris Lu	b56d155b31	admin: native at-a-glance trend sparklines on the dashboard (#9957 ) * admin: native at-a-glance trend sparklines on the dashboard Add a "Cluster Trends" row to the admin Dashboard with inline-SVG sparklines for volumes, EC shards, disk used, files, active maintenance tasks, and workers. The data comes entirely from what the admin already holds — the cached cluster topology and the in-process maintenance queue — sampled into a small bounded ring buffer on the existing maintenance-metrics ticker (~15 min of history). No Prometheus/Grafana dependency, no JS chart library, no extra goroutine: the sparklines are self-contained SVG rendered server-side via templ. This gives basic trend visibility out of the box for clusters that don't run Prometheus, and a quick glance next to the cluster controls; Grafana remains the place for deep/historical dashboards. * admin: cap trendBytes unit index to avoid out-of-bounds panic A value >= 1 ZiB would push exp past the end of the units string and panic on units[exp]; cap exp at the last unit (EiB).	2026-06-14 13:55:26 -07:00
Chris Lu	c1636ac41c	s3: give STS sessions a distinct owner account instead of admin (#9963 ) * s3: give STS sessions a distinct owner account, not admin STS sessions were built with Account: &AccountAdmin, so every assumed-role session shared the admin account for ownership and ACL checks. Use the assumed-role user as the account id instead, matching the JWT auth path. Session permissions are unchanged: they come from the session policies, and admin is granted only through Actions. * s3: resolve STS session identity to the OIDC subject Use sessionInfo.Subject (falling back to the assumed-role user when absent) for the session identity name and account id, so the SigV4 and JWT auth paths resolve the same session to the same identity instead of diverging on AssumedRoleUser vs Subject. * s3: trim verbose comments	2026-06-14 13:55:11 -07:00
Chris Lu	e64c821139	s3: give account-less identities a distinct owner instead of admin (#9962 ) * s3: stop collapsing account-less identities into the admin account Identities configured without an account block all defaulted to the shared admin account, so distinct users got the same owner id and ownership checks could not tell them apart. checkAccessByOwnership also treated that id as an admin bypass, so any account-less caller passed ownership for any bucket. Give such identities a distinct account id from their name, and decide the ownership admin bypass by Admin capability rather than by the account id. isUserAdmin is now nil-safe. * s3: use the context identity in isUserAdmin before re-authenticating The Auth middleware already verifies and stores the identity in the request context. Read it there first so the ownership/admin checks don't re-run signature verification, which is redundant and fails once the request body has been consumed. * s3: nil-guard the context identity in isUserAdmin A non-nil interface wrapping a typed-nil Identity passes the type assertion; guard against it before calling isAdmin(). s3: trim verbose comments	2026-06-14 13:54:49 -07:00
Chris Lu	3fd5018bd2	metrics: overhaul Grafana dashboard for full metric coverage (#9956 ) The bundled dashboard (other/metrics/grafana_seaweedfs.json) covered only 18 of the 84 metrics weed/stats exposes and was a legacy Grafana 8 export (graph panels, schemaVersion 30). Rebuild it as a modern dashboard (timeseries panels, schemaVersion 39) with 100% metric coverage, targeting the direct-scrape model used by Prometheus / seaweed-up / Kubernetes. - Full coverage of every weed/stats metric: master, volume server, filer, filer store/sync, s3, s3 buckets, s3 lifecycle, admin/maintenance, build, wdclient, upload errors, plus Go runtime/process per component. - Organized into collapsible rows with an always-on Overview. - Scrape label model: group by `instance`; generic go_/process_ panels use `job=~"seaweedfs-.*"` to separate components; an optional `cluster` template variable (from SeaweedFS_build_info, defaults to All) supports multi-cluster setups and is transparent when no cluster label is present. - Same uid (nh02dOVnz) and title so it upgrades in place; drops the dead "AWS monthly cost" panel. This is also the single source of truth bundled by seaweed-up's `cluster dashboard install`.	2026-06-14 11:48:30 -07:00
Chris Lu	7e608c877a	refactor(ec_balance): make the balance planner per-volume ratio-capable (#9960 ) * refactor(ec_balance): make the balance planner per-volume ratio-capable Thread a per-volume EC ratio through the balance planner: Plan resolves each volume's data/parity from a new Options.VolumeRatio (falling back to the collection Ratio, then the build default, when it reports 0), and keys the global phase's ratio maps by volume instead of collection. The shell and worker balance paths build the per-volume lookup from each shard's heartbeat via the new ecbalancer.VolumeShardRatio. In OSS this is behavior-preserving: VolumeShardRatio returns 0 because the per-volume data_shards/parity_shards heartbeat fields are an enterprise feature, so every volume falls back to the collection ratio -- the existing standard-scheme behavior. The refactor keeps the shared planner in sync with the enterprise fork, which overrides VolumeShardRatio to classify and spread a mixed-ratio collection by each volume's own data/parity split. * perf(ec_balance): hoist the collection ratio out of the per-volume loop The collection ratio is constant for every volume in a collection, so resolve it once per collection instead of per volume; a custom Ratio func may do map lookups or locking. Addresses a review comment.	2026-06-14 11:33:31 -07:00
Chris Lu	138220b961	fix(ec): recover EC shards with the volume's own ratio, not the build default (#9958 ) * fix(ec): recover EC shards with the volume's own ratio, not the build default recoverOneRemoteEcShardInterval rebuilt a missing shard with a hardcoded 10+4 Reed-Solomon matrix (and counted sufficiency / iterated shards against the 10+4 constants). For a custom-ratio volume (e.g. 9+3) that reconstructs with the wrong matrix and corrupts the recovered bytes, and cachedLookupEcShardLocations could wrongly reject a degraded but recoverable custom-ratio read. Use the volume's own ECContext (loaded from its .vif) for the encoder, the shard-iteration bound, and the data-shard sufficiency checks. In OSS the ratio is always 10+4 so this is a no-op; it brings the Go volume server in line with the Rust one, which already reconstructs with the volume's ratio. * fix(ec): close data races in the EC read-recovery path Address review: the freshness check in cachedLookupEcShardLocations read ecVolume.ShardLocations / ShardLocationsRefreshTime without the lock while recover goroutines mutate them via forgetShardId -- snapshot both under ShardLocationsLock.RLock(). The recover goroutines also wrote the shared is_deleted return concurrently -- collect it via an atomic and fold it in after they join. Also size availableShards/missingShards by the volume's ECContext ratio rather than the 10+4 constants.	2026-06-14 07:32:36 -07:00
Chris Lu	c7781bfca2	fix(ec): remove shared EC index only when no shard remains node-wide (#9955 ) * fix(ec): remove the shared EC index only when no shard remains node-wide deleteEcShardIdsForEachLocation removed the shared .ecx/.ecj/.vif index as soon as a single disk's shard count hit 0, even when a sibling disk of the same node still held shards of the volume (split-disk reconciled layout) -- orphaning those shards without their index. Split the non-teardown delete into two passes: delete the requested shard files (and now-orphaned per-disk bitrot sidecars) on every disk, then remove the shared index only once no shard of the volume remains on ANY disk. This brings the Go volume server in line with the Rust one, which already gates the index removal on a node-wide check. * refactor(ec): reuse checkEcVolumeStatus across the two delete passes Address review: cache hasEcxFile/hasIdxFile from the node-wide count pass and pass them to removeEcSharedIndexFiles instead of re-listing each location's directory. * fix(ec): clean an orphaned EC .vif even when its .ecx is already gone Address review: removeEcSharedIndexFiles returned early on !hasEcxFile, so a node-wide teardown left a stale EC .vif behind when its .ecx was already removed. Decouple the .vif removal (gated on !hasIdxFile) from .ecx presence so the generation metadata doesn't leak once no shard remains node-wide.	2026-06-14 06:36:50 -07:00
Chris Lu	ef5fee6c28	fix(storage): delete/unmount every copy of a duplicate volume id (#9954 ) * fix(storage): delete and unmount every copy of a duplicate volume id NewStore has no cross-disk duplicate guard (unlike the Rust volume server, which refuses to start in that state), so a stale twin of a volume id can mount on a second disk after a disk repair. DeleteVolume and UnmountVolume returned after the first matching disk, leaving the twin to survive and re-register as the volume's content. Walk every disk and act on all copies, emitting one heartbeat delta per copy. * fix(storage): surface partial delete/unmount failures across duplicate copies Address review: if removing one copy of a duplicate volume id fails with a real error (disk IO, permissions), the loop logged it and could still return success once another copy was removed -- leaving the stale copy to re-register, the exact divergence this guards against. DeleteVolume and UnmountVolume now accumulate such errors and return them (still attempting every disk), so a copy left behind is never reported as success. Add a DeleteVolume duplicate-copies regression test.	2026-06-14 06:36:47 -07:00
Chris Lu	284796c7b6	fix(ec): fence stale-worker EC shard cleanup by encode generation (#9953 ) * feat(ec): add encode_ts_ns to the EC task params, shard-unmount, and shard-delete RPCs The generation fence for stale EC-worker cleanup needs the encode generation on three messages: ErasureCodingTaskParams (admin issues it), VolumeEcShardsUnmountRequest, and VolumeEcShardsDeleteRequest (the worker carries it to the volume server). Additive fields only; 0 preserves the existing unfenced behavior. Mirror the two volume-server fields in the Rust volume server's proto copy. * feat(ec): issue the EC encode generation from the admin and carry it on the worker Stamp each EC proposal's encode_ts_ns from the admin's per-cycle DetectionSequence (a single-clock value) so generations are globally ordered even though detection runs on a rotating worker. The worker writes that generation into the distributed .vif and passes it on its shard unmount/delete RPCs; it falls back to a local timestamp for the .vif only on the unfenced legacy/shell path (keeping the read guard on). * fix(ec): fence the stale-worker EC shard unmount and teardown by generation A reaped-but-still-running EC worker's cleanupStaleEcShards issued a generation-blind unmount + full teardown that could unmount and then overwrite a newer run's live shards on a shared node. Both RPCs now carry the encode generation: the volume server unmounts/deletes a disk only when its .vif generation is strictly older than the request, and preserves a same-or-newer generation, a generation-0 (recovered or pre-upgrade) volume, and an unreadable .vif. Unload is per-disk, never node-wide. Request generation 0 keeps the blanket teardown for the shell pre-encode cleanup and pre-upgrade callers. Mirrored in the Rust volume server. * test(ec): cover the generation-fenced teardown and unmount End-to-end volume-server tests: a fenced FullTeardown wipes a strictly- older generation, preserves a newer one, preserves a generation-0 volume, and blanket-wipes on request generation 0; the gen-aware unmount preserves a same-or-newer mounted generation; and the .vif generation reader handles present/absent/no-config cases. * test(ec): pin the fenced .vif==teardown generation and the unreadable-.vif preserve A fenced run must stamp the admin generation verbatim into the .vif so it matches the generation sent on the teardown RPCs; add a regression test that sets the task generation and asserts the .vif carries it exactly. Also cover the present-but-unparseable .vif case (reads as generation 0, preserved) and correct the readEcGenerationTsNs docstring accordingly. * fix(ec): surface EC full-teardown filesystem errors in the Rust volume server remove_ec_volume_files(_full_teardown) discarded every fs::remove_file error, so a teardown that failed on permissions or a full disk still returned full_teardown_done=true and left stale artifacts to collide with the next encode. Return io::Result, ignore NotFound, propagate the first real error, and have the teardown RPC surface it -- matching the Go contract. The best-effort reconcile/load-cleanup callers keep ignoring it. * refactor(ec): reuse the EC volume lookup on unmount and short-circuit the gen read Address review: the Rust unmount fence reuses the ec_vol it already fetched instead of a second find_ec_volume; the Go .vif generation reader breaks out of the data/idx loop early when the two dirs are the same.	2026-06-14 01:54:04 -07:00
Chris Lu	561768a426	[s3]: preserve multipart copy checksums (#9948 ) * s3: preserve checksums for copied multipart parts * s3: return checksums from multipart copy * s3: pin the upload's checksum algorithm on copy-part re-stream * s3: note why UploadPartCopy uses the re-stream slow path * s3: explain the TLS proxy in the multipart copy checksum test * s3: cover nil and unknown-algorithm edge cases in copy checksum tests * s3: cover all checksum algorithms in the multipart copy test * s3: run all checksum integration tests, not just presigned	2026-06-14 00:16:14 -07:00
Chris Lu	da243b9423	fix(ec): group orphan-source completeness by encode generation (topology encode_ts_ns) (#9952 ) * feat(ec): carry the encode generation through the topology heartbeat Add encode_ts_ns (field 14) to VolumeEcShardInformationMessage and populate it from each EC volume's .vif identity. The volume server emits it on the full and incremental heartbeats; the master stores it on EcVolumeInfo and re-emits it via GetTopologyInfo, so the admin/worker layer can see which encode run produced each shard set. Field 14 avoids the enterprise fork's reserved 10-13. Mirror the proto field and both heartbeat emit sites in the Rust volume server. * fix(ec): group orphan-source shard completeness by encode generation countExistingEcShardsForVolume ORed EcIndexBits across every disk, so two interrupted encode runs whose shard sets overlap unioned into a false-complete set -- triggering the orphaned-source delete while no single generation was actually complete. Group shards by encode_ts_ns and return the largest single generation's count, so the trigger fires only when one run holds the full set. Shards from pre-upgrade servers (encode_ts_ns==0) form their own bucket. The heartbeat carries one encode_ts_ns per (volume, disk), so this separates generations on different disks; same-disk mixing is prevented upstream by the pre-encode artifact wipe and the cross-run read guard. * fix(ec): guard against a nil Ec shard info entry in the generation count Defensive: a manually-constructed or corrupted topology could carry a nil entry in EcShardInfos. Skip it rather than dereference. * fix(ec): carry the encode generation on the EC shard unmount delta The mount delta sets EncodeTsNs; the unmount deletion delta left it 0. Populate it from the Ec volume before unloading so both incremental deltas are consistent (the Rust volume server already does this via its snapshot diff).	2026-06-14 00:14:12 -07:00
Chris Lu	26754fca4d	fix(ec): don't fabricate a stub .vif when mounting an EC volume (#9951 ) When an EC volume's .vif was missing, NewEcVolume wrote a stub holding only the version. That stub implies the default 10+4 ratio with DatFileSize=0 and no encode identity, which the custom-ratio resolver and the startup credibility checks then read as an authoritative config -- masking the real ratio of a custom-ratio volume and defeating the byte-exact .vif gate. Mount with in-memory defaults instead and leave the real .vif to the encoder or a recovery tool. The Rust volume server already behaves this way.	2026-06-13 22:15:13 -07:00
Chris Lu	94357ac6a9	[volume] preserve compression state during replication (#9946 ) * preserve compression state during replication * explain why ParseUpload skips compression for replica writes * fix data race on err result in FetchAndWriteNeedle The local-write and replica-write goroutines all wrote the named err return under an unsynchronized err==nil check. Give each goroutine its own error slot and combine after wg.Wait(): local error wins, then the first replica failure. * skip redundant decompression of compressed needles during replication doUploadData decompressed a compressed input only to report the clear-data length on UploadResult.Size, which both replication callers discard. Skip the decompress when IsReplication.	2026-06-13 21:52:59 -07:00
Chris Lu	240f82d6d2	fix(ec): persist EC source readonly mark and skip writable replicas on orphan cleanup (#9950 ) * fix(ec): persist the EC source replica readonly mark markReplicasReadonly marked each regular replica readonly without persisting it, so a source-server restart during or after encoding silently reopened the volume to writes. Those writes are not in the EC shards, and the later orphan-source cleanup would then delete the replica, losing them. Send Persist:true so the mark survives a restart; rollbackReadonly still clears it via VolumeMarkWritable on a failed encode. * fix(ec): don't delete a writable source replica during orphan cleanup cleanupOrphanSourceReplicas issued VolumeDelete to every regular replica once the EC shard set looked complete, without checking the replica's current state. A replica that came back writable may hold writes the EC shards do not contain, so deleting it loses data. Re-probe each replica via VolumeStatus and skip any that is no longer readonly, logging a warning instead of deleting.	2026-06-13 21:26:16 -07:00
Chris Lu	1e858d8af0	fix(ec): make ec.decode write-path crash-safe and atomic (#9949 ) * fix(ec): check decode .idx writes and fsync decoded .dat/.idx WriteIdxFileFromEcIndex silently dropped io.Copy and Write errors, so a short or failed write of the reconstructed .idx went unnoticed and the caller proceeded to delete the source EC shards. Propagate those errors. Also fsync the decoded .dat and .idx before returning, so the bytes are durable before the shards that produced them are removed cluster-wide. Mirror the .idx fsync into the Rust volume server (its .dat already syncs and its writes already propagate errors). * fix(ec): publish decoded .dat/.idx atomically via temp file and rename WriteDatFile and WriteIdxFileFromEcIndex wrote in place at the final name with O_TRUNC. A crash mid-write left a truncated .dat/.idx at the final name beside the still-present EC shards; on restart that partial file could be mounted as the live volume even though the shards held the real data. Write to a .tmp file, fsync it, then rename into place and fsync the directory, so the final name is only ever absent or complete. A failed decode removes its own temp file rather than leaking it. Add util.FsyncDir as the shared directory-fsync primitive and reuse the Rust volume server's fsync_dir for the mirrored change. * fix(ec): propagate .ecj read errors in the Rust decoder Path::exists returned false for any error (permission denied, transient IO), silently skipping the deletion journal and resurrecting deleted needles as live. Read the journal directly and treat only NotFound as absent, propagating other errors. The Go decoder already behaves this way (FileExists returns false only for IsNotExist, then the open surfaces other errors). * fix(ec): remove rename destination on Windows in the Rust decoder publish std::fs::rename does not replace an existing file on every Windows version. Remove the destination first under a Windows guard before the atomic publish rename, matching the compaction commit path.	2026-06-13 21:26:07 -07:00
Chris Lu	4fb3e22a01	fix(tiering): never delete a shared remote object while replicas still reference it (#9942 ) * tiering: stop a shared remote object being deleted while replicas still point at it A remote-tiered volume's .dat content lives only in one cloud object that all N replica .vif files point at. Deleting that object while destroying any one replica, or before a downloaded replica is durable, bricks the survivors. - volume.tier.move cleanup now deletes old replicas with keepRemoteData=true so surviving replicas keep the shared object. Document why the alreadyPlaced anchor needs no replica sync (same-object replicas are byte-identical). - VolumeTierMoveDatFromRemote now fsyncs the downloaded .dat, fsyncs the containing directory, trims the .vif (fsynced) and swaps to the local DiskFile BEFORE deleting the remote object, on both the keep-remote and delete paths. Only the final DeleteFile is gated by keep_remote_dat_file, so a keep-remote download leaves the replica served from local disk rather than the shared object, and a crash before delete merely leaks the object. - volume.tier.download keeps the shared object for every replica except the last, which deletes it. - s3 and rclone download paths fsync the .dat before close. * storage: swap the volume data backend under the data lock The tier-download swap closed v.DataBackend and assigned the new local DiskFile without holding dataFileAccessLock, racing concurrent reads/writes (use of a closed file / nil deref). Add an exported Volume.SwapDataBackend that performs the close-and-replace under the lock, and call it from the tier download. * server: skip directory fsync on Windows in the tier download path os.Open(dir).Sync() is unsupported on Windows and returns an error, which would fail VolumeTierMoveDatFromRemote entirely there. Skip the directory fsync on Windows, matching how the storage-side helper tolerates the unsupported case. * shell: make multi-replica tier.download resilient to already-local replicas If a multi-replica download is interrupted and retried, a replica made local in the prior attempt returns "already on local disk", which aborted the whole command and left the remaining remote replicas dangling. Treat that case as a skip-and-continue so a retry completes the rest. * server: assert downloaded .dat content, not just length, in the tier test A length-only check passes even if the bytes are corrupted; compare the full content of the local .dat against the original.	2026-06-13 20:09:00 -07:00
Chris Lu	339a597e7e	fix(vacuum): crash-safe compaction commit with a durable .cpc marker, fsync-before-rename, and a reload fence (#9944 ) * storage: make vacuum/compaction commit crash-safe with a durable .cpc marker A crash mid-compaction-commit could lose or corrupt volume data. The two-rename commit (.cpd->.dat, .cpx->.idx) was not atomic, fsync results were discarded before renaming over a healthy .dat, a stale .ldb could poison the needle map, and a duplicate/late commit could delete the live .dat/.idx outright. Introduce a durable .cpc commit marker so the swap is atomic across a crash: - CommitCompact writes and fsyncs the .cpc marker after makeupDiff fsyncs the .cpd/.cpx, then runs applyCompactSwap: an existence-guarded rename of .cpd->.dat and .cpx->.idx, a directory fsync, removal of the stale .ldb/.rdb, and finally removal of the marker. - reconcileCompactState recovers an interrupted commit on load: roll forward (finish the renames) when the marker is present, roll back (delete the orphan .cpd/.cpx) when it is absent. It runs from a directory pre-pass keyed on .cpd/.cpc existence, since the per-volume loader is keyed on .idx/.vif and misses the marker-only and already-renamed-.idx states. - applyCompactSwap verifies BOTH .cpd and .cpx exist before touching the live files, so a stale-state commit (including the Windows RemoveAll-then-rename path) errors without deleting anything. - Error-check the fsyncs that gate the swap: the .cpd close-fsync and .cpx fsync in copyDataBasedOnIndexFile, the makeupDiff .idx fsync, and MemDb.SaveToIdx. - generateLevelDbFile rebuilds from offset 0 when the stored watermark sits past the end of the .idx, instead of replaying zero entries and poisoning the needle map. - removeVolumeFiles and cleanupCompact sweep the .cpc marker; cleanup refuses to unlink the temp files while a marker is present. Mirror the commit-marker, fsync-before-rename, guard, and load/reconcile logic in the Rust volume server. * storage: don't reconcile an already-loaded volume's compaction state on reload reconcileCompactStates runs in loadExistingVolumes, which is re-invoked at runtime on SIGHUP (Store.LoadNewVolumes). For a volume that is already loaded and mid-vacuum, its .cpd/.cpx are live temp files, not crash leftovers -- rolling them back would clobber the in-flight compaction (and remove a live .ldb out from under an open handle). Skip any vid already present in the volume map; genuine startup recovery runs before any volume is loaded, so the map is empty then. Mirrored in the Rust volume server. Also drop the .note keepVif change that crept into this branch; it belongs to the replica-copy/verify workstream and is restored to master's behavior here so the two changes don't collide. * storage: roll a compaction commit forward per-file, not all-or-nothing A crash after the .cpd->.dat rename but before .cpx->.idx leaves .cpd gone, .cpx and .cpc present, and a stale .idx. The roll-forward required BOTH temp files, so it skipped the swap and cleared the marker, pairing the fresh .dat with the stale .idx (index corruption). Finish whichever temp file remains: extract finishCompactSwap to rename .cpd->.dat and/or .cpx->.idx independently; applyCompactSwap keeps the both-present guard for the normal commit. Existence in the Rust mirror is checked robustly so a transient error never skips the swap. * seaweed-volume: propagate directory fsync failures on the compaction commit path fsync_dir dropped every sync_all error, so the commit could proceed with an undurable marker or rename and a later restart could recover the wrong generation. Return the error and check it at the commit call sites (marker write and the swap), matching the Go fsyncDir which already propagates. Directory fsync stays a no-op on Windows, where it is unsupported. * storage: overflow-safe stale-watermark check when rebuilding the leveldb index watermarkNeedleMapEntrySize can overflow uint64 for a corrupted watermark and wrap below the file size, defeating the stale-.ldb guard. Compare in entries (watermark > size/NeedleMapEntrySize) instead, which is equivalent and cannot overflow. LevelDb-backed needle map is Go-only; no Rust mirror. storage: propagate idxFile.Close error when writing the compacted index SaveToIdx writes the .cpx that is renamed to .idx at commit; a discarded Close error (buffered data not flushed) could leave a partially-written index after a crash. Surface it in the same durability gate as the fsync.	2026-06-13 20:06:24 -07:00
Chris Lu	c2591b4395	fix(replication): verify-before-destroy in VolumeCopy, check.disk, and over-replication trim (#9943 ) * volume: verify before destroy in VolumeCopy and replication repair Four data-safety fixes around copy/repair paths that could destroy or resurrect data before verifying the source or survivors. (a) VolumeCopy no longer deletes a pre-existing local replica up front. The delete is deferred until ReadVolumeFileStatus on the source succeeds, so a transient source outage (or a retry after one) can no longer wipe a healthy destination replica. Gated on source readability only; size/count comparisons are intentionally not used because they invert legitimately after divergent vacuum/compaction. Mirrored in the Rust volume server. (b) volume.check.disk no longer resurrects vacuumed-deleted needles. A key present-and-live on the source but entirely absent on the target is ambiguous: it may be a genuine missing write, or a needle deleted on the target and then vacuumed (its index entry and any tombstone are gone). An individual needle AppendAtNs has no monotonic relation to a vacuum watermark, so the old cutoff heuristic could not tell them apart. Without positive proof the absence is a missing write, the safe default is to NOT push it back. Tradeoff: a real missing write may go unrepaired until a tombstone-aware path exists, but we never raise back deleted data. (c) Over-replication trim no longer resurrects needles or removes the wrong replica. The pre-delete sync now runs read-only (divergence check only) instead of writing the doomed replica's needles into the survivor. pickOneReplicaToDelete only ever removes the smallest of multiple healthy writable replicas; it refuses the trim when doing so would leave only read-only/integrity-flagged survivors, since file_count>0 alone cannot prove the survivor's .dat is readable. (d) Incomplete-volume (.note) cleanup keeps the shared .vif when an .ecx for the same vid coexists on the disk, so removing an interrupted regular copy cannot strip a coexisting EC volume's info file. VolumeCopy now surfaces .note write/remove errors instead of ignoring them. In the Rust volume server (where a persisting note is actually reachable) the .note check moves below the empty-stub sweep and EC validation, keeps the .vif on EC coexistence, and the mount path fails when a .note still persists. * shell: scope the over-replication writable-survivor guard to the trim path only The writable-survivor guard (never trim down to a read-only survivor) lived inside the shared pickOneReplicaToDelete, so it also gated the misplaced-volume relocation via pickOneMisplacedVolume -- a misplaced read-only volume (e.g. a full one) would silently stop being rebalanced. Extract pickSmallestReplica for the relocation path (which deletes-and-recreates and must act on read-only replicas), and keep the writable-survivor guard only in pickOneReplicaToDelete used by the over-replication trim. * seaweed-volume: recompute keep_vif after invalid-EC cleanup in the .note path keep_vif used the pre-validation ecx_exists snapshot, so when the EC-validation step above removed the invalid .ecx/shards, the .note cleanup still preserved a now-orphaned .vif. Re-check .ecx existence at cleanup time, matching the Go hasEcxFile re-check. * shell: keep placement when picking an over-replication victim to delete The trim picked the smallest writable replica without regard to placement, so it could delete the only replica in a required failure domain (e.g. with "100" and replicas dc1 + two in dc2, deleting dc1 leaves both survivors in dc2). Prefer a writable replica whose removal still satisfies placement, falling back to the smallest writable only when none does.	2026-06-13 20:05:33 -07:00
Chris Lu	aabd44fbb5	[volume] preserve volume data mtime across tier moves (#9947 ) * fix(tier): preserve volume data modification time * fix(tier): best-effort restore of data mtime on download A failed Chtimes should not abort an otherwise complete tier-down; warn and continue, matching the EC copy path. * fix(tier): preserve volume data mtime in rust volume server Mirror the Go fix: store the source .dat mtime on upload instead of the upload time, and restore it on the downloaded .dat. Without this a tiered-then-restored volume loads last_modified_ts_seconds from the upload/download time, extending its TTL across a restart or remount. * fix(tier): read source mtime via DiskFile.GetStat() GetStat() is nil-safe when the backend is closed concurrently and skips a redundant stat syscall; its cached modTime is the on-disk mtime a reload reads, since every .dat write or Chtimes is followed by a DiskFile (re)open. * fix(tier): surface mtime-restore failures on rust tier-down set_file_mtime now returns io::Result; the tier-down path warns on a failed restore instead of dropping it silently, so a wrong local .dat mtime (and the TTL drift it causes) is observable. Matches the Go download. The EC copy path keeps its best-effort silence.	2026-06-13 15:11:39 -07:00
Chris Lu	f724828bcb	fix(ec): never delete recoverable EC shards on startup/reconcile (the non-empty-.dat sibling of the stub bug) (#9941 ) * fix(ec): never delete recoverable shards on startup/reconcile (size-direction + byte-exact .dat) EC startup validation and the cross-disk reconcile could delete the only copy of distributed-EC shards whenever a non-empty .dat sat beside them. This is the same data-loss class as the empty-.dat-stub fix, now for a real (non-empty) stale or partial .dat. validateEcVolume: the discriminating signal is the shard size relative to the .dat's full encode, not the shard count. - shards smaller than expected: an interrupted local encode left partial shards and the .dat is the complete source -> reclaim the .dat. - shards equal to expected: a valid (or still-distributing) EC volume -> keep; the shards may be the only copy. - shards larger than expected: the .dat is the stale/partial side (e.g. an interrupted decode left a half-written .dat next to the real shards) -> keep. Previously any size mismatch, a low shard count beside a .dat, or a transient stat error returned "delete", wiping sole-copy shards. Now every ambiguity (size mismatch in either direction, inconsistent shard sizes, transient I/O error, partial shard set) keeps the data; only a credible full source .dat with no partial set to lose is reclaimed. handleFoundEcxFile: a shard load failure (corrupt/locked .ecx, EMFILE during a mass restart, transient I/O) no longer deletes the EC files when a .dat exists -- it only unloads and keeps the files for retry. All deletion authority now flows through validateEcVolume. pruneIncompleteEcWithSiblingDat: count shards NODE-WIDE (a set split across sibling disks summing to >= dataShards is independently recoverable and is left alone), and require the sibling .dat to byte-exactly match the size .vif recorded at encode time before deleting -- the prior "at least this big, or bigger than a superblock" gate could trust a stale .dat and wipe sole-copy shards. EC encode records the source size in .vif, so this gate works for real volumes; older volumes without it fail safe (kept). Rust volume server mirrors all of the above: size-direction + keep-on- ambiguity in validate_ec_volume, keep-on-load-failure in handle_found_ecx_file, and the node-wide + byte-exact gate in the prune. The Rust validate/prune paths now resolve the data-shard count from the volume's own .vif instead of hardcoding 10+4, so custom-ratio volumes are not mis-sized and wrongly deleted on reboot. Existing tests that encoded the old (unsafe) "delete on low count / size mismatch" behavior are updated to the safe expectation, and new regression tests cover the partial-decode-.dat-keeps-shards and transient-error-keeps cases (Go and Rust); they fail on the pre-fix code. * fix(ec): record DatFileSize in planted EC .vif for the prune test; trim comments The multi-disk lifecycle e2e test planted a partial EC leftover with an empty .vif, so the byte-exact prune gate (which a real encoded volume satisfies via its recorded source size) kept it instead of cleaning up. Record DatFileSize + the EC ratio in the planted .vif, matching production. Also condense the verbose comments added in this change to the repo's concise style.	2026-06-12 23:51:29 -07:00
Chris Lu	3718301599	shell: stop ec.encode/ec.rebuild from destroying live EC shards (no crash needed) (#9939 ) * shell: stop ec.encode/ec.rebuild from destroying live EC shards Three operator-triggered shell paths could destroy data with no crash: ec.encode -volumeId on an already-EC volume tore down its shards before failing. The volume-id path never checked the id was a regular volume: the collection lookup scans only VolumeInfos (so an EC-only id maps to ""), and volumeLocations succeeds via the EC-location fallback, so clearPreexistingEcShards full-teardown-deleted every shard cluster-wide before doEcEncode failed. An EC volume has no .dat, so this is its only copy. Add assertEncodableRegularVolumes: each requested id must be a regular volume in the topology snapshot; an EC-only or unknown id is refused before any teardown. A volume present as both a regular .dat and stale orphan shards (a failed-encode retry) still passes. This closes the operator-rerun/script-retry path; a worker racing the snapshot is a fencing problem handled separately. ec.rebuild dry-run (the default, without -apply) still issued real VolumeEcShardsDelete RPCs: prepareDataToRecover appended every would-copy shard to copiedShardIds even though the copy was skipped, and the cleanup defer deleted that set unconditionally. Now a dry-run copies nothing and records nothing to delete (a separate would-copy counter drives the recoverability check so the dry-run still reports its plan), and the cleanup runs only under -apply. ec.rebuild could also self-destruct a live shard: localShardsInfo was overwritten per disk instead of unioned, so a shard the rebuilder holds on a non-last disk looked remote, got copied onto itself (in-place O_TRUNC) and then node-wide deleted. Union local shards across all disks, and never copy/delete a shard whose only listed holder is the rebuilder itself. * shell: address ec destructive-guards review comments - countLocalShards: union shards across all of the rebuilder's disks so slot accounting matches what prepareDataToRecover treats as local; first-match counting overstated slotsNeeded on multi-disk rebuilders - VolumeEcShardsCopy: resolve SourceDataNode via pb.NewServerAddressFromDataNode instead of the raw node id, which may not be a dialable host:port - assertEncodableRegularVolumes: skip nil DiskInfo map entries, matching the other topology walks in this file; rename ecOnly to hasEcShards since the map marks any volume with shards, not only shard-only ones	2026-06-12 22:30:17 -07:00
Chris Lu	18cdb3819b	fix(ec): crash-safe ecx-journal fold and shard rebuild (fsync before publish, no short-read-as-success) (#9938 ) * fix(ec): make ecx-journal fold and shard rebuild crash-safe Two EC rebuild paths could silently lose or corrupt data: RebuildEcxFile folded the .ecj deletion journal into .ecx (in-place WriteAt tombstones) and then unlinked the journal without flushing the .ecx writes first. A crash could persist the unlink ahead of the tombstones, resurrecting deleted needles on the next load. It also read journal records with a bare n!=size break, so a torn tail silently dropped the remaining tombstones before the unlink. Now: read records with io.ReadFull (io.EOF ends cleanly, a torn tail aborts and leaves .ecj in place for retry), fsync .ecx before removing the journal. rebuildEcFiles treated a zero/short ReadAt as a clean end-of-input and discarded the read error, so a truncated or unreadable input shard produced truncated regenerated shards that were then published as restored redundancy; the regenerated shards were also never fsynced on the no-sidecar path. Now: derive the expected shard size from the present inputs up front (rejecting a divergent/zero-size input), drive the loop by that size, fail on any short read or short write, and fsync every regenerated shard before it is mounted/renamed. Rust volume server mirrors the rebuild fix: rebuild_ec_files now checks the read_at byte count (it previously discarded it, the same truncation bug). The Rust ecx fold already synced .ecx before removing the journal. Custom EC ratios are unaffected: the shard size derives from the input shards and the loop uses the .vif-resolved data/parity counts, never a hardcoded 10+4. * storage: close ecx journal files via defer in RebuildEcxFile Per review: a single deferred Close per file replaces the per-error-path manual closes, so new early returns cannot leak descriptors. The journal is still closed explicitly before its unlink since Windows cannot delete an open file; the deferred second Close is a harmless no-op.	2026-06-12 22:28:56 -07:00
Chris Lu	871d7ddc02	[helm]: configure JWT expiration (#9940 ) helm: configure JWT expiration	2026-06-12 21:11:30 -07:00
7y-9	5468707289	fix(util): ignore comment only sql input (#9933 ) * fix(util): ignore comment only sql input Problem: sqlutil.SplitStatements strips SQL comments while scanning, but when no statements remain it falls back to returning the original query. Inputs that contain only comments are therefore reported as executable SQL statements. Root cause: The no-statements fallback did not distinguish a real single statement from input that had been fully removed by comment filtering. Fix: Remove the original-query fallback and return an explicit empty slice when scanning produces no statements. Reproduction: env GOCACHE=/private/tmp/seaweedfs-go-cache go test ./weed/util/sqlutil -run TestSplitStatements -count=1 failed before the fix because comment-only inputs returned the comment text as a statement. Validation: gofmt -w weed/util/sqlutil/splitter.go weed/util/sqlutil/splitter_test.go; env GOCACHE=/private/tmp/seaweedfs-go-cache go test ./weed/util/sqlutil -run TestSplitStatements -count=1; env GOCACHE=/private/tmp/seaweedfs-go-cache go test ./weed/util/sqlutil -count=1; git diff --check; git diff --cached --check. Duplicate check: Searched /private/tmp/seaweedfs-codex0610-old-branch-index.tsv and existing tests for sqlutil, SplitStatements, comments, and comment-only. Old PostgreSQL query branches cover malformed wire frames and SQL engine numeric parsing, not comment-only statement splitting. Co-authored-by: Codex <noreply@openai.com> * Update weed/util/sqlutil/splitter.go Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --------- Co-authored-by: Codex <noreply@openai.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2026-06-12 10:10:27 -07:00
Chris Lu	0345658ea8	[s3] validate indirect filer path inputs (#9931 ) * s3: validate indirect filer path inputs * s3: avoid query parsing on common request path * filer: scope copy/move source against JWT AllowedPrefixes maybeCheckJwtAuthorization only checked r.URL.Path, but copy and move read their source from the cp.from / mv.from query params. A prefix-restricted token could copy or move data out of a subtree it cannot otherwise reach. Check every path the request touches, reusing pathHasComponentPrefix so `..` in the source is collapsed before the prefix match. * s3: confine iceberg CreateTable location to the catalog bucket CreateTable derived the metadata bucket and path from the client-supplied req.Location / req.Name and wrote there directly, so a caller scoped to one table bucket could place metadata in another bucket (and path.Join collapsed any `..`). Require the parsed bucket to equal the request's catalog bucket and reject traversal segments in the table path. * webdav: clean client path before subFolder confinement wrappedFs concatenated subFolder + name before the underlying FileSystem ran path.Clean, so `..` in the request path or COPY/MOVE Destination resolved across the FilerRootPath confinement boundary. Clean the name as a rooted path first so traversal segments collapse below subFolder. Only the non-default -filer.path (non-empty subFolder) setup was affected. * filer: enforce read-only rule on real write path with destination header The x-seaweedfs-destination header overrides the path used for storage-rule matching while the entry is written at r.URL.Path, letting a caller select a writable rule for a read-only target. When the header is present, also check the read-only/quota rule against the actual write path.	2026-06-11 21:56:16 -07:00
Chris Lu	34f9b91d69	fix(storage): never let an empty .dat delete healthy distributed EC shards (#9930 ) * fix(storage): never let an empty .dat delete healthy distributed EC shards A leftover empty .dat stub (a phantom from the pre-fix loader; zero needles) next to a distributed EC volume's local shards made startup classify the volume as an interrupted local encode: validateEcVolume requires >= dataShards local shards when a .dat is present, fails with the 1-2 shards a distributed volume keeps per disk, and the cleanup deletes those shards -- the only copies of that part of the volume. Repeated across restart waves this destroys enough shards cluster-wide to make the volume unrecoverable. Go: - loadExistingVolume: hoist the empty-stub sweep above the EC presence checks. Previously the .vif-next-to-.ecx guard returned before the sweep ever ran, so exactly the dangerous layout (stub + .ecx + local shards) kept its stub and then lost its shards in loadAllEcShards. - validateEcVolume / checkDatFileExists: treat a .dat <= a superblock (zero needles) as absent. An empty .dat cannot be the encode source, so it must never gate shard deletion; this also covers stubs without a .vif, which the sweep cannot prove are EC leftovers. Rust mirror (seaweed-volume): the same gate in validate_ec_volume and check_dat_file_exists (the Rust sweep already ran before validation); the volume-load skip keeps a plain existence check so fresh, needle-less volumes still load. Regression tests in Go and Rust reproduce the production layout (a zero-byte .dat beside .ecx/.ecj and two shards of a 10+4 volume, with and without a .vif) and fail without the fix with the shards deleted. * fix(ec): gate source volume deletion on a recoverable shard set After EC encode, the shell command and the (plugin) worker task refused to delete the source volume unless every shard was present, and aborted otherwise -- leaving the source .dat next to live shards, exactly the mixed state the startup cleanup mishandles. Replace the full-set requirement with a recoverability gate shared by both callers (RequireRecoverableShardSet): deleting a non-empty source .dat requires at least dataShards distinct shards cluster-wide. Below that the source is kept and the encode fails as before. A degraded but recoverable set (>= dataShards, < total) now proceeds with a warning instead of aborting: the missing shards can be rebuilt from the survivors, while keeping the source would preserve the dangerous mixed state. Empty stub replicas are still swept unguarded (OnlyEmpty) -- an empty .dat has nothing to lose. dataShards/totalShards stay parameters so enterprise custom EC ratios share the helper verbatim. * test(ec): use recoverable shard verification gate	2026-06-11 20:26:20 -07:00
Chris Lu	b44cf51fe9	s3: validate copy source path segments (#9929 ) Reject copy sources whose bucket/object fail IsValidBucketName / IsValidObjectKey, the helpers validateRequestPath already applies to the request URL. The object is joined onto the bucket path and `.`/`..` segments are collapsed by the filer, so without this the source need not stay within the parsed bucket. Route UploadPartCopy through ValidateCopySource too; it previously only checked for empty bucket/object.	2026-06-11 17:07:15 -07:00
Chris Lu	4f8af455bf	feat(storage): sweep leftover empty EC .dat stubs on volume server startup (#9927 ) * feat(storage): sweep leftover empty EC .dat stubs on volume server startup An EC volume keeps no local .dat. The pre-fix loader left empty 8-byte superblock .dat stubs next to EC metadata (one per lone .vif). Left in place each loads as a phantom empty volume, and the same vid's stub on two disks of one server blocks Rust startup via the duplicate-vid check in Store::add_location -- the prior fix stops creating new stubs but does not clean up existing ones. On startup, when a .dat is empty (<= a superblock, i.e. zero needles) and its .vif marks the volume erasure-coded, remove the stub (+ empty .idx) instead of loading it. The real data is in the EC shards, so the empty stub holds nothing to lose. Non-EC empty .dat files (e.g. freshly allocated volumes) are left alone. Done in both Rust (load_existing_volumes) and Go (loadExistingVolume), with regression tests that fail without the sweep. * refactor(storage): extract empty EC .dat stub sweep into its own function Move the startup stub-sweep into remove_empty_ec_dat_stub (Rust) and removeEmptyEcDatStub + vifIsEcVolume (Go) for clearer logic, and look up the .vif in both the data and idx directories (each read at most once) so a stub is still found when -dir.idx is configured. Adds direct tests for the idx-directory lookup on both engines.	2026-06-11 12:26:21 -07:00
Chris Lu	37962e2445	admin: configure maintenance tasks via admin.toml (#9926 ) * admin: configure maintenance tasks via admin.toml Maintenance task settings could only be edited in the admin UI and live under <dataDir>/conf, so they silently reverted to defaults whenever the data directory was recreated. An optional admin.toml now declares vacuum, balance, and erasure coding settings; keys set there are written through to the persisted task configs at every startup, overriding UI edits, so the configuration stays declarative. Generate an example with "weed scaffold -config=admin". * vacuum: round min volume age up to whole hours MinVolumeAgeSeconds was truncated by integer division when converted to the hour-granular protobuf field, so a sub-hour setting silently became 0 and disabled the age guard. * admin: split and normalize preferred_tags from admin.toml A comma-separated string, as set via environment variable, came through viper as a single slice element. Split on commas and reuse util.NormalizeTagList, matching the plugin config path. * scaffold: clarify admin.toml wording	2026-06-11 11:04:52 -07:00
Chris Lu	42030381ae	shell: volume.tier.move can move volumes between data centers (#9925 ) * shell: volume.tier.move can move volumes between data centers -fromDataCenter scopes volume selection to volumes with a replica in that data center. -toDataCenter constrains move destinations and replication fulfillment. With identical disk types both flags are required, moving full volumes between data centers on the same tier. * shell: assert node identity in data center filter test * shell: tier move resumes when the volume is already on the target A replica already on the target tier and data center, typically left by an interrupted earlier run, anchors the move: skip the copy and only complete replication fulfillment and old replica cleanup. Previously such volumes hit the no-destination path and the stale source replicas were never removed.	2026-06-11 10:46:34 -07:00
Chris Lu	c3b06bf809	ci: run weed tests on linux/386 (#9924 ) 386 test binaries execute natively on the amd64 runner, so the suite catches what vet cannot: unaligned 64-bit atomics and arithmetic that wraps at runtime. -short keeps the e2e suites on amd64 only.	2026-06-11 09:49:07 -07:00
Chris Lu	3eb550a3f1	fix(tests): 32-bit build of EC e2e tests, type-check linux/386 in CI (#9922 ) * fix(tests): keep EC e2e fid cookie arithmetic in uint32 The cookie constants 0x9490CA00 and 0x9500CA00 were added to the int loop variable before conversion, overflowing 32-bit int at compile time on linux/386 and linux/arm. Convert the loop variable instead so the addition stays in uint32. * fix(tests): pass s3client max backoff in milliseconds MaxBackoffDelay is documented as milliseconds and multiplied by 1e6 before use, but the example set it to 5s in nanoseconds, yielding an absurd backoff on 64-bit and a compile-time int overflow on 32-bit. * ci: type-check code and tests for linux/386 64-bit-only constant arithmetic keeps slipping into test files and breaking 32-bit downstream builds. Vet the whole root module under GOOS=linux GOARCH=386 so these fail in CI instead of after release. * fix(tests): convert s3client backoff to Duration before scaling The ms-to-ns multiplication ran in int, wrapping at runtime on 32-bit; scale by time.Millisecond after the Duration conversion instead.	2026-06-11 09:05:54 -07:00
Chris Lu	582b7268f5	s3: export per-bucket quota and read-only state metrics (#9923 ) The quota enforcement loop already computes each bucket's configured quota and effective read-only flag every minute, but neither was visible to monitoring, so operators could not alert before a bucket flips read-only. Add two gauges next to the existing bucket size metrics: SeaweedFS_s3_bucket_quota_bytes configured quota; the series is only present while the quota is enabled, so size/quota utilization queries never divide by zero SeaweedFS_s3_bucket_read_only 1 when the bucket's location rule is read-only (over quota or manually locked), 0 otherwise Both are cleaned up with the other per-bucket gauges on bucket deletion and inactivity TTL.	2026-06-11 09:03:00 -07:00
Chris Lu	55010be19b	4.33 4.33	2026-06-11 00:52:31 -07:00
Chris Lu	79ac279fe1	fix(ec): don't mix EC shards from different encode runs (#9880 ) * feat(ec): add encode_ts_ns to EC shard metadata and the shard read RPC EcShardConfig and VolumeEcShardReadRequest gain an int64 encode_ts_ns (encode time in unix nanos). It rides in .vif and the read request so a read can be scoped to the encode run that produced the index. * fix(ec): stamp each encode and reject cross-run shard reads Generate stamps EncodeTsNs into the volume's .vif. Reads carry it to the shard's owning volume (resolved together via FindEcVolumeWithShard, so a multi-disk server validates the disk that actually serves the bytes) and reject a shard from a different encode run, recovering from parity. A zero on either side (pre-upgrade volume) skips the guard. * fix(ec): stamp the encode identity on the worker-generated .vif The worker-local encode path now writes EncodeTsNs (and the resolved EC ratio) into the .vif, so the read guard is not silently off for volumes encoded by the maintenance worker. * fix(ec): wipe stale EC artifacts before re-encoding VolumeEcShardsGenerate evicts any in-memory EcVolume for the volume and removes its on-disk shard/index/sidecar files before writing fresh ones, so a retried encode never builds on a partial prior run and the unlink frees the inodes instead of leaving open fds serving old bytes. * fix(ec): unmount EC shards across all disks UnmountEcShards walked only the first disk holding the shard, leaving a duplicate copy mounted on a sibling disk (split-disk reconciled volumes) still serving and heartbeating. Traverse every disk and emit one deletion delta per disk. * fix(ec): delete orphan shards without a local .ecx deleteEcShardIdsForEachLocation gated shard-file removal on a local .ecx, so it could not clean an orphan .ecNN left by a failed copy on a disk with no index. Delete the requested shard files unconditionally; the index-file (.ecx/.ecj/.vif) routing stays gated as before. * fix(ec): clear stale EC shards cluster-wide before re-encoding ec.encode unmounts and deletes EC shards for the target volumes on every node before regenerating: fatal for the shards the topology reports (mounted leftovers), best-effort for the rest (a sweep that catches unmounted failed-copy orphans). A down node is a no-op. * fix(ec): don't nil EC fds on close so reads can't race eviction A reader resolves an EcVolume/shard under the lock then reads after it is released, so an eviction that nils ecxFile/ecdFile would race that read and panic. Close the fds without nilling the fields: the field is now write-once (no data race) and a concurrent read hits a closed fd, getting a clean error that the caller recovers from parity. * fix(ec): wipe stale EC artifacts on every disk and surface failures The pre-encode wipe only deleted beside the source volume, so a stale shard on a sibling disk survived and could be mounted against the new index at reconcile. Sweep every disk. Removal also ignored os.Remove errors, reporting a failed cleanup as success and letting a stale shard join the next generation; surface the first real failure (treating already-gone as success) from removeStaleEcArtifacts and the shard delete. * fix(ec): log when a local shard is skipped for a different encode run The cross-run guard returned errShardNotLocal, indistinguishable in logs from a genuinely-absent shard. Add a V(1) line naming both EncodeTsNs so operators can tell "wrong encode generation" from "shard not here". * fix(ec): surface metadata removal failures in the shard delete path deleteEcShardIdsForEachLocation still dropped os.Remove errors on the .ecx/.ecj/.vif/sidecar cleanup. A surviving stale .ecx is the orphan-index condition this path prevents, so route those through removeFileIfExists and return the first real failure instead of reporting cleanup as success. * fix(ec): fail orphan cleanup when a reachable node's delete fails The pre-encode orphan sweep swallowed every error for unreported (node, volume) pairs. That is only safe for an unreachable node, which cannot receive this encode's new generation. A reachable node whose delete genuinely failed (permission/IO) keeps an orphan shard that a later copy re-stamps with the new run's volume-level .vif identity, so the read guard would accept stale data. Surface those; stay best-effort only for unreachable nodes (gRPC Unavailable / no status). * fix(ec): guard ecjFile under its lock in the EC delete path EcVolume.Close nils ecjFile under ecjFileAccessLock; a delete that resolved its .ecx lookup before a concurrent eviction (the generate-time UnloadEcVolume) could then reach the journal append with a nil fd. Bail with a clear "volume closed" error under the lock instead. * fix(ec): reject an unstamped shard when the caller has an encode identity The read guard required both identities nonzero, so a current (stamped) caller accepted a holder with identity 0 and could be served a stale pre-upgrade shard. Reject when the caller is stamped and the holder differs (including unstamped); stay lenient only when the caller itself has no identity (pre-upgrade reader). A skipped shard recovers from parity. * fix(ec): full-teardown delete so cluster cleanup wipes a whole generation The pre-encode cluster sweep deleted only the listed canonical shards on remote nodes, leaving index/sidecar (and, on builds with versioned generations, those too) behind. Add a full_teardown flag to VolumeEcShardsDelete that evicts the volume and wipes every EC artifact for it on every disk via removeStaleEcArtifacts; the shell and worker pre-encode cleanup paths set it. Other delete callers (balance/decode/repair) are unchanged. * fix(ec): take ecjFileAccessLock before the nil-check in Sync and Close Sync and Close read ev.ecjFile before acquiring ecjFileAccessLock while Close nils it under the lock, a data race on the field. Take the lock first, then nil-check inside, in both. * fix(ec): acknowledge full_teardown so a pre-upgrade server can't fake success An old volume server silently ignores full_teardown and returns success for an ordinary delete, so the caller wrongly believes the generation was wiped and copies a fresh gen-0 onto an unwiped node. Echo full_teardown_done in the response; the worker destination cleanup fails when it is absent, and the shell cluster sweep fails for a reported (mounted) leftover while staying best-effort for an unreported node. encode_ts_ns stays an accepted transient (an old server just skips the new read guard, no regression). * fix(ec): fail the pre-encode sweep for any reachable node that can't ack teardown A reachable pre-upgrade server ignores full_teardown and returns success without wiping an orphan, which a later copy then folds into the new generation. Treat a missing full_teardown_done ack as fatal for every reachable node (best-effort only for a gRPC-unreachable one), not just for topology-reported pairs. * fix(ec): return the served shard identity and validate it client-side The encode identity was only enforced server-side, so a pre-upgrade server ignored the request field and served bytes unchecked. Echo the served shard's EncodeTsNs on every read response chunk and have the client reject a mismatch (including 0 from an old server), so the guard holds regardless of server version; a rejected read recovers from parity. * fix(ec): reject a short/empty remote shard read instead of serving zeros doReadRemoteEcShardInterval accepted an immediate EOF or a short stream and returned success with a partly zero-filled, unvalidated buffer (the server stamps the identity only on chunks that carry bytes). A non-deleted interval must arrive whole: require n == len(buf), exempting the is_deleted short-circuit (n=0), matching readLocalEcShardInterval's local check. A short read now fails so the caller recovers from parity. * test(ec): fake volume server echoes the full_teardown acknowledgement The worker now fails a teardown delete that isn't acknowledged (so a pre-upgrade server can't silently skip the wipe). The fake server's no-op VolumeEcShardsDelete returned an empty response, which the worker read as a skipped teardown and aborted the encode. Echo full_teardown_done. * feat(ec): mirror the encode-run identity guard + full_teardown into the Rust volume server The Go volume server stamps an encode-run identity (encode_ts_ns) into the .vif and rejects a read served from a shard of a different run; full_teardown wipes a whole generation and acknowledges it. The Rust volume server had none of it. Mirror the shared logic: load encode_ts_ns from the .vif onto the EcVolume, stamp it on every read response, and reject a request/response mismatch on both the server and the distributed-read client (recovering from parity); handle full_teardown by evicting the volume and wiping every EC artifact on each disk, echoing full_teardown_done so the caller can detect a server that ignored it. * fix(ec): remove a stale .vif on full teardown of a shard-only node A shard copy installs shards + .ecx before .vif, so an interrupted copy after a teardown could mount the new files under the previous run's identity / version / shard ratio / dat_file_size carried by the surviving .vif. Remove .vif during full teardown, gated on .idx absence so a source-volume holder keeps its live .vif. In Rust this lives in a teardown-only helper so the reconcile / load- fallback paths (which share the base removal) still preserve .vif. * fix(ec): treat a missing teardown ack as fatal, not as an unreachable node isNodeUnreachable returned true for any non-gRPC-status error, so a reachable pre-upgrade server's missing full_teardown_done ack (a plain error) was classified unreachable and the unreported pair was silently skipped. Classify only a real codes.Unavailable as unreachable, and wrap the missing ack in a sentinel the sweep treats as fatal regardless. A genuinely down node still surfaces as Unavailable from the RPC and stays best-effort. * fix(ec): reject a short shard read in the local EC needle reader read_ec_shard_needle ignored the byte count from shard.read_at and appended the whole pre-sized buffer, so a truncated shard's zero-filled tail passed the later length check and parsed as garbage. Require n == buf.len() per interval, erroring on a short read like the local interval reader already does. * fix(ec): probe reachability before skipping a node that returns Unavailable The pre-encode sweep skipped any node whose teardown delete returned codes.Unavailable, but a reachable volume server in maintenance mode also returns that code for the maintenance-gated delete, so its stale EC files were left behind on a node that can still receive the new generation. Confirm with a non-maintenance-gated empty-target Ping: skip only when the node fails the probe too (genuinely unreachable). * fix(ec): use try_exists for the teardown .vif .idx guard The teardown-only .vif removal gated on Path::exists(), which returns false on a permission/IO stat error, so a stat failure on a present .idx would read as a shard-only node and delete the live source volume's .vif. Gate on try_exists() == Ok(false) instead, preserving the sidecar on any stat error. * fix(ec): only skip a sweep node when a Ping confirms it is transport-down The pre-encode sweep skipped a node whenever its teardown delete and a liveness Ping both failed, but it treated ANY Ping error as down — an application-level Internal/ResourceExhausted, or Unimplemented from a pre-Ping server, left a reachable node's stale generation in place. Classify the Ping tri-state and skip only when it transport-fails with codes.Unavailable; a reachable or inconclusive node stays fatal. * fix(ec): exclude sweep-skipped nodes from the encode's rebalance The pre-encode sweep skips a genuinely-down node best-effort, but the rebalance then recollected the current topology — a node that recovered between the two could become a copy target and receive the new generation while still holding its stale, never-cleared shards. Have the sweep return the skipped set and exclude those nodes from the rebalance for this encode, so a node we could not clean cannot receive the new generation. Standalone ec.balance is unaffected. * fix(ec): re-sweep recovered nodes before generation so they aren't stranded A node skipped as down by the pre-encode sweep is excluded from the rebalance, but it can recover and become the generation host — mounting all shards locally, then being excluded from distribution. Union-only verification accepts all shards on one node and deletes the originals: a single point of failure. Re-sweep the skipped nodes just before generation; one whose teardown now succeeds leaves the skipped set and rebalances normally, while a node still down stays skipped. * fix(ec): abort the encode if a selected source is still skipped after re-sweep The re-sweep un-skips a recovered node, but the source was selected before it and a node can stay down through the re-sweep then recover just in time to be the generation host — mounting all shards locally while still excluded from the rebalance, which union-only verification accepts before deleting the originals. Abort the encode when a selected source remains skipped after the re-sweep. * fix(ec): batch delete returns retriable 503 when a volume became EC mid-batch If a volume is not EC at the batch-delete classification but is encoded to EC and its .dat deleted before the regular-volume mutation, the mutation returns an exact "not found" that the filer chunk-GC treats as completed, dropping the delete. Recheck EC presence under the mutation lock and return a retriable 503 with the "try again" token so the filer requeues it onto the EC path. * fix(ec): recheck EC state before the regular batch-delete mutation ec.encode mounts EC shards (copied from the .dat) before deleting the originals, so a volume can be EC while its .dat still exists. The batch delete only rechecked EC after a NotFound, so a successful regular-volume delete in that window wrote a tombstone to the soon-removed .dat — the delete was lost and the needle resurrected from the pre-tombstone shards. Recheck has_ec_volume under the write lock before delete_volume_needle and return a retriable 503 so the filer requeues onto the EC path. * fix(volume): make the metrics push test independent of test order test_push_metrics_once asserted the pushed body contains the request-counter family without ever touching the counter — a CounterVec with no children emits nothing, so the assertion only held when another test had already created a labelset in the shared registry. Create one in the test itself.	2026-06-10 22:31:18 -07:00
Bruce Zou	1dd292fb84	batch drain delta heartbeat messages (#9914 )	2026-06-10 13:33:45 -07:00
Lisandro Pin	6b4d20a6f3	`volume.scrub` and `ec.scrub` shell commands: make the display of scrub details optional. (#9911 ) On volumes failing scrubs, the detail output can get very verbose, which makes reading results difficult. Most users won't care about this information to begin with - just whether or not volumes pass scrub tests. This MR gates the display of scrub result details behind a `--details` flag.	2026-06-10 13:29:07 -07:00
Chris Lu	caadd6ca79	ci(s3tables): stop Lakekeeper flaking on Docker Hub pull timeouts (#9920 ) * ci(s3tables): drop docker pre-pull from Lakekeeper job The lakekeeper repro is pure Go against the local weed binary; the job kept failing on Docker Hub timeouts pulling python:3 and localstack images the test never runs. Also drop the stale python-in-docker comments left from the old harness. * ci(s3tables): serve python:3 from GHA cache in the STS job Retried pulls still die when both mirror.gcr.io and registry-1.docker.io are unreachable from the runner. Cache the saved image tarball under a weekly key: an exact hit skips the registry entirely, a miss pulls fresh and refreshes the cache, and a stale tarball from a previous week is the fallback when Docker Hub is down. * ci(spark): pre-pull the spark tag the test actually runs The workflow warmed apache/spark:3.5.8 with retries while the testcontainers setup runs apache/spark:3.5.1, so the real image was pulled at test time with no retry at all.	2026-06-10 13:26:30 -07:00
Chris Lu	594fc667d5	Cut per-subscriber replay decode and widen replay concurrency (#9917 ) * Filter metadata events before unmarshaling them per subscriber Every subscriber unmarshaled every log entry into a full event just to run the path filter, and entries carry complete chunk lists, so a fleet of path-filtered subscribers spends almost all replay CPU materializing events it then discards. A shallow wire scan now extracts just the directory, entry names and rename destination into a skeleton event, feeds the same matcher, and skips the decode for entries the subscriber cannot match. Any scan surprise (malformed bytes, merged duplicate message fields) falls back to the full decode, and the unsynced-events heartbeat keeps firing for skipped entries. * Raise the legacy replay cap The cap was sized when every replay pinned a private chunk reader per source filer. Replays now share decoded chunks, so sixteen needlessly serializes subscriber catch-up; the expensive part stays bounded by the cache's load gate. * Weight concurrent log-chunk loads by size The flat eight-load gate let eight tiny chunks through as reluctantly as eight full ones. Charge each load's chunk size against a 128MB in-flight budget instead: small chunks decode wide open while full-size ones still serialize enough to cap the transient peak. Oversized weights clamp to the budget so they can always acquire. * Propagate heartbeat send failures and reset the skip counter A failed heartbeat send means the stream is gone, so end the replay instead of scanning on. A delivered event also resets the skip counter, keeping the heartbeat cadence relative to the last thing the client actually received. * Share the unsynced-events counter across the prefilter and delivery Two independent counters could starve the heartbeat: alternating drops reset each side before either reached its threshold. One shared counter increments on every dropped entry, prefiltered or not, and only an actual delivery resets it, restoring the original cadence exactly. * Tighten comments * Benchmark the subscription match paths For a thousand-chunk event that the subscriber filters out, the shallow scan matches in 10us and 9 allocations against 175us and 4031 allocations for the full decode.	2026-06-10 13:08:34 -07:00
Chris Lu	e56a1c4c05	admin: pre-gzip embedded static assets, add cache headers (#9918 ) The admin UI served embedded static files uncompressed and without cache headers: embed.FS has zero mod times, so no Last-Modified, no ETag, no 304s -- every page load re-downloaded ~700KB of css/js in full, which gets painful over slow or tunneled links. Gzip the static tree at generation time (go generate ./weed/admin) and embed only the compressed mirror, shrinking the binary ~1.5MB. The handler hands the pre-compressed bytes to gzip-capable clients, decompresses for the rest, and sets Cache-Control, per-variant content-hash ETags and Vary so repeat loads revalidate with a 304. bootstrap.min.css goes 232KB -> 30KB on the wire. A drift test keeps static_gz/ in sync with static/.	2026-06-10 12:54:36 -07:00
Chris Lu	c2271d59bb	log_buffer: stop dumping the whole log entry on callback errors (#9919 ) The eachLogDataFn error path printed the full LogEntry proto. For an entry carrying a large chunk manifest that is hundreds of KB of escaped bytes in a single log line, burying the actual error -- often just a subscriber disconnect -- at the very end. Log the key, timestamp, offset and data size instead.	2026-06-10 12:47:35 -07:00
Chris Lu	2ac5aa72c7	add elastic8 filer store for Elasticsearch 8 (#9916 ) * elastic: fix listing against a missing or empty directory index The refresh 404 leaked into the named return, so the first listing of a directory whose index does not exist yet returned an error instead of an empty result. Sorting also fails on an index with no documents ("No mapping found for [_id] in order to sort on"); unmapped_type keeps the resumed-listing path working there. * add elastic8 filer store for Elasticsearch 8 Elasticsearch 8 disables _id fielddata by default, so the elastic7 store's directory listings fail with "Fielddata access on the _id field is disallowed". elastic8 uses the same client and configuration options, but also indexes the document id as an Id field and sorts listings on Id.keyword.	2026-06-10 12:10:49 -07:00
7y-9	689b5b61bf	fix(s3api): reject empty v4 signed header names (#9910 ) Problem: Signature V4 SignedHeaders parsing accepted empty header name segments such as host; or ;host. Malformed Authorization headers could continue into signature verification instead of failing during header parsing. Root cause: parseSignedHeader only checked that the SignedHeaders value was non-empty, then split it on semicolons without validating each element. Fix: reject empty or whitespace-only signed header elements with ErrMissingFields before returning the parsed header list. Reproduction: go test ./weed/s3api -run TestParseSignedHeaderRejectsEmptyHeaderNames -count=1 failed before the fix because SignedHeaders=host; returned ErrNone. Validation: gofmt -w weed/s3api/auth_signature_v4.go weed/s3api/auth_signature_v4_test.go; git diff --check; go test ./weed/s3api -run TestParseSignedHeaderRejectsEmptyHeaderNames -count=1; go test ./weed/s3api -count=1 Co-authored-by: Codex <noreply@openai.com>	2026-06-10 11:00:35 -07:00
Chris Lu	7bf2dfc9ab	Bound the metadata-log flush queue (#9907 ) * Bound the metadata-log flush queue A stalled flush, e.g. slow volume servers under a reconnect storm, let up to 256 queued 8MB buffer copies pin two gigabytes per log buffer while producers kept filling the queue. Cap the queue at 16 so a sustained stall backpressures writers instead of growing the heap. The flush goroutine never feeds back into the buffer (system-log paths skip event notification), so blocked producers cannot deadlock the consumer. * Don't drop a force-flushed buffer on a full queue ForceFlush enqueued with a two-second timeout, but by then the live buffer was already sealed and reset, so a timed-out send silently lost the copy. Block until the flush is queued; the wait for completion stays bounded since the data is durable once the flush loop drains it. * Never close the flush channel ShutdownLogBuffer closed flushChan while producers could still be blocked sending into it, which panics. Terminate loopFlush with a nil sentinel instead, so the channel is never closed, and give every producer-side send a shutdown escape so none parks forever once the flush loop exits. Everything queued before the sentinel still drains, preserving IsAllFlushed semantics. * Copy the shutdown flush under the buffer lock Every other copyToFlush call site holds the lock; the shutdown path read the live buffer unlocked while producers could still be appending.	2026-06-10 10:57:30 -07:00
Chris Lu	bf76040046	Share metadata-log replays per chunk instead of per file (#9906 ) * Share metadata-log replays per chunk instead of per file Log file chunks are immutable: each metadata-log flush uploads one whole buffer of complete records as a new chunk, and appends only add chunks. So cache decoded entries per chunk, with no age gate and no fingerprint revalidation. The per-file cache excluded files younger than two flush intervals, which is exactly the hot tail that every tailing or reconnecting subscriber replays — each through a private chunk reader holding an 8MB buffer and decoding the whole file from byte zero. A chunk's flush time also upper-bounds every record timestamp inside it, so a tail replay now skips cold chunks without reading them at all. If a chunk does not decode standalone (records spanning chunk boundaries, or a corrupt size prefix), fall back to streaming the whole file as one byte stream, resuming after the last yielded entry. * Evict idle metadata-log cache entries The replay cache only evicted on insert, so once filled it held its full budget forever. Stamp entries on use and sweep the LRU tail every minute, dropping anything untouched for five minutes; the cache now holds memory only while subscribers actually replay. * Reject implausible records when decoding log chunks proto.Unmarshal is permissive: empty payloads and unknown-field garbage parse without error, so a chunk starting mid-record could decode by coincidence and get cached instead of falling back to the byte stream. Enforce what the writer guarantees - records are never empty and carry strictly increasing positive timestamps within one flushed buffer. * Gate the singleflight test on an open flight The sleep alone only probabilistically created concurrent misses; a started channel now proves the loader holds the flight before callers are released.	2026-06-10 10:57:11 -07:00
Lisandro Pin	5150c86934	Make shell command `ec.scrub` return shard details upon scrub failures in `LOCAL` mode. (#9913 ) This is useful information to deal with issues requiring EC shard rebuilding, such as https://github.com/seaweedfs/seaweedfs/issues/9872.	2026-06-10 10:55:16 -07:00
7y-9	7c0a9acb30	fix(s3api): normalize checksum trailer header names (#9905 ) Problem: SigV4 chunked upload checksum trailer parsing rejected mixed-case checksum header names even though HTTP header field names are case-insensitive. Root cause: extractChecksumAlgorithm compared the x-amz-trailer value and trailer header key against exact lowercase strings. Fix: Trim and lowercase checksum trailer header names before matching supported checksum algorithms. Reproduction: go test ./weed/s3api -run TestExtractChecksumAlgorithmIsCaseInsensitive -count=1 with X-Amz-Checksum-Crc32; before the fix it returned unsupported checksum algorithm. Validation: gofmt -w weed/s3api/chunked_reader_v4.go weed/s3api/chunked_reader_v4_test.go; git diff --check; go test ./weed/s3api -run TestExtractChecksumAlgorithmIsCaseInsensitive -count=1; go test ./weed/s3api -count=1 Co-authored-by: Codex <noreply@openai.com>	2026-06-10 00:30:43 -07:00
Chris Lu	0c2576c3d0	ci: route Docker Hub pulls through a mirror to cut registry timeouts (#9904 ) * ci(s3tables): route Docker Hub pulls through mirror, drop unused buildx The integration jobs set up docker/setup-buildx-action only to docker pull/run images; the buildx bootstrap pulls moby/buildkit from registry-1.docker.io, which times out and fails the whole job before any test runs. These jobs never docker build with buildx, so the setup is pure overhead and an extra registry dependency. Replace it with a daemon registry-mirror pointing at mirror.gcr.io (a pull-through cache for Docker Hub) and retry the pre-pulls a few times. That removes the buildkit pull entirely and routes the rest through the cache, with graceful fallback to Docker Hub on a miss. * ci: route Docker Hub through mirror in remaining docker test workflows Same registry-1.docker.io timeout fix for the other integration jobs. s3-spark only docker pulls/runs an image, so drop the vestigial buildx setup and pull through the mirror with retries, matching s3-tables. kafka-quicktest, s3-proxy-signature, e2e and postgres build/compose and genuinely need buildx (e2e/postgres export a local layer cache, which the default driver can't), so keep it and just configure the mirror first — that way even the moby/buildkit bootstrap pull is served from the cache. Left samba/pjdfstest alone: they build-push to a local registry and pull from localhost, so buildx is required and there's no Docker Hub runtime pull to mirror.	2026-06-09 17:12:42 -07:00

1 2 3 4 5 ...

14186 Commits