seaweedfs

mirror of https://github.com/seaweedfs/seaweedfs.git synced 2026-06-09 18:32:43 +00:00

Author	SHA1	Message	Date
Chris Lu	b13463880c	s3tables: scope management authorization to the caller's identity (#9961 ) * s3tables: resolve account-less identities to a distinct principal Static identities with no account block default to the shared admin account, so getAccountID returned "admin" for every such user and the permission checks treated them all as the admin principal. Only keep the admin account when the identity actually carries an admin action; otherwise fall back to the unique identity name. * s3tables: limit the open-by-default fallback to anonymous access The legacy permission path allowed any request that no policy explicitly denied whenever default-allow was on, which is the zero-config default. That let an authenticated identity without table permissions reach table resources owned by others. Restrict the fallback to requests with no identity or the anonymous identity; authenticated callers must pass an explicit action or policy check. Zero-config and anonymous access are unchanged. * s3tables: drop the no-op ListTableBuckets account gate The top-level check passed the principal as its own owner, so it always allowed. Per-bucket filtering in the loop is the real authority; remove the dead gate and the now-unused locals. * s3tables: derive the Iceberg catalog's default-allow from auth state The Iceberg catalog reuses the S3 Tables Manager, which hardcoded default-allow on. Authenticated callers were enforced only because the identity struct happens to propagate into the handler; if it were ever dropped, a secured catalog would fall open. Mirror the S3 port and set the Manager's default-allow from the authenticator, so an authenticated caller is enforced regardless. Shell and admin keep their own trusted Manager. Regression test covers the struct, name-only, and admin paths. * s3tables: drop redundant ACTION_ADMIN string conversion ACTION_ADMIN is an untyped string constant, so the conversion is a no-op. * s3tables: enforce name-only authenticated callers, add trusted bypass defaultAllowFor treated a request with no identity object as anonymous, but the Manager path forwards only the identity name (not the struct). A name-only authenticated caller could therefore be misclassified as anonymous and allowed under the open default. Treat a server-set identity name as authenticated too, and add an explicit trusted flag for the local shell/admin tooling that legitimately bypasses authorization. * s3tables: trim verbose comments	2026-06-14 13:55:36 -07:00
Chris Lu	7e608c877a	refactor(ec_balance): make the balance planner per-volume ratio-capable (#9960 ) * refactor(ec_balance): make the balance planner per-volume ratio-capable Thread a per-volume EC ratio through the balance planner: Plan resolves each volume's data/parity from a new Options.VolumeRatio (falling back to the collection Ratio, then the build default, when it reports 0), and keys the global phase's ratio maps by volume instead of collection. The shell and worker balance paths build the per-volume lookup from each shard's heartbeat via the new ecbalancer.VolumeShardRatio. In OSS this is behavior-preserving: VolumeShardRatio returns 0 because the per-volume data_shards/parity_shards heartbeat fields are an enterprise feature, so every volume falls back to the collection ratio -- the existing standard-scheme behavior. The refactor keeps the shared planner in sync with the enterprise fork, which overrides VolumeShardRatio to classify and spread a mixed-ratio collection by each volume's own data/parity split. * perf(ec_balance): hoist the collection ratio out of the per-volume loop The collection ratio is constant for every volume in a collection, so resolve it once per collection instead of per volume; a custom Ratio func may do map lookups or locking. Addresses a review comment.	2026-06-14 11:33:31 -07:00
Chris Lu	4fb3e22a01	fix(tiering): never delete a shared remote object while replicas still reference it (#9942 ) * tiering: stop a shared remote object being deleted while replicas still point at it A remote-tiered volume's .dat content lives only in one cloud object that all N replica .vif files point at. Deleting that object while destroying any one replica, or before a downloaded replica is durable, bricks the survivors. - volume.tier.move cleanup now deletes old replicas with keepRemoteData=true so surviving replicas keep the shared object. Document why the alreadyPlaced anchor needs no replica sync (same-object replicas are byte-identical). - VolumeTierMoveDatFromRemote now fsyncs the downloaded .dat, fsyncs the containing directory, trims the .vif (fsynced) and swaps to the local DiskFile BEFORE deleting the remote object, on both the keep-remote and delete paths. Only the final DeleteFile is gated by keep_remote_dat_file, so a keep-remote download leaves the replica served from local disk rather than the shared object, and a crash before delete merely leaks the object. - volume.tier.download keeps the shared object for every replica except the last, which deletes it. - s3 and rclone download paths fsync the .dat before close. * storage: swap the volume data backend under the data lock The tier-download swap closed v.DataBackend and assigned the new local DiskFile without holding dataFileAccessLock, racing concurrent reads/writes (use of a closed file / nil deref). Add an exported Volume.SwapDataBackend that performs the close-and-replace under the lock, and call it from the tier download. * server: skip directory fsync on Windows in the tier download path os.Open(dir).Sync() is unsupported on Windows and returns an error, which would fail VolumeTierMoveDatFromRemote entirely there. Skip the directory fsync on Windows, matching how the storage-side helper tolerates the unsupported case. * shell: make multi-replica tier.download resilient to already-local replicas If a multi-replica download is interrupted and retried, a replica made local in the prior attempt returns "already on local disk", which aborted the whole command and left the remaining remote replicas dangling. Treat that case as a skip-and-continue so a retry completes the rest. * server: assert downloaded .dat content, not just length, in the tier test A length-only check passes even if the bytes are corrupted; compare the full content of the local .dat against the original.	2026-06-13 20:09:00 -07:00
Chris Lu	c2591b4395	fix(replication): verify-before-destroy in VolumeCopy, check.disk, and over-replication trim (#9943 ) * volume: verify before destroy in VolumeCopy and replication repair Four data-safety fixes around copy/repair paths that could destroy or resurrect data before verifying the source or survivors. (a) VolumeCopy no longer deletes a pre-existing local replica up front. The delete is deferred until ReadVolumeFileStatus on the source succeeds, so a transient source outage (or a retry after one) can no longer wipe a healthy destination replica. Gated on source readability only; size/count comparisons are intentionally not used because they invert legitimately after divergent vacuum/compaction. Mirrored in the Rust volume server. (b) volume.check.disk no longer resurrects vacuumed-deleted needles. A key present-and-live on the source but entirely absent on the target is ambiguous: it may be a genuine missing write, or a needle deleted on the target and then vacuumed (its index entry and any tombstone are gone). An individual needle AppendAtNs has no monotonic relation to a vacuum watermark, so the old cutoff heuristic could not tell them apart. Without positive proof the absence is a missing write, the safe default is to NOT push it back. Tradeoff: a real missing write may go unrepaired until a tombstone-aware path exists, but we never raise back deleted data. (c) Over-replication trim no longer resurrects needles or removes the wrong replica. The pre-delete sync now runs read-only (divergence check only) instead of writing the doomed replica's needles into the survivor. pickOneReplicaToDelete only ever removes the smallest of multiple healthy writable replicas; it refuses the trim when doing so would leave only read-only/integrity-flagged survivors, since file_count>0 alone cannot prove the survivor's .dat is readable. (d) Incomplete-volume (.note) cleanup keeps the shared .vif when an .ecx for the same vid coexists on the disk, so removing an interrupted regular copy cannot strip a coexisting EC volume's info file. VolumeCopy now surfaces .note write/remove errors instead of ignoring them. In the Rust volume server (where a persisting note is actually reachable) the .note check moves below the empty-stub sweep and EC validation, keeps the .vif on EC coexistence, and the mount path fails when a .note still persists. * shell: scope the over-replication writable-survivor guard to the trim path only The writable-survivor guard (never trim down to a read-only survivor) lived inside the shared pickOneReplicaToDelete, so it also gated the misplaced-volume relocation via pickOneMisplacedVolume -- a misplaced read-only volume (e.g. a full one) would silently stop being rebalanced. Extract pickSmallestReplica for the relocation path (which deletes-and-recreates and must act on read-only replicas), and keep the writable-survivor guard only in pickOneReplicaToDelete used by the over-replication trim. * seaweed-volume: recompute keep_vif after invalid-EC cleanup in the .note path keep_vif used the pre-validation ecx_exists snapshot, so when the EC-validation step above removed the invalid .ecx/shards, the .note cleanup still preserved a now-orphaned .vif. Re-check .ecx existence at cleanup time, matching the Go hasEcxFile re-check. * shell: keep placement when picking an over-replication victim to delete The trim picked the smallest writable replica without regard to placement, so it could delete the only replica in a required failure domain (e.g. with "100" and replicas dc1 + two in dc2, deleting dc1 leaves both survivors in dc2). Prefer a writable replica whose removal still satisfies placement, falling back to the smallest writable only when none does.	2026-06-13 20:05:33 -07:00
Chris Lu	3718301599	shell: stop ec.encode/ec.rebuild from destroying live EC shards (no crash needed) (#9939 ) * shell: stop ec.encode/ec.rebuild from destroying live EC shards Three operator-triggered shell paths could destroy data with no crash: ec.encode -volumeId on an already-EC volume tore down its shards before failing. The volume-id path never checked the id was a regular volume: the collection lookup scans only VolumeInfos (so an EC-only id maps to ""), and volumeLocations succeeds via the EC-location fallback, so clearPreexistingEcShards full-teardown-deleted every shard cluster-wide before doEcEncode failed. An EC volume has no .dat, so this is its only copy. Add assertEncodableRegularVolumes: each requested id must be a regular volume in the topology snapshot; an EC-only or unknown id is refused before any teardown. A volume present as both a regular .dat and stale orphan shards (a failed-encode retry) still passes. This closes the operator-rerun/script-retry path; a worker racing the snapshot is a fencing problem handled separately. ec.rebuild dry-run (the default, without -apply) still issued real VolumeEcShardsDelete RPCs: prepareDataToRecover appended every would-copy shard to copiedShardIds even though the copy was skipped, and the cleanup defer deleted that set unconditionally. Now a dry-run copies nothing and records nothing to delete (a separate would-copy counter drives the recoverability check so the dry-run still reports its plan), and the cleanup runs only under -apply. ec.rebuild could also self-destruct a live shard: localShardsInfo was overwritten per disk instead of unioned, so a shard the rebuilder holds on a non-last disk looked remote, got copied onto itself (in-place O_TRUNC) and then node-wide deleted. Union local shards across all disks, and never copy/delete a shard whose only listed holder is the rebuilder itself. * shell: address ec destructive-guards review comments - countLocalShards: union shards across all of the rebuilder's disks so slot accounting matches what prepareDataToRecover treats as local; first-match counting overstated slotsNeeded on multi-disk rebuilders - VolumeEcShardsCopy: resolve SourceDataNode via pb.NewServerAddressFromDataNode instead of the raw node id, which may not be a dialable host:port - assertEncodableRegularVolumes: skip nil DiskInfo map entries, matching the other topology walks in this file; rename ecOnly to hasEcShards since the map marks any volume with shards, not only shard-only ones	2026-06-12 22:30:17 -07:00
Chris Lu	34f9b91d69	fix(storage): never let an empty .dat delete healthy distributed EC shards (#9930 ) * fix(storage): never let an empty .dat delete healthy distributed EC shards A leftover empty .dat stub (a phantom from the pre-fix loader; zero needles) next to a distributed EC volume's local shards made startup classify the volume as an interrupted local encode: validateEcVolume requires >= dataShards local shards when a .dat is present, fails with the 1-2 shards a distributed volume keeps per disk, and the cleanup deletes those shards -- the only copies of that part of the volume. Repeated across restart waves this destroys enough shards cluster-wide to make the volume unrecoverable. Go: - loadExistingVolume: hoist the empty-stub sweep above the EC presence checks. Previously the .vif-next-to-.ecx guard returned before the sweep ever ran, so exactly the dangerous layout (stub + .ecx + local shards) kept its stub and then lost its shards in loadAllEcShards. - validateEcVolume / checkDatFileExists: treat a .dat <= a superblock (zero needles) as absent. An empty .dat cannot be the encode source, so it must never gate shard deletion; this also covers stubs without a .vif, which the sweep cannot prove are EC leftovers. Rust mirror (seaweed-volume): the same gate in validate_ec_volume and check_dat_file_exists (the Rust sweep already ran before validation); the volume-load skip keeps a plain existence check so fresh, needle-less volumes still load. Regression tests in Go and Rust reproduce the production layout (a zero-byte .dat beside .ecx/.ecj and two shards of a 10+4 volume, with and without a .vif) and fail without the fix with the shards deleted. * fix(ec): gate source volume deletion on a recoverable shard set After EC encode, the shell command and the (plugin) worker task refused to delete the source volume unless every shard was present, and aborted otherwise -- leaving the source .dat next to live shards, exactly the mixed state the startup cleanup mishandles. Replace the full-set requirement with a recoverability gate shared by both callers (RequireRecoverableShardSet): deleting a non-empty source .dat requires at least dataShards distinct shards cluster-wide. Below that the source is kept and the encode fails as before. A degraded but recoverable set (>= dataShards, < total) now proceeds with a warning instead of aborting: the missing shards can be rebuilt from the survivors, while keeping the source would preserve the dangerous mixed state. Empty stub replicas are still swept unguarded (OnlyEmpty) -- an empty .dat has nothing to lose. dataShards/totalShards stay parameters so enterprise custom EC ratios share the helper verbatim. * test(ec): use recoverable shard verification gate	2026-06-11 20:26:20 -07:00
Chris Lu	42030381ae	shell: volume.tier.move can move volumes between data centers (#9925 ) * shell: volume.tier.move can move volumes between data centers -fromDataCenter scopes volume selection to volumes with a replica in that data center. -toDataCenter constrains move destinations and replication fulfillment. With identical disk types both flags are required, moving full volumes between data centers on the same tier. * shell: assert node identity in data center filter test * shell: tier move resumes when the volume is already on the target A replica already on the target tier and data center, typically left by an interrupted earlier run, anchors the move: skip the copy and only complete replication fulfillment and old replica cleanup. Previously such volumes hit the no-destination path and the stale source replicas were never removed.	2026-06-11 10:46:34 -07:00
Chris Lu	79ac279fe1	fix(ec): don't mix EC shards from different encode runs (#9880 ) * feat(ec): add encode_ts_ns to EC shard metadata and the shard read RPC EcShardConfig and VolumeEcShardReadRequest gain an int64 encode_ts_ns (encode time in unix nanos). It rides in .vif and the read request so a read can be scoped to the encode run that produced the index. * fix(ec): stamp each encode and reject cross-run shard reads Generate stamps EncodeTsNs into the volume's .vif. Reads carry it to the shard's owning volume (resolved together via FindEcVolumeWithShard, so a multi-disk server validates the disk that actually serves the bytes) and reject a shard from a different encode run, recovering from parity. A zero on either side (pre-upgrade volume) skips the guard. * fix(ec): stamp the encode identity on the worker-generated .vif The worker-local encode path now writes EncodeTsNs (and the resolved EC ratio) into the .vif, so the read guard is not silently off for volumes encoded by the maintenance worker. * fix(ec): wipe stale EC artifacts before re-encoding VolumeEcShardsGenerate evicts any in-memory EcVolume for the volume and removes its on-disk shard/index/sidecar files before writing fresh ones, so a retried encode never builds on a partial prior run and the unlink frees the inodes instead of leaving open fds serving old bytes. * fix(ec): unmount EC shards across all disks UnmountEcShards walked only the first disk holding the shard, leaving a duplicate copy mounted on a sibling disk (split-disk reconciled volumes) still serving and heartbeating. Traverse every disk and emit one deletion delta per disk. * fix(ec): delete orphan shards without a local .ecx deleteEcShardIdsForEachLocation gated shard-file removal on a local .ecx, so it could not clean an orphan .ecNN left by a failed copy on a disk with no index. Delete the requested shard files unconditionally; the index-file (.ecx/.ecj/.vif) routing stays gated as before. * fix(ec): clear stale EC shards cluster-wide before re-encoding ec.encode unmounts and deletes EC shards for the target volumes on every node before regenerating: fatal for the shards the topology reports (mounted leftovers), best-effort for the rest (a sweep that catches unmounted failed-copy orphans). A down node is a no-op. * fix(ec): don't nil EC fds on close so reads can't race eviction A reader resolves an EcVolume/shard under the lock then reads after it is released, so an eviction that nils ecxFile/ecdFile would race that read and panic. Close the fds without nilling the fields: the field is now write-once (no data race) and a concurrent read hits a closed fd, getting a clean error that the caller recovers from parity. * fix(ec): wipe stale EC artifacts on every disk and surface failures The pre-encode wipe only deleted beside the source volume, so a stale shard on a sibling disk survived and could be mounted against the new index at reconcile. Sweep every disk. Removal also ignored os.Remove errors, reporting a failed cleanup as success and letting a stale shard join the next generation; surface the first real failure (treating already-gone as success) from removeStaleEcArtifacts and the shard delete. * fix(ec): log when a local shard is skipped for a different encode run The cross-run guard returned errShardNotLocal, indistinguishable in logs from a genuinely-absent shard. Add a V(1) line naming both EncodeTsNs so operators can tell "wrong encode generation" from "shard not here". * fix(ec): surface metadata removal failures in the shard delete path deleteEcShardIdsForEachLocation still dropped os.Remove errors on the .ecx/.ecj/.vif/sidecar cleanup. A surviving stale .ecx is the orphan-index condition this path prevents, so route those through removeFileIfExists and return the first real failure instead of reporting cleanup as success. * fix(ec): fail orphan cleanup when a reachable node's delete fails The pre-encode orphan sweep swallowed every error for unreported (node, volume) pairs. That is only safe for an unreachable node, which cannot receive this encode's new generation. A reachable node whose delete genuinely failed (permission/IO) keeps an orphan shard that a later copy re-stamps with the new run's volume-level .vif identity, so the read guard would accept stale data. Surface those; stay best-effort only for unreachable nodes (gRPC Unavailable / no status). * fix(ec): guard ecjFile under its lock in the EC delete path EcVolume.Close nils ecjFile under ecjFileAccessLock; a delete that resolved its .ecx lookup before a concurrent eviction (the generate-time UnloadEcVolume) could then reach the journal append with a nil fd. Bail with a clear "volume closed" error under the lock instead. * fix(ec): reject an unstamped shard when the caller has an encode identity The read guard required both identities nonzero, so a current (stamped) caller accepted a holder with identity 0 and could be served a stale pre-upgrade shard. Reject when the caller is stamped and the holder differs (including unstamped); stay lenient only when the caller itself has no identity (pre-upgrade reader). A skipped shard recovers from parity. * fix(ec): full-teardown delete so cluster cleanup wipes a whole generation The pre-encode cluster sweep deleted only the listed canonical shards on remote nodes, leaving index/sidecar (and, on builds with versioned generations, those too) behind. Add a full_teardown flag to VolumeEcShardsDelete that evicts the volume and wipes every EC artifact for it on every disk via removeStaleEcArtifacts; the shell and worker pre-encode cleanup paths set it. Other delete callers (balance/decode/repair) are unchanged. * fix(ec): take ecjFileAccessLock before the nil-check in Sync and Close Sync and Close read ev.ecjFile before acquiring ecjFileAccessLock while Close nils it under the lock, a data race on the field. Take the lock first, then nil-check inside, in both. * fix(ec): acknowledge full_teardown so a pre-upgrade server can't fake success An old volume server silently ignores full_teardown and returns success for an ordinary delete, so the caller wrongly believes the generation was wiped and copies a fresh gen-0 onto an unwiped node. Echo full_teardown_done in the response; the worker destination cleanup fails when it is absent, and the shell cluster sweep fails for a reported (mounted) leftover while staying best-effort for an unreported node. encode_ts_ns stays an accepted transient (an old server just skips the new read guard, no regression). * fix(ec): fail the pre-encode sweep for any reachable node that can't ack teardown A reachable pre-upgrade server ignores full_teardown and returns success without wiping an orphan, which a later copy then folds into the new generation. Treat a missing full_teardown_done ack as fatal for every reachable node (best-effort only for a gRPC-unreachable one), not just for topology-reported pairs. * fix(ec): return the served shard identity and validate it client-side The encode identity was only enforced server-side, so a pre-upgrade server ignored the request field and served bytes unchecked. Echo the served shard's EncodeTsNs on every read response chunk and have the client reject a mismatch (including 0 from an old server), so the guard holds regardless of server version; a rejected read recovers from parity. * fix(ec): reject a short/empty remote shard read instead of serving zeros doReadRemoteEcShardInterval accepted an immediate EOF or a short stream and returned success with a partly zero-filled, unvalidated buffer (the server stamps the identity only on chunks that carry bytes). A non-deleted interval must arrive whole: require n == len(buf), exempting the is_deleted short-circuit (n=0), matching readLocalEcShardInterval's local check. A short read now fails so the caller recovers from parity. * test(ec): fake volume server echoes the full_teardown acknowledgement The worker now fails a teardown delete that isn't acknowledged (so a pre-upgrade server can't silently skip the wipe). The fake server's no-op VolumeEcShardsDelete returned an empty response, which the worker read as a skipped teardown and aborted the encode. Echo full_teardown_done. * feat(ec): mirror the encode-run identity guard + full_teardown into the Rust volume server The Go volume server stamps an encode-run identity (encode_ts_ns) into the .vif and rejects a read served from a shard of a different run; full_teardown wipes a whole generation and acknowledges it. The Rust volume server had none of it. Mirror the shared logic: load encode_ts_ns from the .vif onto the EcVolume, stamp it on every read response, and reject a request/response mismatch on both the server and the distributed-read client (recovering from parity); handle full_teardown by evicting the volume and wiping every EC artifact on each disk, echoing full_teardown_done so the caller can detect a server that ignored it. * fix(ec): remove a stale .vif on full teardown of a shard-only node A shard copy installs shards + .ecx before .vif, so an interrupted copy after a teardown could mount the new files under the previous run's identity / version / shard ratio / dat_file_size carried by the surviving .vif. Remove .vif during full teardown, gated on .idx absence so a source-volume holder keeps its live .vif. In Rust this lives in a teardown-only helper so the reconcile / load- fallback paths (which share the base removal) still preserve .vif. * fix(ec): treat a missing teardown ack as fatal, not as an unreachable node isNodeUnreachable returned true for any non-gRPC-status error, so a reachable pre-upgrade server's missing full_teardown_done ack (a plain error) was classified unreachable and the unreported pair was silently skipped. Classify only a real codes.Unavailable as unreachable, and wrap the missing ack in a sentinel the sweep treats as fatal regardless. A genuinely down node still surfaces as Unavailable from the RPC and stays best-effort. * fix(ec): reject a short shard read in the local EC needle reader read_ec_shard_needle ignored the byte count from shard.read_at and appended the whole pre-sized buffer, so a truncated shard's zero-filled tail passed the later length check and parsed as garbage. Require n == buf.len() per interval, erroring on a short read like the local interval reader already does. * fix(ec): probe reachability before skipping a node that returns Unavailable The pre-encode sweep skipped any node whose teardown delete returned codes.Unavailable, but a reachable volume server in maintenance mode also returns that code for the maintenance-gated delete, so its stale EC files were left behind on a node that can still receive the new generation. Confirm with a non-maintenance-gated empty-target Ping: skip only when the node fails the probe too (genuinely unreachable). * fix(ec): use try_exists for the teardown .vif .idx guard The teardown-only .vif removal gated on Path::exists(), which returns false on a permission/IO stat error, so a stat failure on a present .idx would read as a shard-only node and delete the live source volume's .vif. Gate on try_exists() == Ok(false) instead, preserving the sidecar on any stat error. * fix(ec): only skip a sweep node when a Ping confirms it is transport-down The pre-encode sweep skipped a node whenever its teardown delete and a liveness Ping both failed, but it treated ANY Ping error as down — an application-level Internal/ResourceExhausted, or Unimplemented from a pre-Ping server, left a reachable node's stale generation in place. Classify the Ping tri-state and skip only when it transport-fails with codes.Unavailable; a reachable or inconclusive node stays fatal. * fix(ec): exclude sweep-skipped nodes from the encode's rebalance The pre-encode sweep skips a genuinely-down node best-effort, but the rebalance then recollected the current topology — a node that recovered between the two could become a copy target and receive the new generation while still holding its stale, never-cleared shards. Have the sweep return the skipped set and exclude those nodes from the rebalance for this encode, so a node we could not clean cannot receive the new generation. Standalone ec.balance is unaffected. * fix(ec): re-sweep recovered nodes before generation so they aren't stranded A node skipped as down by the pre-encode sweep is excluded from the rebalance, but it can recover and become the generation host — mounting all shards locally, then being excluded from distribution. Union-only verification accepts all shards on one node and deletes the originals: a single point of failure. Re-sweep the skipped nodes just before generation; one whose teardown now succeeds leaves the skipped set and rebalances normally, while a node still down stays skipped. * fix(ec): abort the encode if a selected source is still skipped after re-sweep The re-sweep un-skips a recovered node, but the source was selected before it and a node can stay down through the re-sweep then recover just in time to be the generation host — mounting all shards locally while still excluded from the rebalance, which union-only verification accepts before deleting the originals. Abort the encode when a selected source remains skipped after the re-sweep. * fix(ec): batch delete returns retriable 503 when a volume became EC mid-batch If a volume is not EC at the batch-delete classification but is encoded to EC and its .dat deleted before the regular-volume mutation, the mutation returns an exact "not found" that the filer chunk-GC treats as completed, dropping the delete. Recheck EC presence under the mutation lock and return a retriable 503 with the "try again" token so the filer requeues it onto the EC path. * fix(ec): recheck EC state before the regular batch-delete mutation ec.encode mounts EC shards (copied from the .dat) before deleting the originals, so a volume can be EC while its .dat still exists. The batch delete only rechecked EC after a NotFound, so a successful regular-volume delete in that window wrote a tombstone to the soon-removed .dat — the delete was lost and the needle resurrected from the pre-tombstone shards. Recheck has_ec_volume under the write lock before delete_volume_needle and return a retriable 503 so the filer requeues onto the EC path. * fix(volume): make the metrics push test independent of test order test_push_metrics_once asserted the pushed body contains the request-counter family without ever touching the counter — a CounterVec with no children emits nothing, so the assertion only held when another test had already created a labelset in the shared registry. Create one in the test itself.	2026-06-10 22:31:18 -07:00
Lisandro Pin	6b4d20a6f3	`volume.scrub` and `ec.scrub` shell commands: make the display of scrub details optional. (#9911 ) On volumes failing scrubs, the detail output can get very verbose, which makes reading results difficult. Most users won't care about this information to begin with - just whether or not volumes pass scrub tests. This MR gates the display of scrub result details behind a `--details` flag.	2026-06-10 13:29:07 -07:00
7y-9	d569dd686f	fix(shell): move files into existing destination directories (#9887 ) * fix(shell): move files into existing destination directories Problem: fs.mv /src/file /dst/dir treats an existing destination directory as a destination file path, so it renames the source to /dst/dir instead of moving it into /dst/dir/file. Root cause: commandFsMv builds the destination LookupDirectoryEntryRequest with Directory and Name swapped, so the destination directory lookup misses. Fix: Populate LookupDirectoryEntryRequest with Directory=destinationDir and Name=destinationName before deciding whether the destination is a directory. Reproduction: env GOCACHE=/private/tmp/seaweedfs-go-cache go test ./weed/shell -run TestFsMvMovesIntoExistingDestinationDirectory -count=1 Validation: gofmt -w weed/shell/command_fs_mv.go weed/shell/command_fs_mv_test.go; git diff --check; git diff --cached --check; env GOCACHE=/private/tmp/seaweedfs-go-cache go test ./weed/shell -run TestFsMvMovesIntoExistingDestinationDirectory -count=1; env GOCACHE=/private/tmp/seaweedfs-go-cache go test ./weed/shell -count=1 * Update weed/shell/command_fs_mv_test.go Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2026-06-08 23:42:13 -07:00
Chris Lu	8cc10460b4	fix(remote): correct content and permissions when syncing/caching remote objects (#9879 ) * fix(remote): reject short reads when caching remote objects A short read from the remote (stale listing size, truncated or flaky response) was silently zero-padded: the S3 and Azure clients pre-size the buffer and discard the downloaded byte count, and the chunk is recorded with the requested size. The cached file then matched the expected size but its tail was NULL, and the entry was marked cached so it never re-fetched. Check the byte count against the requested size in both clients, and add a backend-agnostic guard in FetchAndWriteNeedle. The cache now fails loudly and the entry stays remote-only for a later retry. * fix(remote): match S3 default modes when syncing remote metadata Remote object listings carry no POSIX mode, so synced entries were created with a hardcoded 0644. Against a SeaweedFS remote, whose S3 layer writes objects as 0660 and auto-creates directories as 0771 (0660\|0111), the mounted copy ended up 0644/0755 and the permissions visibly diverged from the source. Default to the S3 modes instead (files 0660, directories 0771). The filer derives parent-dir modes from the child as fileMode\|0111, so fixing the file default also brings the directories into line. Directory mtimes still reflect sync time: S3 listings don't enumerate directories, so the remote's directory timestamps aren't available.	2026-06-08 13:55:53 -07:00
Chris Lu	f0d2a0d417	Treat co-located volume servers as one fault domain when balancing and allocating (#9854 ) * admin/topology: carry the volume server address on DiskInfo The planning DiskInfo exposed only the node id, which can be an opaque label rather than ip:port. Record the address too so callers can resolve the physical machine a disk sits on. * ec.balance: spread a volume's shards across machines, not just nodes Volume servers sharing a host are one fault domain, but the within-rack spread treated them as independent nodes, so one box could end up holding more shards of a volume than EC can afford to lose. Add a machine (host) tier between rack and node: the within-rack pass spreads each volume across machines, and the global load phase no longer re-concentrates a volume onto a machine it already sits on. Host defaults to the node id, so clusters with one server per host are unchanged. * ec placement: prefer machines holding fewer of a volume's shards EC allocation and repair picked the least-loaded node in a rack with no regard for which physical machine it sits on, so a volume's shards could pile onto several servers of one box. Rank candidate nodes by their machine's shard count first, then the node's own. The machine is derived from the volume server address carried on DiskInfo, falling back to the node id, matching how the balancer resolves it. * volume.balance: don't move a replica onto a machine already holding one isGoodMove only rejected a move onto the same data node, so two replicas could land on two volume servers of one box and a single machine failure would lose both. Reject a target whose host already holds another replica of the volume. Best-effort: balancing simply skips and tries the next target. * volume allocation: spread same-rack replicas across machines PickNodesByWeight filled the same-rack replica picks by weight alone, so replicas could co-locate on one box. Prefer candidates on not-yet-used hosts, falling back when too few distinct machines exist. Data-center and rack tiers have no host, so their ordering is unchanged. * ec.balance: harden machine spread against re-concentration and capped machines Two cases where the machine-aware spread could still leave a volume badly placed: - The global load phase could move a shard of a volume onto a machine that already held it, raising that machine's count and undoing the within-rack spread (a 4/4/3/3 layout could become 3/5/3/3, past parity for 10+4). Limit the load-only fallback to same-machine moves, which leave a machine's count unchanged; cross-machine concentration is no longer allowed for load alone. - The within-rack spread chose a destination machine by free slots alone, so if that machine's only nodes were already at the SameRackCount cap it skipped the move instead of trying another machine. Require a machine to have a node that can actually take the shard before selecting it. * reduce comments across the machine-affinity change Trim narration down to the non-obvious why; one terse line where a block was overkill. * ec.balance: gate machine spread on fault-tolerance feasibility Spreading a volume evenly across machines only helps when there are enough that each can stay within EC's parity tolerance (numMachines >= ceil(total/parity)). With fewer -- or wildly unequal -- machines it can't make a machine loss survivable anyway, and forcing it fights capacity: e.g. a cluster of 12 volume servers on one host and 2 on another would have half of every volume crammed onto the 2-server box. So spread across machines only when it's achievable; otherwise fall back to per-node spread and let capacity/global balancing decide. The global load phase applies the same test: it protects a volume's machine spread (no cross-machine move that raises a machine's count past the source's) only where that spread is achievable, so heterogeneous clusters still level by fullness. * ec.balance worker: group servers by host when planning The worker built its planner topology without recording each server's host, so automated ec.balance treated ports on one machine as independent nodes and could concentrate a volume's shards on one physical box. Set the host from the volume server address, matching the shell path. * volume.balance worker: don't move a replica onto a machine holding one The worker compared only node ids, and the replica map dropped the server address, so it could move replicas onto different ports of one machine. Carry the host on ReplicaLocation (from the server address) and reject a target whose host already holds another replica of the volume. Best-effort, matching the shell. * ec.balance: judge machine-spread feasibility by the rack's shards The within-rack and global feasibility checks compared the whole volume's shard count against a rack's machine count, so a rack holding only part of a volume after cross-rack spreading -- e.g. 7 of a 10+4 volume across 2 machines -- was wrongly judged infeasible and fell back to node spread, which could pile 6 shards onto one host, past parity. Gate on the rack's own shard count of the volume instead. * ec.balance: spread a volume's shards across machines by combined count EC recovers from any loss within parity regardless of shard type, so what bounds a machine's exposure is its total shards of the volume, not data and parity separately. Spreading the two independently let each type's remainder land on the same machine -- ceil(d/M)+ceil(p/M) can exceed ceil(total/M), e.g. a 5/3 split where 4/4 was achievable, past parity. Balance the combined count in one pass; disk-level data/parity anti-affinity stays in pickBestDiskOnNode. * ec.balance: don't let the imbalance threshold skip an over-parity machine The within-rack spread gated on relative skew ((max-min)/avg > threshold), so a worker threshold of 0.5 skipped an exactly-50%-skewed layout like 5/4/3 for a 10+4 volume, leaving 5 shards -- past parity -- on one machine. The even cap (ceil(shards/groups)) is the real bound and the move loop already sheds only what exceeds it, so drop the threshold gate from the within-rack phase (machine and node): a balanced rack stays a no-op while any over-cap machine is always fixed. * ec.balance: keep the imbalance threshold for the node fallback Dropping the threshold from the whole within-rack phase made the node fallback too eager: it runs only when machine fault tolerance is unachievable, so it is cosmetic load distribution that should defer to the global utilization phase. Without the gate it would, for a one-server-per-host 6/4 split at threshold 0.5, schedule a count move that worsens utilization balance. Restore the threshold there; machine spreading keeps bypassing it, since that bound is durability, not cosmetic skew.	2026-06-07 14:14:45 -07:00
Chris Lu	6e16994615	s3: make lifecycle TTL fast path per-bucket opt-in (#9825 ) Stamping an Expiration.Days rule as a volume TTL at write time bakes an irreversible TTL into the object: removing or lengthening the rule later can't un-expire it, unlike worker-driven expiration. The metadata-only delete it enables also skips per-chunk DeleteFile, so dead bytes linger in a not-yet-expired TTL volume with no deleted-byte accounting until the whole volume ages out. Gate the resolver on a per-bucket flag, off by default; toggle with the s3.bucket.lifecycle.fastpath shell command. Default writes take the worker path: real deletes that honor current policy and let vacuum reclaim space.	2026-06-06 11:20:15 -07:00
Chris Lu	ca81c0c525	fix(ec): pass per-volume data-shard count to the parity-shard split (#9781 ) * fix(ec): pass per-volume data-shard count to the parity-shard split ShardsInfo.DeleteParityShards/MinusParityShards looped ids 10..13, assuming the fixed 10+4 layout. For a non-default ratio this splits data vs parity wrong — a wide ratio (12+4, 16+6) drops real data ids >= 10, which breaks ec.decode. They now take a dataShards argument (<= 0 falls back to DataShardsCount) and clear ids dataShards..MaxShardCount. ec.decode threads the data-shard count from collectEcNodeShardsInfo to both split call sites, and admin LogicalSize passes DataShardsCount. Also: EC cleanup now sets an explicit per-disk storage impact (-len(ShardIds)) instead of falling back to the TotalShardsCount constant, so freed-capacity accounting matches the shards actually removed. OSS is always 10+4, so behavior is unchanged here; this keeps the split ratio-correct and the API aligned with the enterprise per-volume override. Adds parity-split ratio tests. * ec: clear parity shards in one locked pass Address review: DeleteParityShards looped si.Delete, taking the lock once per id. shards is sorted by Id and shardBits is a bitmap, so mask off the high bits and truncate the sorted slice at the first parity id (binary search) under a single lock. Preserves the dataShards<=0 -> DataShardsCount default.	2026-06-01 19:25:15 -07:00
Chris Lu	2386fa550a	grpc: don't tear down the shared master connection on a caller's own timeout (#9775 ) A Canceled/DeadlineExceeded from the caller's per-request context was treated like a dead channel: it closed the shared cached ClientConn and cancelled every other in-flight RPC on it with "the client connection is closing". Under a burst of concurrent chunk assigns (e.g. a large S3 multipart upload) one slow assign hitting its 10s attempt timeout could poison the connection for all the rest, cascading into a flood of 500s. Thread the caller's context into shouldInvalidateConnection and only invalidate on Canceled/DeadlineExceeded while that context is still live, which isolates the genuine stale-channel signal (a peer restart behind a k8s Service VIP). To carry the context, add a ctx parameter to the existing WithGrpcClient, WithMasterClient, and WithMasterServerClient; the master assign and volume-lookup paths pass their per-attempt context and every other caller passes context.Background().	2026-06-01 15:11:02 -07:00
Chris Lu	8c60408bfb	s3: auto-enforce bucket quota read-only both ways (#9774 ) * s3: auto-enforce bucket quota read-only both ways Quota read-only only ever flipped when an admin re-ran s3.bucket.quota.enforce, so a bucket that went over quota stayed read-only forever even after usage dropped back under. Fold enforcement into the per-minute, leader-locked bucket-size loop the s3 gateway already runs for metrics: it now flips each bucket's read-only flag to match its quota in both directions, rewriting filer.conf only when a flag actually changes. The set/clear decision lives in one shared FilerConf.ApplyBucketQuotaReadOnly helper so the shell command and the gateway can't drift. * only manage read-only when a quota is set, never clobber manual locks * trim comments	2026-06-01 13:11:18 -07:00
Chris Lu	f9ee49b03e	shell: volume.fsck must not skip the system-log subtree (#9764 ) shell: only skip system-log subtree in fs.meta.save, not fsck/verify The SystemLogDir skip lived in the shared BFS traversal, so volume.fsck built its in-use set without the /topic/.system/log chunks and flagged every referenced log needle as orphan. -reallyDeleteFromVolume would then delete live log data and leave dangling filer entries. Gate the skip behind a flag that only fs.meta.save sets.	2026-06-01 09:54:22 -07:00
Chris Lu	9658f309d2	EC bitrot detection: per-shard checksum sidecars (#9761 ) * ec: add EC bitrot checksum protobuf EcBitrotProtection/EcShardChecksums/ChecksumAlgorithm sidecar messages, copy_ecsum_file and unsafe_ignore_sidecar fields, and a CHECKSUM scrub mode. * ec: bitrot checksum sidecar format, validation, and per-volume load Per-shard CRC32C block checksums in an optional <base>.ecsum sidecar with a self-integrity header; validation, rolling builder, backfill primitive, and EcVolume load on mount + removal on destroy. * ec: capture per-shard checksums at encode; verify-and-exclude on rebuild WriteEcFilesWithContext returns the protection computed inline during encoding. generateMissingEcFiles verifies present inputs against the sidecar, excludes corrupt ones, regenerates in place, and re-verifies; fail-closed unless unsafe_ignore_sidecar, removing all generated outputs on failure. * ec: read-only checksum scrub with Reed-Solomon arbiter ChecksumScrub verifies each local shard against the sidecar and reconstructs flagged shards from the clean shards so stale-sidecar false positives are not reported. Wired to the gRPC CHECKSUM mode and ec.scrub -mode checksum. * ec: server-side bitrot sidecar write, copy, cleanup, and opportunistic backfill Write .ecsum at fresh encode; propagate it with copy_ecsum_file (tolerant); remove it on full delete and decode; rebuild honors unsafe_ignore_sidecar and opportunistically backfills a sidecar when all shards are reachable. * ec: volume server bitrot config flags -ec.bitrotChecksum (default on) and -ec.bitrotBlockSizeMB (default 16). * fix(ec_bitrot): bound -ec.bitrotBlockSizeMB before the int64 multiply Validate the MiB value is in [1, 1024] before multiplying by 1 MiB, so a huge flag value cannot overflow int64 and slip past the power-of-two check, and a block size cannot collapse a sidecar to a few oversized blocks. * fix(ec_bitrot): distribute the .ecsum sidecar from the worker encode path The worker EC encode wrote the generation-0 sidecar locally but never added it to shardFiles, so DistributeEcShards never shipped it and the distributed holders came up unprotected. Append it to shardFiles and map the ecsum shard type to its extension in the sender so it travels with the shards. * fix(ec_bitrot): remove orphaned sidecars when the generation is gone Gate sidecar removal on existingShardCount==0 alone rather than also requiring a stray .ecx. A sidecar whose shards have all been deleted is orphaned and must be removed even when no .ecx remains, or it leaks. .ecx/.ecj/.vif removal stays gated on hasEcxFile as before. * fix(ec_bitrot): do not fold checksum blocks scanned into TotalFiles ChecksumScrub's first return is blocks scanned, not files. Discard it so the scrub response's TotalFiles (a needle/file count) is not inflated by the block count for CHECKSUM mode. * test(ec_bitrot): clean up generated .ecsum sidecars in removeGeneratedFiles * fix(ec_bitrot): reject an oversized sidecar payload before the uint32 cast The header stores payload_len as a uint32; bound the payload before the conversion so a pathological manifest cannot truncate the length field and corrupt the sidecar. A real manifest is a few KB, so this never trips. * fix(ec_bitrot): cap -ec.bitrotBlockSizeMB at 64 MiB The block size becomes the per-shard scratch buffer the scrub/backfill path allocates, so an over-large value (e.g. 1 GiB) is a memory hazard per concurrent scrub worker. Lower the upper bound from 1024 to 64 MiB. * fix(ec_bitrot): add -ecUnsafeIgnoreSidecar to weed tool fix -ecx The -ecx recovery path reconstructs missing shards via RebuildEcFilesWithContext, which fails closed on a malformed/stale .ecsum. Without an override flag an operator could not complete the rebuild without manually deleting the sidecar. Expose -ecUnsafeIgnoreSidecar (default false) and thread it through. * fix(ec_bitrot): bound sidecar payload with a direct int constant; drop readFull Guard len(payload) against a plain int constant (1 GiB) before the allocation instead of a uint64 MaxUint32 compare, so the allocation-size value is provably bounded (clears the CodeQL overflow alert) and the math import is no longer needed. Inline os.File.ReadAt with io.EOF handling in verifyShardFileBlocks and remove the now-redundant readFull helper (os.File.ReadAt fills the slice or errors). * test(ec_bitrot): use slices.Contains instead of a hand-rolled containsU32 * refactor(ec): fold the EcFiles WithContext variants into the base functions RebuildEcFiles now takes the ECContext directly (nil => derive from .vif as before) and WriteEcFiles takes it too (nil => default), removing the parallel RebuildEcFilesWithContext / WriteEcFilesWithContext names. Callers that had an explicit context drop the WithContext suffix; the default-context callers pass nil. No behavior change. refactor(ec): pass BackgroundECContext instead of nil to Write/RebuildEcFiles Add a non-nil BackgroundECContext placeholder (analogous to context.Background()) and have callers with no specific layout pass it instead of a nil ECContext. WriteEcFiles resolves a zero/background context to the default ratio and RebuildEcFiles resolves it from the .vif, so behavior is unchanged. fix(ec_bitrot): make BackgroundECContext a func; RebuildEcFiles fails closed on bad .vif - BackgroundECContext is now a function returning a fresh *ECContext, so callers cannot mutate a shared singleton or race on it (and it mirrors context.Background, which is also a function). - RebuildEcFiles now propagates the MaybeLoadVolumeInfo error: a present-but- unreadable .vif fails closed instead of silently rebuilding with the default ratio (which would corrupt a custom-ratio volume). Pass an explicit ctx to override.	2026-05-31 18:52:44 -07:00
Chris Lu	fdfeb4063c	shell: warn in volume.list when a volume id spans collections (#9759 ) * shell: warn in volume.list when a volume id spans collections A reused volume id, the result of the master handing out an id already used by another collection (for example after losing its max-volume-id counter on restart), makes collection.delete destroy the wrong collection's data and makes any bare-id lookup, move, or vacuum ambiguous. volume.list now scans the full topology and warns on ids present in more than one collection so the clash is visible before any destructive operation. * volume.list: track duplicate ids lazily, sort with slices.Sort Allocate the per-id collection set only on the first cross-collection clash instead of one set per volume, so allocations scale with duplicates rather than the volume count.	2026-05-31 11:52:39 -07:00
Chris Lu	5955972fe6	fix(shell): verify volume.merge output before overwriting replicas (#9731 ) * fix(shell): verify volume.merge output before overwriting replicas volume.merge overwrote every replica with the merged copy without checking it was complete. Read back the merged copy and refuse to overwrite unless it holds at least as many live needles as the most complete source replica, leaving the originals intact on a short or empty merge. * fix(shell): keep merged volume until all replicas are rebuilt On a copy failure partway through the overwrite loop, the temporary merged copy was deleted along with the half-rebuilt replicas. Stop deleting it until every replica has been rebuilt; on failure the verified copy is kept so the merge can be re-run to completion. * refactor(shell): reuse readVolumeStatus in ensureVolumeReadonly * fix(shell): guard against nil volume status response	2026-05-28 19:29:25 -07:00
Chris Lu	24e664d651	fix(shell): don't halt volume.fsck purge on a stuck read-only volume (#9714 ) * fix(shell): don't halt volume.fsck purge on a stuck read-only volume A failed VolumeMarkWritable on one volume aborted the entire fsck purge run; per-volume errors now log and continue so remaining volumes still get purged. * fix(shell): unify volume.fsck per-volume skip logging at the caller Return the mark-writable error from purgeOneVolume instead of logging in two places — the caller already prints "skip purging volume N: %v" and defers still fire on the error return. * fix(shell): collect volume.fsck purge-skipped volumes and report at end Track volume IDs whose purge was skipped (mark-writable failure or other per-volume errors) and print a sorted summary so operators don't have to scrape the run log to find them. Deletes for those volumes are already skipped; this just makes them explicit.	2026-05-27 17:49:35 -07:00
Chris Lu	d4e39b499b	EC placement: shared replica-placement resolver, snapshot + Place core, capacity fixes, tiering (#9621 ) * Add shared super_block.ResolveReplicaPlacement; use it in ec_balance * Add ecbalancer.FromActiveTopology snapshot constructor for EC encode/repair * Add ecbalancer.Place greenfield/repair placement core (strict + durability-first) * topology: add GetEffectiveAvailableEcShardSlots; FromActiveTopology uses shard-granular free slots GetDisksWithEffectiveCapacity flattens reserved shard slots into volume slots via integer truncation, so an in-flight EC task reserving a non-multiple-of- DataShardsCount number of shards was lost from the snapshot and freeSlots was over-reported. GetEffectiveAvailableEcShardSlots subtracts the full reservation impact at shard granularity. * ecbalancer.Place: reject nodes without a free disk of the requested type FromActiveTopology keeps all disk types in the snapshot, so an SSD-only request could be routed to a node with only HDD capacity (pickBestDiskOnNode then returns disk 0 on the wrong tier). Filter rack/node selection to those with a free disk of the requested type. * ecbalancer.Place: enforce ReplicaPlacement DiffDataCenterCount (per-DC shard cap) * ecbalancer: enforce DiffDataCenterCount in balance (cross-DC phase + cross-rack DC cap) Adds a cross-DC corrective phase that drains data centers holding more than DiffDataCenterCount shards of a volume, and a per-DC cap on cross-rack move targets. Both are no-ops when DiffDataCenterCount is unset, so balance output is unchanged for non-DC placements. * topology: ratio-aware EC shard slots and provisional empty-disk slot GetEffectiveAvailableEcShardSlots now takes the target collection's data-shard count, so a 4+2 volume's larger shards are not over-counted at 10 per volume slot; and it keeps the one provisional slot for freshly started empty servers that report max=0, matching getEffectiveAvailableCapacityUnsafe. FromActiveTopology threads the ratio through. * ecbalancer.Place: explicit disk-type filter signal (fix HDD vs any ambiguity) HardDriveType normalizes to "", which collided with "" meaning any disk. Add Constraints.FilterDiskType and normalize both sides so a hdd request matches disks reported as "" and never leaks to SSD, while filter=false still means any. * ecbalancer: add clearShardAccounting for repair snapshot reconciliation Clears one disk's copy of a shard from per-domain accounting and recomputes the node-level union (preserving a kept copy on another disk of the same node), without crediting capacity. Repair uses it to drop to-be-deleted copies before placing missing shards. * ecbalancer: don't cap cross-DC target racks when DiffRackCount is unset len(racks)+1 wrongly limited each target rack (3 in a 2-rack cluster), so draining a DC could stop short of the DiffDataCenterCount cap. Use MaxShardCount+1 as the effectively-unlimited default. * topology/ecbalancer: ratio-correct EC capacity accounting Reservation shard slots (default ShardsPerVolumeSlot units) are now converted to the target ratio before subtracting, and existing EC shards are charged by size (targetDataShards/shardDataShards) so a 2+1 shard isn't counted as one 10+4 slot. Per-shard ratio lookup is behind shardDataShards (OSS uses the standard ratio). * ecbalancer.Place: candidate tiering and eligible-rack caps Adds a per-disk eligibility/preference abstraction so Place supports: - preferred-tag whole-plan retry (try disks carrying the earliest tags first, widen to all only if a tier cannot place every shard; reports SpilledOutsidePreferredTags), - soft disk-type spill via DiskTypePolicy (Any/Prefer/Require): Prefer fills the preferred type then spills, reporting SpilledToOtherDiskType; Require filters, - even per-rack caps that divide by racks holding an eligible disk, so a tiered cluster (e.g. SSDs in 2 of 4 racks) isn't capped impossibly low. Disk tags carried via Node.AddDiskTags + FromActiveTopology. * ecbalancer: export ClearShardAccounting for repair snapshot reconciliation * ecbalancer: address review feedback (ratio rounding, bitmap walk, same-DC moves) - topology/ecbalancer: round shard-reservation and existing-shard footprint up when converting to target-ratio shard slots, so a sub-slot reservation is not truncated to zero and free capacity is not overstated for low-data-shard layouts (targetDataShards < ds). - erasure_coding: add ShardBits.All iterator and use it across the balancer, cross-DC phase, and placement scoring instead of scanning 0..MaxShardCount and probing Has on every id. - ecbalancer: allow same-DC cross-rack moves when a DC already sits at its DiffDataCenterCount cap; a same-DC move leaves the DC total unchanged. Add a regression test that fails without the guard. - ecbalancer cross-DC phase: pick targets via the eligible-aware pickNodeInRackEligible/pickBestDiskEligible helpers so the disk-type filter is honored and a 0 disk id is not mistaken for a valid selection. * ecbalancer: test ecShardSlotsOnDisk fractional round-up Cover the mixed-ratio path (targetDataShards < existing data shards) so a shard's fractional footprint is never floored to zero and free capacity is not overstated. Exercises the round-up via the targetDataShards parameter; OSS uses the standard ratio at runtime while the enterprise build hits it with real per-volume ratios. * ecbalancer: assert node B rack in TestFromActiveTopology * ecbalancer: split Destination into separate DataCenter and bare Rack Replace the composite "dc:rack" Rack field on Destination with separate DataCenter and bare Rack values, matching topology.DiskInfo and the worker-task convention. Callers (and tests) read the data center directly instead of parsing the composite with strings.SplitN. * shell ec.balance: use utilization-based global balancing (parity with worker) The shell's global rebalance phase balanced by raw shard count; switch it to fractional fullness (shards/capacity), as the worker already does. On uniform capacity the two agree; on heterogeneous capacity it fills nodes proportionally instead of driving small-capacity nodes toward full. Updates the heterogeneous-capacity regression test to assert even fullness (~equal shards/capacity per node) rather than even shard count. * ecbalancer: bounded-proportional per-DC shard spread DiffDataCenterCount was enforced only as a ceiling (drain-to-cap), which could leave a within-cap-but-lopsided DC distribution under a loose cap (e.g. 10/4 of 14 with cap=10). Now the cross-DC phase, the cross-rack DC guard, and Place all target boundedMaxPerDC = min(DiffDataCenterCount, max(ceil(total/numDCs), parityShards)): shards spread proportionally across DCs, but no tighter than the durability floor (once each DC holds <= parityShards a DC loss is recoverable, so further spreading only adds cross-DC/WAN traffic). No-op when DiffDataCenterCount is 0; identical to before when the cap is the binding constraint. * ecbalancer: drop DiffDataCenterCount enforcement for EC placement The 1-byte volume ReplicaPlacement packs xyz into x100+y10+z<=255, so the DC digit can only be 0-2 -- far too small to be a meaningful per-DC EC shard cap (a cap of 1-2 would demand 7-14 DCs for a 10+4 volume). It's volume replica-placement, not an EC spec. Removes the cross-DC balance phase, the DC guard in the cross-rack phase, and the per-DC cap in Place (and the just-added bounded-proportional logic); EC relies on the RP-independent rack/node even spread instead. Rack/node caps (DiffRackCount/SameRackCount) are unchanged. Per-domain EC caps are left for a real EC placement spec. * ecbalancer: enforce per-disk durability cap; symmetric reserve/release Place now refuses to put more than parityShards shards of a volume on a single disk (pickBestDiskEligible skips a disk once it holds parityShards of the volume, a hard cap not relaxed even in durability-first). Previously Place assigned by free capacity, so a skewed near-full cluster could pile >parityShards onto one disk -> losing it loses the volume; only distinct-disk count was checked. This covers encode and repair (both route through Place); the caller skips/leaves the volume rather than minting an unrecoverable layout. Also makes reserveShard decrement freeSlots unconditionally, symmetric with releaseShard's unconditional increment (the old guarded decrement could credit a phantom slot on release if a shard were ever reserved onto a full disk). * ecbalancer: add Topology.ReleaseVolumeShards (clear + credit) for greenfield encode Releases all of a volume's shards from the snapshot and credits the freed disk capacity, so a greenfield encode can plan as if stale EC shards from a prior failed attempt are gone. Safe to credit because the encode task deletes stale shards (cleanupStaleEcShards) before distributing the new ones. Distinct from ClearShardAccounting (repair), which does not credit. * ecbalancer: ReleaseVolumeShards credits node freeSlots, not just disks releaseShard only increments per-disk freeSlots, but rack capacity is summed from node freeSlots (buildRacks) and node freeSlots gates node eligibility. Crediting only disks left a node/rack looking full after releasing stale shards, so a greenfield encode still couldn't use the freed capacity. Now credits the node by the total disk-slots freed. * ecbalancer: correct PlacementMode docs (encode uses durability-first) PlaceStrict was labeled '(encode)' but encode uses PlaceDurabilityFirst. Clarify that durability-first is used by both encode and repair, reports relaxations in PlaceResult.Relaxed, and never relaxes the per-disk durability cap. * ecbalancer: treat SameRackCount as a direct per-node shard cap The 3rd ReplicaPlacement digit now caps shards per node at exactly the digit value, matching how DiffRackCount (2nd digit) caps per rack, instead of allowing digit+1 per node. This makes the per-rack and per-node caps consistent and matches the documented "digits cap EC shards per rack and per node" semantics; e.g. 011 now means at most one shard per rack and one per node.	2026-05-22 20:22:09 -07:00
Chris Lu	cd15ae1395	fix(ec): bring ec.encode worker and EC/volume helpers to parity with shell (#9599 ) * refactor(volume): extract replica sync/select into shared volume_replica package Move the volume replica reconciliation helpers (status, union builder, SyncAndSelectBestReplica, ReadNeedleMeta) out of the shell into a new weed/storage/volume_replica package so both the shell (ec.encode, volume.tier.move, volume.check.disk) and the EC encode worker can reuse them. No behavior change. * fix(ec): bring ec.encode worker to parity with the shell - Sync replicas and encode the most-complete one (via the shared volume_replica.SyncAndSelectBestReplica) instead of a possibly-stale replica, marking all replicas readonly first. Prevents silent data loss when a stale replica is encoded and the originals deleted. - Skip remote/tiered volumes in detection (shell ec.encode excludes them). - Min-node safety gate: refuse to encode when cluster nodes < parity shards. - Align default thresholds with the shell (fullness 0.95, quiet 1h). * fix(vacuum): plugin path honors min_volume_age_seconds override deriveVacuumConfig hard-coded MinVolumeAgeSeconds=0, dropping any configured value. Read it from worker config (default 0, matching the shell/master vacuum which has no age gate) so an explicit override is honored. * address review feedback - config.go: align GetConfigSpec schema defaults (quiet_for_seconds=3600, fullness_ratio=0.95) with the runtime defaults so UI/bootstrap flows match the shell (coderabbitai). - ec_task.go: roll back readonly when markReplicasReadonly fails partway, so already-marked replicas don't stay readonly (coderabbitai). - volume_replica: pass the caller's replica statuses into buildUnionReplica instead of re-fetching them, and skip the per-needle ReadNeedleMeta RPC when the source replica is read-only (gemini-code-assist). * test(plugin_workers/ec): make fixtures eligible under the new defaults The default EC encode thresholds were raised to match the shell (fullness 0.95, quiet 1h), but the plugin-worker integration fixtures still used 90%-full / 10-minute-old volumes, so detection found no eligible volumes and the tests failed in CI. Bump the eligible fixtures to 96% full and 2h old.	2026-05-21 02:16:28 -07:00
Chris Lu	391f543ff2	fix(ec): correct multi-disk disk counting and EC balance shard attribution (#9594 ) * fix(shell): count physical disks in cluster.status on multi-disk nodes The master keys DataNodeInfo.DiskInfos by disk type, so several same-type physical disks on one node collapse into a single DiskInfo entry. cluster.status (printClusterInfo) and CountTopologyResources counted len(DiskInfos), reporting one disk per node instead of the real physical disk count, while volume.list and the admin ActiveTopology already split per physical disk. Route both counters through DiskInfo.SplitByPhysicalDisk so a node with N same-type disks reports N. Cosmetic/diagnostic only; placement already uses the per-disk activeDisk map. * fix(ec): attribute EC balance source disk per shard and reject same-node moves On multi-disk nodes the EC balance worker built a node-level view that kept only the first physical disk id per (node, volume), so a move of a shard living on a different disk reported the wrong source disk. That source disk drives the per-disk capacity reservation, so the wrong disk drifts the capacity model the EC placement planner relies on. Track shards per physical disk and resolve the actual source disk for every emitted move (dedup, cross-rack, within-rack, global), keeping the per-disk view consistent as simulated moves are applied. Also close a data-loss trap: VolumeEcShardsDelete is node-wide (it removes the shard from every disk on the node) and copyAndMountShard skips the copy when source and target addresses match, so a same-node move would erase a shard it never copied. isDedupPhase now requires the same node AND disk, and Validate / Execute reject same-node cross-disk moves outright. * fix(ec): spread EC balance moves across destination disks Port the shell ec.balance pickBestDiskOnNode heuristic to the EC balance worker so a moved shard is placed on a good physical disk instead of always deferring to the volume server (target disk 0). The detection now builds a per-physical-disk view of each node (free slots split from the node total, exact EC shard count, disk type, discovered from both regular volumes and EC shards) and, for each cross-rack, within-rack, and global move, chooses the destination disk by ascending score: - fewer total EC shards on the disk, - far fewer shards of the same volume on the disk (spread a volume's shards across disks for fault tolerance), and - data/parity anti-affinity (a data shard avoids disks holding the volume's parity shards and vice versa). Planned placements are reserved on the in-memory model during a run so multiple shards moved to the same node spread across its disks rather than piling on one. * fix(ec): bring EC balance worker to parity with shell ec.balance The worker's cross-rack and within-rack balancing balanced shards by total count; the shell balances data and parity shards separately with anti-affinity and honors replica placement. Port that logic so the automatic balancer makes the same fault-tolerance-aware decisions as the manual command: - Cross-rack and within-rack now run a two-pass balance: data shards spread first, then parity shards spread while avoiding racks/nodes that already hold the volume's data shards (anti-affinity), mirroring doBalanceEcShardsAcrossRacks and doBalanceEcShardsWithinOneRack. - Optional replica placement: a new replica_placement config (e.g. "020") constrains shards per rack (DiffRackCount) and per node (SameRackCount); empty keeps the previous even-spread behavior. - The data/parity boundary is resolved from a per-collection EC ratio (standard 10+4 here), replacing the previously hardcoded constant at the call sites. Selection is deterministic (sorted keys) to keep behavior reproducible. * refactor(ec): extract shared ecbalancer package for shell and worker The EC shard balancing policy was duplicated between the shell ec.balance command and the admin EC balance worker, and the two had drifted (multi-disk handling, data/parity anti-affinity, replica placement). Extract the policy into a new pure package, weed/storage/erasure_coding/ecbalancer, that both callers share so it cannot drift again. - ecbalancer.Plan(topology, options) runs the full policy (dedup, cross-rack and within-rack data/parity two-pass with anti-affinity, global per-rack balance, and diversity-aware disk selection) over a caller-built Topology snapshot and returns the shard Moves. It depends only on erasure_coding and super_block. - The worker builds the Topology from the master topology and turns Moves into task proposals; the shell builds it from its EcNode model and executes Moves via the existing move/delete RPCs. Per-collection EC ratio resolution stays in each caller (passed as Options.Ratio). - Options expose the two genuine policy differences: GlobalUtilizationBased (worker balances by fractional fullness; shell by raw count) and GlobalMaxMovesPerRack (worker moves incrementally across cycles; shell drains in one pass). The shell keeps pickBestDiskOnNode for the evacuate command. Policy tests move to the ecbalancer package; the shell and worker keep their adapter/execution tests. * fix(ec): restore parallelism and per-type/full-range balancing after ecbalancer refactor Address regressions and gaps from the ecbalancer extraction: - Shell ec.balance honors -maxParallelization again: planned moves run phase by phase (preserving cross-phase dependencies) with bounded concurrency within a phase. Apply mode does only the RPCs concurrently; dry-run stays sequential and updates the in-memory model for inspection. - Rack and node balancing gate on per-type spread (data and parity separately) instead of combined totals, so a data/parity skew is corrected even when the per-rack/node totals are even. - Global rack balancing iterates the full shard-id space (MaxShardCount) so custom EC ratios with more than the standard total are candidates. - Cross-rack planning decrements the destination node's free slots per planned move, so limited-capacity targets are no longer over-planned. * fix(ec): make EC dedup keeper deterministic and capacity-aware When a shard is duplicated across nodes, keep the copy on the node with the most free slots and delete the duplicates from the more-constrained nodes, relieving capacity pressure where it is tightest. Tie-break on node id so the choice is deterministic. This unifies the shell and worker (the shell previously kept the least-free node, an incidental default) on the more sensible behavior. * fix(ec): restore global volume-diversity and per-volume move serialization Two more behaviors lost in the ecbalancer refactor: - Global rack balancing again prefers moving a shard of a volume the destination does not hold at all before adding another shard of an already-present volume (two-pass, mirroring the old balanceEcRack), keeping each volume's shards spread across nodes. - Shell apply-mode execution serializes a single volume's moves within a phase while still running different volumes in parallel, so concurrent moves of the same volume cannot race on its shared .ecx/.ecj/.vif sidecar files. * fix(ec): key EC balance shards by (collection, volume id) A numeric volume id can be reused across collections, and EC identity is (collection, vid) (see store_ec_attach_reservation.go). The ecbalancer keyed Node.shards by vid alone, so volumes sharing an id across collections merged into one entry — letting dedup delete a "duplicate" that is actually a different collection's shard, and letting moves act across collections. Key shards by (collection, vid) throughout so each volume stays distinct. * fix(ec): credit freed capacity from dedup before later balance phases Dedup deletions are simulated only by applyMovesToTopology, which cleared shard bits but did not return the freed disk/node/rack slots. Later phases reject destinations with no free slots, so a slot opened by dedup could not be reused in the same Plan/ec.balance run. applyMovesToTopology now credits the freed disk/node/rack capacity for dedup moves (non-dedup moves still rely on the inline accounting their phase already did). * test(ec): add multi-disk EC balance integration test Cover issue 9593 end-to-end at the unit level the old tests missed: build the master's actual multi-disk wire format (same-type disks collapsed into one DiskInfo, real DiskId only in per-shard records), run it through a real ActiveTopology and the Detection entry point, then replay the planned moves with the volume server's true semantics (node-wide VolumeEcShardsDelete) and assert no EC shard is ever lost. Covers a balanced spread, a one-node-concentrated volume, and a multi-rack spread, and asserts moves are safe (no same-node cross-disk), correctly attributed to the source disk, and redistribute concentrated volumes across both other racks and multiple destination disks. * fix(ec): aggregate per-disk EC shards when verifying multi-disk volumes collectEcNodeShardsInfo overwrote its per-server entry for each EcShardInfo of a volume. A multi-disk node reports one EcShardInfo per physical disk holding shards of the volume, so only the last disk's shards survived — the node looked like it was missing shards it actually had. This made ec.encode's pre-delete verification (and ec.decode) under-count volumes whose shards are spread across disks on one server, falsely aborting the encode on multi-disk clusters. Union the per-disk shard sets per server instead. Also make verifyEcShardsBeforeDelete poll briefly: shard relocations reach the master via volume-server heartbeats, so a freshly distributed shard set may not be fully visible the instant the balance returns. Retry before concluding the set is incomplete; genuine loss still fails after the retries are exhausted. * test(ec): end-to-end multi-disk EC balance shard-loss regression Start a real cluster of multi-disk volume servers (3 servers x 4 disks), EC-encode a volume, run ec.balance, and assert hard invariants the prior integration tests only logged: after encode all 14 shards exist, ec.balance loses no shard, shards span more than one disk per node, and cluster.status counts physical disks (not one per node). This reproduces issue 9593 end to end and would have caught the multi-disk shard-aggregation bug fixed alongside it. * fix(ec): bring EC balance worker/plugin path to parity with shell - Per-volume serialization and phase order: key the plugin proposal dedupe by (collection, volume) instead of (volume, shard, source), so the scheduler runs only one of a volume's moves at a time (within a run and against in-flight jobs). Concurrent same-volume moves raced on the volume's .ecx/.ecj/.vif sidecars; and because the planner emits a volume's moves in phase order, they now execute in order across detection cycles, matching the shell. - disk_type "hdd": normalize via ToDiskType (hdd -> "" HardDriveType) while keeping a "filter requested" flag, so disk_type=hdd matches the empty-keyed HDD disks instead of nothing; apply the canonical type to planner options and move params. - Replica placement: expose shard_replica_placement in the admin config form and read it into the worker config, mirroring ec.balance -shardReplicaPlacement. * test(ec): rename worker in-process test (not a real integration test) The worker-package multi-disk tests build a fake master topology and simulate move execution; they are not real-cluster integration tests. Rename integration_test.go -> multidisk_detection_test.go and drop the Integration prefix so 'integration' refers only to the real-cluster E2Es in test/erasure_coding. * ci(ec): remove redundant ec-integration workflow ec-integration.yml duplicated EC Integration Tests under the same workflow name but ran only 'go test ec_integration_test.go' (one file), so it never ran new test files (e.g. multidisk_shardloss_test.go) and was a strict, path-filtered subset of ec-integration-tests.yml, which already runs 'go test -v' over the whole test/erasure_coding package on every push/PR. * fix(ec): worker falls back to master default replication for EC balance For strict parity with the shell, the EC balance worker now uses the master's configured default replication as the replica-placement fallback when no explicit shard_replica_placement is set, instead of always defaulting to even spread. The maintenance scanner reads it via GetMasterConfiguration each cycle and passes it through ClusterInfo.DefaultReplicaPlacement; detection resolves the constraint (explicit config wins, else master default, else none) in resolveReplicaPlacement. A zero-replication default (the common 000 case) still means even spread, so the common configuration is unchanged. * fix(ec): plugin path populates master default replication too The plugin worker built ClusterInfo with only ActiveTopology, so the master default replication fallback added for the maintenance path never reached plugin-driven EC balance detection — empty shard_replica_placement still meant even spread there. Fetch the master default via GetMasterConfiguration (new pluginworker.FetchDefaultReplicaPlacement) and set ClusterInfo.DefaultReplicaPlacement so both detection paths resolve replica placement identically to the shell. * docs(ec): empty shard replica placement uses master default, not even spread The EC balance config text (admin plugin form, legacy form help text, and the struct/proto field comments) still said an empty shard_replica_placement spreads evenly. The runtime resolves empty to the master default replication (resolveReplicaPlacement), matching shell ec.balance, with even spread only when that default is empty or zero. Update the text to match and regenerate worker_pb for the proto comment change.	2026-05-20 23:31:21 -07:00
Chris Lu	4385b86bf1	fix(shell): volumeServer.evacuate no longer panics on a nil volume (#9587 ) adjustAfterMove now removes the moved volume from the source disk's VolumeInfos in place: it swaps the entry with the last one and nils the tail. evacuateNormalVolumes ranges directly over that same slice, so the niled tail slot is later read as a nil *VolumeInformationMessage and the move attempt panics on vol.DiskType. Iterate over a snapshot of the slice so in-place removals during a move cannot leave nil holes in the loop.	2026-05-20 10:27:00 -07:00
Chris Lu	e332b97d52	fix(shell): volume.balance no longer drains all volumes onto one server (#9579 ) * fix(shell): volume.balance no longer drains all volumes onto one server The density-based capacity function reads per-disk VolumeInfos sizes, but adjustAfterMove only updated VolumeCount and the selectedVolumes map. The planner re-read a stale topology after every move, so the source node's density never dropped and it kept moving volumes until that node was empty. Move the volume's size accounting between disks after each planned move so the density recomputes and the loop converges to an even distribution. * refactor(shell): O(1) volume removal and direct disk lookup in adjustAfterMove removeVolumeInfo swaps with the last element instead of shifting, and the disk is fetched by key rather than ranging the DiskInfos map.	2026-05-20 01:39:23 -07:00
Chris Lu	41b6ad002b	fix(volume.list): show one entry per physical disk on multi-disk nodes (#9541 ) * fix(volume.list): show one entry per physical disk on multi-disk nodes DataNodeInfo.DiskInfos is keyed by disk type, so several same-type physical disks on one node collapse to a single map entry at the master. volume.list iterated that map directly and reported one "Disk hdd ... id:0" line per node, hiding the per-disk volume and shard layout. EC operators on multi-disk volume servers had no way to verify which physical disk a shard landed on. Lift the per-physical-disk split into a DiskInfo.SplitByPhysicalDisk() method on the proto type so consumers outside admin/topology can use it. Apply it in writeDataNodeInfo so the verbose Disk block shows one entry per physical disk, ordered by DiskId. Capacity counters are split evenly across reconstructed disks since the wire format doesn't carry per-disk capacity yet. This is a display-only change. ActiveTopology already did the split on its own and is now updated to call the shared helper. * fix(volume.list): preserve totals, count active/remote exactly, dedupe header Address review feedback on the per-physical-disk split: - share() truncated remainders so reconstructed per-disk counters could sum to less than the original aggregate (10 / 3 = 3+3+3). Distribute the remainder to the lowest disk ids so MaxVolumeCount and FreeVolumeCount sum exactly back to the node totals. - ActiveVolumeCount and RemoteVolumeCount are derivable per disk from the VolumeInfos already grouped by DiskId, so count them exactly (ReadOnly=false and RemoteStorageName!="" respectively) instead of approximating with an even split. - writeDataNodeInfo's per-disk callback fired the DataNode header on every iteration after the split, so a node with 6 physical disks emitted 6 DataNode headers. Guard the callback with headerPrinted so the header still appears at most once per node. - Sort split disks deterministically using explicit DiskId comparison to avoid int overflow risk on 32-bit systems. - Tighten the volume.list test substring to "id:N\n" so unrelated tokens like "ec volume id:101" don't accidentally match the id:1 needle, and assert the rack callback fires once.	2026-05-18 14:43:44 -07:00
Chris Lu	37e6263efe	fix(shell): attach admin JWT for filer IAM gRPC calls (#9536 ) When jwt.filer_signing.key is set, the filer's IamGrpcServer requires a Bearer token on every IAM RPC. The shell's s3.* IAM commands dialed without that header and failed with Unauthenticated. Route them through a small helper that mints a token from the same key viper-loaded from security.toml and appends it as outgoing metadata, matching the credential grpc_store pattern.	2026-05-18 13:42:32 -07:00
Chris Lu	3a8389cd68	fix(ec): verify full shard set before deleting source volume (#9490 ) (#9493 ) * fix(ec): verify full shard set before deleting source volume (#9490) Before this change, both the worker EC task and the shell ec.encode command would delete the source .dat as soon as MountEcShards returned — even if distribute/mount failed partway, leaving fewer than 14 shards in the cluster. The deletion was logged at V(2), so by the time someone noticed missing data the only trace was a 0-byte .dat synthesized by disk_location at next restart. - Worker path adds Step 6: poll VolumeEcShardsInfo on every destination, union the bitmaps, and refuse to call deleteOriginalVolume unless all TotalShardsCount distinct shard ids are observed. A failed gate leaves the source readonly so the next detection scan can retry. - Shell ec.encode adds the same gate after EcBalance, walking the master topology with collectEcNodeShardsInfo. - VolumeDelete RPC success and .dat/.idx unlinks now log at V(0) so any source destruction is traceable in default-verbosity production logs. The EC-balance-vs-in-flight-encode race is intentionally left for a follow-up; balance should refuse to move shards for a volume whose encode job is not in Completed state. * fix(ec): trim doc comments on the new shard-verification path Drop WHAT-describing godoc on freshly added helpers; keep only the WHY notes (query-error policy in VerifyShardsAcrossServers, the #9490 reference at the call sites). * fix(ec): drop issue-number anchors from new comments Issue references age poorly — the why behind each comment already stands on its own. * fix(ec): parametrize RequireFullShardSet on totalShards Take totalShards as an argument instead of reading the package-level TotalShardsCount constant. The OSS callers continue to pass 14, but the helper is now usable with any DataShards+ParityShards ratio. * test(plugin_workers): make fake volume server respond to VolumeEcShardsInfo The new pre-delete verification gate calls VolumeEcShardsInfo on every destination after mount, and the fake server's UnimplementedVolumeServer returns Unimplemented — the verifier read that as zero shards on every node and aborted source deletion. Build the response from recorded mount requests so the integration test exercises the gate end-to-end. * fix(rust/volume): log .dat/.idx unlink with size in remove_volume_files Mirror the Go-side change in weed/storage/volume_write.go: stat each file before removing and emit an info-level log for .dat/.idx so a destructive call is always traceable. The OSS Rust crate previously unlinked them silently. * fix(ec/decode): verify regenerated .dat before deleting EC shards After mountDecodedVolume succeeds, the previous code immediately unmounts and deletes every EC shard. A silent failure in generate or mount could leave the cluster with neither shards nor a valid normal volume. Probe ReadVolumeFileStatus on the target and refuse to proceed if dat or idx is 0 bytes. Also make the fake volume server's VolumeEcShardsInfo reflect whichever shard files exist on disk (seeded for tests as well as mounted via RPC), so the new gate can be exercised end-to-end. * fix(ec): address PR review nits in verification + fake server - Drop unused ServerShardInventory.Sizes field. - Skip shard ids >= MaxShardCount before bitmap Set so the ShardBits bound is explicit (Set already no-ops on overflow, this is for clarity). - Nil-guard the fake server's VolumeEcShardsInfo so a malformed call doesn't panic the test process.	2026-05-13 19:29:24 -07:00
Chris Lu	79859fc21d	feat(s3/versioning): grep-able heal logs + scan-anomaly diagnostics + audit cmd (#9468 ) * feat(s3/versioning): grep-able heal logs + scan-anomaly diagnostics + audit cmd Three diagnostic additions on top of #9460, all aimed at making the next production incident faster to triage than the one we just spent hours on. 1. [versioning-heal] grep prefix on every heal-related log line, with a small fixed event vocabulary (produced / surfaced / healed / enqueue / drain / retry / gave_up / anomaly / clear_failed / heal_persist_failed / teardown_failed / queue_full). One grep gives operators a single event stream across the produce-to-drain lifecycle. 2. Escalate the "scanned N>0 entries but no valid latest" case in updateLatestVersionAfterDeletion from V(1) Infof to a Warning that names the orphan entries it saw. This is the listing-after-rm inconsistency signature that pinned down 259064a8's failure — it should not be invisible at default log levels. 3. New weed shell command `s3.versions.audit -prefix <path> [-v] [-heal]` that walks .versions/ directories under a prefix and reports the stranded population. With -heal it clears the latest-version pointer in place on stranded directories so subsequent reads return a clean NoSuchKey instead of replaying the 10-retry self-heal loop. * fix(s3/versioning): audit pagination, exclusive categories, ctx-aware retry Address PR review: 1. s3.versions.audit walked only the first 1024-entry page of each .versions/ directory, false-positiving "stranded" on large dirs. Loop until the page returns < 1024 entries, advancing startName. 2. clean and orphan-only categories double-counted when a directory had no pointer and at least one orphan: incremented both. Make them mutually exclusive so report totals sum to versionsDirs. 3. retryFilerOp's worst-case ~6.3s backoff was a bare time.Sleep, non-interruptible by ctx. A server shutdown / client disconnect would wait out the budget per in-flight delete. Thread ctx through deleteSpecificObjectVersion -> repointLatestBeforeDeletion / updateLatestVersionAfterDeletion -> retryFilerOp; backoff now uses a select{<-ctx.Done(), <-timer.C}. HTTP handlers pass r.Context(); gRPC lifecycle handlers pass the stream ctx. New test pins the behavior: cancelling ctx mid-backoff returns ctx.Err() in <500ms instead of blocking ~6.3s. * fix(s3/versioning): clearStale outcome + escape grep-able log fields Two coderabbit follow-ups: 1. Successful pointer clear should suppress `produced`. updateLatestVersionAfterDeletion's transient-rm fallback called clearStaleLatestVersionPointer best-effort, then unconditionally returned retryErr. The caller (deleteSpecificObjectVersion) saw the error and emitted `event=produced` + enqueued the reconciler, even though clearStaleLatestVersionPointer had just driven the pointer to consistency and the next reader would get NoSuchKey via the clean-miss path. Make clearStaleLatestVersionPointer return cleared bool; on success the caller returns nil so neither produced nor the reconciler enqueue fires. Concurrent-writer aborts, re-scan errors, and CAS mismatches still report false so genuinely stranded state keeps surfacing. 2. Escape user-controlled fields in heal log lines. versioningHealInfof / Warningf / Errorf interpolated raw bucket / key / filename / err text into a single-space-separated line. An S3 key (or error string from gRPC) containing whitespace, newlines, or `event=...` could split one event into multiple tokens and spoof fake fields downstream. Sanitize each arg in the helper: safe values pass through; anything with whitespace, quotes, control chars, or backslashes is replaced with its strconv.Quote form. No caller changes — the format strings remain unchanged. Tests pin both behaviors: sanitization table covers the field boundary cases; an end-to-end shape test confirms a key containing `event=spoof` stays inside a single quoted token.	2026-05-13 10:48:58 -07:00
Chris Lu	b1d59b04a8	fix(s3/lifecycle): walker dispatch uses entry.Path for ABORT_MPU (#9477 ) * fix(s3/lifecycle): WalkerDispatcher uses entry.Path for ABORT_MPU + shell announces load Two CI-surfaced bugs caught by PR #9471's S3 Lifecycle Tests run on master after PRs #9475 + #9466: 1. Walker dispatch for ABORT_MPU was sending entry.DestKey as req.ObjectPath. The server's ABORT_MPU handler (weed/s3api/s3api_internal_lifecycle.go) strips the .uploads/ prefix to extract the upload id and reads the init record from that directory, so it expects the .uploads/<id> path verbatim. DestKey looks like a regular object path; the server's prefix check fails and the dispatch returns BLOCKED with "FATAL_EVENT_ERROR: ABORT_MPU object_path missing .uploads/ prefix". The test fix renames TestWalkerDispatcher_MPUInitUsesDestKey to ...UsesUploadsPath and inverts the assertion to match the actual server contract. DestKey is still used for the WalkBuckets shard predicate and for rule-prefix matching in bootstrap.walker; both surfaces want the user's intended path, while DISPATCH wants the .uploads/<id> directory. The bootstrap test (TestLifecycleAbortIncompleteMultipartUpload) caught this when the walker's BLOCKED error surfaced as FATAL output. 2. test/s3/lifecycle/s3_lifecycle_empty_bucket_test.go asserts the shell command logs "loaded lifecycle for N bucket(s)" so a regression that produces half-shaped output (no load summary) is caught. The restored shell command (PR #9475) didn't print that line; add it back on the first pass that finds non-zero inputs. * fix(s3/lifecycle): walker fires for walker-only buckets (empty replay path) runShard's empty-replay sentinel (rsh == [32]byte{}) was returning BEFORE the steady-state walker check. A bucket whose only lifecycle rule was walker-only (ExpirationDate / ExpiredDeleteMarker / NewerNoncurrent) would never have it dispatched because: - ReplayContentHash only hashes replay-eligible kinds, so walker-only-only snapshots produce rsh == empty. - The early-return persisted the empty cursor and exited before the steady-state walker block at the bottom of the function. Move the walker invocation INTO the empty-replay branch so walker- only rules dispatch on the same path as mixed-rule buckets. TestLifecycleExpirationDateInThePast and TestLifecycleExpiredDeleteMarkerCleanup were both timing out their "object must be deleted" Eventually polls because of this. Caught on PR #9471's S3 Lifecycle Tests run after PR #9475 restored the shell entry point that exercises the integration tests. * fix(s3/lifecycle): cold-start walker covers pre-existing objects runShard only walked the bucket tree on the recovery branch (found && hash mismatch). For a fresh worker with no persisted cursor, found=false, so the recovery walker never fired and the meta-log replay only scanned runNow - maxTTL of events. Objects PUT before that window — including pre-existing objects in a newly-rule-enabled bucket — never matched the rule. The streaming worker handled this with scheduler.BucketBootstrapper. Daily-replay needed the equivalent: walk the live tree once on the first run for each shard so pre-existing objects get evaluated even when their PUT events are outside meta-log scan window. Restructured the recovery branch to fire the walker on either (found && mismatch) OR !found. On cold-start the cursor isn't rewound — we keep TsNs=0 and let the drain below floor to runNow - maxTTL like before; the walker just handles whatever the sliding window can't reach. TestLifecycleBootstrapWalkOnExistingObjects was the exact CI failure this addresses (https://github.com/seaweedfs/seaweedfs/actions/runs/25777823522/job/75714014151). * fix(s3/lifecycle): restore walker tag and null-version state Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix(s3/lifecycle): parallelize shell shard sweeps Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix(s3/lifecycle): bound each runPass ctx + refresh in runLifecycleShard Two CI bugs surfaced after PR #9466 deleted the streaming worker: 1. The shell command's -refresh loop never fires. runPass used the outer ctx (full -runtime), so dailyrun.Run blocked for the entire 1800s s3tests window — the background worker only ran one pass and never re-loaded configs that tests created mid-run. test_lifecycle_expiration sees 6 objects when expecting 4 because expire1/* never reaches the worker's snapshot. Cap each pass to cadence+5s when cadence>0; one-shot (cadence=0) keeps the full ctx. 2. TestLifecycleExpiredDeleteMarkerCleanup's docstring says "pass 1 cleans v1; pass 2 removes the now-orphaned marker," but runLifecycleShard invoked with no -refresh — only one pass ran. The marker rule can't fire in the same pass that dispatches v1's delete because v1 is still in .versions/. Add -refresh 1s so the 10s runtime gets multiple passes. * fix(s3/lifecycle): persist cursor with fresh ctx after passCtx timeout drainShardEvents only exits via ctx cancellation for an idle subscription — that's the steady-state when all replayed events are already past. Saving the cursor with the canceled passCtx silently drops every advance, so the next pass re-subscribes from the same floor and re-replays the same events. Symptom in s3tests: status=error shards=16 errors=16 on every pass, and 1/6 expire3/* dispatches lost to a race between concurrent shard drains all retrying the same events. Use a 5s timeout derived from context.Background for the save, and treat passCtx Deadline/Canceled from drain as a clean end-of-pass — not a shard-level error to log. * fix(s3/lifecycle): trust persisted cursor; never bump past pending events The drain freezes cursorAdvanceTo at the last pre-skip event so pending matches (DueTime > runNow) re-enter the subscription next pass. Combined with the new cursor persistence, the floor bump (runNow - maxTTL) then orphans the very events the drain stopped at. Concrete: a rule with TTL == maxTTL fires at runNow == PUT_TIME + maxTTL, so floor (= runNow - maxTTL) lands exactly on PUT_TIME. If the last advance saved a cursor right before the not-yet-due PUT (e.g., keep2/* between expire1/* and expire3/* on the same shard), the floor bump on pass 9 skips past the expire3 event itself — the worker never re-reads it. Test symptom: expire3/* never expires when worker shards include other earlier no-match events. Cold start (found=false) still subscribes from runNow - maxTTL. Steady state honors the cursor verbatim. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-13 00:19:05 -07:00
Chris Lu	7b8647e8bc	fix(shell): loop s3.lifecycle.run-shard so CI workflow stays alive (#9476 ) The s3tests workflow (.github/workflows/s3tests.yml) backgrounds `weed shell -c 's3.lifecycle.run-shard -shards 0-15 -s3 ... -refresh 2s'` and then runs `kill -0 $pid` to confirm the worker stayed alive. The PR-9475 restore ran dailyrun.Run once and exited cleanly — even faster when no buckets had lifecycle rules yet ("nothing to run"). The aliveness check then failed and the s3tests job died with "lifecycle worker died on startup". Caught on https://github.com/seaweedfs/seaweedfs/actions/runs/25772523143/job/75698413401. Fix: - -refresh now drives an inter-pass loop. cadence=0 (default) is one-shot, matching the test/s3/lifecycle/ integration-test invocation that omits -refresh and expects synchronous return. cadence>0 (the CI case) keeps the command alive until -runtime expires, running a fresh dailyrun.Run on every tick. - Each iteration re-loads bucket configs via scheduler.LoadCompileInputs so rules created mid-run (the s3tests flow creates rules AFTER the worker starts) get picked up. - The "no rules; nothing to run" early return is gone — the command stays alive even with an empty initial snapshot, waiting for tests to add rules. - -dispatch, -checkpoint, -bootstrap-interval stay accepted-but- ignored (legacy streaming flags).	2026-05-12 19:19:46 -07:00
Chris Lu	4ce027c2f3	fix(shell): restore s3.lifecycle.run-shard for CI/integration-test compatibility (#9475 ) fix(shell): restore s3.lifecycle.run-shard as a dailyrun.Run wrapper PR #9466 deleted weed/shell/command_s3_lifecycle_run_shard.go on the premise that it was a debug-only tool. It wasn't: the s3tests CI workflow (.github/workflows/s3tests.yml) and the test/s3/lifecycle/ integration tests invoke it via `weed shell` to drive lifecycle expirations on demand. Both started failing with "unknown command: s3.lifecycle.run-shard". This PR restores the command with the same flag set so existing callers (CI scripts and integration tests) work unchanged. The implementation no longer drives the streaming dispatcher.Pipeline + scheduler.BucketBootstrapper (deleted) — instead it does one bounded dailyrun.Run pass through the same daily-replay code path the production worker exercises. The walker fires for walker-bound rules just like in the worker. Obsolete streaming flags (-dispatch / -checkpoint / -refresh / -bootstrap-interval) are accepted-but-ignored so existing scripts don't need to drop them.	2026-05-12 18:29:18 -07:00
Chris Lu	10cc06333b	cluster: restrict Ping RPC to known peers of the requested type (#9445 ) Ping previously dialled whatever host:port the caller asked for. Gate each server's Ping handler on cluster membership: masters check the topology, registered cluster nodes, and configured master peers; volume servers only accept their seed/current masters; filers accept tracked peer filers, the master-learned volume server set, and configured masters. Use address-indexed peer lookups to keep Ping target validation O(1): - topology maintains a pb.ServerAddress -> *DataNode index alongside the dc/rack/node tree, kept in sync from doLinkChildNode and UnlinkChildNode plus the ip/port-rewrite branch in GetOrCreateDataNode. GetTopology now returns nil on a detached subtree instead of panicking, so the linkage hooks can no-op safely. - vid_map tracks a refcount per volume-server address so hasVolumeServer answers without scanning every vid location. The add path skips empty-address entries the same way the delete path already does, so a zero-value Location cannot leak a permanent serverRefCount[""] bucket. - masters reuse a cached master-address set from MasterClient instead of walking the configured peer slice on every request. - volume servers compare against a pre-built seed-master set and protect currentMaster reads/writes with an RWMutex, fixing the data race with the heartbeat goroutine. The seed slice is copied on construction so external mutation cannot desync it from the frozen lookup set. - cluster.check drops the direct volume-to-volume sweep; volume servers no longer carry a peer-volume list, and the note next to the dropped probe is reworded to make clear that direct volume-to-volume reachability is intentionally not validated by this command. Update the volume-server integration tests that drove Ping through the new admission gate: success-path coverage now targets the master peer (the only type a volume server tracks), and the unknown/unreachable path asserts the InvalidArgument the gate now returns instead of the old downstream dial error. Mirror the same admission gate in the Rust volume server crate: a seed-master HashSet built once at startup plus a tokio RwLock over the heartbeat-tracked current master, both consulted in is_known_ping_target on every Ping, with InvalidArgument returned for any target that isn't a recognised master.	2026-05-12 13:00:52 -07:00
Chris Lu	5004b4e542	feat(s3/lifecycle): delete streaming algorithm path (Phase 5b) (#9466 ) * feat(s3/lifecycle): delete streaming algorithm path (Phase 5b) Phase 5a (PR #9465) retired the algorithm flag and made daily_replay the only execution path. The streaming-side code (scheduler.Scheduler, scheduler.BucketBootstrapper, dispatcher.Pipeline, dispatcher.Dispatcher, dispatcher.FilerPersister, and their tests) has had no in-tree caller since then. This PR deletes it. Net change: ~4800 lines removed, ~130 added (the scheduler/configload tests' helper file the deleted bootstrap_test.go used to host). Removed: - weed/s3api/s3lifecycle/scheduler/{bootstrap,bootstrap_test, scheduler,scheduler_test,pipeline_fanout_test, refresh_default,refresh_s3tests}.go - weed/s3api/s3lifecycle/dispatcher/{dispatcher,dispatcher_test, dispatcher_helpers_test,edge_cases_test,multi_shard_test, pipeline,pipeline_test,pipeline_helpers_test,toproto_test, dispatch_ticks_default,dispatch_ticks_s3tests}.go - weed/s3api/s3lifecycle/dispatcher/filer_persister_test.go (FilerPersister deleted; FilerStore tests don't need their own file) - weed/shell/command_s3_lifecycle_run_shard{,_test}.go (debug-only shell command that only ever wrapped the streaming pipeline; the production worker now exercises the same path every daily run) Trimmed: - dispatcher/filer_persister.go down to FilerStore + NewFilerStoreClient — the small interface daily_replay's cursor persister (dailyrun.FilerCursorPersister) plugs into. Kept (still consumed by daily_replay): - scheduler/configload.{go,_test.go} (LoadCompileInputs, AllActivePriorStates) - dispatcher/sibling_lister.{go,_test.go} (NewFilerSiblingLister, FilerSiblingLister) - dispatcher/filer_persister.go (FilerStore, NewFilerStoreClient) scheduler/testhelpers_test.go restores fakeFilerClient, fakeListStream, dirEntry, fileEntry — helpers the configload tests used to share with the deleted bootstrap_test.go. Updates the handler-package doc strings and one reader-package comment that still named the streaming pipeline. * fix(s3/lifecycle): hold lock through tree read in test filer client gemini caught an inconsistency in scheduler/testhelpers_test.go: LookupDirectoryEntry reads c.tree under c.mu, but ListEntries was releasing the lock before reading c.tree. The map is effectively static during tests so there's no actual race today, but matching the convention keeps the helper safe if a future test mutates the tree mid-run.	2026-05-12 12:54:52 -07:00
Parviz Miriyev	2212cc8a5f	shell: expose retention flags on mq.topic.configure (#9416 ) * shell: expose retention flags on mq.topic.configure The ConfigureTopicRequest proto already carries a TopicRetention message ({retention_seconds, enabled}), and GetTopicConfigurationResponse / the admin UI both surface it — but the `mq.topic.configure` shell command sent Retention: nil unconditionally, leaving CLI and IaC users with no way to set retention without going through the admin UI. Add three flags so the shell matches the existing surface: -retention <duration> Go duration string (e.g. 168h, 30m) -retentionSeconds <int> raw seconds (mutually exclusive with -retention) -retentionEnabled bool toggle for retention enforcement Behavior: - If none of the retention flags are set, the request omits Retention (`nil`) — preserving the prior "leave server-side state alone" semantics for callers that only care about partition count. - Setting any retention flag populates TopicRetention with both fields so existing operators keep working when only one is provided. - -retention and -retentionSeconds together is a hard error (ambiguous); negative values are rejected. The new behavior is detected via flag.FlagSet.Visit so default values of 0 / false are distinguishable from "not provided". Help text updated with examples for both setting a TTL and disabling retention on an existing topic. No proto / server changes needed; this is a CLI-only patch. * broker, shell: address review for mq.topic.configure retention Three follow-up fixes for the review comments on the original commit: 1. Server-side: detect retention changes in the early-return path. Previously ConfigureTopic returned without persisting when partitionCount + schema were unchanged, even if the request supplied a different Retention. Now the early-return branch checks both schema and retention, persists either or both when they differ, and logs which fields actually changed. 2. Server-side: preserve existing retention when request.Retention is nil in the fresh-allocation path. Capture the prior resp.Retention before overwriting `resp = &mq_pb.ConfigureTopicResponse{}`, and only overwrite the new resp.Retention with `request.Retention` when the request actually supplies one. Otherwise carry the previous retention forward, so partition-count changes (or any path that bypasses the early-return branch) don't accidentally clear retention. 3. CLI: switch the `-retention` / `-retentionSeconds` mutual-exclusion check from value-based (`!= 0`) to flag.FlagSet.Visit-based, so an explicit `-retention=0 -retentionSeconds=N` is also rejected. 4. CLI: when any retention flag is set, fetch the current GetTopicConfiguration and fill in the fields the user didn't explicitly provide. This means `-retentionEnabled` alone no longer zeros the existing duration, and `-retention 24h` alone preserves the existing enabled flag. When the topic doesn't exist yet, the GetTopicConfiguration error is treated as "no current state" and we proceed with just the user-supplied values. Build / vet / gofmt / `go test ./weed/shell/... ./weed/mq/broker/...` clean.	2026-05-12 12:37:09 -07:00
Chris Lu	1854101125	feat(s3/lifecycle): bootstrap re-walk cadence + operator hooks (Phase 8) (#9386 ) * feat(s3/lifecycle): bootstrap re-walk cadence + operator hooks (Phase 8) scan_only actions only fire from the bootstrap walk: the engine classifies a rule as scan_only when its retention horizon exceeds the meta-log retention, so event-driven routing can't be trusted. Today each bucket walks once per process, so a long-running worker never revisits — scan_only retention only catches up when the worker restarts. Replace BucketBootstrapper.known (set) with BucketBootstrapper.lastWalk (name -> completion time). KickOffNew now re-walks a bucket whose last walk completed more than BootstrapInterval ago. Zero interval preserves the legacy walk-once-per-process behavior so existing deployments don't change cadence by default. walkBucket re-stamps on success and clears the stamp on failure (via MarkDirty), so the next KickOffNew picks failed walks back up. Add MarkDirty / MarkAllDirty operator hooks for forced re-walks, and a Now func() for testable time travel. weed shell run-shard grows --bootstrap-interval (cadence knob) and --force-bootstrap (drop in-memory state at startup so every bucket walks again immediately, useful when a config change should take effect without a restart). Tests: cadence respected (skip inside interval, re-walk past it); zero interval keeps once-per-process; MarkDirty forces re-walk under a 24h interval; MarkAllDirty resets every record. The fakeClock helper guards the test clock with a mutex so race-detector runs are clean. * fix(s3/lifecycle): split walk state, thread BootstrapInterval through worker, drop dead flag Three issues with the Phase 8 cadence work as it landed: 1. lastWalk did double duty as both completed-walk timestamp and in-flight debounce. A walk that took longer than BootstrapInterval would have a fresh KickOffNew start a duplicate goroutine on the next refresh tick because the stamp from KickOffNew looked stale against the interval. Split into lastCompleted (set on success) and inFlight (set on dispatch, cleared after the walk goroutine returns success or failure). KickOffNew skips inFlight buckets regardless of cadence. 2. The cadence knob existed on `weed shell` but not on the production path: scheduler.Scheduler constructed BucketBootstrapper without BootstrapInterval, and weed/worker/tasks/s3_lifecycle/Config had no field for it. Add Scheduler.BootstrapInterval, parse `bootstrap_interval_minutes` in ParseConfig (zero = legacy walk- once-per-process; negative clamps to zero), and forward it from the handler. Tests cover default, override, clamp, and explicit-zero. 3. --force-bootstrap was a no-op: BucketBootstrapper is freshly allocated at command start, so MarkAllDirty on empty state does nothing, and the flag couldn't influence an already-running process anyway. Remove it; a real runtime trigger (SIGHUP, control RPC) is a separate change. In-flight regression: a blockingInjector pins the first walk in progress while the test advances the clock past the interval. The second KickOffNew is a no-op (inFlight check). After release, the post-completion KickOffNew within the interval is also a no-op. * test(s3/lifecycle): wait for lastCompleted stamp before advancing fake clock The cadence test polled listedN to know "the walk happened" — but that fires once both list passes are issued, while the success-stamp lands later, after walkBucketDir returns. A clock.Advance(30m) between those two events would record the stamp at clock+30m instead of T0; the next assertion would then see now.Sub(last) < 1h and skip the expected re-walk. Tight in practice but exposed under -race / load. Add a waitForCompleted helper that polls b.lastCompleted directly, and use it before each clock advance in both the cadence and zero- interval tests. * fix(s3/lifecycle): expose bootstrap interval in worker UI; honor MarkDirty during walks Two follow-ups on Phase 8. The worker config descriptor had no bootstrap_interval_minutes field, so the production operator UI couldn't enable the cadence — only the internal ParseConfig + Scheduler wiring knew about it. Add the field to the cadence section (MinValue=0 since 0 is the legacy default) and include the default in DefaultValues so existing deployments see the knob with the right preset. MarkDirty / MarkAllDirty silently lost their effect when a walk was in flight: the methods cleared lastCompleted, but the walk's success path then wrote a fresh timestamp, hiding the operator's invalidation. Track a pendingDirty set; the walk goroutine consumes the flag on exit and skips the success stamp, so the next KickOffNew picks the bucket up immediately. Regression: pin a walk in progress with a blockingInjector, MarkDirty the bucket, release the walk, and assert lastCompleted stayed empty plus the next KickOffNew triggers a new walk inside the BootstrapInterval window. * refactor(s3/lifecycle): drop unused MarkDirty / MarkAllDirty + pendingDirty These methods were the operator-hook half of Phase 8, but the only caller (--force-bootstrap on the shell command) was removed when it turned out to be a no-op against a freshly-allocated bootstrapper. Nothing in production calls them anymore. Strip the dead surface: MarkDirty, MarkAllDirty, the pendingDirty set, the dirty-suppression branch in walkBucket, and the three tests that only exercised those methods. BootstrapInterval-driven re-bootstrap is the live mechanism. A real runtime trigger (SIGHUP, control RPC) is a separate change with a real call site.	2026-05-09 13:42:31 -07:00
Chris Lu	05d31a04b6	fix(s3tests): wire lifecycle worker for expiration suite (#9374 ) * fix(s3tests): wire lifecycle worker for expiration suite The upstream s3-tests `test_lifecycle_expiration` / `test_lifecyclev2_expiration` exercise the "set rule, wait, verify deletion" path. Phase 4 (#9367) intentionally stripped the PUT-time back-stamp, so pre-existing objects no longer pick up TtlSec on a freshly-applied rule. The s3tests CI bare-bones `weed -s3` had nothing left driving expiration. Three changes that work together: - Engine scales `Days` by `util.LifeCycleInterval`. Production keeps the 24h day; the `s3tests` build tag shrinks it to 10s so a `Days: 1` rule completes inside the suite's 30s polling window. Exported `DaysToDuration` so sibling-package tests pin to the same scale. - Scheduler/dispatcher tick defaults split into `_default` / `_s3tests` files. Production stays 5s/30s/5m; the test build runs at 500ms/2s/2s so deletions land within a couple ticks of becoming due. - s3tests.yml spawns `weed shell s3.lifecycle.run-shard -shards 0-15 -events 0 -runtime 1800s` alongside the s3 server in both the basic and SQL blocks; the shell command runs the full pipeline (reader + scheduler + dispatcher) for the duration of the suite. `test_lifecycle_expiration_versioning_enabled` is left out for now — versioned-bucket expiration via the worker still needs its own pass. Drive-by: bump `TestWorkerDefaultJobTypes` to 7 to match the registered handler count (`8b87ceb0d` updated `mini_plugin_test.go` for the s3_lifecycle plugin but missed this twin test). Two retention-gate engine tests `t.Skip` under the s3tests build because they rely on absolute lookback-vs-retention math the day-rescale collapses; the prod build still covers them. * review: harden lifecycle worker spawn + assert handler identity - Workflow: aliveness check on the backgrounded `weed shell` (a bad command exits in <1s and the suite would otherwise just opaque-timeout); move worker/server teardown into a `trap cleanup EXIT` so failure paths still print the worker log and reap the data dir. - worker_test: check the actual job-type set by name, not just the count. * fix(shell): keep s3.lifecycle.run-shard alive when no rules exist yet The s3-tests CI runs the worker BEFORE any test creates a bucket, so LoadCompileInputs returns empty and the shell command was bailing out with "no buckets with enabled lifecycle rules found" within ~1s. The aliveness check then fired exit 1 before tox ever started. Two changes: - Don't early-exit on empty inputs. Compile against the empty set, log a one-liner, and let the pipeline run normally — the meta-log subscription is already up, so events for buckets created later DO arrive; they just need the engine to know about them when they do. - Add `-refresh <duration>` (default 5m, 2s in s3tests CI) that periodically re-runs LoadCompileInputs + engine.Compile so rules added after startup land in the snapshot the dispatcher reads on its next tick. Production deployments keep the 5m default; only the CI workflow drops to 2s. Workflow passes `-refresh 2s` in both basic and SQL blocks. * fix(shell): backfill pre-rule entries via bootstrap walker The reader-driven path only sees meta-log events created AFTER its engine snapshot knows the rule. The s3-tests CI scenario PUTs objects first, then PUTs the lifecycle config, so by the time the engine refresh picks up the new bucket the object events have already been seen-and-dropped (BucketActionKeys returned empty for the bucket). Wire bootstrap.Walk into the shell command: - bucketBootstrapper tracks buckets seen so far. kickOffNew spawns one loop goroutine per fresh bucket. - Each goroutine re-walks the bucket every walkInterval (defaults to the same value as -refresh, i.e. 2s in s3tests CI, 5m in prod) and feeds each entry through bootstrap.Walk; due actions dispatch via a direct LifecycleDelete RPC. Not-yet-due entries are silently skipped and picked up on a later iteration once they age past their (rescaled or real) threshold. - LifecycleDelete is called with no expected_identity; the server-side identityMatches treats nil as "skip CAS", which is the right call for bootstrap (the bootstrap entry doesn't carry chunk fid / extended hash anyway). The dispatcher's pkg-private toProtoActionKind is duplicated in the shell file rather than exported, since the shape is six lines and the reverse import would pull a proto dep into the s3lifecycle root. * refactor(s3/lifecycle): hoist bucket bootstrapper into scheduler pkg The shell command got the backfill in the previous commit but the worker plugin (weed/worker/tasks/s3_lifecycle/handler.go) drives Scheduler.Run directly and missed it — same root cause: the reader-driven path only sees events created after the rule lands, so a daily cron picking up a freshly-PUT rule wouldn't expire any pre-rule object. Move the looping bucket walker into scheduler.BucketBootstrapper: - Scheduler.Run now constructs one and calls KickOffNew on every engine refresh. Per-bucket goroutines re-walk every BootstrapWalkInterval (defaults to RefreshInterval — 5m in prod, 2s under s3tests). - The shell command consumes the same struct instead of its own copy so the two paths can't drift in semantics. * refactor(s3/lifecycle): walk-once + schedule via event injection Previous per-bucket walker re-listed every WalkInterval forever. For a bucket with N objects under a long rule, the worker did O(N * runtime / walkInterval) listings even when nothing was newly due — way too much for production-scale buckets. New approach: walk each bucket exactly once on first sight, synthesize one reader.Event per existing entry, push it onto Pipeline.events. Router.Route builds a Match with DueTime=mtime+delay; future-due matches sit in the per-shard Schedule and fire when their DueTime arrives. Currently-due matches fire on the very next dispatch tick. Wiring: - dispatcher.Pipeline lifts its events channel into a struct field with sync.Once init, and exposes InjectEvent(ctx, ev). Reader no longer closes the channel — the dispatch goroutine exits on runCtx cancellation, which works the same as channel-close did. - scheduler.BucketBootstrapper drops the WalkInterval ticker. KickOffNew spawns one walker goroutine per fresh bucket; the goroutine lists, synthesizes events, then exits. - scheduler.Scheduler builds its pipelines up front and exposes a pipelineFanout (shard -> Pipeline) as the EventInjector, so a multi- worker scheduler routes each synthesized event to the pipeline that owns its shard. - Shell command's single-pipeline path passes pipeline.InjectEvent directly. Synthesized events carry TsNs=0; dispatcher.advance treats that as a no-op so the reader's persisted cursor isn't ratcheted past unprocessed meta-log events. Identity (HeadFid + ExtendedHash) is still computed from the real filer entry, so the server's identity-CAS catches an overwrite between bootstrap and dispatch. debug(s3tests): make lifecycle worker progress visible in CI logs The previous CI failure dumped an empty $LC_LOG even though the worker was running. Two reasons: 1. weed shell suppresses glog by default (logtostderr / alsologtostderr set to false). Pass `-debug` so the bootstrapper's V(0) lines reach stderr instead of disappearing into /tmp/weed..log. 2. cleanup used `kill -9` which skips Go's stdout flush. SIGTERM first with a 1s grace, then SIGKILL the holdout, then read the log. While here: bump the bootstrap walker's two informational logs to V(0) so the diagnosis from CI doesn't require -v=1 on the worker. fix(s3/lifecycle/dispatcher): refresh snap on every event Pipeline.Run captured snap at startup and only refreshed it on the dispatch tick. With bootstrap event injection, the walker pushes events seconds after engine.Compile sees the bucket — typically WITHIN the same dispatch interval. Routing against the cached (empty) snap then silently dropped every match because BucketActionKeys returned nil for the bucket-not-yet-in-snapshot case. Re-fetch on each event. Engine.Snapshot is an atomic.Pointer.Load, so the cost is negligible. The dispatch-tick branch keeps using a fresh local read for its own loop, so its semantics are unchanged.	2026-05-08 17:29:47 -07:00
Chris Lu	7f254e158e	feat(worker/s3_lifecycle): plugin handler with admin UI config (#9362 ) * feat(s3/lifecycle): scheduler — N pipelines over an even shard split Scheduler.Run spawns Workers Pipeline goroutines plus one engine-refresh ticker. Each worker owns a contiguous AssignShards(idx, total) slice of [0, ShardCount) and runs Pipeline.Run with EventBudget bounding each iteration; brief RetryBackoff between iterations avoids hot-loop on errors. The refresh ticker rebuilds the engine snapshot from the filer's bucket configs every RefreshInterval. LoadCompileInputs / IsBucketVersioned / AllActivePriorStates are exported from a configload.go sibling so the shell command can move to this shared implementation in a follow-up. * refactor(shell): reuse scheduler.LoadCompileInputs in run-shard Drop the local copies of loadLifecycleCompileInputs / isBucketVersioned / allActivePriorStates / lifecycleParseError that the new scheduler package now exports. Same behavior, one source of truth. * feat(worker/s3_lifecycle): plugin handler with admin UI config Registers a JobHandler for s3_lifecycle via pluginworker.RegisterHandler. Admin pulls the descriptor over the worker plugin gRPC and renders the AdminConfigForm + WorkerConfigForm in the existing UI: Admin form (cluster shape): - workers (1..16, default 1) - s3_grpc_endpoints (comma list) Worker form (operational tuning): - dispatch_tick_ms (default 5000) - checkpoint_tick_ms (default 30000) - refresh_interval_ms (default 300000) - event_budget (default 0 = unbounded) Detect emits a single proposal whenever S3 endpoints + filer addresses are configured. MaxExecutionConcurrency=1 so admin only ever runs one lifecycle daemon per worker; a fresh proposal next cycle restarts it if the prior Execute exits. Execute dials the configured S3 endpoint + filer, builds a scheduler.Scheduler with the parsed config, and runs it until ctx cancellation. Reuses the existing scheduler / dispatcher / reader / engine packages — the handler is the thin glue that parses descriptor values and wires the long-running daemon. * proto(plugin): add s3_grpc_addresses to ClusterContext So workers can dial s3 servers discovered by the master rather than a hand-typed list in the admin form. * feat(admin): populate ClusterContext.s3_grpc_addresses from master ListClusterNodes(S3Type) returns the live S3 servers; the plugin scheduler now hands these to job handlers alongside filer/volume addresses. * feat(worker/s3_lifecycle): discover s3 endpoints from cluster context Drop the s3_grpc_endpoints admin form field and read the master-supplied ClusterContext.S3GrpcAddresses instead. Operators no longer maintain a hand-typed list, and a stale entry self-heals when the master's view updates. * feat(worker/s3_lifecycle): time-based runtime cap, friendlier cadence units - dispatch_tick_minutes (was *_ms): minutes is the natural granularity for a daily batch; default 1 minute. - checkpoint_tick_seconds: seconds for the durable cursor write; default 30 seconds. - refresh_interval_minutes: minutes for the engine snapshot rebuild. - max_runtime_minutes replaces event_budget. Each daily run is bounded by wall clock — typical run wraps in well under an hour because the cursor persists and the meta-log streams fast. Default 60 minutes. - AdminRuntimeDefaults.DetectionIntervalSeconds = 86400 so the admin schedules one job per day.	2026-05-08 10:30:02 -07:00
Chris Lu	85abf3ca88	feat(shell): s3.lifecycle.run-shard + integration test (#9361 ) * feat(shell): s3.lifecycle.run-shard for manual Phase 3 dispatch Subscribes to the filer meta-log filtered to one (bucket, key-prefix-hash) shard, routes events through the compiled lifecycle engine, and dispatches due actions to the S3 server's LifecycleDelete RPC. Persists the per-shard cursor to /etc/s3/lifecycle/cursors/shard-NN.json so subsequent runs resume. Operator-runnable harness for end-to-end Phase 3 validation while the plugin-worker auto-scheduler is still pending. EventBudget bounds a single invocation; flags expose dispatch + checkpoint cadence. Discovers buckets by walking the configured DirBuckets path and reading each bucket entry's Extended[s3-bucket-lifecycle-configuration-xml] through lifecycle_xml.ParseCanonical. All compiled actions are seeded BootstrapComplete=true so the run dispatches whatever fires immediately; production bootstrap walks set this incrementally per bucket. * test(s3/lifecycle): integration test driving the run-shard shell command Spins up 'weed mini', creates a bucket with a 1-day expiration on a prefix, PUTs the target object, then rewrites the entry's Mtime via filer UpdateEntry to 30 days ago. Runs 's3.lifecycle.run-shard' for every shard via 'weed shell' subprocess and asserts the backdated object is deleted within 30s, and the in-prefix-but-recent object remains. The S3 API rejects Expiration.Days < 1, so 'wait a day' is unworkable. Backdating via the filer's gRPC sidesteps that constraint while still exercising the real Reader -> Router -> Schedule -> Dispatcher -> LifecycleDelete RPC path end-to-end. Wires a new s3-lifecycle-tests job into s3-go-tests.yml. The test runs all 16 shards because ShardID(bucket, key) is hash-based and the test shouldn't couple to that detail; running every shard keeps the test independent of the hash function. * fix(shell/s3.lifecycle.run-shard): address review findings - Reject negative -events explicitly. Help text already defines 0 as unbounded; negative budgets created ambiguous behavior in pipeline.Run. - Bound the gRPC dial with a 30s timeout instead of context.Background() so an unreachable S3 endpoint doesn't hang the shell. - Paginate the bucket listing in loadLifecycleCompileInputs. SeaweedList takes a single-RPC limit; the prior 4096 silently dropped buckets past that page on large clusters. Loop with startFrom until a page comes back short. - Surface parse errors instead of swallowing them. Buckets with malformed lifecycle XML now print the first three errors verbatim and a count for the rest, so an operator running this command for diagnostics can find what's wrong. * feat(shell/s3.lifecycle.run-shard): -shards range/set with one subscription Adds -shards "lo-hi" or "a,b,c" to the manual run command and threads the same model through Reader and Pipeline. - reader.Reader gains ShardPredicate (func(int) bool) and StartTsNs; ShardID stays for the single-shard short form. Event carries the computed ShardID so consumers can route per-shard without rehashing. - dispatcher.Pipeline gains Shards []int. When set, Run holds one Cursor + Schedule + Dispatcher per shard, opens one filer SubscribeMetadata stream with a predicate covering the whole set, and routes events into the matching shard's schedule from a single dispatch goroutine — no per-shard goroutine fan-out. - shell command parses -shard or -shards (mutually exclusive), formats progress messages with a contiguous-range label when applicable, and validates against ShardCount. Integration test now uses -shards 0-15 (one subprocess invocation) instead of a 16-iteration loop. * fix(s3/lifecycle): allow Reader with StartTsNs=0 + Cursor=nil The reader rejected the legitimate 'fresh subscription from epoch' state when called from a fresh Pipeline.Run on a multi-shard worker (no cursor file yet, all shards' MinTsNs=0). The downstream SubscribeMetadata call handles SinceNs=0 fine; the up-front check was over-defensive and broke the auto-scheduler completely (CI showed 5-second-cadence retries with this exact error). * fix(s3/lifecycle): schedule from ModTime not eventTime A backdated or out-of-band entry update has eventTime ≈ now while ModTime is far in the past; eventTime+Delay would push the dispatch into the future even though the rule already fires. ModTime+Delay is the correct fire moment. The dispatcher's identity-CAS still catches drift between schedule and dispatch. * fix(s3/lifecycle): -runtime cap on run-shard so it exits on quiet shards The CI integration test sets -events 200 expecting the subprocess to return after 200 in-shard events. But -events counts only events that pass the shard filter; the test produces ~5 such events (bucket create, lifecycle PUT, two object PUTs, mtime backdate), so the reader stays in stream.Recv forever and runShellCommand hangs the test deadline. - weed/shell/command_s3_lifecycle_run_shard.go: add -runtime D flag. When > 0, Pipeline.Run runs under context.WithTimeout(D); on expiry the reader/dispatcher drain cleanly and the cursor saves. - weed/s3api/s3lifecycle/dispatcher/pipeline.go: treat context.DeadlineExceeded the same as context.Canceled at exit (both are graceful shutdown signals). * test(s3/lifecycle): pass -runtime 10s to run-shard Pair with the new -runtime flag so the subprocess exits cleanly after 10s instead of waiting for an event budget that never lands on quiet shards. * refactor(s3/lifecycle): extract HashExtended to s3lifecycle pkg The worker's router needs the same length-prefixed sha256 of the entry's Extended map; pulling it out of the s3api private file lets both sides import it. * fix(s3/lifecycle): worker captures ExtendedHash for identity-CAS Without this, the dispatcher sends ExpectedIdentity.ExtendedHash = nil while the live entry on the server has a non-nil hash, so every dispatch returns NOOP_RESOLVED:STALE_IDENTITY and nothing is ever deleted. * fix(s3/lifecycle): identity HeadFid via GetFileIdString Meta-log events go through BeforeEntrySerialization, which clears FileChunk.FileId and writes the Fid struct instead. Reading .FileId directly returns "" on the worker side while the server's freshly fetched entry still has a populated string, so the identity-CAS would mismatch and every expiration ended in NOOP_RESOLVED:STALE_IDENTITY. * fix(s3/lifecycle): treat gRPC Canceled/DeadlineExceeded as graceful exit errors.Is doesn't unwrap a gRPC status error back to the stdlib ctx errors, so a subscription that ends because runCtx was canceled was being logged as a fatal reader error. Check status.Code as well so the shell's -runtime cap exits cleanly. * fix(test/s3/lifecycle): pass the gRPC port (not HTTP) to run-shard run-shard's -s3 flag dials the LifecycleDelete gRPC service, which listens on s3.port + 10000. The integration test was passing the HTTP port instead, so the dispatcher's RPC just timed out and the shell command exited under -runtime with no work done. * chore(test/s3/lifecycle): drop emoji from Makefile output * docs(test/s3/lifecycle): correct '-shards 0-15' wording * fix(s3/lifecycle): reject out-of-range shard IDs in Pipeline.Run The shell's parseShardsSpec already validates, but a programmatic caller (scheduler, future worker config) shouldn't be able to silently produce no-op states by passing -1 or 99. * fix(s3/lifecycle): bound drain + final-save with their own timeouts Shutdown was using context.Background, so a stuck dispatcher RPC or filer save could keep Pipeline.Run from ever returning. * fix(test/s3/lifecycle): drop self-killing pkill in stop-server The pkill pattern \"weed mini -dir=...\" is also in the running shell's argv (it's the recipe body), so pkill -f matches its own bash and the recipe exits with Terminated. CI test job passed but the cleanup step failed with exit 2. The PID file is sufficient on its own. * docs(test/s3/lifecycle): document S3_GRPC_ENDPOINT env var	2026-05-08 09:59:10 -07:00
Chris Lu	35e3fe89bc	feat(s3/lifecycle): filer-backed cursor Persister + drop BlockerStore (#9358 ) * feat(s3/lifecycle): filer-backed cursor Persister FilerPersister persists per-shard cursor maps as JSON to /etc/s3/lifecycle/cursors/shard-NN.json via filer.SaveInsideFiler. One file per shard keeps Save atomic — the filer writes the entry in a single mutation, so a crash mid-write doesn't leak partial state. Pipeline.Run loads on start; the periodic checkpoint and graceful-shutdown save go through this implementation. A small FilerStore interface wraps the SeaweedFilerClient surface the persister needs, so tests inject an in-memory fake instead of mocking the full gRPC client. * refactor(s3/lifecycle): drop BlockerStore — durable cursor IS the block A frozen cursor doesn't advance, so the durable cursor (FilerPersister) encodes the blocked state on its own. On worker restart the reader re-encounters the poison event at MinTsNs, the dispatcher walks the same retry budget to BLOCKED, and the cursor freezes at the same EventTs. Other in-flight events between freeze tsNs and prior cursor positions self-resolve via NOOP_RESOLVED (STALE_IDENTITY) since the underlying objects were already deleted on the prior pass. Removed: - BlockerStore interface + InMemoryBlockerStore + BlockerRecord - Dispatcher.Blockers + Dispatcher.ReplayBlockers - the BlockerStore.Put call in handleBlocked - Pipeline.Blockers field + the ReplayBlockers call on startup Added a TestDispatchRestartReFreezesNaturally that pins the self-recovery property: a fresh Dispatcher with a fresh Cursor, fed the same poison event, reaches the same frozen state at the same EventTs without any durable blocker store. Operator visibility: a cursor whose MinTsNs hasn't advanced is the signal — surfaced via the durable cursor file. * refactor(filer): SaveInsideFiler accepts ctx ReadInsideFiler already takes ctx; SaveInsideFiler used context.Background() internally and silently dropped the caller's ctx. Symmetric API now; cancellation/deadlines propagate through LookupEntry / CreateEntry / UpdateEntry. Mechanical update of all callers — most pass context.Background() since the existing call sites have no ctx in scope. * fix(s3/lifecycle): deterministic order in cursor save Iterating Go maps yields random order, so json.Encode produced a different byte sequence on each save even when the state hadn't changed. Sort entries by (Bucket, ActionKind, RuleHash) before encoding so the on-disk file diffs cleanly. New test pins byte-identical output across two saves of the same map. * fix(s3/lifecycle): log reason when freezing cursor in handleBlocked handleBlocked dropped the reason via _ = reason with a comment claiming the caller logged it; none of the three callers do. A frozen cursor is the only surface where the operator finds out something stuck, so the reason has to land somewhere. glog.Warningf with shard, key, eventTs, and the original reason — same shape the rest of the package uses.	2026-05-07 17:45:04 -07:00
Chris Lu	b9bf45cb2e	fix(shell): scope volume.fsck filer walk when -volumeId selects one bucketed collection (#9347 ) * fix(shell): scope volume.fsck filer walk to the bucket when -volumeId selects one bucketed collection Closes #9345. -volumeId only filtered which volume .idx files were pulled; the filer-side BFS still walked from "/", printing every directory under -v and making it look like the flag was ignored. When all requested volumes share a single non-empty collection that maps to an existing <bucketsPath>/<collection> directory, restrict the BFS root to that bucket. Empty-collection volumes or multi-collection selections fall back to the full walk, since chunks for those can live anywhere. * trim comments * address review: collapse getCollectFilerFilePath; unshadow receiver in loop	2026-05-07 10:04:01 -07:00
Chris Lu	1c0e24f06a	fix(balance): don't move remote-tiered volumes; don't fatal on missing .idx (#9335 ) * fix(volume): don't fatal on missing .idx for remote-tiered volume A .vif left behind without its .idx (orphaned by a crashed move, partial copy, or hand-edit) would trip glog.Fatalf in checkIdxFile and take the whole volume server down on boot, killing every healthy volume on it too. For remote-tiered volumes treat it as a per-volume load error so the server can come up and the operator can clean up the stray .vif. Refs #9331. * fix(balance): skip remote-tiered volumes in admin balance detection The admin/worker balance detector had no equivalent of the shell-side guard ("does not move volume in remote storage" in command_volume_balance.go), so it scheduled moves on remote-tiered volumes. The "move" copies .idx/.vif to the destination and then calls Volume.Destroy on the source, which calls backendStorage.DeleteFile — deleting the remote object the destination's new .vif now points at. Populate HasRemoteCopy on the metrics emitted by both the admin maintenance scanner and the worker's master poll, then drop those volumes at the top of Detection. Fixes #9331. * Apply suggestion from @gemini-code-assist[bot] Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * fix(volume): keep remote data on volume-move-driven delete The on-source delete after a volume move (admin/worker balance and shell volume.move) ran Volume.Destroy with no way to opt out of the remote-object cleanup. Volume.Destroy unconditionally calls backendStorage.DeleteFile for remote-tiered volumes, so a successful move would copy .idx/.vif to the destination and then nuke the cloud object the destination's new .vif was already pointing at. Add VolumeDeleteRequest.keep_remote_data and plumb it through Store.DeleteVolume / DiskLocation.DeleteVolume / Volume.Destroy. The balance task and shell volume.move set it to true; the post-tier-upload cleanup of other replicas and the over-replication trim in volume.fix.replication also set it to true since the remote object is still referenced. Other real-delete callers keep the default. The delete-before-receive path in VolumeCopy also sets it: the inbound copy carries a .vif that may reference the same cloud object as the existing volume. Refs #9331. * test(storage): in-process remote-tier integration tests Cover the four operations the user is most likely to run against a cloud-tiered volume — balance/move, vacuum, EC encode, EC decode — by registering a local-disk-backed BackendStorage as the "remote" tier and exercising the real Volume / DiskLocation / EC encoder code paths. Locks in: - Destroy(keepRemoteData=true) preserves the remote object (move case) - Destroy(keepRemoteData=false) deletes it (real-delete case) - Vacuum/compact on a remote-tier volume never deletes the remote object - EC encode requires the local .dat (callers must download first) - EC encode + rebuild round-trips after a tier-down Tests run in-process and finish in under a second total — no cluster, binary, or external storage required. * fix(rust-volume): keep remote data on volume-move-driven delete Mirror the Go fix in seaweed-volume: plumb keep_remote_data through grpc volume_delete → Store.delete_volume → DiskLocation.delete_volume → Volume.destroy, and skip the s3-tier delete_file call when the flag is set. The pre-receive cleanup in volume_copy passes true for the same reason as the Go side: the inbound copy carries a .vif that may reference the same cloud object as the existing volume. The Rust loader already warns rather than fataling on a stray .vif without an .idx (volume.rs load_index_inmemory / load_index_redb), so no counterpart to the Go fatal-on-missing-idx fix is needed. Refs #9331. * fix(volume): preserve remote tier on IO-error eviction; fix EC test target Two review nits: - Store.MaybeAddVolumes' periodic cleanup pass deleted IO-errored volumes with keepRemoteData=false, so a transient local fault on a remote-tiered volume would also nuke the cloud object. Track the delete reason via a parallel slice and pass keepRemoteData=v.HasRemoteFile() for IO-error evictions; TTL-expired evictions still pass false. - TestRemoteTier_ECEncodeDecode_AfterDownload deleted shards 0..3 but called them "parity" — by the klauspost/reedsolomon convention shards 0..DataShardsCount-1 are data and DataShardsCount..TotalShardsCount-1 are parity. Switch the loop to delete the parity range so the intent matches the indices. --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2026-05-06 15:19:43 -07:00
qzhello	60c76120fc	fix(shell): use exact match for volume.balance -racks/-nodes filter (#9279 ) * fix(shell): correct volume.list -writable filter unit and comparison * fix(shell): correct volume.list -writable filter unit and comparison * chore(shell): fix typo in EC shard helper param names * fix(shell): use exact match for volume.balance -racks/-nodes filter The old strings.Contains-based filter quietly included any id that was a substring of the user-supplied flag value (e.g. -racks=rack10 also matched rack1). Replace it with an exact-match set parsed from the comma-separated flag value, and add regression tests for both -racks and -nodes paths. Also fix a small typo in the "remote storage" error returned by maybeMoveOneVolume. * fix(shell): use exact match for volume.balance -racks/-nodes filter The old strings.Contains-based filter quietly included any id that was a substring of the user-supplied flag value (e.g. -racks=rack10 also matched rack1). Replace it with an exact-match set parsed from the comma-separated flag value, and add regression tests for both -racks and -nodes paths. Also fix a small typo in the "remote storage" error returned by maybeMoveOneVolume. * refactor(shell): drop nil sentinel in splitCSVSet, use len() in callers	2026-04-29 10:19:16 -07:00
qzhello	108e42fb8b	chore(shell): fix typo in EC shard helper param names (#9277 ) * fix(shell): correct volume.list -writable filter unit and comparison * fix(shell): correct volume.list -writable filter unit and comparison * chore(shell): fix typo in EC shard helper param names	2026-04-28 23:09:26 -07:00
Chris Lu	294f7c3d04	shell: expand `~` in local file path arguments (#9265 ) * shell: expand `~` in local file path arguments The weed shell parses commands itself instead of going through an OS shell, so a path like `~/Downloads/foo.meta` was passed verbatim to `os.Open`, which fails because no `~` directory exists. Users had to spell out absolute home paths in every command. Add an `expandHomeDir` helper that resolves a leading `~` or `~/...` to the user's home directory, and run user-supplied local file paths in the affected shell commands through it: fs.meta.load (positional file) fs.meta.save (-o) fs.meta.changeVolumeId (-mapping) s3.iam.export (-file) s3.iam.import (-file) s3.policy (-file) s3tables.bucket (-file) s3tables.table (-file, -metadata) volume.fsck (-tempPath) Filer-namespace path flags (`-dir`, `-path`, `-locationPrefix`, etc.) are unaffected; they live in the filer, not on the local FS. * shell: reuse util.ResolvePath instead of a new helper util.ResolvePath already does tilde expansion; drop the local expandHomeDir helper and route every shell call site through it.	2026-04-28 12:30:13 -07:00
qzh	21fadf5582	fix(shell): correct volume.list -writable filter unit and comparison (#9231 ) * fix(shell): correct volume.list -writable filter unit and comparison * fix(shell): correct volume.list -writable filter unit and comparison	2026-04-26 22:20:46 -07:00
Chris Lu	da2e90aefd	fix(mount): sanitize non-UTF-8 filenames; keep marshal errors per-request (#9207 ) * fix(mount): sanitize non-UTF-8 filenames; keep marshal errors per-request (#9139) A single file with invalid-UTF-8 bytes in its name (e.g. a GNOME Trash "partial" like \x10\x98=\\\x8a\x7f.trashinfo.9a51454f.partial) made every FUSE-initiated filer RPC fail with: rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8 and then produced an avalanche of "connection is closing" errors on unrelated LookupEntry / ReadDirAll / UpdateEntry calls, causing the volume-server QPS dips reported in #9139. Root cause is twofold: 1. Proto3 `string` fields require valid UTF-8, but the FUSE kernel passes raw name bytes. Create/Mknod/Mkdir/Unlink/Rmdir/Rename/Lookup/Link/ Symlink all forwarded those bytes directly into CreateEntryRequest.Name, DeleteEntryRequest.Name, StreamRenameEntryRequest.{Old,New}Name and Entry.Name. saveDataAsChunk also copied the FullPath into AssignVolumeRequest.Path unchecked. 2. When the marshal failed, shouldInvalidateConnection treated the resulting codes.Internal as a connection problem and dropped the shared cached ClientConn — canceling every other in-flight RPC on it. Fix: - Add sanitizeFuseName (strings.ToValidUTF8 with '?' replacement, matching util.FullPath.DirAndName) and make checkName return the sanitized name. Apply at every FUSE entry point that passes a name to the filer RPC, including Unlink/Rmdir (which did not previously call checkName) and both oldName/newName in Rename. Add a backstop scrub for AssignVolumeRequest.Path so async flush paths cannot reintroduce invalid bytes from a pre-sanitization cached FullPath. - In weed/pb.shouldInvalidateConnection, detect client-side marshal errors via the gRPC library's "error while marshaling" prefix and return false: the connection is healthy, only the request is bad. Refs: https://github.com/seaweedfs/seaweedfs/issues/9139#issuecomment-4301184231 * fix(mount,util): use '_' for invalid-UTF-8 replacement (URL-safe) Sanitized filenames flow downstream into HTTP URLs (volume-server uploads, filer HTTP API, S3/WebDAV gateways). '?' is the URL query-string delimiter and would split the path the first time the name lands in one, so swap every invalid-UTF-8 replacement to '_'. This covers the two pre-existing sites in weed/util/fullpath.go as well, keeping all paths sanitized the same way. * refactor(pb): detect client-side marshal errors via errors.As, not substring Replace the raw `strings.Contains(err.Error(), ...)` check with a type-based carve-out: use errors.As against the `GRPCStatus() Status` interface to pull the original Status out of any fmt.Errorf("...: %w") wrapping, then match the library-owned "grpc:" prefix on that Status's Message. Why not errors.Is against a proto-level sentinel: gRPC's encode() collapses the inner proto error with "%v" (stringification) before wrapping it in a Status, so the original error type does not survive into the caller. The Status itself is the structural signal that does survive. Why not status.FromError: when the caller wraps the Status error with fmt.Errorf("...: %w", ...), status.FromError rewrites Status.Message with the full err.Error() of the outermost wrapper, which defeats a prefix check on the library-owned message. errors.As gives us the original Status whose Message is still verbatim from the gRPC library. A new test asserts that a plain errors.New("grpc: error while marshaling: …") — i.e. the same text attached to something that is NOT a gRPC status — does not short-circuit invalidation, so we never silently keep a cached connection alive based on a coincidental substring match. refactor(util): centralize UTF-8 sanitization; add FullPath.Sanitized Addresses review feedback on PR #9207. Nitpick: every invalid-UTF-8 replacement across the codebase (DirAndName, Name, mount.sanitizeFuseName, the weedfs_write.go backstop) now goes through a single util.SanitizeUTF8Name helper, so the replacement char ('_' — URL-safe) is chosen in one place. Outside-diff: three proto fields took raw FullPath strings that could break marshaling if an entry ever carried invalid UTF-8 (CreateEntryRequest.Directory in Mkdir, DeleteEntryRequest.Directory in Unlink, AssignVolumeRequest.Path in command_fs_merge_volumes). The reviewer's suggested fix — using DirAndName() — would have silently changed Directory from parent to grandparent, because DirAndName sanitizes only the trailing component. Added FullPath.Sanitized(), which scrubs every component, and applied it at the three sites. Exposure is narrow in practice (FUSE-boundary sanitization and the gRPC-side isClientSideMarshalError carve-out already cover the #9139 cascade), but the defense-in-depth is cheap and consistent with the existing AssignVolume backstop. New tests in weed/util/fullpath_test.go document: - SanitizeUTF8Name: valid UTF-8 passes through unchanged; invalid bytes become '_' (not '?', which is URL-special). - FullPath.Sanitized: scrubs bytes in any component, not just the last. - FullPath.DirAndName: dir remains raw on purpose — callers needing a clean full path must use Sanitized(). The test pins this behavior so it is not accidentally "fixed" in a way that changes the (dir, name) semantics callers depend on.	2026-04-23 19:17:35 -07:00
Chris Lu	7364f148bd	fix(s3/shell): factor EC volumes into bucket size metrics and collection.list (#9182 ) * fix(s3/shell): include EC volumes in bucket size metrics and collection.list S3 bucket size metrics exported to Prometheus (and fed through stats.UpdateBucketSizeMetrics) are computed by collectCollectionInfoFromTopology, which only walked diskInfo.VolumeInfos. As soon as a volume was encoded to EC it dropped out of every aggregate, so Grafana showed bucket sizes shrinking while physical disk usage kept climbing. The shell helper collectCollectionInfo — used by collection.list and s3.bucket.quota.enforce — had the same gap, with the EC branch left as a commented-out TODO. Fold EC shards into both paths using the same approach the admin dashboard already uses (PR #9093): - PhysicalSize / Size sum across shard holders: EC shards are node-local (not replicas), so per-node TotalSize() and MinusParityShards().TotalSize() sum to the whole-volume physical and logical sizes respectively. - FileCount is deduped via max across reporters (every shard holder reports the same .ecx count; a slow node with a not-yet-loaded .ecx reports 0 and must not pin the aggregate). - DeleteCount is summed (each delete tombstones exactly one node's .ecj). - VolumeCount increments once per unique EC volume id. Adds regression tests covering pure-EC, mixed regular+EC, and the slow-reporter FileCount dedupe case. Refs #9086 * Address PR review feedback: EC size helpers, composite key, VolumeCount dedupe - Add EcShardsTotalSize / EcShardsDataSize helpers in the erasure_coding package that walk the shard bitmap directly instead of materializing a ShardsInfo and copying it via MinusParityShards(). Keeps the DataShardsCount dependency encapsulated in one place and avoids the per-shard allocation/copy overhead in the metrics hot path. - Switch shell collectCollectionInfo ecVolumes map to a composite {collection, volumeId} key, matching the bucket_size_metrics collector and defending against any cross-collection volume id aliasing. - Dedupe VolumeCount in shell addToCollection by volume id so regular volumes aren't counted once per replica presence. Aligns the shell's collection.list output with the S3 metrics collector and the EC branch, all of which now report logical volume counts. - Add unit tests for the new helpers and for the regular-volume VolumeCount dedupe. * Parameterize EcShardsDataSize with dataShards for custom EC ratios Add a dataShards parameter to EcShardsDataSize so forks with per-volume ratio metadata (e.g. the enterprise data_shards field carried on an extended VolumeEcShardInformationMessage) can pass the configured value and get accurate logical sizes under custom EC policies like 6+3 or 16+6. Passing 0 or a negative value falls back to the upstream DataShardsCount default, which is correct for the fixed 10+4 layout — so OSS callers in s3api and shell pass 0 and keep their current behavior. Added table cases covering the custom 6+3 and 16+6 paths so the parameterization is pinned by tests.	2026-04-21 20:17:42 -07:00
Chris Lu	49f62df2cf	fix(shell): mergeVolumes hard-link safety and cleaner cleanup logging (#9163 ) * fix(shell): skip hard-linked entries in fs.mergeVolumes Hard-linked entries share a chunk list with their siblings, but the filer's UpdateEntry only rewrites one entry at a time. Moving a chunk here leaves every other hard-linked sibling pointing at a fid that either gets deleted by the filer's own garbage step after UpdateEntry or by deleteMovedSourceNeedles (#9160) — either way, the siblings end up with dangling references. Skip the entry with a visible log line so operators know the file was bypassed and can handle it explicitly (copy-then-unlink, or dedup before merge). Detected via entry.HardLinkId being non-empty, which is the same signal the filer itself uses (weed/pb/filer_pb/filer.pb.go:451). Flagged by coderabbit on #9160 post-merge. * fix(shell): mergeVolumes suppresses 404 alongside 304 in source cleanup BatchDelete also returns StatusNotFound (404) for an already-deleted needle when ReadVolumeNeedle can't find it in the cookie-check path, not only StatusNotModified (304) from DeleteVolumeNeedle returning size 0. Both are benign races against a concurrent fsck purge or a replica that already reconciled, so don't clutter the output with "delete ... not found" warnings for them. Flagged by coderabbit on #9160 post-merge. * fix(shell): mergeVolumes merges hard-linked files via dedup on HardLinkId Hard-linked siblings share one chunk list through a KV blob keyed by HardLinkId (see weed/filer/filerstore_hardlink.go). UpdateEntry's setHardLink rewrites that blob and maybeReadHardLink overrides per-entry chunks with the blob's on every read, so a single UpdateEntry propagates new fids to every sibling automatically — the previous skip-hardlinks bailout was overly conservative and left hard-linked files stuck on merge-source volumes forever. Process each HardLinkId exactly once per run with a sync.Map so BFS workers in different directories synchronize without a global lock. First sibling carries the chunk move + UpdateEntry; later siblings find the id in the map and return — preventing the real race, which is two siblings trying to re-download an already-moved source needle or double-queue the same fid for deletion. Also address the log-spam review on deleteMovedSourceNeedles: an unreachable volume server returns one error per needle, so collapse multiple failures into a single per-server line with the first error as an example.	2026-04-20 17:55:20 -07:00

1 2 3 4 5 ...

1086 Commits