* fix(shell): honor explicit fs.mergeVolumes from/to direction
mergeVolumes only ever merged a smaller volume into a larger one. When the
user named both -fromVolumeId and -toVolumeId with the source larger than the
target, the planner produced an empty plan and the command printed just
"max volume size: N MB" and moved nothing.
Build the requested pair directly when both ids are given, instead of routing
through the size-descending heuristic. Read-only, empty, and wrong-collection
endpoints are rejected with a clear error rather than a silent no-op.
* fix(shell): allow fs.mergeVolumes into an empty target volume
Merging chunks into an empty volume is valid, e.g. consolidating data into a
freshly created or recently vacuumed volume. Only reject an empty source, which
has nothing to move.
* fix(shell): reject self-map in directed mergeVolumes planner
createMergePlan with from == to returned a {vid: vid} self-merge when called
directly. Guard it in the planner so it is correct independent of the Do
entrypoint.
* s3: invalidate stale reader cache locations on chunk read failure
* filer: share the chunk-read self-heal across reader cache and streaming paths
The reader cache retry added a third copy of the invalidate-relookup-compare-retry
dance already inlined in PrepareStreamContentWithThrottler and duplicated in
retryWithCacheInvalidation. Extract retryFetchWithFreshLocations and route all
three through it, parameterized by the refetch primitive.
* filer: drop redundant completedTimeNew store in reader cache success path
startCaching already stamps completedTimeNew unconditionally before the
fetchErr branch; the second store inside the success branch is dead.
* filer: make NewReaderCache cache invalidator an explicit parameter
The variadic ...CacheInvalidator only ever read the first element, so a caller
could pass two and silently get one. Take a single explicit argument and have
the non-S3 callers pass nil.
* filer: inject reader cache chunk fetch as a struct field
Replace the process-global readerCacheFetchChunkData test seam with a
per-instance fetchChunkDataFn field defaulted in NewReaderCache, matching how
lookupFileIdFn is already wired. Tests set the field on the cache instead of
swapping a shared global.
* filer: log the location count, not full URLs, on self-heal retry
---------
Co-authored-by: Chris Lu <chris.lu@gmail.com>
Concurrent bucket deletion across multiple filer replicas races on the
per-bucket DROP TABLE. The first replica drops the table; the rest hit
an undefined-table error (postgres 42P01, mysql 1051) which propagates
out of DeleteFolderChildren and panics the filer. On restart the same
pending DROP re-runs and the filer crash-loops.
Make the drop idempotent. Same defect class in all SQL backends, so fix
postgres, postgres2, mysql, mysql2, and sqlite together.
* fix(ec): correct EC FULL scrub for deleted needles + shard-location cache
Addresses review findings on the EC FULL distributed scrub:
- Remote EC reads now thread Go's (bytes, is_deleted) contract. A runtime EC
delete keeps the .ecx size positive (the delete lives in .ecj/memory), so the
raw-index walk verifies the needle, and its header interval is usually remote;
the peer answers is_deleted with no payload. The scrub zero-fills that interval
(so the needle reaches read_bytes -> SizeMismatch{0} -> the delete-state
suppression), the serving direct read short-circuits to not-found, and
reconstruction EXCLUDES the shard instead of feeding zeros into Reed-Solomon.
- The walk skips size.is_deleted() (not just is_tombstone), so a -originalSize
.ecx entry (pre-encode delete) can't yield empty intervals or panic parse_header.
- Restore Go's < data_shards completeness guard (per-volume, custom-ratio aware)
and per-shard merge in the location cache instead of clobber-with-partial.
- Abort the scrub with an error on mid-scan unmount instead of a false-CLEAN.
- Hoist the refreshed location map once instead of cloning it per needle.
Claude-Session: https://claude.ai/code/session_015EE9Sc9EvNp8BCVva4RKdo
* feat(scrub): keep RS parity check in EC FULL until CHECKSUM lands
The per-needle FULL walk only reads live data-shard intervals, so it can't catch
bitrot in a parity shard or an unwalked cold region. Run verify_ec_shards
alongside the walk, gated on all-shards-local (single-node EC), via spawn_blocking.
A deliberate temporary divergence from Go FULL; moves to mode 4 (CHECKSUM) once
the .ecsum subsystem lands.
Claude-Session: https://claude.ai/code/session_015EE9Sc9EvNp8BCVva4RKdo
* feat(ec): add scrub_ec_volume_distributed (FULL EC scrub, local+remote)
Ports Go's Store.ScrubEcVolume: walk the raw .ecx, verify every needle across
local AND remote shards without decoding (report faults, don't heal), with the
#10130 deleted-needle size-mismatch suppression gated on a force flag. Reuses
the read path's lock-drop + no-reconstruct read_remote_ec_shard_interval so no
!Send store guard is held across an .await.
Walks the unmasked index (scrub_snapshot_under_lock locates from the raw
(offset, size), not locate_needle) so logically-deleted-but-present needles are
still byte-verified, matching Go. Refreshes shard locations once up front and
hard-fails on a master-lookup error rather than retrying per needle.
Claude-Session: https://claude.ai/code/session_015EE9Sc9EvNp8BCVva4RKdo
* feat(scrub): dispatch EC FULL (mode 2) to the distributed needle walk
FULL ran a local-only Reed-Solomon parity check; route it to the per-needle
local+remote walk instead, mirroring Go. The handler collects vids under a brief
lock then releases it: FULL self-locks per needle (it awaits remote reads),
INDEX/LOCAL re-acquire a brief lock. verify_ec_shards is retained but no longer
wired to a mode.
Claude-Session: https://claude.ai/code/session_015EE9Sc9EvNp8BCVva4RKdo
The EC volumes, EC shards, and collection details pages each rendered a
repair (wrench) button for incomplete EC volumes. Its handler POSTed to a
/repair endpoint that the admin server never registers, so every click
returned "404 page not found" (the collection details page only had a
placeholder handler).
Remove the buttons and their JavaScript handlers, and regenerate the
templ output. Manual EC shard recovery remains available from weed shell
via ec.rebuild.
EC volumes do not propagate deletions to all shard indexes, so it is possible
to run scrubbing on a volume where a deleted needle is still present in the
index, or a needle deleted from the index is still present on the volume.
On either scenario, scrubbing will fail due to size mismatch errors.
This PR reworks the scrubbing logic so needle size mismatches are
ignored in such scenarios.
Scrubbing can still be forced to check deleted needles (f.ex. to discover
index inconsistencies); this option will be exposed in RPCs and `weed shell`
on a follow-up PR.
* fix(scrub): don't flag offset-0 logical tombstones in volume scrub
A remote-tier delete records a tombstone at .idx offset 0 with no physical .dat
bytes. Full scrub double-flagged a healthy remote-tiered volume with deletes:
scrubVolumeData counted the tombstone's GetActualSize(-1)=32 toward totalRead
(want > physical .dat), and CheckIndexFile treated it as occupying [0,31] and
flagged the first live needle as overlapping. Skip offset-0 logical tombstones
from both the size reconcile and the overlap check; they are still counted for
the index-size check. Local deletes (offset != 0) are unaffected.
Claude-Session: https://claude.ai/code/session_015EE9Sc9EvNp8BCVva4RKdo
* fix(scrub): mirror offset-0 logical tombstone handling into Rust
Same fix as the Go volume_checking.go + idx/check.go change: Volume::scrub skips
offset-0 logical tombstones from total_read, and check_index_file excludes them
from the overlap check (still counted for the index-size check).
Claude-Session: https://claude.ai/code/session_015EE9Sc9EvNp8BCVva4RKdo
* fix(ec): suppress deleted-needle size mismatch in EC LOCAL scrub
EcVolume.ScrubLocal reassembles each fully-local needle and ReadBytes-checks
it, but appended every error unconditionally. A needle the .ecx still reports
live while its reassembled on-disk header carries size 0 (delete state
disagrees between index and header) is not corruption — the LOCAL twin of the
#10130 fix for the FULL path. Suppress the ErrorSizeMismatch in that case;
genuine (non-zero) size mismatches and CRC/tail errors are still reported.
Claude-Session: https://claude.ai/code/session_015EE9Sc9EvNp8BCVva4RKdo
* fix(ec): mirror EC LOCAL scrub deleted-needle suppression into Rust
Same suppression as the Go EcVolume.ScrubLocal change: a NeedleError::SizeMismatch
whose on-disk header size is 0 against a live index entry is a delete-state
disagreement, not corruption.
Claude-Session: https://claude.ai/code/session_015EE9Sc9EvNp8BCVva4RKdo
* feat(ec): extract locate_ec_shard_needle_interval
Mirrors Go's EcVolume.LocateEcShardNeedleInterval; reused by locate_needle
and the upcoming local scrub walk.
Claude-Session: https://claude.ai/code/session_015EE9Sc9EvNp8BCVva4RKdo
* feat(ec): add EcVolumeShard::to_ec_shard_info
Mirrors Go's ToEcShardInfo.
Claude-Session: https://claude.ai/code/session_015EE9Sc9EvNp8BCVva4RKdo
* feat(ec): add EcVolume::scrub_local
Walk the .ecx and verify each needle against the locally-held shards,
reading interval-by-interval (reusing one chunk buffer); CRC-check only
fully-local needles, report short/unreadable local shards, and abort the
scan on a structural size mismatch. Mirrors Go's EcVolume.ScrubLocal.
Claude-Session: https://claude.ai/code/session_015EE9Sc9EvNp8BCVva4RKdo
* feat(scrub): dispatch EC LOCAL (mode 3) to scrub_local
Splits the mode 2|3 arm: FULL (2) keeps the Reed-Solomon parity check;
LOCAL (3) now runs the per-needle local-shard walk.
Claude-Session: https://claude.ai/code/session_015EE9Sc9EvNp8BCVva4RKdo
* refactor(scrub): extract open_index_for_scrub shared by scrub_index
Mirrors Go's openIndex, shared by ScrubIndex and the upcoming Scrub rewrite.
Claude-Session: https://claude.ai/code/session_015EE9Sc9EvNp8BCVva4RKdo
* fix(scrub): walk the on-disk .idx in Volume::scrub
scrub walked the deduped in-memory map, so total_read undercounted the
physical .dat on any volume with overwrites or deletes and the size
reconcile falsely flagged healthy volumes broken. Walk every .idx row
instead (matching Go's scrubVolumeData): count all rows, CRC-verify live
needles, skip deleted, and reconcile against the .dat. Holds one data-file
read lock and reads via the unlocked path, like Go's Scrub.
Claude-Session: https://claude.ai/code/session_015EE9Sc9EvNp8BCVva4RKdo
* feat(idx): add check_index_file mirroring Go idx.CheckIndexFile
Index-only structural check: walk the on-disk index, sort by (offset, size),
flag overlapping needles, and verify the file is a whole number of entries.
No data-file reads.
Claude-Session: https://claude.ai/code/session_015EE9Sc9EvNp8BCVva4RKdo
* refactor(ec): use idx::check_index_file in EcVolume::scrub_index
Drops the inline walk/sort/overlap copy. Walks a private fd so the structural
scan never moves the shared ecx_file cursor (read positionally elsewhere).
Claude-Session: https://claude.ai/code/session_015EE9Sc9EvNp8BCVva4RKdo
* fix(scrub): make Volume::scrub_index an index-only check on the on-disk .idx
INDEX mode walked the deduped in-memory map and read .dat headers — more
than the cheap-INDEX contract allows, yet missing Go's overlap and
size-multiple structural checks. Route it through idx::check_index_file so
it matches Go's Volume.ScrubIndex and the INDEX<LOCAL<FULL cost tiering holds.
Ports openIndex's zero-size-index guard (a populated .dat with an empty .idx
is corruption) and takes the data-file read lock for a consistent snapshot.
Claude-Session: https://claude.ai/code/session_015EE9Sc9EvNp8BCVva4RKdo
* fix(ec): cap EcShardConfig at MAX_SHARD_COUNT, not TOTAL_SHARDS_COUNT
read_ec_shard_config rejected any .vif ratio summing past 14 shards and
silently fell back to 10/4, so wider EC volumes ran against the wrong
shard set. Match Go's MaxShardCount(32) bound.
Claude-Session: https://claude.ai/code/session_015EE9Sc9EvNp8BCVva4RKdo
* docs(ec): correct stale 0..14 shard-count comments
Claude-Session: https://claude.ai/code/session_015EE9Sc9EvNp8BCVva4RKdo
Master /dir/lookup JSON omits publicUrl when empty (Go json omitempty).
The Rust volume server required the field, so serde failed with "lookup
parse failed: error decoding response body" and cross-DC replicated writes
failed.
Default publicUrl to empty, fall back to url for peer filtering, and
normalize addresses with to_http_address before excluding the local peer
(so host:port.grpcPort forms do not match self incorrectly).
* test(seaweed-volume): cover type=replicate fan-out writes
A holder must accept a replicated copy and store it locally without
re-replicating. Covers raw and multipart bodies, and a multi-copy
volume where re-replication would otherwise reach the master.
* test(seaweed-volume): use port 0 for the dead-master address
Connecting to port 0 is refused at the socket layer immediately, so the
plain-write fan-out path fails fast instead of risking a connect-timeout
hang where port 1 is filtered.
* feat: add collection pattern to delete empty volumes
Co-authored-by: Codex <noreply@openai.com>
* shell: match collection pattern with wildcard matcher
Use wildcard.MatchesWildcard in the shared collection-pattern helper,
matching command_volume_fix_replication's matchCollectionPattern. The
flag only advertises '*' and '?', which is exactly what the matcher
supports.
---------
Co-authored-by: Codex <noreply@openai.com>
Co-authored-by: Chris Lu <chris.lu@gmail.com>
* s3: support AWS object form for bucket policy Principal, add NotPrincipal
Bucket policy statements only accepted a bare string or array of strings for
the Principal element, so the AWS-documented object form was rejected:
"Principal": { "AWS": "arn:aws:iam::123456789012:root" }
"Principal": { "AWS": ["arn:...", "999999999999"] }
Add a PolicyPrincipal type that parses the bare string, the bare array
(retained for backward compatibility), and the object form keyed by AWS,
Service, Federated or CanonicalUser (each value a string or array). All keyed
values are flattened for principal matching, and the original JSON is preserved
so PutBucketPolicy/GetBucketPolicy returns the exact shape submitted - keeping
infrastructure-as-code tools (Terraform, Ansible) idempotent.
Also add NotPrincipal support (a statement applies to every principal except the
ones named), compiled and evaluated in both policy evaluators, and reject
statements that specify both Principal and NotPrincipal.
* s3: address review - validate principal object form, honor dynamic NotPrincipal
- Reject unsupported Principal object keys (only AWS/Service/Federated/
CanonicalUser) and empty values, so a form like {"AWS":[]} no longer compiles
to zero matchers and silently relies on the match-all fallback.
- Detect both Principal and NotPrincipal by field presence, not by flattened
length, so a present-but-empty field is still rejected.
- Honor dynamic (policy-variable) NotPrincipal/Principal patterns in the
compiled evaluator; previously a NotPrincipal made only of variables was
treated as absent and its exclusion bypassed.
- Add regression tests for the object-form validation and dynamic NotPrincipal.
* Review comment removed unnecessary success and failure count
* fix: use Gather.Gather() with seeded counter for EC rebuild registration test
- Restore Gather.Gather() to verify MustRegister calls as requested in review
- Seed VolumeServerECRebuildCounter before gathering because CounterVec
only appears after at least one label value is observed
- Use correct fully-qualified metric names (SeaweedFS_volumeServer_*)
* fix: remove preflight checkEcVolumeStatus failure from ec_rebuild_total counter
ec_rebuild_total should only reflect actual rebuild execution failures
(from RebuildEcFiles / RebuildEcxFile), not scan/precheck failures in
the volume status loop. The error is still returned to the caller;
only the misleading counter increment was removed.
* Review comment removed unnecessary observe
* label EC rebuild duration histogram by result
Without a result label, fast failures pull down the success-latency
quantiles shown on the EC Rebuild Duration panel. Make the histogram a
HistogramVec keyed by result, record success/failure through one
recordEcRebuild helper, and split the Grafana quantiles by (le, result).
* reset EC rebuild metric vecs in registration test
The HistogramVec needs a child before Gather emits it, so the test must
observe once; reset both vecs in cleanup so that sample doesn't leak into
other tests.
---------
Co-authored-by: Ubuntu User <ubuntu@example.com>
Co-authored-by: Chris Lu <chris.lu@gmail.com>
An empty or truncated tasks/*.pb file unmarshals into a TaskStateFile
with a nil Task, and protobufToMaintenanceTask dereferenced it
immediately, panicking the whole admin process on startup. Guard the
nil case so the loader logs a warning and skips the bad file.
Under a herd of concurrent assigns with no writable volume, Assign spun
PickForWrite for the full 10s timeout, pinning a goroutine per request and
starving the master of the cycles it needs to process growth and answer
heartbeats. When growth is the relevant remedy and already in flight, stop
spinning: if free space exists, shed with a fast retryable error so clients
back off and retry once growth lands; if the cluster is out of space, fail fast
with the real out-of-space error instead of masking it as retryable.
The gRPC shed uses ResourceExhausted, not Unavailable: operation.Assign retries
it, but the client connection layer doesn't treat it as a dead channel, so a
per-request shed across a herd doesn't tear down the shared master connection
and cancel every other in-flight assign. The HTTP dirAssignHandler sheds with
503 + Retry-After.
* volume server: route VolumeMarkReadonly to raft leader
After a master raft election, volume servers may still heartbeat a follower
while admin paths such as weed shell volume.mark call notifyMasterVolumeReadonly
via vs.GetMaster(). Followers reject VolumeMarkReadonly with NotLeader, which
breaks tiering and other mark-readonly workflows until the heartbeat loop
reconnects.
Resolve the leader through GetMasterConfiguration on configured -master peers
(same Leader field filer/master clients already use) before calling
VolumeMarkReadonly. When the leader differs from the heartbeat peer, update
currentMaster so the heartbeat loop converges faster.
Adds operation.LookupRaftLeaderMaster with unit tests.
Co-authored-by: Cursor <cursoragent@cursor.com>
* fix: address review feedback on volume.mark raft leader routing
Do not update currentMaster during leader lookup — heartbeat owns that
field and uses stream GetLeader() to reconnect. Try the heartbeat peer
first and only resolve the raft leader after a NotLeader rejection.
Add ctx.Err() early exit and quieter logging for context cancellation.
Co-authored-by: Cursor <cursoragent@cursor.com>
* fix(operation): thread the lookup timeout ctx into connection invalidation
The 5s timeout drove only the RPC; WithMasterServerClient saw the
unbounded outer ctx, so a self-inflicted timeout (slow GetMasterConfiguration
during an election) was treated as a stale channel and tore down the shared
master connection. Pass the timeout ctx into the helper so its own expiry
leaves ctx.Err() set and spares the connection.
---------
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Chris Lu <chris.lu@gmail.com>
* fix(filer.sync.verify): sort listings client-side before merge
The merge walks both filers' directory listings in lockstep and needs
them in the same byte order. A filer before 4.32 with a locale SQL
collation lists case-insensitively while a 4.32+ peer lists byte-ordered,
so comparing two such clusters returns the same names in a different
order and the merge desyncs into spurious MISSING / ONLY_IN_B.
Buffer and sort each directory client-side so both sides agree on order
regardless of filer version or store backend. Trades the streaming
source's O(buffer) memory for O(directory) per side, fine for a one-shot
verify CLI; both sides still load concurrently.
Claude-Session: https://claude.ai/code/session_01BKsBdKYFNCEjeHLjJfumPF
* fix(filer.sync.verify): surface listing errors before merging
A listing that fails mid-stream leaves a partial, unsorted buffer. Now
that both sides are fully buffered anyway, check each side's error right
after the loads finish and before the merge, so partial entries can't
emit spurious MISSING / ONLY_IN_B before the error aborts the run.
Claude-Session: https://claude.ai/code/session_01BKsBdKYFNCEjeHLjJfumPF
* fix(shell): correct volume.list -writable filter unit and comparison
* fix(shell): correct volume.list -writable filter unit and comparison
* chore(shell): fix typo in EC shard helper param names
* fix(shell): use exact match for volume.balance -racks/-nodes filter
The old strings.Contains-based filter quietly included any id that was a
substring of the user-supplied flag value (e.g. -racks=rack10 also matched
rack1). Replace it with an exact-match set parsed from the comma-separated
flag value, and add regression tests for both -racks and -nodes paths.
Also fix a small typo in the "remote storage" error returned by
maybeMoveOneVolume.
* fix(shell): use exact match for volume.balance -racks/-nodes filter
The old strings.Contains-based filter quietly included any id that was a
substring of the user-supplied flag value (e.g. -racks=rack10 also matched
rack1). Replace it with an exact-match set parsed from the comma-separated
flag value, and add regression tests for both -racks and -nodes paths.
Also fix a small typo in the "remote storage" error returned by
maybeMoveOneVolume.
* refactor(shell): drop nil sentinel in splitCSVSet, use len() in callers
* fix: apply collectionPattern during detection in volume.fix.replication
* use existing wildcard.MatchesWildcard for collection matching
It returns a plain bool, so drop the up-front filepath.Match validation
and the path/filepath import that only existed to handle its error.
* trim verbose comments to terse one-liners
* drop redundant per-path collection guards
Detection already filters by replicas[0].info.Collection. The repair guard
re-checked pickOneReplicaToCopyFrom's collection (a different replica), so a
mixed-collection volume could pass detection yet be skipped in repair without
decrementing the counter, spinning the -apply loop. deleteOneVolume keeps its
collectionIsMismatch safety.
---------
Co-authored-by: Chris Lu <chris.lu@gmail.com>
mount: move directory cache state to a side map to shrink InodeEntry
The mount keeps an InodeEntry alive for every inode the kernel references.
On a mount that is almost entirely regular files, each entry carried the full
directory readdir-cache bookkeeping (four time.Time fields plus counters),
bloating it to 152 bytes whether or not the inode was a directory.
Move that state into a dirState held in a side map keyed by inode, and drop the
isDirectory bool: an inode is a directory iff it has a dirState. InodeEntry is
now just paths + nlookup at 32 bytes, landing in a smaller Go allocator size
class; on a mount with tens of millions of cached file inodes that is several GB
less resident heap. As a side effect the readdir-cache scan helpers iterate only
directories instead of every inode.
* fix(volume): fsync .vif and downloaded tier .dat (Rust)
save_volume_info wrote the .vif with a plain write and no fsync, and the
tier download never synced the .dat it wrote. Either could be lost on a
crash before the tier-down path acts on them. fsync both, matching the Go
volume server's util.WriteFile and DownloadFile.
* fix(volume): swap to local before deleting remote on tier-down (Rust)
The tier-down path deleted the shared remote object before trimming the
.vif, so a crash in between left the volume's .vif pointing at a deleted
object. It also dropped the remote backend only on the delete path and
never opened the downloaded local .dat, so reads broke until reload and a
keep-remote download kept serving from the slow remote object.
Trim the .vif and swap to the local .dat on both paths, bracketed by
directory fsyncs, before removing the remote object; gate only the object
removal on keep_remote_dat_file. Matches the Go volume server's crash-safe
ordering.
After VolumeTierMoveDatToRemote uploaded the .dat, the volume closed its
local backend but never opened the remote one, leaving both dat_file and
remote_dat_file empty. The needle read path has no lazy reopen, so reads
returned "dat file not open" until the volume reloaded.
Switch to the remote backend right after saving the .vif, the same as the
Go volume server's LoadRemoteFile, so the volume keeps serving from remote
storage immediately after tiering.
* ci: add per-process memory sampler for perf jobs
Samples VmRSS once a second into a CSV and records peak VmHWM per process
on stop. Linux only; reads /proc/<pid>/status.
* ci: run perf benchmarks on the Rust volume server and report memory
Matrix the throughput and S3 jobs over go/rust volume servers, using a
standalone master (plus filer for S3) and swapping only the volume binary
so the two are directly comparable. Sample peak RSS in every job and surface
it per impl in the run summary.
* ci: harden mem sampler arg handling and peak fallback
Guard against missing args under set -u, and fall back to the max RSS
sampled when a process exits before VmHWM can be read.
* ec: recover EC shards whose .ecx index lives only on a peer server
A volume server that boots with EC shard files on disk but no .ecx index
on any local disk cannot mount the shards, so the master never learns
about them. ec.rebuild works off master-registered shards, so it sees the
volume as short and gives up even though the shard data is intact.
Add an operator-triggered recovery: VolumeEcShardsMount gains a
recover_missing_index flag that makes the volume server fetch the missing
.ecx (plus .ecj/.vif) from a peer holding it and mount the on-disk shards.
ec.rebuild runs this across the cluster before planning, so orphaned
shards register and the rebuild sees the true shard set.
.ecx is an immutable encode-time index, identical on every holder. .ecj
is a per-holder deletion journal that differs across holders, so the
recovered node adopts the source peer's deletion view, like a balanced or
rebuilt shard does.
* ec: mirror missing-index recovery into the Rust volume server
Port the #10104 recovery to seaweed-volume so the Rust volume server
self-heals the same layout: EC shards on disk with the .ecx index only on
a peer. Adds collect_ec_volumes_missing_index / mount_recovered_ec_shards
to the store, recover_missing_ec_indexes (master LookupEcVolume + peer
CopyFile fetch + mount) to the server, and the recover_missing_index flag
on VolumeEcShardsMount.
.ecx is the immutable encode-time index, identical on every holder. .ecj
is a per-holder deletion journal, so the recovered node adopts the source
peer's deletion view, matching the Go path.
* fix(volume): stream copy_file from disk instead of buffering whole file
copy_file pushed every 2MB chunk into a Vec and only then returned tokio_stream::iter(results), so serving a near-limit volume as a copy source (e.g. during volume.fix.replication) held the entire .dat resident and could OOM the process. Stream chunks through a bounded mpsc channel from a spawn_blocking reader instead; caps memory at ~16MB per transfer with backpressure.
* fix(volume): stream volume_incremental_copy from disk instead of buffering
Same buffering pattern as copy_file: every 2MB chunk was pushed into a Vec and only then returned via tokio_stream::iter, holding the entire delta resident. Stream the byte range from an owned file handle through a bounded mpsc channel, mirroring the copy_file fix.
* test(volume): cover streaming copy_file and volume_incremental_copy
Adds a multi-chunk .dat fixture and tests asserting both handlers stream in 2MB chunks (multiple messages), reassemble byte-for-byte, carry modified_ts_ns only on the first copy_file message, and honor stop_offset.
* address review: use u64 byte counters; stream local incremental copy without holding the store lock
- copy_file/volume_incremental_copy: track remaining bytes and offsets as u64 instead of casting uint64 stop_offset/dat_size through i64 (CodeRabbit).
- volume_incremental_copy: for local volumes open the .dat and stream directly with no lock held; only remote/tiered volumes take the per-chunk read_dat_slice path, so a remote S3 read is never performed while holding the store read lock (Gemini).
* volume (Rust): stream tiered incremental copy off the store lock, open .dat under it
Capture the reader for volume_incremental_copy while the volume lookup is still
under the store read lock: an open File for local volumes, a cloned remote
backend handle for tiered ones. Then drop the lock and stream with none held.
Opening under the lock pins the reader to the volume that exists now, so a
concurrent delete/recreate can't stream from the wrong file, and a slow S3
fetch for a tiered .dat no longer blocks store writers (the remote path
previously re-took the store lock per chunk).
Use a non-uniform copy-test payload so chunk reassembly catches duplicated or
reordered chunks a repeated byte would hide.
* volume (Rust): return empty when incremental-copy start offset is past the .dat
A corrupt needle index could locate an offset beyond the captured .dat size,
underflowing the dat_size - start_offset subtraction (panic in debug, wrap in
release). Guard it up front like the other empty-delta early returns.
---------
Co-authored-by: adri <adri@digitalunited.net>
Co-authored-by: Chris Lu <chris.lu@gmail.com>
Drop max-parallel so the 13 per-platform builds run together instead of two
waves of 8 (rocksdb was queuing behind the cap and starting ~8 min late).
Keep cache-to mode=max for rocksdb: its RocksDB static_lib compile is
sha-independent, so it caches across releases and stops being the ~16-min
long-pole that gates the merge fan-in. go-build variants stay mode=min.
docker release: build per-platform on native runners, drop mode=max cache
The build job built every platform of a variant on one runner, so 2-4 Go
cross-compiles fought over a single 2-vCPU box and arm64 ran in an emulated
context. Split the matrix to one platform per job on a native runner
(amd64/386 on ubuntu-latest, arm64/arm-v7 on ubuntu-24.04-arm); only arm/v7
still needs QEMU, and only for its final apk stage. Each job pushes by
digest, and a new merge job assembles the multi-arch tag with imagetools
and mirrors it to Docker Hub.
cache-to mode=max -> mode=min: BRANCH=sha cache-busts the heavy go-build
layer every release, so writing all intermediate layers to the gha backend
spent 3-11 min per variant on a cache the next release's sha can never hit.
* test: add self-contained S3 read/write load tool
Concurrent PUT/GET against the S3 gateway, reporting requests/sec,
transfer rate, and latency percentiles. Built on the aws-sdk-go-v2
client the S3 tests already use, so no extra benchmark binary is needed.
* ci: add performance workflow
Three parallel jobs: cpu/heap pprof of the server under write load,
native throughput via weed benchmark plus the Go micro-benchmarks, and
an S3 read/write benchmark against the gateway. Runs on push to master
and manual dispatch with tunable duration, object count, size, and
concurrency.
* sts: enforce session-policy explicit deny during role chaining
A chained AssumeRole caller authenticates with an STS session token whose
inline session policy can explicitly deny sts:AssumeRole. The deny check only
evaluated the caller's named policies, so such a session could still chain into
any role its trust policy admits. Validate the session token in the deny check
and honor an explicit Deny in the inline session policy too.
* test(sts): integration coverage for AssumeRole authorization
Add an end-to-end AssumeRole authorization test (real weed mini + boto3):
a non-admin caller assumes a role its trust policy admits, an explicit
identity-side deny is blocked, and a session policy's explicit deny blocks
role chaining.
* sts: skip OIDC tokens and reject revoked sessions in the chaining deny check
Review follow-ups on the session-policy deny check:
- Guard session validation with !isOIDCToken so a bearer token our STS service
cannot validate does not error into a false deny.
- Reject a revoked session before evaluating its policy, restoring the
revocation enforcement the AssumeRole path lost when it stopped routing
through IsActionAllowed.
* fix(sts): authorize AssumeRole by the role's trust policy
The role's trust policy already declares who may assume it, but the caller
also had to pass an identity-side sts:AssumeRole check that only the Admin
action could satisfy — legacy static identities have no way to express
sts:AssumeRole on a role. So assuming any role required a full admin
identity. Drop the redundant check and let the trust policy be the authority;
scope it to specific principals to restrict who can assume.
* sts: resolve caller principal ARN for the trust-policy check
A legacy static identity can reach AssumeRole without a PrincipalArn set;
passing the empty value would miss a trust policy that names a concrete
principal. Resolve it to the canonical user ARN, sharing the logic
GetCallerIdentity already used inline.
* sts: enforce explicit identity-side deny for AssumeRole
Authorizing a named role by its trust policy alone dropped identity-side
evaluation entirely, so a caller whose attached policy explicitly denies
sts:AssumeRole could still assume any role the trust policy admits. Re-check
the caller's policies through the IAM manager for an explicit deny
(deny-always-wins) without requiring an allow; the trust policy stays the
allow authority.
* fix(postgres): prevent uint32 underflow & OOM in message parsing
* postgres: drop redundant startup guard, use maxStartupMessageSize const
The msgTotalLen < 8 check already guarantees msgLength >= 4, so the extra
msgLength < 4 guard before reading the protocol version was unreachable.
Point the startup size limit at maxStartupMessageSize instead of a literal.
* postgres: trim query terminator safely, cap pre-auth payloads
Use strings.TrimSuffix for the simple-query null terminator so a
non-null-terminated body isn't silently shortened, matching the auth
handlers. Bound password/MD5 reads with a dedicated maxAuthMessageSize
(10 KiB) instead of the 100 MiB maxMessageSize, since these payloads are
read before authentication.
---------
Co-authored-by: shangshuhan <shangshuhan@cmict.chinamobile.com>
Co-authored-by: Chris Lu <chris.lu@gmail.com>