Commit Graph

13910 Commits

Author SHA1 Message Date
Chris Lu
87fdea5330 fix(admin): carry filer addresses as ServerAddress in plugin cluster context (#9600)
The plugin cluster context forwarded filers as gRPC-only addresses
(host:grpcPort). The admin-script worker stored that in
ShellOptions.FilerAddress, whose shell commands re-derive the gRPC port
via ToGrpcAddress() and re-add the +10000 offset, dialing a non-existent
host:28888.

Carry filers in pb.ServerAddress form (host:httpPort.grpcPort) and let
each consumer convert when it dials: the admin shell uses it verbatim,
while the s3_lifecycle and iceberg workers collapse it to a gRPC address.
Rename the proto field filer_grpc_addresses -> filer_addresses so the
name matches the content.
2026-05-21 02:10:27 -07:00
Chris Lu
303c2be38d feat(fix): rebuild lost EC index (.ecx) and .vif from local shards (#9596)
weed fix -ecx reconstructs the .dat from the local data shards, scans the
needles, and writes a fresh ascending-sorted .ecx containing only live
entries — the same on-disk index WriteSortedFileFromIdx emits at encode
time. When the .vif is also missing it is regenerated from the inferred
EC ratio (flags > .vif > shard-count inference / 10+4) and the .dat size
recovered from the scan.

When some data shards are missing but at least dataShards shards survive,
the missing shards are first reconstructed from the survivors via
Reed-Solomon, so a partial shard set is repaired too.

Also makes erasure_coding.WriteDatFile de-stripe using len(shardFileNames)
instead of the DataShardsCount constant, so the caller's actual data-shard
count is honored (behavior-preserving for the default 10, and fixing the
existing caller that already passes ECContext.DataShards).

This recovers an EC volume whose sealed index was lost from every node
while enough shards survive, a state neither ec.rebuild nor ec.decode can
repair because both require an existing .ecx.

Flags: -ecx, -ecDataShards, -ecParityShards. Run with the volume server
stopped.
2026-05-21 00:41:27 -07:00
Mmx233
9b9fdb5b76 fix(s3): sync IAM policies to advanced IAM Manager policy engine (#9577)
* fix(s3): sync IAM policies to advanced IAM Manager policy engine

* test(s3): add unit tests for PutPolicy/DeletePolicy IAM Manager sync

* fix(s3): flush loaded policies in SetIAMIntegration, drop extra reload

Sync the policies already loaded from the credential store into the IAM
Manager's engine from SetIAMIntegration itself, instead of re-running a
full LoadS3ApiConfigurationFromCredentialManager after setup. This covers
both startup orderings without a second filer round-trip or racing the
async loader goroutine: if the load won, the policies are in memory to
push; if SetIAMIntegration won, the load's own sync runs afterward.

Move the runtime PutPolicy/DeletePolicy sync out of the iam.m write lock
so the per-request auth RLock path isn't blocked by the policy recompile.

* fix(s3): serialize IAM manager policy resync to avoid stale snapshots

SyncRuntimePolicies replaces the manager's full policy set, so applying a
policy view captured before a later mutation can resurrect a deleted
policy or drop a new one. Funnel every path (PutPolicy, DeletePolicy,
SetIAMIntegration, and the credential-manager load) through a single
resyncIAMManagerPolicies that serializes on a dedicated mutex and reads
iam.policies fresh at apply time, so the live map always wins regardless
of interleaving. The load now installs the config into iam.policies
before resyncing, closing the window where the manager held policies the
map didn't yet have.

---------

Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-05-21 00:39:42 -07:00
Chris Lu
7e4691f2dc test(ec): make multi-disk EC balance disk-spread assertion deterministic (#9595)
test(ec): pre-populate disks so multi-disk EC balance spread is deterministic

The multidisk shard-loss regression asserts EC shards spread across more
than one disk per node, but that only holds for disks the balancer can see.
The master enumerates a physical disk only when it already holds a volume
or EC shard — an empty disk leaves no trace, since heartbeats aggregate
capacity per disk type, not per physical disk. So whether the post-encode
balance spread shards depended on how the master happened to place the
filler volumes across disks, which varies by environment: the test passed
locally (shards on 5 disks) but produced one disk per node in CI and failed
the "got 3 disks across 3 nodes" assertion.

Grow a few volumes on each server before encoding so every physical disk
holds a volume and is visible to the balancer. The volume server places
each new volume on its least-loaded disk, so a handful of grows touches
every disk, making the spread deterministic. The assertion still has teeth:
it counts disks holding shard files, so a balancer that failed to spread
would still collapse to one disk per node.
2026-05-21 00:17:14 -07:00
Chris Lu
391f543ff2 fix(ec): correct multi-disk disk counting and EC balance shard attribution (#9594)
* fix(shell): count physical disks in cluster.status on multi-disk nodes

The master keys DataNodeInfo.DiskInfos by disk type, so several same-type
physical disks on one node collapse into a single DiskInfo entry. cluster.status
(printClusterInfo) and CountTopologyResources counted len(DiskInfos), reporting
one disk per node instead of the real physical disk count, while volume.list and
the admin ActiveTopology already split per physical disk.

Route both counters through DiskInfo.SplitByPhysicalDisk so a node with N
same-type disks reports N. Cosmetic/diagnostic only; placement already uses the
per-disk activeDisk map.

* fix(ec): attribute EC balance source disk per shard and reject same-node moves

On multi-disk nodes the EC balance worker built a node-level view that kept only
the first physical disk id per (node, volume), so a move of a shard living on a
different disk reported the wrong source disk. That source disk drives the
per-disk capacity reservation, so the wrong disk drifts the capacity model the
EC placement planner relies on. Track shards per physical disk and resolve the
actual source disk for every emitted move (dedup, cross-rack, within-rack,
global), keeping the per-disk view consistent as simulated moves are applied.

Also close a data-loss trap: VolumeEcShardsDelete is node-wide (it removes the
shard from every disk on the node) and copyAndMountShard skips the copy when
source and target addresses match, so a same-node move would erase a shard it
never copied. isDedupPhase now requires the same node AND disk, and Validate /
Execute reject same-node cross-disk moves outright.

* fix(ec): spread EC balance moves across destination disks

Port the shell ec.balance pickBestDiskOnNode heuristic to the EC balance
worker so a moved shard is placed on a good physical disk instead of always
deferring to the volume server (target disk 0). The detection now builds a
per-physical-disk view of each node (free slots split from the node total, exact
EC shard count, disk type, discovered from both regular volumes and EC shards)
and, for each cross-rack, within-rack, and global move, chooses the destination
disk by ascending score:
  - fewer total EC shards on the disk,
  - far fewer shards of the same volume on the disk (spread a volume's shards
    across disks for fault tolerance), and
  - data/parity anti-affinity (a data shard avoids disks holding the volume's
    parity shards and vice versa).

Planned placements are reserved on the in-memory model during a run so multiple
shards moved to the same node spread across its disks rather than piling on one.

* fix(ec): bring EC balance worker to parity with shell ec.balance

The worker's cross-rack and within-rack balancing balanced shards by total
count; the shell balances data and parity shards separately with anti-affinity
and honors replica placement. Port that logic so the automatic balancer makes
the same fault-tolerance-aware decisions as the manual command:

- Cross-rack and within-rack now run a two-pass balance: data shards spread
  first, then parity shards spread while avoiding racks/nodes that already hold
  the volume's data shards (anti-affinity), mirroring doBalanceEcShardsAcrossRacks
  and doBalanceEcShardsWithinOneRack.
- Optional replica placement: a new replica_placement config (e.g. "020")
  constrains shards per rack (DiffRackCount) and per node (SameRackCount); empty
  keeps the previous even-spread behavior.
- The data/parity boundary is resolved from a per-collection EC ratio (standard
  10+4 here), replacing the previously hardcoded constant at the call sites.

Selection is deterministic (sorted keys) to keep behavior reproducible.

* refactor(ec): extract shared ecbalancer package for shell and worker

The EC shard balancing policy was duplicated between the shell ec.balance
command and the admin EC balance worker, and the two had drifted (multi-disk
handling, data/parity anti-affinity, replica placement). Extract the policy into
a new pure package, weed/storage/erasure_coding/ecbalancer, that both callers
share so it cannot drift again.

- ecbalancer.Plan(topology, options) runs the full policy (dedup, cross-rack and
  within-rack data/parity two-pass with anti-affinity, global per-rack balance,
  and diversity-aware disk selection) over a caller-built Topology snapshot and
  returns the shard Moves. It depends only on erasure_coding and super_block.
- The worker builds the Topology from the master topology and turns Moves into
  task proposals; the shell builds it from its EcNode model and executes Moves
  via the existing move/delete RPCs. Per-collection EC ratio resolution stays in
  each caller (passed as Options.Ratio).
- Options expose the two genuine policy differences: GlobalUtilizationBased
  (worker balances by fractional fullness; shell by raw count) and
  GlobalMaxMovesPerRack (worker moves incrementally across cycles; shell drains
  in one pass).

The shell keeps pickBestDiskOnNode for the evacuate command. Policy tests move to
the ecbalancer package; the shell and worker keep their adapter/execution tests.

* fix(ec): restore parallelism and per-type/full-range balancing after ecbalancer refactor

Address regressions and gaps from the ecbalancer extraction:

- Shell ec.balance honors -maxParallelization again: planned moves run phase by
  phase (preserving cross-phase dependencies) with bounded concurrency within a
  phase. Apply mode does only the RPCs concurrently; dry-run stays sequential and
  updates the in-memory model for inspection.
- Rack and node balancing gate on per-type spread (data and parity separately)
  instead of combined totals, so a data/parity skew is corrected even when the
  per-rack/node totals are even.
- Global rack balancing iterates the full shard-id space (MaxShardCount) so
  custom EC ratios with more than the standard total are candidates.
- Cross-rack planning decrements the destination node's free slots per planned
  move, so limited-capacity targets are no longer over-planned.

* fix(ec): make EC dedup keeper deterministic and capacity-aware

When a shard is duplicated across nodes, keep the copy on the node with the most
free slots and delete the duplicates from the more-constrained nodes, relieving
capacity pressure where it is tightest. Tie-break on node id so the choice is
deterministic. This unifies the shell and worker (the shell previously kept the
least-free node, an incidental default) on the more sensible behavior.

* fix(ec): restore global volume-diversity and per-volume move serialization

Two more behaviors lost in the ecbalancer refactor:

- Global rack balancing again prefers moving a shard of a volume the destination
  does not hold at all before adding another shard of an already-present volume
  (two-pass, mirroring the old balanceEcRack), keeping each volume's shards
  spread across nodes.
- Shell apply-mode execution serializes a single volume's moves within a phase
  while still running different volumes in parallel, so concurrent moves of the
  same volume cannot race on its shared .ecx/.ecj/.vif sidecar files.

* fix(ec): key EC balance shards by (collection, volume id)

A numeric volume id can be reused across collections, and EC identity is
(collection, vid) (see store_ec_attach_reservation.go). The ecbalancer keyed
Node.shards by vid alone, so volumes sharing an id across collections merged into
one entry — letting dedup delete a "duplicate" that is actually a different
collection's shard, and letting moves act across collections. Key shards by
(collection, vid) throughout so each volume stays distinct.

* fix(ec): credit freed capacity from dedup before later balance phases

Dedup deletions are simulated only by applyMovesToTopology, which cleared shard
bits but did not return the freed disk/node/rack slots. Later phases reject
destinations with no free slots, so a slot opened by dedup could not be reused in
the same Plan/ec.balance run. applyMovesToTopology now credits the freed
disk/node/rack capacity for dedup moves (non-dedup moves still rely on the inline
accounting their phase already did).

* test(ec): add multi-disk EC balance integration test

Cover issue 9593 end-to-end at the unit level the old tests missed: build the
master's actual multi-disk wire format (same-type disks collapsed into one
DiskInfo, real DiskId only in per-shard records), run it through a real
ActiveTopology and the Detection entry point, then replay the planned moves with
the volume server's true semantics (node-wide VolumeEcShardsDelete) and assert no
EC shard is ever lost. Covers a balanced spread, a one-node-concentrated volume,
and a multi-rack spread, and asserts moves are safe (no same-node cross-disk),
correctly attributed to the source disk, and redistribute concentrated volumes
across both other racks and multiple destination disks.

* fix(ec): aggregate per-disk EC shards when verifying multi-disk volumes

collectEcNodeShardsInfo overwrote its per-server entry for each EcShardInfo of a
volume. A multi-disk node reports one EcShardInfo per physical disk holding shards
of the volume, so only the last disk's shards survived — the node looked like it
was missing shards it actually had. This made ec.encode's pre-delete verification
(and ec.decode) under-count volumes whose shards are spread across disks on one
server, falsely aborting the encode on multi-disk clusters. Union the per-disk
shard sets per server instead.

Also make verifyEcShardsBeforeDelete poll briefly: shard relocations reach the
master via volume-server heartbeats, so a freshly distributed shard set may not be
fully visible the instant the balance returns. Retry before concluding the set is
incomplete; genuine loss still fails after the retries are exhausted.

* test(ec): end-to-end multi-disk EC balance shard-loss regression

Start a real cluster of multi-disk volume servers (3 servers x 4 disks),
EC-encode a volume, run ec.balance, and assert hard invariants the prior
integration tests only logged: after encode all 14 shards exist, ec.balance loses
no shard, shards span more than one disk per node, and cluster.status counts
physical disks (not one per node). This reproduces issue 9593 end to end and would
have caught the multi-disk shard-aggregation bug fixed alongside it.

* fix(ec): bring EC balance worker/plugin path to parity with shell

- Per-volume serialization and phase order: key the plugin proposal dedupe by
  (collection, volume) instead of (volume, shard, source), so the scheduler runs
  only one of a volume's moves at a time (within a run and against in-flight jobs).
  Concurrent same-volume moves raced on the volume's .ecx/.ecj/.vif sidecars; and
  because the planner emits a volume's moves in phase order, they now execute in
  order across detection cycles, matching the shell.
- disk_type "hdd": normalize via ToDiskType (hdd -> "" HardDriveType) while keeping
  a "filter requested" flag, so disk_type=hdd matches the empty-keyed HDD disks
  instead of nothing; apply the canonical type to planner options and move params.
- Replica placement: expose shard_replica_placement in the admin config form and
  read it into the worker config, mirroring ec.balance -shardReplicaPlacement.

* test(ec): rename worker in-process test (not a real integration test)

The worker-package multi-disk tests build a fake master topology and simulate
move execution; they are not real-cluster integration tests. Rename
integration_test.go -> multidisk_detection_test.go and drop the Integration
prefix so 'integration' refers only to the real-cluster E2Es in test/erasure_coding.

* ci(ec): remove redundant ec-integration workflow

ec-integration.yml duplicated EC Integration Tests under the same workflow name
but ran only 'go test ec_integration_test.go' (one file), so it never ran new
test files (e.g. multidisk_shardloss_test.go) and was a strict, path-filtered
subset of ec-integration-tests.yml, which already runs 'go test -v' over the whole
test/erasure_coding package on every push/PR.

* fix(ec): worker falls back to master default replication for EC balance

For strict parity with the shell, the EC balance worker now uses the master's
configured default replication as the replica-placement fallback when no explicit
shard_replica_placement is set, instead of always defaulting to even spread.

The maintenance scanner reads it via GetMasterConfiguration each cycle and passes
it through ClusterInfo.DefaultReplicaPlacement; detection resolves the constraint
(explicit config wins, else master default, else none) in resolveReplicaPlacement.
A zero-replication default (the common 000 case) still means even spread, so the
common configuration is unchanged.

* fix(ec): plugin path populates master default replication too

The plugin worker built ClusterInfo with only ActiveTopology, so the master
default replication fallback added for the maintenance path never reached
plugin-driven EC balance detection — empty shard_replica_placement still meant
even spread there. Fetch the master default via GetMasterConfiguration (new
pluginworker.FetchDefaultReplicaPlacement) and set ClusterInfo.DefaultReplicaPlacement
so both detection paths resolve replica placement identically to the shell.

* docs(ec): empty shard replica placement uses master default, not even spread

The EC balance config text (admin plugin form, legacy form help text, and
the struct/proto field comments) still said an empty shard_replica_placement
spreads evenly. The runtime resolves empty to the master default replication
(resolveReplicaPlacement), matching shell ec.balance, with even spread only
when that default is empty or zero. Update the text to match and regenerate
worker_pb for the proto comment change.
2026-05-20 23:31:21 -07:00
Chris Lu
afcc491517 test: fix fd leak in the Samba DLM handoff test (promote xfail checks) (#9592)
test(mount): fix fd leak that deadlocked the DLM handoff check

The cross-mount handoff checks held a file open on mount 2 via fd 9 to
keep the distributed lock, then started the SMB writer in a background
subshell. The subshell inherited fd 9, so the SMB writer kept the file
open and waited on a lock held by its own descriptor; the put could
never complete, and the two checks were parked as expected-fail.

Close fd 9 in the subshell (9>&-) so the writer does not hold the file.
The waiter now acquires the freed lock within ~1s, so the two checks are
real assertions and the xfail machinery is gone.
2026-05-20 16:17:13 -07:00
Chris Lu
a5d0e4a735 Samba-over-FUSE integration test and distributed-lock handoff fixes (#9590)
* test(mount): add Samba over FUSE integration test

Export a SeaweedFS FUSE mount over SMB with smbd and drive it with
smbclient: file round-trips, directories, rename, large-file chunking,
recursive upload, cross-protocol consistency, and deletes.

A second -dlm mount adds locking coverage: POSIX fcntl byte-range locks,
distributed-lock write coordination, and concurrent writers. The two
cross-mount handoff checks currently fail and pin a known limitation -
the distributed lock is released on FUSE Release, which the kernel can
delay under contention.

Runs locally via test/samba/run.sh or in Docker via the compose file;
wired into CI as samba-integration.yml.

* fix(cluster): release distributed lock without racing the renewal goroutine

Stop() closed the cancel channel, slept 10ms, then unlocked using
renewToken. A renewal in flight during that window rotates the token on
the server, so the unlock may be sent with a stale token, fail with a
mismatch, and leave the lock to linger until its TTL expires - stalling
other mounts waiting to write the same file.

Wait for the renewal goroutine to exit before unlocking. The channel
close also makes the renewToken read happen-after the last renewal.

* fix(cluster): poll for distributed lock acquisition without exponential backoff

A mount waiting to write a file held by another mount acquired through
util.RetryUntil, whose backoff grows to several seconds. Once the holder
released, the waiter could sleep that long before retrying, stretching
the cross-mount handoff past client timeouts.

Poll at the steady ~1s cadence AttemptToLock already enforces instead.

* test(mount): tighten Samba harness and mark the DLM handoff checks xfail

Run the workflow for weed/cluster changes, fail fast when the filer or
smbd port never opens, and fold the recursive mput result into its own
assertion so it cannot false-pass.

Mark the two cross-mount handoff checks expected-fail: they pin the
remaining DLM liveness bug (the lock is freed only on the delayed FUSE
Release) without failing CI, and turn the suite red if the handoff is
ever fixed.

* fix(cluster): keep a wedged renewal shutdown from sending a stale unlock

If the renewal goroutine is stuck in a slow RPC, Stop() fell through to
unlock anyway once it timed out waiting. A late renewal can rotate
renewToken, so that unlock races it, is rejected on a stale token, and
leaves the lock lingering until its TTL regardless. On the timeout path,
skip the unlock and let the TTL expire the lock instead.

* fix(cluster): wake the long-lived lock renewal loop promptly on Stop

StartLongLivedLock's renewal loop slept uninterruptibly between attempts,
up to 5*renewInterval (2.5*lockTTL) while unlocked. Stop() waits only
lockTTL+2s for the goroutine to exit, so a Stop() during that backoff
would time out before the goroutine woke and closed renewalDone,
breaking the shutdown synchronization. Sleep on a timer with a select on
cancelCh so the loop exits immediately.
2026-05-20 14:52:17 -07:00
Chris Lu
a17dca7009 fix(filer): don't disable the SQL idle connection pool when unconfigured (#9591)
* fix(filer): don't disable the SQL idle connection pool when unconfigured

The mysql/mysql2/postgres stores called SetMaxIdleConns(maxIdle)
unconditionally, so an unset connection_max_idle (0) actively kept zero
idle connections - every query opened and closed a fresh connection
instead of reusing the pool.

Only apply the value when it's set; otherwise leave database/sql's
default idle pool of 2 in place.

* comments: shorten idle-pool note

* fix(filer): default the SQL idle pool via config, keep explicit 0 honored

Apply the idle-pool default at the config layer with SetDefault instead of
guarding the SetMaxIdleConns call. An absent connection_max_idle now reads
back as 2 (pool stays on), while an explicit 0 flows through to
SetMaxIdleConns(0) so operators can still disable idle pooling on purpose.
2026-05-20 14:04:23 -07:00
Chris Lu
024b59fb31 fix(ec): pack EC shards onto fewer disks instead of refusing the task (#9588)
The planner refused to create an EC task unless it found totalShards
distinct (server, disk_id) targets, so a cluster with fewer disks than
shards (e.g. 8 single-disk servers for a 10+4 scheme) could never encode.

A disk safely holds several distinct shards of one volume: each is its own
.ecNN file and ReceiveFile keys by that extension. Drop the strict check and
let createECTargets round-robin shards across the available disks, matching
ec.encode's "4,4,3,3" fallback. The minTotalDisks floor (ceil(total/parity))
already keeps any disk under parityShards shards, so the volume still
survives losing any one disk.

Reserve capacity for the actual per-disk shard count rather than assuming
one shard each, so packing doesn't over-commit disk slots.
2026-05-20 11:50:42 -07:00
Chris Lu
5af7d12f04 fix(filer.sync): keep sync_offset fresh while the source is read-only (#9589)
* fix(filer.sync): keep sync_offset fresh while the source is read-only

sync_offset holds the timestamp of the last replicated source event, so
monitoring derives lag from now-sync_offset. A read-only source emits no
metadata events, so the gauge froze at the last write and the derived lag
grew without bound, making thresholds unusable.

The source filer now sends an idle heartbeat carrying its current time
while a subscriber is caught up to the buffer head. filer.sync uses it to
advance the gauge, so now-sync_offset reflects real lag. Heartbeats are
opt-in (client_supports_idle_heartbeat), are never written to the metadata
log, and do not move the resume checkpoint, so a restart still resumes
from the last real event.

* fix(filer.sync): gate idle heartbeat on the read cursor, not SinceNs

In metadata-chunks mode persisted entries replay as log file refs and
never reach eachLogEntryFn, so lastSeenTsNs stays put and a caught-up
subscriber with an old SinceNs would never get a heartbeat. Use the
read cursor (lastReadTime), which advances in that mode too, max'd with
lastSeenTsNs so the in-memory backlog-then-idle case still works while
the cursor returned to the caller has not yet updated.
2026-05-20 11:26:37 -07:00
Chris Lu
4385b86bf1 fix(shell): volumeServer.evacuate no longer panics on a nil volume (#9587)
adjustAfterMove now removes the moved volume from the source disk's
VolumeInfos in place: it swaps the entry with the last one and nils the
tail. evacuateNormalVolumes ranges directly over that same slice, so the
niled tail slot is later read as a nil *VolumeInformationMessage and the
move attempt panics on vol.DiskType.

Iterate over a snapshot of the slice so in-place removals during a move
cannot leave nil holes in the loop.
2026-05-20 10:27:00 -07:00
Chris Lu
c00aa90990 fix(s3/audit): populate requester for GET/HEAD/IAM operations (#9581)
Authentication records the identity with r.WithContext, which returns a
request copy. Handlers that log their own audit entry (PUT, DELETE,
tagging) see it, but GET/HEAD object and IAM operations rely on track()'s
fallback entry, which is built from the original request the auth copy
never reached - so requester came out empty.

Install a mutable identity holder on the request before authentication
and have SetIdentityNameInContext record into it. The holder is shared by
pointer across every request copy, so the fallback entry recovers the
authenticated requester. The per-request context value still takes
precedence, so nothing changes for handlers that see the auth copy.
2026-05-20 10:13:33 -07:00
Chris Lu
e332b97d52 fix(shell): volume.balance no longer drains all volumes onto one server (#9579)
* fix(shell): volume.balance no longer drains all volumes onto one server

The density-based capacity function reads per-disk VolumeInfos sizes, but
adjustAfterMove only updated VolumeCount and the selectedVolumes map. The
planner re-read a stale topology after every move, so the source node's
density never dropped and it kept moving volumes until that node was empty.

Move the volume's size accounting between disks after each planned move so the
density recomputes and the loop converges to an even distribution.

* refactor(shell): O(1) volume removal and direct disk lookup in adjustAfterMove

removeVolumeInfo swaps with the last element instead of shifting, and the disk
is fetched by key rather than ranging the DiskInfos map.
4.27
2026-05-20 01:39:23 -07:00
Chris Lu
868849392c 4.27 2026-05-20 00:25:16 -07:00
Chris Lu
a4415c39aa fix(mount): keep periodic metadata flush from dropping concurrent chunk uploads (#9574)
* fix(mount): keep periodic metadata flush from dropping concurrent chunk uploads

The periodic flush snapshotted entry.Chunks, then ran CompactFileChunks and
MaybeManifestize (the manifest upload is a network round trip) before
reassigning entry.Chunks. Async uploaders append freshly uploaded chunks
during that window, and the reassignment overwrote them: the data stayed on
the volumes but the file lost those chunk references, leaving zero-filled
holes on read. Large sequential writes such as cat of two 15 GiB files hit
several flush cycles and ended up corrupted.

Snapshot the chunk list under the entry lock with a length marker, do the
slow compaction and manifestization on the snapshot, then splice the
processed prefix back in front of whatever chunks arrived after the
snapshot.

* mount: drop redundant slice copies in the flush splice

processedPrefix is freshly built and the tail sub-slice is consumed
immediately under the entry lock, so append straight onto processedPrefix
instead of allocating two throwaway copies.
2026-05-19 20:47:52 -07:00
Lars Lehtonen
9914e6af30 chore(weed/command): prune unused functions (#9573)
* chore(weed/command): prune unused functions

* drop now-unused closed field and renderLocked guard

---------

Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-05-19 17:45:50 -07:00
Chris Lu
cc5ef1b741 feat(s3): add TagUser, UntagUser, ListUserTags IAM actions (#9572)
* feat(s3): add TagUser, UntagUser, ListUserTags IAM actions

Adds AWS IAM-compatible user tag operations on the embedded IAM
endpoint. Tags persist in the Identity proto as a repeated UserTag
field; the existing 50-tag / 128-byte-key / 256-byte-value AWS limits
are enforced. Pagination is stubbed (IsTruncated=false) since the
50-tag cap means all tags fit in a single response.

* review: validate UntagUser TagKeys entries

parseTagKeysParams now rejects empty keys and keys past
MaxUserTagKeyLength; UntagUser additionally requires at least one
TagKeys.member.N entry to match AWS validation behavior.

* review: pre-allocate user-tag merge and filter slices

mergeUserTags now allocates the combined existing+incoming capacity
up front; UntagUser builds the filtered slice via make with the full
ident.Tags capacity instead of ident.Tags[:0:0], which forced a
reallocation on every append.

* review: cover duplicate-in-request and invalid TagKeys cases

Regression tests assert TagUser rejects two members with the same key
in one request, and UntagUser rejects missing/empty/oversized TagKeys
entries.
2026-05-19 17:35:44 -07:00
Chris Lu
37b6a14b0d feat(s3): add four bucket configuration handlers (#9570)
* feat(s3): add four bucket configuration handlers

- GetBucketPolicyStatus: computes IsPublic from the existing bucket policy
- PutBucketRequestPayment: companion writer to the existing GET; accepts
  only BucketOwner
- GetBucketAccelerateConfiguration: returns <Status>Suspended</Status>
- GetBucketLogging: returns an empty BucketLoggingStatus

Lets AWS SDK probes succeed instead of returning MethodNotAllowed.

* review: route GetBucketPolicyStatus through checkBucket

Mirrors the existence/auth gating used by other bucket handlers and
drops the bespoke filer_pb lookup so NoSuchBucket precedence is
consistent across the API surface.

* review: cap PutBucketRequestPayment body with MaxBytesReader

The body is unmarshalled as RequestPaymentConfiguration, which is a
handful of bytes; reject excessively large payloads up front and
defer Close immediately after wrapping.

* review: gate static getters on checkBucket

GetBucketAccelerateConfiguration and GetBucketLogging now run the
standard bucket existence check before returning the static
Suspended / empty-status response so a missing bucket cannot appear
to have valid configuration.

* review: share cache helper across misc tests; check io.ReadAll error

Accelerate and Logging tests now run through newMiscTestServer like
the others so the checkBucket guard sees a cached bucket; the
ReadAll error is explicitly checked.
2026-05-19 17:35:08 -07:00
Chris Lu
cee2bf697c feat(s3): stub bucket configuration list endpoints (#9571)
* feat(s3): stub bucket configuration list endpoints

Adds Get and List handlers for Analytics, Inventory, IntelligentTiering,
and Metrics bucket configurations. List returns an empty result with
IsTruncated=false; single-get returns NoSuchConfiguration so SDK error
parsing remains predictable.

* review: gate stubs on bucket existence

All eight stub handlers now call checkBucket via stubBucketGuard so
NoSuchBucket takes precedence over NoSuchConfiguration / empty-list
responses, matching AWS S3 precedence. Tests provide a cached bucket
so the guard sees it as present.
2026-05-19 17:34:51 -07:00
Chris Lu
285025eb73 s3api: support group inline policies + Condition enforcement (#9569)
* test(s3api): cover IAM inline policy aws:SourceIp + group inline gap

Unit tests under weed/s3api/ drive PutUserPolicy / PutGroupPolicy → reload
→ VerifyActionPermission with a synthetic 127.0.0.1 request and assert that
the policy's IpAddress condition flips the outcome.

The user-policy cases pass on master (hydrateRuntimePolicies already routes
inline docs through the policy engine, so Condition blocks are honored end-
to-end). The group-policy case fails: PutGroupPolicy still returns
NotImplemented, so a group inline doc never lands in the engine.

Integration counterparts live under test/s3/iam/ and exercise the same
paths against a live SeaweedFS S3+IAM endpoint.

* s3api: support group inline policies + Condition enforcement

PutGroupPolicy/GetGroupPolicy/DeleteGroupPolicy/ListGroupPolicies used to
return NotImplemented in embedded IAM mode, so anything attached to a
group as an inline doc — including aws:SourceIp or any other Condition —
was simply unreachable.

Wire the four endpoints to the credential-store methods that were
already in place (memory, postgres, filer_etc all implement
GroupInlinePolicyStore). On every config reload, hydrateRuntimePolicies
now also walks LoadGroupInlinePolicies, registers each doc in the IAM
policy engine under __inline_group_policy__/<group>/<policy>, and
appends that key to Group.PolicyNames so evaluateIAMPolicies picks it up
through its existing group walk. PutGroupPolicy/DeleteGroupPolicy are
added to the ReloadConfiguration trigger list in DoActions.

Side fix: MemoryStore.LoadConfiguration now surfaces store.groups too.
Without it iam.groups never repopulated on a memory-store reload, so
group policy evaluation silently no-op'd whether the policy was inline
or attached. The existing tests didn't notice because no test reloaded
through cm after creating a group.

The NotImplemented unit test is inverted to drive the new round-trip.

* s3api: drop redundant refreshIAMConfiguration from Put/DeleteGroupPolicy

DoActions already triggers ReloadConfiguration for both actions via the
explicit reload list, so calling refreshIAMConfiguration inline runs the
load twice per request. Per PR review.

* s3api: scope group-policy resource names per test; tighten deny polling

- Integration test resource names get a per-test suffix so retried or
  parallel CI jobs don't trip EntityAlreadyExists / BucketAlreadyExists.
- Deny-path Eventually loops gate on AccessDenied via a typed helper
  rather than any non-nil error; transient setup errors no longer end
  the wait prematurely.
- ListGroupPolicies returns ServiceFailure when the credential manager
  is nil, matching Put/Get/DeleteGroupPolicy.

* test(s3 iam): cover both IPv4 and IPv6 loopback in allow CIDRs

CI runners with happy-eyeballs resolve `localhost` to ::1 first, in
which case a 127.0.0.0/8-only allow would silently never match and the
deny-driven enforcement test would hang for the allow case. Add ::1/128
to every loopback-matching policy so the allow path works regardless of
which loopback family the SDK lands on.
2026-05-19 16:03:45 -07:00
Chris Lu
77ac781bbd fix(ec): VolumeEcShardsInfo walks every disk on multi-disk servers (#9568)
* fix(ec): VolumeEcShardsInfo walks every disk on multi-disk servers

When a volume server holds EC shards for the same vid across more than
one disk, each DiskLocation registers its own EcVolume entry and
Store.FindEcVolume returns whichever one it hits first. The shard-info
RPC iterated only that single EcVolume's Shards, so the response missed
every shard mounted on a sibling disk.

The worker's verifyEcShardsBeforeDelete sums the per-server responses
into a union bitmap and refuses to delete the source volume when the
union falls short of dataShards+parityShards. On multi-disk
destinations, the union was systematically under-counted and source
deletion got blocked even though all shards were physically present and
mounted.

Walk every DiskLocation in the handler and emit the deduplicated union
of all shards. The .ecx-backed fields (file counts, volume size) still
come from a single EcVolume since every disk's entry opens the same
.ecx via NewEcVolume's cross-disk fallback.

Tests:
- TestVolumeEcShardsInfo_AggregatesAcrossDisks unit test in
  weed/server/.
- test/volume_server/grpc/ec_verify_multi_disk_test.go integration test
  drives the full generate -> mount -> redistribute -> restart ->
  reconcile path and asserts both VolumeEcShardsInfo and
  VerifyShardsAcrossServers + RequireFullShardSet (the production
  source-deletion gate) report all 14 shards.
- ec_multi_disk_lifecycle_test.go tightened: replaces the
  "VolumeEcShardsInfo only sees one disk's EcVolume" workaround with a
  full-shard-set assertion.

* review: use ShardBits bitmask + cap-pre-allocation for shard dedup
2026-05-19 14:58:56 -07:00
Chris Lu
f72983c1fd fix(s3): stop S3 Tables routes from swallowing buckets named "buckets" or "get-table" (#9566)
* fix(s3): stop S3 Tables routes from swallowing buckets named "buckets" or "get-table"

The S3 Tables REST endpoints share top-level paths with the regular S3
API (/buckets for ListTableBuckets/CreateTableBucket, /get-table for
GetTable). They are registered first on the same router as the bucket
subrouter, so a path-style request such as GET /buckets?list-type=2 on
a bucket actually named "buckets" matched ListTableBuckets and returned
JSON. AWS SDK V2 (and Hadoop s3a / Spark) then failed XML parsing with
"Unexpected character '{' (code 123) in prolog".

Disambiguate by requiring the AWS V4 credential scope to name the
s3tables service on the colliding routes. Regular S3 SDKs sign with
service=s3, S3 Tables SDKs sign with service=s3tables, and the scope is
present in both the Authorization header and the X-Amz-Credential query
parameter for presigned URLs, so the matcher works for both flavors.

ARN-bearing S3 Tables routes (/buckets/<arn>, /namespaces/<arn>, etc.)
already cannot collide because colons are not valid in bucket names, so
they are left untouched.

* fix(s3): accept AWS JSON RPC content type as S3 Tables intent signal

The Iceberg catalog integration tests send unsigned PUT /buckets with
Content-Type: application/x-amz-json-1.1 to create table buckets. With
only the credential-scope check, those requests fell through to the
regular S3 CreateBucket handler and the suite went red on this branch.

Extend the matcher so a request is recognized as S3 Tables when either:

  - its AWS V4 credential scope names SERVICE=s3tables; or
  - it carries the canonical AWS JSON RPC 1.1 content type and is
    unsigned (a request explicitly signed for SERVICE=s3 still wins).

The regular S3 SDKs do not send application/x-amz-json-1.1, so the
signal is safe for the colliding paths (/buckets, /get-table).

Also add an AWS SDK V2 for Go integration test under
test/s3/sdk_v2_routing/ that drives the SDK's own XML deserializer
against a bucket literally named "buckets" and "get-table" — the SDK
errors before the test asserts if the server returns the wrong body
shape. Wired up via .github/workflows/s3-sdk-v2-routing-tests.yml,
mirroring the etag/acl workflow.

* s3api: extend service matcher to all S3 Tables routes; simplify scope check

- Apply serviceMatcher to every S3 Tables route, not just the bare-path
  ones. ARN-bearing paths could otherwise be hit by an S3 object key
  that starts with arn:aws:s3tables:..., inside a bucket named
  "buckets", "namespaces", "tables", or "tag". One matcher everywhere
  closes both collision classes.
- Replace strings.Split + index lookup with strings.Contains for the
  credential-scope check. The scope shape is fixed at
  AK/DATE/REGION/SERVICE/aws4_request, slashes only delimit components,
  and access keys are alphanumeric — so /s3tables/ matches iff SERVICE
  is exactly s3tables. Existing unit cases (including the
  access-key-substring case) still pass.
- Read the GetObject body in the SDK v2 routing test with io.ReadAll;
  the single Read could return short and make the equality check flaky.

* s3api: drop content-type fallback; sign s3 tables harness traffic instead

The content-type fallback in isS3TablesSignedRequest let an anonymous
regular-S3 request whose body type is application/x-amz-json-1.1 hit
an S3 Tables route when the path-style object key happened to be
shaped like an S3 Tables ARN (e.g. PutObject on bucket "buckets"
with key arn:aws:s3tables:.../bucket/foo/policy). Narrow the matcher
back to the AWS V4 credential scope so only requests signed for
SERVICE=s3tables match the S3 Tables routes.

Update the Iceberg catalog test harness — the only caller still
sending unsigned PUT /buckets — to sign with SERVICE=s3tables. The
mini instance runs in default-allow mode, so the signature itself is
not verified; only the credential scope matters for the route match.

Drop the stale unit cases for the JSON-RPC content-type signal and
the routing test that exercised unsigned harness traffic.
2026-05-19 14:24:25 -07:00
Chris Lu
cfc08fbf6c fix(volume): tombstone integrity check no longer flips volumes read-only (fixes #9563) (#9565)
* fix(volume): pass on-disk tombstone size to ReadData in verifyDeletedNeedleIntegrity

verifyDeletedNeedleIntegrity was forwarding TombstoneFileSize (-1) into
Needle.ReadData. A deletion tombstone is appended to .dat with DataSize=0
so the on-disk needle header carries Size=0; TombstoneFileSize is only
the .idx sentinel for "this entry is deleted" and is never written into
a needle header.

ReadBytes' size check therefore mismatched on every tombstone
(-1 != 0), returned ErrorSizeMismatch, and triggered the
4-byte-offset wrap-around retry in ReadData (offset + 32 GB). On any
volume large enough that offset+32 GB exceeds dat fileSize the retry
read EOF, CheckVolumeDataIntegrity reported corruption, and the loader
set noWriteOrDelete = true. Every volume whose last 10 .idx entries
included a deletion went read-only on startup — i.e. any healthy
volume where the most recent operations included a delete.

Pass Size(0) so the size check matches the on-disk tombstone header.

Add a regression test that writes three needles, deletes one, and
asserts CheckVolumeDataIntegrity succeeds with a tombstone at the .idx
tail. Without this fix the test reproduces the exact log shape from
the bug report:

  read 0 dataSize 32 offset <orig+32GB> fileSize <much smaller>: EOF
  verifyDeletedNeedleIntegrity ...idx failed: read data [N,N+32) : EOF

The Rust port guards its integrity-check size comparison with
!size.is_deleted() (seaweed-volume/src/storage/volume.rs) and never
hits this path, so no Rust mirror change is needed.

* test(seaweed-volume): mirror Go regression for deletion-tombstone integrity

The Rust integrity check already guards its size-mismatch comparison
with !size.is_deleted() (volume.rs:1859) and reads tombstone AppendAtNs
with body_size=0, so the Go regression fixed in the previous commit
does not apply. Lock that guarantee in with a parallel reload test:
write three needles, delete one, sync, reopen via Volume::new, assert
the volume is not flipped read-only.

Catches any future change that removes the deleted-entry guard or
re-introduces a size-strict path in check_volume_data_integrity for
tombstones.

* fix(volume): propagate io.EOF and ErrorSizeMismatch from verifyDeletedNeedleIntegrity

CheckVolumeDataIntegrity relies on identity comparison against io.EOF
and ErrorSizeMismatch to walk back through the last ten .idx entries
and tolerate a partial truncation at the tail (the "fix and continue"
loop). The live-needle branch in doCheckAndFixVolumeData already
returns those sentinels unwrapped; the deletion branch wrapped them
in fmt.Errorf, so a genuine .dat truncation past a tombstone offset
broke the recovery and flipped the volume read-only.

Mirror the live-needle handling: both verifyDeletedNeedleIntegrity
and doCheckAndFixVolumeData now short-circuit on io.EOF /
ErrorSizeMismatch and pass them through unwrapped. Other errors keep
their existing context wrapping.

Also tighten the regression test to capture lastAppendAtNs and assert
it's non-zero, so a future regression that skips the tombstone body
(and therefore never populates AppendAtNs) is caught even when the
err check still passes.
2026-05-19 13:11:19 -07:00
Chris Lu
d57de6dc20 fix(s3): keep anonymous access working with EnableIam default (fixes #9557) (#9567)
fix(s3): keep anonymous access working with EnableIam default

`docker run seaweedfs` (and `weed mini` with no config) start with
EnableIam=true but no IAM config file and no identities. The advanced-IAM
init path was failing in 4.25 because of the missing STS signing key,
which masked a latent bug: SetIAMIntegration unconditionally flipped
isAuthEnabled to true, and isEnabled() also treated a non-nil
iamIntegration as auth-on. Once the mini SSE-S3 KEK landed in 4.26 the
STS fallback started succeeding, the integration got installed end to
end, and every anonymous S3 request bounced as AccessDenied.

Separate the two concerns: SetIAMIntegration just plumbs in the OIDC /
embedded-IAM machinery, and a new EnableAuthEnforcement opts in to
enforcement. The startup path calls it only when -s3.iam.config is
actually provided, so operators with explicit IAM configs still get auth
(preserves #7726). isEnabled() now reads isAuthEnabled only.
2026-05-19 13:03:30 -07:00
Peter Dodd
4476cb282b feat(filer): add atime to FuseAttributes + TouchAccessTime RPC (#9556)
* feat(filer): add atime field and TouchAccessTime RPC to filer proto

Introduce POSIX-style access-time tracking on the filer:
- FuseAttributes gains atime (field 22) and atime_ns (field 23).
- New TouchAccessTime RPC (and Touch{Access,Time}{Request,Response})
  lets read paths bump atime without going through UpdateEntry's
  chunk-rewrite/EqualEntry short-circuit.

Additive proto changes only; zero atime is treated as unset and
existing clients are unaffected. Java client proto is kept in lock
step.

Co-authored-by: Cursor <cursoragent@cursor.com>

* feat(filer): wire Atime through Attr codec with mtime fallback

Add Attr.Atime and round-trip it through EntryAttributeToPb /
EntryAttributeToExistingPb / PbToEntryAttribute. A zero proto atime
decodes as Mtime, so legacy entries report a sensible value and
freshly-created/updated entries default Atime to Mtime when callers
do not set it explicitly.

CreateEntry and UpdateEntry stamp Atime = Mtime (or Crtime) when it
is zero. TouchAccessTime later bypasses this path to write atime
alone via Store.UpdateEntry.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(filer): preserve atime in first epoch second on decode

The Atime decode branch previously treated any attr.Atime == 0 as
unset and overwrote it with Mtime, which drops valid timestamps in
the first second of the unix epoch where attr.Atime is 0 but
attr.AtimeNs > 0. Check both fields so we only fall back to Mtime
when both are zero.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-19 10:22:17 -07:00
Chris Lu
b63610cf8f volume: accept legacy needle CRC encoding on read (#9564)
Volumes written by versions before 3.09 (commit 056c480eb) store the
needle checksum using the deprecated CRC.Value() transform. When the
read path moved into readNeedleTail, the fallback that accepts both
encodings was dropped, so .dat files copied from old installs now fail
verification with "invalid CRC ... data on disk corrupted" even though
the data is intact. Restore the dual check, matching the surviving
fallback in volume_read.go.
2026-05-19 09:58:47 -07:00
Chris Lu
c61d227613 s3api: verify source permission on CopyObject and UploadPartCopy (#9555)
* s3api: verify source permission on CopyObject and UploadPartCopy

The Auth middleware only authorized the destination because routes key on
the request URL. The source from X-Amz-Copy-Source was never evaluated,
so an STS session token scoped to one prefix could copy from any other
prefix in the same bucket.

Add AuthorizeCopySource on IdentityAccessManagement to run the full
bucket-policy + IAM/identity flow against the source, using a synthetic
GetObject request so action resolution lands on s3:GetObject (or
s3:GetObjectVersion when a source versionId is supplied). Both
CopyObjectHandler and CopyObjectPartHandler now invoke it before reading
the source.

* s3api: preserve presigned-URL session token on copy-source check

Presigned CopyObject / UploadPartCopy requests carry the STS session
token in the query string (X-Amz-Security-Token), not in a header.
Rebuilding the synthetic source URL from scratch dropped that token, so
the source authorization would fall through to non-STS paths and miss
session policy enforcement. Forward X-Amz-Security-Token from the
original query (alongside versionId), still excluding unrelated params
like uploadId/partNumber that would steer ResolveS3Action away from
s3:GetObject.
2026-05-18 21:35:53 -07:00
Chris Lu
7c252e1f16 fix(volume): reopen .idx writable after MarkVolumeWritable (fixes #9515) (#9526)
* fix(volume): reopen .idx writable after MarkVolumeWritable

When .vif has ReadOnly=true, load() opens .idx as O_RDONLY and builds a
SortedFileNeedleMap whose Put returns os.ErrInvalid. MarkVolumeWritable
only flipped noWriteOrDelete back to false and rewrote .vif, so writes
still failed at v.nm.Put. Reopen .idx in O_RDWR and rebuild v.nm in its
writable form (in-memory or leveldb small/medium/large) before flipping
the flag.

Mirror the same fix in seaweed-volume: the Rust load path leaves
CompactNeedleMap/RedbNeedleMap with no idx_file writer when the volume
boots read-only, so post-MarkVolumeWritable puts silently succeeded
in-memory only and were lost on the next restart. set_writable now
reattaches an append-mode writer when one is missing.

* fix(volume): keep old needle map until replacement is built; defer writable flag

Go: build the writable needle map into a local before swapping. A
construction failure now leaves v.nm pointing at the original
SortedFileNeedleMap so MarkVolumeWritable can roll back, instead of
stranding the volume with v.nm == nil.

Rust: attach the .idx writer before flipping no_write_or_delete to
false. A transient open/metadata failure used to leave the volume
marked writable with no writer attached, and subsequent puts would
silently skip the on-disk append.
2026-05-18 20:51:04 -07:00
Chris Lu
7c5296dfb1 fix(admin): switch file browser upload/download to filer gRPC + volume HTTP (#9538)
* fix(admin): switch file browser upload/download to filer gRPC + volume HTTP

The admin file browser proxied uploads and downloads through the filer's
HTTP listener, so the whole feature 404'd against filers started with
-disableHttp=true even though S3 still worked on its own port. Re-route
through the filer gRPC service: LookupDirectoryEntry + StreamContent for
reads (chunks flow straight from the volume servers), AssignVolume +
volume HTTP POST + CreateEntry for writes. Volume read tokens come from
jwt.signing.read.key when configured; the old jwt.filer_signing tokens
no longer apply since the filer HTTP surface is bypassed.

* admin file browser: propagate request context + track response writes

Pass r.Context() into uploadFileToFiler so a client disconnect cancels
the in-flight chunked upload instead of letting it run to completion
against the volume servers. For DownloadFile, replace the Content-Type
probe with a small response-writer wrapper that records whether headers
or bytes have actually been sent, so the error path can't silently
convert a pre-stream failure into a partial response if future code
moves the header-setting around.
2026-05-18 20:33:16 -07:00
Chris Lu
58c3fa802c fix(s3): keep host-less bucket catch-all so reverse proxies work (#9540)
When s3.domainName is set, all bucket-prefix routes were gated on a
matching Host header. Requests that arrive via an IP, an unlisted
hostname, or a reverse proxy that rewrites Host hit no router and bounce
back as 405/404 (and 503 once a proxy maps the upstream error).

Register the path-style catch-all unconditionally, after the
host-specific routers, so it only fires when no Host matcher applies.
2026-05-18 19:44:19 -07:00
dependabot[bot]
d3f80444df build(deps): bump github.com/cognusion/imaging from 1.0.2 to 1.0.3 (#9552)
Bumps [github.com/cognusion/imaging](https://github.com/cognusion/imaging) from 1.0.2 to 1.0.3.
- [Commits](https://github.com/cognusion/imaging/compare/v1.0.2...v1.0.3)

---
updated-dependencies:
- dependency-name: github.com/cognusion/imaging
  dependency-version: 1.0.3
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-18 19:43:33 -07:00
Chris Lu
0dc65e7069 fix(admin.plugin): include disk_id in EC execution plan (#9547)
TaskSource and TaskTarget carry disk_id on the wire, but the execution
plan map built for the admin UI dropped the field entirely. On a
multi-disk node holding shards of the same volume, there was no way to
tell from the plan which disk would receive each shard. Include
disk_id on each endpoint and target_disk_id on each shard assignment,
and extend the existing execution-plan test to set and assert the
field.
2026-05-18 19:43:18 -07:00
ᎠᎡ. Ѕϵrgϵ Ѵictor
18c6c24e47 Revise MinIO comparison in README for accuracy (#9548)
Updated the README to reflect the current status of MinIO, noting its ceased development and security concerns, along with changes in the descriptions of its features compared to SeaweedFS.
2026-05-18 19:32:54 -07:00
dependabot[bot]
120901c883 build(deps): bump github.com/parquet-go/parquet-go from 0.28.0 to 0.30.1 (#9549)
Bumps [github.com/parquet-go/parquet-go](https://github.com/parquet-go/parquet-go) from 0.28.0 to 0.30.1.
- [Release notes](https://github.com/parquet-go/parquet-go/releases)
- [Changelog](https://github.com/parquet-go/parquet-go/blob/main/CHANGELOG.md)
- [Commits](https://github.com/parquet-go/parquet-go/compare/v0.28.0...v0.30.1)

---
updated-dependencies:
- dependency-name: github.com/parquet-go/parquet-go
  dependency-version: 0.30.1
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-18 19:28:42 -07:00
dependabot[bot]
a79880ed41 build(deps): bump github.com/redis/go-redis/v9 from 9.18.0 to 9.19.0 (#9550)
Bumps [github.com/redis/go-redis/v9](https://github.com/redis/go-redis) from 9.18.0 to 9.19.0.
- [Release notes](https://github.com/redis/go-redis/releases)
- [Changelog](https://github.com/redis/go-redis/blob/master/RELEASE-NOTES.md)
- [Commits](https://github.com/redis/go-redis/compare/v9.18.0...v9.19.0)

---
updated-dependencies:
- dependency-name: github.com/redis/go-redis/v9
  dependency-version: 9.19.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-18 19:28:32 -07:00
dependabot[bot]
f5aa776742 build(deps): bump github.com/Azure/azure-sdk-for-go/sdk/storage/azblob from 1.6.4 to 1.7.0 (#9551)
build(deps): bump github.com/Azure/azure-sdk-for-go/sdk/storage/azblob

Bumps [github.com/Azure/azure-sdk-for-go/sdk/storage/azblob](https://github.com/Azure/azure-sdk-for-go) from 1.6.4 to 1.7.0.
- [Release notes](https://github.com/Azure/azure-sdk-for-go/releases)
- [Commits](https://github.com/Azure/azure-sdk-for-go/compare/sdk/storage/azblob/v1.6.4...sdk/azcore/v1.7.0)

---
updated-dependencies:
- dependency-name: github.com/Azure/azure-sdk-for-go/sdk/storage/azblob
  dependency-version: 1.7.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-18 19:28:24 -07:00
dependabot[bot]
f3d6633aac build(deps): bump github.com/aws/aws-sdk-go-v2/service/s3 from 1.99.0 to 1.101.0 (#9553)
build(deps): bump github.com/aws/aws-sdk-go-v2/service/s3

Bumps [github.com/aws/aws-sdk-go-v2/service/s3](https://github.com/aws/aws-sdk-go-v2) from 1.99.0 to 1.101.0.
- [Release notes](https://github.com/aws/aws-sdk-go-v2/releases)
- [Commits](https://github.com/aws/aws-sdk-go-v2/compare/service/s3/v1.99.0...service/s3/v1.101.0)

---
updated-dependencies:
- dependency-name: github.com/aws/aws-sdk-go-v2/service/s3
  dependency-version: 1.101.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-18 19:28:06 -07:00
Chris Lu
68794fb94c fix(ec_distribute): remove partial files on copy stream error (#9543)
* fix(ec_distribute): remove partial files on copy stream error

writeToFile opens the destination with O_TRUNC and streams into it. On
a mid-stream receive / write / cancellation error it returned the
failure but left the destination behind in whatever state had been
written so far — typically 0 bytes when the source errored before
sending any FileContent. VolumeEcShardsCopy distributes .ecx by
calling doCopyFile, so this same stub-leaving behaviour produced the
0-byte .ecx files seen on EC encoding failures: the source claims a
non-zero ModifiedTsNs (so the existing "source not found" cleanup
doesn't fire), the stream then errors immediately, and the receiver
ends up with a 0-byte .ecx that downstream code mistook for a valid
empty index.

Clean up the partial file on every error path that returns from the
streaming loop (receive, write, and cancellation). Skip cleanup when
isAppend=true so resumable appends keep their existing content. As
defense in depth, VolumeEcShardsCopy also stats the .ecx after copy
and removes / errors on a 0-byte result so the orchestrator can pick
a different source.

The Rust volume server has only the source side of CopyFile (no
client-side stream-to-disk consumer) and no .ecx subsystem yet, so
this fix has no Rust mirror.

* fix(ec_distribute): close file before remove, fail fast on stat error

Address review feedback:

- writeToFile's mid-stream removeIncomplete called os.Remove while the
  destination file handle was still open. On Windows os.Remove fails
  while a handle is open, so the cleanup wouldn't run there. Wrap the
  handle close in a once-only helper, call it from removeIncomplete
  and from the existing "source not found" cleanup, and keep a deferred
  close as the safety net for the normal-return path.
- VolumeEcShardsCopy's post-copy .ecx check silently passed when
  os.Stat returned an error: doCopyFile had reported success but if
  the file was already gone, unreadable, or somehow a directory, the
  orchestrator only learned at mount time with no useful context.
  Treat any non-nil stat error and any directory result as a copy
  failure here and surface it immediately.
2026-05-18 15:19:51 -07:00
Chris Lu
af8d4e00ee fix(ec_mount): reject 0-byte .ecx and aggregate cross-disk failures (#9542)
* fix(ec_mount): reject 0-byte .ecx and aggregate cross-disk failures

MountEcShards's per-disk loop bailed on the first disk returning a
non-ENOENT error, and NewEcVolume wrapped its ENOENT with %v so the
caller's `err == os.ErrNotExist` check never matched. On a multi-disk
volume server where ec.balance / ec.rebuild had distributed shards
across sibling disks while the matching .ecx never arrived, the mount
loop bailed after disk 0 with "cannot open ec volume index" and the
operator never saw that the rest of the disks were also empty. The
companion failure mode is a 0-byte .ecx stub left by EC distribute's
writeToFile after a mid-stream copy failure: Stat() succeeds, treating
the stub as a valid index, and downstream mount work proceeds against
an empty file.

Wrap the ec-volume open errors with %w, treat a 0-byte .ecx as
os.ErrNotExist (in NewEcVolume, findEcxIdxDirForVolume, and
HasEcxFileOnDisk), and have MountEcShards collect per-disk failures
before returning a single aggregated error. The "no .ecx anywhere"
case gets a distinct error so the orchestrator can re-copy the index
from a healthy replica rather than retry against the same broken
state.

* fix(ec_reconcile): indexEcxOwners also rejects 0-byte .ecx stubs

findEcxIdxDirForVolume already skipped 0-byte .ecx during MountEcShards,
but indexEcxOwners (used by reconcileEcShardsAcrossDisks at startup)
still recorded the first .ecx by name only. On a store where one disk
holds a 0-byte stub left by a failed EC distribute and a sibling disk
holds the real index, the stub would win the owner selection — and
NewEcVolume's new size check would then refuse to load against it,
leaving the orphan shards unloaded even though a valid index exists.

Mirror the size check from findEcxIdxDirForVolume: skip directory
entries whose .ecx Info() reports size 0 or whose Info() call fails.

* fix(ec_mount): accept 0-byte .ecx as valid empty index

The previous commit treated a 0-byte .ecx in NewEcVolume as
os.ErrNotExist, on the assumption that any empty .ecx was a stub left
by a failed copy stream. That broke the legitimate empty-volume case:
when an EC volume's source .idx has no live entries (e.g. all needles
deleted before WriteSortedFileFromIdx), the sorted .ecx is genuinely
0 bytes and must mount. The integration test
TestEcShardsToVolumeMissingShardAndNoLiveEntries fails with
"MountEcShards: no .ecx index found on any local disk" because the
mount path now refuses the legitimate empty index.

A 0-byte .ecx left by a failed copy stream is indistinguishable from
the legitimate empty case by file size alone. Preventing stub files
from being written is the receiver-side cleanup in writeToFile's job
(the companion EC distribute PR), not NewEcVolume's at mount time.

The cross-disk lookup helpers (findEcxIdxDirForVolume, HasEcxFileOnDisk,
indexEcxOwners) keep their size > 0 preference: when a real .ecx
exists on a sibling disk alongside a stub, we still want to route
mounts and reconcile at the real one. If no non-zero .ecx exists
anywhere, the per-disk fallback in MountEcShards can still open the
0-byte .ecx and the volume mounts.

Replace TestMountEcShards_ZeroByteEcxOnlyDisk with
TestMountEcShards_EmptyEcxMountsSuccessfully, which pins the
empty-volume invariant.
2026-05-18 15:00:33 -07:00
Chris Lu
41b6ad002b fix(volume.list): show one entry per physical disk on multi-disk nodes (#9541)
* fix(volume.list): show one entry per physical disk on multi-disk nodes

DataNodeInfo.DiskInfos is keyed by disk type, so several same-type
physical disks on one node collapse to a single map entry at the master.
volume.list iterated that map directly and reported one "Disk hdd ...
id:0" line per node, hiding the per-disk volume and shard layout. EC
operators on multi-disk volume servers had no way to verify which
physical disk a shard landed on.

Lift the per-physical-disk split into a DiskInfo.SplitByPhysicalDisk()
method on the proto type so consumers outside admin/topology can use
it. Apply it in writeDataNodeInfo so the verbose Disk block shows one
entry per physical disk, ordered by DiskId. Capacity counters are
split evenly across reconstructed disks since the wire format doesn't
carry per-disk capacity yet.

This is a display-only change. ActiveTopology already did the split on
its own and is now updated to call the shared helper.

* fix(volume.list): preserve totals, count active/remote exactly, dedupe header

Address review feedback on the per-physical-disk split:

- share() truncated remainders so reconstructed per-disk counters could
  sum to less than the original aggregate (10 / 3 = 3+3+3). Distribute
  the remainder to the lowest disk ids so MaxVolumeCount and
  FreeVolumeCount sum exactly back to the node totals.
- ActiveVolumeCount and RemoteVolumeCount are derivable per disk from
  the VolumeInfos already grouped by DiskId, so count them exactly
  (ReadOnly=false and RemoteStorageName!="" respectively) instead of
  approximating with an even split.
- writeDataNodeInfo's per-disk callback fired the DataNode header on
  every iteration after the split, so a node with 6 physical disks
  emitted 6 DataNode headers. Guard the callback with headerPrinted so
  the header still appears at most once per node.
- Sort split disks deterministically using explicit DiskId comparison
  to avoid int overflow risk on 32-bit systems.
- Tighten the volume.list test substring to "id:N\n" so unrelated
  tokens like "ec volume id:101" don't accidentally match the id:1
  needle, and assert the rack callback fires once.
2026-05-18 14:43:44 -07:00
Chris Lu
a761441926 fix(test): reserve mini ports on all interfaces; bound risingwave cleanup shell (#9545)
The 127.0.0.1-only reservation in AllocateMiniPorts/AllocatePortSet let
another process hold the gRPC port on a different interface, so weed
mini's isPortAvailable check failed and it shifted master.grpc. weed
shell -master=<HTTP> still derives grpc as HTTP+10000 and dialed the
unused port, hanging until the 30s context deadline killed it. Bind the
reservation listeners on :port to match mini's check.

Also bound listFilerContents in catalog_risingwave with a 30s
exec.CommandContext so a hung weed shell during failure-cleanup can't
burn the 20-minute test budget.
2026-05-18 14:16:22 -07:00
Chris Lu
37e6263efe fix(shell): attach admin JWT for filer IAM gRPC calls (#9536)
When jwt.filer_signing.key is set, the filer's IamGrpcServer requires
a Bearer token on every IAM RPC. The shell's s3.* IAM commands dialed
without that header and failed with Unauthenticated. Route them through
a small helper that mints a token from the same key viper-loaded from
security.toml and appends it as outgoing metadata, matching the credential
grpc_store pattern.
2026-05-18 13:42:32 -07:00
Chris Lu
3d872a1416 fix(filer): load -s3.config static identities into the filer's CredentialManager (#9537)
When weed filer started its embedded S3 gateway with -s3 -s3.config, only
the S3 server loaded the s3.json static identities — the filer's own
CredentialManager stayed empty, so the IAM gRPC service backing the admin
UI and weed shell returned only dynamic users. Mirror the wiring weed
server already does and hand the same config path to the filer.
2026-05-18 13:41:30 -07:00
Chris Lu
4d04609bb8 fix(mount): don't release file handles from FUSE Forget (#9529)
fix(mount): don't release file handles from Forget

Forget(nodeid, nlookup) only decrements the kernel inode lookup count.
File handle lifecycle belongs to FUSE Open/Release. Driving the FH
refcount from Forget coupled two unrelated counters and could tear down
a still-live handle if Forget ever raced ahead of Release.

Drop the ReleaseByInode call (and the now-unused method).
2026-05-18 01:02:58 -07:00
Chris Lu
01b3e4a71c template 4.26 2026-05-17 23:12:04 -07:00
Chris Lu
6cab199400 fix(iceberg): dial filer gRPC address verbatim in plugin worker (#9527)
* fix(iceberg): dial filer gRPC address verbatim in plugin worker

dialFiler was running its address argument through pb.ServerAddress.ToGrpcAddress,
whose single-port fallback adds +10000 to any host:port — so when the admin
forwards ClusterContext.FilerGrpcAddresses (already host:grpcPort) to the worker,
the iceberg handler turns the real gRPC port (e.g. 18888) into a non-existent
28888 and dispatched jobs fail with connection refused.

Drop the conversion; the address is already dialable. Tests that produced fake
filer addresses in dual-port form now return host:grpcPort to match the new
contract.

* test(ec): use renamed detection_interval_minutes field

The admin_runtime.detection_interval_seconds field was renamed to
detection_interval_minutes back in May. This integration test was not
updated, so the unknown JSON field was silently ignored and the scheduler
fell back to the default detection interval (17 min for erasure_coding),
which exceeds the test's 5-minute wait and times out.

Switch to detection_interval_minutes: 1 — local run completes in ~120s.
2026-05-17 23:03:00 -07:00
Chris Lu
136eb1b7c8 4.26 2026-05-17 21:05:25 -07:00
Chris Lu
c11ff6657b fix(ec): mirror EC sidecars onto every shard-bearing disk at startup (#9525)
* fix(ec): mirror EC sidecars onto every shard-bearing disk at startup

In a multi-disk volume server, ec.balance and ec.rebuild can land shards
on a disk that does not also hold the matching .ecx / .ecj / .vif index
files. The orphan-shard reconciler in reconcileEcShardsAcrossDisks
already loads those shards by pointing the EcVolume at the sibling
disk's index files; reads work, but any failure on the index-owning
disk silently disables every shard on the other disk, even though those
shards are physically fine.

This change adds mirrorEcMetadataToShardDisks, a startup pass that
physically replicates .ecx / .ecj / .vif onto each disk that holds
shards but is missing them. Each copy is atomic (tmp + fsync + rename)
and idempotent (a destination that already has the sidecar is
preserved). After mirroring, the cross-disk reconciler prefers the
local IdxDirectory so the EcVolume mounts self-contained; the
cross-disk virtual mount remains as a fallback for volumes whose mirror
failed (read-only target, out of space, partial copy on a previous
boot).

The same-disk invariant the EC lifecycle (encode / decode / balance /
vacuum / repair) was already documented as promising is now actually
restored at boot, so a future failure of one disk in a split-shards
layout no longer takes the other disk's shards with it.

Tests cover the orphan-layout mirror (dir0 receives the .ecx / .ecj /
.vif from dir1) and idempotency (an existing destination .ecx is not
overwritten with the owner's copy).

* fix(ec): handle legacy pre-dir.idx sidecar layout in mirror skip-check

hasAllEcSidecarsLocally checked only the modern destination path
(IdxDirectory for .ecx/.ecj, Directory for .vif). A destination disk
that still had a legacy .ecx in its data dir (written before -dir.idx
was set) would report "not present" and the mirror would write a
second copy to IdxDirectory, leaving two .ecx files on disk.

Matches HasEcxFileOnDisk's open-with-fallback contract: check the
modern path first, then the opposite directory. Factored the
exists-and-not-a-dir check into a small statRegular helper so the
fallback ladder stays readable.

* rust(seaweed-volume): mirror EC sidecars onto shard-bearing disks at startup

Port of the Go fix (commit 088e26ea6) to the Rust volume server.
Adds Store::mirror_ec_metadata_to_shard_disks, called from
add_location / load_new_volumes before the cross-disk orphan
reconciler. Physically copies .ecx / .ecj / .vif from the disk that
owns the index files onto every disk holding shards but missing
sidecars, so each shard-bearing disk ends up self-contained.

The reconciler now prefers the local idx_directory when the mirror
has installed a .ecx there; the cross-disk virtual mount remains as
the fallback for volumes whose mirror failed (read-only target, out
of space, partial copy on a previous boot). Adds ec_local_ecx_path
helper shared between reconcile and mirror to detect the post-mirror
fast path.

Mirrors the Go-side fallback in hasAllEcSidecarsLocally: when
-dir.idx is configured and the destination still has a legacy .ecx
in its data dir, that's recognized so the mirror does not write a
duplicate copy into idx_directory.

Tests cover the two key cases: orphan layout (dir0 receives the
sidecars from dir1) and idempotency (a pre-existing destination .ecx
is not overwritten).

* trim verbose comments on EC mirror code

Comments now lead with the WHY (non-obvious constraints, the
post-mirror fast path, why local copies are authoritative) and drop
restate-the-code blocks, headers, and section dividers. Behavior is
unchanged; all existing tests still pass on both the Go volume
server and the seaweed-volume Rust port.

* drop github issue refs from added comments

Two stray "#9212" references slipped into comments I added on the
cross-disk reconciler call site. The git log carries the issue
history; comments stand on their own.

* test(ec): accept rebuild on either disk after sidecar mirror

TestEcLifecycleAcrossMultipleDisks asserted the rebuilt shard 9 must
land at the disk-0 path. With the boot-time sidecar mirror, every
shard-bearing disk owns its own .ecx, so VolumeEcShardsRebuild now
picks whichever disk hosts the most shards — disk 1 in this layout
after the deletion. The shard can legitimately rebuild on either
disk; the test now accepts both and uses the chosen path for the
subsequent mount + read verification.
2026-05-17 19:55:15 -07:00
Chris Lu
6b94701213 mini: quieter startup with a docker-compose-style progress board (#9524)
* mini: quieter startup with a docker-compose-style progress board

Replaces noisy startup/shutdown logs with a single in-place progress
table on a TTY (or one line per state change off-TTY). Each component
renders as `pending -> starting -> ready` during startup and
`stopping -> stopped` during shutdown, with elapsed time on transition.

Also folds in a few cleanups uncovered while making this readable:

- route the admin.go startup prints through glog so quietMiniLogs()
  filters them under mini but standalone weed admin still shows them
- generate a dev SSE-S3 KEK + passphrase on first run via WEED_S3_SSE_KEK
  and WEED_S3_SSE_KEK_PASSPHRASE env vars (viper.Set has a nested-key
  conflict between s3.sse.kek and s3.sse.kek.passphrase); persisted under
  the data folder so restarts reuse the same key
- demote worker/master gRPC Recv 'context canceled' to V(1); those are
  the normal shutdown signal, not Errors/Warnings
- drop the 'Optimized Settings' block and the 'credentials loaded from
  environment variables' message from the welcome banner
- only show the credentials setup hints when no S3 identities exist
  (new s3api.HasAnyIdentity accessor backed by an atomic.Bool)
- use S3_BUCKET in the credentials hint so it pairs with
  AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY
- reorder running-services list to master / volume / filer / webdav /
  s3 / iceberg / admin

* mini: refuse in-memory-only SSE-S3 dev keys; surface admin serve errors

loadOrCreateMiniHexSecret returns "" when os.WriteFile fails, so SSE-S3
won't encrypt data under a KEK that the next restart can't reproduce
(which would orphan whatever was written this run). The caller already
treats "" as "skip setting WEED_S3_SSE_* env vars", so SSE-S3 and IAM
just stay disabled for this run.

startAdminServer's serve goroutine used to only log ListenAndServe
failures, so a bind error left the caller blocked on ctx.Done() with
no listener. Forward the error through a buffered channel and select
on it alongside ctx.Done().

* ci(s3-proxy-signature): match weed mini's new progress-board ready line

The readiness probe grepped for "S3 (gateway|service).*(started|ready)",
which matched weed mini's old "S3 service is ready at ..." line. Mini
now emits "  S3           ready (Xs)" from its progress board, so the
old pattern misses and the test timed out at the 30-second wait.

Widen the alternation to also accept "S3\s+ready". The curl HEAD
fallback already covers any remaining cases.
2026-05-17 19:13:09 -07:00
Chris Lu
ff6f9fd90a iam: honor configured credential store for IAM API policies and propagate to S3 caches (fixes #9518) (#9522)
* iamapi: route managed policies through credential manager (fixes #9518)

CreatePolicy via the IAM API wrote straight to the filer
/etc/iam/policies.json, ignoring any non-filer credential store. When
credential.postgres was configured, policies created via the IAM API
landed only in the filer while the Admin UI wrote to postgres,
producing a split-brain where ListPolicies/GetPolicy never saw the
Admin UI's policies and vice versa.

GetPolicies/PutPolicies on IamS3ApiConfigure now load managed policies
from credentialManager and persist Create/Update/Delete as a delta
against the store. Inline user/group policies still live in the legacy
policies.json file (no credential-store API for them yet). Pre-existing
managed policies in the legacy file are merged on read so deployments
don't lose data, and re-persisted to the store on the next write so
the legacy file is drained over time.

* credential: route IAM API inline policies through credential manager

Extends the #9518 fix to user-inline and group-inline policies so the
IAM API never writes the legacy /etc/iam/policies.json bundle directly.
The previous patch only routed managed policies; this one finishes the
job for the other two policy types.

- Add GroupInlinePolicyStore + GroupInlinePoliciesLoader optional
  interfaces, mirroring the existing user-inline ones, and matching
  Put/Get/Delete/List/LoadAll wrappers on CredentialManager.
- Implement group-inline storage in memory (new map), filer_etc (new
  field on PoliciesCollection, reusing the legacy file under policyMu),
  and postgres (new group_inline_policies table with ON DELETE CASCADE
  off the groups FK).
- Wire the new methods through PropagatingCredentialStore so wrapped
  stores still delegate correctly.
- IamS3ApiConfigure.PutPolicies now applies managed + user-inline +
  group-inline as deltas through the credential manager; the legacy
  /etc/iam/policies.json file is never written when a credential
  manager is wired up. GetPolicies still reads the legacy bundle once
  as a fallback so unmigrated data is picked up and re-persisted into
  the store on the next write.

* credential: propagate SaveConfiguration writes to running S3 caches

Postgres (and any non-filer) credential stores never fired the S3 IAM
cache invalidation path on bulk identity / group updates. The
PropagatingCredentialStore had explicit Put/Remove handlers for
single-entity calls (CreateUser, PutPolicy, etc.) but inherited
SaveConfiguration unchanged from the embedded store, so the bulk path
the IAM API takes at the end of every handler was silent. Inline-policy
changes recompute identity.Actions and persist via SaveConfiguration,
so until restart the cached Actions on each S3 server stayed stale and
authorization decisions used the pre-change view.

Override SaveConfiguration to snapshot the prior user / group lists,
delegate the save, then fan out PutIdentity / PutGroup for what's in
the new config and RemoveIdentity / RemoveGroup for what got pruned.
Reuses the existing SeaweedS3IamCache RPCs, no protobuf changes.

* iamapi: drain legacy policies.json after authoritative credential-store writes

Review pointed out a resurrection bug: GetPolicies still reads
/etc/iam/policies.json as a one-way migration fallback, but PutPolicies
in the credential-manager path never wrote that file, so legacy-only
entries reappeared on the next read even after the IAM API "deleted"
them. PutPolicies now overwrites the bundle with an empty {} after a
successful credential-store write, unless the store is filer_etc
(which owns the bundle as its own inline-policy backing — clearing it
would wipe filer_etc's data). Also wraps the filer read, JSON
unmarshal, and marshal errors with context per the other review
comments.
2026-05-17 13:15:27 -07:00