seaweedfs

mirror of https://github.com/seaweedfs/seaweedfs.git synced 2026-07-20 06:52:24 +00:00

Author	SHA1	Message	Date
Chris Lu	87fdea5330	fix(admin): carry filer addresses as ServerAddress in plugin cluster context (#9600 ) The plugin cluster context forwarded filers as gRPC-only addresses (host:grpcPort). The admin-script worker stored that in ShellOptions.FilerAddress, whose shell commands re-derive the gRPC port via ToGrpcAddress() and re-add the +10000 offset, dialing a non-existent host:28888. Carry filers in pb.ServerAddress form (host:httpPort.grpcPort) and let each consumer convert when it dials: the admin shell uses it verbatim, while the s3_lifecycle and iceberg workers collapse it to a gRPC address. Rename the proto field filer_grpc_addresses -> filer_addresses so the name matches the content.	2026-05-21 02:10:27 -07:00
Chris Lu	303c2be38d	feat(fix): rebuild lost EC index (.ecx) and .vif from local shards (#9596 ) weed fix -ecx reconstructs the .dat from the local data shards, scans the needles, and writes a fresh ascending-sorted .ecx containing only live entries — the same on-disk index WriteSortedFileFromIdx emits at encode time. When the .vif is also missing it is regenerated from the inferred EC ratio (flags > .vif > shard-count inference / 10+4) and the .dat size recovered from the scan. When some data shards are missing but at least dataShards shards survive, the missing shards are first reconstructed from the survivors via Reed-Solomon, so a partial shard set is repaired too. Also makes erasure_coding.WriteDatFile de-stripe using len(shardFileNames) instead of the DataShardsCount constant, so the caller's actual data-shard count is honored (behavior-preserving for the default 10, and fixing the existing caller that already passes ECContext.DataShards). This recovers an EC volume whose sealed index was lost from every node while enough shards survive, a state neither ec.rebuild nor ec.decode can repair because both require an existing .ecx. Flags: -ecx, -ecDataShards, -ecParityShards. Run with the volume server stopped.	2026-05-21 00:41:27 -07:00
Mmx233	9b9fdb5b76	fix(s3): sync IAM policies to advanced IAM Manager policy engine (#9577 ) * fix(s3): sync IAM policies to advanced IAM Manager policy engine * test(s3): add unit tests for PutPolicy/DeletePolicy IAM Manager sync * fix(s3): flush loaded policies in SetIAMIntegration, drop extra reload Sync the policies already loaded from the credential store into the IAM Manager's engine from SetIAMIntegration itself, instead of re-running a full LoadS3ApiConfigurationFromCredentialManager after setup. This covers both startup orderings without a second filer round-trip or racing the async loader goroutine: if the load won, the policies are in memory to push; if SetIAMIntegration won, the load's own sync runs afterward. Move the runtime PutPolicy/DeletePolicy sync out of the iam.m write lock so the per-request auth RLock path isn't blocked by the policy recompile. * fix(s3): serialize IAM manager policy resync to avoid stale snapshots SyncRuntimePolicies replaces the manager's full policy set, so applying a policy view captured before a later mutation can resurrect a deleted policy or drop a new one. Funnel every path (PutPolicy, DeletePolicy, SetIAMIntegration, and the credential-manager load) through a single resyncIAMManagerPolicies that serializes on a dedicated mutex and reads iam.policies fresh at apply time, so the live map always wins regardless of interleaving. The load now installs the config into iam.policies before resyncing, closing the window where the manager held policies the map didn't yet have. --------- Co-authored-by: Chris Lu <chris.lu@gmail.com>	2026-05-21 00:39:42 -07:00
Chris Lu	7e4691f2dc	test(ec): make multi-disk EC balance disk-spread assertion deterministic (#9595 ) test(ec): pre-populate disks so multi-disk EC balance spread is deterministic The multidisk shard-loss regression asserts EC shards spread across more than one disk per node, but that only holds for disks the balancer can see. The master enumerates a physical disk only when it already holds a volume or EC shard — an empty disk leaves no trace, since heartbeats aggregate capacity per disk type, not per physical disk. So whether the post-encode balance spread shards depended on how the master happened to place the filler volumes across disks, which varies by environment: the test passed locally (shards on 5 disks) but produced one disk per node in CI and failed the "got 3 disks across 3 nodes" assertion. Grow a few volumes on each server before encoding so every physical disk holds a volume and is visible to the balancer. The volume server places each new volume on its least-loaded disk, so a handful of grows touches every disk, making the spread deterministic. The assertion still has teeth: it counts disks holding shard files, so a balancer that failed to spread would still collapse to one disk per node.	2026-05-21 00:17:14 -07:00
Chris Lu	391f543ff2	fix(ec): correct multi-disk disk counting and EC balance shard attribution (#9594 ) * fix(shell): count physical disks in cluster.status on multi-disk nodes The master keys DataNodeInfo.DiskInfos by disk type, so several same-type physical disks on one node collapse into a single DiskInfo entry. cluster.status (printClusterInfo) and CountTopologyResources counted len(DiskInfos), reporting one disk per node instead of the real physical disk count, while volume.list and the admin ActiveTopology already split per physical disk. Route both counters through DiskInfo.SplitByPhysicalDisk so a node with N same-type disks reports N. Cosmetic/diagnostic only; placement already uses the per-disk activeDisk map. * fix(ec): attribute EC balance source disk per shard and reject same-node moves On multi-disk nodes the EC balance worker built a node-level view that kept only the first physical disk id per (node, volume), so a move of a shard living on a different disk reported the wrong source disk. That source disk drives the per-disk capacity reservation, so the wrong disk drifts the capacity model the EC placement planner relies on. Track shards per physical disk and resolve the actual source disk for every emitted move (dedup, cross-rack, within-rack, global), keeping the per-disk view consistent as simulated moves are applied. Also close a data-loss trap: VolumeEcShardsDelete is node-wide (it removes the shard from every disk on the node) and copyAndMountShard skips the copy when source and target addresses match, so a same-node move would erase a shard it never copied. isDedupPhase now requires the same node AND disk, and Validate / Execute reject same-node cross-disk moves outright. * fix(ec): spread EC balance moves across destination disks Port the shell ec.balance pickBestDiskOnNode heuristic to the EC balance worker so a moved shard is placed on a good physical disk instead of always deferring to the volume server (target disk 0). The detection now builds a per-physical-disk view of each node (free slots split from the node total, exact EC shard count, disk type, discovered from both regular volumes and EC shards) and, for each cross-rack, within-rack, and global move, chooses the destination disk by ascending score: - fewer total EC shards on the disk, - far fewer shards of the same volume on the disk (spread a volume's shards across disks for fault tolerance), and - data/parity anti-affinity (a data shard avoids disks holding the volume's parity shards and vice versa). Planned placements are reserved on the in-memory model during a run so multiple shards moved to the same node spread across its disks rather than piling on one. * fix(ec): bring EC balance worker to parity with shell ec.balance The worker's cross-rack and within-rack balancing balanced shards by total count; the shell balances data and parity shards separately with anti-affinity and honors replica placement. Port that logic so the automatic balancer makes the same fault-tolerance-aware decisions as the manual command: - Cross-rack and within-rack now run a two-pass balance: data shards spread first, then parity shards spread while avoiding racks/nodes that already hold the volume's data shards (anti-affinity), mirroring doBalanceEcShardsAcrossRacks and doBalanceEcShardsWithinOneRack. - Optional replica placement: a new replica_placement config (e.g. "020") constrains shards per rack (DiffRackCount) and per node (SameRackCount); empty keeps the previous even-spread behavior. - The data/parity boundary is resolved from a per-collection EC ratio (standard 10+4 here), replacing the previously hardcoded constant at the call sites. Selection is deterministic (sorted keys) to keep behavior reproducible. * refactor(ec): extract shared ecbalancer package for shell and worker The EC shard balancing policy was duplicated between the shell ec.balance command and the admin EC balance worker, and the two had drifted (multi-disk handling, data/parity anti-affinity, replica placement). Extract the policy into a new pure package, weed/storage/erasure_coding/ecbalancer, that both callers share so it cannot drift again. - ecbalancer.Plan(topology, options) runs the full policy (dedup, cross-rack and within-rack data/parity two-pass with anti-affinity, global per-rack balance, and diversity-aware disk selection) over a caller-built Topology snapshot and returns the shard Moves. It depends only on erasure_coding and super_block. - The worker builds the Topology from the master topology and turns Moves into task proposals; the shell builds it from its EcNode model and executes Moves via the existing move/delete RPCs. Per-collection EC ratio resolution stays in each caller (passed as Options.Ratio). - Options expose the two genuine policy differences: GlobalUtilizationBased (worker balances by fractional fullness; shell by raw count) and GlobalMaxMovesPerRack (worker moves incrementally across cycles; shell drains in one pass). The shell keeps pickBestDiskOnNode for the evacuate command. Policy tests move to the ecbalancer package; the shell and worker keep their adapter/execution tests. * fix(ec): restore parallelism and per-type/full-range balancing after ecbalancer refactor Address regressions and gaps from the ecbalancer extraction: - Shell ec.balance honors -maxParallelization again: planned moves run phase by phase (preserving cross-phase dependencies) with bounded concurrency within a phase. Apply mode does only the RPCs concurrently; dry-run stays sequential and updates the in-memory model for inspection. - Rack and node balancing gate on per-type spread (data and parity separately) instead of combined totals, so a data/parity skew is corrected even when the per-rack/node totals are even. - Global rack balancing iterates the full shard-id space (MaxShardCount) so custom EC ratios with more than the standard total are candidates. - Cross-rack planning decrements the destination node's free slots per planned move, so limited-capacity targets are no longer over-planned. * fix(ec): make EC dedup keeper deterministic and capacity-aware When a shard is duplicated across nodes, keep the copy on the node with the most free slots and delete the duplicates from the more-constrained nodes, relieving capacity pressure where it is tightest. Tie-break on node id so the choice is deterministic. This unifies the shell and worker (the shell previously kept the least-free node, an incidental default) on the more sensible behavior. * fix(ec): restore global volume-diversity and per-volume move serialization Two more behaviors lost in the ecbalancer refactor: - Global rack balancing again prefers moving a shard of a volume the destination does not hold at all before adding another shard of an already-present volume (two-pass, mirroring the old balanceEcRack), keeping each volume's shards spread across nodes. - Shell apply-mode execution serializes a single volume's moves within a phase while still running different volumes in parallel, so concurrent moves of the same volume cannot race on its shared .ecx/.ecj/.vif sidecar files. * fix(ec): key EC balance shards by (collection, volume id) A numeric volume id can be reused across collections, and EC identity is (collection, vid) (see store_ec_attach_reservation.go). The ecbalancer keyed Node.shards by vid alone, so volumes sharing an id across collections merged into one entry — letting dedup delete a "duplicate" that is actually a different collection's shard, and letting moves act across collections. Key shards by (collection, vid) throughout so each volume stays distinct. * fix(ec): credit freed capacity from dedup before later balance phases Dedup deletions are simulated only by applyMovesToTopology, which cleared shard bits but did not return the freed disk/node/rack slots. Later phases reject destinations with no free slots, so a slot opened by dedup could not be reused in the same Plan/ec.balance run. applyMovesToTopology now credits the freed disk/node/rack capacity for dedup moves (non-dedup moves still rely on the inline accounting their phase already did). * test(ec): add multi-disk EC balance integration test Cover issue 9593 end-to-end at the unit level the old tests missed: build the master's actual multi-disk wire format (same-type disks collapsed into one DiskInfo, real DiskId only in per-shard records), run it through a real ActiveTopology and the Detection entry point, then replay the planned moves with the volume server's true semantics (node-wide VolumeEcShardsDelete) and assert no EC shard is ever lost. Covers a balanced spread, a one-node-concentrated volume, and a multi-rack spread, and asserts moves are safe (no same-node cross-disk), correctly attributed to the source disk, and redistribute concentrated volumes across both other racks and multiple destination disks. * fix(ec): aggregate per-disk EC shards when verifying multi-disk volumes collectEcNodeShardsInfo overwrote its per-server entry for each EcShardInfo of a volume. A multi-disk node reports one EcShardInfo per physical disk holding shards of the volume, so only the last disk's shards survived — the node looked like it was missing shards it actually had. This made ec.encode's pre-delete verification (and ec.decode) under-count volumes whose shards are spread across disks on one server, falsely aborting the encode on multi-disk clusters. Union the per-disk shard sets per server instead. Also make verifyEcShardsBeforeDelete poll briefly: shard relocations reach the master via volume-server heartbeats, so a freshly distributed shard set may not be fully visible the instant the balance returns. Retry before concluding the set is incomplete; genuine loss still fails after the retries are exhausted. * test(ec): end-to-end multi-disk EC balance shard-loss regression Start a real cluster of multi-disk volume servers (3 servers x 4 disks), EC-encode a volume, run ec.balance, and assert hard invariants the prior integration tests only logged: after encode all 14 shards exist, ec.balance loses no shard, shards span more than one disk per node, and cluster.status counts physical disks (not one per node). This reproduces issue 9593 end to end and would have caught the multi-disk shard-aggregation bug fixed alongside it. * fix(ec): bring EC balance worker/plugin path to parity with shell - Per-volume serialization and phase order: key the plugin proposal dedupe by (collection, volume) instead of (volume, shard, source), so the scheduler runs only one of a volume's moves at a time (within a run and against in-flight jobs). Concurrent same-volume moves raced on the volume's .ecx/.ecj/.vif sidecars; and because the planner emits a volume's moves in phase order, they now execute in order across detection cycles, matching the shell. - disk_type "hdd": normalize via ToDiskType (hdd -> "" HardDriveType) while keeping a "filter requested" flag, so disk_type=hdd matches the empty-keyed HDD disks instead of nothing; apply the canonical type to planner options and move params. - Replica placement: expose shard_replica_placement in the admin config form and read it into the worker config, mirroring ec.balance -shardReplicaPlacement. * test(ec): rename worker in-process test (not a real integration test) The worker-package multi-disk tests build a fake master topology and simulate move execution; they are not real-cluster integration tests. Rename integration_test.go -> multidisk_detection_test.go and drop the Integration prefix so 'integration' refers only to the real-cluster E2Es in test/erasure_coding. * ci(ec): remove redundant ec-integration workflow ec-integration.yml duplicated EC Integration Tests under the same workflow name but ran only 'go test ec_integration_test.go' (one file), so it never ran new test files (e.g. multidisk_shardloss_test.go) and was a strict, path-filtered subset of ec-integration-tests.yml, which already runs 'go test -v' over the whole test/erasure_coding package on every push/PR. * fix(ec): worker falls back to master default replication for EC balance For strict parity with the shell, the EC balance worker now uses the master's configured default replication as the replica-placement fallback when no explicit shard_replica_placement is set, instead of always defaulting to even spread. The maintenance scanner reads it via GetMasterConfiguration each cycle and passes it through ClusterInfo.DefaultReplicaPlacement; detection resolves the constraint (explicit config wins, else master default, else none) in resolveReplicaPlacement. A zero-replication default (the common 000 case) still means even spread, so the common configuration is unchanged. * fix(ec): plugin path populates master default replication too The plugin worker built ClusterInfo with only ActiveTopology, so the master default replication fallback added for the maintenance path never reached plugin-driven EC balance detection — empty shard_replica_placement still meant even spread there. Fetch the master default via GetMasterConfiguration (new pluginworker.FetchDefaultReplicaPlacement) and set ClusterInfo.DefaultReplicaPlacement so both detection paths resolve replica placement identically to the shell. * docs(ec): empty shard replica placement uses master default, not even spread The EC balance config text (admin plugin form, legacy form help text, and the struct/proto field comments) still said an empty shard_replica_placement spreads evenly. The runtime resolves empty to the master default replication (resolveReplicaPlacement), matching shell ec.balance, with even spread only when that default is empty or zero. Update the text to match and regenerate worker_pb for the proto comment change.	2026-05-20 23:31:21 -07:00
Chris Lu	afcc491517	test: fix fd leak in the Samba DLM handoff test (promote xfail checks) (#9592 ) test(mount): fix fd leak that deadlocked the DLM handoff check The cross-mount handoff checks held a file open on mount 2 via fd 9 to keep the distributed lock, then started the SMB writer in a background subshell. The subshell inherited fd 9, so the SMB writer kept the file open and waited on a lock held by its own descriptor; the put could never complete, and the two checks were parked as expected-fail. Close fd 9 in the subshell (9>&-) so the writer does not hold the file. The waiter now acquires the freed lock within ~1s, so the two checks are real assertions and the xfail machinery is gone.	2026-05-20 16:17:13 -07:00
Chris Lu	a5d0e4a735	Samba-over-FUSE integration test and distributed-lock handoff fixes (#9590 ) * test(mount): add Samba over FUSE integration test Export a SeaweedFS FUSE mount over SMB with smbd and drive it with smbclient: file round-trips, directories, rename, large-file chunking, recursive upload, cross-protocol consistency, and deletes. A second -dlm mount adds locking coverage: POSIX fcntl byte-range locks, distributed-lock write coordination, and concurrent writers. The two cross-mount handoff checks currently fail and pin a known limitation - the distributed lock is released on FUSE Release, which the kernel can delay under contention. Runs locally via test/samba/run.sh or in Docker via the compose file; wired into CI as samba-integration.yml. * fix(cluster): release distributed lock without racing the renewal goroutine Stop() closed the cancel channel, slept 10ms, then unlocked using renewToken. A renewal in flight during that window rotates the token on the server, so the unlock may be sent with a stale token, fail with a mismatch, and leave the lock to linger until its TTL expires - stalling other mounts waiting to write the same file. Wait for the renewal goroutine to exit before unlocking. The channel close also makes the renewToken read happen-after the last renewal. * fix(cluster): poll for distributed lock acquisition without exponential backoff A mount waiting to write a file held by another mount acquired through util.RetryUntil, whose backoff grows to several seconds. Once the holder released, the waiter could sleep that long before retrying, stretching the cross-mount handoff past client timeouts. Poll at the steady ~1s cadence AttemptToLock already enforces instead. * test(mount): tighten Samba harness and mark the DLM handoff checks xfail Run the workflow for weed/cluster changes, fail fast when the filer or smbd port never opens, and fold the recursive mput result into its own assertion so it cannot false-pass. Mark the two cross-mount handoff checks expected-fail: they pin the remaining DLM liveness bug (the lock is freed only on the delayed FUSE Release) without failing CI, and turn the suite red if the handoff is ever fixed. * fix(cluster): keep a wedged renewal shutdown from sending a stale unlock If the renewal goroutine is stuck in a slow RPC, Stop() fell through to unlock anyway once it timed out waiting. A late renewal can rotate renewToken, so that unlock races it, is rejected on a stale token, and leaves the lock lingering until its TTL regardless. On the timeout path, skip the unlock and let the TTL expire the lock instead. * fix(cluster): wake the long-lived lock renewal loop promptly on Stop StartLongLivedLock's renewal loop slept uninterruptibly between attempts, up to 5renewInterval (2.5lockTTL) while unlocked. Stop() waits only lockTTL+2s for the goroutine to exit, so a Stop() during that backoff would time out before the goroutine woke and closed renewalDone, breaking the shutdown synchronization. Sleep on a timer with a select on cancelCh so the loop exits immediately.	2026-05-20 14:52:17 -07:00
Chris Lu	a17dca7009	fix(filer): don't disable the SQL idle connection pool when unconfigured (#9591 ) * fix(filer): don't disable the SQL idle connection pool when unconfigured The mysql/mysql2/postgres stores called SetMaxIdleConns(maxIdle) unconditionally, so an unset connection_max_idle (0) actively kept zero idle connections - every query opened and closed a fresh connection instead of reusing the pool. Only apply the value when it's set; otherwise leave database/sql's default idle pool of 2 in place. * comments: shorten idle-pool note * fix(filer): default the SQL idle pool via config, keep explicit 0 honored Apply the idle-pool default at the config layer with SetDefault instead of guarding the SetMaxIdleConns call. An absent connection_max_idle now reads back as 2 (pool stays on), while an explicit 0 flows through to SetMaxIdleConns(0) so operators can still disable idle pooling on purpose.	2026-05-20 14:04:23 -07:00
Chris Lu	024b59fb31	fix(ec): pack EC shards onto fewer disks instead of refusing the task (#9588 ) The planner refused to create an EC task unless it found totalShards distinct (server, disk_id) targets, so a cluster with fewer disks than shards (e.g. 8 single-disk servers for a 10+4 scheme) could never encode. A disk safely holds several distinct shards of one volume: each is its own .ecNN file and ReceiveFile keys by that extension. Drop the strict check and let createECTargets round-robin shards across the available disks, matching ec.encode's "4,4,3,3" fallback. The minTotalDisks floor (ceil(total/parity)) already keeps any disk under parityShards shards, so the volume still survives losing any one disk. Reserve capacity for the actual per-disk shard count rather than assuming one shard each, so packing doesn't over-commit disk slots.	2026-05-20 11:50:42 -07:00
Chris Lu	5af7d12f04	fix(filer.sync): keep sync_offset fresh while the source is read-only (#9589 ) * fix(filer.sync): keep sync_offset fresh while the source is read-only sync_offset holds the timestamp of the last replicated source event, so monitoring derives lag from now-sync_offset. A read-only source emits no metadata events, so the gauge froze at the last write and the derived lag grew without bound, making thresholds unusable. The source filer now sends an idle heartbeat carrying its current time while a subscriber is caught up to the buffer head. filer.sync uses it to advance the gauge, so now-sync_offset reflects real lag. Heartbeats are opt-in (client_supports_idle_heartbeat), are never written to the metadata log, and do not move the resume checkpoint, so a restart still resumes from the last real event. * fix(filer.sync): gate idle heartbeat on the read cursor, not SinceNs In metadata-chunks mode persisted entries replay as log file refs and never reach eachLogEntryFn, so lastSeenTsNs stays put and a caught-up subscriber with an old SinceNs would never get a heartbeat. Use the read cursor (lastReadTime), which advances in that mode too, max'd with lastSeenTsNs so the in-memory backlog-then-idle case still works while the cursor returned to the caller has not yet updated.	2026-05-20 11:26:37 -07:00
Chris Lu	4385b86bf1	fix(shell): volumeServer.evacuate no longer panics on a nil volume (#9587 ) adjustAfterMove now removes the moved volume from the source disk's VolumeInfos in place: it swaps the entry with the last one and nils the tail. evacuateNormalVolumes ranges directly over that same slice, so the niled tail slot is later read as a nil *VolumeInformationMessage and the move attempt panics on vol.DiskType. Iterate over a snapshot of the slice so in-place removals during a move cannot leave nil holes in the loop.	2026-05-20 10:27:00 -07:00
Chris Lu	c00aa90990	fix(s3/audit): populate requester for GET/HEAD/IAM operations (#9581 ) Authentication records the identity with r.WithContext, which returns a request copy. Handlers that log their own audit entry (PUT, DELETE, tagging) see it, but GET/HEAD object and IAM operations rely on track()'s fallback entry, which is built from the original request the auth copy never reached - so requester came out empty. Install a mutable identity holder on the request before authentication and have SetIdentityNameInContext record into it. The holder is shared by pointer across every request copy, so the fallback entry recovers the authenticated requester. The per-request context value still takes precedence, so nothing changes for handlers that see the auth copy.	2026-05-20 10:13:33 -07:00
Chris Lu	e332b97d52	fix(shell): volume.balance no longer drains all volumes onto one server (#9579 ) * fix(shell): volume.balance no longer drains all volumes onto one server The density-based capacity function reads per-disk VolumeInfos sizes, but adjustAfterMove only updated VolumeCount and the selectedVolumes map. The planner re-read a stale topology after every move, so the source node's density never dropped and it kept moving volumes until that node was empty. Move the volume's size accounting between disks after each planned move so the density recomputes and the loop converges to an even distribution. * refactor(shell): O(1) volume removal and direct disk lookup in adjustAfterMove removeVolumeInfo swaps with the last element instead of shifting, and the disk is fetched by key rather than ranging the DiskInfos map. 4.27	2026-05-20 01:39:23 -07:00
Chris Lu	868849392c	4.27	2026-05-20 00:25:16 -07:00
Chris Lu	a4415c39aa	fix(mount): keep periodic metadata flush from dropping concurrent chunk uploads (#9574 ) * fix(mount): keep periodic metadata flush from dropping concurrent chunk uploads The periodic flush snapshotted entry.Chunks, then ran CompactFileChunks and MaybeManifestize (the manifest upload is a network round trip) before reassigning entry.Chunks. Async uploaders append freshly uploaded chunks during that window, and the reassignment overwrote them: the data stayed on the volumes but the file lost those chunk references, leaving zero-filled holes on read. Large sequential writes such as cat of two 15 GiB files hit several flush cycles and ended up corrupted. Snapshot the chunk list under the entry lock with a length marker, do the slow compaction and manifestization on the snapshot, then splice the processed prefix back in front of whatever chunks arrived after the snapshot. * mount: drop redundant slice copies in the flush splice processedPrefix is freshly built and the tail sub-slice is consumed immediately under the entry lock, so append straight onto processedPrefix instead of allocating two throwaway copies.	2026-05-19 20:47:52 -07:00
Lars Lehtonen	9914e6af30	chore(weed/command): prune unused functions (#9573 ) * chore(weed/command): prune unused functions * drop now-unused closed field and renderLocked guard --------- Co-authored-by: Chris Lu <chris.lu@gmail.com>	2026-05-19 17:45:50 -07:00
Chris Lu	cc5ef1b741	feat(s3): add TagUser, UntagUser, ListUserTags IAM actions (#9572 ) * feat(s3): add TagUser, UntagUser, ListUserTags IAM actions Adds AWS IAM-compatible user tag operations on the embedded IAM endpoint. Tags persist in the Identity proto as a repeated UserTag field; the existing 50-tag / 128-byte-key / 256-byte-value AWS limits are enforced. Pagination is stubbed (IsTruncated=false) since the 50-tag cap means all tags fit in a single response. * review: validate UntagUser TagKeys entries parseTagKeysParams now rejects empty keys and keys past MaxUserTagKeyLength; UntagUser additionally requires at least one TagKeys.member.N entry to match AWS validation behavior. * review: pre-allocate user-tag merge and filter slices mergeUserTags now allocates the combined existing+incoming capacity up front; UntagUser builds the filtered slice via make with the full ident.Tags capacity instead of ident.Tags[:0:0], which forced a reallocation on every append. * review: cover duplicate-in-request and invalid TagKeys cases Regression tests assert TagUser rejects two members with the same key in one request, and UntagUser rejects missing/empty/oversized TagKeys entries.	2026-05-19 17:35:44 -07:00
Chris Lu	37b6a14b0d	feat(s3): add four bucket configuration handlers (#9570 ) * feat(s3): add four bucket configuration handlers - GetBucketPolicyStatus: computes IsPublic from the existing bucket policy - PutBucketRequestPayment: companion writer to the existing GET; accepts only BucketOwner - GetBucketAccelerateConfiguration: returns <Status>Suspended</Status> - GetBucketLogging: returns an empty BucketLoggingStatus Lets AWS SDK probes succeed instead of returning MethodNotAllowed. * review: route GetBucketPolicyStatus through checkBucket Mirrors the existence/auth gating used by other bucket handlers and drops the bespoke filer_pb lookup so NoSuchBucket precedence is consistent across the API surface. * review: cap PutBucketRequestPayment body with MaxBytesReader The body is unmarshalled as RequestPaymentConfiguration, which is a handful of bytes; reject excessively large payloads up front and defer Close immediately after wrapping. * review: gate static getters on checkBucket GetBucketAccelerateConfiguration and GetBucketLogging now run the standard bucket existence check before returning the static Suspended / empty-status response so a missing bucket cannot appear to have valid configuration. * review: share cache helper across misc tests; check io.ReadAll error Accelerate and Logging tests now run through newMiscTestServer like the others so the checkBucket guard sees a cached bucket; the ReadAll error is explicitly checked.	2026-05-19 17:35:08 -07:00
Chris Lu	cee2bf697c	feat(s3): stub bucket configuration list endpoints (#9571 ) * feat(s3): stub bucket configuration list endpoints Adds Get and List handlers for Analytics, Inventory, IntelligentTiering, and Metrics bucket configurations. List returns an empty result with IsTruncated=false; single-get returns NoSuchConfiguration so SDK error parsing remains predictable. * review: gate stubs on bucket existence All eight stub handlers now call checkBucket via stubBucketGuard so NoSuchBucket takes precedence over NoSuchConfiguration / empty-list responses, matching AWS S3 precedence. Tests provide a cached bucket so the guard sees it as present.	2026-05-19 17:34:51 -07:00
Chris Lu	285025eb73	s3api: support group inline policies + Condition enforcement (#9569 ) * test(s3api): cover IAM inline policy aws:SourceIp + group inline gap Unit tests under weed/s3api/ drive PutUserPolicy / PutGroupPolicy → reload → VerifyActionPermission with a synthetic 127.0.0.1 request and assert that the policy's IpAddress condition flips the outcome. The user-policy cases pass on master (hydrateRuntimePolicies already routes inline docs through the policy engine, so Condition blocks are honored end- to-end). The group-policy case fails: PutGroupPolicy still returns NotImplemented, so a group inline doc never lands in the engine. Integration counterparts live under test/s3/iam/ and exercise the same paths against a live SeaweedFS S3+IAM endpoint. * s3api: support group inline policies + Condition enforcement PutGroupPolicy/GetGroupPolicy/DeleteGroupPolicy/ListGroupPolicies used to return NotImplemented in embedded IAM mode, so anything attached to a group as an inline doc — including aws:SourceIp or any other Condition — was simply unreachable. Wire the four endpoints to the credential-store methods that were already in place (memory, postgres, filer_etc all implement GroupInlinePolicyStore). On every config reload, hydrateRuntimePolicies now also walks LoadGroupInlinePolicies, registers each doc in the IAM policy engine under __inline_group_policy__/<group>/<policy>, and appends that key to Group.PolicyNames so evaluateIAMPolicies picks it up through its existing group walk. PutGroupPolicy/DeleteGroupPolicy are added to the ReloadConfiguration trigger list in DoActions. Side fix: MemoryStore.LoadConfiguration now surfaces store.groups too. Without it iam.groups never repopulated on a memory-store reload, so group policy evaluation silently no-op'd whether the policy was inline or attached. The existing tests didn't notice because no test reloaded through cm after creating a group. The NotImplemented unit test is inverted to drive the new round-trip. * s3api: drop redundant refreshIAMConfiguration from Put/DeleteGroupPolicy DoActions already triggers ReloadConfiguration for both actions via the explicit reload list, so calling refreshIAMConfiguration inline runs the load twice per request. Per PR review. * s3api: scope group-policy resource names per test; tighten deny polling - Integration test resource names get a per-test suffix so retried or parallel CI jobs don't trip EntityAlreadyExists / BucketAlreadyExists. - Deny-path Eventually loops gate on AccessDenied via a typed helper rather than any non-nil error; transient setup errors no longer end the wait prematurely. - ListGroupPolicies returns ServiceFailure when the credential manager is nil, matching Put/Get/DeleteGroupPolicy. * test(s3 iam): cover both IPv4 and IPv6 loopback in allow CIDRs CI runners with happy-eyeballs resolve `localhost` to ::1 first, in which case a 127.0.0.0/8-only allow would silently never match and the deny-driven enforcement test would hang for the allow case. Add ::1/128 to every loopback-matching policy so the allow path works regardless of which loopback family the SDK lands on.	2026-05-19 16:03:45 -07:00
Chris Lu	77ac781bbd	fix(ec): VolumeEcShardsInfo walks every disk on multi-disk servers (#9568 ) * fix(ec): VolumeEcShardsInfo walks every disk on multi-disk servers When a volume server holds EC shards for the same vid across more than one disk, each DiskLocation registers its own EcVolume entry and Store.FindEcVolume returns whichever one it hits first. The shard-info RPC iterated only that single EcVolume's Shards, so the response missed every shard mounted on a sibling disk. The worker's verifyEcShardsBeforeDelete sums the per-server responses into a union bitmap and refuses to delete the source volume when the union falls short of dataShards+parityShards. On multi-disk destinations, the union was systematically under-counted and source deletion got blocked even though all shards were physically present and mounted. Walk every DiskLocation in the handler and emit the deduplicated union of all shards. The .ecx-backed fields (file counts, volume size) still come from a single EcVolume since every disk's entry opens the same .ecx via NewEcVolume's cross-disk fallback. Tests: - TestVolumeEcShardsInfo_AggregatesAcrossDisks unit test in weed/server/. - test/volume_server/grpc/ec_verify_multi_disk_test.go integration test drives the full generate -> mount -> redistribute -> restart -> reconcile path and asserts both VolumeEcShardsInfo and VerifyShardsAcrossServers + RequireFullShardSet (the production source-deletion gate) report all 14 shards. - ec_multi_disk_lifecycle_test.go tightened: replaces the "VolumeEcShardsInfo only sees one disk's EcVolume" workaround with a full-shard-set assertion. * review: use ShardBits bitmask + cap-pre-allocation for shard dedup	2026-05-19 14:58:56 -07:00
Chris Lu	f72983c1fd	fix(s3): stop S3 Tables routes from swallowing buckets named "buckets" or "get-table" (#9566 ) * fix(s3): stop S3 Tables routes from swallowing buckets named "buckets" or "get-table" The S3 Tables REST endpoints share top-level paths with the regular S3 API (/buckets for ListTableBuckets/CreateTableBucket, /get-table for GetTable). They are registered first on the same router as the bucket subrouter, so a path-style request such as GET /buckets?list-type=2 on a bucket actually named "buckets" matched ListTableBuckets and returned JSON. AWS SDK V2 (and Hadoop s3a / Spark) then failed XML parsing with "Unexpected character '{' (code 123) in prolog". Disambiguate by requiring the AWS V4 credential scope to name the s3tables service on the colliding routes. Regular S3 SDKs sign with service=s3, S3 Tables SDKs sign with service=s3tables, and the scope is present in both the Authorization header and the X-Amz-Credential query parameter for presigned URLs, so the matcher works for both flavors. ARN-bearing S3 Tables routes (/buckets/<arn>, /namespaces/<arn>, etc.) already cannot collide because colons are not valid in bucket names, so they are left untouched. * fix(s3): accept AWS JSON RPC content type as S3 Tables intent signal The Iceberg catalog integration tests send unsigned PUT /buckets with Content-Type: application/x-amz-json-1.1 to create table buckets. With only the credential-scope check, those requests fell through to the regular S3 CreateBucket handler and the suite went red on this branch. Extend the matcher so a request is recognized as S3 Tables when either: - its AWS V4 credential scope names SERVICE=s3tables; or - it carries the canonical AWS JSON RPC 1.1 content type and is unsigned (a request explicitly signed for SERVICE=s3 still wins). The regular S3 SDKs do not send application/x-amz-json-1.1, so the signal is safe for the colliding paths (/buckets, /get-table). Also add an AWS SDK V2 for Go integration test under test/s3/sdk_v2_routing/ that drives the SDK's own XML deserializer against a bucket literally named "buckets" and "get-table" — the SDK errors before the test asserts if the server returns the wrong body shape. Wired up via .github/workflows/s3-sdk-v2-routing-tests.yml, mirroring the etag/acl workflow. * s3api: extend service matcher to all S3 Tables routes; simplify scope check - Apply serviceMatcher to every S3 Tables route, not just the bare-path ones. ARN-bearing paths could otherwise be hit by an S3 object key that starts with arn:aws:s3tables:..., inside a bucket named "buckets", "namespaces", "tables", or "tag". One matcher everywhere closes both collision classes. - Replace strings.Split + index lookup with strings.Contains for the credential-scope check. The scope shape is fixed at AK/DATE/REGION/SERVICE/aws4_request, slashes only delimit components, and access keys are alphanumeric — so /s3tables/ matches iff SERVICE is exactly s3tables. Existing unit cases (including the access-key-substring case) still pass. - Read the GetObject body in the SDK v2 routing test with io.ReadAll; the single Read could return short and make the equality check flaky. * s3api: drop content-type fallback; sign s3 tables harness traffic instead The content-type fallback in isS3TablesSignedRequest let an anonymous regular-S3 request whose body type is application/x-amz-json-1.1 hit an S3 Tables route when the path-style object key happened to be shaped like an S3 Tables ARN (e.g. PutObject on bucket "buckets" with key arn:aws:s3tables:.../bucket/foo/policy). Narrow the matcher back to the AWS V4 credential scope so only requests signed for SERVICE=s3tables match the S3 Tables routes. Update the Iceberg catalog test harness — the only caller still sending unsigned PUT /buckets — to sign with SERVICE=s3tables. The mini instance runs in default-allow mode, so the signature itself is not verified; only the credential scope matters for the route match. Drop the stale unit cases for the JSON-RPC content-type signal and the routing test that exercised unsigned harness traffic.	2026-05-19 14:24:25 -07:00
Chris Lu	cfc08fbf6c	fix(volume): tombstone integrity check no longer flips volumes read-only (fixes #9563 ) (#9565 ) * fix(volume): pass on-disk tombstone size to ReadData in verifyDeletedNeedleIntegrity verifyDeletedNeedleIntegrity was forwarding TombstoneFileSize (-1) into Needle.ReadData. A deletion tombstone is appended to .dat with DataSize=0 so the on-disk needle header carries Size=0; TombstoneFileSize is only the .idx sentinel for "this entry is deleted" and is never written into a needle header. ReadBytes' size check therefore mismatched on every tombstone (-1 != 0), returned ErrorSizeMismatch, and triggered the 4-byte-offset wrap-around retry in ReadData (offset + 32 GB). On any volume large enough that offset+32 GB exceeds dat fileSize the retry read EOF, CheckVolumeDataIntegrity reported corruption, and the loader set noWriteOrDelete = true. Every volume whose last 10 .idx entries included a deletion went read-only on startup — i.e. any healthy volume where the most recent operations included a delete. Pass Size(0) so the size check matches the on-disk tombstone header. Add a regression test that writes three needles, deletes one, and asserts CheckVolumeDataIntegrity succeeds with a tombstone at the .idx tail. Without this fix the test reproduces the exact log shape from the bug report: read 0 dataSize 32 offset <orig+32GB> fileSize <much smaller>: EOF verifyDeletedNeedleIntegrity ...idx failed: read data [N,N+32) : EOF The Rust port guards its integrity-check size comparison with !size.is_deleted() (seaweed-volume/src/storage/volume.rs) and never hits this path, so no Rust mirror change is needed. * test(seaweed-volume): mirror Go regression for deletion-tombstone integrity The Rust integrity check already guards its size-mismatch comparison with !size.is_deleted() (volume.rs:1859) and reads tombstone AppendAtNs with body_size=0, so the Go regression fixed in the previous commit does not apply. Lock that guarantee in with a parallel reload test: write three needles, delete one, sync, reopen via Volume::new, assert the volume is not flipped read-only. Catches any future change that removes the deleted-entry guard or re-introduces a size-strict path in check_volume_data_integrity for tombstones. * fix(volume): propagate io.EOF and ErrorSizeMismatch from verifyDeletedNeedleIntegrity CheckVolumeDataIntegrity relies on identity comparison against io.EOF and ErrorSizeMismatch to walk back through the last ten .idx entries and tolerate a partial truncation at the tail (the "fix and continue" loop). The live-needle branch in doCheckAndFixVolumeData already returns those sentinels unwrapped; the deletion branch wrapped them in fmt.Errorf, so a genuine .dat truncation past a tombstone offset broke the recovery and flipped the volume read-only. Mirror the live-needle handling: both verifyDeletedNeedleIntegrity and doCheckAndFixVolumeData now short-circuit on io.EOF / ErrorSizeMismatch and pass them through unwrapped. Other errors keep their existing context wrapping. Also tighten the regression test to capture lastAppendAtNs and assert it's non-zero, so a future regression that skips the tombstone body (and therefore never populates AppendAtNs) is caught even when the err check still passes.	2026-05-19 13:11:19 -07:00
Chris Lu	d57de6dc20	fix(s3): keep anonymous access working with EnableIam default (fixes #9557 ) (#9567 ) fix(s3): keep anonymous access working with EnableIam default `docker run seaweedfs` (and `weed mini` with no config) start with EnableIam=true but no IAM config file and no identities. The advanced-IAM init path was failing in 4.25 because of the missing STS signing key, which masked a latent bug: SetIAMIntegration unconditionally flipped isAuthEnabled to true, and isEnabled() also treated a non-nil iamIntegration as auth-on. Once the mini SSE-S3 KEK landed in 4.26 the STS fallback started succeeding, the integration got installed end to end, and every anonymous S3 request bounced as AccessDenied. Separate the two concerns: SetIAMIntegration just plumbs in the OIDC / embedded-IAM machinery, and a new EnableAuthEnforcement opts in to enforcement. The startup path calls it only when -s3.iam.config is actually provided, so operators with explicit IAM configs still get auth (preserves #7726). isEnabled() now reads isAuthEnabled only.	2026-05-19 13:03:30 -07:00
Peter Dodd	4476cb282b	feat(filer): add atime to FuseAttributes + TouchAccessTime RPC (#9556 ) * feat(filer): add atime field and TouchAccessTime RPC to filer proto Introduce POSIX-style access-time tracking on the filer: - FuseAttributes gains atime (field 22) and atime_ns (field 23). - New TouchAccessTime RPC (and Touch{Access,Time}{Request,Response}) lets read paths bump atime without going through UpdateEntry's chunk-rewrite/EqualEntry short-circuit. Additive proto changes only; zero atime is treated as unset and existing clients are unaffected. Java client proto is kept in lock step. Co-authored-by: Cursor <cursoragent@cursor.com> * feat(filer): wire Atime through Attr codec with mtime fallback Add Attr.Atime and round-trip it through EntryAttributeToPb / EntryAttributeToExistingPb / PbToEntryAttribute. A zero proto atime decodes as Mtime, so legacy entries report a sensible value and freshly-created/updated entries default Atime to Mtime when callers do not set it explicitly. CreateEntry and UpdateEntry stamp Atime = Mtime (or Crtime) when it is zero. TouchAccessTime later bypasses this path to write atime alone via Store.UpdateEntry. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(filer): preserve atime in first epoch second on decode The Atime decode branch previously treated any attr.Atime == 0 as unset and overwrote it with Mtime, which drops valid timestamps in the first second of the unix epoch where attr.Atime is 0 but attr.AtimeNs > 0. Check both fields so we only fall back to Mtime when both are zero. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-19 10:22:17 -07:00
Chris Lu	b63610cf8f	volume: accept legacy needle CRC encoding on read (#9564 ) Volumes written by versions before 3.09 (commit `056c480eb`) store the needle checksum using the deprecated CRC.Value() transform. When the read path moved into readNeedleTail, the fallback that accepts both encodings was dropped, so .dat files copied from old installs now fail verification with "invalid CRC ... data on disk corrupted" even though the data is intact. Restore the dual check, matching the surviving fallback in volume_read.go.	2026-05-19 09:58:47 -07:00
Chris Lu	c61d227613	s3api: verify source permission on CopyObject and UploadPartCopy (#9555 ) * s3api: verify source permission on CopyObject and UploadPartCopy The Auth middleware only authorized the destination because routes key on the request URL. The source from X-Amz-Copy-Source was never evaluated, so an STS session token scoped to one prefix could copy from any other prefix in the same bucket. Add AuthorizeCopySource on IdentityAccessManagement to run the full bucket-policy + IAM/identity flow against the source, using a synthetic GetObject request so action resolution lands on s3:GetObject (or s3:GetObjectVersion when a source versionId is supplied). Both CopyObjectHandler and CopyObjectPartHandler now invoke it before reading the source. * s3api: preserve presigned-URL session token on copy-source check Presigned CopyObject / UploadPartCopy requests carry the STS session token in the query string (X-Amz-Security-Token), not in a header. Rebuilding the synthetic source URL from scratch dropped that token, so the source authorization would fall through to non-STS paths and miss session policy enforcement. Forward X-Amz-Security-Token from the original query (alongside versionId), still excluding unrelated params like uploadId/partNumber that would steer ResolveS3Action away from s3:GetObject.	2026-05-18 21:35:53 -07:00
Chris Lu	7c252e1f16	fix(volume): reopen .idx writable after MarkVolumeWritable (fixes #9515 ) (#9526 ) * fix(volume): reopen .idx writable after MarkVolumeWritable When .vif has ReadOnly=true, load() opens .idx as O_RDONLY and builds a SortedFileNeedleMap whose Put returns os.ErrInvalid. MarkVolumeWritable only flipped noWriteOrDelete back to false and rewrote .vif, so writes still failed at v.nm.Put. Reopen .idx in O_RDWR and rebuild v.nm in its writable form (in-memory or leveldb small/medium/large) before flipping the flag. Mirror the same fix in seaweed-volume: the Rust load path leaves CompactNeedleMap/RedbNeedleMap with no idx_file writer when the volume boots read-only, so post-MarkVolumeWritable puts silently succeeded in-memory only and were lost on the next restart. set_writable now reattaches an append-mode writer when one is missing. * fix(volume): keep old needle map until replacement is built; defer writable flag Go: build the writable needle map into a local before swapping. A construction failure now leaves v.nm pointing at the original SortedFileNeedleMap so MarkVolumeWritable can roll back, instead of stranding the volume with v.nm == nil. Rust: attach the .idx writer before flipping no_write_or_delete to false. A transient open/metadata failure used to leave the volume marked writable with no writer attached, and subsequent puts would silently skip the on-disk append.	2026-05-18 20:51:04 -07:00
Chris Lu	7c5296dfb1	fix(admin): switch file browser upload/download to filer gRPC + volume HTTP (#9538 ) * fix(admin): switch file browser upload/download to filer gRPC + volume HTTP The admin file browser proxied uploads and downloads through the filer's HTTP listener, so the whole feature 404'd against filers started with -disableHttp=true even though S3 still worked on its own port. Re-route through the filer gRPC service: LookupDirectoryEntry + StreamContent for reads (chunks flow straight from the volume servers), AssignVolume + volume HTTP POST + CreateEntry for writes. Volume read tokens come from jwt.signing.read.key when configured; the old jwt.filer_signing tokens no longer apply since the filer HTTP surface is bypassed. * admin file browser: propagate request context + track response writes Pass r.Context() into uploadFileToFiler so a client disconnect cancels the in-flight chunked upload instead of letting it run to completion against the volume servers. For DownloadFile, replace the Content-Type probe with a small response-writer wrapper that records whether headers or bytes have actually been sent, so the error path can't silently convert a pre-stream failure into a partial response if future code moves the header-setting around.	2026-05-18 20:33:16 -07:00
Chris Lu	58c3fa802c	fix(s3): keep host-less bucket catch-all so reverse proxies work (#9540 ) When s3.domainName is set, all bucket-prefix routes were gated on a matching Host header. Requests that arrive via an IP, an unlisted hostname, or a reverse proxy that rewrites Host hit no router and bounce back as 405/404 (and 503 once a proxy maps the upstream error). Register the path-style catch-all unconditionally, after the host-specific routers, so it only fires when no Host matcher applies.	2026-05-18 19:44:19 -07:00
dependabot[bot]	d3f80444df	build(deps): bump github.com/cognusion/imaging from 1.0.2 to 1.0.3 (#9552 ) Bumps [github.com/cognusion/imaging](https://github.com/cognusion/imaging) from 1.0.2 to 1.0.3. - [Commits](https://github.com/cognusion/imaging/compare/v1.0.2...v1.0.3) --- updated-dependencies: - dependency-name: github.com/cognusion/imaging dependency-version: 1.0.3 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-05-18 19:43:33 -07:00
Chris Lu	0dc65e7069	fix(admin.plugin): include disk_id in EC execution plan (#9547 ) TaskSource and TaskTarget carry disk_id on the wire, but the execution plan map built for the admin UI dropped the field entirely. On a multi-disk node holding shards of the same volume, there was no way to tell from the plan which disk would receive each shard. Include disk_id on each endpoint and target_disk_id on each shard assignment, and extend the existing execution-plan test to set and assert the field.	2026-05-18 19:43:18 -07:00
ᎠᎡ. Ѕϵrgϵ Ѵictor	18c6c24e47	Revise MinIO comparison in README for accuracy (#9548 ) Updated the README to reflect the current status of MinIO, noting its ceased development and security concerns, along with changes in the descriptions of its features compared to SeaweedFS.	2026-05-18 19:32:54 -07:00
dependabot[bot]	120901c883	build(deps): bump github.com/parquet-go/parquet-go from 0.28.0 to 0.30.1 (#9549 ) Bumps [github.com/parquet-go/parquet-go](https://github.com/parquet-go/parquet-go) from 0.28.0 to 0.30.1. - [Release notes](https://github.com/parquet-go/parquet-go/releases) - [Changelog](https://github.com/parquet-go/parquet-go/blob/main/CHANGELOG.md) - [Commits](https://github.com/parquet-go/parquet-go/compare/v0.28.0...v0.30.1) --- updated-dependencies: - dependency-name: github.com/parquet-go/parquet-go dependency-version: 0.30.1 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-05-18 19:28:42 -07:00
dependabot[bot]	a79880ed41	build(deps): bump github.com/redis/go-redis/v9 from 9.18.0 to 9.19.0 (#9550 ) Bumps [github.com/redis/go-redis/v9](https://github.com/redis/go-redis) from 9.18.0 to 9.19.0. - [Release notes](https://github.com/redis/go-redis/releases) - [Changelog](https://github.com/redis/go-redis/blob/master/RELEASE-NOTES.md) - [Commits](https://github.com/redis/go-redis/compare/v9.18.0...v9.19.0) --- updated-dependencies: - dependency-name: github.com/redis/go-redis/v9 dependency-version: 9.19.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-05-18 19:28:32 -07:00
dependabot[bot]	f5aa776742	build(deps): bump github.com/Azure/azure-sdk-for-go/sdk/storage/azblob from 1.6.4 to 1.7.0 (#9551 ) build(deps): bump github.com/Azure/azure-sdk-for-go/sdk/storage/azblob Bumps [github.com/Azure/azure-sdk-for-go/sdk/storage/azblob](https://github.com/Azure/azure-sdk-for-go) from 1.6.4 to 1.7.0. - [Release notes](https://github.com/Azure/azure-sdk-for-go/releases) - [Commits](https://github.com/Azure/azure-sdk-for-go/compare/sdk/storage/azblob/v1.6.4...sdk/azcore/v1.7.0) --- updated-dependencies: - dependency-name: github.com/Azure/azure-sdk-for-go/sdk/storage/azblob dependency-version: 1.7.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-05-18 19:28:24 -07:00
dependabot[bot]	f3d6633aac	build(deps): bump github.com/aws/aws-sdk-go-v2/service/s3 from 1.99.0 to 1.101.0 (#9553 ) build(deps): bump github.com/aws/aws-sdk-go-v2/service/s3 Bumps [github.com/aws/aws-sdk-go-v2/service/s3](https://github.com/aws/aws-sdk-go-v2) from 1.99.0 to 1.101.0. - [Release notes](https://github.com/aws/aws-sdk-go-v2/releases) - [Commits](https://github.com/aws/aws-sdk-go-v2/compare/service/s3/v1.99.0...service/s3/v1.101.0) --- updated-dependencies: - dependency-name: github.com/aws/aws-sdk-go-v2/service/s3 dependency-version: 1.101.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-05-18 19:28:06 -07:00
Chris Lu	68794fb94c	fix(ec_distribute): remove partial files on copy stream error (#9543 ) * fix(ec_distribute): remove partial files on copy stream error writeToFile opens the destination with O_TRUNC and streams into it. On a mid-stream receive / write / cancellation error it returned the failure but left the destination behind in whatever state had been written so far — typically 0 bytes when the source errored before sending any FileContent. VolumeEcShardsCopy distributes .ecx by calling doCopyFile, so this same stub-leaving behaviour produced the 0-byte .ecx files seen on EC encoding failures: the source claims a non-zero ModifiedTsNs (so the existing "source not found" cleanup doesn't fire), the stream then errors immediately, and the receiver ends up with a 0-byte .ecx that downstream code mistook for a valid empty index. Clean up the partial file on every error path that returns from the streaming loop (receive, write, and cancellation). Skip cleanup when isAppend=true so resumable appends keep their existing content. As defense in depth, VolumeEcShardsCopy also stats the .ecx after copy and removes / errors on a 0-byte result so the orchestrator can pick a different source. The Rust volume server has only the source side of CopyFile (no client-side stream-to-disk consumer) and no .ecx subsystem yet, so this fix has no Rust mirror. * fix(ec_distribute): close file before remove, fail fast on stat error Address review feedback: - writeToFile's mid-stream removeIncomplete called os.Remove while the destination file handle was still open. On Windows os.Remove fails while a handle is open, so the cleanup wouldn't run there. Wrap the handle close in a once-only helper, call it from removeIncomplete and from the existing "source not found" cleanup, and keep a deferred close as the safety net for the normal-return path. - VolumeEcShardsCopy's post-copy .ecx check silently passed when os.Stat returned an error: doCopyFile had reported success but if the file was already gone, unreadable, or somehow a directory, the orchestrator only learned at mount time with no useful context. Treat any non-nil stat error and any directory result as a copy failure here and surface it immediately.	2026-05-18 15:19:51 -07:00
Chris Lu	af8d4e00ee	fix(ec_mount): reject 0-byte .ecx and aggregate cross-disk failures (#9542 ) * fix(ec_mount): reject 0-byte .ecx and aggregate cross-disk failures MountEcShards's per-disk loop bailed on the first disk returning a non-ENOENT error, and NewEcVolume wrapped its ENOENT with %v so the caller's `err == os.ErrNotExist` check never matched. On a multi-disk volume server where ec.balance / ec.rebuild had distributed shards across sibling disks while the matching .ecx never arrived, the mount loop bailed after disk 0 with "cannot open ec volume index" and the operator never saw that the rest of the disks were also empty. The companion failure mode is a 0-byte .ecx stub left by EC distribute's writeToFile after a mid-stream copy failure: Stat() succeeds, treating the stub as a valid index, and downstream mount work proceeds against an empty file. Wrap the ec-volume open errors with %w, treat a 0-byte .ecx as os.ErrNotExist (in NewEcVolume, findEcxIdxDirForVolume, and HasEcxFileOnDisk), and have MountEcShards collect per-disk failures before returning a single aggregated error. The "no .ecx anywhere" case gets a distinct error so the orchestrator can re-copy the index from a healthy replica rather than retry against the same broken state. * fix(ec_reconcile): indexEcxOwners also rejects 0-byte .ecx stubs findEcxIdxDirForVolume already skipped 0-byte .ecx during MountEcShards, but indexEcxOwners (used by reconcileEcShardsAcrossDisks at startup) still recorded the first .ecx by name only. On a store where one disk holds a 0-byte stub left by a failed EC distribute and a sibling disk holds the real index, the stub would win the owner selection — and NewEcVolume's new size check would then refuse to load against it, leaving the orphan shards unloaded even though a valid index exists. Mirror the size check from findEcxIdxDirForVolume: skip directory entries whose .ecx Info() reports size 0 or whose Info() call fails. * fix(ec_mount): accept 0-byte .ecx as valid empty index The previous commit treated a 0-byte .ecx in NewEcVolume as os.ErrNotExist, on the assumption that any empty .ecx was a stub left by a failed copy stream. That broke the legitimate empty-volume case: when an EC volume's source .idx has no live entries (e.g. all needles deleted before WriteSortedFileFromIdx), the sorted .ecx is genuinely 0 bytes and must mount. The integration test TestEcShardsToVolumeMissingShardAndNoLiveEntries fails with "MountEcShards: no .ecx index found on any local disk" because the mount path now refuses the legitimate empty index. A 0-byte .ecx left by a failed copy stream is indistinguishable from the legitimate empty case by file size alone. Preventing stub files from being written is the receiver-side cleanup in writeToFile's job (the companion EC distribute PR), not NewEcVolume's at mount time. The cross-disk lookup helpers (findEcxIdxDirForVolume, HasEcxFileOnDisk, indexEcxOwners) keep their size > 0 preference: when a real .ecx exists on a sibling disk alongside a stub, we still want to route mounts and reconcile at the real one. If no non-zero .ecx exists anywhere, the per-disk fallback in MountEcShards can still open the 0-byte .ecx and the volume mounts. Replace TestMountEcShards_ZeroByteEcxOnlyDisk with TestMountEcShards_EmptyEcxMountsSuccessfully, which pins the empty-volume invariant.	2026-05-18 15:00:33 -07:00
Chris Lu	41b6ad002b	fix(volume.list): show one entry per physical disk on multi-disk nodes (#9541 ) * fix(volume.list): show one entry per physical disk on multi-disk nodes DataNodeInfo.DiskInfos is keyed by disk type, so several same-type physical disks on one node collapse to a single map entry at the master. volume.list iterated that map directly and reported one "Disk hdd ... id:0" line per node, hiding the per-disk volume and shard layout. EC operators on multi-disk volume servers had no way to verify which physical disk a shard landed on. Lift the per-physical-disk split into a DiskInfo.SplitByPhysicalDisk() method on the proto type so consumers outside admin/topology can use it. Apply it in writeDataNodeInfo so the verbose Disk block shows one entry per physical disk, ordered by DiskId. Capacity counters are split evenly across reconstructed disks since the wire format doesn't carry per-disk capacity yet. This is a display-only change. ActiveTopology already did the split on its own and is now updated to call the shared helper. * fix(volume.list): preserve totals, count active/remote exactly, dedupe header Address review feedback on the per-physical-disk split: - share() truncated remainders so reconstructed per-disk counters could sum to less than the original aggregate (10 / 3 = 3+3+3). Distribute the remainder to the lowest disk ids so MaxVolumeCount and FreeVolumeCount sum exactly back to the node totals. - ActiveVolumeCount and RemoteVolumeCount are derivable per disk from the VolumeInfos already grouped by DiskId, so count them exactly (ReadOnly=false and RemoteStorageName!="" respectively) instead of approximating with an even split. - writeDataNodeInfo's per-disk callback fired the DataNode header on every iteration after the split, so a node with 6 physical disks emitted 6 DataNode headers. Guard the callback with headerPrinted so the header still appears at most once per node. - Sort split disks deterministically using explicit DiskId comparison to avoid int overflow risk on 32-bit systems. - Tighten the volume.list test substring to "id:N\n" so unrelated tokens like "ec volume id:101" don't accidentally match the id:1 needle, and assert the rack callback fires once.	2026-05-18 14:43:44 -07:00
Chris Lu	a761441926	fix(test): reserve mini ports on all interfaces; bound risingwave cleanup shell (#9545 ) The 127.0.0.1-only reservation in AllocateMiniPorts/AllocatePortSet let another process hold the gRPC port on a different interface, so weed mini's isPortAvailable check failed and it shifted master.grpc. weed shell -master=<HTTP> still derives grpc as HTTP+10000 and dialed the unused port, hanging until the 30s context deadline killed it. Bind the reservation listeners on :port to match mini's check. Also bound listFilerContents in catalog_risingwave with a 30s exec.CommandContext so a hung weed shell during failure-cleanup can't burn the 20-minute test budget.	2026-05-18 14:16:22 -07:00
Chris Lu	37e6263efe	fix(shell): attach admin JWT for filer IAM gRPC calls (#9536 ) When jwt.filer_signing.key is set, the filer's IamGrpcServer requires a Bearer token on every IAM RPC. The shell's s3.* IAM commands dialed without that header and failed with Unauthenticated. Route them through a small helper that mints a token from the same key viper-loaded from security.toml and appends it as outgoing metadata, matching the credential grpc_store pattern.	2026-05-18 13:42:32 -07:00
Chris Lu	3d872a1416	fix(filer): load -s3.config static identities into the filer's CredentialManager (#9537 ) When weed filer started its embedded S3 gateway with -s3 -s3.config, only the S3 server loaded the s3.json static identities — the filer's own CredentialManager stayed empty, so the IAM gRPC service backing the admin UI and weed shell returned only dynamic users. Mirror the wiring weed server already does and hand the same config path to the filer.	2026-05-18 13:41:30 -07:00
Chris Lu	4d04609bb8	fix(mount): don't release file handles from FUSE Forget (#9529 ) fix(mount): don't release file handles from Forget Forget(nodeid, nlookup) only decrements the kernel inode lookup count. File handle lifecycle belongs to FUSE Open/Release. Driving the FH refcount from Forget coupled two unrelated counters and could tear down a still-live handle if Forget ever raced ahead of Release. Drop the ReleaseByInode call (and the now-unused method).	2026-05-18 01:02:58 -07:00
Chris Lu	01b3e4a71c	template 4.26	2026-05-17 23:12:04 -07:00
Chris Lu	6cab199400	fix(iceberg): dial filer gRPC address verbatim in plugin worker (#9527 ) * fix(iceberg): dial filer gRPC address verbatim in plugin worker dialFiler was running its address argument through pb.ServerAddress.ToGrpcAddress, whose single-port fallback adds +10000 to any host:port — so when the admin forwards ClusterContext.FilerGrpcAddresses (already host:grpcPort) to the worker, the iceberg handler turns the real gRPC port (e.g. 18888) into a non-existent 28888 and dispatched jobs fail with connection refused. Drop the conversion; the address is already dialable. Tests that produced fake filer addresses in dual-port form now return host:grpcPort to match the new contract. * test(ec): use renamed detection_interval_minutes field The admin_runtime.detection_interval_seconds field was renamed to detection_interval_minutes back in May. This integration test was not updated, so the unknown JSON field was silently ignored and the scheduler fell back to the default detection interval (17 min for erasure_coding), which exceeds the test's 5-minute wait and times out. Switch to detection_interval_minutes: 1 — local run completes in ~120s.	2026-05-17 23:03:00 -07:00
Chris Lu	136eb1b7c8	4.26	2026-05-17 21:05:25 -07:00
Chris Lu	c11ff6657b	fix(ec): mirror EC sidecars onto every shard-bearing disk at startup (#9525 ) * fix(ec): mirror EC sidecars onto every shard-bearing disk at startup In a multi-disk volume server, ec.balance and ec.rebuild can land shards on a disk that does not also hold the matching .ecx / .ecj / .vif index files. The orphan-shard reconciler in reconcileEcShardsAcrossDisks already loads those shards by pointing the EcVolume at the sibling disk's index files; reads work, but any failure on the index-owning disk silently disables every shard on the other disk, even though those shards are physically fine. This change adds mirrorEcMetadataToShardDisks, a startup pass that physically replicates .ecx / .ecj / .vif onto each disk that holds shards but is missing them. Each copy is atomic (tmp + fsync + rename) and idempotent (a destination that already has the sidecar is preserved). After mirroring, the cross-disk reconciler prefers the local IdxDirectory so the EcVolume mounts self-contained; the cross-disk virtual mount remains as a fallback for volumes whose mirror failed (read-only target, out of space, partial copy on a previous boot). The same-disk invariant the EC lifecycle (encode / decode / balance / vacuum / repair) was already documented as promising is now actually restored at boot, so a future failure of one disk in a split-shards layout no longer takes the other disk's shards with it. Tests cover the orphan-layout mirror (dir0 receives the .ecx / .ecj / .vif from dir1) and idempotency (an existing destination .ecx is not overwritten with the owner's copy). * fix(ec): handle legacy pre-dir.idx sidecar layout in mirror skip-check hasAllEcSidecarsLocally checked only the modern destination path (IdxDirectory for .ecx/.ecj, Directory for .vif). A destination disk that still had a legacy .ecx in its data dir (written before -dir.idx was set) would report "not present" and the mirror would write a second copy to IdxDirectory, leaving two .ecx files on disk. Matches HasEcxFileOnDisk's open-with-fallback contract: check the modern path first, then the opposite directory. Factored the exists-and-not-a-dir check into a small statRegular helper so the fallback ladder stays readable. * rust(seaweed-volume): mirror EC sidecars onto shard-bearing disks at startup Port of the Go fix (commit `088e26ea6`) to the Rust volume server. Adds Store::mirror_ec_metadata_to_shard_disks, called from add_location / load_new_volumes before the cross-disk orphan reconciler. Physically copies .ecx / .ecj / .vif from the disk that owns the index files onto every disk holding shards but missing sidecars, so each shard-bearing disk ends up self-contained. The reconciler now prefers the local idx_directory when the mirror has installed a .ecx there; the cross-disk virtual mount remains as the fallback for volumes whose mirror failed (read-only target, out of space, partial copy on a previous boot). Adds ec_local_ecx_path helper shared between reconcile and mirror to detect the post-mirror fast path. Mirrors the Go-side fallback in hasAllEcSidecarsLocally: when -dir.idx is configured and the destination still has a legacy .ecx in its data dir, that's recognized so the mirror does not write a duplicate copy into idx_directory. Tests cover the two key cases: orphan layout (dir0 receives the sidecars from dir1) and idempotency (a pre-existing destination .ecx is not overwritten). * trim verbose comments on EC mirror code Comments now lead with the WHY (non-obvious constraints, the post-mirror fast path, why local copies are authoritative) and drop restate-the-code blocks, headers, and section dividers. Behavior is unchanged; all existing tests still pass on both the Go volume server and the seaweed-volume Rust port. * drop github issue refs from added comments Two stray "#9212" references slipped into comments I added on the cross-disk reconciler call site. The git log carries the issue history; comments stand on their own. * test(ec): accept rebuild on either disk after sidecar mirror TestEcLifecycleAcrossMultipleDisks asserted the rebuilt shard 9 must land at the disk-0 path. With the boot-time sidecar mirror, every shard-bearing disk owns its own .ecx, so VolumeEcShardsRebuild now picks whichever disk hosts the most shards — disk 1 in this layout after the deletion. The shard can legitimately rebuild on either disk; the test now accepts both and uses the chosen path for the subsequent mount + read verification.	2026-05-17 19:55:15 -07:00
Chris Lu	6b94701213	mini: quieter startup with a docker-compose-style progress board (#9524 ) * mini: quieter startup with a docker-compose-style progress board Replaces noisy startup/shutdown logs with a single in-place progress table on a TTY (or one line per state change off-TTY). Each component renders as `pending -> starting -> ready` during startup and `stopping -> stopped` during shutdown, with elapsed time on transition. Also folds in a few cleanups uncovered while making this readable: - route the admin.go startup prints through glog so quietMiniLogs() filters them under mini but standalone weed admin still shows them - generate a dev SSE-S3 KEK + passphrase on first run via WEED_S3_SSE_KEK and WEED_S3_SSE_KEK_PASSPHRASE env vars (viper.Set has a nested-key conflict between s3.sse.kek and s3.sse.kek.passphrase); persisted under the data folder so restarts reuse the same key - demote worker/master gRPC Recv 'context canceled' to V(1); those are the normal shutdown signal, not Errors/Warnings - drop the 'Optimized Settings' block and the 'credentials loaded from environment variables' message from the welcome banner - only show the credentials setup hints when no S3 identities exist (new s3api.HasAnyIdentity accessor backed by an atomic.Bool) - use S3_BUCKET in the credentials hint so it pairs with AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY - reorder running-services list to master / volume / filer / webdav / s3 / iceberg / admin * mini: refuse in-memory-only SSE-S3 dev keys; surface admin serve errors loadOrCreateMiniHexSecret returns "" when os.WriteFile fails, so SSE-S3 won't encrypt data under a KEK that the next restart can't reproduce (which would orphan whatever was written this run). The caller already treats "" as "skip setting WEED_S3_SSE_* env vars", so SSE-S3 and IAM just stay disabled for this run. startAdminServer's serve goroutine used to only log ListenAndServe failures, so a bind error left the caller blocked on ctx.Done() with no listener. Forward the error through a buffered channel and select on it alongside ctx.Done(). * ci(s3-proxy-signature): match weed mini's new progress-board ready line The readiness probe grepped for "S3 (gateway\|service).*(started\|ready)", which matched weed mini's old "S3 service is ready at ..." line. Mini now emits " S3 ready (Xs)" from its progress board, so the old pattern misses and the test timed out at the 30-second wait. Widen the alternation to also accept "S3\s+ready". The curl HEAD fallback already covers any remaining cases.	2026-05-17 19:13:09 -07:00
Chris Lu	ff6f9fd90a	iam: honor configured credential store for IAM API policies and propagate to S3 caches (fixes #9518 ) (#9522 ) * iamapi: route managed policies through credential manager (fixes #9518) CreatePolicy via the IAM API wrote straight to the filer /etc/iam/policies.json, ignoring any non-filer credential store. When credential.postgres was configured, policies created via the IAM API landed only in the filer while the Admin UI wrote to postgres, producing a split-brain where ListPolicies/GetPolicy never saw the Admin UI's policies and vice versa. GetPolicies/PutPolicies on IamS3ApiConfigure now load managed policies from credentialManager and persist Create/Update/Delete as a delta against the store. Inline user/group policies still live in the legacy policies.json file (no credential-store API for them yet). Pre-existing managed policies in the legacy file are merged on read so deployments don't lose data, and re-persisted to the store on the next write so the legacy file is drained over time. * credential: route IAM API inline policies through credential manager Extends the #9518 fix to user-inline and group-inline policies so the IAM API never writes the legacy /etc/iam/policies.json bundle directly. The previous patch only routed managed policies; this one finishes the job for the other two policy types. - Add GroupInlinePolicyStore + GroupInlinePoliciesLoader optional interfaces, mirroring the existing user-inline ones, and matching Put/Get/Delete/List/LoadAll wrappers on CredentialManager. - Implement group-inline storage in memory (new map), filer_etc (new field on PoliciesCollection, reusing the legacy file under policyMu), and postgres (new group_inline_policies table with ON DELETE CASCADE off the groups FK). - Wire the new methods through PropagatingCredentialStore so wrapped stores still delegate correctly. - IamS3ApiConfigure.PutPolicies now applies managed + user-inline + group-inline as deltas through the credential manager; the legacy /etc/iam/policies.json file is never written when a credential manager is wired up. GetPolicies still reads the legacy bundle once as a fallback so unmigrated data is picked up and re-persisted into the store on the next write. * credential: propagate SaveConfiguration writes to running S3 caches Postgres (and any non-filer) credential stores never fired the S3 IAM cache invalidation path on bulk identity / group updates. The PropagatingCredentialStore had explicit Put/Remove handlers for single-entity calls (CreateUser, PutPolicy, etc.) but inherited SaveConfiguration unchanged from the embedded store, so the bulk path the IAM API takes at the end of every handler was silent. Inline-policy changes recompute identity.Actions and persist via SaveConfiguration, so until restart the cached Actions on each S3 server stayed stale and authorization decisions used the pre-change view. Override SaveConfiguration to snapshot the prior user / group lists, delegate the save, then fan out PutIdentity / PutGroup for what's in the new config and RemoveIdentity / RemoveGroup for what got pruned. Reuses the existing SeaweedS3IamCache RPCs, no protobuf changes. * iamapi: drain legacy policies.json after authoritative credential-store writes Review pointed out a resurrection bug: GetPolicies still reads /etc/iam/policies.json as a one-way migration fallback, but PutPolicies in the credential-manager path never wrote that file, so legacy-only entries reappeared on the next read even after the IAM API "deleted" them. PutPolicies now overwrites the bundle with an empty {} after a successful credential-store write, unless the store is filer_etc (which owns the bundle as its own inline-policy backing — clearing it would wipe filer_etc's data). Also wraps the filer read, JSON unmarshal, and marshal errors with context per the other review comments.	2026-05-17 13:15:27 -07:00

1 2 3 4 5 ...

13910 Commits