seaweedfs

mirror of https://github.com/seaweedfs/seaweedfs.git synced 2026-07-26 10:03:13 +00:00

Author	SHA1	Message	Date
Chris Lu	d44bbbfbfc	docs(parquet-design): co-locate side indexes in .index/ next to data Reverse the earlier decision to place side indexes outside the table prefix. Iceberg, Spark, and Hadoop ecosystems all treat dot-prefixed paths as hidden by convention — table listings, snapshot expiration, and orphan-file cleanup skip them — so a .index/ directory next to the Parquet data is invisible to those tools while keeping each index under the same parent folder as the file it describes (so backup, replication, and lifecycle policies apply uniformly without separate plumbing). Two scopes: - file-scoped: <parquet_parent>/.index/<file_name>/<identity>/... - folder-scoped: <parquet_parent>/.index/<identity>/... Document a fallback: deployments whose readers do not respect dot- prefix conventions can either rename the prefix or relocate indexes to a separate filer mount.	2026-04-25 13:25:21 -07:00
Chris Lu	06b4e46f4f	docs(parquet-design): add dev plan with milestones and open questions Companion to PARQUET_PUSHDOWN_DESIGN.md. Maps the design's six phases to thirteen shippable milestones, with the architectural decisions needed to start coding (residency, wire protocol, trust mode, predicate engine scope, side-index path, library choices) called out as defaults that the open-questions section asks to confirm. No code changes — this commit only adds the plan document. M0 does not start until the open questions are resolved.	2026-04-25 01:58:49 -07:00
Chris Lu	89ed588b34	docs(parquet-design): document trust model and catalog validation modes The previous text said the server "treats this list as authoritative and does not re-read the catalog" without flagging that this is only safe when the caller is trusted. An untrusted client could: - omit position/equality delete files to make deleted rows visible - claim a snapshot id that does not match the data files listed - point at data files outside the table's scope Add a Trust Model section enumerating three deployment modes: connector-trusted (v1 default; rely on a trust boundary around the data plane), catalog-validated (server checks DataFiles + Deletes against the manifest), and manifest-signed (future; requires catalog support that doesn't exist today). Note the always-on guards (object ACL, request-shape limits) and require the chosen mode to be reported in PushdownStats so callers can verify which level handled their request.	2026-04-25 01:56:07 -07:00
Chris Lu	c9c487282b	docs(parquet-design): include DVs in execution flow and Phase 4 rollout The end-to-end delete flow listed only position-delete bitmaps and equality-delete predicates, and Phase 4 only mentioned "position delete masks" and "equality delete masks". Iceberg v3 deletion vectors were covered in the dedicated subsection but never made it into the operational flow or the rollout checklist, so a reader could plausibly ship Phase 4 without DV support. Add DV bitmap application as a distinct step (b) in the end-to-end flow, and split Phase 4 into v2 position-delete ingestion, v3 DV ingestion, equality-delete evaluation, partition matching, and snapshot-aware identity. Phase 4 now lists the work that exists, not a v2-only subset.	2026-04-25 01:55:34 -07:00
Chris Lu	c3bfa48f25	docs(parquet-design): include partition matching in delete applicability Iceberg delete files (position and equality) carry a partition spec id and partition values, and apply only to data files in the same partition. Both prose applicability rules (position deletes, equality deletes) previously omitted this constraint, leaving them correct only for unpartitioned tables. - Add PartitionSpecId and PartitionValues to DataFileDescriptor and DeleteFileRef so the planner can pass the partition triple through. - Rewrite both applicability rules to include partition equality alongside the sequence-number rule, formatted as set-builder predicates so each clause is explicit. - Note that v3 deletion vectors get partition matching implicitly via ReferencedDataFile, but populating the fields anyway keeps the applicability check uniform.	2026-04-25 01:55:08 -07:00
Chris Lu	b47bcd26c0	docs(parquet-design): use Puffin offset+length for DV identity, not a digest The previous text claimed Puffin records a per-blob "content hash" that could be used as the bitmap cache key. Puffin metadata exposes offset, length, properties, optional compression info, and a CRC-32 per blob — not a manifest-level content hash. Treating the CRC as an identity hash would also be wrong: it is for corruption detection, not for cross-table cache lookup. Replace BlobDigest []byte with BlobCRC32 uint32 (clearly scoped to tamper / corruption checks), and use the immutable triple (puffin_file_identity, blob_offset, blob_length) as the DV cache identity. Update the prose and the cache-key shorthand to match.	2026-04-25 01:54:13 -07:00
Chris Lu	00f6ac592b	docs(parquet-design): drop DeletionVector content value Iceberg's manifest "content" enum has only two delete values: POSITION_DELETES (1) and EQUALITY_DELETES (2). Iceberg v3 deletion vectors are POSITION_DELETES stored in a Puffin file, not a third content value. The previous DeleteContent enum invented a DeletionVector = 3 that does not exist in the spec, which would have forced a translation step on the wire and confused any reader cross- checking against an Iceberg manifest. Drop DeletionVector from DeleteContent. A DV is now expressed as (Content == PositionDeletes, FileFormat == FileFormatPuffin), which matches what an Iceberg manifest actually carries. Update the DeleteFileRef field comments and the prose that listed the old three-way discriminator.	2026-04-25 01:53:34 -07:00
Chris Lu	89833c07d3	docs(parquet-design): correct DV applicability rule to inclusive Deletion vectors are position-based (they encode row positions, just in Puffin form), so they follow the same data_sequence_number rule as v2 position-delete files: data_file.seq <= delete_file.seq. The previous comment grouped DVs with equality deletes (strict <), which would silently drop deletes recorded in the same snapshot as the data file they target.	2026-04-25 01:52:54 -07:00
Chris Lu	08889b0f7b	docs(parquet-design): identify columns by Iceberg field ID String column refs are fragile under Iceberg schema evolution: a rename leaves indexes built under the old name unreachable, and a drop-and-re-add of the same name silently aliases two different columns onto the same index path. Replace string Column / Columns []string fields with a ColumnRef struct (FieldId int32 + Path string hint), used in ParquetPushdownRequest.Columns, VectorQuery.Column, and PageRef.Column. The server trusts FieldId when set and falls back to Path only for non-Iceberg-managed Parquet files where field IDs are absent. Side-index file names are also keyed by field ID (bitmap.fid_7, vector.fid_42.ivf) so that rename and re-add stay unambiguous on disk.	2026-04-25 01:46:47 -07:00
Chris Lu	706ad03d0c	docs(parquet-design): use DataSequenceNumber for delete scoping Iceberg manifest entries carry two sequence numbers: - data_sequence_number: the sequence number when the data was logically added (semantic time); - file_sequence_number: the sequence number of the manifest entry itself (commit time). Delete scoping (which delete files apply to which data files) is defined entirely in terms of data_sequence_number, so the request struct should name the field DataSequenceNumber, not the ambiguous SequenceNumber. Rename on both DataFileDescriptor and DeleteFileRef, spell out the scope inequalities (strict for equality/DV, inclusive for position), and clarify in prose that file_sequence_number is not used for delete scoping.	2026-04-25 01:45:07 -07:00
Chris Lu	3475df1203	docs(parquet-design): expand DeleteFileRef for v2 + v3 metadata The previous DeleteFileRef carried only Path, SizeBytes, SequenceNumber. That is not enough to distinguish v2 position-delete files from equality-delete files from v3 deletion vectors, to know which file format to read, to evaluate equality predicates (which field IDs?), or to locate a DV inside a Puffin file. Add: - Content discriminator (PositionDeletes / EqualityDeletes / DeletionVector) matching the Iceberg manifest "content" enum - FileFormat (Parquet / Avro / ORC / Puffin) - EqualityFieldIds for equality-delete predicate columns - BlobOffset / BlobLength / BlobDigest for Puffin DV blobs - ReferencedDataFile, when the delete targets a single data file Also collapse DataFileDescriptor.PositionDeletes + DataFileDescriptor.EqualityDeletes into a single Deletes list now that DeleteFileRef.Content discriminates among the three forms, including DVs which fit neither old slot.	2026-04-25 01:44:15 -07:00
Chris Lu	c9871ce290	docs(parquet-design): scope position-delete bitmap cache key The previous text said the merged position-delete bitmap was "cached as a side index" without spelling out the cache key, leaving the implication that data-file identity was sufficient. It is not: the merged bitmap is a function of the set of position-delete files that currently target the data file, and that set changes as snapshots add new delete files or compaction removes them. Add a "cache key" subsection requiring (data_file_identity, sorted_set_of_input_delete_file_ids), with the v3 DV shorthand of (data_file_identity, puffin_blob_digest), and explain why snapshot_id alone is neither necessary nor sufficient as a key.	2026-04-25 01:43:11 -07:00
Chris Lu	eeef173dd0	docs(parquet-design): cover Iceberg v3 deletion vectors The previous "Handling Iceberg Deletes" section knew only about v2 position-delete files and equality-delete files. Iceberg v3 introduces deletion vectors (DVs) — Puffin "deletion-vector-v1" blobs holding a roaring bitmap of file-absolute row positions for one data file — which are not optional for v3-spec tables. Add a comparison table for the three delete forms and a DV subsection covering: per-data-file scope, the (puffin_path, offset, length) pointer, cache-by-blob-content, and the v2/v3 mixed case where DVs and position-delete files can co-exist during migration. Forward-link to the cache-key section that the next commit tightens.	2026-04-25 01:42:42 -07:00
Chris Lu	317ee9d10c	docs(parquet-design): use planned scan size for pushdown threshold The cost rule of thumb keyed off "total file size" of the table, which is the wrong input — a huge table that Iceberg has narrowed to a single 4 MiB file should not pay for a pushdown round-trip, while a medium table that planning leaves with thousands of small surviving files still benefits. Switch the threshold to planned scan bytes plus surviving file count, both observable after Iceberg planning, and update the configuration knob names accordingly.	2026-04-25 01:31:39 -07:00
Chris Lu	9c71020915	docs(parquet-design): pair scores with row refs and clarify row position Two ambiguities in the response shape: - Scores []float32 was a parallel array to RowIds, with order as the only correlation. Replace with []ScoredRowRef so each score is bound to its row ref and unscored scalar results don't need a sentinel. - RowRef.RowId int64 left it unclear whether the value was file-absolute or row-group-local. Iceberg position deletes are file-absolute, so make RowRef.FilePosition file-absolute and treat the RowGroup field as a derived locality hint rather than identity.	2026-04-25 01:31:13 -07:00
Chris Lu	7110d326fa	docs(parquet-design): expand page-level index section The previous text described page pruning as per-column min/max plus offsets, which omits the actual hard part: page boundaries do not align across columns within a row group, so a predicate evaluated on one column's pages must be translated to row ranges (via OffsetIndex first_row_index) and then back to each projected column's pages. Spell out: - the index wraps Parquet's ColumnIndex and OffsetIndex; - pruning is row-range-driven, not byte-range-driven; - per-column page selection is required for projected columns; - dictionary pages are not in OffsetIndex but must be fetched along with any surviving data page from the same column chunk; - repeated/nested fields use row positions, not value positions, so the cross-column translation still holds; - planner falls back to row-group pruning if ColumnIndex/OffsetIndex are absent.	2026-04-25 01:30:45 -07:00
Chris Lu	06bb1782e4	docs(parquet-design): unify side-index paths under system-prefix The vector-index and position-delete-bitmap path examples still used the old "/table/_seaweed_index/" form, contradicting the rule (added in `605907f5`) that side indexes live outside the table prefix. Update both examples to use the system-prefix/table_uuid/.../identity layout.	2026-04-25 01:30:11 -07:00
Chris Lu	775095c50d	docs(parquet-design): correct equality-delete sequence-number scope The previous text said "new data files added later may need to be filtered by the same delete predicate until compaction", which is the opposite of Iceberg's rule: equality deletes apply only to data files whose sequence number is strictly less than the delete file's sequence number. Files added at or after that point are untouched. Rewrite the scope rule, link it to the per-data-file SequenceNumber that the request now carries, and key the optional materialized-bitmap cache by snapshot_id as well.	2026-04-25 01:29:44 -07:00
Chris Lu	7a53654b93	docs(parquet-design): add per-file descriptors to pushdown request The previous Files []string was insufficient for two reasons: - the server cannot validate that a cached side index still matches the data file without identity fields (size, record count, ETag); - correct equality-delete scoping needs the Iceberg sequence number per data file, plus the set of position/equality delete files the client's Iceberg planner has attached to it. Replace Files with DataFiles []DataFileDescriptor and document that the client owns Iceberg planning while the server treats the resolved set as authoritative.	2026-04-25 01:29:21 -07:00
Chris Lu	fbabfbd353	docs(parquet-design): add residency/cost/security sections and wording fixes - Add Index Residency and Failure Model section: filer-managed vs colocated vs index-serving-tier options, and graceful-degradation behavior when an index is missing or stale. - Add Cost Model section spelling out when connectors should skip the pushdown API (small tables, low selectivity, no predicate). - Add Security section noting that zone maps, blooms, and inverted indexes leak information beyond the underlying object's ACL, and scoping v1 to inherit the file's ACL with row-level security explicitly out of scope. - Wording fixes: clarify CONTAINS is engine-side translation of LIKE/MATCH/array_contains; clarify the embedding column is in Parquet but the vector index is a side file; flag local filtering on volume servers as new infrastructure introduced by this design.	2026-04-25 01:24:01 -07:00
Chris Lu	605907f522	docs(parquet-design): clarify Phase 1 wording and side-index placement - Phase 1 "row group stats cache" was misleading — those stats already live in the Parquet footer. The Phase 1 win is a parsed-footer cache, not new index data. - The previous "/table/_seaweed_index/" path nested side indexes under the table prefix, which Iceberg orphan-file removal and snapshot expiration would surface or attempt to clean. Move side indexes to a separate system bucket / filer mount keyed by file identity, with the in-table-prefix layout listed only as a fallback.	2026-04-25 01:22:31 -07:00
Chris Lu	afeb82ab87	docs(parquet-design): add filtered-ANN strategy for hybrid pushdown The hybrid scalar+vector flow assumed the vector index could accept arbitrary scalar pre-filters, which most off-the-shelf IVF/HNSW implementations do not. Document the realistic strategies (partitioned IVF for v1, post-filter with overscan as fallback, filtered HNSW deferred) and the planner heuristic for choosing between them.	2026-04-25 01:21:58 -07:00
Chris Lu	e31cb80812	docs(parquet-design): make row-id response optional and bounded A predicate matching millions of rows would force the server to return millions of row refs. Default the response to range-oriented data (file/row-group/page refs), require RequestRowIds opt-in for per-row output, and cap returned ids via MaxRowIds with a Truncated flag.	2026-04-25 01:21:32 -07:00
Chris Lu	28c834c76f	docs(parquet-design): specify predicate wire format and enum metrics Replace opaque Predicate []byte with a tagged (PredicateKind, bytes) pair, naming Substrait as the canonical v1 format and Iceberg Expression JSON as a convenience. Replace the free-form Metric string on VectorQuery with a VectorMetric enum.	2026-04-25 01:21:11 -07:00
Chris Lu	51a209cf60	docs(parquet-design): split position vs equality Iceberg deletes Position deletes are file-scoped and precomputable as a per-data-file bitmap. Equality deletes are predicate-scoped, can apply to many data files including ones added later, and must be evaluated at query time. Document both flows separately and remove the implication that equality-delete state can be flattened into a static bitmap.	2026-04-25 01:20:53 -07:00
Chris Lu	b114b318ae	docs(parquet-design): tighten index identity Drop modification time from the identity tuple — it is not stable across replication and metadata-only operations and is unnecessary alongside ETag or Iceberg manifest fields. Order remaining fields by strength, with Iceberg manifest identity preferred when available.	2026-04-25 01:20:21 -07:00
Chris Lu	9ecd455310	docs(parquet-design): reframe internal mapping as logical-only The previous wording suggested SeaweedFS would split a Parquet file into per-column-chunk needles, which contradicts the design's "do not modify the file" goal. Clarify that the per-column-chunk view is a planning abstraction over byte ranges in the unchanged object.	2026-04-25 01:20:02 -07:00
Chris Lu	ca4c8d977d	docs: add SeaweedFS-aware Parquet pushdown design draft Draft design note for keeping Parquet files compatible with existing engines (Spark, Trino, DuckDB, PyArrow, Iceberg) while building auxiliary side indexes and a pushdown API for SeaweedFS-aware clients. Covers physical layout, side index types (footer cache, zone map, page index, bloom, bitmap, B-tree, inverted, vector), pushdown API sketch, Iceberg delete handling, and a phased rollout plan.	2026-04-25 01:15:42 -07:00
Chris LuandGitHub	5eead9409a	fix(admin): S3 Tables CSRF token + non-empty 409 status (#9221 ) * fix(admin): attach CSRF token to S3 Tables write requests Several POST/PUT/DELETE calls in s3tables.js were sent without an X-CSRF-Token header while the corresponding handlers in weed/admin/dash/s3tables_management.go enforce CSRF via requireSessionCSRFToken, so authenticated users hit "invalid CSRF token" on actions like creating a table bucket (#9220), updating policies, and managing tags. Add an s3tWriteHeaders helper that pulls the token from the existing csrf-token meta tag and use it on every write to /api/s3tables/buckets, /bucket-policy, /tables, /table-policy, and /tags. The Iceberg-page write paths already attached the token and are unchanged. Fixes #9220 * fix(admin): map BucketNotEmpty/NamespaceNotEmpty to 409 for S3 Tables DELETE on a non-empty table bucket or namespace returned HTTP 500 because s3TablesErrorStatus didn't list ErrCodeBucketNotEmpty or ErrCodeNamespaceNotEmpty in its conflict case, even though the backend handler emits them with 409 Conflict (matching AWS S3 Tables). Add both codes to the existing conflict mapping. * refactor(admin): route Iceberg S3 Tables writes through s3tWriteHeaders Iceberg namespace/table create and Iceberg table delete were still hand-rolling CSRF headers. Replace those blocks with the existing s3tWriteHeaders() helper so every S3 Tables write uses the same code path. Drop the now-unused csrfTokenInput.value population in initIcebergNamespaces and initIcebergTables (the templ hidden inputs have no server-rendered value, and nothing reads the input now that the JS reads the token from the meta tag via getCSRFToken()).	2026-04-24 22:48:41 -07:00
Chris Lu	a14cbc176b	debug(kafka): add restart flake diagnostics	2026-04-24 15:02:07 -07:00
Chris LuandGitHub	f1f720f5da	fix(master): register EC shards per physical disk on full heartbeat sync (#9212 ) (#9219 ) * refactor(types): add DiskId type for physical-disk identifiers Names the uint32 physical-disk index that volume servers carry in VolumeEcShardInformationMessage / VolumeInformationMessage, so EC shard tracking that needs to distinguish disks within a DataNode can use a dedicated type instead of an untyped uint32. No behaviour change. * fix(master): register EC shards per physical disk on full heartbeat sync (#9212) When a volume's EC shards are spread across multiple physical disks on the same volume server (common after ec.balance / ec.rebuild on multi-disk nodes), the volume server emits one VolumeEcShardInformationMessage per (disk, volume) in its heartbeat. The master's DataNode.UpdateEcShards was building a `map[VolumeId]EcVolumeInfo` with last-write-wins, and doUpdateEcShards then overwrote `disk.ecShards[vid]` once per message, so all but the final disk's shards were silently dropped. Only the topology-global ecShardMap (built via RegisterEcShards in a per-message loop) stayed correct, which hid the problem from `topo.LookupEcShards` but broke everything that reads the DataNode/Disk view — volume.list, admin UI, ec.rebuild dry-run ("only 6 shards, skipping"), and `DiskInfo.EcShardInfos` which the shell's ec.balance / ec.rebuild planners group by `eci.DiskId`. Change the shape of `Disk.ecShards` from map[VolumeId]EcVolumeInfo to map[VolumeId]map[types.DiskId]EcVolumeInfo so every physical disk keeps its own entry. UpdateEcShards aggregates incoming messages by (vid, diskId) rather than vid alone; Add/Delete/ HasVolumesById and HasEcShards consult the nested map; doUpdateEcShards rewrites the nested structure from the aggregated map. Per-physical-disk attribution survives through DataNode.ToDataNodeInfo -> DiskInfo.EcShardInfos, matching the wire format the volume server produces and what downstream admin tooling expects. Delta sync (AddOrUpdateEcShard / DeleteEcShard) already merged via ShardsInfo.Add, so this only affects the full-sync path that runs on heartbeat reconnect. Adds data_node_ec_multi_disk_test.go with two regression tests that fail on pre-fix master: - TestEcShardsAcrossMultipleDisksOnSameNode: volume 15 spread over 3 disks (matches the bug report's volume-2 row); asserts every shard visible via LookupEcShards, DataNode.GetEcShards, and ToDataNodeInfo's per-disk EcShardInfos entries. - TestEcShardsAfterRestartHeartbeat: minimal 2-disk full sync case. fix(topology): tighten locking around EC shard map access Addresses review comments on #9219: * DataNode.UpdateEcShards now holds dn.Lock for the full read-diff-write cycle, matching UpdateVolumes' model, so concurrent heartbeats can no longer interleave their getOrCreateDisk / UpAdjustDiskUsageDelta updates with each other. Introduces a private getEcShardsLocked helper for reads under the held lock; renames doUpdateEcShards to doUpdateEcShardsLocked for the same reason. * DataNode.HasEcShards now takes each disk's ecShardsLock while reading disk.ecShards, closing a pre-existing map race with concurrent Add/Delete/Update writers. * doUpdateEcShardsLocked takes each disk's ecShardsLock around the reset-and-rewrite so readers (GetEcShards, HasEcShards) see a consistent map state rather than a partially-rebuilt one. * Disk.GetEcShards' slice-capacity hint now accounts for the nested per-physical-disk entries (sum of inner lengths) instead of underestimating by the unique-volume count.	2026-04-24 14:01:09 -07:00
Chris LuandGitHub	d65c568cbb	fix(s3api): validate SSE-S3 chunk IV length; add multipart direct reader tests (#9218 ) * fix(s3api): validate SSE-S3 chunk IV length; add multipart direct reader tests DeserializeSSES3Metadata does not require an IV, and a corrupted or legacy chunk without one would have flowed into cipher.NewCTR and panicked. Validate that each per-chunk IV is exactly AESBlockSize bytes before decryption, closing the current and any already-appended chunk readers on error. Factor the per-chunk decryption loop out of createMultipartSSES3DecryptedReaderDirect into buildMultipartSSES3Reader so it can be driven with a mock chunk fetcher, and add tests covering: the happy path with two parts (distinct per-chunk DEKs/IVs, out-of-order chunks) to lock in the fix from #9211; missing-IV and short-IV metadata rejection without panic; and reader cleanup when a later chunk fails. * address review: sort chunks copy; close encryptedStream on error - buildMultipartSSES3Reader now sorts a copy of the chunks slice so callers do not observe entry.Chunks reordered (other code paths, e.g. ETag computation, can rely on the original order). - createMultipartSSES3DecryptedReaderDirect now closes encryptedStream on the error path from buildMultipartSSES3Reader. All current callers pass nil, but this keeps cleanup symmetric with the success path. - Extend TestBuildMultipartSSES3Reader_PerChunkKeys to assert the input slice is not mutated. * address review: defer single close; extend chunk-copy + IV-guard pattern - createMultipartSSES3DecryptedReaderDirect: collapse the duplicated encryptedStream.Close() calls into a single nil-guarded defer so the error and success paths share cleanup. - createMultipartSSECDecryptedReaderDirect, createMultipartSSEKMSDecryptedReaderDirect: sort a copy of entry.Chunks instead of mutating the caller's slice, matching the SSE-S3 helper. - createMultipartSSECDecryptedReaderDirect: validate per-chunk IV length before handing it to cipher.NewCTR; a base64-decoded empty or short IV from malformed/corrupt metadata would otherwise panic. - SSE-KMS needs no IV guard: CreateSSEKMSDecryptedReader already calls ValidateIV before cipher.NewCTR. Note recorded in the sort comment. * address review: close appended readers on SSE-C/SSE-KMS error paths createMultipartSSECDecryptedReaderDirect and createMultipartSSEKMSDecryptedReaderDirect only closed the current chunk reader on error and leaked any chunk readers already appended to the local readers slice, mirroring the leak previously fixed in the SSE-S3 helper. Add the same closeAppendedReaders() closure pattern to both functions and invoke it on every error return inside the loop so failed requests do not leak volume-server HTTP connections. * address review: defer encryptedStream close in SSE-C/SSE-KMS; drop chunks reassignment - Move encryptedStream.Close() to a nil-guarded defer at the top of createMultipartSSECDecryptedReaderDirect and createMultipartSSEKMSDecryptedReaderDirect so the stream is closed on every return path (including error returns from inside the per-chunk loop), mirroring the SSE-S3 helper. - In buildMultipartSSES3Reader, iterate sortedChunks directly instead of reassigning chunks = sortedChunks.	2026-04-24 13:59:23 -07:00
Chris LuandGitHub	fe1d7a404d	fix(iam): substitute dynamic jwt:/saml:/oidc: claim variables in policies (#9217 ) * fix(iam): expand arbitrary jwt:/saml:/oidc: claim variables in policies The policy engine gated variable substitution on a fixed allowlist (jwt:sub, jwt:iss, jwt:aud, jwt:preferred_username), so patterns like arn:aws:s3:::softs/${jwt:project_path}/* were passed through as literals and never matched the requested resource. Dynamic claims from OIDC providers (e.g. GitLab CI's project_path / namespace_path) could not be used to scope policies. Allow any jwt:/saml:/oidc: prefixed variable to be substituted when the claim is present in RequestContext. These values originate from a cryptographically verified identity token (the STS session JWT or federated assertion), and the claim names are controlled by the trusted identity provider, so the dynamic prefix is safe. Missing claims keep the placeholder intact so the statement still fails to match. Numeric JWT claims (JSON-decoded as float64) are now stringified so patterns like ${jwt:project_id} work the same as string claims. Fixes #9214 * fix(iam): cover all integer widths in claim stringification Address PR review: stringifyClaimValue only handled int/int32/int64 on the signed side and nothing on the unsigned side, so int8, int16, uint, uint8, uint16, uint32, and uint64 claim values fell through to the default branch and the placeholder was left unsubstituted. JSON's generic decoder produces float64/json.Number for numbers, but RequestContext can also be populated from typed sources (custom providers or internal code), so cover all common integer widths - signed and unsigned - explicitly. Extend TestStringifyClaimValue to assert each supported type.	2026-04-24 13:08:24 -07:00
os-pradipbabar GitHub Chris Lu	8815844278	fix(s3api): correct SSE-S3 decryption key handling in multipart uploads (#9211 ) * fix(s3api): correct SSE-S3 decryption key handling in multipart uploads * fix(s3api): preallocate readers and close on error in SSE-S3 direct path Address review feedback on createMultipartSSES3DecryptedReaderDirect: preallocate the readers slice with the known chunk count, and close any already-appended chunk readers on error returns so failed requests do not leak volume-server HTTP connections. --------- Co-authored-by: Chris Lu <chris.lu@gmail.com>	2026-04-24 12:00:29 -07:00
Lisandro PinandGitHub	93247d6de4	Export REST file_{read,write}_failures metrics on volume servers (#9215 ) * Export gRPC `file_{read,write}_failures` metrics on volume servers. Allows to track overall R/W errors in real time through Prometheus. Will follow up with a PR for Seaweed's REST API. * Export REST `file_{read,write}_failures` metrics on volume servers.	2026-04-24 11:45:21 -07:00
dependabot[bot]GitHubdependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	352ffdffe1	build(deps): bump rustls-webpki from 0.103.10 to 0.103.13 in /seaweed-volume (#9216 ) build(deps): bump rustls-webpki in /seaweed-volume Bumps [rustls-webpki](https://github.com/rustls/webpki) from 0.103.10 to 0.103.13. - [Release notes](https://github.com/rustls/webpki/releases) - [Commits](https://github.com/rustls/webpki/compare/v/0.103.10...v/0.103.13) --- updated-dependencies: - dependency-name: rustls-webpki dependency-version: 0.103.13 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-04-24 11:44:20 -07:00
Lars Lehtonen GitHub Chris Lu	29e14f89f1	fix(weed/command) address unhandled errors (#9208 ) * fix(weed/command) address unhandled errors * fix(command): don't log graceful-shutdown sentinels; plug response-body leak - s3: Serve on unix socket treated http.ErrServerClosed as fatal; now excluded like the other Serve/ServeTLS paths in this file. - mq_agent, mq_broker: filter grpc.ErrServerStopped so clean shutdown doesn't log as an error. - worker_runtime: the added decodeErr early-continue skipped resp.Body.Close(); drop it since the existing check below already surfaces the decode error. - mount_std: the pre-mount Unmount commonly fails when nothing is mounted; demote to V(1) Infof. - fuse_std: tidy panic message to match sibling cases. * fix(mq_broker): filter grpc.ErrServerStopped on localhost listener The localhost listener goroutine logged any Serve error unconditionally, which includes grpc.ErrServerStopped on graceful shutdown. Match the main listener's check so clean stops don't surface as errors. --------- Co-authored-by: Chris Lu <chris.lu@gmail.com>	2026-04-23 22:15:05 -07:00
Chris LuandGitHub	88c2f3c34d	fix(iam): accept bare "" resource in PutUserPolicy (#9209 ) (#9210 ) AWS IAM treats a bare "" in a statement's Resource as "any resource", but the embedded IAM resource parser required a 6-segment S3 ARN and silently skipped anything else. With a policy like {Action: "s3:", Resource: ""}, every resource was dropped and the statement produced no actions, so PutUserPolicy rejected the document with "no valid actions found in policy document". Short-circuit Resource == "" to the same full-wildcard path that "arn:aws:s3:::" already takes.	2026-04-23 22:14:41 -07:00
Chris LuandGitHub	da2e90aefd	fix(mount): sanitize non-UTF-8 filenames; keep marshal errors per-request (#9207 ) * fix(mount): sanitize non-UTF-8 filenames; keep marshal errors per-request (#9139) A single file with invalid-UTF-8 bytes in its name (e.g. a GNOME Trash "partial" like \x10\x98=\\\x8a\x7f.trashinfo.9a51454f.partial) made every FUSE-initiated filer RPC fail with: rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8 and then produced an avalanche of "connection is closing" errors on unrelated LookupEntry / ReadDirAll / UpdateEntry calls, causing the volume-server QPS dips reported in #9139. Root cause is twofold: 1. Proto3 `string` fields require valid UTF-8, but the FUSE kernel passes raw name bytes. Create/Mknod/Mkdir/Unlink/Rmdir/Rename/Lookup/Link/ Symlink all forwarded those bytes directly into CreateEntryRequest.Name, DeleteEntryRequest.Name, StreamRenameEntryRequest.{Old,New}Name and Entry.Name. saveDataAsChunk also copied the FullPath into AssignVolumeRequest.Path unchecked. 2. When the marshal failed, shouldInvalidateConnection treated the resulting codes.Internal as a connection problem and dropped the shared cached ClientConn — canceling every other in-flight RPC on it. Fix: - Add sanitizeFuseName (strings.ToValidUTF8 with '?' replacement, matching util.FullPath.DirAndName) and make checkName return the sanitized name. Apply at every FUSE entry point that passes a name to the filer RPC, including Unlink/Rmdir (which did not previously call checkName) and both oldName/newName in Rename. Add a backstop scrub for AssignVolumeRequest.Path so async flush paths cannot reintroduce invalid bytes from a pre-sanitization cached FullPath. - In weed/pb.shouldInvalidateConnection, detect client-side marshal errors via the gRPC library's "error while marshaling" prefix and return false: the connection is healthy, only the request is bad. Refs: https://github.com/seaweedfs/seaweedfs/issues/9139#issuecomment-4301184231 * fix(mount,util): use '_' for invalid-UTF-8 replacement (URL-safe) Sanitized filenames flow downstream into HTTP URLs (volume-server uploads, filer HTTP API, S3/WebDAV gateways). '?' is the URL query-string delimiter and would split the path the first time the name lands in one, so swap every invalid-UTF-8 replacement to '_'. This covers the two pre-existing sites in weed/util/fullpath.go as well, keeping all paths sanitized the same way. * refactor(pb): detect client-side marshal errors via errors.As, not substring Replace the raw `strings.Contains(err.Error(), ...)` check with a type-based carve-out: use errors.As against the `GRPCStatus() Status` interface to pull the original Status out of any fmt.Errorf("...: %w") wrapping, then match the library-owned "grpc:" prefix on that Status's Message. Why not errors.Is against a proto-level sentinel: gRPC's encode() collapses the inner proto error with "%v" (stringification) before wrapping it in a Status, so the original error type does not survive into the caller. The Status itself is the structural signal that does survive. Why not status.FromError: when the caller wraps the Status error with fmt.Errorf("...: %w", ...), status.FromError rewrites Status.Message with the full err.Error() of the outermost wrapper, which defeats a prefix check on the library-owned message. errors.As gives us the original Status whose Message is still verbatim from the gRPC library. A new test asserts that a plain errors.New("grpc: error while marshaling: …") — i.e. the same text attached to something that is NOT a gRPC status — does not short-circuit invalidation, so we never silently keep a cached connection alive based on a coincidental substring match. refactor(util): centralize UTF-8 sanitization; add FullPath.Sanitized Addresses review feedback on PR #9207. Nitpick: every invalid-UTF-8 replacement across the codebase (DirAndName, Name, mount.sanitizeFuseName, the weedfs_write.go backstop) now goes through a single util.SanitizeUTF8Name helper, so the replacement char ('_' — URL-safe) is chosen in one place. Outside-diff: three proto fields took raw FullPath strings that could break marshaling if an entry ever carried invalid UTF-8 (CreateEntryRequest.Directory in Mkdir, DeleteEntryRequest.Directory in Unlink, AssignVolumeRequest.Path in command_fs_merge_volumes). The reviewer's suggested fix — using DirAndName() — would have silently changed Directory from parent to grandparent, because DirAndName sanitizes only the trailing component. Added FullPath.Sanitized(), which scrubs every component, and applied it at the three sites. Exposure is narrow in practice (FUSE-boundary sanitization and the gRPC-side isClientSideMarshalError carve-out already cover the #9139 cascade), but the defense-in-depth is cheap and consistent with the existing AssignVolume backstop. New tests in weed/util/fullpath_test.go document: - SanitizeUTF8Name: valid UTF-8 passes through unchanged; invalid bytes become '_' (not '?', which is URL-special). - FullPath.Sanitized: scrubs bytes in any component, not just the last. - FullPath.DirAndName: dir remains raw on purpose — callers needing a clean full path must use Sanitized(). The test pins this behavior so it is not accidentally "fixed" in a way that changes the (dir, name) semantics callers depend on.	2026-04-23 19:17:35 -07:00
Chris Lu	a0be40e070	Merge branch 'master' of https://github.com/seaweedfs/seaweedfs	2026-04-23 16:25:12 -07:00
Chris Lu	b94ad82472	fix(test): stabilize ConcurrentLockContention; warn on coherence drift TestPosixFileLocking/ConcurrentLockContention failed in CI (run 24857323067) with ENOENT when re-opening the file after all 8 workers had successfully written and closed. The 20s openWithRetry budget was exhausted, pointing at a real but unproven metaCache/parent-cache coherence issue in the mount under bursts of concurrent Release. Test: hold the initial fd open for the whole subtest; use it for the post-workers Sync() and the verification read. Workers still exercise the concurrent-flock invariant and per-record write correctness; the re-open path is no longer load-bearing. On Eventually failure, dump ReadDir of the parent, Stat, and a fresh O_RDONLY open so a future recurrence has state to debug from. Drop the darwin-only ENOENT t.Skip branches that hid this same flake. Mount: in weedfs.lookupEntry, when returning ENOENT from the "parent cached but child missing" branch, log at Warningf instead of V(4) when the kernel is still tracking this path's inode. That combination is the smoking-gun signal for cache drift and is rare enough in normal use not to spam the log.	2026-04-23 15:57:35 -07:00
dependabot[bot]GitHubdependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	cd5004cfbd	build(deps): bump github.com/Azure/go-ntlmssp from 0.1.0 to 0.1.1 in /test/kafka (#9204 ) build(deps): bump github.com/Azure/go-ntlmssp in /test/kafka Bumps [github.com/Azure/go-ntlmssp](https://github.com/Azure/go-ntlmssp) from 0.1.0 to 0.1.1. - [Release notes](https://github.com/Azure/go-ntlmssp/releases) - [Commits](https://github.com/Azure/go-ntlmssp/compare/v0.1.0...v0.1.1) --- updated-dependencies: - dependency-name: github.com/Azure/go-ntlmssp dependency-version: 0.1.1 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-04-23 15:02:28 -07:00
dependabot[bot]GitHubdependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	5cbcfd311c	build(deps): bump github.com/Azure/go-ntlmssp from 0.1.0 to 0.1.1 (#9205 ) Bumps [github.com/Azure/go-ntlmssp](https://github.com/Azure/go-ntlmssp) from 0.1.0 to 0.1.1. - [Release notes](https://github.com/Azure/go-ntlmssp/releases) - [Commits](https://github.com/Azure/go-ntlmssp/compare/v0.1.0...v0.1.1) --- updated-dependencies: - dependency-name: github.com/Azure/go-ntlmssp dependency-version: 0.1.1 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-04-23 15:02:17 -07:00
Chris LuandGitHub	76f361fa77	fix(helm): gate S3 TLS cert args on httpsPort to stop probe failures (#9202 ) (#9206 ) * fix(helm): gate S3 TLS cert args on httpsPort to stop probe failures (#9202) With `global.seaweedfs.enableSecurity=true` and the default `s3.httpsPort=0`, the chart was unconditionally passing `-cert.file` / `-key.file` to the S3 frontend. In `weed/command/s3.go`, when `tlsPrivateKey != ""` and `portHttps == 0`, the server promotes its main `-port` (8333 by default) into an HTTPS listener. The pod's readiness / liveness probes still use `scheme: HTTP`, so every kubelet probe produces http: TLS handshake error from <node-ip>:<port>: client sent an HTTP request to an HTTPS server in the pod log, as reported in #9202. `enableSecurity=true` is supposed to activate security.toml / gRPC mTLS, not silently flip the S3 HTTP port to HTTPS. Move the `seaweedfs.s3.tlsArgs` include inside the `if httpsPort` guard in all three templates that wire up an S3 frontend (standalone S3 deployment, filer with S3 sub-server, all-in-one deployment). The TLS cert args are now emitted only when the user explicitly opts into an HTTPS port; the main `-port` stays HTTP so probes work. Also add a regression test to `.github/workflows/helm_ci.yml` that renders all three templates with and without `httpsPort` and asserts the cert/key/ `-port.https` args are emitted together or not at all. * test(helm): add bash -n parse check to the S3 TLS-gating regression test Addresses gemini-code-assist review comment on #9206 flagging a potential "dangling backslash" shell-syntax risk in the rendered all-in-one command script when httpsPort is set but most S3/SFTP args are defaulted off. In practice bash -n accepts a trailing `\<newline><EOF>` (it's line-continuation to an empty line), so no current rendering is broken. Locking that contract down in CI so a future helper change that leaves a dangling backslash — or any other shell-syntax regression in the rendered command — fails loudly instead of silently shipping broken pods.	2026-04-23 15:00:07 -07:00
Chris LuandGitHub	3d39324bc1	fix(nfs): make Linux `mount -t nfs` work without client workaround (#9199 ) (#9201 ) * fix(nfs): make Linux `mount -t nfs` work without client-side workaround (#9199) The upstream go-nfs library serves NFSv3 + MOUNT on a single TCP port and does not register with portmap. Linux mount.nfs queries portmap on port 111 first, so the plain `mount -t nfs host:/export /mnt` form failed with "portmap query failed" / "requested NFS version or transport protocol is not supported" against a default `weed nfs` deployment. - Add a minimal PORTMAP v2 responder (weed/server/nfs/portmap.go) with TCP+UDP listeners implementing PMAP_NULL, PMAP_GETPORT, PMAP_DUMP, and proper PROG_MISMATCH / PROG_UNAVAIL / PROC_UNAVAIL responses. Advertises NFS v3 TCP and MOUNT v3 TCP at the configured NFS port. - New CLI flag `-portmap.bind` (empty, disabled by default) to opt into the responder. Binding port 111 requires root or CAP_NET_BIND_SERVICE and must not collide with a system rpcbind. - Extended `weed nfs -h` help with the two supported ways to mount from Linux (client-side portmap bypass, or server-side `-portmap.bind`). - Startup log now prints a copy-pasteable mount command tailored to whether portmap is enabled. Unit tests cover RPC/XDR parsing, accept-stat paths, and a TCP+UDP round-trip against the real listener. Verified in a privileged Debian 12 container: with `-portmap.bind=0.0.0.0` the exact command from #9199 (`mount -t nfs -o nfsvers=3,nolock host:/export /mnt`) now succeeds and both read and write work. * fix(nfs): harden portmap responder per review feedback (#9201) Addresses three review findings on the portmap responder: - parseRPCCall: validate opaque_auth length against the record limit before applying the XDR 4-byte padding, so a near-uint32-max authLen can no longer overflow (authLen + 3) and bypass the bounds check. (gemini-code-assist) - serveTCP/Close: track live TCP connections and evict them on Close() so shutdown does not block on idle clients waiting for the read deadline to trip. serveTCP also no longer tears the listener down on a non-fatal Accept error (e.g. EMFILE); it logs and retries after a small back-off. Replaces the atomic.Bool closed flag with a mutex-guarded one so closed, conns, and the shutdown transition stay consistent. (coderabbit, minor) - handleTCPConn: apply per-IO read/write deadlines (30s idle, 10s in-flight) so a peer that opens the privileged port 111 and stalls cannot pin a goroutine indefinitely. (coderabbit, major) Adds TestPortmapServer_CloseEvictsIdleTCPConn, which holds a TCP connection idle and asserts Close() returns within 2s (well under the 30s idle deadline) and that the client sees the eviction. All existing tests still pass, including under -race. * fix(nfs): keep portmap UDP responder alive on transient read errors (#9201) - serveUDP: on a non-shutdown ReadFromUDP error, log, back off, and continue instead of returning. Matches how serveTCP now treats non-fatal Accept errors so a transient network blip doesn't take UDP portmap down until restart. (coderabbit) - Rename portmapAcceptBackoff -> portmapRetryBackoff now that both paths use it. - pmapProcDump: fix the pre-allocation capacity to match the actual encoding (20 bytes per entry + 4-byte terminator), replacing the old over-estimate of 24 per entry. No behavior change; just documents intent. (coderabbit nit) * docs(nfs): clarify encodeAcceptedReply body semantics (#9201) The prior comment said body is "nil when the accept_stat is itself an error", which was misleading: the PROG_MISMATCH branch already passes an 8-byte mismatch_info body. Rewrite to enumerate which error accept_stat values omit the body and call out PROG_MISMATCH as the exception, referencing RFC 5531 §9. Comment-only. (coderabbit nit) * fix(nfs): make portmap retry backoff interruptible by Close() (#9201) serveTCP and serveUDP both sleep portmapRetryBackoff (50ms) after a non-fatal listener error. If Close() races in during that sleep, the goroutine can't be interrupted, so Close() has to wait out the remaining backoff before wg.Wait() returns. Add a done channel that Close() closes once, and replace both time.Sleep calls with a select on ps.done + time.After. The window was tiny in practice but the select makes shutdown strictly bounded by Close()'s own work. (coderabbit nit)	2026-04-23 13:53:53 -07:00
FQHSLycopene GitHub Chris Lu	20f4fd9985	fix(storage): use ceil division for EC shard slots in maxVolumeCount (#9196 ) * fix(storage): use ceil division for EC shard slots in maxVolumeCount * fix(topology): use ceil division for EC shard slots consistently Applies the same ceiling-division formula used in store.go to the four remaining master-side sites that computed volume-slots consumed by EC shards with off-by-one approximations: - disk.go ToDiskInfo / Disk.ToDiskInfo used (n+1)/d, which under-counts slots for non-multiples of DataShardsCount, over-reporting FreeVolumeCount. - DiskUsageCounts.FreeSpace and NodeImpl.AvailableSpaceFor subtracted n/d + 1, which over-counts slots at multiples of DataShardsCount, under-reporting free space (and suppressing volume growth on nodes that still had room). All four now use (n + DataShardsCount - 1) / DataShardsCount, matching store.go:393, store.go:810, and command_ec_decode.go:422. * refactor(topology): extract ecShardSlots helper Deduplicates the (n + DataShardsCount - 1) / DataShardsCount ceiling expression now used by ToDiskInfo, DiskUsageCounts.FreeSpace, Disk.ToDiskInfo, and AvailableSpaceFor. Addresses PR review feedback. --------- Co-authored-by: Chris Lu <chris.lu@gmail.com>	2026-04-23 13:52:58 -07:00
faspix GitHub Maksim Babkou Chris Lu	0fcd5173be	fix(admin): use basePath for API fetches when urlPrefix is set (#9197 ) * fix(admin): use basePath for API fetches when urlPrefix is set * fix(admin): drop duplicate iam-utils script on Groups page * fix(admin): route topics page fetches through basePath The Topics page missed two fetch() calls that still used root-relative URLs, so create-topic and view-details still broke when -urlPrefix was set. --------- Co-authored-by: Maksim Babkou <maksim.babkou@innovatrics.com> Co-authored-by: Chris Lu <chris.lu@gmail.com>	2026-04-23 11:55:07 -07:00
Chris LuandGitHub	749430dceb	fix(filer.meta.tail): include extended metadata in Elasticsearch docs (#9200 ) * fix(filer.meta.tail): include extended metadata in Elasticsearch docs The -es sink flattened only the FUSE attributes, so xattrs (including S3 user metadata like X-Amz-Meta-) never reached Elasticsearch. Add an Extended field and convert map[string][]byte to map[string]string so the values index as text; non-UTF-8 values fall back to base64. Addresses #9190 follow-up. fix(filer.meta.tail): prefix base64-encoded extended values with "base64:" Addresses review feedback: a plain UTF-8 xattr and a base64 fallback are otherwise indistinguishable to a consumer reading the ES doc.	2026-04-23 11:54:08 -07:00
Chris Lu	036191c78a	Merge branch 'master' of https://github.com/seaweedfs/seaweedfs	2026-04-23 11:09:59 -07:00
Chris Lu	34b236acfa	test(s3api): look up NewUser by name in CreateAccessKey collision test The memory credential store backs LoadConfiguration with a map, so the identity order is not stable across a save/load round trip. Indexing Identities[1] intermittently pointed at the owner identity and produced a spurious credential leak.	2026-04-23 11:09:17 -07:00

1 2 3 4 5 ...