mirror of
https://github.com/seaweedfs/seaweedfs.git
synced 2026-05-31 14:06:20 +00:00
* fix(s3tables/iceberg): make metadata spec-compliant and accept real-world manifest names
Two related issues prevent SeaweedFS S3 Tables from interoperating with
strict Iceberg clients (Java/Spark/Flink/Trino):
1. iceberg-go v0.5.0 serializes empty TableMetadata state by dropping
keys via `omitempty` on optional pointer/slice fields. The Iceberg
table spec, however, requires `current-snapshot-id`, `snapshots`,
`snapshot-log`, `metadata-log`, and `refs` to be present even when
empty (`current-snapshot-id` must be -1 for a table with no
snapshots). Java's TableMetadataParser uses JsonUtil.getLong on
`current-snapshot-id` and throws "Cannot parse missing long
current-snapshot-id" against responses produced by this server.
2. The Iceberg layout validator only accepts manifest filenames that
match Iceberg's internal naming (`{uuid}-m{n}.avro`,
`snap-{n}-{n}-{uuid}.avro`). Real writers — notably Flink's sink —
emit manifests like
`{flink-job-id}-{checkpoint}-{operator-id}-{n}.avro`, which the
validator rejects with 403, breaking INSERT commits.
Fixes:
* Add ensureMetadataSpecCompliance helper that backfills the five
spec-required empty-state fields when iceberg-go omits them or emits
explicit JSON null. Apply it on every code path that writes
v*.metadata.json to S3 or returns metadata to clients
(handlers_table create-table, handlers_commit, commit_helpers
create-on-commit, plus MarshalJSON on LoadTableResult and
CommitTableResponse). Real values from non-empty tables are never
overwritten.
* Add catch-all regex entries to metadataFilePatterns accepting any
*.avro / *.metadata.json filename composed of [A-Za-z0-9._-]. The
Iceberg spec does not mandate filename format; the strict patterns
remain for documentation. Metadata-directory subdirectory rejection
and the data-file path validation are unchanged.
No upstream dependencies are forked: iceberg-go stays at v0.5.0 and
go.mod is untouched. The compliance layer can be removed once upstream
emits spec-compliant output.
Tests (all pass under `go test -race`):
- metadata_compliance_test.go: 5 cases covering missing fields,
preserved real values, explicit null, invalid JSON, empty input.
- iceberg_layout_test.go: 3 groups (16 subtests) covering real-world
manifest names from Flink/Spark/Iceberg, security boundary
(subdirectories, bad extensions), and data-file regression.
* fix(s3tables/iceberg): preserve metadata key order and keep config field stable
Two small follow-ups on the spec-compliance fix:
* ensureMetadataSpecCompliance now splices missing keys in at the byte
level just before the closing brace, so iceberg-go's struct-declared
key order survives the backfill. The previous unmarshal/remarshal
through map[string]json.RawMessage silently alphabetized every key in
the document, which is spec-legal but breaks byte-equality fixtures
and any downstream hashing of the persisted metadata. The slower
remarshal path is kept for the rare explicit-null replacement case.
* LoadTableResult.MarshalJSON now serializes Config without omitempty,
matching the struct field tag. The custom marshaler had silently
flipped the tag to ,omitempty, which made the "config" key disappear
from the response whenever s3Endpoint was unset (since
buildFileIOConfig returned an empty but non-nil Properties map).
Tests:
- PreservesOriginalKeyOrder pins the byte-level output against
iceberg-go's emitted shape; would have caught the alphabetization
regression.
- EmptyObjectBackfilled covers the {} -> sentinels-only case (no
leading comma).
- AllPresentReturnsSameBytes confirms the no-op path returns input
bytes unchanged, with whitespace intact.
- iceberg_layout_test pins the catch-all $ anchor: metadata/file.avro.txt
must still be rejected.
* fix(s3tables/iceberg): guard ensureMetadataSpecCompliance against top-level null
json.Unmarshal of a JSON `null` literal succeeds but leaves the map nil.
The current byte-append path no-ops gracefully on this input, but the
slow remarshal path would panic with "assignment to entry in nil map"
if the input ever combined `null` with the explicit-null detection. Add
an explicit nil-map short-circuit so the safety property is obvious
from the source, and a test that pins the contract.
* test(s3tables/iceberg): assert full byte equality in AllPresentReturnsSameBytes
The prefix check only caught a missing "{\n " opener, so the test
would have passed even if the function silently reordered keys or
collapsed whitespace later in the document. Switch to a full string
comparison so any future regression in the no-op path is loud.
---------
Co-authored-by: Chris Lu <chris.lu@gmail.com>
see https://blog.aqwari.net/xml-schema-go/ 1. go get aqwari.net/xml/cmd/xsdgen 2. Add EncodingType element for ListBucketResult in AmazonS3.xsd 3. xsdgen -o s3api_xsd_generated.go -pkg s3api AmazonS3.xsd 4. Remove empty Grantee struct in s3api_xsd_generated.go 5. Remove xmlns: sed s'/http:\/\/s3.amazonaws.com\/doc\/2006-03-01\/\ //' s3api_xsd_generated.go