docs: add SeaweedFS-aware Parquet pushdown design draft

Draft design note for keeping Parquet files compatible with existing engines (Spark, Trino, DuckDB, PyArrow, Iceberg) while building auxiliary side indexes and a pushdown API for SeaweedFS-aware clients. Covers physical layout, side index types (footer cache, zone map, page index, bloom, bitmap, B-tree, inverted, vector), pushdown API sketch, Iceberg delete handling, and a phased rollout plan.
2026-07-20 06:52:24 +00:00 · 2026-04-25 01:15:42 -07:00
parent 5eead9409a
commit ca4c8d977d
1 changed files with 614 additions and 0 deletions
@@ -0,0 +1,614 @@
+# SeaweedFS-Aware Parquet Layout and Pushdown Design
+
+## Status
+
+Draft design note.
+
+## Goal
+
+Keep Parquet files fully compatible with existing engines such as Spark, Trino, DuckDB, PyArrow, and Iceberg, while allowing SeaweedFS-aware clients to take advantage of storage-side metadata, indexes, and pushdown execution.
+
+The design avoids replacing Parquet with a custom format. Instead, SeaweedFS stores normal Parquet objects and builds auxiliary metadata/indexes beside them.
+
+```text
+Standard Parquet compatibility
+ SeaweedFS side indexes
+ pushdown-aware connectors
+= faster scans, filtering, and hybrid vector search
+```
+
+## Non-Goals
+
+- Do not require changes to the Parquet file format.
+- Do not require existing Spark/Trino/DuckDB readers to change.
+- Do not make SeaweedFS the only way to read the data.
+- Do not rewrite all Parquet data into Lance or another vector-native format.
+
+## Background
+
+Parquet already contains useful metadata:
+
+- file footer
+- row group metadata
+- column chunk offsets
+- column statistics
+- optional page indexes
+- optional bloom filters
+
+Typical readers use this flow:
+
+```text
+1. Read Parquet footer
+2. Discover row groups and column chunks
+3. Prune row groups using statistics
+4. Read selected column chunks/pages
+5. Decode/filter/project rows
+```
+
+SeaweedFS can optimize this process by caching metadata, exposing smarter range reads, and building additional side indexes.
+
+## High-Level Architecture
+
+```text
+Spark / Trino / DuckDB / PyArrow / Custom Query Engine
+        |
+        | S3 API or SeaweedFS Pushdown API
+        v
+SeaweedFS S3 Gateway / Filer
+        |
+        | metadata/index lookup
+        v
+SeaweedFS Index Metadata
+        |
+        | range reads / local filtering
+        v
+SeaweedFS Volume Servers
+        |
+        v
+Parquet data needles
+```
+
+There are two access modes:
+
+### 1. Standard Object Access
+
+Existing engines read Parquet through the normal S3-compatible API.
+
+```text
+engine -> SeaweedFS S3 -> Parquet object ranges
+```
+
+This preserves compatibility.
+
+### 2. SeaweedFS-Aware Pushdown Access
+
+Optimized connectors call a SeaweedFS pushdown API before reading data.
+
+```text
+engine -> Iceberg planning -> SeaweedFS pushdown API -> candidate files/ranges/rows
+```
+
+The engine still reads standard Parquet data, but reads much less of it.
+
+## Physical Layout
+
+The user-visible layout remains normal Iceberg/Parquet layout:
+
+```text
+/table/
+  metadata/
+    v1.metadata.json
+    snapshots/...
+  data/
+    ds=2026-01-01/
+      part-00001.parquet
+      part-00002.parquet
+```
+
+SeaweedFS may store side indexes internally or in a hidden namespace:
+
+```text
+/table/_seaweed_index/
+  data/ds=2026-01-01/part-00001.parquet/
+    footer.cache
+    row_group_stats
+    page_index.timestamp
+    bloom.user_id
+    bitmap.tenant_id
+    btree.timestamp
+    inverted.message
+    vector.embedding.ivf
+```
+
+The original Parquet file is not modified.
+
+## SeaweedFS-Aware Internal Mapping
+
+A Parquet file can be represented internally as:
+
+```text
+Parquet file
+ ├── footer metadata
+ ├── row group 0
+ │    ├── column chunk: id
+ │    ├── column chunk: timestamp
+ │    ├── column chunk: tenant_id
+ │    └── column chunk: payload
+ └── row group 1
+      ├── column chunk: id
+      ├── column chunk: timestamp
+      ├── column chunk: tenant_id
+      └── column chunk: payload
+```
+
+SeaweedFS needle mapping may be:
+
+```text
+footer                         -> cached metadata needle
+row_group_0 / column_id         -> data needle(s)
+row_group_0 / column_timestamp  -> data needle(s)
+row_group_0 / column_tenant_id  -> data needle(s)
+row_group_0 / column_payload    -> data needle(s)
+row_group_1 / column_id         -> data needle(s)
+...
+```
+
+This does not require physically splitting the Parquet file for compatibility. SeaweedFS can also map byte ranges inside one object to the underlying needles.
+
+## Side Index Types
+
+### 1. Footer Cache
+
+Cache the parsed Parquet footer.
+
+Purpose:
+
+- avoid repeated footer reads
+- speed up planning
+- expose row group and column chunk offsets quickly
+
+Index content:
+
+```text
+file path
+file size
+last modified / etag
+schema
+row groups
+column chunks
+statistics
+byte offsets
+```
+
+### 2. Row Group Zone Map
+
+Stores min/max/null count per row group and column.
+
+Useful for predicates such as:
+
+```sql
+WHERE timestamp BETWEEN t1 AND t2
+WHERE id > 1000
+```
+
+Skip condition:
+
+```text
+row_group.max < predicate_min OR row_group.min > predicate_max
+```
+
+### 3. Page-Level Index
+
+Stores min/max and offsets per page.
+
+Useful when row groups are large and predicates are selective.
+
+Example:
+
+```text
+row_group_3 / column_timestamp:
+  page_0 min=2026-01-01 max=2026-01-02
+  page_1 min=2026-01-03 max=2026-01-04
+  page_2 min=2026-01-05 max=2026-01-06
+```
+
+### 4. Bloom Filter Index
+
+Useful for equality lookups.
+
+Example:
+
+```sql
+WHERE user_id = 'u123'
+```
+
+If the bloom filter says the value is absent, SeaweedFS skips the row group or page.
+
+### 5. Bitmap Index
+
+Useful for low- or medium-cardinality columns.
+
+Examples:
+
+```sql
+WHERE tenant_id = 'abc'
+WHERE status IN ('failed', 'pending')
+```
+
+Bitmap operations:
+
+```text
+tenant_id=abc AND status=failed
+```
+
+### 6. B-Tree / Range Index
+
+Useful for range predicates and point lookups on sortable columns.
+
+Examples:
+
+```sql
+WHERE timestamp > t
+WHERE id BETWEEN a AND b
+```
+
+The index maps values or ranges to:
+
+```text
+file
+row group
+page
+optional row ids
+```
+
+### 7. Inverted Index
+
+Useful for logs, text, tags, and JSON fields.
+
+Examples:
+
+```sql
+WHERE message CONTAINS 'timeout'
+WHERE tags CONTAINS 'gpu'
+```
+
+Index structure:
+
+```text
+token -> candidate row groups/pages/row ids
+```
+
+### 8. Vector Index
+
+Useful for embedding columns stored inside Parquet.
+
+Example:
+
+```sql
+ORDER BY embedding <-> query_vector
+LIMIT 20
+```
+
+Possible index types:
+
+- IVF
+- IVF+PQ
+- HNSW
+- scalar quantization
+
+Index layout:
+
+```text
+/table/_seaweed_index/.../vector.embedding.ivf/
+  centroids
+  list_000001
+  list_000002
+  pq_codebooks
+```
+
+This allows SeaweedFS to perform approximate nearest neighbor search before reading Parquet columns.
+
+## Pushdown Types
+
+### File Pruning
+
+Use Iceberg metadata plus SeaweedFS indexes to skip files.
+
+### Row Group Pruning
+
+Use row group statistics, bloom filters, and side indexes.
+
+### Page Pruning
+
+Use page-level indexes to read only matching pages.
+
+### Column Pruning
+
+Read only requested columns and predicate columns.
+
+Example:
+
+```sql
+SELECT id, image_path
+WHERE tenant_id = 'abc'
+ORDER BY embedding <-> ?
+LIMIT 10
+```
+
+Only these columns are needed:
+
+```text
+tenant_id
+embedding
+id
+image_path
+```
+
+### Scalar Predicate Pushdown
+
+Evaluate simple predicates using indexes:
+
+```text
+=, !=, <, <=, >, >=
+IN
+BETWEEN
+IS NULL
+CONTAINS
+```
+
+### Vector Search Pushdown
+
+Search vector side indexes on volume servers or index-serving nodes.
+
+Flow:
+
+```text
+1. Choose candidate vector partitions
+2. Route to volume servers holding vector index/data needles
+3. Compute local top-K
+4. Merge global top-K
+5. Fetch projected Parquet columns
+```
+
+### Hybrid Scalar + Vector Pushdown
+
+Example:
+
+```sql
+SELECT image_path
+WHERE tenant_id = 'abc'
+  AND timestamp >= '2026-01-01'
+ORDER BY embedding <-> query_vector
+LIMIT 20
+```
+
+Execution:
+
+```text
+1. tenant_id bitmap filters candidate rows/pages
+2. timestamp zone map or B-tree narrows candidates
+3. vector index searches only candidate partitions
+4. delete masks are applied if Iceberg delete files exist
+5. top-K candidates are returned
+6. projected columns are fetched
+```
+
+## API Sketch
+
+### Pushdown Request
+
+```go
+type ParquetPushdownRequest struct {
+    Table       string
+    SnapshotId  int64
+    Files       []string
+    Columns     []string
+    Predicate   []byte
+    VectorQuery *VectorQuery
+    Limit       int
+}
+
+type VectorQuery struct {
+    Column string
+    Vector []float32
+    Metric string // l2, cosine, dot
+    TopK   int
+    NProbe int
+}
+```
+
+### Pushdown Response
+
+```go
+type ParquetPushdownResponse struct {
+    FileRanges []FileRange
+    RowGroups  []RowGroupRef
+    Pages      []PageRef
+    RowIds     []RowRef
+    Scores     []float32
+    Stats      PushdownStats
+}
+
+type FileRange struct {
+    File   string
+    Offset int64
+    Length int64
+}
+
+type RowGroupRef struct {
+    File     string
+    RowGroup int
+}
+
+type PageRef struct {
+    File     string
+    RowGroup int
+    Column   string
+    Page     int
+    Offset   int64
+    Length   int64
+}
+
+type RowRef struct {
+    File     string
+    RowGroup int
+    RowId    int64
+}
+```
+
+## Connector Behavior
+
+### Existing Connector Path
+
+```text
+Spark/Trino/DuckDB
+  -> Iceberg planning
+  -> Parquet scan
+  -> SeaweedFS S3 range reads
+```
+
+### Optimized Connector Path
+
+```text
+Spark/Trino/DuckDB connector
+  -> Iceberg planning
+  -> SeaweedFS pushdown API
+  -> receive candidate row groups/pages/ranges/row ids
+  -> read only needed Parquet ranges
+```
+
+## Index Consistency
+
+Indexes must be tied to immutable file identity:
+
+```text
+file path
+file size
+etag/version
+modification time
+Iceberg snapshot id
+content hash, optional
+```
+
+If the Parquet file changes, indexes are invalidated and rebuilt.
+
+For Iceberg tables, this is easier because data files are generally immutable. New snapshots add or remove files rather than modifying files in place.
+
+## Handling Iceberg Deletes
+
+Iceberg may use:
+
+- position delete files
+- equality delete files
+
+SeaweedFS-aware pushdown should apply delete information when available.
+
+Possible strategy:
+
+```text
+1. Iceberg catalog returns data files and delete files
+2. SeaweedFS builds delete masks per data file
+3. Pushdown execution removes deleted rows from candidate sets
+```
+
+Delete masks can be stored as side indexes:
+
+```text
+/table/_seaweed_index/.../deletes.position.bitmap
+/table/_seaweed_index/.../deletes.equality.bitmap
+```
+
+## Compaction and Maintenance
+
+SeaweedFS can run background index maintenance:
+
+```text
+new Parquet file detected
+  -> parse footer
+  -> build row group stats
+  -> optionally build page/bloom/bitmap/vector indexes
+  -> publish index metadata
+```
+
+For Iceberg compaction:
+
+```text
+old files removed from snapshot
+new compacted files added
+old indexes become unreachable
+new indexes are built lazily or eagerly
+```
+
+Garbage collection can remove indexes for files no longer referenced by active snapshots.
+
+## Rollout Plan
+
+### Phase 1: Metadata Acceleration
+
+- Parquet footer cache
+- row group stats cache
+- column chunk range optimization
+- expose file/range-level pushdown
+
+### Phase 2: Scalar Indexes
+
+- bloom filter side index
+- bitmap index
+- B-tree/range index
+- basic predicate API
+
+### Phase 3: Page-Level Pruning
+
+- page statistics
+- page offset map
+- return page-level ranges
+
+### Phase 4: Iceberg Delete Integration
+
+- position delete masks
+- equality delete masks
+- snapshot-aware index validity
+
+### Phase 5: Vector Indexes
+
+- IVF side index for embedding columns
+- local top-K search on volume servers
+- global top-K merge
+- hybrid scalar + vector search
+
+### Phase 6: Query Engine Integrations
+
+- SeaweedFS-aware DuckDB extension
+- Trino connector pushdown
+- Spark DataSource/Iceberg integration
+
+## Benefits
+
+- keeps Parquet compatibility
+- accelerates existing Iceberg data
+- avoids full file scans
+- reduces network and disk I/O
+- enables vector search over Parquet data
+- lets SeaweedFS become more than object storage
+
+## Tradeoffs
+
+- side indexes require storage and maintenance
+- consistency must be carefully tied to immutable file identity
+- pushdown requires engine connector work
+- page-level and row-level filtering are more complex than row-group pruning
+- vector search on Parquet is less natural than LanceDB, but easier to adopt for existing lake data
+
+## Summary
+
+SeaweedFS should keep Parquet files standard and build auxiliary indexes beside them.
+
+The first useful version can focus on footer caching and row group pruning. Later versions can add scalar, text, delete, and vector indexes.
+
+This creates a practical path:
+
+```text
+Iceberg + Parquet compatibility today
+ SeaweedFS-aware pushdown
+ optional vector search acceleration
+```