mirror of
https://github.com/seaweedfs/seaweedfs.git
synced 2026-05-23 02:01:32 +00:00
docs(parquet-design): clarify Phase 1 wording and side-index placement
- Phase 1 "row group stats cache" was misleading — those stats already live in the Parquet footer. The Phase 1 win is a parsed-footer cache, not new index data. - The previous "/table/_seaweed_index/" path nested side indexes under the table prefix, which Iceberg orphan-file removal and snapshot expiration would surface or attempt to clean. Move side indexes to a separate system bucket / filer mount keyed by file identity, with the in-table-prefix layout listed only as a fallback.
This commit is contained in:
@@ -105,22 +105,25 @@ The user-visible layout remains normal Iceberg/Parquet layout:
|
||||
part-00002.parquet
|
||||
```
|
||||
|
||||
SeaweedFS may store side indexes internally or in a hidden namespace:
|
||||
SeaweedFS stores side indexes outside the table's S3 prefix so they do not appear in Iceberg listings or interfere with snapshot expiration / orphan-file removal. Two acceptable placements:
|
||||
|
||||
- a separate system bucket or filer mount (preferred), keyed by `(table_uuid, file_path, file_identity)`
|
||||
- the same bucket under a top-level prefix that engines are configured to ignore (`/_sw_index/...`), never under the table prefix
|
||||
|
||||
Logical layout per data file:
|
||||
|
||||
```text
|
||||
/table/_seaweed_index/
|
||||
data/ds=2026-01-01/part-00001.parquet/
|
||||
footer.cache
|
||||
row_group_stats
|
||||
page_index.timestamp
|
||||
bloom.user_id
|
||||
bitmap.tenant_id
|
||||
btree.timestamp
|
||||
inverted.message
|
||||
vector.embedding.ivf
|
||||
<system-prefix>/<table_uuid>/data/ds=2026-01-01/part-00001.parquet/<identity>/
|
||||
footer.cache
|
||||
page_index.timestamp
|
||||
bloom.user_id
|
||||
bitmap.tenant_id
|
||||
btree.timestamp
|
||||
inverted.message
|
||||
vector.embedding.ivf
|
||||
```
|
||||
|
||||
The original Parquet file is not modified.
|
||||
`<identity>` is derived from the index identity rules in [Index Consistency](#index-consistency). The original Parquet file is not modified.
|
||||
|
||||
## Logical View for Planning
|
||||
|
||||
@@ -586,8 +589,7 @@ Garbage collection can remove indexes for files no longer referenced by active s
|
||||
|
||||
### Phase 1: Metadata Acceleration
|
||||
|
||||
- Parquet footer cache
|
||||
- row group stats cache
|
||||
- parsed-footer cache (row group stats already live in the footer; the win is avoiding repeated Thrift decode and offset lookups, not building new index data)
|
||||
- column chunk range optimization
|
||||
- expose file/range-level pushdown
|
||||
|
||||
|
||||
Reference in New Issue
Block a user