docs(parquet-design): clarify Phase 1 wording and side-index placement

- Phase 1 "row group stats cache" was misleading — those stats already live in the Parquet footer. The Phase 1 win is a parsed-footer cache, not new index data. - The previous "/table/_seaweed_index/" path nested side indexes under the table prefix, which Iceberg orphan-file removal and snapshot expiration would surface or attempt to clean. Move side indexes to a separate system bucket / filer mount keyed by file identity, with the in-table-prefix layout listed only as a fallback.
2026-07-25 09:32:45 +00:00 · 2026-04-25 01:22:31 -07:00
parent afeb82ab87
commit 605907f522
1 changed files with 16 additions and 14 deletions
@@ -105,22 +105,25 @@ The user-visible layout remains normal Iceberg/Parquet layout:
      part-00002.parquet
 ```

-SeaweedFS may store side indexes internally or in a hidden namespace:
+SeaweedFS stores side indexes outside the table's S3 prefix so they do not appear in Iceberg listings or interfere with snapshot expiration / orphan-file removal. Two acceptable placements:
+
+- a separate system bucket or filer mount (preferred), keyed by `(table_uuid, file_path, file_identity)`
+- the same bucket under a top-level prefix that engines are configured to ignore (`/_sw_index/...`), never under the table prefix
+
+Logical layout per data file:

 ```text
-/table/_seaweed_index/
-  data/ds=2026-01-01/part-00001.parquet/
-    footer.cache
-    row_group_stats
-    page_index.timestamp
-    bloom.user_id
-    bitmap.tenant_id
-    btree.timestamp
-    inverted.message
-    vector.embedding.ivf
+<system-prefix>/<table_uuid>/data/ds=2026-01-01/part-00001.parquet/<identity>/
+  footer.cache
+  page_index.timestamp
+  bloom.user_id
+  bitmap.tenant_id
+  btree.timestamp
+  inverted.message
+  vector.embedding.ivf
 ```

-The original Parquet file is not modified.
+`<identity>` is derived from the index identity rules in [Index Consistency](#index-consistency). The original Parquet file is not modified.

 ## Logical View for Planning

@@ -586,8 +589,7 @@ Garbage collection can remove indexes for files no longer referenced by active s

 ### Phase 1: Metadata Acceleration

- Parquet footer cache
- row group stats cache
+- parsed-footer cache (row group stats already live in the footer; the win is avoiding repeated Thrift decode and offset lookups, not building new index data)
 - column chunk range optimization
 - expose file/range-level pushdown