sstables: add LargeDataRecords metadata type (tag 13)

Add a new scylla metadata component LargeDataRecords (tag 13) that
stores per-SSTable top-N large data records.  Each record carries:
  - large_data_type (partition_size, row_size, cell_size, etc.)
  - binary serialized partition key and clustering key
  - column name (for cell records)
  - value (size in bytes)
  - element count (rows or collection elements, type-dependent)
  - range tombstones and dead rows (partition records only)

The struct uses disk_string<uint32_t> for key/name fields and is
serialized via the existing describe_type framework into the SSTable
Scylla metadata component.

Add JSON support in scylla-sstable and format documentation.
This commit is contained in:
Benny Halevy
2026-03-30 13:17:28 +03:00
parent 85e2c6f2a7
commit d92cd42fe6
6 changed files with 128 additions and 5 deletions

View File

@@ -33,6 +33,7 @@ in individual sections
| ext_timestamp_stats
| schema
| components_digests
| large_data_records
`sharding_metadata` (tag 1): describes what token sub-ranges are included in this
sstable. This is used, when loading the sstable, to determine which shard(s)
@@ -80,6 +81,11 @@ all SSTable component files that are checksummed during write. Each entry maps a
type (e.g., Data, Index, Filter, Statistics, etc.) to its CRC32 checksum. This allows
verifying the integrity of individual component files.
`large_data_records` (tag 13): an `array<large_data_record>` with the top-N individual large
data entries (partitions, rows, cells) found during the sstable write. Unlike `large_data_stats`
which only stores aggregate statistics, this records the actual keys and sizes so they survive
tablet/shard migration.
The [scylla sstable dump-scylla-metadata](https://github.com/scylladb/scylladb/blob/master/docs/operating-scylla/admin-tools/scylla-sstable.rst#dump-scylla-metadata) tool
can be used to dump the scylla metadata in JSON format.
@@ -203,3 +209,35 @@ in the statistics component, which lacks column names and other metadata. Unlike
the full schema stored in the system schema tables, it is not intended to be
comprehensive, but it contains enough information for tools like scylla-sstable
to parse an sstable in a self-sufficient manner.
## large_data_records subcomponent
large_data_records = record_count large_data_record*
record_count = be32
large_data_record = large_data_type partition_key clustering_key column_name value elements_count range_tombstones dead_rows
large_data_type = be32 // same enum as in large_data_stats
partition_key = string32 // binary serialized partition key (sstables::key::get_bytes())
clustering_key = string32 // binary serialized clustering key (clustering_key_prefix::representation()), empty if N/A
column_name = string32 // column name as text, empty for partition/row entries
value = be64 // size in bytes (partition, row, or cell size depending on type)
elements_count = be64 // type-dependent element count (see below)
range_tombstones = be64 // number of range tombstones (partition_size records only, 0 otherwise)
dead_rows = be64 // number of dead rows (partition_size records only, 0 otherwise)
string32 = string32_size byte*
string32_size = be32
The large_data_records component holds individual top-N large data entries
(partitions, rows, cells) found during the sstable write. Unlike large_data_stats,
which only stores aggregate per-type statistics (max value, threshold, count above
threshold), large_data_records preserves the actual partition key, clustering key,
column name, and size for each above-threshold entry. This information is embedded
in the sstable file itself and therefore survives tablet/shard migration.
The elements_count field carries a type-dependent element count:
- For partition_size and rows_in_partition records: number of rows in the partition
- For cell_size and elements_in_collection records: number of elements in the collection (0 for non-collection cells)
- For row_size records: 0
The range_tombstones and dead_rows fields are meaningful only for
partition_size records and are zero for all other record types.