Store and retrieve the optional extended timestamp statistics (min_live_timestamp and min_live_row_marker_timestamp) in the scylla_metadata component. Note that there is no need for a cluster feature to store those attributes since the scylla_metadata on-disk format is extensible so that old sstables can be read by new versions, seeing the extra stats is missing, and new sstables can be read by old versions that ignore unknown scylla metadata section types. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
147 lines
5.4 KiB
Markdown
147 lines
5.4 KiB
Markdown
# File format of the Scylla.db sstable component
|
|
|
|
The `Scylla.db` component (present in a file named like `mc-223-big-Scylla.db`
|
|
contains assorted Scylla-only metadata. Its presence indicates the sstable was
|
|
created by Scylla (or some Scylla-aware creator). Non-Scylla consumers will ignore it.
|
|
|
|
The file is small and intended to be processed in-memory.
|
|
|
|
## Main structure
|
|
|
|
The main structure is that of an unordered set of subcomponents. Each component
|
|
is prefixed with a be32 tag that indicates its type, and its serialized size
|
|
(so unknown subcomponents can be skipped).
|
|
|
|
scylla_db = subcomponent_count (tag serialized_size subcomponent)*
|
|
subcomponent_count = be32
|
|
serialized_size = be32
|
|
tag = be32
|
|
|
|
## Subcomponents and tag values
|
|
|
|
The following subcomponents are recognized. They are described in more detail
|
|
in individual sections
|
|
|
|
subcomponent = sharding_metadata
|
|
| features
|
|
| extension_attributes
|
|
| run_identifier
|
|
| large_data_stats
|
|
| sstable_origin
|
|
| scylla_build_id
|
|
| scylla_version
|
|
| ext_timestamp_stats
|
|
|
|
`sharding_metadata` (tag 1): describes what token sub-ranges are included in this
|
|
sstable. This is used, when loading the sstable, to determine which shard(s)
|
|
it occupies.
|
|
|
|
`features` (tag 2): a set of boolean flags that describe the sstable
|
|
|
|
`extension_attributes` (tag 3): a `map<string, string>` with additional attributes
|
|
|
|
`run_identifier` (tag 4): a uuid that is the same for all sstables in the same run
|
|
(and different for sstables in different runs).
|
|
|
|
`large_data_stats` (tag 5): a `map<large_data_type, large_data_stats_entry>` with statistics
|
|
about large data entities in the sstable.
|
|
|
|
`sstable_origin` (tag 6): a string describing the origin of the
|
|
sstable ("memtable" for memtable flush, "garbage collection" for
|
|
compaction, etc.).
|
|
|
|
`scylla_build_id` (tag 7): a string containing the build id of the
|
|
Scylla executable that created the sstable.
|
|
|
|
`scylla_version` (tag 8): a string containing the version of the
|
|
Scylla executable that created the sstable.
|
|
|
|
`ext_timestamp_stats` (tag 9): a `map<ext_timestamp_stats_type, int64_t>` with statistics
|
|
about timestamps in the sstable, like: `min_live_timestamp`, and `min_live_row_marker_timestamp`.
|
|
|
|
## sharding_metadata subcomponent
|
|
|
|
sharding_metadata = token_range_count token_range*
|
|
token_range_count = be32
|
|
token_range = left_token_bound right_token_bound
|
|
left_token_bound = token_bound
|
|
right_token_bound = token_bound
|
|
token_bound = exclusive_flag token
|
|
exclusive_flag = byte // 0=inclusive, 1=exclusive
|
|
token = token_size byte*
|
|
token_size = be16
|
|
|
|
Sharding metadata is a sorted list of disjoint token ranges. Each token range
|
|
consists of a left bound and a right bound; either bound may be inclusive or
|
|
exclusive. The tokens are interpreted according to the partitioner.
|
|
|
|
The sstable contains no partitions whose token is outside the ranges described by
|
|
sharding_metadata.
|
|
|
|
## features subcomponent
|
|
|
|
features = be64 // interpreted as a set of bits
|
|
|
|
bit 0: NonCompoundPIEntries (if set, indicates the sstable was generated by
|
|
Scylla with issue #2993 fixed)
|
|
|
|
bit 1: NonCompoundRangeTombstones (if set, indicates the sstable was generated by
|
|
Scylla with issue #2986 fixed)
|
|
|
|
bit 2: ShadowableTombstones (if set, indicates the sstable was generated by
|
|
Scylla with issue #3885 fixed)
|
|
|
|
bit 3: CorrectStaticCompact (if set, indicates the sstable was generated by
|
|
Scylla with issue #4139 fixed)
|
|
|
|
bit 4: CorrectEmptyCounters (if set, indicates the sstable was generated by
|
|
Scylla with issue #4363 fixed)
|
|
|
|
bit 5: CorrectUDTsInCollections (if set, indicates that the sstable was generated
|
|
by Scylla with issue #6130 fixed)
|
|
|
|
## extension_attributes subcomponent
|
|
|
|
extension_attributes = extension_attribute_count extension_attribute*
|
|
extension_attribute_count = be32
|
|
extension_attribute = extension_attribute_key extension_attribute_value
|
|
extension_attribute_key = string32
|
|
extension_attribute_value = string32
|
|
string32 = string32_size byte*
|
|
string32_size = be32
|
|
|
|
There are currently no defined attributes.
|
|
|
|
## run_identifier subcomponent
|
|
|
|
run_identifier = uuid
|
|
uuid = uuid_high_bits uuid_low_bits
|
|
uuid_high_bits = be64
|
|
uuid_low_bits = be64
|
|
|
|
If the run_identifier subcomponent is present, the sstable is part of a run.
|
|
All sstables with the same run_identifier belong to the same run. They are
|
|
guaranteed to be disjoint (non-overlapping) in their partition keys.
|
|
|
|
## large_data_stats subcomponent
|
|
|
|
large_data_stats = large_data_count large_data_pair*
|
|
large_data_count = be32
|
|
large_data_pair = large_data_type large_data_stats_entry
|
|
large_data_type = partition_size | row_size | cell_size | rows_in_partition | elements_in_collection
|
|
partition_size = be32(1) // partition size, in bytes
|
|
row_size = be32(2) // row size, in bytes
|
|
cell_size = be32(3) // cell size, in bytes
|
|
rows_in_partition = be32(4) // number of rows in a partition
|
|
elements_in_collection = be32(5) // number of elements in a collection
|
|
large_data_stats_entry = max_value threshold above_threshold
|
|
max_value = be64
|
|
threshold = be64
|
|
above_threshold = be32
|
|
|
|
The large_data_stats component holds statistics about partition,
|
|
row, and cell sizes and about number of rows in partition.
|
|
For each entry, it keeps the largest value for the entry type,
|
|
the respective large_data threshold and the number of entities
|
|
that are above the threshold.
|