5.7 KiB
Shared-dictionary compression for SSTables
Overview
Scylla now supports dictionary-based compression for SSTables, which improves compression ratios by sharing compression dictionaries across compression chunks.
Background
Traditional SSTable compression in Scylla works on a chunk-by-chunk basis. Each chunk is compressed independently, which means patterns that occur across chunks cannot be effectively leveraged for better compression.
Dictionary-based compression addresses this limitation by training a dictionary on representative data samples and using it across all compression chunks, providing the compression algorithm with additional context for referencing.
How it works
-
Dictionary training: Scylla samples data chunks from across the cluster to build an optimized compression dictionary for a specific table.
-
Dictionary distribution: Dictionaries are stored in the
system.dictstable (managed by group0). Each table has its own (possibly absent) row there. -
Shared Compression: When opening an SSTable for writing, if the table has compression dictionaries enabled, the current recommended dictionary for a table (i.e. the one in
system.dicts) is used to compress the data, and is written into the header ofCompressionInfo.db. -
Decompression: When opening an SSTable for reading, the dictionary blob is loaded from
CompressionInfo.dband used to decompress the data.
Implementation details
New persistent data structures
There are two new persistent data structures involved:
- An extension to the SSTable format.
CompressionInfo.dbgains two new compressor IDs (lz4 with dicts, zstd with dicts) and new "compressor options" which store the dictionary blob used by this SSTable. - An extension to
system.dicts, which (in addition to the RPC compression dict) now also stores the current recommended SSTable compression dict for each table.
SSTable format extension
The structure of the format isn't affected. Instead, we add two new compressor
identifiers (LZ4WithDictsCompressor and ZstdWithDictsCompressor), which
use the "compressor options" map in CompressionInfo.db to store the dict.
Since the structure isn't affected, we don't increment the SSTable version for this. Naturally, the dict-compressed SSTables won't be readable by older versions of Scylla (or by Cassandra), but they should complain about an unknown compressor rather than consider the SSTable malformed.
If a downgrade is necessary, it can be done by disabling dictionaries
(through schema, or by setting sstable_compression_dictionaries_enable_writing
to false on all nodes) and rewriting the SSTables
(with nodetool upgradesstables -a or similar).
The extension is hidden behind the SSTABLE_COMPRESSION_DICTS cluster feature.
New entries in CompressionInfo.db
We store the dictionary blob in the "options" map in the header of
CompressionInfo.db, under the keys .dictionary.00000000,
.dictionary.00000001, ...
(It's split into several parts, because the "options" have 16-bit lengths, and dictionaries are usually bigger than that).
system.dicts extension
If a system.dicts partition with key sstables/{table_uuid} exists,
it provides the current recommended dict for this table, which is used
to compress new SSTables.
If a table doesn't have a matching row in system.dicts, then there's no
current dictionary for this table, and new SSTables should fall back to
dictionaryless compression.
Compressor factory
With "traditional" compression, a compressor was just a function in the code, not involving any data. This meant that the creation of compressors was cheap and easy.
But with dictionaries involved, each unique compressor has its own RAM and cache footprint. Therefore we want to deduplicate compressors as much as possible.
For this, we create new compressors through a central "compressor factory" which contacts other shards and ensures that there are no redundant copies of dictionaries in memory.
Automatic training
To create a dictionary, some training data is needed. This means that the dictionary can't be created immediately for a new table, some data must accumulate in it first.
Also, the dataset can change over time, and a dictionary might become outdated. In this case, it could be good to retrain it.
But it would be impractical to manually pick the right moments to train new
dicts. So there's sstable_dict_autotrainer, which periodically trains
new dicts, if it seems that the given dict-aware table deserves one.
Refer to the implementation for up-to-date details.
New interfaces
- To enable dictionaries for a given table, the user sets its
sstable_compressionentry in the schema to one of the new compressor IDs. (The autotrainer will eventually train a dict for it.) - REST API
storage_service/retrain_dictcan be used to trigger a dictionary training for a table manually, without waiting for the automatic training. - REST API
storage_service/estimate_compression_ratioscan be used to generate a report with estimations of compression ratios (on the given table) for various compression configs (algorithm, level, chunk size), to guide the choice of configuration.
New RPCs
SAMPLE_SSTABLESis used by a dictionary-training node to gather SSTable samples from other nodes.ESTIMATE_SSTABLE_VOLUMEis a helper RPC used by a dictionary-training node to find out how much data other nodes have, so that it can later request the right (i.e. proportional) amount of samples from each node. It's also used by the autotrainer to find out if the table is big enough for dictionary training.
New config entries
There are several new config knobs related to this feature, all named like
sstable_compression_dictionaries_*.
Refer to config.hh for up-to-date details.