scylladb/docs/dev/sstable-compression-dicts.md

# Shared-dictionary compression for SSTables

## Overview

Scylla now supports dictionary-based compression for SSTables, which improves
compression ratios by sharing compression dictionaries across compression
chunks.

## Background

Traditional SSTable compression in Scylla works on a chunk-by-chunk basis. Each
chunk is compressed independently, which means patterns that occur across chunks
cannot be effectively leveraged for better compression.

Dictionary-based compression addresses this limitation by training a dictionary
on representative data samples and using it across all compression chunks,
providing the compression algorithm with additional context for referencing.

## How it works

1. **Dictionary training**: Scylla samples data chunks from across the cluster
to build an optimized compression dictionary for a specific table.

2. **Dictionary distribution**: Dictionaries are stored in the `system.dicts`
table (managed by group0). Each table has its own (possibly absent) row there.

3. **Shared Compression**: When opening an SSTable for writing, if the table
has compression dictionaries enabled, the current
recommended dictionary for a table (i.e. the one in `system.dicts`)
is used to compress the data, and is written into the header of
`CompressionInfo.db`.

4. **Decompression**: When opening an SSTable for reading, the dictionary blob
is loaded from `CompressionInfo.db` and used to decompress the data.

## Implementation details

### New persistent data structures

There are two new persistent data structures involved:
- An extension to the SSTable format. `CompressionInfo.db` gains two new
  compressor IDs (lz4 with dicts, zstd with dicts) and new "compressor options"
  which store the dictionary blob used by this SSTable.
- An extension to `system.dicts`, which (in addition to the RPC compression
  dict) now also stores the current recommended SSTable compression dict
  for each table.

### SSTable format extension

The *structure* of the format isn't affected. Instead, we add two new compressor
identifiers (`LZ4WithDictsCompressor` and `ZstdWithDictsCompressor`), which
use the "compressor options" map in CompressionInfo.db to store the dict.

Since the structure isn't affected, we don't increment the SSTable version for
this. Naturally, the dict-compressed SSTables won't be readable by older
versions of Scylla (or by Cassandra), but they should complain about an unknown
compressor rather than consider the SSTable malformed.

If a downgrade is necessary, it can be done by disabling dictionaries
(through schema, or by setting `sstable_compression_dictionaries_enable_writing`
to `false` on all nodes) and rewriting the SSTables
(with `nodetool upgradesstables -a` or similar).

The extension is hidden behind the `SSTABLE_COMPRESSION_DICTS` cluster feature.

#### New entries in CompressionInfo.db

We store the dictionary blob in the "options" map in the header of
`CompressionInfo.db`, under the keys `.dictionary.00000000`,
`.dictionary.00000001`, ...

(It's split into several parts, because the "options" have 16-bit lengths,
and dictionaries are usually bigger than that).

### `system.dicts` extension

If a `system.dicts` partition with key `sstables/{table_uuid}` exists,
it provides the current recommended dict for this table, which is used
to compress new SSTables.

If a table doesn't have a matching row in `system.dicts`, then there's no
current dictionary for this table, and new SSTables should fall back to
dictionaryless compression.

### Compressor factory

With "traditional" compression, a compressor was just a function in the code,
not involving any data. This meant that the creation of compressors was
cheap and easy.

But with dictionaries involved, each unique compressor has its own RAM and cache
footprint. Therefore we want to deduplicate compressors as much as possible.

For this, we create new compressors through a central "compressor factory"
which contacts other shards and ensures that there are no redundant copies
of dictionaries in memory.

### Automatic training

To create a dictionary, some training data is needed.
This means that the dictionary can't be created immediately for a new table,
some data must accumulate in it first.

Also, the dataset can change over time, and a dictionary might become outdated.
In this case, it could be good to retrain it.

But it would be impractical to manually pick the right moments to train new
dicts. So there's `sstable_dict_autotrainer`, which periodically trains
new dicts, if it seems that the given dict-aware table deserves one.
Refer to the implementation for up-to-date details.

### New interfaces

- To enable dictionaries for a given table, the user sets its
  `sstable_compression` entry in the schema to one of the new compressor IDs.
  (The autotrainer will eventually train a dict for it.)
- REST API `storage_service/retrain_dict` can be used to trigger a dictionary
  training for a table manually, without waiting for the automatic training.
- REST API `storage_service/estimate_compression_ratios` can be used to generate
  a report with estimations of compression ratios (on the given table) for
  various compression configs (algorithm, level, chunk size), to guide the
  choice of configuration.

### New RPCs

- `SAMPLE_SSTABLES` is used by a dictionary-training node to gather SSTable
  samples from other nodes.
- `ESTIMATE_SSTABLE_VOLUME` is a helper RPC used by a dictionary-training node
  to find out how much data other nodes have, so that it can later request
  the right (i.e. proportional) amount of samples from each node.
  It's also used by the autotrainer to find out if the table is big enough for
  dictionary training.

### New config entries

There are several new config knobs related to this feature, all named like
`sstable_compression_dictionaries_*`.
Refer to `config.hh` for up-to-date details.