This PR extends Scylla's SSTable compression with the ability to use compression dictionaries shared across compression chunks. This involves several changes:
- We refactor `compression_parameters` and friends (`compressor`, `sstables::local_compression`, `sstables::compression`) to prepare for making the construction of `compressor`s asynchronous, to enable sharing pieces of compressors (the dictionaries) across shards.
- We introduce the notion of "hidden compression options" which are written to `CompressionInfo.db` and used to construct decompressors, like regular options, but don't appear in the schema. (We later stuff the SSTable's dictionary into `CompressionInfo.db` using a sequence of such options).
- We add a cluster feature which guards the creation of dictionary-compressed SSTables.
- We introduce a central "compressor factory" (one instance shared by all shards), which from this point onward is used to construct all `compressor` objects (one per SSTable) used to process the SSTables. When constructing a compressor for writing, it uses the "current"/"recommended" dictionary (which is passed to the factory from the actively-observed contents of the group0-managed `system.dicts`). When constructing a compressor for reading, it uses the dictionary written in the hidden compression options in CompressionInfo.db. And it keeps dictionaries deduplicated, so that each unique live dictionary blob has only one instance in memory, shared across shards.
- We teach the relevant `lz4` and `zstd` compressor wrappers about the dictionaries.
- We add a HTTP API call which samples pieces of the given table (i.e. the Data.db files) from across the cluster, trains a dictionary on it, and publishes it via `system.dicts` as the new current dictionary for that table. (And we add some RPC verbs to support that).
- We add a HTTP API call which estimates the impact of various available compression configurations on the compression ratio.
- We add an autotrainer fiber which periodically retrains dicts for dict-aware tables and publishes them if they seem to be a significant improvement.
Known imperfections:
- The factory currently keeps one dictionary instance on the entire node, but we probably want one copy per NUMA node. I didn't do that because exposing NUMA knowledge to Scylla seems to require some changes in Seastar first.
New feature, no backporting involved.
Closesscylladb/scylladb#23025
* github.com:scylladb/scylladb:
docs: add user-facing documentation for SSTable compression with shared dicts
docs/dev: add sstable-compression-dicts.md
test: add test_sstable_compression_dictionaries_autotrain.py
test: add test_sstable_compression_dictionaries_basic.py
test/pylib/rest_client: add `keyspace_upgrade_sstables` helper
main: run a sstable_dict_autotrainer
api: add the estimate_compression_ratios API call
dict_autotrainer: introduce sstable_dict_autotrainer
db/system_keyspace: add query_dict_timestamp
compress: add ZstdWithDictsCompressor and LZ4WithDictsCompressor
main: clean up sstable compression dicts after table drops
sstables/compress: discard hidden compression options after the decompressor is created
compress: change compressor_ptr from shared_ptr to unique_ptr
api: add the retrain_dict API call
storage_service: add some dict-related routines
main: in compression_dict_updated_callback, recognize and use SSTable compression dicts
storage_service: add do_sample_sstables()
messaging_service: add SAMPLE_SSTABLES and ESTIMATE_SSTABLE_VOLUME verbs
db/system_keyspace: let `system.dicts` helpers be used for dicts other than the RPC compression dict
raft/group0_state_machine: on `system.dicts` mutations, pass the affected partitition keys to the callback
database: add sample_data_files()
database: add take_sstable_set_snapshot()
compress: teach `lz4_processor` about dictionaries
compress: teach `zstd_processor` about dictionaries
sstables: delegate compressor creation to the compressor factory
sstables: plug an `sstable_compressor_factory` into `sstables_manager`
sstables: introduce sstable_compressor_factory
utils/hashers: add get_sha256()
gms/feature_service: add the SSTABLE_COMPRESSION_DICTS cluster feature
compress: add hidden dictionary options
compress: remove `compression_parameters::get_compressor()`
sstables/compress: remove get_sstable_compressor()
sstables/compress: move ownership of `compressor` to `sstable::compression`
compress: remove compressor::option_names()
compress: clean up the constructor of zstd_processor
compress: squash zstd.cc into compress.cc
sstables/compress: break the dependency of `compression_parameters` on `compressor`
compress.hh: switch compressor::name() from an instance member to a virtual call
bytes: adapt fmt_hex to std::span<const std::byte>