scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-26 11:30:36 +00:00

Author	SHA1	Message	Date
Avi Kivity	f7413a47e4	sstables: writer: avoid recursion in variadic write() Following `9b6ce030d0` ("sstables: remove quadratic (and possibly exponential) compile time in parse()"), where we removed recursion in reading, we do the same here for variadic write. This results in a small reduction in compile time. Note the problem isn't very bad here. This is tail-recursion, so likely removed by the compiler during optimization, and we don't have additional amplification due to future::then() double-compiling the ready-future and unready-future paths. Still, better to avoid quadratic compile times. Closes scylladb/scylladb#27050	2025-11-18 08:17:17 +02:00
Botond Dénes	8579e20bd1	Merge 'Enable digest+checksum verification for streaming/repair' from Taras Veretilnyk This PR enables integrity check of both checksum and digest for repair/streaming. In the past, streaming readers only verified the checksum of compressed SSTables. This change extends the checks to include the digest and the checksum (CRC) for both compressed and uncompressed SSTables. These additional checks require reading the digest and CRC components from disk, which may cause some I/O overhead. For uncompressed SSTables, this involves loading and computing checksums and digest from the data, while for compressed SSTables - where checksums are already verified inline - the only extra cost is reading and verifying the digest.If the reader range doesn't cover the full SSTable, the digest is not loaded and check is skipped. To support testing of these changes, a new option was added to the random_mutation_generator that allows disabling compression. Several new test cases were added to verify that the repair_reader correctly detects corruption. These tests corrupt digest or data component of an SSTable and confirm that the system throws the expected `malformed_sstable_exception`. Backport is not required, it is an improvement Refs #21776 Closes scylladb/scylladb#26444 * github.com:scylladb/scylladb: boost/repair_test: add repair reader integrity verification test cases test/lib: allow to disable compression in random_mutation_generator sstables: Skip checksum and digest reads for unlinked SSTables table: enable integrity checks for streaming reader table: Add integrity option to table::make_sstable_reader() sstables: Add integrity option to create_single_key_sstable_reader	2025-11-14 18:00:33 +02:00
Benny Halevy	f9ce98384a	scylla-sstable: correctly dump sharding_metadata This patch fixes 2 issues at one go: First, Currently sstables::load clears the sharding metadata (via open_data()), and so scylla-sstable always prints an empty array for it. Second, printing token values would generate invalid json as they are currently printed as binary bytes, and they should be printed simply as numbers, as we do elsewhere, for example, for the first and last keys. Fixes #26982 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#26991	2025-11-14 17:55:41 +02:00
Lakshmi Narayanan Sreethar	3eba90041f	sstables: prevent oversized allocation when parsing summary positions During sstable summary parsing, the entire header was read into a single buffer upfront and then parsed to obtain the positions. If the header was too large, it could trigger oversized allocation warnings. This commit updates the parse method to read one position at a time from the input stream instead of reading the entire header at once. Since `random_access_reader` already maintains an internal buffer of 128 KB, there is no need to pre read the entire header upfront. Fixes #24428 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> Closes scylladb/scylladb#26846	2025-11-14 06:40:53 +02:00
Taras Veretilnyk	add60d7576	sstables: Skip checksum and digest reads for unlinked SSTables Add an _unlinked flag to track SSTable unlink state and check it in read_digest() and read_checksum() methods to skip file reads for unlinked SSTables, preventing potential file not found errors.	2025-11-13 14:08:26 +01:00
Michał Chojnowski	b82c2aec96	sstables/trie: fix an assertion violation in bti_partition_index_writer_impl::write_last_key _last_key is a multi-fragment buffer. Some prefix of _last_key (up to _last_key_mismatch) is unneeded because it's already a part of the trie. Some suffix of _last_key (after needed_prefix) is unneeded because _last_key can be differentiated from its neighbors even without it. The job of write_last_key() is to find the middle fragments, (containing the range `[_last_key_mismatch, needed_prefix)`) trim the first and last of the middle fragments appropriately, and feed them to the trie writer. But there's an error in the current logic, in the case where `_last_key_mismatch` falls on a fragment boundary. To describe it with an example, if the key is fragmented like `aaa\|bbb\|ccc`, `_last_key_mismatch == 3`, and `needed_prefix == 7`, then the intended output to the trie writer is `bbb\|c`, but the actual output is `\|bbb\|c`. (I.e. the first fragment is empty). Technically the trie writer could handle empty fragments, but it has an assertion against them, because they are a questionable thing. Fix that. We also extend bti_index_test so that it's able to hit the assert violation (before the patch). The reason why it wasn't able to do that before the patch is that the violation requires decorated keys to differ on the _first_ byte of a partition key column, but the keys generated by the test only differed on the last byte of the column. (Because the test was using sequential integers to make the values more human-readable during debugging). So we modify the key generation to use random values that can differ on any position. Fixes scylladb/scylladb#26819 Closes scylladb/scylladb#26839	2025-11-07 11:25:07 +02:00
Avi Kivity	9b6ce030d0	sstables: remove quadratic (and possibly exponential) compile time in parse() parse() taking a list of elements is quadratic (during compile time) in that it generates recursive calls to itself, each time with one fewer parameter. The total size of the parameter lists in all these generated functions is quadratic in the initial parameter list size. It's also exponential if we ignore inlining limits, since each .then() call expands to two branches - a ready future branch and a non-ready future branch. If the compiler did not give up, we'd have 2^list_len branches. For sure the compiler does not do so indefinitely, but the effort getting there is wasted. Simplify by using a fold expression over the comma operator. Instead of passing the remaining parameter list in each step, we pass only the parameter we are processing now, making processing linear, and not generating unnecessary functions. It would be better expressed using pack expansion statements, but these are part of C++26. The largest offender is probably stats_metadata, with 21 elements. dev-mode sstables.o: text data bss dec hex filename 1760059 1312 7673 1769044 1afe54 sstables.o.before 1745533 1312 7673 1754518 1ac596 sstables.o.after We save about 15k of text with presumably a corresponding (small) decrease in compile time. Closes scylladb/scylladb#26735	2025-11-02 13:09:37 +01:00
Avi Kivity	d458dd41c6	Merge 'Avoid input_/output_stream-s default initialization and move-assignment' from Pavel Emelyanov Recent seastar update deprecated in/out streams usage pattern when a stream is default constructed early and them move-assigned with the proper one (see scylladb/seastar#3051). This PR fixes few places in Scylla that still use one. Adopting newer seastar API, no need to backport Closes scylladb/scylladb#26747 * github.com:scylladb/scylladb: commitlog: Remove unused work::r stream variable ec2_snitch: Fix indentation after previous patch ec2_snitch: Coroutinize the aws_api_call_once() sstable: Construct output_stream for data instantly test: Don't reuse on-stack input stream	2025-10-31 21:22:41 +02:00
Nikos Dragazis	96e727d7b9	db/config: Deprecate sstable_compression_dictionaries_allow_in_ddl The option is a knob that allows to reject dictionary-aware compressors in the validation stage of CREATE/ALTER statements, and in the validation of `sstable_compression_user_table_options`. It was introduced in `7d26d3c7cb` to allow the admins of Scylla Cloud to selectively enable it in certain clusters. For more details, check: https://github.com/scylladb/scylla-enterprise/issues/5435 As of this series, we want to start offering dictionary compression as the default option in all clusters, i.e., treat it as a generally available feature. This makes the knob redundant. Additionally, making dictionary compression the default choice in `sstable_compression_user_table_options` creates an awkward dependency with the knob (disabling the knob should cause `sstable_compression_user_table_options` to fall back to a non-dict compressor as default). That may not be very clear to the end user. For these reasons, mark the option as "Deprecated", remove all relevant tests, and adjust the business logic as if dictionary compression is always available. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-10-29 20:13:08 +02:00
Taras Veretilnyk	deb8e32e86	sstables: Add integrity option to create_single_key_sstable_reader Added an sstables::integrity_check parameter to create_single_key_sstable_reader methods across its implementations. This allows callers to enable SSTable integrity checks during single-key reads.	2025-10-28 19:27:35 +01:00
Pavel Emelyanov	5d89816fed	sstable: Construct output_stream for data instantly This changes makes local output_stream variable be constructed in the declaration statement with the help of ternary operator thus avoiding both -- default-initialization and move-assignment depending on the standalone condition checking. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-10-28 19:27:22 +03:00
Radosław Cybulski	621e88ce52	Fix spelling errors Closes scylladb/scylladb#26652	2025-10-22 16:46:31 +02:00
Avi Kivity	ab488fbb3f	Merge 'Switch to seastar API level 9 (no more packet-s in output_stream/data_sink API)' from Pavel Emelyanov Other than patching Scylla sinks to implement new data_sink_impl::put(std::span<temporary_buffer>) overload, the PR changes transport write_response() method to stop using output_stream::write(scattered_message) because it's also gone. Using newer seastar API, no need to backport Closes scylladb/scylladb#26592 * github.com:scylladb/scylladb: code: Fix indentation after previous patch code: Switch to seastar API level 9 transport: Open-code invoke_with_counting into counting_data_sink::put transport: Don't use scattered_message utils: Implement memory_data_sink::put(net::packet)	2025-10-22 01:51:43 +03:00
Pavel Emelyanov	44ed3bbb7c	Merge 'RFC: Initial GCP storage backend for scylla (sstables + backup)' from Calle Wilund Integrates GCP object storage as a working storage backend for scylla sstables as well as backup storage. Adds an abstraction layer (atm very heavily designed around the s3 client interface and usage) to allow the "storage" etc layers of sstable management to pick transparently between "s3" and "gs" providers. This modifies the scylla config such that endpoints can optionally (through a "type" param) ref a GS backend. Similarly with storage_options. Also adds some IO wrapping primitives to make it more feasible to place some logic at a mid level of the implementation stack (such as making networked storage files, ranged reading etc). Test s3 fixture is replaced (where appropriate) with an `object_storage` fixture that multiplexes the test across both backends. Unit tests are duplicated and for the GS versions use a boost test fixture for GCS, default local fake. Fixes #25359 Fixes #26453 Closes scylladb/scylladb#26186 * github.com:scylladb/scylladb: docs::dev::object_storage: Add some initial info on GS storage docs/dev: Add mention of (nested) docker usage in testing.md sstables::object_storage_client: Forward memory limit semaphore to GS instance utils::gcp::object_storage: Add optional memory limits to up/download sstables::object_storage_client: Add multi-upload support for GS utils::gcp::storage: Add merge objects operation test_backup/test_basic: Make tests multiplex both s3 and gs backends test::cluster::conftest: Add support for multiple object storage backends boost::gcs_storage_test: reindent boost::gcs_storage_test: Convert to use fixture tests::boost: Add GS object storage cases to mirror S3 ones tests::lib::gcs_fixture: Add a reusable test fixture for real/fake GS/GCS tests::lib::test_utils: Add overloads/helpers for reading and (temp) writing env sstables::object_storage_client: Add google storage implementation test_services: Allow testing with GS object storage parameters utils::gcp::gcp_credentials: Add option to create uninitialized credentials utils::gcp::object_storage: Make create_download_source return seekable_data_source utils::gcp::object_storage: Add defensive copies of string_view params utils::gcp::object_storage: Add missing retry backoff increate utils::gcp::object_storage: Add timestamp to object listing utils::gcp::object_storage: Add paging support to list_objects object_storage_client: Add object_name wrapper type utils::gcp::object_storage: Add optional abort_source utils::rest::client: Add abort_source support sstables: Use object_storage_client for remote storage sstables::object_storage_client: Add abstraction layer for OS cliens (s3 initial) s3::upload_progress: Promote to general util type storage_options: Abstract s3 to "object_storage" and add gs as option sstables::file_io_extension: Change "creator" callback to just data_source utils::io-wrappers: Add ranged data_source utils::io-wrappers: Add file wrapper type for seekable_source utils::seekable_source: Add a seekable IO source type object_storage_endpoint_param: Add gs storage as option config: break out object_storage_endpoint_param preparing for multi storage	2025-10-20 13:14:53 +03:00
Pavel Emelyanov	7d0722ba5c	code: Fix indentation after previous patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-10-17 10:26:50 +03:00
Pavel Emelyanov	a88a36f5b5	code: Switch to seastar API level 9 In the new API the biggest change is to implement the only data_sink_impl::put(span<temporary_buffer>) overload. Encrypted file impl and sstables compress sink use fallback_put() helper that generates a chain of continuations each holding a buffer. The counting_data_sink in transport had mostly been patched to correct implementation by the previous patch, the change here is to replace vector argument with span one. Most other sinks just re-implement their put(vector<temporary_buffer>) overload by iterating over span and non-preemptively grabbing buffers from it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-10-17 10:26:50 +03:00
Calle Wilund	403247243b	sstables::object_storage_client: Forward memory limit semaphore to GS instance Enforces object storage limits to the GS implementation as well.	2025-10-13 08:53:28 +00:00
Calle Wilund	5e4e5b1f4a	sstables::object_storage_client: Add multi-upload support for GS Uses file splitting + object merge to facilitate parallel, resumable upload of files with known size.	2025-10-13 08:53:27 +00:00
Calle Wilund	527a6df460	sstables::object_storage_client: Add google storage implementation Allowing GS config to be honoured.	2025-10-13 08:53:26 +00:00
Calle Wilund	144b550e4f	object_storage_client: Add object_name wrapper type Remaining S3-centric, but abstracting the object name to possible implementations not quite formatted the same.	2025-10-13 08:53:25 +00:00
Calle Wilund	5d4558df3b	sstables: Use object_storage_client for remote storage Replaces direct s3 interfaces with the abstraction layer, and open for having multiple implentations/backends	2025-10-13 08:53:25 +00:00
Calle Wilund	25e932230a	sstables::object_storage_client: Add abstraction layer for OS cliens (s3 initial) Adds abstraction layer for creating the various ops and IO objects for storing sstable data on cloud storage	2025-10-13 08:53:25 +00:00
Calle Wilund	868d057aae	storage_options: Abstract s3 to "object_storage" and add gs as option Since both are bucket+prefix oriented, we can basically use same options for both, only distinguished by actual protocol. Abstract the types and the helper parse etc routines to handle either. Use "gs" as term for gcs (google compute storage), since this is the URL scheme used.	2025-10-13 08:53:25 +00:00
Calle Wilund	ac438c61e6	sstables::file_io_extension: Change "creator" callback to just data_source Because the concept of pushing reading range does not work for the wrapping we do (i.e. encryption), there is no point having it here. We need to do said range handling higher up. Also, must allow multi-layered wrapping.	2025-10-13 08:53:25 +00:00
Calle Wilund	78d9dda060	config: break out object_storage_endpoint_param preparing for multi storage Moves the config wrapper to own file (to reduce recompilation for modifying) and refactors to handle extending this parameter to non-s3 endpoint configs.	2025-10-13 08:53:24 +00:00
Michał Chojnowski	93dac3d773	sstables/compressor: relax a large allocation warning in ZSTD_CDict creation ZSTD_CDict needs a big contiguous allocation and there's no way around that. The only thing to do is relax the warning appropriately. Closes scylladb/scylladb#25393	2025-10-12 18:21:11 +03:00
Michał Chojnowski	3cf51cb9e8	sstables: fix some typos in comments I added those typos recently, and spellcheckers complain. Closes scylladb/scylladb#26376	2025-10-07 13:20:06 +03:00
Michał Chojnowski	dbddba0794	sstables/trie: actually apply BYPASS CACHE to index reads BYPASS CACHE is implemented for `bti_index_reader` by giving it its own private `cached_file` wrappers over Partitions.db and Rows.db, instead of passing it the shared `cached_file` owned by the sstable. But due to an oversight, the private `cached_file`s aren't constructed on top of the raw Partitions.db and Rows.db files, but on top of `cached_file_impl` wrappers around those files. Which means that BYPASS CACHE doesn't actually do its job. Tests based on `scylla_index_page_cache_*` metrics and on CQL tracing still see the reads from the private files as "cache misses", but those misses are served from the shared cached files anyway, so the tests don't see the problem. In this commit we extend `test_bti_index.py` with a check that looks at reactor's `io_queue` metrics instead, and catches the problem. Fixes scylladb/scylladb#26372 Closes scylladb/scylladb#26373	2025-10-06 15:32:05 +03:00
Michał Chojnowski	6efb807c1a	sstables/sstable_directory: don't forget to delete other components when deleting TemporaryHashes.db TemporaryHashes.db is a temporary sstable component used during ms sstable writes. It's different from other sstable components in that it's not included in the TOC. Because of this, it has a special case in the logic that deletes unfinished sstables on boot. (After Scylla dies in the middle of a sstable write). But there's a bug in that special case, which causes Scylla to forget to delete other components from the same unfinished sstable. The code intends only to delete the TemporaryHashes.db file from the `_state->generations_found` multimap, but it accidentally also deletes the file's sibling components from the multimap. Fix that. Fixes scylladb/scylladb#26393	2025-10-04 00:45:55 +02:00
Avi Kivity	7230a04799	dht, sstables: replace vector with chunked_vector when computing sstable shards sstable::compute_shards_for_this_sstable() has a temporary of type std::vector<dht::token_range> (aka dht::partition_range_vector), which allocates a contiguous 300k when loading an sstable from disk. This causes large allocation warnings (it doesn't really stress the allocator since this typically happens during startup, but best to clear the warning anyway). Fix this by changing the container to by chunked_vector. It is passed to dht::ring_position_range_vector_sharder, but since we're the only user, we can change that class to accept the new type. Fixes #24198. Closes scylladb/scylladb#26353	2025-10-02 00:47:42 +02:00
Avi Kivity	15fa1c1c7e	Merge 'sstables/trie: translate all key cells in one go, not lazily' from Michał Chojnowski Applying lazy evaluation to the BTI encoding of clustering keys was probably a bad default. The possible benefits are dubious (because it's quite likely that the laziness won't allow us to avoid that much work), but the overhead needed to implement the laziness is large and immediate. In this patch we get rid of the laziness. We rewrite lazy_comparable_bytes_from_clustering_position and lazy_comparable_bytes_from_ring_position so that they performs the key translation eagerly, all components to a single bytes_ostream in one synchronous call. perf_bti_key_translation (microbenchmark added in this series, 1 iteration is 100 translations of a clustering key with 8 cells of int32_type): ``` Before: test iterations median mad min max allocs tasks inst cycles lcb_mismatch_test.lcb_mismatch 9233 109.930us 0.000ns 109.930us 109.930us 4356.000 0.000 2615394.3 614709.6 After: test iterations median mad min max allocs tasks inst cycles lcb_mismatch_test.lcb_mismatch 50952 19.487us 0.000ns 19.487us 19.487us 198.000 0.000 603120.1 109042.9 ``` Enhancement, backport not required. Closes scylladb/scylladb#26302 * github.com:scylladb/scylladb: sstables/trie: BTI-translate the entire partition key at once sstables/trie: avoid an unnecessary allocation of std::generator in last_block_offset() sstables/trie: perform the BTI-encoding of position_in_partition eagerly types/comparable_bytes: add comparable_bytes_from_compound test/perf: add perf_bti_key_translation	2025-10-01 14:59:06 +03:00
Avi Kivity	4d9271df98	Merge 'sstables: introduce sstable version `ms`' from Michał Chojnowski This is yet another part in the BTI index project. Overarching issue: https://github.com/scylladb/scylladb/issues/19191 Previous part: https://github.com/scylladb/scylladb/pull/25626 Next parts: make `ms` the default. Then, general tweaks and improvements. Later, potentially a full `da` format implementation. This patch series introduces a new, Scylla-only sstable format version `ms`, which is like `me`, but with the index components (Summary.db and Index.db) replaced with BTI index components (Partitions.db and Rows.db), as they are in Cassandra 5.0's `da` format version. (Eventually we want to just implement `da`, but there are several other changes (unrelated to the index files) between `me` and `da`. By adding this `ms` as an intermediate step we can adapt the new index formats without dragging all the other changes into the mix (and raising the risk of regressions, which is already high)). The high-level structure of the PR is: 1. Introduce new component types — `Partitions` and `Rows`. 2. Teach `class sstable` to open them when they exist. 3. Teach the sstable writer how to write index data to them. 4. Teach `class sstable` and unit tests how to deal with sstables that have no `Index` or `Summary` (but have `Partitions` and `Rows` instead). 5. Introduce the new sstable version `ms`, specify that it has `Partitions` and `Rows` instead of `Index` and `Summary`. 6. Prepare unit tests for the appearance of `ms`. 7. Enable `ms` in unit tests. 8. Make `ms` enablable via db::config (with a silent fall back to `me` until the new `MS_SSTABLE_FORMAT` cluster feature is enabled). 9. Prepare integration tests for the appearance of `ms`. 10. Enable both `ms` and `me` in tests where we want both versions to be tested. This series doesn't make `ms` the default yet, because that requires teaching Scylla Manager and a few dtests about the new format first. It can be enabled by setting `sstable_format: ms` in the config. Per a review request, here is an example from `perf_fast_forward`, demonstrating some motivation for a new format. (Although not the main one. The main motivations are getting rid of restrictions on the RAM:disk ratio, and index read throughput for datasets with tiny partitions). The dataset was populated with `build/release/scylla perf-fast-forward --smp=1 --sstable-format=$VERSION --data-directory=data.$VERSION --column-index-size-in-kb=1 --populate --random-seed=0`. This test involves a partition with 1000000 clustering rows (with 32-bit keys and 100-byte values) and ~500 index blocks, and queries a few particular rows from the partition. Since the branching factor for the BIG promoted index is 2 (it's a binary search), the lookup involves ~11.2 sequential page reads per row. The BTI format has a more reasonable branching factor, so it involves ~2.3 page reads per row. `build/release/scylla perf-fast-forward --smp=1 --data-directory=perf_fast_forward_data/me --run-tests=large-partition-select-few-rows`: ``` offset stride rows iterations avg aio aio (KiB) 500000 1 1 70 18.0 18 128 500001 1 1 647 19.0 19 132 0 1000000 1 748 15.0 15 116 0 500000 2 372 29.0 29 284 0 250000 4 227 56.0 56 504 0 125000 8 116 106.0 106 928 0 62500 16 67 195.0 195 1732 ``` `build/release/scylla perf-fast-forward --smp=1 --data-directory=perf_fast_forward_data/ms --run-tests=large-partition-select-few-rows`: ``` offset stride rows iterations avg aio aio (KiB) 500000 1 1 51 5.1 5 20 500001 1 1 64 5.3 5 20 0 1000000 1 679 4.0 4 16 0 500000 2 492 8.0 8 88 0 250000 4 804 16.0 16 232 0 125000 8 409 31.0 31 516 0 62500 16 97 54.0 54 1056 ``` Index file size comparison for the default `perf_fast_forward` tables with `--random-seed=0`: Large partition table (dominated by intra-partition index): 2.4 MB with `me`, 732 kB with `ms`. For the small partitions table (dominated by inter-partition index): 11 MB with `me`, 8.4 MB with `ms`. External tests: I ran SCT test `longevity-mv-si-4days-streaming-test` test on 6 nodes with 30 shards each for 8 hours. No anomalies were observed. New functionality, no backport needed. Closes scylladb/scylladb#26215 * github.com:scylladb/scylladb: test/boost/bloom_filter_test: add test_rebuild_from_temporary_hashes test/cluster: add test_bti_index.py test: prepare bypass_cache_test.py for `ms` sstables sstables/trie/bti_index_reader: add a failure injection in advance_lower_and_check_if_present test/cqlpy/test_sstable_validation.py: prepare the test for `ms` sstables tools/scylla-sstable: add `--sstable-version=?` to `scylla sstable write` db/config: expose "ms" format to the users via database config test: in Python tests, prepare some sstable filename regexes for `ms` sstables: add `ms` to `all_sstable_versions` test/boost/sstable_3_x_test: add `ms` sstables to multi-version tests test/lib/index_reader_assertions: skip some row index checks for BTI indexes test/boost/sstable_inexact_index_test: explicitly use a `me` sstable test/boost/sstable_datafile_test: skip test_broken_promoted_index_is_skipped for `ms` sstables test/resource: add `ms` sample sstable files for relevant tests test/boost/sstable_compaction_test: prepare for `ms` sstables. test/boost/index_reader_test: prepare for `ms` sstables test/boost/bloom_filter_tests: prepare for `ms` sstables test/boost/sstable_datafile_test: prepare for `ms` sstables test/boost/sstable_test: prepare for `ms` sstables. sstables: introduce `ms` sstable format version tools/scylla-sstable: default to "preferred" sstable version, not "highest" sstables/mx/reader: use the same hashed_key for the bloom filter and the index reader sstables/trie/bti_index_reader: allow the caller to passing a precalculated murmur hash sstables/trie/bti_partition_index_writer: in add(), get the key hash from the caller sstables/mx: make Index and Summary components optional sstables: open Partitions.db early when it's needed to populate key range for sharding metadata sstables: adapt sstable::set_first_and_last_keys to sstables without Summary sstables: implement an alternative way to rebuild bloom filters for sstables without Index utils/bloom_filter: add `add(const hashed_key&)` sstables: adapt estimated_keys_for_range to sstables without Summary sstables: make `sstable::estimated_keys_for_range` asynchronous sstables/sstable: compute get_estimated_key_count() from Statistics instead of Summary replica/database: add table::estimated_partitions_in_range() sstables/mx: implement sstable::has_partition_key using a regular read sstables: use BTI index for queries, when present and enabled sstables/mx/writer: populate BTI index files sstables: create and open BTI index files, when enabled sstables: introduce Partition and Rows component types sstables/mx/writer: make `_pi_write_m.partition_tombstone` a `sstables::deletion_time`	2025-09-30 09:40:02 +03:00
Michał Chojnowski	621bfbe6d9	sstables/trie/bti_index_reader: add a failure injection in advance_lower_and_check_if_present test_column_family.py::test_sstables_by_key_reader_closed injects a failure into `index_reader::advance_lower_and_check_if_present`. To preserve this tests when BTI indexes are made the default, we have to add a corresponding error injection to `bti_index_reader::advance_lower_and_check_if_present`.	2025-09-29 22:15:25 +02:00
Michał Chojnowski	ef11dc57c1	db/config: expose "ms" format to the users via database config Extend the `sstable_format` config enum with a "ms" value, and, if it's enabled (in the config and in cluster features), use it for new sstables on the node. (Before this commit, writing `ms` sstables should only be possible in unit tests, via internal APIs. After this commit, the format can be enabled in the config and the database will write it during normal operation). As of this commit, the new format is not the default yet. (But it will become the default in a later commit in the same series).	2025-09-29 22:15:25 +02:00
Michał Chojnowski	fe9f5f4da2	sstables: add `ms` to `all_sstable_versions` Add `ms` to the lists of sstable formats. This will cause it to be included in various unit tests.	2025-09-29 22:15:25 +02:00
Michał Chojnowski	db4283b542	sstables: introduce `ms` sstable format version Introduce `ms` -- a new sstable format version which is a hybrid of Cassandra's `me` and `da`. It is based on `me`, but with the index components (Summary.db and Index.db) replaced with the index components of `da` (Partitions.db and Rows.db). As of this patch, the version is never chosen anywhere for writing sstables yet. It is only introduced. We will add it to unit tests in a later commit, and expose it to users in yet later commit.	2025-09-29 22:15:24 +02:00
Michał Chojnowski	4ca215abbc	sstables/mx/reader: use the same hashed_key for the bloom filter and the index reader Partitions.db uses a piece of the murmur hash of the partition key internally. The same hash is used to query the bloom filter. So to avoid computing the hash twice (which involves converting the key into a hashable linearized form) it would make sense to use the same `hashed_key` for both purposes. This is what we do in this patch. We extract the computation of the `hashed_key` from `make_pk_filter` up to its parent `sstable_set_impl::create_single_key_sstable_reader`, and we pass this hash down both to `make_pk_filter` and to the sstable reader. (And we add a pointer to the `hashed_key` as a parameter to all functions along the way, to propagate it). The number of parameters to `mx::make_reader` is getting uncomfortable. Maybe they should be packed into some structs.	2025-09-29 13:01:22 +02:00
Michał Chojnowski	420e215873	sstables/trie/bti_index_reader: allow the caller to passing a precalculated murmur hash Partitions.db internally uses a piece of the partition key murmur hash (the same hash which is used to compute the token and the relevant bits in the bloom filter). Before this patch, the Partitions.db reader computes the hash internally from the `sstables::partition_key`. That's a waste, because this hash is usually also computed for bloom filter purposes just before that. So in this patch we let the caller pass that hash instead. The old index interface, without the hash, is kept for convenience. In this patch we only add a new interface, we don't switch the callers to it yet. That will happen in the next commit.	2025-09-29 13:01:21 +02:00
Michał Chojnowski	cee4011e7a	sstables/trie/bti_partition_index_writer: in add(), get the key hash from the caller Partitions.db internally uses a piece of the partition key murmur hash (the same hash which is used to compute the token and the relevant bits in the bloom filter). Before this patch, the Partitions.db writer computes the hash internally from the `sstables::partition_key`. That's a waste, because this hash is also computed for bloom filter purposes just before that, in the owning sstable writer. So in this patch we let the caller pass that hash here instead.	2025-09-29 13:01:21 +02:00
Michał Chojnowski	f8e3d5e7c2	sstables/mx: make Index and Summary components optional In previous patches we (hopefully) modified all users of Index and Summary components so that they don't longer need those components to exist. (And can use Partitions and Rows components instead).	2025-09-29 13:01:21 +02:00
Michał Chojnowski	f003cbce6d	sstables: open Partitions.db early when it's needed to populate key range for sharding metadata If there's no metadata file with sharding metadata, the owning shards of an sstable are computed based on the partition key range within the sstable. This range is set in `set_first_and_last_keys()`, which (since another commit in this commit series) reads it either from the Summary component or from the footer of the Partitions component, whichever is available. But in some code paths `set_first_and_last_keys()` is called before the footer of Partitions is loaded. If the sstable doesn't have Summary, only Partitions, then the `set_first_and_last_keys()` will fail. To prevent that, in those cases we have to open the file and read its footer early, before the `set_first_and_last_keys()` calls. Note: the changes in this commit shouldn't matter during normal operation, in which a Scylla component with sharding metadata is available. But it might be used when old and/or incomplete sstables are read.	2025-09-29 13:01:21 +02:00
Michał Chojnowski	4bdf5ca0cf	sstables: adapt sstable::set_first_and_last_keys to sstables without Summary `sstable::set_first_and_last_keys` currently takes the first and last key from the Summary component. But if only BTI indexes are used, this component will be nonexistent. In this case, we can use the first and last keys written in the footer of Partitions.db.	2025-09-29 13:01:21 +02:00
Michał Chojnowski	b1984d6798	sstables: implement an alternative way to rebuild bloom filters for sstables without Index For efficiency, the cardinality of the bloom filter (i.e. the number of partition keys which will be written into the sstable) has to be known before elements are inserted into the filter. In some cases (e.g. memtables flush) this number is known exactly. But in others (e.g. repair) it can only be estimated, and the estimation might be very wrong, leading to an oversized filter. Because of that, some time ago we added a piece of logic (ran after the sstable is written, but before it's sealed) which looks at the actual number of written partitions, compares it to the initial estimate (on which the size of the bloom filter was based on), and if the difference is unacceptably large, it rewrites the bloom filter from partition keys contained in Index.db. But the idea to rebuild the bloom filters from index files isn't going to work with BTI indexes, because they don't store whole partition keys. If we want sstables which don't have Index.db files, we need some other way to deal with oversized filters. Partition keys can be recovered from Data.db, but that would often be way too expensive. This patch adds another way. We introduce a new component file, TemporaryHashes. This component, if written at all, contains the 16-byte murmur hash for every partition key, in order, and can be used in place of Index to reconstruct the bloom filter. (Our bloom filters are actually built from the set of murmur hashes of partition keys. The first step of inserting a partition key into a filter is hashing the key. Remembering the hashes is sufficient to build the filter later, without looking at partition keys again.) As of this patch, if the Index component is not being written, we don't allocate and populate a bloom filter during the Data.db write. Instead, we write the murmur hashes to TemporaryHashes, and only later, after the Data write finishes, we allocate the optimal-size, bloom filter, we read the hashes back from TemporaryHashes, and we populate the filter with them. That is suboptimal. Writing the hashes to disk (or worse, to S3) and reading them back is more expensive than building the bloom filter during the main Data pass. So ideally it should be avoided in cases where we know in advance that the partition key count estimate is good enough. (Which should be the case in flushes and compactions). But we defer that to a future patch. (Such a change would involve passing some flag to the sstable writer if the cardinality estimate is trustworthy, and not creating TemporaryHashes if the estimate is trustworthy).	2025-09-29 13:01:21 +02:00
Michał Chojnowski	3c83914814	sstables: adapt estimated_keys_for_range to sstables without Summary Before this patch, `estimated_keys_for_range` assumes the presence of the Summary component. But we want to make this component optional in this series. This patch adds a second branch to this function, for sstables which don't have a BIG index (in particular, Summary component), but have a BTI index (Partitions component). In this case, instead of calculating the estimate as "fraction of summary overlapping with given range, multiplied by the total key estimate", we calculate it as "fraction of Data file overlapping with given range, multiplied by the total key estimate". (With an extra conditional for the special case when the given range doesn't overlap with the sstable's range at all. In this case, if the ranges are adjacent, the main path could easily return "1 partition" instead of "0 partitions", due to the inexactness of BTI indexes for range queries. Returning something non-zero in this case would be unfortunate, so the extra conditional makes sure that we return 0).	2025-09-29 13:01:21 +02:00
Michał Chojnowski	55c4b89b88	sstables: make `sstable::estimated_keys_for_range` asynchronous Currently, `sstable::estimated_keys_for_range` works by checking what fraction of Summary is covered by the given range, and multiplying this fraction to the number of all keys. Since computing things on Summary doesn't involve I/O (because Summary is always kept in RAM), this is synchronous. In a later patch, we will modify `sstable::estimated_keys_for_range` so that it can deal with sstables that don't have a Summary (because they use BTI indexes instead of BIG indexes). In that case, the function is going to compute the relevant fraction by using the index instead of Summary. This will require making the function asynchronous. This is what we do in this patch. (The actual change to the logic of `sstable::estimated_keys_for_range` will come in the next patch. In this one, we only make it asynchronous).	2025-09-29 13:01:21 +02:00
Michał Chojnowski	70994170e2	sstables/sstable: compute get_estimated_key_count() from Statistics instead of Summary `sstable::get_estimated_key_count()` estimates the partition count from the size of Summary, and the interval between Summary entries. But we want to allow writing sstables without a Summary (i.e. sstables that use BTI indexes instead of BIG indexes), so we want a way to get the key count without involving Summary. For that, we can use the `estimated_partition_size` histogram in Statistics. By counting the histogram entries, we get the exact number of partitions in the sstable.	2025-09-29 13:01:21 +02:00
Michał Chojnowski	5f4b9a03d1	sstables/mx: implement sstable::has_partition_key using a regular read A BTI index isn't able to determine if a given key is present in the sstable, because it doesn't store full keys. (It only stores prefixes of decorated keys, so it might give false positives). If the sstable only has BTI index, and no BIG index, then `sstable::has_partition_key()` will have to be implemented with with something else than just the index reader. We might as well ignore the index in any cases and just check that a regular data read for the given partition returns a non-empty result. `sstable::has_partition_key` is only used in the `column_family/sstables/by_key` REST API call that nobody uses anyway, no point in trying to make special optimizations for it.	2025-09-29 13:01:21 +02:00
Michał Chojnowski	893eb4ca1f	sstables: use BTI index for queries, when present and enabled This patch teaches `sstable::make_index_reader` how to create a BTI index reader, from the the `Partitions.db` and `Rows.db` components, if they exist (in which case they are opened by this point).	2025-09-29 13:01:21 +02:00
Michał Chojnowski	e0fda9ae6f	sstables/mx/writer: populate BTI index files In the previous patch we added code responsible for creating and opening Partitions.db and Rows.db, but we left those files empty. In this patch, we populate the files using `trie::bti_row_index_writer` and `trie::bti_partition_index_writer`. Note: for the row index, we insert the same clustering blocks to both indexes. The logic for choosing the size of the blocks hasn't been changed in any way. Much of this patch has to do with propagating the current range tombstone down to all places which can start a new clustering block. The reason we need that is that, for each clustering block, BIG indexes store the range tombstone succeeding the block (i.e. the range tombstone in between the given block and its successor) BTI indexes store the range tombstone preceding the block, (i.e. the range tombstone in between the given block and its predecessor). So before the patch there's no code which looks at the current tombstone when starting the block, only when ending the block. This patch adds an extra copy for each `decorated_key`. This is mostly unavoidable -- the BTI partition writer just has to remember the key until its successor appears, to find the common prefix. (We could avoid the key copy if the BTI isn't used, though. We don't do that in this patch, we just let the copy happen).	2025-09-29 13:01:21 +02:00
Michał Chojnowski	cdcf34b3a0	sstables: create and open BTI index files, when enabled This patch adds code responsible for creation and opening of BTI index components (Rows.db, Partitions.db) when BTI index writing is enabled. (It is enabled if the cluster feature is enabled and the relevant config entry permits it). The files are empty for now, and are never read. We will populate and use them in following patches.	2025-09-29 13:01:21 +02:00

1 2 3 4 5 ...

3934 Commits