scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-23 08:12:08 +00:00

Author	SHA1	Message	Date
Pavel Emelyanov	1c0f8ab66e	Merge 'sstables: introduce --abort-on-malformed-sstable-error' from Botond Dénes When a malformed sstable error occurs, it is usually caused by actual sstable corruption — a cosmic ray, a bad disk write, etc. However, it can also be caused by memory corruption, where a data structure in memory happens to be read as sstable data. In the latter case, having a coredump of the process at the moment of the error is invaluable for post-mortem debugging, since the exception throwing/catching machinery destroys the stack frames that would point to the corruption site. This patch series introduces `--abort-on-malformed-sstable-error`, a new command-line option (with `LiveUpdate` support) that, when set, causes the server to call `std::abort()` instead of throwing an exception whenever any sstable parse error is detected. This covers all code paths: - Direct `throw malformed_sstable_exception(...)` sites (migrated to `throw_malformed_sstable_exception()`) - Direct `throw bufsize_mismatch_exception(...)` sites (migrated to `throw_bufsize_mismatch_exception()`) - `parse_assert()` failures (via `on_parse_error()`) - BTI parse errors (via `on_bti_parse_error()`) The implementation places the flag and helper functions in `sstables/sstables.cc`, next to the existing `on_parse_error()` / `on_bti_parse_error()` infrastructure. The flag defaults to `false`, preserving current behaviour. It is intended to be enabled temporarily when investigating suspected memory corruption. Commit breakdown: 1. Infrastructure: flag, getter/setter, and throw helpers in `sstables/sstables.cc`; config option wired up in `main.cc` 2. `on_parse_error()` and `on_bti_parse_error()` check the new flag 3. All ~50 `throw malformed_sstable_exception(...)` sites migrated 4. Both `throw bufsize_mismatch_exception(...)` sites migrated Refs: SCYLLADB-1087 Backport: new feature, no backport Closes scylladb/scylladb#29324 * github.com:scylladb/scylladb: sstables: migrate all bufsize_mismatch_exception throw sites to throw_bufsize_mismatch_exception() sstables: migrate all malformed_sstable_exception throw sites to throw_malformed_sstable_exception() sstables: make on_parse_error() and on_bti_parse_error() respect --abort-on-malformed-sstable-error sstables: disable abort-on-malformed-sstable-error in tests that corrupt sstables on purpose sstables: introduce --abort-on-malformed-sstable-error infrastructure sstables: refactor parse_path() to return std::expected<> instead of throwing	2026-05-12 12:38:25 +03:00
Botond Dénes	ad7ac62835	Merge ' Add a node_owner column (locator::host_id) to system.sstables and make it part of the partition key' from Dimitrios Symonidis Add a node_owner column (locator::host_id) to system.sstables and make it part of the partition key, so the primary key becomesv PRIMARY KEY ((table_id, node_owner), generation). This is the first step toward moving the sstables registry into system_distributed: once distributed, each node's startup scan must read only the rows it owns, which requires the owning node to be part of the partition key. Partitioning by (table_id, node_owner) turns that scan into a single-partition read of exactly the local node's rows. Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1562 No need to backport this, keyspace over object storage is experimental feature Closes scylladb/scylladb#29659 * github.com:scylladb/scylladb: db, sstables: add node_owner to sstables registry primary key db, sstables: rename sstables registry column owner to table_id	2026-05-11 14:08:19 +03:00
Botond Dénes	4ebcc002d6	sstables: disable abort-on-malformed-sstable-error in tests that corrupt sstables on purpose Add scoped_no_abort_on_malformed_sstable_error RAII guard (modeled after seastar::testing::scoped_no_abort_on_internal_error) and use it in all tests that intentionally corrupt sstables and expect malformed_sstable_exception to be thrown rather than the process aborting.	2026-05-11 11:58:14 +03:00
Ernest Zaslavsky	a97502920b	test: optimize compaction_strategy_cleanup_method for remote storage Parallelize SSTable creation using parallel_for_each. The file count is made a parameter with a default of 64, allowing future S3/GCS variants to use a smaller count if needed.	2026-04-28 16:59:38 +03:00
Ernest Zaslavsky	0b9a2844bd	test: optimize stcs_reshape_overlapping for remote storage Parallelize SSTable creation using parallel_for_each and reduce the SSTable count from 256 to 64 for S3/GCS variants. The local test variant retains the original 256 count.	2026-04-28 16:59:38 +03:00
Ernest Zaslavsky	ac89cffc9f	test: optimize twcs_reshape_with_disjoint_set for remote storage Parallelize SSTable creation across all sub-tests using parallel_for_each and reduce the SSTable count from 256 to 64 for S3/GCS variants. Re-enable the S3 test variant that was previously disabled due to taking 4+ minutes. With parallel creation and reduced count, the test now completes in a reasonable time.	2026-04-28 16:59:37 +03:00
Ernest Zaslavsky	01b4292f87	test: parallelize SSTable creation in cleanup_during_offstrategy_incremental Pre-extract mutation pairs and use parallel_for_each with make_sstable_containing_async to create SSTables concurrently instead of sequentially. The post-creation loop still runs serially to collect token ranges and generations.	2026-04-28 16:59:37 +03:00
Ernest Zaslavsky	923ff9abc9	test: parallelize SSTable creation in run_incremental_compaction_test Pre-extract mutation pairs and use parallel_for_each with make_sstable_containing_async to create SSTables concurrently instead of sequentially. The post-creation loop still runs serially to collect token ranges and generations that depend on SSTable order.	2026-04-28 16:59:37 +03:00
Ernest Zaslavsky	6a25f52473	test: parallelize SSTable creation in offstrategy_sstable_compaction Use parallel_for_each with make_sstable_containing_async to create SSTables concurrently instead of sequentially, reducing wall-clock time on remote storage backends (S3/GCS).	2026-04-28 16:59:37 +03:00
Ernest Zaslavsky	baca685629	test: parallelize SSTable creation in twcs_partition_estimate Use parallel_for_each with make_sstable_containing_async to create SSTables concurrently instead of sequentially, reducing wall-clock time on remote storage backends (S3/GCS).	2026-04-28 16:59:37 +03:00
Ernest Zaslavsky	a4ebe16517	test: make sstable test utilities natively async The original make_memtable used seastar::thread::yield() for preemption, which required all callers to run inside a seastar::thread context. This prevented the utilities from being used directly in coroutines or parallel_for_each lambdas. Make the primary functions — make_memtable, make_sstable_containing, and verify_mutation — return future<> directly. Callers now .get() explicitly when in seastar::thread context, or co_await when in a coroutine. make_memtable now uses coroutine::maybe_yield() instead of seastar::thread::yield(). verify_mutation is converted to coroutines as well. Requested in: https://github.com/scylladb/scylladb/pull/29416#pullrequestreview-4112296282	2026-04-28 16:59:37 +03:00
Dimitrios Symonidis	c40842f60a	db, sstables: add node_owner to sstables registry primary key Add a node_owner column (locator::host_id) to system.sstables and make it part of the partition key, so the primary key becomes PRIMARY KEY ((table_id, node_owner), generation). This is the first step toward moving the sstables registry into system_distributed: once distributed, each node's startup scan must read only the rows it owns, which requires the owning node to be part of the partition key. Partitioning by (table_id, node_owner) turns that scan into a single-partition read of exactly the local node's rows. The new column is populated via sstables_manager::get_local_host_id(). No backward compatibility is preserved; the feature is experimental and gated by keyspace-storage-options.	2026-04-24 16:41:09 +02:00
Avi Kivity	0ae22a09d4	LICENSE: Update to version 1.1 Updated terms of non-commercial use (must be a never-customer).	2026-04-12 19:46:33 +03:00
Ernest Zaslavsky	422f107122	compaction_test: enable sstable clone tests for S3 and GCS Now that object_storage_base::clone is implemented, remove the early-return skips and re-enable the sstable_clone_leaving_unsealed_dest_sstable tests for both S3 and GCS storage backends.	2026-04-07 18:16:52 +03:00
Ernest Zaslavsky	c7a74237b3	compaction_test: fix formatting after previous patches	2026-04-05 11:07:17 +03:00
Ernest Zaslavsky	101b4ad7fa	compaction_test: add S3/GCS variations to tests Add S3 and GCS variants of the compaction tests to expand coverage for keyspaces configured to use object_storage backends.	2026-04-05 11:07:17 +03:00
Ernest Zaslavsky	03bd3010bf	compaction_test: extract test_env-based tests into functions Move all test code that relies on test_env into standalone free functions so they can be reused by upcoming S3 and GCS test suites.	2026-04-05 11:07:17 +03:00
Ernest Zaslavsky	b18528e97e	compaction_test: replace file_exists with storage::exists Replace direct filesystem checks (file_exists) with the storage-agnostic exists() method in unsealed_sstable_compaction, sstable_clone_leaving_unsealed_dest_sstable, and failure_when_adding_new_sstable tests, making them compatible with object-storage backends (S3, GCS).	2026-04-05 11:07:17 +03:00
Ernest Zaslavsky	98492e4ea8	compaction_test: initialize tables with schema via make_table_for_tests Start using `table_for_tests::make_default_schema` so test tables are created with a real schema. This is required for object-storage backends, which cannot operate correctly without proper schema initialization.	2026-04-05 11:07:17 +03:00
Ernest Zaslavsky	5ba79e2ed4	compaction_test: use sstable APIs to manipulate component files Switch tests to use sstable member functions for file manipulation instead of opening files directly on the filesystem. This affects the helpers that emulate sstable corruption: we now overwrite the entire component file rather than just the first few kilobytes, which is sufficient for producing a corrupted sstable.	2026-04-05 11:07:17 +03:00
Ernest Zaslavsky	405c032f48	compaction_test: fix use-after-move issue We were moving `compaction_type_options` inside a loop, so on the second iteration the test received an already moved-from instance.	2026-04-05 11:07:17 +03:00
Taras Veretilnyk	95420014ea	sstable_compaction_test: Add scrub validate test for corrupted index Generalize corrupt_sstable() and scrub_validate_corrupted_file() to accept a component_type parameter, defaulting to Data, so they can be reused for corrupting other components.	2026-03-10 19:24:05 +01:00
Botond Dénes	81e214237f	Merge 'Add digests for all sstable components in scylla metadata' from Taras Veretilnyk This pull request adds support for calculation and storing CRC32 digests for all SSTable components. This change replaces plain file_writer with crc32_digest_file_writer for all SSTable components that should be checksummed. The resulting component digests are stored in the sstable structure and later persisted to disk as part of the Scylla metadata component during writer::consume_end_of_stream. Several test cases where introduced to verify expected behaviour. Additionally, this PR adds new rewrite component mechanism for safe sstable component rewriting. Previously, rewriting an sstable component (e.g., via rewrite_statistics) created a temporary file that was renamed to the final name after sealing. This allowed crash recovery by simply removing the temporary file on startup. However, with component digests stored in scylla_metadata (#20100), replacing a component like Statistics requires atomically updating both the component and scylla_metadata with the new digest - impossible with POSIX rename. The new mechanism creates a clone sstable with a fresh generation: - Hard-links all components from the source except the component being rewritten and scylla_metadata - Copies original sstable components pointer and recognized components from the source - Invokes a modifier callback to adjust the new sstable before rewriting - Writes the modified component along with updated scylla_metadata containing the new digest - Seals the new sstable with a temporary TOC - Replaces the old sstable atomically, the same way as it is done in compaction This is built on the rewrite_sstables compaction framework to support batch operations (e.g., following incremental repair). In case of any failure durning the whole process, sstable will be automatically deleted on the node startup due to temporary toc persistence. Backport is not required, it is a new feature Fixes https://github.com/scylladb/scylladb/issues/20100, https://github.com/scylladb/scylladb/issues/27453 Closes scylladb/scylladb#28338 * github.com:scylladb/scylladb: docs: document components_digests subcomponent and trailing digest in Scylla.db sstable_compaction_test: Add tests for perform_component_rewrite sstable_test: add verification testcases of SSTable components digests persistance sstables: store digest of all sstable components in scylla metadata sstables: replace rewrite_statistics with new rewrite component mechanism sstables: add new rewrite component mechanism for safe sstable component rewriting compaction: add compaction_group_view method to specify sstable version sstables: add null_data_sink and serialized_checksum for checksum-only calculation sstables: extract default write open flags into a constant sstables: Add write_simple_with_digest for component checksumming sstables: Extract file writer closing logic into separate methods sstables: Implement CRC32 digest-only writer	2026-03-10 16:02:53 +02:00
Taras Veretilnyk	2b1c37396a	sstable_compaction_test: Add tests for perform_component_rewrite Add two test cases to verify the correctness of the perform_component_rewrite functionality: - test_perform_component_rewrite_single_sstable: Tests rewriting the Statistics component of a single sstable - test_perform_component_rewrite_multiple_sstables: Tests rewriting 5 out of 10 sstables	2026-03-06 21:58:15 +01:00
Botond Dénes	6004e84f18	test: move away from tombstone_gc_state(nullptr) ctor Use for_tests() instead (or no_gc() where approriate).	2026-03-03 14:09:28 +02:00
Raphael S. Carvalho	bc772b791d	test: boost: Add failure_when_adding_new_sstable_test Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-12-12 17:01:18 -03:00
Raphael S. Carvalho	1fdc410e24	Rename maybe_split_sstable() to maybe_split_new_sstable() Since the function must only be used on new sstables, it should be renamed to something describing its usage should be restricted. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-12-12 16:59:50 -03:00
Raphael S. Carvalho	1a077a80f1	sstables: Allow storage::snapshot() to leave destination sstable unsealed Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-12-12 16:59:50 -03:00
Raphael S. Carvalho	c10486a5e9	test: Verify unsealed sstable can be compacted This is crucial for splitting before sealing the sstable produced by repair. This way, unsplit sstables won't be left on disk sealed. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-12-12 16:59:50 -03:00
Michał Chojnowski	1cfce430f1	replica/table: keep track of total pre-compression file size Every table and sstable set keeps track of the total file size of contained sstables. Due to a feature request, we also want to keep track of the hypothetical file size if Data files were uncompressed, to add a metric that shows the compression ratio of sstables. We achieve this by replacing the relevant `uint_64 bytes_on_disk` counters everywhere with a struct that contains both the actual (post-compression) size and the hypothetical pre-compression size. This patch isn't supposed to change any observable behavior. In the next patch, we will use these changes to add a new metric.	2025-11-13 00:49:57 +01:00
Botond Dénes	f8b0142983	Merge 'Add --drop-unfixable-sstables flag for scrub in segregate mode' from Taras Veretilnyk This PR introduces support for a new scrub option: `--drop-unfixable-sstables`, which enables the dropping of corrupted SSTables during scrub only in segregate mode. The patch includes implementation, validation, and set of tests to ensure correct behavior and error handling. Fixes #19060 Backport is not required, it is a new feature Closes scylladb/scylladb#26579 * github.com:scylladb/scylladb: sstable_compaction_test: add segregate mode tests for drop-unfixable-sstables option test/nodetool: add scrub drop-unfixable-sstables option testcase scrub: add support for dropping unfixable sstables in segregate mode	2025-10-23 11:06:19 +03:00
Taras Veretilnyk	60334c6481	sstable_compaction_test: add segregate mode tests for drop-unfixable-sstables option Added a new test case, sstable_scrub_segregate_mode_drop_unfixable_sstables_test, which verifies that when the drop-unfixable-sstables flag is enabled in segregate mode, corrupted SSTables are correctly dropped.	2025-10-22 17:16:55 +02:00
Taras Veretilnyk	42da7f1eb6	scrub: add support for dropping unfixable sstables in segregate mode This patch adds a new flag `drop-unfixable-sstables` to the scrub operation in segregate mode, allowing to automatically drop SSTables that cannot be fixed during scrub. It also includes API support of the 'drop_unfixable_sstables' paramater and validation to ensure this flag is not enabled in other modes rather than segragate.	2025-10-22 17:16:49 +02:00
Pavel Emelyanov	44ed3bbb7c	Merge 'RFC: Initial GCP storage backend for scylla (sstables + backup)' from Calle Wilund Integrates GCP object storage as a working storage backend for scylla sstables as well as backup storage. Adds an abstraction layer (atm very heavily designed around the s3 client interface and usage) to allow the "storage" etc layers of sstable management to pick transparently between "s3" and "gs" providers. This modifies the scylla config such that endpoints can optionally (through a "type" param) ref a GS backend. Similarly with storage_options. Also adds some IO wrapping primitives to make it more feasible to place some logic at a mid level of the implementation stack (such as making networked storage files, ranged reading etc). Test s3 fixture is replaced (where appropriate) with an `object_storage` fixture that multiplexes the test across both backends. Unit tests are duplicated and for the GS versions use a boost test fixture for GCS, default local fake. Fixes #25359 Fixes #26453 Closes scylladb/scylladb#26186 * github.com:scylladb/scylladb: docs::dev::object_storage: Add some initial info on GS storage docs/dev: Add mention of (nested) docker usage in testing.md sstables::object_storage_client: Forward memory limit semaphore to GS instance utils::gcp::object_storage: Add optional memory limits to up/download sstables::object_storage_client: Add multi-upload support for GS utils::gcp::storage: Add merge objects operation test_backup/test_basic: Make tests multiplex both s3 and gs backends test::cluster::conftest: Add support for multiple object storage backends boost::gcs_storage_test: reindent boost::gcs_storage_test: Convert to use fixture tests::boost: Add GS object storage cases to mirror S3 ones tests::lib::gcs_fixture: Add a reusable test fixture for real/fake GS/GCS tests::lib::test_utils: Add overloads/helpers for reading and (temp) writing env sstables::object_storage_client: Add google storage implementation test_services: Allow testing with GS object storage parameters utils::gcp::gcp_credentials: Add option to create uninitialized credentials utils::gcp::object_storage: Make create_download_source return seekable_data_source utils::gcp::object_storage: Add defensive copies of string_view params utils::gcp::object_storage: Add missing retry backoff increate utils::gcp::object_storage: Add timestamp to object listing utils::gcp::object_storage: Add paging support to list_objects object_storage_client: Add object_name wrapper type utils::gcp::object_storage: Add optional abort_source utils::rest::client: Add abort_source support sstables: Use object_storage_client for remote storage sstables::object_storage_client: Add abstraction layer for OS cliens (s3 initial) s3::upload_progress: Promote to general util type storage_options: Abstract s3 to "object_storage" and add gs as option sstables::file_io_extension: Change "creator" callback to just data_source utils::io-wrappers: Add ranged data_source utils::io-wrappers: Add file wrapper type for seekable_source utils::seekable_source: Add a seekable IO source type object_storage_endpoint_param: Add gs storage as option config: break out object_storage_endpoint_param preparing for multi storage	2025-10-20 13:14:53 +03:00
Lakshmi Narayanan Sreethar	18c071c94b	compaction: fix use after free when strategy is altered during compaction The `compaction_strategy_state` class holds strategy specific state via a `std::variant` containing different state types. When a compaction strategy performs compaction, it retrieves a reference to its state from the `compaction_strategy_state` object. If the table's compaction strategy is ALTERed while a compaction is in progress, the `compaction_strategy_state` object gets replaced, destroying the old state. This leaves the ongoing compaction holding a dangling reference, resulting in a use after free. Fix this by using `seastar::shared_ptr` for the state variant alternatives(`leveled_compaction_strategy_state_ptr` and `time_window_compaction_strategy_state_ptr`). The compaction strategies now hold a copy of the shared_ptr, ensuring the state remains valid for the duration of the compaction even if the strategy is altered. The `compaction_strategy_state` itself is still passed by reference and only the variant alternatives use shared_ptrs. This allows ongoing compactions to retain ownership of the state independently of the wrapper's lifetime. Fixes #25913 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-10-17 22:57:05 +05:30
Calle Wilund	7c6b4bed97	tests::boost: Add GS object storage cases to mirror S3 ones I.e. run same remote storage backend unit tests for GS backend	2025-10-13 08:53:27 +00:00
Avi Kivity	4d9271df98	Merge 'sstables: introduce sstable version `ms`' from Michał Chojnowski This is yet another part in the BTI index project. Overarching issue: https://github.com/scylladb/scylladb/issues/19191 Previous part: https://github.com/scylladb/scylladb/pull/25626 Next parts: make `ms` the default. Then, general tweaks and improvements. Later, potentially a full `da` format implementation. This patch series introduces a new, Scylla-only sstable format version `ms`, which is like `me`, but with the index components (Summary.db and Index.db) replaced with BTI index components (Partitions.db and Rows.db), as they are in Cassandra 5.0's `da` format version. (Eventually we want to just implement `da`, but there are several other changes (unrelated to the index files) between `me` and `da`. By adding this `ms` as an intermediate step we can adapt the new index formats without dragging all the other changes into the mix (and raising the risk of regressions, which is already high)). The high-level structure of the PR is: 1. Introduce new component types — `Partitions` and `Rows`. 2. Teach `class sstable` to open them when they exist. 3. Teach the sstable writer how to write index data to them. 4. Teach `class sstable` and unit tests how to deal with sstables that have no `Index` or `Summary` (but have `Partitions` and `Rows` instead). 5. Introduce the new sstable version `ms`, specify that it has `Partitions` and `Rows` instead of `Index` and `Summary`. 6. Prepare unit tests for the appearance of `ms`. 7. Enable `ms` in unit tests. 8. Make `ms` enablable via db::config (with a silent fall back to `me` until the new `MS_SSTABLE_FORMAT` cluster feature is enabled). 9. Prepare integration tests for the appearance of `ms`. 10. Enable both `ms` and `me` in tests where we want both versions to be tested. This series doesn't make `ms` the default yet, because that requires teaching Scylla Manager and a few dtests about the new format first. It can be enabled by setting `sstable_format: ms` in the config. Per a review request, here is an example from `perf_fast_forward`, demonstrating some motivation for a new format. (Although not the main one. The main motivations are getting rid of restrictions on the RAM:disk ratio, and index read throughput for datasets with tiny partitions). The dataset was populated with `build/release/scylla perf-fast-forward --smp=1 --sstable-format=$VERSION --data-directory=data.$VERSION --column-index-size-in-kb=1 --populate --random-seed=0`. This test involves a partition with 1000000 clustering rows (with 32-bit keys and 100-byte values) and ~500 index blocks, and queries a few particular rows from the partition. Since the branching factor for the BIG promoted index is 2 (it's a binary search), the lookup involves ~11.2 sequential page reads per row. The BTI format has a more reasonable branching factor, so it involves ~2.3 page reads per row. `build/release/scylla perf-fast-forward --smp=1 --data-directory=perf_fast_forward_data/me --run-tests=large-partition-select-few-rows`: ``` offset stride rows iterations avg aio aio (KiB) 500000 1 1 70 18.0 18 128 500001 1 1 647 19.0 19 132 0 1000000 1 748 15.0 15 116 0 500000 2 372 29.0 29 284 0 250000 4 227 56.0 56 504 0 125000 8 116 106.0 106 928 0 62500 16 67 195.0 195 1732 ``` `build/release/scylla perf-fast-forward --smp=1 --data-directory=perf_fast_forward_data/ms --run-tests=large-partition-select-few-rows`: ``` offset stride rows iterations avg aio aio (KiB) 500000 1 1 51 5.1 5 20 500001 1 1 64 5.3 5 20 0 1000000 1 679 4.0 4 16 0 500000 2 492 8.0 8 88 0 250000 4 804 16.0 16 232 0 125000 8 409 31.0 31 516 0 62500 16 97 54.0 54 1056 ``` Index file size comparison for the default `perf_fast_forward` tables with `--random-seed=0`: Large partition table (dominated by intra-partition index): 2.4 MB with `me`, 732 kB with `ms`. For the small partitions table (dominated by inter-partition index): 11 MB with `me`, 8.4 MB with `ms`. External tests: I ran SCT test `longevity-mv-si-4days-streaming-test` test on 6 nodes with 30 shards each for 8 hours. No anomalies were observed. New functionality, no backport needed. Closes scylladb/scylladb#26215 * github.com:scylladb/scylladb: test/boost/bloom_filter_test: add test_rebuild_from_temporary_hashes test/cluster: add test_bti_index.py test: prepare bypass_cache_test.py for `ms` sstables sstables/trie/bti_index_reader: add a failure injection in advance_lower_and_check_if_present test/cqlpy/test_sstable_validation.py: prepare the test for `ms` sstables tools/scylla-sstable: add `--sstable-version=?` to `scylla sstable write` db/config: expose "ms" format to the users via database config test: in Python tests, prepare some sstable filename regexes for `ms` sstables: add `ms` to `all_sstable_versions` test/boost/sstable_3_x_test: add `ms` sstables to multi-version tests test/lib/index_reader_assertions: skip some row index checks for BTI indexes test/boost/sstable_inexact_index_test: explicitly use a `me` sstable test/boost/sstable_datafile_test: skip test_broken_promoted_index_is_skipped for `ms` sstables test/resource: add `ms` sample sstable files for relevant tests test/boost/sstable_compaction_test: prepare for `ms` sstables. test/boost/index_reader_test: prepare for `ms` sstables test/boost/bloom_filter_tests: prepare for `ms` sstables test/boost/sstable_datafile_test: prepare for `ms` sstables test/boost/sstable_test: prepare for `ms` sstables. sstables: introduce `ms` sstable format version tools/scylla-sstable: default to "preferred" sstable version, not "highest" sstables/mx/reader: use the same hashed_key for the bloom filter and the index reader sstables/trie/bti_index_reader: allow the caller to passing a precalculated murmur hash sstables/trie/bti_partition_index_writer: in add(), get the key hash from the caller sstables/mx: make Index and Summary components optional sstables: open Partitions.db early when it's needed to populate key range for sharding metadata sstables: adapt sstable::set_first_and_last_keys to sstables without Summary sstables: implement an alternative way to rebuild bloom filters for sstables without Index utils/bloom_filter: add `add(const hashed_key&)` sstables: adapt estimated_keys_for_range to sstables without Summary sstables: make `sstable::estimated_keys_for_range` asynchronous sstables/sstable: compute get_estimated_key_count() from Statistics instead of Summary replica/database: add table::estimated_partitions_in_range() sstables/mx: implement sstable::has_partition_key using a regular read sstables: use BTI index for queries, when present and enabled sstables/mx/writer: populate BTI index files sstables: create and open BTI index files, when enabled sstables: introduce Partition and Rows component types sstables/mx/writer: make `_pi_write_m.partition_tombstone` a `sstables::deletion_time`	2025-09-30 09:40:02 +03:00
Michał Chojnowski	6143dce3db	test/boost/sstable_compaction_test: prepare for `ms` sstables. Fix incompatibilites between the test's assumptions and the upcoming addition of `ms` sstables. Refer to individual tests for comments.	2025-09-29 22:15:25 +02:00
Botond Dénes	9c85046f93	sstables,compaction: move compaction exceptions to compaction/ sstables/exceptions.hh still hosts some compaction specific exception types. Move them over to the new compaction/exceptions.hh, to make the compaction module more self-contained.	2025-09-29 06:49:14 +03:00
Botond Dénes	1999d8e3d3	compaction: remove using namespace {compaction,sstables} Some files in compaction/ have using namespace {compaction,sstables} clauses, some even in headers. This is considered bad practice and muddies the namespace use. Remove them.	2025-09-25 15:03:57 +03:00
Botond Dénes	86ed627fc4	compaction: move code to namespace compaction The namespace usage in this directory is very inconsistent, with files and classes scattered in: * global namespace * namespace compaction * namespace sstables With cases, where all three used in the same file. This code used to live in sstables/ and some of it still retains namespace sstables as a heritage of that time. The mismatch between the dir (future module) and the namespace used is confusing, so finish the migration and move all code in compaction/ to namespace compaction too. This patch, although large, is mechanic and only the following kind of changes are made: * replace namespace sstable {} with namespace compaction {} * add namespace compaction {} * drop/add sstables:: * drop/add compaction:: * move around forward-declarations so they are in the correct namespace context This refactoring revealed some awkward leftover coupling between sstables and compaction, in sstables/sstable_set.cc, where the make_sstable_set() methods of compaction strategies are implemented.	2025-09-25 15:03:56 +03:00
Botond Dénes	761a32927e	Merge 'scrub: Handle malformed_sstable_exception in scrub skip mode' from Taras Veretilnyk This PR improves the handling of malformed SSTables during scrub and adds tests to validate the updated behavior. When scrub is used, there is an increased chance of encountering malformed SSTables. These should not be retried as in regular compaction. Instead, they must be handled according to the selected scrub mode: in skip mode, in case of malformed_sstable_exception, invalid data or whole SSTable should be removed, in abort and segregate modes, the scrub process should abort. Previously, all modes treated malformed_sstable_exception the same way, causing scrub to abort even when skip mode was selected. This PR updates the scrub logic to properly handle malformed SSTable exceptions based on the selected mode. Unit tests are added to verify the intended behavior. Fixes scylladb/scylladb#19059 Backport is not required, it is an improvement Closes scylladb/scylladb#25828 * github.com:scylladb/scylladb: sstable_compaction_test: add scrub tests for malformed SSTables scrub: skip sstable on malformed sstable exception in skip mode	2025-09-24 14:28:43 +03:00
Ernest Zaslavsky	5ba5aec1f8	treewide: Move mutation related files to a `mutation` directory As requested in #22104, moved the files and fixed other includes and build system. Moved files: - combine.hh - collection_mutation.hh - collection_mutation.cc - converting_mutation_partition_applier.hh - converting_mutation_partition_applier.cc - counters.hh - counters.cc - timestamp.hh Fixes: #22104 This is a cleanup, no need to backport Closes scylladb/scylladb#25085	2025-09-24 13:23:38 +03:00
Taras Veretilnyk	b81caf3f54	sstable_compaction_test: add scrub tests for malformed SSTables Add unit tests for scrub behavior with malformed SSTables: - sstable_scrub_abort_mode_malformed_sstable_test(verifies scrub aborts on malformed SSTables) - sstable_scrub_skip_mode_malformed_sstable_test(verifies scrub skips malformed SSTables without aborting)	2025-09-23 14:34:09 +02:00
Aleksandra Martyniuk	17707d0e6b	compaction: stop compaction module in really_do_stop Currently, compaction::task_manager_module is stopped in compaction_manager::stop, concurrently to really_do_stop. We can't predict the order of the two. Do not set _task_manager_module to nullptr at stop, because compaction_manager::really_do_stop() may be called before the actual shutdown, while other components still try to use it. compaction::task_manager_module does not keep a pointer to compaction_manager, so we won't end up with memory leak. Stop compaction module in really_do_stop, after ongoing compactions are stopped. It's a preparation for further patches.	2025-09-23 14:21:15 +02:00
Lakshmi Narayanan Sreethar	7cdda510ee	compaction/scrub: register sstables for compaction before validation When `scrub --validate` runs, it collects all candidate sstables at the start and validates them one by one in separate compaction tasks. However, scrub in validate mode does not register these sstables for compaction, which allows regular compaction to pick them up and potentially compact them away before validation begins. This leads to scrub failures because the sstables can no longer be found. This patch fixes the issue by first disabling compaction, collecting the sstables, and then registering them for compaction before starting validation. This ensures that the enqueued sstables remain available for the entire duration of the scrub validation task. Fixes #23363 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-09-16 15:29:57 +05:30
Łukasz Paszkowski	9539e80e54	compaction_manager: Subscribe to out of space controller	2025-08-29 14:56:07 +02:00
Avi Kivity	611918056a	Merge 'repair: Add tablet incremental repair support' from Asias He The central idea of incremental repair is to allow repair participants to select and repair only a portion of the dataset to speed up the repair process. All repair participants must utilize an identical selection method to repair and synchronize the same selected dataset. There are two primary selection methods: time-based and file-based. The time-based method selects data within a specified time frame. It is versatile but it is less efficient because it requires reading all of the dataset and omitting data beyond the time frame. The file-based method selects data from unrepaired SSTables and is more efficient because it allows the entire SSTable to be omitted. This document patch implements the file-based selection method. Incremental repair will only be supported for tablet tables; it will not be supported for vnode tables. On one hand, the legacy vnode is less important to support. On the other hand, the incremental repair for vnode is much harder to implement. With vnodes, a SSTalbe could contain data for multiple vnode ranges. When a given vnode range is repaired, only a portion of the SSTable is repaired. This complicates the manipulation of SSTables significantly during both repair and compaction. With tablets, an entire tablet is repaired so that a sstable is either fully repaired or not repaired which is a huge simplification. This patch uses the repaired_at from sstables::statistics component to mark a sstable as repaired. It uses a virtual clock as the repair timestamp, i.e., using a monotonically increasing number for the repaired_at field of a SSTable and sstables_repaired_at column in system.tablets table. Notice that when a sstable is not repaired, the repaired_at field will be set to the default value 0 by default. The being_repaired in memory field of a SSTable is used to explicitly mark that a SSTable is being selected. The following variables are used for incremental repair: The repaired_at on disk field of a SSTable is used. - A 64-bit number increases sequentially The sstables_repaired_at is added to the system.tablets table. - repaired_at <= sstables_repaired_at means the sstable is repaired The being_repaired in memory field of a SSTable is added. - A repair UUID tells which sstable has participated in the repair Initial test results: 1) Medium dataset results Node amount: 3 Instance type: i4i.2xlarge Disk usage per node: ~500GB Cluster pre-populated with ~500GB of data before starting repairs job. Results for Repair Timings: The regular repair run took 210 mins. Incremental repair 1st run took 183 mins, 2nd and 3rd runs took around 48s The speedup is: 183 mins / 48s = 228X 2) Small dataset results Node amount: 3 Instance type: i4i.2xlarge Disk usage per node: ~167GB Cluster pre-populated with ~167GB of data before starting the repairs job. Regular repair 1st run took 110s, 2nd and 3rd runs took 110s. Incremental repair 1st run took 110 seconds, 2nd and 3rd run took 1.5 seconds. The speedup is: 110s / 1.5s = 73X 3) Large dataset results Node amount: 6 Instance type: i4i.2xlarge, 3 racks 50% of base load, 50% read/write Dataset == Sum of data on each node Dataset Non-incremental repair (minutes) 1.3 TiB 31:07 3.5 TiB 25:10 5.0 TiB 19:03 6.3 TiB 31:42 Dataset Incremental repair (minutes) 1.3 TiB 24:32 3.0 TiB 13:06 4.0 TiB 5:23 4.8 TiB 7:14 5.6 TiB 3:58 6.3 TiB 7:33 7.0 TiB 6:55 Fixes #22472 Closes scylladb/scylladb#24291 * github.com:scylladb/scylladb: replica: Introduce get_compaction_reenablers_and_lock_holders_for_repair compaction: Move compaction_reenabler to compaction_reenabler.hh topology_coordinator: Make rpc::remote_verb_error to warning level repair: Add metrics for sstable bytes read and skipped from sstables test.py: Disable incremental for test_tombstone_gc_for_streaming_and_repair test.py: Add tests for tablet incremental repair repair: Add tablet incremental repair support compaction: Add tablet incremental repair support feature_service: Add TABLET_INCREMENTAL_REPAIR feature tablet_allocator: Add tablet_force_tablet_count_increase and decrease repair: Add incremental helpers sstable: Add being_repaired to sstable sstables: Add set_repaired_at to metadata_collector mutation_compactor: Introduce add operator to compaction_stats tablet: Add sstables_repaired_at to system.tablets table test: Fix drain api in task_manager_client.py	2025-08-19 13:13:22 +03:00
Asias He	f9021777d8	compaction: Add tablet incremental repair support This patch addes incremental_repair support in compaction. - The sstables are split into repaired and unrepaired set. - Repaired and unrepaired set compact sperately. - The repaired_at from sstable and sstables_repaired_at from system.tablets table are used to decide if a sstable is repaired or not. - Different compactions tasks, e.g., minor, major, scrub, split, are serialized with tablet repair.	2025-08-18 11:01:21 +08:00
Botond Dénes	ef7d49cd21	compaction/compaction_garbage_collector: refactor max_purgeable into a class Make members private, add getters and constructors. This struct will get more functionality soon, so class is a better fit.	2025-08-11 07:09:13 +03:00

1 2 3 4 5 ...

430 Commits