scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-30 19:46:48 +00:00

Author	SHA1	Message	Date
Botond Dénes	8579e20bd1	Merge 'Enable digest+checksum verification for streaming/repair' from Taras Veretilnyk This PR enables integrity check of both checksum and digest for repair/streaming. In the past, streaming readers only verified the checksum of compressed SSTables. This change extends the checks to include the digest and the checksum (CRC) for both compressed and uncompressed SSTables. These additional checks require reading the digest and CRC components from disk, which may cause some I/O overhead. For uncompressed SSTables, this involves loading and computing checksums and digest from the data, while for compressed SSTables - where checksums are already verified inline - the only extra cost is reading and verifying the digest.If the reader range doesn't cover the full SSTable, the digest is not loaded and check is skipped. To support testing of these changes, a new option was added to the random_mutation_generator that allows disabling compression. Several new test cases were added to verify that the repair_reader correctly detects corruption. These tests corrupt digest or data component of an SSTable and confirm that the system throws the expected `malformed_sstable_exception`. Backport is not required, it is an improvement Refs #21776 Closes scylladb/scylladb#26444 * github.com:scylladb/scylladb: boost/repair_test: add repair reader integrity verification test cases test/lib: allow to disable compression in random_mutation_generator sstables: Skip checksum and digest reads for unlinked SSTables table: enable integrity checks for streaming reader table: Add integrity option to table::make_sstable_reader() sstables: Add integrity option to create_single_key_sstable_reader	2025-11-14 18:00:33 +02:00
Marcin Maliszkiewicz	958d04c349	service: attach storage_service to migration_manager using pluggabe Migration manager depends on storage service. For instance, it has a reload_schema_in_bg background task which calls _ss.local() so it expects that storage service is not stopped before it stops. To solve this we use permit approach, and during storage_service stop: - we ignore new code execution in migration_manager which'd use storage_service - but wait with storage_service shutdown until all existing executions are done Fixes scylladb/scylladb#26734	2025-11-14 08:50:19 +01:00
Taras Veretilnyk	554ce17769	test/lib: allow to disable compression in random_mutation_generator Adds a compress flag to random_mutation_generator, allowing tests to disable compression in generated mutations. When set to compress::no, the schema builder uses no_compression() parameters.	2025-11-13 14:08:33 +01:00
Calle Wilund	1d37873cba	test::lib: Add azure mock/real server fixture Wraps the real/mock azure server for test in a fixture. Note: retains the current test setup which explicitly runs some tests with "real" azure, if avail, and some always mock.	2025-11-05 10:22:22 +00:00
Calle Wilund	0842b2ae55	test::lib::aws_kms_fixture: Add a fixture object to run mock AWS KMS Runs local-kms mock AWS KMS server unless overridden by env var. Allows tests to use real or fake AWS KMS endpoint and shared fixture for quicker execution.	2025-11-05 10:22:21 +00:00
Calle Wilund	98c060232e	test::lib::gcs_fixture: Only set port if running docker image + more retry Our connect can spuriously fail. Just retry.	2025-11-05 10:22:21 +00:00
Pavel Emelyanov	44ed3bbb7c	Merge 'RFC: Initial GCP storage backend for scylla (sstables + backup)' from Calle Wilund Integrates GCP object storage as a working storage backend for scylla sstables as well as backup storage. Adds an abstraction layer (atm very heavily designed around the s3 client interface and usage) to allow the "storage" etc layers of sstable management to pick transparently between "s3" and "gs" providers. This modifies the scylla config such that endpoints can optionally (through a "type" param) ref a GS backend. Similarly with storage_options. Also adds some IO wrapping primitives to make it more feasible to place some logic at a mid level of the implementation stack (such as making networked storage files, ranged reading etc). Test s3 fixture is replaced (where appropriate) with an `object_storage` fixture that multiplexes the test across both backends. Unit tests are duplicated and for the GS versions use a boost test fixture for GCS, default local fake. Fixes #25359 Fixes #26453 Closes scylladb/scylladb#26186 * github.com:scylladb/scylladb: docs::dev::object_storage: Add some initial info on GS storage docs/dev: Add mention of (nested) docker usage in testing.md sstables::object_storage_client: Forward memory limit semaphore to GS instance utils::gcp::object_storage: Add optional memory limits to up/download sstables::object_storage_client: Add multi-upload support for GS utils::gcp::storage: Add merge objects operation test_backup/test_basic: Make tests multiplex both s3 and gs backends test::cluster::conftest: Add support for multiple object storage backends boost::gcs_storage_test: reindent boost::gcs_storage_test: Convert to use fixture tests::boost: Add GS object storage cases to mirror S3 ones tests::lib::gcs_fixture: Add a reusable test fixture for real/fake GS/GCS tests::lib::test_utils: Add overloads/helpers for reading and (temp) writing env sstables::object_storage_client: Add google storage implementation test_services: Allow testing with GS object storage parameters utils::gcp::gcp_credentials: Add option to create uninitialized credentials utils::gcp::object_storage: Make create_download_source return seekable_data_source utils::gcp::object_storage: Add defensive copies of string_view params utils::gcp::object_storage: Add missing retry backoff increate utils::gcp::object_storage: Add timestamp to object listing utils::gcp::object_storage: Add paging support to list_objects object_storage_client: Add object_name wrapper type utils::gcp::object_storage: Add optional abort_source utils::rest::client: Add abort_source support sstables: Use object_storage_client for remote storage sstables::object_storage_client: Add abstraction layer for OS cliens (s3 initial) s3::upload_progress: Promote to general util type storage_options: Abstract s3 to "object_storage" and add gs as option sstables::file_io_extension: Change "creator" callback to just data_source utils::io-wrappers: Add ranged data_source utils::io-wrappers: Add file wrapper type for seekable_source utils::seekable_source: Add a seekable IO source type object_storage_endpoint_param: Add gs storage as option config: break out object_storage_endpoint_param preparing for multi storage	2025-10-20 13:14:53 +03:00
Tomasz Grabiec	c4a87453a2	Merge 'Add experimental feature flag for strongly consistent tables and extend kesypace creation syntax to allow specifying consistency mode.' from Gleb Natapov The series adds an experimental flag for strongly consistent tables and extends "CREATE KEYSPACE" ddl with `consistency` option that allows specifying the consistency mode for the keyspace. Closes scylladb/scylladb#26116 * github.com:scylladb/scylladb: schema: Allow configuring consistency setting for a keyspace db: experimental consistent-tablets option	2025-10-16 21:48:06 +02:00
Gleb Natapov	c255740989	schema: Allow configuring consistency setting for a keyspace We want to add strongly consistent tables as an option. We will have two kind of strongly consistent tables: globally consistent and locally consistent. The former means that requests from all DCs will be globally linearisable while the later - only requests to the same DCs will be linearisable. To allow configuring all the possibilities the patch adds new parameter to a keyspace definition "consistency" that can be configured to be `eventual`, `global` or `local`. Non eventual setting is supported for tablets enabled keyspaces only. Since we want to start with implementing local consistency configuring global consistency will result in an error for now.	2025-10-16 13:34:49 +03:00
Nadav Har'El	921d07a26b	cql: make SELECT's "internal page size" configurable In some uses of SELECT, such as aggregation (sum() et al.), GROUP BY or secondary index, it needs to perform internal scans. It uses an "internal page size" which before this patch was always DEFAULT_COUNT_PAGE_SIZE = 10000. There was an ad-hoc and undocumented way to override this default in C++ tests, using functions in test/lib/select_statement_utils.hh, but it was so non-obvious that the test that most needed to override this default - the very slow test test_indexing_paging_and_aggregation which would have been must faster with a lower setting - never used it. So in this patch we replace the ad-hoc configuration functions by a bona-fide Scylla configuration option named "select_internal_page_size". The few C++ tests that used the old configuration functions were modified to use the new configuration parameters. The slow test test_indexing_paging_and_aggregation still doesn't use the new configuration to become faster - we'll do this in the next patch. Another benefit of having this "internal page size" as a configuration option is that one day a user might realize that the default choice 10,000 is bad for some reason (which I can't envision right now), so having it configurable might come it handy. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-10-15 18:42:09 +03:00
Calle Wilund	7c6b4bed97	tests::boost: Add GS object storage cases to mirror S3 ones I.e. run same remote storage backend unit tests for GS backend	2025-10-13 08:53:27 +00:00
Calle Wilund	af2616d750	tests::lib::gcs_fixture: Add a reusable test fixture for real/fake GS/GCS A text fixture object for either real google storage or fake-gcs-server using test local podman. Copied/transposed from gcp_object_storage_test.	2025-10-13 08:53:26 +00:00
Calle Wilund	a33fdd0b62	tests::lib::test_utils: Add overloads/helpers for reading and (temp) writing env Move some code to compilation unit + add some overloads. Add a RAII-object for temporary setting current process env as well.	2025-10-13 08:53:26 +00:00
Calle Wilund	956d26aa34	test_services: Allow testing with GS object storage parameters	2025-10-13 08:53:26 +00:00
Calle Wilund	5d4558df3b	sstables: Use object_storage_client for remote storage Replaces direct s3 interfaces with the abstraction layer, and open for having multiple implentations/backends	2025-10-13 08:53:25 +00:00
Calle Wilund	78d9dda060	config: break out object_storage_endpoint_param preparing for multi storage Moves the config wrapper to own file (to reduce recompilation for modifying) and refactors to handle extending this parameter to non-s3 endpoint configs.	2025-10-13 08:53:24 +00:00
Piotr Dulikowski	380f243986	Merge ' Support replication factor rack list for tablet-based keyspaces' from Tomasz Grabiec This change extends the CQL replication options syntax so the replication factor can be stated as a list of rack names. For example: { 'mydatacenter': [ 'myrack1', 'myrack2', 'myrack4' ] } Rack-list based RF can coexist with the old numerical RF, even in the same keyspace for different DCs. Specifying the rack list also allows to add replicas on the specified racks (increasing the replication factor), or decommissioning certain racks from their replicas (by omitting them from the current datacenter rack-list). This will allow us to keep the keyspace rf-rack-valid, maintaining guarantees, while allowing adding/removing racks. In particular, this will allow us to add a new DC, which happens by incrementally increasing RF in that DC to cover existing racks. Migration from numerical RF to rack-list is not supported yet. Migration from rack-list to numerical RF is not planned to be supported. New feature, no backport required. Co-authored with @bhalevy Fixes https://github.com/scylladb/scylladb/issues/25269 Fixes https://github.com/scylladb/scylladb/issues/23525 Closes scylladb/scylladb#26358 * github.com:scylladb/scylladb: tablets: load_balancer: Recognize that tablets are confined to racks when computing desired tablet count locator: Make hasher for endpoint_dc_rack globally accessible test: tablets: Add test for replica allocation on rack list changes test: lib: topology_builder: generate unique rack names test: Add tests for rack list RF doc: Document rack-list replication factor topology_coordinator: Restore formatting topology_coordinator: Cancel keyspace alter on broader set of errors topology_coordinator: Make keyspace alter process options through as_ks_metadata_update() cql3: ks_prop_defs: Preserve old options cql3: ks_prop_defs: Introduce flattened() locator: Recognize rack list RF as valid in assert_rf_rack_valid_keyspace() tablet_allocator: Respect binding replicas to racks locator: network_topology_strategy: Respect rack list when reallocating tablets cql3: ks_prop_defs: Fail with more information when options are not in expected format locator, cql3: Support rack lists in replication options cql3: Fail early on vnode/tablet flavor alter cql3: Extract convert_property_map() out of Cql.g schema: Use definition from the header instead of open-coding it locator: Abstract obtaining the number of replicas from replication_strategy_config_option cql3, locator: Use type aliases for option maps locator: Add debug logging locator: Pass topology to replication strategy constructor abstract_replication_strategy, network_topology_strategy: add replication_factor_data class	2025-10-06 14:14:09 +02:00
Piotr Dulikowski	e7907b173a	Merge 'db/view: Require rf_rack_valid_keyspaces when creating materialized view' from Dawid Mędrek Materialized views are currently in the experimental phase and using them in tablet-based keyspaces requires starting Scylla with an experimental feature, `views-with-tablets`. Any attempts to create a materialized view or secondary index when it's not enabled will fail with an appropriate error. After considerable effort, we're drawing close to bringing views out of the experimental phase, and the experimental feature will no longer be needed. However, materialized views in tablet-based keyspaces will still be restricted, and creating them will only be possible after enabling the configuration option `rf_rack_valid_keyspaces`. That's what we do in this PR. In this patch, we adjust existing tests in the tree to work with the new restriction. That shouldn't have been necessary because we've already seemingly adjusted all of them to work with the configuration option, but some tests hid well. We fix that mistake now. After that, we introduce the new restriction. What's more, when starting Scylla, we verify that there is no materialized view that would violate the contract. If there are some that do, we list them, notify the user, and refuse to start. High-level implementation strategy: 1. Name the restrictions in form of a function. 2. Adjust existing tests. 3. Restrict materialized views by both the experimental feature and the configuration option. Add validation test. 4. Drop the requirement for the experimental feature. Adjust the added test and add a new one. 5. Update the user documentation. Fixes scylladb/scylladb#23030 Backport: 2025.4, as we are aiming to support materialized views for tablets from that version. Closes scylladb/scylladb#25802 * github.com:scylladb/scylladb: view: Stop requiring experimental feature db/view: Verify valid configuration for tablet-based views db/view: Require rf_rack_valid_keyspaces when creating view test/cluster/random_failures: Skip creating secondary indexes test/cluster/mv: Mark test_mv_rf_change as skipped test/cluster: Adjust MV tests to RF-rack-validity test/boost/schema_loader_test.cc: Explicitly enable rf_rack_valid_keyspaces db/view: Name requirement for views with tablets	2025-10-06 12:46:46 +02:00
Michał Chojnowski	16cb223d7f	test/boost/database_test: fix two no-op distributed loader tests There are two tests which effectively check nothing. They intend to check that distributed loader removes "leftover" sstable files. So they create some incomplete sstables, run the test env on the directory, and the files disappeared. But the test env completely clears the test directory before the distributed loader looks at the files, so the tests succeed trivially. Fix that by adding a config knob to the test env which instructs it not to clear the directory before the test.	2025-10-04 00:44:49 +02:00
Benny Halevy	4955ca3ddd	test: lib: topology_builder: generate unique rack names Encode the dc identifier into each rack name so each dc will have its own unique racks. Just for easier distinction in logs. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-10-02 19:45:00 +02:00
Tomasz Grabiec	5fc617ecf5	test: Add tests for rack list RF	2025-10-02 19:45:00 +02:00
Dawid Mędrek	288be6c82d	db/view: Verify valid configuration for tablet-based views Creating a materialized view or a secondary index in a tablet-based keyspace requires that the user enabled two options: * experimental feature `views-with-tablets`, * configuration option `rf_rack_vaid_keyspaces`. Because the latter has only become a necessity recently (in this series), it's possible that there are already existing materialized views that violate it. We add a new check at start-up that iterates over existing views and makes sure that that is not the case. Otherwise, Scylla notifies the user of the problem.	2025-10-01 09:01:53 +02:00
Nadav Har'El	926089746b	message: move RPC compression from utils/ to message/ The directory utils/ is supposed to contain general-purpose utility classes and functions, which are either already used across the project, or are designed to be used across the project. This patch moves 8 files out of utils/: utils/advanced_rpc_compressor.hh utils/advanced_rpc_compressor.cc utils/advanced_rpc_compressor_protocol.hh utils/stream_compressor.hh utils/stream_compressor.cc utils/dict_trainer.cc utils/dict_trainer.hh utils/shared_dict.hh These 8 files together implement the compression feature of RPC. None of them are used by any other Scylla component (e.g., sstables have a different compression), or are ready to be used by another component, so this patch moves all of them into message/, where RPC is implemented. Theoretically, we may want in the future to use this cluster of classes for some other component, but even then, we shouldn't just have these files individually in utils/ - these are not useful stand-alone utilities. One cannot use "shared_dict.hh" assuming it is some sort of general-purpose shared hash table or something - it is completely specific to compression and zstd, and specifically to its use in those other classes. Beyond moving these 8 files, this patch also contains changes to: 1. Fix includes to the 5 moved header files (.hh). 2. Fix configure.py, utils/CMakeLists.txt and message/CMakeLists.txt for the three moved source files (.cc). 3. In the moved files, change from the "utils::" namespace, to the "netw::" namespace used by RPC. Also needed to change a bunch of callers for the new namespace. Also, had to add "utils::" explicitly in several places which previously assumed the current namespace is "utils::". Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#25149	2025-09-30 17:03:09 +03:00
Avi Kivity	4d9271df98	Merge 'sstables: introduce sstable version `ms`' from Michał Chojnowski This is yet another part in the BTI index project. Overarching issue: https://github.com/scylladb/scylladb/issues/19191 Previous part: https://github.com/scylladb/scylladb/pull/25626 Next parts: make `ms` the default. Then, general tweaks and improvements. Later, potentially a full `da` format implementation. This patch series introduces a new, Scylla-only sstable format version `ms`, which is like `me`, but with the index components (Summary.db and Index.db) replaced with BTI index components (Partitions.db and Rows.db), as they are in Cassandra 5.0's `da` format version. (Eventually we want to just implement `da`, but there are several other changes (unrelated to the index files) between `me` and `da`. By adding this `ms` as an intermediate step we can adapt the new index formats without dragging all the other changes into the mix (and raising the risk of regressions, which is already high)). The high-level structure of the PR is: 1. Introduce new component types — `Partitions` and `Rows`. 2. Teach `class sstable` to open them when they exist. 3. Teach the sstable writer how to write index data to them. 4. Teach `class sstable` and unit tests how to deal with sstables that have no `Index` or `Summary` (but have `Partitions` and `Rows` instead). 5. Introduce the new sstable version `ms`, specify that it has `Partitions` and `Rows` instead of `Index` and `Summary`. 6. Prepare unit tests for the appearance of `ms`. 7. Enable `ms` in unit tests. 8. Make `ms` enablable via db::config (with a silent fall back to `me` until the new `MS_SSTABLE_FORMAT` cluster feature is enabled). 9. Prepare integration tests for the appearance of `ms`. 10. Enable both `ms` and `me` in tests where we want both versions to be tested. This series doesn't make `ms` the default yet, because that requires teaching Scylla Manager and a few dtests about the new format first. It can be enabled by setting `sstable_format: ms` in the config. Per a review request, here is an example from `perf_fast_forward`, demonstrating some motivation for a new format. (Although not the main one. The main motivations are getting rid of restrictions on the RAM:disk ratio, and index read throughput for datasets with tiny partitions). The dataset was populated with `build/release/scylla perf-fast-forward --smp=1 --sstable-format=$VERSION --data-directory=data.$VERSION --column-index-size-in-kb=1 --populate --random-seed=0`. This test involves a partition with 1000000 clustering rows (with 32-bit keys and 100-byte values) and ~500 index blocks, and queries a few particular rows from the partition. Since the branching factor for the BIG promoted index is 2 (it's a binary search), the lookup involves ~11.2 sequential page reads per row. The BTI format has a more reasonable branching factor, so it involves ~2.3 page reads per row. `build/release/scylla perf-fast-forward --smp=1 --data-directory=perf_fast_forward_data/me --run-tests=large-partition-select-few-rows`: ``` offset stride rows iterations avg aio aio (KiB) 500000 1 1 70 18.0 18 128 500001 1 1 647 19.0 19 132 0 1000000 1 748 15.0 15 116 0 500000 2 372 29.0 29 284 0 250000 4 227 56.0 56 504 0 125000 8 116 106.0 106 928 0 62500 16 67 195.0 195 1732 ``` `build/release/scylla perf-fast-forward --smp=1 --data-directory=perf_fast_forward_data/ms --run-tests=large-partition-select-few-rows`: ``` offset stride rows iterations avg aio aio (KiB) 500000 1 1 51 5.1 5 20 500001 1 1 64 5.3 5 20 0 1000000 1 679 4.0 4 16 0 500000 2 492 8.0 8 88 0 250000 4 804 16.0 16 232 0 125000 8 409 31.0 31 516 0 62500 16 97 54.0 54 1056 ``` Index file size comparison for the default `perf_fast_forward` tables with `--random-seed=0`: Large partition table (dominated by intra-partition index): 2.4 MB with `me`, 732 kB with `ms`. For the small partitions table (dominated by inter-partition index): 11 MB with `me`, 8.4 MB with `ms`. External tests: I ran SCT test `longevity-mv-si-4days-streaming-test` test on 6 nodes with 30 shards each for 8 hours. No anomalies were observed. New functionality, no backport needed. Closes scylladb/scylladb#26215 * github.com:scylladb/scylladb: test/boost/bloom_filter_test: add test_rebuild_from_temporary_hashes test/cluster: add test_bti_index.py test: prepare bypass_cache_test.py for `ms` sstables sstables/trie/bti_index_reader: add a failure injection in advance_lower_and_check_if_present test/cqlpy/test_sstable_validation.py: prepare the test for `ms` sstables tools/scylla-sstable: add `--sstable-version=?` to `scylla sstable write` db/config: expose "ms" format to the users via database config test: in Python tests, prepare some sstable filename regexes for `ms` sstables: add `ms` to `all_sstable_versions` test/boost/sstable_3_x_test: add `ms` sstables to multi-version tests test/lib/index_reader_assertions: skip some row index checks for BTI indexes test/boost/sstable_inexact_index_test: explicitly use a `me` sstable test/boost/sstable_datafile_test: skip test_broken_promoted_index_is_skipped for `ms` sstables test/resource: add `ms` sample sstable files for relevant tests test/boost/sstable_compaction_test: prepare for `ms` sstables. test/boost/index_reader_test: prepare for `ms` sstables test/boost/bloom_filter_tests: prepare for `ms` sstables test/boost/sstable_datafile_test: prepare for `ms` sstables test/boost/sstable_test: prepare for `ms` sstables. sstables: introduce `ms` sstable format version tools/scylla-sstable: default to "preferred" sstable version, not "highest" sstables/mx/reader: use the same hashed_key for the bloom filter and the index reader sstables/trie/bti_index_reader: allow the caller to passing a precalculated murmur hash sstables/trie/bti_partition_index_writer: in add(), get the key hash from the caller sstables/mx: make Index and Summary components optional sstables: open Partitions.db early when it's needed to populate key range for sharding metadata sstables: adapt sstable::set_first_and_last_keys to sstables without Summary sstables: implement an alternative way to rebuild bloom filters for sstables without Index utils/bloom_filter: add `add(const hashed_key&)` sstables: adapt estimated_keys_for_range to sstables without Summary sstables: make `sstable::estimated_keys_for_range` asynchronous sstables/sstable: compute get_estimated_key_count() from Statistics instead of Summary replica/database: add table::estimated_partitions_in_range() sstables/mx: implement sstable::has_partition_key using a regular read sstables: use BTI index for queries, when present and enabled sstables/mx/writer: populate BTI index files sstables: create and open BTI index files, when enabled sstables: introduce Partition and Rows component types sstables/mx/writer: make `_pi_write_m.partition_tombstone` a `sstables::deletion_time`	2025-09-30 09:40:02 +03:00
Piotr Dulikowski	4581c72430	Merge 'lwt: prohibit for tablet-based views and cdc logs' from Petr Gusev `SELECT` commands with SERIAL consistency level are historically allowed for vnode-based views, even though they don't provide linearizability guarantees and in general don't make much sense. In this PR we prohibit LWTs for tablet-based views, but preserve old behavior for vnode-based views for compatibility. Similar logic is applied to CDC log tables. We also add a general check that disallows colocating a table with another colocated table, since this is not needed for now. Fixes https://github.com/scylladb/scylladb/issues/26258 backports: not needed (a new feature) Closes scylladb/scylladb#26284 * github.com:scylladb/scylladb: cql_test_env.cc: log exception when callback throws lwt: prohibit for tablet-based views and cdc logs tablets: disallow chains of colocated tables database: get_base_table_for_tablet_colocation: extract table_id_by_name lambda	2025-09-30 07:15:16 +02:00
Michał Chojnowski	70a6c9481b	test/lib/index_reader_assertions: skip some row index checks for BTI indexes Block monotonicity checks can't be implemented for BTI row indexes because they don't store full clustering positions, only some encoded prefixes. The emptiness check could be implemented with some effort, but we currently don't bother. The two tests which use this `is_empty()` method aren't very useful anyway. (They check that the promoted index is empty when there are no clustering keys. That doesn't really need a dedicated test).	2025-09-29 22:15:25 +02:00
Petr Gusev	29f9c355ab	cql_test_env.cc: log exception when callback throws When a test fails inside a do_with_cql_env callback, the logs don’t make it clear where the failure happened. This is because cql_env immediately begins shutting down services, which obscures the original failure.	2025-09-29 17:53:36 +02:00
Lakshmi Narayanan Sreethar	7b97928152	cmake: link `vector_search` to `test-lib` instead of `cql3` PR #26237 fixed linker errors by linking `cql3` to `vector_search` but this introduced a circular dependency between these two static libraries, sometimes causing failures during compilation : ``` ninja: error: dependency cycle: /home/user/Development/scylladb/build/debug/cql3/CqlParser.hpp -> data_dictionary/libdata_dictionary.a -> data_dictionary/CMakeFiles/data_dictionary.dir/data_dictionary.cc.o -> /home/user/Development/scylladb/build/debug/cql3/CqlParser.hpp ``` So, instead of linking the `vector_search` library to the `cql3` library, link it directly to the executable where the `cql3` library is also to be linked. For the test cases, this means linking `vector_search` to the `test-lib` library. Since both `vector_search` and `cql3` are static libraries, the linker will resolve them correctly regardless of the order in which they are linked. Refs #26235 Refs #26237 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> Closes scylladb/scylladb#26318	2025-09-29 17:46:58 +03:00
Michał Chojnowski	4bdf5ca0cf	sstables: adapt sstable::set_first_and_last_keys to sstables without Summary `sstable::set_first_and_last_keys` currently takes the first and last key from the Summary component. But if only BTI indexes are used, this component will be nonexistent. In this case, we can use the first and last keys written in the footer of Partitions.db.	2025-09-29 13:01:21 +02:00
Avi Kivity	0f4363cc8d	Merge 'sstable: add more complete schema to scylla component' from Botond Dénes Sstables store a basic schema in the statistics component. The scylla-sstable tool uses this to be able to read and dump sstables in a self-contained manner, without requiring an external schema source. The problem is that the schema stored int he statistics component is incomplete: it doesn't store column names for key columns, so these have placeholder names in dump outputs where column names are visible. This is not a disaster but it is confusing and it can cause errors in scripts which want to check the content of sstables, while also knowing the schema and expecting the proper names for key columns. To make sstables truly self-contained w.r.t. the schema, add a complete schema to the scylla component. This schema contains the names and types of all columns, as well as some basic information about the schema: keyspace name, table name, id and version. When available, scylla-sstable's schema loader will use this new more complete schema and fall-back to the old method of loading the (incomplete) schema from the statistics component otherwise. New feature, no backport required. Closes scylladb/scylladb#24187 * github.com:scylladb/scylladb: test/boost/schema_loader_test: add specific test with interesting types test/lib/random_schema: add random_schema(schema_ptr) constructor test/boost/schema_loader_test: test_load_schema_from_sstable: add fall-back test tools/schema_loader: add support for loading from scylla-metadata tools/schema_loader: extract code which load schema from statistics sstables: scylla_metadata: add schema member	2025-09-26 00:21:17 +03:00
Botond Dénes	1999d8e3d3	compaction: remove using namespace {compaction,sstables} Some files in compaction/ have using namespace {compaction,sstables} clauses, some even in headers. This is considered bad practice and muddies the namespace use. Remove them.	2025-09-25 15:03:57 +03:00
Botond Dénes	86ed627fc4	compaction: move code to namespace compaction The namespace usage in this directory is very inconsistent, with files and classes scattered in: * global namespace * namespace compaction * namespace sstables With cases, where all three used in the same file. This code used to live in sstables/ and some of it still retains namespace sstables as a heritage of that time. The mismatch between the dir (future module) and the namespace used is confusing, so finish the migration and move all code in compaction/ to namespace compaction too. This patch, although large, is mechanic and only the following kind of changes are made: * replace namespace sstable {} with namespace compaction {} * add namespace compaction {} * drop/add sstables:: * drop/add compaction:: * move around forward-declarations so they are in the correct namespace context This refactoring revealed some awkward leftover coupling between sstables and compaction, in sstables/sstable_set.cc, where the make_sstable_set() methods of compaction strategies are implemented.	2025-09-25 15:03:56 +03:00
Botond Dénes	f10af4b5eb	test/lib/random_schema: add random_schema(schema_ptr) constructor Allow using the convenient random data generation facilities, for any schema.	2025-09-25 11:28:34 +03:00
Ernest Zaslavsky	5ba5aec1f8	treewide: Move mutation related files to a `mutation` directory As requested in #22104, moved the files and fixed other includes and build system. Moved files: - combine.hh - collection_mutation.hh - collection_mutation.cc - converting_mutation_partition_applier.hh - converting_mutation_partition_applier.cc - counters.hh - counters.cc - timestamp.hh Fixes: #22104 This is a cleanup, no need to backport Closes scylladb/scylladb#25085	2025-09-24 13:23:38 +03:00
Botond Dénes	7c6fb131f3	Merge 'compaction: ensure that all compaction executors are stopped' from Aleksandra Martyniuk Currently, while stopping the compaction_manager, we stop task_manager compaction module and concurrently run compaction_manager::really_do_stop. really_do_stop stops and waits for all task_executors that are kept in compaction_manager::_tasks, but nothing ensures that no more tasks will be added there. Due to leftover tasks, we trigger on_fatal_internal_error. Modify the order of compaction_manager::stop. After the change, we stop compaction tasks in the following order: - abort module abort source; - close module gate in the background; - stop_ongoing_compactions (kept in compaction_manager::_tasks); - wait until module gate is closed. Check module abort source before creating compaction executor and adding it to _tasks. Thanks to the above, we can be sure that: - after module::stop there will be no tasks in _tasks; - compaction_manager::stop aborts all tasks; we don't wait for any whole compaction to finish. Fixes: https://github.com/scylladb/scylladb/issues/25806. Fixes shutdown bug; Needs backports to all version Closes scylladb/scylladb#25885 * github.com:scylladb/scylladb: compaction: move _tasks check compaction: stop compaction module in really_do_stop	2025-09-24 06:49:52 +03:00
Aleksandra Martyniuk	17707d0e6b	compaction: stop compaction module in really_do_stop Currently, compaction::task_manager_module is stopped in compaction_manager::stop, concurrently to really_do_stop. We can't predict the order of the two. Do not set _task_manager_module to nullptr at stop, because compaction_manager::really_do_stop() may be called before the actual shutdown, while other components still try to use it. compaction::task_manager_module does not keep a pointer to compaction_manager, so we won't end up with memory leak. Stop compaction module in really_do_stop, after ongoing compactions are stopped. It's a preparation for further patches.	2025-09-23 14:21:15 +02:00
Karol Nowacki	eae71d3e91	vector_store_client: Move to vector_search module Vector search related implementation moved to a new module vector_search. As the vector search functionality is going to be extended, it is better to keep it in a separate module.	2025-09-22 08:01:47 +02:00
Michał Chojnowski	9e70df83ab	db: get rid of sstables-format-selector Our sstable format selection logic is weird, and hard to follow. If I'm not misunderstanding, the pieces are: 1. There's the `sstable_format` config entry, which currently doesn't do anything, but in the past it used to disable cluster features for versions newer than the specified one. 2. There are deprecated and unused config entries for individual versions (`enable_sstables_mc_format`, `enable_sstables_md_format`, etc). 3. There is a cluster feature for each version: ME_SSTABLE_FORMAT, MD_SSTABLE_FORMAT, etc. (Currently all sstable version features have been grandfathered, and aren't checked by the code anymore). 4. There's an entry in `system.scylla_local` which contains the latest enabled sstable version. (Why? Isn't this directly derived from cluster features anyway)? 5. There's `sstable_manager::_format` which contains the sstable version to be used for new writes. This field is updated by `sstables_format_selector` based on cluster features and the `system.scylla_local` entry. I don't see why those pieces are needed. Version selection has the following constraints: 1. New sstables must be written with a format that supports existing data. For example, range tombstones with an infinite bound are only supported by sstables since version "mc". So if a range tombstone with an infinite bound exists somewhere in the dataset, the format chosen for new sstables has to be at least as new as "mc". 2. A new format might only be used after a corresponding cluster feature is enabled. (Otherwise new sstables might become unreadable if they are sent to another node, or if a node is downgraded). 3. The user should have a way to inhibit format ugprades if he wishes. So far, constraint (1) has been fulfilled by never using formats older than the newest format ever enabled on the node. (With an exception for resharding and reshaping system tables). Constraint (2) has been fulfilled by calling `sstable_manager::set_format` only after the corresponsing cluster feature is enabled. Constraint (3) has been fulfilled by the ability to inhibit cluster features by setting `sstable_format` by some fixed value. The main thing I don't like about this whole setup is that it doesn't let me downgrade the preferred sstable format. After a format is enabled, there is no way to go back to writing the old format again. That is no good -- after I make some performance-sensitive changes in a new format, it might turn out to be a pessimization for the particular workload, and I want to be able to go back. This patch aims to give a way to downgrade formats without violating the constraints. What it does is: 1. The entry in `system.scylla_local` becomes obsolete. After the patch we no longer update or read it. As far as I understand, the purpose of this entry is to prevent unwanted format downgrades (which is something cluster features are designed for) and it's updated if and only if relevant cluster features are updated. So there's no reason to have it, we can just directly use cluster features. 2. `sstable_format_selector` gets deleted. Without the `system.scylla_local` around, it's just a glorified feature listener. 3. The format selection logic is moved into `sstable_manager`. It already sees the `db::config` and the `gms::feature_service`. For the foreseeable future, the knowledge of enabled cluster features and current config should be enough information to pick the right formats. 4. The `sstable_format` entry in `db::config` is no longer intended to inhibit cluster features. Instead, it is intended to select the format for new sstables, and it becomes live-updatable. 5. Instead of writing new sstables with "highest supported" format, (which used to be set by `sstables_format_selector`) we write them with the "preferred" format, which is determined by `sstable_manager` based on the combination of enabled features and the current value of `sstable_format`. Closes scylladb/scylladb#26092 [avi: Pavel found the reason for the scylla_local entry - it predates stable storage for cluster features]	2025-09-19 16:17:56 +03:00
Pavel Emelyanov	a1ea553fe1	code: Replace distributed<> with sharded<> The latter is recommended in seastar, and the former was left as compatibility alias. Latest seastar explicitly marks it as deprecated so once the submodule is updated, compilation logs will explode. Most of the patch is generated with for f in $(git grep -l '\<distributed<[A-Za-z0-9:_]>') ; do sed -e 's/\<distributed<$[A-Za-z0-9:_]$>/sharded<\1>/g' -i $f; done for f in $(git grep -l distributed.hh); do sed -e 's/distributed.hh/sharded.hh/' -i $f ; done and a small manual change in test/perf/perf.hh Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#26136	2025-09-19 12:22:51 +02:00
Ernest Zaslavsky	ddf2588985	treewide: Move replica related files to `replica` directory As requested in #22099, moved the files and fixed other includes and build system. Moved files: - cache_temperature.hh - cell_locking.hh Fixes: #22099 Closes scylladb/scylladb#25079	2025-09-18 08:00:35 +03:00
Ernest Zaslavsky	54aa552af7	treewide: Move type related files to a `type` directory As requested in #22110 , moved the files and fixed other includes and build system. Moved files: - duration.hh - duration.cc - concrete_types.hh Fixes: #22110 This is a cleanup, no need to backport Closes scylladb/scylladb#25088	2025-09-17 17:32:19 +03:00
Ernest Zaslavsky	d624413ddd	treewide: Move query related files to a new `query` directory As requested in #22120, moved the files and fixed other includes and build system. Moved files: - query.cc - query-request.hh - query-result.hh - query-result-reader.hh - query-result-set.cc - query-result-set.hh - query-result-writer.hh - query_id.hh - query_result_merger.hh Fixes: #22120 This is a cleanup, no need to backport Closes scylladb/scylladb#25105	2025-09-16 23:40:47 +03:00
Michał Jadwiszczak	dc1ffd2c10	service/storage_service: drain `view_building_worker` earlier Similarly to view builder, view building worker needs to be drained in `storage_service::do_drain()`. Storage service drain is happening at the same beginning of shutdown procedure. Before this patch, the worker was still building views after the storage service was drained and this caused errors like: `Error applying view update to (named_gate_closed_exception)` and `locator::no_such_tablet_map`. Fixes scylladb/scylladb#25908 Closes scylladb/scylladb#25984	2025-09-15 11:29:19 +03:00
Avi Kivity	fc64333040	Merge 'sstables/trie: add BTI index readers and writers' from Michał Chojnowski This is yet another part in the BTI index project. Overarching issue: https://github.com/scylladb/scylladb/issues/19191 Previous part: https://github.com/scylladb/scylladb/pull/25506/ Next part: plugging the BTI index readers and writers into sstable readers and writers. The new code added in this PR isn't used outside of tests yet, but it's posted as a separate PR for reviewability. This series implements, on top of the key translation logic, and abstract trie writing and traversal logic, a writer and a reader of sstable index files (which map primary keys to positions in Data.db), as described in `f16fb6765b/src/java/org/apache/cassandra/io/sstable/format/bti/BtiFormat.md`. Caveats: 1. I think the added test has reasonable coverage, but that depends on running it multiple times. (Though it shouldn't need more than a few runs to catch any bug it covers). It's somewhat awkward as a test meant for running in CI, it's better as something you run many times after a relevant change. 2. These readers and writers are intended to be compatible with Cassandra, but I did NOT do any compatibility testing. The writers and readers added here have only been tested against each other, not against Cassandra's readers and writers. 3. This didn't undergo any proper benchmarking and optimization work. I was doing some measurements in the past, but everything was rewritten so much since then that the my old measurements are effectively invalidated. Frankly I have no idea what the performance of all this branchy-branchy logic is now. No backports needed, new functionality. Closes scylladb/scylladb#25626 * github.com:scylladb/scylladb: test/manual: add bti_cassandra_compatibility_test test/lib/random_schema: add some constraints for generated uuid and time/date values test/lib/random_utils: add a variant of get_bytes which takes an `engine&` test/boost: add bti_index_test sstables/writer: add an accessor for the current write position in Data.db sstables/trie: introduce bti_index_reader sstables/trie: add bti_partition_index_writer.cc sstables/trie: add bti_row_index_writer.cc utils/bit_cast: add a new overload of write_unaligned() sstables/trie: add trie_writer::add_partial() sstables/consumer: add read_56() sstables/trie: make bti_node_reader::page_ptr copy-constructible sstables: extract abstract_index_reader from index_reader.hh to its own header sstables/trie: add an accessor to the file_writer under bti_node_sink sstables/types: make `deletion_time::operator tombstone()` const sstables/types: add sstables::deletion_time::make_live() sstables/trie: fix a special case in max_offset_from_child sstables/trie: handle `partition_region`s other than `clustered` in BTI position encoding sstables/trie: rewrite lcb_mismatch to handle fragment invalidation test/boost/bti_key_translation_test: fix a compilation error hidden behind `if constexpr`	2025-09-10 21:48:52 +03:00
Michał Chojnowski	77dcb2bcda	test/lib/random_schema: add some constraints for generated uuid and time/date values I want to write a test which generates a random table (random schema, random data) and uses the Python driver to query it. But it turns out that some values generated by test/lib/random_schema can't be deserialized by the Python driver. For example, it doesn't unknown uuid versions, dates before year 1 of after year 9999, or `time` values greater or equal to the number of nanoseconds in a day. AFAIK those "driver-illegal" values aren't particularly interesting for tests which use `random_schema`, so we can just not generate them.	2025-09-07 00:32:02 +02:00
Michał Chojnowski	3ce7b761ce	test/lib/random_utils: add a variant of get_bytes which takes an `engine&` I will need it in a test later.	2025-09-07 00:32:02 +02:00
Calle Wilund	5ead6ec420	proc-utils: Re-export waiting types from seastar Just to make directly accessible from wrapper type	2025-09-01 18:03:44 +00:00
Calle Wilund	8169327553	proc-utils: Inherit environment from current process In most cases, when launching a process from tests, we will want to inherit our own env. Add option (default true) to do so.	2025-09-01 18:03:44 +00:00
Avi Kivity	bc5773f777	Merge 'Add out of space prevention mechanisms' from Łukasz Paszkowski When a scaling out is delayed or fails, it is crucial to ensure that clusters remain operational and recoverable even under extreme conditions. To achieve this, the following proactive measures are implemented: - reject writes - includes: inserts, updates, deletes, counter updates, hints, read+repair and lwt writes - applicable to: user tables, views, CDC log, audit, cql tracing - stop running compactions/repairs and prevent from starting new ones - reject incoming tablet migrations The aforementioned mechanisms are automatically enabled when node's disk utilization reaches the critical level (default: 98%) and disabled when the utilization drop below the threshold. Apart from that, the series add tests that require mounted volumes to simulate out of space. The paths to the volumes can be provided using the a pytest argument, i.e. `--space-limited-dirs`. When not provided, tests are skipped. Test scenarios: 1. Start a cluster and write data until one of the nodes reaches 90% of the disk utilization 2. Perform an operation that would take the nodes over 100% 3. The nodes should not exceed the critical disk utilization (98% by default) 4. Scale out the cluster by adding one node per rack 5. Retry or wait for the operation from step 2 The operation is: writing data, running compactions, building materialized views, running repair, migrating tablets (caused by RF change, decommission). The test is successful, if no nodes run out of space, the operation from step 2 is aborted/paused/timed out and the operation from step 5 is successful. `perf-simple-query --smp 1 -m 1G` results obtained for fixed 400MHz frequency: Read path (before) ``` instructions_per_op: mean= 39661.51 standard-deviation=34.53 median= 39655.39 median-absolute-deviation=23.33 maximum=39708.71 minimum=39622.61 ``` Read path (after) ``` instructions_per_op: mean= 39691.68 standard-deviation=34.54 median= 39683.14 median-absolute-deviation=11.94 maximum=39749.32 minimum=39656.63 ``` Write path (before): ``` instructions_per_op: mean= 50942.86 standard-deviation=97.69 median= 50974.11 median-absolute-deviation=34.25 maximum=51019.23 minimum=50771.60 ``` Write path (after): ``` instructions_per_op: mean= 51000.15 standard-deviation=115.04 median= 51043.93 median-absolute-deviation=52.19 maximum=51065.81 minimum=50795.00 ``` Fixes: https://github.com/scylladb/scylladb/issues/14067 Refs: https://github.com/scylladb/scylladb/issues/2871 No backport, as it is a new feature. Closes scylladb/scylladb#23917 * github.com:scylladb/scylladb: tests/cluster: Add new storage tests test/scylla_cluster: Override workdir when passed via cmdline streaming: Reject incoming migrations storage_service: extend locator::load_stats to collect per-node critical disk utilization flag repair_service: Add a facility to disable the service compaction_manager: Subscribe to out of space controller compaction_manager: Replace enabled/disabled states with running state database: Add critical_disk_utilization mode database can be moved to disk_space_monitor: add subscription API for threshold-based disk space monitoring docs: Add feature documentation config: Add critical_disk_utilization_level option replica/exceptions: Add a new custom replica exception	2025-08-30 18:47:57 +03:00
Piotr Dulikowski	7ccb50514d	Merge 'Introduce view building coordinator' from Michał Jadwiszczak This patch introduces `view_building_coordinator`, a single entity within whole cluster responsible for building tablet-based views. The view building coordinator takes slightly different approach than the existing node-local view builder. The whole process is split into smaller view building tasks, one per each tablet replica of the base table. The coordinator builds one base table at a time and it can choose another when all views of currently processing base table are built. The tasks are started by setting `STARTED` state and they are executed by node-local view building worker. The tasks are scheduled in a way, that each shard processes only one tablet at a time (multiple tasks can be started for a shard on a node because a table can have multiple views but then all tasks have the same base table and tablet (last_token)). Once the coordinator starts the tasks, it sends `work_on_view_building_tasks` RPC to start the tasks and receive their results. This RPC is resilient to RPC failure or raft leader change, meaning if one RPC call started a batch of tasks but then failed (for instance the raft leader was changed and caller aborted waiting for the response), next RPC call will attach itself to the already started batch. The coordinator plugs into handling tablet operations (migration/resize/RF change) and adjusts its tasks accordingly. At the start of each tablet operation, the coordinator aborts necessary view building tasks to prevent https://github.com/scylladb/scylladb/issues/21564. Then, new adjusted tasks are created at the end of the operation. If the operation fails at any moment, aborted tasks are rollback. The view building coordinator can also handle staging sstables using process_staging view building tasks. We do this because we don't want to start generating view updates from a staging sstable prematurely, before the writes are directed to the new replica (https://github.com/scylladb/scylladb/issues/19149). For detailed description check: `docs/dev/view-building-coordinator.md` Fixes https://github.com/scylladb/scylladb/issues/22288 Fixes https://github.com/scylladb/scylladb/issues/19149 Fixes https://github.com/scylladb/scylladb/issues/21564 Fixes https://github.com/scylladb/scylladb/issues/17603 Fixes https://github.com/scylladb/scylladb/issues/22586 Fixes https://github.com/scylladb/scylladb/issues/18826 Fixes https://github.com/scylladb/scylladb/issues/23930 --- This PR is reimplementation of https://github.com/scylladb/scylladb/pull/21942 Closes scylladb/scylladb#23760 * github.com:scylladb/scylladb: test/cluster: add view build status tests test/cluster: add view building coordinator tests utils/error_injection: allow to abort `injection_handler::wait_for_message()` test: adjust existing tests utils/error_injection: add injection with `sleep_abortable()` db/view/view_builder: ignore `no_such_keyspace` exception docs/dev: add view building coordinator documentation db/view/view_building_worker: work on `process_staging` tasks db/view/view_building_worker: register staging sstable to view building coordinator when needed db/view/view_building_worker: discover staging sstables db/view/view_building_worker: add method to register staging sstable db/view/view_update_generator: add method to process staging sstables instantly db/view/view_update_generator: extract generating updates from staging sstables to a method db/view/view_update_generator: ignore tablet-based sstables db/view/view_building_coordinator: update view build status on node join/left db/view/view_building_coordinator: handle tablet operations db/view: add view building task mutation builder service/topology_coordinator: run view building coordinator db/view: introduce `view_building_coordinator` db/view/view_building_worker: update built views locally db/view: introduce `view_building_worker` db/view: extract common view building functionalities db/view: prepare to create abstract `view_consumer` message/messaging_service: add `work_on_view_building_tasks` RPC service/topology_coordinator: make `term_changed_error` public db/schema_tables: create/cleanup tasks when an index is created/dropped service/migration_manager: cleanup view building state on drop keyspace service/migration_manager: cleanup view building state on drop view service/migration_manager: create view building tasks on create view test/boost: enable proxy remote in some tests service/migration_manager: pass `storage_proxy` to `prepare_keyspace_drop_announcement()` service/migration_manager: coroutinize `prepare_new_view_announcement()` service/storage_proxy: expose references to `system_keyspace` and `view_building_state_machine` service: reload `view_building_state_machine` on group0 apply() service/vb_coordinator: add currently processing base db/system_keyspace: move `get_scylla_local_mutation()` up db/system_keyspace: add `view_building_tasks` table db/view: add view_building_state and views_state db/system_keyspace: add method to get view build status map db/view: extract `system.view_build_status_v2` cql statements to system_keyspace db/system_keyspace: move `internal_system_query_state()` function earlier db/view: ignore tablet-based views in `view_builder` gms/feature_service: add VIEW_BUILDING_COORDINATOR feature	2025-08-29 17:28:44 +02:00

1 2 3 4 5 ...

1596 Commits