scylladb

Author	SHA1	Message	Date
Botond Dénes	bdca5600ef	Merge 'Prevent stalls due to large tablet mutations' from Benny Halevy Currently, replica::tablet_map_to_mutation generates a mutation having a row per tablet. With enough tablets (10s of thousands) in the table we observe reactor stalls when freezing / unfreezing such large mutations, as seen in https://github.com/scylladb/scylladb/pull/18095#issuecomment-2029246954, and I assume we would see similar stalls also when converting those mutation into canonical_mutation and back, as they are similar to frozen_mutation, and bit more expensive since they also save the column mappings. This series takes a different approach than allowing freeze to yield. `tablet_map_to_mutation` is changed to `tablet_map_to_mutations`, able to generate multiple split mutations, that when squashed together are equivalent to the previously large mutation. Those mutations are fed into a `process_mutation` callback function, provided by the caller, which may add those mutation to a vector for further processing, and/or process them inline by freezing or making a canonical mutation. In addition, split the large mutations would also prevent hitting the commitlog maximum mutation size. Closes scylladb/scylladb#18162 * github.com:scylladb/scylladb: schema_tables: convert_schema_to_mutations: simplify check for system keyspace tablets: read_tablet_mutations: use unfreeze_and_split_gently storage_service: merge_topology_snapshot: freeze snp.mutations gently mutation: async_utils: add unfreeze_and_split_gently mutation: add for_each_split_mutation tablets: tablet_map_to_mutations: maybe split tablets mutation tablets: tablet_map_to_mutations: accept process_func perf-tablets: change default tables and tablets-per-table perf-tablets: abort on unhandled exception	2025-10-01 07:04:09 +03:00
Benny Halevy	1ceb49f6c1	schema_tables: convert_schema_to_mutations: simplify check for system keyspace Currently, the function unfreezes each schema mutation partition and then checks if it's for a system keyspace. This isn't really needed since we can check the partition key using the frozen_mutation, skip it if the partition is for a system keyspace. Note that the constructed partition_key just copies the frozen partition_key_view, without copying or deserializing the actual key contents. Also, reserve `results` capacity using the queried partitions' size to prevent reallocations of the results vector. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-09-30 17:15:41 +03:00
Nadav Har'El	926089746b	message: move RPC compression from utils/ to message/ The directory utils/ is supposed to contain general-purpose utility classes and functions, which are either already used across the project, or are designed to be used across the project. This patch moves 8 files out of utils/: utils/advanced_rpc_compressor.hh utils/advanced_rpc_compressor.cc utils/advanced_rpc_compressor_protocol.hh utils/stream_compressor.hh utils/stream_compressor.cc utils/dict_trainer.cc utils/dict_trainer.hh utils/shared_dict.hh These 8 files together implement the compression feature of RPC. None of them are used by any other Scylla component (e.g., sstables have a different compression), or are ready to be used by another component, so this patch moves all of them into message/, where RPC is implemented. Theoretically, we may want in the future to use this cluster of classes for some other component, but even then, we shouldn't just have these files individually in utils/ - these are not useful stand-alone utilities. One cannot use "shared_dict.hh" assuming it is some sort of general-purpose shared hash table or something - it is completely specific to compression and zstd, and specifically to its use in those other classes. Beyond moving these 8 files, this patch also contains changes to: 1. Fix includes to the 5 moved header files (.hh). 2. Fix configure.py, utils/CMakeLists.txt and message/CMakeLists.txt for the three moved source files (.cc). 3. In the moved files, change from the "utils::" namespace, to the "netw::" namespace used by RPC. Also needed to change a bunch of callers for the new namespace. Also, had to add "utils::" explicitly in several places which previously assumed the current namespace is "utils::". Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#25149	2025-09-30 17:03:09 +03:00
Piotr Dulikowski	5e5a3c7ec5	view_building_worker.cc: fix spelling (commiting -> committing) The typo is reported by GitHub action on each PR, so let's fix it to reduce the noise for everybody. Closes scylladb/scylladb#26329	2025-09-30 16:47:03 +03:00
Avi Kivity	72609b5f69	Merge 'mv: generate view updates on pending replica' from Michael Litvak Generate view updates from a pending base replica if it's a reading replica, i.e. it's in the last stage of transition write_both_read_new before becoming the new base replica. Previously we didn't generate view updates on a pending replica. The problem with that is that when a base token is migrated from one replica B1 to another B2, at one stage we generate view updates only from B1, then at the next stage we generate view updates only from B2. During this transition, it can happen that for some write neither B1 nor B2 generate view update, because each one sees the other as the base replica. We fix this by generating view updates from both base replicas in the phase before the transition. We can generate view updates on the pending replica in this case, even if it requires read-before-write, because it's in a stage where it contains all data and serves reads. Fixes https://github.com/scylladb/scylladb/issues/24292 backport not needed - the issue mostly affects MV with tablets which is still experimental Closes scylladb/scylladb#25904 * github.com:scylladb/scylladb: test: mv: test view update during topology operations mv: generate view updates on both shards in intranode migration mv: generate view updates on pending replica	2025-09-30 13:17:16 +03:00
Michał Jadwiszczak	3bbbbf419b	test/cluster/test_view_building_coordinator: add reproducer for staging sstables with tablet merge The test verifies if staging sstables are processed correctly after tablet merge. Refs scylladb/scylladb#26244 Closes scylladb/scylladb#26286	2025-09-30 09:05:31 +02:00
Avi Kivity	4d9271df98	Merge 'sstables: introduce sstable version `ms`' from Michał Chojnowski This is yet another part in the BTI index project. Overarching issue: https://github.com/scylladb/scylladb/issues/19191 Previous part: https://github.com/scylladb/scylladb/pull/25626 Next parts: make `ms` the default. Then, general tweaks and improvements. Later, potentially a full `da` format implementation. This patch series introduces a new, Scylla-only sstable format version `ms`, which is like `me`, but with the index components (Summary.db and Index.db) replaced with BTI index components (Partitions.db and Rows.db), as they are in Cassandra 5.0's `da` format version. (Eventually we want to just implement `da`, but there are several other changes (unrelated to the index files) between `me` and `da`. By adding this `ms` as an intermediate step we can adapt the new index formats without dragging all the other changes into the mix (and raising the risk of regressions, which is already high)). The high-level structure of the PR is: 1. Introduce new component types — `Partitions` and `Rows`. 2. Teach `class sstable` to open them when they exist. 3. Teach the sstable writer how to write index data to them. 4. Teach `class sstable` and unit tests how to deal with sstables that have no `Index` or `Summary` (but have `Partitions` and `Rows` instead). 5. Introduce the new sstable version `ms`, specify that it has `Partitions` and `Rows` instead of `Index` and `Summary`. 6. Prepare unit tests for the appearance of `ms`. 7. Enable `ms` in unit tests. 8. Make `ms` enablable via db::config (with a silent fall back to `me` until the new `MS_SSTABLE_FORMAT` cluster feature is enabled). 9. Prepare integration tests for the appearance of `ms`. 10. Enable both `ms` and `me` in tests where we want both versions to be tested. This series doesn't make `ms` the default yet, because that requires teaching Scylla Manager and a few dtests about the new format first. It can be enabled by setting `sstable_format: ms` in the config. Per a review request, here is an example from `perf_fast_forward`, demonstrating some motivation for a new format. (Although not the main one. The main motivations are getting rid of restrictions on the RAM:disk ratio, and index read throughput for datasets with tiny partitions). The dataset was populated with `build/release/scylla perf-fast-forward --smp=1 --sstable-format=$VERSION --data-directory=data.$VERSION --column-index-size-in-kb=1 --populate --random-seed=0`. This test involves a partition with 1000000 clustering rows (with 32-bit keys and 100-byte values) and ~500 index blocks, and queries a few particular rows from the partition. Since the branching factor for the BIG promoted index is 2 (it's a binary search), the lookup involves ~11.2 sequential page reads per row. The BTI format has a more reasonable branching factor, so it involves ~2.3 page reads per row. `build/release/scylla perf-fast-forward --smp=1 --data-directory=perf_fast_forward_data/me --run-tests=large-partition-select-few-rows`: ``` offset stride rows iterations avg aio aio (KiB) 500000 1 1 70 18.0 18 128 500001 1 1 647 19.0 19 132 0 1000000 1 748 15.0 15 116 0 500000 2 372 29.0 29 284 0 250000 4 227 56.0 56 504 0 125000 8 116 106.0 106 928 0 62500 16 67 195.0 195 1732 ``` `build/release/scylla perf-fast-forward --smp=1 --data-directory=perf_fast_forward_data/ms --run-tests=large-partition-select-few-rows`: ``` offset stride rows iterations avg aio aio (KiB) 500000 1 1 51 5.1 5 20 500001 1 1 64 5.3 5 20 0 1000000 1 679 4.0 4 16 0 500000 2 492 8.0 8 88 0 250000 4 804 16.0 16 232 0 125000 8 409 31.0 31 516 0 62500 16 97 54.0 54 1056 ``` Index file size comparison for the default `perf_fast_forward` tables with `--random-seed=0`: Large partition table (dominated by intra-partition index): 2.4 MB with `me`, 732 kB with `ms`. For the small partitions table (dominated by inter-partition index): 11 MB with `me`, 8.4 MB with `ms`. External tests: I ran SCT test `longevity-mv-si-4days-streaming-test` test on 6 nodes with 30 shards each for 8 hours. No anomalies were observed. New functionality, no backport needed. Closes scylladb/scylladb#26215 * github.com:scylladb/scylladb: test/boost/bloom_filter_test: add test_rebuild_from_temporary_hashes test/cluster: add test_bti_index.py test: prepare bypass_cache_test.py for `ms` sstables sstables/trie/bti_index_reader: add a failure injection in advance_lower_and_check_if_present test/cqlpy/test_sstable_validation.py: prepare the test for `ms` sstables tools/scylla-sstable: add `--sstable-version=?` to `scylla sstable write` db/config: expose "ms" format to the users via database config test: in Python tests, prepare some sstable filename regexes for `ms` sstables: add `ms` to `all_sstable_versions` test/boost/sstable_3_x_test: add `ms` sstables to multi-version tests test/lib/index_reader_assertions: skip some row index checks for BTI indexes test/boost/sstable_inexact_index_test: explicitly use a `me` sstable test/boost/sstable_datafile_test: skip test_broken_promoted_index_is_skipped for `ms` sstables test/resource: add `ms` sample sstable files for relevant tests test/boost/sstable_compaction_test: prepare for `ms` sstables. test/boost/index_reader_test: prepare for `ms` sstables test/boost/bloom_filter_tests: prepare for `ms` sstables test/boost/sstable_datafile_test: prepare for `ms` sstables test/boost/sstable_test: prepare for `ms` sstables. sstables: introduce `ms` sstable format version tools/scylla-sstable: default to "preferred" sstable version, not "highest" sstables/mx/reader: use the same hashed_key for the bloom filter and the index reader sstables/trie/bti_index_reader: allow the caller to passing a precalculated murmur hash sstables/trie/bti_partition_index_writer: in add(), get the key hash from the caller sstables/mx: make Index and Summary components optional sstables: open Partitions.db early when it's needed to populate key range for sharding metadata sstables: adapt sstable::set_first_and_last_keys to sstables without Summary sstables: implement an alternative way to rebuild bloom filters for sstables without Index utils/bloom_filter: add `add(const hashed_key&)` sstables: adapt estimated_keys_for_range to sstables without Summary sstables: make `sstable::estimated_keys_for_range` asynchronous sstables/sstable: compute get_estimated_key_count() from Statistics instead of Summary replica/database: add table::estimated_partitions_in_range() sstables/mx: implement sstable::has_partition_key using a regular read sstables: use BTI index for queries, when present and enabled sstables/mx/writer: populate BTI index files sstables: create and open BTI index files, when enabled sstables: introduce Partition and Rows component types sstables/mx/writer: make `_pi_write_m.partition_tombstone` a `sstables::deletion_time`	2025-09-30 09:40:02 +03:00
Michał Chojnowski	ef11dc57c1	db/config: expose "ms" format to the users via database config Extend the `sstable_format` config enum with a "ms" value, and, if it's enabled (in the config and in cluster features), use it for new sstables on the node. (Before this commit, writing `ms` sstables should only be possible in unit tests, via internal APIs. After this commit, the format can be enabled in the config and the database will write it during normal operation). As of this commit, the new format is not the default yet. (But it will become the default in a later commit in the same series).	2025-09-29 22:15:25 +02:00
Michael Litvak	c9237bf5f6	mv: generate view updates on both shards in intranode migration Similarly to the issue of tokens migrating from one host to another, where we need to generate view updates on both replicas before transitioning in order to not lose view updates, we need to do the same in case of intranode migration. In intranode migration we migrate tokens from one shard to another. Previously we checked shard_for_reads in order to generate view updates only on the single shard that is selected for reads, and not on a pending shard that is not ready yet. The problem is that shard_for_reads switches from the source shard to the destination shard in a single transition, and during that switch we can lose view updates because neither shard sees itself as the shard for reads. We fix this by having a phase before the transition when both shards are ready for reads and both will generate view updates.	2025-09-29 13:44:04 +02:00
Michael Litvak	d842ea2dc9	mv: generate view updates on pending replica Generate view updates from a pending base replica if it's a reading replica, i.e. it's in the last stage of transition write_both_read_new before becoming the new base replica. Previously we didn't generate view updates on a pending replica. The problem with that is that when a base token is migrated from one replica B1 to another B2, at one stage we generate view updates only from B1, then at the next stage we generate view updates only from B2. During this transition, it can happen that for some write neither B1 nor B2 generate view update, because each one sees the other as the base replica. We fix this by generating view updates from both base replicas in the phase before the transition. We can generate view updates on the pending replica in this case, even if it requires read-before-write, because it's in a stage where it contains all data and serves reads. Fixes scylladb/scylladb#24292	2025-09-29 13:44:04 +02:00
Michał Chojnowski	55c4b89b88	sstables: make `sstable::estimated_keys_for_range` asynchronous Currently, `sstable::estimated_keys_for_range` works by checking what fraction of Summary is covered by the given range, and multiplying this fraction to the number of all keys. Since computing things on Summary doesn't involve I/O (because Summary is always kept in RAM), this is synchronous. In a later patch, we will modify `sstable::estimated_keys_for_range` so that it can deal with sstables that don't have a Summary (because they use BTI indexes instead of BIG indexes). In that case, the function is going to compute the relevant fraction by using the index instead of Summary. This will require making the function asynchronous. This is what we do in this patch. (The actual change to the logic of `sstable::estimated_keys_for_range` will come in the next patch. In this one, we only make it asynchronous).	2025-09-29 13:01:21 +02:00
Michał Chojnowski	893eb4ca1f	sstables: use BTI index for queries, when present and enabled This patch teaches `sstable::make_index_reader` how to create a BTI index reader, from the the `Partitions.db` and `Rows.db` components, if they exist (in which case they are opened by this point).	2025-09-29 13:01:21 +02:00
Michał Chojnowski	cdcf34b3a0	sstables: create and open BTI index files, when enabled This patch adds code responsible for creation and opening of BTI index components (Rows.db, Partitions.db) when BTI index writing is enabled. (It is enabled if the cluster feature is enabled and the relevant config entry permits it). The files are empty for now, and are never read. We will populate and use them in following patches.	2025-09-29 13:01:21 +02:00
Ernest Zaslavsky	debc756794	treewide: Move transport related files to a `transport` directory As requested in #22112 , moved the files and fixed other includes and build system. Moved files: - generic_server.hh - generic_server.cc - protocol_server.hh Fixes: #22112 This is a cleanup, no need to backport Closes scylladb/scylladb#25090	2025-09-29 11:46:06 +03:00
Nadav Har'El	1aef733d48	Merge 'Alternator/cache expressions' from Szymon Malewski Before this patch, every expression in Alternator's requests was parsed from string to adequate structure. This patch enables caching, where input expression strings are mapped to parsed template structures. Every new valid (parsable) expression is added to the cache. The cache has limited (configurable) size - when it is reached, the least recently used entry is removed. When requested expression is in the cache, the copy of the template is returned - individual instances still need to be resolved (placeholders substituted with names and values). Caching is implemented for all expression types. The cache is per shard - shared for all operations, expression types, tables, users. Default cache size is 2000 entries per shard and it has configuration option `alternator_max_expression_cache_entries_per_shard` (0 means cache disabled). Basic metrics (total count of hits and misses for each expression type and number of evicted enries) are implemented. Cache features are tested in boost unit tests and overall expression caching is tested with Python tests - both mostly rely on metrics. refs #5023 `perf-alternator` test shows improvement (median): \| test \| throughput \| instructions_per_op \| cpu_cycles_per_op \| allocs_per_op \| \| ------ \| ---------------- \| ----------------------------- \| --------------------------- \| ------------------- \| \| read \| +6.0% \| -8.5% \| -7.0% \| -4.9% \| \| write \| +13.4% \| -17.6% \| -14.7% \| -7.4% \| \| write(lwt) \| +12.7% \| -7.9% \| -6.9% \| -2.8% \| \| write_rwm \| +5.4% \| -10.5% \| -7.3% \| -4.1% \| "read" had a ProjectionExpression with 10 column names, "write" had a UpdateExpression with 10 column names and "write_rmw" had both ConditionExpression and UpdateExpression. This patch also includes minor refactoring of other expressions related tests (https://github.com/scylladb/scylladb/issues/22494) - use `test_table_ss` instead of `test_table`. Fixes #25855. This is new feature - no backporting. Closes scylladb/scylladb#25176 * github.com:scylladb/scylladb: alternator: use expression caching alternator: adds expression cache implementation utils: extend lru_string_map utils: add lru_string_map alternator/expressions: error on parsing empty update expression alternator/expressions: fix single value condition expression parsing test/alternator: use `test_table_ss` instead of `test_table` in expressions related tests.	2025-09-29 11:36:31 +03:00
Michael Litvak	6bc41926e2	view_builder: reduce log level for expected aborts during view creation When draining the view builder, we abort ongoing operations using the view builder's abort source, which may cause them to fail with abort_requested_exception or raft::request_aborted exceptions. Since these failures are expected during shutdown, reduce the log level in add_new_view from 'error' to 'debug' for these specific exceptions while keeping 'error' level for unexpected failures. Closes scylladb/scylladb#26297	2025-09-28 22:55:07 +03:00
Avi Kivity	5b6570be52	Merge 'db/config: Add SSTable compression options for user tables' from Nikos Dragazis ScyllaDB offers the `compression` DDL property for configuring compression per user table (compression algorithm and chunk size). If not specified, the default compression algorithm is the LZ4Compressor with a 4KiB chunk size. The same default applies to system tables as well. This series introduces a new configuration option to allow customizing the default for user tables. It also adds some tests for the new functionality. Fixes #25195. Closes scylladb/scylladb#26003 * github.com:scylladb/scylladb: test/cluster: Add tests for invalid SSTable compression options test/boost: Add tests for SSTable compression config options main: Validate SSTable compression options from config db/config: Add SSTable compression options for user tables db/config: Prepare compression_parameters for config system compressor: Validate presence of sstable_compression in parameters compressor: Add missing space in exception message	2025-09-28 20:23:23 +03:00
Szymon Malewski	6ce7843774	alternator: use expression caching Before this patch, every expression in Alternator's requests was parsed from string to adequate structure. This patch enables caching - all calls to parse an expression (all types) are proxied through the cache. New expression is added to the cache, the least recently used entry (above cache size) is removed. For existing entries the copy of the template is returned - individual instances still need to be resolved (placeholders substituted with names and values). The cache is per shard - shared for all operations, expression types, tables, users. Default cache size is 2000 entries per shard and it has configuration option `alternator_max_expression_cache_entries_per_shard` (0 means cache disabled). Added Python tests are based on metrics.	2025-09-28 04:27:44 +02:00
Nikos Dragazis	e1d9c83406	db/config: Add SSTable compression options for user tables ScyllaDB offers the `compression` DDL property for configuring compression per user table (compression algorithm and chunk size). If not specified, the default compression algorithm is the LZ4Compressor with a 4KiB chunk size (refer to the default constructor for `compression_parameters`). The same default applies to system tables as well. Add a new configuration option to allow customizing the default for user tables. Use the previously hardcoded default as the new option's default value. Note that the option has no effect on ALTER TABLE statements. An altered table either inherits explicit compression options from the CQL statement, or maintains its existing options. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-09-26 12:02:00 +03:00
Botond Dénes	86ed627fc4	compaction: move code to namespace compaction The namespace usage in this directory is very inconsistent, with files and classes scattered in: * global namespace * namespace compaction * namespace sstables With cases, where all three used in the same file. This code used to live in sstables/ and some of it still retains namespace sstables as a heritage of that time. The mismatch between the dir (future module) and the namespace used is confusing, so finish the migration and move all code in compaction/ to namespace compaction too. This patch, although large, is mechanic and only the following kind of changes are made: * replace namespace sstable {} with namespace compaction {} * add namespace compaction {} * drop/add sstables:: * drop/add compaction:: * move around forward-declarations so they are in the correct namespace context This refactoring revealed some awkward leftover coupling between sstables and compaction, in sstables/sstable_set.cc, where the make_sstable_set() methods of compaction strategies are implemented.	2025-09-25 15:03:56 +03:00
Nikos Dragazis	a7e46974d4	db/config: Prepare compression_parameters for config system SSTable compression is currently configurable only per table, via the `compression` property in CREATE/ALTER TABLE statements. This is represented internally via the `compression_parameters` class. We plan to offer the same options via the configuration as well, to make the default compression method for user tables configurable. This patch prepares the ground by making the `compression_parameters` usable as a `config_file::named_value`, namely: * Define an extraction operator (required by `boost::program_options` for parsing the options from command line). * Define a formatter (required by `named_value::operator()`). * Define a template specialization for `config_type_for` (required by `named_value` constructor). * Define a yaml converter (required for parsing the options from scylla.yaml). Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-09-24 14:51:39 +03:00
Ernest Zaslavsky	5ba5aec1f8	treewide: Move mutation related files to a `mutation` directory As requested in #22104, moved the files and fixed other includes and build system. Moved files: - combine.hh - collection_mutation.hh - collection_mutation.cc - converting_mutation_partition_applier.hh - converting_mutation_partition_applier.cc - counters.hh - counters.cc - timestamp.hh Fixes: #22104 This is a cleanup, no need to backport Closes scylladb/scylladb#25085	2025-09-24 13:23:38 +03:00
Piotr Dulikowski	482ddfb3b4	Merge 'mv: handle mismatched base/view replica count caused by RF change' from Wojciech Mitros During an ALTER KEYSPACE statement execution where a table with a view is present, we need to perform tablet migrations for both tables. These migrations are not synchronized, so at some point the base may have a different number of non-pending replicas than the view. Because of that, we can't pair them correctly. If there is more non-pending base replicas than view replicas, we don't need to do anything because the view replica that didn't finish migrating is a pending replica and will get view updates from all base replicas. But if there is more non-pending view replicas than base replicas, we may currently lose view updates to the new view replica. This patch adds a workaround for this scenario. If after one migration we have too more non-pending view replicas than base replicas, we add it to the pending replica list so that it gets an update anyway. This patch will also take effect if the base and view replica counts differ due to some other bug. To track that, a new metric is added to count such occurrences. This patch also includes a test for this exact scenario, which is enforced by an injection. Fixes https://github.com/scylladb/scylladb/issues/21492 Closes scylladb/scylladb#24396 * github.com:scylladb/scylladb: mv: handle mismatched base/view replica count caused by RF change mv: save the nodes used for pairing calculations for later reuse mv: move the decision about simple rack-aware pairing later	2025-09-23 08:10:08 +02:00
Dawid Mędrek	35f7d2aec6	db/batchlog: Drop batch if table has been dropped If there are pending mutations in the batchlog for a table that has been dropped, we'll keep attempting to replay them but with no success -- `db::no_such_column_family` exceptions will be thrown, and we'll keep trying again and again. To prevent that, we drop the batch in that case just like we do in the case of a non-existing keyspace. A reproducer test has been included in the commit. It fails without the changes in `db/batchlog_manager.cc`, and it succeeds with them. Fixes scylladb/scylladb#24806 Closes scylladb/scylladb#26057	2025-09-23 07:48:59 +02:00
Pavel Emelyanov	f6860d1de0	Merge 'mv: run view building worker fibers in streaming group' from Piotr Dulikowski The background fibers of the view building worker are indirectly spawned by the main function, thus the fibers inherit the "main" scheduling group. The main scheduling group is not supposed to be used for regular work, only for initialization and deinitialization, so this is wrong. Wrap the call to `start_backgroud_fibers()` with `with_scheduling_group` and use the streaming scheduling group. The view building worker already handles RPCs in the streaming scheduling group (which do most of the work; background fibers only do some maintenance), so this seems like a good fit. No need to backport, view build coordinator is not a part of any release yet. Closes scylladb/scylladb#26122 * github.com:scylladb/scylladb: mv: fix typo in start_backgroud_fibers mv: run view building worker fibers in streaming group	2025-09-22 15:28:38 +03:00
Wojciech Mitros	d9b8278178	mv: handle mismatched base/view replica count caused by RF change During an ALTER KEYSPACE statement execution where a table with a view is present, we need to perform tablet migrations for both tables. These migrations are not synchronized, so at some point the base may have a different number of non-pending replicas than the view. Because of that, we can't pair them correctly. If there is more non-pending base replicas than view replicas, we don't need to do anything because the view replica that didn't finish migrating is a pending replica and will get view updates from all base replicas. But if there is more non-pending view replicas than base replicas, we may currently lose view updates to the new view replica. This patch adds a workaround for this scenario. If after one migration we have too more non-pending view replicas than base replicas, we add it to the pending replica list so that it gets an update anyway. This patch will also take effect if the base and view replica counts differ due to some other bug. To track that, a new metric is added to count such occurrences. This patch also includes a test for this exact scenario, which is enforced by an injection. Fixes https://github.com/scylladb/scylladb/issues/21492	2025-09-22 12:50:16 +02:00
Wojciech Mitros	59c40a2edd	mv: save the nodes used for pairing calculations for later reuse In get_view_natural_endpoint() we start with the list if host_ids from the effective replication maps, which we later translate to locator::node to get the information about racks and datacenters. We check all replicas, but we only store the ones relevant for pairing, so for tablets, the ones in the same DC as the replica sending the update. In the next patch, we'll occasionally need to send cross-dc view updates, so to avoid computing the nodes again, in this patch we adjust the logic to prepare them in advance and save them so that they can be later reused.	2025-09-22 12:45:24 +02:00
Wojciech Mitros	9d4449a492	mv: move the decision about simple rack-aware pairing later We'll need to get the lists for the whole dc when fixing replica count mismatches caused by RF changes, so let's first get these lists, and only filter them later if we decide to use simple rack-aware pairing.	2025-09-22 12:45:24 +02:00
Avi Kivity	1258e7c165	Revert "Merge 'transport: service_level_controller: create and use `driver` service level' from Andrzej Jackowski" This reverts commit `fe7e63f109`, reversing changes made to `b5f3f2f4c5`. It is causing test.py failures around cqlpy. Fixes #26163 Closes scylladb/scylladb#26174	2025-09-22 09:32:46 +03:00
Piotr Dulikowski	591a67c7e7	Merge 'view_builder: register view on all shards atomically' from Michael Litvak When the view builder starts to build a new view, each shard registers itself by writing the shard id and current token to the scylla_views_builds_in_progress table. Previously, this happened independently by each shard. We change it now to register all shards "atomically" - when a shard registers itself, it also registers all other shards with an empty status, if they aren't registered yet. This ensures that we don't have a partial state in the table where only some of the shards are registered, but we always have a status for all shards. The reason we want to register all shards atomically is that if it happens that only some of the shards were registered, then we restart and load the status from table, this doesn't work well for multiple reasons. One example is that to know how many shards we had previously, we take the maximum shard id we see in the table. If it's different than the current shard count, we will execute the reshard code. But of course, if the last shard is missing from the table because it didn't register itself, this calculation will be wrong, and we can't know the previous number of shards. This is a problem because suppose we have two shards, and shard 0 finished building the view but shard 1 didn't start. When we come up, we will think that previously we had only a single shard and it completed building everything, when in fact we built only half the view approximately. The problem is that we don't have enough information in the tables to know that. There are additional problems related to reshard. In the reshard function, whether it is executed because we actually do node reshard or because we calculated the wrong number of previous shards, if the status of some shard is missing then the calculation of new ranges will be wrong. When some shard didn't make progress we should start building the view from scratch. However, this doesn't happen if we don't have a status for the shard, because the code looks only for shards that have a status. In effect, this shard is considered complete even though it didn't start. This could cause the view building to get stuck or complete without building all tokens ranges. By registering all shards atomically, this should solve the above problems because we will always have statuses for all shards. Fixes https://github.com/scylladb/scylladb/issues/22989 backport not needed - the issue is probably not common and there's a workaround Closes scylladb/scylladb#25790 * github.com:scylladb/scylladb: test: mv: add a test for view build interrupt during registration view_builder: register view on all shards atomically	2025-09-22 08:03:44 +02:00
Karol Nowacki	eedf506be5	vector_store_client: Rename vector_store_uri to vector_store_primary_uri The configuration setting vector_store_uri is renamed to vector_store_primary_uri according to the final design. In the future, the vector_store_secondary_uri setting will be introduced. This setting now also accepts a comma-separated list of URIs to prepare for future support for redundancy and load balancing. Currently, only the first URI in the list is used. This change must be included before the next release. Otherwise, users will be affected by a breaking change. References: VECTOR-187 Closes scylladb/scylladb#26033	2025-09-21 16:33:10 +03:00
Michael Litvak	3dffb8e0dc	test: mv: add a test for view build interrupt during registration Add a new test that reproduces issue #22989. The test starts view building and interrupts it by restarting the node while some shards registered their status and some didn't.	2025-09-21 10:39:30 +02:00
Michael Litvak	6043409c31	view_builder: register view on all shards atomically When the view builder starts to build a new view, each shard registers itself by writing the shard id and current token to the scylla_views_builds_in_progress table. Previously, this happened independently by each shard. We change it now to register all shards "atomically" - when a shard registers itself, it also registers all other shards with an empty status, if they aren't registered yet. This ensures that we don't have a partial state in the table where only some of the shards are registered, but we always have a status for all shards. The reason we want to register all shards atomically is that if it happens that only some of the shards were registered, then we restart and load the status from table, this doesn't work well for multiple reasons. One example is that to know how many shards we had previously, we take the maximum shard id we see in the table. If it's different than the current shard count, we will execute the reshard code. But of course, if the last shard is missing from the table because it didn't register itself, this calculation will be wrong, and we can't know the previous number of shards. This is a problem because suppose we have two shards, and shard 0 finished building the view but shard 1 didn't start. When we come up, we will think that previously we had only a single shard and it completed building everything, when in fact we built only half the view approximately. The problem is that we don't have enough information in the tables to know that. There are additional problems related to reshard. In the reshard function, whether it is executed because we actually do node reshard or because we calculated the wrong number of previous shards, if the status of some shard is missing then the calculation of new ranges will be wrong. When some shard didn't make progress we should start building the view from scratch. However, this doesn't happen if we don't have a status for the shard, because the code looks only for shards that have a status. In effect, this shard is considered complete even though it didn't start. This could cause the view building to get stuck or complete without building all tokens ranges. By registering all shards atomically, this should solve the above problems because we will always have statuses for all shards. Fixes scylladb/scylladb#22989	2025-09-21 10:39:05 +02:00
Michał Chojnowski	9e70df83ab	db: get rid of sstables-format-selector Our sstable format selection logic is weird, and hard to follow. If I'm not misunderstanding, the pieces are: 1. There's the `sstable_format` config entry, which currently doesn't do anything, but in the past it used to disable cluster features for versions newer than the specified one. 2. There are deprecated and unused config entries for individual versions (`enable_sstables_mc_format`, `enable_sstables_md_format`, etc). 3. There is a cluster feature for each version: ME_SSTABLE_FORMAT, MD_SSTABLE_FORMAT, etc. (Currently all sstable version features have been grandfathered, and aren't checked by the code anymore). 4. There's an entry in `system.scylla_local` which contains the latest enabled sstable version. (Why? Isn't this directly derived from cluster features anyway)? 5. There's `sstable_manager::_format` which contains the sstable version to be used for new writes. This field is updated by `sstables_format_selector` based on cluster features and the `system.scylla_local` entry. I don't see why those pieces are needed. Version selection has the following constraints: 1. New sstables must be written with a format that supports existing data. For example, range tombstones with an infinite bound are only supported by sstables since version "mc". So if a range tombstone with an infinite bound exists somewhere in the dataset, the format chosen for new sstables has to be at least as new as "mc". 2. A new format might only be used after a corresponding cluster feature is enabled. (Otherwise new sstables might become unreadable if they are sent to another node, or if a node is downgraded). 3. The user should have a way to inhibit format ugprades if he wishes. So far, constraint (1) has been fulfilled by never using formats older than the newest format ever enabled on the node. (With an exception for resharding and reshaping system tables). Constraint (2) has been fulfilled by calling `sstable_manager::set_format` only after the corresponsing cluster feature is enabled. Constraint (3) has been fulfilled by the ability to inhibit cluster features by setting `sstable_format` by some fixed value. The main thing I don't like about this whole setup is that it doesn't let me downgrade the preferred sstable format. After a format is enabled, there is no way to go back to writing the old format again. That is no good -- after I make some performance-sensitive changes in a new format, it might turn out to be a pessimization for the particular workload, and I want to be able to go back. This patch aims to give a way to downgrade formats without violating the constraints. What it does is: 1. The entry in `system.scylla_local` becomes obsolete. After the patch we no longer update or read it. As far as I understand, the purpose of this entry is to prevent unwanted format downgrades (which is something cluster features are designed for) and it's updated if and only if relevant cluster features are updated. So there's no reason to have it, we can just directly use cluster features. 2. `sstable_format_selector` gets deleted. Without the `system.scylla_local` around, it's just a glorified feature listener. 3. The format selection logic is moved into `sstable_manager`. It already sees the `db::config` and the `gms::feature_service`. For the foreseeable future, the knowledge of enabled cluster features and current config should be enough information to pick the right formats. 4. The `sstable_format` entry in `db::config` is no longer intended to inhibit cluster features. Instead, it is intended to select the format for new sstables, and it becomes live-updatable. 5. Instead of writing new sstables with "highest supported" format, (which used to be set by `sstables_format_selector`) we write them with the "preferred" format, which is determined by `sstable_manager` based on the combination of enabled features and the current value of `sstable_format`. Closes scylladb/scylladb#26092 [avi: Pavel found the reason for the scylla_local entry - it predates stable storage for cluster features]	2025-09-19 16:17:56 +03:00
Pavel Emelyanov	a1ea553fe1	code: Replace distributed<> with sharded<> The latter is recommended in seastar, and the former was left as compatibility alias. Latest seastar explicitly marks it as deprecated so once the submodule is updated, compilation logs will explode. Most of the patch is generated with for f in $(git grep -l '\<distributed<[A-Za-z0-9:_]>') ; do sed -e 's/\<distributed<$[A-Za-z0-9:_]$>/sharded<\1>/g' -i $f; done for f in $(git grep -l distributed.hh); do sed -e 's/distributed.hh/sharded.hh/' -i $f ; done and a small manual change in test/perf/perf.hh Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#26136	2025-09-19 12:22:51 +02:00
Avi Kivity	fe7e63f109	Merge 'transport: service_level_controller: create and use `driver` service level' from Andrzej Jackowski This patch series: - Increases the number of allowed scheduling groups to allow creation of `sl:driver` - Implements `create_driver_service_level` that creates `sl:driver` with shares=200 if it wasn't already created - Implements creation of `sl:driver` for new systems and tests in `raft_initialize_discovery_leader` - Modifies `topology_coordinator` to use create `sl:driver` after upgrades. - Implements using `sl:driver` for new connections in `transport/server` - Adds to `transport/server` recognition of driver's control connections and forcing them to keep using `sl:driver`. - Adds tests to verify the new functionality - Modifies existing tests to let them pass after `sl:driver` is added - Modifies the documentation to contain new `sl:driver` The changes were evaluated by a test with the following scenario ([test_connections-sl-driver.py](https://github.com/user-attachments/files/22021273/test_connections-sl-driver.py)): - Start ScyllaDB with one node - Create 1000 keyspaces, 1 table in each keyspace - Start `cassandra-stress` (`-rate threads=50 -mode native cql3`) - Run connection storm with 1000 session (100 python processes, 10 sessions each) The maximum latency during connection storm dropped from 224.94ms to 41.43ms (those numbers are average from 20 test executions, were max latency was in [140ms, 361ms] before change and [31.4ms, 61.5ms] after). The snippet of cassandra-stress output from the moment of connection storm: Before: ``` type total ops, op/s, pk/s, row/s, mean, med, .95, .99, .999, max, time, stderr, errors, gc: #, max ms, sum ms, sdv ms, mb ... total, 789206, 85887, 85887, 85887, 0.6, 0.3, 2.0, 2.0, 2.5, 5.0, 9.0, 0.09679, 0, 0, 0, 0, 0, 0 total, 909322, 120116, 120116, 120116, 0.4, 0.2, 1.9, 2.0, 2.1, 3.1, 10.0, 0.09053, 0, 0, 0, 0, 0, 0 total, 964392, 55070, 55070, 55070, 0.9, 0.4, 2.0, 4.5, 7.7, 18.9, 11.0, 0.09203, 0, 0, 0, 0, 0, 0 total, 975705, 11313, 11313, 11313, 4.4, 3.5, 6.5, 24.5, 82.7, 83.0, 12.0, 0.11713, 0, 0, 0, 0, 0, 0 total, 987548, 11843, 11843, 11843, 4.2, 3.5, 6.5, 33.7, 48.6, 51.5, 13.0, 0.13366, 0, 0, 0, 0, 0, 0 total, 995422, 7874, 7874, 7874, 6.3, 4.0, 7.7, 85.6, 112.9, 113.5, 14.0, 0.14753, 0, 0, 0, 0, 0, 0 total, 1007228, 11806, 11806, 11806, 4.3, 3.5, 6.5, 29.1, 43.8, 87.1, 15.0, 0.15598, 0, 0, 0, 0, 0, 0 total, 1012840, 5612, 5612, 5612, 8.2, 5.0, 11.5, 121.8, 166.6, 170.1, 16.0, 0.16535, 0, 0, 0, 0, 0, 0 total, 1016186, 3346, 3346, 3346, 13.4, 7.4, 20.1, 204.9, 207.6, 210.4, 17.0, 0.17405, 0, 0, 0, 0, 0, 0 total, 1025462, 9276, 9276, 9276, 6.3, 3.9, 9.6, 74.6, 206.8, 210.0, 18.0, 0.17800, 0, 0, 0, 0, 0, 0 total, 1035979, 10517, 10517, 10517, 4.8, 3.5, 6.7, 38.5, 82.6, 83.0, 19.0, 0.18120, 0, 0, 0, 0, 0, 0 total, 1047488, 11509, 11509, 11509, 4.3, 3.5, 6.0, 32.6, 72.3, 74.0, 20.0, 0.18334, 0, 0, 0, 0, 0, 0 total, 1077456, 29968, 29968, 29968, 1.7, 1.6, 2.9, 3.6, 7.0, 8.2, 21.0, 0.17943, 0, 0, 0, 0, 0, 0 total, 1105490, 28034, 28034, 28034, 1.8, 1.8, 3.5, 4.6, 5.3, 13.8, 22.0, 0.17609, 0, 0, 0, 0, 0, 0 total, 1132221, 26731, 26731, 26731, 1.9, 1.8, 3.8, 5.2, 8.4, 11.1, 23.0, 0.17314, 0, 0, 0, 0, 0, 0 total, 1162149, 29928, 29928, 29928, 1.7, 1.7, 3.0, 4.5, 8.0, 9.1, 24.0, 0.16950, 0, 0, 0, 0, 0, 0 ... ``` After: ``` type total ops, op/s, pk/s, row/s, mean, med, .95, .99, .999, max, time, stderr, errors, gc: #, max ms, sum ms, sdv ms, mb ... total, 822863, 94379, 94379, 94379, 0.5, 0.3, 2.0, 2.0, 2.1, 3.7, 9.0, 0.06669, 0, 0, 0, 0, 0, 0 total, 937337, 114474, 114474, 114474, 0.4, 0.2, 2.0, 2.0, 2.1, 3.4, 10.0, 0.06301, 0, 0, 0, 0, 0, 0 total, 986630, 49293, 49293, 49293, 1.0, 1.0, 2.0, 2.1, 17.9, 19.0, 11.0, 0.07318, 0, 0, 0, 0, 0, 0 total, 1026734, 40104, 40104, 40104, 1.2, 1.0, 2.0, 2.2, 6.3, 7.1, 12.0, 0.08410, 0, 0, 0, 0, 0, 0 total, 1066124, 39390, 39390, 39390, 1.3, 1.0, 2.0, 2.2, 2.6, 3.4, 13.0, 0.09108, 0, 0, 0, 0, 0, 0 total, 1103082, 36958, 36958, 36958, 1.3, 1.1, 2.1, 2.5, 3.1, 4.2, 14.0, 0.09643, 0, 0, 0, 0, 0, 0 total, 1141987, 38905, 38905, 38905, 1.3, 1.0, 2.0, 2.4, 11.4, 12.7, 15.0, 0.09894, 0, 0, 0, 0, 0, 0 total, 1180023, 38036, 38036, 38036, 1.3, 1.0, 2.0, 3.7, 5.6, 7.1, 16.0, 0.10070, 0, 0, 0, 0, 0, 0 total, 1216481, 36458, 36458, 36458, 1.4, 1.0, 2.1, 3.6, 4.7, 5.0, 17.0, 0.10210, 0, 0, 0, 0, 0, 0 total, 1256819, 40338, 40338, 40338, 1.2, 1.0, 2.0, 2.2, 3.5, 5.4, 18.0, 0.10173, 0, 0, 0, 0, 0, 0 total, 1295122, 38303, 38303, 38303, 1.3, 1.0, 2.0, 2.4, 21.0, 21.1, 19.0, 0.10136, 0, 0, 0, 0, 0, 0 total, 1334743, 39621, 39621, 39621, 1.3, 1.0, 2.0, 2.3, 3.3, 4.0, 20.0, 0.10055, 0, 0, 0, 0, 0, 0 total, 1375579, 40836, 40836, 40836, 1.2, 1.0, 2.0, 2.1, 3.4, 5.7, 21.0, 0.09927, 0, 0, 0, 0, 0, 0 total, 1415576, 39997, 39997, 39997, 1.2, 1.0, 2.0, 2.3, 3.2, 4.1, 22.0, 0.09807, 0, 0, 0, 0, 0, 0 total, 1449268, 33692, 33692, 33692, 1.5, 1.4, 2.5, 3.2, 4.2, 5.6, 23.0, 0.09800, 0, 0, 0, 0, 0, 0 total, 1471873, 22605, 22605, 22605, 2.2, 2.0, 4.8, 5.9, 7.0, 7.9, 24.0, 0.10015, 0, 0, 0, 0, 0, 0 ... ``` Fixes: https://github.com/scylladb/scylladb/issues/24411 This is a new feature, so no backport needed. Closes scylladb/scylladb#25412 * github.com:scylladb/scylladb: docs: workload-prioritization: add driver service level test: add test to verify use of `sl:driver` transport: use `sl:driver` to handle driver's control connections transport: whitespace only change in update_scheduling_group transport: call update_scheduling_group for non-auth connections generic_server: transport: start using `sl:driver` for new connections test: add test_desc_* for driver service level test: service_levels: add tests for sl:driver creation and removal test: add reload_raft_topology_state() to ScyllaRESTAPIClient service_level_controller: automatically create `sl:driver` service_level_controller: methods to create driver service level service_level_controller: handle special sl:driver in DESC output topology_coordinator: add service_level_controller reference system_keyspace: add service_level_driver_created test: add MAX_USER_SERVICE_LEVELS	2025-09-18 19:45:17 +03:00
Piotr Dulikowski	fb0e5784e4	mv: fix typo in start_backgroud_fibers Letter "n" was missing in this name.	2025-09-18 15:50:16 +02:00
Piotr Dulikowski	5f55787e50	Merge 'CDC with tablets' from Michael Litvak initial implementation to support CDC in tablets-enabled keyspaces. The design is described in https://docs.google.com/document/d/1qO5f2q5QoN5z1-rYOQFu6tqVLD3Ha6pphXKEqbtSNiU/edit?usp=sharing It is followed closely for the most part except "Deciding when to change streams" - instead, streams are changed synchronously with tablet split / merge. Instead of the stream switching algorithm with the double writes, we use a scheme similar to the previous method for vnodes - we add the new streams with timestamp that is sufficiently far into the future. In this PR we: * add new group0-based internal system tables for tablet stream metadata and loading it into in-memory CDC metadata * add virtual tables for CDC consumers * the write coordinator chooses a stream by looking up the appropriate stream in the CDC metadata * enable creating tables with CDC enabled in tablets-enabled keyspaces. tablets are allocated for the CDC table, and a stream is created per each tablet. * on tablet resize (split / merge), the topology coordinator creates a new stream set with a new stream for each new tablet. * the cdc tablets are co-located with the base tablets Fixes https://github.com/scylladb/scylladb/issues/22576 backport not needed - new feature update dtests: https://github.com/scylladb/scylla-dtest/pull/5897 update java cdc library: https://github.com/scylladb/scylla-cdc-java/pull/102 update rust cdc library: https://github.com/scylladb/scylla-cdc-rust/pull/136 Closes scylladb/scylladb#23795 * github.com:scylladb/scylladb: docs/dev: update CDC dev docs for tablets doc: update CDC docs for tablets test: cluster_events: enable add_cdc and drop_cdc test/cql: enable cql cdc tests to run with tablets test: test_cdc_with_alter: adjust for cdc with tablets test/cqlpy: adjust cdc tests for tablets test/cluster/test_cdc_with_tablets: introduce cdc with tablets tests cdc: enable cdc with tablets topology coordinator: change streams on tablet split/merge cdc: virtual tables for cdc with tablets cdc: generate_stream_diff helper function cdc: choose stream in tablets enabled keyspaces cdc: rename get_stream to get_vnode_stream cdc: load tablet streams metadata from tables cdc: helper functions for reading metadata from tables cdc: colocate cdc table with base cdc: remove streams when dropping CDC table cdc: create streams when allocating tablets migration_listener: add on_before_allocate_tablet_map notification cdc: notify when creating or dropping cdc table cdc: move cdc table creation to pre_create cdc: add internal tables for cdc with tablets cdc: add cdc_with_tablets feature flag cdc: add is_log_schema helper	2025-09-18 13:39:37 +02:00
Piotr Dulikowski	4ed045a15c	Merge 'db/view/view_building_worker: wrap `shared_sstable` in `foreign_ptr`' from Michał Jadwiszczak When a staging sstable is registered to view building worker, it needs to make a round trip from its original shard to shard 0 (in order to create a view building task) and back (to be eventually processed). Until now this was done using plain `sstables::shared_sstable` (= `lw_shared_ptr`) which is not safe to be moved between shards. This patch fixes this by wrapping the pointer in `foreign_ptr` and obtains necessary informations (owner shard, last token) on the original shard (instead of on shard0). Then all of those objects are put into freshly introduced structure `staging_sstable_task_info`, which can be safely moved between shards. Fixes https://github.com/scylladb/scylladb/issues/25859 View building coordinator isn't present in any release yet, no backport needed. Closes scylladb/scylladb#25832 * github.com:scylladb/scylladb: db/view/view_building_worker: fix indent db/view/view_building_worker: wrap `shared_sstable` in `foreign_ptr` db/view/view_building_worker: use table id in `register_staging_sstable_tasks()` db/view/view_building_worker: move helper functions higher	2025-09-18 10:24:27 +02:00
Piotr Dulikowski	b71af71ab5	Merge 'db/view/view_building_worker: change `sharded<abort_source>` to local `abort_source`' from Michał Jadwiszczak Previously the sharded abort_sources was stopped at the end of `batch::do_work()`, which is working in parallel to view building worker main loop. This leads to races because the worker may call `batch::abort()`, which access the abort_sources. This patch solves this be changing `sharded<abort_source>` into `abort_source`. Since now `batch::do_work()` is executed on tasks' shard, all abort source checks are also done on tasks' shard. The only place where shard0 uses the abort source is `batch::abort()`, but this method now does `smp::submit_to(replica.shard, [request abort])`, so the abort source is used on tasks' shard exclusively. Fixes https://github.com/scylladb/scylladb/issues/25805 Fixes https://github.com/scylladb/scylladb/issues/26045 View building coordinator hasn't been released yet, so no backport needed. Closes scylladb/scylladb#26059 * github.com:scylladb/scylladb: db/view/view_building_worker: fix indents db/view/view_building_worker: change `sharded<abort_source>` to local `abort_source` db/view/view_building_worker: execute entire `batch::do_work` on tasks shard db/view/view_building_worker: store reference to sharded worker in batch	2025-09-18 10:11:20 +02:00
Andrzej Jackowski	dd9b4c64d2	system_keyspace: add service_level_driver_created This commit extends sytem.scylla_local table with an additional key/value pair that can be used later in this patch series to keep an information that `sl:driver` was already created. The purpose of storing this information is to ensure that `sl:driver` is not recreated after being intentionally removed. A new mutation is included in `register_raft_pull_snapshot` to keep `service_level_driver_created` in state machine shapshot, which is required for proper propagation of the value when a new node is added to the cluster. Refs: scylladb/scylladb#24411	2025-09-18 09:28:32 +02:00
Michał Jadwiszczak	1d8b41a51d	db/view/view_building_worker: fix indents	2025-09-18 03:24:43 +02:00
Michał Jadwiszczak	99db5a6c30	db/view/view_building_worker: change `sharded<abort_source>` to local `abort_source` Previously the sharded abort_sources was stopped at the end of batch::do_work(), which is working in parallel to view building worker main loop. This leads to races because the worker may call batch::abort(), which access the abort_sources. This patch solves this be changing `sharded<abort_source>` into `abort_source`. Since now `batch::do_work()` is executed on tasks' shard, all abort source checks are also done on tasks' shard. The only place where shard0 uses the abort source is `batch::abort()`, but this method now does `smp::submit_to(replica.shard, [request abort])`, so the abort source is used on tasks' shard exclusively. Fixes scylladb/scylladb#25805 Fixes scylladb/scylladb#26045	2025-09-18 03:24:43 +02:00
Michał Jadwiszczak	7b9db335c0	db/view/view_building_worker: execute entire `batch::do_work` on tasks shard This change will allow us to get rid of problematic `sharded<abort_source>` and use local `abort_source` instead.	2025-09-18 03:24:43 +02:00
Michał Jadwiszczak	2f65af8aa7	db/view/view_building_worker: store reference to sharded worker in batch Change reference to view building worker in batch to sharded container. In next commits, I'm going to execute `do_work()` exclusively on tasks target shard and sharded reference will be more useful.	2025-09-18 03:24:43 +02:00
Michał Jadwiszczak	e4a0de53ea	db/view/view_building_worker: fix indent	2025-09-18 02:57:36 +02:00
Michał Jadwiszczak	7dfb76f9a7	db/view/view_building_worker: wrap `shared_sstable` in `foreign_ptr` When a staging sstable is registered to view building worker, it needs to make a round trip from its original shard to shard 0 (in order to create a view building task) and back (to be eventually processed). Until now this was done using plain `sstables::shared_sstable` (= `lw_shared_ptr`) which is not safe to be moved between shards. This patch fixes this by wrapping the pointer in `foreign_ptr` and obtains necessary informations (owner shard, last token) on the original shard (instead of on shard0). Then all of those objects are put into freshly introduced structure `staging_sstable_task_info`, which can be safely moved between shards. Fixes scylladb/scylladb#25859	2025-09-18 02:57:36 +02:00
Michał Jadwiszczak	50678030c0	db/view/view_building_worker: use table id in `register_staging_sstable_tasks()` There is no need to pass the pointer only to get id of the table.	2025-09-18 02:57:35 +02:00
Michał Jadwiszczak	b44c223d47	db/view/view_building_worker: move helper functions higher So they can be used in `view_building_worker::register_staging_sstable_tasks()`.	2025-09-18 02:57:35 +02:00
Ernest Zaslavsky	a1f18a8883	treewide: Move schema related files to a `schema` directory As requested in #22111 , moved the files and fixed other includes and build system. Moved files: - frozen_schema.hh - frozen_schema.cc - schema_mutations.hh - schema_mutations.cc - column_computation.hh Fixes: #22111 Closes scylladb/scylladb#25089	2025-09-17 17:31:05 +03:00

1 2 3 4 5 ...

4514 Commits