scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-26 03:20:37 +00:00

Author	SHA1	Message	Date
Pavel Emelyanov	1e0487bd57	table: Add formatter for group_id argument in tablet merge exception message Fixes: SCYLLADB-1432 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#29143 (cherry picked from commit `78f5bab7cf`) Closes scylladb/scylladb#29412 Closes scylladb/scylladb#29453	2026-04-16 10:57:37 +03:00
Pavel Emelyanov	e212762ab7	database: Rate limit all tokens from a range The limiter scans ranges to decide whether or not to rate-limit the query. However, when considering each range only the front one's token is accounted. This looks like a misprint. The limiter was introduced in `cc9a2ad41f` Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#29050 (cherry picked from commit `8b1ca6dcd6`) Closes scylladb/scylladb#29107 Closes scylladb/scylladb#29194	2026-03-24 16:04:01 +02:00
Amnon Heiman	1d8089408a	distributed_loader: system_replicated_keys as system keyspace This patch adds system_replicated_keys to the list of known system keyspaces. Signed-off-by: Amnon Heiman <amnon@scylladb.com> (cherry picked from commit `c6d1c63ddb`)	2026-01-19 14:59:26 +00:00
Łukasz Paszkowski	c04007c755	database: Log message after critical_disk_utilization mode is set This is a follow-up of the previous fix: https://github.com/scylladb/scylladb/pull/26030 The test test_user_writes_rejection starts a 3-node cluster and creates a large file on one of the nodes, to trigger the out-of-space prevention mechanism, which should reject writes on that node. It waits for the log message 'Setting critical disk utilization mode: true' and then executes a write expecting the node to reject it. Currently, the message is logged before the `_critical_disk_utilization` variable is actually updated. This causes the test to fail sporadically if it runs quickly enough. The fix splits the logging into two steps: 1. "Asked to set critical disk utilization mode" - logged before any action 2) "Set critical disk utilization mode" - logged after `_critical_disk_utilization` has been updated The tests are updated to wait for the second message. Fixes https://github.com/scylladb/scylladb/issues/26004 Closes scylladb/scylladb#26392 (cherry picked from commit `7ec369b900`) Closes scylladb/scylladb#26626	2026-01-19 06:42:01 +02:00
Patryk Jędrzejczak	7feafe9a62	Merge '[Backport 2025.4] database: truncate_table_on_all_shards: consider can_flush on all shards' from Scylladb[bot] Currently, database::truncate_table_on_all_shards calls the table::can_flush only on the coordinator shard and therefore it may miss shards with dirty data if the coordinator shard happens to have empty memtables, leading to clearing the memtables with dirty data rather than flushing them. This change fixes that by making flush safe to be called, even if the memtable list is empty, and calling it on every shard that can flush (i.e. seal_immediate_fn is engaged). Also, change database_test::do_with_some_data is use random keys instead of hard-coded key names, to reproduce this issue with `snapshot_list_contains_dropped_tables`. Fixes #27639 * The issue exists since forever and might cause data loss due to wrongly clearing the memtable, so it needs backport to all live versions - (cherry picked from commit `ec4069246d`) - (cherry picked from commit `5be6b80936`) - (cherry picked from commit `0342a24ee0`) - (cherry picked from commit `02ee341a03`) - (cherry picked from commit `2a803d2261`) - (cherry picked from commit `93b827c185`) - (cherry picked from commit `ebd667a8e0`) Parent PR: #27643 Closes scylladb/scylladb#28074 * https://github.com/scylladb/scylladb: test: database_test: do_with_some_data: randomize keys database: truncate_table_on_all_shards: drop outdated TODO comment database: truncate_table_on_all_shards: consider can_flush on all shards memtable_list: unify can_flush and may_flush test: database_test: add test_flush_empty_table_waits_on_outstanding_flush replica: table, storage_group, compaction_group: add needs_flush test: database_test: do_with_some_data_in_thread: accept void callback function	2026-01-12 11:19:41 +01:00
Benny Halevy	67c23add98	database: truncate_table_on_all_shards: drop outdated TODO comment The comment was added in `83323e155e` Since then, table::seal_active_memtable was improved to guarantee waiting on oustanding flushes on success (See `d55a2ac762`), so we can remove this TODO comment (it also not covered by any issue so nobody is planned to ever work on it). Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `93b827c185`)	2026-01-09 08:12:25 +02:00
Benny Halevy	9ff984f0c9	database: truncate_table_on_all_shards: consider can_flush on all shards can_flush might return a different value for each shard so check it right before deciding whether to flush or clear a memtable shard. Note that under normal condition can_flush would always return true now that it checks only the presence of the seal memtable function rather than check memtable_list::empty(). Fixes #27639 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `2a803d2261`)	2026-01-09 08:11:58 +02:00
Benny Halevy	ebfdb83270	memtable_list: unify can_flush and may_flush Now that we have a unit test proving that it's safe to flush an empty memtable list there is no need to distinguish between may_flush and can_flush. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `02ee341a03`)	2026-01-09 08:09:06 +02:00
Benny Halevy	e9f76e46d1	test: database_test: add test_flush_empty_table_waits_on_outstanding_flush Test that table::flush waits on outstanding flushes, even if the active memtable is empty Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `0342a24ee0`)	2026-01-09 08:09:04 +02:00
Benny Halevy	e9ea043980	replica: table, storage_group, compaction_group: add needs_flush Table needs flush if not all its memtable lists are empty. To be used in the next patch for a unit test. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `5be6b80936`)	2026-01-09 08:04:58 +02:00
Aleksandra Martyniuk	50e2a1a9b0	replica: database: change type of tables_metadata::_ks_cf_to_uuid If there is a lot of tables, a node reports oversized allocation in _ks_cf_to_uuid of type flat_hash_map. Change the type to std::unordered_map to prevent oversized allocations. Fixes: https://github.com/scylladb/scylladb/issues/26787. Closes scylladb/scylladb#27165 (cherry picked from commit `19a7d8e248`) Closes scylladb/scylladb#27200	2026-01-08 16:36:34 +02:00
Botond Dénes	c697c6633b	Merge 'Remove noexcept from storage_group and table functions to allow exception propagation' from Tomasz Grabiec Fixed a critical bug where `storage_group::for_each_compaction_group()` was incorrectly marked `noexcept`, causing `std::terminate` when actions threw exceptions (e.g., `utils::memory_limit_reached` during memory-constrained reader creation). Changes made: 1. Removed `noexcept` from `storage_group::for_each_compaction_group()` declaration and implementation 2. Removed `noexcept` from `storage_group::compaction_groups()` overloads (they call for_each_compaction_group) 3. Removed `noexcept` from `storage_group::live_disk_space_used()` and `memtable_count()` (they call compaction_groups()) 4. Kept `noexcept` on `storage_group::flush()` - it's a coroutine that automatically captures exceptions and returns them as exceptional futures 5. Removed `noexcept` from `table_load_stats()` functions in base class, table, and storage group managers Rationale: There's no reason to kill the server if these functions throw. For coroutines returning futures, `noexcept` is appropriate because Seastar automatically captures exceptions and returns them as exceptional futures. For other functions, proper exception handling allows the system to recover gracefully instead of terminating. Fixes #27475 Closes scylladb/scylladb#27476 * github.com:scylladb/scylladb: replica: Remove unnecessary noexcept replica: Remove noexcept from compaction_groups() functions replica: Remove noexcept from storage_group::for_each_compaction_group (cherry picked from commit `730eca5dac`) Closes scylladb/scylladb#27914	2025-12-30 14:23:30 +01:00
Avi Kivity	4db6d3e924	database: fix overflow when computing data distribution over shards We store the per-shard chunk count in a uint64_t vector global_offset, and then convert the counts to offsets with a prefix sum: ```c++ // [1, 2, 3, 0] --> [0, 1, 3, 6] std::exclusive_scan(global_offset.begin(), global_offset.end(), global_offset.begin(), 0, std::plus()); ``` However, std::exclusive_scan takes the accumulator type from the initial value, 0, which is an int, instead of from the range being iterated, which is of uint64_t. As a result, the prefix sum is computed as a 32-bit integer value. If it exceeds 0x8000'0000, it becomes negative. It is then extended to 64 bits and stored. The result is a huge 64-bit number. Later on we try to find an sstable with this chunk and fail, crashing on an assertion. An example of the failure can be seen here: https://godbolt.org/z/6M8aEbo57 The fix is simple: the initial value is passed as uint64_t instead of int. Fixes https://github.com/scylladb/scylladb/issues/27417 Closes scylladb/scylladb#27418 (cherry picked from commit `9696ee64d0`)	2025-12-04 20:17:19 +02:00
Raphael S. Carvalho	f2ee409fdd	replica: Fail timed-out single-key read on cleaned up tablet replica Consider the following: 1) single-key read starts, blocks on replica e.g. waiting for memory. 2) the same replica is migrated away 3) single-key read expires, coordinator abandons it, releases erm. 4) migration advances to cleanup stage, barrier doesn't wait on timed-out read 5) compaction group of the replica is deallocated on cleanup 6) that single-key resumes, but doesn't find sstable set (post cleanup) 7) with abort-on-internal-error turned on, node crashes It's fine for abandoned (= timed out) reads to fail, since the coordinator is gone. For active reads (non timed out), the barrier will wait for them since their coordinator holds erm. This solution consists of failing reads which underlying tablet replica has been cleaned up, by just converting internal error to plain exception. Fixes #26229. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#27078 (cherry picked from commit `74ecedfb5c`) Closes scylladb/scylladb#27158	2025-11-21 17:50:21 +03:00
Dawid Mędrek	a4fd7019e3	replica/database: Fix description of `validate_tablet_views_indexes` The current description is not accurate: the function doesn't throw an exception if there's an invalid materialized view. Instead, it simply logs the keyspaces that violate the requirement. Furthermore, the experimental feature `views-with-tablets` is no longer necessary for considering a materialized view as valid. It was dropped in scylladb/scylladb@b409e85c20. The replacement for it is the cluster feature `VIEWS_WITH_TABLETS`. Fixes scylladb/scylladb#26420 Closes scylladb/scylladb#26421 (cherry picked from commit `a9577e4d52`) Closes scylladb/scylladb#26476	2025-10-14 11:52:34 +03:00
Dawid Mędrek	2bdf792f8e	view: Stop requiring experimental feature We modify the requirements for using materialized views in tablet-based keyspaces. Before, it was necessary to enable the configuration option `rf_rack_valid_keyspaces`, having the cluster feature `VIEWS_WITH_TABLETS` enabled, and using the experimental feature `views-with-tablets`. We drop the last requirement. We adjust code to that change and provide a new validation test. We also update the user documentation to reflect the changes. Fixes scylladb/scylladb#23030 (cherry picked from commit `b409e85c20`)	2025-10-06 13:19:54 +00:00
Dawid Mędrek	2e2d1f17bb	db/view: Verify valid configuration for tablet-based views Creating a materialized view or a secondary index in a tablet-based keyspace requires that the user enabled two options: * experimental feature `views-with-tablets`, * configuration option `rf_rack_vaid_keyspaces`. Because the latter has only become a necessity recently (in this series), it's possible that there are already existing materialized views that violate it. We add a new check at start-up that iterates over existing views and makes sure that that is not the case. Otherwise, Scylla notifies the user of the problem. (cherry picked from commit `288be6c82d`)	2025-10-06 13:19:54 +00:00
Avi Kivity	4d9271df98	Merge 'sstables: introduce sstable version `ms`' from Michał Chojnowski This is yet another part in the BTI index project. Overarching issue: https://github.com/scylladb/scylladb/issues/19191 Previous part: https://github.com/scylladb/scylladb/pull/25626 Next parts: make `ms` the default. Then, general tweaks and improvements. Later, potentially a full `da` format implementation. This patch series introduces a new, Scylla-only sstable format version `ms`, which is like `me`, but with the index components (Summary.db and Index.db) replaced with BTI index components (Partitions.db and Rows.db), as they are in Cassandra 5.0's `da` format version. (Eventually we want to just implement `da`, but there are several other changes (unrelated to the index files) between `me` and `da`. By adding this `ms` as an intermediate step we can adapt the new index formats without dragging all the other changes into the mix (and raising the risk of regressions, which is already high)). The high-level structure of the PR is: 1. Introduce new component types — `Partitions` and `Rows`. 2. Teach `class sstable` to open them when they exist. 3. Teach the sstable writer how to write index data to them. 4. Teach `class sstable` and unit tests how to deal with sstables that have no `Index` or `Summary` (but have `Partitions` and `Rows` instead). 5. Introduce the new sstable version `ms`, specify that it has `Partitions` and `Rows` instead of `Index` and `Summary`. 6. Prepare unit tests for the appearance of `ms`. 7. Enable `ms` in unit tests. 8. Make `ms` enablable via db::config (with a silent fall back to `me` until the new `MS_SSTABLE_FORMAT` cluster feature is enabled). 9. Prepare integration tests for the appearance of `ms`. 10. Enable both `ms` and `me` in tests where we want both versions to be tested. This series doesn't make `ms` the default yet, because that requires teaching Scylla Manager and a few dtests about the new format first. It can be enabled by setting `sstable_format: ms` in the config. Per a review request, here is an example from `perf_fast_forward`, demonstrating some motivation for a new format. (Although not the main one. The main motivations are getting rid of restrictions on the RAM:disk ratio, and index read throughput for datasets with tiny partitions). The dataset was populated with `build/release/scylla perf-fast-forward --smp=1 --sstable-format=$VERSION --data-directory=data.$VERSION --column-index-size-in-kb=1 --populate --random-seed=0`. This test involves a partition with 1000000 clustering rows (with 32-bit keys and 100-byte values) and ~500 index blocks, and queries a few particular rows from the partition. Since the branching factor for the BIG promoted index is 2 (it's a binary search), the lookup involves ~11.2 sequential page reads per row. The BTI format has a more reasonable branching factor, so it involves ~2.3 page reads per row. `build/release/scylla perf-fast-forward --smp=1 --data-directory=perf_fast_forward_data/me --run-tests=large-partition-select-few-rows`: ``` offset stride rows iterations avg aio aio (KiB) 500000 1 1 70 18.0 18 128 500001 1 1 647 19.0 19 132 0 1000000 1 748 15.0 15 116 0 500000 2 372 29.0 29 284 0 250000 4 227 56.0 56 504 0 125000 8 116 106.0 106 928 0 62500 16 67 195.0 195 1732 ``` `build/release/scylla perf-fast-forward --smp=1 --data-directory=perf_fast_forward_data/ms --run-tests=large-partition-select-few-rows`: ``` offset stride rows iterations avg aio aio (KiB) 500000 1 1 51 5.1 5 20 500001 1 1 64 5.3 5 20 0 1000000 1 679 4.0 4 16 0 500000 2 492 8.0 8 88 0 250000 4 804 16.0 16 232 0 125000 8 409 31.0 31 516 0 62500 16 97 54.0 54 1056 ``` Index file size comparison for the default `perf_fast_forward` tables with `--random-seed=0`: Large partition table (dominated by intra-partition index): 2.4 MB with `me`, 732 kB with `ms`. For the small partitions table (dominated by inter-partition index): 11 MB with `me`, 8.4 MB with `ms`. External tests: I ran SCT test `longevity-mv-si-4days-streaming-test` test on 6 nodes with 30 shards each for 8 hours. No anomalies were observed. New functionality, no backport needed. Closes scylladb/scylladb#26215 * github.com:scylladb/scylladb: test/boost/bloom_filter_test: add test_rebuild_from_temporary_hashes test/cluster: add test_bti_index.py test: prepare bypass_cache_test.py for `ms` sstables sstables/trie/bti_index_reader: add a failure injection in advance_lower_and_check_if_present test/cqlpy/test_sstable_validation.py: prepare the test for `ms` sstables tools/scylla-sstable: add `--sstable-version=?` to `scylla sstable write` db/config: expose "ms" format to the users via database config test: in Python tests, prepare some sstable filename regexes for `ms` sstables: add `ms` to `all_sstable_versions` test/boost/sstable_3_x_test: add `ms` sstables to multi-version tests test/lib/index_reader_assertions: skip some row index checks for BTI indexes test/boost/sstable_inexact_index_test: explicitly use a `me` sstable test/boost/sstable_datafile_test: skip test_broken_promoted_index_is_skipped for `ms` sstables test/resource: add `ms` sample sstable files for relevant tests test/boost/sstable_compaction_test: prepare for `ms` sstables. test/boost/index_reader_test: prepare for `ms` sstables test/boost/bloom_filter_tests: prepare for `ms` sstables test/boost/sstable_datafile_test: prepare for `ms` sstables test/boost/sstable_test: prepare for `ms` sstables. sstables: introduce `ms` sstable format version tools/scylla-sstable: default to "preferred" sstable version, not "highest" sstables/mx/reader: use the same hashed_key for the bloom filter and the index reader sstables/trie/bti_index_reader: allow the caller to passing a precalculated murmur hash sstables/trie/bti_partition_index_writer: in add(), get the key hash from the caller sstables/mx: make Index and Summary components optional sstables: open Partitions.db early when it's needed to populate key range for sharding metadata sstables: adapt sstable::set_first_and_last_keys to sstables without Summary sstables: implement an alternative way to rebuild bloom filters for sstables without Index utils/bloom_filter: add `add(const hashed_key&)` sstables: adapt estimated_keys_for_range to sstables without Summary sstables: make `sstable::estimated_keys_for_range` asynchronous sstables/sstable: compute get_estimated_key_count() from Statistics instead of Summary replica/database: add table::estimated_partitions_in_range() sstables/mx: implement sstable::has_partition_key using a regular read sstables: use BTI index for queries, when present and enabled sstables/mx/writer: populate BTI index files sstables: create and open BTI index files, when enabled sstables: introduce Partition and Rows component types sstables/mx/writer: make `_pi_write_m.partition_tombstone` a `sstables::deletion_time`	2025-09-30 09:40:02 +03:00
Piotr Dulikowski	4581c72430	Merge 'lwt: prohibit for tablet-based views and cdc logs' from Petr Gusev `SELECT` commands with SERIAL consistency level are historically allowed for vnode-based views, even though they don't provide linearizability guarantees and in general don't make much sense. In this PR we prohibit LWTs for tablet-based views, but preserve old behavior for vnode-based views for compatibility. Similar logic is applied to CDC log tables. We also add a general check that disallows colocating a table with another colocated table, since this is not needed for now. Fixes https://github.com/scylladb/scylladb/issues/26258 backports: not needed (a new feature) Closes scylladb/scylladb#26284 * github.com:scylladb/scylladb: cql_test_env.cc: log exception when callback throws lwt: prohibit for tablet-based views and cdc logs tablets: disallow chains of colocated tables database: get_base_table_for_tablet_colocation: extract table_id_by_name lambda	2025-09-30 07:15:16 +02:00
Michał Chojnowski	55c4b89b88	sstables: make `sstable::estimated_keys_for_range` asynchronous Currently, `sstable::estimated_keys_for_range` works by checking what fraction of Summary is covered by the given range, and multiplying this fraction to the number of all keys. Since computing things on Summary doesn't involve I/O (because Summary is always kept in RAM), this is synchronous. In a later patch, we will modify `sstable::estimated_keys_for_range` so that it can deal with sstables that don't have a Summary (because they use BTI indexes instead of BIG indexes). In that case, the function is going to compute the relevant fraction by using the index instead of Summary. This will require making the function asynchronous. This is what we do in this patch. (The actual change to the logic of `sstable::estimated_keys_for_range` will come in the next patch. In this one, we only make it asynchronous).	2025-09-29 13:01:21 +02:00
Michał Chojnowski	68c33c0173	replica/database: add table::estimated_partitions_in_range() Add a function which computes an estimated number of partitions in the given token range. We will use this helper in a later patch to replace a few places in the code which de facto do the same thing "manually".	2025-09-29 13:01:21 +02:00
Botond Dénes	9c85046f93	sstables,compaction: move compaction exceptions to compaction/ sstables/exceptions.hh still hosts some compaction specific exception types. Move them over to the new compaction/exceptions.hh, to make the compaction module more self-contained.	2025-09-29 06:49:14 +03:00
Avi Kivity	5b40d4d52b	Merge 'root,replica: mv multishard_mutation_query -> replica/multishard_query' from Botond Dénes The code in `multishard_mutation_query.cc` implements the replica-side of range scans and as such it belongs in the replica module. Take the opportunity to also rename it to `multishard_query`, the code implements both data and mutation queries for a long time now. Code cleanup, no backport required. Closes scylladb/scylladb#26279 * github.com:scylladb/scylladb: test/boost: rename multishard_mutation_query_test to multishard_query_test replica/multishard_query: move code into namespace replica replica/multishard_query.cc: update logger name docs/paged-queries.md: update references to readers root,replica: move multishard_mutation_query to replica/	2025-09-28 20:24:46 +03:00
Petr Gusev	b01f56a6d3	database: get_base_table_for_tablet_colocation: extract table_id_by_name lambda In upcoming commits we’ll add a test to ensure that a table cannot be colocated with another table that is itself already colocated. This must also hold in the case where both colocated tables are created simultaneously in a single migration_manager announcement. We use Paxos tables as an example of colocated tables in this test. To support this, get_base_table_for_tablet_colocation needs to look for the base table among the batch of tables being created.	2025-09-26 16:46:32 +02:00
Pavel Emelyanov	f3c57f7dd0	table: Move for_all_partitions_slow() to test It's now only used by a single test, so move it there and remove from public table API. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-09-26 16:33:25 +03:00
Botond Dénes	3be4f0698f	replica/multishard_query: move code into namespace replica Complete the migration, add code to the replica namespace too.	2025-09-26 11:15:38 +03:00
Botond Dénes	ed50a307db	replica/multishard_query.cc: update logger name To reflect the new file name.	2025-09-26 11:15:38 +03:00
Botond Dénes	fb16c0a6d4	root,replica: move multishard_mutation_query to replica/ It belongs there, it is a completely replica-side thing. Also take the opportunity to rename it to multishard_query.{hh,cc}, it is not just mutation anymore (data query is also implemented).	2025-09-26 11:15:38 +03:00
Botond Dénes	1999d8e3d3	compaction: remove using namespace {compaction,sstables} Some files in compaction/ have using namespace {compaction,sstables} clauses, some even in headers. This is considered bad practice and muddies the namespace use. Remove them.	2025-09-25 15:03:57 +03:00
Botond Dénes	86ed627fc4	compaction: move code to namespace compaction The namespace usage in this directory is very inconsistent, with files and classes scattered in: * global namespace * namespace compaction * namespace sstables With cases, where all three used in the same file. This code used to live in sstables/ and some of it still retains namespace sstables as a heritage of that time. The mismatch between the dir (future module) and the namespace used is confusing, so finish the migration and move all code in compaction/ to namespace compaction too. This patch, although large, is mechanic and only the following kind of changes are made: * replace namespace sstable {} with namespace compaction {} * add namespace compaction {} * drop/add sstables:: * drop/add compaction:: * move around forward-declarations so they are in the correct namespace context This refactoring revealed some awkward leftover coupling between sstables and compaction, in sstables/sstable_set.cc, where the make_sstable_set() methods of compaction strategies are implemented.	2025-09-25 15:03:56 +03:00
Asias He	4f3d076dab	tablets: Demote set sstables_repaired_at log to debug level This is log is too excessive when tablet count is high. Demote to debug level. Fixes #25926 Closes scylladb/scylladb#26175	2025-09-25 11:05:51 +03:00
Ernest Zaslavsky	5ba5aec1f8	treewide: Move mutation related files to a `mutation` directory As requested in #22104, moved the files and fixed other includes and build system. Moved files: - combine.hh - collection_mutation.hh - collection_mutation.cc - converting_mutation_partition_applier.hh - converting_mutation_partition_applier.cc - counters.hh - counters.cc - timestamp.hh Fixes: #22104 This is a cleanup, no need to backport Closes scylladb/scylladb#25085	2025-09-24 13:23:38 +03:00
Wojciech Mitros	d9b8278178	mv: handle mismatched base/view replica count caused by RF change During an ALTER KEYSPACE statement execution where a table with a view is present, we need to perform tablet migrations for both tables. These migrations are not synchronized, so at some point the base may have a different number of non-pending replicas than the view. Because of that, we can't pair them correctly. If there is more non-pending base replicas than view replicas, we don't need to do anything because the view replica that didn't finish migrating is a pending replica and will get view updates from all base replicas. But if there is more non-pending view replicas than base replicas, we may currently lose view updates to the new view replica. This patch adds a workaround for this scenario. If after one migration we have too more non-pending view replicas than base replicas, we add it to the pending replica list so that it gets an update anyway. This patch will also take effect if the base and view replica counts differ due to some other bug. To track that, a new metric is added to count such occurrences. This patch also includes a test for this exact scenario, which is enforced by an injection. Fixes https://github.com/scylladb/scylladb/issues/21492	2025-09-22 12:50:16 +02:00
Michał Chojnowski	9e70df83ab	db: get rid of sstables-format-selector Our sstable format selection logic is weird, and hard to follow. If I'm not misunderstanding, the pieces are: 1. There's the `sstable_format` config entry, which currently doesn't do anything, but in the past it used to disable cluster features for versions newer than the specified one. 2. There are deprecated and unused config entries for individual versions (`enable_sstables_mc_format`, `enable_sstables_md_format`, etc). 3. There is a cluster feature for each version: ME_SSTABLE_FORMAT, MD_SSTABLE_FORMAT, etc. (Currently all sstable version features have been grandfathered, and aren't checked by the code anymore). 4. There's an entry in `system.scylla_local` which contains the latest enabled sstable version. (Why? Isn't this directly derived from cluster features anyway)? 5. There's `sstable_manager::_format` which contains the sstable version to be used for new writes. This field is updated by `sstables_format_selector` based on cluster features and the `system.scylla_local` entry. I don't see why those pieces are needed. Version selection has the following constraints: 1. New sstables must be written with a format that supports existing data. For example, range tombstones with an infinite bound are only supported by sstables since version "mc". So if a range tombstone with an infinite bound exists somewhere in the dataset, the format chosen for new sstables has to be at least as new as "mc". 2. A new format might only be used after a corresponding cluster feature is enabled. (Otherwise new sstables might become unreadable if they are sent to another node, or if a node is downgraded). 3. The user should have a way to inhibit format ugprades if he wishes. So far, constraint (1) has been fulfilled by never using formats older than the newest format ever enabled on the node. (With an exception for resharding and reshaping system tables). Constraint (2) has been fulfilled by calling `sstable_manager::set_format` only after the corresponsing cluster feature is enabled. Constraint (3) has been fulfilled by the ability to inhibit cluster features by setting `sstable_format` by some fixed value. The main thing I don't like about this whole setup is that it doesn't let me downgrade the preferred sstable format. After a format is enabled, there is no way to go back to writing the old format again. That is no good -- after I make some performance-sensitive changes in a new format, it might turn out to be a pessimization for the particular workload, and I want to be able to go back. This patch aims to give a way to downgrade formats without violating the constraints. What it does is: 1. The entry in `system.scylla_local` becomes obsolete. After the patch we no longer update or read it. As far as I understand, the purpose of this entry is to prevent unwanted format downgrades (which is something cluster features are designed for) and it's updated if and only if relevant cluster features are updated. So there's no reason to have it, we can just directly use cluster features. 2. `sstable_format_selector` gets deleted. Without the `system.scylla_local` around, it's just a glorified feature listener. 3. The format selection logic is moved into `sstable_manager`. It already sees the `db::config` and the `gms::feature_service`. For the foreseeable future, the knowledge of enabled cluster features and current config should be enough information to pick the right formats. 4. The `sstable_format` entry in `db::config` is no longer intended to inhibit cluster features. Instead, it is intended to select the format for new sstables, and it becomes live-updatable. 5. Instead of writing new sstables with "highest supported" format, (which used to be set by `sstables_format_selector`) we write them with the "preferred" format, which is determined by `sstable_manager` based on the combination of enabled features and the current value of `sstable_format`. Closes scylladb/scylladb#26092 [avi: Pavel found the reason for the scylla_local entry - it predates stable storage for cluster features]	2025-09-19 16:17:56 +03:00
Pavel Emelyanov	a1ea553fe1	code: Replace distributed<> with sharded<> The latter is recommended in seastar, and the former was left as compatibility alias. Latest seastar explicitly marks it as deprecated so once the submodule is updated, compilation logs will explode. Most of the patch is generated with for f in $(git grep -l '\<distributed<[A-Za-z0-9:_]>') ; do sed -e 's/\<distributed<$[A-Za-z0-9:_]$>/sharded<\1>/g' -i $f; done for f in $(git grep -l distributed.hh); do sed -e 's/distributed.hh/sharded.hh/' -i $f ; done and a small manual change in test/perf/perf.hh Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#26136	2025-09-19 12:22:51 +02:00
Piotr Dulikowski	5f55787e50	Merge 'CDC with tablets' from Michael Litvak initial implementation to support CDC in tablets-enabled keyspaces. The design is described in https://docs.google.com/document/d/1qO5f2q5QoN5z1-rYOQFu6tqVLD3Ha6pphXKEqbtSNiU/edit?usp=sharing It is followed closely for the most part except "Deciding when to change streams" - instead, streams are changed synchronously with tablet split / merge. Instead of the stream switching algorithm with the double writes, we use a scheme similar to the previous method for vnodes - we add the new streams with timestamp that is sufficiently far into the future. In this PR we: * add new group0-based internal system tables for tablet stream metadata and loading it into in-memory CDC metadata * add virtual tables for CDC consumers * the write coordinator chooses a stream by looking up the appropriate stream in the CDC metadata * enable creating tables with CDC enabled in tablets-enabled keyspaces. tablets are allocated for the CDC table, and a stream is created per each tablet. * on tablet resize (split / merge), the topology coordinator creates a new stream set with a new stream for each new tablet. * the cdc tablets are co-located with the base tablets Fixes https://github.com/scylladb/scylladb/issues/22576 backport not needed - new feature update dtests: https://github.com/scylladb/scylla-dtest/pull/5897 update java cdc library: https://github.com/scylladb/scylla-cdc-java/pull/102 update rust cdc library: https://github.com/scylladb/scylla-cdc-rust/pull/136 Closes scylladb/scylladb#23795 * github.com:scylladb/scylladb: docs/dev: update CDC dev docs for tablets doc: update CDC docs for tablets test: cluster_events: enable add_cdc and drop_cdc test/cql: enable cql cdc tests to run with tablets test: test_cdc_with_alter: adjust for cdc with tablets test/cqlpy: adjust cdc tests for tablets test/cluster/test_cdc_with_tablets: introduce cdc with tablets tests cdc: enable cdc with tablets topology coordinator: change streams on tablet split/merge cdc: virtual tables for cdc with tablets cdc: generate_stream_diff helper function cdc: choose stream in tablets enabled keyspaces cdc: rename get_stream to get_vnode_stream cdc: load tablet streams metadata from tables cdc: helper functions for reading metadata from tables cdc: colocate cdc table with base cdc: remove streams when dropping CDC table cdc: create streams when allocating tablets migration_listener: add on_before_allocate_tablet_map notification cdc: notify when creating or dropping cdc table cdc: move cdc table creation to pre_create cdc: add internal tables for cdc with tablets cdc: add cdc_with_tablets feature flag cdc: add is_log_schema helper	2025-09-18 13:39:37 +02:00
Piotr Dulikowski	4ed045a15c	Merge 'db/view/view_building_worker: wrap `shared_sstable` in `foreign_ptr`' from Michał Jadwiszczak When a staging sstable is registered to view building worker, it needs to make a round trip from its original shard to shard 0 (in order to create a view building task) and back (to be eventually processed). Until now this was done using plain `sstables::shared_sstable` (= `lw_shared_ptr`) which is not safe to be moved between shards. This patch fixes this by wrapping the pointer in `foreign_ptr` and obtains necessary informations (owner shard, last token) on the original shard (instead of on shard0). Then all of those objects are put into freshly introduced structure `staging_sstable_task_info`, which can be safely moved between shards. Fixes https://github.com/scylladb/scylladb/issues/25859 View building coordinator isn't present in any release yet, no backport needed. Closes scylladb/scylladb#25832 * github.com:scylladb/scylladb: db/view/view_building_worker: fix indent db/view/view_building_worker: wrap `shared_sstable` in `foreign_ptr` db/view/view_building_worker: use table id in `register_staging_sstable_tasks()` db/view/view_building_worker: move helper functions higher	2025-09-18 10:24:27 +02:00
Ernest Zaslavsky	ddf2588985	treewide: Move replica related files to `replica` directory As requested in #22099, moved the files and fixed other includes and build system. Moved files: - cache_temperature.hh - cell_locking.hh Fixes: #22099 Closes scylladb/scylladb#25079	2025-09-18 08:00:35 +03:00
Michał Jadwiszczak	50678030c0	db/view/view_building_worker: use table id in `register_staging_sstable_tasks()` There is no need to pass the pointer only to get id of the table.	2025-09-18 02:57:35 +02:00
Michael Litvak	650ae30c97	cdc: colocate cdc table with base When creating a tablet map for a CDC table, make it be co-located with its base table. We modify db::get_base_table_for_tablet_colocation to return the base table id of a CDC table, handling both cases that the base table is a new table that's created in the same operation, or is an existing table in the db. This function is used by the tablet allocator to decide whether to create a co-located tablet map or allocate new tablets.	2025-09-17 14:47:12 +02:00
Ernest Zaslavsky	d624413ddd	treewide: Move query related files to a new `query` directory As requested in #22120, moved the files and fixed other includes and build system. Moved files: - query.cc - query-request.hh - query-result.hh - query-result-reader.hh - query-result-set.cc - query-result-set.hh - query-result-writer.hh - query_id.hh - query_result_merger.hh Fixes: #22120 This is a cleanup, no need to backport Closes scylladb/scylladb#25105	2025-09-16 23:40:47 +03:00
Aleksandra Martyniuk	75b772adfb	db: optimize cache invalidation following repair/streaming Currently, if a new sstable is created during repair/streaming, we invalidate its whole token range in cache. If the sstable is sparse, we unnecessarily clear too much data. Modify cache invalidation, so that only the partitions present in the sstable are cleared. To check whether a partition is present in the sstable, we use bloom filters. Bloom filters may return false positives and show that an sstable contains a partition, even though it does not. Due to that we may invalidate a bit more than we need to, but the cache will be in valid state. An issue arises when we do not invalidate two consecutive partitions that are continuous. The sstable may contain a token that falls between these partitions, breaking the continuity. To check that, we would need to scan sstable index. However, such a change would noticeably complicate the invalidation, both performance and code. In this change, sstable index reader isn't used. Instead, the continuity flag is unset for all scanned partitions. This comes at a cost of heavier reads, as we will need to verify continuity when reading more than one partition from cache. Fixes: https://github.com/scylladb/scylladb/issues/9136. Closes scylladb/scylladb#25996	2025-09-14 19:48:14 +03:00
Radosław Cybulski	436150eb52	treewide: fix spelling errors Fix spelling errors reported by copilot on github. Remove single use namespace alias. Closes scylladb/scylladb#25960	2025-09-12 15:58:19 +03:00
Avi Kivity	5237a20993	Merge 'replica: Fix split compaction when tablet boundaries change' from Raphael Raph Carvalho Consider the following: 1) balancer emits split decision 2) split compaction starts 3) split decision is revoked 4) emits merge decision 5) completes merge, before compaction in step 2 finishes After last step, split compaction initiated in step 2 can fail because it works with the global tablet map, rather than the map when the compaction started. With the global state changing under its feet, on merge, the mutation splitting writer will think it's going backwards since sibling tablets are merged. This problem was also seen when running load-and-stream, where split initiated by the sstable writer failed, split completed, and the unsplit sstable is left in the table dir, causing problems in the restart. To fix this, let's make split compaction always work with the state when it started, not a global state. Fixes #24153. All 2025.* versions are vulnerable, so fix must be backported to them. Closes scylladb/scylladb#25690 * github.com:scylladb/scylladb: replica: Fix split compaction when tablet boundaries change replica: Futurize split_compaction_options()	2025-09-09 17:05:32 +03:00
Asias He	cb7db47ae1	repair: Add incremental_mode option for tablet repair This patch introduces a new `incremental_mode` parameter to the tablet repair REST API, providing more fine-grained control over the incremental repair process. Previously, incremental repair was on and could not be turned off. This change allows users to select from three distinct modes: - `regular`: This is the default mode. It performs a standard incremental repair, processing only unrepaired sstables and skipping those that are already repaired. The repair state (`repaired_at`, `sstables_repaired_at`) is updated. - `full`: This mode forces the repair to process all sstables, including those that have been previously repaired. This is useful when a full data validation is needed without disabling the incremental repair feature. The repair state is updated. - `disabled`: This mode completely disables the incremental repair logic for the current repair operation. It behaves like a classic (pre-incremental) repair, and it does not update any incremental repair state (`repaired_at` in sstables or `sstables_repaired_at` in the system.tablets table). The implementation includes: - Adding the `incremental_mode` parameter to the `/storage_service/repair/tablet` API endpoint. - Updating the internal repair logic to handle the different modes. - Adding a new test case to verify the behavior of each mode. - Updating the API documentation and developer documentation. Fixes #25605 Closes scylladb/scylladb#25693	2025-09-09 06:50:21 +03:00
Raphael S. Carvalho	68f23d54d8	replica: Fix split compaction when tablet boundaries change Consider the following: 1) balancer emits split decision 2) split compaction starts 3) split decision is revoked 4) emits merge decision 5) completes merge, before compaction in step 2 finishes After last step, split compaction initiated in step 2 can fail because it works with the global tablet map, rather than the map when the compaction started. With the global state changing under its feet, on merge, the mutation splitting writer will think it's going backwards since sibling tablets are merged. This problem was also seen when running load-and-stream, where split initiated by the sstable writer failed, split completed, and the unsplit sstable is left in the table dir, causing problems in the restart. To fix this, let's make split compaction always work with the state when it started, not a global state. Fixes #24153. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-09-07 05:20:23 -03:00
Raphael S. Carvalho	0c1587473c	replica: Futurize split_compaction_options() Prepararation for the fix of #24153. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-09-07 05:19:09 -03:00
Botond Dénes	6116f9e11b	Merge 'Compaction tasks progress' from Aleksandra Martyniuk Determine the progress of compaction tasks that have children. The progress of a compaction task is calculated using the default get_progress method. If the expected_total_workload method is implemented, the default progress is computed as: (sum of child task progresses) / (expected total workload) If expected_total_workload is not defined, progress is estimated based on children progresses. However, in this case, the total progress may increase over time as the task executes. All compaction tasks, except for reshape tasks, implement the expected_children_number method. To compute expected_total_workload, iterate over all SSTables covered by the task and sum their sizes. Note that expected_total_workload is just an approximation and the real workload may differ if SStables set for the keyspace/table/compaction group changes. Reshape tasks are an exception, as their scope is determined during execution. Hence, for these tasks expected_total_workload isn't defined and their progress (both total and completed) is determined based on currently created children. Fixes: https://github.com/scylladb/scylladb/issues/8392. Fixes: https://github.com/scylladb/scylladb/issues/6406. Fixes: https://github.com/scylladb/scylladb/issues/7845. New feature, no backport needed Closes scylladb/scylladb#15158 * github.com:scylladb/scylladb: test: add compaction task progress test compaction: set progress unit for compaction tasks compaction: find expected workload for reshard tasks compaction: find expected workload for global cleanup compaction tasks compaction: find expected workload for global major compaction tasks compaction: find expected workload for keyspace compaction tasks compaction: find expected workload for shard compaction tasks compaction: find expected workload for table compaction tasks compaction: return empty progress when compaction_size isn't set compaction: update compaction_data::compaction_size at once tasks: do not check expected workload for done task	2025-09-03 13:23:42 +03:00
Radosław Cybulski	c242234552	Revert "build: add precompiled headers to CMakeLists.txt" This reverts commit `01bb7b629a`. Closes scylladb/scylladb#25735	2025-09-03 09:46:00 +03:00
Calle Wilund	bc20861afb	system_keyspace: Prune dropped tables from truncation on start/drop Fixes #25683 Once a table drop is complete, there should be no reason to retain truncation records for it, as any replay should skip mutations anyway (no CF), and iff we somehow resurrect a dropped table, this replay-resurrected data is the least problem anyway. Adds a prune phase to the startup drop_truncation_rp_records run, which ignores updating, and instead deletes records for non-existant tables (which should patch any existing servers with lingering data as well). Also does an explicit delete of records on actual table DROP, to ensure we don't grow this table more than needed even in long uptime nodes. Small unit test included. Closes scylladb/scylladb#25699	2025-09-03 07:25:34 +03:00

1 2 3 4 5 ...

1688 Commits