scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-20 16:40:35 +00:00

Author	SHA1	Message	Date
Avi Kivity	0ae22a09d4	LICENSE: Update to version 1.1 Updated terms of non-commercial use (must be a never-customer).	2026-04-12 19:46:33 +03:00
Avi Kivity	ca80ee8586	Merge 'Introduce maintenance scheduling supergroup and do initial population' from Pavel Emelyanov The supergroup replaces streaming (a.k.a. maintenance as well) group, inherits 200 shares from it and consists of four sub-groups (all have equal shares of 200 withing the new supergroup) * maintenance_compaction. This group configures `compaction_manager::maintenance_sg()` group. User-triggered compaction runs in it * backup. This group configures `snapshot_ctl::config::backup_sched_group`. Native backup activity runs there * maintenance. It's a new "visible" name, everything that was called "maintenance" in the code ran in "streaming" group. Now it will run in "maintenance". The activities include those that don't communicate over RPC (see below why) * `tablet_allocator::balance_tablets()` * `sstables_manager::components_reclaim_reload_fiber()` * `tablet_storage_group_manager::merge_completion_fiber()` * metrics exporting http server altogether * streaming. This is purely existing streaming group that just moves under the new supergroup. Everything else that was run there, continues doing so, including * hints sender * all view building related components (update generator, builder, workers) * repair * stream_manager * messaging service (except for verb handlers that switch groups) * join_cluster() activity * REST API * ... something else I forgot The `--maintenance_io_throughput_mb_per_sec` option is introduced. It controls the IO throughput limit applied to the maintenance supergroup. If not set, the `--stream_io_throughput_mb_per_sec` option is used to preserve backward compatibility. All new sched groups inherit `request_class::maintenance` (however, "backup" seem not to make any requests yet). Moving more activities from "streaming" into "maintenance" (or its own group) is possible, but one will need to take care of RPC group switching. The thing is that when a client makes an RPC call, the server may switch to one of pre-negotiated scheduling groups. Verbs for existing activities that run in "streaming" group are routed through RPC index that negotiates "streaming" group on the server side. If any of that client code moves to some other group, server will still run the handlers in "streaming" which is not quite expected. That's one of the main reasons why only the selected fibers were moved to their own "maintenance" group. Similar for backup -- this code doesn't use RPC, so it can be moved. Restoring code uses load-and-stream and corresponding RPCs, so it cannot be just moved into its own new group. Fixes SCYLLADB-351 New feature, not backporting Closes scylladb/scylladb#28542 * github.com:scylladb/scylladb: code: Add maintenance/maintenance group backup: Add maintenance/backup group compaction: Add maintenance/maintenance_compaction group main: Introduce maintenance supergroup main: Move all maintenance sched group into streaming one database: Use local variable for current_scheduling_group code: Live-update IO throughputs from main	2026-04-12 00:34:48 +03:00
Piotr Dulikowski	3bd770d4d9	Merge 'counters: reuse counter IDs by rack' from Michael Litvak For counter updates, use a counter ID that is constructed from the node's rack instead of the node's host ID. A rack can have at most two active tablet replicas at a time: a single normal tablet replica, and during tablet migration there are two active replicas, the normal and pending replica. Therefore we can have two unique counter IDs per rack that are reused by all replicas in the rack. We construct the counter ID from the rack UUID, which is constructed from the name "dc:rack". The pending replica uses a deterministic variation of the rack's counter ID by negating it. This improves the performance and size of counter cells by having less unique counter IDs and less counter shards in a counter cell. Previously the number of counter shards was the number of different host_id's that updated the counter, which can be typically the number of nodes in the cluster and continue growing indefinitely when nodes are replaced. with the rack-based counter id the number of counter shards will be at most twice the number of different racks (including removed racks, which should not be significant). Fixes SCYLLADB-356 backport not needed - an enhancement Closes scylladb/scylladb#28901 * github.com:scylladb/scylladb: docs/dev: add counters doc counters: reuse counter IDs by rack	2026-04-10 12:24:18 +02:00
Michael Litvak	b71762d5da	counters: reuse counter IDs by rack For counter updates, use a counter ID that is constructed from the node's rack instead of the node's host ID. A rack can have at most two active tablet replicas at a time: a single normal tablet replica, and during tablet migration there are two active replicas, the normal and pending replica. Therefore we can have two unique counter IDs per rack that are reused by all replicas in the rack. We construct the counter ID from the rack UUID, which is constructed from the name "dc:rack". The pending replica uses a deterministic variation of the rack's counter ID by negating it. This improves the performance and size of counter cells by having less unique counter IDs and less counter shards in a counter cell. Previously the number of counter shards was the number of different host_id's that updated the counter, which can be typically the number of nodes in the cluster and continue growing indefinitely when nodes are replaced. with the rack-based counter id the number of counter shards will be at most twice the number of different racks (including removed racks, which should not be significant). Fixes SCYLLADB-356	2026-04-09 13:08:02 +02:00
Yaniv Michael Kaul	2c0076d3ef	replica: set_skip_when_empty() for rare error-path metrics Add .set_skip_when_empty() to four metrics in replica/database.cc that are only incremented on very rare error paths and are almost always zero: - database::dropped_view_updates: view updates dropped due to overload. NOTE: this metric appears to never be incremented in the current codebase and may be a candidate for removal. - database::multishard_query_failed_reader_stops: documented as a 'hard badness counter' that should always be zero. NOTE: no increment site was found in the current codebase; may be a candidate for removal. - database::multishard_query_failed_reader_saves: documented as a 'hard badness counter' that should always be zero. - database::total_writes_rejected_due_to_out_of_space_prevention: only fires when disk utilization is critical and user table writes are disabled, a very rare operational state. These metrics create unnecessary reporting overhead when they are perpetually zero. set_skip_when_empty() suppresses them from metrics output until they become non-zero. AI-Assisted: yes Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com> Closes scylladb/scylladb#29345	2026-04-09 14:07:28 +03:00
Nikos Dragazis	0e1e6ebdc5	database: Construct migrating tables with tablet ERMs Extend `database::add_column_family()` with a `storage_mode` argument. If the table is under vnodes-to-tablets migration and the storage mode is "tablets", create a tablet ERM. Make the distributed loader determine the storage mode from topology (`intended_storage_mode` column in system.topology). Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2026-03-24 13:20:39 +02:00
Pavel Emelyanov	cb329b10bf	code: Add maintenance/maintenance group And move some activities from streaming group into it, namely - tablet_allocator background group - sstables_manager-s components reclaimer - tablet storage group manager merge completion fiber - prometheus All other activity that was in streaming group remains there, but can be moved to this group (or to new maintenance subgroup) later. All but prometheus are patched here, prometheus still uses the maintenance_sched_group variable in main.cc, so it transparently moves into new group Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-03-23 16:00:03 +03:00
Pavel Emelyanov	de9bfe0f1d	backup: Add maintenance/backup group The snapshot_ctl::backup_task_impl runs in configured scheduling group. Now it's streaming one. This patch introduces the maintenance/backup group and re-configures backup task with it. The group gets its --backup_io_throughput_mb_per_sec option that controls bandwidth limit for this sub-group only. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-03-23 16:00:02 +03:00
Pavel Emelyanov	6f43e8562e	compaction: Add maintenance/maintenance_compaction group Compaction manager tells compaction_sched_group from maintenance_compaction_sched_group. The latter, however, is set to be "streaming" group. This patch adds real maintenance_compaction group under the maintenance supergroup and makes compaction manager use it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-03-23 16:00:02 +03:00
Pavel Emelyanov	45ecf15fff	database: Use local variable for current_scheduling_group The classify_request() helper captures current scheduling group into local variable and compares it with groups from db_config to decide which "class" it belongs to. One if uses current_scheduling_group(), while it could use the local variable. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-03-23 16:00:02 +03:00
Michael Litvak	31d339e54a	logstor: trigger separator flush for buffers that hold old segments A compaction group has a separator buffer that holds the mixed segments alive until the separator buffer is flushed. A mixed segment can be freed only after all separator buffers that hold writes from the segment are flushed. Typically a separator buffer is flushed when it becomes full. However it's possible for example that one compaction groups is filled slower than others and holds many segments. To fix this we trigger a separator flush periodically for separator buffers that hold old segments. We track the active segment sequence number and for each separator buffer the oldest sequence number it holds.	2026-03-18 19:24:28 +01:00
Michael Litvak	a0da07e5b7	logstor: recover segments into compaction groups Fix the logstor recovery to work with compaction groups. When recovering a segment find its token range and add it to the appropriate compaction groups. if it doesn't fit in a single compaction group then write each record to its compaction group's separator buffer.	2026-03-18 19:24:28 +01:00
Michael Litvak	a9d0211a64	logstor: change index to btree by token per table Change the primary index to be a btree that is ordered by token, similarly to a memtable, and create a index per-table instead of a single global index.	2026-03-18 19:24:28 +01:00
Michael Litvak	e7c3942d43	logstor: move segments to replica::compaction_group Add a segment_set member to replica::compaction_group that manages the logstor segments that belong to the compaction group, similarly to how it manages sstables. Add also a separator buffer in each compaction group. When writing a mutation to a compaction group, the mutation is written to the active segment and to the separator buffer of the compaction group, and when the separator buffer is flushed the segment is added to the compaction_group's segment set.	2026-03-18 19:24:28 +01:00
Michael Litvak	d69f7eb0ee	db: update dirty mem limits dynamically when logstor is enabled, update the db dirty memory limits dynamically. previously the threshold is set to 0.5 of the available memory, so 0.5 goes to memtables and 0.5 to others (cache). when logstor is enabled, we calculate the available memory excluding logstor, and divide it evenly between memtables and cache.	2026-03-18 19:24:27 +01:00
Michael Litvak	b7bdb1010a	logstor: logstor stats api add api to get logstor statistics about segments for a table	2026-03-18 19:24:27 +01:00
Michael Litvak	bd66edee5c	logstor: truncate table implement freeing all segments of a table for table truncate. first do barrier to flush all active and mixed segments and put all the table's data in compaction groups, then stop compaction for the table, then free the table's segments and remove the live entries from the index.	2026-03-18 19:24:27 +01:00
Michael Litvak	21db4f3ed8	logstor: separator buffer pool pre-allocate write buffers for the separator	2026-03-18 19:24:27 +01:00
Michael Litvak	31aefdc07d	logstor: segment and separator barrier add barrier operation that forces switch of the active segment and separator, and waits for all existing segments to close and all separators to flush.	2026-03-18 19:24:27 +01:00
Michael Litvak	1231fafb46	logstor: separator debt controller add tracking of the total separator debt - writes that were written to a separator and waiting to be flushed, and add flow control to keep the debt in control by delaying normal writes.	2026-03-18 19:24:27 +01:00
Michael Litvak	17cb173e18	logstor: compaction controller adjust compaction shares by the compaction overhead: how many segments compaction writes to generate a single free segment for new writes.	2026-03-18 19:24:27 +01:00
Michael Litvak	600ec82bec	logstor: separator initial implementation of the separator. it replaces "mixed" segments - segments that have records from different groups, to segments by group. every write is written to the active segment and to a buffer in the active separator. the active separator has in-memory buffers by group. at some threshold number of segments we switch the active segment and separator atomically, and start flushing the separator. the separator is flushed by writing the buffers into new non-mixed segments, adding them to a compaction group, and frees the mixed segments.	2026-03-18 19:24:27 +01:00
Michael Litvak	1ae1f37ec1	api: add logstor compaction trigger endpoint add a new api endpoint that triggers logstor compaction.	2026-03-18 19:24:26 +01:00
Michael Litvak	2128b1b15c	replica: add logstor to db Add a single logstor instance in the database that is used for writing and reading to tables with kv storage	2026-03-18 19:24:26 +01:00
Pavel Emelyanov	8b1ca6dcd6	database: Rate limit all tokens from a range The limiter scans ranges to decide whether or not to rate-limit the query. However, when considering each range only the front one's token is accounted. This looks like a misprint. The limiter was introduced in `cc9a2ad41f` Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#29050	2026-03-18 13:50:48 +01:00
Dawid Mędrek	a8dd13731f	Merge 'Improve debuggability of test/cluster/test_data_resurrection_in_memtable.py' from Botond Dénes This test was observed to fail in CI recently but there is not enough information in the logs to figure out what went wrong. This PR makes a few improvements to make the next investigation easier, should it be needed: * storage-service: add table name to mutation write failure error messages. * database: the `database_apply` error injection used to cause trouble, catching writes to bystander tables, making tests flaky. To eliminate this, it gained a filter to apply only to non-system keyspaces. Unfortunately, this still allows it to catch writes to the trace tables. While this should not fail the test, it reduces observability, as some traces disappear. Improve this error injection to only apply to selected table. Also merge it with the `database_apply_wait` error injection, to streamline the code a bit. * test/test_data_resurrection_in_memtable.py: dump data from the datable, before the checks for expected data, so if checks fail, the data in the table is known. Refs: SCYLLADB-812 Refs: SCYLLADB-870 Fixes: SCYLLADB-1050 (by restricting `database_apply` error injection, so it doesn't affect writes to system traces) Backport: test related improvement, no backport Closes scylladb/scylladb#28899 * github.com:scylladb/scylladb: test/cluster/test_data_resurrection_in_memtable.py: dump rows before check replica/database: consolidate the two database_apply error injections service/storage_proxy: add name of table to error message for write errors	2026-03-17 13:35:19 +01:00
Botond Dénes	f375aae257	replica/database: consolidate the two database_apply error injections Into a single database_apply one. Add three parameters: * ks_name and cf_name to filter the tables to be affected * what - what to do: throw or wait This leads to smaller footprint in the code and improved filtering for table names at the cost of some extra error injection params in the tests.	2026-03-05 11:44:02 +02:00
Piotr Dulikowski	85dcbfae9a	Merge 'hint: Don't switch group in database::apply_hint()' from Pavel Emelyanov The method is called from storage_proxy::mutate_hint() which is in turn called from hint_mutation::apply_locally(). The latter is either called from directly by hint sender, which already runs in streaming group, or via RPC HINT_MUTATION handler which uses index 1 that negotiates streaming group as well. To be sure, add a debugging check for current group being the expected one. Code cleanup, not backporting Closes scylladb/scylladb#28545 * github.com:scylladb/scylladb: hint: Don't switch group in database::apply_hint() hint_sender: Switch to sender group on stop either	2026-03-04 09:36:38 +01:00
Pavel Emelyanov	b4b9b547ce	replica: Remove unused sched groups from keyspace and table configs Compaction and statement groups are carried over on those configs, but are in fact unused. Drop both. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#28540	2026-02-19 15:47:31 +01:00
Piotr Dulikowski	7d6f734a51	dictionary compression: add missing co_awaits on get_units There is a handful of places in the code related to dictionary compression which calls get_units to acquire semaphore units but the returned future is not awaited, seemingly by mistake. The result of get_units is assigned to a variable - which is reasonable at a glance because the semaphore units need to be assigned to a variable in order to control their scope - but at the same time if co_await is mistakenly omitted, like here, doing so will silence the nodiscard check of seastar::future and, effectively, the get_units call will be nearly useless. Unfortunately, this is an easy mistake to make. Fix the places in the code that acquire semaphore units via get_units but never await the future returned by it. I found them by manual code inspection, so I hope that I didn't miss any. Closes scylladb/scylladb#28581	2026-02-18 16:40:40 +01:00
Pavel Emelyanov	83e64b516a	hint: Don't switch group in database::apply_hint() The method is called from storage_proxy::mutate_hint() which is in turn called from hint_mutation::apply_locally(). The latter is either called from directly by hint sender, which already runs in streaming group, or via RPC HINT_MUTATION handler which uses index 1 that negotiates streaming group as well. To be sure, add a debugging check for current group being the expected one. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-02-09 08:54:51 +03:00
Pavel Emelyanov	10c278fff7	database: Remove _flush_sg member from replica::database This field is only used to initialize the following _memtable_controller one. It's simpler just to do the initialization with whatever value the field itself is initialized and drop the field itself. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#28539	2026-02-05 13:02:35 +02:00
Łukasz Paszkowski	fde09fd136	reader_concurrency_semaphore: Add preemptive_abort_factor to constructors The new parameter parametrizes the factor used to reject a read during admission. Its value shall be between 0.0 and 1.0 where + 0.0 means a read will never get rejected during admission + 1.0 means a read will immediatelly get rejected during admission Although passing values outside the interaval is possible, they will have the exact same effects as they were clamped to [0.0, 1.0].	2026-01-28 14:20:01 +01:00
Łukasz Paszkowski	8829098e90	reader_concurrency_semaphore: Remove cpu_concurrency's default value The commit `59faa6d`, introduces a new parameter called cpu_concurrency and sets its default value to 1 which violates the commit `fbb83dd` that removes all default values from constructors but one used by the unit tests. The patch removes the default value of the cpu_concurrency parameter and alters tests to use the test dedicated reader_concurrency_semaphore constructor wherever possible.	2026-01-27 15:40:11 +01:00
Piotr Dulikowski	fe9237fdc9	Merge 'alternator: don't require rf_rack flag for indexes, validate instead' from Michael Litvak In `8df61f6d99` we changed the requirements for creating materialized views and MV-based indexes - instead of requiring the rf_rack_valid_keyspaces flag to be set, we now require the keyspace to be RF-rack-valid at the time of creation, and it is enforced to remain RF-rack-valid while the MV exists. This validation is done in the cql create view/index statements. The same should be done also for alternator - when creating a table with GSI or LSI, or when adding a GSI to an existing table, previously we required the flag rf_rack_valid_keyspaces to be set. Now we change it to instead check if the keyspace is RF-rack-valid, and if not the operation fails with an appropriate error. Fixes https://github.com/scylladb/scylladb/issues/28214 backport to 2025.4 to add RF-rack-valid enforcements in alternator Closes scylladb/scylladb#28154 * github.com:scylladb/scylladb: locator: document the exception type of assert_rf_rack_valid_keyspace alternator: don't require rf_rack flag for indexes, validate instead	2026-01-23 11:49:02 +01:00
Michael Litvak	d5009882c6	locator: document the exception type of assert_rf_rack_valid_keyspace The function assert_rf_rack_valid_keyspace uses the exception type std::invalid_argument when the RF-rack validation fails. Document it and change all callers to catch this specific exception type when checking for RF-rack validation failures, so that other exception types can be propagated properly.	2026-01-22 16:11:35 +01:00
Pavel Emelyanov	cb6ee05391	Merge 'Extend snapshot manifest.json with tablet-aware metadata' from Benny Halevy This series extends the json manifest file we create when taking snapshots. It adds the following metadata: - manifesr version and scope - snapshot name - created_at and expires_at timestamps (#24061) - node metadata (host_id, dc, rack) - keyspace and table metadat - tablet_count (#26352) - per-sstable metadata (#26352) Fixes [SCYLLADB-189](https://scylladb.atlassian.net/browse/SCYLLADB-189) Fixes [SCYLLADB-195](https://scylladb.atlassian.net/browse/SCYLLADB-195) Fixes [SCYLLADB-196](https://scylladb.atlassian.net/browse/SCYLLADB-196) * Enhancement, no backport needed [SCYLLADB-189]: https://scylladb.atlassian.net/browse/SCYLLADB-189?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [SCYLLADB-195]: https://scylladb.atlassian.net/browse/SCYLLADB-195?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [SCYLLADB-196]: https://scylladb.atlassian.net/browse/SCYLLADB-196?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ Closes scylladb/scylladb#27945 * github.com:scylladb/scylladb: snapshot: keep per-sstable metadata in manifest.json snapshot: add table info and tablet_count to manifest.json snapshot: add basic support for snapshot ttl in manifest.json table: snapshot_on_all_shards: take snapshot_options db: snapshot_ctl: move skip_flush to struct snapshot_options snapshot: add snapshot name in manifest.json test: lib: cql_test_env: apply db::config::tablets_mode_for_new_keyspaces snapshot: add node info to manifest.json snapshot: add manifest info to manifest.json test: database_test: snapshot_works: add validate_manifest	2026-01-22 15:19:11 +03:00
Botond Dénes	7d2e6c0170	Merge 'config: add enforce_rack_list option' from Aleksandra Martyniuk Add enforce_rack_list option. When the option is set to true, all tablet keyspaces have rack list replication factor. When the option is on: - CREATE STATEMENT always auto-extends rf to rack lists; - ALTER STATEMENT fails when there is numeric rf in any DC. The flag is set to false by default and a node needs to be restarted in order to change its value. Starting a node with enforce_rack_list option will fail, if there are any tablet keyspaces with numeric rf in any DC. enforce_rack_list is a per-node option and a user needs to ensure that no tablet keyspace is altered or created while nodes in the cluster don't have the consistent value. Mark rf_rack_valid_keyspaces as deprecated. Fixes: https://github.com/scylladb/scylladb/issues/26399. New feature; no backport needed Closes scylladb/scylladb#28084 * github.com:scylladb/scylladb: test: add test for enforce_rack_list option db: mark rf_rack_valid_keyspaces as deprecated config: add enforce_rack_list option Revert "alternator: require rf_rack_valid_keyspaces when creating index"	2026-01-22 10:27:35 +02:00
Benny Halevy	5e90fbb9d2	table: snapshot_on_all_shards: take snapshot_options And keep the options for now in the local_snapshot_writer. The options will be used by following patches to pass extra metadata like the snapshot creation time, expiration time, etc. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-01-22 09:12:56 +02:00
Benny Halevy	49a3e0914d	db: snapshot_ctl: move skip_flush to struct snapshot_options So we can easily extend it and add more options. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-01-22 09:12:56 +02:00
Avi Kivity	c7dda5500c	database: simplify apply_counter_update exception handling Use coroutine::try_future to exit the coroutine immediately on error instead of explict checks. Closes scylladb/scylladb#28257	2026-01-20 11:13:49 +02:00
Aleksandra Martyniuk	761ace4f05	config: add enforce_rack_list option Add enforce_rack_list option. When the option is set to true, all tablet keyspaces have rack list replication factor. When the option is on: - CREATE STATEMENT always auto-extends rf to rack lists; - ALTER STATEMENT fails when there is numeric rf in any DC. The flag is set to false by default and a node needs to be restarted in order to change its value. Starting a node with enforce_rack_list option will fail, if there are any tablet keyspaces with numeric rf in any DC. enforce_rack_list is a per-node option and a user needs to ensure that no tablet keyspace is altered or created while nodes in the cluster don't have the consistent value.	2026-01-20 09:58:51 +01:00
Botond Dénes	60570d7114	Merge 'topology coordinator: restrict node join/remove to preserve RF-rack validity' from Michael Litvak Allow creating materialized views and secondary indexes in a tablets keyspace only if it's RF-rack-valid, and enforce RF-rack-validity while the keyspace has views by restricting some operations: * Altering a keyspace's RF if it would make the keyspace RF-rack-invalid * Adding a node in a new rack * Removing / Decommissioning the last node in a rack Previously the config option `rf_rack_valid_keyspaces` was required for creating views. We now remove this restriction - it's not needed because we always maintain RF-rack-validity for keyspaces with views. The restrictions are relevant only for keyspaces with numerical RF. Keyspace with rack-list-based RF are always RF-rack-valid. Fixes scylladb/scylladb#23345 Fixes https://github.com/scylladb/scylladb/issues/26820 backport to relevant versions for materialized views with tablets since it depends on rf-rack validity Closes scylladb/scylladb#26354 * github.com:scylladb/scylladb: docs: update RF-rack restrictions cql3: don't apply RF-rack restrictions on vector indexes cql3: add warning when creating mv/index with tablets about rf-rack service/tablet_allocator: always allow tablet merge of tables with views locator: extend rf-rack validation for rack lists test: test rf-rack validity when creating keyspace during node ops locator: fix rf-rack validation during node join/remove test: test topology restrictions for views with tablets test: add test_topology_ops_with_rf_rack_valid topology coordinator: restrict node join/remove to preserve RF-rack validity topology coordinator: add validation to node remove locator: extend rf-rack validation functions view: change validate_view_keyspace to allow MVs if RF=Racks db: enforce rf-rack-validity for keyspaces with views replica/db: add enforce_rf_rack_validity_for_keyspace helper db: remove enforce parameter from check_rf_rack_validity test: adjust test to not break rf-rack validity	2026-01-09 10:01:23 +02:00
Benny Halevy	93b827c185	database: truncate_table_on_all_shards: drop outdated TODO comment The comment was added in `83323e155e` Since then, table::seal_active_memtable was improved to guarantee waiting on oustanding flushes on success (See `d55a2ac762`), so we can remove this TODO comment (it also not covered by any issue so nobody is planned to ever work on it). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-01-08 09:49:46 +02:00
Benny Halevy	2a803d2261	database: truncate_table_on_all_shards: consider can_flush on all shards can_flush might return a different value for each shard so check it right before deciding whether to flush or clear a memtable shard. Note that under normal condition can_flush would always return true now that it checks only the presence of the seal memtable function rather than check memtable_list::empty(). Fixes #27639 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-01-08 09:49:46 +02:00
Benny Halevy	02ee341a03	memtable_list: unify can_flush and may_flush Now that we have a unit test proving that it's safe to flush an empty memtable list there is no need to distinguish between may_flush and can_flush. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-01-08 09:49:46 +02:00
Pavel Emelyanov	e31b72c61f	table: Rename and move snapshot_on_all_shards() method Now it's database::snapshot_table_on_all_shards(). This is symmetric to database::truncate_table_on_all_shards(). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-12-23 12:14:36 +03:00
Michael Litvak	9e1f78d162	locator: extend rf-rack validation functions Extend the locator function assert_rf_rack_valid_keyspace to accept arbitrary topology dc-rack maps and nodes instead of using the current token metadata. This allows us to add a new variant of the function that checks rf-rack validity given a topology change that we want to apply. we will use it to check that rf-rack validity will be maintained before applying the topology change. The possible topology changes for the check are node add and node remove / decommission. These operations can change the number of normal racks - if a new node is added to a new rack, or the last node is removed from a rack.	2025-12-22 09:14:29 +01:00
Michael Litvak	8df61f6d99	view: change validate_view_keyspace to allow MVs if RF=Racks The function validate_view_keyspace checks if a keyspace is eligible for having materialized views, and it is used for validation when creating a MV or a MV-based index. Previously, it was required that the rf_rack_valid_keyspaces option is set in order for tablets-based keyspaces to be considered eligible, and the RF-rack condition was enforced when the option is set. Instead of this, we change the validation to allow MVs in a keyspace if the RF-rack condition is satisfied for the keyspace - regardless of the config option. We remove the config validation for views on startup that validates the option `rf_rack_valid_keyspaces` is set if there are any views with tablets, since this is not required anymore. We can do this without worrying about upgrades because this change will be effective from 2025.4 where MVs with tablets are first out of experimental phase. We update the test for MV and index restrictions in tablets keyspaces according to the new requirements. * Create MV/index: previously the test checked that it's allowed only if the config option `rf_rack_valid_keyspaces` is set. This is changed now so it's always allowed to create MV/index if the keyspace is RF-rack-valid. Update the test to verify that we can create MV/index when the keyspace is RF-rack-valid, even if the rf_rack option is not set, and verify that it fails when the keyspace is RF-rack-invalid. * Alter: Add a new test to verify that while a keyspace has views, it can't be altered to become RF-rack-invalid.	2025-12-22 09:14:29 +01:00
Michael Litvak	de1bb84fca	db: enforce rf-rack-validity for keyspaces with views Extend the RF-rack-validity enforcement to keyspaces that have views, regardless of the option `rf_rack_valid_keyspaces`. Previously, RF-rack-validity was enforced when `rf_rack_valid_keyspaces` was set for all keyspaces. Now we want to allow creating MVs in tablet keyspaces that are RF-rack-valid and enforce the RF-rack-validity even if the config option is not set.	2025-12-22 09:13:49 +01:00

1 2 3 4 5 ...

688 Commits