scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-22 07:42:16 +00:00

Author	SHA1	Message	Date
Michał Jadwiszczak	1a32ccd8f6	db/system_keyspace: replace `make_remove_view_building_task_mutation()` with mutation builder Again, get rid of system keyspace method in favor of mutation builder, because `system.view_building_tasks` is a single parition table.	2026-05-13 10:06:18 +02:00
Michał Jadwiszczak	e002665aa7	db/system_keyspace: replace `make_view_building_task_mutation()` with mutation builder `system.view_building_tasks` is a single partition table, so it makes more sense to use a mutation builder and generate 1 mutation per group0 command instead of generating multiple mutations.	2026-05-12 21:49:18 +02:00
Piotr Dulikowski	7c2b1ea0b5	Merge 'view_building: fix tombstone_warn_threshold warnings' from Michał Jadwiszczak `system.view_building_tasks` is a single-partition Raft group0 table (pk = `"view_building"`, CK = timeuuid). When `clean_finished_tasks()` deletes hundreds of finished tasks, the physical rows remain in SSTables until compaction. Any subsequent read of the partition counts every column of every tombstoned row as a dead cell, triggering `tombstone_warn_threshold` warnings in large clusters. Two-part fix: 1. Range tombstones instead of row tombstones (commits 2–3) Instead of one row tombstone per finished task, find the minimum alive task UUID (`min_alive_uuid`) and emit a single range tombstone `[before_all, min_alive_uuid)` covering all tasks below that boundary. This reduces the tombstone count significantly and also benefits future compaction. 2. Bounded scan with `min_task_id` (commits 4–6) Even with range tombstones, physical rows remain until compaction and still count as dead cells during reads. The only way to avoid them is to not read them at all. - Add a `min_task_id timeuuid` static column to `system.view_building_tasks`. - On every GC, write `min_task_id = min_alive_uuid` atomically with the range tombstone (same Raft batch). - On reload, read `min_task_id` first using a static-only partition slice (empty `_row_ranges` + `always_return_static_content`): the SSTable reader stops immediately after the static row before processing any clustering tombstones — zero dead cells counted. - Use `AND id >= min_task_id` as a lower bound for the main task scan, skipping all tombstoned rows. The static-only read and the bounded scan are gated on the `VIEW_BUILDING_TASKS_MIN_TASK_ID` cluster feature so mixed-version clusters fall back to the full scan. The issue is not critical, so the fix shouldn't be backported. Fixes SCYLLADB-657 Closes scylladb/scylladb#28929 * github.com:scylladb/scylladb: test/cluster/test_view_building_coordinator: add reproducer for tombstone threshold warning docs: document tombstone avoidance in view_building_tasks view_building: add `task_uuid_generator` to `view_building_task_mutation_builder` view_building: introduce `task_uuid_generator` view_building: store `min_alive_uuid` in view building state view_building: set min_task_id when GC-ing finished tasks view_building: add min_task_id support to view_building_task_mutation_builder view_building: add min_task_id static column and bounded scan to system_keyspace view_building: use range tombstone when GC-ing finished tasks view_building: add range tombstone support to view_building_task_mutation_builder view_building: introduce VIEW_BUILDING_TASKS_MIN_TASK_ID cluster feature	2026-05-12 12:38:25 +03:00
Nadav Har'El	5895dff03b	migration_manager: unique timeout exception for wait_for_schema_agreement() Before this patch, if wait_for_schema_agreement() times out, it threw a generic std::runtime_error, making it inconvenient for callers to catch this error only. So in this patch we create and use a new exception type, schema_agreement_timeout, based on seastar::timed_out_error. Although wait_for_schema_agreement() was added in commit `a429018a8a` was a utility function used in a dozen places, it has become less interesting after we introduced schema changes over Raft, and over the years most of the callers to this function were removed, except one in view.cc which uses an infinite timeout, so doesn't care about the timeout exception type. In the next patch we want to add a new caller which does care about the time exception type - hence this patch. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-05-05 10:38:38 +03:00
Michał Jadwiszczak	b64f2d2e90	view_building: introduce `task_uuid_generator` With the new `min_alive_uuid` saved in the group0 table, we need to make sure that all new tasks are created with time uuid greater than the value saved in `min_alive_uuid`. This patch introduces the `task_uuid_generator` which ensures that when we are generating multiple tasks in one group0 command, each task will have an unique time uuid and each time uuid will be greater than `min_alive_uuid`.	2026-04-22 09:10:14 +02:00
Avi Kivity	0ae22a09d4	LICENSE: Update to version 1.1 Updated terms of non-commercial use (must be a never-customer).	2026-04-12 19:46:33 +03:00
Botond Dénes	475220b9c9	Merge 'Remove the rest of pre raft topology code' from Gleb Natapov Remove the rest of the code that assumes that either group0 does not exist yet or a cluster is till not upgraded to raft topology. Both of those are not supported any more. No need to backport since we remove functionality here. Closes scylladb/scylladb#28841 * github.com:scylladb/scylladb: service level: remove version 1 service level code features: move GROUP0_SCHEMA_VERSIONING to deprecated features list migration_manager: remove unused forward definitions test: remove unused code auth: drop auth_migration_listener since it does nothing now schema: drop schema_registry_entry::maybe_sync() function schema: drop make_table_deleting_mutations since it should not be needed with raft schema: remove calculate_schema_digest function schema: drop recalculate_schema_version function and its uses migration_manager: drop check for group0_schema_versioning feature cdc: drop usage of cdc_local table and v1 generation definition storage_service: no need to add yourself to the topology during reboot since raft state loading already did it storage_service: remove unused functions group0: drop with_raft() function from group0_guard since it always returns true now gossiper: do not gossip TOKENS and CDC_GENERATION_ID any more gossiper: drop tokens from loaded_endpoint_state gossiper: remove unused functions storage_service: do not pass loaded_peer_features to join_topology() storage_service: remove unused fields from replacement_info gossiper: drop is_safe_for_restart() function and its use storage_service: remove unused variables from join_topology gossiper: remove the code that was only used in gossiper topology storage_service: drop the check for raft mode from recovery code cdc: remove legacy code test: remove unused injection points auth: remove legacy auth mode and upgrade code treewide: remove schema pull code since we never pull schema any more raft topology: drop upgrade_state and its type from the topology state machine since it is not used any longer group0: hoist the checks for an illegal upgrade into main.cc api: drop get_topology_upgrade_state and always report upgrade status as done service_level_controller: drop service level upgrade code test: drop run_with_raft_recovery parameter to cql_test_env group0: get rid of group0_upgrade_state storage_service: drop topology_change_kind as it is no longer needed storage_service: drop check_ability_to_perform_topology_operation since no upgrades can happen any more service_storage: remove unused functions storage_service: remove non raft rebuild code storage_service: set topology change kind only once group0: drop in_recovery function and its uses group0: rename use_raft to maintenance_mode and make it sync	2026-03-11 10:24:20 +02:00
Gleb Natapov	74b5a8d43d	schema: drop schema_registry_entry::maybe_sync() function Schema is synced through group0 now. Drop all the test of the function as well.	2026-03-10 10:46:47 +02:00
Gleb Natapov	08e33ad7f7	schema: drop recalculate_schema_version function and its uses There is no need to recalculate schema version any more since it is set by group0.	2026-03-10 10:46:39 +02:00
Gleb Natapov	7bb334a5dd	migration_manager: drop check for group0_schema_versioning feature We do not allow upgrading from a version that does not have it any longer.	2026-03-10 10:39:59 +02:00
Gleb Natapov	0e3e7be335	group0: drop with_raft() function from group0_guard since it always returns true now Also drop the code that assumed that the function can return false.	2026-03-10 10:39:58 +02:00
Gleb Natapov	02fc4ad0a9	treewide: remove schema pull code since we never pull schema any more Schema pull was used by legacy schema code which is not supported for a long time now and during legacy recovery which is no longer supported as well. It can be dropped now.	2026-03-10 10:09:39 +02:00
Michał Chojnowski	ff60a5f1e5	cql3: suggest ALTER MATERIALIZED VIEW to users trying to use ALTER TABLE on a view When a user tries to use ALTER TABLE on a materialized view, the resulting error message is `Cannot use ALTER TABLE on Materialized View`. The intention behind this error is that ALTER MATERIALIZED VIEW should be used instead. But we observed that some users interpret this error message as a general "You cannot do any ALTER on this thing". This patch enhances the error message (and others similar to it) to prevent the confusion. Closes scylladb/scylladb#28831	2026-03-09 15:07:21 +01:00
Gleb Natapov	7d7cbae763	raft_group0: simplify get_group0_upgrade_state function since no upgrade can happen any more No need for locking any more so the function may just return a value and be synchronous.	2026-02-25 10:08:32 +02:00
Pavel Emelyanov	5ce12f2404	gossiper: Export its scheduling group for those who need it There are several places in the code that need to explicitly switch into gossiper scheduling group. For that they currently call database to provide the group, but it's better to get gossiper sched group from gossiper itself, all the more so all those places have gossiper at hand. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-01-28 18:29:33 +03:00
Pavel Emelyanov	0da1a222fc	migration_manager: Reorder members This is to initialize dependency references, in particular gossiper&, before _group0_barrier. The latter will need to access this->_gossiper in the next patch. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-01-28 18:29:33 +03:00
Avi Kivity	c6dfae5661	treewide: #include Seastar headers with angle brackets Seastar is an external library from the point of view of ScyllaDB, so should be included with angle brackets. Closes scylladb/scylladb#27947	2026-01-13 14:56:15 +02:00
Michael Litvak	55f4a2b754	migration_listener: fix deadlock in nested notifications When calling a migration notification from the context of a notification callback, this could lead to a deadlock with unregistering a listener: A: the parent notification is called. it calls thread_for_each, where it acquires a read lock on the vector of listeners, and calls the callback function for each listener while holding the lock. B: a listener is unregistered. it calls `remove` and tries to acquire a write lock on the vector of listeners. it waits because the lock is held. A: the callback function calls another notification and calls thread_for_each which tries to acquire the read lock again. but it waits since there is a waiter. Currently we have such concrete scenario when creating a table, where the callback of `before_create_column_family` in the tablet allocator calls `before_allocate_tablet_map`, and this could deadlock with node shutdown where we unregister listeners. Fix this by not acquiring the read lock again in the nested notification. There is no need because the read lock is already held by the parent notification while the child notification is running. We add a function `thread_for_each_nested` that is similar to `thread_for_each` except it assumes the read lock is already held and doesn't acquire it, and it should be used for nested notifications instead of `thread_for_each`. Fixes scylladb/scylladb#27364 Closes scylladb/scylladb#27637	2025-12-17 14:00:28 +01:00
Michał Jadwiszczak	24d69b4005	db/view/view_building_state: replace task's state with `aborted` flag After previous commits, we can drop entire task's state and replace it with single boolean flag, which determines if a task was aborted. Once a task was aborted, it cannot get resurrected to a normal state.	2025-11-25 12:14:04 +01:00
Pavel Emelyanov	1c9c4c8c8c	Merge 'service: attach storage_service to migration_manager using pluggable' from Marcin Maliszkiewicz Migration manager depends on storage service. For instance, it has a reload_schema_in_bg background task which calls _ss.local() so it expects that storage service is not stopped before it stops. To solve this we use permit approach, and during storage_service stop: - we ignore new code execution in migration_manager which'd use storage_service - but wait with storage_service shutdown until all existing executions are done Fixes scylladb/scylladb#26734 Backport: no need, problem existed since very long time, code restructure in https://github.com/scylladb/scylladb/commit/389afcd (and following commits) made it hitting more often, as _ss was called earlier, but it's not released yet. Closes scylladb/scylladb#26779 * github.com:scylladb/scylladb: service: attach storage_service to migration_manager using pluggabe service: migration_manager: corutinize merge_schema_from service: migration_manager: corutinize reload_schema	2025-11-14 15:14:28 +03:00
Marcin Maliszkiewicz	958d04c349	service: attach storage_service to migration_manager using pluggabe Migration manager depends on storage service. For instance, it has a reload_schema_in_bg background task which calls _ss.local() so it expects that storage service is not stopped before it stops. To solve this we use permit approach, and during storage_service stop: - we ignore new code execution in migration_manager which'd use storage_service - but wait with storage_service shutdown until all existing executions are done Fixes scylladb/scylladb#26734	2025-11-14 08:50:19 +01:00
Marcin Maliszkiewicz	cf9b2de18b	service: migration_manager: corutinize merge_schema_from It's needed to easily keep-alive pluggable storage_service permit in a following commit.	2025-11-14 08:50:19 +01:00
Marcin Maliszkiewicz	5241e9476f	service: migration_manager: corutinize reload_schema It's needed to easily keep-alive pluggable storage_service permit in a following commit.	2025-11-14 08:50:18 +01:00
Michael Litvak	eefae4cc4e	migration_manager: pass timestamp to pre_create pass the write timestamp as parameter to the on_pre_create_column_families notification.	2025-11-13 16:59:43 +01:00
Piotr Dulikowski	2e5eb92f21	Merge 'cdc: use CDC schema that is compatible with the base schema' from Michael Litvak When generating CDC log mutations for some base mutation, use a CDC schema that is compatible with the base schema. The compatible CDC schema has for every base column a corresponding CDC column with the same name. If using a non-compatible schema, we may encounter a situation, especially during ALTER, that we have a mutation with a base column set with some value, but the CDC schema doesn't have a column by that name. This would cause the user request to fail with an error. We add to the schema object a schema_ptr that for CDC-enabled tables points to the schema object of the CDC table that is compatible with the schema. It is set by the schema merge algorithm when creating the schema for a table that is created or altered. We use the fact that a base table and its CDC table are created and altered in the same group0 operation, and this way we can find and set the cdc schema for a base table. When transporting the base schema as a frozen schema between shards, we transport with it the frozen cdc schema as well. The patch starts with a series of refactoring commits that make extending the frozen schema easier and cleans up some duplication in the code about the frozen schema. We combine the two types `frozen_schema_with_base_info` and `view_schema_and_base_info` to a single type `extended_frozen_schema` that holds a frozen schema with additional data that is not part of the schema mutations but needs to be transported with it to unfreeze it - base_info, and the frozen cdc schema which is added in a later commit. Fixes https://github.com/scylladb/scylladb/issues/26405 backport not needed - enhancement Closes scylladb/scylladb#24960 * github.com:scylladb/scylladb: test: cdc: test cdc compatible schema cdc: use compatiable cdc schema db: schema_applier: create schema with pointer to CDC schema db: schema_applier: extract cdc tables schema: add pointer to CDC schema schema_registry: remove base_info from global_schema_ptr schema_registry: use extended_frozen_schema in schema load schema_registry: replace frozen_schema+base_info with extended_frozen_schema frozen_schema: extract info from schema_ptr in the constructor frozen_schema: rename frozen_schema_with_base_info to extended_frozen_schema	2025-11-13 10:11:54 +01:00
Nikos Dragazis	56e5dfc14b	migration_manager: Add missing validations for schema extensions The migration manager offers some free functions to prepare mutations for a new/updated table/view. Most of them include a validation check for the schema extensions, but in the following ones it's missing: * `prepare_new_column_family_announcement` (overload with vector as out parameter) * `prepare_new_column_families_announcement` Presumably, this was just an omission. It's also not a very important one since the only extension having validation logic is the `encryption_schema_extension`, but none of these functions is connected to user queries where encryption options can be provided in the schema. User queries go through the other `prepare_new_column_family_announcement` overload, which does perform a validation check. Add validation in the missing places. Fixes #26470. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com> Closes scylladb/scylladb#26487	2025-11-11 10:08:58 +02:00
Michael Litvak	6e2513c4d2	db: schema_applier: create schema with pointer to CDC schema When creating a schema for a non-CDC table in the schema_applier, find its CDC schema that we created previously in the same operation, if any, and create the schema with a pointer to the CDC schema. We use the fact that for a base table with CDC enabled, its CDC schema is created or altered together in the same group0 operation. Similarly, in schema_tables, when creating table schemas from the schema tables, first create all schemas that don't have CDC enabled, then create schemas that have CDC enabled by extending them with the pointer to the CDC schema that we created before. There are few additional cases where we create schemas that we need to consider how to handle. When loading a schema from schema tables in the schema_loader we decide not to set the CDC schema, because this schema is mostly used for tools and it's not used for generating CDC mutations. When transporting a schema by RPC in the migration manager, we don't transport its CDC schema, and we always set it to null. Because we use raft we expect this shouldn't have any effect, because the schema is synchronized through raft and not through the RPC.	2025-10-21 14:13:43 +02:00
Michael Litvak	ac96e40f13	schema: add pointer to CDC schema Add to the schema object a member that points to the CDC schema object that is compatible with this schema, if any. The compatible CDC schema is created and altered with its base schema in the same group0 operation. When generating CDC log mutations for some base mutation we want them to be created using a compatible schema thas has a CDC column corresponding to each base column. This change will allow us to find the right CDC schema given a base mutation. We also update the relevant structures in the schema registry that are related to learning about schemas and transporting schemas across shards or nodes. When transporting a schema as frozen_schema, we need to transport the frozen cdc schema as well, and set it again when unfreezing and reconstructing the schema. When adding a schema to the registry, we need to ensure its CDC schema is added to the registry as well. Currently we always set the CDC schema to nullptr and maintain the previous behavior. We will change it in a later commit. Until then, we mark all places where CDC schema is passed clearly so we don't forget it.	2025-10-21 14:13:43 +02:00
Michael Litvak	085abef05d	schema_registry: use extended_frozen_schema in schema load Change the schema loader type in the schema_registry to return a extended_frozen_schema instead of view_schema_and_base_info, and remove view_schema_and_base_info which is not used anymore. The casting between them is trivial.	2025-10-21 14:13:43 +02:00
Michael Litvak	278801b2a6	frozen_schema: extract info from schema_ptr in the constructor Currently we construct a frozen schema with base info in few places, and the caller is responsible for constructing the frozen schema and extracting the base info if it's a view table. We change it to make it simpler and remove the burden from the caller. The caller can simply pass the schema_ptr, and the constructor for extended_frozen_schema will construct the frozen schema and extract the additional info it needs. This will make it easier to add additional fields, and reduces code duplication. We also make temporary castings between extended_frozen_schema and view_schema_and_base_info for the transition, which are trivial, until they are combined to a single type.	2025-10-21 14:13:42 +02:00
Marcin Maliszkiewicz	389afcdeb6	service: fix dependencies during migration_manager startup We need to avoid reloading schema early as it goes via schema_applier which internally depends on storage_service and on distribued_loader initializing all keyspaces. Simply moving migration manager startup later in the code is not easy as some services depend on it being initialized so we just enable those feature listeners a bit later.	2025-10-14 10:56:26 +02:00
Benny Halevy	b17a36c071	tablets: read_tablet_mutations: use unfreeze_and_split_gently Split the tablets mutations by number of rows, based on `min_tablets_in_mutation` (currently calibrated to 1024), similar to the splitting done in `storage_service::merge_topology_snapshot`. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-09-30 17:15:41 +03:00
Michael Litvak	5a7e6e53ff	cdc: fix create table with cdc if not exists Fix an issue where executing a CREATE TABLE IF NOT EXISTS statement with CDC enabled fails with an error if the table already exists. Instead, the query should succeed and be a no-op. This regression was introduced by commit `fed1048059`. Previously, when executing the query, we would first check if the table exists in do_prepare_new_column_families_announcement. If it did, we would throw an already_exists_exception, which was handled correctly; otherwise, we would continue and create the CDC table in the before_create_column_families notification. The order of operations was changed in `fed1048059`, causing the regression. Now, we first create the CDC schema and add it to the schema list for creation, and then check for each of them if they already exist. The problem is that when we create the CDC schema in on_pre_create_column_families, it also checks if the CDC table already exists. If it does, it throws an invalid_request_exception, which is not caught and handled as expected. This patch restores the previous order of operations: we first check if the tables exist, and only then add the CDC schema in pre_create. Fixes scylladb/scylladb#26142	2025-09-21 09:38:36 +02:00
Michael Litvak	7f2cd06bdc	migration_listener: add on_before_allocate_tablet_map notification Add a new notification on_before_allocate_tablet_map that is called when creating a tablet map for a new table and passes the tablet map. This will be useful next for CDC for example. when creating tablets for a new table we want to create CDC streams for each tablet in the same operation, and we need to have the tablet map with the tablet count and tokens for each tablet, because the CDC streams are based on that. We need to change slightly the tablet allocation code for this to work with colocated tables, because previously when we created the tablet map of a colocated table we didn't have a reference to the base tablet map, but now we do need it so we can pass it to the notification.	2025-09-17 14:47:11 +02:00
Michael Litvak	fed1048059	cdc: move cdc table creation to pre_create When creating a new table with CDC enabled, we create also a CDC log table by adding the CDC table's mutations in the same operation. Previously, it works by the CDC log service subscribing to on_before_create_column_family and adding the CDC table's mutations there when being notified about a new created table. The problem is that when we create the tables we also create their tablet maps in the tablet allocator, and we want to created the two tables as co-located tables: we allocate a tablet map for the base table, and the CDC table is co-located with the base table. This doesn't work well with the previous approach because the notification that creates the CDC table is the same notification that the tablet allocator creates the base tablet map, so the two operations are independent, but really we want the tablet allocator to work on both tables together, so that we have the base table's schema and tablet map when we create the CDC table's co-located tablet map. In order to achieve this, we want to create and add the CDC table's schema, and only after that notify using before_create_column_families with a vector that contains both the base table and CDC table. The tablet allocator will then have all the information it needs to create the co-located tablet map. We move the creation of the CDC log table - instead of adding the table's mutations in on_before_create_column_family, we create the table schema and add it to the new tables vector in on_pre_create_column_families, which is called by the migration manager in do_prepare_new_column_families_announcement. The migration manager will then create and add all mutations for creating the tables, and notify about the tables being created together.	2025-09-17 14:47:11 +02:00
Michał Jadwiszczak	6e3e287a39	db/schema_tables: create/cleanup tasks when an index is created/dropped Similarly as in previous commits, create view building tasks when an index is created and cleanup view building status when it's dropped.	2025-08-27 08:55:47 +02:00
Michał Jadwiszczak	76caaea3f1	service/migration_manager: cleanup view building state on drop keyspace When a keyspace is dropped, remove all unfinished building tasks for all views and remove their entries from `system.view_built_status_v2` and `system.built_views`.	2025-08-27 08:55:47 +02:00
Michał Jadwiszczak	f10c5c4493	service/migration_manager: cleanup view building state on drop view When a view is dropped, remove all unfinished building tasks, remove entries from `system.view_built_status_v2` and `system.built_views`. If the view is currently being built, removing its tasks means they are also aborted. Finished tasks are already removed from the table.	2025-08-27 08:55:47 +02:00
Michał Jadwiszczak	6d1fbf06ed	service/migration_manager: create view building tasks on create view Create view building tasks in the same batch as new view mutations. The tasks are created only if `VIEW_BUILDING_COORDINATOR` feature is on and the view is in tablet keyspace.	2025-08-27 08:55:47 +02:00
Michał Jadwiszczak	204f61ffe1	service/migration_manager: pass `storage_proxy` to `prepare_keyspace_drop_announcement()` The reference is needed to get `view_building_state_machine`.	2025-08-27 08:55:47 +02:00
Michał Jadwiszczak	76a6dd82fd	service/migration_manager: coroutinize `prepare_new_view_announcement()`	2025-08-27 08:55:47 +02:00
Gleb Natapov	198cfc6fe7	migration manager: do not use group0 on non zero shard Commit `ddc3b6dcf5` added a check of group0 state in get_schema_for_write(), but group0 client can only be used on shard 0, and get_schema_for_write() can be called on any shard, so we cannot use _group0_client there directly. Move assert where we use another group0 function already where it is guarantied to run on shard 0. Closes scylladb/scylladb#25204	2025-07-28 14:10:01 +02:00
Petr Gusev	3e0347c614	migration_manager: add timeout to start_group0_operation and announce Pass a timeout parameter through to start_operation() and add_entry(), respectively. This is a preparatory change for the next commit, which will use the timeout to properly handle timeouts during lazy creation of Paxos state tables.	2025-07-24 16:39:50 +02:00
Gleb Natapov	ddc3b6dcf5	migration manager: assert that if schema pull is disabled the group0 is not in use_pre_raft_procedures state If schema pull are disabled group0 is used to bring up to date schema by calling start_group0_operation() which executes raft read barrier internally, but if the group0 is still in use_pre_raft_procedures start_group0_operation() silently does nothing. Later the code that assumes that schema is already up-to-date will fail and print warnings into the log. But since getting queries in the state when a node is in raft enabled mode but group0 is still not configured is illegal it is better to make those errors more visible buy asserting them during testing. Closes scylladb/scylladb#25112	2025-07-23 14:10:17 +02:00
Botond Dénes	054ea54565	Merge 'streaming: Avoid deadlock by running view checks in a separate scheduling group' from Tomasz Grabiec This issue happens with removenode, when RBNO is disabled, so range streamer is used. The deadlock happens in a scenario like this: 1. Start 3 nodes: {A, B, C}, RF=2 2. Node A is lost 3. removenode A 4. Both B and C gain ownership of ranges. 5. Streaming sessions are started with crossed directions: B->C, C->B Readers created by sender side exhaust streaming semaphore on B and C. Receiver side attempts to obtain a permit indirectly by calling check_needs_view_update_path(), which reads local tables. That read is blocked and times-out, causing streaming to fail. The streaming writer is already using a tracking-only permit. Even if we didn't deadlock, and the streaming semaphore was simply exhausted by other receiving sessions (via tracking-only permit), the query may still time-out due to starvation. To avoid that, run the query under a different scheduling group, which translates to the system semaphore instead of the maintenance semaphore, to break the dependency. The gossip group was chosen because it shouldn't be contended and this change should not interfere with it much. Fixes #24807 Fixes #24925 Closes scylladb/scylladb#24929 * github.com:scylladb/scylladb: streaming: Avoid deadlock by running view checks in a separate scheduling group service: migration_manager: Run group0 barrier in gossip scheduling group	2025-07-17 10:24:41 +03:00
Avi Kivity	6fce817aa8	Merge 'Atomic in-memory schema changes application' from Marcin Maliszkiewicz This change is preparing ground for state update unification for raft bound subsystems. It introduces schema_applier which in the future will become generic interface for applying mutations in raft. Pulling database::apply() out of schema merging code will allow to batch changes to subsystems. Future generic code will first call prepare() on all implementations, then single database::apply() and then update() on all implementations, then on each shard it will call commit() for all implementations, without preemption so that the change is observed as atomic across all subsystems, and then post_commit(). Backport: no, it's a new feature Fixes: https://github.com/scylladb/scylladb/issues/19649 Fixes https://github.com/scylladb/scylladb/issues/24531 Closes scylladb/scylladb#24886 [avi: adjust for std::vector<mutations> -> utils::chunked_vector<mutations>] * github.com:scylladb/scylladb: test: add type creation to test_snapshot storage_service: always wake up load balancer on update tablet metadata db: schema_applier: call destroy also when exception occurs db: replica: simplify seeding ERM during shema change db: remove cleanup from add_column_family db: abort on exception during schema commit phase db: make user defined types changes atomic replica: db: make keyspace schema changes atomic db: atomically apply changes to tables and views replica: make truncate_table_on_all_shards get whole schema from table_shards service: split update_tablet_metadata into two phases service: pull out update_tablet_metadata from migration_listener db: service: add store_service dependency to schema_applier service: simplify load_tablet_metadata and update_tablet_metadata db: don't perform move on tablet_hint reference replica: split add_column_family_and_make_directory into steps replica: db: split drop_table into steps db: don't move map references in merge_tables_and_views() db: introduce commit_on_shard function db: access types during schema merge via special storage replica: make non-preemptive keyspace create/update/delete functions public replica: split update keyspace into two phases replica: split creating keyspace into two functions db: rename create_keyspace_from_schema_partition db: decouple functions and aggregates schema change notification from merging code db: store functions and aggregates change batch in schema_applier db: decouple tables and views schema change notifications from merging code db: store tables and views schema diff in schema_applier db: decouple user type schema change notifications from types merging code service: unify keyspace notification functions arguments db: replica: decouple keyspace schema change notifications to a separate function db: add class encapsulating schema merging	2025-07-13 20:47:55 +03:00
Benny Halevy	3feb759943	everywhere: use utils::chunked_vector for list of mutations Currently, we use std::vector<*mutation> to keep a list of mutations for processing. This can lead to large allocation, e.g. when the vector size is a function of the number of tables. Use a chunked vector instead to prevent oversized allocations. `perf-simple-query --smp 1` results obtained for fixed 400MHz frequency and PGO disabled: Before (read path): ``` enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=read, query_single_key=no, counters=no} Disabling auto compaction Creating 10000 partitions... 89055.97 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39417 insns/op, 18003 cycles/op, 0 errors) 103372.72 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39380 insns/op, 17300 cycles/op, 0 errors) 98942.27 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39413 insns/op, 17336 cycles/op, 0 errors) 103752.93 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39407 insns/op, 17252 cycles/op, 0 errors) 102516.77 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39403 insns/op, 17288 cycles/op, 0 errors) throughput: mean= 99528.13 standard-deviation=6155.71 median= 102516.77 median-absolute-deviation=3844.59 maximum=103752.93 minimum=89055.97 instructions_per_op: mean= 39403.99 standard-deviation=14.25 median= 39406.75 median-absolute-deviation=9.30 maximum=39416.63 minimum=39380.39 cpu_cycles_per_op: mean= 17435.81 standard-deviation=318.24 median= 17300.40 median-absolute-deviation=147.59 maximum=18002.53 minimum=17251.75 ``` After (read path) ``` enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=read, query_single_key=no, counters=no} Disabling auto compaction Creating 10000 partitions... 59755.04 tps ( 66.2 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39466 insns/op, 22834 cycles/op, 0 errors) 71854.16 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39417 insns/op, 17883 cycles/op, 0 errors) 82149.45 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39411 insns/op, 17409 cycles/op, 0 errors) 49640.04 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.3 tasks/op, 39474 insns/op, 19975 cycles/op, 0 errors) 54963.22 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.3 tasks/op, 39474 insns/op, 18235 cycles/op, 0 errors) throughput: mean= 63672.38 standard-deviation=13195.12 median= 59755.04 median-absolute-deviation=8709.16 maximum=82149.45 minimum=49640.04 instructions_per_op: mean= 39448.38 standard-deviation=31.60 median= 39466.17 median-absolute-deviation=25.75 maximum=39474.12 minimum=39411.42 cpu_cycles_per_op: mean= 19267.01 standard-deviation=2217.03 median= 18234.80 median-absolute-deviation=1384.25 maximum=22834.26 minimum=17408.67 ``` `perf-simple-query --smp 1 --write` results obtained for fixed 400MHz frequency and PGO disabled: Before (write path): ``` enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=write, query_single_key=no, counters=no} Disabling auto compaction 63736.96 tps ( 59.4 allocs/op, 16.4 logallocs/op, 14.3 tasks/op, 49667 insns/op, 19924 cycles/op, 0 errors) 64109.41 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 49992 insns/op, 20084 cycles/op, 0 errors) 56950.47 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50005 insns/op, 20501 cycles/op, 0 errors) 44858.42 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50014 insns/op, 21947 cycles/op, 0 errors) 28592.87 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50027 insns/op, 27659 cycles/op, 0 errors) throughput: mean= 51649.63 standard-deviation=15059.74 median= 56950.47 median-absolute-deviation=12087.33 maximum=64109.41 minimum=28592.87 instructions_per_op: mean= 49941.18 standard-deviation=153.76 median= 50005.24 median-absolute-deviation=73.01 maximum=50027.07 minimum=49667.05 cpu_cycles_per_op: mean= 22023.01 standard-deviation=3249.92 median= 20500.74 median-absolute-deviation=1938.76 maximum=27658.75 minimum=19924.32 ``` After (write path) ``` enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=write, query_single_key=no, counters=no} Disabling auto compaction 53395.93 tps ( 59.4 allocs/op, 16.5 logallocs/op, 14.3 tasks/op, 50326 insns/op, 21252 cycles/op, 0 errors) 46527.83 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50704 insns/op, 21555 cycles/op, 0 errors) 55846.30 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50731 insns/op, 21060 cycles/op, 0 errors) 55669.30 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50735 insns/op, 21521 cycles/op, 0 errors) 52130.17 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50757 insns/op, 21334 cycles/op, 0 errors) throughput: mean= 52713.91 standard-deviation=3795.38 median= 53395.93 median-absolute-deviation=2955.40 maximum=55846.30 minimum=46527.83 instructions_per_op: mean= 50650.57 standard-deviation=182.46 median= 50731.38 median-absolute-deviation=84.09 maximum=50756.62 minimum=50325.87 cpu_cycles_per_op: mean= 21344.42 standard-deviation=202.86 median= 21334.00 median-absolute-deviation=176.37 maximum=21554.61 minimum=21060.24 ``` Fixes #24815 Improvement for rare corner cases. No backport required Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#24919	2025-07-13 19:13:11 +03:00
Tomasz Grabiec	ee2fa58bd6	service: migration_manager: Run group0 barrier in gossip scheduling group Fixes two issues. One is potential priority inversion. The barrier will be executed using scheduling group of the first fiber which triggers it, the rest will block waiting on it. For example, CQL statements which need to sync the schema on replica side can block on the barrier triggered by streaming. That's undesirable. This is theoretical, not proved in the field. The second problem is blocking the error path. This barrier is called from the streaming error handling path. If the streaming concurrency semaphore is exhausted, and streaming fails due to timeout on obtaining the permit in check_needs_view_update_path(), the error path will block too because it will also attempt to obtain the permit as part of the group0 barrier. Running it in the gossip scheduling group prevents this. Fixes #24925	2025-07-11 16:29:31 +02:00
Marcin Maliszkiewicz	2f840e51d1	service: pull out update_tablet_metadata from migration_listener It's not a good usage as there is only one non-empty implementation. Also we need to change it further in the following commit which makes it incompatible with listener code.	2025-07-10 10:40:43 +02:00
Marcin Maliszkiewicz	fa157e7e46	db: service: add store_service dependency to schema_applier There is already implicit logical dependency via migration_notifier but in the next commits we'll be moving store_service out from it as we need better control (i.e. return a value from the call).	2025-07-10 10:40:43 +02:00

1 2 3 4 5 ...

448 Commits