scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-24 10:30:38 +00:00

Author	SHA1	Message	Date
Łukasz Paszkowski	98a6002a1a	compaction_manager: cancel submission timer on drain The `drain` method, cancels all running compactions and moves the compaction manager into the disabled state. To move it back to the enabled state, the `enable` method shall be called. This, however, throws an assertion error as the submission time is not cancelled and re-enabling the manager tries to arm the armed timer. Thus, cancel the timer, when calling the drain method to disable the compaction manager. Fixes https://github.com/scylladb/scylladb/issues/24504 All versions are affected. So it's a good candidate for a backport. Closes scylladb/scylladb#24505 (cherry picked from commit `a9a53d9178`) Closes scylladb/scylladb#24585	2025-06-29 14:40:43 +03:00
Avi Kivity	9f1ed14f9d	Merge '[Backport 6.2] cql: create default superuser if it doesn't exist' from Marcin Maliszkiewicz Backport of https://github.com/scylladb/scylladb/pull/20137 is needed for another backport https://github.com/scylladb/scylladb/pull/24690 to work correctly. Fixes https://github.com/scylladb/scylladb/issues/24712 Closes scylladb/scylladb#24711 * github.com:scylladb/scylladb: test: test_restart_cluster: create the test auth: standard_role_manager allows awaiting superuser creation auth: coroutinize the standard_role_manager start() function auth: don't start server until the superuser is created	2025-06-29 14:34:55 +03:00
Aleksandra Martyniuk	8321747451	test: rest_api: fix test_repair_task_progress test_repair_task_progress checks the progress of children of root repair task. However, nothing ensures that the children are already created. Wait until at least one child of a root repair task is created. Fixes: #24556. Closes scylladb/scylladb#24560 (cherry picked from commit `0deb9209a0`) Closes scylladb/scylladb#24652	2025-06-28 09:40:37 +03:00
Paweł Zakrzewski	3bd3d720e0	test: test_restart_cluster: create the test The purpose of this test that the cluster is able to boot up again after a full cluster shutdown, thus exhibiting no issues when connecting to raft group 0 that is larger than one. (cherry picked from commit `900a6706b8`)	2025-06-27 17:50:15 +02:00
Paweł Zakrzewski	3d6d3484b5	auth: standard_role_manager allows awaiting superuser creation This change implements the ability to await superuser creation in the function ensure_superuser_is_created(). This means that Scylla will not be serving CQL connections until the superuser is created. Fixes #10481 (cherry picked from commit `7008b71acc`)	2025-06-27 17:50:08 +02:00
Paweł Zakrzewski	db1d3cc342	auth: coroutinize the standard_role_manager start() function This change is a preparation for the next change. Moving to coroutines makes the code more readable and easier to process. (cherry picked from commit `04fc82620b`)	2025-06-27 17:50:01 +02:00
Paweł Zakrzewski	66301eb2b6	auth: don't start server until the superuser is created This change reorganizes the way standard_role_manager startup is handled: now the future returned by its start() function can be used to determine when startup has finished. We use this future to ensure the startup is finished prior to starting the CQL server. Some clusters are created without auth, and auth is added later. The first node to recognize that auth is needed must create the superuser. Currently this is always on restart, but if we were to ever make it LiveUpdate then it would not be on restart. This suggests that we don't really need to wait during restart. This is a preparatory commit, laying ground for implementation of a start() function that waits for the superuser to be created. The default implementation returns a ready future, which makes no change in the code behavior. (cherry picked from commit `f525d4b0c1`)	2025-06-27 17:49:33 +02:00
Pavel Emelyanov	e5ac2285c0	Merge '[Backport 6.2] memtable: ensure _flushed_memory doesn't grow above total_memory' from Scylladb[bot] `dirty_memory_manager` tracks two quantities about memtable memory usage: "real" and "unspooled" memory usage. "real" is the total memory usage (sum of `occupancy().total_space()`) by all memtable LSA regions, plus a upper-bound estimate of the size of memtable data which has already moved to the cache region but isn't evictable (merged into the cache) yet. "unspooled" is the difference between total memory usage by all memtable LSA regions, and the total flushed memory (sum of `_flushed_memory`) of memtables. `dirty_memory_manager` controls the shares of compaction and/or blocks writes when these quantities cross various thresholds. "Total flushed memory" isn't a well defined notion, since the actual consumption of memory by the same data can vary over time due to LSA compactions, and even the data present in memtable can change over the course of the flush due to removals of outdated MVCC versions. So `_flushed_memory` is merely an approximation computed by `flush_reader` based on the data passing through it. This approximation is supposed to be a conservative lower bound. In particular, `_flushed_memory` should be not greater than `occupancy().total_space()`. Otherwise, for example, "unspooled" memory could become negative (and/or wrap around) and weird things could happen. There is an assertion in `~flush_memory_accounter` which checks that `_flushed_memory < occupancy().total_space()` at the end of flush. But it can fail. Without additional treatment, the memtable reader sometimes emits data which is already deleted. (In particular, it emites rows covered by a partition tombstone in a newer MVCC version.) This data is seen by `flush_reader` and accounted in `_flushed_memory`. But this data can be garbage-collected by the `mutation_cleaner` later during the flush and decrease `total_memory` below `_flushed_memory`. There is a piece of code in `mutation_cleaner` intended to prevent that. If `total_memory` decreases during a `mutation_cleaner` run, `_flushed_memory` is lowered by the same amount, just to preserve the asserted property. (This could also make `_flushed_memory` quite inaccurate, but that's considered acceptable). But that only works if `total_memory` is decreased during that run. It doesn't work if the `total_memory` decrease (enabled by the new allocator holes made by `mutation_cleaner`'s garbage collection work) happens asynchronously (due to memory reclaim for whatever reason) after the run. This patch fixes that by tracking the decreases of `total_memory` closer to the source. Instead of relying on `mutation_cleaner` to notify the memtable if it lowers `total_memory`, the memtable itself listens for notifications about LSA segment deallocations. It keeps `_flushed_memory` equal to the reader's estimate of flushed memory decreased by the change in `total_memory` since the beginning of flush (if it was positive), and it keeps the amount of "spooled" memory reported to the `dirty_memory_manager` at `max(0, _flushed_memory)`. Fixes scylladb/scylladb#21413 Backport candidate because it fixes a crash that can happen in existing stable branches. - (cherry picked from commit `7d551f99be`) - (cherry picked from commit `975e7e405a`) Parent PR: #21638 Closes scylladb/scylladb#24601 * github.com:scylladb/scylladb: memtable: ensure _flushed_memory doesn't grow above total memory usage replica/memtable: move region_listener handlers from dirty_memory_manager to memtable	2025-06-24 10:12:41 +03:00
Michał Chojnowski	1c23edad22	memtable: ensure _flushed_memory doesn't grow above total memory usage dirty_memory_manager tracks two quantities about memtable memory usage: "real" and "unspooled" memory usage. "real" is the total memory usage (sum of `occupancy().total_space()`) by all memtable LSA regions, plus a upper-bound estimate of the size of memtable data which has already moved to the cache region but isn't evictable (merged into the cache) yet. "unspooled" is the difference between total memory usage by all memtable LSA regions, and the total flushed memory (sum of `_flushed_memory`) of memtables. dirty_memory_manager controls the shares of compaction and/or blocks writes when these quantities cross various thresholds. "Total flushed memory" isn't a well defined notion, since the actual consumption of memory by the same data can vary over time due to LSA compactions, and even the data present in memtable can change over the course of the flush due to removals of outdated MVCC versions. So `_flushed_memory` is merely an approximation computed by `flush_reader` based on the data passing through it. This approximation is supposed to be a conservative lower bound. In particular, `_flushed_memory` should be not greater than `occupancy().total_space()`. Otherwise, for example, "unspooled" memory could become negative (and/or wrap around) and weird things could happen. There is an assertion in ~flush_memory_accounter which checks that `_flushed_memory < occupancy().total_space()` at the end of flush. But it can fail. Without additional treatment, the memtable reader sometimes emits data which is already deleted. (In particular, it emites rows covered by a partition tombstone in a newer MVCC version.) This data is seen `flush_reader` and accounted in `_flushed_memory`. But this data can be garbage-collected by the mutation_cleaner later during the flush and decrease `total_memory` below `_flushed_memory`. There is a piece of code in mutation_cleaner intended to prevent that. If `total_memory` decreases during a `mutation_cleaner` run, `_flushed_memory` is lowered by the same amount, just to preserve the asserted property. (This could also make `_flushed_memory` quite inaccurate, but that's considered acceptable). But that only works if `total_memory` is decreased during that run. It doesn't work if the `total_memory` decrease (enabled by the new allocator holes made by `mutation_cleaner`'s garbage collection work) happens asynchronously (due to memory reclaim for whatever reason) after the run. This patch fixes that by tracking the decreases of `total_memory` closer to the source. Instead of relying on `mutation_cleaner` to notify the memtable if it lowers `total_memory`, the memtable itself listens for notifications about LSA segment deallocations. It keeps `_flushed_memory` equal to the reader's estimate of flushed memory decreased by the change in `total_memory` since the beginning of flush (if it was positive), and it keeps the amount of "spooled" memory reported to the `dirty_memory_manager` at `max(0, _flushed_memory)`. (cherry picked from commit `975e7e405a`)	2025-06-22 17:37:27 +00:00
Michał Chojnowski	40d1186218	replica/memtable: move region_listener handlers from dirty_memory_manager to memtable The memtable wants to listen for changes in its `total_memory` in order to decrease its `_flushed_memory` in case some of the freed memory has already been accounted as flushed. (This can happen because the flush reader sees and accounts even outdated MVCC versions, which can be deleted and freed during the flush). Today, the memtable doesn't listen to those changes directly. Instead, some calls which can affect `total_memory` (in particular, the mutation cleaner) manually check the value of `total_memory` before and after they run, and they pass the difference to the memtable. But that's not good enough, because `total_memory` can also change outside of those manually-checked calls -- for example, during LSA compaction, which can occur anytime. This makes memtable's accounting inaccurate and can lead to unexpected states. But we already have an interface for listening to `total_memory` changes actively, and `dirty_memory_manager`, which also needs to know it, does just that. So what happens e.g. when `mutation_cleaner` runs is that `mutation_cleaner` checks the value of `total_memory` before it runs, then it runs, causing several changes to `total_memory` which are picked up by `dirty_memory_manager`, then `mutation_cleaner` checks the end value of `total_memory` and passes the difference to `memtable`, which corrects whatever was observed by `dirty_memory_manager`. To allow memtable to modify its `_flushed_memory` correctly, we need to make `memtable` itself a `region_listener`. Also, instead of the situation where `dirty_memory_manager` receives `total_memory` change notifications from `logalloc` directly, and `memtable` fixes the manager's state later, we want to only the memtable listen for the notifications, and pass them already modified accordingl to the manager, so there is no intermediate wrong states. This patch moves the `region_listener` callbacks from the `dirty_memory_manager` to the `memtable`. It's not intended to be a functional change, just a source code refactoring. The next patch will be a functional change enabled by this. (cherry picked from commit `7d551f99be`)	2025-06-22 17:37:27 +00:00
Pavel Emelyanov	60cb1c0b2f	sstable_directory: Print ks.cf when moving unshared remove sstables When an sstable is identified by sstable_directory as remote-unshared, it will at some point be moved to the target shard. When it happens a log-message appears: sstable_directory - Moving 1 unshared SSTables to shard 1 Processing of tables by sstable_directory often happens in parallel, and messages from sstable_directory are intermixed. Having a message like above is not very informative, as it tells nothing about sstables that are being moved. Equip the message with ks:cf pair to make it more informative. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23912 (cherry picked from commit `d40d6801b0`) Closes scylladb/scylladb#24014	2025-06-17 18:24:13 +03:00
Pavel Emelyanov	a00b4a027e	Update seastar submodule (no nested stall backtraces) * seastar e40388c4...b383afb0 (1): > stall_detector: no backtrace if exception Fixes #24464 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#24536	2025-06-17 18:23:06 +03:00
Nikos Dragazis	8ef55444f5	sstables: Fix race when loading checksum component `read_checksum()` loads the checksum component from disk and stores a non-owning reference in the shareable components. To avoid loading the same component twice, the function has an early return statement. However, this does not guarantee atomicity - two fibers or threads may load the component and update the shareable components concurrently. This can lead to use-after-free situations when accessing the component through the shareable components, since the reference stored there is non-owning. This can happen when multiple compaction tasks run on the same SSTable (e.g., regular compaction and scrub-validate). Fix this by not updating the reference in shareable components, if a reference is already in place. Instead, create an owning reference to the existing component for the current fiber. This is less efficient than using a mutex, since the component may be loaded multiple times from disk before noticing the race, but no locks are used for any other SSTable component either. Also, this affects uncompressed SSTables, which are not that common. Fixes #23728. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com> Closes scylladb/scylladb#23872 (cherry picked from commit `eaa2ce1bb5`) Closes scylladb/scylladb#24267	2025-06-17 13:20:30 +03:00
Szymon Malewski	8eebc97ae1	mapreduce_service: Prevent race condition In parallelized aggregation functions super-coordinator (node performing final merging step) receives and merges each partial result in parallel coroutines (`parallel_for_each`). Usually responses are spread over time and actual merging is atomic. However sometimes partial results are received at the similar time and if an aggregate function (e.g. lua script) yields, two coroutines can try to overwrite the same accumulator one after another, which leads to losing some of the results. To prevent this, in this patch each coroutine stores merging results in its own context and overwrites accumulator atomically, only after it was fully merged. Comparing to the previous implementation order of operands in merging function is swapped, but the order of aggregation is not guaranteed anyway. Fixes #20662 Closes scylladb/scylladb#24106 (cherry picked from commit `5969809607`) Closes scylladb/scylladb#24387	2025-06-17 13:20:12 +03:00
Pavel Emelyanov	884293a382	Merge '[Backport 6.2] tablets: deallocate storage state on end_migration' from Scylladb[bot] When a tablet is migrated and cleaned up, deallocate the tablet storage group state on `end_migration` stage, instead of `cleanup` stage: * When the stage is updated from `cleanup` to `end_migration`, the storage group is removed on the leaving replica. * When the table is initialized, if the tablet stage is `end_migration` then we don't allocate a storage group for it. This happens for example if the leaving replica is restarted during tablet migration. If it's initialized in `cleanup` stage then we allocate a storage group, and it will be deallocated when transitioning to `end_migration`. This guarantees that the storage group is always deallocated on the leaving replica by `end_migration`, and that it is always allocated if the tablet wasn't cleaned up fully yet. It is a similar case also for the pending replica when the migration is aborted. We deallocate the state on `revert_migration` which is the stage following `cleanup_target`. Previously the storage group would be allocated when the tablet is initialized on any of the tablet replicas - also on the leaving replica, and when the tablet stage is `cleanup` or `end_migration`, and deallocated during `cleanup`. This fixes the following issue: 1. A migrating tablet enters cleanup stage 2. the tablet is cleaned up successfuly 3. The leaving replica is restarted, and allocates storage group 4. tablet cleanup is not called because it's already cleaned up 5. the storage group remains allocated on the leaving replica after the migration is completed - it's not cleaned up properly. Fixes https://github.com/scylladb/scylladb/issues/23481 backport to all relevant releases since it's a bug that results in a crash - (cherry picked from commit `34f15ca871`) - (cherry picked from commit `fb18fc0505`) - (cherry picked from commit `bd88ca92c8`) Parent PR: #24393 Closes scylladb/scylladb#24486 * github.com:scylladb/scylladb: test/cluster/test_tablets: test restart during tablet cleanup test: tablets: add get_tablet_info helper tablets: deallocate storage state on end_migration	2025-06-17 13:19:49 +03:00
Michael Litvak	06e5a48d17	test/cluster/test_tablets: test restart during tablet cleanup Add a test that reproduces issue scylladb/scylladb#23481. The test migrates a tablet from one node to another, and while the tablet is in some stage of cleanup - either before or right after, depending on the parameter - the leaving replica, on which the tablet is cleaned, is restarted. This is interesting because when the leaving replica starts and loads its state, the tablet could be in different stages of cleanup - the SSTables may still exist or they may have been cleaned up already, and we want to make sure the state is loaded correctly. (cherry picked from commit `bd88ca92c8`)	2025-06-12 14:33:07 +03:00
Michael Litvak	c5d11205c1	test: tablets: add get_tablet_info helper Add a helper for tests to get the tablet info from system.tablets for a tablet owning a given token. (cherry picked from commit `fb18fc0505`)	2025-06-12 03:24:00 +00:00
Michael Litvak	0b46cfb60d	tablets: deallocate storage state on end_migration When a tablet is migrated and cleaned up, deallocate the tablet storage group state on `end_migration` stage, instead of `cleanup` stage: * When the stage is updated from `cleanup` to `end_migration`, the storage group is removed on the leaving replica. * When the table is initialized, if the tablet stage is `end_migration` then we don't allocate a storage group for it. This happens for example if the leaving replica is restarted during tablet migration. If it's initialized in `cleanup` stage then we allocate a storage group, and it will be deallocated when transitioning to `end_migration`. This guarantees that the storage group is always deallocated on the leaving replica by `end_migration`, and that it is always allocated if the tablet wasn't cleaned up fully yet. It is a similar case also for the pending replica when the migration is aborted. We deallocate the state on `revert_migration` which is the stage following `cleanup_target`. Previously the storage group would be allocated when the tablet is initialized on any of the tablet replicas - also on the leaving replica, and when the tablet stage is `cleanup` or `end_migration`, and deallocated during `cleanup`. This fixes the following issue: 1. A migrating tablet enters cleanup stage 2. the tablet is cleaned up successfuly 3. The leaving replica is restarted, and allocates storage group 4. tablet cleanup is not called because it was already cleaned up 4. the storage group remains allocated on the leaving replica after the migration is completed - it's not cleaned up properly. Fixes scylladb/scylladb#23481 (cherry picked from commit `34f15ca871`)	2025-06-12 03:24:00 +00:00
Michał Chojnowski	4a18a08284	utils/lsa/chunked_managed_vector: fix the calculation of max_chunk_capacity() `chunked_managed_vector` is a vector-like container which splits its contents into multiple contiguous allocations if necessary, in order to fit within LSA's max preferred contiguous allocation limits. Each limited-size chunk is stored in a `managed_vector`. `managed_vector` is unaware of LSA's size limits. It's up to the user of `managed_vector` to pick a size which is small enough. This happens in `chunked_managed_vector::max_chunk_capacity()`. But the calculation is wrong, because it doesn't account for the fact that `managed_vector` has to place some metadata (the backreference pointer) inside the allocation. In effect, the chunks allocated by `chunked_managed_vector` are just a tiny bit larger than the limit, and the limit is violated. Fix this by accounting for the metadata. Also, before the patch `chunked_managed_vector::max_contiguous_allocation`, repeats the definition of logalloc::max_managed_object_size. This is begging for a bug if `logalloc::max_managed_object_size` changes one day. Adjust it so that `chunked_managed_vector` looks directly at `logalloc::max_managed_object_size`, as it means to. Fixes scylladb/scylladb#23854 (cherry picked from commit `7f9152babc`) Closes scylladb/scylladb#24369	2025-06-03 18:10:36 +03:00
Piotr Dulikowski	1571948cd7	topology_coordinator: silence ERROR messages on abort When the topology coordinator is shut down while doing a long-running operation, the current operation might throw a raft::request_aborted exception. This is not a critical issue and should not be logged with ERROR verbosity level. Make sure that all the try..catch blocks in the topology coordinator which: - May try to acquire a new group0 guard in the `try` part - Have a `catch (...)` block that print an ERROR-level message ...have a pass-through `catch (raft::request_aborted&)` block which does not log the exception. Fixes: scylladb/scylladb#22649 Closes scylladb/scylladb#23962 (cherry picked from commit `156ff8798b`) Closes scylladb/scylladb#24074	2025-05-13 20:41:41 +03:00
Pavel Emelyanov	5a82f0d217	Merge '[Backport 6.2] replica: skip flush of dropped table' from Scylladb[bot] Currently, flush throws no_such_column_family if a table is dropped. Skip the flush of dropped table instead. Fixes: #16095. Needs backport to 2025.1 and 6.2 as they contain the bug - (cherry picked from commit `91b57e79f3`) - (cherry picked from commit `c1618c7de5`) Parent PR: #23876 Closes scylladb/scylladb#23904 * github.com:scylladb/scylladb: test: test table drop during flush replica: skip flush of dropped table	2025-05-13 14:00:38 +03:00
Aleksandra Martyniuk	874b4f8d9c	streaming: skip dropped tables Currently, stream_session::prepare throws when a table in requests or summaries is dropped. However, we do not want to fail streaming if the table is dropped. Delete table checks from stream_session::prepare. Further streaming steps can handle the dropped table and finish the streaming successfully. Fixes: #15257. Closes scylladb/scylladb#23915 (cherry picked from commit `20c2d6210e`) Closes scylladb/scylladb#24050	2025-05-13 13:56:46 +03:00
Aleksandra Martyniuk	91a1acc314	test: test table drop during flush (cherry picked from commit `c1618c7de5`)	2025-05-09 09:27:08 +02:00
Aleksandra Martyniuk	720bd681f0	replica: skip flush of dropped table (cherry picked from commit `91b57e79f3`)	2025-05-09 09:26:44 +02:00
Piotr Dulikowski	b3bc3489dd	utils::loading_cache: gracefully skip timer if gate closed The loading_cache has a periodic timer which acquires the _timer_reads_gate. The stop() method first closes the gate and then cancels the timer - this order is necessary because the timer is re-armed under the gate. However, the timer callback does not check whether the gate was closed but tries to acquire it, which might result in unhandled exception which is logged with ERROR severity. Fix the timer callback by acquiring access to the gate at the beginning and gracefully returning if the gate is closed. Even though the gate used to be entered in the middle of the callback, it does not make sense to execute the timer's logic at all if the cache is being stopped. Fixes: scylladb/scylladb#23951 Closes scylladb/scylladb#23952 (cherry picked from commit `8ffe4b0308`) Closes scylladb/scylladb#23980	2025-05-06 10:19:32 +02:00
Botond Dénes	e4c6a3c068	Merge '[Backport 6.2] topology coordinator: do not proceed further on invalid boostrap tokens' from Scylladb[bot] In case when dht::boot_strapper::get_boostrap_tokens fail to parse the tokens, the topology coordinator handles the exception and schedules a rollback. However, the current code tries to continue with the topology coordinator logic even if an exception occurs, leaving boostrap_tokens empty. This does not make sense and can actually cause issues, specifically in prepare_and_broadcast_cdc_generation_data which implicitly expect that the bootstrap_tokens of the first node in the cluster will not be empty. Fix this by adding the missing break. Fixes: scylladb/scylladb#23897 From the code inspection alone it looks like 2025.1 and 6.2 have this problem, so marking for backport to both of them. - (cherry picked from commit `66acaa1bf8`) - (cherry picked from commit `845cedea7f`) - (cherry picked from commit `670a69007e`) Parent PR: #23914 Closes scylladb/scylladb#23948 * github.com:scylladb/scylladb: test: cluster: add test_bad_initial_token topology coordinator: do not proceed further on invalid boostrap tokens cdc: add sanity check for generating an empty generation scylla-6.2.4-candidate-20250502071756 scylla-6.2.4-candidate-20250502123017	2025-05-01 08:32:13 +03:00
Botond Dénes	254f535f63	Merge '[Backport 6.2] tasks: check whether a node is alive before rpc' from Scylladb[bot] Check whether a node is alive before making an rpc that gathers children infos from the whole cluster in virtual_task::impl::get_children. Fixes: https://github.com/scylladb/scylladb/issues/22514. Needs backport to 2025.1 and 6.2 as they contain the bug. - (cherry picked from commit `53e0f79947`) - (cherry picked from commit `e178bd7847`) Parent PR: #23787 Closes scylladb/scylladb#23942 * github.com:scylladb/scylladb: test: add test for getting tasks children tasks: check whether a node is alive before rpc	2025-05-01 08:30:09 +03:00
Avi Kivity	c20b2ea2af	Merge '[Backport 6.2] Ensure raft group0 RPCs use the gossip scheduling group.' from Scylladb[bot] Scylla operations use concurrency semaphores to limit the number of concurrent operations and prevent resource exhaustion. The semaphore is selected based on the current scheduling group. For RAFT group operations, it is essential to use a system semaphore to avoid queuing behind user operations. This patch ensures that RAFT operations use the `gossip` scheduling group to leverage the system semaphore. Fixes scylladb/scylladb#21637 Backport: 6.2 and 6.1 - (cherry picked from commit `60f1053087`) - (cherry picked from commit `e05c082002`) Parent PR: #22779 Closes scylladb/scylladb#23769 * github.com:scylladb/scylladb: ensure raft group0 RPCs use the gossip scheduling group Move RAFT operations verbs to GOSSIP group.	2025-04-30 16:45:20 +03:00
Aleksandra Martyniuk	cc37c64467	test: add test for getting tasks children Add test that checks whether the children of a virtual task will be properly gathered if a node is down. (cherry picked from commit `e178bd7847`)	2025-04-30 10:40:32 +02:00
Aleksandra Martyniuk	47377cd5d4	tasks: check whether a node is alive before rpc Check whether a node is alive before making an rpc that gathers children infos from the whole cluster in virtual_task::impl::get_children. (cherry picked from commit `53e0f79947`)	2025-04-30 10:35:45 +02:00
Sergey Zolotukhin	5c90107c14	ensure raft group0 RPCs use the gossip scheduling group Scylla operations use concurrency semaphores to limit the number of concurrent operations and prevent resource exhaustion. The semaphore is selected based on the current scheduling group. For Raft group operations, it is essential to use a system semaphore to avoid queuing behind user operations. This commit adds a check to ensure that the raft group0 RPCs are executed with the `gossiper` scheduling group. (cherry picked from commit `e05c082002`)	2025-04-30 08:49:07 +02:00
Sergey Zolotukhin	612f184638	Move RAFT operations verbs to GOSSIP group. In order for RAFT operations to use the gossip system semaphore, moving RAFT verbs to the gossip group in `do_get_rpc_client_idx`, messaging_service. Fixes scylladb/scylladb21637 (cherry picked from commit `60f1053087`)	2025-04-29 19:25:55 +00:00
Piotr Dulikowski	d881f3f14d	test: cluster: add test_bad_initial_token Adds a test which checks that rollback works properly in case when a bad value of the initial_token function is provided. (cherry picked from commit `670a69007e`)	2025-04-28 17:07:23 +00:00
Piotr Dulikowski	7eda572129	topology coordinator: do not proceed further on invalid boostrap tokens In case when dht::boot_strapper::get_boostrap_tokens fail to parse the tokens, the topology coordinator handles the exception and schedules a rollback. However, the current code tries to continue with the topology coordinator logic even if an exception occurs, leaving boostrap_tokens empty. This does not make sense and can actually cause issues, specifically in prepare_and_broadcast_cdc_generation_data which implicitly expect that the bootstrap_tokens of the first node in the cluster will not be empty. Fix this by adding the missing break. Fixes: scylladb/scylladb#23897 (cherry picked from commit `845cedea7f`)	2025-04-28 17:07:23 +00:00
Piotr Dulikowski	41e6df7407	cdc: add sanity check for generating an empty generation It doesn't make sense to create an empty CDC generation because it does not make sense to have a cluster with no tokens. Add a sanity check to cdc::make_new_generation_description which fails if somebody attempts to do that (i.e. when the set of current tokens + optionally bootstrapping node's tokens is empty). The function does not work correctly if it is misused, as we saw in scylladb/scylladb#23897. While the function should not be misused in the first place, it's better to throw an exception rather than crash - especially that this crash could happen on the topology coordinator. (cherry picked from commit `66acaa1bf8`)	2025-04-28 17:07:22 +00:00
Tomasz Grabiec	b48d1abade	Merge '[Backport 6.2] Cache base info for view schemas in the schema registry' from Scylladb[bot] Currently, when we load a frozen schema into the registry, we lose the base info if the schema was of a view. Because of that, in various places we need to set the base info again, and in some codepaths we may miss it completely, which may make us unable to process some requests (for example, when executing reverse queries on views). Even after setting the base info, we may still lose it if the schema entry gets deactivated due to all `schema_ptr`s temporarily dying. To fix this, this patch adds the base schema to the registry, alongside the view schema. We store just the frozen base schema, so that we can transfer it across shards. With the base schema, we can now set the base info when returning the schema from the registry. As a result, we can now assume that all view schemas returned by the registry have base_info set. In this series we also make sure that the view schemas in the registry are kept up-to-date in regards to base schema changes. Fixes https://github.com/scylladb/scylladb/issues/21354 This issue is a bug, so adding backport labels 6.1 and 6.2 - (cherry picked from commit `6f11edbf3f`) - (cherry picked from commit `dfe3810f64`) - (cherry picked from commit `82f2e1b44c`) - (cherry picked from commit `3094ff7cbe`) - (cherry picked from commit `74cbc77f50`) Parent PR: #21862 Closes scylladb/scylladb#23046 * github.com:scylladb/scylladb: test: add test for schema registry maintaining base info for views schema_registry: avoid setting base info when getting the schema from registry schema_registry: update cached base schemas when updating a view schema_registry: cache base schemas for views db: set base info before adding schema to registry	2025-04-25 18:42:57 +02:00
Tomasz Grabiec	9190779ee3	Merge '[Backport 6.2] storage_service: preserve state of busy topology when transiting tablet' from Scylladb[bot] Commit `876478b84f` ("storage_service: allow concurrent tablet migration in tablets/move API", 2024-02-08) introduced a code path on which the topology state machine would be busy -- in "tablet_draining" or "tablet_migration" state -- at the time of starting tablet migration. The pre-commit code would unconditionally transition the topology to "tablet_migration" state, assuming the topology had been idle previously. On the new code path, this state change would be idempotent if the topology state machine had been busy in "tablet_migration", but the state change would incorrectly overwrite the "tablet_draining" state otherwise. Restrict the state change to when the topology state machine is idle. In addition, add the topology update to the "updates" vector with plain push_back(). emplace_back() is not helpful here, as topology_mutation_builder::build() cannot construct in-place, and so we invoke the "canonical_mutation" move constructor once, either way. Unit test: Start a two node cluster. Create a single tablet on one of the nodes. Start decommissioning that node, but block decommissioning at once. In that state (i.e., in "tablet_draining"), move the tablet manually to the other node. Check that transit_tablet() leaves the topology transition state alone. Fixes https://github.com/scylladb/scylladb/issues/20073. Commit `876478b84f` was first released in scylla-6.0.0, so we might want to backport this patch accordingly. - (cherry picked from commit `e1186f0ae6`) - (cherry picked from commit `841ca652a0`) Parent PR: #23751 Closes scylladb/scylladb#23768 * github.com:scylladb/scylladb: storage_service: add unit test for mid-decommission transit_tablet() storage_service: preserve state of busy topology when transiting tablet	2025-04-20 20:13:11 +02:00
Avi Kivity	8db020e782	Merge '[Backport 6.2] managed_bytes: in the copy constructor, respect the target preferred allocation size' from Scylladb[bot] Commit `14bf09f447` added a single-chunk layout to `managed_bytes`, which makes the overhead of `managed_bytes` smaller in the common case of a small buffer. But there was a bug in it. In the copy constructor of `managed_bytes`, a copy of a single-chunk `managed_bytes` is made single-chunk too. But this is wrong, because the source of the copy and the target of the copy might have different preferred max contiguous allocation sizes. In particular, if a `managed_bytes` of size between 13 kiB and 128 kiB is copied from the standard allocator into LSA, the resulting `managed_bytes` is a single chunk which violates LSA's preferred allocation size. (And therefore is placed by LSA in the standard allocator). In other words, since Scylla 6.0, cache and memtable cells between 13 kiB and 128 kiB are getting allocated in the standard allocator rather than inside LSA segments. Consequences of the bug: 1. Effective memory consumption of an affected cell is rounded up to the nearest power of 2. 2. With a pathological-enough allocation pattern (for example, one which somehow ends up placing a single 16 kiB memtable-owned allocation in every aligned 128 kiB span), memtable flushing could theoretically deadlock, because the allocator might be too fragmented to let the memtable grow by another 128 kiB segment, while keeping the sum of all allocations small enough to avoid triggering a flush. (Such an allocation pattern probably wouldn't happen in practice though). 3. It triggers a bug in reclaim which results in spurious allocation failures despite ample evictable memory. There is a path in the reclaimer procedure where we check whether reclamation succeeded by checking that the number of free LSA segments grew. But in the presence of evictable non-LSA allocations, this is wrong because the reclaim might have met its target by evicting the non-LSA allocations, in which case memory is returned directly to the standard allocator, rather than to the pool of free segments. If that happens, the reclaimer wrongly returns `reclaimed_nothing` to Seastar, which fails the allocation. Refs (possibly fixes) https://github.com/scylladb/scylladb/issues/21072 Fixes https://github.com/scylladb/scylladb/issues/22941 Fixes https://github.com/scylladb/scylladb/issues/22389 Fixes https://github.com/scylladb/scylladb/issues/23781 This is a regression fix, should be backported to all affected releases. - (cherry picked from commit `4e2f62143b`) - (cherry picked from commit `6c1889f65c`) Parent PR: #23782 Closes scylladb/scylladb#23809 * github.com:scylladb/scylladb: managed_bytes_test: add a reproducer for #23781 managed_bytes: in the copy constructor, respect the target preferred allocation size	2025-04-19 18:43:31 +03:00
Calle Wilund	d566010412	network_topology_strategy/alter ks: Remove dc:s from options once rf=0 Fixes #22688 If we set a dc rf to zero, the options map will still retain a dc=0 entry. If this dc is decommissioned, any further alters of keyspace will fail, because the union of new/old options will now contained an unknown keyword. Change alter ks options processing to simply remove any dc with rf=0 on alter, and treat this as an implicit dc=0 in nw-topo strategy. This means we change the reallocate_tablets routine to not rely on the strategy objects dc mapping, but the full replica topology info for dc:s to consider for reallocation. Since we verify the input on attribute processing, the amount of rf/tablets moved should still be legal. v2: * Update docs as well. v3: * Simplify dc processing * Reintroduce options empty check, but do early in ks_prop_defs * Clean up unit test some Closes scylladb/scylladb#22693 (cherry picked from commit `342df0b1a8`) (Update: workaround python test objects not having dc info) Closes scylladb/scylladb#22876	2025-04-18 14:09:52 +03:00
Michał Chojnowski	ba3975858b	managed_bytes_test: add a reproducer for #23781 (cherry picked from commit `6c1889f65c`)	2025-04-18 07:55:23 +00:00
Michał Chojnowski	f75b0d49a0	managed_bytes: in the copy constructor, respect the target preferred allocation size Commit `14bf09f447` added a single-chunk layout to `managed_bytes`, which makes the overhead of `managed_bytes` smaller in the common case of a small buffer. But there was a bug in it. In the copy constructor of `managed_bytes`, a copy of a single-chunk `managed_bytes` is made single-chunk too. But this is wrong, because the source of the copy and the target of the copy might have different preferred max contiguous allocation sizes. In particular, if a `managed_bytes` of size between 13 kiB and 128 kiB is copied from the standard allocator into LSA, the resulting `managed_bytes` is a single chunk which violates LSA's preferred allocation size. (And therefore is placed by LSA in the standard allocator). In other words, since Scylla 6.0, cache and memtable cells between 13 kiB and 128 kiB are getting allocated in the standard allocator rather than inside LSA segments. Consequences of the bug: 1. Effective memory consumption of an affected cell is rounded up to the nearest power of 2. 2. With a pathological-enough allocation pattern (for example, one which somehow ends up placing a single 16 kiB memtable-owned allocation in every aligned 128 kiB span), memtable flushing could theoretically deadlock, because the allocator might be too fragmented to let the memtable grow by another 128 kiB segment, while keeping the sum of all allocations small enough to avoid triggering a flush. (Such an allocation pattern probably wouldn't happen in practice though). 3. It triggers a bug in reclaim which results in spurious allocation failures despite ample evictable memory. There is a path in the reclaimer procedure where we check whether reclamation succeeded by checking that the number of free LSA segments grew. But in the presence of evictable non-LSA allocations, this is wrong because the reclaim might have met its target by evicting the non-LSA allocations, in which case memory is returned directly to the standard allocator, rather than to the pool of free segments. If that happens, the reclaimer wrongly returns `reclaimed_nothing` to Seastar, which fails the allocation. Refs (possibly fixes) https://github.com/scylladb/scylladb/issues/21072 Fixes https://github.com/scylladb/scylladb/issues/22941 Fixes https://github.com/scylladb/scylladb/issues/22389 Fixes https://github.com/scylladb/scylladb/issues/23781 (cherry picked from commit `4e2f62143b`)	2025-04-18 07:55:23 +00:00
Botond Dénes	e1ce7c36a7	test/cluster/test_read_repair.py: increase read request timeout This test enables trace-level logging for the mutation_data logger, which seems to be too much in debug mode and the test read times out. Increase timeout to 1minute to avoid this. Fixes: #23513 Fixes: #23512 Closes scylladb/scylladb#23558 (cherry picked from commit `7bbfa5293f`) Closes scylladb/scylladb#23793	2025-04-18 06:34:25 +03:00
Laszlo Ersek	597030a821	storage_service: add unit test for mid-decommission transit_tablet() Start a two node cluster. Create a single tablet on one of the nodes. Start decommissioning that node, but block decommissioning at once. In that state (i.e., in "tablet_draining"), move the tablet manually to the other node. Check that transit_tablet() leaves the topology transition state alone. Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com> (cherry picked from commit `841ca652a0`)	2025-04-17 09:09:43 +02:00
Laszlo Ersek	add4022d32	storage_service: preserve state of busy topology when transiting tablet Commit `876478b84f` ("storage_service: allow concurrent tablet migration in tablets/move API", 2024-02-08) introduced a code path on which the topology state machine would be busy -- in "tablet_draining" or "tablet_migration" state -- at the time of starting tablet migration. The pre-commit code would unconditionally transition the topology to "tablet_migration" state, assuming the topology had been idle previously. On the new code path, this state change would be idempotent if the topology state machine had been busy in "tablet_migration", but the state change would incorrectly overwrite the "tablet_draining" state otherwise. Restrict the state change to when the topology state machine is idle. In addition, add the topology update to the "updates" vector with plain push_back(). emplace_back() is not helpful here, as topology_mutation_builder::build() cannot construct in-place, and so we invoke the "canonical_mutation" move constructor once, either way. Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com> (cherry picked from commit `e1186f0ae6`)	2025-04-17 08:32:56 +02:00
Avi Kivity	b6f1cacfea	scylla-gdb: small-objects: fix for very small objects Because of rounding and alignment, there are multiple pools for small sizes (e.g. 4 for size 32). Because the pool selection algorithm ignores alignment, different pools can be chosen for different object sizes. For example, an object size of 29 will choose the first pool of size 32, while an object size of 32 will choose the fourth pool of size 32. The small-objects command doesn't know about this and always considers just the first pool for a given size. This causes it to miss out on sister pools. While it's possible to adjust pool selection to always choose one of the pools, it may eat a precious cycle. So instead let's compensate in the small-objects command. Instead of finding one pool for a given size, find all of them, and iterate over all those pools. Fixes #23603 Closes scylladb/scylladb#23604 (cherry picked from commit `b4d4e48381`) Closes scylladb/scylladb#23748	2025-04-16 14:36:28 +03:00
Botond Dénes	c1defce1de	Merge '[Backport 6.2] transport/server.cc: set default timestamp info in EXECUTE and BATCH tracing' from Scylladb[bot] A default timestamp (not to confuse with the timestamp passed via 'USING TIMESTAMP' query clause) can be set using 0x20 flag and the <timestamp> field in the binary CQL frame payload of QUERY, EXECUTE and BATCH ops. It also happens to be a default of a Java CQL Driver. However, we were only setting the corresponding info in the CQL Tracing context of a QUERY operation. For an unknown reason we were not setting this for an EXECUTE and for a BATCH traces (I guess I simply forgot to set it back then). This patch fixes this. Fixes #23173 The issue fixed by this PR is not critical but the fix is simple and safe enough so we should backport it to all live releases. - (cherry picked from commit `ca6bddef35`) - (cherry picked from commit `f7e1695068`) Parent PR: #23174 Closes scylladb/scylladb#23523 * github.com:scylladb/scylladb: CQL Tracing: set common query parameters in a single function transport/server.cc: set default timestamp info in EXECUTE and BATCH tracing	2025-04-11 17:11:06 +03:00
Botond Dénes	61d1262674	Merge '[Backport 6.2] reader_concurrency_semaphore: register_inactive_read(): handle aborted permit' from Scylladb[bot] It is possible that the permit handed in to register_inactive_read() is already aborted (currently only possible if permit timed out). If the permit also happens to have wait for memory, the current code will attempt to call promise<>::set_exception() on the permit's promise to abort its waiters. But if the permit was already aborted via timeout, this promise will already have an exception and this will trigger an assert. Add a separate case for checking if the permit is aborted already. If so, treat it as immediate eviction: close the reader and clean up. Fixes: scylladb/scylladb#22919 Bug is present in all live versions, backports are required. - (cherry picked from commit `4d8eb02b8d`) - (cherry picked from commit `7ba29ec46c`) Parent PR: #23044 Closes scylladb/scylladb#23144 * github.com:scylladb/scylladb: reader_concurrency_semaphore: register_inactive_read(): handle aborted permit test/boost/reader_concurrency_semaphore_test: move away from db::timeout_clock::now()	2025-04-11 17:10:39 +03:00
Botond Dénes	ae54d4e886	Merge '[Backport 6.2] streaming: fix the way a reason of streaming failure is determined' from Scylladb[bot] During streaming receiving node gets and processes mutation fragments. If this operation fails, receiver responds with -1 status code, unless it failed due to no_such_column_family in which case streaming of this table should be skipped. However, when the table was dropped, an exception handler on receiver side may get not only data_dictionary::no_such_column_family, but also seastar::nested_exception of two no_such_column_family. Encountered example: ``` ERROR 2025-02-12 15:20:51,508 [shard 0:strm] stream_session - [Stream #f1cd6830-e954-11ef-afd9-b022e40bf72d] Failed to handle STREAM_MUTATION_FRAGMENTS (receive and distribute phase) for ks=ks, cf=cf, peer=756dd3fe-2bf0-4dcd-afbc-cfd5202669a0: seastar::nested_exception: data_dictionary::no_such_column_family (Can't find a column family with UUID ef9b1ee0-e954-11ef-ba4a-faf17acf4e14) (while cleaning up after data_dictionary::no_such_column_family (Can't find a column family with UUID ef9b1ee0-e954-11ef-ba4a-faf17acf4e14)) ``` In this case, the exception does not match the try_catch<data_dictionary::no_such_column_family> clause and gets handled the same as any other exception type. Replace try_catch clause with table_sync_and_check that synchronizes the schema and check if the table exists. Fixes: https://github.com/scylladb/scylladb/issues/22834. Needs backport to all live version, as they all contain the bug - (cherry picked from commit `876cf32e9d`) - (cherry picked from commit `faf3aa13db`) - (cherry picked from commit `44748d624d`) - (cherry picked from commit `35bc1fe276`) Parent PR: #22868 Closes scylladb/scylladb#23289 * github.com:scylladb/scylladb: streaming: fix the way a reason of streaming failure is determined streaming: save a continuation lambda streaming: use streaming namespace in table_check.{cc,hh} repair: streaming: move table_check.{cc,hh} to streaming	2025-04-11 15:13:14 +03:00
Alexander Turetskiy	e97e729973	Improve compation on read of expired tombstones compact expired tombstones in cache even if they are blocked by commitlog fixes #16781 Closes: #23033	2025-04-11 15:09:43 +03:00
Botond Dénes	20a1edd763	tools/scylla-nodetool: s/GetInt()/GetInt64()/ GetInt() was observed to fail when the integer JSON value overflows the int32_t type, which `GetInt()` uses for storage. When this happens, rapidjson will assign a distinct 64 bit integer type to the value, and attempting to access it as 32 bit integer triggers the wrong-type error, resulting in assert failure. This was hit on the field where invoking nodetool netstats resulted in nodetool crashing when the streamed bytes amounts were higher than maxint. To avoid such bugs in the future, replace all usage of GetInt() in nodetool of GetInt64(), just to be sure. A reproducer is added to the nodetool netstats crash. Fixes: scylladb/scylladb#23394 Closes scylladb/scylladb#23395 (cherry picked from commit `bd8973a025`) Closes scylladb/scylladb#23475	2025-04-11 15:00:47 +03:00

1 2 3 4 5 ...

44838 Commits