44838 Commits

Author SHA1 Message Date
Łukasz Paszkowski
98a6002a1a compaction_manager: cancel submission timer on drain
The `drain` method, cancels all running compactions and moves the
compaction manager into the disabled state. To move it back to
the enabled state, the `enable` method shall be called.

This, however, throws an assertion error as the submission time is
not cancelled and re-enabling the manager tries to arm the armed timer.

Thus, cancel the timer, when calling the drain method to disable
the compaction manager.

Fixes https://github.com/scylladb/scylladb/issues/24504

All versions are affected. So it's a good candidate for a backport.

Closes scylladb/scylladb#24505

(cherry picked from commit a9a53d9178)

Closes scylladb/scylladb#24585
2025-06-29 14:40:43 +03:00
Avi Kivity
9f1ed14f9d Merge '[Backport 6.2] cql: create default superuser if it doesn't exist' from Marcin Maliszkiewicz
Backport of https://github.com/scylladb/scylladb/pull/20137 is needed for another backport https://github.com/scylladb/scylladb/pull/24690 to work correctly.

Fixes https://github.com/scylladb/scylladb/issues/24712

Closes scylladb/scylladb#24711

* github.com:scylladb/scylladb:
  test: test_restart_cluster: create the test
  auth: standard_role_manager allows awaiting superuser creation
  auth: coroutinize the standard_role_manager start() function
  auth: don't start server until the superuser is created
2025-06-29 14:34:55 +03:00
Aleksandra Martyniuk
8321747451 test: rest_api: fix test_repair_task_progress
test_repair_task_progress checks the progress of children of root
repair task. However, nothing ensures that the children are
already created.

Wait until at least one child of a root repair task is created.

Fixes: #24556.

Closes scylladb/scylladb#24560

(cherry picked from commit 0deb9209a0)

Closes scylladb/scylladb#24652
2025-06-28 09:40:37 +03:00
Paweł Zakrzewski
3bd3d720e0 test: test_restart_cluster: create the test
The purpose of this test that the cluster is able to boot up again after
a full cluster shutdown, thus exhibiting no issues when connecting to
raft group 0 that is larger than one.

(cherry picked from commit 900a6706b8)
2025-06-27 17:50:15 +02:00
Paweł Zakrzewski
3d6d3484b5 auth: standard_role_manager allows awaiting superuser creation
This change implements the ability to await superuser creation in the
function ensure_superuser_is_created(). This means that Scylla will not
be serving CQL connections until the superuser is created.

Fixes #10481

(cherry picked from commit 7008b71acc)
2025-06-27 17:50:08 +02:00
Paweł Zakrzewski
db1d3cc342 auth: coroutinize the standard_role_manager start() function
This change is a preparation for the next change. Moving to coroutines
makes the code more readable and easier to process.

(cherry picked from commit 04fc82620b)
2025-06-27 17:50:01 +02:00
Paweł Zakrzewski
66301eb2b6 auth: don't start server until the superuser is created
This change reorganizes the way standard_role_manager startup is
handled: now the future returned by its start() function can be used to
determine when startup has finished. We use this future to ensure the
startup is finished prior to starting the CQL server.

Some clusters are created without auth, and auth is added later. The
first node to recognize that auth is needed must create the superuser.
Currently this is always on restart, but if we were to ever make it
LiveUpdate then it would not be on restart.

This suggests that we don't really need to wait during restart.

This is a preparatory commit, laying ground for implementation of a
start() function that waits for the superuser to be created. The default
implementation returns a ready future, which makes no change in the code
behavior.

(cherry picked from commit f525d4b0c1)
2025-06-27 17:49:33 +02:00
Pavel Emelyanov
e5ac2285c0 Merge '[Backport 6.2] memtable: ensure _flushed_memory doesn't grow above total_memory' from Scylladb[bot]
`dirty_memory_manager` tracks two quantities about memtable memory usage:
"real" and "unspooled" memory usage.

"real" is the total memory usage (sum of `occupancy().total_space()`)
by all memtable LSA regions, plus a upper-bound estimate of the size of
memtable data which has already moved to the cache region but isn't
evictable (merged into the cache) yet.

"unspooled" is the difference between total memory usage by all memtable
LSA regions, and the total flushed memory (sum of `_flushed_memory`)
of memtables.

`dirty_memory_manager` controls the shares of compaction and/or blocks
writes when these quantities cross various thresholds.

"Total flushed memory" isn't a well defined notion,
since the actual consumption of memory by the same data can vary over
time due to LSA compactions, and even the data present in memtable can
change over the course of the flush due to removals of outdated MVCC versions.
So `_flushed_memory` is merely an approximation computed by `flush_reader`
based on the data passing through it.

This approximation is supposed to be a conservative lower bound.
In particular, `_flushed_memory` should be not greater than
`occupancy().total_space()`. Otherwise, for example, "unspooled" memory
could become negative (and/or wrap around) and weird things could happen.
There is an assertion in `~flush_memory_accounter` which checks that
`_flushed_memory < occupancy().total_space()` at the end of flush.

But it can fail. Without additional treatment, the memtable reader sometimes emits
data which is already deleted. (In particular, it emites rows covered by
a partition tombstone in a newer MVCC version.)
This data is seen by `flush_reader` and accounted in `_flushed_memory`.
But this data can be garbage-collected by the `mutation_cleaner` later during the
flush and decrease `total_memory` below `_flushed_memory`.

There is a piece of code in `mutation_cleaner` intended to prevent that.
If `total_memory` decreases during a `mutation_cleaner` run,
`_flushed_memory` is lowered by the same amount, just to preserve the
asserted property. (This could also make `_flushed_memory` quite inaccurate,
but that's considered acceptable).

But that only works if `total_memory` is decreased during that run. It doesn't
work if the `total_memory` decrease (enabled by the new allocator holes made
by `mutation_cleaner`'s garbage collection work) happens asynchronously
(due to memory reclaim for whatever reason) after the run.

This patch fixes that by tracking the decreases of `total_memory` closer to the
source. Instead of relying on `mutation_cleaner` to notify the memtable if it
lowers `total_memory`, the memtable itself listens for notifications about
LSA segment deallocations. It keeps `_flushed_memory` equal to the reader's
estimate of flushed memory decreased by the change in `total_memory` since the
beginning of flush (if it was positive), and it keeps the amount of "spooled"
memory reported to the `dirty_memory_manager` at `max(0, _flushed_memory)`.

Fixes scylladb/scylladb#21413

Backport candidate because it fixes a crash that can happen in existing stable branches.

- (cherry picked from commit 7d551f99be)

- (cherry picked from commit 975e7e405a)

Parent PR: #21638

Closes scylladb/scylladb#24601

* github.com:scylladb/scylladb:
  memtable: ensure _flushed_memory doesn't grow above total memory usage
  replica/memtable: move region_listener handlers from dirty_memory_manager to memtable
2025-06-24 10:12:41 +03:00
Michał Chojnowski
1c23edad22 memtable: ensure _flushed_memory doesn't grow above total memory usage
dirty_memory_manager tracks two quantities about memtable memory usage:
"real" and "unspooled" memory usage.

"real" is the total memory usage (sum of `occupancy().total_space()`)
by all memtable LSA regions, plus a upper-bound estimate of the size of
memtable data which has already moved to the cache region but isn't
evictable (merged into the cache) yet.

"unspooled" is the difference between total memory usage by all memtable
LSA regions, and the total flushed memory (sum of `_flushed_memory`)
of memtables.

dirty_memory_manager controls the shares of compaction and/or blocks
writes when these quantities cross various thresholds.

"Total flushed memory" isn't a well defined notion,
since the actual consumption of memory by the same data can vary over
time due to LSA compactions, and even the data present in memtable can
change over the course of the flush due to removals of outdated MVCC versions.
So `_flushed_memory` is merely an approximation computed by `flush_reader`
based on the data passing through it.

This approximation is supposed to be a conservative lower bound.
In particular, `_flushed_memory` should be not greater than
`occupancy().total_space()`. Otherwise, for example, "unspooled" memory
could become negative (and/or wrap around) and weird things could happen.
There is an assertion in ~flush_memory_accounter which checks that
`_flushed_memory < occupancy().total_space()` at the end of flush.

But it can fail. Without additional treatment, the memtable reader sometimes emits
data which is already deleted. (In particular, it emites rows covered by
a partition tombstone in a newer MVCC version.)
This data is seen `flush_reader` and accounted in `_flushed_memory`.
But this data can be garbage-collected by the mutation_cleaner later during the
flush and decrease `total_memory` below `_flushed_memory`.

There is a piece of code in mutation_cleaner intended to prevent that.
If `total_memory` decreases during a `mutation_cleaner` run,
`_flushed_memory` is lowered by the same amount, just to preserve the
asserted property. (This could also make `_flushed_memory` quite inaccurate,
but that's considered acceptable).

But that only works if `total_memory` is decreased during that run. It doesn't
work if the `total_memory` decrease (enabled by the new allocator holes made
by `mutation_cleaner`'s garbage collection work) happens asynchronously
(due to memory reclaim for whatever reason) after the run.

This patch fixes that by tracking the decreases of `total_memory` closer to the
source. Instead of relying on `mutation_cleaner` to notify the memtable if it
lowers `total_memory`, the memtable itself listens for notifications about
LSA segment deallocations. It keeps `_flushed_memory` equal to the reader's
estimate of flushed memory decreased by the change in `total_memory` since the
beginning of flush (if it was positive), and it keeps the amount of "spooled"
memory reported to the `dirty_memory_manager` at `max(0, _flushed_memory)`.

(cherry picked from commit 975e7e405a)
2025-06-22 17:37:27 +00:00
Michał Chojnowski
40d1186218 replica/memtable: move region_listener handlers from dirty_memory_manager to memtable
The memtable wants to listen for changes in its `total_memory` in order
to decrease its `_flushed_memory` in case some of the freed memory has already
been accounted as flushed. (This can happen because the flush reader sees
and accounts even outdated MVCC versions, which can be deleted and freed
during the flush).

Today, the memtable doesn't listen to those changes directly. Instead,
some calls which can affect `total_memory` (in particular, the mutation cleaner)
manually check the value of `total_memory` before and after they run, and they
pass the difference to the memtable.

But that's not good enough, because `total_memory` can also change outside
of those manually-checked calls -- for example, during LSA compaction, which
can occur anytime. This makes memtable's accounting inaccurate and can lead
to unexpected states.

But we already have an interface for listening to `total_memory` changes
actively, and `dirty_memory_manager`, which also needs to know it,
does just that. So what happens e.g. when `mutation_cleaner` runs
is that `mutation_cleaner` checks the value of `total_memory` before it runs,
then it runs, causing several changes to `total_memory` which are picked up
by `dirty_memory_manager`, then `mutation_cleaner` checks the end value of
`total_memory` and passes the difference to `memtable`, which corrects
whatever was observed by `dirty_memory_manager`.

To allow memtable to modify its `_flushed_memory` correctly, we need
to make `memtable` itself a `region_listener`. Also, instead of
the situation where `dirty_memory_manager` receives `total_memory`
change notifications from `logalloc` directly, and `memtable` fixes
the manager's state later, we want to only the memtable listen
for the notifications, and pass them already modified accordingl
to the manager, so there is no intermediate wrong states.

This patch moves the `region_listener` callbacks from the
`dirty_memory_manager` to the `memtable`. It's not intended to be
a functional change, just a source code refactoring.
The next patch will be a functional change enabled by this.

(cherry picked from commit 7d551f99be)
2025-06-22 17:37:27 +00:00
Pavel Emelyanov
60cb1c0b2f sstable_directory: Print ks.cf when moving unshared remove sstables
When an sstable is identified by sstable_directory as remote-unshared,
it will at some point be moved to the target shard. When it happens a
log-message appears:

    sstable_directory - Moving 1 unshared SSTables to shard 1

Processing of tables by sstable_directory often happens in parallel, and
messages from sstable_directory are intermixed. Having a message like
above is not very informative, as it tells nothing about sstables that
are being moved.

Equip the message with ks:cf pair to make it more informative.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23912

(cherry picked from commit d40d6801b0)

Closes scylladb/scylladb#24014
2025-06-17 18:24:13 +03:00
Pavel Emelyanov
a00b4a027e Update seastar submodule (no nested stall backtraces)
* seastar e40388c4...b383afb0 (1):
  > stall_detector: no backtrace if exception

Fixes #24464

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#24536
2025-06-17 18:23:06 +03:00
Nikos Dragazis
8ef55444f5 sstables: Fix race when loading checksum component
`read_checksum()` loads the checksum component from disk and stores a
non-owning reference in the shareable components. To avoid loading the
same component twice, the function has an early return statement.
However, this does not guarantee atomicity - two fibers or threads may
load the component and update the shareable components concurrently.
This can lead to use-after-free situations when accessing the component
through the shareable components, since the reference stored there is
non-owning. This can happen when multiple compaction tasks run on the
same SSTable (e.g., regular compaction and scrub-validate).

Fix this by not updating the reference in shareable components, if a
reference is already in place. Instead, create an owning reference to
the existing component for the current fiber. This is less efficient
than using a mutex, since the component may be loaded multiple times
from disk before noticing the race, but no locks are used for any other
SSTable component either. Also, this affects uncompressed SSTables,
which are not that common.

Fixes #23728.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>

Closes scylladb/scylladb#23872

(cherry picked from commit eaa2ce1bb5)

Closes scylladb/scylladb#24267
2025-06-17 13:20:30 +03:00
Szymon Malewski
8eebc97ae1 mapreduce_service: Prevent race condition
In parallelized aggregation functions super-coordinator (node performing final merging step) receives and merges each partial result in parallel coroutines (`parallel_for_each`).
Usually responses are spread over time and actual merging is atomic.
However sometimes partial results are received at the similar time and if an aggregate function (e.g. lua script) yields, two coroutines can try to overwrite the same accumulator one after another,
which leads to losing some of the results.
To prevent this, in this patch each coroutine stores merging results in its own context and overwrites accumulator atomically, only after it was fully merged.
Comparing to the previous implementation order of operands in merging function is swapped, but the order of aggregation is not guaranteed anyway.

Fixes #20662

Closes scylladb/scylladb#24106

(cherry picked from commit 5969809607)

Closes scylladb/scylladb#24387
2025-06-17 13:20:12 +03:00
Pavel Emelyanov
884293a382 Merge '[Backport 6.2] tablets: deallocate storage state on end_migration' from Scylladb[bot]
When a tablet is migrated and cleaned up, deallocate the tablet storage
group state on `end_migration` stage, instead of `cleanup` stage:

* When the stage is updated from `cleanup` to `end_migration`, the
  storage group is removed on the leaving replica.
* When the table is initialized, if the tablet stage is `end_migration`
  then we don't allocate a storage group for it. This happens for
  example if the leaving replica is restarted during tablet migration.
  If it's initialized in `cleanup` stage then we allocate a storage
  group, and it will be deallocated when transitioning to
  `end_migration`.

This guarantees that the storage group is always deallocated on the
leaving replica by `end_migration`, and that it is always allocated if
the tablet wasn't cleaned up fully yet.

It is a similar case also for the pending replica when the migration is
aborted. We deallocate the state on `revert_migration` which is the
stage following `cleanup_target`.

Previously the storage group would be allocated when the tablet is
initialized on any of the tablet replicas - also on the leaving replica,
and when the tablet stage is `cleanup` or `end_migration`, and
deallocated during `cleanup`.

This fixes the following issue:

1. A migrating tablet enters cleanup stage
2. the tablet is cleaned up successfuly
3. The leaving replica is restarted, and allocates storage group
4. tablet cleanup is not called because it's already cleaned up
5. the storage group remains allocated on the leaving replica after the
   migration is completed - it's not cleaned up properly.

Fixes https://github.com/scylladb/scylladb/issues/23481

backport to all relevant releases since it's a bug that results in a crash

- (cherry picked from commit 34f15ca871)

- (cherry picked from commit fb18fc0505)

- (cherry picked from commit bd88ca92c8)

Parent PR: #24393

Closes scylladb/scylladb#24486

* github.com:scylladb/scylladb:
  test/cluster/test_tablets: test restart during tablet cleanup
  test: tablets: add get_tablet_info helper
  tablets: deallocate storage state on end_migration
2025-06-17 13:19:49 +03:00
Michael Litvak
06e5a48d17 test/cluster/test_tablets: test restart during tablet cleanup
Add a test that reproduces issue scylladb/scylladb#23481.

The test migrates a tablet from one node to another, and while the
tablet is in some stage of cleanup - either before or right after,
depending on the parameter - the leaving replica, on which the tablet is
cleaned, is restarted.

This is interesting because when the leaving replica starts and loads
its state, the tablet could be in different stages of cleanup - the
SSTables may still exist or they may have been cleaned up already, and
we want to make sure the state is loaded correctly.

(cherry picked from commit bd88ca92c8)
2025-06-12 14:33:07 +03:00
Michael Litvak
c5d11205c1 test: tablets: add get_tablet_info helper
Add a helper for tests to get the tablet info from system.tablets for a
tablet owning a given token.

(cherry picked from commit fb18fc0505)
2025-06-12 03:24:00 +00:00
Michael Litvak
0b46cfb60d tablets: deallocate storage state on end_migration
When a tablet is migrated and cleaned up, deallocate the tablet storage
group state on `end_migration` stage, instead of `cleanup` stage:

* When the stage is updated from `cleanup` to `end_migration`, the
  storage group is removed on the leaving replica.
* When the table is initialized, if the tablet stage is `end_migration`
  then we don't allocate a storage group for it. This happens for
  example if the leaving replica is restarted during tablet migration.
  If it's initialized in `cleanup` stage then we allocate a storage
  group, and it will be deallocated when transitioning to
  `end_migration`.

This guarantees that the storage group is always deallocated on the
leaving replica by `end_migration`, and that it is always allocated if
the tablet wasn't cleaned up fully yet.

It is a similar case also for the pending replica when the migration is
aborted. We deallocate the state on `revert_migration` which is the
stage following `cleanup_target`.

Previously the storage group would be allocated when the tablet is
initialized on any of the tablet replicas - also on the leaving replica,
and when the tablet stage is `cleanup` or `end_migration`, and
deallocated during `cleanup`.

This fixes the following issue:

1. A migrating tablet enters cleanup stage
2. the tablet is cleaned up successfuly
3. The leaving replica is restarted, and allocates storage group
4. tablet cleanup is not called because it was already cleaned up
4. the storage group remains allocated on the leaving replica after the
   migration is completed - it's not cleaned up properly.

Fixes scylladb/scylladb#23481

(cherry picked from commit 34f15ca871)
2025-06-12 03:24:00 +00:00
Michał Chojnowski
4a18a08284 utils/lsa/chunked_managed_vector: fix the calculation of max_chunk_capacity()
`chunked_managed_vector` is a vector-like container which splits
its contents into multiple contiguous allocations if necessary,
in order to fit within LSA's max preferred contiguous allocation
limits.

Each limited-size chunk is stored in a `managed_vector`.
`managed_vector` is unaware of LSA's size limits.
It's up to the user of `managed_vector` to pick a size which
is small enough.

This happens in `chunked_managed_vector::max_chunk_capacity()`.
But the calculation is wrong, because it doesn't account for
the fact that `managed_vector` has to place some metadata
(the backreference pointer) inside the allocation.
In effect, the chunks allocated by `chunked_managed_vector`
are just a tiny bit larger than the limit, and the limit is violated.

Fix this by accounting for the metadata.

Also, before the patch `chunked_managed_vector::max_contiguous_allocation`,
repeats the definition of logalloc::max_managed_object_size.
This is begging for a bug if `logalloc::max_managed_object_size`
changes one day. Adjust it so that `chunked_managed_vector` looks
directly at `logalloc::max_managed_object_size`, as it means to.

Fixes scylladb/scylladb#23854

(cherry picked from commit 7f9152babc)

Closes scylladb/scylladb#24369
2025-06-03 18:10:36 +03:00
Piotr Dulikowski
1571948cd7 topology_coordinator: silence ERROR messages on abort
When the topology coordinator is shut down while doing a long-running
operation, the current operation might throw a raft::request_aborted
exception. This is not a critical issue and should not be logged with
ERROR verbosity level.

Make sure that all the try..catch blocks in the topology coordinator
which:

- May try to acquire a new group0 guard in the `try` part
- Have a `catch (...)` block that print an ERROR-level message

...have a pass-through `catch (raft::request_aborted&)` block which does
not log the exception.

Fixes: scylladb/scylladb#22649

Closes scylladb/scylladb#23962

(cherry picked from commit 156ff8798b)

Closes scylladb/scylladb#24074
2025-05-13 20:41:41 +03:00
Pavel Emelyanov
5a82f0d217 Merge '[Backport 6.2] replica: skip flush of dropped table' from Scylladb[bot]
Currently, flush throws no_such_column_family if a table is dropped. Skip the flush of dropped table instead.

Fixes: #16095.

Needs backport to 2025.1 and 6.2 as they contain the bug

- (cherry picked from commit 91b57e79f3)

- (cherry picked from commit c1618c7de5)

Parent PR: #23876

Closes scylladb/scylladb#23904

* github.com:scylladb/scylladb:
  test: test table drop during flush
  replica: skip flush of dropped table
2025-05-13 14:00:38 +03:00
Aleksandra Martyniuk
874b4f8d9c streaming: skip dropped tables
Currently, stream_session::prepare throws when a table in requests
or summaries is dropped. However, we do not want to fail streaming
if the table is dropped.

Delete table checks from stream_session::prepare. Further streaming
steps can handle the dropped table and finish the streaming successfully.

Fixes: #15257.

Closes scylladb/scylladb#23915

(cherry picked from commit 20c2d6210e)

Closes scylladb/scylladb#24050
2025-05-13 13:56:46 +03:00
Aleksandra Martyniuk
91a1acc314 test: test table drop during flush
(cherry picked from commit c1618c7de5)
2025-05-09 09:27:08 +02:00
Aleksandra Martyniuk
720bd681f0 replica: skip flush of dropped table
(cherry picked from commit 91b57e79f3)
2025-05-09 09:26:44 +02:00
Piotr Dulikowski
b3bc3489dd utils::loading_cache: gracefully skip timer if gate closed
The loading_cache has a periodic timer which acquires the
_timer_reads_gate. The stop() method first closes the gate and then
cancels the timer - this order is necessary because the timer is
re-armed under the gate. However, the timer callback does not check
whether the gate was closed but tries to acquire it, which might result
in unhandled exception which is logged with ERROR severity.

Fix the timer callback by acquiring access to the gate at the beginning
and gracefully returning if the gate is closed. Even though the gate
used to be entered in the middle of the callback, it does not make sense
to execute the timer's logic at all if the cache is being stopped.

Fixes: scylladb/scylladb#23951

Closes scylladb/scylladb#23952

(cherry picked from commit 8ffe4b0308)

Closes scylladb/scylladb#23980
2025-05-06 10:19:32 +02:00
Botond Dénes
e4c6a3c068 Merge '[Backport 6.2] topology coordinator: do not proceed further on invalid boostrap tokens' from Scylladb[bot]
In case when dht::boot_strapper::get_boostrap_tokens fail to parse the
tokens, the topology coordinator handles the exception and schedules a
rollback. However, the current code tries to continue with the topology
coordinator logic even if an exception occurs, leaving boostrap_tokens
empty. This does not make sense and can actually cause issues,
specifically in prepare_and_broadcast_cdc_generation_data which
implicitly expect that the bootstrap_tokens of the first node in the
cluster will not be empty.

Fix this by adding the missing break.

Fixes: scylladb/scylladb#23897

From the code inspection alone it looks like 2025.1 and 6.2 have this problem, so marking for backport to both of them.

- (cherry picked from commit 66acaa1bf8)

- (cherry picked from commit 845cedea7f)

- (cherry picked from commit 670a69007e)

Parent PR: #23914

Closes scylladb/scylladb#23948

* github.com:scylladb/scylladb:
  test: cluster: add test_bad_initial_token
  topology coordinator: do not proceed further on invalid boostrap tokens
  cdc: add sanity check for generating an empty generation
scylla-6.2.4-candidate-20250502071756 scylla-6.2.4-candidate-20250502123017
2025-05-01 08:32:13 +03:00
Botond Dénes
254f535f63 Merge '[Backport 6.2] tasks: check whether a node is alive before rpc' from Scylladb[bot]
Check whether a node is alive before making an rpc that gathers children
infos from the whole cluster in virtual_task::impl::get_children.

Fixes: https://github.com/scylladb/scylladb/issues/22514.

Needs backport to 2025.1 and 6.2 as they contain the bug.

- (cherry picked from commit 53e0f79947)

- (cherry picked from commit e178bd7847)

Parent PR: #23787

Closes scylladb/scylladb#23942

* github.com:scylladb/scylladb:
  test: add test for getting tasks children
  tasks: check whether a node is alive before rpc
2025-05-01 08:30:09 +03:00
Avi Kivity
c20b2ea2af Merge '[Backport 6.2] Ensure raft group0 RPCs use the gossip scheduling group.' from Scylladb[bot]
Scylla operations use concurrency semaphores to limit the number of concurrent operations and prevent resource exhaustion. The semaphore is selected based on the current scheduling group.

For RAFT group operations, it is essential to use a system semaphore to avoid queuing behind user operations. This patch ensures that RAFT operations use the `gossip` scheduling group to leverage the system semaphore.

Fixes scylladb/scylladb#21637

Backport: 6.2 and 6.1

- (cherry picked from commit 60f1053087)

- (cherry picked from commit e05c082002)

Parent PR: #22779

Closes scylladb/scylladb#23769

* github.com:scylladb/scylladb:
  ensure raft group0 RPCs use the gossip scheduling group
  Move RAFT operations verbs to GOSSIP group.
2025-04-30 16:45:20 +03:00
Aleksandra Martyniuk
cc37c64467 test: add test for getting tasks children
Add test that checks whether the children of a virtual task will be
properly gathered if a node is down.

(cherry picked from commit e178bd7847)
2025-04-30 10:40:32 +02:00
Aleksandra Martyniuk
47377cd5d4 tasks: check whether a node is alive before rpc
Check whether a node is alive before making an rpc that gathers children
infos from the whole cluster in virtual_task::impl::get_children.

(cherry picked from commit 53e0f79947)
2025-04-30 10:35:45 +02:00
Sergey Zolotukhin
5c90107c14 ensure raft group0 RPCs use the gossip scheduling group
Scylla operations use concurrency semaphores to limit the number
of concurrent operations and prevent resource exhaustion. The
semaphore is selected based on the current scheduling group.
For Raft group operations, it is essential to use a system semaphore to
avoid queuing behind user operations.
This commit adds a check to ensure that the raft group0 RPCs are
executed with the `gossiper` scheduling group.

(cherry picked from commit e05c082002)
2025-04-30 08:49:07 +02:00
Sergey Zolotukhin
612f184638 Move RAFT operations verbs to GOSSIP group.
In order for RAFT operations to use the gossip system semaphore, moving RAFT
verbs to the gossip group in `do_get_rpc_client_idx`,  messaging_service.

Fixes scylladb/scylladb21637

(cherry picked from commit 60f1053087)
2025-04-29 19:25:55 +00:00
Piotr Dulikowski
d881f3f14d test: cluster: add test_bad_initial_token
Adds a test which checks that rollback works properly in case when a bad
value of the initial_token function is provided.

(cherry picked from commit 670a69007e)
2025-04-28 17:07:23 +00:00
Piotr Dulikowski
7eda572129 topology coordinator: do not proceed further on invalid boostrap tokens
In case when dht::boot_strapper::get_boostrap_tokens fail to parse the
tokens, the topology coordinator handles the exception and schedules a
rollback. However, the current code tries to continue with the topology
coordinator logic even if an exception occurs, leaving boostrap_tokens
empty. This does not make sense and can actually cause issues,
specifically in prepare_and_broadcast_cdc_generation_data which
implicitly expect that the bootstrap_tokens of the first node in the
cluster will not be empty.

Fix this by adding the missing break.

Fixes: scylladb/scylladb#23897
(cherry picked from commit 845cedea7f)
2025-04-28 17:07:23 +00:00
Piotr Dulikowski
41e6df7407 cdc: add sanity check for generating an empty generation
It doesn't make sense to create an empty CDC generation because it does
not make sense to have a cluster with no tokens. Add a sanity check to
cdc::make_new_generation_description which fails if somebody attempts to
do that (i.e. when the set of current tokens + optionally bootstrapping
node's tokens is empty).

The function does not work correctly if it is misused, as we saw in
scylladb/scylladb#23897. While the function should not be misused in the
first place, it's better to throw an exception rather than crash -
especially that this crash could happen on the topology coordinator.

(cherry picked from commit 66acaa1bf8)
2025-04-28 17:07:22 +00:00
Tomasz Grabiec
b48d1abade Merge '[Backport 6.2] Cache base info for view schemas in the schema registry' from Scylladb[bot]
Currently, when we load a frozen schema into the registry, we lose
the base info if the schema was of a view. Because of that, in various
places we need to set the base info again, and in some codepaths we
may miss it completely, which may make us unable to process some
requests (for example, when executing reverse queries on views).
Even after setting the base info, we may still lose it if the schema
entry gets deactivated due to all `schema_ptr`s temporarily dying.

To fix this, this patch adds the base schema to the registry, alongside
the view schema. We store just the frozen base schema, so that we can
transfer it across shards. With the base schema, we can now set the base
info when returning the schema from the registry. As a result, we can now
assume that all view schemas returned by the registry have base_info set.

In this series we also make sure that the view schemas in the registry are
kept up-to-date in regards to base schema changes.

Fixes https://github.com/scylladb/scylladb/issues/21354

This issue is a bug, so adding backport labels 6.1 and 6.2

- (cherry picked from commit 6f11edbf3f)

- (cherry picked from commit dfe3810f64)

- (cherry picked from commit 82f2e1b44c)

- (cherry picked from commit 3094ff7cbe)

- (cherry picked from commit 74cbc77f50)

Parent PR: #21862

Closes scylladb/scylladb#23046

* github.com:scylladb/scylladb:
  test: add test for schema registry maintaining base info for views
  schema_registry: avoid setting base info when getting the schema from registry
  schema_registry: update cached base schemas when updating a view
  schema_registry: cache base schemas for views
  db: set base info before adding schema to registry
2025-04-25 18:42:57 +02:00
Tomasz Grabiec
9190779ee3 Merge '[Backport 6.2] storage_service: preserve state of busy topology when transiting tablet' from Scylladb[bot]
Commit 876478b84f ("storage_service: allow concurrent tablet migration in tablets/move API", 2024-02-08) introduced a code path on which the topology state machine would be busy -- in "tablet_draining" or "tablet_migration" state -- at the time of starting tablet migration. The pre-commit code would unconditionally transition the topology to "tablet_migration" state, assuming the topology had been idle previously. On the new code path, this state change would be idempotent if the topology state machine had been busy in "tablet_migration", but the state change would incorrectly overwrite the "tablet_draining" state otherwise.

Restrict the state change to when the topology state machine is idle.

In addition, add the topology update to the "updates" vector with plain push_back(). emplace_back() is not helpful here, as topology_mutation_builder::build() cannot construct in-place, and so we invoke the "canonical_mutation" move constructor once, either way.

Unit test:

Start a two node cluster. Create a single tablet on one of the nodes. Start decommissioning that node, but block decommissioning at once. In that state (i.e., in "tablet_draining"), move the tablet manually to the other node. Check that transit_tablet() leaves the topology transition state alone.

Fixes https://github.com/scylladb/scylladb/issues/20073.

Commit 876478b84f was first released in scylla-6.0.0, so we might want to backport this patch accordingly.

- (cherry picked from commit e1186f0ae6)

- (cherry picked from commit 841ca652a0)

Parent PR: #23751

Closes scylladb/scylladb#23768

* github.com:scylladb/scylladb:
  storage_service: add unit test for mid-decommission transit_tablet()
  storage_service: preserve state of busy topology when transiting tablet
2025-04-20 20:13:11 +02:00
Avi Kivity
8db020e782 Merge '[Backport 6.2] managed_bytes: in the copy constructor, respect the target preferred allocation size' from Scylladb[bot]
Commit 14bf09f447 added a single-chunk layout to `managed_bytes`, which makes the overhead of `managed_bytes` smaller in the common case of a small buffer.

But there was a bug in it. In the copy constructor of `managed_bytes`, a copy of a single-chunk `managed_bytes` is made single-chunk too.

But this is wrong, because the source of the copy and the target of the copy might have different preferred max contiguous allocation sizes.

In particular, if a `managed_bytes` of size between 13 kiB and 128 kiB is copied from the standard allocator into LSA, the resulting `managed_bytes` is a single chunk which violates LSA's preferred allocation size. (And therefore is placed by LSA in the standard allocator).

In other words, since Scylla 6.0, cache and memtable cells between 13 kiB and 128 kiB are getting allocated in the standard allocator rather than inside LSA segments.

Consequences of the bug:

1. Effective memory consumption of an affected cell is rounded up to the nearest power of 2.

2. With a pathological-enough allocation pattern (for example, one which somehow ends up placing a single 16 kiB memtable-owned allocation in every aligned 128 kiB span), memtable flushing could theoretically deadlock, because the allocator might be too fragmented to let the memtable grow by another 128 kiB segment, while keeping the sum of all allocations small enough to avoid triggering a flush. (Such an allocation pattern probably wouldn't happen in practice though).

3. It triggers a bug in reclaim which results in spurious allocation failures despite ample evictable memory.

   There is a path in the reclaimer procedure where we check whether reclamation succeeded by checking that the number of free LSA segments grew.

   But in the presence of evictable non-LSA allocations, this is wrong because the reclaim might have met its target by evicting the non-LSA allocations, in which case memory is returned directly to the standard allocator, rather than to the pool of free segments.

   If that happens, the reclaimer wrongly returns `reclaimed_nothing` to Seastar, which fails the allocation.

Refs (possibly fixes) https://github.com/scylladb/scylladb/issues/21072
Fixes https://github.com/scylladb/scylladb/issues/22941
Fixes https://github.com/scylladb/scylladb/issues/22389
Fixes https://github.com/scylladb/scylladb/issues/23781

This is a regression fix, should be backported to all affected releases.

- (cherry picked from commit 4e2f62143b)

- (cherry picked from commit 6c1889f65c)

Parent PR: #23782

Closes scylladb/scylladb#23809

* github.com:scylladb/scylladb:
  managed_bytes_test: add a reproducer for #23781
  managed_bytes: in the copy constructor, respect the target preferred allocation size
2025-04-19 18:43:31 +03:00
Calle Wilund
d566010412 network_topology_strategy/alter ks: Remove dc:s from options once rf=0
Fixes #22688

If we set a dc rf to zero, the options map will still retain a dc=0 entry.
If this dc is decommissioned, any further alters of keyspace will fail,
because the union of new/old options will now contained an unknown keyword.

Change alter ks options processing to simply remove any dc with rf=0 on
alter, and treat this as an implicit dc=0 in nw-topo strategy.
This means we change the reallocate_tablets routine to not rely on
the strategy objects dc mapping, but the full replica topology info
for dc:s to consider for reallocation. Since we verify the input
on attribute processing, the amount of rf/tablets moved should still
be legal.

v2:
* Update docs as well.
v3:
* Simplify dc processing
* Reintroduce options empty check, but do early in ks_prop_defs
* Clean up unit test some

Closes scylladb/scylladb#22693

(cherry picked from commit 342df0b1a8)

(Update: workaround python test objects not having dc info)

Closes scylladb/scylladb#22876
2025-04-18 14:09:52 +03:00
Michał Chojnowski
ba3975858b managed_bytes_test: add a reproducer for #23781
(cherry picked from commit 6c1889f65c)
2025-04-18 07:55:23 +00:00
Michał Chojnowski
f75b0d49a0 managed_bytes: in the copy constructor, respect the target preferred allocation size
Commit 14bf09f447 added a single-chunk
layout to `managed_bytes`, which makes the overhead of `managed_bytes`
smaller in the common case of a small buffer.

But there was a bug in it. In the copy constructor of `managed_bytes`,
a copy of a single-chunk `managed_bytes` is made single-chunk too.

But this is wrong, because the source of the copy and the target
of the copy might have different preferred max contiguous allocation
sizes.

In particular, if a `managed_bytes` of size between 13 kiB and 128 kiB
is copied from the standard allocator into LSA, the resulting
`managed_bytes` is a single chunk which violates LSA's preferred
allocation size. (And therefore is placed by LSA in the standard
allocator).

In other words, since Scylla 6.0, cache and memtable cells
between 13 kiB and 128 kiB are getting allocated in the standard allocator
rather than inside LSA segments.

Consequences of the bug:

1. Effective memory consumption of an affected cell is rounded up to the nearest
   power of 2.

2. With a pathological-enough allocation pattern
   (for example, one which somehow ends up placing a single 16 kiB
   memtable-owned allocation in every aligned 128 kiB span),
   memtable flushing could theoretically deadlock,
   because the allocator might be too fragmented to let the memtable
   grow by another 128 kiB segment, while keeping the sum of all
   allocations small enough to avoid triggering a flush.
   (Such an allocation pattern probably wouldn't happen in practice though).

3. It triggers a bug in reclaim which results in spurious
   allocation failures despite ample evictable memory.

   There is a path in the reclaimer procedure where we check whether
   reclamation succeeded by checking that the number of free LSA
   segments grew.

   But in the presence of evictable non-LSA allocations, this is wrong
   because the reclaim might have met its target by evicting the non-LSA
   allocations, in which case memory is returned directly to the
   standard allocator, rather than to the pool of free segments.

   If that happens, the reclaimer wrongly returns `reclaimed_nothing`
   to Seastar, which fails the allocation.

Refs (possibly fixes) https://github.com/scylladb/scylladb/issues/21072
Fixes https://github.com/scylladb/scylladb/issues/22941
Fixes https://github.com/scylladb/scylladb/issues/22389
Fixes https://github.com/scylladb/scylladb/issues/23781

(cherry picked from commit 4e2f62143b)
2025-04-18 07:55:23 +00:00
Botond Dénes
e1ce7c36a7 test/cluster/test_read_repair.py: increase read request timeout
This test enables trace-level logging for the mutation_data logger,
which seems to be too much in debug mode and the test read times out.
Increase timeout to 1minute to avoid this.

Fixes: #23513
Fixes: #23512

Closes scylladb/scylladb#23558

(cherry picked from commit 7bbfa5293f)

Closes scylladb/scylladb#23793
2025-04-18 06:34:25 +03:00
Laszlo Ersek
597030a821 storage_service: add unit test for mid-decommission transit_tablet()
Start a two node cluster. Create a single tablet on one of the nodes.
Start decommissioning that node, but block decommissioning at once. In
that state (i.e., in "tablet_draining"), move the tablet manually to the
other node. Check that transit_tablet() leaves the topology transition
state alone.

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
(cherry picked from commit 841ca652a0)
2025-04-17 09:09:43 +02:00
Laszlo Ersek
add4022d32 storage_service: preserve state of busy topology when transiting tablet
Commit 876478b84f ("storage_service: allow concurrent tablet migration
in tablets/move API", 2024-02-08) introduced a code path on which the
topology state machine would be busy -- in "tablet_draining" or
"tablet_migration" state -- at the time of starting tablet migration. The
pre-commit code would unconditionally transition the topology to
"tablet_migration" state, assuming the topology had been idle previously.
On the new code path, this state change would be idempotent if the
topology state machine had been busy in "tablet_migration", but the state
change would incorrectly overwrite the "tablet_draining" state otherwise.

Restrict the state change to when the topology state machine is idle.

In addition, add the topology update to the "updates" vector with plain
push_back(). emplace_back() is not helpful here, as
topology_mutation_builder::build() cannot construct in-place, and so we
invoke the "canonical_mutation" move constructor once, either way.

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
(cherry picked from commit e1186f0ae6)
2025-04-17 08:32:56 +02:00
Avi Kivity
b6f1cacfea scylla-gdb: small-objects: fix for very small objects
Because of rounding and alignment, there are multiple pools for small
sizes (e.g. 4 for size 32). Because the pool selection algorithm
ignores alignment, different pools can be chosen for different object
sizes. For example, an object size of 29 will choose the first pool
of size 32, while an object size of 32 will choose the fourth pool of
size 32.

The small-objects command doesn't know about this and always considers
just the first pool for a given size. This causes it to miss out on
sister pools.

While it's possible to adjust pool selection to always choose one of the
pools, it may eat a precious cycle. So instead let's compensate in the
small-objects command. Instead of finding one pool for a given size,
find all of them, and iterate over all those pools.

Fixes #23603

Closes scylladb/scylladb#23604

(cherry picked from commit b4d4e48381)

Closes scylladb/scylladb#23748
2025-04-16 14:36:28 +03:00
Botond Dénes
c1defce1de Merge '[Backport 6.2] transport/server.cc: set default timestamp info in EXECUTE and BATCH tracing' from Scylladb[bot]
A default timestamp (not to confuse with the timestamp passed via 'USING TIMESTAMP' query clause) can be set using 0x20 flag and the <timestamp> field in the binary CQL frame payload of QUERY, EXECUTE and BATCH ops. It also happens to be a default of a Java CQL Driver.

However, we were only setting the corresponding info in the CQL Tracing context of a QUERY operation. For an unknown reason we were not setting this for an EXECUTE and for a BATCH traces (I guess I simply forgot to set it back then).

This patch fixes this.

Fixes #23173

The issue fixed by this PR is not critical but the fix is simple and safe enough so we should backport it to all live releases.

- (cherry picked from commit ca6bddef35)

- (cherry picked from commit f7e1695068)

Parent PR: #23174

Closes scylladb/scylladb#23523

* github.com:scylladb/scylladb:
  CQL Tracing: set common query parameters in a single function
  transport/server.cc: set default timestamp info in EXECUTE and BATCH tracing
2025-04-11 17:11:06 +03:00
Botond Dénes
61d1262674 Merge '[Backport 6.2] reader_concurrency_semaphore: register_inactive_read(): handle aborted permit' from Scylladb[bot]
It is possible that the permit handed in to register_inactive_read() is already aborted (currently only possible if permit timed out). If the permit also happens to have wait for memory, the current code will attempt to call promise<>::set_exception() on the permit's promise to abort its waiters. But if the permit was already aborted via timeout, this promise will already have an exception and this will trigger an assert. Add a separate case for checking if the permit is aborted already. If so, treat it as immediate eviction: close the reader and clean up.

Fixes: scylladb/scylladb#22919

Bug is present in all live versions, backports are required.

- (cherry picked from commit 4d8eb02b8d)

- (cherry picked from commit 7ba29ec46c)

Parent PR: #23044

Closes scylladb/scylladb#23144

* github.com:scylladb/scylladb:
  reader_concurrency_semaphore: register_inactive_read(): handle aborted permit
  test/boost/reader_concurrency_semaphore_test: move away from db::timeout_clock::now()
2025-04-11 17:10:39 +03:00
Botond Dénes
ae54d4e886 Merge '[Backport 6.2] streaming: fix the way a reason of streaming failure is determined' from Scylladb[bot]
During streaming receiving node gets and processes mutation fragments.
If this operation fails, receiver responds with -1 status code, unless
it failed due to no_such_column_family in which case streaming of this
table should be skipped.

However, when the table was dropped, an exception handler on receiver
side may get not only data_dictionary::no_such_column_family, but also
seastar::nested_exception of two no_such_column_family.

Encountered example:
```
ERROR 2025-02-12 15:20:51,508 [shard 0:strm] stream_session - [Stream #f1cd6830-e954-11ef-afd9-b022e40bf72d] Failed to handle STREAM_MUTATION_FRAGMENTS (receive and distribute phase) for ks=ks, cf=cf, peer=756dd3fe-2bf0-4dcd-afbc-cfd5202669a0: seastar::nested_exception: data_dictionary::no_such_column_family (Can't find a column family with UUID ef9b1ee0-e954-11ef-ba4a-faf17acf4e14) (while cleaning up after data_dictionary::no_such_column_family (Can't find a column family with UUID ef9b1ee0-e954-11ef-ba4a-faf17acf4e14))
```

In this case, the exception does not match the try_catch<data_dictionary::no_such_column_family>
clause and gets handled the same as any other exception type.

Replace try_catch clause with table_sync_and_check that synchronizes
the schema and check if the table exists.

Fixes: https://github.com/scylladb/scylladb/issues/22834.

Needs backport to all live version, as they all contain the bug

- (cherry picked from commit 876cf32e9d)

- (cherry picked from commit faf3aa13db)

- (cherry picked from commit 44748d624d)

- (cherry picked from commit 35bc1fe276)

Parent PR: #22868

Closes scylladb/scylladb#23289

* github.com:scylladb/scylladb:
  streaming: fix the way a reason of streaming failure is determined
  streaming: save a continuation lambda
  streaming: use streaming namespace in table_check.{cc,hh}
  repair: streaming: move table_check.{cc,hh} to streaming
2025-04-11 15:13:14 +03:00
Alexander Turetskiy
e97e729973 Improve compation on read of expired tombstones
compact expired tombstones in cache even if they are blocked by
commitlog

fixes #16781

Closes: #23033
2025-04-11 15:09:43 +03:00
Botond Dénes
20a1edd763 tools/scylla-nodetool: s/GetInt()/GetInt64()/
GetInt() was observed to fail when the integer JSON value overflows the
int32_t type, which `GetInt()` uses for storage. When this happens,
rapidjson will assign a distinct 64 bit integer type to the value, and
attempting to access it as 32 bit integer triggers the wrong-type error,
resulting in assert failure. This was hit on the field where invoking
nodetool netstats resulted in nodetool crashing when the streamed bytes
amounts were higher than maxint.

To avoid such bugs in the future, replace all usage of GetInt() in
nodetool of GetInt64(), just to be sure.

A reproducer is added to the nodetool netstats crash.

Fixes: scylladb/scylladb#23394

Closes scylladb/scylladb#23395

(cherry picked from commit bd8973a025)

Closes scylladb/scylladb#23475
2025-04-11 15:00:47 +03:00