Compare commits

..

188 Commits

Author SHA1 Message Date
Łukasz Paszkowski
98a6002a1a compaction_manager: cancel submission timer on drain
The `drain` method, cancels all running compactions and moves the
compaction manager into the disabled state. To move it back to
the enabled state, the `enable` method shall be called.

This, however, throws an assertion error as the submission time is
not cancelled and re-enabling the manager tries to arm the armed timer.

Thus, cancel the timer, when calling the drain method to disable
the compaction manager.

Fixes https://github.com/scylladb/scylladb/issues/24504

All versions are affected. So it's a good candidate for a backport.

Closes scylladb/scylladb#24505

(cherry picked from commit a9a53d9178)

Closes scylladb/scylladb#24585
2025-06-29 14:40:43 +03:00
Avi Kivity
9f1ed14f9d Merge '[Backport 6.2] cql: create default superuser if it doesn't exist' from Marcin Maliszkiewicz
Backport of https://github.com/scylladb/scylladb/pull/20137 is needed for another backport https://github.com/scylladb/scylladb/pull/24690 to work correctly.

Fixes https://github.com/scylladb/scylladb/issues/24712

Closes scylladb/scylladb#24711

* github.com:scylladb/scylladb:
  test: test_restart_cluster: create the test
  auth: standard_role_manager allows awaiting superuser creation
  auth: coroutinize the standard_role_manager start() function
  auth: don't start server until the superuser is created
2025-06-29 14:34:55 +03:00
Aleksandra Martyniuk
8321747451 test: rest_api: fix test_repair_task_progress
test_repair_task_progress checks the progress of children of root
repair task. However, nothing ensures that the children are
already created.

Wait until at least one child of a root repair task is created.

Fixes: #24556.

Closes scylladb/scylladb#24560

(cherry picked from commit 0deb9209a0)

Closes scylladb/scylladb#24652
2025-06-28 09:40:37 +03:00
Paweł Zakrzewski
3bd3d720e0 test: test_restart_cluster: create the test
The purpose of this test that the cluster is able to boot up again after
a full cluster shutdown, thus exhibiting no issues when connecting to
raft group 0 that is larger than one.

(cherry picked from commit 900a6706b8)
2025-06-27 17:50:15 +02:00
Paweł Zakrzewski
3d6d3484b5 auth: standard_role_manager allows awaiting superuser creation
This change implements the ability to await superuser creation in the
function ensure_superuser_is_created(). This means that Scylla will not
be serving CQL connections until the superuser is created.

Fixes #10481

(cherry picked from commit 7008b71acc)
2025-06-27 17:50:08 +02:00
Paweł Zakrzewski
db1d3cc342 auth: coroutinize the standard_role_manager start() function
This change is a preparation for the next change. Moving to coroutines
makes the code more readable and easier to process.

(cherry picked from commit 04fc82620b)
2025-06-27 17:50:01 +02:00
Paweł Zakrzewski
66301eb2b6 auth: don't start server until the superuser is created
This change reorganizes the way standard_role_manager startup is
handled: now the future returned by its start() function can be used to
determine when startup has finished. We use this future to ensure the
startup is finished prior to starting the CQL server.

Some clusters are created without auth, and auth is added later. The
first node to recognize that auth is needed must create the superuser.
Currently this is always on restart, but if we were to ever make it
LiveUpdate then it would not be on restart.

This suggests that we don't really need to wait during restart.

This is a preparatory commit, laying ground for implementation of a
start() function that waits for the superuser to be created. The default
implementation returns a ready future, which makes no change in the code
behavior.

(cherry picked from commit f525d4b0c1)
2025-06-27 17:49:33 +02:00
Pavel Emelyanov
e5ac2285c0 Merge '[Backport 6.2] memtable: ensure _flushed_memory doesn't grow above total_memory' from Scylladb[bot]
`dirty_memory_manager` tracks two quantities about memtable memory usage:
"real" and "unspooled" memory usage.

"real" is the total memory usage (sum of `occupancy().total_space()`)
by all memtable LSA regions, plus a upper-bound estimate of the size of
memtable data which has already moved to the cache region but isn't
evictable (merged into the cache) yet.

"unspooled" is the difference between total memory usage by all memtable
LSA regions, and the total flushed memory (sum of `_flushed_memory`)
of memtables.

`dirty_memory_manager` controls the shares of compaction and/or blocks
writes when these quantities cross various thresholds.

"Total flushed memory" isn't a well defined notion,
since the actual consumption of memory by the same data can vary over
time due to LSA compactions, and even the data present in memtable can
change over the course of the flush due to removals of outdated MVCC versions.
So `_flushed_memory` is merely an approximation computed by `flush_reader`
based on the data passing through it.

This approximation is supposed to be a conservative lower bound.
In particular, `_flushed_memory` should be not greater than
`occupancy().total_space()`. Otherwise, for example, "unspooled" memory
could become negative (and/or wrap around) and weird things could happen.
There is an assertion in `~flush_memory_accounter` which checks that
`_flushed_memory < occupancy().total_space()` at the end of flush.

But it can fail. Without additional treatment, the memtable reader sometimes emits
data which is already deleted. (In particular, it emites rows covered by
a partition tombstone in a newer MVCC version.)
This data is seen by `flush_reader` and accounted in `_flushed_memory`.
But this data can be garbage-collected by the `mutation_cleaner` later during the
flush and decrease `total_memory` below `_flushed_memory`.

There is a piece of code in `mutation_cleaner` intended to prevent that.
If `total_memory` decreases during a `mutation_cleaner` run,
`_flushed_memory` is lowered by the same amount, just to preserve the
asserted property. (This could also make `_flushed_memory` quite inaccurate,
but that's considered acceptable).

But that only works if `total_memory` is decreased during that run. It doesn't
work if the `total_memory` decrease (enabled by the new allocator holes made
by `mutation_cleaner`'s garbage collection work) happens asynchronously
(due to memory reclaim for whatever reason) after the run.

This patch fixes that by tracking the decreases of `total_memory` closer to the
source. Instead of relying on `mutation_cleaner` to notify the memtable if it
lowers `total_memory`, the memtable itself listens for notifications about
LSA segment deallocations. It keeps `_flushed_memory` equal to the reader's
estimate of flushed memory decreased by the change in `total_memory` since the
beginning of flush (if it was positive), and it keeps the amount of "spooled"
memory reported to the `dirty_memory_manager` at `max(0, _flushed_memory)`.

Fixes scylladb/scylladb#21413

Backport candidate because it fixes a crash that can happen in existing stable branches.

- (cherry picked from commit 7d551f99be)

- (cherry picked from commit 975e7e405a)

Parent PR: #21638

Closes scylladb/scylladb#24601

* github.com:scylladb/scylladb:
  memtable: ensure _flushed_memory doesn't grow above total memory usage
  replica/memtable: move region_listener handlers from dirty_memory_manager to memtable
2025-06-24 10:12:41 +03:00
Michał Chojnowski
1c23edad22 memtable: ensure _flushed_memory doesn't grow above total memory usage
dirty_memory_manager tracks two quantities about memtable memory usage:
"real" and "unspooled" memory usage.

"real" is the total memory usage (sum of `occupancy().total_space()`)
by all memtable LSA regions, plus a upper-bound estimate of the size of
memtable data which has already moved to the cache region but isn't
evictable (merged into the cache) yet.

"unspooled" is the difference between total memory usage by all memtable
LSA regions, and the total flushed memory (sum of `_flushed_memory`)
of memtables.

dirty_memory_manager controls the shares of compaction and/or blocks
writes when these quantities cross various thresholds.

"Total flushed memory" isn't a well defined notion,
since the actual consumption of memory by the same data can vary over
time due to LSA compactions, and even the data present in memtable can
change over the course of the flush due to removals of outdated MVCC versions.
So `_flushed_memory` is merely an approximation computed by `flush_reader`
based on the data passing through it.

This approximation is supposed to be a conservative lower bound.
In particular, `_flushed_memory` should be not greater than
`occupancy().total_space()`. Otherwise, for example, "unspooled" memory
could become negative (and/or wrap around) and weird things could happen.
There is an assertion in ~flush_memory_accounter which checks that
`_flushed_memory < occupancy().total_space()` at the end of flush.

But it can fail. Without additional treatment, the memtable reader sometimes emits
data which is already deleted. (In particular, it emites rows covered by
a partition tombstone in a newer MVCC version.)
This data is seen `flush_reader` and accounted in `_flushed_memory`.
But this data can be garbage-collected by the mutation_cleaner later during the
flush and decrease `total_memory` below `_flushed_memory`.

There is a piece of code in mutation_cleaner intended to prevent that.
If `total_memory` decreases during a `mutation_cleaner` run,
`_flushed_memory` is lowered by the same amount, just to preserve the
asserted property. (This could also make `_flushed_memory` quite inaccurate,
but that's considered acceptable).

But that only works if `total_memory` is decreased during that run. It doesn't
work if the `total_memory` decrease (enabled by the new allocator holes made
by `mutation_cleaner`'s garbage collection work) happens asynchronously
(due to memory reclaim for whatever reason) after the run.

This patch fixes that by tracking the decreases of `total_memory` closer to the
source. Instead of relying on `mutation_cleaner` to notify the memtable if it
lowers `total_memory`, the memtable itself listens for notifications about
LSA segment deallocations. It keeps `_flushed_memory` equal to the reader's
estimate of flushed memory decreased by the change in `total_memory` since the
beginning of flush (if it was positive), and it keeps the amount of "spooled"
memory reported to the `dirty_memory_manager` at `max(0, _flushed_memory)`.

(cherry picked from commit 975e7e405a)
2025-06-22 17:37:27 +00:00
Michał Chojnowski
40d1186218 replica/memtable: move region_listener handlers from dirty_memory_manager to memtable
The memtable wants to listen for changes in its `total_memory` in order
to decrease its `_flushed_memory` in case some of the freed memory has already
been accounted as flushed. (This can happen because the flush reader sees
and accounts even outdated MVCC versions, which can be deleted and freed
during the flush).

Today, the memtable doesn't listen to those changes directly. Instead,
some calls which can affect `total_memory` (in particular, the mutation cleaner)
manually check the value of `total_memory` before and after they run, and they
pass the difference to the memtable.

But that's not good enough, because `total_memory` can also change outside
of those manually-checked calls -- for example, during LSA compaction, which
can occur anytime. This makes memtable's accounting inaccurate and can lead
to unexpected states.

But we already have an interface for listening to `total_memory` changes
actively, and `dirty_memory_manager`, which also needs to know it,
does just that. So what happens e.g. when `mutation_cleaner` runs
is that `mutation_cleaner` checks the value of `total_memory` before it runs,
then it runs, causing several changes to `total_memory` which are picked up
by `dirty_memory_manager`, then `mutation_cleaner` checks the end value of
`total_memory` and passes the difference to `memtable`, which corrects
whatever was observed by `dirty_memory_manager`.

To allow memtable to modify its `_flushed_memory` correctly, we need
to make `memtable` itself a `region_listener`. Also, instead of
the situation where `dirty_memory_manager` receives `total_memory`
change notifications from `logalloc` directly, and `memtable` fixes
the manager's state later, we want to only the memtable listen
for the notifications, and pass them already modified accordingl
to the manager, so there is no intermediate wrong states.

This patch moves the `region_listener` callbacks from the
`dirty_memory_manager` to the `memtable`. It's not intended to be
a functional change, just a source code refactoring.
The next patch will be a functional change enabled by this.

(cherry picked from commit 7d551f99be)
2025-06-22 17:37:27 +00:00
Pavel Emelyanov
60cb1c0b2f sstable_directory: Print ks.cf when moving unshared remove sstables
When an sstable is identified by sstable_directory as remote-unshared,
it will at some point be moved to the target shard. When it happens a
log-message appears:

    sstable_directory - Moving 1 unshared SSTables to shard 1

Processing of tables by sstable_directory often happens in parallel, and
messages from sstable_directory are intermixed. Having a message like
above is not very informative, as it tells nothing about sstables that
are being moved.

Equip the message with ks:cf pair to make it more informative.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23912

(cherry picked from commit d40d6801b0)

Closes scylladb/scylladb#24014
2025-06-17 18:24:13 +03:00
Pavel Emelyanov
a00b4a027e Update seastar submodule (no nested stall backtraces)
* seastar e40388c4...b383afb0 (1):
  > stall_detector: no backtrace if exception

Fixes #24464

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#24536
2025-06-17 18:23:06 +03:00
Nikos Dragazis
8ef55444f5 sstables: Fix race when loading checksum component
`read_checksum()` loads the checksum component from disk and stores a
non-owning reference in the shareable components. To avoid loading the
same component twice, the function has an early return statement.
However, this does not guarantee atomicity - two fibers or threads may
load the component and update the shareable components concurrently.
This can lead to use-after-free situations when accessing the component
through the shareable components, since the reference stored there is
non-owning. This can happen when multiple compaction tasks run on the
same SSTable (e.g., regular compaction and scrub-validate).

Fix this by not updating the reference in shareable components, if a
reference is already in place. Instead, create an owning reference to
the existing component for the current fiber. This is less efficient
than using a mutex, since the component may be loaded multiple times
from disk before noticing the race, but no locks are used for any other
SSTable component either. Also, this affects uncompressed SSTables,
which are not that common.

Fixes #23728.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>

Closes scylladb/scylladb#23872

(cherry picked from commit eaa2ce1bb5)

Closes scylladb/scylladb#24267
2025-06-17 13:20:30 +03:00
Szymon Malewski
8eebc97ae1 mapreduce_service: Prevent race condition
In parallelized aggregation functions super-coordinator (node performing final merging step) receives and merges each partial result in parallel coroutines (`parallel_for_each`).
Usually responses are spread over time and actual merging is atomic.
However sometimes partial results are received at the similar time and if an aggregate function (e.g. lua script) yields, two coroutines can try to overwrite the same accumulator one after another,
which leads to losing some of the results.
To prevent this, in this patch each coroutine stores merging results in its own context and overwrites accumulator atomically, only after it was fully merged.
Comparing to the previous implementation order of operands in merging function is swapped, but the order of aggregation is not guaranteed anyway.

Fixes #20662

Closes scylladb/scylladb#24106

(cherry picked from commit 5969809607)

Closes scylladb/scylladb#24387
2025-06-17 13:20:12 +03:00
Pavel Emelyanov
884293a382 Merge '[Backport 6.2] tablets: deallocate storage state on end_migration' from Scylladb[bot]
When a tablet is migrated and cleaned up, deallocate the tablet storage
group state on `end_migration` stage, instead of `cleanup` stage:

* When the stage is updated from `cleanup` to `end_migration`, the
  storage group is removed on the leaving replica.
* When the table is initialized, if the tablet stage is `end_migration`
  then we don't allocate a storage group for it. This happens for
  example if the leaving replica is restarted during tablet migration.
  If it's initialized in `cleanup` stage then we allocate a storage
  group, and it will be deallocated when transitioning to
  `end_migration`.

This guarantees that the storage group is always deallocated on the
leaving replica by `end_migration`, and that it is always allocated if
the tablet wasn't cleaned up fully yet.

It is a similar case also for the pending replica when the migration is
aborted. We deallocate the state on `revert_migration` which is the
stage following `cleanup_target`.

Previously the storage group would be allocated when the tablet is
initialized on any of the tablet replicas - also on the leaving replica,
and when the tablet stage is `cleanup` or `end_migration`, and
deallocated during `cleanup`.

This fixes the following issue:

1. A migrating tablet enters cleanup stage
2. the tablet is cleaned up successfuly
3. The leaving replica is restarted, and allocates storage group
4. tablet cleanup is not called because it's already cleaned up
5. the storage group remains allocated on the leaving replica after the
   migration is completed - it's not cleaned up properly.

Fixes https://github.com/scylladb/scylladb/issues/23481

backport to all relevant releases since it's a bug that results in a crash

- (cherry picked from commit 34f15ca871)

- (cherry picked from commit fb18fc0505)

- (cherry picked from commit bd88ca92c8)

Parent PR: #24393

Closes scylladb/scylladb#24486

* github.com:scylladb/scylladb:
  test/cluster/test_tablets: test restart during tablet cleanup
  test: tablets: add get_tablet_info helper
  tablets: deallocate storage state on end_migration
2025-06-17 13:19:49 +03:00
Michael Litvak
06e5a48d17 test/cluster/test_tablets: test restart during tablet cleanup
Add a test that reproduces issue scylladb/scylladb#23481.

The test migrates a tablet from one node to another, and while the
tablet is in some stage of cleanup - either before or right after,
depending on the parameter - the leaving replica, on which the tablet is
cleaned, is restarted.

This is interesting because when the leaving replica starts and loads
its state, the tablet could be in different stages of cleanup - the
SSTables may still exist or they may have been cleaned up already, and
we want to make sure the state is loaded correctly.

(cherry picked from commit bd88ca92c8)
2025-06-12 14:33:07 +03:00
Michael Litvak
c5d11205c1 test: tablets: add get_tablet_info helper
Add a helper for tests to get the tablet info from system.tablets for a
tablet owning a given token.

(cherry picked from commit fb18fc0505)
2025-06-12 03:24:00 +00:00
Michael Litvak
0b46cfb60d tablets: deallocate storage state on end_migration
When a tablet is migrated and cleaned up, deallocate the tablet storage
group state on `end_migration` stage, instead of `cleanup` stage:

* When the stage is updated from `cleanup` to `end_migration`, the
  storage group is removed on the leaving replica.
* When the table is initialized, if the tablet stage is `end_migration`
  then we don't allocate a storage group for it. This happens for
  example if the leaving replica is restarted during tablet migration.
  If it's initialized in `cleanup` stage then we allocate a storage
  group, and it will be deallocated when transitioning to
  `end_migration`.

This guarantees that the storage group is always deallocated on the
leaving replica by `end_migration`, and that it is always allocated if
the tablet wasn't cleaned up fully yet.

It is a similar case also for the pending replica when the migration is
aborted. We deallocate the state on `revert_migration` which is the
stage following `cleanup_target`.

Previously the storage group would be allocated when the tablet is
initialized on any of the tablet replicas - also on the leaving replica,
and when the tablet stage is `cleanup` or `end_migration`, and
deallocated during `cleanup`.

This fixes the following issue:

1. A migrating tablet enters cleanup stage
2. the tablet is cleaned up successfuly
3. The leaving replica is restarted, and allocates storage group
4. tablet cleanup is not called because it was already cleaned up
4. the storage group remains allocated on the leaving replica after the
   migration is completed - it's not cleaned up properly.

Fixes scylladb/scylladb#23481

(cherry picked from commit 34f15ca871)
2025-06-12 03:24:00 +00:00
Michał Chojnowski
4a18a08284 utils/lsa/chunked_managed_vector: fix the calculation of max_chunk_capacity()
`chunked_managed_vector` is a vector-like container which splits
its contents into multiple contiguous allocations if necessary,
in order to fit within LSA's max preferred contiguous allocation
limits.

Each limited-size chunk is stored in a `managed_vector`.
`managed_vector` is unaware of LSA's size limits.
It's up to the user of `managed_vector` to pick a size which
is small enough.

This happens in `chunked_managed_vector::max_chunk_capacity()`.
But the calculation is wrong, because it doesn't account for
the fact that `managed_vector` has to place some metadata
(the backreference pointer) inside the allocation.
In effect, the chunks allocated by `chunked_managed_vector`
are just a tiny bit larger than the limit, and the limit is violated.

Fix this by accounting for the metadata.

Also, before the patch `chunked_managed_vector::max_contiguous_allocation`,
repeats the definition of logalloc::max_managed_object_size.
This is begging for a bug if `logalloc::max_managed_object_size`
changes one day. Adjust it so that `chunked_managed_vector` looks
directly at `logalloc::max_managed_object_size`, as it means to.

Fixes scylladb/scylladb#23854

(cherry picked from commit 7f9152babc)

Closes scylladb/scylladb#24369
2025-06-03 18:10:36 +03:00
Piotr Dulikowski
1571948cd7 topology_coordinator: silence ERROR messages on abort
When the topology coordinator is shut down while doing a long-running
operation, the current operation might throw a raft::request_aborted
exception. This is not a critical issue and should not be logged with
ERROR verbosity level.

Make sure that all the try..catch blocks in the topology coordinator
which:

- May try to acquire a new group0 guard in the `try` part
- Have a `catch (...)` block that print an ERROR-level message

...have a pass-through `catch (raft::request_aborted&)` block which does
not log the exception.

Fixes: scylladb/scylladb#22649

Closes scylladb/scylladb#23962

(cherry picked from commit 156ff8798b)

Closes scylladb/scylladb#24074
2025-05-13 20:41:41 +03:00
Pavel Emelyanov
5a82f0d217 Merge '[Backport 6.2] replica: skip flush of dropped table' from Scylladb[bot]
Currently, flush throws no_such_column_family if a table is dropped. Skip the flush of dropped table instead.

Fixes: #16095.

Needs backport to 2025.1 and 6.2 as they contain the bug

- (cherry picked from commit 91b57e79f3)

- (cherry picked from commit c1618c7de5)

Parent PR: #23876

Closes scylladb/scylladb#23904

* github.com:scylladb/scylladb:
  test: test table drop during flush
  replica: skip flush of dropped table
2025-05-13 14:00:38 +03:00
Aleksandra Martyniuk
874b4f8d9c streaming: skip dropped tables
Currently, stream_session::prepare throws when a table in requests
or summaries is dropped. However, we do not want to fail streaming
if the table is dropped.

Delete table checks from stream_session::prepare. Further streaming
steps can handle the dropped table and finish the streaming successfully.

Fixes: #15257.

Closes scylladb/scylladb#23915

(cherry picked from commit 20c2d6210e)

Closes scylladb/scylladb#24050
2025-05-13 13:56:46 +03:00
Aleksandra Martyniuk
91a1acc314 test: test table drop during flush
(cherry picked from commit c1618c7de5)
2025-05-09 09:27:08 +02:00
Aleksandra Martyniuk
720bd681f0 replica: skip flush of dropped table
(cherry picked from commit 91b57e79f3)
2025-05-09 09:26:44 +02:00
Piotr Dulikowski
b3bc3489dd utils::loading_cache: gracefully skip timer if gate closed
The loading_cache has a periodic timer which acquires the
_timer_reads_gate. The stop() method first closes the gate and then
cancels the timer - this order is necessary because the timer is
re-armed under the gate. However, the timer callback does not check
whether the gate was closed but tries to acquire it, which might result
in unhandled exception which is logged with ERROR severity.

Fix the timer callback by acquiring access to the gate at the beginning
and gracefully returning if the gate is closed. Even though the gate
used to be entered in the middle of the callback, it does not make sense
to execute the timer's logic at all if the cache is being stopped.

Fixes: scylladb/scylladb#23951

Closes scylladb/scylladb#23952

(cherry picked from commit 8ffe4b0308)

Closes scylladb/scylladb#23980
2025-05-06 10:19:32 +02:00
Botond Dénes
e4c6a3c068 Merge '[Backport 6.2] topology coordinator: do not proceed further on invalid boostrap tokens' from Scylladb[bot]
In case when dht::boot_strapper::get_boostrap_tokens fail to parse the
tokens, the topology coordinator handles the exception and schedules a
rollback. However, the current code tries to continue with the topology
coordinator logic even if an exception occurs, leaving boostrap_tokens
empty. This does not make sense and can actually cause issues,
specifically in prepare_and_broadcast_cdc_generation_data which
implicitly expect that the bootstrap_tokens of the first node in the
cluster will not be empty.

Fix this by adding the missing break.

Fixes: scylladb/scylladb#23897

From the code inspection alone it looks like 2025.1 and 6.2 have this problem, so marking for backport to both of them.

- (cherry picked from commit 66acaa1bf8)

- (cherry picked from commit 845cedea7f)

- (cherry picked from commit 670a69007e)

Parent PR: #23914

Closes scylladb/scylladb#23948

* github.com:scylladb/scylladb:
  test: cluster: add test_bad_initial_token
  topology coordinator: do not proceed further on invalid boostrap tokens
  cdc: add sanity check for generating an empty generation
2025-05-01 08:32:13 +03:00
Botond Dénes
254f535f63 Merge '[Backport 6.2] tasks: check whether a node is alive before rpc' from Scylladb[bot]
Check whether a node is alive before making an rpc that gathers children
infos from the whole cluster in virtual_task::impl::get_children.

Fixes: https://github.com/scylladb/scylladb/issues/22514.

Needs backport to 2025.1 and 6.2 as they contain the bug.

- (cherry picked from commit 53e0f79947)

- (cherry picked from commit e178bd7847)

Parent PR: #23787

Closes scylladb/scylladb#23942

* github.com:scylladb/scylladb:
  test: add test for getting tasks children
  tasks: check whether a node is alive before rpc
2025-05-01 08:30:09 +03:00
Avi Kivity
c20b2ea2af Merge '[Backport 6.2] Ensure raft group0 RPCs use the gossip scheduling group.' from Scylladb[bot]
Scylla operations use concurrency semaphores to limit the number of concurrent operations and prevent resource exhaustion. The semaphore is selected based on the current scheduling group.

For RAFT group operations, it is essential to use a system semaphore to avoid queuing behind user operations. This patch ensures that RAFT operations use the `gossip` scheduling group to leverage the system semaphore.

Fixes scylladb/scylladb#21637

Backport: 6.2 and 6.1

- (cherry picked from commit 60f1053087)

- (cherry picked from commit e05c082002)

Parent PR: #22779

Closes scylladb/scylladb#23769

* github.com:scylladb/scylladb:
  ensure raft group0 RPCs use the gossip scheduling group
  Move RAFT operations verbs to GOSSIP group.
2025-04-30 16:45:20 +03:00
Aleksandra Martyniuk
cc37c64467 test: add test for getting tasks children
Add test that checks whether the children of a virtual task will be
properly gathered if a node is down.

(cherry picked from commit e178bd7847)
2025-04-30 10:40:32 +02:00
Aleksandra Martyniuk
47377cd5d4 tasks: check whether a node is alive before rpc
Check whether a node is alive before making an rpc that gathers children
infos from the whole cluster in virtual_task::impl::get_children.

(cherry picked from commit 53e0f79947)
2025-04-30 10:35:45 +02:00
Sergey Zolotukhin
5c90107c14 ensure raft group0 RPCs use the gossip scheduling group
Scylla operations use concurrency semaphores to limit the number
of concurrent operations and prevent resource exhaustion. The
semaphore is selected based on the current scheduling group.
For Raft group operations, it is essential to use a system semaphore to
avoid queuing behind user operations.
This commit adds a check to ensure that the raft group0 RPCs are
executed with the `gossiper` scheduling group.

(cherry picked from commit e05c082002)
2025-04-30 08:49:07 +02:00
Sergey Zolotukhin
612f184638 Move RAFT operations verbs to GOSSIP group.
In order for RAFT operations to use the gossip system semaphore, moving RAFT
verbs to the gossip group in `do_get_rpc_client_idx`,  messaging_service.

Fixes scylladb/scylladb21637

(cherry picked from commit 60f1053087)
2025-04-29 19:25:55 +00:00
Piotr Dulikowski
d881f3f14d test: cluster: add test_bad_initial_token
Adds a test which checks that rollback works properly in case when a bad
value of the initial_token function is provided.

(cherry picked from commit 670a69007e)
2025-04-28 17:07:23 +00:00
Piotr Dulikowski
7eda572129 topology coordinator: do not proceed further on invalid boostrap tokens
In case when dht::boot_strapper::get_boostrap_tokens fail to parse the
tokens, the topology coordinator handles the exception and schedules a
rollback. However, the current code tries to continue with the topology
coordinator logic even if an exception occurs, leaving boostrap_tokens
empty. This does not make sense and can actually cause issues,
specifically in prepare_and_broadcast_cdc_generation_data which
implicitly expect that the bootstrap_tokens of the first node in the
cluster will not be empty.

Fix this by adding the missing break.

Fixes: scylladb/scylladb#23897
(cherry picked from commit 845cedea7f)
2025-04-28 17:07:23 +00:00
Piotr Dulikowski
41e6df7407 cdc: add sanity check for generating an empty generation
It doesn't make sense to create an empty CDC generation because it does
not make sense to have a cluster with no tokens. Add a sanity check to
cdc::make_new_generation_description which fails if somebody attempts to
do that (i.e. when the set of current tokens + optionally bootstrapping
node's tokens is empty).

The function does not work correctly if it is misused, as we saw in
scylladb/scylladb#23897. While the function should not be misused in the
first place, it's better to throw an exception rather than crash -
especially that this crash could happen on the topology coordinator.

(cherry picked from commit 66acaa1bf8)
2025-04-28 17:07:22 +00:00
Tomasz Grabiec
b48d1abade Merge '[Backport 6.2] Cache base info for view schemas in the schema registry' from Scylladb[bot]
Currently, when we load a frozen schema into the registry, we lose
the base info if the schema was of a view. Because of that, in various
places we need to set the base info again, and in some codepaths we
may miss it completely, which may make us unable to process some
requests (for example, when executing reverse queries on views).
Even after setting the base info, we may still lose it if the schema
entry gets deactivated due to all `schema_ptr`s temporarily dying.

To fix this, this patch adds the base schema to the registry, alongside
the view schema. We store just the frozen base schema, so that we can
transfer it across shards. With the base schema, we can now set the base
info when returning the schema from the registry. As a result, we can now
assume that all view schemas returned by the registry have base_info set.

In this series we also make sure that the view schemas in the registry are
kept up-to-date in regards to base schema changes.

Fixes https://github.com/scylladb/scylladb/issues/21354

This issue is a bug, so adding backport labels 6.1 and 6.2

- (cherry picked from commit 6f11edbf3f)

- (cherry picked from commit dfe3810f64)

- (cherry picked from commit 82f2e1b44c)

- (cherry picked from commit 3094ff7cbe)

- (cherry picked from commit 74cbc77f50)

Parent PR: #21862

Closes scylladb/scylladb#23046

* github.com:scylladb/scylladb:
  test: add test for schema registry maintaining base info for views
  schema_registry: avoid setting base info when getting the schema from registry
  schema_registry: update cached base schemas when updating a view
  schema_registry: cache base schemas for views
  db: set base info before adding schema to registry
2025-04-25 18:42:57 +02:00
Tomasz Grabiec
9190779ee3 Merge '[Backport 6.2] storage_service: preserve state of busy topology when transiting tablet' from Scylladb[bot]
Commit 876478b84f ("storage_service: allow concurrent tablet migration in tablets/move API", 2024-02-08) introduced a code path on which the topology state machine would be busy -- in "tablet_draining" or "tablet_migration" state -- at the time of starting tablet migration. The pre-commit code would unconditionally transition the topology to "tablet_migration" state, assuming the topology had been idle previously. On the new code path, this state change would be idempotent if the topology state machine had been busy in "tablet_migration", but the state change would incorrectly overwrite the "tablet_draining" state otherwise.

Restrict the state change to when the topology state machine is idle.

In addition, add the topology update to the "updates" vector with plain push_back(). emplace_back() is not helpful here, as topology_mutation_builder::build() cannot construct in-place, and so we invoke the "canonical_mutation" move constructor once, either way.

Unit test:

Start a two node cluster. Create a single tablet on one of the nodes. Start decommissioning that node, but block decommissioning at once. In that state (i.e., in "tablet_draining"), move the tablet manually to the other node. Check that transit_tablet() leaves the topology transition state alone.

Fixes https://github.com/scylladb/scylladb/issues/20073.

Commit 876478b84f was first released in scylla-6.0.0, so we might want to backport this patch accordingly.

- (cherry picked from commit e1186f0ae6)

- (cherry picked from commit 841ca652a0)

Parent PR: #23751

Closes scylladb/scylladb#23768

* github.com:scylladb/scylladb:
  storage_service: add unit test for mid-decommission transit_tablet()
  storage_service: preserve state of busy topology when transiting tablet
2025-04-20 20:13:11 +02:00
Avi Kivity
8db020e782 Merge '[Backport 6.2] managed_bytes: in the copy constructor, respect the target preferred allocation size' from Scylladb[bot]
Commit 14bf09f447 added a single-chunk layout to `managed_bytes`, which makes the overhead of `managed_bytes` smaller in the common case of a small buffer.

But there was a bug in it. In the copy constructor of `managed_bytes`, a copy of a single-chunk `managed_bytes` is made single-chunk too.

But this is wrong, because the source of the copy and the target of the copy might have different preferred max contiguous allocation sizes.

In particular, if a `managed_bytes` of size between 13 kiB and 128 kiB is copied from the standard allocator into LSA, the resulting `managed_bytes` is a single chunk which violates LSA's preferred allocation size. (And therefore is placed by LSA in the standard allocator).

In other words, since Scylla 6.0, cache and memtable cells between 13 kiB and 128 kiB are getting allocated in the standard allocator rather than inside LSA segments.

Consequences of the bug:

1. Effective memory consumption of an affected cell is rounded up to the nearest power of 2.

2. With a pathological-enough allocation pattern (for example, one which somehow ends up placing a single 16 kiB memtable-owned allocation in every aligned 128 kiB span), memtable flushing could theoretically deadlock, because the allocator might be too fragmented to let the memtable grow by another 128 kiB segment, while keeping the sum of all allocations small enough to avoid triggering a flush. (Such an allocation pattern probably wouldn't happen in practice though).

3. It triggers a bug in reclaim which results in spurious allocation failures despite ample evictable memory.

   There is a path in the reclaimer procedure where we check whether reclamation succeeded by checking that the number of free LSA segments grew.

   But in the presence of evictable non-LSA allocations, this is wrong because the reclaim might have met its target by evicting the non-LSA allocations, in which case memory is returned directly to the standard allocator, rather than to the pool of free segments.

   If that happens, the reclaimer wrongly returns `reclaimed_nothing` to Seastar, which fails the allocation.

Refs (possibly fixes) https://github.com/scylladb/scylladb/issues/21072
Fixes https://github.com/scylladb/scylladb/issues/22941
Fixes https://github.com/scylladb/scylladb/issues/22389
Fixes https://github.com/scylladb/scylladb/issues/23781

This is a regression fix, should be backported to all affected releases.

- (cherry picked from commit 4e2f62143b)

- (cherry picked from commit 6c1889f65c)

Parent PR: #23782

Closes scylladb/scylladb#23809

* github.com:scylladb/scylladb:
  managed_bytes_test: add a reproducer for #23781
  managed_bytes: in the copy constructor, respect the target preferred allocation size
2025-04-19 18:43:31 +03:00
Calle Wilund
d566010412 network_topology_strategy/alter ks: Remove dc:s from options once rf=0
Fixes #22688

If we set a dc rf to zero, the options map will still retain a dc=0 entry.
If this dc is decommissioned, any further alters of keyspace will fail,
because the union of new/old options will now contained an unknown keyword.

Change alter ks options processing to simply remove any dc with rf=0 on
alter, and treat this as an implicit dc=0 in nw-topo strategy.
This means we change the reallocate_tablets routine to not rely on
the strategy objects dc mapping, but the full replica topology info
for dc:s to consider for reallocation. Since we verify the input
on attribute processing, the amount of rf/tablets moved should still
be legal.

v2:
* Update docs as well.
v3:
* Simplify dc processing
* Reintroduce options empty check, but do early in ks_prop_defs
* Clean up unit test some

Closes scylladb/scylladb#22693

(cherry picked from commit 342df0b1a8)

(Update: workaround python test objects not having dc info)

Closes scylladb/scylladb#22876
2025-04-18 14:09:52 +03:00
Michał Chojnowski
ba3975858b managed_bytes_test: add a reproducer for #23781
(cherry picked from commit 6c1889f65c)
2025-04-18 07:55:23 +00:00
Michał Chojnowski
f75b0d49a0 managed_bytes: in the copy constructor, respect the target preferred allocation size
Commit 14bf09f447 added a single-chunk
layout to `managed_bytes`, which makes the overhead of `managed_bytes`
smaller in the common case of a small buffer.

But there was a bug in it. In the copy constructor of `managed_bytes`,
a copy of a single-chunk `managed_bytes` is made single-chunk too.

But this is wrong, because the source of the copy and the target
of the copy might have different preferred max contiguous allocation
sizes.

In particular, if a `managed_bytes` of size between 13 kiB and 128 kiB
is copied from the standard allocator into LSA, the resulting
`managed_bytes` is a single chunk which violates LSA's preferred
allocation size. (And therefore is placed by LSA in the standard
allocator).

In other words, since Scylla 6.0, cache and memtable cells
between 13 kiB and 128 kiB are getting allocated in the standard allocator
rather than inside LSA segments.

Consequences of the bug:

1. Effective memory consumption of an affected cell is rounded up to the nearest
   power of 2.

2. With a pathological-enough allocation pattern
   (for example, one which somehow ends up placing a single 16 kiB
   memtable-owned allocation in every aligned 128 kiB span),
   memtable flushing could theoretically deadlock,
   because the allocator might be too fragmented to let the memtable
   grow by another 128 kiB segment, while keeping the sum of all
   allocations small enough to avoid triggering a flush.
   (Such an allocation pattern probably wouldn't happen in practice though).

3. It triggers a bug in reclaim which results in spurious
   allocation failures despite ample evictable memory.

   There is a path in the reclaimer procedure where we check whether
   reclamation succeeded by checking that the number of free LSA
   segments grew.

   But in the presence of evictable non-LSA allocations, this is wrong
   because the reclaim might have met its target by evicting the non-LSA
   allocations, in which case memory is returned directly to the
   standard allocator, rather than to the pool of free segments.

   If that happens, the reclaimer wrongly returns `reclaimed_nothing`
   to Seastar, which fails the allocation.

Refs (possibly fixes) https://github.com/scylladb/scylladb/issues/21072
Fixes https://github.com/scylladb/scylladb/issues/22941
Fixes https://github.com/scylladb/scylladb/issues/22389
Fixes https://github.com/scylladb/scylladb/issues/23781

(cherry picked from commit 4e2f62143b)
2025-04-18 07:55:23 +00:00
Botond Dénes
e1ce7c36a7 test/cluster/test_read_repair.py: increase read request timeout
This test enables trace-level logging for the mutation_data logger,
which seems to be too much in debug mode and the test read times out.
Increase timeout to 1minute to avoid this.

Fixes: #23513
Fixes: #23512

Closes scylladb/scylladb#23558

(cherry picked from commit 7bbfa5293f)

Closes scylladb/scylladb#23793
2025-04-18 06:34:25 +03:00
Laszlo Ersek
597030a821 storage_service: add unit test for mid-decommission transit_tablet()
Start a two node cluster. Create a single tablet on one of the nodes.
Start decommissioning that node, but block decommissioning at once. In
that state (i.e., in "tablet_draining"), move the tablet manually to the
other node. Check that transit_tablet() leaves the topology transition
state alone.

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
(cherry picked from commit 841ca652a0)
2025-04-17 09:09:43 +02:00
Laszlo Ersek
add4022d32 storage_service: preserve state of busy topology when transiting tablet
Commit 876478b84f ("storage_service: allow concurrent tablet migration
in tablets/move API", 2024-02-08) introduced a code path on which the
topology state machine would be busy -- in "tablet_draining" or
"tablet_migration" state -- at the time of starting tablet migration. The
pre-commit code would unconditionally transition the topology to
"tablet_migration" state, assuming the topology had been idle previously.
On the new code path, this state change would be idempotent if the
topology state machine had been busy in "tablet_migration", but the state
change would incorrectly overwrite the "tablet_draining" state otherwise.

Restrict the state change to when the topology state machine is idle.

In addition, add the topology update to the "updates" vector with plain
push_back(). emplace_back() is not helpful here, as
topology_mutation_builder::build() cannot construct in-place, and so we
invoke the "canonical_mutation" move constructor once, either way.

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
(cherry picked from commit e1186f0ae6)
2025-04-17 08:32:56 +02:00
Avi Kivity
b6f1cacfea scylla-gdb: small-objects: fix for very small objects
Because of rounding and alignment, there are multiple pools for small
sizes (e.g. 4 for size 32). Because the pool selection algorithm
ignores alignment, different pools can be chosen for different object
sizes. For example, an object size of 29 will choose the first pool
of size 32, while an object size of 32 will choose the fourth pool of
size 32.

The small-objects command doesn't know about this and always considers
just the first pool for a given size. This causes it to miss out on
sister pools.

While it's possible to adjust pool selection to always choose one of the
pools, it may eat a precious cycle. So instead let's compensate in the
small-objects command. Instead of finding one pool for a given size,
find all of them, and iterate over all those pools.

Fixes #23603

Closes scylladb/scylladb#23604

(cherry picked from commit b4d4e48381)

Closes scylladb/scylladb#23748
2025-04-16 14:36:28 +03:00
Botond Dénes
c1defce1de Merge '[Backport 6.2] transport/server.cc: set default timestamp info in EXECUTE and BATCH tracing' from Scylladb[bot]
A default timestamp (not to confuse with the timestamp passed via 'USING TIMESTAMP' query clause) can be set using 0x20 flag and the <timestamp> field in the binary CQL frame payload of QUERY, EXECUTE and BATCH ops. It also happens to be a default of a Java CQL Driver.

However, we were only setting the corresponding info in the CQL Tracing context of a QUERY operation. For an unknown reason we were not setting this for an EXECUTE and for a BATCH traces (I guess I simply forgot to set it back then).

This patch fixes this.

Fixes #23173

The issue fixed by this PR is not critical but the fix is simple and safe enough so we should backport it to all live releases.

- (cherry picked from commit ca6bddef35)

- (cherry picked from commit f7e1695068)

Parent PR: #23174

Closes scylladb/scylladb#23523

* github.com:scylladb/scylladb:
  CQL Tracing: set common query parameters in a single function
  transport/server.cc: set default timestamp info in EXECUTE and BATCH tracing
2025-04-11 17:11:06 +03:00
Botond Dénes
61d1262674 Merge '[Backport 6.2] reader_concurrency_semaphore: register_inactive_read(): handle aborted permit' from Scylladb[bot]
It is possible that the permit handed in to register_inactive_read() is already aborted (currently only possible if permit timed out). If the permit also happens to have wait for memory, the current code will attempt to call promise<>::set_exception() on the permit's promise to abort its waiters. But if the permit was already aborted via timeout, this promise will already have an exception and this will trigger an assert. Add a separate case for checking if the permit is aborted already. If so, treat it as immediate eviction: close the reader and clean up.

Fixes: scylladb/scylladb#22919

Bug is present in all live versions, backports are required.

- (cherry picked from commit 4d8eb02b8d)

- (cherry picked from commit 7ba29ec46c)

Parent PR: #23044

Closes scylladb/scylladb#23144

* github.com:scylladb/scylladb:
  reader_concurrency_semaphore: register_inactive_read(): handle aborted permit
  test/boost/reader_concurrency_semaphore_test: move away from db::timeout_clock::now()
2025-04-11 17:10:39 +03:00
Botond Dénes
ae54d4e886 Merge '[Backport 6.2] streaming: fix the way a reason of streaming failure is determined' from Scylladb[bot]
During streaming receiving node gets and processes mutation fragments.
If this operation fails, receiver responds with -1 status code, unless
it failed due to no_such_column_family in which case streaming of this
table should be skipped.

However, when the table was dropped, an exception handler on receiver
side may get not only data_dictionary::no_such_column_family, but also
seastar::nested_exception of two no_such_column_family.

Encountered example:
```
ERROR 2025-02-12 15:20:51,508 [shard 0:strm] stream_session - [Stream #f1cd6830-e954-11ef-afd9-b022e40bf72d] Failed to handle STREAM_MUTATION_FRAGMENTS (receive and distribute phase) for ks=ks, cf=cf, peer=756dd3fe-2bf0-4dcd-afbc-cfd5202669a0: seastar::nested_exception: data_dictionary::no_such_column_family (Can't find a column family with UUID ef9b1ee0-e954-11ef-ba4a-faf17acf4e14) (while cleaning up after data_dictionary::no_such_column_family (Can't find a column family with UUID ef9b1ee0-e954-11ef-ba4a-faf17acf4e14))
```

In this case, the exception does not match the try_catch<data_dictionary::no_such_column_family>
clause and gets handled the same as any other exception type.

Replace try_catch clause with table_sync_and_check that synchronizes
the schema and check if the table exists.

Fixes: https://github.com/scylladb/scylladb/issues/22834.

Needs backport to all live version, as they all contain the bug

- (cherry picked from commit 876cf32e9d)

- (cherry picked from commit faf3aa13db)

- (cherry picked from commit 44748d624d)

- (cherry picked from commit 35bc1fe276)

Parent PR: #22868

Closes scylladb/scylladb#23289

* github.com:scylladb/scylladb:
  streaming: fix the way a reason of streaming failure is determined
  streaming: save a continuation lambda
  streaming: use streaming namespace in table_check.{cc,hh}
  repair: streaming: move table_check.{cc,hh} to streaming
2025-04-11 15:13:14 +03:00
Alexander Turetskiy
e97e729973 Improve compation on read of expired tombstones
compact expired tombstones in cache even if they are blocked by
commitlog

fixes #16781

Closes: #23033
2025-04-11 15:09:43 +03:00
Botond Dénes
20a1edd763 tools/scylla-nodetool: s/GetInt()/GetInt64()/
GetInt() was observed to fail when the integer JSON value overflows the
int32_t type, which `GetInt()` uses for storage. When this happens,
rapidjson will assign a distinct 64 bit integer type to the value, and
attempting to access it as 32 bit integer triggers the wrong-type error,
resulting in assert failure. This was hit on the field where invoking
nodetool netstats resulted in nodetool crashing when the streamed bytes
amounts were higher than maxint.

To avoid such bugs in the future, replace all usage of GetInt() in
nodetool of GetInt64(), just to be sure.

A reproducer is added to the nodetool netstats crash.

Fixes: scylladb/scylladb#23394

Closes scylladb/scylladb#23395

(cherry picked from commit bd8973a025)

Closes scylladb/scylladb#23475
2025-04-11 15:00:47 +03:00
Aleksandra Martyniuk
07236341ab streaming: fix the way a reason of streaming failure is determined
During streaming receiving node gets and processes mutation fragments.
If this operation fails, receiver responds with -1 status code, unless
it failed due to no_such_column_family in which case streaming of this
table should be skipped.

However, when the table was dropped, an exception handler on receiver
side may get not only data_dictionary::no_such_column_family, but also
seastar::nested_exception of two no_such_column_family.

Encountered example:
```
ERROR 2025-02-12 15:20:51,508 [shard 0:strm] stream_session - [Stream #f1cd6830-e954-11ef-afd9-b022e40bf72d] Failed to handle STREAM_MUTATION_FRAGMENTS (receive and distribute phase) for ks=ks, cf=cf, peer=756dd3fe-2bf0-4dcd-afbc-cfd5202669a0: seastar::nested_exception: data_dictionary::no_such_column_family (Can't find a column family with UUID ef9b1ee0-e954-11ef-ba4a-faf17acf4e14) (while cleaning up after data_dictionary::no_such_column_family (Can't find a column family with UUID ef9b1ee0-e954-11ef-ba4a-faf17acf4e14))
```

In this case, the exception does not match the try_catch<data_dictionary::no_such_column_family>
clause and gets handled the same as any other exception type.

Replace try_catch clause with table_sync_and_check that synchronizes
the schema and check if the table exists.

Fixes: https://github.com/scylladb/scylladb/issues/22834.
(cherry picked from commit 35bc1fe276)
2025-04-11 11:55:38 +02:00
Aleksandra Martyniuk
2027b9a21b streaming: save a continuation lambda
In the following patches, an additional preemption point will be
added to the coroutine lambda in register_stream_mutation_fragments.

Assign a lambda to a variable to prolong the captures lifetime.

(cherry picked from commit 44748d624d)
2025-04-11 11:55:38 +02:00
Aleksandra Martyniuk
c1a211101d streaming: use streaming namespace in table_check.{cc,hh}
(cherry picked from commit faf3aa13db)
2025-04-11 11:55:38 +02:00
Aleksandra Martyniuk
e52fd85cf8 repair: streaming: move table_check.{cc,hh} to streaming
(cherry picked from commit 876cf32e9d)
2025-04-11 11:55:36 +02:00
Botond Dénes
6091e81e18 reader_concurrency_semaphore: register_inactive_read(): handle aborted permit
It is possible that the permit handed in to register_inactive_read() is
already aborted (currently only possible if permit timed out).
If the permit also happens to have wait for memory, the current code
will attempt to call promise<>::set_exception() on the permit's promise
to abort its waiters. But if the permit was already aborted via timeout,
this promise will already have an exception and this will trigger an
assert. Add a separate case for checking if the permit is aborted
already. If so, treat it as immediate eviction: close the reader and
clean up.

Fixes: scylladb/scylladb#22919
(cherry picked from commit 7ba29ec46c)
2025-04-11 04:04:42 -04:00
Botond Dénes
742ff76d4d test/boost/reader_concurrency_semaphore_test: move away from db::timeout_clock::now()
Unless the test in question actually wants to test timeouts. Timeouts
will have more pronounced consequences soon and thus using
db::timeout_clock::now() becomes a sure way to make tests flaky.
To avoid this, use db::no_timeout in the tests that don't care about
timeouts.

(cherry picked from commit 4d8eb02b8d)
2025-04-11 04:04:42 -04:00
Botond Dénes
76f5816e43 Merge '[Backport 6.2] scylla sstable: Add standard extensions and propagate to schema load ' from Scylladb[bot]
Fixes #22314

Adds expected schema extensions to the tools extension set (if used). Also uses the source config extensions in schema loader instead of temp one, to ensure we can, for example, load a schema.cql with things like `tombstone_gc` or encryption attributes in them.

Bundles together the setup of "always on" schema extensions into a single call, and uses this from the three (3) init points.
Could have opted for static reg via `configurables`, but since we are moving to a single code base, the need for this is going away, hence explicit init seems more in line.

- (cherry picked from commit e6aa09e319)

- (cherry picked from commit 4aaf3df45e)

- (cherry picked from commit 00b40eada3)

- (cherry picked from commit 48fda00f12)

Parent PR: #22327

Closes scylladb/scylladb#23089

* github.com:scylladb/scylladb:
  tools: Add standard extensions and propagate to schema load
  cql_test_env: Use add all extensions instead of inidividually
  main: Move extensions adding to function
  tomstone_gc: Make validate work for tools
2025-04-11 11:00:41 +03:00
Lakshmi Narayanan Sreethar
1f272eade5 topology_coordinator: handle_table_migration: do not continue after executing metadata barrier
Return after executing the global metadata barrier to allow the topology
handler to handle any transitions that might have started by a
concurrect transaction.

Fixes #22792

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes scylladb/scylladb#22793

(cherry picked from commit 0f7d08d41d)

Closes scylladb/scylladb#23020
2025-04-11 10:59:43 +03:00
yangpeiyu2_yewu
f9cb9b905f mutation_writer/multishard_writer.cc: wrap writer into futurize_invoke
wrapped writer in seastar::futurize_invoke to make sure that the close() for the mutation_reader can be executed before destruction.

Fixes scylladb/scylladb#22790

Closes scylladb/scylladb#22812

(cherry picked from commit 0de232934a)

Closes scylladb/scylladb#22944
2025-04-11 10:59:17 +03:00
Avi Kivity
11993efc8c Merge '[Backport 6.2] row_cache: don't garbage-collect tombstones which cover data in memtables' from Scylladb[bot]
The row cache can garbage-collect tombstones in two places:
1) When populating the cache - the underlying reader pipeline has a `compacting_reader` in it;
2) During reads - reads now compact data including garbage collection;

In both cases, garbage collection has to do overlap checks against memtables, to avoid collecting tombstones which cover data in the memtables.
This PR includes fixes for (2), which were not handled at all currently.
(1) was already supposed to be fixed, see https://github.com/scylladb/scylladb/issues/20916. But the test added in this PR showed that the test is incomplete: https://github.com/scylladb/scylladb/issues/23291. A fix for this issue is also included.

Fixes: https://github.com/scylladb/scylladb/issues/23291
Fixes: https://github.com/scylladb/scylladb/issues/23252

The fix will need backport to all live release.

- (cherry picked from commit c2518cdf1a)

- (cherry picked from commit 6b5b563ef7)

- (cherry picked from commit 7e600a0747)

- (cherry picked from commit d126ea09ba)

- (cherry picked from commit cb76cafb60)

- (cherry picked from commit df09b3f970)

- (cherry picked from commit e5afd9b5fb)

- (cherry picked from commit 34b18d7ef4)

- (cherry picked from commit f7938e3f8b)

- (cherry picked from commit 6c1f6427b3)

- (cherry picked from commit 0d39091df2)

Parent PR: #23255

Closes scylladb/scylladb#23671

* github.com:scylladb/scylladb:
  test/boost/row_cache_test: add memtable overlap check tests
  replica/table: add error injection to memtable post-flush phase
  utils/error_injection: add a way to set parameters from error injection points
  test/cluster: add test_data_resurrection_in_memtable.py
  test/pylib/utils: wait_for_cql_and_get_hosts(): sort hosts
  replica/mutation_dump: don't assume cells are live
  replica/database: do_apply() add error injection point
  replica: improve memtable overlap checks for the cache
  replica/memtable: add is_merging_to_cache()
  db/row_cache: add overlap-check for cache tombstone garbage collection
  mutation/mutation_compactor: copy key passed-in to consume_new_partition()
2025-04-10 21:41:52 +03:00
Botond Dénes
5fb8a6dae2 mutation/frozen_mutation: frozen_mutation_consumer_adaptor: fix end-of-partition handling
This adaptor adapts a mutation reader pausable consumer to the frozen
mutation visitor interface. The pausable consumer protocol allows the
consumer to skip the remaining parts of the partition and resume the
consumption with the next one. To do this, the consumer just has to
return stop_iteration::yes from one of the consume() overloads for
clustering elements, then return stop_iteration::no from
consume_end_of_partition(). Due to a bug in the adaptor, this sequence
leads to terminating the consumption completely -- so any remaining
partitions are also skipped.

This protocol implementation bug has user-visible effects, when the
only user of the adaptor -- read repair -- happens during a query which
has limitations on the amount of content in each partition.
There are two such queries: select distinct ... and select ... with
partition limit. When converting the repaired mutation to to query
result, these queries will trigger the skip sequence in the consumer and
due to the above described bug, will skip the remaining partitions in
the results, omitting these from the final query result.

This patch fixes the protocol bug, the return value of the underlying
consumer's consume_end_of_partition() is now respected.

A unit test is also added which reproduces the problem both with select
distinct ... and select ... per partition limit.

Follow-up work:
* frozen_mutation_consumer_adaptor::on_end_of_partition() calls the
  underlying consumer's on_end_of_stream(), so when consuming multiple
  frozen mutations, the underlying's on_end_of_stream() is called for
  each partition. This is incorrect but benign.
* Improve documentation of mutation_reader::consume_pausable().

Fixes: #20084

Closes scylladb/scylladb#23657

(cherry picked from commit d67202972a)

Closes scylladb/scylladb#23693
2025-04-10 21:38:13 +03:00
Botond Dénes
a157b3e62f test/boost/row_cache_test: add memtable overlap check tests
Similar to test/cluster/test_data_resurrection_in_memtable.py but works
on a single node and uses more low-level mechanism. These tests can also
reproduce more advanced scenarios, like concurrent reads, with some
reading from flushed memtables.

(cherry picked from commit 0d39091df2)
2025-04-10 07:33:09 -04:00
Botond Dénes
ce1d990dd6 replica/table: add error injection to memtable post-flush phase
After the memtable was flushed to disk, but before it is merged to
cache. The injection point will only active for the table specified in
the "table_name" injection parameter.

(cherry picked from commit 6c1f6427b3)
2025-04-10 07:33:09 -04:00
Botond Dénes
37b51871ec utils/error_injection: add a way to set parameters from error injection points
With this, now it is possible to have two-way communication between
the error injection point and its enabler. The test can enable the error
injection point, then wait until it is hit, before proceedin.

(cherry picked from commit f7938e3f8b)
2025-04-10 07:33:09 -04:00
Botond Dénes
ac18570069 test/cluster: add test_data_resurrection_in_memtable.py
Reproducers for #23252 and #23291 -- cache garbage
collecting tombstones resurrecting data in the memtable.

(cherry picked from commit 34b18d7ef4)
2025-04-10 07:33:09 -04:00
Botond Dénes
990e92d7cf test/pylib/utils: wait_for_cql_and_get_hosts(): sort hosts
Such that a given index in the return hosts refers to the same
underlying Scylla instance, as the same index in the passed-in nodes
list. This is what users of this method intuitively expect, but
currently the returned hosts list is unordered (has random order).

(cherry picked from commit e5afd9b5fb)
2025-04-10 07:33:09 -04:00
Botond Dénes
67a56ae192 replica/mutation_dump: don't assume cells are live
Currently the dumper unconditionally extracts the value of atomic cells,
assuming they are live. This doesn't always hold of course and
attempting to get the value of a dead cell will lead to marshalling
errors. Fix by checking is_live() before attempting to get the cell
value. Fix for both regular and collection cells.

(cherry picked from commit df09b3f970)
2025-04-10 07:33:09 -04:00
Botond Dénes
85a7a9cb05 replica/database: do_apply() add error injection point
So writes (to user tables) can be failed on a replica, via error
injection. Should simplify tests which want to create differences in
what writes different replicas receive.

(cherry picked from commit cb76cafb60)
2025-04-10 07:33:09 -04:00
Botond Dénes
95205a1b29 replica: improve memtable overlap checks for the cache
The current memtable overlap check that is used by the cache
-- table::get_max_purgeable_fn_for_cache_underlying_reader() -- only
checks the active memtable, so memtables which are either being flushed
or are already flushed and also have active reads against them do not
participate in the overlap check.
This can result in temporary data resurrection, where a cache read can
garbage-collect a tombstone which still covers data in a flushing or
flushed memtable, which still have active read against it.

To prevent this, extend the overlap check to also consider all of the
memtable list. Furthermore, memtable_list::erase() now places the removed
(flushed) memtable in an intrusive list. These entries are alive only as
long as there are readers still keeping an `lw_shared_ptr<memtable>`
alive. This list is now also consulted on overlap checks.

(cherry picked from commit d126ea09ba)
2025-04-10 07:33:09 -04:00
Botond Dénes
ef423eb4c7 replica/memtable: add is_merging_to_cache()
And set it when the memtable is merged to cache.

(cherry picked from commit 7e600a0747)
2025-04-10 07:33:08 -04:00
Botond Dénes
d10a2688b1 db/row_cache: add overlap-check for cache tombstone garbage collection
The cache should not garbage-collect tombstone which cover data in the
memtable. Add overlap checks (get_max_purgeable) to garbage collection
to detect tombstones which cover data in the memtable and to prevent
their garbage collection.

(cherry picked from commit 6b5b563ef7)
2025-04-10 07:33:08 -04:00
Botond Dénes
4647aa0366 mutation/mutation_compactor: copy key passed-in to consume_new_partition()
This doesn't introduce additional work for single-partition queries: the
key is copied anyway on consume_end_of_stream().
Multi-partition reads and compaction are not that sensitive to
additional copy added.

This change fixes a bug in the compacting_reader: currently the reader
passes _last_uncompacted_partition_start.key() to the compactor's
consume_new_partition(). When the compactor emits enough content for this
partition, _last_uncompacted_partition_start is moved from to emit the
partition start, this makes the key reference passed to the compaction
corrupt (refer to moved-from value). This in turn means that subsequent
GC checks done by the compactor will be done with a corrupt key and
therefore can result in tombstone being garbage-collected while they
still cover data elsewhere (data resurrection).

The compacting reader is violating the API contract and normally the bug
should be fixed there. We make an exception here because doing the fix
in the mutation compactor better aligns with our future plans:
* The fix simplifies the compactor (gets rid of _last_dk).
* Prepares the way to get rid of the consume API used by the compactor.

(cherry picked from commit c2518cdf1a)
2025-04-09 14:02:30 +00:00
Michał Chojnowski
664d36c737 table: fix a race in table::take_storage_snapshot()
`safe_foreach_sstable` doesn't do its job correctly.

It iterates over an sstable set under the sstable deletion
lock in an attempt to ensure that SSTables aren't deleted during the iteration.

The thing is, it takes the deletion lock after the SSTable set is
already obtained, so SSTables might get unlinked *before* we take the lock.

Remove this function and fix its usages to obtain the set and iterate
over it under the lock.

Closes scylladb/scylladb#23397

(cherry picked from commit e23fdc0799)

Closes scylladb/scylladb#23627
2025-04-08 19:06:42 +03:00
Lakshmi Narayanan Sreethar
90add328ad replica/table::do_apply : do not check for async gate's closure
The `table::do_apply()` method verifies if the compaction group's async
gate is open to determine if the compaction group is active. Closing
this async gate prevents any new operations but waits for existing
holders to exit, allowing their operations to complete. When holding a
gate, holders will observe the gate as closed when it is being closed,
but this is irrelevant as they are already inside the gate and are
allowed to complete. All the callers of `table::do_apply()` already
enter the gate before calling the method. So, the async gate check
inside `table::do_apply()` will erroneously throw an exception when the
compaction group is closing despite holding the gate. This commit
removes the check to prevent this from happening.

Fixes #23348

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes scylladb/scylladb#23579

(cherry picked from commit 750f4baf44)

Closes scylladb/scylladb#23644
2025-04-08 18:58:35 +03:00
Yaron Kaikov
f4909aafc7 .github: Make "make-pr-ready-for-review" workflow run in base repo
in 57683c1a50 we fixed the `token` error,
but removed the checkout part which causing now the following error
```
failed to run git: fatal: not a git repository (or any of the parent directories): .git
```
Adding the repo checkout stage to avoid such error

Fixes: https://github.com/scylladb/scylladb/issues/22765

Closes scylladb/scylladb#23641

(cherry picked from commit 2dc7ea366b)

Closes scylladb/scylladb#23653
2025-04-08 13:47:50 +03:00
Kefu Chai
9e3eb4329c .github: Make "make-pr-ready-for-review" workflow run in base repo
The "make-pr-ready-for-review" workflow was failing with an "Input
required and not supplied: token" error.  This was due to GitHub Actions
security restrictions preventing access to the token when the workflow
is triggered in a fork:
```
    Error: Input required and not supplied: token
```

This commit addresses the issue by:

- Running the workflow in the base repository instead of the fork. This
  grants the workflow access to the required token with write permissions.
- Simplifying the workflow by using a job-level `if` condition to
  controlexecution, as recommended in the GitHub Actions documentation
  (https://docs.github.com/en/actions/writing-workflows/choosing-when-your-workflow-runs/using-conditions-to-control-job-execution).
  This is cleaner than conditional steps.
- Removing the repository checkout step, as the source code is not required for this workflow.

This change resolves the token error and ensures the
"make-pr-ready-for-review" workflow functions correctly.

Fixes scylladb/scylladb#22765

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22766

(cherry picked from commit ca832dc4fb)

Closes scylladb/scylladb#23617
2025-04-07 08:11:09 +03:00
Kefu Chai
4ac3f82df9 dist: systemd: use default KillMode
before this change, we specify the KillMode of the scylla-service
service unit explicitly to "process". according to
according to
https://www.freedesktop.org/software/systemd/man/latest/systemd.kill.html,

> If set to process, only the main process itself is killed (not recommended!).

and the document suggests use "control-group" over "process".
but scylla server is not a multi-process server, it is a multi-threaded
server. so it should not make any difference even if we switch to
the recommended "control-group".

in the light that we've been seeing "defunct" scylla process after
stopping the scylla service using systemd. we are wondering if we should
try to change the `KillMode` to "control-group", which is the default
value of this setting.

in this change, we just drop the setting so that the systemd stops the
service by stopping all processes in the control group of this unit
are stopped.

Fixes scylladb/scylladb#21507

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

(cherry picked from commit 961a53f716)

Closes scylladb/scylladb#23176
2025-04-04 17:55:04 +03:00
Vlad Zolotarov
e3d063a070 CQL Tracing: set common query parameters in a single function
Each query-type (QUERY, EXECUTE, BATCH) CQL opcode has a number of parameters
in their payload which we always want to record in the Tracing object.
Today it's a Consistency Level, Serial Consistency Level and a Default Timestamp.

Setting each of them individually can lead to a human error when one (or more) of
them would not be set. Let's eliminate such a possibility by defining
a single function that sets them all.

This also allows an easy addition of such parameters to this function in
the future.
2025-04-02 14:10:03 -04:00
Vlad Zolotarov
1b2ab34647 transport/server.cc: set default timestamp info in EXECUTE and BATCH tracing
A default timestamp (not to confuse with the timestamp passed via 'USING TIMESTAMP' query clause)
can be set using 0x20 flag and the <timestamp> field in the binary CQL frame payload of
QUERY, EXECUTE and BATCH ops. It also happens to be a default of a Java CQL Driver.

However, we were only setting the corresponding info in the CQL Tracing context of a QUERY operation.
For an unknown reason we were not setting this for an EXECUTE and for a BATCH traces (I guess I simply forgot to
set it back then).

This patch fixes this.

Fixes #23173

(cherry picked from commit ca6bddef35)
2025-04-01 11:45:33 +00:00
Yaron Kaikov
b67329b34e .github: add action to make PR ready for review when conflicts label was removed
Moving a PR out of draft is only allowed to users with write access,
adding a github action to switch PR to `ready for review` once the
`conflicts` label was removed

Closes scylladb/scylladb#22446

(cherry picked from commit ed4bfad5c3)

Closes scylladb/scylladb#23006
2025-03-30 11:59:40 +03:00
Kefu Chai
48ff7cf61c gms: Fix fmt formatter for gossip_digest_sync
In commit 4812a57f, the fmt-based formatter for gossip_digest_syn had
formatting code for cluster_id, partitioner, and group0_id
accidentally commented out, preventing these fields from being included
in the output. This commit restores the formatting by uncommenting the
code, ensuring full visibility of all fields in the gossip_digest_syn
message when logging permits.

This fixes a regression introduced in 4812a57f, which obscured these
fields and reduced debugging insight. Backporting is recommended for
improved observability.

Fixes #23142
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23155

(cherry picked from commit 2a9966a20e)

Closes scylladb/scylladb#23198
2025-03-30 11:57:15 +03:00
Kefu Chai
18d5af1cd3 storage_proxy: Prevent integer overflow in abstract_read_executor::execute
Fix UBSan abort caused by integer overflow when calculating time difference
between read and write operations. The issue occurs when:
1. The queried partition on replicas is not purgeable (has no recorded
   modified time)
2. Digests don't match across replicas
3. The system attempts to calculate timespan using missing/negative
   last_modified timestamps

This change skips cross-DC repair optimization when write timestamp is
negative or missing, as this optimization is only relevant for reads
occurring within write_timeout of a write.

Error details:
```
service/storage_proxy.cc:5532:80: runtime error: signed integer overflow: -9223372036854775808 - 1741940132787203 cannot be represented in type 'int64_t' (aka 'long')
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior service/storage_proxy.cc:5532:80
Aborting on shard 1, in scheduling group sl:default
```

Related to previous fix 39325cf which handled negative read_timestamp cases.

Fixes #23314
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23359

(cherry picked from commit ebf9125728)

Closes scylladb/scylladb#23386
2025-03-30 11:54:59 +03:00
Tomasz Grabiec
6cdd1cccdc test: tablets: Fix flakiness due to ungraceful shutdown
The test fails sporadically with:

cassandra.ReadFailure: Error from server: code=1300 [Replica(s) failed to execute read] message="Operation failed for test3.test2 - received 1 responses and 1 failures from 2 CL=QUORUM." info={'consistency': 'QUORUM', 'required_responses': 2, 'received_responses': 1, 'failures': 1}

That's becase a server is stopped in the middle of the workload.

The server is stopped ungracefully which will cause some requests to
time out. We should stop it gracefully to allow in-flight requests to
finish.

Fixes #20492

Closes scylladb/scylladb#23451

(cherry picked from commit 8e506c5a8f)

Closes scylladb/scylladb#23468
2025-03-28 14:56:39 +01:00
Anna Stuchlik
3f0e52a5ee doc: zero-token nodes and Arbiter DC
This commit adds documentation for zero-token nodes and an explanation
of how to use them to set up an arbiter DC to prevent a quorum loss
in multi-DC deployments.

The commit adds two documents:
- The one in Architecture describes zero-token nodes.
- The other in Cluster Management explains how to use them.

We need separate documents because zero-token nodes may be used
for other purposes in the future.

In addition, the documents are cross-linked, and the link is added
to the Create a ScyllaDB Cluster - Multi Data Centers (DC) document.

Refs https://github.com/scylladb/scylladb/pull/19684

Fixes https://github.com/scylladb/scylladb/issues/20294

Closes scylladb/scylladb#21348

(cherry picked from commit 9ac0aa7bba)

Closes scylladb/scylladb#23200
2025-03-10 10:52:13 +01:00
Piotr Dulikowski
9fc27b734f test: test_mv_topology_change: increase timeout for removenode
The test `test_mv_topology_change` is a regression test for
scylladb/scylladb#19529. The problem was that CL=ANY writes issued when
all replicas were down would be kept in memory until the timeout. In
particular, MV updates are CL=ANY writes and have a 5 minute timeout.
When doing topology operations for vnodes or when migrating tablet
replicas, the cluster goes through stages where the replica sets for
writes undergo changes, and the writes started with the old replica set
need to be drained first.

Because of the aforementioned MV updates, the removenode operation could
be delayed by 5 minutes or more. Therefore, the
`test_mv_topology_change` test uses a short timeout for the removenode
operation, i.e. 30s. Apparently, this is too low for the debug mode and
the test has been observed to time out even though the removenode
operation is progressing fine.

Increase the timeout to 60s. This is the lowest timeout for the
removenode operation that we currently use among the in-repo tests, and
is lower than 5 minutes so the test will still serve its purpose.

Fixes: scylladb/scylladb#22953

Closes scylladb/scylladb#22958

(cherry picked from commit 43ae3ab703)

Closes scylladb/scylladb#23052
2025-03-04 16:04:00 +01:00
Wojciech Mitros
8d392229cc test: add test for schema registry maintaining base info for views
In this patch we test the behavior of schema registry in a few
scenarios where it was identified it could misbehave.

The first one is reverse schemas for views. Previously, SELECT
queries with reverse order on views could fail because we didn't
have base info in the registry for such schemas.

The second one is schemas that temporarily died in the registry.
This can happen when, while processing a query for a given schema
version, all related schema_ptrs were destroyed, but this schema
was requested before schema_registry::grace_period() has passed.
In this scenario, the base info would not be recovered, causing
errors.

(cherry picked from commit 74cbc77f50)
2025-03-03 13:57:45 +01:00
Wojciech Mitros
3a1d2cbeb6 schema_registry: avoid setting base info when getting the schema from registry
After the previous patches, the view schemas returned by schema registry
always have their base info set. As such, we no longer need to set it after
getting the view schema from the registry. This patch removes these
unnecessary updates.

(cherry picked from commit 3094ff7cbe)
2025-03-03 13:49:11 +01:00
Calle Wilund
ecbb765b3f tools: Add standard extensions and propagate to schema load
Fixes #22314

Adds expected schema extensions to the tools extension set (if used). Also uses the source config extensions in schema loader instead of temp one, to ensure we can, for example, load a schema.cql with things like `tombstone_gc` or encryption attributes in them.

(cherry picked from commit 48fda00f12)
2025-02-27 17:46:28 +00:00
Calle Wilund
3bf553efa1 cql_test_env: Use add all extensions instead of inidividually
(cherry picked from commit 00b40eada3)
2025-02-27 17:46:28 +00:00
Calle Wilund
9500c8c9cb main: Move extensions adding to function
Easily called from elsewhere. The extensions we should always include (oxymoron?)

(cherry picked from commit 4aaf3df45e)
2025-02-27 17:46:28 +00:00
Calle Wilund
4c735abb62 tomstone_gc: Make validate work for tools
Don't crash if validation is done as part of loading a schema from file (schema.cql)

(cherry picked from commit e6aa09e319)
2025-02-27 17:46:28 +00:00
Wojciech Mitros
b0ab86a8ad schema_registry: update cached base schemas when updating a view
The schema registry now holds base schemas for view schemas.
The base schema may change without changing the view schema, so to
preserve the change in the schema registry, we also update the
base schema in the registry when updating the base info in the
view schema.

(cherry picked from commit 82f2e1b44c)
2025-02-25 16:55:03 +00:00
Wojciech Mitros
1cf288dd97 schema_registry: cache base schemas for views
Currently, when we load a frozen schema into the registry, we lose
the base info if the schema was of a view. Because of that, in various
places we need to set the base info again, and in some codepaths we
may miss it completely, which may make us unable to process some
requests (for example, when executing reverse queries on views).
Even after setting the base info, we may still lose it if the schema
entry gets deactivated.

To fix this, this patch adds the base schema to the registry, alongside
the view schema. With the base schema, we can now set the base
info when returning the schema from the registry. As a result, we can now
assume that all view schemas returned by the registry have base_info set.

To store the base schema, the loader methods now have to return the base
schema alongside the view schema. At the same time, when loading into
the registry, we need to check whether we're loading a view schema, and if
so, we need to also provide the base schema. When inserting a regular table
schema, the base schema should be a disengaged optional.

(cherry picked from commit dfe3810f64)
2025-02-25 16:55:03 +00:00
Wojciech Mitros
40312f482e db: set base info before adding schema to registry
In the following patches, we'll assure that view schemas returned by the
schema registry always have base info set. To prepare for that, make sure
that the base info is always set before inserting it into schema registry,

(cherry picked from commit 6f11edbf3f)
2025-02-25 16:55:03 +00:00
Benny Halevy
9fd5909a5e token_group_based_splitting_mutation_writer: maybe_switch_to_new_writer: prevent double close
Currently, maybe_switch_to_new_writer resets _current_writer
only in a continuation after closing the current writer.
This leaves a window of vulnerability if close() yields,
and token_group_based_splitting_mutation_writer::close()
is called. Seeing the engaged _current_writer, close()
will call _current_writer->close() - which must be called
exactly once.

Solve this when switching to a new writer by resetting
_current_writer before closing it and potentially yielding.

Fixes #22715

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#22922

(cherry picked from commit 29b795709b)

Closes scylladb/scylladb#22964
2025-02-23 14:27:11 +02:00
Botond Dénes
103a986eca Merge '[Backport 6.2] reader_concurrency_semaphore: set_notify_handler(): disable timeout ' from Scylladb[bot]
`set_notify_handler()` is called after a querier was inserted into the querier cache. It has two purposes: set a callback for eviction and set a TTL for the cache entry. This latter was not disabling the pre-existing timeout of the permit (if any) and this would lead to premature eviction of the cache entry if the timeout was shorter than TTL (which his typical).
Disable the timeout before setting the TTL to prevent premature eviction.

Fixes: https://github.com/scylladb/scylladb/issues/22629

Backport required to all active releases, they are all affected.

- (cherry picked from commit a3ae0c7cee)

- (cherry picked from commit 9174f27cc8)

Parent PR: #22701

Closes scylladb/scylladb#22751

* github.com:scylladb/scylladb:
  reader_concurrency_semaphore: set_notify_handler(): disable timeout
  reader_permit: mark check_abort() as const
2025-02-19 09:59:47 +02:00
Botond Dénes
1d4ea169e3 reader_concurrency_semaphore: set_notify_handler(): disable timeout
set_notify_handler() is called after a querier was inserted into the
querier cache. It has two purposes: set a callback for eviction and set
a TTL for the cache entry. This latter was not disabling the
pre-existing timeout of the permit (if any) and this would lead to
premature eviction of the cache entry if the timeout was shorter than
TTL (which his typical).
Disable the timeout before setting the TTL to prevent premature
eviction.

Fixes: #scylladb/scylladb#22629
(cherry picked from commit 9174f27cc8)
2025-02-18 04:48:22 -05:00
Botond Dénes
86c9bc778a tools/scylla-nodetool: netstats: don't assume both senders and receivers
The code currently assumes that a session has both sender and receiver
streams, but it is possible to have just one or the other.
Change the test to include this scenario and remove this assumption from
the code.

Fixes: #22770

Closes scylladb/scylladb#22771

(cherry picked from commit 87e8e00de6)

Closes scylladb/scylladb#22873
2025-02-18 10:35:10 +02:00
Botond Dénes
4acb366a28 service/storage_proxy: schedule_repair(): materialize the range into a vector
Said method passes down its `diff` input to `mutate_internal()`, after
some std::ranges massaging. Said massaging is destructive -- it moves
items from the diff. If the output range is iterated-over multiple
times, only the first time will see the actual output, further
iterations will get an empty range.
When trace-level logging is enabled, this is exactly what happens:
`mutate_internal()` iterates over the range multiple times, first to log
its content, then to pass it down the stack. This ends up resulting in
a range with moved-from elements being pased down and consequently write
handlers being created with nullopt mutations.

Make the range re-entrant by materializing it into a vector before
passing it to `mutate_internal()`.

Fixes: scylladb/scylladb#21907
Fixes: scylladb/scylladb#21714

Closes scylladb/scylladb#21910

(cherry picked from commit 7150442f6a)

Closes scylladb/scylladb#22853
2025-02-18 10:34:42 +02:00
Botond Dénes
24eb8f49ba service: query_pager: fix last-position for filtering queries
On short-pages, cut short because of a tombstone prefix.
When page-results are filtered and the filter drops some rows, the
last-position is taken from the page visitor, which does the filtering.
This means that last partition and row position will be that of the last
row the filter saw. This will not match the last position of the
replica, when the replica cut the page due to tombstones.
When fetching the next page, this means that all the tombstone suffix of
the last page, will be re-fetched. Worse still: the last position of the
next page will not match that of the saved reader left on the replica, so
the saved reader will be dropped and a new one created from scratch.
This wasted work will show up as elevated tail latencies.
Fix by always taking the last position from raw query results.

Fixes: #22620

Closes scylladb/scylladb#22622

(cherry picked from commit 7ce932ce01)

Closes scylladb/scylladb#22718
2025-02-13 15:15:24 +02:00
Botond Dénes
8e6648870b reader_concurrency_semaphore: with_permit(): proper clean-up after queue overload
with_permit() creates a permit, with a self-reference, to avoid
attaching a continuation to the permit's run function. This
self-reference is used to keep the permit alive, until the execution
loop processes it. This self reference has to be carefully cleared on
error-paths, otherwise the permit will become a zombie, effectively
leaking memory.
Instead of trying to handle all loose ends, get rid of this
self-reference altogether: ask caller to provide a place to save the
permit, where it will survive until the end of the call. This makes the
call-site a little bit less nice, but it gets rid of a whole class of
possible bugs.

Fixes: #22588

Closes scylladb/scylladb#22624

(cherry picked from commit f2d5819645)

Closes scylladb/scylladb#22703
2025-02-13 15:03:57 +02:00
Botond Dénes
77696b1e43 reader_concurrency_semaphore: foreach_permit(): include _inactive_reads
So inactive reads show up in semaphore diagnostics dumps (currently the
only non-test user of this method).

Fixes: #22574

Closes scylladb/scylladb#22575

(cherry picked from commit e1b1a2068a)

Closes scylladb/scylladb#22610
2025-02-13 15:03:23 +02:00
Aleksandra Martyniuk
3497ba7f60 replica: mark registry entry as synch after the table is added
When a replica get a write request it performs get_schema_for_write,
which waits until the schema is synced. However, database::add_column_family
marks a schema as synced before the table is added. Hence, the write may
see the schema as synced, but hit no_such_column_family as the table
hasn't been added yet.

Mark schema as synced after the table is added to database::_tables_metadata.

Fixes: #22347.

Closes scylladb/scylladb#22348

(cherry picked from commit 328818a50f)

Closes scylladb/scylladb#22603
2025-02-13 15:02:59 +02:00
Aleksandra Martyniuk
ade0fe2d7a nodetool: tasks: print empty string for start_time/end_time if unspecified
If start_time/end_time is unspecified for a task, task_manager API
returns epoch. Nodetool prints the value in task status.

Fix nodetool tasks commands to print empty string for start_time/end_time
if it isn't specified.

Modify nodetool tasks status docs to show empty end_time.

Fixes: #22373.

Closes scylladb/scylladb#22370

(cherry picked from commit 477ad98b72)

Closes scylladb/scylladb#22600
2025-02-13 13:26:54 +02:00
Jenkins Promoter
72cf5ef576 Update ScyllaDB version to: 6.2.4 2025-02-09 16:52:35 +02:00
Botond Dénes
2978ed58a2 reader_permit: mark check_abort() as const
All it does is read one field, making it const makes using it easier.

(cherry picked from commit a3ae0c7cee)
2025-02-09 00:32:13 +00:00
Tomasz Grabiec
6922acb69f Merge '[Backport 6.2] split: run set_split_mode() on all storage groups during all_storage_groups_split()' from Scylladb[bot]
`tablet_storage_group_manager::all_storage_groups_split()` calls `set_split_mode()` for each of its storage groups to create split ready compaction groups. It does this by iterating through storage groups using `std::ranges::all_of()` which is not guaranteed to iterate through the entire range, and will stop iterating on the first occurrence of the predicate (`set_split_mode()`) returning false. `set_split_mode()` creates the split compaction groups and returns false if the storage group's main compaction group or merging groups are not empty. This means that in cases where the tablet storage group manager has non-empty storage groups, we could have a situation where split compaction groups are not created for all storage groups.

The missing split compaction groups are later created in `tablet_storage_group_manager::split_all_storage_groups()` which also calls `set_split_mode()`, and that is the reason why split completes successfully. The problem is that
`tablet_storage_group_manager::all_storage_groups_split()` runs under a group0 guard, but
`tablet_storage_group_manager::split_all_storage_groups()` does not. This can cause problems with operations which should exclude with compaction group creation. i.e. DROP TABLE/DROP KEYSPACE

Fixes #22431

This is a bugfix and should be back ported to versions with tablets: 6.1 6.2 and 2025.1

- (cherry picked from commit 24e8d2a55c)

- (cherry picked from commit 8bff7786a8)

Parent PR: #22330

Closes scylladb/scylladb#22559

* github.com:scylladb/scylladb:
  test: add reproducer and test for fix to split ready CG creation
  table: run set_split_mode() on all storage groups during all_storage_groups_split()
2025-02-07 14:22:57 +01:00
Tomasz Grabiec
61e303a3e3 locator: network_topology_strategy: Fix SIGSEGV when creating a table when there is a rack with no normal nodes
In that case, new_racks will be used, but when we discover no
candidates, we try to pop from existing_racks.

Fixes #22625

Closes scylladb/scylladb#22652

(cherry picked from commit e22e3b21b1)

Closes scylladb/scylladb#22721
2025-02-06 16:47:14 +01:00
Avi Kivity
8ede62d288 Update seastar submodule (hwloc failures on some AWS instances)
* seastar ec5da7a606...e40388c4c7 (1):
  > resource: fallback to sysconf when failed to detect memory size from hwloc

Fixes #22382
2025-02-04 16:29:45 +02:00
Avi Kivity
4ac9c710fc Merge '[Backport 6.2] api: task_manager: do not unregister finish task when its status is queried' from Scylladb[bot]
Currently, when the status of a task is queried and the task is already finished,
it gets unregistered. Getting the status shouldn't be a one-time operation.

Stop removing the task after its status is queried. Adjust tests not to rely
on this behavior. Add task_manager/drain API and nodetool tasks drain
command to remove finished tasks in the module.

Fixes: https://github.com/scylladb/scylladb/issues/21388.

It's a fix to task_manager API, should be backported to all branches

- (cherry picked from commit e37d1bcb98)

- (cherry picked from commit 18cc79176a)

Parent PR: #22310

Closes scylladb/scylladb#22597

* github.com:scylladb/scylladb:
  api: task_manager: do not unregister tasks on get_status
  api: task_manager: add /task_manager/drain
2025-02-03 23:04:31 +02:00
Avi Kivity
34fa9bd586 Merge '[Backport 6.2] Simplify loading_cache_test and use manual_clock' from Scylladb[bot]
This series exposes a Clock template parameter for loading_cache so that the test could use
the manual_clock rather than the lowres_clock, since relying on the latter is flaky.

In addition, the test load function is simplified to sleep some small random time and co_return the expected string,
rather than reading it from a real file, since the latter's timing might also be flaky, and it out-of-scope for this test.

Fixes #20322

* The test was flaky forever, so backport is required for all live versions.

- (cherry picked from commit b509644972)

- (cherry picked from commit 934a9d3fd6)

- (cherry picked from commit d68829243f)

- (cherry picked from commit b258f8cc69)

- (cherry picked from commit 0841483d68)

- (cherry picked from commit 32b7cab917)

Parent PR: #22064

Closes scylladb/scylladb#22640

* github.com:scylladb/scylladb:
  tests: loading_cache_test: use manual_clock
  utils: loading_cache: make clock_type a template parameter
  test: loading_cache_test: use function-scope loader
  test: loading_cache_test: simlute loader using sleep
  test: lib: eventually: add sleep function param
  test: lib: eventually: make *EVENTUALLY_EQUAL inline functions
2025-02-03 22:56:31 +02:00
Benny Halevy
79bff0885c tests: loading_cache_test: use manual_clock
Relying on a real-time clock like lowres_clock
can be flaky (in particular in debug mode).
Use manual_clock instead to harden the test against
timing issues.

Fixes #20322

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 32b7cab917)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-02-03 16:15:46 +02:00
Benny Halevy
abf8f44e03 utils: loading_cache: make clock_type a template parameter
So the unit test can use manual_clock rather than lowres_clock
which can be flaky (in particular in debug mode).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 0841483d68)
2025-02-03 16:02:37 +02:00
Benny Halevy
00f1dcfd09 test: loading_cache_test: use function-scope loader
Rather than a global function, accessing a thread-local `load_count`.
The thread-local load_count cannot be used when multiple test
cases run in parallel.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit b258f8cc69)
2025-02-03 16:01:53 +02:00
Benny Halevy
b0166a3a9c test: loading_cache_test: simlute loader using sleep
This test isn't about reading values from file,
but rather it's about the loading_cache.
Reading from the file can sometimes take longer than
the expected refresh times, causing flakiness (see #20322).

Rather than reading a string from a real file, just
sleep a random, short time, and co_return the string.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit d68829243f)
2025-02-03 16:00:51 +02:00
Benny Halevy
7addc3454d test: lib: eventually: add sleep function param
To allow support for manual_clock instead of seastar::sleep.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 934a9d3fd6)
2025-02-03 16:00:47 +02:00
Benny Halevy
9d5e3f050e test: lib: eventually: make *EVENTUALLY_EQUAL inline functions
rather then macros.

This is a first cleanup step before adding a sleep function
parameter to support also manual_clock.

Also, add a call to BOOST_REQUIRE_EQUAL/BOOST_CHECK_EQUAL,
respectively, to make an error more visible in the test log
since those entry points print the offending values
when not equal.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit b509644972)
2025-02-03 15:50:22 +02:00
Michael Litvak
c7421a8804 view_builder: fix loop in view builder when tokens are moved
The view builder builds a view by going over the entire token ring,
consuming the base table partitions, and generating view updates for
each partition.

A view is considered as built when we complete a full cycle of the
token ring. Suppose we start to build a view at a token F. We will
consume all partitions with tokens starting at F until the maximum
token, then go back to the minimum token and consume all partitions
until F, and then we detect that we pass F and complete building the
view. This happens in the view builder consumer in
`check_for_built_views`.

The problem is that we check if we pass the first token F with the
condition `_step.current_token() >= it->first_token` whenever we consume
a new partition or the current_token goes back to the minimum token.
But suppose that we don't have any partitions with a token greater than
or equal to the first token (this could happen if the partition with
token F was moved to another node for example), then this condition will never be
satisfied, and we don't detect correctly when we pass F. Instead, we
go back to the minimum token, building the same token ranges again,
in a possibly infinite loop.

To fix this we add another step when reaching the end of the reader's
stream. When this happens it means we don't have any more fragments to
consume until the end of the range, so we advance the current_token to
the end of the range, simulating a partition, and check for built views
in that range.

Fixes scylladb/scylladb#21829

Closes scylladb/scylladb#22493

(cherry picked from commit 6d34125eb7)

Closes scylladb/scylladb#22606
2025-02-03 13:27:28 +01:00
Avi Kivity
700402a7bb seatar: point submodule at scylla-seastar.git
This allows backporting commits to seastar.
2025-01-31 19:49:15 +02:00
Aleksandra Martyniuk
424fab77d2 api: task_manager: do not unregister tasks on get_status
Currently, /task_manager/task_status_recursive/{task_id} and
/task_manager/task_status/{task_id} unregister queries task if it
has already finished.

The status should not disappear after being queried. Do not unregister
finished task when its status or recursive status is queried.

(cherry picked from commit 18cc79176a)
2025-01-31 10:12:46 +01:00
Aleksandra Martyniuk
5d16157936 api: task_manager: add /task_manager/drain
In the following patches, get_status won't be unregistering finished
tasks. However, tests need a functionality to drop a task, so that
they could manipulate only with the tasks for operations that were
invoked by these tests.

Add /task_manager/drain/{module} to unregister all finished tasks
from the module. Add respective nodetool command.

(cherry picked from commit e37d1bcb98)
2025-01-31 10:11:57 +01:00
Aleksandra Martyniuk
e7d891b629 repair: add repair_service gate
In main.cc storage_service is started before and stopped after
repair_service. storage_service keeps a reference to sharded
repair_service and calls its methods, but nothing ensures that
repair_service's local instance would be alive for the whole
execution of the method.

Add a gate to repair_service and enter it in storage_service
before executing methods on local instances of repair_service.

Fixes: #21964.

Closes scylladb/scylladb#22145

(cherry picked from commit 32ab58cdea)

Closes scylladb/scylladb#22318
2025-01-30 11:39:46 +02:00
Anna Stuchlik
e57e3b4039 doc: add troubleshooting removal with --autoremove-ubuntu
This commit adds a troubleshooting article on removing ScyllaDB
with the --autoremove option.

Fixes https://github.com/scylladb/scylladb/issues/21408

Closes scylladb/scylladb#21697

(cherry picked from commit 8d824a564f)

Closes scylladb/scylladb#22232
2025-01-29 20:24:24 +02:00
Kefu Chai
2514f50f7f docs: fix monospace formatting for rm command
Add missing space before `rm` to ensure proper rendering
in monospace font within documentation.

Fixes scylladb/scylladb#22255
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21576

(cherry picked from commit 6955b8238e)

Closes scylladb/scylladb#22257
2025-01-29 20:23:22 +02:00
Botond Dénes
e74c8372a9 tools/scylla-sstable: dump-statistics: fix handling of {min,max}_column_names
Said fields in statistics are of type
`disk_array<uint32_t, disk_string<uint16_t>>` and currently are handled
as array of regular strings. However these fields store exploded
clustering keys, so the elements store binary data and converting to
string can yield invalid UTF-8 characters that certain JSON parsers (jq,
or python's json) can choke on. Fix this by treating them as binary and
using `to_hex()` to convert them to string. This requires some massaging
of the json_dumper: passing field offset to all visit() methods and
using a caller-provided disk-string to sstring converter to convert disk
strings to sstring, so in the case of statistics, these fields can be
intercepted and properly handled.

While at it, the type of these fields is also fixed in the
documentation.

Before:

    "min_column_names": [
      "��Z���\u0011�\u0012ŷ4^��<",
      "�2y\u0000�}\u007f"
    ],
    "max_column_names": [
      "��Z���\u0011�\u0012ŷ4^��<",
      "}��B\u0019l%^"
    ],

After:

    "min_column_names": [
      "9dd55a92bc8811ef12c5b7345eadf73c",
      "80327900e2827d7f"
    ],
    "max_column_names": [
      "9dd55a92bc8811ef12c5b7345eadf73c",
      "7df79242196c255e"
    ],

Fixes: #22078

Closes scylladb/scylladb#22225

(cherry picked from commit f899f0e411)

Closes scylladb/scylladb#22296
2025-01-29 20:22:52 +02:00
Botond Dénes
aa16d736dc Merge '[Backport 6.2] sstable_directory: do not load remote unshared sstables in process_descriptor()' from Lakshmi Narayanan
The sstable loader relied on the generation id to provide an efficient
hint about the shard that owns an sstable. But, this hint was rendered
ineffective with the introduction of UUID generation, as the shard id
was no longer embedded in the generation id. This also became suboptimal
with the introduction of tablets. Commit 0c77f77 addressed this issue by
reading the minimum from disk to determine sstable ownership but this
improvement was lost with commit 63f1969, which optimistically assumed
that hints would work most of the time, which isn't true.

This commit restores that change - shard id of a table is deduced by
reading minially from disk and then the sstable is fully loaded only if
it belongs to the local shard. This patch also adds a testcase to verify
that the sstable are loaded only in their respective shards.

Fixes #21015

This fixes a regression and should be backported.

- (cherry picked from commit d2ba45a01f)

- (cherry picked from commit 6e3ecc70a6)

- (cherry picked from commit 63100b34da)

Parent PR: #22263

Closes scylladb/scylladb#22376

* github.com:scylladb/scylladb:
  sstable_directory: do not load remote sstables in process_descriptor
  sstable_directory: reintroduce `get_shards_for_this_sstable()`
2025-01-29 20:20:50 +02:00
Botond Dénes
fa73a8da34 replica: remove noexcept from token -> tablet resolution path
The methods to resolve a key/token/range to a table are all noexcept.
Yet the method below all of these, `storage_group_for_id()` can throw.
This means that if due to any mistake a tablet without local replica is
attempted to be looked up, it will result in a crash, as the exception
bubbles up into the noexcept methods.
There is no value in pretending that looking up the tablet replica is
noexcept, remove the noexcept specifiers so that any bad lookup only
fails the operation at hand and doesn't crash the node. This is
especially relevant to replace, which still has a window where writes
can arrive for tablets that don't (yet) have a local replica. Currently,
this results in a crash. After this patch, this will only fail the
writes and the replace can move on.

Fixes: #21480

Closes scylladb/scylladb#22251

(cherry picked from commit 55963f8f79)

Closes scylladb/scylladb#22379
2025-01-29 20:19:52 +02:00
Avi Kivity
10abba4c64 Merge '[Backport 6.2] repair: handle no_such_keyspace in repair preparation phase' from null
Currently, data sync repair handles most no_such_keyspace exceptions,
but it omits the preparation phase, where the exception could be thrown
during make_global_effective_replication_map.

Skip the keyspace repair if no_such_keyspace is thrown during preparations.

Fixes: #22073.

Requires backport to 6.1 and 6.2 as they contain the bug

- (cherry picked from commit bfb1704afa)

- (cherry picked from commit 54e7f2819c)

Parent PR: #22473

Closes scylladb/scylladb#22541

* github.com:scylladb/scylladb:
  test: add test to check if repair handles no_such_keyspace
  repair: handle keyspace dropped
2025-01-29 19:52:38 +02:00
Michael Litvak
36b1a486de cdc: fix handling of new generation during raft upgrade
During raft upgrade, a node may gossip about a new CDC generation that
was propagated through raft. The node that receives the generation by
gossip may have not applied the raft update yet, and it will not find
the generation in the system tables. We should consider this error
non-fatal and retry to read until it succeeds or becomes obsolete.

Another issue is when we fail with a "fatal" exception and not retrying
to read, the cdc metadata is left in an inconsistent state that causes
further attempts to insert this CDC generation to fail.

What happens is we complete preparing the new generation by calling `prepare`,
we insert an empty entry for the generation's timestamp, and then we fail. The
next time we try to insert the generation, we skip inserting it because we see
that it already has an entry in the metadata and we determine that
there's nothing to do. But this is wrong, because the entry is empty,
and we should continue to insert the generation.

To fix it, we change `prepare` to return `true` when the entry already
exists but it's empty, indicating we should continue to insert the
generation.

Fixes scylladb/scylladb#21227

Closes scylladb/scylladb#22093

(cherry picked from commit 4f5550d7f2)

Closes scylladb/scylladb#22545
2025-01-29 19:52:18 +02:00
Ferenc Szili
5a74ded582 test: add reproducer and test for fix to split ready CG creation
This adds a reproducer for #22431

In cases where a tablet storage group manager had more than one storage
group, it was possible to create compaction groups outside the group0
guard, which could create problems with operations which should exclude
with compaction group creation.

(cherry picked from commit 8bff7786a8)
2025-01-29 14:46:37 +01:00
Ferenc Szili
0ea5a1fc48 table: run set_split_mode() on all storage groups during all_storage_groups_split()
tablet_storage_group_manager::all_storage_groups_split() calls set_split_mode()
for each of its storage groups to create split ready compaction groups. It does
this by iterating through storage groups using std::ranges::all_of() which is
not guaranteed to iterate through the entire range, and will stop iterating on
the first occurance of the predicate (set_split_mode()) returning false.
set_split_mode() creates the split compaction groups and returns false if the
storage group's main compaction group or merging groups are not empty. This
means that in cases where the tablet storage group manager has non-empty
storage groups, we could have a situation where split compaction groups are not
created for all storage groups.

The missing split compaction groups are later created in
tablet_storage_group_manager::split_all_storage_groups() which also calls
set_split_mode(), and that is the reason why split completes successfully. The
problem is that tablet_storage_group_manager::all_storage_groups_split() runs
under a group0 guard, and tablet_storage_group_manager::split_all_storage_groups()
does not. This can cause problems with operations which should exclude with
compaction group creation. i.e. DROP TABLE/DROP KEYSPACE

(cherry picked from commit 24e8d2a55c)
2025-01-29 14:42:17 +01:00
Aleksandra Martyniuk
be67b4d634 test: add test to check if repair handles no_such_keyspace
(cherry picked from commit 54e7f2819c)
2025-01-28 21:50:10 +00:00
Aleksandra Martyniuk
2143f8ccb6 repair: handle keyspace dropped
Currently, data sync repair handles most no_such_keyspace exceptions,
but it omits the preparation phase, where the exception could be thrown
during make_global_effective_replication_map.

Skip the keyspace repair if no_such_keyspace is thrown during preparations.

(cherry picked from commit bfb1704afa)
2025-01-28 21:50:10 +00:00
Kefu Chai
7da4223411 compress: fix compressor initialization order by making namespace_prefix a function
Fixes a race condition where COMPRESSOR_NAME in zstd.cc could be
initialized before compressor::namespace_prefix due to undefined
global variable initialization order across translation units. This
was causing ZstdCompressor to be unregistered in release builds,
making it impossible to create tables with Zstd compression.

Replace the global namespace_prefix variable with a function that
returns the fully qualified compressor name. This ensures proper
initialization order and fixes the registration of the ZstdCompressor.

Fixes scylladb/scylladb#22444
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22451

(cherry picked from commit 4a268362b9)

Closes scylladb/scylladb#22510
2025-01-27 19:50:23 +02:00
Calle Wilund
2b40f405f0 commitlog: Fix assertion in oversized_alloc
Fixes #20633

Cannot assert on actual request_controller when releasing permit, as the
release, if we have waiters in queue, will subtract some units to hand to them.
Instead assert on permit size + waiter status (and if zero, also controller value)

* v2 - use SCYLLA_ASSERT

(cherry picked from commit 58087ef427)

Closes scylladb/scylladb#22455
2025-01-26 16:46:35 +02:00
Kamil Braun
399325f0f0 Merge '[Backport 6.2] raft: Handle non-critical config update errors in when changing voter status.' from Sergey Z
When a node is bootstrapped and joined a cluster as a non-voter and changes it's role to a voter, errors can occur while committing a new Raft record, for instance, if the Raft leader changes during this time. These errors are not critical and should not cause a node crash, as the action can be retried.

Fixes scylladb/scylladb#20814

Backport: This issue occurs frequently and disrupts the CI workflow to some extent. Backports are needed for versions 6.1 and 6.2.

- (cherry picked from commit 775411ac56)

- (cherry picked from commit 16053a86f0)

- (cherry picked from commit 8c48f7ad62)

- (cherry picked from commit 3da4848810)

- (cherry picked from commit 228a66d030)

Parent PR: #22253

Closes scylladb/scylladb#22358

* github.com:scylladb/scylladb:
  raft: refactor `remove_from_raft_config` to use a timed `modify_config` call.
  raft: Refactor functions using `modify_config` to use a common wrapper for retrying.
  raft: Handle non-critical config update errors in when changing status to voter.
  test: Add test to check that a node does not fail on unknown commit status error when starting up.
  raft: Add run_op_with_retry in raft_group0.
2025-01-24 17:05:50 +01:00
Sergey Zolotukhin
9730f98d34 raft: refactor remove_from_raft_config to use a timed modify_config call.
To avoid potential hangs during the `remove_from_raft_config` operation, use a timed `modify_config` call.
This ensures the operation doesn't get stuck indefinitely.

(cherry picked from commit 228a66d030)
2025-01-22 09:54:28 +01:00
Sergey Zolotukhin
d419fb4a0c raft: Refactor functions using modify_config to use a common wrapper
for retrying.

There are several places in `raft_group0` where almost identical code is
used for retrying `modify_config` in case of `commit_status_unknown`
error. To avoid code duplication all these places were changed to
use a new wrapper `run_op_with_retry`.

(cherry picked from commit 3da4848810)
2025-01-22 09:54:26 +01:00
Lakshmi Narayanan Sreethar
e0189ccac5 sstable_directory: do not load remote sstables in process_descriptor
The sstable loader relied on the generation id to provide an efficient
hint about the shard that owns an sstable. But, this hint was rendered
ineffective with the introduction of UUID generation, as the shard id
was no longer embedded in the generation id. This also became suboptimal
with the introduction of tablets. Commit 0c77f77 addressed this issue by
reading the minimum from disk to determine sstable ownership but this
improvement was lost with commit 63f1969, which optimistically assumed
that hints would work most of the time, which isn't true.

This commit restores that change - shard id of a table is deduced by
reading minially from disk and then the sstable is fully loaded only if
it belongs to the local shard. This patch also adds a testcase to verify
that the sstable are loaded only in their respective shards.

Fixes #21015

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 63100b34da)
2025-01-21 00:48:34 +05:30
Lakshmi Narayanan Sreethar
8191f5d0f4 sstable_directory: reintroduce get_shards_for_this_sstable()
Reintroduce `get_shards_for_this_sstable()` that was removed in commit
ad375fbb. This will be used in the following patch to ensure that an
sstable is loaded only once.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit d2ba45a01f)
2025-01-21 00:48:34 +05:30
Nadav Har'El
bff9ddde12 Merge '[Backport 6.2] view_builder: write status to tables before starting to build' from null
When adding a new view for building, first write the status to the
system tables and then add the view building step that will start
building it.

Otherwise, if we start building it before the status is written to the
table, it may happen that we complete building the view, write the
SUCCESS status, and then overwrite it with the STARTED status. The
view_build_status table will remain in incorrect state indicating the
view building is not complete.

Fixes #20638

The PR contains few additional small fixes in separate commits related to the view build status table.

It addresses flakiness issues in tests that use the view build status table to determine when view building is complete. The table may be in incorrect state due to these issues, having a row with status STARTED when it actually finished building the view, which will cause us to wait in `wait_for_view` until it timeouts.

For testing I used a test similar to `test_view_build_status_with_replace_node`, but it only creates the views and calls `wait_for_view`. Without these commits it failed in 4/1024 runs, and with the commits it passed 2048/2048.

backport to fix the bugs that affects previous versions and improve CI stability

- (cherry picked from commit b1be2d3c41)

- (cherry picked from commit 1104411f83)

- (cherry picked from commit 7a6aec1a6c)

Parent PR: #22307

Closes scylladb/scylladb#22356

* github.com:scylladb/scylladb:
  view_builder: hold semaphore during entire startup
  view_builder: pass view name by value to write_view_build_status
  view_builder: write status to tables before starting to build
2025-01-19 15:36:44 +02:00
Michael Litvak
e8fde3b0a3 view_builder: hold semaphore during entire startup
Guard the whole view builder startup routine by holding the semaphore
until it's done instead of releasing it early, so that it's not
intercepted by migration notifications.

(cherry picked from commit 7a6aec1a6c)
2025-01-19 11:03:53 +02:00
Michael Litvak
a7f8842776 view_builder: pass view name by value to write_view_build_status
The function write_view_build_status takes two lambda functions and
chooses which of them to run depending on the upgrade state. It might
run both of them.

The parameters ks_name and view_name should be passed by value instead
of by reference because they are moved inside each lambda function.
Otherwise, if both lambdas are run, the second call operates on invalid
values that were moved.

(cherry picked from commit 1104411f83)
2025-01-19 11:03:53 +02:00
Piotr Dulikowski
b88e0f8f74 Merge '[Backport 6.2] main, view: Pair view builder drain with its start' from null
In this PR, we pair draining the view builder with its start.
To better understand what was done and why, let's first look at the
situation before this commit and the context of it:

(a) The following things happened in order:

    1. The view builder would be constructed.
    2. Right after that, a deferred lambda would be created to stop the
       view builder during shutdown.
    3. group0_service would be started.
    4. A deferred lambda stopping group0_service would be created right
       after that.
    5. The view builder would be started.

(b) Because the view builder depends on group0_client, it couldn't be
    started before starting group0_service. On the other hand, other
    services depend on the view builder, e.g. the stream manager. That
    makes changing the order of initialization a difficult problem,
    so we want to avoid doing that unless we're sure it's the right
    choice.

(c) Since the view builder uses group0_client, there was a possibility
    of running into a segmentation fault issue in the following
    scenario:

    1. A call to `view_builder::mark_view_build_success()` is issued.
    2. We stop group0_service.
    3. `view_builder::mark_view_build_success()` calls
       `announce_with_raft()`, which leads to a use-after-free because
       group0_service has already been destroyed.

      This very scenario took place in scylladb/scylladb#20772.

Initially, we decided to solve the issue by initializing
group0_service a bit earlier (scylladb/scylladb@7bad8378c7).
Unfortunately, it led to other issues described in scylladb/scylladb#21534,
so we revert that patch. These changes are the second attempt
to the problem where we want to solve it in a safer manner.

The solution we came up with is to pair the start of the view builder
with a deferred lambda that deinitializes it by calling
`view_builder::drain()`. No other component of the system should be
able to use the view builder anymore, so it's safe to do that.
Furthermore, that pairing makes the analysis of
initialization/deinitialization order much easier. We also solve the
aformentioned use-after-free issue because the view builder itself
will no longer attempt to use group0_client.

Note that we still pair a deferred lambda calling `view_builder::stop()`
with the construction of the view builder; that function will also call
`view_builder::drain()`. Another notable thing is `view_builder::drain()`
may be called earlier by `storage_service::do_drain()`. In other words,
these changes cover the situation when Scylla runs into a problem when
starting up.

Backport: The patch I'm reverting made it to 6.2, so we want to backport this one there too.

Fixes scylladb/scylladb#20772
Fixes scylladb/scylladb#21534

- (cherry picked from commit a5715086a4)

- (cherry picked from commit 06ce976370)

- (cherry picked from commit d1f960eee2)

Parent PR: #21909

Closes scylladb/scylladb#22331

* github.com:scylladb/scylladb:
  test/topology_custom: Add test for Scylla with disabled view building
  main, view: Pair view builder drain with its start
  Revert "main,cql_test_env: start group0_service before view_builder"
2025-01-17 09:57:36 +01:00
Sergey Zolotukhin
2ea97d8c19 raft: Handle non-critical config update errors in when changing status
to voter.

When a node is bootstrapped and joins a cluster as a non-voter, errors can occur while committing
a new Raft record, for instance, if the Raft leader changes during this time. These errors are not
critical and should not cause a node crash, as the action can be retried.

Fixes scylladb/scylladb#20814

(cherry picked from commit 8c48f7ad62)
2025-01-16 20:09:31 +00:00
Sergey Zolotukhin
36ff3e8f5f test: Add test to check that a node does not fail on unknown commit status
error when starting up.

Test that a node is starting successfully if while joining a cluster and becoming a voter, it
receives an unknown commit status error.

Test for scylladb/scylladb#20814

(cherry picked from commit 16053a86f0)
2025-01-16 20:09:31 +00:00
Sergey Zolotukhin
325cdd3ebc raft: Add run_op_with_retry in raft_group0.
Since when calling `modify_config` it's quite often we need to do
retries, to avoid code duplication, a function wrapper that allows
a function to be called with automatic retries in case of failures
was added.

(cherry picked from commit 775411ac56)
2025-01-16 20:09:30 +00:00
Michael Litvak
c5bdc9e58f view_builder: write status to tables before starting to build
When adding a new view for building, first write the status to the
system tables and then add the view building step that will start
building it.

Otherwise, if we start building it before the status is written to the
table, it may happen that we complete building the view, write the
SUCCESS status, and then overwrite it with the STARTED status. The
view_build_status table will remain in incorrect state indicating the
view building is not complete.

Fixes scylladb/scylladb#20638

(cherry picked from commit b1be2d3c41)
2025-01-16 20:08:36 +00:00
Kamil Braun
cbef20e977 Merge '[Backport 6.2] Fix possible data corruption due to token keys clashing in read repair.' from Sergey
This update addresses an issue in the mutation diff calculation algorithm used during read repair. Previously, the algorithm used `token` as the hashmap key. Since `token` is calculated basing on the Murmur3 hash function, it could generate duplicate values for different partition keys, causing corruption in the affected rows' values.

Fixes scylladb/scylladb#19101

Since the issue affects all the relevant scylla versions, backport to: 6.1, 6.2

- (cherry picked from commit e577f1d141)

- (cherry picked from commit 39785c6f4e)

- (cherry picked from commit 155480595f)

Parent PR: #21996

Closes scylladb/scylladb#22298

* github.com:scylladb/scylladb:
  storage_proxy/read_repair: Remove redundant 'schema' parameter from `data_read_resolver::resolve` function.
  storage_proxy/read_repair: Use `partition_key` instead of `token` key for mutation diff calculation hashmap.
  test: Add test case for checking read repair diff calculation when having conflicting keys.
2025-01-16 17:13:12 +01:00
Dawid Mędrek
833ea91940 test/topology_custom: Add test for Scylla with disabled view building
Before this commit, there doesn't seem to have been a test verifying that
starting and shutting down Scylla behave correctly when the configuration
option `view_building` is set to false. In these changes, we add one.

(cherry picked from commit d1f960eee2)
2025-01-16 14:12:31 +01:00
Dawid Mędrek
1200d3b735 main, view: Pair view builder drain with its start
In these changes, we pair draining the view builder with its start.
To better understand what was done and why, let's first look at the
situation before this commit and the context of it:

(a) The following things happened in order:

    1. The view builder would be constructed.
    2. Right after that, a deferred lambda would be created to stop the
       view builder during shutdown.
    3. group0_service would be started.
    4. A deferred lambda stopping group0_service would be created right
       after that.
    5. The view builder would be started.

(b) Because the view builder depends on group0_client, it couldn't be
    started before starting group0_service. On the other hand, other
    services depend on the view builder, e.g. the stream manager. That
    makes changing the order of initialization a difficult problem,
    so we want to avoid doing that unless we're sure it's the right
    choice.

(c) Since the view builder uses group0_client, there was a possibility
    of running into a segmentation fault issue in the following
    scenario:

    1. A call to `view_builder::mark_view_build_success()` is issued.
    2. We stop group0_service.
    3. `view_builder::mark_view_build_success()` calls
       `announce_with_raft()`, which leads to a use-after-free because
       group0_service has already been destroyed.

      This very scenario took place in scylladb/scylladb#20772.

Initially, we decided to solve the issue by initializing
group0_service a bit earlier (scylladb/scylladb@7bad8378c7).
Unfortunately, it led to other issues described in scylladb/scylladb#21534.
We reverted that change in the previous commit. These changes are the
second attempt to the problem where we want to solve it in a safer manner.

The solution we came up with is to pair the start of the view builder
with a deferred lambda that deinitializes it by calling
`view_builder::drain()`. No other component of the system should be
able to use the view builder anymore, so it's safe to do that.
Furthermore, that pairing makes the analysis of
initialization/deinitialization order much easier. We also solve the
aformentioned use-after-free issue because the view builder itself
will no longer attempt to use group0_client.

Note that we still pair a deferred lambda calling `view_builder::stop()`
with the construction of the view builder; that function will also call
`view_builder::drain()`. Another notable thing is `view_builder::drain()`
may be called earlier by `storage_service::do_drain()`. In other words,
these changes cover the situation when Scylla runs into a problem when
starting up.

Fixes scylladb/scylladb#20772

(cherry picked from commit 06ce976370)
2025-01-16 12:37:04 +01:00
Dawid Mędrek
84b774515b Revert "main,cql_test_env: start group0_service before view_builder"
The patch solved a problem related to an initialization order
(scylladb/scylladb#20772), but we ran into another one: scylladb/scylladb#21534.
After moving the initialization of group0_service, it ended up being destroyed
AFTER the CDC generation service would. Since CDC generations are accessed
in `storage_service::topology_state_load()`:

```
for (const auto& gen_id : _topology_state_machine._topology.committed_cdc_generations) {
    rtlogger.trace("topology_state_load: process committed cdc generation {}", gen_id);
    co_await _cdc_gens.local().handle_cdc_generation(gen_id);
```

we started getting the following failure:

```
Service &seastar::sharded<cdc::generation_service>::local() [Service = cdc::generation_service]: Assertion `local_is_initialized()' failed.
```

We're reverting the patch to go back to a more stable version of Scylla
and in the following commit, we'll solve the original issue in a more
systematic way.

This reverts commit 7bad8378c7.

(cherry picked from commit a5715086a4)
2025-01-16 12:36:41 +01:00
Sergey Zolotukhin
06a8956174 test: Include parent test name in ScyllaClusterManager log file names.
Add the test file name to `ScyllaClusterManager` log file names alongside the test function name.
This avoids race conditions when tests with the same function names are executed simultaneously.

Fixes scylladb/scylladb#21807

Backport: not needed since this is a fix in the testing scripts.

Closes scylladb/scylladb#22192

(cherry picked from commit 2f1731c551)

Closes scylladb/scylladb#22249
2025-01-14 16:33:27 +01:00
Sergey Zolotukhin
f0f833e8ab storage_proxy/read_repair: Remove redundant 'schema' parameter from data_read_resolver::resolve
function.

The `data_read_resolver` class inherits from `abstract_read_resolver`, which already includes the
`schema_ptr _schema` member. Therefore, using a separate function parameter in `data_read_resolver::resolve`
initialized with the same variable in `abstract_read_executor` is redundant.

(cherry picked from commit 155480595f)
2025-01-14 14:43:52 +01:00
Sergey Zolotukhin
b04b6aad9e storage_proxy/read_repair: Use partition_key instead of token key for mutation
diff calculation hashmap.

This update addresses an issue in the mutation diff calculation algorithm used during read repair.
Previously, the algorithm used `token` as the hashmap key. Since `token` is calculated basing on
the Murmur3 hash function, it could generate duplicate values for different partition keys, causing
corruption in the affected rows' values.

Fixes scylladb/scylladb#19101

(cherry picked from commit 39785c6f4e)
2025-01-14 14:37:47 +01:00
Sergey Zolotukhin
12ee41869a test: Add test case for checking read repair diff calculation when having
conflicting keys.

The test updates two rows with keys that result in a Murmur3 hash collision, which
is used to generate Scylla tokens. These tokens are involved in read repair diff
calculations. Due to the identical token values, a hash map key collision occurs.
Consequently, an incorrect value from the second row (with a different primary key)
is then sent for writing as 'repaired', causing data corruption.

(cherry picked from commit e577f1d141)
2025-01-13 22:05:32 +00:00
Kamil Braun
ef93c3a8d7 Merge '[Backport 6.2] cache_algorithm_test: fix flaky failures' from Michał Chojnowski
This series attempts to get read of flakiness in cache_algorithm_test by solving two problems.

Problem 1:

The test needs to create some arbitrary partition keys of a given size. It intends to create keys of the form:
0x0000000000000000000000000000000000000000...
0x0100000000000000000000000000000000000000...
0x0200000000000000000000000000000000000000...
But instead, unintentionally, it creates partially initialized keys of the form: 0x0000000000000000garbagegarbagegarbagegar...
0x0100000000000000garbagegarbagegarbagegar...
0x0200000000000000garbagegarbagegarbagegar...

Each of these keys is created several times and -- for the test to pass -- the result must be the same each time.
By coincidence, this is usually the case, since the same allocator slots are used. But if some background task happens to overwrite the allocator slot during a preemption, the keys used during "SELECT" will be different than the keys used during "INSERT", and the test will fail due to extra cache misses.

Problem 2:

Cache stats are global, so there's no good way to reliably
verify that e.g. a given read causes 0 cache misses,
because something done by Scylla in a background can trigger a cache miss.

This can cause the test to fail spuriously.

With how the test framework and the cache are designed, there's probably
no good way to test this properly. It would require ensuring that cache
stats are per-read, or at least per-table, and that Scylla's background
activity doesn't cause enough memory pressure to evict the tested rows.

This patch tries to deal with the flakiness without deleting the test
altogether by letting it retry after a failure if it notices that it
can be explained by a read which wasn't done by the test.
(Though, if the test can't be written well, maybe it just shouldn't be written...)

Fixes scylladb/scylladb#21536

(cherry picked from commit 1fffd976a4)
(cherry picked from commit 6caaead4ac)

Parent PR: scylladb/scylladb#21948

Closes scylladb/scylladb#22228

* github.com:scylladb/scylladb:
  cache_algorithm_test: harden against stats being confused by background activity
  cache_algorithm_test: fix a use of an uninitialized variable
2025-01-09 14:30:31 +01:00
Aleksandra Martyniuk
a59c4653fe repair: check tasks local to given shard
Currently task_manager_module::is_aborted checks the tasks local
to caller's shard on a given shard.

Fix the method to check the task map local to the given shard.

Fixes: #22156.

Closes scylladb/scylladb#22161

(cherry picked from commit a91e03710a)

Closes scylladb/scylladb#22197
2025-01-08 13:07:38 +02:00
Yaron Kaikov
2e87e317d9 .github/scripts/auto-backport.py: Add comment to PR when conflicts apply
When we open a PR with conflicts, the PR owner gets a notification about the assignment but has no idea if this PR is with conflicts or not (in Scylla it's important since CI will not start on draft PR)

Let's add a comment to notify the user we have conflicts

Closes scylladb/scylladb#21939

(cherry picked from commit 2e6755ecca)

Closes scylladb/scylladb#22190
2025-01-08 13:07:08 +02:00
Botond Dénes
f73f7c17ec Merge 'sstables_manager: do not reclaim unlinked sstables' from Lakshmi Narayanan Sreethar
When an sstable is unlinked, it remains in the _active list of the
sstable manager. Its memory might be reclaimed and later reloaded,
causing issues since the sstable is already unlinked. This patch updates
the on_unlink method to reclaim memory from the sstable upon unlinking,
remove it from memory tracking, and thereby prevent the issues described
above.

Added a testcase to verify the fix.

Fixes #21887

This is a bug fix in the bloom filter reload/reclaim mechanism and should be backported to older versions.

Closes scylladb/scylladb#21895

* github.com:scylladb/scylladb:
  sstables_manager: reclaim memory from sstables on unlink
  sstables_manager: introduce reclaim_memory_and_stop_tracking_sstable()
  sstables: introduce disable_component_memory_reload()
  sstables_manager: log sstable name when reclaiming components

(cherry picked from commit d4129ddaa6)

Closes scylladb/scylladb#21998
2025-01-08 13:06:24 +02:00
Michał Chojnowski
379f23d854 cache_algorithm_test: harden against stats being confused by background activity
Cache stats are global, so there's no good way to reliably
verify that e.g. a given read causes 0 cache misses,
because something done by Scylla in a background can trigger a cache miss.

This can cause the test to fail spuriously.

With how the test framework and the cache are designed, there's probably
no good way to test this properly. It would require ensuring that cache
stats are per-read, or at least per-table, and that Scylla's background
activity doesn't cause enough memory pressure to evict the tested rows.

This patch tries to deal with the flakiness without deleting the test
altogether by letting it retry after a failure if it notices that it
can be explained by a read which wasn't done by the test.
(Though, if the test can't be written well, maybe it just shouldn't be written...)

(cherry picked from commit 6caaead4ac)
2025-01-08 11:48:23 +01:00
Michał Chojnowski
10815d2599 cache_algorithm_test: fix a use of an uninitialized variable
The test needs to create some arbitrary partition keys of a given size.
It intends to create keys of the form:
0x0000000000000000000000000000000000000000...
0x0100000000000000000000000000000000000000...
0x0200000000000000000000000000000000000000...
But instead, unintentionally, it creates partially initialized keys of the form:
0x0000000000000000garbagegarbagegarbagegar...
0x0100000000000000garbagegarbagegarbagegar...
0x0200000000000000garbagegarbagegarbagegar...

Each of these keys is created several times and -- for the test to pass --
the result must be the same each time.
By coincidence, this is usually the case, since the same allocator slots are used.
But if some background task happens to overwrite the allocator slot during a
preemption, the keys used during "SELECT" will be different than the keys used
during "INSERT", and the test will fail due to extra cache misses.

(cherry picked from commit 1fffd976a4)
2025-01-08 11:48:17 +01:00
Patryk Jędrzejczak
a63a0eac1e [Backport 6.2] raft: improve logs for abort while waiting for apply
New logs allow us to easily distinguish two cases in which
waiting for apply times out:
- the node didn't receive the entry it was waiting for,
- the node received the entry but didn't apply it in time.

Distinguishing these cases simplifies reasoning about failures.
The first case indicates that something went wrong on the leader.
The second case indicates that something went wrong on the node
on which waiting for apply timed out.

As it turns out, many different bugs result in the `read_barrier`
(which calls `wait_for_apply`) timeout. This change should help
us in debugging bugs like these.

We want to backport this change to all supported branches so that
it helps us in all tests.

Fixes scylladb/scylladb#22160

Closes scylladb/scylladb#22159
2025-01-07 17:01:22 +01:00
Kamil Braun
afd588d4c7 Merge '[Backport 6.2] Do not reset quarantine list in non raft mode' from Gleb
The series contains small fixes to the gossiper one of which fixes #21930. Others I noticed while debugged the issue.

Fixes: #21930

(cherry picked from commit 91cddcc17f)

Parent PR: #21956

Closes scylladb/scylladb#21991

* github.com:scylladb/scylladb:
  gossiper: do not reset _just_removed_endpoints in non raft mode
  gossiper: do not call apply for the node's old state
2025-01-03 16:28:08 +01:00
Abhinav
f5bce45399 Fix gossiper orphan node floating problem by adding a remover fiber
In the current scenario, if during startup, a node crashes after initiating gossip and before joining group0,
then it keeps floating in the gossiper forever because the raft based gossiper purging logic is only effective
once node joins group0. This orphan node hinders the successor node from same ip to join cluster since it collides
with it during gossiper shadow round.

This commit intends to fix this issue by adding a background thread which periodically checks for such orphan entries in
gossiper and removes them.

A test is also added in to verify this logic. This test fails without this background thread enabled, hence
verifying the behavior.

Fixes: scylladb/scylladb#20082

Closes scylladb/scylladb#21600

(cherry picked from commit 6c90a25014)

Closes scylladb/scylladb#21822
2025-01-02 14:57:46 +01:00
Gleb Natapov
cda997fe59 gossiper: do not reset _just_removed_endpoints in non raft mode
By the time the function is called during start it may already be
populated.

Fixes: scylladb/scylladb#21930
(cherry picked from commit e318dfb83a)
2024-12-25 12:01:16 +02:00
Gleb Natapov
155a0462d5 gossiper: do not call apply for the node's old state
If a nodes changed its address an old state may be still in a gossiper,
so ignore it.

(cherry picked from commit e80355d3a1)
2024-12-23 11:47:12 +02:00
Piotr Dulikowski
76b1173546 Merge 'service/topology_coordinator: migrate view builder only if all nodes are up' from Michał Jadwiszczak
The migration process is doing read with consistency level ALL,
requiring all nodes to be alive.

Fixes scylladb/scylladb#20754

The PR should be backported to 6.2, this version has view builder on group0.

Closes scylladb/scylladb#21708

* github.com:scylladb/scylladb:
  test/topology_custom/test_view_build_status: add reproducer
  service/topology_coordinator: migrate view builder only if all nodes are up

(cherry picked from commit def51e252d)

Closes scylladb/scylladb#21850
2024-12-19 14:10:55 +01:00
Piotr Dulikowski
e5a37d63c0 Merge 'transport/server: revert using async function in for_each_gently()' from Michał Jadwiszczak
This patch reverts 324b3c43c0 and adds synchronous versions of `service_level_controller::find_effective_service_level()` and `client_state::maybe_update_per_service_level_params()`.

It isn't safe to do asynchronous calls in `for_each_gently`, as the
connection may be disconnected while a call in callback preempts.

Fixes scylladb/scylladb#21801

Closes scylladb/scylladb#21761

* github.com:scylladb/scylladb:
  Revert "generic_server: use async function in `for_each_gently()`"
  transport/server: use synchronous calls in `for_each_gently` callback
  service/client_state: add synchronous method to update service level params
  qos/service_level_controller: add `find_cached_effective_service_level`

(cherry picked from commit c601f7a359)

Closes scylladb/scylladb#21849
2024-12-19 14:10:31 +01:00
Tomasz Grabiec
0851e3fba7 Merge '[Backport 6.2] utils: cached_file: Mark permit as awaiting on page miss' from ScyllaDB
Otherwise, the read will be considered as on-cpu during promoted index
search, which will severely underutlize the disk because by default
on-cpu concurrency is 1.

I verified this patch on the worst case scenario, where the workload
reads missing rows from a large partition. So partition index is
cached (no IO) and there is no data file IO (relies on https://github.com/scylladb/scylladb/pull/20522).
But there is IO during promoted index search (via cached_file).

Before the patch this workload was doing 4k req/s, after the patch it does 30k req/s.

The problem is much less pronounced if there is data file or partition index IO involved
because that IO will signal read concurrency semaphore to invite more concurrency.

Fixes #21325

(cherry picked from commit 868f5b59c4)

(cherry picked from commit 0f2101b055)

Refs #21323

Closes scylladb/scylladb#21358

* github.com:scylladb/scylladb:
  utils: cached_file: Mark permit as awaiting on page miss
  utils: cached_file: Push resource_unit management down to cached_file
2024-12-16 19:55:00 +01:00
Michael Litvak
99f190f699 service/qos/service_level_controller: update cache on startup
Update the service level cache in the node startup sequence, after the
service level and auth service are initialized.

The cache update depends on the service level data accessor being set
and the auth service being initialized. Before the commit, it may happen that a
cache update is not triggered after the initialization. The commit adds
an explicit call to update the cache where it is guaranteed to be ready.

Fixes scylladb/scylladb#21763

Closes scylladb/scylladb#21773

(cherry picked from commit 373855b493)

Closes scylladb/scylladb#21893
2024-12-16 14:19:06 +01:00
Michael Litvak
04e8506cbb service/qos: increase timeout of internal get_service_levels queries
The function get_service_levels is used to retrieve all service levels
and it is called from multiple different contexts.
Importantly, it is called internally from the context of group0 state reload,
where it should be executed with a long timeout, similarly to other
internal queries, because a failure of this function affects the entire
group0 client, and a longer timeout can be tolerated.
The function is also called in the context of the user command LIST
SERVICE LEVELS, and perhaps other contexts, where a shorter timeout is
preferred.

The commit introduces a function parameter to indicate whether the
context is internal or not. For internal context, a long timeout is
chosen for the query. Otherwise, the timeout is shorter, the same as
before. When the distinction is not important, a default value is
chosen which maintains the same behavior.

The main purpose is to fix the case where the timeout is too short and causes
a failure that propagates and fails the group0 client.

Fixes scylladb/scylladb#20483

Closes scylladb/scylladb#21748

(cherry picked from commit 53224d90be)

Closes scylladb/scylladb#21890
2024-12-16 14:15:26 +01:00
Yaron Kaikov
8e606a239f github: check if PR is closed instead of merge
In Scylla, we can have either `closed` or `merged` PRs. Based on that we decide when to start the backport process when the label was added after the PR is closed (or merged),

In https://github.com/scylladb/scylladb/pull/21876 even when adding the proper backport label didn't trigger the backport automation. Https://github.com/scylladb/scylladb/pull/21809/ caused this, we should have left the `state=closed` (this includes both closed and merged PR)

Fixing it

Closes scylladb/scylladb#21906

(cherry picked from commit b4b7617554)

Closes scylladb/scylladb#21922
2024-12-16 14:07:32 +02:00
Jenkins Promoter
8cdff8f52f Update ScyllaDB version to: 6.2.3 2024-12-15 15:55:40 +02:00
Kamil Braun
3ec741dbac Merge 'topology_coordinator: introduce reload_count in topology state and use it to prevent race' from Gleb Natapov
Topology request table may change between the code reading it and
calling to cv::when() since reading is a preemption point. In this
case cv:signal can be missed. Detect that there was no signal in between
reading and waiting by introducing reload_count which is increased each
time the state is reloaded and signaled. If the counter is different
before and after reading the state may have change so re-check it again
instead of sleeping.

Closes scylladb/scylladb#21713

* github.com:scylladb/scylladb:
  topology_coordinator: introduce reload_count in topology state and use it to prevent race
  storage_service: use conditional_variable::when in co-routines consistently

(cherry picked from commit 8f858325b6)

Closes scylladb/scylladb#21803
2024-12-12 15:45:31 +01:00
Anna Stuchlik
e3e7ac16e9 doc: remove wrong image upgrade info (5.2-to-2023.1)
This commit removes the information about the recommended way of upgrading
ScyllaDB images - by updating ScyllaDB and OS packages in one step. This upgrade
procedure is not supported (it was implemented, but then reverted).

Refs https://github.com/scylladb/scylladb/issues/15733

Closes scylladb/scylladb#21876
Fixes https://github.com/scylladb/scylla-enterprise/issues/5041
Fixes https://github.com/scylladb/scylladb/issues/21898

(cherry picked from commit 98860905d8)
2024-12-12 15:22:26 +02:00
Tomasz Grabiec
81d6d88016 utils: cached_file: Mark permit as awaiting on page miss
Otherwise, the read will be considered as on-cpu during promoted index
search, which will severely underutlize the disk because by default
on-cpu concurrency is 1.

I verified this patch on the worst case scenario, where the workload
reads missing rows from a large partition. So partition index is
cached (no IO) and there is no data file IO. But there is IO during
promoted index search (via cached_file). Before the patch this
workload was doing 4k req/s, after the patch it does 30k req/s.

The problem is much less pronounced if there is data file or index
file IO involved because that IO will signal read concurrency
semaphore to invite more concurrency.

(cherry picked from commit 0f2101b055)
2024-12-09 23:18:00 +01:00
Tomasz Grabiec
56f93dd434 utils: cached_file: Push resource_unit management down to cached_file
It saves us permit operations on the hot path when we hit in cache.

Also, it will lay the ground for marking the permit as awaiting later.

(cherry picked from commit 868f5b59c4)
2024-12-09 23:17:56 +01:00
Kefu Chai
28a32f9c50 github: do not nest ${{}} inside condition
In commit 2596d157, we added a condition to run auto-backport.py only
when the GitHub Action is triggered by a push to the default branch.
However, this introduced an unexpected error due to incorrect condition
handling.

Problem:
- `github.event.before` evaluates to an empty string
- GitHub Actions' single-pass expression evaluation system causes
  the step to always execute, regardless of `github.event_name`

Despite GitHub's documentation suggesting that ${{ }} can be omitted,
it recommends using explicit ${{}} expressions for compound conditions.

Changes:
- Use explicit ${{}} expression for compound conditions
- Avoid string interpolation in conditional statements

Root Cause:
The previous implementation failed because of how GitHub Actions
evaluates conditional expressions, leading to an unintended script
execution and a 404 error when attempting to compare commits.

Example Error:

```
  python .github/scripts/auto-backport.py --repo scylladb/scylladb --base-branch refs/heads/master --commits ..2b07d93beac7bc83d955dadc20ccc307f13f20b6
  shell: /usr/bin/bash -e {0}
  env:
    DEFAULT_BRANCH: master
    GITHUB_TOKEN: ***
Traceback (most recent call last):
  File "/home/runner/work/scylladb/scylladb/.github/scripts/auto-backport.py", line 201, in <module>
    main()
  File "/home/runner/work/scylladb/scylladb/.github/scripts/auto-backport.py", line 162, in main
    commits = repo.compare(start_commit, end_commit).commits
  File "/usr/lib/python3/dist-packages/github/Repository.py", line 888, in compare
    headers, data = self._requester.requestJsonAndCheck(
  File "/usr/lib/python3/dist-packages/github/Requester.py", line 353, in requestJsonAndCheck
    return self.__check(
  File "/usr/lib/python3/dist-packages/github/Requester.py", line 378, in __check
    raise self.__createException(status, responseHeaders, output)
github.GithubException.UnknownObjectException: 404 {"message": "Not Found", "documentation_url": "https://docs.github.com/rest/commits/commits#compare-two-commits", "status": "404"}
```

Fixes scylladb/scylladb#21808
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21809

(cherry picked from commit e04aca7efe)

Closes scylladb/scylladb#21820
2024-12-06 16:34:59 +02:00
Avi Kivity
20b2d5b7c9 Merge 'compaction: update maintenance sstable set on scrub compaction completion' from Lakshmi Narayanan Sreethar
Scrub compaction can pick up input sstables from maintenance sstable set
but on compaction completion, it doesn't update the maintenance set
leaving the original sstable in set after it has been scrubbed. To fix
this, on compaction completion has to update the maintenance sstable if
the input originated from there. This PR solves the issue by updating the
correct sstable_sets on compaction completion.

Fixes #20030

This issue has existed since the introduction of main and maintenance sstable sets into scrub compaction. It would be good to have the fix backported to versions 6.1 and 6.2.

Closes scylladb/scylladb#21582

* github.com:scylladb/scylladb:
  compaction: remove unused `update_sstable_lists_on_off_strategy_completion`
  compaction_group: replace `update_sstable_lists_on_off_strategy_completion`
  compaction_group: rename `update_main_sstable_list_on_compaction_completion`
  compaction_group: update maintenance sstable set on scrub compaction completion
  compaction_group: store table::sstable_list_builder::result in replacement_desc
  table::sstable_list_builder: remove old sstables only from current list
  table::sstable_list_builder: return removed sstables from build_new_list

(cherry picked from commit 58baeac0ad)

Closes scylladb/scylladb#21790
2024-12-06 10:36:46 +02:00
Michael Pedersen
f37deb7e98 docs: correct the storage size for n2-highmem-32 to 9000GB
updated storage size for n2-highmem-32 to 9000GB as this is default in SC

Fixes scylladb/scylladb#21785
Closes scylladb/scylladb#21537

(cherry picked from commit 309f1606ae)

Closes scylladb/scylladb#21595
2024-12-05 09:51:11 +02:00
Tomasz Grabiec
933ec7c6ab utils: UUID: Make get_time_UUID() respect the clock offset
schema_change_test currently fails due to failure to start a cql test
env in unit tests after the point where this is called (in one of the
test cases):

   forward_jump_clocks(std::chrono::seconds(60*60*24*31));

The problem manifests with a failure to join the cluster due to
missing_column exception ("missing_column: done") being thrown from
system_keyspace::get_topology_request_state(). It's a symptom of
join request being missing in system.topology_requests. It's missing
because the row is expired.

When request is created, we insert the
mutations with intended TTL of 1 month. The actual TTL value is
computed like this:

  ttl_opt topology_request_tracking_mutation_builder::ttl() const {
      return std::chrono::duration_cast<std::chrono::seconds>(std::chrono::microseconds(_ts)) + std::chrono::months(1)
          - std::chrono::duration_cast<std::chrono::seconds>(gc_clock::now().time_since_epoch());
  }

_ts comes from the request_id, which is supposed to be a timeuuid set
from current time when request starts. It's set using
utils::UUID_gen::get_time_UUID(). It reads the system clock without
adding the clock offset, so after forward_jump_clocks(), _ts and
gc_clock::now() may be far off. In some cases the accumulated offset
is larger than 1month and the ttl becomes negative, causing the
request row to expire immediately and failing the boot sequence.

The fix is to use db_clock, which respects offsets and is consistent
with gc_clock.

The test doesn't fail in CI becuase there each test case runs in a
separate process, so there is no bootstrap attempt (by new cql test
env) after forward_jump_clocks().

Closes scylladb/scylladb#21558

(cherry picked from commit 1d0c6aa26f)

Closes scylladb/scylladb#21584

Fixes #21581
2024-12-04 14:18:16 +01:00
Kefu Chai
2b5cd10b66 docs: explain task status retention and one-time query behavior
Task status information from nodetool commands is not retained permanently:

- Status of completed tasks is only kept for `task_ttl_in_seconds`
- Status is removed after being queried, making it a one-time operation

This behavior is important for users to understand since subsequent
queries for the same completed task will not return any information.
Add documentation to make this clear to users.

Fixes scylladb/scylladb#21757
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21386

(cherry picked from commit afeff0a792)

Closes scylladb/scylladb#21759
2024-12-04 13:49:24 +02:00
Kefu Chai
bf47de9f7f test: topology_custom: ensure node visibility before keyspace creation
Building upon commit 69b47694, this change addresses a subtle synchronization
weakness in node visibility checks during recovery mode testing.

Previous Approach:
- Waited only for the first node to see its peers
- Insufficient to guarantee full cluster consistency

Current Solution:
1. Implement comprehensive node visibility verification
2. Ensure all nodes mutually recognize each other
3. Prevent potential schema propagation race conditions

Key Improvements:
- Robust cluster state validation before keyspace creation
- Eliminate partial visibility scenarios

Fixes scylladb/scylladb#21724

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21726

(cherry picked from commit 65949ce607)

Closes scylladb/scylladb#21734
2024-12-04 13:46:26 +02:00
Nadav Har'El
dc71a6a75e Merge 'test/boost/view_schema_test.cc: Wait for views to build in test_view_update_generating_writetime' from Dawid Mędrek
Before these changes, we didn't wait for the materialized views to
finish building before writing to the base table. That led to generating
an additional view update, which, in turn, led to test failures.

The scenario corresponding to the summary above looked like this:

1. The test creates an empty table and MVs on it.
2. The view builder starts, but it doesn't finish immediately.
3. The test performs mutations to the base table. Since the views
   already exist, view updates are generated.
4. Finally, the view builder finishes. It notices that the base
   table has a row, so it generates a view update for it because
   it doesn't notice that we already have data in the view.

We solve it by explicitly waiting for both views to finish building
and only then start writing to the base table.

Additionally, we also fix a lifetime issue of the row the test revolves
around, further stabilizing CI.

Fixes https://github.com/scylladb/scylladb/issues/20889

Backport: These changes have no semantic effect on the codebase,
but they stabilize CI, so we want to backport them to the maintained
versions of Scylla.

Closes scylladb/scylladb#21632

* github.com:scylladb/scylladb:
  test/boost/view_schema_test.cc: Increase TTL in test_view_update_generating_writetime
  test/boost/view_schema_test.cc: Wait for views to build in test_view_update_generating_writetime

(cherry picked from commit 733a4f94c7)

Closes scylladb/scylladb#21640
2024-12-04 13:44:33 +02:00
Aleksandra Martyniuk
f13f821b31 repair: implement tablet_repair_task_impl::release_resources
tablet_repair_task_impl keeps a vector of tablet_repair_task_meta,
each of which keeps an effective_replication_map_ptr. So, after
the task completes, the token metadata version will not change for
task_ttl seconds.

Implement tablet_repair_task_impl::release_resources method that clears
tablet_repair_task_meta vector when the task finishes.

Set task_ttl to 1h in test_tablet_repair to check whether the test
won't time out.

Fixes: #21503.

Closes scylladb/scylladb#21504

(cherry picked from commit 572b005774)

Closes scylladb/scylladb#21622
2024-12-04 13:43:40 +02:00
André LFA
74ad6f2fa3 Update report-scylla-problem.rst removing references to old Health Check Report
Closes scylladb/scylladb#21467

(cherry picked from commit 703e6f3b1f)

Closes scylladb/scylladb#21591
2024-12-04 13:41:00 +02:00
Abhinav
fc42571591 test: Parametrize 'replacement with inter-dc encryption' test to confirm behavior in zero token node cases.
In the current scenario, 'test_replace_with_encryption' only confirms the replacement with inter-dc encryption
for normal nodes. This commit increases the coverage of test by parametrizing the test to confirm behavior
for zero token node replacement as well. This test also implicitly provides
coverage for bootstrap with encryption of zero token nodes.

This PR increases coverage for existing code. Hence we need to backport it. Since only 6.2 version has zero
token node support, hence we only backport it to 6.2

Fixes: scylladb/scylladb#21096

Closes scylladb/scylladb#21609

(cherry picked from commit acd643bd75)

Closes scylladb/scylladb#21764
2024-12-04 11:22:39 +01:00
203 changed files with 4260 additions and 1251 deletions

View File

@@ -47,6 +47,11 @@ def create_pull_request(repo, new_branch_name, base_branch_name, pr, backport_pr
)
logging.info(f"Pull request created: {backport_pr.html_url}")
backport_pr.add_to_assignees(pr.user)
if is_draft:
backport_pr.add_to_labels("conflicts")
pr_comment = f"@{pr.user} - This PR was marked as draft because it has conflicts\n"
pr_comment += "Please resolve them and mark this PR as ready for review"
backport_pr.create_issue_comment(pr_comment)
logging.info(f"Assigned PR to original author: {pr.user}")
return backport_pr
except GithubException as e:

View File

@@ -49,7 +49,7 @@ jobs:
GITHUB_TOKEN: ${{ secrets.AUTO_BACKPORT_TOKEN }}
run: python .github/scripts/label_promoted_commits.py --commits ${{ github.event.before }}..${{ github.sha }} --repository ${{ github.repository }} --ref ${{ github.ref }}
- name: Run auto-backport.py when promotion completed
if: github.event_name == 'push' && github.ref == 'refs/heads/${{ env.DEFAULT_BRANCH }}'
if: ${{ github.event_name == 'push' && github.ref == format('refs/heads/{0}', env.DEFAULT_BRANCH) }}
env:
GITHUB_TOKEN: ${{ secrets.AUTO_BACKPORT_TOKEN }}
run: python .github/scripts/auto-backport.py --repo ${{ github.repository }} --base-branch ${{ github.ref }} --commits ${{ github.event.before }}..${{ github.sha }}
@@ -65,7 +65,7 @@ jobs:
echo "backport_label=false" >> $GITHUB_OUTPUT
fi
- name: Run auto-backport.py when label was added
if: github.event_name == 'pull_request_target' && steps.check_label.outputs.backport_label == 'true' && (github.event.pull_request.state == 'closed' && github.event.pull_request.merged == true)
if: ${{ github.event_name == 'pull_request_target' && steps.check_label.outputs.backport_label == 'true' && github.event.pull_request.state == 'closed' }}
env:
GITHUB_TOKEN: ${{ secrets.AUTO_BACKPORT_TOKEN }}
run: python .github/scripts/auto-backport.py --repo ${{ github.repository }} --base-branch ${{ github.ref }} --pull-request ${{ github.event.pull_request.number }} --head-commit ${{ github.event.pull_request.base.sha }}

View File

@@ -0,0 +1,27 @@
name: Mark PR as Ready When Conflicts Label is Removed
on:
pull_request_target:
types:
- unlabeled
env:
DEFAULT_BRANCH: 'master'
jobs:
mark-ready:
if: github.event.label.name == 'conflicts'
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v4
with:
repository: ${{ github.repository }}
ref: ${{ env.DEFAULT_BRANCH }}
token: ${{ secrets.AUTO_BACKPORT_TOKEN }}
fetch-depth: 1
- name: Mark pull request as ready for review
run: gh pr ready "${{ github.event.pull_request.number }}"
env:
GITHUB_TOKEN: ${{ secrets.AUTO_BACKPORT_TOKEN }}

2
.gitmodules vendored
View File

@@ -1,6 +1,6 @@
[submodule "seastar"]
path = seastar
url = ../seastar
url = ../scylla-seastar
ignore = dirty
[submodule "swagger-ui"]
path = swagger-ui

View File

@@ -78,7 +78,7 @@ fi
# Default scylla product/version tags
PRODUCT=scylla
VERSION=6.2.2
VERSION=6.2.4
if test -f version
then

View File

@@ -218,6 +218,30 @@
]
}
]
},
{
"path":"/task_manager/drain/{module}",
"operations":[
{
"method":"POST",
"summary":"Drain finished local tasks",
"type":"void",
"nickname":"drain_tasks",
"produces":[
"application/json"
],
"parameters":[
{
"name":"module",
"description":"The module to drain",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
}
]
}
]
}
],
"models":{

View File

@@ -218,6 +218,32 @@ void set_task_manager(http_context& ctx, routes& r, sharded<tasks::task_manager>
uint32_t ttl = cfg.task_ttl_seconds();
co_return json::json_return_type(ttl);
});
tm::drain_tasks.set(r, [&tm] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
co_await tm.invoke_on_all([&req] (tasks::task_manager& tm) -> future<> {
tasks::task_manager::module_ptr module;
try {
module = tm.find_module(req->get_path_param("module"));
} catch (...) {
throw bad_param_exception(fmt::format("{}", std::current_exception()));
}
const auto& local_tasks = module->get_local_tasks();
std::vector<tasks::task_id> ids;
ids.reserve(local_tasks.size());
std::transform(begin(local_tasks), end(local_tasks), std::back_inserter(ids), [] (const auto& task) {
return task.second->is_complete() ? task.first : tasks::task_id::create_null_id();
});
for (auto&& id : ids) {
if (id) {
module->unregister_task(id);
}
co_await maybe_yield();
}
});
co_return json_void();
});
}
void unset_task_manager(http_context& ctx, routes& r) {
@@ -229,6 +255,7 @@ void unset_task_manager(http_context& ctx, routes& r) {
tm::get_task_status_recursively.unset(r);
tm::get_and_update_ttl.unset(r);
tm::get_ttl.unset(r);
tm::drain_tasks.unset(r);
}
}

View File

@@ -43,6 +43,10 @@ future<> maintenance_socket_role_manager::stop() {
return make_ready_future<>();
}
future<> maintenance_socket_role_manager::ensure_superuser_is_created() {
return make_ready_future<>();
}
template<typename T = void>
future<T> operation_not_supported_exception(std::string_view operation) {
return make_exception_future<T>(

View File

@@ -39,6 +39,8 @@ public:
virtual future<> stop() override;
virtual future<> ensure_superuser_is_created() override;
virtual future<> create(std::string_view role_name, const role_config&, ::service::group0_batch&) override;
virtual future<> drop(std::string_view role_name, ::service::group0_batch& mc) override;

View File

@@ -106,6 +106,13 @@ public:
virtual future<> stop() = 0;
///
/// Ensure that superuser role exists.
///
/// \returns a future once it is ensured that the superuser role exists.
///
virtual future<> ensure_superuser_is_created() = 0;
///
/// \returns an exceptional future with \ref role_already_exists for a role that has previously been created.
///

View File

@@ -257,6 +257,10 @@ future<> service::stop() {
});
}
future<> service::ensure_superuser_is_created() {
return _role_manager->ensure_superuser_is_created();
}
void service::update_cache_config() {
auto db = _qp.db();

View File

@@ -131,6 +131,8 @@ public:
future<> stop();
future<> ensure_superuser_is_created();
void update_cache_config();
void reset_authorization_cache();

View File

@@ -241,35 +241,39 @@ future<> standard_role_manager::migrate_legacy_metadata() {
}
future<> standard_role_manager::start() {
return once_among_shards([this] {
return futurize_invoke([this] () {
if (legacy_mode(_qp)) {
return create_legacy_metadata_tables_if_missing();
}
return make_ready_future<>();
}).then([this] {
_stopped = auth::do_after_system_ready(_as, [this] {
return seastar::async([this] {
if (legacy_mode(_qp)) {
_migration_manager.wait_for_schema_agreement(_qp.db().real_database(), db::timeout_clock::time_point::max(), &_as).get();
return once_among_shards([this] () -> future<> {
if (legacy_mode(_qp)) {
co_await create_legacy_metadata_tables_if_missing();
}
if (any_nondefault_role_row_satisfies(_qp, &has_can_login).get()) {
if (legacy_metadata_exists()) {
log.warn("Ignoring legacy user metadata since nondefault roles already exist.");
}
auto handler = [this] () -> future<> {
const bool legacy = legacy_mode(_qp);
if (legacy) {
if (!_superuser_created_promise.available()) {
_superuser_created_promise.set_value();
}
co_await _migration_manager.wait_for_schema_agreement(_qp.db().real_database(), db::timeout_clock::time_point::max(), &_as);
return;
}
if (legacy_metadata_exists()) {
migrate_legacy_metadata().get();
return;
}
if (co_await any_nondefault_role_row_satisfies(_qp, &has_can_login)) {
if (legacy_metadata_exists()) {
log.warn("Ignoring legacy user metadata since nondefault roles already exist.");
}
create_default_role_if_missing().get();
});
});
});
co_return;
}
if (legacy_metadata_exists()) {
co_await migrate_legacy_metadata();
co_return;
}
}
co_await create_default_role_if_missing();
if (!legacy) {
_superuser_created_promise.set_value();
}
};
_stopped = auth::do_after_system_ready(_as, handler);
co_return;
});
}
@@ -278,6 +282,11 @@ future<> standard_role_manager::stop() {
return _stopped.handle_exception_type([] (const sleep_aborted&) { }).handle_exception_type([](const abort_requested_exception&) {});;
}
future<> standard_role_manager::ensure_superuser_is_created() {
SCYLLA_ASSERT(this_shard_id() == 0);
return _superuser_created_promise.get_shared_future();
}
future<> standard_role_manager::create_or_replace(std::string_view role_name, const role_config& c, ::service::group0_batch& mc) {
const sstring query = seastar::format("INSERT INTO {}.{} ({}, is_superuser, can_login) VALUES (?, ?, ?)",
get_auth_ks_name(_qp),

View File

@@ -37,6 +37,7 @@ class standard_role_manager final : public role_manager {
future<> _stopped;
abort_source _as;
std::string _superuser;
shared_promise<> _superuser_created_promise;
public:
standard_role_manager(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&);
@@ -49,6 +50,8 @@ public:
virtual future<> stop() override;
virtual future<> ensure_superuser_is_created() override;
virtual future<> create(std::string_view role_name, const role_config&, ::service::group0_batch&) override;
virtual future<> drop(std::string_view role_name, ::service::group0_batch& mc) override;

View File

@@ -122,6 +122,9 @@ class cache_mutation_reader final : public mutation_reader::impl {
gc_clock::time_point _read_time;
gc_clock::time_point _gc_before;
api::timestamp_type _max_purgeable_timestamp = api::missing_timestamp;
api::timestamp_type _max_purgeable_timestamp_shadowable = api::missing_timestamp;
future<> do_fill_buffer();
future<> ensure_underlying();
void copy_from_cache_to_buffer();
@@ -200,12 +203,17 @@ class cache_mutation_reader final : public mutation_reader::impl {
gc_clock::time_point get_gc_before(const schema& schema, dht::decorated_key dk, const gc_clock::time_point query_time) {
auto gc_state = _read_context.tombstone_gc_state();
if (gc_state) {
return gc_state->get_gc_before_for_key(schema.shared_from_this(), dk, query_time);
return gc_state->with_commitlog_check_disabled().get_gc_before_for_key(schema.shared_from_this(), dk, query_time);
}
return gc_clock::time_point::min();
}
bool can_gc(tombstone t, is_shadowable is) const {
const auto max_purgeable = is ? _max_purgeable_timestamp_shadowable : _max_purgeable_timestamp;
return t.timestamp < max_purgeable;
}
public:
cache_mutation_reader(schema_ptr s,
dht::decorated_key dk,
@@ -227,8 +235,19 @@ public:
, _read_time(get_read_time())
, _gc_before(get_gc_before(*_schema, dk, _read_time))
{
clogger.trace("csm {}: table={}.{}, reversed={}, snap={}", fmt::ptr(this), _schema->ks_name(), _schema->cf_name(), _read_context.is_reversed(),
fmt::ptr(&*_snp));
_max_purgeable_timestamp = ctx.get_max_purgeable(dk, is_shadowable::no);
_max_purgeable_timestamp_shadowable = ctx.get_max_purgeable(dk, is_shadowable::yes);
clogger.trace("csm {}: table={}.{}, dk={}, gc-before={}, max-purgeable-regular={}, max-purgeable-shadowable={}, reversed={}, snap={}",
fmt::ptr(this),
_schema->ks_name(),
_schema->cf_name(),
dk,
_gc_before,
_max_purgeable_timestamp,
_max_purgeable_timestamp_shadowable,
_read_context.is_reversed(),
fmt::ptr(&*_snp));
push_mutation_fragment(*_schema, _permit, partition_start(std::move(dk), _snp->partition_tombstone()));
}
cache_mutation_reader(schema_ptr s,
@@ -786,12 +805,12 @@ void cache_mutation_reader::copy_from_cache_to_buffer() {
t.apply(range_tomb);
auto row_tomb_expired = [&](row_tombstone tomb) {
return (tomb && tomb.max_deletion_time() < _gc_before);
return (tomb && tomb.max_deletion_time() < _gc_before && can_gc(tomb.tomb(), tomb.is_shadowable()));
};
auto is_row_dead = [&](const deletable_row& row) {
auto& m = row.marker();
return (!m.is_missing() && m.is_dead(_read_time) && m.deletion_time() < _gc_before);
return (!m.is_missing() && m.is_dead(_read_time) && m.deletion_time() < _gc_before && can_gc(tombstone(m.timestamp(), m.deletion_time()), is_shadowable::no));
};
if (row_tomb_expired(t) || is_row_dead(row)) {
@@ -799,9 +818,11 @@ void cache_mutation_reader::copy_from_cache_to_buffer() {
_read_context.cache()._tracker.on_row_compacted();
auto mutation_can_gc = can_gc_fn([this] (tombstone t, is_shadowable is) { return can_gc(t, is); });
with_allocator(_snp->region().allocator(), [&] {
deletable_row row_copy(row_schema, row);
row_copy.compact_and_expire(row_schema, t.tomb(), _read_time, always_gc, _gc_before, nullptr);
row_copy.compact_and_expire(row_schema, t.tomb(), _read_time, mutation_can_gc, _gc_before, nullptr);
std::swap(row, row_copy);
});
remove_row = row.empty();

View File

@@ -364,6 +364,9 @@ cdc::topology_description make_new_generation_description(
const noncopyable_function<std::pair<size_t, uint8_t>(dht::token)>& get_sharding_info,
const locator::token_metadata_ptr tmptr) {
const auto tokens = get_tokens(bootstrap_tokens, tmptr);
if (tokens.empty()) {
on_internal_error(cdc_log, "Attempted to create a CDC generation from an empty list of tokens");
}
utils::chunked_vector<token_range_description> vnode_descriptions;
vnode_descriptions.reserve(tokens.size());
@@ -1111,7 +1114,9 @@ future<bool> generation_service::legacy_do_handle_cdc_generation(cdc::generation
auto sys_dist_ks = get_sys_dist_ks();
auto gen = co_await retrieve_generation_data(gen_id, _sys_ks.local(), *sys_dist_ks, { _token_metadata.get()->count_normal_token_owners() });
if (!gen) {
throw std::runtime_error(fmt::format(
// This may happen during raft upgrade when a node gossips about a generation that
// was propagated through raft and we didn't apply it yet.
throw generation_handling_nonfatal_exception(fmt::format(
"Could not find CDC generation {} in distributed system tables (current time: {}),"
" even though some node gossiped about it.",
gen_id, db_clock::now()));

View File

@@ -186,7 +186,7 @@ bool cdc::metadata::prepare(db_clock::time_point tp) {
}
auto ts = to_ts(tp);
auto emplaced = _gens.emplace(to_ts(tp), std::nullopt).second;
auto [it, emplaced] = _gens.emplace(to_ts(tp), std::nullopt);
if (_last_stream_timestamp != api::missing_timestamp) {
auto last_correct_gen = gen_used_at(_last_stream_timestamp);
@@ -201,5 +201,5 @@ bool cdc::metadata::prepare(db_clock::time_point tp) {
}
}
return emplaced;
return !it->second;
}

View File

@@ -1120,7 +1120,10 @@ future<> compaction_manager::drain() {
cmlog.info("Asked to drain");
if (*_early_abort_subscription) {
_state = state::disabled;
_compaction_submission_timer.cancel();
co_await stop_ongoing_compactions("drain");
// Trigger a signal to properly exit from postponed_compactions_reevaluation() fiber
reevaluate_postponed_compactions();
}
cmlog.info("Drained");
}

View File

@@ -14,7 +14,9 @@
#include "exceptions/exceptions.hh"
#include "utils/class_registrator.hh"
const sstring compressor::namespace_prefix = "org.apache.cassandra.io.compress.";
sstring compressor::make_name(std::string_view short_name) {
return seastar::format("org.apache.cassandra.io.compress.{}", short_name);
}
class lz4_processor: public compressor {
public:
@@ -66,7 +68,7 @@ compressor::ptr_type compressor::create(const sstring& name, const opt_getter& o
return {};
}
qualified_name qn(namespace_prefix, name);
qualified_name qn(make_name(""), name);
for (auto& c : { lz4, snappy, deflate }) {
if (c->name() == static_cast<const sstring&>(qn)) {
@@ -91,9 +93,9 @@ shared_ptr<compressor> compressor::create(const std::map<sstring, sstring>& opti
return {};
}
thread_local const shared_ptr<compressor> compressor::lz4 = ::make_shared<lz4_processor>(namespace_prefix + "LZ4Compressor");
thread_local const shared_ptr<compressor> compressor::snappy = ::make_shared<snappy_processor>(namespace_prefix + "SnappyCompressor");
thread_local const shared_ptr<compressor> compressor::deflate = ::make_shared<deflate_processor>(namespace_prefix + "DeflateCompressor");
thread_local const shared_ptr<compressor> compressor::lz4 = ::make_shared<lz4_processor>(make_name("LZ4Compressor"));
thread_local const shared_ptr<compressor> compressor::snappy = ::make_shared<snappy_processor>(make_name("SnappyCompressor"));
thread_local const shared_ptr<compressor> compressor::deflate = ::make_shared<deflate_processor>(make_name("DeflateCompressor"));
const sstring compression_parameters::SSTABLE_COMPRESSION = "sstable_compression";
const sstring compression_parameters::CHUNK_LENGTH_KB = "chunk_length_in_kb";

View File

@@ -69,7 +69,7 @@ public:
static thread_local const ptr_type snappy;
static thread_local const ptr_type deflate;
static const sstring namespace_prefix;
static sstring make_name(std::string_view short_name);
};
template<typename BaseType, typename... Args>

View File

@@ -1113,7 +1113,7 @@ scylla_core = (['message/messaging_service.cc',
'utils/lister.cc',
'repair/repair.cc',
'repair/row_level.cc',
'repair/table_check.cc',
'streaming/table_check.cc',
'exceptions/exceptions.cc',
'auth/allow_all_authenticator.cc',
'auth/allow_all_authorizer.cc',
@@ -1323,6 +1323,7 @@ scylla_tests_generic_dependencies = [
'test/lib/test_utils.cc',
'test/lib/tmpdir.cc',
'test/lib/sstable_run_based_compaction_strategy_for_tests.cc',
'test/lib/eventually.cc',
]
scylla_tests_dependencies = scylla_core + alternator + idls + scylla_tests_generic_dependencies + [
@@ -1363,6 +1364,7 @@ scylla_perfs = ['test/perf/perf_alternator.cc',
'test/lib/key_utils.cc',
'test/lib/random_schema.cc',
'test/lib/data_model.cc',
'test/lib/eventually.cc',
'seastar/tests/perf/linux_perf_event.cc']
deps = {
@@ -1470,7 +1472,7 @@ deps['test/boost/bytes_ostream_test'] = [
"test/lib/log.cc",
]
deps['test/boost/input_stream_test'] = ['test/boost/input_stream_test.cc']
deps['test/boost/UUID_test'] = ['utils/UUID_gen.cc', 'test/boost/UUID_test.cc', 'utils/uuid.cc', 'utils/dynamic_bitset.cc', 'utils/hashers.cc', 'utils/on_internal_error.cc']
deps['test/boost/UUID_test'] = ['clocks-impl.cc', 'utils/UUID_gen.cc', 'test/boost/UUID_test.cc', 'utils/uuid.cc', 'utils/dynamic_bitset.cc', 'utils/hashers.cc', 'utils/on_internal_error.cc']
deps['test/boost/murmur_hash_test'] = ['bytes.cc', 'utils/murmur_hash.cc', 'test/boost/murmur_hash_test.cc']
deps['test/boost/allocation_strategy_test'] = ['test/boost/allocation_strategy_test.cc', 'utils/logalloc.cc', 'utils/dynamic_bitset.cc']
deps['test/boost/log_heap_test'] = ['test/boost/log_heap_test.cc']
@@ -1501,11 +1503,11 @@ deps['test/boost/rust_test'] += ['rust/inc/src/lib.rs']
deps['test/boost/group0_cmd_merge_test'] += ['test/lib/expr_test_utils.cc']
deps['test/raft/replication_test'] = ['test/raft/replication_test.cc', 'test/raft/replication.cc', 'test/raft/helpers.cc'] + scylla_raft_dependencies
deps['test/raft/raft_server_test'] = ['test/raft/raft_server_test.cc', 'test/raft/replication.cc', 'test/raft/helpers.cc'] + scylla_raft_dependencies
deps['test/raft/replication_test'] = ['test/raft/replication_test.cc', 'test/raft/replication.cc', 'test/raft/helpers.cc', 'test/lib/eventually.cc'] + scylla_raft_dependencies
deps['test/raft/raft_server_test'] = ['test/raft/raft_server_test.cc', 'test/raft/replication.cc', 'test/raft/helpers.cc', 'test/lib/eventually.cc'] + scylla_raft_dependencies
deps['test/raft/randomized_nemesis_test'] = ['test/raft/randomized_nemesis_test.cc', 'direct_failure_detector/failure_detector.cc', 'test/raft/helpers.cc'] + scylla_raft_dependencies
deps['test/raft/failure_detector_test'] = ['test/raft/failure_detector_test.cc', 'direct_failure_detector/failure_detector.cc', 'test/raft/helpers.cc'] + scylla_raft_dependencies
deps['test/raft/many_test'] = ['test/raft/many_test.cc', 'test/raft/replication.cc', 'test/raft/helpers.cc'] + scylla_raft_dependencies
deps['test/raft/many_test'] = ['test/raft/many_test.cc', 'test/raft/replication.cc', 'test/raft/helpers.cc', 'test/lib/eventually.cc'] + scylla_raft_dependencies
deps['test/raft/fsm_test'] = ['test/raft/fsm_test.cc', 'test/raft/helpers.cc', 'test/lib/log.cc'] + scylla_raft_dependencies
deps['test/raft/etcd_test'] = ['test/raft/etcd_test.cc', 'test/raft/helpers.cc', 'test/lib/log.cc'] + scylla_raft_dependencies
deps['test/raft/raft_sys_table_storage_test'] = ['test/raft/raft_sys_table_storage_test.cc'] + \

View File

@@ -27,6 +27,8 @@
#include "gms/feature_service.hh"
#include "replica/database.hh"
using namespace std::string_literals;
static logging::logger mylogger("alter_keyspace");
bool is_system_keyspace(std::string_view keyspace);
@@ -207,6 +209,25 @@ cql3::statements::alter_keyspace_statement::prepare_schema_mutations(query_proce
auto ts = mc.write_timestamp();
auto global_request_id = mc.new_group0_state_id();
// #22688 - filter out any dc*:0 entries - consider these
// null and void (removed). Migration planning will treat it
// as dc*=0 still.
std::erase_if(ks_options, [](const auto& i) {
static constexpr std::string replication_prefix = ks_prop_defs::KW_REPLICATION + ":"s;
// Flattened map, replication entries starts with "replication:".
// Only valid options are replication_factor, class and per-dc rf:s. We want to
// filter out any dcN=0 entries.
auto& [key, val] = i;
if (key.starts_with(replication_prefix) && val == "0") {
std::string_view v(key);
v.remove_prefix(replication_prefix.size());
return v != ks_prop_defs::REPLICATION_FACTOR_KEY
&& v != ks_prop_defs::REPLICATION_STRATEGY_CLASS_KEY
;
}
return false;
});
// we only want to run the tablets path if there are actually any tablets changes, not only schema changes
// TODO: the current `if (changes_tablets(qp))` is insufficient: someone may set the same RFs as before,
// and we'll unnecessarily trigger the processing path for ALTER tablets KS,

View File

@@ -69,6 +69,16 @@ static std::map<sstring, sstring> prepare_options(
}
}
// #22688 / #20039 - check for illegal, empty options (after above expand)
// moved to here. We want to be able to remove dc:s once rf=0,
// in which case, the options actually serialized in result mutations
// will in extreme cases in fact be empty -> cannot do this check in
// verify_options. We only want to apply this constraint on the input
// provided by the user
if (options.empty() && !tm.get_topology().get_datacenters().empty()) {
throw exceptions::configuration_exception("Configuration for at least one datacenter must be present");
}
return options;
}

View File

@@ -54,7 +54,7 @@ list_service_level_statement::execute(query_processor& qp,
return make_ready_future().then([this, &state] () {
if (_describe_all) {
return state.get_service_level_controller().get_distributed_service_levels();
return state.get_service_level_controller().get_distributed_service_levels(qos::query_context::user);
} else {
return state.get_service_level_controller().get_distributed_service_level(_service_level);
}

View File

@@ -1588,7 +1588,7 @@ future<> db::commitlog::segment_manager::oversized_allocation(entry_writer& writ
std::vector<std::pair<sseg_ptr, uint64_t>> maybe_clear;
assert(_request_controller.available_units() <= ssize_t(max_request_controller_units()));
SCYLLA_ASSERT(_request_controller.available_units() <= ssize_t(max_request_controller_units()));
auto fut = get_units(_request_controller, max_request_controller_units(), timeout);
if (_request_controller.waiters()) {
totals.requests_blocked_memory++;
@@ -1597,7 +1597,7 @@ future<> db::commitlog::segment_manager::oversized_allocation(entry_writer& writ
scope_increment_counter allocating(totals.active_allocations);
auto permit = co_await std::move(fut);
assert(_request_controller.available_units() == 0);
SCYLLA_ASSERT(_request_controller.available_units() == 0);
decltype(permit) fake_permit; // can't have allocate+sync release semaphore.
bool failed = false;
@@ -1858,10 +1858,13 @@ future<> db::commitlog::segment_manager::oversized_allocation(entry_writer& writ
}
}
}
assert(_request_controller.available_units() == 0);
SCYLLA_ASSERT(_request_controller.available_units() == 0);
SCYLLA_ASSERT(permit.count() == max_request_controller_units());
auto nw = _request_controller.waiters();
permit.return_all();
assert(_request_controller.available_units() == ssize_t(max_request_controller_units()));
// #20633 cannot guarantee controller avail is now full, since we could have had waiters when doing
// return all -> now will be less avail
SCYLLA_ASSERT(nw > 0 || _request_controller.available_units() == ssize_t(max_request_controller_units()));
if (!failed) {
clogger.trace("Oversized allocation succeeded.");

View File

@@ -27,6 +27,7 @@
#include "cdc/cdc_extension.hh"
#include "tombstone_gc_extension.hh"
#include "db/per_partition_rate_limit_extension.hh"
#include "db/paxos_grace_seconds_extension.hh"
#include "db/tags/extension.hh"
#include "config.hh"
#include "extensions.hh"
@@ -1192,6 +1193,18 @@ void db::config::add_tombstone_gc_extension() {
_extensions->add_schema_extension<tombstone_gc_extension>(tombstone_gc_extension::NAME);
}
void db::config::add_paxos_grace_seconds_extension() {
_extensions->add_schema_extension<db::paxos_grace_seconds_extension>(db::paxos_grace_seconds_extension::NAME);
}
void db::config::add_all_default_extensions() {
add_cdc_extension();
add_per_partition_rate_limit_extension();
add_tags_extension();
add_tombstone_gc_extension();
add_paxos_grace_seconds_extension();
}
void db::config::setup_directories() {
maybe_in_workdir(commitlog_directory, "commitlog");
if (!schema_commitlog_directory.is_set()) {

View File

@@ -140,6 +140,9 @@ public:
void add_per_partition_rate_limit_extension();
void add_tags_extension();
void add_tombstone_gc_extension();
void add_paxos_grace_seconds_extension();
void add_all_default_extensions();
/// True iff the feature is enabled.
bool check_experimental(experimental_features_t::feature f) const;

View File

@@ -741,8 +741,8 @@ system_distributed_keyspace::get_cdc_desc_v1_timestamps(context ctx) {
co_return res;
}
future<qos::service_levels_info> system_distributed_keyspace::get_service_levels() const {
return qos::get_service_levels(_qp, NAME, SERVICE_LEVELS, db::consistency_level::ONE);
future<qos::service_levels_info> system_distributed_keyspace::get_service_levels(qos::query_context ctx) const {
return qos::get_service_levels(_qp, NAME, SERVICE_LEVELS, db::consistency_level::ONE, ctx);
}
future<qos::service_levels_info> system_distributed_keyspace::get_service_level(sstring service_level_name) const {

View File

@@ -112,7 +112,7 @@ public:
future<db_clock::time_point> cdc_current_generation_timestamp(context);
future<qos::service_levels_info> get_service_levels() const;
future<qos::service_levels_info> get_service_levels(qos::query_context ctx) const;
future<qos::service_levels_info> get_service_level(sstring service_level_name) const;
future<> set_service_level(sstring service_level_name, qos::service_level_options slo) const;
future<> drop_service_level(sstring service_level_name) const;

View File

@@ -2044,7 +2044,6 @@ future<> view_builder::start_in_background(service::migration_manager& mm, utils
// the view build information.
fail.cancel();
co_await barrier.arrive_and_wait();
units.return_all();
co_await calculate_shard_build_step(vbi);
_mnotifier.register_listener(this);
@@ -2349,7 +2348,7 @@ static future<> announce_with_raft(
future<> view_builder::mark_view_build_started(sstring ks_name, sstring view_name) {
co_await write_view_build_status(
[&] () -> future<> {
[this, ks_name, view_name] () -> future<> {
co_await utils::get_local_injector().inject("view_builder_pause_add_new_view",
[] (auto& handler) { return handler.wait_for_message(db::timeout_clock::now() + std::chrono::minutes(5)); });
const sstring query_string = format("INSERT INTO {}.{} (keyspace_name, view_name, host_id, status) VALUES (?, ?, ?, ?)",
@@ -2359,7 +2358,7 @@ future<> view_builder::mark_view_build_started(sstring ks_name, sstring view_nam
{std::move(ks_name), std::move(view_name), host_id.uuid(), "STARTED"},
"view builder: mark view build STARTED");
},
[&] () -> future<> {
[this, ks_name, view_name] () -> future<> {
co_await utils::get_local_injector().inject("view_builder_pause_add_new_view",
[] (auto& handler) { return handler.wait_for_message(db::timeout_clock::now() + std::chrono::minutes(5)); });
co_await _sys_dist_ks.start_view_build(std::move(ks_name), std::move(view_name));
@@ -2369,7 +2368,7 @@ future<> view_builder::mark_view_build_started(sstring ks_name, sstring view_nam
future<> view_builder::mark_view_build_success(sstring ks_name, sstring view_name) {
co_await write_view_build_status(
[&] () -> future<> {
[this, ks_name, view_name] () -> future<> {
co_await utils::get_local_injector().inject("view_builder_pause_mark_success",
[] (auto& handler) { return handler.wait_for_message(db::timeout_clock::now() + std::chrono::minutes(5)); });
const sstring query_string = format("UPDATE {}.{} SET status = ? WHERE keyspace_name = ? AND view_name = ? AND host_id = ?",
@@ -2379,7 +2378,7 @@ future<> view_builder::mark_view_build_success(sstring ks_name, sstring view_nam
{"SUCCESS", std::move(ks_name), std::move(view_name), host_id.uuid()},
"view builder: mark view build SUCCESS");
},
[&] () -> future<> {
[this, ks_name, view_name] () -> future<> {
co_await utils::get_local_injector().inject("view_builder_pause_mark_success",
[] (auto& handler) { return handler.wait_for_message(db::timeout_clock::now() + std::chrono::minutes(5)); });
co_await _sys_dist_ks.finish_view_build(std::move(ks_name), std::move(view_name));
@@ -2389,14 +2388,14 @@ future<> view_builder::mark_view_build_success(sstring ks_name, sstring view_nam
future<> view_builder::remove_view_build_status(sstring ks_name, sstring view_name) {
co_await write_view_build_status(
[&] () -> future<> {
[this, ks_name, view_name] () -> future<> {
const sstring query_string = format("DELETE FROM {}.{} WHERE keyspace_name = ? AND view_name = ?",
db::system_keyspace::NAME, db::system_keyspace::VIEW_BUILD_STATUS_V2);
co_await announce_with_raft(_qp, _group0_client, _as, std::move(query_string),
{std::move(ks_name), std::move(view_name)},
"view builder: delete view build status");
},
[&] () -> future<> {
[this, ks_name, view_name] () -> future<> {
co_await _sys_dist_ks.remove_view(std::move(ks_name), std::move(view_name));
}
);
@@ -2444,11 +2443,11 @@ view_builder::view_build_statuses(sstring keyspace, sstring view_name) const {
future<> view_builder::add_new_view(view_ptr view, build_step& step) {
vlogger.info0("Building view {}.{}, starting at token {}", view->ks_name(), view->cf_name(), step.current_token());
if (this_shard_id() == 0) {
co_await mark_view_build_started(view->ks_name(), view->cf_name());
}
co_await _sys_ks.register_view_for_building(view->ks_name(), view->cf_name(), step.current_token());
step.build_status.emplace(step.build_status.begin(), view_build_status{view, step.current_token(), std::nullopt});
auto f = this_shard_id() == 0 ? mark_view_build_started(view->ks_name(), view->cf_name()) : make_ready_future<>();
return when_all_succeed(
std::move(f),
_sys_ks.register_view_for_building(view->ks_name(), view->cf_name(), step.current_token())).discard_result();
}
static future<> flush_base(lw_shared_ptr<replica::column_family> base, abort_source& as) {
@@ -2990,6 +2989,12 @@ public:
_step.build_status.pop_back();
}
}
// before going back to the minimum token, advance current_key to the end
// and check for built views in that range.
_step.current_key = {_step.prange.end().value_or(dht::ring_position::max()).value().token(), partition_key::make_empty()};
check_for_built_views();
_step.current_key = {dht::minimum_token(), partition_key::make_empty()};
for (auto&& vs : _step.build_status) {
vs.next_token = dht::minimum_token();

View File

@@ -20,7 +20,6 @@ ExecStart=/usr/bin/scylla $SCYLLA_ARGS $SEASTAR_IO $DEV_MODE $CPUSET $MEM_CONF
ExecStopPost=+/opt/scylladb/scripts/scylla_stop
TimeoutStartSec=1y
TimeoutStopSec=900
KillMode=process
Restart=on-abnormal
User=scylla
OOMScoreAdjust=-950

View File

@@ -12,6 +12,7 @@ ScyllaDB Architecture
SSTable <sstable/index/>
Compaction Strategies <compaction/compaction-strategies>
Raft Consensus Algorithm in ScyllaDB </architecture/raft>
Zero-token Nodes </architecture/zero-token-nodes>
* :doc:`Data Distribution with Tablets </architecture/tablets/>` - Tablets in ScyllaDB
@@ -22,5 +23,6 @@ ScyllaDB Architecture
* :doc:`SSTable </architecture/sstable/index/>` - ScyllaDB SSTable 2.0 and 3.0 Format Information
* :doc:`Compaction Strategies </architecture/compaction/compaction-strategies>` - High-level analysis of different compaction strategies
* :doc:`Raft Consensus Algorithm in ScyllaDB </architecture/raft>` - Overview of how Raft is implemented in ScyllaDB.
* :doc:`Zero-token Nodes </architecture/zero-token-nodes>` - Nodes that do not replicate any data.
Learn more about these topics in the `ScyllaDB University: Architecture lesson <https://university.scylladb.com/courses/scylla-essentials-overview/lessons/architecture/>`_.

View File

@@ -0,0 +1,23 @@
=========================
Zero-token Nodes
=========================
By default, all nodes in a cluster own a set of token ranges and are used to
replicate data. In certain circumstances, you may choose to add a node that
doesn't own any token. Such nodes are referred to as zero-token nodes. They
do not have a copy of the data but only participate in Raft quorum voting.
To configure a zero-token node, set the ``join_ring`` parameter to ``false``.
You can use zero-token nodes in multi-DC deployments to reduce the risk of
losing a quorum of nodes.
See :doc:`Preventing Quorum Loss in Symmetrical Multi-DC Clusters </operating-scylla/procedures/cluster-management/arbiter-dc>` for details.
Note that:
* Zero-token nodes are ignored by drivers, so there is no need to change
the load balancing policy on the clients after adding zero-token nodes
to the cluster.
* Zero-token nodes never store replicated data, so running ``nodetool rebuild``,
``nodetool repair``, and ``nodetool cleanup`` can be skipped as it does not
affect zero-token nodes.

View File

@@ -62,16 +62,18 @@ Briefly:
- `/task_manager/list_module_tasks/{module}` -
lists (by default non-internal) tasks in the module;
- `/task_manager/task_status/{task_id}` -
gets the task's status, unregisters the task if it's finished;
gets the task's status;
- `/task_manager/abort_task/{task_id}` -
aborts the task if it's abortable;
- `/task_manager/wait_task/{task_id}` -
waits for the task and gets its status;
- `/task_manager/task_status_recursive/{task_id}` -
gets statuses of the task and all its descendants in BFS
order, unregisters the task;
order;
- `/task_manager/ttl` -
gets or sets new ttl.
- `/task_manager/drain/{module}` -
unregisters all finished local tasks in the module.
# Virtual tasks

View File

@@ -175,7 +175,7 @@ Recommended instances types are `n1-highmem <https://cloud.google.com/compute/do
* - n2-highmem-32
- 32
- 256
- 6,000
- 9,000
* - n2-highmem-48
- 48
- 384

View File

@@ -430,8 +430,8 @@ The content is dumped in JSON, using the following schema:
"estimated_tombstone_drop_time": $STREAMING_HISTOGRAM,
"sstable_level": Uint,
"repaired_at": Uint64,
"min_column_names": [Uint, ...],
"max_column_names": [Uint, ...],
"min_column_names": [String, ...],
"max_column_names": [String, ...],
"has_legacy_counter_shards": Bool,
"columns_count": Int64, // >= MC only
"rows_count": Int64, // >= MC only

View File

@@ -73,17 +73,19 @@ API calls
- *keyspace* - if set, tasks are filtered to contain only the ones working on this keyspace;
- *table* - if set, tasks are filtered to contain only the ones working on this table;
* ``/task_manager/task_status/{task_id}`` - gets the task's status, unregisters the task if it's finished;
* ``/task_manager/task_status/{task_id}`` - gets the task's status;
* ``/task_manager/abort_task/{task_id}`` - aborts the task if it's abortable, otherwise 403 status code is returned;
* ``/task_manager/wait_task/{task_id}`` - waits for the task and gets its status (does not unregister the tasks); query params:
* ``/task_manager/wait_task/{task_id}`` - waits for the task and gets its status; query params:
- *timeout* - timeout in seconds; if set - 408 status code is returned if waiting times out;
* ``/task_manager/task_status_recursive/{task_id}`` - gets statuses of the task and all its descendants in BFS order, unregisters the root task;
* ``/task_manager/task_status_recursive/{task_id}`` - gets statuses of the task and all its descendants in BFS order;
* ``/task_manager/ttl`` - gets or sets new ttl; query params (if setting):
- *ttl* - new ttl value.
* ``/task_manager/drain/{module}`` - unregisters all finished local tasks in the module.
Cluster tasks are not unregistered from task manager with API calls.
Tasks API

View File

@@ -0,0 +1,21 @@
Nodetool tasks drain
====================
**tasks drain** - Unregisters all finished local tasks from the module.
If a module is not specified, finished tasks in all modules are unregistered.
Syntax
-------
.. code-block:: console
nodetool tasks drain [--module <module>]
Options
-------
* ``--module`` - if set, only the specified module is drained.
For example:
.. code-block:: shell
> nodetool tasks drain --module repair

View File

@@ -5,6 +5,7 @@ Nodetool tasks
:hidden:
abort <abort>
drain <drain>
list <list>
modules <modules>
status <status>
@@ -17,10 +18,17 @@ Nodetool tasks
Task manager is an API-based tool for tracking long-running background operations, such as repair or compaction,
which makes them observable and controllable. Task manager operates per node.
Task Status Retention
---------------------
* When a task completes, its status is temporarily stored on the executing node
* Status information is retained for up to :confval:`task_ttl_in_seconds` seconds
Supported tasks suboperations
-----------------------------
* :doc:`abort </operating-scylla/nodetool-commands/tasks/abort>` - Aborts the task.
* :doc:`drain </operating-scylla/nodetool-commands/tasks/drain>` - Unregisters all finished local tasks.
* :doc:`list </operating-scylla/nodetool-commands/tasks/list>` - Lists tasks in the module.
* :doc:`modules </operating-scylla/nodetool-commands/tasks/modules>` - Lists supported modules.
* :doc:`status </operating-scylla/nodetool-commands/tasks/status>` - Gets status of the task.

View File

@@ -1,6 +1,6 @@
Nodetool tasks status
=========================
**tasks status** - Gets the status of a task manager task. If the task was finished it is unregistered.
**tasks status** - Gets the status of a task manager task.
Syntax
-------
@@ -23,10 +23,10 @@ Example output
type: repair
kind: node
scope: keyspace
state: done
state: running
is_abortable: true
start_time: 2024-07-29T15:48:55Z
end_time: 2024-07-29T15:48:55Z
end_time:
error:
parent_id: none
sequence_number: 5

View File

@@ -1,7 +1,7 @@
Nodetool tasks tree
=======================
**tasks tree** - Gets the statuses of a task manager task and all its descendants.
The statuses are listed in BFS order. If the task was finished it is unregistered.
The statuses are listed in BFS order.
If task_id isn't specified, trees of all non-internal tasks are printed
(internal tasks are the ones that have a parent or cover an operation that

View File

@@ -201,6 +201,7 @@ Add New DC
#. If you are using ScyllaDB Monitoring, update the `monitoring stack <https://monitoring.docs.scylladb.com/stable/install/monitoring_stack.html#configure-scylla-nodes-from-files>`_ to monitor it. If you are using ScyllaDB Manager, make sure you install the `Manager Agent <https://manager.docs.scylladb.com/stable/install-scylla-manager-agent.html>`_ and Manager can access the new DC.
.. _add-dc-to-existing-dc-not-connect-clients:
Configure the Client not to Connect to the New DC
-------------------------------------------------

View File

@@ -0,0 +1,57 @@
=========================================================
Preventing Quorum Loss in Symmetrical Multi-DC Clusters
=========================================================
ScyllaDB requires at least a quorum (majority) of nodes in a cluster to be up
and communicate with each other. A cluster that loses a quorum can handle reads
and writes of user data, but cluster management operations, such as schema and
topology updates, are impossible.
In clusters that are symmetrical, i.e., have two (DCs) with the same number of
nodes, losing a quorum may occur if one of the DCs becomes unavailable.
For example, if one DC fails in a 2-DC cluster where each DC has three nodes,
only three out of six nodes are available, and the quorum is lost.
Adding another DC would mitigate the risk of losing a quorum, but it comes
with network and storage costs. To prevent the quorum loss with minimum costs,
you can configure an arbiter (tie-breaker) DC.
An arbiter DC is a datacenter with a :doc:`zero-token node </architecture/zero-token-nodes>`
-- a node that doesn't replicate any data but is only used for Raft quorum
voting. An arbiter DC maintains the cluster quorum if one of the other DCs
fails, while it doesn't incur extra network and storage costs as it has no
user data.
Adding an Arbiter DC
-----------------------
To set up an arbiter DC, follow the procedure to
:doc:`add a new datacenter to an existing cluster </operating-scylla/procedures/cluster-management/add-dc-to-existing-dc/>`.
When editing the *scylla.yaml* file, set the ``join_ring`` parameter to
``false`` following these guidelines:
* Set ``join_ring=false`` before you start the node(s). If you set that
parameter on a node that has already been bootstrapped and owns a token
range, the node startup will fail. In such a case, you'll need to
:doc:`decommission </operating-scylla/procedures/cluster-management/decommissioning-data-center>`
the node, :doc:`wipe it clean </operating-scylla/procedures/cluster-management/clear-data>`,
and add it back to the arbiter DC properly following
the :doc:`procedure </operating-scylla/procedures/cluster-management/add-dc-to-existing-dc/>`.
* As a rule, one node is sufficient for an arbiter to serve as a tie-breaker.
In case you add more than one node to the arbiter DC, ensure that you set
``join_ring=false`` on all the nodes in that DC.
Follow-up steps:
^^^^^^^^^^^^^^^^^^^
* An arbiter DC has a replication factor of 0 (RF=0) for all keyspaces. You
need to ``ALTER`` the keyspaces to update their RF.
* Since zero-token nodes are ignored by drivers, you can skip
:ref:`configuring the client not to connect to the new DC <add-dc-to-existing-dc-not-connect-clients>`.
References
----------------
* :doc:`Zero-token Nodes </architecture/zero-token-nodes>`
* :doc:`Raft Consensus Algorithm in ScyllaDB </architecture/raft>`
* :doc:`Handling Node Failures </troubleshooting/handling-node-failures>`
* :doc:`Adding a New Data Center Into an Existing ScyllaDB Cluster </operating-scylla/procedures/cluster-management/add-dc-to-existing-dc/>`

View File

@@ -209,6 +209,17 @@ In this example, we will show how to install a nine nodes cluster.
UN 54.187.142.201 109.54 KB 256 ? d99967d6-987c-4a54-829d-86d1b921470f RACK1
UN 54.187.168.20 109.54 KB 256 ? 2329c2e0-64e1-41dc-8202-74403a40f851 RACK1
See also:
--------------------------
Preventing Quorum Loss
--------------------------
If your cluster is symmetrical, i.e., it has an even number of datacenters
with the same number of nodes, consider adding an arbiter DC to mitigate
the risk of losing a quorum at a minimum cost.
See :doc:`Preventing Quorum Loss in Symmetrical Multi-DC Clusters </operating-scylla/procedures/cluster-management/arbiter-dc>`
for details.
------------
See also:
------------
:doc:`Create a ScyllaDB Cluster - Single Data Center (DC) </operating-scylla/procedures/cluster-management/create-cluster>`

View File

@@ -55,7 +55,7 @@ Procedure
cqlsh> DESCRIBE <KEYSPACE_NAME>
cqlsh> CREATE KEYSPACE <KEYSPACE_NAME> WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', '<DC_NAME1>' : 3, '<DC_NAME2>' : 3, '<DC_NAME3>' : 3};
cqlsh> ALTER KEYSPACE <KEYSPACE_NAME> WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', '<DC_NAME1>' : 3, '<DC_NAME2>' : 3};
cqlsh> ALTER KEYSPACE <KEYSPACE_NAME> WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', '<DC_NAME1>' : 3, '<DC_NAME2>' : 3, '<DC_NAME3>' : 0};
For example:
@@ -71,7 +71,7 @@ Procedure
.. code-block:: shell
cqlsh> ALTER KEYSPACE nba WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'US-DC' : 3, 'EUROPE-DC' : 3};
cqlsh> ALTER KEYSPACE nba WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'US-DC' : 3, 'ASIA-DC' : 0, 'EUROPE-DC' : 3};
#. Run :doc:`nodetool decommission </operating-scylla/nodetool-commands/decommission>` on every node in the data center that is to be removed.
Refer to :doc:`Remove a Node from a ScyllaDB Cluster - Down Scale </operating-scylla/procedures/cluster-management/remove-node>` for further information.

View File

@@ -26,6 +26,8 @@ Cluster Management Procedures
Safely Restart Your Cluster <safe-start>
Handling Membership Change Failures <handling-membership-change-failures>
repair-based-node-operation
Prevent Quorum Loss in Symmetrical Multi-DC Clusters <arbiter-dc>
.. panel-box::
:title: Cluster and DC Creation
@@ -84,6 +86,8 @@ Cluster Management Procedures
* :doc:`Repair Based Node Operations (RBNO) </operating-scylla/procedures/cluster-management/repair-based-node-operation>`
* :doc:`Preventing Quorum Loss in Symmetrical Multi-DC Clusters <arbiter-dc>`
.. panel-box::
:title: Topology Changes
:id: "getting-started"

View File

@@ -0,0 +1,38 @@
Removing ScyllaDB with the "--autoremove" option on Ubuntu breaks system packages
======================================================================================
Problem
^^^^^^^
Running ``apt purge scylla --autoremove`` marks most system packages for
removal.
.. code::
root@myserv:~# apt purge scylla --autoremove
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following packages will be REMOVED:
apport-symptoms* bc* bcache-tools* bolt* btrfs-progs* byobu* cloud-guest-utils* cloud-init* cloud-initramfs-copymods* cloud-initramfs-dyn-netconf* cryptsetup* cryptsetup-initramfs* dmeventd* eatmydata* ethtool* fdisk* fonts-ubuntu-console* fwupd* fwupd-signed* gdisk* gir1.2-packagekitglib-1.0* git* git-man* kpartx* landscape-common* libaio1* libappstream4* libatasmart4* libblockdev-crypto2* libblockdev-fs2*
libblockdev-loop2* libblockdev-part-err2* libblockdev-part2* libblockdev-swap2* libblockdev-utils2* libblockdev2* libdevmapper-event1.02.1* libeatmydata1* liberror-perl* libfdisk1* libfwupd2* libfwupdplugin5* libgcab-1.0-0* libgpgme11* libgstreamer1.0-0* libgusb2* libinih1* libintl-perl* libintl-xs-perl* libjcat1* libjson-glib-1.0-0* libjson-glib-1.0-common* liblvm2cmd2.03* libmbim-glib4* libmbim-proxy*
libmm-glib0* libmodule-find-perl* libmodule-scandeps-perl* libmspack0* libpackagekit-glib2-18* libparted-fs-resize0* libproc-processtable-perl* libqmi-glib5* libqmi-proxy* libsgutils2-2* libsmbios-c2* libsort-naturally-perl* libstemmer0d* libtcl8.6* libterm-readkey-perl* libudisks2-0* liburcu8* libutempter0* libvolume-key1* libxmlb2* libxmlsec1* libxmlsec1-openssl* libxslt1.1* lvm2* lxd-agent-loader* mdadm*
modemmanager* motd-news-config* multipath-tools* needrestart* open-vm-tools* overlayroot* packagekit* packagekit-tools* pastebinit* patch* pollinate* python3-apport* python3-certifi* python3-chardet* python3-configobj* python3-debconf* python3-debian* python3-json-pointer* python3-jsonpatch* python3-jsonschema* python3-magic* python3-newt* python3-packaging* python3-pexpect* python3-problem-report*
python3-ptyprocess* python3-pyrsistent* python3-requests* python3-software-properties* python3-systemd* python3-xkit* run-one* sbsigntool* screen* scylla* scylla-conf* scylla-cqlsh* scylla-kernel-conf* scylla-node-exporter* scylla-python3* scylla-server* secureboot-db* sg3-utils* sg3-utils-udev* software-properties-common* sosreport* tcl* tcl8.6* thin-provisioning-tools* tmux* ubuntu-drivers-common* udisks2*
unattended-upgrades* update-notifier-common* usb-modeswitch* usb-modeswitch-data* xfsprogs* zerofree*
0 upgraded, 0 newly installed, 139 to remove and 0 not upgraded.
Cause
^^^^^^^
This problem may occur on Ubuntu 22.04 or earlier. It is caused by
the ``systemd-coredump`` package installed with the ``scylla_setup`` script.
Installing ``systemd-coredump`` results in removing ``apport`` and ``ubuntu-server``.
In turn, the ``--autoremove`` option marks for removal all packages installed
by ``ubuntu-server dependencies``.
Solution
^^^^^^^^^^
Do not run the ``--autoremove`` option when removing ScyllaDB.

View File

@@ -16,6 +16,4 @@ Solution
2. If you are deleting an entire keyspace, repeat the procedure above for every table inside the keyspace.
3. This behavior is controlled by the ``auto_snapshot`` flag in ``/etc/scylla/scylla.yaml``, which set to true by default. To stop taking snapshots on deletion, set that flag to false and restart all your scylla nodes.
.. note:: Alternatively you can use the``rm`` Linux utility to remove the files. If you do, keep in mind that the ``rm`` Linux utility is not aware if some snapshots are still associated with existing keyspaces, but nodetool is.
.. note:: Alternatively you can use the ``rm`` Linux utility to remove the files. If you do, keep in mind that the ``rm`` Linux utility is not aware if some snapshots are still associated with existing keyspaces, but nodetool is.

View File

@@ -14,6 +14,7 @@ Troubleshooting ScyllaDB
storage/index
CQL/index
monitor/index
install-remove/index
ScyllaDB's troubleshooting section contains articles which are targeted to pinpoint and answer problems with ScyllaDB. For broader issues and workarounds, consult the :doc:`Knowledge base </kb/index>`.
@@ -33,6 +34,7 @@ Keep your versions up-to-date. The two latest versions are supported. Also, alwa
* :doc:`Data Storage and SSTables <storage/index>`
* :doc:`CQL errors <CQL/index>`
* :doc:`ScyllaDB Monitoring and ScyllaDB Manager <monitor/index>`
* :doc:`Installation and Removal <install-remove/index>`
Also check out the `Monitoring lesson <https://university.scylladb.com/courses/scylla-operations/lessons/scylla-monitoring/>`_ on ScyllaDB University, which covers how to troubleshoot different issues when running a ScyllaDB cluster.

View File

@@ -0,0 +1,13 @@
Installation and Removal
===========================
.. toctree::
:hidden:
:maxdepth: 2
Removing ScyllaDB on Ubuntu breaks system packages </troubleshooting/autoremove-ubuntu/>
* :doc:`Removing ScyllaDB with the "--autoremove" option on Ubuntu breaks system packages </troubleshooting/autoremove-ubuntu/>`

View File

@@ -279,17 +279,12 @@ Once you have collected and compressed your reports, send them to ScyllaDB for a
curl -X PUT https://upload.scylladb.com/$report_uuid/yourfile -T yourfile
For example with the health check report and node health check report:
.. code-block:: shell
curl -X PUT https://upload.scylladb.com/$report_uuid/output_files.tgz -T output_files.tgz
For example with the Scylla Doctor's vitals:
.. code-block:: shell
curl -X PUT https://upload.scylladb.com/$report_uuid/192.0.2.0-health-check-report.txt -T 192.0.2.0-health-check-report.txt
curl -X PUT https://upload.scylladb.com/$report_uuid/my_cluster_123_vitals.tgz -T my_cluster_123_vitals.tgz
The **UUID** you generated replaces the variable ``$report_uuid`` at runtime. ``yourfile`` is any file you need to send to ScyllaDB support.

View File

@@ -162,54 +162,27 @@ Download and install the new release
.. group-tab:: EC2/GCP/Azure Ubuntu Image
Before upgrading, check what version you are running now using ``scylla --version``. You should use the same version as this version in case you want to |ROLLBACK|_ the upgrade. If you are not running a |SRC_VERSION|.x version, stop right here! This guide only covers |SRC_VERSION|.x to |NEW_VERSION|.y upgrades.
Before upgrading, check what version you are running now using ``scylla --version``. You should use the same version as this version in case you want to |ROLLBACK|_ the upgrade. If you are not running a |SRC_VERSION|.x version, stop right here! This guide only covers |SRC_VERSION|.x to |NEW_VERSION|.y upgrades.
There are two alternative upgrade procedures: upgrading ScyllaDB and simultaneously updating 3rd party and OS packages - recommended if you
are running a ScyllaDB official image (EC2 AMI, GCP, and Azure images), which is based on Ubuntu 20.04, and upgrading ScyllaDB without updating
any external packages.
If youre using the ScyllaDB official image (recommended), see
the **Debian/Ubuntu** tab for upgrade instructions. If youre using your
own image and have installed ScyllaDB packages for Ubuntu or Debian,
you need to apply an extended upgrade procedure:
#. Update the ScyllaDB deb repo (see above).
#. Configure Java 1.8 (see above).
#. Install the new ScyllaDB version with the additional
``scylla-enterprise-machine-image`` package:
**To upgrade ScyllaDB and update 3rd party and OS packages (RECOMMENDED):**
Choosing this upgrade procedure allows you to upgrade your ScyllaDB version and update the 3rd party and OS packages using one command.
#. Update the |SCYLLA_DEB_NEW_REPO| to |NEW_VERSION|.
#. Load the new repo:
.. code:: sh
sudo apt-get update
#. Run the following command to update the manifest file:
.. code:: sh
cat scylla-enterprise-packages-<version>-<arch>.txt | sudo xargs -n1 apt-get install -y
Where:
* ``<version>`` - The ScyllaDB Enterprise version to which you are upgrading ( |NEW_VERSION| ).
* ``<arch>`` - Architecture type: ``x86_64`` or ``aarch64``.
The file is included in the ScyllaDB Enterprise packages downloaded in the previous step. The file location is ``http://downloads.scylladb.com/downloads/scylla/aws/manifest/scylla-packages-<version>-<arch>.txt``
Example:
.. code:: sh
cat scylla-enterprise-packages-2022.2.0-x86_64.txt | sudo xargs -n1 apt-get install -y
.. note::
Alternatively, you can update the manifest file with the following command:
``sudo apt-get install $(awk '{print $1'} scylla-enterprise-packages-<version>-<arch>.txt) -y``
To upgrade ScyllaDB without updating any external packages, follow the :ref:`download and installation instructions for Debian/Ubuntu <upgrade-debian-ubuntu-5.2-to-enterprise-2023.1>`.
.. code::
sudo apt-get clean all
sudo apt-get update
sudo apt-get dist-upgrade scylla-enterprise
sudo apt-get dist-upgrade scylla-enterprise-machine-image
#. Run ``scylla_setup`` without running ``io_setup``.
#. Run ``sudo /opt/scylladb/scylla-machine-image/scylla_cloud_io_setup``.
Start the node
--------------

View File

@@ -38,12 +38,13 @@ connection::~connection()
_server._connections_list.erase(iter);
}
future<> server::for_each_gently(noncopyable_function<future<>(connection&)> fn) {
future<> server::for_each_gently(noncopyable_function<void(connection&)> fn) {
_gentle_iterators.emplace_front(*this);
std::list<gentle_iterator>::iterator gi = _gentle_iterators.begin();
return seastar::do_until([ gi ] { return gi->iter == gi->end; },
[ gi, fn = std::move(fn) ] {
return fn(*(gi->iter++));
fn(*(gi->iter++));
return make_ready_future<>();
}
).finally([ this, gi ] { _gentle_iterators.erase(gi); });
}

View File

@@ -118,7 +118,7 @@ protected:
virtual future<> unadvertise_connection(shared_ptr<connection> conn);
future<> for_each_gently(noncopyable_function<future<>(connection&)>);
future<> for_each_gently(noncopyable_function<void(connection&)>);
};
}

View File

@@ -149,7 +149,7 @@ public:
// the "features_enable_test_feature" injection is enabled.
// This feature MUST NOT be advertised in release mode!
gms::feature test_only_feature { *this, "TEST_ONLY_FEATURE"sv };
gms::feature enforced_raft_rpc_scheduling_group { *this, "ENFORCED_RAFT_RPC_SCHEDULING_GROUP"sv };
public:
const std::unordered_map<sstring, std::reference_wrapper<feature>>& registered_features() const;

View File

@@ -13,8 +13,8 @@
auto fmt::formatter<gms::gossip_digest_syn>::format(const gms::gossip_digest_syn& syn, fmt::format_context& ctx) const
-> decltype(ctx.out()) {
auto out = ctx.out();
// out = fmt::format_to(out, "cluster_id:{},partioner:{},group0_id{},"
// syn._cluster_id, syn._partioner, syn._group0_id);
out = fmt::format_to(out, "cluster_id:{},partioner:{},group0_id{},",
syn._cluster_id, syn._partioner, syn._group0_id);
out = fmt::format_to(out, "digests:{{");
for (auto& d : syn._digests) {
out = fmt::format_to(out, "{} ", d);

View File

@@ -356,31 +356,30 @@ future<> gossiper::handle_ack_msg(msg_addr id, gossip_digest_ack ack_msg) {
}
future<> gossiper::do_send_ack2_msg(msg_addr from, utils::chunked_vector<gossip_digest> ack_msg_digest) {
return futurize_invoke([this, from, ack_msg_digest = std::move(ack_msg_digest)] () mutable {
/* Get the state required to send to this gossipee - construct GossipDigestAck2Message */
std::map<inet_address, endpoint_state> delta_ep_state_map;
for (auto g_digest : ack_msg_digest) {
inet_address addr = g_digest.get_endpoint();
const auto es = get_endpoint_state_ptr(addr);
if (!es || es->get_heart_beat_state().get_generation() < g_digest.get_generation()) {
continue;
}
// Local generation for addr may have been increased since the
// current node sent an initial SYN. Comparing versions across
// different generations in get_state_for_version_bigger_than
// could result in losing some app states with smaller versions.
const auto version = es->get_heart_beat_state().get_generation() > g_digest.get_generation()
? version_type(0)
: g_digest.get_max_version();
auto local_ep_state_ptr = this->get_state_for_version_bigger_than(addr, version);
if (local_ep_state_ptr) {
delta_ep_state_map.emplace(addr, *local_ep_state_ptr);
}
/* Get the state required to send to this gossipee - construct GossipDigestAck2Message */
std::map<inet_address, endpoint_state> delta_ep_state_map;
for (auto g_digest : ack_msg_digest) {
inet_address addr = g_digest.get_endpoint();
const auto es = get_endpoint_state_ptr(addr);
if (!es || es->get_heart_beat_state().get_generation() < g_digest.get_generation()) {
continue;
}
gms::gossip_digest_ack2 ack2_msg(std::move(delta_ep_state_map));
logger.debug("Calling do_send_ack2_msg to node {}, ack_msg_digest={}, ack2_msg={}", from, ack_msg_digest, ack2_msg);
return ser::gossip_rpc_verbs::send_gossip_digest_ack2(&_messaging, from, std::move(ack2_msg));
});
// Local generation for addr may have been increased since the
// current node sent an initial SYN. Comparing versions across
// different generations in get_state_for_version_bigger_than
// could result in losing some app states with smaller versions.
const auto version = es->get_heart_beat_state().get_generation() > g_digest.get_generation()
? version_type(0)
: g_digest.get_max_version();
auto local_ep_state_ptr = get_state_for_version_bigger_than(addr, version);
if (local_ep_state_ptr) {
delta_ep_state_map.emplace(addr, *local_ep_state_ptr);
}
}
gms::gossip_digest_ack2 ack2_msg(std::move(delta_ep_state_map));
logger.debug("Calling do_send_ack2_msg to node {}, ack_msg_digest={}, ack2_msg={}", from, ack_msg_digest, ack2_msg);
co_await ser::gossip_rpc_verbs::send_gossip_digest_ack2(&_messaging, from, std::move(ack2_msg));
logger.debug("finished do_send_ack2_msg to node {}, ack_msg_digest={}, ack2_msg={}", from, ack_msg_digest, ack2_msg);
}
// Depends on
@@ -683,6 +682,10 @@ future<> gossiper::apply_state_locally(std::map<inet_address, endpoint_state> ma
// If there is no host id in the new state there should be one locally
hid = get_host_id(ep);
}
if (hid == my_host_id()) {
logger.trace("Ignoring gossip for {} because it maps to local id, but is not local address", ep);
return make_ready_future<>();
}
if (_topo_sm->_topology.left_nodes.contains(raft::server_id(hid.uuid()))) {
logger.trace("Ignoring gossip for {} because it left", ep);
return make_ready_future<>();

View File

@@ -333,8 +333,10 @@ public:
void set_topology_state_machine(service::topology_state_machine* m) {
_topo_sm = m;
// In raft topology mode the coodinator maintains banned nodes list
_just_removed_endpoints.clear();
if (m) {
// In raft topology mode the coodinator maintains banned nodes list
_just_removed_endpoints.clear();
}
}
private:

View File

@@ -677,6 +677,8 @@ future<> global_vnode_effective_replication_map::get_keyspace_erms(sharded<repli
// all under the lock.
auto lk = co_await db.get_shared_token_metadata().get_lock();
auto erm = db.find_keyspace(keyspace_name).get_vnode_effective_replication_map();
utils::get_local_injector().inject("get_keyspace_erms_throw_no_such_keyspace",
[&keyspace_name] { throw data_dictionary::no_such_keyspace{keyspace_name}; });
auto ring_version = erm->get_token_metadata().get_ring_version();
_erms[0] = make_foreign(std::move(erm));
co_await coroutine::parallel_for_each(boost::irange(1u, smp::count), [this, &sharded_db, keyspace_name, ring_version] (unsigned shard) -> future<> {

View File

@@ -269,9 +269,9 @@ network_topology_strategy::calculate_natural_endpoints(
}
void network_topology_strategy::validate_options(const gms::feature_service& fs) const {
if(_config_options.empty()) {
throw exceptions::configuration_exception("Configuration for at least one datacenter must be present");
}
// #22688 / #20039 - we want to remove dc:s once rf=0, and we
// also want to allow fully setting rf=0 in _all_ dc:s (hello data loss)
// so empty options here are in fact ok. Removed check for it
validate_tablet_options(*this, fs, _config_options);
auto tablet_opts = recognized_tablet_options();
for (auto& c : _config_options) {
@@ -376,7 +376,10 @@ future<tablet_replica_set> network_topology_strategy::reallocate_tablets(schema_
++nodes_per_dc[node.dc_rack().dc];
}
for (const auto& [dc, dc_rf] : _dc_rep_factor) {
// #22688 - take all dcs in topology into account when determining migration.
// Any change should still have been pre-checked to never exceed rf factor one.
for (const auto& dc : tm->get_topology().get_datacenters()) {
auto dc_rf = get_replication_factor(dc);
auto dc_node_count = nodes_per_dc[dc];
if (dc_rf == dc_node_count) {
continue;
@@ -430,8 +433,8 @@ future<tablet_replica_set> network_topology_strategy::add_tablets_in_dc(schema_p
continue;
}
const auto& existing = replicas_per_rack[rack];
auto& candidate = existing.empty() ?
new_racks.emplace_back(rack) : existing_racks.emplace_back(rack);
candidates_list& rack_list = existing.empty() ? new_racks : existing_racks;
auto& candidate = rack_list.emplace_back(rack);
for (const auto& node : nodes) {
if (!node->is_normal()) {
continue;
@@ -442,7 +445,7 @@ future<tablet_replica_set> network_topology_strategy::add_tablets_in_dc(schema_p
}
}
if (candidate.nodes.empty()) {
existing_racks.pop_back();
rack_list.pop_back();
tablet_logger.trace("allocate_replica {}.{}: no candidate nodes left on rack={}", s->ks_name(), s->cf_name(), rack);
// Note that this rack can't be in new_racks since
// those had no existing replicas and if current rack has no nodes

View File

@@ -136,6 +136,18 @@ std::optional<tablet_replica> get_leaving_replica(const tablet_info& tinfo, cons
return *leaving.begin();
}
bool is_post_cleanup(tablet_replica replica, const tablet_info& tinfo, const tablet_transition_info& trinfo) {
if (replica == locator::get_leaving_replica(tinfo, trinfo)) {
// we do tablet cleanup on the leaving replica in the `cleanup` stage, after which there is only the `end_migration` stage.
return trinfo.stage == locator::tablet_transition_stage::end_migration;
}
if (replica == trinfo.pending_replica) {
// we do tablet cleanup on the pending replica in the `cleanup_target` stage, after which there is only the `revert_migration` stage.
return trinfo.stage == locator::tablet_transition_stage::revert_migration;
}
return false;
}
tablet_replica_set get_new_replicas(const tablet_info& tinfo, const tablet_migration_info& mig) {
return replace_replica(tinfo.replicas, mig.src, mig.dst);
}

View File

@@ -222,6 +222,10 @@ struct tablet_transition_info {
// Returns the leaving replica for a given transition.
std::optional<tablet_replica> get_leaving_replica(const tablet_info&, const tablet_transition_info&);
// True if the tablet is transitioning and it's in a stage that follows the stage
// where we clean up the tablet on the given replica.
bool is_post_cleanup(tablet_replica replica, const tablet_info& tinfo, const tablet_transition_info& trinfo);
/// Represents intention to move a single tablet replica from src to dst.
struct tablet_migration_info {
locator::tablet_transition_kind kind;

46
main.cc
View File

@@ -91,11 +91,7 @@
#include "redis/controller.hh"
#include "cdc/log.hh"
#include "cdc/cdc_extension.hh"
#include "cdc/generation_service.hh"
#include "tombstone_gc_extension.hh"
#include "db/tags/extension.hh"
#include "db/paxos_grace_seconds_extension.hh"
#include "service/qos/standard_service_level_distributed_data_accessor.hh"
#include "service/storage_proxy.hh"
#include "service/mapreduce_service.hh"
@@ -103,7 +99,6 @@
#include "alternator/ttl.hh"
#include "tools/entry_point.hh"
#include "test/perf/entry_point.hh"
#include "db/per_partition_rate_limit_extension.hh"
#include "lang/manager.hh"
#include "sstables/sstables_manager.hh"
#include "db/virtual_tables.hh"
@@ -635,15 +630,11 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
app_template app(std::move(app_cfg));
auto ext = std::make_shared<db::extensions>();
ext->add_schema_extension<db::tags_extension>(db::tags_extension::NAME);
ext->add_schema_extension<cdc::cdc_extension>(cdc::cdc_extension::NAME);
ext->add_schema_extension<db::paxos_grace_seconds_extension>(db::paxos_grace_seconds_extension::NAME);
ext->add_schema_extension<tombstone_gc_extension>(tombstone_gc_extension::NAME);
ext->add_schema_extension<db::per_partition_rate_limit_extension>(db::per_partition_rate_limit_extension::NAME);
auto cfg = make_lw_shared<db::config>(ext);
auto init = app.get_options_description().add_options();
cfg->add_all_default_extensions();
init("version", bpo::bool_switch(), "print version number and exit");
init("build-id", bpo::bool_switch(), "print build-id and exit");
init("build-mode", bpo::bool_switch(), "print build mode and exit");
@@ -1658,18 +1649,6 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
sys_dist_ks.invoke_on_all(&db::system_distributed_keyspace::stop).get();
});
group0_service.start().get();
auto stop_group0_service = defer_verbose_shutdown("group 0 service", [&group0_service] {
sl_controller.local().abort_group0_operations();
group0_service.abort().get();
});
utils::get_local_injector().inject("stop_after_starting_group0_service",
[] { std::raise(SIGSTOP); });
// Set up group0 service earlier since it is needed by group0 setup just below
ss.local().set_group0(group0_service);
supervisor::notify("starting view update generator");
view_update_generator.start(std::ref(db), std::ref(proxy), std::ref(stop_signal.as_sharded_abort_source())).get();
auto stop_view_update_generator = defer_verbose_shutdown("view update generator", [] {
@@ -1936,6 +1915,18 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
*/
db.local().enable_autocompaction_toggle();
group0_service.start().get();
auto stop_group0_service = defer_verbose_shutdown("group 0 service", [&group0_service] {
sl_controller.local().abort_group0_operations();
group0_service.abort().get();
});
utils::get_local_injector().inject("stop_after_starting_group0_service",
[] { std::raise(SIGSTOP); });
// Set up group0 service earlier since it is needed by group0 setup just below
ss.local().set_group0(group0_service);
const auto generation_number = gms::generation_type(sys_ks.local().increment_and_get_generation().get());
// Load address_map from system.peers and subscribe to gossiper events to keep it updated.
@@ -2019,6 +2010,11 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
api::unset_server_authorization_cache(ctx).get();
});
// update the service level cache after the SL data accessor and auth service are initialized.
if (sl_controller.local().is_v2()) {
sl_controller.local().update_cache(qos::update_both_cache_levels::yes).get();
}
sl_controller.invoke_on_all([&lifecycle_notifier] (qos::service_level_controller& controller) {
lifecycle_notifier.local().register_subscriber(&controller);
}).get();
@@ -2090,6 +2086,9 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
if (cfg->view_building()) {
view_builder.invoke_on_all(&db::view::view_builder::start, std::ref(mm), utils::cross_shard_barrier()).get();
}
auto drain_view_builder = defer_verbose_shutdown("draining view builders", [&] {
view_builder.invoke_on_all(&db::view::view_builder::drain).get();
});
api::set_server_view_builder(ctx, view_builder).get();
auto stop_vb_api = defer_verbose_shutdown("view builder API", [&ctx] {
@@ -2134,6 +2133,7 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
ss.local().drain_on_shutdown().get();
});
auth_service.local().ensure_superuser_is_created().get();
ss.local().register_protocol_server(cql_server_ctl, cfg->start_native_transport()).get();
api::set_transport_controller(ctx, cql_server_ctl).get();
auto stop_transport_controller = defer_verbose_shutdown("transport controller API", [&ctx] {

View File

@@ -584,6 +584,18 @@ static constexpr unsigned do_get_rpc_client_idx(messaging_verb verb) {
case messaging_verb::JOIN_NODE_RESPONSE:
case messaging_verb::JOIN_NODE_QUERY:
case messaging_verb::TASKS_GET_CHILDREN:
case messaging_verb::RAFT_SEND_SNAPSHOT:
case messaging_verb::RAFT_APPEND_ENTRIES:
case messaging_verb::RAFT_APPEND_ENTRIES_REPLY:
case messaging_verb::RAFT_VOTE_REQUEST:
case messaging_verb::RAFT_VOTE_REPLY:
case messaging_verb::RAFT_TIMEOUT_NOW:
case messaging_verb::RAFT_READ_QUORUM:
case messaging_verb::RAFT_READ_QUORUM_REPLY:
case messaging_verb::RAFT_EXECUTE_READ_BARRIER_ON_LEADER:
case messaging_verb::RAFT_ADD_ENTRY:
case messaging_verb::RAFT_MODIFY_CONFIG:
case messaging_verb::RAFT_PULL_SNAPSHOT:
// See comment above `TOPOLOGY_INDEPENDENT_IDX`.
// DO NOT put any 'hot' (e.g. data path) verbs in this group,
// only verbs which are 'rare' and 'cheap'.
@@ -637,19 +649,7 @@ static constexpr unsigned do_get_rpc_client_idx(messaging_verb verb) {
case messaging_verb::PAXOS_ACCEPT:
case messaging_verb::PAXOS_LEARN:
case messaging_verb::PAXOS_PRUNE:
case messaging_verb::RAFT_SEND_SNAPSHOT:
case messaging_verb::RAFT_APPEND_ENTRIES:
case messaging_verb::RAFT_APPEND_ENTRIES_REPLY:
case messaging_verb::RAFT_VOTE_REQUEST:
case messaging_verb::RAFT_VOTE_REPLY:
case messaging_verb::RAFT_TIMEOUT_NOW:
case messaging_verb::RAFT_READ_QUORUM:
case messaging_verb::RAFT_READ_QUORUM_REPLY:
case messaging_verb::RAFT_EXECUTE_READ_BARRIER_ON_LEADER:
case messaging_verb::RAFT_ADD_ENTRY:
case messaging_verb::RAFT_MODIFY_CONFIG:
case messaging_verb::DIRECT_FD_PING:
case messaging_verb::RAFT_PULL_SNAPSHOT:
return 2;
case messaging_verb::MUTATION_DONE:
case messaging_verb::MUTATION_FAILED:

View File

@@ -134,9 +134,7 @@ public:
auto on_end_of_partition() {
flush_rows_and_tombstones(position_in_partition::after_all_clustered_rows());
if (_consumer.consume_end_of_partition()) {
_stop_consuming = stop_iteration::yes;
}
_stop_consuming = _consumer.consume_end_of_partition();
using consume_res_type = decltype(_consumer.consume_end_of_stream());
if constexpr (std::is_same_v<consume_res_type, void>) {
_consumer.consume_end_of_stream();

View File

@@ -356,7 +356,13 @@ void mutation_partition_json_writer::write_atomic_cell_value(const atomic_cell_v
}
void mutation_partition_json_writer::write_collection_value(const collection_mutation_view_description& mv, data_type type) {
write_each_collection_cell(mv, type, [&] (atomic_cell_view v, data_type t) { write_atomic_cell_value(v, t); });
write_each_collection_cell(mv, type, [&] (atomic_cell_view v, data_type t) {
if (v.is_live()) {
write_atomic_cell_value(v, t);
} else {
writer().Null();
}
});
}
void mutation_partition_json_writer::write(gc_clock::duration ttl, gc_clock::time_point expiry) {

View File

@@ -164,8 +164,7 @@ class compact_mutation_state {
uint32_t _current_partition_limit;
bool _empty_partition{};
bool _empty_partition_in_gc_consumer{};
const dht::decorated_key* _dk{};
dht::decorated_key _last_dk;
std::optional<dht::decorated_key> _dk;
bool _return_static_content_on_partition_with_no_rows{};
std::optional<static_row> _last_static_row;
@@ -310,7 +309,6 @@ public:
, _partition_limit(partition_limit)
, _partition_row_limit(_slice.options.contains(query::partition_slice::option::distinct) ? 1 : slice.partition_row_limit())
, _tombstone_gc_state(nullptr)
, _last_dk({dht::token(), partition_key::make_empty()})
, _last_pos(position_in_partition::for_partition_end())
, _validator("mutation_compactor for read", _schema, validation_level)
{
@@ -326,7 +324,6 @@ public:
, _can_gc([this] (tombstone t, is_shadowable is_shadowable) { return can_gc(t, is_shadowable); })
, _slice(s.full_slice())
, _tombstone_gc_state(gc_state)
, _last_dk({dht::token(), partition_key::make_empty()})
, _last_pos(position_in_partition::for_partition_end())
, _collector(std::make_unique<mutation_compactor_garbage_collector>(_schema))
// We already have a validator for compaction in the sstable writer, no need to validate twice
@@ -337,10 +334,10 @@ public:
void consume_new_partition(const dht::decorated_key& dk) {
_validator(mutation_fragment_v2::kind::partition_start, position_in_partition_view::for_partition_start(), {});
_validator(dk);
_stop = stop_iteration::no;
auto& pk = dk.key();
_dk = &dk;
_dk = dk;
auto& pk = _dk->key();
_validator(*_dk);
_return_static_content_on_partition_with_no_rows =
_slice.options.contains(query::partition_slice::option::always_return_static_content) ||
!has_ck_selector(_slice.row_ranges(_schema, pk));
@@ -531,10 +528,6 @@ public:
requires CompactedFragmentsConsumerV2<Consumer> && CompactedFragmentsConsumerV2<GCConsumer>
auto consume_end_of_stream(Consumer& consumer, GCConsumer& gc_consumer) {
_validator.on_end_of_stream();
if (_dk) {
_last_dk = *_dk;
_dk = &_last_dk;
}
if constexpr (std::is_void_v<std::invoke_result_t<decltype(&GCConsumer::consume_end_of_stream), GCConsumer&>>) {
gc_consumer.consume_end_of_stream();
return consumer.consume_end_of_stream();
@@ -546,7 +539,7 @@ public:
/// The decorated key of the partition the compaction is positioned in.
/// Can be null if the compaction wasn't started yet.
const dht::decorated_key* current_partition() const {
return _dk;
return _dk ? &*_dk : nullptr;
}
// Only updated when SSTableCompaction == compact_for_sstables::no.
@@ -629,7 +622,7 @@ public:
if (!_stop) {
return {};
}
partition_start ps(std::move(_last_dk), _partition_tombstone);
partition_start ps(*std::exchange(_dk, std::nullopt), _partition_tombstone);
if (_effective_tombstone) {
return detached_compaction_state{std::move(ps), std::move(_last_static_row),
range_tombstone_change(position_in_partition::after_key(_schema, _last_pos), _effective_tombstone)};

View File

@@ -224,7 +224,7 @@ future<uint64_t> distribute_reader_and_consume_on_shards(schema_ptr s,
std::function<future<> (mutation_reader)> consumer,
utils::phased_barrier::operation&& op) {
return do_with(multishard_writer(std::move(s), sharder, std::move(producer), std::move(consumer)), std::move(op), [] (multishard_writer& writer, utils::phased_barrier::operation&) {
return writer().finally([&writer] {
return seastar::futurize_invoke(writer).finally([&writer] {
return writer.close();
});
});

View File

@@ -35,6 +35,19 @@ private:
}
}
// Keeps the previous writer alive while closed
// and then allocates a new write, if needed.
future<> do_switch_to_new_writer() {
_current_writer->consume_end_of_stream();
// reset _current_writer while closing the previous one
// to prevent race with close() after abort()
auto wr = std::exchange(_current_writer, std::nullopt);
co_await wr->close();
allocate_new_writer_if_needed();
}
// Called frequently, hence yields (and allocates)
// only on the unlikely slow path.
future<> maybe_switch_to_new_writer(dht::token t) {
auto prev_group_id = _current_group_id;
_current_group_id = _classify(t);
@@ -44,11 +57,7 @@ private:
}
if (_current_writer && _current_group_id > prev_group_id) [[unlikely]] {
_current_writer->consume_end_of_stream();
return _current_writer->close().then([this] {
_current_writer = std::nullopt;
allocate_new_writer_if_needed();
});
return do_switch_to_new_writer();
}
allocate_new_writer_if_needed();
return make_ready_future<>();

View File

@@ -78,7 +78,7 @@ future<std::optional<tasks::task_status>> node_ops_virtual_task::get_status_help
.entity = "",
.progress_units = "",
.progress = tasks::task_manager::task::progress{},
.children = started ? co_await get_children(get_module(), id) : std::vector<tasks::task_identity>{}
.children = started ? co_await get_children(get_module(), id, [&gossiper = _ss.gossiper()] (gms::inet_address addr) { return gossiper.is_alive(addr); }) : std::vector<tasks::task_identity>{}
};
}

View File

@@ -589,7 +589,9 @@ future<> server_impl::wait_for_entry(entry_id eid, wait_type type, seastar::abor
check_not_aborted();
if (as && as->abort_requested()) {
throw request_aborted(format("Abort requested before waiting for entry with idx: {}, term: {}", eid.idx, eid.term));
throw request_aborted(format(
"Abort requested before waiting for entry with idx: {}, term: {}; last committed entry: {}, last applied entry: {}",
eid.idx, eid.term, _fsm->commit_idx(), _applied_idx));
}
auto& container = type == wait_type::committed ? _awaited_commits : _awaited_applies;
@@ -637,9 +639,11 @@ future<> server_impl::wait_for_entry(entry_id eid, wait_type type, seastar::abor
}
SCYLLA_ASSERT(inserted);
if (as) {
it->second.abort = as->subscribe([it = it, &container] noexcept {
it->second.abort = as->subscribe([this, it = it, &container] noexcept {
it->second.done.set_exception(
request_aborted(format("Abort requested while waiting for entry with idx: {}, term: {}", it->first, it->second.term)));
request_aborted(format(
"Abort requested while waiting for entry with idx: {}, term: {}; last committed entry: {}, last applied entry: {}",
it->first, it->second.term, _fsm->commit_idx(), _applied_idx)));
container.erase(it);
});
SCYLLA_ASSERT(it->second.abort);
@@ -854,6 +858,10 @@ future<add_entry_reply> server_impl::execute_modify_config(server_id from,
}
future<> server_impl::modify_config(std::vector<config_member> add, std::vector<server_id> del, seastar::abort_source* as) {
utils::get_local_injector().inject("raft/throw_commit_status_unknown_in_modify_config", [] {
throw raft::commit_status_unknown();
});
if (!_config.enable_forwarding) {
const auto leader = _fsm->current_leader();
if (leader != _id) {
@@ -1451,7 +1459,9 @@ term_t server_impl::get_current_term() const {
future<> server_impl::wait_for_apply(index_t idx, abort_source* as) {
if (as && as->abort_requested()) {
throw request_aborted(format("Aborted before waiting for applying entry: {}, last applied entry: {}", idx, _applied_idx));
throw request_aborted(format(
"Aborted before waiting for applying entry: {}, last committed entry: {}, last applied entry: {}",
idx, _fsm->commit_idx(), _applied_idx));
}
check_not_aborted();
@@ -1463,7 +1473,9 @@ future<> server_impl::wait_for_apply(index_t idx, abort_source* as) {
if (as) {
it->second.abort = as->subscribe([this, it] noexcept {
it->second.promise.set_exception(
request_aborted(format("Aborted while waiting to apply entry: {}, last applied entry: {}", it->first, _applied_idx)));
request_aborted(format(
"Aborted while waiting to apply entry: {}, last committed entry: {}, last applied entry: {}",
it->first, _fsm->commit_idx(), _applied_idx)));
_awaited_indexes.erase(it);
});
SCYLLA_ASSERT(it->second.abort);

View File

@@ -126,6 +126,7 @@ class read_context final : public enable_lw_shared_from_this<read_context> {
mutation_reader::forwarding _fwd_mr;
bool _range_query;
const tombstone_gc_state* _tombstone_gc_state;
max_purgeable_fn _get_max_purgeable;
// When reader enters a partition, it must be set up for reading that
// partition from the underlying mutation source (_underlying) in one of two ways:
//
@@ -149,6 +150,7 @@ public:
const dht::partition_range& range,
const query::partition_slice& slice,
const tombstone_gc_state* gc_state,
max_purgeable_fn get_max_purgeable,
tracing::trace_state_ptr trace_state,
mutation_reader::forwarding fwd_mr)
: _cache(cache)
@@ -160,6 +162,7 @@ public:
, _fwd_mr(fwd_mr)
, _range_query(!query::is_single_partition(range))
, _tombstone_gc_state(gc_state)
, _get_max_purgeable(std::move(get_max_purgeable))
, _underlying(_cache, *this)
{
++_cache._tracker._stats.reads;
@@ -195,6 +198,7 @@ public:
void on_underlying_created() { ++_underlying_created; }
bool digest_requested() const { return _slice.options.contains<query::partition_slice::option::with_digest>(); }
const tombstone_gc_state* tombstone_gc_state() const { return _tombstone_gc_state; }
api::timestamp_type get_max_purgeable(const dht::decorated_key& dk, is_shadowable is) const { return _get_max_purgeable(dk, is); }
public:
future<> ensure_underlying() {
if (_underlying_snapshot) {

View File

@@ -146,9 +146,6 @@ public:
promise<> pr;
std::optional<shared_future<>> fut;
reader_concurrency_semaphore::read_func func;
// Self reference to keep the permit alive while queued for execution.
// Must be cleared on all code-paths, otherwise it will keep the permit alive in perpetuity.
reader_permit_opt permit_keepalive;
std::optional<reader_concurrency_semaphore::inactive_read> ir;
};
@@ -224,8 +221,6 @@ private:
}
void on_timeout() {
auto keepalive = std::exchange(_aux_data.permit_keepalive, std::nullopt);
_ex = std::make_exception_ptr(timed_out_error{});
if (_state == state::waiting_for_admission || _state == state::waiting_for_memory || _state == state::waiting_for_execution) {
@@ -487,7 +482,11 @@ public:
_trace_ptr = std::move(trace_ptr);
}
void check_abort() {
bool aborted() const {
return bool(_ex);
}
void check_abort() const {
if (_ex) {
std::rethrow_exception(_ex);
}
@@ -636,7 +635,7 @@ void reader_permit::set_trace_state(tracing::trace_state_ptr trace_ptr) noexcept
_impl->set_trace_state(std::move(trace_ptr));
}
void reader_permit::check_abort() {
void reader_permit::check_abort() const {
return _impl->check_abort();
}
@@ -1089,6 +1088,11 @@ reader_concurrency_semaphore::~reader_concurrency_semaphore() {
reader_concurrency_semaphore::inactive_read_handle reader_concurrency_semaphore::register_inactive_read(mutation_reader reader,
const dht::partition_range* range) noexcept {
auto& permit = reader.permit();
if (permit->aborted()) {
permit->release_base_resources();
close_reader(std::move(reader));
return inactive_read_handle();
}
if (permit->get_state() == reader_permit::state::waiting_for_memory) {
// Kill all outstanding memory requests, the read is going to be evicted.
permit->aux_data().pr.set_exception(std::make_exception_ptr(std::bad_alloc{}));
@@ -1126,6 +1130,7 @@ void reader_concurrency_semaphore::set_notify_handler(inactive_read_handle& irh,
auto& ir = *(*irh._permit)->aux_data().ir;
ir.notify_handler = std::move(notify_handler);
if (ttl_opt) {
irh._permit->set_timeout(db::no_timeout);
ir.ttl_timer.set_callback([this, permit = *irh._permit] () mutable {
evict(*permit, evict_reason::time);
});
@@ -1579,10 +1584,10 @@ reader_permit reader_concurrency_semaphore::make_tracking_only_permit(schema_ptr
}
future<> reader_concurrency_semaphore::with_permit(schema_ptr schema, const char* const op_name, size_t memory,
db::timeout_clock::time_point timeout, tracing::trace_state_ptr trace_ptr, read_func func) {
auto permit = reader_permit(*this, std::move(schema), std::string_view(op_name), {1, static_cast<ssize_t>(memory)}, timeout, std::move(trace_ptr));
db::timeout_clock::time_point timeout, tracing::trace_state_ptr trace_ptr, reader_permit_opt& permit_holder, read_func func) {
permit_holder = reader_permit(*this, std::move(schema), std::string_view(op_name), {1, static_cast<ssize_t>(memory)}, timeout, std::move(trace_ptr));
auto permit = *permit_holder;
permit->aux_data().func = std::move(func);
permit->aux_data().permit_keepalive = permit;
return do_wait_admission(*permit);
}
@@ -1635,6 +1640,7 @@ void reader_concurrency_semaphore::foreach_permit(noncopyable_function<void(cons
boost::for_each(_wait_list._admission_queue, std::ref(func));
boost::for_each(_wait_list._memory_queue, std::ref(func));
boost::for_each(_ready_list, std::ref(func));
boost::for_each(_inactive_reads, std::ref(func));
}
void reader_concurrency_semaphore::foreach_permit(noncopyable_function<void(const reader_permit&)> func) const {

View File

@@ -456,7 +456,8 @@ public:
///
/// Some permits cannot be associated with any table, so passing nullptr as
/// the schema parameter is allowed.
future<> with_permit(schema_ptr schema, const char* const op_name, size_t memory, db::timeout_clock::time_point timeout, tracing::trace_state_ptr trace_ptr, read_func func);
future<> with_permit(schema_ptr schema, const char* const op_name, size_t memory, db::timeout_clock::time_point timeout,
tracing::trace_state_ptr trace_ptr, reader_permit_opt& permit_holder, read_func func);
/// Run the function through the semaphore's execution stage with a pre-admitted permit
///

View File

@@ -174,7 +174,7 @@ public:
// If the read was aborted, throw the exception the read was aborted with.
// Otherwise no-op.
void check_abort();
void check_abort() const;
query::max_result_size max_result_size() const;
void set_max_result_size(query::max_result_size);

View File

@@ -2,8 +2,7 @@ add_library(repair STATIC)
target_sources(repair
PRIVATE
repair.cc
row_level.cc
table_check.cc)
row_level.cc)
target_include_directories(repair
PUBLIC
${CMAKE_SOURCE_DIR})

View File

@@ -15,7 +15,7 @@
#include "gms/gossiper.hh"
#include "gms/feature_service.hh"
#include "message/messaging_service.hh"
#include "repair/table_check.hh"
#include "streaming/table_check.hh"
#include "replica/database.hh"
#include "service/migration_manager.hh"
#include "service/storage_service.hh"
@@ -536,9 +536,10 @@ size_t repair::task_manager_module::nr_running_repair_jobs() {
}
future<bool> repair::task_manager_module::is_aborted(const tasks::task_id& uuid, shard_id shard) {
return smp::submit_to(shard, [&] () {
auto it = get_local_tasks().find(uuid);
return it != get_local_tasks().end() && it->second->abort_requested();
return get_task_manager().container().invoke_on(shard, [name = get_name(), uuid] (tasks::task_manager& tm) {
auto module = tm.find_module(name);
auto it = module->get_local_tasks().find(uuid);
return it != module->get_local_tasks().end() && it->second->abort_requested();
});
}
@@ -741,7 +742,7 @@ future<> repair::shard_repair_task_impl::repair_range(const dht::token_range& ra
co_return;
}
try {
auto dropped = co_await with_table_drop_silenced(db.local(), mm, table.id, [&] (const table_id& uuid) {
auto dropped = co_await streaming::with_table_drop_silenced(db.local(), mm, table.id, [&] (const table_id& uuid) {
return repair_cf_range_row_level(*this, table.name, table.id, range, neighbors, _small_table_optimization);
});
if (dropped) {
@@ -1487,7 +1488,16 @@ future<> repair::data_sync_repair_task_impl::run() {
auto& keyspace = _status.keyspace;
auto& sharded_db = rs.get_db();
auto& db = sharded_db.local();
auto germs = make_lw_shared(co_await locator::make_global_effective_replication_map(sharded_db, keyspace));
auto germs_fut = co_await coroutine::as_future(locator::make_global_effective_replication_map(sharded_db, keyspace));
if (germs_fut.failed()) {
auto ex = germs_fut.get_exception();
if (try_catch<data_dictionary::no_such_keyspace>(ex)) {
rlogger.warn("sync data: keyspace {} does not exist, skipping", keyspace);
co_return;
}
co_await coroutine::return_exception_ptr(std::move(ex));
}
auto germs = make_lw_shared(germs_fut.get());
auto id = get_repair_uniq_id();
rlogger.info("repair[{}]: sync data for keyspace={}, status=started", id.uuid(), keyspace);
@@ -2215,7 +2225,7 @@ future<> repair_service::repair_tablets(repair_uniq_id rid, sstring keyspace_nam
}
table_id tid = t->schema()->id();
// Invoke group0 read barrier before obtaining erm pointer so that it sees all prior metadata changes
auto dropped = co_await repair::table_sync_and_check(_db.local(), _mm, tid);
auto dropped = co_await streaming::table_sync_and_check(_db.local(), _mm, tid);
if (dropped) {
rlogger.debug("repair[{}] Table {}.{} does not exist anymore", rid.uuid(), keyspace_name, table_name);
continue;
@@ -2372,6 +2382,16 @@ future<> repair_service::repair_tablets(repair_uniq_id rid, sstring keyspace_nam
auto task = co_await _repair_module->make_and_start_task<repair::tablet_repair_task_impl>({}, rid, keyspace_name, table_names, streaming::stream_reason::repair, std::move(task_metas), ranges_parallelism);
}
void repair::tablet_repair_task_impl::release_resources() noexcept {
_metas_size = _metas.size();
_metas = {};
_tables = {};
}
size_t repair::tablet_repair_task_impl::get_metas_size() const noexcept {
return _metas.size() > 0 ? _metas.size() : _metas_size;
}
future<> repair::tablet_repair_task_impl::run() {
auto m = dynamic_pointer_cast<repair::task_manager_module>(_module);
auto& rs = m->get_repair_service();
@@ -2478,7 +2498,7 @@ future<> repair::tablet_repair_task_impl::run() {
auto ep = res.get_exception();
sstring ignore_msg;
// Ignore the error if the keyspace and/or table were dropped
auto ignore = co_await repair::table_sync_and_check(rs.get_db().local(), rs.get_migration_manager(), m.tid);
auto ignore = co_await streaming::table_sync_and_check(rs.get_db().local(), rs.get_migration_manager(), m.tid);
if (ignore) {
ignore_msg = format("{} does not exist any more, ignoring it, ",
rs.get_db().local().has_keyspace(m.keyspace_name) ? "table" : "keyspace");
@@ -2507,12 +2527,12 @@ future<> repair::tablet_repair_task_impl::run() {
}
future<std::optional<double>> repair::tablet_repair_task_impl::expected_total_workload() const {
auto sz = _metas.size();
auto sz = get_metas_size();
co_return sz ? std::make_optional<double>(sz) : std::nullopt;
}
std::optional<double> repair::tablet_repair_task_impl::expected_children_number() const {
return _metas.size();
return get_metas_size();
}
node_ops_cmd_category categorize_node_ops_cmd(node_ops_cmd cmd) noexcept {

View File

@@ -3184,6 +3184,7 @@ future<> repair_service::stop() {
rlogger.debug("Unregistering gossiper helper");
co_await _gossiper.local().unregister_(_gossip_helper);
}
co_await async_gate().close();
_stopped = true;
rlogger.info("Stopped repair_service");
} catch (...) {

View File

@@ -109,6 +109,8 @@ class repair_service : public seastar::peering_sharded_service<repair_service> {
shared_ptr<row_level_repair_gossip_helper> _gossip_helper;
bool _stopped = false;
gate _gate;
size_t _max_repair_memory;
seastar::semaphore _memory_sem;
seastar::named_semaphore _load_parallelism_semaphore = {16, named_semaphore_exception_factory{"Load repair history parallelism"}};
@@ -191,6 +193,7 @@ public:
size_t max_repair_memory() const { return _max_repair_memory; }
seastar::semaphore& memory_sem() { return _memory_sem; }
gms::inet_address my_address() const noexcept;
gate& async_gate() noexcept { return _gate; }
repair::task_manager_module& get_repair_module() noexcept {
return *_repair_module;

View File

@@ -108,6 +108,7 @@ private:
std::vector<tablet_repair_task_meta> _metas;
optimized_optional<abort_source::subscription> _abort_subscription;
std::optional<int> _ranges_parallelism;
size_t _metas_size = 0;
public:
tablet_repair_task_impl(tasks::task_manager::module_ptr module, repair_uniq_id id, sstring keyspace, std::vector<sstring> tables, streaming::stream_reason reason, std::vector<tablet_repair_task_meta> metas, std::optional<int> ranges_parallelism)
: repair_task_impl(module, id.uuid(), id.id, "keyspace", keyspace, "", "", tasks::task_id::create_null_id(), reason)
@@ -121,6 +122,10 @@ public:
virtual tasks::is_abortable is_abortable() const noexcept override {
return tasks::is_abortable(!_abort_subscription);
}
virtual void release_resources() noexcept override;
private:
size_t get_metas_size() const noexcept;
protected:
future<> run() override;

View File

@@ -150,15 +150,10 @@ public:
void add_maintenance_sstable(sstables::shared_sstable sst);
api::timestamp_type max_seen_timestamp() const { return _max_seen_timestamp; }
// Update main sstable set based on info in completion descriptor, where input sstables
// will be replaced by output ones, row cache ranges are possibly invalidated and
// statistics are updated.
future<> update_main_sstable_list_on_compaction_completion(sstables::compaction_completion_desc desc);
// This will update sstable lists on behalf of off-strategy compaction, where
// input files will be removed from the maintenance set and output files will
// be inserted into the main set.
future<> update_sstable_lists_on_off_strategy_completion(sstables::compaction_completion_desc desc);
// Update main and/or maintenance sstable sets based in info in completion descriptor,
// where input sstables will be replaced by output ones, row cache ranges are possibly
// invalidated and statistics are updated.
future<> update_sstable_sets_on_compaction_completion(sstables::compaction_completion_desc desc);
const lw_shared_ptr<sstables::sstable_set>& main_sstables() const noexcept;
void set_main_sstables(lw_shared_ptr<sstables::sstable_set> new_main_sstables);
@@ -340,13 +335,13 @@ public:
// new tablet replica is allocated).
virtual future<> update_effective_replication_map(const locator::effective_replication_map& erm, noncopyable_function<void()> refresh_mutation_source) = 0;
virtual compaction_group& compaction_group_for_token(dht::token token) const noexcept = 0;
virtual compaction_group& compaction_group_for_token(dht::token token) const = 0;
virtual utils::chunked_vector<compaction_group*> compaction_groups_for_token_range(dht::token_range tr) const = 0;
virtual compaction_group& compaction_group_for_key(partition_key_view key, const schema_ptr& s) const noexcept = 0;
virtual compaction_group& compaction_group_for_sstable(const sstables::shared_sstable& sst) const noexcept = 0;
virtual compaction_group& compaction_group_for_key(partition_key_view key, const schema_ptr& s) const = 0;
virtual compaction_group& compaction_group_for_sstable(const sstables::shared_sstable& sst) const = 0;
virtual size_t log2_storage_groups() const = 0;
virtual storage_group& storage_group_for_token(dht::token) const noexcept = 0;
virtual storage_group& storage_group_for_token(dht::token) const = 0;
virtual locator::table_load_stats table_load_stats(std::function<bool(const locator::tablet_map&, locator::global_tablet_id)> tablet_filter) const noexcept = 0;
virtual bool all_storage_groups_split() = 0;

View File

@@ -907,8 +907,15 @@ db::commitlog* database::commitlog_for(const schema_ptr& schema) {
}
future<> database::add_column_family(keyspace& ks, schema_ptr schema, column_family::config cfg, is_new_cf is_new) {
if (schema->is_view()) {
try {
auto base_schema = find_schema(schema->view_info()->base_id());
schema->view_info()->set_base_info(schema->view_info()->make_base_dependent_view_info(*base_schema));
} catch (no_such_column_family&) {
throw std::invalid_argument("The base table " + schema->view_info()->base_name() + " was already dropped");
}
}
schema = local_schema_registry().learn(schema);
schema->registry_entry()->mark_synced();
auto&& rs = ks.get_replication_strategy();
locator::effective_replication_map_ptr erm;
if (auto pt_rs = rs.maybe_as_per_table()) {
@@ -940,6 +947,8 @@ future<> database::add_column_family(keyspace& ks, schema_ptr schema, column_fam
co_await cf->stop();
co_await coroutine::return_exception_ptr(f.get_exception());
}
// Table must be added before entry is marked synced.
schema->registry_entry()->mark_synced();
}
future<> database::add_column_family_and_make_directory(schema_ptr schema, is_new_cf is_new) {
@@ -951,6 +960,14 @@ future<> database::add_column_family_and_make_directory(schema_ptr schema, is_ne
}
bool database::update_column_family(schema_ptr new_schema) {
if (new_schema->is_view()) {
try {
auto base_schema = find_schema(new_schema->view_info()->base_id());
new_schema->view_info()->set_base_info(new_schema->view_info()->make_base_dependent_view_info(*base_schema));
} catch (no_such_column_family&) {
throw std::invalid_argument("The base table " + new_schema->view_info()->base_name() + " was already dropped");
}
}
column_family& cfm = find_column_family(new_schema->id());
bool columns_changed = !cfm.schema()->equal_columns(*new_schema);
auto s = local_schema_registry().learn(new_schema);
@@ -958,11 +975,8 @@ bool database::update_column_family(schema_ptr new_schema) {
cfm.set_schema(s);
find_keyspace(s->ks_name()).metadata()->add_or_update_column_family(s);
if (s->is_view()) {
try {
find_column_family(s->view_info()->base_id()).add_or_update_view(view_ptr(s));
} catch (no_such_column_family&) {
// Update view mutations received after base table drop.
}
// We already tested that the base table exists
find_column_family(s->view_info()->base_id()).add_or_update_view(view_ptr(s));
}
cfm.get_index_manager().reload();
return columns_changed;
@@ -1499,7 +1513,9 @@ database::query(schema_ptr query_schema, const query::read_command& cmd, query::
querier_opt->permit().set_trace_state(trace_state);
f = co_await coroutine::as_future(semaphore.with_ready_permit(querier_opt->permit(), read_func));
} else {
f = co_await coroutine::as_future(semaphore.with_permit(query_schema, "data-query", cf.estimate_read_memory_cost(), timeout, trace_state, read_func));
reader_permit_opt permit_holder;
f = co_await coroutine::as_future(semaphore.with_permit(query_schema, "data-query", cf.estimate_read_memory_cost(), timeout,
trace_state, permit_holder, read_func));
}
if (!f.failed()) {
@@ -1561,7 +1577,9 @@ database::query_mutations(schema_ptr query_schema, const query::read_command& cm
querier_opt->permit().set_trace_state(trace_state);
f = co_await coroutine::as_future(semaphore.with_ready_permit(querier_opt->permit(), read_func));
} else {
f = co_await coroutine::as_future(semaphore.with_permit(query_schema, "mutation-query", cf.estimate_read_memory_cost(), timeout, trace_state, read_func));
reader_permit_opt permit_holder;
f = co_await coroutine::as_future(semaphore.with_permit(query_schema, "mutation-query", cf.estimate_read_memory_cost(), timeout,
trace_state, permit_holder, read_func));
}
if (!f.failed()) {
@@ -1712,6 +1730,38 @@ future<mutation> database::do_apply_counter_update(column_family& cf, const froz
co_return m;
}
api::timestamp_type memtable_list::min_live_timestamp(const dht::decorated_key& dk, is_shadowable is, api::timestamp_type max_seen_timestamp) const noexcept {
const auto get_min_ts = [is] (const memtable& mt) {
// see get_max_purgeable_timestamp() in compaction.cc for comments on choosing min timestamp
return is ? mt.get_min_live_row_marker_timestamp() : mt.get_min_live_timestamp();
};
auto min_live_ts = api::max_timestamp;
for (const auto& mt : _memtables) {
const auto mt_min_live_ts = get_min_ts(*mt);
if (mt_min_live_ts > max_seen_timestamp) {
continue;
}
// We cannot do lookups on flushing memtables, they might be in the
// process of merging into cache. Keys already merged will not be seen
// by the lookup.
if (!mt->is_merging_to_cache() && !mt->contains_partition(dk)) {
continue;
}
min_live_ts = std::min(min_live_ts, mt_min_live_ts);
}
for (const auto& mt : _flushed_memtables_with_active_reads) {
// We cannot check if the flushed memtable contains the key as it
// becomes empty after the merge to cache completes, so we only use the
// min ts metadata.
min_live_ts = std::min(min_live_ts, get_min_ts(mt));
}
return min_live_ts;
}
future<> memtable_list::flush() {
if (!may_flush()) {
return make_ready_future<>();
@@ -1907,6 +1957,12 @@ future<> database::do_apply(schema_ptr s, const frozen_mutation& m, tracing::tra
// assume failure until proven otherwise
auto update_writes_failed = defer([&] { ++_stats->total_writes_failed; });
utils::get_local_injector().inject("database_apply", [&s] () {
if (!is_system_keyspace(s->ks_name())) {
throw std::runtime_error("injected error");
}
});
// I'm doing a nullcheck here since the init code path for db etc
// is a little in flux and commitlog is created only when db is
// initied from datadir.
@@ -2271,7 +2327,10 @@ future<> database::flush(const sstring& ksname, const sstring& cfname) {
future<> database::flush_table_on_all_shards(sharded<database>& sharded_db, table_id id) {
return sharded_db.invoke_on_all([id] (replica::database& db) {
return db.find_column_family(id).flush();
if (db.column_family_exists(id)) {
return db.find_column_family(id).flush();
}
return make_ready_future();
});
}
@@ -2282,7 +2341,12 @@ future<> database::drop_cache_for_table_on_all_shards(sharded<database>& sharded
}
future<> database::flush_table_on_all_shards(sharded<database>& sharded_db, std::string_view ks_name, std::string_view table_name) {
return flush_table_on_all_shards(sharded_db, sharded_db.local().find_uuid(ks_name, table_name));
try {
return flush_table_on_all_shards(sharded_db, sharded_db.local().find_uuid(ks_name, table_name));
} catch (no_such_column_family&) {
// Skip.
return make_ready_future();
}
}
static future<> force_new_commitlog_segments(std::unique_ptr<db::commitlog>& cl1, std::unique_ptr<db::commitlog>& cl2) {
@@ -2301,6 +2365,9 @@ future<> database::flush_tables_on_all_shards(sharded<database>& sharded_db, std
* to discard the currently active segment, This ensures we get
* as sstable-ish a universe as we can, as soon as we can.
*/
if (utils::get_local_injector().enter("flush_tables_on_all_shards_table_drop")) {
table_names.push_back("");
}
return sharded_db.invoke_on_all([] (replica::database& db) {
return force_new_commitlog_segments(db._commitlog, db._schema_commitlog);
}).then([&, ks_name, table_names = std::move(table_names)] {
@@ -2452,6 +2519,12 @@ future<> database::truncate_table_on_all_shards(sharded<database>& sharded_db, s
});
});
co_await utils::get_local_injector().inject("truncate_compaction_disabled_wait", [] (auto& handler) -> future<> {
dblog.info("truncate_compaction_disabled_wait: wait");
co_await handler.wait_for_message(std::chrono::steady_clock::now() + std::chrono::minutes{5});
dblog.info("truncate_compaction_disabled_wait: done");
}, false);
const auto should_flush = with_snapshot && cf.can_flush();
dblog.trace("{} {}.{} and views on all shards", should_flush ? "Flushing" : "Clearing", s->ks_name(), s->cf_name());
std::function<future<>(replica::table&)> flush_or_clear = should_flush ?

View File

@@ -77,6 +77,8 @@ class compaction_manager;
class frozen_mutation;
class reconcilable_result;
namespace bi = boost::intrusive;
namespace tracing { class trace_state_ptr; }
namespace s3 { struct endpoint_config; }
@@ -173,8 +175,13 @@ class global_table_ptr;
class memtable_list {
public:
using seal_immediate_fn_type = std::function<future<> (flush_permit&&)>;
using intrusive_memtable_list = bi::list<
memtable,
bi::base_hook<bi::list_base_hook<bi::link_mode<bi::auto_unlink>>>,
bi::constant_time_size<false>>;
private:
std::vector<shared_memtable> _memtables;
intrusive_memtable_list _flushed_memtables_with_active_reads;
seal_immediate_fn_type _seal_immediate_fn;
std::function<schema_ptr()> _current_schema;
replica::dirty_memory_manager* _dirty_memory_manager;
@@ -230,6 +237,15 @@ public:
return _memtables.back();
}
// Returns the minimum live timestamp. Considers all memtables, even
// those that were flushed and removed with erase(), but an
// in-progress read is still using them.
// Memtables whose min live timestamp > max_seen_timestamp are ignored as we
// consider that their content is more recent than any potential tombstone in
// other mutation sources.
// Returns api::max_timestamp if the key is not in any of the memtables.
api::timestamp_type min_live_timestamp(const dht::decorated_key& dk, is_shadowable is, api::timestamp_type max_seen_timestamp) const noexcept;
// # 8904 - this method is akin to std::set::erase(key_type), not
// erase(iterator). Should be tolerant against non-existing.
void erase(const shared_memtable& element) noexcept {
@@ -237,6 +253,7 @@ public:
if (i != _memtables.end()) {
_memtables.erase(i);
}
_flushed_memtables_with_active_reads.push_back(*element);
}
// Synchronously swaps the active memtable with a new, empty one,
@@ -563,8 +580,15 @@ public:
sstable_list_builder& operator=(const sstable_list_builder&) = delete;
sstable_list_builder(const sstable_list_builder&) = delete;
// Struct to return the newly built sstable set and the removed sstables
struct result {
lw_shared_ptr<sstables::sstable_set> new_sstable_set;
std::vector<sstables::shared_sstable> removed_sstables;
};
// Builds new sstable set from existing one, with new sstables added to it and old sstables removed from it.
future<lw_shared_ptr<sstables::sstable_set>>
// Returns the updated sstable set and a list of removed sstables.
future<result>
build_new_list(const sstables::sstable_set& current_sstables,
sstables::sstable_set new_sstable_list,
const std::vector<sstables::shared_sstable>& new_sstables,
@@ -593,19 +617,19 @@ private:
future<> handle_tablet_split_completion(size_t old_tablet_count, const locator::tablet_map& new_tmap);
// Select a storage group from a given token.
storage_group& storage_group_for_token(dht::token token) const noexcept;
storage_group& storage_group_for_token(dht::token token) const;
storage_group& storage_group_for_id(size_t i) const;
std::unique_ptr<storage_group_manager> make_storage_group_manager();
compaction_group* get_compaction_group(size_t id) const noexcept;
compaction_group* get_compaction_group(size_t id) const;
// Select a compaction group from a given token.
compaction_group& compaction_group_for_token(dht::token token) const noexcept;
compaction_group& compaction_group_for_token(dht::token token) const;
// Return compaction groups, present in this shard, that own a particular token range.
utils::chunked_vector<compaction_group*> compaction_groups_for_token_range(dht::token_range tr) const;
// Select a compaction group from a given key.
compaction_group& compaction_group_for_key(partition_key_view key, const schema_ptr& s) const noexcept;
compaction_group& compaction_group_for_key(partition_key_view key, const schema_ptr& s) const;
// Select a compaction group from a given sstable based on its token range.
compaction_group& compaction_group_for_sstable(const sstables::shared_sstable& sst) const noexcept;
compaction_group& compaction_group_for_sstable(const sstables::shared_sstable& sst) const;
// Safely iterate through compaction groups, while performing async operations on them.
future<> parallel_foreach_compaction_group(std::function<future<>(compaction_group&)> action);
void for_each_compaction_group(std::function<void(compaction_group&)> action);
@@ -613,13 +637,6 @@ private:
// Unsafe reference to all storage groups. Don't use it across preemption points.
const storage_group_map& storage_groups() const;
// Safely iterate through SSTables, with deletion guard taken to make sure they're not
// removed during iteration.
// WARNING: Be careful that the action doesn't perform an operation that will itself
// take the deletion guard, as that will cause a deadlock. For example, memtable flush
// can wait on compaction (backpressure) which in turn takes deletion guard on completion.
future<> safe_foreach_sstable(const sstables::sstable_set&, noncopyable_function<future<>(const sstables::shared_sstable&)> action);
// Returns a sstable set that can be safely used for purging any expired tombstone in a compaction group.
// Only the sstables in the compaction group is not sufficient, since there might be other compaction
// groups during tablet split with overlapping token range, and we need to include them all in a single
@@ -1627,6 +1644,8 @@ public:
seastar::scheduling_group get_streaming_scheduling_group() const { return _dbcfg.streaming_scheduling_group; }
seastar::scheduling_group get_gossip_scheduling_group() const { return _dbcfg.gossip_scheduling_group; }
compaction_manager& get_compaction_manager() {
return _compaction_manager;
}

View File

@@ -100,7 +100,7 @@ struct reclaim_config {
// A container for memtables. Called "region_group" for historical
// reasons. Receives updates about memtable size change via the
// LSA region_listener interface.
class region_group : public logalloc::region_listener {
class region_group {
using region_heap = dirty_memory_manager_logalloc::region_heap;
public:
struct allocating_function {
@@ -238,8 +238,6 @@ private:
future<> release_queued_allocations();
void notify_unspooled_pressure_relieved();
friend void region_group_binomial_group_sanity_check(const region_group::region_heap& bh);
private: // from region_listener
virtual void moved(logalloc::region* old_address, logalloc::region* new_address) override;
public:
// When creating a region_group, one can specify an optional throttle_threshold parameter. This
// parameter won't affect normal allocations, but an API is provided, through the region_group's
@@ -268,30 +266,24 @@ public:
}
void update_unspooled(ssize_t delta);
// It would be easier to call update, but it is unfortunately broken in boost versions up to at
// least 1.59.
//
// One possibility would be to just test for delta sigdness, but we adopt an explicit call for
// two reasons:
//
// 1) it save us a branch
// 2) some callers would like to pass delta = 0. For instance, when we are making a region
// evictable / non-evictable. Because the evictable occupancy changes, we would like to call
// the full update cycle even then.
virtual void increase_usage(logalloc::region* r, ssize_t delta) override { // From region_listener
void increase_usage(logalloc::region* r) { // Called by memtable's region_listener
// It would be easier to call update, but it is unfortunately broken in boost versions up to at
// least 1.59.
//
// One possibility would be to just test for delta sigdness, but we adopt an explicit call for
// two reasons:
//
// 1) it save us a branch
// 2) some callers would like to pass delta = 0. For instance, when we are making a region
// evictable / non-evictable. Because the evictable occupancy changes, we would like to call
// the full update cycle even then.
_regions.increase(*static_cast<size_tracked_region*>(r)->_heap_handle);
update_unspooled(delta);
}
virtual void decrease_evictable_usage(logalloc::region* r) override { // From region_listener
void decrease_usage(logalloc::region* r) { // Called by memtable's region_listener
_regions.decrease(*static_cast<size_tracked_region*>(r)->_heap_handle);
}
virtual void decrease_usage(logalloc::region* r, ssize_t delta) override { // From region_listener
decrease_evictable_usage(r);
update_unspooled(delta);
}
//
// Make sure that the function specified by the parameter func only runs when this region_group,
// as well as each of its ancestors have a memory_used() amount of memory that is lesser or
@@ -332,10 +324,11 @@ private:
bool execution_permitted() noexcept;
uint64_t top_region_evictable_space() const noexcept;
virtual void add(logalloc::region* child) override; // from region_listener
virtual void del(logalloc::region* child) override; // from region_listener
public:
void add(logalloc::region* child); // Called by memtable's region_listener
void del(logalloc::region* child); // Called by memtable's region_listener
void moved(logalloc::region* old_address, logalloc::region* new_address); // Called by memtable's region_listener
private:
friend class ::test_region_group;
};

View File

@@ -124,14 +124,13 @@ memtable::memtable(schema_ptr schema, dirty_memory_manager& dmm,
memtable_list* memtable_list, seastar::scheduling_group compaction_scheduling_group)
: dirty_memory_manager_logalloc::size_tracked_region()
, _dirty_mgr(dmm)
, _cleaner(*this, no_cache_tracker, table_stats.memtable_app_stats, compaction_scheduling_group,
[this] (size_t freed) { remove_flushed_memory(freed); })
, _cleaner(*this, no_cache_tracker, table_stats.memtable_app_stats, compaction_scheduling_group)
, _memtable_list(memtable_list)
, _schema(std::move(schema))
, _table_shared_data(table_shared_data)
, partitions(dht::raw_token_less_comparator{})
, _table_stats(table_stats) {
logalloc::region::listen(&dmm.region_group());
logalloc::region::listen(this);
}
static thread_local dirty_memory_manager mgr_for_tests;
@@ -149,23 +148,17 @@ memtable::~memtable() {
logalloc::region::unlisten();
}
uint64_t memtable::dirty_size() const {
return occupancy().total_space();
}
void memtable::evict_entry(memtable_entry& e, mutation_cleaner& cleaner) noexcept {
e.partition().evict(cleaner);
nr_partitions--;
}
void memtable::clear() noexcept {
auto dirty_before = dirty_size();
with_allocator(allocator(), [this] {
partitions.clear_and_dispose([this] (memtable_entry* e) noexcept {
evict_entry(*e, _cleaner);
});
});
remove_flushed_memory(dirty_before - dirty_size());
}
future<> memtable::clear_gently() noexcept {
@@ -176,7 +169,6 @@ future<> memtable::clear_gently() noexcept {
auto p = std::move(partitions);
nr_partitions = 0;
while (!p.empty()) {
auto dirty_before = dirty_size();
with_allocator(alloc, [&] () noexcept {
while (!p.empty()) {
if (p.begin()->clear_gently() == stop_iteration::no) {
@@ -188,7 +180,6 @@ future<> memtable::clear_gently() noexcept {
}
}
});
remove_flushed_memory(dirty_before - dirty_size());
seastar::thread::yield();
}
@@ -283,6 +274,11 @@ memtable::slice(const dht::partition_range& range) const {
}
class iterator_reader {
// DO NOT RELEASE the memtable! Keep a reference to it, so it stays in
// memtable_list::_flushed_memtables_with_active_reads and so that it keeps
// blocking tombstone GC of tombstone in the cache, which cover data that
// used to be in this memtable, and which will possibly be produced by this
// reader later on.
lw_shared_ptr<memtable> _memtable;
schema_ptr _schema;
const dht::partition_range* _range;
@@ -381,7 +377,6 @@ protected:
streamed_mutation::forwarding fwd,
mutation_reader::forwarding fwd_mr) {
auto ret = _memtable->_underlying->make_reader_v2(_schema, std::move(permit), delegate, slice, nullptr, fwd, fwd_mr);
_memtable = {};
_last = {};
return ret;
}
@@ -525,13 +520,16 @@ public:
void memtable::add_flushed_memory(uint64_t delta) {
_flushed_memory += delta;
_dirty_mgr.account_potentially_cleaned_up_memory(this, delta);
if (_flushed_memory > 0) {
_dirty_mgr.account_potentially_cleaned_up_memory(this, std::min<int64_t>(delta, _flushed_memory));
}
}
void memtable::remove_flushed_memory(uint64_t delta) {
delta = std::min(_flushed_memory, delta);
if (_flushed_memory > 0) {
_dirty_mgr.revert_potentially_cleaned_up_memory(this, std::min<int64_t>(delta, _flushed_memory));
}
_flushed_memory -= delta;
_dirty_mgr.revert_potentially_cleaned_up_memory(this, delta);
}
void memtable::on_detach_from_region_group() noexcept {
@@ -540,8 +538,11 @@ void memtable::on_detach_from_region_group() noexcept {
}
void memtable::revert_flushed_memory() noexcept {
_dirty_mgr.revert_potentially_cleaned_up_memory(this, _flushed_memory);
if (_flushed_memory > 0) {
_dirty_mgr.revert_potentially_cleaned_up_memory(this, _flushed_memory);
}
_flushed_memory = 0;
_total_memory_low_watermark_during_flush = _total_memory;
}
class flush_memory_accounter {
@@ -554,7 +555,7 @@ public:
: _mt(mt)
{}
~flush_memory_accounter() {
SCYLLA_ASSERT(_mt._flushed_memory <= _mt.occupancy().total_space());
SCYLLA_ASSERT(_mt._flushed_memory <= static_cast<int64_t>(_mt.occupancy().total_space()));
}
uint64_t compute_size(memtable_entry& e, partition_snapshot& snp) {
return e.size_in_allocator_without_rows(_mt.allocator())
@@ -747,6 +748,7 @@ memtable::make_flat_reader_opt(schema_ptr query_schema,
mutation_reader
memtable::make_flush_reader(schema_ptr s, reader_permit permit) {
if (!_merged_into_cache) {
revert_flushed_memory();
return make_mutation_reader<flush_reader>(std::move(s), std::move(permit), shared_from_this());
} else {
auto& full_slice = s->full_slice();
@@ -834,6 +836,10 @@ void memtable::mark_flushed(mutation_source underlying) noexcept {
_underlying = std::move(underlying);
}
bool memtable::is_merging_to_cache() const noexcept {
return _merging_into_cache;
}
bool memtable::is_flushed() const noexcept {
return bool(_underlying);
}
@@ -871,3 +877,35 @@ auto fmt::formatter<replica::memtable>::format(replica::memtable& mt,
logalloc::reclaim_lock rl(mt);
return fmt::format_to(ctx.out(), "{{memtable: [{}]}}", fmt::join(mt.partitions, ",\n"));
}
void replica::memtable::increase_usage(logalloc::region* r, ssize_t delta) {
SCYLLA_ASSERT(delta >= 0);
_dirty_mgr.region_group().increase_usage(r);
_dirty_mgr.region_group().update_unspooled(delta);
_total_memory += delta;
}
void replica::memtable::decrease_evictable_usage(logalloc::region* r) {
_dirty_mgr.region_group().decrease_usage(r);
}
void replica::memtable::decrease_usage(logalloc::region* r, ssize_t delta) {
SCYLLA_ASSERT(delta <= 0);
_dirty_mgr.region_group().decrease_usage(r);
_dirty_mgr.region_group().update_unspooled(delta);
_total_memory += delta;
if (_total_memory < _total_memory_low_watermark_during_flush) {
remove_flushed_memory(_total_memory_low_watermark_during_flush - _total_memory);
_total_memory_low_watermark_during_flush = _total_memory;
}
}
void replica::memtable::add(logalloc::region* r) {
_dirty_mgr.region_group().add(r);
}
void replica::memtable::del(logalloc::region* r) {
_dirty_mgr.region_group().del(r);
}
void replica::memtable::moved(logalloc::region* old_address, logalloc::region* new_address) {
_dirty_mgr.region_group().moved(old_address, new_address);
}

View File

@@ -104,7 +104,11 @@ class dirty_memory_manager;
struct table_stats;
// Managed by lw_shared_ptr<>.
class memtable final : public enable_lw_shared_from_this<memtable>, private dirty_memory_manager_logalloc::size_tracked_region {
class memtable final
: public enable_lw_shared_from_this<memtable>
, public boost::intrusive::list_base_hook<boost::intrusive::link_mode<boost::intrusive::auto_unlink>>
, private dirty_memory_manager_logalloc::size_tracked_region
, public logalloc::region_listener {
public:
using partitions_type = double_decker<int64_t, memtable_entry,
dht::raw_token_less_comparator, dht::ring_position_comparator,
@@ -126,7 +130,25 @@ private:
// mutation_source is necessary for the combined mutation source to be
// monotonic. That combined source in this case is cache + memtable.
mutation_source_opt _underlying;
uint64_t _flushed_memory = 0;
bool _merging_into_cache = false;
// Tracks the difference between the amount of memory "spooled" during the flush
// and the memory freed during the flush.
//
// If positive, this is equal to the difference between the amount of "spooled" memory
// registered in dirty_memory_manager with account_potentially_cleaned_up_memory
// and unregistered with revert_potentially_cleaned_up_memory.
// If negative, the above difference is 0.
int64_t _flushed_memory = 0;
// For most of the time, this is equal to occupancy().total_memory.
// But we want to know the current memory usage in our logalloc::region_listener
// handlers, and at that point in time occupancy() is undefined. (LSA can choose to
// update it before or after the handler). So we track it ourselves, based on the deltas
// passed to the handlers.
uint64_t _total_memory = 0;
// During LSA compaction, _total_memory can fluctuate up and down.
// But we are only interested in the maximal total decrease since the beginning of flush.
// This tracks the lowest value of _total_memory seen during the flush.
uint64_t _total_memory_low_watermark_during_flush = 0;
bool _merged_into_cache = false;
replica::table_stats& _table_stats;
@@ -194,7 +216,6 @@ private:
void add_flushed_memory(uint64_t);
void remove_flushed_memory(uint64_t);
void clear() noexcept;
uint64_t dirty_size() const;
public:
explicit memtable(schema_ptr schema, dirty_memory_manager&,
memtable_table_shared_data& shared_data,
@@ -304,6 +325,7 @@ public:
bool empty() const noexcept { return partitions.empty(); }
void mark_flushed(mutation_source) noexcept;
bool is_merging_to_cache() const noexcept;
bool is_flushed() const noexcept;
void on_detach_from_region_group() noexcept;
void revert_flushed_memory() noexcept;
@@ -326,6 +348,14 @@ public:
return _dirty_mgr;
}
// Implementation of region_listener.
virtual void increase_usage(logalloc::region* r, ssize_t delta) override;
virtual void decrease_evictable_usage(logalloc::region* r) override;
virtual void decrease_usage(logalloc::region* r, ssize_t delta) override;
virtual void add(logalloc::region* r) override;
virtual void del(logalloc::region* r) override;
virtual void moved(logalloc::region* old_address, logalloc::region* new_address) override;
friend fmt::formatter<memtable>;
};

View File

@@ -291,7 +291,12 @@ private:
auto& cdef = _underlying_schema->column_at(kind, id);
writer.writer().Key(cdef.name_as_text());
if (cdef.is_atomic()) {
writer.write_atomic_cell_value(cell.as_atomic_cell(cdef), cdef.type);
auto acv = cell.as_atomic_cell(cdef);
if (acv.is_live()) {
writer.write_atomic_cell_value(acv, cdef.type);
} else {
writer.writer().Null();
}
} else if (cdef.type->is_collection() || cdef.type->is_user_type()) {
cell.as_collection_mutation().with_deserialized(*cdef.type, [&] (collection_mutation_view_description mv) {
writer.write_collection_value(mv, cdef.type);

View File

@@ -247,7 +247,8 @@ table::make_reader_v2(schema_ptr s,
const auto bypass_cache = slice.options.contains(query::partition_slice::option::bypass_cache);
if (cache_enabled() && !bypass_cache) {
if (auto reader_opt = _cache.make_reader_opt(s, permit, range, slice, &_compaction_manager.get_tombstone_gc_state(), std::move(trace_state), fwd, fwd_mr)) {
if (auto reader_opt = _cache.make_reader_opt(s, permit, range, slice, &_compaction_manager.get_tombstone_gc_state(),
get_max_purgeable_fn_for_cache_underlying_reader(), std::move(trace_state), fwd, fwd_mr)) {
readers.emplace_back(std::move(*reader_opt));
}
} else {
@@ -692,7 +693,7 @@ public:
future<> update_effective_replication_map(const locator::effective_replication_map& erm, noncopyable_function<void()> refresh_mutation_source) override { return make_ready_future(); }
compaction_group& compaction_group_for_token(dht::token token) const noexcept override {
compaction_group& compaction_group_for_token(dht::token token) const override {
return get_compaction_group();
}
utils::chunked_vector<compaction_group*> compaction_groups_for_token_range(dht::token_range tr) const override {
@@ -700,16 +701,16 @@ public:
ret.push_back(&get_compaction_group());
return ret;
}
compaction_group& compaction_group_for_key(partition_key_view key, const schema_ptr& s) const noexcept override {
compaction_group& compaction_group_for_key(partition_key_view key, const schema_ptr& s) const override {
return get_compaction_group();
}
compaction_group& compaction_group_for_sstable(const sstables::shared_sstable& sst) const noexcept override {
compaction_group& compaction_group_for_sstable(const sstables::shared_sstable& sst) const override {
return get_compaction_group();
}
size_t log2_storage_groups() const override {
return 0;
}
storage_group& storage_group_for_token(dht::token token) const noexcept override {
storage_group& storage_group_for_token(dht::token token) const override {
return *_single_sg;
}
@@ -805,27 +806,34 @@ public:
auto local_replica = locator::tablet_replica{_my_host_id, this_shard_id()};
for (auto tid : tmap.tablet_ids()) {
auto range = tmap.get_token_range(tid);
if (tmap.has_replica(tid, local_replica)) {
tlogger.debug("Tablet with id {} and range {} present for {}.{}", tid, range, schema()->ks_name(), schema()->cf_name());
ret[tid.value()] = allocate_storage_group(tmap, tid, std::move(range));
if (!tmap.has_replica(tid, local_replica)) {
continue;
}
// if the tablet was cleaned up already on this replica, don't allocate a storage group for it.
auto trinfo = tmap.get_tablet_transition_info(tid);
if (trinfo && locator::is_post_cleanup(local_replica, tmap.get_tablet_info(tid), *trinfo)) {
continue;
}
auto range = tmap.get_token_range(tid);
tlogger.debug("Tablet with id {} and range {} present for {}.{}", tid, range, schema()->ks_name(), schema()->cf_name());
ret[tid.value()] = allocate_storage_group(tmap, tid, std::move(range));
}
_storage_groups = std::move(ret);
}
future<> update_effective_replication_map(const locator::effective_replication_map& erm, noncopyable_function<void()> refresh_mutation_source) override;
compaction_group& compaction_group_for_token(dht::token token) const noexcept override;
compaction_group& compaction_group_for_token(dht::token token) const override;
utils::chunked_vector<compaction_group*> compaction_groups_for_token_range(dht::token_range tr) const override;
compaction_group& compaction_group_for_key(partition_key_view key, const schema_ptr& s) const noexcept override;
compaction_group& compaction_group_for_sstable(const sstables::shared_sstable& sst) const noexcept override;
compaction_group& compaction_group_for_key(partition_key_view key, const schema_ptr& s) const override;
compaction_group& compaction_group_for_sstable(const sstables::shared_sstable& sst) const override;
size_t log2_storage_groups() const override {
return log2ceil(tablet_map().tablet_count());
}
storage_group& storage_group_for_token(dht::token token) const noexcept override {
storage_group& storage_group_for_token(dht::token token) const override {
return storage_group_for_id(storage_group_of(token).first);
}
@@ -952,8 +960,10 @@ bool tablet_storage_group_manager::all_storage_groups_split() {
return true;
}
auto split_ready = std::ranges::all_of(_storage_groups | boost::adaptors::map_values,
std::mem_fn(&storage_group::set_split_mode));
bool split_ready = true;
for (const storage_group_ptr& sg : _storage_groups | boost::adaptors::map_values) {
split_ready &= sg->set_split_mode();
}
// The table replica will say to coordinator that its split status is ready by
// mirroring the sequence number from tablet metadata into its local state,
@@ -981,6 +991,12 @@ sstables::compaction_type_options::split tablet_storage_group_manager::split_com
future<> tablet_storage_group_manager::split_all_storage_groups() {
sstables::compaction_type_options::split opt = split_compaction_options();
co_await utils::get_local_injector().inject("split_storage_groups_wait", [] (auto& handler) -> future<> {
dblog.info("split_storage_groups_wait: waiting");
co_await handler.wait_for_message(std::chrono::steady_clock::now() + std::chrono::minutes{5});
dblog.info("split_storage_groups_wait: done");
}, false);
co_await for_each_storage_group_gently([opt] (storage_group& storage_group) {
return storage_group.split(opt);
});
@@ -1036,11 +1052,11 @@ std::unique_ptr<storage_group_manager> table::make_storage_group_manager() {
return ret;
}
compaction_group* table::get_compaction_group(size_t id) const noexcept {
compaction_group* table::get_compaction_group(size_t id) const {
return storage_group_for_id(id).main_compaction_group().get();
}
storage_group& table::storage_group_for_token(dht::token token) const noexcept {
storage_group& table::storage_group_for_token(dht::token token) const {
return _sg_manager->storage_group_for_token(token);
}
@@ -1048,13 +1064,13 @@ storage_group& table::storage_group_for_id(size_t i) const {
return _sg_manager->storage_group_for_id(_schema, i);
}
compaction_group& tablet_storage_group_manager::compaction_group_for_token(dht::token token) const noexcept {
compaction_group& tablet_storage_group_manager::compaction_group_for_token(dht::token token) const {
auto [idx, range_side] = storage_group_of(token);
auto& sg = storage_group_for_id(idx);
return *sg.select_compaction_group(range_side);
}
compaction_group& table::compaction_group_for_token(dht::token token) const noexcept {
compaction_group& table::compaction_group_for_token(dht::token token) const {
return _sg_manager->compaction_group_for_token(token);
}
@@ -1085,15 +1101,15 @@ utils::chunked_vector<compaction_group*> table::compaction_groups_for_token_rang
return _sg_manager->compaction_groups_for_token_range(tr);
}
compaction_group& tablet_storage_group_manager::compaction_group_for_key(partition_key_view key, const schema_ptr& s) const noexcept {
compaction_group& tablet_storage_group_manager::compaction_group_for_key(partition_key_view key, const schema_ptr& s) const {
return compaction_group_for_token(dht::get_token(*s, key));
}
compaction_group& table::compaction_group_for_key(partition_key_view key, const schema_ptr& s) const noexcept {
compaction_group& table::compaction_group_for_key(partition_key_view key, const schema_ptr& s) const {
return _sg_manager->compaction_group_for_key(key, s);
}
compaction_group& tablet_storage_group_manager::compaction_group_for_sstable(const sstables::shared_sstable& sst) const noexcept {
compaction_group& tablet_storage_group_manager::compaction_group_for_sstable(const sstables::shared_sstable& sst) const {
auto [first_id, first_range_side] = storage_group_of(sst->get_first_decorated_key().token());
auto [last_id, last_range_side] = storage_group_of(sst->get_last_decorated_key().token());
@@ -1111,7 +1127,7 @@ compaction_group& tablet_storage_group_manager::compaction_group_for_sstable(con
return *sg.select_compaction_group(first_range_side);
}
compaction_group& table::compaction_group_for_sstable(const sstables::shared_sstable& sst) const noexcept {
compaction_group& table::compaction_group_for_sstable(const sstables::shared_sstable& sst) const {
return _sg_manager->compaction_group_for_sstable(sst);
}
@@ -1153,14 +1169,6 @@ const storage_group_map& table::storage_groups() const {
return _sg_manager->storage_groups();
}
future<> table::safe_foreach_sstable(const sstables::sstable_set& set, noncopyable_function<future<>(const sstables::shared_sstable&)> action) {
auto deletion_guard = co_await get_units(_sstable_deletion_sem, 1);
co_await set.for_each_sstable_gently([&] (const sstables::shared_sstable& sst) -> future<> {
return action(sst);
});
}
future<utils::chunked_vector<sstables::sstable_files_snapshot>> table::take_storage_snapshot(dht::token_range tr) {
utils::chunked_vector<sstables::sstable_files_snapshot> ret;
@@ -1173,9 +1181,11 @@ future<utils::chunked_vector<sstables::sstable_files_snapshot>> table::take_stor
co_await cg->flush();
auto set = cg->make_sstable_set();
co_await safe_foreach_sstable(*set, [&] (const sstables::shared_sstable& sst) -> future<> {
// The sstable set must be obtained *after* the deletion lock is taken,
// otherwise components of sstables in the set might be unlinked from the filesystem
// by compaction while we are waiting for the lock.
auto deletion_guard = co_await get_units(_sstable_deletion_sem, 1);
co_await cg->make_sstable_set()->for_each_sstable_gently([&] (const sstables::shared_sstable& sst) -> future<> {
ret.push_back({
.sst = sst,
.files = co_await sst->readable_file_for_all_components(),
@@ -1194,8 +1204,11 @@ table::clone_tablet_storage(locator::tablet_id tid) {
auto& sg = storage_group_for_id(tid.value());
auto sg_holder = sg.async_gate().hold();
co_await sg.flush();
auto set = sg.make_sstable_set();
co_await safe_foreach_sstable(*set, [&] (const sstables::shared_sstable& sst) -> future<> {
// The sstable set must be obtained *after* the deletion lock is taken,
// otherwise components of sstables in the set might be unlinked from the filesystem
// by compaction while we are waiting for the lock.
auto deletion_guard = co_await get_units(_sstable_deletion_sem, 1);
co_await sg.make_sstable_set()->for_each_sstable_gently([&] (const sstables::shared_sstable& sst) -> future<> {
ret.push_back(co_await sst->clone(calculate_generation_for_new_table()));
});
co_return ret;
@@ -1561,6 +1574,17 @@ table::try_flush_memtable_to_sstable(compaction_group& cg, lw_shared_ptr<memtabl
co_await with_scheduling_group(_config.memtable_to_cache_scheduling_group, [this, old, &newtabs, &cg] {
return update_cache(cg, old, newtabs);
});
co_await utils::get_local_injector().inject("replica_post_flush_after_update_cache", [this] (auto& handler) -> future<> {
const auto this_table_name = format("{}.{}", _schema->ks_name(), _schema->cf_name());
if (this_table_name == handler.get("table_name")) {
tlogger.info("error injection handler replica_post_flush_after_update_cache: suspending flush for table {}", this_table_name);
handler.set("suspended", true);
co_await handler.wait_for_message(std::chrono::steady_clock::now() + std::chrono::minutes{5});
tlogger.info("error injection handler replica_post_flush_after_update_cache: resuming flush for table {}", this_table_name);
}
});
cg.memtables()->erase(old);
tlogger.debug("Memtable for {}.{} replaced, into {} sstables", old->schema()->ks_name(), old->schema()->cf_name(), newtabs.size());
co_return;
@@ -1744,23 +1768,30 @@ void table::subtract_compaction_group_from_stats(const compaction_group& cg) noe
_stats.live_sstable_count -= cg.live_sstable_count();
}
future<lw_shared_ptr<sstables::sstable_set>>
future<table::sstable_list_builder::result>
table::sstable_list_builder::build_new_list(const sstables::sstable_set& current_sstables,
sstables::sstable_set new_sstable_list,
const std::vector<sstables::shared_sstable>& new_sstables,
const std::vector<sstables::shared_sstable>& old_sstables) {
std::unordered_set<sstables::shared_sstable> s(old_sstables.begin(), old_sstables.end());
// this might seem dangerous, but "move" here just avoids constness,
// making the two ranges compatible when compiling with boost 1.55.
// No one is actually moving anything...
for (auto all = current_sstables.all(); auto&& tab : boost::range::join(new_sstables, std::move(*all))) {
if (!s.contains(tab)) {
// add sstables from the current list into the new list except the ones that are in the old list
std::vector<sstables::shared_sstable> removed_sstables;
co_await current_sstables.for_each_sstable_gently([&s, &removed_sstables, &new_sstable_list] (const sstables::shared_sstable& tab) {
if (s.contains(tab)) {
removed_sstables.push_back(tab);
} else {
new_sstable_list.insert(tab);
}
});
// add new sstables into the new list
for (auto& tab : new_sstables) {
new_sstable_list.insert(tab);
co_await coroutine::maybe_yield();
}
co_return make_lw_shared<sstables::sstable_set>(std::move(new_sstable_list));
co_return table::sstable_list_builder::result {
make_lw_shared<sstables::sstable_set>(std::move(new_sstable_list)), std::move(removed_sstables)};
}
future<>
@@ -1788,54 +1819,7 @@ compaction_group::delete_unused_sstables(sstables::compaction_completion_desc de
}
future<>
compaction_group::update_sstable_lists_on_off_strategy_completion(sstables::compaction_completion_desc desc) {
class sstable_lists_updater : public row_cache::external_updater_impl {
using sstables_t = std::vector<sstables::shared_sstable>;
table& _t;
compaction_group& _cg;
table::sstable_list_builder _builder;
const sstables_t& _old_maintenance;
const sstables_t& _new_main;
lw_shared_ptr<sstables::sstable_set> _new_maintenance_list;
lw_shared_ptr<sstables::sstable_set> _new_main_list;
public:
explicit sstable_lists_updater(compaction_group& cg, table::sstable_list_builder::permit_t permit, const sstables_t& old_maintenance, const sstables_t& new_main)
: _t(cg._t), _cg(cg), _builder(std::move(permit)), _old_maintenance(old_maintenance), _new_main(new_main) {
}
virtual future<> prepare() override {
sstables_t empty;
// adding new sstables, created by off-strategy operation, to main set
_new_main_list = co_await _builder.build_new_list(*_cg.main_sstables(), _t._compaction_strategy.make_sstable_set(_t._schema), _new_main, empty);
// removing old sstables, used as input by off-strategy, from the maintenance set
_new_maintenance_list = co_await _builder.build_new_list(*_cg.maintenance_sstables(), std::move(*_t.make_maintenance_sstable_set()), empty, _old_maintenance);
}
virtual void execute() override {
_cg.set_main_sstables(std::move(_new_main_list));
_cg.set_maintenance_sstables(std::move(_new_maintenance_list));
// FIXME: the following is not exception safe
_t.refresh_compound_sstable_set();
// Input sstables aren't not removed from backlog tracker because they come from the maintenance set.
_cg.backlog_tracker_adjust_charges({}, _new_main);
}
static std::unique_ptr<row_cache::external_updater_impl> make(compaction_group& cg, table::sstable_list_builder::permit_t permit, const sstables_t& old_maintenance, const sstables_t& new_main) {
return std::make_unique<sstable_lists_updater>(cg, std::move(permit), old_maintenance, new_main);
}
};
auto permit = co_await seastar::get_units(_t._sstable_set_mutation_sem, 1);
auto updater = row_cache::external_updater(sstable_lists_updater::make(*this, std::move(permit), desc.old_sstables, desc.new_sstables));
// row_cache::invalidate() is only used to synchronize sstable list updates, to prevent race conditions from occurring,
// meaning nothing is actually invalidated.
dht::partition_range_vector empty_ranges = {};
co_await _t.get_row_cache().invalidate(std::move(updater), std::move(empty_ranges));
_t.get_row_cache().refresh_snapshot();
_t.rebuild_statistics();
co_await delete_unused_sstables(std::move(desc));
}
future<>
compaction_group::update_main_sstable_list_on_compaction_completion(sstables::compaction_completion_desc desc) {
compaction_group::update_sstable_sets_on_compaction_completion(sstables::compaction_completion_desc desc) {
// Build a new list of _sstables: We remove from the existing list the
// tables we compacted (by now, there might be more sstables flushed
// later), and we add the new tables generated by the compaction.
@@ -1884,7 +1868,8 @@ compaction_group::update_main_sstable_list_on_compaction_completion(sstables::co
const sstables::compaction_completion_desc& _desc;
struct replacement_desc {
sstables::compaction_completion_desc desc;
lw_shared_ptr<sstables::sstable_set> new_sstables;
table::sstable_list_builder::result main_sstable_set_builder_result;
std::optional<lw_shared_ptr<sstables::sstable_set>> new_maintenance_sstables;
};
std::unordered_map<compaction_group*, replacement_desc> _cg_desc;
public:
@@ -1900,18 +1885,34 @@ compaction_group::update_main_sstable_list_on_compaction_completion(sstables::co
// The group that triggered compaction is the only one to have sstables removed from it.
_cg_desc[&_cg].desc.old_sstables = _desc.old_sstables;
for (auto& [cg, d] : _cg_desc) {
d.new_sstables = co_await _builder.build_new_list(*cg->main_sstables(), _t._compaction_strategy.make_sstable_set(_t._schema),
d.main_sstable_set_builder_result = co_await _builder.build_new_list(*cg->main_sstables(), _t._compaction_strategy.make_sstable_set(_t._schema),
d.desc.new_sstables, d.desc.old_sstables);
if (!d.desc.old_sstables.empty()
&& d.main_sstable_set_builder_result.removed_sstables.size() != d.desc.old_sstables.size()) {
// Not all old_sstables were removed from the main sstable set, which implies that
// they don't exist there. This can happen if the input sstables were picked up from
// the maintenance set during an offstrategy or scrub compaction. So, remove the old
// sstables from the maintenance set. No need to add any new sstables to the maintenance
// set though, as they are always added to the main set.
auto builder_result = co_await _builder.build_new_list(
*cg->maintenance_sstables(), std::move(*_t.make_maintenance_sstable_set()), {}, d.desc.old_sstables);
d.new_maintenance_sstables = std::move(builder_result.new_sstable_set);
}
}
}
virtual void execute() override {
for (auto&& [cg, d] : _cg_desc) {
cg->set_main_sstables(std::move(d.new_sstables));
cg->set_main_sstables(std::move(d.main_sstable_set_builder_result.new_sstable_set));
if (d.new_maintenance_sstables) {
// offstrategy or scrub compaction - replace the maintenance set
cg->set_maintenance_sstables(std::move(d.new_maintenance_sstables.value()));
}
}
// FIXME: the following is not exception safe
_t.refresh_compound_sstable_set();
for (auto& [cg, d] : _cg_desc) {
cg->backlog_tracker_adjust_charges(d.desc.old_sstables, d.desc.new_sstables);
cg->backlog_tracker_adjust_charges(d.main_sstable_set_builder_result.removed_sstables, d.desc.new_sstables);
}
}
static std::unique_ptr<row_cache::external_updater_impl> make(compaction_group& cg, table::sstable_list_builder::permit_t permit, sstables::compaction_completion_desc& d) {
@@ -2245,12 +2246,10 @@ public:
return _cg.memtable_has_key(key);
}
future<> on_compaction_completion(sstables::compaction_completion_desc desc, sstables::offstrategy offstrategy) override {
co_await _cg.update_sstable_sets_on_compaction_completion(std::move(desc));
if (offstrategy) {
co_await _cg.update_sstable_lists_on_off_strategy_completion(std::move(desc));
_cg.trigger_compaction();
co_return;
}
co_await _cg.update_main_sstable_list_on_compaction_completion(std::move(desc));
}
bool is_auto_compaction_disabled_by_user() const noexcept override {
return _t.is_auto_compaction_disabled_by_user();
@@ -2441,7 +2440,7 @@ future<> tablet_storage_group_manager::update_effective_replication_map(const lo
co_return;
}
// Allocate storage group if tablet is migrating in.
// Allocate storage group if tablet is migrating in, or deallocate if it's migrating out.
auto this_replica = locator::tablet_replica{
.host = erm.get_token_metadata().get_my_id(),
.shard = this_shard_id()
@@ -2457,6 +2456,8 @@ future<> tablet_storage_group_manager::update_effective_replication_map(const lo
auto range = new_tablet_map->get_token_range(tid);
_storage_groups[tid.value()] = allocate_storage_group(*new_tablet_map, tid, std::move(range));
tablet_migrating_in = true;
} else if (_storage_groups.contains(tid.value()) && locator::is_post_cleanup(this_replica, new_tablet_map->get_tablet_info(tid), transition_info)) {
remove_storage_group(tid.value());
}
}
@@ -2536,20 +2537,7 @@ max_purgeable_fn table::get_max_purgeable_fn_for_cache_underlying_reader() const
auto max_purgeable_timestamp = api::max_timestamp;
sg.for_each_compaction_group([&dk, is_shadowable, &max_purgeable_timestamp] (const compaction_group_ptr& cg) {
const auto& mt = cg->memtables()->active_memtable();
// see get_max_purgeable_timestamp() in compaction.cc for comments on choosing min timestamp
api::timestamp_type memtable_min_timestamp = is_shadowable ? mt.get_min_live_row_marker_timestamp() : mt.get_min_live_timestamp();
if (memtable_min_timestamp > cg->max_seen_timestamp()) {
// All the entries in the memtable are newer than the entries in the
// SSTable within this compaction group. So, no need to check further.
return;
}
// If a memtable with a minimum timestamp lower than the current maximum
// purgeable timestamp has the given key, the tombstone should not be purged.
if (memtable_min_timestamp < max_purgeable_timestamp && mt.contains_partition(dk)) {
max_purgeable_timestamp = memtable_min_timestamp;
}
max_purgeable_timestamp = std::min(cg->memtables()->min_live_timestamp(dk, is_shadowable, cg->max_seen_timestamp()), max_purgeable_timestamp);
});
return max_purgeable_timestamp;
@@ -2572,7 +2560,7 @@ table::sstables_as_snapshot_source() {
std::move(reader),
gc_clock::now(),
get_max_purgeable_fn_for_cache_underlying_reader(),
_compaction_manager.get_tombstone_gc_state(),
_compaction_manager.get_tombstone_gc_state().with_commitlog_check_disabled(),
fwd);
}, [this, sst_set] {
return make_partition_presence_checker(sst_set);
@@ -2983,8 +2971,15 @@ void table::set_schema(schema_ptr s) {
_schema = std::move(s);
for (auto&& v : _views) {
v->view_info()->set_base_info(
v->view_info()->make_base_dependent_view_info(*_schema));
auto base_info = v->view_info()->make_base_dependent_view_info(*_schema);
v->view_info()->set_base_info(base_info);
if (v->registry_entry()) {
v->registry_entry()->update_base_schema(_schema);
}
if (auto reverse_schema = local_schema_registry().get_or_null(reversed(v->version()))) {
reverse_schema->view_info()->set_base_info(base_info);
reverse_schema->registry_entry()->update_base_schema(_schema);
}
}
set_compaction_strategy(_schema->compaction_strategy());
@@ -2998,8 +2993,6 @@ static std::vector<view_ptr>::iterator find_view(std::vector<view_ptr>& views, c
}
void table::add_or_update_view(view_ptr v) {
v->view_info()->set_base_info(
v->view_info()->make_base_dependent_view_info(*_schema));
auto existing = find_view(_views, v);
if (existing != _views.end()) {
*existing = std::move(v);
@@ -3174,10 +3167,6 @@ db::replay_position table::set_low_replay_position_mark() {
template<typename... Args>
void table::do_apply(compaction_group& cg, db::rp_handle&& h, Args&&... args) {
if (cg.async_gate().is_closed()) [[unlikely]] {
on_internal_error(tlogger, "async_gate of table's compaction group is closed");
}
utils::latency_counter lc;
_stats.writes.set_latency(lc);
db::replay_position rp = h;
@@ -3823,7 +3812,6 @@ future<> table::cleanup_tablet(database& db, db::system_keyspace& sys_ks, locato
co_await stop_compaction_groups(sg);
co_await utils::get_local_injector().inject("delay_tablet_compaction_groups_cleanup", std::chrono::seconds(5));
co_await cleanup_compaction_groups(db, sys_ks, tid, sg);
_sg_manager->remove_storage_group(tid.value());
}
future<> table::cleanup_tablet_without_deallocation(database& db, db::system_keyspace& sys_ks, locator::tablet_id tid) {

View File

@@ -774,12 +774,13 @@ row_cache::make_reader_opt(schema_ptr s,
const dht::partition_range& range,
const query::partition_slice& slice,
const tombstone_gc_state* gc_state,
max_purgeable_fn get_max_purgeable,
tracing::trace_state_ptr trace_state,
streamed_mutation::forwarding fwd,
mutation_reader::forwarding fwd_mr)
{
auto make_context = [&] {
return std::make_unique<read_context>(*this, s, permit, range, slice, gc_state, trace_state, fwd_mr);
return std::make_unique<read_context>(*this, s, permit, range, slice, gc_state, get_max_purgeable, trace_state, fwd_mr);
};
if (query::is_single_partition(range) && !fwd_mr) {
@@ -1108,6 +1109,7 @@ future<> row_cache::do_update(external_updater eu, replica::memtable& m, Updater
}
future<> row_cache::update(external_updater eu, replica::memtable& m, preemption_source& preempt_src) {
m._merging_into_cache = true;
return do_update(std::move(eu), m, [this] (logalloc::allocating_section& alloc,
row_cache::partitions_type::iterator cache_i, replica::memtable_entry& mem_e, partition_presence_checker& is_present,
real_dirty_memory_accounter& acc, const partitions_type::bound_hint& hint, preemption_source& preempt_src) mutable {

View File

@@ -24,6 +24,7 @@
#include "db/cache_tracker.hh"
#include "readers/empty_v2.hh"
#include "readers/mutation_source.hh"
#include "compaction/compaction_garbage_collector.hh"
namespace bi = boost::intrusive;
@@ -376,8 +377,9 @@ public:
tracing::trace_state_ptr trace_state = nullptr,
streamed_mutation::forwarding fwd = streamed_mutation::forwarding::no,
mutation_reader::forwarding fwd_mr = mutation_reader::forwarding::no,
const tombstone_gc_state* gc_state = nullptr) {
if (auto reader_opt = make_reader_opt(s, permit, range, slice, gc_state, std::move(trace_state), fwd, fwd_mr)) {
const tombstone_gc_state* gc_state = nullptr,
max_purgeable_fn get_max_purgeable = can_never_purge) {
if (auto reader_opt = make_reader_opt(s, permit, range, slice, gc_state, std::move(get_max_purgeable), std::move(trace_state), fwd, fwd_mr)) {
return std::move(*reader_opt);
}
[[unlikely]] return make_empty_flat_reader_v2(std::move(s), std::move(permit));
@@ -389,6 +391,7 @@ public:
const dht::partition_range&,
const query::partition_slice&,
const tombstone_gc_state*,
max_purgeable_fn get_max_purgeable,
tracing::trace_state_ptr trace_state = nullptr,
streamed_mutation::forwarding fwd = streamed_mutation::forwarding::no,
mutation_reader::forwarding fwd_mr = mutation_reader::forwarding::no);
@@ -396,10 +399,11 @@ public:
mutation_reader make_reader(schema_ptr s,
reader_permit permit,
const dht::partition_range& range = query::full_partition_range,
const tombstone_gc_state* gc_state = nullptr) {
const tombstone_gc_state* gc_state = nullptr,
max_purgeable_fn get_max_purgeable = can_never_purge) {
auto& full_slice = s->full_slice();
return make_reader(std::move(s), std::move(permit), range, full_slice, nullptr,
streamed_mutation::forwarding::no, mutation_reader::forwarding::no, gc_state);
streamed_mutation::forwarding::no, mutation_reader::forwarding::no, gc_state, std::move(get_max_purgeable));
}
// Only reads what is in the cache, doesn't populate.

View File

@@ -1837,8 +1837,16 @@ schema_ptr schema::make_reversed() const {
}
schema_ptr schema::get_reversed() const {
return local_schema_registry().get_or_load(reversed(_raw._version), [this] (table_schema_version) {
return frozen_schema(make_reversed());
return local_schema_registry().get_or_load(reversed(_raw._version), [this] (table_schema_version) -> base_and_view_schemas {
auto s = make_reversed();
if (s->is_view()) {
if (!s->view_info()->base_info()) {
on_internal_error(dblog, format("Tried to make a reverse schema for view {}.{} with an uninitialized base info", s->ks_name(), s->cf_name()));
}
return {frozen_schema(s), s->view_info()->base_info()->base_schema()};
}
return {frozen_schema(s)};
});
}

View File

@@ -169,8 +169,11 @@ void schema_registry::clear() {
_entries.clear();
}
schema_ptr schema_registry_entry::load(frozen_schema fs) {
_frozen_schema = std::move(fs);
schema_ptr schema_registry_entry::load(base_and_view_schemas fs) {
_frozen_schema = std::move(fs.schema);
if (fs.base_schema) {
_base_schema = std::move(fs.base_schema);
}
auto s = get_schema();
if (_state == state::LOADING) {
_schema_promise.set_value(s);
@@ -183,6 +186,9 @@ schema_ptr schema_registry_entry::load(frozen_schema fs) {
schema_ptr schema_registry_entry::load(schema_ptr s) {
_frozen_schema = frozen_schema(s);
if (s->is_view()) {
_base_schema = s->view_info()->base_info()->base_schema();
}
_schema = &*s;
_schema->_registry_entry = this;
_erase_timer.cancel();
@@ -202,7 +208,7 @@ future<schema_ptr> schema_registry_entry::start_loading(async_schema_loader load
_state = state::LOADING;
slogger.trace("Loading {}", _version);
// Move to background.
(void)f.then_wrapped([self = shared_from_this(), this] (future<frozen_schema>&& f) {
(void)f.then_wrapped([self = shared_from_this(), this] (future<base_and_view_schemas>&& f) {
_loader = {};
if (_state != state::LOADING) {
slogger.trace("Loading of {} aborted", _version);
@@ -231,6 +237,10 @@ schema_ptr schema_registry_entry::get_schema() {
if (s->version() != _version) {
throw std::runtime_error(format("Unfrozen schema version doesn't match entry version ({}): {}", _version, *s));
}
if (s->is_view()) {
// We may encounter a no_such_column_family here, which means that the base table was deleted and we should fail the request
s->view_info()->set_base_info(s->view_info()->make_base_dependent_view_info(**_base_schema));
}
_erase_timer.cancel();
s->_registry_entry = this;
_schema = &*s;
@@ -251,6 +261,10 @@ frozen_schema schema_registry_entry::frozen() const {
return *_frozen_schema;
}
void schema_registry_entry::update_base_schema(schema_ptr s) {
_base_schema = s;
}
future<> schema_registry_entry::maybe_sync(std::function<future<>()> syncer) {
switch (_sync_state) {
case schema_registry_entry::sync_state::SYNCED:
@@ -324,17 +338,17 @@ schema_ptr global_schema_ptr::get() const {
if (this_shard_id() == _cpu_of_origin) {
return _ptr;
} else {
auto registered_schema = [](const schema_registry_entry& e) {
auto registered_schema = [](const schema_registry_entry& e, std::optional<schema_ptr> base_schema = std::nullopt) -> schema_ptr {
schema_ptr ret = local_schema_registry().get_or_null(e.version());
if (!ret) {
ret = local_schema_registry().get_or_load(e.version(), [&e](table_schema_version) {
return e.frozen();
ret = local_schema_registry().get_or_load(e.version(), [&e, &base_schema](table_schema_version) -> base_and_view_schemas {
return {e.frozen(), base_schema};
});
}
return ret;
};
schema_ptr registered_bs;
std::optional<schema_ptr> registered_bs;
// the following code contains registry entry dereference of a foreign shard
// however, it is guaranteed to succeed since we made sure in the constructor
// that _bs_schema and _ptr will have a registry on the foreign shard where this
@@ -343,16 +357,10 @@ schema_ptr global_schema_ptr::get() const {
if (_base_schema) {
registered_bs = registered_schema(*_base_schema->registry_entry());
if (_base_schema->registry_entry()->is_synced()) {
registered_bs->registry_entry()->mark_synced();
}
}
schema_ptr s = registered_schema(*_ptr->registry_entry());
if (s->is_view()) {
if (!s->view_info()->base_info()) {
// we know that registered_bs is valid here because we make sure of it in the constructors.
s->view_info()->set_base_info(s->view_info()->make_base_dependent_view_info(*registered_bs));
registered_bs.value()->registry_entry()->mark_synced();
}
}
schema_ptr s = registered_schema(*_ptr->registry_entry(), registered_bs);
if (_ptr->registry_entry()->is_synced()) {
s->registry_entry()->mark_synced();
}
@@ -369,25 +377,22 @@ global_schema_ptr::global_schema_ptr(const schema_ptr& ptr)
if (e) {
return s;
} else {
return local_schema_registry().get_or_load(s->version(), [&s] (table_schema_version) {
return frozen_schema(s);
return local_schema_registry().get_or_load(s->version(), [&s] (table_schema_version) -> base_and_view_schemas {
if (s->is_view()) {
if (!s->view_info()->base_info()) {
on_internal_error(slogger, format("Tried to build a global schema for view {}.{} with an uninitialized base info", s->ks_name(), s->cf_name()));
}
return {frozen_schema(s), s->view_info()->base_info()->base_schema()};
} else {
return {frozen_schema(s)};
}
});
}
};
schema_ptr s = ensure_registry_entry(ptr);
if (s->is_view()) {
if (s->view_info()->base_info()) {
_base_schema = ensure_registry_entry(s->view_info()->base_info()->base_schema());
} else if (ptr->view_info()->base_info()) {
_base_schema = ensure_registry_entry(ptr->view_info()->base_info()->base_schema());
} else {
on_internal_error(slogger, format("Tried to build a global schema for view {}.{} with an uninitialized base info", s->ks_name(), s->cf_name()));
}
if (!s->view_info()->base_info() || !s->view_info()->base_info()->base_schema()->registry_entry()) {
s->view_info()->set_base_info(s->view_info()->make_base_dependent_view_info(*_base_schema));
}
_base_schema = ensure_registry_entry(s->view_info()->base_info()->base_schema());
}
_ptr = s;
}

View File

@@ -22,8 +22,13 @@ class schema_ctxt;
class schema_registry;
using async_schema_loader = std::function<future<frozen_schema>(table_schema_version)>;
using schema_loader = std::function<frozen_schema(table_schema_version)>;
struct base_and_view_schemas {
frozen_schema schema;
std::optional<schema_ptr> base_schema;
};
using async_schema_loader = std::function<future<base_and_view_schemas>(table_schema_version)>;
using schema_loader = std::function<base_and_view_schemas(table_schema_version)>;
class schema_version_not_found : public std::runtime_error {
public:
@@ -61,6 +66,8 @@ class schema_registry_entry : public enable_lw_shared_from_this<schema_registry_
shared_promise<schema_ptr> _schema_promise; // valid when state == LOADING
std::optional<frozen_schema> _frozen_schema; // engaged when state == LOADED
std::optional<schema_ptr> _base_schema;// engaged when state == LOADED for view schemas
// valid when state == LOADED
// This is != nullptr when there is an alive schema_ptr associated with this entry.
const ::schema* _schema = nullptr;
@@ -77,7 +84,7 @@ public:
schema_registry_entry(schema_registry_entry&&) = delete;
schema_registry_entry(const schema_registry_entry&) = delete;
~schema_registry_entry();
schema_ptr load(frozen_schema);
schema_ptr load(base_and_view_schemas);
schema_ptr load(schema_ptr);
future<schema_ptr> start_loading(async_schema_loader);
schema_ptr get_schema(); // call only when state >= LOADED
@@ -87,6 +94,9 @@ public:
future<> maybe_sync(std::function<future<>()> sync);
// Marks this schema version as synced. Syncing cannot be in progress.
void mark_synced();
// Updates the frozen base schema for a view, should be called when updating the base info
// Is not needed when we set the base info for the first time - that means this schema is not in the registry
void update_base_schema(schema_ptr);
// Can be called from other shards
frozen_schema frozen() const;
// Can be called from other shards
@@ -108,6 +118,7 @@ public:
// alive the registry will keep its entry. To ensure remote nodes can query current node
// for schema version, make sure that schema_ptr for the request is alive around the call.
//
// Schemas of views returned by this registry always have base_info set.
class schema_registry {
std::unordered_map<table_schema_version, lw_shared_ptr<schema_registry_entry>> _entries;
std::unique_ptr<db::schema_ctxt> _ctxt;
@@ -125,6 +136,7 @@ public:
void init(const db::schema_ctxt&);
// Looks up schema by version or loads it using supplied loader.
// If the schema refers to a view, the loader must return both view and base schemas.
schema_ptr get_or_load(table_schema_version, const schema_loader&);
// Looks up schema by version or returns an empty pointer if not available.
@@ -134,6 +146,7 @@ public:
// deferring. The loader is copied must be alive only until this method
// returns. If the loader fails, the future resolves with
// schema_version_loading_failed.
// If the schema refers to a view, the loader must return both view and base schemas.
future<schema_ptr> get_or_load(table_schema_version, const async_schema_loader&);
// Looks up schema version. Throws schema_version_not_found when not found
@@ -149,6 +162,7 @@ public:
// the schema which was passed as argument.
// The schema instance pointed to by the argument will be attached to the registry
// entry and will keep it alive.
// If the schema refers to a view, it must have base_info set.
schema_ptr learn(const schema_ptr&);
// Removes all entries from the registry. This in turn removes all dependencies

View File

@@ -5012,8 +5012,10 @@ class scylla_small_objects(gdb.Command):
[2019] 0x635002ecbc60
"""
class small_object_iterator():
def __init__(self, small_pool, resolve_symbols):
self._small_pool = small_pool
def __init__(self, small_pools, resolve_symbols):
self._small_pools = small_pools
self._small_pool_addresses = [small_pool.address for small_pool in small_pools]
self._object_size = int(small_pools[0]['_object_size'])
self._resolve_symbols = resolve_symbols
self._text_ranges = get_text_ranges()
@@ -5023,8 +5025,9 @@ class scylla_small_objects(gdb.Command):
self._free_in_pool = set()
self._free_in_span = set()
pool_next_free = self._small_pool['_free']
while pool_next_free:
for small_pool in self._small_pools:
pool_next_free = small_pool['_free']
while pool_next_free:
self._free_in_pool.add(int(pool_next_free))
pool_next_free = pool_next_free.reinterpret_cast(self._free_object_ptr).dereference()
@@ -5035,7 +5038,7 @@ class scylla_small_objects(gdb.Command):
# Let any StopIteration bubble up, as it signals we are done with
# all spans.
span = next(self._span_it)
while span.pool() != self._small_pool.address:
while span.pool() not in self._small_pool_addresses:
span = next(self._span_it)
self._free_in_span = set()
@@ -5058,7 +5061,7 @@ class scylla_small_objects(gdb.Command):
pass
span_start, span_end = self._next_span()
self._obj_it = iter(range(span_start, span_end, int(self._small_pool['_object_size'])))
self._obj_it = iter(range(span_start, span_end, int(self._object_size)))
return next(self._obj_it)
def __next__(self):
@@ -5094,16 +5097,14 @@ class scylla_small_objects(gdb.Command):
return [int(small_pools['_u']['a'][i]['_object_size']) for i in range(nr)]
@staticmethod
def find_small_pool(object_size):
def find_small_pools(object_size):
cpu_mem = gdb.parse_and_eval('\'seastar::memory::cpu_mem\'')
small_pools = cpu_mem['small_pools']
small_pools_a = small_pools['_u']['a']
nr = int(small_pools['nr_small_pools'])
for i in range(nr):
sp = small_pools['_u']['a'][i]
if object_size == int(sp['_object_size']):
return sp
return None
return [small_pools_a[i]
for i in range(nr)
if int(small_pools_a[i]['_object_size']) == object_size]
def init_parser(self):
parser = argparse.ArgumentParser(description="scylla small-objects")
@@ -5120,10 +5121,10 @@ class scylla_small_objects(gdb.Command):
self._parser = parser
def get_objects(self, small_pool, offset=0, count=0, resolve_symbols=False, verbose=False):
if self._last_object_size != int(small_pool['_object_size']) or offset < self._last_pos:
def get_objects(self, small_pools, offset=0, count=0, resolve_symbols=False, verbose=False):
if self._last_object_size != int(small_pools[0]['_object_size']) or offset < self._last_pos:
self._last_pos = 0
self._iterator = scylla_small_objects.small_object_iterator(small_pool, resolve_symbols)
self._iterator = scylla_small_objects.small_object_iterator(small_pools, resolve_symbols)
skip = offset - self._last_pos
if verbose:
@@ -5153,15 +5154,15 @@ class scylla_small_objects(gdb.Command):
except SystemExit:
return
small_pool = scylla_small_objects.find_small_pool(args.object_size)
if small_pool is None:
small_pools = scylla_small_objects.find_small_pools(args.object_size)
if not small_pools:
raise ValueError("{} is not a valid object size for any small pools, valid object sizes are: {}", scylla_small_objects.get_object_sizes())
if args.summarize:
if self._last_object_size != args.object_size:
if args.verbose:
gdb.write("Object size changed ({} -> {}), scanning pool.\n".format(self._last_object_size, args.object_size))
self._num_objects = len(self.get_objects(small_pool, verbose=args.verbose))
self._num_objects = len(self.get_objects(small_pools, verbose=args.verbose))
self._last_object_size = args.object_size
gdb.write("number of objects: {}\n"
"page size : {}\n"
@@ -5176,7 +5177,7 @@ class scylla_small_objects(gdb.Command):
if self._last_object_size != args.object_size:
if args.verbose:
gdb.write("Object size changed ({} -> {}), scanning pool.\n".format(self._last_object_size, args.object_size))
self._num_objects = len(self.get_objects(small_pool, verbose=args.verbose))
self._num_objects = len(self.get_objects(small_pools, verbose=args.verbose))
self._last_object_size = args.object_size
page = random.randint(0, int(self._num_objects / args.page_size) - 1)
else:
@@ -5184,7 +5185,7 @@ class scylla_small_objects(gdb.Command):
offset = page * args.page_size
gdb.write("page {}: {}-{}\n".format(page, offset, offset + args.page_size - 1))
for i, (obj, sym) in enumerate(self.get_objects(small_pool, offset, args.page_size, resolve_symbols=True, verbose=args.verbose)):
for i, (obj, sym) in enumerate(self.get_objects(small_pools, offset, args.page_size, resolve_symbols=True, verbose=args.verbose)):
if sym is None:
sym_text = ""
else:

Submodule seastar updated: ec5da7a606...b383afb0d0

View File

@@ -243,27 +243,33 @@ future<> service::client_state::maybe_update_per_service_level_params() {
if (!slo_opt) {
co_return;
}
auto slo_timeout_or = [&] (const lowres_clock::duration& default_timeout) {
return std::visit(overloaded_functor{
[&] (const qos::service_level_options::unset_marker&) -> lowres_clock::duration {
return default_timeout;
},
[&] (const qos::service_level_options::delete_marker&) -> lowres_clock::duration {
return default_timeout;
},
[&] (const lowres_clock::duration& d) -> lowres_clock::duration {
return d;
},
}, slo_opt->timeout);
};
_timeout_config.read_timeout = slo_timeout_or(_default_timeout_config.read_timeout);
_timeout_config.write_timeout = slo_timeout_or(_default_timeout_config.write_timeout);
_timeout_config.range_read_timeout = slo_timeout_or(_default_timeout_config.range_read_timeout);
_timeout_config.counter_write_timeout = slo_timeout_or(_default_timeout_config.counter_write_timeout);
_timeout_config.truncate_timeout = slo_timeout_or(_default_timeout_config.truncate_timeout);
_timeout_config.cas_timeout = slo_timeout_or(_default_timeout_config.cas_timeout);
_timeout_config.other_timeout = slo_timeout_or(_default_timeout_config.other_timeout);
_workload_type = slo_opt->workload;
update_per_service_level_params(*slo_opt);
}
}
void service::client_state::update_per_service_level_params(qos::service_level_options& slo) {
auto slo_timeout_or = [&] (const lowres_clock::duration& default_timeout) {
return std::visit(overloaded_functor{
[&] (const qos::service_level_options::unset_marker&) -> lowres_clock::duration {
return default_timeout;
},
[&] (const qos::service_level_options::delete_marker&) -> lowres_clock::duration {
return default_timeout;
},
[&] (const lowres_clock::duration& d) -> lowres_clock::duration {
return d;
},
}, slo.timeout);
};
_timeout_config.read_timeout = slo_timeout_or(_default_timeout_config.read_timeout);
_timeout_config.write_timeout = slo_timeout_or(_default_timeout_config.write_timeout);
_timeout_config.range_read_timeout = slo_timeout_or(_default_timeout_config.range_read_timeout);
_timeout_config.counter_write_timeout = slo_timeout_or(_default_timeout_config.counter_write_timeout);
_timeout_config.truncate_timeout = slo_timeout_or(_default_timeout_config.truncate_timeout);
_timeout_config.cas_timeout = slo_timeout_or(_default_timeout_config.cas_timeout);
_timeout_config.other_timeout = slo_timeout_or(_default_timeout_config.other_timeout);
_workload_type = slo.workload;
}

View File

@@ -346,6 +346,7 @@ public:
future<bool> check_has_permission(auth::command_desc) const;
future<> ensure_has_permission(auth::command_desc) const;
future<> maybe_update_per_service_level_params();
void update_per_service_level_params(qos::service_level_options& slo);
/**
* Returns an exceptional future with \ref exceptions::invalid_request_exception if the resource does not exist.

View File

@@ -577,7 +577,7 @@ future<query::mapreduce_result> mapreduce_service::dispatch(query::mapreduce_req
co_await coroutine::parallel_for_each(vnodes_per_addr.begin(), vnodes_per_addr.end(),
[&] (std::pair<const netw::messaging_service::msg_addr, dht::partition_range_vector>& vnodes_with_addr) -> future<> {
netw::messaging_service::msg_addr addr = vnodes_with_addr.first;
query::mapreduce_result& result_ = result;
query::mapreduce_result& shared_accumulator = result;
tracing::trace_state_ptr& tr_state_ = tr_state;
retrying_dispatcher& dispatcher_ = dispatcher;
@@ -598,9 +598,21 @@ future<query::mapreduce_result> mapreduce_service::dispatch(query::mapreduce_req
flogger.debug("received mapreduce_result={} from {}", partial_printer, addr);
auto aggrs = mapreduce_aggregates(req);
co_return co_await aggrs.with_thread_if_needed([&result_, &aggrs, partial_result = std::move(partial_result)] () mutable {
aggrs.merge(result_, std::move(partial_result));
});
// Anytime this coroutine yields, other coroutines may want to write to `shared_accumulator`.
// As merging can yield internally, merging directly to `shared_accumulator` would result in race condition.
// We can safely write to `shared_accumulator` only when it is empty.
while (!shared_accumulator.query_results.empty()) {
// Move `shared_accumulator` content to local variable. Leave `shared_accumulator` empty - now other coroutines can safely write to it.
query::mapreduce_result previous_results = std::exchange(shared_accumulator, {});
// Merge two local variables - it can yield.
co_await aggrs.with_thread_if_needed([&previous_results, &aggrs, &partial_result] () mutable {
aggrs.merge(partial_result, std::move(previous_results));
});
// `partial_result` now contains results merged by this coroutine, but `shared_accumulator` might have been updated by others.
}
// `shared_accumulator` is empty, we can atomically write results merged by this coroutine.
shared_accumulator = std::move(partial_result);
});
mapreduce_aggregates aggrs(req);

View File

@@ -1053,29 +1053,20 @@ static future<schema_ptr> get_schema_definition(table_schema_version v, netw::me
// with TTL (refresh TTL in case column mapping already existed prior to that).
auto us = s.unfreeze(db::schema_ctxt(proxy));
// if this is a view - sanity check that its schema doesn't need fixing.
schema_ptr base_schema;
if (us->is_view()) {
auto& db = proxy.local().local_db();
schema_ptr base_schema = db.find_schema(us->view_info()->base_id());
base_schema = db.find_schema(us->view_info()->base_id());
db::schema_tables::check_no_legacy_secondary_index_mv_schema(db, view_ptr(us), base_schema);
}
return db::schema_tables::store_column_mapping(proxy, us, true).then([us] {
return frozen_schema{us};
return db::schema_tables::store_column_mapping(proxy, us, true).then([us, base_schema] -> base_and_view_schemas {
if (us->is_view()) {
return {frozen_schema(us), base_schema};
} else {
return {frozen_schema(us)};
}
});
});
}).then([&storage_proxy] (schema_ptr s) {
// If this is a view so this schema also needs a reference to the base
// table.
if (s->is_view()) {
if (!s->view_info()->base_info()) {
auto& db = storage_proxy.local_db();
// This line might throw a no_such_column_family
// It should be fine since if we tried to register a view for which
// we don't know the base table, our registry is broken.
schema_ptr base_schema = db.find_schema(s->view_info()->base_id());
s->view_info()->set_base_info(s->view_info()->make_base_dependent_view_info(*base_schema));
}
}
return s;
});
}
@@ -1099,6 +1090,8 @@ future<schema_ptr> migration_manager::get_schema_for_write(table_schema_version
}
if (!s) {
// The schema returned by get_schema_definition comes (eventually) from the schema registry,
// so if it is a view, it already has base info and we don't need to set it later
s = co_await get_schema_definition(v, dst, ms, _storage_proxy);
}
@@ -1111,23 +1104,6 @@ future<schema_ptr> migration_manager::get_schema_for_write(table_schema_version
co_await maybe_sync(s, dst);
}
}
// here s is guaranteed to be valid and synced
if (s->is_view() && !s->view_info()->base_info()) {
// The way to get here is if the view schema was deactivated
// and reactivated again, or if we loaded it from the schema
// history.
auto& db = _storage_proxy.local_db();
// This line might throw a no_such_column_family but
// that is fine, if the schema is synced, it means that if
// we failed to get the base table, we learned about the base
// table not existing (which means that the view also doesn't exist
// any more), which means that this schema is actually useless for either
// read or write so we better throw than return an incomplete useless
// schema
schema_ptr base_schema = db.find_schema(s->view_info()->base_id());
s->view_info()->set_base_info(s->view_info()->make_base_dependent_view_info(*base_schema));
}
co_return s;
}

View File

@@ -383,18 +383,14 @@ void query_pager::handle_result(
auto view = query::result_view(*results);
_last_pos = position_in_partition::for_partition_start();
uint64_t row_count;
uint64_t replica_row_count, row_count;
if constexpr(!std::is_same_v<std::decay_t<Visitor>, noop_visitor>) {
query_result_visitor<Visitor> v(std::forward<Visitor>(visitor));
view.consume(_cmd->slice, v);
if (_last_pkey) {
update_slice(*_last_pkey);
}
row_count = v.total_rows - v.dropped_rows;
_max = _max - row_count;
_exhausted = (v.total_rows < page_size && !results->is_short_read() && v.dropped_rows == 0) || _max == 0;
replica_row_count = v.total_rows;
// If per partition limit is defined, we need to accumulate rows fetched for last partition key if the key matches
if (_cmd->slice.partition_row_limit() < query::max_rows_if_set) {
if (_last_pkey && v.last_pkey && _last_pkey->equal(*_query_schema, *v.last_pkey)) {
@@ -403,32 +399,30 @@ void query_pager::handle_result(
_rows_fetched_for_last_partition = v.last_partition_row_count;
}
}
const auto& last_pos = results->last_position();
if (last_pos && !v.dropped_rows) {
_last_pkey = last_pos->partition;
_last_pos = last_pos->position;
} else {
_last_pkey = v.last_pkey;
if (v.last_ckey) {
_last_pos = position_in_partition::for_key(*v.last_ckey);
}
}
} else {
row_count = results->row_count() ? *results->row_count() : std::get<1>(view.count_partitions_and_rows());
_max = _max - row_count;
_exhausted = (row_count < page_size && !results->is_short_read()) || _max == 0;
replica_row_count = row_count;
}
if (!_exhausted) {
if (_last_pkey) {
update_slice(*_last_pkey);
}
{
_max = _max - row_count;
_exhausted = (replica_row_count < page_size && !results->is_short_read()) || _max == 0;
if (_last_pkey) {
update_slice(*_last_pkey);
}
// The last page can be truly empty -- with unset last-position and no data to calculate it based on.
if (!replica_row_count && !results->is_short_read()) {
_last_pkey = {};
} else {
auto last_pos = results->get_or_calculate_last_position();
_last_pkey = std::move(last_pos.partition);
_last_pos = std::move(last_pos.position);
}
}
qlogger.debug("Fetched {} rows, max_remain={} {}", row_count, _max, _exhausted ? "(exh)" : "");
qlogger.debug("Fetched {} rows (kept {}), max_remain={} {}", replica_row_count, row_count, _max, _exhausted ? "(exh)" : "");
if (_last_pkey) {
qlogger.debug("Last partition key: {}", *_last_pkey);

Some files were not shown because too many files have changed in this diff Show More