Compare commits

...

1303 Commits

Author SHA1 Message Date
Łukasz Paszkowski
98a6002a1a compaction_manager: cancel submission timer on drain
The `drain` method, cancels all running compactions and moves the
compaction manager into the disabled state. To move it back to
the enabled state, the `enable` method shall be called.

This, however, throws an assertion error as the submission time is
not cancelled and re-enabling the manager tries to arm the armed timer.

Thus, cancel the timer, when calling the drain method to disable
the compaction manager.

Fixes https://github.com/scylladb/scylladb/issues/24504

All versions are affected. So it's a good candidate for a backport.

Closes scylladb/scylladb#24505

(cherry picked from commit a9a53d9178)

Closes scylladb/scylladb#24585
2025-06-29 14:40:43 +03:00
Avi Kivity
9f1ed14f9d Merge '[Backport 6.2] cql: create default superuser if it doesn't exist' from Marcin Maliszkiewicz
Backport of https://github.com/scylladb/scylladb/pull/20137 is needed for another backport https://github.com/scylladb/scylladb/pull/24690 to work correctly.

Fixes https://github.com/scylladb/scylladb/issues/24712

Closes scylladb/scylladb#24711

* github.com:scylladb/scylladb:
  test: test_restart_cluster: create the test
  auth: standard_role_manager allows awaiting superuser creation
  auth: coroutinize the standard_role_manager start() function
  auth: don't start server until the superuser is created
2025-06-29 14:34:55 +03:00
Aleksandra Martyniuk
8321747451 test: rest_api: fix test_repair_task_progress
test_repair_task_progress checks the progress of children of root
repair task. However, nothing ensures that the children are
already created.

Wait until at least one child of a root repair task is created.

Fixes: #24556.

Closes scylladb/scylladb#24560

(cherry picked from commit 0deb9209a0)

Closes scylladb/scylladb#24652
2025-06-28 09:40:37 +03:00
Paweł Zakrzewski
3bd3d720e0 test: test_restart_cluster: create the test
The purpose of this test that the cluster is able to boot up again after
a full cluster shutdown, thus exhibiting no issues when connecting to
raft group 0 that is larger than one.

(cherry picked from commit 900a6706b8)
2025-06-27 17:50:15 +02:00
Paweł Zakrzewski
3d6d3484b5 auth: standard_role_manager allows awaiting superuser creation
This change implements the ability to await superuser creation in the
function ensure_superuser_is_created(). This means that Scylla will not
be serving CQL connections until the superuser is created.

Fixes #10481

(cherry picked from commit 7008b71acc)
2025-06-27 17:50:08 +02:00
Paweł Zakrzewski
db1d3cc342 auth: coroutinize the standard_role_manager start() function
This change is a preparation for the next change. Moving to coroutines
makes the code more readable and easier to process.

(cherry picked from commit 04fc82620b)
2025-06-27 17:50:01 +02:00
Paweł Zakrzewski
66301eb2b6 auth: don't start server until the superuser is created
This change reorganizes the way standard_role_manager startup is
handled: now the future returned by its start() function can be used to
determine when startup has finished. We use this future to ensure the
startup is finished prior to starting the CQL server.

Some clusters are created without auth, and auth is added later. The
first node to recognize that auth is needed must create the superuser.
Currently this is always on restart, but if we were to ever make it
LiveUpdate then it would not be on restart.

This suggests that we don't really need to wait during restart.

This is a preparatory commit, laying ground for implementation of a
start() function that waits for the superuser to be created. The default
implementation returns a ready future, which makes no change in the code
behavior.

(cherry picked from commit f525d4b0c1)
2025-06-27 17:49:33 +02:00
Pavel Emelyanov
e5ac2285c0 Merge '[Backport 6.2] memtable: ensure _flushed_memory doesn't grow above total_memory' from Scylladb[bot]
`dirty_memory_manager` tracks two quantities about memtable memory usage:
"real" and "unspooled" memory usage.

"real" is the total memory usage (sum of `occupancy().total_space()`)
by all memtable LSA regions, plus a upper-bound estimate of the size of
memtable data which has already moved to the cache region but isn't
evictable (merged into the cache) yet.

"unspooled" is the difference between total memory usage by all memtable
LSA regions, and the total flushed memory (sum of `_flushed_memory`)
of memtables.

`dirty_memory_manager` controls the shares of compaction and/or blocks
writes when these quantities cross various thresholds.

"Total flushed memory" isn't a well defined notion,
since the actual consumption of memory by the same data can vary over
time due to LSA compactions, and even the data present in memtable can
change over the course of the flush due to removals of outdated MVCC versions.
So `_flushed_memory` is merely an approximation computed by `flush_reader`
based on the data passing through it.

This approximation is supposed to be a conservative lower bound.
In particular, `_flushed_memory` should be not greater than
`occupancy().total_space()`. Otherwise, for example, "unspooled" memory
could become negative (and/or wrap around) and weird things could happen.
There is an assertion in `~flush_memory_accounter` which checks that
`_flushed_memory < occupancy().total_space()` at the end of flush.

But it can fail. Without additional treatment, the memtable reader sometimes emits
data which is already deleted. (In particular, it emites rows covered by
a partition tombstone in a newer MVCC version.)
This data is seen by `flush_reader` and accounted in `_flushed_memory`.
But this data can be garbage-collected by the `mutation_cleaner` later during the
flush and decrease `total_memory` below `_flushed_memory`.

There is a piece of code in `mutation_cleaner` intended to prevent that.
If `total_memory` decreases during a `mutation_cleaner` run,
`_flushed_memory` is lowered by the same amount, just to preserve the
asserted property. (This could also make `_flushed_memory` quite inaccurate,
but that's considered acceptable).

But that only works if `total_memory` is decreased during that run. It doesn't
work if the `total_memory` decrease (enabled by the new allocator holes made
by `mutation_cleaner`'s garbage collection work) happens asynchronously
(due to memory reclaim for whatever reason) after the run.

This patch fixes that by tracking the decreases of `total_memory` closer to the
source. Instead of relying on `mutation_cleaner` to notify the memtable if it
lowers `total_memory`, the memtable itself listens for notifications about
LSA segment deallocations. It keeps `_flushed_memory` equal to the reader's
estimate of flushed memory decreased by the change in `total_memory` since the
beginning of flush (if it was positive), and it keeps the amount of "spooled"
memory reported to the `dirty_memory_manager` at `max(0, _flushed_memory)`.

Fixes scylladb/scylladb#21413

Backport candidate because it fixes a crash that can happen in existing stable branches.

- (cherry picked from commit 7d551f99be)

- (cherry picked from commit 975e7e405a)

Parent PR: #21638

Closes scylladb/scylladb#24601

* github.com:scylladb/scylladb:
  memtable: ensure _flushed_memory doesn't grow above total memory usage
  replica/memtable: move region_listener handlers from dirty_memory_manager to memtable
2025-06-24 10:12:41 +03:00
Michał Chojnowski
1c23edad22 memtable: ensure _flushed_memory doesn't grow above total memory usage
dirty_memory_manager tracks two quantities about memtable memory usage:
"real" and "unspooled" memory usage.

"real" is the total memory usage (sum of `occupancy().total_space()`)
by all memtable LSA regions, plus a upper-bound estimate of the size of
memtable data which has already moved to the cache region but isn't
evictable (merged into the cache) yet.

"unspooled" is the difference between total memory usage by all memtable
LSA regions, and the total flushed memory (sum of `_flushed_memory`)
of memtables.

dirty_memory_manager controls the shares of compaction and/or blocks
writes when these quantities cross various thresholds.

"Total flushed memory" isn't a well defined notion,
since the actual consumption of memory by the same data can vary over
time due to LSA compactions, and even the data present in memtable can
change over the course of the flush due to removals of outdated MVCC versions.
So `_flushed_memory` is merely an approximation computed by `flush_reader`
based on the data passing through it.

This approximation is supposed to be a conservative lower bound.
In particular, `_flushed_memory` should be not greater than
`occupancy().total_space()`. Otherwise, for example, "unspooled" memory
could become negative (and/or wrap around) and weird things could happen.
There is an assertion in ~flush_memory_accounter which checks that
`_flushed_memory < occupancy().total_space()` at the end of flush.

But it can fail. Without additional treatment, the memtable reader sometimes emits
data which is already deleted. (In particular, it emites rows covered by
a partition tombstone in a newer MVCC version.)
This data is seen `flush_reader` and accounted in `_flushed_memory`.
But this data can be garbage-collected by the mutation_cleaner later during the
flush and decrease `total_memory` below `_flushed_memory`.

There is a piece of code in mutation_cleaner intended to prevent that.
If `total_memory` decreases during a `mutation_cleaner` run,
`_flushed_memory` is lowered by the same amount, just to preserve the
asserted property. (This could also make `_flushed_memory` quite inaccurate,
but that's considered acceptable).

But that only works if `total_memory` is decreased during that run. It doesn't
work if the `total_memory` decrease (enabled by the new allocator holes made
by `mutation_cleaner`'s garbage collection work) happens asynchronously
(due to memory reclaim for whatever reason) after the run.

This patch fixes that by tracking the decreases of `total_memory` closer to the
source. Instead of relying on `mutation_cleaner` to notify the memtable if it
lowers `total_memory`, the memtable itself listens for notifications about
LSA segment deallocations. It keeps `_flushed_memory` equal to the reader's
estimate of flushed memory decreased by the change in `total_memory` since the
beginning of flush (if it was positive), and it keeps the amount of "spooled"
memory reported to the `dirty_memory_manager` at `max(0, _flushed_memory)`.

(cherry picked from commit 975e7e405a)
2025-06-22 17:37:27 +00:00
Michał Chojnowski
40d1186218 replica/memtable: move region_listener handlers from dirty_memory_manager to memtable
The memtable wants to listen for changes in its `total_memory` in order
to decrease its `_flushed_memory` in case some of the freed memory has already
been accounted as flushed. (This can happen because the flush reader sees
and accounts even outdated MVCC versions, which can be deleted and freed
during the flush).

Today, the memtable doesn't listen to those changes directly. Instead,
some calls which can affect `total_memory` (in particular, the mutation cleaner)
manually check the value of `total_memory` before and after they run, and they
pass the difference to the memtable.

But that's not good enough, because `total_memory` can also change outside
of those manually-checked calls -- for example, during LSA compaction, which
can occur anytime. This makes memtable's accounting inaccurate and can lead
to unexpected states.

But we already have an interface for listening to `total_memory` changes
actively, and `dirty_memory_manager`, which also needs to know it,
does just that. So what happens e.g. when `mutation_cleaner` runs
is that `mutation_cleaner` checks the value of `total_memory` before it runs,
then it runs, causing several changes to `total_memory` which are picked up
by `dirty_memory_manager`, then `mutation_cleaner` checks the end value of
`total_memory` and passes the difference to `memtable`, which corrects
whatever was observed by `dirty_memory_manager`.

To allow memtable to modify its `_flushed_memory` correctly, we need
to make `memtable` itself a `region_listener`. Also, instead of
the situation where `dirty_memory_manager` receives `total_memory`
change notifications from `logalloc` directly, and `memtable` fixes
the manager's state later, we want to only the memtable listen
for the notifications, and pass them already modified accordingl
to the manager, so there is no intermediate wrong states.

This patch moves the `region_listener` callbacks from the
`dirty_memory_manager` to the `memtable`. It's not intended to be
a functional change, just a source code refactoring.
The next patch will be a functional change enabled by this.

(cherry picked from commit 7d551f99be)
2025-06-22 17:37:27 +00:00
Pavel Emelyanov
60cb1c0b2f sstable_directory: Print ks.cf when moving unshared remove sstables
When an sstable is identified by sstable_directory as remote-unshared,
it will at some point be moved to the target shard. When it happens a
log-message appears:

    sstable_directory - Moving 1 unshared SSTables to shard 1

Processing of tables by sstable_directory often happens in parallel, and
messages from sstable_directory are intermixed. Having a message like
above is not very informative, as it tells nothing about sstables that
are being moved.

Equip the message with ks:cf pair to make it more informative.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23912

(cherry picked from commit d40d6801b0)

Closes scylladb/scylladb#24014
2025-06-17 18:24:13 +03:00
Pavel Emelyanov
a00b4a027e Update seastar submodule (no nested stall backtraces)
* seastar e40388c4...b383afb0 (1):
  > stall_detector: no backtrace if exception

Fixes #24464

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#24536
2025-06-17 18:23:06 +03:00
Nikos Dragazis
8ef55444f5 sstables: Fix race when loading checksum component
`read_checksum()` loads the checksum component from disk and stores a
non-owning reference in the shareable components. To avoid loading the
same component twice, the function has an early return statement.
However, this does not guarantee atomicity - two fibers or threads may
load the component and update the shareable components concurrently.
This can lead to use-after-free situations when accessing the component
through the shareable components, since the reference stored there is
non-owning. This can happen when multiple compaction tasks run on the
same SSTable (e.g., regular compaction and scrub-validate).

Fix this by not updating the reference in shareable components, if a
reference is already in place. Instead, create an owning reference to
the existing component for the current fiber. This is less efficient
than using a mutex, since the component may be loaded multiple times
from disk before noticing the race, but no locks are used for any other
SSTable component either. Also, this affects uncompressed SSTables,
which are not that common.

Fixes #23728.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>

Closes scylladb/scylladb#23872

(cherry picked from commit eaa2ce1bb5)

Closes scylladb/scylladb#24267
2025-06-17 13:20:30 +03:00
Szymon Malewski
8eebc97ae1 mapreduce_service: Prevent race condition
In parallelized aggregation functions super-coordinator (node performing final merging step) receives and merges each partial result in parallel coroutines (`parallel_for_each`).
Usually responses are spread over time and actual merging is atomic.
However sometimes partial results are received at the similar time and if an aggregate function (e.g. lua script) yields, two coroutines can try to overwrite the same accumulator one after another,
which leads to losing some of the results.
To prevent this, in this patch each coroutine stores merging results in its own context and overwrites accumulator atomically, only after it was fully merged.
Comparing to the previous implementation order of operands in merging function is swapped, but the order of aggregation is not guaranteed anyway.

Fixes #20662

Closes scylladb/scylladb#24106

(cherry picked from commit 5969809607)

Closes scylladb/scylladb#24387
2025-06-17 13:20:12 +03:00
Pavel Emelyanov
884293a382 Merge '[Backport 6.2] tablets: deallocate storage state on end_migration' from Scylladb[bot]
When a tablet is migrated and cleaned up, deallocate the tablet storage
group state on `end_migration` stage, instead of `cleanup` stage:

* When the stage is updated from `cleanup` to `end_migration`, the
  storage group is removed on the leaving replica.
* When the table is initialized, if the tablet stage is `end_migration`
  then we don't allocate a storage group for it. This happens for
  example if the leaving replica is restarted during tablet migration.
  If it's initialized in `cleanup` stage then we allocate a storage
  group, and it will be deallocated when transitioning to
  `end_migration`.

This guarantees that the storage group is always deallocated on the
leaving replica by `end_migration`, and that it is always allocated if
the tablet wasn't cleaned up fully yet.

It is a similar case also for the pending replica when the migration is
aborted. We deallocate the state on `revert_migration` which is the
stage following `cleanup_target`.

Previously the storage group would be allocated when the tablet is
initialized on any of the tablet replicas - also on the leaving replica,
and when the tablet stage is `cleanup` or `end_migration`, and
deallocated during `cleanup`.

This fixes the following issue:

1. A migrating tablet enters cleanup stage
2. the tablet is cleaned up successfuly
3. The leaving replica is restarted, and allocates storage group
4. tablet cleanup is not called because it's already cleaned up
5. the storage group remains allocated on the leaving replica after the
   migration is completed - it's not cleaned up properly.

Fixes https://github.com/scylladb/scylladb/issues/23481

backport to all relevant releases since it's a bug that results in a crash

- (cherry picked from commit 34f15ca871)

- (cherry picked from commit fb18fc0505)

- (cherry picked from commit bd88ca92c8)

Parent PR: #24393

Closes scylladb/scylladb#24486

* github.com:scylladb/scylladb:
  test/cluster/test_tablets: test restart during tablet cleanup
  test: tablets: add get_tablet_info helper
  tablets: deallocate storage state on end_migration
2025-06-17 13:19:49 +03:00
Michael Litvak
06e5a48d17 test/cluster/test_tablets: test restart during tablet cleanup
Add a test that reproduces issue scylladb/scylladb#23481.

The test migrates a tablet from one node to another, and while the
tablet is in some stage of cleanup - either before or right after,
depending on the parameter - the leaving replica, on which the tablet is
cleaned, is restarted.

This is interesting because when the leaving replica starts and loads
its state, the tablet could be in different stages of cleanup - the
SSTables may still exist or they may have been cleaned up already, and
we want to make sure the state is loaded correctly.

(cherry picked from commit bd88ca92c8)
2025-06-12 14:33:07 +03:00
Michael Litvak
c5d11205c1 test: tablets: add get_tablet_info helper
Add a helper for tests to get the tablet info from system.tablets for a
tablet owning a given token.

(cherry picked from commit fb18fc0505)
2025-06-12 03:24:00 +00:00
Michael Litvak
0b46cfb60d tablets: deallocate storage state on end_migration
When a tablet is migrated and cleaned up, deallocate the tablet storage
group state on `end_migration` stage, instead of `cleanup` stage:

* When the stage is updated from `cleanup` to `end_migration`, the
  storage group is removed on the leaving replica.
* When the table is initialized, if the tablet stage is `end_migration`
  then we don't allocate a storage group for it. This happens for
  example if the leaving replica is restarted during tablet migration.
  If it's initialized in `cleanup` stage then we allocate a storage
  group, and it will be deallocated when transitioning to
  `end_migration`.

This guarantees that the storage group is always deallocated on the
leaving replica by `end_migration`, and that it is always allocated if
the tablet wasn't cleaned up fully yet.

It is a similar case also for the pending replica when the migration is
aborted. We deallocate the state on `revert_migration` which is the
stage following `cleanup_target`.

Previously the storage group would be allocated when the tablet is
initialized on any of the tablet replicas - also on the leaving replica,
and when the tablet stage is `cleanup` or `end_migration`, and
deallocated during `cleanup`.

This fixes the following issue:

1. A migrating tablet enters cleanup stage
2. the tablet is cleaned up successfuly
3. The leaving replica is restarted, and allocates storage group
4. tablet cleanup is not called because it was already cleaned up
4. the storage group remains allocated on the leaving replica after the
   migration is completed - it's not cleaned up properly.

Fixes scylladb/scylladb#23481

(cherry picked from commit 34f15ca871)
2025-06-12 03:24:00 +00:00
Michał Chojnowski
4a18a08284 utils/lsa/chunked_managed_vector: fix the calculation of max_chunk_capacity()
`chunked_managed_vector` is a vector-like container which splits
its contents into multiple contiguous allocations if necessary,
in order to fit within LSA's max preferred contiguous allocation
limits.

Each limited-size chunk is stored in a `managed_vector`.
`managed_vector` is unaware of LSA's size limits.
It's up to the user of `managed_vector` to pick a size which
is small enough.

This happens in `chunked_managed_vector::max_chunk_capacity()`.
But the calculation is wrong, because it doesn't account for
the fact that `managed_vector` has to place some metadata
(the backreference pointer) inside the allocation.
In effect, the chunks allocated by `chunked_managed_vector`
are just a tiny bit larger than the limit, and the limit is violated.

Fix this by accounting for the metadata.

Also, before the patch `chunked_managed_vector::max_contiguous_allocation`,
repeats the definition of logalloc::max_managed_object_size.
This is begging for a bug if `logalloc::max_managed_object_size`
changes one day. Adjust it so that `chunked_managed_vector` looks
directly at `logalloc::max_managed_object_size`, as it means to.

Fixes scylladb/scylladb#23854

(cherry picked from commit 7f9152babc)

Closes scylladb/scylladb#24369
2025-06-03 18:10:36 +03:00
Piotr Dulikowski
1571948cd7 topology_coordinator: silence ERROR messages on abort
When the topology coordinator is shut down while doing a long-running
operation, the current operation might throw a raft::request_aborted
exception. This is not a critical issue and should not be logged with
ERROR verbosity level.

Make sure that all the try..catch blocks in the topology coordinator
which:

- May try to acquire a new group0 guard in the `try` part
- Have a `catch (...)` block that print an ERROR-level message

...have a pass-through `catch (raft::request_aborted&)` block which does
not log the exception.

Fixes: scylladb/scylladb#22649

Closes scylladb/scylladb#23962

(cherry picked from commit 156ff8798b)

Closes scylladb/scylladb#24074
2025-05-13 20:41:41 +03:00
Pavel Emelyanov
5a82f0d217 Merge '[Backport 6.2] replica: skip flush of dropped table' from Scylladb[bot]
Currently, flush throws no_such_column_family if a table is dropped. Skip the flush of dropped table instead.

Fixes: #16095.

Needs backport to 2025.1 and 6.2 as they contain the bug

- (cherry picked from commit 91b57e79f3)

- (cherry picked from commit c1618c7de5)

Parent PR: #23876

Closes scylladb/scylladb#23904

* github.com:scylladb/scylladb:
  test: test table drop during flush
  replica: skip flush of dropped table
2025-05-13 14:00:38 +03:00
Aleksandra Martyniuk
874b4f8d9c streaming: skip dropped tables
Currently, stream_session::prepare throws when a table in requests
or summaries is dropped. However, we do not want to fail streaming
if the table is dropped.

Delete table checks from stream_session::prepare. Further streaming
steps can handle the dropped table and finish the streaming successfully.

Fixes: #15257.

Closes scylladb/scylladb#23915

(cherry picked from commit 20c2d6210e)

Closes scylladb/scylladb#24050
2025-05-13 13:56:46 +03:00
Aleksandra Martyniuk
91a1acc314 test: test table drop during flush
(cherry picked from commit c1618c7de5)
2025-05-09 09:27:08 +02:00
Aleksandra Martyniuk
720bd681f0 replica: skip flush of dropped table
(cherry picked from commit 91b57e79f3)
2025-05-09 09:26:44 +02:00
Piotr Dulikowski
b3bc3489dd utils::loading_cache: gracefully skip timer if gate closed
The loading_cache has a periodic timer which acquires the
_timer_reads_gate. The stop() method first closes the gate and then
cancels the timer - this order is necessary because the timer is
re-armed under the gate. However, the timer callback does not check
whether the gate was closed but tries to acquire it, which might result
in unhandled exception which is logged with ERROR severity.

Fix the timer callback by acquiring access to the gate at the beginning
and gracefully returning if the gate is closed. Even though the gate
used to be entered in the middle of the callback, it does not make sense
to execute the timer's logic at all if the cache is being stopped.

Fixes: scylladb/scylladb#23951

Closes scylladb/scylladb#23952

(cherry picked from commit 8ffe4b0308)

Closes scylladb/scylladb#23980
2025-05-06 10:19:32 +02:00
Botond Dénes
e4c6a3c068 Merge '[Backport 6.2] topology coordinator: do not proceed further on invalid boostrap tokens' from Scylladb[bot]
In case when dht::boot_strapper::get_boostrap_tokens fail to parse the
tokens, the topology coordinator handles the exception and schedules a
rollback. However, the current code tries to continue with the topology
coordinator logic even if an exception occurs, leaving boostrap_tokens
empty. This does not make sense and can actually cause issues,
specifically in prepare_and_broadcast_cdc_generation_data which
implicitly expect that the bootstrap_tokens of the first node in the
cluster will not be empty.

Fix this by adding the missing break.

Fixes: scylladb/scylladb#23897

From the code inspection alone it looks like 2025.1 and 6.2 have this problem, so marking for backport to both of them.

- (cherry picked from commit 66acaa1bf8)

- (cherry picked from commit 845cedea7f)

- (cherry picked from commit 670a69007e)

Parent PR: #23914

Closes scylladb/scylladb#23948

* github.com:scylladb/scylladb:
  test: cluster: add test_bad_initial_token
  topology coordinator: do not proceed further on invalid boostrap tokens
  cdc: add sanity check for generating an empty generation
2025-05-01 08:32:13 +03:00
Botond Dénes
254f535f63 Merge '[Backport 6.2] tasks: check whether a node is alive before rpc' from Scylladb[bot]
Check whether a node is alive before making an rpc that gathers children
infos from the whole cluster in virtual_task::impl::get_children.

Fixes: https://github.com/scylladb/scylladb/issues/22514.

Needs backport to 2025.1 and 6.2 as they contain the bug.

- (cherry picked from commit 53e0f79947)

- (cherry picked from commit e178bd7847)

Parent PR: #23787

Closes scylladb/scylladb#23942

* github.com:scylladb/scylladb:
  test: add test for getting tasks children
  tasks: check whether a node is alive before rpc
2025-05-01 08:30:09 +03:00
Avi Kivity
c20b2ea2af Merge '[Backport 6.2] Ensure raft group0 RPCs use the gossip scheduling group.' from Scylladb[bot]
Scylla operations use concurrency semaphores to limit the number of concurrent operations and prevent resource exhaustion. The semaphore is selected based on the current scheduling group.

For RAFT group operations, it is essential to use a system semaphore to avoid queuing behind user operations. This patch ensures that RAFT operations use the `gossip` scheduling group to leverage the system semaphore.

Fixes scylladb/scylladb#21637

Backport: 6.2 and 6.1

- (cherry picked from commit 60f1053087)

- (cherry picked from commit e05c082002)

Parent PR: #22779

Closes scylladb/scylladb#23769

* github.com:scylladb/scylladb:
  ensure raft group0 RPCs use the gossip scheduling group
  Move RAFT operations verbs to GOSSIP group.
2025-04-30 16:45:20 +03:00
Aleksandra Martyniuk
cc37c64467 test: add test for getting tasks children
Add test that checks whether the children of a virtual task will be
properly gathered if a node is down.

(cherry picked from commit e178bd7847)
2025-04-30 10:40:32 +02:00
Aleksandra Martyniuk
47377cd5d4 tasks: check whether a node is alive before rpc
Check whether a node is alive before making an rpc that gathers children
infos from the whole cluster in virtual_task::impl::get_children.

(cherry picked from commit 53e0f79947)
2025-04-30 10:35:45 +02:00
Sergey Zolotukhin
5c90107c14 ensure raft group0 RPCs use the gossip scheduling group
Scylla operations use concurrency semaphores to limit the number
of concurrent operations and prevent resource exhaustion. The
semaphore is selected based on the current scheduling group.
For Raft group operations, it is essential to use a system semaphore to
avoid queuing behind user operations.
This commit adds a check to ensure that the raft group0 RPCs are
executed with the `gossiper` scheduling group.

(cherry picked from commit e05c082002)
2025-04-30 08:49:07 +02:00
Sergey Zolotukhin
612f184638 Move RAFT operations verbs to GOSSIP group.
In order for RAFT operations to use the gossip system semaphore, moving RAFT
verbs to the gossip group in `do_get_rpc_client_idx`,  messaging_service.

Fixes scylladb/scylladb21637

(cherry picked from commit 60f1053087)
2025-04-29 19:25:55 +00:00
Piotr Dulikowski
d881f3f14d test: cluster: add test_bad_initial_token
Adds a test which checks that rollback works properly in case when a bad
value of the initial_token function is provided.

(cherry picked from commit 670a69007e)
2025-04-28 17:07:23 +00:00
Piotr Dulikowski
7eda572129 topology coordinator: do not proceed further on invalid boostrap tokens
In case when dht::boot_strapper::get_boostrap_tokens fail to parse the
tokens, the topology coordinator handles the exception and schedules a
rollback. However, the current code tries to continue with the topology
coordinator logic even if an exception occurs, leaving boostrap_tokens
empty. This does not make sense and can actually cause issues,
specifically in prepare_and_broadcast_cdc_generation_data which
implicitly expect that the bootstrap_tokens of the first node in the
cluster will not be empty.

Fix this by adding the missing break.

Fixes: scylladb/scylladb#23897
(cherry picked from commit 845cedea7f)
2025-04-28 17:07:23 +00:00
Piotr Dulikowski
41e6df7407 cdc: add sanity check for generating an empty generation
It doesn't make sense to create an empty CDC generation because it does
not make sense to have a cluster with no tokens. Add a sanity check to
cdc::make_new_generation_description which fails if somebody attempts to
do that (i.e. when the set of current tokens + optionally bootstrapping
node's tokens is empty).

The function does not work correctly if it is misused, as we saw in
scylladb/scylladb#23897. While the function should not be misused in the
first place, it's better to throw an exception rather than crash -
especially that this crash could happen on the topology coordinator.

(cherry picked from commit 66acaa1bf8)
2025-04-28 17:07:22 +00:00
Tomasz Grabiec
b48d1abade Merge '[Backport 6.2] Cache base info for view schemas in the schema registry' from Scylladb[bot]
Currently, when we load a frozen schema into the registry, we lose
the base info if the schema was of a view. Because of that, in various
places we need to set the base info again, and in some codepaths we
may miss it completely, which may make us unable to process some
requests (for example, when executing reverse queries on views).
Even after setting the base info, we may still lose it if the schema
entry gets deactivated due to all `schema_ptr`s temporarily dying.

To fix this, this patch adds the base schema to the registry, alongside
the view schema. We store just the frozen base schema, so that we can
transfer it across shards. With the base schema, we can now set the base
info when returning the schema from the registry. As a result, we can now
assume that all view schemas returned by the registry have base_info set.

In this series we also make sure that the view schemas in the registry are
kept up-to-date in regards to base schema changes.

Fixes https://github.com/scylladb/scylladb/issues/21354

This issue is a bug, so adding backport labels 6.1 and 6.2

- (cherry picked from commit 6f11edbf3f)

- (cherry picked from commit dfe3810f64)

- (cherry picked from commit 82f2e1b44c)

- (cherry picked from commit 3094ff7cbe)

- (cherry picked from commit 74cbc77f50)

Parent PR: #21862

Closes scylladb/scylladb#23046

* github.com:scylladb/scylladb:
  test: add test for schema registry maintaining base info for views
  schema_registry: avoid setting base info when getting the schema from registry
  schema_registry: update cached base schemas when updating a view
  schema_registry: cache base schemas for views
  db: set base info before adding schema to registry
2025-04-25 18:42:57 +02:00
Tomasz Grabiec
9190779ee3 Merge '[Backport 6.2] storage_service: preserve state of busy topology when transiting tablet' from Scylladb[bot]
Commit 876478b84f ("storage_service: allow concurrent tablet migration in tablets/move API", 2024-02-08) introduced a code path on which the topology state machine would be busy -- in "tablet_draining" or "tablet_migration" state -- at the time of starting tablet migration. The pre-commit code would unconditionally transition the topology to "tablet_migration" state, assuming the topology had been idle previously. On the new code path, this state change would be idempotent if the topology state machine had been busy in "tablet_migration", but the state change would incorrectly overwrite the "tablet_draining" state otherwise.

Restrict the state change to when the topology state machine is idle.

In addition, add the topology update to the "updates" vector with plain push_back(). emplace_back() is not helpful here, as topology_mutation_builder::build() cannot construct in-place, and so we invoke the "canonical_mutation" move constructor once, either way.

Unit test:

Start a two node cluster. Create a single tablet on one of the nodes. Start decommissioning that node, but block decommissioning at once. In that state (i.e., in "tablet_draining"), move the tablet manually to the other node. Check that transit_tablet() leaves the topology transition state alone.

Fixes https://github.com/scylladb/scylladb/issues/20073.

Commit 876478b84f was first released in scylla-6.0.0, so we might want to backport this patch accordingly.

- (cherry picked from commit e1186f0ae6)

- (cherry picked from commit 841ca652a0)

Parent PR: #23751

Closes scylladb/scylladb#23768

* github.com:scylladb/scylladb:
  storage_service: add unit test for mid-decommission transit_tablet()
  storage_service: preserve state of busy topology when transiting tablet
2025-04-20 20:13:11 +02:00
Avi Kivity
8db020e782 Merge '[Backport 6.2] managed_bytes: in the copy constructor, respect the target preferred allocation size' from Scylladb[bot]
Commit 14bf09f447 added a single-chunk layout to `managed_bytes`, which makes the overhead of `managed_bytes` smaller in the common case of a small buffer.

But there was a bug in it. In the copy constructor of `managed_bytes`, a copy of a single-chunk `managed_bytes` is made single-chunk too.

But this is wrong, because the source of the copy and the target of the copy might have different preferred max contiguous allocation sizes.

In particular, if a `managed_bytes` of size between 13 kiB and 128 kiB is copied from the standard allocator into LSA, the resulting `managed_bytes` is a single chunk which violates LSA's preferred allocation size. (And therefore is placed by LSA in the standard allocator).

In other words, since Scylla 6.0, cache and memtable cells between 13 kiB and 128 kiB are getting allocated in the standard allocator rather than inside LSA segments.

Consequences of the bug:

1. Effective memory consumption of an affected cell is rounded up to the nearest power of 2.

2. With a pathological-enough allocation pattern (for example, one which somehow ends up placing a single 16 kiB memtable-owned allocation in every aligned 128 kiB span), memtable flushing could theoretically deadlock, because the allocator might be too fragmented to let the memtable grow by another 128 kiB segment, while keeping the sum of all allocations small enough to avoid triggering a flush. (Such an allocation pattern probably wouldn't happen in practice though).

3. It triggers a bug in reclaim which results in spurious allocation failures despite ample evictable memory.

   There is a path in the reclaimer procedure where we check whether reclamation succeeded by checking that the number of free LSA segments grew.

   But in the presence of evictable non-LSA allocations, this is wrong because the reclaim might have met its target by evicting the non-LSA allocations, in which case memory is returned directly to the standard allocator, rather than to the pool of free segments.

   If that happens, the reclaimer wrongly returns `reclaimed_nothing` to Seastar, which fails the allocation.

Refs (possibly fixes) https://github.com/scylladb/scylladb/issues/21072
Fixes https://github.com/scylladb/scylladb/issues/22941
Fixes https://github.com/scylladb/scylladb/issues/22389
Fixes https://github.com/scylladb/scylladb/issues/23781

This is a regression fix, should be backported to all affected releases.

- (cherry picked from commit 4e2f62143b)

- (cherry picked from commit 6c1889f65c)

Parent PR: #23782

Closes scylladb/scylladb#23809

* github.com:scylladb/scylladb:
  managed_bytes_test: add a reproducer for #23781
  managed_bytes: in the copy constructor, respect the target preferred allocation size
2025-04-19 18:43:31 +03:00
Calle Wilund
d566010412 network_topology_strategy/alter ks: Remove dc:s from options once rf=0
Fixes #22688

If we set a dc rf to zero, the options map will still retain a dc=0 entry.
If this dc is decommissioned, any further alters of keyspace will fail,
because the union of new/old options will now contained an unknown keyword.

Change alter ks options processing to simply remove any dc with rf=0 on
alter, and treat this as an implicit dc=0 in nw-topo strategy.
This means we change the reallocate_tablets routine to not rely on
the strategy objects dc mapping, but the full replica topology info
for dc:s to consider for reallocation. Since we verify the input
on attribute processing, the amount of rf/tablets moved should still
be legal.

v2:
* Update docs as well.
v3:
* Simplify dc processing
* Reintroduce options empty check, but do early in ks_prop_defs
* Clean up unit test some

Closes scylladb/scylladb#22693

(cherry picked from commit 342df0b1a8)

(Update: workaround python test objects not having dc info)

Closes scylladb/scylladb#22876
2025-04-18 14:09:52 +03:00
Michał Chojnowski
ba3975858b managed_bytes_test: add a reproducer for #23781
(cherry picked from commit 6c1889f65c)
2025-04-18 07:55:23 +00:00
Michał Chojnowski
f75b0d49a0 managed_bytes: in the copy constructor, respect the target preferred allocation size
Commit 14bf09f447 added a single-chunk
layout to `managed_bytes`, which makes the overhead of `managed_bytes`
smaller in the common case of a small buffer.

But there was a bug in it. In the copy constructor of `managed_bytes`,
a copy of a single-chunk `managed_bytes` is made single-chunk too.

But this is wrong, because the source of the copy and the target
of the copy might have different preferred max contiguous allocation
sizes.

In particular, if a `managed_bytes` of size between 13 kiB and 128 kiB
is copied from the standard allocator into LSA, the resulting
`managed_bytes` is a single chunk which violates LSA's preferred
allocation size. (And therefore is placed by LSA in the standard
allocator).

In other words, since Scylla 6.0, cache and memtable cells
between 13 kiB and 128 kiB are getting allocated in the standard allocator
rather than inside LSA segments.

Consequences of the bug:

1. Effective memory consumption of an affected cell is rounded up to the nearest
   power of 2.

2. With a pathological-enough allocation pattern
   (for example, one which somehow ends up placing a single 16 kiB
   memtable-owned allocation in every aligned 128 kiB span),
   memtable flushing could theoretically deadlock,
   because the allocator might be too fragmented to let the memtable
   grow by another 128 kiB segment, while keeping the sum of all
   allocations small enough to avoid triggering a flush.
   (Such an allocation pattern probably wouldn't happen in practice though).

3. It triggers a bug in reclaim which results in spurious
   allocation failures despite ample evictable memory.

   There is a path in the reclaimer procedure where we check whether
   reclamation succeeded by checking that the number of free LSA
   segments grew.

   But in the presence of evictable non-LSA allocations, this is wrong
   because the reclaim might have met its target by evicting the non-LSA
   allocations, in which case memory is returned directly to the
   standard allocator, rather than to the pool of free segments.

   If that happens, the reclaimer wrongly returns `reclaimed_nothing`
   to Seastar, which fails the allocation.

Refs (possibly fixes) https://github.com/scylladb/scylladb/issues/21072
Fixes https://github.com/scylladb/scylladb/issues/22941
Fixes https://github.com/scylladb/scylladb/issues/22389
Fixes https://github.com/scylladb/scylladb/issues/23781

(cherry picked from commit 4e2f62143b)
2025-04-18 07:55:23 +00:00
Botond Dénes
e1ce7c36a7 test/cluster/test_read_repair.py: increase read request timeout
This test enables trace-level logging for the mutation_data logger,
which seems to be too much in debug mode and the test read times out.
Increase timeout to 1minute to avoid this.

Fixes: #23513
Fixes: #23512

Closes scylladb/scylladb#23558

(cherry picked from commit 7bbfa5293f)

Closes scylladb/scylladb#23793
2025-04-18 06:34:25 +03:00
Laszlo Ersek
597030a821 storage_service: add unit test for mid-decommission transit_tablet()
Start a two node cluster. Create a single tablet on one of the nodes.
Start decommissioning that node, but block decommissioning at once. In
that state (i.e., in "tablet_draining"), move the tablet manually to the
other node. Check that transit_tablet() leaves the topology transition
state alone.

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
(cherry picked from commit 841ca652a0)
2025-04-17 09:09:43 +02:00
Laszlo Ersek
add4022d32 storage_service: preserve state of busy topology when transiting tablet
Commit 876478b84f ("storage_service: allow concurrent tablet migration
in tablets/move API", 2024-02-08) introduced a code path on which the
topology state machine would be busy -- in "tablet_draining" or
"tablet_migration" state -- at the time of starting tablet migration. The
pre-commit code would unconditionally transition the topology to
"tablet_migration" state, assuming the topology had been idle previously.
On the new code path, this state change would be idempotent if the
topology state machine had been busy in "tablet_migration", but the state
change would incorrectly overwrite the "tablet_draining" state otherwise.

Restrict the state change to when the topology state machine is idle.

In addition, add the topology update to the "updates" vector with plain
push_back(). emplace_back() is not helpful here, as
topology_mutation_builder::build() cannot construct in-place, and so we
invoke the "canonical_mutation" move constructor once, either way.

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
(cherry picked from commit e1186f0ae6)
2025-04-17 08:32:56 +02:00
Avi Kivity
b6f1cacfea scylla-gdb: small-objects: fix for very small objects
Because of rounding and alignment, there are multiple pools for small
sizes (e.g. 4 for size 32). Because the pool selection algorithm
ignores alignment, different pools can be chosen for different object
sizes. For example, an object size of 29 will choose the first pool
of size 32, while an object size of 32 will choose the fourth pool of
size 32.

The small-objects command doesn't know about this and always considers
just the first pool for a given size. This causes it to miss out on
sister pools.

While it's possible to adjust pool selection to always choose one of the
pools, it may eat a precious cycle. So instead let's compensate in the
small-objects command. Instead of finding one pool for a given size,
find all of them, and iterate over all those pools.

Fixes #23603

Closes scylladb/scylladb#23604

(cherry picked from commit b4d4e48381)

Closes scylladb/scylladb#23748
2025-04-16 14:36:28 +03:00
Botond Dénes
c1defce1de Merge '[Backport 6.2] transport/server.cc: set default timestamp info in EXECUTE and BATCH tracing' from Scylladb[bot]
A default timestamp (not to confuse with the timestamp passed via 'USING TIMESTAMP' query clause) can be set using 0x20 flag and the <timestamp> field in the binary CQL frame payload of QUERY, EXECUTE and BATCH ops. It also happens to be a default of a Java CQL Driver.

However, we were only setting the corresponding info in the CQL Tracing context of a QUERY operation. For an unknown reason we were not setting this for an EXECUTE and for a BATCH traces (I guess I simply forgot to set it back then).

This patch fixes this.

Fixes #23173

The issue fixed by this PR is not critical but the fix is simple and safe enough so we should backport it to all live releases.

- (cherry picked from commit ca6bddef35)

- (cherry picked from commit f7e1695068)

Parent PR: #23174

Closes scylladb/scylladb#23523

* github.com:scylladb/scylladb:
  CQL Tracing: set common query parameters in a single function
  transport/server.cc: set default timestamp info in EXECUTE and BATCH tracing
2025-04-11 17:11:06 +03:00
Botond Dénes
61d1262674 Merge '[Backport 6.2] reader_concurrency_semaphore: register_inactive_read(): handle aborted permit' from Scylladb[bot]
It is possible that the permit handed in to register_inactive_read() is already aborted (currently only possible if permit timed out). If the permit also happens to have wait for memory, the current code will attempt to call promise<>::set_exception() on the permit's promise to abort its waiters. But if the permit was already aborted via timeout, this promise will already have an exception and this will trigger an assert. Add a separate case for checking if the permit is aborted already. If so, treat it as immediate eviction: close the reader and clean up.

Fixes: scylladb/scylladb#22919

Bug is present in all live versions, backports are required.

- (cherry picked from commit 4d8eb02b8d)

- (cherry picked from commit 7ba29ec46c)

Parent PR: #23044

Closes scylladb/scylladb#23144

* github.com:scylladb/scylladb:
  reader_concurrency_semaphore: register_inactive_read(): handle aborted permit
  test/boost/reader_concurrency_semaphore_test: move away from db::timeout_clock::now()
2025-04-11 17:10:39 +03:00
Botond Dénes
ae54d4e886 Merge '[Backport 6.2] streaming: fix the way a reason of streaming failure is determined' from Scylladb[bot]
During streaming receiving node gets and processes mutation fragments.
If this operation fails, receiver responds with -1 status code, unless
it failed due to no_such_column_family in which case streaming of this
table should be skipped.

However, when the table was dropped, an exception handler on receiver
side may get not only data_dictionary::no_such_column_family, but also
seastar::nested_exception of two no_such_column_family.

Encountered example:
```
ERROR 2025-02-12 15:20:51,508 [shard 0:strm] stream_session - [Stream #f1cd6830-e954-11ef-afd9-b022e40bf72d] Failed to handle STREAM_MUTATION_FRAGMENTS (receive and distribute phase) for ks=ks, cf=cf, peer=756dd3fe-2bf0-4dcd-afbc-cfd5202669a0: seastar::nested_exception: data_dictionary::no_such_column_family (Can't find a column family with UUID ef9b1ee0-e954-11ef-ba4a-faf17acf4e14) (while cleaning up after data_dictionary::no_such_column_family (Can't find a column family with UUID ef9b1ee0-e954-11ef-ba4a-faf17acf4e14))
```

In this case, the exception does not match the try_catch<data_dictionary::no_such_column_family>
clause and gets handled the same as any other exception type.

Replace try_catch clause with table_sync_and_check that synchronizes
the schema and check if the table exists.

Fixes: https://github.com/scylladb/scylladb/issues/22834.

Needs backport to all live version, as they all contain the bug

- (cherry picked from commit 876cf32e9d)

- (cherry picked from commit faf3aa13db)

- (cherry picked from commit 44748d624d)

- (cherry picked from commit 35bc1fe276)

Parent PR: #22868

Closes scylladb/scylladb#23289

* github.com:scylladb/scylladb:
  streaming: fix the way a reason of streaming failure is determined
  streaming: save a continuation lambda
  streaming: use streaming namespace in table_check.{cc,hh}
  repair: streaming: move table_check.{cc,hh} to streaming
2025-04-11 15:13:14 +03:00
Alexander Turetskiy
e97e729973 Improve compation on read of expired tombstones
compact expired tombstones in cache even if they are blocked by
commitlog

fixes #16781

Closes: #23033
2025-04-11 15:09:43 +03:00
Botond Dénes
20a1edd763 tools/scylla-nodetool: s/GetInt()/GetInt64()/
GetInt() was observed to fail when the integer JSON value overflows the
int32_t type, which `GetInt()` uses for storage. When this happens,
rapidjson will assign a distinct 64 bit integer type to the value, and
attempting to access it as 32 bit integer triggers the wrong-type error,
resulting in assert failure. This was hit on the field where invoking
nodetool netstats resulted in nodetool crashing when the streamed bytes
amounts were higher than maxint.

To avoid such bugs in the future, replace all usage of GetInt() in
nodetool of GetInt64(), just to be sure.

A reproducer is added to the nodetool netstats crash.

Fixes: scylladb/scylladb#23394

Closes scylladb/scylladb#23395

(cherry picked from commit bd8973a025)

Closes scylladb/scylladb#23475
2025-04-11 15:00:47 +03:00
Aleksandra Martyniuk
07236341ab streaming: fix the way a reason of streaming failure is determined
During streaming receiving node gets and processes mutation fragments.
If this operation fails, receiver responds with -1 status code, unless
it failed due to no_such_column_family in which case streaming of this
table should be skipped.

However, when the table was dropped, an exception handler on receiver
side may get not only data_dictionary::no_such_column_family, but also
seastar::nested_exception of two no_such_column_family.

Encountered example:
```
ERROR 2025-02-12 15:20:51,508 [shard 0:strm] stream_session - [Stream #f1cd6830-e954-11ef-afd9-b022e40bf72d] Failed to handle STREAM_MUTATION_FRAGMENTS (receive and distribute phase) for ks=ks, cf=cf, peer=756dd3fe-2bf0-4dcd-afbc-cfd5202669a0: seastar::nested_exception: data_dictionary::no_such_column_family (Can't find a column family with UUID ef9b1ee0-e954-11ef-ba4a-faf17acf4e14) (while cleaning up after data_dictionary::no_such_column_family (Can't find a column family with UUID ef9b1ee0-e954-11ef-ba4a-faf17acf4e14))
```

In this case, the exception does not match the try_catch<data_dictionary::no_such_column_family>
clause and gets handled the same as any other exception type.

Replace try_catch clause with table_sync_and_check that synchronizes
the schema and check if the table exists.

Fixes: https://github.com/scylladb/scylladb/issues/22834.
(cherry picked from commit 35bc1fe276)
2025-04-11 11:55:38 +02:00
Aleksandra Martyniuk
2027b9a21b streaming: save a continuation lambda
In the following patches, an additional preemption point will be
added to the coroutine lambda in register_stream_mutation_fragments.

Assign a lambda to a variable to prolong the captures lifetime.

(cherry picked from commit 44748d624d)
2025-04-11 11:55:38 +02:00
Aleksandra Martyniuk
c1a211101d streaming: use streaming namespace in table_check.{cc,hh}
(cherry picked from commit faf3aa13db)
2025-04-11 11:55:38 +02:00
Aleksandra Martyniuk
e52fd85cf8 repair: streaming: move table_check.{cc,hh} to streaming
(cherry picked from commit 876cf32e9d)
2025-04-11 11:55:36 +02:00
Botond Dénes
6091e81e18 reader_concurrency_semaphore: register_inactive_read(): handle aborted permit
It is possible that the permit handed in to register_inactive_read() is
already aborted (currently only possible if permit timed out).
If the permit also happens to have wait for memory, the current code
will attempt to call promise<>::set_exception() on the permit's promise
to abort its waiters. But if the permit was already aborted via timeout,
this promise will already have an exception and this will trigger an
assert. Add a separate case for checking if the permit is aborted
already. If so, treat it as immediate eviction: close the reader and
clean up.

Fixes: scylladb/scylladb#22919
(cherry picked from commit 7ba29ec46c)
2025-04-11 04:04:42 -04:00
Botond Dénes
742ff76d4d test/boost/reader_concurrency_semaphore_test: move away from db::timeout_clock::now()
Unless the test in question actually wants to test timeouts. Timeouts
will have more pronounced consequences soon and thus using
db::timeout_clock::now() becomes a sure way to make tests flaky.
To avoid this, use db::no_timeout in the tests that don't care about
timeouts.

(cherry picked from commit 4d8eb02b8d)
2025-04-11 04:04:42 -04:00
Botond Dénes
76f5816e43 Merge '[Backport 6.2] scylla sstable: Add standard extensions and propagate to schema load ' from Scylladb[bot]
Fixes #22314

Adds expected schema extensions to the tools extension set (if used). Also uses the source config extensions in schema loader instead of temp one, to ensure we can, for example, load a schema.cql with things like `tombstone_gc` or encryption attributes in them.

Bundles together the setup of "always on" schema extensions into a single call, and uses this from the three (3) init points.
Could have opted for static reg via `configurables`, but since we are moving to a single code base, the need for this is going away, hence explicit init seems more in line.

- (cherry picked from commit e6aa09e319)

- (cherry picked from commit 4aaf3df45e)

- (cherry picked from commit 00b40eada3)

- (cherry picked from commit 48fda00f12)

Parent PR: #22327

Closes scylladb/scylladb#23089

* github.com:scylladb/scylladb:
  tools: Add standard extensions and propagate to schema load
  cql_test_env: Use add all extensions instead of inidividually
  main: Move extensions adding to function
  tomstone_gc: Make validate work for tools
2025-04-11 11:00:41 +03:00
Lakshmi Narayanan Sreethar
1f272eade5 topology_coordinator: handle_table_migration: do not continue after executing metadata barrier
Return after executing the global metadata barrier to allow the topology
handler to handle any transitions that might have started by a
concurrect transaction.

Fixes #22792

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes scylladb/scylladb#22793

(cherry picked from commit 0f7d08d41d)

Closes scylladb/scylladb#23020
2025-04-11 10:59:43 +03:00
yangpeiyu2_yewu
f9cb9b905f mutation_writer/multishard_writer.cc: wrap writer into futurize_invoke
wrapped writer in seastar::futurize_invoke to make sure that the close() for the mutation_reader can be executed before destruction.

Fixes scylladb/scylladb#22790

Closes scylladb/scylladb#22812

(cherry picked from commit 0de232934a)

Closes scylladb/scylladb#22944
2025-04-11 10:59:17 +03:00
Avi Kivity
11993efc8c Merge '[Backport 6.2] row_cache: don't garbage-collect tombstones which cover data in memtables' from Scylladb[bot]
The row cache can garbage-collect tombstones in two places:
1) When populating the cache - the underlying reader pipeline has a `compacting_reader` in it;
2) During reads - reads now compact data including garbage collection;

In both cases, garbage collection has to do overlap checks against memtables, to avoid collecting tombstones which cover data in the memtables.
This PR includes fixes for (2), which were not handled at all currently.
(1) was already supposed to be fixed, see https://github.com/scylladb/scylladb/issues/20916. But the test added in this PR showed that the test is incomplete: https://github.com/scylladb/scylladb/issues/23291. A fix for this issue is also included.

Fixes: https://github.com/scylladb/scylladb/issues/23291
Fixes: https://github.com/scylladb/scylladb/issues/23252

The fix will need backport to all live release.

- (cherry picked from commit c2518cdf1a)

- (cherry picked from commit 6b5b563ef7)

- (cherry picked from commit 7e600a0747)

- (cherry picked from commit d126ea09ba)

- (cherry picked from commit cb76cafb60)

- (cherry picked from commit df09b3f970)

- (cherry picked from commit e5afd9b5fb)

- (cherry picked from commit 34b18d7ef4)

- (cherry picked from commit f7938e3f8b)

- (cherry picked from commit 6c1f6427b3)

- (cherry picked from commit 0d39091df2)

Parent PR: #23255

Closes scylladb/scylladb#23671

* github.com:scylladb/scylladb:
  test/boost/row_cache_test: add memtable overlap check tests
  replica/table: add error injection to memtable post-flush phase
  utils/error_injection: add a way to set parameters from error injection points
  test/cluster: add test_data_resurrection_in_memtable.py
  test/pylib/utils: wait_for_cql_and_get_hosts(): sort hosts
  replica/mutation_dump: don't assume cells are live
  replica/database: do_apply() add error injection point
  replica: improve memtable overlap checks for the cache
  replica/memtable: add is_merging_to_cache()
  db/row_cache: add overlap-check for cache tombstone garbage collection
  mutation/mutation_compactor: copy key passed-in to consume_new_partition()
2025-04-10 21:41:52 +03:00
Botond Dénes
5fb8a6dae2 mutation/frozen_mutation: frozen_mutation_consumer_adaptor: fix end-of-partition handling
This adaptor adapts a mutation reader pausable consumer to the frozen
mutation visitor interface. The pausable consumer protocol allows the
consumer to skip the remaining parts of the partition and resume the
consumption with the next one. To do this, the consumer just has to
return stop_iteration::yes from one of the consume() overloads for
clustering elements, then return stop_iteration::no from
consume_end_of_partition(). Due to a bug in the adaptor, this sequence
leads to terminating the consumption completely -- so any remaining
partitions are also skipped.

This protocol implementation bug has user-visible effects, when the
only user of the adaptor -- read repair -- happens during a query which
has limitations on the amount of content in each partition.
There are two such queries: select distinct ... and select ... with
partition limit. When converting the repaired mutation to to query
result, these queries will trigger the skip sequence in the consumer and
due to the above described bug, will skip the remaining partitions in
the results, omitting these from the final query result.

This patch fixes the protocol bug, the return value of the underlying
consumer's consume_end_of_partition() is now respected.

A unit test is also added which reproduces the problem both with select
distinct ... and select ... per partition limit.

Follow-up work:
* frozen_mutation_consumer_adaptor::on_end_of_partition() calls the
  underlying consumer's on_end_of_stream(), so when consuming multiple
  frozen mutations, the underlying's on_end_of_stream() is called for
  each partition. This is incorrect but benign.
* Improve documentation of mutation_reader::consume_pausable().

Fixes: #20084

Closes scylladb/scylladb#23657

(cherry picked from commit d67202972a)

Closes scylladb/scylladb#23693
2025-04-10 21:38:13 +03:00
Botond Dénes
a157b3e62f test/boost/row_cache_test: add memtable overlap check tests
Similar to test/cluster/test_data_resurrection_in_memtable.py but works
on a single node and uses more low-level mechanism. These tests can also
reproduce more advanced scenarios, like concurrent reads, with some
reading from flushed memtables.

(cherry picked from commit 0d39091df2)
2025-04-10 07:33:09 -04:00
Botond Dénes
ce1d990dd6 replica/table: add error injection to memtable post-flush phase
After the memtable was flushed to disk, but before it is merged to
cache. The injection point will only active for the table specified in
the "table_name" injection parameter.

(cherry picked from commit 6c1f6427b3)
2025-04-10 07:33:09 -04:00
Botond Dénes
37b51871ec utils/error_injection: add a way to set parameters from error injection points
With this, now it is possible to have two-way communication between
the error injection point and its enabler. The test can enable the error
injection point, then wait until it is hit, before proceedin.

(cherry picked from commit f7938e3f8b)
2025-04-10 07:33:09 -04:00
Botond Dénes
ac18570069 test/cluster: add test_data_resurrection_in_memtable.py
Reproducers for #23252 and #23291 -- cache garbage
collecting tombstones resurrecting data in the memtable.

(cherry picked from commit 34b18d7ef4)
2025-04-10 07:33:09 -04:00
Botond Dénes
990e92d7cf test/pylib/utils: wait_for_cql_and_get_hosts(): sort hosts
Such that a given index in the return hosts refers to the same
underlying Scylla instance, as the same index in the passed-in nodes
list. This is what users of this method intuitively expect, but
currently the returned hosts list is unordered (has random order).

(cherry picked from commit e5afd9b5fb)
2025-04-10 07:33:09 -04:00
Botond Dénes
67a56ae192 replica/mutation_dump: don't assume cells are live
Currently the dumper unconditionally extracts the value of atomic cells,
assuming they are live. This doesn't always hold of course and
attempting to get the value of a dead cell will lead to marshalling
errors. Fix by checking is_live() before attempting to get the cell
value. Fix for both regular and collection cells.

(cherry picked from commit df09b3f970)
2025-04-10 07:33:09 -04:00
Botond Dénes
85a7a9cb05 replica/database: do_apply() add error injection point
So writes (to user tables) can be failed on a replica, via error
injection. Should simplify tests which want to create differences in
what writes different replicas receive.

(cherry picked from commit cb76cafb60)
2025-04-10 07:33:09 -04:00
Botond Dénes
95205a1b29 replica: improve memtable overlap checks for the cache
The current memtable overlap check that is used by the cache
-- table::get_max_purgeable_fn_for_cache_underlying_reader() -- only
checks the active memtable, so memtables which are either being flushed
or are already flushed and also have active reads against them do not
participate in the overlap check.
This can result in temporary data resurrection, where a cache read can
garbage-collect a tombstone which still covers data in a flushing or
flushed memtable, which still have active read against it.

To prevent this, extend the overlap check to also consider all of the
memtable list. Furthermore, memtable_list::erase() now places the removed
(flushed) memtable in an intrusive list. These entries are alive only as
long as there are readers still keeping an `lw_shared_ptr<memtable>`
alive. This list is now also consulted on overlap checks.

(cherry picked from commit d126ea09ba)
2025-04-10 07:33:09 -04:00
Botond Dénes
ef423eb4c7 replica/memtable: add is_merging_to_cache()
And set it when the memtable is merged to cache.

(cherry picked from commit 7e600a0747)
2025-04-10 07:33:08 -04:00
Botond Dénes
d10a2688b1 db/row_cache: add overlap-check for cache tombstone garbage collection
The cache should not garbage-collect tombstone which cover data in the
memtable. Add overlap checks (get_max_purgeable) to garbage collection
to detect tombstones which cover data in the memtable and to prevent
their garbage collection.

(cherry picked from commit 6b5b563ef7)
2025-04-10 07:33:08 -04:00
Botond Dénes
4647aa0366 mutation/mutation_compactor: copy key passed-in to consume_new_partition()
This doesn't introduce additional work for single-partition queries: the
key is copied anyway on consume_end_of_stream().
Multi-partition reads and compaction are not that sensitive to
additional copy added.

This change fixes a bug in the compacting_reader: currently the reader
passes _last_uncompacted_partition_start.key() to the compactor's
consume_new_partition(). When the compactor emits enough content for this
partition, _last_uncompacted_partition_start is moved from to emit the
partition start, this makes the key reference passed to the compaction
corrupt (refer to moved-from value). This in turn means that subsequent
GC checks done by the compactor will be done with a corrupt key and
therefore can result in tombstone being garbage-collected while they
still cover data elsewhere (data resurrection).

The compacting reader is violating the API contract and normally the bug
should be fixed there. We make an exception here because doing the fix
in the mutation compactor better aligns with our future plans:
* The fix simplifies the compactor (gets rid of _last_dk).
* Prepares the way to get rid of the consume API used by the compactor.

(cherry picked from commit c2518cdf1a)
2025-04-09 14:02:30 +00:00
Michał Chojnowski
664d36c737 table: fix a race in table::take_storage_snapshot()
`safe_foreach_sstable` doesn't do its job correctly.

It iterates over an sstable set under the sstable deletion
lock in an attempt to ensure that SSTables aren't deleted during the iteration.

The thing is, it takes the deletion lock after the SSTable set is
already obtained, so SSTables might get unlinked *before* we take the lock.

Remove this function and fix its usages to obtain the set and iterate
over it under the lock.

Closes scylladb/scylladb#23397

(cherry picked from commit e23fdc0799)

Closes scylladb/scylladb#23627
2025-04-08 19:06:42 +03:00
Lakshmi Narayanan Sreethar
90add328ad replica/table::do_apply : do not check for async gate's closure
The `table::do_apply()` method verifies if the compaction group's async
gate is open to determine if the compaction group is active. Closing
this async gate prevents any new operations but waits for existing
holders to exit, allowing their operations to complete. When holding a
gate, holders will observe the gate as closed when it is being closed,
but this is irrelevant as they are already inside the gate and are
allowed to complete. All the callers of `table::do_apply()` already
enter the gate before calling the method. So, the async gate check
inside `table::do_apply()` will erroneously throw an exception when the
compaction group is closing despite holding the gate. This commit
removes the check to prevent this from happening.

Fixes #23348

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes scylladb/scylladb#23579

(cherry picked from commit 750f4baf44)

Closes scylladb/scylladb#23644
2025-04-08 18:58:35 +03:00
Yaron Kaikov
f4909aafc7 .github: Make "make-pr-ready-for-review" workflow run in base repo
in 57683c1a50 we fixed the `token` error,
but removed the checkout part which causing now the following error
```
failed to run git: fatal: not a git repository (or any of the parent directories): .git
```
Adding the repo checkout stage to avoid such error

Fixes: https://github.com/scylladb/scylladb/issues/22765

Closes scylladb/scylladb#23641

(cherry picked from commit 2dc7ea366b)

Closes scylladb/scylladb#23653
2025-04-08 13:47:50 +03:00
Kefu Chai
9e3eb4329c .github: Make "make-pr-ready-for-review" workflow run in base repo
The "make-pr-ready-for-review" workflow was failing with an "Input
required and not supplied: token" error.  This was due to GitHub Actions
security restrictions preventing access to the token when the workflow
is triggered in a fork:
```
    Error: Input required and not supplied: token
```

This commit addresses the issue by:

- Running the workflow in the base repository instead of the fork. This
  grants the workflow access to the required token with write permissions.
- Simplifying the workflow by using a job-level `if` condition to
  controlexecution, as recommended in the GitHub Actions documentation
  (https://docs.github.com/en/actions/writing-workflows/choosing-when-your-workflow-runs/using-conditions-to-control-job-execution).
  This is cleaner than conditional steps.
- Removing the repository checkout step, as the source code is not required for this workflow.

This change resolves the token error and ensures the
"make-pr-ready-for-review" workflow functions correctly.

Fixes scylladb/scylladb#22765

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22766

(cherry picked from commit ca832dc4fb)

Closes scylladb/scylladb#23617
2025-04-07 08:11:09 +03:00
Kefu Chai
4ac3f82df9 dist: systemd: use default KillMode
before this change, we specify the KillMode of the scylla-service
service unit explicitly to "process". according to
according to
https://www.freedesktop.org/software/systemd/man/latest/systemd.kill.html,

> If set to process, only the main process itself is killed (not recommended!).

and the document suggests use "control-group" over "process".
but scylla server is not a multi-process server, it is a multi-threaded
server. so it should not make any difference even if we switch to
the recommended "control-group".

in the light that we've been seeing "defunct" scylla process after
stopping the scylla service using systemd. we are wondering if we should
try to change the `KillMode` to "control-group", which is the default
value of this setting.

in this change, we just drop the setting so that the systemd stops the
service by stopping all processes in the control group of this unit
are stopped.

Fixes scylladb/scylladb#21507

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

(cherry picked from commit 961a53f716)

Closes scylladb/scylladb#23176
2025-04-04 17:55:04 +03:00
Vlad Zolotarov
e3d063a070 CQL Tracing: set common query parameters in a single function
Each query-type (QUERY, EXECUTE, BATCH) CQL opcode has a number of parameters
in their payload which we always want to record in the Tracing object.
Today it's a Consistency Level, Serial Consistency Level and a Default Timestamp.

Setting each of them individually can lead to a human error when one (or more) of
them would not be set. Let's eliminate such a possibility by defining
a single function that sets them all.

This also allows an easy addition of such parameters to this function in
the future.
2025-04-02 14:10:03 -04:00
Vlad Zolotarov
1b2ab34647 transport/server.cc: set default timestamp info in EXECUTE and BATCH tracing
A default timestamp (not to confuse with the timestamp passed via 'USING TIMESTAMP' query clause)
can be set using 0x20 flag and the <timestamp> field in the binary CQL frame payload of
QUERY, EXECUTE and BATCH ops. It also happens to be a default of a Java CQL Driver.

However, we were only setting the corresponding info in the CQL Tracing context of a QUERY operation.
For an unknown reason we were not setting this for an EXECUTE and for a BATCH traces (I guess I simply forgot to
set it back then).

This patch fixes this.

Fixes #23173

(cherry picked from commit ca6bddef35)
2025-04-01 11:45:33 +00:00
Yaron Kaikov
b67329b34e .github: add action to make PR ready for review when conflicts label was removed
Moving a PR out of draft is only allowed to users with write access,
adding a github action to switch PR to `ready for review` once the
`conflicts` label was removed

Closes scylladb/scylladb#22446

(cherry picked from commit ed4bfad5c3)

Closes scylladb/scylladb#23006
2025-03-30 11:59:40 +03:00
Kefu Chai
48ff7cf61c gms: Fix fmt formatter for gossip_digest_sync
In commit 4812a57f, the fmt-based formatter for gossip_digest_syn had
formatting code for cluster_id, partitioner, and group0_id
accidentally commented out, preventing these fields from being included
in the output. This commit restores the formatting by uncommenting the
code, ensuring full visibility of all fields in the gossip_digest_syn
message when logging permits.

This fixes a regression introduced in 4812a57f, which obscured these
fields and reduced debugging insight. Backporting is recommended for
improved observability.

Fixes #23142
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23155

(cherry picked from commit 2a9966a20e)

Closes scylladb/scylladb#23198
2025-03-30 11:57:15 +03:00
Kefu Chai
18d5af1cd3 storage_proxy: Prevent integer overflow in abstract_read_executor::execute
Fix UBSan abort caused by integer overflow when calculating time difference
between read and write operations. The issue occurs when:
1. The queried partition on replicas is not purgeable (has no recorded
   modified time)
2. Digests don't match across replicas
3. The system attempts to calculate timespan using missing/negative
   last_modified timestamps

This change skips cross-DC repair optimization when write timestamp is
negative or missing, as this optimization is only relevant for reads
occurring within write_timeout of a write.

Error details:
```
service/storage_proxy.cc:5532:80: runtime error: signed integer overflow: -9223372036854775808 - 1741940132787203 cannot be represented in type 'int64_t' (aka 'long')
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior service/storage_proxy.cc:5532:80
Aborting on shard 1, in scheduling group sl:default
```

Related to previous fix 39325cf which handled negative read_timestamp cases.

Fixes #23314
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23359

(cherry picked from commit ebf9125728)

Closes scylladb/scylladb#23386
2025-03-30 11:54:59 +03:00
Tomasz Grabiec
6cdd1cccdc test: tablets: Fix flakiness due to ungraceful shutdown
The test fails sporadically with:

cassandra.ReadFailure: Error from server: code=1300 [Replica(s) failed to execute read] message="Operation failed for test3.test2 - received 1 responses and 1 failures from 2 CL=QUORUM." info={'consistency': 'QUORUM', 'required_responses': 2, 'received_responses': 1, 'failures': 1}

That's becase a server is stopped in the middle of the workload.

The server is stopped ungracefully which will cause some requests to
time out. We should stop it gracefully to allow in-flight requests to
finish.

Fixes #20492

Closes scylladb/scylladb#23451

(cherry picked from commit 8e506c5a8f)

Closes scylladb/scylladb#23468
2025-03-28 14:56:39 +01:00
Anna Stuchlik
3f0e52a5ee doc: zero-token nodes and Arbiter DC
This commit adds documentation for zero-token nodes and an explanation
of how to use them to set up an arbiter DC to prevent a quorum loss
in multi-DC deployments.

The commit adds two documents:
- The one in Architecture describes zero-token nodes.
- The other in Cluster Management explains how to use them.

We need separate documents because zero-token nodes may be used
for other purposes in the future.

In addition, the documents are cross-linked, and the link is added
to the Create a ScyllaDB Cluster - Multi Data Centers (DC) document.

Refs https://github.com/scylladb/scylladb/pull/19684

Fixes https://github.com/scylladb/scylladb/issues/20294

Closes scylladb/scylladb#21348

(cherry picked from commit 9ac0aa7bba)

Closes scylladb/scylladb#23200
2025-03-10 10:52:13 +01:00
Piotr Dulikowski
9fc27b734f test: test_mv_topology_change: increase timeout for removenode
The test `test_mv_topology_change` is a regression test for
scylladb/scylladb#19529. The problem was that CL=ANY writes issued when
all replicas were down would be kept in memory until the timeout. In
particular, MV updates are CL=ANY writes and have a 5 minute timeout.
When doing topology operations for vnodes or when migrating tablet
replicas, the cluster goes through stages where the replica sets for
writes undergo changes, and the writes started with the old replica set
need to be drained first.

Because of the aforementioned MV updates, the removenode operation could
be delayed by 5 minutes or more. Therefore, the
`test_mv_topology_change` test uses a short timeout for the removenode
operation, i.e. 30s. Apparently, this is too low for the debug mode and
the test has been observed to time out even though the removenode
operation is progressing fine.

Increase the timeout to 60s. This is the lowest timeout for the
removenode operation that we currently use among the in-repo tests, and
is lower than 5 minutes so the test will still serve its purpose.

Fixes: scylladb/scylladb#22953

Closes scylladb/scylladb#22958

(cherry picked from commit 43ae3ab703)

Closes scylladb/scylladb#23052
2025-03-04 16:04:00 +01:00
Wojciech Mitros
8d392229cc test: add test for schema registry maintaining base info for views
In this patch we test the behavior of schema registry in a few
scenarios where it was identified it could misbehave.

The first one is reverse schemas for views. Previously, SELECT
queries with reverse order on views could fail because we didn't
have base info in the registry for such schemas.

The second one is schemas that temporarily died in the registry.
This can happen when, while processing a query for a given schema
version, all related schema_ptrs were destroyed, but this schema
was requested before schema_registry::grace_period() has passed.
In this scenario, the base info would not be recovered, causing
errors.

(cherry picked from commit 74cbc77f50)
2025-03-03 13:57:45 +01:00
Wojciech Mitros
3a1d2cbeb6 schema_registry: avoid setting base info when getting the schema from registry
After the previous patches, the view schemas returned by schema registry
always have their base info set. As such, we no longer need to set it after
getting the view schema from the registry. This patch removes these
unnecessary updates.

(cherry picked from commit 3094ff7cbe)
2025-03-03 13:49:11 +01:00
Calle Wilund
ecbb765b3f tools: Add standard extensions and propagate to schema load
Fixes #22314

Adds expected schema extensions to the tools extension set (if used). Also uses the source config extensions in schema loader instead of temp one, to ensure we can, for example, load a schema.cql with things like `tombstone_gc` or encryption attributes in them.

(cherry picked from commit 48fda00f12)
2025-02-27 17:46:28 +00:00
Calle Wilund
3bf553efa1 cql_test_env: Use add all extensions instead of inidividually
(cherry picked from commit 00b40eada3)
2025-02-27 17:46:28 +00:00
Calle Wilund
9500c8c9cb main: Move extensions adding to function
Easily called from elsewhere. The extensions we should always include (oxymoron?)

(cherry picked from commit 4aaf3df45e)
2025-02-27 17:46:28 +00:00
Calle Wilund
4c735abb62 tomstone_gc: Make validate work for tools
Don't crash if validation is done as part of loading a schema from file (schema.cql)

(cherry picked from commit e6aa09e319)
2025-02-27 17:46:28 +00:00
Wojciech Mitros
b0ab86a8ad schema_registry: update cached base schemas when updating a view
The schema registry now holds base schemas for view schemas.
The base schema may change without changing the view schema, so to
preserve the change in the schema registry, we also update the
base schema in the registry when updating the base info in the
view schema.

(cherry picked from commit 82f2e1b44c)
2025-02-25 16:55:03 +00:00
Wojciech Mitros
1cf288dd97 schema_registry: cache base schemas for views
Currently, when we load a frozen schema into the registry, we lose
the base info if the schema was of a view. Because of that, in various
places we need to set the base info again, and in some codepaths we
may miss it completely, which may make us unable to process some
requests (for example, when executing reverse queries on views).
Even after setting the base info, we may still lose it if the schema
entry gets deactivated.

To fix this, this patch adds the base schema to the registry, alongside
the view schema. With the base schema, we can now set the base
info when returning the schema from the registry. As a result, we can now
assume that all view schemas returned by the registry have base_info set.

To store the base schema, the loader methods now have to return the base
schema alongside the view schema. At the same time, when loading into
the registry, we need to check whether we're loading a view schema, and if
so, we need to also provide the base schema. When inserting a regular table
schema, the base schema should be a disengaged optional.

(cherry picked from commit dfe3810f64)
2025-02-25 16:55:03 +00:00
Wojciech Mitros
40312f482e db: set base info before adding schema to registry
In the following patches, we'll assure that view schemas returned by the
schema registry always have base info set. To prepare for that, make sure
that the base info is always set before inserting it into schema registry,

(cherry picked from commit 6f11edbf3f)
2025-02-25 16:55:03 +00:00
Benny Halevy
9fd5909a5e token_group_based_splitting_mutation_writer: maybe_switch_to_new_writer: prevent double close
Currently, maybe_switch_to_new_writer resets _current_writer
only in a continuation after closing the current writer.
This leaves a window of vulnerability if close() yields,
and token_group_based_splitting_mutation_writer::close()
is called. Seeing the engaged _current_writer, close()
will call _current_writer->close() - which must be called
exactly once.

Solve this when switching to a new writer by resetting
_current_writer before closing it and potentially yielding.

Fixes #22715

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#22922

(cherry picked from commit 29b795709b)

Closes scylladb/scylladb#22964
2025-02-23 14:27:11 +02:00
Botond Dénes
103a986eca Merge '[Backport 6.2] reader_concurrency_semaphore: set_notify_handler(): disable timeout ' from Scylladb[bot]
`set_notify_handler()` is called after a querier was inserted into the querier cache. It has two purposes: set a callback for eviction and set a TTL for the cache entry. This latter was not disabling the pre-existing timeout of the permit (if any) and this would lead to premature eviction of the cache entry if the timeout was shorter than TTL (which his typical).
Disable the timeout before setting the TTL to prevent premature eviction.

Fixes: https://github.com/scylladb/scylladb/issues/22629

Backport required to all active releases, they are all affected.

- (cherry picked from commit a3ae0c7cee)

- (cherry picked from commit 9174f27cc8)

Parent PR: #22701

Closes scylladb/scylladb#22751

* github.com:scylladb/scylladb:
  reader_concurrency_semaphore: set_notify_handler(): disable timeout
  reader_permit: mark check_abort() as const
2025-02-19 09:59:47 +02:00
Botond Dénes
1d4ea169e3 reader_concurrency_semaphore: set_notify_handler(): disable timeout
set_notify_handler() is called after a querier was inserted into the
querier cache. It has two purposes: set a callback for eviction and set
a TTL for the cache entry. This latter was not disabling the
pre-existing timeout of the permit (if any) and this would lead to
premature eviction of the cache entry if the timeout was shorter than
TTL (which his typical).
Disable the timeout before setting the TTL to prevent premature
eviction.

Fixes: #scylladb/scylladb#22629
(cherry picked from commit 9174f27cc8)
2025-02-18 04:48:22 -05:00
Botond Dénes
86c9bc778a tools/scylla-nodetool: netstats: don't assume both senders and receivers
The code currently assumes that a session has both sender and receiver
streams, but it is possible to have just one or the other.
Change the test to include this scenario and remove this assumption from
the code.

Fixes: #22770

Closes scylladb/scylladb#22771

(cherry picked from commit 87e8e00de6)

Closes scylladb/scylladb#22873
2025-02-18 10:35:10 +02:00
Botond Dénes
4acb366a28 service/storage_proxy: schedule_repair(): materialize the range into a vector
Said method passes down its `diff` input to `mutate_internal()`, after
some std::ranges massaging. Said massaging is destructive -- it moves
items from the diff. If the output range is iterated-over multiple
times, only the first time will see the actual output, further
iterations will get an empty range.
When trace-level logging is enabled, this is exactly what happens:
`mutate_internal()` iterates over the range multiple times, first to log
its content, then to pass it down the stack. This ends up resulting in
a range with moved-from elements being pased down and consequently write
handlers being created with nullopt mutations.

Make the range re-entrant by materializing it into a vector before
passing it to `mutate_internal()`.

Fixes: scylladb/scylladb#21907
Fixes: scylladb/scylladb#21714

Closes scylladb/scylladb#21910

(cherry picked from commit 7150442f6a)

Closes scylladb/scylladb#22853
2025-02-18 10:34:42 +02:00
Botond Dénes
24eb8f49ba service: query_pager: fix last-position for filtering queries
On short-pages, cut short because of a tombstone prefix.
When page-results are filtered and the filter drops some rows, the
last-position is taken from the page visitor, which does the filtering.
This means that last partition and row position will be that of the last
row the filter saw. This will not match the last position of the
replica, when the replica cut the page due to tombstones.
When fetching the next page, this means that all the tombstone suffix of
the last page, will be re-fetched. Worse still: the last position of the
next page will not match that of the saved reader left on the replica, so
the saved reader will be dropped and a new one created from scratch.
This wasted work will show up as elevated tail latencies.
Fix by always taking the last position from raw query results.

Fixes: #22620

Closes scylladb/scylladb#22622

(cherry picked from commit 7ce932ce01)

Closes scylladb/scylladb#22718
2025-02-13 15:15:24 +02:00
Botond Dénes
8e6648870b reader_concurrency_semaphore: with_permit(): proper clean-up after queue overload
with_permit() creates a permit, with a self-reference, to avoid
attaching a continuation to the permit's run function. This
self-reference is used to keep the permit alive, until the execution
loop processes it. This self reference has to be carefully cleared on
error-paths, otherwise the permit will become a zombie, effectively
leaking memory.
Instead of trying to handle all loose ends, get rid of this
self-reference altogether: ask caller to provide a place to save the
permit, where it will survive until the end of the call. This makes the
call-site a little bit less nice, but it gets rid of a whole class of
possible bugs.

Fixes: #22588

Closes scylladb/scylladb#22624

(cherry picked from commit f2d5819645)

Closes scylladb/scylladb#22703
2025-02-13 15:03:57 +02:00
Botond Dénes
77696b1e43 reader_concurrency_semaphore: foreach_permit(): include _inactive_reads
So inactive reads show up in semaphore diagnostics dumps (currently the
only non-test user of this method).

Fixes: #22574

Closes scylladb/scylladb#22575

(cherry picked from commit e1b1a2068a)

Closes scylladb/scylladb#22610
2025-02-13 15:03:23 +02:00
Aleksandra Martyniuk
3497ba7f60 replica: mark registry entry as synch after the table is added
When a replica get a write request it performs get_schema_for_write,
which waits until the schema is synced. However, database::add_column_family
marks a schema as synced before the table is added. Hence, the write may
see the schema as synced, but hit no_such_column_family as the table
hasn't been added yet.

Mark schema as synced after the table is added to database::_tables_metadata.

Fixes: #22347.

Closes scylladb/scylladb#22348

(cherry picked from commit 328818a50f)

Closes scylladb/scylladb#22603
2025-02-13 15:02:59 +02:00
Aleksandra Martyniuk
ade0fe2d7a nodetool: tasks: print empty string for start_time/end_time if unspecified
If start_time/end_time is unspecified for a task, task_manager API
returns epoch. Nodetool prints the value in task status.

Fix nodetool tasks commands to print empty string for start_time/end_time
if it isn't specified.

Modify nodetool tasks status docs to show empty end_time.

Fixes: #22373.

Closes scylladb/scylladb#22370

(cherry picked from commit 477ad98b72)

Closes scylladb/scylladb#22600
2025-02-13 13:26:54 +02:00
Jenkins Promoter
72cf5ef576 Update ScyllaDB version to: 6.2.4 2025-02-09 16:52:35 +02:00
Botond Dénes
2978ed58a2 reader_permit: mark check_abort() as const
All it does is read one field, making it const makes using it easier.

(cherry picked from commit a3ae0c7cee)
2025-02-09 00:32:13 +00:00
Tomasz Grabiec
6922acb69f Merge '[Backport 6.2] split: run set_split_mode() on all storage groups during all_storage_groups_split()' from Scylladb[bot]
`tablet_storage_group_manager::all_storage_groups_split()` calls `set_split_mode()` for each of its storage groups to create split ready compaction groups. It does this by iterating through storage groups using `std::ranges::all_of()` which is not guaranteed to iterate through the entire range, and will stop iterating on the first occurrence of the predicate (`set_split_mode()`) returning false. `set_split_mode()` creates the split compaction groups and returns false if the storage group's main compaction group or merging groups are not empty. This means that in cases where the tablet storage group manager has non-empty storage groups, we could have a situation where split compaction groups are not created for all storage groups.

The missing split compaction groups are later created in `tablet_storage_group_manager::split_all_storage_groups()` which also calls `set_split_mode()`, and that is the reason why split completes successfully. The problem is that
`tablet_storage_group_manager::all_storage_groups_split()` runs under a group0 guard, but
`tablet_storage_group_manager::split_all_storage_groups()` does not. This can cause problems with operations which should exclude with compaction group creation. i.e. DROP TABLE/DROP KEYSPACE

Fixes #22431

This is a bugfix and should be back ported to versions with tablets: 6.1 6.2 and 2025.1

- (cherry picked from commit 24e8d2a55c)

- (cherry picked from commit 8bff7786a8)

Parent PR: #22330

Closes scylladb/scylladb#22559

* github.com:scylladb/scylladb:
  test: add reproducer and test for fix to split ready CG creation
  table: run set_split_mode() on all storage groups during all_storage_groups_split()
2025-02-07 14:22:57 +01:00
Tomasz Grabiec
61e303a3e3 locator: network_topology_strategy: Fix SIGSEGV when creating a table when there is a rack with no normal nodes
In that case, new_racks will be used, but when we discover no
candidates, we try to pop from existing_racks.

Fixes #22625

Closes scylladb/scylladb#22652

(cherry picked from commit e22e3b21b1)

Closes scylladb/scylladb#22721
2025-02-06 16:47:14 +01:00
Avi Kivity
8ede62d288 Update seastar submodule (hwloc failures on some AWS instances)
* seastar ec5da7a606...e40388c4c7 (1):
  > resource: fallback to sysconf when failed to detect memory size from hwloc

Fixes #22382
2025-02-04 16:29:45 +02:00
Avi Kivity
4ac9c710fc Merge '[Backport 6.2] api: task_manager: do not unregister finish task when its status is queried' from Scylladb[bot]
Currently, when the status of a task is queried and the task is already finished,
it gets unregistered. Getting the status shouldn't be a one-time operation.

Stop removing the task after its status is queried. Adjust tests not to rely
on this behavior. Add task_manager/drain API and nodetool tasks drain
command to remove finished tasks in the module.

Fixes: https://github.com/scylladb/scylladb/issues/21388.

It's a fix to task_manager API, should be backported to all branches

- (cherry picked from commit e37d1bcb98)

- (cherry picked from commit 18cc79176a)

Parent PR: #22310

Closes scylladb/scylladb#22597

* github.com:scylladb/scylladb:
  api: task_manager: do not unregister tasks on get_status
  api: task_manager: add /task_manager/drain
2025-02-03 23:04:31 +02:00
Avi Kivity
34fa9bd586 Merge '[Backport 6.2] Simplify loading_cache_test and use manual_clock' from Scylladb[bot]
This series exposes a Clock template parameter for loading_cache so that the test could use
the manual_clock rather than the lowres_clock, since relying on the latter is flaky.

In addition, the test load function is simplified to sleep some small random time and co_return the expected string,
rather than reading it from a real file, since the latter's timing might also be flaky, and it out-of-scope for this test.

Fixes #20322

* The test was flaky forever, so backport is required for all live versions.

- (cherry picked from commit b509644972)

- (cherry picked from commit 934a9d3fd6)

- (cherry picked from commit d68829243f)

- (cherry picked from commit b258f8cc69)

- (cherry picked from commit 0841483d68)

- (cherry picked from commit 32b7cab917)

Parent PR: #22064

Closes scylladb/scylladb#22640

* github.com:scylladb/scylladb:
  tests: loading_cache_test: use manual_clock
  utils: loading_cache: make clock_type a template parameter
  test: loading_cache_test: use function-scope loader
  test: loading_cache_test: simlute loader using sleep
  test: lib: eventually: add sleep function param
  test: lib: eventually: make *EVENTUALLY_EQUAL inline functions
2025-02-03 22:56:31 +02:00
Benny Halevy
79bff0885c tests: loading_cache_test: use manual_clock
Relying on a real-time clock like lowres_clock
can be flaky (in particular in debug mode).
Use manual_clock instead to harden the test against
timing issues.

Fixes #20322

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 32b7cab917)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-02-03 16:15:46 +02:00
Benny Halevy
abf8f44e03 utils: loading_cache: make clock_type a template parameter
So the unit test can use manual_clock rather than lowres_clock
which can be flaky (in particular in debug mode).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 0841483d68)
2025-02-03 16:02:37 +02:00
Benny Halevy
00f1dcfd09 test: loading_cache_test: use function-scope loader
Rather than a global function, accessing a thread-local `load_count`.
The thread-local load_count cannot be used when multiple test
cases run in parallel.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit b258f8cc69)
2025-02-03 16:01:53 +02:00
Benny Halevy
b0166a3a9c test: loading_cache_test: simlute loader using sleep
This test isn't about reading values from file,
but rather it's about the loading_cache.
Reading from the file can sometimes take longer than
the expected refresh times, causing flakiness (see #20322).

Rather than reading a string from a real file, just
sleep a random, short time, and co_return the string.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit d68829243f)
2025-02-03 16:00:51 +02:00
Benny Halevy
7addc3454d test: lib: eventually: add sleep function param
To allow support for manual_clock instead of seastar::sleep.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 934a9d3fd6)
2025-02-03 16:00:47 +02:00
Benny Halevy
9d5e3f050e test: lib: eventually: make *EVENTUALLY_EQUAL inline functions
rather then macros.

This is a first cleanup step before adding a sleep function
parameter to support also manual_clock.

Also, add a call to BOOST_REQUIRE_EQUAL/BOOST_CHECK_EQUAL,
respectively, to make an error more visible in the test log
since those entry points print the offending values
when not equal.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit b509644972)
2025-02-03 15:50:22 +02:00
Michael Litvak
c7421a8804 view_builder: fix loop in view builder when tokens are moved
The view builder builds a view by going over the entire token ring,
consuming the base table partitions, and generating view updates for
each partition.

A view is considered as built when we complete a full cycle of the
token ring. Suppose we start to build a view at a token F. We will
consume all partitions with tokens starting at F until the maximum
token, then go back to the minimum token and consume all partitions
until F, and then we detect that we pass F and complete building the
view. This happens in the view builder consumer in
`check_for_built_views`.

The problem is that we check if we pass the first token F with the
condition `_step.current_token() >= it->first_token` whenever we consume
a new partition or the current_token goes back to the minimum token.
But suppose that we don't have any partitions with a token greater than
or equal to the first token (this could happen if the partition with
token F was moved to another node for example), then this condition will never be
satisfied, and we don't detect correctly when we pass F. Instead, we
go back to the minimum token, building the same token ranges again,
in a possibly infinite loop.

To fix this we add another step when reaching the end of the reader's
stream. When this happens it means we don't have any more fragments to
consume until the end of the range, so we advance the current_token to
the end of the range, simulating a partition, and check for built views
in that range.

Fixes scylladb/scylladb#21829

Closes scylladb/scylladb#22493

(cherry picked from commit 6d34125eb7)

Closes scylladb/scylladb#22606
2025-02-03 13:27:28 +01:00
Avi Kivity
700402a7bb seatar: point submodule at scylla-seastar.git
This allows backporting commits to seastar.
2025-01-31 19:49:15 +02:00
Aleksandra Martyniuk
424fab77d2 api: task_manager: do not unregister tasks on get_status
Currently, /task_manager/task_status_recursive/{task_id} and
/task_manager/task_status/{task_id} unregister queries task if it
has already finished.

The status should not disappear after being queried. Do not unregister
finished task when its status or recursive status is queried.

(cherry picked from commit 18cc79176a)
2025-01-31 10:12:46 +01:00
Aleksandra Martyniuk
5d16157936 api: task_manager: add /task_manager/drain
In the following patches, get_status won't be unregistering finished
tasks. However, tests need a functionality to drop a task, so that
they could manipulate only with the tasks for operations that were
invoked by these tests.

Add /task_manager/drain/{module} to unregister all finished tasks
from the module. Add respective nodetool command.

(cherry picked from commit e37d1bcb98)
2025-01-31 10:11:57 +01:00
Aleksandra Martyniuk
e7d891b629 repair: add repair_service gate
In main.cc storage_service is started before and stopped after
repair_service. storage_service keeps a reference to sharded
repair_service and calls its methods, but nothing ensures that
repair_service's local instance would be alive for the whole
execution of the method.

Add a gate to repair_service and enter it in storage_service
before executing methods on local instances of repair_service.

Fixes: #21964.

Closes scylladb/scylladb#22145

(cherry picked from commit 32ab58cdea)

Closes scylladb/scylladb#22318
2025-01-30 11:39:46 +02:00
Anna Stuchlik
e57e3b4039 doc: add troubleshooting removal with --autoremove-ubuntu
This commit adds a troubleshooting article on removing ScyllaDB
with the --autoremove option.

Fixes https://github.com/scylladb/scylladb/issues/21408

Closes scylladb/scylladb#21697

(cherry picked from commit 8d824a564f)

Closes scylladb/scylladb#22232
2025-01-29 20:24:24 +02:00
Kefu Chai
2514f50f7f docs: fix monospace formatting for rm command
Add missing space before `rm` to ensure proper rendering
in monospace font within documentation.

Fixes scylladb/scylladb#22255
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21576

(cherry picked from commit 6955b8238e)

Closes scylladb/scylladb#22257
2025-01-29 20:23:22 +02:00
Botond Dénes
e74c8372a9 tools/scylla-sstable: dump-statistics: fix handling of {min,max}_column_names
Said fields in statistics are of type
`disk_array<uint32_t, disk_string<uint16_t>>` and currently are handled
as array of regular strings. However these fields store exploded
clustering keys, so the elements store binary data and converting to
string can yield invalid UTF-8 characters that certain JSON parsers (jq,
or python's json) can choke on. Fix this by treating them as binary and
using `to_hex()` to convert them to string. This requires some massaging
of the json_dumper: passing field offset to all visit() methods and
using a caller-provided disk-string to sstring converter to convert disk
strings to sstring, so in the case of statistics, these fields can be
intercepted and properly handled.

While at it, the type of these fields is also fixed in the
documentation.

Before:

    "min_column_names": [
      "��Z���\u0011�\u0012ŷ4^��<",
      "�2y\u0000�}\u007f"
    ],
    "max_column_names": [
      "��Z���\u0011�\u0012ŷ4^��<",
      "}��B\u0019l%^"
    ],

After:

    "min_column_names": [
      "9dd55a92bc8811ef12c5b7345eadf73c",
      "80327900e2827d7f"
    ],
    "max_column_names": [
      "9dd55a92bc8811ef12c5b7345eadf73c",
      "7df79242196c255e"
    ],

Fixes: #22078

Closes scylladb/scylladb#22225

(cherry picked from commit f899f0e411)

Closes scylladb/scylladb#22296
2025-01-29 20:22:52 +02:00
Botond Dénes
aa16d736dc Merge '[Backport 6.2] sstable_directory: do not load remote unshared sstables in process_descriptor()' from Lakshmi Narayanan
The sstable loader relied on the generation id to provide an efficient
hint about the shard that owns an sstable. But, this hint was rendered
ineffective with the introduction of UUID generation, as the shard id
was no longer embedded in the generation id. This also became suboptimal
with the introduction of tablets. Commit 0c77f77 addressed this issue by
reading the minimum from disk to determine sstable ownership but this
improvement was lost with commit 63f1969, which optimistically assumed
that hints would work most of the time, which isn't true.

This commit restores that change - shard id of a table is deduced by
reading minially from disk and then the sstable is fully loaded only if
it belongs to the local shard. This patch also adds a testcase to verify
that the sstable are loaded only in their respective shards.

Fixes #21015

This fixes a regression and should be backported.

- (cherry picked from commit d2ba45a01f)

- (cherry picked from commit 6e3ecc70a6)

- (cherry picked from commit 63100b34da)

Parent PR: #22263

Closes scylladb/scylladb#22376

* github.com:scylladb/scylladb:
  sstable_directory: do not load remote sstables in process_descriptor
  sstable_directory: reintroduce `get_shards_for_this_sstable()`
2025-01-29 20:20:50 +02:00
Botond Dénes
fa73a8da34 replica: remove noexcept from token -> tablet resolution path
The methods to resolve a key/token/range to a table are all noexcept.
Yet the method below all of these, `storage_group_for_id()` can throw.
This means that if due to any mistake a tablet without local replica is
attempted to be looked up, it will result in a crash, as the exception
bubbles up into the noexcept methods.
There is no value in pretending that looking up the tablet replica is
noexcept, remove the noexcept specifiers so that any bad lookup only
fails the operation at hand and doesn't crash the node. This is
especially relevant to replace, which still has a window where writes
can arrive for tablets that don't (yet) have a local replica. Currently,
this results in a crash. After this patch, this will only fail the
writes and the replace can move on.

Fixes: #21480

Closes scylladb/scylladb#22251

(cherry picked from commit 55963f8f79)

Closes scylladb/scylladb#22379
2025-01-29 20:19:52 +02:00
Avi Kivity
10abba4c64 Merge '[Backport 6.2] repair: handle no_such_keyspace in repair preparation phase' from null
Currently, data sync repair handles most no_such_keyspace exceptions,
but it omits the preparation phase, where the exception could be thrown
during make_global_effective_replication_map.

Skip the keyspace repair if no_such_keyspace is thrown during preparations.

Fixes: #22073.

Requires backport to 6.1 and 6.2 as they contain the bug

- (cherry picked from commit bfb1704afa)

- (cherry picked from commit 54e7f2819c)

Parent PR: #22473

Closes scylladb/scylladb#22541

* github.com:scylladb/scylladb:
  test: add test to check if repair handles no_such_keyspace
  repair: handle keyspace dropped
2025-01-29 19:52:38 +02:00
Michael Litvak
36b1a486de cdc: fix handling of new generation during raft upgrade
During raft upgrade, a node may gossip about a new CDC generation that
was propagated through raft. The node that receives the generation by
gossip may have not applied the raft update yet, and it will not find
the generation in the system tables. We should consider this error
non-fatal and retry to read until it succeeds or becomes obsolete.

Another issue is when we fail with a "fatal" exception and not retrying
to read, the cdc metadata is left in an inconsistent state that causes
further attempts to insert this CDC generation to fail.

What happens is we complete preparing the new generation by calling `prepare`,
we insert an empty entry for the generation's timestamp, and then we fail. The
next time we try to insert the generation, we skip inserting it because we see
that it already has an entry in the metadata and we determine that
there's nothing to do. But this is wrong, because the entry is empty,
and we should continue to insert the generation.

To fix it, we change `prepare` to return `true` when the entry already
exists but it's empty, indicating we should continue to insert the
generation.

Fixes scylladb/scylladb#21227

Closes scylladb/scylladb#22093

(cherry picked from commit 4f5550d7f2)

Closes scylladb/scylladb#22545
2025-01-29 19:52:18 +02:00
Ferenc Szili
5a74ded582 test: add reproducer and test for fix to split ready CG creation
This adds a reproducer for #22431

In cases where a tablet storage group manager had more than one storage
group, it was possible to create compaction groups outside the group0
guard, which could create problems with operations which should exclude
with compaction group creation.

(cherry picked from commit 8bff7786a8)
2025-01-29 14:46:37 +01:00
Ferenc Szili
0ea5a1fc48 table: run set_split_mode() on all storage groups during all_storage_groups_split()
tablet_storage_group_manager::all_storage_groups_split() calls set_split_mode()
for each of its storage groups to create split ready compaction groups. It does
this by iterating through storage groups using std::ranges::all_of() which is
not guaranteed to iterate through the entire range, and will stop iterating on
the first occurance of the predicate (set_split_mode()) returning false.
set_split_mode() creates the split compaction groups and returns false if the
storage group's main compaction group or merging groups are not empty. This
means that in cases where the tablet storage group manager has non-empty
storage groups, we could have a situation where split compaction groups are not
created for all storage groups.

The missing split compaction groups are later created in
tablet_storage_group_manager::split_all_storage_groups() which also calls
set_split_mode(), and that is the reason why split completes successfully. The
problem is that tablet_storage_group_manager::all_storage_groups_split() runs
under a group0 guard, and tablet_storage_group_manager::split_all_storage_groups()
does not. This can cause problems with operations which should exclude with
compaction group creation. i.e. DROP TABLE/DROP KEYSPACE

(cherry picked from commit 24e8d2a55c)
2025-01-29 14:42:17 +01:00
Aleksandra Martyniuk
be67b4d634 test: add test to check if repair handles no_such_keyspace
(cherry picked from commit 54e7f2819c)
2025-01-28 21:50:10 +00:00
Aleksandra Martyniuk
2143f8ccb6 repair: handle keyspace dropped
Currently, data sync repair handles most no_such_keyspace exceptions,
but it omits the preparation phase, where the exception could be thrown
during make_global_effective_replication_map.

Skip the keyspace repair if no_such_keyspace is thrown during preparations.

(cherry picked from commit bfb1704afa)
2025-01-28 21:50:10 +00:00
Kefu Chai
7da4223411 compress: fix compressor initialization order by making namespace_prefix a function
Fixes a race condition where COMPRESSOR_NAME in zstd.cc could be
initialized before compressor::namespace_prefix due to undefined
global variable initialization order across translation units. This
was causing ZstdCompressor to be unregistered in release builds,
making it impossible to create tables with Zstd compression.

Replace the global namespace_prefix variable with a function that
returns the fully qualified compressor name. This ensures proper
initialization order and fixes the registration of the ZstdCompressor.

Fixes scylladb/scylladb#22444
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22451

(cherry picked from commit 4a268362b9)

Closes scylladb/scylladb#22510
2025-01-27 19:50:23 +02:00
Calle Wilund
2b40f405f0 commitlog: Fix assertion in oversized_alloc
Fixes #20633

Cannot assert on actual request_controller when releasing permit, as the
release, if we have waiters in queue, will subtract some units to hand to them.
Instead assert on permit size + waiter status (and if zero, also controller value)

* v2 - use SCYLLA_ASSERT

(cherry picked from commit 58087ef427)

Closes scylladb/scylladb#22455
2025-01-26 16:46:35 +02:00
Kamil Braun
399325f0f0 Merge '[Backport 6.2] raft: Handle non-critical config update errors in when changing voter status.' from Sergey Z
When a node is bootstrapped and joined a cluster as a non-voter and changes it's role to a voter, errors can occur while committing a new Raft record, for instance, if the Raft leader changes during this time. These errors are not critical and should not cause a node crash, as the action can be retried.

Fixes scylladb/scylladb#20814

Backport: This issue occurs frequently and disrupts the CI workflow to some extent. Backports are needed for versions 6.1 and 6.2.

- (cherry picked from commit 775411ac56)

- (cherry picked from commit 16053a86f0)

- (cherry picked from commit 8c48f7ad62)

- (cherry picked from commit 3da4848810)

- (cherry picked from commit 228a66d030)

Parent PR: #22253

Closes scylladb/scylladb#22358

* github.com:scylladb/scylladb:
  raft: refactor `remove_from_raft_config` to use a timed `modify_config` call.
  raft: Refactor functions using `modify_config` to use a common wrapper for retrying.
  raft: Handle non-critical config update errors in when changing status to voter.
  test: Add test to check that a node does not fail on unknown commit status error when starting up.
  raft: Add run_op_with_retry in raft_group0.
2025-01-24 17:05:50 +01:00
Sergey Zolotukhin
9730f98d34 raft: refactor remove_from_raft_config to use a timed modify_config call.
To avoid potential hangs during the `remove_from_raft_config` operation, use a timed `modify_config` call.
This ensures the operation doesn't get stuck indefinitely.

(cherry picked from commit 228a66d030)
2025-01-22 09:54:28 +01:00
Sergey Zolotukhin
d419fb4a0c raft: Refactor functions using modify_config to use a common wrapper
for retrying.

There are several places in `raft_group0` where almost identical code is
used for retrying `modify_config` in case of `commit_status_unknown`
error. To avoid code duplication all these places were changed to
use a new wrapper `run_op_with_retry`.

(cherry picked from commit 3da4848810)
2025-01-22 09:54:26 +01:00
Lakshmi Narayanan Sreethar
e0189ccac5 sstable_directory: do not load remote sstables in process_descriptor
The sstable loader relied on the generation id to provide an efficient
hint about the shard that owns an sstable. But, this hint was rendered
ineffective with the introduction of UUID generation, as the shard id
was no longer embedded in the generation id. This also became suboptimal
with the introduction of tablets. Commit 0c77f77 addressed this issue by
reading the minimum from disk to determine sstable ownership but this
improvement was lost with commit 63f1969, which optimistically assumed
that hints would work most of the time, which isn't true.

This commit restores that change - shard id of a table is deduced by
reading minially from disk and then the sstable is fully loaded only if
it belongs to the local shard. This patch also adds a testcase to verify
that the sstable are loaded only in their respective shards.

Fixes #21015

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 63100b34da)
2025-01-21 00:48:34 +05:30
Lakshmi Narayanan Sreethar
8191f5d0f4 sstable_directory: reintroduce get_shards_for_this_sstable()
Reintroduce `get_shards_for_this_sstable()` that was removed in commit
ad375fbb. This will be used in the following patch to ensure that an
sstable is loaded only once.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit d2ba45a01f)
2025-01-21 00:48:34 +05:30
Nadav Har'El
bff9ddde12 Merge '[Backport 6.2] view_builder: write status to tables before starting to build' from null
When adding a new view for building, first write the status to the
system tables and then add the view building step that will start
building it.

Otherwise, if we start building it before the status is written to the
table, it may happen that we complete building the view, write the
SUCCESS status, and then overwrite it with the STARTED status. The
view_build_status table will remain in incorrect state indicating the
view building is not complete.

Fixes #20638

The PR contains few additional small fixes in separate commits related to the view build status table.

It addresses flakiness issues in tests that use the view build status table to determine when view building is complete. The table may be in incorrect state due to these issues, having a row with status STARTED when it actually finished building the view, which will cause us to wait in `wait_for_view` until it timeouts.

For testing I used a test similar to `test_view_build_status_with_replace_node`, but it only creates the views and calls `wait_for_view`. Without these commits it failed in 4/1024 runs, and with the commits it passed 2048/2048.

backport to fix the bugs that affects previous versions and improve CI stability

- (cherry picked from commit b1be2d3c41)

- (cherry picked from commit 1104411f83)

- (cherry picked from commit 7a6aec1a6c)

Parent PR: #22307

Closes scylladb/scylladb#22356

* github.com:scylladb/scylladb:
  view_builder: hold semaphore during entire startup
  view_builder: pass view name by value to write_view_build_status
  view_builder: write status to tables before starting to build
2025-01-19 15:36:44 +02:00
Michael Litvak
e8fde3b0a3 view_builder: hold semaphore during entire startup
Guard the whole view builder startup routine by holding the semaphore
until it's done instead of releasing it early, so that it's not
intercepted by migration notifications.

(cherry picked from commit 7a6aec1a6c)
2025-01-19 11:03:53 +02:00
Michael Litvak
a7f8842776 view_builder: pass view name by value to write_view_build_status
The function write_view_build_status takes two lambda functions and
chooses which of them to run depending on the upgrade state. It might
run both of them.

The parameters ks_name and view_name should be passed by value instead
of by reference because they are moved inside each lambda function.
Otherwise, if both lambdas are run, the second call operates on invalid
values that were moved.

(cherry picked from commit 1104411f83)
2025-01-19 11:03:53 +02:00
Piotr Dulikowski
b88e0f8f74 Merge '[Backport 6.2] main, view: Pair view builder drain with its start' from null
In this PR, we pair draining the view builder with its start.
To better understand what was done and why, let's first look at the
situation before this commit and the context of it:

(a) The following things happened in order:

    1. The view builder would be constructed.
    2. Right after that, a deferred lambda would be created to stop the
       view builder during shutdown.
    3. group0_service would be started.
    4. A deferred lambda stopping group0_service would be created right
       after that.
    5. The view builder would be started.

(b) Because the view builder depends on group0_client, it couldn't be
    started before starting group0_service. On the other hand, other
    services depend on the view builder, e.g. the stream manager. That
    makes changing the order of initialization a difficult problem,
    so we want to avoid doing that unless we're sure it's the right
    choice.

(c) Since the view builder uses group0_client, there was a possibility
    of running into a segmentation fault issue in the following
    scenario:

    1. A call to `view_builder::mark_view_build_success()` is issued.
    2. We stop group0_service.
    3. `view_builder::mark_view_build_success()` calls
       `announce_with_raft()`, which leads to a use-after-free because
       group0_service has already been destroyed.

      This very scenario took place in scylladb/scylladb#20772.

Initially, we decided to solve the issue by initializing
group0_service a bit earlier (scylladb/scylladb@7bad8378c7).
Unfortunately, it led to other issues described in scylladb/scylladb#21534,
so we revert that patch. These changes are the second attempt
to the problem where we want to solve it in a safer manner.

The solution we came up with is to pair the start of the view builder
with a deferred lambda that deinitializes it by calling
`view_builder::drain()`. No other component of the system should be
able to use the view builder anymore, so it's safe to do that.
Furthermore, that pairing makes the analysis of
initialization/deinitialization order much easier. We also solve the
aformentioned use-after-free issue because the view builder itself
will no longer attempt to use group0_client.

Note that we still pair a deferred lambda calling `view_builder::stop()`
with the construction of the view builder; that function will also call
`view_builder::drain()`. Another notable thing is `view_builder::drain()`
may be called earlier by `storage_service::do_drain()`. In other words,
these changes cover the situation when Scylla runs into a problem when
starting up.

Backport: The patch I'm reverting made it to 6.2, so we want to backport this one there too.

Fixes scylladb/scylladb#20772
Fixes scylladb/scylladb#21534

- (cherry picked from commit a5715086a4)

- (cherry picked from commit 06ce976370)

- (cherry picked from commit d1f960eee2)

Parent PR: #21909

Closes scylladb/scylladb#22331

* github.com:scylladb/scylladb:
  test/topology_custom: Add test for Scylla with disabled view building
  main, view: Pair view builder drain with its start
  Revert "main,cql_test_env: start group0_service before view_builder"
2025-01-17 09:57:36 +01:00
Sergey Zolotukhin
2ea97d8c19 raft: Handle non-critical config update errors in when changing status
to voter.

When a node is bootstrapped and joins a cluster as a non-voter, errors can occur while committing
a new Raft record, for instance, if the Raft leader changes during this time. These errors are not
critical and should not cause a node crash, as the action can be retried.

Fixes scylladb/scylladb#20814

(cherry picked from commit 8c48f7ad62)
2025-01-16 20:09:31 +00:00
Sergey Zolotukhin
36ff3e8f5f test: Add test to check that a node does not fail on unknown commit status
error when starting up.

Test that a node is starting successfully if while joining a cluster and becoming a voter, it
receives an unknown commit status error.

Test for scylladb/scylladb#20814

(cherry picked from commit 16053a86f0)
2025-01-16 20:09:31 +00:00
Sergey Zolotukhin
325cdd3ebc raft: Add run_op_with_retry in raft_group0.
Since when calling `modify_config` it's quite often we need to do
retries, to avoid code duplication, a function wrapper that allows
a function to be called with automatic retries in case of failures
was added.

(cherry picked from commit 775411ac56)
2025-01-16 20:09:30 +00:00
Michael Litvak
c5bdc9e58f view_builder: write status to tables before starting to build
When adding a new view for building, first write the status to the
system tables and then add the view building step that will start
building it.

Otherwise, if we start building it before the status is written to the
table, it may happen that we complete building the view, write the
SUCCESS status, and then overwrite it with the STARTED status. The
view_build_status table will remain in incorrect state indicating the
view building is not complete.

Fixes scylladb/scylladb#20638

(cherry picked from commit b1be2d3c41)
2025-01-16 20:08:36 +00:00
Kamil Braun
cbef20e977 Merge '[Backport 6.2] Fix possible data corruption due to token keys clashing in read repair.' from Sergey
This update addresses an issue in the mutation diff calculation algorithm used during read repair. Previously, the algorithm used `token` as the hashmap key. Since `token` is calculated basing on the Murmur3 hash function, it could generate duplicate values for different partition keys, causing corruption in the affected rows' values.

Fixes scylladb/scylladb#19101

Since the issue affects all the relevant scylla versions, backport to: 6.1, 6.2

- (cherry picked from commit e577f1d141)

- (cherry picked from commit 39785c6f4e)

- (cherry picked from commit 155480595f)

Parent PR: #21996

Closes scylladb/scylladb#22298

* github.com:scylladb/scylladb:
  storage_proxy/read_repair: Remove redundant 'schema' parameter from `data_read_resolver::resolve` function.
  storage_proxy/read_repair: Use `partition_key` instead of `token` key for mutation diff calculation hashmap.
  test: Add test case for checking read repair diff calculation when having conflicting keys.
2025-01-16 17:13:12 +01:00
Dawid Mędrek
833ea91940 test/topology_custom: Add test for Scylla with disabled view building
Before this commit, there doesn't seem to have been a test verifying that
starting and shutting down Scylla behave correctly when the configuration
option `view_building` is set to false. In these changes, we add one.

(cherry picked from commit d1f960eee2)
2025-01-16 14:12:31 +01:00
Dawid Mędrek
1200d3b735 main, view: Pair view builder drain with its start
In these changes, we pair draining the view builder with its start.
To better understand what was done and why, let's first look at the
situation before this commit and the context of it:

(a) The following things happened in order:

    1. The view builder would be constructed.
    2. Right after that, a deferred lambda would be created to stop the
       view builder during shutdown.
    3. group0_service would be started.
    4. A deferred lambda stopping group0_service would be created right
       after that.
    5. The view builder would be started.

(b) Because the view builder depends on group0_client, it couldn't be
    started before starting group0_service. On the other hand, other
    services depend on the view builder, e.g. the stream manager. That
    makes changing the order of initialization a difficult problem,
    so we want to avoid doing that unless we're sure it's the right
    choice.

(c) Since the view builder uses group0_client, there was a possibility
    of running into a segmentation fault issue in the following
    scenario:

    1. A call to `view_builder::mark_view_build_success()` is issued.
    2. We stop group0_service.
    3. `view_builder::mark_view_build_success()` calls
       `announce_with_raft()`, which leads to a use-after-free because
       group0_service has already been destroyed.

      This very scenario took place in scylladb/scylladb#20772.

Initially, we decided to solve the issue by initializing
group0_service a bit earlier (scylladb/scylladb@7bad8378c7).
Unfortunately, it led to other issues described in scylladb/scylladb#21534.
We reverted that change in the previous commit. These changes are the
second attempt to the problem where we want to solve it in a safer manner.

The solution we came up with is to pair the start of the view builder
with a deferred lambda that deinitializes it by calling
`view_builder::drain()`. No other component of the system should be
able to use the view builder anymore, so it's safe to do that.
Furthermore, that pairing makes the analysis of
initialization/deinitialization order much easier. We also solve the
aformentioned use-after-free issue because the view builder itself
will no longer attempt to use group0_client.

Note that we still pair a deferred lambda calling `view_builder::stop()`
with the construction of the view builder; that function will also call
`view_builder::drain()`. Another notable thing is `view_builder::drain()`
may be called earlier by `storage_service::do_drain()`. In other words,
these changes cover the situation when Scylla runs into a problem when
starting up.

Fixes scylladb/scylladb#20772

(cherry picked from commit 06ce976370)
2025-01-16 12:37:04 +01:00
Dawid Mędrek
84b774515b Revert "main,cql_test_env: start group0_service before view_builder"
The patch solved a problem related to an initialization order
(scylladb/scylladb#20772), but we ran into another one: scylladb/scylladb#21534.
After moving the initialization of group0_service, it ended up being destroyed
AFTER the CDC generation service would. Since CDC generations are accessed
in `storage_service::topology_state_load()`:

```
for (const auto& gen_id : _topology_state_machine._topology.committed_cdc_generations) {
    rtlogger.trace("topology_state_load: process committed cdc generation {}", gen_id);
    co_await _cdc_gens.local().handle_cdc_generation(gen_id);
```

we started getting the following failure:

```
Service &seastar::sharded<cdc::generation_service>::local() [Service = cdc::generation_service]: Assertion `local_is_initialized()' failed.
```

We're reverting the patch to go back to a more stable version of Scylla
and in the following commit, we'll solve the original issue in a more
systematic way.

This reverts commit 7bad8378c7.

(cherry picked from commit a5715086a4)
2025-01-16 12:36:41 +01:00
Sergey Zolotukhin
06a8956174 test: Include parent test name in ScyllaClusterManager log file names.
Add the test file name to `ScyllaClusterManager` log file names alongside the test function name.
This avoids race conditions when tests with the same function names are executed simultaneously.

Fixes scylladb/scylladb#21807

Backport: not needed since this is a fix in the testing scripts.

Closes scylladb/scylladb#22192

(cherry picked from commit 2f1731c551)

Closes scylladb/scylladb#22249
2025-01-14 16:33:27 +01:00
Sergey Zolotukhin
f0f833e8ab storage_proxy/read_repair: Remove redundant 'schema' parameter from data_read_resolver::resolve
function.

The `data_read_resolver` class inherits from `abstract_read_resolver`, which already includes the
`schema_ptr _schema` member. Therefore, using a separate function parameter in `data_read_resolver::resolve`
initialized with the same variable in `abstract_read_executor` is redundant.

(cherry picked from commit 155480595f)
2025-01-14 14:43:52 +01:00
Sergey Zolotukhin
b04b6aad9e storage_proxy/read_repair: Use partition_key instead of token key for mutation
diff calculation hashmap.

This update addresses an issue in the mutation diff calculation algorithm used during read repair.
Previously, the algorithm used `token` as the hashmap key. Since `token` is calculated basing on
the Murmur3 hash function, it could generate duplicate values for different partition keys, causing
corruption in the affected rows' values.

Fixes scylladb/scylladb#19101

(cherry picked from commit 39785c6f4e)
2025-01-14 14:37:47 +01:00
Sergey Zolotukhin
12ee41869a test: Add test case for checking read repair diff calculation when having
conflicting keys.

The test updates two rows with keys that result in a Murmur3 hash collision, which
is used to generate Scylla tokens. These tokens are involved in read repair diff
calculations. Due to the identical token values, a hash map key collision occurs.
Consequently, an incorrect value from the second row (with a different primary key)
is then sent for writing as 'repaired', causing data corruption.

(cherry picked from commit e577f1d141)
2025-01-13 22:05:32 +00:00
Kamil Braun
ef93c3a8d7 Merge '[Backport 6.2] cache_algorithm_test: fix flaky failures' from Michał Chojnowski
This series attempts to get read of flakiness in cache_algorithm_test by solving two problems.

Problem 1:

The test needs to create some arbitrary partition keys of a given size. It intends to create keys of the form:
0x0000000000000000000000000000000000000000...
0x0100000000000000000000000000000000000000...
0x0200000000000000000000000000000000000000...
But instead, unintentionally, it creates partially initialized keys of the form: 0x0000000000000000garbagegarbagegarbagegar...
0x0100000000000000garbagegarbagegarbagegar...
0x0200000000000000garbagegarbagegarbagegar...

Each of these keys is created several times and -- for the test to pass -- the result must be the same each time.
By coincidence, this is usually the case, since the same allocator slots are used. But if some background task happens to overwrite the allocator slot during a preemption, the keys used during "SELECT" will be different than the keys used during "INSERT", and the test will fail due to extra cache misses.

Problem 2:

Cache stats are global, so there's no good way to reliably
verify that e.g. a given read causes 0 cache misses,
because something done by Scylla in a background can trigger a cache miss.

This can cause the test to fail spuriously.

With how the test framework and the cache are designed, there's probably
no good way to test this properly. It would require ensuring that cache
stats are per-read, or at least per-table, and that Scylla's background
activity doesn't cause enough memory pressure to evict the tested rows.

This patch tries to deal with the flakiness without deleting the test
altogether by letting it retry after a failure if it notices that it
can be explained by a read which wasn't done by the test.
(Though, if the test can't be written well, maybe it just shouldn't be written...)

Fixes scylladb/scylladb#21536

(cherry picked from commit 1fffd976a4)
(cherry picked from commit 6caaead4ac)

Parent PR: scylladb/scylladb#21948

Closes scylladb/scylladb#22228

* github.com:scylladb/scylladb:
  cache_algorithm_test: harden against stats being confused by background activity
  cache_algorithm_test: fix a use of an uninitialized variable
2025-01-09 14:30:31 +01:00
Aleksandra Martyniuk
a59c4653fe repair: check tasks local to given shard
Currently task_manager_module::is_aborted checks the tasks local
to caller's shard on a given shard.

Fix the method to check the task map local to the given shard.

Fixes: #22156.

Closes scylladb/scylladb#22161

(cherry picked from commit a91e03710a)

Closes scylladb/scylladb#22197
2025-01-08 13:07:38 +02:00
Yaron Kaikov
2e87e317d9 .github/scripts/auto-backport.py: Add comment to PR when conflicts apply
When we open a PR with conflicts, the PR owner gets a notification about the assignment but has no idea if this PR is with conflicts or not (in Scylla it's important since CI will not start on draft PR)

Let's add a comment to notify the user we have conflicts

Closes scylladb/scylladb#21939

(cherry picked from commit 2e6755ecca)

Closes scylladb/scylladb#22190
2025-01-08 13:07:08 +02:00
Botond Dénes
f73f7c17ec Merge 'sstables_manager: do not reclaim unlinked sstables' from Lakshmi Narayanan Sreethar
When an sstable is unlinked, it remains in the _active list of the
sstable manager. Its memory might be reclaimed and later reloaded,
causing issues since the sstable is already unlinked. This patch updates
the on_unlink method to reclaim memory from the sstable upon unlinking,
remove it from memory tracking, and thereby prevent the issues described
above.

Added a testcase to verify the fix.

Fixes #21887

This is a bug fix in the bloom filter reload/reclaim mechanism and should be backported to older versions.

Closes scylladb/scylladb#21895

* github.com:scylladb/scylladb:
  sstables_manager: reclaim memory from sstables on unlink
  sstables_manager: introduce reclaim_memory_and_stop_tracking_sstable()
  sstables: introduce disable_component_memory_reload()
  sstables_manager: log sstable name when reclaiming components

(cherry picked from commit d4129ddaa6)

Closes scylladb/scylladb#21998
2025-01-08 13:06:24 +02:00
Michał Chojnowski
379f23d854 cache_algorithm_test: harden against stats being confused by background activity
Cache stats are global, so there's no good way to reliably
verify that e.g. a given read causes 0 cache misses,
because something done by Scylla in a background can trigger a cache miss.

This can cause the test to fail spuriously.

With how the test framework and the cache are designed, there's probably
no good way to test this properly. It would require ensuring that cache
stats are per-read, or at least per-table, and that Scylla's background
activity doesn't cause enough memory pressure to evict the tested rows.

This patch tries to deal with the flakiness without deleting the test
altogether by letting it retry after a failure if it notices that it
can be explained by a read which wasn't done by the test.
(Though, if the test can't be written well, maybe it just shouldn't be written...)

(cherry picked from commit 6caaead4ac)
2025-01-08 11:48:23 +01:00
Michał Chojnowski
10815d2599 cache_algorithm_test: fix a use of an uninitialized variable
The test needs to create some arbitrary partition keys of a given size.
It intends to create keys of the form:
0x0000000000000000000000000000000000000000...
0x0100000000000000000000000000000000000000...
0x0200000000000000000000000000000000000000...
But instead, unintentionally, it creates partially initialized keys of the form:
0x0000000000000000garbagegarbagegarbagegar...
0x0100000000000000garbagegarbagegarbagegar...
0x0200000000000000garbagegarbagegarbagegar...

Each of these keys is created several times and -- for the test to pass --
the result must be the same each time.
By coincidence, this is usually the case, since the same allocator slots are used.
But if some background task happens to overwrite the allocator slot during a
preemption, the keys used during "SELECT" will be different than the keys used
during "INSERT", and the test will fail due to extra cache misses.

(cherry picked from commit 1fffd976a4)
2025-01-08 11:48:17 +01:00
Patryk Jędrzejczak
a63a0eac1e [Backport 6.2] raft: improve logs for abort while waiting for apply
New logs allow us to easily distinguish two cases in which
waiting for apply times out:
- the node didn't receive the entry it was waiting for,
- the node received the entry but didn't apply it in time.

Distinguishing these cases simplifies reasoning about failures.
The first case indicates that something went wrong on the leader.
The second case indicates that something went wrong on the node
on which waiting for apply timed out.

As it turns out, many different bugs result in the `read_barrier`
(which calls `wait_for_apply`) timeout. This change should help
us in debugging bugs like these.

We want to backport this change to all supported branches so that
it helps us in all tests.

Fixes scylladb/scylladb#22160

Closes scylladb/scylladb#22159
2025-01-07 17:01:22 +01:00
Kamil Braun
afd588d4c7 Merge '[Backport 6.2] Do not reset quarantine list in non raft mode' from Gleb
The series contains small fixes to the gossiper one of which fixes #21930. Others I noticed while debugged the issue.

Fixes: #21930

(cherry picked from commit 91cddcc17f)

Parent PR: #21956

Closes scylladb/scylladb#21991

* github.com:scylladb/scylladb:
  gossiper: do not reset _just_removed_endpoints in non raft mode
  gossiper: do not call apply for the node's old state
2025-01-03 16:28:08 +01:00
Abhinav
f5bce45399 Fix gossiper orphan node floating problem by adding a remover fiber
In the current scenario, if during startup, a node crashes after initiating gossip and before joining group0,
then it keeps floating in the gossiper forever because the raft based gossiper purging logic is only effective
once node joins group0. This orphan node hinders the successor node from same ip to join cluster since it collides
with it during gossiper shadow round.

This commit intends to fix this issue by adding a background thread which periodically checks for such orphan entries in
gossiper and removes them.

A test is also added in to verify this logic. This test fails without this background thread enabled, hence
verifying the behavior.

Fixes: scylladb/scylladb#20082

Closes scylladb/scylladb#21600

(cherry picked from commit 6c90a25014)

Closes scylladb/scylladb#21822
2025-01-02 14:57:46 +01:00
Gleb Natapov
cda997fe59 gossiper: do not reset _just_removed_endpoints in non raft mode
By the time the function is called during start it may already be
populated.

Fixes: scylladb/scylladb#21930
(cherry picked from commit e318dfb83a)
2024-12-25 12:01:16 +02:00
Gleb Natapov
155a0462d5 gossiper: do not call apply for the node's old state
If a nodes changed its address an old state may be still in a gossiper,
so ignore it.

(cherry picked from commit e80355d3a1)
2024-12-23 11:47:12 +02:00
Piotr Dulikowski
76b1173546 Merge 'service/topology_coordinator: migrate view builder only if all nodes are up' from Michał Jadwiszczak
The migration process is doing read with consistency level ALL,
requiring all nodes to be alive.

Fixes scylladb/scylladb#20754

The PR should be backported to 6.2, this version has view builder on group0.

Closes scylladb/scylladb#21708

* github.com:scylladb/scylladb:
  test/topology_custom/test_view_build_status: add reproducer
  service/topology_coordinator: migrate view builder only if all nodes are up

(cherry picked from commit def51e252d)

Closes scylladb/scylladb#21850
2024-12-19 14:10:55 +01:00
Piotr Dulikowski
e5a37d63c0 Merge 'transport/server: revert using async function in for_each_gently()' from Michał Jadwiszczak
This patch reverts 324b3c43c0 and adds synchronous versions of `service_level_controller::find_effective_service_level()` and `client_state::maybe_update_per_service_level_params()`.

It isn't safe to do asynchronous calls in `for_each_gently`, as the
connection may be disconnected while a call in callback preempts.

Fixes scylladb/scylladb#21801

Closes scylladb/scylladb#21761

* github.com:scylladb/scylladb:
  Revert "generic_server: use async function in `for_each_gently()`"
  transport/server: use synchronous calls in `for_each_gently` callback
  service/client_state: add synchronous method to update service level params
  qos/service_level_controller: add `find_cached_effective_service_level`

(cherry picked from commit c601f7a359)

Closes scylladb/scylladb#21849
2024-12-19 14:10:31 +01:00
Tomasz Grabiec
0851e3fba7 Merge '[Backport 6.2] utils: cached_file: Mark permit as awaiting on page miss' from ScyllaDB
Otherwise, the read will be considered as on-cpu during promoted index
search, which will severely underutlize the disk because by default
on-cpu concurrency is 1.

I verified this patch on the worst case scenario, where the workload
reads missing rows from a large partition. So partition index is
cached (no IO) and there is no data file IO (relies on https://github.com/scylladb/scylladb/pull/20522).
But there is IO during promoted index search (via cached_file).

Before the patch this workload was doing 4k req/s, after the patch it does 30k req/s.

The problem is much less pronounced if there is data file or partition index IO involved
because that IO will signal read concurrency semaphore to invite more concurrency.

Fixes #21325

(cherry picked from commit 868f5b59c4)

(cherry picked from commit 0f2101b055)

Refs #21323

Closes scylladb/scylladb#21358

* github.com:scylladb/scylladb:
  utils: cached_file: Mark permit as awaiting on page miss
  utils: cached_file: Push resource_unit management down to cached_file
2024-12-16 19:55:00 +01:00
Michael Litvak
99f190f699 service/qos/service_level_controller: update cache on startup
Update the service level cache in the node startup sequence, after the
service level and auth service are initialized.

The cache update depends on the service level data accessor being set
and the auth service being initialized. Before the commit, it may happen that a
cache update is not triggered after the initialization. The commit adds
an explicit call to update the cache where it is guaranteed to be ready.

Fixes scylladb/scylladb#21763

Closes scylladb/scylladb#21773

(cherry picked from commit 373855b493)

Closes scylladb/scylladb#21893
2024-12-16 14:19:06 +01:00
Michael Litvak
04e8506cbb service/qos: increase timeout of internal get_service_levels queries
The function get_service_levels is used to retrieve all service levels
and it is called from multiple different contexts.
Importantly, it is called internally from the context of group0 state reload,
where it should be executed with a long timeout, similarly to other
internal queries, because a failure of this function affects the entire
group0 client, and a longer timeout can be tolerated.
The function is also called in the context of the user command LIST
SERVICE LEVELS, and perhaps other contexts, where a shorter timeout is
preferred.

The commit introduces a function parameter to indicate whether the
context is internal or not. For internal context, a long timeout is
chosen for the query. Otherwise, the timeout is shorter, the same as
before. When the distinction is not important, a default value is
chosen which maintains the same behavior.

The main purpose is to fix the case where the timeout is too short and causes
a failure that propagates and fails the group0 client.

Fixes scylladb/scylladb#20483

Closes scylladb/scylladb#21748

(cherry picked from commit 53224d90be)

Closes scylladb/scylladb#21890
2024-12-16 14:15:26 +01:00
Yaron Kaikov
8e606a239f github: check if PR is closed instead of merge
In Scylla, we can have either `closed` or `merged` PRs. Based on that we decide when to start the backport process when the label was added after the PR is closed (or merged),

In https://github.com/scylladb/scylladb/pull/21876 even when adding the proper backport label didn't trigger the backport automation. Https://github.com/scylladb/scylladb/pull/21809/ caused this, we should have left the `state=closed` (this includes both closed and merged PR)

Fixing it

Closes scylladb/scylladb#21906

(cherry picked from commit b4b7617554)

Closes scylladb/scylladb#21922
2024-12-16 14:07:32 +02:00
Jenkins Promoter
8cdff8f52f Update ScyllaDB version to: 6.2.3 2024-12-15 15:55:40 +02:00
Kamil Braun
3ec741dbac Merge 'topology_coordinator: introduce reload_count in topology state and use it to prevent race' from Gleb Natapov
Topology request table may change between the code reading it and
calling to cv::when() since reading is a preemption point. In this
case cv:signal can be missed. Detect that there was no signal in between
reading and waiting by introducing reload_count which is increased each
time the state is reloaded and signaled. If the counter is different
before and after reading the state may have change so re-check it again
instead of sleeping.

Closes scylladb/scylladb#21713

* github.com:scylladb/scylladb:
  topology_coordinator: introduce reload_count in topology state and use it to prevent race
  storage_service: use conditional_variable::when in co-routines consistently

(cherry picked from commit 8f858325b6)

Closes scylladb/scylladb#21803
2024-12-12 15:45:31 +01:00
Anna Stuchlik
e3e7ac16e9 doc: remove wrong image upgrade info (5.2-to-2023.1)
This commit removes the information about the recommended way of upgrading
ScyllaDB images - by updating ScyllaDB and OS packages in one step. This upgrade
procedure is not supported (it was implemented, but then reverted).

Refs https://github.com/scylladb/scylladb/issues/15733

Closes scylladb/scylladb#21876
Fixes https://github.com/scylladb/scylla-enterprise/issues/5041
Fixes https://github.com/scylladb/scylladb/issues/21898

(cherry picked from commit 98860905d8)
2024-12-12 15:22:26 +02:00
Tomasz Grabiec
81d6d88016 utils: cached_file: Mark permit as awaiting on page miss
Otherwise, the read will be considered as on-cpu during promoted index
search, which will severely underutlize the disk because by default
on-cpu concurrency is 1.

I verified this patch on the worst case scenario, where the workload
reads missing rows from a large partition. So partition index is
cached (no IO) and there is no data file IO. But there is IO during
promoted index search (via cached_file). Before the patch this
workload was doing 4k req/s, after the patch it does 30k req/s.

The problem is much less pronounced if there is data file or index
file IO involved because that IO will signal read concurrency
semaphore to invite more concurrency.

(cherry picked from commit 0f2101b055)
2024-12-09 23:18:00 +01:00
Tomasz Grabiec
56f93dd434 utils: cached_file: Push resource_unit management down to cached_file
It saves us permit operations on the hot path when we hit in cache.

Also, it will lay the ground for marking the permit as awaiting later.

(cherry picked from commit 868f5b59c4)
2024-12-09 23:17:56 +01:00
Kefu Chai
28a32f9c50 github: do not nest ${{}} inside condition
In commit 2596d157, we added a condition to run auto-backport.py only
when the GitHub Action is triggered by a push to the default branch.
However, this introduced an unexpected error due to incorrect condition
handling.

Problem:
- `github.event.before` evaluates to an empty string
- GitHub Actions' single-pass expression evaluation system causes
  the step to always execute, regardless of `github.event_name`

Despite GitHub's documentation suggesting that ${{ }} can be omitted,
it recommends using explicit ${{}} expressions for compound conditions.

Changes:
- Use explicit ${{}} expression for compound conditions
- Avoid string interpolation in conditional statements

Root Cause:
The previous implementation failed because of how GitHub Actions
evaluates conditional expressions, leading to an unintended script
execution and a 404 error when attempting to compare commits.

Example Error:

```
  python .github/scripts/auto-backport.py --repo scylladb/scylladb --base-branch refs/heads/master --commits ..2b07d93beac7bc83d955dadc20ccc307f13f20b6
  shell: /usr/bin/bash -e {0}
  env:
    DEFAULT_BRANCH: master
    GITHUB_TOKEN: ***
Traceback (most recent call last):
  File "/home/runner/work/scylladb/scylladb/.github/scripts/auto-backport.py", line 201, in <module>
    main()
  File "/home/runner/work/scylladb/scylladb/.github/scripts/auto-backport.py", line 162, in main
    commits = repo.compare(start_commit, end_commit).commits
  File "/usr/lib/python3/dist-packages/github/Repository.py", line 888, in compare
    headers, data = self._requester.requestJsonAndCheck(
  File "/usr/lib/python3/dist-packages/github/Requester.py", line 353, in requestJsonAndCheck
    return self.__check(
  File "/usr/lib/python3/dist-packages/github/Requester.py", line 378, in __check
    raise self.__createException(status, responseHeaders, output)
github.GithubException.UnknownObjectException: 404 {"message": "Not Found", "documentation_url": "https://docs.github.com/rest/commits/commits#compare-two-commits", "status": "404"}
```

Fixes scylladb/scylladb#21808
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21809

(cherry picked from commit e04aca7efe)

Closes scylladb/scylladb#21820
2024-12-06 16:34:59 +02:00
Avi Kivity
20b2d5b7c9 Merge 'compaction: update maintenance sstable set on scrub compaction completion' from Lakshmi Narayanan Sreethar
Scrub compaction can pick up input sstables from maintenance sstable set
but on compaction completion, it doesn't update the maintenance set
leaving the original sstable in set after it has been scrubbed. To fix
this, on compaction completion has to update the maintenance sstable if
the input originated from there. This PR solves the issue by updating the
correct sstable_sets on compaction completion.

Fixes #20030

This issue has existed since the introduction of main and maintenance sstable sets into scrub compaction. It would be good to have the fix backported to versions 6.1 and 6.2.

Closes scylladb/scylladb#21582

* github.com:scylladb/scylladb:
  compaction: remove unused `update_sstable_lists_on_off_strategy_completion`
  compaction_group: replace `update_sstable_lists_on_off_strategy_completion`
  compaction_group: rename `update_main_sstable_list_on_compaction_completion`
  compaction_group: update maintenance sstable set on scrub compaction completion
  compaction_group: store table::sstable_list_builder::result in replacement_desc
  table::sstable_list_builder: remove old sstables only from current list
  table::sstable_list_builder: return removed sstables from build_new_list

(cherry picked from commit 58baeac0ad)

Closes scylladb/scylladb#21790
2024-12-06 10:36:46 +02:00
Michael Pedersen
f37deb7e98 docs: correct the storage size for n2-highmem-32 to 9000GB
updated storage size for n2-highmem-32 to 9000GB as this is default in SC

Fixes scylladb/scylladb#21785
Closes scylladb/scylladb#21537

(cherry picked from commit 309f1606ae)

Closes scylladb/scylladb#21595
2024-12-05 09:51:11 +02:00
Tomasz Grabiec
933ec7c6ab utils: UUID: Make get_time_UUID() respect the clock offset
schema_change_test currently fails due to failure to start a cql test
env in unit tests after the point where this is called (in one of the
test cases):

   forward_jump_clocks(std::chrono::seconds(60*60*24*31));

The problem manifests with a failure to join the cluster due to
missing_column exception ("missing_column: done") being thrown from
system_keyspace::get_topology_request_state(). It's a symptom of
join request being missing in system.topology_requests. It's missing
because the row is expired.

When request is created, we insert the
mutations with intended TTL of 1 month. The actual TTL value is
computed like this:

  ttl_opt topology_request_tracking_mutation_builder::ttl() const {
      return std::chrono::duration_cast<std::chrono::seconds>(std::chrono::microseconds(_ts)) + std::chrono::months(1)
          - std::chrono::duration_cast<std::chrono::seconds>(gc_clock::now().time_since_epoch());
  }

_ts comes from the request_id, which is supposed to be a timeuuid set
from current time when request starts. It's set using
utils::UUID_gen::get_time_UUID(). It reads the system clock without
adding the clock offset, so after forward_jump_clocks(), _ts and
gc_clock::now() may be far off. In some cases the accumulated offset
is larger than 1month and the ttl becomes negative, causing the
request row to expire immediately and failing the boot sequence.

The fix is to use db_clock, which respects offsets and is consistent
with gc_clock.

The test doesn't fail in CI becuase there each test case runs in a
separate process, so there is no bootstrap attempt (by new cql test
env) after forward_jump_clocks().

Closes scylladb/scylladb#21558

(cherry picked from commit 1d0c6aa26f)

Closes scylladb/scylladb#21584

Fixes #21581
2024-12-04 14:18:16 +01:00
Kefu Chai
2b5cd10b66 docs: explain task status retention and one-time query behavior
Task status information from nodetool commands is not retained permanently:

- Status of completed tasks is only kept for `task_ttl_in_seconds`
- Status is removed after being queried, making it a one-time operation

This behavior is important for users to understand since subsequent
queries for the same completed task will not return any information.
Add documentation to make this clear to users.

Fixes scylladb/scylladb#21757
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21386

(cherry picked from commit afeff0a792)

Closes scylladb/scylladb#21759
2024-12-04 13:49:24 +02:00
Kefu Chai
bf47de9f7f test: topology_custom: ensure node visibility before keyspace creation
Building upon commit 69b47694, this change addresses a subtle synchronization
weakness in node visibility checks during recovery mode testing.

Previous Approach:
- Waited only for the first node to see its peers
- Insufficient to guarantee full cluster consistency

Current Solution:
1. Implement comprehensive node visibility verification
2. Ensure all nodes mutually recognize each other
3. Prevent potential schema propagation race conditions

Key Improvements:
- Robust cluster state validation before keyspace creation
- Eliminate partial visibility scenarios

Fixes scylladb/scylladb#21724

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21726

(cherry picked from commit 65949ce607)

Closes scylladb/scylladb#21734
2024-12-04 13:46:26 +02:00
Nadav Har'El
dc71a6a75e Merge 'test/boost/view_schema_test.cc: Wait for views to build in test_view_update_generating_writetime' from Dawid Mędrek
Before these changes, we didn't wait for the materialized views to
finish building before writing to the base table. That led to generating
an additional view update, which, in turn, led to test failures.

The scenario corresponding to the summary above looked like this:

1. The test creates an empty table and MVs on it.
2. The view builder starts, but it doesn't finish immediately.
3. The test performs mutations to the base table. Since the views
   already exist, view updates are generated.
4. Finally, the view builder finishes. It notices that the base
   table has a row, so it generates a view update for it because
   it doesn't notice that we already have data in the view.

We solve it by explicitly waiting for both views to finish building
and only then start writing to the base table.

Additionally, we also fix a lifetime issue of the row the test revolves
around, further stabilizing CI.

Fixes https://github.com/scylladb/scylladb/issues/20889

Backport: These changes have no semantic effect on the codebase,
but they stabilize CI, so we want to backport them to the maintained
versions of Scylla.

Closes scylladb/scylladb#21632

* github.com:scylladb/scylladb:
  test/boost/view_schema_test.cc: Increase TTL in test_view_update_generating_writetime
  test/boost/view_schema_test.cc: Wait for views to build in test_view_update_generating_writetime

(cherry picked from commit 733a4f94c7)

Closes scylladb/scylladb#21640
2024-12-04 13:44:33 +02:00
Aleksandra Martyniuk
f13f821b31 repair: implement tablet_repair_task_impl::release_resources
tablet_repair_task_impl keeps a vector of tablet_repair_task_meta,
each of which keeps an effective_replication_map_ptr. So, after
the task completes, the token metadata version will not change for
task_ttl seconds.

Implement tablet_repair_task_impl::release_resources method that clears
tablet_repair_task_meta vector when the task finishes.

Set task_ttl to 1h in test_tablet_repair to check whether the test
won't time out.

Fixes: #21503.

Closes scylladb/scylladb#21504

(cherry picked from commit 572b005774)

Closes scylladb/scylladb#21622
2024-12-04 13:43:40 +02:00
André LFA
74ad6f2fa3 Update report-scylla-problem.rst removing references to old Health Check Report
Closes scylladb/scylladb#21467

(cherry picked from commit 703e6f3b1f)

Closes scylladb/scylladb#21591
2024-12-04 13:41:00 +02:00
Abhinav
fc42571591 test: Parametrize 'replacement with inter-dc encryption' test to confirm behavior in zero token node cases.
In the current scenario, 'test_replace_with_encryption' only confirms the replacement with inter-dc encryption
for normal nodes. This commit increases the coverage of test by parametrizing the test to confirm behavior
for zero token node replacement as well. This test also implicitly provides
coverage for bootstrap with encryption of zero token nodes.

This PR increases coverage for existing code. Hence we need to backport it. Since only 6.2 version has zero
token node support, hence we only backport it to 6.2

Fixes: scylladb/scylladb#21096

Closes scylladb/scylladb#21609

(cherry picked from commit acd643bd75)

Closes scylladb/scylladb#21764
2024-12-04 11:22:39 +01:00
Botond Dénes
c6ef055e9c Merge 'repair: fix task_manager_module::abort_all_repairs' from Aleksandra Martyniuk
Currently, task_manager_module::abort_all_repairs marks top-level repairs as aborted (but does not abort them) and aborts all existing shard tasks.

A running repair checks whether its id isn't contained in _aborted_pending_repairs and then proceeds to create shard tasks. If abort_all_repairs is executed after _aborted_pending_repairs is checked but before shard tasks are created, then those new tasks won't be aborted. The issue is the most severe for tablet_repair_task_impl that checks the _aborted_pending_repairs content from different shards, that do not see the top-level task. Hence the repair isn't stopped but it creates shard repair tasks on all shards but the one that initialized repair.

Abort top-level tasks in abort_all_repairs. Fix the shard on which the task abort is checked.

Fixes: #21612.

Needs backport to 6.1 and 6.2 as they contain the bug.

Closes scylladb/scylladb#21616

* github.com:scylladb/scylladb:
  test: add test to check if repair is properly aborted
  repair: add shard param to task_manager_module::is_aborted
  repair: use task abort source to abort repair
  repair: drop _aborted_pending_repairs and utilize tasks abort mechanism
  repair: fix task_manager_module::abort_all_repairs

(cherry picked from commit 5ccbd500e0)

Closes scylladb/scylladb#21642
2024-11-21 06:33:31 +02:00
Nadav Har'El
6ba0253dd3 alternator: fix "/localnodes" to not return down nodes
Alternator's "/localnodes" HTTP requests is supposed to return the list
of nodes in the local DC to which the user can send requests.

Before commit bac7c33313 we used the
gossiper is_alive() method to determine if a node should be returned.
That commit changed the check to is_normal() - because a node can be
alive but in non-normal (e.g., joining) state and not ready for
requests.

However, it turns out that checking is_normal() is not enough, because
if node is stopped abruptly, other nodes will still consider it "normal",
but down (this is so-called "DN" state). So we need to check **both**
is_alive() and is_normal().

This patch also adds a test reproducing this case, where a node is
shut down abruptly. Before this patch, the test failed ("/localnodes"
continued to return the dead node), and after it it passes.

Fixes #21538

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#21540

(cherry picked from commit 7607f5e33e)

Closes scylladb/scylladb#21634
2024-11-20 09:22:59 +02:00
Anna Stuchlik
d9eb502841 doc: add the 6.0-to-2024.2 upgrade guide-from-6
This commit adds an upgrade guide from ScyllDB 6.0
to ScyllaDB Enterprise 2024.2.

Fixes https://github.com/scylladb/scylladb/issues/20063
Fixes https://github.com/scylladb/scylladb/issues/20062
Refs https://github.com/scylladb/scylla-enterprise/issues/4544

(cherry picked from commit 3d4b7e41ef)

Closes scylladb/scylladb#21620
2024-11-18 17:28:44 +02:00
Emil Maskovsky
0c7c6f85e0 test/topology_custom: fix the flaky test_raft_recovery_stuck
The test is only sending a subset of the running servers for the rolling
restart. The rolling restart is checking the visibility of the restarted
node agains the other nodes, but if that set is incomplete some of the
running servers might not have seen the restarted node yet.

Improved the manager client rolling restart method to consider all the
running nodes for checking the restarted node visibility.

Fixes: scylladb/scylladb#19959

Closes scylladb/scylladb#21477

(cherry picked from commit 92db2eca0b)

Closes scylladb/scylladb#21556
2024-11-15 10:37:20 +02:00
Kefu Chai
2480decbc7 doc: import the new pub keys used to sign the package
before this change, when user follows the instruction, they'd get

```console
$ sudo apt-get update
Hit:1 http://us-east-1.ec2.archive.ubuntu.com/ubuntu noble InRelease
Hit:2 http://us-east-1.ec2.archive.ubuntu.com/ubuntu noble-updates InRelease
Hit:3 http://us-east-1.ec2.archive.ubuntu.com/ubuntu noble-backports InRelease
Hit:4 http://security.ubuntu.com/ubuntu noble-security InRelease
Get:5 https://downloads.scylladb.com/downloads/scylla/deb/debian-ubuntu/scylladb-6.2 stable InRelease [7550 B]
Err:5 https://downloads.scylladb.com/downloads/scylla/deb/debian-ubuntu/scylladb-6.2 stable InRelease
 The following signatures couldn't be verified because the public key is not available: NO_PUBKEY A43E06657BAC99E3
Reading package lists... Done
W: GPG error: https://downloads.scylladb.com/downloads/scylla/deb/debian-ubuntu/scylladb-6.2 stable InRelease: The following signatures couldn't be verified because the public key is not av
ailable: NO_PUBKEY A43E06657BAC99E3
E: The repository 'https://downloads.scylladb.com/downloads/scylla/deb/debian-ubuntu/scylladb-6.2 stable InRelease' is not signed.
N: Updating from such a repository can't be done securely, and is therefore disabled by default.
N: See apt-secure(8) manpage for repository creation and user configuration details.
```

because the packages were signed with a different keyring.

in this change, we import the new pubkey, so that the pacakge manager
can
verify the new packages (2024.2+ and 6.2+) signed with the new key.

see also https://github.com/scylladb/scylla-ansible-roles/issues/399
and https://forum.scylladb.com/t/release-scylla-manager-3-3-1/2516
for the annonucement on using the new key.

Fixes scylladb/scylladb#21557
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21524

(cherry picked from commit 1cedc45c35)

Closes scylladb/scylladb#21588
2024-11-15 10:36:44 +02:00
Botond Dénes
687a18db38 Merge 'scylla_raid_setup: fix failure on SELinux package installation' from Takuya ASADA
After merged 5a470b2bfb, we found that scylla_raid_setup fails on offline mode
installation.
This is because pkg_install() just print error and exit script on offline mode, instead of installing packages since offline mode not supposed able to connect
internet.
Seems like it occur because of missing "policycoreutils-python-utils"
package, which is the package for "semange" command.
So we need to implement the relabeling patch without using the command.

Fixes https://github.com/scylladb/scylladb/issues/21441

Also, since Amazon Linux 2 has different package name for semange, we need to
adjust package name.

Fixes https://github.com/scylladb/scylladb/issues/21351

Closes scylladb/scylladb#21474

* github.com:scylladb/scylladb:
  scylla_raid_setup: support installing semanage on Amazon Linux 2
  scylla_raid_setup: fix failure on SELinux package installation

(cherry picked from commit 1c212df62d)

Closes scylladb/scylladb#21547
2024-11-14 15:51:06 +02:00
Botond Dénes
548170fb68 Merge '[Backport 6.2] compaction_manager: stop_tasks, stop_ongoing_compactions: ignore errors' from ScyllaDB
stop() methods, like destructors must always succeed,
and returning errors from them is futile as there is
nothing else we can do with them by continue with shutdown.

stop_ongoing_compactions, in particular, currently returns the status
of stopped compaction tasks from `stop_tasks`, but still all tasks
must be stopped after it, even if they failed, so assert that
and ignore the errors.

Fixes scylladb/scylladb#21159

* Needs backport to 6.2 and 6.1, as commit 8cc99973eb causes handles storage that might cause compaction tasks to fail and eventually terminate on shudown when the exceptions are thrown in noexcept context in the deferred stop destructor body

(cherry picked from commit e942c074f2)

(cherry picked from commit d8500472b3)

(cherry picked from commit c08ba8af68)

(cherry picked from commit a7a55298ea)

(cherry picked from commit 6cce67bec8)

Refs #21299

Closes scylladb/scylladb#21434

* github.com:scylladb/scylladb:
  compaction_manager: stop: await _stop_future if engaged
  compaction_manager: really_do_stop:  assert that no tasks are left behind
  compaction_manager: stop_tasks, stop_ongoing_compactions: ignore errors
  compaction/compaction_manager: stop_tasks(): unlink stopped tasks
  compaction/compaction_manager: make _tasks an intrusive list
2024-11-14 06:59:52 +02:00
Jenkins Promoter
75b79a30da Update ScyllaDB version to: 6.2.2 2024-11-13 23:22:52 +02:00
Benny Halevy
bdf31d7f54 compaction_manager: stop: await _stop_future if engaged
The current condition that consults the compaction manager
state for awaiting `_stop_future` works since _stop_future
is assigned after the state is set to `stopped`, but it is
incidental.  What matters is that `_stop_future` is engaged.

While at it, exchange _stop_future with a ready future
so that stop() can be safely called multiple times.
And dropped the superfluous co_return.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 6cce67bec8)
2024-11-13 10:00:47 +02:00
Benny Halevy
3d915cd091 compaction_manager: really_do_stop: assert that no tasks are left behind
stop_ongoing_compactions now ignores any errors returned
by tasks, and it should leave no task left behind.
Assert that here, before the compaction_manager is destroyed.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit a7a55298ea)
2024-11-13 09:59:57 +02:00
Benny Halevy
abb26ff913 compaction_manager: stop_tasks, stop_ongoing_compactions: ignore errors
stop() methods, like destructors must always succeed,
and returning errors from them is futile as there is
nothing else we can do with them but continue with shutdown.

Leaked errors on the stop path may cause termination
on shutdown, when called in a deferred action destructor.

Fixes scylladb/scylladb#21298

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit c08ba8af68)
2024-11-13 09:56:42 +02:00
Botond Dénes
3f821b7f4f compaction/compaction_manager: stop_tasks(): unlink stopped tasks
Stopped tasks currently linger in _tasks until the fiber that created
the task is scheduled again and unlinks the task. This window between
stop and remove prevents reliable checks for empty _tasks list after all
tasks are stopped.
Unlink the task early so really_do_stop() can safely check for an empty
_tasks list (next patch).

(cherry picked from commit d8500472b3)
2024-11-13 09:56:21 +02:00
Botond Dénes
cab3b86240 compaction/compaction_manager: make _tasks an intrusive list
_tasks is currently std::list<shared_ptr<compaction_task_executor>>, but
it has no role in keeping the instances alive, this is done by the
fibers which create the task (and pin a shared ptr instance).
This lends itself to an intrusive list, avoiding that extra
allocation upon push_back().
Using an intrusive list also makes it simpler and much cheaper (O(1) vs.
O(N)) to remove tasks from the _tasks list. This will be made use of in
the next patch.

Code using _task has to be updated because the value_type changes from
shared_ptr<compaction_task_executor> to compaction_task_executor&.

(cherry picked from commit e942c074f2)
2024-11-13 09:48:00 +02:00
Piotr Dulikowski
2fa4f3a9fc Merge 'main,cql_test_env: start group0_service before view_builder' from Michał Jadwiszczak
In scylladb/scylladb#19745, view_builder was migrated to group0 and since then it is dependant on group0_service.
Because of this, group0_service should be initialized/destroyed before/after view_builder.

This patch also adds error injection to `raft_server_with_timeouts::read_barrier`, which does 1s sleep before doing the read barrier. There is a new test which reproduces the use after free bug using the error injection.

Fixes scylladb/scylladb#20772

scylladb/scylladb#19745 is present in 6.2, so this fix should be backported to it.

Closes scylladb/scylladb#21471

* github.com:scylladb/scylladb:
  test/boost/secondary_index_test: add test for use after free
  api/raft: use `get_server_with_timeouts().read_barrier()` in coroutines
  main,cql_test_env: start group0_service before view_builder

(cherry picked from commit 7021efd6b0)

Closes scylladb/scylladb#21506
2024-11-12 14:36:06 +01:00
Yaron Kaikov
a3e69cc8fb ./github/workflows/add-label-when-promoted.yaml: Run auto-backport only on default branch
In https://github.com/scylladb/scylladb/pull/21496#event-15221789614
```
scylladbbot force-pushed the backport/21459/to-6.1 branch from 414691c to 59a4ccd Compare 2 days ago
```

Backport automation triggered by `push` but also should either start from `master` branch (or `enterprise` branch from Enterprise), we need to verify it by checking also the default branch.

Fixes: https://github.com/scylladb/scylladb/issues/21514

Closes scylladb/scylladb#21515

(cherry picked from commit 2596d1577b)

Closes scylladb/scylladb#21531
2024-11-11 17:43:54 +02:00
Michał Chojnowski
876017efee mvcc_test: fix a benign failure of test_apply_to_incomplete_respects_continuity
For performance reasons, mutation_partition_v2::maybe_drop(), and by extension
also mutation_partition_v2::apply_monotonically(mutation_partition_v2&&)
can evict empty row entries, and hence change the continuity of the merged
entry.

For checking that apply_to_incomplete respects continuity,
test_apply_to_incomplete_respects_continuity obtains the continuity of
the partition entry before and after apply_to_incomplete by calling
e.squashed().get_continuity(). But squashed() uses apply_monotonically(),
so in some circumstances the result of squashed() can have smaller
continuity than the argument of squashed(), which messes with the thing
that the test is trying to check, and causes spurious failures.

This patch changes the method of calculating the continuity set,
so that it matches the entry exactly, fixing the test failures.

Fixes scylladb/scylladb#13757

Closes scylladb/scylladb#21459

(cherry picked from commit 35921eb67e)

Closes scylladb/scylladb#21497
2024-11-08 15:32:24 +01:00
Yaron Kaikov
9eed1d1cbd .github/scripts/auto-backport.py: update method to get closed prs
`commit.get_pulls()` in PyGithub returns pull requests that are directly associated with the given commit

Since in closed PR. the relevant commit is an event type, the backport
automation didn't get the PR info for backporting

Ref: https://github.com/scylladb/scylladb/issues/18973

Closes scylladb/scylladb#21468

(cherry picked from commit ef104b7b96)

Closes scylladb/scylladb#21483
2024-11-08 10:26:10 +02:00
Yaron Kaikov
d33538bdd4 .github/script/auto-backport.py: push backport PR to scylladbbot fork
Since Scylla is a public repo, when we create a fork, it doesn't fork the team and permissions (unlike private repos where it does).

When we have a backport PR with conflicts, the developers need to be able to update the branch to fix the conflicts. To do so, we modified the logic of the backport automation as follows:

- Every backport PR (with and without conflicts) will be open directly on the `scylladbbot` fork repo
- When there are conflicts, an email will be sent to the original PR author with an invitation to become a contributor in the `scylladbbot` fork with `push` permissions. This will happen only once if Auther is not a contributor.
- Together with sending the invite, all backport labels will be removed and a comment will be added to the original PR with instructions
- The PR author must add the backport labels after the invitation is accepted

Fixes: https://github.com/scylladb/scylladb/issues/18973

Closes scylladb/scylladb#21401

(cherry picked from commit 77604b4ac7)

Closes scylladb/scylladb#21466
2024-11-07 12:38:56 +02:00
Yaron Kaikov
073c9cbaa1 github: add script for backports automation instead of Mergify
Adding an auto-backport.py script to handle backport automation instead of Mergify.

The rules of backport are as follows:

* Merged or Closed PRs with any backport/x.y label (one or more) and promoted-to-master label
* Backport PR will be automatically assigned to the original PR author
* In case of conflicts the backport PR will be open in the original autoor fork in draft mode. This will give the PR owner the option to resolve conflicts and push those changes to the PR branch (Today in Scylla when we have conflicts, the developers are forced to open another PR and manually close the backport PR opened by Mergify)
* Fixing cherry-pick the wrong commit SHA. With the new script, we always take the SHA from the stable branch
* Support backport for enterprise releases (from Enterprise branch)

Fixes: https://github.com/scylladb/scylladb/issues/18973
(cherry picked from commit f9e171c7af)

Closes scylladb/scylladb#21469
2024-11-07 06:57:05 +02:00
Tomasz Grabiec
a3a0ffbcd0 Merge 'tablet: Fix single-sstable split when attaching new unsplit sstables' from Raphael "Raph" Carvalho
To fix a race between split and repair here c1de4859d8, a new sstable
  generated during streaming can be split before being attached to the sstable
  set. That's to prevent an unsplit sstable from reaching the set after the
  tablet map is resized.

  So we can think this split is an extension of the sstable writer. A failure
  during split means the new sstable won't be added. Also, the duration of split
  is also adding to the time erm is held. For example, repair writer will only
  release its erm once the split sstable is added into the set.

  This single-sstable split is going through run_custom_job(), which serializes
  with other maintenance tasks. That was a terrible decision, since the split may
  have to wait for ongoing maintenance task to finish, which means holding erm
  for longer. Additionally, if split monitor decides to run split on the entire
  compaction group, it can cause single-sstable split to be aborted since the
  former wants to select all sstables, propagating a failure to the streaming
  writer.
  That results in new sstable being leaked and may cause problems on restart,
  since the underlying tablet may have moved elsewhere or multiple splits may
  have happened. We have some fragility today in cleaning up leaked sstables on
  streaming failure, but this single-sstable split made it worse since the
  failure can happen during normal operation, when there's e.g. no I/O error.

  It makes sense to kill run_custom_job() usage, since the single-sstable split
  is offline and an extension of sstable writing, therefore it makes no sense to
  serialize with maintenance tasks. It must also inherit the sched group of the
  process writing the new sstable. The inheritance happens today, but is fragile.

  Fixes #20626.

Closes scylladb/scylladb#20737

* github.com:scylladb/scylladb:
  tablet: Fix single-sstable split when attaching new unsplit sstables
  replica: Fix tablet split execute after restart

(cherry picked from commit bca8258150)

Ref scylladb/scylladb#21415
2024-11-06 15:01:35 +02:00
Botond Dénes
8bf76d6be7 Merge '[Backport 6.2] replica: Fix tombstone GC during tablet split preparation' from Raphael Raph Carvalho
During split prepare phase, there will be more than 1 compaction group with
overlapping token range for a given replica.

Assume tablet 1 has sstable A containing deleted data, and sstable B containing
a tombstone that shadows data in A.

Then split starts:

sstable B is split first, and moved from main (unsplit) group to a
split-ready group
now compaction runs in split-ready group before sstable A is split
tombstone GC logic today only looks at underlying group, so compaction is step
2 will discard the deleted data in A, since it belongs to another group (the
unsplit one), and so the tombstone can be purged incorrectly.

To fix it, compaction will now work with all uncompacting sstables that belong
to the same replica, since tombstone GC requires all sstables that possibly
contain shadowed data to be available for correct decision to be made.

Fixes https://github.com/scylladb/scylladb/issues/20044.

Please replace this line with justification for the backport/* labels added to this PR
Branches 6.0, 6.1 and 6.2 are vulnerable, so backport is needed.

(cherry picked from commit bcd358595f)

(cherry picked from commit 93815e0649)

Refs https://github.com/scylladb/scylladb/pull/20939

Closes scylladb/scylladb#21206

* github.com:scylladb/scylladb:
  replica: Fix tombstone GC during tablet split preparation
  service: Improve error handling for split
2024-11-06 09:55:47 +02:00
Raphael S. Carvalho
1e51ed88c6 replica: Fix tombstone GC during tablet split preparation
During split prepare phase, there will be more than 1 compaction group with
overlapping token range for a given replica.

Assume tablet 1 has sstable A containing deleted data, and sstable B containing
a tombstone that shadows data in A.

Then split starts:
1) sstable B is split first, and moved from main (unsplit) group to a
split-ready group
2) now compaction runs in split-ready group before sstable A is split

tombstone GC logic today only looks at underlying group, so compaction is step
2 will discard the deleted data in A, since it belongs to another group (the
unsplit one), and so the tombstone can be purged incorrectly.

To fix it, compaction will now work with all uncompacting sstables that belong
to the same replica, since tombstone GC requires all sstables that possibly
contain shadowed data to be available for correct decision to be made.

Fixes #20044.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 93815e0649)
2024-11-04 14:24:18 -03:00
Raphael S. Carvalho
ca5f938ed4 service: Improve error handling for split
Retry wasn't really happening since the loop was broken and sleep
part was skipped on error. Also, we were treating abort of split
during shutdown as if it were an actual error and that confused
longevity tests that parse for logs with error level. The fix is
about demoting the level of logs when we know the exception comes
from shutdown.

Fixes #20890.

(cherry picked from commit bcd358595f)
2024-11-04 14:22:08 -03:00
Botond Dénes
fb20ea7de1 Merge '[Backport 6.2] tasks: fix virtual tasks children' from ScyllaDB
Fix how regular tasks that have a virtual parent are created
in task_manager::module::make_task: set sequence number
of a task and subscribe to module's abort source.

Fixes: #21278.

Needs backport to 6.2

(cherry picked from commit 1eb47b0bbf)

(cherry picked from commit 910a6fc032)

Refs #21280

Closes scylladb/scylladb#21332

* github.com:scylladb/scylladb:
  tasks: fix sequence number assignment
  tasks: fix abort source subscription of virtual task's child
2024-11-04 18:18:35 +02:00
Tzach Livyatan
d5eb12c25d Update os-support-info.rst - add CentOS
ScyllaDB support RHEL 9 and derivatives, including CentOS 9.

Fix https://github.com/scylladb/scylladb/issues/21309

(cherry picked from commit 1878af9399)

Closes scylladb/scylladb#21331
2024-11-04 18:17:46 +02:00
Aleksandra Martyniuk
291f568585 test: repair: drop log checks from test_repair_succeeds_with_unitialized_bm
Currently, test_repair_succeeds_with_unitialized_bm checks whether
repair finishes successfully and the error is properly handled
if batchlog_manager isn't initialized. Error handling depends on
logs, making the test fragile to external conditions and flaky.

Drop the error handling check, successful repair is a sufficient
passing condition.

Fixes: #21167.
(cherry picked from commit 85d9565158)

Closes scylladb/scylladb#21330
2024-11-04 18:16:55 +02:00
Botond Dénes
d5475fbc07 Merge '[Backport 6.2] repair: Fix finished ranges metrics for removenode' from ScyllaDB
The skipped ranges should be multiplied by the number of tables

Otherwise the finished ranges ratio will not reach 100%.

Fixes #21174

(cherry picked from commit cffe3dc49f)

(cherry picked from commit 1392a6068d)

(cherry picked from commit 9868ccbac0)

Refs #21252

Closes scylladb/scylladb#21313

* github.com:scylladb/scylladb:
  test: Add test_node_ops_metrics.py
  repair: Make the ranges more consistent in the log
  repair: Fix finished ranges metrics for removenode
2024-11-04 18:16:21 +02:00
Anna Stuchlik
6916dbe822 doc: remove the Cassandra references from notedool
This PR removes the reference to Cassandra from the nodetool index,
as the native nodetool is no longer a fork.

In addition, it removes the Apache copyright.

Fixes https://github.com/scylladb/scylladb/issues/21238

(cherry picked from commit ef4bcf8b3f)

Closes scylladb/scylladb#21307
2024-11-04 18:15:36 +02:00
Michał Jadwiszczak
f51a8ed541 test/auth_cluster/test_raft_service_levels: match enterprise SL limit
Despite OSS doesn't limit number of created service levels, match the
enterprise limit to decrease divergence in the test between OSS and
enterprise.

Fixes scylladb/scylladb#21044

(cherry picked from commit 846d94134f)

Closes scylladb/scylladb#21282
2024-11-04 18:14:38 +02:00
Calle Wilund
127606f788 cql_test_env/gossip: Prevent double shutdown call crash
Fixes #21159

When an exception is thrown in sstable write etc such that
storage_manager::isolate is initiated, we start a shutdown chain
for message service, gossip etc. These are synced (properly) in
storage_manager::stop, but if we somehow call gossiper::shutdown
outside the normal service::stop cycle, we can end up running the
method simultaneously, intertwined (missing the guard because of
the state change between check and set). We then end up co_awaiting
an invalid future (_failure_detector_loop_done) - a second wait.

Fixed by
a.) Remove superfluous gossiper::shutdown in cql_test_env. This was added
    in 20496ed, ages ago. However, it should not be needed nowadays.
b.) Ensure _failure_detector_loop_done is always waitable. Just to be sure.

(cherry picked from commit c28a5173d9)

Closes scylladb/scylladb#21393
2024-11-04 16:52:42 +01:00
Benny Halevy
56a0fa922d storage_service: on_change: update_peer_info only if peer info changed
Return an optional peer_info from get_peer_info_for_update
when the `app_state_map` arg does not change peer_info,
so that we can skip calling update_peer_info, if it didn't
change.

Fixes scylladb/scylladb#20991
Refs scylladb/scylladb#16376

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#21152

(cherry picked from commit 04d741bcbb)
2024-11-04 11:20:32 +02:00
Benny Halevy
c841a4a851 compaction_manager: compaction_disabled: return true if not in compaction_state
When a compaction_group is removed via `compaction_manager::remove`,
it is erase from `_compaction_state`, and therefore compaction
is definitely not enabled on it.

This triggers an internal error if tablets are cleaned up
during drop/truncate, which checks that compaction is disabled
in all compaction groups.

Note that the callers of `compaction_disabled` aren't really
interested in compaction being actively disabled on the
compaction_group, but rather if it's enabled or not.
A follow-up patch can be consider to reverse the logic
and expose `compaction_enabled` rather than `compaction_disabled`.

Fixes scylladb/scylladb#20060

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 1c55747637)

Closes scylladb/scylladb#21404
2024-11-03 16:05:05 +02:00
Gleb Natapov
1a9721e93e topology coordinator: take a copy of a replication state in raft_topology_cmd_handler
Current code takes a reference and holds it past preemption points. And
while the state itself is not suppose to change the reference may
become stale because the state is re-created on each raft topology
command.

Fix it by taking a copy instead. This is a slow path anyway.

Fixes: scylladb/scylladb#21220
(cherry picked from commit fb38bfa35d)

Closes scylladb/scylladb#21361
2024-10-30 14:11:17 +01:00
Kamil Braun
1dded7e52f Merge '[Backport 6.2] fix nodetool status to show zero-token nodes' from ScyllaDB
In the current scenario, the nodetool status doesn’t display information regarding zero token nodes. For example, if 5 nodes are spun by the administrator, out of which, 2 nodes are zero token nodes, then nodetool status only shows information regarding the 3 non-zero token nodes.

This commit intends to fix this issue by leveraging the “/storage_service/host_id ” API  and adding appropriate logic in scylla-nodetool.cc to support zero token nodes.

A test is also added in nodetool/test_status.py to verify this logic. This test fails without this commit’s zero token node support logic, hence verifying the behavior.

This PR fixes a bug. Hence we need to backport it. Backporting needs to be done only
to 6.2 version, since earlier versions don't support zero token nodes.

Fixes: scylladb/scylladb#19849
Fixes: scylladb/scylladb#17857

(cherry picked from commit 72f3c95a63)

(cherry picked from commit 39dfd2d7ac)

(cherry picked from commit c00d40b239)

Refs scylladb/scylladb#20909

Closes scylladb/scylladb#21334

* github.com:scylladb/scylladb:
  fix nodetool status to show zero-token nodes
  test: move `wait_for_first_completed` to pylib/util.py
  token_metadata: rename endpoint_to_host_id_map getter and add support for joining nodes
2024-10-29 10:50:35 +01:00
Abhinav
9082d66d8a fix nodetool status to show zero-token nodes
In the current scenario, the nodetool status doesn’t display information
regarding zero token nodes. For example, if 5 nodes are spun by the
administrator, out of which, 2 nodes are zero token nodes, then nodetool
status only shows information regarding the 3 non-zero token nodes.

This commit intends to fix this issue by leveraging the “/storage_service/host_id
” API  and adding appropriate logic in scylla-nodetool.cc to support zero token nodes.

Robust topology tests are added, which spins up scylla nodes and confirm nodetool
status output for various cases, providing good coverage.
A test is also added in nodetool/test_status.py to verify this logic. These tests fail
without this commit’s zero token node support logic, hence verifying the behavior.

The test `test_status_keyspace_joining_node` has been removed. This test is
based on case where host_id=None, which is impossible. Since we now use
host_id_map for node discovery in nodetool, the nodes with "host_id=None"
go undetected. Since this case is anyway impossible, we can get rid of this.

This PR fixes a bug. Hence we need to backport it. Backporting needs to be done only
to 6.2 version, since earlier versions dont support zero token nodes.

Fixes: scylladb/scylladb#19849
(cherry picked from commit c00d40b239)
2024-10-28 21:33:55 +00:00
Abhinav
c7a0876a73 test: move wait_for_first_completed to pylib/util.py
This function is needed in a new test added in the next commit and this
refactoring avoids code duplication.

(cherry picked from commit 39dfd2d7ac)
2024-10-28 21:33:55 +00:00
Abhinav
917d40e600 token_metadata: rename endpoint_to_host_id_map getter and add support for joining nodes
Rename host_id map getter, 'get_endpoint_to_host_id_map_for_reading' to 'get_endpoint_to_host_id_map_'
Also modify the getter to return information regarding joining nodes as well.

This getter will later be used for retrieving the nodes in nodetool status, hence it needs to show all nodes,
including joining ones.

The function name suffix `_for_reading` suggests that the function was used
in some other places in the past, and indeed if we need endpoints
"for reading" then we cannot show joining endpoints. But it was confirmed
that this function is currently only used by "/storage_service/host_id" endpoint,
hence it can be modified as required.

Fixes: scylladb/scylladb#17857
(cherry picked from commit 72f3c95a63)
2024-10-28 21:33:54 +00:00
Aleksandra Martyniuk
1fd60424d9 tasks: fix sequence number assignment
Currently, children of virtual tasks do not have sequence number
assigned. Fix it.

(cherry picked from commit 910a6fc032)
2024-10-28 21:32:49 +00:00
Aleksandra Martyniuk
af6ddebc7f tasks: fix abort source subscription of virtual task's child
Currently, if a regular task does not have a parent or its parent
is a virtual tasks then it subscribes to module's abort source
in task_manager::task::impl constructor. However, at this point
the kind of the task's parent isn't set. Due to that, children
of virtual tasks aren't aborted on shutdown.

Subscribe to module's abort source in task::impl::set_virtual_parent.

(cherry picked from commit 1eb47b0bbf)
2024-10-28 21:32:49 +00:00
Tomasz Grabiec
fa71b82da4 node-exporter: Disable hwmon collector
This collector reads nvme temperature sensor, which was observed to
cause bad performance on Azure cloud following the reading of the
sensor for ~6 seconds. During the event, we can see elevated system
time (up to 30%) and softirq time. CPU utilization is high, with
nvm_queue_rq taking several orders of magnitude more time than
normally. There are signs of contention, we can see
__pv_queued_spin_lock_slowpath in the perf profile, called. This
manifests as latency spikes and potentially also throughput drop due
to reduced CPU capacity.

By default, the monitoring stack queries it once every 60s.

(cherry picked from commit 93777fa907)

Closes scylladb/scylladb#21304
2024-10-28 15:05:06 +01:00
Asias He
1a5a6a0758 test: Add test_node_ops_metrics.py
It tests the node_ops_metrics_done metric reaches 100% when a node ops
is done.

Refs: #21174
(cherry picked from commit 9868ccbac0)
2024-10-28 09:54:30 +00:00
Asias He
6ae5481de4 repair: Make the ranges more consistent in the log
Consider the number of tables for the number of ranges logging. Make it
more consistent with the log when the ops starts.

(cherry picked from commit 1392a6068d)
2024-10-28 09:54:30 +00:00
Asias He
0bc22db3a9 repair: Fix finished ranges metrics for removenode
The skipped ranges should be multiplied by the number of tables.

Otherwise the finished ranges ratio will not reach 100%.

Fixes #21174

(cherry picked from commit cffe3dc49f)
2024-10-28 09:54:30 +00:00
Botond Dénes
b78675270e streaming: stream-session: switch to tracking permit
The stream-session is the receiving end of streaming, it reads the
mutation fragment stream from an RPC stream and writes it onto the disk.
As such, this part does no disk IO and therefore, using a permit with
count resources is superfluous. Furthermore, after
d98708013c, the count resources on this
permit can cause a deadlock on the receiver end, via the
`db::view::check_view_update_path()`, which wants to read the content of
a system table and therefore has to obtain a permit of its own.

Switch to a tracking-only permit, primarily to resolve the deadlock, but
also because admission is not necessary for a read which does no IO.

Refs: scylladb/scylladb#20885 (partial fix, solves only one of the deadlocks)
Fixes: scylladb/scylladb#21264
(cherry picked from commit dbb26da2aa)

Closes scylladb/scylladb#21303
2024-10-28 08:07:05 +02:00
Jenkins Promoter
ea6fe4bfa1 Update ScyllaDB version to: 6.2.1 2024-10-27 12:06:35 +02:00
Botond Dénes
30a2ed7488 Merge '[Backport 6.2] cql/tablets: fix retrying ALTER tablets KEYSPACE' from Marcin Maliszkiewicz
ALTER tablets-enabled KEYSPACES (KS) may fail due to
group0_concurrent_modification, in which case it's repeated by a for
loop surrounding the code. But because raft's add_entry consumes the
raft's guard (by std::move'ing the guard object), retries of ALTER KS
will use a moved-from guard object, which is UB, potentially a crash.
The fix is to remove the before mentioned for loop altogether and rethrow the exception, as the rf_change event
will be repeated by the topology state machine if it receives the
concurrent modification exception, because the event will remain present
in the global requests queue, hence it's going to be executed as the
very next event.
Note: refactor is implemented in the follow-up commit.

Fixes: https://github.com/scylladb/scylladb/issues/21102

Should be backported to every 6.x branch, as it may lead to a crash.

(cherry picked from commit de511f56ac)

(cherry picked from commit 3f4c8a30e3)

(cherry picked from commit 522bede8ec)

Refs https://github.com/scylladb/scylladb/pull/21121

Closes scylladb/scylladb#21256

* github.com:scylladb/scylladb:
  test: topology: add disable_schema_agreement_wait utility function
  test: add UT to test retrying ALTER tablets KEYSPACE
  cql/tablets: fix indentation in `rf_change` event handler
  cql/tablets: fix retrying ALTER tablets KEYSPACE
2024-10-25 10:57:36 +03:00
Botond Dénes
dcddb1ff4a Merge '[Backport 6.2] multishard reader: make it safe to create with admitted permits' from ScyllaDB
Passing an admitted permit -- i.e. one with count resources on it -- to the multishard reader, will possibly result in a deadlock, because the permit of the multishard reader is destroyed after the permits of its child readers. Therefore its semaphore resources won't be automatically released until children acquire their own resources. This creates a dependency (an edge in the "resource allocation graph"), where the semaphore used by the multishard reader depends on the semaphores used by children. When such dependencies create a cycle, and permits are acquired by different reads in just the right order, a deadlock will  happen.

Users of the multishard reader have to be aware of this gotcha -- and of course they aren't. This is small wonder, considering that not even the documentation on the multishard reader mentions this problem. To work around this, the user has to call `reader_permit::release_base_resources()` on the permit, before passing it to the multishard reader. On multiple occasions, developers (including the very author of the multishard reader), forgot or didn't know about this and this resulted in deadlocks down the line. This is a design-flaw of the multishard reader, which is addressed in this PR, after which, it is safe to pass admitted or not admitted permits to the multishard reader, it will handle the call to `release_base_resources()` if needed.

After fixing the problem in the multishard reader, the existing calls to `release_base_resources()` on permits passed to multishard readers are removed. A test is added which reproduces the problem and ensures we don't regress.

Refs: https://github.com/scylladb/scylladb/issues/20885 (partial fix, there is another deadlock in that issue, which this PR doesn't fix)
Fixes: https://github.com/scylladb/scylladb/issues/21263

This fixes (indirectly) a regression introduced by d98708013c so it has to be backported to 6.2

(cherry picked from commit e1d8cddd09)

Refs scylladb/scylladb#21058

Closes scylladb/scylladb#21178

* github.com:scylladb/scylladb:
  test/boost/mutation_test: add test for multishard permit safety
  test/lib/reader_lifecycle_policy: add semaphore factory to constructor
  test/lib/reader_lifecycle_policy: rename factory_function
  repair/row_level: drop now unneeded release_base_resource() calls
  readers/multishard: make multishard reader safe to create with admitted permits
2024-10-25 09:32:03 +03:00
Piotr Dulikowski
4ca0e31415 test/test_view_build_status: properly wait for v2 in migration test
The test_view_build_status_migration_to_v2 test case creates a new view
(vt2) after peforming the view_build_status -> view_build_status_v2
migration and waits until it is built by `wait_for_view_v2` function. It
works by waiting until a SELECT from view_build_status_v2 will return
the expected number of rows for a given view.

However, if the host parameter is unspecified, it will query only one
node on each attempt. Because `view_build_status_v2` is managed via
raft, queries always return data from the queried node only. It might
happen that `wait_for_view_v2` fetches expected results from one node
while a different node might be lagging behind the group0 coordinator
and might not have all data yet.

In case of test_view_build_status_migration_to_v2 this is a problem - it
first uses `wait_for_view_v2` to wait for view, later it queries
`view_build_status_v2` on a random node and asserts its state - and
might fail because that node didn't have the newest state yet.

Fix the issue by issuing `wait_for_view_v2` in parallel for all nodes in
the cluster and waiting until all nodes have the most recent state.

Fixes: scylladb/scylladb#21060
(cherry picked from commit a380a2efd9)

Closes scylladb/scylladb#21129
2024-10-24 16:42:53 +03:00
Raphael S. Carvalho
363bc7424e locator: Always preserve balancing_enabled in tablet_metadata::copy()
When there are zero tablets, tablet_metadata::_balancing_enabled
is ignored in the copy.

The property not being preserved can result in balancer not
respecting user's wish to disable balancing when a replica is
created later on.

Fixes #21175.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit dfc217f99a)

Closes scylladb/scylladb#21190
2024-10-24 16:37:41 +03:00
Botond Dénes
a5b11a3189 test/boost/mutation_test: add test for multishard permit safety
Add a test checking that the multishard reader will not deadlock, when
created with an admitted permit, on a semaphore with a single count
resource.

(cherry picked from commit e1d8cddd09)
2024-10-24 09:18:11 -04:00
Botond Dénes
c0eba659f6 test/lib/reader_lifecycle_policy: add semaphore factory to constructor
Allowing callers to specify how the semaphore is created and stopped,
instead of doing so via boolean flags like it is done currently. This
method doesn't scale, so use a factory instead.

(cherry picked from commit 5a3fd69374)
2024-10-24 09:18:11 -04:00
Botond Dénes
dbb1dc872d test/lib/reader_lifecycle_policy: rename factory_function
To reader_factor_function. We are about to add a new factory function
parameters, so the current factory_function has to be renamed to
something more specific.

(cherry picked from commit c8598e21e8)
2024-10-24 09:18:11 -04:00
Botond Dénes
07b288b7d7 repair/row_level: drop now unneeded release_base_resource() calls
The multishard reader now does this itself, no need to do it here.

(cherry picked from commit 76a5ba2342)
2024-10-24 09:18:11 -04:00
Botond Dénes
41a44ddc12 readers/multishard: make multishard reader safe to create with admitted permits
Passing an admitted permit -- i.e. one with count resources on it -- to
the multishard reader, will possibly result in a deadlock, because the
permit of the multishard reader is destroyed after the permits of its
child readers. Therefore its semaphore resources won't be automatically
released until children acquire their own resources.
This creates a dependency (an edge in the "resource allocation graph"),
where the semaphore used by the multishard reader depends on the
semaphores used by children. When such dependencies create a cycle, and
permits are acquired by different reads in just the right order, a
deadlock will happen.

Users of the multishard reader have to be aware of this gotcha -- and of
course they aren't. This is small wonder, considering that not even the
documentation on the multishard reader mentions this problem.
To work around this, the user has to call
`reader_permit::release_base_resources()` on the permit, before passing
it to the multishard reader.
On multiple occasions, developers (including the very author of the
multishard reader), forgot or didn't know about this and this resulted
in deadlocks down the line.
This is a design-flaw of the multishard reader, which is addressed in
this patch, after which, it is safe to pass admitted or not admitted
permits to the multishard reader, it will handle the call to
`release_base_resources()` if needed.

(cherry picked from commit 218ea449a5)
2024-10-24 09:18:11 -04:00
Lakshmi Narayanan Sreethar
3f04df55eb [Backport 6.2] replica/table: check memtable before discarding tombstone during read
On the read path, the compacting reader is applied only to the sstable
reader. This can cause an expired tombstone from an sstable to be purged
from the request before it has a chance to merge with deleted data in
the memtable leading to data resurrection.

Fix this by checking the memtables before deciding to purge tombstones
from the request on the read path. A tombstone will not be purged if a
key exists in any of the table's memtables with a minimum live timestamp
that is lower than the maximum purgeable timestamp.

Fixes #20916

`perf-simple-query` stats before and after this fix :

`build/Dev/scylla perf-simple-query --smp=1 --flush` :
```
// Before this Fix
// ---------------
94941.79 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   59393 insns/op,   24029 cycles/op,        0 errors)
97551.14 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   59376 insns/op,   23966 cycles/op,        0 errors)
96599.92 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   59367 insns/op,   23998 cycles/op,        0 errors)
97774.91 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   59370 insns/op,   23968 cycles/op,        0 errors)
97796.13 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   59368 insns/op,   23947 cycles/op,        0 errors)

         throughput: mean=96932.78 standard-deviation=1215.71 median=97551.14 median-absolute-deviation=842.13 maximum=97796.13 minimum=94941.79
instructions_per_op: mean=59374.78 standard-deviation=10.78 median=59369.59 median-absolute-deviation=6.36 maximum=59393.12 minimum=59367.02
  cpu_cycles_per_op: mean=23981.67 standard-deviation=32.29 median=23967.76 median-absolute-deviation=16.33 maximum=24029.38 minimum=23947.19

// After this Fix
// --------------
95313.53 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   59392 insns/op,   24058 cycles/op,        0 errors)
97311.48 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   59375 insns/op,   24005 cycles/op,        0 errors)
98043.10 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   59381 insns/op,   23941 cycles/op,        0 errors)
96750.31 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   59396 insns/op,   24025 cycles/op,        0 errors)
93381.21 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   59390 insns/op,   24097 cycles/op,        0 errors)

         throughput: mean=96159.93 standard-deviation=1847.88 median=96750.31 median-absolute-deviation=1151.55 maximum=98043.10 minimum=93381.21
instructions_per_op: mean=59386.60 standard-deviation=8.78 median=59389.55 median-absolute-deviation=6.02 maximum=59396.40 minimum=59374.73
  cpu_cycles_per_op: mean=24025.13 standard-deviation=58.39 median=24025.17 median-absolute-deviation=32.67 maximum=24096.66 minimum=23941.22
```

This PR fixes a regression introduced in ce96b472d3 and should be backported to older versions.

Closes scylladb/scylladb#20985

* github.com:scylladb/scylladb:
  topology-custom: add test to verify tombstone gc in read path
  replica/table: check memtable before discarding tombstone during read
  compaction_group: track maximum timestamp across all sstables

(cherry picked from commit 519e167611)

Backported from #20985 to 6.2.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes scylladb/scylladb#21251
2024-10-24 15:33:39 +03:00
Marcin Maliszkiewicz
8c8f97c280 test: topology: add disable_schema_agreement_wait utility function
Code extracted from fa45fdf5f7 as it's being used by
test_alter_tablets_keyspace_concurrent_modification and we're
backporting it.
2024-10-24 11:24:56 +02:00
Piotr Smaron
a61ab7d02e test: add UT to test retrying ALTER tablets KEYSPACE
The newly added testcase is based on the already existing
`test_alter_dropped_tablets_keyspace`.
A new error injection is created, which stops the ALTER execution just
before the changes are submitted to RAFT. In the meantime, a new schema
change is performed using the 2nd node in the cluster, thus causing the
1st node to retry the ALTER statement.

(cherry picked from commit 522bede8ec)
2024-10-23 13:35:26 +00:00
Piotr Smaron
775578af59 cql/tablets: fix indentation in rf_change event handler
Just moved the code that previously was under a `for` loop by 1 tab, i.e. 4 spaces, to the left.

(cherry picked from commit 3f4c8a30e3)
2024-10-23 13:35:26 +00:00
Piotr Smaron
97f22f426f cql/tablets: fix retrying ALTER tablets KEYSPACE
ALTER tablets-enabled KEYSPACES (KS) may fail due to
`group0_concurrent_modification`, in which case it's repeated by a `for`
loop surrounding the code. But because raft's `add_entry` consumes the
raft's guard (by `std::move`'ing the guard object), retries of ALTER KS
will use a moved-from guard object, which is UB, potentially a crash.
The fix is to remove the before mentioned `for` loop altogether and rethrow the exception, as the `rf_change` event
will be repeated by the topology state machine if it receives the
concurrent modification exception, because the event will remain present
in the global requests queue, hence it's going to be executed as the
very next event.
`topology_coordinator::handle_topology_coordinator_error` handling the
case of `group0_concurrent_modification` has been extended with logging
in order not to write catch-log-throw boilerplate.
Note: refactor is implemented in the follow-up commit.

Fixes: scylladb/scylladb#21102
(cherry picked from commit de511f56ac)
2024-10-23 13:35:26 +00:00
Botond Dénes
55a9605687 Merge '[Backport 6.2] Check system.tablets update before putting it into the table' from ScyllaDB
Having tablet metadata with more than 1 pending replica will prevent this metadata from being (re)loaded due to sanity check on load. This patch fails the operation which tries to save the wrong metadata with a similar sanity check. For that, changes submitted to raft are validated, and if it's topology_change that affects system.tablets, the new "replicas" and "new_replicas" values are checked similarly to how they will be on (re)load.

Fixes #20043

(cherry picked from commit f09fe4f351)

(cherry picked from commit e5bf376cbc)

(cherry picked from commit 1863ccd900)

Refs #21020

Closes scylladb/scylladb#21111

* github.com:scylladb/scylladb:
  tablets: Validate system.tablets update
  group0_client: Introduce change validation
  group0_client: Add shared_token_metadata dependency
2024-10-23 10:00:39 +03:00
Pavel Emelyanov
83cc3e4791 tablets: Validate system.tablets update
Implement change validation for raft topology_change command. For now
the only check is that the "pending replicas" contains at most one
entry. The check mirrors similar one in `process_one_row` function.

If not passed, this prevents system.tablets from being updated with the
mutation(s) that will not be loaded later.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-22 14:45:51 +03:00
Pavel Emelyanov
aef7e7db0b group0_client: Introduce change validation
Add validate_change() methods (well, a template and an overload) that
are called by prepare_command() and are supposed to validate the
proposed change before it hits persistent storage

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-22 14:45:22 +03:00
Pavel Emelyanov
282cdfcfcc group0_client: Add shared_token_metadata dependency
It will be needed later to get tablet_metadata from.
The dependency is "OK", shared_token_metadata is low-level sharded
service. Client already references db::system_keyspace, which in turn
references replica::database which, finally, references token_metadata

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-22 14:45:12 +03:00
Daniel Reis
b661bc39df docs: fix redirect from cert-based auth to security/enable-auth page
(cherry picked from commit 28a265ccd8)

Closes scylladb/scylladb#21124
2024-10-22 09:10:04 +03:00
Botond Dénes
0805780064 Merge '[Backport 6.2] scylla_raid_setup: configure SELinux file context' from ScyllaDB
On RHEL9, systemd-coredump fails to coredump on /var/lib/scylla/coredump because the service only have write acess with systemd_coredump_var_lib_t. To make it writable, we need to add file context rule for /var/lib/scylla/coredump, and run restorecon on /var/lib/scylla.

Fixes #19325

(cherry picked from commit 56c971373c)

(cherry picked from commit 0ac450de05)

Refs #20528

Closes scylladb/scylladb#21211

* github.com:scylladb/scylladb:
  scylla_raid_setup: configure SELinux file context
  scylla_coredump_setup: fix SELinux configuration for RHEL9
2024-10-21 16:03:08 +03:00
Takuya ASADA
3de8885161 scylla_raid_setup: configure SELinux file context
On RHEL9, systemd-coredump fails to coredump on /var/lib/scylla/coredump
because the service only have write acess with systemd_coredump_var_lib_t.
To make it writable, we need to add file context rule for
/var/lib/scylla/coredump, and run restorecon on /var/lib/scylla.

Fixes #20573

(cherry picked from commit 0ac450de05)
2024-10-21 11:15:06 +00:00
Takuya ASADA
29a0ce3b0a scylla_coredump_setup: fix SELinux configuration for RHEL9
Seems like specific version of systemd pacakge on RHEL9 has a bug on
SELinux configuration, it introduced "systemd-container-coredump" module
to provide rule for systemd-coredump, but not enabled by default.
We have to manually load it, otherwise it causes permission error.

Fixes #19325

(cherry picked from commit 56c971373c)
2024-10-21 11:15:06 +00:00
Benny Halevy
eebf97c545 view: check_needs_view_update_path: get token_metadata_ptr
check_needs_view_update_path is async and might yield
so the token_metadata reference passed to it must be kept
alive throughout the call.

Fixes scylladb/scylladb#20979

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit eaa3b774a6)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#21038
2024-10-21 10:27:52 +02:00
Artsiom Mishuta
a728695d10 test.py: deselect remove_data_dir_of_dead_node event
deselect remove_data_dir_of_dead_node event from test_random_failures
due to ussue #20751

(cherry picked from commit 9b0e15678e)

Closes scylladb/scylladb#21138
2024-10-17 11:38:35 +02:00
Piotr Smaron
82a34aa837 test: fix flaky test_multidc_alter_tablets_rf
The testcase is flaky due to a known python driver issue:
https://github.com/scylladb/python-driver/issues/317.
This issue causes the `CREATE KEYSPACE` statement to be sometimes
executed twice in a row, and the 2nd CREATE statement causes the test to
fail.
In order to work around it, it's enough to add `if not exists` when
creating a ks.

Fixes: #21034

Needs to be backported to all 6.x branches, as the PR introducing this flakiness is backported to every 6.x branch.

(cherry picked from commit f8475915fb)

Closes scylladb/scylladb#21107
2024-10-15 09:26:28 +03:00
Piotr Dulikowski
d10c6a86cc SCYLLA-VERSION-GEN: correct the logic for skipping SCYLLA-*-FILE
The SCYLLA-VERSION-GEN file skips updating the SCYLLA-*-FILE files if
the commit hash from SCYLLA-RELEASE-FILE is the same. The original
reason for this was to prevent the date in the version string from
changing if multiple modes are built across midnight
(scylladb/scylla-pkg#826). However - intentionally or not - it serves
another purpose: it prevents an infinite loop in the build process.

If the build.ninja file needs to be rebuilt, the configure.py script
unconditionally calls ./SCYLLA-VERSION-GEN. On the other hand, if one
of the SCYLLA-*-FILE files is updated then this triggers rebuild
of build.ninja. Apparently, this is sufficient for ninja to enter an
infinite loop.

However, the check assumes that the RELEASE is in the format

  <build identifier>.<date>.<commit hash>

and assumes that none of the components have a dot inside - otherwise it
breaks and just works incorrectly. Specifically, when building a private
version, it is recommended to set the build identifier to
`count.yourname`.

Previously, before 85219e9, this problem wasn't noticed most likely
because reconfigure process was broken and stopped overwriting
the build.ninja file after the first iteration.

Fix the problem by fixing the logic that extracts the commit hash -
instead of looking at the third dot-separated field counting from the
left side, look at the last field.

Fixes: scylladb/scylladb#21027
(cherry picked from commit 64ca58125e)

Closes scylladb/scylladb#21103
2024-10-15 09:26:00 +03:00
Botond Dénes
554838691b Merge '[Backport 6.2] compaction: fix potential data resurrection with file-based migration' from Ferenc Szili
This is a manual backport of #20788

When tablets are migrated with file-based streaming, we can have a situation where a tombstone is garbage collected before the data it shadows lands. For instance, if we have a tablet replica with 3 sstables:

1. sstable containing an expired tombstone
2. sstable with additional data
3. sstable containing data which is shadowed by the expired tombstone in sstable 1

If this tablet is migrated, and the sstables are streamed in the order listed above, the first two sstables can be compacted before the third sstable arrives. In that case, the expired tombstone will be garbage collected, and data in the third sstable will be resurrected after it arrives to the pending replica.

This change fixes this problem by disabling tombstone garbage collection for pending replicas.

This fixes a problem in Enterprise, but the change is in OSS in order to have as few differences between OSS and Enterprise and to have a common infrastructure for disabling tombstone GC on pending replicas.

Fixes #21090

Closes scylladb/scylladb#21061

* github.com:scylladb/scylladb:
  test: test tombstone GC disabled on pending replica
  tablet_storage_group_manager: update tombstone_gc_enabled in compaction group
  database::table: add tombstone_gc_enabled(locator::tablet_id)
2024-10-15 09:25:22 +03:00
Kefu Chai
b691dddf6b install.sh: install seastar/scripts/addr2line.py as well
seastar extracted `addr2line` python module out back in
e078d7877273e4a6698071dc10902945f175e8bc. but `install.sh` was
not updated accordingly. it still installs `seastar-addr2line`
without installing its new dependency. this leaves us with a
broken `seastar-addr2line` in the relocatable tarball.
```console
$ /opt/scylladb/scripts/seastar-addr2line
Traceback (most recent call last):
  File "/opt/scylladb/scripts/libexec/seastar-addr2line", line 26, in <module>
    from addr2line import BacktraceResolver
ModuleNotFoundError: No module named 'addr2line'
```

in this change, we redistribute `addr2line.py` as well. this
should address the issue above.

Fixes scylladb/scylladb#21077

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit da433aad9d)

Closes scylladb/scylladb#21085
2024-10-14 09:52:21 +03:00
Botond Dénes
85b1c64a33 Merge '[Backport 6.2] storage_proxy: Add conditions checking to avoid UB in speculating read executors.' from ScyllaDB
During the investigation of scylladb/scylladb#20282, it was discovered that implementations of speculating read executors have undefined behavior when called with an incorrect number of read replicas. This PR introduces two levels of condition checking:

- Condition checking in speculating read executors for the number of replicas.
- Checking the consistency of the Effective Replication Map in  filter_for_query(): the map is considered incorrect if the list  of replicas contains a node from a data center whose replication factor is 0.

 Please note: This PR does not fix the issue found in scylladb/scylladb#20282;   it only adds condition checks to prevent undefined behavior in cases of  inconsistent inputs.

Refs scylladb/scylladb#20625

As this issue applies to the releases versions and can affect clients, we need backports to 6.0, 6.1, 6.2.

(cherry picked from commit 132358dc92)

(cherry picked from commit ae23d42889)

(cherry picked from commit ad93cf5753)

(cherry picked from commit 8db6d6bd57)

(cherry picked from commit c373edab2d)

Refs #20851

Closes scylladb/scylladb#21067

* github.com:scylladb/scylladb:
  Add conditions checking for get_read_executor
  Avoid an extra call to block_for in db::filter_for_query.
  Improve code readability in consistency_level.cc and storage_proxy.cc
  tools: Add build_info header with functions providing build type information
  tests: Add tests for alter table with RF=1 to RF=0
2024-10-14 09:51:50 +03:00
Benny Halevy
6e67a993ba storage_service: rebuild: warn about tablets-enabled keyspaces
Until we automatically support rebuild for tablets-enabled
keyspaces, warn the user about them.

The reason this is not an error, is that after
increasing RF in a new datacenter, the current procedure
is to run `nodetool rebuild` on all nodes in that dc
to rebuild the new vnode replicas.
This is not required for tablets, since the additional
replicas are rebuilt automatically as part of ALTER KS.

However, `nodetool rebuild` is also run after local
data loss (e.g. due to corruption and removal of sstables).
In this case, rebuild is not supported for tablets-enabled
keyspaces, as tablet replicas that had lost data may have
already been migrated to other nodes, and rebuilding the
requested node will not know about it.
It is advised to repair all nodes in the datacenter instead.

Refs scylladb/scylladb#17575

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit ed1e9a1543)

Closes scylladb/scylladb#20722
2024-10-14 09:47:35 +03:00
Michał Chojnowski
b8a9fd4e49 reader_concurrency_semaphore: in stats, fix swapped count_resources and memory_resources
can_admit_read() returns reason::memory_resources when the permit is queued due
to lack of count resources, and it returns reason::count_resources when the
permit is queued due to lack of memory resources. It's supposed to be the other
way around.

This bug is causing the two counts to be swapped in the stat dumps printed to
the logs when semaphores time out.

(cherry picked from commit 6cf3747c5f)

Closes scylladb/scylladb#21030
2024-10-13 18:34:18 +03:00
Jenkins Promoter
363cf881d4 Update ScyllaDB version to: 6.2.0 2024-10-13 14:15:40 +03:00
Sergey Zolotukhin
68a55facdf Add conditions checking for get_read_executor
During the investigation of scylladb/scylladb#20282, it was discovered that
implementations of speculating read executors have undefined behavior
when called with an incorrect number of read replicas. This PR
introduces two levels of condition checking:

- Condition checking in speculating read executors for the number of replicas.
- Checking the consistency of the Effective Replication Map in
  get_endpoints_for_reading(): the map is considered incorrect the number of
  read replica nodes is higher than replication factor. The check is
  applied only when built in non release mode.

Please note: This PR does not fix the issue found in scylladb/scylladb#20282;
it only adds condition checks to prevent undefined behavior in cases of
inconsistent inputs.

Refs scylladb/scylladb#20625

(cherry picked from commit c373edab2d)
2024-10-11 18:20:43 +00:00
Sergey Zolotukhin
9010d0a22f Avoid an extra call to block_for in db::filter_for_query.
(cherry picked from commit 8db6d6bd57)
2024-10-11 18:20:43 +00:00
Sergey Zolotukhin
3c0f43b6eb Improve code readability in consistency_level.cc and storage_proxy.cc
Add const correctness and rename some variables to improve code readability.

(cherry picked from commit ad93cf5753)
2024-10-11 18:20:43 +00:00
Sergey Zolotukhin
a22e4476ac tools: Add build_info header with functions providing build type information
A new header provides `constexpr` functions to retrieve build
type information: `get_build_type()`, `is_release_build()`,
and `is_debug_build()`. These functions are useful when adding
changes that should be enabled at compile time only for
specific build types.

(cherry picked from commit ae23d42889)
2024-10-11 18:20:42 +00:00
Sergey Zolotukhin
14650257c0 tests: Add tests for alter table with RF=1 to RF=0
Adding Vnodes and Tablets tests for alter keyspace operation that decreases replication factor
from 1 to 0 for one of two data centers. Tablet version fails due to issue described in
scylladb/scylladb#20625.

Test for scylladb/scylladb#20625

(cherry picked from commit 132358dc92)
2024-10-11 18:20:42 +00:00
Ferenc Szili
2a318817ba test: test tombstone GC disabled on pending replica
This tests if tombstone GC is disabled on pending replicas
2024-10-11 14:10:30 +02:00
Ferenc Szili
5f052a2b52 tablet_storage_group_manager: update tombstone_gc_enabled in compaction group
In order to avoid cases during tablet migrations where we garbage
collect tombstones before the data it shadows arrives, we will
disable tombstone GC on pending replicas.

To achieve this we added a tombston_gc_enabled flag to compaction_group.
This flag is updated from updte_effective_repliction_map method of the
tablet_storage_group_manager class.
2024-10-11 14:09:30 +02:00
David Garcia
e018b38a54 docs: Fix confgroup links
It was not possible to link to configuration parameters groups in docs/reference/configuration-parameters.rst if they contained a space.

(cherry picked from commit 2247bdbc8c)

Closes scylladb/scylladb#21037
2024-10-11 14:31:28 +03:00
Ferenc Szili
14ce5e14d0 database::table: add tombstone_gc_enabled(locator::tablet_id)
This change adds the flag tombstone_gc_enabled to compaction_group.
The value of this flag will be set in
tablet_storage_group_manager::update_effective_replication_map().
2024-10-11 13:29:30 +02:00
Piotr Smaron
d1a31460a0 cql/tablets: handle MVs in ALTER tablets KEYSPACE
ALTERing tablets-enabled KEYSPACES (KS) didn't account for materialized
views (MV), and only produced tablets mutations changing tables.
With this patch we're producing tablets mutations for both tables and
MVs, hence when e.g. we change the replication factor (RF) of a KS, both the
tables' RFs and MVs' RFs are updated along with tablets replicas.
The `test_tablet_rf_change` testcase has been extended to also verify
that MVs' tablets replicas are updated when RF changes.

Fixes: #20240
(cherry picked from commit e0c1a51642)

Closes scylladb/scylladb#21022
2024-10-11 14:14:09 +03:00
Botond Dénes
9175cc528b Merge '[Backport 6.2] cql: improve validating RF's change in ALTER tablets KS' from ScyllaDB
This patch series fixes a couple of bugs around validating if RF is not changed by too much when performing ALTER tablets KS.
RF cannot change by more than 1 in total, because tablets load balancer cannot handle more work at once.

Fixes: #20039

Should be backported to 6.0 & 6.1 (wherever tablets feature is present), as this bug may break the cluster.

(cherry picked from commit 042825247f)

(cherry picked from commit adf453af3f)

(cherry picked from commit 9c5950533f)

(cherry picked from commit 47acdc1f98)

(cherry picked from commit 93d61d7031)

(cherry picked from commit 6676e47371)

(cherry picked from commit 2aabe7f09c)

(cherry picked from commit ee56bbfe61)

Refs #20208

Closes scylladb/scylladb#21009

* github.com:scylladb/scylladb:
  cql: sum of abs RFs diffs cannot exceed 1 in ALTER tablets KS
  cql: join new and old KS options in ALTER tablets KS
  cql: fix validation of ALTERing RFs in tablets KS
  cql: harden `alter_keyspace_statement.cc::validate_rf_difference`
  cql: validate RF change for new DCs in ALTER tablets KS
  cql: extend test_alter_tablet_keyspace_rf
  cql: refactor test_tablets::test_alter_tablet_keyspace
  cql: remove unused helper function from test_tablets
2024-10-11 14:13:43 +03:00
Botond Dénes
18be4f454e Merge '[Backport 6.2] Node replace and remove operations: Add deprecate IP addresses usage warning.' from ScyllaDB
- As part of deprecation of IP address usage, warning messages were added when IP addresses specified in the `ignore-dead-nodes` and `--ignore-dead-nodes-for-replace` options for scylla and nodetool.
- Slight optimizations for `utils::split_comma_separated_list`, ` host_id_or_endpoint lists` and `storage_service` remove node operations, replacing `std::list` usage with `std::vector`.

Fixes scylladb/scylladb#19218

Backport: 6.2 as it's not yet released.

(cherry picked from commit 3b9033423d)

(cherry picked from commit a871321ecf)

(cherry picked from commit 9c692438e9)

(cherry picked from commit 6398b7548c)

Refs #20756

Closes scylladb/scylladb#20958

* github.com:scylladb/scylladb:
  config: Add a warning about use of IP address for join topology and replace operations.
  nodetool: Add IP address usage warning for 'ignore-dead-nodes'.
  tests: Fix incorrect UUIDs in test_nodeops
  utils: Optimizations for utils::split_comma_separated_list and usage of host_id_or_endpoint lists
2024-10-11 14:12:51 +03:00
Botond Dénes
f35a083abe repair/row_level: remove reader timeout
This timeout was added to catch reader related deadlocks. We have not
seen such deadlocks for a long time, but we did see false-timeouts
caused by this, see explanation below. Since the cost now outweight the
benefit, remove the timeout altogether.

The false timeout happens during mixed-shard repair. The
`reader_permit::set_timeout()` call is called on the top-level permit
which repair has a handle on. In the case of the mixed-shard repair,
this belongs to the multishard reader. Calling set_timeout() on the
multishard reader has no effect on the actual shard readers, except in
one case: when the shard reader is created, it inherits the multishard
reader's current timeout. As the shard reader can be alive for a long
time, this timeout is not refreshed and ultimately causes a timeout and
fails the repair.

Refs: #18269
(cherry picked from commit 3ebb124eb2)

Closes scylladb/scylladb#20955
2024-10-11 14:11:03 +03:00
Anna Stuchlik
57affc7fad doc: document the option to run ScyllaDB in Docker on macOS
This commit adds a description of a workaround to create a multi-node ScyllaDB cluster
with Docker on macOS.

Refs https://github.com/scylladb/scylladb/issues/16806
See https://forum.scylladb.com/t/running-3-node-scylladb-in-docker/1057/4

(cherry picked from commit 7eb1dc2ae5)

Closes scylladb/scylladb#20931
2024-10-11 14:10:06 +03:00
Raphael S. Carvalho
927e526e2d replica: Fix schema change during migration cleanup
During migration cleanup, there's a small window in which the storage
group was stopped but not yet removed from the list. So concurrent
operations traversing the list could work with stopped groups.

During a test which emitted schema changes during migrations,
a failure happened when updating the compaction strategy of a table,
but since the group was stopped, the compaction manager was unable
to find the state for that group.

In order to fix it, we'll skip stopped groups when traversing the
list since they're unused at this stage of migration and going away
soon.

Fixes #20699.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit cf58674029)

Closes scylladb/scylladb#20899
2024-10-11 14:07:42 +03:00
Calle Wilund
b224665575 database: Also forced new schema commitlog segment on user initiated memtable flush
Refs #20686
Refs #15607

In #15060 we added forced new commitlog segment on user initated flush,
mainly so that tests can verify tombstone gc and other compaction related
things, without having to wait for "organic" segment deletion.
Schema commitlog was not included, mainly because we did not have tests
featuring compaction checks of schema related tables, but also because
it was assumed to be lower general througput.
There is however no real reason to not include it, and it will make some
testing much quicker and more predictable.

(cherry picked from commit 60f8a9f39d)

Closes scylladb/scylladb#20705
2024-10-11 14:03:17 +03:00
Gleb Natapov
9afb1afefa storage_proxy: make sure there is no end iterator in _live_iterators array
storage_proxy::cancellable_write_handlers_list::update_live_iterators
assumes that iterators in _live_iterators can be dereferenced, but
the code does not make any attempt to make sure this is the case. The
iterator can be the end iterator which cannot be dereferenced.

The patch makes sure that there is no end iterator in _live_iterators.

Fixes scylladb/scylladb#20874

(cherry picked from commit da084d6441)

Closes scylladb/scylladb#21003
2024-10-10 17:09:27 +03:00
Kefu Chai
72153cac96 auth: capture boost::regex_error not std::regex_error
in a3db5401, we introduced the TLS certi authenticator, which is
configured using `auth_certificate_role_queries` option . the
value of this option contains a regular expression. so there are
chances the regular expression is malformatted. in that case,
when converting its value presenting the regular expression to an
instance of `boost::regex`, Boost.Regex throws a `boost::regex_error`
exception, not `std::regex_error`.

since we decided to use Boost.Regex, let's catch `boost::regex_error`.

Refs a3db5401
Fixes scylladb/scylladb#20941
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 439c52c7c5)

Closes scylladb/scylladb#20952
2024-10-09 21:58:40 +03:00
Michał Chojnowski
f988980260 utils/rjson.cc: correct a comment about assert()
Commit aa1270a00c changed most uses
of `assert` in the codebase to `SCYLLA_ASSERT`.

But the comment fixed in this patch is talking specifically about
`assert`, and shouldn't have been changed. It doesn't make sense
after the change.

(cherry picked from commit da7edc3a08)

Closes scylladb/scylladb#20976
2024-10-09 21:50:26 +03:00
Anna Stuchlik
1d11adf766 doc: remove outdated JMX references
This commit removes references to JMX from the docs.

Context:
The JMX server has been dropped and removed from installation. The user can
install it manually if needed, as documented with https://github.com/scylladb/scylladb/issues/18687.

This commit removes the outdated information about JMX from other pages
in the documentation, including the docs for nodetool, the list of ports,
and the admin section.

Also, the no longer relevant JMX information is removed from
the Docker Hub docs.

Fixes https://github.com/scylladb/scylladb/issues/18687
Fixes https://github.com/scylladb/scylladb/issues/19575

(cherry picked from commit 4e43d542cd)

Closes scylladb/scylladb#20988
2024-10-09 20:57:49 +03:00
Jenkins Promoter
dae1d18145 Update ScyllaDB version to: 6.2.0-rc3 2024-10-09 15:10:48 +03:00
Kamil Braun
e9588a8a53 Merge '[Backport 6.2] Wait for all users of group0 server to complete before destroying it' from ScyllaDB
Group0 server is often used in asynchronous context, but we do not wait
for them to complete before destroying the server. We already have
shutdown gate for it, so lets use it in those asynch functions.

Also make sure to signal group0 abort source if initialization fails.

Fixes scylladb/scylladb#20701

Backport to 6.2 since it contains af83c5e53e and it made the race easier to hit, so tests became flaky.

(cherry picked from commit ba22493a69)

(cherry picked from commit e642f0a86d)

Refs #20891

Closes scylladb/scylladb#21008

* github.com:scylladb/scylladb:
  group: hold group0 shutdown gate during async operations
  group0: Stop group0 if node initialization fails
2024-10-09 12:19:16 +02:00
Piotr Smaron
c73d0ffbaa cql: sum of abs RFs diffs cannot exceed 1 in ALTER tablets KS
Tablets load balancer is unable to process more than a single pending
replica, thus ALTER tablets KS cannot accept an ALTER statement which
would result in creating 2+ pending replicas, hence it has to validate
if the sum of absoulte differences of RFs specified in the statement is
not greter than 1.

(cherry picked from commit ee56bbfe61)
2024-10-08 18:06:52 +00:00
Piotr Smaron
c7b5571766 cql: join new and old KS options in ALTER tablets KS
A bug has been discovered while trying to ALTER tablets KS and
specifying only 1 out of 2 DCs - the not specified DC's RF has been
zeroed. This is because ALTER tablets KS updated the KS only with the
RF-per-DC mapping specified in the ALTER tablets KS statement, so if a
DC was ommitted, it was assigned a value of RF=0.
This commit fixes that plus additionally passes all the KS options, not
only the replication options, to the topology coordinator, where the KS
update is performed.
`initial_tablets` is a special case, which requires a special handling
in the source code, as we cannot simply update old initial_tablet's
settings with the new ones, because if only ` and TABLETS = {'enabled':
true}` is specified in the ALTER tablets KS statement, we should not zero the `initial_tablets`, but
rather keep the old value - this is tested by the
`test_alter_preserves_tablets_if_initial_tablets_skipped` testcase.
Other than that, the above mentioned testcase started to fail with
these changes, and it appeared to be an issue with the test not waiting
until ALTER is completed, and thus reading the old value, hence the
test's body has been modified to wait for ALTER to complete before
performing validation.

(cherry picked from commit 2aabe7f09c)
2024-10-08 18:06:48 +00:00
Piotr Smaron
92325073a9 cql: fix validation of ALTERing RFs in tablets KS
The validation has been corrected with:
1. Checking if a DC specified in ALTER exists.
2. Removing `REPLICATION_STRATEGY_CLASS_KEY` key from a map of RFs that
   needs their RFs to be validated.

(cherry picked from commit 6676e47371)
2024-10-08 18:06:47 +00:00
Piotr Smaron
f5c0969c06 cql: harden alter_keyspace_statement.cc::validate_rf_difference
This function assumed that strings passed as arguments will be of
integer types, but that wasn't the case, and we missed that because this
function didn't have any validation, so this change adds proper
validation and error logging.
Arguments passed to this function were forwarded from a call to
`ks_prop_defs::get_replication_options`, which, among rf-per-dc mapping, returns also
`class:replication_strategy` pair. Second pair's member has been casted
into an `int` type and somehow the code was still running fine, but only
extra testing added later discovered a bug in here.

(cherry picked from commit 93d61d7031)
2024-10-08 18:06:46 +00:00
Gleb Natapov
90ced080a8 group: hold group0 shutdown gate during async operations
Wait for all outstanding async work that uses group0 to complete before
destroying group0 server.

Fixes scylladb/scylladb#20701

(cherry picked from commit e642f0a86d)
2024-10-08 18:06:45 +00:00
Piotr Smaron
7674d80c31 cql: validate RF change for new DCs in ALTER tablets KS
ALTER tablets KS validated if RF is not changed by more than 1 for DCs
that already had replicas, but not for DCs that didn't have them yet, so
specifying an RF jump from 0 to 2 was possible when listing a new DC in
ALTER tablets KS statement, which violated internal invariants of
tablets load balancer.
This PR fixes that bug and adds a multi-dc testcases to check if adding
replicas to a new DC and removing replicas from a DC is honoring the RF
change constraints.

Refs: #20039
(cherry picked from commit 47acdc1f98)
2024-10-08 18:06:45 +00:00
Gleb Natapov
06ceef34a7 group0: Stop group0 if node initialization fails
Commit af83c5e53e moved aborting of group0 into the storage service
drain function. But it is not called if node fails during initialization
(if it failed to join cluster for instance). So lets abort on both
paths (but only once).

(cherry picked from commit ba22493a69)
2024-10-08 18:06:44 +00:00
Piotr Smaron
ec83367b45 cql: extend test_alter_tablet_keyspace_rf
Added cases to also test decreasing RF and setting the same RF.
Also added extra explanatory comments.

(cherry picked from commit 9c5950533f)
2024-10-08 18:06:44 +00:00
Piotr Smaron
dfe2e20442 cql: refactor test_tablets::test_alter_tablet_keyspace
1. Renamed the testcase to emphasize that it only focuses on testing
   changing RF - there are other tests that test ALTER tablets KS
in general.
2. Fixed whitespaces according to PEP8

(cherry picked from commit adf453af3f)
2024-10-08 18:06:42 +00:00
Piotr Smaron
ad2191e84f cql: remove unused helper function from test_tablets
`change_default_rf` is not used anywhere, moreover it uses
`replication_factor` tag, which is forbidden in ALTER tablets KS
statement.

(cherry picked from commit 042825247f)
2024-10-08 18:06:41 +00:00
Sergey Zolotukhin
855abd7368 config: Add a warning about use of IP address for join topology and replace
operations.

When the '--ignore-dead-nodes-for-replace' config option contains
IP addresses, a warning will be logged, notifying the user that
using IP addresses with this option is deprecated and will no
longer be supported in the next release.

Fixes scylladb/scylladb#19218

(cherry picked from commit 6398b7548c)
2024-10-03 14:10:30 +00:00
Sergey Zolotukhin
086dc6d53c nodetool: Add IP address usage warning for 'ignore-dead-nodes'.
Since we are deprecating the use of IP addresses, a warning message will be printed
if 'nodetool removenode --ignore-dead-nodes' is used with IP addresses.

(cherry picked from commit 9c692438e9)
2024-10-03 14:10:29 +00:00
Sergey Zolotukhin
09b0b3f7d6 tests: Fix incorrect UUIDs in test_nodeops
It was found that the UUIDs used in test_nodeops were
invalid. This update replaces those UUIDs with newly generated
random UUIDs.

(cherry picked from commit a871321ecf)
2024-10-03 14:10:28 +00:00
Sergey Zolotukhin
3bbb7a24b1 utils: Optimizations for utils::split_comma_separated_list and usage of host_id_or_endpoint lists
- utils::split_comma_separated_list now accepts a reference to sstring instead
  of a copy to avoid extra memory allocations. Additionally, the results of
  trimming are moved to the resulting vector instead of being copied.
- service/storage_service removenode, raft_removenode, find_raft_nodes_from_hoeps,
  parse_node_list and api/storage_service::set_storage_service were changed to use
  std::vector<host_id_or_endpoint> instead of std::list<host_id_or_endpoint> as
  std::vector is a more cache-friendly structure,  resulting in better performance.

(cherry picked from commit 3b9033423d)
2024-10-03 14:10:27 +00:00
Pavel Emelyanov
b43454c658 cql: Check that CREATEing tablets/vnodes is consistent with the CLI
There are two bits that control whenter replication strategy for a
keyspace will use tablets or not -- the configuration option and CQL
parameter. This patch tunes its parsing to implement the logic shown
below:

    if (strategy.supports_tablets) {
         if (cql.with_tablets) {
             if (cfg.enable_tablets) {
                 return create_keyspace_with_tablets();
             } else {
                 throw "tablets are not enabled";
             }
         } else if (cql.with_tablets = off) {
              return create_keyspace_without_tablets();
         } else { // cql.with_tablets is not specified
              if (cfg.enable_tablets) {
                  return create_keyspace_with_tablets();
              } else {
                  return create_keyspace_without_tablets();
              }
         }
     } else { // strategy doesn't support tablets
         if (cql.with_tablets == on) {
             throw "invalid cql parameter";
         } else if (cql.with_tablets == off) {
             return create_keyspace_without_tablets();
         } else { // cql.with_tablets is not specified
             return create_keyspace_without_tablets();
         }
     }

closes: #20088

In order to enable tablets "by default" for NetworkTopologyStrategy
there's explicit check near ks_prop_defs::get_initial_tablets(), that's
not very nice. It needs more care to fix it, e.g. provide feature
service reference to abstract_replication_strategy constructor. But
since ks_prop_defs code already highjacks options specifically for that
strategy type (see prepare_options() helper), it's OK for now.

There's also #20768 misbehavior that's preserved in this patch, but
should be fixed eventually as well.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit ebedc57300)

Closes scylladb/scylladb#20927
2024-10-03 17:09:49 +03:00
Jenkins Promoter
93700ff5d1 Update ScyllaDB version to: 6.2.0-rc2 2024-10-02 14:58:37 +03:00
Anna Stuchlik
5e2b4a0e80 doc: add metric updates from 6.1 to 6.2
This commit specifies metrics that are new in version 6.2 compared to 6.1,
as specified in https://github.com/scylladb/scylladb/issues/20176.

Fixes https://github.com/scylladb/scylladb/issues/20176

(cherry picked from commit a97db03448)

Closes scylladb/scylladb#20930
2024-10-02 12:07:06 +03:00
Calle Wilund
bb5dc0771c commitlog: Fix buffer_list_bytes not updated correctly
Fixes #20862

With the change in 60af2f3cb2 the bookkeep
for buffer memory was changed subtly, the problem here that we would
shrink buffer size before we after flush use said buffer's size to
decrement the buffer_list_bytes value, previously inc:ed by the full,
allocated size. I.e. we would slowly grow this value instead of adjusting
properly to actual used bytes.

Test included.

(cherry picked from commit ee5e71172f)

Closes scylladb/scylladb#20902
2024-10-01 17:41:02 +03:00
Aleksandra Martyniuk
9ed8519362 node_ops: fix task_manager_module::get_nodes()
Currently, node ops virtual task gathers its children from all nodes contained
in a sum of service::topology::normal_nodes and service::topology::transition_nodes.
The maps may contain nodes that are down but weren't removed yet. So, if a user
requests the status of a node ops virtual task, the task's attempt to retrieve
its children list may fail with seastar::rpc::closed_error.

Filter out the tasks that are down in node_ops::task_manager_module::get_nodes.

Fixes: #20843.
(cherry picked from commit a558abeba3)

Closes scylladb/scylladb#20898
2024-10-01 14:52:11 +03:00
Avi Kivity
077d7c06a0 Merge '[Backport 6.2] sstables: Fix use-after-free on page cache buffer when parsing promoted index entries across pages' from ScyllaDB
This fixes a use-after-free bug when parsing clustering key across
pages.

Also includes a fix for allocating section retry, which is potentially not safe (not in practice yet).

Details of the first problem:

Clustering key index lookup is based on the index file page cache. We
do a binary search within the index, which involves parsing index
blocks touched by the algorithm. Index file pages are 4 KB chunks
which are stored in LSA.

To parse the first key of the block, we reuse clustering_parser, which
is also used when parsing the data file. The parser is stateful and
accepts consecutive chunks as temporary_buffers. The parser is
supposed to keep its state across chunks.

In 93482439, the promoted index cursor was optimized to avoid
fully page copy when parsing index blocks. Instead, parser is
given a temporary_buffer which is a view on the page.

A bit earlier, in b1b5bda, the parser was changed to keep shared
fragments of the buffer passed to the parser in its internal state (across pages)
rather than copy the fragments into a new buffer. This is problematic
when buffers come from page cache because LSA buffers may be moved
around or evicted. So the temporary_buffer which is a view on the LSA
buffer is valid only around the duration of a single consume() call to
the parser.

If the blob which is parsed (e.g. variable-length clustering key
component) spans pages, the fragments stored in the parser may be
invalidated before the component is fully parsed. As a result, the
parsed clustering key may have incorrect component values. This never
causes parsing errors because the "length" field is always parsed from
the current buffer, which is valid, and component parsing will end at
the right place in the next (valid) buffer.

The problematic path for clustering_key parsing is the one which calls
primitive_consumer::read_bytes(), which is called for example for text
components. Fixed-size components are not parsed like this, they store
the intermediate state by copying data.

This may cause incorrect clustering keys to be parsed when doing
binary search in the index, diverting the search to an incorrect
block.

Details of the solution:

We adapt page_view to a temporary_buffer-like API. For this, a new concept
is introduced called ContiguousSharedBuffer. We also change parsers so that
they can be templated on the type of the buffer they work with (page_view vs
temporary_buffer). This way we don't introduce indirection to existing algorithms.

We use page_view instead of temporary_buffer in the promoted
index parser which works with page cache buffers. page_view can be safely
shared via share() and stored across allocating sections. It keeps hold to the
LSA buffer even across allocating sections by the means of cached_file::page_ptr.

Fixes #20766

(cherry picked from commit 8aca93b3ec)

(cherry picked from commit ac823b1050)

(cherry picked from commit 93bfaf4282)

(cherry picked from commit c0fa49bab5)

(cherry picked from commit 29498a97ae)

(cherry picked from commit c15145b71d)

(cherry picked from commit 7670ee701a)

(cherry picked from commit c09fa0cb98)

(cherry picked from commit 0279ac5faa)

(cherry picked from commit 8e54ecd38e)

(cherry picked from commit b5ae7da9d2)

Refs #20837

Closes scylladb/scylladb#20905

* github.com:scylladb/scylladb:
  sstables: bsearch_clustered_cursor: Add trace-level logging
  sstables: bsearch_clustered_cursor: Move definitions out of line
  test, sstables: Verify parsing stability when allocating section is retried
  test, sstables: Verify parsing stability when buffers cross page boundary
  sstables: bsearch_clustered_cursor: Switch parsers to work with page_view
  cached_file: Adapt page_view to ContiguousSharedBuffer
  cached_file: Change meaning of page_view::_size to be relative to _offset rather than page start
  sstables, utils: Allow parsers to work with different buffer types
  sstables: promoted_index_block_parser: Make reset() always bring parser to initial state
  sstables: bsearch_clustered_cursor: Switch read_block_offset() to use the read() method
  sstables: bsearch_clustered_cursor: Fix parsing when allocating section is retried
2024-10-01 14:51:29 +03:00
Tomasz Grabiec
5a1575678b sstables: bsearch_clustered_cursor: Add trace-level logging
(cherry picked from commit b5ae7da9d2)
2024-10-01 01:38:48 +00:00
Tomasz Grabiec
2401f7f9ca sstables: bsearch_clustered_cursor: Move definitions out of line
In order to later use the formatter for the inner class
promoted_index_block, which is defined out of line after
cached_promoted_index class definition.

(cherry picked from commit 8e54ecd38e)
2024-10-01 01:38:47 +00:00
Tomasz Grabiec
906d085289 test, sstables: Verify parsing stability when allocating section is retried
(cherry picked from commit 0279ac5faa)
2024-10-01 01:38:47 +00:00
Tomasz Grabiec
34dd3a6daa test, sstables: Verify parsing stability when buffers cross page boundary
(cherry picked from commit c09fa0cb98)
2024-10-01 01:38:47 +00:00
Tomasz Grabiec
3afa8ee2ca sstables: bsearch_clustered_cursor: Switch parsers to work with page_view
This fixes a use-after-free bug when parsing clustering key across
pages.

Clustering key index lookup is based on the index file page cache. We
do a binary search within the index, which involves parsing index
blocks touched by the algorithm. Index file pages are 4 KB chunks
which are stored in LSA.

To parse the first key of the block, we reuse clustering_parser, which
is also used when parsing the data file. The parser is stateful and
accepts consecutive chunks as temporary_buffers. The parser is
supposed to keep its state across chunks.

In b1b5bda, the parser was changed to keep shared fragments of the
buffer passed to the parser in its internal state (across pages)
rather than copy the fragments into a new buffer. This is problematic
when buffers come from page cache because LSA buffers may be moved
around or evicted. So the temporary_buffer which is a view on the LSA
buffer is valid only around the duration of a single consume() call to
the parser.

If the blob which is parsed (e.g. variable-length clustering key
component) spans pages, the fragments stored in the parser may be
invalidated before the component is fully parsed. As a result, the
parsed clustering key may have incorrect component values. This never
causes parsing errors because the "length" field is always parsed from
the current buffer, which is valid, and component parsing will end at
the right place in the next (valid) buffer.

The problematic path for clustering_key parsing is the one which calls
primitive_consumer::read_bytes(), which is called for example for text
components. Fixed-size components are not parsed like this, they store
the intermediate state by copying data.

This may cause incorrect clustering keys to be parsed when doing
binary search in the index, diverting the search to an incorrect
block.

The solution is to use page_view instead of temporary_buffer, which
can be safely shared via share() and stored across allocating
section. The page_view maintains its hold to the LSA buffer even
across allocating sections.

Fixes #20766

(cherry picked from commit 7670ee701a)
2024-10-01 01:38:47 +00:00
Tomasz Grabiec
3347152ff9 cached_file: Adapt page_view to ContiguousSharedBuffer
(cherry picked from commit c15145b71d)
2024-10-01 01:38:47 +00:00
Tomasz Grabiec
ff7bd937e2 cached_file: Change meaning of page_view::_size to be relative to _offset rather than page start
Will be easier to implement ContiguousSharedBuffer API as the buffer
size will be equal to _size.

(cherry picked from commit 29498a97ae)
2024-10-01 01:38:47 +00:00
Tomasz Grabiec
50ea1dbe32 sstables, utils: Allow parsers to work with different buffer types
Currently, parsers work with temporary_buffer<char>. This is unsafe
when invoked by bsearch_clustered_cursor, which reuses some of the
parsers, and passes temporary_buffer<char> which is a view onto LSA
buffer which comes from the index file page cache. This view is stable
only around consume(). If parsing requires more than one page, it will
continue with a different input buffer. The old buffer will be
invalid, and it's unsafe for the parser to store and access
it. Unfortunetly, the temporary_buffer API allows sharing the buffer
via the share() method, which shares the underlying memory area. This
is not correct when the underlying is managed by LSA, because storage
may move. Parser uses this sharing when parsing blobs, e.g. clustering
key components. When parsing resumes in the next page, parser will try
to access the stored shared buffers pointing to the previous page,
which may result in use-after-free on the memory area.

In prearation for fixing the problem, parametrize parsers to work with
different kinds of buffers. This will allow us to instantiate them
with a buffer kind which supports sharing of LSA buffers properly in a
safe way.

It's not purely mechanical work. Some parts of the parsing state
machine still works with temporary_buffer<char>, and allocate buffers
internally, when reading into linearized destination buffer. They used
to store this destination in _read_bytes vector, same field which is
used to store the shared buffers. Now it's not possible, since shared
buffer type may be different than temporary_buffer<char>. So those
paths were changed to use a new field: _read_bytes_buf.

(cherry picked from commit c0fa49bab5)
2024-10-01 01:38:47 +00:00
Tomasz Grabiec
45125c4d7d sstables: promoted_index_block_parser: Make reset() always bring parser to initial state
When reset() is done due to allocating section retry, it can be
theoretically in an arbitrary point. So we should not assume that it
finished parsing and state was reset by previous parsing. We should
reset all the fields.

(cherry picked from commit 93bfaf4282)
2024-10-01 01:38:46 +00:00
Tomasz Grabiec
9207f7823d sstables: bsearch_clustered_cursor: Switch read_block_offset() to use the read() method
To unify logic which handles allocating section retry, and thus
improve safety.

(cherry picked from commit ac823b1050)
2024-10-01 01:38:46 +00:00
Tomasz Grabiec
711864687f sstables: bsearch_clustered_cursor: Fix parsing when allocating section is retried
Parser's state was not reset when allocating section was retried.

This doesn't cause problems in practice, because reserves are enough
to cover allocation demands of parsing clustering keys, which are at
most 64K in size. But it's still potentially unsafe and needs fixing.

(cherry picked from commit 8aca93b3ec)
2024-10-01 01:38:45 +00:00
Kamil Braun
faf11e5bc3 Merge '[Backport 6.2] Populate raft address map from gossiper on raft configuration change' from ScyllaDB
For each new node added to the raft config populate it's ID to IP mapping in raft address map from the gossiper. The mapping may have expired if a node is added to the raft configuration long after it first appears in the gossiper.

Fixes scylladb/scylladb#20600

Backport to all supported versions since the bug may cause bootstrapping failure.

(cherry picked from commit bddaf498df)

(cherry picked from commit 9e4cd32096)

Refs #20601

Closes scylladb/scylladb#20847

* github.com:scylladb/scylladb:
  test: extend existing test to check that a joining node can map addresses of all pre-existing nodes during join
  group0: make sure that address map has an entry for each new node in the raft configuration
2024-09-30 17:01:52 +02:00
Gleb Natapov
f9215b4d7e test: extend existing test to check that a joining node can map addresses of all pre-existing nodes during join
(cherry picked from commit 9e4cd32096)
2024-09-26 21:13:34 +00:00
Gleb Natapov
469ac9976a group0: make sure that address map has an entry for each new node in the raft configuration
ID->IP mapping is added to the raft address map when the mapping first
appears in the gossiper, but it is added as expiring entry. It becomes
non expiring when a node is added to raft configuration. But when a node
joins those two events may be distant in time (since the node's request
may sit in the topology coordinator queue for a while) and mappings may
expire already from the map. This patch makes sure to transfer the
mapping from the gossiper for a node that is added to the raft
configuration instead of assuming that the mapping is already there.

(cherry picked from commit bddaf498df)
2024-09-26 21:13:33 +00:00
Botond Dénes
d341f1ef1e Merge '[Backport 6.2] mark node as being replaced earlier' from ScyllaDB
Before 17f4a151ce the node was marked as
been replaced in join_group0 state, before it actually joins the group0,
so by the time it actually joins and starts transferring snapshot/log no
traffic is sent to it. The commit changed this to mark the node as
being replaced after the snapshot/log is already transferred so we can
get the traffic to the node while it sill did not caught up with a
leader and this may causes problems since the state is not complete.
Mark the node as being replaced earlier, but still add the new node to
the topology later as the commit above intended.

Fixes: https://github.com/scylladb/scylladb/issues/20629

Need to be backported since this is a regression

(cherry picked from commit 644e7a2012)

(cherry picked from commit c0939d86f9)

(cherry picked from commit 1b4c255ffd)

Refs #20743

Closes scylladb/scylladb#20829

* github.com:scylladb/scylladb:
  test: amend test_replace_reuse_ip test to check that there is no stale writes after snapshot transfer starts
  topology coordinator:: mark node as being replaced earlier
  topology coordinator: do metadata barrier before calling finish_accepting_node() during replace
2024-09-26 10:37:25 +03:00
Kamil Braun
07dfcd1f64 service: raft: fix rpc error message
What it called "leader" is actually the destination of the RPC.

Trivial fix, should be backported to all affected versions.

(cherry picked from commit 09c68c0731)

Closes scylladb/scylladb#20826
2024-09-26 10:33:50 +03:00
Anna Stuchlik
f8d63b5572 doc: add OS support for version 6.2
This commit adds the OS support for version 6.2.
In addition, it removes support for 6.0, as the policy is only to include
information for the supported versions, i.e., the two latest versions.

Fixes https://github.com/scylladb/scylladb/issues/20804

(cherry picked from commit 8145109120)

Closes scylladb/scylladb#20825
2024-09-26 10:29:08 +03:00
Anna Stuchlik
ca83da91d1 doc: add an intro to the Features page
This commit modifies the Features page in the following way:

- It adds a short introduction and descriptions to each listed feature.
- It hides the ToC (required to control and modify the information on the page,
  e.g., to add descriptions, have full control over what is displayed, etc.)
- Removes the info about Enterprise features (following the request not to include
  Enterprise info in the OSS docs)

Fixes https://github.com/scylladb/scylladb/issues/20617
Blocks https://github.com/scylladb/scylla-enterprise/pull/4711

(cherry picked from commit da8047a834)

Closes scylladb/scylladb#20811
2024-09-26 10:22:36 +03:00
Botond Dénes
f55081fb1a Merge '[Backport 6.2] Rename Alternator batch item count metrics' from ScyllaDB
This PR addresses multiple issues with alternator batch metrics:

1. Rename the metrics to scylla_alternator_batch_item_count with op=BatchGetItem/BatchWriteItem
2. The batch size calculation was wrong and didn't count all items in the batch.
3. Add a test to validate that the metrics values increase by the correct value (not just increase). This also requires an addition to the testing to validate ops of different metrics and an exact value change.

Needs backporting to allow the monitoring to use the correct metrics names.

Fixes #20571

(cherry picked from commit 515857a4a9)

(cherry picked from commit 905408f764)

(cherry picked from commit 4d57a43815)

(cherry picked from commit 8dec292698)

Refs #20646

Closes scylladb/scylladb#20758

* github.com:scylladb/scylladb:
  alternator:test_metrics test metrics for batch item count
  alternator:test_metrics Add validating the increased value
  alternator: Fix item counting in batch operations
  Alterntor rename batch item count metrics
2024-09-26 10:22:00 +03:00
Anna Stuchlik
aa8cdec5bd doc: fix a broken link
This commit fixes a link to the Manager by adding a missing underscore
to the external link.

(cherry picked from commit aa0c95c95c)

Closes scylladb/scylladb#20710
2024-09-26 10:18:59 +03:00
Anna Stuchlik
75a2484dba doc: update the unified installer instructions
This commit updates the unified installer instructions to avoid specifying a given version.
At the moment, we're technically unable to use variables in URLs, so we need to update
the page each release.

Fixes https://github.com/scylladb/scylladb/issues/20677

(cherry picked from commit 400a14eefa)

Closes scylladb/scylladb#20708
2024-09-26 10:04:35 +03:00
Gleb Natapov
37387135b4 test: amend test_replace_reuse_ip test to check that there is no stale writes after snapshot transfer starts
(cherry picked from commit 1b4c255ffd)
2024-09-26 03:45:50 +00:00
Gleb Natapov
ac24ab5141 topology coordinator:: mark node as being replaced earlier
Before 17f4a151ce the node was marked as
been replaced in join_group0 state, before it actually joins the group0,
so by the time it actually joins and starts transferring snapshot/log no
traffic is sent to it. The commit changed this to mark the node as
being replaced after the snapshot/log is already transferred so we can
get the traffic to the node while it sill did not caught up with a
leader and this may causes problems since the state is not complete.
Mark the node as being replaced earlier, but still add the new node to
the topology later as the commit above intended.

(cherry picked from commit c0939d86f9)
2024-09-26 03:45:50 +00:00
Gleb Natapov
729dc03e0c topology coordinator: do metadata barrier before calling finish_accepting_node() during replace
During replace with the same IP a node may get queries that were intended
for the node it was replacing since the new node declares itself UP
before it advertises that it is a replacement. But after the node
starts replacing procedure the old node is marked as "being replaced"
and queries no longer sent there. It is important to do so before the
new node start to get raft snapshot since the snapshot application is
not atomic and queries that run parallel with it may see partial state
and fail in weird ways. Queries that are sent before that will fail
because schema is empty, so they will not find any tables in the first
place. The is pre-existing and not addressed by this patch.

(cherry picked from commit 644e7a2012)
2024-09-26 03:45:50 +00:00
Kamil Braun
9d64ced982 test: fix topology_custom/test_raft_recovery_stuck flakiness
The test performs consecutive schema changes in RECOVERY mode. The
second change relies on the first. However the driver might route the
changes to different servers and we don't have group 0 to guarantee
linearizability. We must rely on the first change coordinator to push
the schema mutations to other servers before returning, but that only
happens when it sees other servers as alive when doing the schema
change. It wasn't guaranteed in the test. Fix this.

Fixes scylladb/scylladb#20791

Should be backported to all branches containing this test to reduce
flakiness.

(cherry picked from commit f390d4020a)

Closes scylladb/scylladb#20807
2024-09-25 15:11:10 +02:00
Abhinav
ea6349a6f5 raft topology: add error for removal of non-normal nodes
In the current scenario, We check if a node being removed is normal
on the node initiating the removenode request. However, we don't have a
similar check on the topology coordinator. The node being removed could be
normal when we initiate the request, but it doesn't have to be normal when
the topology coordinator starts handling the request.
For example, the topology coordinator could have removed this node while handling
another removenode request that was added to the request queue earlier.

This commit intends to fix this issue by adding more checks in the enqueuing phase
and return errors for duplicate requests for node removal.

This PR fixes a bug. Hence we need to backport it.

Fixes: scylladb/scylladb#20271
(cherry picked from commit b25b8dccbd)

Closes scylladb/scylladb#20799
2024-09-25 11:34:20 +02:00
Benny Halevy
ed9122a84e time_window_compaction_strategy: get_reshaping_job: restrict sort of multi_window vector to its size
Currently the function calls boost::partial_sort with a middle
iterator that might be out of bound and cause undefined behavior.

Check the vector size, and do a partial sort only if its longer
than `max_sstables`, otherwise sort the whole vector.

Fixes scylladb/scylladb#20608

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#20609

(cherry picked from commit 39ce358d82)

Refs: scylladb/scylladb#20609
2024-09-23 16:02:40 +03:00
Amnon Heiman
c7d6b4a194 alternator:test_metrics test metrics for batch item count
This patch adds tests for the batch operations item count.

The tests validate that the metrics tracking the number of items
processed in a batch increase by the correct amount.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
(cherry picked from commit 8dec292698)
2024-09-23 11:02:55 +00:00
Amnon Heiman
a35e138b22 alternator:test_metrics Add validating the increased value
The `check_increases_operation` now allows override the checked metric.

Additionally, a custom validation value can now be passed, which make it
possible to validate the amount by which a value has changed, rather
than just validating that the value increased.

The default behavior of validating that values have increased remains
unchanged, ensuring backward compatibility.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
(cherry picked from commit 4d57a43815)
2024-09-23 11:02:55 +00:00
Amnon Heiman
3db67faa8a alternator: Fix item counting in batch operations
This patch fixes the logic for counting items in batch operations.
Previously, the item count in requests was inaccurate, it count the
number of tabels in get_item and the request_items in write_items.

The new logic correctly counts each individual item in `BatchGetItem`
and `BatchWriteItem` requests.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
(cherry picked from commit 905408f764)
2024-09-23 11:02:55 +00:00
Amnon Heiman
6a12174e2d Alterntor rename batch item count metrics
This patch renames metrics tracking the total number of items in a batch
to `scylla_alternator_batch_item_count`.  It uses the existing `op` label to
differentiate between `BatchGetItem` and `BatchWriteItem` operations.

Ensures better clarity and distinction for batch operations in monitoring.

This an example of how it looks like:
 # HELP scylla_alternator_batch_item_count The total number of items processed across all batches
 # TYPE scylla_alternator_batch_item_count counter
 scylla_alternator_batch_item_count{op="BatchGetItem",shard="0"} 4
 scylla_alternator_batch_item_count{op="BatchWriteItem",shard="0"} 4

(cherry picked from commit 515857a4a9)
2024-09-23 11:02:55 +00:00
Piotr Dulikowski
ca0096ccb8 Merge '[Backport 6.2] message/messaging_service: guard adding maintenance tenant under cluster feature' from Michał Jadwiszczak
In https://github.com/scylladb/scylladb/pull/18729, we introduced a new statement tenant $maintenance, but the change wasn't protected by any cluster feature.
This wasn't a problem for OSS, since unknown isolation cookie just uses default scheduling group. However, in enterprise that leads to creating a service level on not-upgraded nodes, which may end up in an error if user create maximum number of service levels.

This patch adds a cluster feature to guard adding the new tenant. It's done in the way to handle two upgrade scenarios:

version without $maintenance tenant -> version with $maintenance tenant guarded by a feature
version with $maintenance tenant but not guarded by a feature -> version with $maintenance tenant guarded by a feature
The PR adds enabled flag to statement tenants.
This way, when the tenant is disabled, it cannot be used to create a connection, but it can be used to accept an incoming connection.
The $maintenance tenant is added to the config as disabled and it gets enabled once the corresponding feature is enabled.

Fixes https://github.com/scylladb/scylladb/issues/20070
Refs https://github.com/scylladb/scylla-enterprise/issues/4403

(cherry picked from commit d44844241d)

(cherry picked from commit 71a03ef6b0)

(cherry picked from commit b4b91ca364)

Refs https://github.com/scylladb/scylladb/pull/19802

Closes scylladb/scylladb#20690

* github.com:scylladb/scylladb:
  message/messaging_service: guard adding maintenance tenant under cluster feature
  message/messaging_service: add feature_service dependency
  message/messaging_service: add `enabled` flag to statement tenants
2024-09-23 09:48:12 +02:00
Jenkins Promoter
a71d4bc49c Update ScyllaDB version to: 6.2.0-rc1 2024-09-19 10:21:33 +03:00
Michał Jadwiszczak
749399e4b8 message/messaging_service: guard adding maintenance tenant under cluster feature
Set `enabled` flag for `$maintenance` tenant to false and
enable it when `MAINTENANCE_TENANT` feature is enabled.

(cherry-picked from b4b91ca364)
2024-09-18 19:10:24 +02:00
Michał Jadwiszczak
bdd97b2950 message/messaging_service: add feature_service dependency
(cherry-picked from 71a03ef6b0)
2024-09-18 19:09:46 +02:00
Michał Jadwiszczak
1a056f0cab message/messaging_service: add enabled flag to statement tenants
Adding a new tenant needs to be done under cluster feature protection.
However it wasn't the case for adding `$maintenance` statement tenant
and to fix it we need to support an upgrade from node which doesn't
know about maintenance tenant at all and from one which uses it without
any cluster feature protection.

This commit adds `enabled` flag to statement tenants.
This way, when the tenant is disabled, it cannot be used to create
a connection, but it can be used to accept an incoming connection.

(cherry-picked from d44844241d)
2024-09-18 19:09:06 +02:00
Tzach Livyatan
cf78a2caca Update client-node-encryption: OpsnSSL is FIPS *enabled*
Closes scylladb/scylladb#19705

(cherry picked from commit cb864b11d8)
2024-09-18 11:58:46 +03:00
Anna Mikhlin
cbc53f0e81 Update ScyllaDB version to: 6.2.0-rc0 2024-09-17 13:40:50 +03:00
Botond Dénes
a4a8cad97f Merge 'atomic_delete: allow deletion of sstables from several prefixes' from Benny Halevy
Allow create_pending_deletion_log to delete a bunch of sstables
potentially resides in different prefixes (e.g. in the base directory
and under staging/).

The motivation arises from table::cleanup_tablet that calls compaction_group::cleanup on all cg:s via cleanup_compaction_groups.  Cleanup, in turn, calls delete_sstables_atomically on all sstables in the compaction_group, in all states, including the normal state as well as staging - hence the requirement to support deleting sstables in different sub-directories.

Also, apparently truncate calls delete_atomically for all sstables too, via table::discard_sstables, so if it happened to be executed during view update generation, i.e. when there are sstables in staging, it should hit the assertion failure reported in https://github.com/scylladb/scylladb/issues/18862 as well (although I haven't seen it yet, but I see no reason why it would happen). So the issue was apparently present since the initial implementation of the pending_delete_log. It's just that with tablet migration it is more likely to be hit.

Fixes scylladb/scylladb#18862

Needs backport to 6.0 since tablets require this capability

Closes scylladb/scylladb#19555

* github.com:scylladb/scylladb:
  sstable_directory: create_pending_deletion_log: place pending_delete log under the base directory
  sstables: storage: keep base directory in base class
  sstables: storage: define opened_directory in header file
  sstable_directory: use only dirlog
2024-09-17 08:30:40 +03:00
Lakshmi Narayanan Sreethar
626f55a2ea compaction: run cleanup under maintenance scheduling group
The cleanup compaction task is a maintenance operation that runs after
topology changes. So, run it under the maintenance scheduling group to
avoid interference with regular compaction tasks. Also remove the share
allocations done by the cleanup task, as they are unnecessary when
running under the maintenance group.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes scylladb/scylladb#20582
2024-09-16 16:58:43 +03:00
Avi Kivity
870d1c16f7 scripts: fix bin/cqlsh shortcut
Since 3c7af28725, the cqlsh submodule no longer contains a
bin/cqlsh shell script. This broke the supermodule's bin/cqlsh
shortcut.

Fix it by invoking cqlsh.py directly.

Closes scylladb/scylladb#20591
2024-09-16 09:52:29 +03:00
Botond Dénes
ea29fe579b Merge 'replica: ignore cleanup of deallocated storage group' from Aleksandra Martyniuk
Cleanup of a deallocated tablet throws an exception.
Since failed cleanup is retried, we end up in an infinite loop.

Ignore cleanup of deallocated storage groups.

Fixes:  #19752.

Needs to be backported to all branches with tablets (6.0 and later)

Closes scylladb/scylladb#20584

* github.com:scylladb/scylladb:
  test: check if cleanup of deallocated sg is ignored
  replica: ignore cleanup of deallocated storage group
2024-09-16 09:22:56 +03:00
Gleb Natapov
695f112795 paxos_state: release semaphore units before checking if a semaphore can be dropped
To drop a semaphore it should not be held by anyone, so we need to
release out units before checking if a semaphore can be dropped.

Fixes: scylladb/scylladb#20602

Closes scylladb/scylladb#20607
2024-09-15 21:21:03 +03:00
Kefu Chai
028410ba58 mutation_writer: use bucket parameter instead of using it->first
as `_bucket` is an `unordered_map<bucket_id, timestamp_bucket_writer>`,
when writing to a given bucket, we try to create a writer with the
specified bucket id, so the returned iterator should point to a node
whose `first` element is always the bucket id.

so, there is no need to reference `it` for the bucket id, let's just
reference the parameter. simpler this way.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20598
2024-09-15 20:05:12 +03:00
Kefu Chai
49f232f405 compaction: fix a typo in comment
s/expection/exception/

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20594
2024-09-15 16:09:01 +03:00
Avi Kivity
b9bc783418 cql3: selection: don't ignore regular column restriction if a regular row is not present
If a regular row isn't present, no regular column restriction
(say, r=3) can pass since all regular columns are presented as NULL,
and we don't have an IS NULL predicate. Yet we just ignore it.

Handle the restriction on a missing column by return false, signifying
the row was filtered out.

We have to move the check after the conditional checking whether there's
any restriction at all, otherwise we exit early with a false failure.

Unit test marked xfail on this issue are now unmarked.

A subtest of test_tombstone_limit is adjusted since it depended on this
bug. It tested a regular column which wasn't there, and this bug caused
the filter to be ignored. Change to test a static column that is there.

A test for a bug found while developing the patch is also added. It is
also tested by test_tombstone_limit, but better to have a dedicated test.

Fixes #10357

Closes scylladb/scylladb#20486
2024-09-15 13:44:16 +03:00
Botond Dénes
6d8e9645ce test/*/run: restore --vnodes into working order
This option was silently broken when --enable-tablet's default changed
from false to true. The reason is that when --vnodes is passed, run only
removes --enable-tablets=true from scylla's command line. With the new
default this is not enough, we need to explicitely disable tablets to
override the default.

Closes scylladb/scylladb#20462
2024-09-13 17:10:09 +03:00
Nadav Har'El
f255391d52 cql-pytest: translate Cassandra's tests for arithmetic operators
This is a translation of Cassandra's CQL unit test source file
OperationFctsTest.java into our cql-pytest framework.

This is a massive test suite (over 800 lines of code) for Cassandra's
"arithmetic operators" CQL feature (CASSANDRA-11935), which was added
to Cassandra almost 8 years ago (and reached Cassandra 4.0), but we
never implemented it in Scylla.

All of the tests in suite fail in ScyllaDB due to our lack of this
feature:

  Refs #2693: Support arithmetic operators

One test also discovered a new issue:

  Refs #20501: timestamp column doesn't allow "UTC" in string format

All the tests pass on Cassandra.

Some of the tests insist on specific error message strings and specific
precision for decimal arithmetic operations - where we may not necessarily
want to be 100% compatible with Cassandra in our eventual implementation.
But at least the test will allow us to make deliberate - and not accidental -
deviations from compatibility with Cassandra.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#20502
2024-09-13 14:52:59 +03:00
Botond Dénes
d3a9654fcc Merge 'Make use of async() context in sstable_mutation_test' from Pavel Emelyanov
This test runs all its cases in seastar thread, but still uses .then() continuations in some of them.
This PR converts all continuations into plain .get()-s.

Closes scylladb/scylladb#20457

* github.com:scylladb/scylladb:
  test: Restore indentation after previous changes
  test: Threadify tombstone_in_tombstone2()
  test: Threadify range_tombstone_reading()
  test: Threadify tombstone_in_tombstone()
  test: Threadify broken_ranges_collection()
  test: Threadify compact_storage_dense_read()
  test: Threadify compact_storage_simple_dense_read()
  test: Threadify compact_storage_sparse_read()
  test: Simplify test_range_reads() counting
  test: Simplify test_range_reads() inner loop
  test: Threadify test_range_reads() itself
  test: Threadify test_range_reads() callers
  test: Threadify generate_clustered() itself
  test: Threadify generate_clustered() callers
  test: Threadify test_no_clustered test
  test: Threadify nonexistent_key test
2024-09-13 14:09:53 +03:00
Aleksandra Martyniuk
2c4b1d6b45 test: check if cleanup of deallocated sg is ignored 2024-09-13 13:00:58 +02:00
Aleksandra Martyniuk
20d6cf55f2 replica: ignore cleanup of deallocated storage group
Currently, attempt to cleanup deallocated storage group throws
an exception. Failed tablet cleanup is retried, stucking
in an endless loop.

Ignore cleanup of deallocated storage group.
2024-09-13 13:00:53 +02:00
Andrei Chekun
bad7407718 test.py: Add support for BOOST_DATA_TEST_CASE
Currently, test.py will throw an error if the test will use
BOOST_DATA_TEST_CASE. test.py as a first step getting all test
functions in the file, but when BOOST_DATA_TEST_CASE will be used the
output will have additional lines indicating parametrized test that
test.py can not handle. This commit adds handling this case, as a caveat
all tests should start from 'test' or they will be ignored.

Closes: #20530

Closes scylladb/scylladb#20556
2024-09-13 13:44:26 +03:00
Botond Dénes
7cb8cab2ae Merge 'Remove make_shared_schema() helper' from Pavel Emelyanov
This function was obsoleted by schema_builder some time ago. Not to patch all its callers, that helper became wrapper around it. Remained users are all in tests, and patching the to use builder directory makes the code shorter in many cases.

Closes scylladb/scylladb#20466

* github.com:scylladb/scylladb:
  schema: Ditch make_shared_schema() helper
  test: Tune up indentation in uncompressed_schema()
  test: Make tests use schema_builder instead of make_shared_schema
2024-09-13 12:25:10 +03:00
Pavel Emelyanov
730731da4a test: Remove unused table config from max_ongoing_compaction_test
The local config is unused since #15909, when the table creation was
changed to use env's facilities.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#20511
2024-09-13 12:21:56 +03:00
Pavel Emelyanov
4c77f474ed test: Remove unused upload_path local variable
Since #14152 creation of an sstable takes table dir and its state. The
test in question wants to create and sstable in upload/ subdir and for
that it used to maintain full "cf.dir/upload" path, which is not
required any more.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#20514
2024-09-13 12:21:00 +03:00
Pavel Emelyanov
e9a1c0716f test: Use sstables::test_env to make sstables for directory test
This is continuation of #20431 in another test. After #20395 it's also
possible to remove unused local dir variables.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#20541
2024-09-13 12:19:59 +03:00
Botond Dénes
4fb194117e Merge 'Generalize multipart upload implementations in S3 client' from Pavel Emelyanov
There are two currently -- upload_sink_base and do_upload_file. This PR merges as much code as possible (spoiler: it's already mostly copy-n-pase-d, so squashing is pretty straightforward)

Closes scylladb/scylladb#20568

* github.com:scylladb/scylladb:
  s3/client: Reuse class multipart_upload in do_upload_file
  s3/client: Split upload_sink_base class into two
2024-09-13 10:35:10 +03:00
Kefu Chai
cf1f90fe0c auth: remove unused #include
the `seastar/core/print.hh` header is no longer required by
`auth/resource.hh`. this was identified by clang-include-cleaner.
As the code is audited, wecan safely remove the #include directive.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20575
2024-09-13 09:49:05 +03:00
Botond Dénes
c7c5817808 Merge 'Improve timestamp heuristics for tombstone garbage collection' from Benny Halevy
When purging regular tombstone consult the min_live_timestamp, if available.
This is safe since we don't need to protect dead data from resurrection, as it is already dead.

For shadowable_tombstones, consult the min_memtable_live_row_marker_timestamp,
if available, otherwise fallback to the min_live_timestamp.

If we see in a view table a shadowable tombstone with time T, then in any row where the row marker's timestamp is higher than T the shadowable tombstone is completely ignored and it doesn't hide any data in any column, so the shadowable tombstone can be safely purged without any effect or risk resurrecting any deleted data.

In other words, rows which might cause problems for purging a shadowable tombstone with time T are rows with row markers older or equal T. So to know if a whole sstable can cause problems for shadowable tombstone of time T, we need to check if the sstable's oldest row marker (and not oldest column) is older or equal T. And the same check applies similarly to the memtable.

If both extended timestamp statistics are missing, fallback to the legacy (and inaccurate) min_timestamp.

Fixes scylladb/scylladb#20423
Fixes scylladb/scylladb#20424

> [!NOTE]
> no backport needed at this time
> We may consider backport later on after given some soak time in master/enterprise
> since we do see tombstone accumulation in the field under some materialized views workloads

Closes scylladb/scylladb#20446

* github.com:scylladb/scylladb:
  cql-pytest: add test_compaction_tombstone_gc
  sstable_compaction_test: add mv_tombstone_purge_test
  sstable_compaction_test: tombstone_purge_test: test that old deleted data do not inhibit tombstone garbage collection
  sstable_compaction_test: tombstone_purge_test: add testlog debugging
  sstable_compaction_test: tombstone_purge_test: make_expiring: use next_timestamp
  sstable, compaction: add debug logging for extended min timestamp stats
  compaction: get_max_purgeable_timestamp: use memtable and sstable extended timestamp stats
  compaction: define max_purgeable_fn
  tombstone: can_gc_fn: move declaration to compaction_garbage_collector.hh
  sstables: scylla_metadata: add ext_timestamp_stats
  compaction_group, storage_group, table_state: add extended timestamp stats getters
  sstables, memtable: track live timestamps
  memtable_encoding_stats_collector: update row_marker: do nothing if missing
2024-09-13 08:56:51 +03:00
Takuya ASADA
3cd2a61736 dist: drop scylla-jmx
Since JMX server is deprecated, drop them from submodule, build system
and package definition.

Related scylladb/scylla-tools-java#370
Related #14856

Signed-off-by: Takuya ASADA <syuu@scylladb.com>

Closes scylladb/scylladb#17969
2024-09-13 07:59:45 +03:00
Botond Dénes
fc9804ec31 Update tools/java submodule
* tools/java 0b4accdd...e505a6d3 (1):
  > [C-S] Make it use DCAwareRoundRobinPolicy unless rack is provided

Closes scylladb/scylladb#20562
2024-09-13 06:30:04 +03:00
Pavel Emelyanov
17e7d3145c s3/client: Reuse class multipart_upload in do_upload_file
Uploading a file is implemented by the do_upload_file class. This class
re-implements a big portion of what's currently in multipart_upload one.
This patch makes the former class inherit from the latter and removes
all the duplication from it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-12 18:38:16 +03:00
Pavel Emelyanov
14b741afc9 s3/client: Split upload_sink_base class into two
This class implements two facilities -- multipart upload protocol itself
plus some common parts of upload_sink_impl (in fact -- only close() and
plugs put(packet)).

This patch aplits those two facilities into two classes. One of them
will be re-used later.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-12 18:00:19 +03:00
Sergey Zolotukhin
612a141660 raft: Fix race condition on override_snapshot_thresholds.
When the server_impl::applier_fiber is paused by a co_await at line raft/server.cc:1375:
```
co_await override_snapshot_thresholds();
```
a new snapshot may be applied, which updates the actual values of the log's last applied
and snapshot indexes. As a result, the new snapshot index could become higher than the
old value stored in _applied_idx at line raft/server.cc:1365, leading to an assertion
failure in log::last_conf_for().
Since error injection is disabled in release builds, this issue does not affect production releases.

This issue was introduced in the following commit
9dfa041fe1,
when error injection was added to override the log snapshot configuration parameters.

How to reproduce:

1. Build debug version of randomized_nemesis_test
```
ninja-build build/debug/test/raft/randomized_nemesis_test
```
2. Run
```
parallel --halt now,fail=1 -j20 'build/debug/test/raft/randomized_nemesis_test \
--run_test=test_frequent_snapshotting  -- -c2 -m2G --overprovisioned --unsafe-bypass-fsync 1 \
--kernel-page-cache 1 --blocked-reactor-notify-ms 2000000  --default-log-level \
trace > tmp/logs/eraseme_{}.log  2>&1 && rm tmp/logs/eraseme_{}.log' ::: {1..1000}
```

Fixes scylladb/scylladb#20363

Closes scylladb/scylladb#20555
2024-09-12 16:19:27 +02:00
Aleksandra Martyniuk
59fba9016f docs: operating-scylla: add task manager docs
Admin-facing documentation of task manager.

Closes scylladb/scylladb#20209
2024-09-12 16:42:28 +03:00
Nadav Har'El
d49dbb944c Merge 'doc: move Alternator in the page tree and remove it's redundant ToC' from Anna Stuchlik
This PR hides the ToC on the Alternator page, as we don't need it, especially at the end of the page.

The ToC must be hidden rather than removed because removing it would, in turn, remove the "Getting Started With ScyllaDB Alternator" and "ScyllaDB Alternator for DynamoDB users" from the page tree and make them inaccessible.

In addition, this PR moves Alternator higher in the page tree.

Fixes https://github.com/scylladb/scylladb/issues/19823

Closes scylladb/scylladb#20565

* github.com:scylladb/scylladb:
  doc: move Alternator higher in the page tree
  doc: hide the redundant ToC on the Alternator page
2024-09-12 15:58:34 +03:00
Nadav Har'El
930accad12 alternator: return error on unused AttributeDefinitions
A CreateTable request defines the KeySchema of the base table and each
of its GSIs and LSIs. It also needs to give an AttributeDefinition for
each attribute used in a KeySchema - which among other things specifies
this attribute's type (e.g., S, N, etc.). Other, non-key, attributes *do
not* have a specified type, and accordingly must not be mentioned in
AttributeDefinitions.

Before this patch, Alternator just ignored unused AttributeDefinitions
entries, whereas DynamoDB throws an error in this case. This patch fixes
Alternator's behavior to match DynamoDB's - and adds a test to verify this.

Besides being more error-path-compatible with DynamoDB, this extra check
can also help users: We already had one user complaining that an
AttributeDefinitions setting he was using was ignored, not realizing
that it wasn't used by any KeySchema. A clear error message would have
saved this user hours of investigation.

Fixes #19784.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#20378
2024-09-12 15:37:18 +03:00
Pavel Emelyanov
632a65bffa Merge 'repair: row_level: coroutinize more functions' from Avi Kivity
Coroutinize more functions in row-level repair to improve maintainability.

The functions all deal with repair buffers, so coroutinization does not affect performance.

Cleanup, no reason to backport

Closes scylladb/scylladb#20464

* github.com:scylladb/scylladb:
  repair: row_level: restore indentation
  repair: row_level: coroutinize repair_service::insert_repair_meta()
  repair: row_level: coroutinize repair_meta::get_full_row_hashes()
  repair: row_level: coroutinize repair_meta::apply_rows_on_follower()
  repair: row_level: coroutinize repair_meta::clear_working_row_buf()
  repair: row_level: coroutinize get_common_diff_detect_algorithm()
  repair: row_level: coroutinize repair_service::remove_repair_meta() (non-selective overload)
  repair: row_level: coroutinize repair_service::remove_repair_meta() (by-address overload)
  repair: row_level: coroutinize repair_service::remove_repair_meta() (by-id overload)
  repair: row_level: row_level_repair::run()
  repair: row_level: row_level_repair::send_missing_rows_to_follower_nodes()
  repair: row_level: row_level_repair::get_missing_rows_from_follower_nodes()
  repair: row_level: row_level_repair::negotiate_sync_boundary()
  repair: row_level: coroutinize repair_put_row_diff_with_rpc_stream_process_op()
  repair: row_level: coroutinize repair_meta::get_sync_boundary_handler()
  repair: row_level: coroutinize repair_meta::get_sync_boundary()
  repair: row_level: coroutinize repair_meta::repair_set_estimated_partitions_handler()
  repair: row_level: coroutinize repair_meta::repair_set_estimated_partitions()
  repair: row_level: coroutinize repair_meta::repair_get_estimated_partitions_handler()
  repair: row_level: coroutinize repair_meta::repair_get_estimated_partitions()
  repair: row_level: coroutinize repair_meta::repair_row_level_stop_handler()
  repair: row_level: coroutinize repair_meta::repair_row_level_stop()
  repair: row_level: coroutinize repair_meta::repair_row_level_start_handler()
  repair: row_level: coroutinize repair_meta::repair_row_level_start()
  repair: row_level: coroutinize repair_meta::get_combined_row_hash_handler()
  repair: row_level: coroutinize repair_meta::get_combined_row_hash()
  repair: row_level: coroutinize repair_meta::get_full_row_hashes_handler()
  repair: row_level: coroutinize repair_meta::get_full_row_hashes_with_rpc_stream()
  repair: row_level: coroutinize repair_meta::request_row_hashes()
2024-09-12 15:35:57 +03:00
Kefu Chai
197451f8c9 utils/rjson.cc: include the function name in exception message
recently, we are observing errors like:

```
stderr: error running operation: rjson::error (JSON SCYLLA_ASSERT failed on condition 'false', at: 0x60d6c8e 0x4d853fd 0x50d3ac8 0x518f5cd 0x51c4a4b 0x5fad446)
```

we only passed `false` to the `RAPIDJSON_ASSERT()` macro, so what we
have is but the type of the error (rjson::error) and a backtrace.
would be better if we can have more information without recompiling
or fetching the debug symbols for decipher the backtrace.

Refs scylladb/scylladb#20533
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20539
2024-09-12 15:22:49 +03:00
Anna Stuchlik
851e903f46 doc: move Alternator higher in the page tree 2024-09-12 14:08:26 +02:00
Anna Stuchlik
a32ff55c66 doc: hide the redundant ToC on the Alternator page
This commit hides the ToC, as we don't need it, especially at the end of the page.
The ToC must be hidden rather than removed because removing it would, in turn,
remove the "Getting Started With ScyllaDB Alternator" and "ScyllaDB Alternator for DynamoDB users"
from the page tree and make them inaccessible.
2024-09-12 14:01:15 +02:00
Alexey Novikov
8b6e987a99 test: add test_pinned_cl_segment_doesnt_resurrect_data
add test for issue when writes in commitlog segments pinned to another table can be resurrected.
This test based on dtest code published in #14870 and adapted for community version.
It's a regression test for #15060 fix and should fail before this patch and succeed afterwards.

Refs #14870, #15060

Closes scylladb/scylladb#20331
2024-09-12 10:58:22 +03:00
Takuya ASADA
90ab2a24df toolchain: restore multiarch build
When we introduced optimized clang at 6e487a4, we dropped multiarch build on frozen toolchain, because building clang on QEMU emulation is too heavy.

Actually, even after the patch merged, there are two mode which does not build clang, --clang-build-mode INSTALL_FROM and --clang-build-mode SKIP.
So we should restore multiarch build only these mode, and keep skipping on INSTALL mode since it builds clang.

Since we apply multiarch on INSTALL_FROM mode, --clang-archive replaced
to --clang-archive-x86_64 and --clang-archive-aarch64.

Note that this breaks compatibility of existing clang archive, since it
changes clang root directory name from llvm-project to llvm-project-$ARCH.

Closes #20442

Closes scylladb/scylladb#20444
2024-09-12 10:44:45 +03:00
Kefu Chai
3e84d43f93 treewide: use seastar::format() or fmt::format() explicitly
before this change, we rely on `using namespace seastar` to use
`seastar::format()` without qualifying the `format()` with its
namespace. this works fine until we changed the parameter type
of format string `seastar::format()` from `const char*` to
`fmt::format_string<...>`. this change practically invited
`seastar::format()` to the club of `std::format()` and `fmt::format()`,
where all members accept a templated parameter as its `fmt`
parameter. and `seastar::format()` is not the best candidate anymore.
despite that argument-dependent lookup (ADT for short) favors the
function which is in the same namespace as its parameter, but
`using namespace` makes `seastar::format()` more competitive,
so both `std::format()` and `seastar::format()` are considered
as the condidates.

that is what is happening scylladb in quite a few caller sites of
`format()`, hence ADT is not able to tell which function the winner
in the name lookup:

```
/__w/scylladb/scylladb/mutation/mutation_fragment_stream_validator.cc:265:12: error: call to 'format' is ambiguous
  265 |     return format("{} ({}.{} {})", _name_view, s.ks_name(), s.cf_name(), s.id());
      |            ^~~~~~
/usr/bin/../lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/format:4290:5: note: candidate function [with _Args = <const std::basic_string_view<char> &, const seastar::basic_sstring<char, unsigned int, 15> &, const seastar::basic_sstring<char, unsigned int, 15> &, const utils::tagged_uuid<table_id_tag> &>]
 4290 |     format(format_string<_Args...> __fmt, _Args&&... __args)
      |     ^
/__w/scylladb/scylladb/seastar/include/seastar/core/print.hh:143:1: note: candidate function [with A = <const std::basic_string_view<char> &, const seastar::basic_sstring<char, unsigned int, 15> &, const seastar::basic_sstring<char, unsigned int, 15> &, const utils::tagged_uuid<table_id_tag> &>]
  143 | format(fmt::format_string<A...> fmt, A&&... a) {
      | ^
```

in this change, we

change all `format()` to either `fmt::format()` or `seastar::format()`
with following rules:
- if the caller expects an `sstring` or `std::string_view`, change to
  `seastar::format()`
- if the caller expects an `std::string`, change to `fmt::format()`.
  because, `sstring::operator std::basic_string` would incur a deep
  copy.

we will need another change to enable scylladb to compile with the
latest seastar. namely, to pass the format string as a templated
parameter down to helper functions which format their parameters.
to miminize the scope of this change, let's include that change when
bumping up the seastar submodule. as that change will depend on
the seastar change.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-09-11 23:21:40 +03:00
Pavel Emelyanov
f227f4332c test: Remove unused path local variable
Left after #20499 :(

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#20540
2024-09-11 23:10:25 +03:00
Avi Kivity
ed7d352e7d Merge 'Validate checksums for uncompressed SSTables' from Nikos Dragazis
This PR introduces a new file data source implementation for uncompressed SSTables that will be validating the checksum of each chunk that is being read. Unlike for compressed SSTables, checksum validation for uncompressed SSTables will be active for scrub/validate reads but not for normal user reads to ensure we will not have any performance regression.

It consists of:
* A new file data source for uncompressed SSTables.
* Integration of checksums into SSTable's shareable components. The validation code loads the component on demand and manages its lifecycle with shared pointers.
* A new `integrity_check` flag to enable the new file data source for uncompressed SSTables. The flag is currently enabled only through the validation path, i.e., it does not affect normal user reads.
* New scrub tests for both compressed and uncompressed SSTables, as well as improvements in the existing ones.
* A change in JSON response of `scylla validate-checksums` to report if an uncompressed SSTable cannot be validated due to lack of checksums (no `CRC.db` in `TOC.txt`).

Refs #19058.

New feature, no backport is needed.

Closes scylladb/scylladb#20207

* github.com:scylladb/scylladb:
  test: Add test to validate SSTables with no checksums
  tools: Fix typo in help message of scylla validate-checksums
  sstables: Allow validate_checksums() to report missing checksums
  test: Add test for concurrent scrub/validate operations
  test: Add scrub/validate tests for uncompressed SSTables
  test/lib: Add option to create uncompressed random schemas
  test: Add test for scrub/validate with file-level corruption
  test: Check validation errors in scrub tests
  sstables: Enable checksum validation for uncompressed SSTables
  sstables: Expose integrity option via crawling mutation readers
  sstables: Expose integrity option via data_consume_rows()
  sstables: Add option for integrity check in data streams
  sstables: Remove unused variable
  sstables: Add checksum in the SSTable components
  sstables: Introduce checksummed file data source implementation
  sstables: Replace assert with on_internal_error
2024-09-11 23:09:45 +03:00
Calle Wilund
b7839ec5d0 cql_test_env: Use temp socket + retry to ensure usable port for message_service if listen is enabled
Fixes #20543

In cql_test_env, if cfg_in.ms_listen is set, we try to get a free port for the current test on
which message service rpc can bind. This to allow multiple tests in parallel.

However, we just do this by using random and getting a number, not actually verifying it against
host ports in use.

This is complicated further by the fact that port reuse is effectively disabled in seastar
(see reactor::posix_reuseport_detect()). Due to this, the solution applied here is a combo
of
* Create temp socket with port = 0 to get a previously free port
* Close socket right before listen (to handle reuse not working)
* Retry on EADDRINUSE

Closes scylladb/scylladb#20547
2024-09-11 23:02:41 +03:00
Aleksandra Martyniuk
31ea74b96e db: system_keyspace: change version of topology_requests schema
In 880058073b a new column (request_type)
was added to topology_requests table, but the table's schema version
wasn't changed. Due to that during cluster upgrade, the old and the new
versions occur but they are not distinguishable.

Add offset to schema version of topology_requests table if it contains
request_type column.

Fixes: #20299.

Closes scylladb/scylladb#20402
2024-09-11 16:36:35 +03:00
Piotr Dulikowski
d98708013c Merge 'view: move view_build_status to group0' from Michael Litvak
Migrate the `system_distributed.view_build_status` table to `system.view_build_status_v2`. The writes to the v2 table are done via raft group0 operations.

The new parameter `view_builder_version` stored in `scylla_local` indicates whether nodes should use the old or the new table.

New clusters use v2. Otherwise, the migration to v2 is initiated by the topology coordinator when the feature is enabled. It reads all the rows from the old table and writes them to the new table, and sets `view_builder_version` to v2. When the change is applied, all view_builder services are updated to write and read from the v2 table.

The old table `system_distributed.view_build_status` is set to read virtually from the new table in order to maintain compatibility.

When removing a node from the cluster, we remove its rows from the table atomically (fixes https://github.com/scylladb/scylladb/issues/11836). Also, during the migration, we remove all invalid rows.

Fixes scylladb/scylladb#15329

dtest https://github.com/scylladb/scylla-dtest/pull/4827

Closes scylladb/scylladb#19745

* github.com:scylladb/scylladb:
  view: test view_build_status table with node replace
  test/pylib: use view_build_status_v2 table in wait_for_view
  view_builder: common write view_build_status function
  view_builder: improve migration to v2 with intermediate phase
  view: delete node rows from view_build_status on node removal
  view: sanitize view_build_status during migration
  view: make old view_build_status table a virtual table
  replica: move streaming_reader_lifecycle_policy to header file
  view_builder: test view_build_status_v2
  storage_service: add view_build_status to raft snapshot
  view_builder: migration to v2
  db:system_keyspace: add view_builder_version to scylla_local
  view_builder: read view status from v2 table
  view_builder: introduce writing status mutations via raft
  view_builder: pass group0_client and qp to view_builder
  view_builder: extract sys_dist status operations to functions
  db:system_keyspace: add view_build_status_v2 table
2024-09-11 13:02:58 +02:00
Nikos Dragazis
d1152a200f test: Add test to validate SSTables with no checksums
In a previous patch we extended the return status of
`sstables::validate_checksums()` to report if an SSTable cannot be
validated due to a missing CRC component (i.e., CRC.db does not appear
in TOC.txt).

Add a test case for this.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2024-09-11 13:12:40 +03:00
Nikos Dragazis
1f275c71b1 tools: Fix typo in help message of scylla validate-checksums
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2024-09-11 13:12:39 +03:00
Nikos Dragazis
5c0a7f706b sstables: Allow validate_checksums() to report missing checksums
Change the return type of `sstable::validate_checksums()` from binary
(valid/invalid) to a ternary (valid/invalid/no_checksums). The third
status represents uncompressed SSTables without a CRC component (no
entry for CRC.db in the TOC).

Also, change the JSON response of `sstable validate-checksums` to expose
the new status. Replace the boolean value for valid/invalid checksums
with an object that contains two boolean keys: one that indicates if the
SSTable has checksums, and one that indicates if the checksums are valid
or not. The second key is optional and appears only if the SSTable has
checksums.

Finally, update the documentation to reflect the changes in the API.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2024-09-11 13:12:39 +03:00
Nikos Dragazis
5a284f4a9d test: Add test for concurrent scrub/validate operations
Theoretically it is possible to launch more than one scrub instances
simultaneously. Since the checksum component is a shared resource,
accesses have to be synchronized.

Add a test that launches two scrub operations in validate mode and
ensures that the checksum component is loaded once, referenced by all
scrub instances via shared pointers, and deleted once the scrub
operations finish. Introduce an injection point to achieve concurrent
execution of scrubs.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2024-09-11 13:12:39 +03:00
Nikos Dragazis
e2353f3b3e test: Add scrub/validate tests for uncompressed SSTables
Currently the unit tests check scrub in validate mode against compressed
SSTables only. Mirror the tests for uncompressed SSTables as well.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2024-09-11 13:12:39 +03:00
Nikos Dragazis
2991b09c8e test/lib: Add option to create uncompressed random schemas
Extend the `random_schema_specification` to support creating both
compressed and uncompressed schemas.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2024-09-11 13:12:32 +03:00
Nikos Dragazis
4f56c587f6 test: Add test for scrub/validate with file-level corruption
Currently, we test scrub/validate only against a corrupted SSTable with
content-level corruption (out-of-order partition key).

Add a test for file-level corruption as well. This should trigger the
checksum check in the underlying compressed file data source
implementation.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2024-09-11 12:28:59 +03:00
Nikos Dragazis
cc10a5f287 test: Check validation errors in scrub tests
Scrub was extended in PR #11074 to report validation errors but the
unit tests were not updated.

Update the tests to check the validation errors reported by scrub.
Validation errors must be zero for valid SSTables and non-zero for
invalid SSTables.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2024-09-11 12:28:59 +03:00
Nikos Dragazis
719757fba9 sstables: Enable checksum validation for uncompressed SSTables
Extend the `sstable::validate()` to validate the checksums of
uncompressed SSTables. Given that this is already supported for
compressed SSTables, this allows us to provide consistent behavior
across any type of SSTable, be it either compressed or uncompressed.

The most prominent use case for this is scrub/validate, which is now
able to detect file-level corruption in uncompressed SSTables as
well.

Note that this change will not affect normal user reads which skip
checksum validation altogether.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2024-09-11 12:28:59 +03:00
Nikos Dragazis
716fc487fd sstables: Expose integrity option via crawling mutation readers
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2024-09-11 12:28:59 +03:00
Nikos Dragazis
1d2dc9f2e1 sstables: Expose integrity option via data_consume_rows()
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2024-09-11 12:28:59 +03:00
Nikos Dragazis
2feced32f7 sstables: Add option for integrity check in data streams
Add a new boolean parameter in `sstable::data_stream()` to
enable/disable integrity mechanisms in the underlying data streams.
Currently, this only affects uncompressed SSTables and it allows to
enable/disable checksum validation on each chunk. The validation happens
transparently via the checksummed data source implementation.

The reason we need this option is to allow differentiating the behavior
between normal user reads and scrub/validate reads. We would like to
enable scrub to verify checksums for uncompressed SSTables, while
leaving normal user reads unchanged for performance reasons (read
amplification due to round up of reads to chunk size and loading of the
CRC component).

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2024-09-11 12:27:54 +03:00
Nikos Dragazis
d5bd40ad2c sstables: Remove unused variable
Remove unused stream variable from `sstable::data_stream()`. This was
introduced in commit 47e07b787e but never used.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2024-09-11 12:27:54 +03:00
Nikos Dragazis
2575d20f41 sstables: Add checksum in the SSTable components
Uncompressed SSTables store their checksums in a separate CRC.db file.
Add this in the list of SSTable components.

Since this component is used only for validation, load the component
on-demand for validation tasks and delete it when all validation tasks
finish. In more detail:

- Make the checksum component shareable and weakly referencable.
  Also, add a constructor since it is no longer an aggregate.
- Use a weak pointer to store a non-owning reference in the components
  and a shared pointer to keep the object alive while validation runs.
  Once validation finishes, the component should be cleaned up
  automatically.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2024-09-11 12:27:38 +03:00
Nikos Dragazis
b7dfba4c18 sstables: Introduce checksummed file data source implementation
Introduce a new data source implementation for uncompressed SSTables.

This is just a thin wrapper for a raw data source that also performs
checksum validation for each chunk. This way we can have consistent
behavior for compressed and uncompressed SSTables.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2024-09-11 12:26:18 +03:00
Botond Dénes
0e5b444777 Merge 'database::get_all_tables_flushed_at: fix return value' from Lakshmi Narayanan Sreethar
The `database::get_all_tables_flushed_at` method returns a variable
without setting the computed all_tables_flushed_at value. This causes
its caller, `maybe_flush_all_tables` to flush all the tables everytime
regardless of when they were last flushed. Fix this by returning
the computed value from `database::get_all_tables_flushed_at`.

Fixes #20301

Requires a backport to 6.0 and 6.1 as they have the same issue.

Closes scylladb/scylladb#20471

* github.com:scylladb/scylladb:
  cql-pytest: add test to verify compaction_flush_all_tables_before_major_seconds config
  database::get_all_tables_flushed_at: fix return value
2024-09-11 11:43:45 +03:00
Benny Halevy
4e8f3f4cdd cql-pytest: add test_compaction_tombstone_gc
Test tombstone garbage collection with:
1. conflicting live data in memtable (verifying there is no regression
   in this area)
2. deletion in memtable (reproducing scylladb/scylladb#20423)
3. materialized view update in memtable (reproducing scylladb/scylladb#20424)
in materialized_views

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-09-10 19:06:23 +03:00
Benny Halevy
9270348c38 sstable_compaction_test: add mv_tombstone_purge_test
Simulate view updates pattern and verify that they
don't inhibit tombstone garbage collection.

Verify fix for scylladb/scylladb#20424

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-09-10 19:06:23 +03:00
Benny Halevy
0407e50aa4 sstable_compaction_test: tombstone_purge_test: test that old deleted data do not inhibit tombstone garbage collection
Tests fix for scylladb/scylladb#20423

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-09-10 19:06:06 +03:00
Benny Halevy
a7caa79df7 sstable_compaction_test: tombstone_purge_test: add testlog debugging
Add some testlog debug printouts for the make_* helpers.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-09-10 19:05:58 +03:00
Benny Halevy
470d301fe3 sstable_compaction_test: tombstone_purge_test: make_expiring: use next_timestamp
Rather than forging a timestamp from the gc_clock
just use `next_timestamp` do it can be considered
for tomebstone purging purposes.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-09-10 19:05:58 +03:00
Benny Halevy
5849ba83e0 sstable, compaction: add debug logging for extended min timestamp stats
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-09-10 19:05:57 +03:00
Benny Halevy
7d893a5ed9 compaction: get_max_purgeable_timestamp: use memtable and sstable extended timestamp stats
When purging regular tombstone consult the min_live_timestamp,
if available.

For shadowable_tombstones, consult the
min_memtable_live_row_marker_timestamp,
if available, otherwise fallback to the min_live_timestamp.

If both are missing, fallback to the legacy
(and inaccurate) min_timestamp.

Fixes scylladb/scylladb#20423
Fixes scylladb/scylladb#20424

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-09-10 19:05:57 +03:00
Benny Halevy
57e9e9c369 compaction: define max_purgeable_fn
Before we add a new, is_shadowable, parameter to it.

And define global `can_always_purge` and `can_never_purge`
functions, a-la `always_gc` and `never_gc`.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-09-10 19:05:57 +03:00
Benny Halevy
b6fabd98c6 tombstone: can_gc_fn: move declaration to compaction_garbage_collector.hh
And define `never_gc` globally, same as `always_gc`

Before adding a new, is_shadowable parameter to it.

Since it is used in the context of compaction
it better fits compaction_garbage_collector header
rather than tombstone.hh

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-09-10 19:05:57 +03:00
Benny Halevy
4de4af954f sstables: scylla_metadata: add ext_timestamp_stats
Store and retrieve the optional extended timestamp statistics
(min_live_timestamp and min_live_row_marker_timestamp)
in the scylla_metadata component.

Note that there is no need for a cluster feature to
store those attributes since the scylla_metadata
on-disk format is extensible so that old sstables
can be read by new versions, seeing the extra stats
is missing, and new sstables can be read by old
versions that ignore unknown scylla metadata section types.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-09-10 19:05:57 +03:00
Benny Halevy
6f202cf48b compaction_group, storage_group, table_state: add extended timestamp stats getters
To return the minimum live timestamp and live row-marker
timestamp across a compaction_group, storage_group, or
table_state.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-09-10 19:05:57 +03:00
Benny Halevy
14d86a3a12 sstables, memtable: track live timestamps
When garbage collecting tombstones, we care only
about shadowing of live data.  However, currently
we track min/max timestamp of both live and dead
data, but there is no problem with purging tombstones
that shadow dead data (expired or shdowed by other
tombstones in the sstable/memtable).

Also, for shadowable tombstones, we track live row marker timestamps
separately since, if the live row marker timestamp is greater than
a shadowable tombstone timestamp, then the row marker
would shadow the shadowable tombstone thus exposing the cells
in that row, even if their timestasmp may be smaller
than the shadow tombstone's.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-09-10 19:05:49 +03:00
Abhi
9b09439065 raft: Add descriptions for requested abort errors
Fixes: scylladb/scylladb#18902

Closes scylladb/scylladb#20291
2024-09-10 17:56:29 +02:00
Botond Dénes
de81388edb Merge 'commitlog: Handle oversized entries' from Calle Wilund
Refs #18161

Yet another approach to dealing with large commitlog submissions.

We handle oversize single mutation by adding yet another entry
typo: fragmented. In this case we only add a fragment (aha) of
the data that needs storing into each entry, along with metadata
to correlate and reconstruct the full entry on replay.

Because these fragmented entries are spread over N segments, we
also need to add references from the first segment in a chain
to the subsequent ones. These are released once we clear the
relevant cf_id count in the base.
                 *
This approach has the downside that due to how serialization etc
works w.r.t. mutations, we need to create an intermediate buffer
to hold the full serialized target entry. This is then incrementally
written into entries of < max_mutation_size, successively requesting
more segments.

On replay, when encountering a fragment chain, the fragment is
added to a "state", i.e. a mapping of currently processing
frag chains. Once we've found all fragments and concatenated
the buffers into a single fragmented one, we can issue a
replay callback as usual.

Note that a replay caller will need to create and provide such
a state object. Old signature replay function remains for tests
and such.

This approach bumps the file format (docs to come).

To ensure "atomicity" we both force synchronization, and should
the whole op fail, we restore segment state (rewinding), thus
discarding data all we wrote.

Closes scylladb/scylladb#19472

* github.com:scylladb/scylladb:
  commitlog/database: Make some commitlog options updatable + add feature listener
  features/config: Add feature for fragmented commitlog entries
  docs: Add entry on commitlog file format v4
  commitlog_test: Add more oversized cases
  commitlog_replayer: Replay segments in order created
  commitlog_replayer: Use replay state to support fragmented entries
  commitlog_replayer: coroutinize partly
  commitlog: Handle oversized entries
2024-09-10 17:15:46 +03:00
Benny Halevy
8d67357c42 memtable_encoding_stats_collector: update row_marker: do nothing if missing
If the row_marker is missing then its timestamp
is missing as well, so there's no point
calling update_timestamp for it.  Better return early.

This should cause no functional change.

The following patch will add more logic
for tracking extended timestamp stats.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-09-10 16:46:34 +03:00
Pavel Emelyanov
b6f662417c table: Remove unused database& argument from take_snapshot() method
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#20496
2024-09-10 14:53:06 +03:00
Gleb Natapov
af83c5e53e group0: stop group0 before draining storage service during shutdown
Currently storage service is drained while group0 is still active. The
draining stops commitlogs, so after this point no more writes are
possible, but if group0 is still active it may try to apply commands
which will try to do writes and they will fail causing group0 state
machine errors. This is benign since we are shutting down anyway, but
better to fix shutdown order to keep logs clean.

Fixes scylladb/scylladb#19665
2024-09-10 13:15:56 +02:00
Lakshmi Narayanan Sreethar
a0f4fe3fc4 cql-pytest: add test to verify compaction_flush_all_tables_before_major_seconds config
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-09-10 16:39:05 +05:30
Lakshmi Narayanan Sreethar
4ca720f0bd database::get_all_tables_flushed_at: fix return value
The `database::get_all_tables_flushed_at` method returns a variable
without setting the computed all_tables_flushed_at value. This causes
its caller, `maybe_flush_all_tables` to flush all the tables everytime
regardless of when they were last flushed. Fix this by returning
the computed value from `database::get_all_tables_flushed_at`.

Fixes #20301

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-09-10 16:35:47 +05:30
Yaniv Michael Kaul
a4ff0aae47 HACKIGN.md: clarify the use of dbuild when running test.py
If you are using dbuild, that's where test.py needs to run.

Also, replace 'Docker image' with the more generic 'container' term.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Closes scylladb/scylladb#20336
2024-09-10 13:40:45 +03:00
Botond Dénes
08f109724b docs/cql/ddl.rst: fix description of sstable_compression
ScyllaDB doesn't support custom compressors. The available compressors
are the only available ones, not the default ones.
Adjust the text to reflect this.

Closes scylladb/scylladb#20225
2024-09-10 13:39:24 +03:00
Pavel Emelyanov
cfa59ab73d test: Use single temp dir for sharded<sstables::test_env>
The test-env in question is mostly started in one-shard mode. Also there
are several boost tests that start sharded<> environment. In that case
instances on different shards live in different temp dirs. That's not
critical yet, but better to have single directory for the whole test.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#20412
2024-09-10 11:25:04 +03:00
Artsiom Mishuta
f95c257a1e [test.py]: Fail test teardown in case of task leakage
In test.py every asyncio task spawned during the test must be finished before the next test, otherwise, tests might affect each other results.
The developers are responsible for writing asyncio code in a way that doesn’t leave task objects unfinished.
Test.py has a mechanism that helps test writers avoid such tasks. At the end of each test case, it verifies that the test did not produce/leave any tasks and sets an event object that fails the next test at the start if this is the case(issue https://github.com/scylladb/scylladb/issues/16472)
The problem with this was that breaking the next test was counterintuitive, and the logging for this situation was insufficient and unobvious.

notes:  Task.cancel() is not an option to avoid task leakage
        1) Calling cancel() Does Not Cancel The Task :  the cancel() method just  request that the target task cancel.
        2) Calling cancel() Does Not Block Until The Task is Cancelled:  If the caller needs to know the task is cancelled and done, it could await for the target
        3) In particular PR, task.cancel() cancell task on client(ManagerClient) but not on http server(ScyllaManager). so "await" is needed.

Closes scylladb/scylladb#20012
2024-09-10 10:51:45 +03:00
Pavel Emelyanov
ac2127a640 test: Call table::make_sstable() directly in compaction test
The test in question generates a bunch of table_for_tests objects and
creates sstables for each. For that it calls test_env::make_sstable(),
but it can be made shorter, by calling table method directly.

The hidden goal of this change is to remove the explicit caller of
table::dir() method. The latter is going away.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#20451
2024-09-10 10:19:20 +03:00
Botond Dénes
76bb22664a Merge 'Sanitize open_sstables() helper in compaction test' from Pavel Emelyanov
This includes
- coroutinization
- elimination of unused overload

Closes scylladb/scylladb#20456

* github.com:scylladb/scylladb:
  test: Squash two open_sstables() helper together
  test: Coroutinize open_sstables() helper
2024-09-10 10:18:33 +03:00
Botond Dénes
a4a4797e27 Merge 'Alternator: tests and other preparation towards allowing adding a GSI to an existing table' from Nadav Har'El
This series prepares us for working on #11567 -  allow adding a GSI to a pre-existing table. This will require changing the implementation of GSIs in Alternator to not use real columns in the schema for the materialized view, and instead of a computed column - a function which extracts the desired member from the `:attrs` map and de-serializes it.

This series does not contain the GSI re-implementation itself. Rather it contains a few small cleanups and mostly - new regression tests that cover this area, of adding and removing a GSI, and **using** a GSI, in more details than the tests we already had. I developed most of these tests while working on **buggy** fixes for #11567; The bugs in those implementations were exposed by the tests added here - they exposed bugs both in the new feature of adding or removing a GSI, and also regressions to the ordinary operation of GSI. So these tests should be helpful for whoever ends up fixing #11567, be it me based on my buggy implementation (which is _not_ included in this patch series), or someone else.

No backports needed - this is part of a new feature, which we don't usually backport.

Closes scylladb/scylladb#20383

* github.com:scylladb/scylladb:
  test/alternator: more extensive tests for GSI with two new key attributes
  test/alternator: test invalid key types for GSI
  test/alternator: test combination of LSI and GSI
  test/alternator: expand another test to use different write operations
  test/alternator: test GSIs with different key types
  alternator: better error message in some cases of key type mismatch
  test/alternator: test for more elaborate GSI updates
  test/alternator: strengthen tests for empty attribute values
  test/alternator: fix typo in test_batch.py
  test/alternator: more checks for GSI-key attribute validation
  Alternator: drop unneeded "IS NOT NULL" clauses in MV of GSI/LSI
  test/alternator: add more checks for adding/deleting a GSI
  test/alternator: ensure table deletions in test_gsi.py
2024-09-10 10:13:52 +03:00
Pavel Emelyanov
42f8d06a17 test: Use correct schema in directory tests with created table
There are some test cases in sstable_directory_test test actually create
a table with CQL and then try to manipulate its sstables with the help
of sstable_directory. Those tests use existing local helper that starts
sharded<sstable_directory> and this helper passes test-local static
schema to sstable_directory constructor. As a result -- the schema of a
table that test case created and the schema that sstable_directory works
with are different. They match in the columns layout, which helps the
test cases pass, but otherwise are two different schema objects with
different IDs. It's more correct to use table schema for those runs.

The fix introduces another helper to start sharded<sstable_directory>,
and the older wrapper around cql_test_env becomes unused. Drop it too
not to encourage future tests use it and re-introduce schema mismatch
again.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#20499
2024-09-10 09:56:26 +03:00
Benny Halevy
f47b5e60bc sstable_directory: create_pending_deletion_log: place pending_delete log under the base directory
To be able to atomically delete sstables both in
base table directory and in its sub-directories,
like `staging/`, use a shared pending_delete_dir
under under the base directory.

Note that this requires loading and processing
the base directory first.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-09-10 09:28:13 +03:00
Benny Halevy
44bd183187 sstables: storage: keep base directory in base class
so we can use the base (table) directory for
e.g. pending_delete logs, in the next patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-09-10 09:28:13 +03:00
Benny Halevy
027e64876a sstables: storage: define opened_directory in header file
So it can be used outside the storage module
in the following patches.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-09-10 09:28:13 +03:00
Benny Halevy
a7b92d7b6f sstable_directory: use only dirlog
Currently, there are leftover log messages using
sstlog rather than dirlog, that was introduced
in aebd965f0e,
and that makes debugging harder.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-09-10 09:28:11 +03:00
Botond Dénes
fc690a60d8 Update tools/cqlsh submodule
* tools/cqlsh 86a280a1...b09bc793 (6):
  > build(deps): bump actions/download-artifact in /.github/workflows
  > cqlshlib/test: Add test_formatting.py
  > cqlshlib/test: Use assertEqual instead of assertEquals
  > cqlsh.py: Send DESCRIBE statement to server before parsing
  > cqlsh.py: Fix indentation
  > cqlsh.py: change shebang to /usr/bin/env python3
2024-09-10 08:11:40 +03:00
Lakshmi Narayanan Sreethar
2148e33d37 compaction: remove unnecessary share bump for split, scrub, and upgrade
When split, scrub, and upgrade compactions ran under the compaction
group, they had to bump up their shares to a minimum of 200 to prevent
slow progress as they neared completion, especially in workloads with
inconsistent ingestion rates. Since commit e86965c2 moved these
compactions to the maintenance group, this share bump is no longer
necessary. This patch removes the unnecessary share allocation.

Fixes #20224

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes scylladb/scylladb#20495
2024-09-09 22:03:38 +03:00
Avi Kivity
9448260b30 Merge 'major compaction: check only sstables being compacted for tombstone garbage collection' from Lakshmi Narayanan Sreethar
Any expired tombstone can be garbage collected if it doesn't shadow data in the commit log, memtable, or uncompacting SSTables.

This PR introduces a new mode to major compaction, enabled by the `consider_only_existing_data` flag that bypasses these checks. When enabled, memtables and old commitlog segments are cleared with a system-wide flush and all the sstables (after flush) are included in the compaction, so that it works with all data generated up to a given time point.

This new mode works with the assumption that newly written data will not be shadowed by expired tombstones. So it ignores new sstables (and new data written to memtable) created after compaction started. Since there was a system wide flush, commitlog checks can also be skipped when garbage collecting tombstones. Introducing data shadowed by a tombstone during compaction can lead to undefined behavior, even without this PR, as the tombstone may or may not have already been garbage collected.

Fixes #19728

Closes scylladb/scylladb#20031

* github.com:scylladb/scylladb:
  cql-pytest: add test to verify consider_only_existing_data compaction option
  tools/scylla-nodetool: add consider-only-existing-data option to compact command
  api: compaction: add `consider_only_existing_data` option
  compaction: consider gc_check_only_compacting_sstables when deducing max purgeable timestamp
  compaction: do not check commitlog if gc_check_only_compacting_sstables is enabled
  tombstone_gc_state: introduce with_commitlog_check_disabled()
  compaction: introduce new option to check only compacting sstables for gc
  compaction: rename maybe_flush_all_tables to maybe_flush_commitlog
  compaction: maybe_flush_all_tables: add new force_flush param
2024-09-09 20:45:41 +03:00
Avi Kivity
894b85ce95 Merge 'hints: send hints with CL=ALL if target is leaving' from Piotr Dulikowski
Currently, when attempting to send a hint, we might choose its recipients in one of two ways:

- If the original destination is a natural endpoint of the hint, we only send the hint to that node and none other,
- Otherwise, we send the hint to all current replicas of the mutation.

There is a problem when we decommission a node: while data is streamed away from that node, it is still considered to be a natural endpoint of the data that it used to own. Because of that, it might happen that a hint is sent directly to it but streaming will miss it, effectively resulting in the hint being discarded.

As sending the hint _only_ to the leaving replica is a rather bad idea, send the hint to all replicas also in the case when the original destination of the hint is leaving.

Note that this is a conservative fix written only with the decommission + vnode-based keyspaces combo in mind. In general, such "data loss" can occur in other situations where the replica set is changing and we go through a streaming phase, i.e. other topology operations in case of vnodes and tablet load balancing. However, the consistency guarantees of hinted handoff in the face of topology changes are not defined and it is not clear what they should be, if there should be any at all. The picture is further complicated by the fact that hints are used by materialized views, and sending view updates to more replicas than necessary can introduce inconsistencies in the form of "ghost rows". This fix was developed in response to a failing test which checked the hint replay + decommission scenario, and it makes it work again.

Fixes scylladb/scylla-dtest#4582
Refs scylladb/scylladb#19835

Should be backported to 6.0 and 6.1; the dtest started failing due to topology on raft, which sped up execution of the test and exposed the preexisting problem.

Closes scylladb/scylladb#20488

* github.com:scylladb/scylladb:
  test: topology_custom/test_hints: consistency test for decommission
  test: topology_custom/test_hints: move sync point helpers to top level
  test: topology/util: extract find_server_by_host_id
  hints: send hints with CL=ALL if target is leaving
  hints: inline do_send_one_mutation
2024-09-09 18:23:13 +03:00
Avi Kivity
c3e19425bd Merge 'docs/dev/docker-hub.md: refresh aio-max-nr calculation' from Laszlo Ersek
~~~
What we have today in "docs/dev/docker-hub.md" on "aio-max-nr" dates back
to scylla commit f4412029f4 ("docs/docker-hub.md: add quickstart section
with --smp 1", 2020-09-22). Problems with the current language:

- The "65K" claim as default value on non-production systems is wrong;
  "fs/aio.c" in Linux initializes "aio_max_nr" to 0x10000, which is 64K.

- The section in question uses equal signs (=) incorrectly. The intent was
  probably to say "which means the same as", but that's not what equality
  means.

- In the same section, the relational operator "<" is bogus. The available
  AIO count must be at least as high (>=) as the requested AIO count.

- Clearer names should be used;
  adjust_max_networking_aio_io_control_blocks() in "src/core/reactor.cc"
  sets a great example:

  - "reactor::max_aio" should be called "storage_iocbs",

  - "detect_aio_poll" should be called "preempt_iocbs",

  - "reactor_backend_aio::max_polls" should be called "network_iocbs".

- The specific value 10000 for the last one ("network_iocbs") is not
  correct in scylla's context. It is correct as the Seastar default, but
  scylla has used 50000 since commit 2cfc517874 ("main, test: adjust
  number of networking iocbs", 2021-07-18).

Rewrite the section to address these problems.

See also:
- https://github.com/scylladb/scylladb/issues/5981
- https://github.com/scylladb/seastar/pull/2396
- https://github.com/scylladb/scylladb/pull/19921

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
~~~

No need for backporting; the documentation being refreshed targets developers as audience, not end-users.

Closes scylladb/scylladb#20398

* github.com:scylladb/scylladb:
  docs/dev/docker-hub.md: refresh aio-max-nr calculation
  docs/dev/docker-hub.md: strip trailing whitespace
2024-09-09 15:04:38 +03:00
Botond Dénes
3e0bff161c Merge 'Use yielding directory lister in sstable_directory' from Pavel Emelyanov
The yielding lister is considered to be better replacement that scan_dir(lambda) one.
Also, the sstable directory will be patched to scan the contents of S3 bucket and yielding lister fits better for generalization.

Closes scylladb/scylladb#20114

* github.com:scylladb/scylladb:
  sstable_directory: Fix indentation after previous patches
  sstable_directory: Use yielding lister in .handle_sstables_pending_delete()
  sstable_directory: Use yielding lister in .cleanup_column_family_temp_sst_dirs()
  sstable_directory: Use yielding lister in .prepare()
  sstable_directory: Shorten lister loop
  sstable_directory: Use with_closeable() in .process()
  directory_lister: Add noexcept default move-constructor
2024-09-09 14:35:51 +03:00
Pavel Emelyanov
0f48847d02 test: Use shorter with_sstable_directory overload()
In sstable directory test there are two of those -- one that works on
path, state, env and callback, and the other one that just needs env and
callback, getting path from env and assuming state is normal.

Two test cases in this test can enjoy the shorter one.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#20395
2024-09-09 14:25:24 +03:00
Pavel Emelyanov
2bfbbaffac test: Use sstables::test_env to make sstables for schema loader test
This test calls manager directly, but it's shorter to ask test_env for
that

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#20431
2024-09-09 14:22:58 +03:00
Takuya ASADA
e36c939505 dist: tune LimitNOFILES for large nodes
On very large node, LimitNOFILES=80000 may not enough size, it can cause
"Too many files" error.

To avoid that, let's increase LimitNOFILES on scylla_setup stage,
generate optimal value calurated from memory size and number of cpus.

Closes scylladb/scylla-enterprise#4304

Closes scylladb/scylladb#20443
2024-09-09 14:13:49 +03:00
Piotr Smaron
60af48f5fd cql: fix exception when validating KS in CREATE TABLE
c70f321c6f added an extra check if KS
exists. This check can throw `data_dictionary::no_such_keyspace`
exception, which is supposed to be caught and a more user-friendly
exception should be thrown instead.
This commit fixes the above problem and adds a testcase to validate it
doesn't appear ever again.
Also, I moved the check for the keyspace outside of the `for` loop, as
it doesn't need to be checked repeatedly.

Fixes: scylladb/scylladb#20097

Closes scylladb/scylladb#20404
2024-09-09 13:30:57 +03:00
Nadav Har'El
ee7d4d8825 test/alternator: more extensive tests for GSI with two new key attributes
The case of a GSI with two key attributes (hash and range) which were both
not keys in the base table is a special case, not supported by CQL but
allowed in Alternator. We have several tests for this case, but they don't
cover all the strange possibilities that a GSI row disappears / reappears
when one or two of the attributes is updated / inserted / deleted.
So this patch includes a more extensive test for this case.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-09-09 13:14:49 +03:00
Nadav Har'El
ad53d6a230 test/alternator: test invalid key types for GSI
This patch adds a test that types which are not allowed for GSI keys -
basically any type except S(tring), B(ytes) or N(number), are rejected
as expected - an error path that we didn't cover in existing tests.

The new test passes - Alternator doesn't have a bug in this area, and
as usual, also passes on DynamoDB.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-09-09 13:14:49 +03:00
Nadav Har'El
c4021d0819 test/alternator: test combination of LSI and GSI
To allow adding a GSI to an existing table (refs #11567), we plan to
re-implement GSIs to stop forcing their key attribute to become a real
column in the schema - and let it remains a member of the map ":attrs"
like all non-key attributes. But since LSIs can only be defined on table
creation time, we don't have to change the LSI implementation, and these
can still force their key to become a real column.

What the test in this patch does is to verify that using the same
attribute as a key of *both* GSI and LSI on the same table works.
There's a high risk that it won't work: After all, the LSI should force the
attribute to become a real column (to which base reads and writes go), but
the GSI will use a computed column which reads from ":attrs", no? Well,
it turns out that view.cc's value_getter::operator() always had a
surprising exception which "rescues" this test and makes it pass: Before
using a computed column, this code checks if a base-table column with the
same name exists, and if it does, it is used instead of the computed column!
It's not clear why this logic was chosen, but it turns out to be really
useful for making the test in this test pass. And it's important that if
we ever change that unintuitive behavior, we will have this test as a
regression test.

The new test unsurprisingly passes on current Scylla because its
implementation of GSI and LSI is still the same. But it's an important
regression test for when we change the GSI implementation.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-09-09 13:14:49 +03:00
Nadav Har'El
7563d0a8a1 test/alternator: expand another test to use different write operations
Expand another Alternator test (test_gsi.py::test_gsi_missing_attribute)
to write items not just using PutItem, but also using UpdateItem and
BatchWriteItem. There is a risk that these different operations use
slightly different code paths - so better check all of them and not
just PutItem.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-09-09 13:14:49 +03:00
Nadav Har'El
4d02beec53 test/alternator: test GSIs with different key types
All of the tests in test/alternator/test_gsi.py use strings as the GSI's
keys. This tests a lot of GSI functionality, but we implicitly assumed that
our implementation used an already-correct and already-tested implementation
of key columns and MV, which if it works for one type, works for other types
as well.

This assumption will no longer hold if we reimplement GSI on a "computed
column" implementation, which might run different code for different types
of GSI key attributes (the supported types are "S"tring, "B"ytes, and
"N"umber).

So in this patch we add tests for writing and reading different types of
GSI key attributes. These tests showed their importance as regression
tests when the first draft of the GSI reimplementation series failed them.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-09-09 13:14:49 +03:00
Nadav Har'El
80a0798e77 alternator: better error message in some cases of key type mismatch
Alternator uses a common function get_typed_value() to read the values
of key attribute and confirm they have the expected type (key attributes
have a fixed type in the schema). If the type is wrong, we want to print
a "Type mismatch" error message.

But the current implementation did the checks in the wrong order, and
as a result could print a "Malformed value object" message instead of a
"Type mismatch". That could happen if the wrong type is a boolean, map,
list, or basically any type whose JSON representation is not a string.
The allowed key types - bytes), string and number - all have string
representations in JSON, but still we should first report the mismatched
type and only report the "Malformed object" if the type matches but the
JSON is faulty.

In addition to fixing the error message, we fix an existing test which
complained in a comment (but ignored) that the error message in some
case (when trying to use a map where a key is expected) the strange
"Malformed value object" instead of the expected "Type mismatch".

The next patch will add an additional reproducer for this problem and
its fix. That test will do:

```
    with pytest.raises(ClientError, match='ValidationException.*mismatch'):
        test_table_gsi_6.put_item(Item={'p': p, 's': True})
```
I.e., it tries to set a boolean value for a string key column, and
expect to get the "Type mismatch" error and not the ugly "Malformed
value object".

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-09-09 13:14:49 +03:00
Nadav Har'El
624ed32278 test/alternator: test for more elaborate GSI updates
Most tests in test_gsi.py involve simple updates to a GSI, just
creating a GSI row. Although a couple of tests did involve more
complex operations (such as an update requiring deleting an old row
from the GSI and inserting a new one,), we did not have a single
organized test designed to check all these cases, so we add one in
this patch.

This test (test_update_gsi_pk) will be important for verifying
the low-level implementation of the new GSI implementation that
we plan to based on computed columns. Early versions of that code
passed many of the simpler tests, but not this one.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-09-09 13:14:49 +03:00
Nadav Har'El
65d4ddf093 test/alternator: strengthen tests for empty attribute values
We soon plan to refactor Alternator's GSI and change the validation of
values set in attributes which are GSI keys. It's important to test that
when updating attributes that are *not* GSI keys - and are either base-
table keys or normal non-key attributes - the validation didn't change.
For example, empty strings are still not allowed in base-table key
attributes, but are allowed (since May 2020 in DynamoDB) in non-key
attributes.

We did have tests in this area, but this patch strengthens them -
adding a test for non-key attribute, and expanding the key-attribute
test to cover the UpdateItem and BatchWriteItem operations, not just
PutItem.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-09-09 13:14:41 +03:00
Avi Kivity
9a5061209f Merge '[test.py] Enable allure for python test' from Andrei Chekun
To enhance the test reports UX:
1. switching off/on passed/failed/skipped test for better visibility
2. better searching in test results
3. understanding the trends of execution for each test
4. better configurability of the final report

Enable allure adapter for all python tests.
Add tags and parameters to the test to be able to distinguish them across modes and runs.

Related: https://github.com/scylladb/qa-tasks/issues/1665

Related: https://github.com/scylladb/scylladb/pull/19335

Related: https://github.com/scylladb/scylladb/pull/18169

Closes scylladb/scylladb#19942

* github.com:scylladb/scylladb:
  [test.py] Clean duplicated arg for test suite
  [test.py] Enable allure for python test
2024-09-09 12:53:00 +03:00
Nadav Har'El
5859daed68 test/alternator: fix typo in test_batch.py
Two tests had a typo 'item' instead of 'Item'. If Scylla had a bug, this
could have caused these tests to miss the bug.

Scylla passes also the fixed test, because Scylla's behavior is correct.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-09-09 12:09:25 +03:00
Nadav Har'El
1f8e39f680 test/alternator: more checks for GSI-key attribute validation
When an attribute is a GSI key, DynamoDB imposes certain rules when
writing values for it - it must be of the declared type for that key,
and can't be an empty string. We had tests for this, but all of them
did the write using the PutItem operation.

In this patch we also test the same things using the UpdateItem and
BatchWriteItem operations. Because Scylla has different code paths
for these three operations, and each code path needs to remember to
call the validation function, all three should all be checked and not just
PutItem.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-09-09 12:09:25 +03:00
Nadav Har'El
cf5d7ce212 Alternator: drop unneeded "IS NOT NULL" clauses in MV of GSI/LSI
Scylla's materialized views naturally skip any base rows where the view's
key isn't set (is NULL), because we can't create a view row with a null
key. To make the user aware that this is happening, the user is required
to add "WHERE ... IS NOT NULL" for the view's key columns when defining
the view. However, the only place that these extra IS NOT NULL clauses
are checked are in the CQL "CREATE MATERIALIZED VIEWS" statement - they
are completely ignored in all other places in the code.

In particular, when we create a materialized view in Alternator (GSI or
LSI), we don't have to add these "IS NOT NULL" clauses, as they are
outright ignored. We didn't know they were ignored, and made an effort
to add them - but no matter how incorrectly we did it, it didn't matter :-)
In commit 2bf2ffd3ed it turned out we had a
typo that caused the wrong column name to be printed. Also, even today we
are still missing base key columns that aren't listed as a view key in
Alternator but still added as view clustering keys in Scylla - and again
the fact these were missing also didn't matter. So I think it's time to
stop pretending, and stop calculating these "IS NOT NULL" strings, so
this patch outright removes them from the Alternator view-creation code.

Beyond being a nice cleanup of unnecessary and inaccurate code, it
will also be necessary when we allow in later patches to index for
an Alternator attribute "x" not a real column x in the base table but
rather an element in the ":attrs" map - so adding a "x IS NOT NULL" isn't
only unnecessary, it is outright illegal: The expression evaluation code,
even though it doesn't do anything with the "IS NOT NULL" expression,
still verifies that "x" is a valid column, which it isn't.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-09-09 12:09:25 +03:00
Nadav Har'El
8beaa9d10e test/alternator: add more checks for adding/deleting a GSI
We already have tests for the feature of adding or removing a GSI from
an existing table, which Alternator doesn't yet support (issue #11567).
In this patch we add another check, how after a GSI is added, you can
no longer add items with the wrong type for the indexed type, and after
removing a GSI, you can. The expanded tests pass on DynamoDB, and
obviously still xfail on Alternator because the feature is not yet
implemented.

Refs #11567.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-09-09 12:09:25 +03:00
Nadav Har'El
ce19311ab3 test/alternator: ensure table deletions in test_gsi.py
Most of the Alternator tests are careful to unconditionally remove the test
tables, even if the test fails. This is important when testing on a shared
database (e.g., DynamoDB) but also useful to make clean shutdown faster
as there should be no user table to flush.

We missed a few such cases in test_gsi.py, and fixed some of them in
commit 59c1498338 but still missed a few,
and this patch fixes some more instances of this problem.
We do this by using the context manager new_test_table() - which
automatically deletes the table when done - instead of the function
create_test_table() which needs an explicit delete at the end.

There are no functional changes in this patch - most of the lines
changed are just reindents.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-09-09 12:09:25 +03:00
Kefu Chai
ccbd3eb9f7 main: do not register redis and alternator services if not enabled
in main.cc, we start redis with `ss.local().register_protocol_server()`
only if it is enabled. but `storage_service` always calls
`stop_server()` with _all_ registered server, no matter if they have
started or not. in general, it does not hurt. for instance,

`redis::controller::stop_server()` is a noop, if the controller
is not started. but `storage_service` still print the logging message
like:
```
INFO  2024-09-04 11:20:02,224 [shard 0:main] storage_service - Shutting down redis server
INFO  2024-09-04 11:20:02,224 [shard 0:main] storage_service - Shutting down redis server was successful
```

this could be confusing or at least distracting when a field engineer
looks at the log. also, please note, `redis_port` and `redis_ssl_port`
cannot be changed dynamically once scylla server is up, so we do not
need to worry about "what if the redis server is started at runtime,
how can is be stopped?".

the same applies to alternator service.

in this change, to avoid surprises, we conditionally register the
protocol servers with the storage service based on their enabled statuses.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20472
2024-09-09 08:44:50 +03:00
Avi Kivity
58713f3080 types: remove some unused free functions
These functions are unused, so safe to remove, and reduce the work
to convert to managed_bytes{,_view}.

Closes scylladb/scylladb#20482
2024-09-09 08:36:33 +03:00
Kefu Chai
720997d1de cql3/statements: mark format string as constexpr const
after switching over to the new `seastar::format()` which enables
the compile-time format check, the fmt string should be a constexpr,
otherwise `fmt::format()` is not able to perform the check at compile
time.

to prepare for bumping up the seastar module to a version which
contains the change of `seastar::format()`, let's mark the format
string with `constexpr const`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20484
2024-09-09 08:35:45 +03:00
Piotr Dulikowski
6f3d0af994 test: topology_custom/test_hints: consistency test for decommission
Adds the test_hints_consistency_during_decommission test which
reproduces the failure observed in scylladb/scylla-dtest#4582. It uses
error injections, including the newly added
topology_coordinator_pause_after_streaming injection, to reliably
orchestrate the scenario observed there.

In a nutshell, the test makes sure to replay hints after streaming
during decommission has finished, but before the cluster switches to
reading from new replicas. Without the fix, hints would be replayed to
the decommissioned node and then would be lost forever after the cluster
start reading from new replicas.
2024-09-08 10:51:38 +02:00
Piotr Dulikowski
30d53167c9 test: topology_custom/test_hints: move sync point helpers to top level
Move create_sync_point and await_sync_point from the scope of the
test_sync_point test to the file scope. They will be used in a test that
will be introduced in the commit that follows.
2024-09-08 10:51:38 +02:00
Piotr Dulikowski
a75d0c0bfa test: topology/util: extract find_server_by_host_id
Move it out from test_mv_tablets_replace.py. It will be used by a test
introduced in a later commit.
2024-09-08 10:51:38 +02:00
Piotr Dulikowski
61ac0a336d hints: send hints with CL=ALL if target is leaving
Currently, when attempting to send a hint, we might choose its
recipients in one of two ways:

- If the original destination is a natural endpoint of the hint, we only
  send the hint to that node and none other,
- Otherwise, we send the hint to all current replicas of the mutation.

There is a problem when we decommission a node: while data is streamed
away from that node, it is still considered to be a natural endpoint of
the data that it used to own. Because of that, it might happen that a
hint is sent directly to it but streaming will miss it, effectively
resulting in the hint being discarded.

As sending the hint _only_ to the leaving replica is a rather bad idea,
send the hint to all replicas also in the case when the original
destiantion of the hint is leaving.

Note that this is a conservative fix written only with the decommission
+ vnode-based keyspaces combo in mind. In general, such "data loss" can
occur in other situations where the replica set is changing and we go
through a streaming phase, i.e. other topology operations in case of
vnodes and tablet load balancing. However, the consistency guarantees of
hinted handoff in the face of topology changes are not defined and it is
not clear what they should be, if there should be any at all. The
picture is further complicated by the fact that hints are used by
materialized views, and sending view updates to more replicas than
necessary can introduce inconsistencies in the form of "ghost rows".
This fix was developed in response to a failing test which checked the
hint replay + decommission scenario, and it makes it work again.

Fixes scylladb/scylla-dtest#4582
Refs scylladb/scylladb#19835
2024-09-08 10:50:59 +02:00
Piotr Dulikowski
8abb06ab82 hints: inline do_send_one_mutation
It's a small method and it is only used once in send_one_mutation.
Inlining it lets us get rid of its declaration in the header - now, if
one needs to change the variables passed from one function to another,
it is no longer necessary to change the header.
2024-09-08 07:19:35 +02:00
Avi Kivity
ab32ce6b45 Merge 'Coroutinize sstable::read_summary() method' from Pavel Emelyanov
Shorter and simpler this way. Hopefully it doesn't sit on critical paths

Closes scylladb/scylladb#20460

* github.com:scylladb/scylladb:
  sstables: Fix indentation after previous patch
  sstables: Coroutinize sstable::read_summary()
2024-09-06 18:45:54 +03:00
Kefu Chai
aeaeaf345d compaction: use structured binding when appropriate
for better readability

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20473
2024-09-06 18:17:48 +03:00
Kamil Braun
427ad2040f Merge 'test: randomized failure injection for Raft-based topology' from Evgeniy Naydanov
The idea of the test is to have a cluster where one node is stressed with injections and failures and the rest of the cluster is used to make progress of the raft state machine.

To achieve this following two lists introduced in the PR:

  - ERROR_INJECTIONS in error_injections.py
  - CLUSTER_EVENTS in cluster_events.py

Each cluster event is an async generator which has 2 yields and should be used in the following way:

   0. Start the generator:
       ```python
       >>> cluster_event_steps = cluster_event(manager, random_tables, error_injection)
       ```

   1. Run the prepare part (before the first yield)
       ```python
       >>> await anext(cluster_event_steps)
       ```

   2.  Run the cluster event itself (between the yields)
       ```python
       >>> await anext(cluster_event_steps)
       ```

   3. Run the check part (after the second yield)
       ```python
       >>> await anext(cluster_event, None)
       ```

Closes scylladb/scylladb#16223

* github.com:scylladb/scylladb:
  test: randomized failure injection for Raft-based topology
  test: error injections for Raft-based topology
  [test.py] topology.util: add get_non_coordinator_host() function
  [test.py] random_tables: add UDT methods
  [test.py] random_tables: add CDC methods
  [test.py] api: get scylla process status
  [test.py] api: add expected_server_up_state argument to server_add()
2024-09-06 14:00:41 +02:00
Pavel Emelyanov
226fd03bae Merge 'service/qos: remove unused marked_for_deletion field from service_level struct' from Piotr Dulikowski
The `service_level::marked_for_deletion` field is always set to `false`. It might have served some purpose in the past, but now it can be just removed, simplifying the code and eliminating confusion about the field.

This is just code cleanup, no backport is needed.

Closes scylladb/scylladb#20452

* github.com:scylladb/scylladb:
  service/qos: remove the marked_for_deletion parameter
  service/qos: add constructors to service_level
2024-09-06 11:44:25 +03:00
Kamil Braun
52fdf5b4c9 test: test_raft_no_quorum: increase raft timeout in debug mode
The test cases in this file use an error injection to reduce raft group
0 timeouts (from the default 1 minute), in order to speed up the tests;
the scenarios expect these timeouts to happen, so we want them to happen
as quick as possible, but we don't want to reduce timeouts so much that
it will make other operations fail when we don't expect them to (e.g.
when the test wants to add a node to the cluster).

Unfortunately the selected 5 seconds in debug mode was not enough and
made the tests flaky: scylladb/scylladb#20111.

Increase it to 10 seconds. This unfortunately will slow down these tests
as they have to sometimes wait for 10 seconds for the timeout to happen.
But better to have this than a flaky test.

Fixes: scylladb/scylladb#20111

Closes scylladb/scylladb#20320
2024-09-06 11:40:09 +03:00
Avi Kivity
384a09585b repair: row_level: repair_get_row_diff_with_rpc_stream_process_op: simplify return value
During review of 0857b63259 it was noticed that the function

  repair_get_row_diff_with_rpc_stream_process_op()

and its _slow_path callee only ever return stop_iteration::no (or throw
an exception). As such, its return value is useless, and in fact the
only caller ignores it. Simplify by returning a plain future<>.

Closes scylladb/scylladb#20441
2024-09-06 11:39:21 +03:00
Kefu Chai
034c1df29b auth/authentication_options: move fmt::formatter up
so that it is accessible from its caller. if we enforce the
compile-time format string check, the formatter would need the access to
the specialization of `fmt::formatter` of the arguments being foramtted.
to be prepared for this change, let's move the `fmt::formatter`
specialization up, otherwise we'd have following error after switching
to the compile-time format string check introduced by a recent seastar
change:

```
In file included from ./auth/authenticator.hh:22:                                                                                                             ./auth/authentication_options.hh:50:49: error: call to consteval function 'fmt::basic_format_string<char, auth::authentication_option &>::basic_format_string<
char[32], 0>' is not a constant expression
   50 |             : std::invalid_argument(fmt::format("The {} option is not supported.", k)) {
      |                                                 ^                                                                                                     ./auth/authentication_options.hh:57:13: error: explicit specialization of 'fmt::formatter<auth::authentication_option>' after instantiation
   57 | struct fmt::formatter<auth::authentication_option> : fmt::formatter<string_view> {
      |             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/include/fmt/base.h:1228:17: note: implicit instantiation first required here
 1228 |     -> decltype(typename Context::template formatter_type<T>().format(
      |                 ^
In file included from replica/distributed_loader.cc:30:
```

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20447
2024-09-06 09:12:38 +03:00
Pavel Emelyanov
527fc9594a sstables: Fix indentation after previous patch
And move the comment inside if while at it, it looks better in there
(and makes less churn in the patch itself)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-06 08:43:08 +03:00
Pavel Emelyanov
f7325586f3 sstables: Coroutinize sstable::read_summary()
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-06 08:43:07 +03:00
Evgeniy Naydanov
dd99cf197d test: randomized failure injection for Raft-based topology
The idea of the test is to have a small cluster, where one node
is stressed with injections and failures and the rest of
the cluster is used to make progress of the Raft state machine.

To achieve this following two lists introduced in the commit:

  - ERROR_INJECTIONS in error_injections.py
  - CLUSTER_EVENTS in cluster_events.py

Each cluster event is an async generator which has 2 yields and should be
used in the following way:

   0. Start the generator:
       >>> cluster_event_steps = cluster_event(manager, random_tables, error_injection)

   1. Run the prepare part (before the first yield)
       >>> await anext(cluster_event_steps)

   2. Run the cluster event itself (between the yields)
       >>> await anext(cluster_event_steps)

   3. Run the check part (after the second yield)
       >>> await anext(cluster_event, None)
2024-09-05 22:11:32 +00:00
Evgeniy Naydanov
769424723b test: error injections for Raft-based topology
Add following error injections:
 - stop_after_init_of_system_ks
 - stop_after_init_of_schema_commitlog
 - stop_after_starting_gossiper
 - stop_after_starting_raft_address_map
 - stop_after_starting_migration_manager
 - stop_after_starting_commitlog
 - stop_after_starting_repair
 - stop_after_starting_cdc_generation_service
 - stop_after_starting_group0_service
 - stop_after_starting_auth_service
 - stop_during_gossip_shadow_round
 - stop_after_saving_tokens
 - stop_after_starting_gossiping
 - stop_after_sending_join_node_request
 - stop_after_setting_mode_to_normal_raft_topology
 - stop_before_becoming_raft_voter
 - topology_coordinator_pause_after_updating_cdc_generation
 - stop_before_streaming
 - stop_after_streaming
 - stop_after_bootstrapping_initial_raft_configuration
2024-09-05 22:11:31 +00:00
Evgeniy Naydanov
ac4ffbad5c [test.py] topology.util: add get_non_coordinator_host() function
Add get_non_coordinator_host() function which returns
ServerInfo for the first host which is not a coordinator
or None if there is no such host.

Also rework get_coordinator_host() to not fail if some
of the hosts don't have a host id.
2024-09-05 22:11:31 +00:00
Evgeniy Naydanov
d95d698601 [test.py] random_tables: add UDT methods
Add .add_udt() / .drop_udt() methods.
2024-09-05 22:11:31 +00:00
Evgeniy Naydanov
8cb442ca50 [test.py] random_tables: add CDC methods
Add .enabled_cdc() / .disable_cdc() methods.
2024-09-05 22:11:31 +00:00
Evgeniy Naydanov
a7119cf420 [test.py] api: get scylla process status
Add `server_get_process_status(server_id)` API call and
wait_for_scylla_process_status() helper function.
2024-09-05 22:11:31 +00:00
Evgeniy Naydanov
241bbb4172 [test.py] api: add expected_server_up_state argument to server_add()
Allow to return from server_add() when a server reaches specified state.
One of:
 - PROCESS_STARTED
 - HOST_ID_QUERIED (previously called NOT_CONNECTED)
 - CQL_CONNECTED (renamed from CONNECTED)
 - CQL_QUERIED (was just QUERIED)

Also, rename CqlUpState to ServerUpState and move to internal_types.
2024-09-05 22:11:31 +00:00
Pavel Emelyanov
f02a686115 schema: Ditch make_shared_schema() helper
Now it's unused

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-05 19:34:00 +03:00
Pavel Emelyanov
d045aa6df7 test: Tune up indentation in uncompressed_schema()
After it was switched to use schema builder, the indenation of untouched
lines deserves one extra space.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-05 19:33:29 +03:00
Pavel Emelyanov
a1deba0779 test: Make tests use schema_builder instead of make_shared_schema
Everything, but perf test is straightforward switch.

The perf-test generated regular columns dynamically via vector, with
builder the vector goes away.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-05 19:31:30 +03:00
Avi Kivity
c57b8dd0bf repair: row_level: restore indentation 2024-09-05 18:38:43 +03:00
Avi Kivity
710977ef88 repair: row_level: coroutinize repair_service::insert_repair_meta()
Some of the indentation was broken, and is partially repaired by this
change.
2024-09-05 17:59:42 +03:00
Avi Kivity
f23a32ed84 repair: row_level: coroutinize repair_meta::get_full_row_hashes() 2024-09-05 17:56:27 +03:00
Avi Kivity
607747beb1 repair: row_level: coroutinize repair_meta::apply_rows_on_follower() 2024-09-05 17:55:07 +03:00
Avi Kivity
89d4394d12 repair: row_level: coroutinize repair_meta::clear_working_row_buf() 2024-09-05 17:52:32 +03:00
Pavel Emelyanov
69a5ec69c4 test: Use table storage options in sstable_directory_test
When creating sstables this test allocates temporary local options.
That works, because this test doesn't run on object storage, but it's
more correct to pick storage options from the table at hand.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#20440
2024-09-05 17:48:25 +03:00
Avi Kivity
4cfc25f8d7 repair: row_level: coroutinize get_common_diff_detect_algorithm()
The function is threaded, but the inner lambda can be coroutinized.
2024-09-05 17:47:27 +03:00
Michael Litvak
9545e0a114 view: test view_build_status table with node replace
Add a test replacing a node and verifying the contents of the
view_build_status table are updated as expected, having rows for the new
node and no rows for the old node.
2024-09-05 15:42:35 +03:00
Michael Litvak
3ca5dd537f test/pylib: use view_build_status_v2 table in wait_for_view
Change the util function wait_for_view to read the view build status
from the system.view_build_status_v2 table which replaces
system_distributed.view_build_status.
The old table can still be used but it is less efficient because it's
implemented as a virtual table which reads from the v2 table, so it's
better to read directly from the v2 table. This can cause slowness in
tests.
The additional util function wait_for_view_v1 reads from the old table.
This may be needed in upgrade tests if the v2 table is not available
yet.
2024-09-05 15:42:35 +03:00
Michael Litvak
5c95aaae0d view_builder: common write view_build_status function
When writing to the view_build_status we have common logic related to
upgrade and deciding whether to write to sys_dist ks or group0.
Move this common logic to a generic function used by all functions
writing to the table.
2024-09-05 15:42:35 +03:00
Michael Litvak
c1f3517a75 view_builder: improve migration to v2 with intermediate phase
Add an intermediate phase to the view builder migration to v2 where we
write to both the old and new table in order to not lose writes during
the migration.
We add an additional view builder version v1_5 between v1 and v2 where
we write to both tables. We perform a barrier before moving to v2 to
ensure all the operations to the old table are completed.
2024-09-05 15:42:35 +03:00
Michael Litvak
446ad3c184 view: delete node rows from view_build_status on node removal
When a node is removed we want to clean its rows from the
view_build_status table.
Now when removing a node and generating the topology state update, we
generate also the mutations to delete all the possible rows belonging to
the node from the table.
2024-09-05 15:42:35 +03:00
Michael Litvak
08462aaff7 view: sanitize view_build_status during migration
When migrating the view_build_status to v2, skip adding any leftover
rows that don't correspond to an existing node or an existing view.

Previously such rows could have been created and not cleaned, for
example when a node is removed.
2024-09-05 15:42:35 +03:00
Michael Litvak
78d6ff6598 view: make old view_build_status table a virtual table
After migrating the view build status from
system_distributed.view_build_status to system.view_build_status_v2, we
set system_distributed.view_build_status to be a virtual table, such
that reading from it is actually reading from the underlying new table.

The reason for this is that we want to keep compatibility with the old
table, since it exists also in Cassandra and it is used by various external
tools to check the view build status. Making the table virtual makes the
transition transparent for external users.

The two tables are in different keyspaces and have different shard
mapping. The v1 table is a distributed table with a normal shard
mapping, and the v2 table is a local table using the null sharder. The
virtual reader works by constructing a multishard reader which reads the rows
from shard zero, and then filtering it to get only the rows owned by the
current shard.
2024-09-05 15:42:35 +03:00
Michael Litvak
09eadcff08 replica: move streaming_reader_lifecycle_policy to header file
move the class streaming_reader_lifecycle_policy to a header file in
order to make it reusable in other places.
2024-09-05 15:42:35 +03:00
Michael Litvak
22f4f1fa49 view_builder: test view_build_status_v2
Add tests to verify the new view_build_status_v2 is used by the
view_builder and can be read from all nodes with the expected values.
Also test a migration from the v1 layout to v2.
2024-09-05 15:42:35 +03:00
Michael Litvak
fcf66ad541 storage_service: add view_build_status to raft snapshot
Include the table system.view_build_status_v2 in the raft snapshot, and
also the view_builder version parameter.
2024-09-05 15:42:30 +03:00
Michael Litvak
8d25a4d678 view_builder: migration to v2
Migrate view_builder to v2, to store the view build status of all nodes
in the group0 based table view_build_status_v2.

Introduce a feature view_build_status_on_group0 so we know when all
nodes are ready to migrate and use the new table.

A new cluster is initialized to use v2. Otherwise, The topology coordinator
initiates the migration when the feature is enabled, if it was not done
already.

The migration reads all the rows in the v1 table and writes it via
group0 to the v2 table, together with a mutation that updates the
view_builder parameter in scylla_local to v2. When this mutation is
applied, it updates the view_builder service to start using the v2
table.
2024-09-05 15:41:04 +03:00
Michael Litvak
f3887cd80b db:system_keyspace: add view_builder_version to scylla_local
Add a new scylla_local parameter view_builder_version, and functions to
read and mutate the value.
The version value defaults to v1 if it doesn't exist in the table.
2024-09-05 15:41:04 +03:00
Michael Litvak
d58a8930c4 view_builder: read view status from v2 table
Update the view_status function to read from the new
view_build_status_v2 table when enabled.
The code to read and extract the values is identical to v1 and v2 except it
accesses different keyspace and table, so the common code is extracted
to the view_status_common function and used by both v1 and v2 flows with
appropriate parameters.
2024-09-05 15:41:04 +03:00
Michael Litvak
05d18b818f view_builder: introduce writing status mutations via raft
Introduce the announce_with_raft function as alternative to writing view build
status mutations to the table in system_distributed. Instead, we can
apply the mutations via group0 operation to the view_build_status_v2
table.
All the view_builder functions that write to the view_build_status table
can be configured by a flag to either write the legacy way or via raft.
2024-09-05 15:41:04 +03:00
Michael Litvak
b8c7a10ae6 view_builder: pass group0_client and qp to view_builder
Store references of group0_client and query_processor in the
view_builder service.
They are required for generating mutations and writing them via group0.
2024-09-05 15:41:04 +03:00
Michael Litvak
b2332c5a72 view_builder: extract sys_dist status operations to functions
Extract all the update and read operations of a view build status in the
table system_distributed.view_build_status to separate functions.
2024-09-05 15:41:04 +03:00
Michael Litvak
bf4a58bf91 db:system_keyspace: add view_build_status_v2 table
add the table system.view_build_status_v2 with the same schema as
system_distributed.view_build_status.
2024-09-05 15:41:04 +03:00
Gleb Natapov
807e37502a db/consistency_level: do not use result from heat weighted load balancer if it contains duplicates
Because of https://github.com/scylladb/scylladb/issues/9285 heat weighted
load balancer may sometimes return same node twice. It may cause wrong
data to be read or unexpected errors to be returned to a client. Since
the original bug is not easy to fix and it is rare lets introduce a
workaround. We will check for duplicates and will use non HWLB one if
one is found.

Fixes scylladb/scylladb#20430

Closes scylladb/scylladb#20414
2024-09-05 15:21:35 +03:00
Wojciech Mitros
c1b0434c16 test: finish mv view update explicitly instead of relying on delay duration
When testing mv admission control, we perform a large view update
and check if the following view update can be admitted due to the
high view backlog usage. We rely on a delay which keeps the backlog
high for longer to make sure the backlog is still increased during
the second write. However, in some test runs the delay is not long
enough, causing the second write to miss the large backlog and not
hit admission control.

In this patch we keep the increased backlog high using another
injection instead of relying on a delay to make absolute sure
that the backlog is still high during the second write.

Fixes scylladb/scylladb#20382

Closes scylladb/scylladb#20445
2024-09-05 15:08:04 +03:00
Lakshmi Narayanan Sreethar
7c5efab7d5 cql-pytest: add test to verify consider_only_existing_data compaction option
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-09-05 17:34:13 +05:30
Lakshmi Narayanan Sreethar
68a902f74a tools/scylla-nodetool: add consider-only-existing-data option to compact command
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-09-05 17:34:06 +05:30
Lakshmi Narayanan Sreethar
84d06a13c7 api: compaction: add consider_only_existing_data option
Added a new parameter `consider_only_existing_data` to major compaction
API endpoints. When enabled, major compaction will:

- Force-flush all tables.
- Force a new active segment in the commit log.
- Compact all existing SSTables and garbage-collect tombstones by only
  checking the SSTables being compacted. Memtables, commit logs, and
  other SSTables not part of the compaction will not be checked, as they
  will only contain newer data that arrived after the compaction
  started.

The `consider_only_existing_data` is passed down to the compaction
descriptor's `gc_check_only_compacting_sstables` option to ensure that
only the existing data is considered for garbage collection.

The option is also passed to the `maybe_flush_commitlog` method to make
sure all the tables are flushed and a new active segment is created in
the commit log.

Fixes #19728

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-09-05 17:25:45 +05:30
Lakshmi Narayanan Sreethar
98bc44f900 compaction: consider gc_check_only_compacting_sstables when deducing max purgeable timestamp
When gc_check_only_compacting_sstables is enabled,
get_max_purgeable_timestamp should not check memtables and other
sstables that are not part of the compaction to deduce the max purgeable
timestamp.

Refs #19728

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-09-05 17:25:45 +05:30
Lakshmi Narayanan Sreethar
7b9ce8e040 compaction: do not check commitlog if gc_check_only_compacting_sstables is enabled
When the compaction_descriptor's gc_check_only_compacting_sstables flag
is enabled, create and pass a copy of the get_tombstone_gc_state that
will skip checking the commitlog.

Refs #19728

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-09-05 17:25:45 +05:30
Lakshmi Narayanan Sreethar
12fa40154b tombstone_gc_state: introduce with_commitlog_check_disabled()
Added a new method, `with_commitlog_check_disabled`, that returns a new
copy of the tombstone_gc_state but with commitlog check disabled. This
will be used by a following patch to disable commitlog checks during
compaction.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-09-05 17:25:45 +05:30
Lakshmi Narayanan Sreethar
5b8c6a8a5e compaction: introduce new option to check only compacting sstables for gc
Added new option, `gc_check_only_compacting_sstables`, to
compaction_descriptor to control the garbage collection behavior. The
subsequent patches will use this flag to decide if the garbage
collection has to check only the SSTables being compacted to collect
tombstones. This option is disabled for now and will be enabled based on
a new compaction parameter that will be added later in this patch
series.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-09-05 17:25:45 +05:30
Lakshmi Narayanan Sreethar
5e6bffc146 compaction: rename maybe_flush_all_tables to maybe_flush_commitlog
Major compaction flushes all tables as a part of flushing the commitlog.
After forcing new active segments in the commitlog, all the tables are
flushed to enable reclaim of older commitlog segments. The main goal is
to flush the commitlog and flushing all the table is just a dependency.

Rename maybe_flush_all_tables to maybe_flush_commitlog so that it
reflects the actual intent of the major compaction code. Added a new
wrapper method to database::flush_all_tables(),
database::flush_commitlog(), that is now called from
maybe_flush_commitlog.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-09-05 17:25:45 +05:30
Lakshmi Narayanan Sreethar
fa2488cc83 compaction: maybe_flush_all_tables: add new force_flush param
Add a new parameter, `force_flush` to the maybe_flush_all_tables()
method. Setting `force_flush` to true will flush all the tables
regardless of when they were flushed last. This will be used by the new
compaction option in a following patch.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-09-05 17:25:45 +05:30
Laszlo Ersek
53524974db docs/dev/maintainer.md: clarify "Updating submodule references"
Before the introduction of "scripts/refresh-submodules.sh", there was
indeed some manual work for the maintainer to do, hence "publish your
work" must have sounded correct. Today, the phrase "publish your work"
sounds confusing.

Commit 71da4e6e79 ("docs: Document sync-submodules.sh script in
maintainer.md", 2020-06-18) should have arguably reworded the last step of
the submodule refresh procedure; let's do it now.

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>

Closes scylladb/scylladb#20333
2024-09-05 13:57:32 +03:00
Pavel Emelyanov
1f0db29ef6 test: Remove unused directory semaphore
The with_sstable_dir() helper no longer needs one, it used to pass it as
argument to sstable_directory constructor, but now the directory doesn't
need it (takes semaphore via table object).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#20396
2024-09-05 13:11:35 +03:00
Kefu Chai
b4fc24cc1f github: use needs.read-toolchain.outputs.image for build-scylla
so we don't need to hardwire the image on which we build scylla.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20370
2024-09-05 12:58:36 +03:00
Pavel Emelyanov
955391d209 sstable_directory: Fix indentation after previous patches
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-05 11:19:19 +03:00
Pavel Emelyanov
2febde24f3 sstable_directory: Use yielding lister in .handle_sstables_pending_delete()
Indentation is deliberately left broken.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-05 11:19:19 +03:00
Pavel Emelyanov
02aac3e407 sstable_directory: Use yielding lister in .cleanup_column_family_temp_sst_dirs()
Indentation is deliberately left broken

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-05 11:19:19 +03:00
Pavel Emelyanov
ff77a677a6 sstable_directory: Use yielding lister in .prepare()
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-05 11:19:19 +03:00
Pavel Emelyanov
7b5fe6bee6 sstable_directory: Shorten lister loop
Squash call to lister.get() and check for the returned value into
while()'s condition. This saves few more lines of code as well.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-05 11:19:19 +03:00
Pavel Emelyanov
5dc266cefa sstable_directory: Use with_closeable() in .process()
The method already uses yielding lister, but handles the exceptions
explicitly. Use with_closeable() helper, it makes the code shorter.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-05 11:19:19 +03:00
Pavel Emelyanov
7742b90cb1 directory_lister: Add noexcept default move-constructor
It's required to make it possible to push lister into with_closeable().
Its requiremenent of nothrow-move-constructible doesn't accept
default-generated one.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-05 11:10:21 +03:00
Nikos Dragazis
2450afb934 sstables: Replace assert with on_internal_error
The `skip()` method of the compressed data source implementation uses an
assert statement to check if the given offset is valid.

Replace this with `on_internal_error()` to fail gracefully. An invalid
offset shouldn't bring the whole server down.

Also, enhance the error message for unsynced compressed readers.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2024-09-05 11:03:54 +03:00
Pavel Emelyanov
da598a6210 test: Restore indentation after previous changes
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-05 10:38:01 +03:00
Pavel Emelyanov
e16c07c896 test: Threadify tombstone_in_tombstone2()
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-05 10:36:33 +03:00
Pavel Emelyanov
28d016f312 test: Threadify range_tombstone_reading()
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-05 10:36:33 +03:00
Pavel Emelyanov
7d567d07ad test: Threadify tombstone_in_tombstone()
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-05 10:36:33 +03:00
Pavel Emelyanov
a34e38f070 test: Threadify broken_ranges_collection()
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-05 10:36:33 +03:00
Pavel Emelyanov
eac4ec47f8 test: Threadify compact_storage_dense_read()
Indentation is deliberately left broken.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-05 10:36:33 +03:00
Pavel Emelyanov
322c1ee9c5 test: Threadify compact_storage_simple_dense_read()
Indentation is deliberately left broken.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-05 10:36:33 +03:00
Pavel Emelyanov
df71b3e446 test: Threadify compact_storage_sparse_read()
Indentation is deliberately left broken.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-05 10:36:33 +03:00
Pavel Emelyanov
142ccc64fb test: Simplify test_range_reads() counting
It used to keep counter with the help of a smart pointer, now it can
just use on-stack variable.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-05 10:36:33 +03:00
Pavel Emelyanov
a78ab2e998 test: Simplify test_range_reads() inner loop
It used to rely on bool (wrapped with pointer) and future<>-based loop
helper, now it can just break from the while loop.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-05 10:36:33 +03:00
Pavel Emelyanov
c84ae64562 test: Threadify test_range_reads() itself
And update its callers again.
Preserve no longer relevant local smart pointers until next patch.
Indentation is deliberately left broken.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-05 10:36:33 +03:00
Pavel Emelyanov
253d53b6a1 test: Threadify test_range_reads() callers
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-05 10:36:00 +03:00
Pavel Emelyanov
fd8bb0c46c test: Threadify generate_clustered() itself
And update its callers again.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-05 10:35:59 +03:00
Pavel Emelyanov
f500ee690b test: Threadify generate_clustered() callers
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-05 10:34:54 +03:00
Pavel Emelyanov
08186c048d test: Threadify test_no_clustered test
And update its callers.
Indentation is deliberately left broken.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-05 10:26:25 +03:00
Pavel Emelyanov
5f0a40f959 test: Threadify nonexistent_key test
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-05 10:26:13 +03:00
Pavel Emelyanov
a150a63259 test: Squash two open_sstables() helper together
One accepts integer generations, another one accepts "generic" ones. The
latter is only called by the former, so no sense in keeping it around.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-05 09:08:40 +03:00
Pavel Emelyanov
4184c688ea test: Coroutinize open_sstables() helper
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-05 09:08:12 +03:00
Piotr Dulikowski
ecd53db3b0 service/qos: remove the marked_for_deletion parameter
It is always set to false and it doesn't seem to serve any function now.
2024-09-04 21:52:34 +02:00
Piotr Dulikowski
bae6076541 service/qos: add constructors to service_level
Add a default constructor and a constructor which explicitly
initializes all fields of the service_level structure.

This is done in order to make sure that removal of the
marked_for_deletion field can be done safely - otherwise, for example,
service_level could be aggregate-initialized with an incomplete list of
values for the fields, and removing marked_for_deletion which is in the
middle of the struct would cause the is_static field to be initialized
with the value that was designated for marked_for_deletion.

As a bonus, make sure that marked_for_deletion and is_static bool fields
are initialized in the default constructor to false in order to avoid
potential undefined behavior.
2024-09-04 21:52:13 +02:00
Avi Kivity
ec8590ae6c Merge 'Always pass abort_source& to raft_group0_client::hold_read_apply_mutex' from Kamil Braun
There are two versions of `raft_group0_client::hold_read_apply_mutex`, one takes `abort_source&`, the other doesn't. Modify all call sites that used the non-abort-source version to pass an `abort_source&`, allowing us to remove the other overload.

If there is no explicit reason not to pass an `abort_source&`, then one should be passed by default -- it often prevents hangs during shutdown.

---

No backport needed -- no known issues affected by this change.

Closes scylladb/scylladb#19996

* github.com:scylladb/scylladb:
  raft_group0_client: remove `hold_read_apply_mutex` overload without `abort_source&`
  storage_service: pass `_abort_source` to `hold_read_apply_mutex`
  group0_state_machine: pass `_abort_source` to `hold_read_apply_mutex`
  api: move `reload_raft_topology_state` implementation inside `storage_service`
2024-09-04 21:35:27 +03:00
Kefu Chai
fe0e961856 docs: do not install scylla/ppa repo when perform upgrade
for following reasons:

1. the ppa in question does not provide the build for the latest ubuntu's LTS release. it only builds for trusty, xenial, bionic and jammy. according to https://wiki.ubuntu.com/Releases, the latest LTS release is ubuntu noble at the time of writing.
2. the ppa in question does not provide the packages used in production. it does provides the package for *building* scylla
3. after we introduced the relocatable package, there is no need to provide extra user space dependencies apart from scylla packages.

so, in this change, we remove all references to enabling the Scylla/PPA repository.

Fixes scylladb/scylladb#20449

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20450
2024-09-04 20:30:40 +03:00
Avi Kivity
20b79816f1 repair: row_level: coroutinize repair_service::remove_repair_meta() (non-selective overload) 2024-09-04 18:43:19 +03:00
Avi Kivity
3b9ac51b6b repair: row_level: coroutinize repair_service::remove_repair_meta() (by-address overload) 2024-09-04 18:39:21 +03:00
Avi Kivity
704e3f5432 repair: row_level: coroutinize repair_service::remove_repair_meta() (by-id overload) 2024-09-04 18:37:48 +03:00
Avi Kivity
9612c4d790 repair: row_level: row_level_repair::run()
The function itself is threaded, but the inner lambdas are coroutinized
(except one which is expected to run in a thread, and so is threaded).
2024-09-04 18:34:45 +03:00
Avi Kivity
2b94ee981b repair: row_level: row_level_repair::send_missing_rows_to_follower_nodes()
The function itself is threaded, but the inner lambda is coroutinized.
2024-09-04 18:28:27 +03:00
Avi Kivity
c768448339 repair: row_level: row_level_repair::get_missing_rows_from_follower_nodes()
The function itself is threaded, but the inner lambda is coroutinized.
2024-09-04 18:28:12 +03:00
Avi Kivity
d2f1b44487 repair: row_level: row_level_repair::negotiate_sync_boundary()
The function itself is threaded, but the inner lambda is coroutinized.
2024-09-04 18:21:39 +03:00
Kefu Chai
0756520f82 sstable: coroutinize sstable::seal_sstable()
for better readability.

presumably, `sstable::seal_sstable()` is not on the critical path,
and we don't need to worry about the overhead of using C++20 coroutine.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20410
2024-09-04 18:14:33 +03:00
Kefu Chai
88c5c3001a compaction: refactor compaction_manager::can_proceed()
instead of chaining the conditions with '&&', break them down.
for two reasons:

* for better readability: to group the conditions with the same
  purpose together
* so we don't look up the table twice. it's an anti-pattern of
  using STL, and it could be confusing at first glance.

this change is a cleanup, so it does not change the behavior.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20369
2024-09-04 18:12:29 +03:00
Avi Kivity
645e39e746 repair: row_level: coroutinize repair_put_row_diff_with_rpc_stream_process_op()
Both the outer function and the inner lambda are coroutinized.
2024-09-04 18:10:43 +03:00
Avi Kivity
4c05d0b965 repair: row_level: coroutinize repair_meta::get_sync_boundary_handler() 2024-09-04 15:33:40 +03:00
Avi Kivity
eea011fad5 repair: row_level: coroutinize repair_meta::get_sync_boundary()
Not really helping anything, but a coroutine is a safer platform for
future changes in administrative APIs.
2024-09-04 15:31:57 +03:00
Avi Kivity
91b88df956 repair: row_level: coroutinize repair_meta::repair_set_estimated_partitions_handler() 2024-09-04 15:20:53 +03:00
Avi Kivity
b73194c9bf repair: row_level: coroutinize repair_meta::repair_set_estimated_partitions()
Not really helping anything, but a coroutine is a safer platform for
future changes in administrative APIs.
2024-09-04 15:18:33 +03:00
Avi Kivity
a69fb626bd repair: row_level: coroutinize repair_meta::repair_get_estimated_partitions_handler() 2024-09-04 15:17:42 +03:00
Avi Kivity
5cd8207ac7 repair: row_level: coroutinize repair_meta::repair_get_estimated_partitions()
Not really helping anything, but a coroutine is a safer platform for
future changes in administrative APIs.
2024-09-04 15:16:32 +03:00
Avi Kivity
e108f867a9 repair: row_level: coroutinize repair_meta::repair_row_level_stop_handler() 2024-09-04 15:15:42 +03:00
Avi Kivity
ffbb973063 repair: row_level: coroutinize repair_meta::repair_row_level_stop()
Not really helping anything, but a coroutine is a safer platform for
future changes in administrative APIs.
2024-09-04 15:14:08 +03:00
Avi Kivity
587b6fe400 repair: row_level: coroutinize repair_meta::repair_row_level_start_handler() 2024-09-04 15:12:49 +03:00
Avi Kivity
db7b1014ff repair: row_level: coroutinize repair_meta::repair_row_level_start() 2024-09-04 15:10:45 +03:00
Avi Kivity
17b82265ae repair: row_level: coroutinize repair_meta::get_combined_row_hash_handler() 2024-09-04 15:08:58 +03:00
Avi Kivity
bacbdde791 repair: row_level: coroutinize repair_meta::get_combined_row_hash() 2024-09-04 15:07:27 +03:00
Avi Kivity
8b8dc5092f repair: row_level: coroutinize repair_meta::get_full_row_hashes_handler() 2024-09-04 15:05:28 +03:00
Avi Kivity
21e01990ff repair: row_level: coroutinize repair_meta::get_full_row_hashes_with_rpc_stream()
The when_all_succeed() call is changed to the safer coroutine::when_all(),
which avoids the temporary futures.
2024-09-04 15:03:00 +03:00
Avi Kivity
572fbfde09 repair: row_level: coroutinize repair_meta::request_row_hashes() 2024-09-04 14:07:59 +03:00
Nadav Har'El
15f8046fcb alternator ttl: fix use-after-free
The Alternator TTL scanning code uses an object "scan_ranges_context"
to hold the scanning context. One of the members of this object is
a service::query_state, and that in turn holds a reference to a
service::client_state. The existing constructor created a temporary
client_state object and saved a reference to it - which can result
in use after free as the temporary object is freed as soon as the
constructor ends.

The fix is to save a client_state in the scan_ranges_context object,
instead of a temporary object.

Fixes #19988

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#20418
2024-09-03 22:15:18 +03:00
Pavel Emelyanov
c03b1e2827 test: Remove unused database argument from make_sstable_for_all_shards() helper
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#20427
2024-09-03 21:36:28 +03:00
Calle Wilund
2695fefa81 commitlog/database: Make some commitlog options updatable + add feature listener
Makes some commitlog options runtime updatable. Most important for this case,
the usage of fragmented entries. Also adds a subscription in database on said
feature, to possibly enable once cluster enables it.
2024-09-03 16:38:28 +00:00
Calle Wilund
238a0236e5 features/config: Add feature for fragmented commitlog entries
Hides the functionality behind a cluster feature, i.e. postspones
using it until an upgrade is complete etc. This to allow rolling back
even with dirty nodes, at least until a cluster is commited.

Feature can also be disabled by scylla option, just in case. This will
lock it out of whole cluster, but this is probably good, because depending
on off or on, certain schema/raft ops might fail or succeed (due to large
mutations), and this should probably be equivalent across nodes.
2024-09-03 16:38:28 +00:00
Calle Wilund
9bf452c7a0 docs: Add entry on commitlog file format v4 2024-09-03 16:38:28 +00:00
Calle Wilund
ad595e4d6a commitlog_test: Add more oversized cases
Also adds some randomization to the tests.
2024-09-03 16:38:28 +00:00
Calle Wilund
1d5e509136 commitlog_replayer: Replay segments in order created
Minimizes potential buffer usage for fragmented entries.
2024-09-03 16:38:28 +00:00
Calle Wilund
61ff9486fb commitlog_replayer: Use replay state to support fragmented entries 2024-09-03 16:38:27 +00:00
Calle Wilund
7c16683184 commitlog_replayer: coroutinize partly 2024-09-03 16:38:27 +00:00
Calle Wilund
05bf2ae5d7 commitlog: Handle oversized entries
Refs #18161

Yet another approach to dealing with large commitlog submissions.

We handle oversize single mutation by adding yet another entry
type: fragmented. In this case we only add a fragment (aha) of
the data that needs storing into each entry, along with metadata
to correlate and reconstruct the full entry on replay.

Because these fragmented entries are spread over N segments, we
also need to add references from the first segment in a chain
to the subsequent ones. These are released once we clear the
relevant cf_id count in the base.
                 *
This approach has the downside that due to how serialization etc
works w.r.t. mutations, we need to create an intermediate buffer
to hold the full serialized target entry. This is then incrementally
written into entries of < max_mutation_size, successively requesting
more segments.

On replay, when encountering a fragment chain, the fragment is
added to a "state", i.e. a mapping of currently processing
frag chains. Once we've found all fragments and concatenated
the buffers into a single fragmented one, we can issue a
replay callback as usual.

Note that a replay caller will need to create and provide such
a state object. Old signature replay function remains for tests
and such.

This approach bumps the file format (docs to come).

To ensure "atomicity" we both force syncronization, and should
the whole op fail, we restore segment state (rewinding), thus
discarding data all we wrote.

v2:
* Improve some bookeep, ensure we keep track of segments and flush
  properly, to get counter correct
2024-09-03 16:38:27 +00:00
Anna Stuchlik
35796306a7 doc: comment out redirections for pages under Features
This commit temporarily disables redirections for all pages under Features
that were moved with this PR: https://github.com/scylladb/scylladb/pull/20401

Redirections work for all versions. This means that pages in 6.1 are redirected
to URLs that are not available yet (because 6.2 has not been released yet).

The redirections are correct and should be enabled when 6.2 is released:
I've created an issue to do it: https://github.com/scylladb/scylladb/issues/20428

Closes scylladb/scylladb#20429
2024-09-03 17:16:51 +02:00
Avi Kivity
6ddcf80d89 Merge 'Reuse sstable::test_env::reusable_sst() helper for pre-exsting sstables' from Pavel Emelyanov
Tests that try to access sstables from test/resource/ typically sstable::load() it after object creation. There's reusable_sst() helper for that. This PR fixes one more caller that still goes longer route by doing sstable and loading it on its own.

Closes scylladb/scylladb#20420

* github.com:scylladb/scylladb:
  test: Call reusable sst from ka_sst() helper
  test: Move sstable_open_config to reusable_sst()'s argument
2024-09-03 17:40:34 +03:00
Kamil Braun
504bf68ebb raft_group0_client: remove hold_read_apply_mutex overload without abort_source&
Ensure that every caller passes `abort_source&`.
2024-09-03 15:52:05 +02:00
Kamil Braun
79983723c8 storage_service: pass _abort_source to hold_read_apply_mutex
There's no point waiting for this lock if `storage_service` is being
aborted. In theory the lock, if held, should be eventually released by
whatever is holding it during shutdown -- but if there is some cyclic
reference between the services, and e.g. whatever holds the lock is
stuck because of ongoing shutdown and would only be unstuck by
`storage_service` getting stopped (which it can't because it's waiting
on the lock), that would cause a shutdown deadlock. Better to be safe
than sorry.
2024-09-03 15:52:05 +02:00
Kamil Braun
a7097fb985 group0_state_machine: pass _abort_source to hold_read_apply_mutex
`transfer_snapshot` was already passing `_abort_source` when trying to
take the lock but other member functions didn't.
2024-09-03 15:52:05 +02:00
Kamil Braun
a4d1065628 api: move reload_raft_topology_state implementation inside storage_service
In later commit we'll want to access more `storage_service` internals
in the API's implementation (namely, `_abort_source`)

Also moving the implementation there allows making
`service::topology_transition()` private again (it was made public in
992f1327d3 only for this API
implementation)
2024-09-03 15:52:03 +02:00
Andrei Chekun
27e5fa149a [test.py] Clean duplicated arg for test suite
Arguments mode and run_id already set in the _prepare_pytest_params, so
there is no need to set them one more time.
2024-09-03 14:41:57 +02:00
Andrei Chekun
8a9146ebda [test.py] Enable allure for python test
Enable allure adapter for all python tests. Add tag and parameters to the test to be able to distinguish them across modes and runs.

Related: https://github.com/scylladb/qa-tasks/issues/1665
2024-09-03 14:41:57 +02:00
Łukasz Paszkowski
20a6296309 test: Add reversed query tests on simulated upgrade process
Run the reversed queries on a 2-node cluster with CL=ALL with and
without NATIVE_REVERSE_QUERIES feature flag. When the flag is enabled,
the native reversed format is used, otherwise the legacy format.

The NATIVE_REVERSE_QUERIES feature flag is suppressed with an error
injection that simulates cluster upgrade process.

Backport is not required. The patch adds additional upgrade tests
for https://github.com/scylladb/scylladb/pull/18864

Closes scylladb/scylladb#20179
2024-09-03 14:45:08 +03:00
Pavel Emelyanov
0857b63259 Merge 'repair: row_level: coroutinize some slow-path functions' from Avi Kivity
This series coroutinizes up some functions in repair/row_level.cc. This enhances
readability and reduces bloat:

```
size  build/release/repair/row_level.o.{before,after}
   text	   data	    bss	    dec	    hex	filename
1650619	     48	    524	1651191	 1931f7	build/release/repair/row_level.o.before
1604610	     48	    524	1605182	 187e3e	build/release/repair/row_level.o.after
```

46kB of text were saved.

Functions that only touch a single mutation fragment were not coroutinized to avoid
adding a allocation in a fast path. In one case a function was split into a fast path and a
slow path.

Clean-up series, backport not needed.

Closes scylladb/scylladb#20283

* github.com:scylladb/scylladb:
  repair: row_level: restore indentation
  repair: row_level: coroutinize repair_meta::get_full_row_hashes_sink_op()
  repair: row_level: coroutinize repair_meta::get_full_row_hashes_source_op()
  repair: row_level: coroutinize repair_get_full_row_hashes_with_rpc_stream_handler()
  repair: row_level: coroutinize repair_put_row_diff_with_rpc_stream_handler()
  repair: row_level: coroutinize repair_get_row_diff_with_rpc_stream_handler()
  repair: row_level: coroutinize repair_get_full_row_hashes_with_rpc_stream_process()
  repair: row_level: coroutinize repair_get_row_diff_with_rpc_stream_process_op_slow_path()
  repair: row_level: split repair_get_row_diff_with_rpc_stream_process_op() into fast and slow paths
  repair: row_level: coroutinize repair_meta::put_row_diff_handler()
  repair: row_level: coroutinize repair_meta::put_row_diff_sink_op()
  repair: row_level: coroutinize repair_meta::put_row_diff_source_op()
  repair: row_level: coroutinize repair_meta::put_row_diff()
  repair: row_level: coroutinize repair_meta::get_row_diff_handler()
  repair: row_level: coroutinize repair_meta::get_row_diff_sink_op()
  repair: row_level: coroutinize repair_meta::to_repair_rows_on_wire()
  repair: row_level: coroutinize repair_meta::do_apply_rows()
  repair: row_level: coroutinize repair_meta::copy_rows_from_working_row_buf_within_set_diff()
  repair: row_level: coroutinize repair_meta::copy_rows_from_working_row_buf()
  repair: row_level: coroutinize repair_meta::row_buf_csum()
  repair: row_level: coroutinize repair_meta::get_repairs_row_size()
  repair: row_level: coroutinize repair_meta::set_estimated_partitions()
  repair: row_level: coroutinize repair_meta::get_estimated_partitions()
  repair: row_level: coroutinize repair_meta::do_estimate_partitions_on_local_shard()
  repair: row_level: coroutinize repair_reader::close()
  repair: row_level: coroutinize repair_reader::end_of_stream()
  repair: row_level: coroutinize sink_source_for_repair::close()
  repair: row_level: coroutinize sink_source_for_repair::get_sink_source()
2024-09-03 14:41:22 +03:00
Nadav Har'El
dd030f8112 alternator: improve RBAC access denied error messages
This patch address two requests made by reviewers of the original "Add
CQL-based RBAC support to Alternator" series. Both requests were about
the error messages produced when access is denied:

1. The error message is improved to use more proper English, and also
   to include the name of the role which was denied access.

2. The permission-check and error-message-formatting code is
   de-duplicated, using a common function verify_permission().

   This de-duplication required moving the access-denied error path to
   throwing an exception instead of the previous exception-free
   implementation. However, it can be argued that this change is actually
   a good thing, because it makes the successful case, when access is
   allowed, faster.

   The de-duplicated code is shorter and simpler, and allowed changing
   the text of the error message in just one place.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#20326
2024-09-03 14:39:30 +03:00
Kefu Chai
d26bb9ae30 sstables: correct the debugging message printed when removing temp dir
in 372a4d1b79, we introduced a change
which was for debugging the logging message. but the logging message
intended for printing the temp_dir not prints an `optional<int>`. this
is both confusing, and more importantly, it hurts the debuggability.

in this change, the related change is reverted.

Fixes scylladb/scylladb#20408

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20409
2024-09-03 14:36:08 +03:00
Pavel Emelyanov
e4bc5470cf test: Call reusable sst from ka_sst() helper
The sstable_mutation_test wants to load pre-existing sstables from
resouce/ subdir. For that there's reusable_sst() helper on env.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-03 14:01:28 +03:00
Pavel Emelyanov
e9980bd6dd test: Move sstable_open_config to reusable_sst()'s argument
So that callers are able to provide custom config in the future

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-09-03 14:00:59 +03:00
Laszlo Ersek
cd0819e3ed docs/dev/docker-hub.md: refresh aio-max-nr calculation
What we have today in "docs/dev/docker-hub.md" on "aio-max-nr" dates back
to scylla commit f4412029f4 ("docs/docker-hub.md: add quickstart section
with --smp 1", 2020-09-22). Problems with the current language:

- The "65K" claim as default value on non-production systems is wrong;
  "fs/aio.c" in Linux initializes "aio_max_nr" to 0x10000, which is 64K.

- The section in question uses equal signs (=) incorrectly. The intent was
  probably to say "which means the same as", but that's not what equality
  means.

- In the same section, the relational operator "<" is bogus. The available
  AIO count must be at least as high (>=) as the requested AIO count.

- Clearer names should be used;
  adjust_max_networking_aio_io_control_blocks() in "src/core/reactor.cc"
  sets a great example:

  - "reactor::max_aio" should be called "storage_iocbs",

  - "detect_aio_poll" should be called "preempt_iocbs",

  - "reactor_backend_aio::max_polls" should be called "network_iocbs".

- The specific value 10000 for the last one ("network_iocbs") is not
  correct in scylla's context. It is correct as the Seastar default, but
  scylla has used 50000 since commit 2cfc517874 ("main, test: adjust
  number of networking iocbs", 2021-07-18).

Rewrite the section to address these problems.

See also:
- https://github.com/scylladb/scylladb/issues/5981
- https://github.com/scylladb/seastar/pull/2396
- https://github.com/scylladb/scylladb/pull/19921

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2024-09-03 12:10:59 +02:00
Laszlo Ersek
15738d14ce docs/dev/docker-hub.md: strip trailing whitespace
Strip trailing whitespace.

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2024-09-03 12:00:28 +02:00
Botond Dénes
2556e902b1 Update tools/jmx submodule
* tools/jmx 89308b77...793452a9 (1):
  > dist: support building packages in Github Actions
2024-09-03 11:58:37 +03:00
Anna Stuchlik
5193d2d171 doc: remove the seeds-related questions from the FAQ
This commit one of the series to remove the FAQ page by removing irrelevant/outdated entries
or moving them to the forum.

The question about seeds is irrelevant, not frequently asked, and covered in other sections
of the docs. Also, it mentions versions that are no longer supported.

Closes scylladb/scylladb#20403
2024-09-03 11:01:49 +03:00
Takuya ASADA
9d7fed40b5 install.sh: fix more incorrect permission on strict umask
Even after 13caac7, we still have more files incorrect permission, since
we use "cp -r" and creating new file with redirect.

To fix this, we need to replace "cp -r" with "cp -pr", and "chmod <perm>" on
newly created files.

Fixes #14383
Related #19775

Closes scylladb/scylladb#19786
2024-09-03 10:37:53 +03:00
Anna Stuchlik
360f7b3d33 doc: move Features to the top-level page
This commit moves the Features page from the section for developers
to the top level in the page tree. This involves:
- Moving the source files to the *features* folder from the  *using-scylla* folder.
- Moving images into *features/images* folder.
- Updating references to the moved resources.
- Adding redirections to the moved pages.

Closes scylladb/scylladb#20401
2024-09-03 07:24:33 +03:00
Kefu Chai
fb2ed20b42 .github: post a comment if "Fixes" policy is violated
it's more visible than an "Error" in the action's detail message.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#19271
2024-09-03 07:23:48 +03:00
Botond Dénes
8f31d3f1fc Merge 'tools/nodetool: improve backup and restore commands' from Kefu Chai
this change contains two improvements to "backup" and "restore" commands:

- let them print task id
- let them return 1 as the exist status code upon operation failure

----

these changes are improvements to the newly introduced commands, which are not in any LTS branches yet, so no need to backport.

Closes scylladb/scylladb#20371

* github.com:scylladb/scylladb:
  tools/scylla-nodetool: return failure with exit code in backup/restore
  tools/scylla-nodetool: let backup/restore print task id
2024-09-02 16:40:55 +03:00
Takuya ASADA
59aedb38d0 locator: retry HTTP request to GCE/Azure metadata service
Like we already do on EC2, implement retrying request to the metadata
service on GCE and Azure.

Closes #19817

Closes scylladb/scylladb#20189
2024-09-02 13:04:05 +03:00
Kefu Chai
e66e885e5b tools/scylla-nodetool: return failure with exit code in backup/restore
before this change, "backup" and "restore" commands always return 0 as
their exist code no matter if the performed operation fails or not.

inspired by the "task" commands of nodetool, let's return 1 with
exit code if the operation fails.

the tests are updated accordingly.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-09-02 15:12:26 +08:00
Kefu Chai
470c3e8535 tools/scylla-nodetool: let backup/restore print task id
in 20fffcdc, we added the "task wait" subcommand, so user is allowed to
interact with a task with its task id. and in existing implementation of
"backup" and "restore" command, if user does not pass `--nowait`, the
command just exits without any output upon sending the request to
scylladb.

in this change, we print out the task_id if user does not pass
`--nowait` command line option to "backup" or "restore" command. this
allows user to follow up on the operation if necessary.

the tests are updated accordingly.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-09-02 15:12:26 +08:00
Nadav Har'El
0b3890df46 test/cql-pytest: test RBAC auto-grant (and reproduce CDC bug)
This patch adds functional testing for the role-based access control
(RBAC) "auto-grant" feature, where if a user that is allowed to create
a table, it also recieves full permissions over the table it just
created. We also test permissions over new materialized views created
by a user, and over CDC logs. The test for CDC logs reproduces an
already suspected bug, #19798: A user may be allowed to create a table
with CDC enabled, but then is not allowed to read the CDC log just
created. The tests show that the other cases (base tables and views)
do not have this bug, and the creating user does get appropriate
permissions over the new table and views.

In addition to testing auto-grant, the patch also includes tests for
the opposite feature, "auto-revoke" - that permissions are removed when
the table/view/cdc is deleted. If we forget to do that while implementing
auto-grant, we risk that users may be able to use tables created by
other users just because they used the same table _name_ earlier.

It's important to have these auto-revoke tests together with the
auto-grant tests that reproduce #19798 - so we don't forget this
part when finally fixing #19798.

Refs #19798.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#19845
2024-09-02 09:03:40 +03:00
Botond Dénes
52bed81a1e Merge 'cql3: add option to not unify bind variables with the same name' from Avi Kivity
Bind variables in CQL have two formats: positional (`?`) where a variable is referred to by its relative position in the statement, and named (`:var`), where the user is expected to supply a name->value mapping.

In 19a6e69001 we identified the case where a named bind variable appears twice in a query, and collapsed it to a single entry in the statement metadata. Without this, a driver using the named variable syntax cannot disambiguate which variable is referred to.

However, it turns out that users can use the positional call form even with the named variable syntax, by using the positional API of the driver. To support this use case, we add a configuration variable to disable the same-variable detection.

Because the detection has to happen when the entire statement is visible, we have to supply the configuration to the parser. We call it the `dialect` and pass it from all callers. The alternative would be to add a pre-prepare call similar to fill_prepare_context that rewrites all expressions in a statement to deduplicate variables.

A unit test is added.

Fixes #15559

This may be useful to users transitioning from Cassandra, so merits a backport.

Closes scylladb/scylladb#19493

* github.com:scylladb/scylladb:
  cql3: add option to not unify bind variables with the same name
  cql3: introduce dialect infrastructure
  cql3: prepared_statement_cache: drop cache key default constructor
2024-09-02 08:34:24 +03:00
Kefu Chai
28b5471c01 docs/dev/maintainer.md: fix formatting
* in the "Backporting Seastar commits" section, there's a single quote
  instead of a backtick in this line, so fix it.
* add backticks around `refresh-submodules.sh`, which is a filename.
* correct the command line setting a git config option, because `git-config`
  does not support this command line syntax,

  ```console
  $ git config --global diff.conflictstyle = diff3
  $ git config --global get diff.conflictstyle
  =
  $ git config --global diff.conflictstyle diff3
  $ git config --global get diff.conflictstyle
  diff3
  ```

  quote from git-config(1)

  > ```
  > git config set [<file-option>] [--type=<type>] [--all] [--value=<value>] [--fixed-value] <name> <value>
  > ```
* stop using the deprecated mode of the `git-config` command, and use
  subcommand instead. as git-config(1) puts:

  > git config <name> <value> [<value-pattern>]
  >   Replaced by git config set [--value=<pattern>] <name> <value>.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20328
2024-09-01 22:24:01 +03:00
Yaniv Michael Kaul
2ebba9cd11 tools/toolchain/dbuild: prefer podman over docker
Check if podman is available before docker. If it is, use it. Otherwise, check for docker.

1. Podman is better. It runs with fewer resources, and I've had display issues with Docker (output was not shown consistently)
2. 'which docker' works even when the docker service and socket are turned off.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Closes scylladb/scylladb#20342
2024-09-01 22:17:01 +03:00
David Garcia
c4da75e392 docs: run docs test on changing config params
Triggers the "Build Docs" PR workflow whenever the `db/config.cc` or `db/config.h` files are edited. These files are used to produce documentation, and this change will help prevent the introduction of breaking changes to the documentation build when they are modified.

Closes scylladb/scylladb#20347
2024-09-01 22:15:48 +03:00
Avi Kivity
0f4b05824e Merge 'perf/perf_sstable: add {crawling,partitioned}_streaming modes' from Kefu Chai
for testing the load performance of load_and_stream operation.

Refs #19989

---

no need to backport. it adds two new tests to the existing `perf_sstable` tool for evaluating the load performance when performing the "load_and_streaming" operation. hence has no impact on the production.

Closes scylladb/scylladb#20186

* github.com:scylladb/scylladb:
  perf/perf_sstable: add {crawling,partitioned}_streaming modes
  test/perf/perf_sstable: use switch-case when appropriate
2024-09-01 22:04:22 +03:00
Avi Kivity
7197d280b0 Merge 'scylla-gdb.py: lazy-evaluate the constants ' from Kefu Chai
instead of evaluating the constants in-class, accessing them via
a cached class property.

it would be handy if we could source `scylla-gdb.py` in `.gdbinit`,
but this script accesses some symbols which are not available without
a file being debugged. what's why gdb fails to load the init script:

```
Traceback (most recent call last):
  File "/home/kefu/dev/scylladb/scylla-gdb.py", line 167, in <module>
    class intrusive_slist:
  File "/home/kefu/dev/scylladb/scylla-gdb.py", line 168, in intrusive_slist
    size_t = gdb.lookup_type('size_t')
             ^^^^^^^^^^^^^^^^^^^^^^^^^
gdb.error: No type named size_t.
```

so we have to `file path/to/scylla` and *then*
`source scylla-gdb.py` every time when we debug scylla or a seastar
application, instead of loading `scylla-gdb.py` in `.gdbinit`.

the reason is that the script accesses the debug symbols like
`gdb.lookup_type('size_t')` in-class. so when the python interpreter
reads the script, it evaluates this statement, but at that moment,
the debug symbols are not loaded, so `source scylla-gdb.py` fails
in `.gdbinit`.

in this change, we transform all these class variables to cached
properties, so that they

* are evaluated on-demand
* are evaluated only once at most

this addresses the pain at the expense of verbosity.

---

this change intends to improve the developer's user experience, and has no impacts on product, so no need to backport.

Closes scylladb/scylladb#20334

* github.com:scylladb/scylladb:
  test/scylla_gdb: test the .gdb init use case
  scylla-gdb.py: lazy-evaluate the constants
2024-09-01 20:00:53 +03:00
Pavel Emelyanov
7df43312ac test: Remove sstable making helpers from table_for_tests
All users of it have sstable_test_env at hand (in fact -- they call env
method to get table_for_test). And since sstable_test_env already has a
bunch of methods to create sstable, the table_for_test wrapper doesn't
need to duplicate this code.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#20360
2024-09-01 19:58:15 +03:00
Kefu Chai
bc2b7b47c8 build: cmake: add and use Scylla_CLANG_INLINE_THRESHOLD cmake parameter
so that we can set this the parameter passed to `-inline-threshold` with
`configure.py` when building with CMake.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20364
2024-09-01 19:56:02 +03:00
Kefu Chai
6970c502c9 dist: drop %pretrans section
before this change, if user does not have `/bin/sh` around, when
installing scylla packages, the script in `%pretrans" is executed,
and fails due to missing `/bin/sh`. per
https://docs.fedoraproject.org/en-US/packaging-guidelines/Scriptlets/#pretrans

> Note that the %pretrans scriptlet will, in the particular case of
> system installation, run before anything at all has been installed.
> This implies that it cannot have any dependencies at all. For this
> reason, %pretrans is best avoided, but if used it MUST (by necessity)
> be written in Lua. See
> https://rpm-software-management.github.io/rpm/manual/lua.html for more
> information.

but we were trying to warn users upgrading from scylla < 1.7.3, which
was released 7 years ago at the time of writing.

in this change, we drop the `%pretrans` section. hopefuly they will
find their way out if they still exist.

Fixes scylladb/scylladb#20321

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20365
2024-09-01 19:46:19 +03:00
Kefu Chai
a06e1c6545 scylla-housekeeping: use raw string to avoid using escape sequence
before this change, when running `scylla-housekeeping`:
```
/opt/scylladb/scripts/libexec/scylla-housekeeping:122: SyntaxWarning: invalid escape sequence '\s'
  match = re.search(".*http.?://repositories.*/scylladb/([^/\s]+)/.*/([^/\s]+)/scylladb-.*", line)
```

we could have the warning above. because `\s` is not a valid escape
sequence, but the Python interpreter accepts it as two separated
characters of `\s` after complaining. but it's still annoying.

so, let's use a raw string here.

Refs scylladb/scylladb#20317
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20359
2024-09-01 18:59:23 +03:00
Kefu Chai
e431b90145 test/boost/view_build_test: include used header
before this change, when building the test of `view_build_test` with
clang-20, we can have following build failure:

```
FAILED: test/boost/CMakeFiles/view_build_test.dir/Debug/view_build_test.cc.o
/home/kefu/.local/bin/clang++ -DBOOST_ALL_DYN_LINK -DDEBUG -DDEBUG_LSA_SANITIZER -DFMT_SHARED -DSANITIZE -DSCYLLA_BUILD_MODE=debug -DSCYLLA_ENABLE_ERROR_INJECTION -DSEASTAR_API_LEVEL=7 -DSEASTAR_DEBUG -DSEASTAR_DEBUG_PROMISE -DSEASTAR_DEBUG_SHARED_PTR -DSEASTAR_DEFAULT_ALLOCATOR -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SHUFFLE_TASK_QUEUE -DSEASTAR_SSTRING -DSEASTAR_TESTING_MAIN -DSEASTAR_TYPE_ERASE_MORE -DXXH_PRIVATE_API -DCMAKE_INTDIR=\"Debug\" -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/build/gen -I/home/kefu/dev/scylladb/seastar/include -I/home/kefu/dev/scylladb/build/seastar/gen/include -I/home/kefu/dev/scylladb/build/seastar/gen/src -isystem /home/kefu/dev/scylladb/abseil -isystem /home/kefu/dev/scylladb/build/rust -g -Og -g -gz -std=gnu++23 -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-unused-parameter -ffile-prefix-map=/home/kefu/dev/scylladb/build=. -march=westmere -Xclang -fexperimental-assignment-tracking=disabled -Werror=unused-result -fstack-clash-protection -fsanitize=address -fsanitize=undefined -fno-sanitize=vptr -MD -MT test/boost/CMakeFiles/view_build_test.dir/Debug/view_build_test.cc.o -MF test/boost/CMakeFiles/view_build_test.dir/Debug/view_build_test.cc.o.d -o test/boost/CMakeFiles/view_build_test.dir/Debug/view_build_test.cc.o -c /home/kefu/dev/scylladb/test/boost/view_build_test.cc
/home/kefu/dev/scylladb/test/boost/view_build_test.cc:998:5: error: unknown type name 'simple_schema'
  998 |     simple_schema ss;
      |     ^
```

apparently, `simple_schema`'s declaration is not available in this
translation unit.

in this change

* we include the header where `simple_schema` is defined, so that
  the build passes with clang-20.
* also take this opportunity to reorder the header a little bit,
  so the testing headers are grouped together.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20367
2024-09-01 18:58:23 +03:00
Kefu Chai
753188c33d test: include seastar/testing/random.hh when appropriate
in a recent seastar change (644bb662), we do not include
`seastar/testing/random.hh` in `seastar/testing/test_runner.hh` anymore,
as the latter is not a facade of the former, and neither does it use the
former. as a sequence, some tests which take the advantage of the
included `seastar/testing/random.hh` do not build with the latest
seastar:

```
FAILED: test/lib/CMakeFiles/test-lib.dir/key_utils.cc.o
/usr/bin/clang++ -DBOOST_REGEX_DYN_LINK -DBOOST_REGEX_NO_LIB -DBOOST_UNIT_TEST_FRAMEWORK_DYN_LINK -DBOOST_UNIT_TEST_FRAMEWORK_NO_LIB -DDEVEL -DFMT_SHARED -DSCYLLA_BUILD_MODE=dev -DSCYLLA_ENABLE_ERROR_INJECTION -DSCYLLA_ENABLE_PREEMPTION_SOURCE -DSEASTAR_API_LEVEL=7 -DSEASTAR_ENABLE_ALLOC_FAILURE_INJECTION -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SSTRING -DSEASTAR_TYPE_ERASE_MORE -DXXH_PRIVATE_API -I/__w/scylladb/scylladb -I/__w/scylladb/scylladb/build/gen -I/__w/scylladb/scylladb/seastar/include -I/__w/scylladb/scylladb/build/seastar/gen/include -I/__w/scylladb/scylladb/build/seastar/gen/src -I/__w/scylladb/scylladb/build -isystem /__w/scylladb/scylladb/abseil -isystem /__w/scylladb/scylladb/build/rust -O2 -std=gnu++23 -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-enum-constexpr-conversion -Wno-unused-parameter -ffile-prefix-map=/__w/scylladb/scylladb/build=. -march=westmere -Xclang -fexperimental-assignment-tracking=disabled -Werror=unused-result -fstack-clash-protection -MD -MT test/lib/CMakeFiles/test-lib.dir/key_utils.cc.o -MF test/lib/CMakeFiles/test-lib.dir/key_utils.cc.o.d -o test/lib/CMakeFiles/test-lib.dir/key_utils.cc.o -c /__w/scylladb/scylladb/test/lib/key_utils.cc
In file included from /__w/scylladb/scylladb/test/lib/key_utils.cc:11:
/__w/scylladb/scylladb/test/lib/random_utils.hh:25:30: error: no member named 'local_random_engine' in namespace 'seastar::testing'
   25 |     return seastar::testing::local_random_engine;
      |            ~~~~~~~~~~~~~~~~~~^
1 error generated.
```

in this change, we include `seastar/testing/random.hh` when the random
facility is used, so that they can be compiled with the latest seastar
library.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20368
2024-09-01 18:57:07 +03:00
Kefu Chai
0104c7d371 tools/scylla-nodetool: s/vm.count()/vm.contains()/
under the hood, std::map::count() and std::map::contains() are nearly
identical. both operations search for the given key witin the map.
however, the former finds a equal range with the given
key, and gets the distance between the disntance between the begin
and the end of the range; while the later just searches with the given
key.

since scylla-nodetool is not a performance-critical application, the
minor difference in efficiency between these two operations is unlikely
to have a significant impact on its overall performance.

while std::map::count() is generally suitable for our need, it might be
beneficial to use a more appropriate API.

in this change, we use std::map::contains() in the place of
std::map::count() when checking for the existence of a paramter with
given name.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20350
2024-09-01 18:39:00 +03:00
Avi Kivity
ddf344e4f1 Merge 'compaction: use structured binding and ranges library when appropriate' from Kefu Chai
for better readability

---

it's a cleanup, hence no need to backport.

Closes scylladb/scylladb#20366

* github.com:scylladb/scylladb:
  compaction: use std::views::reverse when appropriate
  compaction: use structured binding when appropriate
2024-09-01 18:35:15 +03:00
Avi Kivity
ea8441dfa3 cql3: add option to not unify bind variables with the same name
Bind variables in CQL have two formats: positional (`?`) where a
variable is referred to by its relative position in the statement,
and named (`:var`), where the user is expected to supply a
name->value mapping.

In 19a6e69001 we identified the case where a named bind variable
appears twice in a query, and collapsed it to a single entry in the
statement metadata. Without this, a driver using the named variable
syntax cannot disambiguate which variable is referred to.

However, it turns out that users can use the positional call form
even with the named variable syntax, by using the positional
API of the driver. To support this use case, we add a configuration
variable to disable the same-variable detection.

Because the detection has to happen when the entire statement is
visible, we have to supply the configuration to the parser. We
call it the `dialect` and pass it from all callers. The alternative
would be to add a pre-prepare call similar to fill_prepare_context that
rewrites all expressions in a statement to deduplicate variables.

A unit test is added.

Fixes #15559
2024-09-01 17:27:48 +03:00
Avi Kivity
60acfd8c08 docs: cql: document ZstdCompressor for CREATE TABLE
Adjust the wording slightly to be less awkward.

Closes scylladb/scylladb#20377
2024-09-01 14:28:09 +03:00
Kefu Chai
e53a9a99cd compaction: use std::views::reverse when appropriate
let's use the standard library when appropriate.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-09-01 08:44:01 +08:00
Kefu Chai
3801c079e2 compaction: use structured binding when appropriate
for better readability

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-09-01 08:34:10 +08:00
Avi Kivity
61e6a77a99 repair: row_level: restore indentation 2024-08-30 23:00:59 +03:00
Avi Kivity
a35942e09a repair: row_level: coroutinize repair_meta::get_full_row_hashes_sink_op()
Extra care is needed for exception handling.
2024-08-30 22:55:16 +03:00
Avi Kivity
8e9ebd82fc repair: row_level: coroutinize repair_meta::get_full_row_hashes_source_op() 2024-08-30 22:55:16 +03:00
Avi Kivity
f7d19e237d repair: row_level: coroutinize repair_get_full_row_hashes_with_rpc_stream_handler()
Both the handle_exception() and finally() blocks need some extra care.
2024-08-30 22:55:16 +03:00
Avi Kivity
bb8751f4b5 repair: row_level: coroutinize repair_put_row_diff_with_rpc_stream_handler()
Both the handle_exception() and finally() blocks need some extra care.
2024-08-30 22:55:16 +03:00
Avi Kivity
7ba0642da2 repair: row_level: coroutinize repair_get_row_diff_with_rpc_stream_handler()
Both the handle_exception() and finally() blocks need some extra care.
2024-08-30 22:55:16 +03:00
Avi Kivity
61bbf452c6 repair: row_level: coroutinize repair_get_full_row_hashes_with_rpc_stream_process() 2024-08-30 22:55:16 +03:00
Avi Kivity
01a578f608 repair: row_level: coroutinize repair_get_row_diff_with_rpc_stream_process_op_slow_path() 2024-08-30 22:55:16 +03:00
Avi Kivity
3733105f78 repair: row_level: split repair_get_row_diff_with_rpc_stream_process_op() into fast and slow paths
This allows coroutinization of the slow path without affecting the fast path.
2024-08-30 22:55:16 +03:00
Avi Kivity
e17c3b71a8 repair: row_level: coroutinize repair_meta::put_row_diff_handler() 2024-08-30 22:55:16 +03:00
Avi Kivity
74ea2b9663 repair: row_level: coroutinize repair_meta::put_row_diff_sink_op()
Exception handling is a bit awkward since can't co_await in a catch block.
2024-08-30 22:55:16 +03:00
Avi Kivity
e4362a5b7b repair: row_level: coroutinize repair_meta::put_row_diff_source_op() 2024-08-30 22:55:16 +03:00
Avi Kivity
b998d69f09 repair: row_level: coroutinize repair_meta::put_row_diff() 2024-08-30 22:55:16 +03:00
Avi Kivity
3f2b5fe5dc repair: row_level: coroutinize repair_meta::get_row_diff_handler() 2024-08-30 22:55:16 +03:00
Avi Kivity
cd63971501 repair: row_level: coroutinize repair_meta::get_row_diff_sink_op()
Since sink.close() is called from an exception handler, some code
movement is needed so it isn't co_awaited from a catch block.
2024-08-30 22:55:16 +03:00
Avi Kivity
3f28dec88c repair: row_level: coroutinize repair_meta::to_repair_rows_on_wire()
coroutine::maybe_yield() introduced to compensate for loss of
stall-protected do_for_each()
2024-08-30 22:55:16 +03:00
Avi Kivity
1a84f1a73d repair: row_level: coroutinize repair_meta::do_apply_rows()
coroutine::maybe_yield() introduced to compensate for loss of
stall-protected do_for_each()
2024-08-30 22:55:16 +03:00
Avi Kivity
7f15cc446f repair: row_level: coroutinize repair_meta::copy_rows_from_working_row_buf_within_set_diff()
coroutine::maybe_yield() introduced to compensate for loss of
stall-protected do_for_each()
2024-08-30 22:55:16 +03:00
Avi Kivity
93ca202bd3 repair: row_level: coroutinize repair_meta::copy_rows_from_working_row_buf()
coroutine::maybe_yield() introduced to compensate for loss of
stall-protected do_for_each()
2024-08-30 22:55:15 +03:00
Avi Kivity
5f8895d908 repair: row_level: coroutinize repair_meta::row_buf_csum()
coroutine::maybe_yield() introduced to compensate for loss of
stall-protected do_for_each()
2024-08-30 22:55:15 +03:00
Avi Kivity
d1e45f2982 repair: row_level: coroutinize repair_meta::get_repairs_row_size()
coroutine::maybe_yield() introduced to compensate for loss of
stall-protected do_for_each()
2024-08-30 22:55:15 +03:00
Avi Kivity
0b1bf57d19 repair: row_level: coroutinize repair_meta::set_estimated_partitions() 2024-08-30 22:55:15 +03:00
Avi Kivity
aee078d8e5 repair: row_level: coroutinize repair_meta::get_estimated_partitions() 2024-08-30 22:55:15 +03:00
Avi Kivity
51534f60eb repair: row_level: coroutinize repair_meta::do_estimate_partitions_on_local_shard() 2024-08-30 22:55:12 +03:00
Kamil Braun
e01cef01a6 Merge 'Ignore seed name resolution errors during the restart of a cluster member node.' from Sergey Zolotukhin
All seeds hostname resolution errors will be ignored during a node
restart in case the node had already joined a cluster.  This will
prevent restart errors if some seed names are not resolvable.

Fixes scylladb/scylladb#14945

Closes scylladb/scylladb#20292

* github.com:scylladb/scylladb:
  Ignore seed name resolution errors on restart.
  Add a test for starting with a wrong seed.
2024-08-30 11:33:44 +02:00
Kamil Braun
292ef0d1f9 Merge 'Fix node replace with inter-dc encryption enabled.' from Gleb Natapov
Currently if a coordinator and a node being replaced are in the same DC
while inter-dc encryption is enabled (connections between nodes in the
same DC should not be encrypted) the replace operation will fail. It
fails because a coordinator uses non encrypted connection to push raft
data to the new node, but the new node will not accept such connection
until it knows which DC the coordinator belongs to and for that the raft
data needs to be transferred.

The series adds the test for this scenario and the fix for the
chicken&egg problem above.

The series (or at least the fix itself) needs to be backported because
this is a serious regression.

Fixes: scylladb/scylladb#19025

Closes scylladb/scylladb#20290

* github.com:scylladb/scylladb:
  topology coordinator: fix indentation after the last patch
  topology coordinator: do not add replacing node without a ring to topology
  test: add test for replace in clusters with encryption enabled
  test.py: add server encryption support to cluster manager
  .gitignore: fix pattern for resources to match only one specific directory
2024-08-30 11:29:05 +02:00
Kefu Chai
82fbe317ec test/scylla_gdb: test the .gdb init use case
before this change, we run all the tests in a single pytest session,
with scylladb debug symbols loaded. but we want to test another use
case, where the scylladb debug symbols are missing.

in this change,

* we do not check for the existence of debug symbols until necessary
* add a mark named "without_scylla"
* run the tests in two pytest sessions
  - one with "without_scylla" mark
  - one with "not without_scylla" mark
* add a test which is marked with the "without_scylla" mark. the test
  verify that the scylla-gdb.py script can be loaded even without
  scylladb debug symbols.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-08-30 17:05:29 +08:00
Kefu Chai
7dd63c891f scylla-gdb.py: lazy-evaluate the constants
instead of evaluating the constants in-class, accessing them via
a cached class property.

it would be handy if we could source `scylla-gdb.py` in `.gdbinit`,
but this script accesses some symbols which are not available with
a file being debugged. so when gdb fails to load init script:

```
Traceback (most recent call last):
  File "/home/kefu/dev/scylladb/scylla-gdb.py", line 167, in <module>
    class intrusive_slist:
  File "/home/kefu/dev/scylladb/scylla-gdb.py", line 168, in intrusive_slist
    size_t = gdb.lookup_type('size_t')
             ^^^^^^^^^^^^^^^^^^^^^^^^^
gdb.error: No type named size_t.
```

so we have to `file path/to/scylla` and *then*
`source scylla-gdb.py` every time when we debug scylla or a seastar
application, instead of loading `scylla-gdb.py` in `.gdbinit`.

the reason is that the script access the debug symbols like
`gdb.lookup_type('size_t')` in-class. so when the python interpreter
reads the script, it evaluates this statement, but at that moment,
the debug symbols are not loaded, so `source scylla-gdb.py` fails
in `.gdbinit`.

in this change, we transform all these class variables to cached
property, so that they

* are evaluated on-demand
* are evaluated only once at most

this addresses the pain at the expense of verbosity.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-08-30 17:05:29 +08:00
Pavel Emelyanov
cec4d207f6 Merge 'repair: throw if batchlog manager isn't initialized' from Aleksandra Martyniuk
repair_service::repair_flush_hints_batchlog_handler may access batchlog
manager while it is uninitialized.

Throw if batchlog manager isn't initialized.

Fixes:  #20236.

Needs backport to 6.0 and 6.1 as they suffer from the uninitialized bm access.

Closes scylladb/scylladb#20251

* github.com:scylladb/scylladb:
  test: add test to ensure repair won't fail with uninitialized bm
  repair: throw if batchlog manager isn't initialized
2024-08-30 11:37:24 +03:00
Anna Stuchlik
4471c80bdc doc: add the 6.1-to-6.2 upgrade guide
This commit replaces the 6.0-to-6.1 upgrade guide with the 6.1-to-6.2 upgrade guide.

The new guide is a template that covers the basic procedure.
If any 6.2-specific updates are required, they will have to be added along with development.

Closes scylladb/scylladb#20178
2024-08-30 10:10:45 +03:00
Piotr Dulikowski
c05be27e4a Merge 'db/hints: Move the code for writing hints to a separate function' from Dawid Mędrek
In scylladb/scylladb@7301a96, in the function `hint_endpoint_manager::store_hint()`,
we transformed the lambda passed to `seastar::with_gate()` to a coroutine lambda
to improve the readability. However, there was a subtle problem related to
lifetimes of the captures that needed to be addressed:

* Since we started `co_await`ing in the lambda, the captures were at risk of
  being destructed too soon. The usual solution is to wrap a coroutine lambda
  within a `seastar::coroutine::lambda` object and rely on the extended lifetime
  enforced by the semantics of the language.
  See `docs/dev/lambda-coroutine-fiasco.md` for more context.
* However, since we don't immediately `co_await` the future returned by
  `with_gate()`, we cannot rely on the extended lifetime provided by the wrapper.
  The document linked in the previous bullet point suggests keeping the passed
  coroutine lambda as a variable and pass it as a reference to `with_gate()`.
  However, that's not feasible either because we discard the returned future and
  the function returns almost instantly -- destructing every local object, which
  would encompass the lambda too.

The solution used in the commit was to move captures of the lambda into
the lambda's body. That helped because Seastar's backend is responsible for
keeping all of the local variables alive until the lambda finishes its execution.
However, we didn't move all of the captures into the lambda -- the missing one
was the `this` pointer that was implicitly used in the lambda.

Address sanitiser hasn't reported any bugs related to the pointer yet, but
the bug is most likely there.

In this commit, we transform the lambda's body into a new member function
and only call it from the lambda. This way, we don't need to care about
the lifetimes of the captures because Seastar ensures that the function's
arguments stay alive until the coroutine finishes.

Choosing this solution instead of assigning `this` to a pointer variable
inside the lambda's body and using it to refer to the object's members
has actual benefit: it's not possible to accidentally forget to refer
to a member of the object via the pointer; it also makes the code less
awkward.

Fixes scylladb/scylladb#20306

Closes scylladb/scylladb#20258

* github.com:scylladb/scylladb:
  db/hints: Fix indentation in `do_store_hint()`
  db/hints: Move code for writing hints to separate function
2024-08-30 09:09:02 +02:00
Avi Kivity
bbcfd47bf5 doc: nodetool: toppartitions: document --samplers and --capacity
In particular --capacity is critical for obtaining accurate measurements.

Closes scylladb/scylladb#20192
2024-08-30 10:07:54 +03:00
Botond Dénes
9f9346fc59 Merge 'nodetool: tasks: add nodetool commands to track task manager tasks' from Aleksandra Martyniuk
Add nodetool commands to manage task manager tasks:
- tasks abort - aborts the task
- tasks list - lists all tasks in the module
- tasks modules - lists all modules
- tasks set-ttl - sets task ttl
- tasks status - gets status of the task
- tasks tree - gets statuses of the task and all its desendent's
- tasks ttl - gets task ttl
- tasks wait - waits for the task and gets its status

Fixes: https://github.com/scylladb/scylladb/issues/19201.

Closes scylladb/scylladb#19614

* github.com:scylladb/scylladb:
  test: nodetool: add tests for tasks commands
  nodetool: tasks: add nodetool commands to track task manager tasks
  api: task_manager: return status 403 if a task is not abortable
  api: task_manager: return none instead of empty task id
  api: task_manager: add timeout to wait_task
  api: task_manager: add operation to get ttl
  nodetool: add suboperations support
  nodetool: change operations_with_func type
  nodetool: prepare operation related classes for suboperations
2024-08-30 07:37:37 +03:00
Avi Kivity
d69bf4f010 cql3: introduce dialect infrastructure
A dialect is a different way to interpret the same CQL statement.

Examples:
 - how duplicate bind variable names are handled (later in this series)
 - whether `column = NULL` in LWT can return true (as is now) or
   whether it always returns NULL (as in SQL)

Currently, dialect is an empty structure and will be filled in later.
It is passed to query_processor methods that also accept a CQL string,
and from there to the parser. It is part of the prepared statement cache
key, so that if the dialect is changed online, previous parses of the
statement are ignored and the statement is prepared again.

The patch is careful to pick up the dialect at the entry point (e.g.
CQL protocol server) so that the dialect doesn't change while a statement
is parsed, prepared, and cached.
2024-08-29 21:19:23 +03:00
Avi Kivity
f9322799af cql3: prepared_statement_cache: drop cache key default constructor
It's unnecessary, and interferes with the following patch where
we change the cache key type.
2024-08-29 21:07:00 +03:00
Avi Kivity
67b24859bc Merge 'generic_server: convert connection tracking to seastar::gate' from Laszlo Ersek
~~~
generic_server: convert connection tracking to seastar::gate

If we call server::stop() right after "server" construction, it hangs:

With the server never listening (never accepting connections and never
serving connections), nothing ever calls server::maybe_stop().
Consequently,

    co_await _all_connections_stopped.get_future();

at the end of server::stop() deadlocks.

Such a server::stop() call does occur in controller::do_start_server()
[transport/controller.cc], when

- cserver->start() (sharded<cql_server>::start()) constructs a
  "server"-derived object,

- start_listening_on_tcp_sockets() throws an exception before reaching
  listen_on_all_shards() (for example because it fails to set up client
  encryption -- certificate file is inaccessible etc.),

- the "deferred_action"

      cserver->stop().get();

  is invoked during cleanup.

(The cserver->stop() call exposing the connection tracking problem dates
back to commit ae4d5a60ca ("transport::controller: Shut down distributed
object on startup exception", 2020-11-25), and it's been triggerable
through the above code path since commit 6b178f9a4a
("transport/controller: split configuring sockets into separate
functions", 2024-02-05).)

Tracking live connections and connection acceptances seems like a good fit
for "seastar::gate", so rewrite the tracking with that. "seastar::gate"
can be closed (and the returned future can be waited for) without anyone
ever having entered the gate.

NOTE: this change makes it quite clear that neither server::stop() nor
server::shutdown() must be called multiple times. The permitted sequences
are:

- server::shutdown() + server::stop()

- or just server::stop().

Fixes #10305

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
~~~

Fixes #10305.

I think we might want to backport this -- it fixes a hang-on-misconfiguration which affects `scylla-6.1.0-0.20240804.abbf0b24a60c.x86_64` minimally. Basically every release that contains commit ae4d5a60ca has a theoretical chance for the hang, and every release that contains commit 6b178f9a4a has a practical chance for the hang.

Focusing on the more practical symptom (i.e., releases containing commit 6b178f9a4a), `git tag --contains 6b178f9a4a90` gives us (ignoring candidates and release candidates):
- scylla-6.0.0
- scylla-6.0.1
- scylla-6.0.2
- scylla-6.1.0

Closes scylladb/scylladb#20212

* github.com:scylladb/scylladb:
  generic_server: make server::stop() idempotent
  generic_server: coroutinize server::shutdown()
  generic_server: make server::shutdown() idempotent
  test/generic_server: add test case
  configure, cmake: sort the lists of boost unit tests
  generic_server: convert connection tracking to seastar::gate
2024-08-29 19:45:48 +03:00
Laszlo Ersek
db44000f8d Update seastar submodule
* seastar 83e6cdfd...ec5da7a6 (1):
  > reactor, linux-aio: advise users in more detail on setting aio-max-nr

Fixes #5981

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>

Closes scylladb/scylladb#20307
2024-08-29 19:42:02 +03:00
Raphael S. Carvalho
26facd807e storage_service: avoid processing same table unnecessarily in split monitor
If there's a token metadata for a given table, and it is in split mode,
it will be registered such that split monitor can look at it, for
example, to start split work, or do nothing if table completed it.

during topology change, e.g. drain, split is stalled since it cannot
take over the state machine.
It was noticed that the log is being spammed with a message saying the
table completed split work, since every tablet metadata update, means
waking up the monitor on behalf of a table. So it makes sense to
demote the logging level to debug. That persists until drain completes
and split can finally complete.

Another thing that was noticed is that during drain, a table can be
submitted for processing faster than the monitor can handle, so the
candidate queue may end up with multiple duplicated entries for same
table, which means unnecessary work. That is fixed by using a
sequenced set, which keeps the current FIFO behavior.

Fixes #20339.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#20029
2024-08-29 19:38:43 +03:00
Aleksandra Martyniuk
1f46cad5de test: nodetool: add tests for tasks commands 2024-08-29 17:37:13 +02:00
Aleksandra Martyniuk
20fffcdcf5 nodetool: tasks: add nodetool commands to track task manager tasks 2024-08-29 17:37:12 +02:00
Avi Kivity
7da3314deb Merge 'Integrated restore' from Ernest Zaslavsky
Handed over from https://github.com/scylladb/scylladb/pull/20149

This adds minimal implementation of the start-restore API call.

The method starts a task that runs load-and-stream functionality against sstables from S3 bucket. Arguments are:

```
endpoint -- the ID in object_store.yaml config file
bucket -- the target bucket to get objects from
keyspace -- the keyspace to work on
table -- the table to work on
snapshot -- the name of the snapshot from which the backup was taken
```
The task runs in the background, its task_id is returned from the method once it's spawned and it should be used via /task_manager API to track the task execution and completion.

Remote sstables components are scanned as if they were placed in local upload/ directory. Then colelcted sstables are fed into load-and-stream.

This branch has https://github.com/scylladb/scylladb/pull/19890 (Integrated backup), https://github.com/scylladb/scylladb/pull/20120 (S3 lister) and few more minor PRs merged in. The restore branch itself starts with [utils: Introduce abstract (directory) lister](29c867b54d) commit.

refs: https://github.com/scylladb/scylladb/issues/18392

Closes scylladb/scylladb#20305

* github.com:scylladb/scylladb:
  tools/scylla-nodetool: add restore integration
  test/object_store: Add simple restore test
  test/object_store: Generalize prepare_snapshot_for_backup()
  code: Introduce restore API method
  sstable_loader: Add sstables::storage_manager dependency
  sstable_loader: Maintain task manager module
  sstable_loader: Out-line constructor
  distributed_loader: Split get_sstables_from_upload_dir()
  sstables/storage: Compose uploaded sstable path simpler
  sstable_directory: Prepare FS lister to scan files on S3
  sstable_directory: Parse sstable component without full path
  s3-client: Add support for lister::filter
  utils: Introduce abstract (directory) lister
2024-08-29 18:25:30 +03:00
Kamil Braun
9574c399ce Merge 'add support for zero-token nodes' from Patryk Jędrzejczak
We revive the `join_ring` option. We support it only in the
Raft-based topology, as we plan to remove the gossip-based topology
when we fix the last blocker - the implementation of the manual
recovery tool. In the Raft-based topology, a node can be assigned
tokens only once when it joins the cluster. Hence, we disallow
joining the ring later, which is possible in Cassandra.

The main idea behind the solution is simple. We make the unsupported
special case of zero tokens a supported normal case. Nodes with zero
tokens assigned are called "zero-token nodes" from now on.

From the topology point of view, zero-token nodes are the same as
token-owning nodes. They can be in the same states, etc. From the
data point of view, they are different. They are not members of
the token ring, so they are not present in
`token_metadata::_normal_token_owners`. Hence, they are ignored in
all non-local replication strategies. The tablet load balancer also
ignores them.

Zero-token nodes can be used as coordinator-only nodes, just like in
Cassandra. They can handle requests just like token-owning nodes.

The main motivation behind zero-token nodes is that they can prevent
the Raft majority loss efficiently. Zero-token nodes are group 0
voters, but they can run on much weaker and cheaper machines because
they do not replicate data and handle client requests by default
(drivers ignore them). For example, if there are two DCs, one with 4
nodes and one with 5 nodes, if we add a DC with 2 zero-token nodes,
every DC will contain less than half of the nodes, so we won't lose
the majority when any DC dies.

Another way of preventing the Raft majority loss is changing the
voter set, which is tracked by scylladb/scylladb#18793. That approach
can be used together with zero-token nodes. In the example above, if
we choose equal numbers of voters in both DCs, then a DC with one
zero-token node will be sufficient. However, in the typical setup of
2 DCs with the same number of nodes it is enough to add a DC with
only one zero-token node without changing the voter set.

Zero-token nodes could also be used as load balancers in the
Alternator.

Additionally, this PR fixes scylladb/scylladb#11087, which turned out to
be a blocker.

This PR introduced a new feature. There is no need to backport it.

Fixes scylladb/scylladb#6527
Fixes scylladb/scylladb#11087
Fixes scylladb/scylladb#15360

Closes scylladb/scylladb#19684

* github.com:scylladb/scylladb:
  docs: raft: document using zero-token nodes to prevent majority loss
  test: test recovery mode in the presence of zero-token nodes
  test: topology: util.py: add cqls parameter to check_system_topology_and_cdc_generations_v3_consistency
  test: topology: util.py: accept zero tokens in check_system_topology_and_cdc_generations_v3_consistency
  treewide: support zero-token nodes in the recovery mode
  storage_proxy: make TRUNCATE work locally for local tables
  test: topology: util.py: document that check_token_ring_and_group0_consistency fails with zero-token nodes
  test: test zero-token nodes
  test: test_topology_ops: move helpers to topology/util.py
  feature_service: introduce the ZERO_TOKEN_NODES feature
  storage_service: rename join_token_ring to join_topology
  storage_service: raft_topology_cmd_handler: improve warnings
  topology_coordinator: fix indentation after the previous patch
  treewide: introduce support for zero-token nodes in Raft topology
  system_keyspace: load_topology_state: remove assertion impossible to hit
  treewide: distinguish all nodes from all token owners
  gossip topology: make a replacing node remove the replaced node from topology
  locator: topology: add_or_update_endpoint: use none as the default node state
  test: boost: tablets tests: ensure all nodes are normal token owners
  token_metadata: rename get_all_endpoints and get_all_ips
  network_topology_strategy: reallocate_tablets: remove unused dc_rack_nodes
  virtual_tables: cluster_status_table: execute: set dc regardless of the token ownership
2024-08-29 16:26:21 +02:00
Gleb Natapov
32a59ba98f topology coordinator: fix indentation after the last patch 2024-08-29 17:14:09 +03:00
Gleb Natapov
17f4a151ce topology coordinator: do not add replacing node without a ring to topology
When only inter dc encryption is enabled a non encrypted connection
between two nodes is allowed only if both nodes are in the same dc.
If a nodes that initiates the connection knows that dst is in the same
dc and hence use non encrypted connection, but the dst not yet knows the
topology of the src such connection will not be allowed since dst cannot
guaranty that dst is in the same dc.

Currently, when topology coordinator is used, a replacing node will
appear in the coordinator's topology immediately after it is added to the
group0. The coordinator will try to send raft message to the new node
and (assuming only inter dc encryption is enabled and replacing node and
the coordinator are in the same dc) it will try to open regular, non encrypted,
connection to it. But the replacing node will not have the coordinator
in it's topology yet (it needs to sync the raft state for that). so it
will reject such connection.

To solve the problem the patch does not add a replacing node that was
just added to group0 to the topology. It will be added later, when
tokens will be assigned to it. At this point a replacing node will
already make sure that its topology state is up-to-date (since it will
execute a raft barrier in join_node_response_params handler) and it knows
coordinator's topology. This aligns replace behaviour with bootstrap
since bootstrap also does not add a node without a ring to the topology.

The patch effectively reverts b8ee8911ca

Fixes: scylladb/scylladb#19025
2024-08-29 17:14:09 +03:00
Gleb Natapov
2f1b1fd45e test: add test for replace in clusters with encryption enabled 2024-08-29 17:14:09 +03:00
Gleb Natapov
b98282a976 test.py: add server encryption support to cluster manager 2024-08-29 17:14:09 +03:00
Gleb Natapov
84757a4ed3 .gitignore: fix pattern for resources to match only one specific directory 2024-08-29 17:13:58 +03:00
Dawid Medrek
d459cf91eb db/hints: Fix indentation in do_store_hint() 2024-08-29 14:47:08 +02:00
Dawid Medrek
75ce6943d0 db/hints: Move code for writing hints to separate function
In scylladb/scylladb@7301a96, in the function `hint_endpoint_manager::store_hint()`,
we transformed the lambda passed to `seastar::with_gate()` to a coroutine lambda
to improve the readability. However, there was a subtle problem related to
lifetimes of the captures that needed to be addressed:

* Since we started `co_await`ing in the lambda, the captures were at risk of
  being destructed too soon. The usual solution is to wrap a coroutine lambda
  within a `seastar::coroutine::lambda` object and rely on the extended lifetime
  enforced by the semantics of the language.
  See `docs/dev/lambda-coroutine-fiasco.md` for more context.

* However, since we don't immediately `co_await` the future returned by
  `with_gate()`, we cannot rely on the extended lifetime provided by the wrapper.
  The document linked in the previous bullet point suggests keeping the passed
  coroutine lambda as a variable and pass it as a reference to `with_gate()`.
  However, that's not feasible either because we discard the returned future and
  the function returns almost instantly -- destructing every local object, which
  would encompass the lambda too.

The solution used in the commit was to move captures of the lambda into
the lambda's body. That helped because Seastar's backend is responsible for
keeping all of the local variables alive until the lambda finishes its execution.
However, we didn't move all of the captures into the lambda -- the missing one
was the `this` pointer that was implicitly used in the lambda.

Address sanitiser hasn't reported any bugs related to the pointer yet, but
the bug is most likely there.

In this commit, we transform the lambda's body into a new member function
and only call it from the lambda. This way, we don't need to care about
the lifetimes of the captures because Seastar ensures that the function's
arguments stay alive until the coroutine finishes.

Choosing this solution instead of assigning `this` to a pointer variable
inside the lambda's body and using it to refer to the object's members
has actual benefit: it's not possible to accidentally forget to refer
to a member of the object via the pointer; it also makes the code less
awkward.
2024-08-29 14:47:02 +02:00
Aleksandra Martyniuk
627fc46ca7 api: task_manager: return status 403 if a task is not abortable 2024-08-29 13:53:40 +02:00
Aleksandra Martyniuk
10ab60f32b api: task_manager: return none instead of empty task id
If a user requests a status of a task that does not have a parent,
show "none" instead of an empty parent_id.
2024-08-29 13:53:40 +02:00
Aleksandra Martyniuk
5bcff4d544 api: task_manager: add timeout to wait_task 2024-08-29 13:53:40 +02:00
Aleksandra Martyniuk
3d78172328 api: task_manager: add operation to get ttl 2024-08-29 13:53:39 +02:00
Aleksandra Martyniuk
fb160afaf6 nodetool: add suboperations support
Modify nodetool methods so that it support suboperations.
2024-08-29 13:53:39 +02:00
Aleksandra Martyniuk
4b96f9abb9 nodetool: change operations_with_func type
Change the type of operations_with_func so that they can contain
suboperations.
2024-08-29 13:53:39 +02:00
Aleksandra Martyniuk
c6f8a0116a nodetool: prepare operation related classes for suboperations
Modify operation and add operation_action class so that information
about suboperations is stored. It's a preparation for adding
suboperations support to nodetool.
2024-08-29 13:53:39 +02:00
Kefu Chai
dbb056f4f7 build: cmake: point -ffile-prefix-map to build directory
before this change, we included `-ffile-prefix-map=${CMAKE_SOURCE_DIR}=.`
in cflags when building the tree with CMake, but this was wrong.
as the "." directory is the build directory used by CMake. and this
directory is specified by the `-B` option when generating the building
system. if `configure.py --use-cmake` is used to build the tree,
the build directory would be "build". so this option instructs the compiler
to replace the directory of source file in the debug symbols and in
`__FILE__` at compile time.

but, in a typical workspace, for instance, `build/main.cc` does not exist.
the reason why this does not apply to CMake but applies to the rules
generated by `configure.py` is that, `configure.py` puts the generated
`build.ninja` right under the top source directory, so `.` is correct and
it helps to create reproducible builds. because this practically erases
the path prefixes in the build output. while CMake puts it under the
specified build directory, replacing the source directory with the build
directory with the file prefix map is just wrong.

there are two options to address this problem:

* stop passing this option. but this would lead to non-reproducible
  builds. as we would encode the build directory in the "scylla"
  executable. if a developer needs to rebuild an executable for debugging
  a coredump generated in production, he/she would have to either
  build the tree in the same directory as our CI does. or, he/she
  has to pass `-ffile-prefix-map=...` to map the local build directory
  to the one used by CI. this is not convenient.
* instead of using `${CMAKE_SOURCE_DIR}=.`, add `${CMAKE_BINARY_DIR}=.`.
  this erases the build directory in the outputs, but preserves the
  debuggability.

so we pick the second solution.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20329
2024-08-29 12:28:11 +03:00
Patryk Jędrzejczak
c192a9ee3b docs: raft: document using zero-token nodes to prevent majority loss 2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak
e027ffdffc test: test recovery mode in the presence of zero-token nodes
We modify existing tests to verify that the recovery mode works
correctly in the presence of zero-token nodes.

In `test_topology_recovery_basic`, we test the case when a
zero-token node is live. In particular, we test that the
gossip-based restart of such a node works.

In `test_topology_recovery_after_majority_loss`, we test the case
when zero-token nodes are unrecoverable. In particular, we test
that the gossip-based removenode of such nodes works.

Since zero-token nodes are ignored by the Python driver if it also
connects to other nodes, we use different CQL sessions for a
zero-token node in `test_topology_recovery_basic`.
2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak
fb1e060c4c test: topology: util.py: add cqls parameter to check_system_topology_and_cdc_generations_v3_consistency
In the following commit, we modify `test_topology_recovery_basic`
to test the recovery mode in the presence of live zero-token nodes.
Unfortunately, it requires a bit ugly workaround. Zero-token nodes
are ignored by the Python driver if it also connects to other
nodes because of empty tokens in the `system.peers` table. In that
test, we must connect to a zero-token node to enter the recovery
mode and purge the Raft data. Hence, we use different CQL sessions
for different nodes.

In the future, we may change the Python driver behavior and revert
this workaround. Moreover, the recovery tests will be removed or
significantly changed when we implement the manual recovery tool.
Therefore, we shouldn't worry about this workaround too much.
2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak
54905fc179 test: topology: util.py: accept zero tokens in check_system_topology_and_cdc_generations_v3_consistency
Before we use `check_system_topology_and_cdc_generations_v3_consistency`
in a test with a zero-token node, we must ensure it doesn't fail
because of zero tokens in a row of the `system.topology` table.
2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak
02bb70da19 treewide: support zero-token nodes in the recovery mode
Before we implement the manual recovery tool, we must support
zero-token nodes in the recovery mode. This means that two topology
operations involving zero-token nodes must work in the gossip-based
topology:
- removing a dead zero-token node,
- restarting a live zero-token node.
We make changes necessary to make them work in this patch.
2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak
87b415efdc storage_proxy: make TRUNCATE work locally for local tables
In on of the following patches, we implement support for zero-token
nodes in the recovery mode. To achieve this, we need to be able to
purge all Raft data on live zero-token nodes by using TRUNCATE.
Currently, TRUNCATE works the same for all replication strategies - it
is performed on all token owners. However, zero-token nodes are not
token owners, so TRUNCATE would ignore them. Since zero-token nodes
store only local tables, fixing scylladb/scylladb#11087 is the perfect
solution for the issue with zero-token nodes. We do it in this patch.

Fixes scylladb/scylladb#11087
2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak
21c8409fa4 test: topology: util.py: document that check_token_ring_and_group0_consistency fails with zero-token nodes 2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak
95e14ae44b test: test zero-token nodes
We add tests to verify the basic properties of zero-token nodes.

`test_zero_token_nodes_no_replication` and
`test_not_enough_token_owners` are more or less deterministic tests.
Running them only in the dev mode is sufficient.

`test_zero_token_nodes_topology_ops` is quite slow, as expected,
considering parameterization and the number of topology operations.
In the future we can think of making it faster or skipping in the
debug mode. For now, our priority is to test zero-token nodes
thoroughly.
2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak
d43d67c525 test: test_topology_ops: move helpers to topology/util.py
In one of the following patches, we reuse the helper functions from
`test_topology_ops` in a new test, so we move them to `util.py`.

Also, we add the `cl` parameter to `start_writes`, as the new test
will use `cl=2`.
2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak
574c252391 feature_service: introduce the ZERO_TOKEN_NODES feature
Zero-token nodes must be supported by all nodes in the cluster.
Otherwise, the non-supporting nodes would crash on some assertion
that assumes only token-owing normal nodes make sense.

Hence, we introduce the ZERO_TOKEN_NODES cluster feature. Zero-token
nodes refuse to boot if it is not supported.

I tested this patch manually. First, I booted a node built in the
previous patch. Then, I tried to add a zero-token node built in this
patch. It refused to boot as expected.
2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak
c25eefe217 storage_service: rename join_token_ring to join_topology
After introducing zero-token nodes that call join_token_ring but do
not join the ring, the join_token_ring name does not make much sense.
2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak
9937cf3a24 storage_service: raft_topology_cmd_handler: improve warnings 2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak
3ce936da7b topology_coordinator: fix indentation after the previous patch 2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak
22d907e721 treewide: introduce support for zero-token nodes in Raft topology
We revive the `join_ring` option. We support it only in the
Raft-based topology, as we plan to remove the gossip-based topology
when we fix the last blocker - the implementation of the manual
recovery tool. In the Raft-based topology, a node can be assigned
tokens only once when it joins the cluster. Hence, we disallow
joining the ring later, which is possible in Cassandra.

The main idea behind the solution is simple. We make the unsupported
special case of zero tokens a supported normal case. Nodes with zero
tokens assigned are called "zero-token nodes" from now on.

From the topology point of view, zero-token nodes are the same as
token-owning nodes. They can be in the same states, etc. From the
data point of view, they are different. They are not members of
the token ring, so they are not present in
`token_metadata::_normal_token_owners`. Hence, they are ignored in
all non-local replication strategies. The tablet load balancer also
ignores them.

Topology operations involving zero-token nodes are simplified:
- `add` and `replace` finish in the `join_group0` state, so creating
a new CDC generation and streaming are skipped,
- `removenode` and `decommission` skip streaming,
- `rebuild` does not even contact the topology coordinator as there
is nothing to rebuild,

Also, if the topology operation involves a token-owning node,
zero-token nodes are ignored in streaming.

Zero-token nodes can be used as coordinator-only nodes, just like in
Cassandra. They can handle requests just like token-owning nodes.

The main motivation behind zero-token nodes is that they can prevent
the Raft majority loss efficiently. Zero-token nodes are group 0
voters, but they can run on much weaker and cheaper machines because
they do not replicate data and handle client requests by default
(drivers ignore them). For example, if there are two DCs, one with 4
nodes and one with 5 nodes, if we add a DC with 2 zero-token nodes,
every DC will contain less than half of the nodes, so we won't lose
the majority when any DC dies.

Another way of preventing the Raft majority loss is changing the
voter set, which is tracked by scylladb/scylladb#18793. That approach
can be used together with zero-token nodes. In the example above, if
we choose equal numbers of voters in both DCs, then a DC with one
zero-token node will be sufficient. However, in the typical setup of
2 DCs with the same number of nodes it is enough to add a DC with
only one zero-token node without changing the voter set.

Zero-token nodes could also be used as load balancers in the
Alternator.
2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak
ba016c9af7 system_keyspace: load_topology_state: remove assertion impossible to hit
We store tokens in a non-frozen set, which doesn't distinguish an
empty set from no value. Hence, hitting this assertion is impossible.
2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak
ed55261650 treewide: distinguish all nodes from all token owners
In one of the following patches, we introduce support for zero-token
nodes. From that point, getting all nodes and getting all token
owners isn't equivalent. In this patch, we ensure that we consider
only token owners when we want to consider only token owners (for
example, in the replication logic), and we consider all nodes when
we want to consider all nodes (for example, in the topology logic).

The main purpose of this patch is to make the PR introducing
zero-token nodes easier to review. The patch that introduces
zero-token nodes is already complicated. We don't want trivial
changes from this patch to make noise there.

This patch introduces changes needed for zero-token nodes only in the
Raft-based topology and in the recovery mode. Zero-token nodes are
unsupported in the gossip-based topology outside recovery.

Some functions added to `token_metadata` and `topology` are
inefficient because they compute a new data structure in every call.
They are never called in the hot path, so it's not a serious problem.
Nevertheless, we should improve it somehow. Note that it's not
obvious how to do it because we don't want to make `token_metadata`
store topology-related data. Similarly, we don't want to make
`topology` store token-related data. We can think of an improvement
in a follow-up.

We don't remove unused `topology::get_datacenter_rack_nodes` and
`topology::get_datacenter_nodes`. These function can be useful in the
future. Also, `topology::_dc_nodes` is used internally in `topology`.
2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak
2d9575d6a9 gossip topology: make a replacing node remove the replaced node from topology
In the following patch, we change the gossiper to work the same for
zero-token nodes and token-owning nodes. We replace occurrences of
`is_normal_token_owner` with topology-based conditions. We want to
rely on the invariant that token-owning nodes own tokens if and only
if they are in the normal or leaving state. However, this invariant
is broken by a replacing node because it does not remove the
replaced node from topology. Hence, after joining, the replacing node
has topology with a node that is not a token owner anymore but is
in a leaving state (`being_replaced`). We fix it to prevent the
following patch from introducing a regression.
2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak
c7016dedb3 locator: topology: add_or_update_endpoint: use none as the default node state
In one of the following patches, we change the gossiper to work the
same for zero-token nodes and token-owning nodes. We replace
occurrences of `is_normal_token_owner` with topology-based
conditions. We want to rely on the invariant that token-owning nodes
own tokens if and only if they are in the normal or leaving state.
However, this invariant can be broken in the gossip-based topology
when a new node joins the cluster. When a boostrapping node starts
gossiping, other nodes add it to their topology in
`storage_service::on_alive`. Surprisingly, the state of the new node
is set to `normal`, as it's the default value used by
`add_or_update_endpoint`. Later, the state will be set to
`bootstrapping` or `replacing`, and finally it will be set again to
`normal` when the join operation finishes. We fix this strange
behavior by setting the node state to `none` in
`storage_service::on_alive` for nodes not present in the topology.
Note that we must add such nodes to the topology. Other code needs
their Host ID, IP, and location.

We change the default node state from `normal` to `none` in
`add_or_update_endpoint` to prevent bugs like the one in
`storage_service::on_alive`. Also, we ensure that nodes in the `none`
state are ignored in the getters of `locator::topology`.
2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak
6adaf85634 test: boost: tablets tests: ensure all nodes are normal token owners
In one of the following patches, we make NetworkTopologyStrategy
and the tablet load balancer consider only normal token owners to
ensure they ignore zero-token nodes. Some unit tests would start
failing after this change because they do not ensure that all
nodes are normal token owners. This patch prevents it.

Judging by the logic in the test cases in
`network_topology_strategy_test`, `point++` was probably intended
anyway.
2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak
366605224c token_metadata: rename get_all_endpoints and get_all_ips
In one of the following patches, we introduce support for zero-token
nodes. A zero-token node that has successfully joined the cluster is
in the normal state but is not a normal token owner. Hence, the names
of `get_all_endpoints` and `get_all_ips` become misleading. They
should specify that the functions return only IDs/IPs of token owners.
2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak
293a66fe41 network_topology_strategy: reallocate_tablets: remove unused dc_rack_nodes 2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak
4ff08decb8 virtual_tables: cluster_status_table: execute: set dc regardless of the token ownership
If a node is in `locator::topology`, then it has a location.
We remove the token ownership condition to make the table more
descriptive.
2024-08-29 10:37:06 +02:00
Kefu Chai
ecfe0aace6 perf: perf_mutation_readers: break memtable class down
before this change, memtable serves as the fixture for 6 test cases,
actually these 6 test cases can be categorized into a matrix of 3 x 2:
{ single_row, multi_row, large_partition } x { single_partition, multi_paritition }.

in this change, we break memtable into 3 different fixtures, to reflect
this fact. more readable this way. and a benefit is that each test does
not have to pay for the overhead of setup it does not use at all.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20177
2024-08-29 08:54:17 +03:00
Botond Dénes
e538e3593c Merge 'build: add --no-use-cmake option to configure.py' from Kefu Chai
as part of the efforts to address scylladb/scylladb#2717, we are
switching over to the CMake-based building system, and fade out the
mechinary to create the rules manually in `configure.py`.

in this change, we add `--no-use-cmake` to `configure.py`, it serves
two purposes:

* prepare for the change which enables cmake by default, by then,
  we would set the default value of `use_cmake` to True, and allow
  user to keep using the existing mechinary in the transition period
  using `--no-use-cmake`.
* allows the CI to tell if a tree is able to build with CMake.
  the command line option of `--use-cmake` is also used by the CI
  workflows, and is passed to `configure.py` if `BUILD_WITH_CMAKE`
  jenkins pipeline parameter is set. but not all branches with
  `--use-cmake` are ready to build with CMake -- only the latest
  master HEAD is ready. so the CI needs to check the capability of
  building with CMake by looking at the output of `configure.py --help`,
  to see if it includes `--no-use-cmake`.
  after this change lands. we will remove the `BUILD_WITH_CMAKE`
  parameter, and use cmake as long as `configure.py` supports
  `--no-use-cmake` option.

the existing mechinary will stay with us for a short transition
period so that developers can take time to get used to the
usage of the naming of targets and the new  directory arrangement.

as a side effect, #20079 will be fixed after switching to CMake.

---

this is a cmake-related change, hence no need to backport.

Closes scylladb/scylladb#20261

* github.com:scylladb/scylladb:
  build: add --no-use-cmake option to configure.py
  build: let configure.py fail if unknown option is passed to it
2024-08-29 08:51:41 +03:00
Kefu Chai
a182bfd96a tools/read_mutation: reuse parse_table_directory_name()
less repeatings this way.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20315
2024-08-29 08:49:20 +03:00
Nadav Har'El
6391550bbc test/alternator: add another check to test_stream_list_tables
The test test_streams.py::test_stream_list_tables reproduces a bug where
enabling streams added a spurious result to ListTables. A reviewer of
that patch asked to also add a check that name of the table itself
doesn't disappear from ListTables when a stream is enabled, so this is
what this patch adds.

This theoretical scenario (a table's name disappearing from ListTables)
never happened, so the new check doesn't reproduce any known bug, but
I guess it never hurts to make the test stronger for regression testing.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#19934
2024-08-29 08:45:22 +03:00
Nadav Har'El
61e5927e8e repair: fix build on older compilers
The code tries to build as "neighbors" an unordered_map from an iterator
of std::tuple, instead of the correct std::pair. Apparently, the tuples
are transparently converted to pairs on the newest compilers and the
whole works, but on slightly older compilers (like the one on Fedora 39)
Scylla no longer compiles - the compiler complains it can't convert a
tuple to a pair in this context.

So fix the code to use pairs, not tuples, and it fixes the build on
Fedora 39.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#20319
2024-08-28 19:56:03 +03:00
Laszlo Ersek
49bff3b1ab generic_server: make server::stop() idempotent
After server::shutdown(), make server::stop() more robust too, by allowing
callers (internal or external) to call it several times (not concurrently
though, just yet; see
<https://github.com/scylladb/scylladb/issues/20309>).

Suggested-by: Benny Halevy <bhalevy@scylladb.com>
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2024-08-28 15:54:31 +02:00
Kefu Chai
03ab80501f tools/scylla-nodetool: add restore integration
as we have an API for restore a keyspace / table, let's expose this feature
with nodetool. so we can exercise it without the help of scylla-manager
or 3rd-party tools with a user-friendly interface.

in this change:

* add a new subcommand named "restore" to nodetool
* add test to verify its interaction with the API server
* update the document accordingly.
* the bash completion script is updated accordingly.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-08-28 15:42:49 +03:00
Pavel Emelyanov
41b9eda398 test/object_store: Add simple restore test
The test shows how to restore previously backed up table:

- backup
- truncate to get rid of existing sstables
- start restore with the new API method
- wait for the task to finish

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-08-28 15:42:49 +03:00
Pavel Emelyanov
f5a22a94c6 test/object_store: Generalize prepare_snapshot_for_backup()
Give it snapshot-name argument. Next test will want custom snapshot
name.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-08-28 15:42:49 +03:00
Pavel Emelyanov
11a04bfb66 code: Introduce restore API method
The method starts a task that uses sstables_loader load-and-stream
functionality to bring new sstables into the cluster. The existing
load-and-stream picks up sstables from upload/ directory, the newly
introduced task collects them from S3 bucket and given prefix (that
correspond to the path where backup API method put them).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-08-28 15:42:49 +03:00
Sergey Zolotukhin
65f37f3ba6 Ignore seed name resolution errors on restart.
Gossiper seeds host name resolution failures are ignored during restart if
a node is already boostrapped (i.e. it has successfully joined the cluster).

Fixes scylladb/scylladb#14945
2024-08-28 14:01:04 +02:00
Patryk Jędrzejczak
08cb3a5e2c test: test_raft_recovery_basic: add raft=trace logs
It could help when we hit scylladb/scylladb#17918 again.

This PR only changes log levels in a test, no need to backport it.

Refs scylladb/scylladb#17918

Closes scylladb/scylladb#20318
2024-08-28 13:50:09 +02:00
Sergey Zolotukhin
fc5e683d02 Add a test for starting with a wrong seed.
The test checks a bootstrapped node start with a wrong host name in the
seeds config.

Test for scylladb/scylladb#14945
2024-08-28 11:34:37 +02:00
Laszlo Ersek
1138347e7e generic_server: coroutinize server::shutdown()
By turning server::shutdown() into a coroutine, we need not dynamically
allocate "nr_conn".

Verified as follows:

(1) In terminal #1:

    build/Dev/scylla --overprovisioned --developer-mode=yes \
        --memory=2G --smp=1 --default-log-level error \
        --logger-log-level cql_server=debug:cql_server_controller=debug

> INFO  [...] cql_server_controller - Starting listening for CQL clients
>                                     on 127.0.0.1:9042 (unencrypted,
>                                     non-shard-aware)
> INFO  [...] cql_server_controller - Starting listening for CQL clients
>                                     on 127.0.0.1:19042 (unencrypted,
>                                     shard-aware)

(2) In terminals #2 and #3:

    tools/cqlsh/bin/cqlsh.py

(3) Press ^C in terminal #1:

> DEBUG [...] cql_server - abort accept nr_total=2
> DEBUG [...] cql_server - abort accept 1 out of 2 done
> DEBUG [...] cql_server - abort accept 2 out of 2 done
> DEBUG [...] cql_server - shutdown connection nr_total=4
> DEBUG [...] cql_server - shutdown connection 1 out of 4 done
> DEBUG [...] cql_server - shutdown connection 2 out of 4 done
> DEBUG [...] cql_server - shutdown connection 3 out of 4 done
> DEBUG [...] cql_server - shutdown connection 4 out of 4 done
> INFO  [...] cql_server_controller - CQL server stopped

This patch is best viewed with "git show --word-diff=color".

Suggested-by: Benny Halevy <bhalevy@scylladb.com>
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2024-08-28 10:59:44 +02:00
Laszlo Ersek
2216275ebd generic_server: make server::shutdown() idempotent
Make server::shutdown() more robust by allowing callers (internal or
external) to call it several times (not concurrently though, just yet; see
<https://github.com/scylladb/scylladb/issues/20309>).

Suggested-by: Benny Halevy <bhalevy@scylladb.com>
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2024-08-28 10:59:44 +02:00
Laszlo Ersek
dbc0ca6354 test/generic_server: add test case
Check whether we can stop a generic server without first asking it to
listen.

The test fails currently; the failure mode is a hang, which triggers the 5
minute timeout set in the test:

> unknown location(0): fatal error: in "stop_without_listening":
> seastar::timed_out_error: timedout
> seastar/src/testing/seastar_test.cc(43): last checkpoint
> test/boost/generic_server_test.cc(34): Leaving test case
> "stop_without_listening"; testing time: 300097447us

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2024-08-28 10:59:44 +02:00
Laszlo Ersek
931f2f8d73 configure, cmake: sort the lists of boost unit tests
Both lists were obviously meant to be sorted originally, but by today
we've introduced many instances of disorder -- thus, inserting a new test
in the proper place leaves the developer scratching their head. Sort both
lists.

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2024-08-28 10:59:44 +02:00
Laszlo Ersek
5a04743663 generic_server: convert connection tracking to seastar::gate
If we call server::stop() right after "server" construction, it hangs:

With the server never listening (never accepting connections and never
serving connections), nothing ever calls server::maybe_stop().
Consequently,

    co_await _all_connections_stopped.get_future();

at the end of server::stop() deadlocks.

Such a server::stop() call does occur in controller::do_start_server()
[transport/controller.cc], when

- cserver->start() (sharded<cql_server>::start()) constructs a
  "server"-derived object,

- start_listening_on_tcp_sockets() throws an exception before reaching
  listen_on_all_shards() (for example because it fails to set up client
  encryption -- certificate file is inaccessible etc.),

- the "deferred_action"

      cserver->stop().get();

  is invoked during cleanup.

(The cserver->stop() call exposing the connection tracking problem dates
back to commit ae4d5a60ca ("transport::controller: Shut down distributed
object on startup exception", 2020-11-25), and it's been triggerable
through the above code path since commit 6b178f9a4a
("transport/controller: split configuring sockets into separate
functions", 2024-02-05).)

Tracking live connections and connection acceptances seems like a good fit
for "seastar::gate", so rewrite the tracking with that. "seastar::gate"
can be closed (and the returned future can be waited for) without anyone
ever having entered the gate.

NOTE: this change makes it quite clear that neither server::stop() nor
server::shutdown() must be called multiple times. The permitted sequences
are:

- server::shutdown() + server::stop()

- or just server::stop().

Fixes #10305

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2024-08-28 10:59:44 +02:00
Kefu Chai
6d8dca1e20 build: add --no-use-cmake option to configure.py
as part of the efforts to address scylladb/scylladb#2717, we are
switching over to the CMake-based building system, and fade out the
mechinary to create the rules manually in `configure.py`.

in this change, we add `--no-use-cmake` to `configure.py`, it serves
two purposes:

* prepare for the change which enables cmake by default, by then,
  we would set the default value of `use_cmake` to True, and allow
  user to keep using the existing mechinary in the transition period
  using `--no-use-cmake`.
* allows the CI to tell if a tree is able to build with CMake.
  the command line option of `--use-cmake` is also used by the CI
  workflows, and is passed to `configure.py` if `BUILD_WITH_CMAKE`
  jenkins pipeline parameter is set. but not all branches with
  `--use-cmake` are ready to build with CMake -- only the latest
  master HEAD is ready. so the CI needs to check the capability of
  building with CMake by looking at the output of `configure.py --help`,
  to see if it includes --no-use-cmake`.
  after this change lands. we will remove the `BUILD_WITH_CMAKE`
  parameter, and use cmake as long as `configure.py` supports
  `--no-use-cmake` option.

the existing mechinary will stay with us for a short transition
period so that developers can take time to get used to the
usage of the naming of targets and the new  directory arrangement.

as a side effect, #20079 will be fixed after switching to CMake.

Refs scylladb/scylladb#2717
Refs scylladb/scylladb#20079

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-08-28 11:37:56 +08:00
Kefu Chai
a2de14be7f build: let configure.py fail if unknown option is passed to it
this allows us to use `configure.py` to tell if a certain argument is supported
without parsing its output. in the next commit, we will add `--no-use-cmake` option,
which will be used to tell if the tree is ready for using CMake for its building
system.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-08-28 11:37:55 +08:00
Kefu Chai
e4b213f041 build: cmake: use the same options to configure seastar
in `configure.py`, a set of options are specified when configuring
seastar, but not all of them were ported to scylla's CMake building
system.

for instance, `configure.py` explicitly disables io_uring reactor
backend at build time, but the CMake-based system does not.

so, in this change, in order to preserve the existing behavior, let's
port the two previously missing option to CMake-based building system
as well.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20288
2024-08-28 06:15:59 +03:00
Avi Kivity
94d5507237 Merge 'select from mutation_fragments() + tablets: handle reads for non-owned partitions' from Botond Dénes
Attempting to read a partition via `SELECT * FROM MUTATION_FRAGMENTS()`, which the node doesn't own, from a table using tablets causes a crash.
This is because when using tablets, the replica side simply doesn't handle requests for un-owned tokens and this triggers a crash.
We should probably improve how this is handled (an exception is better than a crash), but this is outside the scope of this PR.
This PR fixes this and also adds a reproducer test.

Fixes: https://github.com/scylladb/scylladb/issues/18786

Fixes a regression introduced in 6.0, so needs backport to 6.0 and 6.1

Closes scylladb/scylladb#20109

* github.com:scylladb/scylladb:
  test/tablets: Test that reading tablets' mutations from MUTATION_FRAGMENTS works
  replica/mutation_dump: enfore pinning of effective replication map
  replica/mutation_dump: handle un-owned tokens (with tablets)
2024-08-27 20:46:10 +03:00
Avi Kivity
b13ab90448 Merge 'alternator/executor: Use native reversed format' from Łukasz Paszkowski
When executing reversed queries, a native revered format shall be used. Therefore, the table schema and the clustering key bounds are reversed before a partition slice and a read command are constructed.

It is, however, possible to run a reversed query passing a table schema but only when there are no restrictions on the clustering keys. In this particular situation, the query returns correct results. Since the current alternator tests in test.py do not imply any restrictions, this situation was not caught during development of https://github.com/scylladb/scylladb/pull/18864.

Hence, additional tests are provided that add clustering keys restrictions when executing reversed queries to capture such errors earlier than in dtests.

Additional manual tests were performed to test a mixed-node cluster (with alternator API enabled in Scylla on each node):

1. 2-node cluster with one node upgraded: reverse read queries performed on an old node
2. 2-node cluster with one node upgraded: reverse read queries performed on a new node
3. 2-node cluster with one node upgraded and all its sstable files deleted to trigger repair: reverse read queries performed on an old node
4. 2-node cluster with one node upgraded and all its sstable files deleted to trigger repair: reverse read queries performed on a new node

All reverse read queries above consists of:

- single-partition reverse reads with no clustering key restrictions, with single column restrictions and multi column restrictions both with and without paging turned on

The exact same tests were also performed on a fully upgraded cluster.

Fixes https://github.com/scylladb/scylladb/issues/20191

No backport is required as this is a complementary patch for the series https://github.com/scylladb/scylladb/pull/18864 that did not require backporting.

Closes scylladb/scylladb#20205

* github.com:scylladb/scylladb:
  test_query.py: Test reverse queries with clustering key bounds
  alternator::do_query Add additional trace log
  alternator::do_query: Use native reversed format
  alternator::do_query Rename schema with table_schema
2024-08-27 20:40:49 +03:00
Benny Halevy
18c45f7502 raft_rebuild: propagate source_dc force option to rebuild_option
Currently, the `force` property of the `source_dc` rebuild option
is lost and `raft_topology_cmd_handler` has no way to know
if it was given or not.

This in turn can cause rebuild to fail, even when `--force`
is set by the user, where it would succeed with gossip
topology changes, based on the source_dc --force semantics.

Fixes scylladb/scylladb#20242

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#20249
2024-08-27 17:05:48 +02:00
Kefu Chai
d27fdf9f57 Update seastar submodule
* seastar a7d81328...83e6cdfd (29):
  > fair_queue: Export the number of times class was activated
  > tests/unit: drop support of C++17
  > remove vestigial OSv support
  > cmake: undefine _FORTIFY_SOURCE on thread.cc
  > container_perf: a benchmark for container perf
  > io_sink: use chunked_fifo as _pending_io container
  > chunked_fifo: implement clear in terms of pop_n
  > chunked_fifo: pop_front_n
  > io_sink: use iteration instead of indexing
  > json2code_test: choose less popular port number
  > ioinfo: add '--max-reqsize' parameter
  > treewide: drop the support of fmtlib < 8.0.0
  > build: bump up the required fmtlib version to 8.1.1
  > conditional-variable: align when() and wait() behaviour in case of a predicate throwing an exception
  > stall-analyser: add output support for flamegraph
  > reactor: Add --io-completion-notify-ms option
  > io_queue: Stall detector
  > io_queue: Keep local variable with request execution delay
  > io_queue: Rename flow ratio timer to be more generic
  > reactor: Export _polls counter (internally)
  > dns: de-inline dns_resolver::impl methods
  > dns: enter seastar::net namespace
  > dnf: drop compatibility for c-ares <= 1.16
  > reactor: add missing includes of noncopyable_function.hh
  > reactor: Reset one-shot signal to DFL before handling
  > future: correctly document nested exception type emitted by finally()
  > modules: fix FATAL_ERROR on compiler check
  > seastar.cc: include fmt/ranges.h
  > pack io_request

Closes scylladb/scylladb#20300
2024-08-27 17:51:21 +03:00
Avi Kivity
2f4ef31254 Merge 'tools/testing: update dist-check to use rockylinux and adapt to cmake' from Kefu Chai
`dist-check` tests the generated rpm packages by installing them in a centos 7 container. but this script is terribly outdated

- centos 7 is deprecated. we should use a new distro's latest stable release.
- cqlsh was added to the family of rpms a while ago. we should test it as well.
- the directory hierarchy has been changed. we should read the artifacts from the new directories.
- cmake uses a different directory hierarchy. we should check the directory used by cmake as well.

to address these breaking changes, the scripts are updated accordingly.

---

this change gives an overhaul to a test, which is not used in production. so no need to backport.

Closes scylladb/scylladb#20267

* github.com:scylladb/scylladb:
  tools/testing: add cqlsh rpm
  tools/testing: adapt to cmake build directory
  tools/testing: test with rockylinux:9 not centos:7
  tools/testing: correct the paths to rpm packages and SCYLLA-*-FILE
  dist-check: add :z option when mapping volume
2024-08-27 16:16:34 +03:00
Pavel Emelyanov
1f3f0b1926 sstable_loader: Add sstables::storage_manager dependency
The storage_manager maintains set of clients to configured object
storage(s). The sstables loader  is going to spawn tasks that will talk
to to those storages, thus it needs the storage manager to get the
clients clients from.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-08-27 16:15:41 +03:00
Pavel Emelyanov
06c3c53deb sstable_loader: Maintain task manager module
This service is going to start tasks managed by task manager. For that,
it should have its module set up and registered.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-08-27 16:15:41 +03:00
Pavel Emelyanov
9cf95e8a07 sstable_loader: Out-line constructor
It will grow and become more complicated. Better to have it outside the
header.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-08-27 16:15:41 +03:00
Pavel Emelyanov
6a006d2255 distributed_loader: Split get_sstables_from_upload_dir()
Next patches will need this method to initialize sstable_directory
differently and then do its regular processing. For that, split the
method into two, next patch will re-use the common part it needs.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-08-27 16:15:41 +03:00
Pavel Emelyanov
630ab1dbea sstables/storage: Compose uploaded sstable path simpler
Current S3 storage driver keeps sstables in bucket in a form of

  /bucket/generation/component-name

To get sstables that are backed up on S3 this format doesn't apply,
because components are uploaded with their names unmodified. This patch
makes S3 storage driver account for that and not re-format component
paths for upload sstable state.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-08-27 16:15:41 +03:00
Pavel Emelyanov
2eda917375 sstable_directory: Prepare FS lister to scan files on S3
When component lister is created it checks the target storage options
for what kind of lister to create. For local options it creates FS
lister that collects sstables from their component files. For S3
options, it relies on sstables registry.

When collecting sstables from backup, it's not possible to use registry,
because those entries are not there. Instead, lister should pick up
individual components as it they were on local FS. This patch prepares
the lister for that -- in case S3 options are provided and the sstables'
state is "upload", don't try to read those from registry, but
instantiate the FS lister that will later use s3::bucket_lister.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-08-27 16:15:41 +03:00
Pavel Emelyanov
60d43911a9 sstable_directory: Parse sstable component without full path
When sstable directory collects a entry from storage, it tries to parse
its full path with the help of sstables::parse_path(). There are two
overloads of that function -- one with ks:cf arguments and one without.
The latter tries to "guess" keyspace and table names from the directory
name.

However, ks and table names are already known by the directory, it
doesn't even use the returned ks and cf values, so this parsing is
excessive. Also, future patches will put here backup paths, that might
not match the ks_name/table_name-table_uuid/ pattern that the parser
expects.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-08-27 16:15:41 +03:00
Pavel Emelyanov
86bc5b11fe s3-client: Add support for lister::filter
Directory lister comes with a filter function that tells lister which
entries to skip by its .get() method. For uniformity, add the same to
S3 bucket_lister.

After this change the lister reports shorter name in the returned
directory entry (with the prefix cut), so also need to tune up the unit
test respectively.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-08-27 16:15:40 +03:00
Pavel Emelyanov
113d2449f8 utils: Introduce abstract (directory) lister
This patch hides directory_lister and bucket_lister behind a common
facade. The intention is to provide a uniform API for sstable_directory
that it could use to list sstables' components wherever they are.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-08-27 16:15:40 +03:00
Piotr Dulikowski
da5f4faac1 Merge 'mv: reject user requests by coordinator when a replica is overloaded by MVs' from Wojciech Mitros
Currently, when a view update backlog of one replica is full, the write is still sent by the coordinator to all replicas. Because of the backlog, the write fails on the replica, causing inconsistency that needs to be fixed by repair. To avoid these inconsistencies, this patch adds a check on the coordinator for overloaded replicas. As a result, a write may be rejected before being sent to any replicas and later retried by the user, when the replica is no longer overloaded.

This patch does not remove the replica write failures, because we still may reach a full backlog when more view updates are generated after the coordinator check is performed and before the write reaches the replica.

Fixes scylladb/scylladb#17426

Closes scylladb/scylladb#18334

* github.com:scylladb/scylladb:
  mv: test the view update behavior
  mv: add test for admission control
  storage_proxy: return overloaded_exception instead of throwing
  mv: reject user requests by coordinator when a replica is overloaded by MVs
2024-08-27 12:50:34 +02:00
Aleksandra Martyniuk
f38bb6483a test: add test to ensure repair won't fail with uninitialized bm 2024-08-27 11:37:50 +02:00
Aleksandra Martyniuk
d8e4393418 repair: throw if batchlog manager isn't initialized
repair_service::repair_flush_hints_batchlog_handler may access batchlog
manager while it is uninitialized.

Batchlog manager cannot be initialized before repair as we have the
dependencies chain:
repair_service -> storage_service::join_cluster -> batchlog_manager.

Throw if batchlog manager isn't initialized. That won't cause repair
to fail.
2024-08-27 11:22:28 +02:00
Botond Dénes
5c0f6d4613 Merge 'Make Summary support histogram with infinite bucket vlaues' from Amnon Heiman
This series fixes an issue where histogram Summaries return an infinite value.

It updated the quantile calculation logic to address cases where values fall into the infinite bucket of a histogram.
Now, instead of returning infinite (max int), the calculation will return the last bucket limit, ensuring finite outputs in all cases.

The series adds a test for summaries with a specific test case for this scenario.

Fixes #20255
Need backport to 6.0, 6.1 and 2023.1 and above

Closes scylladb/scylladb#20257

* github.com:scylladb/scylladb:
  test/estimated_histogram_test Add summary tests
  utils/histogram.hh: Make summary support inifinite bucket.
2024-08-27 10:33:54 +03:00
Kefu Chai
ae7ce38721 build: print out the default value of options
instead of using the default `argparse.HelpFormatter`, let's
use `ArgumentDefaultsHelpFormatter`, so that the default values
of options are displayed in the help messages.

this should help developer understand the behavior of the script
better.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20262
2024-08-27 10:04:31 +03:00
Kefu Chai
e2747e4bb5 build: cmake: add dist-check target
to achieve feature parity with our existing building system, we
need to implement a new build target "dist-check" in the CMake-based
building system.

in this change, "dist-check" is added to CMake-based building system.
unlike the rules generated by `configure.py`, the `dist-check` target
in CMake depends on the dist-*-rpm targets. the goal is to enable user
to test `dist-check` without explicitly building the artifacts being
tested.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20266
2024-08-27 10:03:41 +03:00
Kefu Chai
ea612e7065 docs: install poetry>=1.8.0
in 57def6f1, we specified "package-mode" for poetry, but this option
was introduced in poetry 1.8.0, as the "non-package" mode support.
see https://github.com/python-poetry/poetry/releases/tag/1.8.0

this change practically bumps up the minimum required poetry version
to 1.8.0, we did update `pyproject.tombl` to reflect this change.
but wefailed to update the `Makefile`.

in this change, we update `Makefile` to ensure that user which happens
have an older version of poetry can install the version which supports
this version when running `make setupenv`.

Refs scylladb/scylladb#20284
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20286
2024-08-27 09:20:09 +03:00
Yaniv Michael Kaul
022eb25d98 tools/toolchain/README.md: fix wording
Forgot to add that 'reg' tool is also needed.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Closes scylladb/scylladb#20287
2024-08-27 09:18:23 +03:00
Kefu Chai
5cffb23aa3 scylla-gdb.py: use chunked_fifo to represent _sink._pending_io
we switched from `circular_buffer` to `chunked_fifo` to present
`io_sink::_pending_io` in the latest seastar now. to be prepared for
this change, let's

* add `chunked_fifo` class in `scylla-gdb.py`.
* use `circular_buffer` as a fallback of `chunked_fifo`. instead of
  doing this the other way around, we try to send the message that
  the latest seastar uses `chunked_fifo`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20280
2024-08-27 08:44:56 +03:00
Andrei Chekun
fd51332978 test.py: Add parameter to control the pool size from the command line
Add parameter --cluster-pool-size that can control pool size for all
PythonTestSuite tests. By default, the pool size set to 10 for most of
the suites, but this is too much for laptops. So this parameter can be
used to lower the pool size and not to freeze the system. Additionally,
the environment variable CLUSTER_POOL_SIZE was added for a convenient way
to limit pool size in the system without the need to provide each time an
additional parameter.

Related: https://github.com/scylladb/scylladb/pull/20276

Closes scylladb/scylladb#20289
2024-08-26 19:55:41 +03:00
Avi Kivity
0acfa4a00d Merge 'abstract_replication_strategy: make get_ranges async' from Benny Halevy
To prevent stalls due to large number of tokens.
For example, large cluster with say 70 nodes can have
more than 16K tokens.

Fixes #19757

Closes scylladb/scylladb#19758

* github.com:scylladb/scylladb:
  abstract_replication_strategy: make get_ranges async
  database: get_keyspace_local_ranges: get vnode_effective_replication_map_ptr param
  compaction: task_manager_module: open code maybe_get_keyspace_local_ranges
  alternator: ttl: token_ranges_owned_by_this_shard: let caller make the ranges_holder
  alternator: ttl: can pass const gms::gossiper& to ranges_holder
  alternator: ttl: ranges_holder_primary: unconstify _token_ranges member
  alternator: ttl: refactor token_ranges_owned_by_this_shard
2024-08-26 16:56:18 +03:00
Botond Dénes
6d633e89ef Merge 'update CODEOWNERS' from Piotr Smaron
Removed people that no longer contribute to the scylladb.git and added/substituted reviewers responsible for maintaining the frontend components.

No need to backport, this is just an information for the github tool.

Closes scylladb/scylladb#20136

* github.com:scylladb/scylladb:
  codeowners: add appropriate reviewers to the cluster components
  codeowners: add appropriate reviewers to the frontend components
  codeowners: fix codeowner names
  codeowners: remove non contributors
2024-08-26 16:44:39 +03:00
Botond Dénes
4505b14fd6 Merge 'table_helper: complete coroutinization' from Avi Kivity
table_helper has some quite awkward code, improve it a little.

Code cleanup, so no reason to backport.

Closes scylladb/scylladb#20194

* github.com:scylladb/scylladb:
  table_helper: insert(): improve indentation
  table_helper: coroutinize insert()
  table_helper: coroutinize cache_table_info()
  table_helper: extract try_prepare()
2024-08-26 13:43:17 +03:00
Botond Dénes
b2c07c9b6f Merge 'compaction: change compaction stop reason ' from Aleksandra Martyniuk
Currently "table removal" is logged as a reason of compaction stop for table drop,
tablet cleanup and tablet split. Modify log to reflect the reason.

Closes scylladb/scylladb#20042

* github.com:scylladb/scylladb:
  test: add test to check compaction stop log
  compaction: fix compaction group stop reason
2024-08-26 13:40:07 +03:00
Kefu Chai
4d516a8363 tools/testing: add cqlsh rpm
we need to test the installation of cqlsh rpm. also, we should use the
correct paths of the generated rpm packages.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-08-26 11:33:57 +08:00
Kefu Chai
baee15390e tools/testing: adapt to cmake build directory
cmake uses a different arrangement, so let's check for the existence
of the build directory and fallback to cmake's build directory.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-08-26 11:33:57 +08:00
Kefu Chai
b802c000e1 tools/testing: test with rockylinux:9 not centos:7
the centos image repos on docker has been deprecated, and the
repo for centos7 has been removed from the main CentOS servers.

so we are either not able to install packages from its default repo,
without using the vault mirror, or no longer to pull its image
from dockerhub.

so, in this change

* we switch over to rockylinux:9, which is the latest stable release
  of rockylinux, and rockylinux is a popular clone of RHEL, so it
  matches our expectation of a typical use case of scylla.
* use dnf to manage the packages. as dnf is the standard way to manage
  rpm packages in modern RPM-based distributions.
* do not install deltarpm.
  delta rpms are was not supported since RHEL8, and the `deltarpm` package
  is not longer available ever since. see
  https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/8/html-single/considerations_in_adopting_rhel_8/index#ref_the-deltarpm-functionality-is-no-longer-supported_notable-changes-to-the-yum-stack
  as a sequence, this package does not exist in Rockylinux-9.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-08-26 11:33:53 +08:00
Kefu Chai
00dad27f67 tools/testing: correct the paths to rpm packages and SCYLLA-*-FILE
when building with the rules generated from `configure.py`,
these files are located under tools' own build directory.
so correct them.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-08-26 11:19:24 +08:00
Kefu Chai
86ef63df92 dist-check: add :z option when mapping volume
if SELinux is enabled on the host, we'd have following failure when
running `dist-check.sh`:
```
+ podman run -i --rm -v /home/kefu/dev/scylladb:/home/kefu/dev/scylladb docker.io/centos:7 /bin/bash -c 'cd /home/kefu/dev/scylladb && /home/kefu/dev/scylladb/tools/testing/dist-check/docker.io/centos-7.sh --mode debug'
/bin/bash: line 0: cd: /home/kefu/dev/scylladb: Permission denied
```

to address the permission issue, we need to instruct podman to
relabel the shared volume, so that the container can access
the shared volume.

see also https://docs.podman.io/en/stable/markdown/podman-pod-create.1.html#volume-v-source-volume-host-dir-container-dir-options

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-08-26 11:15:40 +08:00
Kefu Chai
8ef26a9c8c build: cmake: add "test" target
before this change, none of the target generated by CMake-based
building system runs `test.py`. but `build.ninja` generated directly
by `configure.py` provides a target named `test`, which runs the
`test.py` with the options passed to `configure.py`.

to be more compatible with the rules generated by `configure.py`,
in this change

* do not include "CTest" module, as we are not using CTest for
  driving tests. we use the homebrew `test.py` for this purpose.
  more importantly, the target named "test" is provided by "CTest".
  so in order to add our own "test" target, we cannot use "CTest"
  module.
* add a target named "test" to run "test.py".
* add two CMake options so we can customize the behavior of "test.py",
  this is to be compatible with the existing behavior of `configure.py`.

Refs scylladb/scylladb#2717

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20263
2024-08-25 21:45:13 +03:00
Avi Kivity
72a85e3812 Merge 'Integrated backup' from Pavel Emelyanov
This adds minimal implementation of the start-backup API call.

The method starts a task that uploads all files from the given keyspace's snapshot to the requested endpoint/bucket. Arguments are:
- endpoint -- the ID in object_store.yaml config file
- bucket -- the target bucket to put objects into
- keyspace -- the keyspace to work on
- snapshot -- the method assumes that the snapshot had been already taken and only copies sstables from it

The task runs in the background, its task_id is returned from the method once it's spawned and it should be used via /task_manager API to track the task execution and completion (hint: it's good to have non-zero TTL value to make sure fast backups don't finish before the caller manages to call wait_task API).

Sstables components are scanned for all tables in the keyspace and are uploaded into the /bucket/${cf_name}/${snapshot_name}/ path.

refs: #18391

Closes scylladb/scylladb#19890

* github.com:scylladb/scylladb:
  tools/scylla-nodetool: add backup integration
  docs: Document the new backup method
  test/object_store: Test that backup task is abortable
  test/object_store: Add simple backup test
  test/object_store: Move format_tuples()
  test/pylib: Add more methods to rest client
  backup-task: Make it abortable (almost)
  code: Introduce backup API method
  database: Export parse_table_directory_name() helper
  database: Introduce format_table_directory_name() helper
  snapshot-ctl: Add config to snapshot_ctl
  snapshot-ctl: Add sstables::storage_manager dependency
  snapshot-ctl: Maintain task manager module
  snapshot-ctl: Add "snapshots" logger
  snapshot-ctl: Outline stop() method and constructor
  snapshot-ctl: Inline run_snapshot_list<>
  test/cql_test_env: Export task manager from cql test env
  task_manager: Print task ttl on start (for debugging)
  docs: Update object_storage.md with AWS_ environment
  docs: Restructure object_storage.md
2024-08-25 20:19:10 +03:00
Kefu Chai
f8931a4578 build: cmake: add "dist" target
since the rules generated by `configure.py` has this target, we need
to have an equivalent target as well in CMake-based buidling system.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20265
2024-08-25 20:18:12 +03:00
Andrei Chekun
f54b7f5427 test.py: Increase pool size
Increase pool size changes were recently reverted because of the flakiness for the test_gossip_boot test. Test started
to fail on adding the node to the cluster without any issues in the Scylla log file. In test logs it looked like the
installation process for the new node just hanged. After investigating the problem, I've found out that the issue is that
test.py was draining the io_executor pool for cleaning the directory during install that was set to eight workers. So
to fix the issue, io_executor pool should be increased to more or less the same ratio as it was: doubled cluster pool size.

Closes scylladb/scylladb#20276
2024-08-25 19:59:18 +03:00
Kefu Chai
a0688b29ea replication_strategy: add fmt::formatter<replication_strategy_type>
so that we can use {fmt} with it without the help of fmt::streamed.
also since we have a proper formatter for replication_strategy_type,
let's implement
`formatter<vnode_effective_replication_map::factory_key>`
as well.

since there are no callers of these two operator<<, let's drop
them in this change.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20248
2024-08-25 19:34:52 +03:00
Kefu Chai
c88b63ce13 github: use clang-20 in clang-nightly workflow
since clang 19 has been branched. let's track the development brach,
which is clang 20.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20279
2024-08-25 19:31:43 +03:00
Benny Halevy
686a8f2939 abstract_replication_strategy: make get_ranges async
To prevent stalls due to large number of tokens.
For example, large cluster with say 70 nodes can have
more than 16K tokens.

Fixes #19757

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-08-25 10:57:34 +03:00
Benny Halevy
2bbbe2a8bc database: get_keyspace_local_ranges: get vnode_effective_replication_map_ptr param
Prepare for making the function async.
Then, it will need to hold on to the erm while getting
the token_ranges asynchronously.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-08-25 10:55:33 +03:00
Benny Halevy
ea5a0cca10 compaction: task_manager_module: open code maybe_get_keyspace_local_ranges
It is used only here and can be simplified by
checking if the keyspace replication strategy
is per table by the caller.

Prepare for making get_keyspace_local_ranges async.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-08-25 10:25:32 +03:00
Benny Halevy
824bdf99d2 alternator: ttl: token_ranges_owned_by_this_shard: let caller make the ranges_holder
Add static `make` methods to ranges_holder_{primary,secondary}
and use them to make the ranges objects and pass them
to `token_ranges_owned_by_this_shard`, rather than letting
token_ranges_owned_by_this_shard invoke the right constructor
of the ranges_holder class.

Prepare for making `make` async.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-08-25 10:25:32 +03:00
Benny Halevy
b2abbae24b alternator: ttl: can pass const gms::gossiper& to ranges_holder
There's no need to pass a mutable reference to
the gossiper.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-08-25 10:25:32 +03:00
Benny Halevy
333c0d7c88 alternator: ttl: ranges_holder_primary: unconstify _token_ranges member
To allow the class to be nothrow_move_constructable.
Prepare for returning it as a future value.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-08-25 10:25:32 +03:00
Benny Halevy
d385219a12 alternator: ttl: refactor token_ranges_owned_by_this_shard
Rather than holding a variant member (and defining
both ranges_holder_{primary,secondary} in both
specilizations of the class, just make the internal
ranges_holder class first-class citizens
and parameterize the `token_ranges_owned_by_this_shard`
template by this class type.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-08-25 10:25:32 +03:00
Avi Kivity
c4dd21de38 repair: row_level: coroutinize repair_reader::close() 2024-08-24 00:36:48 +03:00
Avi Kivity
b1dd470533 repair: row_level: coroutinize repair_reader::end_of_stream() 2024-08-24 00:35:59 +03:00
Avi Kivity
7ce76fd0ea repair: row_level: coroutinize sink_source_for_repair::close()
The repeat() loop translates to almost nothing.
2024-08-24 00:30:02 +03:00
Avi Kivity
168a018e45 repair: row_level: coroutinize sink_source_for_repair::get_sink_source() 2024-08-24 00:19:12 +03:00
Avi Kivity
6b370d8154 table_helper: insert(): improve indentation
Restore after coroutinization.
2024-08-24 00:08:05 +03:00
Avi Kivity
ecd7702007 table_helper: coroutinize insert()
Improves readability. The do_with() ensures it's at least as performant
(though it's not in any fast path).
2024-08-24 00:08:05 +03:00
Avi Kivity
980ec2f925 table_helper: coroutinize cache_table_info()
After we extracted try_prepare(), this is fairly simple, and improves
readability.
2024-08-24 00:08:05 +03:00
Avi Kivity
4e44a15d4d table_helper: extract try_prepare()
table_helper::cache_table_info() is fairly convoluted. It cannot be
easily coroutinized since it invokes asynchronous functions in a
catch block, which isn't supported in coroutines. To start to break it
down, extract a block try_prepare() from code that is called twice. It's
both a simplification and a first step towards coroutinization.

The new try_prepare() can return three values: `true` if it succeeded,
`false` if it failed and there's the possibility of attempting a fallback,
and an exception on error.
2024-08-24 00:08:05 +03:00
Lakshmi Narayanan Sreethar
4823a1e203 test/pylib: fix keyspace_compaction method
The `keyspace_compaction` method incorrectly appends the column family
parameter to the URL using a regular string, `"?cf={table}"`, instead of
an f-string, `f"?cf={table}"`. As a result, the column family name is
sent as `{table}` to the server, causing the compaction request to fail.
Fix this issue by passing the parameter to the POST request using a
dictionary instead of appending it to the URL.

Fixes #20264

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes scylladb/scylladb#20243
2024-08-23 15:20:10 +03:00
Kefu Chai
4a405b0af9 perf/perf_sstable: enumerate sstables when loading them
before this change, we use the default options when creating `test_env`,
and the default options enable `use_uuid`. but the modes of
`perf-sstables` involving reads assumes that the identifiers are
deterministic. so that the previously written sstables using the "write"
mode can be read with the modes like "index_read", which just uses
`test_env::make_sstable()` in `load_sstables()`, and under the hood,
`test_env::make_sstable()` uses `test_env::new_generation()` for
retrieving the next identifier of sstable. when using integer-base
identifier, this works. as the sstable identifiers are generated
from a monotonically increasing integer sequence, where the identifiers
are deterministic. but this does not apply anymore when the UUID-based
identifiers are used, as the identifiers are generated with a
pseudorandom generator of UUID v1.

in this change, to avoid relying on the determinism of the integer-based
sstable identifier generation, we enumerate sstables by listing the
given directory, and parse the path for their identifier.

after this change, we are able to support the UUID-based sstable
identifier.

another option is disable the UUID-based sstable identifier when
loading sstables. the upside is that this approach is minimal and
straightforward. but the downside is that it encodes the assumption
in the algorithm implicitly, and could be confusing -- we create
a new generation for loading an existing sstable with this generation.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20183
2024-08-23 10:39:24 +03:00
Pavel Emelyanov
d1ac58f088 api: Get compaction througput via compaction manager
Now the endpoint hanler gets the value from db::config which is not nice
from several perspectives. First, it gets config (ab)using database.
Second, it's compaction manager that "knows" its throughput, global
config is the initial source of that information.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#20173
2024-08-23 10:33:03 +03:00
Pavel Emelyanov
38edbebb10 compaction_manager: Keep flush-all-before-major option on own config
Currently the major compaction task impl grabs this (non-updateable)
value from db::config. That's not good, all services including
compaction manager have their own configs from which they take options.
Said that, this patch puts the said option onto
compaction_manager::config, makes use of it and configures one from
db::config on start (and tests).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#20174
2024-08-23 10:31:55 +03:00
Botond Dénes
15fdc3f6cc Merge 'Add ability to list S3 bucket contents' from Pavel Emelyanov
This is prerequisite for "restore from object storage" feature. In order to collect the sstables in bucket one would need to list the bucket contents with the given prefix. The ListObjectsV2 provides a way for it and here's the respective s3::client extension.

Closes scylladb/scylladb#20120

* github.com:scylladb/scylladb:
  test: Add test for s3::client::bucket_lister
  s3_client: Add bucket lister
  s3_client: Encode query parameter value for query-string
2024-08-23 10:16:07 +03:00
Kefu Chai
7f65ee3270 dbuild: pass --tty only if --interactive
in 947e2814, we pass `--tty` as long as we are using podman _or_
we are in interactive mode. but if we build the tree using podman
using jenkins, we are seeing that ninja is displaying the output
as if it's in an interactive mode. and the output includes ASCII
escape codes. this is distracting.

the reason is that we

* are using podman, and
* ninja tells if it should displaying with a "smart" terminal by
  checking istty() and the "TERM" environmental variable.

so, in this change, we add --tty only if

* we are in the interactive mode.
* or stdin is associated with a terminal. this is the use case
  where user uses dbuild to interactively build scylla

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20196
2024-08-23 09:30:20 +03:00
Kefu Chai
ee19bbed05 test: do not define boost_test_print_type() for types with operator<<
in 30e82a81, we add a contraint to the template parameter of
boost_test_print_type() to prevent it from being matched with
types which can be formatted with operator<<. but it failed to
work. we still have test failure reports like:

```
[Exception] - critical check ['s', 's', 't', '_', 'm', 'r', '.', 'i', 's', '_', 'e', 'n', 'd', '_', 'o', 'f', '_', 's', 't', 'r', 'e', 'a', 'm', '(', ')'] has failed
```

this is not what we expect. the reason is that we passed the template
parameters to the `has_left_shift` trait in the wrong order, see
https://live.boost.org/doc/libs/1_83_0/libs/type_traits/doc/html/boost_typetraits/reference/has_left_shift.html.
we should have passed the lhs of operator<< expression as first
parameter, and rhs the second.

so, in this change, we correct the type constraint by passing the
template parameter in the right order, now the error message looks
better, like:

```
test/boost/mutation_query_test.cc(110): error: in "test_partition_query_is_full": check !partition_slice_builder(*s) .with_range({}) .build() .is_full() has failed
```

it turns out boost::transformed_range<> is formattable with operator<<,
as it fulfills the constraints of `boost::has_left_shift<ostream, R>`,
but when printing it, the compiler fails when it tries to insert the
elements in the range to the output stream.

so, in order to workaround this issue, we add a specialization for
`boost::transformed_range<F, R`.

also, to improve the readability, we reimplement the `has_left_shift<>`
as a concept, so that it's obvious that we need to put both the output
stream as the first parameter.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20233
2024-08-23 09:26:22 +03:00
Amnon Heiman
644e6f0121 test/estimated_histogram_test Add summary tests
This patch adds tests for summary calculation. It adds two tests, the
first is a basic calculation for P50, P95, P99 by adding 100 elements
into 20 buckets.

The second test look that if elements are found in the infinite bucket,
the result would be the lower limit (33s) and not infinite.

Relates to #20255

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2024-08-22 23:34:24 +03:00
Amnon Heiman
011aa91a8c utils/histogram.hh: Make summary support inifinite bucket.
This patch handles an edge cases related to The infinite bucket  
limit.

Summaries are the P50, P95, and P99 quantiles.

The quantiles are calculated from a histogram; we find the bucket and
return its upper limit.

In classic histograms, there is a notion of the infinite bucket;
anything that does not fall into the last bucket is considered to be
infinite;

with quantile, it does not make sense. So instead of reporting infinite
we'll report the bucket lower limit.

Fixes #20255

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2024-08-22 23:34:24 +03:00
Kefu Chai
39dd088374 test: include used headers
before this change, clang 20 fails to build the tree, like:

```
/home/kefu/.local/bin/clang++ -DBOOST_ALL_DYN_LINK -DDEBUG -DDEBUG_LSA_SANITIZER -DFMT_SHARED -DSANITIZE -DSCYLLA_BUILD_MODE=debug -DSCYLLA_ENABLE_ERROR_INJECTION -DSEASTAR_API_LEVEL=7 -DSEASTAR_DEBUG -DSEASTAR_DEBUG_PROMISE -DSEASTAR_DEBUG_SHARED_PTR -DSEASTAR_DEFAULT_ALLOCATOR -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SHUFFLE_TASK_QUEUE -DSEASTAR_SSTRING -DSEASTAR_TESTING_MAIN -DSEASTAR_TYPE_ERASE_MORE -DXXH_PRIVATE_API -DCMAKE_INTDIR=\"Debug\" -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/build/gen -I/home/kefu/dev/scylladb/seastar/include -I/home/kefu/dev/scylladb/build/seastar/gen/include -I/home/kefu/dev/scylladb/build/seastar/gen/src -isystem /home/kefu/dev/scylladb/abseil -isystem /home/kefu/dev/scylladb/build/rust -g -Og -g -gz -std=gnu++23 -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-unused-parameter -ffile-prefix-map=/home/kefu/dev/scylladb=. -march=westmere -Xclang -fexperimental-assignment-tracking=disabled -U_FORTIFY_SOURCE -Werror=unused-result -fstack-clash-protection -fsanitize=address -fsanitize=undefined -fno-sanitize=vptr -MD -MT test/boost/CMakeFiles/database_test.dir/Debug/database_test.cc.o -MF test/boost/CMakeFiles/database_test.dir/Debug/database_test.cc.o.d -o test/boost/CMakeFiles/database_test.dir/Debug/database_test.cc.o -c /home/kefu/dev/scylladb/test/boost/database_test.cc
/home/kefu/dev/scylladb/test/boost/database_test.cc:539:29: error: invalid use of incomplete type 'schema_builder'
  539 |                     return *schema_builder(ks_name, cf_name)
      |                             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/kefu/dev/scylladb/schema/schema.hh:115:7: note: forward declaration of 'schema_builder'
  115 | class schema_builder;
      |       ^
```
and
```
/home/kefu/.local/bin/clang++ -DBOOST_ALL_DYN_LINK -DDEBUG -DDEBUG_LSA_SANITIZER -DFMT_SHARED -DSANITIZE -DSCYLLA_BUILD_MODE=debug -DSCYLLA_ENABLE_ERROR_INJECTION -DSEASTAR_API_LEVEL=7 -DSEASTAR_DEBUG -DSEASTAR_DEBUG_PROMISE -DSEASTAR_DEBUG_SHARED_PTR -DSEASTAR_DEFAULT_ALLOCATOR -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SHUFFLE_TASK_QUEUE -DSEASTAR_SSTRING -DSEASTAR_TESTING_MAIN -DSEASTAR_TYPE_ERASE_MORE -DXXH_PRIVATE_API -DCMAKE_INTDIR=\"Debug\" -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/build/gen -I/home/kefu/dev/scylladb/seastar/include -I/home/kefu/dev/scylladb/build/seastar/gen/include -I/home/kefu/dev/scylladb/build/seastar/gen/src -isystem /home/kefu/dev/scylladb/abseil -isystem /home/kefu/dev/scylladb/build/rust -g -Og -g -gz -std=gnu++23 -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-unused-parameter -ffile-prefix-map=/home/kefu/dev/scylladb=. -march=westmere -Xclang -fexperimental-assignment-tracking=disabled -U_FORTIFY_SOURCE -Werror=unused-result -fstack-clash-protection -fsanitize=address -fsanitize=undefined -fno-sanitize=vptr -MD -MT test/boost/CMakeFiles/group0_cmd_merge_test.dir/Debug/group0_cmd_merge_test.cc.o -MF test/boost/CMakeFiles/group0_cmd_merge_test.dir/Debug/group0_cmd_merge_test.cc.o.d -o test/boost/CMakeFiles/group0_cmd_merge_test.dir/Debug/group0_cmd_merge_test.cc.o -c /home/kefu/dev/scylladb/test/boost/group0_cmd_merge_test.cc
/home/kefu/dev/scylladb/test/boost/group0_cmd_merge_test.cc:78:18: error: member access into incomplete type 'db::config'
   78 |     cfg.db_config->commitlog_segment_size_in_mb(1);
      |                  ^
/home/kefu/dev/scylladb/data_dictionary/data_dictionary.hh:28:7: note: forward declaration of 'db::config'
   28 | class config;
      |       ^
1 error generated.
```
and
```
`FAILED: test/boost/CMakeFiles/repair_test.dir/Debug/repair_test.cc.o
/home/kefu/.local/bin/clang++ -DBOOST_ALL_DYN_LINK -DDEBUG -DDEBUG_LSA_SANITIZER -DFMT_SHARED -DSANITIZE -DSCYLLA_BUILD_MODE=debug -DSCYLLA_ENABLE_ERROR_INJECTION -DSEASTAR_API_LEVEL=7 -DSEASTAR_DEBUG -DSEASTAR_DEBUG_PROMISE -DSEASTAR_DEBUG_SHARED_PTR -DSEASTAR_DEFAULT_ALLOCATOR -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SHUFFLE_TASK_QUEUE -DSEASTAR_SSTRING -DSEASTAR_TESTING_MAIN -DSEASTAR_TYPE_ERASE_MORE -DXXH_PRIVATE_API -DCMAKE_INTDIR=\"Debug\" -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/build/gen -I/home/kefu/dev/scylladb/seastar/include -I/home/kefu/dev/scylladb/build/seastar/gen/include -I/home/kefu/dev/scylladb/build/seastar/gen/src -isystem /home/kefu/dev/scylladb/abseil -isystem /home/kefu/dev/scylladb/build/rust -g -Og -g -gz -std=gnu++23 -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-unused-parameter -ffile-prefix-map=/home/kefu/dev/scylladb=. -march=westmere -Xclang -fexperimental-assignment-tracking=disabled -U_FORTIFY_SOURCE -Werror=unused-result -fstack-clash-protection -fsanitize=address -fsanitize=undefined -fno-sanitize=vptr -MD -MT test/boost/CMakeFiles/repair_test.dir/Debug/repair_test.cc.o -MF test/boost/CMakeFiles/repair_test.dir/Debug/repair_test.cc.o.d -o test/boost/CMakeFiles/repair_test.dir/Debug/repair_test.cc.o -c /home/kefu/dev/scylladb/test/boost/repair_test.cc
/home/kefu/dev/scylladb/test/boost/repair_test.cc:149:45: error: use of undeclared identifier 'global_schema_ptr'
  149 |         co_await e.db().invoke_on_all([gs = global_schema_ptr(gen.schema())](replica::database& db) -> future<> {
      |                                             ^
/home/kefu/dev/scylladb/test/boost/repair_test.cc:150:62: error: use of undeclared identifier 'gs'
  150 |             co_await db.add_column_family_and_make_directory(gs.get(), replica::database::is_new_cf::yes);
      |                                                              ^
2 errors generated.
```

because we are using incomplete types when their complete definitions
are required.
so, in this change, we include the headers for their complete definition.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20239
2024-08-22 20:51:38 +03:00
Kefu Chai
969cbb75ce tools/scylla-nodetool: add backup integration
as we have an API for backup a keyspace, let's expose this feature
with nodetool. so we can exercise it without the help of scylla-manager
or 3rd-party tools with a user-friendly interface.

in this change:

* add a new subcommand named "backup" to nodetool
* add test to verify its interaction with the API server
* add two more route to the REST API mock server, as
  the test is using /task_manager/wait_task/{task_id} API.
  for the sake of completeness, the route for
  /task_manager/{part1} is added as well.
* update the document accordingly.
* the bash completion script is updated accordingly.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-08-22 19:48:06 +03:00
Pavel Emelyanov
245cc852dd docs: Document the new backup method
Add the new /storage_service/backup endpoint to object_storage.md as yet
another way to use S3 from Scylla.
2024-08-22 19:47:06 +03:00
Pavel Emelyanov
de87450453 test/object_store: Test that backup task is abortable
It starts similarly to simpl backup test, but injects a pause into the
task once a single file is scheduled for upload, then aborts the task,
waits for it to fail, and check that _not_ all files are uploaded.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-08-22 19:47:06 +03:00
Pavel Emelyanov
f8d894bc23 test/object_store: Add simple backup test
The test shows how to backup a keyspace:

- flush
- take snapshot
- start backup with the new API method
- wait for the task to finish

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-08-22 19:47:06 +03:00
Pavel Emelyanov
47e49e6dec test/object_store: Move format_tuples()
There will soon appear a new .py file in the suite that will want to use
this helper too

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-08-22 19:47:06 +03:00
Pavel Emelyanov
d83d585709 test/pylib: Add more methods to rest client
Namely:
- POST /storage_service/snapshots to take snapshot on a ks
- GET /task_manager/get_task_status/{id} to get status of a running task
- GET /task_manager/wait_task/{id} to wait for a task to finish
- POST /task_manager/abort_task/{id} to abort a running task

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-08-22 19:47:06 +03:00
Pavel Emelyanov
ed6e6700ab backup-task: Make it abortable (almost)
Make the impl::is_abortable() return 'yes' and check the impl::_as in
the files listing loop. It's not real abort, since files listing loop is
expected to be fast and most of the time will be spent in s3::client
code reading data from disk and sending them to S3, but client doesn't
support aborting its requests. That's some work yet to be done.

Also add injection for future testing.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-08-22 19:47:06 +03:00
Pavel Emelyanov
a812f13ddd code: Introduce backup API method
The method starts a task that uploads all files from the given
keyspace's snapshot to the requested endpoint/bucket. The task runs in
the background, its task_id is returned from the method once it's
spawned and it should be used via /task_manager API to track the task
execution and completion (hint: it's good to have non-zero TTL value to
make sure fast backups don't finish before the caller manages to call
wait_task API).

If snapshot doesn't exist, nothing happens (FIXME, need to return back
an error in that case).

If endpoint is not configured locally, the API call resolves with
bad-request instantly.

Sstables components are scanned for all tables in the keyspace and are
uploaded into the /bucket/${cf_name}/${snapshot_name}/ path.

Task is not abortable (FIXME -- to be added) and doesn't really report
its progress other than running/done state (FIXME -- to be added too).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-08-22 19:47:06 +03:00
Pavel Emelyanov
f7b380d53b database: Export parse_table_directory_name() helper
There's parse_table_directory_name() static helper in database.cc code
that is used by methods that parse table tree layout for snapshot.
Export this helper for external usage and rename to fit the format_...
one introduced by previous patch.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-08-22 14:57:48 +03:00
Pavel Emelyanov
33962946fc database: Introduce format_table_directory_name() helper
The one makes table directory (not full path) out of table name and
uuid. This is to be symmetrical with yet another helper that converts
dirctory name back to table name and uuid (next patch)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-08-22 14:57:48 +03:00
Pavel Emelyanov
dff51fd58c snapshot-ctl: Add config to snapshot_ctl
Pretty much all services in Scylla have their own config. Add one to
snapshot-ctl too, it will be populated later.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-08-22 14:57:20 +03:00
Pavel Emelyanov
f37857e20a snapshot-ctl: Add sstables::storage_manager dependency
The storage_manager maintains set of clients to configured object
storage(s). The snapshot ctl is going to spawn tasks that will talk to
those storages, thus it needs the storage manager to get the clients
from.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-08-22 14:08:21 +03:00
Pavel Emelyanov
362331c89b snapshot-ctl: Maintain task manager module
This service is going to start tasks managed by task manager. For that,
it should have its module set up and registered.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-08-22 14:08:21 +03:00
Pavel Emelyanov
4ae89a9c81 snapshot-ctl: Add "snapshots" logger
Will be used later

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-08-22 14:08:21 +03:00
Pavel Emelyanov
90c794172b snapshot-ctl: Outline stop() method and constructor
These two are going to grow, keep them out not to pollute the header

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-08-22 14:08:21 +03:00
Pavel Emelyanov
96946a4b11 snapshot-ctl: Inline run_snapshot_list<>
This helper will be used by a code from another .cc file, so the
template needs to be in header for smooth instantiation

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-08-22 14:08:21 +03:00
Pavel Emelyanov
4e73b4d8ad test/cql_test_env: Export task manager from cql test env
To be used by one of the next patches

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-08-22 14:08:21 +03:00
Pavel Emelyanov
4b86eede1f task_manager: Print task ttl on start (for debugging)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-08-22 14:08:21 +03:00
Pavel Emelyanov
8949d73cd9 docs: Update object_storage.md with AWS_ environment
Commit 51c53d8db6 made it possible to configure object storage endpoint
creds via environment. Mention this in the docs.
2024-08-22 14:08:21 +03:00
Pavel Emelyanov
d3f9865d2f docs: Restructure object_storage.md
Currently the doc assumes that object storage can only be used to keep
sstables on it. It's going to change, restructure the doc to allow for
more usage scenarios.
2024-08-22 14:08:21 +03:00
Pavel Emelyanov
4e2d7aa2a2 test/tablets: Test that reading tablets' mutations from MUTATION_FRAGMENTS works
Currently it doesn't, one of the node crashes with std::out_of_range
exception and meaningless calltrace

[Botond]: this test checks the case of reading a partition via
MUTATION_FRAGMENTS from a node which doesn't own said partition.

refs: #18786

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-08-22 06:24:06 -04:00
Botond Dénes
46563d719f replica/mutation_dump: enfore pinning of effective replication map
By making it a required argument, making sure the topology version is
pinned for the duration of the query. This is needed because mutation
dump queries bypass the storage proxy, where this pinning usually takes
place. So it has to be enforced here.
2024-08-22 06:24:06 -04:00
Botond Dénes
de5329157c replica/mutation_dump: handle un-owned tokens (with tablets)
When using tablets, the replica-side doesn't handle un-owned tokens.
table::shard_for_reads() will just return 0 for un-owned tokens, and a
later attempt at calling table::storage_group_for_token() with said
un-owned token will cause a crash (std::terminate due to
std::out_of_range thrown in noexcept context).
The replicas rely on the coordinator to not send stray requests, but for
select from mutation_fragments(table) queries, there is no coordinator
side who could do the correct dispatching. So do this in
mutation_dump(), just creating empty readers for un-owned tokens.
2024-08-22 03:06:55 -04:00
Łukasz Paszkowski
a11d19f321 test_query.py: Test reverse queries with clustering key bounds
Since a native reversed format is used for reversed queries,
additional tests with restrictions on clustering keys are required
to capture possible errors like https://github.com/scylladb/scylladb/issues/20191
earlier than in dtests.

Add parametrization to the following tests:
  + test_query_reverse
  + test_query_reverse_paging
to accept a comparison operator used in selection criteria for a Query
operation.
2024-08-21 14:21:34 +02:00
Aleksandra Martyniuk
9b7c837106 test: add test to check compaction stop log 2024-08-21 12:42:37 +02:00
Aleksandra Martyniuk
5005e19de7 compaction: fix compaction group stop reason
compaction_manager::remove passes "table removal" as a reason
of stopping ongoing compactions, but currently remove method
is also called when a tablet is migrated or split.

Pass the actual reason of compaction stop, so that logs aren't
misleading.
2024-08-21 12:42:09 +02:00
Avi Kivity
2ef5b5e4fe Revert "[test.py] Increase pool size for CI"
This reverts commit cc428e8a36. It causes
may spurious CI failures while nodes are being torn down. Revert it until
the root cause is fixed, after which it can be reinstated.

Fixes #20116.
2024-08-21 13:21:08 +03:00
Benny Halevy
f40d06b766 table: calculate_tablet_count: use sg_manager storage_groups size
Now, when each shard storage_group_manager keeps
only the storage_groups for the tablet replica it owns,
we can simple return the storage_group map size
instead of counting the number of tablet replicas
mapped to this shard.

Add a unit test that sums the tablet count
on all shards and tests that the sum is equal
to the configured default `initial_tablets.

Fixes #18909

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#20223
2024-08-21 11:01:58 +02:00
Tomasz Grabiec
a3a97e8aad Merge 'schema_tables: calculate_schema_digest: prevent stalls due to large m…' from Benny Halevy
…utations vector

With a large number of table the schema mutations
vector might get big enoug to cause reactor stalls when freed.

For example, the following stall was hit on
2023.1.0~rc1-20230208.fe3cc281ec73 with 5000 tables:
```
 (inlined by) ~vector at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/stl_vector.h:730
 (inlined by) db::schema_tables::calculate_schema_digest(seastar::sharded<service::storage_proxy>&, enum_set<super_enum<db::schema_feature, (db::schema_feature)0, (db::schema_feature)1, (db::schema_feature)2, (db::schema_feature)3, (db::schema_feature)4, (db::schema_feature)5, (db::schema_feature)6, (db::schema_feature)7> >, seastar::noncopyable_function<bool (std::basic_string_view<char, std::char_traits<char> >)>) at ./db/schema_tables.cc:799
```

This change returns a mutations generator from
the `map` lambda coroutine so we can process them
one at a time, destroy the mutations one at a time, and by that, reducing memory footprint and preventing reactor stalls.

Fixes #18173

Closes scylladb/scylladb#18174

* github.com:scylladb/scylladb:
  schema_tables: calculate_schema_digest: filter the key earlier
  schema_tables: calculate_schema_digest: prevent stalls due to large mutations vector
2024-08-20 21:24:38 +02:00
Łukasz Paszkowski
f29d7ffa81 alternator::do_query Add additional trace log
Additional log prints information on the read query being executed.
It lists information like whether the query is a reversed one or
not, and table_schema and query_schema versions.
2024-08-20 20:56:15 +02:00
Łukasz Paszkowski
727cbd8151 alternator::do_query: Use native reversed format
When executing reversed queries, a native revered format shall be used.
Therefore the table schema and the clustering key bounds are reversed
before a partition slice and a read command are constructed.

Similarly as for cql3::statements::select_statement.
2024-08-20 20:56:15 +02:00
Łukasz Paszkowski
3720e8aabe alternator::do_query Rename schema with table_schema
In order to increase readability, a schema variable is renamed to
a table_schema to emphesize a table schema is passed to the function
and used across it.

Allows us to introduce a query_schema variable in the next patch.
2024-08-20 20:56:06 +02:00
Aleksandra Martyniuk
9d9414a75d replica: add/remove table atomically
Currently, database::tables_metadata::add_table needs to hold a write
lock before adding a table. So, if we update other classes keeping
track of tables before calling add_table, and the method yields,
table's metadata will be inconsistent.

Set all table-related info in tables_metadata::add_table_helper (called
by add_table) so that the operation is atomic.

Analogically for remove_table.

Fixes: #19833.

Closes scylladb/scylladb#20064
2024-08-20 20:53:32 +03:00
Kamil Braun
5c9efdff50 Merge 'raft: store_snapshot_descriptor to use actually preserved items number when truncating the local log table' from Sergey Zolotukhin
io_fiber/store_snapshot_descriptor now gets the actual number of items
preserved when the log is truncated, fixing extra entries remained after
log snapshot creation. Also removes incorrect check for the number of
truncated items in the
raft_sys_table_storage::store_snapshot_descriptor.

Minor change: Added error_injection test API for changing snapshot thresholds settings.

Fixes scylladb/scylladb#16817
Fixes scylladb/scylladb#20080

Closes scylladb/scylladb#20095

* github.com:scylladb/scylladb:
  raft:  Ensure const correctness in applier_fiber.
  raft: Invoke store_snapshot_descriptor with actually preserved items.
  raft: Use raft_server_set_snapshot_thresholds in tests.
  raft: Fix indentation in server.cc
  raft: Add a test to check log size after truncation.
  raft: Add raft_server_set_snapshot_thresholds injection.
  utils: Ensure const correctness of injection_handler::get().
2024-08-20 18:15:30 +02:00
Tomasz Grabiec
ff52527c54 Merge 'repair: do_rebuild_replace_with_repair: use source_dc only when safe' from Benny Halevy
It is unsafe to restrict the sync nodes for repair to the source data center if it has too low replication factor in network_topology_replication_strategy, or if other nodes in that DC are ignored.

Also, this change restricts the usage of source_dc to `network_topology` and `everywhere_topology`
strategies, as with simple replication strategy
there is no guarantee that there would be any
more replicas in that data center.

Fixes #16826

Reproducer submitted as https://github.com/scylladb/scylla-dtest/pull/3865
It fails without this fix and passes with it.

* Requires backport to live versions.  Issue hit in the filed with 2022.2.14

Closes scylladb/scylladb#16827

* github.com:scylladb/scylladb:
  repair: do_rebuild_replace_with_repair: use source_dc only when safe
  repair: replace_with_repair: pass the replace_node downstream
  repair: replace_with_repair: pass ignore_nodes as a set of host_id:s
  repair: replace_rebuild_with_repair: pass ks_erms from caller
  nodetool: rebuild: add force option
  Add and use utils::optional_param to pass source_dc
2024-08-20 16:13:23 +02:00
Sergey Zolotukhin
13b3d3a795 raft: Ensure const correctness in applier_fiber.
Add 'const' to non mutable varibales in server_impl::applier_fiber() function.
2024-08-20 15:24:00 +02:00
Sergey Zolotukhin
c3e52ab942 raft: Invoke store_snapshot_descriptor with actually preserved items.
- raft_sys_table_storage::store_snapshot_descriptor now receives a number of
preserved items in the log, rather than _config.snapshot_trailing value;
- Incorrect check for truncated number of items in store_snapshot_descriptor
was removed.

Fixes scylladb/scylladb#16817
Fixes scylladb/scylladb#20080
2024-08-20 15:22:49 +02:00
Sergey Zolotukhin
922e035629 raft: Use raft_server_set_snapshot_thresholds in tests.
Replace raft_server_snapshot_reduce_threshold with raft_server_set_snapshot_thresholds in tests
as raft_server_set_snapshot_thresholds fully covers the functionality of raft_server_snapshot_reduce_threshold.
2024-08-20 15:08:49 +02:00
Sergey Zolotukhin
00a1d3e305 raft: Fix indentation in server.cc 2024-08-20 15:08:45 +02:00
Sergey Zolotukhin
b6de8230a9 raft: Add a test to check log size after truncation.
The test checks that snapshot_trailing_size parameter is taken
into consideration when the log system table is truncated.
Test for  scylladb#16817
2024-08-20 14:15:50 +02:00
Sergey Zolotukhin
9dfa041fe1 raft: Add raft_server_set_snapshot_thresholds injection.
Use error injection to allow overriding following snapshot threshold settings:
- snapshot_threshold
- snapshot_threshold_log_size
- snapshot_trailing
- snapshot_trailing_size
2024-08-20 14:15:50 +02:00
Sergey Zolotukhin
c5da0775f2 utils: Ensure const correctness of injection_handler::get().
Make utils::error_injection::injection_handler::get() method 'const' as it does not mutate object's state.
2024-08-20 14:15:50 +02:00
Botond Dénes
3ee0d7f2d1 Merge 'tools: Enhance scylla sstable shard-of to support tablets' from Kefu Chai
before this change, `scylla sstable shard-of` didn't support tablets,
because:

- with tablets enabled, data distribution uses the scheduler
- this replaces the previous method of mapping based on vnodes and shard numbers
- as a result, we can no longer deduce sstable mapping from token ranges

in this change, we:
- read `system.tablets` table to retrieve tablet information
- print the tablet's replica set (list of <host, shard> pairs)
- this helps users determine where a given sstable is hosted

This approach provides the closest equivalent functionality of
`shard-of` in the tablet era.

Fixes scylladb/scylladb#16488

---

no need to backport, it's an improvement, not a critical fix.

Closes scylladb/scylladb#20002

* github.com:scylladb/scylladb:
  tools: enhance `scylla sstable shard-of` to support tablets
  replica/tablets: extract tablet_replica_set_from_cell()
  tools: extract get_table_directory() out
  tools: extract read_mutation out
  build: split the list of source file across multiple line
  tools/scylla-sstable: print warning when running shard-of with tablets
2024-08-20 13:51:12 +03:00
Avi Kivity
e2b179a3d0 Merge 'Coroutinize sstable_directory registry garbage collecting method' from Pavel Emelyanov
null

Closes scylladb/scylladb#20172

* github.com:scylladb/scylladb:
  sstable_directory: Coroutinize inner lambdas
  sstable_directory: Fix indentation after previous patch
  sstable_directory: Coroutinize outer cotinuation chain
2024-08-20 12:50:09 +03:00
David Garcia
fea707033f docs: improve include flag directive
The include flag directive now treats missing content as info logs instead of warnings. This prevents build failures when the enterprise-specific content isn't yet available.

If the enterprise content is undefined, the directive automatically loads the open-source content. This ensures the end user has access to some content.

address comments

Closes scylladb/scylladb#19804
2024-08-20 12:21:39 +03:00
Kefu Chai
9a10c33734 build: cmake: do not build storage_proxy.o by default
in 5ce07e5d84, the target named "storage_proxy.o" was added for
training the build of clang. but the rule for building this target
has two flaws:

* it was added a dependency of the "all" target, but we don't need
  to build `storage_proxy.cc` twice when building the tree in the
  regular build job. we only need to build it when creating the
  profile for training the build of clang.
* it misses the include directory of abseil library. that's why we
  have following build failure when building the default target:

  ```
  [2024-08-18T14:58:37.494Z] /usr/local/bin/clang++ -DDEBUG -DDEBUG_LSA_SANITIZER -DFMT_SHARED -DSANITIZE -DSCYLLA_BUILD_MODE=debug -DSCYLLA_ENABLE_ERROR_INJECTION -DSEASTAR_API_LEVEL=7 -DSEASTAR_DEBUG -DSEASTAR_DEBUG_PROMISE -DSEASTAR_DEBUG_SHARED_PTR -DSEASTAR_DEFAULT_ALLOCATOR -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SHUFFLE_TASK_QUEUE -DSEASTAR_SSTRING -DSEASTAR_TYPE_ERASE_MORE -DXXH_PRIVATE_API -DCMAKE_INTDIR=\"Debug\" -I/jenkins/workspace/scylla-master/scylla-ci/scylla -I/jenkins/workspace/scylla-master/scylla-ci/scylla/seastar/include -I/jenkins/workspace/scylla-master/scylla-ci/scylla/build/seastar/gen/include -I/jenkins/workspace/scylla-master/scylla-ci/scylla/build/seastar/gen/src -I/jenkins/workspace/scylla-master/scylla-ci/scylla/build/gen -g -Og -g -gz -std=gnu++23 -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-enum-constexpr-conversion -Wno-unused-parameter -ffile-prefix-map=/jenkins/workspace/scylla-master/scylla-ci/scylla=. -march=westmere -Xclang -fexperimental-assignment-tracking=disabled -U_FORTIFY_SOURCE -Werror=unused-result -fstack-clash-protection -fsanitize=address -fsanitize=undefined -fno-sanitize=vptr -MD -MT service/CMakeFiles/storage_proxy.o.dir/Debug/storage_proxy.cc.o -MF service/CMakeFiles/storage_proxy.o.dir/Debug/storage_proxy.cc.o.d -o service/CMakeFiles/storage_proxy.o.dir/Debug/storage_proxy.cc.o -c /jenkins/workspace/scylla-master/scylla-ci/scylla/service/storage_proxy.cc
  [2024-08-18T14:58:37.495Z] In file included from /jenkins/workspace/scylla-master/scylla-ci/scylla/service/storage_proxy.cc:17:
  [2024-08-18T14:58:37.495Z] In file included from /jenkins/workspace/scylla-master/scylla-ci/scylla/db/commitlog/commitlog.hh:19:
  [2024-08-18T14:58:37.495Z] In file included from /jenkins/workspace/scylla-master/scylla-ci/scylla/db/commitlog/commitlog_entry.hh:15:
  [2024-08-18T14:58:37.495Z] In file included from /jenkins/workspace/scylla-master/scylla-ci/scylla/mutation/frozen_mutation.hh:15:
  [2024-08-18T14:58:37.495Z] In file included from /jenkins/workspace/scylla-master/scylla-ci/scylla/mutation/mutation_partition_view.hh:16:
  [2024-08-18T14:58:37.495Z] In file included from /jenkins/workspace/scylla-master/scylla-ci/scylla/build/gen/idl/mutation.dist.impl.hh:14:
  [2024-08-18T14:58:37.495Z] /jenkins/workspace/scylla-master/scylla-ci/scylla/serializer_impl.hh:20:10: fatal error: 'absl/container/btree_set.h' file not found
  [2024-08-18T14:58:37.495Z]    20 | #include <absl/container/btree_set.h>
  [2024-08-18T14:58:37.495Z]       |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
  [2024-08-18T14:58:37.495Z] 1 error generated.
  ```
* if user only enables "dev" mode, we'd have:
  ```
  CMake Error at service/CMakeLists.txt:54 (add_library):
  No SOURCES given to target: storage_proxy.o
  ```

so, in this change, we

* exclude this target from "all"
* link this target against abseil header library, so it has access
  to the abseil library. please note, we don't need to build an
  executable in this case, so the header would suffice.
* add a proxy target to conditionally enable/disable this target.
  as CMake does not support generator expression in `add_dependencies()`
  yet at the time of writing.
  see https://gitlab.kitware.com/cmake/cmake/-/issues/19467

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20195
2024-08-19 21:30:34 +03:00
Avi Kivity
7eb3b15fff Merge 'utils/tagged_integer: remove conversion to underlying integer' from Laszlo Ersek
~~~
utils/tagged_integer: remove conversion to underlying integer

Silently converting a tagged (i.e., "dimension-ful") integer to a naked
("dimensionless") integer defeats the purpose of having tagged integers,
and is a source of practical bugs, such as
<https://github.com/scylladb/scylladb/issues/20080>.

We could make the conversion operator explicit, for enforcing

  static_cast<TAGGED_INTEGER_TYPE::value_type>(TAGGED_INTEGER_VALUE)

in every conversion location -- but that's a mouthful to write. Instead,
remove the conversion operator, and let clients call the (identically
behaving) value() member function.
~~~

No backport needed (refactoring).

The series is supposed to solve #20081.

Two patches in the series touch up code that is known to be (orthogonally) buggy; see
- `service/raft_sys_table_storage: tweak dead code` (#20080)
- `test/raft/replication: untag index_t in test_case::get_first_val()` (#20151)

Fixes for those (independent) issues will have to be rebased on this series, or this series will have to be rebased on those (due to context conflicts).

The series builds at every stage. The debug and release unit test suites pass at the end.

Closes scylladb/scylladb#20159

* github.com:scylladb/scylladb:
  utils/tagged_integer: remove conversion to underlying integer
  test/raft/randomized_nemesis_test: clean up remaining index_t usage
  test/raft/randomized_nemesis_test: clean up index_t usage in store_snapshot()
  test/raft/replication: clean up remaining index_t usage
  test/raft/replication: take an "index_t start_idx" in create_log()
  test/raft/replication: untag index_t in test_case::get_first_val()
  test/raft/etcd_test: tag index_t and term_t for comparisons and subtractions
  test/raft/fsm_test: tag index_t and term_t for comparisons and subtractions
  test/raft/helpers: tighten compare_log_entries() param types
  service/raft_sys_table_storage: tweak dead code
  service/raft_sys_table_storage: simplify (snap.idx - preserve_log_entries)
  service/raft_sys_table_storage: untag index_t and term_t for queries
  raft/server: clean up index_t usage
  raft/tracker: don't drop out of index_t space for subtraction
  raft/fsm: clean up index_t and term_t usage
  raft/log: clean up index_t usage
  db/system_keyspace: promise a tagged integer from increment_and_get_generation()
  gms/gossiper: return "strong_ordering" from compare_endpoint_startup()
  gms/gossiper: get "int32_t" value of "gms::version_type" explicitly
2024-08-19 19:52:54 +03:00
Benny Halevy
5f655e41e3 repair: do_rebuild_replace_with_repair: use source_dc only when safe
It is unsafe to restrict the sync nodes for repair to
the source data center if we cannot guarantee a quorum
in the data center with network-topology replication strategy.

This change restricts the usage of source_dc in the following cases:
1. For SimpleStrategy - source_dc is ignored since there is no guarantee
that it contains remaining replicas for all tokens.
2. For EverywhereStrategy - use source_dc if there are remaining
live nodes in the datacenter.
3. For NetworkTopologyStrategy:
a. It is considered unsafe to use source_dc if number of nodes
   lost in that DC (replaced/rebuilt node + additional ignored nodes)
   is greater than 1, or it has 1 lost node and rf <= 1 in the DC.

b. If the source_dc arg is forced, as with the new
   `nodetool rebuild --force <source_dc>` option,
   we use it anyway, even if it's considered to be unsafe.
   A warning is printed in this case.

c. If the source_dc arg is user-provided, (using nodetool rebuild),
   an error exception is thrown, advising to use an alternative dc,
   if available, omit source_dc to sync with all nodes, or use the
   --force option to use the given source_dc anyhow.

d. Otherwise, we look for an alternative source datacenter,
   that has not lost any node. If such datacenter is found
   we use it as source_dc for the keyspace, and log a warning.

e. If no alternative dc is found (and source_dc is implicit), then:
   log a warning and fall back to using replicas from all nodes in the cluster.

Fixes #16826

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-08-19 17:23:51 +03:00
Benny Halevy
8665eef98c repair: replace_with_repair: pass the replace_node downstream
To be used by the next path to count how many nodes
are lost in each datacenter.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-08-19 17:23:33 +03:00
Benny Halevy
9729dd21c3 repair: replace_with_repair: pass ignore_nodes as a set of host_id:s
The callers already pass ignore_nodes as host_id:s
and we translate them into inet_address only for repair
so delay the translation as much as posible,

Refs scylladb/scylladb#6403

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-08-19 17:22:01 +03:00
Benny Halevy
b5d0ab092c repair: replace_rebuild_with_repair: pass ks_erms from caller
The keyspaces replication maps must be in sync with the
token_metadata_ptr passed already to the functions,
so instead of getting it in the callee, let the caller
get the ks_erms along with retrieving the tmptr.

Note that it's already done on the rebuild path
for streaming based rebuild.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-08-19 17:20:27 +03:00
Benny Halevy
0419b1d522 nodetool: rebuild: add force option
To be used to force usage of source_dc, even
when it is unsafe for rebuild.

Update docs and add test/nodetool/test_rebuild.py

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-08-19 17:20:12 +03:00
Benny Halevy
8b1877f3ca Add and use utils::optional_param to pass source_dc
Clearly indicate if a source_dc is provided,
and if so, was it explicitly given by the user,
or was implicitly selected by scylla.

This will become useful in the next patches
that will use that to either reject the operation
if it's unsafe to use the source_dc and the dc was
explicitly given by the user, or whether
to fallback to using all nodes otherwise.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-08-19 17:13:54 +03:00
Anna Stuchlik
83d5cb04c2 doc: extract the info about tablets defaut to a separate file
This commit extracts the information about the default for tables in keyspace creation
to a separate file in the _common folder. The file is then included using
the scylladb_include_flag directive.

The purpose of this commit is to make it possible to include a different file
in the scylla-enterprise repo - with a different default.

Refs https://github.com/scylladb/scylla-enterprise/issues/4585

Closes scylladb/scylladb#20181
2024-08-19 16:16:18 +03:00
Kefu Chai
25b3c50f71 test/nodetool: print default value of options in help message
would be more helpful, if the output of "--help" command line can
include the default value of options.

so, in this change, we include the default values in it.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20170
2024-08-19 16:15:24 +03:00
Botond Dénes
40d2a6f0b2 Merge 'test.py: use XPath for iterating in "TestSuite/TestSuite"' from Kefu Chai
before this change, we check for the existence of "TestSuite" node
under the root of XML tree, and then enumerating all "TestSuite" nodes
under this "TestSuite", this approach works. but it

* introduces unnecessary indent
* is not very readable

in this change, we just use "./TestSuite/TestSuite" for enumerating
all "TestSuite" nodes under "TestSuite". simpler this way.

---

it's a cleanup in the test driver script, hence no need to backport.

Closes scylladb/scylladb#20169

* github.com:scylladb/scylladb:
  test.py: fix the indent
  test.py: use XPath for iterating in "TestSuite/TestSuite"
2024-08-19 16:13:42 +03:00
Botond Dénes
6835f7e993 Merge 'Add CQL-based RBAC support to Alternator' from Piotr Smaron
Alternator already supports **authentication** - the ability to to sign each request as a particular user. The users that can be used are the different "roles" that are created by CQL "CREATE ROLE" commands. This series adds support for **authorization**, i.e., the ability to determine that only some of these roles are allowed to read or write particular tables, to create new tables, and so on.

The way we chose to do this in this series is to support CQL's existing role-based access control (RBAC) commands - GRANT and REVOKE - on Alternator tables. For example, an Alternator table "xyz" is visible to CQL as "alternator_xyz.xyz", so a `GRANT SELECT ON alternator_xyz.xyz TO myrole` will allow read commands (e.g., GetItem) on that table, and without this GRANT, a GetItem will fail with `AccessDeniedException`.

This series adds the necessary checks to all relevant Alternator operations, and also adds extensive functional testing for this feature - i.e., that certain DynamoDB API operations are not allowed without the appropriate GRANTs.

The following permissions are needed for the following Alternator API operations:

* **SELECT**:      `GetItem`, `Query`, `Scan`, `BatchGetItem`, `GetRecords`
* **MODIFY**:      `PutItem`, `DeleteItem`, `UpdateItem`, `BatchWriteItem`
* **CREATE**:      `CreateTable`
* **DROP**:        `DeleteTable`
* **ALTER**:       `UpdateTable`, `TagResource`, `UntagResource`, `UpdateTimeToLive`
* _none needed_: `ListTables`, `DescribeTable`, `DescribeEndpoints`, `ListTagsOfResource`, `DescribeTimeToLive`, `DescribeContinuousBackups`, `ListStreams`, `DescribeStream`, `GetShardIterator`

Currently, I decided that for consistency each operation requires one permission only. For example, PutItem only requires MODIFY permission. This is despite the fact that in some cases (namely, `ReturnValues=ALL_OLD`) it can also _read_ the item. We should perhaps discuss this decision - and compare how it was done in CQL - e.g., what happens in LWT writes that may return old values?

Different permissions can be granted for a base table, each of its views, and the CDC table (Alternator streams). This adds power - e.g., we can allow a role to read only a view but not the base table, or read the table but not its history. GRANTing permissions on views or CDC logs require knowing their names, which are somewhat ugly (e.g., the name of GSI "abc" in table "xyz" is `alternator_xyz.xyz:abc`). But usefully, the error message when permissions are denied contains the full name of the table that was lacking permissions and which permissions were lacking, so users can easily add them.

In addition to permissions checking, this series also correctly supports _auto-grant_ (except #19798): When a role has permissions to `CreateTable`, any table it creates will automatically be granted all permissions for this role, so this role will be able to use the new table and eventually delete it. `DeleteTable` does the opposite - it removes permissions from tables being deleted, so that if later a second user re-creates a table with the same name, the first user will not have permissions over the new table.

The already-existing configuration parameter `alternator_enforce_authorization` (off by default), which previously only enabled authentication, now also enables authorization. Users that upgrade to the new version and already had `alternator_enforce_authorization=true` should verify that the users they use to authenticate either have the appropriate permissions or the "superuser" flag. Roles used to authenticate must also have the "login" flag.

Please note that although the new RBAC support implements the access control feature we asked for in #5047, this implementation is _not compatible_ with DynamoDB. In DynamoDB, the access control is configured through IAM operations or through the new `PutResourcePolicy` - operation, not through CQL (obviously!). DynamoDB also offers finer access-control granularity than we support (Scylla's RBAC works on entire tables, DynamoDB allows setting permissions on key prefixes, on individual attributes, and more). Despite this non-compatibility, I believe this feature, as is, will already be useful to Alternator users.

Fixes #5047 (after closing that issue, a new clean issue should be opened about the DynamoDB-compatible APIs that we didn't do - just so we remember this wasn't done yet).

New feature, should not be backported.

Closes scylladb/scylladb#20135

* github.com:scylladb/scylladb:
  tests: disable test_alternator_enforce_authorization_true
  test, alternator: test for alternator_enforce_authorization config
  test/pylib: allow setting driver_connect() options in servers_add()
  test: fix test_localnodes_joining_nodes
  alternator, RBAC: reproducer for missing CDC auto-grant
  alternator: document the new RBAC support
  alternator: add RBAC enforcement to GetRecords
  test/alternator: additional tests for RBAC
  test/alternator: reduce permissions-validity-in-ms
  test/alternator: add test for BatchGetItem from multiple tables
  alternator: test for operations that do not need any permissions
  alternator: add RBAC enforcement to UpdateTimeToLive
  alternator: add RBAC enforcement to TagResource and UntagResource
  alternator: add RBAC enforcement to BatchGetItem
  alternator: add RBAC enforcement to BatchWriteItem
  alternator: add RBAC enforcement to UpdateTable
  alternator: add RBAC enforcement to Query and Scan
  alternator: add RBAC enforcement to CreateTable
  alternator: add RBAC enforcement to DeleteTable
  alternator: add RBAC enforcement to UpdateItem
  alternator: add RBAC enforcement to DeleteItem
  alternator: add RBAC enforcement to PutItem
  alternator: add RBAC enforcement to GetItem
  alternator: stop using an "internal" client_state
2024-08-19 16:09:53 +03:00
Tomasz Grabiec
c1de4859d8 Merge 'tablets: Fix race between repair and split' from Raphael "Raph" Carvalho
Consider the following:

```
T
0   split prepare starts
1                               repair starts
2   split prepare finishes
3                               repair adds unsplit sstables
4                               repair ends
5   split executes
```

If repair produces sstable after split prepare phase, the replica will not split that sstable later, as prepare phase is considered completed already. That causes split execution to fail as replicas weren't really prepared. This also can be triggered with load-and-stream which shares the same write (consumer) path.

The approach to fix this is the same employed to prevent a race between split and migration. If migration happens during prepare phase, it can happen source misses the split request, but the tablet will still be split on the destination (if needed). Similarly, the repair writer becomes responsible for splitting the data if underlying table is in split mode. That's implemented in replica::table for correctness, so if node crashes, the new sstable missing split is still split before added to the set.

Fixes #19378.
Fixes #19416.

**Please replace this line with justification for the backport/\* labels added to this PR**

Closes scylladb/scylladb#19427

* github.com:scylladb/scylladb:
  tablets: Fix race between repair and split
  compaction: Allow "offline" sstable to be split
2024-08-19 14:44:28 +02:00
Kefu Chai
151074240c utils: cached_file: use structured binding when appropriate
for better readability.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20184
2024-08-19 14:01:42 +03:00
Piotr Smaron
f773c76bfb codeowners: add appropriate reviewers to the cluster components 2024-08-19 12:39:47 +02:00
Anna Stuchlik
8fb746a5d2 doc: fix a link on the RBAC page
This commit fixes an external link on the Role Based Access Control page.

Fixes https://github.com/scylladb/scylladb/issues/20166

Closes scylladb/scylladb#20171
2024-08-19 12:56:38 +03:00
Piotr Smaron
cdc88cd06c tests: disable test_alternator_enforce_authorization_true
The test is flaky and needs to be fixed in order to not randomly break
our CI, OTOH can be commented out for the time being, so that we can
marge the feature.
2024-08-19 09:57:53 +02:00
Nadav Har'El
989dbef315 test, alternator: test for alternator_enforce_authorization config
This patch adds tests that demonstrates the current way that Alternator's
authentication and authorization are both enabled or disabled by the
option "alternator_enforce_authorization".

If in the future we decide to change this option or eliminate it (e.g.,
remain just with the "authenticator" and "authorizer" options), we can
easily update these tests to fit the new configuration parameters and
check they work as expected.

Because the new tests want to start Scylla instances with different
configuration parameters, they are written in the the "topology"
framework and not in the test/alternator framework. The test/alternator
framework still contains (test/alternator/test_cql_rbac.py) the vast
majority of the functional testing of the RBAC feature where all those
tests just assume that RBAC is enabled and needs to be tested.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-08-19 09:57:53 +02:00
Nadav Har'El
41418603e1 test/pylib: allow setting driver_connect() options in servers_add()
The manager.driver_connect() functions allows to pass parameters when
creating the connection (e.g., a special auth_provider), but unfortunately
right now the servers_add() function always calls driver_connect()
without parameters. So in this patch we just add a new optional
parameter to servers_add(), driver_connect_opts, that will be passed to
driver_connect().

In theory instead of the new option to driver_connect() a caller can
pass start=False to servers_add() and later call driver_connect()
manually with the right arguments. The problem is that start=False
avoids more than just calling driver_connect(), so it doesn't solve
the problem.

An example of using the new option is to run Scylla with authentication
enabled, and then connect to it using the correct default account
("cassandra"/"cassandra"):

    config = {
        'authenticator': 'PasswordAuthenticator',
        'authorizer': 'CassandraAuthorizer'
    }
    servers = await manager.servers_add(1, config=config,
        driver_connect_opts={'auth_provider':
            PlainTextAuthProvider(username='cassandra', password='cassandra')})

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-08-19 09:57:53 +02:00
Nadav Har'El
de20ac1a6d test: fix test_localnodes_joining_nodes
The existing test
topology_experimental_raft/test_alternator::test_localnodes_joining_nodes

Tried to create a second server but *not* wait for it to complete, but
the trick it used (cancelling the task) doesn't work since commit 2ee063c
makes a list of unwaited tasks and waits for them anyway. The test
*appears* to work because it is the last test in the file, but if we
ever add another test in the same file (like I plan to do in the next
patch), that other test will find a "BROKEN" ScyllaClusterManager and
report that it failed :-(

Other tricks I tried to use (like killing the servers) also didn't work
because of various limitations and complications of the test framework
and all its layers.

So not wanting to fight the fragile testing framework any more at this
point, I just gave up and the test will *wait* for the second server
to come up. This adds 120 seconds (!) to the test, but since this whole
test file already takes more than 500 seconds to complete, let's bite
this bullet. Maybe in the future when the test framework improves, we can
avoid this 120 second wait.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-08-19 09:57:53 +02:00
Nadav Har'El
79f9b3007e alternator, RBAC: reproducer for missing CDC auto-grant
This patch adds a reproducing (xfailing) test for issue #19798, which
shows that if a role is able to create an Alternator table, the role is
able to read the new table (this is known as "auto-grant"), but is NOT
able to read the CDC log (i.e., use Alternator Streams' "GetRecords").

Once we do fix this auto-grant bug, it's also important to also implement
auto-revoke - the permissions on a deleted table must be deleted as well
(otherwise the old owner of a deleted table will be able to read a new
table with the same name). This patch also adds a test verifying that
auto-revoke works. This test currently passes (because there is no auto-
grant, so nothing needs to be revoked...) but if we'll implement auto-grant
and forget auto-revoke, the second test will start to fail - so I added
this test as a precaution against a bad fix.

Refs #19798

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-08-19 09:57:53 +02:00
Nadav Har'El
7de6aedd47 alternator: document the new RBAC support
In docs/alternator/compatibility.md we said that although Alternator
supports authentication, it doesn't support authorization (access
control). Now it does, so the relevant text needs to be corrected
to fit what we have today.

It's still in the compatibility.md document because it's not the same
API as DynamoDB's, so users with existing applications may need to be
aware of this difference.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-08-19 09:57:53 +02:00
Nadav Har'El
f9ff475dfb alternator: add RBAC enforcement to GetRecords
This patch adds a requirement for the "SELECT" permission on a table to
run a GetRecords on it (the DynamoDB Streams API, i.e., CDC).

The grant is checked on the *CDC log table* - not on the base table,
which allows giving a role the ability to read the base but not is
change stream, or vice versa.

The operations ListStreams, DescribeStreams, GetShardIterators do not
require any permissions to run - they do not read any data, and are
(in my opinion) similar in spirit to DescribeTable, so I think it's fine
not to require any permissions for them.

A test is also added.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-08-19 09:57:53 +02:00
Nadav Har'El
0789841cf8 test/alternator: additional tests for RBAC
Additional tests for support for CQL Role-Based Access Control (RBAC)
in Alternator:

1. Check that even in an Alternator table whose name isn't valid as CQL
   table names (e.g., uses the dot character) the GRANT/REVOKE commands
   work as expected.

2. Check that superuser roles have full permissions, as expected.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-08-19 09:57:53 +02:00
Nadav Har'El
409fea5541 test/alternator: reduce permissions-validity-in-ms
We set in test/cql-pytest/run.py, affecting test/alternator/run, the
configuration permissions_validity_in_ms by default to 100ms. This means
that tests that need to check how GRANT or REVOKE work always need to
sleep for more than 100ms, which can make a test with a lot of these
operations very slow.

So let's just set this configuration value to 5ms. I checked that it
doesn't adversely affect the total running speed of test/alternator/run.

This change only affects running tests through test/alternator/run, which
is expected to be fast. I left the default for test.py as it was, 100ms,
the latency of individual tests is less important there.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-08-19 09:57:53 +02:00
Nadav Har'El
1b20a11dec test/alternator: add test for BatchGetItem from multiple tables
While working on the RBAC on BatchGetItem, I noticed that although
BatchGetItem may ask to read items from several tables, we don't have
a test covering this case! This patch fixes that testing oversight.

Note that for the write-side version of this operation, BatchWriteItem,
we do have tests that write to several tables in the same batch.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-08-19 09:57:53 +02:00
Nadav Har'El
f827bd51d2 alternator: test for operations that do not need any permissions
Some operations, namely ListTables, DescribeTable, DescribeEndpoints,
ListTagsOfResource, DescribeTimeToLive and DescribeContinuousBackups
do not need any permissions to be GRANTed to a role.

Our rationale for this decision is that in CQL, "describe table" and
friends also do not require any permissions.

This patch includes a test that verifies that they really don't need
permissions.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-08-19 09:57:53 +02:00
Nadav Har'El
9417cf8bcf alternator: add RBAC enforcement to UpdateTimeToLive
This patch adds a requirement for the "ALTER" permission on a table to
run a UpdateTimeToLive on it. UpdateTimeToLive is similar in purpose to
UpdateTable, so it makes sense to use the same permission "ALTER" as we
do for UpdateTable.

A tests is also added.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-08-19 09:57:53 +02:00
Nadav Har'El
e76316495c alternator: add RBAC enforcement to TagResource and UntagResource
This patch adds a requirement for the "ALTER" permission on a table to
run the TagResource or UntagResource operations on it. CQL does not
have an exact parallel of DynamoDB's tagging feature, but our usual
use of tags as an extension of UpdateTable to change non-standard options
(e.g., write isolation policy or tablets setup), so it makes sense to
require the same permissions we require for UpdateTable - namely "ALTER".

A test for both operations is also added.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-08-19 09:57:53 +02:00
Nadav Har'El
fda4a9fad8 alternator: add RBAC enforcement to BatchGetItem
This patch adds a requirement for the "SELECT" permission on a table to
run a BatchGetItem on it. A single batch may ask to write to several
different tables, so we fail the entire batch with AccessDeniedException
if any of the tables mentioned in the batch do not have SELECT permissions
for this role.

A tests is also added.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-08-19 09:57:51 +02:00
Nadav Har'El
b02288785f alternator: add RBAC enforcement to BatchWriteItem
This patch adds a requirement for the "MODIFY" permission on a table to
run a BatchWriteItem on it. A single batch may ask to write to several
different tables, so we fail the entire batch with AccessDeniedException
if any of the tables mentioned in the batch do not have MODIFY permissions
for this role.

A tests is also added.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-08-19 09:56:28 +02:00
Nadav Har'El
445a5d57cd alternator: add RBAC enforcement to UpdateTable
This patch adds a requirement for the "ALTER" permission on a table to
run a UpdateTable on it.

A tests is also added.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-08-19 09:45:22 +02:00
Nadav Har'El
b4484158e7 alternator: add RBAC enforcement to Query and Scan
This patch adds a requirement for the "SELECT" permission on a table to
run a Query or Scan on it.

Both Query and Scan operations call the same do_query() function, so the
permission checks are put there.

Note that Query can read from either the base table or one of its views,
and the permissions on the base and each of the views can be separate
(so we can allow a role to only read one view, for example).

Tests for all of the above are also added.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-08-19 09:45:22 +02:00
Nadav Har'El
82f7e55943 alternator: add RBAC enforcement to CreateTable
This patch adds a requirement for the "CREATE" permission on ALL
KEYSPACES to run a CreateTable operation.

The CreateTable operation also performs so-called "auto-grant": When a
role creates a table, it is automatically granted full permissions to
read, write, change or delete that new table.

A test for all these things is also added.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-08-19 09:45:22 +02:00
Nadav Har'El
79dfb7b7d5 alternator: add RBAC enforcement to DeleteTable
This patch adds a requirement for the "DROP" permission on a table to
run a DeleteTable on it.

Moreover, when a table and its views are deleted, any special permissions
previously GRANTed on this table are removed. This is necessary because
if a role creates a table it is automatically granted permissions on this
table (this is known as "auto-grant" - see the CreateTable patch for
details). If this role deletes this table and later a second role creates
a table with the same name, we don't want the first role to have
permissions on this new table.

Tests for permission enforcements and revocation on delete are also added.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-08-19 09:45:22 +02:00
Nadav Har'El
2ebc0501b8 alternator: add RBAC enforcement to UpdateItem
This patch adds a requirement for the "MODIFY" permission on a table to
run a UpdateItem on it.

Only the MODIFY permission is required, even if the operation may also
read the old value of the item, such as a read-modify-write operation
or even using ReturnValues='ALL_OLD'.

A test is also added.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-08-19 09:45:22 +02:00
Nadav Har'El
36d8aea654 alternator: add RBAC enforcement to DeleteItem
This patch adds a requirement for the "MODIFY" permission on a table to
run a DeleteItem on it.

Only the MODIFY permission is required, even if the operation may also
read the old value of the item (using ReturnValues='ALL_OLD').

A test is also added.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-08-19 09:45:22 +02:00
Nadav Har'El
34c975854a alternator: add RBAC enforcement to PutItem
This patch adds a requirement for the "MODIFY" permission on a table to
run a PutItem on it.

Only the MODIFY permission is required, even if the operation may also
read the old value of the item (using ReturnValues='ALL_OLD').

A test is also added.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-08-19 09:45:22 +02:00
Nadav Har'El
3008b8416c alternator: add RBAC enforcement to GetItem
In this patch, we begin to add role-based access control (RBAC)
enforement to Alternator - in this patch only to GetItem.

After the preparation of client_state correctly in the previous patch,
the permission check itself in the get_item() function is very simple.
The bigger part of this patch is a full functional test in
test/alternator/test_cql_rbac.py. The test is quite self-explanatory
and heavily commented. Basically we check that a new role cannot
read with GetItem a pre-existing table, and we can add that ability
by GRANTing (in CQL) the new role the ability to SELECT the table,
the keyspace, all keyspaces, or add that ability to some other role
that this role inherits.

In the following patches, we will add role-based access control to
the Alternator operations, but the functional tests will be shorter -
we don't need to check the role inheritence, "all keyspaces" feature,
and so on, for every operation separately since they all use the
same underlying checking functions which handles these role inheritence
issues in exactly the same way.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-08-19 09:45:22 +02:00
Nadav Har'El
583f060bd8 alternator: stop using an "internal" client_state
Scylla uses a "client_state" object to encapsulate the information of
who the client is - its IP address, which user was authenticated, and so on.

For an unknown reason, Alternator created for each request an "internal"
client_state, meaning that supposedly the client for each request was
some sort of internal process (e.g., repair) rather than a real client.
This was wrong, and we even had a FIXME about not putting the client's
IP address in client_state.

So in this patch, we start using a normal "external" client_state
instead of an "internal" one. The client_state constructors are very
different in the two cases, so a few lines of code had to change.

I hope that this change will cause no functional changes. For example,
Alternator was already setting its own timeouts explicitly and not
relying on the default ones for external clients. However, we need to
fix this for the following patches which introduce permissions checks
(Role-Based Access Control - RBAC) - the client_state methods for
checking permissions become no-ops for *internal* clients (even if the
client_state contains an authenticated users). We need these functions
to do their job - so we need an *external* variant of client_state.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-08-19 09:45:22 +02:00
Tomasz Grabiec
ab7656a7be Merge 'replica: fix copy constructor of tablet_sstable_set' from Lakshmi Narayanan Sreethar
Commit 9f93dd9fa3 changed
`tablet_sstable_set::_sstable_sets` to be a `absl::flat_hash_map` and
in addition, `std::set<size_t> _sstable_set_ids` was
added. `_sstable_set_ids` is set up in the
`tablet_sstable_set(schema_ptr s, const storage_group_manager& sgm,
const locator::tablet_map& tmap)` constructor, but it is not copied in
`tablet_sstable_set(const tablet_sstable_set& o)`.

This affects the `tablet_sstable_set::tablet_sstable_set` method as it
depends on the copy constructor. Since sstable set can be cloned when
a new sstable set is added, the issue will cause ids not being copied
into the new sstable set. It's healed only after compaction, since the
sstable set is rebuilt from scratch there.

This PR fixes this issue by removing the existing copy constructor of
`tablet_sstable_set` to enable the implicit default copy constructor.

Fixes #19519

Closes scylladb/scylladb#20115

* github.com:scylladb/scylladb:
  boost/sstable_set_test: add testcase to test tablet_sstable_set copy constructor
  replica: fix copy constructor of tablet_sstable_set
2024-08-19 00:53:29 +02:00
Avi Kivity
390e01673b Merge 'Adding batch latency and batch size metrics to Alternator' from Amnon Heiman
This patch adds metrics for batch get_item and batch write_item.
The new metrics record summary and histogram for latencies and batch size.

Batch sizes are implemented as ever-growing counters. To get the average batch size divide the rate of
the batch size counter by the rate of the number of batch counter:
```rate(batch_get_item_batch_size)/rate(batch_get_item)```

Relates to #17615

New code, No need to backport

Closes scylladb/scylladb#20190

* github.com:scylladb/scylladb:
  Add tests for Alternator batch operation metrics
  alternator/executor: support batch latency and size metrics
  Add metrics for Alternator get and write batch operations
2024-08-18 21:22:39 +03:00
Amnon Heiman
63fdfb89cd Add tests for Alternator batch operation metrics
This patch adds unit tests to verify the correctness of the newly
introduced histogram metrics for get and write batch operation
latencies.

The test uses the existing latency test with the added metrics.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2024-08-18 12:19:43 +03:00
Amnon Heiman
d20a333f51 alternator/executor: support batch latency and size metrics
This patch Updated the get and write batch operations in Alternator to
record latency using the newly added histogram metrics.
It adds logic to increment the counters with the number of items
processed in each batch.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2024-08-18 12:14:23 +03:00
Amnon Heiman
8bad4b44f8 Add metrics for Alternator get and write batch operations
Introduced histogram metrics to track latency for Alternator's get and
write batch operations.

Added counters to record the number of items processed in each batch
operation.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2024-08-18 12:09:46 +03:00
Lakshmi Narayanan Sreethar
ec47b50859 boost/sstable_set_test: add testcase to test tablet_sstable_set copy constructor
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-08-17 23:38:05 +05:30
Lakshmi Narayanan Sreethar
44583eed9e replica: fix copy constructor of tablet_sstable_set
Remove the existing copy constructor to enable the use of the implicit
copy constructor. This fixes the issue of `_sstable_set_ids` not being
copied in the current copy constructor.

Fixes #19519

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-08-17 23:37:58 +05:30
Kefu Chai
3d593ceeb1 perf/perf_sstable: add {crawling,partitioned}_streaming modes
for testing the load performance of load_and_stream operation.

Refs scylladb/scylladb#19989

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-08-17 14:43:54 +08:00
Kefu Chai
7806c72e49 test/perf/perf_sstable: use switch-case when appropriate
this change is a follow up of 06c60f6ab, which updated the
2nd step of the test to use switch-case, but missed the 1st step.
so this change updates the first step of the test as well.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-08-17 14:38:37 +08:00
Pavel Emelyanov
6a9b8ea135 sstable_directory: Coroutinize inner lambdas
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-08-16 10:45:27 +03:00
Pavel Emelyanov
7401c0ace2 sstable_directory: Fix indentation after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-08-16 10:45:27 +03:00
Pavel Emelyanov
7422504d35 sstable_directory: Coroutinize outer cotinuation chain
Indentation is deliberately left broken

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-08-16 10:45:27 +03:00
Kefu Chai
e8f9f71ef3 test.py: fix the indent
and take this opportunity to fix a typo in comment.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-08-16 13:32:57 +08:00
Kefu Chai
e88166f7a4 test.py: use XPath for iterating in "TestSuite/TestSuite"
before this change, we check for the existence of "TestSuite" node
under the root of XML tree, and then enumerating all "TestSuite" nodes
under this "TestSuite", this approach works. but it

* introduces unnecessary indent
* is not very readable

in this change, we just use "./TestSuite/TestSuite" for enumerating
all "TestSuite" nodes under "TestSuite". simpler this way.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-08-16 13:32:33 +08:00
Kefu Chai
afee3924b3 s3/client: check for "Key" and "Value" tag in "Tag" XML tag
despite that the API document at
https://docs.aws.amazon.com/AmazonS3/latest/API/API_Tag.htm
claims that both these tags are "Required" in the "Tag" object returned
by S3 APIs, we still have to check them before dereferencing the pointer
of the child node, as we should not trust the output of an external API.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20160
2024-08-15 20:16:35 +03:00
Andrei Chekun
f24f5b7db2 test.py: Fix boost XML conversion to allure when XML file is empty
The method cannot find the TestSuite in the XML file and fails the whole job, however tests are passed. The issue was in incorrect understanding of boost summarization method. It creates one file for all modes, so there is no need to go through all modes to convert the XML file for allure.

Closes: https://github.com/scylladb/scylladb/issues/20161

Closes scylladb/scylladb#20165
2024-08-15 20:15:31 +03:00
Benny Halevy
52234214e5 schema_tables: calculate_schema_digest: filter the key earlier
Currently, each frozen mutation we get from
system_keyspace::query_mutations is unfrozen in whole
to a mutation and only then we check its key with
the provided `accept_keyspace` function.

This is wasteful, since they key can be processed
directly form the frozen mutation, before taking
the toll of unfreezing it.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-08-15 12:33:34 +03:00
Benny Halevy
95a5fba0ea schema_tables: calculate_schema_digest: prevent stalls due to large mutations vector
With a large number of table the schema mutations
vector might get big enoug to cause reactor stalls
when freed.

For example, the following stall was hit on
2023.1.0~rc1-20230208.fe3cc281ec73 with 5000 tables:
```
 (inlined by) ~vector at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/stl_vector.h:730
 (inlined by) db::schema_tables::calculate_schema_digest(seastar::sharded<service::storage_proxy>&, enum_set<super_enum<db::schema_feature, (db::schema_feature)0, (db::schema_feature)1, (db::schema_feature)2, (db::schema_feature)3, (db::schema_feature)4, (db::schema_feature)5, (db::schema_feature)6, (db::schema_feature)7> >, seastar::noncopyable_function<bool (std::basic_string_view<char, std::char_traits<char> >)>) at ./db/schema_tables.cc:799
```

This change returns a mutations generator from
the `map` lambda coroutine so we can process them
one at a time, destroy the mutations one at a time,
and by that, reducing memory footprint and preventing
reactor stalls.

Fixes #18173

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-08-15 12:33:34 +03:00
Kefu Chai
c628fa4e9e tools: enhance scylla sstable shard-of to support tablets
before this change, `scylla sstable shard-of` didn't support tablets,
because:

- with tablets enabled, data distribution uses the scheduler
- this replaces the previous method of mapping based on vnodes and shard numbers
- as a result, we can no longer deduce sstable mapping from token ranges

in this change, we:
- read `system.tablets` table to retrieve tablet information
- print the tablet's replica set (list of <host, shard> pairs)
- this helps users determine where a given sstable is hosted

This approach provides the closest equivalent functionality of
`shard-of` in the tablet era.

Fixes scylladb/scylladb#16488
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-08-15 15:49:55 +08:00
Kefu Chai
4291033b14 replica/tablets: extract tablet_replica_set_from_cell()
so it can be reused to implement a low-level tool which reads tablets
data from sstables

Refs scylladb/scylladb#16488
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-08-15 15:49:55 +08:00
Kefu Chai
e1162e0dae tools: extract get_table_directory() out
the `get_table_directory()` function will have applications
beyond its current use in `schema_loader.cc`. its ability to locate
the directory storing the sstables of given table could be valuable
in other subcommand(s) implementation.

so, in this change we extract it out into a dedicated source file,
so that it accept the primary_key and an optional clustering_key.

Refs scylladb/scylladb#16488

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-08-15 15:49:55 +08:00
Kefu Chai
a04e0b6c7d tools: extract read_mutation out
the `read_mutation_from_table_offline()` function will have applications
beyond its current use in `schema_loader.cc`. its ability to parser
mutation data from sstables could be valuable in other subcommand(s)
implementation.

so, in this change we extract it out into a dedicated source file,
so that it accept the primary_key and an optional clustering_key.

Refs scylladb/scylladb#16488

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-08-15 15:49:55 +08:00
Kefu Chai
74a670dd19 build: split the list of source file across multiple line
Split the extended list of source files across multiple lines.
This improves readability and makes future additions easier to
review in diffs.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-08-15 15:49:55 +08:00
Kefu Chai
3f8f1d7274 tools/scylla-sstable: print warning when running shard-of with tablets
the subcommand of "shard-of" does not support tablets yet. so let's
print out an error message, instead of printing the mapping assuming
that the sstables are distributed based on token only.

this commit also adds two more command line options to this subcommand,
so that user is required to specify either "--vnodes" or "--tablets"
to instruct the tool how the cluster distributes the tokens across nodes
and their shards. this helps to minimize the suprise of user.

this change prepares for the succeeding changes to implement the tablets
support.

the corresponding test is updated accordingly so that it only exercises
the "shard-of" subcommand without tablets. we will test it with tablets
enabled in a succeeding change.

Refs scylladb/scylladb#16488
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-08-15 15:49:55 +08:00
Laszlo Ersek
baf6ec49ff utils/tagged_integer: remove conversion to underlying integer
Silently converting a tagged (i.e., "dimension-ful") integer to a naked
("dimensionless") integer defeats the purpose of having tagged integers,
and is a source of practical bugs, such as
<https://github.com/scylladb/scylladb/issues/20080>.

We could make the conversion operator explicit, for enforcing

  static_cast<TAGGED_INTEGER_TYPE::value_type>(TAGGED_INTEGER_VALUE)

in every conversion location -- but that's a mouthful to write. Instead,
remove the conversion operator, and let clients call the (identically
behaving) value() member function.

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2024-08-15 02:12:58 +02:00
Laszlo Ersek
9aa7d232d6 test/raft/randomized_nemesis_test: clean up remaining index_t usage
With implicit conversion of tagged integers to untagged ones going away,
explicitly tag (or untag, as necessary) the operands of the following
operations, in "test/raft/randomized_nemesis_test.cc":

- addition of tagged and untagged (both should be tagged)

- taking the minimum of an index difference and a container size (both
  should be untagged)

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2024-08-14 22:54:42 +02:00
Laszlo Ersek
1af3460a81 test/raft/randomized_nemesis_test: clean up index_t usage in store_snapshot()
With implicit conversion of tagged integers to untagged ones going away,
unpack and clean up the relatively complex

  first_to_remain = max(snap.idx + 1 - preserve_log_entries, 0)

calculation in persistence::store_snapshot().

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2024-08-14 22:54:42 +02:00
Laszlo Ersek
4dc2faa49a test/raft/replication: clean up remaining index_t usage
With implicit conversion of tagged integers to untagged ones going away,
explicitly untag the operands / arguments of the following operations, in
"test/raft/replication.hh":

- assignment to raft_cluster::_seen

- call to hasher_int::hash_range()

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2024-08-14 22:54:42 +02:00
Laszlo Ersek
3a32f3de81 test/raft/replication: take an "index_t start_idx" in create_log()
raft_cluster::get_states() passes a "start_idx" to create_log(), and
create_log() uses it as an "index_t" object. Match the type of "start_idx"
to its name.

This patch is best viewed with "git show -W".

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2024-08-14 22:54:42 +02:00
Laszlo Ersek
08e117aeb5 test/raft/replication: untag index_t in test_case::get_first_val()
In test_case::get_first_val(), the asssignment

  first_val = initial_snapshots[initial_leader].snap.idx;

*both* relies on implicit conversion of the tagged integer type "index_t"
to the underlying "uint64_t", *and* is a logic bug, as reported at
<https://github.com/scylladb/scylladb/issues/20151>.

For now, wean the buggy asssignment off the disappearing
tagged-to-untaggged conversion.

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2024-08-14 22:54:42 +02:00
Laszlo Ersek
6254fca7f5 test/raft/etcd_test: tag index_t and term_t for comparisons and subtractions
Properly annotate index_t and term_t constants for use in
BOOST_CHECK_EQUAL() and BOOST_CHECK().

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2024-08-14 22:54:42 +02:00
Laszlo Ersek
bd4fc85bf0 test/raft/fsm_test: tag index_t and term_t for comparisons and subtractions
Properly annotate index_t and term_t constants for use in
BOOST_CHECK_EQUAL(), BOOST_CHECK(). Clean up the first args of
read_quorum() calls -- stay in term_t space.

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2024-08-14 22:54:42 +02:00
Laszlo Ersek
265655473e test/raft/helpers: tighten compare_log_entries() param types
The "from" and "to" parameters of compare_log_entries() are raft log
indices; change them to raft::index_t, and update the callers.

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2024-08-14 22:54:42 +02:00
Piotr Smaron
3e3858521d codeowners: add appropriate reviewers to the frontend components 2024-08-14 22:26:35 +02:00
Piotr Smaron
1b2e88b96a codeowners: fix codeowner names 2024-08-14 22:26:26 +02:00
Laszlo Ersek
5dcc627465 service/raft_sys_table_storage: tweak dead code
In raft_sys_table_storage::store_snapshot_descriptor(), the condition

  preserve_log_entries > snap.idx

*both* relies on implicit conversion of the tagged integer type "index_t"
to the underlying "uint64_t", *and* is a logic bug, as reported at
<https://github.com/scylladb/scylladb/issues/20080>.

Ticket#20080 explains that this condition always evaluates to false in
practice, and that the "else" branch handles all cases correctly anyway.
For now, wean the buggy expression off the disappearing
tagged-to-untaggged conversion.

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2024-08-14 21:35:34 +02:00
Andrei Chekun
3407ae5d8f [test.py] Add Junit logger for boost test
Currently, boost tests aren't using Junit. Enable Junit report output and clean them from skipped test, since boost tests are executed by function name rather than filename. This allows including boost tests result to the Allure report.

Related: https://github.com/scylladb/qa-tasks/issues/1665

Closes scylladb/scylladb#19925
2024-08-14 22:18:31 +03:00
Avi Kivity
6d6f93e4b5 Merge 'test/nodetool: enable running nodetool tests under test/nodetool' from Kefu Chai
before this change, we assume user runs nodetool tests right under the root source directory. if user runs them under `test/nodetool`, the suppression rules are not applied. as the path is incorrect in that case.

after this change, the supression rules' path is deduced from the top src directory. so we can now run the nodetool test under `test/nodetool` .

---

no need to backport, this change improves developer's experience.

Closes scylladb/scylladb#20119

* github.com:scylladb/scylladb:
  test/nodetool: deduce subpression path from top srcdir
  test/nodetool: deduce path from top srcdir
2024-08-14 22:10:38 +03:00
Michał Jadwiszczak
f7eb74e31f cql3/statements/create_service_level: forbid creating SL starting with $
Tenant names starting with `$` are reserved for internal ones.
Forbid creating new service level which name starts with `$`
and log a warning for existing service levels with `$` prefix.

Closes scylladb/scylladb#20122
2024-08-14 21:25:31 +03:00
Kefu Chai
5ce07e5d84 build: cmake: add compiler-training target
`tools/toolchain/optimized_clang.sh` builds this target for creating
the profile in order to build clang optimized with this profile data.

so let's be compatible with `configure.py`, and add this target to
CMake building system as well.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20105
2024-08-14 21:21:33 +03:00
Ernest Zaslavsky
f5f65ead1e Add .clang-format, also add CLion build folder to the .gitignore file
Closes scylladb/scylladb#20123
2024-08-14 21:20:29 +03:00
Pavel Emelyanov
66d72e010c distributed_loader: Lock table via global table ptr
The lock_table() method needs database, ks and cf to find the table on
all shards. The same can be achieved with the help of global_table_ptr
thing that all the core callers already have at hand.

There's a test that doesn't have global table, but it can get one.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#20139
2024-08-14 20:53:21 +03:00
Pavel Emelyanov
7e3e5cfcad sstable_directory: Simplify special-purpose local-only constructor
Typically the sstable_directory is constructed out of a table object.
Some code, namely tests and schema-loader, don't have table at hand and
construct directory out of schema, sharder, path-to-sstables, etc. This
code doesn't work with any storage options other than local ones, so
there's no need (yet) to carry this argument over.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#20138
2024-08-14 20:22:50 +03:00
Avi Kivity
28d3b91cce Merge 'test/perf/perf_sstables: use test_modes as the type of its option' from Kefu Chai
before this change, we look up for the mode using the command line
option as the key, but that's incorrect if the command line option does
not match with any of the known names. in that case, `test_mode` just
create another pair of <sstring, test_modes>, and return the second
component of this pair. and the second component is not what we expect.
we should have thrown an exception.

in this change

* the test_mode map is marked const.
* the overloads for parsing / formatting the `test_modes` type are
  added, so that boost::program_options can parse and format it.

after this change, we  print more user friendly error, like

```
/scylla perf-sstable --mode index-foo
error: the argument ('index-foo') for option '--mode' is invalid

Try --help.
```

instead of a bunch of output which is printed as if we passes the correct option as the argument of the `--mode` option.

---

it's an improvement of developer experience, hence no need to backport.

Closes scylladb/scylladb#20140

* github.com:scylladb/scylladb:
  test/perf/perf_sstable: use switch-case when appropriate
  test/perf/perf_sstables: use test_modes as the type of its option
2024-08-14 20:18:22 +03:00
Piotr Smaron
31cb5b132b codeowners: remove non contributors 2024-08-14 18:52:25 +02:00
Avi Kivity
3de4e8f91b Merge 'cql: process LIMIT for GROUP BY select queries' from Paweł Zakrzewski
This change fixes #17237, fixes #5361 and fixes #5362 by passing the limit value down the call chain in cql3. A test is also added.

fixes #17237
fixes #5361
fixes #5362

The regression happened in 5.4 as we changed the way GROUP BY is processed in 432cb02 - to force aggregation when it is used. The LIMIT value was not passed to aggregations and thus we failed to adhere to it.

W want to backport this fix to 5.4 and 6.0 to have continuous correct results for the test case from #17237

This patch consists of 4 commits:
- fa4225ea0fac2057b7a9976f57dc06bcbd900cd4 - cql3: respect the user-defined page size in aggregate queries - a precondition for this patch to be implementable
- 8fbe69e74dca16ed8832d9a90489ca47ba271d0b - cql3/select_statement: simplify the get_limit function - the `do_get_limit()` function did a lot of legwork that should not be associated with it. This change makes it trivial and makes its callers do additional checks (for unset guards, or for an aggregate query)
- 162828194a2b88c22fbee335894ff045dcc943c9 - cql3: process LIMIT for GROUP BY queries - pass the limit value down the chain and make use of it. This is the actual fix to #17237
- b3dc6de6d6cda8f5c09b01463bb52f827a6a00b4 - test/cql-pytest: Add test for GROUP BY queries with LIMIT - tests

Closes scylladb/scylladb#18842

* github.com:scylladb/scylladb:
  test/cql-pytest: Add test for GROUP BY queries with LIMIT
  cql3: process LIMIT for GROUP BY queries
  cql3/select_statement: simplify the get_limit function
  cql3: respect the user-defined page size in aggregate queries
2024-08-14 17:54:59 +03:00
Avi Kivity
8c257db283 Merge 'Native reverse pages over RPC' from Łukasz Paszkowski
Drop half-reversed (legacy) format of query::partition_slice.

The select query builds a fully reversed (native) slice for reversed queries and use it together with a reversed
schema to construct query::read_command that is further propagated to the database.

A cluster feature is added to support nodes that still operate on half-reversed slices. When the feature is turned off:
- query::read_command is transformed (to have table schema and half-reversed slices) before sending to other nodes
- query::read_command is transformed (to have query schema (reversed) and reversed slices) after receiving it from other nodes
- Similarly, mutations are transformed. They are reversed before being sent to other nodes or after receiving them from other nodes.

Additional manual tests were performed to test a mixed-node cluster:

1. 3-node cluster with one node upgraded: reverse read queries performed on an old node
2. 3-node cluster with one node upgraded: reverse read queries performed on a new node
3. 3-node cluster with one node upgraded and all its sstable files deleted to trigger repair: reverse read queries performed on an old node
4. 3-node cluster with one node upgraded and all its sstable files deleted to trigger repair: reverse read queries performed on a new node

All reverse read queries above consists of:

- single-partition reverse reads with no clustering key restrictions, with single column restrictions and multi column restrictions both with and without paging turned on
- multi-partition reverse reads with range restrictions with optional partition limit and partial ordering

The exact same tests were also performed on a fully upgraded cluster.

Fixes https://github.com/scylladb/scylladb/issues/12557

Closes scylladb/scylladb#18864

* github.com:scylladb/scylladb:
  mutation_partition: drop reverse parameter in compact_for_query
  clustering_key_filter: unify get_ranges and get_native_ranges
  streamed_mutation_freezer: drop the reverse parameter
  reverse-reads.md: Drop legacy reverse format information
  Fix comments refering to half-reversed (legacy) slices
  select_statement::do_execute: Add tracing informaction
  query::trim_clustering_row_ranges_to: require reversed schema for native reversed ranges
  query-request: Drop half_reverse_slice as it is no longer used anywhere
  readers: Use reversed schema and native reversed slices
  database: accept reversed schema for reversed queries
  storage_proxy: Support reverse queries in native format
  query_pagers: Replace _schema with _query_schema
  query_pagers: Support reverse queries in native format
  select_statement: Execute reversed query in native format
  storage_proxy::remote: Add support for mixed-node clusters
  mutation_query: Add reversed function to reverse reconcilable_result
  query-request: Add reversed function to reverse read_command
  features: add native_reverse_queries
  kl::reader::make_reader: Unify interface with mx::reader::make_reader
  config: drop reversed_reads_auto_bypass_cache
  config: drop enable_optimized_reversed_reads
2024-08-14 17:51:56 +03:00
Anna Stuchlik
99be8de71e doc: set 6.1 as the latest stable version
This commit updates the configuration for ScyllaDB documentation so that:
- 6.1 is the latest version.
- 6.1 is removed from the list of unstable versions.

It must be merged when ScyllaDB 6.1 is released.

No backport is required.

Closes scylladb/scylladb#20041
2024-08-14 13:43:17 +02:00
Laszlo Ersek
d87d1ae29d service/raft_sys_table_storage: simplify (snap.idx - preserve_log_entries)
With conversion of tagged integers to untagged ones going away, replace

  static_cast<uint64_t>(snap.idx)

with

  snap.idx.value()

Furthermore, casting "preserve_log_entries" (of type "size_t") to
"uint64_t" is redundant (both "snap.idx" and "preserve_log_entries" carry
nonnegative values, and the mathematical difference is expected to be
nonnegative); remove the cast.

Finally, simplify the initialization syntax.

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2024-08-14 13:35:08 +02:00
Laszlo Ersek
e781046739 service/raft_sys_table_storage: untag index_t and term_t for queries
With implicit conversion of tagged integers to untagged ones going away,
explicitly untag index_t and term_t values in the following two contexts:

- when they are passed to CQL queries as int64_t,

- when they are default-constructed as fallbacks for int64_t fields
  missing from CQL result sets.

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2024-08-14 13:35:08 +02:00
Laszlo Ersek
4f1f207be1 raft/server: clean up index_t usage
With implicit conversion of tagged integers to untagged ones going away,
explicitly tag (or untag, as necessary) the operands of the following
operations, in "raft/server.cc":

- addition of tagged and untagged (both should be tagged)

- subscripting an array by tagged (should be untagged)

- comparing a size-like threshold against tagged (should be untagged)

- exposing tagged via gauges (should be untagged)

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2024-08-14 13:35:08 +02:00
Laszlo Ersek
1b134d52ac raft/tracker: don't drop out of index_t space for subtraction
Tagged integers support subtraction; use it.

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2024-08-14 13:35:08 +02:00
Laszlo Ersek
b6233209d9 raft/fsm: clean up index_t and term_t usage
With implicit conversion of tagged integers to untagged ones going away,
explicitly tag (or untag, as necessary) the operands of the following
operations, in "raft/fsm.cc":

- addition of tagged and untagged (both should be tagged)

- comparison (relop) between tagged an untagged (both should be tagged)

- subscripting or sizing an array by tagged (should be untagged)

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2024-08-14 13:35:08 +02:00
Laszlo Ersek
5b9a4428c6 raft/log: clean up index_t usage
With implicit conversion of tagged integers to untagged ones going away,
explicitly tag (or untag, as necessary) the operands of the following
operations, in raft/log.{cc,h}:

- addition of tagged and untagged (both should be tagged)

- comparison (relop) between tagged an untagged (both should be tagged)

- subscripting an array, or offsetting an iterator, by tagged (should be
  untagged)

- comparing an array bound against tagged (should be untagged)

- subtracting tagged from an array bound (should be untagged)

Note: these files mix uniform initialization syntax (index_t{...}) with
constructor call syntax (index_t()), with the former being more frequent.
Stick with the former here too, for consistency.

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2024-08-14 13:35:08 +02:00
Laszlo Ersek
9e95f3a198 db/system_keyspace: promise a tagged integer from increment_and_get_generation()
Internally, increment_and_get_generation() produces a
"gms::generation_type" value.

In turn, all callers of increment_and_get_generation() -- namely
scylla_main() [main.cc] and single_node_cql_env::run_in_thread()
[test/lib/cql_test_env.cc] -- pass the resolved value to
storage_service::init_address_map() and storage_service::join_cluster(),
both of which take a "gms::generation_type".

Therefore it is pointless to "untag" the generation value temporarily
between the producer and the consumers. Correct the return type of
increment_and_get_generation().

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2024-08-14 13:35:08 +02:00
Laszlo Ersek
baccbc09c5 gms/gossiper: return "strong_ordering" from compare_endpoint_startup()
The callers of gossiper::compare_endpoint_startup() need not (should not)
learn of any particular (tagged or untagged) difference of generations;
they only care about the ordering of generations. Change the return type
of compare_endpoint_startup() to "std::strong_ordering", and delegate the
comparison to tagged_tagged_integer::operator<=>.

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2024-08-14 13:35:08 +02:00
Laszlo Ersek
3bb608056c gms/gossiper: get "int32_t" value of "gms::version_type" explicitly
In do_sort(), we need to drop to "int32_t" temporarily, so that we can
call ::abs() on the version difference. Do that explicitly.

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2024-08-14 13:35:08 +02:00
Michał Chojnowski
4d77faa61e cql_test_env: ensure shutdown() before stop() for system_keyspace
If system_keyspace::stop() is called before system_keyspace::shutdown(),
it will never finish, because the uncleared shared pointers will keep
it alive indefinitely.

Currently this can happen if an exception is thrown before the construction
of the shutdown() defer. This patch moves the shutdown() call to immediately
before stop(). I see no reason why it should be elsewhere.

Fixes scylladb/scylla-enterprise#4380

Closes scylladb/scylladb#20089
2024-08-14 12:16:44 +03:00
Kefu Chai
06c60f6abe test/perf/perf_sstable: use switch-case when appropriate
instead of using a chain of `if-else`, use switch-case instead,
it's visually easier to follow than `if`-`else` blocks. and since
we never need to handle the `else` case, the `throw` statement
is removed.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-08-14 17:14:42 +08:00
Kefu Chai
5141c6efe0 test/perf/perf_sstables: use test_modes as the type of its option
before this change, we look up for the mode using the command line
option as the key, but that's incorrect if the command line option does
not match with any of the known names. in that case, `test_mode` just
create another pair of <sstring, test_modes>, and return the second
component of this pair. and the second component is not what we expect.
we should have thrown an exception.

in this change

* the test_mode map is marked const.
* the overloads for parsing / formatting the `test_modes` type are
  added, so that boost::program_options can parse and format it.

after this change,

* we can print more user friendly error, like

```
/scylla perf-sstable --mode index-foo
error: the argument ('index-foo') for option '--mode' is invalid

Try --help.
```

instead of a bunch of output which is printed as if we passes the
correct option as the argument of the `--mode` option.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-08-14 17:14:42 +08:00
Dawid Medrek
4ba9cb0036 README: Update the version of C++ to C++23
Scylla has started being built with C++23. We update
the information in the relevant documents accordingly.

Closes scylladb/scylladb#20134
2024-08-14 12:06:23 +03:00
Kamil Braun
a3d53bd224 Merge 'Prevent ALTERing non-existing KS with tablets' from Piotr Smaron
ALTER tablets KS executes in 2 steps:
1. ALTER KS's cql handler forms a global topo req, and saves data required to execute this req,
2. global topo req is executed by topo coordinator, which reads data attached to the req.

The KS name is among the data attached to the req. There's a time window between these steps where a to-be-altered KS could have been DROPped, which results in topo coordinator forever trying to ALTER a non-existing KS. In order to avoid it, the code has been changed to first check if a to-be-altered KS exists, and if it's not the case, it doesn't perform any schema/tablets mutations, but just removes the global topo req from the coordinator's queue.
BTW. just adding this extra check resulted in broader than expected changes, which is due to the fact that the code is written badly and needs to be refactored - an effort that's already planned under #19126
(I suggest to disable displaying whitespace differences when reviewing this PR).

Fixes: scylladb/scylladb#19576

Closes scylladb/scylladb#19666

* github.com:scylladb/scylladb:
  tests: ensure ALTER tablets KS doesn't crash if KS doesn't exist
  cql: refactor rf_change indentation
  Prevent ALTERing non-existing KS with tablets
2024-08-14 10:27:41 +02:00
Piotr Smaron
ddb5204929 tests: ensure ALTER tablets KS doesn't crash if KS doesn't exist
Using the error injection framework, we inject a sleep into the
processing path of ALTER tablets KS, so that the topology coordinator of
the leader node
sleeps after the rf_change event has been scheduled, but before it is
started to be executed. During that time the second node executes a DROP
KS statement, which is propagated to the leader node. Once leader node
wakes up and resumes processing of ALTER tablets KS, the KS won't exist
and the node cannot crash, which was the case before.
2024-08-13 21:51:51 +02:00
Pavel Emelyanov
05adee4c82 test: Add test for s3::client::bucket_lister
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-08-13 21:15:43 +03:00
Pavel Emelyanov
a02e65c649 s3_client: Add bucket lister
The lister resembles the directory_lister from util -- it returns
entries upon its .get() invocation, and should be .close()d at the end.

Internally the lister issues ListObjectsV2 request with provided prefix
and limits the server with the amount of entries returned not to consume
too much local memory (we don't have streaming XML parser for response).
If the result is indeed truncated, the subsequent calls include the
continuation token as per [1]

[1] https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectsV2.html

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-08-13 21:15:43 +03:00
Avi Kivity
d82fd8b5f0 Merge 'Relax sstable_directory::process_descriptor() call graph' from Pavel Emelyanov
The method logic is clean and simple -- load sstable from the descriptor and sort it into one of collections (local, shared, remote, unsorted). To achieve that there's a bunch of helper methods, but they duplicate functionality of each other. Squashing most of this code into process_descriptor() makes it easier to read and keeps sstable_directory private API much shorter.

Closes scylladb/scylladb#20126

* github.com:scylladb/scylladb:
  sstable_directory: Open-code load_sstable() into process_descriptor()
  sstable_directory: Squash sort_sstable() with process_descriptor()
  sstable_directory: Remove unused sstable_filename(desc) helper
  sstable_directory: Log sst->get_filename(), not sstable_filename(desc)
  sstable_directory: Keep loaded sst in local var
  sstable_directory: Remove unused helpers
  sstable_directory: Load sstable once when sorting
2024-08-13 16:42:52 +03:00
Pavel Emelyanov
d3870304a9 sstable_directory: Open-code load_sstable() into process_descriptor()
There are two load_sstable() overloads, and one of them is only used
inside process_descriptor(). What this loading helper does is, in fact,
processes given descriptor, so it's worth having it open-coded into its
caller.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-08-13 13:27:00 +03:00
Pavel Emelyanov
da4a5df339 sstable_directory: Squash sort_sstable() with process_descriptor()
The latter (caller) loads sstable, so does the former, so load it once
and then put it in either list/set, depending on flags and shard info.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-08-13 13:26:10 +03:00
Pavel Emelyanov
d8cb175fb7 sstable_directory: Remove unused sstable_filename(desc) helper
It's unused after previous patch.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-08-13 12:55:40 +03:00
Pavel Emelyanov
aa40aeb72f sstable_directory: Log sst->get_filename(), not sstable_filename(desc)
There are some places that log sstable Data file name via sstable
descriptor. After previous patching all those loggers have sstable at
hand and can use sstable::get_filename() instead.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-08-13 12:55:40 +03:00
Pavel Emelyanov
369f9111b8 sstable_directory: Keep loaded sst in local var
This will make next patch shorter.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-08-13 12:55:40 +03:00
Pavel Emelyanov
ad3725fbbd sstable_directory: Remove unused helpers
After previous patch some wrappers around load_sstable() became unused.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-08-13 12:55:40 +03:00
Pavel Emelyanov
63f1969e08 sstable_directory: Load sstable once when sorting
In order to decide which list to put sstable into, the sort_sstable()
first calls get_shards_for_this_sstable() which loads the sstable
anyway. If loaded shards contain only the current one (which is the
common case) sstable is loaded again. In fact, if the sstable happens to
be remote it's loaded anyway to get its open info.

Fix that by loading sstable, then getting shards directly from it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-08-13 12:55:16 +03:00
Łukasz Paszkowski
ba2f037af5 mutation_partition: drop reverse parameter in compact_for_query
The reverse parameter is no longer used with native reverse reads.
The row ranges are provided in native reverse order together with
a reversed schema, thus the reverse parameter remain false all the
time and can be droped.
2024-08-13 10:07:12 +02:00
Łukasz Paszkowski
43221bbeed clustering_key_filter: unify get_ranges and get_native_ranges
When a reverse slice is provided, it is given in the native reverse
format. Thus the ranges will be returned in the same order as stored
in the slice.

Therefore there is no need to distinguish between get_ranges and
get_native_ranges. The latter one gets dropped and get_ranges returns
ranges in the same order as stored in the slice.
2024-08-13 10:07:12 +02:00
Łukasz Paszkowski
8b5ec0e963 streamed_mutation_freezer: drop the reverse parameter
The reverse parameter is no longer used with native reverse reads.
A reversed schema is provided and thus the reverse parameter shall
remain false all the time.
2024-08-13 10:07:12 +02:00
Łukasz Paszkowski
f4ca734ccb reverse-reads.md: Drop legacy reverse format information 2024-08-13 10:07:12 +02:00
Łukasz Paszkowski
b3bf555036 Fix comments refering to half-reversed (legacy) slices 2024-08-13 10:07:12 +02:00
Łukasz Paszkowski
15a01c7111 select_statement::do_execute: Add tracing informaction
Add information on table and query schema versions to tracing.
2024-08-13 10:07:12 +02:00
Łukasz Paszkowski
158b994676 query::trim_clustering_row_ranges_to: require reversed schema for native reversed ranges
Simplify implementation and for clustering key ranges in native
reversed format, require a reversed table schema.

Trimming native reversed clustering key ranges requires a reversed
schema to be passed in. Thus, the reverse flag is no longer required
as it would always be set to false.
2024-08-13 10:07:10 +02:00
Łukasz Paszkowski
8d95d44027 query-request: Drop half_reverse_slice as it is no longer used anywhere 2024-08-13 10:03:46 +02:00
Łukasz Paszkowski
da95f44adc readers: Use reversed schema and native reversed slices
The reconcilable_result is built as it would be constructed for
forward read queries for tables with reversed order.

Mutations constructed for reversed queries are consumed forward.

Drop overloaded reversed functions that reverse read_command and
reconcilable_result directly and keep only those requiring smart
pointers. They are not used any more.
2024-08-13 10:03:46 +02:00
Łukasz Paszkowski
faa62310d9 database: accept reversed schema for reversed queries
Remove schema reversing in query() and query_mutations() methods.
Instead, a reversed schema shall be passed for reversed queries.
Rename a schema variable from s into query_schema for readability.
2024-08-13 10:03:46 +02:00
Łukasz Paszkowski
df734e35a1 storage_proxy: Support reverse queries in native format
For reversed queries, query_result() method accepts a reversed table
schema and read_command with a query schema version and a slice in
native reversed format.

Support mixed-node clusters. In such a case, the feature flag
native_reverse_queries is disabled and the read_command in sent
to replicas in the old regacy format (stores table schema version
and a slice in the legacy reverse format).

After the reconciliation, for the read+repair case, un-reversed
mutations are sent to replicas, i.e. forward ones.
2024-08-13 10:03:46 +02:00
Łukasz Paszkowski
d9e76a5295 query_pagers: Replace _schema with _query_schema
For readability purposes. As the constructor accepts a query schema,
let the varaible holding a schema be called _query_schema.
2024-08-13 10:03:46 +02:00
Łukasz Paszkowski
0b2e5ff28f query_pagers: Support reverse queries in native format
For reversed queries, accept a reversed table schema and read_command
with a query schema version and a slice in native reversed format.
2024-08-13 10:03:46 +02:00
Łukasz Paszkowski
309ba68692 select_statement: Execute reversed query in native format
Use a reversed schema and a native reversed slice when constructing
a read_command and executing a reversed select statement.

Such a created read_command is passed further down to query_pagers::pager
and storage::proxy::query_result that transform it to the format
they accept/know, i.e. lagacy.
2024-08-13 10:03:46 +02:00
Łukasz Paszkowski
8c391a8ebe storage_proxy::remote: Add support for mixed-node clusters
In handle_read, detect whether a coming read_command is in the
legacy reversed format or native reversed format.

The result will be used to transform the read_command between format
as well as to transforms the results before they are send back to
the coordinator.
2024-08-13 10:03:46 +02:00
Łukasz Paszkowski
fbd324b5cd mutation_query: Add reversed function to reverse reconcilable_result
The reconcilable_result is reversed by reversing mutations for all
paritions it holds. Reversing is asynchronous to avoid potential
stall.

Use for transitions between legacy and native formats and in order
to support mixed-nodes clusters.
2024-08-13 10:03:46 +02:00
Łukasz Paszkowski
b91edbacf1 query-request: Add reversed function to reverse read_command
The read_command is reversed by reversing the schema version it
holds and transforming a slice from the legacy reversed format to
the native reversed format.

Use for trasition between format and to support mixed-nodes clusters
2024-08-13 10:03:46 +02:00
Łukasz Paszkowski
9690785112 features: add native_reverse_queries
Enabled when all replicas support the native_reversed command slice
and return the result in reverse order in this case.
2024-08-13 10:03:42 +02:00
Łukasz Paszkowski
7b201e9165 kl::reader::make_reader: Unify interface with mx::reader::make_reader
Ensure both readers have the same interfaces to avoid mistakes as
both readers are used in sstable::make_reader. Less error prone.
2024-08-13 10:02:43 +02:00
Łukasz Paszkowski
b270097f1f config: drop reversed_reads_auto_bypass_cache
Reverse reads have already been with us for a while, thus this back
door option to bypass in-memory data cache for reversed queries can
be retired.
2024-08-13 10:02:42 +02:00
Łukasz Paszkowski
80df313f49 config: drop enable_optimized_reversed_reads
Reverse reads have already been with us for a while, thus this back
door option to read entire paritions forward and reversing them after
can be retired.
2024-08-13 10:02:42 +02:00
Pavel Emelyanov
6675bd8a5c s3_client: Encode query parameter value for query-string
When signing AWS query one need to prepare "query string" which is a
line looking like `encode(query_param)=encode(query_value)&...`. Encoded
are only the query parameter names and values. It was missing in current
code and so far worked because no encodable characters were used.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-08-13 10:59:31 +03:00
Raphael S. Carvalho
74612ad358 tablets: Fix race between repair and split
Consider the following:

T
0   split prepare starts
1                               repair starts
2   split prepare finishes
3                               repair adds unsplit sstables
4                               repair ends
5   split executes

If repair produces sstable after split prepare phase, the replica
will not split that sstable later, as prepare phase is considered
completed already. That causes split execution to fail as replicas
weren't really prepared. This also can be triggered with
load-and-stream which shares the same write (consumer) path.

The approach to fix this is the same employed to prevent a race
between split and migration. If migration happens during prepare
phase, it can happen source misses the split request, but the
tablet will still be split on the destination (if needed).
Similarly, the repair writer becomes responsible for splitting
the data if underlying table is in split mode. That's implemented
in replica::table for correctness, so if node crashes, the new
sstable missing split is still split before added to the set.

Fixes #19378.
Fixes #19416.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-08-12 17:28:51 -03:00
Raphael S. Carvalho
239344ab55 compaction: Allow "offline" sstable to be split
In order to fix the race between split and repair, we must introduce
the ability to split an "offline" sstable, one that wasn't added
to any of the table's sstable set yet.

It's not safe to split a sstable after adding it to the set, because
a failure to split can result in unsplit data left in the set, causing
split to fail down the road, since the coordinator thinks this replica
has only split data in the set.

Refs #19378.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-08-12 17:27:16 -03:00
Laszlo Ersek
607abe96e8 test/sstable: merge test_using_reusable_sst*()
All lambdas passed to test_using_reusable_sst() conform to the prototype

  void (test_env&, sstable_ptr)

All lambdas passed to test_using_reusable_sst_returning() conform to the
prototype

  NON_VOID (test_env&, sstable_ptr)

The common parameter list of both prototypes can be expressed with the
concept

  std::invocable<test_env&, sstable_ptr>

Once a "Func" template parameter (i.e., function type) satisfying this
concept is taken, then "Func"'s void or non-void return type can be
commonly expressed with

  std::invoke_result_t<Func, test_env&, sstable_ptr>

In turn, test_env::do_with_async_returning<...> can be instantiated with
this return type, even if it happens to be "void".

([stmt.return] specifies, "[a] return statement with an operand of type
void shall be used only in a function that has a cv void return type",
meaning that

  return func(env)

will do the right thing in the body of
test_env::do_with_async_returning<void>().)

Merge test_using_reusable_sst() and test_using_reusable_sst_returning()
into one. Preserve the function name from the former, and the
test_env::do_with_async_returning<...>() call from the latter.

Suggested-by: Avi Kivity <avi@scylladb.com>
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>

Closes scylladb/scylladb#20090
2024-08-12 17:52:01 +03:00
Kefu Chai
db4654ca49 test/nodetool: deduce subpression path from top srcdir
there are chances that developer launch `pytest` right under
`test/nodetool`, in that case current working directory is not
the root directory of the project, so the path to suppression rules
does not point to a file.

to cater the needs to run the test under `test/nodetool`, let's
use the path deduced from the top_srcdir.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-08-12 22:50:18 +08:00
Kefu Chai
c817e13d63 test/nodetool: deduce path from top srcdir
add a helper to get path from top src dir, more readable this way.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-08-12 22:50:18 +08:00
Nikos Dragazis
90363ce802 test: Test the SSTable validation API against malformed SSTables
Unit testing for the SSTable validation API happens in
`sstable_validate_test`. Currently, this test checks the API against
some invalid SSTables with out-of-order clustering rows and out-of-order
partitions. However, both are types of content-level corruption that do
not trigger `malformed_sstable_exception` errors.

Extend the test to cover cases of file-level corruption as well, i.e.,
cases that would raise a `malformed_sstable_exception`. Construct an
SSTable with an invalid checksum to trigger this.

This is part of the effort to improve scrub to handle all kinds of
corruption.

Fixes scylladb/scylladb#19057

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>

Closes scylladb/scylladb#20096
2024-08-12 15:09:58 +03:00
Botond Dénes
fec57c83e6 Merge 'cell_locker: maybe_rehash: ignore allocation failures' from Benny Halevy
`maybe_rehash` is complimentary and is not strictly require to succeed. If it fails, it will retry on the next call, but there's no reason to throw an exception that will fail its caller, since `maybe_rehash` is called as the final step after the caller has already succeeded with its action.

Minor enhancement for the error path, no backport required.

Closes scylladb/scylladb#19910

* github.com:scylladb/scylladb:
  cell_locker: maybe_rehash: reindent
  cell_locker: maybe_rehash: ignore allocation failures
2024-08-12 10:54:56 +03:00
Kefu Chai
0ae04ee819 build: cmake: use $<CONFIG:cfgs> when appropriate
per
https://cmake.org/cmake/help/latest/manual/cmake-generator-expressions.7.html#genex:CONFIG,
`cfgs` can be a comma-separated list. this is supported by CMake
3.19 and up, and our minimum required CMake version is 3.27. so let's
switch over from the composition of `IN_LIST` and `CONFIG` generator
expressions to a single one. simpler this way.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20110
2024-08-11 21:28:38 +03:00
Avi Kivity
318278ff92 Merge 'tablets: reload only changed metadata' from Botond Dénes
Currently, each change to tablet metadata triggers a full metadata reload from disk. This is very wasteful, especially if the metadata change affects only a single row in the `system.tablets` table. This is the case when the tablet load balancer triggers a migration, this will affect a single row in the table, but today will trigger a full reload.
We expect tablet count to potentially grow to thousands and beyond and the overhead of this full reload can become significant.
This PR makes tablet metadata reload partial, instead of reloading all metadata on topology or schema changes, reload only the partitions that are affected by the change. Copy the rest from the in-memory state.
This is done with two passes: first the change mutations are scanned and a hint is produced. This hint is then passed down to the reload code, which will use it to only reload parts (rows/partitions) of the metadata that has actually changed.

The performance difference between full reload and partial reload is quite drastic:
```
INFO  2024-07-25 05:06:27,347 [shard 0:stat] testlog - Tablet metadata reload:
full      616.39ms
partial     0.18ms
```
This was measured with the modified (by this PR) `perf_tablets`, which creates 100 tables, each with 2K tablets. The test was modified to change a single tablet, then do a full and partial reload respectively, measuring the time it takes for reach.

Fixes: #15294

New feature, no backport needed.

Closes scylladb/scylladb#15541

* github.com:scylladb/scylladb:
  test/perf/perf_tablets: add tablet metadata reload perf measurement
  test/boost/tablets_test: add test for partial tablet metadata updates
  db/schema_tables: pass tablet hint to update_tablet_metadata()
  service/storage_service: load_tablet_metadata(): add hint parameter
  service/migration_listener: update_tablet_metadata(): add hint parameter
  service/raft/group0_state_machine: provide tablet change hint on topology change
  service/storage_service: topology_state_load(): allow providing change hint
  replica/tablets: add update_tablet_metadata()
  replica/tablets: fix indentation
  replica/tablets: extract tablet_metadata builder logic
  replica/tablets: add get_tablet_metadata_change_hint() and update_tablet_metadata_change_hint()
  locator/tablets: add tablet_map::clear_tablet_transition_info()
  locator/tablets: make tablet_metadata cheap to copy
  mutation/canonical_mutation: add key()
2024-08-11 21:27:18 +03:00
Botond Dénes
2b2db510b7 test/perf/perf_tablets: add tablet metadata reload perf measurement
Measure reload perf of full reload vs. partial reload, after changing a
single tablet.
While at it, modify the `--tablets-per-table` parameter, so that it has
a default parameter which works OOTB. The previous default was both too
large (causing oversized commitlog entry errors) and not a power of two.
2024-08-11 09:53:19 -04:00
Botond Dénes
65eee200b2 test/boost/tablets_test: add test for partial tablet metadata updates 2024-08-11 09:53:19 -04:00
Botond Dénes
b886ed44a7 db/schema_tables: pass tablet hint to update_tablet_metadata()
Replace the has_tablet_mutations in `merge_tables_and_views()` with a
hint parameter, which is calculated in the caller, from the original
schema change mutations. This hint is then forwarded to the notifier's
`update_tablet_metadata()` so that subscribers can refresh only the
tablet partitions that changed.
2024-08-11 09:53:19 -04:00
Botond Dénes
5bff422b54 service/storage_service: load_tablet_metadata(): add hint parameter
Allowing for reloading only those parts of the tablet metadata that were
actually changed.
2024-08-11 09:53:19 -04:00
Botond Dénes
2cec0d8dd1 service/migration_listener: update_tablet_metadata(): add hint parameter
The hint contains information related to what exactly changed, allowing
listeners to do partial updates, instead of reloading all metadata on
each notification.
2024-08-11 09:53:19 -04:00
Botond Dénes
ca302d9e28 service/raft/group0_state_machine: provide tablet change hint on topology change
So that when reloading tablet state metadata from the disk, only the
changed parts are reloaded.
2024-08-11 09:53:19 -04:00
Botond Dénes
806ec3244a service/storage_service: topology_state_load(): allow providing change hint
So that when reloading state from disk, only changed parts are reloaded
instead of all. For now, only tablets have hints implemented.
2024-08-11 09:53:18 -04:00
Botond Dénes
bb1e733fe0 replica/tablets: add update_tablet_metadata()
Allows updateng tablet metadata in-place, according to the provided
hint, reading and updating only the parts that actually changed.
2024-08-11 09:52:37 -04:00
Botond Dénes
66292b4baa replica/tablets: fix indentation
Left broken from the previous patch.
2024-08-11 09:52:37 -04:00
Botond Dénes
aa378c458e replica/tablets: extract tablet_metadata builder logic
So it can be reused in a new method.
Indentation is left broken deliberately, to make the patch easier to
read.
2024-08-11 09:52:37 -04:00
Botond Dénes
f5976aa87b replica/tablets: add get_tablet_metadata_change_hint() and update_tablet_metadata_change_hint()
Extract a hint of what a tablet mutation changed. The hint can be later
used to selectively reload only the changed parts from disk.
Two variants are added:
* get_tablet_metadata_change_hint() - extracts a hint from a list of
  tablet mutations
* update_tablet_metadata_change_hint() - updates an existing hint based
  on a single mutation, allowing for incremental hint extraction
2024-08-11 09:52:37 -04:00
Botond Dénes
54ea71f8a6 locator/tablets: add tablet_map::clear_tablet_transition_info() 2024-08-11 09:52:37 -04:00
Botond Dénes
0254cfc7d3 locator/tablets: make tablet_metadata cheap to copy
Keep lw_shared_ptr<tablet_map> in the tablet map and use COW semantics.
To prevent accidental changes to shared tablet_map instances, all
modifications to a tablet_map have to go through a new
`mutate_tablet_map()` method, which implements the copy-modify-swap
idiom.
2024-08-11 09:52:37 -04:00
Botond Dénes
fb0ab3c1fb mutation/canonical_mutation: add key()
Extracts the partition key without deserializing the entire mutation.
2024-08-11 09:52:37 -04:00
Calle Wilund
e18a855abe extensions: Add exception types for IO extensions and handle in memtable write path
Fixes #19960

Write path for sstables/commitlog need to handle the fact that IO extensions can
generate errors, some of which should be considered retry-able, and some that should,
similar to system IO errors, cause the node to go into isolate mode.

One option would of course be for extensions to simply generate std::system_errors,
with system_category and appropriate codes. But this is probably a bad idea, since
it makes it more muddy at which level an error happened, as well as limits the
expressibility of the error.

This adds three distinct types (sharing base) distinguishing permission, availabilty
and configuration errors. These are treated akin to EACCESS, ENOENT and EINVAL in
disk error handler and memtable write loop.

Tests updated to use and verify behaviour.

Closes scylladb/scylladb#19961
2024-08-11 13:52:35 +03:00
Raphael S. Carvalho
75829d75ec replica: Fix race between split compaction and migration
After removal of rwlock (53a6ec05ed), the race was introduced because the order that
compaction groups of a tablet are closed, is no longer deterministic.

Some background first:
Split compaction runs in main (unsplit) group, and adds sstable to left and right groups
on completion.

The race works as follow:
1) split compaction starts on main group of tablet X
2) tablet X reaches cleanup stage, so its compaction groups are closed in parallel
3) left or right group are closed before main (more likely when only main has flush work to do)
4) split compaction completes, and adds sstable to left and right
5) if e.g left is closed, adjusting backlog tracker will trigger an exception, and since that
happens in row cache update's execute(), node crashes.

The problem manifested as follow:
[shard 0: gms] raft_topology - Initiating tablet cleanup of 5739b9b0-49d4-11ef-828f-770894013415:15 on 102a904a-0b15-4661-ba3f-f9085a5ad03c:0
...
[shard 0:strm] compaction - [Split keyspace1.standard1 009e2f80-49e5-11ef-85e3-7161200fb137] Splitting [/var/lib/scylla/data/keyspace1/...]
...
[shard 0:strm] cache - Fatal error during cache update: std::out_of_range (Compaction state for table [0x600007772740] not found),
at: ...
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<void>, row_cache::do_update(...
   --------
   seastar::internal::do_with_state<std::tuple<row_cache::external_updater, std::function<seastar::future<void> ()> >, seastar::future<void> >
   --------
   seastar::internal::coroutine_traits_base<void>::promise_type
   --------
   seastar::internal::coroutine_traits_base<void>::promise_type
   --------
   seastar::(anonymous namespace)::thread_wake_task
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<sstables::compaction_result>, seastar::async<sstables::compaction::run(...
   seastar::continuation<seastar::internal::promise_base_with_type<sstables::compaction_result>, seastar::future<sstables::compaction_resu...

From the log above, it can be seen cache update failure happens under streaming sched group and
during compaction completion, which was good evidence to the cause.
Problem was reproduced locally with the help of tablet shuffling.

Fixes: #19873.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#19987
2024-08-11 11:00:19 +03:00
Botond Dénes
1f4b9a5300 Merge 'compaction: drop compaction executors' possibility to bypass task manager' from Aleksandra Martyniuk
If parent_info argument of compaction_manager::perform_compaction
is std::nullopt, then created compaction executor isn't tracked by task
manager. Currently, all compaction operations should by visible in task
manager.

Modify split methods to keep split executor in task manager. Get rid of
the option to bypass task manager.

Closes scylladb/scylladb#19995

* github.com:scylladb/scylladb:
  compaction: replace optional<task_info> with task_info param
  compaction: keep split executor in task manager
2024-08-11 10:26:43 +03:00
Botond Dénes
0bb1075a19 Merge 'tasks: fix task handler' from Aleksandra Martyniuk
There are some bugs missed in task handler:
- wait_for_task does not wait until virtual tasks are done, but returns the status immediately;
- wait_for_task suffers from use after return;
- get_status_recursively does not set the kind of task essentials.

Fix the aforementioned.

Closes scylladb/scylladb#19930

* github.com:scylladb/scylladb:
  test: add test to check that task handler is fixed
  tasks: fix task handler
2024-08-11 10:23:17 +03:00
Paweł Zakrzewski
9db272c949 test/cql-pytest: Add test for GROUP BY queries with LIMIT
Remove xfail from all tests for #5361, as the issue is fixed.

Remove xfail from test_group_by_clustering_prefix_with_limit
It references #5362, but is fixed by #17237.

Refs #17237
2024-08-11 09:08:44 +02:00
Paweł Zakrzewski
e7ae7f3662 cql3: process LIMIT for GROUP BY queries
Currently LIMIT not passed to the query executor at all and it was just
an accident that it worked for the case referenced in #17237. This
change passes the limit value down the chain.
2024-08-11 09:08:43 +02:00
Paweł Zakrzewski
3838ad64b3 cql3/select_statement: simplify the get_limit function
The get_limit() function performed tasks outside of its scope - for
example checked if the statement was an aggregate. This change moves the
onus of the check to the caller.
2024-08-11 09:08:43 +02:00
Paweł Zakrzewski
08f3219cb8 cql3: respect the user-defined page size in aggregate queries
The comment in the code already states that we should use the
user-defined page size if it's provided. To avoid OOM conditions we'll
use the internally defined limit as the upper bound or if no page size
is provided.

This change lays ground work for fixing #5362 and is necessary to pass
the test introduced in #19392 once it is implemented.
2024-08-11 09:08:43 +02:00
Michał Jadwiszczak
3745d0a534 gms/feature_service: allow to suppress features
This patch adds `suppress_features` error injection. It allows to revoke
support for some features and it can be used to simulate upgrade process
in test.py.

Features to suppress are passed as injection's value, separated by `;`.
Example: `PARALLELIZED_AGGREGATION;UDA_NATIVE_PARALLELIZED_AGGREGATION`

Fixes scylladb/scylladb#20034

Closes scylladb/scylladb#20055
2024-08-09 19:15:19 +02:00
Kefu Chai
a78f46aad7 s3/client: customize options for input_stream
before this change, we use the default options for
performing read on the input. and the default options
is like
```c++
struct file_input_stream_options {
    size_t buffer_size = 8192;    ///< I/O buffer size
    unsigned read_ahead = 0;      ///< Maximum number of extra read-ahead operations
};
```
which is not able to offer good throughput when
reading from disk, when we stream to S3.

so, in this change, we use options which allows better throughput.

Refs 061def001d
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20074
2024-08-09 11:52:30 +03:00
Dawid Medrek
e5d01d4000 db/hints: Make commitlog use commitlog IO scheduling group
Before these changes, we didn't specify which I/O scheduling
group commitlog instances in hinted handoff should use.
In this commit, we set it explicitly to the commitlog
scheduling group. The rationale for this choice is the fact
we don't want to cause a bottleneck on the write path
-- if hints are written too slowly, new incoming mutations
(NOT hints) might be rejected due to a too high number
of hints currently being written to disk; see
`storage_proxy::create_write_response_handler_helper()`
for more context.

Fixes scylladb/scylladb#18654

Closes scylladb/scylladb#19170
2024-08-08 16:14:07 +02:00
Piotr Dulikowski
b72906518f Merge 'service levels: update connections parameters automatically' from Michał Jadwiszczak
This patch makes all cql connections update theirs service level parameters automatically when:
- any service level is created or changed
- one role is granted to another
- any service level is attached to/detached from a role

First of all, the patch defines what a service level and an effective service level are 938aa10509. No new type of service levels are introduced, the commit only clarifies definitions and names what an effective service level is.
(Effective service level is created by merging all service levels which are attached to all roles granted to the user. It represents exact  values of connection's parameters.)

Previously, to find an effective service level of a user, it required O(n) internal queries: O(n) queries to recursively find all granted roles (`standard_role_manager::query_granted()`) and a query for each role to get its service level (`standard_role_manager::get_attribute()`, which sums to O(n) queries).

Because we want to reload SL parameters for all opened cql connections, we don't want to do O(n) queries for every connection, every time we create or change any service level/grant one role to another/attach or detach a service level to/from a role.

To speed it up, the patch adds another layer of service level controller cache, which stored `role_name -> effective_service_level` mapping. This way finding a effective service level for a role is only a lookup to a map.
Building the new cache requires only 2 queries: one to obtain all role hierarchy one to get all roles' service level.

Fixes scylladb/scylladb#12923

Closes scylladb/scylladb#19085

* github.com:scylladb/scylladb:
  test/auth_cluster/test_raft_service_levels: add test for automatic connection update
  api/cql_server_test: add CQL server testing API
  transport/cql_server: subscribe to sl effective cache reloaded
  transport/controller: coroutinize `subscribe_server` and `unsubscribe_server`
  transport/cql_server: add method to update service level params on all connections
  generic_server: use async function in `for_each_gently()`
  service/qos/sl_controller: use effective service levels cache
  service/qos/service_level_controller: notify subscribers on effective cache reloaded
  service/raft/group0_state_machine: update effective service levels cache
  service/topology_coordinator: migrate service levels before auth
  service/qos/service_level_controller: effective service levels cache
  utils/sorting: allow to pass any container as verticies
  service/qos/service_level_controller: replace shard check to assert
  service/qos: define effective service level
  service/qos/qos_common: use const reference in `init_effective_names()`
  service/qos/service_level_controller: remove unused field
  auth: return map of directly granted roles
  test/auth/test_auth_v2_migration: create sl1 in the test
2024-08-08 15:31:04 +02:00
Anna Stuchlik
a1b4357765 doc: update Raft info in 6.1
This commit updates the Raft information regarding the Raft verification procedure.
In 6.1, the procedure is no longer related to the upgrade.

Fixes https://github.com/scylladb/scylladb/issues/19932

Closes scylladb/scylladb#20040
2024-08-08 11:25:50 +02:00
PeterFlockhart
0f9c6d24cf Update SELECT grammar to define group_by_clause explicitly
Closes scylladb/scylladb#20046
2024-08-08 12:23:20 +03:00
Avi Kivity
12c68bcf75 Merge 'querier: include cell stats in page stats' from Botond Dénes
We have two mechanism to give visibility into reads having to process many tombstones:
* a warning in the logs, triggered if a read processed more the `tombstone_warn_threshold` dead rows/tombstones
* a trace message, which includes stats of the amount of rows in the page, including the amount of live and dead rows as well as tombstones

This series extends this to also include information on cells, so we have visibility into the case where a read has to process an excessive amount of cell tombstones (mainly because of collections).
A log line is now also logged if the amount of dead cells/tombstones in the page exceeds `tombstone_warn_threshold`. The trace message is also extended to contain cell stats.

The `tombstone_warn_threshold` log lines now receive a 10s rate-limit to avoid excessive log spamming. The rate-limit is separate for the row and cell logs.

Example of the new log line (`tombstone_warn_threshold=10` ):
```
WARN  2024-05-30 07:56:44,979 [shard 0:stmt] querier - Read 98 live cells and 126 dead cells/tombstones for system_schema.scylla_tables <partition-range-scan> (-inf, +inf) (see tombstone_warn_threshold)
```

Example of the new tracing message:
```
Page stats: 1 partition(s), 0 static row(s) (0 live, 0 dead), 1 clustering row(s) (1 live, 0 dead), 0 range tombstone(s) and 13 cell(s) (1 live, 12 dead) [shard 0] | 2024-05-30 08:13:19.690803 | 127.0.0.1 |           6114 | 127.0.0.1
```

Fixes: https://github.com/scylladb/scylladb/issues/18996

Improvement, not a backport candidate.

Closes scylladb/scylladb#18997

* github.com:scylladb/scylladb:
  test/boost: mutation_test: add test for cell compaction stats
  mutation/compact_and_expire_result: drop operator bool()
  querier: consume_page(): add rate-limiting to tombstone warnings
  querier: consume_page(): add cell stats to page stats trace message
  querier: consume_page(): add tombstone warning for cell tombstones
  querier: consume_page(): extract code which logs tombstone warning
  mutation/mutation_compactor: collect and aggregate cell compaction stats
  mutation: row::compact_and_expire(): use compact_and_expire_result
  collection_mutation: compact_and_expire(): use compact_and_expire_result
  mutation: introduce compact_and_expire_result
2024-08-08 12:16:13 +03:00
Calle Wilund
d6742e9bce distributed_loader: Remove load_prio_keyspaces
Fixes #13334

All required code paths (see enterprise) now uses
extensions::is_extension_internal_keyspace.
The old mechanism can be removed. One less global var.

Closes scylladb/scylladb#20047
2024-08-08 12:10:27 +03:00
Avi Kivity
db77b5bd03 Merge 'convert the rest of test/boost/sstable_test.cc to co-routines and seastar::thread' from Laszlo Ersek
This is a followup to #19937, for #19803. See in particular [this comment](https://github.com/scylladb/scylladb/issues/19803#issuecomment-2258371923).

The primary conversion target is coroutines. However, while coroutines are the most convenient style, they are only infrequently usable in this case, for the following reasons:
- Wherever we have a `future::finally()` that calls a cleanup function that returns a future (which must be awaited), we cannot use `co_await`. We can only use `seastar::async()` with `deferred_close` or `defer()`.
- The code passes lots of lambdas, and `co_await` cannot be used in lambdas. First, I tried, and the compiler rejects it; second, a capturing lambda that is a coroutine is a trap [[1]](https://devblogs.microsoft.com/oldnewthing/20211103-00/?p=105870) [[2]](https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#Rcoro-capture).

In most cases, I didn't have to use naked `seastar::async()`; there were specialized wrappers in place already. Thus, most of the changes target `seastar::thread` context under existent `seastar::async()` wrappers, and only a few functions end up as coroutines.

The last patch in the series (`test/sstable: remove useless variable from promoted_index_read()`) is an independent micro-cleanup, the opportunity for which I thought to have noticed while reading the code.

The tail of `test/boost/sstable_test.cc` (the stuff following `promoted_index_read()`) is already written as `seastar::thread`. That's already better (for readability) than future chaining; but could have I perhaps further converted those functions to coroutines? My answer was "no":
- Some of the candidate functions relied on deferred cleanups that might need to yield (all three variants of `count_rows()`).
- Some had been implemented by passing lambdas to wrappers of `seastar::async()` (`sub_partition_read()`, `sub_partitions_read()`).
- The test case `test_skipping_in_compressed_stream()` initially looked promising for co-routinization (from its starting point `seastar::async()`), because it seemed to employ no deferred cleanup (that might need to yield). However, the function uses three lambdas that must be able to yield internally, and one of those (`make_is()`) is even capturing.
- The rest (`test_empty_key_view_comparison()`, `test_parse_path_good()`, `test_parse_path_bad()`) was synchronous code to begin with.

```
 test/boost/sstable_test.cc | 188 +++++++++-----------
 1 file changed, 83 insertions(+), 105 deletions(-)
```

Refactoring; no backport needed.

Closes scylladb/scylladb#20011

* github.com:scylladb/scylladb:
  test/sstable: remove useless variable from promoted_index_read()
  test/sstable: rewrite promoted_index_read() with async()
  test/sstable: unfuturize lambda invocation in test_using_reusable_sst*()
  test/sstable: rewrite wrong_range() with async()
  test/sstable: simplify not_find_key_composite_bucket0() under test_using_reusable_sst()
  test/sstable: rewrite full_index_search() with async()
  test/sstable: simplify find_key*(), all_in_place() under test_using_reusable_sst()
  test/sstable: rewrite (un)compressed_random_access_read() with async()
  test/sstable: simplify write_and_validate_sst()
  test/sstable: simplify check_toc_func() under async()
  test/sstable: simplify check_statistics_func() under async()
  test/sstable: simplify check_summary_func() under async()
  test/sstable: coroutinize check_component_integrity()
  test/sstable: rewrite write_sst_info() with async()
  test/sstable: simplify missing_summary_first_last_sane()
  test/sstable: coroutinize summary_query_fail()
  test/sstable: rewrite summary_query() with async()
  test/sstable: coroutinize (simple/composite)_index_read()
  test/sstable: rewrite index_read() with async()
  test/sstable: rewrite test_using_reusable_sst() with async()
  test/sstable: rewrite test_using_working_sst() with async()
2024-08-08 11:55:37 +03:00
Michał Jadwiszczak
b62a8b747a test/auth_cluster/test_raft_service_levels: add test for automatic
connection update
2024-08-08 10:42:09 +02:00
Michał Jadwiszczak
870bdaa6b1 api/cql_server_test: add CQL server testing API
Add a CQL server testing API with and endpoint to dump
service level parameters of all CQL connections.

This endpoint will be later used to test functionality of
automated updating CQL connections parameters.
2024-08-08 10:42:09 +02:00
Michał Jadwiszczak
c3e8778ad4 transport/cql_server: subscribe to sl effective cache reloaded
Make cql server (but not maintenance server) is subscribed to qos
configuration change.
Trigger update of connections' service level params on effective cache
reloaded event.

It's not done on maintenance server because it doesn't support role
hierarchy nor attaching service levels.
2024-08-08 10:42:09 +02:00
Michał Jadwiszczak
b2f2288292 transport/controller: coroutinize subscribe_server and unsubscribe_server 2024-08-08 10:42:09 +02:00
Michał Jadwiszczak
4af90726b6 transport/cql_server: add method to update service level params on all
connections

Trigger update of service level param on every cql connection.
In enterprise, the method needs also to update connections' scheduling
group.
2024-08-08 10:42:09 +02:00
Michał Jadwiszczak
324b3c43c0 generic_server: use async function in for_each_gently()
In the following patch, we will add a method to update service levels
parameters for each cql connections.
To support this, this patch allows to pass async function as a parameter
to `for_each_gently()` method.
2024-08-08 10:42:09 +02:00
Michał Jadwiszczak
93e6de0d04 service/qos/sl_controller: use effective service levels cache
Use cache to quickly access effective service level of a role.
2024-08-08 10:42:09 +02:00
Michał Jadwiszczak
664a1913c6 service/qos/service_level_controller: notify subscribers on effective
cache reloaded

Add event representing reload of effective service level cache and
notify subscribers when the cache is reloaded.
2024-08-08 10:42:09 +02:00
Michał Jadwiszczak
5f8132c13c service/raft/group0_state_machine: update effective service levels cache
Updates to `system.role_members` and `system.role_attributes` affect
effective service levels cache, so applying mutations to those tables
should reload the effective SL cache.
2024-08-08 10:42:09 +02:00
Michał Jadwiszczak
7b28df9b4d service/topology_coordinator: migrate service levels before auth
Effective service level cache will be updated when mutations are applied to
some of the auth tables.
But the effective cache depends on first-level service levels cache, so
service levels data should be migrated before auth data.
2024-08-08 10:42:09 +02:00
Michał Jadwiszczak
842573d0af service/qos/service_level_controller: effective service levels cache
Add a second layer of service_level_controller cache which contains
role name -> effective service level mapping.
To build the mapping, controller uses first cache layer (service level
name -> service level) and 2 queries to auth tables (one to `roles` and
one to `role_members`).
2024-08-08 10:42:09 +02:00
Michał Jadwiszczak
4922f87fed utils/sorting: allow to pass any container as verticies
The container containing all verticies doesn't have to be a vector.
Allowing to pass any container that meet conditions, will make to
function more flexible.
2024-08-08 10:42:09 +02:00
Michał Jadwiszczak
619937c466 service/qos/service_level_controller: replace shard check to assert
The cache is only updated on shard 0, so doing assert is a better sanity
check.
2024-08-08 10:42:09 +02:00
Michał Jadwiszczak
be4c83ad3c service/qos: define effective service level
Write down definitions of `service level` and `effective service level`
in service/qos/service_level_controller.hh.

Until now, effective service level was only used as result of
`LIST EFFECTIVE SERVICE LEVEL OF <role>`.
Now we want to have quick access to effective service level of
each role and introduce cache of effective sl to do it.
New definitions clarify things.

The commit also renames:
- `update_service_levels_from_distributed_data` -> `update_service_levels_cache`
  Later we will introduce effective_service_level_cache, so this change
  standarizes the names.
- `find_service_level` -> `find_effective_service_level`
  The function actualy returns effective service level.
2024-08-08 10:42:09 +02:00
Michał Jadwiszczak
0da979e013 service/qos/qos_common: use const reference in init_effective_names()
`service_level_options::init_effective_names()` method's argument has no
reason to be mutable reference.
This commit converts it to const ref.
2024-08-08 10:42:09 +02:00
Michał Jadwiszczak
37cd998993 service/qos/service_level_controller: remove unused field 2024-08-08 10:42:08 +02:00
Michał Jadwiszczak
f9048de0ce auth: return map of directly granted roles
Returns multimap of directly granted roles for each role. Uses
only one query to create the map, instead of doing recursive queries
for each individual role.
2024-08-08 10:42:08 +02:00
Michał Jadwiszczak
d643d5637c test/auth/test_auth_v2_migration: create sl1 in the test
Test `test_auth_v2_migration` creates auth data where role `users`
has assigned service level `sl:fefe` but the service level isn't
actually created.

In following patches, we are going to introduce effective service levels
cache which depends on auth and is refreshed when mutations are applied
to v2 auth tables.

Without this changes, this test will fail because the service level
doesn't exist.
Also the name `sl:fefe` is change to `sl1`.
2024-08-08 10:42:08 +02:00
Avi Kivity
3fe60560d2 Merge 'Coroutinize view_builder::start()' from Pavel Emelyanov
It runs in the background and consists of two parts -- async() lambda and following .then()-s. This PR move the background running code into its own method and coroutinizes it in parts. With #19954 merged it finally looks really nice.

Closes scylladb/scylladb#20058

* github.com:scylladb/scylladb:
  view_builder: Restore indentation after previous patches
  view_builder: Coroutinize inner start_in_background() calls
  view_builder: Coroutinize outer start_in_background() calls
  view_builder: Add helper method for background start
2024-08-07 19:47:32 +03:00
Kamil Braun
4181a1c53e storage_service: raft topology: warn when raft_topology_cmd_handler fails due to abort
Currently we print an ERROR on all exceptions in
`raft_topology_cmd_handler`. This log level is too high, in some cases
exceptions are expected -- like during shutdown. And it causes dtest
failures.

Turn exceptions from aborts into WARN level.

Also improve logging by printing the command that failed.

Fixes scylladb/scylladb#19754

Closes scylladb/scylladb#19935
2024-08-07 17:57:23 +02:00
Tomasz Grabiec
1a4baa5f9e tablets: Do not allocate tablets on nodes being decommissioned
If tablet-based table is created concurrently with node being
decommissioned after tablets are already drained, the new table may be
permanently left with replicas on the node which is no longer in the
topology. That creates an immidiate availability risk because we are
running with one replica down.

This also violates invariants about replica placement and this state
cannot be fixed by topology operations.

One effect is that this will lead to load balancer failure which will
inhibit progress of any topology operations:

  load_balancer - Replica 154b0380-1dd2-11b2-9fdd-7156aa720e1a:0 of tablet 7e03dd40-537b-11ef-9fdd-7156aa720e1a:1 not found in topology, at:  ...

Fixes #20032

Closes scylladb/scylladb#20053
2024-08-07 18:52:58 +03:00
Dawid Medrek
96509c4cf7 db/hints: Make sync points be created for all hosts when not specified
Sync points are created, via POST HTTP requests, for a subset of nodes
in the cluster. Those nodes are specified in a request's parameter
`target_hosts`. When the parameter is empty, Scylla should assume
the user wants to create a sync point for ALL nodes.

Before these changes, sync points were created only for LIVE nodes.
If a node was dead but still part of the cluster and the user
requested creating a sync point leaving the parameter `target_hosts`
empty, the dead node was skipped during the creation of the sync point.
That was inconsistent with the guarantees the sync point API provides.

In this commit, we fix that issue and add a test verifying that
the changes have made the implementation compliant with the design
of the sync point API -- the test only passes after this commit.

Fixes scylladb/scylladb#9413

Closes scylladb/scylladb#19750
2024-08-07 13:15:20 +02:00
Pavel Emelyanov
63afbc0fcb view_builder: Restore indentation after previous patches
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-08-07 14:00:01 +03:00
Pavel Emelyanov
aa1a5d3201 view_builder: Coroutinize inner start_in_background() calls
One of the co_await-ed parts of this method is async() lambda. It can be
coroutinized too. One thing to care is the semaphore units -- its scope
should (?) terminate earlier than the whole start_in_background() so
release it explicitly.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-08-07 14:00:01 +03:00
Pavel Emelyanov
167c6a9c5e view_builder: Coroutinize outer start_in_background() calls
The method consists of two parts -- one running in async() thread and
continuations to it. This patch turns the latter chain into co_await-s.
The mentioned chain is "guarded" by then_wrapped() catch of any
exception, which is turned into a plain try-catch block.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-08-07 14:00:01 +03:00
Pavel Emelyanov
10a87f5c5b view_builder: Add helper method for background start
The view_builder::start() happens in the background. It's good to have
explicit start_in_background() method and coroutinize it next.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-08-07 13:59:57 +03:00
Dawid Medrek
ec691a84a5 docs/hinted_handoff: Describe sync point HTTP API
In this commit, we describe the mechanism of sync point
in Hinted Handoff in the user documentation. We explain
the motivation for it and how to use it, as well as list
and describe all of the parameters involved in the process.
Errors that may appear and experienced by the user
are addressed in the article.

Fixes scylladb/scylladb#18500

Closes scylladb/scylladb#19686
2024-08-07 11:12:23 +02:00
Pavel Emelyanov
2fd60b0adc api: Move config-related endpoints from storage_service.cc
The get_all_data_file_locations and get_saved_caches_location get the
returned data from db::config and should be next other endpoints working
with config data.

refs: #2737

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#19958
2024-08-07 10:18:29 +03:00
Piotr Dulikowski
1963619803 Merge 'Use cross shard barrier to start view builder' from Pavel Emelyanov
When starting, view builder wants all shards to synchronize with each other in the middle of initialization. For that they all synchronize via shard-0's instance counter and a shared future. There's cross-shard barrier in utils/ that provides the same facility.

Closes scylladb/scylladb#19954

* github.com:scylladb/scylladb:
  view_builder: Drop unused members
  view_builder: Use cross-shard barrier on start
  view_builder: Add cross-shard barrier to its .start() method
2024-08-07 08:54:15 +02:00
Botond Dénes
78206a3fad test/boost: mutation_test: add test for cell compaction stats 2024-08-06 08:56:28 -04:00
Botond Dénes
259a59bd64 mutation/compact_and_expire_result: drop operator bool()
Having an operator bool() on this struct is counter-intuitive, so this
commit drops it and migrates any remaining users to bool is_live().
The purpose of this operator bool() was to help in incrementally replace
the previous bool return type with compact_and_expire_result in the
compact_and_expire() call stack. Now that this is done, it has served
its purpose.
2024-08-06 08:56:28 -04:00
Botond Dénes
f638c37c4b querier: consume_page(): add rate-limiting to tombstone warnings
These warnings can be logged once per query, which could result in
filling the logs with thousands of log lines.
Rate-limit to once per 10sec.
2024-08-06 08:56:11 -04:00
Botond Dénes
d69b16a51e querier: consume_page(): add cell stats to page stats trace message 2024-08-06 08:56:11 -04:00
Botond Dénes
98c599f73a querier: consume_page(): add tombstone warning for cell tombstones
Since it is really difficult to meaningfully aggregate cell tombstones
with row tombstones, there is two separate warning for them.
2024-08-06 08:56:11 -04:00
Botond Dénes
fa2ee6d545 querier: consume_page(): extract code which logs tombstone warning
Soon, we want to log a warning on too many cell tombstones as well.
Extract the logging code to allow reuse between the row and cell
tombstone warnings.
2024-08-06 08:56:11 -04:00
Botond Dénes
e403644c8b mutation/mutation_compactor: collect and aggregate cell compaction stats
row::compact_and_expire() now returns details cell stats. Collect and
aggregate these, using the existing compaction_stats::row_stats
structure.
2024-08-06 08:56:11 -04:00
Botond Dénes
0396db497c mutation: row::compact_and_expire(): use compact_and_expire_result
Collect, store and return stats about cells, via
compact_and_expire_result.
2024-08-06 08:56:11 -04:00
Botond Dénes
2c6d4e21e6 collection_mutation: compact_and_expire(): use compact_and_expire_result
Collect, store and return stats about cells, via
compact_and_expire_result.
2024-08-06 08:56:11 -04:00
Botond Dénes
e773a8eee6 mutation: introduce compact_and_expire_result
To hold cell stats, to be collected during row::compact_and_expire().
Users will come in the next patches.
2024-08-06 08:56:11 -04:00
Aleksandra Martyniuk
9ec8000499 test: add test to check that task handler is fixed 2024-08-06 13:15:33 +02:00
Aleksandra Martyniuk
811ca00cec tasks: fix task handler
There are some bugs missed in task handler:
- wait_for_task does not wait until virtual tasks are done, but
  returns the status immediately;
- wait_for_task suffers from use after return;
- get_status_recursively does not set the kind of task essentials.

Fix the aforementioned.
2024-08-06 13:15:13 +02:00
Anna Stuchlik
849856b964 doc: add post-installation configuration to the Web Installer page
This commit extracts the information about the configuration the user should do
right after installation (especially running scylla_setup) to a separate file.
The file is included in the relevant pages, i.e., installing with packages
and installing with Web Installer.

In addition, the examples on the Web Installer page are updated
with supported versions of ScyllaDB.

Fixes https://github.com/scylladb/scylladb/issues/19908

Closes scylladb/scylladb#20035
2024-08-06 13:49:09 +03:00
Kamil Braun
f348f33667 raft topology: improve logging
Add more logging for raft-based topology operations in INFO and DEBUG
levels.

Improve the existing logging, adding more details.

Fix a FIXME in test_coordinator_queue_management (by readding a log
message that was removed in the past -- probably by accident -- and
properly awaiting for it to appear in test).

Enable group0_state_machine logging at TRACE level in tests. These logs
are relatively rare (group 0 commands are used for metadata operations)
and relatively small, mostly consist of printing `system.group0_history`
mutation in the applied command, for example:
```
TRACE 2024-08-02 18:47:12,238 [shard 0: gms] group0_raft_sm - apply() is called with 1 commands
TRACE 2024-08-02 18:47:12,238 [shard 0: gms] group0_raft_sm - cmd: prev_state_id: optional(dd9d47c6-50ee-11ef-d77f-500b8e1edde3), new_state_id: dd9ea5c6-50ee-11ef-ae64-dfbcd08d72c3, creator_addr: 127.219.233.1, creator_id: 02679305-b9d1-41ef-866d-d69be156c981
TRACE 2024-08-02 18:47:12,238 [shard 0: gms] group0_raft_sm - cmd.history_append: {canonical_mutation: table_id 027e42f5-683a-3ed7-b404-a0100762063c schema_version c9c345e1-428f-36e0-b7d5-9af5f985021e partition_key pk{0007686973746f7279} partition_tombstone {tombstone: none}, row tombstone {range_tombstone: start={position: clustered, ckp{0010b4ba65c64b6e11ef8080808080808080}, 1}, end={position: clustered, ckp{}, 1}, {tombstone: timestamp=1722617232237511, deletion_time=1722617232}}{row {position: clustered, ckp{0010dd9ea5c650ee11efae64dfbcd08d72c3}, 0} tombstone {row_tombstone: none} marker {row_marker: 1722617232237511 0 0}, column description atomic_cell{ create system_distributed keyspace; create system_distributed_everywhere keyspace; create and update system_distributed(_everywhere) tables,ts=1722617232237511,expiry=-1,ttl=0}}}
```
note that the mutation contains a human-readable description of the
command -- like "create system_distributed keyspace" above.

These logs might help debugging various issues (e.g. when `apply` hangs
waiting for read_apply mutex, or takes too long to apply a command).

Ref: scylladb/scylladb#19105
Ref: scylladb/scylladb#19945

Closes scylladb/scylladb#19998
2024-08-06 11:50:16 +03:00
Kamil Braun
aa9d5fe3f5 Merge 'doc: add the 6.0-to-6.1 upgrade guide' from Anna Stuchlik
This PR adds the 6.0-to-6.1 upgrade guide (including metrics) and removes the 5.4-to-6.0 upgrade guide.

Compared 5.4-to-6.0, the the 6.0-to-6.1 guide:

- Added the "Ensure Consistent Topology Changes Are Enabled" prerequisite.
- Removed the "After Upgrading Every Node" section. Both Raft-based schema changes and topology updates
  are mandatory in 6.1 and don't require any user action after upgrading to 6.1.
- Removed the "Validate Raft Setup" section. Raft was enabled in all 6.0 clusters (for schema management),
  so now there's no scenario that would require the user to follow the validation procedure.
- Removed the references to the Enable Consistent Topology Updates page (which was in version 6.0 and is removed with this PR) across the docs.

See the individual commits for more details.

Fixes https://github.com/scylladb/scylladb/issues/19853
Fixes https://github.com/scylladb/scylladb/issues/19933

This PR must be backported to branch-6.1 as it is critical in version 6.1.

Closes scylladb/scylladb#19983

* github.com:scylladb/scylladb:
  doc: remove the 5.4-to-6.0 upgrade guide
  doc: add the 6.0-to-6.1 upgrade guide
2024-08-06 10:23:18 +02:00
Andrei Chekun
cc428e8a36 [test.py] Increase pool size for CI
Currently, the resource utilization in CI is low. Increasing the number of clusters will increase how many tests are executed simultaneously. This will decrease the time it takes to execute and improve resource utilization.

Related: https://github.com/scylladb/qa-tasks/issues/1667

Closes scylladb/scylladb#19832
2024-08-06 11:20:36 +03:00
Botond Dénes
822d3b11d0 tool/scylla-nodetool: refresh: improve error-message on missing ks/tbl args
The command has a singl check for the missing keyspace and/or table
parameters and if the check fails, there is a combined error message.
Apparently this is confusing, so split the check so that missing
keyspace and missing table args have its own check and error message.

Fixes: scylladb/scylladb#19984

Closes scylladb/scylladb#20005
2024-08-05 22:36:05 +03:00
Anna Stuchlik
32fa5aa938 doc: remove the 5.4-to-6.0 upgrade guide
This commit removes the 5.4-to-6.0 upgrade guide and all references to it.
It mainly removes references to the Enable Consistent Topology Updates page,
which was added as enabling the feature was optional.
In rare cases, when a reference to that page is necessary,
the internal link is replaced with an external link to version 6.0.
Especially the Handling Cluster Membership Change Failures page was modified
for troubleshooting purposes rather than removed.
2024-08-05 20:13:48 +02:00
Kefu Chai
b1405da6ac s3/client: use div_ceil() defined by utils/div_ceil.hh
instead of reinventing the wheel, let's use the existing one.

in this change, we trade the `div_ceil()` implementated in s3/client.cc
for the existing one in utils/div_ceil.hh . because we are not using
`std::lldiv()` anymore, the corresponding `#include <cstdlib>` is dropped.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20000
2024-08-05 15:35:18 +03:00
Kefu Chai
12a066ccdf sstable_directory: use return_exception_ptr() when appropriate
instead of using `std::rethrow_exception()`, use
`coroutine::return_exception_ptr()` which is a little bit  more
efficient.

See also 6cafd83e1c

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20001
2024-08-05 12:54:27 +03:00
Kefu Chai
0bc886d005 service: mark fmt::formatter<T>::format() as const
fmt 11 enforces the constness of `format()` member function, if it
is not marked with `const`, the tree fails to build with fmt 11, like:

```
/usr/include/fmt/base.h:1393:23: error: no matching member function for call to 'format'
 1393 |     ctx.advance_to(cf.format(*static_cast<qualified_type*>(arg), ctx));
      |                    ~~~^~~~~~
/usr/include/fmt/base.h:1374:21: note: in instantiation of function template specialization 'fmt::detail::value<fmt::context>::format_custom_arg<service::migration_badness, fmt::formatter<service::migration_badness>>' requested here
 1374 |     custom.format = format_custom_arg<
      |                     ^
/home/kefu/dev/scylladb/service/tablet_allocator.cc:170:14: note: in instantiation of function template specialization 'fmt::format_to<fmt::basic_appender<char>, const locator::global_tablet_id &, const locator::tablet_replica &, const locator::tablet_replica &, const service::migration_badness &, 0>' requested here
  170 |         fmt::format_to(ctx.out(), "{{tablet: {}, {} -> {}, badness: {}", candidate.tablet, candidate.src,
      |              ^
/home/kefu/dev/scylladb/service/tablet_allocator.cc:161:10: note: candidate function template not viable: 'this' argument has type 'const fmt::formatter<service::migration_badness>', but method is not marked const
  161 |     auto format(const service::migration_badness& badness, FormatContext& ctx) {
      |          ^
```

so, in this change, we mark these two `format()` member functions const.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20013
2024-08-05 12:53:42 +03:00
Piotr Dulikowski
a038a1fdef Merge 'db: coroutinize do_apply_counter_update' from Michael Litvak
rewrite the function as coroutine to make it easier to read and maintain, following lifetime issues we had and fixed in this function.

The second commit adds a test that drops a table while there is a counter update operation ongoing in the table.
The test reproduces issue https://github.com/scylladb/scylla-enterprise/issues/4475 and verifies it is fixed.

Follow-up to https://github.com/scylladb/scylladb/pull/19948
Doesn't require backport because the fix to the issue was already done and backported. This is just cleanup and a test.

Closes scylladb/scylladb#19982

* github.com:scylladb/scylladb:
  db: test counter update while table is dropped
  db: coroutinize do_apply_counter_update
2024-08-05 10:08:18 +02:00
Nadav Har'El
247b84715a test/cql-pytest: reproducers for key length bugs
Recently, some users have seen "Key size too large" errors in various
places. Cassandra and Scylla impose a 64KB length limit on keys, and
we have known about bugs in this area for a long time - and even had
some translated Cassandra unit tests that cover some of them. But these
tests did not cover all the corner cases and left us with partial and
fragmented knowledge of this problem, spread over many test files and
many issues.

In this patch, we add a single test file, test/cql-pytest/test_key_length.py
which attempts to rigourously explore the various bugs we have with
CQL key length limits. These test aim to reproduce all known bugs in
this area:

* Refs #3017 - CQL layer accepts set values too large to be written to
  an sstable
* Refs #10366 - Enforce Key-length limits during SELECT
* Refs #12247 - Better error reporting for oversized keys during INSERT
* Refs #16772 - Key length should be limited to exactly 65535, not less

The following less interesting bug is already covered by many tests so
I decided not to test it again:

* Refs #7745 - Length of map keys and set items are incorrectly limited
  to 64K in unprepared CQL

There's also a situation in materialized views and secondary indexes,
where a column that was _not_ a key, now becomes a key, and a length
limit needs to be enforced on it. We already have good test coverage
for this (in test/cql-pytest/test_secondary_index.py and in
test/cql-pytest/test_materialized_view.py), and we have an issue:

* Refs #8627 - Cleanly reject updates with indexed values where value > 64k

All 16 tests added here pass on Cassandra 5 except one that fails on
https://issues.apache.org/jira/browse/CASSANDRA-19270, but 11 of the
tests currently fail on Scylla (6 on #12247, 2 on #10366, 3 on #16772).

It is possible that our decision in #16772 will not be to fix Scylla
to match Cassandra but rather to declare that strict compatibility isn't
needed in this case or even that Cassandra is wrong. But even then,
having these tests which demonstrate the behavior of both Cassandra
and Scylla will be important.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#16779
2024-08-05 10:13:49 +03:00
Tzach Livyatan
861a1cedea Improve tombstone_compaction_interval description
Closes scylladb/scylladb#19072
2024-08-05 10:10:55 +03:00
Pavel Emelyanov
f0f28cf685 docs: Extend debugging with info about exploring ELF notes
When debugging coredumps some (small, but useful) information is hidden
in the notes of the core ELF file. Add some words about it exists, what
it includes and the thing that is always forgotten -- the way to get one

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#19962
2024-08-05 09:49:52 +03:00
Tzach Livyatan
858fd4d183 Update tracing.rst - fix table node_slow_log_time name
Closes scylladb/scylladb#19893
2024-08-05 09:47:27 +03:00
Botond Dénes
76b6e8c5aa Merge 'Drop datadir from keyspace::config' from Pavel Emelyanov
Commit ad0e6b79 (replica: Remove all_datadir from keyspace config) removed all_datadirs from keyspace config, now it's datadir turn. After this change keyspace no longer references any on-disk directories, only the sstables's storage driver attached to keyspace's tables does.

refs #12707

Closes scylladb/scylladb#19866

* github.com:scylladb/scylladb:
  replica: Remove keyspace::config::datadir
  sstables/storage: Evaluate path for keyspace directory in storage
  sstables/storage: Add sstables_manager arg to init_keyspace_storage()
2024-08-05 09:46:29 +03:00
Avi Kivity
2eff4b41ad repair: row_level: coroutinize working_row_hashes()
It uses do_with, so it allocates unconditionally. Might as well use
the allocation for a nice coroutine.

Closes scylladb/scylladb#19915
2024-08-05 08:55:34 +03:00
Anna Stuchlik
eca2dfd8c3 doc: add OS support for version 6.1
This commit adds OS support for version 6.1 and removes OS support for 5.4
(according to our support policy for versions).

Closes scylladb/scylladb#19992
2024-08-05 08:25:16 +03:00
Avi Kivity
aa1270a00c treewide: change assert() to SCYLLA_ASSERT()
assert() is traditionally disabled in release builds, but not in
scylladb. This hasn't caused problems so far, but the latest abseil
release includes a commit [1] that causes a 1000 insn/op regression when
NDEBUG is not defined.

Clearly, we must move towards a build system where NDEBUG is defined in
release builds. But we can't just define it blindly without vetting
all the assert() calls, as some were written with the expectation that
they are enabled in release mode.

To solve the conundrum, change all assert() calls to a new SCYLLA_ASSERT()
macro in utils/assert.hh. This macro is always defined and is not conditional
on NDEBUG, so we can later (after vetting Seastar) enable NDEBUG in release
mode.

[1] 66ef711d68

Closes scylladb/scylladb#20006
2024-08-05 08:23:35 +03:00
Avi Kivity
cdee667170 alternator: destroy streamed json values gently
Large json return values are streamed to avoid memory pressure
and stalls, but are destroyed all at once. This in itself can cause
stalls [1].

Destroy them gently to avoid the stalls.

[1]

++[0#1/1 100%] addr=0x46880df total=514498 count=7004 avg=73:
|              seastar::backtrace<seastar::backtrace_buffer::append_backtrace_oneline()::{lambda(seastar::frame)#1}> at ./build/release/seastar.lto/./seastar/include/seastar/util/backtrace.hh:64
++           - addr=0x4680b35:
|              seastar::backtrace_buffer::append_backtrace_oneline at ./build/release/seastar.lto/./seastar/src/core/reactor.cc:839
|              (inlined by) seastar::print_with_backtrace at ./build/release/seastar.lto/./seastar/src/core/reactor.cc:858
++           - addr=0x46800f7:
|              seastar::internal::cpu_stall_detector::generate_trace at ./build/release/seastar.lto/./seastar/src/core/reactor.cc:1469
++           - addr=0x4680178:
|              seastar::internal::cpu_stall_detector::maybe_report at ./build/release/seastar.lto/./seastar/src/core/reactor.cc:1206
|              (inlined by) seastar::internal::cpu_stall_detector::on_signal at ./build/release/seastar.lto/./seastar/src/core/reactor.cc:1226
++           - addr=0x3dbaf: ?? ??:0
  ++[1#1/812 13%] addr=0x217b774 total=69336 count=990 avg=70:
  |               rapidjson::GenericValue<rapidjson::UTF8<char>, rjson::internal::throwing_allocator>::~GenericValue at /usr/include/rapidjson/document.h:721
  | ++[2#1/3 85%] addr=0x217b7db total=58974 count=842 avg=70:
  | |             rapidjson::GenericMember<rapidjson::UTF8<char>, rjson::internal::throwing_allocator>::~GenericMember at /usr/include/rapidjson/document.h:71
  | |             (inlined by) rapidjson::GenericValue<rapidjson::UTF8<char>, rjson::internal::throwing_allocator>::~GenericValue at /usr/include/rapidjson/document.h:733
  | | ++[3#1/4 45%] addr=0x217b7db total=902102 count=12903 avg=70:
  | | |             rapidjson::GenericMember<rapidjson::UTF8<char>, rjson::internal::throwing_allocator>::~GenericMember at /usr/include/rapidjson/document.h:71
  | | |             (inlined by) rapidjson::GenericValue<rapidjson::UTF8<char>, rjson::internal::throwing_allocator>::~GenericValue at /usr/include/rapidjson/document.h:733
  | | -> continued at addr=0x217b7db above
  | | |+[3#2/4 40%] addr=0x217b8b3 total=794219 count=11363 avg=70:
  | | |             rapidjson::GenericValue<rapidjson::UTF8<char>, rjson::internal::throwing_allocator>::~GenericValue at /usr/include/rapidjson/document.h:726
  | | | ++[4#1/1 100%] addr=0x217b7db total=909571 count=13012 avg=70:
  | | | |              rapidjson::GenericMember<rapidjson::UTF8<char>, rjson::internal::throwing_allocator>::~GenericMember at /usr/include/rapidjson/document.h:71
  | | | |              (inlined by) rapidjson::GenericValue<rapidjson::UTF8<char>, rjson::internal::throwing_allocator>::~GenericValue at /usr/include/rapidjson/document.h:733
  | | | -> continued at addr=0x217b7db above
  | | |+[3#3/4 15%] addr=0x43d35a3 total=296768 count=4246 avg=70:
  | | |             seastar::shared_ptr_count_for<rapidjson::GenericValue<rapidjson::UTF8<char>, rjson::internal::throwing_allocator> >::~shared_ptr_count_for at ././seastar/include/seastar/core/shared_ptr.hh:492
  | | |             (inlined by) seastar::shared_ptr_count_for<rapidjson::GenericValue<rapidjson::UTF8<char>, rjson::internal::throwing_allocator> >::~shared_ptr_count_for at ././seastar/include/seastar/core/shared_ptr.hh:492
  | | | ++[4#1/2 98%] addr=0x43e7d06 total=289680 count=4144 avg=70:
  | | | |             seastar::shared_ptr<rapidjson::GenericValue<rapidjson::UTF8<char>, rjson::internal::throwing_allocator> >::~shared_ptr at ././seastar/include/seastar/core/shared_ptr.hh:570
  | | | |             (inlined by) alternator::make_streamed(rapidjson::GenericValue<rapidjson::UTF8<char>, rjson::internal::throwing_allocator>&&)::$_0::operator() at ./alternator/executor.cc:127
  | | | ++          - addr=0x184e0a6:
  | | | |             std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<void>::promise_type>::resume at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/coroutine:240
  | | | |             (inlined by) seastar::internal::coroutine_traits_base<void>::promise_type::run_and_dispose at ./build/release/seastar.lto/./seastar/include/seastar/core/coroutine.hh:125
  | | | |             (inlined by) seastar::reactor::run_tasks at ./build/release/seastar.lto/./seastar/src/core/reactor.cc:2651
  | | | |             (inlined by) seastar::reactor::run_some_tasks at ./build/release/seastar.lto/./seastar/src/core/reactor.cc:3114
  | | | | ++[5#1/1 100%] addr=0x2503b87 total=310677 count=4417 avg=70:
  | | | | |              seastar::reactor::do_run at ./build/release/seastar.lto/./seastar/src/core/reactor.cc:3283
  | | | |   ++[6#1/2 78%] addr=0x46a2898 total=400571 count=5450 avg=73:
  | | | |   |             seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0::operator() at ./build/release/seastar.lto/./seastar/src/core/reactor.cc:4501
  | | | |   |             (inlined by) std::__invoke_impl<void, seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0&> at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/invoke.h:61
  | | | |   |             (inlined by) std::__invoke_r<void, seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0&> at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/invoke.h:111
  | | | |   |             (inlined by) std::_Function_handler<void (), seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0>::_M_invoke at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/std_function.h:290
  | | | |   ++          - addr=0x4673fda:
  | | | |   |             std::function<void ()>::operator() at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/std_function.h:591
  | | | |   |             (inlined by) seastar::posix_thread::start_routine at ./build/release/seastar.lto/./seastar/src/core/posix.cc:90
  | | | |   ++          - addr=0x8c946: ?? ??:0
  | | | |   ++          - addr=0x11296f: ?? ??:0
  | | | |   ++[6#2/2 22%] addr=0x2502c1e total=113613 count=1549 avg=73:
  | | | |   |             seastar::reactor::run at ./build/release/seastar.lto/./seastar/src/core/reactor.cc:3166
  | | | |   ++          - addr=0x22068e0:
  | | | |   |             seastar::app_template::run_deprecated at ./build/release/seastar.lto/./seastar/src/core/app-template.cc:276
  | | | |   ++          - addr=0x220630b:
  | | | |   |             seastar::app_template::run at ./build/release/seastar.lto/./seastar/src/core/app-template.cc:167
  | | | |   ++          - addr=0x22334bc:
  | | | |   |             scylla_main at ./main.cc:672
  | | | |   ++          - addr=0x20411cc:
  | | | |   |             std::function<int (int, char**)>::operator() at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/std_function.h:591
  | | | |   |             (inlined by) main at ./main.cc:2072
  | | | |   ++          - addr=0x27b89: ?? ??:0
  | | | |   ++          - addr=0x27c4a: ?? ??:0
  | | | |   ++          - addr=0x28c8fb4:
  | | | |   |             _start at ??:?

Closes scylladb/scylladb#19968
2024-08-05 00:35:52 +03:00
Botond Dénes
c34127092d reader_concurrency_semaphore: test constructor: don't ignore metrics param
The for_tests constructor has a metrics parameter defaulted to
register_metrics::no, but when delegating to the other constructor, a
hard-coded register_metrics::no is passed. This makes no difference
currently, because all callers use the default and the hard-coded value
corresponds to it. Let's fix it nevertheless to avoid any future
surprises.

Closes scylladb/scylladb#20007
2024-08-04 21:14:42 +03:00
Laszlo Ersek
0933a52c0b test/sstable: remove useless variable from promoted_index_read()
The large_partition_schema() call returns a copy of the "schema_ptr"
object that points to an effectively statically initialized thread_local
"schema" object. The large_partition_schema() call has no bearing on
whether, or when, the "schema" object is constructed, and has no side
effects (other than copying an "lw_shared_ptr" object). Furthermore, the
return value of large_partition_schema() is not used for anything in
promoted_index_read().

This redundant call seems to date back to original commit 3dd079fb7a
("tests: add test for reading parts of a large partition", 2016-08-07).
Remove the call and the variable.

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2024-08-04 15:35:51 +02:00
Laszlo Ersek
bb58446258 test/sstable: rewrite promoted_index_read() with async()
For better readability, replace future::then() chaining with
future::get().

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2024-08-04 15:35:51 +02:00
Laszlo Ersek
1f565626d4 test/sstable: unfuturize lambda invocation in test_using_reusable_sst*()
All lambdas passed to test_using_reusable_sst() and
test_using_reusable_sst_returning() have been converted to future::get()
calls (according to the seastar::thread context that they are now executed
in). None of the lambdas return futures anymore; they all directly return
void or non-void. Therefore, drop futurize_invoke(...).get() around the
lambda invocations in test_using_reusable_sst*().

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2024-08-04 15:35:51 +02:00
Laszlo Ersek
8ea881ae04 test/sstable: rewrite wrong_range() with async()
For better readability, replace the future::then() chaining (and the
associated manual fiddling with object lifecycles) with future::get() (and
rely on seastar::thread's stack). We're already in seastar::thread
context.

Similarly, replace the future::finally() underlying with_closeable() with
deferred_close(); with the assumption that mutation_reader::close() never
fails (and is therefore safe to call in the "deferred_close" destructor).
This is actually guaranteed, as mutation_reader::close() is marked
"noexcept".

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2024-08-04 15:35:51 +02:00
Laszlo Ersek
e7e9a0a696 test/sstable: simplify not_find_key_composite_bucket0() under test_using_reusable_sst()
According to early patch "test/sstable: rewrite test_using_reusable_sst()
with async" in this series, lambdas passed to test_using_reusable_sst()
are invoked:

(a) less importantly here, in seastar::thread context,

(b) more importantly here, futurized (temporarily so).

The test case not_find_key_composite_bucket0() doesn't chain futures;
therefore it needs no conversion to future::get() for purpose (a);
however, we can eliminate its empty future return. Fact (b) will cover for
that, until all such lambdas are converted to direct "void" returns (at
which point we can remove the futurization from
test_using_reusable_sst()).

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2024-08-04 15:35:51 +02:00
Laszlo Ersek
95cf16708d test/sstable: rewrite full_index_search() with async()
For better readability, replace future::then() chaining with
future::get(). (We're already in seastar::thread context.)

This patch is best viewed with "git show -b".

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2024-08-04 15:35:51 +02:00
Laszlo Ersek
2a27d5b344 test/sstable: simplify find_key*(), all_in_place() under test_using_reusable_sst()
According to early patch "test/sstable: rewrite test_using_reusable_sst()
with async" in this series, lambdas passed to test_using_reusable_sst()
are invoked:

(a) less importantly here, in seastar::thread context,

(b) more importantly here, futurized (temporarily so).

The test cases find_key_map(), find_key_set(), find_key_list(),
find_key_composite(), all_in_place() don't chain futures; therefore they
need no conversion to future::get() for purpose (a); however, we can
eliminate their empty future returns. Fact (b) will cover for that, until
all such lambdas are converted to direct "void" returns (at which point we
can remove the futurization from test_using_reusable_sst()).

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2024-08-04 15:35:51 +02:00
Laszlo Ersek
d22bd93abb test/sstable: rewrite (un)compressed_random_access_read() with async()
For better readability, replace future::then() chaining with
future::get(). (We're already in seastar::thread context.)

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2024-08-04 15:35:51 +02:00
Laszlo Ersek
6e35e584c8 test/sstable: simplify write_and_validate_sst()
All three lambdas passed to write_and_validate_sst() now use future::get()
rather than future::then() chaining; in other words, the future::get()
calls inside all these seastar::thread contexts have been pushed down to
the lambdas. Change all these lambdas' return types from future<> to void.

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2024-08-04 15:35:51 +02:00
Laszlo Ersek
8819b3f134 test/sstable: simplify check_toc_func() under async()
The lambda passed to write_and_validate_sst() already runs in
seastar::thread context; replace future::then() chaining with
future::get() calls.

We're going to eliminate the trailing "return make_ready_future<>()"
later.

This patch is best viewed with "git show -W -b".

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2024-08-04 15:35:51 +02:00
Laszlo Ersek
de56883a17 test/sstable: simplify check_statistics_func() under async()
The lambda passed to write_and_validate_sst() already runs in
seastar::thread context; replace future::then() chaining with
future::get() calls.

We're going to eliminate the trailing "return make_ready_future<>()"
later.

This patch is best viewed with "git show -W -b".

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2024-08-04 15:35:51 +02:00
Laszlo Ersek
1a85412f96 test/sstable: simplify check_summary_func() under async()
The lambda passed to write_and_validate_sst() already runs in
seastar::thread context; replace future::then() chaining with
future::get() calls.

We're going to eliminate the trailing "return make_ready_future<>()"
later.

This patch is best viewed with "git show -W -b".

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2024-08-04 15:35:51 +02:00
Laszlo Ersek
7b21bce1ca test/sstable: coroutinize check_component_integrity()
check_component_integrity() does not rely on any deferred close or stop
operations; turn it into a coroutine therefore, for best readability.

This conversion demonstrates particularly well how much the stack eases
coding. We no longer need to artificially extend the lifetime of "tmp"
with a final

  .then([tmp] {})

future. Consequently, "tmp" no longer needs to be a shared pointer to an
on-heap "tmpdir" object; "tmp" can just be a "tmpdir" object on the stack.

While at it, eliminate the single-use local objects "s" and "gen", for
movability's sake. (We could use std::move() on these variables, but it
seems easier to just flatten the function calls that produce the
corresponding rvalues into the write_sst_info() argument list.)

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2024-08-04 15:35:51 +02:00
Laszlo Ersek
caca13fe28 test/sstable: rewrite write_sst_info() with async()
For better readability, replace future::then() chaining with
future::get().

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2024-08-04 15:35:51 +02:00
Laszlo Ersek
cfe92ee203 test/sstable: simplify missing_summary_first_last_sane()
The lambda passed to test_using_reusable_sst() is now invoked --
futurized, transitorily -- in seastar::thread context; stop returning an
explicit make_ready_future<>() from the lambda.

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2024-08-04 15:35:51 +02:00
Laszlo Ersek
10ebc0a2d2 test/sstable: coroutinize summary_query_fail()
summary_query_fail() does not rely on any deferred close or stop
operations; turn it into a coroutine therefore, for best readability.

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2024-08-04 15:35:51 +02:00
Laszlo Ersek
a403ad0703 test/sstable: rewrite summary_query() with async()
For better readability, replace future::then() chaining with
future::get(). (We're already in seastar::thread context.)

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2024-08-04 15:35:51 +02:00
Laszlo Ersek
3a57a7cfea test/sstable: coroutinize (simple/composite)_index_read()
simple_index_read() and composite_index_read() do not rely on any deferred
close or stop operations; turn them into coroutines therefore, for best
readability.

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2024-08-04 15:35:51 +02:00
Laszlo Ersek
eeeab1110a test/sstable: rewrite index_read() with async()
For better readability, replace future::then() chaining with
future::get(). (We're already in seastar::thread context.)

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2024-08-04 15:35:51 +02:00
Laszlo Ersek
17d4fac669 test/sstable: rewrite test_using_reusable_sst() with async()
Improve the readability of test_using_reusable_sst() by replacing
future::then() chaining with test_env::do_with_async() and future::get().

Unlike seastar::async(), test_env::do_with_async() restricts its input
lambda to returning "void". Because of this, introduce the variant
test_using_reusable_sst_returning(), based on
test_env::do_with_async_returning(), for lambdas returning non-void. Put
the latter to use in index_read() at once.

Subsequently, we'll gradually convert the lambdas passed to
test_using_reusable_sst() and test_using_reusable_sst_returning() from
returning futures to returning direct values. In order for
test_using_reusable_sst() and test_using_reusable_sst_returning() to cope
with both types of lambdas, wrap the lambdas into futurize_invoke().get().
In the seastar::thread context, future::get() will gracefully block on
genuine futures, and return immediately on direct values that were
futurized on the spot.

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2024-08-04 15:35:51 +02:00
Laszlo Ersek
79a8a6c638 test/sstable: rewrite test_using_working_sst() with async()
Make test_using_working_sst() easier to read by:

(1) replacing test_env::do_with() with seastar::async(),
    seastar::defer(), and future::get();

(2) replacing seastar::async() and seastar::defer() with
    test_env::do_with_async().

Technically speaking, this change does not perfectly preserve exceptional
behavior. Namely, test_env::do_with() uses future::finally() to link
test_env::stop() to the chain of futures, and future::finally() permits
test_env::stop() itself to throw an exception -- potentially leading to a
seastar::nested_exception being thrown, which would carry both the
original exception and the one thrown by test_env::stop().

Contrarily, the test_env::stop() deferred with seastar::defer() runs in a
destructor, and therefore test_env::stop() had better not throw there.

However, we will assume that test_env::stop() does not throw, albeit not
marked "noexcept". Prior commits 8d704f2532 ("sstable_test_env:
Coroutinize and move to .cc test_env::stop()", 2023-10-31) and
2c78b46c78 ("sstables::test_env: Carry compaction manager on board",
2023-10-31) show that we've considered individual actions in
test_env::stop() not to throw before.

The 128KB stack of seastar::thread (which underlies seastar::async())
should be a tolerable cost in a test case, in exchange for the improved
readability.

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2024-08-04 15:35:51 +02:00
Kefu Chai
0660675387 utils/div_ceil: add constraints to template arguments
to better reflect what we expect from the arguments.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20003
2024-08-04 15:32:01 +03:00
Aleksandra Martyniuk
2ab56b7f56 repair: use find_column_family in insert_repair_meta
repair_service::insert_repair_meta gets the reference to a table
and passes it to continuations. If the table is dropped in the meantime,
the reference becomes invalid.

Use find_column_family at each table occurrence in insert_repair_meta
instead.

Closes scylladb/scylladb#19953
2024-08-04 13:56:38 +03:00
Kefu Chai
571ae0ac96 docs: link to current document instead of the github wiki
before this change, the hyper link brings us to a GitHub wiki page,
which just points the reader to
https://docs.scylladb.com/operating-scylla/snitch/.
this is not a great user experience.

so, in this change, we just reference the document in the current build.
more efficient this way.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#19952
2024-08-04 11:47:21 +03:00
Kefu Chai
f7556edc65 build: cmake: define SCYLLA_ENABLE_PREEMPTION_SOURCE for dev build
in fabab2f4, we introduced preemption_source, and added
`SCYLLA_ENABLE_PREEMPTION_SOURCE` preprocessor macro to enable
opt-in the pluggable preemption check.

but CMake building system was not updated accordingly.

so, in this change, let's sync the CMake building system with
`configure.py`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#19951
2024-08-04 11:46:28 +03:00
Yaron Kaikov
8221a178d8 Revert "dist: support nonroot and offline mode for scylla-housekeeping"
This reverts commit c3bea539b6.

Since it breaking offline-installer artifact-tests. Also, it seems that we should have merged it in the first place since we don't need scylla-housekeeping checks for offline-installer

Closes scylladb/scylladb#19976
2024-08-04 10:55:26 +03:00
Aleksandra Martyniuk
c456a43173 compaction: replace optional<task_info> with task_info param
compaction_manager::perform_compaction does not create task manager
task for compaction if parent_info is set to std::nullopt. Currently,
we always want to create task manager task for compaction.

Remove optional from task info parameters which start compaction.
Track all compactions with task manager.
2024-08-02 14:38:46 +02:00
Aleksandra Martyniuk
108d0344b8 compaction: keep split executor in task manager
If perform_compaction gets std::nullopt as a parent info then
the executor won't be tracked by task manager.

Modify storage_group::split call so that it passes empty task_info
instead of nullopt to track split.
2024-08-02 12:45:32 +02:00
Wojciech Mitros
543dab9e88 mv: test the view update behavior
With the recently added mv admission control, we can now
test how are the view update backlogs updated and propagated
without relying just on the response delays that it was causing
until now.
This patch adds a test for it, replicating issues scylladb/scylladb#18461
and scylladb/scylladb#18783.
In the test, we start with an empty view update backlog, then perform
a write to it, increasing its backlog and saving the updated backlog
on coordinator, the backlog then drops back to 0, we wait 1s for the
backlog to be gossiped and we perform another write which should
succeed.
Due to scylladb/scylladb#18461, the test would fail because
in both gossip rounds before and after the write, the backlog was empty,
causing the write to be blocked by admission control indefinitely.
Due to scylladb/scylladb#18783, the test would fail because when
the backlog drops back to 0 after the write, the change is never
registered, causing all writes to be blocked as well.
2024-08-02 12:12:24 +02:00
Wojciech Mitros
795ac177c2 mv: add test for admission control
In this patch we add 2 tests for checking that the mv admission control works.
The first one simply checks whether, after increasing the backlog on one node
over the admission control threshold, the following request is rejected with
the error message corresponding to the admission control.
The second one checks whether, after triggering admission control, the entire
user request fails instead of just failing a replica write. This is done by
performing a number of writes, some of which trigger the admission control
and cause retries, then checking if the node that had a large view update backlog
received all the writes. Before, the writes would succeed on enough replicas,
reaching QUORUM, and allowing the user write to succeed and cause no retries,
even though on the replica with a high backlog the write got rejected due to
the backlog size.
2024-08-02 12:12:24 +02:00
Wojciech Mitros
a55b7688b6 storage_proxy: return overloaded_exception instead of throwing
To avoid an expensive stack unwind, instead of throwing an error,
we can just return it thanks to the boost::result type that the
affected methods use. The result with an exception needs to be
constructed not implicitly, but with boost::outcome_v2::failure,
because the exception, converted into coordinator_exception_container
can be then converted into both into a successful response_id_type
as well as into a failure.
2024-08-02 12:12:24 +02:00
Wojciech Mitros
5eaae05aaf mv: reject user requests by coordinator when a replica is overloaded by MVs
Currently, when a replica's view update backlog is full, the write is still
sent by the coordinator to all replicas. Because of the backlog, the write
fails on the replica, causing inconsistency that needs to be fixed by repair.
To avoid these inconsistencies, this patch adds a check on the coordinator
for overloaded replicas. As a result, a write may be rejected before being
sent to any replicas and later retried by the user, when the replica is no
longer overloaded.

Fixes scylladb/scylladb#17426
2024-08-02 12:12:19 +02:00
Piotr Dulikowski
39b49a41cc Merge 'mv: delete a partition in a single operation when applicable' from Michael Litvak
Currently when a partition is deleted from the base table, we generate a
row tombstone update for each one of the view rows in the partition.

When the partition key in the view is the same as the base, maybe in a
different order, this can be done more efficiently - The whole corresponding
view partition can be deleted with one partition tombstone update.

With this commit, when generating view updates, if the update mutation has a
partition tombstone then for the views which have the same partition key
we will generate a partition tombstone update, and skip the individual
row tombstone updates.

Fixes scylladb/scylladb#8199

Closes scylladb/scylladb#19338

* github.com:scylladb/scylladb:
  mv: skip reading rows when generating partition tombstone update
  mv: delete a partition in a single operation when applicable
  cql-pytest: move ScyllaMetrics to util file to allow reuse
2024-08-02 11:00:18 +02:00
Michael Litvak
0f5e8c52ad db: test counter update while table is dropped
Add a test that drops a table while there is a counter update operation
ongoing in the table.
The test reproduces issue scylladb/scylla-enterprise#4475 and verifies
it is fixed.
2024-08-01 22:23:17 +03:00
Avi Kivity
99d0aaa7d2 Merge 'tablets: load_balancer: Improve per-table balance' from Tomasz Grabiec
Tablet load balancer tries to equalize tablet load between shards by
moving tablets. Currently, the tablet load balancer assumes that each
tablet has the same hotness. This may not be true, and some tables may
be hotter than others. If some nodes end up getting more tablets of
the hot table, we can end up with request load imbalance and reduced
performance.

In 79d0711c7e we implemented a
mitigation for the problem by randomly choosing the table whose tablet
replica should be moved. This should improve fairness of
movement. However, this proved to not be enough to get a good
distribution of tablets.

This change improves candidate selection to not relay on randomness
but rather evaluating candidates with respect to the impact on load
imbalance.  Also, if there is no good candidate, we consider picking
other source shards, not the most-loaded one. This is helpful because
when finishing node drain we get just a few candidates per shard, all
of which may belong to a single table, and the destination may already
be overloaded with that table. Another shard may contain tablets of
another table which is not yet overloaded on the destination. And
shards may be of similar load, so it doesn't matter much which shard
we choose to unload.

We also consider other destinations, not the least-loaded one. This
helps when draining nodes and the source node has few shard
candidates. Shards on the destination may have similar load so there
is more than one good destinatin candidate. By limiting ourselves to a
single shard, we increase the chance that we're overload the table on
that shard.

The algorithm was evaluated using "scylla perf-load-balancing", which
simulates a sequeunce of 8 node bootstraps and decommissions for
different node and shard counts, RF, and tablet counts.

For example, for the following parameters:

  params: {iterations=8, nodes=5, tablets1=128 (2.4/sh), tablets2=512 (9.6/sh), rf1=3, rf2=3, shards=32}

The results are:

Before:

  Overcommit (old) : init : {table1={shard=1.25 (best=1.25), node=1.00}, table2={shard=1.04 (best=1.04), node=1.00}}
  Overcommit (old) : worst: {table1={shard=4.00 (best=1.25), node=1.81}, table2={shard=1.25 (best=1.04), node=1.11}}
  Overcommit (old) : last : {table1={shard=2.50 (best=1.25), node=1.41}, table2={shard=1.25 (best=1.04), node=1.05}}

After:

  Overcommit       : init : {table1={shard=1.25 (best=1.25), node=1.00}, table2={shard=1.04 (best=1.04), node=1.00}}
  Overcommit       : worst: {table1={shard=1.50 (best=1.25), node=1.02}, table2={shard=1.12 (best=1.04), node=1.01}}
  Overcommit       : last : {table1={shard=1.25 (best=1.25), node=1.00}, table2={shard=1.04 (best=1.04), node=1.00}}

So worst shard overcommit for table1 was reduced from 4 to 1.5. Overcommit
of 4 means that the most-loaded shard has 4 times more tablets than
the average per-shard load in the cluster.

Also, node overcommit for table1 was reduced from 1.81 to 1.02.

The magnitude of improvement depends greatly on test configurtion, so on topology and tablet distribution.

The algorithm is not perfect, it finds a local optimum. In the above
test, overcommit of 1.5 is not the best possible (1.25).

One of the reason why the current algorithm doesn't achieve best
distribution is that it works with a single movement at a time and
replication constraints limit the choice of destinations. Viable
destinations for remaining candidates may by only on nodes which are
not least-loaded, and we won't be able to fill the least loaded
node. Doing so would require more complex movement involving moving a
tablet from one of the destination nodes which doesn't have a replica
on the least loaded node and then replacing it with the candidate from
the source node.

Another limitation is that the algorithm can only fix balance by
moving tablets away from most loaded nodes, and it does so due to
imbalance between nodes. So it cannot fix the imbalance which is
already present on the nodes if there is not much to move due to
similar load between nodes. It is designed to not make the imbalance
worse, so it works good if we started in a good shape.

Fixes https://github.com/scylladb/scylladb/issues/16824

Closes scylladb/scylladb#19779

* github.com:scylladb/scylladb:
  test: perf: tablet_load_balancing: Test with higher shard and tablet counts
  tablets: load_balancer: Avoid quadratic complexity when finding best candidate
  tablets: load_balancer: Maintain load sketch properly during intra-node migration
  tablets: load_balancer: Use "drained" flag
  test: perf: tablet_load_balancing: Report load balancer stats
  tablets: load_balancer: Move load_balancer_stats_manager to header file
  tablets: load_balancer: Split evaluate_candidate() into src and dst part
  tablets: load_balancer: Optimize evaluate_candidate()
  tablets: load_balancer: Add more statistics
  tablets: load_balancer: Track load per table on cluster level
  tablets: load_balancer: Track load per table on node level
  tablets: load_balancer: Use a single load sketch for tracking all nodes
  locator: load_sketch: Introduce populate_dc()
  tablets: load_balancer: Modify target load sketch only when emitting migration
  locator: load_sketch: Introduce get_most_loaded_shard()
  locator: load_sketch: Introduce get_least_loaded_shard()
  locator: load_sketch: Optimize pick()/unload()
  locator: load_sketch: Introduce load_type
  test: perf: tablet_load_balancing: Report total tablet counts
  test: perf: tablet_load_balancing: Print run parameters in the single simulation case too
  test: perf: tablet_load_balancing: Report time it took to schedule migrations
  tablets: load_balancer: Log table load stats after each migration
  tablets: load_balancer: Log per-shard load distribution in debug level
  tablets: load_balancer: Improve per-table balance
  tablets: load_balancer: Extract check_convergence()
  tablets: load_balancer: Extract nodes_by_load_cmp
  tablets: load_balancer: Maintain tablet count per table
  tablets: load_balancer: Reuse src_node_info
  test: perf: tablet_load_balancing: Print warnings about bad overcommit
  test: perf: tablet_load_balancing: Allow running a single simulation
  test: perf: tablet_load_balancing: Report best possible shard overcommit
  test: perf: tablet_load_balancing: Report global shard overcommit
2024-08-01 21:12:14 +03:00
Michael Litvak
22b282f5c5 db: coroutinize do_apply_counter_update
rewrite the function as coroutine to make it easier to read and maintain,
following lifetime issues we had and fixed in this function.
2024-08-01 19:09:04 +03:00
Anna Stuchlik
9972e50134 doc: add the 6.0-to-6.1 upgrade guide
This commit adds the 6.0-to-6.1 upgrade guide.

Compared to the previous upgrade guide:

- Added the "Ensure Consistent Topology Changes Are Enabled" prerequisite.
- Removed the "After Upgrading Every Node" section. Both Raft-based schema changes and topology updates
  are mandatory in 6.1 and don't require any user action after upgrading to 6.1.
- Removed the "Validate Raft Setup" section. Raft was enabled in all 6.0 clusters (for schema management),
  so now there's no scenario that would require the user to follow the validation procedure.
2024-08-01 14:58:14 +02:00
Piotr Smaron
0ea2128140 cql: refactor rf_change indentation 2024-08-01 14:37:53 +02:00
Piotr Smaron
5b089d8e10 Prevent ALTERing non-existing KS with tablets
ALTER tablets KS executes in 2 steps:
1. ALTER KS's cql handler forms a global topo req, and saves data required
   to execute this req,
2. global topo req is executed by topo coordinator, which reads data
   attached to the req.

The KS name is among the data attached to the req.
There's a time window between these steps where a to-be-altered KS could
have been DROPped, which results in topo coordinator forever trying to
ALTER a non-existing KS. In order to avoid it, the code has been changed
to first check if a to-be-altered KS exists, and if it's not the case,
it doesn't perform any schema/tablets mutations, but just removes the
global topo req from the coordinator's queue.
BTW. just adding this extra check resulted in broader than expected
changes, which is due to the fact that the code is written badly and
needs to be refactored - an effort that's already planned under #19126

Fixes: #19576
2024-08-01 14:37:53 +02:00
Piotr Dulikowski
44f327675d Merge 'Remove gossiper argument from storage_service::join_cluster()' from Pavel Emelyanov
It's only needed to start hints via proxy, but proxy can do it without gossiper argument

Closes scylladb/scylladb#19894

* github.com:scylladb/scylladb:
  storage_service: Remote gossiper argument from join_cluster()
  proxy: Use remote gossiper to start hints resource manager
  hints: Const-ify gossiper references and anchor pointers
2024-08-01 10:18:14 +02:00
Michael Litvak
c944e28e43 db: fix waiting for counter update operations on table stop
When a table is dropped it should wait for all pending operations in the
table before the table is destroyed, because the operations may use the
table's resources.
With counter update operations, currently this is not the case. The
table may be destroyed while there is a counter update operation in
progress, causing an assert to be triggered due to a resource being
destroyed while it's in use.
The reason the operation is not waited for is a mistake in the lifetime
management of the object representing the write in progress. The commit
fixes it so the object lives for the duration of the entire counter
update operation, by moving it to the `do_with` list.

Fixes scylladb/scylla-enterprise#4475

Closes scylladb/scylladb#19948
2024-08-01 09:39:49 +02:00
Nadav Har'El
5411559a94 test/cql-pytest: test ALLOW FILTERING in intersection of two indexes
A user complained that ScyllaDB is incompatible with Cassandra when it
requires ALLOW FILTERING on a restriction like WHERE x=1 AND y=1 where
x and y are two columns with secondary indexes.

In the tests added in this patch we show that:

1. Scylla *is* compatible with Cassandra when the traditional "CREATE
   INDEX" is used - ALLOW FILTERING *is* required in this case in both
   Cassandra and Scylla.

2. If SAI is used in Cassandra (CREATE CUSTOM INDEX USING 'SAI'),
   indeed ALLOW FILTERING becomes optional. I believe this is incorrect
   so I opened CASSANDRA-19795.

These two tests combined show that we're not incompatible with Cassandra,
rather Cassandra's two index implementations are incompatible between
themselves, and Scylla is in fact compatible in this case with Cassadra's
traditional index and not with SAI.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#19909
2024-07-31 14:01:29 +03:00
Laszlo Ersek
e67eb0ccc1 test/sstable: coroutinize do_write_sst()
Make do_write_sst() easier to read by coroutinizing it.

Closes #19803.

Suggested-by: Benny Halevy <bhalevy@scylladb.com>
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>

Closes scylladb/scylladb#19937
2024-07-31 13:59:26 +03:00
Kefu Chai
020333fcf1 sstables: fix a typo in comment
s/guranteed/guaranteed/

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#19946
2024-07-31 13:58:09 +03:00
Tomasz Grabiec
28de5231f4 test: perf: tablet_load_balancing: Test with higher shard and tablet counts
We have up to 200 shards in production, so test this to catch
performance issues.
2024-07-31 12:57:15 +02:00
Tomasz Grabiec
19b7fb3a4d tablets: load_balancer: Avoid quadratic complexity when finding best candidate
If the source and destination shards picked for migration based on
global tablet balance do not have a good candidate in terms of effect
on per-table balance, the algorithm explores other source shards and
destinations. This has quadratic complexity in terms of shard count in
the worst case, when there are no good candidates.

Since we can have up to ~200 shards, this can slow down scheduling
significantly. I saw total scheduling time of 5 min in the following run:

 scylla perf-load-balancing -c1 -m1G  --iterations=8 \
    --nodes=4 --tablets1=1024 --tablets2=8096 \
    --rf1=2 --rf2=3 --shards=256

To improve, change the apprach to first find the best source shard and
then best target shard, sequentially. So it's now linear in terms of
shard count.

After the change, the total scheduling time in that run is down to 4s.

Minimizing source and destination metrics piece-wise minimizes the
combined metric, so badness of the best candidate doesn't suffer after
this change.
2024-07-31 12:57:15 +02:00
Tomasz Grabiec
93df82032f tablets: load_balancer: Maintain load sketch properly during intra-node migration
Affects only intra-node migration. The code was recording destination
shard as taken and did not un-take it in case we skipped the migration
due to lack of candidates.

Noticed during code review. Impact is minor, since even if this leads
to suboptimal balance, the next scheduling round should fix it.

Also, the source shard was not unloaded, but that should have no
impact on decisions. But to be future-proof, better to maintain the
load accurately in case the algorithm is extended with more steps.
2024-07-31 12:57:15 +02:00
Tomasz Grabiec
88988ce0db tablets: load_balancer: Use "drained" flag
Cleanup / optimization.
2024-07-31 12:57:15 +02:00
Tomasz Grabiec
56801b7cb7 test: perf: tablet_load_balancing: Report load balancer stats 2024-07-31 12:57:15 +02:00
Tomasz Grabiec
90c9934099 tablets: load_balancer: Move load_balancer_stats_manager to header file
So that stats can be accessed outside tablet allocator.
2024-07-31 12:57:15 +02:00
Anna Stuchlik
ae28880fc8 doc: enable publishing docs for branch-6.1
This commit enables publishing documentation from branch-6.1. The docs will be published as UNSTABLE (the warning about version 6.1 being unstable will be displayed).

Fixes https://github.com/scylladb/scylladb/issues/19926

No backport is required.

Closes scylladb/scylladb#19931
2024-07-31 12:48:51 +02:00
Kamil Braun
c05e077a13 Merge 'raft: fix the shutdown phase being stuck' from Emil Maskovsky
Some of the calls inside the `raft_group0_client::start_operation()` method were missing the abort source parameter. This caused the repair test to be stuck in the shutdown phase - the abort source has been triggered, but the operations were not checking it.

This was in particular the case of operations that try to take the ownership of the raft group semaphore (`get_units(semaphore)`) - these waits should be cancelled when the abort source is triggered.

This should fix the following tests that were failing in some percentage of dtest runs (about 1-3 of 100):
* TestRepairAdditional::test_repair_kill_1
* TestRepairAdditional::test_repair_kill_3

Fixes scylladb/scylladb#19223

Closes scylladb/scylladb#19860

* github.com:scylladb/scylladb:
  raft: fix the shutdown phase being stuck
  raft: use the abort source reference in raft group0 client interface
2024-07-31 12:10:30 +02:00
Pavel Emelyanov
93ed978729 view_builder: Drop unused members
There's a counter and a shared future on board, that used to facilitate
start-time barrier synchronization. Now they are not needed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-07-31 12:59:40 +03:00
Pavel Emelyanov
613161c7b9 view_builder: Use cross-shard barrier on start
When starting, view builder spawns an async background fibers, and upon
its completion each shard needs to wait for other shards to do the same.
This is exactly what cross-shard barrier is about, so instead of
synchronizing via v.b.'s shard-0 instance, use the barrier. This makes
the view_builder::start() shorder and earier to read.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-07-31 12:56:25 +03:00
Pavel Emelyanov
fb1b749445 view_builder: Add cross-shard barrier to its .start() method
The barrier will be used by next patch to synchronize shards with each
other. When passed to invoke_on_all() lambda like this, each lambda gets
its its copy of the barrier "handler" that maintains shared state across
shards.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-07-31 12:54:28 +03:00
Tomasz Grabiec
94cce4b7d3 tablets: load_balancer: Split evaluate_candidate() into src and dst part
Those parts will be used separately later.
2024-07-31 11:38:17 +02:00
Tomasz Grabiec
4df2abe47a tablets: load_balancer: Optimize evaluate_candidate()
Moves load computation out of the hot path by relying on
data structures maintained globally during plan making.
2024-07-31 11:38:17 +02:00
Tomasz Grabiec
5e7facd543 tablets: load_balancer: Add more statistics 2024-07-31 11:38:17 +02:00
Tomasz Grabiec
be055977c9 tablets: load_balancer: Track load per table on cluster level 2024-07-31 11:38:17 +02:00
Tomasz Grabiec
81fcee2040 tablets: load_balancer: Track load per table on node level 2024-07-31 11:38:17 +02:00
Tomasz Grabiec
e7ef7419dc tablets: load_balancer: Use a single load sketch for tracking all nodes
This is code simplification and optimization.

Avoids multiple passes of tablet metadata to consturct load sketch for
each target node.
2024-07-31 11:38:17 +02:00
Tomasz Grabiec
352b8e0ddd locator: load_sketch: Introduce populate_dc() 2024-07-31 11:38:17 +02:00
Tomasz Grabiec
9a7afd334b tablets: load_balancer: Modify target load sketch only when emitting migration
This avoids the need to unpick() a replica when the candidate is not
selected. Optimization.
2024-07-31 11:38:17 +02:00
Tomasz Grabiec
b78657ce7d locator: load_sketch: Introduce get_most_loaded_shard() 2024-07-31 11:38:17 +02:00
Tomasz Grabiec
de404471b7 locator: load_sketch: Introduce get_least_loaded_shard() 2024-07-31 11:38:17 +02:00
Tomasz Grabiec
8fbfd595bb locator: load_sketch: Optimize pick()/unload()
They are executed frequently during tablet scheduling. Currently, they
have time complexity of O(N*log(N)) in terms of shard count. With
large shard counts, that has significant overhead.

This patch optimizes them down to O(log(N)).
2024-07-31 11:38:17 +02:00
Tomasz Grabiec
d0b0f95849 locator: load_sketch: Introduce load_type 2024-07-31 11:38:17 +02:00
Tomasz Grabiec
8f3b623144 test: perf: tablet_load_balancing: Report total tablet counts 2024-07-31 11:38:17 +02:00
Tomasz Grabiec
662a0ff038 test: perf: tablet_load_balancing: Print run parameters in the single simulation case too 2024-07-31 11:38:16 +02:00
Tomasz Grabiec
a040404875 test: perf: tablet_load_balancing: Report time it took to schedule migrations 2024-07-31 11:38:16 +02:00
Tomasz Grabiec
ae7fd80554 tablets: load_balancer: Log table load stats after each migration 2024-07-31 11:38:16 +02:00
Tomasz Grabiec
b8996a0f59 tablets: load_balancer: Log per-shard load distribution in debug level 2024-07-31 11:38:16 +02:00
Tomasz Grabiec
469e2f3f90 tablets: load_balancer: Improve per-table balance
Tablet load balancer tries to equalize tablet load between shards by
moving tablets. Currently, the tablet load balancer assumes that each
tablet has the same hotness. This may not be true, and some tables may
be hotter than others. If some nodes end up getting more tablets of
the hot table, we can end up with request load imbalance and reduced
performance.

In 79d0711c7e we implemented a
mitigation for the problem by randomly choosing the table whose tablet
replica should be moved. This should improve fairness of
movement. However, this proved to not be enough to get a good
distribution of tablets.

This change improves candidate selection to not relay on randomness
but rather evaluating candidates with respect to the impact on load
imbalance.  Also, if there is no good candidate, we consider picking
other source shards, not the most-loaded one. This is helpful because
when finishing node drain we get just a few candidates per shard, all
of which may belong to a single table, and the destination may already
be overloaded with that table. Another shard may contain tablets of
another table which is not yet overloaded on the destination. And
shards may be of similar load, so it doesn't matter much which shard
we choose to unload.

We also consider other destinations, not the least-loaded one. This
helps when draining nodes and the source node has few shard
candidates. Shards on the destination may have similar load so there
is more than one good destinatin candidate. By limiting ourselves to a
single shard, we increase the chance that we're overload the table on
that shard.

The algorithm was evaluated using "scylla perf-load-balancing", which
simulates a sequeunce of 8 node bootstraps and decommissions for
different node and shard counts, RF, and tablet counts.

For example, for the following parameters:

  params: {iterations=8, nodes=5, tablets1=128 (2.4/sh), tablets2=512 (9.6/sh), rf1=3, rf2=3, shards=32}

The results are:

After:

  Overcommit       : init : {table1={shard=1.25 (best=1.25), node=1.00}, table2={shard=1.04 (best=1.04), node=1.00}}
  Overcommit       : worst: {table1={shard=1.50 (best=1.25), node=1.02}, table2={shard=1.12 (best=1.04), node=1.01}}
  Overcommit       : last : {table1={shard=1.25 (best=1.25), node=1.00}, table2={shard=1.04 (best=1.04), node=1.00}}

Before:

  Overcommit (old) : init : {table1={shard=1.25 (best=1.25), node=1.00}, table2={shard=1.04 (best=1.04), node=1.00}}
  Overcommit (old) : worst: {table1={shard=4.00 (best=1.25), node=1.81}, table2={shard=1.25 (best=1.04), node=1.11}}
  Overcommit (old) : last : {table1={shard=2.50 (best=1.25), node=1.41}, table2={shard=1.25 (best=1.04), node=1.05}}

So shard overcommit for table1 was reduced from 4 to 1.5. Overcommit
of 4 means that the most-loaded shard has 4 times more tablets than
the average per-shard load in the cluster.

Also, node overcommit for table1 was reduced from 1.81 to 1.02.

The magnitude of improvement depends greatly on test configurtion, so on topology and tablet distribution.

The algorithm is not perfect, it finds a local optimum. In the above
test, overcommit of 1.5 is not the best possible (1.25).

One of the reason why the current algorithm doesn't achieve best
distribution is that it works with a single movement at a time and
replication constraints limit the choice of destinations. Viable
destinations for remaining candidates may by only on nodes which are
not least-loaded, and we won't be able to fill the least loaded
node. Doing so would require more complex movement involving moving a
tablet from one of the destination nodes which doesn't have a replica
on the least loaded node and then replacing it with the candidate from
the source node.

Another limitation is that the algorithm can only fix balance by
moving tablets away from most loaded nodes, and it does so due to
imbalance between nodes. So it cannot fix the imbalance which is
already present on the nodes if there is not much to move due to
similar load between nodes. It is designed to not make the imbalance
worse, so it works good if we started in a good shape.

Fixes #16824
2024-07-31 11:38:16 +02:00
Tomasz Grabiec
b7661aa6c9 tablets: load_balancer: Extract check_convergence()
Will be reused when evaluating different targets for migration in later
stages.

The refactoring drops updating of _stats.for_dc(dc).stop_no_candidates
and we update _stats.for_dc(dc).stop_load_inversion in both cases
where convergence check may fail. The reason is that stat updates must
be outside check_convergence(), since the new use case should not
update those stats (it doesn't stop balancing, just drops
candidates). Propagating the information for distinguishing the two
cases would be a burden. But it's not necessary, since both cases are
actually load inversion cases, one pre-migration the other
post-migration, so we don't need the distinction.

It's actually wrong to increment stop_no_candidates, since there may
still be candidates, it's the load which is inverted.
2024-07-31 11:26:11 +02:00
Tomasz Grabiec
41e643ddb9 tablets: load_balancer: Extract nodes_by_load_cmp
Will be reused in a different place.
2024-07-31 11:26:11 +02:00
Tomasz Grabiec
8a7257971d tablets: load_balancer: Maintain tablet count per table 2024-07-31 11:26:11 +02:00
Tomasz Grabiec
4e4f13ac9d tablets: load_balancer: Reuse src_node_info 2024-07-31 11:26:11 +02:00
Tomasz Grabiec
71b8d6b7aa test: perf: tablet_load_balancing: Print warnings about bad overcommit 2024-07-31 11:26:11 +02:00
Tomasz Grabiec
0d50a028a5 test: perf: tablet_load_balancing: Allow running a single simulation 2024-07-31 11:26:11 +02:00
Tomasz Grabiec
3f3660c3fe test: perf: tablet_load_balancing: Report best possible shard overcommit 2024-07-31 11:26:11 +02:00
Tomasz Grabiec
c89a320925 test: perf: tablet_load_balancing: Report global shard overcommit
Rather than maximum per-node shard overcommit. Global shard overcommit
is a better metric since we want to equalize global load not just
per-node load.
2024-07-31 11:26:11 +02:00
Emil Maskovsky
5dfc50d354 raft: fix the shutdown phase being stuck
Some of the calls inside the `raft_group0_client::start_operation()`
method were missing the abort source parameter. This caused the repair
test to be stuck in the shutdown phase - the abort source has been
triggered, but the operations were not checking it.

This was in particular the case of operations that try to take the
ownership of the raft group semaphore (`get_units(semaphore)`) - these
waits should be cancelled when the abort source is triggered.

This should fix the following tests that were failing in some percentage
of dtest runs (about 1-3 of 100):
* TestRepairAdditional::test_repair_kill_1
* TestRepairAdditional::test_repair_kill_3

Fixes scylladb/scylladb#19223
2024-07-31 09:18:54 +02:00
Emil Maskovsky
2dbe9ef2f2 raft: use the abort source reference in raft group0 client interface
Most callers of the raft group0 client interface are passing a real
source instance, so we can use the abort source reference in the client
interface. This change makes the code simpler and more consistent.
2024-07-31 09:18:54 +02:00
Benny Halevy
82333036f3 cell_locker: maybe_rehash: reindent
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-07-31 10:06:07 +03:00
Benny Halevy
8853adea96 cell_locker: maybe_rehash: ignore allocation failures
`maybe_rehash` is complimentary and is not strictly required to succeed.
If it fails, it will retry on the next call, but there's no reason
to throw a bad_alloc exception that will fail its caller, since `maybe_rehash`
is called as the final step after the caller has already succeeded with
its action.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-07-31 10:06:06 +03:00
Pavel Emelyanov
9214aecbe7 storage_service: Remove orphan forward declaration of a method
The start_sys_dist_ks() itself was removed by bc051387c5

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#19928
2024-07-30 16:17:49 +03:00
Benny Halevy
e58ca8c44b service_level_controller: stop: always call subscription on_abort
We want to call `service_level_controller::do_abort()` in all cases.
The current code (introduced in
535e5f4ae7)
calls do_abort if abort was not requested, however, since
it does so by checking the subscription bool operator,
it would miss the case where abort was already requested
before the subscription took place (in service_level_controller
ctor).

With scylladb/seastar@470b539b1c and
scylladb/seastar@8ecce18c51
we can just unconditionally call the subscription `on_abort`
method, that ensures only-once semantics, even if abort
was already requested at subscription time.

Fixes scylladb/scylladb#19075

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#19929
2024-07-30 13:23:17 +03:00
Kefu Chai
35394c3f9a docs/dev: fix a typo
remove the extraneous "is".

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#19902
2024-07-30 10:46:25 +03:00
Pavel Emelyanov
97154b0671 Merge 'mapreduce_service: complete coroutinization' from Avi Kivity
mapreduce_server was previously coroutinized, but only partially. This
series completes coroutinization and eliminates remaining continuation chains.

None of this code is performance sensitive as it runs at the super-coordinator level
and is amortized over a full scan of the entire table.

No backport needed as this is a cleanup.

Closes scylladb/scylladb#19913

* github.com:scylladb/scylladb:
  mapreduce_service: reindent
  mapreduce_service: coroutinize retrying_dispatcher::dispatch_to_node()
  mapreduce_service: coroutinize dispatch() inner lambda
2024-07-30 10:44:34 +03:00
Nadav Har'El
d293a5787f alternator: exclude CDC log table from ListTables
The Alternator command ListTables is supposed to list actual tables
created with CreateTable, and should list things like materialized views
(created for GSI or LSI) or CDC log tables.

We already properly excluded materialized views from the list - and
had the tests to prove it - but forgot both the exclusion and the testing
for CDC log tables - so creating a table xyz with streams enable would
cause ListTables to also list "xyz_scylla_cdc_log".

This patch fixes both oversights: It adds the code to exclude CDC logs
from the output of ListTables, add adds a test which reproduces the bug
before this fix, and verifies the fix works.

Fixes #19911.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#19914
2024-07-30 10:43:29 +03:00
Nadav Har'El
ca8b91f641 test: increase timeouts for /localnodes test
In commit bac7c33313 we introduced a new
test for the Alternator "/localnodes" request, checking that a node
that is still joining does not get returned. The tests used what I
thought were "very high" timeouts - we had a timeout of 10 seconds
for starting a single node, and injected a 20 second sleep to leave
us 10 seconds after the first sleep.

But the test failed in one extremely slow run (a debug build on
aarch64), where starting just a single node took more than 15 seconds!

So in this patch I increase the timeouts significantly: We increase
the wait for the node to 60 seconds, and the sleeping injection to
120 seconds. These should definitely be enough for anyone (famous
last words...).

The test doesn't actually wait for these timeouts, so the ridiculously
high timeouts shouldn't affect the normal runtime of this test.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#19916
2024-07-30 10:41:48 +03:00
Avi Kivity
52ee6127dd Merge 'Use boto3 in object_store test to list bucket' from Pavel Emelyanov
There's a test in object_store suite that verifies the contents of a bucket. It does with the plain http request, but unfortunately this doesn't work -- even local minio uses restricted bucket and using plain http request results in 403(Forbidden) error code. Test doesn't check it and continues working with empty list of objects which, in turn, is what it expects to see.

The fix is in using boto3. With it, the acc/secret pair is picked up and listing the bucket finally works.

Closes scylladb/scylladb#19889

* github.com:scylladb/scylladb:
  test/object_store: Use boto3.resource to list bucket
  test/object_store: Add get_s3_resource() helper
2024-07-29 13:49:50 +03:00
Pavel Emelyanov
8b1a106b62 test/object_store: Use boto3.resource to list bucket
Instead of plain http request, use the power of boto3 package. The
recently added get_s3_resource() facilitates creating one

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-07-29 12:29:16 +03:00
Pavel Emelyanov
172e1cb0da test/object_store: Add get_s3_resource() helper
It creates boto3.resource object that points to endpoint maintained
by s3_server argument (that tests obtain via fixture). This allows
using boto3 to access S3 bucket from local minio server.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-07-29 12:25:57 +03:00
Kefu Chai
1094c71282 cql3/statement: use compile-time format string
instead of using fmt::runtime, use compile-time format string in
order to detect the bad format string, or missing format arguments,
or arguments which are not formattable at compile time.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#19901
2024-07-28 21:54:43 +03:00
Benny Halevy
be880ab22c Update seastar submodule
* seastar 67065040...a7d81328 (30):
  > reactor: Initialize _aio_pollfd later
  > abortable_fifo: fix a typo in comment
  > net: Expose DNS error category
  > pollable_fd_state: use default-generated dtor
  > perftune: tune tcp_mem
  > scripts/perftune.py: clock source tweaking: special case Amazon and Google KVM virtualizations
  > abort_source: subscription: keep callback function alive after abort
  > github: disable ccache when building with C++ modules
  > github: add enable-ccache input to test.yaml
  > pollable_fd_state: Mark destructor protected and make non-virtual
  > reactor: Mark .configure() private
  > reactor: Set aio_nowait_supported once
  > reactor: Add .no_poll_aio to reactor_config
  > reactor: Move .max_poll_time on reactor_config
  > reactor: Move .task_quota on reactor_config
  > reactor: Move .strict_o_direct on reactor_config
  > reactor: Move .bypass_fsync on reactor_config
  > reactor: Move .max_task_backlog on reactor_config
  > reactor: Move .force_io_getevents_syscall on reactor_config
  > reactor: Move .have_aio_fsync on reactor_config
  > reactor: Move .kernel_page_cache on reactor_config
  > reactor: Move .handle_sigint on reactor_config
  > reactor_backend: Construct _polling_io from reactor config
  > reactor: Move config when constructing
  > reactor: Use designated initializers to set up reactor_config
  > native-stack: use queue::pop_eventually() in listener::accept()
  > abort_source: subscription: allow calling on_abort explicitly
  > file: document that close() returns the file object to uninitialized state
  > code-cleanup: do not include 'smp.hh' in 'reactor.hh'
  > code-cleanup: remove redundant includes of smp.hh

Closes scylladb/scylladb#19912
2024-07-28 21:04:45 +03:00
Kefu Chai
36f5032b2d db: correct the doxygen comment
the parameter names do not match with the ones we are using.
these comments were inherited from Origin, but we failed to update
them accordingly.

in this change, the comments are updated to reflect the function
signatures.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#19900
2024-07-28 18:24:57 +03:00
Kefu Chai
67e07bee25 build: cmake: use per-mode build dir
The build_unified.sh script accepts a --build-dir option, which
specifies the directory used for storing temporary files extracted
from tarballs defined by the --pkgs option. When performing parallel
builds of multiple modes, it's crucial that each build uses a unique
build directory. Reusing the same build directory for different modes
can lead to conflicts, resulting in build failures or, more seriously,
the creation of tarballs containing corrupted files.

so, in this change, we specify a different directory for each mode,
so that they don't share the same one.

Refs scylladb/scylladb#2717
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#19905
2024-07-28 18:11:37 +03:00
Avi Kivity
149a47088e mapreduce_service: reindent 2024-07-28 17:55:51 +03:00
Avi Kivity
0dd03789f3 mapreduce_service: coroutinize retrying_dispatcher::dispatch_to_node()
Simplify the function by converting it to a coroutine.

Note that while the final co_return co_await looks like a loop (and
therefore an await would introduce an O(n) allocation), it really isn't -
we retry at most once.
2024-07-28 17:54:01 +03:00
Avi Kivity
b019927a0e mapreduce_service: coroutinize dispatch() inner lambda
dispatch() is a coroutine, but the inner lambda that is executed per
node is still a continuation chain. Make it uniform by converting to
a coroutine.
2024-07-28 17:36:08 +03:00
Kefu Chai
ee80742c39 cql3: do not include unused headers
these unused includes were identified by clangd. see
https://clangd.llvm.org/guides/include-cleaner#unused-include-warning
for more details on the "Unused include" warning.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#19906
2024-07-28 17:29:07 +03:00
Benny Halevy
26abad23d9 sstable_directory: delete_atomically: allow sstables from multiple prefixes
Currently, delete_atomically can be called with
a list of sstables from mixed prefixes in two cases:
1. truncate: where we delete all the sstables in the table directory
2. tablet cleanup: similar to truncate but restricted to sstables in a
   single tablet replica

In both cases, it is possible that sstables in staging (or quarantine)
are mixed with sstables in the base directory.

Until a more comprehensive fix is in place,
(see https://github.com/scylladb/scylladb/pull/19555)
this change just lifts the ban on atomic deletion
of sstables from different prefixes, and acknowledging
that the implementation is not atomic across
prefixes.  This is better than crashing for now,
and can be backported more easily to branches
that support tablets so tablet migration can
be done safely in the presence of repair of
tables with views.

Refs scylladb/scylladb#18862

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#19816
2024-07-28 17:26:31 +03:00
Pavel Emelyanov
aaad2bbeaf storage_service: Remote gossiper argument from join_cluster()
This pointer was only needed to pull all the way down the hints resource
manager start() method. It's no longer needed for that.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-07-26 16:29:58 +03:00
Pavel Emelyanov
a1dbaba9e1 proxy: Use remote gossiper to start hints resource manager
By the time hinst resource manager is started, proxy already has its
remote part initialized. Remote returns const gossiper pointer, but
after previous change hints code can live with it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-07-26 16:29:03 +03:00
Pavel Emelyanov
dd7c7c301d hints: Const-ify gossiper references and anchor pointers
There are two places in hints code that need gossiper: hist_sender
calling gossiper::is_alive() and endpoint_downtime_not_bigger_than()
helper in manager. Both can live with const gossiper, so the dependency
references and anchor pointers can be restricted to const too.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-07-26 16:28:54 +03:00
Lakshmi Narayanan Sreethar
27b305b9d1 boost/bloom_filter_test: wait for total memory reclaimed update
The testcase `test_bloom_filter_reclaim_during_reload` checks the
SSTable manager's `_total_memory_reclaimed` against an expected value to
verify that a Bloom filter was reloaded. However, it does not wait for
the manager to update the variable, causing the check to fail if the
update has not occurred yet. Fix it by making the testcase wait until
the variable is updated to the expected value.

Fixes #19879

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes scylladb/scylladb#19883
2024-07-26 08:15:11 +03:00
Tomasz Grabiec
851da230c8 Merge 'db/view: drop view updates to replaced node marked as left' from Piotr Dulikowski
When a node that is permanently down is replaced, it is marked as "left" but it still can be a replica of some tablets. We also don't keep IPs of nodes that have left and the `node` structure for such node returns an empty IP (all zeros) as the address.

This interacts badly with the view update logic. The base replica paired with the left node might decide to generate a view update. Because storage proxy still uses IPs and not host IDs, it needs to obtain the view replica's IP and tell the storage proxy to write a view update to that node - so, it chooses 0.0.0.0. Apparently, storage proxy decides to write a hint towards this address - hinted handoff on the other hand operates on host IDs and not IPs, so it attempts to translate the IP back, which triggers an assertion as there is no replica with IP 0.0.0.0.

As a quick workaround for this issue just drop view updates towards nodes which seem to have IPs that are all zeros. It would be more proper to keep the view updates as hints and replay them later to the new paired replica, but achieving this right now would require much more significant changes. For now, fixing a crash is more important than keeping views consistent with base replicas.

In addition to the fix, this PR also includes a regression test heavily based on the test that @kbr-scylla prepared during his investigation of the issue.

Fixes: scylladb/scylladb#19439

This issue can cause multiple nodes to crash at once and the fix is quite small, so I think this justifies backporting it to all affected versions. 6.0 and 6.1 are affected. No need to backport to 5.4 as this issue only happens with tablets, and tablets are experimental there.

Closes scylladb/scylladb#19765

* github.com:scylladb/scylladb:
  test: regression test for MV crash with tablets during decommission
  db/view: drop view updates to replaced node marked as left
2024-07-25 11:47:14 +02:00
Michael Litvak
6f25f4b387 mv: skip reading rows when generating partition tombstone update
when deleting a base partition, in some cases we can update the view by
generating a single partition deletion update, instead of generating a
row deletion update for each of the partition rows.
If this is the case for all the affected views, and there are no other
updates besides deleting the partition, then we can skip reading and
iterating over all the rows, since this won't generate any additional
updates that are not covered already.
2024-07-25 11:12:58 +03:00
Michael Litvak
d0b02dc0d0 mv: delete a partition in a single operation when applicable
Currently when a partition is deleted from the base table, we generate a
row tombstone update for each one of the view rows in the partition.

When the partition key in the view is the same as the base, maybe in a
different order, this can be done more efficiently - The whole corresponding
view partition can be deleted with one partition tombstone update.

With this commit, when generating view updates, if the update mutation has a
partition tombstone then for the views which have the same partition key
we will generate a partition tombstone update, and skip the individual
row tombstone updates.

Fixes scylladb/scylladb#8199
2024-07-25 11:12:58 +03:00
Michael Litvak
98cc707c76 cql-pytest: move ScyllaMetrics to util file to allow reuse
ScyllaMetrics is a useful generic component for retrieving metrics in a
pytest.
The commit moves the implementation from test_shedding.py to util.py to
make it reusable in other tests in cql-pytest.
2024-07-25 11:12:58 +03:00
Botond Dénes
1bfe73c2ea Merge 'Order API endpoints registration in main' from Pavel Emelyanov
There are few api::set_foo()-s left in main that are placed in ~~random~~ legacy order. This PR fixes it and makes few more associated cleanups.

refs: #2737

Closes scylladb/scylladb#19682

* github.com:scylladb/scylladb:
  api: Unset cache_service endpoints on stop
  main: Don't ignore set_cache_service() future
  api: Move storage API few steps above
  api: Register token-metadata API next to token-metadata itsels
  api: Do not return zero local host-id
  api: Move snitch API registration next to snitch itself
2024-07-25 09:59:38 +03:00
Pavel Emelyanov
456dbc122b api: Unset cache_service endpoints on stop
They currently stay registered long after the dependent services get
stopped. There's a need for batch unsetting (scylladb/seastar#1620), so
currently only this explicit listing :(

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-07-24 18:51:32 +03:00
Pavel Emelyanov
61fb0ad996 main: Don't ignore set_cache_service() future
The call itself seem to be in wrong place -- there's no "cache service"
also the API uses database and snapshot_ctl to work on. So it deserves
more cleanup, but at least don't throw the returned future<> away.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-07-24 18:51:32 +03:00
Pavel Emelyanov
e1eb48f9c2 api: Move storage API few steps above
The sequence currently is

sharded<storage_service>.start()
sharded<query_processor>.invoke_on_all(start_remote)
api::set_server_storage_service()

The last two steps can be safely swapped to keep storage service API
next to its service.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-07-24 18:51:32 +03:00
Pavel Emelyanov
6ae09cc6bf api: Register token-metadata API next to token-metadata itsels
Right now API registration happens quite late because it waits storage
service to register its "function" first. This can be done beforeheand
and the t.m. API can be moved to where it should be.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-07-24 18:51:32 +03:00
Pavel Emelyanov
10566256fd api: Do not return zero local host-id
The local host id is read from local token metadata and returned to the
caller as string. The t.m. itself starts with default-constructed host
id vlaue which is updated later. However, even such "unset" host id
value can be rendered as string without errors. This makes the correct
work of the API endpoint depend on the initialization sequence which may
(spoilter: it will) change in the future.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-07-24 18:51:32 +03:00
Pavel Emelyanov
29738f0cb6 api: Move snitch API registration next to snitch itself
Once sharded<snitch> is started, it can register its handlers

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-07-24 18:51:07 +03:00
Pavel Emelyanov
6357755624 replica: Remove keyspace::config::datadir
It's finally no longer used. Now only sstables storage code "knows" that
keyspace may have its on-disk directory.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-07-24 17:45:51 +03:00
Pavel Emelyanov
f767e25c8b sstables/storage: Evaluate path for keyspace directory in storage
Currently the init_keyspace_storage() expects that the caller would
tell it where the ks directory is, but it's not nice as keyspace may
not necessarity keep its sstables in any directory.

This patch moves the directory path evaluation into storage code,
specifically to the lambda that is called for on-disk sstables. The
way directory is evaluated mirrors the one from make_keyspace_config()
that will be removed by next patch.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-07-24 17:45:50 +03:00
Pavel Emelyanov
3ae41bd6f6 sstables/storage: Add sstables_manager arg to init_keyspace_storage()
Will be needed by next patch

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-07-24 17:41:45 +03:00
Botond Dénes
6337372b9d test/boost/reader_concurrency_semaphore_test: un-flake test admission
The admission test has a section which tests admission when the
semaphore has inactive reads. This section (and therefore the enire
test) became flaky lately, after a seemingly unrelated seastar upgrade,
which improved timers.
The cause of the flakyness is the permit which is made inactive later:
this permit is created with 0 timeout (times out immediately). For some
time now, when the timeout timer of a permit fires, if the permit is
inactive, it is evicted. This is what makes the test fail: the inactive
read times out and ends up evicting this permit, which is not expected
for the test. The reason this was not a problem before, is that the test
finishes very quickly, usually, before the timer could even be polled by
the reactor. The recent seastar changes changed this and now the timer
sometimes get polled and fires, failing the test.

Fixes: #19801

Closes scylladb/scylladb#19859
2024-07-24 13:04:50 +03:00
Takuya ASADA
02b20089cb scylla_raid_setup: install update-initramfs when it's not available
scylla_raid_setup may fail on Ubuntu minimal image since it calls
update-initramfs without installing.

Closes scylladb/scylladb#19651
2024-07-24 11:55:16 +03:00
Pavel Emelyanov
b02d20d12d Merge 'Minor improvements around compaction groups' from Raphael "Raph" Carvalho
Minor changes, no backporting needed.

Closes scylladb/scylladb#19723

* github.com:scylladb/scylladb:
  replica: rename for_each_const_compaction_group()
  replica: Fix comment about compaction group
  replica: remove unused compaction_group_vector
2024-07-24 11:22:24 +03:00
Nadav Har'El
edc5bca6b1 alternator: do not allow authentication with a non-"login" role
Alternator allows authentication into the existing CQL roles, but
roles which have the flag "login=false" should be refused in
authentication, and this patch adds the missing check.

The patch also adds a regression test for this feature in the
test/alternator test framework, in a new test file
test/alternator/cql_rbac.py. This test file will later include more
tests of how the CQL RBAC commands (CREATE ROLE, GRANT, REVOKE)
affect authentication and authorization in Alternator.
In particular, these tests need to use not just the DynamoDB API but
also CQL, so this new test file includes the "cql" fixture that allows
us to run CQL commands, to create roles, to retrieve their secret keys,
and so on.

Fixes scylladb/scylladb#19735

Closes scylladb/scylladb#19740
2024-07-24 08:20:23 +02:00
Botond Dénes
84db147c58 Merge 'tasks: introduce virtual tasks' from Aleksandra Martyniuk
Introduce virtual tasks - task manager tasks which cover
cluster-wide operations.

Virtual tasks aren't kept in memory, instead their statuses
are retrieved from associated service when user requests
them with task manager API. From API users' perspective,
virtual tasks behave similarly to regular tasks, but they can
be queried from any node in a cluster.

Virtual tasks cannot have a parent task. They can have
children on each node in a cluster, but do not keep references
to them. So, if a direct child of a virtual task is unregistered
from task manager, it will no longer be shown in parent's
children vector.

virtual_task class corresponds to all virtual tasks in one
group. If users want to list all tasks in a module, a virtual_task
returns all recent supported operations; if they request virtual
task's status - info about the one specified operation is
presented. Time to live, number of tracked operations etc.
depend on the implementation of individual virtual_task.
All virtual_tasks are kept only on shard 0.

Refs: https://github.com/scylladb/scylladb/issues/15852

New feature, no backport needed.

Closes scylladb/scylladb#16374

* github.com:scylladb/scylladb:
  docs: describe virtual tasks
  db: node_ops: filter topology request entries
  test: add a topology suite for testing tasks
  node_ops: service: create streaming tasks
  node_ops: register node_ops_virtual_task in task manager
  service: node_ops: keep node ops module in storage service
  node_ops: implement node_ops_virtual_task methods
  db: service: modify methods to get topology_requests data
  db: service: add request type column to topology_requests
  node_ops: add task manager module and node_ops_virtual_task
  tasks: api: add virtual task support to get_task_status_recursively
  tasks: api: add virtual task support
  tasks: api: add virtual tasks support to get_tasks
  tasks: add task_handler to hide task and virtual_task differences from user
  tasks: modify invoke_on_task
  tasks: implement task_manager::virtual_task::impl::get_children
  tasks: keep virtual tasks in task manager
  tasks: introduce task_manager::virtual_task
2024-07-24 08:34:28 +03:00
Botond Dénes
0bb6413ea5 Merge 'github: disable scheduled workflow on forks' from Kefu Chai
as these workflows are scheduled periodically, and if they fail, notifications are sent to the repo's owner. to minimize the surprises to the contributors using github, let's disable these workflows on fork repos.

Closes scylladb/scylladb#19736

* github.com:scylladb/scylladb:
  github: do not run clang-tidy as a cron job
  github: disable scheduled workflow on forks
2024-07-24 07:50:39 +03:00
Avi Kivity
3c930a61c9 Merge 'test: scylla_cluster: support more test scenarios' from Patryk Jędrzejczak
We modify `ScyllaCluster.server_start` so that it changes seeds of the
starting node to all currently running nodes. This allows writing tests like
```python
s1 = await manager.server_add(start=False)
await manager.server_add()
await manager.server_start(s1.server_id)
```
However, it disallows writing tests that start multiple clusters. To fix this,
we add the `seeds` parameter to `server_start`.

We also improve the logic in `ScyllaCluster.add_server` to allow writing
tests like
```python
await manager.server_add(expected_error="...")
await manager.server_add()
```

This PR only adds improvements to the `test.py` framework, no need
to backport it.

Closes scylladb/scylladb#19847

* github.com:scylladb/scylladb:
  test: scylla_cluster: improve expected_error in add_server
  test: scylla_cluster: support more test scenarios
  test: scylla_cluster: correctly change seeds in server_start
2024-07-23 22:05:31 +03:00
Patryk Jędrzejczak
02ccd2e3af test: scylla_cluster: improve expected_error in add_server
We make two changes:
- we lease the IP address of a node that failed to boot because of
  an expected error,
- we don't log "Cluster ... added ..." when a node fails to boot
  because of an expected error.
2024-07-23 14:35:09 +02:00
Patryk Jędrzejczak
4079cd1a7b test: scylla_cluster: support more test scenarios
Here are some examples of tests that don't work with no initial
nodes, but they should work:

1.
```
await manager.server_add(expected_error="...")
await manager.server_add()
```

2.
```
await manager.servers_add(2, expected_error="...")
await manager.servers_add(2)
```

3.
```
s1 = await manager.server_add(start=False)
await manager.server_start(s1.server_id, expected_error="...")
await manager.server_add()
```

4.
```
[s1, s2] = await manager.servers_add(2, start=False)
await manager.server_start(s1.server_id, expected_error="...")
await manager.server_start(s2.server_id, expected_error="...")
await manager.servers_add(2)
```

5.
```
s1 = await manager.server_add(start=False)
await manager.server_add()
await manager.server_start(s1.server_id)
```

6.
```
[s1, s2] = await manager.servers_add(2, start=False)
await manager.servers_add(2)
await manager.server_start(s1.server_id)
await manager.server_start(s2.server_id)
```

In this patch, we make a few improvements to make tests like the ones
presented above work. I tested all the examples above manually.

From now on, servers receive correct seeds if the first servers added
in the test didn't start or failed to boot.

Also, we remove the assertion preventing the creation of a second
cluster. This assertion failed the tests presented above. We could
weaken it to make these tests pass, but it would require some work.
Moreover, we have tests that intentionally create two clusters.
Therefore, we go for the easiest solution and accept that a single
`ScyllaCluster` may not correspond to a single Scylla cluster.
2024-07-23 14:35:09 +02:00
Patryk Jędrzejczak
e196c1727e test: scylla_cluster: correctly change seeds in server_start
We change seeds in `ScyllaCluster.server_start` to all currently
running nodes. The previous code only pretended that it did it.

After doing this change, writing tests that create multiple clusters
is impossible. To allow it, we add the `seeds` parameter to
`ManagerClient.server_start`. We use it to fix and simplify the only
test that creates two clusters - `test_different_group0_ids`.
2024-07-23 14:35:08 +02:00
Aleksandra Martyniuk
d04159e7de docs: describe virtual tasks 2024-07-23 13:35:02 +02:00
Aleksandra Martyniuk
c64cb98bcf db: node_ops: filter topology request entries
system_keyspace::get_topology_request_entries returns entries for
requests which are running or have finished after specified time.

In task manager node ops task set the time so that they are shown
for task_ttl seconds after they have finished.
2024-07-23 13:35:02 +02:00
Aleksandra Martyniuk
36b77c0592 test: add a topology suite for testing tasks
Add topology_tasks test suite for testing task manager's node ops
tasks. Add TaskManagerClient to topology_tasks for an easy usage
of task manager rest api.

Write a test for bootstrap, replace, rebuild, decommission and remove
top level tasks using the above.
2024-07-23 13:35:01 +02:00
Aleksandra Martyniuk
a903971a74 node_ops: service: create streaming tasks
Create tasks which cover streaming part of topology changes. These
tasks are children of respective node_ops_virtual_task.
2024-07-23 13:35:01 +02:00
Aleksandra Martyniuk
63e82764e1 node_ops: register node_ops_virtual_task in task manager 2024-07-23 13:35:01 +02:00
Aleksandra Martyniuk
8e56913fdf service: node_ops: keep node ops module in storage service
Keep task manager node ops module in storage service. It will be
used to create and manage tasks related to topology changes.

The module is created and registered in storage service constructor.
In storage_service::stop() the module is stopped and so all the remaining
tasks would be unregistered immediately after they are finished.
2024-07-23 13:35:01 +02:00
Aleksandra Martyniuk
b97a348361 node_ops: implement node_ops_virtual_task methods 2024-07-23 13:35:01 +02:00
Aleksandra Martyniuk
94282b5214 db: service: modify methods to get topology_requests data
Modify get_topology_request_state (and wait_for_topology_request_completion),
so that it doesn't call on_internal_error when request_id isn't
in the topology_requests table if require_entry == false.

Add other methods to get topology request entry.
2024-07-23 13:35:01 +02:00
Aleksandra Martyniuk
880058073b db: service: add request type column to topology_requests
topology_requests table will be used by task manager node ops tasks,
but it loses info about request type, which is required by tasks.

Add request_type column to topology_requests.
2024-07-23 13:35:01 +02:00
Aleksandra Martyniuk
91fbfbf98a node_ops: add task manager module and node_ops_virtual_task
Add task manager node ops module and node_ops_virtual_task.

Some methods will be implemented in later patches.
2024-07-23 13:35:01 +02:00
Aleksandra Martyniuk
d2e6010670 tasks: api: add virtual task support to get_task_status_recursively
Virtual tasks are supported by get_task_status_recursively. Currently
only local descendants' statuses are shown.
2024-07-23 13:35:01 +02:00
Aleksandra Martyniuk
5f7f403a15 tasks: api: add virtual task support
Virtual tasks are supported by get_task_status, abort_task and
wait_task.

Task status returned by get_task_status and wait_task:
- contains task_kind to indicate whether it's virtual (cluster) or
  regular (node) task;
- children list apart from task_id contains node address of the task.
2024-07-23 13:35:01 +02:00
Aleksandra Martyniuk
20ba7ceff9 tasks: api: add virtual tasks support to get_tasks
task_manager/list_module_tasks/{module} starts supporting virtual tasks,
which means that their stats will also be shown for users.

Additional task_kind param is added to indicate whether the task is
virutal (cluster-wide) or regular (node-wide).

Support in other paths will be added in following patches.
2024-07-23 13:35:01 +02:00
Aleksandra Martyniuk
1d85b319e0 tasks: add task_handler to hide task and virtual_task differences from user
Contrary to regular tasks, which are per-operation, virtual tasks
are associated with the whole group of operations. There may be many
operations of each group performed at the same time. Info about each
running operation will be shown to a user through the API.

For virtual tasks, task manager imitates a regular task covering
each operation, but task_manager::tasks aren't actually created
in the memory. Instead, information (e.g. status) about the operation
is retrieved from associated service and passed to a user.

To hide most of the differences from user, task_handler class is created.
Task handler performs appropriate actions depending on task's kind.

However, users need to stay conscious about the kind of task, because:
- get_task_status and wait_task do not unregister virtual tasks;
- time for which a virtual tasks stays in task manager depends
  on associated service and tasks' implementation;
- number of virtual task's children shown by get_tasks doesn't have
  to be monotonous.

API is modified to use task_handler.
API-specific classes are moved to task_handler.{cc,hh}.
2024-07-23 13:35:01 +02:00
Aleksandra Martyniuk
abde7ba271 tasks: modify invoke_on_task
Modify task_manager::invoke_on_task to also check virtual tasks.
2024-07-23 13:35:01 +02:00
Aleksandra Martyniuk
6029936665 tasks: implement task_manager::virtual_task::impl::get_children
Return a vector of task_identity of all children of a virtual task
in a cluster.
2024-07-23 13:35:01 +02:00
Aleksandra Martyniuk
9de8d4b5b0 tasks: keep virtual tasks in task manager
Virtual tasks are kept in task manager together with regular tasks.
All virtual tasks are stored on shard 0.

task_manager::module::make_task is modified to consider virtual
tasks as possible parents.
2024-07-23 13:35:01 +02:00
Aleksandra Martyniuk
00cfc49d18 tasks: introduce task_manager::virtual_task
A virtual task is a new kind of task supported by task manager,
which covers cluster-wide operations.

From users' perspective virtual tasks behave similarly
to task_manager::tasks. The API side of virtual tasks will be
covered in the following patches.

Contrary to task_manager::task, virtual task does not update
its fields proactively. Moreover, no object is kept in memory
for each individual virtual task's operation. Instead a service
(or services) is queried on API user's demand to learn about
the status of running operation. Hence the name.

task_manager::virtual_task is responsible for a whole group
of virtual tasks, i.e. for tracking and generating statuses
of all operations of similar type.

To enable tracking of some kind of operations, one needs to
override task_manager::virtual_task::impl and provide implementations
of the methods returning appropriate information about the operations.
task_manager::virtual_task must be kept on shard 0.

Similarly to task_manager::tasks, virtual tasks can have child tasks,
responsible for tracking suboperations' progress. But virtual tasks
cannot have parents - they are always roots in task trees.

Some methods and structs will be implemented in later patches.
2024-07-23 13:35:01 +02:00
Nadav Har'El
bac7c33313 alternator: fix "/localnodes" to not return nodes still joining
Alternator's "/localnodes" HTTP request is supposed to return the list of
nodes in the local DC to which the user can send requests.

The existing implementation incorrectly used gossiper::is_alive() to check
for which nodes to return - but "alive" nodes include nodes which are still
joining the cluster and not really usable. These nodes can remain in the
JOINING state for a long time while they are copying data, and an attempt
to send requests to them will fail.

The fix for this bug is trivial: change the call to is_alive() to a call
to is_normal().

But the hard part of this test is the testing:

1. An existing multi-node test for "/localnodes" assummed that right after
   a new node was created, it appears on "/localnodes". But after this
   patch, it may take a bit more time for the bootstrapping to complete
   and the new node to appear in /localnodes - so I had to add a retry loop.

2. I added a test that reproduces the bug fixed here, and verifies its
   fix. The test is in the multi-node topology framework. It adds an
   injection which delays the bootstrap, which leaves a new node in JOINING
   state for a long time. The test then verifies that the new node is
   alive (as checked by the REST API), but is not returned by "/localnodes".

3. The new injection for delaying the bootstrap is unfortunately not
   very pretty - I had to do it in three places because we have several
   code paths of how bootstrap works without repair, with repair, without
   Raft and with Raft - and I wanted to delay all of them.

Fixes #19694.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#19725
2024-07-23 13:51:16 +03:00
Pavel Emelyanov
65565a56c3 Merge 's3/client: add client::upload_file()' from Kefu Chai
this member function prepares for the backup feature, where the
object to be stored in the object storage is already persisted as a
file on local filesystem. this brings us two benefits:

- with the file, we don't need to accumulate the payloads in memory
  and send them in batch, as we do in upload_sink and in
  upload_jumbo_sink. this puts less pressure on the memory subsystem.
- with the file, we can read multiple parts in parallel if multpart
  upload applies to it, this helps to improve the throughput.

so, this new helper is introduced to help upload an sstable from local
filesystem to the object storage.

Fixes https://github.com/scylladb/scylladb/issues/16287

Closes scylladb/scylladb#16387

* github.com:scylladb/scylladb:
  s3/client: add client::upload_file()
  s3/client: move constants related to aws constraints out
2024-07-23 12:39:27 +03:00
Kefu Chai
061def001d s3/client: add client::upload_file()
this member function prepares for the backup feature, where the
object to be stored in the object storage is already persisted as a
file on local filesystem. this brings us two benefits:

- with the file, we don't need to accumulate the payloads in memory
  and send them in batch, as we do in upload_sink and in
  upload_jumbo_sink. this puts less pressure on the memory subsystem.
- with the file, we can read multiple parts in parallel if multpart
  upload applies to it, this helps to improve the throughput.

so, this new helper is introduced to help upload an sstable from local
filesystem to the object storage.

Fixes #16287
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-07-23 14:39:30 +08:00
Kefu Chai
6701ce50a5 s3/client: move constants related to aws constraints out
minimum_part_size and aws_maximum_parts_in_piece are
AWS S3 related constraints, they can be reused out of
client::upload_sink and client::upload_jumbo_sink, so
in this change

* extract them out.
* use the user-defined literal with IEC prefix for
  better readablity to define minimum_part_size
* add "aws_" prefix to `minimum_part_size` to be
  more consistent.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-07-23 14:33:54 +08:00
Takuya ASADA
c3bea539b6 dist: support nonroot and offline mode for scylla-housekeeping
Introduce support nonroot and offline mode for scylla-housekeeping.

Closes #13084

Closes scylladb/scylladb#13088
2024-07-23 07:57:32 +03:00
Aleksandra Martyniuk
dfe3af40ed test: tasks: adjust tests to new wait_task behavior
After c1b2b8cb2c /task_manager/wait_task/
does not unregister tasks anymore.

Delete the check if the task was unregistered from test_task_manager_wait.
Check task status in drain_module_tasks to ensure that the task
is removed from task manager.

Fixes: #19351.

Closes scylladb/scylladb#19834
2024-07-22 18:24:54 +03:00
Nadav Har'El
9eb47b3ef0 Merge 'config: round-trip boolean configuration variables' from Avi Kivity
When you SELECT a boolean from system.config, it reads as true/false, but this isn't accepted
on UPDATE (instead, we accept 1/0). This is surprising and annoying, so accept true/false in
both directions.

Not a regression, so a backport isn't strictly necessary.

Closes scylladb/scylladb#19792

* github.com:scylladb/scylladb:
  config: specialize from-string conversion for bool
  config: wrap boost::lexical_cast<> when converting from strings
2024-07-22 17:53:02 +03:00
Botond Dénes
d3135db457 Merge 'commitlog: Add optional max lifetime parameter to cl instance' from Calle Wilund
If set, any remaining segment that has data older than this threshold will request flushing, regardless of data pressure. I.e. even a system where nothing happends will after X seconds flush data to free up the commit log.

Related to  #15820

The functionality here is to prevent pathological/test cases where a silent system cannot fully process stuff like compaction, GC etc due to things like CL forcing smaller GC windows etc.

Closes scylladb/scylladb#15971

* github.com:scylladb/scylladb:
  commitlog: Make max data lifetime runtime-configurable
  db::config: Expose commitlog_max_data_lifetime_in_s parameter
  commitlog: Add optional max lifetime parameter to cl instance
2024-07-22 17:21:33 +03:00
Botond Dénes
3ff33e9c70 Update ./tools/java submodule
* ./tools/java dbaf7ba7...0b4accdd (1):
  > cassandra-stress: Make default repl. strategy NetworkTopologyStrategy

Closes scylladb/scylladb#19818
2024-07-22 17:12:09 +03:00
Kamil Braun
8ec90a0e60 docs: extend "forbidden operations" section for Raft-topology upgrade
The Raft-topology upgrade procedure must not be run concurrently with
version upgrade.

Closes scylladb/scylladb#19746
2024-07-22 12:45:38 +03:00
Botond Dénes
591876b44e Merge 'sstables: do not reload components of unlinked sstables' from Lakshmi Narayanan Sreethar
The SSTable is removed from the reclaimed memory tracking logic only
when its object is deleted. However, there is a risk that the Bloom
filter reloader may attempt to reload the SSTable after it has been
unlinked but before the SSTable object is destroyed. Prevent this by
removing the SSTable from the reclaimed list maintained by the manager
as soon as it is unlinked.

The original logic that updated the memory tracking in
`sstables_manager::deactivate()` is left in place as (a) the variables
have to be updated only when the SSTable object is actually deleted, as
the memory used by the filter is not freed as long as the SSTable is
alive, and (b) the `_reclaimed.erase(*sst)` is still useful during
shutdown, for example, when the SSTable is not unlinked but just
destroyed.

Fixes https://github.com/scylladb/scylladb/issues/19722

Closes scylladb/scylladb#19717

* github.com:scylladb/scylladb:
  boost/bloom_filter_test: add testcase to verify unlinked sstables are not reloaded
  sstables: do not reload components of unlinked sstables
  sstables/sstables_manager: introduce on_unlink method
2024-07-22 12:08:25 +03:00
Avi Kivity
358147959e Merge 'keep table directory open for flushing' from Laszlo Ersek
`filesystem_storage` methods frequently call `sync_directory()`, for the sake of flushing (sync'ing) a directory. `sync_directory()` always brackets the sync with open and close, and given that most `sync_directory()` calls target the sstable base directory, those repeated opens and closes are considered wasteful. Rework the `filesystem_storage::_dir` member (from a mere pathname) so that it stand for an `opened_directory` object, which keeps the sstable base directory open, for the purpose of repeated sync'ing.

Resolves #2399.

Closes scylladb/scylladb#19624

* github.com:scylladb/scylladb:
  sstables/storage: synch "dst_dir" more leanly in create_links_common()
  sstables/storage: close previous directory asynchronously upon dir change
  sstables/storage: futurize change_dir_for_test()
  sstables/storage: sync through "opened_directory" in filesystem...::move()
  sstables/storage: sync through "opened_directory" in the "easy" cases
  sstables/storage: introduce "opened_directory" class
2024-07-21 17:07:44 +03:00
Yaron Kaikov
d3cbe04130 .github/mergify.yml: update conf to support 6.1
Modify Mergify configuation to support `6.1` instead of `5.2` which is
EOL

Closes scylladb/scylladb#19810
2024-07-21 17:02:19 +03:00
Łukasz Paszkowski
781eb7517c api/system: add highest_supported_sstable_format path
Current upgrade dtest rely on a ccm node function to
get_highest_supported_sstable_version() that looks for
r'Feature (.*)_SSTABLE_FORMAT is enabled' in the log files.

Starting from scylla-6.0 ME_SSTABLE_FORMAT is enabled by default
and there is no cluster feature for it. Thus get_highest_supported_sstable_version()
returns an empty list resulting in the upgrade tests failures.

This change introduces a seperate API path that returns the highest
supported sstable format (one of la, mc, md, me) by a scylla node.

Fixes scylladb/scylladb#19772

Backports to 6.0 and 6.1 required. The current upgrade test in dtest
checks scylla upgrades up to version 5.4 only. This patch is a
prerequisite to backport the upgrade tests fix in dtest.

Closes scylladb/scylladb#19787
2024-07-21 17:00:19 +03:00
Avi Kivity
36b57f3432 Merge 'token: inline optimizations' from Benny Halevy
This series contains several optimizations for dht::token
around its comparison functions as well as minimum_token and maximum_token definitions,
by moving them inline into dht/token.hh

This results in a nice improvement in perf-simple-query:
```
==> perf-simple-query.pre <== (21c67a5a64)
         throughput: mean=95774.01 standard-deviation=1129.83 median=96243.64 median-absolute-deviation=1090.08 maximum=96864.09 minimum=94471.19
instructions_per_op: mean=41813.68 standard-deviation=16.27 median=41809.29 median-absolute-deviation=7.02 maximum=41841.64 minimum=41799.41
  cpu_cycles_per_op: mean=22383.19 standard-deviation=331.01 median=22254.53 median-absolute-deviation=332.26 maximum=22744.11 minimum=21996.73

==> perf-simple-query.post.0 <== (token: move ordering operator inline)
         throughput: mean=96350.01 standard-deviation=640.10 median=96228.88 median-absolute-deviation=621.45 maximum=96988.16 minimum=95478.51
instructions_per_op: mean=41627.13 standard-deviation=37.55 median=41627.06 median-absolute-deviation=2.43 maximum=41679.44 minimum=41573.31
  cpu_cycles_per_op: mean=22184.65 standard-deviation=151.03 median=22163.05 median-absolute-deviation=120.83 maximum=22348.49 minimum=21967.30

==> perf-simple-query.post.1 <== (token: operator<=>: optimize the common case)
         throughput: mean=96778.29 standard-deviation=1719.34 median=97021.72 median-absolute-deviation=1059.56 maximum=98300.99 minimum=93893.75
instructions_per_op: mean=41590.25 standard-deviation=5.53 median=41589.50 median-absolute-deviation=4.17 maximum=41598.39 minimum=41584.57
  cpu_cycles_per_op: mean=22135.33 standard-deviation=471.98 median=21969.30 median-absolute-deviation=244.89 maximum=22905.24 minimum=21685.33

==> perf-simple-query.post.3 <== (token: always initialize data member)
         throughput: mean=98264.33 standard-deviation=998.49 median=98533.02 median-absolute-deviation=780.45 maximum=99075.40 minimum=96656.51
instructions_per_op: mean=41657.61 standard-deviation=22.53 median=41648.49 median-absolute-deviation=12.89 maximum=41696.81 minimum=41642.07
  cpu_cycles_per_op: mean=21808.57 standard-deviation=93.63 median=21794.56 median-absolute-deviation=75.41 maximum=21949.46 minimum=21719.55

==> perf-simple-query.post.4 <== (token: constexpr ctors, methods, and minimum/maximum_token)
         throughput: mean=98095.05 standard-deviation=1333.32 median=98930.22 median-absolute-deviation=906.80 maximum=99209.38 minimum=96194.25
instructions_per_op: mean=41572.28 standard-deviation=6.04 median=41574.49 median-absolute-deviation=4.76 maximum=41579.56 minimum=41564.72
  cpu_cycles_per_op: mean=21831.35 standard-deviation=169.56 median=21732.86 median-absolute-deviation=102.93 maximum=22091.66 minimum=21689.63

==> perf-simple-query.post.5 <== (token: initialize non-key tokens with min() value)
         throughput: mean=99502.32 standard-deviation=1003.70 median=99744.03 median-absolute-deviation=388.87 maximum=100482.95 minimum=97813.42
instructions_per_op: mean=41593.48 standard-deviation=17.27 median=41585.25 median-absolute-deviation=8.46 maximum=41619.41 minimum=41575.86
  cpu_cycles_per_op: mean=21545.90 standard-deviation=86.66 median=21578.01 median-absolute-deviation=43.17 maximum=21612.41 minimum=21395.42
```

Optimization only. No backport required

Closes scylladb/scylladb#19782

* github.com:scylladb/scylladb:
  token: initialize non-key tokens with min() value
  token: make kind-based ctor private
  token: constexpr ctors, methods, and minimum/maximum_token
  token: always initialize data member
  everywhere: use dht::token is_{minimum,maximum}
  token: operator<=>: optimize the common case
  token: move ordering operator inline
  partitioner_test: add more token-level tests
2024-07-21 15:07:36 +03:00
Benny Halevy
365e1fb1b9 token: initialize non-key tokens with min() value
We already have code to return min() for
the minimum and maximum tokens in long_token()
and raw(), so instead of using code to return
it, just make sure to set it in the _data member.

Note that although this change affect serialization,
the existing codebase ignores the deserialized bytes
and places a constant (0 before this patch, or min()
with it) in _data for non-key (minumum or maximum) tokens.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-07-20 21:21:42 +03:00
Benny Halevy
9f05072527 token: make kind-based ctor private
Users outside of the token module don't
need to mess with the token::kind.
They can only create key tokens.
Never, minimum or maximum tokens, with a particular
datya value.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-07-20 21:21:42 +03:00
Benny Halevy
6806112189 token: constexpr ctors, methods, and minimum/maximum_token
sizeof(dht::token) is only 16 bytes and therefore
it can be passed with 2 registers.

There is no sense in defining minimum_token
and maximum_token out of line, returning a token&
to statically allocated values that require memory
access/copy, while the only call sites that needs
to point to the static min/max tokens are in
dht::ring_position_view.
Instead, they can be defined inline as constexpr
functions and return their const values.

Respectively, define token ctors and methods
as constexpr where applicable (and noexcept while at it
where applicable)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-07-20 21:21:42 +03:00
Benny Halevy
e509ccd184 token: always initialize data member
Make sure to always initalize the _data member
to 0 for non-key (minimum or maximum) tokens.

This allows to simplify the equality operator
that now doesn't need to rely on `operator<=>`

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-07-20 21:21:42 +03:00
Benny Halevy
850f298ccd everywhere: use dht::token is_{minimum,maximum}
The is_minimum/is_maximum predicates are more
efficient than comparing the the m{minimum,maximum}_token
values, respectrively. since the is_* functions
need to check only the token kind.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-07-20 21:21:42 +03:00
Benny Halevy
5a60ba5c5f token: operator<=>: optimize the common case
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-07-20 21:21:42 +03:00
Benny Halevy
adc1d7f68f token: move ordering operator inline
Token comparisons are abundant.
The equality operator is defined inline
in dht/token.hh by calling `t1 <=> t2`,
and so is `tri_compare_raw`, which `operator<=>`
calls in the common path, but `operator<=>` itself
is defined out of line, losing the benefits of inlining.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-07-20 21:21:42 +03:00
Benny Halevy
7e745d31ed partitioner_test: add more token-level tests
Before changing how minimum and maximum
tokens are represented in memory.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-07-20 21:21:37 +03:00
Kamil Braun
ad68a7f799 Merge 'test: raft: fix the flaky test_raft_recovery_stuck' from Emil Maskovsky
Use the rolling restart to avoid spurious driver reconnects.

This can be eventually reverted once the scylladb/python-driver#295 is fixed.

Fixes scylladb/scylladb#19154

Closes scylladb/scylladb#19771

* github.com:scylladb/scylladb:
  test: raft: fix the flaky `test_raft_recovery_stuck`
  test: raft: code cleanup in `test_raft_recovery_stuck`
2024-07-19 19:34:43 +02:00
Piotr Dulikowski
4571262e46 Merge 'Improve constness of functions schema code' from Marcin Maliszkiewicz
In v4 of scylladb/scylladb#19598 the last commit of the patch was replaced but this change missed merge so submitting it in a separate patch.

In the current patch, the original functions class correctly marks methods as const where appropriate, and the instance() method now returns a const object. This ensures protection against accidental modifications, as all changes must go through the change_batch object.

Since the functions_changer class was intended to serve the same purpose, it is now redundant. Therefore, we are reverting the commit that introduced it.

Relates scylladb/scylladb#19153

Closes scylladb/scylladb#19647

* github.com:scylladb/scylladb:
  cql3: functions: replace template with std::function in with_udf_iter()
  cql3: functions: improve functions class constness handling
  Revert "cql3: functions: make modification functions accessible only via batch class"
2024-07-19 19:23:11 +02:00
Emil Maskovsky
9ab25e5cbf test: raft: replace the use of read_barrier work-around
Replaced the old `read_barrier` helper from "test/pylib/util.py"
by the new helper from "test/pylib/rest_client.py" that is calling
the newly introduced direct REST API.

Replaced in all relevant tests and decommissioned the old helper.

Introduced a new helper `get_host_api_address` to retrieve the host API
address - which in come cases can be different from the host address
(e.g. if the RPC address is changed).

Fixes: scylladb/scylladb#19662

Closes scylladb/scylladb#19739
2024-07-19 19:20:44 +02:00
Laszlo Ersek
680403d2cd sstables/storage: synch "dst_dir" more leanly in create_links_common()
filesystem_storage::create_links_common() runs on directories that
generally differ from "_dir", thus, we can't replace its sync_directory()
calls with _dir.sync(). We can still use a common (temporary)
"opened_directory" object for synching "dst_dir" three times, saving two
open and two close operations.

This patch is best viewed with "git show -W".

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2024-07-19 15:46:31 +02:00
Laszlo Ersek
0057ee2431 sstables/storage: close previous directory asynchronously upon dir change
In "filesystem_storage", change_dir_for_test() and move() replace "_dir"
with "opened_directory(new_dir)" using the move assignment operator.
Consequently, the file descriptor underlying "_dir" is closed
synchronously as a part of object destruction.

Expose the async file::close() function through "opened_directory".
Introduce filesystem_storage::change_dir() as a common async workhorse for
both change_dir_for_test() and move(). In change_dir(), close the old
directory asynchronously.

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2024-07-19 15:43:19 +02:00
Laszlo Ersek
6711574646 sstables/storage: futurize change_dir_for_test()
Currently change_dir_for_test() is synchronous. Make it return a future,
so that we can use async operations in change_dir_for_test() overrides.

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2024-07-19 15:43:19 +02:00
Laszlo Ersek
ef446c4da0 sstables/storage: sync through "opened_directory" in filesystem...::move()
Near the end of filesystem_storage::move(), we sync both the old
directory, and the new directory, if "delay_commit" is null. At that
point, the new directory is just "_dir"; call _dir.sync() instead of
sync_directory().

This patch is best viewed with "git show -W".

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2024-07-19 15:14:46 +02:00
Laszlo Ersek
4d33640481 sstables/storage: sync through "opened_directory" in the "easy" cases
Replace

  sst.sstable_write_io_check(sync_directory, _dir.native())

with

  _dir.sync(sst._write_error_handler)

Also replace the explicit (but still relatively "easy")
open_checked_directory() + flush() + flush() operations in
filesystem_storage::seal() with two _dir.sync() calls.

Because filesystem_storage::create_links_common() is marked "const", we
need to declare "_dir" mutable.

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2024-07-19 15:14:46 +02:00
Laszlo Ersek
2c01171a4d sstables/storage: introduce "opened_directory" class
"filesystem_storage::_dir" is currently of type "std::filesystem::path".
Introduce a new class called "opened_directory", and change the type of
"_dir" to the new class "opened_directory".

"opened_directory" keeps the directory open, and offers synchronization on
that open directory (i.e., without having to reopen the directory every
time). In subsequent patches, that will be put to use.

The opening and closing of the wrapped directory cannot easily be handled
explicitly in the "filesystem_storage" member functions.

(

  Namely, test::store() and test::rewrite_toc_without_scylla_component()
  -- both in "test/lib/sstable_utils.hh" -- perform "open -> ... -> seal"
  sequences, and such a sequence may be executed repeatedly. For example,
  sstable_directory_shared_sstables_reshard_correctly()
  [test/boost/sstable_directory_test.cc] does just that; it "reopens" the
  "filesystem_storage" object repeatedly.

)

Rather than trying to restrict the order of "filesystem_storage" member
function calls, replace the "opened_directory" object with a new one
whenever the directory pathname is re-set; namely in
filesystem_storage::change_dir_for_test() and filesystem_storage::move().

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2024-07-19 15:14:46 +02:00
Piotr Dulikowski
204a479e82 Merge 'db/hints: Test manager::too_many_in_flight_hints_for()' from Dawid Mędrek
In 6e79d64, the behavior of `manager::too_many_in_flight_hints_for()`
was accidentally modified. It remained unnoticed for some time
and then fixed. In this commit, we add a test verifying that
the concurrency of hints being written to disk is indeed limited
and the limitations are imposed properly.

Refs scylladb/scylladb#17636
Fixes scylladb/scylladb#17660

Closes scylladb/scylladb#19741

* github.com:scylladb/scylladb:
  db/hints: Verify that Scylla limits the concurrency of written hints
  db/hints: Coroutinize `hint_endpoint_manager::store_hint()`
  db/hints: Move a constant value to the TU it's used in
2024-07-19 13:26:34 +02:00
Lakshmi Narayanan Sreethar
0615c8a46b boost/bloom_filter_test: add testcase to verify unlinked sstables are not reloaded
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-07-19 13:15:57 +05:30
Lakshmi Narayanan Sreethar
31ff69a13c sstables: do not reload components of unlinked sstables
The SSTable is removed from the reclaimed memory tracking logic only
when its object is deleted. However, there is a risk that the Bloom
filter reloader may attempt to reload the SSTable after it has been
unlinked but before the SSTable object is destroyed. Prevent this by
removing the SSTable from the reclaimed list maintained by the manager
as soon as it is unlinked.

The original logic that updated the memory tracking in
`sstables_manager::deactivate()` is left in place as (a) the variables
have to be updated only when the SSTable object is actually deleted, as
the memory used by the filter is not freed as long as the SSTable is
alive, and (b) the `_reclaimed.erase(*sst)` is still useful during
shutdown, for example, when the SSTable is not unlinked but just
destroyed.

Fixes #19722

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-07-19 13:15:57 +05:30
Lakshmi Narayanan Sreethar
dbf22848a8 sstables/sstables_manager: introduce on_unlink method
Added a new method, on_unlink() to the sstable_manager. This method is
now used by the sstable to notify the manager when it has been unlinked,
enabling the manager to update its bookkeeping as required. The
on_unlink method doesn't do anything yet but will be updated by the next
patch.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-07-19 13:15:55 +05:30
Kefu Chai
c52f49facb build: cmake: do not mark cqlsh noarch
in 3c7af287, cqlsh's reloc package was marked as "noarch", and its
filename was updated accordingly in `configure.py`, so let's update
the CMake building system accordingly.

this change should address the build failure of

```
08:48:14  [3325/4124] Generating ../Debug/dist/tar/scylla-cqlsh-6.1.0~dev-0.20240629.60955ead75ef.noarch.tar.gz
08:48:14  FAILED: Debug/dist/tar/scylla-cqlsh-6.1.0~dev-0.20240629.60955ead75ef.noarch.tar.gz /jenkins/workspace/scylla-master/scylla-ci/scylla/build/Debug/dist/tar/scylla-cqlsh-6.1.0~dev-0.20240629.60955ead75ef.noarch.tar.gz
08:48:14  cd /jenkins/workspace/scylla-master/scylla-ci/scylla/build/dist && /usr/bin/cmake -E copy /jenkins/workspace/scylla-master/scylla-ci/scylla/tools/cqlsh/build/scylla-cqlsh-6.1.0~dev-0.20240629.60955ead75ef.noarch.tar.gz /jenkins/workspace/scylla-master/scylla-ci/scylla/build/Debug/dist/tar/scylla-cqlsh-6.1.0~dev-0.20240629.60955ead75ef.noarch.tar.gz
08:48:14  Error copying file "/jenkins/workspace/scylla-master/scylla-ci/scylla/tools/cqlsh/build/scylla-cqlsh-6.1.0~dev-0.20240629.60955ead75ef.noarch.tar.gz" to "/jenkins/workspace/scylla-master/scylla-ci/scylla/build/Debug/dist/tar/scylla-cqlsh-6.1.0~dev-0.20240629.60955ead75ef.noarch.tar.gz".
```

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#19710
2024-07-19 08:00:17 +03:00
Kefu Chai
34bf10050b build: cmake: bump up the minimal required fmt to 10.0.0
in cccec07581, we started using a featured introduced by {fmt} v10.
so we need to bump up the required version in CMake as well.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#19709
2024-07-19 07:58:31 +03:00
Botond Dénes
79567c1c98 scripts/open-coredump.sh: allow complete bypass of S3 server
In some cases, the S3 server will not know about a certain build and
any attempt to open a coredump which was generated by this build will
fail, because the S3 server returns an empty/illegal response.
There is already a bypass for missing package-url in the S3 server
response, but this doesn't help in the case when the response is also
missing other metadata, like build-id and version info.
Extend this existig mechanism with a new --scylla-package-url flag,
which provides complete bypass. When provided, the S3 server will not be
queried at all, instead the package is downloaded from the link and
version metadata is extracted from the package itself.

Closes scylladb/scylladb#19769
2024-07-18 21:43:53 +03:00
Avi Kivity
58a8fd6f19 Update tools/python3 submodule (install umask, selinux)
* tools/python3 18fa79e...fbf12d0 (1):
  > install.sh: fix incorrect permission on strict umask

Ref https://github.com/scylladb/scylladb/issues/8589
Ref https://github.com/scylladb/scylladb/issues/19775
2024-07-18 21:36:50 +03:00
Avi Kivity
7984e595ce Update tools/java submodule (install selinux context)
* tools/java 33938ec16f...dbaf7ba7db (1):
  > install.sh: apply correct security context on offline installer

Ref https://github.com/scylladb/scylladb/issues/8589
2024-07-18 21:03:32 +03:00
Kefu Chai
4fbfecbb3e Update seastar submodule
* seastar 908ccd93...67065040 (44):
  > metrics: Use this_shard_id unconditionally
  > sstring: prevent fmt from formatting sstring as a sequence
  > coding style: allow lines up to 160 chars in length
  > src/core: remove unnecessary includes
  > when_all: stop using deprecated std::aligned_union_t
  > reactor: respect preempt requests in debug mode
  > core: fix -Wunused-but-set-variable
  > gate: add try_hold
  > sstring: declare nested type with typename
  > rpc: pass start time to `wait_for_reply()` which accepts `no_wait_type`
  > scripts/perftune.py: get rid of "SyntaxWarning: invalid escape sequence"
  > scripts/perftune.py: add support for tweaking VLAN interfaces
  > scripts/perftune.py: improve discovery of bond device slaves
  > scripts/perftune.py: refactor __learn_slaves() function
  > code-cleanup: add missing header guards
  > code-cleanup: remove redundant includes of 'reactor.hh'
  > code-cleanup: explicitly depend on io_desc.hh
  > scripts/perftune.py: aRFS should be disabled by default in non-MQ mode
  > code-cleanup: remove unneeded includes of fair_queue.hh
  > docker: fix mount of install-dependencies
  > code-cleanup: remove redundant includes of linux-aio.hh
  > fstream: reformat the doxygen comment of make_file_input_stream()
  > iostream: use new-style consumer to implement copy()
  > stall-analyser: use 0 for default value of --minimum
  > reactor: fix crash during metrics gathering
  > build: run socket test with linux-aio reactor backend
  > test: Add testing of connect()-ion abort ability
  > linux_perf_event: exclude_idle only on x86_64
  > linux_perf_event: add make_linux_perf_event
  > stall-analyser: gracefully handle empty input
  > shared_token_bucket: resolve FIXME
  > io_tester: ensure that file object is valid when closing it
  > tutorial.md: fix typo in Dan Kegel's name
  > test,rpc: Extend simple ping-pong case
  > rpc: Calculate delay and export it via metrics
  > rpc: Exchange handler duration with server responses
  > rpc: Track handler execution time
  > rpc: Fix hard-coded constants when sending unknown verb reply
  > reactor: Unfriend alien and smp queues
  > reactor: Add and use stopped() getter
  > reactor: Generalize wakeup() callers
  > file: Use lighter access to map of fs-info-s
  > file: Fix indentation after previous patch
  > file: Don't return chain of ready futures from make_file_impl

Closes scylladb/scylladb#19780
2024-07-18 20:00:15 +03:00
Avi Kivity
f7e24cf0b1 Update tools/jmx submodule (umask fix)
* tools/jmx 3328a22...89308b7 (1):
  > install.sh: fix incorrect permission on strict umask

Ref scylladb/scylladb#14383
Ref scylladb/scylladb#8589
2024-07-18 19:37:57 +03:00
Avi Kivity
c3b9e64713 Merge 'sstable::open_sstable: pass origin from the writer' from Lakshmi Narayanan Sreethar
Pass origin when opening the sstable from the writer and store it in the
sstable object. This will make the origin available for the entire write
path.

Closes scylladb/scylladb#19721

* github.com:scylladb/scylladb:
  sstables: use _origin in write path
  sstable::open_sstable: pass and store origin
2024-07-18 19:30:32 +03:00
Avi Kivity
926a02451e Merge 'sstables/index_reader: abort reading during shutdown' from Lakshmi Narayanan Sreethar
This PR adds support for aborting index reads from within `index_consume_entry_context::consume_input` when the server is being stopped. The abort source is now propagated down to the `index_consume_entry_context`, making it available for `consume_input` to check if an abort has been requested. If an abort is detected, `consume_input` will throw an exception to stop the index read operation.

Closes scylladb/scylladb#19453

* github.com:scylladb/scylladb:
  test/boost: test abort behaviour during index read
  sstables/index_reader: stop consuming index when abort has been requested
  sstables::index_consume_entry_context: store abort_source
  sstable: drop old filter only after the new filter is built during rebuild
  sstables/sstables_manager: store abort_source in sstable_manager
  replica/database: pass abort_source to database constructor
2024-07-18 19:26:22 +03:00
Avi Kivity
0780228aa2 config: specialize from-string conversion for bool
The yaml/json representation for bool is true/false, but boost::lexical_cast
is 1/0. Specialize bool conversion to accept true/false (for yaml/json
compatibilty) and 1/0 (for backward compatibility). This provides
round-trip conversion for bool configs in system.config.
2024-07-18 18:38:22 +03:00
Avi Kivity
33eaa61cdd config: wrap boost::lexical_cast<> when converting from strings
Configuration uses boost::lexical_cast to convert strings to native
values (e.g. bools/ints). However, boost::lexical_cast doesn't
recognize true/false for bool. Since we can't change boost::lexical_cast,
replace it with a wrapper that forwards directly to boost::lexical_cast.

In the next step, we'll specialize it for bool.
2024-07-18 18:38:19 +03:00
Piotr Dulikowski
5ec8c06561 test: regression test for MV crash with tablets during decommission
Regression test for scylladb/scylladb#19439.

Co-authored-by: Kamil Braun <kbraun@scylladb.com>
2024-07-18 16:00:26 +02:00
Anna Mikhlin
cd007123c3 Update ScyllaDB version to: 6.2.0-dev 2024-07-18 16:07:07 +03:00
Avi Kivity
47e99f4e04 Merge 'Fix lwt semaphore guard accounting' from Gleb Natapov
Currently the guard does not account correctly for ongoing operation if semaphore acquisition fails. It may signal a semaphore when it is not held.

Should be backported to all supported versions.

Closes scylladb/scylladb#19699

* github.com:scylladb/scylladb:
  test: add test to check that coordinator lwt semaphore continues functioning after locking failures
  paxos: do not signal semaphore if it was not acquired
2024-07-18 14:58:31 +03:00
Dawid Medrek
8b6e887e02 db/hints: Verify that Scylla limits the concurrency of written hints
In 6e79d64, the behavior of `manager::too_many_in_flight_hints_for()`
was accidentally modified. It remained unnoticed for some time
and then fixed. In this commit, we add a test verifying that
the concurrency of hints being written to disk is indeed limited
and the limitations are imposed properly.
2024-07-18 13:49:29 +02:00
Kefu Chai
db56af2e41 replication_strategy: mark fmt::formatter<..>::format() const
since fmt 11, it is required that the format() to be const, otherwise
its caller in fmt library would not be able to call it. and compile
would fail like:

```
/home/kefu/.local/bin/clang++ -DFMT_SHARED -DSCYLLA_BUILD_MODE=release -DSEASTAR_API_LEVEL=7 -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SSTRING -DXXH_PRIVATE_API -DCMAKE_INTDIR=\"RelWithDebInfo\" -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/seastar/include -I/home/kefu/dev/scylladb/build/seastar/gen/include -I/home/kefu/dev/scylladb/build/seastar/gen/src -I/home/kefu/dev/scylladb/build/gen -isystem /home/kefu/dev/scylladb/abseil -ffunction-sections -fdata-sections -O3 -g -gz -std=gnu++23 -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-enum-constexpr-conversion -Wno-unused-parameter -ffile-prefix-map=/home/kefu/dev/scylladb=. -march=westmere -Xclang -fexperimental-assignment-tracking=disabled -mllvm -inline-threshold=2500 -fno-slp-vectorize -U_FORTIFY_SOURCE -Werror=unused-result -MD -MT locator/CMakeFiles/scylla_locator.dir/RelWithDebInfo/abstract_replication_strategy.cc.o -MF locator/CMakeFiles/scylla_locator.dir/RelWithDebInfo/abstract_replication_strategy.cc.o.d -o locator/CMakeFiles/scylla_locator.dir/RelWithDebInfo/abstract_replication_strategy.cc.o -c /home/kefu/dev/scylladb/locator/abstract_replication_strategy.cc
In file included from /home/kefu/dev/scylladb/locator/abstract_replication_strategy.cc:9:
In file included from /home/kefu/dev/scylladb/locator/abstract_replication_strategy.hh:16:
In file included from /home/kefu/dev/scylladb/gms/inet_address.hh:11:
In file included from /usr/include/fmt/ostream.h:23:
In file included from /usr/include/fmt/chrono.h:23:
In file included from /usr/include/fmt/format.h:41:
/usr/include/fmt/base.h:1393:23: error: no matching member function for call to 'format'
 1393 |     ctx.advance_to(cf.format(*static_cast<qualified_type*>(arg), ctx));
      |                    ~~~^~~~~~
/usr/include/fmt/base.h:1374:21: note: in instantiation of function template specialization 'fmt::detail::value<fmt::context>::format_custom_arg<locator::vnode_effective_replication_map::factory_key, fmt::formatter<locator::vnode_effective_replication_map::factory_key>>' requested here
 1374 |     custom.format = format_custom_arg<
      |                     ^
/home/kefu/dev/scylladb/seastar/include/seastar/util/log.hh:299:33: note: in instantiation of function template specialization 'fmt::format_to<seastar::internal::log_buf::inserter_iterator &, locator::vnode_effective_replication_map::factory_key &, const void *, 0>' requested here
  299 |                     return fmt::format_to(it, fmt.format, std::forward<Args>(args)...);
      |                                 ^
/home/kefu/dev/scylladb/seastar/include/seastar/util/log.hh:428:9: note: in instantiation of function template specialization 'seastar::logger::log<locator::vnode_effective_replication_map::factory_key &, const void *>' requested here
  428 |         log(log_level::debug, std::move(fmt), std::forward<Args>(args)...);
      |         ^
/home/kefu/dev/scylladb/locator/abstract_replication_strategy.cc:561:18: note: in instantiation of function template specialization 'seastar::logger::debug<locator::vnode_effective_replication_map::factory_key &, const void *>' requested here
  561 |         rslogger.debug("create_effective_replication_map: found {} [{}]", key, fmt::ptr(erm.get()));
      |                  ^
/home/kefu/dev/scylladb/locator/abstract_replication_strategy.hh:471:10: note: candidate function template not viable: 'this' argument has type 'const fmt::formatter<locator::vnode_effective_replication_map::factory_key>', but method is not marked const
  471 |     auto format(const locator::vnode_effective_replication_map::factory_key& key, FormatContext& ctx) {
      |          ^
1 error generated.
```

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#19768
2024-07-18 13:52:36 +03:00
Emil Maskovsky
a89facbc74 test: raft: fix the flaky test_raft_recovery_stuck
Use the rolling restart to avoid spurious driver reconnects.

This can be eventually reverted once the scylladb/python-driver#295 is
fixed.

Fixes scylladb/scylladb#19154
2024-07-17 09:16:06 +02:00
Emil Maskovsky
ef3393bd36 test: raft: code cleanup in test_raft_recovery_stuck
Cleaning up the imports.
2024-07-17 09:09:46 +02:00
Lakshmi Narayanan Sreethar
7b58fa2534 sstables: use _origin in write path
Now that the origin is available inside the sstable object, no need to
pass it to the methods called in the write path.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-07-16 20:44:28 +05:30
Lakshmi Narayanan Sreethar
b762a09dcd sstable::open_sstable: pass and store origin
Pass origin when opening the sstable from the writer and store it in the
sstable object. This will make the origin available for the entire write
path.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-07-16 20:43:30 +05:30
Lakshmi Narayanan Sreethar
7d0f3ace4a test/boost: test abort behaviour during index read
Added a new boost test, index_reader_test, with a testcase to verifyi
the abort behaviour during an index read using
index_consume_entry_context.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-07-16 20:42:50 +05:30
Lakshmi Narayanan Sreethar
64dadd5ec2 sstables/index_reader: stop consuming index when abort has been requested
When an abort is requested, stop further reading of the index file and
throw and exception from index_consume_entry_context::process_state().

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-07-16 20:42:50 +05:30
Lakshmi Narayanan Sreethar
c2524337a2 sstables::index_consume_entry_context: store abort_source
Store abort source inside sstables::index_consume_entry_context, so that
the next patch can implement cancelling the index read when abort is
requested.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-07-16 20:42:50 +05:30
Lakshmi Narayanan Sreethar
587da62686 sstable: drop old filter only after the new filter is built during rebuild
sstable::maybe_rebuild_filter_from_index drops the existing filter first
and then rebuilds the new filter as the method is only called before the
sstable is sealed. But to make the index read abortable, the old filter
can be dropped only after the new filter is built so that in case if the
index consumer gets aborted, we still have the old filter to write to
disk.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-07-16 20:42:47 +05:30
Lakshmi Narayanan Sreethar
6a3e7a5e7a sstables/sstables_manager: store abort_source in sstable_manager
Add a new member that stores the abort_source. This can later be used by
the sstables to check if an abort has been requested. Also implement
sstables_manager::get_abort_source() that returns a const reference to
the abort source.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-07-16 20:36:06 +05:30
Lakshmi Narayanan Sreethar
e2142974f8 replica/database: pass abort_source to database constructor
This is in preparation for the following patch that adds abort_source
variable to the sstables_manager.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-07-16 20:36:06 +05:30
Piotr Dulikowski
6af7882c59 db/view: drop view updates to replaced node marked as left
When a node that is permanently down is replaced, it is marked as "left"
but it still can be a replica of some tablets. We also don't keep IPs of
nodes that have left and the `node` structure for such node returns an
empty IP (all zeros) as the address.

This interacts badly with the view update logic. The base replica paired
with the left node might decide to generate a view update. Because
storage proxy still uses IPs and not host IDs, it needs to obtain the
view replica's IP and tell the storage proxy to write a view update to
that node - so, it chooses 0.0.0.0. Apparently, storage proxy decides to
write a hint towards this address - hinted handoff on the other hand
operates on host IDs and not IPs, so it attempts to translate the IP
back, which triggers an assertion as there is no replica with IP
0.0.0.0.

As a quick workaround for this issue just drop view updates towards
nodes which seem to have IPs that are all zeros. It would be more proper
to keep the view updates as hints and replay them later to the new
paired replica, but achieving this right now would require much more
significant changes. For now, fixing a crash is more important than
keeping views consistent with base replicas.

Fixes: scylladb/scylladb#19439
2024-07-16 15:50:11 +02:00
Gleb Natapov
4178589826 test: add test to check that coordinator lwt semaphore continues functioning after locking failures 2024-07-16 12:32:25 +03:00
Gleb Natapov
87beebeed0 paxos: do not signal semaphore if it was not acquired
The guard signals a semaphore during destruction if it is marked as
locked, but currently it may be marked as locked even if locking failed.
Fix this by using semaphore_units instead of managing the locked flag
manually.

Fixes: https://github.com/scylladb/scylladb/issues/19698
2024-07-16 12:32:25 +03:00
Kefu Chai
c911832ed9 github: do not run clang-tidy as a cron job
we already run it for every pull request, so no need to run it
periodically.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-07-15 19:19:49 +08:00
Kefu Chai
dc189c67a6 github: disable scheduled workflow on forks
as these workflows are scheduled periodically, and if they fail,
notifications are sent to the repo's owner. to minimize the surprises
to the contributors using github, let's disable these workflows on
fork repos.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-07-15 19:19:28 +08:00
Marcin Maliszkiewicz
395dec35c1 cql3: functions: replace template with std::function in with_udf_iter()
Templates are slower to compile and more difficult to read,
in this case generalization is not needed and can be replaced
by std::function.
2024-07-15 09:39:20 +02:00
Marcin Maliszkiewicz
85d38e013c cql3: functions: improve functions class constness handling
Declares getters as const methods. Makes instance() function
return const object so that it may only be modified via change_batch
class.
2024-07-15 09:39:20 +02:00
Marcin Maliszkiewicz
b9861c0bb7 Revert "cql3: functions: make modification functions accessible only via batch class"
This reverts commit 3f1c2fecc2.

This access control property will be implemented differently
(by using const) in subsequent commit hence revert.
2024-07-15 09:39:20 +02:00
Dawid Medrek
7301a96ff4 db/hints: Coroutinize hint_endpoint_manager::store_hint() 2024-07-15 04:15:25 +02:00
Raphael S. Carvalho
8df7f78969 replica: rename for_each_const_compaction_group()
use same name as non-const-qualified variant, by relying on
overloading.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-07-12 16:33:34 -03:00
Raphael S. Carvalho
518677d7f9 replica: Fix comment about compaction group
there's not a 1:1 relationship between compaction group count and
tablet count. a tablet replica has a storage group instance, which
may map to multiple compaction groups during split mode.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-07-12 16:24:51 -03:00
Raphael S. Carvalho
f139aa1df6 replica: remove unused compaction_group_vector
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-07-12 16:16:47 -03:00
Dawid Medrek
3e02e66ca8 db/hints: Move a constant value to the TU it's used in
Until now, the constant `HINT_FILE_WRITE_TIMEOUT` was
declared as a static member of `db::hints::manager`.
However, the constant is only ever used in one
translation unit, so it makes more sense to move it
there and not include boilerplate in a header.
2024-07-12 13:08:33 +02:00
Calle Wilund
8295980d14 commitlog: Make max data lifetime runtime-configurable 2024-07-09 12:30:49 +00:00
Calle Wilund
0c6679e55f db::config: Expose commitlog_max_data_lifetime_in_s parameter
To allow user control of commitlog time based expiry.
Set to 24h initially.
2024-07-09 12:30:48 +00:00
Calle Wilund
55d6afda6e commitlog: Add optional max lifetime parameter to cl instance
If set, any remaining segment that has data older than this threshold
will request flushing, regardless of data pressure. I.e. even a system
where nothing happends will after X seconds flush data to free up the
commit log.
2024-07-09 12:30:48 +00:00
1024 changed files with 36780 additions and 11465 deletions

225
.clang-format Normal file
View File

@@ -0,0 +1,225 @@
---
Language: Cpp
AccessModifierOffset: -4
AlignAfterOpenBracket: Align
AlignArrayOfStructures: None
AlignConsecutiveAssignments:
Enabled: false
AcrossEmptyLines: false
AcrossComments: false
AlignCompound: false
PadOperators: true
AlignConsecutiveBitFields:
Enabled: false
AcrossEmptyLines: false
AcrossComments: false
AlignCompound: false
PadOperators: false
AlignConsecutiveDeclarations:
Enabled: false
AcrossEmptyLines: false
AcrossComments: false
AlignCompound: false
PadOperators: false
AlignConsecutiveMacros:
Enabled: false
AcrossEmptyLines: false
AcrossComments: false
AlignCompound: false
PadOperators: false
AlignConsecutiveShortCaseStatements:
Enabled: false
AcrossEmptyLines: false
AcrossComments: false
AlignCaseColons: false
AlignEscapedNewlines: Right
AlignOperands: Align
AlignTrailingComments:
Kind: Always
OverEmptyLines: 0
AllowAllArgumentsOnNextLine: true
AllowAllParametersOfDeclarationOnNextLine: true
AllowShortBlocksOnASingleLine: Never
AllowShortCaseLabelsOnASingleLine: false
AllowShortEnumsOnASingleLine: true
AllowShortFunctionsOnASingleLine: InlineOnly
AllowShortIfStatementsOnASingleLine: Never
AllowShortLambdasOnASingleLine: All
AllowShortLoopsOnASingleLine: false
AlwaysBreakAfterDefinitionReturnType: None
AlwaysBreakAfterReturnType: None
AlwaysBreakBeforeMultilineStrings: false
AlwaysBreakTemplateDeclarations: Yes
AttributeMacros:
- __capability
BinPackArguments: false
BinPackParameters: false
BitFieldColonSpacing: Both
BraceWrapping:
AfterCaseLabel: false
AfterClass: false
AfterControlStatement: Never
AfterEnum: false
AfterExternBlock: false
AfterFunction: false
AfterNamespace: false
AfterObjCDeclaration: false
AfterStruct: false
AfterUnion: false
BeforeCatch: false
BeforeElse: false
BeforeLambdaBody: false
BeforeWhile: false
IndentBraces: false
SplitEmptyFunction: true
SplitEmptyRecord: true
SplitEmptyNamespace: true
BreakAfterAttributes: Never
BreakAfterJavaFieldAnnotations: false
BreakArrays: true
BreakBeforeBinaryOperators: None
BreakBeforeConceptDeclarations: Always
BreakBeforeBraces: Attach
BreakBeforeInlineASMColon: OnlyMultiline
BreakBeforeTernaryOperators: true
BreakConstructorInitializers: BeforeComma
BreakInheritanceList: BeforeColon
BreakStringLiterals: true
ColumnLimit: 160
CommentPragmas: '^ IWYU pragma:'
CompactNamespaces: false
ConstructorInitializerIndentWidth: 4
ContinuationIndentWidth: 4
Cpp11BracedListStyle: true
DerivePointerAlignment: false
DisableFormat: false
EmptyLineAfterAccessModifier: Never
EmptyLineBeforeAccessModifier: LogicalBlock
ExperimentalAutoDetectBinPacking: false
FixNamespaceComments: true
ForEachMacros:
- foreach
- Q_FOREACH
- BOOST_FOREACH
IfMacros:
- KJ_IF_MAYBE
IncludeBlocks: Preserve
IncludeCategories:
- Regex: '^"(llvm|llvm-c|clang|clang-c)/'
Priority: 2
SortPriority: 0
CaseSensitive: false
- Regex: '^(<|"(gtest|gmock|isl|json)/)'
Priority: 3
SortPriority: 0
CaseSensitive: false
- Regex: '.*'
Priority: 1
SortPriority: 0
CaseSensitive: false
IncludeIsMainRegex: '(Test)?$'
IncludeIsMainSourceRegex: ''
IndentAccessModifiers: false
IndentCaseBlocks: false
IndentCaseLabels: false
IndentExternBlock: AfterExternBlock
IndentGotoLabels: true
IndentPPDirectives: None
IndentRequiresClause: true
IndentWidth: 4
IndentWrappedFunctionNames: false
InsertBraces: false
InsertNewlineAtEOF: true
InsertTrailingCommas: None
IntegerLiteralSeparator:
Binary: 0
BinaryMinDigits: 0
Decimal: 0
DecimalMinDigits: 0
Hex: 0
HexMinDigits: 0
JavaScriptQuotes: Leave
JavaScriptWrapImports: true
KeepEmptyLinesAtTheStartOfBlocks: true
KeepEmptyLinesAtEOF: false
LambdaBodyIndentation: Signature
LineEnding: DeriveLF
MacroBlockBegin: ''
MacroBlockEnd: ''
MaxEmptyLinesToKeep: 2
NamespaceIndentation: None
PackConstructorInitializers: NextLine
PenaltyBreakAssignment: 2
PenaltyBreakBeforeFirstCallParameter: 19
PenaltyBreakComment: 300
PenaltyBreakFirstLessLess: 120
PenaltyBreakOpenParenthesis: 0
PenaltyBreakString: 1000
PenaltyBreakTemplateDeclaration: 10
PenaltyExcessCharacter: 1000000
PenaltyIndentedWhitespace: 0
PenaltyReturnTypeOnItsOwnLine: 60
PointerAlignment: Left
PPIndentWidth: -1
QualifierAlignment: Leave
ReferenceAlignment: Pointer
ReflowComments: true
RemoveBracesLLVM: false
RemoveParentheses: Leave
RemoveSemicolon: false
RequiresClausePosition: OwnLine
RequiresExpressionIndentation: OuterScope
SeparateDefinitionBlocks: Leave
ShortNamespaceLines: 1
SortIncludes: CaseSensitive
SortJavaStaticImport: Before
SortUsingDeclarations: LexicographicNumeric
SpaceAfterCStyleCast: false
SpaceAfterLogicalNot: false
SpaceAfterTemplateKeyword: true
SpaceAroundPointerQualifiers: Default
SpaceBeforeAssignmentOperators: true
SpaceBeforeCaseColon: false
SpaceBeforeCpp11BracedList: false
SpaceBeforeCtorInitializerColon: true
SpaceBeforeInheritanceColon: true
SpaceBeforeJsonColon: false
SpaceBeforeParens: ControlStatements
SpaceBeforeParensOptions:
AfterControlStatements: true
AfterForeachMacros: true
AfterFunctionDefinitionName: false
AfterFunctionDeclarationName: false
AfterIfMacros: true
AfterOverloadedOperator: false
AfterRequiresInClause: false
AfterRequiresInExpression: false
BeforeNonEmptyParentheses: false
SpaceBeforeRangeBasedForLoopColon: true
SpaceBeforeSquareBrackets: false
SpaceInEmptyBlock: false
SpacesBeforeTrailingComments: 1
SpacesInAngles: Never
SpacesInContainerLiterals: true
SpacesInLineCommentPrefix:
Minimum: 1
Maximum: -1
SpacesInParens: Never
SpacesInParensOptions:
InCStyleCasts: false
InConditionalStatements: false
InEmptyParentheses: false
Other: false
SpacesInSquareBrackets: false
Standard: Latest
TabWidth: 8
UseTab: Never
VerilogBreakBetweenInstancePorts: true
WhitespaceSensitiveMacros:
- BOOST_PP_STRINGIZE
- CF_SWIFT_NAME
- NS_SWIFT_NAME
- PP_STRINGIZE
- STRINGIZE
...

26
.github/CODEOWNERS vendored
View File

@@ -1,5 +1,5 @@
# AUTH
auth/* @elcallio @vladzcloudius
auth/* @nuivall @ptrsmrn @KrzaQ
# CACHE
row_cache* @tgrabiec
@@ -7,9 +7,9 @@ row_cache* @tgrabiec
test/boost/mvcc* @tgrabiec
# CDC
cdc/* @kbr- @elcallio @piodul @jul-stas
test/cql/cdc_* @kbr- @elcallio @piodul @jul-stas
test/boost/cdc_* @kbr- @elcallio @piodul @jul-stas
cdc/* @kbr-scylla @elcallio @piodul
test/cql/cdc_* @kbr-scylla @elcallio @piodul
test/boost/cdc_* @kbr-scylla @elcallio @piodul
# COMMITLOG / BATCHLOG
db/commitlog/* @elcallio @eliransin
@@ -25,18 +25,18 @@ compaction/* @raphaelsc
transport/*
# CQL QUERY LANGUAGE
cql3/* @tgrabiec
cql3/* @tgrabiec @nuivall @ptrsmrn @KrzaQ
# COUNTERS
counters* @jul-stas
tests/counter_test* @jul-stas
counters* @nuivall @ptrsmrn @KrzaQ
tests/counter_test* @nuivall @ptrsmrn @KrzaQ
# DOCS
docs/* @annastuchlik @tzach
docs/alternator @annastuchlik @tzach @nyh @havaker @nuivall
docs/alternator @annastuchlik @tzach @nyh @nuivall @ptrsmrn @KrzaQ
# GOSSIP
gms/* @tgrabiec @asias
gms/* @tgrabiec @asias @kbr-scylla
# DOCKER
dist/docker/*
@@ -74,8 +74,8 @@ streaming/* @tgrabiec @asias
service/storage_service.* @tgrabiec @asias
# ALTERNATOR
alternator/* @havaker @nuivall
test/alternator/* @havaker @nuivall
alternator/* @nyh @nuivall @ptrsmrn @KrzaQ
test/alternator/* @nyh @nuivall @ptrsmrn @KrzaQ
# HINTED HANDOFF
db/hints/* @piodul @vladzcloudius @eliransin
@@ -94,8 +94,8 @@ test/boost/querier_cache_test.cc @denesb
test/cql-pytest/* @nyh
# RAFT
raft/* @kbr- @gleb-cloudius @kostja
test/raft/* @kbr- @gleb-cloudius @kostja
raft/* @kbr-scylla @gleb-cloudius @kostja
test/raft/* @kbr-scylla @gleb-cloudius @kostja
# HEAT-WEIGHTED LOAD BALANCING
db/heat_load_balance.* @nyh @gleb-cloudius

8
.github/mergify.yml vendored
View File

@@ -15,7 +15,7 @@ pull_request_rules:
- closed
actions:
delete_head_branch:
- name: Automate backport pull request 5.2
- name: Automate backport pull request 6.1
conditions:
- or:
- closed
@@ -23,11 +23,11 @@ pull_request_rules:
- or:
- base=master
- base=next
- label=backport/5.2 # The PR must have this label to trigger the backport
- label=backport/6.1 # The PR must have this label to trigger the backport
- label=promoted-to-master
actions:
copy:
title: "[Backport 5.2] {{ title }}"
title: "[Backport 6.1] {{ title }}"
body: |
{{ body }}
@@ -37,7 +37,7 @@ pull_request_rules:
Refs #{{number}}
branches:
- branch-5.2
- branch-6.1
assignees:
- "{{ author }}"
- name: Automate backport pull request 5.4

186
.github/scripts/auto-backport.py vendored Executable file
View File

@@ -0,0 +1,186 @@
#!/usr/bin/env python3
import argparse
import os
import re
import sys
import tempfile
import logging
from github import Github, GithubException
from git import Repo, GitCommandError
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
try:
github_token = os.environ["GITHUB_TOKEN"]
except KeyError:
print("Please set the 'GITHUB_TOKEN' environment variable")
sys.exit(1)
def is_pull_request():
return '--pull-request' in sys.argv[1:]
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument('--repo', type=str, required=True, help='Github repository name')
parser.add_argument('--base-branch', type=str, default='refs/heads/master', help='Base branch')
parser.add_argument('--commits', default=None, type=str, help='Range of promoted commits.')
parser.add_argument('--pull-request', type=int, help='Pull request number to be backported')
parser.add_argument('--head-commit', type=str, required=is_pull_request(), help='The HEAD of target branch after the pull request specified by --pull-request is merged')
return parser.parse_args()
def create_pull_request(repo, new_branch_name, base_branch_name, pr, backport_pr_title, commits, is_draft=False):
pr_body = f'{pr.body}\n\n'
for commit in commits:
pr_body += f'- (cherry picked from commit {commit})\n\n'
pr_body += f'Parent PR: #{pr.number}'
try:
backport_pr = repo.create_pull(
title=backport_pr_title,
body=pr_body,
head=f'scylladbbot:{new_branch_name}',
base=base_branch_name,
draft=is_draft
)
logging.info(f"Pull request created: {backport_pr.html_url}")
backport_pr.add_to_assignees(pr.user)
if is_draft:
backport_pr.add_to_labels("conflicts")
pr_comment = f"@{pr.user} - This PR was marked as draft because it has conflicts\n"
pr_comment += "Please resolve them and mark this PR as ready for review"
backport_pr.create_issue_comment(pr_comment)
logging.info(f"Assigned PR to original author: {pr.user}")
return backport_pr
except GithubException as e:
if 'A pull request already exists' in str(e):
logging.warning(f'A pull request already exists for {pr.user}:{new_branch_name}')
else:
logging.error(f'Failed to create PR: {e}')
def get_pr_commits(repo, pr, stable_branch, start_commit=None):
commits = []
if pr.merged:
merge_commit = repo.get_commit(pr.merge_commit_sha)
if len(merge_commit.parents) > 1: # Check if this merge commit includes multiple commits
commits.append(pr.merge_commit_sha)
else:
if start_commit:
promoted_commits = repo.compare(start_commit, stable_branch).commits
else:
promoted_commits = repo.get_commits(sha=stable_branch)
for commit in pr.get_commits():
for promoted_commit in promoted_commits:
commit_title = commit.commit.message.splitlines()[0]
# In Scylla-pkg and scylla-dtest, for example,
# we don't create a merge commit for a PR with multiple commits,
# according to the GitHub API, the last commit will be the merge commit,
# which is not what we need when backporting (we need all the commits).
# So here, we are validating the correct SHA for each commit so we can cherry-pick
if promoted_commit.commit.message.startswith(commit_title):
commits.append(promoted_commit.sha)
elif pr.state == 'closed':
events = pr.get_issue_events()
for event in events:
if event.event == 'closed':
commits.append(event.commit_id)
return commits
def create_pr_comment_and_remove_label(pr, comment_body):
labels = pr.get_labels()
pattern = re.compile(r"backport/\d+\.\d+$")
for label in labels:
if pattern.match(label.name):
print(f"Removing label: {label.name}")
comment_body += f'- {label.name}\n'
pr.remove_from_labels(label)
pr.create_issue_comment(comment_body)
def backport(repo, pr, version, commits, backport_base_branch):
new_branch_name = f'backport/{pr.number}/to-{version}'
backport_pr_title = f'[Backport {version}] {pr.title}'
repo_url = f'https://scylladbbot:{github_token}@github.com/{repo.full_name}.git'
fork_repo = f'https://scylladbbot:{github_token}@github.com/scylladbbot/{repo.name}.git'
with (tempfile.TemporaryDirectory() as local_repo_path):
try:
repo_local = Repo.clone_from(repo_url, local_repo_path, branch=backport_base_branch)
repo_local.git.checkout(b=new_branch_name)
is_draft = False
for commit in commits:
try:
repo_local.git.cherry_pick(commit, '-m1', '-x')
except GitCommandError as e:
logging.warning(f'Cherry-pick conflict on commit {commit}: {e}')
is_draft = True
repo_local.git.add(A=True)
repo_local.git.cherry_pick('--continue')
if not repo.private and not repo.has_in_collaborators(pr.user.login):
repo.add_to_collaborators(pr.user.login, permission="push")
comment = f':warning: @{pr.user.login} you have been added as collaborator to scylladbbot fork '
comment += f'Please check your inbox and approve the invitation, once it is done, please add the backport labels again'
create_pr_comment_and_remove_label(pr, comment)
return
repo_local.git.push(fork_repo, new_branch_name, force=True)
create_pull_request(repo, new_branch_name, backport_base_branch, pr, backport_pr_title, commits,
is_draft=is_draft)
except GitCommandError as e:
logging.warning(f"GitCommandError: {e}")
def main():
args = parse_args()
base_branch = args.base_branch.split('/')[2]
promoted_label = 'promoted-to-master'
repo_name = args.repo
if 'scylla-enterprise' in args.repo:
promoted_label = 'promoted-to-enterprise'
stable_branch = base_branch
backport_branch = 'branch-'
backport_label_pattern = re.compile(r'backport/\d+\.\d+$')
g = Github(github_token)
repo = g.get_repo(repo_name)
closed_prs = []
start_commit = None
if args.commits:
start_commit, end_commit = args.commits.split('..')
commits = repo.compare(start_commit, end_commit).commits
for commit in commits:
match = re.search(rf"Closes .*#([0-9]+)", commit.commit.message, re.IGNORECASE)
if match:
pr_number = int(match.group(1))
pr = repo.get_pull(pr_number)
closed_prs.append(pr)
if args.pull_request:
start_commit = args.head_commit
pr = repo.get_pull(args.pull_request)
closed_prs = [pr]
for pr in closed_prs:
labels = [label.name for label in pr.labels]
backport_labels = [label for label in labels if backport_label_pattern.match(label)]
if promoted_label not in labels:
print(f'no {promoted_label} label: {pr.number}')
continue
if not backport_labels:
print(f'no backport label: {pr.number}')
continue
commits = get_pr_commits(repo, pr, stable_branch, start_commit)
logging.info(f"Found PR #{pr.number} with commit {commits} and the following labels: {backport_labels}")
for backport_label in backport_labels:
version = backport_label.replace('backport/', '')
backport_base_branch = backport_label.replace('backport/', backport_branch)
backport(repo, pr, version, commits, backport_base_branch)
if __name__ == "__main__":
main()

View File

@@ -16,13 +16,8 @@ def parser():
parser = argparse.ArgumentParser()
parser.add_argument('--repository', type=str, required=True,
help='Github repository name (e.g., scylladb/scylladb)')
parser.add_argument('--commit_before_merge', type=str, required=True, help='Git commit ID to start labeling from ('
'newest commit).')
parser.add_argument('--commit_after_merge', type=str, required=True,
help='Git commit ID to end labeling at (oldest '
'commit, exclusive).')
parser.add_argument('--update_issue', type=bool, default=False, help='Set True to update issues when backport was '
'done')
parser.add_argument('--commits', type=str, required=True, help='Range of promoted commits.')
parser.add_argument('--label', type=str, default='promoted-to-master', help='Label to use')
parser.add_argument('--ref', type=str, required=True, help='PR target branch')
return parser.parse_args()
@@ -53,10 +48,11 @@ def main():
target_branch = re.search(r'branch-(\d+\.\d+)', args.ref)
g = Github(github_token)
repo = g.get_repo(args.repository, lazy=False)
commits = repo.compare(head=args.commit_after_merge, base=args.commit_before_merge)
start_commit, end_commit = args.commits.split('..')
commits = repo.compare(start_commit, end_commit).commits
processed_prs = set()
# Print commit information
for commit in commits.commits:
for commit in commits:
print(f'Commit sha is: {commit.sha}')
match = pr_pattern.search(commit.commit.message)
if match:
@@ -66,13 +62,13 @@ def main():
if target_branch:
pr = repo.get_pull(pr_number)
branch_name = target_branch[1]
refs_pr = re.findall(r'Refs (?:#|https.*?)(\d+)', pr.body)
refs_pr = re.findall(r'Parent PR: (?:#|https.*?)(\d+)', pr.body)
if refs_pr:
print(f'branch-{target_branch.group(1)}, pr number is: {pr_number}')
# 1. change the backport label of the parent PR to note that
# we've merge the corresponding backport PR
# we've merged the corresponding backport PR
# 2. close the backport PR and leave a comment on it to note
# that it has been merged with a certain git commit,
# that it has been merged with a certain git commit.
ref_pr_number = refs_pr[0]
mark_backport_done(repo, ref_pr_number, branch_name)
comment = f'Closed via {commit.sha}'

View File

@@ -5,9 +5,10 @@ on:
branches:
- master
- branch-*.*
env:
DEFAULT_BRANCH: 'master'
- enterprise
pull_request_target:
types: [labeled]
branches: [master, next, enterprise]
jobs:
check-commit:
@@ -20,17 +21,51 @@ jobs:
env:
GITHUB_CONTEXT: ${{ toJson(github) }}
run: echo "$GITHUB_CONTEXT"
- name: Set Default Branch
id: set_branch
run: |
if [[ "${{ github.repository }}" == *enterprise* ]]; then
echo "DEFAULT_BRANCH=enterprise" >> $GITHUB_ENV
else
echo "DEFAULT_BRANCH=master" >> $GITHUB_ENV
fi
- name: Checkout repository
uses: actions/checkout@v4
with:
repository: ${{ github.repository }}
ref: ${{ env.DEFAULT_BRANCH }}
token: ${{ secrets.AUTO_BACKPORT_TOKEN }}
fetch-depth: 0 # Fetch all history for all tags and branches
- name: Set up Git identity
run: |
git config --global user.name "GitHub Action"
git config --global user.email "action@github.com"
git config --global merge.conflictstyle diff3
- name: Install dependencies
run: sudo apt-get install -y python3-github
run: sudo apt-get install -y python3-github python3-git
- name: Run python script
if: github.event_name == 'push'
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: python .github/scripts/label_promoted_commits.py --commit_before_merge ${{ github.event.before }} --commit_after_merge ${{ github.event.after }} --repository ${{ github.repository }} --ref ${{ github.ref }}
GITHUB_TOKEN: ${{ secrets.AUTO_BACKPORT_TOKEN }}
run: python .github/scripts/label_promoted_commits.py --commits ${{ github.event.before }}..${{ github.sha }} --repository ${{ github.repository }} --ref ${{ github.ref }}
- name: Run auto-backport.py when promotion completed
if: ${{ github.event_name == 'push' && github.ref == format('refs/heads/{0}', env.DEFAULT_BRANCH) }}
env:
GITHUB_TOKEN: ${{ secrets.AUTO_BACKPORT_TOKEN }}
run: python .github/scripts/auto-backport.py --repo ${{ github.repository }} --base-branch ${{ github.ref }} --commits ${{ github.event.before }}..${{ github.sha }}
- name: Check if label starts with 'backport/' and contains digits
id: check_label
run: |
label_name="${{ github.event.label.name }}"
if [[ "$label_name" =~ ^backport/[0-9]+\.[0-9]+$ ]]; then
echo "Label matches backport/X.X pattern."
echo "backport_label=true" >> $GITHUB_OUTPUT
else
echo "Label does not match the required pattern."
echo "backport_label=false" >> $GITHUB_OUTPUT
fi
- name: Run auto-backport.py when label was added
if: ${{ github.event_name == 'pull_request_target' && steps.check_label.outputs.backport_label == 'true' && github.event.pull_request.state == 'closed' }}
env:
GITHUB_TOKEN: ${{ secrets.AUTO_BACKPORT_TOKEN }}
run: python .github/scripts/auto-backport.py --repo ${{ github.repository }} --base-branch ${{ github.ref }} --pull-request ${{ github.event.pull_request.number }} --head-commit ${{ github.event.pull_request.base.sha }}

View File

@@ -22,5 +22,12 @@ jobs:
const regex = new RegExp(pattern);
if (!regex.test(body)) {
core.setFailed("PR body does not contain a valid 'Fixes' reference.");
const error = "PR body does not contain a valid 'Fixes' reference.";
core.setFailed(error);
await github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: `:warning: ${error}`
});
}

View File

@@ -13,10 +13,14 @@ on:
value: ${{ jobs.build.outputs.md5sum }}
jobs:
read-toolchain:
uses: ./.github/workflows/read-toolchain.yaml
build:
if: github.repository == 'scylladb/scylladb'
needs:
- read-toolchain
runs-on: ubuntu-latest
# be consistent with tools/toolchain/image
container: scylladb/scylla-toolchain:fedora-40-20240621
container: ${{ needs.read-toolchain.outputs.image }}
outputs:
md5sum: ${{ steps.checksum.outputs.md5sum }}
steps:

View File

@@ -7,7 +7,7 @@ on:
env:
# use the development branch explicitly
CLANG_VERSION: 19
CLANG_VERSION: 20
BUILD_DIR: build
permissions: {}
@@ -20,6 +20,7 @@ concurrency:
jobs:
clang-dev:
name: Build with clang nightly
if: github.repository == 'scylladb/scylladb'
runs-on: ubuntu-latest
container: fedora:40
strategy:

View File

@@ -10,9 +10,6 @@ on:
- 'docs/**'
- '.github/**'
workflow_dispatch:
schedule:
# only at 5AM Saturday
- cron: '0 5 * * SAT'
env:
BUILD_TYPE: RelWithDebInfo

View File

@@ -12,7 +12,8 @@ on:
- enterprise
paths:
- "docs/**"
- "db/config.hh"
- "db/config.cc"
jobs:
build:
runs-on: ubuntu-latest

View File

@@ -0,0 +1,27 @@
name: Mark PR as Ready When Conflicts Label is Removed
on:
pull_request_target:
types:
- unlabeled
env:
DEFAULT_BRANCH: 'master'
jobs:
mark-ready:
if: github.event.label.name == 'conflicts'
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v4
with:
repository: ${{ github.repository }}
ref: ${{ env.DEFAULT_BRANCH }}
token: ${{ secrets.AUTO_BACKPORT_TOKEN }}
fetch-depth: 1
- name: Mark pull request as ready for review
run: gh pr ready "${{ github.event.pull_request.number }}"
env:
GITHUB_TOKEN: ${{ secrets.AUTO_BACKPORT_TOKEN }}

View File

@@ -19,6 +19,7 @@ jobs:
with:
build_mode: release
compare-checksum:
if: github.repository == 'scylladb/scylladb'
runs-on: ubuntu-latest
needs:
- build-a

4
.gitignore vendored
View File

@@ -3,6 +3,7 @@
.settings
build
build.ninja
cmake-build-*
build.ninja.new
cscope.*
/debian/
@@ -13,13 +14,14 @@ dist/ami/scylla_deploy.sh
Cql.tokens
.kdev4
*.kdev4
.idea
CMakeLists.txt.user
.cache
.tox
*.egg-info
__pycache__CMakeLists.txt.user
.gdbinit
resources
/resources
.pytest_cache
/expressions.tokens
tags

5
.gitmodules vendored
View File

@@ -1,6 +1,6 @@
[submodule "seastar"]
path = seastar
url = ../seastar
url = ../scylla-seastar
ignore = dirty
[submodule "swagger-ui"]
path = swagger-ui
@@ -9,9 +9,6 @@
[submodule "abseil"]
path = abseil
url = ../abseil-cpp
[submodule "scylla-jmx"]
path = tools/jmx
url = ../scylla-jmx
[submodule "scylla-tools"]
path = tools/java
url = ../scylla-tools-java

View File

@@ -2,8 +2,6 @@ cmake_minimum_required(VERSION 3.27)
project(scylla)
include(CTest)
list(APPEND CMAKE_MODULE_PATH
${CMAKE_CURRENT_SOURCE_DIR}/cmake
${CMAKE_CURRENT_SOURCE_DIR}/seastar/cmake)
@@ -55,20 +53,22 @@ set(Seastar_DEPRECATED_OSTREAM_FORMATTERS OFF CACHE BOOL "" FORCE)
set(Seastar_APPS ON CACHE BOOL "" FORCE)
set(Seastar_EXCLUDE_APPS_FROM_ALL ON CACHE BOOL "" FORCE)
set(Seastar_EXCLUDE_TESTS_FROM_ALL ON CACHE BOOL "" FORCE)
set(Seastar_IO_URING OFF CACHE BOOL "" FORCE)
set(Seastar_SCHEDULING_GROUPS_COUNT 16 CACHE STRING "" FORCE)
set(Seastar_UNUSED_RESULT_ERROR ON CACHE BOOL "" FORCE)
add_subdirectory(seastar)
set(ABSL_PROPAGATE_CXX_STD ON CACHE BOOL "" FORCE)
find_package(Sanitizers QUIET)
set(sanitizer_cxx_flags
$<$<IN_LIST:$<CONFIG>,Debug;Sanitize>:$<TARGET_PROPERTY:Sanitizers::address,INTERFACE_COMPILE_OPTIONS>;$<TARGET_PROPERTY:Sanitizers::undefined_behavior,INTERFACE_COMPILE_OPTIONS>>)
$<$<CONFIG:Debug,Sanitize>:$<TARGET_PROPERTY:Sanitizers::address,INTERFACE_COMPILE_OPTIONS>;$<TARGET_PROPERTY:Sanitizers::undefined_behavior,INTERFACE_COMPILE_OPTIONS>>)
if(CMAKE_CXX_COMPILER_ID STREQUAL "GNU")
set(ABSL_GCC_FLAGS ${sanitizer_cxx_flags})
elseif(CMAKE_CXX_COMPILER_ID STREQUAL "Clang")
set(ABSL_LLVM_FLAGS ${sanitizer_cxx_flags})
endif()
set(ABSL_DEFAULT_LINKOPTS
$<$<IN_LIST:$<CONFIG>,Debug;Sanitize>:$<TARGET_PROPERTY:Sanitizers::address,INTERFACE_LINK_LIBRARIES>;$<TARGET_PROPERTY:Sanitizers::undefined_behavior,INTERFACE_LINK_LIBRARIES>>)
$<$<CONFIG:Debug,Sanitize>:$<TARGET_PROPERTY:Sanitizers::address,INTERFACE_LINK_LIBRARIES>;$<TARGET_PROPERTY:Sanitizers::undefined_behavior,INTERFACE_LINK_LIBRARIES>>)
add_subdirectory(abseil)
add_library(absl-headers INTERFACE)
target_include_directories(absl-headers SYSTEM INTERFACE
@@ -95,7 +95,7 @@ target_link_libraries(Boost::regex
find_package(Lua REQUIRED)
find_package(ZLIB REQUIRED)
find_package(ICU COMPONENTS uc i18n REQUIRED)
find_package(fmt 9.0.0 REQUIRED)
find_package(fmt 10.0.0 REQUIRED)
find_package(libdeflate REQUIRED)
find_package(libxcrypt REQUIRED)
find_package(Snappy REQUIRED)
@@ -138,6 +138,7 @@ target_sources(scylla-main
keys.cc
multishard_mutation_query.cc
mutation_query.cc
node_ops/task_manager_module.cc
partition_slice_builder.cc
querier.cc
query.cc
@@ -151,6 +152,7 @@ target_sources(scylla-main
serializer.cc
sstables_loader.cc
table_helper.cc
tasks/task_handler.cc
tasks/task_manager.cc
timeout_config.cc
unimplemented.cc
@@ -194,6 +196,8 @@ include(check_headers)
check_headers(check-headers scylla-main
GLOB ${CMAKE_CURRENT_SOURCE_DIR}/*.hh)
add_custom_target(compiler-training)
add_subdirectory(api)
add_subdirectory(alternator)
add_subdirectory(db)
@@ -278,4 +282,9 @@ target_include_directories(scylla PRIVATE
"${CMAKE_CURRENT_SOURCE_DIR}"
"${scylla_gen_build_dir}")
add_custom_target(maybe-scylla
DEPENDS $<$<CONFIG:Dev>:$<TARGET_FILE:scylla>>)
add_dependencies(compiler-training
maybe-scylla)
add_subdirectory(dist)

View File

@@ -19,18 +19,18 @@ $ git submodule update --init --recursive
### Dependencies
Scylla is fairly fussy about its build environment, requiring a very recent
version of the C++20 compiler and numerous tools and libraries to build.
version of the C++23 compiler and numerous tools and libraries to build.
Run `./install-dependencies.sh` (as root) to use your Linux distributions's
package manager to install the appropriate packages on your build machine.
However, this will only work on very recent distributions. For example,
currently Fedora users must upgrade to Fedora 32 otherwise the C++ compiler
will be too old, and not support the new C++20 standard that Scylla uses.
will be too old, and not support the new C++23 standard that Scylla uses.
Alternatively, to avoid having to upgrade your build machine or install
various packages on it, we provide another option - the **frozen toolchain**.
This is a script, `./tools/toolchain/dbuild`, that can execute build or run
commands inside a Docker image that contains exactly the right build tools and
commands inside a container that contains exactly the right build tools and
libraries. The `dbuild` technique is useful for beginners, but is also the way
in which ScyllaDB produces official releases, so it is highly recommended.
@@ -43,6 +43,12 @@ $ ./tools/toolchain/dbuild ninja build/release/scylla
$ ./tools/toolchain/dbuild ./build/release/scylla --developer-mode 1
```
Note: do not mix environemtns - either perform all your work with dbuild, or natively on the host.
Note2: you can get to an interactive shell within dbuild by running it without any parameters:
```bash
$ ./tools/toolchain/dbuild
```
### Build system
**Note**: Compiling Scylla requires, conservatively, 2 GB of memory per native
@@ -116,6 +122,13 @@ Run all tests through the test execution wrapper with
$ ./test.py --mode={debug,release}
```
or, if you are using `dbuild`, you need to build the code and the tests and then you can run them at will:
```bash
$ ./tools/toolchain/dbuild ninja {debug,release,dev}-build
$ ./tools/toolchain/dbuild ./test.py --mode {debug,release,dev}
```
The `--name` argument can be specified to run a particular test.
Alternatively, you can execute the test executable directly. For example,

View File

@@ -15,7 +15,7 @@ For more information, please see the [ScyllaDB web site].
## Build Prerequisites
Scylla is fairly fussy about its build environment, requiring very recent
versions of the C++20 compiler and of many libraries to build. The document
versions of the C++23 compiler and of many libraries to build. The document
[HACKING.md](HACKING.md) includes detailed information on building and
developing Scylla, but to get Scylla building quickly on (almost) any build
machine, Scylla offers a [frozen toolchain](tools/toolchain/README.md),
@@ -84,11 +84,11 @@ Documentation can be found [here](docs/dev/README.md).
Seastar documentation can be found [here](http://docs.seastar.io/master/index.html).
User documentation can be found [here](https://docs.scylladb.com/).
## Training
## Training
Training material and online courses can be found at [Scylla University](https://university.scylladb.com/).
The courses are free, self-paced and include hands-on examples. They cover a variety of topics including Scylla data modeling,
administration, architecture, basic NoSQL concepts, using drivers for application development, Scylla setup, failover, compactions,
Training material and online courses can be found at [Scylla University](https://university.scylladb.com/).
The courses are free, self-paced and include hands-on examples. They cover a variety of topics including Scylla data modeling,
administration, architecture, basic NoSQL concepts, using drivers for application development, Scylla setup, failover, compactions,
multi-datacenters and how Scylla integrates with third-party applications.
## Contributing to Scylla

View File

@@ -78,7 +78,7 @@ fi
# Default scylla product/version tags
PRODUCT=scylla
VERSION=6.1.0-dev
VERSION=6.2.4
if test -f version
then
@@ -104,7 +104,7 @@ else
fi
if [ -f "$OUTPUT_DIR/SCYLLA-RELEASE-FILE" ]; then
GIT_COMMIT_FILE=$(cat "$OUTPUT_DIR/SCYLLA-RELEASE-FILE" |cut -d . -f 3)
GIT_COMMIT_FILE=$(cat "$OUTPUT_DIR/SCYLLA-RELEASE-FILE" | rev | cut -d . -f 1 | rev)
if [ "$GIT_COMMIT" = "$GIT_COMMIT_FILE" ]; then
exit 0
fi

View File

@@ -19,6 +19,7 @@
#include "alternator/executor.hh"
#include "cql3/selection/selection.hh"
#include "cql3/result_set.hh"
#include "types/types.hh"
#include <seastar/core/coroutine.hh>
namespace alternator {
@@ -31,11 +32,12 @@ future<std::string> get_key_from_roles(service::storage_proxy& proxy, auth::serv
dht::partition_range_vector partition_ranges{dht::partition_range(dht::decorate_key(*schema, pk))};
std::vector<query::clustering_range> bounds{query::clustering_range::make_open_ended_both_sides()};
const column_definition* salted_hash_col = schema->get_column_definition(bytes("salted_hash"));
if (!salted_hash_col) {
co_await coroutine::return_exception(api_error::unrecognized_client(format("Credentials cannot be fetched for: {}", username)));
const column_definition* can_login_col = schema->get_column_definition(bytes("can_login"));
if (!salted_hash_col || !can_login_col) {
co_await coroutine::return_exception(api_error::unrecognized_client(fmt::format("Credentials cannot be fetched for: {}", username)));
}
auto selection = cql3::selection::selection::for_columns(schema, {salted_hash_col});
auto partition_slice = query::partition_slice(std::move(bounds), {}, query::column_id_vector{salted_hash_col->id}, selection->get_query_options());
auto selection = cql3::selection::selection::for_columns(schema, {salted_hash_col, can_login_col});
auto partition_slice = query::partition_slice(std::move(bounds), {}, query::column_id_vector{salted_hash_col->id, can_login_col->id}, selection->get_query_options());
auto command = ::make_lw_shared<query::read_command>(schema->id(), schema->version(), partition_slice,
proxy.get_max_result_size(partition_slice), query::tombstone_limit(proxy.get_tombstone_limit()));
auto cl = auth::password_authenticator::consistency_for_user(username);
@@ -49,11 +51,18 @@ future<std::string> get_key_from_roles(service::storage_proxy& proxy, auth::serv
auto result_set = builder.build();
if (result_set->empty()) {
co_await coroutine::return_exception(api_error::unrecognized_client(format("User not found: {}", username)));
co_await coroutine::return_exception(api_error::unrecognized_client(fmt::format("User not found: {}", username)));
}
const managed_bytes_opt& salted_hash = result_set->rows().front().front(); // We only asked for 1 row and 1 column
const auto& result = result_set->rows().front();
bool can_login = result[1] && value_cast<bool>(boolean_type->deserialize(*result[1]));
if (!can_login) {
// This is a valid role name, but has "login=False" so should not be
// usable for authentication (see #19735).
co_await coroutine::return_exception(api_error::unrecognized_client(fmt::format("Role {} has login=false so cannot be used for login", username)));
}
const managed_bytes_opt& salted_hash = result.front();
if (!salted_hash) {
co_await coroutine::return_exception(api_error::unrecognized_client(format("No password found for user: {}", username)));
co_await coroutine::return_exception(api_error::unrecognized_client(fmt::format("No password found for user: {}", username)));
}
co_return value_cast<sstring>(utf8_type->deserialize(*salted_hash));
}

View File

@@ -42,12 +42,12 @@ comparison_operator_type get_comparison_operator(const rjson::value& comparison_
{"NOT_CONTAINS", comparison_operator_type::NOT_CONTAINS},
};
if (!comparison_operator.IsString()) {
throw api_error::validation(format("Invalid comparison operator definition {}", rjson::print(comparison_operator)));
throw api_error::validation(fmt::format("Invalid comparison operator definition {}", rjson::print(comparison_operator)));
}
std::string op = comparison_operator.GetString();
auto it = ops.find(op);
if (it == ops.end()) {
throw api_error::validation(format("Unsupported comparison operator {}", op));
throw api_error::validation(fmt::format("Unsupported comparison operator {}", op));
}
return it->second;
}
@@ -429,7 +429,7 @@ static bool check_BETWEEN(const T& v, const T& lb, const T& ub, bool bounds_from
if (cmp_lt()(ub, lb)) {
if (bounds_from_query) {
throw api_error::validation(
format("BETWEEN operator requires lower_bound <= upper_bound, but {} > {}", lb, ub));
fmt::format("BETWEEN operator requires lower_bound <= upper_bound, but {} > {}", lb, ub));
} else {
return false;
}
@@ -613,7 +613,7 @@ conditional_operator_type get_conditional_operator(const rjson::value& req) {
return conditional_operator_type::OR;
} else {
throw api_error::validation(
format("'ConditionalOperator' parameter must be AND, OR or missing. Found {}.", s));
fmt::format("'ConditionalOperator' parameter must be AND, OR or missing. Found {}.", s));
}
}

View File

@@ -9,10 +9,14 @@
#include <fmt/ranges.h>
#include <seastar/core/sleep.hh>
#include "alternator/executor.hh"
#include "auth/permission.hh"
#include "cdc/log.hh"
#include "auth/service.hh"
#include "db/config.hh"
#include "log.hh"
#include "schema/schema_builder.hh"
#include "exceptions/exceptions.hh"
#include "service/client_state.hh"
#include "timestamp.hh"
#include "types/map.hh"
#include "schema/schema.hh"
@@ -29,6 +33,7 @@
#include "conditions.hh"
#include "cql3/util.hh"
#include <optional>
#include "utils/assert.hh"
#include "utils/overloaded_functor.hh"
#include <seastar/json/json_elements.hh>
#include <boost/algorithm/cxx11/any_of.hpp>
@@ -84,7 +89,7 @@ static map_type attrs_type() {
static const column_definition& attrs_column(const schema& schema) {
const column_definition* cdef = schema.get_column_definition(bytes(executor::ATTRS_COLUMN_NAME));
assert(cdef);
SCYLLA_ASSERT(cdef);
return *cdef;
}
@@ -115,6 +120,7 @@ json::json_return_type make_streamed(rjson::value&& value) {
elogger.error("Exception during streaming HTTP response: {}", ex);
}
co_await los.close();
co_await rjson::destroy_gently(std::move(*lrs));
if (ex) {
co_await coroutine::return_exception_ptr(std::move(ex));
}
@@ -189,12 +195,12 @@ static std::string view_name(const std::string& table_name, std::string_view ind
}
if (!valid_table_name_chars(index_name)) {
throw api_error::validation(
format("IndexName '{}' must satisfy regular expression pattern: [a-zA-Z0-9_.-]+", index_name));
fmt::format("IndexName '{}' must satisfy regular expression pattern: [a-zA-Z0-9_.-]+", index_name));
}
std::string ret = table_name + delim + std::string(index_name);
if (ret.length() > max_table_name_length) {
throw api_error::validation(
format("The total length of TableName ('{}') and IndexName ('{}') cannot exceed {} characters",
fmt::format("The total length of TableName ('{}') and IndexName ('{}') cannot exceed {} characters",
table_name, index_name, max_table_name_length - delim.size()));
}
return ret;
@@ -249,7 +255,7 @@ schema_ptr executor::find_table(service::storage_proxy& proxy, const rjson::valu
validate_table_name(table_name.value());
throw api_error::resource_not_found(
format("Requested resource not found: Table: {} not found", *table_name));
fmt::format("Requested resource not found: Table: {} not found", *table_name));
}
}
@@ -303,7 +309,7 @@ get_table_or_view(service::storage_proxy& proxy, const rjson::value& request) {
validate_table_name(table_name);
throw api_error::resource_not_found(
format("Requested resource not found: Internal table: {}.{} not found", internal_ks_name, internal_table_name));
fmt::format("Requested resource not found: Internal table: {}.{} not found", internal_ks_name, internal_table_name));
}
}
@@ -317,7 +323,7 @@ get_table_or_view(service::storage_proxy& proxy, const rjson::value& request) {
type = table_or_view_type::gsi;
} else {
throw api_error::validation(
format("Non-string IndexName '{}'", rjson::to_string_view(*index_name)));
fmt::format("Non-string IndexName '{}'", rjson::to_string_view(*index_name)));
}
// If no tables for global indexes were found, the index may be local
if (!proxy.data_dictionary().has_schema(keyspace_name, table_name)) {
@@ -335,14 +341,14 @@ get_table_or_view(service::storage_proxy& proxy, const rjson::value& request) {
// does exist but the index does not (ValidationException).
if (proxy.data_dictionary().has_schema(keyspace_name, orig_table_name)) {
throw api_error::validation(
format("Requested resource not found: Index '{}' for table '{}'", index_name->GetString(), orig_table_name));
fmt::format("Requested resource not found: Index '{}' for table '{}'", index_name->GetString(), orig_table_name));
} else {
throw api_error::resource_not_found(
format("Requested resource not found: Table: {} not found", orig_table_name));
fmt::format("Requested resource not found: Table: {} not found", orig_table_name));
}
} else {
throw api_error::resource_not_found(
format("Requested resource not found: Table: {} not found", table_name));
fmt::format("Requested resource not found: Table: {} not found", table_name));
}
}
}
@@ -355,7 +361,7 @@ static std::string get_string_attribute(const rjson::value& value, std::string_v
if (!attribute_value)
return default_return;
if (!attribute_value->IsString()) {
throw api_error::validation(format("Expected string value for attribute {}, got: {}",
throw api_error::validation(fmt::format("Expected string value for attribute {}, got: {}",
attribute_name, value));
}
return std::string(attribute_value->GetString(), attribute_value->GetStringLength());
@@ -370,7 +376,7 @@ static bool get_bool_attribute(const rjson::value& value, std::string_view attri
return default_return;
}
if (!attribute_value->IsBool()) {
throw api_error::validation(format("Expected boolean value for attribute {}, got: {}",
throw api_error::validation(fmt::format("Expected boolean value for attribute {}, got: {}",
attribute_name, value));
}
return attribute_value->GetBool();
@@ -384,7 +390,7 @@ static std::optional<int> get_int_attribute(const rjson::value& value, std::stri
if (!attribute_value)
return {};
if (!attribute_value->IsInt()) {
throw api_error::validation(format("Expected integer value for attribute {}, got: {}",
throw api_error::validation(fmt::format("Expected integer value for attribute {}, got: {}",
attribute_name, value));
}
return attribute_value->GetInt();
@@ -432,7 +438,7 @@ static rjson::value generate_arn_for_table(const schema& schema) {
}
static rjson::value generate_arn_for_index(const schema& schema, std::string_view index_name) {
return rjson::from_string(format(
return rjson::from_string(fmt::format(
"arn:scylla:alternator:{}:scylla:table/{}/index/{}",
schema.ks_name(), schema.cf_name(), index_name));
}
@@ -545,6 +551,53 @@ future<executor::request_return_type> executor::describe_table(client_state& cli
return make_ready_future<executor::request_return_type>(make_jsonable(std::move(response)));
}
// Check CQL's Role-Based Access Control (RBAC) permission_to_check (MODIFY,
// SELECT, DROP, etc.) on the given table. When permission is denied an
// appropriate user-readable api_error::access_denied is thrown.
future<> verify_permission(
const service::client_state& client_state,
const schema_ptr& schema,
auth::permission permission_to_check) {
// Using exceptions for errors makes this function faster in the success
// path (when the operation is allowed). Using a continuation instead of
// co_await makes it faster (no allocation) in the happy path where
// permissions are cached and check_has_permissions() doesn't yield.
return client_state.check_has_permission(auth::command_desc(
permission_to_check,
auth::make_data_resource(schema->ks_name(), schema->cf_name()))).then(
[permission_to_check, &schema, &client_state] (bool allowed) {
if (!allowed) {
sstring username = "anonymous";
if (client_state.user() && client_state.user()->name) {
username = client_state.user()->name.value();
}
throw api_error::access_denied(format(
"{} access on table {}.{} is denied to role {}",
auth::permissions::to_string(permission_to_check),
schema->ks_name(), schema->cf_name(), username));
}
});
}
// Similar to verify_permission() above, but just for CREATE operations.
// Those do not operate on any specific table, so require permissions on
// ALL KEYSPACES instead of any specific table.
future<> verify_create_permission(const service::client_state& client_state) {
return client_state.check_has_permission(auth::command_desc(
auth::permission::CREATE,
auth::resource(auth::resource_kind::data))).then(
[&client_state] (bool allowed) {
if (!allowed) {
sstring username = "anonymous";
if (client_state.user() && client_state.user()->name) {
username = client_state.user()->name.value();
}
throw api_error::access_denied(format(
"CREATE access on ALL KEYSPACES is denied to role {}", username));
}
});
}
future<executor::request_return_type> executor::delete_table(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request) {
_stats.api_operations.delete_table++;
elogger.trace("Deleting table {}", request);
@@ -560,14 +613,15 @@ future<executor::request_return_type> executor::delete_table(client_state& clien
schema_ptr schema = get_table(_proxy, request);
rjson::value table_description = fill_table_description(schema, table_status::deleting, _proxy);
co_await _mm.container().invoke_on(0, [&] (service::migration_manager& mm) -> future<> {
co_await verify_permission(client_state, schema, auth::permission::DROP);
co_await _mm.container().invoke_on(0, [&, cs = client_state.move_to_other_shard()] (service::migration_manager& mm) -> future<> {
// FIXME: the following needs to be in a loop. If mm.announce() below
// fails, we need to retry the whole thing.
auto group0_guard = co_await mm.start_group0_operation();
if (!p.local().data_dictionary().has_schema(keyspace_name, table_name)) {
throw api_error::resource_not_found(format("Requested resource not found: Table: {} not found", table_name));
std::optional<data_dictionary::table> tbl = p.local().data_dictionary().try_find_table(keyspace_name, table_name);
if (!tbl) {
throw api_error::resource_not_found(fmt::format("Requested resource not found: Table: {} not found", table_name));
}
auto m = co_await service::prepare_column_family_drop_announcement(_proxy, keyspace_name, table_name, group0_guard.write_timestamp(), service::drop_views::yes);
@@ -575,7 +629,29 @@ future<executor::request_return_type> executor::delete_table(client_state& clien
std::move(m2.begin(), m2.end(), std::back_inserter(m));
co_await mm.announce(std::move(m), std::move(group0_guard), format("alternator-executor: delete {} table", table_name));
// When deleting a table and its views, we need to remove this role's
// special permissions in those tables (undoing the "auto-grant" done
// by CreateTable). If we didn't do this, if a second role later
// recreates a table with the same name, the first role would still
// have permissions over the new table.
// To make things more robust we just remove *all* permissions for
// the deleted table (CQL's drop_table_statement also does this).
// Unfortunately, there is an API mismatch between this code (which
// uses separate group0_guard and vector<mutation>) and the function
// revoke_all() which uses a combined "group0_batch" structure - so
// we need to do some ugly back-and-forth conversions between the pair
// to the group0_batch and back to the pair :-(
service::group0_batch mc(std::move(group0_guard));
mc.add_mutations(std::move(m));
co_await auth::revoke_all(*cs.get().get_auth_service(),
auth::make_data_resource(schema->ks_name(), schema->cf_name()), mc);
for (const view_ptr& v : tbl->views()) {
co_await auth::revoke_all(*cs.get().get_auth_service(),
auth::make_data_resource(v->ks_name(), v->cf_name()), mc);
}
std::tie(m, group0_guard) = co_await std::move(mc).extract();
co_await mm.announce(std::move(m), std::move(group0_guard), fmt::format("alternator-executor: delete {} table", table_name));
});
rjson::value response = rjson::empty_object();
@@ -595,7 +671,7 @@ static data_type parse_key_type(const std::string& type) {
}
}
throw api_error::validation(
format("Invalid key type '{}', can only be S, B or N.", type));
fmt::format("Invalid key type '{}', can only be S, B or N.", type));
}
@@ -605,7 +681,7 @@ static void add_column(schema_builder& builder, const std::string& name, const r
// second column with the same name. We should fix this, by renaming
// some column names which we want to reserve.
if (name == executor::ATTRS_COLUMN_NAME) {
throw api_error::validation(format("Column name '{}' is currently reserved. FIXME.", name));
throw api_error::validation(fmt::format("Column name '{}' is currently reserved. FIXME.", name));
}
for (auto it = attribute_definitions.Begin(); it != attribute_definitions.End(); ++it) {
const rjson::value& attribute_info = *it;
@@ -616,7 +692,7 @@ static void add_column(schema_builder& builder, const std::string& name, const r
}
}
throw api_error::validation(
format("KeySchema key '{}' missing in AttributeDefinitions", name));
fmt::format("KeySchema key '{}' missing in AttributeDefinitions", name));
}
// Parse the KeySchema request attribute, which specifies the column names
@@ -684,7 +760,7 @@ static schema_ptr get_table_from_arn(service::storage_proxy& proxy, std::string_
// A table name cannot contain a '/' - if it does, it's not a
// table ARN, it may be an index. DynamoDB returns a
// ValidationException in that case - see #10786.
throw api_error::validation(format("ResourceArn '{}' is not a valid table ARN", table_name));
throw api_error::validation(fmt::format("ResourceArn '{}' is not a valid table ARN", table_name));
}
// FIXME: remove sstring creation once find_schema gains a view-based interface
return proxy.data_dictionary().find_schema(sstring(keyspace_name), sstring(table_name));
@@ -719,7 +795,7 @@ static void validate_tags(const std::map<sstring, sstring>& tags) {
std::string_view value = it->second;
if (!allowed_write_isolation_values.contains(value)) {
throw api_error::validation(
format("Incorrect write isolation tag {}. Allowed values: {}", value, allowed_write_isolation_values));
fmt::format("Incorrect write isolation tag {}. Allowed values: {}", value, allowed_write_isolation_values));
}
}
}
@@ -754,7 +830,7 @@ void rmw_operation::set_default_write_isolation(std::string_view value) {
"See docs/alternator/alternator.md for instructions.");
}
if (!allowed_write_isolation_values.contains(value)) {
throw std::runtime_error(format("Invalid --alternator-write-isolation "
throw std::runtime_error(fmt::format("Invalid --alternator-write-isolation "
"setting '{}'. Allowed values: {}.",
value, allowed_write_isolation_values));
}
@@ -831,6 +907,7 @@ future<executor::request_return_type> executor::tag_resource(client_state& clien
if (tags->Size() < 1) {
co_return api_error::validation("The number of tags must be at least 1") ;
}
co_await verify_permission(client_state, schema, auth::permission::ALTER);
co_await db::modify_tags(_mm, schema->ks_name(), schema->cf_name(), [tags](std::map<sstring, sstring>& tags_map) {
update_tags_map(*tags, tags_map, update_tags_action::add_tags);
});
@@ -850,7 +927,7 @@ future<executor::request_return_type> executor::untag_resource(client_state& cli
}
schema_ptr schema = get_table_from_arn(_proxy, rjson::to_string_view(*arn));
co_await verify_permission(client_state, schema, auth::permission::ALTER);
co_await db::modify_tags(_mm, schema->ks_name(), schema->cf_name(), [tags](std::map<sstring, sstring>& tags_map) {
update_tags_map(*tags, tags_map, update_tags_action::delete_tags);
});
@@ -901,7 +978,10 @@ static void verify_billing_mode(const rjson::value& request) {
// throws user-facing api_error::validation if it's not.
// In particular, verify that the same AttributeName doesn't appear more than
// once (Issue #13870).
static void validate_attribute_definitions(const rjson::value& attribute_definitions){
// Return the set of attribute names defined in AttributeDefinitions - this
// set is useful for later verifying that all of them are used by some
// KeySchema (issue #19784)
static std::unordered_set<std::string> validate_attribute_definitions(const rjson::value& attribute_definitions){
if (!attribute_definitions.IsArray()) {
throw api_error::validation("AttributeDefinitions must be an array");
}
@@ -916,7 +996,7 @@ static void validate_attribute_definitions(const rjson::value& attribute_definit
}
auto [it2, added] = seen_attribute_names.emplace(rjson::to_string_view(*attribute_name));
if (!added) {
throw api_error::validation(format("Duplicate AttributeName={} in AttributeDefinitions",
throw api_error::validation(fmt::format("Duplicate AttributeName={} in AttributeDefinitions",
rjson::to_string_view(*attribute_name)));
}
const rjson::value* attribute_type = rjson::find(*it, "AttributeType");
@@ -927,10 +1007,11 @@ static void validate_attribute_definitions(const rjson::value& attribute_definit
throw api_error::validation("AttributeType in AttributeDefinitions must be a string");
}
}
return seen_attribute_names;
}
static future<executor::request_return_type> create_table_on_shard0(tracing::trace_state_ptr trace_state, rjson::value request, service::storage_proxy& sp, service::migration_manager& mm, gms::gossiper& gossiper) {
assert(this_shard_id() == 0);
static future<executor::request_return_type> create_table_on_shard0(service::client_state&& client_state, tracing::trace_state_ptr trace_state, rjson::value request, service::storage_proxy& sp, service::migration_manager& mm, gms::gossiper& gossiper) {
SCYLLA_ASSERT(this_shard_id() == 0);
// We begin by parsing and validating the content of the CreateTable
// command. We can't inspect the current database schema at this point
@@ -940,19 +1021,26 @@ static future<executor::request_return_type> create_table_on_shard0(tracing::tra
validate_table_name(table_name);
if (table_name.find(executor::INTERNAL_TABLE_PREFIX) == 0) {
co_return api_error::validation(format("Prefix {} is reserved for accessing internal tables", executor::INTERNAL_TABLE_PREFIX));
co_return api_error::validation(fmt::format("Prefix {} is reserved for accessing internal tables", executor::INTERNAL_TABLE_PREFIX));
}
std::string keyspace_name = executor::KEYSPACE_NAME_PREFIX + table_name;
const rjson::value& attribute_definitions = request["AttributeDefinitions"];
validate_attribute_definitions(attribute_definitions);
// Save the list of AttributeDefinitions in unused_attribute_definitions,
// and below remove each one as we see it in a KeySchema of the table or
// any of its GSIs or LSIs. If anything remains in this set at the end of
// this function, it's an error.
std::unordered_set<std::string> unused_attribute_definitions =
validate_attribute_definitions(attribute_definitions);
tracing::add_table_name(trace_state, keyspace_name, table_name);
schema_builder builder(keyspace_name, table_name);
auto [hash_key, range_key] = parse_key_schema(request);
add_column(builder, hash_key, attribute_definitions, column_kind::partition_key);
unused_attribute_definitions.erase(hash_key);
if (!range_key.empty()) {
add_column(builder, range_key, attribute_definitions, column_kind::clustering_key);
unused_attribute_definitions.erase(range_key);
}
builder.with_column(bytes(executor::ATTRS_COLUMN_NAME), attrs_type(), column_kind::regular_column);
@@ -965,7 +1053,6 @@ static future<executor::request_return_type> create_table_on_shard0(tracing::tra
// any table.
const rjson::value* gsi = rjson::find(request, "GlobalSecondaryIndexes");
std::vector<schema_builder> view_builders;
std::vector<sstring> where_clauses;
std::unordered_set<std::string> index_names;
if (gsi) {
if (!gsi->IsArray()) {
@@ -979,7 +1066,7 @@ static future<executor::request_return_type> create_table_on_shard0(tracing::tra
std::string_view index_name = rjson::to_string_view(*index_name_v);
auto [it, added] = index_names.emplace(index_name);
if (!added) {
co_return api_error::validation(format("Duplicate IndexName '{}', ", index_name));
co_return api_error::validation(fmt::format("Duplicate IndexName '{}', ", index_name));
}
std::string vname(view_name(table_name, index_name));
elogger.trace("Adding GSI {}", index_name);
@@ -993,6 +1080,7 @@ static future<executor::request_return_type> create_table_on_shard0(tracing::tra
add_column(builder, view_hash_key, attribute_definitions, column_kind::regular_column);
}
add_column(view_builder, view_hash_key, attribute_definitions, column_kind::partition_key);
unused_attribute_definitions.erase(view_hash_key);
if (!view_range_key.empty()) {
if (partial_schema->get_column_definition(to_bytes(view_range_key)) == nullptr) {
// A column that exists in a global secondary index is upgraded from being a map entry
@@ -1005,6 +1093,7 @@ static future<executor::request_return_type> create_table_on_shard0(tracing::tra
add_column(builder, view_range_key, attribute_definitions, column_kind::regular_column);
}
add_column(view_builder, view_range_key, attribute_definitions, column_kind::clustering_key);
unused_attribute_definitions.erase(view_range_key);
}
// Base key columns which aren't part of the index's key need to
// be added to the view nonetheless, as (additional) clustering
@@ -1017,12 +1106,6 @@ static future<executor::request_return_type> create_table_on_shard0(tracing::tra
}
// GSIs have no tags:
view_builder.add_extension(db::tags_extension::NAME, ::make_shared<db::tags_extension>());
sstring where_clause = format("{} IS NOT NULL", cql3::util::maybe_quote(view_hash_key));
if (!view_range_key.empty()) {
where_clause = format("{} AND {} IS NOT NULL", where_clause,
cql3::util::maybe_quote(view_range_key));
}
where_clauses.push_back(std::move(where_clause));
view_builders.emplace_back(std::move(view_builder));
}
}
@@ -1040,7 +1123,7 @@ static future<executor::request_return_type> create_table_on_shard0(tracing::tra
std::string_view index_name = rjson::to_string_view(*index_name_v);
auto [it, added] = index_names.emplace(index_name);
if (!added) {
co_return api_error::validation(format("Duplicate IndexName '{}', ", index_name));
co_return api_error::validation(fmt::format("Duplicate IndexName '{}', ", index_name));
}
std::string vname(lsi_name(table_name, index_name));
elogger.trace("Adding LSI {}", index_name);
@@ -1055,9 +1138,11 @@ static future<executor::request_return_type> create_table_on_shard0(tracing::tra
co_return api_error::validation("LocalSecondaryIndex hash key must match the base table hash key");
}
add_column(view_builder, view_hash_key, attribute_definitions, column_kind::partition_key);
unused_attribute_definitions.erase(view_hash_key);
if (view_range_key.empty()) {
co_return api_error::validation("LocalSecondaryIndex must specify a sort key");
}
unused_attribute_definitions.erase(view_range_key);
if (view_range_key == hash_key) {
co_return api_error::validation("LocalSecondaryIndex sort key cannot be the same as hash key");
}
@@ -1075,12 +1160,6 @@ static future<executor::request_return_type> create_table_on_shard0(tracing::tra
// Note above we don't need to add virtual columns, as all
// base columns were copied to view. TODO: reconsider the need
// for virtual columns when we support Projection.
sstring where_clause = format("{} IS NOT NULL", cql3::util::maybe_quote(view_hash_key));
if (!view_range_key.empty()) {
where_clause = format("{} AND {} IS NOT NULL", where_clause,
cql3::util::maybe_quote(view_range_key));
}
where_clauses.push_back(std::move(where_clause));
// LSIs have no tags, but Scylla's "synchronous_updates" feature
// (which an LSIs need), is actually implemented as a tag so we
// need to add it here:
@@ -1090,6 +1169,12 @@ static future<executor::request_return_type> create_table_on_shard0(tracing::tra
}
}
if (!unused_attribute_definitions.empty()) {
co_return api_error::validation(format(
"AttributeDefinitions defines spurious attributes not used by any KeySchema: {}",
unused_attribute_definitions));
}
// We don't yet support configuring server-side encryption (SSE) via the
// SSESpecifiction attribute, but an SSESpecification with Enabled=false
// is simply the default, and should be accepted:
@@ -1119,8 +1204,9 @@ static future<executor::request_return_type> create_table_on_shard0(tracing::tra
}
builder.add_extension(db::tags_extension::NAME, ::make_shared<db::tags_extension>(tags_map));
co_await verify_create_permission(client_state);
schema_ptr schema = builder.build();
auto where_clause_it = where_clauses.begin();
for (auto& view_builder : view_builders) {
// Note below we don't need to add virtual columns, as all
// base columns were copied to view. TODO: reconsider the need
@@ -1131,8 +1217,7 @@ static future<executor::request_return_type> create_table_on_shard0(tracing::tra
}
}
const bool include_all_columns = true;
view_builder.with_view_info(*schema, include_all_columns, *where_clause_it);
++where_clause_it;
view_builder.with_view_info(*schema, include_all_columns, ""/*where clause*/);
}
// FIXME: the following needs to be in a loop. If mm.announce() below
@@ -1157,7 +1242,7 @@ static future<executor::request_return_type> create_table_on_shard0(tracing::tra
schema_mutations = service::prepare_new_keyspace_announcement(sp.local_db(), ksm, ts);
} catch (exceptions::already_exists_exception&) {
if (sp.data_dictionary().has_schema(keyspace_name, table_name)) {
co_return api_error::resource_in_use(format("Table {} already exists", table_name));
co_return api_error::resource_in_use(fmt::format("Table {} already exists", table_name));
}
}
if (sp.data_dictionary().try_find_table(schema->id())) {
@@ -1179,7 +1264,27 @@ static future<executor::request_return_type> create_table_on_shard0(tracing::tra
});
}
co_await mm.announce(std::move(schema_mutations), std::move(group0_guard), format("alternator-executor: create {} table", table_name));
// If a role is allowed to create a table, we must give it permissions to
// use (and eventually delete) the specific table it just created (and
// also the view tables). This is known as "auto-grant".
// Unfortunately, there is an API mismatch between this code (which uses
// separate group0_guard and vector<mutation>) and the function
// grant_applicable_permissions() which uses a combined "group0_batch"
// structure - so we need to do some ugly back-and-forth conversions
// between the pair to the group0_batch and back to the pair :-(
service::group0_batch mc(std::move(group0_guard));
mc.add_mutations(std::move(schema_mutations));
co_await auth::grant_applicable_permissions(
*client_state.get_auth_service(), *client_state.user(),
auth::make_data_resource(schema->ks_name(), schema->cf_name()), mc);
for (const schema_builder& view_builder : view_builders) {
co_await auth::grant_applicable_permissions(
*client_state.get_auth_service(), *client_state.user(),
auth::make_data_resource(view_builder.ks_name(), view_builder.cf_name()), mc);
}
std::tie(schema_mutations, group0_guard) = co_await std::move(mc).extract();
co_await mm.announce(std::move(schema_mutations), std::move(group0_guard), fmt::format("alternator-executor: create {} table", table_name));
co_await mm.wait_for_schema_agreement(sp.local_db(), db::timeout_clock::now() + 10s, nullptr);
rjson::value status = rjson::empty_object();
@@ -1192,9 +1297,9 @@ future<executor::request_return_type> executor::create_table(client_state& clien
_stats.api_operations.create_table++;
elogger.trace("Creating table {}", request);
co_return co_await _mm.container().invoke_on(0, [&, tr = tracing::global_trace_state_ptr(trace_state), request = std::move(request), &sp = _proxy.container(), &g = _gossiper.container()]
co_return co_await _mm.container().invoke_on(0, [&, tr = tracing::global_trace_state_ptr(trace_state), request = std::move(request), &sp = _proxy.container(), &g = _gossiper.container(), client_state_other_shard = client_state.move_to_other_shard()]
(service::migration_manager& mm) mutable -> future<executor::request_return_type> {
co_return co_await create_table_on_shard0(tr, std::move(request), sp.local(), mm, g.local());
co_return co_await create_table_on_shard0(client_state_other_shard.get(), tr, std::move(request), sp.local(), mm, g.local());
});
}
@@ -1219,7 +1324,7 @@ future<executor::request_return_type> executor::update_table(client_state& clien
verify_billing_mode(request);
}
co_return co_await _mm.container().invoke_on(0, [&p = _proxy.container(), request = std::move(request), gt = tracing::global_trace_state_ptr(std::move(trace_state))]
co_return co_await _mm.container().invoke_on(0, [&p = _proxy.container(), request = std::move(request), gt = tracing::global_trace_state_ptr(std::move(trace_state)), client_state_other_shard = client_state.move_to_other_shard()]
(service::migration_manager& mm) mutable -> future<executor::request_return_type> {
// FIXME: the following needs to be in a loop. If mm.announce() below
// fails, we need to retry the whole thing.
@@ -1232,7 +1337,7 @@ future<executor::request_return_type> executor::update_table(client_state& clien
// the ugly but harmless conversion to string_view here is because
// Seastar's sstring is missing a find(std::string_view) :-()
if (std::string_view(tab->cf_name()).find(INTERNAL_TABLE_PREFIX) == 0) {
co_await coroutine::return_exception(api_error::validation(format("Prefix {} is reserved for accessing internal tables", INTERNAL_TABLE_PREFIX)));
co_await coroutine::return_exception(api_error::validation(fmt::format("Prefix {} is reserved for accessing internal tables", INTERNAL_TABLE_PREFIX)));
}
schema_builder builder(tab);
@@ -1250,7 +1355,7 @@ future<executor::request_return_type> executor::update_table(client_state& clien
}
auto schema = builder.build();
co_await verify_permission(client_state_other_shard.get(), schema, auth::permission::ALTER);
auto m = co_await service::prepare_column_family_update_announcement(p.local(), schema, std::vector<view_ptr>(), group0_guard.write_timestamp());
co_await mm.announce(std::move(m), std::move(group0_guard), format("alternator-executor: update {} table", tab->cf_name()));
@@ -1339,7 +1444,7 @@ void validate_value(const rjson::value& v, const char* caller) {
}
} else if (type != "L" && type != "M" && type != "BOOL" && type != "NULL") {
// TODO: can do more sanity checks on the content of the above types.
throw api_error::validation(format("{}: unknown type {} for value {}", caller, type, v));
throw api_error::validation(fmt::format("{}: unknown type {} for value {}", caller, type, v));
}
}
@@ -1533,7 +1638,7 @@ rmw_operation::returnvalues rmw_operation::parse_returnvalues(const rjson::value
} else if (s == "UPDATED_NEW") {
return rmw_operation::returnvalues::UPDATED_NEW;
} else {
throw api_error::validation(format("Unrecognized value for ReturnValues: {}", s));
throw api_error::validation(fmt::format("Unrecognized value for ReturnValues: {}", s));
}
}
@@ -1552,7 +1657,7 @@ rmw_operation::parse_returnvalues_on_condition_check_failure(const rjson::value&
} else if (s == "ALL_OLD") {
return rmw_operation::returnvalues_on_condition_check_failure::ALL_OLD;
} else {
throw api_error::validation(format("Unrecognized value for ReturnValuesOnConditionCheckFailure: {}", s));
throw api_error::validation(fmt::format("Unrecognized value for ReturnValuesOnConditionCheckFailure: {}", s));
}
}
@@ -1676,7 +1781,7 @@ future<executor::request_return_type> rmw_operation::execute(service::storage_pr
}
} else if (_write_isolation != write_isolation::LWT_ALWAYS) {
std::optional<mutation> m = apply(nullptr, api::new_timestamp());
assert(m); // !needs_read_before_write, so apply() did not check a condition
SCYLLA_ASSERT(m); // !needs_read_before_write, so apply() did not check a condition
return proxy.mutate(std::vector<mutation>{std::move(*m)}, db::consistency_level::LOCAL_QUORUM, executor::default_timeout(), trace_state, std::move(permit), db::allow_per_partition_rate_limit::yes).then([this] () mutable {
return rmw_operation_return(std::move(_return_attributes));
});
@@ -1808,10 +1913,13 @@ future<executor::request_return_type> executor::put_item(client_state& client_st
auto op = make_shared<put_item_operation>(_proxy, std::move(request));
tracing::add_table_name(trace_state, op->schema()->ks_name(), op->schema()->cf_name());
const bool needs_read_before_write = op->needs_read_before_write();
co_await verify_permission(client_state, op->schema(), auth::permission::MODIFY);
if (auto shard = op->shard_for_execute(needs_read_before_write); shard) {
_stats.api_operations.put_item--; // uncount on this shard, will be counted in other shard
_stats.shard_bounce_for_lwt++;
return container().invoke_on(*shard, _ssg,
co_return co_await container().invoke_on(*shard, _ssg,
[request = std::move(*op).move_request(), cs = client_state.move_to_other_shard(), gt = tracing::global_trace_state_ptr(trace_state), permit = std::move(permit)]
(executor& e) mutable {
return do_with(cs.get(), [&e, request = std::move(request), trace_state = tracing::trace_state_ptr(gt)]
@@ -1824,7 +1932,7 @@ future<executor::request_return_type> executor::put_item(client_state& client_st
});
});
}
return op->execute(_proxy, client_state, trace_state, std::move(permit), needs_read_before_write, _stats).finally([op, start_time, this] {
co_return co_await op->execute(_proxy, client_state, trace_state, std::move(permit), needs_read_before_write, _stats).finally([op, start_time, this] {
_stats.api_operations.put_item_latency.mark(std::chrono::steady_clock::now() - start_time);
});
}
@@ -1897,10 +2005,13 @@ future<executor::request_return_type> executor::delete_item(client_state& client
auto op = make_shared<delete_item_operation>(_proxy, std::move(request));
tracing::add_table_name(trace_state, op->schema()->ks_name(), op->schema()->cf_name());
const bool needs_read_before_write = op->needs_read_before_write();
co_await verify_permission(client_state, op->schema(), auth::permission::MODIFY);
if (auto shard = op->shard_for_execute(needs_read_before_write); shard) {
_stats.api_operations.delete_item--; // uncount on this shard, will be counted in other shard
_stats.shard_bounce_for_lwt++;
return container().invoke_on(*shard, _ssg,
co_return co_await container().invoke_on(*shard, _ssg,
[request = std::move(*op).move_request(), cs = client_state.move_to_other_shard(), gt = tracing::global_trace_state_ptr(trace_state), permit = std::move(permit)]
(executor& e) mutable {
return do_with(cs.get(), [&e, request = std::move(request), trace_state = tracing::trace_state_ptr(gt)]
@@ -1913,7 +2024,7 @@ future<executor::request_return_type> executor::delete_item(client_state& client
});
});
}
return op->execute(_proxy, client_state, trace_state, std::move(permit), needs_read_before_write, _stats).finally([op, start_time, this] {
co_return co_await op->execute(_proxy, client_state, trace_state, std::move(permit), needs_read_before_write, _stats).finally([op, start_time, this] {
_stats.api_operations.delete_item_latency.mark(std::chrono::steady_clock::now() - start_time);
});
}
@@ -2078,10 +2189,11 @@ static future<> do_batch_write(service::storage_proxy& proxy,
future<executor::request_return_type> executor::batch_write_item(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request) {
_stats.api_operations.batch_write_item++;
rjson::value& request_items = request["RequestItems"];
auto start_time = std::chrono::steady_clock::now();
std::vector<std::pair<schema_ptr, put_or_delete_item>> mutation_builders;
mutation_builders.reserve(request_items.MemberCount());
uint batch_size = 0;
for (auto it = request_items.MemberBegin(); it != request_items.MemberEnd(); ++it) {
schema_ptr schema = get_table_from_batch_request(_proxy, it);
tracing::add_table_name(trace_state, schema->ks_name(), schema->cf_name());
@@ -2089,7 +2201,7 @@ future<executor::request_return_type> executor::batch_write_item(client_state& c
1, primary_key_hash{schema}, primary_key_equal{schema});
for (auto& request : it->value.GetArray()) {
if (!request.IsObject() || request.MemberCount() != 1) {
return make_ready_future<request_return_type>(api_error::validation(format("Invalid BatchWriteItem request: {}", request)));
co_return api_error::validation(format("Invalid BatchWriteItem request: {}", request));
}
auto r = request.MemberBegin();
const std::string r_name = r->name.GetString();
@@ -2100,9 +2212,10 @@ future<executor::request_return_type> executor::batch_write_item(client_state& c
item, schema, put_or_delete_item::put_item{}));
auto mut_key = std::make_pair(mutation_builders.back().second.pk(), mutation_builders.back().second.ck());
if (used_keys.contains(mut_key)) {
return make_ready_future<request_return_type>(api_error::validation("Provided list of item keys contains duplicates"));
co_return api_error::validation("Provided list of item keys contains duplicates");
}
used_keys.insert(std::move(mut_key));
batch_size++;
} else if (r_name == "DeleteRequest") {
const rjson::value& key = (r->value)["Key"];
mutation_builders.emplace_back(schema, put_or_delete_item(
@@ -2110,21 +2223,28 @@ future<executor::request_return_type> executor::batch_write_item(client_state& c
auto mut_key = std::make_pair(mutation_builders.back().second.pk(),
mutation_builders.back().second.ck());
if (used_keys.contains(mut_key)) {
return make_ready_future<request_return_type>(api_error::validation("Provided list of item keys contains duplicates"));
co_return api_error::validation("Provided list of item keys contains duplicates");
}
used_keys.insert(std::move(mut_key));
batch_size++;
} else {
return make_ready_future<request_return_type>(api_error::validation(format("Unknown BatchWriteItem request type: {}", r_name)));
co_return api_error::validation(fmt::format("Unknown BatchWriteItem request type: {}", r_name));
}
}
}
return do_batch_write(_proxy, _ssg, std::move(mutation_builders), client_state, trace_state, std::move(permit), _stats).then([] () {
for (const auto& b : mutation_builders) {
co_await verify_permission(client_state, b.first, auth::permission::MODIFY);
}
_stats.api_operations.batch_write_item_batch_total += batch_size;
co_return co_await do_batch_write(_proxy, _ssg, std::move(mutation_builders), client_state, trace_state, std::move(permit), _stats).then([start_time, this] () {
// FIXME: Issue #5650: If we failed writing some of the updates,
// need to return a list of these failed updates in UnprocessedItems
// rather than fail the whole write (issue #5650).
rjson::value ret = rjson::empty_object();
rjson::add(ret, "UnprocessedItems", rjson::empty_object());
_stats.api_operations.batch_write_item_latency.mark(std::chrono::steady_clock::now() - start_time);
return make_ready_future<executor::request_return_type>(make_jsonable(std::move(ret)));
});
}
@@ -2231,13 +2351,13 @@ void attribute_path_map_add(const char* source, attribute_path_map<T>& map, cons
} else if(!p.has_operators()) {
// If p is top-level and we already have it or a part of it
// in map, it's a forbidden overlapping path.
throw api_error::validation(format(
throw api_error::validation(fmt::format(
"Invalid {}: two document paths overlap at {}", source, p.root()));
} else if (it->second.has_value()) {
// If we're here, it != map.end() && p.has_operators && it->second.has_value().
// This means the top-level attribute already has a value, and we're
// trying to add a non-top-level value. It's an overlap.
throw api_error::validation(format("Invalid {}: two document paths overlap at {}", source, p.root()));
throw api_error::validation(fmt::format("Invalid {}: two document paths overlap at {}", source, p.root()));
}
node* h = &it->second;
// The second step is to walk h from the top-level node to the inner node
@@ -2297,7 +2417,7 @@ void attribute_path_map_add(const char* source, attribute_path_map<T>& map, cons
if (it == map.end()) {
map.emplace(attr, node {std::move(value)});
} else {
throw api_error::validation(format(
throw api_error::validation(fmt::format(
"Invalid {}: Duplicate attribute: {}", source, attr));
}
}
@@ -2350,7 +2470,7 @@ static select_type parse_select(const rjson::value& request, table_or_view_type
}
return select_type::projection;
}
throw api_error::validation(format("Unknown Select value '{}'. Allowed choices: ALL_ATTRIBUTES, SPECIFIC_ATTRIBUTES, ALL_PROJECTED_ATTRIBUTES, COUNT",
throw api_error::validation(fmt::format("Unknown Select value '{}'. Allowed choices: ALL_ATTRIBUTES, SPECIFIC_ATTRIBUTES, ALL_PROJECTED_ATTRIBUTES, COUNT",
select));
}
@@ -2719,12 +2839,12 @@ static std::optional<rjson::value> action_result(
std::string v1_type = get_item_type_string(v1);
if (v1_type == "N") {
if (get_item_type_string(v2) != "N") {
throw api_error::validation(format("Incorrect operand type for operator or function. Expected {}: {}", v1_type, rjson::print(v2)));
throw api_error::validation(fmt::format("Incorrect operand type for operator or function. Expected {}: {}", v1_type, rjson::print(v2)));
}
result = number_add(v1, v2);
} else if (v1_type == "SS" || v1_type == "NS" || v1_type == "BS") {
if (get_item_type_string(v2) != v1_type) {
throw api_error::validation(format("Incorrect operand type for operator or function. Expected {}: {}", v1_type, rjson::print(v2)));
throw api_error::validation(fmt::format("Incorrect operand type for operator or function. Expected {}: {}", v1_type, rjson::print(v2)));
}
result = set_sum(v1, v2);
} else {
@@ -2810,7 +2930,7 @@ static bool hierarchy_actions(
} else if (h.has_members()) {
if (type[0] != 'M' || !v.IsObject()) {
// A .something on a non-map doesn't work.
throw api_error::validation(format("UpdateExpression: document paths not valid for this item:{}", h));
throw api_error::validation(fmt::format("UpdateExpression: document paths not valid for this item:{}", h));
}
for (const auto& member : h.get_members()) {
std::string attr = member.first;
@@ -2997,7 +3117,7 @@ update_item_operation::apply(std::unique_ptr<rjson::value> previous_item, api::t
std::string column_name = actions.first;
const column_definition* cdef = _schema->get_column_definition(to_bytes(column_name));
if (cdef && cdef->is_primary_key()) {
throw api_error::validation(format("UpdateItem cannot update key column {}", column_name));
throw api_error::validation(fmt::format("UpdateItem cannot update key column {}", column_name));
}
if (actions.second.has_value()) {
// An action on a top-level attribute column_name. The single
@@ -3021,7 +3141,7 @@ update_item_operation::apply(std::unique_ptr<rjson::value> previous_item, api::t
}
const rjson::value *toplevel = rjson::find(*previous_item, column_name);
if (!toplevel) {
throw api_error::validation(format("UpdateItem cannot update document path: missing attribute {}",
throw api_error::validation(fmt::format("UpdateItem cannot update document path: missing attribute {}",
column_name));
}
rjson::value result = rjson::copy(*toplevel);
@@ -3058,7 +3178,7 @@ update_item_operation::apply(std::unique_ptr<rjson::value> previous_item, api::t
validate_value(v2, "AttributeUpdates");
std::string v2_type = get_item_type_string(v2);
if (v2_type != "SS" && v2_type != "NS" && v2_type != "BS") {
throw api_error::validation(format("AttributeUpdates DELETE operation with Value only valid for sets, got type {}", v2_type));
throw api_error::validation(fmt::format("AttributeUpdates DELETE operation with Value only valid for sets, got type {}", v2_type));
}
if (v1) {
std::optional<rjson::value> result = set_diff(*v1, v2);
@@ -3104,7 +3224,7 @@ update_item_operation::apply(std::unique_ptr<rjson::value> previous_item, api::t
std::string v1_type = get_item_type_string(*v1);
std::string v2_type = get_item_type_string(v2);
if (v2_type != v1_type) {
throw api_error::validation(format("Operand type mismatch in AttributeUpdates ADD. Expected {}, got {}", v1_type, v2_type));
throw api_error::validation(fmt::format("Operand type mismatch in AttributeUpdates ADD. Expected {}, got {}", v1_type, v2_type));
}
if (v1_type == "N") {
do_update(std::move(column_name), number_add(*v1, v2));
@@ -3123,7 +3243,7 @@ update_item_operation::apply(std::unique_ptr<rjson::value> previous_item, api::t
}
} else {
throw api_error::validation(
format("Unknown Action value '{}' in AttributeUpdates", action));
fmt::format("Unknown Action value '{}' in AttributeUpdates", action));
}
}
}
@@ -3161,10 +3281,13 @@ future<executor::request_return_type> executor::update_item(client_state& client
auto op = make_shared<update_item_operation>(_proxy, std::move(request));
tracing::add_table_name(trace_state, op->schema()->ks_name(), op->schema()->cf_name());
const bool needs_read_before_write = op->needs_read_before_write();
co_await verify_permission(client_state, op->schema(), auth::permission::MODIFY);
if (auto shard = op->shard_for_execute(needs_read_before_write); shard) {
_stats.api_operations.update_item--; // uncount on this shard, will be counted in other shard
_stats.shard_bounce_for_lwt++;
return container().invoke_on(*shard, _ssg,
co_return co_await container().invoke_on(*shard, _ssg,
[request = std::move(*op).move_request(), cs = client_state.move_to_other_shard(), gt = tracing::global_trace_state_ptr(trace_state), permit = std::move(permit)]
(executor& e) mutable {
return do_with(cs.get(), [&e, request = std::move(request), trace_state = tracing::trace_state_ptr(gt)]
@@ -3177,7 +3300,7 @@ future<executor::request_return_type> executor::update_item(client_state& client
});
});
}
return op->execute(_proxy, client_state, trace_state, std::move(permit), needs_read_before_write, _stats).finally([op, start_time, this] {
co_return co_await op->execute(_proxy, client_state, trace_state, std::move(permit), needs_read_before_write, _stats).finally([op, start_time, this] {
_stats.api_operations.update_item_latency.mark(std::chrono::steady_clock::now() - start_time);
});
}
@@ -3258,8 +3381,8 @@ future<executor::request_return_type> executor::get_item(client_state& client_st
auto attrs_to_get = calculate_attrs_to_get(request, used_attribute_names);
const rjson::value* expression_attribute_names = rjson::find(request, "ExpressionAttributeNames");
verify_all_are_used(expression_attribute_names, used_attribute_names, "ExpressionAttributeNames", "GetItem");
return _proxy.query(schema, std::move(command), std::move(partition_ranges), cl,
co_await verify_permission(client_state, schema, auth::permission::SELECT);
co_return co_await _proxy.query(schema, std::move(command), std::move(partition_ranges), cl,
service::storage_proxy::coordinator_query_options(executor::default_timeout(), std::move(permit), client_state, trace_state)).then(
[this, schema, partition_slice = std::move(partition_slice), selection = std::move(selection), attrs_to_get = std::move(attrs_to_get), start_time = std::move(start_time)] (service::storage_proxy::coordinator_query_result qr) mutable {
_stats.api_operations.get_item_latency.mark(std::chrono::steady_clock::now() - start_time);
@@ -3328,7 +3451,7 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli
// the response size, as DynamoDB does.
_stats.api_operations.batch_get_item++;
rjson::value& request_items = request["RequestItems"];
auto start_time = std::chrono::steady_clock::now();
// We need to validate all the parameters before starting any asynchronous
// query, and fail the entire request on any parse error. So we parse all
// the input into our own vector "requests", each element a table_requests
@@ -3361,7 +3484,7 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli
}
};
std::vector<table_requests> requests;
uint batch_size = 0;
for (auto it = request_items.MemberBegin(); it != request_items.MemberEnd(); ++it) {
table_requests rs(get_table_from_batch_request(_proxy, it));
tracing::add_table_name(trace_state, sstring(executor::KEYSPACE_NAME_PREFIX) + rs.schema->cf_name(), rs.schema->cf_name());
@@ -3375,9 +3498,15 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli
rs.add(key);
check_key(key, rs.schema);
}
batch_size += rs.requests.size();
requests.emplace_back(std::move(rs));
}
for (const table_requests& tr : requests) {
co_await verify_permission(client_state, tr.schema, auth::permission::SELECT);
}
_stats.api_operations.batch_get_item_batch_total += batch_size;
// If we got here, all "requests" are valid, so let's start the
// requests for the different partitions all in parallel.
std::vector<future<std::vector<rjson::value>>> response_futures;
@@ -3467,6 +3596,7 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli
if (!some_succeeded && eptr) {
co_await coroutine::return_exception_ptr(std::move(eptr));
}
_stats.api_operations.batch_get_item_latency.mark(std::chrono::steady_clock::now() - start_time);
if (is_big(response)) {
co_return make_streamed(std::move(response));
} else {
@@ -3776,7 +3906,7 @@ static rjson::value encode_paging_state(const schema& schema, const service::pag
}
static future<executor::request_return_type> do_query(service::storage_proxy& proxy,
schema_ptr schema,
schema_ptr table_schema,
const rjson::value* exclusive_start_key,
dht::partition_range_vector partition_ranges,
std::vector<query::clustering_range> ck_bounds,
@@ -3793,33 +3923,50 @@ static future<executor::request_return_type> do_query(service::storage_proxy& pr
tracing::trace(trace_state, "Performing a database query");
// Reverse the schema and the clustering bounds as the underlying code expects
// reversed queries in the native reversed format.
auto query_schema = table_schema;
const bool reversed = custom_opts.contains<query::partition_slice::option::reversed>();
if (reversed) {
query_schema = table_schema->get_reversed();
std::reverse(ck_bounds.begin(), ck_bounds.end());
for (auto& bound : ck_bounds) {
bound = query::reverse(bound);
}
}
if (exclusive_start_key) {
partition_key pk = pk_from_json(*exclusive_start_key, schema);
partition_key pk = pk_from_json(*exclusive_start_key, table_schema);
auto pos = position_in_partition::for_partition_start();
if (schema->clustering_key_size() > 0) {
pos = pos_from_json(*exclusive_start_key, schema);
if (table_schema->clustering_key_size() > 0) {
pos = pos_from_json(*exclusive_start_key, table_schema);
}
old_paging_state = make_lw_shared<service::pager::paging_state>(pk, pos, query::max_partitions, query_id::create_null_id(), service::pager::paging_state::replicas_per_token_range{}, std::nullopt, 0);
}
co_await verify_permission(client_state, table_schema, auth::permission::SELECT);
auto regular_columns = boost::copy_range<query::column_id_vector>(
schema->regular_columns() | boost::adaptors::transformed([] (const column_definition& cdef) { return cdef.id; }));
table_schema->regular_columns() | boost::adaptors::transformed([] (const column_definition& cdef) { return cdef.id; }));
auto static_columns = boost::copy_range<query::column_id_vector>(
schema->static_columns() | boost::adaptors::transformed([] (const column_definition& cdef) { return cdef.id; }));
auto selection = cql3::selection::selection::wildcard(schema);
table_schema->static_columns() | boost::adaptors::transformed([] (const column_definition& cdef) { return cdef.id; }));
auto selection = cql3::selection::selection::wildcard(table_schema);
query::partition_slice::option_set opts = selection->get_query_options();
opts.add(custom_opts);
auto partition_slice = query::partition_slice(std::move(ck_bounds), std::move(static_columns), std::move(regular_columns), opts);
auto command = ::make_lw_shared<query::read_command>(schema->id(), schema->version(), partition_slice, proxy.get_max_result_size(partition_slice),
auto command = ::make_lw_shared<query::read_command>(query_schema->id(), query_schema->version(), partition_slice, proxy.get_max_result_size(partition_slice),
query::tombstone_limit(proxy.get_tombstone_limit()));
elogger.trace("Executing read query (reversed {}): table schema {}, query schema {}", partition_slice.is_reversed(), table_schema->version(), query_schema->version());
auto query_state_ptr = std::make_unique<service::query_state>(client_state, trace_state, std::move(permit));
// FIXME: should be moved above, set on opts, so get_max_result_size knows it?
command->slice.options.set<query::partition_slice::option::allow_short_read>();
auto query_options = std::make_unique<cql3::query_options>(cl, std::vector<cql3::raw_value>{});
query_options = std::make_unique<cql3::query_options>(std::move(query_options), std::move(old_paging_state));
auto p = service::pager::query_pagers::pager(proxy, schema, selection, *query_state_ptr, *query_options, command, std::move(partition_ranges), nullptr);
auto p = service::pager::query_pagers::pager(proxy, query_schema, selection, *query_state_ptr, *query_options, command, std::move(partition_ranges), nullptr);
std::unique_ptr<cql3::result_set> rs = co_await p->fetch_page(limit, gc_clock::now(), executor::default_timeout());
if (!p->is_exhausted()) {
@@ -3829,7 +3976,7 @@ static future<executor::request_return_type> do_query(service::storage_proxy& pr
bool has_filter = filter;
auto [items, size] = co_await describe_items(*selection, std::move(rs), std::move(attrs_to_get), std::move(filter));
if (paging_state) {
rjson::add(items, "LastEvaluatedKey", encode_paging_state(*schema, *paging_state));
rjson::add(items, "LastEvaluatedKey", encode_paging_state(*table_schema, *paging_state));
}
if (has_filter){
cql_stats.filtered_rows_read_total += p->stats().rows_read_total;
@@ -3843,7 +3990,7 @@ static future<executor::request_return_type> do_query(service::storage_proxy& pr
}
static dht::token token_for_segment(int segment, int total_segments) {
assert(total_segments > 1 && segment >= 0 && segment < total_segments);
SCYLLA_ASSERT(total_segments > 1 && segment >= 0 && segment < total_segments);
uint64_t delta = std::numeric_limits<uint64_t>::max() / total_segments;
return dht::token::from_int64(std::numeric_limits<int64_t>::min() + delta * segment);
}
@@ -4001,7 +4148,7 @@ static query::clustering_range calculate_ck_bound(schema_ptr schema, const colum
// NOTICE(sarna): A range starting with given prefix and ending (non-inclusively) with a string "incremented" by a single
// character at the end. Throws for NUMBER instances.
if (!ck_cdef.type->is_compatible_with(*utf8_type)) {
throw api_error::validation(format("BEGINS_WITH operator cannot be applied to type {}", type_to_string(ck_cdef.type)));
throw api_error::validation(fmt::format("BEGINS_WITH operator cannot be applied to type {}", type_to_string(ck_cdef.type)));
}
return get_clustering_range_for_begins_with(std::move(raw_value), ck, schema, ck_cdef.type);
}
@@ -4082,13 +4229,13 @@ static std::string_view get_toplevel(const parsed::value& v,
used_attribute_names.emplace(column_name);
if (!expression_attribute_names) {
throw api_error::validation(
format("ExpressionAttributeNames missing, entry '{}' required by KeyConditionExpression",
fmt::format("ExpressionAttributeNames missing, entry '{}' required by KeyConditionExpression",
column_name));
}
const rjson::value* value = rjson::find(*expression_attribute_names, column_name);
if (!value || !value->IsString()) {
throw api_error::validation(
format("ExpressionAttributeNames missing entry '{}' required by KeyConditionExpression",
fmt::format("ExpressionAttributeNames missing entry '{}' required by KeyConditionExpression",
column_name));
}
column_name = rjson::to_string_view(*value);
@@ -4206,7 +4353,7 @@ calculate_bounds_condition_expression(schema_ptr schema,
}
if (f->_function_name != "begins_with") {
throw api_error::validation(
format("KeyConditionExpression function '{}' not supported",f->_function_name));
fmt::format("KeyConditionExpression function '{}' not supported",f->_function_name));
}
if (f->_parameters.size() != 2 || !f->_parameters[0].is_path() ||
!f->_parameters[1].is_constant()) {
@@ -4271,7 +4418,7 @@ calculate_bounds_condition_expression(schema_ptr schema,
ck_bounds.push_back(query::clustering_range(ck));
} else {
throw api_error::validation(
format("KeyConditionExpression condition on non-key attribute {}", key));
fmt::format("KeyConditionExpression condition on non-key attribute {}", key));
}
continue;
}
@@ -4279,10 +4426,10 @@ calculate_bounds_condition_expression(schema_ptr schema,
// are allowed *only* on the clustering key:
if (sstring(key) == pk_cdef.name_as_text()) {
throw api_error::validation(
format("KeyConditionExpression only '=' condition is supported on partition key {}", key));
fmt::format("KeyConditionExpression only '=' condition is supported on partition key {}", key));
} else if (!ck_cdef || sstring(key) != ck_cdef->name_as_text()) {
throw api_error::validation(
format("KeyConditionExpression condition on non-key attribute {}", key));
fmt::format("KeyConditionExpression condition on non-key attribute {}", key));
}
if (!ck_bounds.empty()) {
throw api_error::validation(
@@ -4306,7 +4453,7 @@ calculate_bounds_condition_expression(schema_ptr schema,
// begins_with() supported on bytes and strings (both stored
// in the database as strings) but not on numbers.
throw api_error::validation(
format("KeyConditionExpression begins_with() not supported on type {}",
fmt::format("KeyConditionExpression begins_with() not supported on type {}",
type_to_string(ck_cdef->type)));
} else if (raw_value.empty()) {
ck_bounds.push_back(query::clustering_range::make_open_ended_both_sides());
@@ -4439,8 +4586,10 @@ future<executor::request_return_type> executor::list_tables(client_state& client
auto tables = _proxy.data_dictionary().get_tables(); // hold on to temporary, table_names isn't a container, it's a view
auto table_names = tables
| boost::adaptors::filtered([] (data_dictionary::table t) {
return t.schema()->ks_name().find(KEYSPACE_NAME_PREFIX) == 0 && !t.schema()->is_view();
| boost::adaptors::filtered([this] (data_dictionary::table t) {
return t.schema()->ks_name().find(KEYSPACE_NAME_PREFIX) == 0 &&
!t.schema()->is_view() &&
!cdc::is_log_for_some_table(_proxy.local_db(), t.schema()->ks_name(), t.schema()->cf_name());
})
| boost::adaptors::transformed([] (data_dictionary::table t) {
return t.schema()->cf_name();
@@ -4532,7 +4681,7 @@ future<executor::request_return_type> executor::describe_continuous_backups(clie
validate_table_name(table_name);
throw api_error::table_not_found(
format("Table {} not found", table_name));
fmt::format("Table {} not found", table_name));
}
rjson::value desc = rjson::empty_object();
rjson::add(desc, "ContinuousBackupsStatus", "DISABLED");

View File

@@ -262,4 +262,9 @@ public:
// add more than a couple of levels in its own output construction.
bool is_big(const rjson::value& val, int big_size = 100'000);
// Check CQL's Role-Based Access Control (RBAC) permission (MODIFY,
// SELECT, DROP, etc.) on the given table. When permission is denied an
// appropriate user-readable api_error::access_denied is thrown.
future<> verify_permission(const service::client_state&, const schema_ptr&, auth::permission);
}

View File

@@ -57,10 +57,10 @@ static Result parse(const char* input_name, std::string_view input, Func&& f) {
// TODO: displayRecognitionError could set a position inside the
// expressions_syntax_error in throws, and we could use it here to
// mark the broken position in 'input'.
throw expressions_syntax_error(format("Failed parsing {} '{}': {}",
throw expressions_syntax_error(fmt::format("Failed parsing {} '{}': {}",
input_name, input, e.what()));
} catch (...) {
throw expressions_syntax_error(format("Failed parsing {} '{}': {}",
throw expressions_syntax_error(fmt::format("Failed parsing {} '{}': {}",
input_name, input, std::current_exception()));
}
}
@@ -160,12 +160,12 @@ static std::optional<std::string> resolve_path_component(const std::string& colu
if (column_name.size() > 0 && column_name.front() == '#') {
if (!expression_attribute_names) {
throw api_error::validation(
format("ExpressionAttributeNames missing, entry '{}' required by expression", column_name));
fmt::format("ExpressionAttributeNames missing, entry '{}' required by expression", column_name));
}
const rjson::value* value = rjson::find(*expression_attribute_names, column_name);
if (!value || !value->IsString()) {
throw api_error::validation(
format("ExpressionAttributeNames missing entry '{}' required by expression", column_name));
fmt::format("ExpressionAttributeNames missing entry '{}' required by expression", column_name));
}
used_attribute_names.emplace(column_name);
return std::string(rjson::to_string_view(*value));
@@ -202,16 +202,16 @@ static void resolve_constant(parsed::constant& c,
[&] (const std::string& valref) {
if (!expression_attribute_values) {
throw api_error::validation(
format("ExpressionAttributeValues missing, entry '{}' required by expression", valref));
fmt::format("ExpressionAttributeValues missing, entry '{}' required by expression", valref));
}
const rjson::value* value = rjson::find(*expression_attribute_values, valref);
if (!value) {
throw api_error::validation(
format("ExpressionAttributeValues missing entry '{}' required by expression", valref));
fmt::format("ExpressionAttributeValues missing entry '{}' required by expression", valref));
}
if (value->IsNull()) {
throw api_error::validation(
format("ExpressionAttributeValues null value for entry '{}' required by expression", valref));
fmt::format("ExpressionAttributeValues null value for entry '{}' required by expression", valref));
}
validate_value(*value, "ExpressionAttributeValues");
used_attribute_values.emplace(valref);
@@ -708,7 +708,7 @@ rjson::value calculate_value(const parsed::value& v,
auto function_it = function_handlers.find(std::string_view(f._function_name));
if (function_it == function_handlers.end()) {
throw api_error::validation(
format("{}: unknown function '{}' called.", caller, f._function_name));
fmt::format("{}: unknown function '{}' called.", caller, f._function_name));
}
return function_it->second(caller, previous_item, f);
},

View File

@@ -143,17 +143,17 @@ static big_decimal parse_and_validate_number(std::string_view s) {
big_decimal ret(s);
auto [magnitude, precision] = internal::get_magnitude_and_precision(s);
if (magnitude > 125) {
throw api_error::validation(format("Number overflow: {}. Attempting to store a number with magnitude larger than supported range.", s));
throw api_error::validation(fmt::format("Number overflow: {}. Attempting to store a number with magnitude larger than supported range.", s));
}
if (magnitude < -130) {
throw api_error::validation(format("Number underflow: {}. Attempting to store a number with magnitude lower than supported range.", s));
throw api_error::validation(fmt::format("Number underflow: {}. Attempting to store a number with magnitude lower than supported range.", s));
}
if (precision > 38) {
throw api_error::validation(format("Number too precise: {}. Attempting to store a number with more significant digits than supported.", s));
throw api_error::validation(fmt::format("Number too precise: {}. Attempting to store a number with more significant digits than supported.", s));
}
return ret;
} catch (const marshal_exception& e) {
throw api_error::validation(format("The parameter cannot be converted to a numeric value: {}", s));
throw api_error::validation(fmt::format("The parameter cannot be converted to a numeric value: {}", s));
}
}
@@ -265,7 +265,7 @@ bytes get_key_column_value(const rjson::value& item, const column_definition& co
std::string column_name = column.name_as_text();
const rjson::value* key_typed_value = rjson::find(item, column_name);
if (!key_typed_value) {
throw api_error::validation(format("Key column {} not found", column_name));
throw api_error::validation(fmt::format("Key column {} not found", column_name));
}
return get_key_from_typed_value(*key_typed_value, column);
}
@@ -277,19 +277,26 @@ bytes get_key_column_value(const rjson::value& item, const column_definition& co
// mentioned in the exception message).
// If the type does match, a reference to the encoded value is returned.
static const rjson::value& get_typed_value(const rjson::value& key_typed_value, std::string_view type_str, std::string_view name, std::string_view value_name) {
if (!key_typed_value.IsObject() || key_typed_value.MemberCount() != 1 ||
!key_typed_value.MemberBegin()->value.IsString()) {
if (!key_typed_value.IsObject() || key_typed_value.MemberCount() != 1) {
throw api_error::validation(
format("Malformed value object for {} {}: {}",
fmt::format("Malformed value object for {} {}: {}",
value_name, name, key_typed_value));
}
auto it = key_typed_value.MemberBegin();
if (rjson::to_string_view(it->name) != type_str) {
throw api_error::validation(
format("Type mismatch: expected type {} for {} {}, got type {}",
fmt::format("Type mismatch: expected type {} for {} {}, got type {}",
type_str, value_name, name, it->name));
}
// We assume this function is called just for key types (S, B, N), and
// all of those always have a string value in the JSON.
if (!it->value.IsString()) {
throw api_error::validation(
fmt::format("Malformed value object for {} {}: {}",
value_name, name, key_typed_value));
}
return it->value;
}
@@ -395,16 +402,16 @@ position_in_partition pos_from_json(const rjson::value& item, schema_ptr schema)
big_decimal unwrap_number(const rjson::value& v, std::string_view diagnostic) {
if (!v.IsObject() || v.MemberCount() != 1) {
throw api_error::validation(format("{}: invalid number object", diagnostic));
throw api_error::validation(fmt::format("{}: invalid number object", diagnostic));
}
auto it = v.MemberBegin();
if (it->name != "N") {
throw api_error::validation(format("{}: expected number, found type '{}'", diagnostic, it->name));
throw api_error::validation(fmt::format("{}: expected number, found type '{}'", diagnostic, it->name));
}
if (!it->value.IsString()) {
// We shouldn't reach here. Callers normally validate their input
// earlier with validate_value().
throw api_error::validation(format("{}: improperly formatted number constant", diagnostic));
throw api_error::validation(fmt::format("{}: improperly formatted number constant", diagnostic));
}
big_decimal ret = parse_and_validate_number(rjson::to_string_view(it->value));
return ret;
@@ -485,7 +492,7 @@ rjson::value set_sum(const rjson::value& v1, const rjson::value& v2) {
auto [set1_type, set1] = unwrap_set(v1);
auto [set2_type, set2] = unwrap_set(v2);
if (set1_type != set2_type) {
throw api_error::validation(format("Mismatched set types: {} and {}", set1_type, set2_type));
throw api_error::validation(fmt::format("Mismatched set types: {} and {}", set1_type, set2_type));
}
if (!set1 || !set2) {
throw api_error::validation("UpdateExpression: ADD operation for sets must be given sets as arguments");
@@ -513,7 +520,7 @@ std::optional<rjson::value> set_diff(const rjson::value& v1, const rjson::value&
auto [set1_type, set1] = unwrap_set(v1);
auto [set2_type, set2] = unwrap_set(v2);
if (set1_type != set2_type) {
throw api_error::validation(format("Set DELETE type mismatch: {} and {}", set1_type, set2_type));
throw api_error::validation(fmt::format("Set DELETE type mismatch: {} and {}", set1_type, set2_type));
}
if (!set1 || !set2) {
throw api_error::validation("UpdateExpression: DELETE operation can only be performed on a set");

View File

@@ -17,7 +17,10 @@
#include <seastar/util/short_streams.hh>
#include "seastarx.hh"
#include "error.hh"
#include "service/client_state.hh"
#include "service/qos/service_level_controller.hh"
#include "utils/assert.hh"
#include "timeout_config.hh"
#include "utils/rjson.hh"
#include "auth.hh"
#include <cctype>
@@ -211,7 +214,10 @@ protected:
sstring local_dc = topology.get_datacenter();
std::unordered_set<gms::inet_address> local_dc_nodes = topology.get_datacenter_endpoints().at(local_dc);
for (auto& ip : local_dc_nodes) {
if (_gossiper.is_alive(ip)) {
// Note that it's not enough for the node to be is_alive() - a
// node joining the cluster is also "alive" but not responsive to
// requests. We alive *and* normal. See #19694, #21538.
if (_gossiper.is_alive(ip) && _gossiper.is_normal(ip)) {
// Use the gossiped broadcast_rpc_address if available instead
// of the internal IP address "ip". See discussion in #18711.
rjson::push_back(results, rjson::from_string(_gossiper.get_rpc_address(ip)));
@@ -257,7 +263,7 @@ future<std::string> server::verify_signature(const request& req, const chunked_c
std::string_view authorization_header = authorization_it->second;
auto pos = authorization_header.find_first_of(' ');
if (pos == std::string_view::npos || authorization_header.substr(0, pos) != "AWS4-HMAC-SHA256") {
throw api_error::invalid_signature(format("Authorization header must use AWS4-HMAC-SHA256 algorithm: {}", authorization_header));
throw api_error::invalid_signature(fmt::format("Authorization header must use AWS4-HMAC-SHA256 algorithm: {}", authorization_header));
}
authorization_header.remove_prefix(pos+1);
std::string credential;
@@ -292,7 +298,7 @@ future<std::string> server::verify_signature(const request& req, const chunked_c
std::vector<std::string_view> credential_split = split(credential, '/');
if (credential_split.size() != 5) {
throw api_error::validation(format("Incorrect credential information format: {}", credential));
throw api_error::validation(fmt::format("Incorrect credential information format: {}", credential));
}
std::string user(credential_split[0]);
std::string datestamp(credential_split[1]);
@@ -377,7 +383,7 @@ static tracing::trace_state_ptr maybe_trace_query(service::client_state& client_
std::string buf;
tracing::add_session_param(trace_state, "alternator_op", op);
tracing::add_query(trace_state, truncated_content_view(query, buf));
tracing::begin(trace_state, format("Alternator {}", op), client_state.get_client_address());
tracing::begin(trace_state, seastar::format("Alternator {}", op), client_state.get_client_address());
if (!username.empty()) {
tracing::set_username(trace_state, auth::authenticated_user(username));
}
@@ -402,7 +408,7 @@ future<executor::request_return_type> server::handle_api_request(std::unique_ptr
++_executor._stats.requests_blocked_memory;
}
auto units = co_await std::move(units_fut);
assert(req->content_stream);
SCYLLA_ASSERT(req->content_stream);
chunked_content content = co_await util::read_entire_stream(*req->content_stream);
auto username = co_await verify_signature(*req, content);
@@ -413,7 +419,7 @@ future<executor::request_return_type> server::handle_api_request(std::unique_ptr
auto callback_it = _callbacks.find(op);
if (callback_it == _callbacks.end()) {
_executor._stats.unsupported_operations++;
co_return api_error::unknown_operation(format("Unsupported operation {}", op));
co_return api_error::unknown_operation(fmt::format("Unsupported operation {}", op));
}
if (_pending_requests.get_count() >= _max_concurrent_requests) {
_executor._stats.requests_shed++;
@@ -421,11 +427,11 @@ future<executor::request_return_type> server::handle_api_request(std::unique_ptr
}
_pending_requests.enter();
auto leave = defer([this] () noexcept { _pending_requests.leave(); });
//FIXME: Client state can provide more context, e.g. client's endpoint address
// We use unique_ptr because client_state cannot be moved or copied
executor::client_state client_state = username.empty()
? service::client_state{service::client_state::internal_tag()}
: service::client_state{service::client_state::internal_tag(), _auth_service, _sl_controller, username};
executor::client_state client_state(service::client_state::external_tag(),
_auth_service, &_sl_controller, _timeout_config.current_values(), req->get_client_address());
if (!username.empty()) {
client_state.set_login(auth::authenticated_user(username));
}
co_await client_state.maybe_update_per_service_level_params();
tracing::trace_state_ptr trace_state = maybe_trace_query(client_state, username, op, content);
@@ -472,6 +478,7 @@ server::server(executor& exec, service::storage_proxy& proxy, gms::gossiper& gos
, _enforce_authorization(false)
, _enabled_servers{}
, _pending_requests{}
, _timeout_config(_proxy.data_dictionary().get_config())
, _callbacks{
{"CreateTable", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.create_table(client_state, std::move(trace_state), std::move(permit), std::move(json_request));
@@ -636,7 +643,7 @@ future<> server::json_parser::stop() {
const char* api_error::what() const noexcept {
if (_what_string.empty()) {
_what_string = format("{} {}: {}", std::to_underlying(_http_code), _type, _msg);
_what_string = fmt::format("{} {}: {}", std::to_underlying(_http_code), _type, _msg);
}
return _what_string.c_str();
}

View File

@@ -42,6 +42,11 @@ class server {
bool _enforce_authorization;
utils::small_vector<std::reference_wrapper<seastar::httpd::http_server>, 2> _enabled_servers;
gate _pending_requests;
// In some places we will need a CQL updateable_timeout_config object even
// though it isn't really relevant for Alternator which defines its own
// timeouts separately. We can create this object only once.
updateable_timeout_config _timeout_config;
alternator_callbacks_map _callbacks;
semaphore* _memory_limiter;

View File

@@ -67,6 +67,8 @@ stats::stats() : api_operations{} {
OPERATION_LATENCY(get_item_latency, "GetItem")
OPERATION_LATENCY(delete_item_latency, "DeleteItem")
OPERATION_LATENCY(update_item_latency, "UpdateItem")
OPERATION_LATENCY(batch_write_item_latency, "BatchWriteItem")
OPERATION_LATENCY(batch_get_item_latency, "BatchGetItem")
OPERATION(list_streams, "ListStreams")
OPERATION(describe_stream, "DescribeStream")
OPERATION(get_shard_iterator, "GetShardIterator")
@@ -94,6 +96,10 @@ stats::stats() : api_operations{} {
seastar::metrics::description("number of rows read and matched during filtering operations")),
seastar::metrics::make_total_operations("filtered_rows_dropped_total", [this] { return cql_stats.filtered_rows_read_total - cql_stats.filtered_rows_matched_total; },
seastar::metrics::description("number of rows read and dropped during filtering operations")),
seastar::metrics::make_counter("batch_item_count", seastar::metrics::description("The total number of items processed across all batches"),{op("BatchWriteItem")},
api_operations.batch_write_item_batch_total).set_skip_when_empty(),
seastar::metrics::make_counter("batch_item_count", seastar::metrics::description("The total number of items processed across all batches"),{op("BatchGetItem")},
api_operations.batch_get_item_batch_total).set_skip_when_empty(),
});
}

View File

@@ -26,6 +26,8 @@ public:
struct {
uint64_t batch_get_item = 0;
uint64_t batch_write_item = 0;
uint64_t batch_get_item_batch_total = 0;
uint64_t batch_write_item_batch_total = 0;
uint64_t create_backup = 0;
uint64_t create_global_table = 0;
uint64_t create_table = 0;
@@ -69,6 +71,8 @@ public:
utils::timed_rate_moving_average_summary_and_histogram get_item_latency;
utils::timed_rate_moving_average_summary_and_histogram delete_item_latency;
utils::timed_rate_moving_average_summary_and_histogram update_item_latency;
utils::timed_rate_moving_average_summary_and_histogram batch_write_item_latency;
utils::timed_rate_moving_average_summary_and_histogram batch_get_item_latency;
utils::timed_rate_moving_average_summary_and_histogram get_records_latency;
} api_operations;
// Miscellaneous event counters

View File

@@ -13,6 +13,7 @@
#include <seastar/json/formatter.hh>
#include "auth/permission.hh"
#include "db/config.hh"
#include "cdc/log.hh"
@@ -818,11 +819,13 @@ future<executor::request_return_type> executor::get_records(client_state& client
}
if (!schema || !base || !is_alternator_keyspace(schema->ks_name())) {
throw api_error::resource_not_found(fmt::to_string(iter.table));
co_return api_error::resource_not_found(fmt::to_string(iter.table));
}
tracing::add_table_name(trace_state, schema->ks_name(), schema->cf_name());
co_await verify_permission(client_state, schema, auth::permission::SELECT);
db::consistency_level cl = db::consistency_level::LOCAL_QUORUM;
partition_key pk = iter.shard.id.to_partition_key(*schema);
@@ -887,7 +890,7 @@ future<executor::request_return_type> executor::get_records(client_state& client
auto command = ::make_lw_shared<query::read_command>(schema->id(), schema->version(), partition_slice, _proxy.get_max_result_size(partition_slice),
query::tombstone_limit(_proxy.get_tombstone_limit()), query::row_limit(limit * mul));
return _proxy.query(schema, std::move(command), std::move(partition_ranges), cl, service::storage_proxy::coordinator_query_options(default_timeout(), std::move(permit), client_state)).then(
co_return co_await _proxy.query(schema, std::move(command), std::move(partition_ranges), cl, service::storage_proxy::coordinator_query_options(default_timeout(), std::move(permit), client_state)).then(
[this, schema, partition_slice = std::move(partition_slice), selection = std::move(selection), start_time = std::move(start_time), limit, key_names = std::move(key_names), attr_names = std::move(attr_names), type, iter, high_ts] (service::storage_proxy::coordinator_query_result qr) mutable {
cql3::selection::result_set_builder builder(*selection, gc_clock::now());
query::result_view::consume(*qr.query_result, partition_slice, cql3::selection::result_set_builder::visitor(builder, *schema, *selection));

View File

@@ -26,6 +26,7 @@
#include "log.hh"
#include "gc_clock.hh"
#include "replica/database.hh"
#include "service/client_state.hh"
#include "service_permit.hh"
#include "timestamp.hh"
#include "service/storage_proxy.hh"
@@ -35,6 +36,7 @@
#include "mutation/mutation.hh"
#include "types/types.hh"
#include "types/map.hh"
#include "utils/assert.hh"
#include "utils/rjson.hh"
#include "utils/big_decimal.hh"
#include "cql3/selection/selection.hh"
@@ -97,6 +99,7 @@ future<executor::request_return_type> executor::update_time_to_live(client_state
}
sstring attribute_name(v->GetString(), v->GetStringLength());
co_await verify_permission(client_state, schema, auth::permission::ALTER);
co_await db::modify_tags(_mm, schema->ks_name(), schema->cf_name(), [&](std::map<sstring, sstring>& tags_map) {
if (enabled) {
if (tags_map.contains(TTL_TAG_KEY)) {
@@ -312,7 +315,7 @@ static size_t random_offset(size_t min, size_t max) {
// this range's primary node is down. For this we need to return not just
// a list of this node's secondary ranges - but also the primary owner of
// each of those ranges.
static std::vector<std::pair<dht::token_range, gms::inet_address>> get_secondary_ranges(
static future<std::vector<std::pair<dht::token_range, gms::inet_address>>> get_secondary_ranges(
const locator::effective_replication_map_ptr& erm,
gms::inet_address ep) {
const auto& tm = *erm->get_token_metadata_ptr();
@@ -323,6 +326,7 @@ static std::vector<std::pair<dht::token_range, gms::inet_address>> get_secondary
}
auto prev_tok = sorted_tokens.back();
for (const auto& tok : sorted_tokens) {
co_await coroutine::maybe_yield();
inet_address_vector_replica_set eps = erm->get_natural_endpoints(tok);
if (eps.size() <= 1 || eps[1] != ep) {
prev_tok = tok;
@@ -350,7 +354,7 @@ static std::vector<std::pair<dht::token_range, gms::inet_address>> get_secondary
}
prev_tok = tok;
}
return ret;
co_return ret;
}
@@ -386,63 +390,63 @@ static std::vector<std::pair<dht::token_range, gms::inet_address>> get_secondary
//
// FIXME: Check if this algorithm is safe with tablet migration.
// https://github.com/scylladb/scylladb/issues/16567
enum primary_or_secondary_t {primary, secondary};
template<primary_or_secondary_t primary_or_secondary>
class token_ranges_owned_by_this_shard {
// ranges_holder_primary holds just the primary ranges themselves
class ranges_holder_primary {
const dht::token_range_vector _token_ranges;
public:
ranges_holder_primary(const locator::vnode_effective_replication_map_ptr& erm, gms::gossiper& g, gms::inet_address ep)
: _token_ranges(erm->get_primary_ranges(ep)) {}
std::size_t size() const { return _token_ranges.size(); }
const dht::token_range& operator[](std::size_t i) const {
return _token_ranges[i];
}
bool should_skip(std::size_t i) const {
return false;
}
};
// ranges_holder<secondary> holds the secondary token ranges plus each
// range's primary owner, needed to implement should_skip().
class ranges_holder_secondary {
std::vector<std::pair<dht::token_range, gms::inet_address>> _token_ranges;
gms::gossiper& _gossiper;
public:
ranges_holder_secondary(const locator::effective_replication_map_ptr& erm, gms::gossiper& g, gms::inet_address ep)
: _token_ranges(get_secondary_ranges(erm, ep))
, _gossiper(g) {}
std::size_t size() const { return _token_ranges.size(); }
const dht::token_range& operator[](std::size_t i) const {
return _token_ranges[i].first;
}
// range i should be skipped if its primary owner is alive.
bool should_skip(std::size_t i) const {
return _gossiper.is_alive(_token_ranges[i].second);
}
};
// ranges_holder_primary holds just the primary ranges themselves
class ranges_holder_primary {
dht::token_range_vector _token_ranges;
public:
explicit ranges_holder_primary(dht::token_range_vector token_ranges) : _token_ranges(std::move(token_ranges)) {}
static future<ranges_holder_primary> make(const locator::vnode_effective_replication_map_ptr& erm, gms::inet_address ep) {
co_return ranges_holder_primary(co_await erm->get_primary_ranges(ep));
}
std::size_t size() const { return _token_ranges.size(); }
const dht::token_range& operator[](std::size_t i) const {
return _token_ranges[i];
}
bool should_skip(std::size_t i) const {
return false;
}
};
// ranges_holder<secondary> holds the secondary token ranges plus each
// range's primary owner, needed to implement should_skip().
class ranges_holder_secondary {
std::vector<std::pair<dht::token_range, gms::inet_address>> _token_ranges;
const gms::gossiper& _gossiper;
public:
explicit ranges_holder_secondary(std::vector<std::pair<dht::token_range, gms::inet_address>> token_ranges, const gms::gossiper& g)
: _token_ranges(std::move(token_ranges))
, _gossiper(g) {}
static future<ranges_holder_secondary> make(const locator::effective_replication_map_ptr& erm, gms::inet_address ep, const gms::gossiper& g) {
co_return ranges_holder_secondary(co_await get_secondary_ranges(erm, ep), g);
}
std::size_t size() const { return _token_ranges.size(); }
const dht::token_range& operator[](std::size_t i) const {
return _token_ranges[i].first;
}
// range i should be skipped if its primary owner is alive.
bool should_skip(std::size_t i) const {
return _gossiper.is_alive(_token_ranges[i].second);
}
};
template<class primary_or_secondary_t>
class token_ranges_owned_by_this_shard {
schema_ptr _s;
locator::effective_replication_map_ptr _erm;
// _token_ranges will contain a list of token ranges owned by this node.
// We'll further need to split each such range to the pieces owned by
// the current shard, using _intersecter.
using ranges_holder = std::conditional_t<
primary_or_secondary == primary_or_secondary_t::primary,
ranges_holder_primary,
ranges_holder_secondary>;
const ranges_holder _token_ranges;
const primary_or_secondary_t _token_ranges;
// NOTICE: _range_idx is used modulo _token_ranges size when accessing
// the data to ensure that it doesn't go out of bounds
size_t _range_idx;
size_t _end_idx;
std::optional<dht::selective_token_range_sharder> _intersecter;
public:
token_ranges_owned_by_this_shard(replica::database& db, gms::gossiper& g, schema_ptr s)
token_ranges_owned_by_this_shard(schema_ptr s, primary_or_secondary_t token_ranges)
: _s(s)
, _erm(s->table().get_effective_replication_map())
, _token_ranges(db.find_keyspace(s->ks_name()).get_vnode_effective_replication_map(),
g, _erm->get_topology().my_address())
, _token_ranges(std::move(token_ranges))
, _range_idx(random_offset(0, _token_ranges.size() - 1))
, _end_idx(_range_idx + _token_ranges.size())
{
@@ -498,6 +502,7 @@ struct scan_ranges_context {
bytes column_name;
std::optional<std::string> member;
service::client_state internal_client_state;
::shared_ptr<cql3::selection::selection> selection;
std::unique_ptr<service::query_state> query_state_ptr;
std::unique_ptr<cql3::query_options> query_options;
@@ -507,6 +512,7 @@ struct scan_ranges_context {
: s(s)
, column_name(column_name)
, member(member)
, internal_client_state(service::client_state::internal_tag())
{
// FIXME: don't read the entire items - read only parts of it.
// We must read the key columns (to be able to delete) and also
@@ -525,10 +531,9 @@ struct scan_ranges_context {
std::vector<query::clustering_range> ck_bounds{query::clustering_range::make_open_ended_both_sides()};
auto partition_slice = query::partition_slice(std::move(ck_bounds), {}, std::move(regular_columns), opts);
command = ::make_lw_shared<query::read_command>(s->id(), s->version(), partition_slice, proxy.get_max_result_size(partition_slice), query::tombstone_limit(proxy.get_tombstone_limit()));
executor::client_state client_state{executor::client_state::internal_tag()};
tracing::trace_state_ptr trace_state;
// NOTICE: empty_service_permit is used because the TTL service has fixed parallelism
query_state_ptr = std::make_unique<service::query_state>(client_state, trace_state, empty_service_permit());
query_state_ptr = std::make_unique<service::query_state>(internal_client_state, trace_state, empty_service_permit());
// FIXME: What should we do on multi-DC? Will we run the expiration on the same ranges on all
// DCs or only once for each range? If the latter, we need to change the CLs in the
// scanner and deleter.
@@ -551,7 +556,7 @@ static future<> scan_table_ranges(
expiration_service::stats& expiration_stats)
{
const schema_ptr& s = scan_ctx.s;
assert (partition_ranges.size() == 1); // otherwise issue #9167 will cause incorrect results.
SCYLLA_ASSERT (partition_ranges.size() == 1); // otherwise issue #9167 will cause incorrect results.
auto p = service::pager::query_pagers::pager(proxy, s, scan_ctx.selection, *scan_ctx.query_state_ptr,
*scan_ctx.query_options, scan_ctx.command, std::move(partition_ranges), nullptr);
while (!p->is_exhausted()) {
@@ -724,7 +729,9 @@ static future<bool> scan_table(
expiration_stats.scan_table++;
// FIXME: need to pace the scan, not do it all at once.
scan_ranges_context scan_ctx{s, proxy, std::move(column_name), std::move(member)};
token_ranges_owned_by_this_shard<primary> my_ranges(db.real_database(), gossiper, s);
auto erm = db.real_database().find_keyspace(s->ks_name()).get_vnode_effective_replication_map();
auto my_address = erm->get_topology().my_address();
token_ranges_owned_by_this_shard my_ranges(s, co_await ranges_holder_primary::make(erm, my_address));
while (std::optional<dht::partition_range> range = my_ranges.next_partition_range()) {
// Note that because of issue #9167 we need to run a separate
// query on each partition range, and can't pass several of
@@ -744,7 +751,7 @@ static future<bool> scan_table(
// by tasking another node to take over scanning of the dead node's primary
// ranges. What we do here is that this node will also check expiration
// on its *secondary* ranges - but only those whose primary owner is down.
token_ranges_owned_by_this_shard<secondary> my_secondary_ranges(db.real_database(), gossiper, s);
token_ranges_owned_by_this_shard my_secondary_ranges(s, co_await ranges_holder_secondary::make(erm, my_address, gossiper));
while (std::optional<dht::partition_range> range = my_secondary_ranges.next_partition_range()) {
expiration_stats.secondary_ranges_scanned++;
dht::partition_range_vector partition_ranges;

View File

@@ -7,6 +7,7 @@ set(swagger_files
api-doc/commitlog.json
api-doc/compaction_manager.json
api-doc/config.json
api-doc/cql_server_test.json
api-doc/endpoint_snitch_info.json
api-doc/error_injection.json
api-doc/failure_detector.json
@@ -46,6 +47,7 @@ target_sources(api
commitlog.cc
compaction_manager.cc
config.cc
cql_server_test.cc
endpoint_snitch.cc
error_injection.cc
authorization_cache.cc

View File

@@ -92,6 +92,14 @@
"type":"boolean",
"paramType":"query"
},
{
"name":"consider_only_existing_data",
"description":"Set to \"true\" to flush all memtables and force tombstone garbage collection to check only the sstables being compacted (false by default). The memtable, commitlog and other uncompacted sstables will not be checked during tombstone garbage collection.",
"required":false,
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
},
{
"name":"split_output",
"description":"true if the output of the major compaction should be split in several sstables",

View File

@@ -0,0 +1,26 @@
{
"apiVersion":"0.0.1",
"swaggerVersion":"1.2",
"basePath":"{{Protocol}}://{{Host}}",
"resourcePath":"/cql_server_test",
"produces":[
"application/json"
],
"apis":[
{
"path":"/cql_server_test/connections_params",
"operations":[
{
"method":"GET",
"summary":"Get service level params of each CQL connection",
"type":"connections_service_level_params",
"nickname":"connections_params",
"produces":[
"application/json"
],
"parameters":[]
}
]
}
]
}

View File

@@ -741,11 +741,123 @@
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
},
{
"name":"consider_only_existing_data",
"description":"Set to \"true\" to flush all memtables and force tombstone garbage collection to check only the sstables being compacted (false by default). The memtable, commitlog and other uncompacted sstables will not be checked during tombstone garbage collection.",
"required":false,
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
}
]
}
]
},
{
"path":"/storage_service/backup",
"operations":[
{
"method":"POST",
"summary":"Starts copying SSTables from a specified keyspace to a designated bucket in object storage",
"type":"string",
"nickname":"start_backup",
"produces":[
"application/json"
],
"parameters":[
{
"name":"endpoint",
"description":"ID of the configured object storage endpoint to copy sstables to",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"bucket",
"description":"Name of the bucket to backup sstables to",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"keyspace",
"description":"Name of a keyspace to copy sstables from",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"snapshot",
"description":"Name of a snapshot to copy sstables from",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
}
]
}
]
},
{
"path":"/storage_service/restore",
"operations":[
{
"method":"POST",
"summary":"Starts copying SSTables from a designated bucket in object storage to a specified keyspace",
"type":"string",
"nickname":"start_restore",
"produces":[
"application/json"
],
"parameters":[
{
"name":"endpoint",
"description":"ID of the configured object storage endpoint to copy SSTables from",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"bucket",
"description":"Name of the bucket to read SSTables from",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"snapshot",
"description":"Name of a snapshot to copy SSTables from",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"keyspace",
"description":"Name of a keyspace to copy SSTables to",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"table",
"description":"Name of a table to copy SSTables to",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
}
]
}
]
},
{
"path":"/storage_service/keyspace_compaction/{keyspace}",
"operations":[
@@ -781,6 +893,14 @@
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
},
{
"name":"consider_only_existing_data",
"description":"Set to \"true\" to flush all memtables and force tombstone garbage collection to check only the sstables being compacted (false by default). The memtable, commitlog and other uncompacted sstables will not be checked during tombstone garbage collection.",
"required":false,
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
}
]
}
@@ -1891,6 +2011,14 @@
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"force",
"description":"Enforce the source_dc option, even if it unsafe to use for rebuild",
"required":false,
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
}
]
}

View File

@@ -194,6 +194,21 @@
"parameters":[]
}
]
},
{
"path":"/system/highest_supported_sstable_version",
"operations":[
{
"method":"GET",
"summary":"Get highest supported sstable version",
"type":"string",
"nickname":"get_highest_supported_sstable_version",
"produces":[
"application/json"
],
"parameters":[]
}
]
}
]
}

View File

@@ -115,7 +115,7 @@
"parameters":[
{
"name":"task_id",
"description":"The uuid of a task to abort",
"description":"The uuid of a task to abort; if the task is not abortable, 403 status code is returned",
"required":true,
"allowMultiple":false,
"type":"string",
@@ -144,6 +144,14 @@
"allowMultiple":false,
"type":"string",
"paramType":"path"
},
{
"name":"timeout",
"description":"Timeout for waiting; if times out, 408 status code is returned",
"required":false,
"allowMultiple":false,
"type":"long",
"paramType":"query"
}
]
}
@@ -197,11 +205,60 @@
"paramType":"query"
}
]
},
{
"method":"GET",
"summary":"Get current ttl value",
"type":"long",
"nickname":"get_ttl",
"produces":[
"application/json"
],
"parameters":[
]
}
]
},
{
"path":"/task_manager/drain/{module}",
"operations":[
{
"method":"POST",
"summary":"Drain finished local tasks",
"type":"void",
"nickname":"drain_tasks",
"produces":[
"application/json"
],
"parameters":[
{
"name":"module",
"description":"The module to drain",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
}
]
}
]
}
],
"models":{
"task_identity":{
"id": "task_identity",
"description":"Id and node of a task",
"properties":{
"task_id":{
"type":"string",
"description":"The uuid of a task"
},
"node":{
"type":"string",
"description":"Address of a server on which a task is created"
}
}
},
"task_stats" :{
"id": "task_stats",
"description":"A task statistics object",
@@ -224,6 +281,14 @@
"type":"string",
"description":"The description of the task"
},
"kind":{
"type":"string",
"enum":[
"node",
"cluster"
],
"description":"The kind of a task"
},
"scope":{
"type":"string",
"description":"The scope of the task"
@@ -258,6 +323,14 @@
"type":"string",
"description":"The description of the task"
},
"kind":{
"type":"string",
"enum":[
"node",
"cluster"
],
"description":"The kind of a task"
},
"scope":{
"type":"string",
"description":"The scope of the task"
@@ -327,9 +400,9 @@
"children_ids":{
"type":"array",
"items":{
"type":"string"
"type":"task_identity"
},
"description":"Task IDs of children of this task"
"description":"Task identities of children of this task"
}
}
}

View File

@@ -10,6 +10,7 @@
#include <seastar/http/file_handler.hh>
#include <seastar/http/transformers.hh>
#include <seastar/http/api_docs.hh>
#include "cql_server_test.hh"
#include "storage_service.hh"
#include "token_metadata.hh"
#include "commitlog.hh"
@@ -73,6 +74,8 @@ future<> set_server_init(http_context& ctx) {
set_error_injection(ctx, r);
rb->register_function(r, "storage_proxy",
"The storage proxy API");
rb->register_function(r, "storage_service",
"The storage service API");
});
}
@@ -115,7 +118,7 @@ future<> unset_thrift_controller(http_context& ctx) {
}
future<> set_server_storage_service(http_context& ctx, sharded<service::storage_service>& ss, service::raft_group0_client& group0_client) {
return register_api(ctx, "storage_service", "The storage service API", [&ss, &group0_client] (http_context& ctx, routes& r) {
return ctx.http_server.set_routes([&ctx, &ss, &group0_client] (routes& r) {
set_storage_service(ctx, r, ss, group0_client);
});
}
@@ -256,6 +259,10 @@ future<> set_server_cache(http_context& ctx) {
"The cache service API", set_cache_service);
}
future<> unset_server_cache(http_context& ctx) {
return ctx.http_server.set_routes([&ctx] (routes& r) { unset_cache_service(ctx, r); });
}
future<> set_hinted_handoff(http_context& ctx, sharded<service::storage_proxy>& proxy) {
return register_api(ctx, "hinted_handoff",
"The hinted handoff API", [&proxy] (http_context& ctx, routes& r) {
@@ -323,6 +330,16 @@ future<> unset_server_task_manager_test(http_context& ctx) {
return ctx.http_server.set_routes([&ctx] (routes& r) { unset_task_manager_test(ctx, r); });
}
future<> set_server_cql_server_test(http_context& ctx, cql_transport::controller& ctl) {
return register_api(ctx, "cql_server_test", "The CQL server test API", [&ctl] (http_context& ctx, routes& r) {
set_cql_server_test(ctx, r, ctl);
});
}
future<> unset_server_cql_server_test(http_context& ctx) {
return ctx.http_server.set_routes([&ctx] (routes& r) { unset_cql_server_test(ctx, r); });
}
#endif
future<> set_server_tasks_compaction_module(http_context& ctx, sharded<service::storage_service>& ss, sharded<db::snapshot_ctl>& snap_ctl) {

View File

@@ -246,7 +246,7 @@ public:
value = T{boost::lexical_cast<Base>(param)};
}
} catch (boost::bad_lexical_cast&) {
throw httpd::bad_param_exception(format("{} ({}): type error - should be {}", name, param, boost::units::detail::demangle(typeid(Base).name())));
throw httpd::bad_param_exception(fmt::format("{} ({}): type error - should be {}", name, param, boost::units::detail::demangle(typeid(Base).name())));
}
}

View File

@@ -119,6 +119,7 @@ future<> unset_server_stream_manager(http_context& ctx);
future<> set_hinted_handoff(http_context& ctx, sharded<service::storage_proxy>& p);
future<> unset_hinted_handoff(http_context& ctx);
future<> set_server_cache(http_context& ctx);
future<> unset_server_cache(http_context& ctx);
future<> set_server_compaction_manager(http_context& ctx);
future<> set_server_done(http_context& ctx);
future<> set_server_task_manager(http_context& ctx, sharded<tasks::task_manager>& tm, lw_shared_ptr<db::config> cfg);
@@ -131,5 +132,7 @@ future<> set_server_raft(http_context&, sharded<service::raft_group_registry>&);
future<> unset_server_raft(http_context&);
future<> set_load_meter(http_context& ctx, service::load_meter& lm);
future<> unset_load_meter(http_context& ctx);
future<> set_server_cql_server_test(http_context& ctx, cql_transport::controller& ctl);
future<> unset_server_cql_server_test(http_context& ctx);
}

View File

@@ -320,5 +320,50 @@ void set_cache_service(http_context& ctx, routes& r) {
});
}
void unset_cache_service(http_context& ctx, routes& r) {
cs::get_row_cache_save_period_in_seconds.unset(r);
cs::set_row_cache_save_period_in_seconds.unset(r);
cs::get_key_cache_save_period_in_seconds.unset(r);
cs::set_key_cache_save_period_in_seconds.unset(r);
cs::get_counter_cache_save_period_in_seconds.unset(r);
cs::set_counter_cache_save_period_in_seconds.unset(r);
cs::get_row_cache_keys_to_save.unset(r);
cs::set_row_cache_keys_to_save.unset(r);
cs::get_key_cache_keys_to_save.unset(r);
cs::set_key_cache_keys_to_save.unset(r);
cs::get_counter_cache_keys_to_save.unset(r);
cs::set_counter_cache_keys_to_save.unset(r);
cs::invalidate_key_cache.unset(r);
cs::invalidate_counter_cache.unset(r);
cs::set_row_cache_capacity_in_mb.unset(r);
cs::set_key_cache_capacity_in_mb.unset(r);
cs::set_counter_cache_capacity_in_mb.unset(r);
cs::save_caches.unset(r);
cs::get_key_capacity.unset(r);
cs::get_key_hits.unset(r);
cs::get_key_requests.unset(r);
cs::get_key_hit_rate.unset(r);
cs::get_key_hits_moving_avrage.unset(r);
cs::get_key_requests_moving_avrage.unset(r);
cs::get_key_size.unset(r);
cs::get_key_entries.unset(r);
cs::get_row_capacity.unset(r);
cs::get_row_hits.unset(r);
cs::get_row_requests.unset(r);
cs::get_row_hit_rate.unset(r);
cs::get_row_hits_moving_avrage.unset(r);
cs::get_row_requests_moving_avrage.unset(r);
cs::get_row_size.unset(r);
cs::get_row_entries.unset(r);
cs::get_counter_capacity.unset(r);
cs::get_counter_hits.unset(r);
cs::get_counter_requests.unset(r);
cs::get_counter_hit_rate.unset(r);
cs::get_counter_hits_moving_avrage.unset(r);
cs::get_counter_requests_moving_avrage.unset(r);
cs::get_counter_size.unset(r);
cs::get_counter_entries.unset(r);
}
}

View File

@@ -16,5 +16,6 @@ namespace api {
struct http_context;
void set_cache_service(http_context& ctx, seastar::httpd::routes& r);
void unset_cache_service(http_context& ctx, seastar::httpd::routes& r);
}

View File

@@ -15,6 +15,7 @@
#include <seastar/http/exception.hh>
#include "sstables/sstables.hh"
#include "sstables/metadata_collector.hh"
#include "utils/assert.hh"
#include "utils/estimated_histogram.hh"
#include <algorithm>
#include "db/system_keyspace.hh"
@@ -103,7 +104,7 @@ class autocompaction_toggle_guard {
replica::database& _db;
public:
autocompaction_toggle_guard(replica::database& db) : _db(db) {
assert(this_shard_id() == 0);
SCYLLA_ASSERT(this_shard_id() == 0);
if (!_db._enable_autocompaction_toggle) {
throw std::runtime_error("Autocompaction toggle is busy");
}
@@ -112,7 +113,7 @@ public:
autocompaction_toggle_guard(const autocompaction_toggle_guard&) = delete;
autocompaction_toggle_guard(autocompaction_toggle_guard&&) = default;
~autocompaction_toggle_guard() {
assert(this_shard_id() == 0);
SCYLLA_ASSERT(this_shard_id() == 0);
_db._enable_autocompaction_toggle = true;
}
};
@@ -1125,6 +1126,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
auto params = req_params({
std::pair("name", mandatory::yes),
std::pair("flush_memtables", mandatory::no),
std::pair("consider_only_existing_data", mandatory::no),
std::pair("split_output", mandatory::no),
});
params.process(*req);
@@ -1133,7 +1135,8 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
}
auto [ks, cf] = parse_fully_qualified_cf_name(*params.get("name"));
auto flush = params.get_as<bool>("flush_memtables").value_or(true);
apilog.info("column_family/force_major_compaction: name={} flush={}", req->get_path_param("name"), flush);
auto consider_only_existing_data = params.get_as<bool>("consider_only_existing_data").value_or(false);
apilog.info("column_family/force_major_compaction: name={} flush={} consider_only_existing_data={}", req->get_path_param("name"), flush, consider_only_existing_data);
auto keyspace = validate_keyspace(ctx, ks);
std::vector<table_info> table_infos = {table_info{
@@ -1143,10 +1146,10 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace
auto& compaction_module = ctx.db.local().get_compaction_manager().get_task_manager_module();
std::optional<flush_mode> fmopt;
if (!flush) {
if (!flush && !consider_only_existing_data) {
fmopt = flush_mode::skip;
}
auto task = co_await compaction_module.make_and_start_task<major_keyspace_compaction_task_impl>({}, std::move(keyspace), tasks::task_id::create_null_id(), ctx.db, std::move(table_infos), fmopt);
auto task = co_await compaction_module.make_and_start_task<major_keyspace_compaction_task_impl>({}, std::move(keyspace), tasks::task_id::create_null_id(), ctx.db, std::move(table_infos), fmopt, consider_only_existing_data);
co_await task->done();
co_return json_void();
});

View File

@@ -10,6 +10,7 @@
#include "api/config.hh"
#include "api/api-doc/config.json.hh"
#include "api/api-doc/storage_proxy.json.hh"
#include "api/api-doc/storage_service.json.hh"
#include "replica/database.hh"
#include "db/config.hh"
#include <sstream>
@@ -19,6 +20,7 @@
namespace api {
using namespace seastar::httpd;
namespace sp = httpd::storage_proxy_json;
namespace ss = httpd::storage_service_json;
template<class T>
json::json_return_type get_json_return_type(const T& val) {
@@ -183,6 +185,14 @@ void set_config(std::shared_ptr < api_registry_builder20 > rb, http_context& ctx
return make_ready_future<json::json_return_type>(seastar::json::json_void());
});
ss::get_all_data_file_locations.set(r, [&cfg](const_req req) {
return container_to_vec(cfg.data_file_directories());
});
ss::get_saved_caches_location.set(r, [&cfg](const_req req) {
return cfg.saved_caches_directory();
});
}
void unset_config(http_context& ctx, routes& r) {
@@ -201,6 +211,8 @@ void unset_config(http_context& ctx, routes& r) {
sp::set_range_rpc_timeout.unset(r);
sp::get_truncate_rpc_timeout.unset(r);
sp::set_truncate_rpc_timeout.unset(r);
ss::get_all_data_file_locations.unset(r);
ss::get_saved_caches_location.unset(r);
}
}

70
api/cql_server_test.cc Normal file
View File

@@ -0,0 +1,70 @@
/*
* Copyright (C) 2024-present ScyllaDB
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#ifndef SCYLLA_BUILD_MODE_RELEASE
#include <seastar/core/coroutine.hh>
#include <boost/range/algorithm/transform.hpp>
#include "api/api-doc/cql_server_test.json.hh"
#include "cql_server_test.hh"
#include "transport/controller.hh"
#include "transport/server.hh"
#include "service/qos/qos_common.hh"
namespace api {
namespace cst = httpd::cql_server_test_json;
using namespace json;
using namespace seastar::httpd;
struct connection_sl_params : public json::json_base {
json::json_element<sstring> _role_name;
json::json_element<sstring> _workload_type;
json::json_element<sstring> _timeout;
connection_sl_params(const sstring& role_name, const sstring& workload_type, const sstring& timeout) {
_role_name = role_name;
_workload_type = workload_type;
_timeout = timeout;
register_params();
}
connection_sl_params(const connection_sl_params& params)
: connection_sl_params(params._role_name(), params._workload_type(), params._timeout()) {}
void register_params() {
add(&_role_name, "role_name");
add(&_workload_type, "workload_type");
add(&_timeout, "timeout");
}
};
void set_cql_server_test(http_context& ctx, seastar::httpd::routes& r, cql_transport::controller& ctl) {
cst::connections_params.set(r, [&ctl] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto sl_params = co_await ctl.get_connections_service_level_params();
std::vector<connection_sl_params> result;
boost::transform(std::move(sl_params), std::back_inserter(result), [] (const cql_transport::connection_service_level_params& params) {
auto nanos = std::chrono::duration_cast<std::chrono::nanoseconds>(params.timeout_config.read_timeout).count();
return connection_sl_params(
std::move(params.role_name),
sstring(qos::service_level_options::to_string(params.workload_type)),
to_string(cql_duration(months_counter{0}, days_counter{0}, nanoseconds_counter{nanos})));
});
co_return result;
});
}
void unset_cql_server_test(http_context& ctx, seastar::httpd::routes& r) {
cst::connections_params.unset(r);
}
}
#endif

29
api/cql_server_test.hh Normal file
View File

@@ -0,0 +1,29 @@
/*
* Copyright (C) 2024-present ScyllaDB
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#ifndef SCYLLA_BUILD_MODE_RELEASE
#pragma once
namespace cql_transport {
class controller;
}
namespace seastar::httpd {
class routes;
}
namespace api {
struct http_context;
void set_cql_server_test(http_context& ctx, seastar::httpd::routes& r, cql_transport::controller& ctl);
void unset_cql_server_test(http_context& ctx, seastar::httpd::routes& r);
}
#endif

View File

@@ -102,8 +102,8 @@ void set_raft(http_context&, httpd::routes& r, sharded<service::raft_group_regis
if (!req->query_parameters.contains("group_id")) {
// Read barrier on group 0 by default
co_await raft_gr.invoke_on(0, [timeout] (service::raft_group_registry& raft_gr) {
return raft_gr.group0_with_timeouts().read_barrier(nullptr, timeout);
co_await raft_gr.invoke_on(0, [timeout] (service::raft_group_registry& raft_gr) -> future<> {
co_await raft_gr.group0_with_timeouts().read_barrier(nullptr, timeout);
});
co_return json_void{};
}
@@ -111,12 +111,12 @@ void set_raft(http_context&, httpd::routes& r, sharded<service::raft_group_regis
raft::group_id gid{utils::UUID{req->get_query_param("group_id")}};
std::atomic<bool> found_srv{false};
co_await raft_gr.invoke_on_all([gid, timeout, &found_srv] (service::raft_group_registry& raft_gr) {
co_await raft_gr.invoke_on_all([gid, timeout, &found_srv] (service::raft_group_registry& raft_gr) -> future<> {
if (!raft_gr.find_server(gid)) {
return make_ready_future<>();
co_return;
}
found_srv = true;
return raft_gr.get_server_with_timeouts(gid).read_barrier(nullptr, timeout);
co_await raft_gr.get_server_with_timeouts(gid).read_barrier(nullptr, timeout);
});
if (!found_srv) {

View File

@@ -54,6 +54,7 @@
#include "locator/abstract_replication_strategy.hh"
#include "sstables_loader.hh"
#include "db/view/view_builder.hh"
#include "utils/user_provided_param.hh"
using namespace seastar::httpd;
using namespace std::chrono_literals;
@@ -489,10 +490,27 @@ void set_sstables_loader(http_context& ctx, routes& r, sharded<sstables_loader>&
return make_ready_future<json::json_return_type>(json_void());
});
});
ss::start_restore.set(r, [&sst_loader] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto endpoint = req->get_query_param("endpoint");
auto keyspace = req->get_query_param("keyspace");
auto table = req->get_query_param("table");
auto bucket = req->get_query_param("bucket");
auto snapshot_name = req->get_query_param("snapshot");
if (table.empty()) {
// TODO: If missing, should restore all tables
throw httpd::bad_param_exception("The table name must be specified");
}
auto task_id = co_await sst_loader.local().download_new_sstables(keyspace, table, endpoint, bucket, snapshot_name);
co_return json::json_return_type(fmt::to_string(task_id));
});
}
void unset_sstables_loader(http_context& ctx, routes& r) {
ss::load_new_ss_tables.unset(r);
ss::start_restore.unset(r);
}
void set_view_builder(http_context& ctx, routes& r, sharded<db::view::view_builder>& vb) {
@@ -610,14 +628,6 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
return ss.local().get_schema_version();
});
ss::get_all_data_file_locations.set(r, [&ctx](const_req req) {
return container_to_vec(ctx.db.local().get_config().data_file_directories());
});
ss::get_saved_caches_location.set(r, [&ctx](const_req req) {
return ctx.db.local().get_config().saved_caches_directory();
});
ss::get_range_to_endpoint_map.set(r, [&ctx, &ss](std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto keyspace = validate_keyspace(ctx, req);
auto table = req->get_query_param("cf");
@@ -706,17 +716,19 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
auto& db = ctx.db;
auto params = req_params({
std::pair("flush_memtables", mandatory::no),
std::pair("consider_only_existing_data", mandatory::no),
});
params.process(*req);
auto flush = params.get_as<bool>("flush_memtables").value_or(true);
apilog.info("force_compaction: flush={}", flush);
auto consider_only_existing_data = params.get_as<bool>("consider_only_existing_data").value_or(false);
apilog.info("force_compaction: flush={} consider_only_existing_data={}", flush, consider_only_existing_data);
auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();
std::optional<flush_mode> fmopt;
if (!flush) {
if (!flush && !consider_only_existing_data) {
fmopt = flush_mode::skip;
}
auto task = co_await compaction_module.make_and_start_task<global_major_compaction_task_impl>({}, db, fmopt);
auto task = co_await compaction_module.make_and_start_task<global_major_compaction_task_impl>({}, db, fmopt, consider_only_existing_data);
try {
co_await task->done();
} catch (...) {
@@ -733,19 +745,21 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
std::pair("keyspace", mandatory::yes),
std::pair("cf", mandatory::no),
std::pair("flush_memtables", mandatory::no),
std::pair("consider_only_existing_data", mandatory::no),
});
params.process(*req);
auto keyspace = validate_keyspace(ctx, *params.get("keyspace"));
auto table_infos = parse_table_infos(keyspace, ctx, params.get("cf").value_or(""));
auto flush = params.get_as<bool>("flush_memtables").value_or(true);
apilog.debug("force_keyspace_compaction: keyspace={} tables={}, flush={}", keyspace, table_infos, flush);
auto consider_only_existing_data = params.get_as<bool>("consider_only_existing_data").value_or(false);
apilog.info("force_keyspace_compaction: keyspace={} tables={}, flush={} consider_only_existing_data={}", keyspace, table_infos, flush, consider_only_existing_data);
auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();
std::optional<flush_mode> fmopt;
if (!flush) {
if (!flush && !consider_only_existing_data) {
fmopt = flush_mode::skip;
}
auto task = co_await compaction_module.make_and_start_task<major_keyspace_compaction_task_impl>({}, std::move(keyspace), tasks::task_id::create_null_id(), db, table_infos, fmopt);
auto task = co_await compaction_module.make_and_start_task<major_keyspace_compaction_task_impl>({}, std::move(keyspace), tasks::task_id::create_null_id(), db, table_infos, fmopt, consider_only_existing_data);
try {
co_await task->done();
} catch (...) {
@@ -884,7 +898,8 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
auto host_id = validate_host_id(req->get_query_param("host_id"));
std::vector<sstring> ignore_nodes_strs = utils::split_comma_separated_list(req->get_query_param("ignore_nodes"));
apilog.info("remove_node: host_id={} ignore_nodes={}", host_id, ignore_nodes_strs);
auto ignore_nodes = std::list<locator::host_id_or_endpoint>();
locator::host_id_or_endpoint_list ignore_nodes;
ignore_nodes.reserve(ignore_nodes_strs.size());
for (const sstring& n : ignore_nodes_strs) {
try {
auto hoep = locator::host_id_or_endpoint(n);
@@ -893,7 +908,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
}
ignore_nodes.push_back(std::move(hoep));
} catch (...) {
throw std::runtime_error(format("Failed to parse ignore_nodes parameter: ignore_nodes={}, node={}: {}", ignore_nodes_strs, n, std::current_exception()));
throw std::runtime_error(fmt::format("Failed to parse ignore_nodes parameter: ignore_nodes={}, node={}: {}", ignore_nodes_strs, n, std::current_exception()));
}
}
return ss.local().removenode(host_id, std::move(ignore_nodes)).then([] {
@@ -1048,7 +1063,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
});
ss::get_compaction_throughput_mb_per_sec.set(r, [&ctx](std::unique_ptr<http::request> req) {
int value = ctx.db.local().get_config().compaction_throughput_mb_per_sec();
int value = ctx.db.local().get_compaction_manager().throughput_mbs();
return make_ready_future<json::json_return_type>(value);
});
@@ -1096,7 +1111,16 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
});
ss::rebuild.set(r, [&ss](std::unique_ptr<http::request> req) {
auto source_dc = req->get_query_param("source_dc");
utils::optional_param source_dc;
if (auto source_dc_str = req->get_query_param("source_dc"); !source_dc_str.empty()) {
source_dc.emplace(std::move(source_dc_str)).set_user_provided();
}
if (auto force_str = req->get_query_param("force"); !force_str.empty() && service::loosen_constraints(validate_bool(force_str))) {
if (!source_dc) {
throw bad_param_exception("The `source_dc` option must be provided for using the `force` option");
}
source_dc.set_force();
}
apilog.info("rebuild: source_dc={}", source_dc);
return ss.local().rebuild(std::move(source_dc)).then([] {
return make_ready_future<json::json_return_type>(json_void());
@@ -1439,12 +1463,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
ss::reload_raft_topology_state.set(r,
[&ss, &group0_client] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
co_await ss.invoke_on(0, [&group0_client] (service::storage_service& ss) -> future<> {
apilog.info("Waiting for group 0 read/apply mutex before reloading Raft topology state...");
auto holder = co_await group0_client.hold_read_apply_mutex();
apilog.info("Reloading Raft topology state");
// Using topology_transition() instead of topology_state_load(), because the former notifies listeners
co_await ss.topology_transition();
apilog.info("Reloaded Raft topology state");
return ss.reload_raft_topology_state(group0_client);
});
co_return json_void();
});
@@ -1559,8 +1578,6 @@ void unset_storage_service(http_context& ctx, routes& r) {
ss::get_release_version.unset(r);
ss::get_scylla_release_version.unset(r);
ss::get_schema_version.unset(r);
ss::get_all_data_file_locations.unset(r);
ss::get_saved_caches_location.unset(r);
ss::get_range_to_endpoint_map.unset(r);
ss::get_pending_range_to_endpoint_map.unset(r);
ss::describe_ring.unset(r);
@@ -1776,6 +1793,21 @@ void set_snapshot(http_context& ctx, routes& r, sharded<db::snapshot_ctl>& snap_
co_return json::json_return_type(static_cast<int>(scrub_status::successful));
});
ss::start_backup.set(r, [&snap_ctl] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto endpoint = req->get_query_param("endpoint");
auto keyspace = req->get_query_param("keyspace");
auto bucket = req->get_query_param("bucket");
auto snapshot_name = req->get_query_param("snapshot");
if (snapshot_name.empty()) {
// TODO: If missing, snapshot should be taken by scylla, then removed
throw httpd::bad_param_exception("The snapshot name must be specified");
}
auto& ctl = snap_ctl.local();
auto task_id = co_await ctl.start_backup(std::move(endpoint), std::move(bucket), std::move(keyspace), std::move(snapshot_name));
co_return json::json_return_type(fmt::to_string(task_id));
});
cf::get_true_snapshots_size.set(r, [&snap_ctl] (std::unique_ptr<http::request> req) {
auto [ks, cf] = parse_fully_qualified_cf_name(req->get_path_param("name"));
return snap_ctl.local().true_snapshots_size(std::move(ks), std::move(cf)).then([] (int64_t res) {
@@ -1797,6 +1829,7 @@ void unset_snapshot(http_context& ctx, routes& r) {
ss::del_snapshot.unset(r);
ss::true_snapshots_size.unset(r);
ss::scrub.unset(r);
ss::start_backup.unset(r);
cf::get_true_snapshots_size.unset(r);
cf::get_all_true_snapshots_size.unset(r);
}

View File

@@ -10,6 +10,7 @@
#include "api/api-doc/system.json.hh"
#include "api/api-doc/metrics.json.hh"
#include "replica/database.hh"
#include "sstables/sstables_manager.hh"
#include <rapidjson/document.h>
#include <seastar/core/reactor.hh>
@@ -182,6 +183,11 @@ void set_system(http_context& ctx, routes& r) {
apilog.info("Profile dumped to {}", profile_dest);
return make_ready_future<json::json_return_type>(json::json_return_type(json::json_void()));
}) ;
hs::get_highest_supported_sstable_version.set(r, [&ctx] (const_req req) {
auto& table = ctx.db.local().find_column_family("system", "local");
return seastar::to_sstring(table.get_sstables_manager().get_highest_supported_format());
});
}
}

View File

@@ -14,6 +14,8 @@
#include "api/api.hh"
#include "api/api-doc/task_manager.json.hh"
#include "db/system_keyspace.hh"
#include "tasks/task_handler.hh"
#include "utils/overloaded_functor.hh"
#include <utility>
#include <boost/range/adaptors.hpp>
@@ -24,93 +26,57 @@ namespace tm = httpd::task_manager_json;
using namespace json;
using namespace seastar::httpd;
using task_variant = std::variant<tasks::task_manager::foreign_task_ptr, tasks::task_manager::task::task_essentials>;
inline bool filter_tasks(tasks::task_manager::task_ptr task, std::unordered_map<sstring, sstring>& query_params) {
return (!query_params.contains("keyspace") || query_params["keyspace"] == task->get_status().keyspace) &&
(!query_params.contains("table") || query_params["table"] == task->get_status().table);
}
struct full_task_status {
tasks::task_manager::task::status task_status;
std::string type;
tasks::task_manager::task::progress progress;
tasks::task_id parent_id;
tasks::is_abortable abortable;
std::vector<std::string> children_ids;
};
struct task_stats {
task_stats(tasks::task_manager::task_ptr task)
: task_id(task->id().to_sstring())
, state(task->get_status().state)
, type(task->type())
, scope(task->get_status().scope)
, keyspace(task->get_status().keyspace)
, table(task->get_status().table)
, entity(task->get_status().entity)
, sequence_number(task->get_status().sequence_number)
{ }
sstring task_id;
tasks::task_manager::task_state state;
std::string type;
std::string scope;
std::string keyspace;
std::string table;
std::string entity;
uint64_t sequence_number;
};
tm::task_status make_status(full_task_status status) {
auto start_time = db_clock::to_time_t(status.task_status.start_time);
auto end_time = db_clock::to_time_t(status.task_status.end_time);
tm::task_status make_status(tasks::task_status status) {
auto start_time = db_clock::to_time_t(status.start_time);
auto end_time = db_clock::to_time_t(status.end_time);
::tm st, et;
::gmtime_r(&end_time, &et);
::gmtime_r(&start_time, &st);
std::vector<tm::task_identity> tis{status.children.size()};
boost::transform(status.children, tis.begin(), [] (const auto& child) {
tm::task_identity ident;
ident.task_id = child.task_id.to_sstring();
ident.node = fmt::format("{}", child.node);
return ident;
});
tm::task_status res{};
res.id = status.task_status.id.to_sstring();
res.id = status.task_id.to_sstring();
res.type = status.type;
res.scope = status.task_status.scope;
res.state = status.task_status.state;
res.is_abortable = bool(status.abortable);
res.kind = status.kind;
res.scope = status.scope;
res.state = status.state;
res.is_abortable = bool(status.is_abortable);
res.start_time = st;
res.end_time = et;
res.error = status.task_status.error;
res.parent_id = status.parent_id.to_sstring();
res.sequence_number = status.task_status.sequence_number;
res.shard = status.task_status.shard;
res.keyspace = status.task_status.keyspace;
res.table = status.task_status.table;
res.entity = status.task_status.entity;
res.progress_units = status.task_status.progress_units;
res.error = status.error;
res.parent_id = status.parent_id ? status.parent_id.to_sstring() : "none";
res.sequence_number = status.sequence_number;
res.shard = status.shard;
res.keyspace = status.keyspace;
res.table = status.table;
res.entity = status.entity;
res.progress_units = status.progress_units;
res.progress_total = status.progress.total;
res.progress_completed = status.progress.completed;
res.children_ids = std::move(status.children_ids);
res.children_ids = std::move(tis);
return res;
}
future<full_task_status> retrieve_status(const tasks::task_manager::foreign_task_ptr& task) {
if (task.get() == nullptr) {
co_return coroutine::return_exception(httpd::bad_param_exception("Task not found"));
}
auto progress = co_await task->get_progress();
full_task_status s;
s.task_status = task->get_status();
s.type = task->type();
s.parent_id = task->get_parent_id();
s.abortable = task->is_abortable();
s.progress.completed = progress.completed;
s.progress.total = progress.total;
std::vector<std::string> ct = co_await task->get_children().map_each_task<std::string>([] (const tasks::task_manager::foreign_task_ptr& child) {
return child->id().to_sstring();
}, [] (const tasks::task_manager::task::task_essentials& child) {
return child.task_status.id.to_sstring();
});
s.children_ids = std::move(ct);
co_return s;
};
tm::task_stats make_stats(tasks::task_stats stats) {
tm::task_stats res{};
res.task_id = stats.task_id.to_sstring();
res.type = stats.type;
res.kind = stats.kind;
res.scope = stats.scope;
res.state = stats.state;
res.sequence_number = stats.sequence_number;
res.keyspace = stats.keyspace;
res.table = stats.table;
res.entity = stats.entity;
return res;
}
void set_task_manager(http_context& ctx, routes& r, sharded<tasks::task_manager>& tm, db::config& cfg) {
tm::get_modules.set(r, [&tm] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
@@ -119,23 +85,28 @@ void set_task_manager(http_context& ctx, routes& r, sharded<tasks::task_manager>
});
tm::get_tasks.set(r, [&tm] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
using chunked_stats = utils::chunked_vector<task_stats>;
using chunked_stats = utils::chunked_vector<tasks::task_stats>;
auto internal = tasks::is_internal{req_param<bool>(*req, "internal", false)};
std::vector<chunked_stats> res = co_await tm.map([&req, internal] (tasks::task_manager& tm) {
chunked_stats local_res;
tasks::task_manager::module_ptr module;
std::optional<std::string> keyspace = std::nullopt;
std::optional<std::string> table = std::nullopt;
try {
module = tm.find_module(req->get_path_param("module"));
} catch (...) {
throw bad_param_exception(fmt::format("{}", std::current_exception()));
}
const auto& filtered_tasks = module->get_tasks() | boost::adaptors::filtered([&params = req->query_parameters, internal] (const auto& task) {
return (internal || !task.second->is_internal()) && filter_tasks(task.second, params);
});
for (auto& [task_id, task] : filtered_tasks) {
local_res.push_back(task_stats{task});
if (auto it = req->query_parameters.find("keyspace"); it != req->query_parameters.end()) {
keyspace = it->second;
}
return local_res;
if (auto it = req->query_parameters.find("table"); it != req->query_parameters.end()) {
table = it->second;
}
return module->get_stats(internal, [keyspace = std::move(keyspace), table = std::move(table)] (std::string& ks, std::string& t) {
return (!keyspace || keyspace == ks) && (!table || table == t);
});
});
std::function<future<>(output_stream<char>&&)> f = [r = std::move(res)] (output_stream<char>&& os) -> future<> {
@@ -148,8 +119,7 @@ void set_task_manager(http_context& ctx, routes& r, sharded<tasks::task_manager>
for (auto& v: res) {
for (auto& stats: v) {
co_await s.write(std::exchange(delim, ", "));
tm::task_stats ts;
ts = stats;
tm::task_stats ts = make_stats(stats);
co_await formatter::write(s, ts);
}
}
@@ -168,121 +138,70 @@ void set_task_manager(http_context& ctx, routes& r, sharded<tasks::task_manager>
tm::get_task_status.set(r, [&tm] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto id = tasks::task_id{utils::UUID{req->get_path_param("task_id")}};
tasks::task_manager::foreign_task_ptr task;
tasks::task_status status;
try {
task = co_await tasks::task_manager::invoke_on_task(tm, id, std::function([] (tasks::task_manager::task_ptr task) -> future<tasks::task_manager::foreign_task_ptr> {
if (task->is_complete()) {
task->unregister_task();
}
co_return std::move(task);
}));
auto task = tasks::task_handler{tm.local(), id};
status = co_await task.get_status();
} catch (tasks::task_manager::task_not_found& e) {
throw bad_param_exception(e.what());
}
auto s = co_await retrieve_status(task);
co_return make_status(s);
co_return make_status(status);
});
tm::abort_task.set(r, [&tm] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto id = tasks::task_id{utils::UUID{req->get_path_param("task_id")}};
try {
co_await tasks::task_manager::invoke_on_task(tm, id, [] (tasks::task_manager::task_ptr task) -> future<> {
if (!task->is_abortable()) {
co_await coroutine::return_exception(std::runtime_error("Requested task cannot be aborted"));
}
task->abort();
});
auto task = tasks::task_handler{tm.local(), id};
co_await task.abort();
} catch (tasks::task_manager::task_not_found& e) {
throw bad_param_exception(e.what());
} catch (tasks::task_not_abortable& e) {
throw httpd::base_exception{e.what(), http::reply::status_type::forbidden};
}
co_return json_void();
});
tm::wait_task.set(r, [&tm] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto id = tasks::task_id{utils::UUID{req->get_path_param("task_id")}};
tasks::task_manager::foreign_task_ptr task;
tasks::task_status status;
std::optional<std::chrono::seconds> timeout = std::nullopt;
if (auto it = req->query_parameters.find("timeout"); it != req->query_parameters.end()) {
timeout = std::chrono::seconds(boost::lexical_cast<uint32_t>(it->second));
}
try {
task = co_await tasks::task_manager::invoke_on_task(tm, id, std::function([] (tasks::task_manager::task_ptr task) {
return task->done().then_wrapped([task] (auto f) {
// done() is called only because we want the task to be complete before getting its status.
// The future should be ignored here as the result does not matter.
f.ignore_ready_future();
return make_foreign(task);
});
}));
auto task = tasks::task_handler{tm.local(), id};
status = co_await task.wait_for_task(timeout);
} catch (tasks::task_manager::task_not_found& e) {
throw bad_param_exception(e.what());
} catch (timed_out_error& e) {
throw httpd::base_exception{e.what(), http::reply::status_type::request_timeout};
}
auto s = co_await retrieve_status(task);
co_return make_status(s);
co_return make_status(status);
});
tm::get_task_status_recursively.set(r, [&_tm = tm] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto& tm = _tm;
auto id = tasks::task_id{utils::UUID{req->get_path_param("task_id")}};
std::queue<task_variant> q;
utils::chunked_vector<full_task_status> res;
tasks::task_manager::foreign_task_ptr task;
try {
// Get requested task.
task = co_await tasks::task_manager::invoke_on_task(tm, id, std::function([] (tasks::task_manager::task_ptr task) -> future<tasks::task_manager::foreign_task_ptr> {
if (task->is_complete()) {
task->unregister_task();
auto task = tasks::task_handler{tm.local(), id};
auto res = co_await task.get_status_recursively(true);
std::function<future<>(output_stream<char>&&)> f = [r = std::move(res)] (output_stream<char>&& os) -> future<> {
auto s = std::move(os);
auto res = std::move(r);
co_await s.write("[");
std::string delim = "";
for (auto& status: res) {
co_await s.write(std::exchange(delim, ", "));
co_await formatter::write(s, make_status(status));
}
co_return task;
}));
co_await s.write("]");
co_await s.close();
};
co_return f;
} catch (tasks::task_manager::task_not_found& e) {
throw bad_param_exception(e.what());
}
// Push children's statuses in BFS order.
q.push(co_await task.copy()); // Task cannot be moved since we need it to be alive during whole loop execution.
while (!q.empty()) {
auto& current = q.front();
co_await std::visit(overloaded_functor {
[&] (const tasks::task_manager::foreign_task_ptr& task) -> future<> {
res.push_back(co_await retrieve_status(task));
co_await task->get_children().for_each_task([&q] (const tasks::task_manager::foreign_task_ptr& child) -> future<> {
q.push(co_await child.copy());
}, [&] (const tasks::task_manager::task::task_essentials& child) {
q.push(child);
return make_ready_future();
});
},
[&] (const tasks::task_manager::task::task_essentials& task) -> future<> {
res.push_back(full_task_status{
.task_status = task.task_status,
.type = task.type,
.progress = task.task_progress,
.parent_id = task.parent_id,
.abortable = task.abortable,
.children_ids = boost::copy_range<std::vector<std::string>>(task.failed_children | boost::adaptors::transformed([] (auto& child) {
return child.task_status.id.to_sstring();
}))
});
for (auto& child: task.failed_children) {
q.push(child);
}
return make_ready_future();
}
}, current);
q.pop();
}
std::function<future<>(output_stream<char>&&)> f = [r = std::move(res)] (output_stream<char>&& os) -> future<> {
auto s = std::move(os);
auto res = std::move(r);
co_await s.write("[");
std::string delim = "";
for (auto& status: res) {
co_await s.write(std::exchange(delim, ", "));
co_await formatter::write(s, make_status(status));
}
co_await s.write("]");
co_await s.close();
};
co_return f;
});
tm::get_and_update_ttl.set(r, [&cfg] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
@@ -294,6 +213,37 @@ void set_task_manager(http_context& ctx, routes& r, sharded<tasks::task_manager>
}
co_return json::json_return_type(ttl);
});
tm::get_ttl.set(r, [&cfg] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
uint32_t ttl = cfg.task_ttl_seconds();
co_return json::json_return_type(ttl);
});
tm::drain_tasks.set(r, [&tm] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
co_await tm.invoke_on_all([&req] (tasks::task_manager& tm) -> future<> {
tasks::task_manager::module_ptr module;
try {
module = tm.find_module(req->get_path_param("module"));
} catch (...) {
throw bad_param_exception(fmt::format("{}", std::current_exception()));
}
const auto& local_tasks = module->get_local_tasks();
std::vector<tasks::task_id> ids;
ids.reserve(local_tasks.size());
std::transform(begin(local_tasks), end(local_tasks), std::back_inserter(ids), [] (const auto& task) {
return task.second->is_complete() ? task.first : tasks::task_id::create_null_id();
});
for (auto&& id : ids) {
if (id) {
module->unregister_task(id);
}
co_await maybe_yield();
}
});
co_return json_void();
});
}
void unset_task_manager(http_context& ctx, routes& r) {
@@ -304,6 +254,8 @@ void unset_task_manager(http_context& ctx, routes& r) {
tm::wait_task.unset(r);
tm::get_task_status_recursively.unset(r);
tm::get_and_update_ttl.unset(r);
tm::get_ttl.unset(r);
tm::drain_tasks.unset(r);
}
}

View File

@@ -13,6 +13,7 @@
#include "task_manager_test.hh"
#include "api/api-doc/task_manager_test.json.hh"
#include "tasks/test_module.hh"
#include "utils/overloaded_functor.hh"
namespace api {
@@ -61,8 +62,8 @@ void set_task_manager_test(http_context& ctx, routes& r, sharded<tasks::task_man
auto module = tms.local().find_module("test");
id = co_await module->make_task<tasks::test_task_impl>(shard, id, keyspace, table, entity, data);
co_await tms.invoke_on(shard, [id] (tasks::task_manager& tm) {
auto it = tm.get_all_tasks().find(id);
if (it != tm.get_all_tasks().end()) {
auto it = tm.get_local_tasks().find(id);
if (it != tm.get_local_tasks().end()) {
it->second->start();
}
});
@@ -72,9 +73,16 @@ void set_task_manager_test(http_context& ctx, routes& r, sharded<tasks::task_man
tmt::unregister_test_task.set(r, [&tm] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto id = tasks::task_id{utils::UUID{req->query_parameters["task_id"]}};
try {
co_await tasks::task_manager::invoke_on_task(tm, id, [] (tasks::task_manager::task_ptr task) -> future<> {
tasks::test_task test_task{task};
co_await test_task.unregister_task();
co_await tasks::task_manager::invoke_on_task(tm, id, [] (tasks::task_manager::task_variant task_v) -> future<> {
return std::visit(overloaded_functor{
[] (tasks::task_manager::task_ptr task) -> future<> {
tasks::test_task test_task{task};
co_await test_task.unregister_task();
},
[] (tasks::task_manager::virtual_task_ptr task) {
return make_ready_future();
}
}, task_v);
});
} catch (tasks::task_manager::task_not_found& e) {
throw bad_param_exception(e.what());
@@ -89,13 +97,20 @@ void set_task_manager_test(http_context& ctx, routes& r, sharded<tasks::task_man
std::string error = fail ? it->second : "";
try {
co_await tasks::task_manager::invoke_on_task(tm, id, [fail, error = std::move(error)] (tasks::task_manager::task_ptr task) -> future<> {
tasks::test_task test_task{task};
if (fail) {
co_await test_task.finish_failed(std::make_exception_ptr(std::runtime_error(error)));
} else {
co_await test_task.finish();
}
co_await tasks::task_manager::invoke_on_task(tm, id, [fail, error = std::move(error)] (tasks::task_manager::task_variant task_v) -> future<> {
return std::visit(overloaded_functor{
[fail, error = std::move(error)] (tasks::task_manager::task_ptr task) -> future<> {
tasks::test_task test_task{task};
if (fail) {
co_await test_task.finish_failed(std::make_exception_ptr(std::runtime_error(error)));
} else {
co_await test_task.finish();
}
},
[] (tasks::task_manager::virtual_task_ptr task) {
return make_ready_future();
}
}, task_v);
});
} catch (tasks::task_manager::task_not_found& e) {
throw bad_param_exception(e.what());

View File

@@ -21,6 +21,9 @@ using namespace json;
void set_token_metadata(http_context& ctx, routes& r, sharded<locator::shared_token_metadata>& tm) {
ss::local_hostid.set(r, [&tm](std::unique_ptr<http::request> req) {
auto id = tm.local().get()->get_my_id();
if (!bool(id)) {
throw not_found_exception("local host ID is not yet set");
}
return make_ready_future<json::json_return_type>(id.to_sstring());
});
@@ -68,7 +71,7 @@ void set_token_metadata(http_context& ctx, routes& r, sharded<locator::shared_to
ss::get_host_id_map.set(r, [&tm](const_req req) {
std::vector<ss::mapper> res;
return map_to_key_value(tm.local().get()->get_endpoint_to_host_id_map_for_reading(), res);
return map_to_key_value(tm.local().get()->get_endpoint_to_host_id_map(), res);
});
static auto host_or_broadcast = [&tm](const_req req) {

View File

@@ -25,6 +25,25 @@ enum class authentication_option {
options
};
}
template <>
struct fmt::formatter<auth::authentication_option> : fmt::formatter<string_view> {
template <typename FormatContext>
auto format(const auth::authentication_option a, FormatContext& ctx) const {
using enum auth::authentication_option;
switch (a) {
case password:
return formatter<string_view>::format("PASSWORD", ctx);
case options:
return formatter<string_view>::format("OPTIONS", ctx);
}
std::abort();
}
};
namespace auth {
using authentication_option_set = std::unordered_set<authentication_option>;
using custom_options = std::unordered_map<sstring, sstring>;
@@ -46,18 +65,3 @@ public:
};
}
template <>
struct fmt::formatter<auth::authentication_option> : fmt::formatter<string_view> {
template <typename FormatContext>
auto format(const auth::authentication_option a, FormatContext& ctx) const {
using enum auth::authentication_option;
switch (a) {
case password:
return formatter<string_view>::format("PASSWORD", ctx);
case options:
return formatter<string_view>::format("OPTIONS", ctx);
}
std::abort();
}
};

View File

@@ -76,7 +76,7 @@ auth::certificate_authenticator::certificate_authenticator(cql3::query_processor
continue;
} catch (std::out_of_range&) {
// just fallthrough
} catch (std::regex_error&) {
} catch (boost::regex_error&) {
std::throw_with_nested(std::invalid_argument(fmt::format("Invalid query expression: {}", map.at(cfg_query_attr))));
}
}
@@ -149,7 +149,7 @@ future<std::optional<auth::authenticated_user>> auth::certificate_authenticator:
co_return username;
}
}
throw exceptions::authentication_exception(format("Subject '{}'/'{}' does not match any query expression", subject, altname));
throw exceptions::authentication_exception(seastar::format("Subject '{}'/'{}' does not match any query expression", subject, altname));
}

View File

@@ -16,6 +16,7 @@
#include "mutation/canonical_mutation.hh"
#include "schema/schema_fwd.hh"
#include "timestamp.hh"
#include "utils/assert.hh"
#include "utils/exponential_backoff_retry.hh"
#include "cql3/query_processor.hh"
#include "cql3/statements/create_table_statement.hh"
@@ -68,10 +69,10 @@ static future<> create_legacy_metadata_table_if_missing_impl(
cql3::query_processor& qp,
std::string_view cql,
::service::migration_manager& mm) {
assert(this_shard_id() == 0); // once_among_shards makes sure a function is executed on shard 0 only
SCYLLA_ASSERT(this_shard_id() == 0); // once_among_shards makes sure a function is executed on shard 0 only
auto db = qp.db();
auto parsed_statement = cql3::query_processor::parse_statement(cql);
auto parsed_statement = cql3::query_processor::parse_statement(cql, cql3::dialect{});
auto& parsed_cf_statement = static_cast<cql3::statements::raw::cf_statement&>(*parsed_statement);
parsed_cf_statement.prepare_keyspace(meta::legacy::AUTH_KS);
@@ -121,7 +122,7 @@ static future<> announce_mutations_with_guard(
::service::raft_group0_client& group0_client,
std::vector<canonical_mutation> muts,
::service::group0_guard group0_guard,
seastar::abort_source* as,
seastar::abort_source& as,
std::optional<::service::raft_timeout> timeout) {
auto group0_cmd = group0_client.prepare_command(
::service::write_mutations{
@@ -137,7 +138,7 @@ future<> announce_mutations_with_batching(
::service::raft_group0_client& group0_client,
start_operation_func_t start_operation_func,
std::function<::service::mutations_generator(api::timestamp_type t)> gen,
seastar::abort_source* as,
seastar::abort_source& as,
std::optional<::service::raft_timeout> timeout) {
// account for command's overhead, it's better to use smaller threshold than constantly bounce off the limit
size_t memory_threshold = group0_client.max_command_size() * 0.75;
@@ -188,7 +189,7 @@ future<> announce_mutations(
::service::raft_group0_client& group0_client,
const sstring query_string,
std::vector<data_value_or_unset> values,
seastar::abort_source* as,
seastar::abort_source& as,
std::optional<::service::raft_timeout> timeout) {
auto group0_guard = co_await group0_client.start_operation(as, timeout);
auto timestamp = group0_guard.write_timestamp();

View File

@@ -80,7 +80,7 @@ future<> create_legacy_metadata_table_if_missing(
// Execute update query via group0 mechanism, mutations will be applied on all nodes.
// Use this function when need to perform read before write on a single guard or if
// you have more than one mutation and potentially exceed single command size limit.
using start_operation_func_t = std::function<future<::service::group0_guard>(abort_source*)>;
using start_operation_func_t = std::function<future<::service::group0_guard>(abort_source&)>;
future<> announce_mutations_with_batching(
::service::raft_group0_client& group0_client,
// since we can operate also in topology coordinator context where we need stronger
@@ -88,7 +88,7 @@ future<> announce_mutations_with_batching(
// function here
start_operation_func_t start_operation_func,
std::function<::service::mutations_generator(api::timestamp_type t)> gen,
seastar::abort_source* as,
seastar::abort_source& as,
std::optional<::service::raft_timeout> timeout);
// Execute update query via group0 mechanism, mutations will be applied on all nodes.
@@ -97,7 +97,7 @@ future<> announce_mutations(
::service::raft_group0_client& group0_client,
const sstring query_string,
std::vector<data_value_or_unset> values,
seastar::abort_source* as,
seastar::abort_source& as,
std::optional<::service::raft_timeout> timeout);
// Appends mutations to a collector, they will be applied later on all nodes via group0 mechanism.

View File

@@ -67,7 +67,7 @@ bool default_authorizer::legacy_metadata_exists() const {
}
future<bool> default_authorizer::legacy_any_granted() const {
static const sstring query = format("SELECT * FROM {}.{} LIMIT 1", meta::legacy::AUTH_KS, PERMISSIONS_CF);
static const sstring query = seastar::format("SELECT * FROM {}.{} LIMIT 1", meta::legacy::AUTH_KS, PERMISSIONS_CF);
return _qp.execute_internal(
query,
@@ -80,7 +80,7 @@ future<bool> default_authorizer::legacy_any_granted() const {
future<> default_authorizer::migrate_legacy_metadata() {
alogger.info("Starting migration of legacy permissions metadata.");
static const sstring query = format("SELECT * FROM {}.{}", meta::legacy::AUTH_KS, legacy_table_name);
static const sstring query = seastar::format("SELECT * FROM {}.{}", meta::legacy::AUTH_KS, legacy_table_name);
return _qp.execute_internal(
query,
@@ -163,7 +163,7 @@ default_authorizer::authorize(const role_or_anonymous& maybe_role, const resourc
co_return permissions::NONE;
}
const sstring query = format("SELECT {} FROM {}.{} WHERE {} = ? AND {} = ?",
const sstring query = seastar::format("SELECT {} FROM {}.{} WHERE {} = ? AND {} = ?",
PERMISSIONS_NAME,
get_auth_ks_name(_qp),
PERMISSIONS_CF,
@@ -188,7 +188,7 @@ default_authorizer::modify(
const resource& resource,
std::string_view op,
::service::group0_batch& mc) {
const sstring query = format("UPDATE {}.{} SET {} = {} {} ? WHERE {} = ? AND {} = ?",
const sstring query = seastar::format("UPDATE {}.{} SET {} = {} {} ? WHERE {} = ? AND {} = ?",
get_auth_ks_name(_qp),
PERMISSIONS_CF,
PERMISSIONS_NAME,
@@ -218,7 +218,7 @@ future<> default_authorizer::revoke(std::string_view role_name, permission_set s
}
future<std::vector<permission_details>> default_authorizer::list_all() const {
const sstring query = format("SELECT {}, {}, {} FROM {}.{}",
const sstring query = seastar::format("SELECT {}, {}, {} FROM {}.{}",
ROLE_NAME,
RESOURCE_NAME,
PERMISSIONS_NAME,
@@ -246,7 +246,7 @@ future<std::vector<permission_details>> default_authorizer::list_all() const {
future<> default_authorizer::revoke_all(std::string_view role_name, ::service::group0_batch& mc) {
try {
const sstring query = format("DELETE FROM {}.{} WHERE {} = ?",
const sstring query = seastar::format("DELETE FROM {}.{} WHERE {} = ?",
get_auth_ks_name(_qp),
PERMISSIONS_CF,
ROLE_NAME);
@@ -266,7 +266,7 @@ future<> default_authorizer::revoke_all(std::string_view role_name, ::service::g
}
future<> default_authorizer::revoke_all_legacy(const resource& resource) {
static const sstring query = format("SELECT {} FROM {}.{} WHERE {} = ? ALLOW FILTERING",
static const sstring query = seastar::format("SELECT {} FROM {}.{} WHERE {} = ? ALLOW FILTERING",
ROLE_NAME,
get_auth_ks_name(_qp),
PERMISSIONS_CF,
@@ -283,7 +283,7 @@ future<> default_authorizer::revoke_all_legacy(const resource& resource) {
res->begin(),
res->end(),
[this, res, resource](const cql3::untyped_result_set::row& r) {
static const sstring query = format("DELETE FROM {}.{} WHERE {} = ? AND {} = ?",
static const sstring query = seastar::format("DELETE FROM {}.{} WHERE {} = ? AND {} = ?",
get_auth_ks_name(_qp),
PERMISSIONS_CF,
ROLE_NAME,
@@ -323,7 +323,7 @@ future<> default_authorizer::revoke_all(const resource& resource, ::service::gro
auto name = resource.name();
auto gen = [this, name] (api::timestamp_type t) -> ::service::mutations_generator {
const sstring query = format("SELECT {} FROM {}.{} WHERE {} = ? ALLOW FILTERING",
const sstring query = seastar::format("SELECT {} FROM {}.{} WHERE {} = ? ALLOW FILTERING",
ROLE_NAME,
get_auth_ks_name(_qp),
PERMISSIONS_CF,
@@ -334,7 +334,7 @@ future<> default_authorizer::revoke_all(const resource& resource, ::service::gro
{name},
cql3::query_processor::cache_internal::no);
for (const auto& r : *res) {
const sstring query = format("DELETE FROM {}.{} WHERE {} = ? AND {} = ?",
const sstring query = seastar::format("DELETE FROM {}.{} WHERE {} = ? AND {} = ?",
get_auth_ks_name(_qp),
PERMISSIONS_CF,
ROLE_NAME,
@@ -346,7 +346,7 @@ future<> default_authorizer::revoke_all(const resource& resource, ::service::gro
{r.get_as<sstring>(ROLE_NAME), name});
if (muts.size() != 1) {
on_internal_error(alogger,
format("expecting single delete mutation, got {}", muts.size()));
seastar::format("expecting single delete mutation, got {}", muts.size()));
}
co_yield std::move(muts[0]);
}
@@ -357,7 +357,7 @@ future<> default_authorizer::revoke_all(const resource& resource, ::service::gro
void default_authorizer::revoke_all_keyspace_resources(const resource& ks_resource, ::service::group0_batch& mc) {
auto ks_name = ks_resource.name();
auto gen = [this, ks_name] (api::timestamp_type t) -> ::service::mutations_generator {
const sstring query = format("SELECT {}, {} FROM {}.{}",
const sstring query = seastar::format("SELECT {}, {} FROM {}.{}",
ROLE_NAME,
RESOURCE_NAME,
get_auth_ks_name(_qp),
@@ -374,7 +374,7 @@ void default_authorizer::revoke_all_keyspace_resources(const resource& ks_resour
// r doesn't represent resource related to ks_resource
continue;
}
const sstring query = format("DELETE FROM {}.{} WHERE {} = ? AND {} = ?",
const sstring query = seastar::format("DELETE FROM {}.{} WHERE {} = ? AND {} = ?",
get_auth_ks_name(_qp),
PERMISSIONS_CF,
ROLE_NAME,

View File

@@ -43,10 +43,14 @@ future<> maintenance_socket_role_manager::stop() {
return make_ready_future<>();
}
future<> maintenance_socket_role_manager::ensure_superuser_is_created() {
return make_ready_future<>();
}
template<typename T = void>
future<T> operation_not_supported_exception(std::string_view operation) {
return make_exception_future<T>(
std::runtime_error(format("role manager: {} operation not supported through maintenance socket", operation)));
std::runtime_error(fmt::format("role manager: {} operation not supported through maintenance socket", operation)));
}
future<> maintenance_socket_role_manager::create(std::string_view role_name, const role_config&, ::service::group0_batch&) {
@@ -73,6 +77,10 @@ future<role_set> maintenance_socket_role_manager::query_granted(std::string_view
return operation_not_supported_exception<role_set>("QUERY GRANTED");
}
future<role_to_directly_granted_map> maintenance_socket_role_manager::query_all_directly_granted() {
return operation_not_supported_exception<role_to_directly_granted_map>("QUERY ALL DIRECTLY GRANTED");
}
future<role_set> maintenance_socket_role_manager::query_all() {
return operation_not_supported_exception<role_set>("QUERY ALL");
}

View File

@@ -39,6 +39,8 @@ public:
virtual future<> stop() override;
virtual future<> ensure_superuser_is_created() override;
virtual future<> create(std::string_view role_name, const role_config&, ::service::group0_batch&) override;
virtual future<> drop(std::string_view role_name, ::service::group0_batch& mc) override;
@@ -51,6 +53,8 @@ public:
virtual future<role_set> query_granted(std::string_view grantee_name, recursive_role_query) override;
virtual future<role_to_directly_granted_map> query_all_directly_granted() override;
virtual future<role_set> query_all() override;
virtual future<bool> exists(std::string_view role_name) override;

View File

@@ -75,7 +75,7 @@ static bool has_salted_hash(const cql3::untyped_result_set_row& row) {
}
sstring password_authenticator::update_row_query() const {
return format("UPDATE {}.{} SET {} = ? WHERE {} = ?",
return seastar::format("UPDATE {}.{} SET {} = ? WHERE {} = ?",
get_auth_ks_name(_qp),
meta::roles_table::name,
SALTED_HASH,
@@ -90,7 +90,7 @@ bool password_authenticator::legacy_metadata_exists() const {
future<> password_authenticator::migrate_legacy_metadata() const {
plogger.info("Starting migration of legacy authentication metadata.");
static const sstring query = format("SELECT * FROM {}.{}", meta::legacy::AUTH_KS, legacy_table_name);
static const sstring query = seastar::format("SELECT * FROM {}.{}", meta::legacy::AUTH_KS, legacy_table_name);
return _qp.execute_internal(
query,
@@ -136,7 +136,7 @@ future<> password_authenticator::create_default_if_missing() {
plogger.info("Created default superuser authentication record.");
} else {
co_await announce_mutations(_qp, _group0_client, query,
{salted_pwd, _superuser}, &_as, ::service::raft_timeout{});
{salted_pwd, _superuser}, _as, ::service::raft_timeout{});
plogger.info("Created default superuser authentication record.");
}
}
@@ -223,7 +223,7 @@ future<authenticated_user> password_authenticator::authenticate(
// obsolete prepared statements pretty quickly.
// Rely on query processing caching statements instead, and lets assume
// that a map lookup string->statement is not gonna kill us much.
const sstring query = format("SELECT {} FROM {}.{} WHERE {} = ?",
const sstring query = seastar::format("SELECT {} FROM {}.{} WHERE {} = ?",
SALTED_HASH,
get_auth_ks_name(_qp),
meta::roles_table::name,
@@ -280,7 +280,7 @@ future<> password_authenticator::alter(std::string_view role_name, const authent
co_return;
}
const sstring query = format("UPDATE {}.{} SET {} = ? WHERE {} = ?",
const sstring query = seastar::format("UPDATE {}.{} SET {} = ? WHERE {} = ?",
get_auth_ks_name(_qp),
meta::roles_table::name,
SALTED_HASH,
@@ -299,7 +299,7 @@ future<> password_authenticator::alter(std::string_view role_name, const authent
}
future<> password_authenticator::drop(std::string_view name, ::service::group0_batch& mc) {
const sstring query = format("DELETE {} FROM {}.{} WHERE {} = ?",
const sstring query = seastar::format("DELETE {} FROM {}.{} WHERE {} = ?",
SALTED_HASH,
get_auth_ks_name(_qp),
meta::roles_table::name,

View File

@@ -193,7 +193,7 @@ service_level_resource_view::service_level_resource_view(const resource &r) {
}
sstring encode_signature(std::string_view name, std::vector<data_type> args) {
return format("{}[{}]", name,
return seastar::format("{}[{}]", name,
fmt::join(args | boost::adaptors::transformed([] (const data_type t) {
return t->name();
}), "^"));
@@ -222,7 +222,7 @@ std::pair<sstring, std::vector<data_type>> decode_signature(std::string_view enc
// to the short form (int)
static sstring decoded_signature_string(std::string_view encoded_signature) {
auto [function_name, arg_types] = decode_signature(encoded_signature);
return format("{}({})", cql3::util::maybe_quote(sstring(function_name)),
return seastar::format("{}({})", cql3::util::maybe_quote(sstring(function_name)),
boost::algorithm::join(arg_types | boost::adaptors::transformed([] (data_type t) {
return t->cql3_type_name();
}), ", "));

View File

@@ -18,7 +18,6 @@
#include <unordered_set>
#include <fmt/core.h>
#include <seastar/core/print.hh>
#include <seastar/core/sstring.hh>
#include "auth/permission.hh"
@@ -33,7 +32,7 @@ namespace auth {
class invalid_resource_name : public std::invalid_argument {
public:
explicit invalid_resource_name(std::string_view name)
: std::invalid_argument(format("The resource name '{}' is invalid.", name)) {
: std::invalid_argument(fmt::format("The resource name '{}' is invalid.", name)) {
}
};
@@ -149,7 +148,7 @@ class resource_kind_mismatch : public std::invalid_argument {
public:
explicit resource_kind_mismatch(resource_kind expected, resource_kind actual)
: std::invalid_argument(
format("This resource has kind '{}', but was expected to have kind '{}'.", actual, expected)) {
fmt::format("This resource has kind '{}', but was expected to have kind '{}'.", actual, expected)) {
}
};

View File

@@ -48,14 +48,14 @@ public:
class role_already_exists : public roles_argument_exception {
public:
explicit role_already_exists(std::string_view role_name)
: roles_argument_exception(format("Role {} already exists.", role_name)) {
: roles_argument_exception(seastar::format("Role {} already exists.", role_name)) {
}
};
class nonexistant_role : public roles_argument_exception {
public:
explicit nonexistant_role(std::string_view role_name)
: roles_argument_exception(format("Role {} doesn't exist.", role_name)) {
: roles_argument_exception(seastar::format("Role {} doesn't exist.", role_name)) {
}
};
@@ -63,7 +63,7 @@ class role_already_included : public roles_argument_exception {
public:
role_already_included(std::string_view grantee_name, std::string_view role_name)
: roles_argument_exception(
format("{} already includes role {}.", grantee_name, role_name)) {
seastar::format("{} already includes role {}.", grantee_name, role_name)) {
}
};
@@ -71,11 +71,12 @@ class revoke_ungranted_role : public roles_argument_exception {
public:
revoke_ungranted_role(std::string_view revokee_name, std::string_view role_name)
: roles_argument_exception(
format("{} was not granted role {}, so it cannot be revoked.", revokee_name, role_name)) {
seastar::format("{} was not granted role {}, so it cannot be revoked.", revokee_name, role_name)) {
}
};
using role_set = std::unordered_set<sstring>;
using role_to_directly_granted_map = std::multimap<sstring, sstring>;
enum class recursive_role_query { yes, no };
@@ -105,6 +106,13 @@ public:
virtual future<> stop() = 0;
///
/// Ensure that superuser role exists.
///
/// \returns a future once it is ensured that the superuser role exists.
///
virtual future<> ensure_superuser_is_created() = 0;
///
/// \returns an exceptional future with \ref role_already_exists for a role that has previously been created.
///
@@ -144,6 +152,22 @@ public:
///
virtual future<role_set> query_granted(std::string_view grantee, recursive_role_query) = 0;
/// \returns map of directly granted roles for all roles
///
/// Example:
/// GRANT role2 TO role1
/// GRANT role3 TO role1
/// GRANT role3 TO role2
///
/// Will return map:
/// {
/// (role1, role2),
/// (role1, role3),
/// (role2, role3)
/// }
///
virtual future<role_to_directly_granted_map> query_all_directly_granted() = 0;
virtual future<role_set> query_all() = 0;
virtual future<bool> exists(std::string_view role_name) = 0;

View File

@@ -47,7 +47,7 @@ future<bool> default_role_row_satisfies(
cql3::query_processor& qp,
std::function<bool(const cql3::untyped_result_set_row&)> p,
std::optional<std::string> rolename) {
const sstring query = format("SELECT * FROM {}.{} WHERE {} = ?",
const sstring query = seastar::format("SELECT * FROM {}.{} WHERE {} = ?",
get_auth_ks_name(qp),
meta::roles_table::name,
meta::roles_table::role_col_name);
@@ -69,7 +69,7 @@ future<bool> any_nondefault_role_row_satisfies(
cql3::query_processor& qp,
std::function<bool(const cql3::untyped_result_set_row&)> p,
std::optional<std::string> rolename) {
const sstring query = format("SELECT * FROM {}.{}", get_auth_ks_name(qp), meta::roles_table::name);
const sstring query = seastar::format("SELECT * FROM {}.{}", get_auth_ks_name(qp), meta::roles_table::name);
auto results = co_await qp.execute_internal(query, db::consistency_level::QUORUM
, internal_distributed_query_state(), cql3::query_processor::cache_internal::no

View File

@@ -36,6 +36,7 @@
#include "service/migration_manager.hh"
#include "service/raft/raft_group0_client.hh"
#include "timestamp.hh"
#include "utils/assert.hh"
#include "utils/class_registrator.hh"
#include "locator/abstract_replication_strategy.hh"
#include "data_dictionary/keyspace_metadata.hh"
@@ -77,7 +78,7 @@ private:
void on_update_function(const sstring& ks_name, const sstring& function_name) override {}
void on_update_aggregate(const sstring& ks_name, const sstring& aggregate_name) override {}
void on_update_view(const sstring& ks_name, const sstring& view_name, bool columns_changed) override {}
void on_update_tablet_metadata() override {}
void on_update_tablet_metadata(const locator::tablet_metadata_change_hint&) override {}
void on_drop_keyspace(const sstring& ks_name) override {
if (!legacy_mode(_qp)) {
@@ -194,7 +195,7 @@ service::service(
}
future<> service::create_legacy_keyspace_if_missing(::service::migration_manager& mm) const {
assert(this_shard_id() == 0); // once_among_shards makes sure a function is executed on shard 0 only
SCYLLA_ASSERT(this_shard_id() == 0); // once_among_shards makes sure a function is executed on shard 0 only
auto db = _qp.db();
while (!db.has_keyspace(meta::legacy::AUTH_KS)) {
@@ -212,7 +213,7 @@ future<> service::create_legacy_keyspace_if_missing(::service::migration_manager
try {
co_return co_await mm.announce(::service::prepare_new_keyspace_announcement(db.real_database(), ksm, ts),
std::move(group0_guard), format("auth_service: create {} keyspace", meta::legacy::AUTH_KS));
std::move(group0_guard), seastar::format("auth_service: create {} keyspace", meta::legacy::AUTH_KS));
} catch (::service::group0_concurrent_modification&) {
log.info("Concurrent operation is detected while creating {} keyspace, retrying.", meta::legacy::AUTH_KS);
}
@@ -256,6 +257,10 @@ future<> service::stop() {
});
}
future<> service::ensure_superuser_is_created() {
return _role_manager->ensure_superuser_is_created();
}
void service::update_cache_config() {
auto db = _qp.db();
@@ -632,7 +637,7 @@ future<> migrate_to_auth_v2(db::system_keyspace& sys_ks, ::service::raft_group0_
::service::query_state qs(cs, empty_service_permit());
auto rows = co_await qp.execute_internal(
format("SELECT * FROM {}.{}", meta::legacy::AUTH_KS, cf_name),
seastar::format("SELECT * FROM {}.{}", meta::legacy::AUTH_KS, cf_name),
db::consistency_level::ALL,
qs,
{},
@@ -681,7 +686,7 @@ future<> migrate_to_auth_v2(db::system_keyspace& sys_ks, ::service::raft_group0_
co_await announce_mutations_with_batching(g0,
start_operation_func,
std::move(gen),
&as,
as,
std::nullopt);
}

View File

@@ -131,6 +131,8 @@ public:
future<> stop();
future<> ensure_superuser_is_created();
void update_cache_config();
void reset_authorization_cache();

View File

@@ -21,12 +21,15 @@
#include <seastar/core/thread.hh>
#include "auth/common.hh"
#include "auth/role_manager.hh"
#include "auth/roles-metadata.hh"
#include "cql3/query_processor.hh"
#include "cql3/untyped_result_set.hh"
#include "db/consistency_level_type.hh"
#include "exceptions/exceptions.hh"
#include "log.hh"
#include "seastar/core/loop.hh"
#include "seastar/coroutine/maybe_yield.hh"
#include "service/raft/raft_group0_client.hh"
#include "utils/class_registrator.hh"
#include "service/migration_manager.hh"
@@ -46,7 +49,7 @@ namespace role_attributes_table {
constexpr std::string_view name{"role_attributes", 15};
static std::string_view creation_query() noexcept {
static const sstring instance = format(
static const sstring instance = seastar::format(
"CREATE TABLE {}.{} ("
" role text,"
" name text,"
@@ -86,7 +89,7 @@ static db::consistency_level consistency_for_role(std::string_view role_name) no
}
static future<std::optional<record>> find_record(cql3::query_processor& qp, std::string_view role_name) {
const sstring query = format("SELECT * FROM {}.{} WHERE {} = ?",
const sstring query = seastar::format("SELECT * FROM {}.{} WHERE {} = ?",
get_auth_ks_name(qp),
meta::roles_table::name,
meta::roles_table::role_col_name);
@@ -180,7 +183,7 @@ future<> standard_role_manager::create_default_role_if_missing() {
if (exists) {
co_return;
}
const sstring query = format("INSERT INTO {}.{} ({}, is_superuser, can_login) VALUES (?, true, true)",
const sstring query = seastar::format("INSERT INTO {}.{} ({}, is_superuser, can_login) VALUES (?, true, true)",
get_auth_ks_name(_qp),
meta::roles_table::name,
meta::roles_table::role_col_name);
@@ -192,7 +195,7 @@ future<> standard_role_manager::create_default_role_if_missing() {
{_superuser},
cql3::query_processor::cache_internal::no).discard_result();
} else {
co_await announce_mutations(_qp, _group0_client, query, {_superuser}, &_as, ::service::raft_timeout{});
co_await announce_mutations(_qp, _group0_client, query, {_superuser}, _as, ::service::raft_timeout{});
}
log.info("Created default superuser role '{}'.", _superuser);
} catch(const exceptions::unavailable_exception& e) {
@@ -209,7 +212,7 @@ bool standard_role_manager::legacy_metadata_exists() {
future<> standard_role_manager::migrate_legacy_metadata() {
log.info("Starting migration of legacy user metadata.");
static const sstring query = format("SELECT * FROM {}.{}", meta::legacy::AUTH_KS, legacy_table_name);
static const sstring query = seastar::format("SELECT * FROM {}.{}", meta::legacy::AUTH_KS, legacy_table_name);
return _qp.execute_internal(
query,
@@ -238,35 +241,39 @@ future<> standard_role_manager::migrate_legacy_metadata() {
}
future<> standard_role_manager::start() {
return once_among_shards([this] {
return futurize_invoke([this] () {
if (legacy_mode(_qp)) {
return create_legacy_metadata_tables_if_missing();
}
return make_ready_future<>();
}).then([this] {
_stopped = auth::do_after_system_ready(_as, [this] {
return seastar::async([this] {
if (legacy_mode(_qp)) {
_migration_manager.wait_for_schema_agreement(_qp.db().real_database(), db::timeout_clock::time_point::max(), &_as).get();
return once_among_shards([this] () -> future<> {
if (legacy_mode(_qp)) {
co_await create_legacy_metadata_tables_if_missing();
}
if (any_nondefault_role_row_satisfies(_qp, &has_can_login).get()) {
if (legacy_metadata_exists()) {
log.warn("Ignoring legacy user metadata since nondefault roles already exist.");
}
auto handler = [this] () -> future<> {
const bool legacy = legacy_mode(_qp);
if (legacy) {
if (!_superuser_created_promise.available()) {
_superuser_created_promise.set_value();
}
co_await _migration_manager.wait_for_schema_agreement(_qp.db().real_database(), db::timeout_clock::time_point::max(), &_as);
return;
}
if (legacy_metadata_exists()) {
migrate_legacy_metadata().get();
return;
}
if (co_await any_nondefault_role_row_satisfies(_qp, &has_can_login)) {
if (legacy_metadata_exists()) {
log.warn("Ignoring legacy user metadata since nondefault roles already exist.");
}
create_default_role_if_missing().get();
});
});
});
co_return;
}
if (legacy_metadata_exists()) {
co_await migrate_legacy_metadata();
co_return;
}
}
co_await create_default_role_if_missing();
if (!legacy) {
_superuser_created_promise.set_value();
}
};
_stopped = auth::do_after_system_ready(_as, handler);
co_return;
});
}
@@ -275,8 +282,13 @@ future<> standard_role_manager::stop() {
return _stopped.handle_exception_type([] (const sleep_aborted&) { }).handle_exception_type([](const abort_requested_exception&) {});;
}
future<> standard_role_manager::ensure_superuser_is_created() {
SCYLLA_ASSERT(this_shard_id() == 0);
return _superuser_created_promise.get_shared_future();
}
future<> standard_role_manager::create_or_replace(std::string_view role_name, const role_config& c, ::service::group0_batch& mc) {
const sstring query = format("INSERT INTO {}.{} ({}, is_superuser, can_login) VALUES (?, ?, ?)",
const sstring query = seastar::format("INSERT INTO {}.{} ({}, is_superuser, can_login) VALUES (?, ?, ?)",
get_auth_ks_name(_qp),
meta::roles_table::name,
meta::roles_table::role_col_name);
@@ -323,7 +335,7 @@ standard_role_manager::alter(std::string_view role_name, const role_config_updat
if (!u.is_superuser && !u.can_login) {
return make_ready_future<>();
}
const sstring query = format("UPDATE {}.{} SET {} WHERE {} = ?",
const sstring query = seastar::format("UPDATE {}.{} SET {} WHERE {} = ?",
get_auth_ks_name(_qp),
meta::roles_table::name,
build_column_assignments(u),
@@ -347,7 +359,7 @@ future<> standard_role_manager::drop(std::string_view role_name, ::service::grou
}
// First, revoke this role from all roles that are members of it.
const auto revoke_from_members = [this, role_name, &mc] () -> future<> {
const sstring query = format("SELECT member FROM {}.{} WHERE role = ?",
const sstring query = seastar::format("SELECT member FROM {}.{} WHERE role = ?",
get_auth_ks_name(_qp),
meta::role_members_table::name);
const auto members = co_await _qp.execute_internal(
@@ -379,7 +391,7 @@ future<> standard_role_manager::drop(std::string_view role_name, ::service::grou
};
// Delete all attributes for that role
const auto remove_attributes_of = [this, role_name, &mc] () -> future<> {
const sstring query = format("DELETE FROM {}.{} WHERE role = ?",
const sstring query = seastar::format("DELETE FROM {}.{} WHERE role = ?",
get_auth_ks_name(_qp),
meta::role_attributes_table::name);
if (legacy_mode(_qp)) {
@@ -391,7 +403,7 @@ future<> standard_role_manager::drop(std::string_view role_name, ::service::grou
};
// Finally, delete the role itself.
const auto delete_role = [this, role_name, &mc] () -> future<> {
const sstring query = format("DELETE FROM {}.{} WHERE {} = ?",
const sstring query = seastar::format("DELETE FROM {}.{} WHERE {} = ?",
get_auth_ks_name(_qp),
meta::roles_table::name,
meta::roles_table::role_col_name);
@@ -418,7 +430,7 @@ standard_role_manager::legacy_modify_membership(
std::string_view role_name,
membership_change ch) {
const auto modify_roles = [this, role_name, grantee_name, ch] () -> future<> {
const auto query = format(
const auto query = seastar::format(
"UPDATE {}.{} SET member_of = member_of {} ? WHERE {} = ?",
get_auth_ks_name(_qp),
meta::roles_table::name,
@@ -435,7 +447,7 @@ standard_role_manager::legacy_modify_membership(
const auto modify_role_members = [this, role_name, grantee_name, ch] () -> future<> {
switch (ch) {
case membership_change::add: {
const sstring insert_query = format("INSERT INTO {}.{} (role, member) VALUES (?, ?)",
const sstring insert_query = seastar::format("INSERT INTO {}.{} (role, member) VALUES (?, ?)",
get_auth_ks_name(_qp),
meta::role_members_table::name);
co_return co_await _qp.execute_internal(
@@ -447,7 +459,7 @@ standard_role_manager::legacy_modify_membership(
}
case membership_change::remove: {
const sstring delete_query = format("DELETE FROM {}.{} WHERE role = ? AND member = ?",
const sstring delete_query = seastar::format("DELETE FROM {}.{} WHERE role = ? AND member = ?",
get_auth_ks_name(_qp),
meta::role_members_table::name);
co_return co_await _qp.execute_internal(
@@ -473,7 +485,7 @@ standard_role_manager::modify_membership(
co_return co_await legacy_modify_membership(grantee_name, role_name, ch);
}
const auto modify_roles = format(
const auto modify_roles = seastar::format(
"UPDATE {}.{} SET member_of = member_of {} ? WHERE {} = ?",
get_auth_ks_name(_qp),
meta::roles_table::name,
@@ -485,12 +497,12 @@ standard_role_manager::modify_membership(
sstring modify_role_members;
switch (ch) {
case membership_change::add:
modify_role_members = format("INSERT INTO {}.{} (role, member) VALUES (?, ?)",
modify_role_members = seastar::format("INSERT INTO {}.{} (role, member) VALUES (?, ?)",
get_auth_ks_name(_qp),
meta::role_members_table::name);
break;
case membership_change::remove:
modify_role_members = format("DELETE FROM {}.{} WHERE role = ? AND member = ?",
modify_role_members = seastar::format("DELETE FROM {}.{} WHERE role = ? AND member = ?",
get_auth_ks_name(_qp),
meta::role_members_table::name);
break;
@@ -583,8 +595,22 @@ future<role_set> standard_role_manager::query_granted(std::string_view grantee_n
});
}
future<role_to_directly_granted_map> standard_role_manager::query_all_directly_granted() {
const sstring query = seastar::format("SELECT * FROM {}.{}",
get_auth_ks_name(_qp),
meta::role_members_table::name);
role_to_directly_granted_map roles_map;
co_await _qp.query_internal(query, [&roles_map] (const cql3::untyped_result_set_row& row) -> future<stop_iteration> {
roles_map.insert({row.get_as<sstring>("member"), row.get_as<sstring>("role")});
co_return stop_iteration::no;
});
co_return roles_map;
}
future<role_set> standard_role_manager::query_all() {
const sstring query = format("SELECT {} FROM {}.{}",
const sstring query = seastar::format("SELECT {} FROM {}.{}",
meta::roles_table::role_col_name,
get_auth_ks_name(_qp),
meta::roles_table::name);
@@ -628,7 +654,7 @@ future<bool> standard_role_manager::can_login(std::string_view role_name) {
}
future<std::optional<sstring>> standard_role_manager::get_attribute(std::string_view role_name, std::string_view attribute_name) {
const sstring query = format("SELECT name, value FROM {}.{} WHERE role = ? AND name = ?",
const sstring query = seastar::format("SELECT name, value FROM {}.{} WHERE role = ? AND name = ?",
get_auth_ks_name(_qp),
meta::role_attributes_table::name);
const auto result_set = co_await _qp.execute_internal(query, {sstring(role_name), sstring(attribute_name)}, cql3::query_processor::cache_internal::yes);
@@ -659,7 +685,7 @@ future<> standard_role_manager::set_attribute(std::string_view role_name, std::s
if (!co_await exists(role_name)) {
throw auth::nonexistant_role(role_name);
}
const sstring query = format("INSERT INTO {}.{} (role, name, value) VALUES (?, ?, ?)",
const sstring query = seastar::format("INSERT INTO {}.{} (role, name, value) VALUES (?, ?, ?)",
get_auth_ks_name(_qp),
meta::role_attributes_table::name);
if (legacy_mode(_qp)) {
@@ -674,7 +700,7 @@ future<> standard_role_manager::remove_attribute(std::string_view role_name, std
if (!co_await exists(role_name)) {
throw auth::nonexistant_role(role_name);
}
const sstring query = format("DELETE FROM {}.{} WHERE role = ? AND name = ?",
const sstring query = seastar::format("DELETE FROM {}.{} WHERE role = ? AND name = ?",
get_auth_ks_name(_qp),
meta::role_attributes_table::name);
if (legacy_mode(_qp)) {

View File

@@ -37,6 +37,7 @@ class standard_role_manager final : public role_manager {
future<> _stopped;
abort_source _as;
std::string _superuser;
shared_promise<> _superuser_created_promise;
public:
standard_role_manager(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&);
@@ -49,6 +50,8 @@ public:
virtual future<> stop() override;
virtual future<> ensure_superuser_is_created() override;
virtual future<> create(std::string_view role_name, const role_config&, ::service::group0_batch&) override;
virtual future<> drop(std::string_view role_name, ::service::group0_batch& mc) override;
@@ -61,6 +64,8 @@ public:
virtual future<role_set> query_granted(std::string_view grantee_name, recursive_role_query) override;
virtual future<role_to_directly_granted_map> query_all_directly_granted() override;
virtual future<role_set> query_all() override;
virtual future<bool> exists(std::string_view role_name) override;

View File

@@ -4,5 +4,5 @@
# SPDX-License-Identifier: AGPL-3.0-or-later
here=$(dirname "$0")
exec "$here/../tools/cqlsh/bin/cqlsh" "$@"
exec "$here/../tools/cqlsh/bin/cqlsh.py" "$@"

View File

@@ -42,7 +42,7 @@ bytes from_hex(sstring_view s) {
auto half_byte1 = hex_to_int(s[i * 2]);
auto half_byte2 = hex_to_int(s[i * 2 + 1]);
if (half_byte1 == -1 || half_byte2 == -1) {
throw std::invalid_argument(format("Non-hex characters in {}", s));
throw std::invalid_argument(fmt::format("Non-hex characters in {}", s));
}
out[i] = (half_byte1 << 4) | half_byte2;
}

View File

@@ -11,6 +11,7 @@
#include <boost/range/iterator_range.hpp>
#include "bytes.hh"
#include "utils/assert.hh"
#include "utils/managed_bytes.hh"
#include <seastar/core/simple-stream.hh>
#include <seastar/core/loop.hh>
@@ -269,7 +270,7 @@ public:
// Call only when is_linearized()
bytes_view view() const {
assert(is_linearized());
SCYLLA_ASSERT(is_linearized());
if (!_current) {
return bytes_view();
}

View File

@@ -8,6 +8,7 @@
#pragma once
#include "utils/assert.hh"
#include <vector>
#include "row_cache.hh"
#include "mutation/mutation_fragment.hh"
@@ -121,6 +122,9 @@ class cache_mutation_reader final : public mutation_reader::impl {
gc_clock::time_point _read_time;
gc_clock::time_point _gc_before;
api::timestamp_type _max_purgeable_timestamp = api::missing_timestamp;
api::timestamp_type _max_purgeable_timestamp_shadowable = api::missing_timestamp;
future<> do_fill_buffer();
future<> ensure_underlying();
void copy_from_cache_to_buffer();
@@ -199,12 +203,17 @@ class cache_mutation_reader final : public mutation_reader::impl {
gc_clock::time_point get_gc_before(const schema& schema, dht::decorated_key dk, const gc_clock::time_point query_time) {
auto gc_state = _read_context.tombstone_gc_state();
if (gc_state) {
return gc_state->get_gc_before_for_key(schema.shared_from_this(), dk, query_time);
return gc_state->with_commitlog_check_disabled().get_gc_before_for_key(schema.shared_from_this(), dk, query_time);
}
return gc_clock::time_point::min();
}
bool can_gc(tombstone t, is_shadowable is) const {
const auto max_purgeable = is ? _max_purgeable_timestamp_shadowable : _max_purgeable_timestamp;
return t.timestamp < max_purgeable;
}
public:
cache_mutation_reader(schema_ptr s,
dht::decorated_key dk,
@@ -226,8 +235,19 @@ public:
, _read_time(get_read_time())
, _gc_before(get_gc_before(*_schema, dk, _read_time))
{
clogger.trace("csm {}: table={}.{}, reversed={}, snap={}", fmt::ptr(this), _schema->ks_name(), _schema->cf_name(), _read_context.is_reversed(),
fmt::ptr(&*_snp));
_max_purgeable_timestamp = ctx.get_max_purgeable(dk, is_shadowable::no);
_max_purgeable_timestamp_shadowable = ctx.get_max_purgeable(dk, is_shadowable::yes);
clogger.trace("csm {}: table={}.{}, dk={}, gc-before={}, max-purgeable-regular={}, max-purgeable-shadowable={}, reversed={}, snap={}",
fmt::ptr(this),
_schema->ks_name(),
_schema->cf_name(),
dk,
_gc_before,
_max_purgeable_timestamp,
_max_purgeable_timestamp_shadowable,
_read_context.is_reversed(),
fmt::ptr(&*_snp));
push_mutation_fragment(*_schema, _permit, partition_start(std::move(dk), _snp->partition_tombstone()));
}
cache_mutation_reader(schema_ptr s,
@@ -283,7 +303,7 @@ future<> cache_mutation_reader::process_static_row() {
return ensure_underlying().then([this] {
return (*_underlying)().then([this] (mutation_fragment_v2_opt&& sr) {
if (sr) {
assert(sr->is_static_row());
SCYLLA_ASSERT(sr->is_static_row());
maybe_add_to_cache(sr->as_static_row());
push_mutation_fragment(std::move(*sr));
}
@@ -382,7 +402,7 @@ future<> cache_mutation_reader::do_fill_buffer() {
if (_state == state::reading_from_underlying) {
return read_from_underlying();
}
// assert(_state == state::reading_from_cache)
// SCYLLA_ASSERT(_state == state::reading_from_cache)
return _lsa_manager.run_in_read_section([this] {
auto next_valid = _next_row.iterators_valid();
clogger.trace("csm {}: reading_from_cache, range=[{}, {}), next={}, valid={}, rt={}", fmt::ptr(this), _lower_bound,
@@ -785,23 +805,24 @@ void cache_mutation_reader::copy_from_cache_to_buffer() {
t.apply(range_tomb);
auto row_tomb_expired = [&](row_tombstone tomb) {
return (tomb && tomb.max_deletion_time() < _gc_before);
return (tomb && tomb.max_deletion_time() < _gc_before && can_gc(tomb.tomb(), tomb.is_shadowable()));
};
auto is_row_dead = [&](const deletable_row& row) {
auto& m = row.marker();
return (!m.is_missing() && m.is_dead(_read_time) && m.deletion_time() < _gc_before);
return (!m.is_missing() && m.is_dead(_read_time) && m.deletion_time() < _gc_before && can_gc(tombstone(m.timestamp(), m.deletion_time()), is_shadowable::no));
};
if (row_tomb_expired(t) || is_row_dead(row)) {
can_gc_fn always_gc = [&](tombstone) { return true; };
const schema& row_schema = _next_row.latest_row_schema();
_read_context.cache()._tracker.on_row_compacted();
auto mutation_can_gc = can_gc_fn([this] (tombstone t, is_shadowable is) { return can_gc(t, is); });
with_allocator(_snp->region().allocator(), [&] {
deletable_row row_copy(row_schema, row);
row_copy.compact_and_expire(row_schema, t.tomb(), _read_time, always_gc, _gc_before, nullptr);
row_copy.compact_and_expire(row_schema, t.tomb(), _read_time, mutation_can_gc, _gc_before, nullptr);
std::swap(row, row_copy);
});
remove_row = row.empty();
@@ -990,7 +1011,7 @@ void cache_mutation_reader::offer_from_underlying(mutation_fragment_v2&& mf) {
maybe_add_to_cache(mf.as_clustering_row());
add_clustering_row_to_buffer(std::move(mf));
} else {
assert(mf.is_range_tombstone_change());
SCYLLA_ASSERT(mf.is_range_tombstone_change());
auto& chg = mf.as_range_tombstone_change();
if (maybe_add_to_cache(chg)) {
add_to_buffer(std::move(mf).as_range_tombstone_change());

View File

@@ -23,7 +23,7 @@ const sstring cdc_partitioner::name() const {
}
static dht::token to_token(int64_t value) {
return dht::token(dht::token::kind::key, value);
return dht::token(value);
}
static dht::token to_token(bytes_view key) {

View File

@@ -8,6 +8,7 @@
#pragma once
#include "utils/assert.hh"
#include "mutation/mutation.hh"
/*
@@ -246,7 +247,7 @@ void inspect_mutation(const mutation& m, V& v) {
if (r.deleted_at()) {
auto t = r.deleted_at().tomb();
assert(t.timestamp != api::missing_timestamp);
SCYLLA_ASSERT(t.timestamp != api::missing_timestamp);
v.clustered_row_delete(cr.key(), t);
if (v.finished()) {
return;
@@ -255,7 +256,7 @@ void inspect_mutation(const mutation& m, V& v) {
}
for (auto& rt: p.row_tombstones()) {
assert(rt.tombstone().tomb.timestamp != api::missing_timestamp);
SCYLLA_ASSERT(rt.tombstone().tomb.timestamp != api::missing_timestamp);
v.range_delete(rt.tombstone());
if (v.finished()) {
return;

View File

@@ -26,6 +26,7 @@
#include "gms/inet_address.hh"
#include "gms/gossiper.hh"
#include "gms/feature_service.hh"
#include "utils/assert.hh"
#include "utils/error_injection.hh"
#include "utils/UUID_gen.hh"
#include "utils/to_string.hh"
@@ -107,8 +108,8 @@ stream_id::stream_id(dht::token token, size_t vnode_index)
copy_int_to_bytes(dht::token::to_int64(token), 0, _value);
copy_int_to_bytes(low_qword, sizeof(int64_t), _value);
// not a hot code path. make sure we did not mess up the shifts and masks.
assert(version() == version_1);
assert(index() == vnode_index);
SCYLLA_ASSERT(version() == version_1);
SCYLLA_ASSERT(index() == vnode_index);
}
stream_id::stream_id(bytes b)
@@ -126,7 +127,7 @@ bool stream_id::is_set() const {
}
static int64_t bytes_to_int64(bytes_view b, size_t offset) {
assert(b.size() >= offset + sizeof(int64_t));
SCYLLA_ASSERT(b.size() >= offset + sizeof(int64_t));
int64_t res;
std::copy_n(b.begin() + offset, sizeof(int64_t), reinterpret_cast<int8_t *>(&res));
return net::ntoh(res);
@@ -363,6 +364,9 @@ cdc::topology_description make_new_generation_description(
const noncopyable_function<std::pair<size_t, uint8_t>(dht::token)>& get_sharding_info,
const locator::token_metadata_ptr tmptr) {
const auto tokens = get_tokens(bootstrap_tokens, tmptr);
if (tokens.empty()) {
on_internal_error(cdc_log, "Attempted to create a CDC generation from an empty list of tokens");
}
utils::chunked_vector<token_range_description> vnode_descriptions;
vnode_descriptions.reserve(tokens.size());
@@ -411,7 +415,7 @@ future<cdc::generation_id> generation_service::legacy_make_new_generation(const
// Our caller should ensure that there are normal tokens in the token ring.
auto normal_token_owners = tmptr->count_normal_token_owners();
assert(normal_token_owners);
SCYLLA_ASSERT(normal_token_owners);
if (_feature_service.cdc_generations_v2) {
cdc_log.info("Inserting new generation data at UUID {}", uuid);
@@ -811,7 +815,7 @@ future<> generation_service::stop() {
}
generation_service::~generation_service() {
assert(_stopped);
SCYLLA_ASSERT(_stopped);
}
future<> generation_service::after_join(std::optional<cdc::generation_id>&& startup_gen_id) {
@@ -871,7 +875,7 @@ future<> generation_service::check_and_repair_cdc_streams() {
return;
}
if (!_gossiper.is_normal(addr)) {
throw std::runtime_error(format("All nodes must be in NORMAL or LEFT state while performing check_and_repair_cdc_streams"
throw std::runtime_error(fmt::format("All nodes must be in NORMAL or LEFT state while performing check_and_repair_cdc_streams"
" ({} is in state {})", addr, _gossiper.get_gossip_status(state)));
}
@@ -1110,7 +1114,9 @@ future<bool> generation_service::legacy_do_handle_cdc_generation(cdc::generation
auto sys_dist_ks = get_sys_dist_ks();
auto gen = co_await retrieve_generation_data(gen_id, _sys_ks.local(), *sys_dist_ks, { _token_metadata.get()->count_normal_token_owners() });
if (!gen) {
throw std::runtime_error(format(
// This may happen during raft upgrade when a node gossips about a generation that
// was propagated through raft and we didn't apply it yet.
throw generation_handling_nonfatal_exception(fmt::format(
"Could not find CDC generation {} in distributed system tables (current time: {}),"
" even though some node gossiped about it.",
gen_id, db_clock::now()));

View File

@@ -121,7 +121,7 @@ public:
class no_generation_data_exception : public std::runtime_error {
public:
no_generation_data_exception(cdc::generation_id generation_ts)
: std::runtime_error(format("could not find generation data for timestamp {}", generation_ts))
: std::runtime_error(fmt::format("could not find generation data for timestamp {}", generation_ts))
{}
};

View File

@@ -98,7 +98,7 @@ public:
* that generation timestamp moved in as the `startup_gen_id` parameter.
* This passes the responsibility of managing generations from the node startup code to this service;
* until then, the service remains dormant.
* At the time of writing this comment, the startup code is in `storage_service::join_token_ring`, hence
* The startup code is in `storage_service::join_topology`, hence
* `after_join` should be called at the end of that function.
* Precondition: the node has completed bootstrapping and system_distributed_keyspace is initialized.
* Must be called on shard 0 - that's where the generation management happens.

View File

@@ -32,6 +32,7 @@
#include "cql3/statements/select_statement.hh"
#include "cql3/untyped_result_set.hh"
#include "log.hh"
#include "utils/assert.hh"
#include "utils/rjson.hh"
#include "utils/UUID_gen.hh"
#include "utils/managed_bytes.hh"
@@ -65,8 +66,8 @@ void cdc::stats::parts_touched_stats::register_metrics(seastar::metrics::metric_
namespace sm = seastar::metrics;
auto register_part = [&] (part_type part, sstring part_name) {
metrics.add_group(cdc_group_name, {
sm::make_total_operations(format("operations_on_{}_performed_{}", part_name, suffix), count[(size_t)part],
sm::description(format("number of {} CDC operations that processed a {}", suffix, part_name)),
sm::make_total_operations(seastar::format("operations_on_{}_performed_{}", part_name, suffix), count[(size_t)part],
sm::description(seastar::format("number of {} CDC operations that processed a {}", suffix, part_name)),
{})
});
};
@@ -148,7 +149,7 @@ public:
_ctxt._migration_notifier.register_listener(this);
}
~impl() {
assert(_stopped);
SCYLLA_ASSERT(_stopped);
}
future<> stop() {
@@ -455,7 +456,7 @@ schema_ptr get_base_table(const replica::database& db, sstring_view ks_name,std:
}
seastar::sstring base_name(std::string_view log_name) {
assert(is_log_name(log_name));
SCYLLA_ASSERT(is_log_name(log_name));
return sstring(log_name.data(), log_name.size() - cdc_log_suffix.size());
}
@@ -655,7 +656,7 @@ private:
template<>
void collection_iterator<std::pair<managed_bytes_view, managed_bytes_view>>::parse() {
assert(_rem > 0);
SCYLLA_ASSERT(_rem > 0);
_next = _v;
auto k = read_collection_key(_next);
auto v = read_collection_value_nonnull(_next);
@@ -664,7 +665,7 @@ void collection_iterator<std::pair<managed_bytes_view, managed_bytes_view>>::par
template<>
void collection_iterator<managed_bytes_view>::parse() {
assert(_rem > 0);
SCYLLA_ASSERT(_rem > 0);
_next = _v;
auto k = read_collection_key(_next);
_current = k;
@@ -672,7 +673,7 @@ void collection_iterator<managed_bytes_view>::parse() {
template<>
void collection_iterator<managed_bytes_view_opt>::parse() {
assert(_rem > 0);
SCYLLA_ASSERT(_rem > 0);
_next = _v;
auto k = read_collection_value_nonnull(_next);
_current = k;
@@ -1065,7 +1066,7 @@ struct process_row_visitor {
void update_row_state(const column_definition& cdef, managed_bytes_opt value) {
if (!_row_state) {
// static row always has a valid state, so this must be a clustering row missing
assert(_base_ck);
SCYLLA_ASSERT(_base_ck);
auto [it, _] = _clustering_row_states.try_emplace(*_base_ck);
_row_state = &it->second;
}
@@ -1496,12 +1497,12 @@ public:
}
void generate_image(operation op, const clustering_key* ck, const one_kind_column_set* affected_columns) {
assert(op == operation::pre_image || op == operation::post_image);
SCYLLA_ASSERT(op == operation::pre_image || op == operation::post_image);
// assert that post_image is always full
assert(!(op == operation::post_image && affected_columns));
// SCYLLA_ASSERT that post_image is always full
SCYLLA_ASSERT(!(op == operation::post_image && affected_columns));
assert(_builder);
SCYLLA_ASSERT(_builder);
const auto kind = ck ? column_kind::regular_column : column_kind::static_column;
@@ -1571,7 +1572,7 @@ public:
// TODO: is pre-image data based on query enough. We only have actual column data. Do we need
// more details like tombstones/ttl? Probably not but keep in mind.
void process_change(const mutation& m) override {
assert(_builder);
SCYLLA_ASSERT(_builder);
process_change_visitor v {
._touched_parts = _touched_parts,
._builder = *_builder,
@@ -1584,7 +1585,7 @@ public:
}
void end_record() override {
assert(_builder);
SCYLLA_ASSERT(_builder);
_builder->end_record();
}

View File

@@ -69,7 +69,7 @@ bool cdc::metadata::streams_available() const {
cdc::stream_id cdc::metadata::get_stream(api::timestamp_type ts, dht::token tok) {
auto now = api::new_timestamp();
if (ts > now + get_generation_leeway().count()) {
throw exceptions::invalid_request_exception(format(
throw exceptions::invalid_request_exception(seastar::format(
"cdc: attempted to get a stream \"from the future\" ({}; current server time: {})."
" With CDC you cannot send writes with timestamps arbitrarily into the future, because we don't"
" know what streams will be used at that time.\n"
@@ -100,7 +100,7 @@ cdc::stream_id cdc::metadata::get_stream(api::timestamp_type ts, dht::token tok)
// the generation under `it` because that generation was operating at `now - generation_leeway`.
bool is_previous_gen = it != _gens.end() && std::next(it) != _gens.end() && std::next(it)->first <= now;
if (it == _gens.end() || ts < it->first || is_previous_gen) {
throw exceptions::invalid_request_exception(format(
throw exceptions::invalid_request_exception(seastar::format(
"cdc: attempted to get a stream \"from the past\" ({}; current server time: {})."
" With CDC you cannot send writes with timestamps too far into the past, because that would break"
" consistency properties.\n"
@@ -112,7 +112,7 @@ cdc::stream_id cdc::metadata::get_stream(api::timestamp_type ts, dht::token tok)
it = _gens.begin();
if (it == _gens.end() || ts < it->first) {
throw std::runtime_error(format(
throw std::runtime_error(fmt::format(
"cdc::metadata::get_stream: could not find any CDC stream for timestamp {}."
" Are we in the middle of a cluster upgrade?", format_timestamp(ts)));
}
@@ -129,7 +129,7 @@ cdc::stream_id cdc::metadata::get_stream(api::timestamp_type ts, dht::token tok)
// about the current generation in time. We won't be able to prevent it until we introduce transactions.
if (!it->second) {
throw std::runtime_error(format(
throw std::runtime_error(fmt::format(
"cdc: attempted to get a stream from a generation that we know about, but weren't able to retrieve"
" (generation timestamp: {}, write timestamp: {}). Make sure that the replicas which contain"
" this generation's data are alive and reachable from this node.", format_timestamp(it->first), format_timestamp(ts)));
@@ -186,7 +186,7 @@ bool cdc::metadata::prepare(db_clock::time_point tp) {
}
auto ts = to_ts(tp);
auto emplaced = _gens.emplace(to_ts(tp), std::nullopt).second;
auto [it, emplaced] = _gens.emplace(to_ts(tp), std::nullopt);
if (_last_stream_timestamp != api::missing_timestamp) {
auto last_correct_gen = gen_used_at(_last_stream_timestamp);
@@ -201,5 +201,5 @@ bool cdc::metadata::prepare(db_clock::time_point tp) {
}
}
return emplaced;
return !it->second;
}

View File

@@ -8,13 +8,19 @@
#pragma once
#include <exception>
#include <boost/intrusive/unordered_set.hpp>
#include "utils/assert.hh"
#include "utils/small_vector.hh"
#include "mutation/mutation_partition.hh"
#include "utils/xx_hasher.hh"
#include "db/timeout_clock.hh"
#include "log.hh"
extern logging::logger cell_locker_log;
class cells_range {
using ids_vector_type = utils::small_vector<column_id, 5>;
@@ -229,14 +235,24 @@ private:
static constexpr size_t compute_rehash_at_size(size_t bucket_count) {
return bucket_count * max_load_factor::num / max_load_factor::den;
}
// Try to rehash the set, if needed.
// The function may fail silently on bad_alloc (logging a warning).
// Rehashing would be retried at a later time on failure.
void maybe_rehash() {
if (_cell_count >= _rehash_at_size) {
auto new_bucket_count = std::min(_cells.bucket_count() * 2, _cells.bucket_count() + 1024);
auto buckets = std::make_unique<cells_type::bucket_type[]>(new_bucket_count);
_cells.rehash(cells_type::bucket_traits(buckets.get(), new_bucket_count));
_buckets = std::move(buckets);
try {
auto buckets = std::make_unique<cells_type::bucket_type[]>(new_bucket_count);
_cells.rehash(cells_type::bucket_traits(buckets.get(), new_bucket_count));
_buckets = std::move(buckets);
} catch (const std::bad_alloc&) {
cell_locker_log.warn("Could not rehash cell_locker partition cells set: bucket_count={} new_bucket_count={}: {}", _cells.bucket_count(), new_bucket_count, std::current_exception());
}
// Attempt rehash at the new size in both success and failure paths.
// On failure, we don't want to retry too soon since it may take some time
// for memory to free up.
_rehash_at_size = compute_rehash_at_size(new_bucket_count);
}
}
@@ -320,14 +336,24 @@ private:
static constexpr size_t compute_rehash_at_size(size_t bucket_count) {
return bucket_count * max_load_factor::num / max_load_factor::den;
}
// Try to rehash the set, if needed.
// The function may fail silently on bad_alloc (logging a warning).
// Rehashing would be retried at a later time on failure.
void maybe_rehash() {
if (_partition_count >= _rehash_at_size) {
auto new_bucket_count = std::min(_partitions.bucket_count() * 2, _partitions.bucket_count() + 64 * 1024);
auto buckets = std::make_unique<partitions_type::bucket_type[]>(new_bucket_count);
_partitions.rehash(partitions_type::bucket_traits(buckets.get(), new_bucket_count));
_buckets = std::move(buckets);
try {
auto buckets = std::make_unique<partitions_type::bucket_type[]>(new_bucket_count);
_partitions.rehash(partitions_type::bucket_traits(buckets.get(), new_bucket_count));
_buckets = std::move(buckets);
} catch (const std::bad_alloc&) {
cell_locker_log.warn("Could not rehash cell_locker partitions set: bucket_count={} new_bucket_count={}: {}", _partitions.bucket_count(), new_bucket_count, std::current_exception());
}
// Attempt rehash at the new size in both success and failure paths.
// On failure, we don't want to retry too soon since it may take some time
// for memory to free up.
_rehash_at_size = compute_rehash_at_size(new_bucket_count);
}
}
@@ -342,7 +368,7 @@ public:
{ }
~cell_locker() {
assert(_partitions.empty());
SCYLLA_ASSERT(_partitions.empty());
}
void set_schema(schema_ptr s) {

View File

@@ -8,6 +8,7 @@
#pragma once
#include "utils/assert.hh"
#include "schema/schema_fwd.hh"
#include "mutation/position_in_partition.hh"
#include <boost/icl/interval_set.hpp>
@@ -87,8 +88,8 @@ public:
}
};
static interval::type make_interval(const schema& s, const position_range& r) {
assert(r.start().has_clustering_key());
assert(r.end().has_clustering_key());
SCYLLA_ASSERT(r.start().has_clustering_key());
SCYLLA_ASSERT(r.end().has_clustering_key());
return interval::right_open(
position_in_partition_with_schema(s.shared_from_this(), r.start()),
position_in_partition_with_schema(s.shared_from_this(), r.end()));

View File

@@ -23,10 +23,6 @@ public:
clustering_key_filter_ranges(clustering_row_ranges&& ranges)
: _storage(std::make_move_iterator(ranges.begin()), std::make_move_iterator(ranges.end())), _ref(_storage) {}
struct reversed { };
clustering_key_filter_ranges(reversed, const clustering_row_ranges& ranges)
: _storage(ranges.rbegin(), ranges.rend()), _ref(_storage) { }
clustering_key_filter_ranges(clustering_key_filter_ranges&& other) noexcept
: _storage(std::move(other._storage))
, _ref(&other._ref.get() == &other._storage ? _storage : other._ref.get())
@@ -47,21 +43,9 @@ public:
const clustering_row_ranges& ranges() const { return _ref; }
// Returns all clustering ranges determined by `slice` inside partition determined by `key`.
// If the slice contains the `reversed` option, we assume that it is given in 'half-reversed' format
// (i.e. the ranges within are given in reverse order, but the ranges themselves are not reversed)
// with respect to the table order.
// The ranges will be returned in forward (increasing) order even if the slice is reversed.
// The ranges will be returned in the same order as stored in the slice. For a reversed slice
// a reverse schema shall be provided.
static clustering_key_filter_ranges get_ranges(const schema& schema, const query::partition_slice& slice, const partition_key& key) {
const query::clustering_row_ranges& ranges = slice.row_ranges(schema, key);
if (slice.is_reversed()) {
return clustering_key_filter_ranges(clustering_key_filter_ranges::reversed{}, ranges);
}
return clustering_key_filter_ranges(ranges);
}
// Returns all clustering ranges determined by `slice` inside partition determined by `key`.
// The ranges will be returned in the same order as stored in the slice.
static clustering_key_filter_ranges get_native_ranges(const schema& schema, const query::partition_slice& slice, const partition_key& key) {
const query::clustering_row_ranges& ranges = slice.row_ranges(schema, key);
return clustering_key_filter_ranges(ranges);
}

View File

@@ -10,6 +10,7 @@
#pragma once
#include "utils/assert.hh"
#include "schema/schema.hh"
#include "query-request.hh"
#include "mutation/mutation_fragment.hh"
@@ -249,7 +250,7 @@ public:
auto range_end = position_in_partition_view::for_range_end(rng);
if (!less(rt.position(), range_start) && !less(range_end, rt.end_position())) {
// Fully enclosed by this range.
assert(!first);
SCYLLA_ASSERT(!first);
return std::move(rt);
}
auto this_range_rt = rt;

View File

@@ -11,7 +11,8 @@ set(Seastar_DEFINITIONS_DEV
SCYLLA_BUILD_MODE=${scylla_build_mode_Dev}
DEVEL
SEASTAR_ENABLE_ALLOC_FAILURE_INJECTION
SCYLLA_ENABLE_ERROR_INJECTION)
SCYLLA_ENABLE_ERROR_INJECTION
SCYLLA_ENABLE_PREEMPTION_SOURCE)
foreach(definition ${Seastar_DEFINITIONS_DEV})
add_compile_definitions(
$<$<CONFIG:Dev>:${definition}>)

View File

@@ -16,10 +16,11 @@ set(scylla_build_mode_RelWithDebInfo "release")
add_compile_definitions(
$<$<CONFIG:RelWithDebInfo>:SCYLLA_BUILD_MODE=${scylla_build_mode_RelWithDebInfo}>)
set(clang_inline_threshold 2500)
set(Scylla_CLANG_INLINE_THRESHOLD "2500" CACHE STRING
"LLVM-specific inline threshold compilation parameter")
add_compile_options(
"$<$<AND:$<CONFIG:RelWithDebInfo>,$<CXX_COMPILER_ID:GNU>>:--param;inline-unit-growth=300>"
"$<$<AND:$<CONFIG:RelWithDebInfo>,$<CXX_COMPILER_ID:Clang>>:-mllvm;-inline-threshold=${clang_inline_threshold}>")
"$<$<AND:$<CONFIG:RelWithDebInfo>,$<CXX_COMPILER_ID:Clang>>:-mllvm;-inline-threshold=${Scylla_CLANG_INLINE_THRESHOLD}>")
# clang generates 16-byte loads that break store-to-load forwarding
# gcc also has some trouble: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103554
check_cxx_compiler_flag("-fno-slp-vectorize" _slp_vectorize_supported)

View File

@@ -83,7 +83,7 @@ function(get_padded_dynamic_linker_option output length)
set(${output} "${dynamic_linker_option}=${padded_dynamic_linker}" PARENT_SCOPE)
endfunction()
add_compile_options("-ffile-prefix-map=${CMAKE_SOURCE_DIR}=.")
add_compile_options("-ffile-prefix-map=${CMAKE_BINARY_DIR}=.")
default_target_arch(target_arch)
if(target_arch)

View File

@@ -6,6 +6,7 @@
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#include "utils/assert.hh"
#include "types/collection.hh"
#include "types/user.hh"
#include "concrete_types.hh"
@@ -166,15 +167,18 @@ collection_mutation_view_description::materialize(const abstract_type& type) con
return m;
}
bool collection_mutation_description::compact_and_expire(column_id id, row_tombstone base_tomb, gc_clock::time_point query_time,
compact_and_expire_result collection_mutation_description::compact_and_expire(column_id id, row_tombstone base_tomb, gc_clock::time_point query_time,
can_gc_fn& can_gc, gc_clock::time_point gc_before, compaction_garbage_collector* collector)
{
bool any_live = false;
compact_and_expire_result res{};
if (tomb) {
res.collection_tombstones++;
}
auto t = tomb;
tombstone purged_tomb;
if (tomb <= base_tomb.regular()) {
tomb = tombstone();
} else if (tomb.deletion_time < gc_before && can_gc(tomb)) {
} else if (tomb.deletion_time < gc_before && can_gc(tomb, is_shadowable::no)) { // The collection tombstone is never shadowable
purged_tomb = tomb;
tomb = tombstone();
}
@@ -184,10 +188,12 @@ bool collection_mutation_description::compact_and_expire(column_id id, row_tombs
for (auto&& name_and_cell : cells) {
atomic_cell& cell = name_and_cell.second;
auto cannot_erase_cell = [&] {
return cell.deletion_time() >= gc_before || !can_gc(tombstone(cell.timestamp(), cell.deletion_time()));
// Only row tombstones can be shadowable, (collection) cell tombstones aren't
return cell.deletion_time() >= gc_before || !can_gc(tombstone(cell.timestamp(), cell.deletion_time()), is_shadowable::no);
};
if (cell.is_covered_by(t, false) || cell.is_covered_by(base_tomb.shadowable().tomb(), false)) {
res.dead_cells++;
continue;
}
if (cell.has_expired(query_time)) {
@@ -198,22 +204,24 @@ bool collection_mutation_description::compact_and_expire(column_id id, row_tombs
losers.emplace_back(std::pair(
std::move(name_and_cell.first), atomic_cell::make_dead(cell.timestamp(), cell.deletion_time())));
}
res.dead_cells++;
} else if (!cell.is_live()) {
if (cannot_erase_cell()) {
survivors.emplace_back(std::move(name_and_cell));
} else if (collector) {
losers.emplace_back(std::move(name_and_cell));
}
res.dead_cells++;
} else {
any_live |= true;
survivors.emplace_back(std::move(name_and_cell));
res.live_cells++;
}
}
if (collector) {
collector->collect(id, collection_mutation_description{purged_tomb, std::move(losers)});
}
cells = std::move(survivors);
return any_live;
return res;
}
template <typename Iterator>
@@ -391,7 +399,7 @@ deserialize_collection_mutation(collection_mutation_input_stream& in, F&& read_k
ret.cells.push_back(read_kv(in));
}
assert(in.empty());
SCYLLA_ASSERT(in.empty());
return ret;
}

View File

@@ -12,6 +12,8 @@
#include "schema/schema_fwd.hh"
#include "gc_clock.hh"
#include "mutation/atomic_cell.hh"
#include "mutation/compact_and_expire_result.hh"
#include "compaction/compaction_garbage_collector.hh"
#include <iosfwd>
#include <forward_list>
@@ -34,7 +36,7 @@ struct collection_mutation_description {
// Expires cells based on query_time. Expires tombstones based on max_purgeable and gc_before.
// Removes cells covered by tomb or this->tomb.
bool compact_and_expire(column_id id, row_tombstone tomb, gc_clock::time_point query_time,
compact_and_expire_result compact_and_expire(column_id id, row_tombstone tomb, gc_clock::time_point query_time,
can_gc_fn&, gc_clock::time_point gc_before, compaction_garbage_collector* collector = nullptr);
// Packs the data to a serialized blob.

View File

@@ -28,7 +28,9 @@
#include <seastar/util/closeable.hh>
#include <seastar/core/shared_ptr.hh>
#include <seastar/core/shard_id.hh>
#include <seastar/core/on_internal_error.hh>
#include "compaction/compaction_garbage_collector.hh"
#include "dht/i_partitioner.hh"
#include "sstables/exceptions.hh"
#include "sstables/sstables.hh"
@@ -46,12 +48,14 @@
#include "mutation_writer/partition_based_splitting_writer.hh"
#include "mutation/mutation_source_metadata.hh"
#include "mutation/mutation_fragment_stream_validator.hh"
#include "utils/assert.hh"
#include "utils/error_injection.hh"
#include "utils/pretty_printers.hh"
#include "readers/multi_range.hh"
#include "readers/compacting.hh"
#include "tombstone_gc.hh"
#include "replica/database.hh"
#include "timestamp.hh"
namespace sstables {
@@ -136,14 +140,38 @@ std::string_view to_string(compaction_type_options::scrub::quarantine_mode quara
static api::timestamp_type get_max_purgeable_timestamp(const table_state& table_s, sstable_set::incremental_selector& selector,
const std::unordered_set<shared_sstable>& compacting_set, const dht::decorated_key& dk, uint64_t& bloom_filter_checks,
const api::timestamp_type compacting_max_timestamp) {
const api::timestamp_type compacting_max_timestamp, const bool gc_check_only_compacting_sstables, const is_shadowable is_shadowable) {
if (!table_s.tombstone_gc_enabled()) [[unlikely]] {
return api::min_timestamp;
}
auto timestamp = api::max_timestamp;
auto memtable_min_timestamp = table_s.min_memtable_timestamp();
// Use memtable timestamp if it contains data older than the sstables being compacted,
if (gc_check_only_compacting_sstables) {
// If gc_check_only_compacting_sstables is enabled, do not
// check memtables and other sstables not being compacted.
return timestamp;
}
api::timestamp_type memtable_min_timestamp;
if (is_shadowable) {
// For shadowable tombstones, check the minimum live row_marker timestamp
// as rows with timestamp larger than the tombstone's would shadow the tombstone,
// exposing all live cells in the row with timestamps potentially lower than
// the shadowable tombstone (and those are tracked in the min_memtable_live_timestamp).
// In contrast, a shadowable tombstone applies to rows with row_marker whose timestamp
// is less than or equal to the tombstone's timestamp, the same way as a regular tombstone would.
// See https://github.com/scylladb/scylladb/issues/20424
memtable_min_timestamp = table_s.min_memtable_live_row_marker_timestamp();
} else {
// For regular tombstones, check the minimum live data timestamp.
// Even if purgeable tombstones shadow dead data in the memtable, it's ok to purge them;
// since "resurrecting" the already-dead data will no have effect, as they are already dead.
// See https://github.com/scylladb/scylladb/issues/20423
memtable_min_timestamp = table_s.min_memtable_live_timestamp();
}
clogger.trace("memtable_min_timestamp={} compacting_max_timestamp={} memtable_has_key={} is_shadowable={} min_memtable_live_timestamp={} min_memtable_live_row_marker_timestamp={}",
memtable_min_timestamp, compacting_max_timestamp, table_s.memtable_has_key(dk), is_shadowable, table_s.min_memtable_live_timestamp(), table_s.min_memtable_live_row_marker_timestamp());
// Use memtable timestamp if it contains live data older than the sstables being compacted,
// and if the memtable also contains the key we're calculating max purgeable timestamp for.
// First condition helps to not penalize the common scenario where memtable only contains
// newer data.
@@ -155,9 +183,35 @@ static api::timestamp_type get_max_purgeable_timestamp(const table_state& table_
if (compacting_set.contains(sst)) {
continue;
}
api::timestamp_type min_timestamp = sst->get_stats_metadata().min_timestamp;
auto ts_stats = sst->get_ext_timestamp_stats();
if (!ts_stats.empty()) {
auto stat = is_shadowable ?
sstables::ext_timestamp_stats_type::min_live_row_marker_timestamp :
sstables::ext_timestamp_stats_type::min_live_timestamp;
auto it = ts_stats.find(stat);
if (it != ts_stats.end()) {
min_timestamp = it->second;
} else {
// Do not throw an exception in production, just use the legacy min_timestamp set above
on_internal_error_noexcept(clogger, format("Missing extended timestamp statstics: stat={} is_shadowable={}", int(stat), bool(is_shadowable)));
}
}
if (clogger.is_enabled(log_level::trace)) {
if (!hk) {
hk = sstables::sstable::make_hashed_key(*table_s.schema(), dk.key());
}
clogger.trace("get_max_purgeable_timestamp={}: min_timestamp={} timestamp={} filter_has_key={} is_shadowable={} stats.min_timestamp={} min_live_timestamp={} min_live_row_marker_timestamp={}: sst={}",
min_timestamp >= timestamp || !sst->filter_has_key(*hk) ? timestamp : min_timestamp,
min_timestamp, timestamp, sst->filter_has_key(*hk), is_shadowable,
sst->get_stats_metadata().min_timestamp,
ts_stats[sstables::ext_timestamp_stats_type::min_live_timestamp],
ts_stats[sstables::ext_timestamp_stats_type::min_live_row_marker_timestamp],
sst->get_filename());
}
// There's no point in looking up the key in the sstable filter if
// it does not contain data older than the minimum timestamp.
if (sst->get_stats_metadata().min_timestamp >= timestamp) {
if (min_timestamp >= timestamp) {
continue;
}
if (!hk) {
@@ -165,14 +219,15 @@ static api::timestamp_type get_max_purgeable_timestamp(const table_state& table_
}
if (sst->filter_has_key(*hk)) {
bloom_filter_checks++;
timestamp = sst->get_stats_metadata().min_timestamp;
timestamp = min_timestamp;
}
}
return timestamp;
}
static std::vector<shared_sstable> get_uncompacting_sstables(const table_state& table_s, std::vector<shared_sstable> sstables) {
auto all_sstables = boost::copy_range<std::vector<shared_sstable>>(*table_s.main_sstable_set().all());
auto sstable_set = table_s.sstable_set_for_tombstone_gc();
auto all_sstables = boost::copy_range<std::vector<shared_sstable>>(*sstable_set->all());
auto& compacted_undeleted = table_s.compacted_undeleted_sstables();
all_sstables.insert(all_sstables.end(), compacted_undeleted.begin(), compacted_undeleted.end());
boost::sort(all_sstables, [] (const shared_sstable& x, const shared_sstable& y) {
@@ -283,7 +338,7 @@ private:
utils::observer<> make_stop_request_observer(utils::observable<>& sro) {
return sro.observe([this] () mutable {
assert(!_unclosed_partition);
SCYLLA_ASSERT(!_unclosed_partition);
consume_end_of_stream();
});
}
@@ -476,6 +531,8 @@ protected:
// Garbage collected sstables that were added to SSTable set and should be eventually removed from it.
std::vector<shared_sstable> _used_garbage_collected_sstables;
utils::observable<> _stop_request_observable;
// optional tombstone_gc_state that is used when gc has to check only the compacting sstables to collect tombstones.
std::optional<tombstone_gc_state> _tombstone_gc_state_with_commitlog_check_disabled;
private:
// Keeps track of monitors for input sstable.
// If _update_backlog_tracker is set to true, monitors are responsible for adjusting backlog as compaction progresses.
@@ -524,6 +581,7 @@ protected:
, _owned_ranges(std::move(descriptor.owned_ranges))
, _sharder(descriptor.sharder)
, _owned_ranges_checker(_owned_ranges ? std::optional<dht::incremental_owned_ranges_checker>(*_owned_ranges) : std::nullopt)
, _tombstone_gc_state_with_commitlog_check_disabled(descriptor.gc_check_only_compacting_sstables ? std::make_optional(_table_s.get_tombstone_gc_state().with_commitlog_check_disabled()) : std::nullopt)
, _progress_monitor(progress_monitor)
{
std::unordered_set<run_id> ssts_run_ids;
@@ -722,6 +780,10 @@ private:
return _table_s.get_compaction_strategy().make_sstable_set(_schema);
}
const tombstone_gc_state& get_tombstone_gc_state() const {
return _tombstone_gc_state_with_commitlog_check_disabled ? _tombstone_gc_state_with_commitlog_check_disabled.value() : _table_s.get_tombstone_gc_state();
}
future<> setup() {
auto ssts = make_lw_shared<sstables::sstable_set>(make_sstable_set_for_input());
auto fully_expired = _table_s.fully_expired_sstables(_sstables, gc_clock::now());
@@ -755,7 +817,7 @@ private:
// for a better estimate for the number of partitions in the merged
// sstable than just adding up the lengths of individual sstables.
_estimated_partitions += sst->get_estimated_key_count();
sum_of_estimated_droppable_tombstone_ratio += sst->estimate_droppable_tombstone_ratio(gc_clock::now(), _table_s.get_tombstone_gc_state(), _schema);
sum_of_estimated_droppable_tombstone_ratio += sst->estimate_droppable_tombstone_ratio(gc_clock::now(), get_tombstone_gc_state(), _schema);
_compacting_data_file_size += sst->ondisk_data_size();
_compacting_max_timestamp = std::max(_compacting_max_timestamp, sst->get_stats_metadata().max_timestamp);
if (sst->originated_on_this_node().value_or(false) && sst_stats.position.shard_id() == this_shard_id()) {
@@ -787,7 +849,7 @@ private:
reader.consume_in_thread(std::move(cfc));
});
});
const auto& gc_state = _table_s.get_tombstone_gc_state();
const auto& gc_state = get_tombstone_gc_state();
return consumer(make_compacting_reader(setup_sstable_reader(), compaction_time, max_purgeable_func(), gc_state));
}
@@ -808,7 +870,7 @@ private:
using compact_mutations = compact_for_compaction_v2<compacted_fragments_writer, compacted_fragments_writer>;
auto cfc = compact_mutations(*schema(), now,
max_purgeable_func(),
_table_s.get_tombstone_gc_state(),
get_tombstone_gc_state(),
get_compacted_fragments_writer(),
get_gc_compacted_fragments_writer());
@@ -818,7 +880,7 @@ private:
using compact_mutations = compact_for_compaction_v2<compacted_fragments_writer, noop_compacted_fragments_consumer>;
auto cfc = compact_mutations(*schema(), now,
max_purgeable_func(),
_table_s.get_tombstone_gc_state(),
get_tombstone_gc_state(),
get_compacted_fragments_writer(),
noop_compacted_fragments_consumer());
reader.consume_in_thread(std::move(cfc));
@@ -875,14 +937,12 @@ private:
virtual std::string_view report_start_desc() const = 0;
virtual std::string_view report_finish_desc() const = 0;
std::function<api::timestamp_type(const dht::decorated_key&)> max_purgeable_func() {
max_purgeable_fn max_purgeable_func() {
if (!tombstone_expiration_enabled()) {
return [] (const dht::decorated_key& dk) {
return api::min_timestamp;
};
return can_never_purge;
}
return [this] (const dht::decorated_key& dk) {
return get_max_purgeable_timestamp(_table_s, *_selector, _compacting_for_max_purgeable_func, dk, _bloom_filter_checks, _compacting_max_timestamp);
return [this] (const dht::decorated_key& dk, is_shadowable is_shadowable) {
return get_max_purgeable_timestamp(_table_s, *_selector, _compacting_for_max_purgeable_func, dk, _bloom_filter_checks, _compacting_max_timestamp, _tombstone_gc_state_with_commitlog_check_disabled.has_value(), is_shadowable);
};
}

View File

@@ -171,6 +171,16 @@ struct compaction_descriptor {
// Denotes if this compaction task is comprised solely of completely expired SSTables
sstables::has_only_fully_expired has_only_fully_expired = has_only_fully_expired::no;
// If set to true, gc will check only the compacting sstables to collect tombstones.
// If set to false, gc will check the memtables, commit log and other uncompacting
// sstables to decide if a tombstone can be collected. Note that these checks are
// not perfect. W.r.to memtables and uncompacted SSTables, if their minimum timestamp
// is less than that of the tombstone and they contain the key, the tombstone will
// not be collected. No row-level, cell-level check takes place. W.r.to the commit
// log, there is currently no way to check if the key exists; only the minimum
// timestamp comparison, similar to memtables, is performed.
bool gc_check_only_compacting_sstables = false;
compaction_descriptor() = default;
static constexpr int default_level = 0;

View File

@@ -8,7 +8,24 @@
#pragma once
#include <seastar/util/bool_class.hh>
#include "mutation/tombstone.hh"
#include "schema/schema_fwd.hh"
#include "dht/i_partitioner_fwd.hh"
using is_shadowable = bool_class<struct is_shadowable_tag>;
// Determines whether tombstone may be GC-ed.
using can_gc_fn = std::function<bool(tombstone, is_shadowable)>;
extern can_gc_fn always_gc;
extern can_gc_fn never_gc;
using max_purgeable_fn = std::function<api::timestamp_type(const dht::decorated_key&, is_shadowable)>;
extern max_purgeable_fn can_always_purge;
extern max_purgeable_fn can_never_purge;
class atomic_cell;
class row_marker;

View File

@@ -22,6 +22,7 @@
#include <seastar/coroutine/maybe_yield.hh>
#include "sstables/exceptions.hh"
#include "sstables/sstable_directory.hh"
#include "utils/assert.hh"
#include "utils/error_injection.hh"
#include "utils/UUID_gen.hh"
#include "db/system_keyspace.hh"
@@ -187,7 +188,7 @@ unsigned compaction_manager::current_compaction_fan_in_threshold() const {
return 0;
}
auto largest_fan_in = std::ranges::max(_tasks | boost::adaptors::transformed([] (auto& task) {
return task->compaction_running() ? task->compaction_data().compaction_fan_in : 0;
return task.compaction_running() ? task.compaction_data().compaction_fan_in : 0;
}));
// conservatively limit fan-in threshold to 32, such that tons of small sstables won't accumulate if
// running major on a leveled table, which can even have more than one thousand files.
@@ -387,11 +388,26 @@ future<sstables::compaction_result> compaction_task_executor::compact_sstables_a
co_return res;
}
future<sstables::sstable_set> compaction_task_executor::sstable_set_for_tombstone_gc(table_state& t) {
auto compound_set = t.sstable_set_for_tombstone_gc();
// Compound set will be linearized into a single set, since compaction might add or remove sstables
// to it for incremental compaction to work.
auto new_set = sstables::make_partitioned_sstable_set(t.schema(), false);
co_await compound_set->for_each_sstable_gently([&] (const sstables::shared_sstable& sst) {
auto inserted = new_set.insert(sst);
if (!inserted) {
on_internal_error(cmlog, format("Unable to insert SSTable {} into set used for tombstone GC", sst->get_filename()));
}
});
co_return std::move(new_set);
}
future<sstables::compaction_result> compaction_task_executor::compact_sstables(sstables::compaction_descriptor descriptor, sstables::compaction_data& cdata, on_replacement& on_replace, compaction_manager::can_purge_tombstones can_purge,
sstables::offstrategy offstrategy) {
table_state& t = *_compacting_table;
if (can_purge) {
descriptor.enable_garbage_collection(t.main_sstable_set());
descriptor.enable_garbage_collection(co_await sstable_set_for_tombstone_gc(t));
}
descriptor.creator = [&t] (shard_id dummy) {
auto sst = t.make_sstable();
@@ -503,9 +519,10 @@ public:
major_compaction_task_executor(compaction_manager& mgr,
throw_if_stopping do_throw_if_stopping,
table_state* t,
tasks::task_id parent_id)
tasks::task_id parent_id,
bool consider_only_existing_data)
: compaction_task_executor(mgr, do_throw_if_stopping, t, sstables::compaction_type::Compaction, "Major compaction")
, major_compaction_task_impl(mgr._task_manager_module, tasks::task_id::create_random_id(), 0, "compaction group", t->schema()->ks_name(), t->schema()->cf_name(), "", parent_id)
, major_compaction_task_impl(mgr._task_manager_module, tasks::task_id::create_random_id(), 0, "compaction group", t->schema()->ks_name(), t->schema()->cf_name(), "", parent_id, flush_mode::compacted_tables, consider_only_existing_data)
{
_status.progress_units = "bytes";
}
@@ -540,6 +557,7 @@ protected:
table_state* t = _compacting_table;
sstables::compaction_strategy cs = t->get_compaction_strategy();
sstables::compaction_descriptor descriptor = cs.get_major_compaction_job(*t, _cm.get_candidates(*t));
descriptor.gc_check_only_compacting_sstables = _consider_only_existing_data;
auto compacting = compacting_sstable_registration(_cm, _cm.get_compaction_state(t), descriptor.sstables);
auto on_replace = compacting.update_on_sstable_replacement();
setup_new_compaction(descriptor.run_identifier);
@@ -575,22 +593,18 @@ requires std::is_base_of_v<compaction_task_executor, TaskExecutor> &&
requires (compaction_manager& cm, throw_if_stopping do_throw_if_stopping, Args&&... args) {
{TaskExecutor(cm, do_throw_if_stopping, std::forward<Args>(args)...)} -> std::same_as<TaskExecutor>;
}
future<compaction_manager::compaction_stats_opt> compaction_manager::perform_compaction(throw_if_stopping do_throw_if_stopping, std::optional<tasks::task_info> parent_info, Args&&... args) {
future<compaction_manager::compaction_stats_opt> compaction_manager::perform_compaction(throw_if_stopping do_throw_if_stopping, tasks::task_info parent_info, Args&&... args) {
auto task_executor = seastar::make_shared<TaskExecutor>(*this, do_throw_if_stopping, std::forward<Args>(args)...);
_tasks.push_back(task_executor);
auto unregister_task = defer([this, task_executor] {
_tasks.remove(task_executor);
_tasks.push_back(*task_executor);
auto unregister_task = defer([task_executor] {
task_executor->unlink();
task_executor->switch_state(compaction_task_executor::state::none);
});
if (parent_info) {
auto task = co_await get_task_manager_module().make_task(task_executor, parent_info.value());
task->start();
co_await task->done();
co_return task_executor->get_stats();
} else {
co_return co_await perform_task(std::move(task_executor), do_throw_if_stopping);
}
auto task = co_await get_task_manager_module().make_task(task_executor, parent_info);
task->start();
co_await task->done();
co_return task_executor->get_stats();
}
std::optional<gate::holder> compaction_manager::start_compaction(table_state& t) {
@@ -606,13 +620,13 @@ std::optional<gate::holder> compaction_manager::start_compaction(table_state& t)
return it->second.gate.hold();
}
future<> compaction_manager::perform_major_compaction(table_state& t, std::optional<tasks::task_info> info) {
future<> compaction_manager::perform_major_compaction(table_state& t, tasks::task_info info, bool consider_only_existing_data) {
auto gh = start_compaction(t);
if (!gh) {
co_return;
}
co_await perform_compaction<major_compaction_task_executor>(throw_if_stopping::no, info, &t, info.value_or(tasks::task_info{}).id).discard_result();
co_await perform_compaction<major_compaction_task_executor>(throw_if_stopping::no, info, &t, info.id, consider_only_existing_data).discard_result();
}
namespace compaction {
@@ -669,13 +683,13 @@ protected:
}
future<> compaction_manager::run_custom_job(table_state& t, sstables::compaction_type type, const char* desc, noncopyable_function<future<>(sstables::compaction_data&, sstables::compaction_progress_monitor&)> job, std::optional<tasks::task_info> info, throw_if_stopping do_throw_if_stopping) {
future<> compaction_manager::run_custom_job(table_state& t, sstables::compaction_type type, const char* desc, noncopyable_function<future<>(sstables::compaction_data&, sstables::compaction_progress_monitor&)> job, tasks::task_info info, throw_if_stopping do_throw_if_stopping) {
auto gh = start_compaction(t);
if (!gh) {
co_return;
}
co_return co_await perform_compaction<custom_compaction_task_executor>(do_throw_if_stopping, info, &t, info.value_or(tasks::task_info{}).id, type, desc, std::move(job)).discard_result();
co_return co_await perform_compaction<custom_compaction_task_executor>(do_throw_if_stopping, info, &t, info.id, type, desc, std::move(job)).discard_result();
}
future<> compaction_manager::update_static_shares(float static_shares) {
@@ -885,10 +899,10 @@ public:
explicit strategy_control(compaction_manager& cm) noexcept : _cm(cm) {}
bool has_ongoing_compaction(table_state& table_s) const noexcept override {
return std::any_of(_cm._tasks.begin(), _cm._tasks.end(), [&s = table_s.schema()] (const shared_ptr<compaction_task_executor>& task) {
return task->compaction_running()
&& task->compacting_table()->schema()->ks_name() == s->ks_name()
&& task->compacting_table()->schema()->cf_name() == s->cf_name();
return std::any_of(_cm._tasks.begin(), _cm._tasks.end(), [&s = table_s.schema()] (const compaction_task_executor& task) {
return task.compaction_running()
&& task.compacting_table()->schema()->ks_name() == s->ks_name()
&& task.compacting_table()->schema()->cf_name() == s->cf_name();
});
}
@@ -958,7 +972,7 @@ compaction_manager::compaction_manager(tasks::task_manager& tm)
compaction_manager::~compaction_manager() {
// Assert that compaction manager was explicitly stopped, if started.
// Otherwise, fiber(s) will be alive after the object is stopped.
assert(_state == state::none || _state == state::stopped);
SCYLLA_ASSERT(_state == state::none || _state == state::stopped);
}
future<> compaction_manager::update_throughput(uint32_t value_mbs) {
@@ -998,7 +1012,7 @@ void compaction_manager::register_metrics() {
}
void compaction_manager::enable() {
assert(_state == state::none || _state == state::disabled);
SCYLLA_ASSERT(_state == state::none || _state == state::disabled);
_state = state::enabled;
_compaction_submission_timer.arm_periodic(periodic_compaction_submission_interval());
_waiting_reevalution = postponed_compactions_reevaluation();
@@ -1052,7 +1066,7 @@ void compaction_manager::postpone_compaction_for_table(table_state* t) {
_postponed.insert(t);
}
future<> compaction_manager::stop_tasks(std::vector<shared_ptr<compaction_task_executor>> tasks, sstring reason) {
future<> compaction_manager::stop_tasks(std::vector<shared_ptr<compaction_task_executor>> tasks, sstring reason) noexcept {
// To prevent compaction from being postponed while tasks are being stopped,
// let's stop all tasks before the deferring point below.
for (auto& t : tasks) {
@@ -1060,14 +1074,16 @@ future<> compaction_manager::stop_tasks(std::vector<shared_ptr<compaction_task_e
t->stop_compaction(reason);
}
co_await coroutine::parallel_for_each(tasks, [] (auto& task) -> future<> {
auto unlink_task = deferred_action([task] { task->unlink(); });
try {
co_await task->compaction_done();
} catch (sstables::compaction_stopped_exception&) {
// swallow stop exception if a given procedure decides to propagate it to the caller,
// as it happens with reshard and reshape.
} catch (...) {
// just log any other errors as the callers have nothing to do with them.
cmlog.debug("Stopping {}: task returned error: {}", *task, std::current_exception());
throw;
co_return;
}
cmlog.debug("Stopping {}: done", *task);
});
@@ -1076,9 +1092,12 @@ future<> compaction_manager::stop_tasks(std::vector<shared_ptr<compaction_task_e
future<> compaction_manager::stop_ongoing_compactions(sstring reason, table_state* t, std::optional<sstables::compaction_type> type_opt) noexcept {
try {
auto ongoing_compactions = get_compactions(t).size();
auto tasks = boost::copy_range<std::vector<shared_ptr<compaction_task_executor>>>(_tasks | boost::adaptors::filtered([t, type_opt] (auto& task) {
return (!t || task->compacting_table() == t) && (!type_opt || task->compaction_type() == *type_opt);
}));
auto tasks = _tasks
| std::views::filter([t, type_opt] (const auto& task) {
return (!t || task.compacting_table() == t) && (!type_opt || task.compaction_type() == *type_opt);
})
| std::views::transform([] (auto& task) { return task.shared_from_this(); })
| std::ranges::to<std::vector<shared_ptr<compaction_task_executor>>>();
logging::log_level level = tasks.empty() ? log_level::debug : log_level::info;
if (cmlog.is_enabled(level)) {
std::string scope = "";
@@ -1092,15 +1111,19 @@ future<> compaction_manager::stop_ongoing_compactions(sstring reason, table_stat
}
return stop_tasks(std::move(tasks), std::move(reason));
} catch (...) {
return current_exception_as_future<>();
cmlog.error("Stopping ongoing compactions failed: {}. Ignored", std::current_exception());
}
return make_ready_future();
}
future<> compaction_manager::drain() {
cmlog.info("Asked to drain");
if (*_early_abort_subscription) {
_state = state::disabled;
_compaction_submission_timer.cancel();
co_await stop_ongoing_compactions("drain");
// Trigger a signal to properly exit from postponed_compactions_reevaluation() fiber
reevaluate_postponed_compactions();
}
cmlog.info("Drained");
}
@@ -1110,17 +1133,17 @@ future<> compaction_manager::stop() {
if (auto cm = std::exchange(_task_manager_module, nullptr)) {
co_await cm->stop();
}
if (_state != state::none) {
co_return co_await std::move(*_stop_future);
if (_stop_future) {
co_await std::exchange(*_stop_future, make_ready_future());
}
}
future<> compaction_manager::really_do_stop() {
future<> compaction_manager::really_do_stop() noexcept {
cmlog.info("Asked to stop");
// Reset the metrics registry
_metrics.clear();
co_await stop_ongoing_compactions("shutdown");
co_await coroutine::parallel_for_each(_compaction_state | boost::adaptors::map_values, [] (compaction_state& cs) -> future<> {
co_await coroutine::parallel_for_each(_compaction_state | std::views::values, [] (compaction_state& cs) -> future<> {
if (!cs.gate.is_closed()) {
co_await cs.gate.close();
}
@@ -1150,7 +1173,14 @@ void compaction_manager::do_stop() noexcept {
}
inline bool compaction_manager::can_proceed(table_state* t) const {
return (_state == state::enabled) && _compaction_state.contains(t) && !_compaction_state.at(t).compaction_disabled();
if (_state != state::enabled) {
return false;
}
auto found = _compaction_state.find(t);
if (found == _compaction_state.end()) {
return false;
}
return !found->second.compaction_disabled();
}
future<> compaction_task_executor::perform() {
@@ -1500,14 +1530,14 @@ protected:
}
future<bool> compaction_manager::perform_offstrategy(table_state& t, std::optional<tasks::task_info> info) {
future<bool> compaction_manager::perform_offstrategy(table_state& t, tasks::task_info info) {
auto gh = start_compaction(t);
if (!gh) {
co_return false;
}
bool performed;
co_await perform_compaction<offstrategy_compaction_task_executor>(throw_if_stopping::no, info, &t, info.value_or(tasks::task_info{}).id, performed);
co_await perform_compaction<offstrategy_compaction_task_executor>(throw_if_stopping::no, info, &t, info.id, performed);
co_return performed;
}
@@ -1553,11 +1583,16 @@ protected:
co_return stats;
}
virtual sstables::compaction_descriptor make_descriptor(const sstables::shared_sstable& sst) const {
static sstables::compaction_descriptor
make_descriptor(const sstables::shared_sstable& sst, const sstables::compaction_type_options& opt, owned_ranges_ptr owned_ranges = {}) {
auto sstable_level = sst->get_sstable_level();
auto run_identifier = sst->run_identifier();
return sstables::compaction_descriptor({ sst },
sstable_level, sstables::compaction_descriptor::default_max_sstable_bytes, run_identifier, _options, _owned_ranges_ptr);
sstable_level, sstables::compaction_descriptor::default_max_sstable_bytes, run_identifier, opt, owned_ranges);
}
virtual sstables::compaction_descriptor make_descriptor(const sstables::shared_sstable& sst) const {
return make_descriptor(sst, _options, _owned_ranges_ptr);
}
virtual future<sstables::compaction_result> rewrite_sstable(const sstables::shared_sstable sst) {
@@ -1573,9 +1608,6 @@ protected:
setup_new_compaction(descriptor.run_identifier);
compaction_backlog_tracker user_initiated(std::make_unique<user_initiated_backlog_tracker>(_cm._compaction_controller.backlog_of_shares(200), _cm.available_memory()));
_cm.register_backlog_tracker(user_initiated);
std::exception_ptr ex;
try {
sstables::compaction_result res = co_await compact_sstables_and_update_history(std::move(descriptor), _compaction_data, on_replace, _can_purge);
@@ -1610,19 +1642,30 @@ public:
std::move(sstables), std::move(compacting), compaction_manager::can_purge_tombstones::yes)
, _opt(options.as<sstables::compaction_type_options::split>())
{
if (utils::get_local_injector().is_enabled("split_sstable_rewrite")) {
_do_throw_if_stopping = throw_if_stopping::yes;
}
}
static bool sstable_needs_split(const sstables::shared_sstable& sst, const sstables::compaction_type_options::split& opt) {
return opt.classifier(sst->get_first_decorated_key().token()) != opt.classifier(sst->get_last_decorated_key().token());
}
static sstables::compaction_descriptor
make_descriptor(const sstables::shared_sstable& sst, const sstables::compaction_type_options::split& split_opt) {
auto opt = sstables::compaction_type_options::make_split(split_opt.classifier);
return rewrite_sstables_compaction_task_executor::make_descriptor(sst, std::move(opt));
}
private:
bool sstable_needs_split(const sstables::shared_sstable& sst) const {
return _opt.classifier(sst->get_first_decorated_key().token()) != _opt.classifier(sst->get_last_decorated_key().token());
return sstable_needs_split(sst, _opt);
}
protected:
sstables::compaction_descriptor make_descriptor(const sstables::shared_sstable& sst) const override {
auto desc = rewrite_sstables_compaction_task_executor::make_descriptor(sst);
desc.options = sstables::compaction_type_options::make_split(_opt.classifier);
return desc;
return make_descriptor(sst, _opt);
}
future<sstables::compaction_result> rewrite_sstable(const sstables::shared_sstable sst) override {
future<sstables::compaction_result> do_rewrite_sstable(const sstables::shared_sstable sst) {
if (sstable_needs_split(sst)) {
return rewrite_sstables_compaction_task_executor::rewrite_sstable(std::move(sst));
}
@@ -1635,6 +1678,20 @@ protected:
return sstables::compaction_result{};
});
}
future<sstables::compaction_result> rewrite_sstable(const sstables::shared_sstable sst) override {
co_await utils::get_local_injector().inject("split_sstable_rewrite", [this] (auto& handler) -> future<> {
cmlog.info("split_sstable_rewrite: waiting");
while (!handler.poll_for_message() && !_compaction_data.is_stop_requested()) {
co_await sleep(std::chrono::milliseconds(5));
}
cmlog.info("split_sstable_rewrite: released");
if (_compaction_data.is_stop_requested()) {
throw make_compaction_stopped_exception();
}
}, false);
co_return co_await do_rewrite_sstable(std::move(sst));
}
};
}
@@ -1642,7 +1699,7 @@ protected:
template<typename TaskType, typename... Args>
requires std::derived_from<TaskType, compaction_task_executor> &&
std::derived_from<TaskType, compaction_task_impl>
future<compaction_manager::compaction_stats_opt> compaction_manager::perform_task_on_all_files(std::optional<tasks::task_info> info, table_state& t, sstables::compaction_type_options options, owned_ranges_ptr owned_ranges_ptr, get_candidates_func get_func, Args... args) {
future<compaction_manager::compaction_stats_opt> compaction_manager::perform_task_on_all_files(tasks::task_info info, table_state& t, sstables::compaction_type_options options, owned_ranges_ptr owned_ranges_ptr, get_candidates_func get_func, Args... args) {
auto gh = start_compaction(t);
if (!gh) {
co_return std::nullopt;
@@ -1670,12 +1727,12 @@ future<compaction_manager::compaction_stats_opt> compaction_manager::perform_tas
if (sstables.empty()) {
co_return std::nullopt;
}
co_return co_await perform_compaction<TaskType>(throw_if_stopping::no, info, &t, info.value_or(tasks::task_info{}).id, std::move(options), std::move(owned_ranges_ptr), std::move(sstables), std::move(compacting), std::forward<Args>(args)...);
co_return co_await perform_compaction<TaskType>(throw_if_stopping::no, info, &t, info.id, std::move(options), std::move(owned_ranges_ptr), std::move(sstables), std::move(compacting), std::forward<Args>(args)...);
}
future<compaction_manager::compaction_stats_opt>
compaction_manager::rewrite_sstables(table_state& t, sstables::compaction_type_options options, owned_ranges_ptr owned_ranges_ptr,
get_candidates_func get_func, std::optional<tasks::task_info> info, can_purge_tombstones can_purge,
get_candidates_func get_func, tasks::task_info info, can_purge_tombstones can_purge,
sstring options_desc) {
return perform_task_on_all_files<rewrite_sstables_compaction_task_executor>(info, t, std::move(options), std::move(owned_ranges_ptr), std::move(get_func), can_purge, std::move(options_desc));
}
@@ -1744,14 +1801,14 @@ static std::vector<sstables::shared_sstable> get_all_sstables(table_state& t) {
return s;
}
future<compaction_manager::compaction_stats_opt> compaction_manager::perform_sstable_scrub_validate_mode(table_state& t, std::optional<tasks::task_info> info) {
future<compaction_manager::compaction_stats_opt> compaction_manager::perform_sstable_scrub_validate_mode(table_state& t, tasks::task_info info) {
auto gh = start_compaction(t);
if (!gh) {
co_return compaction_stats_opt{};
}
// All sstables must be included, even the ones being compacted, such that everything in table is validated.
auto all_sstables = get_all_sstables(t);
co_return co_await perform_compaction<validate_sstables_compaction_task_executor>(throw_if_stopping::no, info, &t, info.value_or(tasks::task_info{}).id, std::move(all_sstables));
co_return co_await perform_compaction<validate_sstables_compaction_task_executor>(throw_if_stopping::no, info, &t, info.id, std::move(all_sstables));
}
namespace compaction {
@@ -1816,7 +1873,7 @@ protected:
}
private:
future<> run_cleanup_job(sstables::compaction_descriptor descriptor) {
co_await coroutine::switch_to(_cm.compaction_sg());
co_await coroutine::switch_to(_cm.maintenance_sg());
// Releases reference to cleaned files such that respective used disk space can be freed.
using update_registration = compacting_sstable_registration::update_me;
@@ -1836,9 +1893,6 @@ private:
};
release_exhausted on_replace{_compacting, descriptor};
for (;;) {
compaction_backlog_tracker user_initiated(std::make_unique<user_initiated_backlog_tracker>(_cm._compaction_controller.backlog_of_shares(200), _cm.available_memory()));
_cm.register_backlog_tracker(user_initiated);
std::exception_ptr ex;
try {
setup_new_compaction(descriptor.run_identifier);
@@ -1917,7 +1971,7 @@ const std::unordered_set<sstables::shared_sstable>& compaction_manager::sstables
return cs.sstables_requiring_cleanup;
}
future<> compaction_manager::perform_cleanup(owned_ranges_ptr sorted_owned_ranges, table_state& t, std::optional<tasks::task_info> info) {
future<> compaction_manager::perform_cleanup(owned_ranges_ptr sorted_owned_ranges, table_state& t, tasks::task_info info) {
auto gh = start_compaction(t);
if (!gh) {
co_return;
@@ -1962,10 +2016,10 @@ future<> compaction_manager::perform_cleanup(owned_ranges_ptr sorted_owned_range
}
}
future<> compaction_manager::try_perform_cleanup(owned_ranges_ptr sorted_owned_ranges, table_state& t, std::optional<tasks::task_info> info) {
future<> compaction_manager::try_perform_cleanup(owned_ranges_ptr sorted_owned_ranges, table_state& t, tasks::task_info info) {
auto check_for_cleanup = [this, &t] {
return boost::algorithm::any_of(_tasks, [&t] (auto& task) {
return task->compacting_table() == &t && task->compaction_type() == sstables::compaction_type::Cleanup;
return task.compacting_table() == &t && task.compaction_type() == sstables::compaction_type::Cleanup;
});
};
if (check_for_cleanup()) {
@@ -2019,7 +2073,7 @@ future<> compaction_manager::try_perform_cleanup(owned_ranges_ptr sorted_owned_r
}
// Submit a table to be upgraded and wait for its termination.
future<> compaction_manager::perform_sstable_upgrade(owned_ranges_ptr sorted_owned_ranges, table_state& t, bool exclude_current_version, std::optional<tasks::task_info> info) {
future<> compaction_manager::perform_sstable_upgrade(owned_ranges_ptr sorted_owned_ranges, table_state& t, bool exclude_current_version, tasks::task_info info) {
auto get_sstables = [this, &t, exclude_current_version] {
std::vector<sstables::shared_sstable> tables;
@@ -2046,7 +2100,7 @@ future<> compaction_manager::perform_sstable_upgrade(owned_ranges_ptr sorted_own
return rewrite_sstables(t, sstables::compaction_type_options::make_upgrade(), std::move(sorted_owned_ranges), std::move(get_sstables), info).discard_result();
}
future<compaction_manager::compaction_stats_opt> compaction_manager::perform_split_compaction(table_state& t, sstables::compaction_type_options::split opt, std::optional<tasks::task_info> info) {
future<compaction_manager::compaction_stats_opt> compaction_manager::perform_split_compaction(table_state& t, sstables::compaction_type_options::split opt, tasks::task_info info) {
auto get_sstables = [this, &t] {
return make_ready_future<std::vector<sstables::shared_sstable>>(get_candidates(t));
};
@@ -2056,8 +2110,33 @@ future<compaction_manager::compaction_stats_opt> compaction_manager::perform_spl
return perform_task_on_all_files<split_compaction_task_executor>(info, t, std::move(options), std::move(owned_ranges_ptr), std::move(get_sstables));
}
future<std::vector<sstables::shared_sstable>>
compaction_manager::maybe_split_sstable(sstables::shared_sstable sst, table_state& t, sstables::compaction_type_options::split opt) {
if (!split_compaction_task_executor::sstable_needs_split(sst, opt)) {
co_return std::vector<sstables::shared_sstable>{sst};
}
std::vector<sstables::shared_sstable> ret;
// FIXME: indentation.
auto gate = get_compaction_state(&t).gate.hold();
sstables::compaction_progress_monitor monitor;
sstables::compaction_data info = create_compaction_data();
sstables::compaction_descriptor desc = split_compaction_task_executor::make_descriptor(sst, opt);
desc.creator = [&t] (shard_id _) {
return t.make_sstable();
};
desc.replacer = [&] (sstables::compaction_completion_desc d) {
std::move(d.new_sstables.begin(), d.new_sstables.end(), std::back_inserter(ret));
};
co_await sstables::compact_sstables(std::move(desc), info, t, monitor);
co_await sst->unlink();
co_return ret;
}
// Submit a table to be scrubbed and wait for its termination.
future<compaction_manager::compaction_stats_opt> compaction_manager::perform_sstable_scrub(table_state& t, sstables::compaction_type_options::scrub opts, std::optional<tasks::task_info> info) {
future<compaction_manager::compaction_stats_opt> compaction_manager::perform_sstable_scrub(table_state& t, sstables::compaction_type_options::scrub opts, tasks::task_info info) {
auto scrub_mode = opts.operation_mode;
if (scrub_mode == sstables::compaction_type_options::scrub::mode::validate) {
return perform_sstable_scrub_validate_mode(t, info);
@@ -2097,7 +2176,7 @@ void compaction_manager::add(table_state& t) {
}
}
future<> compaction_manager::remove(table_state& t) noexcept {
future<> compaction_manager::remove(table_state& t, sstring reason) noexcept {
auto& c_state = get_compaction_state(&t);
auto erase_state = defer([&t, &c_state, this] () noexcept {
c_state.backlog_tracker->disable();
@@ -2113,7 +2192,7 @@ future<> compaction_manager::remove(table_state& t) noexcept {
// and prevent new tasks from entering the gate.
if (!c_state.gate.is_closed()) {
auto close_gate = c_state.gate.close();
co_await stop_ongoing_compactions("table removal", &t);
co_await stop_ongoing_compactions(reason, &t);
co_await std::move(close_gate);
}
@@ -2121,11 +2200,11 @@ future<> compaction_manager::remove(table_state& t) noexcept {
auto found = false;
sstring msg;
for (auto& task : _tasks) {
if (task->compacting_table() == &t) {
if (task.compacting_table() == &t) {
if (!msg.empty()) {
msg += "\n";
}
msg += format("Found {} after remove", *task.get());
msg += format("Found {} after remove", task);
found = true;
}
}
@@ -2136,30 +2215,38 @@ future<> compaction_manager::remove(table_state& t) noexcept {
}
const std::vector<sstables::compaction_info> compaction_manager::get_compactions(table_state* t) const {
auto to_info = [] (const shared_ptr<compaction_task_executor>& task) {
auto to_info = [] (const compaction_task_executor& task) {
sstables::compaction_info ret;
ret.compaction_uuid = task->compaction_data().compaction_uuid;
ret.type = task->compaction_type();
ret.ks_name = task->compacting_table()->schema()->ks_name();
ret.cf_name = task->compacting_table()->schema()->cf_name();
ret.total_partitions = task->compaction_data().total_partitions;
ret.total_keys_written = task->compaction_data().total_keys_written;
ret.compaction_uuid = task.compaction_data().compaction_uuid;
ret.type = task.compaction_type();
ret.ks_name = task.compacting_table()->schema()->ks_name();
ret.cf_name = task.compacting_table()->schema()->cf_name();
ret.total_partitions = task.compaction_data().total_partitions;
ret.total_keys_written = task.compaction_data().total_keys_written;
return ret;
};
using ret = std::vector<sstables::compaction_info>;
return boost::copy_range<ret>(_tasks | boost::adaptors::filtered([t] (const shared_ptr<compaction_task_executor>& task) {
return (!t || task->compacting_table() == t) && task->compaction_running();
return boost::copy_range<ret>(_tasks | boost::adaptors::filtered([t] (const compaction_task_executor& task) {
return (!t || task.compacting_table() == t) && task.compaction_running();
}) | boost::adaptors::transformed(to_info));
}
bool compaction_manager::has_table_ongoing_compaction(const table_state& t) const {
return std::any_of(_tasks.begin(), _tasks.end(), [&t] (const shared_ptr<compaction_task_executor>& task) {
return task->compacting_table() == &t && task->compaction_running();
return std::any_of(_tasks.begin(), _tasks.end(), [&t] (const compaction_task_executor& task) {
return task.compacting_table() == &t && task.compaction_running();
});
};
bool compaction_manager::compaction_disabled(table_state& t) const {
return _compaction_state.contains(&t) && _compaction_state.at(&t).compaction_disabled();
if (auto it = _compaction_state.find(&t); it != _compaction_state.end()) {
return it->second.compaction_disabled();
} else {
cmlog.debug("compaction_disabled: {}:{} not in compaction_state", t.schema()->id(), t.get_group_id());
// Compaction is not strictly disabled, but it is not enabled either.
// The callers actually care about if it's enabled or not, not about the actual state of
// compaction_state::compaction_disabled()
return true;
}
}
future<> compaction_manager::stop_compaction(sstring type, table_state* table) {
@@ -2184,8 +2271,8 @@ future<> compaction_manager::stop_compaction(sstring type, table_state* table) {
void compaction_manager::propagate_replacement(table_state& t,
const std::vector<sstables::shared_sstable>& removed, const std::vector<sstables::shared_sstable>& added) {
for (auto& task : _tasks) {
if (task->compacting_table() == &t && task->compaction_running()) {
task->compaction_data().pending_replacements.push_back({ removed, added });
if (task.compacting_table() == &t && task.compaction_running()) {
task.compaction_data().pending_replacements.push_back({ removed, added });
}
}
}

View File

@@ -85,6 +85,7 @@ public:
size_t available_memory = 0;
utils::updateable_value<float> static_shares = utils::updateable_value<float>(0);
utils::updateable_value<uint32_t> throughput_mb_per_sec = utils::updateable_value<uint32_t>(0);
std::chrono::seconds flush_all_tables_before_major = std::chrono::duration_cast<std::chrono::seconds>(std::chrono::days(1));
};
public:
@@ -93,8 +94,13 @@ public:
private:
shared_ptr<compaction::task_manager_module> _task_manager_module;
using compaction_task_executor_list_type = bi::list<
compaction_task_executor,
bi::base_hook<bi::list_base_hook<bi::link_mode<bi::auto_unlink>>>,
bi::constant_time_size<false>>;
// compaction manager may have N fibers to allow parallel compaction per shard.
std::list<shared_ptr<compaction::compaction_task_executor>> _tasks;
compaction_task_executor_list_type _tasks;
// Possible states in which the compaction manager can be found.
//
@@ -170,17 +176,15 @@ private:
// Return nullopt if compaction cannot be started
std::optional<gate::holder> start_compaction(table_state& t);
// parent_info set to std::nullopt means that task manager should not register this task executor.
// To create a task manager task with no parent, parent_info argument should contain empty task_info.
template<typename TaskExecutor, typename... Args>
requires std::is_base_of_v<compaction_task_executor, TaskExecutor> &&
std::is_base_of_v<compaction_task_impl, TaskExecutor> &&
requires (compaction_manager& cm, throw_if_stopping do_throw_if_stopping, Args&&... args) {
{TaskExecutor(cm, do_throw_if_stopping, std::forward<Args>(args)...)} -> std::same_as<TaskExecutor>;
}
future<compaction_manager::compaction_stats_opt> perform_compaction(throw_if_stopping do_throw_if_stopping, std::optional<tasks::task_info> parent_info, Args&&... args);
future<compaction_manager::compaction_stats_opt> perform_compaction(throw_if_stopping do_throw_if_stopping, tasks::task_info parent_info, Args&&... args);
future<> stop_tasks(std::vector<shared_ptr<compaction::compaction_task_executor>> tasks, sstring reason);
future<> stop_tasks(std::vector<shared_ptr<compaction::compaction_task_executor>> tasks, sstring reason) noexcept;
future<> update_throughput(uint32_t value_mbs);
// Return the largest fan-in of currently running compactions
@@ -229,7 +233,7 @@ private:
// similar-sized compaction.
void postpone_compaction_for_table(compaction::table_state* t);
future<compaction_stats_opt> perform_sstable_scrub_validate_mode(compaction::table_state& t, std::optional<tasks::task_info> info);
future<compaction_stats_opt> perform_sstable_scrub_validate_mode(compaction::table_state& t, tasks::task_info info);
future<> update_static_shares(float shares);
using get_candidates_func = std::function<future<std::vector<sstables::shared_sstable>>()>;
@@ -239,14 +243,14 @@ private:
template<typename TaskType, typename... Args>
requires std::derived_from<TaskType, compaction_task_executor> &&
std::derived_from<TaskType, compaction_task_impl>
future<compaction_manager::compaction_stats_opt> perform_task_on_all_files(std::optional<tasks::task_info> info, table_state& t, sstables::compaction_type_options options, owned_ranges_ptr owned_ranges_ptr, get_candidates_func get_func, Args... args);
future<compaction_manager::compaction_stats_opt> perform_task_on_all_files(tasks::task_info info, table_state& t, sstables::compaction_type_options options, owned_ranges_ptr owned_ranges_ptr, get_candidates_func get_func, Args... args);
future<compaction_stats_opt> rewrite_sstables(compaction::table_state& t, sstables::compaction_type_options options, owned_ranges_ptr, get_candidates_func, std::optional<tasks::task_info> info,
future<compaction_stats_opt> rewrite_sstables(compaction::table_state& t, sstables::compaction_type_options options, owned_ranges_ptr, get_candidates_func, tasks::task_info info,
can_purge_tombstones can_purge = can_purge_tombstones::yes, sstring options_desc = "");
// Stop all fibers, without waiting. Safe to be called multiple times.
void do_stop() noexcept;
future<> really_do_stop();
future<> really_do_stop() noexcept;
// Propagate replacement of sstables to all ongoing compaction of a given table
void propagate_replacement(compaction::table_state& t, const std::vector<sstables::shared_sstable>& removed, const std::vector<sstables::shared_sstable>& added);
@@ -285,6 +289,10 @@ public:
return _cfg.throughput_mb_per_sec.get();
}
std::chrono::seconds flush_all_tables_before_major() const noexcept {
return _cfg.flush_all_tables_before_major;
}
void register_metrics();
// enable the compaction manager.
@@ -314,7 +322,7 @@ public:
// Submit a table to be off-strategy compacted.
// Returns true iff off-strategy compaction was required and performed.
future<bool> perform_offstrategy(compaction::table_state& t, std::optional<tasks::task_info> info);
future<bool> perform_offstrategy(compaction::table_state& t, tasks::task_info info);
// Submit a table to be cleaned up and wait for its termination.
//
@@ -323,9 +331,9 @@ public:
// Cleanup is about discarding keys that are no longer relevant for a
// given sstable, e.g. after node loses part of its token range because
// of a newly added node.
future<> perform_cleanup(owned_ranges_ptr sorted_owned_ranges, compaction::table_state& t, std::optional<tasks::task_info> info);
future<> perform_cleanup(owned_ranges_ptr sorted_owned_ranges, compaction::table_state& t, tasks::task_info info);
private:
future<> try_perform_cleanup(owned_ranges_ptr sorted_owned_ranges, compaction::table_state& t, std::optional<tasks::task_info> info);
future<> try_perform_cleanup(owned_ranges_ptr sorted_owned_ranges, compaction::table_state& t, tasks::task_info info);
// Add sst to or remove it from the respective compaction_state.sstables_requiring_cleanup set.
bool update_sstable_cleanup_state(table_state& t, const sstables::shared_sstable& sst, const dht::token_range_vector& sorted_owned_ranges);
@@ -333,19 +341,24 @@ private:
future<> on_compaction_completion(table_state& t, sstables::compaction_completion_desc desc, sstables::offstrategy offstrategy);
public:
// Submit a table to be upgraded and wait for its termination.
future<> perform_sstable_upgrade(owned_ranges_ptr sorted_owned_ranges, compaction::table_state& t, bool exclude_current_version, std::optional<tasks::task_info> info = std::nullopt);
future<> perform_sstable_upgrade(owned_ranges_ptr sorted_owned_ranges, compaction::table_state& t, bool exclude_current_version, tasks::task_info info);
// Submit a table to be scrubbed and wait for its termination.
future<compaction_stats_opt> perform_sstable_scrub(compaction::table_state& t, sstables::compaction_type_options::scrub opts, std::optional<tasks::task_info> info = std::nullopt);
future<compaction_stats_opt> perform_sstable_scrub(compaction::table_state& t, sstables::compaction_type_options::scrub opts, tasks::task_info info);
// Submit a table for major compaction.
future<> perform_major_compaction(compaction::table_state& t, std::optional<tasks::task_info> info = std::nullopt);
future<> perform_major_compaction(compaction::table_state& t, tasks::task_info info, bool consider_only_existing_data = false);
// Splits a compaction group by segregating all its sstable according to the classifier[1].
// [1]: See sstables::compaction_type_options::splitting::classifier.
// Returns when all sstables in the main sstable set are split. The only exception is shutdown
// or user aborted splitting using stop API.
future<compaction_stats_opt> perform_split_compaction(compaction::table_state& t, sstables::compaction_type_options::split opt, std::optional<tasks::task_info> info = std::nullopt);
future<compaction_stats_opt> perform_split_compaction(compaction::table_state& t, sstables::compaction_type_options::split opt, tasks::task_info info);
// Splits a single SSTable by segregating all its data according to the classifier.
// If SSTable doesn't need split, the same input SSTable is returned as output.
// If SSTable needs split, then output SSTables are returned and the input SSTable is deleted.
future<std::vector<sstables::shared_sstable>> maybe_split_sstable(sstables::shared_sstable sst, table_state& t, sstables::compaction_type_options::split opt);
// Run a custom job for a given table, defined by a function
// it completes when future returned by job is ready or returns immediately
@@ -354,7 +367,7 @@ public:
// parameter type is the compaction type the operation can most closely be
// associated with, use compaction_type::Compaction, if none apply.
// parameter job is a function that will carry the operation
future<> run_custom_job(compaction::table_state& s, sstables::compaction_type type, const char *desc, noncopyable_function<future<>(sstables::compaction_data&, sstables::compaction_progress_monitor&)> job, std::optional<tasks::task_info> info, throw_if_stopping do_throw_if_stopping);
future<> run_custom_job(compaction::table_state& s, sstables::compaction_type type, const char *desc, noncopyable_function<future<>(sstables::compaction_data&, sstables::compaction_progress_monitor&)> job, tasks::task_info info, throw_if_stopping do_throw_if_stopping);
class compaction_reenabler {
compaction_manager& _cm;
@@ -394,7 +407,7 @@ public:
// Remove a table from the compaction manager.
// Cancel requests on table and wait for possible ongoing compactions.
future<> remove(compaction::table_state& t) noexcept;
future<> remove(compaction::table_state& t, sstring reason = "table removal") noexcept;
const stats& get_stats() const {
return _stats;
@@ -462,7 +475,9 @@ public:
namespace compaction {
class compaction_task_executor : public enable_shared_from_this<compaction_task_executor> {
class compaction_task_executor
: public enable_shared_from_this<compaction_task_executor>
, public boost::intrusive::list_base_hook<boost::intrusive::link_mode<boost::intrusive::auto_unlink>> {
public:
enum class state {
none, // initial and final state
@@ -586,6 +601,8 @@ private:
future<compaction_manager::compaction_stats_opt> compaction_done() noexcept {
return _compaction_done.get_future();
}
future<sstables::sstable_set> sstable_set_for_tombstone_gc(::compaction::table_state& t);
public:
bool stopping() const noexcept {
return _compaction_data.abort.abort_requested();
@@ -603,10 +620,10 @@ public:
requires (compaction_manager& cm, throw_if_stopping do_throw_if_stopping, Args&&... args) {
{TaskExecutor(cm, do_throw_if_stopping, std::forward<Args>(args)...)} -> std::same_as<TaskExecutor>;
}
friend future<compaction_manager::compaction_stats_opt> compaction_manager::perform_compaction(throw_if_stopping do_throw_if_stopping, std::optional<tasks::task_info> parent_info, Args&&... args);
friend future<compaction_manager::compaction_stats_opt> compaction_manager::perform_compaction(throw_if_stopping do_throw_if_stopping, tasks::task_info parent_info, Args&&... args);
friend future<compaction_manager::compaction_stats_opt> compaction_manager::perform_task(shared_ptr<compaction_task_executor> task, throw_if_stopping do_throw_if_stopping);
friend fmt::formatter<compaction_task_executor>;
friend future<> compaction_manager::stop_tasks(std::vector<shared_ptr<compaction_task_executor>> tasks, sstring reason);
friend future<> compaction_manager::stop_tasks(std::vector<shared_ptr<compaction_task_executor>> tasks, sstring reason) noexcept;
friend sstables::test_env_compaction_manager;
};

View File

@@ -10,6 +10,7 @@
#pragma once
#include "utils/assert.hh"
#include "sstables/sstables.hh"
#include "size_tiered_compaction_strategy.hh"
#include "interval.hh"
@@ -311,7 +312,7 @@ public:
template <typename T>
static std::vector<sstables::shared_sstable> overlapping(const schema& s, const std::vector<sstables::shared_sstable>& candidates, const T& others) {
assert(!candidates.empty());
SCYLLA_ASSERT(!candidates.empty());
/*
* Picking each sstable from others that overlap one of the sstable of candidates is not enough
* because you could have the following situation:
@@ -350,7 +351,7 @@ public:
*/
template <typename T>
static std::vector<sstables::shared_sstable> overlapping(const schema& s, dht::token start, dht::token end, const T& sstables) {
assert(start <= end);
SCYLLA_ASSERT(start <= end);
std::vector<sstables::shared_sstable> overlapped;
auto range = ::wrapping_interval<dht::token>::make(start, end);
@@ -459,7 +460,7 @@ private:
* for prior failure), will return an empty list. Never returns null.
*/
candidates_info get_candidates_for(int level, const std::vector<std::optional<dht::decorated_key>>& last_compacted_keys) {
assert(!get_level(level).empty());
SCYLLA_ASSERT(!get_level(level).empty());
logger.debug("Choosing candidates for L{}", level);
@@ -517,7 +518,7 @@ public:
new_level = 0;
} else {
new_level = (minimum_level == maximum_level && can_promote) ? maximum_level + 1 : maximum_level;
assert(new_level > 0);
SCYLLA_ASSERT(new_level > 0);
}
return new_level;
}

View File

@@ -6,6 +6,7 @@
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#include "utils/assert.hh"
#include "sstables/sstables.hh"
#include "size_tiered_compaction_strategy.hh"
#include "cql3/statements/property_definitions.hh"
@@ -114,7 +115,7 @@ size_tiered_compaction_strategy::create_sstable_and_length_pairs(const std::vect
for(auto& sstable : sstables) {
auto sstable_size = sstable->data_size();
assert(sstable_size != 0);
SCYLLA_ASSERT(sstable_size != 0);
sstable_length_pairs.emplace_back(sstable, sstable_size);
}

View File

@@ -39,6 +39,7 @@ public:
virtual bool compaction_enforce_min_threshold() const noexcept = 0;
virtual const sstables::sstable_set& main_sstable_set() const = 0;
virtual const sstables::sstable_set& maintenance_sstable_set() const = 0;
virtual lw_shared_ptr<const sstables::sstable_set> sstable_set_for_tombstone_gc() const = 0;
virtual std::unordered_set<sstables::shared_sstable> fully_expired_sstables(const std::vector<sstables::shared_sstable>& sstables, gc_clock::time_point compaction_time) const = 0;
virtual const std::vector<sstables::shared_sstable>& compacted_undeleted_sstables() const noexcept = 0;
virtual sstables::compaction_strategy& get_compaction_strategy() const noexcept = 0;
@@ -48,6 +49,8 @@ public:
virtual sstables::shared_sstable make_sstable() const = 0;
virtual sstables::sstable_writer_config configure_writer(sstring origin) const = 0;
virtual api::timestamp_type min_memtable_timestamp() const = 0;
virtual api::timestamp_type min_memtable_live_timestamp() const = 0;
virtual api::timestamp_type min_memtable_live_row_marker_timestamp() const = 0;
virtual bool memtable_has_key(const dht::decorated_key& key) const = 0;
virtual future<> on_compaction_completion(sstables::compaction_completion_desc desc, sstables::offstrategy offstrategy) = 0;
virtual bool is_auto_compaction_disabled_by_user() const noexcept = 0;

View File

@@ -16,7 +16,6 @@
#include "sstables/sstable_directory.hh"
#include "utils/error_injection.hh"
#include "utils/pretty_printers.hh"
#include "db/config.hh"
using namespace std::chrono_literals;
@@ -131,7 +130,7 @@ distribute_reshard_jobs(sstables::sstable_directory::sstable_open_info_vector so
// A creator function must be passed that will create an SSTable object in the correct shard,
// and an I/O priority must be specified.
future<> reshard(sstables::sstable_directory& dir, sstables::sstable_directory::sstable_open_info_vector shared_info, replica::table& table,
sstables::compaction_sstable_creator_fn creator, compaction::owned_ranges_ptr owned_ranges_ptr, std::optional<tasks::task_info> parent_info)
sstables::compaction_sstable_creator_fn creator, compaction::owned_ranges_ptr owned_ranges_ptr, tasks::task_info parent_info)
{
// Resharding doesn't like empty sstable sets, so bail early. There is nothing
// to reshard in this shard.
@@ -330,24 +329,32 @@ tasks::is_abortable compaction_task_impl::is_abortable() const noexcept {
return tasks::is_abortable{!_parent_id};
}
static future<bool> maybe_flush_all_tables(sharded<replica::database>& db) {
auto interval = db.local().get_config().compaction_flush_all_tables_before_major_seconds();
if (interval) {
auto when = db_clock::now() - interval * 1s;
if (co_await replica::database::get_all_tables_flushed_at(db) <= when) {
co_await db.invoke_on_all([&] (replica::database& db) -> future<> {
co_await db.flush_all_tables();
});
co_return true;
static future<bool> maybe_flush_commitlog(sharded<replica::database>& db, bool force_flush) {
// flush commitlog if
// (a) force_flush == true (or)
// (b) flush_all_tables_before_major > 0s and the configured seconds have elapsed since last all tables flush
if (!force_flush) {
auto interval = db.local().get_compaction_manager().flush_all_tables_before_major();
if (interval <= 0s) {
co_return false;
}
auto when = db_clock::now() - interval;
if (co_await replica::database::get_all_tables_flushed_at(db) > when) {
co_return false;
}
}
co_return false;
co_await db.invoke_on_all([&] (replica::database& db) -> future<> {
co_await db.flush_commitlog();
});
co_return true;
}
future<> global_major_compaction_task_impl::run() {
bool flushed_all_tables = false;
if (_flush_mode == flush_mode::all_tables) {
flushed_all_tables = co_await maybe_flush_all_tables(_db);
flushed_all_tables = co_await maybe_flush_commitlog(_db, _consider_only_existing_data);
}
std::unordered_map<sstring, std::vector<table_info>> tables_by_keyspace;
@@ -364,7 +371,7 @@ future<> global_major_compaction_task_impl::run() {
flush_mode fm = flushed_all_tables ? flush_mode::skip : _flush_mode;
for (auto& [ks, table_infos] : tables_by_keyspace) {
auto task = co_await _module->make_and_start_task<major_keyspace_compaction_task_impl>(parent_info, ks, parent_info.id, _db, table_infos, fm,
&cv, &current_task);
_consider_only_existing_data, &cv, &current_task);
keyspace_tasks.emplace_back(std::move(task), ks, std::move(table_infos));
}
co_await run_keyspace_tasks(_db.local(), keyspace_tasks, cv, current_task, false);
@@ -380,14 +387,14 @@ future<> major_keyspace_compaction_task_impl::run() {
bool flushed_all_tables = false;
if (_flush_mode == flush_mode::all_tables) {
flushed_all_tables = co_await maybe_flush_all_tables(_db);
flushed_all_tables = co_await maybe_flush_commitlog(_db, _consider_only_existing_data);
}
flush_mode fm = flushed_all_tables ? flush_mode::skip : _flush_mode;
co_await _db.invoke_on_all([&] (replica::database& db) -> future<> {
tasks::task_info parent_info{_status.id, _status.shard};
auto& module = db.get_compaction_manager().get_task_manager_module();
auto task = co_await module.make_and_start_task<shard_major_keyspace_compaction_task_impl>(parent_info, _status.keyspace, _status.id, db, _table_infos, fm);
auto task = co_await module.make_and_start_task<shard_major_keyspace_compaction_task_impl>(parent_info, _status.keyspace, _status.id, db, _table_infos, fm, _consider_only_existing_data);
co_await task->done();
});
}
@@ -398,7 +405,7 @@ future<> shard_major_keyspace_compaction_task_impl::run() {
tasks::task_info parent_info{_status.id, _status.shard};
std::vector<table_tasks_info> table_tasks;
for (auto& ti : _local_tables) {
table_tasks.emplace_back(co_await _module->make_and_start_task<table_major_keyspace_compaction_task_impl>(parent_info, _status.keyspace, ti.name, _status.id, _db, ti, cv, current_task, _flush_mode), ti);
table_tasks.emplace_back(co_await _module->make_and_start_task<table_major_keyspace_compaction_task_impl>(parent_info, _status.keyspace, ti.name, _status.id, _db, ti, cv, current_task, _flush_mode, _consider_only_existing_data), ti);
}
co_await run_table_tasks(_db, std::move(table_tasks), cv, current_task, true);
@@ -408,8 +415,8 @@ future<> table_major_keyspace_compaction_task_impl::run() {
co_await wait_for_your_turn(_cv, _current_task, _status.id);
tasks::task_info info{_status.id, _status.shard};
replica::table::do_flush do_flush(_flush_mode != flush_mode::skip);
co_await run_on_table("force_keyspace_compaction", _db, _status.keyspace, _ti, [info, do_flush] (replica::table& t) {
return t.compact_all_sstables(info, do_flush);
co_await run_on_table("force_keyspace_compaction", _db, _status.keyspace, _ti, [info, do_flush, consider_only_existing_data = _consider_only_existing_data] (replica::table& t) {
return t.compact_all_sstables(info, do_flush, consider_only_existing_data);
});
}
@@ -467,7 +474,16 @@ future<> shard_cleanup_keyspace_compaction_task_impl::run() {
future<> table_cleanup_keyspace_compaction_task_impl::run() {
co_await wait_for_your_turn(_cv, _current_task, _status.id);
auto owned_ranges_ptr = compaction::make_owned_ranges_ptr(_db.get_keyspace_local_ranges(_status.keyspace));
// Note that we do not hold an effective_replication_map_ptr throughout
// the cleanup operation, so the topology might change.
// Since clenaup is an admin operation required for vnodes,
// it is the responsibility of the system operator to not
// perform additional incompatible range movements during cleanup.
auto get_owned_ranges = [&] (std::string_view ks_name) -> future<owned_ranges_ptr> {
const auto& erm = _db.find_keyspace(ks_name).get_vnode_effective_replication_map();
co_return compaction::make_owned_ranges_ptr(co_await _db.get_keyspace_local_ranges(erm));
};
auto owned_ranges_ptr = co_await get_owned_ranges(_status.keyspace);
co_await run_on_table("force_keyspace_cleanup", _db, _status.keyspace, _ti, [&] (replica::table& t) {
// skip the flush, as cleanup_keyspace_compaction_task_impl::run should have done this.
return t.perform_cleanup_compaction(owned_ranges_ptr, tasks::task_info{_status.id, _status.shard}, replica::table::do_flush::no);
@@ -531,8 +547,15 @@ future<> shard_upgrade_sstables_compaction_task_impl::run() {
future<> table_upgrade_sstables_compaction_task_impl::run() {
co_await wait_for_your_turn(_cv, _current_task, _status.id);
auto owned_ranges = _db.maybe_get_keyspace_local_ranges(_status.keyspace);
auto owned_ranges_ptr = owned_ranges ? compaction::make_owned_ranges_ptr(std::move(owned_ranges.value())) : nullptr;
auto get_owned_ranges = [&] (std::string_view keyspace_name) -> future<owned_ranges_ptr> {
const auto& ks = _db.find_keyspace(keyspace_name);
if (ks.get_replication_strategy().is_per_table()) {
co_return nullptr;
}
const auto& erm = ks.get_vnode_effective_replication_map();
co_return compaction::make_owned_ranges_ptr(co_await _db.get_keyspace_local_ranges(erm));
};
auto owned_ranges_ptr = co_await get_owned_ranges(_status.keyspace);
tasks::task_info info{_status.id, _status.shard};
co_await run_on_table("upgrade_sstables", _db, _status.keyspace, _ti, [&] (replica::table& t) -> future<> {
return t.parallel_foreach_table_state([&] (compaction::table_state& ts) -> future<> {

View File

@@ -62,9 +62,11 @@ public:
std::string table,
std::string entity,
tasks::task_id parent_id,
flush_mode fm = flush_mode::compacted_tables) noexcept
flush_mode fm = flush_mode::compacted_tables,
bool consider_only_existing_data = false) noexcept
: compaction_task_impl(module, id, sequence_number, std::move(scope), std::move(keyspace), std::move(table), std::move(entity), parent_id)
, _flush_mode(fm)
, _consider_only_existing_data(consider_only_existing_data)
{
// FIXME: add progress units
}
@@ -75,6 +77,7 @@ public:
protected:
flush_mode _flush_mode;
bool _consider_only_existing_data;
virtual future<> run() override = 0;
};
@@ -85,9 +88,10 @@ private:
public:
global_major_compaction_task_impl(tasks::task_manager::module_ptr module,
sharded<replica::database>& db,
std::optional<flush_mode> fm = std::nullopt) noexcept
std::optional<flush_mode> fm = std::nullopt,
bool consider_only_existing_data = false) noexcept
: major_compaction_task_impl(module, tasks::task_id::create_random_id(), module->new_sequence_number(), "global", "", "", "", tasks::task_id::create_null_id(),
fm.value_or(flush_mode::all_tables))
fm.value_or(flush_mode::all_tables), consider_only_existing_data)
, _db(db)
{}
protected:
@@ -109,12 +113,13 @@ public:
sharded<replica::database>& db,
std::vector<table_info> table_infos,
std::optional<flush_mode> fm = std::nullopt,
bool consider_only_existing_data = false,
seastar::condition_variable* cv = nullptr,
tasks::task_manager::task_ptr* current_task = nullptr) noexcept
: major_compaction_task_impl(module, tasks::task_id::create_random_id(),
parent_id ? 0 : module->new_sequence_number(),
"keyspace", std::move(keyspace), "", "", parent_id,
fm.value_or(flush_mode::all_tables))
fm.value_or(flush_mode::all_tables), consider_only_existing_data)
, _db(db)
, _table_infos(std::move(table_infos))
, _cv(cv)
@@ -134,8 +139,9 @@ public:
tasks::task_id parent_id,
replica::database& db,
std::vector<table_info> local_tables,
flush_mode fm) noexcept
: major_compaction_task_impl(module, tasks::task_id::create_random_id(), 0, "shard", std::move(keyspace), "", "", parent_id, fm)
flush_mode fm,
bool consider_only_existing_data) noexcept
: major_compaction_task_impl(module, tasks::task_id::create_random_id(), 0, "shard", std::move(keyspace), "", "", parent_id, fm, consider_only_existing_data)
, _db(db)
, _local_tables(std::move(local_tables))
{}
@@ -158,8 +164,9 @@ public:
table_info ti,
seastar::condition_variable& cv,
tasks::task_manager::task_ptr& current_task,
flush_mode fm) noexcept
: major_compaction_task_impl(module, tasks::task_id::create_random_id(), 0, "table", std::move(keyspace), std::move(table), "", parent_id, fm)
flush_mode fm,
bool consider_only_existing_data) noexcept
: major_compaction_task_impl(module, tasks::task_id::create_random_id(), 0, "table", std::move(keyspace), std::move(table), "", parent_id, fm, consider_only_existing_data)
, _db(db)
, _ti(std::move(ti))
, _cv(cv)

View File

@@ -17,7 +17,8 @@
#include <boost/range/algorithm/remove_if.hpp>
#include <boost/range/algorithm/min_element.hpp>
#include <boost/range/algorithm/partial_sort.hpp>
#include <boost/range/adaptor/reversed.hpp>
#include <ranges>
namespace sstables {
@@ -295,7 +296,8 @@ time_window_compaction_strategy::get_reshaping_job(std::vector<shared_sstable> i
// When trimming, let's keep sstables with overlapping time window, so as to reduce write amplification.
// For example, if there are N sstables spanning window W, where N <= 32, then we can produce all data for W
// in a single compaction round, removing the need to later compact W to reduce its number of files.
boost::partial_sort(multi_window, multi_window.begin() + max_sstables, [](const shared_sstable &a, const shared_sstable &b) {
auto sort_size = std::min(max_sstables, multi_window.size());
boost::partial_sort(multi_window, multi_window.begin() + sort_size, [](const shared_sstable &a, const shared_sstable &b) {
return a->get_stats_metadata().max_timestamp < b->get_stats_metadata().max_timestamp;
});
maybe_trim_job(multi_window, job_size, disjoint);
@@ -309,10 +311,9 @@ time_window_compaction_strategy::get_reshaping_job(std::vector<shared_sstable> i
auto all_disjoint = !single_window.empty() && is_disjoint(single_window);
auto all_buckets = get_buckets(single_window, _options);
single_window.clear();
for (auto& pair : all_buckets.first) {
auto ssts = std::move(pair.second);
for (auto& [bucket, ssts] : all_buckets.first) {
if (ssts.size() >= offstrategy_threshold) {
clogger.debug("time_window_compaction_strategy::get_reshaping_job: bucket={} bucket_size={}", pair.first, ssts.size());
clogger.debug("time_window_compaction_strategy::get_reshaping_job: bucket={} bucket_size={}", bucket, ssts.size());
if (all_disjoint) {
std::copy(ssts.begin(), ssts.end(), std::back_inserter(single_window));
continue;
@@ -417,13 +418,13 @@ time_window_compaction_strategy::get_next_non_expired_sstables(table_state& tabl
std::vector<shared_sstable>
time_window_compaction_strategy::get_compaction_candidates(table_state& table_s, strategy_control& control, std::vector<shared_sstable> candidate_sstables) {
auto& state = get_state(table_s);
auto p = get_buckets(std::move(candidate_sstables), _options);
auto [buckets, max_timestamp] = get_buckets(std::move(candidate_sstables), _options);
// Update the highest window seen, if necessary
state.highest_window_seen = std::max(state.highest_window_seen, p.second);
state.highest_window_seen = std::max(state.highest_window_seen, max_timestamp);
update_estimated_compaction_by_tasks(state, p.first, table_s.min_compaction_threshold(), table_s.schema()->max_compaction_threshold());
update_estimated_compaction_by_tasks(state, buckets, table_s.min_compaction_threshold(), table_s.schema()->max_compaction_threshold());
return newest_bucket(table_s, control, std::move(p.first), table_s.min_compaction_threshold(), table_s.schema()->max_compaction_threshold(),
return newest_bucket(table_s, control, std::move(buckets), table_s.min_compaction_threshold(), table_s.schema()->max_compaction_threshold(),
state.highest_window_seen);
}
@@ -463,7 +464,7 @@ struct fmt::formatter<std::map<sstables::timestamp_type, std::vector<sstables::s
constexpr auto parse(format_parse_context& ctx) { return ctx.begin(); }
auto format(const std::map<sstables::timestamp_type, std::vector<sstables::shared_sstable>>& buckets, fmt::format_context& ctx) const {
auto out = fmt::format_to(ctx.out(), " buckets = {{\n");
for (auto& [timestamp, sstables] : buckets | boost::adaptors::reversed) {
for (auto& [timestamp, sstables] : buckets | std::views::reverse) {
out = fmt::format_to(out, " key={}, size={}\n", timestamp, sstables.size());
}
return fmt::format_to(out, " }}\n");
@@ -478,10 +479,7 @@ time_window_compaction_strategy::newest_bucket(table_state& table_s, strategy_co
auto& state = get_state(table_s);
clogger.debug("time_window_compaction_strategy::newest_bucket:\n now {}\n{}", now, buckets);
for (auto&& key_bucket : buckets | boost::adaptors::reversed) {
auto key = key_bucket.first;
auto& bucket = key_bucket.second;
for (auto&& [key, bucket] : buckets | std::views::reverse) {
bool last_active_bucket = is_last_active_bucket(key, now);
if (last_active_bucket) {
state.recent_active_windows.insert(key);
@@ -536,10 +534,7 @@ void time_window_compaction_strategy::update_estimated_compaction_by_tasks(time_
int64_t n = 0;
timestamp_type now = state.highest_window_seen;
for (auto& task : tasks) {
const bucket_t& bucket = task.second;
timestamp_type bucket_key = task.first;
for (auto& [bucket_key, bucket] : tasks) {
switch (compaction_mode(state, bucket, bucket_key, now, min_threshold)) {
case bucket_compaction_mode::size_tiered:
n += size_tiered_compaction_strategy::estimated_pending_compactions(bucket, min_threshold, max_threshold, _stcs_options);

View File

@@ -14,6 +14,7 @@
#include <span>
#include <boost/range/iterator_range.hpp>
#include <boost/range/adaptor/transformed.hpp>
#include "utils/assert.hh"
#include "utils/serialization.hh"
#include <seastar/util/backtrace.hh>
@@ -65,15 +66,15 @@ private:
for (auto&& val : values) {
using val_type = std::remove_cvref_t<decltype(val)>;
if constexpr (FragmentedView<val_type>) {
assert(val.size_bytes() <= std::numeric_limits<size_type>::max());
SCYLLA_ASSERT(val.size_bytes() <= std::numeric_limits<size_type>::max());
write<size_type>(out, size_type(val.size_bytes()));
write_fragmented(out, val);
} else if constexpr (std::same_as<val_type, managed_bytes>) {
assert(val.size() <= std::numeric_limits<size_type>::max());
SCYLLA_ASSERT(val.size() <= std::numeric_limits<size_type>::max());
write<size_type>(out, size_type(val.size()));
write_fragmented(out, managed_bytes_view(val));
} else {
assert(val.size() <= std::numeric_limits<size_type>::max());
SCYLLA_ASSERT(val.size() <= std::numeric_limits<size_type>::max());
write<size_type>(out, size_type(val.size()));
write_fragmented(out, single_fragmented_view(val));
}
@@ -135,7 +136,7 @@ public:
partial.reserve(values.size());
auto i = _types.begin();
for (auto&& component : values) {
assert(i != _types.end());
SCYLLA_ASSERT(i != _types.end());
partial.push_back((*i++)->decompose(component));
}
return serialize_value(partial);
@@ -256,7 +257,7 @@ public:
}
// Returns true iff given prefix has no missing components
bool is_full(managed_bytes_view v) const {
assert(AllowPrefixes == allow_prefixes::yes);
SCYLLA_ASSERT(AllowPrefixes == allow_prefixes::yes);
return std::distance(begin(v), end(v)) == (ssize_t)_types.size();
}
bool is_empty(managed_bytes_view v) const {

Some files were not shown because too many files have changed in this diff Show More