Compare commits

...

202 Commits

Author SHA1 Message Date
Nadav Har'El
bff9ddde12 Merge '[Backport 6.2] view_builder: write status to tables before starting to build' from null
When adding a new view for building, first write the status to the
system tables and then add the view building step that will start
building it.

Otherwise, if we start building it before the status is written to the
table, it may happen that we complete building the view, write the
SUCCESS status, and then overwrite it with the STARTED status. The
view_build_status table will remain in incorrect state indicating the
view building is not complete.

Fixes #20638

The PR contains few additional small fixes in separate commits related to the view build status table.

It addresses flakiness issues in tests that use the view build status table to determine when view building is complete. The table may be in incorrect state due to these issues, having a row with status STARTED when it actually finished building the view, which will cause us to wait in `wait_for_view` until it timeouts.

For testing I used a test similar to `test_view_build_status_with_replace_node`, but it only creates the views and calls `wait_for_view`. Without these commits it failed in 4/1024 runs, and with the commits it passed 2048/2048.

backport to fix the bugs that affects previous versions and improve CI stability

- (cherry picked from commit b1be2d3c41)

- (cherry picked from commit 1104411f83)

- (cherry picked from commit 7a6aec1a6c)

Parent PR: #22307

Closes scylladb/scylladb#22356

* github.com:scylladb/scylladb:
  view_builder: hold semaphore during entire startup
  view_builder: pass view name by value to write_view_build_status
  view_builder: write status to tables before starting to build
2025-01-19 15:36:44 +02:00
Michael Litvak
e8fde3b0a3 view_builder: hold semaphore during entire startup
Guard the whole view builder startup routine by holding the semaphore
until it's done instead of releasing it early, so that it's not
intercepted by migration notifications.

(cherry picked from commit 7a6aec1a6c)
2025-01-19 11:03:53 +02:00
Michael Litvak
a7f8842776 view_builder: pass view name by value to write_view_build_status
The function write_view_build_status takes two lambda functions and
chooses which of them to run depending on the upgrade state. It might
run both of them.

The parameters ks_name and view_name should be passed by value instead
of by reference because they are moved inside each lambda function.
Otherwise, if both lambdas are run, the second call operates on invalid
values that were moved.

(cherry picked from commit 1104411f83)
2025-01-19 11:03:53 +02:00
Piotr Dulikowski
b88e0f8f74 Merge '[Backport 6.2] main, view: Pair view builder drain with its start' from null
In this PR, we pair draining the view builder with its start.
To better understand what was done and why, let's first look at the
situation before this commit and the context of it:

(a) The following things happened in order:

    1. The view builder would be constructed.
    2. Right after that, a deferred lambda would be created to stop the
       view builder during shutdown.
    3. group0_service would be started.
    4. A deferred lambda stopping group0_service would be created right
       after that.
    5. The view builder would be started.

(b) Because the view builder depends on group0_client, it couldn't be
    started before starting group0_service. On the other hand, other
    services depend on the view builder, e.g. the stream manager. That
    makes changing the order of initialization a difficult problem,
    so we want to avoid doing that unless we're sure it's the right
    choice.

(c) Since the view builder uses group0_client, there was a possibility
    of running into a segmentation fault issue in the following
    scenario:

    1. A call to `view_builder::mark_view_build_success()` is issued.
    2. We stop group0_service.
    3. `view_builder::mark_view_build_success()` calls
       `announce_with_raft()`, which leads to a use-after-free because
       group0_service has already been destroyed.

      This very scenario took place in scylladb/scylladb#20772.

Initially, we decided to solve the issue by initializing
group0_service a bit earlier (scylladb/scylladb@7bad8378c7).
Unfortunately, it led to other issues described in scylladb/scylladb#21534,
so we revert that patch. These changes are the second attempt
to the problem where we want to solve it in a safer manner.

The solution we came up with is to pair the start of the view builder
with a deferred lambda that deinitializes it by calling
`view_builder::drain()`. No other component of the system should be
able to use the view builder anymore, so it's safe to do that.
Furthermore, that pairing makes the analysis of
initialization/deinitialization order much easier. We also solve the
aformentioned use-after-free issue because the view builder itself
will no longer attempt to use group0_client.

Note that we still pair a deferred lambda calling `view_builder::stop()`
with the construction of the view builder; that function will also call
`view_builder::drain()`. Another notable thing is `view_builder::drain()`
may be called earlier by `storage_service::do_drain()`. In other words,
these changes cover the situation when Scylla runs into a problem when
starting up.

Backport: The patch I'm reverting made it to 6.2, so we want to backport this one there too.

Fixes scylladb/scylladb#20772
Fixes scylladb/scylladb#21534

- (cherry picked from commit a5715086a4)

- (cherry picked from commit 06ce976370)

- (cherry picked from commit d1f960eee2)

Parent PR: #21909

Closes scylladb/scylladb#22331

* github.com:scylladb/scylladb:
  test/topology_custom: Add test for Scylla with disabled view building
  main, view: Pair view builder drain with its start
  Revert "main,cql_test_env: start group0_service before view_builder"
2025-01-17 09:57:36 +01:00
Michael Litvak
c5bdc9e58f view_builder: write status to tables before starting to build
When adding a new view for building, first write the status to the
system tables and then add the view building step that will start
building it.

Otherwise, if we start building it before the status is written to the
table, it may happen that we complete building the view, write the
SUCCESS status, and then overwrite it with the STARTED status. The
view_build_status table will remain in incorrect state indicating the
view building is not complete.

Fixes scylladb/scylladb#20638

(cherry picked from commit b1be2d3c41)
2025-01-16 20:08:36 +00:00
Kamil Braun
cbef20e977 Merge '[Backport 6.2] Fix possible data corruption due to token keys clashing in read repair.' from Sergey
This update addresses an issue in the mutation diff calculation algorithm used during read repair. Previously, the algorithm used `token` as the hashmap key. Since `token` is calculated basing on the Murmur3 hash function, it could generate duplicate values for different partition keys, causing corruption in the affected rows' values.

Fixes scylladb/scylladb#19101

Since the issue affects all the relevant scylla versions, backport to: 6.1, 6.2

- (cherry picked from commit e577f1d141)

- (cherry picked from commit 39785c6f4e)

- (cherry picked from commit 155480595f)

Parent PR: #21996

Closes scylladb/scylladb#22298

* github.com:scylladb/scylladb:
  storage_proxy/read_repair: Remove redundant 'schema' parameter from `data_read_resolver::resolve` function.
  storage_proxy/read_repair: Use `partition_key` instead of `token` key for mutation diff calculation hashmap.
  test: Add test case for checking read repair diff calculation when having conflicting keys.
2025-01-16 17:13:12 +01:00
Dawid Mędrek
833ea91940 test/topology_custom: Add test for Scylla with disabled view building
Before this commit, there doesn't seem to have been a test verifying that
starting and shutting down Scylla behave correctly when the configuration
option `view_building` is set to false. In these changes, we add one.

(cherry picked from commit d1f960eee2)
2025-01-16 14:12:31 +01:00
Dawid Mędrek
1200d3b735 main, view: Pair view builder drain with its start
In these changes, we pair draining the view builder with its start.
To better understand what was done and why, let's first look at the
situation before this commit and the context of it:

(a) The following things happened in order:

    1. The view builder would be constructed.
    2. Right after that, a deferred lambda would be created to stop the
       view builder during shutdown.
    3. group0_service would be started.
    4. A deferred lambda stopping group0_service would be created right
       after that.
    5. The view builder would be started.

(b) Because the view builder depends on group0_client, it couldn't be
    started before starting group0_service. On the other hand, other
    services depend on the view builder, e.g. the stream manager. That
    makes changing the order of initialization a difficult problem,
    so we want to avoid doing that unless we're sure it's the right
    choice.

(c) Since the view builder uses group0_client, there was a possibility
    of running into a segmentation fault issue in the following
    scenario:

    1. A call to `view_builder::mark_view_build_success()` is issued.
    2. We stop group0_service.
    3. `view_builder::mark_view_build_success()` calls
       `announce_with_raft()`, which leads to a use-after-free because
       group0_service has already been destroyed.

      This very scenario took place in scylladb/scylladb#20772.

Initially, we decided to solve the issue by initializing
group0_service a bit earlier (scylladb/scylladb@7bad8378c7).
Unfortunately, it led to other issues described in scylladb/scylladb#21534.
We reverted that change in the previous commit. These changes are the
second attempt to the problem where we want to solve it in a safer manner.

The solution we came up with is to pair the start of the view builder
with a deferred lambda that deinitializes it by calling
`view_builder::drain()`. No other component of the system should be
able to use the view builder anymore, so it's safe to do that.
Furthermore, that pairing makes the analysis of
initialization/deinitialization order much easier. We also solve the
aformentioned use-after-free issue because the view builder itself
will no longer attempt to use group0_client.

Note that we still pair a deferred lambda calling `view_builder::stop()`
with the construction of the view builder; that function will also call
`view_builder::drain()`. Another notable thing is `view_builder::drain()`
may be called earlier by `storage_service::do_drain()`. In other words,
these changes cover the situation when Scylla runs into a problem when
starting up.

Fixes scylladb/scylladb#20772

(cherry picked from commit 06ce976370)
2025-01-16 12:37:04 +01:00
Dawid Mędrek
84b774515b Revert "main,cql_test_env: start group0_service before view_builder"
The patch solved a problem related to an initialization order
(scylladb/scylladb#20772), but we ran into another one: scylladb/scylladb#21534.
After moving the initialization of group0_service, it ended up being destroyed
AFTER the CDC generation service would. Since CDC generations are accessed
in `storage_service::topology_state_load()`:

```
for (const auto& gen_id : _topology_state_machine._topology.committed_cdc_generations) {
    rtlogger.trace("topology_state_load: process committed cdc generation {}", gen_id);
    co_await _cdc_gens.local().handle_cdc_generation(gen_id);
```

we started getting the following failure:

```
Service &seastar::sharded<cdc::generation_service>::local() [Service = cdc::generation_service]: Assertion `local_is_initialized()' failed.
```

We're reverting the patch to go back to a more stable version of Scylla
and in the following commit, we'll solve the original issue in a more
systematic way.

This reverts commit 7bad8378c7.

(cherry picked from commit a5715086a4)
2025-01-16 12:36:41 +01:00
Sergey Zolotukhin
06a8956174 test: Include parent test name in ScyllaClusterManager log file names.
Add the test file name to `ScyllaClusterManager` log file names alongside the test function name.
This avoids race conditions when tests with the same function names are executed simultaneously.

Fixes scylladb/scylladb#21807

Backport: not needed since this is a fix in the testing scripts.

Closes scylladb/scylladb#22192

(cherry picked from commit 2f1731c551)

Closes scylladb/scylladb#22249
2025-01-14 16:33:27 +01:00
Sergey Zolotukhin
f0f833e8ab storage_proxy/read_repair: Remove redundant 'schema' parameter from data_read_resolver::resolve
function.

The `data_read_resolver` class inherits from `abstract_read_resolver`, which already includes the
`schema_ptr _schema` member. Therefore, using a separate function parameter in `data_read_resolver::resolve`
initialized with the same variable in `abstract_read_executor` is redundant.

(cherry picked from commit 155480595f)
2025-01-14 14:43:52 +01:00
Sergey Zolotukhin
b04b6aad9e storage_proxy/read_repair: Use partition_key instead of token key for mutation
diff calculation hashmap.

This update addresses an issue in the mutation diff calculation algorithm used during read repair.
Previously, the algorithm used `token` as the hashmap key. Since `token` is calculated basing on
the Murmur3 hash function, it could generate duplicate values for different partition keys, causing
corruption in the affected rows' values.

Fixes scylladb/scylladb#19101

(cherry picked from commit 39785c6f4e)
2025-01-14 14:37:47 +01:00
Sergey Zolotukhin
12ee41869a test: Add test case for checking read repair diff calculation when having
conflicting keys.

The test updates two rows with keys that result in a Murmur3 hash collision, which
is used to generate Scylla tokens. These tokens are involved in read repair diff
calculations. Due to the identical token values, a hash map key collision occurs.
Consequently, an incorrect value from the second row (with a different primary key)
is then sent for writing as 'repaired', causing data corruption.

(cherry picked from commit e577f1d141)
2025-01-13 22:05:32 +00:00
Kamil Braun
ef93c3a8d7 Merge '[Backport 6.2] cache_algorithm_test: fix flaky failures' from Michał Chojnowski
This series attempts to get read of flakiness in cache_algorithm_test by solving two problems.

Problem 1:

The test needs to create some arbitrary partition keys of a given size. It intends to create keys of the form:
0x0000000000000000000000000000000000000000...
0x0100000000000000000000000000000000000000...
0x0200000000000000000000000000000000000000...
But instead, unintentionally, it creates partially initialized keys of the form: 0x0000000000000000garbagegarbagegarbagegar...
0x0100000000000000garbagegarbagegarbagegar...
0x0200000000000000garbagegarbagegarbagegar...

Each of these keys is created several times and -- for the test to pass -- the result must be the same each time.
By coincidence, this is usually the case, since the same allocator slots are used. But if some background task happens to overwrite the allocator slot during a preemption, the keys used during "SELECT" will be different than the keys used during "INSERT", and the test will fail due to extra cache misses.

Problem 2:

Cache stats are global, so there's no good way to reliably
verify that e.g. a given read causes 0 cache misses,
because something done by Scylla in a background can trigger a cache miss.

This can cause the test to fail spuriously.

With how the test framework and the cache are designed, there's probably
no good way to test this properly. It would require ensuring that cache
stats are per-read, or at least per-table, and that Scylla's background
activity doesn't cause enough memory pressure to evict the tested rows.

This patch tries to deal with the flakiness without deleting the test
altogether by letting it retry after a failure if it notices that it
can be explained by a read which wasn't done by the test.
(Though, if the test can't be written well, maybe it just shouldn't be written...)

Fixes scylladb/scylladb#21536

(cherry picked from commit 1fffd976a4)
(cherry picked from commit 6caaead4ac)

Parent PR: scylladb/scylladb#21948

Closes scylladb/scylladb#22228

* github.com:scylladb/scylladb:
  cache_algorithm_test: harden against stats being confused by background activity
  cache_algorithm_test: fix a use of an uninitialized variable
2025-01-09 14:30:31 +01:00
Aleksandra Martyniuk
a59c4653fe repair: check tasks local to given shard
Currently task_manager_module::is_aborted checks the tasks local
to caller's shard on a given shard.

Fix the method to check the task map local to the given shard.

Fixes: #22156.

Closes scylladb/scylladb#22161

(cherry picked from commit a91e03710a)

Closes scylladb/scylladb#22197
2025-01-08 13:07:38 +02:00
Yaron Kaikov
2e87e317d9 .github/scripts/auto-backport.py: Add comment to PR when conflicts apply
When we open a PR with conflicts, the PR owner gets a notification about the assignment but has no idea if this PR is with conflicts or not (in Scylla it's important since CI will not start on draft PR)

Let's add a comment to notify the user we have conflicts

Closes scylladb/scylladb#21939

(cherry picked from commit 2e6755ecca)

Closes scylladb/scylladb#22190
2025-01-08 13:07:08 +02:00
Botond Dénes
f73f7c17ec Merge 'sstables_manager: do not reclaim unlinked sstables' from Lakshmi Narayanan Sreethar
When an sstable is unlinked, it remains in the _active list of the
sstable manager. Its memory might be reclaimed and later reloaded,
causing issues since the sstable is already unlinked. This patch updates
the on_unlink method to reclaim memory from the sstable upon unlinking,
remove it from memory tracking, and thereby prevent the issues described
above.

Added a testcase to verify the fix.

Fixes #21887

This is a bug fix in the bloom filter reload/reclaim mechanism and should be backported to older versions.

Closes scylladb/scylladb#21895

* github.com:scylladb/scylladb:
  sstables_manager: reclaim memory from sstables on unlink
  sstables_manager: introduce reclaim_memory_and_stop_tracking_sstable()
  sstables: introduce disable_component_memory_reload()
  sstables_manager: log sstable name when reclaiming components

(cherry picked from commit d4129ddaa6)

Closes scylladb/scylladb#21998
2025-01-08 13:06:24 +02:00
Michał Chojnowski
379f23d854 cache_algorithm_test: harden against stats being confused by background activity
Cache stats are global, so there's no good way to reliably
verify that e.g. a given read causes 0 cache misses,
because something done by Scylla in a background can trigger a cache miss.

This can cause the test to fail spuriously.

With how the test framework and the cache are designed, there's probably
no good way to test this properly. It would require ensuring that cache
stats are per-read, or at least per-table, and that Scylla's background
activity doesn't cause enough memory pressure to evict the tested rows.

This patch tries to deal with the flakiness without deleting the test
altogether by letting it retry after a failure if it notices that it
can be explained by a read which wasn't done by the test.
(Though, if the test can't be written well, maybe it just shouldn't be written...)

(cherry picked from commit 6caaead4ac)
2025-01-08 11:48:23 +01:00
Michał Chojnowski
10815d2599 cache_algorithm_test: fix a use of an uninitialized variable
The test needs to create some arbitrary partition keys of a given size.
It intends to create keys of the form:
0x0000000000000000000000000000000000000000...
0x0100000000000000000000000000000000000000...
0x0200000000000000000000000000000000000000...
But instead, unintentionally, it creates partially initialized keys of the form:
0x0000000000000000garbagegarbagegarbagegar...
0x0100000000000000garbagegarbagegarbagegar...
0x0200000000000000garbagegarbagegarbagegar...

Each of these keys is created several times and -- for the test to pass --
the result must be the same each time.
By coincidence, this is usually the case, since the same allocator slots are used.
But if some background task happens to overwrite the allocator slot during a
preemption, the keys used during "SELECT" will be different than the keys used
during "INSERT", and the test will fail due to extra cache misses.

(cherry picked from commit 1fffd976a4)
2025-01-08 11:48:17 +01:00
Patryk Jędrzejczak
a63a0eac1e [Backport 6.2] raft: improve logs for abort while waiting for apply
New logs allow us to easily distinguish two cases in which
waiting for apply times out:
- the node didn't receive the entry it was waiting for,
- the node received the entry but didn't apply it in time.

Distinguishing these cases simplifies reasoning about failures.
The first case indicates that something went wrong on the leader.
The second case indicates that something went wrong on the node
on which waiting for apply timed out.

As it turns out, many different bugs result in the `read_barrier`
(which calls `wait_for_apply`) timeout. This change should help
us in debugging bugs like these.

We want to backport this change to all supported branches so that
it helps us in all tests.

Fixes scylladb/scylladb#22160

Closes scylladb/scylladb#22159
2025-01-07 17:01:22 +01:00
Kamil Braun
afd588d4c7 Merge '[Backport 6.2] Do not reset quarantine list in non raft mode' from Gleb
The series contains small fixes to the gossiper one of which fixes #21930. Others I noticed while debugged the issue.

Fixes: #21930

(cherry picked from commit 91cddcc17f)

Parent PR: #21956

Closes scylladb/scylladb#21991

* github.com:scylladb/scylladb:
  gossiper: do not reset _just_removed_endpoints in non raft mode
  gossiper: do not call apply for the node's old state
2025-01-03 16:28:08 +01:00
Abhinav
f5bce45399 Fix gossiper orphan node floating problem by adding a remover fiber
In the current scenario, if during startup, a node crashes after initiating gossip and before joining group0,
then it keeps floating in the gossiper forever because the raft based gossiper purging logic is only effective
once node joins group0. This orphan node hinders the successor node from same ip to join cluster since it collides
with it during gossiper shadow round.

This commit intends to fix this issue by adding a background thread which periodically checks for such orphan entries in
gossiper and removes them.

A test is also added in to verify this logic. This test fails without this background thread enabled, hence
verifying the behavior.

Fixes: scylladb/scylladb#20082

Closes scylladb/scylladb#21600

(cherry picked from commit 6c90a25014)

Closes scylladb/scylladb#21822
2025-01-02 14:57:46 +01:00
Gleb Natapov
cda997fe59 gossiper: do not reset _just_removed_endpoints in non raft mode
By the time the function is called during start it may already be
populated.

Fixes: scylladb/scylladb#21930
(cherry picked from commit e318dfb83a)
2024-12-25 12:01:16 +02:00
Gleb Natapov
155a0462d5 gossiper: do not call apply for the node's old state
If a nodes changed its address an old state may be still in a gossiper,
so ignore it.

(cherry picked from commit e80355d3a1)
2024-12-23 11:47:12 +02:00
Piotr Dulikowski
76b1173546 Merge 'service/topology_coordinator: migrate view builder only if all nodes are up' from Michał Jadwiszczak
The migration process is doing read with consistency level ALL,
requiring all nodes to be alive.

Fixes scylladb/scylladb#20754

The PR should be backported to 6.2, this version has view builder on group0.

Closes scylladb/scylladb#21708

* github.com:scylladb/scylladb:
  test/topology_custom/test_view_build_status: add reproducer
  service/topology_coordinator: migrate view builder only if all nodes are up

(cherry picked from commit def51e252d)

Closes scylladb/scylladb#21850
2024-12-19 14:10:55 +01:00
Piotr Dulikowski
e5a37d63c0 Merge 'transport/server: revert using async function in for_each_gently()' from Michał Jadwiszczak
This patch reverts 324b3c43c0 and adds synchronous versions of `service_level_controller::find_effective_service_level()` and `client_state::maybe_update_per_service_level_params()`.

It isn't safe to do asynchronous calls in `for_each_gently`, as the
connection may be disconnected while a call in callback preempts.

Fixes scylladb/scylladb#21801

Closes scylladb/scylladb#21761

* github.com:scylladb/scylladb:
  Revert "generic_server: use async function in `for_each_gently()`"
  transport/server: use synchronous calls in `for_each_gently` callback
  service/client_state: add synchronous method to update service level params
  qos/service_level_controller: add `find_cached_effective_service_level`

(cherry picked from commit c601f7a359)

Closes scylladb/scylladb#21849
2024-12-19 14:10:31 +01:00
Tomasz Grabiec
0851e3fba7 Merge '[Backport 6.2] utils: cached_file: Mark permit as awaiting on page miss' from ScyllaDB
Otherwise, the read will be considered as on-cpu during promoted index
search, which will severely underutlize the disk because by default
on-cpu concurrency is 1.

I verified this patch on the worst case scenario, where the workload
reads missing rows from a large partition. So partition index is
cached (no IO) and there is no data file IO (relies on https://github.com/scylladb/scylladb/pull/20522).
But there is IO during promoted index search (via cached_file).

Before the patch this workload was doing 4k req/s, after the patch it does 30k req/s.

The problem is much less pronounced if there is data file or partition index IO involved
because that IO will signal read concurrency semaphore to invite more concurrency.

Fixes #21325

(cherry picked from commit 868f5b59c4)

(cherry picked from commit 0f2101b055)

Refs #21323

Closes scylladb/scylladb#21358

* github.com:scylladb/scylladb:
  utils: cached_file: Mark permit as awaiting on page miss
  utils: cached_file: Push resource_unit management down to cached_file
2024-12-16 19:55:00 +01:00
Michael Litvak
99f190f699 service/qos/service_level_controller: update cache on startup
Update the service level cache in the node startup sequence, after the
service level and auth service are initialized.

The cache update depends on the service level data accessor being set
and the auth service being initialized. Before the commit, it may happen that a
cache update is not triggered after the initialization. The commit adds
an explicit call to update the cache where it is guaranteed to be ready.

Fixes scylladb/scylladb#21763

Closes scylladb/scylladb#21773

(cherry picked from commit 373855b493)

Closes scylladb/scylladb#21893
2024-12-16 14:19:06 +01:00
Michael Litvak
04e8506cbb service/qos: increase timeout of internal get_service_levels queries
The function get_service_levels is used to retrieve all service levels
and it is called from multiple different contexts.
Importantly, it is called internally from the context of group0 state reload,
where it should be executed with a long timeout, similarly to other
internal queries, because a failure of this function affects the entire
group0 client, and a longer timeout can be tolerated.
The function is also called in the context of the user command LIST
SERVICE LEVELS, and perhaps other contexts, where a shorter timeout is
preferred.

The commit introduces a function parameter to indicate whether the
context is internal or not. For internal context, a long timeout is
chosen for the query. Otherwise, the timeout is shorter, the same as
before. When the distinction is not important, a default value is
chosen which maintains the same behavior.

The main purpose is to fix the case where the timeout is too short and causes
a failure that propagates and fails the group0 client.

Fixes scylladb/scylladb#20483

Closes scylladb/scylladb#21748

(cherry picked from commit 53224d90be)

Closes scylladb/scylladb#21890
2024-12-16 14:15:26 +01:00
Yaron Kaikov
8e606a239f github: check if PR is closed instead of merge
In Scylla, we can have either `closed` or `merged` PRs. Based on that we decide when to start the backport process when the label was added after the PR is closed (or merged),

In https://github.com/scylladb/scylladb/pull/21876 even when adding the proper backport label didn't trigger the backport automation. Https://github.com/scylladb/scylladb/pull/21809/ caused this, we should have left the `state=closed` (this includes both closed and merged PR)

Fixing it

Closes scylladb/scylladb#21906

(cherry picked from commit b4b7617554)

Closes scylladb/scylladb#21922
2024-12-16 14:07:32 +02:00
Jenkins Promoter
8cdff8f52f Update ScyllaDB version to: 6.2.3 2024-12-15 15:55:40 +02:00
Kamil Braun
3ec741dbac Merge 'topology_coordinator: introduce reload_count in topology state and use it to prevent race' from Gleb Natapov
Topology request table may change between the code reading it and
calling to cv::when() since reading is a preemption point. In this
case cv:signal can be missed. Detect that there was no signal in between
reading and waiting by introducing reload_count which is increased each
time the state is reloaded and signaled. If the counter is different
before and after reading the state may have change so re-check it again
instead of sleeping.

Closes scylladb/scylladb#21713

* github.com:scylladb/scylladb:
  topology_coordinator: introduce reload_count in topology state and use it to prevent race
  storage_service: use conditional_variable::when in co-routines consistently

(cherry picked from commit 8f858325b6)

Closes scylladb/scylladb#21803
2024-12-12 15:45:31 +01:00
Anna Stuchlik
e3e7ac16e9 doc: remove wrong image upgrade info (5.2-to-2023.1)
This commit removes the information about the recommended way of upgrading
ScyllaDB images - by updating ScyllaDB and OS packages in one step. This upgrade
procedure is not supported (it was implemented, but then reverted).

Refs https://github.com/scylladb/scylladb/issues/15733

Closes scylladb/scylladb#21876
Fixes https://github.com/scylladb/scylla-enterprise/issues/5041
Fixes https://github.com/scylladb/scylladb/issues/21898

(cherry picked from commit 98860905d8)
2024-12-12 15:22:26 +02:00
Tomasz Grabiec
81d6d88016 utils: cached_file: Mark permit as awaiting on page miss
Otherwise, the read will be considered as on-cpu during promoted index
search, which will severely underutlize the disk because by default
on-cpu concurrency is 1.

I verified this patch on the worst case scenario, where the workload
reads missing rows from a large partition. So partition index is
cached (no IO) and there is no data file IO. But there is IO during
promoted index search (via cached_file). Before the patch this
workload was doing 4k req/s, after the patch it does 30k req/s.

The problem is much less pronounced if there is data file or index
file IO involved because that IO will signal read concurrency
semaphore to invite more concurrency.

(cherry picked from commit 0f2101b055)
2024-12-09 23:18:00 +01:00
Tomasz Grabiec
56f93dd434 utils: cached_file: Push resource_unit management down to cached_file
It saves us permit operations on the hot path when we hit in cache.

Also, it will lay the ground for marking the permit as awaiting later.

(cherry picked from commit 868f5b59c4)
2024-12-09 23:17:56 +01:00
Kefu Chai
28a32f9c50 github: do not nest ${{}} inside condition
In commit 2596d157, we added a condition to run auto-backport.py only
when the GitHub Action is triggered by a push to the default branch.
However, this introduced an unexpected error due to incorrect condition
handling.

Problem:
- `github.event.before` evaluates to an empty string
- GitHub Actions' single-pass expression evaluation system causes
  the step to always execute, regardless of `github.event_name`

Despite GitHub's documentation suggesting that ${{ }} can be omitted,
it recommends using explicit ${{}} expressions for compound conditions.

Changes:
- Use explicit ${{}} expression for compound conditions
- Avoid string interpolation in conditional statements

Root Cause:
The previous implementation failed because of how GitHub Actions
evaluates conditional expressions, leading to an unintended script
execution and a 404 error when attempting to compare commits.

Example Error:

```
  python .github/scripts/auto-backport.py --repo scylladb/scylladb --base-branch refs/heads/master --commits ..2b07d93beac7bc83d955dadc20ccc307f13f20b6
  shell: /usr/bin/bash -e {0}
  env:
    DEFAULT_BRANCH: master
    GITHUB_TOKEN: ***
Traceback (most recent call last):
  File "/home/runner/work/scylladb/scylladb/.github/scripts/auto-backport.py", line 201, in <module>
    main()
  File "/home/runner/work/scylladb/scylladb/.github/scripts/auto-backport.py", line 162, in main
    commits = repo.compare(start_commit, end_commit).commits
  File "/usr/lib/python3/dist-packages/github/Repository.py", line 888, in compare
    headers, data = self._requester.requestJsonAndCheck(
  File "/usr/lib/python3/dist-packages/github/Requester.py", line 353, in requestJsonAndCheck
    return self.__check(
  File "/usr/lib/python3/dist-packages/github/Requester.py", line 378, in __check
    raise self.__createException(status, responseHeaders, output)
github.GithubException.UnknownObjectException: 404 {"message": "Not Found", "documentation_url": "https://docs.github.com/rest/commits/commits#compare-two-commits", "status": "404"}
```

Fixes scylladb/scylladb#21808
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21809

(cherry picked from commit e04aca7efe)

Closes scylladb/scylladb#21820
2024-12-06 16:34:59 +02:00
Avi Kivity
20b2d5b7c9 Merge 'compaction: update maintenance sstable set on scrub compaction completion' from Lakshmi Narayanan Sreethar
Scrub compaction can pick up input sstables from maintenance sstable set
but on compaction completion, it doesn't update the maintenance set
leaving the original sstable in set after it has been scrubbed. To fix
this, on compaction completion has to update the maintenance sstable if
the input originated from there. This PR solves the issue by updating the
correct sstable_sets on compaction completion.

Fixes #20030

This issue has existed since the introduction of main and maintenance sstable sets into scrub compaction. It would be good to have the fix backported to versions 6.1 and 6.2.

Closes scylladb/scylladb#21582

* github.com:scylladb/scylladb:
  compaction: remove unused `update_sstable_lists_on_off_strategy_completion`
  compaction_group: replace `update_sstable_lists_on_off_strategy_completion`
  compaction_group: rename `update_main_sstable_list_on_compaction_completion`
  compaction_group: update maintenance sstable set on scrub compaction completion
  compaction_group: store table::sstable_list_builder::result in replacement_desc
  table::sstable_list_builder: remove old sstables only from current list
  table::sstable_list_builder: return removed sstables from build_new_list

(cherry picked from commit 58baeac0ad)

Closes scylladb/scylladb#21790
2024-12-06 10:36:46 +02:00
Michael Pedersen
f37deb7e98 docs: correct the storage size for n2-highmem-32 to 9000GB
updated storage size for n2-highmem-32 to 9000GB as this is default in SC

Fixes scylladb/scylladb#21785
Closes scylladb/scylladb#21537

(cherry picked from commit 309f1606ae)

Closes scylladb/scylladb#21595
2024-12-05 09:51:11 +02:00
Tomasz Grabiec
933ec7c6ab utils: UUID: Make get_time_UUID() respect the clock offset
schema_change_test currently fails due to failure to start a cql test
env in unit tests after the point where this is called (in one of the
test cases):

   forward_jump_clocks(std::chrono::seconds(60*60*24*31));

The problem manifests with a failure to join the cluster due to
missing_column exception ("missing_column: done") being thrown from
system_keyspace::get_topology_request_state(). It's a symptom of
join request being missing in system.topology_requests. It's missing
because the row is expired.

When request is created, we insert the
mutations with intended TTL of 1 month. The actual TTL value is
computed like this:

  ttl_opt topology_request_tracking_mutation_builder::ttl() const {
      return std::chrono::duration_cast<std::chrono::seconds>(std::chrono::microseconds(_ts)) + std::chrono::months(1)
          - std::chrono::duration_cast<std::chrono::seconds>(gc_clock::now().time_since_epoch());
  }

_ts comes from the request_id, which is supposed to be a timeuuid set
from current time when request starts. It's set using
utils::UUID_gen::get_time_UUID(). It reads the system clock without
adding the clock offset, so after forward_jump_clocks(), _ts and
gc_clock::now() may be far off. In some cases the accumulated offset
is larger than 1month and the ttl becomes negative, causing the
request row to expire immediately and failing the boot sequence.

The fix is to use db_clock, which respects offsets and is consistent
with gc_clock.

The test doesn't fail in CI becuase there each test case runs in a
separate process, so there is no bootstrap attempt (by new cql test
env) after forward_jump_clocks().

Closes scylladb/scylladb#21558

(cherry picked from commit 1d0c6aa26f)

Closes scylladb/scylladb#21584

Fixes #21581
2024-12-04 14:18:16 +01:00
Kefu Chai
2b5cd10b66 docs: explain task status retention and one-time query behavior
Task status information from nodetool commands is not retained permanently:

- Status of completed tasks is only kept for `task_ttl_in_seconds`
- Status is removed after being queried, making it a one-time operation

This behavior is important for users to understand since subsequent
queries for the same completed task will not return any information.
Add documentation to make this clear to users.

Fixes scylladb/scylladb#21757
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21386

(cherry picked from commit afeff0a792)

Closes scylladb/scylladb#21759
2024-12-04 13:49:24 +02:00
Kefu Chai
bf47de9f7f test: topology_custom: ensure node visibility before keyspace creation
Building upon commit 69b47694, this change addresses a subtle synchronization
weakness in node visibility checks during recovery mode testing.

Previous Approach:
- Waited only for the first node to see its peers
- Insufficient to guarantee full cluster consistency

Current Solution:
1. Implement comprehensive node visibility verification
2. Ensure all nodes mutually recognize each other
3. Prevent potential schema propagation race conditions

Key Improvements:
- Robust cluster state validation before keyspace creation
- Eliminate partial visibility scenarios

Fixes scylladb/scylladb#21724

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21726

(cherry picked from commit 65949ce607)

Closes scylladb/scylladb#21734
2024-12-04 13:46:26 +02:00
Nadav Har'El
dc71a6a75e Merge 'test/boost/view_schema_test.cc: Wait for views to build in test_view_update_generating_writetime' from Dawid Mędrek
Before these changes, we didn't wait for the materialized views to
finish building before writing to the base table. That led to generating
an additional view update, which, in turn, led to test failures.

The scenario corresponding to the summary above looked like this:

1. The test creates an empty table and MVs on it.
2. The view builder starts, but it doesn't finish immediately.
3. The test performs mutations to the base table. Since the views
   already exist, view updates are generated.
4. Finally, the view builder finishes. It notices that the base
   table has a row, so it generates a view update for it because
   it doesn't notice that we already have data in the view.

We solve it by explicitly waiting for both views to finish building
and only then start writing to the base table.

Additionally, we also fix a lifetime issue of the row the test revolves
around, further stabilizing CI.

Fixes https://github.com/scylladb/scylladb/issues/20889

Backport: These changes have no semantic effect on the codebase,
but they stabilize CI, so we want to backport them to the maintained
versions of Scylla.

Closes scylladb/scylladb#21632

* github.com:scylladb/scylladb:
  test/boost/view_schema_test.cc: Increase TTL in test_view_update_generating_writetime
  test/boost/view_schema_test.cc: Wait for views to build in test_view_update_generating_writetime

(cherry picked from commit 733a4f94c7)

Closes scylladb/scylladb#21640
2024-12-04 13:44:33 +02:00
Aleksandra Martyniuk
f13f821b31 repair: implement tablet_repair_task_impl::release_resources
tablet_repair_task_impl keeps a vector of tablet_repair_task_meta,
each of which keeps an effective_replication_map_ptr. So, after
the task completes, the token metadata version will not change for
task_ttl seconds.

Implement tablet_repair_task_impl::release_resources method that clears
tablet_repair_task_meta vector when the task finishes.

Set task_ttl to 1h in test_tablet_repair to check whether the test
won't time out.

Fixes: #21503.

Closes scylladb/scylladb#21504

(cherry picked from commit 572b005774)

Closes scylladb/scylladb#21622
2024-12-04 13:43:40 +02:00
André LFA
74ad6f2fa3 Update report-scylla-problem.rst removing references to old Health Check Report
Closes scylladb/scylladb#21467

(cherry picked from commit 703e6f3b1f)

Closes scylladb/scylladb#21591
2024-12-04 13:41:00 +02:00
Abhinav
fc42571591 test: Parametrize 'replacement with inter-dc encryption' test to confirm behavior in zero token node cases.
In the current scenario, 'test_replace_with_encryption' only confirms the replacement with inter-dc encryption
for normal nodes. This commit increases the coverage of test by parametrizing the test to confirm behavior
for zero token node replacement as well. This test also implicitly provides
coverage for bootstrap with encryption of zero token nodes.

This PR increases coverage for existing code. Hence we need to backport it. Since only 6.2 version has zero
token node support, hence we only backport it to 6.2

Fixes: scylladb/scylladb#21096

Closes scylladb/scylladb#21609

(cherry picked from commit acd643bd75)

Closes scylladb/scylladb#21764
2024-12-04 11:22:39 +01:00
Botond Dénes
c6ef055e9c Merge 'repair: fix task_manager_module::abort_all_repairs' from Aleksandra Martyniuk
Currently, task_manager_module::abort_all_repairs marks top-level repairs as aborted (but does not abort them) and aborts all existing shard tasks.

A running repair checks whether its id isn't contained in _aborted_pending_repairs and then proceeds to create shard tasks. If abort_all_repairs is executed after _aborted_pending_repairs is checked but before shard tasks are created, then those new tasks won't be aborted. The issue is the most severe for tablet_repair_task_impl that checks the _aborted_pending_repairs content from different shards, that do not see the top-level task. Hence the repair isn't stopped but it creates shard repair tasks on all shards but the one that initialized repair.

Abort top-level tasks in abort_all_repairs. Fix the shard on which the task abort is checked.

Fixes: #21612.

Needs backport to 6.1 and 6.2 as they contain the bug.

Closes scylladb/scylladb#21616

* github.com:scylladb/scylladb:
  test: add test to check if repair is properly aborted
  repair: add shard param to task_manager_module::is_aborted
  repair: use task abort source to abort repair
  repair: drop _aborted_pending_repairs and utilize tasks abort mechanism
  repair: fix task_manager_module::abort_all_repairs

(cherry picked from commit 5ccbd500e0)

Closes scylladb/scylladb#21642
2024-11-21 06:33:31 +02:00
Nadav Har'El
6ba0253dd3 alternator: fix "/localnodes" to not return down nodes
Alternator's "/localnodes" HTTP requests is supposed to return the list
of nodes in the local DC to which the user can send requests.

Before commit bac7c33313 we used the
gossiper is_alive() method to determine if a node should be returned.
That commit changed the check to is_normal() - because a node can be
alive but in non-normal (e.g., joining) state and not ready for
requests.

However, it turns out that checking is_normal() is not enough, because
if node is stopped abruptly, other nodes will still consider it "normal",
but down (this is so-called "DN" state). So we need to check **both**
is_alive() and is_normal().

This patch also adds a test reproducing this case, where a node is
shut down abruptly. Before this patch, the test failed ("/localnodes"
continued to return the dead node), and after it it passes.

Fixes #21538

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#21540

(cherry picked from commit 7607f5e33e)

Closes scylladb/scylladb#21634
2024-11-20 09:22:59 +02:00
Anna Stuchlik
d9eb502841 doc: add the 6.0-to-2024.2 upgrade guide-from-6
This commit adds an upgrade guide from ScyllDB 6.0
to ScyllaDB Enterprise 2024.2.

Fixes https://github.com/scylladb/scylladb/issues/20063
Fixes https://github.com/scylladb/scylladb/issues/20062
Refs https://github.com/scylladb/scylla-enterprise/issues/4544

(cherry picked from commit 3d4b7e41ef)

Closes scylladb/scylladb#21620
2024-11-18 17:28:44 +02:00
Emil Maskovsky
0c7c6f85e0 test/topology_custom: fix the flaky test_raft_recovery_stuck
The test is only sending a subset of the running servers for the rolling
restart. The rolling restart is checking the visibility of the restarted
node agains the other nodes, but if that set is incomplete some of the
running servers might not have seen the restarted node yet.

Improved the manager client rolling restart method to consider all the
running nodes for checking the restarted node visibility.

Fixes: scylladb/scylladb#19959

Closes scylladb/scylladb#21477

(cherry picked from commit 92db2eca0b)

Closes scylladb/scylladb#21556
2024-11-15 10:37:20 +02:00
Kefu Chai
2480decbc7 doc: import the new pub keys used to sign the package
before this change, when user follows the instruction, they'd get

```console
$ sudo apt-get update
Hit:1 http://us-east-1.ec2.archive.ubuntu.com/ubuntu noble InRelease
Hit:2 http://us-east-1.ec2.archive.ubuntu.com/ubuntu noble-updates InRelease
Hit:3 http://us-east-1.ec2.archive.ubuntu.com/ubuntu noble-backports InRelease
Hit:4 http://security.ubuntu.com/ubuntu noble-security InRelease
Get:5 https://downloads.scylladb.com/downloads/scylla/deb/debian-ubuntu/scylladb-6.2 stable InRelease [7550 B]
Err:5 https://downloads.scylladb.com/downloads/scylla/deb/debian-ubuntu/scylladb-6.2 stable InRelease
 The following signatures couldn't be verified because the public key is not available: NO_PUBKEY A43E06657BAC99E3
Reading package lists... Done
W: GPG error: https://downloads.scylladb.com/downloads/scylla/deb/debian-ubuntu/scylladb-6.2 stable InRelease: The following signatures couldn't be verified because the public key is not av
ailable: NO_PUBKEY A43E06657BAC99E3
E: The repository 'https://downloads.scylladb.com/downloads/scylla/deb/debian-ubuntu/scylladb-6.2 stable InRelease' is not signed.
N: Updating from such a repository can't be done securely, and is therefore disabled by default.
N: See apt-secure(8) manpage for repository creation and user configuration details.
```

because the packages were signed with a different keyring.

in this change, we import the new pubkey, so that the pacakge manager
can
verify the new packages (2024.2+ and 6.2+) signed with the new key.

see also https://github.com/scylladb/scylla-ansible-roles/issues/399
and https://forum.scylladb.com/t/release-scylla-manager-3-3-1/2516
for the annonucement on using the new key.

Fixes scylladb/scylladb#21557
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21524

(cherry picked from commit 1cedc45c35)

Closes scylladb/scylladb#21588
2024-11-15 10:36:44 +02:00
Botond Dénes
687a18db38 Merge 'scylla_raid_setup: fix failure on SELinux package installation' from Takuya ASADA
After merged 5a470b2bfb, we found that scylla_raid_setup fails on offline mode
installation.
This is because pkg_install() just print error and exit script on offline mode, instead of installing packages since offline mode not supposed able to connect
internet.
Seems like it occur because of missing "policycoreutils-python-utils"
package, which is the package for "semange" command.
So we need to implement the relabeling patch without using the command.

Fixes https://github.com/scylladb/scylladb/issues/21441

Also, since Amazon Linux 2 has different package name for semange, we need to
adjust package name.

Fixes https://github.com/scylladb/scylladb/issues/21351

Closes scylladb/scylladb#21474

* github.com:scylladb/scylladb:
  scylla_raid_setup: support installing semanage on Amazon Linux 2
  scylla_raid_setup: fix failure on SELinux package installation

(cherry picked from commit 1c212df62d)

Closes scylladb/scylladb#21547
2024-11-14 15:51:06 +02:00
Botond Dénes
548170fb68 Merge '[Backport 6.2] compaction_manager: stop_tasks, stop_ongoing_compactions: ignore errors' from ScyllaDB
stop() methods, like destructors must always succeed,
and returning errors from them is futile as there is
nothing else we can do with them by continue with shutdown.

stop_ongoing_compactions, in particular, currently returns the status
of stopped compaction tasks from `stop_tasks`, but still all tasks
must be stopped after it, even if they failed, so assert that
and ignore the errors.

Fixes scylladb/scylladb#21159

* Needs backport to 6.2 and 6.1, as commit 8cc99973eb causes handles storage that might cause compaction tasks to fail and eventually terminate on shudown when the exceptions are thrown in noexcept context in the deferred stop destructor body

(cherry picked from commit e942c074f2)

(cherry picked from commit d8500472b3)

(cherry picked from commit c08ba8af68)

(cherry picked from commit a7a55298ea)

(cherry picked from commit 6cce67bec8)

Refs #21299

Closes scylladb/scylladb#21434

* github.com:scylladb/scylladb:
  compaction_manager: stop: await _stop_future if engaged
  compaction_manager: really_do_stop:  assert that no tasks are left behind
  compaction_manager: stop_tasks, stop_ongoing_compactions: ignore errors
  compaction/compaction_manager: stop_tasks(): unlink stopped tasks
  compaction/compaction_manager: make _tasks an intrusive list
2024-11-14 06:59:52 +02:00
Jenkins Promoter
75b79a30da Update ScyllaDB version to: 6.2.2 2024-11-13 23:22:52 +02:00
Benny Halevy
bdf31d7f54 compaction_manager: stop: await _stop_future if engaged
The current condition that consults the compaction manager
state for awaiting `_stop_future` works since _stop_future
is assigned after the state is set to `stopped`, but it is
incidental.  What matters is that `_stop_future` is engaged.

While at it, exchange _stop_future with a ready future
so that stop() can be safely called multiple times.
And dropped the superfluous co_return.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 6cce67bec8)
2024-11-13 10:00:47 +02:00
Benny Halevy
3d915cd091 compaction_manager: really_do_stop: assert that no tasks are left behind
stop_ongoing_compactions now ignores any errors returned
by tasks, and it should leave no task left behind.
Assert that here, before the compaction_manager is destroyed.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit a7a55298ea)
2024-11-13 09:59:57 +02:00
Benny Halevy
abb26ff913 compaction_manager: stop_tasks, stop_ongoing_compactions: ignore errors
stop() methods, like destructors must always succeed,
and returning errors from them is futile as there is
nothing else we can do with them but continue with shutdown.

Leaked errors on the stop path may cause termination
on shutdown, when called in a deferred action destructor.

Fixes scylladb/scylladb#21298

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit c08ba8af68)
2024-11-13 09:56:42 +02:00
Botond Dénes
3f821b7f4f compaction/compaction_manager: stop_tasks(): unlink stopped tasks
Stopped tasks currently linger in _tasks until the fiber that created
the task is scheduled again and unlinks the task. This window between
stop and remove prevents reliable checks for empty _tasks list after all
tasks are stopped.
Unlink the task early so really_do_stop() can safely check for an empty
_tasks list (next patch).

(cherry picked from commit d8500472b3)
2024-11-13 09:56:21 +02:00
Botond Dénes
cab3b86240 compaction/compaction_manager: make _tasks an intrusive list
_tasks is currently std::list<shared_ptr<compaction_task_executor>>, but
it has no role in keeping the instances alive, this is done by the
fibers which create the task (and pin a shared ptr instance).
This lends itself to an intrusive list, avoiding that extra
allocation upon push_back().
Using an intrusive list also makes it simpler and much cheaper (O(1) vs.
O(N)) to remove tasks from the _tasks list. This will be made use of in
the next patch.

Code using _task has to be updated because the value_type changes from
shared_ptr<compaction_task_executor> to compaction_task_executor&.

(cherry picked from commit e942c074f2)
2024-11-13 09:48:00 +02:00
Piotr Dulikowski
2fa4f3a9fc Merge 'main,cql_test_env: start group0_service before view_builder' from Michał Jadwiszczak
In scylladb/scylladb#19745, view_builder was migrated to group0 and since then it is dependant on group0_service.
Because of this, group0_service should be initialized/destroyed before/after view_builder.

This patch also adds error injection to `raft_server_with_timeouts::read_barrier`, which does 1s sleep before doing the read barrier. There is a new test which reproduces the use after free bug using the error injection.

Fixes scylladb/scylladb#20772

scylladb/scylladb#19745 is present in 6.2, so this fix should be backported to it.

Closes scylladb/scylladb#21471

* github.com:scylladb/scylladb:
  test/boost/secondary_index_test: add test for use after free
  api/raft: use `get_server_with_timeouts().read_barrier()` in coroutines
  main,cql_test_env: start group0_service before view_builder

(cherry picked from commit 7021efd6b0)

Closes scylladb/scylladb#21506
2024-11-12 14:36:06 +01:00
Yaron Kaikov
a3e69cc8fb ./github/workflows/add-label-when-promoted.yaml: Run auto-backport only on default branch
In https://github.com/scylladb/scylladb/pull/21496#event-15221789614
```
scylladbbot force-pushed the backport/21459/to-6.1 branch from 414691c to 59a4ccd Compare 2 days ago
```

Backport automation triggered by `push` but also should either start from `master` branch (or `enterprise` branch from Enterprise), we need to verify it by checking also the default branch.

Fixes: https://github.com/scylladb/scylladb/issues/21514

Closes scylladb/scylladb#21515

(cherry picked from commit 2596d1577b)

Closes scylladb/scylladb#21531
2024-11-11 17:43:54 +02:00
Michał Chojnowski
876017efee mvcc_test: fix a benign failure of test_apply_to_incomplete_respects_continuity
For performance reasons, mutation_partition_v2::maybe_drop(), and by extension
also mutation_partition_v2::apply_monotonically(mutation_partition_v2&&)
can evict empty row entries, and hence change the continuity of the merged
entry.

For checking that apply_to_incomplete respects continuity,
test_apply_to_incomplete_respects_continuity obtains the continuity of
the partition entry before and after apply_to_incomplete by calling
e.squashed().get_continuity(). But squashed() uses apply_monotonically(),
so in some circumstances the result of squashed() can have smaller
continuity than the argument of squashed(), which messes with the thing
that the test is trying to check, and causes spurious failures.

This patch changes the method of calculating the continuity set,
so that it matches the entry exactly, fixing the test failures.

Fixes scylladb/scylladb#13757

Closes scylladb/scylladb#21459

(cherry picked from commit 35921eb67e)

Closes scylladb/scylladb#21497
2024-11-08 15:32:24 +01:00
Yaron Kaikov
9eed1d1cbd .github/scripts/auto-backport.py: update method to get closed prs
`commit.get_pulls()` in PyGithub returns pull requests that are directly associated with the given commit

Since in closed PR. the relevant commit is an event type, the backport
automation didn't get the PR info for backporting

Ref: https://github.com/scylladb/scylladb/issues/18973

Closes scylladb/scylladb#21468

(cherry picked from commit ef104b7b96)

Closes scylladb/scylladb#21483
2024-11-08 10:26:10 +02:00
Yaron Kaikov
d33538bdd4 .github/script/auto-backport.py: push backport PR to scylladbbot fork
Since Scylla is a public repo, when we create a fork, it doesn't fork the team and permissions (unlike private repos where it does).

When we have a backport PR with conflicts, the developers need to be able to update the branch to fix the conflicts. To do so, we modified the logic of the backport automation as follows:

- Every backport PR (with and without conflicts) will be open directly on the `scylladbbot` fork repo
- When there are conflicts, an email will be sent to the original PR author with an invitation to become a contributor in the `scylladbbot` fork with `push` permissions. This will happen only once if Auther is not a contributor.
- Together with sending the invite, all backport labels will be removed and a comment will be added to the original PR with instructions
- The PR author must add the backport labels after the invitation is accepted

Fixes: https://github.com/scylladb/scylladb/issues/18973

Closes scylladb/scylladb#21401

(cherry picked from commit 77604b4ac7)

Closes scylladb/scylladb#21466
2024-11-07 12:38:56 +02:00
Yaron Kaikov
073c9cbaa1 github: add script for backports automation instead of Mergify
Adding an auto-backport.py script to handle backport automation instead of Mergify.

The rules of backport are as follows:

* Merged or Closed PRs with any backport/x.y label (one or more) and promoted-to-master label
* Backport PR will be automatically assigned to the original PR author
* In case of conflicts the backport PR will be open in the original autoor fork in draft mode. This will give the PR owner the option to resolve conflicts and push those changes to the PR branch (Today in Scylla when we have conflicts, the developers are forced to open another PR and manually close the backport PR opened by Mergify)
* Fixing cherry-pick the wrong commit SHA. With the new script, we always take the SHA from the stable branch
* Support backport for enterprise releases (from Enterprise branch)

Fixes: https://github.com/scylladb/scylladb/issues/18973
(cherry picked from commit f9e171c7af)

Closes scylladb/scylladb#21469
2024-11-07 06:57:05 +02:00
Tomasz Grabiec
a3a0ffbcd0 Merge 'tablet: Fix single-sstable split when attaching new unsplit sstables' from Raphael "Raph" Carvalho
To fix a race between split and repair here c1de4859d8, a new sstable
  generated during streaming can be split before being attached to the sstable
  set. That's to prevent an unsplit sstable from reaching the set after the
  tablet map is resized.

  So we can think this split is an extension of the sstable writer. A failure
  during split means the new sstable won't be added. Also, the duration of split
  is also adding to the time erm is held. For example, repair writer will only
  release its erm once the split sstable is added into the set.

  This single-sstable split is going through run_custom_job(), which serializes
  with other maintenance tasks. That was a terrible decision, since the split may
  have to wait for ongoing maintenance task to finish, which means holding erm
  for longer. Additionally, if split monitor decides to run split on the entire
  compaction group, it can cause single-sstable split to be aborted since the
  former wants to select all sstables, propagating a failure to the streaming
  writer.
  That results in new sstable being leaked and may cause problems on restart,
  since the underlying tablet may have moved elsewhere or multiple splits may
  have happened. We have some fragility today in cleaning up leaked sstables on
  streaming failure, but this single-sstable split made it worse since the
  failure can happen during normal operation, when there's e.g. no I/O error.

  It makes sense to kill run_custom_job() usage, since the single-sstable split
  is offline and an extension of sstable writing, therefore it makes no sense to
  serialize with maintenance tasks. It must also inherit the sched group of the
  process writing the new sstable. The inheritance happens today, but is fragile.

  Fixes #20626.

Closes scylladb/scylladb#20737

* github.com:scylladb/scylladb:
  tablet: Fix single-sstable split when attaching new unsplit sstables
  replica: Fix tablet split execute after restart

(cherry picked from commit bca8258150)

Ref scylladb/scylladb#21415
2024-11-06 15:01:35 +02:00
Botond Dénes
8bf76d6be7 Merge '[Backport 6.2] replica: Fix tombstone GC during tablet split preparation' from Raphael Raph Carvalho
During split prepare phase, there will be more than 1 compaction group with
overlapping token range for a given replica.

Assume tablet 1 has sstable A containing deleted data, and sstable B containing
a tombstone that shadows data in A.

Then split starts:

sstable B is split first, and moved from main (unsplit) group to a
split-ready group
now compaction runs in split-ready group before sstable A is split
tombstone GC logic today only looks at underlying group, so compaction is step
2 will discard the deleted data in A, since it belongs to another group (the
unsplit one), and so the tombstone can be purged incorrectly.

To fix it, compaction will now work with all uncompacting sstables that belong
to the same replica, since tombstone GC requires all sstables that possibly
contain shadowed data to be available for correct decision to be made.

Fixes https://github.com/scylladb/scylladb/issues/20044.

Please replace this line with justification for the backport/* labels added to this PR
Branches 6.0, 6.1 and 6.2 are vulnerable, so backport is needed.

(cherry picked from commit bcd358595f)

(cherry picked from commit 93815e0649)

Refs https://github.com/scylladb/scylladb/pull/20939

Closes scylladb/scylladb#21206

* github.com:scylladb/scylladb:
  replica: Fix tombstone GC during tablet split preparation
  service: Improve error handling for split
2024-11-06 09:55:47 +02:00
Raphael S. Carvalho
1e51ed88c6 replica: Fix tombstone GC during tablet split preparation
During split prepare phase, there will be more than 1 compaction group with
overlapping token range for a given replica.

Assume tablet 1 has sstable A containing deleted data, and sstable B containing
a tombstone that shadows data in A.

Then split starts:
1) sstable B is split first, and moved from main (unsplit) group to a
split-ready group
2) now compaction runs in split-ready group before sstable A is split

tombstone GC logic today only looks at underlying group, so compaction is step
2 will discard the deleted data in A, since it belongs to another group (the
unsplit one), and so the tombstone can be purged incorrectly.

To fix it, compaction will now work with all uncompacting sstables that belong
to the same replica, since tombstone GC requires all sstables that possibly
contain shadowed data to be available for correct decision to be made.

Fixes #20044.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 93815e0649)
2024-11-04 14:24:18 -03:00
Raphael S. Carvalho
ca5f938ed4 service: Improve error handling for split
Retry wasn't really happening since the loop was broken and sleep
part was skipped on error. Also, we were treating abort of split
during shutdown as if it were an actual error and that confused
longevity tests that parse for logs with error level. The fix is
about demoting the level of logs when we know the exception comes
from shutdown.

Fixes #20890.

(cherry picked from commit bcd358595f)
2024-11-04 14:22:08 -03:00
Botond Dénes
fb20ea7de1 Merge '[Backport 6.2] tasks: fix virtual tasks children' from ScyllaDB
Fix how regular tasks that have a virtual parent are created
in task_manager::module::make_task: set sequence number
of a task and subscribe to module's abort source.

Fixes: #21278.

Needs backport to 6.2

(cherry picked from commit 1eb47b0bbf)

(cherry picked from commit 910a6fc032)

Refs #21280

Closes scylladb/scylladb#21332

* github.com:scylladb/scylladb:
  tasks: fix sequence number assignment
  tasks: fix abort source subscription of virtual task's child
2024-11-04 18:18:35 +02:00
Tzach Livyatan
d5eb12c25d Update os-support-info.rst - add CentOS
ScyllaDB support RHEL 9 and derivatives, including CentOS 9.

Fix https://github.com/scylladb/scylladb/issues/21309

(cherry picked from commit 1878af9399)

Closes scylladb/scylladb#21331
2024-11-04 18:17:46 +02:00
Aleksandra Martyniuk
291f568585 test: repair: drop log checks from test_repair_succeeds_with_unitialized_bm
Currently, test_repair_succeeds_with_unitialized_bm checks whether
repair finishes successfully and the error is properly handled
if batchlog_manager isn't initialized. Error handling depends on
logs, making the test fragile to external conditions and flaky.

Drop the error handling check, successful repair is a sufficient
passing condition.

Fixes: #21167.
(cherry picked from commit 85d9565158)

Closes scylladb/scylladb#21330
2024-11-04 18:16:55 +02:00
Botond Dénes
d5475fbc07 Merge '[Backport 6.2] repair: Fix finished ranges metrics for removenode' from ScyllaDB
The skipped ranges should be multiplied by the number of tables

Otherwise the finished ranges ratio will not reach 100%.

Fixes #21174

(cherry picked from commit cffe3dc49f)

(cherry picked from commit 1392a6068d)

(cherry picked from commit 9868ccbac0)

Refs #21252

Closes scylladb/scylladb#21313

* github.com:scylladb/scylladb:
  test: Add test_node_ops_metrics.py
  repair: Make the ranges more consistent in the log
  repair: Fix finished ranges metrics for removenode
2024-11-04 18:16:21 +02:00
Anna Stuchlik
6916dbe822 doc: remove the Cassandra references from notedool
This PR removes the reference to Cassandra from the nodetool index,
as the native nodetool is no longer a fork.

In addition, it removes the Apache copyright.

Fixes https://github.com/scylladb/scylladb/issues/21238

(cherry picked from commit ef4bcf8b3f)

Closes scylladb/scylladb#21307
2024-11-04 18:15:36 +02:00
Michał Jadwiszczak
f51a8ed541 test/auth_cluster/test_raft_service_levels: match enterprise SL limit
Despite OSS doesn't limit number of created service levels, match the
enterprise limit to decrease divergence in the test between OSS and
enterprise.

Fixes scylladb/scylladb#21044

(cherry picked from commit 846d94134f)

Closes scylladb/scylladb#21282
2024-11-04 18:14:38 +02:00
Calle Wilund
127606f788 cql_test_env/gossip: Prevent double shutdown call crash
Fixes #21159

When an exception is thrown in sstable write etc such that
storage_manager::isolate is initiated, we start a shutdown chain
for message service, gossip etc. These are synced (properly) in
storage_manager::stop, but if we somehow call gossiper::shutdown
outside the normal service::stop cycle, we can end up running the
method simultaneously, intertwined (missing the guard because of
the state change between check and set). We then end up co_awaiting
an invalid future (_failure_detector_loop_done) - a second wait.

Fixed by
a.) Remove superfluous gossiper::shutdown in cql_test_env. This was added
    in 20496ed, ages ago. However, it should not be needed nowadays.
b.) Ensure _failure_detector_loop_done is always waitable. Just to be sure.

(cherry picked from commit c28a5173d9)

Closes scylladb/scylladb#21393
2024-11-04 16:52:42 +01:00
Benny Halevy
56a0fa922d storage_service: on_change: update_peer_info only if peer info changed
Return an optional peer_info from get_peer_info_for_update
when the `app_state_map` arg does not change peer_info,
so that we can skip calling update_peer_info, if it didn't
change.

Fixes scylladb/scylladb#20991
Refs scylladb/scylladb#16376

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#21152

(cherry picked from commit 04d741bcbb)
2024-11-04 11:20:32 +02:00
Benny Halevy
c841a4a851 compaction_manager: compaction_disabled: return true if not in compaction_state
When a compaction_group is removed via `compaction_manager::remove`,
it is erase from `_compaction_state`, and therefore compaction
is definitely not enabled on it.

This triggers an internal error if tablets are cleaned up
during drop/truncate, which checks that compaction is disabled
in all compaction groups.

Note that the callers of `compaction_disabled` aren't really
interested in compaction being actively disabled on the
compaction_group, but rather if it's enabled or not.
A follow-up patch can be consider to reverse the logic
and expose `compaction_enabled` rather than `compaction_disabled`.

Fixes scylladb/scylladb#20060

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 1c55747637)

Closes scylladb/scylladb#21404
2024-11-03 16:05:05 +02:00
Gleb Natapov
1a9721e93e topology coordinator: take a copy of a replication state in raft_topology_cmd_handler
Current code takes a reference and holds it past preemption points. And
while the state itself is not suppose to change the reference may
become stale because the state is re-created on each raft topology
command.

Fix it by taking a copy instead. This is a slow path anyway.

Fixes: scylladb/scylladb#21220
(cherry picked from commit fb38bfa35d)

Closes scylladb/scylladb#21361
2024-10-30 14:11:17 +01:00
Kamil Braun
1dded7e52f Merge '[Backport 6.2] fix nodetool status to show zero-token nodes' from ScyllaDB
In the current scenario, the nodetool status doesn’t display information regarding zero token nodes. For example, if 5 nodes are spun by the administrator, out of which, 2 nodes are zero token nodes, then nodetool status only shows information regarding the 3 non-zero token nodes.

This commit intends to fix this issue by leveraging the “/storage_service/host_id ” API  and adding appropriate logic in scylla-nodetool.cc to support zero token nodes.

A test is also added in nodetool/test_status.py to verify this logic. This test fails without this commit’s zero token node support logic, hence verifying the behavior.

This PR fixes a bug. Hence we need to backport it. Backporting needs to be done only
to 6.2 version, since earlier versions don't support zero token nodes.

Fixes: scylladb/scylladb#19849
Fixes: scylladb/scylladb#17857

(cherry picked from commit 72f3c95a63)

(cherry picked from commit 39dfd2d7ac)

(cherry picked from commit c00d40b239)

Refs scylladb/scylladb#20909

Closes scylladb/scylladb#21334

* github.com:scylladb/scylladb:
  fix nodetool status to show zero-token nodes
  test: move `wait_for_first_completed` to pylib/util.py
  token_metadata: rename endpoint_to_host_id_map getter and add support for joining nodes
2024-10-29 10:50:35 +01:00
Abhinav
9082d66d8a fix nodetool status to show zero-token nodes
In the current scenario, the nodetool status doesn’t display information
regarding zero token nodes. For example, if 5 nodes are spun by the
administrator, out of which, 2 nodes are zero token nodes, then nodetool
status only shows information regarding the 3 non-zero token nodes.

This commit intends to fix this issue by leveraging the “/storage_service/host_id
” API  and adding appropriate logic in scylla-nodetool.cc to support zero token nodes.

Robust topology tests are added, which spins up scylla nodes and confirm nodetool
status output for various cases, providing good coverage.
A test is also added in nodetool/test_status.py to verify this logic. These tests fail
without this commit’s zero token node support logic, hence verifying the behavior.

The test `test_status_keyspace_joining_node` has been removed. This test is
based on case where host_id=None, which is impossible. Since we now use
host_id_map for node discovery in nodetool, the nodes with "host_id=None"
go undetected. Since this case is anyway impossible, we can get rid of this.

This PR fixes a bug. Hence we need to backport it. Backporting needs to be done only
to 6.2 version, since earlier versions dont support zero token nodes.

Fixes: scylladb/scylladb#19849
(cherry picked from commit c00d40b239)
2024-10-28 21:33:55 +00:00
Abhinav
c7a0876a73 test: move wait_for_first_completed to pylib/util.py
This function is needed in a new test added in the next commit and this
refactoring avoids code duplication.

(cherry picked from commit 39dfd2d7ac)
2024-10-28 21:33:55 +00:00
Abhinav
917d40e600 token_metadata: rename endpoint_to_host_id_map getter and add support for joining nodes
Rename host_id map getter, 'get_endpoint_to_host_id_map_for_reading' to 'get_endpoint_to_host_id_map_'
Also modify the getter to return information regarding joining nodes as well.

This getter will later be used for retrieving the nodes in nodetool status, hence it needs to show all nodes,
including joining ones.

The function name suffix `_for_reading` suggests that the function was used
in some other places in the past, and indeed if we need endpoints
"for reading" then we cannot show joining endpoints. But it was confirmed
that this function is currently only used by "/storage_service/host_id" endpoint,
hence it can be modified as required.

Fixes: scylladb/scylladb#17857
(cherry picked from commit 72f3c95a63)
2024-10-28 21:33:54 +00:00
Aleksandra Martyniuk
1fd60424d9 tasks: fix sequence number assignment
Currently, children of virtual tasks do not have sequence number
assigned. Fix it.

(cherry picked from commit 910a6fc032)
2024-10-28 21:32:49 +00:00
Aleksandra Martyniuk
af6ddebc7f tasks: fix abort source subscription of virtual task's child
Currently, if a regular task does not have a parent or its parent
is a virtual tasks then it subscribes to module's abort source
in task_manager::task::impl constructor. However, at this point
the kind of the task's parent isn't set. Due to that, children
of virtual tasks aren't aborted on shutdown.

Subscribe to module's abort source in task::impl::set_virtual_parent.

(cherry picked from commit 1eb47b0bbf)
2024-10-28 21:32:49 +00:00
Tomasz Grabiec
fa71b82da4 node-exporter: Disable hwmon collector
This collector reads nvme temperature sensor, which was observed to
cause bad performance on Azure cloud following the reading of the
sensor for ~6 seconds. During the event, we can see elevated system
time (up to 30%) and softirq time. CPU utilization is high, with
nvm_queue_rq taking several orders of magnitude more time than
normally. There are signs of contention, we can see
__pv_queued_spin_lock_slowpath in the perf profile, called. This
manifests as latency spikes and potentially also throughput drop due
to reduced CPU capacity.

By default, the monitoring stack queries it once every 60s.

(cherry picked from commit 93777fa907)

Closes scylladb/scylladb#21304
2024-10-28 15:05:06 +01:00
Asias He
1a5a6a0758 test: Add test_node_ops_metrics.py
It tests the node_ops_metrics_done metric reaches 100% when a node ops
is done.

Refs: #21174
(cherry picked from commit 9868ccbac0)
2024-10-28 09:54:30 +00:00
Asias He
6ae5481de4 repair: Make the ranges more consistent in the log
Consider the number of tables for the number of ranges logging. Make it
more consistent with the log when the ops starts.

(cherry picked from commit 1392a6068d)
2024-10-28 09:54:30 +00:00
Asias He
0bc22db3a9 repair: Fix finished ranges metrics for removenode
The skipped ranges should be multiplied by the number of tables.

Otherwise the finished ranges ratio will not reach 100%.

Fixes #21174

(cherry picked from commit cffe3dc49f)
2024-10-28 09:54:30 +00:00
Botond Dénes
b78675270e streaming: stream-session: switch to tracking permit
The stream-session is the receiving end of streaming, it reads the
mutation fragment stream from an RPC stream and writes it onto the disk.
As such, this part does no disk IO and therefore, using a permit with
count resources is superfluous. Furthermore, after
d98708013c, the count resources on this
permit can cause a deadlock on the receiver end, via the
`db::view::check_view_update_path()`, which wants to read the content of
a system table and therefore has to obtain a permit of its own.

Switch to a tracking-only permit, primarily to resolve the deadlock, but
also because admission is not necessary for a read which does no IO.

Refs: scylladb/scylladb#20885 (partial fix, solves only one of the deadlocks)
Fixes: scylladb/scylladb#21264
(cherry picked from commit dbb26da2aa)

Closes scylladb/scylladb#21303
2024-10-28 08:07:05 +02:00
Jenkins Promoter
ea6fe4bfa1 Update ScyllaDB version to: 6.2.1 2024-10-27 12:06:35 +02:00
Botond Dénes
30a2ed7488 Merge '[Backport 6.2] cql/tablets: fix retrying ALTER tablets KEYSPACE' from Marcin Maliszkiewicz
ALTER tablets-enabled KEYSPACES (KS) may fail due to
group0_concurrent_modification, in which case it's repeated by a for
loop surrounding the code. But because raft's add_entry consumes the
raft's guard (by std::move'ing the guard object), retries of ALTER KS
will use a moved-from guard object, which is UB, potentially a crash.
The fix is to remove the before mentioned for loop altogether and rethrow the exception, as the rf_change event
will be repeated by the topology state machine if it receives the
concurrent modification exception, because the event will remain present
in the global requests queue, hence it's going to be executed as the
very next event.
Note: refactor is implemented in the follow-up commit.

Fixes: https://github.com/scylladb/scylladb/issues/21102

Should be backported to every 6.x branch, as it may lead to a crash.

(cherry picked from commit de511f56ac)

(cherry picked from commit 3f4c8a30e3)

(cherry picked from commit 522bede8ec)

Refs https://github.com/scylladb/scylladb/pull/21121

Closes scylladb/scylladb#21256

* github.com:scylladb/scylladb:
  test: topology: add disable_schema_agreement_wait utility function
  test: add UT to test retrying ALTER tablets KEYSPACE
  cql/tablets: fix indentation in `rf_change` event handler
  cql/tablets: fix retrying ALTER tablets KEYSPACE
2024-10-25 10:57:36 +03:00
Botond Dénes
dcddb1ff4a Merge '[Backport 6.2] multishard reader: make it safe to create with admitted permits' from ScyllaDB
Passing an admitted permit -- i.e. one with count resources on it -- to the multishard reader, will possibly result in a deadlock, because the permit of the multishard reader is destroyed after the permits of its child readers. Therefore its semaphore resources won't be automatically released until children acquire their own resources. This creates a dependency (an edge in the "resource allocation graph"), where the semaphore used by the multishard reader depends on the semaphores used by children. When such dependencies create a cycle, and permits are acquired by different reads in just the right order, a deadlock will  happen.

Users of the multishard reader have to be aware of this gotcha -- and of course they aren't. This is small wonder, considering that not even the documentation on the multishard reader mentions this problem. To work around this, the user has to call `reader_permit::release_base_resources()` on the permit, before passing it to the multishard reader. On multiple occasions, developers (including the very author of the multishard reader), forgot or didn't know about this and this resulted in deadlocks down the line. This is a design-flaw of the multishard reader, which is addressed in this PR, after which, it is safe to pass admitted or not admitted permits to the multishard reader, it will handle the call to `release_base_resources()` if needed.

After fixing the problem in the multishard reader, the existing calls to `release_base_resources()` on permits passed to multishard readers are removed. A test is added which reproduces the problem and ensures we don't regress.

Refs: https://github.com/scylladb/scylladb/issues/20885 (partial fix, there is another deadlock in that issue, which this PR doesn't fix)
Fixes: https://github.com/scylladb/scylladb/issues/21263

This fixes (indirectly) a regression introduced by d98708013c so it has to be backported to 6.2

(cherry picked from commit e1d8cddd09)

Refs scylladb/scylladb#21058

Closes scylladb/scylladb#21178

* github.com:scylladb/scylladb:
  test/boost/mutation_test: add test for multishard permit safety
  test/lib/reader_lifecycle_policy: add semaphore factory to constructor
  test/lib/reader_lifecycle_policy: rename factory_function
  repair/row_level: drop now unneeded release_base_resource() calls
  readers/multishard: make multishard reader safe to create with admitted permits
2024-10-25 09:32:03 +03:00
Piotr Dulikowski
4ca0e31415 test/test_view_build_status: properly wait for v2 in migration test
The test_view_build_status_migration_to_v2 test case creates a new view
(vt2) after peforming the view_build_status -> view_build_status_v2
migration and waits until it is built by `wait_for_view_v2` function. It
works by waiting until a SELECT from view_build_status_v2 will return
the expected number of rows for a given view.

However, if the host parameter is unspecified, it will query only one
node on each attempt. Because `view_build_status_v2` is managed via
raft, queries always return data from the queried node only. It might
happen that `wait_for_view_v2` fetches expected results from one node
while a different node might be lagging behind the group0 coordinator
and might not have all data yet.

In case of test_view_build_status_migration_to_v2 this is a problem - it
first uses `wait_for_view_v2` to wait for view, later it queries
`view_build_status_v2` on a random node and asserts its state - and
might fail because that node didn't have the newest state yet.

Fix the issue by issuing `wait_for_view_v2` in parallel for all nodes in
the cluster and waiting until all nodes have the most recent state.

Fixes: scylladb/scylladb#21060
(cherry picked from commit a380a2efd9)

Closes scylladb/scylladb#21129
2024-10-24 16:42:53 +03:00
Raphael S. Carvalho
363bc7424e locator: Always preserve balancing_enabled in tablet_metadata::copy()
When there are zero tablets, tablet_metadata::_balancing_enabled
is ignored in the copy.

The property not being preserved can result in balancer not
respecting user's wish to disable balancing when a replica is
created later on.

Fixes #21175.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit dfc217f99a)

Closes scylladb/scylladb#21190
2024-10-24 16:37:41 +03:00
Botond Dénes
a5b11a3189 test/boost/mutation_test: add test for multishard permit safety
Add a test checking that the multishard reader will not deadlock, when
created with an admitted permit, on a semaphore with a single count
resource.

(cherry picked from commit e1d8cddd09)
2024-10-24 09:18:11 -04:00
Botond Dénes
c0eba659f6 test/lib/reader_lifecycle_policy: add semaphore factory to constructor
Allowing callers to specify how the semaphore is created and stopped,
instead of doing so via boolean flags like it is done currently. This
method doesn't scale, so use a factory instead.

(cherry picked from commit 5a3fd69374)
2024-10-24 09:18:11 -04:00
Botond Dénes
dbb1dc872d test/lib/reader_lifecycle_policy: rename factory_function
To reader_factor_function. We are about to add a new factory function
parameters, so the current factory_function has to be renamed to
something more specific.

(cherry picked from commit c8598e21e8)
2024-10-24 09:18:11 -04:00
Botond Dénes
07b288b7d7 repair/row_level: drop now unneeded release_base_resource() calls
The multishard reader now does this itself, no need to do it here.

(cherry picked from commit 76a5ba2342)
2024-10-24 09:18:11 -04:00
Botond Dénes
41a44ddc12 readers/multishard: make multishard reader safe to create with admitted permits
Passing an admitted permit -- i.e. one with count resources on it -- to
the multishard reader, will possibly result in a deadlock, because the
permit of the multishard reader is destroyed after the permits of its
child readers. Therefore its semaphore resources won't be automatically
released until children acquire their own resources.
This creates a dependency (an edge in the "resource allocation graph"),
where the semaphore used by the multishard reader depends on the
semaphores used by children. When such dependencies create a cycle, and
permits are acquired by different reads in just the right order, a
deadlock will happen.

Users of the multishard reader have to be aware of this gotcha -- and of
course they aren't. This is small wonder, considering that not even the
documentation on the multishard reader mentions this problem.
To work around this, the user has to call
`reader_permit::release_base_resources()` on the permit, before passing
it to the multishard reader.
On multiple occasions, developers (including the very author of the
multishard reader), forgot or didn't know about this and this resulted
in deadlocks down the line.
This is a design-flaw of the multishard reader, which is addressed in
this patch, after which, it is safe to pass admitted or not admitted
permits to the multishard reader, it will handle the call to
`release_base_resources()` if needed.

(cherry picked from commit 218ea449a5)
2024-10-24 09:18:11 -04:00
Lakshmi Narayanan Sreethar
3f04df55eb [Backport 6.2] replica/table: check memtable before discarding tombstone during read
On the read path, the compacting reader is applied only to the sstable
reader. This can cause an expired tombstone from an sstable to be purged
from the request before it has a chance to merge with deleted data in
the memtable leading to data resurrection.

Fix this by checking the memtables before deciding to purge tombstones
from the request on the read path. A tombstone will not be purged if a
key exists in any of the table's memtables with a minimum live timestamp
that is lower than the maximum purgeable timestamp.

Fixes #20916

`perf-simple-query` stats before and after this fix :

`build/Dev/scylla perf-simple-query --smp=1 --flush` :
```
// Before this Fix
// ---------------
94941.79 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   59393 insns/op,   24029 cycles/op,        0 errors)
97551.14 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   59376 insns/op,   23966 cycles/op,        0 errors)
96599.92 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   59367 insns/op,   23998 cycles/op,        0 errors)
97774.91 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   59370 insns/op,   23968 cycles/op,        0 errors)
97796.13 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   59368 insns/op,   23947 cycles/op,        0 errors)

         throughput: mean=96932.78 standard-deviation=1215.71 median=97551.14 median-absolute-deviation=842.13 maximum=97796.13 minimum=94941.79
instructions_per_op: mean=59374.78 standard-deviation=10.78 median=59369.59 median-absolute-deviation=6.36 maximum=59393.12 minimum=59367.02
  cpu_cycles_per_op: mean=23981.67 standard-deviation=32.29 median=23967.76 median-absolute-deviation=16.33 maximum=24029.38 minimum=23947.19

// After this Fix
// --------------
95313.53 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   59392 insns/op,   24058 cycles/op,        0 errors)
97311.48 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   59375 insns/op,   24005 cycles/op,        0 errors)
98043.10 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   59381 insns/op,   23941 cycles/op,        0 errors)
96750.31 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   59396 insns/op,   24025 cycles/op,        0 errors)
93381.21 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   59390 insns/op,   24097 cycles/op,        0 errors)

         throughput: mean=96159.93 standard-deviation=1847.88 median=96750.31 median-absolute-deviation=1151.55 maximum=98043.10 minimum=93381.21
instructions_per_op: mean=59386.60 standard-deviation=8.78 median=59389.55 median-absolute-deviation=6.02 maximum=59396.40 minimum=59374.73
  cpu_cycles_per_op: mean=24025.13 standard-deviation=58.39 median=24025.17 median-absolute-deviation=32.67 maximum=24096.66 minimum=23941.22
```

This PR fixes a regression introduced in ce96b472d3 and should be backported to older versions.

Closes scylladb/scylladb#20985

* github.com:scylladb/scylladb:
  topology-custom: add test to verify tombstone gc in read path
  replica/table: check memtable before discarding tombstone during read
  compaction_group: track maximum timestamp across all sstables

(cherry picked from commit 519e167611)

Backported from #20985 to 6.2.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes scylladb/scylladb#21251
2024-10-24 15:33:39 +03:00
Marcin Maliszkiewicz
8c8f97c280 test: topology: add disable_schema_agreement_wait utility function
Code extracted from fa45fdf5f7 as it's being used by
test_alter_tablets_keyspace_concurrent_modification and we're
backporting it.
2024-10-24 11:24:56 +02:00
Piotr Smaron
a61ab7d02e test: add UT to test retrying ALTER tablets KEYSPACE
The newly added testcase is based on the already existing
`test_alter_dropped_tablets_keyspace`.
A new error injection is created, which stops the ALTER execution just
before the changes are submitted to RAFT. In the meantime, a new schema
change is performed using the 2nd node in the cluster, thus causing the
1st node to retry the ALTER statement.

(cherry picked from commit 522bede8ec)
2024-10-23 13:35:26 +00:00
Piotr Smaron
775578af59 cql/tablets: fix indentation in rf_change event handler
Just moved the code that previously was under a `for` loop by 1 tab, i.e. 4 spaces, to the left.

(cherry picked from commit 3f4c8a30e3)
2024-10-23 13:35:26 +00:00
Piotr Smaron
97f22f426f cql/tablets: fix retrying ALTER tablets KEYSPACE
ALTER tablets-enabled KEYSPACES (KS) may fail due to
`group0_concurrent_modification`, in which case it's repeated by a `for`
loop surrounding the code. But because raft's `add_entry` consumes the
raft's guard (by `std::move`'ing the guard object), retries of ALTER KS
will use a moved-from guard object, which is UB, potentially a crash.
The fix is to remove the before mentioned `for` loop altogether and rethrow the exception, as the `rf_change` event
will be repeated by the topology state machine if it receives the
concurrent modification exception, because the event will remain present
in the global requests queue, hence it's going to be executed as the
very next event.
`topology_coordinator::handle_topology_coordinator_error` handling the
case of `group0_concurrent_modification` has been extended with logging
in order not to write catch-log-throw boilerplate.
Note: refactor is implemented in the follow-up commit.

Fixes: scylladb/scylladb#21102
(cherry picked from commit de511f56ac)
2024-10-23 13:35:26 +00:00
Botond Dénes
55a9605687 Merge '[Backport 6.2] Check system.tablets update before putting it into the table' from ScyllaDB
Having tablet metadata with more than 1 pending replica will prevent this metadata from being (re)loaded due to sanity check on load. This patch fails the operation which tries to save the wrong metadata with a similar sanity check. For that, changes submitted to raft are validated, and if it's topology_change that affects system.tablets, the new "replicas" and "new_replicas" values are checked similarly to how they will be on (re)load.

Fixes #20043

(cherry picked from commit f09fe4f351)

(cherry picked from commit e5bf376cbc)

(cherry picked from commit 1863ccd900)

Refs #21020

Closes scylladb/scylladb#21111

* github.com:scylladb/scylladb:
  tablets: Validate system.tablets update
  group0_client: Introduce change validation
  group0_client: Add shared_token_metadata dependency
2024-10-23 10:00:39 +03:00
Pavel Emelyanov
83cc3e4791 tablets: Validate system.tablets update
Implement change validation for raft topology_change command. For now
the only check is that the "pending replicas" contains at most one
entry. The check mirrors similar one in `process_one_row` function.

If not passed, this prevents system.tablets from being updated with the
mutation(s) that will not be loaded later.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-22 14:45:51 +03:00
Pavel Emelyanov
aef7e7db0b group0_client: Introduce change validation
Add validate_change() methods (well, a template and an overload) that
are called by prepare_command() and are supposed to validate the
proposed change before it hits persistent storage

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-22 14:45:22 +03:00
Pavel Emelyanov
282cdfcfcc group0_client: Add shared_token_metadata dependency
It will be needed later to get tablet_metadata from.
The dependency is "OK", shared_token_metadata is low-level sharded
service. Client already references db::system_keyspace, which in turn
references replica::database which, finally, references token_metadata

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-22 14:45:12 +03:00
Daniel Reis
b661bc39df docs: fix redirect from cert-based auth to security/enable-auth page
(cherry picked from commit 28a265ccd8)

Closes scylladb/scylladb#21124
2024-10-22 09:10:04 +03:00
Botond Dénes
0805780064 Merge '[Backport 6.2] scylla_raid_setup: configure SELinux file context' from ScyllaDB
On RHEL9, systemd-coredump fails to coredump on /var/lib/scylla/coredump because the service only have write acess with systemd_coredump_var_lib_t. To make it writable, we need to add file context rule for /var/lib/scylla/coredump, and run restorecon on /var/lib/scylla.

Fixes #19325

(cherry picked from commit 56c971373c)

(cherry picked from commit 0ac450de05)

Refs #20528

Closes scylladb/scylladb#21211

* github.com:scylladb/scylladb:
  scylla_raid_setup: configure SELinux file context
  scylla_coredump_setup: fix SELinux configuration for RHEL9
2024-10-21 16:03:08 +03:00
Takuya ASADA
3de8885161 scylla_raid_setup: configure SELinux file context
On RHEL9, systemd-coredump fails to coredump on /var/lib/scylla/coredump
because the service only have write acess with systemd_coredump_var_lib_t.
To make it writable, we need to add file context rule for
/var/lib/scylla/coredump, and run restorecon on /var/lib/scylla.

Fixes #20573

(cherry picked from commit 0ac450de05)
2024-10-21 11:15:06 +00:00
Takuya ASADA
29a0ce3b0a scylla_coredump_setup: fix SELinux configuration for RHEL9
Seems like specific version of systemd pacakge on RHEL9 has a bug on
SELinux configuration, it introduced "systemd-container-coredump" module
to provide rule for systemd-coredump, but not enabled by default.
We have to manually load it, otherwise it causes permission error.

Fixes #19325

(cherry picked from commit 56c971373c)
2024-10-21 11:15:06 +00:00
Benny Halevy
eebf97c545 view: check_needs_view_update_path: get token_metadata_ptr
check_needs_view_update_path is async and might yield
so the token_metadata reference passed to it must be kept
alive throughout the call.

Fixes scylladb/scylladb#20979

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit eaa3b774a6)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#21038
2024-10-21 10:27:52 +02:00
Artsiom Mishuta
a728695d10 test.py: deselect remove_data_dir_of_dead_node event
deselect remove_data_dir_of_dead_node event from test_random_failures
due to ussue #20751

(cherry picked from commit 9b0e15678e)

Closes scylladb/scylladb#21138
2024-10-17 11:38:35 +02:00
Piotr Smaron
82a34aa837 test: fix flaky test_multidc_alter_tablets_rf
The testcase is flaky due to a known python driver issue:
https://github.com/scylladb/python-driver/issues/317.
This issue causes the `CREATE KEYSPACE` statement to be sometimes
executed twice in a row, and the 2nd CREATE statement causes the test to
fail.
In order to work around it, it's enough to add `if not exists` when
creating a ks.

Fixes: #21034

Needs to be backported to all 6.x branches, as the PR introducing this flakiness is backported to every 6.x branch.

(cherry picked from commit f8475915fb)

Closes scylladb/scylladb#21107
2024-10-15 09:26:28 +03:00
Piotr Dulikowski
d10c6a86cc SCYLLA-VERSION-GEN: correct the logic for skipping SCYLLA-*-FILE
The SCYLLA-VERSION-GEN file skips updating the SCYLLA-*-FILE files if
the commit hash from SCYLLA-RELEASE-FILE is the same. The original
reason for this was to prevent the date in the version string from
changing if multiple modes are built across midnight
(scylladb/scylla-pkg#826). However - intentionally or not - it serves
another purpose: it prevents an infinite loop in the build process.

If the build.ninja file needs to be rebuilt, the configure.py script
unconditionally calls ./SCYLLA-VERSION-GEN. On the other hand, if one
of the SCYLLA-*-FILE files is updated then this triggers rebuild
of build.ninja. Apparently, this is sufficient for ninja to enter an
infinite loop.

However, the check assumes that the RELEASE is in the format

  <build identifier>.<date>.<commit hash>

and assumes that none of the components have a dot inside - otherwise it
breaks and just works incorrectly. Specifically, when building a private
version, it is recommended to set the build identifier to
`count.yourname`.

Previously, before 85219e9, this problem wasn't noticed most likely
because reconfigure process was broken and stopped overwriting
the build.ninja file after the first iteration.

Fix the problem by fixing the logic that extracts the commit hash -
instead of looking at the third dot-separated field counting from the
left side, look at the last field.

Fixes: scylladb/scylladb#21027
(cherry picked from commit 64ca58125e)

Closes scylladb/scylladb#21103
2024-10-15 09:26:00 +03:00
Botond Dénes
554838691b Merge '[Backport 6.2] compaction: fix potential data resurrection with file-based migration' from Ferenc Szili
This is a manual backport of #20788

When tablets are migrated with file-based streaming, we can have a situation where a tombstone is garbage collected before the data it shadows lands. For instance, if we have a tablet replica with 3 sstables:

1. sstable containing an expired tombstone
2. sstable with additional data
3. sstable containing data which is shadowed by the expired tombstone in sstable 1

If this tablet is migrated, and the sstables are streamed in the order listed above, the first two sstables can be compacted before the third sstable arrives. In that case, the expired tombstone will be garbage collected, and data in the third sstable will be resurrected after it arrives to the pending replica.

This change fixes this problem by disabling tombstone garbage collection for pending replicas.

This fixes a problem in Enterprise, but the change is in OSS in order to have as few differences between OSS and Enterprise and to have a common infrastructure for disabling tombstone GC on pending replicas.

Fixes #21090

Closes scylladb/scylladb#21061

* github.com:scylladb/scylladb:
  test: test tombstone GC disabled on pending replica
  tablet_storage_group_manager: update tombstone_gc_enabled in compaction group
  database::table: add tombstone_gc_enabled(locator::tablet_id)
2024-10-15 09:25:22 +03:00
Kefu Chai
b691dddf6b install.sh: install seastar/scripts/addr2line.py as well
seastar extracted `addr2line` python module out back in
e078d7877273e4a6698071dc10902945f175e8bc. but `install.sh` was
not updated accordingly. it still installs `seastar-addr2line`
without installing its new dependency. this leaves us with a
broken `seastar-addr2line` in the relocatable tarball.
```console
$ /opt/scylladb/scripts/seastar-addr2line
Traceback (most recent call last):
  File "/opt/scylladb/scripts/libexec/seastar-addr2line", line 26, in <module>
    from addr2line import BacktraceResolver
ModuleNotFoundError: No module named 'addr2line'
```

in this change, we redistribute `addr2line.py` as well. this
should address the issue above.

Fixes scylladb/scylladb#21077

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit da433aad9d)

Closes scylladb/scylladb#21085
2024-10-14 09:52:21 +03:00
Botond Dénes
85b1c64a33 Merge '[Backport 6.2] storage_proxy: Add conditions checking to avoid UB in speculating read executors.' from ScyllaDB
During the investigation of scylladb/scylladb#20282, it was discovered that implementations of speculating read executors have undefined behavior when called with an incorrect number of read replicas. This PR introduces two levels of condition checking:

- Condition checking in speculating read executors for the number of replicas.
- Checking the consistency of the Effective Replication Map in  filter_for_query(): the map is considered incorrect if the list  of replicas contains a node from a data center whose replication factor is 0.

 Please note: This PR does not fix the issue found in scylladb/scylladb#20282;   it only adds condition checks to prevent undefined behavior in cases of  inconsistent inputs.

Refs scylladb/scylladb#20625

As this issue applies to the releases versions and can affect clients, we need backports to 6.0, 6.1, 6.2.

(cherry picked from commit 132358dc92)

(cherry picked from commit ae23d42889)

(cherry picked from commit ad93cf5753)

(cherry picked from commit 8db6d6bd57)

(cherry picked from commit c373edab2d)

Refs #20851

Closes scylladb/scylladb#21067

* github.com:scylladb/scylladb:
  Add conditions checking for get_read_executor
  Avoid an extra call to block_for in db::filter_for_query.
  Improve code readability in consistency_level.cc and storage_proxy.cc
  tools: Add build_info header with functions providing build type information
  tests: Add tests for alter table with RF=1 to RF=0
2024-10-14 09:51:50 +03:00
Benny Halevy
6e67a993ba storage_service: rebuild: warn about tablets-enabled keyspaces
Until we automatically support rebuild for tablets-enabled
keyspaces, warn the user about them.

The reason this is not an error, is that after
increasing RF in a new datacenter, the current procedure
is to run `nodetool rebuild` on all nodes in that dc
to rebuild the new vnode replicas.
This is not required for tablets, since the additional
replicas are rebuilt automatically as part of ALTER KS.

However, `nodetool rebuild` is also run after local
data loss (e.g. due to corruption and removal of sstables).
In this case, rebuild is not supported for tablets-enabled
keyspaces, as tablet replicas that had lost data may have
already been migrated to other nodes, and rebuilding the
requested node will not know about it.
It is advised to repair all nodes in the datacenter instead.

Refs scylladb/scylladb#17575

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit ed1e9a1543)

Closes scylladb/scylladb#20722
2024-10-14 09:47:35 +03:00
Michał Chojnowski
b8a9fd4e49 reader_concurrency_semaphore: in stats, fix swapped count_resources and memory_resources
can_admit_read() returns reason::memory_resources when the permit is queued due
to lack of count resources, and it returns reason::count_resources when the
permit is queued due to lack of memory resources. It's supposed to be the other
way around.

This bug is causing the two counts to be swapped in the stat dumps printed to
the logs when semaphores time out.

(cherry picked from commit 6cf3747c5f)

Closes scylladb/scylladb#21030
2024-10-13 18:34:18 +03:00
Jenkins Promoter
363cf881d4 Update ScyllaDB version to: 6.2.0 2024-10-13 14:15:40 +03:00
Sergey Zolotukhin
68a55facdf Add conditions checking for get_read_executor
During the investigation of scylladb/scylladb#20282, it was discovered that
implementations of speculating read executors have undefined behavior
when called with an incorrect number of read replicas. This PR
introduces two levels of condition checking:

- Condition checking in speculating read executors for the number of replicas.
- Checking the consistency of the Effective Replication Map in
  get_endpoints_for_reading(): the map is considered incorrect the number of
  read replica nodes is higher than replication factor. The check is
  applied only when built in non release mode.

Please note: This PR does not fix the issue found in scylladb/scylladb#20282;
it only adds condition checks to prevent undefined behavior in cases of
inconsistent inputs.

Refs scylladb/scylladb#20625

(cherry picked from commit c373edab2d)
2024-10-11 18:20:43 +00:00
Sergey Zolotukhin
9010d0a22f Avoid an extra call to block_for in db::filter_for_query.
(cherry picked from commit 8db6d6bd57)
2024-10-11 18:20:43 +00:00
Sergey Zolotukhin
3c0f43b6eb Improve code readability in consistency_level.cc and storage_proxy.cc
Add const correctness and rename some variables to improve code readability.

(cherry picked from commit ad93cf5753)
2024-10-11 18:20:43 +00:00
Sergey Zolotukhin
a22e4476ac tools: Add build_info header with functions providing build type information
A new header provides `constexpr` functions to retrieve build
type information: `get_build_type()`, `is_release_build()`,
and `is_debug_build()`. These functions are useful when adding
changes that should be enabled at compile time only for
specific build types.

(cherry picked from commit ae23d42889)
2024-10-11 18:20:42 +00:00
Sergey Zolotukhin
14650257c0 tests: Add tests for alter table with RF=1 to RF=0
Adding Vnodes and Tablets tests for alter keyspace operation that decreases replication factor
from 1 to 0 for one of two data centers. Tablet version fails due to issue described in
scylladb/scylladb#20625.

Test for scylladb/scylladb#20625

(cherry picked from commit 132358dc92)
2024-10-11 18:20:42 +00:00
Ferenc Szili
2a318817ba test: test tombstone GC disabled on pending replica
This tests if tombstone GC is disabled on pending replicas
2024-10-11 14:10:30 +02:00
Ferenc Szili
5f052a2b52 tablet_storage_group_manager: update tombstone_gc_enabled in compaction group
In order to avoid cases during tablet migrations where we garbage
collect tombstones before the data it shadows arrives, we will
disable tombstone GC on pending replicas.

To achieve this we added a tombston_gc_enabled flag to compaction_group.
This flag is updated from updte_effective_repliction_map method of the
tablet_storage_group_manager class.
2024-10-11 14:09:30 +02:00
David Garcia
e018b38a54 docs: Fix confgroup links
It was not possible to link to configuration parameters groups in docs/reference/configuration-parameters.rst if they contained a space.

(cherry picked from commit 2247bdbc8c)

Closes scylladb/scylladb#21037
2024-10-11 14:31:28 +03:00
Ferenc Szili
14ce5e14d0 database::table: add tombstone_gc_enabled(locator::tablet_id)
This change adds the flag tombstone_gc_enabled to compaction_group.
The value of this flag will be set in
tablet_storage_group_manager::update_effective_replication_map().
2024-10-11 13:29:30 +02:00
Piotr Smaron
d1a31460a0 cql/tablets: handle MVs in ALTER tablets KEYSPACE
ALTERing tablets-enabled KEYSPACES (KS) didn't account for materialized
views (MV), and only produced tablets mutations changing tables.
With this patch we're producing tablets mutations for both tables and
MVs, hence when e.g. we change the replication factor (RF) of a KS, both the
tables' RFs and MVs' RFs are updated along with tablets replicas.
The `test_tablet_rf_change` testcase has been extended to also verify
that MVs' tablets replicas are updated when RF changes.

Fixes: #20240
(cherry picked from commit e0c1a51642)

Closes scylladb/scylladb#21022
2024-10-11 14:14:09 +03:00
Botond Dénes
9175cc528b Merge '[Backport 6.2] cql: improve validating RF's change in ALTER tablets KS' from ScyllaDB
This patch series fixes a couple of bugs around validating if RF is not changed by too much when performing ALTER tablets KS.
RF cannot change by more than 1 in total, because tablets load balancer cannot handle more work at once.

Fixes: #20039

Should be backported to 6.0 & 6.1 (wherever tablets feature is present), as this bug may break the cluster.

(cherry picked from commit 042825247f)

(cherry picked from commit adf453af3f)

(cherry picked from commit 9c5950533f)

(cherry picked from commit 47acdc1f98)

(cherry picked from commit 93d61d7031)

(cherry picked from commit 6676e47371)

(cherry picked from commit 2aabe7f09c)

(cherry picked from commit ee56bbfe61)

Refs #20208

Closes scylladb/scylladb#21009

* github.com:scylladb/scylladb:
  cql: sum of abs RFs diffs cannot exceed 1 in ALTER tablets KS
  cql: join new and old KS options in ALTER tablets KS
  cql: fix validation of ALTERing RFs in tablets KS
  cql: harden `alter_keyspace_statement.cc::validate_rf_difference`
  cql: validate RF change for new DCs in ALTER tablets KS
  cql: extend test_alter_tablet_keyspace_rf
  cql: refactor test_tablets::test_alter_tablet_keyspace
  cql: remove unused helper function from test_tablets
2024-10-11 14:13:43 +03:00
Botond Dénes
18be4f454e Merge '[Backport 6.2] Node replace and remove operations: Add deprecate IP addresses usage warning.' from ScyllaDB
- As part of deprecation of IP address usage, warning messages were added when IP addresses specified in the `ignore-dead-nodes` and `--ignore-dead-nodes-for-replace` options for scylla and nodetool.
- Slight optimizations for `utils::split_comma_separated_list`, ` host_id_or_endpoint lists` and `storage_service` remove node operations, replacing `std::list` usage with `std::vector`.

Fixes scylladb/scylladb#19218

Backport: 6.2 as it's not yet released.

(cherry picked from commit 3b9033423d)

(cherry picked from commit a871321ecf)

(cherry picked from commit 9c692438e9)

(cherry picked from commit 6398b7548c)

Refs #20756

Closes scylladb/scylladb#20958

* github.com:scylladb/scylladb:
  config: Add a warning about use of IP address for join topology and replace operations.
  nodetool: Add IP address usage warning for 'ignore-dead-nodes'.
  tests: Fix incorrect UUIDs in test_nodeops
  utils: Optimizations for utils::split_comma_separated_list and usage of host_id_or_endpoint lists
2024-10-11 14:12:51 +03:00
Botond Dénes
f35a083abe repair/row_level: remove reader timeout
This timeout was added to catch reader related deadlocks. We have not
seen such deadlocks for a long time, but we did see false-timeouts
caused by this, see explanation below. Since the cost now outweight the
benefit, remove the timeout altogether.

The false timeout happens during mixed-shard repair. The
`reader_permit::set_timeout()` call is called on the top-level permit
which repair has a handle on. In the case of the mixed-shard repair,
this belongs to the multishard reader. Calling set_timeout() on the
multishard reader has no effect on the actual shard readers, except in
one case: when the shard reader is created, it inherits the multishard
reader's current timeout. As the shard reader can be alive for a long
time, this timeout is not refreshed and ultimately causes a timeout and
fails the repair.

Refs: #18269
(cherry picked from commit 3ebb124eb2)

Closes scylladb/scylladb#20955
2024-10-11 14:11:03 +03:00
Anna Stuchlik
57affc7fad doc: document the option to run ScyllaDB in Docker on macOS
This commit adds a description of a workaround to create a multi-node ScyllaDB cluster
with Docker on macOS.

Refs https://github.com/scylladb/scylladb/issues/16806
See https://forum.scylladb.com/t/running-3-node-scylladb-in-docker/1057/4

(cherry picked from commit 7eb1dc2ae5)

Closes scylladb/scylladb#20931
2024-10-11 14:10:06 +03:00
Raphael S. Carvalho
927e526e2d replica: Fix schema change during migration cleanup
During migration cleanup, there's a small window in which the storage
group was stopped but not yet removed from the list. So concurrent
operations traversing the list could work with stopped groups.

During a test which emitted schema changes during migrations,
a failure happened when updating the compaction strategy of a table,
but since the group was stopped, the compaction manager was unable
to find the state for that group.

In order to fix it, we'll skip stopped groups when traversing the
list since they're unused at this stage of migration and going away
soon.

Fixes #20699.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit cf58674029)

Closes scylladb/scylladb#20899
2024-10-11 14:07:42 +03:00
Calle Wilund
b224665575 database: Also forced new schema commitlog segment on user initiated memtable flush
Refs #20686
Refs #15607

In #15060 we added forced new commitlog segment on user initated flush,
mainly so that tests can verify tombstone gc and other compaction related
things, without having to wait for "organic" segment deletion.
Schema commitlog was not included, mainly because we did not have tests
featuring compaction checks of schema related tables, but also because
it was assumed to be lower general througput.
There is however no real reason to not include it, and it will make some
testing much quicker and more predictable.

(cherry picked from commit 60f8a9f39d)

Closes scylladb/scylladb#20705
2024-10-11 14:03:17 +03:00
Gleb Natapov
9afb1afefa storage_proxy: make sure there is no end iterator in _live_iterators array
storage_proxy::cancellable_write_handlers_list::update_live_iterators
assumes that iterators in _live_iterators can be dereferenced, but
the code does not make any attempt to make sure this is the case. The
iterator can be the end iterator which cannot be dereferenced.

The patch makes sure that there is no end iterator in _live_iterators.

Fixes scylladb/scylladb#20874

(cherry picked from commit da084d6441)

Closes scylladb/scylladb#21003
2024-10-10 17:09:27 +03:00
Kefu Chai
72153cac96 auth: capture boost::regex_error not std::regex_error
in a3db5401, we introduced the TLS certi authenticator, which is
configured using `auth_certificate_role_queries` option . the
value of this option contains a regular expression. so there are
chances the regular expression is malformatted. in that case,
when converting its value presenting the regular expression to an
instance of `boost::regex`, Boost.Regex throws a `boost::regex_error`
exception, not `std::regex_error`.

since we decided to use Boost.Regex, let's catch `boost::regex_error`.

Refs a3db5401
Fixes scylladb/scylladb#20941
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 439c52c7c5)

Closes scylladb/scylladb#20952
2024-10-09 21:58:40 +03:00
Michał Chojnowski
f988980260 utils/rjson.cc: correct a comment about assert()
Commit aa1270a00c changed most uses
of `assert` in the codebase to `SCYLLA_ASSERT`.

But the comment fixed in this patch is talking specifically about
`assert`, and shouldn't have been changed. It doesn't make sense
after the change.

(cherry picked from commit da7edc3a08)

Closes scylladb/scylladb#20976
2024-10-09 21:50:26 +03:00
Anna Stuchlik
1d11adf766 doc: remove outdated JMX references
This commit removes references to JMX from the docs.

Context:
The JMX server has been dropped and removed from installation. The user can
install it manually if needed, as documented with https://github.com/scylladb/scylladb/issues/18687.

This commit removes the outdated information about JMX from other pages
in the documentation, including the docs for nodetool, the list of ports,
and the admin section.

Also, the no longer relevant JMX information is removed from
the Docker Hub docs.

Fixes https://github.com/scylladb/scylladb/issues/18687
Fixes https://github.com/scylladb/scylladb/issues/19575

(cherry picked from commit 4e43d542cd)

Closes scylladb/scylladb#20988
2024-10-09 20:57:49 +03:00
Jenkins Promoter
dae1d18145 Update ScyllaDB version to: 6.2.0-rc3 2024-10-09 15:10:48 +03:00
Kamil Braun
e9588a8a53 Merge '[Backport 6.2] Wait for all users of group0 server to complete before destroying it' from ScyllaDB
Group0 server is often used in asynchronous context, but we do not wait
for them to complete before destroying the server. We already have
shutdown gate for it, so lets use it in those asynch functions.

Also make sure to signal group0 abort source if initialization fails.

Fixes scylladb/scylladb#20701

Backport to 6.2 since it contains af83c5e53e and it made the race easier to hit, so tests became flaky.

(cherry picked from commit ba22493a69)

(cherry picked from commit e642f0a86d)

Refs #20891

Closes scylladb/scylladb#21008

* github.com:scylladb/scylladb:
  group: hold group0 shutdown gate during async operations
  group0: Stop group0 if node initialization fails
2024-10-09 12:19:16 +02:00
Piotr Smaron
c73d0ffbaa cql: sum of abs RFs diffs cannot exceed 1 in ALTER tablets KS
Tablets load balancer is unable to process more than a single pending
replica, thus ALTER tablets KS cannot accept an ALTER statement which
would result in creating 2+ pending replicas, hence it has to validate
if the sum of absoulte differences of RFs specified in the statement is
not greter than 1.

(cherry picked from commit ee56bbfe61)
2024-10-08 18:06:52 +00:00
Piotr Smaron
c7b5571766 cql: join new and old KS options in ALTER tablets KS
A bug has been discovered while trying to ALTER tablets KS and
specifying only 1 out of 2 DCs - the not specified DC's RF has been
zeroed. This is because ALTER tablets KS updated the KS only with the
RF-per-DC mapping specified in the ALTER tablets KS statement, so if a
DC was ommitted, it was assigned a value of RF=0.
This commit fixes that plus additionally passes all the KS options, not
only the replication options, to the topology coordinator, where the KS
update is performed.
`initial_tablets` is a special case, which requires a special handling
in the source code, as we cannot simply update old initial_tablet's
settings with the new ones, because if only ` and TABLETS = {'enabled':
true}` is specified in the ALTER tablets KS statement, we should not zero the `initial_tablets`, but
rather keep the old value - this is tested by the
`test_alter_preserves_tablets_if_initial_tablets_skipped` testcase.
Other than that, the above mentioned testcase started to fail with
these changes, and it appeared to be an issue with the test not waiting
until ALTER is completed, and thus reading the old value, hence the
test's body has been modified to wait for ALTER to complete before
performing validation.

(cherry picked from commit 2aabe7f09c)
2024-10-08 18:06:48 +00:00
Piotr Smaron
92325073a9 cql: fix validation of ALTERing RFs in tablets KS
The validation has been corrected with:
1. Checking if a DC specified in ALTER exists.
2. Removing `REPLICATION_STRATEGY_CLASS_KEY` key from a map of RFs that
   needs their RFs to be validated.

(cherry picked from commit 6676e47371)
2024-10-08 18:06:47 +00:00
Piotr Smaron
f5c0969c06 cql: harden alter_keyspace_statement.cc::validate_rf_difference
This function assumed that strings passed as arguments will be of
integer types, but that wasn't the case, and we missed that because this
function didn't have any validation, so this change adds proper
validation and error logging.
Arguments passed to this function were forwarded from a call to
`ks_prop_defs::get_replication_options`, which, among rf-per-dc mapping, returns also
`class:replication_strategy` pair. Second pair's member has been casted
into an `int` type and somehow the code was still running fine, but only
extra testing added later discovered a bug in here.

(cherry picked from commit 93d61d7031)
2024-10-08 18:06:46 +00:00
Gleb Natapov
90ced080a8 group: hold group0 shutdown gate during async operations
Wait for all outstanding async work that uses group0 to complete before
destroying group0 server.

Fixes scylladb/scylladb#20701

(cherry picked from commit e642f0a86d)
2024-10-08 18:06:45 +00:00
Piotr Smaron
7674d80c31 cql: validate RF change for new DCs in ALTER tablets KS
ALTER tablets KS validated if RF is not changed by more than 1 for DCs
that already had replicas, but not for DCs that didn't have them yet, so
specifying an RF jump from 0 to 2 was possible when listing a new DC in
ALTER tablets KS statement, which violated internal invariants of
tablets load balancer.
This PR fixes that bug and adds a multi-dc testcases to check if adding
replicas to a new DC and removing replicas from a DC is honoring the RF
change constraints.

Refs: #20039
(cherry picked from commit 47acdc1f98)
2024-10-08 18:06:45 +00:00
Gleb Natapov
06ceef34a7 group0: Stop group0 if node initialization fails
Commit af83c5e53e moved aborting of group0 into the storage service
drain function. But it is not called if node fails during initialization
(if it failed to join cluster for instance). So lets abort on both
paths (but only once).

(cherry picked from commit ba22493a69)
2024-10-08 18:06:44 +00:00
Piotr Smaron
ec83367b45 cql: extend test_alter_tablet_keyspace_rf
Added cases to also test decreasing RF and setting the same RF.
Also added extra explanatory comments.

(cherry picked from commit 9c5950533f)
2024-10-08 18:06:44 +00:00
Piotr Smaron
dfe2e20442 cql: refactor test_tablets::test_alter_tablet_keyspace
1. Renamed the testcase to emphasize that it only focuses on testing
   changing RF - there are other tests that test ALTER tablets KS
in general.
2. Fixed whitespaces according to PEP8

(cherry picked from commit adf453af3f)
2024-10-08 18:06:42 +00:00
Piotr Smaron
ad2191e84f cql: remove unused helper function from test_tablets
`change_default_rf` is not used anywhere, moreover it uses
`replication_factor` tag, which is forbidden in ALTER tablets KS
statement.

(cherry picked from commit 042825247f)
2024-10-08 18:06:41 +00:00
Sergey Zolotukhin
855abd7368 config: Add a warning about use of IP address for join topology and replace
operations.

When the '--ignore-dead-nodes-for-replace' config option contains
IP addresses, a warning will be logged, notifying the user that
using IP addresses with this option is deprecated and will no
longer be supported in the next release.

Fixes scylladb/scylladb#19218

(cherry picked from commit 6398b7548c)
2024-10-03 14:10:30 +00:00
Sergey Zolotukhin
086dc6d53c nodetool: Add IP address usage warning for 'ignore-dead-nodes'.
Since we are deprecating the use of IP addresses, a warning message will be printed
if 'nodetool removenode --ignore-dead-nodes' is used with IP addresses.

(cherry picked from commit 9c692438e9)
2024-10-03 14:10:29 +00:00
Sergey Zolotukhin
09b0b3f7d6 tests: Fix incorrect UUIDs in test_nodeops
It was found that the UUIDs used in test_nodeops were
invalid. This update replaces those UUIDs with newly generated
random UUIDs.

(cherry picked from commit a871321ecf)
2024-10-03 14:10:28 +00:00
Sergey Zolotukhin
3bbb7a24b1 utils: Optimizations for utils::split_comma_separated_list and usage of host_id_or_endpoint lists
- utils::split_comma_separated_list now accepts a reference to sstring instead
  of a copy to avoid extra memory allocations. Additionally, the results of
  trimming are moved to the resulting vector instead of being copied.
- service/storage_service removenode, raft_removenode, find_raft_nodes_from_hoeps,
  parse_node_list and api/storage_service::set_storage_service were changed to use
  std::vector<host_id_or_endpoint> instead of std::list<host_id_or_endpoint> as
  std::vector is a more cache-friendly structure,  resulting in better performance.

(cherry picked from commit 3b9033423d)
2024-10-03 14:10:27 +00:00
Pavel Emelyanov
b43454c658 cql: Check that CREATEing tablets/vnodes is consistent with the CLI
There are two bits that control whenter replication strategy for a
keyspace will use tablets or not -- the configuration option and CQL
parameter. This patch tunes its parsing to implement the logic shown
below:

    if (strategy.supports_tablets) {
         if (cql.with_tablets) {
             if (cfg.enable_tablets) {
                 return create_keyspace_with_tablets();
             } else {
                 throw "tablets are not enabled";
             }
         } else if (cql.with_tablets = off) {
              return create_keyspace_without_tablets();
         } else { // cql.with_tablets is not specified
              if (cfg.enable_tablets) {
                  return create_keyspace_with_tablets();
              } else {
                  return create_keyspace_without_tablets();
              }
         }
     } else { // strategy doesn't support tablets
         if (cql.with_tablets == on) {
             throw "invalid cql parameter";
         } else if (cql.with_tablets == off) {
             return create_keyspace_without_tablets();
         } else { // cql.with_tablets is not specified
             return create_keyspace_without_tablets();
         }
     }

closes: #20088

In order to enable tablets "by default" for NetworkTopologyStrategy
there's explicit check near ks_prop_defs::get_initial_tablets(), that's
not very nice. It needs more care to fix it, e.g. provide feature
service reference to abstract_replication_strategy constructor. But
since ks_prop_defs code already highjacks options specifically for that
strategy type (see prepare_options() helper), it's OK for now.

There's also #20768 misbehavior that's preserved in this patch, but
should be fixed eventually as well.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit ebedc57300)

Closes scylladb/scylladb#20927
2024-10-03 17:09:49 +03:00
Jenkins Promoter
93700ff5d1 Update ScyllaDB version to: 6.2.0-rc2 2024-10-02 14:58:37 +03:00
Anna Stuchlik
5e2b4a0e80 doc: add metric updates from 6.1 to 6.2
This commit specifies metrics that are new in version 6.2 compared to 6.1,
as specified in https://github.com/scylladb/scylladb/issues/20176.

Fixes https://github.com/scylladb/scylladb/issues/20176

(cherry picked from commit a97db03448)

Closes scylladb/scylladb#20930
2024-10-02 12:07:06 +03:00
Calle Wilund
bb5dc0771c commitlog: Fix buffer_list_bytes not updated correctly
Fixes #20862

With the change in 60af2f3cb2 the bookkeep
for buffer memory was changed subtly, the problem here that we would
shrink buffer size before we after flush use said buffer's size to
decrement the buffer_list_bytes value, previously inc:ed by the full,
allocated size. I.e. we would slowly grow this value instead of adjusting
properly to actual used bytes.

Test included.

(cherry picked from commit ee5e71172f)

Closes scylladb/scylladb#20902
2024-10-01 17:41:02 +03:00
Aleksandra Martyniuk
9ed8519362 node_ops: fix task_manager_module::get_nodes()
Currently, node ops virtual task gathers its children from all nodes contained
in a sum of service::topology::normal_nodes and service::topology::transition_nodes.
The maps may contain nodes that are down but weren't removed yet. So, if a user
requests the status of a node ops virtual task, the task's attempt to retrieve
its children list may fail with seastar::rpc::closed_error.

Filter out the tasks that are down in node_ops::task_manager_module::get_nodes.

Fixes: #20843.
(cherry picked from commit a558abeba3)

Closes scylladb/scylladb#20898
2024-10-01 14:52:11 +03:00
Avi Kivity
077d7c06a0 Merge '[Backport 6.2] sstables: Fix use-after-free on page cache buffer when parsing promoted index entries across pages' from ScyllaDB
This fixes a use-after-free bug when parsing clustering key across
pages.

Also includes a fix for allocating section retry, which is potentially not safe (not in practice yet).

Details of the first problem:

Clustering key index lookup is based on the index file page cache. We
do a binary search within the index, which involves parsing index
blocks touched by the algorithm. Index file pages are 4 KB chunks
which are stored in LSA.

To parse the first key of the block, we reuse clustering_parser, which
is also used when parsing the data file. The parser is stateful and
accepts consecutive chunks as temporary_buffers. The parser is
supposed to keep its state across chunks.

In 93482439, the promoted index cursor was optimized to avoid
fully page copy when parsing index blocks. Instead, parser is
given a temporary_buffer which is a view on the page.

A bit earlier, in b1b5bda, the parser was changed to keep shared
fragments of the buffer passed to the parser in its internal state (across pages)
rather than copy the fragments into a new buffer. This is problematic
when buffers come from page cache because LSA buffers may be moved
around or evicted. So the temporary_buffer which is a view on the LSA
buffer is valid only around the duration of a single consume() call to
the parser.

If the blob which is parsed (e.g. variable-length clustering key
component) spans pages, the fragments stored in the parser may be
invalidated before the component is fully parsed. As a result, the
parsed clustering key may have incorrect component values. This never
causes parsing errors because the "length" field is always parsed from
the current buffer, which is valid, and component parsing will end at
the right place in the next (valid) buffer.

The problematic path for clustering_key parsing is the one which calls
primitive_consumer::read_bytes(), which is called for example for text
components. Fixed-size components are not parsed like this, they store
the intermediate state by copying data.

This may cause incorrect clustering keys to be parsed when doing
binary search in the index, diverting the search to an incorrect
block.

Details of the solution:

We adapt page_view to a temporary_buffer-like API. For this, a new concept
is introduced called ContiguousSharedBuffer. We also change parsers so that
they can be templated on the type of the buffer they work with (page_view vs
temporary_buffer). This way we don't introduce indirection to existing algorithms.

We use page_view instead of temporary_buffer in the promoted
index parser which works with page cache buffers. page_view can be safely
shared via share() and stored across allocating sections. It keeps hold to the
LSA buffer even across allocating sections by the means of cached_file::page_ptr.

Fixes #20766

(cherry picked from commit 8aca93b3ec)

(cherry picked from commit ac823b1050)

(cherry picked from commit 93bfaf4282)

(cherry picked from commit c0fa49bab5)

(cherry picked from commit 29498a97ae)

(cherry picked from commit c15145b71d)

(cherry picked from commit 7670ee701a)

(cherry picked from commit c09fa0cb98)

(cherry picked from commit 0279ac5faa)

(cherry picked from commit 8e54ecd38e)

(cherry picked from commit b5ae7da9d2)

Refs #20837

Closes scylladb/scylladb#20905

* github.com:scylladb/scylladb:
  sstables: bsearch_clustered_cursor: Add trace-level logging
  sstables: bsearch_clustered_cursor: Move definitions out of line
  test, sstables: Verify parsing stability when allocating section is retried
  test, sstables: Verify parsing stability when buffers cross page boundary
  sstables: bsearch_clustered_cursor: Switch parsers to work with page_view
  cached_file: Adapt page_view to ContiguousSharedBuffer
  cached_file: Change meaning of page_view::_size to be relative to _offset rather than page start
  sstables, utils: Allow parsers to work with different buffer types
  sstables: promoted_index_block_parser: Make reset() always bring parser to initial state
  sstables: bsearch_clustered_cursor: Switch read_block_offset() to use the read() method
  sstables: bsearch_clustered_cursor: Fix parsing when allocating section is retried
2024-10-01 14:51:29 +03:00
Tomasz Grabiec
5a1575678b sstables: bsearch_clustered_cursor: Add trace-level logging
(cherry picked from commit b5ae7da9d2)
2024-10-01 01:38:48 +00:00
Tomasz Grabiec
2401f7f9ca sstables: bsearch_clustered_cursor: Move definitions out of line
In order to later use the formatter for the inner class
promoted_index_block, which is defined out of line after
cached_promoted_index class definition.

(cherry picked from commit 8e54ecd38e)
2024-10-01 01:38:47 +00:00
Tomasz Grabiec
906d085289 test, sstables: Verify parsing stability when allocating section is retried
(cherry picked from commit 0279ac5faa)
2024-10-01 01:38:47 +00:00
Tomasz Grabiec
34dd3a6daa test, sstables: Verify parsing stability when buffers cross page boundary
(cherry picked from commit c09fa0cb98)
2024-10-01 01:38:47 +00:00
Tomasz Grabiec
3afa8ee2ca sstables: bsearch_clustered_cursor: Switch parsers to work with page_view
This fixes a use-after-free bug when parsing clustering key across
pages.

Clustering key index lookup is based on the index file page cache. We
do a binary search within the index, which involves parsing index
blocks touched by the algorithm. Index file pages are 4 KB chunks
which are stored in LSA.

To parse the first key of the block, we reuse clustering_parser, which
is also used when parsing the data file. The parser is stateful and
accepts consecutive chunks as temporary_buffers. The parser is
supposed to keep its state across chunks.

In b1b5bda, the parser was changed to keep shared fragments of the
buffer passed to the parser in its internal state (across pages)
rather than copy the fragments into a new buffer. This is problematic
when buffers come from page cache because LSA buffers may be moved
around or evicted. So the temporary_buffer which is a view on the LSA
buffer is valid only around the duration of a single consume() call to
the parser.

If the blob which is parsed (e.g. variable-length clustering key
component) spans pages, the fragments stored in the parser may be
invalidated before the component is fully parsed. As a result, the
parsed clustering key may have incorrect component values. This never
causes parsing errors because the "length" field is always parsed from
the current buffer, which is valid, and component parsing will end at
the right place in the next (valid) buffer.

The problematic path for clustering_key parsing is the one which calls
primitive_consumer::read_bytes(), which is called for example for text
components. Fixed-size components are not parsed like this, they store
the intermediate state by copying data.

This may cause incorrect clustering keys to be parsed when doing
binary search in the index, diverting the search to an incorrect
block.

The solution is to use page_view instead of temporary_buffer, which
can be safely shared via share() and stored across allocating
section. The page_view maintains its hold to the LSA buffer even
across allocating sections.

Fixes #20766

(cherry picked from commit 7670ee701a)
2024-10-01 01:38:47 +00:00
Tomasz Grabiec
3347152ff9 cached_file: Adapt page_view to ContiguousSharedBuffer
(cherry picked from commit c15145b71d)
2024-10-01 01:38:47 +00:00
Tomasz Grabiec
ff7bd937e2 cached_file: Change meaning of page_view::_size to be relative to _offset rather than page start
Will be easier to implement ContiguousSharedBuffer API as the buffer
size will be equal to _size.

(cherry picked from commit 29498a97ae)
2024-10-01 01:38:47 +00:00
Tomasz Grabiec
50ea1dbe32 sstables, utils: Allow parsers to work with different buffer types
Currently, parsers work with temporary_buffer<char>. This is unsafe
when invoked by bsearch_clustered_cursor, which reuses some of the
parsers, and passes temporary_buffer<char> which is a view onto LSA
buffer which comes from the index file page cache. This view is stable
only around consume(). If parsing requires more than one page, it will
continue with a different input buffer. The old buffer will be
invalid, and it's unsafe for the parser to store and access
it. Unfortunetly, the temporary_buffer API allows sharing the buffer
via the share() method, which shares the underlying memory area. This
is not correct when the underlying is managed by LSA, because storage
may move. Parser uses this sharing when parsing blobs, e.g. clustering
key components. When parsing resumes in the next page, parser will try
to access the stored shared buffers pointing to the previous page,
which may result in use-after-free on the memory area.

In prearation for fixing the problem, parametrize parsers to work with
different kinds of buffers. This will allow us to instantiate them
with a buffer kind which supports sharing of LSA buffers properly in a
safe way.

It's not purely mechanical work. Some parts of the parsing state
machine still works with temporary_buffer<char>, and allocate buffers
internally, when reading into linearized destination buffer. They used
to store this destination in _read_bytes vector, same field which is
used to store the shared buffers. Now it's not possible, since shared
buffer type may be different than temporary_buffer<char>. So those
paths were changed to use a new field: _read_bytes_buf.

(cherry picked from commit c0fa49bab5)
2024-10-01 01:38:47 +00:00
Tomasz Grabiec
45125c4d7d sstables: promoted_index_block_parser: Make reset() always bring parser to initial state
When reset() is done due to allocating section retry, it can be
theoretically in an arbitrary point. So we should not assume that it
finished parsing and state was reset by previous parsing. We should
reset all the fields.

(cherry picked from commit 93bfaf4282)
2024-10-01 01:38:46 +00:00
Tomasz Grabiec
9207f7823d sstables: bsearch_clustered_cursor: Switch read_block_offset() to use the read() method
To unify logic which handles allocating section retry, and thus
improve safety.

(cherry picked from commit ac823b1050)
2024-10-01 01:38:46 +00:00
Tomasz Grabiec
711864687f sstables: bsearch_clustered_cursor: Fix parsing when allocating section is retried
Parser's state was not reset when allocating section was retried.

This doesn't cause problems in practice, because reserves are enough
to cover allocation demands of parsing clustering keys, which are at
most 64K in size. But it's still potentially unsafe and needs fixing.

(cherry picked from commit 8aca93b3ec)
2024-10-01 01:38:45 +00:00
Kamil Braun
faf11e5bc3 Merge '[Backport 6.2] Populate raft address map from gossiper on raft configuration change' from ScyllaDB
For each new node added to the raft config populate it's ID to IP mapping in raft address map from the gossiper. The mapping may have expired if a node is added to the raft configuration long after it first appears in the gossiper.

Fixes scylladb/scylladb#20600

Backport to all supported versions since the bug may cause bootstrapping failure.

(cherry picked from commit bddaf498df)

(cherry picked from commit 9e4cd32096)

Refs #20601

Closes scylladb/scylladb#20847

* github.com:scylladb/scylladb:
  test: extend existing test to check that a joining node can map addresses of all pre-existing nodes during join
  group0: make sure that address map has an entry for each new node in the raft configuration
2024-09-30 17:01:52 +02:00
Gleb Natapov
f9215b4d7e test: extend existing test to check that a joining node can map addresses of all pre-existing nodes during join
(cherry picked from commit 9e4cd32096)
2024-09-26 21:13:34 +00:00
Gleb Natapov
469ac9976a group0: make sure that address map has an entry for each new node in the raft configuration
ID->IP mapping is added to the raft address map when the mapping first
appears in the gossiper, but it is added as expiring entry. It becomes
non expiring when a node is added to raft configuration. But when a node
joins those two events may be distant in time (since the node's request
may sit in the topology coordinator queue for a while) and mappings may
expire already from the map. This patch makes sure to transfer the
mapping from the gossiper for a node that is added to the raft
configuration instead of assuming that the mapping is already there.

(cherry picked from commit bddaf498df)
2024-09-26 21:13:33 +00:00
Botond Dénes
d341f1ef1e Merge '[Backport 6.2] mark node as being replaced earlier' from ScyllaDB
Before 17f4a151ce the node was marked as
been replaced in join_group0 state, before it actually joins the group0,
so by the time it actually joins and starts transferring snapshot/log no
traffic is sent to it. The commit changed this to mark the node as
being replaced after the snapshot/log is already transferred so we can
get the traffic to the node while it sill did not caught up with a
leader and this may causes problems since the state is not complete.
Mark the node as being replaced earlier, but still add the new node to
the topology later as the commit above intended.

Fixes: https://github.com/scylladb/scylladb/issues/20629

Need to be backported since this is a regression

(cherry picked from commit 644e7a2012)

(cherry picked from commit c0939d86f9)

(cherry picked from commit 1b4c255ffd)

Refs #20743

Closes scylladb/scylladb#20829

* github.com:scylladb/scylladb:
  test: amend test_replace_reuse_ip test to check that there is no stale writes after snapshot transfer starts
  topology coordinator:: mark node as being replaced earlier
  topology coordinator: do metadata barrier before calling finish_accepting_node() during replace
2024-09-26 10:37:25 +03:00
Kamil Braun
07dfcd1f64 service: raft: fix rpc error message
What it called "leader" is actually the destination of the RPC.

Trivial fix, should be backported to all affected versions.

(cherry picked from commit 09c68c0731)

Closes scylladb/scylladb#20826
2024-09-26 10:33:50 +03:00
Anna Stuchlik
f8d63b5572 doc: add OS support for version 6.2
This commit adds the OS support for version 6.2.
In addition, it removes support for 6.0, as the policy is only to include
information for the supported versions, i.e., the two latest versions.

Fixes https://github.com/scylladb/scylladb/issues/20804

(cherry picked from commit 8145109120)

Closes scylladb/scylladb#20825
2024-09-26 10:29:08 +03:00
Anna Stuchlik
ca83da91d1 doc: add an intro to the Features page
This commit modifies the Features page in the following way:

- It adds a short introduction and descriptions to each listed feature.
- It hides the ToC (required to control and modify the information on the page,
  e.g., to add descriptions, have full control over what is displayed, etc.)
- Removes the info about Enterprise features (following the request not to include
  Enterprise info in the OSS docs)

Fixes https://github.com/scylladb/scylladb/issues/20617
Blocks https://github.com/scylladb/scylla-enterprise/pull/4711

(cherry picked from commit da8047a834)

Closes scylladb/scylladb#20811
2024-09-26 10:22:36 +03:00
Botond Dénes
f55081fb1a Merge '[Backport 6.2] Rename Alternator batch item count metrics' from ScyllaDB
This PR addresses multiple issues with alternator batch metrics:

1. Rename the metrics to scylla_alternator_batch_item_count with op=BatchGetItem/BatchWriteItem
2. The batch size calculation was wrong and didn't count all items in the batch.
3. Add a test to validate that the metrics values increase by the correct value (not just increase). This also requires an addition to the testing to validate ops of different metrics and an exact value change.

Needs backporting to allow the monitoring to use the correct metrics names.

Fixes #20571

(cherry picked from commit 515857a4a9)

(cherry picked from commit 905408f764)

(cherry picked from commit 4d57a43815)

(cherry picked from commit 8dec292698)

Refs #20646

Closes scylladb/scylladb#20758

* github.com:scylladb/scylladb:
  alternator:test_metrics test metrics for batch item count
  alternator:test_metrics Add validating the increased value
  alternator: Fix item counting in batch operations
  Alterntor rename batch item count metrics
2024-09-26 10:22:00 +03:00
Anna Stuchlik
aa8cdec5bd doc: fix a broken link
This commit fixes a link to the Manager by adding a missing underscore
to the external link.

(cherry picked from commit aa0c95c95c)

Closes scylladb/scylladb#20710
2024-09-26 10:18:59 +03:00
Anna Stuchlik
75a2484dba doc: update the unified installer instructions
This commit updates the unified installer instructions to avoid specifying a given version.
At the moment, we're technically unable to use variables in URLs, so we need to update
the page each release.

Fixes https://github.com/scylladb/scylladb/issues/20677

(cherry picked from commit 400a14eefa)

Closes scylladb/scylladb#20708
2024-09-26 10:04:35 +03:00
Gleb Natapov
37387135b4 test: amend test_replace_reuse_ip test to check that there is no stale writes after snapshot transfer starts
(cherry picked from commit 1b4c255ffd)
2024-09-26 03:45:50 +00:00
Gleb Natapov
ac24ab5141 topology coordinator:: mark node as being replaced earlier
Before 17f4a151ce the node was marked as
been replaced in join_group0 state, before it actually joins the group0,
so by the time it actually joins and starts transferring snapshot/log no
traffic is sent to it. The commit changed this to mark the node as
being replaced after the snapshot/log is already transferred so we can
get the traffic to the node while it sill did not caught up with a
leader and this may causes problems since the state is not complete.
Mark the node as being replaced earlier, but still add the new node to
the topology later as the commit above intended.

(cherry picked from commit c0939d86f9)
2024-09-26 03:45:50 +00:00
Gleb Natapov
729dc03e0c topology coordinator: do metadata barrier before calling finish_accepting_node() during replace
During replace with the same IP a node may get queries that were intended
for the node it was replacing since the new node declares itself UP
before it advertises that it is a replacement. But after the node
starts replacing procedure the old node is marked as "being replaced"
and queries no longer sent there. It is important to do so before the
new node start to get raft snapshot since the snapshot application is
not atomic and queries that run parallel with it may see partial state
and fail in weird ways. Queries that are sent before that will fail
because schema is empty, so they will not find any tables in the first
place. The is pre-existing and not addressed by this patch.

(cherry picked from commit 644e7a2012)
2024-09-26 03:45:50 +00:00
Kamil Braun
9d64ced982 test: fix topology_custom/test_raft_recovery_stuck flakiness
The test performs consecutive schema changes in RECOVERY mode. The
second change relies on the first. However the driver might route the
changes to different servers and we don't have group 0 to guarantee
linearizability. We must rely on the first change coordinator to push
the schema mutations to other servers before returning, but that only
happens when it sees other servers as alive when doing the schema
change. It wasn't guaranteed in the test. Fix this.

Fixes scylladb/scylladb#20791

Should be backported to all branches containing this test to reduce
flakiness.

(cherry picked from commit f390d4020a)

Closes scylladb/scylladb#20807
2024-09-25 15:11:10 +02:00
Abhinav
ea6349a6f5 raft topology: add error for removal of non-normal nodes
In the current scenario, We check if a node being removed is normal
on the node initiating the removenode request. However, we don't have a
similar check on the topology coordinator. The node being removed could be
normal when we initiate the request, but it doesn't have to be normal when
the topology coordinator starts handling the request.
For example, the topology coordinator could have removed this node while handling
another removenode request that was added to the request queue earlier.

This commit intends to fix this issue by adding more checks in the enqueuing phase
and return errors for duplicate requests for node removal.

This PR fixes a bug. Hence we need to backport it.

Fixes: scylladb/scylladb#20271
(cherry picked from commit b25b8dccbd)

Closes scylladb/scylladb#20799
2024-09-25 11:34:20 +02:00
Benny Halevy
ed9122a84e time_window_compaction_strategy: get_reshaping_job: restrict sort of multi_window vector to its size
Currently the function calls boost::partial_sort with a middle
iterator that might be out of bound and cause undefined behavior.

Check the vector size, and do a partial sort only if its longer
than `max_sstables`, otherwise sort the whole vector.

Fixes scylladb/scylladb#20608

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#20609

(cherry picked from commit 39ce358d82)

Refs: scylladb/scylladb#20609
2024-09-23 16:02:40 +03:00
Amnon Heiman
c7d6b4a194 alternator:test_metrics test metrics for batch item count
This patch adds tests for the batch operations item count.

The tests validate that the metrics tracking the number of items
processed in a batch increase by the correct amount.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
(cherry picked from commit 8dec292698)
2024-09-23 11:02:55 +00:00
Amnon Heiman
a35e138b22 alternator:test_metrics Add validating the increased value
The `check_increases_operation` now allows override the checked metric.

Additionally, a custom validation value can now be passed, which make it
possible to validate the amount by which a value has changed, rather
than just validating that the value increased.

The default behavior of validating that values have increased remains
unchanged, ensuring backward compatibility.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
(cherry picked from commit 4d57a43815)
2024-09-23 11:02:55 +00:00
Amnon Heiman
3db67faa8a alternator: Fix item counting in batch operations
This patch fixes the logic for counting items in batch operations.
Previously, the item count in requests was inaccurate, it count the
number of tabels in get_item and the request_items in write_items.

The new logic correctly counts each individual item in `BatchGetItem`
and `BatchWriteItem` requests.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
(cherry picked from commit 905408f764)
2024-09-23 11:02:55 +00:00
Amnon Heiman
6a12174e2d Alterntor rename batch item count metrics
This patch renames metrics tracking the total number of items in a batch
to `scylla_alternator_batch_item_count`.  It uses the existing `op` label to
differentiate between `BatchGetItem` and `BatchWriteItem` operations.

Ensures better clarity and distinction for batch operations in monitoring.

This an example of how it looks like:
 # HELP scylla_alternator_batch_item_count The total number of items processed across all batches
 # TYPE scylla_alternator_batch_item_count counter
 scylla_alternator_batch_item_count{op="BatchGetItem",shard="0"} 4
 scylla_alternator_batch_item_count{op="BatchWriteItem",shard="0"} 4

(cherry picked from commit 515857a4a9)
2024-09-23 11:02:55 +00:00
Piotr Dulikowski
ca0096ccb8 Merge '[Backport 6.2] message/messaging_service: guard adding maintenance tenant under cluster feature' from Michał Jadwiszczak
In https://github.com/scylladb/scylladb/pull/18729, we introduced a new statement tenant $maintenance, but the change wasn't protected by any cluster feature.
This wasn't a problem for OSS, since unknown isolation cookie just uses default scheduling group. However, in enterprise that leads to creating a service level on not-upgraded nodes, which may end up in an error if user create maximum number of service levels.

This patch adds a cluster feature to guard adding the new tenant. It's done in the way to handle two upgrade scenarios:

version without $maintenance tenant -> version with $maintenance tenant guarded by a feature
version with $maintenance tenant but not guarded by a feature -> version with $maintenance tenant guarded by a feature
The PR adds enabled flag to statement tenants.
This way, when the tenant is disabled, it cannot be used to create a connection, but it can be used to accept an incoming connection.
The $maintenance tenant is added to the config as disabled and it gets enabled once the corresponding feature is enabled.

Fixes https://github.com/scylladb/scylladb/issues/20070
Refs https://github.com/scylladb/scylla-enterprise/issues/4403

(cherry picked from commit d44844241d)

(cherry picked from commit 71a03ef6b0)

(cherry picked from commit b4b91ca364)

Refs https://github.com/scylladb/scylladb/pull/19802

Closes scylladb/scylladb#20690

* github.com:scylladb/scylladb:
  message/messaging_service: guard adding maintenance tenant under cluster feature
  message/messaging_service: add feature_service dependency
  message/messaging_service: add `enabled` flag to statement tenants
2024-09-23 09:48:12 +02:00
Jenkins Promoter
a71d4bc49c Update ScyllaDB version to: 6.2.0-rc1 2024-09-19 10:21:33 +03:00
Michał Jadwiszczak
749399e4b8 message/messaging_service: guard adding maintenance tenant under cluster feature
Set `enabled` flag for `$maintenance` tenant to false and
enable it when `MAINTENANCE_TENANT` feature is enabled.

(cherry-picked from b4b91ca364)
2024-09-18 19:10:24 +02:00
Michał Jadwiszczak
bdd97b2950 message/messaging_service: add feature_service dependency
(cherry-picked from 71a03ef6b0)
2024-09-18 19:09:46 +02:00
Michał Jadwiszczak
1a056f0cab message/messaging_service: add enabled flag to statement tenants
Adding a new tenant needs to be done under cluster feature protection.
However it wasn't the case for adding `$maintenance` statement tenant
and to fix it we need to support an upgrade from node which doesn't
know about maintenance tenant at all and from one which uses it without
any cluster feature protection.

This commit adds `enabled` flag to statement tenants.
This way, when the tenant is disabled, it cannot be used to create
a connection, but it can be used to accept an incoming connection.

(cherry-picked from d44844241d)
2024-09-18 19:09:06 +02:00
Tzach Livyatan
cf78a2caca Update client-node-encryption: OpsnSSL is FIPS *enabled*
Closes scylladb/scylladb#19705

(cherry picked from commit cb864b11d8)
2024-09-18 11:58:46 +03:00
Anna Mikhlin
cbc53f0e81 Update ScyllaDB version to: 6.2.0-rc0 2024-09-17 13:40:50 +03:00
203 changed files with 4628 additions and 1353 deletions

186
.github/scripts/auto-backport.py vendored Executable file
View File

@@ -0,0 +1,186 @@
#!/usr/bin/env python3
import argparse
import os
import re
import sys
import tempfile
import logging
from github import Github, GithubException
from git import Repo, GitCommandError
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
try:
github_token = os.environ["GITHUB_TOKEN"]
except KeyError:
print("Please set the 'GITHUB_TOKEN' environment variable")
sys.exit(1)
def is_pull_request():
return '--pull-request' in sys.argv[1:]
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument('--repo', type=str, required=True, help='Github repository name')
parser.add_argument('--base-branch', type=str, default='refs/heads/master', help='Base branch')
parser.add_argument('--commits', default=None, type=str, help='Range of promoted commits.')
parser.add_argument('--pull-request', type=int, help='Pull request number to be backported')
parser.add_argument('--head-commit', type=str, required=is_pull_request(), help='The HEAD of target branch after the pull request specified by --pull-request is merged')
return parser.parse_args()
def create_pull_request(repo, new_branch_name, base_branch_name, pr, backport_pr_title, commits, is_draft=False):
pr_body = f'{pr.body}\n\n'
for commit in commits:
pr_body += f'- (cherry picked from commit {commit})\n\n'
pr_body += f'Parent PR: #{pr.number}'
try:
backport_pr = repo.create_pull(
title=backport_pr_title,
body=pr_body,
head=f'scylladbbot:{new_branch_name}',
base=base_branch_name,
draft=is_draft
)
logging.info(f"Pull request created: {backport_pr.html_url}")
backport_pr.add_to_assignees(pr.user)
if is_draft:
backport_pr.add_to_labels("conflicts")
pr_comment = f"@{pr.user} - This PR was marked as draft because it has conflicts\n"
pr_comment += "Please resolve them and mark this PR as ready for review"
backport_pr.create_issue_comment(pr_comment)
logging.info(f"Assigned PR to original author: {pr.user}")
return backport_pr
except GithubException as e:
if 'A pull request already exists' in str(e):
logging.warning(f'A pull request already exists for {pr.user}:{new_branch_name}')
else:
logging.error(f'Failed to create PR: {e}')
def get_pr_commits(repo, pr, stable_branch, start_commit=None):
commits = []
if pr.merged:
merge_commit = repo.get_commit(pr.merge_commit_sha)
if len(merge_commit.parents) > 1: # Check if this merge commit includes multiple commits
commits.append(pr.merge_commit_sha)
else:
if start_commit:
promoted_commits = repo.compare(start_commit, stable_branch).commits
else:
promoted_commits = repo.get_commits(sha=stable_branch)
for commit in pr.get_commits():
for promoted_commit in promoted_commits:
commit_title = commit.commit.message.splitlines()[0]
# In Scylla-pkg and scylla-dtest, for example,
# we don't create a merge commit for a PR with multiple commits,
# according to the GitHub API, the last commit will be the merge commit,
# which is not what we need when backporting (we need all the commits).
# So here, we are validating the correct SHA for each commit so we can cherry-pick
if promoted_commit.commit.message.startswith(commit_title):
commits.append(promoted_commit.sha)
elif pr.state == 'closed':
events = pr.get_issue_events()
for event in events:
if event.event == 'closed':
commits.append(event.commit_id)
return commits
def create_pr_comment_and_remove_label(pr, comment_body):
labels = pr.get_labels()
pattern = re.compile(r"backport/\d+\.\d+$")
for label in labels:
if pattern.match(label.name):
print(f"Removing label: {label.name}")
comment_body += f'- {label.name}\n'
pr.remove_from_labels(label)
pr.create_issue_comment(comment_body)
def backport(repo, pr, version, commits, backport_base_branch):
new_branch_name = f'backport/{pr.number}/to-{version}'
backport_pr_title = f'[Backport {version}] {pr.title}'
repo_url = f'https://scylladbbot:{github_token}@github.com/{repo.full_name}.git'
fork_repo = f'https://scylladbbot:{github_token}@github.com/scylladbbot/{repo.name}.git'
with (tempfile.TemporaryDirectory() as local_repo_path):
try:
repo_local = Repo.clone_from(repo_url, local_repo_path, branch=backport_base_branch)
repo_local.git.checkout(b=new_branch_name)
is_draft = False
for commit in commits:
try:
repo_local.git.cherry_pick(commit, '-m1', '-x')
except GitCommandError as e:
logging.warning(f'Cherry-pick conflict on commit {commit}: {e}')
is_draft = True
repo_local.git.add(A=True)
repo_local.git.cherry_pick('--continue')
if not repo.private and not repo.has_in_collaborators(pr.user.login):
repo.add_to_collaborators(pr.user.login, permission="push")
comment = f':warning: @{pr.user.login} you have been added as collaborator to scylladbbot fork '
comment += f'Please check your inbox and approve the invitation, once it is done, please add the backport labels again'
create_pr_comment_and_remove_label(pr, comment)
return
repo_local.git.push(fork_repo, new_branch_name, force=True)
create_pull_request(repo, new_branch_name, backport_base_branch, pr, backport_pr_title, commits,
is_draft=is_draft)
except GitCommandError as e:
logging.warning(f"GitCommandError: {e}")
def main():
args = parse_args()
base_branch = args.base_branch.split('/')[2]
promoted_label = 'promoted-to-master'
repo_name = args.repo
if 'scylla-enterprise' in args.repo:
promoted_label = 'promoted-to-enterprise'
stable_branch = base_branch
backport_branch = 'branch-'
backport_label_pattern = re.compile(r'backport/\d+\.\d+$')
g = Github(github_token)
repo = g.get_repo(repo_name)
closed_prs = []
start_commit = None
if args.commits:
start_commit, end_commit = args.commits.split('..')
commits = repo.compare(start_commit, end_commit).commits
for commit in commits:
match = re.search(rf"Closes .*#([0-9]+)", commit.commit.message, re.IGNORECASE)
if match:
pr_number = int(match.group(1))
pr = repo.get_pull(pr_number)
closed_prs.append(pr)
if args.pull_request:
start_commit = args.head_commit
pr = repo.get_pull(args.pull_request)
closed_prs = [pr]
for pr in closed_prs:
labels = [label.name for label in pr.labels]
backport_labels = [label for label in labels if backport_label_pattern.match(label)]
if promoted_label not in labels:
print(f'no {promoted_label} label: {pr.number}')
continue
if not backport_labels:
print(f'no backport label: {pr.number}')
continue
commits = get_pr_commits(repo, pr, stable_branch, start_commit)
logging.info(f"Found PR #{pr.number} with commit {commits} and the following labels: {backport_labels}")
for backport_label in backport_labels:
version = backport_label.replace('backport/', '')
backport_base_branch = backport_label.replace('backport/', backport_branch)
backport(repo, pr, version, commits, backport_base_branch)
if __name__ == "__main__":
main()

View File

@@ -16,13 +16,8 @@ def parser():
parser = argparse.ArgumentParser()
parser.add_argument('--repository', type=str, required=True,
help='Github repository name (e.g., scylladb/scylladb)')
parser.add_argument('--commit_before_merge', type=str, required=True, help='Git commit ID to start labeling from ('
'newest commit).')
parser.add_argument('--commit_after_merge', type=str, required=True,
help='Git commit ID to end labeling at (oldest '
'commit, exclusive).')
parser.add_argument('--update_issue', type=bool, default=False, help='Set True to update issues when backport was '
'done')
parser.add_argument('--commits', type=str, required=True, help='Range of promoted commits.')
parser.add_argument('--label', type=str, default='promoted-to-master', help='Label to use')
parser.add_argument('--ref', type=str, required=True, help='PR target branch')
return parser.parse_args()
@@ -53,10 +48,11 @@ def main():
target_branch = re.search(r'branch-(\d+\.\d+)', args.ref)
g = Github(github_token)
repo = g.get_repo(args.repository, lazy=False)
commits = repo.compare(head=args.commit_after_merge, base=args.commit_before_merge)
start_commit, end_commit = args.commits.split('..')
commits = repo.compare(start_commit, end_commit).commits
processed_prs = set()
# Print commit information
for commit in commits.commits:
for commit in commits:
print(f'Commit sha is: {commit.sha}')
match = pr_pattern.search(commit.commit.message)
if match:
@@ -66,13 +62,13 @@ def main():
if target_branch:
pr = repo.get_pull(pr_number)
branch_name = target_branch[1]
refs_pr = re.findall(r'Refs (?:#|https.*?)(\d+)', pr.body)
refs_pr = re.findall(r'Parent PR: (?:#|https.*?)(\d+)', pr.body)
if refs_pr:
print(f'branch-{target_branch.group(1)}, pr number is: {pr_number}')
# 1. change the backport label of the parent PR to note that
# we've merge the corresponding backport PR
# we've merged the corresponding backport PR
# 2. close the backport PR and leave a comment on it to note
# that it has been merged with a certain git commit,
# that it has been merged with a certain git commit.
ref_pr_number = refs_pr[0]
mark_backport_done(repo, ref_pr_number, branch_name)
comment = f'Closed via {commit.sha}'

View File

@@ -5,9 +5,10 @@ on:
branches:
- master
- branch-*.*
env:
DEFAULT_BRANCH: 'master'
- enterprise
pull_request_target:
types: [labeled]
branches: [master, next, enterprise]
jobs:
check-commit:
@@ -20,17 +21,51 @@ jobs:
env:
GITHUB_CONTEXT: ${{ toJson(github) }}
run: echo "$GITHUB_CONTEXT"
- name: Set Default Branch
id: set_branch
run: |
if [[ "${{ github.repository }}" == *enterprise* ]]; then
echo "DEFAULT_BRANCH=enterprise" >> $GITHUB_ENV
else
echo "DEFAULT_BRANCH=master" >> $GITHUB_ENV
fi
- name: Checkout repository
uses: actions/checkout@v4
with:
repository: ${{ github.repository }}
ref: ${{ env.DEFAULT_BRANCH }}
token: ${{ secrets.AUTO_BACKPORT_TOKEN }}
fetch-depth: 0 # Fetch all history for all tags and branches
- name: Set up Git identity
run: |
git config --global user.name "GitHub Action"
git config --global user.email "action@github.com"
git config --global merge.conflictstyle diff3
- name: Install dependencies
run: sudo apt-get install -y python3-github
run: sudo apt-get install -y python3-github python3-git
- name: Run python script
if: github.event_name == 'push'
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: python .github/scripts/label_promoted_commits.py --commit_before_merge ${{ github.event.before }} --commit_after_merge ${{ github.event.after }} --repository ${{ github.repository }} --ref ${{ github.ref }}
GITHUB_TOKEN: ${{ secrets.AUTO_BACKPORT_TOKEN }}
run: python .github/scripts/label_promoted_commits.py --commits ${{ github.event.before }}..${{ github.sha }} --repository ${{ github.repository }} --ref ${{ github.ref }}
- name: Run auto-backport.py when promotion completed
if: ${{ github.event_name == 'push' && github.ref == format('refs/heads/{0}', env.DEFAULT_BRANCH) }}
env:
GITHUB_TOKEN: ${{ secrets.AUTO_BACKPORT_TOKEN }}
run: python .github/scripts/auto-backport.py --repo ${{ github.repository }} --base-branch ${{ github.ref }} --commits ${{ github.event.before }}..${{ github.sha }}
- name: Check if label starts with 'backport/' and contains digits
id: check_label
run: |
label_name="${{ github.event.label.name }}"
if [[ "$label_name" =~ ^backport/[0-9]+\.[0-9]+$ ]]; then
echo "Label matches backport/X.X pattern."
echo "backport_label=true" >> $GITHUB_OUTPUT
else
echo "Label does not match the required pattern."
echo "backport_label=false" >> $GITHUB_OUTPUT
fi
- name: Run auto-backport.py when label was added
if: ${{ github.event_name == 'pull_request_target' && steps.check_label.outputs.backport_label == 'true' && github.event.pull_request.state == 'closed' }}
env:
GITHUB_TOKEN: ${{ secrets.AUTO_BACKPORT_TOKEN }}
run: python .github/scripts/auto-backport.py --repo ${{ github.repository }} --base-branch ${{ github.ref }} --pull-request ${{ github.event.pull_request.number }} --head-commit ${{ github.event.pull_request.base.sha }}

View File

@@ -78,7 +78,7 @@ fi
# Default scylla product/version tags
PRODUCT=scylla
VERSION=6.2.0-dev
VERSION=6.2.3
if test -f version
then
@@ -104,7 +104,7 @@ else
fi
if [ -f "$OUTPUT_DIR/SCYLLA-RELEASE-FILE" ]; then
GIT_COMMIT_FILE=$(cat "$OUTPUT_DIR/SCYLLA-RELEASE-FILE" |cut -d . -f 3)
GIT_COMMIT_FILE=$(cat "$OUTPUT_DIR/SCYLLA-RELEASE-FILE" | rev | cut -d . -f 1 | rev)
if [ "$GIT_COMMIT" = "$GIT_COMMIT_FILE" ]; then
exit 0
fi

View File

@@ -2195,7 +2195,6 @@ future<executor::request_return_type> executor::batch_write_item(client_state& c
mutation_builders.reserve(request_items.MemberCount());
uint batch_size = 0;
for (auto it = request_items.MemberBegin(); it != request_items.MemberEnd(); ++it) {
batch_size++;
schema_ptr schema = get_table_from_batch_request(_proxy, it);
tracing::add_table_name(trace_state, schema->ks_name(), schema->cf_name());
std::unordered_set<primary_key, primary_key_hash, primary_key_equal> used_keys(
@@ -2216,6 +2215,7 @@ future<executor::request_return_type> executor::batch_write_item(client_state& c
co_return api_error::validation("Provided list of item keys contains duplicates");
}
used_keys.insert(std::move(mut_key));
batch_size++;
} else if (r_name == "DeleteRequest") {
const rjson::value& key = (r->value)["Key"];
mutation_builders.emplace_back(schema, put_or_delete_item(
@@ -2226,6 +2226,7 @@ future<executor::request_return_type> executor::batch_write_item(client_state& c
co_return api_error::validation("Provided list of item keys contains duplicates");
}
used_keys.insert(std::move(mut_key));
batch_size++;
} else {
co_return api_error::validation(fmt::format("Unknown BatchWriteItem request type: {}", r_name));
}
@@ -3483,7 +3484,7 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli
}
};
std::vector<table_requests> requests;
uint batch_size = 0;
for (auto it = request_items.MemberBegin(); it != request_items.MemberEnd(); ++it) {
table_requests rs(get_table_from_batch_request(_proxy, it));
tracing::add_table_name(trace_state, sstring(executor::KEYSPACE_NAME_PREFIX) + rs.schema->cf_name(), rs.schema->cf_name());
@@ -3497,6 +3498,7 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli
rs.add(key);
check_key(key, rs.schema);
}
batch_size += rs.requests.size();
requests.emplace_back(std::move(rs));
}
@@ -3504,7 +3506,7 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli
co_await verify_permission(client_state, tr.schema, auth::permission::SELECT);
}
_stats.api_operations.batch_get_item_batch_total += requests.size();
_stats.api_operations.batch_get_item_batch_total += batch_size;
// If we got here, all "requests" are valid, so let's start the
// requests for the different partitions all in parallel.
std::vector<future<std::vector<rjson::value>>> response_futures;

View File

@@ -216,8 +216,8 @@ protected:
for (auto& ip : local_dc_nodes) {
// Note that it's not enough for the node to be is_alive() - a
// node joining the cluster is also "alive" but not responsive to
// requests. We need the node to be in normal state. See #19694.
if (_gossiper.is_normal(ip)) {
// requests. We alive *and* normal. See #19694, #21538.
if (_gossiper.is_alive(ip) && _gossiper.is_normal(ip)) {
// Use the gossiped broadcast_rpc_address if available instead
// of the internal IP address "ip". See discussion in #18711.
rjson::push_back(results, rjson::from_string(_gossiper.get_rpc_address(ip)));

View File

@@ -29,8 +29,6 @@ stats::stats() : api_operations{} {
seastar::metrics::description("Latency summary of an operation via Alternator API"), [this]{return to_metrics_summary(api_operations.name.summary());})(op(CamelCaseName)).set_skip_when_empty(),
OPERATION(batch_get_item, "BatchGetItem")
OPERATION(batch_write_item, "BatchWriteItem")
OPERATION(batch_get_item_batch_total, "BatchGetItemSize")
OPERATION(batch_write_item_batch_total, "BatchWriteItemSize")
OPERATION(create_backup, "CreateBackup")
OPERATION(create_global_table, "CreateGlobalTable")
OPERATION(create_table, "CreateTable")
@@ -98,6 +96,10 @@ stats::stats() : api_operations{} {
seastar::metrics::description("number of rows read and matched during filtering operations")),
seastar::metrics::make_total_operations("filtered_rows_dropped_total", [this] { return cql_stats.filtered_rows_read_total - cql_stats.filtered_rows_matched_total; },
seastar::metrics::description("number of rows read and dropped during filtering operations")),
seastar::metrics::make_counter("batch_item_count", seastar::metrics::description("The total number of items processed across all batches"),{op("BatchWriteItem")},
api_operations.batch_write_item_batch_total).set_skip_when_empty(),
seastar::metrics::make_counter("batch_item_count", seastar::metrics::description("The total number of items processed across all batches"),{op("BatchGetItem")},
api_operations.batch_get_item_batch_total).set_skip_when_empty(),
});
}

View File

@@ -102,8 +102,8 @@ void set_raft(http_context&, httpd::routes& r, sharded<service::raft_group_regis
if (!req->query_parameters.contains("group_id")) {
// Read barrier on group 0 by default
co_await raft_gr.invoke_on(0, [timeout] (service::raft_group_registry& raft_gr) {
return raft_gr.group0_with_timeouts().read_barrier(nullptr, timeout);
co_await raft_gr.invoke_on(0, [timeout] (service::raft_group_registry& raft_gr) -> future<> {
co_await raft_gr.group0_with_timeouts().read_barrier(nullptr, timeout);
});
co_return json_void{};
}
@@ -111,12 +111,12 @@ void set_raft(http_context&, httpd::routes& r, sharded<service::raft_group_regis
raft::group_id gid{utils::UUID{req->get_query_param("group_id")}};
std::atomic<bool> found_srv{false};
co_await raft_gr.invoke_on_all([gid, timeout, &found_srv] (service::raft_group_registry& raft_gr) {
co_await raft_gr.invoke_on_all([gid, timeout, &found_srv] (service::raft_group_registry& raft_gr) -> future<> {
if (!raft_gr.find_server(gid)) {
return make_ready_future<>();
co_return;
}
found_srv = true;
return raft_gr.get_server_with_timeouts(gid).read_barrier(nullptr, timeout);
co_await raft_gr.get_server_with_timeouts(gid).read_barrier(nullptr, timeout);
});
if (!found_srv) {

View File

@@ -898,7 +898,8 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
auto host_id = validate_host_id(req->get_query_param("host_id"));
std::vector<sstring> ignore_nodes_strs = utils::split_comma_separated_list(req->get_query_param("ignore_nodes"));
apilog.info("remove_node: host_id={} ignore_nodes={}", host_id, ignore_nodes_strs);
auto ignore_nodes = std::list<locator::host_id_or_endpoint>();
locator::host_id_or_endpoint_list ignore_nodes;
ignore_nodes.reserve(ignore_nodes_strs.size());
for (const sstring& n : ignore_nodes_strs) {
try {
auto hoep = locator::host_id_or_endpoint(n);

View File

@@ -71,7 +71,7 @@ void set_token_metadata(http_context& ctx, routes& r, sharded<locator::shared_to
ss::get_host_id_map.set(r, [&tm](const_req req) {
std::vector<ss::mapper> res;
return map_to_key_value(tm.local().get()->get_endpoint_to_host_id_map_for_reading(), res);
return map_to_key_value(tm.local().get()->get_endpoint_to_host_id_map(), res);
});
static auto host_or_broadcast = [&tm](const_req req) {

View File

@@ -76,7 +76,7 @@ auth::certificate_authenticator::certificate_authenticator(cql3::query_processor
continue;
} catch (std::out_of_range&) {
// just fallthrough
} catch (std::regex_error&) {
} catch (boost::regex_error&) {
std::throw_with_nested(std::invalid_argument(fmt::format("Invalid query expression: {}", map.at(cfg_query_attr))));
}
}

View File

@@ -226,7 +226,8 @@ static api::timestamp_type get_max_purgeable_timestamp(const table_state& table_
}
static std::vector<shared_sstable> get_uncompacting_sstables(const table_state& table_s, std::vector<shared_sstable> sstables) {
auto all_sstables = boost::copy_range<std::vector<shared_sstable>>(*table_s.main_sstable_set().all());
auto sstable_set = table_s.sstable_set_for_tombstone_gc();
auto all_sstables = boost::copy_range<std::vector<shared_sstable>>(*sstable_set->all());
auto& compacted_undeleted = table_s.compacted_undeleted_sstables();
all_sstables.insert(all_sstables.end(), compacted_undeleted.begin(), compacted_undeleted.end());
boost::sort(all_sstables, [] (const shared_sstable& x, const shared_sstable& y) {

View File

@@ -188,7 +188,7 @@ unsigned compaction_manager::current_compaction_fan_in_threshold() const {
return 0;
}
auto largest_fan_in = std::ranges::max(_tasks | boost::adaptors::transformed([] (auto& task) {
return task->compaction_running() ? task->compaction_data().compaction_fan_in : 0;
return task.compaction_running() ? task.compaction_data().compaction_fan_in : 0;
}));
// conservatively limit fan-in threshold to 32, such that tons of small sstables won't accumulate if
// running major on a leveled table, which can even have more than one thousand files.
@@ -388,11 +388,26 @@ future<sstables::compaction_result> compaction_task_executor::compact_sstables_a
co_return res;
}
future<sstables::sstable_set> compaction_task_executor::sstable_set_for_tombstone_gc(table_state& t) {
auto compound_set = t.sstable_set_for_tombstone_gc();
// Compound set will be linearized into a single set, since compaction might add or remove sstables
// to it for incremental compaction to work.
auto new_set = sstables::make_partitioned_sstable_set(t.schema(), false);
co_await compound_set->for_each_sstable_gently([&] (const sstables::shared_sstable& sst) {
auto inserted = new_set.insert(sst);
if (!inserted) {
on_internal_error(cmlog, format("Unable to insert SSTable {} into set used for tombstone GC", sst->get_filename()));
}
});
co_return std::move(new_set);
}
future<sstables::compaction_result> compaction_task_executor::compact_sstables(sstables::compaction_descriptor descriptor, sstables::compaction_data& cdata, on_replacement& on_replace, compaction_manager::can_purge_tombstones can_purge,
sstables::offstrategy offstrategy) {
table_state& t = *_compacting_table;
if (can_purge) {
descriptor.enable_garbage_collection(t.main_sstable_set());
descriptor.enable_garbage_collection(co_await sstable_set_for_tombstone_gc(t));
}
descriptor.creator = [&t] (shard_id dummy) {
auto sst = t.make_sstable();
@@ -580,9 +595,9 @@ requires (compaction_manager& cm, throw_if_stopping do_throw_if_stopping, Args&&
}
future<compaction_manager::compaction_stats_opt> compaction_manager::perform_compaction(throw_if_stopping do_throw_if_stopping, tasks::task_info parent_info, Args&&... args) {
auto task_executor = seastar::make_shared<TaskExecutor>(*this, do_throw_if_stopping, std::forward<Args>(args)...);
_tasks.push_back(task_executor);
auto unregister_task = defer([this, task_executor] {
_tasks.remove(task_executor);
_tasks.push_back(*task_executor);
auto unregister_task = defer([task_executor] {
task_executor->unlink();
task_executor->switch_state(compaction_task_executor::state::none);
});
@@ -884,10 +899,10 @@ public:
explicit strategy_control(compaction_manager& cm) noexcept : _cm(cm) {}
bool has_ongoing_compaction(table_state& table_s) const noexcept override {
return std::any_of(_cm._tasks.begin(), _cm._tasks.end(), [&s = table_s.schema()] (const shared_ptr<compaction_task_executor>& task) {
return task->compaction_running()
&& task->compacting_table()->schema()->ks_name() == s->ks_name()
&& task->compacting_table()->schema()->cf_name() == s->cf_name();
return std::any_of(_cm._tasks.begin(), _cm._tasks.end(), [&s = table_s.schema()] (const compaction_task_executor& task) {
return task.compaction_running()
&& task.compacting_table()->schema()->ks_name() == s->ks_name()
&& task.compacting_table()->schema()->cf_name() == s->cf_name();
});
}
@@ -1051,7 +1066,7 @@ void compaction_manager::postpone_compaction_for_table(table_state* t) {
_postponed.insert(t);
}
future<> compaction_manager::stop_tasks(std::vector<shared_ptr<compaction_task_executor>> tasks, sstring reason) {
future<> compaction_manager::stop_tasks(std::vector<shared_ptr<compaction_task_executor>> tasks, sstring reason) noexcept {
// To prevent compaction from being postponed while tasks are being stopped,
// let's stop all tasks before the deferring point below.
for (auto& t : tasks) {
@@ -1059,14 +1074,16 @@ future<> compaction_manager::stop_tasks(std::vector<shared_ptr<compaction_task_e
t->stop_compaction(reason);
}
co_await coroutine::parallel_for_each(tasks, [] (auto& task) -> future<> {
auto unlink_task = deferred_action([task] { task->unlink(); });
try {
co_await task->compaction_done();
} catch (sstables::compaction_stopped_exception&) {
// swallow stop exception if a given procedure decides to propagate it to the caller,
// as it happens with reshard and reshape.
} catch (...) {
// just log any other errors as the callers have nothing to do with them.
cmlog.debug("Stopping {}: task returned error: {}", *task, std::current_exception());
throw;
co_return;
}
cmlog.debug("Stopping {}: done", *task);
});
@@ -1075,9 +1092,12 @@ future<> compaction_manager::stop_tasks(std::vector<shared_ptr<compaction_task_e
future<> compaction_manager::stop_ongoing_compactions(sstring reason, table_state* t, std::optional<sstables::compaction_type> type_opt) noexcept {
try {
auto ongoing_compactions = get_compactions(t).size();
auto tasks = boost::copy_range<std::vector<shared_ptr<compaction_task_executor>>>(_tasks | boost::adaptors::filtered([t, type_opt] (auto& task) {
return (!t || task->compacting_table() == t) && (!type_opt || task->compaction_type() == *type_opt);
}));
auto tasks = _tasks
| std::views::filter([t, type_opt] (const auto& task) {
return (!t || task.compacting_table() == t) && (!type_opt || task.compaction_type() == *type_opt);
})
| std::views::transform([] (auto& task) { return task.shared_from_this(); })
| std::ranges::to<std::vector<shared_ptr<compaction_task_executor>>>();
logging::log_level level = tasks.empty() ? log_level::debug : log_level::info;
if (cmlog.is_enabled(level)) {
std::string scope = "";
@@ -1091,8 +1111,9 @@ future<> compaction_manager::stop_ongoing_compactions(sstring reason, table_stat
}
return stop_tasks(std::move(tasks), std::move(reason));
} catch (...) {
return current_exception_as_future<>();
cmlog.error("Stopping ongoing compactions failed: {}. Ignored", std::current_exception());
}
return make_ready_future();
}
future<> compaction_manager::drain() {
@@ -1109,17 +1130,17 @@ future<> compaction_manager::stop() {
if (auto cm = std::exchange(_task_manager_module, nullptr)) {
co_await cm->stop();
}
if (_state != state::none) {
co_return co_await std::move(*_stop_future);
if (_stop_future) {
co_await std::exchange(*_stop_future, make_ready_future());
}
}
future<> compaction_manager::really_do_stop() {
future<> compaction_manager::really_do_stop() noexcept {
cmlog.info("Asked to stop");
// Reset the metrics registry
_metrics.clear();
co_await stop_ongoing_compactions("shutdown");
co_await coroutine::parallel_for_each(_compaction_state | boost::adaptors::map_values, [] (compaction_state& cs) -> future<> {
co_await coroutine::parallel_for_each(_compaction_state | std::views::values, [] (compaction_state& cs) -> future<> {
if (!cs.gate.is_closed()) {
co_await cs.gate.close();
}
@@ -1618,6 +1639,9 @@ public:
std::move(sstables), std::move(compacting), compaction_manager::can_purge_tombstones::yes)
, _opt(options.as<sstables::compaction_type_options::split>())
{
if (utils::get_local_injector().is_enabled("split_sstable_rewrite")) {
_do_throw_if_stopping = throw_if_stopping::yes;
}
}
static bool sstable_needs_split(const sstables::shared_sstable& sst, const sstables::compaction_type_options::split& opt) {
@@ -1633,13 +1657,12 @@ private:
bool sstable_needs_split(const sstables::shared_sstable& sst) const {
return sstable_needs_split(sst, _opt);
}
protected:
sstables::compaction_descriptor make_descriptor(const sstables::shared_sstable& sst) const override {
return make_descriptor(sst, _opt);
}
future<sstables::compaction_result> rewrite_sstable(const sstables::shared_sstable sst) override {
future<sstables::compaction_result> do_rewrite_sstable(const sstables::shared_sstable sst) {
if (sstable_needs_split(sst)) {
return rewrite_sstables_compaction_task_executor::rewrite_sstable(std::move(sst));
}
@@ -1652,6 +1675,20 @@ protected:
return sstables::compaction_result{};
});
}
future<sstables::compaction_result> rewrite_sstable(const sstables::shared_sstable sst) override {
co_await utils::get_local_injector().inject("split_sstable_rewrite", [this] (auto& handler) -> future<> {
cmlog.info("split_sstable_rewrite: waiting");
while (!handler.poll_for_message() && !_compaction_data.is_stop_requested()) {
co_await sleep(std::chrono::milliseconds(5));
}
cmlog.info("split_sstable_rewrite: released");
if (_compaction_data.is_stop_requested()) {
throw make_compaction_stopped_exception();
}
}, false);
co_return co_await do_rewrite_sstable(std::move(sst));
}
};
}
@@ -1979,7 +2016,7 @@ future<> compaction_manager::perform_cleanup(owned_ranges_ptr sorted_owned_range
future<> compaction_manager::try_perform_cleanup(owned_ranges_ptr sorted_owned_ranges, table_state& t, tasks::task_info info) {
auto check_for_cleanup = [this, &t] {
return boost::algorithm::any_of(_tasks, [&t] (auto& task) {
return task->compacting_table() == &t && task->compaction_type() == sstables::compaction_type::Cleanup;
return task.compacting_table() == &t && task.compaction_type() == sstables::compaction_type::Cleanup;
});
};
if (check_for_cleanup()) {
@@ -2077,8 +2114,10 @@ compaction_manager::maybe_split_sstable(sstables::shared_sstable sst, table_stat
}
std::vector<sstables::shared_sstable> ret;
co_await run_custom_job(t, sstables::compaction_type::Split, "Split SSTable",
[&] (sstables::compaction_data& info, sstables::compaction_progress_monitor& monitor) -> future<> {
// FIXME: indentation.
auto gate = get_compaction_state(&t).gate.hold();
sstables::compaction_progress_monitor monitor;
sstables::compaction_data info = create_compaction_data();
sstables::compaction_descriptor desc = split_compaction_task_executor::make_descriptor(sst, opt);
desc.creator = [&t] (shard_id _) {
return t.make_sstable();
@@ -2089,7 +2128,6 @@ compaction_manager::maybe_split_sstable(sstables::shared_sstable sst, table_stat
co_await sstables::compact_sstables(std::move(desc), info, t, monitor);
co_await sst->unlink();
}, tasks::task_info{}, throw_if_stopping::yes);
co_return ret;
}
@@ -2159,11 +2197,11 @@ future<> compaction_manager::remove(table_state& t, sstring reason) noexcept {
auto found = false;
sstring msg;
for (auto& task : _tasks) {
if (task->compacting_table() == &t) {
if (task.compacting_table() == &t) {
if (!msg.empty()) {
msg += "\n";
}
msg += format("Found {} after remove", *task.get());
msg += format("Found {} after remove", task);
found = true;
}
}
@@ -2174,30 +2212,38 @@ future<> compaction_manager::remove(table_state& t, sstring reason) noexcept {
}
const std::vector<sstables::compaction_info> compaction_manager::get_compactions(table_state* t) const {
auto to_info = [] (const shared_ptr<compaction_task_executor>& task) {
auto to_info = [] (const compaction_task_executor& task) {
sstables::compaction_info ret;
ret.compaction_uuid = task->compaction_data().compaction_uuid;
ret.type = task->compaction_type();
ret.ks_name = task->compacting_table()->schema()->ks_name();
ret.cf_name = task->compacting_table()->schema()->cf_name();
ret.total_partitions = task->compaction_data().total_partitions;
ret.total_keys_written = task->compaction_data().total_keys_written;
ret.compaction_uuid = task.compaction_data().compaction_uuid;
ret.type = task.compaction_type();
ret.ks_name = task.compacting_table()->schema()->ks_name();
ret.cf_name = task.compacting_table()->schema()->cf_name();
ret.total_partitions = task.compaction_data().total_partitions;
ret.total_keys_written = task.compaction_data().total_keys_written;
return ret;
};
using ret = std::vector<sstables::compaction_info>;
return boost::copy_range<ret>(_tasks | boost::adaptors::filtered([t] (const shared_ptr<compaction_task_executor>& task) {
return (!t || task->compacting_table() == t) && task->compaction_running();
return boost::copy_range<ret>(_tasks | boost::adaptors::filtered([t] (const compaction_task_executor& task) {
return (!t || task.compacting_table() == t) && task.compaction_running();
}) | boost::adaptors::transformed(to_info));
}
bool compaction_manager::has_table_ongoing_compaction(const table_state& t) const {
return std::any_of(_tasks.begin(), _tasks.end(), [&t] (const shared_ptr<compaction_task_executor>& task) {
return task->compacting_table() == &t && task->compaction_running();
return std::any_of(_tasks.begin(), _tasks.end(), [&t] (const compaction_task_executor& task) {
return task.compacting_table() == &t && task.compaction_running();
});
};
bool compaction_manager::compaction_disabled(table_state& t) const {
return _compaction_state.contains(&t) && _compaction_state.at(&t).compaction_disabled();
if (auto it = _compaction_state.find(&t); it != _compaction_state.end()) {
return it->second.compaction_disabled();
} else {
cmlog.debug("compaction_disabled: {}:{} not in compaction_state", t.schema()->id(), t.get_group_id());
// Compaction is not strictly disabled, but it is not enabled either.
// The callers actually care about if it's enabled or not, not about the actual state of
// compaction_state::compaction_disabled()
return true;
}
}
future<> compaction_manager::stop_compaction(sstring type, table_state* table) {
@@ -2222,8 +2268,8 @@ future<> compaction_manager::stop_compaction(sstring type, table_state* table) {
void compaction_manager::propagate_replacement(table_state& t,
const std::vector<sstables::shared_sstable>& removed, const std::vector<sstables::shared_sstable>& added) {
for (auto& task : _tasks) {
if (task->compacting_table() == &t && task->compaction_running()) {
task->compaction_data().pending_replacements.push_back({ removed, added });
if (task.compacting_table() == &t && task.compaction_running()) {
task.compaction_data().pending_replacements.push_back({ removed, added });
}
}
}

View File

@@ -94,8 +94,13 @@ public:
private:
shared_ptr<compaction::task_manager_module> _task_manager_module;
using compaction_task_executor_list_type = bi::list<
compaction_task_executor,
bi::base_hook<bi::list_base_hook<bi::link_mode<bi::auto_unlink>>>,
bi::constant_time_size<false>>;
// compaction manager may have N fibers to allow parallel compaction per shard.
std::list<shared_ptr<compaction::compaction_task_executor>> _tasks;
compaction_task_executor_list_type _tasks;
// Possible states in which the compaction manager can be found.
//
@@ -179,7 +184,7 @@ private:
}
future<compaction_manager::compaction_stats_opt> perform_compaction(throw_if_stopping do_throw_if_stopping, tasks::task_info parent_info, Args&&... args);
future<> stop_tasks(std::vector<shared_ptr<compaction::compaction_task_executor>> tasks, sstring reason);
future<> stop_tasks(std::vector<shared_ptr<compaction::compaction_task_executor>> tasks, sstring reason) noexcept;
future<> update_throughput(uint32_t value_mbs);
// Return the largest fan-in of currently running compactions
@@ -245,7 +250,7 @@ private:
// Stop all fibers, without waiting. Safe to be called multiple times.
void do_stop() noexcept;
future<> really_do_stop();
future<> really_do_stop() noexcept;
// Propagate replacement of sstables to all ongoing compaction of a given table
void propagate_replacement(compaction::table_state& t, const std::vector<sstables::shared_sstable>& removed, const std::vector<sstables::shared_sstable>& added);
@@ -470,7 +475,9 @@ public:
namespace compaction {
class compaction_task_executor : public enable_shared_from_this<compaction_task_executor> {
class compaction_task_executor
: public enable_shared_from_this<compaction_task_executor>
, public boost::intrusive::list_base_hook<boost::intrusive::link_mode<boost::intrusive::auto_unlink>> {
public:
enum class state {
none, // initial and final state
@@ -594,6 +601,8 @@ private:
future<compaction_manager::compaction_stats_opt> compaction_done() noexcept {
return _compaction_done.get_future();
}
future<sstables::sstable_set> sstable_set_for_tombstone_gc(::compaction::table_state& t);
public:
bool stopping() const noexcept {
return _compaction_data.abort.abort_requested();
@@ -614,7 +623,7 @@ public:
friend future<compaction_manager::compaction_stats_opt> compaction_manager::perform_compaction(throw_if_stopping do_throw_if_stopping, tasks::task_info parent_info, Args&&... args);
friend future<compaction_manager::compaction_stats_opt> compaction_manager::perform_task(shared_ptr<compaction_task_executor> task, throw_if_stopping do_throw_if_stopping);
friend fmt::formatter<compaction_task_executor>;
friend future<> compaction_manager::stop_tasks(std::vector<shared_ptr<compaction_task_executor>> tasks, sstring reason);
friend future<> compaction_manager::stop_tasks(std::vector<shared_ptr<compaction_task_executor>> tasks, sstring reason) noexcept;
friend sstables::test_env_compaction_manager;
};

View File

@@ -39,6 +39,7 @@ public:
virtual bool compaction_enforce_min_threshold() const noexcept = 0;
virtual const sstables::sstable_set& main_sstable_set() const = 0;
virtual const sstables::sstable_set& maintenance_sstable_set() const = 0;
virtual lw_shared_ptr<const sstables::sstable_set> sstable_set_for_tombstone_gc() const = 0;
virtual std::unordered_set<sstables::shared_sstable> fully_expired_sstables(const std::vector<sstables::shared_sstable>& sstables, gc_clock::time_point compaction_time) const = 0;
virtual const std::vector<sstables::shared_sstable>& compacted_undeleted_sstables() const noexcept = 0;
virtual sstables::compaction_strategy& get_compaction_strategy() const noexcept = 0;

View File

@@ -296,7 +296,8 @@ time_window_compaction_strategy::get_reshaping_job(std::vector<shared_sstable> i
// When trimming, let's keep sstables with overlapping time window, so as to reduce write amplification.
// For example, if there are N sstables spanning window W, where N <= 32, then we can produce all data for W
// in a single compaction round, removing the need to later compact W to reduce its number of files.
boost::partial_sort(multi_window, multi_window.begin() + max_sstables, [](const shared_sstable &a, const shared_sstable &b) {
auto sort_size = std::min(max_sstables, multi_window.size());
boost::partial_sort(multi_window, multi_window.begin() + sort_size, [](const shared_sstable &a, const shared_sstable &b) {
return a->get_stats_metadata().max_timestamp < b->get_stats_metadata().max_timestamp;
});
maybe_trim_job(multi_window, job_size, disjoint);

View File

@@ -1470,7 +1470,7 @@ deps['test/boost/bytes_ostream_test'] = [
"test/lib/log.cc",
]
deps['test/boost/input_stream_test'] = ['test/boost/input_stream_test.cc']
deps['test/boost/UUID_test'] = ['utils/UUID_gen.cc', 'test/boost/UUID_test.cc', 'utils/uuid.cc', 'utils/dynamic_bitset.cc', 'utils/hashers.cc', 'utils/on_internal_error.cc']
deps['test/boost/UUID_test'] = ['clocks-impl.cc', 'utils/UUID_gen.cc', 'test/boost/UUID_test.cc', 'utils/uuid.cc', 'utils/dynamic_bitset.cc', 'utils/hashers.cc', 'utils/on_internal_error.cc']
deps['test/boost/murmur_hash_test'] = ['bytes.cc', 'utils/murmur_hash.cc', 'test/boost/murmur_hash_test.cc']
deps['test/boost/allocation_strategy_test'] = ['test/boost/allocation_strategy_test.cc', 'utils/logalloc.cc', 'utils/dynamic_bitset.cc']
deps['test/boost/log_heap_test'] = ['test/boost/log_heap_test.cc']

View File

@@ -11,6 +11,7 @@
#include <boost/range/algorithm.hpp>
#include <fmt/format.h>
#include <seastar/core/coroutine.hh>
#include <seastar/core/on_internal_error.hh>
#include <stdexcept>
#include "alter_keyspace_statement.hh"
#include "prepared_statement.hh"
@@ -43,18 +44,16 @@ future<> cql3::statements::alter_keyspace_statement::check_access(query_processo
return state.has_keyspace_access(_name, auth::permission::ALTER);
}
static bool validate_rf_difference(const std::string_view curr_rf, const std::string_view new_rf) {
auto to_number = [] (const std::string_view rf) {
int result;
// We assume the passed string view represents a valid decimal number,
// so we don't need the error code.
(void) std::from_chars(rf.begin(), rf.end(), result);
return result;
};
// We want to ensure that each DC's RF is going to change by at most 1
// because in that case the old and new quorums must overlap.
return std::abs(to_number(curr_rf) - to_number(new_rf)) <= 1;
static unsigned get_abs_rf_diff(const std::string& curr_rf, const std::string& new_rf) {
try {
return std::abs(std::stoi(curr_rf) - std::stoi(new_rf));
} catch (std::invalid_argument const& ex) {
on_internal_error(mylogger, fmt::format("get_abs_rf_diff expects integer arguments, "
"but got curr_rf:{} and new_rf:{}", curr_rf, new_rf));
} catch (std::out_of_range const& ex) {
on_internal_error(mylogger, fmt::format("get_abs_rf_diff expects integer arguments to fit into `int` type, "
"but got curr_rf:{} and new_rf:{}", curr_rf, new_rf));
}
}
void cql3::statements::alter_keyspace_statement::validate(query_processor& qp, const service::client_state& state) const {
@@ -84,11 +83,24 @@ void cql3::statements::alter_keyspace_statement::validate(query_processor& qp, c
auto new_ks = _attrs->as_ks_metadata_update(ks.metadata(), *qp.proxy().get_token_metadata_ptr(), qp.proxy().features());
if (ks.get_replication_strategy().uses_tablets()) {
const std::map<sstring, sstring>& current_rfs = ks.metadata()->strategy_options();
for (const auto& [new_dc, new_rf] : _attrs->get_replication_options()) {
auto it = current_rfs.find(new_dc);
if (it != current_rfs.end() && !validate_rf_difference(it->second, new_rf)) {
throw exceptions::invalid_request_exception("Cannot modify replication factor of any DC by more than 1 at a time.");
const std::map<sstring, sstring>& current_rf_per_dc = ks.metadata()->strategy_options();
auto new_rf_per_dc = _attrs->get_replication_options();
new_rf_per_dc.erase(ks_prop_defs::REPLICATION_STRATEGY_CLASS_KEY);
unsigned total_abs_rfs_diff = 0;
for (const auto& [new_dc, new_rf] : new_rf_per_dc) {
sstring old_rf = "0";
if (auto new_dc_in_current_mapping = current_rf_per_dc.find(new_dc);
new_dc_in_current_mapping != current_rf_per_dc.end()) {
old_rf = new_dc_in_current_mapping->second;
} else if (!qp.proxy().get_token_metadata_ptr()->get_topology().get_datacenters().contains(new_dc)) {
// This means that the DC listed in ALTER doesn't exist. This error will be reported later,
// during validation in abstract_replication_strategy::validate_replication_strategy.
// We can't report this error now, because it'd change the order of errors reported:
// first we need to report non-existing DCs, then if RFs aren't changed by too much.
continue;
}
if (total_abs_rfs_diff += get_abs_rf_diff(old_rf, new_rf); total_abs_rfs_diff >= 2) {
throw exceptions::invalid_request_exception("Only one DC's RF can be changed at a time and not by more than 1");
}
}
}
@@ -118,6 +130,63 @@ bool cql3::statements::alter_keyspace_statement::changes_tablets(query_processor
return ks.get_replication_strategy().uses_tablets() && !_attrs->get_replication_options().empty();
}
namespace {
// These functions are used to flatten all the options in the keyspace definition into a single-level map<string, string>.
// (Currently options are stored in a nested structure that looks more like a map<string, map<string, string>>).
// Flattening is simply joining the keys of maps from both levels with a colon ':' character,
// or in other words: prefixing the keys in the output map with the option type, e.g. 'replication', 'storage', etc.,
// so that the output map contains entries like: "replication:dc1" -> "3".
// This is done to avoid key conflicts and to be able to de-flatten the map back into the original structure.
void add_prefixed_key(const sstring& prefix, const std::map<sstring, sstring>& in, std::map<sstring, sstring>& out) {
for (const auto& [in_key, in_value]: in) {
out[prefix + ":" + in_key] = in_value;
}
};
std::map<sstring, sstring> get_current_options_flattened(const shared_ptr<cql3::statements::ks_prop_defs>& ks,
bool include_tablet_options,
const gms::feature_service& feat) {
std::map<sstring, sstring> all_options;
add_prefixed_key(ks->KW_REPLICATION, ks->get_replication_options(), all_options);
add_prefixed_key(ks->KW_STORAGE, ks->get_storage_options().to_map(), all_options);
// if no tablet options are specified in ATLER KS statement,
// we want to preserve the old ones and hence cannot overwrite them with defaults
if (include_tablet_options) {
auto initial_tablets = ks->get_initial_tablets(std::nullopt);
add_prefixed_key(ks->KW_TABLETS,
{{"enabled", initial_tablets ? "true" : "false"},
{"initial", std::to_string(initial_tablets.value_or(0))}},
all_options);
}
add_prefixed_key(ks->KW_DURABLE_WRITES,
{{sstring(ks->KW_DURABLE_WRITES), to_sstring(ks->get_boolean(ks->KW_DURABLE_WRITES, true))}},
all_options);
return all_options;
}
std::map<sstring, sstring> get_old_options_flattened(const data_dictionary::keyspace& ks, bool include_tablet_options) {
std::map<sstring, sstring> all_options;
using namespace cql3::statements;
add_prefixed_key(ks_prop_defs::KW_REPLICATION, ks.get_replication_strategy().get_config_options(), all_options);
add_prefixed_key(ks_prop_defs::KW_STORAGE, ks.metadata()->get_storage_options().to_map(), all_options);
if (include_tablet_options) {
add_prefixed_key(ks_prop_defs::KW_TABLETS,
{{"enabled", ks.metadata()->initial_tablets() ? "true" : "false"},
{"initial", std::to_string(ks.metadata()->initial_tablets().value_or(0))}},
all_options);
}
add_prefixed_key(ks_prop_defs::KW_DURABLE_WRITES,
{{sstring(ks_prop_defs::KW_DURABLE_WRITES), to_sstring(ks.metadata()->durable_writes())}},
all_options);
return all_options;
}
} // <anonymous> namespace
future<std::tuple<::shared_ptr<cql_transport::event::schema_change>, cql3::cql_warnings_vec>>
cql3::statements::alter_keyspace_statement::prepare_schema_mutations(query_processor& qp, service::query_state& state, const query_options& options, service::group0_batch& mc) const {
using namespace cql_transport;
@@ -130,11 +199,18 @@ cql3::statements::alter_keyspace_statement::prepare_schema_mutations(query_proce
auto ks_md_update = _attrs->as_ks_metadata_update(ks_md, tm, feat);
std::vector<mutation> muts;
std::vector<sstring> warnings;
auto ks_options = _attrs->get_all_options_flattened(feat);
bool include_tablet_options = _attrs->get_map(_attrs->KW_TABLETS).has_value();
auto old_ks_options = get_old_options_flattened(ks, include_tablet_options);
auto ks_options = get_current_options_flattened(_attrs, include_tablet_options, feat);
ks_options.merge(old_ks_options);
auto ts = mc.write_timestamp();
auto global_request_id = mc.new_group0_state_id();
// we only want to run the tablets path if there are actually any tablets changes, not only schema changes
// TODO: the current `if (changes_tablets(qp))` is insufficient: someone may set the same RFs as before,
// and we'll unnecessarily trigger the processing path for ALTER tablets KS,
// when in reality nothing or only schema is being changed
if (changes_tablets(qp)) {
if (!qp.topology_global_queue_empty()) {
return make_exception_future<std::tuple<::shared_ptr<::cql_transport::event::schema_change>, cql3::cql_warnings_vec>>(

View File

@@ -139,28 +139,22 @@ data_dictionary::storage_options ks_prop_defs::get_storage_options() const {
return opts;
}
ks_prop_defs::init_tablets_options ks_prop_defs::get_initial_tablets(const sstring& strategy_class, bool enabled_by_default) const {
// FIXME -- this should be ignored somehow else
init_tablets_options ret{ .enabled = false, .specified_count = std::nullopt };
if (locator::abstract_replication_strategy::to_qualified_class_name(strategy_class) != "org.apache.cassandra.locator.NetworkTopologyStrategy") {
return ret;
}
std::optional<unsigned> ks_prop_defs::get_initial_tablets(std::optional<unsigned> default_value) const {
auto tablets_options = get_map(KW_TABLETS);
if (!tablets_options) {
return enabled_by_default ? init_tablets_options{ .enabled = true } : ret;
return default_value;
}
unsigned initial_count = 0;
auto it = tablets_options->find("enabled");
if (it != tablets_options->end()) {
auto enabled = it->second;
tablets_options->erase(it);
if (enabled == "true") {
ret = init_tablets_options{ .enabled = true, .specified_count = 0 }; // even if 'initial' is not set, it'll start with auto-detection
// nothing
} else if (enabled == "false") {
SCYLLA_ASSERT(!ret.enabled);
return ret;
return std::nullopt;
} else {
throw exceptions::configuration_exception(sstring("Tablets enabled value must be true or false; found: ") + enabled);
}
@@ -169,7 +163,7 @@ ks_prop_defs::init_tablets_options ks_prop_defs::get_initial_tablets(const sstri
it = tablets_options->find("initial");
if (it != tablets_options->end()) {
try {
ret = init_tablets_options{ .enabled = true, .specified_count = std::stol(it->second)};
initial_count = std::stol(it->second);
} catch (...) {
throw exceptions::configuration_exception(sstring("Initial tablets value should be numeric; found ") + it->second);
}
@@ -180,7 +174,7 @@ ks_prop_defs::init_tablets_options ks_prop_defs::get_initial_tablets(const sstri
throw exceptions::configuration_exception(sstring("Unrecognized tablets option ") + tablets_options->begin()->first);
}
return ret;
return initial_count;
}
std::optional<sstring> ks_prop_defs::get_replication_strategy_class() const {
@@ -191,32 +185,13 @@ bool ks_prop_defs::get_durable_writes() const {
return get_boolean(KW_DURABLE_WRITES, true);
}
std::map<sstring, sstring> ks_prop_defs::get_all_options_flattened(const gms::feature_service& feat) const {
std::map<sstring, sstring> all_options;
auto ingest_flattened_options = [&all_options](const std::map<sstring, sstring>& options, const sstring& prefix) {
for (auto& option: options) {
all_options[prefix + ":" + option.first] = option.second;
}
};
ingest_flattened_options(get_replication_options(), KW_REPLICATION);
ingest_flattened_options(get_storage_options().to_map(), KW_STORAGE);
ingest_flattened_options(get_map(KW_TABLETS).value_or(std::map<sstring, sstring>{}), KW_TABLETS);
ingest_flattened_options({{sstring(KW_DURABLE_WRITES), to_sstring(get_boolean(KW_DURABLE_WRITES, true))}}, KW_DURABLE_WRITES);
return all_options;
}
lw_shared_ptr<data_dictionary::keyspace_metadata> ks_prop_defs::as_ks_metadata(sstring ks_name, const locator::token_metadata& tm, const gms::feature_service& feat) {
auto sc = get_replication_strategy_class().value();
auto initial_tablets = get_initial_tablets(sc, feat.tablets);
// if tablets options have not been specified, but tablets are globally enabled, set the value to 0
if (initial_tablets.enabled && !initial_tablets.specified_count) {
initial_tablets.specified_count = 0;
}
// if tablets options have not been specified, but tablets are globally enabled, set the value to 0 for N.T.S. only
auto initial_tablets = get_initial_tablets(feat.tablets && locator::abstract_replication_strategy::to_qualified_class_name(sc) == "org.apache.cassandra.locator.NetworkTopologyStrategy" ? std::optional<unsigned>(0) : std::nullopt);
auto options = prepare_options(sc, tm, get_replication_options());
return data_dictionary::keyspace_metadata::new_keyspace(ks_name, sc,
std::move(options), initial_tablets.specified_count, get_boolean(KW_DURABLE_WRITES, true), get_storage_options());
std::move(options), initial_tablets, get_boolean(KW_DURABLE_WRITES, true), get_storage_options());
}
lw_shared_ptr<data_dictionary::keyspace_metadata> ks_prop_defs::as_ks_metadata_update(lw_shared_ptr<data_dictionary::keyspace_metadata> old, const locator::token_metadata& tm, const gms::feature_service& feat) {
@@ -229,13 +204,9 @@ lw_shared_ptr<data_dictionary::keyspace_metadata> ks_prop_defs::as_ks_metadata_u
sc = old->strategy_name();
options = old_options;
}
auto initial_tablets = get_initial_tablets(*sc, old->initial_tablets().has_value());
// if tablets options have not been specified, inherit them if it's tablets-enabled KS
if (initial_tablets.enabled && !initial_tablets.specified_count) {
initial_tablets.specified_count = old->initial_tablets();
}
return data_dictionary::keyspace_metadata::new_keyspace(old->name(), *sc, options, initial_tablets.specified_count, get_boolean(KW_DURABLE_WRITES, true), get_storage_options());
auto initial_tablets = get_initial_tablets(old->initial_tablets());
return data_dictionary::keyspace_metadata::new_keyspace(old->name(), *sc, options, initial_tablets, get_boolean(KW_DURABLE_WRITES, true), get_storage_options());
}

View File

@@ -49,21 +49,15 @@ public:
private:
std::optional<sstring> _strategy_class;
public:
struct init_tablets_options {
bool enabled;
std::optional<unsigned> specified_count;
};
ks_prop_defs() = default;
explicit ks_prop_defs(std::map<sstring, sstring> options);
void validate();
std::map<sstring, sstring> get_replication_options() const;
std::optional<sstring> get_replication_strategy_class() const;
init_tablets_options get_initial_tablets(const sstring& strategy_class, bool enabled_by_default) const;
std::optional<unsigned> get_initial_tablets(std::optional<unsigned> default_value) const;
data_dictionary::storage_options get_storage_options() const;
bool get_durable_writes() const;
std::map<sstring, sstring> get_all_options_flattened(const gms::feature_service& feat) const;
lw_shared_ptr<data_dictionary::keyspace_metadata> as_ks_metadata(sstring ks_name, const locator::token_metadata&, const gms::feature_service&);
lw_shared_ptr<data_dictionary::keyspace_metadata> as_ks_metadata_update(lw_shared_ptr<data_dictionary::keyspace_metadata> old, const locator::token_metadata&, const gms::feature_service&);
};

View File

@@ -54,7 +54,7 @@ list_service_level_statement::execute(query_processor& qp,
return make_ready_future().then([this, &state] () {
if (_describe_all) {
return state.get_service_level_controller().get_distributed_service_levels();
return state.get_service_level_controller().get_distributed_service_levels(qos::query_context::user);
} else {
return state.get_service_level_controller().get_distributed_service_level(_service_level);
}

View File

@@ -46,14 +46,14 @@ public:
protected:
std::optional<sstring> get_simple(const sstring& name) const;
std::optional<std::map<sstring, sstring>> get_map(const sstring& name) const;
void remove_from_map_if_exists(const sstring& name, const sstring& key) const;
public:
bool has_property(const sstring& name) const;
std::optional<value_type> get(const sstring& name) const;
std::optional<std::map<sstring, sstring>> get_map(const sstring& name) const;
sstring get_string(sstring key, sstring default_value) const;
// Return a property value, typed as a Boolean

View File

@@ -1132,7 +1132,12 @@ public:
write(out, uint64_t(0));
}
buf.remove_suffix(buf.size_bytes() - size);
auto to_remove = buf.size_bytes() - size;
// #20862 - we decrement usage counter based on buf.size() below.
// Since we are shrinking buffer here, we need to also decrement
// counter already
buf.remove_suffix(to_remove);
_segment_manager->totals.buffer_list_bytes -= to_remove;
// Build sector checksums.
auto id = net::hton(_desc.id);
@@ -3826,6 +3831,10 @@ uint64_t db::commitlog::get_total_size() const {
;
}
uint64_t db::commitlog::get_buffer_size() const {
return _segment_manager->totals.buffer_list_bytes;
}
uint64_t db::commitlog::get_completed_tasks() const {
return _segment_manager->totals.allocation_count;
}

View File

@@ -306,6 +306,7 @@ public:
future<> delete_segments(std::vector<sstring>) const;
uint64_t get_total_size() const;
uint64_t get_buffer_size() const;
uint64_t get_completed_tasks() const;
uint64_t get_flush_count() const;
uint64_t get_pending_tasks() const;

View File

@@ -1526,18 +1526,19 @@ future<> update_relabel_config_from_file(const std::string& name) {
co_return;
}
std::vector<sstring> split_comma_separated_list(sstring comma_separated_list) {
std::vector<sstring> split_comma_separated_list(const std::string_view comma_separated_list) {
std::vector<sstring> strs, trimmed_strs;
boost::split(strs, std::move(comma_separated_list), boost::is_any_of(","));
for (sstring n : strs) {
boost::split(strs, comma_separated_list, boost::is_any_of(","));
trimmed_strs.reserve(strs.size());
for (sstring& n : strs) {
std::replace(n.begin(), n.end(), '\"', ' ');
std::replace(n.begin(), n.end(), '\'', ' ');
boost::trim_all(n);
if (!n.empty()) {
trimmed_strs.push_back(n);
trimmed_strs.push_back(std::move(n));
}
}
return trimmed_strs;
}
}
} // namespace utils

View File

@@ -545,6 +545,6 @@ future<gms::inet_address> resolve(const config_file::named_value<sstring>&, gms:
*/
future<> update_relabel_config_from_file(const std::string& name);
std::vector<sstring> split_comma_separated_list(sstring comma_separated_list);
std::vector<sstring> split_comma_separated_list(std::string_view comma_separated_list);
}
} // namespace utils

View File

@@ -36,7 +36,7 @@ size_t quorum_for(const locator::effective_replication_map& erm) {
size_t local_quorum_for(const locator::effective_replication_map& erm, const sstring& dc) {
using namespace locator;
auto& rs = erm.get_replication_strategy();
const auto& rs = erm.get_replication_strategy();
if (rs.get_type() == replication_strategy_type::network_topology) {
const network_topology_strategy* nrs =
@@ -65,7 +65,7 @@ size_t block_for_local_serial(const locator::effective_replication_map& erm) {
size_t block_for_each_quorum(const locator::effective_replication_map& erm) {
using namespace locator;
auto& rs = erm.get_replication_strategy();
const auto& rs = erm.get_replication_strategy();
if (rs.get_type() == replication_strategy_type::network_topology) {
const network_topology_strategy* nrs =
@@ -260,7 +260,7 @@ filter_for_query(consistency_level cl,
size_t bf = block_for(erm, cl);
if (read_repair == read_repair_decision::DC_LOCAL) {
bf = std::max(block_for(erm, cl), local_count);
bf = std::max(bf, local_count);
}
if (bf >= live_endpoints.size()) { // RRD.DC_LOCAL + CL.LOCAL or CL.ALL

View File

@@ -35,8 +35,6 @@
#include <span>
#include <unordered_map>
class fragmented_temporary_buffer;
namespace utils {
class directories;
} // namespace utils

View File

@@ -741,8 +741,8 @@ system_distributed_keyspace::get_cdc_desc_v1_timestamps(context ctx) {
co_return res;
}
future<qos::service_levels_info> system_distributed_keyspace::get_service_levels() const {
return qos::get_service_levels(_qp, NAME, SERVICE_LEVELS, db::consistency_level::ONE);
future<qos::service_levels_info> system_distributed_keyspace::get_service_levels(qos::query_context ctx) const {
return qos::get_service_levels(_qp, NAME, SERVICE_LEVELS, db::consistency_level::ONE, ctx);
}
future<qos::service_levels_info> system_distributed_keyspace::get_service_level(sstring service_level_name) const {

View File

@@ -112,7 +112,7 @@ public:
future<db_clock::time_point> cdc_current_generation_timestamp(context);
future<qos::service_levels_info> get_service_levels() const;
future<qos::service_levels_info> get_service_levels(qos::query_context ctx) const;
future<qos::service_levels_info> get_service_level(sstring service_level_name) const;
future<> set_service_level(sstring service_level_name, qos::service_level_options slo) const;
future<> drop_service_level(sstring service_level_name) const;

View File

@@ -2044,7 +2044,6 @@ future<> view_builder::start_in_background(service::migration_manager& mm, utils
// the view build information.
fail.cancel();
co_await barrier.arrive_and_wait();
units.return_all();
co_await calculate_shard_build_step(vbi);
_mnotifier.register_listener(this);
@@ -2349,7 +2348,7 @@ static future<> announce_with_raft(
future<> view_builder::mark_view_build_started(sstring ks_name, sstring view_name) {
co_await write_view_build_status(
[&] () -> future<> {
[this, ks_name, view_name] () -> future<> {
co_await utils::get_local_injector().inject("view_builder_pause_add_new_view",
[] (auto& handler) { return handler.wait_for_message(db::timeout_clock::now() + std::chrono::minutes(5)); });
const sstring query_string = format("INSERT INTO {}.{} (keyspace_name, view_name, host_id, status) VALUES (?, ?, ?, ?)",
@@ -2359,7 +2358,7 @@ future<> view_builder::mark_view_build_started(sstring ks_name, sstring view_nam
{std::move(ks_name), std::move(view_name), host_id.uuid(), "STARTED"},
"view builder: mark view build STARTED");
},
[&] () -> future<> {
[this, ks_name, view_name] () -> future<> {
co_await utils::get_local_injector().inject("view_builder_pause_add_new_view",
[] (auto& handler) { return handler.wait_for_message(db::timeout_clock::now() + std::chrono::minutes(5)); });
co_await _sys_dist_ks.start_view_build(std::move(ks_name), std::move(view_name));
@@ -2369,7 +2368,7 @@ future<> view_builder::mark_view_build_started(sstring ks_name, sstring view_nam
future<> view_builder::mark_view_build_success(sstring ks_name, sstring view_name) {
co_await write_view_build_status(
[&] () -> future<> {
[this, ks_name, view_name] () -> future<> {
co_await utils::get_local_injector().inject("view_builder_pause_mark_success",
[] (auto& handler) { return handler.wait_for_message(db::timeout_clock::now() + std::chrono::minutes(5)); });
const sstring query_string = format("UPDATE {}.{} SET status = ? WHERE keyspace_name = ? AND view_name = ? AND host_id = ?",
@@ -2379,7 +2378,7 @@ future<> view_builder::mark_view_build_success(sstring ks_name, sstring view_nam
{"SUCCESS", std::move(ks_name), std::move(view_name), host_id.uuid()},
"view builder: mark view build SUCCESS");
},
[&] () -> future<> {
[this, ks_name, view_name] () -> future<> {
co_await utils::get_local_injector().inject("view_builder_pause_mark_success",
[] (auto& handler) { return handler.wait_for_message(db::timeout_clock::now() + std::chrono::minutes(5)); });
co_await _sys_dist_ks.finish_view_build(std::move(ks_name), std::move(view_name));
@@ -2389,14 +2388,14 @@ future<> view_builder::mark_view_build_success(sstring ks_name, sstring view_nam
future<> view_builder::remove_view_build_status(sstring ks_name, sstring view_name) {
co_await write_view_build_status(
[&] () -> future<> {
[this, ks_name, view_name] () -> future<> {
const sstring query_string = format("DELETE FROM {}.{} WHERE keyspace_name = ? AND view_name = ?",
db::system_keyspace::NAME, db::system_keyspace::VIEW_BUILD_STATUS_V2);
co_await announce_with_raft(_qp, _group0_client, _as, std::move(query_string),
{std::move(ks_name), std::move(view_name)},
"view builder: delete view build status");
},
[&] () -> future<> {
[this, ks_name, view_name] () -> future<> {
co_await _sys_dist_ks.remove_view(std::move(ks_name), std::move(view_name));
}
);
@@ -2444,11 +2443,11 @@ view_builder::view_build_statuses(sstring keyspace, sstring view_name) const {
future<> view_builder::add_new_view(view_ptr view, build_step& step) {
vlogger.info0("Building view {}.{}, starting at token {}", view->ks_name(), view->cf_name(), step.current_token());
if (this_shard_id() == 0) {
co_await mark_view_build_started(view->ks_name(), view->cf_name());
}
co_await _sys_ks.register_view_for_building(view->ks_name(), view->cf_name(), step.current_token());
step.build_status.emplace(step.build_status.begin(), view_build_status{view, step.current_token(), std::nullopt});
auto f = this_shard_id() == 0 ? mark_view_build_started(view->ks_name(), view->cf_name()) : make_ready_future<>();
return when_all_succeed(
std::move(f),
_sys_ks.register_view_for_building(view->ks_name(), view->cf_name(), step.current_token())).discard_result();
}
static future<> flush_base(lw_shared_ptr<replica::column_family> base, abort_source& as) {
@@ -3154,16 +3153,16 @@ future<> view_builder::register_staging_sstable(sstables::shared_sstable sst, lw
return _vug.register_staging_sstable(std::move(sst), std::move(table));
}
future<bool> check_needs_view_update_path(view_builder& vb, const locator::token_metadata& tm, const replica::table& t, streaming::stream_reason reason) {
future<bool> check_needs_view_update_path(view_builder& vb, locator::token_metadata_ptr tmptr, const replica::table& t, streaming::stream_reason reason) {
if (is_internal_keyspace(t.schema()->ks_name())) {
return make_ready_future<bool>(false);
}
if (reason == streaming::stream_reason::repair && !t.views().empty()) {
return make_ready_future<bool>(true);
}
return do_with(t.views(), [&vb, &tm] (auto& views) {
return do_with(std::move(tmptr), t.views(), [&vb] (locator::token_metadata_ptr& tmptr, auto& views) {
return map_reduce(views,
[&vb, &tm] (const view_ptr& view) { return vb.check_view_build_ongoing(tm, view->ks_name(), view->cf_name()); },
[&] (const view_ptr& view) { return vb.check_view_build_ongoing(*tmptr, view->ks_name(), view->cf_name()); },
false,
std::logical_or<bool>());
});

View File

@@ -10,20 +10,17 @@
#include <seastar/core/future.hh>
#include "streaming/stream_reason.hh"
#include "locator/token_metadata_fwd.hh"
#include "seastarx.hh"
namespace replica {
class table;
}
namespace locator {
class token_metadata;
}
namespace db::view {
class view_builder;
future<bool> check_needs_view_update_path(view_builder& vb, const locator::token_metadata& tm, const replica::table& t,
future<bool> check_needs_view_update_path(view_builder& vb, locator::token_metadata_ptr tmptr, const replica::table& t,
streaming::stream_reason reason);
}

View File

@@ -40,6 +40,25 @@ if __name__ == '__main__':
help='enable compress on systemd-coredump')
args = parser.parse_args()
# Seems like specific version of systemd pacakge on RHEL9 has a bug on
# SELinux configuration, it introduced "systemd-container-coredump" module
# to provide rule for systemd-coredump but not enabled by default.
# We have to manually load it, otherwise it causes permission errror.
# (#19325)
if is_redhat_variant() and distro.major_version() == '9':
if not shutil.which('getenforce'):
pkg_install('libselinux-utils')
if not shutil.which('semodule'):
pkg_install('policycoreutils')
enforce = out('getenforce')
if enforce != "Disabled":
if os.path.exists('/usr/share/selinux/packages/targeted/systemd-container-coredump.pp.bz2'):
modules = out('semodule -l')
match = re.match(r'^systemd-container-coredump$', modules, re.MULTILINE)
if not match:
run('semodule -v -i /usr/share/selinux/packages/targeted/systemd-container-coredump.pp.bz2', shell=True, check=True)
run('semodule -v -e systemd-container-coredump', shell=True, check=True)
# abrt-ccpp.service needs to stop before enabling systemd-coredump,
# since both will try to install kernel coredump handler
# (This will only requires for abrt < 2.14)

View File

@@ -16,6 +16,7 @@ import sys
import stat
import logging
import pyudev
import psutil
from pathlib import Path
from scylla_util import *
from subprocess import run, SubprocessError
@@ -92,6 +93,15 @@ class UdevInfo:
def id_links(self):
return [l for l in self.device.device_links if l.startswith('/dev/disk/by-id')]
def is_selinux_enabled():
partitions = psutil.disk_partitions(all=True)
for p in partitions:
if p.fstype == 'selinuxfs':
if os.path.exists(p.mountpoint + '/enforce'):
return True
return False
if __name__ == '__main__':
if os.getuid() > 0:
print('Requires root permission.')
@@ -333,3 +343,43 @@ WantedBy=local-fs.target
LOGGER.error(f'Error detected, dumping udev env parameters on {fsdev}')
udev_info.verify()
udev_info.dump_variables()
if is_redhat_variant():
offline_skip_relabel = False
has_semanage = True
if not shutil.which('matchpathcon'):
offline_skip_relabel = True
pkg_install('libselinux-utils', offline_exit=False)
if not shutil.which('restorecon'):
offline_skip_relabel = True
pkg_install('policycoreutils', offline_exit=False)
if not shutil.which('semanage'):
if is_offline():
has_semanage = False
else:
pkg_install('policycoreutils-python-utils')
if is_offline() and offline_skip_relabel:
print('Unable to find SELinux tools, skip relabeling.')
sys.exit(0)
selinux_context = out('matchpathcon -n /var/lib/systemd/coredump')
selinux_type = selinux_context.split(':')[2]
if has_semanage:
run(f'semanage fcontext -a -t {selinux_type} "{root}/coredump(/.*)?"', shell=True, check=True)
else:
# without semanage, we need to update file_contexts directly,
# and compile it to binary format (.bin file)
try:
with open('/etc/selinux/targeted/contexts/files/file_contexts.local', 'a') as f:
spacer = ''
if f.tell() != 0:
spacer = '\n'
f.write(f'{spacer}{root}/coredump(/.*)? {selinux_context}\n')
except FileNotFoundError as e:
print('Unable to find SELinux policy files, skip relabeling.')
sys.exit(0)
run('sefcontext_compile /etc/selinux/targeted/contexts/files/file_contexts.local', shell=True, check=True)
if is_selinux_enabled():
run(f'restorecon -F -v -R {root}', shell=True, check=True)
else:
Path('/.autorelabel').touch(exist_ok=True)

View File

@@ -293,13 +293,14 @@ def swap_exists():
swaps = out('swapon --noheadings --raw')
return True if swaps != '' else False
def pkg_error_exit(pkg):
def pkg_error_exit(pkg, offline_exit=True):
print(f'Package "{pkg}" required.')
sys.exit(1)
if offline_exit:
sys.exit(1)
def yum_install(pkg):
def yum_install(pkg, offline_exit=True):
if is_offline():
pkg_error_exit(pkg)
pkg_error_exit(pkg, offline_exit)
return run(f'yum install -y {pkg}', shell=True, check=True)
def apt_is_updated():
@@ -313,9 +314,9 @@ def apt_is_updated():
APT_GET_UPDATE_NUM_RETRY = 30
APT_GET_UPDATE_RETRY_INTERVAL = 10
def apt_install(pkg):
def apt_install(pkg, offline_exit=True):
if is_offline():
pkg_error_exit(pkg)
pkg_error_exit(pkg, offline_exit)
# The lock for update and install/remove are different, and
# DPkg::Lock::Timeout will only wait for install/remove lock.
@@ -344,14 +345,14 @@ def apt_install(pkg):
apt_env['DEBIAN_FRONTEND'] = 'noninteractive'
return run(f'apt-get -o DPkg::Lock::Timeout=300 install -y {pkg}', shell=True, check=True, env=apt_env)
def emerge_install(pkg):
def emerge_install(pkg, offline_exit=True):
if is_offline():
pkg_error_exit(pkg)
pkg_error_exit(pkg, offline_exit)
return run(f'emerge -uq {pkg}', shell=True, check=True)
def zypper_install(pkg):
def zypper_install(pkg, offline_exit=True):
if is_offline():
pkg_error_exit(pkg)
pkg_error_exit(pkg, offline_exit)
return run(f'zypper install -y {pkg}', shell=True, check=True)
def pkg_distro():
@@ -364,18 +365,20 @@ def pkg_distro():
else:
return distro.id()
pkg_xlat = {'cpupowerutils': {'debian': 'linux-cpupower', 'gentoo':'sys-power/cpupower', 'arch':'cpupower', 'suse': 'cpupower'}}
def pkg_install(pkg):
pkg_xlat = {'cpupowerutils': {'debian': 'linux-cpupower', 'gentoo':'sys-power/cpupower', 'arch':'cpupower', 'suse': 'cpupower'},
'policycoreutils-python-utils': {'amzn2': 'policycoreutils-python'}}
def pkg_install(pkg, offline_exit=True):
if pkg in pkg_xlat and pkg_distro() in pkg_xlat[pkg]:
pkg = pkg_xlat[pkg][pkg_distro()]
if is_redhat_variant():
return yum_install(pkg)
return yum_install(pkg, offline_exit)
elif is_debian_variant():
return apt_install(pkg)
return apt_install(pkg, offline_exit)
elif is_gentoo():
return emerge_install(pkg)
return emerge_install(pkg, offline_exit)
elif is_suse_variant():
return zypper_install(pkg)
return zypper_install(pkg, offline_exit)
else:
pkg_error_exit(pkg)

View File

@@ -1 +1 @@
SCYLLA_NODE_EXPORTER_ARGS="--collector.interrupts"
SCYLLA_NODE_EXPORTER_ARGS="--collector.interrupts --no-collector.hwmon"

View File

@@ -2,7 +2,7 @@
{% for group in data %}
{% if group.value_status_count[value_status] > 0 %}
.. _confgroup_{{ group.name }}:
.. _confgroup_{{ group.name|lower|replace(" ", "_") }}:
{{ group.name }}
{{ '-' * (group.name|length) }}
@@ -13,7 +13,7 @@
{% for item in group.properties %}
{% if item.value_status == value_status %}
.. _confprop_{{ item.name }}:
.. _confprop_{{ item.name|lower|replace(" ", "_") }}:
.. confval:: {{ item.name }}
{% endif %}

View File

@@ -1,12 +1,13 @@
### a dictionary of redirections
#old path: new path
# Move up the Features section
# THESE REDIRECTIOSN SHOULD BE UNCOMMENTED WHEN 6.2 IS RELEASED
# Before 6.2 documentation is available, these redirections result in 404
#/stable/troubleshooting/nodetool-memory-read-timeout.html: /stable/troubleshooting/index.html
# Move up the Features section
#/stable/using-scylla/features.html: /stable/features/index.html
#/stable/using-scylla/lwt.html: /stable/features/lwt.html
#/stable/using-scylla/secondary-indexes.html: /stable/features/secondary-indexes.html

View File

@@ -50,6 +50,13 @@ Which yields, for `/proc/sys/fs/aio-max-nr`:
$ docker run --name some-scylla --hostname some-scylla -d scylladb/scylla
```
If you're on macOS and plan to start a multi-node cluster (3 nodes or more), start ScyllaDB with
`reactor-backend=epoll` to override the default `linux-aio` reactor backend:
```console
$ docker run --name some-scylla --hostname some-scylla -d scylladb/scylla --reactor-backend=epoll
```
### Run `nodetool` utility
```console
@@ -77,6 +84,11 @@ cqlsh>
```console
$ docker run --name some-scylla2 --hostname some-scylla2 -d scylladb/scylla --seeds="$(docker inspect --format='{{ .NetworkSettings.IPAddress }}' some-scylla)"
```
If you're on macOS, ensure to add the `reactor-backend=epoll` option when adding new nodes:
```console
$ docker run --name some-scylla2 --hostname some-scylla2 -d scylladb/scylla --reactor-backend=epoll --seeds="$(docker inspect --format='{{ .NetworkSettings.IPAddress }}' some-scylla)"
```
#### Make a cluster with Docker Compose
@@ -344,90 +356,6 @@ The `--authenticator` command lines option allows to provide the authenticator c
The `--authorizer` command lines option allows to provide the authorizer class ScyllaDB will use. By default ScyllaDB uses the `AllowAllAuthorizer` which allows any action to any user. The second option is using the `CassandraAuthorizer` parameter, which stores permissions in `system.permissions` table.
**Since: 2.3**
### JMX parameters
JMX ScyllaDB service is initialized from the `/scylla-jmx-service.sh` on
container startup. By default the script uses `/etc/sysconfig/scylla-jmx`
to read the default configuration. It then can be overridden by setting
environmental parameters.
An example:
docker run -d -e "SCYLLA_JMX_ADDR=-ja 0.0.0.0" -e SCYLLA_JMX_REMOTE=-r --publish 7199:7199 scylladb/scylla
#### SCYLLA_JMX_PORT
Scylla JMX listening port.
Default value:
SCYLLA_JMX_PORT="-jp 7199"
#### SCYLLA_API_PORT
Scylla API port for JMX to connect to.
Default value:
SCYLLA_API_PORT="-p 10000"
#### SCYLLA_API_ADDR
Scylla API address for JMX to connect to.
Default value:
SCYLLA_API_ADDR="-a localhost"
#### SCYLLA_JMX_ADDR
JMX address to bind on.
Default value:
SCYLLA_JMX_ADDR="-ja localhost"
For example, it is possible to make JMX available to the outer world
by changing its bind address to `0.0.0.0`:
docker run -d -e "SCYLLA_JMX_ADDR=-ja 0.0.0.0" -e SCYLLA_JMX_REMOTE=-r --publish 7199:7199 scylladb/scylla
`cassandra-stress` requires direct access to the JMX.
#### SCYLLA_JMX_FILE
A JMX service configuration file path.
Example value:
SCYLLA_JMX_FILE="-cf /etc/scylla.d/scylla-user.cfg"
#### SCYLLA_JMX_LOCAL
The location of the JMX executable.
Example value:
SCYLLA_JMX_LOCAL="-l /opt/scylladb/jmx
#### SCYLLA_JMX_REMOTE
Allow JMX to run remotely.
Example value:
SCYLLA_JMX_REMOTE="-r"
#### SCYLLA_JMX_DEBUG
Enable debugger.
Example value:
SCYLLA_JMX_DEBUG="-d"
### Related Links
* [Best practices for running ScyllaDB on docker](http://docs.scylladb.com/procedures/best_practices_scylla_on_docker/)

View File

@@ -194,7 +194,7 @@ Alternatively, you can explicitly install **all** the ScyllaDB packages for the
.. code-block:: console
sudo apt-get install scylla-enterprise{,-server,-jmx,-tools,-tools-core,-kernel-conf,-node-exporter,-conf,-python3}=2021.1.0-0.20210511.9e8e7d58b-1
sudo apt-get install scylla-enterprise{,-server,-tools,-tools-core,-kernel-conf,-node-exporter,-conf,-python3}=2021.1.0-0.20210511.9e8e7d58b-1
sudo apt-get install scylla-enterprise-machine-image=2021.1.0-0.20210511.9e8e7d58b-1 # only execute on AMI instance

View File

@@ -1,8 +1,11 @@
Features
========================
This document highlights ScyllaDB's key data modeling features.
.. toctree::
:maxdepth: 1
:hidden:
Lightweight Transactions </features/lwt/>
Global Secondary Indexes </features/secondary-indexes/>
@@ -12,6 +15,23 @@ Features
Change Data Capture </features/cdc/index>
Workload Attributes </features/workload-attributes>
`ScyllaDB Enterprise <https://enterprise.docs.scylladb.com/stable/overview.html#enterprise-only-features>`_
provides additional features, including Encryption at Rest,
workload prioritization, auditing, and more.
.. panel-box::
:title: ScyllaDB Features
:id: "getting-started"
:class: my-panel
* Secondary Indexes and Materialized Views provide efficient search mechanisms
on non-partition keys by creating an index.
* :doc:`Global Secondary Indexes </features/secondary-indexes/>`
* :doc:`Local Secondary Indexes </features/local-secondary-indexes/>`
* :doc:`Materialized Views </features/materialized-views/>`
* :doc:`Lightweight Transactions </features/lwt/>` provide conditional updates
through linearizability.
* :doc:`Counters </features/counters/>` are columns that only allow their values
to be incremented, decremented, read, or deleted.
* :doc:`Change Data Capture </features/cdc/index>` allows you to query the current
state and the history of all changes made to tables in the database.
* :doc:`Workload Attributes </features/workload-attributes>` assigned to your workloads
specify how ScyllaDB will handle requests depending on the workload.

View File

@@ -1,14 +1,14 @@
You can `build ScyllaDB from source <https://github.com/scylladb/scylladb#build-prerequisites>`_ on other x86_64 or aarch64 platforms, without any guarantees.
+----------------------------+--------------------+-------+---------------+
| Linux Distributions |Ubuntu | Debian| Rocky / |
| | | | RHEL |
| Linux Distributions |Ubuntu | Debian|Rocky / CentOS |
| | | |/ RHEL |
+----------------------------+------+------+------+-------+-------+-------+
| ScyllaDB Version / Version |20.04 |22.04 |24.04 | 11 | 8 | 9 |
+============================+======+======+======+=======+=======+=======+
| 6.1 | |v| | |v| | |v| | |v| | |v| | |v| |
| 6.2 | |v| | |v| | |v| | |v| | |v| | |v| |
+----------------------------+------+------+------+-------+-------+-------+
| 6.0 | |v| | |v| | |v| | |v| | |v| | |v| |
| 6.1 | |v| | |v| | |v| | |v| | |v| | |v| |
+----------------------------+------+------+------+-------+-------+-------+
* The recommended OS for ScyllaDB Open Source is Ubuntu 22.04.
@@ -18,4 +18,4 @@ Supported Architecture
-----------------------------
ScyllaDB Open Source supports x86_64 for all versions and AArch64 starting from ScyllaDB 4.6 and nightly build.
In particular, aarch64 support includes AWS EC2 Graviton.
In particular, aarch64 support includes AWS EC2 Graviton.

View File

@@ -175,7 +175,7 @@ Recommended instances types are `n1-highmem <https://cloud.google.com/compute/do
* - n2-highmem-32
- 32
- 256
- 6,000
- 9,000
* - n2-highmem-48
- 48
- 384

View File

@@ -46,7 +46,7 @@ Install ScyllaDB
.. code-block:: console
sudo gpg --homedir /tmp --no-default-keyring --keyring /etc/apt/keyrings/scylladb.gpg --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys 491c93b9de7496a7
sudo gpg --homedir /tmp --no-default-keyring --keyring /etc/apt/keyrings/scylladb.gpg --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys a43e06657bac99e3
.. code-block:: console
:substitutions:

View File

@@ -78,7 +78,7 @@ Launching Instances from ScyllaDB AMI
* The ``scylla.yaml`` file: ``/etc/scylla/scylla.yaml``
* Data: ``/var/lib/scylla/``
To check that the ScyllaDB server and the JMX component are running, run:
To check that the ScyllaDB server is running, run:
.. code-block:: console

View File

@@ -77,7 +77,7 @@ Launching ScyllaDB on Azure
ssh -i ~/.ssh/ssh-key.pem scyllaadm@public-ip
To check that the ScyllaDB server and the JMX component are running, run:
To check that the ScyllaDB server is running, run:
.. code-block:: console

View File

@@ -63,7 +63,7 @@ Launching ScyllaDB on GCP
gcloud compute ssh scylla-node1
To check that the ScyllaDB server and the JMX component are running, run:
To check that the ScyllaDB server is running, run:
.. code-block:: console

View File

@@ -1,8 +1,3 @@
.. |SCYLLADB_VERSION| replace:: 5.2
.. update the version folder URL below (variables won't work):
https://downloads.scylladb.com/downloads/scylla/relocatable/scylladb-5.2/
====================================================
Install ScyllaDB Without root Privileges
====================================================
@@ -24,14 +19,17 @@ Note that if you're on CentOS 7, only root offline installation is supported.
Download and Install
-----------------------
#. Download the latest tar.gz file for ScyllaDB |SCYLLADB_VERSION| (x86 or ARM) from https://downloads.scylladb.com/downloads/scylla/relocatable/scylladb-5.2/.
#. Download the latest tar.gz file for ScyllaDB version (x86 or ARM) from ``https://downloads.scylladb.com/downloads/scylla/relocatable/scylladb-<version>/``.
Example for version 6.1: https://downloads.scylladb.com/downloads/scylla/relocatable/scylladb-6.1/
#. Uncompress the downloaded package.
The following example shows the package for ScyllaDB 5.2.4 (x86):
The following example shows the package for ScyllaDB 6.1.1 (x86):
.. code:: console
tar xvfz scylla-unified-5.2.4-0.20230623.cebbf6c5df2b.x86_64.tar.gz
tar xvfz scylla-unified-6.1.1-0.20240814.8d90b817660a.x86_64.tar.gz
#. Install OpenJDK 8 or 11.

View File

@@ -71,7 +71,7 @@ This will send ScyllaDB only logs to :code:`/var/log/scylla/scylla.log`
Logging on Docker
-----------------
Starting from ScyllaDB 1.3, `ScyllaDB Docker <https://hub.docker.com/r/scylladb/scylla/>`_, you should use :code:`docker logs` command to access ScyllaDB server and JMX proxy logs
Starting from ScyllaDB 1.3, `ScyllaDB Docker <https://hub.docker.com/r/scylladb/scylla/>`_, you should use :code:`docker logs` command to access ScyllaDB server logs.
.. include:: /rst_include/advance-index.rst

View File

@@ -26,13 +26,6 @@ By default, ScyllaDB runs as user ``scylla`` in group ``scylla``. The following
4. Edit ``/etc/systemd/system/multi-user.target.wants/node-exporter.service``
.. code-block:: sh
User=test
Group=test
5. Edit /usr/lib/systemd/system/scylla-jmx.service
.. code-block:: sh
User=test
@@ -51,5 +44,4 @@ At this point, all services should be started as test:test user:
.. code-block:: sh
test 8760 1 11 14:42 ? 00:00:01 /usr/bin/scylla --log-to-syslog 1 --log-to-std ...
test 8765 1 12 14:42 ? 00:00:01 /opt/scylladb/jmx/symlinks/scylla-jmx -Xmx256m ...
test 13638 1 0 14:30 ? 00:00:00 /usr/bin/node_exporter --collector.interrupts

View File

@@ -11,7 +11,7 @@ For example:
ScyllaDB uses available memory to cache your data. ScyllaDB knows how to dynamically manage memory for optimal performance, for example, if many clients connect to ScyllaDB, it will evict some data from the cache to make room for these connections, when the connection count drops again, this memory is returned to the cache.
To limit the memory usage you can start scylla with ``--memory`` parameter.
Alternatively, you can specify the amount of memory ScyllaDB should leave to the OS with ``--reserve-memory`` parameter. Keep in mind that the amount of memory left to the operating system needs to suffice external scylla modules, such as ``scylla-jmx``, which runs on top of JVM.
Alternatively, you can specify the amount of memory ScyllaDB should leave to the OS with ``--reserve-memory`` parameter. Keep in mind that the amount of memory left to the operating system needs to suffice external scylla modules.
On Ubuntu, edit the ``/etc/default/scylla-server``.

View File

@@ -14,8 +14,6 @@ Port Description Protocol
------ -------------------------------------------- --------
7001 SSL inter-node communication (RPC) TCP
------ -------------------------------------------- --------
7199 JMX management TCP
------ -------------------------------------------- --------
10000 ScyllaDB REST API TCP
------ -------------------------------------------- --------
9180 Prometheus API TCP

View File

@@ -146,9 +146,7 @@ The ScyllaDB ports are detailed in the table below. For ScyllaDB Manager ports,
.. include:: /operating-scylla/_common/networking-ports.rst
All ports above need to be open to external clients (CQL), external admin systems (JMX), and other nodes (RPC). REST API port can be kept closed for incoming external connections.
The JMX service, :code:`scylla-jmx`, runs on port 7199. It is required in order to manage ScyllaDB using :code:`nodetool` and other Apache Cassandra-compatible utilities. The :code:`scylla-jmx` process must be able to connect to port 10000 on localhost. The JMX service listens for incoming JMX connections on all network interfaces on the system.
All ports above need to be open to external clients (CQL) and other nodes (RPC). REST API port can be kept closed for incoming external connections.
Advanced networking
-------------------
@@ -223,10 +221,6 @@ Monitoring Stack
|mon_root|
JMX
---
ScyllaDB JMX is compatible with Apache Cassandra, exposing the relevant subset of MBeans.
.. REST
.. include:: /operating-scylla/rest.rst

View File

@@ -31,7 +31,7 @@ Parameter Descriptio
-------------------------------------------------------------------- -------------------------------------------------------------------------------------
-kc <ktlist>, --kc.list <ktlist> The list of Keyspaces to take snapshot
-------------------------------------------------------------------- -------------------------------------------------------------------------------------
-p <port> / --port <port> Remote jmx agent port number
-p <port> / --port <port> The port of the REST API of the ScyllaDB node.
-------------------------------------------------------------------- -------------------------------------------------------------------------------------
-sf / --skip-flush Do not flush memtables before snapshotting (snapshot will not contain unflushed data)
-------------------------------------------------------------------- -------------------------------------------------------------------------------------

View File

@@ -17,6 +17,16 @@ Nodetool tasks
Task manager is an API-based tool for tracking long-running background operations, such as repair or compaction,
which makes them observable and controllable. Task manager operates per node.
Task Status Retention
---------------------
* When a task completes, its status is temporarily stored on the executing node
* Status information is retained for up to :confval:`task_ttl_in_seconds` seconds
* The status information of a completed task is automatically removed after being queried with ``tasks status`` or ``tasks tree``
* ``tasks wait`` returns the status, but it does not remove the task information of the queried task
.. note:: Multiple status queries using ``tasks status`` and ``tasks tree`` for the same completed task will only receive a response for the first query, since the status is removed after being retrieved.
Supported tasks suboperations
-----------------------------

View File

@@ -61,26 +61,14 @@ Nodetool
nodetool-commands/viewbuildstatus
nodetool-commands/version
The ``nodetool`` utility provides a simple command-line interface to the following exposed operations and attributes. ScyllaDBs nodetool is a fork of `the Apache Cassandra nodetool <https://cassandra.apache.org/doc/latest/tools/nodetool/nodetool.html>`_ with the same syntax and a subset of the operations.
The ``nodetool`` utility provides a simple command-line interface to the following exposed operations and attributes.
.. _nodetool-generic-options:
Nodetool generic options
========================
All options are supported:
* ``-p <port>`` or ``--port <port>`` - Remote JMX agent port number.
* ``-pp`` or ``--print-port`` - Operate in 4.0 mode with hosts disambiguated by port number.
* ``-pw <password>`` or ``--password <password>`` - Remote JMX agent password.
* ``-pwf <passwordFilePath>`` or ``--password-file <passwordFilePath>`` - Path to the JMX password file.
* ``-u <username>`` or ``--username <username>`` - Remote JMX agent username.
* ``-p <port>`` or ``--port <port>`` - The port of the REST API of the ScyllaDB node.
* ``--`` - Separates command-line options from the list of argument(useful when an argument might be mistaken for a command-line option).
Supported Nodetool operations
@@ -145,4 +133,4 @@ Operations that are not listed below are currently not available.
* :doc:`viewbuildstatus </operating-scylla/nodetool-commands/viewbuildstatus/>` - Shows the progress of a materialized view build.
* :doc:`version </operating-scylla/nodetool-commands/version>` - Print the DB version.
.. include:: /rst_include/apache-copyrights.rst

View File

@@ -41,7 +41,7 @@ With the recent addition of the `ScyllaDB Advisor <http://monitoring.docs.scylla
Install ScyllaDB Manager
------------------------
Install and use `ScyllaDB Manager <https://manager.docs.scylladb.com>` together with the `ScyllaDB Monitoring Stack <http://monitoring.docs.scylladb.com/>`_.
Install and use `ScyllaDB Manager <https://manager.docs.scylladb.com>`_ together with the `ScyllaDB Monitoring Stack <http://monitoring.docs.scylladb.com/>`_.
ScyllaDB Manager provides automated backups and repairs of your database.
ScyllaDB Manager can manage multiple ScyllaDB clusters and run cluster-wide tasks in a controlled and predictable way.
For example, with ScyllaDB Manager you can control the intensity of a repair, increasing it to speed up the process, or lower the intensity to ensure it minimizes impact on ongoing operations.

View File

@@ -22,6 +22,13 @@ To start a single ScyllaDB node instance in a Docker container, run:
docker run --name some-scylla -d scylladb/scylla
If you're on macOS and plan to start a multi-node cluster (3 nodes or more), start ScyllaDB with
``reactor-backend=epoll`` to override the default ``linux-aio`` reactor backend:
.. code-block:: console
docker run --name some-scylla -d scylladb/scylla --reactor-backend=epoll
The ``docker run`` command starts a new Docker instance in the background named some-scylla that runs the ScyllaDB server:
.. code-block:: console
@@ -95,6 +102,12 @@ With a single ``some-scylla`` instance running, joining new nodes to form a clu
docker run --name some-scylla2 -d scylladb/scylla --seeds="$(docker inspect --format='{{ .NetworkSettings.IPAddress }}' some-scylla)"
If you're on macOS, ensure to add the ``reactor-backend=epoll`` option when adding new nodes:
.. code-block:: console
docker run --name some-scylla2 -d scylladb/scylla --reactor-backend=epoll --seeds="$(docker inspect --format='{{ .NetworkSettings.IPAddress }}' some-scylla)"
To query when the node is up and running (and view the status of the entire cluster) use the ``nodetool status`` command:
.. code-block:: console

View File

@@ -6,8 +6,8 @@ ScyllaDB exposes a REST API to retrieve administrative information from a node a
administrative operations. For example, it allows you to check or update configuration,
retrieve cluster-level information, and more.
The :doc:`nodetool </operating-scylla/nodetool>` CLI tool interacts with a *scylla-jmx* process using JMX.
The process, in turn, uses the REST API to interact with the ScyllaDB process.
The :doc:`nodetool </operating-scylla/nodetool>` CLI tool uses the REST API
to interact with the ScyllaDB process.
You can interact with the REST API directly using :code:`curl`, ScyllaDB's CLI for REST API, or the Swagger UI.

View File

@@ -11,7 +11,7 @@ Procedure
#. Enable authentication
Enable authentication and define authorized roles in the cluster as described in the `Enable Authentication </operating-scylla/security/authentication/>`_ document.
Enable authentication and define authorized roles in the cluster as described in the :doc:`Enable Authentication </operating-scylla/security/authentication/>` document.
#. Enable CQL transport TLS using client certificate verification

View File

@@ -3,7 +3,7 @@ Encryption: Data in Transit Client to Node
Follow the procedures below to enable a client to node encryption.
Once enabled, all communication between the client and the node is transmitted over TLS/SSL.
The libraries used by ScyllaDB for OpenSSL are FIPS 140-2 certified.
The libraries used by ScyllaDB for OpenSSL are FIPS 140-2 enabled.
Workflow
^^^^^^^^

View File

@@ -10,7 +10,6 @@ Cluster and Node
Failed Decommission Problem </troubleshooting/failed-decommission/>
Cluster Timeouts </troubleshooting/timeouts>
Node Joined With No Data </troubleshooting/node-joined-without-any-data>
SocketTimeoutException </troubleshooting/nodetool-memory-read-timeout/>
NullPointerException </troubleshooting/nodetool-nullpointerexception/>
Failed Schema Sync </troubleshooting/failed-schema-sync/>
@@ -28,7 +27,6 @@ Cluster and Node
* :doc:`Failed Decommission Problem </troubleshooting/failed-decommission/>`
* :doc:`Cluster Timeouts </troubleshooting/timeouts>`
* :doc:`Node Joined With No Data </troubleshooting/node-joined-without-any-data>`
* :doc:`Nodetool fails with SocketTimeoutException 'Read timed out' </troubleshooting/nodetool-memory-read-timeout>`
* :doc:`Nodetool Throws NullPointerException </troubleshooting/nodetool-nullpointerexception>`
* :doc:`Failed Schema Sync </troubleshooting/failed-schema-sync>`

View File

@@ -1,112 +0,0 @@
Nodetool fails with SocketTimeoutException 'Read timed out'
===========================================================
This troubleshooting article describes what to do when Nodetool fails with a 'Read timed out' error.
Problem
^^^^^^^
When running any Nodetool command, users may see the following error:
.. code-block:: none
Failed to connect to '127.0.0.1:7199' - SocketTimeoutException: 'Read timed out'
Analysis
^^^^^^^^
Nodetool is a Java based application which requires memory. ScyllaDB by default consumes 93% of the nodes RAM (for MemTables + Cache) and leaves 7% for other applications, such as nodetool.
If cases where this is not enough memory (e.g. small instances with ~64GB RAM or lower), Nodetool may not be able to run due to insufficient memory. In this case an out of memory (OOM) error may appear and scylla-jmx will not run.
Example
-------
The error you will see is similar to:
.. code-block:: none
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00000005c0000000,
671088640, 0) failed; error='Cannot allocate memory' (err no=12)
In order to check if the issue is scylla-jmx, use the following command (systemd-based Linux distribution) to check the status of the service:
.. code-block:: none
sudo systemctl status scylla-jmx
If the service is running you will see something similar to:
.. code-block:: none
sudo service scylla-jmx status
● scylla-jmx.service - ScyllaDB JMX
Loaded: loaded (/lib/systemd/system/scylla-jmx.service; disabled; vendor preset: enabled)
Active: active (running) since Wed 2018-07-18 20:59:08 UTC; 3s ago
Main PID: 256050 (scylla-jmx)
Tasks: 27
Memory: 119.5M
CPU: 1.959s
CGroup: /system.slice/scylla-jmx.service
└─256050 /usr/lib/scylla/jmx/symlinks/scylla-jmx -Xmx384m -XX:+UseSerialGC -Dcom.sun.management.jmxremote.auth
If it isn't, you will see an error similar to:
.. code-block:: none
sudo systemctl status scylla-jmx
● scylla-jmx.service - ScyllaDB JMX
Loaded: loaded (/usr/lib/systemd/system/scylla-jmx.service; disabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Thu 2018-05-10 10:34:15 EDT; 3min 47s ago
Process: 1417 ExecStart=/usr/lib/scylla/jmx/scylla-jmx $SCYLLA_JMX_PORT $SCYLLA_API_PORT $SCYLLA_API_ADDR $SCYLLA_JMX_ADDR
$SCYLLA_JMX_FILE $SCYLLA_JMX_LOCAL $SCYLLA_JMX_REMOTE $SCYLLA_JMX_DEBUG (code=exited, status=127)
Main PID: 1417 (code=exited, status=127)
or
.. code-block:: none
sudo service scylla-jmx status
● scylla-jmx.service
Loaded: not-found (Reason: No such file or directory)
Active: failed (Result: exit-code) since Wed 2018-07-18 20:38:58 UTC; 12min ago
Main PID: 141256 (code=exited, status=143)
You will need to restart the service or change the RAM allocation as per the Solution_ below.
Solution
^^^^^^^^
There are two ways to fix this problem, one is faster but may not permanently fix the issue and the other solution is more robust.
**The immediate solution**
.. code-block:: none
service scylla-jmx restart
.. note:: This is not a permanent fix as the problem might manifest again at a later time.
**The more robust solution**
1. Take the size of your nodes RAM, calculate 7% of that size, increase it by another 40%, and use this new size as your RAM requirement.
For example: on a GCP n1-highmem-8 instance (52GB RAM)
* 7% would be ~3.6GB.
* Increasing it by ~40% means you need to increase your RAM ~5GB.
2. Open one of the following files (as per your OS platform):
* Ubuntu: ``/etc/default/scylla-server``.
* Red Hat/ CentOS: ``/etc/sysconfig/scylla-server``
3. In the file you are editing, add to the ``SCYLLA_ARGS`` statement ``--reserve-memory 5G`` (the amount you calculated above). Save and exit.
4. Restart ScyllaDB server
.. code-block:: none
sudo systemctl restart scylla-server
.. note:: If the initial calculation and reserve memory is not enough and problem persists and/or reappears, repeat the procedure from step 2 and increase the RAM in 1GB increments.

View File

@@ -279,17 +279,12 @@ Once you have collected and compressed your reports, send them to ScyllaDB for a
curl -X PUT https://upload.scylladb.com/$report_uuid/yourfile -T yourfile
For example with the health check report and node health check report:
.. code-block:: shell
curl -X PUT https://upload.scylladb.com/$report_uuid/output_files.tgz -T output_files.tgz
For example with the Scylla Doctor's vitals:
.. code-block:: shell
curl -X PUT https://upload.scylladb.com/$report_uuid/192.0.2.0-health-check-report.txt -T 192.0.2.0-health-check-report.txt
curl -X PUT https://upload.scylladb.com/$report_uuid/my_cluster_123_vitals.tgz -T my_cluster_123_vitals.tgz
The **UUID** you generated replaces the variable ``$report_uuid`` at runtime. ``yourfile`` is any file you need to send to ScyllaDB support.

View File

@@ -21,8 +21,8 @@ The following metrics are new in ScyllaDB |NEW_VERSION|:
* - Metric
- Description
* -
-
* - scylla_alternator_batch_item_count
- The total number of items processed across all batches

View File

@@ -6,22 +6,14 @@ Upgrade from ScyllaDB Open Source to ScyllaDB Enterprise
:titlesonly:
:hidden:
ScyllaDB 6.0 to ScyllaDB Enterprise 2024.2 <upgrade-guide-from-6.0-to-2024.2/index>
ScyllaDB 5.4 to ScyllaDB Enterprise 2024.1 <upgrade-guide-from-5.4-to-2024.1/index>
ScyllaDB 5.2 to ScyllaDB Enterprise 2023.1 <upgrade-guide-from-5.2-to-2023.1/index>
.. raw:: html
<div class="panel callout radius animated">
<div class="row">
<div class="medium-3 columns">
<h5 id="getting-started">Upgrade to ScyllaDB Enterprise</h5>
</div>
<div class="medium-9 columns">
Procedures for upgrading from ScyllaDB Open Source to ScyllaDB Enterprise:
* :doc:`ScyllaDB 6.0 to ScyllaDB Enterprise 2024.2 </upgrade/upgrade-to-enterprise/upgrade-guide-from-6.0-to-2024.2/index>`
* :doc:`ScyllaDB 5.4 to ScyllaDB Enterprise 2024.1 </upgrade/upgrade-to-enterprise/upgrade-guide-from-5.4-to-2024.1/index>`
* :doc:`ScyllaDB 5.2 to ScyllaDB Enterprise 2023.1 </upgrade/upgrade-to-enterprise/upgrade-guide-from-5.2-to-2023.1/index>`

View File

@@ -162,54 +162,27 @@ Download and install the new release
.. group-tab:: EC2/GCP/Azure Ubuntu Image
Before upgrading, check what version you are running now using ``scylla --version``. You should use the same version as this version in case you want to |ROLLBACK|_ the upgrade. If you are not running a |SRC_VERSION|.x version, stop right here! This guide only covers |SRC_VERSION|.x to |NEW_VERSION|.y upgrades.
Before upgrading, check what version you are running now using ``scylla --version``. You should use the same version as this version in case you want to |ROLLBACK|_ the upgrade. If you are not running a |SRC_VERSION|.x version, stop right here! This guide only covers |SRC_VERSION|.x to |NEW_VERSION|.y upgrades.
There are two alternative upgrade procedures: upgrading ScyllaDB and simultaneously updating 3rd party and OS packages - recommended if you
are running a ScyllaDB official image (EC2 AMI, GCP, and Azure images), which is based on Ubuntu 20.04, and upgrading ScyllaDB without updating
any external packages.
If youre using the ScyllaDB official image (recommended), see
the **Debian/Ubuntu** tab for upgrade instructions. If youre using your
own image and have installed ScyllaDB packages for Ubuntu or Debian,
you need to apply an extended upgrade procedure:
#. Update the ScyllaDB deb repo (see above).
#. Configure Java 1.8 (see above).
#. Install the new ScyllaDB version with the additional
``scylla-enterprise-machine-image`` package:
**To upgrade ScyllaDB and update 3rd party and OS packages (RECOMMENDED):**
Choosing this upgrade procedure allows you to upgrade your ScyllaDB version and update the 3rd party and OS packages using one command.
#. Update the |SCYLLA_DEB_NEW_REPO| to |NEW_VERSION|.
#. Load the new repo:
.. code:: sh
sudo apt-get update
#. Run the following command to update the manifest file:
.. code:: sh
cat scylla-enterprise-packages-<version>-<arch>.txt | sudo xargs -n1 apt-get install -y
Where:
* ``<version>`` - The ScyllaDB Enterprise version to which you are upgrading ( |NEW_VERSION| ).
* ``<arch>`` - Architecture type: ``x86_64`` or ``aarch64``.
The file is included in the ScyllaDB Enterprise packages downloaded in the previous step. The file location is ``http://downloads.scylladb.com/downloads/scylla/aws/manifest/scylla-packages-<version>-<arch>.txt``
Example:
.. code:: sh
cat scylla-enterprise-packages-2022.2.0-x86_64.txt | sudo xargs -n1 apt-get install -y
.. note::
Alternatively, you can update the manifest file with the following command:
``sudo apt-get install $(awk '{print $1'} scylla-enterprise-packages-<version>-<arch>.txt) -y``
To upgrade ScyllaDB without updating any external packages, follow the :ref:`download and installation instructions for Debian/Ubuntu <upgrade-debian-ubuntu-5.2-to-enterprise-2023.1>`.
.. code::
sudo apt-get clean all
sudo apt-get update
sudo apt-get dist-upgrade scylla-enterprise
sudo apt-get dist-upgrade scylla-enterprise-machine-image
#. Run ``scylla_setup`` without running ``io_setup``.
#. Run ``sudo /opt/scylladb/scylla-machine-image/scylla_cloud_io_setup``.
Start the node
--------------

View File

@@ -0,0 +1,15 @@
======================================================
Upgrade - ScyllaDB 6.0 to ScyllaDB Enterprise 2024.2
======================================================
.. toctree::
:maxdepth: 2
:hidden:
ScyllaDB <upgrade-guide-from-6.0-to-2024.2-generic>
Metrics <metric-update-6.0-to-2024.2>
* :doc:`Upgrade ScyllaDB from 6.0.x to 2024.2.y <upgrade-guide-from-6.0-to-2024.2-generic>`
* :doc:`ScyllaDB Metrics Update - ScyllaDB 6.0 to 2024.2 <metric-update-6.0-to-2024.2>`

View File

@@ -0,0 +1,41 @@
.. |SRC_VERSION| replace:: 6.0
.. |NEW_VERSION| replace:: 2024.2
=======================================================================================
ScyllaDB Metric Update - ScyllaDB |SRC_VERSION| to ScyllaDB Enterprise |NEW_VERSION|
=======================================================================================
ScyllaDB Enterprise |NEW_VERSION| Dashboards are available as part of the latest |mon_root|.
New Metrics
------------
The following metrics are new in ScyllaDB |NEW_VERSION|:
.. list-table::
:widths: 25 150
:header-rows: 1
* - Metric
- Description
* - scylla_rpc_compression_bytes_received
- Bytes read from RPC connections (before decompression).
* - scylla_rpc_compression_bytes_sent
- Bytes written to RPC connections (after compression).
* - scylla_rpc_compression_compressed_bytes_received
- RPC messages received.
* - scylla_rpc_compression_compressed_bytes_sent
- RPC messages sent.
* - scylla_rpc_compression_compression_cpu_nanos
- Nanoseconds spent on compression.
* - scylla_rpc_compression_decompression_cpu_nanos
- Nanoseconds spent on decompression.
* - scylla_rpc_compression_messages_received
- Size of backlog on this queue, in tasks; indicates whether the queue is
busy and/or contended.
* - scylla_rpc_compression_messages_sent
- Accumulated runtime of this task queue; an increment rate of 1000ms per
second indicates full utilization.

View File

@@ -0,0 +1,391 @@
.. |SCYLLA_NAME| replace:: ScyllaDB
.. |SRC_VERSION| replace:: 6.0
.. |NEW_VERSION| replace:: 2024.2
.. |DEBIAN_SRC_REPO| replace:: Debian
.. _DEBIAN_SRC_REPO: https://www.scylladb.com/download/?platform=debian-11&version=scylla-6.0
.. |UBUNTU_SRC_REPO| replace:: Ubuntu
.. _UBUNTU_SRC_REPO: https://www.scylladb.com/download/?platform=ubuntu-22.04&version=scylla-6.0
.. |SCYLLA_DEB_SRC_REPO| replace:: ScyllaDB deb repo (|DEBIAN_SRC_REPO|_, |UBUNTU_SRC_REPO|_)
.. |SCYLLA_RPM_SRC_REPO| replace:: ScyllaDB rpm repo
.. _SCYLLA_RPM_SRC_REPO: https://www.scylladb.com/download/?platform=centos&version=scylla-6.0
.. |DEBIAN_NEW_REPO| replace:: Debian
.. _DEBIAN_NEW_REPO: https://www.scylladb.com/customer-portal/?product=ent&platform=debian-11&version=stable-release-2024.2
.. |UBUNTU_NEW_REPO| replace:: Ubuntu
.. _UBUNTU_NEW_REPO: https://www.scylladb.com/customer-portal/?product=ent&platform=ubuntu-22.04&version=stable-release-2024.2
.. |SCYLLA_DEB_NEW_REPO| replace:: ScyllaDB deb repo (|DEBIAN_NEW_REPO|_, |UBUNTU_NEW_REPO|_)
.. |SCYLLA_RPM_NEW_REPO| replace:: ScyllaDB rpm repo
.. _SCYLLA_RPM_NEW_REPO: https://www.scylladb.com/customer-portal/?product=ent&platform=centos7&version=stable-release-2024.2
.. |ROLLBACK| replace:: rollback
.. _ROLLBACK: ./#rollback-procedure
.. |SCYLLA_METRICS| replace:: ScyllaDB Enterprise Metrics Update - ScyllaDB Enterprise 6.0 to 2024.2
.. _SCYLLA_METRICS: ../metric-update-6.0-to-2024.2
=============================================================================
Upgrade Guide - |SCYLLA_NAME| |SRC_VERSION| to |NEW_VERSION|
=============================================================================
This document is a step-by-step procedure for upgrading from |SCYLLA_NAME| |SRC_VERSION|
to |SCYLLA_NAME| Enterpise |NEW_VERSION|, and rollback to version |SRC_VERSION| if required.
This guide covers upgrading ScyllaDB on Red Hat Enterprise Linux (RHEL) CentOS, Debian,
and Ubuntu. See :doc:`OS Support by Platform and Version </getting-started/os-support>`
for information about supported versions.
This guide also applies when you're upgrading ScyllaDB Enterprise official image on EC2,
GCP, or Azure.
Before You Upgrade ScyllaDB
================================
**Upgrade Your Driver**
If you're using a :doc:`ScyllaDB driver </using-scylla/drivers/cql-drivers/index>`,
upgrade the driver before you upgrade ScyllaDB. The latest two versions of each driver
are supported.
**Upgrade ScyllaDB Monitoring Stack**
If you're using the ScyllaDB Monitoring Stack, verify that your Monitoring Stack
version supports the ScyllaDB version to which you want to upgrade. See
`ScyllaDB Monitoring Stack Support Matrix <https://monitoring.docs.scylladb.com/stable/reference/matrix.html>`_.
We recommend upgrading the Monitoring Stack to the latest version.
**Check Feature Updates**
See the ScyllaDB Release Notes for the latest updates. The Release Notes are published
at the `ScyllaDB Community Forum <https://forum.scylladb.com/>`_.
.. note::
Unlike ScyllaDB 6.0, ScyllaDB Enterprise 2024.2 has **tablets disabled by
default**. This means that after you upgrade to 2024.2:
* Keyspaces that had tablets enabled in 6.0 will continue to work with tablets.
* Keyspaces created with default settings after upgrading to 2024.2 will have
tablets disabled.
To use tablets, create a new keyspace with the ``tablets = { 'enabled': true }``
option. For example:
.. code::
CREATE KEYSPACE my_keyspace
WITH replication = {
'class': 'NetworkTopologyStrategy',
'replication_factor': 3
} AND tablets = {
'enabled': true
};
All tables created in this keyspace will use tablets.
Note that ``NetworkTopologyStrategy`` is required when tablets are enabled.
See :doc:`Data Distribution with Tablets </architecture/tablets/>` for more information
about tablets.
Upgrade Procedure
=================
A ScyllaDB upgrade is a rolling procedure that does **not** require full cluster shutdown.
For each of the nodes in the cluster, you will:
* Check that the cluster's schema is synchronized
* Drain the node and backup the data
* Backup the configuration file
* Stop ScyllaDB
* Download and install new ScyllaDB packages
* Start ScyllaDB
* Validate that the upgrade was successful
.. caution::
Apply the procedure **serially** on each node. Do not move to the next node before
validating that the node you upgraded is up and running the new version.
**During** the rolling upgrade, it is highly recommended:
* Not to use the new |NEW_VERSION| features.
* Not to run administration functions, like repairs, refresh, rebuild, or add or remove
nodes. See `sctool <https://manager.docs.scylladb.com/stable/sctool/>`_ for suspending
ScyllaDB Manager's scheduled or running repairs.
* Not to apply schema changes.
Upgrade Steps
=============
Check the cluster schema
-------------------------
Make sure that all nodes have the schema synchronized before upgrade. The upgrade
procedure will fail if there is a schema disagreement between nodes.
.. code:: sh
nodetool describecluster
Drain the nodes and backup the data
-----------------------------------
Before any major procedure, like an upgrade, it is recommended to backup all
the data to an external device. In ScyllaDB, you can backup the data using
the ``nodetool snapshot`` command. For **each** node in the cluster, run
the following command:
.. code:: sh
nodetool drain
nodetool snapshot
Take note of the directory name that nodetool gives you, and copy all the directories
having that name under ``/var/lib/scylla`` to a backup device.
When the upgrade is completed on all nodes, remove the snapshot with the
``nodetool clearsnapshot -t <snapshot>`` command to prevent running out of space.
Backup the configuration file
------------------------------
.. code:: sh
sudo cp -a /etc/scylla/scylla.yaml /etc/scylla/scylla.yaml.backup-src
Gracefully stop the node
------------------------
.. code:: sh
sudo service scylla-server stop
Download and install the new release
------------------------------------
Before upgrading, check what version you are running now using ``scylla --version``.
You should use the same version as this version in case you want to |ROLLBACK|_
the upgrade.
.. tabs::
.. group-tab:: Debian/Ubuntu
#. Update the |SCYLLA_DEB_NEW_REPO| to |NEW_VERSION|.
#. Configure Java 1.8:
.. code-block:: console
sudo apt-get update
sudo apt-get install -y openjdk-8-jre-headless
sudo update-java-alternatives -s java-1.8.0-openjdk-amd64
#. Install the new ScyllaDB version:
.. code-block:: console
sudo apt-get clean all
sudo apt-get update
sudo apt-get remove scylla\*
sudo apt-get install scylla-enterprise
sudo systemctl daemon-reload
Answer y to the first two questions.
.. group-tab:: RHEL/CentOS
#. Update the |SCYLLA_RPM_NEW_REPO|_ to |NEW_VERSION|.
#. Install the new ScyllaDB version:
.. code:: sh
sudo yum clean all
sudo rm -rf /var/cache/yum
sudo yum remove scylla\*
sudo yum install scylla-enterprise
.. group-tab:: EC2/GCP/Azure Ubuntu Image
If youre using the ScyllaDB official image (recommended), see
the **Debian/Ubuntu** tab for upgrade instructions. If youre using your
own image and have installed ScyllaDB packages for Ubuntu or Debian,
you need to apply an extended upgrade procedure:
#. Update the ScyllaDB deb repo (see above).
#. Configure Java 1.8 (see above).
#. Install the new ScyllaDB version with the additional
``scylla-enterprise-machine-image`` package:
.. code::
sudo apt-get clean all
sudo apt-get update
sudo apt-get dist-upgrade scylla-enterprise
sudo apt-get dist-upgrade scylla-enterprise-machine-image
#. Run ``scylla_setup`` without running ``io_setup``.
#. Run ``sudo /opt/scylladb/scylla-machine-image/scylla_cloud_io_setup``.
Start the node
--------------
.. code:: sh
sudo service scylla-server start
Validate
--------
#. Check cluster status with ``nodetool status`` and make sure **all** nodes, including
the one you just upgraded, are in ``UN`` status.
#. Use ``curl -X GET "http://localhost:10000/storage_service/scylla_release_version"``
to check the ScyllaDB version. Validate that the version matches the one you upgraded to.
#. Check scylla-server log (using ``journalctl _COMM=scylla``) and ``/var/log/syslog``
to validate there are no new errors in the log.
#. Check again after two minutes to validate that no new issues are introduced.
Once you are sure the node upgrade was successful, move to the next node in the cluster.
Rollback Procedure
==================
.. warning::
The rollback procedure can only be applied if some nodes have **not** been upgraded
to |NEW_VERSION| yet. As soon as the last node in the rolling upgrade procedure is
started with |NEW_VERSION|, rollback becomes impossible. At that point, the only way
to restore a cluster to |SRC_VERSION| is by restoring it from backup.
The following procedure describes a rollback from |SCYLLA_NAME| |NEW_VERSION|.x to
|SRC_VERSION|.y. Apply this procedure if an upgrade from |SRC_VERSION| to |NEW_VERSION|
failed before completing on all nodes.
* Use this procedure only for nodes you upgraded to |NEW_VERSION|.
* Execute the commands one node at a time, moving to the next node
only after the rollback procedure is completed successfully.
ScyllaDB rollback is a rolling procedure that does **not** require a full cluster shutdown.
For each of the nodes you rollback to |SRC_VERSION|, you will:
* Drain the node and stop ScyllaDB
* Retrieve the old ScyllaDB packages
* Restore the configuration file
* Restore system tables
* Reload systemd configuration
* Restart ScyllaDB
* Validate the rollback success
Apply the procedure **serially** on each node. Do not move to the next node
before validating that the rollback was successful and the node is up and
running the old version.
Rollback Steps
==============
Drain and gracefully stop the node
----------------------------------
.. code:: sh
nodetool drain
sudo service scylla-server stop
Download and install the old release
------------------------------------
.. tabs::
.. group-tab:: Debian/Ubuntu
#. Remove the old repo file.
.. code:: sh
sudo rm -rf /etc/apt/sources.list.d/scylla.list
#. Update the |SCYLLA_DEB_SRC_REPO| to |SRC_VERSION|.
#. Install:
.. code-block::
sudo apt-get update
sudo apt-get remove scylla\* -y
sudo apt-get install scylla
Answer y to the first two questions.
.. group-tab:: RHEL/CentOS
#. Remove the old repo file.
.. code:: sh
sudo rm -rf /etc/yum.repos.d/scylla.repo
#. Update the |SCYLLA_RPM_SRC_REPO|_ to |SRC_VERSION|.
#. Install:
.. code:: console
sudo yum clean all
sudo yum remove scylla\*
sudo yum install scylla
.. note::
If you are running a ScyllaDB Enterprise official image (for EC2 AMI, GCP, or Azure), follow the instructions for Ubuntu.
Restore the configuration file
------------------------------
.. code:: sh
sudo rm -rf /etc/scylla/scylla.yaml
sudo cp -a /etc/scylla/scylla.yaml.backup-src | /etc/scylla/scylla.yaml
Restore system tables
---------------------
Restore all tables of **system** and **system_schema** from the previous snapshot because
|NEW_VERSION| uses a different set of system tables.
See :doc:`Restore from a Backup and Incremental Backup </operating-scylla/procedures/backup-restore/restore/>`
for reference.
.. code:: console
cd /var/lib/scylla/data/keyspace_name/table_name-UUID/
sudo find . -maxdepth 1 -type f -exec sudo rm -f "{}" +
cd /var/lib/scylla/data/keyspace_name/table_name-UUID/snapshots/<snapshot_name>/
sudo cp -r * /var/lib/scylla/data/keyspace_name/table_name-UUID/
sudo chown -R scylla:scylla /var/lib/scylla/data/keyspace_name/table_name-UUID/
Reload systemd configuration
----------------------------
You must reload the unit file if the systemd unit file is changed.
.. code:: sh
sudo systemctl daemon-reload
Start the node
--------------
.. code:: sh
sudo service scylla-server start
Validate
--------
Check the upgrade instructions above for validation. Once you are sure the node rollback
is successful, move to the next node in the cluster.

View File

@@ -38,12 +38,13 @@ connection::~connection()
_server._connections_list.erase(iter);
}
future<> server::for_each_gently(noncopyable_function<future<>(connection&)> fn) {
future<> server::for_each_gently(noncopyable_function<void(connection&)> fn) {
_gentle_iterators.emplace_front(*this);
std::list<gentle_iterator>::iterator gi = _gentle_iterators.begin();
return seastar::do_until([ gi ] { return gi->iter == gi->end; },
[ gi, fn = std::move(fn) ] {
return fn(*(gi->iter++));
fn(*(gi->iter++));
return make_ready_future<>();
}
).finally([ this, gi ] { _gentle_iterators.erase(gi); });
}

View File

@@ -118,7 +118,7 @@ protected:
virtual future<> unadvertise_connection(shared_ptr<connection> conn);
future<> for_each_gently(noncopyable_function<future<>(connection&)>);
future<> for_each_gently(noncopyable_function<void(connection&)>);
};
}

View File

@@ -143,6 +143,7 @@ public:
// whereas without it, it will fail the insert - i.e. for things like raft etc _all_ nodes should
// have it or none, otherwise we can get partial failures on writes.
gms::feature fragmented_commitlog_entries { *this, "FRAGMENTED_COMMITLOG_ENTRIES"sv };
gms::feature maintenance_tenant { *this, "MAINTENANCE_TENANT"sv };
// A feature just for use in tests. It must not be advertised unless
// the "features_enable_test_feature" injection is enabled.

View File

@@ -356,31 +356,30 @@ future<> gossiper::handle_ack_msg(msg_addr id, gossip_digest_ack ack_msg) {
}
future<> gossiper::do_send_ack2_msg(msg_addr from, utils::chunked_vector<gossip_digest> ack_msg_digest) {
return futurize_invoke([this, from, ack_msg_digest = std::move(ack_msg_digest)] () mutable {
/* Get the state required to send to this gossipee - construct GossipDigestAck2Message */
std::map<inet_address, endpoint_state> delta_ep_state_map;
for (auto g_digest : ack_msg_digest) {
inet_address addr = g_digest.get_endpoint();
const auto es = get_endpoint_state_ptr(addr);
if (!es || es->get_heart_beat_state().get_generation() < g_digest.get_generation()) {
continue;
}
// Local generation for addr may have been increased since the
// current node sent an initial SYN. Comparing versions across
// different generations in get_state_for_version_bigger_than
// could result in losing some app states with smaller versions.
const auto version = es->get_heart_beat_state().get_generation() > g_digest.get_generation()
? version_type(0)
: g_digest.get_max_version();
auto local_ep_state_ptr = this->get_state_for_version_bigger_than(addr, version);
if (local_ep_state_ptr) {
delta_ep_state_map.emplace(addr, *local_ep_state_ptr);
}
/* Get the state required to send to this gossipee - construct GossipDigestAck2Message */
std::map<inet_address, endpoint_state> delta_ep_state_map;
for (auto g_digest : ack_msg_digest) {
inet_address addr = g_digest.get_endpoint();
const auto es = get_endpoint_state_ptr(addr);
if (!es || es->get_heart_beat_state().get_generation() < g_digest.get_generation()) {
continue;
}
gms::gossip_digest_ack2 ack2_msg(std::move(delta_ep_state_map));
logger.debug("Calling do_send_ack2_msg to node {}, ack_msg_digest={}, ack2_msg={}", from, ack_msg_digest, ack2_msg);
return ser::gossip_rpc_verbs::send_gossip_digest_ack2(&_messaging, from, std::move(ack2_msg));
});
// Local generation for addr may have been increased since the
// current node sent an initial SYN. Comparing versions across
// different generations in get_state_for_version_bigger_than
// could result in losing some app states with smaller versions.
const auto version = es->get_heart_beat_state().get_generation() > g_digest.get_generation()
? version_type(0)
: g_digest.get_max_version();
auto local_ep_state_ptr = get_state_for_version_bigger_than(addr, version);
if (local_ep_state_ptr) {
delta_ep_state_map.emplace(addr, *local_ep_state_ptr);
}
}
gms::gossip_digest_ack2 ack2_msg(std::move(delta_ep_state_map));
logger.debug("Calling do_send_ack2_msg to node {}, ack_msg_digest={}, ack2_msg={}", from, ack_msg_digest, ack2_msg);
co_await ser::gossip_rpc_verbs::send_gossip_digest_ack2(&_messaging, from, std::move(ack2_msg));
logger.debug("finished do_send_ack2_msg to node {}, ack_msg_digest={}, ack2_msg={}", from, ack_msg_digest, ack2_msg);
}
// Depends on
@@ -683,6 +682,10 @@ future<> gossiper::apply_state_locally(std::map<inet_address, endpoint_state> ma
// If there is no host id in the new state there should be one locally
hid = get_host_id(ep);
}
if (hid == my_host_id()) {
logger.trace("Ignoring gossip for {} because it maps to local id, but is not local address", ep);
return make_ready_future<>();
}
if (_topo_sm->_topology.left_nodes.contains(raft::server_id(hid.uuid()))) {
logger.trace("Ignoring gossip for {} because it left", ep);
return make_ready_future<>();
@@ -2354,7 +2357,15 @@ future<> gossiper::do_stop_gossiping() {
// Take the semaphore makes sure existing gossip loop is finished
auto units = co_await get_units(_callback_running, 1);
co_await container().invoke_on_all([] (auto& g) {
return std::move(g._failure_detector_loop_done);
// #21159
// gossiper::shutdown can be called from more than once place - both
// storage_service::isolate and normal gossip service stop. The former is
// waited for in storage_service::stop, but if we, as was done in cql_test_env,
// call shutdown independently, we could still end up here twite, and not hit
// the _enabled guard (because we do waiting things before setting it, and setting it
// is also waiting). However, making sure we don't leave an invalid future
// here should ensure even if we reenter this method in such as way, we don't crash.
return std::exchange(g._failure_detector_loop_done, make_ready_future<>());
});
logger.info("Gossip is now stopped");
}

View File

@@ -333,8 +333,10 @@ public:
void set_topology_state_machine(service::topology_state_machine* m) {
_topo_sm = m;
// In raft topology mode the coodinator maintains banned nodes list
_just_removed_endpoints.clear();
if (m) {
// In raft topology mode the coodinator maintains banned nodes list
_just_removed_endpoints.clear();
}
}
private:

View File

@@ -574,7 +574,7 @@ PYSCRIPTS=$(find dist/common/scripts/ -maxdepth 1 -type f -exec grep -Pls '\A#!/
for i in $PYSCRIPTS; do
relocate_python3 "$rprefix"/scripts "$i"
done
for i in seastar/scripts/perftune.py seastar/scripts/seastar-addr2line; do
for i in seastar/scripts/{perftune.py,addr2line.py,seastar-addr2line}; do
relocate_python3 "$rprefix"/scripts "$i"
done
relocate_python3 "$rprefix"/scyllatop tools/scyllatop/scyllatop.py

View File

@@ -39,7 +39,11 @@ abstract_replication_strategy::abstract_replication_strategy(
replication_strategy_params params,
replication_strategy_type my_type)
: _config_options(params.options)
, _my_type(my_type) {}
, _my_type(my_type) {
if (params.initial_tablets.has_value()) {
_uses_tablets = true;
}
}
abstract_replication_strategy::ptr_type abstract_replication_strategy::create_replication_strategy(const sstring& strategy_name, replication_strategy_params params) {
try {

View File

@@ -67,6 +67,7 @@ class vnode_effective_replication_map;
class effective_replication_map_factory;
class per_table_replication_strategy;
class tablet_aware_replication_strategy;
class effective_replication_map;
class abstract_replication_strategy : public seastar::enable_shared_from_this<abstract_replication_strategy> {
@@ -98,6 +99,9 @@ protected:
public:
using ptr_type = seastar::shared_ptr<abstract_replication_strategy>;
// Check that the read replica set does not exceed what's allowed by the schema.
[[nodiscard]] virtual sstring sanity_check_read_replicas(const effective_replication_map& erm, const inet_address_vector_replica_set& read_replicas) const = 0;
abstract_replication_strategy(
replication_strategy_params params,
replication_strategy_type my_type);

View File

@@ -12,6 +12,7 @@
#include "locator/everywhere_replication_strategy.hh"
#include "utils/class_registrator.hh"
#include "locator/token_metadata.hh"
#include "exceptions/exceptions.hh"
namespace locator {
@@ -33,6 +34,21 @@ size_t everywhere_replication_strategy::get_replication_factor(const token_metad
return tm.sorted_tokens().empty() ? 1 : tm.count_normal_token_owners();
}
void everywhere_replication_strategy::validate_options(const gms::feature_service&) const {
if (_uses_tablets) {
throw exceptions::configuration_exception("EverywhereStrategy doesn't support tablet replication");
}
}
sstring everywhere_replication_strategy::sanity_check_read_replicas(const effective_replication_map& erm, const inet_address_vector_replica_set& read_replicas) const {
const auto replication_factor = erm.get_replication_factor();
if (read_replicas.size() > replication_factor) {
return seastar::format("everywhere_replication_strategy: the number of replicas for everywhere_replication_strategy is {}, cannot be higher than replication factor {}", read_replicas.size(), replication_factor);
}
return {};
}
using registry = class_registrator<abstract_replication_strategy, everywhere_replication_strategy, replication_strategy_params>;
static registry registrator("org.apache.cassandra.locator.EverywhereStrategy");
static registry registrator_short_name("EverywhereStrategy");

View File

@@ -20,7 +20,7 @@ public:
virtual future<host_id_set> calculate_natural_endpoints(const token& search_token, const token_metadata& tm) const override;
virtual void validate_options(const gms::feature_service&) const override { /* noop */ }
virtual void validate_options(const gms::feature_service&) const override;
std::optional<std::unordered_set<sstring>> recognized_options(const topology&) const override {
// We explicitly allow all options
@@ -32,5 +32,7 @@ public:
virtual bool allow_remove_node_being_replaced_from_natural_endpoints() const override {
return true;
}
[[nodiscard]] sstring sanity_check_read_replicas(const effective_replication_map& erm, const inet_address_vector_replica_set& read_replicas) const override;
};
}

View File

@@ -9,6 +9,7 @@
#include <algorithm>
#include "local_strategy.hh"
#include "utils/class_registrator.hh"
#include "exceptions/exceptions.hh"
namespace locator {
@@ -23,6 +24,9 @@ future<host_id_set> local_strategy::calculate_natural_endpoints(const token& t,
}
void local_strategy::validate_options(const gms::feature_service&) const {
if (_uses_tablets) {
throw exceptions::configuration_exception("LocalStrategy doesn't support tablet replication");
}
}
std::optional<std::unordered_set<sstring>> local_strategy::recognized_options(const topology&) const {
@@ -34,6 +38,13 @@ size_t local_strategy::get_replication_factor(const token_metadata&) const {
return 1;
}
sstring local_strategy::sanity_check_read_replicas(const effective_replication_map& erm, const inet_address_vector_replica_set& read_replicas) const {
if (read_replicas.size() > 1) {
return seastar::format("local_strategy: the number of replicas for local_strategy is {}, cannot be higher than 1", read_replicas.size());
}
return {};
}
using registry = class_registrator<abstract_replication_strategy, local_strategy, replication_strategy_params>;
static registry registrator("org.apache.cassandra.locator.LocalStrategy");
static registry registrator_short_name("LocalStrategy");

View File

@@ -35,6 +35,8 @@ public:
virtual bool allow_remove_node_being_replaced_from_natural_endpoints() const override {
return false;
}
[[nodiscard]] sstring sanity_check_read_replicas(const effective_replication_map& erm, const inet_address_vector_replica_set& read_replicas) const override;
};
}

View File

@@ -19,6 +19,8 @@
#include "locator/network_topology_strategy.hh"
#include "locator/load_sketch.hh"
#include <absl/container/flat_hash_map.h>
#include <boost/algorithm/string.hpp>
#include <boost/range/adaptors.hpp>
#include "exceptions/exceptions.hh"
@@ -554,6 +556,36 @@ tablet_replica_set network_topology_strategy::drop_tablets_in_dc(schema_ptr s, c
return filtered;
}
sstring network_topology_strategy::sanity_check_read_replicas(const effective_replication_map& erm,
const inet_address_vector_replica_set& read_replicas) const {
const auto& topology = erm.get_topology();
struct rf_node_count {
size_t replication_factor{0};
size_t node_count{0};
};
absl::flat_hash_map<sstring, rf_node_count> data_centers_replication_factor;
std::ranges::for_each(read_replicas, [&data_centers_replication_factor, &topology, this](const auto& node) {
auto res = data_centers_replication_factor.emplace(topology.get_datacenter(node), rf_node_count{0, 0});
if (res.second) {
// For new item add replication factor.
res.first->second.replication_factor = get_replication_factor(res.first->first);
}
++res.first->second.node_count;
});
for (const auto& [key, item] : data_centers_replication_factor) {
if (item.replication_factor < item.node_count) {
return seastar::format("network_topology_strategy: ERM inconsistency, Datacenter [{}] has higher count of read replicas (accounting for "
"current consistency level): [{}] than its replication factor [{}]",
key, item.node_count, item.replication_factor);
}
}
return {};
}
using registry = class_registrator<abstract_replication_strategy, network_topology_strategy, replication_strategy_params>;
static registry registrator("org.apache.cassandra.locator.NetworkTopologyStrategy");
static registry registrator_short_name("NetworkTopologyStrategy");

View File

@@ -42,6 +42,8 @@ public:
return true;
}
[[nodiscard]] sstring sanity_check_read_replicas(const effective_replication_map& erm, const inet_address_vector_replica_set& read_replicas) const override;
public: // tablet_aware_replication_strategy
virtual effective_replication_map_ptr make_replication_map(table_id, token_metadata_ptr) const override;
virtual future<tablet_map> allocate_tablets_for_new_table(schema_ptr, token_metadata_ptr, unsigned initial_scale) const override;

View File

@@ -70,12 +70,25 @@ void simple_strategy::validate_options(const gms::feature_service&) const {
throw exceptions::configuration_exception("SimpleStrategy requires a replication_factor strategy option.");
}
parse_replication_factor(it->second);
if (_uses_tablets) {
throw exceptions::configuration_exception("SimpleStrategy doesn't support tablet replication");
}
}
std::optional<std::unordered_set<sstring>>simple_strategy::recognized_options(const topology&) const {
return {{ "replication_factor" }};
}
sstring simple_strategy::sanity_check_read_replicas(const effective_replication_map& erm, const inet_address_vector_replica_set& read_replicas) const {
if (read_replicas.size() > _replication_factor) {
return seastar::format("ERM inconsistency, the read replica set for simple strategy has higher count of"
" read replicas [{}] than its replication factor [{}]",
read_replicas.size(),
_replication_factor);
}
return {};
}
using registry = class_registrator<abstract_replication_strategy, simple_strategy, replication_strategy_params>;
static registry registrator("org.apache.cassandra.locator.SimpleStrategy");
static registry registrator_short_name("SimpleStrategy");

View File

@@ -26,6 +26,8 @@ public:
}
virtual future<host_id_set> calculate_natural_endpoints(const token& search_token, const token_metadata& tm) const override;
[[nodiscard]] sstring sanity_check_read_replicas(const effective_replication_map& erm, const inet_address_vector_replica_set& read_replicas) const override;
private:
size_t _replication_factor = 1;
};

View File

@@ -200,9 +200,6 @@ future<> tablet_metadata::mutate_tablet_map_async(table_id id, noncopyable_funct
}
future<tablet_metadata> tablet_metadata::copy() const {
if (_tablets.empty()) {
co_return tablet_metadata{};
}
tablet_metadata copy;
for (const auto& e : _tablets) {
copy._tablets.emplace(e.first, co_await e.second.copy());
@@ -851,9 +848,8 @@ void tablet_aware_replication_strategy::validate_tablet_options(const abstract_r
void tablet_aware_replication_strategy::process_tablet_options(abstract_replication_strategy& ars,
replication_strategy_config_options& opts,
replication_strategy_params params) {
if (params.initial_tablets.has_value()) {
_initial_tablets = *params.initial_tablets;
ars._uses_tablets = true;
if (ars._uses_tablets) {
_initial_tablets = params.initial_tablets.value_or(0);
mark_as_per_table(ars);
}
}

View File

@@ -164,7 +164,7 @@ public:
inet_address get_endpoint_for_host_id(host_id) const;
/** @return a copy of the endpoint-to-id map for read-only operations */
std::unordered_map<inet_address, host_id> get_endpoint_to_host_id_map_for_reading() const;
std::unordered_map<inet_address, host_id> get_endpoint_to_host_id_map() const;
void add_bootstrap_token(token t, host_id endpoint);
@@ -565,19 +565,18 @@ inet_address token_metadata_impl::get_endpoint_for_host_id(host_id host_id) cons
}
}
std::unordered_map<inet_address, host_id> token_metadata_impl::get_endpoint_to_host_id_map_for_reading() const {
std::unordered_map<inet_address, host_id> token_metadata_impl::get_endpoint_to_host_id_map() const {
const auto& nodes = _topology.get_nodes_by_endpoint();
std::unordered_map<inet_address, host_id> map;
map.reserve(nodes.size());
for (const auto& [endpoint, node] : nodes) {
// Restrict to members
if (!node->is_member()) {
if (node->left() || node->is_none()) {
continue;
}
if (const auto& host_id = node->host_id()) {
map[endpoint] = host_id;
} else {
tlogger.info("get_endpoint_to_host_id_map_for_reading: endpoint {} has null host_id: state={}", endpoint, node->get_state());
tlogger.info("get_endpoint_to_host_id_map: endpoint {} has null host_id: state={}", endpoint, node->get_state());
}
}
return map;
@@ -1044,8 +1043,8 @@ token_metadata::get_endpoint_for_host_id(host_id host_id) const {
}
std::unordered_map<inet_address, host_id>
token_metadata::get_endpoint_to_host_id_map_for_reading() const {
return _impl->get_endpoint_to_host_id_map_for_reading();
token_metadata::get_endpoint_to_host_id_map() const {
return _impl->get_endpoint_to_host_id_map();
}
void

View File

@@ -77,6 +77,12 @@ struct host_id_or_endpoint {
gms::inet_address resolve_endpoint(const token_metadata&) const;
};
using host_id_or_endpoint_list = std::vector<host_id_or_endpoint>;
[[nodiscard]] inline bool check_host_ids_contain_only_uuid(const auto& host_ids) {
return std::ranges::none_of(host_ids, [](const auto& node_str) { return locator::host_id_or_endpoint{node_str}.has_endpoint(); });
}
class token_metadata_impl;
struct topology_change_info;
@@ -230,7 +236,7 @@ public:
inet_address get_endpoint_for_host_id(locator::host_id host_id) const;
/** @return a copy of the endpoint-to-id map for read-only operations */
std::unordered_map<inet_address, host_id> get_endpoint_to_host_id_map_for_reading() const;
std::unordered_map<inet_address, host_id> get_endpoint_to_host_id_map() const;
/// Returns host_id of the local node.
host_id get_my_id() const;

21
main.cc
View File

@@ -1389,7 +1389,7 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
scfg.statement_tenants = {
{dbcfg.statement_scheduling_group, "$user"},
{default_scheduling_group(), "$system"},
{dbcfg.streaming_scheduling_group, "$maintenance"}
{dbcfg.streaming_scheduling_group, "$maintenance", false}
};
scfg.streaming = dbcfg.streaming_scheduling_group;
scfg.gossip = dbcfg.gossip_scheduling_group;
@@ -1404,7 +1404,7 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
}
// Delay listening messaging_service until gossip message handlers are registered
messaging.start(mscfg, scfg, creds).get();
messaging.start(mscfg, scfg, creds, std::ref(feature_service)).get();
auto stop_ms = defer_verbose_shutdown("messaging service", [&messaging] {
messaging.invoke_on_all(&netw::messaging_service::stop).get();
});
@@ -1511,7 +1511,7 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
// group0 client exists only on shard 0.
// The client has to be created before `stop_raft` since during
// destruction it has to exist until raft_gr.stop() completes.
service::raft_group0_client group0_client{raft_gr.local(), sys_ks.local(), maintenance_mode_enabled{cfg->maintenance_mode()}};
service::raft_group0_client group0_client{raft_gr.local(), sys_ks.local(), token_metadata.local(), maintenance_mode_enabled{cfg->maintenance_mode()}};
service::raft_group0 group0_service{
stop_signal.as_local_abort_source(), raft_gr.local(), messaging,
@@ -1944,6 +1944,13 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
ss.local().uninit_address_map().get();
});
// Need to make sure storage service stopped using group0 before running group0_service.abort()
// Normally it is done in storage_service::do_drain(), but in case start up fail we need to do it
// here as well
auto stop_group0_usage_in_storage_service = defer_verbose_shutdown("group 0 usage in local storage", [&ss] {
ss.local().wait_for_group0_stop().get();
});
// Setup group0 early in case the node is bootstrapped already and the group exists.
// Need to do it before allowing incoming messaging service connections since
// storage proxy's and migration manager's verbs may access group0.
@@ -2012,6 +2019,11 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
api::unset_server_authorization_cache(ctx).get();
});
// update the service level cache after the SL data accessor and auth service are initialized.
if (sl_controller.local().is_v2()) {
sl_controller.local().update_cache(qos::update_both_cache_levels::yes).get();
}
sl_controller.invoke_on_all([&lifecycle_notifier] (qos::service_level_controller& controller) {
lifecycle_notifier.local().register_subscriber(&controller);
}).get();
@@ -2083,6 +2095,9 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
if (cfg->view_building()) {
view_builder.invoke_on_all(&db::view::view_builder::start, std::ref(mm), utils::cross_shard_barrier()).get();
}
auto drain_view_builder = defer_verbose_shutdown("draining view builders", [&] {
view_builder.invoke_on_all(&db::view::view_builder::drain).get();
});
api::set_server_view_builder(ctx, view_builder).get();
auto stop_vb_api = defer_verbose_shutdown("view builder API", [&ctx] {

View File

@@ -119,6 +119,7 @@
#include "idl/mapreduce_request.dist.impl.hh"
#include "idl/storage_service.dist.impl.hh"
#include "idl/join_node.dist.impl.hh"
#include "gms/feature_service.hh"
namespace netw {
@@ -232,9 +233,9 @@ future<> messaging_service::unregister_handler(messaging_verb verb) {
return _rpc->unregister_handler(verb);
}
messaging_service::messaging_service(locator::host_id id, gms::inet_address ip, uint16_t port)
messaging_service::messaging_service(locator::host_id id, gms::inet_address ip, uint16_t port, gms::feature_service& feature_service)
: messaging_service(config{std::move(id), ip, ip, port},
scheduling_config{{{{}, "$default"}}, {}, {}}, nullptr)
scheduling_config{{{{}, "$default"}}, {}, {}}, nullptr, feature_service)
{}
static
@@ -419,13 +420,14 @@ void messaging_service::do_start_listen() {
}
}
messaging_service::messaging_service(config cfg, scheduling_config scfg, std::shared_ptr<seastar::tls::credentials_builder> credentials)
messaging_service::messaging_service(config cfg, scheduling_config scfg, std::shared_ptr<seastar::tls::credentials_builder> credentials, gms::feature_service& feature_service)
: _cfg(std::move(cfg))
, _rpc(new rpc_protocol_wrapper(serializer { }))
, _credentials_builder(credentials ? std::make_unique<seastar::tls::credentials_builder>(*credentials) : nullptr)
, _clients(PER_SHARD_CONNECTION_COUNT + scfg.statement_tenants.size() * PER_TENANT_CONNECTION_COUNT)
, _scheduling_config(scfg)
, _scheduling_info_for_connection_index(initial_scheduling_info())
, _feature_service(feature_service)
{
_rpc->set_logger(&rpc_logger);
@@ -434,7 +436,8 @@ messaging_service::messaging_service(config cfg, scheduling_config scfg, std::sh
// which in turn relies on _connection_index_for_tenant to be initialized.
_connection_index_for_tenant.reserve(_scheduling_config.statement_tenants.size());
for (unsigned i = 0; i < _scheduling_config.statement_tenants.size(); ++i) {
_connection_index_for_tenant.push_back({_scheduling_config.statement_tenants[i].sched_group, i});
auto& tenant_cfg = _scheduling_config.statement_tenants[i];
_connection_index_for_tenant.push_back({tenant_cfg.sched_group, i, tenant_cfg.enabled});
}
register_handler(this, messaging_verb::CLIENT_ID, [this] (rpc::client_info& ci, gms::inet_address broadcast_address, uint32_t src_cpu_id, rpc::optional<uint64_t> max_result_size, rpc::optional<utils::UUID> host_id) {
@@ -457,6 +460,7 @@ messaging_service::messaging_service(config cfg, scheduling_config scfg, std::sh
});
init_local_preferred_ip_cache(_cfg.preferred_ips);
init_feature_listeners();
}
msg_addr messaging_service::get_source(const rpc::client_info& cinfo) {
@@ -679,16 +683,22 @@ messaging_service::get_rpc_client_idx(messaging_verb verb) const {
return idx;
}
// A statement or statement-ack verb
const auto curr_sched_group = current_scheduling_group();
for (unsigned i = 0; i < _connection_index_for_tenant.size(); ++i) {
if (_connection_index_for_tenant[i].sched_group == curr_sched_group) {
// i == 0: the default tenant maps to the default client indexes belonging to the interval
// [PER_SHARD_CONNECTION_COUNT, PER_SHARD_CONNECTION_COUNT + PER_TENANT_CONNECTION_COUNT).
idx += i * PER_TENANT_CONNECTION_COUNT;
break;
if (_connection_index_for_tenant[i].enabled) {
// i == 0: the default tenant maps to the default client indexes belonging to the interval
// [PER_SHARD_CONNECTION_COUNT, PER_SHARD_CONNECTION_COUNT + PER_TENANT_CONNECTION_COUNT).
idx += i * PER_TENANT_CONNECTION_COUNT;
break;
} else {
// If the tenant is disable, immediately return current index to
// use $system tenant.
return idx;
}
}
}
return idx;
}
@@ -793,6 +803,22 @@ void messaging_service::cache_preferred_ip(gms::inet_address ep, gms::inet_addre
remove_rpc_client(msg_addr(ep));
}
void messaging_service::init_feature_listeners() {
_maintenance_tenant_enabled_listener = _feature_service.maintenance_tenant.when_enabled([this] {
enable_scheduling_tenant("$maintenance");
});
}
void messaging_service::enable_scheduling_tenant(std::string_view name) {
for (size_t i = 0; i < _scheduling_config.statement_tenants.size(); ++i) {
if (_scheduling_config.statement_tenants[i].name == name) {
_scheduling_config.statement_tenants[i].enabled = true;
_connection_index_for_tenant[i].enabled = true;
return;
}
}
}
gms::inet_address messaging_service::get_public_endpoint_for(const gms::inet_address& ip) const {
auto i = _preferred_to_endpoint.find(ip);
return i != _preferred_to_endpoint.end() ? i->second : ip;

View File

@@ -45,6 +45,7 @@ namespace gms {
class gossip_digest_ack2;
class gossip_get_endpoint_states_request;
class gossip_get_endpoint_states_response;
class feature_service;
}
namespace db {
@@ -299,6 +300,7 @@ public:
struct tenant {
scheduling_group sched_group;
sstring name;
bool enabled = true;
};
// Must have at least one element. No two tenants should have the same
// scheduling group. [0] is the default tenant, that all unknown
@@ -319,6 +321,7 @@ private:
struct tenant_connection_index {
scheduling_group sched_group;
unsigned cliend_idx;
bool enabled;
};
private:
config _cfg;
@@ -337,6 +340,7 @@ private:
scheduling_config _scheduling_config;
std::vector<scheduling_info_for_connection_index> _scheduling_info_for_connection_index;
std::vector<tenant_connection_index> _connection_index_for_tenant;
gms::feature_service& _feature_service;
struct connection_ref;
std::unordered_multimap<locator::host_id, connection_ref> _host_connections;
@@ -351,8 +355,8 @@ private:
public:
using clock_type = lowres_clock;
messaging_service(locator::host_id id, gms::inet_address ip, uint16_t port);
messaging_service(config cfg, scheduling_config scfg, std::shared_ptr<seastar::tls::credentials_builder>);
messaging_service(locator::host_id id, gms::inet_address ip, uint16_t port, gms::feature_service& feature_service);
messaging_service(config cfg, scheduling_config scfg, std::shared_ptr<seastar::tls::credentials_builder>, gms::feature_service& feature_service);
~messaging_service();
future<> start();
@@ -544,6 +548,12 @@ public:
std::vector<messaging_service::scheduling_info_for_connection_index> initial_scheduling_info() const;
unsigned get_rpc_client_idx(messaging_verb verb) const;
static constexpr std::array<std::string_view, 3> _connection_types_prefix = {"statement:", "statement-ack:", "forward:"}; // "forward" is the old name for "mapreduce"
void init_feature_listeners();
private:
std::any _maintenance_tenant_enabled_listener;
void enable_scheduling_tenant(std::string_view name);
};
} // namespace netw

View File

@@ -215,6 +215,7 @@ public:
mutation_partition as_mutation_partition(const schema&) const;
private:
// Erases the entry if it's safe to do so without changing the logical state of the partition.
// (It's allowed to evict empty row entries, though).
rows_type::iterator maybe_drop(const schema&, cache_tracker*, rows_type::iterator, mutation_application_stats&);
void insert_row(const schema& s, const clustering_key& key, deletable_row&& row);
void insert_row(const schema& s, const clustering_key& key, const deletable_row& row);

View File

@@ -14,6 +14,7 @@
#include "utils/assert.hh"
#include "utils/coroutine.hh"
#include "real_dirty_memory_accounter.hh"
#include "clustering_interval_set.hh"
static void remove_or_mark_as_unique_owner(partition_version* current, mutation_cleaner* cleaner)
{
@@ -638,6 +639,15 @@ mutation_partition_v2 partition_entry::squashed_v2(const schema& to, is_evictabl
return mp;
}
clustering_interval_set partition_entry::squashed_continuity(const schema& s)
{
clustering_interval_set result;
for (auto&& v : _version->all_elements()) {
result.add(s, v.partition().as_mutation_partition(*v.get_schema()).get_continuity(s));
}
return result;
}
mutation_partition partition_entry::squashed(const schema& s, is_evictable evictable)
{
return squashed_v2(s, evictable).as_mutation_partition(s);

View File

@@ -682,6 +682,7 @@ public:
}
mutation_partition_v2 squashed_v2(const schema& to, is_evictable);
clustering_interval_set squashed_continuity(const schema&);
mutation_partition squashed(const schema&, is_evictable);
tombstone partition_tombstone() const;

View File

@@ -186,6 +186,8 @@ std::set<gms::inet_address> task_manager_module::get_nodes() const noexcept {
_ss._topology_state_machine._topology.transition_nodes
) | boost::adaptors::transformed([&ss = _ss] (auto& node) {
return ss.host2ip(locator::host_id{node.first.uuid()});
}) | boost::adaptors::filtered([&ss = _ss] (auto& ip) {
return ss._gossiper.is_alive(ip);
})
);
}

View File

@@ -589,7 +589,9 @@ future<> server_impl::wait_for_entry(entry_id eid, wait_type type, seastar::abor
check_not_aborted();
if (as && as->abort_requested()) {
throw request_aborted(format("Abort requested before waiting for entry with idx: {}, term: {}", eid.idx, eid.term));
throw request_aborted(format(
"Abort requested before waiting for entry with idx: {}, term: {}; last committed entry: {}, last applied entry: {}",
eid.idx, eid.term, _fsm->commit_idx(), _applied_idx));
}
auto& container = type == wait_type::committed ? _awaited_commits : _awaited_applies;
@@ -637,9 +639,11 @@ future<> server_impl::wait_for_entry(entry_id eid, wait_type type, seastar::abor
}
SCYLLA_ASSERT(inserted);
if (as) {
it->second.abort = as->subscribe([it = it, &container] noexcept {
it->second.abort = as->subscribe([this, it = it, &container] noexcept {
it->second.done.set_exception(
request_aborted(format("Abort requested while waiting for entry with idx: {}, term: {}", it->first, it->second.term)));
request_aborted(format(
"Abort requested while waiting for entry with idx: {}, term: {}; last committed entry: {}, last applied entry: {}",
it->first, it->second.term, _fsm->commit_idx(), _applied_idx)));
container.erase(it);
});
SCYLLA_ASSERT(it->second.abort);
@@ -1451,7 +1455,9 @@ term_t server_impl::get_current_term() const {
future<> server_impl::wait_for_apply(index_t idx, abort_source* as) {
if (as && as->abort_requested()) {
throw request_aborted(format("Aborted before waiting for applying entry: {}, last applied entry: {}", idx, _applied_idx));
throw request_aborted(format(
"Aborted before waiting for applying entry: {}, last committed entry: {}, last applied entry: {}",
idx, _fsm->commit_idx(), _applied_idx));
}
check_not_aborted();
@@ -1463,7 +1469,9 @@ future<> server_impl::wait_for_apply(index_t idx, abort_source* as) {
if (as) {
it->second.abort = as->subscribe([this, it] noexcept {
it->second.promise.set_exception(
request_aborted(format("Aborted while waiting to apply entry: {}, last applied entry: {}", it->first, _applied_idx)));
request_aborted(format(
"Aborted while waiting to apply entry: {}, last committed entry: {}, last applied entry: {}",
it->first, _fsm->commit_idx(), _applied_idx)));
_awaited_indexes.erase(it);
});
SCYLLA_ASSERT(it->second.abort);

View File

@@ -1367,7 +1367,7 @@ reader_concurrency_semaphore::can_admit_read(const reader_permit::impl& permit)
}
if (!has_available_units(permit.base_resources())) {
auto reason = _resources.memory >= permit.base_resources().memory ? reason::memory_resources : reason::count_resources;
auto reason = _resources.memory >= permit.base_resources().memory ? reason::count_resources : reason::memory_resources;
if (_inactive_reads.empty()) {
return {can_admit::no, reason};
} else {

View File

@@ -1081,6 +1081,20 @@ multishard_combining_reader_v2::multishard_combining_reader_v2(
mutation_reader::forwarding fwd_mr)
: impl(std::move(s), std::move(permit)), _keep_alive_sharder(std::move(keep_alive_sharder)), _sharder(sharder) {
// The permit of the multishard reader is destroyed after the permits of its child readers.
// Therefore its semaphore resources won't be automatically released
// until children acquire their own resources.
//
// This creates a dependency (an edge in the "resource allocation graph"),
// where the semaphore used by the multishard reader depends on the semaphores used by children.
// When such dependencies create a cycle, and permits are acquired by different reads
// in just the right order, a deadlock will happen.
//
// One way to prevent the deadlock is to avoid the resource dependency by ensuring
// that the resources of multishard reader are released before the children attempt to acquire theirs.
// We do this here.
_permit.release_base_resources();
on_partition_range_change(pr);
_shard_readers.reserve(_sharder.shard_count());

View File

@@ -446,7 +446,6 @@ void repair::task_manager_module::start(repair_uniq_id id) {
void repair::task_manager_module::done(repair_uniq_id id, bool succeeded) {
_pending_repairs.erase(id.uuid());
_aborted_pending_repairs.erase(id.uuid());
if (succeeded) {
_status.erase(id.id);
} else {
@@ -536,21 +535,23 @@ size_t repair::task_manager_module::nr_running_repair_jobs() {
return count;
}
bool repair::task_manager_module::is_aborted(const tasks::task_id& uuid) {
return _aborted_pending_repairs.contains(uuid);
future<bool> repair::task_manager_module::is_aborted(const tasks::task_id& uuid, shard_id shard) {
return get_task_manager().container().invoke_on(shard, [name = get_name(), uuid] (tasks::task_manager& tm) {
auto module = tm.find_module(name);
auto it = module->get_local_tasks().find(uuid);
return it != module->get_local_tasks().end() && it->second->abort_requested();
});
}
void repair::task_manager_module::abort_all_repairs() {
_aborted_pending_repairs = _pending_repairs;
for (auto& x : _repairs) {
auto it = get_local_tasks().find(x.second);
for (auto& id : _pending_repairs) {
auto it = get_local_tasks().find(id);
if (it != get_local_tasks().end()) {
auto& impl = dynamic_cast<repair::shard_repair_task_impl&>(*it->second->_impl);
// If the task is aborted, its state will change to failed. One can wait for this with task_manager::task::done().
impl.abort();
it->second->abort();
}
}
rlogger.info0("Started to abort repair jobs={}, nr_jobs={}", _aborted_pending_repairs, _aborted_pending_repairs.size());
rlogger.info0("Started to abort repair jobs={}, nr_jobs={}", _pending_repairs, _pending_repairs.size());
}
float repair::task_manager_module::report_progress() {
@@ -1328,7 +1329,7 @@ future<> repair::user_requested_repair_task_impl::run() {
auto id = get_repair_uniq_id();
return module->run(id, [this, &rs, &db, id, keyspace = _status.keyspace, germs = std::move(_germs),
&cfs = _cfs, &ranges = _ranges, hosts = std::move(_hosts), data_centers = std::move(_data_centers), ignore_nodes = std::move(_ignore_nodes)] () mutable {
&cfs = _cfs, &ranges = _ranges, hosts = std::move(_hosts), data_centers = std::move(_data_centers), ignore_nodes = std::move(_ignore_nodes), &task_as = _as] () mutable {
auto uuid = node_ops_id{id.uuid().uuid()};
auto start_time = std::chrono::steady_clock::now();
@@ -1382,9 +1383,7 @@ future<> repair::user_requested_repair_task_impl::run() {
}
});
if (rs.get_repair_module().is_aborted(id.uuid())) {
throw abort_requested_exception();
}
task_as.check();
auto ranges_parallelism = _ranges_parallelism;
bool small_table_optimization = _small_table_optimization;
@@ -1493,7 +1492,7 @@ future<> repair::data_sync_repair_task_impl::run() {
auto id = get_repair_uniq_id();
rlogger.info("repair[{}]: sync data for keyspace={}, status=started", id.uuid(), keyspace);
co_await module->run(id, [this, &rs, id, &db, keyspace, germs = std::move(germs), &ranges = _ranges, &neighbors = _neighbors, reason = _reason] () mutable {
co_await module->run(id, [this, &rs, id, &db, keyspace, germs = std::move(germs), &ranges = _ranges, &neighbors = _neighbors, reason = _reason, &task_as = _as] () mutable {
auto cfs = list_column_families(db, keyspace);
_cfs_size = cfs.size();
if (cfs.empty()) {
@@ -1503,9 +1502,7 @@ future<> repair::data_sync_repair_task_impl::run() {
auto table_ids = get_table_ids(db, keyspace, cfs);
std::vector<future<>> repair_results;
repair_results.reserve(smp::count);
if (rs.get_repair_module().is_aborted(id.uuid())) {
throw abort_requested_exception();
}
task_as.check();
for (auto shard : boost::irange(unsigned(0), smp::count)) {
auto f = rs.container().invoke_on(shard, [keyspace, table_ids, id, ranges, neighbors, reason, germs, parent_data = get_repair_uniq_id().task_info] (repair_service& local_repair) mutable -> future<> {
auto data_centers = std::vector<sstring>();
@@ -1732,7 +1729,7 @@ future<> repair_service::bootstrap_with_repair(locator::token_metadata_ptr tmptr
}
auto nr_ranges = desired_ranges.size();
sync_data_using_repair(keyspace_name, erm, std::move(desired_ranges), std::move(range_sources), reason, nullptr).get();
rlogger.info("bootstrap_with_repair: finished with keyspace={}, nr_ranges={}", keyspace_name, nr_ranges);
rlogger.info("bootstrap_with_repair: finished with keyspace={}, nr_ranges={}", keyspace_name, nr_ranges * nr_tables);
}
rlogger.info("bootstrap_with_repair: finished with keyspaces={}", ks_erms | boost::adaptors::map_keys);
});
@@ -1914,12 +1911,12 @@ future<> repair_service::do_decommission_removenode_with_repair(locator::token_m
}
temp.clear_gently().get();
if (reason == streaming::stream_reason::decommission) {
container().invoke_on_all([nr_ranges_skipped] (repair_service& rs) {
rs.get_metrics().decommission_finished_ranges += nr_ranges_skipped;
container().invoke_on_all([nr_ranges_skipped, nr_tables] (repair_service& rs) {
rs.get_metrics().decommission_finished_ranges += nr_ranges_skipped * nr_tables;
}).get();
} else if (reason == streaming::stream_reason::removenode) {
container().invoke_on_all([nr_ranges_skipped] (repair_service& rs) {
rs.get_metrics().removenode_finished_ranges += nr_ranges_skipped;
container().invoke_on_all([nr_ranges_skipped, nr_tables] (repair_service& rs) {
rs.get_metrics().removenode_finished_ranges += nr_ranges_skipped * nr_tables;
}).get();
}
if (is_removenode) {
@@ -1928,7 +1925,7 @@ future<> repair_service::do_decommission_removenode_with_repair(locator::token_m
auto nr_ranges_synced = ranges.size();
sync_data_using_repair(keyspace_name, erm, std::move(ranges), std::move(range_sources), reason, ops).get();
rlogger.info("{}: finished with keyspace={}, leaving_node={}, nr_ranges={}, nr_ranges_synced={}, nr_ranges_skipped={}",
op, keyspace_name, leaving_node, nr_ranges_total, nr_ranges_synced, nr_ranges_skipped);
op, keyspace_name, leaving_node, nr_ranges_total, nr_ranges_synced * nr_tables, nr_ranges_skipped * nr_tables);
}
rlogger.info("{}: finished with keyspaces={}, leaving_node={}", op, ks_erms | boost::adaptors::map_keys, leaving_node);
});
@@ -2148,7 +2145,7 @@ future<> repair_service::do_rebuild_replace_with_repair(std::unordered_map<sstri
}
auto nr_ranges = ranges.size();
sync_data_using_repair(keyspace_name, erm, std::move(ranges), std::move(range_sources), reason, nullptr).get();
rlogger.info("{}: finished with keyspace={}, source_dc={}, nr_ranges={}", op, keyspace_name, source_dc_for_keyspace, nr_ranges);
rlogger.info("{}: finished with keyspace={}, source_dc={}, nr_ranges={}", op, keyspace_name, source_dc_for_keyspace, nr_ranges * nr_tables);
}
rlogger.info("{}: finished with keyspaces={}, source_dc={}", op, ks_erms | boost::adaptors::map_keys, source_dc);
});
@@ -2376,6 +2373,16 @@ future<> repair_service::repair_tablets(repair_uniq_id rid, sstring keyspace_nam
auto task = co_await _repair_module->make_and_start_task<repair::tablet_repair_task_impl>({}, rid, keyspace_name, table_names, streaming::stream_reason::repair, std::move(task_metas), ranges_parallelism);
}
void repair::tablet_repair_task_impl::release_resources() noexcept {
_metas_size = _metas.size();
_metas = {};
_tables = {};
}
size_t repair::tablet_repair_task_impl::get_metas_size() const noexcept {
return _metas.size() > 0 ? _metas.size() : _metas_size;
}
future<> repair::tablet_repair_task_impl::run() {
auto m = dynamic_pointer_cast<repair::task_manager_module>(_module);
auto& rs = m->get_repair_service();
@@ -2434,8 +2441,8 @@ future<> repair::tablet_repair_task_impl::run() {
}
});
rs.container().invoke_on_all([&idx, id, metas = _metas, parent_data, reason = _reason, tables = _tables, ranges_parallelism = _ranges_parallelism] (repair_service& rs) -> future<> {
auto parent_shard = this_shard_id();
rs.container().invoke_on_all([&idx, id, metas = _metas, parent_data, reason = _reason, tables = _tables, ranges_parallelism = _ranges_parallelism, parent_shard] (repair_service& rs) -> future<> {
std::exception_ptr error;
for (auto& m : metas) {
if (m.master_shard_id != this_shard_id()) {
@@ -2450,10 +2457,13 @@ future<> repair::tablet_repair_task_impl::run() {
continue;
}
auto erm = t->get_effective_replication_map();
if (rs.get_repair_module().is_aborted(id.uuid())) {
if (co_await rs.get_repair_module().is_aborted(id.uuid(), parent_shard)) {
throw abort_requested_exception();
}
co_await utils::get_local_injector().inject("repair_tablet_repair_task_impl_run",
[] (auto& handler) { return handler.wait_for_message(db::timeout_clock::now() + 10s); });
std::unordered_map<dht::token_range, repair_neighbors> neighbors;
neighbors[m.range] = m.neighbors;
dht::token_range_vector ranges = {m.range};
@@ -2508,12 +2518,12 @@ future<> repair::tablet_repair_task_impl::run() {
}
future<std::optional<double>> repair::tablet_repair_task_impl::expected_total_workload() const {
auto sz = _metas.size();
auto sz = get_metas_size();
co_return sz ? std::make_optional<double>(sz) : std::nullopt;
}
std::optional<double> repair::tablet_repair_task_impl::expected_children_number() const {
return _metas.size();
return get_metas_size();
}
node_ops_cmd_category categorize_node_ops_cmd(node_ops_cmd cmd) noexcept {

Some files were not shown because too many files have changed in this diff Show More