Compare commits

...

299 Commits

Author SHA1 Message Date
Kefu Chai
3146a09638 dist: systemd: use default KillMode
before this change, we specify the KillMode of the scylla-service
service unit explicitly to "process". according to
according to
https://www.freedesktop.org/software/systemd/man/latest/systemd.kill.html,

> If set to process, only the main process itself is killed (not recommended!).

and the document suggests use "control-group" over "process".
but scylla server is not a multi-process server, it is a multi-threaded
server. so it should not make any difference even if we switch to
the recommended "control-group".

in the light that we've been seeing "defunct" scylla process after
stopping the scylla service using systemd. we are wondering if we should
try to change the `KillMode` to "control-group", which is the default
value of this setting.

in this change, we just drop the setting so that the systemd stops the
service by stopping all processes in the control group of this unit
are stopped.

Fixes scylladb/scylladb#21507

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21508

(cherry picked from commit 961a53f716)

Closes scylladb/scylladb#23177
2025-04-04 17:56:15 +03:00
Botond Dénes
b567d60624 Update seastar submodule
* seastar 882ed7ac...af8ae075 (1):
  > util/backtrace: Optimize formatter to reduce memory allocation overhead
2025-02-11 10:27:00 +02:00
Jenkins Promoter
50c4e91d4e Update ScyllaDB version to: 6.1.6 2025-02-10 11:56:06 +02:00
Botond Dénes
7b80816721 service: query_pager: fix last-position for filtering queries
On short-pages, cut short because of a tombstone prefix.
When page-results are filtered and the filter drops some rows, the
last-position is taken from the page visitor, which does the filtering.
This means that last partition and row position will be that of the last
row the filter saw. This will not match the last position of the
replica, when the replica cut the page due to tombstones.
When fetching the next page, this means that all the tombstone suffix of
the last page, will be re-fetched. Worse still: the last position of the
next page will not match that of the saved reader left on the replica, so
the saved reader will be dropped and a new one created from scratch.
This wasted work will show up as elevated tail latencies.
Fix by always taking the last position from raw query results.

Fixes: #22620

Closes scylladb/scylladb#22622

(cherry picked from commit 7ce932ce01)

Closes scylladb/scylladb#22717
2025-02-06 13:30:18 +02:00
Avi Kivity
acd5bd924f Update seastar submodule (hwloc failure on some AWS instances)
* seastar 908ccd936a...882ed7ac3c (1):
  > resource: fallback to sysconf when failed to detect memory size from hwloc

Fixes #22382.
2025-02-04 16:32:05 +02:00
Michael Litvak
3337ca35e4 view_builder: fix loop in view builder when tokens are moved
The view builder builds a view by going over the entire token ring,
consuming the base table partitions, and generating view updates for
each partition.

A view is considered as built when we complete a full cycle of the
token ring. Suppose we start to build a view at a token F. We will
consume all partitions with tokens starting at F until the maximum
token, then go back to the minimum token and consume all partitions
until F, and then we detect that we pass F and complete building the
view. This happens in the view builder consumer in
`check_for_built_views`.

The problem is that we check if we pass the first token F with the
condition `_step.current_token() >= it->first_token` whenever we consume
a new partition or the current_token goes back to the minimum token.
But suppose that we don't have any partitions with a token greater than
or equal to the first token (this could happen if the partition with
token F was moved to another node for example), then this condition will never be
satisfied, and we don't detect correctly when we pass F. Instead, we
go back to the minimum token, building the same token ranges again,
in a possibly infinite loop.

To fix this we add another step when reaching the end of the reader's
stream. When this happens it means we don't have any more fragments to
consume until the end of the range, so we advance the current_token to
the end of the range, simulating a partition, and check for built views
in that range.

Fixes scylladb/scylladb#21829

Closes scylladb/scylladb#22493

(cherry picked from commit 6d34125eb7)

Closes scylladb/scylladb#22605
2025-02-03 19:22:01 +01:00
Avi Kivity
b94b6bce4b seatar: point submodule at scylla-seastar.git
This allows backporting commits to seastar.
2025-01-31 19:49:58 +02:00
Aleksandra Martyniuk
003e3f212e repair: add repair_service gate
In main.cc storage_service is started before and stopped after
repair_service. storage_service keeps a reference to sharded
repair_service and calls its methods, but nothing ensures that
repair_service's local instance would be alive for the whole
execution of the method.

Add a gate to repair_service and enter it in storage_service
before executing methods on local instances of repair_service.

Fixes: #21964.

Closes scylladb/scylladb#22145

(cherry picked from commit 32ab58cdea)

Closes scylladb/scylladb#22317
2025-01-30 11:40:22 +02:00
Aleksandra Martyniuk
9c035f810f repair: check tasks local to given shard
Currently task_manager_module::is_aborted checks the tasks local
to caller's shard on a given shard.

Fix the method to check the task map local to the given shard.

Fixes: #22156.

Closes scylladb/scylladb#22161

(cherry picked from commit a91e03710a)

Closes scylladb/scylladb#22196
2025-01-30 07:46:40 +02:00
Botond Dénes
753c603f40 tools/scylla-sstable: dump-statistics: fix handling of {min,max}_column_names
Said fields in statistics are of type
`disk_array<uint32_t, disk_string<uint16_t>>` and currently are handled
as array of regular strings. However these fields store exploded
clustering keys, so the elements store binary data and converting to
string can yield invalid UTF-8 characters that certain JSON parsers (jq,
or python's json) can choke on. Fix this by treating them as binary and
using `to_hex()` to convert them to string. This requires some massaging
of the json_dumper: passing field offset to all visit() methods and
using a caller-provided disk-string to sstring converter to convert disk
strings to sstring, so in the case of statistics, these fields can be
intercepted and properly handled.

While at it, the type of these fields is also fixed in the
documentation.

Before:

    "min_column_names": [
      "��Z���\u0011�\u0012ŷ4^��<",
      "�2y\u0000�}\u007f"
    ],
    "max_column_names": [
      "��Z���\u0011�\u0012ŷ4^��<",
      "}��B\u0019l%^"
    ],

After:

    "min_column_names": [
      "9dd55a92bc8811ef12c5b7345eadf73c",
      "80327900e2827d7f"
    ],
    "max_column_names": [
      "9dd55a92bc8811ef12c5b7345eadf73c",
      "7df79242196c255e"
    ],

Fixes: #22078

Closes scylladb/scylladb#22225

(cherry picked from commit f899f0e411)

Closes scylladb/scylladb#22295
2025-01-29 20:27:31 +02:00
Botond Dénes
a1ab46d54d replica: remove noexcept from token -> tablet resolution path
The methods to resolve a key/token/range to a table are all noexcept.
Yet the method below all of these, `storage_group_for_id()` can throw.
This means that if due to any mistake a tablet without local replica is
attempted to be looked up, it will result in a crash, as the exception
bubbles up into the noexcept methods.
There is no value in pretending that looking up the tablet replica is
noexcept, remove the noexcept specifiers so that any bad lookup only
fails the operation at hand and doesn't crash the node. This is
especially relevant to replace, which still has a window where writes
can arrive for tablets that don't (yet) have a local replica. Currently,
this results in a crash. After this patch, this will only fail the
writes and the replace can move on.

Fixes: #21480

Closes scylladb/scylladb#22251

(cherry picked from commit 55963f8f79)

Closes scylladb/scylladb#22378
2025-01-29 20:26:23 +02:00
Kefu Chai
4f8d2f48af compress: fix compressor initialization order by making namespace_prefix a function
Fixes a race condition where COMPRESSOR_NAME in zstd.cc could be
initialized before compressor::namespace_prefix due to undefined
global variable initialization order across translation units. This
was causing ZstdCompressor to be unregistered in release builds,
making it impossible to create tables with Zstd compression.

Replace the global namespace_prefix variable with a function that
returns the fully qualified compressor name. This ensures proper
initialization order and fixes the registration of the ZstdCompressor.

Fixes scylladb/scylladb#22444
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22451

(cherry picked from commit 4a268362b9)

Closes scylladb/scylladb#22509
2025-01-29 20:25:53 +02:00
Avi Kivity
6d7ce6890a Merge '[Backport 6.1] repair: handle no_such_keyspace in repair preparation phase' from null
Currently, data sync repair handles most no_such_keyspace exceptions,
but it omits the preparation phase, where the exception could be thrown
during make_global_effective_replication_map.

Skip the keyspace repair if no_such_keyspace is thrown during preparations.

Fixes: #22073.

Requires backport to 6.1 and 6.2 as they contain the bug

- (cherry picked from commit bfb1704afa)

- (cherry picked from commit 54e7f2819c)

Parent PR: #22473

Closes scylladb/scylladb#22540

* github.com:scylladb/scylladb:
  test: add test to check if repair handles no_such_keyspace
  repair: handle keyspace dropped
2025-01-29 19:53:06 +02:00
Michael Litvak
b395d35c6c cdc: fix handling of new generation during raft upgrade
During raft upgrade, a node may gossip about a new CDC generation that
was propagated through raft. The node that receives the generation by
gossip may have not applied the raft update yet, and it will not find
the generation in the system tables. We should consider this error
non-fatal and retry to read until it succeeds or becomes obsolete.

Another issue is when we fail with a "fatal" exception and not retrying
to read, the cdc metadata is left in an inconsistent state that causes
further attempts to insert this CDC generation to fail.

What happens is we complete preparing the new generation by calling `prepare`,
we insert an empty entry for the generation's timestamp, and then we fail. The
next time we try to insert the generation, we skip inserting it because we see
that it already has an entry in the metadata and we determine that
there's nothing to do. But this is wrong, because the entry is empty,
and we should continue to insert the generation.

To fix it, we change `prepare` to return `true` when the entry already
exists but it's empty, indicating we should continue to insert the
generation.

Fixes scylladb/scylladb#21227

Closes scylladb/scylladb#22093

(cherry picked from commit 4f5550d7f2)

Closes scylladb/scylladb#22544
2025-01-29 19:50:30 +02:00
Aleksandra Martyniuk
044841ef9c test: add test to check if repair handles no_such_keyspace
(cherry picked from commit 54e7f2819c)
2025-01-28 21:49:47 +00:00
Aleksandra Martyniuk
8fbfabaac4 repair: handle keyspace dropped
Currently, data sync repair handles most no_such_keyspace exceptions,
but it omits the preparation phase, where the exception could be thrown
during make_global_effective_replication_map.

Skip the keyspace repair if no_such_keyspace is thrown during preparations.

(cherry picked from commit bfb1704afa)
2025-01-28 21:49:46 +00:00
Kamil Braun
7bdccd8b49 Merge '[Backport 6.1] raft: Handle non-critical config update errors in when changing voter status.' from Sergey Z
When a node is bootstrapped and joined a cluster as a non-voter and changes it's role to a voter, errors can occur while committing a new Raft record, for instance, if the Raft leader changes during this time. These errors are not critical and should not cause a node crash, as the action can be retried.

Fixes scylladb/scylladb#20814

Backport: This issue occurs frequently and disrupts the CI workflow to some extent. Backports are needed for versions 6.1 and 6.2.

- (cherry picked from commit 775411ac56)

- (cherry picked from commit 16053a86f0)

- (cherry picked from commit 8c48f7ad62)

- (cherry picked from commit 3da4848810)

- (cherry picked from commit 228a66d030)

Parent PR: #22253

Closes scylladb/scylladb#22357

* github.com:scylladb/scylladb:
  raft: refactor `remove_from_raft_config` to use a timed `modify_config` call.
  raft: Refactor functions using `modify_config` to use a common wrapper for retrying.
  raft: Handle non-critical config update errors in when changing status to voter.
  test: Add test to check that a node does not fail on unknown commit status error when starting up.
  raft: Add run_op_with_retry in raft_group0.
2025-01-24 17:07:03 +01:00
Sergey Zolotukhin
7f75a5c7d8 raft: refactor remove_from_raft_config to use a timed modify_config call.
To avoid potential hangs during the `remove_from_raft_config` operation, use a timed `modify_config` call.
This ensures the operation doesn't get stuck indefinitely.

(cherry picked from commit 228a66d030)
2025-01-22 09:41:29 +01:00
Sergey Zolotukhin
dfc8559bea raft: Refactor functions using modify_config to use a common wrapper
for retrying.

There are several places in `raft_group0` where almost identical code is
used for retrying `modify_config` in case of `commit_status_unknown`
error. To avoid code duplication all these places were changed to
use a new wrapper `run_op_with_retry`.

(cherry picked from commit 3da4848810)
2025-01-22 09:41:26 +01:00
Kefu Chai
56644f1a22 docs: fix monospace formatting for rm command
Add missing space before `rm` to ensure proper rendering
in monospace font within documentation.

Fixes scylladb/scylladb#22255
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21576

(cherry picked from commit 6955b8238e)

Closes scylladb/scylladb#22256
2025-01-20 11:27:42 +02:00
Michael Litvak
c847806182 view_builder: write status to tables before starting to build
When adding a new view for building, first write the status to the
system tables and then add the view building step that will start
building it.

Otherwise, if we start building it before the status is written to the
table, it may happen that we complete building the view, write the
SUCCESS status, and then overwrite it with the STARTED status. The
view_build_status table will remain in incorrect state indicating the
view building is not complete.

Fixes scylladb/scylladb#20638

(cherry picked from commit b1be2d3c41)

Closes scylladb/scylladb#22355
2025-01-19 18:22:07 +02:00
Sergey Zolotukhin
81169eda19 raft: Handle non-critical config update errors in when changing status
to voter.

When a node is bootstrapped and joins a cluster as a non-voter, errors can occur while committing
a new Raft record, for instance, if the Raft leader changes during this time. These errors are not
critical and should not cause a node crash, as the action can be retried.

Fixes scylladb/scylladb#20814

(cherry picked from commit 8c48f7ad62)
2025-01-16 20:09:04 +00:00
Sergey Zolotukhin
76be0f8a1e test: Add test to check that a node does not fail on unknown commit status
error when starting up.

Test that a node is starting successfully if while joining a cluster and becoming a voter, it
receives an unknown commit status error.

Test for scylladb/scylladb#20814

(cherry picked from commit 16053a86f0)
2025-01-16 20:09:04 +00:00
Sergey Zolotukhin
4c962bbc54 raft: Add run_op_with_retry in raft_group0.
Since when calling `modify_config` it's quite often we need to do
retries, to avoid code duplication, a function wrapper that allows
a function to be called with automatic retries in case of failures
was added.

(cherry picked from commit 775411ac56)
2025-01-16 20:09:03 +00:00
Kamil Braun
116857c7ed Merge '[Backport 6.1] Fix possible data corruption due to token keys clashing in read repair.' from Sergey
This update addresses an issue in the mutation diff calculation algorithm used during read repair. Previously, the algorithm used `token` as the hashmap key. Since `token` is calculated basing on the Murmur3 hash function, it could generate duplicate values for different partition keys, causing corruption in the affected rows' values.

Fixes scylladb/scylladb#19101

Since the issue affects all the relevant scylla versions, backport to: 6.1, 6.2

- (cherry picked from commit e577f1d141)

- (cherry picked from commit 39785c6f4e)

- (cherry picked from commit 155480595f)

Parent PR: #21996

Closes scylladb/scylladb#22297

* github.com:scylladb/scylladb:
  storage_proxy/read_repair: Remove redundant 'schema' parameter from `data_read_resolver::resolve` function.
  storage_proxy/read_repair: Use `partition_key` instead of `token` key for mutation diff calculation hashmap.
  test: Add test case for checking read repair diff calculation when having conflicting keys.
2025-01-16 17:15:09 +01:00
Sergey Zolotukhin
8f8b8e902c test: Include parent test name in ScyllaClusterManager log file names.
Add the test file name to `ScyllaClusterManager` log file names alongside the test function name.
This avoids race conditions when tests with the same function names are executed simultaneously.

Fixes scylladb/scylladb#21807

Backport: not needed since this is a fix in the testing scripts.

Closes scylladb/scylladb#22192

(cherry picked from commit 2f1731c551)

Closes scylladb/scylladb#22248
2025-01-14 16:34:15 +01:00
Sergey Zolotukhin
d5dd364b49 storage_proxy/read_repair: Remove redundant 'schema' parameter from data_read_resolver::resolve
function.

The `data_read_resolver` class inherits from `abstract_read_resolver`, which already includes the
`schema_ptr _schema` member. Therefore, using a separate function parameter in `data_read_resolver::resolve`
initialized with the same variable in `abstract_read_executor` is redundant.

(cherry picked from commit 1554805)
2025-01-14 14:48:25 +01:00
Sergey Zolotukhin
e65d1a3665 storage_proxy/read_repair: Use partition_key instead of token key for mutation
diff calculation hashmap.

This update addresses an issue in the mutation diff calculation algorithm used during read repair.
Previously, the algorithm used `token` as the hashmap key. Since `token` is calculated basing on
the Murmur3 hash function, it could generate duplicate values for different partition keys, causing
corruption in the affected rows' values.

Fixes scylladb/scylladb#19101

(cherry picked from commit 39785c6)
2025-01-14 11:25:49 +01:00
Sergey Zolotukhin
63d58022a6 test: Add test case for checking read repair diff calculation when having
conflicting keys.

The test updates two rows with keys that result in a Murmur3 hash collision, which
is used to generate Scylla tokens. These tokens are involved in read repair diff
calculations. Due to the identical token values, a hash map key collision occurs.
Consequently, an incorrect value from the second row (with a different primary key)
is then sent for writing as 'repaired', causing data corruption.

(cherry picked from commit e577f1d141)
2025-01-13 22:05:06 +00:00
Kamil Braun
52a09a2f2d Merge '[Backport 6.1] cache_algorithm_test: fix flaky failures' from Michał Chojnowski
This series attempts to get read of flakiness in cache_algorithm_test by solving two problems.

Problem 1:

The test needs to create some arbitrary partition keys of a given size. It intends to create keys of the form:
0x0000000000000000000000000000000000000000...
0x0100000000000000000000000000000000000000...
0x0200000000000000000000000000000000000000...
But instead, unintentionally, it creates partially initialized keys of the form: 0x0000000000000000garbagegarbagegarbagegar...
0x0100000000000000garbagegarbagegarbagegar...
0x0200000000000000garbagegarbagegarbagegar...

Each of these keys is created several times and -- for the test to pass -- the result must be the same each time.
By coincidence, this is usually the case, since the same allocator slots are used. But if some background task happens to overwrite the allocator slot during a preemption, the keys used during "SELECT" will be different than the keys used during "INSERT", and the test will fail due to extra cache misses.

Problem 2:

Cache stats are global, so there's no good way to reliably
verify that e.g. a given read causes 0 cache misses,
because something done by Scylla in a background can trigger a cache miss.

This can cause the test to fail spuriously.

With how the test framework and the cache are designed, there's probably
no good way to test this properly. It would require ensuring that cache
stats are per-read, or at least per-table, and that Scylla's background
activity doesn't cause enough memory pressure to evict the tested rows.

This patch tries to deal with the flakiness without deleting the test
altogether by letting it retry after a failure if it notices that it
can be explained by a read which wasn't done by the test.
(Though, if the test can't be written well, maybe it just shouldn't be written...)

Fixes scylladb/scylladb#21536

(cherry picked from commit 1fffd976a4)
(cherry picked from commit 6caaead4ac)

Parent PR: scylladb/scylladb#21948

Closes scylladb/scylladb#22227

* github.com:scylladb/scylladb:
  cache_algorithm_test: harden against stats being confused by background activity
  cache_algorithm_test: fix a use of an uninitialized variable
2025-01-09 14:30:54 +01:00
Anna Stuchlik
98dfb50c99 doc: add troubleshooting removal with --autoremove-ubuntu
This commit adds a troubleshooting article on removing ScyllaDB
with the --autoremove option.

Fixes https://github.com/scylladb/scylladb/issues/21408

Closes scylladb/scylladb#21697

(cherry picked from commit 8d824a564f)

Closes scylladb/scylladb#22230
2025-01-08 13:11:28 +02:00
Yaron Kaikov
03a19d586e .github/scripts/auto-backport.py: Add comment to PR when conflicts apply
When we open a PR with conflicts, the PR owner gets a notification about the assignment but has no idea if this PR is with conflicts or not (in Scylla it's important since CI will not start on draft PR)

Let's add a comment to notify the user we have conflicts

Closes scylladb/scylladb#21939

(cherry picked from commit 2e6755ecca)

Closes scylladb/scylladb#22189
2025-01-08 13:11:00 +02:00
Botond Dénes
af2cb66cfc Merge 'sstables_manager: do not reclaim unlinked sstables' from Lakshmi Narayanan Sreethar
When an sstable is unlinked, it remains in the _active list of the
sstable manager. Its memory might be reclaimed and later reloaded,
causing issues since the sstable is already unlinked. This patch updates
the on_unlink method to reclaim memory from the sstable upon unlinking,
remove it from memory tracking, and thereby prevent the issues described
above.

Added a testcase to verify the fix.

Fixes #21887

This is a bug fix in the bloom filter reload/reclaim mechanism and should be backported to older versions.

Closes scylladb/scylladb#21895

* github.com:scylladb/scylladb:
  sstables_manager: reclaim memory from sstables on unlink
  sstables_manager: introduce reclaim_memory_and_stop_tracking_sstable()
  sstables: introduce disable_component_memory_reload()
  sstables_manager: log sstable name when reclaiming components

(cherry picked from commit d4129ddaa6)

Closes scylladb/scylladb#21997
2025-01-08 13:10:30 +02:00
Michał Chojnowski
e10def2f2a cache_algorithm_test: harden against stats being confused by background activity
Cache stats are global, so there's no good way to reliably
verify that e.g. a given read causes 0 cache misses,
because something done by Scylla in a background can trigger a cache miss.

This can cause the test to fail spuriously.

With how the test framework and the cache are designed, there's probably
no good way to test this properly. It would require ensuring that cache
stats are per-read, or at least per-table, and that Scylla's background
activity doesn't cause enough memory pressure to evict the tested rows.

This patch tries to deal with the flakiness without deleting the test
altogether by letting it retry after a failure if it notices that it
can be explained by a read which wasn't done by the test.
(Though, if the test can't be written well, maybe it just shouldn't be written...)

(cherry picked from commit 6caaead4ac)
2025-01-08 11:43:15 +01:00
Michał Chojnowski
f0f2749c5c cache_algorithm_test: fix a use of an uninitialized variable
The test needs to create some arbitrary partition keys of a given size.
It intends to create keys of the form:
0x0000000000000000000000000000000000000000...
0x0100000000000000000000000000000000000000...
0x0200000000000000000000000000000000000000...
But instead, unintentionally, it creates partially initialized keys of the form:
0x0000000000000000garbagegarbagegarbagegar...
0x0100000000000000garbagegarbagegarbagegar...
0x0200000000000000garbagegarbagegarbagegar...

Each of these keys is created several times and -- for the test to pass --
the result must be the same each time.
By coincidence, this is usually the case, since the same allocator slots are used.
But if some background task happens to overwrite the allocator slot during a
preemption, the keys used during "SELECT" will be different than the keys used
during "INSERT", and the test will fail due to extra cache misses.

(cherry picked from commit 1fffd976a4)
2025-01-08 11:43:04 +01:00
Patryk Jędrzejczak
c5f28d8099 [Backport 6.1] raft: improve logs for abort while waiting for apply
New logs allow us to easily distinguish two cases in which
waiting for apply times out:
- the node didn't receive the entry it was waiting for,
- the node received the entry but didn't apply it in time.

Distinguishing these cases simplifies reasoning about failures.
The first case indicates that something went wrong on the leader.
The second case indicates that something went wrong on the node
on which waiting for apply timed out.

As it turns out, many different bugs result in the `read_barrier`
(which calls `wait_for_apply`) timeout. This change should help
us in debugging bugs like these.

We want to backport this change to all supported branches so that
it helps us in all tests.

Fixes scylladb/scylladb#22160

Closes scylladb/scylladb#22157
2025-01-07 17:03:17 +01:00
Kamil Braun
9618e9b0d3 Merge '[Backport 6.1] Do not reset quarantine list in non raft mode' from Gleb
The series contains small fixes to the gossiper one of which fixes #21930. Others I noticed while debugged the issue.

Fixes: #21930

- (cherry picked from commit 91cddcc17f)

Parent PR: #21956

Closes scylladb/scylladb#21990

* github.com:scylladb/scylladb:
  gossiper: do not reset _just_removed_endpoints in non raft mode
  gossiper: do not call apply for the node's old state
2025-01-07 17:00:35 +01:00
Abhinav
1e54ee19ce Fix gossiper orphan node floating problem by adding a remover fiber
In the current scenario, if during startup, a node crashes after initiating gossip and before joining group0,
then it keeps floating in the gossiper forever because the raft based gossiper purging logic is only effective
once node joins group0. This orphan node hinders the successor node from same ip to join cluster since it collides
with it during gossiper shadow round.

This commit intends to fix this issue by adding a background thread which periodically checks for such orphan entries in
gossiper and removes them.

A test is also added in to verify this logic. This test fails without this background thread enabled, hence
verifying the behavior.

Fixes: scylladb/scylladb#20082

Closes scylladb/scylladb#21600

(cherry picked from commit 6c90a25014)

Closes scylladb/scylladb#21821
2025-01-02 14:59:28 +01:00
Gleb Natapov
2163839c6d gossiper: do not reset _just_removed_endpoints in non raft mode
By the time the function is called during start it may already be
populated.

Fixes: scylladb/scylladb#21930
(cherry picked from commit e318dfb83a)
2024-12-25 11:47:26 +02:00
Gleb Natapov
951bfd4203 gossiper: do not call apply for the node's old state
If a nodes changed its address an old state may be still in a gossiper,
so ignore it.

(cherry picked from commit e80355d3a1)
2024-12-23 11:46:47 +02:00
Piotr Dulikowski
e3b1216cba Merge '[Backport 6.1] service_levels: increase timeout of internal queries and update cache on startup' from Michael Litvak
Backport of two service level related fixes:

service/qos/service_level_controller: update cache on startup
Fixes scylladb/scylladb#21763
Parent PR: scylladb/scylladb#21773

service/qos: increase timeout of internal get_service_levels queries
Fixes scylladb/scylladb#20483
Parent PR: scylladb/scylladb#21748

Closes scylladb/scylladb#21889

* github.com:scylladb/scylladb:
  service/qos/service_level_controller: update cache on startup
  service/qos: increase timeout of internal get_service_levels queries
2024-12-17 11:21:19 +01:00
Yaron Kaikov
dae61b51a8 github: check if PR is closed instead of merge
In Scylla, we can have either `closed` or `merged` PRs. Based on that we decide when to start the backport process when the label was added after the PR is closed (or merged),

In https://github.com/scylladb/scylladb/pull/21876 even when adding the proper backport label didn't trigger the backport automation. Https://github.com/scylladb/scylladb/pull/21809/ caused this, we should have left the `state=closed` (this includes both closed and merged PR)

Fixing it

Closes scylladb/scylladb#21906

(cherry picked from commit b4b7617554)

Closes scylladb/scylladb#21921
2024-12-16 14:08:03 +02:00
Anna Stuchlik
8b3f5d277b doc: remove wrong image upgrade info (5.2-to-2023.1)
This commit removes the information about the recommended way of upgrading
ScyllaDB images - by updating ScyllaDB and OS packages in one step. This upgrade
procedure is not supported (it was implemented, but then reverted).

Refs https://github.com/scylladb/scylladb/issues/15733

Closes scylladb/scylladb#21876
Fixes https://github.com/scylladb/scylla-enterprise/issues/5041
Fixes https://github.com/scylladb/scylladb/issues/21898

(cherry picked from commit 98860905d8)
2024-12-12 15:23:30 +02:00
Michael Litvak
39186f76c7 service/qos/service_level_controller: update cache on startup
Update the service level cache in the node startup sequence, after the
service level and auth service are initialized.

The cache update depends on the service level data accessor being set
and the auth service being initialized. Before the commit, it may happen that a
cache update is not triggered after the initialization. The commit adds
an explicit call to update the cache where it is guaranteed to be ready.

Fixes scylladb/scylladb#21763

Closes scylladb/scylladb#21773

(cherry picked from commit 373855b493)
2024-12-11 16:55:38 +02:00
Michael Litvak
93e3e256c1 service/qos: increase timeout of internal get_service_levels queries
The function get_service_levels is used to retrieve all service levels
and it is called from multiple different contexts.
Importantly, it is called internally from the context of group0 state reload,
where it should be executed with a long timeout, similarly to other
internal queries, because a failure of this function affects the entire
group0 client, and a longer timeout can be tolerated.
The function is also called in the context of the user command LIST
SERVICE LEVELS, and perhaps other contexts, where a shorter timeout is
preferred.

The commit introduces a function parameter to indicate whether the
context is internal or not. For internal context, a long timeout is
chosen for the query. Otherwise, the timeout is shorter, the same as
before. When the distinction is not important, a default value is
chosen which maintains the same behavior.

The main purpose is to fix the case where the timeout is too short and causes
a failure that propagates and fails the group0 client.

Fixes scylladb/scylladb#20483

Closes scylladb/scylladb#21748

(cherry picked from commit 53224d90be)
2024-12-11 15:23:53 +02:00
Tomasz Grabiec
6ce18dca32 Merge '[Backport 6.1] utils: cached_file: Mark permit as awaiting on page miss' from ScyllaDB
Otherwise, the read will be considered as on-cpu during promoted index
search, which will severely underutlize the disk because by default
on-cpu concurrency is 1.

I verified this patch on the worst case scenario, where the workload
reads missing rows from a large partition. So partition index is
cached (no IO) and there is no data file IO (relies on https://github.com/scylladb/scylladb/pull/20522).
But there is IO during promoted index search (via cached_file).

Before the patch this workload was doing 4k req/s, after the patch it does 30k req/s.

The problem is much less pronounced if there is data file or partition index IO involved
because that IO will signal read concurrency semaphore to invite more concurrency.

Fixes #21325

(cherry picked from commit 868f5b59c4)

(cherry picked from commit 0f2101b055)

 Refs #21323

Closes scylladb/scylladb#21359

* github.com:scylladb/scylladb:
  utils: cached_file: Mark permit as awaiting on page miss
  utils: cached_file: Push resource_unit management down to cached_file
2024-12-09 22:32:01 +01:00
Tomasz Grabiec
86ebca4621 utils: cached_file: Mark permit as awaiting on page miss
Otherwise, the read will be considered as on-cpu during promoted index
search, which will severely underutlize the disk because by default
on-cpu concurrency is 1.

I verified this patch on the worst case scenario, where the workload
reads missing rows from a large partition. So partition index is
cached (no IO) and there is no data file IO. But there is IO during
promoted index search (via cached_file). Before the patch this
workload was doing 4k req/s, after the patch it does 30k req/s.

The problem is much less pronounced if there is data file or index
file IO involved because that IO will signal read concurrency
semaphore to invite more concurrency.

(cherry picked from commit 0f2101b055)
2024-12-09 17:45:04 +01:00
Tomasz Grabiec
6e2f5c2bd9 utils: cached_file: Push resource_unit management down to cached_file
It saves us permit operations on the hot path when we hit in cache.

Also, it will lay the ground for marking the permit as awaiting later.

(cherry picked from commit 868f5b59c4)
2024-12-09 17:45:02 +01:00
Kefu Chai
b7dcf7420a github: do not nest ${{}} inside condition
In commit 2596d157, we added a condition to run auto-backport.py only
when the GitHub Action is triggered by a push to the default branch.
However, this introduced an unexpected error due to incorrect condition
handling.

Problem:
- `github.event.before` evaluates to an empty string
- GitHub Actions' single-pass expression evaluation system causes
  the step to always execute, regardless of `github.event_name`

Despite GitHub's documentation suggesting that ${{ }} can be omitted,
it recommends using explicit ${{}} expressions for compound conditions.

Changes:
- Use explicit ${{}} expression for compound conditions
- Avoid string interpolation in conditional statements

Root Cause:
The previous implementation failed because of how GitHub Actions
evaluates conditional expressions, leading to an unintended script
execution and a 404 error when attempting to compare commits.

Example Error:

```
  python .github/scripts/auto-backport.py --repo scylladb/scylladb --base-branch refs/heads/master --commits ..2b07d93beac7bc83d955dadc20ccc307f13f20b6
  shell: /usr/bin/bash -e {0}
  env:
    DEFAULT_BRANCH: master
    GITHUB_TOKEN: ***
Traceback (most recent call last):
  File "/home/runner/work/scylladb/scylladb/.github/scripts/auto-backport.py", line 201, in <module>
    main()
  File "/home/runner/work/scylladb/scylladb/.github/scripts/auto-backport.py", line 162, in main
    commits = repo.compare(start_commit, end_commit).commits
  File "/usr/lib/python3/dist-packages/github/Repository.py", line 888, in compare
    headers, data = self._requester.requestJsonAndCheck(
  File "/usr/lib/python3/dist-packages/github/Requester.py", line 353, in requestJsonAndCheck
    return self.__check(
  File "/usr/lib/python3/dist-packages/github/Requester.py", line 378, in __check
    raise self.__createException(status, responseHeaders, output)
github.GithubException.UnknownObjectException: 404 {"message": "Not Found", "documentation_url": "https://docs.github.com/rest/commits/commits#compare-two-commits", "status": "404"}
```

Fixes scylladb/scylladb#21808
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21809

(cherry picked from commit e04aca7efe)

Closes scylladb/scylladb#21819
2024-12-06 16:35:30 +02:00
Avi Kivity
4be2d3d6c0 Merge 'compaction: update maintenance sstable set on scrub compaction completion' from Lakshmi Narayanan Sreethar
Scrub compaction can pick up input sstables from maintenance sstable set
but on compaction completion, it doesn't update the maintenance set
leaving the original sstable in set after it has been scrubbed. To fix
this, on compaction completion has to update the maintenance sstable if
the input originated from there. This PR solves the issue by updating the
correct sstable_sets on compaction completion.

Fixes #20030

This issue has existed since the introduction of main and maintenance sstable sets into scrub compaction. It would be good to have the fix backported to versions 6.1 and 6.2.

Closes scylladb/scylladb#21582

* github.com:scylladb/scylladb:
  compaction: remove unused `update_sstable_lists_on_off_strategy_completion`
  compaction_group: replace `update_sstable_lists_on_off_strategy_completion`
  compaction_group: rename `update_main_sstable_list_on_compaction_completion`
  compaction_group: update maintenance sstable set on scrub compaction completion
  compaction_group: store table::sstable_list_builder::result in replacement_desc
  table::sstable_list_builder: remove old sstables only from current list
  table::sstable_list_builder: return removed sstables from build_new_list

(cherry picked from commit 58baeac0ad)

Closes scylladb/scylladb#21789
2024-12-06 10:37:23 +02:00
Michael Pedersen
2bbb0859e1 docs: correct the storage size for n2-highmem-32 to 9000GB
updated storage size for n2-highmem-32 to 9000GB as this is default in SC

Fixes scylladb/scylladb#21785
Closes scylladb/scylladb#21537

(cherry picked from commit 309f1606ae)

Closes scylladb/scylladb#21594
2024-12-05 09:51:51 +02:00
Avi Kivity
4c6ddcf6c1 Merge 'sstables: Fix use-after-free on page cache buffer when parsing promoted index entries across pages' from Tomasz Grabiec
This fixes a use-after-free bug when parsing clustering key across
pages.

Also includes a fix for allocating section retry, which is potentially not safe (not in practice yet).

Details of the first problem:

Clustering key index lookup is based on the index file page cache. We
do a binary search within the index, which involves parsing index
blocks touched by the algorithm. Index file pages are 4 KB chunks
which are stored in LSA.

To parse the first key of the block, we reuse clustering_parser, which
is also used when parsing the data file. The parser is stateful and
accepts consecutive chunks as temporary_buffers. The parser is
supposed to keep its state across chunks.

In 93482439, the promoted index cursor was optimized to avoid
fully page copy when parsing index blocks. Instead, parser is
given a temporary_buffer which is a view on the page.

A bit earlier, in b1b5bda, the parser was changed to keep shared
fragments of the buffer passed to the parser in its internal state (across pages)
rather than copy the fragments into a new buffer. This is problematic
when buffers come from page cache because LSA buffers may be moved
around or evicted. So the temporary_buffer which is a view on the LSA
buffer is valid only around the duration of a single consume() call to
the parser.

If the blob which is parsed (e.g. variable-length clustering key
component) spans pages, the fragments stored in the parser may be
invalidated before the component is fully parsed. As a result, the
parsed clustering key may have incorrect component values. This never
causes parsing errors because the "length" field is always parsed from
the current buffer, which is valid, and component parsing will end at
the right place in the next (valid) buffer.

The problematic path for clustering_key parsing is the one which calls
primitive_consumer::read_bytes(), which is called for example for text
components. Fixed-size components are not parsed like this, they store
the intermediate state by copying data.

This may cause incorrect clustering keys to be parsed when doing
binary search in the index, diverting the search to an incorrect
block.

Details of the solution:

We adapt page_view to a temporary_buffer-like API. For this, a new concept
is introduced called ContiguousSharedBuffer. We also change parsers so that
they can be templated on the type of the buffer they work with (page_view vs
temporary_buffer). This way we don't introduce indirection to existing algorithms.

We use page_view instead of temporary_buffer in the promoted
index parser which works with page cache buffers. page_view can be safely
shared via share() and stored across allocating sections. It keeps hold to the
LSA buffer even across allocating sections by the means of cached_file::page_ptr.

Fixes #20766

Closes scylladb/scylladb#20837

* github.com:scylladb/scylladb:
  sstables: bsearch_clustered_cursor: Add trace-level logging
  sstables: bsearch_clustered_cursor: Move definitions out of line
  test, sstables: Verify parsing stability when allocating section is retried
  test, sstables: Verify parsing stability when buffers cross page boundary
  sstables: bsearch_clustered_cursor: Switch parsers to work with page_view
  cached_file: Adapt page_view to ContiguousSharedBuffer
  cached_file: Change meaning of page_view::_size to be relative to _offset rather than page start
  sstables, utils: Allow parsers to work with different buffer types
  sstables: promoted_index_block_parser: Make reset() always bring parser to initial state
  sstables: bsearch_clustered_cursor: Switch read_block_offset() to use the read() method
  sstables: bsearch_clustered_cursor: Fix parsing when allocating section is retried

(cherry picked from commit fb8743b2d6)

Closes scylladb/scylladb#20906
2024-12-05 09:50:07 +02:00
Tomasz Grabiec
159c1b0847 utils: UUID: Make get_time_UUID() respect the clock offset
schema_change_test currently fails due to failure to start a cql test
env in unit tests after the point where this is called (in one of the
test cases):

   forward_jump_clocks(std::chrono::seconds(60*60*24*31));

The problem manifests with a failure to join the cluster due to
missing_column exception ("missing_column: done") being thrown from
system_keyspace::get_topology_request_state(). It's a symptom of
join request being missing in system.topology_requests. It's missing
because the row is expired.

When request is created, we insert the
mutations with intended TTL of 1 month. The actual TTL value is
computed like this:

  ttl_opt topology_request_tracking_mutation_builder::ttl() const {
      return std::chrono::duration_cast<std::chrono::seconds>(std::chrono::microseconds(_ts)) + std::chrono::months(1)
          - std::chrono::duration_cast<std::chrono::seconds>(gc_clock::now().time_since_epoch());
  }

_ts comes from the request_id, which is supposed to be a timeuuid set
from current time when request starts. It's set using
utils::UUID_gen::get_time_UUID(). It reads the system clock without
adding the clock offset, so after forward_jump_clocks(), _ts and
gc_clock::now() may be far off. In some cases the accumulated offset
is larger than 1month and the ttl becomes negative, causing the
request row to expire immediately and failing the boot sequence.

The fix is to use db_clock, which respects offsets and is consistent
with gc_clock.

The test doesn't fail in CI becuase there each test case runs in a
separate process, so there is no bootstrap attempt (by new cql test
env) after forward_jump_clocks().

Closes scylladb/scylladb#21558

(cherry picked from commit 1d0c6aa26f)

Closes scylladb/scylladb#21583

Fixes #21581
2024-12-04 14:19:47 +01:00
Aleksandra Martyniuk
b487931396 repair: implement tablet_repair_task_impl::release_resources
tablet_repair_task_impl keeps a vector of tablet_repair_task_meta,
each of which keeps an effective_replication_map_ptr. So, after
the task completes, the token metadata version will not change for
task_ttl seconds.

Implement tablet_repair_task_impl::release_resources method that clears
tablet_repair_task_meta vector when the task finishes.

Set task_ttl to 1h in test_tablet_repair to check whether the test
won't time out.

Fixes: #21503.

Closes scylladb/scylladb#21504

(cherry picked from commit 572b005774)

Closes scylladb/scylladb#21621
2024-12-04 13:58:16 +02:00
Kefu Chai
cf71fd3977 test: topology_custom: ensure node visibility before keyspace creation
Building upon commit 69b47694, this change addresses a subtle synchronization
weakness in node visibility checks during recovery mode testing.

Previous Approach:
- Waited only for the first node to see its peers
- Insufficient to guarantee full cluster consistency

Current Solution:
1. Implement comprehensive node visibility verification
2. Ensure all nodes mutually recognize each other
3. Prevent potential schema propagation race conditions

Key Improvements:
- Robust cluster state validation before keyspace creation
- Eliminate partial visibility scenarios

Fixes scylladb/scylladb#21724

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21726

(cherry picked from commit 65949ce607)

Closes scylladb/scylladb#21733
2024-12-04 13:57:55 +02:00
André LFA
9cd356d66c Update report-scylla-problem.rst removing references to old Health Check Report
Closes scylladb/scylladb#21467
Fixes scylladb/scylladb#21599

(cherry picked from commit 703e6f3b1f)

Closes scylladb/scylladb#21590
2024-12-04 13:55:31 +02:00
Jenkins Promoter
dd9dcb28a3 Update ScyllaDB version to: 6.1.5 2024-12-01 15:58:56 +02:00
Botond Dénes
3771405482 Merge 'repair: fix task_manager_module::abort_all_repairs' from Aleksandra Martyniuk
Currently, task_manager_module::abort_all_repairs marks top-level repairs as aborted (but does not abort them) and aborts all existing shard tasks.

A running repair checks whether its id isn't contained in _aborted_pending_repairs and then proceeds to create shard tasks. If abort_all_repairs is executed after _aborted_pending_repairs is checked but before shard tasks are created, then those new tasks won't be aborted. The issue is the most severe for tablet_repair_task_impl that checks the _aborted_pending_repairs content from different shards, that do not see the top-level task. Hence the repair isn't stopped but it creates shard repair tasks on all shards but the one that initialized repair.

Abort top-level tasks in abort_all_repairs. Fix the shard on which the task abort is checked.

Fixes: #21612.

Needs backport to 6.1 and 6.2 as they contain the bug.

Closes scylladb/scylladb#21616

* github.com:scylladb/scylladb:
  test: add test to check if repair is properly aborted
  repair: add shard param to task_manager_module::is_aborted
  repair: use task abort source to abort repair
  repair: drop _aborted_pending_repairs and utilize tasks abort mechanism
  repair: fix task_manager_module::abort_all_repairs

(cherry picked from commit 5ccbd500e0)

Closes scylladb/scylladb#21641
2024-11-25 11:01:12 +02:00
Nadav Har'El
506b366e5d alternator: fix "/localnodes" to not return down nodes
Alternator's "/localnodes" HTTP requests is supposed to return the list
of nodes in the local DC to which the user can send requests.

Before commit bac7c33313 we used the
gossiper is_alive() method to determine if a node should be returned.
That commit changed the check to is_normal() - because a node can be
alive but in non-normal (e.g., joining) state and not ready for
requests.

However, it turns out that checking is_normal() is not enough, because
if node is stopped abruptly, other nodes will still consider it "normal",
but down (this is so-called "DN" state). So we need to check **both**
is_alive() and is_normal().

This patch also adds a test reproducing this case, where a node is
shut down abruptly. Before this patch, the test failed ("/localnodes"
continued to return the dead node), and after it it passes.

Fixes #21538

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#21540

(cherry picked from commit 7607f5e33e)

Closes scylladb/scylladb#21633
2024-11-21 08:50:44 +02:00
Anna Stuchlik
f2bed0f362 doc: add the 6.0-to-2024.2 upgrade guide-from-6
This commit adds an upgrade guide from ScyllDB 6.0
to ScyllaDB Enterprise 2024.2.

Fixes https://github.com/scylladb/scylladb/issues/20063
Fixes https://github.com/scylladb/scylladb/issues/20062
Refs https://github.com/scylladb/scylla-enterprise/issues/4544

(cherry picked from commit 3d4b7e41ef)

Closes scylladb/scylladb#21619
2024-11-18 17:22:12 +02:00
Raphael S. Carvalho
b0bb40e8d4 replica: Fix schema change during migration cleanup
During migration cleanup, there's a small window in which the storage
group was stopped but not yet removed from the list. So concurrent
operations traversing the list could work with stopped groups.

During a test which emitted schema changes during migrations,
a failure happened when updating the compaction strategy of a table,
but since the group was stopped, the compaction manager was unable
to find the state for that group.

In order to fix it, we'll skip stopped groups when traversing the
list since they're unused at this stage of migration and going away
soon.

Fixes #20699.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit b8d6f864bc)

Closes scylladb/scylladb#21203
2024-11-15 10:40:21 +02:00
Calle Wilund
5058d6af41 cql_test_env/gossip: Prevent double shutdown call crash
Fixes #21159

When an exception is thrown in sstable write etc such that
storage_manager::isolate is initiated, we start a shutdown chain
for message service, gossip etc. These are synced (properly) in
storage_manager::stop, but if we somehow call gossiper::shutdown
outside the normal service::stop cycle, we can end up running the
method simultaneously, intertwined (missing the guard because of
the state change between check and set). We then end up co_awaiting
an invalid future (_failure_detector_loop_done) - a second wait.

Fixed by
a.) Remove superfluous gossiper::shutdown in cql_test_env. This was added
    in 20496ed, ages ago. However, it should not be needed nowadays.
b.) Ensure _failure_detector_loop_done is always waitable. Just to be sure.

(cherry picked from commit c28a5173d9)

Closes scylladb/scylladb#21394
2024-11-15 10:40:04 +02:00
Emil Maskovsky
730d39df40 test/topology_custom: fix the flaky test_raft_recovery_stuck
The test is only sending a subset of the running servers for the rolling
restart. The rolling restart is checking the visibility of the restarted
node agains the other nodes, but if that set is incomplete some of the
running servers might not have seen the restarted node yet.

Improved the manager client rolling restart method to consider all the
running nodes for checking the restarted node visibility.

Fixes: scylladb/scylladb#19959

Closes scylladb/scylladb#21477

(cherry picked from commit 92db2eca0b)

Closes scylladb/scylladb#21555
2024-11-15 10:39:18 +02:00
Botond Dénes
78ad345f7f Merge 'scylla_raid_setup: fix failure on SELinux package installation' from Takuya ASADA
After merged 5a470b2bfb, we found that scylla_raid_setup fails on offline mode
installation.
This is because pkg_install() just print error and exit script on offline mode, instead of installing packages since offline mode not supposed able to connect
internet.
Seems like it occur because of missing "policycoreutils-python-utils"
package, which is the package for "semange" command.
So we need to implement the relabeling patch without using the command.

Fixes https://github.com/scylladb/scylladb/issues/21441

Also, since Amazon Linux 2 has different package name for semange, we need to
adjust package name.

Fixes https://github.com/scylladb/scylladb/issues/21351

Closes scylladb/scylladb#21474

* github.com:scylladb/scylladb:
  scylla_raid_setup: support installing semanage on Amazon Linux 2
  scylla_raid_setup: fix failure on SELinux package installation

(cherry picked from commit 1c212df62d)

Closes scylladb/scylladb#21546
2024-11-14 15:57:47 +02:00
Botond Dénes
4610dde4da streaming: stream-session: switch to tracking permit
The stream-session is the receiving end of streaming, it reads the
mutation fragment stream from an RPC stream and writes it onto the disk.
As such, this part does no disk IO and therefore, using a permit with
count resources is superfluous. Furthermore, after
d98708013c, the count resources on this
permit can cause a deadlock on the receiver end, via the
`db::view::check_view_update_path()`, which wants to read the content of
a system table and therefore has to obtain a permit of its own.

Switch to a tracking-only permit, primarily to resolve the deadlock, but
also because admission is not necessary for a read which does no IO.

Refs: scylladb/scylladb#20885 (partial fix, solves only one of the deadlocks)
Fixes: scylladb/scylladb#21264
Fixes: scylladb/scylladb#21570

Closes scylladb/scylladb#21059

(cherry picked from commit 7c75fc599f)

Closes scylladb/scylladb#21571
2024-11-14 12:45:03 +02:00
Botond Dénes
ecb9cb374e Merge '[Backport 6.1] compaction_manager: stop_tasks, stop_ongoing_compactions: ignore errors' from ScyllaDB
stop() methods, like destructors must always succeed,
and returning errors from them is futile as there is
nothing else we can do with them by continue with shutdown.

stop_ongoing_compactions, in particular, currently returns the status
of stopped compaction tasks from `stop_tasks`, but still all tasks
must be stopped after it, even if they failed, so assert that
and ignore the errors.

Fixes scylladb/scylladb#21159

* Needs backport to 6.2 and 6.1, as commit 8cc99973eb causes handles storage that might cause compaction tasks to fail and eventually terminate on shudown when the exceptions are thrown in noexcept context in the deferred stop destructor body

(cherry picked from commit e942c074f2)

(cherry picked from commit d8500472b3)

(cherry picked from commit c08ba8af68)

(cherry picked from commit a7a55298ea)

(cherry picked from commit 6cce67bec8)

 Refs #21299

Closes scylladb/scylladb#21435

* github.com:scylladb/scylladb:
  compaction_manager: stop: await _stop_future if engaged
  compaction_manager: really_do_stop:  assert that no tasks are left behind
  compaction_manager: stop_tasks, stop_ongoing_compactions: ignore errors
  compaction/compaction_manager: stop_tasks(): unlink stopped tasks
  compaction/compaction_manager: make _tasks an intrusive list
2024-11-14 07:00:28 +02:00
Benny Halevy
5f9b3b08f4 compaction_manager: stop: await _stop_future if engaged
The current condition that consults the compaction manager
state for awaiting `_stop_future` works since _stop_future
is assigned after the state is set to `stopped`, but it is
incidental.  What matters is that `_stop_future` is engaged.

While at it, exchange _stop_future with a ready future
so that stop() can be safely called multiple times.
And dropped the superfluous co_return.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 6cce67bec8)
2024-11-12 15:21:04 +02:00
Benny Halevy
fe03c9b724 compaction_manager: really_do_stop: assert that no tasks are left behind
stop_ongoing_compactions now ignores any errors returned
by tasks, and it should leave no task left behind.
Assert that here, before the compaction_manager is destroyed.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit a7a55298ea)
2024-11-12 15:21:00 +02:00
Benny Halevy
cbddf18727 compaction_manager: stop_tasks, stop_ongoing_compactions: ignore errors
stop() methods, like destructors must always succeed,
and returning errors from them is futile as there is
nothing else we can do with them but continue with shutdown.

Leaked errors on the stop path may cause termination
on shutdown, when called in a deferred action destructor.

Fixes scylladb/scylladb#21298

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit c08ba8af68)
2024-11-12 15:14:21 +02:00
Botond Dénes
2a32e2ae82 compaction/compaction_manager: stop_tasks(): unlink stopped tasks
Stopped tasks currently linger in _tasks until the fiber that created
the task is scheduled again and unlinks the task. This window between
stop and remove prevents reliable checks for empty _tasks list after all
tasks are stopped.
Unlink the task early so really_do_stop() can safely check for an empty
_tasks list (next patch).

(cherry picked from commit d8500472b3)
2024-11-12 15:13:32 +02:00
Botond Dénes
d63b9efa7e compaction/compaction_manager: make _tasks an intrusive list
_tasks is currently std::list<shared_ptr<compaction_task_executor>>, but
it has no role in keeping the instances alive, this is done by the
fibers which create the task (and pin a shared ptr instance).
This lends itself to an intrusive list, avoiding that extra
allocation upon push_back().
Using an intrusive list also makes it simpler and much cheaper (O(1) vs.
O(N)) to remove tasks from the _tasks list. This will be made use of in
the next patch.

Code using _task has to be updated because the value_type changes from
shared_ptr<compaction_task_executor> to compaction_task_executor&.

(cherry picked from commit e942c074f2)
2024-11-12 11:42:34 +02:00
Yaron Kaikov
a1fea6b225 ./github/workflows/add-label-when-promoted.yaml: Run auto-backport only on default branch
In https://github.com/scylladb/scylladb/pull/21496#event-15221789614
```
scylladbbot force-pushed the backport/21459/to-6.1 branch from 414691c to 59a4ccd Compare 2 days ago
```

Backport automation triggered by `push` but also should either start from `master` branch (or `enterprise` branch from Enterprise), we need to verify it by checking also the default branch.

Fixes: https://github.com/scylladb/scylladb/issues/21514

Closes scylladb/scylladb#21515

(cherry picked from commit 2596d1577b)

Closes scylladb/scylladb#21530
2024-11-11 17:44:41 +02:00
Michał Chojnowski
04b3d96259 mvcc_test: fix a benign failure of test_apply_to_incomplete_respects_continuity
For performance reasons, mutation_partition_v2::maybe_drop(), and by extension
also mutation_partition_v2::apply_monotonically(mutation_partition_v2&&)
can evict empty row entries, and hence change the continuity of the merged
entry.

For checking that apply_to_incomplete respects continuity,
test_apply_to_incomplete_respects_continuity obtains the continuity of
the partition entry before and after apply_to_incomplete by calling
e.squashed().get_continuity(). But squashed() uses apply_monotonically(),
so in some circumstances the result of squashed() can have smaller
continuity than the argument of squashed(), which messes with the thing
that the test is trying to check, and causes spurious failures.

This patch changes the method of calculating the continuity set,
so that it matches the entry exactly, fixing the test failures.

Fixes scylladb/scylladb#13757

Closes scylladb/scylladb#21459

(cherry picked from commit 35921eb67e)

Closes scylladb/scylladb#21496
2024-11-08 15:33:20 +01:00
Yaron Kaikov
236b235a89 .github/scripts/auto-backport.py: update method to get closed prs
`commit.get_pulls()` in PyGithub returns pull requests that are directly associated with the given commit

Since in closed PR. the relevant commit is an event type, the backport
automation didn't get the PR info for backporting

Ref: https://github.com/scylladb/scylladb/issues/18973

Closes scylladb/scylladb#21468

(cherry picked from commit ef104b7b96)

Closes scylladb/scylladb#21482
2024-11-08 10:26:44 +02:00
Yaron Kaikov
3ddb61c90e .github/script/auto-backport.py: push backport PR to scylladbbot fork
Since Scylla is a public repo, when we create a fork, it doesn't fork the team and permissions (unlike private repos where it does).

When we have a backport PR with conflicts, the developers need to be able to update the branch to fix the conflicts. To do so, we modified the logic of the backport automation as follows:

- Every backport PR (with and without conflicts) will be open directly on the `scylladbbot` fork repo
- When there are conflicts, an email will be sent to the original PR author with an invitation to become a contributor in the `scylladbbot` fork with `push` permissions. This will happen only once if Auther is not a contributor.
- Together with sending the invite, all backport labels will be removed and a comment will be added to the original PR with instructions
- The PR author must add the backport labels after the invitation is accepted

Fixes: https://github.com/scylladb/scylladb/issues/18973

Closes scylladb/scylladb#21401

(cherry picked from commit 77604b4ac7)

Closes scylladb/scylladb#21465
2024-11-07 15:05:56 +02:00
Yaron Kaikov
160823ccaf github: add script for backports automation instead of Mergify
Adding an auto-backport.py script to handle backport automation instead of Mergify.

The rules of backport are as follows:

* Merged or Closed PRs with any backport/x.y label (one or more) and promoted-to-master label
* Backport PR will be automatically assigned to the original PR author
* In case of conflicts the backport PR will be open in the original autoor fork in draft mode. This will give the PR owner the option to resolve conflicts and push those changes to the PR branch (Today in Scylla when we have conflicts, the developers are forced to open another PR and manually close the backport PR opened by Mergify)
* Fixing cherry-pick the wrong commit SHA. With the new script, we always take the SHA from the stable branch
* Support backport for enterprise releases (from Enterprise branch)

Fixes: https://github.com/scylladb/scylladb/issues/18973
(cherry picked from commit f9e171c7af)

Closes scylladb/scylladb#21470
2024-11-07 06:58:16 +02:00
Jenkins Promoter
9ff31c6c4e Update ScyllaDB version to: 6.1.4 2024-11-06 16:08:17 +02:00
Botond Dénes
6a66faab41 Merge '[Backport 6.1] repair: Fix finished ranges metrics for removenode' from ScyllaDB
The skipped ranges should be multiplied by the number of tables

Otherwise the finished ranges ratio will not reach 100%.

Fixes #21174

(cherry picked from commit cffe3dc49f)

(cherry picked from commit 1392a6068d)

(cherry picked from commit 9868ccbac0)

 Refs #21252

Closes scylladb/scylladb#21314

* github.com:scylladb/scylladb:
  test: Add test_node_ops_metrics.py
  repair: Make the ranges more consistent in the log
  repair: Fix finished ranges metrics for removenode
2024-11-05 09:44:29 +02:00
Tzach Livyatan
c1e42cacac Update os-support-info.rst - add CentOS
ScyllaDB support RHEL 9 and derivatives, including CentOS 9.

Fix https://github.com/scylladb/scylladb/issues/21309

(cherry picked from commit 1878af9399)

Closes scylladb/scylladb#21333
2024-11-05 09:43:51 +02:00
Benny Halevy
baa4d1a6e7 compaction_manager: compaction_disabled: return true if not in compaction_state
When a compaction_group is removed via `compaction_manager::remove`,
it is erase from `_compaction_state`, and therefore compaction
is definitely not enabled on it.

This triggers an internal error if tablets are cleaned up
during drop/truncate, which checks that compaction is disabled
in all compaction groups.

Note that the callers of `compaction_disabled` aren't really
interested in compaction being actively disabled on the
compaction_group, but rather if it's enabled or not.
A follow-up patch can be consider to reverse the logic
and expose `compaction_enabled` rather than `compaction_disabled`.

Fixes scylladb/scylladb#20060

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 78ceaeabca)

Closes scylladb/scylladb#21405
2024-11-05 09:42:01 +02:00
Kamil Braun
b057168dd0 Merge '[Backport 6.1] cql/tablets: fix retrying ALTER tablets KEYSPACE' from Marcin Maliszkiewicz
ALTER tablets-enabled KEYSPACES (KS) may fail due to
group0_concurrent_modification, in which case it's repeated by a for
loop surrounding the code. But because raft's add_entry consumes the
raft's guard (by std::move'ing the guard object), retries of ALTER KS
will use a moved-from guard object, which is UB, potentially a crash.
The fix is to remove the before mentioned for loop altogether and rethrow the exception, as the rf_change event
will be repeated by the topology state machine if it receives the
concurrent modification exception, because the event will remain present
in the global requests queue, hence it's going to be executed as the
very next event.
Note: refactor is implemented in the follow-up commit.

Fixes: https://github.com/scylladb/scylladb/issues/21102

Should be backported to every 6.x branch, as it may lead to a crash.

(cherry picked from commit de511f56ac)

(cherry picked from commit 3f4c8a30e3)

(cherry picked from commit 522bede8ec)

Refs https://github.com/scylladb/scylladb/pull/21121

Closes scylladb/scylladb#21340

* github.com:scylladb/scylladb:
  test: topology: add disable_schema_agreement_wait utility function
  test: add UT to test retrying ALTER tablets KEYSPACE
  cql/tablets: fix indentation in `rf_change` event handler
  cql/tablets: fix retrying ALTER tablets KEYSPACE
2024-11-04 12:23:47 +01:00
Benny Halevy
7dbe39a9a5 storage_service: on_change: update_peer_info only if peer info changed
Return an optional peer_info from get_peer_info_for_update
when the `app_state_map` arg does not change peer_info,
so that we can skip calling update_peer_info, if it didn't
change.

Fixes scylladb/scylladb#20991
Refs scylladb/scylladb#16376

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#21152

(cherry picked from commit 04d741bcbb)
2024-11-04 11:44:05 +02:00
Tomasz Grabiec
eec3e22c6a node-exporter: Disable hwmon collector
This collector reads nvme temperature sensor, which was observed to
cause bad performance on Azure cloud following the reading of the
sensor for ~6 seconds. During the event, we can see elevated system
time (up to 30%) and softirq time. CPU utilization is high, with
nvm_queue_rq taking several orders of magnitude more time than
normally. There are signs of contention, we can see
__pv_queued_spin_lock_slowpath in the perf profile, called. This
manifests as latency spikes and potentially also throughput drop due
to reduced CPU capacity.

By default, the monitoring stack queries it once every 60s.

(cherry picked from commit 93777fa907)

Closes scylladb/scylladb#21305
2024-10-31 14:05:38 +01:00
Marcin Maliszkiewicz
7d87f744ea test: topology: add disable_schema_agreement_wait utility function
Code extracted from fa45fdf5f7 as it's being used by
test_alter_tablets_keyspace_concurrent_modification and we're
backporting it.
2024-10-30 16:57:19 +01:00
Piotr Smaron
d8e36873cf test: add UT to test retrying ALTER tablets KEYSPACE
The newly added testcase is based on the already existing
`test_alter_dropped_tablets_keyspace`.
A new error injection is created, which stops the ALTER execution just
before the changes are submitted to RAFT. In the meantime, a new schema
change is performed using the 2nd node in the cluster, thus causing the
1st node to retry the ALTER statement.

(cherry picked from commit 522bede8ec)
2024-10-30 16:49:33 +01:00
Piotr Smaron
1dddd2a8ca cql/tablets: fix indentation in rf_change event handler
Just moved the code that previously was under a `for` loop by 1 tab, i.e. 4 spaces, to the left.

(cherry picked from commit 3f4c8a30e3)
2024-10-30 16:49:33 +01:00
Piotr Smaron
ab333f2453 cql/tablets: fix retrying ALTER tablets KEYSPACE
ALTER tablets-enabled KEYSPACES (KS) may fail due to
`group0_concurrent_modification`, in which case it's repeated by a `for`
loop surrounding the code. But because raft's `add_entry` consumes the
raft's guard (by `std::move`'ing the guard object), retries of ALTER KS
will use a moved-from guard object, which is UB, potentially a crash.
The fix is to remove the before mentioned `for` loop altogether and rethrow the exception, as the `rf_change` event
will be repeated by the topology state machine if it receives the
concurrent modification exception, because the event will remain present
in the global requests queue, hence it's going to be executed as the
very next event.
`topology_coordinator::handle_topology_coordinator_error` handling the
case of `group0_concurrent_modification` has been extended with logging
in order not to write catch-log-throw boilerplate.
Note: refactor is implemented in the follow-up commit.

Fixes: scylladb/scylladb#21102
(cherry picked from commit de511f56ac)
2024-10-30 16:49:33 +01:00
Gleb Natapov
0b502a2610 topology coordinator: take a copy of a replication state in raft_topology_cmd_handler
Current code takes a reference and holds it past preemption points. And
while the state itself is not suppose to change the reference may
become stale because the state is re-created on each raft topology
command.

Fix it by taking a copy instead. This is a slow path anyway.

Fixes: scylladb/scylladb#21220
(cherry picked from commit fb38bfa35d)

Closes scylladb/scylladb#21373
2024-10-30 14:12:44 +01:00
Kamil Braun
51f7ff8697 Merge '[Backport 6.1] storage_proxy: Add conditions checking to avoid UB in speculating read executors.' from ScyllaDB
During the investigation of scylladb/scylladb#20282, it was discovered that implementations of speculating read executors have undefined behavior when called with an incorrect number of read replicas. This PR introduces two levels of condition checking:

- Condition checking in speculating read executors for the number of replicas.
- Checking the consistency of the Effective Replication Map in  filter_for_query(): the map is considered incorrect if the list  of replicas contains a node from a data center whose replication factor is 0.

 Please note: This PR does not fix the issue found in scylladb/scylladb#20282;   it only adds condition checks to prevent undefined behavior in cases of  inconsistent inputs.

Refs scylladb/scylladb#20625

As this issue applies to the releases versions and can affect clients, we need backports to 6.0, 6.1, 6.2.

(cherry picked from commit 132358dc92)

(cherry picked from commit ae23d42889)

(cherry picked from commit ad93cf5753)

(cherry picked from commit 8db6d6bd57)

(cherry picked from commit c373edab2d)

 Refs #20851

Closes scylladb/scylladb#21068

* github.com:scylladb/scylladb:
  Add conditions checking for get_read_executor
  Avoid an extra call to block_for in db::filter_for_query.
  Improve code readability in consistency_level.cc and storage_proxy.cc
  tools: Add build_info header with functions providing build type information
  tests: Add tests for alter table with RF=1 to RF=0
2024-10-29 12:32:48 +01:00
Asias He
9fdc596ff7 test: Add test_node_ops_metrics.py
It tests the node_ops_metrics_done metric reaches 100% when a node ops
is done.

Refs: #21174
(cherry picked from commit 9868ccbac0)
2024-10-28 09:54:30 +00:00
Asias He
5a2196b94a repair: Make the ranges more consistent in the log
Consider the number of tables for the number of ranges logging. Make it
more consistent with the log when the ops starts.

(cherry picked from commit 1392a6068d)
2024-10-28 09:54:30 +00:00
Asias He
34cb594dd5 repair: Fix finished ranges metrics for removenode
The skipped ranges should be multiplied by the number of tables.

Otherwise the finished ranges ratio will not reach 100%.

Fixes #21174

(cherry picked from commit cffe3dc49f)
2024-10-28 09:54:30 +00:00
Lakshmi Narayanan Sreethar
91c693bf93 [Backport 6.1] replica/table: check memtable before discarding tombstone during read
On the read path, the compacting reader is applied only to the sstable
reader. This can cause an expired tombstone from an sstable to be purged
from the request before it has a chance to merge with deleted data in
the memtable leading to data resurrection.

Fix this by checking the memtables before deciding to purge tombstones
from the request on the read path. A tombstone will not be purged if a
key exists in any of the table's memtables with a minimum live timestamp
that is lower than the maximum purgeable timestamp.

Fixes #20916

`perf-simple-query` stats before and after this fix :

`build/Dev/scylla perf-simple-query --smp=1 --flush` :
```
// Before this Fix
// ---------------
94941.79 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   59393 insns/op,   24029 cycles/op,        0 errors)
97551.14 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   59376 insns/op,   23966 cycles/op,        0 errors)
96599.92 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   59367 insns/op,   23998 cycles/op,        0 errors)
97774.91 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   59370 insns/op,   23968 cycles/op,        0 errors)
97796.13 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   59368 insns/op,   23947 cycles/op,        0 errors)

         throughput: mean=96932.78 standard-deviation=1215.71 median=97551.14 median-absolute-deviation=842.13 maximum=97796.13 minimum=94941.79
instructions_per_op: mean=59374.78 standard-deviation=10.78 median=59369.59 median-absolute-deviation=6.36 maximum=59393.12 minimum=59367.02
  cpu_cycles_per_op: mean=23981.67 standard-deviation=32.29 median=23967.76 median-absolute-deviation=16.33 maximum=24029.38 minimum=23947.19

// After this Fix
// --------------
95313.53 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   59392 insns/op,   24058 cycles/op,        0 errors)
97311.48 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   59375 insns/op,   24005 cycles/op,        0 errors)
98043.10 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   59381 insns/op,   23941 cycles/op,        0 errors)
96750.31 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   59396 insns/op,   24025 cycles/op,        0 errors)
93381.21 tps ( 71.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   59390 insns/op,   24097 cycles/op,        0 errors)

         throughput: mean=96159.93 standard-deviation=1847.88 median=96750.31 median-absolute-deviation=1151.55 maximum=98043.10 minimum=93381.21
instructions_per_op: mean=59386.60 standard-deviation=8.78 median=59389.55 median-absolute-deviation=6.02 maximum=59396.40 minimum=59374.73
  cpu_cycles_per_op: mean=24025.13 standard-deviation=58.39 median=24025.17 median-absolute-deviation=32.67 maximum=24096.66 minimum=23941.22
```

This PR fixes a regression introduced in ce96b472d3 and should be backported to older versions.

Closes scylladb/scylladb#20985

* github.com:scylladb/scylladb:
  topology-custom: add test to verify tombstone gc in read path
  replica/table: check memtable before discarding tombstone during read
  compaction_group: track maximum timestamp across all sstables

(cherry picked from commit 519e167611)

Backported from #20985 to 6.1.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes scylladb/scylladb#21250
2024-10-25 11:13:54 +03:00
Piotr Dulikowski
77f0533a01 SCYLLA-VERSION-GEN: correct the logic for skipping SCYLLA-*-FILE
The SCYLLA-VERSION-GEN file skips updating the SCYLLA-*-FILE files if
the commit hash from SCYLLA-RELEASE-FILE is the same. The original
reason for this was to prevent the date in the version string from
changing if multiple modes are built across midnight
(scylladb/scylla-pkg#826). However - intentionally or not - it serves
another purpose: it prevents an infinite loop in the build process.

If the build.ninja file needs to be rebuilt, the configure.py script
unconditionally calls ./SCYLLA-VERSION-GEN. On the other hand, if one
of the SCYLLA-*-FILE files is updated then this triggers rebuild
of build.ninja. Apparently, this is sufficient for ninja to enter an
infinite loop.

However, the check assumes that the RELEASE is in the format

  <build identifier>.<date>.<commit hash>

and assumes that none of the components have a dot inside - otherwise it
breaks and just works incorrectly. Specifically, when building a private
version, it is recommended to set the build identifier to
`count.yourname`.

Previously, before 85219e9, this problem wasn't noticed most likely
because reconfigure process was broken and stopped overwriting
the build.ninja file after the first iteration.

Fix the problem by fixing the logic that extracts the commit hash -
instead of looking at the third dot-separated field counting from the
left side, look at the last field.

Fixes: scylladb/scylladb#21027
(cherry picked from commit 64ca58125e)

Closes scylladb/scylladb#21104
2024-10-25 11:09:51 +03:00
Benny Halevy
145230e032 storage_service: rebuild: warn about tablets-enabled keyspaces
Until we automatically support rebuild for tablets-enabled
keyspaces, warn the user about them.

The reason this is not an error, is that after
increasing RF in a new datacenter, the current procedure
is to run `nodetool rebuild` on all nodes in that dc
to rebuild the new vnode replicas.
This is not required for tablets, since the additional
replicas are rebuilt automatically as part of ALTER KS.

However, `nodetool rebuild` is also run after local
data loss (e.g. due to corruption and removal of sstables).
In this case, rebuild is not supported for tablets-enabled
keyspaces, as tablet replicas that had lost data may have
already been migrated to other nodes, and rebuilding the
requested node will not know about it.
It is advised to repair all nodes in the datacenter instead.

Refs scylladb/scylladb#17575

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit ed1e9a1543)

Closes scylladb/scylladb#20723
2024-10-25 11:06:38 +03:00
Tomasz Grabiec
39c1a448f6 Merge '[Backport 6.1] replica: Fix tombstone GC during tablet split preparation' from Raphael Raph Carvalho
During split prepare phase, there will be more than 1 compaction group with
overlapping token range for a given replica.

Assume tablet 1 has sstable A containing deleted data, and sstable B containing
a tombstone that shadows data in A.

Then split starts:

sstable B is split first, and moved from main (unsplit) group to a
split-ready group
now compaction runs in split-ready group before sstable A is split
tombstone GC logic today only looks at underlying group, so compaction is step
2 will discard the deleted data in A, since it belongs to another group (the
unsplit one), and so the tombstone can be purged incorrectly.

To fix it, compaction will now work with all uncompacting sstables that belong
to the same replica, since tombstone GC requires all sstables that possibly
contain shadowed data to be available for correct decision to be made.

Fixes https://github.com/scylladb/scylladb/issues/20044.

Please replace this line with justification for the backport/* labels added to this PR
Branches 6.0, 6.1 and 6.2 are vulnerable, so backport is needed.

(cherry picked from commit bcd358595f)

(cherry picked from commit 93815e0649)

Refs https://github.com/scylladb/scylladb/pull/20939

Closes scylladb/scylladb#21205

* github.com:scylladb/scylladb:
  replica: Fix tombstone GC during tablet split preparation
  service: Improve error handling for split
2024-10-23 11:41:36 +02:00
Botond Dénes
03f370e971 Merge '[Backport 6.1] Check system.tablets update before putting it into the table' from ScyllaDB
Having tablet metadata with more than 1 pending replica will prevent this metadata from being (re)loaded due to sanity check on load. This patch fails the operation which tries to save the wrong metadata with a similar sanity check. For that, changes submitted to raft are validated, and if it's topology_change that affects system.tablets, the new "replicas" and "new_replicas" values are checked similarly to how they will be on (re)load.

fixes #20043

(cherry picked from commit f09fe4f351)

(cherry picked from commit e5bf376cbc)

(cherry picked from commit 1863ccd900)

 Refs #21020

Closes scylladb/scylladb#21110

* github.com:scylladb/scylladb:
  tablets: Validate system.tablets update
  group0_client: Introduce change validation
  group0_client: Add shared_token_metadata dependency
  replica/tablets: Add to_tablet_metadata_(row_)?key helpers
  replica/tablets: extract tablet_replica_set_from_cell()
2024-10-23 10:02:13 +03:00
Pavel Emelyanov
c52e5a8a87 tablets: Validate system.tablets update
Implement change validation for raft topology_change command. For now
the only check is that the "pending replicas" contains at most one
entry. The check mirrors similar one in `process_one_row` function.

If not passed, this prevents system.tablets from being updated with the
mutation(s) that will not be loaded later.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-22 13:17:00 +03:00
Pavel Emelyanov
337c777635 group0_client: Introduce change validation
Add validate_change() methods (well, a template and an overload) that
are called by prepare_command() and are supposed to validate the
proposed change before it hits persistent storage

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-22 13:16:56 +03:00
Pavel Emelyanov
881ec8600f group0_client: Add shared_token_metadata dependency
It will be needed later to get tablet_metadata from.
The dependency is "OK", shared_token_metadata is low-level sharded
service. Client already references db::system_keyspace, which in turn
references replica::database which, finally, references token_metadata

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-22 13:16:52 +03:00
Pavel Emelyanov
4bed029b56 replica/tablets: Add to_tablet_metadata_(row_)?key helpers
Extraceted from larger patch f5976aa87b (replica/tablets: add
get_tablet_metadata_change_hint() and update_tablet_metadata_change_hint())
by Botond. The helpers are needed to decode mutations with tablets
update to validate them later.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-22 13:16:47 +03:00
Kefu Chai
751f1fda16 replica/tablets: extract tablet_replica_set_from_cell()
so it can be reused to implement a low-level tool which reads tablets
data from sstables

Refs scylladb/scylladb#16488
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-10-22 13:16:44 +03:00
Botond Dénes
0d41447e1a Merge '[Backport 6.1] atomic_delete: allow deletion of sstables from several prefixes' from ScyllaDB
Allow create_pending_deletion_log to delete a bunch of sstables
potentially resides in different prefixes (e.g. in the base directory
and under staging/).

The motivation arises from table::cleanup_tablet that calls compaction_group::cleanup on all cg:s via cleanup_compaction_groups.  Cleanup, in turn, calls delete_sstables_atomically on all sstables in the compaction_group, in all states, including the normal state as well as staging - hence the requirement to support deleting sstables in different sub-directories.

Also, apparently truncate calls delete_atomically for all sstables too, via table::discard_sstables, so if it happened to be executed during view update generation, i.e. when there are sstables in staging, it should hit the assertion failure reported in https://github.com/scylladb/scylladb/issues/18862 as well (although I haven't seen it yet, but I see no reason why it would happen). So the issue was apparently present since the initial implementation of the pending_delete_log. It's just that with tablet migration it is more likely to be hit.

Fixes scylladb/scylladb#18862

Needs backport to 6.0 since tablets require this capability

(cherry picked from commit a7b92d7b6f)

(cherry picked from commit 027e64876a)

(cherry picked from commit 44bd183187)

(cherry picked from commit f47b5e60bc)

 Refs #19555

Closes scylladb/scylladb#20644

* github.com:scylladb/scylladb:
  sstable_directory: create_pending_deletion_log: place pending_delete log under the base directory
  sstables: storage: keep base directory in base class
  sstables: storage: define opened_directory in header file
  sstable_directory: use only dirlog
2024-10-22 09:17:26 +03:00
Benny Halevy
71d90b2fbc view: check_needs_view_update_path: get token_metadata_ptr
check_needs_view_update_path is async and might yield
so the token_metadata reference passed to it must be kept
alive throughout the call.

Fixes scylladb/scylladb#20979

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit d34878e96c)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#21039
2024-10-22 09:16:40 +03:00
Daniel Reis
a22486a6d3 docs: fix redirect from cert-based auth to security/enable-auth page
(cherry picked from commit 28a265ccd8)

Closes scylladb/scylladb#21123
2024-10-22 09:13:42 +03:00
Raphael S. Carvalho
5106d40577 replica: Fix tombstone GC during tablet split preparation
During split prepare phase, there will be more than 1 compaction group with
overlapping token range for a given replica.

Assume tablet 1 has sstable A containing deleted data, and sstable B containing
a tombstone that shadows data in A.

Then split starts:
1) sstable B is split first, and moved from main (unsplit) group to a
split-ready group
2) now compaction runs in split-ready group before sstable A is split

tombstone GC logic today only looks at underlying group, so compaction is step
2 will discard the deleted data in A, since it belongs to another group (the
unsplit one), and so the tombstone can be purged incorrectly.

To fix it, compaction will now work with all uncompacting sstables that belong
to the same replica, since tombstone GC requires all sstables that possibly
contain shadowed data to be available for correct decision to be made.

Fixes #20044.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 93815e0649)
2024-10-20 20:44:44 -03:00
Benny Halevy
a8e472178f sstable_directory: create_pending_deletion_log: place pending_delete log under the base directory
To be able to atomically delete sstables both in
base table directory and in its sub-directories,
like `staging/`, use a shared pending_delete_dir
under under the base directory.

Note that this requires loading and processing
the base directory first.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit f47b5e60bc)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

# Conflicts:
#	sstables/sstable_directory.hh
2024-10-20 09:10:47 +03:00
Benny Halevy
8c646c2942 sstables: storage: keep base directory in base class
so we can use the base (table) directory for
e.g. pending_delete logs, in the next patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 44bd183187)
2024-10-20 09:09:06 +03:00
Benny Halevy
334d56fcfd sstables: storage: define opened_directory in header file
So it can be used outside the storage module
in the following patches.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 027e64876a)
2024-10-20 09:09:00 +03:00
Benny Halevy
e141e97f2d sstable_directory: use only dirlog
Currently, there are leftover log messages using
sstlog rather than dirlog, that was introduced
in aebd965f0e,
and that makes debugging harder.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit a7b92d7b6f)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

# Conflicts:
#	sstables/sstable_directory.cc
2024-10-20 09:08:49 +03:00
Botond Dénes
7367544ea2 Merge '[Backport 6.1] tablet: Fix single-sstable split when attaching new unsplit sstables' from ScyllaDB
To fix a race between split and repair here c1de4859d8, a new sstable
  generated during streaming can be split before being attached to the sstable
  set. That's to prevent an unsplit sstable from reaching the set after the
  tablet map is resized.

  So we can think this split is an extension of the sstable writer. A failure
  during split means the new sstable won't be added. Also, the duration of split
  is also adding to the time erm is held. For example, repair writer will only
  release its erm once the split sstable is added into the set.

  This single-sstable split is going through run_custom_job(), which serializes
  with other maintenance tasks. That was a terrible decision, since the split may
  have to wait for ongoing maintenance task to finish, which means holding erm
  for longer. Additionally, if split monitor decides to run split on the entire
  compaction group, it can cause single-sstable split to be aborted since the
  former wants to select all sstables, propagating a failure to the streaming
  writer.
  That results in new sstable being leaked and may cause problems on restart,
  since the underlying tablet may have moved elsewhere or multiple splits may
  have happened. We have some fragility today in cleaning up leaked sstables on
  streaming failure, but this single-sstable split made it worse since the
  failure can happen during normal operation, when there's e.g. no I/O error.

  It makes sense to kill run_custom_job() usage, since the single-sstable split
  is offline and an extension of sstable writing, therefore it makes no sense to
  serialize with maintenance tasks. It must also inherit the sched group of the
  process writing the new sstable. The inheritance happens today, but is fragile.

  Fixes #20626.

(cherry picked from commit 999f1f1318)

(cherry picked from commit 38ce2c605d)

 Refs #20737

Closes scylladb/scylladb#20802

* github.com:scylladb/scylladb:
  tablet: Fix single-sstable split when attaching new unsplit sstables
  replica: Fix tablet split execute after restart
2024-10-17 19:36:47 +03:00
Piotr Smaron
f8d6215242 test: fix flaky test_multidc_alter_tablets_rf
The testcase is flaky due to a known python driver issue:
https://github.com/scylladb/python-driver/issues/317.
This issue causes the `CREATE KEYSPACE` statement to be sometimes
executed twice in a row, and the 2nd CREATE statement causes the test to
fail.
In order to work around it, it's enough to add `if not exists` when
creating a ks.

Fixes: #21034

Needs to be backported to all 6.x branches, as the PR introducing this flakiness is backported to every 6.x branch.

(cherry picked from commit 3969ffb39f)

Closes scylladb/scylladb#21106
2024-10-17 10:59:52 +03:00
Piotr Smaron
750ff26371 cql/tablets: handle MVs in ALTER tablets KEYSPACE
ALTERing tablets-enabled KEYSPACES (KS) didn't account for materialized
views (MV), and only produced tablets mutations changing tables.
With this patch we're producing tablets mutations for both tables and
MVs, hence when e.g. we change the replication factor (RF) of a KS, both the
tables' RFs and MVs' RFs are updated along with tablets replicas.
The `test_tablet_rf_change` testcase has been extended to also verify
that MVs' tablets replicas are updated when RF changes.

Fixes: #20240
(cherry picked from commit 5ac16e29e6)

Closes scylladb/scylladb#21023
2024-10-16 10:39:07 +03:00
Kefu Chai
e22d8a3de3 install.sh: install seastar/scripts/addr2line.py as well
seastar extracted `addr2line` python module out back in
e078d7877273e4a6698071dc10902945f175e8bc. but `install.sh` was
not updated accordingly. it still installs `seastar-addr2line`
without installing its new dependency. this leaves us with a
broken `seastar-addr2line` in the relocatable tarball.
```console
$ /opt/scylladb/scripts/seastar-addr2line
Traceback (most recent call last):
  File "/opt/scylladb/scripts/libexec/seastar-addr2line", line 26, in <module>
    from addr2line import BacktraceResolver
ModuleNotFoundError: No module named 'addr2line'
```

in this change, we redistribute `addr2line.py` as well. this
should address the issue above.

Fixes scylladb/scylladb#21077

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit da433aad9d)

Closes scylladb/scylladb#21087
2024-10-14 13:31:17 +03:00
Sergey Zolotukhin
6e15c244ec Add conditions checking for get_read_executor
During the investigation of scylladb/scylladb#20282, it was discovered that
implementations of speculating read executors have undefined behavior
when called with an incorrect number of read replicas. This PR
introduces two levels of condition checking:

- Condition checking in speculating read executors for the number of replicas.
- Checking the consistency of the Effective Replication Map in
  get_endpoints_for_reading(): the map is considered incorrect the number of
  read replica nodes is higher than replication factor. The check is
  applied only when built in non release mode.

Please note: This PR does not fix the issue found in scylladb/scylladb#20282;
it only adds condition checks to prevent undefined behavior in cases of
inconsistent inputs.

Refs scylladb/scylladb#20625

(cherry picked from commit c373edab2d)
2024-10-11 18:20:43 +00:00
Sergey Zolotukhin
5357e492ca Avoid an extra call to block_for in db::filter_for_query.
(cherry picked from commit 8db6d6bd57)
2024-10-11 18:20:43 +00:00
Sergey Zolotukhin
09330bf597 Improve code readability in consistency_level.cc and storage_proxy.cc
Add const correctness and rename some variables to improve code readability.

(cherry picked from commit ad93cf5753)
2024-10-11 18:20:42 +00:00
Sergey Zolotukhin
116661a05b tools: Add build_info header with functions providing build type information
A new header provides `constexpr` functions to retrieve build
type information: `get_build_type()`, `is_release_build()`,
and `is_debug_build()`. These functions are useful when adding
changes that should be enabled at compile time only for
specific build types.

(cherry picked from commit ae23d42889)
2024-10-11 18:20:42 +00:00
Sergey Zolotukhin
52955e940a tests: Add tests for alter table with RF=1 to RF=0
Adding Vnodes and Tablets tests for alter keyspace operation that decreases replication factor
from 1 to 0 for one of two data centers. Tablet version fails due to issue described in
scylladb/scylladb#20625.

Test for scylladb/scylladb#20625

(cherry picked from commit 132358dc92)
2024-10-11 18:20:42 +00:00
Michał Chojnowski
9f0b19b7f7 reader_concurrency_semaphore: in stats, fix swapped count_resources and memory_resources
can_admit_read() returns reason::memory_resources when the permit is queued due
to lack of count resources, and it returns reason::count_resources when the
permit is queued due to lack of memory resources. It's supposed to be the other
way around.

This bug is causing the two counts to be swapped in the stat dumps printed to
the logs when semaphores time out.

(cherry picked from commit c2ba300f1c)

Closes scylladb/scylladb#21031
2024-10-11 14:45:31 +03:00
Botond Dénes
1e847d0253 Merge '[Backport 6.1] cql: improve validating RF's change in ALTER tablets KS' from ScyllaDB
This patch series fixes a couple of bugs around validating if RF is not changed by too much when performing ALTER tablets KS.
RF cannot change by more than 1 in total, because tablets load balancer cannot handle more work at once.

Fixes: #20039

Should be backported to 6.0 & 6.1 (wherever tablets feature is present), as this bug may break the cluster.

(cherry picked from commit 042825247f)

(cherry picked from commit adf453af3f)

(cherry picked from commit 9c5950533f)

(cherry picked from commit 47acdc1f98)

(cherry picked from commit 93d61d7031)

(cherry picked from commit 6676e47371)

(cherry picked from commit 2aabe7f09c)

(cherry picked from commit ee56bbfe61)

 Refs #20208

Closes scylladb/scylladb#21010

* github.com:scylladb/scylladb:
  cql: sum of abs RFs diffs cannot exceed 1 in ALTER tablets KS
  cql: join new and old KS options in ALTER tablets KS
  cql: fix validation of ALTERing RFs in tablets KS
  cql: harden `alter_keyspace_statement.cc::validate_rf_difference`
  cql: validate RF change for new DCs in ALTER tablets KS
  cql: extend test_alter_tablet_keyspace_rf
  cql: refactor test_tablets::test_alter_tablet_keyspace
  cql: remove unused helper function from test_tablets
2024-10-11 14:44:48 +03:00
Botond Dénes
b32304bdda repair/row_level: remove reader timeout
This timeout was added to catch reader related deadlocks. We have not
seen such deadlocks for a long time, but we did see false-timeouts
caused by this, see explanation below. Since the cost now outweight the
benefit, remove the timeout altogether.

The false timeout happens during mixed-shard repair. The
`reader_permit::set_timeout()` call is called on the top-level permit
which repair has a handle on. In the case of the mixed-shard repair,
this belongs to the multishard reader. Calling set_timeout() on the
multishard reader has no effect on the actual shard readers, except in
one case: when the shard reader is created, it inherits the multishard
reader's current timeout. As the shard reader can be alive for a long
time, this timeout is not refreshed and ultimately causes a timeout and
fails the repair.

Refs: #18269
(cherry picked from commit 3ebb124eb2)

Closes scylladb/scylladb#20956
2024-10-11 14:42:06 +03:00
Kefu Chai
ef549dbeac auth: capture boost::regex_error not std::regex_error
in a3db5401, we introduced the TLS certi authenticator, which is
configured using `auth_certificate_role_queries` option . the
value of this option contains a regular expression. so there are
chances the regular expression is malformatted. in that case,
when converting its value presenting the regular expression to an
instance of `boost::regex`, Boost.Regex throws a `boost::regex_error`
exception, not `std::regex_error`.

since we decided to use Boost.Regex, let's catch `boost::regex_error`.

Refs a3db5401
Fixes scylladb/scylladb#20941
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit c7eafc4dc1)

Closes scylladb/scylladb#20953
2024-10-11 14:38:28 +03:00
Anna Stuchlik
65c047f911 doc: document the option to run ScyllaDB in Docker on macOS
This commit adds a description of a workaround to create a multi-node ScyllaDB cluster
with Docker on macOS.

Refs https://github.com/scylladb/scylladb/issues/16806
See https://forum.scylladb.com/t/running-3-node-scylladb-in-docker/1057/4

(cherry picked from commit 7eb1dc2ae5)

Closes scylladb/scylladb#20932
2024-10-11 14:37:55 +03:00
Calle Wilund
6ea4e4a289 database: Also forced new schema commitlog segment on user initiated memtable flush
Refs #20686
Refs #15607

In #15060 we added forced new commitlog segment on user initated flush,
mainly so that tests can verify tombstone gc and other compaction related
things, without having to wait for "organic" segment deletion.
Schema commitlog was not included, mainly because we did not have tests
featuring compaction checks of schema related tables, but also because
it was assumed to be lower general througput.
There is however no real reason to not include it, and it will make some
testing much quicker and more predictable.

(cherry picked from commit 60f8a9f39d)

Closes scylladb/scylladb#20704
2024-10-11 14:36:26 +03:00
Avi Kivity
e31d6c278f Merge '[Backport 6.1] scylla_raid_setup: configure SELinux file context' from ScyllaDB
On RHEL9, systemd-coredump fails to coredump on /var/lib/scylla/coredump because the service only have write acess with systemd_coredump_var_lib_t. To make it writable, we need to add file context rule for /var/lib/scylla/coredump, and run restorecon on /var/lib/scylla.

Fixes #19325

(cherry picked from commit 56c971373c)

(cherry picked from commit 0ac450de05)

 Refs #20528

Closes scylladb/scylladb#20871

* github.com:scylladb/scylladb:
  scylla_raid_setup: configure SELinux file context
  scylla_coredump_setup: fix SELinux configuration for RHEL9
2024-10-10 19:01:40 +03:00
Gleb Natapov
592d925516 storage_proxy: make sure there is no end iterator in _live_iterators array
storage_proxy::cancellable_write_handlers_list::update_live_iterators
assumes that iterators in _live_iterators can be dereferenced, but
the code does not make any attempt to make sure this is the case. The
iterator can be the end iterator which cannot be dereferenced.

The patch makes sure that there is no end iterator in _live_iterators.

Fixes scylladb/scylladb#20874

(cherry picked from commit da084d6441)

Closes scylladb/scylladb#21004
2024-10-09 20:16:53 +03:00
Piotr Smaron
08165851fb cql: sum of abs RFs diffs cannot exceed 1 in ALTER tablets KS
Tablets load balancer is unable to process more than a single pending
replica, thus ALTER tablets KS cannot accept an ALTER statement which
would result in creating 2+ pending replicas, hence it has to validate
if the sum of absoulte differences of RFs specified in the statement is
not greter than 1.

(cherry picked from commit ee56bbfe61)
2024-10-08 18:06:54 +00:00
Piotr Smaron
1f6befe16d cql: join new and old KS options in ALTER tablets KS
A bug has been discovered while trying to ALTER tablets KS and
specifying only 1 out of 2 DCs - the not specified DC's RF has been
zeroed. This is because ALTER tablets KS updated the KS only with the
RF-per-DC mapping specified in the ALTER tablets KS statement, so if a
DC was ommitted, it was assigned a value of RF=0.
This commit fixes that plus additionally passes all the KS options, not
only the replication options, to the topology coordinator, where the KS
update is performed.
`initial_tablets` is a special case, which requires a special handling
in the source code, as we cannot simply update old initial_tablet's
settings with the new ones, because if only ` and TABLETS = {'enabled':
true}` is specified in the ALTER tablets KS statement, we should not zero the `initial_tablets`, but
rather keep the old value - this is tested by the
`test_alter_preserves_tablets_if_initial_tablets_skipped` testcase.
Other than that, the above mentioned testcase started to fail with
these changes, and it appeared to be an issue with the test not waiting
until ALTER is completed, and thus reading the old value, hence the
test's body has been modified to wait for ALTER to complete before
performing validation.

(cherry picked from commit 2aabe7f09c)
2024-10-08 18:06:53 +00:00
Piotr Smaron
97b37fbbd0 cql: fix validation of ALTERing RFs in tablets KS
The validation has been corrected with:
1. Checking if a DC specified in ALTER exists.
2. Removing `REPLICATION_STRATEGY_CLASS_KEY` key from a map of RFs that
   needs their RFs to be validated.

(cherry picked from commit 6676e47371)
2024-10-08 18:06:47 +00:00
Piotr Smaron
7c837837eb cql: harden alter_keyspace_statement.cc::validate_rf_difference
This function assumed that strings passed as arguments will be of
integer types, but that wasn't the case, and we missed that because this
function didn't have any validation, so this change adds proper
validation and error logging.
Arguments passed to this function were forwarded from a call to
`ks_prop_defs::get_replication_options`, which, among rf-per-dc mapping, returns also
`class:replication_strategy` pair. Second pair's member has been casted
into an `int` type and somehow the code was still running fine, but only
extra testing added later discovered a bug in here.

(cherry picked from commit 93d61d7031)
2024-10-08 18:06:47 +00:00
Piotr Smaron
0e0fe4d756 cql: validate RF change for new DCs in ALTER tablets KS
ALTER tablets KS validated if RF is not changed by more than 1 for DCs
that already had replicas, but not for DCs that didn't have them yet, so
specifying an RF jump from 0 to 2 was possible when listing a new DC in
ALTER tablets KS statement, which violated internal invariants of
tablets load balancer.
This PR fixes that bug and adds a multi-dc testcases to check if adding
replicas to a new DC and removing replicas from a DC is honoring the RF
change constraints.

Refs: #20039
(cherry picked from commit 47acdc1f98)
2024-10-08 18:06:46 +00:00
Piotr Smaron
78bf036419 cql: extend test_alter_tablet_keyspace_rf
Added cases to also test decreasing RF and setting the same RF.
Also added extra explanatory comments.

(cherry picked from commit 9c5950533f)
2024-10-08 18:06:45 +00:00
Piotr Smaron
4fc45b6fa6 cql: refactor test_tablets::test_alter_tablet_keyspace
1. Renamed the testcase to emphasize that it only focuses on testing
   changing RF - there are other tests that test ALTER tablets KS
in general.
2. Fixed whitespaces according to PEP8

(cherry picked from commit adf453af3f)
2024-10-08 18:06:44 +00:00
Piotr Smaron
dbb912c8dd cql: remove unused helper function from test_tablets
`change_default_rf` is not used anywhere, moreover it uses
`replication_factor` tag, which is forbidden in ALTER tablets KS
statement.

(cherry picked from commit 042825247f)
2024-10-08 18:06:42 +00:00
Raphael S. Carvalho
684b16d709 service: Improve error handling for split
Retry wasn't really happening since the loop was broken and sleep
part was skipped on error. Also, we were treating abort of split
during shutdown as if it were an actual error and that confused
longevity tests that parse for logs with error level. The fix is
about demoting the level of logs when we know the exception comes
from shutdown.

Fixes #20890.

(cherry picked from commit bcd358595f)
2024-10-04 11:17:37 +00:00
Pavel Emelyanov
190385ee2b cql: Check that CREATEing tablets/vnodes is consistent with the CLI
There are two bits that control whenter replication strategy for a
keyspace will use tablets or not -- the configuration option and CQL
parameter. This patch tunes its parsing to implement the logic shown
below:

    if (strategy.supports_tablets) {
         if (cql.with_tablets) {
             if (cfg.enable_tablets) {
                 return create_keyspace_with_tablets();
             } else {
                 throw "tablets are not enabled";
             }
         } else if (cql.with_tablets = off) {
              return create_keyspace_without_tablets();
         } else { // cql.with_tablets is not specified
              if (cfg.enable_tablets) {
                  return create_keyspace_with_tablets();
              } else {
                  return create_keyspace_without_tablets();
              }
         }
     } else { // strategy doesn't support tablets
         if (cql.with_tablets == on) {
             throw "invalid cql parameter";
         } else if (cql.with_tablets == off) {
             return create_keyspace_without_tablets();
         } else { // cql.with_tablets is not specified
             return create_keyspace_without_tablets();
         }
     }

closes: #20088

In order to enable tablets "by default" for NetworkTopologyStrategy
there's explicit check near ks_prop_defs::get_initial_tablets(), that's
not very nice. It needs more care to fix it, e.g. provide feature
service reference to abstract_replication_strategy constructor. But
since ks_prop_defs code already highjacks options specifically for that
strategy type (see prepare_options() helper), it's OK for now.

There's also #20768 misbehavior that's preserved in this patch, but
should be fixed eventually as well.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#20928
2024-10-03 17:09:21 +03:00
Calle Wilund
4a1e83d6be commitlog: Fix buffer_list_bytes not updated correctly
Fixes #20862

With the change in 60af2f3cb2 the bookkeep
for buffer memory was changed subtly, the problem here that we would
shrink buffer size before we after flush use said buffer's size to
decrement the buffer_list_bytes value, previously inc:ed by the full,
allocated size. I.e. we would slowly grow this value instead of adjusting
properly to actual used bytes.

Test included.

(cherry picked from commit ee5e71172f)

Closes scylladb/scylladb#20914
2024-10-03 09:11:40 +03:00
Kamil Braun
a96654bea3 Merge '[Backport 6.1] Populate raft address map from gossiper on raft configuration change' from ScyllaDB
For each new node added to the raft config populate it's ID to IP mapping in raft address map from the gossiper. The mapping may have expired if a node is added to the raft configuration long after it first appears in the gossiper.

Fixes scylladb/scylladb#20600

Backport to all supported versions since the bug may cause bootstrapping failure.

(cherry picked from commit bddaf498df)

(cherry picked from commit 9e4cd32096)

 Refs #20601

Closes scylladb/scylladb#20848

* github.com:scylladb/scylladb:
  test: extend existing test to check that a joining node can map addresses of all pre-existing nodes during join
  group0: make sure that address map has an entry for each new node in the raft configuration
2024-09-30 17:03:03 +02:00
Takuya ASADA
295993d7f9 scylla_raid_setup: configure SELinux file context
On RHEL9, systemd-coredump fails to coredump on /var/lib/scylla/coredump
because the service only have write acess with systemd_coredump_var_lib_t.
To make it writable, we need to add file context rule for
/var/lib/scylla/coredump, and run restorecon on /var/lib/scylla.

Fixes #20573

(cherry picked from commit 0ac450de05)
2024-09-29 13:23:03 +00:00
Takuya ASADA
bd7e1cfc5f scylla_coredump_setup: fix SELinux configuration for RHEL9
Seems like specific version of systemd pacakge on RHEL9 has a bug on
SELinux configuration, it introduced "systemd-container-coredump" module
to provide rule for systemd-coredump, but not enabled by default.
We have to manually load it, otherwise it causes permission error.

Fixes #19325

(cherry picked from commit 56c971373c)
2024-09-29 13:23:03 +00:00
Kamil Braun
79119f58e8 Merge '[Backport 6.1] mark node as being replaced earlier' from Gleb Natapov
Before 17f4a151ce the node was marked as
been replaced in join_group0 state, before it actually joins the group0,
so by the time it actually joins and starts transferring snapshot/log no
traffic is sent to it. The commit changed this to mark the node as
being replaced after the snapshot/log is already transferred so we can
get the traffic to the node while it sill did not caught up with a
leader and this may causes problems since the state is not complete.
Mark the node as being replaced earlier, but still add the new node to
the topology later as the commit above intended.

Fixes: https://github.com/scylladb/scylladb/issues/20629

Need to be backported since this is a regression

(cherry picked from commit 644e7a2012)

(cherry picked from commit c0939d86f9)

(cherry picked from commit 1b4c255ffd)

Closes scylladb/scylladb#20834

* github.com:scylladb/scylladb:
  test: amend test_replace_reuse_ip test to check that there is no stale writes after snapshot transfer starts
  topology coordinator:: mark node as being replaced earlier
  topology coordinator: do metadata barrier before calling finish_accepting_node() during replace
2024-09-27 16:10:07 +02:00
Andrei Chekun
392d95d2cd test.py: Increase workers for cluster cleaning
Increase workers for that used in method async_rmtree() that is used for
cleaning directories. This should help to reduce flakiness.
Increasing the workers count was introduced in f54b7f5427
but there is no need to backport the whole commit.

Closes scylladb/scylladb#20795
2024-09-27 14:47:08 +02:00
Kamil Braun
be76d6f9d9 service: raft: fix rpc error message
What it called "leader" is actually the destination of the RPC.

Trivial fix, should be backported to all affected versions.

(cherry picked from commit 84dd0e922b)

Closes scylladb/scylladb#20827
2024-09-27 11:22:02 +02:00
Gleb Natapov
39a8203160 test: extend existing test to check that a joining node can map addresses of all pre-existing nodes during join
(cherry picked from commit 9e4cd32096)
2024-09-26 21:13:39 +00:00
Gleb Natapov
d2d1ed92c2 group0: make sure that address map has an entry for each new node in the raft configuration
ID->IP mapping is added to the raft address map when the mapping first
appears in the gossiper, but it is added as expiring entry. It becomes
non expiring when a node is added to raft configuration. But when a node
joins those two events may be distant in time (since the node's request
may sit in the topology coordinator queue for a while) and mappings may
expire already from the map. This patch makes sure to transfer the
mapping from the gossiper for a node that is added to the raft
configuration instead of assuming that the mapping is already there.

(cherry picked from commit bddaf498df)
2024-09-26 21:13:39 +00:00
Gleb Natapov
c7be05cc50 test: amend test_replace_reuse_ip test to check that there is no stale writes after snapshot transfer starts
(cherry picked from commit 1b4c255ffd)
2024-09-26 12:34:18 +03:00
Gleb Natapov
88712782de topology coordinator:: mark node as being replaced earlier
Before 17f4a151ce the node was marked as
been replaced in join_group0 state, before it actually joins the group0,
so by the time it actually joins and starts transferring snapshot/log no
traffic is sent to it. The commit changed this to mark the node as
being replaced after the snapshot/log is already transferred so we can
get the traffic to the node while it sill did not caught up with a
leader and this may causes problems since the state is not complete.
Mark the node as being replaced earlier, but still add the new node to
the topology later as the commit above intended.

(cherry picked from commit c0939d86f9)
2024-09-26 12:34:04 +03:00
Gleb Natapov
eaade2f0ef topology coordinator: do metadata barrier before calling finish_accepting_node() during replace
During replace with the same IP a node may get queries that were intended
for the node it was replacing since the new node declares itself UP
before it advertises that it is a replacement. But after the node
starts replacing procedure the old node is marked as "being replaced"
and queries no longer sent there. It is important to do so before the
new node start to get raft snapshot since the snapshot application is
not atomic and queries that run parallel with it may see partial state
and fail in weird ways. Queries that are sent before that will fail
because schema is empty, so they will not find any tables in the first
place. The is pre-existing and not addressed by this patch.

(cherry picked from commit 644e7a2012)
2024-09-26 12:33:06 +03:00
Kefu Chai
ef32ba704d docs: explain precedence of configure options
to explain for instance which setting takes effect if both
command line options and `scylla.yaml` configures the same parameter.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 1aa030a8cd)

Closes scylladb/scylladb#20775
2024-09-26 10:47:42 +03:00
Anna Stuchlik
10d71d2f4b doc: update the unified installer instructions
This commit updates the unified installer instructions to avoid specifying a given version.
At the moment, we're technically unable to use variables in URLs, so we need to update
the page each release.

Fixes https://github.com/scylladb/scylladb/issues/20677

(cherry picked from commit 400a14eefa)

Closes scylladb/scylladb#20709
2024-09-26 10:45:53 +03:00
Anna Stuchlik
9afb3daf98 doc: fix a broken link
This commit fixes a link to the Manager by adding a missing underscore
to the external link.

(cherry picked from commit aa0c95c95c)

Closes scylladb/scylladb#20707
2024-09-26 10:45:17 +03:00
Tzach Livyatan
82e7cb5bf5 Update client-node-encryption: OpsnSSL is FIPS *enabled*
(cherry picked from commit cb864b11d8)

Closes scylladb/scylladb#20651
2024-09-26 10:42:12 +03:00
Lakshmi Narayanan Sreethar
58da8fdbbc [Backport 6.1]: database::get_all_tables_flushed_at: fix return value
The `database::get_all_tables_flushed_at` method returns a variable
without setting the computed all_tables_flushed_at value. This causes
its caller, `maybe_flush_all_tables` to flush all the tables everytime
regardless of when they were last flushed. Fix this by returning
the computed value from `database::get_all_tables_flushed_at`.

Fixes #20301

Closes scylladb/scylladb#20471

* github.com:scylladb/scylladb:
  cql-pytest: add test to verify compaction_flush_all_tables_before_major_seconds config
  database::get_all_tables_flushed_at: fix return value

(cherry picked from commit 0e5b444777)

Backported from #20471 to 6.1.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes scylladb/scylladb#20581
2024-09-26 10:40:48 +03:00
Kamil Braun
92156e7930 test: fix topology_custom/test_raft_recovery_stuck flakiness
The test performs consecutive schema changes in RECOVERY mode. The
second change relies on the first. However the driver might route the
changes to different servers and we don't have group 0 to guarantee
linearizability. We must rely on the first change coordinator to push
the schema mutations to other servers before returning, but that only
happens when it sees other servers as alive when doing the schema
change. It wasn't guaranteed in the test. Fix this.

Fixes scylladb/scylladb#20791

Should be backported to all branches containing this test to reduce
flakiness.

(cherry picked from commit f390d4020a)

Closes scylladb/scylladb#20809
2024-09-25 15:11:50 +02:00
Abhinav
33b50a9d3a raft topology: add error for removal of non-normal nodes
In the current scenario, We check if a node being removed is normal
on the node initiating the removenode request. However, we don't have a
similar check on the topology coordinator. The node being removed could be
normal when we initiate the request, but it doesn't have to be normal when
the topology coordinator starts handling the request.
For example, the topology coordinator could have removed this node while handling
another removenode request that was added to the request queue earlier.

This commit intends to fix this issue by adding more checks in the enqueuing phase
and return errors for duplicate requests for node removal.

This PR fixes a bug. Hence we need to backport it.

Fixes: scylladb/scylladb#20271
(cherry picked from commit b25b8dccbd)

Closes scylladb/scylladb#20800
2024-09-25 11:35:27 +02:00
Raphael S. Carvalho
153a54626b tablet: Fix single-sstable split when attaching new unsplit sstables
To fix a race between split and repair here c1de4859d8, a new sstable
generated during streaming can be split before being attached to the sstable
set. That's to prevent an unsplit sstable from reaching the set after the
tablet map is resized.

So we can think this split is an extension of the sstable writer. A failure
during split means the new sstable won't be added. Also, the duration of split
is also adding to the time erm is held. For example, repair writer will only
release its erm once the split sstable is added into the set.

This single-sstable split is going through run_custom_job(), which serializes
with other maintenance tasks. That was a terrible decision, since the split may
have to wait for ongoing maintenance task to finish, which means holding erm
for longer. Additionally, if split monitor decides to run split on the entire
compaction group, it can cause single-sstable split to be aborted since the
former wants to select all sstables, propagating a failure to the streaming
writer.
That results in new sstable being leaked and may cause problems on restart,
since the underlying tablet may have moved elsewhere or multiple splits may
have happened. We have some fragility today in cleaning up leaked sstables on
streaming failure, but this single-sstable split made it worse since the
failure can happen during normal operation, when there's e.g. no I/O error.

It makes sense to kill run_custom_job() usage, since the single-sstable split
is offline and an extension of sstable writing, therefore it makes no sense to
serialize with maintenance tasks. It must also inherit the sched group of the
process writing the new sstable. The inheritance happens today, but is fragile.

Fixes #20626.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 38ce2c605d)
2024-09-25 02:13:42 +00:00
Raphael S. Carvalho
c0b2e89d35 replica: Fix tablet split execute after restart
let's assume there are 2 nodes, n1, n2. n1 is the coordinator.

1) n1 emits split
2) n1 and n2 complete split work
3) n1 becomes aware all replicas are ready for split
4) n2 restarts, but places split sstable into main group[1]
5) n1 executes split
6) n2 handles split completion, but see the main group is not empty

[1]: During split, main group should only contain unsplit sstables.
If all sstables are split, main must be empty.

This is a result of replica not setting storage group to split mode on restart
(using tablet map) and therefore sstables are incorrectly placed on main group.

The fix is about looking at tablet map and setting group to split mode before
sstables are populated into it.

Refs #20626.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 999f1f1318)
2024-09-25 02:13:42 +00:00
Gleb Natapov
43f9b3b997 test: skip test_lwt_semaphore::test_cas_semaphore in aarch64 debug mode
The test configures write timeout to much smaller value to make the test
run faster since for some writes sleep is inserted to hit the timeout,
but it makes aarch64 debug flaky since timeout happens when it should
not because of a natural slowness.

(cherry picked from commit 71a5b1c6dd)

Closes scylladb/scylladb#20777
2024-09-24 15:20:09 +02:00
Botond Dénes
7ed2f87414 Merge '[Backport 6.1] cql3: add option to not unify bind variables with the same' from Avi Kivity
Bind variables in CQL have two formats: positional (?) where a variable is referred to by its relative position in the statement, and named (:var), where the user is expected to supply a name->value mapping.

In 19a6e69001 we identified the case where a named bind variable appears twice in a query, and collapsed it to a single entry in the statement metadata. Without this, a driver using the named variable syntax cannot disambiguate which variable is referred to.

However, it turns out that users can use the positional call form even with the named variable syntax, by using the positional API of the driver. To support this use case, we add a configuration variable to disable the same-variable detection.

Because the detection has to happen when the entire statement is visible, we have to supply the configuration to the parser. We call it the dialect and pass it from all callers. The alternative would be to add a pre-prepare call similar to fill_prepare_context that rewrites all expressions in a statement to deduplicate variables.

A unit test is added.

Fixes https://github.com/scylladb/scylladb/issues/15559

This may be useful to users transitioning from Cassandra, so merits a backport.

(cherry picked from commit f9322799af)

(cherry picked from commit d69bf4f010)

(cherry picked from commit ea8441dfa3)

Refs https://github.com/scylladb/scylladb/pull/19493

Closes scylladb/scylladb#20590

* github.com:scylladb/scylladb:
  cql3: add option to not unify bind variables with the same name
  cql3: introduce dialect infrastructure
  cql3: prepared_statement_cache: drop cache key default constructor
  Merge 'config: round-trip boolean configuration variables' from Avi Kivity
2024-09-24 15:15:05 +03:00
Jenkins Promoter
f4ad3436cb Update ScyllaDB version to: 6.1.3 2024-09-24 15:07:23 +03:00
Benny Halevy
d13c77e1eb time_window_compaction_strategy: get_reshaping_job: restrict sort of multi_window vector to its size
Currently the function calls boost::partial_sort with a middle
iterator that might be out of bound and cause undefined behavior.

Check the vector size, and do a partial sort only if its longer
than `max_sstables`, otherwise sort the whole vector.

Fixes scylladb/scylladb#20608

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 39ce358d82)

Closes scylladb/scylladb#20663
2024-09-23 15:38:35 +03:00
Piotr Dulikowski
bf6dd16071 Merge '[Backport 6.1] message/messaging_service: guard adding maintenance tenant under cluster feature' from Michał Jadwiszczak
In https://github.com/scylladb/scylladb/pull/18729, we introduced a new statement tenant $maintenance, but the change wasn't protected by any cluster feature.
This wasn't a problem for OSS, since unknown isolation cookie just uses default scheduling group. However, in enterprise that leads to creating a service level on not-upgraded nodes, which may end up in an error if user create maximum number of service levels.

This patch adds a cluster feature to guard adding the new tenant. It's done in the way to handle two upgrade scenarios:

version without $maintenance tenant -> version with $maintenance tenant guarded by a feature
version with $maintenance tenant but not guarded by a feature -> version with $maintenance tenant guarded by a feature
The PR adds enabled flag to statement tenants.
This way, when the tenant is disabled, it cannot be used to create a connection, but it can be used to accept an incoming connection.
The $maintenance tenant is added to the config as disabled and it gets enabled once the corresponding feature is enabled.

Fixes https://github.com/scylladb/scylladb/issues/20070
Refs https://github.com/scylladb/scylla-enterprise/issues/4403

(cherry picked from commit d44844241d)

(cherry picked from commit 71a03ef6b0)

(cherry picked from commit b4b91ca364)

Refs https://github.com/scylladb/scylladb/pull/19802

Closes scylladb/scylladb#20674

* github.com:scylladb/scylladb:
  message/messaging_service: guard adding maintenance tenant under cluster feature
  message/messaging_service: add feature_service dependency
  message/messaging_service: add `enabled` flag to statement tenants
2024-09-23 13:18:45 +02:00
Botond Dénes
f987afb2e1 Merge '[Manual Backport 6.1] generic_server: convert connection tracking to seastar::gate' from Laszlo Ersek
This is a manual backport of #20212 to 6.1, superseding #20345 (which had run into conflicts).

Please see the individual commit messages for backport notes.

Fixes #10305

Closes scylladb/scylladb#20355

* github.com:scylladb/scylladb:
  generic_server: make server::stop() idempotent
  generic_server: coroutinize server::shutdown()
  generic_server: make server::shutdown() idempotent
  test/generic_server: add test case
  configure, cmake: sort the lists of boost unit tests
  generic_server: convert connection tracking to seastar::gate
2024-09-18 15:52:32 +03:00
Michał Jadwiszczak
7e14df5ba7 message/messaging_service: guard adding maintenance tenant under cluster feature
Set `enabled` flag for `$maintenance` tenant to false and
enable it when `MAINTENANCE_TENANT` feature is enabled.

(cherry-picked from b4b91ca364)
2024-09-18 11:31:26 +02:00
Michał Jadwiszczak
d11df0fcbc message/messaging_service: add feature_service dependency
(cherry-picked from 71a03ef6b0)
2024-09-18 11:26:56 +02:00
Michał Jadwiszczak
f928bb7967 message/messaging_service: add enabled flag to statement tenants
Adding a new tenant needs to be done under cluster feature protection.
However it wasn't the case for adding `$maintenance` statement tenant
and to fix it we need to support an upgrade from node which doesn't
know about maintenance tenant at all and from one which uses it without
any cluster feature protection.

This commit adds `enabled` flag to statement tenants.
This way, when the tenant is disabled, it cannot be used to create
a connection, but it can be used to accept an incoming connection.

(cherry-picked from d44844241d)
2024-09-18 11:23:02 +02:00
Tomasz Grabiec
edea822bd7 Merge '[Backport 6.1] tablets: Fix race between repair and split' from Raphael "Raph" Carvalho
Consider the following:

```
T
0   split prepare starts
1                               repair starts
2   split prepare finishes
3                               repair adds unsplit sstables
4                               repair ends
5   split executes
```
If repair produces sstable after split prepare phase, the replica will not split that sstable later, as prepare phase is considered completed already. That causes split execution to fail as replicas weren't really prepared. This also can be triggered with load-and-stream which shares the same write (consumer) path.

The approach to fix this is the same employed to prevent a race between split and migration. If migration happens during prepare phase, it can happen source misses the split request, but the tablet will still be split on the destination (if needed). Similarly, the repair writer becomes responsible for splitting the data if underlying table is in split mode. That's implemented in replica::table for correctness, so if node crashes, the new sstable missing split is still split before added to the set.

Fixes https://github.com/scylladb/scylladb/issues/19378.
Fixes https://github.com/scylladb/scylladb/issues/19416.

Please replace this line with justification for the backport/* labels added to this PR

(cherry picked from commit 239344ab55)

(cherry picked from commit 74612ad358)

Refs https://github.com/scylladb/scylladb/pull/19427

Closes scylladb/scylladb#20595

* github.com:scylladb/scylladb:
  tablets: Fix race between repair and split
  compaction: Allow "offline" sstable to be split
2024-09-17 13:25:03 +02:00
Avi Kivity
fb98d6f832 Merge '[Backport 6.1] replica: ignore cleanup of deallocated storage group' from Aleksandra Martyniuk
Cleanup of a deallocated tablet throws an exception.
Since failed cleanup is retried, we end up in an infinite loop.

Ignore cleanup of deallocated storage groups.

Fixes: https://github.com/scylladb/scylladb/issues/19752.

Needs to be backported to all branches with tablets (6.0 and later)

(cherry picked from commit 20d6cf55f2)

(cherry picked from commit 2c4b1d6b45)

Refs https://github.com/scylladb/scylladb/pull/20584

Closes scylladb/scylladb#20627

* github.com:scylladb/scylladb:
  test: check if cleanup of deallocated sg is ignored
  replica: ignore cleanup of deallocated storage group
2024-09-17 12:22:00 +03:00
Gleb Natapov
d2e9007442 paxos_state: release semaphore units before checking if a semaphore can be dropped
To drop a semaphore it should not be held by anyone, so we need to
release out units before checking if a semaphore can be dropped.

Fixes: scylladb/scylladb#20602
(cherry picked from commit 9cc54932ae)

Closes scylladb/scylladb#20621
2024-09-16 22:08:45 +03:00
Aleksandra Martyniuk
032c9146d5 test: check if cleanup of deallocated sg is ignored
(cherry picked from commit 2c4b1d6b45)
2024-09-16 16:22:29 +02:00
Aleksandra Martyniuk
120ff5aeb8 replica: ignore cleanup of deallocated storage group
Currently, attempt to cleanup deallocated storage group throws
an exception. Failed tablet cleanup is retried, stucking
in an endless loop.

Ignore cleanup of deallocated storage group.

(cherry picked from commit 20d6cf55f2)
2024-09-16 12:44:36 +00:00
Raphael S. Carvalho
fe56fa39c0 tablets: Fix race between repair and split
Consider the following:

T
0   split prepare starts
1                               repair starts
2   split prepare finishes
3                               repair adds unsplit sstables
4                               repair ends
5   split executes

If repair produces sstable after split prepare phase, the replica
will not split that sstable later, as prepare phase is considered
completed already. That causes split execution to fail as replicas
weren't really prepared. This also can be triggered with
load-and-stream which shares the same write (consumer) path.

The approach to fix this is the same employed to prevent a race
between split and migration. If migration happens during prepare
phase, it can happen source misses the split request, but the
tablet will still be split on the destination (if needed).
Similarly, the repair writer becomes responsible for splitting
the data if underlying table is in split mode. That's implemented
in replica::table for correctness, so if node crashes, the new
sstable missing split is still split before added to the set.

Fixes #19378.
Fixes #19416.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 74612ad358)
2024-09-13 21:32:01 -03:00
Avi Kivity
8ddfd0d70d cql3: add option to not unify bind variables with the same name
Bind variables in CQL have two formats: positional (`?`) where a
variable is referred to by its relative position in the statement,
and named (`:var`), where the user is expected to supply a
name->value mapping.

In 19a6e69001 we identified the case where a named bind variable
appears twice in a query, and collapsed it to a single entry in the
statement metadata. Without this, a driver using the named variable
syntax cannot disambiguate which variable is referred to.

However, it turns out that users can use the positional call form
even with the named variable syntax, by using the positional
API of the driver. To support this use case, we add a configuration
variable to disable the same-variable detection.

Because the detection has to happen when the entire statement is
visible, we have to supply the configuration to the parser. We
call it the `dialect` and pass it from all callers. The alternative
would be to add a pre-prepare call similar to fill_prepare_context that
rewrites all expressions in a statement to deduplicate variables.

A unit test is added.

Fixes #15559

(cherry picked from commit ea8441dfa3)
(cherry picked from commit edb3068ecf)
2024-09-13 18:17:15 +03:00
Avi Kivity
92dd47c6d6 cql3: introduce dialect infrastructure
A dialect is a different way to interpret the same CQL statement.

Examples:
 - how duplicate bind variable names are handled (later in this series)
 - whether `column = NULL` in LWT can return true (as is now) or
   whether it always returns NULL (as in SQL)

Currently, dialect is an empty structure and will be filled in later.
It is passed to query_processor methods that also accept a CQL string,
and from there to the parser. It is part of the prepared statement cache
key, so that if the dialect is changed online, previous parses of the
statement are ignored and the statement is prepared again.

The patch is careful to pick up the dialect at the entry point (e.g.
CQL protocol server) so that the dialect doesn't change while a statement
is parsed, prepared, and cached.

(cherry picked from commit d69bf4f010)
2024-09-13 18:11:11 +03:00
Avi Kivity
4bf81f54b4 cql3: prepared_statement_cache: drop cache key default constructor
It's unnecessary, and interferes with the following patch where
we change the cache key type.

(cherry picked from commit f9322799af)
2024-09-13 17:56:06 +03:00
Nadav Har'El
d9ba5423bb Merge 'config: round-trip boolean configuration variables' from Avi Kivity
When you SELECT a boolean from system.config, it reads as true/false, but this isn't accepted
on UPDATE (instead, we accept 1/0). This is surprising and annoying, so accept true/false in
both directions.

Not a regression, so a backport isn't strictly necessary.

Closes scylladb/scylladb#19792

* github.com:scylladb/scylladb:
  config: specialize from-string conversion for bool
  config: wrap boost::lexical_cast<> when converting from strings

(cherry picked from commit 9eb47b3ef0)
2024-09-13 17:54:37 +03:00
Piotr Smaron
b60f9ef4c2 cql: fix exception when validating KS in CREATE TABLE
c70f321c6f added an extra check if KS
exists. This check can throw `data_dictionary::no_such_keyspace`
exception, which is supposed to be caught and a more user-friendly
exception should be thrown instead.
This commit fixes the above problem and adds a testcase to validate it
doesn't appear ever again.
Also, I moved the check for the keyspace outside of the `for` loop, as
it doesn't need to be checked repeatedly.
Additionally, I added an extra comment to both `no_such_keyspace` and
`no_such_column_family` exceptions explaining they should not be
returned directly to the caller, as they lack error code, which may not
trigger correct exceptions handling mechanisms on the driver side.

Fixes: #20097
(cherry picked from commit f1e8976fbe)

Closes scylladb/scylladb#20553
2024-09-13 11:36:51 +03:00
Piotr Dulikowski
00e96d4b70 Merge '[Backport 6.1]: hints: send hints with CL=ALL if target is leaving' from Piotr Dulikowski
Currently, when attempting to send a hint, we might choose its recipients in one of two ways:

- If the original destination is a natural endpoint of the hint, we only send the hint to that node and none other,
- Otherwise, we send the hint to all current replicas of the mutation.

There is a problem when we decommission a node: while data is streamed away from that node, it is still considered to be a natural endpoint of the data that it used to own. Because of that, it might happen that a hint is sent directly to it but streaming will miss it, effectively resulting in the hint being discarded.

As sending the hint _only_ to the leaving replica is a rather bad idea, send the hint to all replicas also in the case when the original destination of the hint is leaving.

Note that this is a conservative fix written only with the decommission + vnode-based keyspaces combo in mind. In general, such "data loss" can occur in other situations where the replica set is changing and we go through a streaming phase, i.e. other topology operations in case of vnodes and tablet load balancing. However, the consistency guarantees of hinted handoff in the face of topology changes are not defined and it is not clear what they should be, if there should be any at all. The picture is further complicated by the fact that hints are used by materialized views, and sending view updates to more replicas than necessary can introduce inconsistencies in the form of "ghost rows". This fix was developed in response to a failing test which checked the hint replay + decommission scenario, and it makes it work again.

Fixes scylladb/scylladb#20558
Fixes scylladb/scylla-dtest#4582
Refs scylladb/scylladb#19835

This is a backport of the original PR without the tests, done avoid the need of resolving merge conflicts in that area.

Closes scylladb/scylladb#20557

* github.com:scylladb/scylladb:
  hints: send hints with CL=ALL if target is leaving
  hints: inline do_send_one_mutation
2024-09-13 09:39:36 +02:00
Abhi
848054079b raft: Add descriptions for requested abort errors
Fixes: scylladb/scylladb#18902

This PR only improves error messages, no need to backport it.

(cherry picked from commit 9b09439065)

Closes scylladb/scylladb#20526
2024-09-13 10:13:49 +03:00
Botond Dénes
c80cefe422 docs/cql/ddl.rst: fix description of sstable_compression
ScyllaDB doesn't support custom compressors. The available compressors
are the only available ones, not the default ones.
Adjust the text to reflect this.

(cherry picked from commit 08f109724b)

Closes scylladb/scylladb#20524
2024-09-13 10:12:59 +03:00
Takuya ASADA
b07c74a65c install.sh: fix more incorrect permission on strict umask
Even after 13caac7, we still have more files incorrect permission, since
we use "cp -r" and creating new file with redirect.

To fix this, we need to replace "cp -r" with "cp -pr", and "chmod <perm>" on
newly created files.

Fixes #14383
Related #19775

(cherry picked from commit 9d7fed40b5)

Closes scylladb/scylladb#20432
2024-09-13 10:12:22 +03:00
Piotr Dulikowski
2556c7a0dc hints: send hints with CL=ALL if target is leaving
Currently, when attempting to send a hint, we might choose its
recipients in one of two ways:

- If the original destination is a natural endpoint of the hint, we only
  send the hint to that node and none other,
- Otherwise, we send the hint to all current replicas of the mutation.

There is a problem when we decommission a node: while data is streamed
away from that node, it is still considered to be a natural endpoint of
the data that it used to own. Because of that, it might happen that a
hint is sent directly to it but streaming will miss it, effectively
resulting in the hint being discarded.

As sending the hint _only_ to the leaving replica is a rather bad idea,
send the hint to all replicas also in the case when the original
destiantion of the hint is leaving.

Note that this is a conservative fix written only with the decommission
+ vnode-based keyspaces combo in mind. In general, such "data loss" can
occur in other situations where the replica set is changing and we go
through a streaming phase, i.e. other topology operations in case of
vnodes and tablet load balancing. However, the consistency guarantees of
hinted handoff in the face of topology changes are not defined and it is
not clear what they should be, if there should be any at all. The
picture is further complicated by the fact that hints are used by
materialized views, and sending view updates to more replicas than
necessary can introduce inconsistencies in the form of "ghost rows".
This fix was developed in response to a failing test which checked the
hint replay + decommission scenario, and it makes it work again.

Fixes scylladb/scylla-dtest#4582
Refs scylladb/scylladb#19835

(cherry picked from commit 61ac0a336d)
2024-09-12 10:55:29 +02:00
Piotr Dulikowski
132d77f447 hints: inline do_send_one_mutation
It's a small method and it is only used once in send_one_mutation.
Inlining it lets us get rid of its declaration in the header - now, if
one needs to change the variables passed from one function to another,
it is no longer necessary to change the header.

(cherry picked from commit 8abb06ab82)
2024-09-12 10:55:21 +02:00
Gleb Natapov
bb9249f055 db/consistency_level: do not use result from hit weighted load balancer if it contains duplicates
Because of https://github.com/scylladb/scylladb/issues/9285 hit weighted
load balancer may sometimes return same node twice. It may cause wrong
data to be read or unexpected errors to be returned to a client. Since
the original bug is not easy to fix and it is rare lets introduce a
workaround. We will check for duplicates and will use non HWLB one if
one is found.

(cherry picked from commit e06a772b87)

Closes scylladb/scylladb#20468
2024-09-10 17:18:47 +03:00
Kamil Braun
e4a18b0858 test: test_raft_no_quorum: increase raft timeout in debug mode
The test cases in this file use an error injection to reduce raft group
0 timeouts (from the default 1 minute), in order to speed up the tests;
the scenarios expect these timeouts to happen, so we want them to happen
as quick as possible, but we don't want to reduce timeouts so much that
it will make other operations fail when we don't expect them to (e.g.
when the test wants to add a node to the cluster).

Unfortunately the selected 5 seconds in debug mode was not enough and
made the tests flaky: scylladb/scylladb#20111.

Increase it to 10 seconds. This unfortunately will slow down these tests
as they have to sometimes wait for 10 seconds for the timeout to happen.
But better to have this than a flaky test.

Fixes: scylladb/scylladb#20111
(cherry picked from commit 52fdf5b4c9)

Closes scylladb/scylladb#20477
2024-09-10 08:48:06 +03:00
Kefu Chai
105293b2ab docs: do not install scylla/ppa repo when perform upgrade
for following reasons:

1. the ppa in question does not provide the build for the latest ubuntu's LTS release. it only builds for trusty, xenial, bionic and jammy. according to https://wiki.ubuntu.com/Releases, the latest LTS release is ubuntu noble at the time of writing.
2. the ppa in question does not provide the packages used in production. it does provides the package for *building* scylla
3. after we introduced the relocatable package, there is no need to provide extra user space dependencies apart from scylla packages.

so, in this change, we remove all references to enabling the Scylla/PPA repository.

Fixes scylladb/scylladb#20449

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit fe0e961856)

Closes scylladb/scylladb#20453
2024-09-10 08:46:47 +03:00
Nadav Har'El
ad47c0e2f9 alternator ttl: fix use-after-free
The Alternator TTL scanning code uses an object "scan_ranges_context"
to hold the scanning context. One of the members of this object is
a service::query_state, and that in turn holds a reference to a
service::client_state. The existing constructor created a temporary
client_state object and saved a reference to it - which can result
in use after free as the temporary object is freed as soon as the
constructor ends.

The fix is to save a client_state in the scan_ranges_context object,
instead of a temporary object.

Fixes #19988

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 15f8046fcb)

Closes scylladb/scylladb#20436
2024-09-10 08:43:14 +03:00
Kefu Chai
0eb66cbee5 sstables: correct the debugging message printed when removing temp dir
in 372a4d1b79, we introduced a change
which was for debugging the logging message. but the logging message
intended for printing the temp_dir not prints an `optional<int>`. this
is both confusing, and more importantly, it hurts the debuggability.

in this change, the related change is reverted.

Fixes scylladb/scylladb#20408

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit d26bb9ae30)

Closes scylladb/scylladb#20434
2024-09-10 08:42:29 +03:00
Kefu Chai
a2458f07d7 dist: drop %pretrans section
before this change, if user does not have `/bin/sh` around, when
installing scylla packages, the script in `%pretrans" is executed,
and fails due to missing `/bin/sh`. per
https://docs.fedoraproject.org/en-US/packaging-guidelines/Scriptlets/#pretrans

> Note that the %pretrans scriptlet will, in the particular case of
> system installation, run before anything at all has been installed.
> This implies that it cannot have any dependencies at all. For this
> reason, %pretrans is best avoided, but if used it MUST (by necessity)
> be written in Lua. See
> https://rpm-software-management.github.io/rpm/manual/lua.html for more
> information.

but we were trying to warn users upgrading from scylla < 1.7.3, which
was released 7 years ago at the time of writing.

in this change, we drop the `%pretrans` section. hopefuly they will
find their way out if they still exist.

Fixes scylladb/scylladb#20321

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
(cherry picked from commit 6970c502c9)

Closes scylladb/scylladb#20384
2024-09-10 08:40:11 +03:00
Avi Kivity
b484effcad docs: cql: document ZstdCompressor for CREATE TABLE
Adjust the wording slightly to be less awkward.

(cherry picked from commit 60acfd8c08)

Closes scylladb/scylladb#20380
2024-09-10 08:39:08 +03:00
Raphael S. Carvalho
4c4d1cce14 storage_service: avoid processing same table unnecessarily in split monitor
If there's a token metadata for a given table, and it is in split mode,
it will be registered such that split monitor can look at it, for
example, to start split work, or do nothing if table completed it.

during topology change, e.g. drain, split is stalled since it cannot
take over the state machine.
It was noticed that the log is being spammed with a message saying the
table completed split work, since every tablet metadata update, means
waking up the monitor on behalf of a table. So it makes sense to
demote the logging level to debug. That persists until drain completes
and split can finally complete.

Another thing that was noticed is that during drain, a table can be
submitted for processing faster than the monitor can handle, so the
candidate queue may end up with multiple duplicated entries for same
table, which means unnecessary work. That is fixed by using a
sequenced set, which keeps the current FIFO behavior.

Fixes #20339.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 26facd807e)

Closes scylladb/scylladb#20343
2024-09-10 08:37:20 +03:00
Botond Dénes
c64ae3f839 Merge '[Backport 6.1] repair: throw if batchlog manager isn't initialized' from ScyllaDB
repair_service::repair_flush_hints_batchlog_handler may access batchlog
manager while it is uninitialized.

Throw if batchlog manager isn't initialized.

Fixes:  #20236.

Needs backport to 6.0 and 6.1 as they suffer from the uninitialized bm access.

(cherry picked from commit d8e4393418)

(cherry picked from commit f38bb6483a)

 Refs #20251

Closes scylladb/scylladb#20351

* github.com:scylladb/scylladb:
  test: add test to ensure repair won't fail with uninitialized bm
  repair: throw if batchlog manager isn't initialized
2024-09-04 07:02:18 +03:00
Kamil Braun
f77686cefb Merge '[Backport 6.1] Fix node replace with inter-dc encryption enabled.' from Gleb Natapov
Currently if a coordinator and a node being replaced are in the same DC
while inter-dc encryption is enabled (connections between nodes in the
same DC should not be encrypted) the replace operation will fail. It
fails because a coordinator uses non encrypted connection to push raft
data to the new node, but the new node will not accept such connection
until it knows which DC the coordinator belongs to and for that the raft
data needs to be transferred.

The series adds the test for this scenario and the fix for the
chicken&egg problem above.

The series (or at least the fix itself) is needs to be backported because
this is a serious regression.

Fixes: https://github.com/scylladb/scylladb/issues/19025

(cherry picked from commit 84757a4ed3)

(cherry picked from commit b98282a976)

(cherry picked from commit 2f1b1fd45e)

(cherry picked from commit 17f4a151ce)

(cherry picked from commit 32a59ba98f)

Refs https://github.com/scylladb/scylladb/pull/20290

Closes scylladb/scylladb#20374

* github.com:scylladb/scylladb:
  topology coordinator: fix indentation after the last patch
  topology coordinator: do not add replacing node without a ring to topology
  test: add test for replace in clusters with encryption enabled
  test.py: add server encryption support to cluster manager
  .gitignore: fix pattern for resources to match only one specific directory
2024-09-02 16:14:37 +02:00
Gleb Natapov
d6a1a55d6c topology coordinator: fix indentation after the last patch
(cherry picked from commit 32a59ba98f)
2024-09-01 11:57:34 +03:00
Gleb Natapov
9db819763b topology coordinator: do not add replacing node without a ring to topology
When only inter dc encryption is enabled a non encrypted connection
between two nodes is allowed only if both nodes are in the same dc.
If a nodes that initiates the connection knows that dst is in the same
dc and hence use non encrypted connection, but the dst not yet knows the
topology of the src such connection will not be allowed since dst cannot
guaranty that dst is in the same dc.

Currently, when topology coordinator is used, a replacing node will
appear in the coordinator's topology immediately after it is added to the
group0. The coordinator will try to send raft message to the new node
and (assuming only inter dc encryption is enabled and replacing node and
the coordinator are in the same dc) it will try to open regular, non encrypted,
connection to it. But the replacing node will not have the coordinator
in it's topology yet (it needs to sync the raft state for that). so it
will reject such connection.

To solve the problem the patch does not add a replacing node that was
just added to group0 to the topology. It will be added later, when
tokens will be assigned to it. At this point a replacing node will
already make sure that its topology state is up-to-date (since it will
execute a raft barrier in join_node_response_params handler) and it knows
coordinator's topology. This aligns replace behaviour with bootstrap
since bootstrap also does not add a node without a ring to the topology.

The patch effectively reverts b8ee8911ca

Fixes: scylladb/scylladb#19025
(cherry picked from commit 17f4a151ce)
2024-09-01 11:57:25 +03:00
Gleb Natapov
4769e694d1 test: add test for replace in clusters with encryption enabled
(cherry picked from commit 2f1b1fd45e)
2024-09-01 11:56:37 +03:00
Gleb Natapov
74012c562a test.py: add server encryption support to cluster manager
(cherry picked from commit b98282a976)
2024-09-01 11:56:25 +03:00
Gleb Natapov
51215fb7f7 .gitignore: fix pattern for resources to match only one specific directory
(cherry picked from commit 84757a4ed3)
2024-09-01 11:54:42 +03:00
Laszlo Ersek
370bf14872 generic_server: make server::stop() idempotent
After server::shutdown(), make server::stop() more robust too, by allowing
callers (internal or external) to call it several times (not concurrently
though, just yet; see
<https://github.com/scylladb/scylladb/issues/20309>).

Suggested-by: Benny Halevy <bhalevy@scylladb.com>
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
(cherry picked from commit 49bff3b1ab)
2024-08-30 16:17:44 +02:00
Laszlo Ersek
860a1872bc generic_server: coroutinize server::shutdown()
By turning server::shutdown() into a coroutine, we need not dynamically
allocate "nr_conn".

Verified as follows:

(1) In terminal #1:

    build/Dev/scylla --overprovisioned --developer-mode=yes \
        --memory=2G --smp=1 --default-log-level error \
        --logger-log-level cql_server=debug:cql_server_controller=debug

> INFO  [...] cql_server_controller - Starting listening for CQL clients
>                                     on 127.0.0.1:9042 (unencrypted,
>                                     non-shard-aware)
> INFO  [...] cql_server_controller - Starting listening for CQL clients
>                                     on 127.0.0.1:19042 (unencrypted,
>                                     shard-aware)

(2) In terminals #2 and #3:

    tools/cqlsh/bin/cqlsh.py

(3) Press ^C in terminal #1:

> DEBUG [...] cql_server - abort accept nr_total=2
> DEBUG [...] cql_server - abort accept 1 out of 2 done
> DEBUG [...] cql_server - abort accept 2 out of 2 done
> DEBUG [...] cql_server - shutdown connection nr_total=4
> DEBUG [...] cql_server - shutdown connection 1 out of 4 done
> DEBUG [...] cql_server - shutdown connection 2 out of 4 done
> DEBUG [...] cql_server - shutdown connection 3 out of 4 done
> DEBUG [...] cql_server - shutdown connection 4 out of 4 done
> INFO  [...] cql_server_controller - CQL server stopped

This patch is best viewed with "git show --word-diff=color".

Suggested-by: Benny Halevy <bhalevy@scylladb.com>
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
(cherry picked from commit 1138347e7e)
2024-08-30 16:17:44 +02:00
Laszlo Ersek
9e224136ab generic_server: make server::shutdown() idempotent
Make server::shutdown() more robust by allowing callers (internal or
external) to call it several times (not concurrently though, just yet; see
<https://github.com/scylladb/scylladb/issues/20309>).

Suggested-by: Benny Halevy <bhalevy@scylladb.com>
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
(cherry picked from commit 2216275ebd)
2024-08-30 16:17:44 +02:00
Laszlo Ersek
16321fc243 test/generic_server: add test case
Check whether we can stop a generic server without first asking it to
listen.

The test fails currently; the failure mode is a hang, which triggers the 5
minute timeout set in the test:

> unknown location(0): fatal error: in "stop_without_listening":
> seastar::timed_out_error: timedout
> seastar/src/testing/seastar_test.cc(43): last checkpoint
> test/boost/generic_server_test.cc(34): Leaving test case
> "stop_without_listening"; testing time: 300097447us

Backport notes for 6.1:

- Replace

    #include "utils/assert.hh"
    SCYLLA_ASSERT(false);

  with

    #include <cassert>
    assert(false);

  due to 6.1 lacking commit aa1270a00c ("treewide: change assert() to
  SCYLLA_ASSERT()", 2024-08-05). The header file "utils/assert.hh"
  wouldn't be difficult to backport, but separating it from the treewide
  changes in commit aa1270a00c might not be the best idea.

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
(cherry picked from commit dbc0ca6354)
2024-08-30 16:17:44 +02:00
Laszlo Ersek
8f0f362a30 configure, cmake: sort the lists of boost unit tests
Both lists were obviously meant to be sorted originally, but by today
we've introduced many instances of disorder -- thus, inserting a new test
in the proper place leaves the developer scratching their head. Sort both
lists.

Backport notes for 6.1:

- Conflicts in "configure.py", unsurprisingly. For the backport, I sorted
  the boost unit test list manually, from scratch.

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
(cherry picked from commit 931f2f8d73)
2024-08-30 16:16:53 +02:00
Laszlo Ersek
a8131a99ed generic_server: convert connection tracking to seastar::gate
If we call server::stop() right after "server" construction, it hangs:

With the server never listening (never accepting connections and never
serving connections), nothing ever calls server::maybe_stop().
Consequently,

    co_await _all_connections_stopped.get_future();

at the end of server::stop() deadlocks.

Such a server::stop() call does occur in controller::do_start_server()
[transport/controller.cc], when

- cserver->start() (sharded<cql_server>::start()) constructs a
  "server"-derived object,

- start_listening_on_tcp_sockets() throws an exception before reaching
  listen_on_all_shards() (for example because it fails to set up client
  encryption -- certificate file is inaccessible etc.),

- the "deferred_action"

      cserver->stop().get();

  is invoked during cleanup.

(The cserver->stop() call exposing the connection tracking problem dates
back to commit ae4d5a60ca ("transport::controller: Shut down distributed
object on startup exception", 2020-11-25), and it's been triggerable
through the above code path since commit 6b178f9a4a
("transport/controller: split configuring sockets into separate
functions", 2024-02-05).)

Tracking live connections and connection acceptances seems like a good fit
for "seastar::gate", so rewrite the tracking with that. "seastar::gate"
can be closed (and the returned future can be waited for) without anyone
ever having entered the gate.

NOTE: this change makes it quite clear that neither server::stop() nor
server::shutdown() must be called multiple times. The permitted sequences
are:

- server::shutdown() + server::stop()

- or just server::stop().

Fixes #10305

Backport notes for 6.1:

- Conflict in "generic_server.hh", due to 6.1 not having commit
  324b3c43c0 ("generic_server: use async function in
  `for_each_gently()`", 2024-08-08), which is part of the feature series
  "service levels: update connections parameters automatically"
  <https://github.com/scylladb/scylladb/pull/19085>.

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
(cherry picked from commit 5a04743663)
2024-08-30 16:03:51 +02:00
Aleksandra Martyniuk
93fbe3af12 test: add test to ensure repair won't fail with uninitialized bm
(cherry picked from commit f38bb6483a)
2024-08-30 13:55:48 +00:00
Aleksandra Martyniuk
b164ea4a68 repair: throw if batchlog manager isn't initialized
repair_service::repair_flush_hints_batchlog_handler may access batchlog
manager while it is uninitialized.

Batchlog manager cannot be initialized before repair as we have the
dependencies chain:
repair_service -> storage_service::join_cluster -> batchlog_manager.

Throw if batchlog manager isn't initialized. That won't cause repair
to fail.

(cherry picked from commit d8e4393418)
2024-08-30 13:55:48 +00:00
Jenkins Promoter
2db808e364 Update ScyllaDB version to: 6.1.2 2024-08-29 15:13:24 +03:00
Botond Dénes
e6d2d29dd1 Merge '[Backport 6.1] repair: do_rebuild_replace_with_repair: use source_dc only when safe' from ScyllaDB
It is unsafe to restrict the sync nodes for repair to the source data center if it has too low replication factor in network_topology_replication_strategy, or if other nodes in that DC are ignored.

Also, this change restricts the usage of source_dc to `network_topology` and `everywhere_topology`
strategies, as with simple replication strategy
there is no guarantee that there would be any
more replicas in that data center.

Fixes #16826

Reproducer submitted as https://github.com/scylladb/scylla-dtest/pull/3865
It fails without this fix and passes with it.

* Requires backport to live versions.  Issue hit in the filed with 2022.2.14

(cherry picked from commit 8b1877f3ca)

(cherry picked from commit 0419b1d522)

(cherry picked from commit b5d0ab092c)

(cherry picked from commit 9729dd21c3)

(cherry picked from commit 8665eef98c)

(cherry picked from commit 5f655e41e3)

 Refs #16827

Closes scylladb/scylladb#20228

* github.com:scylladb/scylladb:
  raft_rebuild: propagate source_dc force option to rebuild_option
  repair: do_rebuild_replace_with_repair: use source_dc only when safe
  repair: replace_with_repair: pass the replace_node downstream
  repair: replace_with_repair: pass ignore_nodes as a set of host_id:s
  repair: replace_rebuild_with_repair: pass ks_erms from caller
  nodetool: rebuild: add force option
  Add and use utils::optional_param to pass source_dc
2024-08-29 07:35:05 +03:00
Lakshmi Narayanan Sreethar
01661e1eaa test/pylib: fix keyspace_compaction method
The `keyspace_compaction` method incorrectly appends the column family
parameter to the URL using a regular string, `"?cf={table}"`, instead of
an f-string, `f"?cf={table}"`. As a result, the column family name is
sent as `{table}` to the server, causing the compaction request to fail.
Fix this issue by passing the parameter to the POST request using a
dictionary instead of appending it to the URL.

Fixes #20264

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit dc5c45e803)

Closes scylladb/scylladb#20273
2024-08-28 20:08:58 +03:00
Botond Dénes
6232982772 Merge '[Backport 6.1] select from mutation_fragments() + tablets: handle reads for non-owned partitions' from ScyllaDB
Attempting to read a partition via `SELECT * FROM MUTATION_FRAGMENTS()`, which the node doesn't own, from a table using tablets causes a crash.
This is because when using tablets, the replica side simply doesn't handle requests for un-owned tokens and this triggers a crash.
We should probably improve how this is handled (an exception is better than a crash), but this is outside the scope of this PR.
This PR fixes this and also adds a reproducer test.

Fixes: https://github.com/scylladb/scylladb/issues/18786

Fixes a regression introduced in 6.0, so needs backport to 6.0 and 6.1

(cherry picked from commit de5329157c)

(cherry picked from commit 46563d719f)

(cherry picked from commit 4e2d7aa2a2)

 Refs #20109

Closes scylladb/scylladb#20313

* github.com:scylladb/scylladb:
  test/tablets: Test that reading tablets' mutations from MUTATION_FRAGMENTS works
  replica/mutation_dump: enfore pinning of effective replication map
  replica/mutation_dump: handle un-owned tokens (with tablets)
2024-08-28 06:23:45 +03:00
Botond Dénes
6418787ee0 Merge '[Backport 6.1] Make Summary support histogram with infinite bucket vlaues' from ScyllaDB
This series fixes an issue where histogram Summaries return an infinite value.

It updated the quantile calculation logic to address cases where values fall into the infinite bucket of a histogram.
Now, instead of returning infinite (max int), the calculation will return the last bucket limit, ensuring finite outputs in all cases.

The series adds a test for summaries with a specific test case for this scenario.

Fixes #20255
Need backport to 6.0, 6.1 and 2023.1 and above

(cherry picked from commit 011aa91a8c)

(cherry picked from commit 644e6f0121)

 Refs #20257

Closes scylladb/scylladb#20303

* github.com:scylladb/scylladb:
  test/estimated_histogram_test Add summary tests
  utils/histogram.hh: Make summary support inifinite bucket.
2024-08-28 06:23:03 +03:00
Botond Dénes
06d6cf5608 Merge '[Backport 6.1] abstract_replication_strategy: make get_ranges async' from ScyllaDB
To prevent stalls due to large number of tokens.
For example, large cluster with say 70 nodes can have
more than 16K tokens.

Fixes #19757

(cherry picked from commit d385219a12)

(cherry picked from commit 333c0d7c88)

(cherry picked from commit b2abbae24b)

(cherry picked from commit 824bdf99d2)

(cherry picked from commit ea5a0cca10)

(cherry picked from commit 2bbbe2a8bc)

(cherry picked from commit 686a8f2939)

 Refs #19758

Closes scylladb/scylladb#20297

* github.com:scylladb/scylladb:
  abstract_replication_strategy: make get_ranges async
  database: get_keyspace_local_ranges: get vnode_effective_replication_map_ptr param
  compaction: task_manager_module: open code maybe_get_keyspace_local_ranges
  alternator: ttl: token_ranges_owned_by_this_shard: let caller make the ranges_holder
  alternator: ttl: can pass const gms::gossiper& to ranges_holder
  alternator: ttl: ranges_holder_primary: unconstify _token_ranges member
  alternator: ttl: refactor token_ranges_owned_by_this_shard
2024-08-28 06:22:33 +03:00
Botond Dénes
1f8d8fd3db Merge '[Backport 6.1] replica: fix copy constructor of tablet_sstable_set' from ScyllaDB
Commit 9f93dd9fa3 changed `tablet_sstable_set::_sstable_sets` to be a `absl::flat_hash_map` and in addition, `std::set<size_t> _sstable_set_ids` was added. `_sstable_set_ids` is set up in the `tablet_sstable_set(schema_ptr s, const storage_group_manager& sgm, const locator::tablet_map& tmap)` constructor, but it is not copied in `tablet_sstable_set(const tablet_sstable_set& o)`.

This affects the `tablet_sstable_set::tablet_sstable_set` method as it depends on the copy constructor. Since sstable set can be cloned when a new sstable set is added, the issue will cause ids not being copied into the new sstable set. It's healed only after compaction, since the sstable set is rebuilt from scratch there.

This PR fixes this issue by removing the existing copy constructor of `tablet_sstable_set` to enable the implicit default copy constructor.

Fixes #19519

(cherry picked from commit 44583eed9e)

(cherry picked from commit ec47b50859)

 Refs #20115

Closes scylladb/scylladb#20201

* github.com:scylladb/scylladb:
  boost/sstable_set_test: add testcase to test tablet_sstable_set copy constructor
  replica: fix copy constructor of tablet_sstable_set
2024-08-28 06:20:12 +03:00
Pavel Emelyanov
bc03d13c76 test/tablets: Test that reading tablets' mutations from MUTATION_FRAGMENTS works
Currently it doesn't, one of the node crashes with std::out_of_range
exception and meaningless calltrace

[Botond]: this test checks the case of reading a partition via
MUTATION_FRAGMENTS from a node which doesn't own said partition.

refs: #18786

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit 4e2d7aa2a2)
2024-08-27 23:43:14 +00:00
Botond Dénes
4b4dbc1112 replica/mutation_dump: enfore pinning of effective replication map
By making it a required argument, making sure the topology version is
pinned for the duration of the query. This is needed because mutation
dump queries bypass the storage proxy, where this pinning usually takes
place. So it has to be enforced here.

(cherry picked from commit 46563d719f)
2024-08-27 23:43:14 +00:00
Botond Dénes
739be17801 replica/mutation_dump: handle un-owned tokens (with tablets)
When using tablets, the replica-side doesn't handle un-owned tokens.
table::shard_for_reads() will just return 0 for un-owned tokens, and a
later attempt at calling table::storage_group_for_token() with said
un-owned token will cause a crash (std::terminate due to
std::out_of_range thrown in noexcept context).
The replicas rely on the coordinator to not send stray requests, but for
select from mutation_fragments(table) queries, there is no coordinator
side who could do the correct dispatching. So do this in
mutation_dump(), just creating empty readers for un-owned tokens.

(cherry picked from commit de5329157c)
2024-08-27 23:43:14 +00:00
Tomasz Grabiec
7fc15ce200 Merge '[Backport 6.1] schema_tables: calculate_schema_digest: prevent stalls due to large m…' from ScyllaDB
…utations vector

With a large number of table the schema mutations
vector might get big enoug to cause reactor stalls when freed.

For example, the following stall was hit on
2023.1.0~rc1-20230208.fe3cc281ec73 with 5000 tables:
```
 (inlined by) ~vector at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/stl_vector.h:730
 (inlined by) db::schema_tables::calculate_schema_digest(seastar::sharded<service::storage_proxy>&, enum_set<super_enum<db::schema_feature, (db::schema_feature)0, (db::schema_feature)1, (db::schema_feature)2, (db::schema_feature)3, (db::schema_feature)4, (db::schema_feature)5, (db::schema_feature)6, (db::schema_feature)7> >, seastar::noncopyable_function<bool (std::basic_string_view<char, std::char_traits<char> >)>) at ./db/schema_tables.cc:799
```

This change returns a mutations generator from
the `map` lambda coroutine so we can process them
one at a time, destroy the mutations one at a time, and by that, reducing memory footprint and preventing reactor stalls.

Fixes #18173

(cherry picked from commit 95a5fba0ea)

(cherry picked from commit 52234214e5)

 Refs #18174

Closes scylladb/scylladb#20246

* github.com:scylladb/scylladb:
  schema_tables: calculate_schema_digest: filter the key earlier
  schema_tables: calculate_schema_digest: prevent stalls due to large mutations vector
2024-08-27 21:42:35 +02:00
Benny Halevy
164d58b0d5 raft_rebuild: propagate source_dc force option to rebuild_option
Currently, the `force` property of the `source_dc` rebuild option
is lost and `raft_topology_cmd_handler` has no way to know
if it was given or not.

This in turn can cause rebuild to fail, even when `--force`
is set by the user, where it would succeed with gossip
topology changes, based on the source_dc --force semantics.

\Fixes scylladb/scylladb#20242

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

\Closes scylladb/scylladb#20249

(cherry picked from commit 18c45f7502)

Closes scylladb/scylladb#20311
2024-08-27 22:20:48 +03:00
Aleksandra Martyniuk
0839df3dbf replica: add/remove table atomically
Currently, database::tables_metadata::add_table needs to hold a write
lock before adding a table. So, if we update other classes keeping
track of tables before calling add_table, and the method yields,
table's metadata will be inconsistent.

Set all table-related info in tables_metadata::add_table_helper (called
by add_table) so that the operation is atomic.

Analogically for remove_table.

Fixes: #19833.
(cherry picked from commit 483d89ed6d)

Closes scylladb/scylladb#20244
2024-08-27 20:46:48 +03:00
Amnon Heiman
64befbca61 test/estimated_histogram_test Add summary tests
This patch adds tests for summary calculation. It adds two tests, the
first is a basic calculation for P50, P95, P99 by adding 100 elements
into 20 buckets.

The second test look that if elements are found in the infinite bucket,
the result would be the lower limit (33s) and not infinite.

Relates to #20255

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
(cherry picked from commit 644e6f0121)
2024-08-27 12:12:39 +00:00
Amnon Heiman
8ee09f4353 utils/histogram.hh: Make summary support inifinite bucket.
This patch handles an edge cases related to The infinite bucket  
limit.

Summaries are the P50, P95, and P99 quantiles.

The quantiles are calculated from a histogram; we find the bucket and
return its upper limit.

In classic histograms, there is a notion of the infinite bucket;
anything that does not fall into the last bucket is considered to be
infinite;

with quantile, it does not make sense. So instead of reporting infinite
we'll report the bucket lower limit.

Fixes #20255

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
(cherry picked from commit 011aa91a8c)
2024-08-27 12:12:39 +00:00
Botond Dénes
e84d8b1205 Merge '[Backport 6.1] cql: process LIMIT for GROUP BY select queries' from ScyllaDB
This change fixes #17237, fixes #5361 and fixes #5362 by passing the limit value down the call chain in cql3. A test is also added.

fixes: #17237
fixes: #5361
fixes: #5362

The regression happened in 5.4 as we changed the way GROUP BY is processed in 432cb02 - to force aggregation when it is used. The LIMIT value was not passed to aggregations and thus we failed to adhere to it.

W want to backport this fix to 5.4 and 6.0 to have continuous correct results for the test case from #17237

This patch consists of 4 commits:
- fa4225ea0fac2057b7a9976f57dc06bcbd900cd4 - cql3: respect the user-defined page size in aggregate queries - a precondition for this patch to be implementable
- 8fbe69e74dca16ed8832d9a90489ca47ba271d0b - cql3/select_statement: simplify the get_limit function - the `do_get_limit()` function did a lot of legwork that should not be associated with it. This change makes it trivial and makes its callers do additional checks (for unset guards, or for an aggregate query)
- 162828194a2b88c22fbee335894ff045dcc943c9 - cql3: process LIMIT for GROUP BY queries - pass the limit value down the chain and make use of it. This is the actual fix to #17237
- b3dc6de6d6cda8f5c09b01463bb52f827a6a00b4 - test/cql-pytest: Add test for GROUP BY queries with LIMIT - tests

(cherry picked from commit 08f3219cb8)

(cherry picked from commit 3838ad64b3)

(cherry picked from commit e7ae7f3662)

(cherry picked from commit 9db272c949)

 Refs: #18842

Closes scylladb/scylladb#20154

* github.com:scylladb/scylladb:
  test/cql-pytest: Add test for GROUP BY queries with LIMIT
  cql3: process LIMIT for GROUP BY queries
  cql3/select_statement: simplify the get_limit function
  cql3: respect the user-defined page size in aggregate queries
2024-08-27 14:52:18 +03:00
Benny Halevy
6692c1702d abstract_replication_strategy: make get_ranges async
To prevent stalls due to large number of tokens.
For example, large cluster with say 70 nodes can have
more than 16K tokens.

Fixes #19757

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 686a8f2939)
2024-08-26 21:50:39 +00:00
Benny Halevy
415bdf3160 database: get_keyspace_local_ranges: get vnode_effective_replication_map_ptr param
Prepare for making the function async.
Then, it will need to hold on to the erm while getting
the token_ranges asynchronously.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 2bbbe2a8bc)
2024-08-26 21:50:39 +00:00
Benny Halevy
6b2d0f5934 compaction: task_manager_module: open code maybe_get_keyspace_local_ranges
It is used only here and can be simplified by
checking if the keyspace replication strategy
is per table by the caller.

Prepare for making get_keyspace_local_ranges async.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit ea5a0cca10)
2024-08-26 21:50:39 +00:00
Benny Halevy
0f990a8dc5 alternator: ttl: token_ranges_owned_by_this_shard: let caller make the ranges_holder
Add static `make` methods to ranges_holder_{primary,secondary}
and use them to make the ranges objects and pass them
to `token_ranges_owned_by_this_shard`, rather than letting
token_ranges_owned_by_this_shard invoke the right constructor
of the ranges_holder class.

Prepare for making `make` async.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 824bdf99d2)
2024-08-26 21:50:39 +00:00
Benny Halevy
5f8b199253 alternator: ttl: can pass const gms::gossiper& to ranges_holder
There's no need to pass a mutable reference to
the gossiper.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit b2abbae24b)
2024-08-26 21:50:38 +00:00
Benny Halevy
2288f98d83 alternator: ttl: ranges_holder_primary: unconstify _token_ranges member
To allow the class to be nothrow_move_constructable.
Prepare for returning it as a future value.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 333c0d7c88)
2024-08-26 21:50:38 +00:00
Benny Halevy
3ed214a728 alternator: ttl: refactor token_ranges_owned_by_this_shard
Rather than holding a variant member (and defining
both ranges_holder_{primary,secondary} in both
specilizations of the class, just make the internal
ranges_holder class first-class citizens
and parameterize the `token_ranges_owned_by_this_shard`
template by this class type.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit d385219a12)
2024-08-26 21:50:38 +00:00
Michał Jadwiszczak
b7e6f22999 cql3/statements/create_service_level: forbid creating SL starting with $
Tenant names starting with `$` are reserved for internal ones.
Forbid creating new service level which name starts with `$`
and log a warning for existing service levels with `$` prefix.

(cherry picked from commit d729d1b272)

Closes scylladb/scylladb#20156
2024-08-26 13:03:16 +03:00
Benny Halevy
31f3ff37f4 schema_tables: calculate_schema_digest: filter the key earlier
Currently, each frozen mutation we get from
system_keyspace::query_mutations is unfrozen in whole
to a mutation and only then we check its key with
the provided `accept_keyspace` function.

This is wasteful, since they key can be processed
directly form the frozen mutation, before taking
the toll of unfreezing it.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 52234214e5)
2024-08-22 09:06:26 +00:00
Benny Halevy
828595786a schema_tables: calculate_schema_digest: prevent stalls due to large mutations vector
With a large number of table the schema mutations
vector might get big enoug to cause reactor stalls
when freed.

For example, the following stall was hit on
2023.1.0~rc1-20230208.fe3cc281ec73 with 5000 tables:
```
 (inlined by) ~vector at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/stl_vector.h:730
 (inlined by) db::schema_tables::calculate_schema_digest(seastar::sharded<service::storage_proxy>&, enum_set<super_enum<db::schema_feature, (db::schema_feature)0, (db::schema_feature)1, (db::schema_feature)2, (db::schema_feature)3, (db::schema_feature)4, (db::schema_feature)5, (db::schema_feature)6, (db::schema_feature)7> >, seastar::noncopyable_function<bool (std::basic_string_view<char, std::char_traits<char> >)>) at ./db/schema_tables.cc:799
```

This change returns a mutations generator from
the `map` lambda coroutine so we can process them
one at a time, destroy the mutations one at a time,
and by that, reducing memory footprint and preventing
reactor stalls.

Fixes #18173

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 95a5fba0ea)
2024-08-22 09:06:25 +00:00
Benny Halevy
fdbb0cdef3 repair: do_rebuild_replace_with_repair: use source_dc only when safe
It is unsafe to restrict the sync nodes for repair to
the source data center if we cannot guarantee a quorum
in the data center with network-topology replication strategy.

This change restricts the usage of source_dc in the following cases:
1. For SimpleStrategy - source_dc is ignored since there is no guarantee
that it contains remaining replicas for all tokens.
2. For EverywhereStrategy - use source_dc if there are remaining
live nodes in the datacenter.
3. For NetworkTopologyStrategy:
a. It is considered unsafe to use source_dc if number of nodes
   lost in that DC (replaced/rebuilt node + additional ignored nodes)
   is greater than 1, or it has 1 lost node and rf <= 1 in the DC.

b. If the source_dc arg is forced, as with the new
   `nodetool rebuild --force <source_dc>` option,
   we use it anyway, even if it's considered to be unsafe.
   A warning is printed in this case.

c. If the source_dc arg is user-provided, (using nodetool rebuild),
   an error exception is thrown, advising to use an alternative dc,
   if available, omit source_dc to sync with all nodes, or use the
   --force option to use the given source_dc anyhow.

d. Otherwise, we look for an alternative source datacenter,
   that has not lost any node. If such datacenter is found
   we use it as source_dc for the keyspace, and log a warning.

e. If no alternative dc is found (and source_dc is implicit), then:
   log a warning and fall back to using replicas from all nodes in the cluster.

Fixes #16826

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 5f655e41e3)
2024-08-21 16:09:25 +03:00
Benny Halevy
912c46e07f repair: replace_with_repair: pass the replace_node downstream
To be used by the next path to count how many nodes
are lost in each datacenter.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 8665eef98c)
2024-08-21 15:49:39 +03:00
Benny Halevy
e80c587da3 repair: replace_with_repair: pass ignore_nodes as a set of host_id:s
The callers already pass ignore_nodes as host_id:s
and we translate them into inet_address only for repair
so delay the translation as much as posible,

Refs scylladb/scylladb#6403

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 9729dd21c3)
2024-08-21 15:41:42 +03:00
Benny Halevy
485a508cb3 repair: replace_rebuild_with_repair: pass ks_erms from caller
The keyspaces replication maps must be in sync with the
token_metadata_ptr passed already to the functions,
so instead of getting it in the callee, let the caller
get the ks_erms along with retrieving the tmptr.

Note that it's already done on the rebuild path
for streaming based rebuild.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit b5d0ab092c)
2024-08-21 14:42:09 +03:00
Anna Stuchlik
1683b07d2e doc: extract the info about tablets defaut to a separate file
This commit extracts the information about the default for tables in keyspace creation
to a separate file in the _common folder. The file is then included using
the scylladb_include_flag directive.

The purpose of this commit is to make it possible to include a different file
in the scylla-enterprise repo - with a different default.

Refs https://github.com/scylladb/scylla-enterprise/issues/4585

(cherry picked from commit 107708434c)

Closes scylladb/scylladb#20220
2024-08-21 11:07:19 +03:00
David Garcia
853d2ec76f docs: improve include flag directive
The include flag directive now treats missing content as info logs instead of warnings. This prevents build failures when the enterprise-specific content isn't yet available.

If the enterprise content is undefined, the directive automatically loads the open-source content. This ensures the end user has access to some content.

address comments

(cherry picked from commit 30887d096f)

Closes scylladb/scylladb#20226
2024-08-21 10:20:21 +03:00
Botond Dénes
0b1dbb3a64 Update tools/java submodule
* tools/java 33938ec1...27999135 (1):
  > cassandra-stress: Make default repl. strategy NetworkTopologyStrategy

Fixes: scylladb/scylla-tools-java#400

Closes scylladb/scylladb#20199
2024-08-21 10:02:59 +03:00
Benny Halevy
e13d5ee834 nodetool: rebuild: add force option
To be used to force usage of source_dc, even
when it is unsafe for rebuild.

Update docs and add test/nodetool/test_rebuild.py

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 0419b1d522)
2024-08-21 09:37:14 +03:00
Benny Halevy
505cad64ad Add and use utils::optional_param to pass source_dc
Clearly indicate if a source_dc is provided,
and if so, was it explicitly given by the user,
or was implicitly selected by scylla.

This will become useful in the next patches
that will use that to either reject the operation
if it's unsafe to use the source_dc and the dc was
explicitly given by the user, or whether
to fallback to using all nodes otherwise.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 8b1877f3ca)
2024-08-21 09:35:13 +03:00
Raphael S. Carvalho
d65961d8cf compaction: Allow "offline" sstable to be split
In order to fix the race between split and repair, we must introduce
the ability to split an "offline" sstable, one that wasn't added
to any of the table's sstable set yet.

It's not safe to split a sstable after adding it to the set, because
a failure to split can result in unsplit data left in the set, causing
split to fail down the road, since the coordinator thinks this replica
has only split data in the set.

Refs #19378.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 239344ab55)
2024-08-20 10:38:36 +00:00
Anna Stuchlik
4b88ec4722 doc: fix a link on the RBAC page
This commit fixes an external link on the Role Based Access Control page.

Fixes https://github.com/scylladb/scylladb/issues/20166

(cherry picked from commit c56c3ce469)

Closes scylladb/scylladb#20202
2024-08-19 15:29:54 +03:00
Lakshmi Narayanan Sreethar
13aa97a00f boost/sstable_set_test: add testcase to test tablet_sstable_set copy constructor
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit ec47b50859)
2024-08-19 12:11:50 +00:00
Lakshmi Narayanan Sreethar
c336ee63a3 replica: fix copy constructor of tablet_sstable_set
Remove the existing copy constructor to enable the use of the implicit
copy constructor. This fixes the issue of `_sstable_set_ids` not being
copied in the current copy constructor.

Fixes #19519

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 44583eed9e)
2024-08-19 12:11:50 +00:00
Dawid Medrek
8d90b81766 db/hints: Make commitlog use commitlog IO scheduling group
Before these changes, we didn't specify which I/O scheduling
group commitlog instances in hinted handoff should use.
In this commit, we set it explicitly to the commitlog
scheduling group. The rationale for this choice is the fact
we don't want to cause a bottleneck on the write path
-- if hints are written too slowly, new incoming mutations
(NOT hints) might be rejected due to a too high number
of hints currently being written to disk; see
`storage_proxy::create_write_response_handler_helper()`
for more context.

(cherry picked from commit 6a7fb18b52)

Closes scylladb/scylladb#20093
2024-08-14 22:14:43 +03:00
Raphael S. Carvalho
bc0097688f replica: Fix race between split compaction and migration
After removal of rwlock (53a6ec05ed), the race was introduced because the order that
compaction groups of a tablet are closed, is no longer deterministic.

Some background first:
Split compaction runs in main (unsplit) group, and adds sstable to left and right groups
on completion.

The race works as follow:
1) split compaction starts on main group of tablet X
2) tablet X reaches cleanup stage, so its compaction groups are closed in parallel
3) left or right group are closed before main (more likely when only main has flush work to do)
4) split compaction completes, and adds sstable to left and right
5) if e.g left is closed, adjusting backlog tracker will trigger an exception, and since that
happens in row cache update's execute(), node crashes.

The problem manifested as follow:
[shard 0: gms] raft_topology - Initiating tablet cleanup of 5739b9b0-49d4-11ef-828f-770894013415:15 on 102a904a-0b15-4661-ba3f-f9085a5ad03c:0
...
[shard 0:strm] compaction - [Split keyspace1.standard1 009e2f80-49e5-11ef-85e3-7161200fb137] Splitting [/var/lib/scylla/data/keyspace1/...]
...
[shard 0:strm] cache - Fatal error during cache update: std::out_of_range (Compaction state for table [0x600007772740] not found),
at: ...
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<void>, row_cache::do_update(...
   --------
   seastar::internal::do_with_state<std::tuple<row_cache::external_updater, std::function<seastar::future<void> ()> >, seastar::future<void> >
   --------
   seastar::internal::coroutine_traits_base<void>::promise_type
   --------
   seastar::internal::coroutine_traits_base<void>::promise_type
   --------
   seastar::(anonymous namespace)::thread_wake_task
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<sstables::compaction_result>, seastar::async<sstables::compaction::run(...
   seastar::continuation<seastar::internal::promise_base_with_type<sstables::compaction_result>, seastar::future<sstables::compaction_resu...

From the log above, it can be seen cache update failure happens under streaming sched group and
during compaction completion, which was good evidence to the cause.
Problem was reproduced locally with the help of tablet shuffling.

Fixes: #19873.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 5af1f41ecd)

Closes scylladb/scylladb#20107
2024-08-14 22:13:53 +03:00
Aleksandra Martyniuk
69c1a0e2ca repair: use find_column_family in insert_repair_meta
repair_service::insert_repair_meta gets the reference to a table
and passes it to continuations. If the table is dropped in the meantime,
the reference becomes invalid.

Use find_column_family at each table occurrence in insert_repair_meta
instead.

Fixes: #20057

(cherry picked from commit 719999b34c)

Refs #19953

Closes scylladb/scylladb#20076
2024-08-14 20:54:12 +03:00
Avi Kivity
c382e19e5e Merge '[Backport 6.1] Prevent ALTERing non-existing KS with tablets' from ScyllaDB
ALTER tablets KS executes in 2 steps:
1. ALTER KS's cql handler forms a global topo req, and saves data required to execute this req,
2. global topo req is executed by topo coordinator, which reads data attached to the req.

The KS name is among the data attached to the req. There's a time window between these steps where a to-be-altered KS could have been DROPped, which results in topo coordinator forever trying to ALTER a non-existing KS. In order to avoid it, the code has been changed to first check if a to-be-altered KS exists, and if it's not the case, it doesn't perform any schema/tablets mutations, but just removes the global topo req from the coordinator's queue.
BTW. just adding this extra check resulted in broader than expected changes, which is due to the fact that the code is written badly and needs to be refactored - an effort that's already planned under #19126
(I suggest to disable displaying whitespace differences when reviewing this PR).

Fixes: #19576

Requires 6.0 backport

(cherry picked from commit 5b089d8e10)

(cherry picked from commit 0ea2128140)

(cherry picked from commit ddb5204929)

 Refs #19666

Closes scylladb/scylladb#20143

* github.com:scylladb/scylladb:
  tests: ensure ALTER tablets KS doesn't crash if KS doesn't exist
  cql: refactor rf_change indentation
  Prevent ALTERing non-existing KS with tablets
2024-08-14 20:16:55 +03:00
Michał Chojnowski
b786e6a39a cql_test_env: ensure shutdown() before stop() for system_keyspace
If system_keyspace::stop() is called before system_keyspace::shutdown(),
it will never finish, because the uncleared shared pointers will keep
it alive indefinitely.

Currently this can happen if an exception is thrown before the construction
of the shutdown() defer. This patch moves the shutdown() call to immediately
before stop(). I see no reason why it should be elsewhere.

Fixes scylladb/scylla-enterprise#4380

(cherry picked from commit eeaf4c3443)

Closes scylladb/scylladb#20145
2024-08-14 20:16:29 +03:00
Paweł Zakrzewski
3286c14d76 test/cql-pytest: Add test for GROUP BY queries with LIMIT
Remove xfail from all tests for #5361, as the issue is fixed.

Remove xfail from test_group_by_clustering_prefix_with_limit
It references #5362, but is fixed by #17237.

Refs #17237

(cherry picked from commit 9db272c949)
2024-08-14 16:56:20 +00:00
Paweł Zakrzewski
1773dd5632 cql3: process LIMIT for GROUP BY queries
Currently LIMIT not passed to the query executor at all and it was just
an accident that it worked for the case referenced in #17237. This
change passes the limit value down the chain.

(cherry picked from commit e7ae7f3662)
2024-08-14 16:56:20 +00:00
Paweł Zakrzewski
c1292c69cf cql3/select_statement: simplify the get_limit function
The get_limit() function performed tasks outside of its scope - for
example checked if the statement was an aggregate. This change moves the
onus of the check to the caller.

(cherry picked from commit 3838ad64b3)
2024-08-14 16:56:20 +00:00
Paweł Zakrzewski
f27edaa19c cql3: respect the user-defined page size in aggregate queries
The comment in the code already states that we should use the
user-defined page size if it's provided. To avoid OOM conditions we'll
use the internally defined limit as the upper bound or if no page size
is provided.

This change lays ground work for fixing #5362 and is necessary to pass
the test introduced in #19392 once it is implemented.

(cherry picked from commit 08f3219cb8)
2024-08-14 16:56:19 +00:00
Piotr Smaron
706761d8ec tests: ensure ALTER tablets KS doesn't crash if KS doesn't exist
Using the error injection framework, we inject a sleep into the
processing path of ALTER tablets KS, so that the topology coordinator of
the leader node
sleeps after the rf_change event has been scheduled, but before it is
started to be executed. During that time the second node executes a DROP
KS statement, which is propagated to the leader node. Once leader node
wakes up and resumes processing of ALTER tablets KS, the KS won't exist
and the node cannot crash, which was the case before.

(cherry picked from commit ddb5204929)
2024-08-14 10:37:25 +00:00
Piotr Smaron
41e4c39087 cql: refactor rf_change indentation
(cherry picked from commit 0ea2128140)
2024-08-14 10:37:24 +00:00
Piotr Smaron
d5bdef9ee5 Prevent ALTERing non-existing KS with tablets
ALTER tablets KS executes in 2 steps:
1. ALTER KS's cql handler forms a global topo req, and saves data required
   to execute this req,
2. global topo req is executed by topo coordinator, which reads data
   attached to the req.

The KS name is among the data attached to the req.
There's a time window between these steps where a to-be-altered KS could
have been DROPped, which results in topo coordinator forever trying to
ALTER a non-existing KS. In order to avoid it, the code has been changed
to first check if a to-be-altered KS exists, and if it's not the case,
it doesn't perform any schema/tablets mutations, but just removes the
global topo req from the coordinator's queue.
BTW. just adding this extra check resulted in broader than expected
changes, which is due to the fact that the code is written badly and
needs to be refactored - an effort that's already planned under #19126

Fixes: #19576
(cherry picked from commit 5b089d8e10)
2024-08-14 10:37:24 +00:00
Jenkins Promoter
a4dcf3956e Update ScyllaDB version to: 6.1.1 2024-08-14 12:28:43 +03:00
Anna Stuchlik
858fa914b1 doc: update Raft info in 6.1
This commit updates the Raft information regarding the Raft verification procedure.
In 6.1, the procedure is no longer related to the upgrade.

Fixes https://github.com/scylladb/scylladb/issues/19932

(cherry picked from commit 705e53d223)

Closes scylladb/scylladb#20083
2024-08-11 11:37:05 +03:00
Kamil Braun
ec923171a6 storage_service: raft topology: warn when raft_topology_cmd_handler fails due to abort
Currently we print an ERROR on all exceptions in
`raft_topology_cmd_handler`. This log level is too high, in some cases
exceptions are expected -- like during shutdown. And it causes dtest
failures.

Turn exceptions from aborts into WARN level.

Also improve logging by printing the command that failed.

Fixes scylladb/scylladb#19754

(cherry picked from commit 7506709573)

Closes scylladb/scylladb#20071
2024-08-08 18:13:53 +02:00
Tomasz Grabiec
0144549cd6 tablets: Do not allocate tablets on nodes being decommissioned
If tablet-based table is created concurrently with node being
decommissioned after tablets are already drained, the new table may be
permanently left with replicas on the node which is no longer in the
topology. That creates an immidiate availability risk because we are
running with one replica down.

This also violates invariants about replica placement and this state
cannot be fixed by topology operations.

One effect is that this will lead to load balancer failure which will
inhibit progress of any topology operations:

  load_balancer - Replica 154b0380-1dd2-11b2-9fdd-7156aa720e1a:0 of tablet 7e03dd40-537b-11ef-9fdd-7156aa720e1a:1 not found in topology, at:  ...

Fixes #20032

(cherry picked from commit f5c74a5df2)

Closes scylladb/scylladb#20066
2024-08-08 11:56:13 +03:00
Kamil Braun
0f246bfbc9 raft topology: improve logging
Add more logging for raft-based topology operations in INFO and DEBUG
levels.

Improve the existing logging, adding more details.

Fix a FIXME in test_coordinator_queue_management (by readding a log
message that was removed in the past -- probably by accident -- and
properly awaiting for it to appear in test).

Enable group0_state_machine logging at TRACE level in tests. These logs
are relatively rare (group 0 commands are used for metadata operations)
and relatively small, mostly consist of printing `system.group0_history`
mutation in the applied command, for example:
```
TRACE 2024-08-02 18:47:12,238 [shard 0: gms] group0_raft_sm - apply() is called with 1 commands
TRACE 2024-08-02 18:47:12,238 [shard 0: gms] group0_raft_sm - cmd: prev_state_id: optional(dd9d47c6-50ee-11ef-d77f-500b8e1edde3), new_state_id: dd9ea5c6-50ee-11ef-ae64-dfbcd08d72c3, creator_addr: 127.219.233.1, creator_id: 02679305-b9d1-41ef-866d-d69be156c981
TRACE 2024-08-02 18:47:12,238 [shard 0: gms] group0_raft_sm - cmd.history_append: {canonical_mutation: table_id 027e42f5-683a-3ed7-b404-a0100762063c schema_version c9c345e1-428f-36e0-b7d5-9af5f985021e partition_key pk{0007686973746f7279} partition_tombstone {tombstone: none}, row tombstone {range_tombstone: start={position: clustered, ckp{0010b4ba65c64b6e11ef8080808080808080}, 1}, end={position: clustered, ckp{}, 1}, {tombstone: timestamp=1722617232237511, deletion_time=1722617232}}{row {position: clustered, ckp{0010dd9ea5c650ee11efae64dfbcd08d72c3}, 0} tombstone {row_tombstone: none} marker {row_marker: 1722617232237511 0 0}, column description atomic_cell{ create system_distributed keyspace; create system_distributed_everywhere keyspace; create and update system_distributed(_everywhere) tables,ts=1722617232237511,expiry=-1,ttl=0}}}
```
note that the mutation contains a human-readable description of the
command -- like "create system_distributed keyspace" above.

These logs might help debugging various issues (e.g. when `apply` hangs
waiting for read_apply mutex, or takes too long to apply a command).

Ref: scylladb/scylladb#19105
Ref: scylladb/scylladb#19945
(cherry picked from commit e8d5974961)

Closes scylladb/scylladb#20048
2024-08-07 13:39:30 +02:00
Anna Stuchlik
1a1583a5b6 doc: add post-installation configuration to the Web Installer page
This commit extracts the information about the configuration the user should do
right after installation (especially running scylla_setup) to a separate file.
The file is included in the relevant pages, i.e., installing with packages
and installing with Web Installer.

In addition, the examples on the Web Installer page are updated
with supported versions of ScyllaDB.

Fixes https://github.com/scylladb/scylladb/issues/19908

(cherry picked from commit 849856b964)

Closes scylladb/scylladb#20050
2024-08-07 10:14:13 +03:00
Botond Dénes
f78b88b59b Merge '[Backport 6.1] db/view: drop view updates to replaced node marked as left' from ScyllaDB
When a node that is permanently down is replaced, it is marked as "left" but it still can be a replica of some tablets. We also don't keep IPs of nodes that have left and the `node` structure for such node returns an empty IP (all zeros) as the address.

This interacts badly with the view update logic. The base replica paired with the left node might decide to generate a view update. Because storage proxy still uses IPs and not host IDs, it needs to obtain the view replica's IP and tell the storage proxy to write a view update to that node - so, it chooses 0.0.0.0. Apparently, storage proxy decides to write a hint towards this address - hinted handoff on the other hand operates on host IDs and not IPs, so it attempts to translate the IP back, which triggers an assertion as there is no replica with IP 0.0.0.0.

As a quick workaround for this issue just drop view updates towards nodes which seem to have IPs that are all zeros. It would be more proper to keep the view updates as hints and replay them later to the new paired replica, but achieving this right now would require much more significant changes. For now, fixing a crash is more important than keeping views consistent with base replicas.

In addition to the fix, this PR also includes a regression test heavily based on the test that @kbr-scylla prepared during his investigation of the issue.

Fixes: scylladb/scylladb#19439

This issue can cause multiple nodes to crash at once and the fix is quite small, so I think this justifies backporting it to all affected versions. 6.0 and 6.1 are affected. No need to backport to 5.4 as this issue only happens with tablets, and tablets are experimental there.

(cherry picked from commit 6af7882c59)

(cherry picked from commit 5ec8c06561)

 Refs #19765

Closes scylladb/scylladb#19895

* github.com:scylladb/scylladb:
  test: regression test for MV crash with tablets during decommission
  db/view: drop view updates to replaced node marked as left
2024-08-07 09:18:26 +03:00
Tzach Livyatan
73d46ec548 Improve tombstone_compaction_interval description
(cherry picked from commit 861a1cedea)

Closes scylladb/scylladb#20025
2024-08-07 09:06:56 +03:00
Tzach Livyatan
dcee7839d4 Update tracing.rst - fix table node_slow_log_time name
(cherry picked from commit 858fd4d183)

Closes scylladb/scylladb#20023
2024-08-07 09:05:50 +03:00
Anna Stuchlik
75477f5661 doc: add OS support for version 6.1
This commit adds OS support for version 6.1 and removes OS support for 5.4
(according to our support policy for versions).

(cherry picked from commit eca2dfd8c3)

Closes scylladb/scylladb#20019
2024-08-07 09:04:13 +03:00
Nadav Har'El
78d7c953b0 test: increase timeouts for /localnodes test
In commit bac7c33313 we introduced a new
test for the Alternator "/localnodes" request, checking that a node
that is still joining does not get returned. The tests used what I
thought were "very high" timeouts - we had a timeout of 10 seconds
for starting a single node, and injected a 20 second sleep to leave
us 10 seconds after the first sleep.

But the test failed in one extremely slow run (a debug build on
aarch64), where starting just a single node took more than 15 seconds!

So in this patch I increase the timeouts significantly: We increase
the wait for the node to 60 seconds, and the sleeping injection to
120 seconds. These should definitely be enough for anyone (famous
last words...).

The test doesn't actually wait for these timeouts, so the ridiculously
high timeouts shouldn't affect the normal runtime of this test.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit ca8b91f641)

Closes scylladb/scylladb#19940
2024-08-07 08:55:23 +03:00
Nadav Har'El
753fc87efa alternator: exclude CDC log table from ListTables
The Alternator command ListTables is supposed to list actual tables
created with CreateTable, and should list things like materialized views
(created for GSI or LSI) or CDC log tables.

We already properly excluded materialized views from the list - and
had the tests to prove it - but forgot both the exclusion and the testing
for CDC log tables - so creating a table xyz with streams enable would
cause ListTables to also list "xyz_scylla_cdc_log".

This patch fixes both oversights: It adds the code to exclude CDC logs
from the output of ListTables, add adds a test which reproduces the bug
before this fix, and verifies the fix works.

Fixes #19911.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit d293a5787f)

Closes scylladb/scylladb#19938
2024-08-07 08:54:08 +03:00
Benny Halevy
c75dbc1f9c sstable_directory: delete_atomically: allow sstables from multiple prefixes
Currently, delete_atomically can be called with
a list of sstables from mixed prefixes in two cases:
1. truncate: where we delete all the sstables in the table directory
2. tablet cleanup: similar to truncate but restricted to sstables in a
   single tablet replica

In both cases, it is possible that sstables in staging (or quarantine)
are mixed with sstables in the base directory.

Until a more comprehensive fix is in place,
(see https://github.com/scylladb/scylladb/pull/19555)
this change just lifts the ban on atomic deletion
of sstables from different prefixes, and acknowledging
that the implementation is not atomic across
prefixes.  This is better than crashing for now,
and can be backported more easily to branches
that support tablets so tablet migration can
be done safely in the presence of repair of
tables with views.

Refs scylladb/scylladb#18862

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 26abad23d9)

Closes scylladb/scylladb#19919
2024-08-06 16:27:57 +03:00
Lakshmi Narayanan Sreethar
96e5ebe28c boost/bloom_filter_test: wait for total memory reclaimed update
The testcase `test_bloom_filter_reclaim_during_reload` checks the
SSTable manager's `_total_memory_reclaimed` against an expected value to
verify that a Bloom filter was reloaded. However, it does not wait for
the manager to update the variable, causing the check to fail if the
update has not occurred yet. Fix it by making the testcase wait until
the variable is updated to the expected value.

Fixes #19879

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 27b305b9d1)

Closes scylladb/scylladb#19897
2024-08-06 16:26:36 +03:00
Takuya ASADA
c45e92142e scylla_raid_setup: install update-initramfs when it's not available
scylla_raid_setup may fail on Ubuntu minimal image since it calls
update-initramfs without installing.

(cherry picked from commit 02b20089cb)

Closes scylladb/scylladb#19869
2024-08-06 16:24:27 +03:00
Aleksandra Martyniuk
d69f0e529a test: tasks: adjust tests to new wait_task behavior
After c1b2b8cb2c /task_manager/wait_task/
does not unregister tasks anymore.

Delete the check if the task was unregistered from test_task_manager_wait.
Check task status in drain_module_tasks to ensure that the task
is removed from task manager.

Fixes: #19351.
(cherry picked from commit dfe3af40ed)

Closes scylladb/scylladb#19839
2024-08-06 16:23:02 +03:00
Łukasz Paszkowski
86ff3c2aa3 api/system: add highest_supported_sstable_format path
Current upgrade dtest rely on a ccm node function to
get_highest_supported_sstable_version() that looks for
r'Feature (.*)_SSTABLE_FORMAT is enabled' in the log files.

Starting from scylla-6.0 ME_SSTABLE_FORMAT is enabled by default
and there is no cluster feature for it. Thus get_highest_supported_sstable_version()
returns an empty list resulting in the upgrade tests failures.

This change introduces a seperate API path that returns the highest
supported sstable format (one of la, mc, md, me) by a scylla node.

Fixes scylladb/scylladb#19772

Backports to 6.0 and 6.1 required. The current upgrade test in dtest
checks scylla upgrades up to version 5.4 only. This patch is a
prerequisite to backport the upgrade tests fix in dtest.

(cherry picked from commit 781eb7517c)

Closes scylladb/scylladb#19814
2024-08-06 16:21:48 +03:00
Avi Kivity
efac73109e Merge '[Backport 6.1] doc: add the 6.0-to-6.1 upgrade guide' from ScyllaDB
This PR adds the 6.0-to-6.1 upgrade guide (including metrics) and removes the 5.4-to-6.0 upgrade guide.

Compared 5.4-to-6.0, the the 6.0-to-6.1 guide:

- Added the "Ensure Consistent Topology Changes Are Enabled" prerequisite.
- Removed the "After Upgrading Every Node" section. Both Raft-based schema changes and topology updates
  are mandatory in 6.1 and don't require any user action after upgrading to 6.1.
- Removed the "Validate Raft Setup" section. Raft was enabled in all 6.0 clusters (for schema management),
  so now there's no scenario that would require the user to follow the validation procedure.
- Removed the references to the Enable Consistent Topology Updates page (which was in version 6.0 and is removed with this PR) across the docs.

See the individual commits for more details.

Fixes https://github.com/scylladb/scylladb/issues/19853
Fixes https://github.com/scylladb/scylladb/issues/19933

This PR must be backported to branch-6.1 as it is critical in version 6.1.

(cherry picked from commit 9972e50134)

(cherry picked from commit 32fa5aa938)

 Refs #19983

Closes scylladb/scylladb#20038

* github.com:scylladb/scylladb:
  doc: remove the 5.4-to-6.0 upgrade guide
  doc: add the 6.0-to-6.1 upgrade guide
2024-08-06 13:28:24 +03:00
Anna Stuchlik
8c975712d3 doc: remove the 5.4-to-6.0 upgrade guide
This commit removes the 5.4-to-6.0 upgrade guide and all references to it.
It mainly removes references to the Enable Consistent Topology Updates page,
which was added as enabling the feature was optional.
In rare cases, when a reference to that page is necessary,
the internal link is replaced with an external link to version 6.0.
Especially the Handling Cluster Membership Change Failures page was modified
for troubleshooting purposes rather than removed.

(cherry picked from commit 32fa5aa938)
2024-08-06 10:20:09 +00:00
Anna Stuchlik
1fdfe11bb0 doc: add the 6.0-to-6.1 upgrade guide
This commit adds the 6.0-to-6.1 upgrade guide.

Compared to the previous upgrade guide:

- Added the "Ensure Consistent Topology Changes Are Enabled" prerequisite.
- Removed the "After Upgrading Every Node" section. Both Raft-based schema changes and topology updates
  are mandatory in 6.1 and don't require any user action after upgrading to 6.1.
- Removed the "Validate Raft Setup" section. Raft was enabled in all 6.0 clusters (for schema management),
  so now there's no scenario that would require the user to follow the validation procedure.

(cherry picked from commit 9972e50134)
2024-08-06 10:20:09 +00:00
Botond Dénes
58c06819d7 Update ./tools/python3 submodule
* ./tools/python3 18fa79ee...ea49f0ca (1):
  > install.sh: fix incorrect permission on strict umask

Fixes: https://github.com/scylladb/scylladb/issues/19775

Closes scylladb/scylladb#20022
2024-08-06 10:02:07 +03:00
Michael Litvak
5b604509ce db: fix waiting for counter update operations on table stop
When a table is dropped it should wait for all pending operations in the
table before the table is destroyed, because the operations may use the
table's resources.
With counter update operations, currently this is not the case. The
table may be destroyed while there is a counter update operation in
progress, causing an assert to be triggered due to a resource being
destroyed while it's in use.
The reason the operation is not waited for is a mistake in the lifetime
management of the object representing the write in progress. The commit
fixes it so the object lives for the duration of the entire counter
update operation, by moving it to the `do_with` list.

Fixes scylladb/scylla-enterprise#4475

Closes scylladb/scylladb#20018
2024-08-05 12:54:19 +02:00
Jenkins Promoter
abbf0b24a6 Update ScyllaDB version to: 6.1.0 2024-08-04 14:31:47 +03:00
Kamil Braun
347857e5e5 Merge '[Backport 6.1] raft: fix the shutdown phase being stuck' from ScyllaDB
Some of the calls inside the `raft_group0_client::start_operation()` method were missing the abort source parameter. This caused the repair test to be stuck in the shutdown phase - the abort source has been triggered, but the operations were not checking it.

This was in particular the case of operations that try to take the ownership of the raft group semaphore (`get_units(semaphore)`) - these waits should be cancelled when the abort source is triggered.

This should fix the following tests that were failing in some percentage of dtest runs (about 1-3 of 100):
* TestRepairAdditional::test_repair_kill_1
* TestRepairAdditional::test_repair_kill_3

Fixes scylladb/scylladb#19223

(cherry picked from commit 2dbe9ef2f2)

(cherry picked from commit 5dfc50d354)

 Refs #19860

Closes scylladb/scylladb#19970

* github.com:scylladb/scylladb:
  raft: fix the shutdown phase being stuck
  raft: use the abort source reference in raft group0 client interface
2024-08-02 11:24:34 +02:00
Emil Maskovsky
cd2ca5ef57 raft: fix the shutdown phase being stuck
Some of the calls inside the `raft_group0_client::start_operation()`
method were missing the abort source parameter. This caused the repair
test to be stuck in the shutdown phase - the abort source has been
triggered, but the operations were not checking it.

This was in particular the case of operations that try to take the
ownership of the raft group semaphore (`get_units(semaphore)`) - these
waits should be cancelled when the abort source is triggered.

This should fix the following tests that were failing in some percentage
of dtest runs (about 1-3 of 100):
* TestRepairAdditional::test_repair_kill_1
* TestRepairAdditional::test_repair_kill_3

Fixes scylladb/scylladb#19223

(cherry picked from commit 5dfc50d354)
2024-07-31 20:52:23 +00:00
Emil Maskovsky
5a4065ecd5 raft: use the abort source reference in raft group0 client interface
Most callers of the raft group0 client interface are passing a real
source instance, so we can use the abort source reference in the client
interface. This change makes the code simpler and more consistent.

(cherry picked from commit 2dbe9ef2f2)
2024-07-31 20:52:23 +00:00
Kamil Braun
ed4f2ecca4 docs: extend "forbidden operations" section for Raft-topology upgrade
The Raft-topology upgrade procedure must not be run concurrently with
version upgrade.

(cherry picked from commit bb0c3cdc65)

Closes scylladb/scylladb#19836
2024-07-29 16:52:40 +02:00
Jenkins Promoter
8f80a84e93 Update ScyllaDB version to: 6.1.0-rc2 2024-07-29 15:50:26 +03:00
Piotr Dulikowski
95abb6d4a7 test: regression test for MV crash with tablets during decommission
Regression test for scylladb/scylladb#19439.

Co-authored-by: Kamil Braun <kbraun@scylladb.com>
(cherry picked from commit 5ec8c06561)
2024-07-26 14:02:51 +00:00
Piotr Dulikowski
30b0cb4f5d db/view: drop view updates to replaced node marked as left
When a node that is permanently down is replaced, it is marked as "left"
but it still can be a replica of some tablets. We also don't keep IPs of
nodes that have left and the `node` structure for such node returns an
empty IP (all zeros) as the address.

This interacts badly with the view update logic. The base replica paired
with the left node might decide to generate a view update. Because
storage proxy still uses IPs and not host IDs, it needs to obtain the
view replica's IP and tell the storage proxy to write a view update to
that node - so, it chooses 0.0.0.0. Apparently, storage proxy decides to
write a hint towards this address - hinted handoff on the other hand
operates on host IDs and not IPs, so it attempts to translate the IP
back, which triggers an assertion as there is no replica with IP
0.0.0.0.

As a quick workaround for this issue just drop view updates towards
nodes which seem to have IPs that are all zeros. It would be more proper
to keep the view updates as hints and replay them later to the new
paired replica, but achieving this right now would require much more
significant changes. For now, fixing a crash is more important than
keeping views consistent with base replicas.

Fixes: scylladb/scylladb#19439
(cherry picked from commit 6af7882c59)
2024-07-26 14:02:50 +00:00
Nadav Har'El
97ae704f99 alternator: do not allow authentication with a non-"login" role
Alternator allows authentication into the existing CQL roles, but
roles which have the flag "login=false" should be refused in
authentication, and this patch adds the missing check.

The patch also adds a regression test for this feature in the
test/alternator test framework, in a new test file
test/alternator/cql_rbac.py. This test file will later include more
tests of how the CQL RBAC commands (CREATE ROLE, GRANT, REVOKE)
affect authentication and authorization in Alternator.
In particular, these tests need to use not just the DynamoDB API but
also CQL, so this new test file includes the "cql" fixture that allows
us to run CQL commands, to create roles, to retrieve their secret keys,
and so on.

Fixes #19735

(cherry picked from commit 14cd7b5095)

Closes scylladb/scylladb#19863
2024-07-25 12:45:27 +03:00
Nadav Har'El
738e4c3681 alternator: fix "/localnodes" to not return nodes still joining
Alternator's "/localnodes" HTTP request is supposed to return the list of
nodes in the local DC to which the user can send requests.

The existing implementation incorrectly used gossiper::is_alive() to check
for which nodes to return - but "alive" nodes include nodes which are still
joining the cluster and not really usable. These nodes can remain in the
JOINING state for a long time while they are copying data, and an attempt
to send requests to them will fail.

The fix for this bug is trivial: change the call to is_alive() to a call
to is_normal().

But the hard part of this test is the testing:

1. An existing multi-node test for "/localnodes" assummed that right after
   a new node was created, it appears on "/localnodes". But after this
   patch, it may take a bit more time for the bootstrapping to complete
   and the new node to appear in /localnodes - so I had to add a retry loop.

2. I added a test that reproduces the bug fixed here, and verifies its
   fix. The test is in the multi-node topology framework. It adds an
   injection which delays the bootstrap, which leaves a new node in JOINING
   state for a long time. The test then verifies that the new node is
   alive (as checked by the REST API), but is not returned by "/localnodes".

3. The new injection for delaying the bootstrap is unfortunately not
   very pretty - I had to do it in three places because we have several
   code paths of how bootstrap works without repair, with repair, without
   Raft and with Raft - and I wanted to delay all of them.

Fixes #19694.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 0d1aa399f9)

Closes scylladb/scylladb#19855
2024-07-24 11:04:54 +03:00
Lakshmi Narayanan Sreethar
ee74fe4e0e [Backport 6.1] sstables: do not reload components of unlinked sstables
The SSTable is removed from the reclaimed memory tracking logic only
when its object is deleted. However, there is a risk that the Bloom
filter reloader may attempt to reload the SSTable after it has been
unlinked but before the SSTable object is destroyed. Prevent this by
removing the SSTable from the reclaimed list maintained by the manager
as soon as it is unlinked.

The original logic that updated the memory tracking in
`sstables_manager::deactivate()` is left in place as (a) the variables
have to be updated only when the SSTable object is actually deleted, as
the memory used by the filter is not freed as long as the SSTable is
alive, and (b) the `_reclaimed.erase(*sst)` is still useful during
shutdown, for example, when the SSTable is not unlinked but just
destroyed.

Fixes https://github.com/scylladb/scylladb/issues/19722

Closes scylladb/scylladb#19717

* github.com:scylladb/scylladb:
  boost/bloom_filter_test: add testcase to verify unlinked sstables are not reloaded
  sstables: do not reload components of unlinked sstables
  sstables/sstables_manager: introduce on_unlink method

(cherry picked from commit 591876b44e)

Backported from #19717 to 6.1

Closes scylladb/scylladb#19828
2024-07-24 09:03:52 +03:00
Jenkins Promoter
b2ea946837 Update ScyllaDB version to: 6.1.0-rc1 2024-07-23 10:33:48 +03:00
Avi Kivity
92e725c467 Merge '[Backport 6.1] Fix lwt semaphore guard accounting' from ScyllaDB
Currently the guard does not account correctly for ongoing operation if semaphore acquisition fails. It may signal a semaphore when it is not held.

Should be backported to all supported versions.

(cherry picked from commit 87beebeed0)

(cherry picked from commit 4178589826)

 Refs #19699

Closes scylladb/scylladb#19819

* github.com:scylladb/scylladb:
  test: add test to check that coordinator lwt semaphore continues functioning after locking failures
  paxos: do not signal semaphore if it was not acquired
2024-07-22 17:41:30 +03:00
Kamil Braun
e57d48253f Merge '[Backport 6.1] test: raft: fix the flaky test_raft_recovery_stuck' from ScyllaDB
Use the rolling restart to avoid spurious driver reconnects.

This can be eventually reverted once the scylladb/python-driver#295 is fixed.

Fixes scylladb/scylladb#19154

(cherry picked from commit ef3393bd36)

(cherry picked from commit a89facbc74)

 Refs #19771

Closes scylladb/scylladb#19820

* github.com:scylladb/scylladb:
  test: raft: fix the flaky `test_raft_recovery_stuck`
  test: raft: code cleanup in `test_raft_recovery_stuck`
2024-07-22 14:12:26 +02:00
Emil Maskovsky
47df9f9b05 test: raft: fix the flaky test_raft_recovery_stuck
Use the rolling restart to avoid spurious driver reconnects.

This can be eventually reverted once the scylladb/python-driver#295 is
fixed.

Fixes scylladb/scylladb#19154

(cherry picked from commit a89facbc74)
2024-07-22 09:17:05 +00:00
Emil Maskovsky
193dc87bd0 test: raft: code cleanup in test_raft_recovery_stuck
Cleaning up the imports.

(cherry picked from commit ef3393bd36)
2024-07-22 09:17:04 +00:00
Gleb Natapov
11d1950957 test: add test to check that coordinator lwt semaphore continues functioning after locking failures
(cherry picked from commit 4178589826)
2024-07-22 09:01:34 +00:00
Gleb Natapov
6317325ed5 paxos: do not signal semaphore if it was not acquired
The guard signals a semaphore during destruction if it is marked as
locked, but currently it may be marked as locked even if locking failed.
Fix this by using semaphore_units instead of managing the locked flag
manually.

Fixes: https://github.com/scylladb/scylladb/issues/19698
(cherry picked from commit 87beebeed0)
2024-07-22 09:01:34 +00:00
Anna Mikhlin
14222ad205 Update ScyllaDB version to: 6.1.0-rc0 2024-07-18 16:05:23 +03:00
311 changed files with 7236 additions and 2376 deletions

186
.github/scripts/auto-backport.py vendored Executable file
View File

@@ -0,0 +1,186 @@
#!/usr/bin/env python3
import argparse
import os
import re
import sys
import tempfile
import logging
from github import Github, GithubException
from git import Repo, GitCommandError
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
try:
github_token = os.environ["GITHUB_TOKEN"]
except KeyError:
print("Please set the 'GITHUB_TOKEN' environment variable")
sys.exit(1)
def is_pull_request():
return '--pull-request' in sys.argv[1:]
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument('--repo', type=str, required=True, help='Github repository name')
parser.add_argument('--base-branch', type=str, default='refs/heads/master', help='Base branch')
parser.add_argument('--commits', default=None, type=str, help='Range of promoted commits.')
parser.add_argument('--pull-request', type=int, help='Pull request number to be backported')
parser.add_argument('--head-commit', type=str, required=is_pull_request(), help='The HEAD of target branch after the pull request specified by --pull-request is merged')
return parser.parse_args()
def create_pull_request(repo, new_branch_name, base_branch_name, pr, backport_pr_title, commits, is_draft=False):
pr_body = f'{pr.body}\n\n'
for commit in commits:
pr_body += f'- (cherry picked from commit {commit})\n\n'
pr_body += f'Parent PR: #{pr.number}'
try:
backport_pr = repo.create_pull(
title=backport_pr_title,
body=pr_body,
head=f'scylladbbot:{new_branch_name}',
base=base_branch_name,
draft=is_draft
)
logging.info(f"Pull request created: {backport_pr.html_url}")
backport_pr.add_to_assignees(pr.user)
if is_draft:
backport_pr.add_to_labels("conflicts")
pr_comment = f"@{pr.user} - This PR was marked as draft because it has conflicts\n"
pr_comment += "Please resolve them and mark this PR as ready for review"
backport_pr.create_issue_comment(pr_comment)
logging.info(f"Assigned PR to original author: {pr.user}")
return backport_pr
except GithubException as e:
if 'A pull request already exists' in str(e):
logging.warning(f'A pull request already exists for {pr.user}:{new_branch_name}')
else:
logging.error(f'Failed to create PR: {e}')
def get_pr_commits(repo, pr, stable_branch, start_commit=None):
commits = []
if pr.merged:
merge_commit = repo.get_commit(pr.merge_commit_sha)
if len(merge_commit.parents) > 1: # Check if this merge commit includes multiple commits
commits.append(pr.merge_commit_sha)
else:
if start_commit:
promoted_commits = repo.compare(start_commit, stable_branch).commits
else:
promoted_commits = repo.get_commits(sha=stable_branch)
for commit in pr.get_commits():
for promoted_commit in promoted_commits:
commit_title = commit.commit.message.splitlines()[0]
# In Scylla-pkg and scylla-dtest, for example,
# we don't create a merge commit for a PR with multiple commits,
# according to the GitHub API, the last commit will be the merge commit,
# which is not what we need when backporting (we need all the commits).
# So here, we are validating the correct SHA for each commit so we can cherry-pick
if promoted_commit.commit.message.startswith(commit_title):
commits.append(promoted_commit.sha)
elif pr.state == 'closed':
events = pr.get_issue_events()
for event in events:
if event.event == 'closed':
commits.append(event.commit_id)
return commits
def create_pr_comment_and_remove_label(pr, comment_body):
labels = pr.get_labels()
pattern = re.compile(r"backport/\d+\.\d+$")
for label in labels:
if pattern.match(label.name):
print(f"Removing label: {label.name}")
comment_body += f'- {label.name}\n'
pr.remove_from_labels(label)
pr.create_issue_comment(comment_body)
def backport(repo, pr, version, commits, backport_base_branch):
new_branch_name = f'backport/{pr.number}/to-{version}'
backport_pr_title = f'[Backport {version}] {pr.title}'
repo_url = f'https://scylladbbot:{github_token}@github.com/{repo.full_name}.git'
fork_repo = f'https://scylladbbot:{github_token}@github.com/scylladbbot/{repo.name}.git'
with (tempfile.TemporaryDirectory() as local_repo_path):
try:
repo_local = Repo.clone_from(repo_url, local_repo_path, branch=backport_base_branch)
repo_local.git.checkout(b=new_branch_name)
is_draft = False
for commit in commits:
try:
repo_local.git.cherry_pick(commit, '-m1', '-x')
except GitCommandError as e:
logging.warning(f'Cherry-pick conflict on commit {commit}: {e}')
is_draft = True
repo_local.git.add(A=True)
repo_local.git.cherry_pick('--continue')
if not repo.private and not repo.has_in_collaborators(pr.user.login):
repo.add_to_collaborators(pr.user.login, permission="push")
comment = f':warning: @{pr.user.login} you have been added as collaborator to scylladbbot fork '
comment += f'Please check your inbox and approve the invitation, once it is done, please add the backport labels again'
create_pr_comment_and_remove_label(pr, comment)
return
repo_local.git.push(fork_repo, new_branch_name, force=True)
create_pull_request(repo, new_branch_name, backport_base_branch, pr, backport_pr_title, commits,
is_draft=is_draft)
except GitCommandError as e:
logging.warning(f"GitCommandError: {e}")
def main():
args = parse_args()
base_branch = args.base_branch.split('/')[2]
promoted_label = 'promoted-to-master'
repo_name = args.repo
if 'scylla-enterprise' in args.repo:
promoted_label = 'promoted-to-enterprise'
stable_branch = base_branch
backport_branch = 'branch-'
backport_label_pattern = re.compile(r'backport/\d+\.\d+$')
g = Github(github_token)
repo = g.get_repo(repo_name)
closed_prs = []
start_commit = None
if args.commits:
start_commit, end_commit = args.commits.split('..')
commits = repo.compare(start_commit, end_commit).commits
for commit in commits:
match = re.search(rf"Closes .*#([0-9]+)", commit.commit.message, re.IGNORECASE)
if match:
pr_number = int(match.group(1))
pr = repo.get_pull(pr_number)
closed_prs.append(pr)
if args.pull_request:
start_commit = args.head_commit
pr = repo.get_pull(args.pull_request)
closed_prs = [pr]
for pr in closed_prs:
labels = [label.name for label in pr.labels]
backport_labels = [label for label in labels if backport_label_pattern.match(label)]
if promoted_label not in labels:
print(f'no {promoted_label} label: {pr.number}')
continue
if not backport_labels:
print(f'no backport label: {pr.number}')
continue
commits = get_pr_commits(repo, pr, stable_branch, start_commit)
logging.info(f"Found PR #{pr.number} with commit {commits} and the following labels: {backport_labels}")
for backport_label in backport_labels:
version = backport_label.replace('backport/', '')
backport_base_branch = backport_label.replace('backport/', backport_branch)
backport(repo, pr, version, commits, backport_base_branch)
if __name__ == "__main__":
main()

View File

@@ -16,13 +16,8 @@ def parser():
parser = argparse.ArgumentParser()
parser.add_argument('--repository', type=str, required=True,
help='Github repository name (e.g., scylladb/scylladb)')
parser.add_argument('--commit_before_merge', type=str, required=True, help='Git commit ID to start labeling from ('
'newest commit).')
parser.add_argument('--commit_after_merge', type=str, required=True,
help='Git commit ID to end labeling at (oldest '
'commit, exclusive).')
parser.add_argument('--update_issue', type=bool, default=False, help='Set True to update issues when backport was '
'done')
parser.add_argument('--commits', type=str, required=True, help='Range of promoted commits.')
parser.add_argument('--label', type=str, default='promoted-to-master', help='Label to use')
parser.add_argument('--ref', type=str, required=True, help='PR target branch')
return parser.parse_args()
@@ -53,10 +48,11 @@ def main():
target_branch = re.search(r'branch-(\d+\.\d+)', args.ref)
g = Github(github_token)
repo = g.get_repo(args.repository, lazy=False)
commits = repo.compare(head=args.commit_after_merge, base=args.commit_before_merge)
start_commit, end_commit = args.commits.split('..')
commits = repo.compare(start_commit, end_commit).commits
processed_prs = set()
# Print commit information
for commit in commits.commits:
for commit in commits:
print(f'Commit sha is: {commit.sha}')
match = pr_pattern.search(commit.commit.message)
if match:
@@ -66,13 +62,13 @@ def main():
if target_branch:
pr = repo.get_pull(pr_number)
branch_name = target_branch[1]
refs_pr = re.findall(r'Refs (?:#|https.*?)(\d+)', pr.body)
refs_pr = re.findall(r'Parent PR: (?:#|https.*?)(\d+)', pr.body)
if refs_pr:
print(f'branch-{target_branch.group(1)}, pr number is: {pr_number}')
# 1. change the backport label of the parent PR to note that
# we've merge the corresponding backport PR
# we've merged the corresponding backport PR
# 2. close the backport PR and leave a comment on it to note
# that it has been merged with a certain git commit,
# that it has been merged with a certain git commit.
ref_pr_number = refs_pr[0]
mark_backport_done(repo, ref_pr_number, branch_name)
comment = f'Closed via {commit.sha}'

View File

@@ -5,9 +5,10 @@ on:
branches:
- master
- branch-*.*
env:
DEFAULT_BRANCH: 'master'
- enterprise
pull_request_target:
types: [labeled]
branches: [master, next, enterprise]
jobs:
check-commit:
@@ -20,17 +21,51 @@ jobs:
env:
GITHUB_CONTEXT: ${{ toJson(github) }}
run: echo "$GITHUB_CONTEXT"
- name: Set Default Branch
id: set_branch
run: |
if [[ "${{ github.repository }}" == *enterprise* ]]; then
echo "DEFAULT_BRANCH=enterprise" >> $GITHUB_ENV
else
echo "DEFAULT_BRANCH=master" >> $GITHUB_ENV
fi
- name: Checkout repository
uses: actions/checkout@v4
with:
repository: ${{ github.repository }}
ref: ${{ env.DEFAULT_BRANCH }}
token: ${{ secrets.AUTO_BACKPORT_TOKEN }}
fetch-depth: 0 # Fetch all history for all tags and branches
- name: Set up Git identity
run: |
git config --global user.name "GitHub Action"
git config --global user.email "action@github.com"
git config --global merge.conflictstyle diff3
- name: Install dependencies
run: sudo apt-get install -y python3-github
run: sudo apt-get install -y python3-github python3-git
- name: Run python script
if: github.event_name == 'push'
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: python .github/scripts/label_promoted_commits.py --commit_before_merge ${{ github.event.before }} --commit_after_merge ${{ github.event.after }} --repository ${{ github.repository }} --ref ${{ github.ref }}
GITHUB_TOKEN: ${{ secrets.AUTO_BACKPORT_TOKEN }}
run: python .github/scripts/label_promoted_commits.py --commits ${{ github.event.before }}..${{ github.sha }} --repository ${{ github.repository }} --ref ${{ github.ref }}
- name: Run auto-backport.py when promotion completed
if: ${{ github.event_name == 'push' && github.ref == format('refs/heads/{0}', env.DEFAULT_BRANCH) }}
env:
GITHUB_TOKEN: ${{ secrets.AUTO_BACKPORT_TOKEN }}
run: python .github/scripts/auto-backport.py --repo ${{ github.repository }} --base-branch ${{ github.ref }} --commits ${{ github.event.before }}..${{ github.sha }}
- name: Check if label starts with 'backport/' and contains digits
id: check_label
run: |
label_name="${{ github.event.label.name }}"
if [[ "$label_name" =~ ^backport/[0-9]+\.[0-9]+$ ]]; then
echo "Label matches backport/X.X pattern."
echo "backport_label=true" >> $GITHUB_OUTPUT
else
echo "Label does not match the required pattern."
echo "backport_label=false" >> $GITHUB_OUTPUT
fi
- name: Run auto-backport.py when label was added
if: ${{ github.event_name == 'pull_request_target' && steps.check_label.outputs.backport_label == 'true' && github.event.pull_request.state == 'closed' }}
env:
GITHUB_TOKEN: ${{ secrets.AUTO_BACKPORT_TOKEN }}
run: python .github/scripts/auto-backport.py --repo ${{ github.repository }} --base-branch ${{ github.ref }} --pull-request ${{ github.event.pull_request.number }} --head-commit ${{ github.event.pull_request.base.sha }}

2
.gitignore vendored
View File

@@ -19,7 +19,7 @@ CMakeLists.txt.user
*.egg-info
__pycache__CMakeLists.txt.user
.gdbinit
resources
/resources
.pytest_cache
/expressions.tokens
tags

2
.gitmodules vendored
View File

@@ -1,6 +1,6 @@
[submodule "seastar"]
path = seastar
url = ../seastar
url = ../scylla-seastar
ignore = dirty
[submodule "swagger-ui"]
path = swagger-ui

View File

@@ -78,7 +78,7 @@ fi
# Default scylla product/version tags
PRODUCT=scylla
VERSION=6.1.0-dev
VERSION=6.1.6
if test -f version
then
@@ -104,7 +104,7 @@ else
fi
if [ -f "$OUTPUT_DIR/SCYLLA-RELEASE-FILE" ]; then
GIT_COMMIT_FILE=$(cat "$OUTPUT_DIR/SCYLLA-RELEASE-FILE" |cut -d . -f 3)
GIT_COMMIT_FILE=$(cat "$OUTPUT_DIR/SCYLLA-RELEASE-FILE" | rev | cut -d . -f 1 | rev)
if [ "$GIT_COMMIT" = "$GIT_COMMIT_FILE" ]; then
exit 0
fi

View File

@@ -19,6 +19,7 @@
#include "alternator/executor.hh"
#include "cql3/selection/selection.hh"
#include "cql3/result_set.hh"
#include "types/types.hh"
#include <seastar/core/coroutine.hh>
namespace alternator {
@@ -31,11 +32,12 @@ future<std::string> get_key_from_roles(service::storage_proxy& proxy, auth::serv
dht::partition_range_vector partition_ranges{dht::partition_range(dht::decorate_key(*schema, pk))};
std::vector<query::clustering_range> bounds{query::clustering_range::make_open_ended_both_sides()};
const column_definition* salted_hash_col = schema->get_column_definition(bytes("salted_hash"));
if (!salted_hash_col) {
const column_definition* can_login_col = schema->get_column_definition(bytes("can_login"));
if (!salted_hash_col || !can_login_col) {
co_await coroutine::return_exception(api_error::unrecognized_client(format("Credentials cannot be fetched for: {}", username)));
}
auto selection = cql3::selection::selection::for_columns(schema, {salted_hash_col});
auto partition_slice = query::partition_slice(std::move(bounds), {}, query::column_id_vector{salted_hash_col->id}, selection->get_query_options());
auto selection = cql3::selection::selection::for_columns(schema, {salted_hash_col, can_login_col});
auto partition_slice = query::partition_slice(std::move(bounds), {}, query::column_id_vector{salted_hash_col->id, can_login_col->id}, selection->get_query_options());
auto command = ::make_lw_shared<query::read_command>(schema->id(), schema->version(), partition_slice,
proxy.get_max_result_size(partition_slice), query::tombstone_limit(proxy.get_tombstone_limit()));
auto cl = auth::password_authenticator::consistency_for_user(username);
@@ -51,7 +53,14 @@ future<std::string> get_key_from_roles(service::storage_proxy& proxy, auth::serv
if (result_set->empty()) {
co_await coroutine::return_exception(api_error::unrecognized_client(format("User not found: {}", username)));
}
const managed_bytes_opt& salted_hash = result_set->rows().front().front(); // We only asked for 1 row and 1 column
const auto& result = result_set->rows().front();
bool can_login = result[1] && value_cast<bool>(boolean_type->deserialize(*result[1]));
if (!can_login) {
// This is a valid role name, but has "login=False" so should not be
// usable for authentication (see #19735).
co_await coroutine::return_exception(api_error::unrecognized_client(format("Role {} has login=false so cannot be used for login", username)));
}
const managed_bytes_opt& salted_hash = result.front();
if (!salted_hash) {
co_await coroutine::return_exception(api_error::unrecognized_client(format("No password found for user: {}", username)));
}

View File

@@ -9,6 +9,7 @@
#include <fmt/ranges.h>
#include <seastar/core/sleep.hh>
#include "alternator/executor.hh"
#include "cdc/log.hh"
#include "db/config.hh"
#include "log.hh"
#include "schema/schema_builder.hh"
@@ -4439,8 +4440,10 @@ future<executor::request_return_type> executor::list_tables(client_state& client
auto tables = _proxy.data_dictionary().get_tables(); // hold on to temporary, table_names isn't a container, it's a view
auto table_names = tables
| boost::adaptors::filtered([] (data_dictionary::table t) {
return t.schema()->ks_name().find(KEYSPACE_NAME_PREFIX) == 0 && !t.schema()->is_view();
| boost::adaptors::filtered([this] (data_dictionary::table t) {
return t.schema()->ks_name().find(KEYSPACE_NAME_PREFIX) == 0 &&
!t.schema()->is_view() &&
!cdc::is_log_for_some_table(_proxy.local_db(), t.schema()->ks_name(), t.schema()->cf_name());
})
| boost::adaptors::transformed([] (data_dictionary::table t) {
return t.schema()->cf_name();

View File

@@ -211,7 +211,10 @@ protected:
sstring local_dc = topology.get_datacenter();
std::unordered_set<gms::inet_address> local_dc_nodes = topology.get_datacenter_endpoints().at(local_dc);
for (auto& ip : local_dc_nodes) {
if (_gossiper.is_alive(ip)) {
// Note that it's not enough for the node to be is_alive() - a
// node joining the cluster is also "alive" but not responsive to
// requests. We alive *and* normal. See #19694, #21538.
if (_gossiper.is_alive(ip) && _gossiper.is_normal(ip)) {
// Use the gossiped broadcast_rpc_address if available instead
// of the internal IP address "ip". See discussion in #18711.
rjson::push_back(results, rjson::from_string(_gossiper.get_rpc_address(ip)));

View File

@@ -26,6 +26,7 @@
#include "log.hh"
#include "gc_clock.hh"
#include "replica/database.hh"
#include "service/client_state.hh"
#include "service_permit.hh"
#include "timestamp.hh"
#include "service/storage_proxy.hh"
@@ -312,7 +313,7 @@ static size_t random_offset(size_t min, size_t max) {
// this range's primary node is down. For this we need to return not just
// a list of this node's secondary ranges - but also the primary owner of
// each of those ranges.
static std::vector<std::pair<dht::token_range, gms::inet_address>> get_secondary_ranges(
static future<std::vector<std::pair<dht::token_range, gms::inet_address>>> get_secondary_ranges(
const locator::effective_replication_map_ptr& erm,
gms::inet_address ep) {
const auto& tm = *erm->get_token_metadata_ptr();
@@ -323,6 +324,7 @@ static std::vector<std::pair<dht::token_range, gms::inet_address>> get_secondary
}
auto prev_tok = sorted_tokens.back();
for (const auto& tok : sorted_tokens) {
co_await coroutine::maybe_yield();
inet_address_vector_replica_set eps = erm->get_natural_endpoints(tok);
if (eps.size() <= 1 || eps[1] != ep) {
prev_tok = tok;
@@ -350,7 +352,7 @@ static std::vector<std::pair<dht::token_range, gms::inet_address>> get_secondary
}
prev_tok = tok;
}
return ret;
co_return ret;
}
@@ -386,63 +388,63 @@ static std::vector<std::pair<dht::token_range, gms::inet_address>> get_secondary
//
// FIXME: Check if this algorithm is safe with tablet migration.
// https://github.com/scylladb/scylladb/issues/16567
enum primary_or_secondary_t {primary, secondary};
template<primary_or_secondary_t primary_or_secondary>
class token_ranges_owned_by_this_shard {
// ranges_holder_primary holds just the primary ranges themselves
class ranges_holder_primary {
const dht::token_range_vector _token_ranges;
public:
ranges_holder_primary(const locator::vnode_effective_replication_map_ptr& erm, gms::gossiper& g, gms::inet_address ep)
: _token_ranges(erm->get_primary_ranges(ep)) {}
std::size_t size() const { return _token_ranges.size(); }
const dht::token_range& operator[](std::size_t i) const {
return _token_ranges[i];
}
bool should_skip(std::size_t i) const {
return false;
}
};
// ranges_holder<secondary> holds the secondary token ranges plus each
// range's primary owner, needed to implement should_skip().
class ranges_holder_secondary {
std::vector<std::pair<dht::token_range, gms::inet_address>> _token_ranges;
gms::gossiper& _gossiper;
public:
ranges_holder_secondary(const locator::effective_replication_map_ptr& erm, gms::gossiper& g, gms::inet_address ep)
: _token_ranges(get_secondary_ranges(erm, ep))
, _gossiper(g) {}
std::size_t size() const { return _token_ranges.size(); }
const dht::token_range& operator[](std::size_t i) const {
return _token_ranges[i].first;
}
// range i should be skipped if its primary owner is alive.
bool should_skip(std::size_t i) const {
return _gossiper.is_alive(_token_ranges[i].second);
}
};
// ranges_holder_primary holds just the primary ranges themselves
class ranges_holder_primary {
dht::token_range_vector _token_ranges;
public:
explicit ranges_holder_primary(dht::token_range_vector token_ranges) : _token_ranges(std::move(token_ranges)) {}
static future<ranges_holder_primary> make(const locator::vnode_effective_replication_map_ptr& erm, gms::inet_address ep) {
co_return ranges_holder_primary(co_await erm->get_primary_ranges(ep));
}
std::size_t size() const { return _token_ranges.size(); }
const dht::token_range& operator[](std::size_t i) const {
return _token_ranges[i];
}
bool should_skip(std::size_t i) const {
return false;
}
};
// ranges_holder<secondary> holds the secondary token ranges plus each
// range's primary owner, needed to implement should_skip().
class ranges_holder_secondary {
std::vector<std::pair<dht::token_range, gms::inet_address>> _token_ranges;
const gms::gossiper& _gossiper;
public:
explicit ranges_holder_secondary(std::vector<std::pair<dht::token_range, gms::inet_address>> token_ranges, const gms::gossiper& g)
: _token_ranges(std::move(token_ranges))
, _gossiper(g) {}
static future<ranges_holder_secondary> make(const locator::effective_replication_map_ptr& erm, gms::inet_address ep, const gms::gossiper& g) {
co_return ranges_holder_secondary(co_await get_secondary_ranges(erm, ep), g);
}
std::size_t size() const { return _token_ranges.size(); }
const dht::token_range& operator[](std::size_t i) const {
return _token_ranges[i].first;
}
// range i should be skipped if its primary owner is alive.
bool should_skip(std::size_t i) const {
return _gossiper.is_alive(_token_ranges[i].second);
}
};
template<class primary_or_secondary_t>
class token_ranges_owned_by_this_shard {
schema_ptr _s;
locator::effective_replication_map_ptr _erm;
// _token_ranges will contain a list of token ranges owned by this node.
// We'll further need to split each such range to the pieces owned by
// the current shard, using _intersecter.
using ranges_holder = std::conditional_t<
primary_or_secondary == primary_or_secondary_t::primary,
ranges_holder_primary,
ranges_holder_secondary>;
const ranges_holder _token_ranges;
const primary_or_secondary_t _token_ranges;
// NOTICE: _range_idx is used modulo _token_ranges size when accessing
// the data to ensure that it doesn't go out of bounds
size_t _range_idx;
size_t _end_idx;
std::optional<dht::selective_token_range_sharder> _intersecter;
public:
token_ranges_owned_by_this_shard(replica::database& db, gms::gossiper& g, schema_ptr s)
token_ranges_owned_by_this_shard(schema_ptr s, primary_or_secondary_t token_ranges)
: _s(s)
, _erm(s->table().get_effective_replication_map())
, _token_ranges(db.find_keyspace(s->ks_name()).get_vnode_effective_replication_map(),
g, _erm->get_topology().my_address())
, _token_ranges(std::move(token_ranges))
, _range_idx(random_offset(0, _token_ranges.size() - 1))
, _end_idx(_range_idx + _token_ranges.size())
{
@@ -498,6 +500,7 @@ struct scan_ranges_context {
bytes column_name;
std::optional<std::string> member;
service::client_state internal_client_state;
::shared_ptr<cql3::selection::selection> selection;
std::unique_ptr<service::query_state> query_state_ptr;
std::unique_ptr<cql3::query_options> query_options;
@@ -507,6 +510,7 @@ struct scan_ranges_context {
: s(s)
, column_name(column_name)
, member(member)
, internal_client_state(service::client_state::internal_tag())
{
// FIXME: don't read the entire items - read only parts of it.
// We must read the key columns (to be able to delete) and also
@@ -525,10 +529,9 @@ struct scan_ranges_context {
std::vector<query::clustering_range> ck_bounds{query::clustering_range::make_open_ended_both_sides()};
auto partition_slice = query::partition_slice(std::move(ck_bounds), {}, std::move(regular_columns), opts);
command = ::make_lw_shared<query::read_command>(s->id(), s->version(), partition_slice, proxy.get_max_result_size(partition_slice), query::tombstone_limit(proxy.get_tombstone_limit()));
executor::client_state client_state{executor::client_state::internal_tag()};
tracing::trace_state_ptr trace_state;
// NOTICE: empty_service_permit is used because the TTL service has fixed parallelism
query_state_ptr = std::make_unique<service::query_state>(client_state, trace_state, empty_service_permit());
query_state_ptr = std::make_unique<service::query_state>(internal_client_state, trace_state, empty_service_permit());
// FIXME: What should we do on multi-DC? Will we run the expiration on the same ranges on all
// DCs or only once for each range? If the latter, we need to change the CLs in the
// scanner and deleter.
@@ -724,7 +727,9 @@ static future<bool> scan_table(
expiration_stats.scan_table++;
// FIXME: need to pace the scan, not do it all at once.
scan_ranges_context scan_ctx{s, proxy, std::move(column_name), std::move(member)};
token_ranges_owned_by_this_shard<primary> my_ranges(db.real_database(), gossiper, s);
auto erm = db.real_database().find_keyspace(s->ks_name()).get_vnode_effective_replication_map();
auto my_address = erm->get_topology().my_address();
token_ranges_owned_by_this_shard my_ranges(s, co_await ranges_holder_primary::make(erm, my_address));
while (std::optional<dht::partition_range> range = my_ranges.next_partition_range()) {
// Note that because of issue #9167 we need to run a separate
// query on each partition range, and can't pass several of
@@ -744,7 +749,7 @@ static future<bool> scan_table(
// by tasking another node to take over scanning of the dead node's primary
// ranges. What we do here is that this node will also check expiration
// on its *secondary* ranges - but only those whose primary owner is down.
token_ranges_owned_by_this_shard<secondary> my_secondary_ranges(db.real_database(), gossiper, s);
token_ranges_owned_by_this_shard my_secondary_ranges(s, co_await ranges_holder_secondary::make(erm, my_address, gossiper));
while (std::optional<dht::partition_range> range = my_secondary_ranges.next_partition_range()) {
expiration_stats.secondary_ranges_scanned++;
dht::partition_range_vector partition_ranges;

View File

@@ -1891,6 +1891,14 @@
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"force",
"description":"Enforce the source_dc option, even if it unsafe to use for rebuild",
"required":false,
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
}
]
}

View File

@@ -194,6 +194,21 @@
"parameters":[]
}
]
},
{
"path":"/system/highest_supported_sstable_version",
"operations":[
{
"method":"GET",
"summary":"Get highest supported sstable version",
"type":"string",
"nickname":"get_highest_supported_sstable_version",
"produces":[
"application/json"
],
"parameters":[]
}
]
}
]
}

View File

@@ -54,6 +54,7 @@
#include "locator/abstract_replication_strategy.hh"
#include "sstables_loader.hh"
#include "db/view/view_builder.hh"
#include "utils/user_provided_param.hh"
using namespace seastar::httpd;
using namespace std::chrono_literals;
@@ -1096,7 +1097,16 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
});
ss::rebuild.set(r, [&ss](std::unique_ptr<http::request> req) {
auto source_dc = req->get_query_param("source_dc");
utils::optional_param source_dc;
if (auto source_dc_str = req->get_query_param("source_dc"); !source_dc_str.empty()) {
source_dc.emplace(std::move(source_dc_str)).set_user_provided();
}
if (auto force_str = req->get_query_param("force"); !force_str.empty() && service::loosen_constraints(validate_bool(force_str))) {
if (!source_dc) {
throw bad_param_exception("The `source_dc` option must be provided for using the `force` option");
}
source_dc.set_force();
}
apilog.info("rebuild: source_dc={}", source_dc);
return ss.local().rebuild(std::move(source_dc)).then([] {
return make_ready_future<json::json_return_type>(json_void());

View File

@@ -10,6 +10,7 @@
#include "api/api-doc/system.json.hh"
#include "api/api-doc/metrics.json.hh"
#include "replica/database.hh"
#include "sstables/sstables_manager.hh"
#include <rapidjson/document.h>
#include <seastar/core/reactor.hh>
@@ -182,6 +183,11 @@ void set_system(http_context& ctx, routes& r) {
apilog.info("Profile dumped to {}", profile_dest);
return make_ready_future<json::json_return_type>(json::json_return_type(json::json_void()));
}) ;
hs::get_highest_supported_sstable_version.set(r, [&ctx] (const_req req) {
auto& table = ctx.db.local().find_column_family("system", "local");
return seastar::to_sstring(table.get_sstables_manager().get_highest_supported_format());
});
}
}

View File

@@ -76,7 +76,7 @@ auth::certificate_authenticator::certificate_authenticator(cql3::query_processor
continue;
} catch (std::out_of_range&) {
// just fallthrough
} catch (std::regex_error&) {
} catch (boost::regex_error&) {
std::throw_with_nested(std::invalid_argument(fmt::format("Invalid query expression: {}", map.at(cfg_query_attr))));
}
}

View File

@@ -71,7 +71,7 @@ static future<> create_legacy_metadata_table_if_missing_impl(
assert(this_shard_id() == 0); // once_among_shards makes sure a function is executed on shard 0 only
auto db = qp.db();
auto parsed_statement = cql3::query_processor::parse_statement(cql);
auto parsed_statement = cql3::query_processor::parse_statement(cql, cql3::dialect{});
auto& parsed_cf_statement = static_cast<cql3::statements::raw::cf_statement&>(*parsed_statement);
parsed_cf_statement.prepare_keyspace(meta::legacy::AUTH_KS);
@@ -121,7 +121,7 @@ static future<> announce_mutations_with_guard(
::service::raft_group0_client& group0_client,
std::vector<canonical_mutation> muts,
::service::group0_guard group0_guard,
seastar::abort_source* as,
seastar::abort_source& as,
std::optional<::service::raft_timeout> timeout) {
auto group0_cmd = group0_client.prepare_command(
::service::write_mutations{
@@ -137,7 +137,7 @@ future<> announce_mutations_with_batching(
::service::raft_group0_client& group0_client,
start_operation_func_t start_operation_func,
std::function<::service::mutations_generator(api::timestamp_type t)> gen,
seastar::abort_source* as,
seastar::abort_source& as,
std::optional<::service::raft_timeout> timeout) {
// account for command's overhead, it's better to use smaller threshold than constantly bounce off the limit
size_t memory_threshold = group0_client.max_command_size() * 0.75;
@@ -188,7 +188,7 @@ future<> announce_mutations(
::service::raft_group0_client& group0_client,
const sstring query_string,
std::vector<data_value_or_unset> values,
seastar::abort_source* as,
seastar::abort_source& as,
std::optional<::service::raft_timeout> timeout) {
auto group0_guard = co_await group0_client.start_operation(as, timeout);
auto timestamp = group0_guard.write_timestamp();

View File

@@ -80,7 +80,7 @@ future<> create_legacy_metadata_table_if_missing(
// Execute update query via group0 mechanism, mutations will be applied on all nodes.
// Use this function when need to perform read before write on a single guard or if
// you have more than one mutation and potentially exceed single command size limit.
using start_operation_func_t = std::function<future<::service::group0_guard>(abort_source*)>;
using start_operation_func_t = std::function<future<::service::group0_guard>(abort_source&)>;
future<> announce_mutations_with_batching(
::service::raft_group0_client& group0_client,
// since we can operate also in topology coordinator context where we need stronger
@@ -88,7 +88,7 @@ future<> announce_mutations_with_batching(
// function here
start_operation_func_t start_operation_func,
std::function<::service::mutations_generator(api::timestamp_type t)> gen,
seastar::abort_source* as,
seastar::abort_source& as,
std::optional<::service::raft_timeout> timeout);
// Execute update query via group0 mechanism, mutations will be applied on all nodes.
@@ -97,7 +97,7 @@ future<> announce_mutations(
::service::raft_group0_client& group0_client,
const sstring query_string,
std::vector<data_value_or_unset> values,
seastar::abort_source* as,
seastar::abort_source& as,
std::optional<::service::raft_timeout> timeout);
// Appends mutations to a collector, they will be applied later on all nodes via group0 mechanism.

View File

@@ -136,7 +136,7 @@ future<> password_authenticator::create_default_if_missing() {
plogger.info("Created default superuser authentication record.");
} else {
co_await announce_mutations(_qp, _group0_client, query,
{salted_pwd, _superuser}, &_as, ::service::raft_timeout{});
{salted_pwd, _superuser}, _as, ::service::raft_timeout{});
plogger.info("Created default superuser authentication record.");
}
}

View File

@@ -681,7 +681,7 @@ future<> migrate_to_auth_v2(db::system_keyspace& sys_ks, ::service::raft_group0_
co_await announce_mutations_with_batching(g0,
start_operation_func,
std::move(gen),
&as,
as,
std::nullopt);
}

View File

@@ -192,7 +192,7 @@ future<> standard_role_manager::create_default_role_if_missing() {
{_superuser},
cql3::query_processor::cache_internal::no).discard_result();
} else {
co_await announce_mutations(_qp, _group0_client, query, {_superuser}, &_as, ::service::raft_timeout{});
co_await announce_mutations(_qp, _group0_client, query, {_superuser}, _as, ::service::raft_timeout{});
}
log.info("Created default superuser role '{}'.", _superuser);
} catch(const exceptions::unavailable_exception& e) {

View File

@@ -1110,7 +1110,9 @@ future<bool> generation_service::legacy_do_handle_cdc_generation(cdc::generation
auto sys_dist_ks = get_sys_dist_ks();
auto gen = co_await retrieve_generation_data(gen_id, _sys_ks.local(), *sys_dist_ks, { _token_metadata.get()->count_normal_token_owners() });
if (!gen) {
throw std::runtime_error(format(
// This may happen during raft upgrade when a node gossips about a generation that
// was propagated through raft and we didn't apply it yet.
throw generation_handling_nonfatal_exception(format(
"Could not find CDC generation {} in distributed system tables (current time: {}),"
" even though some node gossiped about it.",
gen_id, db_clock::now()));

View File

@@ -186,7 +186,7 @@ bool cdc::metadata::prepare(db_clock::time_point tp) {
}
auto ts = to_ts(tp);
auto emplaced = _gens.emplace(to_ts(tp), std::nullopt).second;
auto [it, emplaced] = _gens.emplace(to_ts(tp), std::nullopt);
if (_last_stream_timestamp != api::missing_timestamp) {
auto last_correct_gen = gen_used_at(_last_stream_timestamp);
@@ -201,5 +201,5 @@ bool cdc::metadata::prepare(db_clock::time_point tp) {
}
}
return emplaced;
return !it->second;
}

View File

@@ -172,7 +172,8 @@ static api::timestamp_type get_max_purgeable_timestamp(const table_state& table_
}
static std::vector<shared_sstable> get_uncompacting_sstables(const table_state& table_s, std::vector<shared_sstable> sstables) {
auto all_sstables = boost::copy_range<std::vector<shared_sstable>>(*table_s.main_sstable_set().all());
auto sstable_set = table_s.sstable_set_for_tombstone_gc();
auto all_sstables = boost::copy_range<std::vector<shared_sstable>>(*sstable_set->all());
auto& compacted_undeleted = table_s.compacted_undeleted_sstables();
all_sstables.insert(all_sstables.end(), compacted_undeleted.begin(), compacted_undeleted.end());
boost::sort(all_sstables, [] (const shared_sstable& x, const shared_sstable& y) {

View File

@@ -187,7 +187,7 @@ unsigned compaction_manager::current_compaction_fan_in_threshold() const {
return 0;
}
auto largest_fan_in = std::ranges::max(_tasks | boost::adaptors::transformed([] (auto& task) {
return task->compaction_running() ? task->compaction_data().compaction_fan_in : 0;
return task.compaction_running() ? task.compaction_data().compaction_fan_in : 0;
}));
// conservatively limit fan-in threshold to 32, such that tons of small sstables won't accumulate if
// running major on a leveled table, which can even have more than one thousand files.
@@ -387,11 +387,26 @@ future<sstables::compaction_result> compaction_task_executor::compact_sstables_a
co_return res;
}
future<sstables::sstable_set> compaction_task_executor::sstable_set_for_tombstone_gc(table_state& t) {
auto compound_set = t.sstable_set_for_tombstone_gc();
// Compound set will be linearized into a single set, since compaction might add or remove sstables
// to it for incremental compaction to work.
auto new_set = sstables::make_partitioned_sstable_set(t.schema(), false);
co_await compound_set->for_each_sstable_gently([&] (const sstables::shared_sstable& sst) {
auto inserted = new_set.insert(sst);
if (!inserted) {
on_internal_error(cmlog, format("Unable to insert SSTable {} into set used for tombstone GC", sst->get_filename()));
}
});
co_return std::move(new_set);
}
future<sstables::compaction_result> compaction_task_executor::compact_sstables(sstables::compaction_descriptor descriptor, sstables::compaction_data& cdata, on_replacement& on_replace, compaction_manager::can_purge_tombstones can_purge,
sstables::offstrategy offstrategy) {
table_state& t = *_compacting_table;
if (can_purge) {
descriptor.enable_garbage_collection(t.main_sstable_set());
descriptor.enable_garbage_collection(co_await sstable_set_for_tombstone_gc(t));
}
descriptor.creator = [&t] (shard_id dummy) {
auto sst = t.make_sstable();
@@ -577,9 +592,9 @@ requires (compaction_manager& cm, throw_if_stopping do_throw_if_stopping, Args&&
}
future<compaction_manager::compaction_stats_opt> compaction_manager::perform_compaction(throw_if_stopping do_throw_if_stopping, std::optional<tasks::task_info> parent_info, Args&&... args) {
auto task_executor = seastar::make_shared<TaskExecutor>(*this, do_throw_if_stopping, std::forward<Args>(args)...);
_tasks.push_back(task_executor);
auto unregister_task = defer([this, task_executor] {
_tasks.remove(task_executor);
_tasks.push_back(*task_executor);
auto unregister_task = defer([task_executor] {
task_executor->unlink();
task_executor->switch_state(compaction_task_executor::state::none);
});
@@ -885,10 +900,10 @@ public:
explicit strategy_control(compaction_manager& cm) noexcept : _cm(cm) {}
bool has_ongoing_compaction(table_state& table_s) const noexcept override {
return std::any_of(_cm._tasks.begin(), _cm._tasks.end(), [&s = table_s.schema()] (const shared_ptr<compaction_task_executor>& task) {
return task->compaction_running()
&& task->compacting_table()->schema()->ks_name() == s->ks_name()
&& task->compacting_table()->schema()->cf_name() == s->cf_name();
return std::any_of(_cm._tasks.begin(), _cm._tasks.end(), [&s = table_s.schema()] (const compaction_task_executor& task) {
return task.compaction_running()
&& task.compacting_table()->schema()->ks_name() == s->ks_name()
&& task.compacting_table()->schema()->cf_name() == s->cf_name();
});
}
@@ -1052,7 +1067,7 @@ void compaction_manager::postpone_compaction_for_table(table_state* t) {
_postponed.insert(t);
}
future<> compaction_manager::stop_tasks(std::vector<shared_ptr<compaction_task_executor>> tasks, sstring reason) {
future<> compaction_manager::stop_tasks(std::vector<shared_ptr<compaction_task_executor>> tasks, sstring reason) noexcept {
// To prevent compaction from being postponed while tasks are being stopped,
// let's stop all tasks before the deferring point below.
for (auto& t : tasks) {
@@ -1060,14 +1075,16 @@ future<> compaction_manager::stop_tasks(std::vector<shared_ptr<compaction_task_e
t->stop_compaction(reason);
}
co_await coroutine::parallel_for_each(tasks, [] (auto& task) -> future<> {
auto unlink_task = deferred_action([task] { task->unlink(); });
try {
co_await task->compaction_done();
} catch (sstables::compaction_stopped_exception&) {
// swallow stop exception if a given procedure decides to propagate it to the caller,
// as it happens with reshard and reshape.
} catch (...) {
// just log any other errors as the callers have nothing to do with them.
cmlog.debug("Stopping {}: task returned error: {}", *task, std::current_exception());
throw;
co_return;
}
cmlog.debug("Stopping {}: done", *task);
});
@@ -1076,9 +1093,12 @@ future<> compaction_manager::stop_tasks(std::vector<shared_ptr<compaction_task_e
future<> compaction_manager::stop_ongoing_compactions(sstring reason, table_state* t, std::optional<sstables::compaction_type> type_opt) noexcept {
try {
auto ongoing_compactions = get_compactions(t).size();
auto tasks = boost::copy_range<std::vector<shared_ptr<compaction_task_executor>>>(_tasks | boost::adaptors::filtered([t, type_opt] (auto& task) {
return (!t || task->compacting_table() == t) && (!type_opt || task->compaction_type() == *type_opt);
}));
auto tasks = _tasks
| std::views::filter([t, type_opt] (const auto& task) {
return (!t || task.compacting_table() == t) && (!type_opt || task.compaction_type() == *type_opt);
})
| std::views::transform([] (auto& task) { return task.shared_from_this(); })
| std::ranges::to<std::vector<shared_ptr<compaction_task_executor>>>();
logging::log_level level = tasks.empty() ? log_level::debug : log_level::info;
if (cmlog.is_enabled(level)) {
std::string scope = "";
@@ -1092,8 +1112,9 @@ future<> compaction_manager::stop_ongoing_compactions(sstring reason, table_stat
}
return stop_tasks(std::move(tasks), std::move(reason));
} catch (...) {
return current_exception_as_future<>();
cmlog.error("Stopping ongoing compactions failed: {}. Ignored", std::current_exception());
}
return make_ready_future();
}
future<> compaction_manager::drain() {
@@ -1110,17 +1131,17 @@ future<> compaction_manager::stop() {
if (auto cm = std::exchange(_task_manager_module, nullptr)) {
co_await cm->stop();
}
if (_state != state::none) {
co_return co_await std::move(*_stop_future);
if (_stop_future) {
co_await std::exchange(*_stop_future, make_ready_future());
}
}
future<> compaction_manager::really_do_stop() {
future<> compaction_manager::really_do_stop() noexcept {
cmlog.info("Asked to stop");
// Reset the metrics registry
_metrics.clear();
co_await stop_ongoing_compactions("shutdown");
co_await coroutine::parallel_for_each(_compaction_state | boost::adaptors::map_values, [] (compaction_state& cs) -> future<> {
co_await coroutine::parallel_for_each(_compaction_state | std::views::values, [] (compaction_state& cs) -> future<> {
if (!cs.gate.is_closed()) {
co_await cs.gate.close();
}
@@ -1553,11 +1574,16 @@ protected:
co_return stats;
}
virtual sstables::compaction_descriptor make_descriptor(const sstables::shared_sstable& sst) const {
static sstables::compaction_descriptor
make_descriptor(const sstables::shared_sstable& sst, const sstables::compaction_type_options& opt, owned_ranges_ptr owned_ranges = {}) {
auto sstable_level = sst->get_sstable_level();
auto run_identifier = sst->run_identifier();
return sstables::compaction_descriptor({ sst },
sstable_level, sstables::compaction_descriptor::default_max_sstable_bytes, run_identifier, _options, _owned_ranges_ptr);
sstable_level, sstables::compaction_descriptor::default_max_sstable_bytes, run_identifier, opt, owned_ranges);
}
virtual sstables::compaction_descriptor make_descriptor(const sstables::shared_sstable& sst) const {
return make_descriptor(sst, _options, _owned_ranges_ptr);
}
virtual future<sstables::compaction_result> rewrite_sstable(const sstables::shared_sstable sst) {
@@ -1610,19 +1636,30 @@ public:
std::move(sstables), std::move(compacting), compaction_manager::can_purge_tombstones::yes)
, _opt(options.as<sstables::compaction_type_options::split>())
{
if (utils::get_local_injector().is_enabled("split_sstable_rewrite")) {
_do_throw_if_stopping = throw_if_stopping::yes;
}
}
static bool sstable_needs_split(const sstables::shared_sstable& sst, const sstables::compaction_type_options::split& opt) {
return opt.classifier(sst->get_first_decorated_key().token()) != opt.classifier(sst->get_last_decorated_key().token());
}
static sstables::compaction_descriptor
make_descriptor(const sstables::shared_sstable& sst, const sstables::compaction_type_options::split& split_opt) {
auto opt = sstables::compaction_type_options::make_split(split_opt.classifier);
return rewrite_sstables_compaction_task_executor::make_descriptor(sst, std::move(opt));
}
private:
bool sstable_needs_split(const sstables::shared_sstable& sst) const {
return _opt.classifier(sst->get_first_decorated_key().token()) != _opt.classifier(sst->get_last_decorated_key().token());
return sstable_needs_split(sst, _opt);
}
protected:
sstables::compaction_descriptor make_descriptor(const sstables::shared_sstable& sst) const override {
auto desc = rewrite_sstables_compaction_task_executor::make_descriptor(sst);
desc.options = sstables::compaction_type_options::make_split(_opt.classifier);
return desc;
return make_descriptor(sst, _opt);
}
future<sstables::compaction_result> rewrite_sstable(const sstables::shared_sstable sst) override {
future<sstables::compaction_result> do_rewrite_sstable(const sstables::shared_sstable sst) {
if (sstable_needs_split(sst)) {
return rewrite_sstables_compaction_task_executor::rewrite_sstable(std::move(sst));
}
@@ -1635,6 +1672,20 @@ protected:
return sstables::compaction_result{};
});
}
future<sstables::compaction_result> rewrite_sstable(const sstables::shared_sstable sst) override {
co_await utils::get_local_injector().inject("split_sstable_rewrite", [this] (auto& handler) -> future<> {
cmlog.info("split_sstable_rewrite: waiting");
while (!handler.poll_for_message() && !_compaction_data.is_stop_requested()) {
co_await sleep(std::chrono::milliseconds(5));
}
cmlog.info("split_sstable_rewrite: released");
if (_compaction_data.is_stop_requested()) {
throw make_compaction_stopped_exception();
}
}, false);
co_return co_await do_rewrite_sstable(std::move(sst));
}
};
}
@@ -1965,7 +2016,7 @@ future<> compaction_manager::perform_cleanup(owned_ranges_ptr sorted_owned_range
future<> compaction_manager::try_perform_cleanup(owned_ranges_ptr sorted_owned_ranges, table_state& t, std::optional<tasks::task_info> info) {
auto check_for_cleanup = [this, &t] {
return boost::algorithm::any_of(_tasks, [&t] (auto& task) {
return task->compacting_table() == &t && task->compaction_type() == sstables::compaction_type::Cleanup;
return task.compacting_table() == &t && task.compaction_type() == sstables::compaction_type::Cleanup;
});
};
if (check_for_cleanup()) {
@@ -2056,6 +2107,31 @@ future<compaction_manager::compaction_stats_opt> compaction_manager::perform_spl
return perform_task_on_all_files<split_compaction_task_executor>(info, t, std::move(options), std::move(owned_ranges_ptr), std::move(get_sstables));
}
future<std::vector<sstables::shared_sstable>>
compaction_manager::maybe_split_sstable(sstables::shared_sstable sst, table_state& t, sstables::compaction_type_options::split opt) {
if (!split_compaction_task_executor::sstable_needs_split(sst, opt)) {
co_return std::vector<sstables::shared_sstable>{sst};
}
std::vector<sstables::shared_sstable> ret;
// FIXME: indentation.
auto gate = get_compaction_state(&t).gate.hold();
sstables::compaction_progress_monitor monitor;
sstables::compaction_data info = create_compaction_data();
sstables::compaction_descriptor desc = split_compaction_task_executor::make_descriptor(sst, opt);
desc.creator = [&t] (shard_id _) {
return t.make_sstable();
};
desc.replacer = [&] (sstables::compaction_completion_desc d) {
std::move(d.new_sstables.begin(), d.new_sstables.end(), std::back_inserter(ret));
};
co_await sstables::compact_sstables(std::move(desc), info, t, monitor);
co_await sst->unlink();
co_return ret;
}
// Submit a table to be scrubbed and wait for its termination.
future<compaction_manager::compaction_stats_opt> compaction_manager::perform_sstable_scrub(table_state& t, sstables::compaction_type_options::scrub opts, std::optional<tasks::task_info> info) {
auto scrub_mode = opts.operation_mode;
@@ -2121,11 +2197,11 @@ future<> compaction_manager::remove(table_state& t) noexcept {
auto found = false;
sstring msg;
for (auto& task : _tasks) {
if (task->compacting_table() == &t) {
if (task.compacting_table() == &t) {
if (!msg.empty()) {
msg += "\n";
}
msg += format("Found {} after remove", *task.get());
msg += format("Found {} after remove", task);
found = true;
}
}
@@ -2136,30 +2212,38 @@ future<> compaction_manager::remove(table_state& t) noexcept {
}
const std::vector<sstables::compaction_info> compaction_manager::get_compactions(table_state* t) const {
auto to_info = [] (const shared_ptr<compaction_task_executor>& task) {
auto to_info = [] (const compaction_task_executor& task) {
sstables::compaction_info ret;
ret.compaction_uuid = task->compaction_data().compaction_uuid;
ret.type = task->compaction_type();
ret.ks_name = task->compacting_table()->schema()->ks_name();
ret.cf_name = task->compacting_table()->schema()->cf_name();
ret.total_partitions = task->compaction_data().total_partitions;
ret.total_keys_written = task->compaction_data().total_keys_written;
ret.compaction_uuid = task.compaction_data().compaction_uuid;
ret.type = task.compaction_type();
ret.ks_name = task.compacting_table()->schema()->ks_name();
ret.cf_name = task.compacting_table()->schema()->cf_name();
ret.total_partitions = task.compaction_data().total_partitions;
ret.total_keys_written = task.compaction_data().total_keys_written;
return ret;
};
using ret = std::vector<sstables::compaction_info>;
return boost::copy_range<ret>(_tasks | boost::adaptors::filtered([t] (const shared_ptr<compaction_task_executor>& task) {
return (!t || task->compacting_table() == t) && task->compaction_running();
return boost::copy_range<ret>(_tasks | boost::adaptors::filtered([t] (const compaction_task_executor& task) {
return (!t || task.compacting_table() == t) && task.compaction_running();
}) | boost::adaptors::transformed(to_info));
}
bool compaction_manager::has_table_ongoing_compaction(const table_state& t) const {
return std::any_of(_tasks.begin(), _tasks.end(), [&t] (const shared_ptr<compaction_task_executor>& task) {
return task->compacting_table() == &t && task->compaction_running();
return std::any_of(_tasks.begin(), _tasks.end(), [&t] (const compaction_task_executor& task) {
return task.compacting_table() == &t && task.compaction_running();
});
};
bool compaction_manager::compaction_disabled(table_state& t) const {
return _compaction_state.contains(&t) && _compaction_state.at(&t).compaction_disabled();
if (auto it = _compaction_state.find(&t); it != _compaction_state.end()) {
return it->second.compaction_disabled();
} else {
cmlog.debug("compaction_disabled: {}:{} not in compaction_state", t.schema()->id(), t.get_group_id());
// Compaction is not strictly disabled, but it is not enabled either.
// The callers actually care about if it's enabled or not, not about the actual state of
// compaction_state::compaction_disabled()
return true;
}
}
future<> compaction_manager::stop_compaction(sstring type, table_state* table) {
@@ -2184,8 +2268,8 @@ future<> compaction_manager::stop_compaction(sstring type, table_state* table) {
void compaction_manager::propagate_replacement(table_state& t,
const std::vector<sstables::shared_sstable>& removed, const std::vector<sstables::shared_sstable>& added) {
for (auto& task : _tasks) {
if (task->compacting_table() == &t && task->compaction_running()) {
task->compaction_data().pending_replacements.push_back({ removed, added });
if (task.compacting_table() == &t && task.compaction_running()) {
task.compaction_data().pending_replacements.push_back({ removed, added });
}
}
}

View File

@@ -93,8 +93,13 @@ public:
private:
shared_ptr<compaction::task_manager_module> _task_manager_module;
using compaction_task_executor_list_type = bi::list<
compaction_task_executor,
bi::base_hook<bi::list_base_hook<bi::link_mode<bi::auto_unlink>>>,
bi::constant_time_size<false>>;
// compaction manager may have N fibers to allow parallel compaction per shard.
std::list<shared_ptr<compaction::compaction_task_executor>> _tasks;
compaction_task_executor_list_type _tasks;
// Possible states in which the compaction manager can be found.
//
@@ -180,7 +185,7 @@ private:
}
future<compaction_manager::compaction_stats_opt> perform_compaction(throw_if_stopping do_throw_if_stopping, std::optional<tasks::task_info> parent_info, Args&&... args);
future<> stop_tasks(std::vector<shared_ptr<compaction::compaction_task_executor>> tasks, sstring reason);
future<> stop_tasks(std::vector<shared_ptr<compaction::compaction_task_executor>> tasks, sstring reason) noexcept;
future<> update_throughput(uint32_t value_mbs);
// Return the largest fan-in of currently running compactions
@@ -246,7 +251,7 @@ private:
// Stop all fibers, without waiting. Safe to be called multiple times.
void do_stop() noexcept;
future<> really_do_stop();
future<> really_do_stop() noexcept;
// Propagate replacement of sstables to all ongoing compaction of a given table
void propagate_replacement(compaction::table_state& t, const std::vector<sstables::shared_sstable>& removed, const std::vector<sstables::shared_sstable>& added);
@@ -347,6 +352,11 @@ public:
// or user aborted splitting using stop API.
future<compaction_stats_opt> perform_split_compaction(compaction::table_state& t, sstables::compaction_type_options::split opt, std::optional<tasks::task_info> info = std::nullopt);
// Splits a single SSTable by segregating all its data according to the classifier.
// If SSTable doesn't need split, the same input SSTable is returned as output.
// If SSTable needs split, then output SSTables are returned and the input SSTable is deleted.
future<std::vector<sstables::shared_sstable>> maybe_split_sstable(sstables::shared_sstable sst, table_state& t, sstables::compaction_type_options::split opt);
// Run a custom job for a given table, defined by a function
// it completes when future returned by job is ready or returns immediately
// if manager was asked to stop.
@@ -462,7 +472,9 @@ public:
namespace compaction {
class compaction_task_executor : public enable_shared_from_this<compaction_task_executor> {
class compaction_task_executor
: public enable_shared_from_this<compaction_task_executor>
, public boost::intrusive::list_base_hook<boost::intrusive::link_mode<boost::intrusive::auto_unlink>> {
public:
enum class state {
none, // initial and final state
@@ -586,6 +598,8 @@ private:
future<compaction_manager::compaction_stats_opt> compaction_done() noexcept {
return _compaction_done.get_future();
}
future<sstables::sstable_set> sstable_set_for_tombstone_gc(::compaction::table_state& t);
public:
bool stopping() const noexcept {
return _compaction_data.abort.abort_requested();
@@ -606,7 +620,7 @@ public:
friend future<compaction_manager::compaction_stats_opt> compaction_manager::perform_compaction(throw_if_stopping do_throw_if_stopping, std::optional<tasks::task_info> parent_info, Args&&... args);
friend future<compaction_manager::compaction_stats_opt> compaction_manager::perform_task(shared_ptr<compaction_task_executor> task, throw_if_stopping do_throw_if_stopping);
friend fmt::formatter<compaction_task_executor>;
friend future<> compaction_manager::stop_tasks(std::vector<shared_ptr<compaction_task_executor>> tasks, sstring reason);
friend future<> compaction_manager::stop_tasks(std::vector<shared_ptr<compaction_task_executor>> tasks, sstring reason) noexcept;
friend sstables::test_env_compaction_manager;
};

View File

@@ -39,6 +39,7 @@ public:
virtual bool compaction_enforce_min_threshold() const noexcept = 0;
virtual const sstables::sstable_set& main_sstable_set() const = 0;
virtual const sstables::sstable_set& maintenance_sstable_set() const = 0;
virtual lw_shared_ptr<const sstables::sstable_set> sstable_set_for_tombstone_gc() const = 0;
virtual std::unordered_set<sstables::shared_sstable> fully_expired_sstables(const std::vector<sstables::shared_sstable>& sstables, gc_clock::time_point compaction_time) const = 0;
virtual const std::vector<sstables::shared_sstable>& compacted_undeleted_sstables() const noexcept = 0;
virtual sstables::compaction_strategy& get_compaction_strategy() const noexcept = 0;

View File

@@ -467,7 +467,16 @@ future<> shard_cleanup_keyspace_compaction_task_impl::run() {
future<> table_cleanup_keyspace_compaction_task_impl::run() {
co_await wait_for_your_turn(_cv, _current_task, _status.id);
auto owned_ranges_ptr = compaction::make_owned_ranges_ptr(_db.get_keyspace_local_ranges(_status.keyspace));
// Note that we do not hold an effective_replication_map_ptr throughout
// the cleanup operation, so the topology might change.
// Since clenaup is an admin operation required for vnodes,
// it is the responsibility of the system operator to not
// perform additional incompatible range movements during cleanup.
auto get_owned_ranges = [&] (std::string_view ks_name) -> future<owned_ranges_ptr> {
const auto& erm = _db.find_keyspace(ks_name).get_vnode_effective_replication_map();
co_return compaction::make_owned_ranges_ptr(co_await _db.get_keyspace_local_ranges(erm));
};
auto owned_ranges_ptr = co_await get_owned_ranges(_status.keyspace);
co_await run_on_table("force_keyspace_cleanup", _db, _status.keyspace, _ti, [&] (replica::table& t) {
// skip the flush, as cleanup_keyspace_compaction_task_impl::run should have done this.
return t.perform_cleanup_compaction(owned_ranges_ptr, tasks::task_info{_status.id, _status.shard}, replica::table::do_flush::no);
@@ -531,8 +540,15 @@ future<> shard_upgrade_sstables_compaction_task_impl::run() {
future<> table_upgrade_sstables_compaction_task_impl::run() {
co_await wait_for_your_turn(_cv, _current_task, _status.id);
auto owned_ranges = _db.maybe_get_keyspace_local_ranges(_status.keyspace);
auto owned_ranges_ptr = owned_ranges ? compaction::make_owned_ranges_ptr(std::move(owned_ranges.value())) : nullptr;
auto get_owned_ranges = [&] (std::string_view keyspace_name) -> future<owned_ranges_ptr> {
const auto& ks = _db.find_keyspace(keyspace_name);
if (ks.get_replication_strategy().is_per_table()) {
co_return nullptr;
}
const auto& erm = ks.get_vnode_effective_replication_map();
co_return compaction::make_owned_ranges_ptr(co_await _db.get_keyspace_local_ranges(erm));
};
auto owned_ranges_ptr = co_await get_owned_ranges(_status.keyspace);
tasks::task_info info{_status.id, _status.shard};
co_await run_on_table("upgrade_sstables", _db, _status.keyspace, _ti, [&] (replica::table& t) -> future<> {
return t.parallel_foreach_table_state([&] (compaction::table_state& ts) -> future<> {

View File

@@ -295,7 +295,8 @@ time_window_compaction_strategy::get_reshaping_job(std::vector<shared_sstable> i
// When trimming, let's keep sstables with overlapping time window, so as to reduce write amplification.
// For example, if there are N sstables spanning window W, where N <= 32, then we can produce all data for W
// in a single compaction round, removing the need to later compact W to reduce its number of files.
boost::partial_sort(multi_window, multi_window.begin() + max_sstables, [](const shared_sstable &a, const shared_sstable &b) {
auto sort_size = std::min(max_sstables, multi_window.size());
boost::partial_sort(multi_window, multi_window.begin() + sort_size, [](const shared_sstable &a, const shared_sstable &b) {
return a->get_stats_metadata().max_timestamp < b->get_stats_metadata().max_timestamp;
});
maybe_trim_job(multi_window, job_size, disjoint);

View File

@@ -14,7 +14,9 @@
#include "exceptions/exceptions.hh"
#include "utils/class_registrator.hh"
const sstring compressor::namespace_prefix = "org.apache.cassandra.io.compress.";
sstring compressor::make_name(std::string_view short_name) {
return seastar::format("org.apache.cassandra.io.compress.{}", short_name);
}
class lz4_processor: public compressor {
public:
@@ -66,7 +68,7 @@ compressor::ptr_type compressor::create(const sstring& name, const opt_getter& o
return {};
}
qualified_name qn(namespace_prefix, name);
qualified_name qn(make_name(""), name);
for (auto& c : { lz4, snappy, deflate }) {
if (c->name() == static_cast<const sstring&>(qn)) {
@@ -91,9 +93,9 @@ shared_ptr<compressor> compressor::create(const std::map<sstring, sstring>& opti
return {};
}
thread_local const shared_ptr<compressor> compressor::lz4 = ::make_shared<lz4_processor>(namespace_prefix + "LZ4Compressor");
thread_local const shared_ptr<compressor> compressor::snappy = ::make_shared<snappy_processor>(namespace_prefix + "SnappyCompressor");
thread_local const shared_ptr<compressor> compressor::deflate = ::make_shared<deflate_processor>(namespace_prefix + "DeflateCompressor");
thread_local const shared_ptr<compressor> compressor::lz4 = ::make_shared<lz4_processor>(make_name("LZ4Compressor"));
thread_local const shared_ptr<compressor> compressor::snappy = ::make_shared<snappy_processor>(make_name("SnappyCompressor"));
thread_local const shared_ptr<compressor> compressor::deflate = ::make_shared<deflate_processor>(make_name("DeflateCompressor"));
const sstring compression_parameters::SSTABLE_COMPRESSION = "sstable_compression";
const sstring compression_parameters::CHUNK_LENGTH_KB = "chunk_length_in_kb";

View File

@@ -69,7 +69,7 @@ public:
static thread_local const ptr_type snappy;
static thread_local const ptr_type deflate;
static const sstring namespace_prefix;
static sstring make_name(std::string_view short_name);
};
template<typename BaseType, typename... Args>

View File

@@ -431,8 +431,6 @@ modes = {
scylla_tests = set([
'test/boost/UUID_test',
'test/boost/pretty_printers_test',
'test/boost/cdc_generation_test',
'test/boost/aggregate_fcts_test',
'test/boost/allocation_strategy_test',
'test/boost/alternator_unit_test',
@@ -443,7 +441,9 @@ scylla_tests = set([
'test/boost/batchlog_manager_test',
'test/boost/big_decimal_test',
'test/boost/bloom_filter_test',
'test/boost/bptree_test',
'test/boost/broken_sstable_test',
'test/boost/btree_test',
'test/boost/bytes_ostream_test',
'test/boost/cache_algorithm_test',
'test/boost/cache_mutation_reader_test',
@@ -452,13 +452,15 @@ scylla_tests = set([
'test/boost/canonical_mutation_test',
'test/boost/cartesian_product_test',
'test/boost/castas_fcts_test',
'test/boost/cdc_generation_test',
'test/boost/cdc_test',
'test/boost/cell_locker_test',
'test/boost/checksum_utils_test',
'test/boost/chunked_vector_test',
'test/boost/chunked_managed_vector_test',
'test/boost/chunked_vector_test',
'test/boost/clustering_ranges_walker_test',
'test/boost/column_mapping_test',
'test/boost/commitlog_cleanup_test',
'test/boost/commitlog_test',
'test/boost/compaction_group_test',
'test/boost/compound_test',
@@ -468,102 +470,124 @@ scylla_tests = set([
'test/boost/counter_test',
'test/boost/cql_auth_query_test',
'test/boost/cql_auth_syntax_test',
'test/boost/cql_query_test',
'test/boost/cql_functions_test',
'test/boost/cql_query_group_test',
'test/boost/cql_query_large_test',
'test/boost/cql_query_like_test',
'test/boost/cql_query_group_test',
'test/boost/cql_functions_test',
'test/boost/cql_query_test',
'test/boost/crc_test',
'test/boost/data_listeners_test',
'test/boost/database_test',
'test/boost/commitlog_cleanup_test',
'test/boost/dirty_memory_manager_test',
'test/boost/double_decker_test',
'test/boost/duration_test',
'test/boost/dynamic_bitset_test',
'test/boost/enum_option_test',
'test/boost/enum_set_test',
'test/boost/extensions_test',
'test/boost/error_injection_test',
'test/boost/estimated_histogram_test',
'test/boost/exception_container_test',
'test/boost/exceptions_fallback_test',
'test/boost/exceptions_optimized_test',
'test/boost/expr_test',
'test/boost/extensions_test',
'test/boost/filtering_test',
'test/boost/mutation_reader_another_test',
'test/boost/flush_queue_test',
'test/boost/fragmented_temporary_buffer_test',
'test/boost/frozen_mutation_test',
'test/boost/generic_server_test',
'test/boost/gossiping_property_file_snitch_test',
'test/boost/group0_cmd_merge_test',
'test/boost/group0_test',
'test/boost/hash_test',
'test/boost/hashers_test',
'test/boost/hint_test',
'test/boost/idl_test',
'test/boost/index_with_paging_test',
'test/boost/input_stream_test',
'test/boost/intrusive_array_test',
'test/boost/json_cql_query_test',
'test/boost/json_test',
'test/boost/keys_test',
'test/boost/large_paging_state_test',
'test/boost/recent_entries_map_test',
'test/boost/like_matcher_test',
'test/boost/limiting_data_source_test',
'test/boost/linearizing_input_stream_test',
'test/boost/lister_test',
'test/boost/loading_cache_test',
'test/boost/locator_topology_test',
'test/boost/log_heap_test',
'test/boost/estimated_histogram_test',
'test/boost/summary_test',
'test/boost/logalloc_test',
'test/boost/logalloc_standard_allocator_segment_pool_backend_test',
'test/boost/managed_vector_test',
'test/boost/logalloc_test',
'test/boost/managed_bytes_test',
'test/boost/intrusive_array_test',
'test/boost/managed_vector_test',
'test/boost/map_difference_test',
'test/boost/memtable_test',
'test/boost/multishard_combining_reader_as_mutation_source_test',
'test/boost/multishard_mutation_query_test',
'test/boost/murmur_hash_test',
'test/boost/mutation_fragment_test',
'test/boost/mutation_query_test',
'test/boost/mutation_reader_another_test',
'test/boost/mutation_reader_test',
'test/boost/multishard_combining_reader_as_mutation_source_test',
'test/boost/mutation_test',
'test/boost/mutation_writer_test',
'test/boost/mvcc_test',
'test/boost/network_topology_strategy_test',
'test/boost/token_metadata_test',
'test/boost/tablets_test',
'test/boost/sessions_test',
'test/boost/nonwrapping_interval_test',
'test/boost/observable_test',
'test/boost/partitioner_test',
'test/boost/per_partition_rate_limit_test',
'test/boost/pretty_printers_test',
'test/boost/querier_cache_test',
'test/boost/query_processor_test',
'test/boost/wrapping_interval_test',
'test/boost/radix_tree_test',
'test/boost/range_tombstone_list_test',
'test/boost/reusable_buffer_test',
'test/boost/restrictions_test',
'test/boost/rate_limiter_test',
'test/boost/reader_concurrency_semaphore_test',
'test/boost/recent_entries_map_test',
'test/boost/repair_test',
'test/boost/restrictions_test',
'test/boost/result_utils_test',
'test/boost/reusable_buffer_test',
'test/boost/role_manager_test',
'test/boost/row_cache_test',
'test/boost/rust_test',
'test/boost/s3_test',
'test/boost/schema_change_test',
'test/boost/schema_changes_test',
'test/boost/schema_loader_test',
'test/boost/schema_registry_test',
'test/boost/secondary_index_test',
'test/boost/tracing_test',
'test/boost/index_with_paging_test',
'test/boost/serialization_test',
'test/boost/serialized_action_test',
'test/boost/service_level_controller_test',
'test/boost/sessions_test',
'test/boost/small_vector_test',
'test/boost/snitch_reset_test',
'test/boost/sorting_test',
'test/boost/sstable_3_x_test',
'test/boost/sstable_compaction_test',
'test/boost/sstable_conforms_to_mutation_source_test',
'test/boost/sstable_datafile_test',
'test/boost/sstable_directory_test',
'test/boost/sstable_generation_test',
'test/boost/sstable_move_test',
'test/boost/sstable_mutation_test',
'test/boost/sstable_partition_index_cache_test',
'test/boost/schema_changes_test',
'test/boost/sstable_conforms_to_mutation_source_test',
'test/boost/sstable_compaction_test',
'test/boost/sstable_resharding_test',
'test/boost/sstable_directory_test',
'test/boost/sstable_set_test',
'test/boost/sstable_test',
'test/boost/sstable_move_test',
'test/boost/stall_free_test',
'test/boost/statement_restrictions_test',
'test/boost/storage_proxy_test',
'test/boost/string_format_test',
'test/boost/summary_test',
'test/boost/tablets_test',
'test/boost/tagged_integer_test',
'test/boost/token_metadata_test',
'test/boost/top_k_test',
'test/boost/tracing_test',
'test/boost/transport_test',
'test/boost/types_test',
'test/boost/user_function_test',
@@ -571,39 +595,16 @@ scylla_tests = set([
'test/boost/utf8_test',
'test/boost/view_build_test',
'test/boost/view_complex_test',
'test/boost/view_schema_test',
'test/boost/view_schema_pkey_test',
'test/boost/view_schema_ckey_test',
'test/boost/view_schema_pkey_test',
'test/boost/view_schema_test',
'test/boost/vint_serialization_test',
'test/boost/virtual_reader_test',
'test/boost/virtual_table_mutation_source_test',
'test/boost/virtual_table_test',
'test/boost/wasm_test',
'test/boost/wasm_alloc_test',
'test/boost/bptree_test',
'test/boost/btree_test',
'test/boost/radix_tree_test',
'test/boost/double_decker_test',
'test/boost/stall_free_test',
'test/boost/sstable_set_test',
'test/boost/reader_concurrency_semaphore_test',
'test/boost/service_level_controller_test',
'test/boost/schema_loader_test',
'test/boost/lister_test',
'test/boost/group0_test',
'test/boost/exception_container_test',
'test/boost/result_utils_test',
'test/boost/rate_limiter_test',
'test/boost/per_partition_rate_limit_test',
'test/boost/expr_test',
'test/boost/exceptions_optimized_test',
'test/boost/exceptions_fallback_test',
'test/boost/s3_test',
'test/boost/locator_topology_test',
'test/boost/string_format_test',
'test/boost/tagged_integer_test',
'test/boost/group0_cmd_merge_test',
'test/boost/sorting_test',
'test/boost/wasm_test',
'test/boost/wrapping_interval_test',
'test/manual/ec2_snitch_test',
'test/manual/enormous_table_scan_test',
'test/manual/gce_snitch_test',
@@ -1452,7 +1453,7 @@ deps['test/boost/bytes_ostream_test'] = [
"test/lib/log.cc",
]
deps['test/boost/input_stream_test'] = ['test/boost/input_stream_test.cc']
deps['test/boost/UUID_test'] = ['utils/UUID_gen.cc', 'test/boost/UUID_test.cc', 'utils/uuid.cc', 'utils/dynamic_bitset.cc', 'utils/hashers.cc', 'utils/on_internal_error.cc']
deps['test/boost/UUID_test'] = ['clocks-impl.cc', 'utils/UUID_gen.cc', 'test/boost/UUID_test.cc', 'utils/uuid.cc', 'utils/dynamic_bitset.cc', 'utils/hashers.cc', 'utils/on_internal_error.cc']
deps['test/boost/murmur_hash_test'] = ['bytes.cc', 'utils/murmur_hash.cc', 'test/boost/murmur_hash_test.cc']
deps['test/boost/allocation_strategy_test'] = ['test/boost/allocation_strategy_test.cc', 'utils/logalloc.cc', 'utils/dynamic_bitset.cc']
deps['test/boost/log_heap_test'] = ['test/boost/log_heap_test.cc']

View File

@@ -68,6 +68,7 @@ options {
#include "cql3/statements/ks_prop_defs.hh"
#include "cql3/selection/raw_selector.hh"
#include "cql3/selection/selectable-expr.hh"
#include "cql3/dialect.hh"
#include "cql3/keyspace_element_name.hh"
#include "cql3/constants.hh"
#include "cql3/operation_impl.hh"
@@ -148,6 +149,8 @@ using uexpression = uninitialized<expression>;
listener_type* listener;
dialect _dialect;
// Keeps the names of all bind variables. For bind variables without a name ('?'), the name is nullptr.
// Maps bind_index -> name.
std::vector<::shared_ptr<cql3::column_identifier>> _bind_variable_names;
@@ -171,9 +174,14 @@ using uexpression = uninitialized<expression>;
return s;
}
void set_dialect(dialect d) {
_dialect = d;
}
bind_variable new_bind_variables(shared_ptr<cql3::column_identifier> name)
{
if (name && _named_bind_variables_indexes.contains(*name)) {
if (_dialect.duplicate_bind_variable_names_refer_to_same_variable
&& name && _named_bind_variables_indexes.contains(*name)) {
return bind_variable{_named_bind_variables_indexes[*name]};
}
auto marker = bind_variable{_bind_variable_names.size()};

View File

@@ -449,7 +449,8 @@ sstring maybe_quote(const sstring& identifier) {
// many keywords but allow keywords listed as "unreserved keywords".
// So we can use any of them, for example cident.
try {
cql3::util::do_with_parser(identifier, std::mem_fn(&cql3_parser::CqlParser::cident));
// In general it's not a good idea to use the default dialect, but for parsing an identifier, it's okay.
cql3::util::do_with_parser(identifier, dialect{}, std::mem_fn(&cql3_parser::CqlParser::cident));
return identifier;
} catch(exceptions::syntax_exception&) {
// This alphanumeric string is not a valid identifier, so fall

34
cql3/dialect.hh Normal file
View File

@@ -0,0 +1,34 @@
// Copyright (C) 2024-present ScyllaDB
// SPDX-License-Identifier: AGPL-3.0-or-later
#pragma once
#include <fmt/core.h>
namespace cql3 {
struct dialect {
bool duplicate_bind_variable_names_refer_to_same_variable = true; // if :a is found twice in a query, the two references are to the same variable (see #15559)
bool operator==(const dialect&) const = default;
};
inline
dialect
internal_dialect() {
return dialect{
.duplicate_bind_variable_names_refer_to_same_variable = true,
};
}
}
template <>
struct fmt::formatter<cql3::dialect> {
constexpr auto parse(format_parse_context& ctx) { return ctx.begin(); }
template <typename FormatContext>
auto format(const cql3::dialect& d, FormatContext& ctx) const {
return fmt::format_to(ctx.out(), "cql3::dialect{{duplicate_bind_variable_names_refer_to_same_variable={}}}",
d.duplicate_bind_variable_names_refer_to_same_variable);
}
};

View File

@@ -14,6 +14,7 @@
#include "utils/hash.hh"
#include "cql3/statements/prepared_statement.hh"
#include "cql3/column_specification.hh"
#include "cql3/dialect.hh"
namespace cql3 {
@@ -37,14 +38,17 @@ class prepared_cache_key_type {
public:
// derive from cql_prepared_id_type so we can customize the formatter of
// cache_key_type
struct cache_key_type : public cql_prepared_id_type {};
struct cache_key_type : public cql_prepared_id_type {
cache_key_type(cql_prepared_id_type&& id, cql3::dialect d) : cql_prepared_id_type(std::move(id)), dialect(d) {}
cql3::dialect dialect; // Not part of hash, but we don't expect collisions because of that
bool operator==(const cache_key_type& other) const = default;
};
private:
cache_key_type _key;
public:
prepared_cache_key_type() = default;
explicit prepared_cache_key_type(cql_prepared_id_type cql_id) : _key(std::move(cql_id)) {}
explicit prepared_cache_key_type(cql_prepared_id_type cql_id, dialect d) : _key(std::move(cql_id), d) {}
cache_key_type& key() { return _key; }
const cache_key_type& key() const { return _key; }
@@ -176,7 +180,7 @@ struct hash<cql3::prepared_cache_key_type> final {
template <> struct fmt::formatter<cql3::prepared_cache_key_type::cache_key_type> {
constexpr auto parse(format_parse_context& ctx) { return ctx.begin(); }
auto format(const cql3::prepared_cache_key_type::cache_key_type& p, fmt::format_context& ctx) const {
return fmt::format_to(ctx.out(), "{{cql_id: {}}}", static_cast<const cql3::cql_prepared_id_type&>(p));
return fmt::format_to(ctx.out(), "{{cql_id: {}, dialect: {}}}", static_cast<const cql3::cql_prepared_id_type&>(p), p.dialect);
}
};

View File

@@ -566,10 +566,10 @@ query_processor::execute_maybe_with_guard(service::query_state& query_state, ::s
}
future<::shared_ptr<result_message>>
query_processor::execute_direct_without_checking_exception_message(const sstring_view& query_string, service::query_state& query_state, query_options& options) {
query_processor::execute_direct_without_checking_exception_message(const sstring_view& query_string, service::query_state& query_state, dialect d, query_options& options) {
log.trace("execute_direct: \"{}\"", query_string);
tracing::trace(query_state.get_trace_state(), "Parsing a statement");
auto p = get_statement(query_string, query_state.get_client_state());
auto p = get_statement(query_string, query_state.get_client_state(), d);
auto statement = p->statement;
const auto warnings = std::move(p->warnings);
if (statement->get_bound_terms() != options.get_values_count()) {
@@ -653,18 +653,21 @@ query_processor::process_authorized_statement(const ::shared_ptr<cql_statement>
}
future<::shared_ptr<cql_transport::messages::result_message::prepared>>
query_processor::prepare(sstring query_string, service::query_state& query_state) {
query_processor::prepare(sstring query_string, service::query_state& query_state, cql3::dialect d) {
auto& client_state = query_state.get_client_state();
return prepare(std::move(query_string), client_state);
return prepare(std::move(query_string), client_state, d);
}
future<::shared_ptr<cql_transport::messages::result_message::prepared>>
query_processor::prepare(sstring query_string, const service::client_state& client_state) {
query_processor::prepare(sstring query_string, const service::client_state& client_state, cql3::dialect d) {
using namespace cql_transport::messages;
return prepare_one<result_message::prepared::cql>(
std::move(query_string),
client_state,
compute_id,
d,
[d] (std::string_view query_string, std::string_view keyspace) {
return compute_id(query_string, keyspace, d);
},
prepared_cache_key_type::cql_id);
}
@@ -676,13 +679,14 @@ static std::string hash_target(std::string_view query_string, std::string_view k
prepared_cache_key_type query_processor::compute_id(
std::string_view query_string,
std::string_view keyspace) {
return prepared_cache_key_type(md5_hasher::calculate(hash_target(query_string, keyspace)));
std::string_view keyspace,
dialect d) {
return prepared_cache_key_type(md5_hasher::calculate(hash_target(query_string, keyspace)), d);
}
std::unique_ptr<prepared_statement>
query_processor::get_statement(const sstring_view& query, const service::client_state& client_state) {
std::unique_ptr<raw::parsed_statement> statement = parse_statement(query);
query_processor::get_statement(const sstring_view& query, const service::client_state& client_state, dialect d) {
std::unique_ptr<raw::parsed_statement> statement = parse_statement(query, d);
// Set keyspace for statement that require login
auto cf_stmt = dynamic_cast<raw::cf_statement*>(statement.get());
@@ -696,7 +700,7 @@ query_processor::get_statement(const sstring_view& query, const service::client_
}
std::unique_ptr<raw::parsed_statement>
query_processor::parse_statement(const sstring_view& query) {
query_processor::parse_statement(const sstring_view& query, dialect d) {
try {
{
const char* error_injection_key = "query_processor-parse_statement-test_failure";
@@ -706,7 +710,7 @@ query_processor::parse_statement(const sstring_view& query) {
}
});
}
auto statement = util::do_with_parser(query, std::mem_fn(&cql3_parser::CqlParser::query));
auto statement = util::do_with_parser(query, d, std::mem_fn(&cql3_parser::CqlParser::query));
if (!statement) {
throw exceptions::syntax_exception("Parsing failed");
}
@@ -722,9 +726,9 @@ query_processor::parse_statement(const sstring_view& query) {
}
std::vector<std::unique_ptr<raw::parsed_statement>>
query_processor::parse_statements(std::string_view queries) {
query_processor::parse_statements(std::string_view queries, dialect d) {
try {
auto statements = util::do_with_parser(queries, std::mem_fn(&cql3_parser::CqlParser::queries));
auto statements = util::do_with_parser(queries, d, std::mem_fn(&cql3_parser::CqlParser::queries));
if (statements.empty()) {
throw exceptions::syntax_exception("Parsing failed");
}
@@ -797,7 +801,7 @@ query_options query_processor::make_internal_options(
statements::prepared_statement::checked_weak_ptr query_processor::prepare_internal(const sstring& query_string) {
auto& p = _internal_statements[query_string];
if (p == nullptr) {
auto np = parse_statement(query_string)->prepare(_db, _cql_stats);
auto np = parse_statement(query_string, internal_dialect())->prepare(_db, _cql_stats);
np->statement->raw_cql_statement = query_string;
p = std::move(np); // inserts it into map
}
@@ -903,7 +907,8 @@ query_processor::execute_internal(
auto p = prepare_internal(query_string);
return execute_with_params(std::move(p), cl, query_state, values);
} else {
auto p = parse_statement(query_string)->prepare(_db, _cql_stats);
// For internal queries, we want the default dialect, not the user provided one
auto p = parse_statement(query_string, dialect{})->prepare(_db, _cql_stats);
p->statement->raw_cql_statement = query_string;
auto checked_weak_ptr = p->checked_weak_from_this();
return execute_with_params(std::move(checked_weak_ptr), cl, query_state, values).finally([p = std::move(p)] {});

View File

@@ -21,6 +21,7 @@
#include "cql3/authorized_prepared_statements_cache.hh"
#include "cql3/statements/prepared_statement.hh"
#include "cql3/cql_statement.hh"
#include "cql3/dialect.hh"
#include "exceptions/exceptions.hh"
#include "service/migration_listener.hh"
#include "timestamp.hh"
@@ -137,10 +138,11 @@ public:
static prepared_cache_key_type compute_id(
std::string_view query_string,
std::string_view keyspace);
std::string_view keyspace,
dialect d);
static std::unique_ptr<statements::raw::parsed_statement> parse_statement(const std::string_view& query);
static std::vector<std::unique_ptr<statements::raw::parsed_statement>> parse_statements(std::string_view queries);
static std::unique_ptr<statements::raw::parsed_statement> parse_statement(const std::string_view& query, dialect d);
static std::vector<std::unique_ptr<statements::raw::parsed_statement>> parse_statements(std::string_view queries, dialect d);
query_processor(service::storage_proxy& proxy, data_dictionary::database db, service::migration_notifier& mn, memory_config mcfg, cql_config& cql_cfg, utils::loading_cache_config auth_prep_cache_cfg, lang::manager& langm);
@@ -249,10 +251,12 @@ public:
execute_direct(
const std::string_view& query_string,
service::query_state& query_state,
dialect d,
query_options& options) {
return execute_direct_without_checking_exception_message(
query_string,
query_state,
d,
options)
.then(cql_transport::messages::propagate_exception_as_future<::shared_ptr<cql_transport::messages::result_message>>);
}
@@ -263,6 +267,7 @@ public:
execute_direct_without_checking_exception_message(
const std::string_view& query_string,
service::query_state& query_state,
dialect d,
query_options& options);
future<::shared_ptr<cql_transport::messages::result_message>>
@@ -397,10 +402,10 @@ public:
future<::shared_ptr<cql_transport::messages::result_message::prepared>>
prepare(sstring query_string, service::query_state& query_state);
prepare(sstring query_string, service::query_state& query_state, dialect d);
future<::shared_ptr<cql_transport::messages::result_message::prepared>>
prepare(sstring query_string, const service::client_state& client_state);
prepare(sstring query_string, const service::client_state& client_state, dialect d);
future<> stop();
@@ -443,7 +448,8 @@ public:
std::unique_ptr<statements::prepared_statement> get_statement(
const std::string_view& query,
const service::client_state& client_state);
const service::client_state& client_state,
dialect d);
friend class migration_subscriber;
@@ -527,14 +533,15 @@ private:
prepare_one(
sstring query_string,
const service::client_state& client_state,
dialect d,
PreparedKeyGenerator&& id_gen,
IdGetter&& id_getter) {
return do_with(
id_gen(query_string, client_state.get_raw_keyspace()),
std::move(query_string),
[this, &client_state, &id_getter](const prepared_cache_key_type& key, const sstring& query_string) {
return _prepared_cache.get(key, [this, &query_string, &client_state] {
auto prepared = get_statement(query_string, client_state);
[this, &client_state, &id_getter, d](const prepared_cache_key_type& key, const sstring& query_string) {
return _prepared_cache.get(key, [this, &query_string, &client_state, d] {
auto prepared = get_statement(query_string, client_state, d);
auto bound_terms = prepared->statement->get_bound_terms();
if (bound_terms > std::numeric_limits<uint16_t>::max()) {
throw exceptions::invalid_request_exception(

View File

@@ -503,10 +503,12 @@ selection::collect_metadata(const schema& schema, const std::vector<prepared_sel
}
result_set_builder::result_set_builder(const selection& s, gc_clock::time_point now,
std::vector<size_t> group_by_cell_indices)
std::vector<size_t> group_by_cell_indices,
uint64_t limit)
: _result_set(std::make_unique<result_set>(::make_shared<metadata>(*(s.get_result_metadata()))))
, _selectors(s.new_selectors())
, _group_by_cell_indices(std::move(group_by_cell_indices))
, _limit(limit)
, _last_group(_group_by_cell_indices.size())
, _group_began(false)
, _now(now)
@@ -577,8 +579,10 @@ void result_set_builder::flush_selectors() {
// handled by process_current_row
return;
}
_result_set->add_row(_selectors->get_output_row());
_selectors->reset();
if (_result_set->size() < _limit) {
_result_set->add_row(_selectors->get_output_row());
_selectors->reset();
}
}
void result_set_builder::complete_row() {
@@ -790,6 +794,10 @@ int32_t result_set_builder::ttl_of(size_t idx) {
return _ttls[idx];
}
size_t result_set_builder::result_set_size() const {
return _result_set->size();
}
bytes_opt result_set_builder::get_value(data_type t, query::result_atomic_cell_view c) {
return {c.value().linearize()};
}

View File

@@ -172,6 +172,7 @@ private:
std::unique_ptr<result_set> _result_set;
std::unique_ptr<selectors> _selectors;
const std::vector<size_t> _group_by_cell_indices; ///< Indices in \c current of cells holding GROUP BY values.
const uint64_t _limit; ///< Maximum number of rows to return.
std::vector<managed_bytes_opt> _last_group; ///< Previous row's group: all of GROUP BY column values.
bool _group_began; ///< Whether a group began being formed.
public:
@@ -236,7 +237,8 @@ public:
};
result_set_builder(const selection& s, gc_clock::time_point now,
std::vector<size_t> group_by_cell_indices = {});
std::vector<size_t> group_by_cell_indices = {},
uint64_t limit = std::numeric_limits<uint64_t>::max());
void add_empty();
void add(bytes_opt value);
void add(const column_definition& def, const query::result_atomic_cell_view& c);
@@ -246,6 +248,7 @@ public:
std::unique_ptr<result_set> build();
api::timestamp_type timestamp_of(size_t idx);
int32_t ttl_of(size_t idx);
size_t result_set_size() const;
// Implements ResultVisitor concept from query.hh
template<typename Filter = nop_filter>

View File

@@ -11,6 +11,7 @@
#include <boost/range/algorithm.hpp>
#include <fmt/format.h>
#include <seastar/core/coroutine.hh>
#include <seastar/core/on_internal_error.hh>
#include <stdexcept>
#include "alter_keyspace_statement.hh"
#include "prepared_statement.hh"
@@ -43,18 +44,16 @@ future<> cql3::statements::alter_keyspace_statement::check_access(query_processo
return state.has_keyspace_access(_name, auth::permission::ALTER);
}
static bool validate_rf_difference(const std::string_view curr_rf, const std::string_view new_rf) {
auto to_number = [] (const std::string_view rf) {
int result;
// We assume the passed string view represents a valid decimal number,
// so we don't need the error code.
(void) std::from_chars(rf.begin(), rf.end(), result);
return result;
};
// We want to ensure that each DC's RF is going to change by at most 1
// because in that case the old and new quorums must overlap.
return std::abs(to_number(curr_rf) - to_number(new_rf)) <= 1;
static unsigned get_abs_rf_diff(const std::string& curr_rf, const std::string& new_rf) {
try {
return std::abs(std::stoi(curr_rf) - std::stoi(new_rf));
} catch (std::invalid_argument const& ex) {
on_internal_error(mylogger, fmt::format("get_abs_rf_diff expects integer arguments, "
"but got curr_rf:{} and new_rf:{}", curr_rf, new_rf));
} catch (std::out_of_range const& ex) {
on_internal_error(mylogger, fmt::format("get_abs_rf_diff expects integer arguments to fit into `int` type, "
"but got curr_rf:{} and new_rf:{}", curr_rf, new_rf));
}
}
void cql3::statements::alter_keyspace_statement::validate(query_processor& qp, const service::client_state& state) const {
@@ -84,11 +83,24 @@ void cql3::statements::alter_keyspace_statement::validate(query_processor& qp, c
auto new_ks = _attrs->as_ks_metadata_update(ks.metadata(), *qp.proxy().get_token_metadata_ptr(), qp.proxy().features());
if (ks.get_replication_strategy().uses_tablets()) {
const std::map<sstring, sstring>& current_rfs = ks.metadata()->strategy_options();
for (const auto& [new_dc, new_rf] : _attrs->get_replication_options()) {
auto it = current_rfs.find(new_dc);
if (it != current_rfs.end() && !validate_rf_difference(it->second, new_rf)) {
throw exceptions::invalid_request_exception("Cannot modify replication factor of any DC by more than 1 at a time.");
const std::map<sstring, sstring>& current_rf_per_dc = ks.metadata()->strategy_options();
auto new_rf_per_dc = _attrs->get_replication_options();
new_rf_per_dc.erase(ks_prop_defs::REPLICATION_STRATEGY_CLASS_KEY);
unsigned total_abs_rfs_diff = 0;
for (const auto& [new_dc, new_rf] : new_rf_per_dc) {
sstring old_rf = "0";
if (auto new_dc_in_current_mapping = current_rf_per_dc.find(new_dc);
new_dc_in_current_mapping != current_rf_per_dc.end()) {
old_rf = new_dc_in_current_mapping->second;
} else if (!qp.proxy().get_token_metadata_ptr()->get_topology().get_datacenters().contains(new_dc)) {
// This means that the DC listed in ALTER doesn't exist. This error will be reported later,
// during validation in abstract_replication_strategy::validate_replication_strategy.
// We can't report this error now, because it'd change the order of errors reported:
// first we need to report non-existing DCs, then if RFs aren't changed by too much.
continue;
}
if (total_abs_rfs_diff += get_abs_rf_diff(old_rf, new_rf); total_abs_rfs_diff >= 2) {
throw exceptions::invalid_request_exception("Only one DC's RF can be changed at a time and not by more than 1");
}
}
}
@@ -118,6 +130,63 @@ bool cql3::statements::alter_keyspace_statement::changes_tablets(query_processor
return ks.get_replication_strategy().uses_tablets() && !_attrs->get_replication_options().empty();
}
namespace {
// These functions are used to flatten all the options in the keyspace definition into a single-level map<string, string>.
// (Currently options are stored in a nested structure that looks more like a map<string, map<string, string>>).
// Flattening is simply joining the keys of maps from both levels with a colon ':' character,
// or in other words: prefixing the keys in the output map with the option type, e.g. 'replication', 'storage', etc.,
// so that the output map contains entries like: "replication:dc1" -> "3".
// This is done to avoid key conflicts and to be able to de-flatten the map back into the original structure.
void add_prefixed_key(const sstring& prefix, const std::map<sstring, sstring>& in, std::map<sstring, sstring>& out) {
for (const auto& [in_key, in_value]: in) {
out[prefix + ":" + in_key] = in_value;
}
};
std::map<sstring, sstring> get_current_options_flattened(const shared_ptr<cql3::statements::ks_prop_defs>& ks,
bool include_tablet_options,
const gms::feature_service& feat) {
std::map<sstring, sstring> all_options;
add_prefixed_key(ks->KW_REPLICATION, ks->get_replication_options(), all_options);
add_prefixed_key(ks->KW_STORAGE, ks->get_storage_options().to_map(), all_options);
// if no tablet options are specified in ATLER KS statement,
// we want to preserve the old ones and hence cannot overwrite them with defaults
if (include_tablet_options) {
auto initial_tablets = ks->get_initial_tablets(std::nullopt);
add_prefixed_key(ks->KW_TABLETS,
{{"enabled", initial_tablets ? "true" : "false"},
{"initial", std::to_string(initial_tablets.value_or(0))}},
all_options);
}
add_prefixed_key(ks->KW_DURABLE_WRITES,
{{sstring(ks->KW_DURABLE_WRITES), to_sstring(ks->get_boolean(ks->KW_DURABLE_WRITES, true))}},
all_options);
return all_options;
}
std::map<sstring, sstring> get_old_options_flattened(const data_dictionary::keyspace& ks, bool include_tablet_options) {
std::map<sstring, sstring> all_options;
using namespace cql3::statements;
add_prefixed_key(ks_prop_defs::KW_REPLICATION, ks.get_replication_strategy().get_config_options(), all_options);
add_prefixed_key(ks_prop_defs::KW_STORAGE, ks.metadata()->get_storage_options().to_map(), all_options);
if (include_tablet_options) {
add_prefixed_key(ks_prop_defs::KW_TABLETS,
{{"enabled", ks.metadata()->initial_tablets() ? "true" : "false"},
{"initial", std::to_string(ks.metadata()->initial_tablets().value_or(0))}},
all_options);
}
add_prefixed_key(ks_prop_defs::KW_DURABLE_WRITES,
{{sstring(ks_prop_defs::KW_DURABLE_WRITES), to_sstring(ks.metadata()->durable_writes())}},
all_options);
return all_options;
}
} // <anonymous> namespace
future<std::tuple<::shared_ptr<cql_transport::event::schema_change>, cql3::cql_warnings_vec>>
cql3::statements::alter_keyspace_statement::prepare_schema_mutations(query_processor& qp, service::query_state& state, const query_options& options, service::group0_batch& mc) const {
using namespace cql_transport;
@@ -130,11 +199,18 @@ cql3::statements::alter_keyspace_statement::prepare_schema_mutations(query_proce
auto ks_md_update = _attrs->as_ks_metadata_update(ks_md, tm, feat);
std::vector<mutation> muts;
std::vector<sstring> warnings;
auto ks_options = _attrs->get_all_options_flattened(feat);
bool include_tablet_options = _attrs->get_map(_attrs->KW_TABLETS).has_value();
auto old_ks_options = get_old_options_flattened(ks, include_tablet_options);
auto ks_options = get_current_options_flattened(_attrs, include_tablet_options, feat);
ks_options.merge(old_ks_options);
auto ts = mc.write_timestamp();
auto global_request_id = mc.new_group0_state_id();
// we only want to run the tablets path if there are actually any tablets changes, not only schema changes
// TODO: the current `if (changes_tablets(qp))` is insufficient: someone may set the same RFs as before,
// and we'll unnecessarily trigger the processing path for ALTER tablets KS,
// when in reality nothing or only schema is being changed
if (changes_tablets(qp)) {
if (!qp.topology_global_queue_empty()) {
return make_exception_future<std::tuple<::shared_ptr<::cql_transport::event::schema_change>, cql3::cql_warnings_vec>>(

View File

@@ -384,7 +384,8 @@ std::pair<schema_builder, std::vector<view_ptr>> alter_table_statement::prepare_
auto new_where = util::rename_column_in_where_clause(
view->view_info()->where_clause(),
column_identifier::raw(view_from->text(), true),
column_identifier::raw(view_to->text(), true));
column_identifier::raw(view_to->text(), true),
cql3::dialect{});
builder.with_view_info(view->view_info()->base_id(), view->view_info()->base_name(),
view->view_info()->include_all_columns(), std::move(new_where));

View File

@@ -7,6 +7,7 @@
*/
#include "auth/service.hh"
#include "exceptions/exceptions.hh"
#include "seastarx.hh"
#include "cql3/statements/create_service_level_statement.hh"
#include "service/qos/service_level_controller.hh"
@@ -38,6 +39,10 @@ create_service_level_statement::execute(query_processor& qp,
service::query_state &state,
const query_options &,
std::optional<service::group0_guard> guard) const {
if (_service_level.starts_with('$')) {
throw exceptions::invalid_request_exception("Names starting with '$' are reserved for internal tenants. Use a different name.");
}
service::group0_batch mc{std::move(guard)};
qos::service_level_options slo = _slo.replace_defaults(qos::service_level_options{});
auto& sl = state.get_service_level_controller();

View File

@@ -192,6 +192,13 @@ std::unique_ptr<prepared_statement> create_table_statement::raw_statement::prepa
auto stmt = ::make_shared<create_table_statement>(*_cf_name, _properties.properties(), _if_not_exists, _static_columns, _properties.properties()->get_id());
bool ks_uses_tablets;
try {
ks_uses_tablets = db.find_keyspace(keyspace()).get_replication_strategy().uses_tablets();
} catch (const data_dictionary::no_such_keyspace& e) {
throw exceptions::invalid_request_exception("Cannot create a table in a non-existent keyspace: " + keyspace());
}
std::optional<std::map<bytes, data_type>> defined_multi_cell_columns;
for (auto&& entry : _definitions) {
::shared_ptr<column_identifier> id = entry.first;
@@ -201,7 +208,7 @@ std::unique_ptr<prepared_statement> create_table_statement::raw_statement::prepa
throw exceptions::invalid_request_exception("Cannot set default_time_to_live on a table with counters");
}
if (db.find_keyspace(keyspace()).get_replication_strategy().uses_tablets() && pt.is_counter()) {
if (ks_uses_tablets && pt.is_counter()) {
throw exceptions::invalid_request_exception(format("Cannot use the 'counter' type for table {}.{}: Counters are not yet supported with tablets", keyspace(), cf_name));
}

View File

@@ -138,28 +138,22 @@ data_dictionary::storage_options ks_prop_defs::get_storage_options() const {
return opts;
}
ks_prop_defs::init_tablets_options ks_prop_defs::get_initial_tablets(const sstring& strategy_class, bool enabled_by_default) const {
// FIXME -- this should be ignored somehow else
init_tablets_options ret{ .enabled = false, .specified_count = std::nullopt };
if (locator::abstract_replication_strategy::to_qualified_class_name(strategy_class) != "org.apache.cassandra.locator.NetworkTopologyStrategy") {
return ret;
}
std::optional<unsigned> ks_prop_defs::get_initial_tablets(std::optional<unsigned> default_value) const {
auto tablets_options = get_map(KW_TABLETS);
if (!tablets_options) {
return enabled_by_default ? init_tablets_options{ .enabled = true } : ret;
return default_value;
}
unsigned initial_count = 0;
auto it = tablets_options->find("enabled");
if (it != tablets_options->end()) {
auto enabled = it->second;
tablets_options->erase(it);
if (enabled == "true") {
ret = init_tablets_options{ .enabled = true, .specified_count = 0 }; // even if 'initial' is not set, it'll start with auto-detection
// nothing
} else if (enabled == "false") {
assert(!ret.enabled);
return ret;
return std::nullopt;
} else {
throw exceptions::configuration_exception(sstring("Tablets enabled value must be true or false; found: ") + enabled);
}
@@ -168,7 +162,7 @@ ks_prop_defs::init_tablets_options ks_prop_defs::get_initial_tablets(const sstri
it = tablets_options->find("initial");
if (it != tablets_options->end()) {
try {
ret = init_tablets_options{ .enabled = true, .specified_count = std::stol(it->second)};
initial_count = std::stol(it->second);
} catch (...) {
throw exceptions::configuration_exception(sstring("Initial tablets value should be numeric; found ") + it->second);
}
@@ -179,7 +173,7 @@ ks_prop_defs::init_tablets_options ks_prop_defs::get_initial_tablets(const sstri
throw exceptions::configuration_exception(sstring("Unrecognized tablets option ") + tablets_options->begin()->first);
}
return ret;
return initial_count;
}
std::optional<sstring> ks_prop_defs::get_replication_strategy_class() const {
@@ -190,32 +184,13 @@ bool ks_prop_defs::get_durable_writes() const {
return get_boolean(KW_DURABLE_WRITES, true);
}
std::map<sstring, sstring> ks_prop_defs::get_all_options_flattened(const gms::feature_service& feat) const {
std::map<sstring, sstring> all_options;
auto ingest_flattened_options = [&all_options](const std::map<sstring, sstring>& options, const sstring& prefix) {
for (auto& option: options) {
all_options[prefix + ":" + option.first] = option.second;
}
};
ingest_flattened_options(get_replication_options(), KW_REPLICATION);
ingest_flattened_options(get_storage_options().to_map(), KW_STORAGE);
ingest_flattened_options(get_map(KW_TABLETS).value_or(std::map<sstring, sstring>{}), KW_TABLETS);
ingest_flattened_options({{sstring(KW_DURABLE_WRITES), to_sstring(get_boolean(KW_DURABLE_WRITES, true))}}, KW_DURABLE_WRITES);
return all_options;
}
lw_shared_ptr<data_dictionary::keyspace_metadata> ks_prop_defs::as_ks_metadata(sstring ks_name, const locator::token_metadata& tm, const gms::feature_service& feat) {
auto sc = get_replication_strategy_class().value();
auto initial_tablets = get_initial_tablets(sc, feat.tablets);
// if tablets options have not been specified, but tablets are globally enabled, set the value to 0
if (initial_tablets.enabled && !initial_tablets.specified_count) {
initial_tablets.specified_count = 0;
}
// if tablets options have not been specified, but tablets are globally enabled, set the value to 0 for N.T.S. only
auto initial_tablets = get_initial_tablets(feat.tablets && locator::abstract_replication_strategy::to_qualified_class_name(sc) == "org.apache.cassandra.locator.NetworkTopologyStrategy" ? std::optional<unsigned>(0) : std::nullopt);
auto options = prepare_options(sc, tm, get_replication_options());
return data_dictionary::keyspace_metadata::new_keyspace(ks_name, sc,
std::move(options), initial_tablets.specified_count, get_boolean(KW_DURABLE_WRITES, true), get_storage_options());
std::move(options), initial_tablets, get_boolean(KW_DURABLE_WRITES, true), get_storage_options());
}
lw_shared_ptr<data_dictionary::keyspace_metadata> ks_prop_defs::as_ks_metadata_update(lw_shared_ptr<data_dictionary::keyspace_metadata> old, const locator::token_metadata& tm, const gms::feature_service& feat) {
@@ -228,13 +203,9 @@ lw_shared_ptr<data_dictionary::keyspace_metadata> ks_prop_defs::as_ks_metadata_u
sc = old->strategy_name();
options = old_options;
}
auto initial_tablets = get_initial_tablets(*sc, old->initial_tablets().has_value());
// if tablets options have not been specified, inherit them if it's tablets-enabled KS
if (initial_tablets.enabled && !initial_tablets.specified_count) {
initial_tablets.specified_count = old->initial_tablets();
}
return data_dictionary::keyspace_metadata::new_keyspace(old->name(), *sc, options, initial_tablets.specified_count, get_boolean(KW_DURABLE_WRITES, true), get_storage_options());
auto initial_tablets = get_initial_tablets(old->initial_tablets());
return data_dictionary::keyspace_metadata::new_keyspace(old->name(), *sc, options, initial_tablets, get_boolean(KW_DURABLE_WRITES, true), get_storage_options());
}

View File

@@ -49,21 +49,15 @@ public:
private:
std::optional<sstring> _strategy_class;
public:
struct init_tablets_options {
bool enabled;
std::optional<unsigned> specified_count;
};
ks_prop_defs() = default;
explicit ks_prop_defs(std::map<sstring, sstring> options);
void validate();
std::map<sstring, sstring> get_replication_options() const;
std::optional<sstring> get_replication_strategy_class() const;
init_tablets_options get_initial_tablets(const sstring& strategy_class, bool enabled_by_default) const;
std::optional<unsigned> get_initial_tablets(std::optional<unsigned> default_value) const;
data_dictionary::storage_options get_storage_options() const;
bool get_durable_writes() const;
std::map<sstring, sstring> get_all_options_flattened(const gms::feature_service& feat) const;
lw_shared_ptr<data_dictionary::keyspace_metadata> as_ks_metadata(sstring ks_name, const locator::token_metadata&, const gms::feature_service&);
lw_shared_ptr<data_dictionary::keyspace_metadata> as_ks_metadata_update(lw_shared_ptr<data_dictionary::keyspace_metadata> old, const locator::token_metadata&, const gms::feature_service&);
};

View File

@@ -54,7 +54,7 @@ list_service_level_statement::execute(query_processor& qp,
return make_ready_future().then([this, &state] () {
if (_describe_all) {
return state.get_service_level_controller().get_distributed_service_levels();
return state.get_service_level_controller().get_distributed_service_levels(qos::query_context::user);
} else {
return state.get_service_level_controller().get_distributed_service_level(_service_level);
}

View File

@@ -46,14 +46,14 @@ public:
protected:
std::optional<sstring> get_simple(const sstring& name) const;
std::optional<std::map<sstring, sstring>> get_map(const sstring& name) const;
void remove_from_map_if_exists(const sstring& name, const sstring& key) const;
public:
bool has_property(const sstring& name) const;
std::optional<value_type> get(const sstring& name) const;
std::optional<std::map<sstring, sstring>> get_map(const sstring& name) const;
sstring get_string(sstring key, sstring default_value) const;
// Return a property value, typed as a Boolean

View File

@@ -283,33 +283,44 @@ select_statement::make_partition_slice(const query_options& options) const
std::reverse(bounds.begin(), bounds.end());
++_stats.reverse_queries;
}
const uint64_t per_partition_limit = get_inner_loop_limit(get_limit(options, _per_partition_limit),
_selection->is_aggregate());
return query::partition_slice(std::move(bounds),
std::move(static_columns), std::move(regular_columns), _opts, nullptr, get_per_partition_limit(options));
std::move(static_columns), std::move(regular_columns), _opts, nullptr, per_partition_limit);
}
uint64_t select_statement::do_get_limit(const query_options& options,
const std::optional<expr::expression>& limit,
const expr::unset_bind_variable_guard& limit_unset_guard,
uint64_t default_limit) const {
if (!limit.has_value() || limit_unset_guard.is_unset(options) || _selection->is_aggregate()) {
return default_limit;
}
auto val = expr::evaluate(*limit, options);
if (val.is_null()) {
throw exceptions::invalid_request_exception("Invalid null value of limit");
select_statement::get_limit_result select_statement::get_limit(
const query_options& options, const std::optional<expr::expression>& limit) const
{
if (!limit.has_value()) {
return bo::success(query::max_rows);
}
try {
auto val = expr::evaluate(*limit, options);
if (val.is_null()) {
return bo::failure(exceptions::invalid_request_exception("Invalid null value of limit"));
}
auto l = val.view().validate_and_deserialize<int32_t>(*int32_type);
if (l <= 0) {
throw exceptions::invalid_request_exception("LIMIT must be strictly positive");
return bo::failure(exceptions::invalid_request_exception("LIMIT must be strictly positive"));
}
return l;
return bo::success(l);
} catch (const marshal_exception& e) {
throw exceptions::invalid_request_exception("Invalid limit value");
return bo::failure(exceptions::invalid_request_exception("Invalid limit value"));
} catch (const exceptions::invalid_request_exception& e) {
return bo::failure(e);
}
}
uint64_t select_statement::get_inner_loop_limit(const select_statement::get_limit_result& limit, bool is_aggregate)
{
if (!limit.has_value() || is_aggregate) {
return query::max_rows;
}
return limit.value();
}
bool select_statement::needs_post_query_ordering() const {
// We need post-query ordering only for queries with IN on the partition key and an ORDER BY.
return _restrictions->key_is_in_relation() && !_parameters->orderings().empty();
@@ -358,7 +369,8 @@ select_statement::do_execute(query_processor& qp,
validate_for_read(cl);
uint64_t limit = get_limit(options);
const auto parsed_limit = get_limit(options, _limit);
const uint64_t inner_loop_limit = get_inner_loop_limit(parsed_limit, _selection->is_aggregate());
auto now = gc_clock::now();
_stats.filtered_reads += _restrictions_need_filtering;
@@ -380,7 +392,7 @@ select_statement::do_execute(query_processor& qp,
std::move(slice),
max_result_size,
query::tombstone_limit(qp.proxy().get_tombstone_limit()),
query::row_limit(limit),
query::row_limit(inner_loop_limit),
query::partition_limit(query::max_partitions),
now,
tracing::make_trace_info(state.get_trace_state()),
@@ -393,14 +405,13 @@ select_statement::do_execute(query_processor& qp,
_stats.unpaged_select_queries(_ks_sel) += page_size <= 0;
// An aggregation query will never be paged for the user, but we always page it internally to avoid OOM.
// If we user provided a page_size we'll use that to page internally (because why not), otherwise we use our default
// Note that if there are some nodes in the cluster with a version less than 2.0, we can't use paging (CASSANDRA-6707).
// An aggregation query may not be paged for the user, but we always page it internally to avoid OOM.
// If the user provided a page_size we'll use that to page internally (because why not), otherwise we use our default
// Also note: all GROUP BY queries are considered aggregation.
const bool aggregate = _selection->is_aggregate() || has_group_by();
const bool nonpaged_filtering = _restrictions_need_filtering && page_size <= 0;
if (aggregate || nonpaged_filtering) {
page_size = internal_paging_size;
page_size = page_size <= 0 ? internal_paging_size : std::min(page_size, internal_paging_size);
}
auto key_ranges = _restrictions->get_partition_key_ranges(options);
@@ -438,7 +449,9 @@ select_statement::do_execute(query_processor& qp,
*command, key_ranges))) {
f = execute_without_checking_exception_message_non_aggregate_unpaged(qp, command, std::move(key_ranges), state, options, now);
} else {
f = execute_without_checking_exception_message_aggregate_or_paged(qp, command, std::move(key_ranges), state, options, now, page_size, aggregate, nonpaged_filtering);
f = execute_without_checking_exception_message_aggregate_or_paged(qp, command,
std::move(key_ranges), state, options, now, page_size, aggregate,
nonpaged_filtering, parsed_limit.has_value() ? parsed_limit.value() : query::max_rows);
}
if (!tablet_info.has_value()) {
@@ -454,7 +467,8 @@ select_statement::do_execute(query_processor& qp,
future<::shared_ptr<cql_transport::messages::result_message>>
select_statement::execute_without_checking_exception_message_aggregate_or_paged(query_processor& qp,
lw_shared_ptr<query::read_command> command, dht::partition_range_vector&& key_ranges, service::query_state& state,
const query_options& options, gc_clock::time_point now, int32_t page_size, bool aggregate, bool nonpaged_filtering) const {
const query_options& options, gc_clock::time_point now, int32_t page_size, bool aggregate, bool nonpaged_filtering,
uint64_t limit) const {
command->slice.options.set<query::partition_slice::option::allow_short_read>();
auto timeout_duration = get_timeout(state.get_client_state(), options);
auto timeout = db::timeout_clock::now() + timeout_duration;
@@ -462,8 +476,11 @@ select_statement::execute_without_checking_exception_message_aggregate_or_paged(
state, options, command, std::move(key_ranges), _restrictions_need_filtering ? _restrictions : nullptr);
if (aggregate || nonpaged_filtering) {
auto builder = cql3::selection::result_set_builder(*_selection, now, *_group_by_cell_indices);
coordinator_result<void> result_void = co_await utils::result_do_until([&p] {return p->is_exhausted();},
auto builder = cql3::selection::result_set_builder(*_selection, now, *_group_by_cell_indices, limit);
coordinator_result<void> result_void = co_await utils::result_do_until(
[&p, &builder, limit] {
return p->is_exhausted() || (limit < builder.result_set_size());
},
[&p, &builder, page_size, now, timeout] {
return p->fetch_page_result(builder, page_size, now, timeout);
}
@@ -586,7 +603,7 @@ indexed_table_select_statement::prepare_command_for_base_query(query_processor&
std::move(slice),
qp.proxy().get_max_result_size(slice),
query::tombstone_limit(qp.proxy().get_tombstone_limit()),
query::row_limit(get_limit(options)),
query::row_limit(get_inner_loop_limit(get_limit(options, _limit), _selection->is_aggregate())),
query::partition_limit(query::max_partitions),
now,
tracing::make_trace_info(state.get_trace_state()),
@@ -1368,7 +1385,8 @@ indexed_table_select_statement::find_index_partition_ranges(query_processor& qp,
using value_type = std::tuple<dht::partition_range_vector, lw_shared_ptr<const service::pager::paging_state>>;
auto now = gc_clock::now();
auto timeout = db::timeout_clock::now() + get_timeout(state.get_client_state(), options);
return read_posting_list(qp, options, get_limit(options), state, now, timeout, false).then(utils::result_wrap(
const uint64_t limit = get_inner_loop_limit(get_limit(options, _limit), _selection->is_aggregate());
return read_posting_list(qp, options, limit, state, now, timeout, false).then(utils::result_wrap(
[this, &options] (::shared_ptr<cql_transport::messages::result_message::rows> rows) {
auto rs = cql3::untyped_result_set(rows);
dht::partition_range_vector partition_ranges;
@@ -1417,7 +1435,8 @@ indexed_table_select_statement::find_index_clustering_rows(query_processor& qp,
using value_type = std::tuple<std::vector<indexed_table_select_statement::primary_key>, lw_shared_ptr<const service::pager::paging_state>>;
auto now = gc_clock::now();
auto timeout = db::timeout_clock::now() + get_timeout(state.get_client_state(), options);
return read_posting_list(qp, options, get_limit(options), state, now, timeout, true).then(utils::result_wrap(
const uint64_t limit = get_inner_loop_limit(get_limit(options, _limit), _selection->is_aggregate());
return read_posting_list(qp, options, limit, state, now, timeout, true).then(utils::result_wrap(
[this, &options] (::shared_ptr<cql_transport::messages::result_message::rows> rows) {
auto rs = cql3::untyped_result_set(rows);
@@ -1683,6 +1702,7 @@ schema_ptr mutation_fragments_select_statement::generate_output_schema(schema_pt
future<exceptions::coordinator_result<service::storage_proxy_coordinator_query_result>>
mutation_fragments_select_statement::do_query(
locator::effective_replication_map_ptr erm_keepalive,
locator::host_id this_node,
service::storage_proxy& sp,
schema_ptr schema,
@@ -1690,7 +1710,7 @@ mutation_fragments_select_statement::do_query(
dht::partition_range_vector partition_ranges,
db::consistency_level cl,
service::storage_proxy_coordinator_query_options optional_params) const {
auto res = co_await replica::mutation_dump::dump_mutations(sp.get_db(), schema, _underlying_schema, partition_ranges, *cmd, optional_params.timeout(sp));
auto res = co_await replica::mutation_dump::dump_mutations(sp.get_db(), std::move(erm_keepalive), schema, _underlying_schema, partition_ranges, *cmd, optional_params.timeout(sp));
service::replicas_per_token_range last_replicas;
if (this_node) {
last_replicas.emplace(dht::token_range::make_open_ended_both_sides(), std::vector<locator::host_id>{this_node});
@@ -1704,7 +1724,7 @@ mutation_fragments_select_statement::do_execute(query_processor& qp, service::qu
auto cl = options.get_consistency();
uint64_t limit = get_limit(options);
const uint64_t limit = get_inner_loop_limit(get_limit(options, _limit), _selection->is_aggregate());
auto now = gc_clock::now();
_stats.filtered_reads += _restrictions_need_filtering;
@@ -1762,7 +1782,7 @@ mutation_fragments_select_statement::do_execute(query_processor& qp, service::qu
if (!aggregate && !_restrictions_need_filtering && (page_size <= 0
|| !service::pager::query_pagers::may_need_paging(*_schema, page_size,
*command, key_ranges))) {
return do_query({}, qp.proxy(), _schema, command, std::move(key_ranges), cl,
return do_query(erm_keepalive, {}, qp.proxy(), _schema, command, std::move(key_ranges), cl,
{timeout, state.get_permit(), state.get_client_state(), state.get_trace_state(), {}, {}})
.then(wrap_result_to_error_message([this, erm_keepalive, now, slice = command->slice] (service::storage_proxy_coordinator_query_result&& qr) mutable {
cql3::selection::result_set_builder builder(*_selection, now);
@@ -1801,8 +1821,8 @@ mutation_fragments_select_statement::do_execute(query_processor& qp, service::qu
std::move(key_ranges),
_restrictions_need_filtering ? _restrictions : nullptr,
[this, erm_keepalive, this_node] (service::storage_proxy& sp, schema_ptr schema, lw_shared_ptr<query::read_command> cmd, dht::partition_range_vector partition_ranges,
db::consistency_level cl, service::storage_proxy_coordinator_query_options optional_params) {
return do_query(this_node, sp, std::move(schema), std::move(cmd), std::move(partition_ranges), cl, std::move(optional_params));
db::consistency_level cl, service::storage_proxy_coordinator_query_options optional_params) mutable {
return do_query(std::move(erm_keepalive), this_node, sp, std::move(schema), std::move(cmd), std::move(partition_ranges), cl, std::move(optional_params));
});
if (_selection->is_trivial() && !_restrictions_need_filtering && !_per_partition_limit) {
@@ -2561,7 +2581,9 @@ std::unique_ptr<cql3::statements::raw::select_statement> build_select_statement(
if (!where_clause.empty()) {
out << " WHERE " << where_clause << " ALLOW FILTERING";
}
return do_with_parser(out.str(), std::mem_fn(&cql3_parser::CqlParser::selectStatement));
// In general it's not a good idea to use the default dialect, but here the database is talking to
// itself, so we can hope the dialects are mutually compatible here.
return do_with_parser(out.str(), dialect{}, std::mem_fn(&cql3_parser::CqlParser::selectStatement));
}
}

View File

@@ -128,7 +128,7 @@ public:
future<::shared_ptr<cql_transport::messages::result_message>> execute_without_checking_exception_message_aggregate_or_paged(query_processor& qp,
lw_shared_ptr<query::read_command> cmd, dht::partition_range_vector&& partition_ranges, service::query_state& state,
const query_options& options, gc_clock::time_point now, int32_t page_size, bool aggregate, bool nonpaged_filtering) const;
const query_options& options, gc_clock::time_point now, int32_t page_size, bool aggregate, bool nonpaged_filtering, uint64_t limit) const;
struct primary_key {
@@ -152,13 +152,10 @@ public:
db::timeout_clock::duration get_timeout(const service::client_state& state, const query_options& options) const;
protected:
uint64_t do_get_limit(const query_options& options, const std::optional<expr::expression>& limit, const expr::unset_bind_variable_guard& unset_guard, uint64_t default_limit) const;
uint64_t get_limit(const query_options& options) const {
return do_get_limit(options, _limit, _limit_unset_guard, query::max_rows);
}
uint64_t get_per_partition_limit(const query_options& options) const {
return do_get_limit(options, _per_partition_limit, _per_partition_limit_unset_guard, query::partition_max_rows);
}
using get_limit_result = bo::result<uint64_t, exceptions::invalid_request_exception>;
get_limit_result get_limit(const query_options& options, const std::optional<expr::expression>& limit) const;
static uint64_t get_inner_loop_limit(const select_statement::get_limit_result& limit, bool is_aggregate);
bool needs_post_query_ordering() const;
virtual void update_stats_rows_read(int64_t rows_read) const {
_stats.rows_read += rows_read;
@@ -338,6 +335,7 @@ public:
private:
future<exceptions::coordinator_result<service::storage_proxy_coordinator_query_result>>
do_query(
locator::effective_replication_map_ptr erm_keepalive,
locator::host_id this_node,
service::storage_proxy& sp,
schema_ptr schema,

View File

@@ -20,7 +20,7 @@ void __sanitizer_finish_switch_fiber(void* fake_stack_save, const void** stack_b
namespace cql3::util {
static void do_with_parser_impl_impl(const sstring_view& cql, noncopyable_function<void (cql3_parser::CqlParser& parser)> f) {
static void do_with_parser_impl_impl(const sstring_view& cql, dialect d, noncopyable_function<void (cql3_parser::CqlParser& parser)> f) {
cql3_parser::CqlLexer::collector_type lexer_error_collector(cql);
cql3_parser::CqlParser::collector_type parser_error_collector(cql);
cql3_parser::CqlLexer::InputStreamType input{reinterpret_cast<const ANTLR_UINT8*>(cql.begin()), ANTLR_ENC_UTF8, static_cast<ANTLR_UINT32>(cql.size()), nullptr};
@@ -29,13 +29,14 @@ static void do_with_parser_impl_impl(const sstring_view& cql, noncopyable_functi
cql3_parser::CqlParser::TokenStreamType tstream(ANTLR_SIZE_HINT, lexer.get_tokSource());
cql3_parser::CqlParser parser{&tstream};
parser.set_error_listener(parser_error_collector);
parser.set_dialect(d);
f(parser);
}
#ifndef DEBUG
void do_with_parser_impl(const sstring_view& cql, noncopyable_function<void (cql3_parser::CqlParser& parser)> f) {
return do_with_parser_impl_impl(cql, std::move(f));
void do_with_parser_impl(const sstring_view& cql, dialect d, noncopyable_function<void (cql3_parser::CqlParser& parser)> f) {
return do_with_parser_impl_impl(cql, d, std::move(f));
}
#else
@@ -47,6 +48,7 @@ void do_with_parser_impl(const sstring_view& cql, noncopyable_function<void (cql
struct thunk_args {
// arguments to do_with_parser_impl_impl
const sstring_view& cql;
dialect d;
noncopyable_function<void (cql3_parser::CqlParser&)>&& func;
// Exceptions can't be returned from another stack, so store
// any thrown exception here
@@ -70,7 +72,7 @@ static void thunk(int p1, int p2) {
// Complete stack switch started in do_with_parser_impl()
__sanitizer_finish_switch_fiber(nullptr, &san.stack_bottom, &san.stack_size);
try {
do_with_parser_impl_impl(args->cql, std::move(args->func));
do_with_parser_impl_impl(args->cql, args->d, std::move(args->func));
} catch (...) {
args->ex = std::current_exception();
}
@@ -79,11 +81,12 @@ static void thunk(int p1, int p2) {
setcontext(&args->caller_stack);
};
void do_with_parser_impl(const sstring_view& cql, noncopyable_function<void (cql3_parser::CqlParser& parser)> f) {
void do_with_parser_impl(const sstring_view& cql, dialect d, noncopyable_function<void (cql3_parser::CqlParser& parser)> f) {
static constexpr size_t stack_size = 1 << 20;
static thread_local std::unique_ptr<char[]> stack = std::make_unique<char[]>(stack_size);
thunk_args args{
.cql = cql,
.d = d,
.func = std::move(f),
};
ucontext_t uc;
@@ -92,7 +95,7 @@ void do_with_parser_impl(const sstring_view& cql, noncopyable_function<void (cql
if (stack.get() <= (char*)&uc && (char*)&uc < stack.get() + stack_size) {
// We are already running on the large stack, so just call the
// parser directly.
return do_with_parser_impl_impl(cql, std::move(f));
return do_with_parser_impl_impl(cql, d, std::move(f));
}
uc.uc_stack.ss_sp = stack.get();
uc.uc_stack.ss_size = stack_size;
@@ -136,12 +139,12 @@ sstring relations_to_where_clause(const expr::expression& e) {
return boost::algorithm::join(expressions, " AND ");
}
expr::expression where_clause_to_relations(const sstring_view& where_clause) {
return do_with_parser(where_clause, std::mem_fn(&cql3_parser::CqlParser::whereClause));
expr::expression where_clause_to_relations(const sstring_view& where_clause, dialect d) {
return do_with_parser(where_clause, d, std::mem_fn(&cql3_parser::CqlParser::whereClause));
}
sstring rename_column_in_where_clause(const sstring_view& where_clause, column_identifier::raw from, column_identifier::raw to) {
std::vector<expr::expression> relations = boolean_factors(where_clause_to_relations(where_clause));
sstring rename_column_in_where_clause(const sstring_view& where_clause, column_identifier::raw from, column_identifier::raw to, dialect d) {
std::vector<expr::expression> relations = boolean_factors(where_clause_to_relations(where_clause, d));
std::vector<expr::expression> new_relations;
new_relations.reserve(relations.size());

View File

@@ -21,18 +21,19 @@
#include "cql3/CqlParser.hpp"
#include "cql3/error_collector.hh"
#include "cql3/statements/raw/select_statement.hh"
#include "cql3/dialect.hh"
namespace cql3 {
namespace util {
void do_with_parser_impl(const sstring_view& cql, noncopyable_function<void (cql3_parser::CqlParser& p)> func);
void do_with_parser_impl(const sstring_view& cql, dialect d, noncopyable_function<void (cql3_parser::CqlParser& p)> func);
template <typename Func, typename Result = cql3_parser::unwrap_uninitialized_t<std::invoke_result_t<Func, cql3_parser::CqlParser&>>>
Result do_with_parser(const sstring_view& cql, Func&& f) {
Result do_with_parser(const sstring_view& cql, dialect d, Func&& f) {
std::optional<Result> ret;
do_with_parser_impl(cql, [&] (cql3_parser::CqlParser& parser) {
do_with_parser_impl(cql, d, [&] (cql3_parser::CqlParser& parser) {
ret.emplace(f(parser));
});
return std::move(*ret);
@@ -40,9 +41,9 @@ Result do_with_parser(const sstring_view& cql, Func&& f) {
sstring relations_to_where_clause(const expr::expression& e);
expr::expression where_clause_to_relations(const sstring_view& where_clause);
expr::expression where_clause_to_relations(const sstring_view& where_clause, dialect d);
sstring rename_column_in_where_clause(const sstring_view& where_clause, column_identifier::raw from, column_identifier::raw to);
sstring rename_column_in_where_clause(const sstring_view& where_clause, column_identifier::raw from, column_identifier::raw to, dialect d);
/// build a CQL "select" statement with the desired parameters.
/// If select_all_columns==true, all columns are selected and the value of

View File

@@ -1100,7 +1100,12 @@ public:
write(out, uint64_t(0));
}
buf.remove_suffix(buf.size_bytes() - size);
auto to_remove = buf.size_bytes() - size;
// #20862 - we decrement usage counter based on buf.size() below.
// Since we are shrinking buffer here, we need to also decrement
// counter already
buf.remove_suffix(to_remove);
_segment_manager->totals.buffer_list_bytes -= to_remove;
// Build sector checksums.
auto id = net::hton(_desc.id);
@@ -3221,6 +3226,10 @@ uint64_t db::commitlog::get_total_size() const {
;
}
uint64_t db::commitlog::get_buffer_size() const {
return _segment_manager->totals.buffer_list_bytes;
}
uint64_t db::commitlog::get_completed_tasks() const {
return _segment_manager->totals.allocation_count;
}

View File

@@ -297,6 +297,7 @@ public:
future<> delete_segments(std::vector<sstring>) const;
uint64_t get_total_size() const;
uint64_t get_buffer_size() const;
uint64_t get_completed_tasks() const;
uint64_t get_flush_count() const;
uint64_t get_pending_tasks() const;

View File

@@ -99,6 +99,21 @@ error_injection_list_to_json(const std::vector<db::config::error_injection_at_st
return value_to_json("error_injection_list");
}
template <>
bool
config_from_string(std::string_view value) {
// boost::lexical_cast doesn't accept true/false, which are our output representations
// for bools. We want round-tripping, so we need to accept true/false. For backward
// compatibility, we also accept 1/0. #19791.
if (value == "true" || value == "1") {
return true;
} else if (value == "false" || value == "0") {
return false;
} else {
throw boost::bad_lexical_cast(typeid(std::string_view), typeid(bool));
}
}
template <>
const config_type config_type_for<bool> = config_type("bool", value_to_json<bool>);
@@ -177,7 +192,7 @@ struct convert<seastar::log_level> {
if (!convert<std::string>::decode(node, tmp)) {
return false;
}
rhs = boost::lexical_cast<seastar::log_level>(tmp);
rhs = utils::config_from_string<seastar::log_level>(tmp);
return true;
}
};
@@ -1057,6 +1072,8 @@ db::config::config(std::shared_ptr<db::extensions> exts)
"Make the system.config table UPDATEable.")
, enable_parallelized_aggregation(this, "enable_parallelized_aggregation", liveness::LiveUpdate, value_status::Used, true,
"Use on a new, parallel algorithm for performing aggregate queries.")
, cql_duplicate_bind_variable_names_refer_to_same_variable(this, "cql_duplicate_bind_variable_names_refer_to_same_variable", liveness::LiveUpdate, value_status::Used, true,
"A bind variable that appears twice in a CQL query refers to a single variable (if false, no name matching is performed).")
, alternator_port(this, "alternator_port", value_status::Used, 0, "Alternator API port.")
, alternator_https_port(this, "alternator_https_port", value_status::Used, 0, "Alternator API HTTPS port.")
, alternator_address(this, "alternator_address", value_status::Used, "0.0.0.0", "Alternator API listening address.")

View File

@@ -399,6 +399,7 @@ public:
named_value<bool> enable_optimized_reversed_reads;
named_value<bool> enable_cql_config_updates;
named_value<bool> enable_parallelized_aggregation;
named_value<bool> cql_duplicate_bind_variable_names_refer_to_same_variable;
named_value<uint16_t> alternator_port;
named_value<uint16_t> alternator_https_port;

View File

@@ -36,7 +36,7 @@ size_t quorum_for(const locator::effective_replication_map& erm) {
size_t local_quorum_for(const locator::effective_replication_map& erm, const sstring& dc) {
using namespace locator;
auto& rs = erm.get_replication_strategy();
const auto& rs = erm.get_replication_strategy();
if (rs.get_type() == replication_strategy_type::network_topology) {
const network_topology_strategy* nrs =
@@ -65,7 +65,7 @@ size_t block_for_local_serial(const locator::effective_replication_map& erm) {
size_t block_for_each_quorum(const locator::effective_replication_map& erm) {
using namespace locator;
auto& rs = erm.get_replication_strategy();
const auto& rs = erm.get_replication_strategy();
if (rs.get_type() == replication_strategy_type::network_topology) {
const network_topology_strategy* nrs =
@@ -260,7 +260,7 @@ filter_for_query(consistency_level cl,
size_t bf = block_for(erm, cl);
if (read_repair == read_repair_decision::DC_LOCAL) {
bf = std::max(block_for(erm, cl), local_count);
bf = std::max(bf, local_count);
}
if (bf >= live_endpoints.size()) { // RRD.DC_LOCAL + CL.LOCAL or CL.ALL
@@ -334,7 +334,13 @@ filter_for_query(consistency_level cl,
if (!old_node && ht_max - ht_min > 0.01) { // if there is old node or hit rates are close skip calculations
// local node is always first if present (see storage_proxy::get_endpoints_for_reading)
unsigned local_idx = erm.get_topology().is_me(epi[0].first) ? 0 : epi.size() + 1;
live_endpoints = boost::copy_range<inet_address_vector_replica_set>(miss_equalizing_combination(epi, local_idx, remaining_bf, bool(extra)));
auto weighted = boost::copy_range<inet_address_vector_replica_set>(miss_equalizing_combination(epi, local_idx, remaining_bf, bool(extra)));
// Workaround for https://github.com/scylladb/scylladb/issues/9285
auto last = std::adjacent_find(weighted.begin(), weighted.end());
if (last == weighted.end()) {
// No duplicates, so use the result based on hit rates
live_endpoints = std::move(weighted);
}
}
}

View File

@@ -20,7 +20,9 @@
#include "utils/sorting.hh"
static ::shared_ptr<cql3::cql3_type::raw> parse_raw(const sstring& str) {
return cql3::util::do_with_parser(str,
// In general it's a bad idea to use the default dialect, but type parsing
// should be dialect-agnostic.
return cql3::util::do_with_parser(str, cql3::dialect{},
[] (cql3_parser::CqlParser& parser) {
return parser.comparator_type(true);
});

View File

@@ -167,6 +167,7 @@ future<db::commitlog> hint_endpoint_manager::add_store() noexcept {
return io_check([name = _hints_dir.c_str()] { return recursive_touch_directory(name); }).then([this] () {
commitlog::config cfg;
cfg.sched_group = _shard_manager.local_db().commitlog()->active_config().sched_group;
cfg.commit_log_location = _hints_dir.c_str();
cfg.commitlog_segment_size_in_mb = resource_manager::hint_segment_size_in_mb;
cfg.commitlog_total_space_in_mb = resource_manager::max_hints_per_ep_size_mb;

View File

@@ -76,23 +76,6 @@ future<timespec> hint_sender::get_last_file_modification(const sstring& fname) {
});
}
future<> hint_sender::do_send_one_mutation(frozen_mutation_and_schema m, locator::effective_replication_map_ptr ermp, const inet_address_vector_replica_set& natural_endpoints) {
return futurize_invoke([this, m = std::move(m), ermp = std::move(ermp), &natural_endpoints] () mutable -> future<> {
// The fact that we send with CL::ALL in both cases below ensures that new hints are not going
// to be generated as a result of hints sending.
const auto& tm = ermp->get_token_metadata();
const auto maybe_addr = tm.get_endpoint_for_host_id_if_known(end_point_key());
if (maybe_addr && boost::range::find(natural_endpoints, *maybe_addr) != natural_endpoints.end()) {
manager_logger.trace("Sending directly to {}", end_point_key());
return _proxy.send_hint_to_endpoint(std::move(m), std::move(ermp), *maybe_addr);
} else {
manager_logger.trace("Endpoints set has changed and {} is no longer a replica. Mutating from scratch...", end_point_key());
return _proxy.send_hint_to_all_replicas(std::move(m));
}
});
}
bool hint_sender::can_send() noexcept {
if (stopping() && !draining()) {
return false;
@@ -274,11 +257,30 @@ void hint_sender::start() {
}
future<> hint_sender::send_one_mutation(frozen_mutation_and_schema m) {
auto erm = _db.find_column_family(m.s).get_effective_replication_map();
auto ermp = _db.find_column_family(m.s).get_effective_replication_map();
auto token = dht::get_token(*m.s, m.fm.key());
inet_address_vector_replica_set natural_endpoints = erm->get_natural_endpoints(std::move(token));
inet_address_vector_replica_set natural_endpoints = ermp->get_natural_endpoints(std::move(token));
return do_send_one_mutation(std::move(m), std::move(erm), std::move(natural_endpoints));
return futurize_invoke([this, m = std::move(m), ermp = std::move(ermp), &natural_endpoints] () mutable -> future<> {
// The fact that we send with CL::ALL in both cases below ensures that new hints are not going
// to be generated as a result of hints sending.
const auto& tm = ermp->get_token_metadata();
const auto maybe_addr = tm.get_endpoint_for_host_id_if_known(end_point_key());
if (maybe_addr && boost::range::find(natural_endpoints, *maybe_addr) != natural_endpoints.end() && !tm.is_leaving(end_point_key())) {
manager_logger.trace("Sending directly to {}", end_point_key());
return _proxy.send_hint_to_endpoint(std::move(m), std::move(ermp), *maybe_addr);
} else {
if (manager_logger.is_enabled(log_level::trace)) {
if (tm.is_leaving(end_point_key())) {
manager_logger.trace("The original target endpoint {} is leaving. Mutating from scratch...", end_point_key());
} else {
manager_logger.trace("Endpoints set has changed and {} is no longer a replica. Mutating from scratch...", end_point_key());
}
}
return _proxy.send_hint_to_all_replicas(std::move(m));
}
});
}
future<> hint_sender::send_one_hint(lw_shared_ptr<send_one_file_ctx> ctx_ptr, fragmented_temporary_buffer buf, db::replay_position rp, gc_clock::duration secs_since_file_mod, const sstring& fname) {

View File

@@ -233,18 +233,14 @@ private:
/// \return
const column_mapping& get_column_mapping(lw_shared_ptr<send_one_file_ctx> ctx_ptr, const frozen_mutation& fm, const hint_entry_reader& hr);
/// \brief Perform a single mutation send attempt.
/// \brief Send one mutation out.
///
/// If the original destination end point is still a replica for the given mutation - send the mutation directly
/// to it, otherwise execute the mutation "from scratch" with CL=ALL.
///
/// \param m mutation to send
/// \param ermp points to the effective_replication_map used to obtain \c natural_endpoints
/// \param natural_endpoints current replicas for the given mutation
/// \return future that resolves when the operation is complete
future<> do_send_one_mutation(frozen_mutation_and_schema m, locator::effective_replication_map_ptr ermp, const inet_address_vector_replica_set& natural_endpoints);
/// \brief Send one mutation out.
/// The mutation will be sent with CL=ALL semantics to all current replicas also in case if the original destination
/// is leaving the cluster - otherwise the hint might be applied only on the leaving node and streaming might
/// miss it.
///
/// \param m mutation to send
/// \return future that resolves when the mutation sending processing is complete.

View File

@@ -34,8 +34,6 @@
#include <span>
#include <unordered_map>
class fragmented_temporary_buffer;
namespace utils {
class directories;
} // namespace utils

View File

@@ -779,40 +779,35 @@ redact_columns_for_missing_features(mutation&& m, schema_features features) {
*/
future<table_schema_version> calculate_schema_digest(distributed<service::storage_proxy>& proxy, schema_features features, noncopyable_function<bool(std::string_view)> accept_keyspace)
{
auto map = [&proxy, features, accept_keyspace = std::move(accept_keyspace)] (sstring table) mutable -> future<std::vector<mutation>> {
using mutations_generator = coroutine::experimental::generator<mutation>;
auto map = [&proxy, features, accept_keyspace = std::move(accept_keyspace)] (sstring table) mutable -> mutations_generator {
auto& db = proxy.local().get_db();
auto rs = co_await db::system_keyspace::query_mutations(db, NAME, table);
auto s = db.local().find_schema(NAME, table);
std::vector<mutation> mutations;
for (auto&& p : rs->partitions()) {
auto mut = co_await unfreeze_gently(p.mut(), s);
auto partition_key = value_cast<sstring>(utf8_type->deserialize(mut.key().get_component(*s, 0)));
auto partition_key = value_cast<sstring>(utf8_type->deserialize(::partition_key(p.mut().key()).get_component(*s, 0)));
if (!accept_keyspace(partition_key)) {
continue;
}
mut = redact_columns_for_missing_features(std::move(mut), features);
mutations.emplace_back(std::move(mut));
}
co_return mutations;
};
auto reduce = [features] (auto& hash, auto&& mutations) {
for (const mutation& m : mutations) {
feed_hash_for_schema_digest(hash, m, features);
auto mut = co_await unfreeze_gently(p.mut(), s);
co_yield redact_columns_for_missing_features(std::move(mut), features);
}
};
auto hash = md5_hasher();
auto tables = all_table_names(features);
{
for (auto& table: tables) {
auto mutations = co_await map(table);
if (diff_logger.is_enabled(logging::log_level::trace)) {
for (const mutation& m : mutations) {
auto gen_mutations = map(table);
while (auto mut_opt = co_await gen_mutations()) {
auto& m = *mut_opt;
feed_hash_for_schema_digest(hash, m, features);
if (diff_logger.is_enabled(logging::log_level::trace)) {
md5_hasher h;
feed_hash_for_schema_digest(h, m, features);
diff_logger.trace("Digest {} for {}, compacted={}", h.finalize(), m, compact_for_schema_digest(m));
}
}
reduce(hash, mutations);
}
co_return utils::UUID_gen::get_name_UUID(hash.finalize());
}
@@ -1948,7 +1943,9 @@ static shared_ptr<cql3::functions::user_aggregate> create_aggregate(replica::dat
bytes_opt initcond = std::nullopt;
if (initcond_str) {
auto expr = cql3::util::do_with_parser(*initcond_str, std::mem_fn(&cql3_parser::CqlParser::term));
// In general using the default dialect is wrong, but here the database is communicating with itself,
// not the user, so any dialect should work.
auto expr = cql3::util::do_with_parser(*initcond_str, cql3::dialect{}, std::mem_fn(&cql3_parser::CqlParser::term));
auto dummy_ident = ::make_shared<cql3::column_identifier>("", true);
auto column_spec = make_lw_shared<cql3::column_specification>("", "", dummy_ident, state_type);
auto raw = cql3::expr::evaluate(prepare_expression(expr, db.as_data_dictionary(), "", nullptr, {column_spec}), cql3::query_options::DEFAULT);

View File

@@ -756,8 +756,8 @@ system_distributed_keyspace::get_cdc_desc_v1_timestamps(context ctx) {
co_return res;
}
future<qos::service_levels_info> system_distributed_keyspace::get_service_levels() const {
return qos::get_service_levels(_qp, NAME, SERVICE_LEVELS, db::consistency_level::ONE);
future<qos::service_levels_info> system_distributed_keyspace::get_service_levels(qos::query_context ctx) const {
return qos::get_service_levels(_qp, NAME, SERVICE_LEVELS, db::consistency_level::ONE, ctx);
}
future<qos::service_levels_info> system_distributed_keyspace::get_service_level(sstring service_level_name) const {

View File

@@ -112,7 +112,7 @@ public:
future<db_clock::time_point> cdc_current_generation_timestamp(context);
future<qos::service_levels_info> get_service_levels() const;
future<qos::service_levels_info> get_service_levels(qos::query_context ctx) const;
future<qos::service_levels_info> get_service_level(sstring service_level_name) const;
future<> set_service_level(sstring service_level_name, qos::service_level_options slo) const;
future<> drop_service_level(sstring service_level_name) const;

View File

@@ -1673,7 +1673,22 @@ get_view_natural_endpoint(
return {};
}
auto replica = view_endpoints[base_it - base_endpoints.begin()];
return view_topology.get_node(replica).endpoint();
// https://github.com/scylladb/scylladb/issues/19439
// With tablets, a node being replaced might transition to "left" state
// but still be kept as a replica. In such case, the IP of the replaced
// node will be lost and `endpoint()` will return an empty IP here.
// As of writing this, storage proxy was not migrated to host IDs yet
// (#6403) and hints are not prepared to handle nodes that are left
// but are still replicas. Therefore, there is no other sensible option
// right now but to give up attempt to send the update or write a hint
// to the paired, permanently down replica.
const auto ep = view_topology.get_node(replica).endpoint();
if (ep != gms::inet_address{}) {
return ep;
} else {
return std::nullopt;
}
}
static future<> apply_to_remote_endpoints(service::storage_proxy& proxy, locator::effective_replication_map_ptr ermp,
@@ -2210,11 +2225,11 @@ view_builder::view_build_statuses(sstring keyspace, sstring view_name) const {
future<> view_builder::add_new_view(view_ptr view, build_step& step) {
vlogger.info0("Building view {}.{}, starting at token {}", view->ks_name(), view->cf_name(), step.current_token());
if (this_shard_id() == 0) {
co_await _sys_dist_ks.start_view_build(view->ks_name(), view->cf_name());
}
co_await _sys_ks.register_view_for_building(view->ks_name(), view->cf_name(), step.current_token());
step.build_status.emplace(step.build_status.begin(), view_build_status{view, step.current_token(), std::nullopt});
auto f = this_shard_id() == 0 ? _sys_dist_ks.start_view_build(view->ks_name(), view->cf_name()) : make_ready_future<>();
return when_all_succeed(
std::move(f),
_sys_ks.register_view_for_building(view->ks_name(), view->cf_name(), step.current_token())).discard_result();
}
static future<> flush_base(lw_shared_ptr<replica::column_family> base, abort_source& as) {
@@ -2541,6 +2556,12 @@ public:
_step.build_status.pop_back();
}
}
// before going back to the minimum token, advance current_key to the end
// and check for built views in that range.
_step.current_key = {_step.prange.end().value_or(dht::ring_position::max()).value().token(), partition_key::make_empty()};
check_for_built_views();
_step.current_key = {dht::minimum_token(), partition_key::make_empty()};
for (auto&& vs : _step.build_status) {
vs.next_token = dht::minimum_token();
@@ -2705,16 +2726,16 @@ future<> view_builder::register_staging_sstable(sstables::shared_sstable sst, lw
return _vug.register_staging_sstable(std::move(sst), std::move(table));
}
future<bool> check_needs_view_update_path(view_builder& vb, const locator::token_metadata& tm, const replica::table& t, streaming::stream_reason reason) {
future<bool> check_needs_view_update_path(view_builder& vb, locator::token_metadata_ptr tmptr, const replica::table& t, streaming::stream_reason reason) {
if (is_internal_keyspace(t.schema()->ks_name())) {
return make_ready_future<bool>(false);
}
if (reason == streaming::stream_reason::repair && !t.views().empty()) {
return make_ready_future<bool>(true);
}
return do_with(t.views(), [&vb, &tm] (auto& views) {
return do_with(std::move(tmptr), t.views(), [&vb] (locator::token_metadata_ptr& tmptr, auto& views) {
return map_reduce(views,
[&vb, &tm] (const view_ptr& view) { return vb.check_view_build_ongoing(tm, view->ks_name(), view->cf_name()); },
[&] (const view_ptr& view) { return vb.check_view_build_ongoing(*tmptr, view->ks_name(), view->cf_name()); },
false,
std::logical_or<bool>());
});

View File

@@ -10,20 +10,17 @@
#include <seastar/core/future.hh>
#include "streaming/stream_reason.hh"
#include "locator/token_metadata_fwd.hh"
#include "seastarx.hh"
namespace replica {
class table;
}
namespace locator {
class token_metadata;
}
namespace db::view {
class view_builder;
future<bool> check_needs_view_update_path(view_builder& vb, const locator::token_metadata& tm, const replica::table& t,
future<bool> check_needs_view_update_path(view_builder& vb, locator::token_metadata_ptr tmptr, const replica::table& t,
streaming::stream_reason reason);
}

View File

@@ -40,6 +40,25 @@ if __name__ == '__main__':
help='enable compress on systemd-coredump')
args = parser.parse_args()
# Seems like specific version of systemd pacakge on RHEL9 has a bug on
# SELinux configuration, it introduced "systemd-container-coredump" module
# to provide rule for systemd-coredump but not enabled by default.
# We have to manually load it, otherwise it causes permission errror.
# (#19325)
if is_redhat_variant() and distro.major_version() == '9':
if not shutil.which('getenforce'):
pkg_install('libselinux-utils')
if not shutil.which('semodule'):
pkg_install('policycoreutils')
enforce = out('getenforce')
if enforce != "Disabled":
if os.path.exists('/usr/share/selinux/packages/targeted/systemd-container-coredump.pp.bz2'):
modules = out('semodule -l')
match = re.match(r'^systemd-container-coredump$', modules, re.MULTILINE)
if not match:
run('semodule -v -i /usr/share/selinux/packages/targeted/systemd-container-coredump.pp.bz2', shell=True, check=True)
run('semodule -v -e systemd-container-coredump', shell=True, check=True)
# abrt-ccpp.service needs to stop before enabling systemd-coredump,
# since both will try to install kernel coredump handler
# (This will only requires for abrt < 2.14)

View File

@@ -16,6 +16,7 @@ import sys
import stat
import logging
import pyudev
import psutil
from pathlib import Path
from scylla_util import *
from subprocess import run, SubprocessError
@@ -92,6 +93,15 @@ class UdevInfo:
def id_links(self):
return [l for l in self.device.device_links if l.startswith('/dev/disk/by-id')]
def is_selinux_enabled():
partitions = psutil.disk_partitions(all=True)
for p in partitions:
if p.fstype == 'selinuxfs':
if os.path.exists(p.mountpoint + '/enforce'):
return True
return False
if __name__ == '__main__':
if os.getuid() > 0:
print('Requires root permission.')
@@ -325,9 +335,51 @@ WantedBy=local-fs.target
os.chown(dpath, uid, gid)
if is_debian_variant():
if not shutil.which('update-initramfs'):
pkg_install('initramfs-tools')
run('update-initramfs -u', shell=True, check=True)
if not udev_info.uuid_link:
LOGGER.error(f'Error detected, dumping udev env parameters on {fsdev}')
udev_info.verify()
udev_info.dump_variables()
if is_redhat_variant():
offline_skip_relabel = False
has_semanage = True
if not shutil.which('matchpathcon'):
offline_skip_relabel = True
pkg_install('libselinux-utils', offline_exit=False)
if not shutil.which('restorecon'):
offline_skip_relabel = True
pkg_install('policycoreutils', offline_exit=False)
if not shutil.which('semanage'):
if is_offline():
has_semanage = False
else:
pkg_install('policycoreutils-python-utils')
if is_offline() and offline_skip_relabel:
print('Unable to find SELinux tools, skip relabeling.')
sys.exit(0)
selinux_context = out('matchpathcon -n /var/lib/systemd/coredump')
selinux_type = selinux_context.split(':')[2]
if has_semanage:
run(f'semanage fcontext -a -t {selinux_type} "{root}/coredump(/.*)?"', shell=True, check=True)
else:
# without semanage, we need to update file_contexts directly,
# and compile it to binary format (.bin file)
try:
with open('/etc/selinux/targeted/contexts/files/file_contexts.local', 'a') as f:
spacer = ''
if f.tell() != 0:
spacer = '\n'
f.write(f'{spacer}{root}/coredump(/.*)? {selinux_context}\n')
except FileNotFoundError as e:
print('Unable to find SELinux policy files, skip relabeling.')
sys.exit(0)
run('sefcontext_compile /etc/selinux/targeted/contexts/files/file_contexts.local', shell=True, check=True)
if is_selinux_enabled():
run(f'restorecon -F -v -R {root}', shell=True, check=True)
else:
Path('/.autorelabel').touch(exist_ok=True)

View File

@@ -293,13 +293,14 @@ def swap_exists():
swaps = out('swapon --noheadings --raw')
return True if swaps != '' else False
def pkg_error_exit(pkg):
def pkg_error_exit(pkg, offline_exit=True):
print(f'Package "{pkg}" required.')
sys.exit(1)
if offline_exit:
sys.exit(1)
def yum_install(pkg):
def yum_install(pkg, offline_exit=True):
if is_offline():
pkg_error_exit(pkg)
pkg_error_exit(pkg, offline_exit)
return run(f'yum install -y {pkg}', shell=True, check=True)
def apt_is_updated():
@@ -313,9 +314,9 @@ def apt_is_updated():
APT_GET_UPDATE_NUM_RETRY = 30
APT_GET_UPDATE_RETRY_INTERVAL = 10
def apt_install(pkg):
def apt_install(pkg, offline_exit=True):
if is_offline():
pkg_error_exit(pkg)
pkg_error_exit(pkg, offline_exit)
# The lock for update and install/remove are different, and
# DPkg::Lock::Timeout will only wait for install/remove lock.
@@ -344,14 +345,14 @@ def apt_install(pkg):
apt_env['DEBIAN_FRONTEND'] = 'noninteractive'
return run(f'apt-get -o DPkg::Lock::Timeout=300 install -y {pkg}', shell=True, check=True, env=apt_env)
def emerge_install(pkg):
def emerge_install(pkg, offline_exit=True):
if is_offline():
pkg_error_exit(pkg)
pkg_error_exit(pkg, offline_exit)
return run(f'emerge -uq {pkg}', shell=True, check=True)
def zypper_install(pkg):
def zypper_install(pkg, offline_exit=True):
if is_offline():
pkg_error_exit(pkg)
pkg_error_exit(pkg, offline_exit)
return run(f'zypper install -y {pkg}', shell=True, check=True)
def pkg_distro():
@@ -364,18 +365,20 @@ def pkg_distro():
else:
return distro.id()
pkg_xlat = {'cpupowerutils': {'debian': 'linux-cpupower', 'gentoo':'sys-power/cpupower', 'arch':'cpupower', 'suse': 'cpupower'}}
def pkg_install(pkg):
pkg_xlat = {'cpupowerutils': {'debian': 'linux-cpupower', 'gentoo':'sys-power/cpupower', 'arch':'cpupower', 'suse': 'cpupower'},
'policycoreutils-python-utils': {'amzn2': 'policycoreutils-python'}}
def pkg_install(pkg, offline_exit=True):
if pkg in pkg_xlat and pkg_distro() in pkg_xlat[pkg]:
pkg = pkg_xlat[pkg][pkg_distro()]
if is_redhat_variant():
return yum_install(pkg)
return yum_install(pkg, offline_exit)
elif is_debian_variant():
return apt_install(pkg)
return apt_install(pkg, offline_exit)
elif is_gentoo():
return emerge_install(pkg)
return emerge_install(pkg, offline_exit)
elif is_suse_variant():
return zypper_install(pkg)
return zypper_install(pkg, offline_exit)
else:
pkg_error_exit(pkg)

View File

@@ -1 +1 @@
SCYLLA_NODE_EXPORTER_ARGS="--collector.interrupts"
SCYLLA_NODE_EXPORTER_ARGS="--collector.interrupts --no-collector.hwmon"

View File

@@ -20,7 +20,6 @@ ExecStart=/usr/bin/scylla $SCYLLA_ARGS $SEASTAR_IO $DEV_MODE $CPUSET $MEM_CONF
ExecStopPost=+/opt/scylladb/scripts/scylla_stop
TimeoutStartSec=1y
TimeoutStopSec=900
KillMode=process
Restart=on-abnormal
User=scylla
OOMScoreAdjust=-950

View File

@@ -158,33 +158,6 @@ Obsoletes: scylla-server < 1.1
%description conf
This package contains the main scylla configuration file.
# we need to refuse upgrade if current scylla < 1.7.3 && commitlog remains
%pretrans conf
ver=$(rpm -qi scylla-server | grep Version | awk '{print $3}')
if [ -n "$ver" ]; then
ver_fmt=$(echo $ver | awk -F. '{printf "%d%02d%02d", $1,$2,$3}')
if [ $ver_fmt -lt 10703 ]; then
# for <scylla-1.2
if [ ! -f /opt/scylladb/lib/scylla/scylla_config_get.py ]; then
echo
echo "Error: Upgrading from scylla-$ver to scylla-%{version} is not supported."
echo "Please upgrade to scylla-1.7.3 or later, before upgrade to %{version}."
echo
exit 1
fi
commitlog_directory=$(/opt/scylladb/lib/scylla/scylla_config_get.py -g commitlog_directory)
commitlog_files=$(ls $commitlog_directory | wc -l)
if [ $commitlog_files -ne 0 ]; then
echo
echo "Error: Upgrading from scylla-$ver to scylla-%{version} is not supported when commitlog is not clean."
echo "Please upgrade to scylla-1.7.3 or later, before upgrade to %{version}."
echo "Also make sure $commitlog_directory is empty."
echo
exit 1
fi
fi
fi
%files conf
%defattr(-,root,root)
%attr(0755,root,root) %dir %{_sysconfdir}/scylla

View File

@@ -1,6 +1,10 @@
import os
from sphinx.directives.other import Include
from sphinx.util import logging
from docutils.parsers.rst import directives
LOGGER = logging.getLogger(__name__)
class IncludeFlagDirective(Include):
option_spec = Include.option_spec.copy()
option_spec['base_path'] = directives.unchanged
@@ -8,11 +12,18 @@ class IncludeFlagDirective(Include):
def run(self):
env = self.state.document.settings.env
base_path = self.options.get('base_path', '_common')
file_path = self.arguments[0]
if env.app.tags.has('enterprise'):
self.arguments[0] = base_path + "_enterprise/" + self.arguments[0]
enterprise_path = os.path.join(base_path + "_enterprise", file_path)
_, enterprise_abs_path = env.relfn2path(enterprise_path)
if os.path.exists(enterprise_abs_path):
self.arguments[0] = enterprise_path
else:
LOGGER.info(f"Enterprise content not found: Skipping inclusion of {file_path}")
return []
else:
self.arguments[0] = base_path + "/" + self.arguments[0]
self.arguments[0] = os.path.join(base_path, file_path)
return super().run()
def setup(app):

View File

@@ -123,10 +123,6 @@ the secret key is the `salted_hash`, i.e., the secret key can be found by
<!--- REMOVE IN FUTURE VERSIONS - Remove the note below in version 6.1 -->
(Note: If you upgraded from version 5.4 to version 6.0 without
[enabling consistent topology updates](../upgrade/upgrade-opensource/upgrade-guide-from-5.4-to-6.0/enable-consistent-topology.rst),
the table name is `system_auth.roles`.)
By default, authorization is not enforced at all. It can be turned on
by providing an entry in Scylla configuration:
`alternator_enforce_authorization: true`

View File

@@ -1,3 +0,0 @@
If you upgraded from 5.4, you must perform a manual action in order to enable
consistent topology changes.
See :doc:`the guide for enabling consistent topology changes</upgrade/upgrade-opensource/upgrade-guide-from-5.4-to-6.0/enable-consistent-topology>` for more details.

View File

@@ -60,9 +60,8 @@ In summary, Raft makes schema changes safe, but it requires that a quorum of nod
Verifying that the Raft upgrade procedure finished successfully
========================================================================
You may need to perform the following procedure on upgrade if you explicitly
disabled the Raft-based schema changes feature in the previous ScyllaDB
version. Please consult the upgrade guide.
You may need to perform the following procedure as part of
the :ref:`manual recovery procedure <recovery-procedure>`.
The Raft upgrade procedure requires **full cluster availability** to correctly setup the Raft algorithm; after the setup finishes, Raft can proceed with only a majority of nodes, but this initial setup is an exception.
An unlucky event, such as a hardware failure, may cause one of your nodes to fail. If this happens before the Raft upgrade procedure finishes, the procedure will get stuck and your intervention will be required.
@@ -173,8 +172,6 @@ gossip-based topology.
The feature is automatically enabled in new clusters.
.. scylladb_include_flag:: consistent-topology-with-raft-upgrade-info.rst
Verifying that Raft is Enabled
----------------------------------

View File

@@ -0,0 +1,3 @@
By default, a keyspace is created with tablets enabled. The ``tablets`` option
is used to opt out a keyspace from tablets-based distribution; see :ref:`Enabling Tablets <tablets-enable-tablets>`
for details.

View File

@@ -62,7 +62,7 @@ The following options are available for all compaction strategies.
=====
``tombstone_compaction_interval`` (default: 86400s (1 day))
An SSTable that is suitable for single SSTable compaction, according to tombstone_threshold will not be compacted if it is newer than tombstone_compaction_interval.
*tombstone_compaction_interval* is lower-bound for when a new tombstone compaction can start. If an SSTable was compacted at a time *X*, the earliest time it will be considered for tombstone compaction again is *X + tombstone_compaction_interval*. This does not guarantee that sstables will be considered for compaction immediately after tombstone_compaction_interval time has elapsed after the last compaction.
=====

View File

@@ -377,6 +377,20 @@ FINALFUNC final_fct
INITCOND (0, 0);
```
### Behavior of bind variables references with the same name
If a bind variable is referred to twice (example: `WHERE aa = :var AND bb = :var`; `:var`
is referenced twice), ScyllaDB and Cassandra treat it differently:
- Cassandra ignores the double reference and treats the two as two separate variables. They
can have different types, and occupy two slots in the bind variable metadata (used by
drivers when the user provides a bind variable tuple rather than a map)
- ScyllaDB treats the two references as referring to the same variable. The two references
must have the same type, and occupy one slot in the bind variable metadata.
ScyllaDB can revert to the Cassandra treatment by setting the configuration item
`cql_duplicate_bind_variable_names_refer_to_same_variable` to `false`.
### Lists elements for filtering
Subscripting a list in a WHERE clause is supported as are maps.

View File

@@ -116,7 +116,7 @@ name kind mandatory default description
details below).
``durable_writes`` *simple* no true Whether to use the commit log for updates on this keyspace
(disable this option at your own risk!).
``tablets`` *map* no Enables or disables tablets for the keyspace (see :ref:`tablets<tablets>`)
``tablets`` *map* no Enables or disables tablets for the keyspace (see :ref:`tablets <tablets>`)
=================== ========== =========== ========= ===================================================================
The ``replication`` property is mandatory and must at least contains the ``'class'`` sub-option, which defines the
@@ -232,9 +232,7 @@ sub-option type description
``'initial'`` int The number of tablets to start with
===================================== ====== =============================================
By default, a keyspace is created with tablets enabled. The ``tablets`` option
is used to opt out a keyspace from tablets-based distribution; see :ref:`Enabling Tablets <tablets-enable-tablets>`
for details.
.. scylladb_include_flag:: tablets-default.rst
A good rule of thumb to calculate initial tablets is to divide the expected total storage used
by tables in this keyspace by (``replication_factor`` * 5GB). For example, if you expect a 30TB
@@ -759,10 +757,8 @@ available:
========================= =============== =============================================================================
Option Default Description
========================= =============== =============================================================================
``sstable_compression`` LZ4Compressor The compression algorithm to use. Default compressors are
LZ4Compressor, SnappyCompressor, and DeflateCompressor.
A custom compressor can be provided by specifying the full class
name as a “string constant”:#constants.
``sstable_compression`` LZ4Compressor The compression algorithm to use. Available compressors are
LZ4Compressor, SnappyCompressor, DeflateCompressor, and ZstdCompressor.
``chunk_length_in_kb`` 4 On disk SSTables are compressed by block (to allow random reads). This
defines the size (in KB) of the block. Bigger values may improve the
compression rate, but increases the minimum size of data to be read from disk

View File

@@ -48,6 +48,13 @@ to calculate the proper value is:
$ docker run --name some-scylla --hostname some-scylla -d scylladb/scylla
```
If you're on macOS and plan to start a multi-node cluster (3 nodes or more), start ScyllaDB with
`reactor-backend=epoll` to override the default `linux-aio` reactor backend:
```console
$ docker run --name some-scylla --hostname some-scylla -d scylladb/scylla --reactor-backend=epoll
```
### Run `nodetool` utility
```console
@@ -75,6 +82,11 @@ cqlsh>
```console
$ docker run --name some-scylla2 --hostname some-scylla2 -d scylladb/scylla --seeds="$(docker inspect --format='{{ .NetworkSettings.IPAddress }}' some-scylla)"
```
If you're on macOS, ensure to add the `reactor-backend=epoll` option when adding new nodes:
```console
$ docker run --name some-scylla2 --hostname some-scylla2 -d scylladb/scylla --reactor-backend=epoll --seeds="$(docker inspect --format='{{ .NetworkSettings.IPAddress }}' some-scylla)"
```
#### Make a cluster with Docker Compose

View File

@@ -1,14 +1,14 @@
You can `build ScyllaDB from source <https://github.com/scylladb/scylladb#build-prerequisites>`_ on other x86_64 or aarch64 platforms, without any guarantees.
+----------------------------+--------------------+-------+---------------+
| Linux Distributions |Ubuntu | Debian| Rocky / |
| | | | RHEL |
| Linux Distributions |Ubuntu | Debian|Rocky / CentOS |
| | | |/ RHEL |
+----------------------------+------+------+------+-------+-------+-------+
| ScyllaDB Version / Version |20.04 |22.04 |24.04 | 11 | 8 | 9 |
+============================+======+======+======+=======+=======+=======+
| 6.0 | |v| | |v| | |v| | |v| | |v| | |v| |
| 6.1 | |v| | |v| | |v| | |v| | |v| | |v| |
+----------------------------+------+------+------+-------+-------+-------+
| 5.4 | |v| | |v| | |x| | |v| | |v| | |v| |
| 6.0 | |v| | |v| | |v| | |v| | |v| | |v| |
+----------------------------+------+------+------+-------+-------+-------+
* The recommended OS for ScyllaDB Open Source is Ubuntu 22.04.
@@ -18,4 +18,4 @@ Supported Architecture
-----------------------------
ScyllaDB Open Source supports x86_64 for all versions and AArch64 starting from ScyllaDB 4.6 and nightly build.
In particular, aarch64 support includes AWS EC2 Graviton.
In particular, aarch64 support includes AWS EC2 Graviton.

View File

@@ -0,0 +1,54 @@
Configure and Run ScyllaDB
-------------------------------
#. Configure the following parameters in the ``/etc/scylla/scylla.yaml`` configuration file.
* ``cluster_name`` - The name of the cluster. All the nodes in the cluster must have the same
cluster name configured.
* ``seeds`` - The IP address of the first node. Other nodes will use it as the first contact
point to discover the cluster topology when joining the cluster.
* ``listen_address`` - The IP address that ScyllaDB uses to connect to other nodes in the cluster.
* ``rpc_address`` - The IP address of the interface for CQL client connections.
#. Run the ``scylla_setup`` script to tune the system settings and determine the optimal configuration.
.. code-block:: console
sudo scylla_setup
* The script invokes a set of :ref:`scripts <system-configuration-scripts>` to configure several operating system settings; for example, it sets
RAID0 and XFS filesystem.
* The script runs a short (up to a few minutes) benchmark on your storage and generates the ``/etc/scylla.d/io.conf``
configuration file. When the file is ready, you can start ScyllaDB. ScyllaDB will not run without XFS
or ``io.conf`` file.
* You can bypass this check by running ScyllaDB in :doc:`developer mode </getting-started/installation-common/dev-mod>`.
We recommend against enabling developer mode in production environments to ensure ScyllaDB's maximum performance.
#. Run ScyllaDB as a service (if not already running).
.. code-block:: console
sudo systemctl start scylla-server
Now you can start using ScyllaDB. Here are some tools you may find useful.
Run nodetool:
.. code-block:: console
nodetool status
Run cqlsh:
.. code-block:: console
cqlsh
Run cassandra-stress:
.. code-block:: console
cassandra-stress write -mode cql3 native

View File

@@ -175,7 +175,7 @@ Recommended instances types are `n1-highmem <https://cloud.google.com/compute/do
* - n2-highmem-32
- 32
- 256
- 6,000
- 9,000
* - n2-highmem-48
- 48
- 384

View File

@@ -154,59 +154,7 @@ Install ScyllaDB
sudo yum install scylla-5.2.3
Configure and Run ScyllaDB
-------------------------------
#. Configure the following parameters in the ``/etc/scylla/scylla.yaml`` configuration file.
* ``cluster_name`` - The name of the cluster. All the nodes in the cluster must have the same
cluster name configured.
* ``seeds`` - The IP address of the first node. Other nodes will use it as the first contact
point to discover the cluster topology when joining the cluster.
* ``listen_address`` - The IP address that ScyllaDB uses to connect to other nodes in the cluster.
* ``rpc_address`` - The IP address of the interface for CQL client connections.
#. Run the ``scylla_setup`` script to tune the system settings and determine the optimal configuration.
.. code-block:: console
sudo scylla_setup
* The script invokes a set of :ref:`scripts <system-configuration-scripts>` to configure several operating system settings; for example, it sets
RAID0 and XFS filesystem.
* The script runs a short (up to a few minutes) benchmark on your storage and generates the ``/etc/scylla.d/io.conf``
configuration file. When the file is ready, you can start ScyllaDB. ScyllaDB will not run without XFS
or ``io.conf`` file.
* You can bypass this check by running ScyllaDB in :doc:`developer mode </getting-started/installation-common/dev-mod>`.
We recommend against enabling developer mode in production environments to ensure ScyllaDB's maximum performance.
#. Run ScyllaDB as a service (if not already running).
.. code-block:: console
sudo systemctl start scylla-server
Now you can start using ScyllaDB. Here are some tools you may find useful.
Run nodetool:
.. code-block:: console
nodetool status
Run cqlsh:
.. code-block:: console
cqlsh
Run cassandra-stress:
.. code-block:: console
cassandra-stress write -mode cql3 native
.. include:: /getting-started/_common/setup-after-install.rst
Next Steps
------------

View File

@@ -12,7 +12,7 @@ Prerequisites
Ensure that your platform is supported by the ScyllaDB version you want to install.
See :doc:`OS Support by Platform and Version </getting-started/os-support/>`.
Installing ScyllaDB with Web Installer
Install ScyllaDB with Web Installer
---------------------------------------
To install ScyllaDB with Web Installer, run:
@@ -40,22 +40,24 @@ options to install a different version or ScyllaDB Enterprise:
You can run the command with the ``-h`` or ``--help`` flag to print information about the script.
Examples
---------
===========
Installing ScyllaDB Open Source 4.6.1:
Installing ScyllaDB Open Source 6.0.1:
.. code:: console
curl -sSf get.scylladb.com/server | sudo bash -s -- --scylla-version 4.6.1
curl -sSf get.scylladb.com/server | sudo bash -s -- --scylla-version 6.0.1
Installing the latest patch release for ScyllaDB Open Source 4.6:
Installing the latest patch release for ScyllaDB Open Source 6.0:
.. code:: console
curl -sSf get.scylladb.com/server | sudo bash -s -- --scylla-version 4.6
curl -sSf get.scylladb.com/server | sudo bash -s -- --scylla-version 6.0
Installing ScyllaDB Enterprise 2021.1:
Installing ScyllaDB Enterprise 2024.1:
.. code:: console
curl -sSf get.scylladb.com/server | sudo bash -s -- --scylla-product scylla-enterprise --scylla-version 2021.1
curl -sSf get.scylladb.com/server | sudo bash -s -- --scylla-product scylla-enterprise --scylla-version 2024.1
.. include:: /getting-started/_common/setup-after-install.rst

View File

@@ -1,8 +1,3 @@
.. |SCYLLADB_VERSION| replace:: 5.2
.. update the version folder URL below (variables won't work):
https://downloads.scylladb.com/downloads/scylla/relocatable/scylladb-5.2/
====================================================
Install ScyllaDB Without root Privileges
====================================================
@@ -24,14 +19,17 @@ Note that if you're on CentOS 7, only root offline installation is supported.
Download and Install
-----------------------
#. Download the latest tar.gz file for ScyllaDB |SCYLLADB_VERSION| (x86 or ARM) from https://downloads.scylladb.com/downloads/scylla/relocatable/scylladb-5.2/.
#. Download the latest tar.gz file for ScyllaDB version (x86 or ARM) from ``https://downloads.scylladb.com/downloads/scylla/relocatable/scylladb-<version>/``.
Example for version 6.1: https://downloads.scylladb.com/downloads/scylla/relocatable/scylladb-6.1/
#. Uncompress the downloaded package.
The following example shows the package for ScyllaDB 5.2.4 (x86):
The following example shows the package for ScyllaDB 6.1.1 (x86):
.. code:: console
tar xvfz scylla-unified-5.2.4-0.20230623.cebbf6c5df2b.x86_64.tar.gz
tar xvfz scylla-unified-6.1.1-0.20240814.8d90b817660a.x86_64.tar.gz
#. Install OpenJDK 8 or 11.

View File

@@ -430,8 +430,8 @@ The content is dumped in JSON, using the following schema:
"estimated_tombstone_drop_time": $STREAMING_HISTOGRAM,
"sstable_level": Uint,
"repaired_at": Uint64,
"min_column_names": [Uint, ...],
"max_column_names": [Uint, ...],
"min_column_names": [String, ...],
"max_column_names": [String, ...],
"has_legacy_counter_shards": Bool,
"columns_count": Int64, // >= MC only
"rows_count": Int64, // >= MC only

View File

@@ -1,8 +1,17 @@
Nodetool rebuild
================
**rebuild** ``[<src-dc-name>]`` - This command rebuilds a node's data by streaming data from other nodes in the cluster (similarly to bootstrap).
Rebuild operates on multiple nodes in a ScyllaDB cluster. It streams data from a single source replica when rebuilding a token range. When executing the command, ScyllaDB first figures out which ranges the local node (the one we want to rebuild) is responsible for. Then which node in the cluster contains the same ranges. Finally, ScyllaDB streams the data to the local node.
**rebuild** ``[[--force] <source-dc-name>]`` - This command rebuilds a node's data by streaming data from other nodes in the cluster (similarly to bootstrap).
When executing the command, ScyllaDB first figures out which ranges the local node (the one we want to rebuild) is responsible for.
Then which node in the cluster contains the same ranges.
If ``source-dc-name`` is provided, ScyllaDB will stream data only from nodes in that datacenter, when safe to do so.
Otherwise, an alternative datacenter that lost no nodes will be considered, and if none exist, all datacenters will be considered.
Use the ``--force`` option to enforce rebuild using the source datacenter, even if it is unsafe to do so.
When ``rebuild`` is enabled in :doc:`Repair Based Node Operations (RBNO) </operating-scylla/procedures/cluster-management/repair-based-node-operation>`,
data is rebuilt using repair-based-rebuild by reading all source replicas in each token range and repairing any discrepancies between them.
Otherwise, data is streamed from a single source replica when rebuilding each token range.
When :doc:`adding a new data-center into an existing ScyllaDB cluster </operating-scylla/procedures/cluster-management/add-dc-to-existing-dc/>` use the rebuild command.
@@ -14,6 +23,6 @@ For Example:
.. code-block:: shell
nodetool rebuild <src-dc-name>
nodetool rebuild <source-dc-name>
.. include:: nodetool-index.rst

View File

@@ -1,7 +1,10 @@
.. note::
This page only applies to clusters where consistent topology updates are not enabled.
This page only applies to clusters where consistent topology updates are not enabled.
Consistent topology updates are mandatory, so **this page serves troubleshooting purposes**.
The page does NOT apply if you:
* Created a cluster with ScyllaDB 6.0 (consistent topology updates are automatically enabled).
* Upgraded from ScyllaDB 5.4 and :doc:`manually enabled consistent topology updates </upgrade/upgrade-opensource/upgrade-guide-from-5.4-to-6.0/enable-consistent-topology>`.
* Created a cluster with ScyllaDB 6.0 or later (consistent topology updates are automatically enabled).
* `Manually enabled consistent topology updates <https://opensource.docs.scylladb.com/branch-6.0/upgrade/upgrade-opensource/upgrade-guide-from-5.4-to-6.0/enable-consistent-topology.html>`_
after upgrading to 6.0 or before upgrading to 6.1 (required).

View File

@@ -1,3 +0,0 @@
(Note: If you upgraded from version 5.4 without
:doc:`enabling consistent topology updates </upgrade/upgrade-opensource/upgrade-guide-from-5.4-to-6.0/enable-consistent-topology>`,
you must additionally alter the ``system_auth`` keyspace.)

View File

@@ -1,3 +0,0 @@
.. note::
If you upgraded your cluster from version 5.4, see :ref:`After Upgrading from 5.4 <add-dc-upgrade-info>`.

View File

@@ -1,3 +0,0 @@
.. note::
If you upgraded your cluster from version 5.4, see :ref:`After Upgrading from 5.4 <add-new-node-upgrade-info>`.

View File

@@ -1,3 +0,0 @@
.. note::
If you upgraded your cluster from version 5.4, see :ref:`After Upgrading from 5.4 <remove-node-upgrade-info>`.

View File

@@ -1,3 +0,0 @@
.. note::
If you upgraded your cluster from version 5.4, see :ref:`After Upgrading from 5.4 <replace-node-upgrade-info>`.

View File

@@ -1,24 +0,0 @@
After Upgrading from 5.4
----------------------------
The procedure described above applies to clusters where consistent topology updates
are enabled. The feature is automatically enabled in new clusters.
If you've upgraded an existing cluster from version 5.4, ensure that you
:doc:`manually enabled consistent topology updates </upgrade/upgrade-opensource/upgrade-guide-from-5.4-to-6.0/enable-consistent-topology>`.
Without consistent topology updates enabled, you must consider the following
limitations while applying the procedure:
* You can only bootstrap one node at a time. You need to wait until the status
of one new node becomes UN (Up Normal) before adding another new node.
* If the node starts bootstrapping but fails in the middle, for example, due to
a power loss, you can retry bootstrap by restarting the node. If you don't want to
retry, or the node refuses to boot on subsequent attempts, consult the
:doc:`Handling Membership Change Failures </operating-scylla/procedures/cluster-management/handling-membership-change-failures>`
document.
* The ``system_auth`` keyspace has not been upgraded to ``system``.
As a result, if ``authenticator`` is set to ``PasswordAuthenticator``, you must
increase the replication factor of the ``system_auth`` keyspace. It is
recommended to set ``system_auth`` replication factor to the number of nodes
in each DC.

View File

@@ -1,21 +0,0 @@
After Upgrading from 5.4
----------------------------
The procedure described above applies to clusters where consistent topology updates
are enabled. The feature is automatically enabled in new clusters.
If you've upgraded an existing cluster from version 5.4, ensure that you
:doc:`manually enabled consistent topology updates </upgrade/upgrade-opensource/upgrade-guide-from-5.4-to-6.0/enable-consistent-topology>`.
Without consistent topology updates enabled, you must consider the following
limitations while applying the procedure:
* Its essential to ensure the removed node will **never** come back to the cluster,
which might adversely affect your data (data resurrection/loss). To prevent the removed
node from rejoining the cluster, remove that node from the cluster network or VPC.
* You can only remove one node at a time. You need to verify that the node has
been removed before removing another one.
* If ``nodetool decommission`` starts executing but fails in the middle, for example,
due to a power loss, consult the
:doc:`Handling Membership Change Failures </operating-scylla/procedures/cluster-management/handling-membership-change-failures>`
document.

View File

@@ -1,23 +0,0 @@
----------------------------
After Upgrading from 5.4
----------------------------
The procedure described above applies to clusters where consistent topology updates
are enabled. The feature is automatically enabled in new clusters.
If you've upgraded an existing cluster from version 5.4, ensure that you
:doc:`manually enabled consistent topology updates </upgrade/upgrade-opensource/upgrade-guide-from-5.4-to-6.0/enable-consistent-topology>`.
Without consistent topology updates enabled, you must consider the following
limitations while applying the procedure:
* Its essential to ensure the replaced (dead) node will never come back to the cluster,
which might lead to a split-brain situation. Remove the replaced (dead) node from
the cluster network or VPC.
* You can only replace one node at a time. You need to wait until the status
of the new node becomes UN (Up Normal) before replacing another new node.
* If the new node starts and begins the replace operation but then fails in the middle,
for example, due to a power loss, you can retry the replace by restarting the node.
If you dont want to retry, or the node refuses to boot on subsequent attempts, consult the
:doc:`Handling Membership Change Failures </operating-scylla/procedures/cluster-management/handling-membership-change-failures>`
document.

View File

@@ -1,8 +1,6 @@
Adding a New Data Center Into an Existing ScyllaDB Cluster
***********************************************************
.. scylladb_include_flag:: upgrade-note-add-new-dc.rst
The following procedure specifies how to add a Data Center (DC) to a live ScyllaDB Cluster, in a single data center, :ref:`multi-availability zone <faq-best-scenario-node-multi-availability-zone>`, or multi-datacenter. Adding a DC out-scales the cluster and provides higher availability (HA).
The procedure includes:
@@ -164,8 +162,6 @@ Add New DC
* Keyspace created by the user (which needed to replicate to the new DC).
* System: ``system_distributed``, ``system_traces``, for example, replicate the data to three nodes in the new DC.
.. scylladb_include_flag:: system-auth-alter-info.rst
For example:
Before
@@ -234,7 +230,3 @@ Additional Resources for Java Clients
* `DCAwareRoundRobinPolicy.Builder <https://java-driver.docs.scylladb.com/scylla-3.10.2.x/api/com/datastax/driver/core/policies/DCAwareRoundRobinPolicy.Builder.html>`_
* `DCAwareRoundRobinPolicy <https://java-driver.docs.scylladb.com/scylla-3.10.2.x/api/com/datastax/driver/core/policies/DCAwareRoundRobinPolicy.html>`_
.. _add-dc-upgrade-info:
.. scylladb_include_flag:: upgrade-warning-add-new-node-or-dc.rst

View File

@@ -2,8 +2,6 @@
Adding a New Node Into an Existing ScyllaDB Cluster (Out Scale)
=================================================================
.. scylladb_include_flag:: upgrade-note-add-new-node.rst
When you add a new node, other nodes in the cluster stream data to the new node. This operation is called bootstrapping and may
be time-consuming, depending on the data size and network bandwidth. If using a :ref:`multi-availability-zone <faq-best-scenario-node-multi-availability-zone>`, make sure they are balanced.
@@ -100,7 +98,3 @@ Procedure
#. If you are using ScyllaDB Monitoring, update the `monitoring stack <https://monitoring.docs.scylladb.com/stable/install/monitoring_stack.html#configure-scylla-nodes-from-files>`_ to monitor it. If you are using ScyllaDB Manager, make sure you install the `Manager Agent <https://manager.docs.scylladb.com/stable/install-scylla-manager-agent.html>`_, and Manager can access it.
.. _add-new-node-upgrade-info:
.. scylladb_include_flag:: upgrade-warning-add-new-node-or-dc.rst

Some files were not shown because too many files have changed in this diff Show More