During upgrade to raft topology, information about service levels is
copied from the legacy tables in system_distributed to the raft-managed
tables of group 0. system_distributed has RF=3, so if the cluster has
only one or two nodes we should use lower consistency level than ALL -
and the current procedure does exactly that, it selects QUORUM in case
of two nodes and ONE in case of only one node. The cluster size is
determined based on the call to _gossiper.num_endpoints().
Despite its name, gossiper::num_endpoints() does not necessarily return
the number of nodes in the cluster but rather the number of endpoint
states in gossiper (this behavior is documented in a comment near the
declaration of this function). In some cases, e.g. after gossiper-based
nodetool remove, the state might be kept for some time after removal (3
days in this case).
The consequence of this is that gossiper::num_endpoints() might return
more than the current number of nodes during upgrade, and that in turn
might cause migration of data from one table to another to fail -
causing the upgrade procedure to get stuck if there is only 1 or two
nodes in the cluster.
In order to fix this, use token_metadata::get_all_endpoints() as a
measure of the cluster size.
Fixes: scylladb/scylladb#18198
before this change, `partition_version` uses a hand-crafted move
constructor. but it suffers from the warning from clang-tidy, which
believe there is a use-after-move issue, as the inner instance of
it's parent class is constructed using
`anchorless_list_base_hook(std::move(pv))`, and its other member
variables are initialized like `_partition(std::move(pv._partition))`
`std::move(pv)` does not do anything, but *indicates* `pv` maybe
moved from. and what is moved away is but the part belong to its
parent class. so this issue is benign.
but, it's still annoying. as we need to tell the genuine issues
reported by clang-tidy from the false alarms. so we have at least
two options:
- stop using clang-tidy
- ignore this warning
- silence this warning using LINT direction in a comment
- use another way to implement the move constructor
in this change, we just cast the moved instance to its
base class and move it instead, this should applease
clang-tidy.
Fixes#18354
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#18359
We move consistent cluster management out of experimental and
make it the default for new clusters in 6.0. In code, we make the
`consistent-topology-changes` flag unused and assumed to be true.
In 6.0, the topology upgrade procedure will be manual and
voluntary, so some clusters will still be using the gossip-based
topology even though they support the raft-based topology.
Therefore, we need to continue testing the gossip-based topology.
This is possible by using the `force-gossip-topology-changes` flag
introduced in scylladb/scylladb#18284.
Ref scylladb/scylladb#17802Closesscylladb/scylladb#18285
* github.com:scylladb/scylladb:
docs: raft.rst: update after removing consistent-topology-changes
treewide: fix indentation after the previous patch
db: config: make consistent-topology-changes unused
test: lib: single_node_cql_env: restart a node in noninitial run_in_thread calls
test: test_read_required_hosts: run with force-gossip-topology-changes
storage_service: join_cluster: replace force_gossip_based_join with force-gossip-topology-changes
storage_service: join_token_ring: fix finish_setup_after_join calls
in `set_repair()`, despite that the repair is performed asynchronously,
we check the options specified by client immediately, and throw
`std::runtime_error`, if any of them is not supported.
before this change, these unhandled exceptions are translated to HTTP
500 error but the underlying HTTP router. but this is misleading, as
these errors are caused by client, not server.
in this change, we handle the `runtime_error`, and translate them
into `httpd::bad_param_exception`, so that the client can have
HTTP 400 (Bad Request) instead of HTTP 500 (Internal Server Error),
and with informative error message.
for instance, if we apply repair with "small_table_optimization" enabled
on a keyspace with tablets enabled. we should have an HTTP error 400
with "The small_table_optimization option is not supported for tablet repair"
as the body of the error. this would much more helpful.
Closesscylladb/scylladb#18389
* github.com:scylladb/scylladb:
api/storage_service: convert runtime_error from repair to http error
repair: change runtime_error to invalid_argument in do_repair_start()
api/storage_service: coroutinize set_repair()
The entry in the repair history map that is used to track repair status
internally for each repair job should be removed after the repair job is
done. We do the same for vnode repairs.
This patch adds the missing automatic history cleanup code which is
missed in the initial tablet repair support in commit 54239514af,
which does not support repair history update back then.
Refs #17046Closesscylladb/scylladb#18434
The populate_views() and generate_and_propagate_view_updates() both naturally belong to view_update_generator -- they don't need anything special from table itself, but rather depend on some internals of the v.u.generator itself.
Moving them there lets removing the view concurrency semaphore from keyspace and table, thus reducing the cross-components dependencies.
Closesscylladb/scylladb#18421
* github.com:scylladb/scylladb:
replica: Do not carry view concurrency semaphore pointer around
view: Get concurrency semaphore via database, not table
view_update_generator: Mark mutate_MV() private
view: Move view_update_generator methods' code
view: Move table::generate_and_propagate_view_updates into view code
view: Move table::populate_views() into view_update_generator class
dclocal_read_repair_chance and read_repair_chance have been removed in Cassandra 3.11 and 4.x, see
https://issues.apache.org/jira/browse/CASSANDRA-13910. if we expose these properties via DDL, Cassandra would fail to consume the CQL statement creating the table when performing migration from Scylla to Cassandra 4.x, as the latter does not understand these properties anymore.
currently the default values of `dc_local_read_repair_chance` and `read_repair_chance` are both "0". so they are practically disabled, unless user deliberately set them to a value greater than 0.
also, as a side effect, Cassandra 4.x has better support of Python3. the cqlsh shipped along with Cassandra 3.11.16 only supports python2.7, see
https://github.com/apache/cassandra/blob/cassandra-3.11.16/bin/cqlsh.py it errors out if the system only provides python3 with the error of
```
No appropriate python interpreter found.
```
but modern linux systems do not provide python2 anymore.
so, in this change, we deprecate these two options.
Fixes#3502
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#18087
* github.com:scylladb/scylladb:
docs: drop documents related to {,dclocal_}read_repair_chance
treewide: remove {dclocal_,}read_repair_chance options
in `set_repair()`, despite that the repair is performed asynchronously,
we check the options specified by client immediately, and throw
`std::runtime_error`, if any of them is not supported.
before this change, these unhandled exceptions are translated to HTTP
500 error but the underlying HTTP router. but this is misleading, as
these errors are caused by client, not server. and the error message
is missing in the HTTP error message when performing the translation.
in this change, we handle the `runtime_error`, and translate them
into `httpd::bad_param_exception`, so that the client can have
HTTP 400 (Bad Request) instead of HTTP 500 (Internal Server Error),
and with informative error message.
for instance, if we apply repair with "small_table_optimization" enabled
on a keyspace with tablets enabled. we should have an HTTP error 400
with "The small_table_optimization option is not supported for tablet repair"
as the body of the error. this would much more helpful.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
if an error is caused by the option provided by user, would be better
to throw an `std::invalid_argument` instead of `std::runtime_error`,
so that the caller can make a better decision when handling the
thrown exceptions.
so, in this change, we change the exceptions raise directly in
`repair_service::do_repair_start()` from `std::runtime_error` to
`std::invalid_argument`. please note, in the lambda named `host2ip`,
since the hostname is not provided by user, so we are not changing
the exception type in that lambda.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
before this change, `set_repair()` uses a lambda for handling
the client-side requests. and this works great. but the underlying
`repair_start()` throws if any of the given options is not sane.
and we don't handle any of these throw exceptions in `set_repair()`,
from client's point of view, it would get an HTTP 500 error code,
which implies an "Internal Server Error". but actually, we should
blame the client for the error, not the server.
so, to prepare the error handling, let's take the opportunity to
coroutinize the lambda handling the request, so that we can handle
the exception in a more elegant way.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
* tools/java b810e8b00e...4ee15fd9ea (1):
> install.sh: don't install nodetool into /usr/bin
Add a bin/nodetool and install it to bin/ in install.sh. This script
simply forwards to scylla nodetool and it is the replacement for the
Java nodetool, which is dropped from the java-tools's install.sh, in the
submodule update also included in this patch.
With this change, we now hardwire the usage of the native nodetool, as
*the* nodetool, with the intermediary nodetool wrapper script removed
from the picture.
Bash completion was copied from the java tools repository and it is now
installed by the scylla package, together with nodetool.
The Java nodetool is still available as as a fall-back, in case the
native nodetool has problems, at the path of
/opt/scylladb/share/cassandra/bin/nodetool.
Testing
I tested upgrades on a DEB and RPM distro: Ubuntu and Fedora.
First I installed scylla-5.4, then I installed the packages for this PR.
On Ubuntu, I had to use dpkg -i --auto-deconfigure, otherwise, dpkg would
refuse to install the new packages because they break the old ones. No
extra flags were required on Fedora.
In both cases, /usr/bin/nodetool was changed from a thunk calling the
Java nodetool (from 5.4) to the native launcher script from this PR.
/opt/scylladb/share/cassandra/bin/nodetool remained in place and still
works after the upgrade.
I also verified that --nonroot installs also work. Nodetool works both
when called with an absolute path, or when ~/scylladb/bin is added to
$PATH.
Fixes: #18226Fixes: #17412Closesscylladb/scylladb#18255
[avi: reset submodule to actual hash we ended up with]
Repair may miss some tablets that migrated across nodes.
So if tombstones expire after some timeout, then we can
have data resurrection.
Set default tombstone_gc mode to "repair" for tables which
use tablets (if repair is required).
Fixes: #16627.
Closesscylladb/scylladb#18013
* github.com:scylladb/scylladb:
test: check default value of tombstone_gc
test: topology: move some functions to util.py
cql3: statements: change default tombstone_gc mode for tablets
When reclaiming memory from bloom filters, do not remove them from
_recognised_components, as that leads to the on-disk filter component
being left back on disk when the SSTable is deleted.
Fixes#18398
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#18400
The FIXME was added back then because we thought the interface of
compaction_group_for_sstable might have to be adjusted if a sstable
were allowed to temporarily span multiple tablets until it's split,
but we have gone a different path.
If a sstable's key range incorrectly spans more than one tablet,
that will be considered a bug and an exception is thrown.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#18410
We have a concurrent modification conflict in tests and suspect
duplicated requests but since we don't log successful requests
we have no way to verify if that's the case. get_mutations_internal log
will help to tell wchich nodes are trying to push auth or
service levels mutations into raft.
Refs scylladb/scylladb#18319Closesscylladb/scylladb#18413
Today with the backport automation, the developer added the relevant backport label, but without any explanation of why
Adding the PR template with a placeholder for the developer to add his decision about backport yes or no
The placeholder is marked as a task, so once the explanation is added, the task must be checked as completed
Also adding another check to the PR summary will make it clear to the maintainer/reviewer if the developer explained about backport
Closesscylladb/scylladb#18275
* github.com:scylladb/scylladb:
[github] add action to verify PR tasks was completed
[github] add PR template
There are two places that get total:live stats for a table snapshot --
database::get_snapshot_details() and table::get_snapshot_details(). Both
do pretty similar thing -- walk the table/snapshots/ directory, then
each of the found sub-directory and accumulate the found files' sizes
into snapshot details structure.
Both try to tell total from live sizes by checking whether an sstable
component found in snapshots is present in the table datadir. The
database code does it in a more correct way -- not just checks the file
presense, but also compares if it's a hardlink on the snapshot file,
while the table code just checks if the file of the same name exists.
This patch does both -- makes both database and table call the same
helper method for a single snapshot details, and makes the generalized
version use more elaborated collision check, thus fixing the per-table
details getting behavior.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#18347
The event is used in a loop.
Found by clang-tidy:
```
streaming/stream_result_future.cc:80:49: warning: 'event' used after it was moved [bugprone-use-after-move]
listener->handle_stream_event(std::move(event));
^
streaming/stream_result_future.cc:80:39: note: move occurred here
listener->handle_stream_event(std::move(event));
^
streaming/stream_result_future.cc:80:49: note: the use happens in a later loop iteration than the move
listener->handle_stream_event(std::move(event));
^
```
Fixes#18332Closesscylladb/scylladb#18333
It's pretty straightforward, but prior to that, exception handling needs some care
Closesscylladb/scylladb#18378
* github.com:scylladb/scylladb:
view-builder: Coroutinize stop()
view_builder: Do not try to handle step join exceptions on stop
we are using `fmt::ostream_formatter` which was introduced in
{fmt} v9.0.0, see https://github.com/fmtlib/fmt/releases/tag/9.0.0 .
before this change, we depend on Seastar to find {fmt}. but
the minimal version of {fmt} required by Seastar is 5.0.0, which
cannot fulfill the needs to build scylladb.
in this change, we find {fmt} package in scylla, and specify the
minimal required version of 9.0.0, so the build can fail at the
configuration time. {fmt} v8 could be still being used by users.
for instance, ubuntu:jammy comes with libfmt-dev 8.1.1. and
ubuntu:jammy is EOL in Apr 2027, see
https://ubuntu.com/about/release-cycle .
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#18386
Metric family config lets a user configure the metric family aggregate labels.
This patch modifies the existing relable-config from file to accept
metric family config.
Similar to the existing relable_config, it adds a metric_family_configs
section. For example, the following configuration demonstrates changing
aggregate labels by name and regular expression.
```
metric_family_configs:
- name: storage_service
aggregate_labels: [shard]
- regex: (storage_proxy.*)
aggregate_labels: [shard, scheduling_group_name]
```
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Closesscylladb/scylladb#18339
We make the `consistent-topology-changes` experimental feature
unused and assumed to be true in 6.0. We remove code branches that
executed if `consistent-topology-changes` was disabled.
In the following commit, we make the `consistent-topology-changes`
experimental feature unused. Then, all unit tests in the boost suite
will start using the raft-based topology by default. Unfortunately,
tests with multiple `single_node_cql_env::run_in_thread` calls
(usually coming from the `do_with_cql_env_thread` calls) would fail.
In a noninitial `run_in_thread` call, a node is started as if it
booted for the first time. On the other hand, it has its persistent
state from previous boots. Hence, the node can behave strangely and
unexpectedly. In particular, `SYSTEM.TOPOLOGY` is not empty and the
assertion that expects it to be empty when we boot for the first
time fails.
We fix this issue by making noninitial `run_in_thread` calls
behave as normal restarts.
After this change,
`test_schema_digest_does_not_change_with_disabled_features` starts
failing. This test copies the data directory before booting for the
first time, so the new
`_sys_ks.local().build_bootstrap_info().get();` makes the node
incorrectly think it restarts. Then, after noticing it is not a part
of group 0, the node would start the raft upgrade procedure if we
didn't run it in the raft RECOVERY mode. This procedure would get
stuck because it depends on messaging being enabled even if the node
communicates only with itself and messaging is disabled in boost tests.
In one of the following commits, we make the
`consistent-topology-changes` experimental feature unused. Then,
all unit tests in the boost suite will start using the raft-based
topology by default. Unfortunately, some tests would start failing
and `test_read_required_hosts` is one of them.
`tablet_cql_test_config` in `tablets_test.cc` doesn't use
`consistent-topology-changes`, so all test cases in this file
run incorrectly wit the gossip-based topology changes. With
`consistent-topology-changes`, only `test_read_required_hosts`
fails. The failure happens on `auto table2 = add_table(e).get();`:
```
ERROR 2024-04-17 11:14:16,083 [shard 0:main] load_balancer -
Replica 9b94d710-fbfb-11ee-9c4f-448617b47e11:0 of tablet
9b94d713-fbfb-11ee-9c4f-448617b47e11:0 not found in topology
```
This test case needs to be investigated and rewritten so that
it passes with the raft-based topology. However, we don't want
this issue to block the process of making the
`consistent-topology-changes` experimental feature unused. We
leave a FIXME and we will open a new issue to track it.
The `force_gossip_based_join` error injection does exactly what we
expect from `force-gossip-topology-changes` so we can do a simple
replacement.
We prefer a flag over an error injection because we will use it
a lot in CI jobs' configurations, some tests, manual testing etc.
It's much more convenient.
Moreover, the flag can be used in the release mode, so we re-enable
all tests that were disabled in release mode only because of using
the `force_gossip_based_join` error injection.
The name of the `force-gossip-topology-changes` flag suggests that
using it should always succesfully force the gossip-based topology
or, if forcing is not possible, the booting should fail. We don't
want a node with `force-gossip-topology-changes=true` that silently
boots in the raft-topology mode. We achieve it by throwing a
runtime error from `join_cluster` in two cases:
- the node is restarting in the cluster that is using raft topology
- the node is joining the cluster that is using raft topology
The `topology_change_enabled` parameter of `finish_setup_after_join`
is used underneath to enable pulling raft topology snapshots in two
cases:
- when the node joins the cluster that uses the raft-based topology,
- when the SUPPORTS_CONSISTENT_TOPOLOGY_CHANGES feature is enabled.
The first case happens in the first changed call.
`_raft_experimental_topology` always equals true there. The second
call was incorrect as it could enable pulling snapshots before
SUPPORTS_CONSISTENT_TOPOLOGY_CHANGES was enabled. It could cause
problems during rolling upgrade to 6.0. For more information see
07aba3abc4.
The API endpoint in question calls table::get_snapshot_detail() which just walks table/snapshots/ directory. This can clash with creating a new snapshot. Database-wide walk is guarded with snapshot-ctl's locking, so should the per-table API do
Closesscylladb/scylladb#18414
* github.com:scylladb/scylladb:
snapshot: Get per-table snapshot size under snapshot lock
snapshot: Move per-table snap API to other snapshot endpoints
in `partition_entry::apply_to_incomplete()`, we pass `*dst_snp` and
`std::move(dst_snp)` to build the capture variable list of a lambda,
but the order of evaluation of these variables are unspecified.
fortunately, we haven't run into any issues at this moment. but this
is not future-proof. so, let's avoid this by storing a reference
of the dereferenced smart pointer, and use it later on.
this issue is identified by clang-tidy:
```
/home/kefu/dev/scylladb/mutation/partition_version.cc:500:53: warning: 'dst_snp' used after it was moved [bugprone-use-after-move]
500 | cur = partition_snapshot_row_cursor(s, *dst_snp),
| ^
/home/kefu/dev/scylladb/mutation/partition_version.cc:502:23: note: move occurred here
502 | dst_snp = std::move(dst_snp),
| ^
/home/kefu/dev/scylladb/mutation/partition_version.cc:500:53: note: the use and move are unsequenced, i.e. there is no guarantee about the order in which they are evaluated
500 | cur = partition_snapshot_row_cursor(s, *dst_snp),
| ^
/home/kefu/dev/scylladb/mutation/partition_version.cc:501:57: warning: 'src_snp' used after it was moved [bugprone-use-after-move]
501 | src_cur = partition_snapshot_row_cursor(s, *src_snp, can_move),
| ^
/home/kefu/dev/scylladb/mutation/partition_version.cc:504:23: note: move occurred here
504 | src_snp = std::move(src_snp),
| ^
/home/kefu/dev/scylladb/mutation/partition_version.cc:501:57: note: the use and move are unsequenced, i.e. there is no guarantee about the order in which they are evaluated
501 | src_cur = partition_snapshot_row_cursor(s, *src_snp, can_move),
| ^
```
Fixes#18360
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#18361
The _view_update_concurrency_sem field on database propagates itself via
keyspace config down to table config and view_update_generator then
grabs one via table:: helper. That's an overkil, view_update_generator
has direct reference on the database and can get this semaphore from
there.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now when the two methods belong to another class, move the code itself
to db/view , where the class itself resides.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Similarly to populate_views() method, this one also naturally belongs to
view_update_generator class.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method in question has little to do with table, effectively it only
needs stats and consurrency semaphore. And the semaphore in question is
obtained from table indirectly, it really resides on database. On the
other hand, the method carries lots of bits from db::view, e.g. the
view_update_builder class, memory_usage_of() helper and a bit more.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
since "read_repair_chance" and "dclocal_read_repair_chance" are
removed, and not supported anymore. let's stop documenting them.
Refs #3502
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
dclocal_read_repair_chance and read_repair_chance have been removed
in Cassandra 3.11 and 4.x, see
https://issues.apache.org/jira/browse/CASSANDRA-13910.
if we expose the properties via DDL, Cassandra would fails to consume
the CQL statement to creating the table when performing migration
from Scylla to Cassandra 4.x, as the latter does not understand
these properties anymore.
currently the default values of `dc_local_read_repair_chance` and
`read_repair_chance` are both "0". so this is practically disabled,
unless user deliberately set them to a value greater than 0.
also, as a side effect, Cassandra 4.x has better support of
Python3. the cqlsh shipped along with Cassandra 3.11.16 only
supports python2.7, see
https://github.com/apache/cassandra/blob/cassandra-3.11.16/bin/cqlsh.py
it errors out if the system only provides python3 with the error
of
```
No appropriate python interpreter found.
```
but modern linux systems do not provide python2 anymore.
so, in this change, we deprecate these two options.
Fixes#3502
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
When issuing warnings about partitions with the number of rows above a configured threshold, the large partitions handler does not take into consideration the number of range tombstone markers in the total rows count. This fix adds the number of range tombstone markers to the total number of rows and saves this total in system.large_partitions.rows (if it is above the threshold). It also adds a new column range_tombstones to the system.large_partitions table which only contains the number of range tombstone markers for the given partition.
This PR fixes the first part of issue #13968
It does not cover distinguishing between live and dead rows. A subsequent PR will handle that.
Closesscylladb/scylladb#18346
* github.com:scylladb/scylladb:
sstables: add docs changes for system.large_partitions
sstable: large data handler needs to count range tombstones as rows
before this change, if we generate the building system with plain
`Ninja`, instead of `Ninja Multi-Config` using cmake, the build
fails, because `${scylla_build_mode_${CMAKE_BUILD_TYPE}}` is not
defined. so the profile used for building the rust library would be
"rust-", which does not match any of the profiles defined by
`Cargo.toml`.
in this change, we use `$CMAKE_BUILD_TYPE` instead of "$config". as
the former is defined for non-multi generator. while the latter
is. see https://cmake.org/cmake/help/latest/generator/Ninja%20Multi-Config.html
with this change, we are able to generate the building system properly
with the "Ninja" generator. if we just want to run some static analyzer
against the source tree or just want to build scylladb with a single
configuration, the "Ninja" generator is a good fit.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#18353
Walking per-table snapshot directory without lock is racy. There's
snapshot-ctl locking that's used to get db-wide snapshot details, it
should be used to get per-table snapshot details too
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
So that they are collected in one place and to facilitate next patch
that's going to use snapshot-ctl for per-table API too
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This commit adds the description and usage instructions of Scylla Doctor
to the "How to Report a ScyllaDB Problem" page.
Scylla Doctor replaces Health Check Report, so the description of
and references to the latter are removed with this commit.
Fixes https://github.com/scylladb/scylladb/issues/16276Closesscylladb/scylladb#17617