Commit Graph

6336 Commits

Author SHA1 Message Date
Petr Gusev
1d6caa42b9 join_cluster: move was_decommissioned check earlier
Before the patch if a decommissioned node tries
to restart, it calls _group0->discover_group0 first
in join_cluster, which hangs since decommissioned
nodes are banned and other nodes don't respond
to their discovering requests.

We fix the problem by checking was_decommissioned()
flag before calling discover_group0.

fixes scylladb/scylladb#17282

Closes scylladb/scylladb#17358
2024-02-18 22:07:28 +02:00
Avi Kivity
43f1c3df2e Merge 'repair: Update repair history for tablet repair' from Asias He
This patch wires up tombstone_gc repair with tablet repair. The flush
hints logic from the vnode table repair is reused. The way to mark the
finish of the repair is also adjusted for tablet repair because it only
has one shard per tablet token range instead of smp::count shards.

Fixes: #17046
Tests: test_tablet_repair_history

Closes scylladb/scylladb#17047

* github.com:scylladb/scylladb:
  repair: Update repair history for tablet repair
  repair: Extract flush hints code
2024-02-18 19:21:54 +02:00
Avi Kivity
d257cc5003 Merge 'scylla-nodetool: implement the repair command' from Botond Dénes
As usual, the new command is covered with tests, which pass with both the legacy and the new native implementation.

Refs: #15588

Closes scylladb/scylladb#17368

* github.com:scylladb/scylladb:
  tools/scylla-nodetool: implement the repair command
  test/nodetool: utils: add check_nodetool_fails_with_error_contains()
  test/nodetool: util: replace flags with custom matcher
2024-02-18 19:21:54 +02:00
Petr Gusev
4ef5d92f50 gossiping_property_file_snitch_test: modernize + fix potential race
This is mostly a refactoring commit to make the test
more readable, as a byproduct of
scylladb/scylladb#17369 investigation.

We add the check for specific type of exceptions that
can be thrown (bad_property_file_error).

We also fix the potential race - the test may write
to res from multiple cores with no locks.

Closes scylladb/scylladb#17371
2024-02-18 19:21:53 +02:00
Patryk Wrobel
3842bf18a7 storage_service/range_to_endpoint_map: allow API to properly handle tablets
This API endpoint was failing when tablets were enabled
because of usage of get_vnode_effective_replication_map().
Moreover, it was providing an error message that was not
user-friendly.

This change extends the handler to properly service the incoming requests.
Furthermore, it introduces two new test cases that verify the behavior of
storage_service/range_to_endpoint_map API. It also adjusts the test case
of this endpoint for vnodes to succeed when tablets are enabled by default.

The new logic is as follows:
 - when tablets are disabled then users may query endpoints
   for a keyspace or for a given table in a keyspace
 - when tablets are enabled then users have to provide
   table name, because effective replication map is per-table

When user does not provide table name when tablets are enabled
for a given keyspace, then BAD_REQUEST is returned with a
meaningful error message.

Fixes: scylladb#17343

Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>

Closes scylladb/scylladb#17372
2024-02-18 19:21:53 +02:00
Botond Dénes
b11213e547 tools/scylla-nodetool: implement the upgradesstables command
Refs: #15588

Closes scylladb/scylladb#17370
2024-02-18 19:21:53 +02:00
Avi Kivity
9bb4482ad0 Merge 'cdc: metadata: allow sending writes to the previous generations' from Patryk Jędrzejczak
Before this PR, writes to the previous CDC generations would
always be rejected. After this PR, they will be accepted if the
write's timestamp is greater than `now - generation_leeway`.

This change was proposed around 3 years ago. The motivation was
to improve user experience. If a client generates timestamps by
itself and its clock is desynchronized with the clock of the node
the client is connected to, there could be a period during
generation switching when writes fail. We didn't consider this
problem critical because the client could simply retry a failed
write with a higher timestamp. Eventually, it would succeed. This
approach is safe because these failed writes cannot have any side
effects. However, it can be inconvenient. Writing to previous
generations was proposed to improve it.

The idea was rejected 3 years ago. Recently, it turned out that
there is a case when the client cannot retry a write with the
increased timestamp. It happens when a table uses CDC and LWT,
which makes timestamps permanent. Once Paxos commits an entry
with a given timestamp, Scylla will keep trying to apply that entry
until it succeeds, with the same timestamp. Applying the entry
involves writing to the CDC log table. If it fails, we get stuck.
It's a major bug with an unknown perfect solution.

Allowing writes to previous generations for `generation_leeway` is
a probabilistic fix that should solve the problem in practice.

Apart from this change, this PR adds tests for it and updates
the documentation.

This PR is sufficient to enable writes to the previous generations
only in the gossiper-based topology. The Raft-based topology
needs some adjustments in loading and cleaning CDC generations.
These changes won't interfere with the changes introduced in this
PR, so they are left for a follow-up.

Fixes scylladb/scylladb#7251
Fixes scylladb/scylladb#15260

Closes scylladb/scylladb#17134

* github.com:scylladb/scylladb:
  docs: using-scylla: cdc: remove info about failing writes to old generations
  docs: dev: cdc: document writing to previous CDC generations
  test: add test_writes_to_previous_cdc_generations
  cdc: generation: allow increasing generation_leeway through error injection
  cdc: metadata: allow sending writes to the previous generations
2024-02-18 19:21:53 +02:00
Asias He
796044be1c repair: Update repair history for tablet repair
This patch wires up tombstone_gc repair with tablet repair. The flush
hints logic from the vnode table repair is reused. The way to mark the
finish of the repair is also adjusted for tablet repair because it only
has one shard per tablet token range instead of smp::count shards.

Fixes: #17046
Tests: test_tablet_repair_history
2024-02-18 10:21:58 +08:00
Kefu Chai
47fec0428a tools/scylla-nodetool: return 1 when viewbuild not succeeds
this change introduces a new exception which carries the status code
so that an operation can return a non-zero exit code without printing
any errors. this mimics the behavior of "viewbuildstatus" command of
C* nodetool.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#17359
2024-02-16 13:53:33 +02:00
Botond Dénes
8d8ea12862 tools/scylla-nodetool: implement the repair command 2024-02-16 04:42:08 -05:00
Botond Dénes
48e8435466 test/nodetool: utils: add check_nodetool_fails_with_error_contains()
Checks that at least one error snippet is contained in the error output.
2024-02-16 04:40:31 -05:00
Botond Dénes
190c9a7239 test/nodetool: util: replace flags with custom matcher
_do_check_nodetool_fails_with() currently has a `match_all` flag to
control how the match is checked. Now we need yet another way to control
how matching is done. Instead of adding yet another flag (and who knows
how many more), jut replace the flag and the errors input with a matcher
functor, which gets the stdout and stderr and is delegated to do any
checks it wants. This method will scale much better going forward.
2024-02-16 04:40:31 -05:00
Kamil Braun
50ebce8acc Merge 'Purge old ip on change' from Petr Gusev
When a node changes IP address we need to remove its old IP from `system.peers` and gossiper.

We do this in `sync_raft_topology_nodes` when the new IP is saved into `system.peers` to avoid losing the mapping if the node crashes between deleting and saving the new IP. We also handle the possible duplicates in this case by dropping them on the read path when the node is restarted.

The PR also fixes the problem with old IPs getting resurrected when a node changes its IP address.
The following scenario is possible: a node `A` changes its IP from `ip1` to `ip2` with restart, other nodes are not yet aware of `ip2` so they keep gossiping `ip1`. After restart `A` receives `ip1` in a gossip message and calls `handle_major_state_change` since it considers it as a new node. Then `on_join` event is called on the gossiper notification handlers, we receive such event in `raft_ip_address_updater` and reverts the IP of the node A back to ip1.

To fix this we ensure that the new gossiper generation number is used when a node registers its IP address in `raft_address_map` at startup.

The `test_change_ip` is adjusted to ensure that the old IPs are properly removed in all cases, even if the node crashes.

Fixes #16886
Fixes #16691
Fixes #17199

Closes scylladb/scylladb#17162

* github.com:scylladb/scylladb:
  test_change_ip: improve the test
  raft_ip_address_updater: remove stale IPs from gossiper
  raft_address_map: add my ip with the new generation
  system_keyspace::update_peer_info: check ep and host_id are not empty
  system_keyspace::update_peer_info: make host_id an explicit parameter
  system_keyspace::update_peer_info: remove any_set flag optimisation
  system_keyspace: remove duplicate ips for host_id
  system_keyspace: peers table: use coroutines
  storage_service::raft_ip_address_updater: log gossiper event name
  raft topology: ip change: purge old IP
  on_endpoint_change: coroutinize the lambda around sync_raft_topology_nodes
2024-02-15 17:40:29 +01:00
Avi Kivity
5df5714331 Merge 'api: storage_service/natural_endpoints: add tablets support' from Botond Dénes
This API endpoint currently returns with status 500 if attempted to be called for a table which uses tablets. This series adds tablet support. No change in usage semantics is required, the endpoint already has a table parameter.
This endpoint is the backend of `nodetool getendpoints` which should now work, after this PR.

Fixes: #17313

Closes scylladb/scylladb#17316

* github.com:scylladb/scylladb:
  service/storage_service: get_natural_endpoints(): add tablets support
  replica/database: keyspace: add uses_tablets()
  service/storage_service: remove token overload of get_natural_endpoints()
2024-02-15 13:36:56 +02:00
Botond Dénes
811e931b09 Merge 'tools/scylla-nodetool: implement compactionstats and viewbuildstatus' from Kefu Chai
Refs #15588

Closes scylladb/scylladb#17344

* github.com:scylladb/scylladb:
  tools/scylla-nodetool: implement viewbuildstatus
  tools/scylla-nodetool: implement compactionstats
2024-02-15 12:44:05 +02:00
Petr Gusev
c4140678ba test_change_ip: improve the test
In this commit we refactor test_change_ip to improve
it in several ways:
  * We inject failure before old IP is removed and verify
    that after restart the node sees the proper peers - the
    new IP for node2 and old IP for node3, which is not restarted
    yet.
  * We introduce the lambda wait_proper_ips, which checks not only the
    system.peers table, but also gossiper and token_metadata.
  * We call this lambda for all nodes, not only the first node;
    this allows to validate that the node that has changed its
    IP has the proper IP of itself in the data structures above.

Note that we need to inject an additional delay ip-change-raft-sync-delay
before old IP is removed. Otherwise the problem stop reproducing - other
nodes remove the old IP before it's send back to the just restarted node.
2024-02-15 13:26:02 +04:00
Petr Gusev
4b33ba2894 raft_address_map: add my ip with the new generation
The following scenario is possible: a node A changes its IP
from ip1 to ip2 with restart, other nodes are not yet aware of ip2
so they keep gossiping ip1, after restart A receives
ip1 in a gossip message and calls handle_major_state_change
since it considers it as a new node. Then on_join event is
called on the gossiper notification handles, we receive
such event in raft_ip_address_updater and reverts the IP
of the node A back to ip1.

The essence of the problem is that we don't pass the proper
generation when we add ip2 as a local IP during initialization
when node A restarts, so the zero generation is used
in raft_address_map::add_or_update_entry and the gossiper
message owerwrites ip2 to ip1.

In this commit we fix this problem by passing the new generation.
To do that we move the increment_and_get_generation call
from join_token_ring to scylla_main, so that we have a new generation
value before init_address_map is called.

Also we remove the load_initial_raft_address_map function from
raft_group0 since it's redundant. The comment above its call site
says that it's needed to not miss gossiper updates, but
the function storage_service::init_address_map where raft_address_map
is now initialized is called before gossiper is started. This
function does both - it load the previously persisted host_id<->IP
mappings from system.local and subscribes to gossiper notifications,
so there is no room for races.

Note that this problem reproduces less likely with the
'raft topology: ip change: purge old IP' commit - other
nodes remove the old IP before it's send back to the
just restarted node. This is also the reason why this
problem doesn't occur in gossiper mode.

fixes scylladb/scylladb#17199
2024-02-15 13:21:04 +04:00
Kefu Chai
f9d19a61ff tools/scylla-nodetool: implement viewbuildstatus
Refs 15588

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-02-15 16:54:16 +08:00
Nadav Har'El
28db187756 alternator, tablets: return error if enabling TTL with tablets
Alternator TTL doesn't yet work on tables using tablets (this is
issue #16567). Before this patch, it can be enabled on a table with
tablets, and the result is a lot of log spam and nothing will get expired.

So let's make the attempt to enable TTL on a table that uses tablets
into a clear error. The error message points to the issue, and also
suggests how to create a table that uses vnodes, not tablets.

This patch also adds a test that verifies that trying to enable TTL
with tablets is an error. Obviously, this test should be removed
once the issue is solved and TTL begins working with tablets.

Refs #16567
Refs #16807

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#17306
2024-02-15 10:47:06 +02:00
Nadav Har'El
b97ded5c4a test/topology: tests for setting tombstone_gc on materialized view
A user asked on the ScyllaDB forum several questions on whether
tombstone_gc works on materialized views. This patch includes two
tests that confirm the following:

1. The tombstone_gc may be set on a view - either during its creation
   with CREATE MATERIALIZED VIEW or later with ALTER MATERIALIZED VIEW.

2. The tombstone_gc setting is correctly shown - for both base tables
   and views - by the "DESC" statement.

3. The tombstone_gc setting is NOT inherited from a base table to a new
   view - if you want this option on a view, you need to set it
   separately.

Unfortunately, this test could not be a single-node cql-pytest because
we forbid tombstone_gc=repair when RF=1, and since recently, we forbid
setting RF>1 on a single-node setup. So the new tests are written in
the test/topology framework - which may run multiple tests against
a single three-node cluster run multiple tests against it.

To write tests over a shared cluster, we need functions which create
temporary keyspaces, tables and views, which are deleted automatically
as soon as a test ends. The test/topology framework was lacking such
functions, so this tests includes them - currently inside the test
file, but if other people find them useful they can be moved to a more
central location.

The new functions, net_test_keyspace(), new_test_table() and
new_materialized_view() are inspired by the identically-named
functions in test/cql-pytest/util.py, but the implementation is
different: Importantly, the new functions here are *async*
context managers, used via "async with", to fit with the rest
of the asynchronous code used in the topology test framework.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#17345
2024-02-15 09:43:30 +02:00
Botond Dénes
ca13ff10ea service/storage_service: get_natural_endpoints(): add tablets support
Also add a unit test for this API endpoint, testing it with both tablets
and vnodes.
2024-02-15 02:07:18 -05:00
Kefu Chai
68795eb8fa tools/scylla-nodetool: implement gossipinfo
Refs #15588

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#17317
2024-02-15 08:41:39 +02:00
Kefu Chai
a7abaa457b tools/scylla-nodetool: implement compactionstats
Refs #15588

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-02-15 12:29:10 +08:00
Kamil Braun
7e9e10186f Merge 'change the way ignored nodes are handled by the topology coordinator' from Gleb
This series makes several changes to how ignored nodes list is treated
by the topology coordinator. First the series makes it global and not
part of a single topology operation, second it extends the list at the
time of removenode/replace invocation and third it bans all nodes in
the list from contacting the cluster ever again.

The main motivation is to have a way to unblock tablet migration in case
of a node failure. Tablet migration knows how to avoid nodes in ignored
nodes list and this patch series provides a way to extend it without
performing any topology operation (which is not possible while tables
migration runs).

Fixes scylladb/scylladb#16108

* 'gleb/ignore-nodes-handling-v2' of github.com:scylladb/scylla-dev:
  test: add test for the new ignore nodes behaviour
  topology coordinator: cleanup node_state::decommissioning state handling code
  topology coordinator: ban ignored nodes just like we ban nodes that are left
  storage_service: topology coordinator: validate ignore dead nodes parameters in removenode/replace
  topology coordinator: add removed/replaced nodes to ignored_nodes list at the request invocation time
  topology coordinator: make ignored_nodes list global and permanent
  topology_coordinator: do not cancel rebuild just because some other nodes are dead
  topology coordinator: throw more specific error from wait_for_ip() function in case of a timeout
  raft_group0: add make_nonvoters function that can make multiple node non voters simultaneously
2024-02-14 16:36:01 +01:00
Kefu Chai
d43c418f72 tools/scylla-nodetool: implement getendpoints
Refs #15588

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#17332
2024-02-14 11:20:52 +02:00
Gleb Natapov
7802c206c7 test: add test for the new ignore nodes behaviour
The test checks that once a node is specified in ignored node list by
one topology operation the information is carried over to the next
operation as well.
2024-02-14 10:35:11 +02:00
Gleb Natapov
363af9e664 topology coordinator: ban ignored nodes just like we ban nodes that are left
Since now a node that is at one point was marked as dead, either via
--ignore-dead-nodes parameter or by been a target for removenode or
replace, can no longer be made "undead" we need to make sure that they
cannot rejoin the cluster any longer. Do that by banning them on a
messaging layer just like we do for nodes that are left.

Not that the removenode failure test had to be altered since it restarted
a node after removenode failure (which now will not work). Also, since
the check for liveness was removed from the topology coordinator (because
the node is already banned by then), the test case that triggers the
removed code is removed as well.
2024-02-14 10:35:06 +02:00
Nadav Har'El
5d4c60aee3 test/cql-pytest: avoid spurious guardrail warnings
All cql-pytest tests use one node, and unsuprisingly most use RF=1.
By default, as part of the "guardrails" feature, we print a warning
when creating a keyspace with RF=1. This warning gets printed on
every cql-pytest run, which creates a "boy who cried wolf" effect
whereby developers get used to seeing these warnings, and won't care
if new warnings start appearing.

The fix is easy - in run.py start Scylla with minimum-replication-factor-
warn-threshold set to -1 instead of the default 3.

Note that we do have cql-pytest tests for this guardrail, but those don't
rely on the default setting of this variable (they can't, cql-pytest
tests can also be run on a Scylla instance run manually by a developer).
Those tests temporarily set the threshold during the test.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#17274
2024-02-13 17:44:20 +02:00
Botond Dénes
120442231f Merge 'row_cache: test cache consistency during multi-partition cache updates' from Michał Chojnowski
Adds a test reproducing https://github.com/scylladb/scylladb/issues/16759, and the instrumentation needed for it.

Closes scylladb/scylladb#17208

* github.com:scylladb/scylladb:
  row_cache_test: test cache consistency during memtable-to-cache merge
  row_cache: use preemption_source in update()
  utils: preempt: add preemption_source
2024-02-13 17:37:06 +02:00
Nadav Har'El
dce47a81b0 alternator, tablets: return error if enabling Streams with tablets
Alternator Streams doesn't yet work on tables using tablets (this is
issue #16317). Before this patch, an attempt to enable it results in
an unsightly InternalServerError, which isn't terrible - but we can
do better.

So in this patch, we make the attempt to enable Streams and tablets
together into a clear error. The error message points to the open issue,
and also suggests how to create a table that uses vnodes, not tablets.

Unfortunately, there are slightly two different code paths and error
messages for two cases: One case is the creation of a new table (where
the validation happens before the keyspace is actually created), and
the other case is an attempt to enable streams on an existing table
with an existing keyspace (which already might or might not be using
tablets).

This patch also adds a test that verifies that trying to enable Streams
with tablets is an error - in both cases (table creation and update).
Obviously, this test - and the validation code - should be removed once
the issue is solved and Alternator Streams begins working with tablets.

Fixes #16497
Refs #16807

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#17311
2024-02-13 16:42:35 +02:00
Piotr Dulikowski
314fd9a11f test: test_topology_recovery_basic: add missing driver reconnect
Unfortunately, scylladb/python-driver#230 is not fixed yet, so it is
necessary for the sake of our CI's stability to re-create the driver
session after all nodes in the cluster are restarted.

There is one place in test_topology_recovery_basic where all nodes are
restarted but the driver session is not re-created. Even though nodes
are not restarted at once but rather sequentially, we observed a failure
with similar symptoms in a CI run for scylla-enterprise.

Add the missing driver reconnect as a workaround for the issue.

Fixes: scylladb/scylladb#17277

Closes scylladb/scylladb#17278
2024-02-13 12:28:30 +01:00
Botond Dénes
3f2d7e8b25 tree: remove unnecessary yields around for_each_tablet()
Commit 904bafd069 consolidated the two
existing for_each_tablet() overloads, to the one which has a future<>
returning callback. It also added yields to the bodies of said
callbacks. This is unnecessary, the loop in for_each_tablet() already
has a yield per tablet, which should be enough to prevent stalls.

This patch is a follow-up to #17118

Closes scylladb/scylladb#17284
2024-02-12 17:10:25 +01:00
Kefu Chai
7baee379de sstable/storage: pass fs::path to storage::create_links()
this change is a follow-up of 637dd730. the goal is to use
std::filesystem::path for manipulating paths, and to avoid the
converting between sstring and fs::path back and forth.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#17257
2024-02-12 13:26:11 +02:00
Pavel Emelyanov
b9721bd397 test/tablets: Decommissioning node below RF is not allowed
When a node is decommissioned, all tablet replicas need to be moved away
from it. In some cases it may not be possible. If the number of node in
the cluster equals the keysapce RF, one cannot decommission any node
because it's not possible to find nodes for every replica.

The new test case validates this constraint is satisfied.

refs: #16195

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#17248
2024-02-12 13:21:47 +02:00
Nadav Har'El
21e7deafeb alternator, mv: fix case of two new key columns in GSI
A materialized view in CQL allows AT MOST ONE view key column that
wasn't a key column in the base table. This is because if there were
two or more of those, the "liveness" (timestamp, ttl) of these different
columns can change at every update, and it's not possible to pick what
liveness to use for the view row we create.

We made an exception for this rule for Alternator: DynamoDB's API allows
creating a GSI whose partition key and range key are both regular columns
in the base table, and we must support this. We claim that the fact that
Alternator allows neither TTL (Alternator's "TTL" is a different feature)
nor user-defined timestamps, does allow picking the liveness for the view
row we create. But we did it wrong!

We claimed in a comment - and implemented in the code before this patch -
that in Alternator we can assume that both GSI key columns will have the
*same* liveness, and in particular timestamp. But this is only true if
one modifies both columns together! In fact, in general it is not true:
We can have two non-key attributes 'a' and 'b' which are the GSI's key
columns, and we can modify *only* b, without modifying a, in which case
the timestamp of the view modification should be b's newer timestamp,
not a's older one. The existing code took a's timestamp, assuming it
will be the same as b's, which is incorrect. The result was that if
we repeatedly modify only b, all view updates will receive the same
timestamp (a's old timestamp), and a deletion will always win over
all the modifications. This patch includes a reproducing test written by
a user (@Zak-Kent) that demonstrates how after a view row is deleted
it doesn't get recreated - because all the modifications use the same
timestamp.

The fix is, as suggested above, to use the *higher* of the two
timestamps of both base-regular-column GSI key columns as the timestamp
for the new view rows or view row deletions. The reproducer that
failed before this patch passes with it. As usual, the reproducer
passes on AWS DynamoDB as well, proving that the test is correct and
should really work.

Fixes #17119

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#17172
2024-02-12 13:17:29 +02:00
Nadav Har'El
341af86167 test/cql-pytest: reproducer for GROUP BY regression
This patch adds a simple reproducer for a regression in Scylla 5.4 caused
by commit 432cb02, breaking LIMIT support in GROUP BY.

Refs #17237

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#17275
2024-02-12 13:09:52 +02:00
Kamil Braun
7d73c40125 Merge 'test.py: tablets: Fix flakiness of test_tablet_missing_data_repair' from Tomasz Grabiec
Reimplements stop/start sequence using rolling_restart() which is safe
with regards to UP status propagation and not prone to sudden
connection drop which may cause later CQL queries to time out. It also
ensures that CQL is up on all the remaining nodes when the with_down
callback is executed.

The test was observed to fail in CI like this:

```
  cassandra.cluster.NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: 127.157.135.26:9042 datacenter1>: ConnectionException('Pool for 127.157.135.26:9042 is shutdown')})
  ...
      @pytest.mark.repair
      @pytest.mark.asyncio
      async def test_tablet_missing_data_repair(manager: ManagerClient):
  ...
          for idx in range(0,3):
              s = servers[idx].server_id
              await manager.server_stop_gracefully(s, timeout=120)
  >           await check()
```

Hopefully: Fixes #17107

Closes scylladb/scylladb#17252

* github.com:scylladb/scylladb:
  test: py: tablets: Fix flakiness of test_tablet_missing_data_repair
  test: pylib: manager_client: Wait for driver to catch up in rolling_restart()
  test: pylib: manager_client: Accept callback in rolling_restart() to execute with node down
2024-02-12 11:52:09 +01:00
Botond Dénes
f068d1a6fa query: do not kill unpaged queries when they reach the tombstone-limit
The reason we introduced the tombstone-limit
(query_tombstone_page_limit), was to allow paged queries to return
incomplete/empty pages in the face of large tombstone spans. This works
by cutting the page after the tombstone-limit amount of tombstones were
processed. If the read is unpaged, it is killed instead. This was a
mistake. First, it doesn't really make sense, the reason we introduced
the tombstone limit, was to allow paged queries to process large
tombstone-spans without timing out. It does not help unpaged queries.
Furthermore, the tombstone-limit can kill internal queries done on
behalf of user queries, because all our internal queries are unpaged.
This can cause denial of service.

So in this patch we disable the tombstone-limit for unpaged queries
altogether, they are allowed to continue even after having processed the
configured limit of tombstones.

Fixes: #17241

Closes scylladb/scylladb#17242
2024-02-12 12:34:04 +02:00
Patryk Wrobel
9fccd968d3 test_tablets.py: implement test_tablet_count_metric_per_shard
This change introduces a new test that verifies the
functionality related to tablet_count metric.

It checks if tablet_count metric is correctly reported
and updated when new tables are created, when tables
are dropped and when `move_tablet` is executed.

Refs: scylladb#16131
Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>

Closes scylladb/scylladb#17165
2024-02-12 11:49:38 +02:00
Kefu Chai
54995fcac0 test/manual: do not include unused headers
these unused includes were identified by clangd. see
https://clangd.llvm.org/guides/include-cleaner#unused-include-warning
for more details on the "Unused include" warning.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#17255
2024-02-12 11:49:38 +02:00
Patryk Jędrzejczak
e64162e8f6 test: add test_writes_to_previous_cdc_generations
In one of the previous patches, we allowed writing to the previous
CDC generations for `generation_leeway`. Now, we add tests for this
change.
2024-02-12 10:14:00 +01:00
Nadav Har'El
13e16475fa cql-pytest: fix skipping of tests on Cassandra or old Scylla
Recently we added a trick to allow running cql-pytests either with or
without tablets. A single fixture test_keyspace uses two separate
fixtures test_keyspace_tablets or test_keyspace_vnodes, as requested.

The problem is that even if test_keyspace doesn't use its
test_keyspace_tablets fixture (it doesn't, if the test isn't
parameterized to ask for tablets explicitly), it's still a fixture,
and it causes the test to be skipped. This causes every test to be
skipped when running on Cassandra or old Scylla which doesn't support
tablets.

The fix is simple - the internal fixture test_keyspace_tablets should
yield None instead of skipping. It is the caller, test_keyspace, which
now skips the test if tablets are requested but test_keyspace_tablets
is None.

Fixes #17266

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#17267
2024-02-11 21:03:25 +02:00
Kefu Chai
f990ea9678 tools/scylla-nodetool: implement describecluster
Refs #15588
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#17240
2024-02-11 20:21:07 +02:00
Benny Halevy
2ed29e31db gms: inet_address: make constructors explicit
In particular, `inet_address(const sstring& addr)` is
dangerous, since a function like
`topology::get_datacenter(inet_address ep)`
might accidentally convert a `sstring` argument
into an `inet_address` (which would most likely
throw an obscure std::invalid_argument if the datacenter
name does not look like an inet_address).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#17260
2024-02-11 15:44:13 +02:00
Tomasz Grabiec
1eedc85990 test: py: tablets: Fix flakiness of test_tablet_missing_data_repair
Reimplement stop/start sequence using rolling_restart() which is safe
with regards to UP status propagation and not prone to sudden
connection drop which may cause later CQL queries to time out. It also
ensures that CQL is up on all the remaining nodes when the with_down
callback is executed.

Hopefully: Fixes #17107
2024-02-09 20:37:06 +01:00
Tomasz Grabiec
27ed2d94fc test: pylib: manager_client: Wait for driver to catch up in rolling_restart()
For sanity of the developers who want to execute CQL queries after
rolling restarts.
2024-02-09 20:35:41 +01:00
Tomasz Grabiec
3ce4ec796a test: pylib: manager_client: Accept callback in rolling_restart() to execute with node down 2024-02-09 20:35:41 +01:00
Petr Gusev
4554653ad9 storage_proxy: add a test for stop_remote
This patch adds a reproducer test for an issue #16382.
See scylladb/seastar#2044 for details of the problem.

The test is enabled only in dev mode since it requires
error injection mechanism. The patch adds a new injection
into storage_proxy::handle_read to simulate the problem
scenario - the node is shutting down and there are some
unfinished pending replica requests.

Closes scylladb/scylladb#16776
2024-02-09 17:23:13 +01:00
Raphael S. Carvalho
daa82f406c test_tablets: Enable table debug log in split test
If the test fails, it's helpful to see how split completion was
handled.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#17236
2024-02-09 14:38:24 +02:00
Kamil Braun
e9e24f47ec Merge 'raft topology: implement upgrade and recovery procedure' from Piotr Dulikowski
This PR implements a procedure that upgrades existing clusters to use
raft-based topology operations. The procedure does not start
automatically, it must be triggered manually by the administrator after
making sure that no topology operations are currently running.

Upgrade is triggered by sending `POST
/storage_service/raft_topology/upgrade` request. This causes the
topology coordinator to start who drives the rest of the process: it
builds the `system.topology` state based on information observed in
gossip and tells all nodes to switch to raft mode. Then, topology
coordinator runs normally.

Upgrade progress is tracked in a new static column `upgrade_state` in
`system.topology`.

The procedure also serves as an extension to the current recovery
procedure on raft. The current recovery procedure requires restarting
nodes in a special mode which disables raft, perform `nodetool
removenode` on the dead nodes, clean up some state on the nodes and
restart them so that they automatically rebuild the group 0. Raft
topology fits into existing procedure by falling back to legacy topology
operations after disabling raft. After rebuilding the group 0, upgrade
needs to be triggered again.

Because upgrade is manual and it might not be convenient for
administrators to run it right after upgrading the cluster, we allow the
cluster to operate in legacy topology operations mode until upgrade,
which includes allowing new nodes to join. In order to allow it, nodes
now ask the cluster about the mode they should use to join before
proceeding by using a new `JOIN_NODE_QUERY` RPC.

The procedure is explained in more detail in `topology-over-raft.md`.

Fixes: https://github.com/scylladb/scylladb/issues/15008

Closes scylladb/scylladb#17077

* github.com:scylladb/scylladb:
  test/topology_custom: upgrade/recovery tests for topology on raft
  cdc/generation_service: in legacy mode, fall back to raft tables
  system_keyspace: add read_cdc_generation_opt
  cdc/generation_service: turn off gossip notifications in raft topo mode
  cql_test_env: move raft_topology_change_enabled var earlier
  group0_state_machine: pull snapshot after raft topology feature enabled
  storage_service: disable persistent feature enabler on upgrade
  storage_service: replicate raft features to system.peers
  storage_service: gossip tokens and cdc generation in raft topology mode
  API: add api for triggering and monitoring topology-on-raft upgrade
  storage_service: infer which topology operations to use on startup
  storage_service: set the topology kind value based on group 0 state
  raft_group0: expose link to the upgrade doc in the header
  feature_service: fall back to checking legacy features on startup
  storage_service: add fiber for tracking the topology upgrade progress
  gms: feature_service: add SUPPORTS_CONSISTENT_TOPOLOGY_CHANGES
  topology_coordinator: implement core upgrade logic
  topology_coordinator: extract top-level error handling logic
  storage_service: initialize discovery leader's state earlier
  topology_coordinator: allow for custom sharding info in prepare_and_broadcast_cdc_generation_data
  topology_coordinator: allow for custom sharding info in prepare_new_cdc_generation_data
  topology_coordinator: remove outdated fixme in prepare_new_cdc_generation_data
  topology_state_machine: introduce upgrade_state
  storage_service: disallow topology ops when upgrade is in progress
  raft_group0_client: add in_recovery method
  storage_service: introduce join_node_query verb
  raft_group0: make discover_group0 public
  raft_group0: filter current node's IP in discover_group0
  raft_group0: remove my_id arg from discover_group0
  storage_service: make _raft_topology_change_enabled more advanced
  docs: document raft topology upgrade and recovery
2024-02-09 11:54:53 +01:00