Commit Graph

168 Commits

Author SHA1 Message Date
Avi Kivity
cdae15ced9 Merge '[Backport 6.0] db/view: drop view updates to replaced node marked as left' from ScyllaDB
When a node that is permanently down is replaced, it is marked as "left" but it still can be a replica of some tablets. We also don't keep IPs of nodes that have left and the `node` structure for such node returns an empty IP (all zeros) as the address.

This interacts badly with the view update logic. The base replica paired with the left node might decide to generate a view update. Because storage proxy still uses IPs and not host IDs, it needs to obtain the view replica's IP and tell the storage proxy to write a view update to that node - so, it chooses 0.0.0.0. Apparently, storage proxy decides to write a hint towards this address - hinted handoff on the other hand operates on host IDs and not IPs, so it attempts to translate the IP back, which triggers an assertion as there is no replica with IP 0.0.0.0.

As a quick workaround for this issue just drop view updates towards nodes which seem to have IPs that are all zeros. It would be more proper to keep the view updates as hints and replay them later to the new paired replica, but achieving this right now would require much more significant changes. For now, fixing a crash is more important than keeping views consistent with base replicas.

In addition to the fix, this PR also includes a regression test heavily based on the test that @kbr-scylla prepared during his investigation of the issue.

Fixes: scylladb/scylladb#19439

This issue can cause multiple nodes to crash at once and the fix is quite small, so I think this justifies backporting it to all affected versions. 6.0 and 6.1 are affected. No need to backport to 5.4 as this issue only happens with tablets, and tablets are experimental there.

(cherry picked from commit 6af7882c59)

(cherry picked from commit 5ec8c06561)

 Refs #19765

Closes scylladb/scylladb#19896

* github.com:scylladb/scylladb:
  test: regression test for MV crash with tablets during decommission
  db/view: drop view updates to replaced node marked as left
2024-08-14 22:32:07 +03:00
Piotr Smaron
b10bf17df7 tests: ensure ALTER tablets KS doesn't crash if KS doesn't exist
Using the error injection framework, we inject a sleep into the
processing path of ALTER tablets KS, so that the topology coordinator of
the leader node
sleeps after the rf_change event has been scheduled, but before it is
started to be executed. During that time the second node executes a DROP
KS statement, which is propagated to the leader node. Once leader node
wakes up and resumes processing of ALTER tablets KS, the KS won't exist
and the node cannot crash, which was the case before.

(cherry picked from commit ddb5204929)
2024-08-14 10:37:59 +00:00
Piotr Dulikowski
dec02b38ae test: regression test for MV crash with tablets during decommission
Regression test for scylladb/scylladb#19439.

Co-authored-by: Kamil Braun <kbraun@scylladb.com>
(cherry picked from commit 5ec8c06561)
2024-07-26 14:02:51 +00:00
Emil Maskovsky
62c9709f4a test: raft: fix the flaky test_raft_recovery_stuck
Use the rolling restart to avoid spurious driver reconnects.

This can be eventually reverted once the scylladb/python-driver#295 is
fixed.

Fixes scylladb/scylladb#19154

(cherry picked from commit a89facbc74)
2024-07-20 02:17:50 +00:00
Emil Maskovsky
64d414f10a test: raft: code cleanup in test_raft_recovery_stuck
Cleaning up the imports.

(cherry picked from commit ef3393bd36)
2024-07-20 02:17:50 +00:00
Gleb Natapov
c437c8be36 test: add test to check that coordinator lwt semaphore continues functioning after locking failures
(cherry picked from commit 4178589826)
2024-07-18 15:34:17 +00:00
Michael Litvak
815a707b0a storage_proxy: remove response handler if no targets
When writing a mutation, it might happen that there are no live targets
to send the mutation to, yet the request can be satisfied. For example,
when writing with CL=ANY to a dead node, the request is completed by
storing a local hint.

Currently, in that case, a write response handler is created for the
request and it remains active until it timeouts because it is not
removed anywhere, even though the write is completed successfuly after
storing the hint. The response handler should be removed usually when
receiving responses from all targets, but in this case there are no
targets to trigger the removal.

In this commit we check if we don't have live targets to send the
mutation to. If so, we remove the response handler immediately.

Fixes scylladb/scylladb#19529

(cherry picked from commit a9fdd0a93a)

Closes scylladb/scylladb#19680
2024-07-15 08:24:18 +02:00
Michael Litvak
ad6eb1cadf view: drain view builder before database
The view builder is doing write operations to the database.
In order for the view builder to shutdown gracefully without errors, we
need to ensure the database can handle writes while it is drained.
The commit changes the drain order, so that view builder is drained
before the database shuts down.

Fixes scylladb/scylladb#18929

(cherry picked from commit 9d9318c564)

Closes scylladb/scylladb#19636
2024-07-08 19:16:26 +02:00
Gleb Natapov
724ec62e22 test: add test that checks that local address cannot expire between join request placemen and its processing
(cherry picked from commit 3f136cf2eb)
2024-07-01 10:44:31 +00:00
Botond Dénes
b18d9e5d0d Merge '[Backport 6.0] make enable_compacting_data_for_streaming_and_repair truly live-update' from ScyllaDB
This config item is propagated to the table object via table::config. Although the field in `table::config`, used to propagate the value, was `utils::updateable_value<T>`, it was assigned a constant and so the live-update chain was broken.
This series fixes this and adds a test which fails before the patch and passes after. The test needed new test infrastructure, around the failure injection api, namely the ability to exfiltrate the value of internal variable. This infrastructure is also added in this series.

Fixes: https://github.com/scylladb/scylladb/issues/18674

- [x] This patch has to be backported because it fixes broken functionality

(cherry picked from commit dbccb61636)

(cherry picked from commit 4590026b38)

(cherry picked from commit feea609e37)

(cherry picked from commit 0c61b1822c)

(cherry picked from commit 8ef4fbdb87)

 Refs #18705

Closes scylladb/scylladb#19240

* github.com:scylladb/scylladb:
  test/topology_custom: add test for enable_compacting_data_for_streaming_and_repair live-update
  test/pylib: rest_client: add get_injection()
  api/error_injection: add getter for error_injection
  utils/error_injection: add set_parameter()
  replica/database: fix live-update enable_compacting_data_for_streaming_and_repair
2024-06-13 12:45:23 +03:00
Pavel Emelyanov
2306c3b522 test: Reduce failure detector timeout for failed tablets migration test
Most of the time this test spends waiting for a node to die. Helps 3x times

Was
  real	9m21,950s
  user	1m11,439s
  sys	1m26,022s

Now
  real	3m37,780s
  user	0m58,439s
  sys	1m13,698s

refs: #17764

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit a4e8f9340a)

Closes scylladb/scylladb#19233
2024-06-12 10:02:45 +03:00
Botond Dénes
0d13c51dd4 test/topology_custom: add test for enable_compacting_data_for_streaming_and_repair live-update
Avoid this the live-update feature of this config item breaking
silently.

(cherry picked from commit 8ef4fbdb87)
2024-06-11 17:32:37 +00:00
Raphael S. Carvalho
d4c3a43b34 replica: Refresh mutation source when allocating tablet replicas
Consider the following:

1) table A has N tablets and views
2) migration starts for a tablet of A from node 1 to 2.
3) migration is at write_both_read_old stage
4) coordinator will push writes to both nodes (pending and leaving)
5) A has view, so writes to it will also result in reads (table::push_view_replica_updates())
6) tablet's update_effective_replication_map() is not refreshing tablet sstable set (for new tablet migrating in)
7) so read on step 5 is not being able to find sstable set for tablet migrating in

Causes the following error:
"tablets - SSTable set wasn't found for tablet 21 of table mview.users"

which means loss of write on pending replica.

The fix will refresh the table's sstable set (tablet_sstable_set) and cache's snapshot.
It's not a problem to refresh the cache snapshot as long as the logical
state of the data hasn't changed, which is true when allocating new
tablet replicas. That's also done in the context of compactions for example.

Fixes #19052.
Fixes #19033.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 7b41630299)

Closes scylladb/scylladb#19229
2024-06-11 18:12:43 +03:00
Gleb Natapov
c53cd98a41 test: add test of bootstrap where the coordinator crashes just before storing IP mapping
On the next boot there is no host ID to IP mapping which causes node to
crash again with "No mapping for :: in the passed effective replication map"
assertion.

(cherry picked from commit 27445f5291)
2024-06-05 13:55:28 +00:00
Pavel Emelyanov
62a23fd86a config: Remove experimental TABLETS feature
... and replace it with boolean enable_tablets option. All the places
in the code are patched to check the latter option instead of the former
feature.

The option is OFF by default, but the default scylla.yaml file sets this
to true, so that newly installed clusters turn tablets ON.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
(cherry picked from commit 83d491af02)

Closes scylladb/scylladb#19012
2024-06-03 12:16:41 +03:00
Wojciech Mitros
3c47ab9851 mv: handle different ERMs for base and view table
When calculating the base-view mapping while the topology
is changing, we may encounter a situation where the base
table noticed the change in its effective replication map
while the view table hasn't, or vice-versa. This can happen
because the ERM update may be performed during the preemption
between taking the base ERM and view ERM, or, due to f2ff701,
the update may have just been performed partially when we are
taking the ERMs.

Until now, we assumed that the ERMs are synchronized while calling
finding the base-view endpoint mapping, so in particular, we were
using the topology from the base's ERM to check the datacenters of
all endpoints. Now that the ERMs are more likely to not be the same,
we may try to get the datacenter of a view endpoint that doesn't
exist in the base's topology, causing us to crash.

This is fixed in this patch by using the view table's topology for
endpoints coming from the view ERM. The mapping resulting from the
call might now be a temporary mapping between endpoints in different
topologies, but it still maps base and view replicas 1-to-1.

Fixes: #17786
Fixes: #18709

(cherry-picked from 519317dc58)

This commit also includes the follow-up patch that removes the
flakiness from the test that is introduced by the commit above.
The flakiness was caused by enabling the
delay_before_get_view_natural_endpoint injection on a node
and not disabling it before the node is shut down. The patch
removes the enabling of the injection on the node in the first
place.
By squashing the commits, we won't introduce a place in the
commit history where a potential bisect could mistakenly fail.

Fixes: https://github.com/scylladb/scylladb/issues/18941

(cherry-picked from 0de3a5f3ff)

Closes scylladb/scylladb#18974
2024-05-30 09:13:31 +02:00
Pavel Emelyanov
57d267a97e test: Do not check tablets mutations on nodes that don't have them
The check is performed by selecting from mutation_fragments(table), but
it's known that this query crashes Scylla when there's no tablet replica
on that node.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-05-30 08:33:26 +03:00
Pavel Emelyanov
5b8523273b test: Fix the way tablets RF-change test parses mutation_fragments
When the test changes RF from 2 to 3, the extra node executes "rebuild"
transition which means that it streams tablets replicas from two other
peers. When doing it, the node receives two sets of sstables with
mutations from the given tablet. The test part that checks if the extra
node received the mutations notices two mutation fragments on the new
replica and errorneously fails by seeing, that RF=3 is not equal to the
number of mutations found, which is 4.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-05-30 08:33:26 +03:00
Pavel Emelyanov
6497ed68ed test/tablets: Unmark RF-changing test with xfail
Now the scailing works and test must check it does

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-05-30 08:33:26 +03:00
Pavel Emelyanov
da816bf50c test/tablets: Check that after RF change data is replicated properly
There's a test that checks system.tablets contents to see that after
changing ks replication factor via ALTER KEYSPACE the tablet map is
updated properly. This patch extends this test that also validates that
mutations themselves are replicated according to the desired replication
factor.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#18644
2024-05-30 08:31:48 +03:00
Piotr Dulikowski
68eca3778c Merge 'mv: throttle view update generation for large queries' from Wojciech Mitros
This series is a reupload of #13792 with a few modifications, namely a test is added and the conflicts with recent tablet related changes are fixed.

See https://github.com/scylladb/scylladb/issues/12379 and https://github.com/scylladb/scylladb/pull/13583 for a detailed description of the problem and discussions.

This PR aims to extend the existing throttling mechanism to work with requests that internally generate a large amount of view updates, as suggested by @nyh.

The existing mechanism works in the following way:

* Client sends a request, we generate the view updates corresponding to the request and spawn background tasks which will send these updates to remote nodes
* Each background task consumes some units from the `view_update_concurrency_semaphore`, but doesn't wait for these units, it's just for tracking
* We keep track of the percent of consumed units on each node, this is called `view update backlog`.
* Before sending a response to the client we sleep for a short amount of time. The amount of time to sleep for is based on the fullness of this `view update backlog`. For a well behaved client with limited concurrency this will limit the amount of incoming requests to a manageable level.

This mechanism doesn't handle large DELETE queries. Deleting a partition is fast for the base table, but it requires us to generate a view update for every single deleted row. The number of deleted rows per single client request can be in the millions. Delaying response to the request doesn't help when a single request can generate millions of updates.

To deal with this we could treat the view update generator just like any other client and force it to wait a bit of time before sending the next batch of updates. The amount of time to wait for is calculated just like in the existing throttling code, it's based on the fullness of `view update backlogs`.

The new algorithm of view update generation looks something like this:
```c++
for(;;) {
    auto updates = generate_updates_batch_with_max_100_rows();
    co_await seastar::sleep(calculate_sleep_time_from_backlogs());
    spawn_background_tasks_for_updates(updates);
}
```
Fixes: https://github.com/scylladb/scylladb/issues/12379

Closes scylladb/scylladb#16819

* github.com:scylladb/scylladb:
  test: add test for bad_allocs during large mv queries
  mv: throttle view update generation for large queries
  exceptions: add read_write_timeout_exception, a subclass of request_timeout_exception
  db/view: extract view throttling delay calculation to a global function
  view_update_generator: add get_storage_proxy()
  storage_proxy: make view backlog getters public
2024-05-16 08:22:54 +02:00
Wojciech Mitros
485eb7a64c test: add test for bad_allocs during large mv queries
This patch adds a test for reproducing issue #12379, which is
being fixed in #16819.
The test case works by creating a table with a materialized
view, and then performing a partition delete query on it.
At the same time, it uses injections to limit the memory
to a level lower than usual, in order to increase the
consistency of the test, and to limit its runtime.
Before #16819, the test would exceed the limit and fail,
and now the next allocation is throttled using a sleep.
2024-05-13 18:16:39 +02:00
Aleksandra Martyniuk
51fdda4199 test: add test for back and forth tablets migration 2024-05-10 15:08:56 +02:00
Avi Kivity
37d32a5f8b Merge 'Cleanup inactive reads on tablet migration' from Botond Dénes
When a tablet is migrated away, any inactive read which might be reading from said tablet, has to be dropped. Otherwise these inactive reads can prevent sstables from being removed and these sstables can potentially survive until the tablet is migrated back and resurrect data.
This series introduces the fix as well as a reproducer test.

Fixes: https://github.com/scylladb/scylladb/issues/18110

Closes scylladb/scylladb#18179

* github.com:scylladb/scylladb:
  test: add test for cleaning up cached querier on tablet migration
  querier: allow injecting cache entry ttl by error injector
  replica/table: cleanup_tablet(): clear inactive reads for the tablet
  replica/database: introduce clear_inactive_reads_for_tablet()
  replica/database: introduce foreach_reader_concurrency_semaphore
  reader_concurrency_semaphore: add range param to evict_inactive_reads_for_table()
  reader_concurrency_semaphore: allow storing a range with the inactive reader
  reader_concurrency_semaphore: avoid detach() in inactive_read_handle::abandon()
2024-05-09 17:34:49 +03:00
Kamil Braun
4dcae66380 Merge 'test: {auth,topology}: use manager.rolling_restart' from Piotr Dulikowski
Instead of performing a rolling restart by calling `restart` in a loop over every node in the cluster, use the dedicated
`manager.rolling_restart` function. This method waits until all other nodes see the currently processed node as up or down before proceeding to the next step. Not doing so may lead to surprising behavior.

In particular, in scylladb/scylladb#18369, a test failed shortly after restarting three nodes. Because nodes were restarted one after another too fast, when the third node was restarted it didn't send a notification to the second node because it still didn't know that the second node was alive. This led the second node to notice that the third node restarted by observing that it incremented its generation in gossip (it restarted too fast to be marked as down by the failure detector). In turn, this caused the second node to send "third node down" and "third node up" notifications to the driver in a quick succession, causing it to drop and reestablish all connections to that node. However, this happened _after_ rolling upgrade finished and _after_ the test logic confirmed that all nodes were alive. When the notifications were sent to the driver, the test was executing some statements necessary for the test to pass - as they broke, the test failed.

Fixes: scylladb/scylladb#18369

Closes scylladb/scylladb#18379

* github.com:scylladb/scylladb:
  test: get rid of server-side server_restart
  test: util: get rid of the `restart` helper
  test: {auth,topology}: use manager.rolling_restart
2024-05-08 09:45:08 +02:00
Kamil Braun
03818c4aa9 direct_failure_detector: increase ping timeout and make it tunable
The direct failure detector design is simplistic. It sends pings
sequentially and times out listeners that reached the threshold (i.e.
didn't hear from a given endpoint for too long) in-between pings.

Given the sequential nature, the previous ping must finish so the next
ping can start. We timeout pings that take too long. The timeout was
hardcoded and set to 300ms. This is too low for wide-area setups --
latencies across the Earth can indeed go up to 300ms. 3 subsequent timed
out pings to a given node were sufficient for the Raft listener to "mark
server as down" (the listener used a threshold of 1s).

Increase the ping timeout to 600ms which should be enough even for
pinging the opposite side of Earth, and make it tunable.

Increase the Raft listener threshold from 1s to 2s. Without the
increased threshold, one timed out ping would be enough to mark the
server as down. Increasing it to 2s requires 3 timed out pings which
makes it more robust in presence of transient network hiccups.

In the future we'll most likely want to decrease the Raft listener
threshold again, if we use Raft for data path -- so leader elections
start quickly after leader failures. (Faster than 2s). To do that we'll
have to improve the design of the direct failure detector.

Ref: scylladb/scylladb#16410
Fixes: scylladb/scylladb#16607

---

I tested the change manually using `tc qdisc ... netem delay`, setting
network delay on local setup to ~300ms with jitter. Without the change,
the result is as observed in scylladb/scylladb#16410: interleaving
```
raft_group_registry - marking Raft server ... as dead for Raft groups
raft_group_registry - marking Raft server ... as alive for Raft groups
```
happening once every few seconds. The "marking as dead" happens whenever
we get 3 subsequent failed pings, which is happens with certain (high)
probability depending on the latency jitter. Then as soon as we get a
successful ping, we mark server back as alive.

With the change, the phenomenon no longer appears.

Closes scylladb/scylladb#18443
2024-05-07 23:40:23 +02:00
Piotr Dulikowski
8de2bda7ae test: util: get rid of the restart helper
We already have `ManagerClient.server_restart`, which can be used in its
place.
2024-05-06 12:24:40 +02:00
Patryk Jędrzejczak
f61c50baa4 test: test_raft_snapshot_request: improve the last assertion
The last assertion in the test is very sensitive to changes. The
constant has already been increased from 0 to 1 due to flakiness.
The old comment explains it.

In the following patch, we change the CDC generation publisher so
that it clears the obsolete CDC generations earlier. This change
would make this assertion flaky again. After restarting the servers,
the new topology coordinator could remove the first generation if it
became obsolete. This operation appends a new entry to the log. If
it happened after triggering snapshot, the assertion could fail
with `2 <= 1`.

We could increase the constant again to unflake the test, but we
better improve it once and for all. We change the assertion so
that it's not sensitive to changes in the code based on Raft. The
explanation is in the new comment.
2024-05-02 12:46:33 +02:00
Patryk Jędrzejczak
44791a849e test: test_raft_snapshot_request: find raft leader after restart
Finding the new Raft leader after restart simplifies the test
and makes it easier to reason about. There are two improvements:
- we only need to wait until the leader appends a command, so
  the read barrier becomes unnecessary,
- we only need to trigger snapshot on the leader.

We also use the knowledge about the leader in the following patch.
2024-05-02 12:46:33 +02:00
Patryk Jędrzejczak
41198998c5 test: test_raft_shanpshot_request: simplify appended_command
We shorten the code and remove the unused `log_size` variable.
2024-05-02 12:46:31 +02:00
Botond Dénes
0ace90ad04 test: add test for cleaning up cached querier on tablet migration
Check that a cached querier, which exists prior to a migration, will be
cleaned up afterwards. This reproduces #18110.
The test fails before the fix for the above and passes afterwards.
2024-04-30 01:47:16 -04:00
Kamil Braun
d8313dda43 Merge 'db: config: move consistent-topology-changes out of experimental and make it the default for new clusters' from Patryk Jędrzejczak
We move consistent cluster management out of experimental and
make it the default for new clusters in 6.0. In code, we make the
`consistent-topology-changes` flag unused and assumed to be true.

In 6.0, the topology upgrade procedure will be manual and
voluntary, so some clusters will still be using the gossip-based
topology even though they support the raft-based topology.
Therefore, we need to continue testing the gossip-based topology.
This is possible by using the `force-gossip-topology-changes` flag
introduced in scylladb/scylladb#18284.

Ref scylladb/scylladb#17802

Closes scylladb/scylladb#18285

* github.com:scylladb/scylladb:
  docs: raft.rst: update after removing consistent-topology-changes
  treewide: fix indentation after the previous patch
  db: config: make consistent-topology-changes unused
  test: lib: single_node_cql_env: restart a node in noninitial run_in_thread calls
  test: test_read_required_hosts: run with force-gossip-topology-changes
  storage_service: join_cluster: replace force_gossip_based_join with force-gossip-topology-changes
  storage_service: join_token_ring: fix finish_setup_after_join calls
2024-04-26 14:45:29 +02:00
Patryk Jędrzejczak
3a100cd16c test: test_raft_recovery_stuck: ensure raft upgrade procedure failed
We have log browsing in test.py now, so we can fix this TODO easily.

Closes scylladb/scylladb#18425
2024-04-26 10:16:49 +02:00
Patryk Jędrzejczak
3a34bb18cd db: config: make consistent-topology-changes unused
We make the `consistent-topology-changes` experimental feature
unused and assumed to be true in 6.0. We remove code branches that
executed if `consistent-topology-changes` was disabled.
2024-04-25 14:33:21 +02:00
Patryk Jędrzejczak
213f2f6882 storage_service: join_cluster: replace force_gossip_based_join with force-gossip-topology-changes
The `force_gossip_based_join` error injection does exactly what we
expect from `force-gossip-topology-changes` so we can do a simple
replacement.

We prefer a flag over an error injection because we will use it
a lot in CI jobs' configurations, some tests, manual testing etc.
It's much more convenient.

Moreover, the flag can be used in the release mode, so we re-enable
all tests that were disabled in release mode only because of using
the `force_gossip_based_join` error injection.

The name of the `force-gossip-topology-changes` flag suggests that
using it should always succesfully force the gossip-based topology
or, if forcing is not possible, the booting should fail. We don't
want a node with `force-gossip-topology-changes=true` that silently
boots in the raft-topology mode. We achieve it by throwing a
runtime error from `join_cluster` in two cases:
- the node is restarting in the cluster that is using raft topology
- the node is joining the cluster that is using raft topology
2024-04-25 14:33:21 +02:00
Petr Gusev
bc98774f83 test_replace_reuse_ip: add data plane load
In this commit we enhance test_replace_reuse_ip
to reproduce #17421. We create a test table and run
insert queries on it while the first node is
being replaced. In this form the test fails
without the fix from the previous commit. Some
insert requests fail with [Unavailable exception]
"Cannot achieve consistency level for cl QUORUM...".
2024-04-24 16:59:24 +03:00
Botond Dénes
162c9ad6f6 Merge 'gossiper: lock local endpoint when updating heart_beat' from Kamil Braun
In testing, we've observed multiple cases where nodes would fail to
observe updated application states of other nodes in gossiper.

For example:
- in scylladb/scylladb#16902, a node would finish bootstrapping and enter
NORMAL state, propagating this information through gossiper. However,
other nodes would never observe that the node entered NORMAL state,
still thinking that it is in joining state. This would lead to further
bad consequences down the line.
- in scylladb/scylladb#15393, a node got stuck in bootstrap, waiting for
schema versions to converge. Convergence would never be achieved and the
test eventually timed out. The node was observing outdated schema state
of some existing node in gossip.

I created a test that would bootstrap 3 nodes, then wait until they all
observe each other as NORMAL, with timeout. Unfortunately, thousands of
runs of this test on different machines failed to reproduce the problem.

After banging my head against the wall failing to reproduce, I decided
to sprinkle randomized sleeps across multiple places in gossiper code
and finally: the test started catching the problem in about 1 in 1000
runs.

With additional logging and additional head-banging, I determined
the root cause.

The following scenario can happen, 2 nodes are sufficient, let's call
them A and B:
- Node B calls `add_local_application_state` to update its gossiper
  state, for example, to propagate its new NORMAL status.
- `add_local_application_state` takes a copy of the endpoint_state, and
  updates the copy:
```
            auto local_state = *ep_state_before;
            for (auto& p : states) {
                auto& state = p.first;
                auto& value = p.second;
                value = versioned_value::clone_with_higher_version(value);
                local_state.add_application_state(state, value);
            }
```
  `clone_with_higher_version` bumps `version` inside
  gms/version_generator.cc.
- `add_local_application_state` calls `gossiper.replicate(...)`
- `replicate` works in 2 phases to achieve exception safety: in first
  phase it copies the updated `local_state` to all shards into a
  separate map. In second phase the values from separate map are used to
  overwrite the endpoint_state map used for gossiping.

  Due to the cross-shard calls of the 1 phase, there is a yield before
  the second phase. *During this yield* the following happens:
- `gossiper::run()` loop on B executes and bumps node B's `heart_beat`.
  This uses the monotonic version_generator, so it uses a higher version
  then the ones we used for states added above. Let's call this new version
  X. Note that X is larger than the versions used by application_states
  added above.
- now node B handles a SYN or ACK message from node A, creating
  an ACK or ACK2 message in response. This message contains:
    - old application states (NOT including the update described above,
      because `replicate` is still sleeping before phase 2),
    - but bumped heart_beat == X from `gossiper::run()` loop,
  and sends the message.
- node A receives the message and remembers that the max
  version across all states (including heart_beat) of node B is X.
  This means that it will no longer request or apply states from node B
  with versions smaller than X.
- `gossiper.replicate(...)` on B wakes up, and overwrites
  endpoint_state with the ones it saved in phase 1. In particular it
  reverts heart_beat back to smaller value, but the larger problem is that it
  saves updated application_states that use versions smaller than X.
- now when node B sends the updated application_states in ACK or ACK2
  message to node A, node A will ignore them, because their versions are
  smaller than X. Or node B will never send them, because whenever node
  A requests states from node B, it only requests states with versions >
  X. Either way, node A will fail to observe new states of node B.

If I understand correctly, this is a regression introduced in
38c2347a3c, which introduced a yield in
`replicate`. Before that, the updated state would be saved atomically on
shard 0, there could be no `heart_beat` bump in-between making a copy of
the local state, updating it, and then saving it.

With the description above, it's easy to make a consistent
reproducer for the problem -- introduce a longer sleep in
`add_local_application_state` before second phase of replicate, to
increase the chance that gossiper loop will execute and bump heart_beat
version during the yield. Further commit adds a test based on that.

The fix is to bump the heart_beat under local endpoint lock, which is
also taken by `replicate`.

The PR also adds a regression test.

Fixes: scylladb/scylladb#15393
Fixes: scylladb/scylladb#15602
Fixes: scylladb/scylladb#16668
Fixes: scylladb/scylladb#16902
Fixes: scylladb/scylladb#17493
Fixes: scylladb/scylladb#18118
Ref: scylladb/scylla-enterprise#3720

Closes scylladb/scylladb#18184

* github.com:scylladb/scylladb:
  test: reproducer for missing gossiper updates
  gossiper: lock local endpoint when updating heart_beat
2024-04-16 06:46:24 +03:00
Pavel Emelyanov
b06b85c270 test: Tune up tablet-transition test to check del_replica
For that the test case is modified to have 3 nodes and 2 replicas on
start. Existing test cases are changed slightly in the way "from" host
is detected.

Also, the final check for data presense is modified to check that hosts
in "replicas" have data and other hosts don't have it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-04-15 16:31:07 +03:00
Kamil Braun
72955093eb test: reproducer for missing gossiper updates
Regression test for scylladb/scylladb#17493.
2024-04-04 18:47:01 +02:00
Pavel Emelyanov
c7908c319f test: Check data presense as well
Other than making sure that system.tablets is updated with correct
replica set, it's also good to check that the data is present on the
repsective nodes.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-04-04 18:01:24 +03:00
Pavel Emelyanov
590f0329ae test: Test how tablets are copied between nodes
This patches the previously introduced test by introducing the 'action'
test paramter and tweaking the final checking assertions around tablet
replicas read from system.tablets

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-04-04 09:22:57 +03:00
Pavel Emelyanov
28964ba5fe test: Add sanity test for tablet migration
It just checks that after api call to move_tablet the resulting replica
is in expected state. This test will be later expanded to check for
rebuild transition.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-04-04 09:22:31 +03:00
Andrei Chekun
0752ef1481 test: remove skip annotation for multi-DC test with 5 DCs with one node in each
As a follow-up of the https://github.com/scylladb/scylladb/pull/17503 remove skip annotation for the multi-DC test with a reduced amount of the DC used in it: from 30 DCs to 5 DCs

Closes scylladb/scylladb#17898
2024-03-27 13:13:13 +02:00
Gleb Natapov
ed534fde8f test: add test for initial_token parameter
Test that configured tokens are used and tokens collision is detected.
2024-03-26 18:43:31 +02:00
Botond Dénes
f0ff23492f Merge 'Sanitize topology suites' skiplists' from Pavel Emelyanov
There are skip_in_<mode> lists in suite yaml that tells test.py not to run the test from it. This PR sanitizes these lists in two ways.

First, to skip pytests the skip-decorators are much more convenient, e.g. because they show the reason why the test is skipped.

Also, if a test wants to be opt-in-ed for some mode only, it's opt-out-ed in all other lists instead. There's run_in_<mode> list in suite for that.

Closes scylladb/scylladb#17964

* github.com:scylladb/scylladb:
  test: Do not duplicate test name in several skip-lists
  test: Mark tests with skip_mode instead of suite skip-list
2024-03-26 08:24:57 +02:00
Pavel Emelyanov
16343b3edc test: Do not duplicate test name in several skip-lists
Some tests are only run in dev mode for some reason. For such tests
there's run_in_dev list, no need in putting it in all the non-dev
skip_in_... ones.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-03-25 14:56:37 +03:00
Pavel Emelyanov
90dfcec86b test: Mark tests with skip_mode instead of suite skip-list
There are many tests that are skipped in release mode becuase they rely
on error-injection machinery which doesn't work in release mode. Most of
those tests are listed in suite's skip_in_release, but it's not very
handy, mainly because it's not clear why the test is there. The
skip_mode decoration is much more convenient.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-03-25 14:56:37 +03:00
Kamil Braun
69bf962522 Merge 'allow changing snitch with topology over raft' from Gleb
Fixes scylladb/scylladb#17513

* 'gleb/raft-snitch-change-v3' of github.com:scylladb/scylla-dev:
  doc: amend snitch changing procedure to work with raft
  test: add test to check that snitch change takes effect.
  raft topology: update rack/dc info in topology state on reboot if changed
2024-03-25 10:41:39 +01:00
Gleb Natapov
d7adf26a56 test: add test to check that snitch change takes effect.
The test creates two node cluster with default snitch (SimpleSnitch) and
checks that dc and rack names are as expected. Then it changes the
config to use GossipingPropertyFileSnitch with different names, restart
nodes and check that now peers table has new names.
2024-03-25 10:41:49 +02:00
Piotr Dulikowski
f23f8f81bf Merge 'Raft-based service levels' from Michał Jadwiszczak
This patch introduces raft-based service levels.

The difference to the current method of working is:
- service levels are stored in `system.service_levels_v2`
- reads are executed with `LOCAL_ONE`
- writes are done via raft group0 operation

Service levels are migrated to v2 in topology upgrade.
After the service levels are migrated, `key: service_level_v2_status; value: data_migrated` is written to `system.scylla_local` table. If this row is present, raft data accessor is created from the beginning and it handles recovery mode procedure (service levels will be read from v2 table even if consistent topology is disabled then)

Fixes #17926

Closes scylladb/scylladb#16585

* github.com:scylladb/scylladb:
  test: test service levels v2 works in recovery mode
  test: add test for service levels migration
  test: add test for service levels snapshot
  test:topology: extract `trigger_snapshot` to utils
  main: create raft dda if sl data was migrated
  service:qos: store information about sl data migration
  service:qos: service levels migration
  main: assign standard service level DDA before starting group0
  service:qos: fix `is_v2()` method
  service:qos: add a method to upgrade data accessor
  test: add unit_test_raft_service_levels_accessor
  service:storage_service: add support for service levels raft snapshot
  service:qos: add abort_source for group0 operations
  service:qos: raft service level distributed data accessor
  service:qos: use group0_guard in data accessor
  cql3:statements: run service level statements on shard0 with raft guard
  test: fix overrides in unit_test_service_levels_accessor
  service:qos: fix indentation
  service:qos: coroutinize some of the methods
  db:system_keyspace: add `SERVICE_LEVELS_V2` table
  service:qos: extract common service levels' table functions
2024-03-22 11:51:53 +01:00