scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-22 17:40:34 +00:00

Author	SHA1	Message	Date
Avi Kivity	cdae15ced9	Merge '[Backport 6.0] db/view: drop view updates to replaced node marked as left' from ScyllaDB When a node that is permanently down is replaced, it is marked as "left" but it still can be a replica of some tablets. We also don't keep IPs of nodes that have left and the `node` structure for such node returns an empty IP (all zeros) as the address. This interacts badly with the view update logic. The base replica paired with the left node might decide to generate a view update. Because storage proxy still uses IPs and not host IDs, it needs to obtain the view replica's IP and tell the storage proxy to write a view update to that node - so, it chooses 0.0.0.0. Apparently, storage proxy decides to write a hint towards this address - hinted handoff on the other hand operates on host IDs and not IPs, so it attempts to translate the IP back, which triggers an assertion as there is no replica with IP 0.0.0.0. As a quick workaround for this issue just drop view updates towards nodes which seem to have IPs that are all zeros. It would be more proper to keep the view updates as hints and replay them later to the new paired replica, but achieving this right now would require much more significant changes. For now, fixing a crash is more important than keeping views consistent with base replicas. In addition to the fix, this PR also includes a regression test heavily based on the test that @kbr-scylla prepared during his investigation of the issue. Fixes: scylladb/scylladb#19439 This issue can cause multiple nodes to crash at once and the fix is quite small, so I think this justifies backporting it to all affected versions. 6.0 and 6.1 are affected. No need to backport to 5.4 as this issue only happens with tablets, and tablets are experimental there. (cherry picked from commit `6af7882c59`) (cherry picked from commit `5ec8c06561`) Refs #19765 Closes scylladb/scylladb#19896 * github.com:scylladb/scylladb: test: regression test for MV crash with tablets during decommission db/view: drop view updates to replaced node marked as left	2024-08-14 22:32:07 +03:00
Piotr Smaron	b10bf17df7	tests: ensure ALTER tablets KS doesn't crash if KS doesn't exist Using the error injection framework, we inject a sleep into the processing path of ALTER tablets KS, so that the topology coordinator of the leader node sleeps after the rf_change event has been scheduled, but before it is started to be executed. During that time the second node executes a DROP KS statement, which is propagated to the leader node. Once leader node wakes up and resumes processing of ALTER tablets KS, the KS won't exist and the node cannot crash, which was the case before. (cherry picked from commit `ddb5204929`)	2024-08-14 10:37:59 +00:00
Piotr Dulikowski	dec02b38ae	test: regression test for MV crash with tablets during decommission Regression test for scylladb/scylladb#19439. Co-authored-by: Kamil Braun <kbraun@scylladb.com> (cherry picked from commit `5ec8c06561`)	2024-07-26 14:02:51 +00:00
Emil Maskovsky	62c9709f4a	test: raft: fix the flaky `test_raft_recovery_stuck` Use the rolling restart to avoid spurious driver reconnects. This can be eventually reverted once the scylladb/python-driver#295 is fixed. Fixes scylladb/scylladb#19154 (cherry picked from commit `a89facbc74`)	2024-07-20 02:17:50 +00:00
Emil Maskovsky	64d414f10a	test: raft: code cleanup in `test_raft_recovery_stuck` Cleaning up the imports. (cherry picked from commit `ef3393bd36`)	2024-07-20 02:17:50 +00:00
Gleb Natapov	c437c8be36	test: add test to check that coordinator lwt semaphore continues functioning after locking failures (cherry picked from commit `4178589826`)	2024-07-18 15:34:17 +00:00
Michael Litvak	815a707b0a	storage_proxy: remove response handler if no targets When writing a mutation, it might happen that there are no live targets to send the mutation to, yet the request can be satisfied. For example, when writing with CL=ANY to a dead node, the request is completed by storing a local hint. Currently, in that case, a write response handler is created for the request and it remains active until it timeouts because it is not removed anywhere, even though the write is completed successfuly after storing the hint. The response handler should be removed usually when receiving responses from all targets, but in this case there are no targets to trigger the removal. In this commit we check if we don't have live targets to send the mutation to. If so, we remove the response handler immediately. Fixes scylladb/scylladb#19529 (cherry picked from commit `a9fdd0a93a`) Closes scylladb/scylladb#19680	2024-07-15 08:24:18 +02:00
Michael Litvak	ad6eb1cadf	view: drain view builder before database The view builder is doing write operations to the database. In order for the view builder to shutdown gracefully without errors, we need to ensure the database can handle writes while it is drained. The commit changes the drain order, so that view builder is drained before the database shuts down. Fixes scylladb/scylladb#18929 (cherry picked from commit `9d9318c564`) Closes scylladb/scylladb#19636	2024-07-08 19:16:26 +02:00
Gleb Natapov	724ec62e22	test: add test that checks that local address cannot expire between join request placemen and its processing (cherry picked from commit `3f136cf2eb`)	2024-07-01 10:44:31 +00:00
Botond Dénes	b18d9e5d0d	Merge '[Backport 6.0] make enable_compacting_data_for_streaming_and_repair truly live-update' from ScyllaDB This config item is propagated to the table object via table::config. Although the field in `table::config`, used to propagate the value, was `utils::updateable_value<T>`, it was assigned a constant and so the live-update chain was broken. This series fixes this and adds a test which fails before the patch and passes after. The test needed new test infrastructure, around the failure injection api, namely the ability to exfiltrate the value of internal variable. This infrastructure is also added in this series. Fixes: https://github.com/scylladb/scylladb/issues/18674 - [x] This patch has to be backported because it fixes broken functionality (cherry picked from commit `dbccb61636`) (cherry picked from commit `4590026b38`) (cherry picked from commit `feea609e37`) (cherry picked from commit `0c61b1822c`) (cherry picked from commit `8ef4fbdb87`) Refs #18705 Closes scylladb/scylladb#19240 * github.com:scylladb/scylladb: test/topology_custom: add test for enable_compacting_data_for_streaming_and_repair live-update test/pylib: rest_client: add get_injection() api/error_injection: add getter for error_injection utils/error_injection: add set_parameter() replica/database: fix live-update enable_compacting_data_for_streaming_and_repair	2024-06-13 12:45:23 +03:00
Pavel Emelyanov	2306c3b522	test: Reduce failure detector timeout for failed tablets migration test Most of the time this test spends waiting for a node to die. Helps 3x times Was real 9m21,950s user 1m11,439s sys 1m26,022s Now real 3m37,780s user 0m58,439s sys 1m13,698s refs: #17764 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> (cherry picked from commit `a4e8f9340a`) Closes scylladb/scylladb#19233	2024-06-12 10:02:45 +03:00
Botond Dénes	0d13c51dd4	test/topology_custom: add test for enable_compacting_data_for_streaming_and_repair live-update Avoid this the live-update feature of this config item breaking silently. (cherry picked from commit `8ef4fbdb87`)	2024-06-11 17:32:37 +00:00
Raphael S. Carvalho	d4c3a43b34	replica: Refresh mutation source when allocating tablet replicas Consider the following: 1) table A has N tablets and views 2) migration starts for a tablet of A from node 1 to 2. 3) migration is at write_both_read_old stage 4) coordinator will push writes to both nodes (pending and leaving) 5) A has view, so writes to it will also result in reads (table::push_view_replica_updates()) 6) tablet's update_effective_replication_map() is not refreshing tablet sstable set (for new tablet migrating in) 7) so read on step 5 is not being able to find sstable set for tablet migrating in Causes the following error: "tablets - SSTable set wasn't found for tablet 21 of table mview.users" which means loss of write on pending replica. The fix will refresh the table's sstable set (tablet_sstable_set) and cache's snapshot. It's not a problem to refresh the cache snapshot as long as the logical state of the data hasn't changed, which is true when allocating new tablet replicas. That's also done in the context of compactions for example. Fixes #19052. Fixes #19033. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> (cherry picked from commit `7b41630299`) Closes scylladb/scylladb#19229	2024-06-11 18:12:43 +03:00
Gleb Natapov	c53cd98a41	test: add test of bootstrap where the coordinator crashes just before storing IP mapping On the next boot there is no host ID to IP mapping which causes node to crash again with "No mapping for :: in the passed effective replication map" assertion. (cherry picked from commit `27445f5291`)	2024-06-05 13:55:28 +00:00
Pavel Emelyanov	62a23fd86a	config: Remove experimental TABLETS feature ... and replace it with boolean enable_tablets option. All the places in the code are patched to check the latter option instead of the former feature. The option is OFF by default, but the default scylla.yaml file sets this to true, so that newly installed clusters turn tablets ON. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> (cherry picked from commit `83d491af02`) Closes scylladb/scylladb#19012	2024-06-03 12:16:41 +03:00
Wojciech Mitros	3c47ab9851	mv: handle different ERMs for base and view table When calculating the base-view mapping while the topology is changing, we may encounter a situation where the base table noticed the change in its effective replication map while the view table hasn't, or vice-versa. This can happen because the ERM update may be performed during the preemption between taking the base ERM and view ERM, or, due to `f2ff701`, the update may have just been performed partially when we are taking the ERMs. Until now, we assumed that the ERMs are synchronized while calling finding the base-view endpoint mapping, so in particular, we were using the topology from the base's ERM to check the datacenters of all endpoints. Now that the ERMs are more likely to not be the same, we may try to get the datacenter of a view endpoint that doesn't exist in the base's topology, causing us to crash. This is fixed in this patch by using the view table's topology for endpoints coming from the view ERM. The mapping resulting from the call might now be a temporary mapping between endpoints in different topologies, but it still maps base and view replicas 1-to-1. Fixes: #17786 Fixes: #18709 (cherry-picked from `519317dc58`) This commit also includes the follow-up patch that removes the flakiness from the test that is introduced by the commit above. The flakiness was caused by enabling the delay_before_get_view_natural_endpoint injection on a node and not disabling it before the node is shut down. The patch removes the enabling of the injection on the node in the first place. By squashing the commits, we won't introduce a place in the commit history where a potential bisect could mistakenly fail. Fixes: https://github.com/scylladb/scylladb/issues/18941 (cherry-picked from `0de3a5f3ff`) Closes scylladb/scylladb#18974	2024-05-30 09:13:31 +02:00
Pavel Emelyanov	57d267a97e	test: Do not check tablets mutations on nodes that don't have them The check is performed by selecting from mutation_fragments(table), but it's known that this query crashes Scylla when there's no tablet replica on that node. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-05-30 08:33:26 +03:00
Pavel Emelyanov	5b8523273b	test: Fix the way tablets RF-change test parses mutation_fragments When the test changes RF from 2 to 3, the extra node executes "rebuild" transition which means that it streams tablets replicas from two other peers. When doing it, the node receives two sets of sstables with mutations from the given tablet. The test part that checks if the extra node received the mutations notices two mutation fragments on the new replica and errorneously fails by seeing, that RF=3 is not equal to the number of mutations found, which is 4. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-05-30 08:33:26 +03:00
Pavel Emelyanov	6497ed68ed	test/tablets: Unmark RF-changing test with xfail Now the scailing works and test must check it does Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-05-30 08:33:26 +03:00
Pavel Emelyanov	da816bf50c	test/tablets: Check that after RF change data is replicated properly There's a test that checks system.tablets contents to see that after changing ks replication factor via ALTER KEYSPACE the tablet map is updated properly. This patch extends this test that also validates that mutations themselves are replicated according to the desired replication factor. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#18644	2024-05-30 08:31:48 +03:00
Piotr Dulikowski	68eca3778c	Merge 'mv: throttle view update generation for large queries' from Wojciech Mitros This series is a reupload of #13792 with a few modifications, namely a test is added and the conflicts with recent tablet related changes are fixed. See https://github.com/scylladb/scylladb/issues/12379 and https://github.com/scylladb/scylladb/pull/13583 for a detailed description of the problem and discussions. This PR aims to extend the existing throttling mechanism to work with requests that internally generate a large amount of view updates, as suggested by @nyh. The existing mechanism works in the following way: * Client sends a request, we generate the view updates corresponding to the request and spawn background tasks which will send these updates to remote nodes * Each background task consumes some units from the `view_update_concurrency_semaphore`, but doesn't wait for these units, it's just for tracking * We keep track of the percent of consumed units on each node, this is called `view update backlog`. * Before sending a response to the client we sleep for a short amount of time. The amount of time to sleep for is based on the fullness of this `view update backlog`. For a well behaved client with limited concurrency this will limit the amount of incoming requests to a manageable level. This mechanism doesn't handle large DELETE queries. Deleting a partition is fast for the base table, but it requires us to generate a view update for every single deleted row. The number of deleted rows per single client request can be in the millions. Delaying response to the request doesn't help when a single request can generate millions of updates. To deal with this we could treat the view update generator just like any other client and force it to wait a bit of time before sending the next batch of updates. The amount of time to wait for is calculated just like in the existing throttling code, it's based on the fullness of `view update backlogs`. The new algorithm of view update generation looks something like this: ```c++ for(;;) { auto updates = generate_updates_batch_with_max_100_rows(); co_await seastar::sleep(calculate_sleep_time_from_backlogs()); spawn_background_tasks_for_updates(updates); } ``` Fixes: https://github.com/scylladb/scylladb/issues/12379 Closes scylladb/scylladb#16819 * github.com:scylladb/scylladb: test: add test for bad_allocs during large mv queries mv: throttle view update generation for large queries exceptions: add read_write_timeout_exception, a subclass of request_timeout_exception db/view: extract view throttling delay calculation to a global function view_update_generator: add get_storage_proxy() storage_proxy: make view backlog getters public	2024-05-16 08:22:54 +02:00
Wojciech Mitros	485eb7a64c	test: add test for bad_allocs during large mv queries This patch adds a test for reproducing issue #12379, which is being fixed in #16819. The test case works by creating a table with a materialized view, and then performing a partition delete query on it. At the same time, it uses injections to limit the memory to a level lower than usual, in order to increase the consistency of the test, and to limit its runtime. Before #16819, the test would exceed the limit and fail, and now the next allocation is throttled using a sleep.	2024-05-13 18:16:39 +02:00
Aleksandra Martyniuk	51fdda4199	test: add test for back and forth tablets migration	2024-05-10 15:08:56 +02:00
Avi Kivity	37d32a5f8b	Merge 'Cleanup inactive reads on tablet migration' from Botond Dénes When a tablet is migrated away, any inactive read which might be reading from said tablet, has to be dropped. Otherwise these inactive reads can prevent sstables from being removed and these sstables can potentially survive until the tablet is migrated back and resurrect data. This series introduces the fix as well as a reproducer test. Fixes: https://github.com/scylladb/scylladb/issues/18110 Closes scylladb/scylladb#18179 * github.com:scylladb/scylladb: test: add test for cleaning up cached querier on tablet migration querier: allow injecting cache entry ttl by error injector replica/table: cleanup_tablet(): clear inactive reads for the tablet replica/database: introduce clear_inactive_reads_for_tablet() replica/database: introduce foreach_reader_concurrency_semaphore reader_concurrency_semaphore: add range param to evict_inactive_reads_for_table() reader_concurrency_semaphore: allow storing a range with the inactive reader reader_concurrency_semaphore: avoid detach() in inactive_read_handle::abandon()	2024-05-09 17:34:49 +03:00
Kamil Braun	4dcae66380	Merge 'test: {auth,topology}: use manager.rolling_restart' from Piotr Dulikowski Instead of performing a rolling restart by calling `restart` in a loop over every node in the cluster, use the dedicated `manager.rolling_restart` function. This method waits until all other nodes see the currently processed node as up or down before proceeding to the next step. Not doing so may lead to surprising behavior. In particular, in scylladb/scylladb#18369, a test failed shortly after restarting three nodes. Because nodes were restarted one after another too fast, when the third node was restarted it didn't send a notification to the second node because it still didn't know that the second node was alive. This led the second node to notice that the third node restarted by observing that it incremented its generation in gossip (it restarted too fast to be marked as down by the failure detector). In turn, this caused the second node to send "third node down" and "third node up" notifications to the driver in a quick succession, causing it to drop and reestablish all connections to that node. However, this happened _after_ rolling upgrade finished and _after_ the test logic confirmed that all nodes were alive. When the notifications were sent to the driver, the test was executing some statements necessary for the test to pass - as they broke, the test failed. Fixes: scylladb/scylladb#18369 Closes scylladb/scylladb#18379 * github.com:scylladb/scylladb: test: get rid of server-side server_restart test: util: get rid of the `restart` helper test: {auth,topology}: use manager.rolling_restart	2024-05-08 09:45:08 +02:00
Kamil Braun	03818c4aa9	direct_failure_detector: increase ping timeout and make it tunable The direct failure detector design is simplistic. It sends pings sequentially and times out listeners that reached the threshold (i.e. didn't hear from a given endpoint for too long) in-between pings. Given the sequential nature, the previous ping must finish so the next ping can start. We timeout pings that take too long. The timeout was hardcoded and set to 300ms. This is too low for wide-area setups -- latencies across the Earth can indeed go up to 300ms. 3 subsequent timed out pings to a given node were sufficient for the Raft listener to "mark server as down" (the listener used a threshold of 1s). Increase the ping timeout to 600ms which should be enough even for pinging the opposite side of Earth, and make it tunable. Increase the Raft listener threshold from 1s to 2s. Without the increased threshold, one timed out ping would be enough to mark the server as down. Increasing it to 2s requires 3 timed out pings which makes it more robust in presence of transient network hiccups. In the future we'll most likely want to decrease the Raft listener threshold again, if we use Raft for data path -- so leader elections start quickly after leader failures. (Faster than 2s). To do that we'll have to improve the design of the direct failure detector. Ref: scylladb/scylladb#16410 Fixes: scylladb/scylladb#16607 --- I tested the change manually using `tc qdisc ... netem delay`, setting network delay on local setup to ~300ms with jitter. Without the change, the result is as observed in scylladb/scylladb#16410: interleaving ``` raft_group_registry - marking Raft server ... as dead for Raft groups raft_group_registry - marking Raft server ... as alive for Raft groups ``` happening once every few seconds. The "marking as dead" happens whenever we get 3 subsequent failed pings, which is happens with certain (high) probability depending on the latency jitter. Then as soon as we get a successful ping, we mark server back as alive. With the change, the phenomenon no longer appears. Closes scylladb/scylladb#18443	2024-05-07 23:40:23 +02:00
Piotr Dulikowski	8de2bda7ae	test: util: get rid of the `restart` helper We already have `ManagerClient.server_restart`, which can be used in its place.	2024-05-06 12:24:40 +02:00
Patryk Jędrzejczak	f61c50baa4	test: test_raft_snapshot_request: improve the last assertion The last assertion in the test is very sensitive to changes. The constant has already been increased from 0 to 1 due to flakiness. The old comment explains it. In the following patch, we change the CDC generation publisher so that it clears the obsolete CDC generations earlier. This change would make this assertion flaky again. After restarting the servers, the new topology coordinator could remove the first generation if it became obsolete. This operation appends a new entry to the log. If it happened after triggering snapshot, the assertion could fail with `2 <= 1`. We could increase the constant again to unflake the test, but we better improve it once and for all. We change the assertion so that it's not sensitive to changes in the code based on Raft. The explanation is in the new comment.	2024-05-02 12:46:33 +02:00
Patryk Jędrzejczak	44791a849e	test: test_raft_snapshot_request: find raft leader after restart Finding the new Raft leader after restart simplifies the test and makes it easier to reason about. There are two improvements: - we only need to wait until the leader appends a command, so the read barrier becomes unnecessary, - we only need to trigger snapshot on the leader. We also use the knowledge about the leader in the following patch.	2024-05-02 12:46:33 +02:00
Patryk Jędrzejczak	41198998c5	test: test_raft_shanpshot_request: simplify appended_command We shorten the code and remove the unused `log_size` variable.	2024-05-02 12:46:31 +02:00
Botond Dénes	0ace90ad04	test: add test for cleaning up cached querier on tablet migration Check that a cached querier, which exists prior to a migration, will be cleaned up afterwards. This reproduces #18110. The test fails before the fix for the above and passes afterwards.	2024-04-30 01:47:16 -04:00
Kamil Braun	d8313dda43	Merge 'db: config: move consistent-topology-changes out of experimental and make it the default for new clusters' from Patryk Jędrzejczak We move consistent cluster management out of experimental and make it the default for new clusters in 6.0. In code, we make the `consistent-topology-changes` flag unused and assumed to be true. In 6.0, the topology upgrade procedure will be manual and voluntary, so some clusters will still be using the gossip-based topology even though they support the raft-based topology. Therefore, we need to continue testing the gossip-based topology. This is possible by using the `force-gossip-topology-changes` flag introduced in scylladb/scylladb#18284. Ref scylladb/scylladb#17802 Closes scylladb/scylladb#18285 * github.com:scylladb/scylladb: docs: raft.rst: update after removing consistent-topology-changes treewide: fix indentation after the previous patch db: config: make consistent-topology-changes unused test: lib: single_node_cql_env: restart a node in noninitial run_in_thread calls test: test_read_required_hosts: run with force-gossip-topology-changes storage_service: join_cluster: replace force_gossip_based_join with force-gossip-topology-changes storage_service: join_token_ring: fix finish_setup_after_join calls	2024-04-26 14:45:29 +02:00
Patryk Jędrzejczak	3a100cd16c	test: test_raft_recovery_stuck: ensure raft upgrade procedure failed We have log browsing in test.py now, so we can fix this TODO easily. Closes scylladb/scylladb#18425	2024-04-26 10:16:49 +02:00
Patryk Jędrzejczak	3a34bb18cd	db: config: make consistent-topology-changes unused We make the `consistent-topology-changes` experimental feature unused and assumed to be true in 6.0. We remove code branches that executed if `consistent-topology-changes` was disabled.	2024-04-25 14:33:21 +02:00
Patryk Jędrzejczak	213f2f6882	storage_service: join_cluster: replace force_gossip_based_join with force-gossip-topology-changes The `force_gossip_based_join` error injection does exactly what we expect from `force-gossip-topology-changes` so we can do a simple replacement. We prefer a flag over an error injection because we will use it a lot in CI jobs' configurations, some tests, manual testing etc. It's much more convenient. Moreover, the flag can be used in the release mode, so we re-enable all tests that were disabled in release mode only because of using the `force_gossip_based_join` error injection. The name of the `force-gossip-topology-changes` flag suggests that using it should always succesfully force the gossip-based topology or, if forcing is not possible, the booting should fail. We don't want a node with `force-gossip-topology-changes=true` that silently boots in the raft-topology mode. We achieve it by throwing a runtime error from `join_cluster` in two cases: - the node is restarting in the cluster that is using raft topology - the node is joining the cluster that is using raft topology	2024-04-25 14:33:21 +02:00
Petr Gusev	bc98774f83	test_replace_reuse_ip: add data plane load In this commit we enhance test_replace_reuse_ip to reproduce #17421. We create a test table and run insert queries on it while the first node is being replaced. In this form the test fails without the fix from the previous commit. Some insert requests fail with [Unavailable exception] "Cannot achieve consistency level for cl QUORUM...".	2024-04-24 16:59:24 +03:00
Botond Dénes	162c9ad6f6	Merge 'gossiper: lock local endpoint when updating heart_beat' from Kamil Braun In testing, we've observed multiple cases where nodes would fail to observe updated application states of other nodes in gossiper. For example: - in scylladb/scylladb#16902, a node would finish bootstrapping and enter NORMAL state, propagating this information through gossiper. However, other nodes would never observe that the node entered NORMAL state, still thinking that it is in joining state. This would lead to further bad consequences down the line. - in scylladb/scylladb#15393, a node got stuck in bootstrap, waiting for schema versions to converge. Convergence would never be achieved and the test eventually timed out. The node was observing outdated schema state of some existing node in gossip. I created a test that would bootstrap 3 nodes, then wait until they all observe each other as NORMAL, with timeout. Unfortunately, thousands of runs of this test on different machines failed to reproduce the problem. After banging my head against the wall failing to reproduce, I decided to sprinkle randomized sleeps across multiple places in gossiper code and finally: the test started catching the problem in about 1 in 1000 runs. With additional logging and additional head-banging, I determined the root cause. The following scenario can happen, 2 nodes are sufficient, let's call them A and B: - Node B calls `add_local_application_state` to update its gossiper state, for example, to propagate its new NORMAL status. - `add_local_application_state` takes a copy of the endpoint_state, and updates the copy: ``` auto local_state = ep_state_before; for (auto& p : states) { auto& state = p.first; auto& value = p.second; value = versioned_value::clone_with_higher_version(value); local_state.add_application_state(state, value); } ``` `clone_with_higher_version` bumps `version` inside gms/version_generator.cc. - `add_local_application_state` calls `gossiper.replicate(...)` - `replicate` works in 2 phases to achieve exception safety: in first phase it copies the updated `local_state` to all shards into a separate map. In second phase the values from separate map are used to overwrite the endpoint_state map used for gossiping. Due to the cross-shard calls of the 1 phase, there is a yield before the second phase. During this yield* the following happens: - `gossiper::run()` loop on B executes and bumps node B's `heart_beat`. This uses the monotonic version_generator, so it uses a higher version then the ones we used for states added above. Let's call this new version X. Note that X is larger than the versions used by application_states added above. - now node B handles a SYN or ACK message from node A, creating an ACK or ACK2 message in response. This message contains: - old application states (NOT including the update described above, because `replicate` is still sleeping before phase 2), - but bumped heart_beat == X from `gossiper::run()` loop, and sends the message. - node A receives the message and remembers that the max version across all states (including heart_beat) of node B is X. This means that it will no longer request or apply states from node B with versions smaller than X. - `gossiper.replicate(...)` on B wakes up, and overwrites endpoint_state with the ones it saved in phase 1. In particular it reverts heart_beat back to smaller value, but the larger problem is that it saves updated application_states that use versions smaller than X. - now when node B sends the updated application_states in ACK or ACK2 message to node A, node A will ignore them, because their versions are smaller than X. Or node B will never send them, because whenever node A requests states from node B, it only requests states with versions > X. Either way, node A will fail to observe new states of node B. If I understand correctly, this is a regression introduced in `38c2347a3c`, which introduced a yield in `replicate`. Before that, the updated state would be saved atomically on shard 0, there could be no `heart_beat` bump in-between making a copy of the local state, updating it, and then saving it. With the description above, it's easy to make a consistent reproducer for the problem -- introduce a longer sleep in `add_local_application_state` before second phase of replicate, to increase the chance that gossiper loop will execute and bump heart_beat version during the yield. Further commit adds a test based on that. The fix is to bump the heart_beat under local endpoint lock, which is also taken by `replicate`. The PR also adds a regression test. Fixes: scylladb/scylladb#15393 Fixes: scylladb/scylladb#15602 Fixes: scylladb/scylladb#16668 Fixes: scylladb/scylladb#16902 Fixes: scylladb/scylladb#17493 Fixes: scylladb/scylladb#18118 Ref: scylladb/scylla-enterprise#3720 Closes scylladb/scylladb#18184 * github.com:scylladb/scylladb: test: reproducer for missing gossiper updates gossiper: lock local endpoint when updating heart_beat	2024-04-16 06:46:24 +03:00
Pavel Emelyanov	b06b85c270	test: Tune up tablet-transition test to check del_replica For that the test case is modified to have 3 nodes and 2 replicas on start. Existing test cases are changed slightly in the way "from" host is detected. Also, the final check for data presense is modified to check that hosts in "replicas" have data and other hosts don't have it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-04-15 16:31:07 +03:00
Kamil Braun	72955093eb	test: reproducer for missing gossiper updates Regression test for scylladb/scylladb#17493.	2024-04-04 18:47:01 +02:00
Pavel Emelyanov	c7908c319f	test: Check data presense as well Other than making sure that system.tablets is updated with correct replica set, it's also good to check that the data is present on the repsective nodes. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-04-04 18:01:24 +03:00
Pavel Emelyanov	590f0329ae	test: Test how tablets are copied between nodes This patches the previously introduced test by introducing the 'action' test paramter and tweaking the final checking assertions around tablet replicas read from system.tablets Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-04-04 09:22:57 +03:00
Pavel Emelyanov	28964ba5fe	test: Add sanity test for tablet migration It just checks that after api call to move_tablet the resulting replica is in expected state. This test will be later expanded to check for rebuild transition. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-04-04 09:22:31 +03:00
Andrei Chekun	0752ef1481	test: remove skip annotation for multi-DC test with 5 DCs with one node in each As a follow-up of the https://github.com/scylladb/scylladb/pull/17503 remove skip annotation for the multi-DC test with a reduced amount of the DC used in it: from 30 DCs to 5 DCs Closes scylladb/scylladb#17898	2024-03-27 13:13:13 +02:00
Gleb Natapov	ed534fde8f	test: add test for initial_token parameter Test that configured tokens are used and tokens collision is detected.	2024-03-26 18:43:31 +02:00
Botond Dénes	f0ff23492f	Merge 'Sanitize topology suites' skiplists' from Pavel Emelyanov There are skip_in_<mode> lists in suite yaml that tells test.py not to run the test from it. This PR sanitizes these lists in two ways. First, to skip pytests the skip-decorators are much more convenient, e.g. because they show the reason why the test is skipped. Also, if a test wants to be opt-in-ed for some mode only, it's opt-out-ed in all other lists instead. There's run_in_<mode> list in suite for that. Closes scylladb/scylladb#17964 * github.com:scylladb/scylladb: test: Do not duplicate test name in several skip-lists test: Mark tests with skip_mode instead of suite skip-list	2024-03-26 08:24:57 +02:00
Pavel Emelyanov	16343b3edc	test: Do not duplicate test name in several skip-lists Some tests are only run in dev mode for some reason. For such tests there's run_in_dev list, no need in putting it in all the non-dev skip_in_... ones. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-03-25 14:56:37 +03:00
Pavel Emelyanov	90dfcec86b	test: Mark tests with skip_mode instead of suite skip-list There are many tests that are skipped in release mode becuase they rely on error-injection machinery which doesn't work in release mode. Most of those tests are listed in suite's skip_in_release, but it's not very handy, mainly because it's not clear why the test is there. The skip_mode decoration is much more convenient. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-03-25 14:56:37 +03:00
Kamil Braun	69bf962522	Merge 'allow changing snitch with topology over raft' from Gleb Fixes scylladb/scylladb#17513 * 'gleb/raft-snitch-change-v3' of github.com:scylladb/scylla-dev: doc: amend snitch changing procedure to work with raft test: add test to check that snitch change takes effect. raft topology: update rack/dc info in topology state on reboot if changed	2024-03-25 10:41:39 +01:00
Gleb Natapov	d7adf26a56	test: add test to check that snitch change takes effect. The test creates two node cluster with default snitch (SimpleSnitch) and checks that dc and rack names are as expected. Then it changes the config to use GossipingPropertyFileSnitch with different names, restart nodes and check that now peers table has new names.	2024-03-25 10:41:49 +02:00
Piotr Dulikowski	f23f8f81bf	Merge 'Raft-based service levels' from Michał Jadwiszczak This patch introduces raft-based service levels. The difference to the current method of working is: - service levels are stored in `system.service_levels_v2` - reads are executed with `LOCAL_ONE` - writes are done via raft group0 operation Service levels are migrated to v2 in topology upgrade. After the service levels are migrated, `key: service_level_v2_status; value: data_migrated` is written to `system.scylla_local` table. If this row is present, raft data accessor is created from the beginning and it handles recovery mode procedure (service levels will be read from v2 table even if consistent topology is disabled then) Fixes #17926 Closes scylladb/scylladb#16585 * github.com:scylladb/scylladb: test: test service levels v2 works in recovery mode test: add test for service levels migration test: add test for service levels snapshot test:topology: extract `trigger_snapshot` to utils main: create raft dda if sl data was migrated service:qos: store information about sl data migration service:qos: service levels migration main: assign standard service level DDA before starting group0 service:qos: fix `is_v2()` method service:qos: add a method to upgrade data accessor test: add unit_test_raft_service_levels_accessor service:storage_service: add support for service levels raft snapshot service:qos: add abort_source for group0 operations service:qos: raft service level distributed data accessor service:qos: use group0_guard in data accessor cql3:statements: run service level statements on shard0 with raft guard test: fix overrides in unit_test_service_levels_accessor service:qos: fix indentation service:qos: coroutinize some of the methods db:system_keyspace: add `SERVICE_LEVELS_V2` table service:qos: extract common service levels' table functions	2024-03-22 11:51:53 +01:00

1 2 3 4

168 Commits