scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-27 03:45:11 +00:00

Author	SHA1	Message	Date
Piotr Dulikowski	a75d0c0bfa	test: topology/util: extract find_server_by_host_id Move it out from test_mv_tablets_replace.py. It will be used by a test introduced in a later commit.	2024-09-08 10:51:38 +02:00
Evgeniy Naydanov	ac4ffbad5c	[test.py] topology.util: add get_non_coordinator_host() function Add get_non_coordinator_host() function which returns ServerInfo for the first host which is not a coordinator or None if there is no such host. Also rework get_coordinator_host() to not fail if some of the hosts don't have a host id.	2024-09-05 22:11:31 +00:00
Patryk Jędrzejczak	fb1e060c4c	test: topology: util.py: add cqls parameter to check_system_topology_and_cdc_generations_v3_consistency In the following commit, we modify `test_topology_recovery_basic` to test the recovery mode in the presence of live zero-token nodes. Unfortunately, it requires a bit ugly workaround. Zero-token nodes are ignored by the Python driver if it also connects to other nodes because of empty tokens in the `system.peers` table. In that test, we must connect to a zero-token node to enter the recovery mode and purge the Raft data. Hence, we use different CQL sessions for different nodes. In the future, we may change the Python driver behavior and revert this workaround. Moreover, the recovery tests will be removed or significantly changed when we implement the manual recovery tool. Therefore, we shouldn't worry about this workaround too much.	2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak	54905fc179	test: topology: util.py: accept zero tokens in check_system_topology_and_cdc_generations_v3_consistency Before we use `check_system_topology_and_cdc_generations_v3_consistency` in a test with a zero-token node, we must ensure it doesn't fail because of zero tokens in a row of the `system.topology` table.	2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak	21c8409fa4	test: topology: util.py: document that check_token_ring_and_group0_consistency fails with zero-token nodes	2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak	d43d67c525	test: test_topology_ops: move helpers to topology/util.py In one of the following patches, we reuse the helper functions from `test_topology_ops` in a new test, so we move them to `util.py`. Also, we add the `cl` parameter to `start_writes`, as the new test will use `cl=2`.	2024-08-29 10:37:07 +02:00
Sergey Zolotukhin	b6de8230a9	raft: Add a test to check log size after truncation. The test checks that snapshot_trailing_size parameter is taken into consideration when the log system table is truncated. Test for scylladb#16817	2024-08-20 14:15:50 +02:00
Emil Maskovsky	9ab25e5cbf	test: raft: replace the use of read_barrier work-around Replaced the old `read_barrier` helper from "test/pylib/util.py" by the new helper from "test/pylib/rest_client.py" that is calling the newly introduced direct REST API. Replaced in all relevant tests and decommissioned the old helper. Introduced a new helper `get_host_api_address` to retrieve the host API address - which in come cases can be different from the host address (e.g. if the RPC address is changed). Fixes: scylladb/scylladb#19662 Closes scylladb/scylladb#19739	2024-07-19 19:20:44 +02:00
Tomasz Grabiec	2c3f7c996f	test: pylib: Fetch all pages by default in run_async Fetching only the first page is not the intuitive behavior expected by users. This causes flakiness in some tests which generate variable amount of keys depending on execution speed and verify later that all keys were written using a single SELECT statement. When the amount of keys becomes larger than page size, the test fails. Fixes #18774 Closes scylladb/scylladb#19004	2024-06-05 18:07:24 +03:00
Patryk Jędrzejczak	388db33dec	test: util: start_writes_to_cdc_table: add FIXME to increase CL	2024-05-27 10:39:04 +02:00
Patryk Jędrzejczak	68b6e8e13e	test: util: start_writes_to_cdc_table: allow restarting with new cql This patch allows us to restart writing (to the same table with CDC enabled) with a new CQL session. It is useful when we want to continue writing after closing the first CQL session, which happens during the `reconnect_driver` call. We must stop writing before calling `reconnect_driver`. If a write started just before the first CQL session was closed, it would time out on the client. We rename `finish_and_verify` - `stop_and_verify` is a better name after introducing `restart`.	2024-05-27 10:39:04 +02:00
Piotr Dulikowski	8de2bda7ae	test: util: get rid of the `restart` helper We already have `ManagerClient.server_restart`, which can be used in its place.	2024-05-06 12:24:40 +02:00
Aleksandra Martyniuk	e0d498716a	test: topology: move some functions to util.py Move functions marked with asynccontextmanager from test/topology/test_mv.py to test/topology/util.py so that they can be used in other tests.	2024-04-24 10:57:51 +02:00
Aleksandr Bykov	e8833c6f2a	test: Kill coordinator during topology operation If coordinator node was killed, restarted, become not operatable during topology operation, new coordinator should be elected, operation should be aborted and cluster should be rolled back Error injection will be used to kill the coordinator before streaming starts Closes scylladb/scylladb#16197	2024-04-17 17:24:20 +02:00
Mikołaj Grzebieluch	1e2607563f	test.py: test_topology_upgrade_basic: make ring_delay_ms nonzero Test.py uses `ring_delay_ms = 0` by default. CDC creates generation's timestamp by adding `ring_delay_ms` to it. In this test, nodes are learning about new generations (introduced by upgrade procedure and then by node bootstrap) concurrently with doing writes that should go to these generations. Because of `ring_delay_ms = 0', the generation could have been committed when it should have already been in use. This can be seen in the following logs from a node: ``` ERROR 2024-03-22 12:29:55,431 [shard 0:strm] cdc - just learned about a CDC generation newer than the one used the last time streams were retrieved. This generation, or some newer one, should have been used instead (new generation's timestamp: 2024/03/22 12:29:54, last time streams were retrieved: 2024/03/22 12:29:55). The new generation probably arrived too late due to a network partition and we've made a write using the wrong set streams. ``` Creating writes during such a generation can result in assigning them a wrong generation or a failure. Failure may occur if it hits short time window when `generation_service::handle_cdc_generation(cdc::generation_id_v2)` has executed `svc._cdc_metadata.prepare(...)` but`_cdc_metadata.insert(...)` has not yet been executed. With a nonzero ring_delay_ms it's not a problem, because during this time window, the generation should not be in use. Write can fail with the following response from a node: ``` cdc: attempted to get a stream from a generation that we know about, but weren't able to retrieve (generation timestamp: 2024/03/22 12:29:54, write timestamp: 2024/03/22 12:29:55). Make sure that the replicas which contain this generation's data are alive and reachable from this node. ``` Set ring_delay_ms to 15000 for the debug mode and 5000 in other modes. Wait for the last generation to be in use and sleep one second to make sure there are writes to the CDC table in this generation. Fixes #17977	2024-03-28 17:13:43 +01:00
Mikołaj Grzebieluch	fa4193e09f	Reapply "test.py: adjust the test for topology upgrade to write to and read from CDC tables" This reverts commit `230f23004b`.	2024-03-27 10:39:01 +01:00
Kamil Braun	230f23004b	Revert "test.py: adjust the test for topology upgrade to write to and read from CDC tables" This reverts commit `b4144d14c6`. The test is flaky and blocks next promotions.	2024-03-22 17:25:04 +01:00
Michał Jadwiszczak	bf3aed1ecb	test:topology: extract `trigger_snapshot` to utils The function was defined separately in a few tests.	2024-03-21 23:14:57 +01:00
Mikołaj Grzebieluch	b4144d14c6	test.py: adjust the test for topology upgrade to write to and read from CDC tables In topology on raft, management of CDC generations is moved to the topology coordinator. We need to verify that the CDC keeps working correctly during the upgrade for topology on the raft. A similar change will be made in the topology recovery test. It will reuse the `start_writes_to_cdc_table` function. Ref #17409 Closes scylladb/scylladb#17828	2024-03-20 11:15:02 +01:00
Tomasz Grabiec	a233a699cc	test: py: Add test for view replica pairing after replace	2024-03-15 13:20:08 +01:00
Patryk Jędrzejczak	f1d9248df9	test: wait for CDC generations publishing before checking CDC-topology consistency Tests that verify upgrading to the raft-based topology (`test_topology_upgrade`, `test_topology_recovery_basic`, `test_topology_recovery_majority_loss`) have flaky `check_system_topology_and_cdc_generations_v3_consistency` calls. `assert topo_results[0] == topo_res` can fail because of different `unpublished_cdc_generations` on different nodes. The upgrade procedure creates a new CDC generation, which is later published by the CDC generation publisher. However, this can happen after the upgrade procedure finishes. In tests, if publishing happens just before querying `system.topology` in `check_system_topology_and_cdc_generations_v3_consistency`, we can observe different `unpublished_cdc_generations` on different nodes. It is an expected and temporary inconsistency. For the same reasons, `check_system_topology_and_cdc_generations_v3_consistency` can fail after adding a new node. To make the tests not flaky, we wait until the CDC generation publisher finishes its job. Then, all nodes should always have equal (and empty) `unpublished_cdc_generations`. Fixes scylladb/scylladb#17587 Fixes scylladb/scylladb#17600 Fixes scylladb/scylladb#17621 Closes scylladb/scylladb#17622	2024-03-04 19:28:51 +02:00
Patryk Jędrzejczak	2b724735d1	test: test clean-up of committed_cdc_generations We extend `test_cdc_generation_clearing`. Now, it also tests the clean-up of `TOPOLOGY.committed_cdc_generations` added in the previous patch. In the implementation, we harden the already existing `check_system_topology_and_cdc_generations_v3_consistency`. After the previous patch, data of every generation present in `committed_cdc_generations` should be present in CDC_GENERATIONS_V3. In other words, `committed_cdc_generations` should always be a subset of a set containing generations in CDC_GENERATIONS_V3. Before the previous patch, this wasn't true after the clearing, so the new version of `test_cdc_generation_clearing` wouldn't pass back then.	2024-02-20 12:35:18 +01:00
Patryk Jędrzejczak	e145e758eb	raft topology: store committed CDC generations' IDs in the topology When we create a CDC generation and ring-delay is non-zero, the timestamp of the new generation is in the future. Hence, we can have multiple generations that can be written to. However, if we add a new node to the cluster with the Raft-based topology, it receives only the last committed generation. So, this node will be rejecting writes considered correct by the other nodes until the last committed generation starts operating. In scylladb/scylladb#17134, we have allowed sending writes to the previous CDC generations. So, the situation became even more complicated. We need to adjust the Raft-based topology to ensure all required generations are loaded into memory and their data isn't cleared too early. This patch is the first step of the adjustment. We replace `current_cdc_generation_{uuid, timestamp}` with the set containing IDs of all committed generations - `committed_cdc_generations`. This set is sorted by timestamps, just like `unpublished_cdc_generations`. This patch is mostly refactoring. The last generation in `committed_cdc_generations` is the equivalent of the previous `current_cdc_generation_{uuid, timestamp}`. The other generations are irrelevant for now. They will be used in the following patches. After introducing `committed_cdc_generations`, a newly committed generation is also unpublished (it was current and unpublished before the patch). We introduce `add_new_committed_cdc_generation`, which updates both sets of generations so that we don't have to call `add_committed_cdc_generation` and `add_unpublished_cdc_generation` together. It's easy to forget that both of them are necessary. Before this patch, there was no call to `add_unpublished_cdc_generation` in `topology_coordinator::build_coordinator_state`. It was a bug reported in scylladb/scylladb#17288. This patch fixes it. This patch also removes "the current generation" notion from the Raft-based topology. For the Raft-based topology, the current generation was the last committed generation. However, for the `cdc::metadata`, it was the generation operating now. These two generations could be different, which was confusing. For the `cdc::metadata`, the current generation is relevant as it is handled differently, but for the Raft-based topology, it isn't. Therefore, we change only the Raft-based topology. The generation called "current" is called "the last committed" from now.	2024-02-20 12:35:16 +01:00
Piotr Dulikowski	4d4976feb0	test/topology_custom: upgrade/recovery tests for topology on raft Adds three tests for the new upgrade procedure: - test_topology_upgrade - upgrades a cluster operating in legacy mode to use raft topology operations, - test_topology_recovery_basic - performs recovery on a three-node cluster, no node removal is done, - test_topology_majority_loss - simulates a majority loss scenario, i.e. removed two nodes out of three, performs recovery to rebuild the raft topology state and re-add two nodes back.	2024-02-08 19:12:28 +01:00
Kamil Braun	39339b9f70	test: topology/util: update comment for `reconnect_driver` The issues mentioned in the comment before are already fixed. Unfortunately, there is another, opposite issue which this function can be used for. The previous issue was about the existing driver session not reconnecting. The current issue is about the existing driver session reconnecting too much... (and in the middle of queries.)	2024-01-30 15:36:48 +01:00
Patryk Jędrzejczak	659ac9c7f5	test: topology_raft_disabled: move utils to topology suite We move all used util functions from topology_raft_disabled to topology before we remove topology_raft_disabled. After this change, util.py in topology will be the single util file for all topology tests. Some util functions in topology_raft_disabled aren't used anymore. We don't move such functions and remove them instead.	2023-11-30 15:50:22 +01:00
Piotr Dulikowski	b3771e6011	test/topology{_raft_disabled}: move reconnect_driver to topology utils The `reconnect_driver` function will be useful outside the `topology_raft_disabled` test suite - namely, for cluster feature tests in `topology`. The best course of action for this function would be to put it into pylib utils; however, the function depends on ManagerClient which is defined in `test.pylib.manager_client` that depends on `test.pylib.utils` - therefore we cannot put it there as it would cause an import cycle. The `topology.utils` module sounds like the next best thing. In addition, the docstring comment is updated to reflect that this function will now be used to work around another issue as well.	2023-06-16 15:25:02 +02:00
Kamil Braun	3f3dcf451b	test: pylib: random_tables: perform read barrier in `verify_schema` `RandomTables.verify_schema` is often called in topology tests after performing a schema change. It compares the schema tables fetched from some node to the expected latest schema stored by the `RandomTables` object. However there's no guarantee that the latest schema change has already propagated to the node which we query. We could have performed the schema change on a different node and the change may not have been applied yet on all nodes. To fix that, pick a specific node and perform a read barrier on it, then use that node to fetch the schema tables. Fixes #13788 Closes #13789	2023-05-08 13:21:10 +02:00
Konstantin Osipov	e7c9ca560b	test: issue a read barrier before checking ring consistency Raft replication doesn't guarantee that all replicas see identical Raft state at all times, it only guarantees the same order of events on all replicas. When comparing raft state with gossip state on a node, first issue a read barrier to ensure the node has the latest raft state. To issue a read barrier it is sufficient to alter a non-existing state: in order to validate the DDL the node needs to sync with the leader and fetch its latest group0 state. Fixes #13518 (flaky topology test). Closes #13756	2023-05-04 12:22:07 +02:00
Alejo Sanchez	293550ca5c	test/topology: move topology helpers to common file Move helper functions to a common file ahead of splitting topology tests. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2023-02-12 12:02:16 +01:00

30 Commits