scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-21 17:10:35 +00:00

Author	SHA1	Message	Date
Michał Chojnowski	f466dcfa5f	test: pylib: add get_sstables_disk_usage() Adds an util for measuring the disk usage of the given table on the given node. Will be used in a follow-up patch for testing that sstables are freed properly. (cherry picked from commit `7741491b47`)	2024-06-14 10:19:32 +00:00
Andrei Chekun	bce53efd36	Enrich test results produced by test.py This PR resolves issue with double count of the test result for topology tests. It will not appear in the consolidated report anymore. Another fix is to provide a better view which test failed by modifying the test case name in the report enriching it with mode and run id, so making them unique across the run. The scope of this change is: 1. Modify the test name to have run id in name 2. Add handlers to get logs of test.py and pytest in one file that are related to test, rather than to the full suite 3. Remove topology tests from aggregating them on a suite level in Junit results 4. Add a link to the logs related to the failed tests in Junit results, so it will be easier to navigate to all logs related to test 5. Gather logs related to the failed test to one directory for better logs investigation Ref: scylladb/scylladb#17851 Closes scylladb/scylladb#18277	2024-05-20 15:33:57 +02:00
Artsiom Mishuta	d659d9338b	test/pylib: Introduce ManagerClient.test_finished_event introduce ManagerClient.test_finished_event to block access to REST client object from the test if ManagerClient.after_test method was called (test teardown started)	2024-05-15 11:33:45 +02:00
Piotr Dulikowski	92e5018ddb	test: get rid of server-side server_restart Restarting a node amounts to just shutting it down and then starting again. There is no good reason to have a dedicated endpoint in the ScyllaClusterManager for restarting when it can be implemented by calling two endpoints in a sequence: stop and start - it's just code duplication. Remove the server_restart endpoint in ScyllaClusterManager and reimplement it as two endpoint calls in the ManagerClient.	2024-05-06 12:54:53 +02:00
Kamil Braun	4359a1b460	Merge 'raft timeouts: better handling of lost quorum' from Petr Gusev In this PR we add timeouts support to raft groups registry. We introduce the `raft_server_with_timeouts` class, which wraps the `raft::server` add exposes its interface with additional `raft_timeout` parameter. If it's set, the wrapper cancels the `abort_source` after certain amount of time. The value of the timeout can be specified either in the `raft_timeout` parameter, or the default value can be set in `the raft_server_with_timeouts` class constructor. The `raft_group_registry` interface is extended with `group0_with_timeouts()` method. It returns an instance of `raft_server_with_timeouts` for group0 raft server. The timeout value for it is configured in `create_server_for_group0`. It's one minute by default and can be overridden for tests with `group0-raft-op-timeout-in-ms` parameter. The new api allows the client to decide whether to use timeouts or not. In this PR we are reviewing all the group0 call sites and add `raft_timeout` if that makes sense. The general principle is that if the code is handling a client request and the client expects a potential error, we use timeouts. We don't use timeouts for background fibers (such as topology coordinator), since they wouldn't add much value. The only thing the background fiber can do with a timeout is to retry, and this will have the same end effect as not having a timeout at all. Fixes scylladb/scylladb#16604 Closes scylladb/scylladb#17590 * github.com:scylladb/scylladb: migration_manager: use raft_timeout{} storage_service::join_node_response_handler: use raft_timeout{} storage_service::start_upgrade_to_raft_topology: use raft_timeout{} storage_service::set_tablet_balancing_enabled: use raft_timeout{} storage_service::move_tablet: use raft_timeout{} raft_check_and_repair_cdc_streams: use raft_timeout{} raft_timeout: test that node operations fail properly raft_rebuild: use raft_timeout{} do_cluster_cleanup: use raft_timeout{} raft_initialize_discovery_leader: use raft_timeout{} update_topology_with_local_metadata: use with_timeout{} raft_decommission: use raft_timeout{} raft_removenode: use raft_timeout{} join_node_request_handler: add raft_timeout to make_nonvoters and add_entry raft_group0: make_raft_config_nonvoter: add raft_timeout parameter raft_group0: make_raft_config_nonvoter: add abort_source parameter manager_client: server_add with start=false shouldn't call driver_connect scylla_cluster: add seeds parameter to the add_server and servers_add raft_server_with_timeouts: report the lost quorum join_node_request_handler: add raft_timeout{} for start_operation skip_mode: add platform_key auth: use raft_timeout{} raft_group0_client: add raft_timeout parameter raft_group_registry: add group0_with_timeouts utils: add composite_abort_source.hh error_injection: move api registration to set_server_init error_injection: add inject_parameter method error_injection: move injection_name string into injection_shared_data error_injection: pass injection parameters at startup	2024-03-22 10:45:33 +01:00
Andrei Chekun	7de28729e7	test: change maintenance socket location to /tmp Fixes #16912 By default, ScyllaDB stores the maintenance socket in the workdir. Test.py by default uses the location for the ScyllaDB workdir as testlog/{mode}/scylla-#. The Usual location for cloning the repo is the user's home folder. In some cases, it can lead the socket path being too long and the test will start to fail. The simple way is to move the maintenance socket to /tmp folder to eliminate such a possibility. Closes scylladb/scylladb#17941	2024-03-21 18:22:21 +02:00
Petr Gusev	294e1ff464	storage_service::join_node_response_handler: use raft_timeout{} This function is called as part of a node join procedure initiated by the user, so having timeouts here makes sense.	2024-03-21 16:35:48 +04:00
Petr Gusev	ca21362ade	raft_timeout: test that node operations fail properly	2024-03-21 16:35:48 +04:00
Petr Gusev	99ddffac32	manager_client: server_add with start=false shouldn't call driver_connect If the server is not started there is not point in starting the driver, it would fail because there are no nodes to connect to. On the other hand, we should connect the driver in server_start() if it's not connected yet.	2024-03-21 16:35:48 +04:00
Petr Gusev	3f6cf38dd5	scylla_cluster: add seeds parameter to the add_server and servers_add If this parameter is set, we use its value for the scylla.yaml of the new node, otherwise we use IPs of all running nodes as before. We'll need this parameter in subsequent commits to restrict the communication between nodes. We remove default values for _create_server_add_data parameters since they are redundant - in the two call sites we pass all of them.	2024-03-21 16:35:48 +04:00
Petr Gusev	1a3fc58438	join_node_request_handler: add raft_timeout{} for start_operation In the test, we use the group0-raft-op-timeout-in-ms parameter to reduce the timeout to one second so as not to waste time. The join_node_request_handler method contains other group0 calls which should have timeouts (make_nonvoters and add_entry). They will be handled in a separate commit.	2024-03-21 16:35:48 +04:00
Andrei Chekun	a5455460d8	test: fix flakiness of the multi_dc tests The initial version used a redundant method, and it did not cover all cases. So that leads to the flakiness of the test that used this method. Switching to the cluster_con() method removes flakiness since it's written more robustly. Fixes scylladb/scylladb#17914 Closes scylladb/scylladb#17932	2024-03-21 11:17:22 +01:00
Andrei Chekun	b6edf056ea	Add sanity tests for multi dc Fix writing cassandra-rackdc.properties with correct format data instead of yaml Add a parameter to overwrite RF for specific DC Add the possibility to connect cql to the specific node In this PR 4 tests were added to test multi-DC functionality. One is added from initial commit were multi-DC possibility were introduced, however, this test was not commited. Three of them are migrations from dtest, that later will be deleted. To be able to execute migrated tests additional functionality is added: the ability to connect cql to the specific node in the cluster instead of pooled connection and the possibility to overwrite the replication factor for the specific DC. To be able to use the multi DC in test.py issue with the incorrect format of the properties file fixed in this PR. Closes scylladb/scylladb#17503	2024-03-18 13:00:36 +02:00
Artsiom Mishuta	73ed4c0eb5	test.py: fix aiohttp usage issue in python 3.12 Fix aiohttp usage issue in python 3.12: "Timeout context manager should be used inside a task" This occurs due to UnixRESTClient created in one event loop (created inside pytest) but used in another (created in rewriten event_loop fixture), now it is fixed by updating UnixRESTClient object for every new loop. Closes scylladb/scylladb#17760	2024-03-15 11:17:29 +01:00
Marcin Maliszkiewicz	7b60752e47	test: fix cql connection problem in test_auth_raft_command_split This is a speculative fix as the problem is observed only on CI. When run_async is called right after driver_connect and get_cql it fails with ConnectionException('Host has been marked down or removed'). If the approach proves to be succesfull we can start to deprecate base get_cql in favor of get_ready_cql. It's better to have robust testing helper libraries than try to take care of it in every test case separately. Fixes #17713 Closes scylladb/scylladb#17772	2024-03-13 10:36:51 +01:00
Marcin Maliszkiewicz	4f65e173cf	test: auth: add tests for lost quorum and command splitting With auth-v2 we can login even if quorum is lost. So test which checks if error occurs in such situation is deleted and the opposite test which checks if logging in works was added.	2024-03-01 16:25:14 +01:00
Marcin Maliszkiewicz	a5f81f0836	test: pylib: disconnect driver before re-connection	2024-03-01 16:25:14 +01:00
Tomasz Grabiec	27ed2d94fc	test: pylib: manager_client: Wait for driver to catch up in rolling_restart() For sanity of the developers who want to execute CQL queries after rolling restarts.	2024-02-09 20:35:41 +01:00
Tomasz Grabiec	3ce4ec796a	test: pylib: manager_client: Accept callback in rolling_restart() to execute with node down	2024-02-09 20:35:41 +01:00
Kamil Braun	bb22e06a9e	Merge 'abort failed rebuild instead of retrying it forever' from Gleb Add error handling to rebuild instead of retrying it until succeeds. * 'gleb/rebuild-fail-v2' of github.com:scylladb/scylla-dev: test: add test for rebuild failure test: add expected_error to rebuild_node operation topology_coordinator: Propagate rebuild failure to the initiator	2024-01-31 10:07:28 +01:00
Gleb Natapov	d62204e758	test: add expected_error to rebuild_node operation	2024-01-30 11:04:19 +02:00
Botond Dénes	a7a5aada2a	test/pylib: manager[_client]: add update_cmdline() Similar to the existing update_config(). Updates the command-line arguments of the specified nodes, merging the new options into the existing ones. Needs a restart to take effect.	2024-01-29 07:04:33 -05:00
Michał Chojnowski	b88a0eb9ab	test: pylib: add ScyllaCluster.wipe_sstables Add a method which wipes sstables files for a particular table on a particular stopped node.	2024-01-24 11:52:49 +01:00
Asias He	39912d7bed	test: Allow timeout in server_stop_gracefully The default is 60s. Sometimes it takes more than 60s to stop a node for some reason.	2024-01-18 08:49:06 +08:00
Gleb Natapov	455ffaf5d8	test: add servers_see_each_other helper The helper makes sure that all nodes in the cluster see each other as alive.	2024-01-14 14:44:07 +02:00
Kefu Chai	317af97e41	test/pylib: shutdown unix RESTful client when stopping the ManagerClient, it would be better to close all connected connector, otherwise aiohttp complains like: ``` 13:57:53.763 ERROR> Unclosed connector connections: ['[(<aiohttp.client_proto.ResponseHandler object at 0x7f939d2ca5f0>, 96672.211256817)]'] connector: <aiohttp.connector.UnixConnector object at 0x7f939d2da890> ``` this warning message is printed to the console, and it is distracting when testing manually. so, in this change, let's close the client connecting to unix domain socket. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#16675	2024-01-10 11:07:14 +02:00
Kamil Braun	d4f4b58f3a	Merge 'topology_coordinator: reject removenode if the removed node is alive' from Patryk Jędrzejczak The removenode operation is defined to succeed only if the node being removed is dead. Currently, we reject this operation on the initiator side (in `storage_service::raft_removenode`) when the failure detector considers the node being removed alive. However, it is possible that even if the initiator considers the node dead, the topology coordinator will consider it alive when handling the topology request. For example, the topology coordinator can use a bigger failure detector timeout, or the node being removed can suddenly resurrect. This PR makes the topology coordinator reject removenode if the node being removed is considered alive. It also adds `test_remove_alive_node` that verifies this change. Fixes scylladb/scylladb#16109 Closes scylladb/scylladb#16584 * github.com:scylladb/scylladb: test: add test_remove_alive_node topology_coordinator: reject removenode if the removed node is alive test: ManagerClient: remove unused wait_for_host_down test: remove_node: wait until the node being removed is dead	2024-01-08 12:39:23 +01:00
Patryk Jędrzejczak	16b0eeb3d6	test: ManagerClient: servers_add: specify consistent-topology-changes assumption ManagerClient.servers_add can be called only if the cluster uses consistent topology changes. We add this specification to the leading comment.	2024-01-02 12:19:31 +01:00
Patryk Jędrzejczak	cf955094c1	test: ManagerClient: remove unused wait_for_host_down The previous commit removed the only call to wait_for_host_down. Moreover, this function is identical to server_not_sees_other_server. We can safely remove it.	2023-12-29 17:12:46 +01:00
Patryk Jędrzejczak	7038a033f2	test: remove_node: wait until the node being removed is dead In the following commits, we make the topology coordinator reject removenode requests if the node being removed is considered alive by the gossiper. Before making this change, we need to adapt the testing framework so that we don't have flaky removenode operations that fail because the node being removed hasn't been marked as dead yet. We achieve this by waiting until all other running nodes see the node being removed as dead in all removenode operations. Some tests are simplified after this change because they don't have to call server_not_sees_other_server anymore.	2023-12-29 17:12:45 +01:00
Tomasz Grabiec	84ea8b32b2	test: tablets: Restart cluster in a graceful manner to avoid connection drop in the middle of request serving After restarting each node, we should wait for other nodes to notice the node is UP before restarting the next server. Otherwise, the next node we restart may not send the shutdown notification to the previously restarted node, if it still sees it as down when we initiate its shutdown. In this case, the node will learn about the restart from gossip later, possible when we already started CQL requests. When a node learns that some node restarted while it considers it as UP, it will close connections to that node. This will fail RPC sent to that node, which will cause CQL request to time-out. Fixes #14746 Closes scylladb/scylladb#16010	2023-12-18 16:22:02 +01:00
Evgeniy Naydanov	10eebe3c66	test: use different IP addresses for listen and RPC addresses Scylla can be configured to use different IPs for the internode communication and client connections. This test allocates and configure unique IP addresses for the client connections (`rpc_address`) for 2-nodes cluster. Two scenarios tested: 1) Change RPC IPs sequentially 2) Change RPC IPs simultaneously Closes scylladb/scylladb#15965	2023-12-17 18:00:09 +02:00
Tomasz Grabiec	7d0f4c10a2	test: tablets: Add test for failed streaming being fenced away	2023-12-06 18:37:01 +01:00
Patryk Jędrzejczak	cd7b282db6	test: ManagerClient: introduce servers_add We add a new function - servers_add - that allows adding multiple servers concurrently to a cluster. It makes use of a concurrent bootstrap now supported in the raft-based topology. servers_add doesn't have the replace_cfg parameter. The reason is that we don't support concurrent replace operations, at least for now. There is an implementation detail in ScyllaCluster.add_servers. We cannot simply do multiple calls to add_server concurrently. If we did that in an empty cluster, every node would take itself as the only seed and start a new cluster. To solve this, we introduce a new field - initial_seed. It is used to choose one of the servers as a seed for all servers added concurrently to an empty cluster. Note that the add_server calls in asyncio.gather in add_servers cannot race with each other when setting initial_seed because there is only one thread. In the future, we will also start all initial servers concurrently in ScyllaCluster.install_and_start. The changes in this commit were designed in a way that will make changing install_and_start easy.	2023-11-24 09:39:01 +01:00
Patryk Jędrzejczak	aca90e6640	test: ManagerClient: introduce _create_server_add_data We introduce this function to avoid code duplication. After the following commits, it will also be used in the new ManagerClient.servers_add function.	2023-11-24 09:39:01 +01:00
Patryk Jędrzejczak	9775b1c12d	test: server_add: wait until the node being replaced is dead In the following commits, we make the topology coordinator reject join requests if the node being replaced is considered alive by the gossiper. Before making this change, we need to adapt the testing framework so that we don't have flaky replace operations that fail because the node being replaced hasn't been marked as dead yet. We achieve this by waiting until all other running nodes see the node being replaced as dead in all replace operations.	2023-11-21 12:39:16 +01:00
Patryk Jędrzejczak	18ed89f760	test: server_add: add support for expected errors After this change, if we try to add a server and it fails with an expected error, the add_server function will not throw. Also, the server will be correctly installed and stopped. Two issues are motivating this feature. The first one is that if we want to add a server while expecting an error, we have to do it in two steps: - call server_add with the start parameter set to False, - call server_start with the expected_error parameter. It is quite inconvenient. The second one is that we want to be able to test the replace operation when it is considered incorrect, for example when we try to replace an alive node. To do this, we would have to remove some assertions from ScyllaCluster.add_server. However, we should not remove them because they give us clear information when we write an incorrect test. After adding the expected_error parameter, we can ignore these assertions only when we expect an error. In this way, we enable testing failing replace operations without sacrificing the testing framework's protection.	2023-11-21 12:39:16 +01:00
Paweł Zakrzewski	a0dcc154c1	test: add the auth_cluster test suite This commit adds the auth_cluster test suite to test a custom scenario involving password authentication: - create a cluster of 2 nodes with password authentication - down one node - the other node should refuse login stating that it couldn't reach QUORUM References ScyllaDB OSS #2339	2023-11-13 14:04:28 +01:00
Kamil Braun	7dcee7de02	test/pylib: implement `expected_error` for decommission and removenode You can now pass `expected_error` to `ManagerClient.decommission_node` and `ManagerClient.remove_node`. Useful in combination with error injections, for example. Closes scylladb/scylladb#15650	2023-10-17 16:25:43 +03:00
Kamil Braun	05ede7a042	test/pylib: always return a response from `put_json` In `20ff2ae5e1` mutating endpoints were changed to use PUT. But some of them return a response, and I forgot to provide `response_type` parameter to `put_json` (which causes `RESTClient` to actually obtain the response). These endpoints now return `None`. Fix this. Closes scylladb/scylladb#15674	2023-10-09 14:35:04 +03:00
Kamil Braun	d3bc0d47e0	test/pylib: always return data as JSON from endpoints Some endpoint handlers return JSON, some return text, some return empty responses. Reduce the number of different handler types by making the text case a subcase of the JSON case. This also simplifies some code on the `ManagerClient` side, which would have to deserialize data from text (because some endpoint handlers would serialize data into text for no particular reason). And it will allow reducing boilerplate in later commits even further.	2023-10-06 11:24:02 +02:00
Kamil Braun	f848d7b5c0	test/pylib: use JSON data to pass `expected_error` in `server_start` Most other endpoints receive data through request body as JSON, this one endpoint is an exception for some reason. Make it consistent with others.	2023-10-06 10:55:45 +02:00
Kamil Braun	20ff2ae5e1	test/pylib: use PUT instead of GET for mutating endpoints `ScyllaClusterManager` registers a bunch of HTTP endpoints which `ManagerClient` uses to perform operations on a cluster during a topology test. The endpoints were inconsistently using verbs, like using GET for endpoints that would have side effects. Use PUT for these.	2023-10-06 10:55:45 +02:00
Kamil Braun	33463df7d2	test/pylib: fix some type errors	2023-10-06 10:55:45 +02:00
Botond Dénes	70e26e5a10	test/pylib: add REST methods to get node exe and workdir paths	2023-09-22 02:53:15 -04:00
Botond Dénes	7e7101c180	Revert "Merge 'database, storage_proxy: Reconcile pages with dead rows and partitions incrementally' from Botond Dénes" This reverts commit `628e6ffd33`, reversing changes made to `45ec76cfbf`. The test included with this PR is flaky and often breaks CI. Revert while a fix is found. Fixes: #15371	2023-09-13 10:45:37 +03:00
Botond Dénes	46e37436d0	test/pylib: add REST methods to get node exe and workdir paths	2023-09-11 07:02:14 -04:00
Aleksandra Martyniuk	ede8182dd4	test: fix types and variable names in wait_for_host_down Fix types and variable names in ManagerClient::wait_for_host_down and related methods.	2023-09-05 15:01:59 +02:00
Mikołaj Grzebieluch	a031a14249	tests: add asynchronous log browsing functionality Add a class that handles log file browsing with the following features: * mark: returns "a mark" to the current position of the log. * wait_for: asynchronously checks if the log contains the given message. * grep: returns a list of lines matching the regular expression in the log. Add a new endpoint in `ManagerClient` to obtain the scylla logfile path. Fixes #14782 Closes #14834	2023-08-25 14:19:09 +02:00
Kamil Braun	169d19e5b0	Merge 'raft topology: support --ignore-dead-nodes in removenode and replace' from Patryk Jędrzejczak We add support for `--ignore-dead-nodes` in `raft_removenode` and `--ignore-dead-nodes-for-replace` in `raft_replace`. For now, we allow passing only host ids of the ignored nodes. Supporting IPs is currently impossible because `raft_address_map` doesn't provide a mapping from IP to a host id. The main steps of the implementation are as follows: - add the `ignore_nodes` column to `system.topology`, - set the `ignore_nodes` value of the topology mutation in `raft_removenode` and `raft_replace`, - extend `service::request_param` with alternative types that allow storing a set of ids of the ignored nodes, - load `ignore_nodes` from `system.topology` into `request_param` in `system_keyspace::load_topology_state`, - add `ignore_nodes` to `exclude_nodes` in `topology_coordinator::exec_global_command`, - pass `ignore_nodes` to `replace_with_repair` and `remove_with_repair` in `storage_service::raft_topology_cmd_handler`. Additionally, we add `test_raft_ignore_nodes.py` with two tests that verify the added changes. Fixes #15025 Closes #15113 * github.com:scylladb/scylladb: test: add test_raft_ignore_nodes test: ManagerClient.remove_node: allow List[HostId] for ignore_dead raft topology: pass ignore_nodes to {replace, remove}_with_repair raft topology: exec_global_command: add ignore_nodes to exclude_nodes raft topology: exec_global_command: change type of exclude_nodes topology_state_machine: extend request_param with a set of raft ids raft topology: set ignore_nodes in raft_removenode and raft_replace utils: introduce split_comma_separated_list raft topology: add the ignore_nodes column to system.topology	2023-08-22 18:04:59 +02:00

1 2 3

114 Commits