scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-29 04:37:00 +00:00

Author	SHA1	Message	Date
Botond Dénes	e55f475db1	Merge 'test/pylib: use larger timeout for decommission/removenode' from Kamil Braun Recently we enabled RBNO by default in all topology operations. This made the operations a bit slower (repair-based topology ops are a bit slower than classic streaming - they do more work), and in debug mode with large number of concurrent tests running, they might timeout. The timeout for bootstrap was already increased before, do the same for decommission/removenode. The previously used timeout was 300 seconds (this is the default used by aiohttp library when it makes HTTP requests), now use the TOPOLOGY_TIMEOUT constant from ScyllaServer which is 1000 seconds. Closes #12765 * github.com:scylladb/scylladb: test/pylib: use larger timeout for decommission/removenode test/pylib: scylla_cluster: rename START_TIMEOUT to TOPOLOGY_TIMEOUT	2023-02-13 16:30:24 +02:00
Nadav Har'El	2653865b34	Merge 'test.py: improve test failure handling' from Kamil Braun Improve logging by printing the cluster at the end of each test. Stop performing operations like attempting queries or dropping keyspaces on dirty clusters. Dirty clusters might be completely dead and these operations would only cause more "errors" to happen after a failed test, making it harder to find the real cause of failure. Mark cluster as dirty when a test that uses it fails - after a failed test, we shouldn't assume that the cluster is in a usable state, so we shouldn't reuse it for another test. Rely on the `is_dirty` flag in `PythonTest`s and `CQLApprovalTest`s, similarly to what `TopologyTest`s do. Closes #12652 * github.com:scylladb/scylladb: test.py: rely on ScyllaCluster.is_dirty flag for recycling clusters test/topology: don't drop random_tables keyspace after a failed test test/pylib: mark cluster as dirty after a failed test test: pylib, topology: don't perform operations after test on a dirty cluster test/pylib: print cluster at the end of test	2023-02-12 12:13:25 +02:00
Kamil Braun	54f85c641d	test/pylib: use larger timeout for decommission/removenode Recently we enabled RBNO by default in all topology operations. This made the operations a bit slower (repair-based topology ops are a bit slower than classic streaming - they do more work), and in debug mode with large number of concurrent tests running, they might timeout. The timeout for bootstrap was already increased before, do the same for decommission/removenode. The previously used timeout was 300 seconds (this is the default used by aiohttp library when it makes HTTP requests), now use the TOPOLOGY_TIMEOUT constant from ScyllaServer which is 1000 seconds.	2023-02-10 15:56:31 +01:00
Kamil Braun	fde6ad5fc0	test/pylib: scylla_cluster: rename START_TIMEOUT to TOPOLOGY_TIMEOUT Use a more generic name since the constant will also be used as timeout for decommission and removenode.	2023-02-10 15:56:31 +01:00
Kamil Braun	ca4db9bb72	Merge 'test/raft: test snapshot threshold' from Alecco Force snapshot with schema changes while server down. Then verify schema when bringing back up the server. Closes #12726 * github.com:scylladb/scylladb: pytest/topology: check snapshot transfer raft conf error injection for snapshot test/pylib: one-shot error injection helper	2023-02-10 15:24:46 +01:00
Asias He	fc60484422	test: Increase START_TIMEOUT It is observed that CI machine is slow to run the test. Increase the timeout of adding servers.	2023-02-03 21:15:08 +08:00
Alejo Sanchez	9ceb6aba81	test/pylib: one-shot error injection helper Existing helper with async context manager only worked for non one-shot error injections. Fix it and add another helper for one-shot without a context manager. Fix tests using the previous helper. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2023-02-02 16:37:21 +01:00
Kamil Braun	a9dbd89478	test/pylib: mark cluster as dirty after a failed test We don't expect the cluster to be functioning at all after a failed test. The whole cluster might have crashed, for example. In these situations the framework would report multiple errors (one for the actual failure, another for a failed post-condition check because the cluster was down) which would only obscure the report and make debugging harder. It's also not safe in general to reuse the cluster in another test - if the test previous failed, we should not assume that it's in a valid state. Therefore, mark the cluster as dirty after a failed test. This will let us recycle the cluster based on the dirty flag and it will disable post-condition check after a failed test (which is only done on non-dirty clusters). To implement this in topology tests, we use the `pytest_runtest_makereport` hook which executes after a test finishes but before fixtures finish. There we store a test-failed flag in a stash provided by pytest, then access the flag in the `manager` fixture.	2023-02-02 16:35:55 +01:00
Kamil Braun	977375d13f	test: pylib, topology: don't perform operations after test on a dirty cluster `after_test` would count keyspaces and check that the number is the same as before the test started. The `random_tables` fixture after a test would drop the keyspace that it created before the test. These steps are done to ensure that the cluster is ready to be reused for the next steps. If the cluster is dirty, it cannot be reused anyway, so the steps are unnecessary. They might also be impossible in general - a dirty cluster might be completely dead. For example, the attempts to drop a keyspace from `random_tables` would cause confusing errors if a test failed when it tried to restart a node while all nodes were down, making it harder to find the 'real' failure. Therefore don't perform these operations if the cluster is dirty.	2023-02-02 15:59:02 +01:00
Kamil Braun	f4b56cddde	test/pylib: print cluster at the end of test - print the cluster used by the test in `after_test` - if cluster setup fails in `before_test`, print the cluster together with the exception (`after_test` is not executed if `before_test` fails)	2023-02-02 15:59:02 +01:00
Asias He	6d7b4a896e	test: Increase max-networking-io-control-blocks The number is too low in the test and we saw rpc: Connection is closed error Inrease the number to the default 1000.	2023-02-02 11:11:22 +08:00
Nadav Har'El	132af20057	Merge 'test/pylib: scylla_cluster: ensure there's space in the cluster pool when running a sequence of tests' from Kamil Braun `ScyllaClusterManager` is used to run a sequence of test cases from a single test file. Between two consecutive tests, if the previous test left the cluster 'dirty', meaning the cluster cannot be reused, it would free up space in the pool (using `steal`), stop the cluster, then get a new cluster from the pool. Between the `steal` and the `get`, a concurrent test run (with its own instance of `ScyllaClusterManager` would start, because there was free space in the pool. This resulted in undesirable behavior when we ran tests with `--repeat X` for a large `X`: we would start with e.g. 4 concurrent runs of a test file, because the pool size was 4. As soon as one of the runs freed up space in the pool, we would start another concurrent run. Soon we'd end up with 8 concurrent runs. Then 16 concurrent runs. And so on. We would have a large number of concurrent runs, even though the original 4 runs didn't finish yet. All of these concurrent runs would compete waiting on the pool, and waiting for space in the pool would take longer and longer (the duration is linear w.r.t number of concurrent competing runs). Tests would then time out because they would have to wait too long. Fix that by using the new `replace_dirty` function introduced to the pool. This function frees up space by returning a dirty cluster and then immediately takes it away to be used for a new cluster. Thanks to this, we will only have at most as many concurrent runs as the pool size. For example with --repeat 8 and pool size 4, we would run 4 concurrent runs and start the 5th run only when one of the original 4 runs finishes, then the 6th run when a second run finishes and so on. The fix is preceded by a refactor that replaces `steal` with `put(is_dirty=True)` and a `destroy` function passed to the pool (now the pool is responsible for stopping the cluster and releasing its IPs). Fixes #11757 Closes #12549 * github.com:scylladb/scylladb: test/pylib: scylla_cluster: ensure there's space in the cluster pool when running a sequence of tests test/pylib: pool: introduce `replace_dirty` test/pylib: pool: replace `steal` with `put(is_dirty=True)`	2023-02-01 12:37:39 +02:00
Nadav Har'El	681a066923	test/pylib: put UNIX-domain socket in /tmp The "cluster manager" used by the topology test suite uses a UNIX-domain socket to communicate between the cluster manager and the individual tests. The socket is currently located in the test directory but there is a problem: In Linux the length of the path used as a UNIX-domain socket address is limited to just a little over 100 bytes. In Jenkins run, the test directory names are very long, and we sometimes go over this length limit and the result is that test.py fails creating this socket. In this patch we simply put the socket in /tmp instead of the test directory. We only need to do this change in one place - the cluster manager, as it already passes the socket path to the individual tests (using the "--manager-api" option). Tested by cloning Scylla in a very long directory name. A test like ./test.py --mode=dev test_concurrent_schema fails before this patch, and passes with it. Fixes #12622 Closes #12678	2023-02-01 12:37:35 +03:00
Kamil Braun	5eadea301e	Merge 'pytest: start after ungraceful stop' from Alecco If a server is stopped suddenly (i.e. not graceful), schema tables might be in inconsistent state. Add a test case and enable Scylla configuration option (force_schema_commit_log) to handle this. Fixes #12218 Closes #12630 * github.com:scylladb/scylladb: pytest: test start after ungraceful stop test.py: enable force_schema_commit_log	2023-01-26 12:08:33 +01:00
Kamil Braun	3eabe04f5d	test/pylib: scylla_cluster: ensure there's space in the cluster pool when running a sequence of tests `ScyllaClusterManager` is used to run a sequence of test cases from a single test file. Between two consecutive tests, if the previous test left the cluster 'dirty', meaning the cluster cannot be reused, it would put the old cluster to the pool with `is_dirty=True`, then get a new cluster from the pool. Between the `put` and the `get`, a concurrent test run (with its own instance of `ScyllaClusterManager`) would start, because there was free space in the pool. This resulted in undesirable behavior when we ran tests with `--repeat X` for a large `X`: we would start with e.g. 4 concurrent runs of a test file, because the pool size was 4. As soon as one of the runs freed up space in the pool, we would start another concurrent run. Soon we'd end up with 8 concurrent runs. Then 16 concurrent runs. And so on. We would have a large number of concurrent runs, even though the original 4 runs didn't finish yet. All of these concurrent runs would compete waiting on the pool, and waiting for space in the pool would take longer and longer (the duration is linear w.r.t number of concurrent competing runs). Tests would then time out because they would have to wait too long. Fix that by using the new `replace_dirty` function introduced to the pool. This function frees up space by returning a dirty cluster and then immediately takes it away to be used for a new cluster. Thanks to this, we will only have at most as many concurrent runs as the pool size. For example with --repeat 8 and pool size 4, we would run 4 concurrent runs and start the 5th run only when one of the original 4 runs finishes, then the 6th run when a second run finishes and so on. Fixes #11757	2023-01-26 11:58:00 +01:00
Kamil Braun	b5ef57ecc2	test/pylib: pool: introduce `replace_dirty` Used to atomically return a dirty object to the pool and then use the space freed by this object to get another object. Unlike `put(is_dirty=True)` followed by `get`, a concurrent waiter cannot take away our space from us. A piece of `get` was refactored to a private function `_build_and_get`, this piece is also used in `replace_dirty`.	2023-01-26 11:58:00 +01:00
Kamil Braun	858803cc2c	test/pylib: pool: replace `steal` with `put(is_dirty=True)` The pool usage was kind of awkward previously: if the user of a pool decided that a previously borrowed object should no longer be used, it was their responsibility to destroy the object (releasing associated resources and so on) and then call `steal()` on the pool to free space for a new object. Change the interface. Now the `Pool` constructor obtains a `destroy` function additionally to the `build` function. The user calls the function `put` to return both objects that are still usable and those aren't. For the latter, they set `is_dirty=True`. The pool will 'destroy' the object with the provided function, which could mean e.g. releasing associated resources. For example, instead of: ``` if self.cluster.is_dirty: self.clusters.stop() self.clusters.release_ips() self.clusters.steal() else: self.clusters.put(self.cluster) ``` we can now use: ``` self.clusters.put(self.cluster, is_dirty=self.cluster.is_dirty) ``` (assuming that `self.clusters` is a pool constructed with a `destroy` function that stops the cluster and releases its IPs.) Also extend the interface of the context manager obtained by `instance()` - the user must now pass a flag `dirty_on_exception`. If the context manager exists due to an exception and that flag was `True`, the object will be considered dirty. The dirty flag can also be set manually on the context manager. For example: ``` async with (cm := pool.instance(dirty_on_exception=True)) as server: cm.dirty = await run_test(test, server) # It will also be considered dirty if run_test throws an exception ```	2023-01-26 11:58:00 +01:00
Kamil Braun	a0ff33e777	test/pylib: scylla_cluster: don't leak server if stopping it fails `ScyllaCluster.server_stop` had this piece of code: ``` server = self.running.pop(server_id) if gracefully: await server.stop_gracefully() else: await server.stop() self.stopped[server_id] = server ``` We observed `stop_gracefully()` failing due to a server hanging during shutdown. We then ended up in a state where neither `self.running` nor `self.stopped` had this server. Later, when releasing the cluster and its IPs, we would release that server's IP - but the server might have still been running (all servers in `self.running` are killed before releasing IPs, but this one wasn't in `self.running`). Fix this by popping the server from `self.running` only after `stop_gracefully`/`stop` finishes. Make an analogous fix in `server_start`: put `server` into `self.running` before we actually start it. If the start fails, the server will be considered "running" even though it isn't necessarily, but that is OK - if it isn't running, then trying to stop it later will simply do nothing; if it is actually running, we will kill it (which we should do) when clearing after the cluster; and we don't leak it. Closes #12613	2023-01-25 16:58:02 +02:00
Alejo Sanchez	ccbd89f0cd	test.py: enable force_schema_commit_log To handle start after ungraceful stop, enable separate schema commit log from server start. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2023-01-25 14:49:27 +01:00
Alejo Sanchez	f236d518c6	test.py: manual cluster handling for PythonSuite Instead of complex async with logic, use manual cluster pool handling. Revert the discard() logic in Pool from a recent commit. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2023-01-24 11:38:17 +01:00
Nadav Har'El	ccc2c6b5dd	Merge 'test/pylib: scylla_cluster: improve server startup check' from Kamil Braun Don't use a range scan, which is very inefficient, to perform a query for checking CQL availability. Improve logging when waiting for server startup times out. Provide details about the failure: whether we managed to obtain the Host ID of the server and whether we managed to establish a CQL connection. Closes #12588 * github.com:scylladb/scylladb: test/pylib: scylla_cluster: better logging for timeout on server startup test/pylib: scylla_cluster: use less expensive query to check for CQL availability	2023-01-23 17:00:52 +02:00
Kamil Braun	8a1ea6c49f	test/pylib: scylla_cluster: better logging for timeout on server startup Waiting for server startup is a multi-step procedure: after we start the actual process, we will: - try to obtain the Host ID (by querying a REST API endpoint) - then try to connect a CQL session - then try to perform a CQL query The steps are repeated every .1 second until we reach a timeout (the Host ID step is skipped if we previously managed to obtain it). On timeout we'd only get a generic "failed to start server" message, it wouldn't say what we managed to do and what not. For example, on one of the failed jobs on Jenkins I observed this timeout error. Looking at the logs of the server, it turned out that the server printed the "initialization completed" message more than 2 minutes before the actual timeout happened. So for 2 minutes, the test framework either couldn't obtain the Host ID, or couldn't establish a CQL connection, or couldn't perform a CQL query, but I wasn't able to determine fully which one of these was the case. Improve the code by printing whether we managed to get the Host ID of the server and if so - whether we managed to connect to CQL.	2023-01-23 15:59:42 +01:00
Kamil Braun	0e591606a5	test/pylib: scylla_cluster: use less expensive query to check for CQL availability The previous CQL query used a range scan which is very inefficient, even for local tables. Also add a comment explaining why we need this query.	2023-01-23 15:59:05 +01:00
Nadav Har'El	54f174a1f4	Merge 'test.py: handle broken clusters for Python suite' from Alecco If the after test check fails (is_after_test_ok is False), discard the cluster and raise exception so context manager (pool) does not recycle it. Ignore exception re-raised by the context manager. Fixes #12360 Closes #12569 * github.com:scylladb/scylladb: test.py: handle broken clusters for Python suite test.py: Pool discard method	2023-01-22 19:58:12 +02:00
Alejo Sanchez	c886a05b37	test.py: Pool discard method Add a context manager discard() method to tell it to discard the object. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2023-01-19 21:43:45 +01:00
Kamil Braun	2f84e820fd	test/pylib: scylla_cluster: return error details from test framework endpoints If an endpoint handler throws an exception, the details of the exception are not returned to the client. Normally this is desirable so that information is not leaked, but in this test framework we do want to return the details to the client so it can log a useful error message. Do it by wrapping every handler into a catch clause that returns the exception message. Also modify a bit how HTTPErrors are rendered so it's easier to discern the actual body of the error from other details (such as the params used to make the request etc.) Before: ``` E test.pylib.rest_client.HTTPError: HTTP error 500: 500 Internal Server Error E E Server got itself in trouble, params None, json None, uri http+unix://api/cluster/before-test/test_stuff ``` After: ``` E test.pylib.rest_client.HTTPError: HTTP error 500, uri: http+unix://api/cluster/before-test/test_stuff, params: None, json: None, body: E Failed to start server at host 127.155.129.1. E Check the log files: E /home/kbraun/dev/scylladb/testlog/test.py.dev.log E /home/kbraun/dev/scylladb/testlog/dev/scylla-1.log ``` Closes #12563	2023-01-19 17:47:13 +02:00
Kamil Braun	3ed3966f13	test/pylib: scylla_cluster: release cluster IPs when stopping ScyllaClusterManager When we obtained a new cluster for a test case after the previous test case left a dirty cluster, we would release the old cluster's used IP addresses (`_before_test` function). However, we would not release the last cluster's IP after the last test case. We would run out of IPs with sufficiently many test files or `--repeat` runs. Fix this. Also reorder the operations a bit: stop the cluster (and release its IPs) before freeing up space in the cluster pool (i.e. call `self.cluster.stop()` before `self.clusters.steal()`). This reduces concurrency a bit - fewer Scyllas running at the same time, which is good (the pool size gives a limit on the desired max number of concurrently running clusters). Killing a cluster is quick so it won't make a significant difference for the next guy waiting on the pool. Closes #12564	2023-01-19 17:46:46 +02:00
Kamil Braun	147dd73996	test/pylib: scylla_cluster: mark cluster as dirty if it fails to boot If a cluster fails to boot, it saves the exception in `self.start_exception` variable; the exception will be rethrown when a test tries to start using this cluster. As explained in `before_test`: ``` def before_test(self, name) -> None: """Check that the cluster is ready for a test. If there was a start error, throw it here - the server is running when it's added to the pool, which can't be attributed to any specific test, throwing it here would stop a specific test.""" ``` It's arguable whether we should blame some random test for a failure that it didn't cause, but nevertheless, there's a problem here: the `start_exception` will be rethrown and the test will fail, but then the cluster will be simply returned to the pool and the next test will attempt to use it... and so on. Prevent this by marking the cluster as dirty the first time we rethrow the exception. Closes #12560	2023-01-19 14:26:57 +02:00
Avi Kivity	9029b8dead	test: disable commitlog O_DSYNC, preallocation Commitlog O_DSYNC is intended to make Raft and schema writes durable in the face of power loss. To make O_DSYNC performant, we preallocate the commitlog segments, so that the commitlog writes only change file data and not file metadata (which would require the filesystem to commit its own log). However, in tests, this causes each ScyllaDB instance to write 384MB of commitlog segments. This overloads the disks and slows everything down. Fix this by disabling O_DSYNC (and therefore preallocation) during the tests. They can't survive power loss, and run with --unsafe-bypass-fsync anyway. Closes #12542	2023-01-19 11:14:05 +01:00
Avi Kivity	aab5954cfb	Merge 'reader_concurrency_semaphore: add more layers of defense against OOM' from Botond Dénes The reader concurrency semaphore has no mechanism to limit the memory consumption of already admitted read. Once memory collective memory consumption of all the admitted reads is above the limit, all it can do is to not admit any more. Sometimes this is not enough and the memory consumption of the already admitted reads balloons to the point of OOMing the node. This pull-request offers a solution to this: it introduces two more layers of defense above this: a soft and a hard limit. Both are multipliers applied on the semaphores normal memory limit. When the soft limit threshold is surpassed, all readers but one are blocked via a new blocking `request_memory()` call which is used by the `tracking_file_impl`. The reader to be allowed to proceed is chosen at random, it is the first reader which happens to request memory after the limit is surpassed. This is both very simple and should avoid situations where the algorithm choosing the reader to be allowed to proceed chooses a reader which will then always time out. When the hard limit threshold is surpassed, `reader_concurrency_semaphore::consume()` starts throwing `std::bad_alloc`. This again will result in eliminating whichever reader was unlucky enough to request memory at the right moment. With this, the semaphore is now effectively enforcing an upper bound for memory consumption, defined by the hard limit. Refs: https://github.com/scylladb/scylladb/issues/11927 Closes #11955 * github.com:scylladb/scylladb: test: reader_concurrency_semaphore_test: add tests for semaphore memory limits reader_permit: expose operator<<(reader_permit::state) reader_permit: add id() accessor reader_concurrency_semaphore: add foreach_permit() reader_concurrency_semaphore: document the new memory limits reader_concurrency_semaphore: add OOM killer reader_concurrency_semaphore: make consume() and signal() private test: stop using reader_concurrency_semaphore::{consume,signal}() directly reader_concurrency_semaphore: move consume() out-of-line reader_permit: consume(): make it exception-safe reader_permit: resource_units::reset(): only call consume() if needed reader_concurrency_semaphore: tracked_file_impl: use request_memory() reader_concurrency_semaphore: add request_memory() reader_concurrency_semaphore: wrap wait list reader_concurrency_semaphore: add {serialize,kill}_limit_multiplier parameters test/boost/reader_concurrency_semaphore_test: dummy_file_impl: don't use hardoced buffer size reader_permit: add make_new_tracked_temporary_buffer() reader_permit: add get_state() accessor reader_permit: resource_units: add constructor for already consumed res reader_permit: resource_units: remove noexcept qualifier from constructor db/config: introduce reader_concurrency_semaphore_{serialize,kill}_limit_multiplier scylla-gdb.py: scylla-memory: extract semaphore stats formatting code scylla-gdb.py: fix spelling of "graphviz"	2023-01-18 17:02:55 +02:00
Tomasz Grabiec	563998b69a	Merge 'raft: improve group 0 reconfiguration failure handling' from Kamil Braun Make it so that failures in `removenode`/`decommission` don't lead to reduced availability, and any leftovers in group 0 can be removed by `removenode`: - In `removenode`, make the node a non-voter before removing it from the token ring. This removes the possibility of having a group 0 voting member which doesn't correspond to a token ring member. We can still be left with a non-voter, but that's doesn't reduce the availability of group 0. - As above but for `decommission`. - Make it possible to remove group 0 members that don't correspond to token ring members from group 0 using `removenode`. - Add an API to query the current group 0 configuration. Fixes #11723. Closes #12502 * github.com:scylladb/scylladb: test: test_topology: test for removing garbage group 0 members test/pylib: move some utility functions to util.py db: system_keyspace: add a virtual table with raft configuration db: system_keyspace: improve system.raft_snapshot_config schema service: storage_service: better error handling in `decommission` service: storage_service: fix indentation in removenode service: storage_service: make `removenode` work for group 0 members which are not token ring members service/raft: raft_group0: perform read_barrier in wait_for_raft service: storage_service: make leaving node a non-voter before removing it from group 0 in decommission/removenode test: test_raft_upgrade: remove test_raft_upgrade_with_node_remove service/raft: raft_group0: link to Raft docs where appropriate service/raft: raft_group0: more logging service/raft: raft_group0: separate function for checking and waiting for Raft	2023-01-17 21:23:15 +01:00
Kamil Braun	d134c458e5	test/pylib: increase timeout when waiting for cluster before test Increase the timeout from default 5 minutes to 10 minutes. Sent as a workaround for #12546 to unblock next promotions. Closes #12547	2023-01-17 21:03:09 +02:00
Kamil Braun	c959ec455a	test/pylib: move some utility functions to util.py They were used in test_raft_upgrade, but we want to use them in other test files too.	2023-01-17 12:28:00 +01:00
Botond Dénes	7eb093899a	db/config: introduce reader_concurrency_semaphore_{serialize,kill}_limit_multiplier Will be propagated to reader concurrency semaphores. Not wired in yet.	2023-01-16 02:05:27 -05:00
Benny Halevy	90faeedb77	test: test_topology: test replace using host_id Add test cases exercising the --replace-node-first-boot option by replacing nodes using their host_id rather than ip address. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-01-13 18:36:09 +02:00
Benny Halevy	7d0d9e28f1	test: pylib: ServerInfo: add host_id Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-01-13 18:36:07 +02:00
Kamil Braun	79712185d5	test/pylib: additional logging during cluster setup This would have saved me a lot of debugging time.	2023-01-11 10:09:42 +01:00
Kamil Braun	4f7e5ee963	test/pylib: prefix cluster/manager logs with the current test name The log file produced by test.py combines logs coming from multiple concurrent test runs. Each test has its own log file as well, but this "global" log file is useful when debugging problems with topology tests, since many events related to managing clusters are stored there. Make the logs easier to read by including information about the test case that's currently performing operations such as adding new servers to clusters and so on. This includes the mode, test run name and the name of the test case. We do this by using custom `Logger` objects (instead of calling `logging.info` etc. which uses the root logger) with `LoggerAdapter`s that include the prefixes. A bit of boilerplate 'plumbing' through function parameters is required but it's mostly straightforward. This doesn't apply to all events, e.g. boost test cases which don't setup a "real" Scylla cluster. These events don't have additional prefixes. Example: ``` 17:41:43.531 INFO> [dev/topology.test_topology.1] Cluster ScyllaCluster(name: 7a414ffc-903c-11ed-bafb-f4d108a9e4a3, running: ScyllaServer(1, 127.40.246.1, 29c4ec73-8912-45ca-ae19-8bfda701a6b5), ScyllaServer(4, 127.40.246.4, 75ae2afe-ff9b-4760-9e19-cd0ed8d052e7), ScyllaServer(7, 127.40.246.7, 67a27df4-be63-4b4c-a70c-aeac0506304f), stopped: ) adding server... 17:41:43.531 INFO> [dev/topology.test_topology.1] installing Scylla server in /home/kbraun/dev/scylladb/testlog/dev/scylla-10... 17:41:43.603 INFO> [dev/topology.test_topology.1] starting server at host 127.40.246.10 in scylla-10... 17:41:43.614 INFO> [dev/topology.test_topology.2] Cluster ScyllaCluster(name: 7a497fce-903c-11ed-bafb-f4d108a9e4a3, running: ScyllaServer(2, 127.40.246.2, f59d3b1d-efbb-4657-b6d5-3fa9e9ef786e), ScyllaServer(5, 127.40.246.5, 9da16633-ce53-4d32-8687-e6b4d27e71eb), ScyllaServer(9, 127.40.246.9, e60c69cd-212d-413b-8678-dfd476d7faf5), stopped: ) adding server... 17:41:43.614 INFO> [dev/topology.test_topology.2] installing Scylla server in /home/kbraun/dev/scylladb/testlog/dev/scylla-11... 17:41:43.670 INFO> [dev/topology.test_topology.2] starting server at host 127.40.246.11 in scylla-11... ```	2023-01-11 10:09:39 +01:00
Kamil Braun	2bda0f9830	test/pylib: pool: pass args and *kwargs to the build function from get() This will be used to specify a custom logger when building new clusters before starting tests, allowing to easily pinpoint which tests are waiting for clusters to be built and what's happening to these particular clusters.	2023-01-10 17:41:54 +01:00
Kamil Braun	822410c49b	test/pylib: scylla_cluster: release IPs when cluster is no longer needed With sufficiently many test cases we would eventually run out of IP addresses, because IPs (which are leased from a global host registry) would only be released at the end of an entire test suite. In fact we already hit this during next promotions, causing much pain indeed. Release IPs when a cluster, after being marked dirty, is stopped and thrown away. Closes #12482	2023-01-10 06:59:41 +02:00
Alejo Sanchez	d632e1aa7a	test/pytest: add missing import, remove unused import Add missed import time and remove unused name import. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Closes #12446	2023-01-08 17:38:46 +02:00
Kamil Braun	09da661eeb	Merge 'raft: replace experimental raft option with dedicated flag' from Gleb Natapov Unlike other experimental feature we want to raft to be opt in even after it leaves experimental mode. For that we need to have a separate option to enable it. The patch adds the binary option "consistent-cluster-management" for that. * 'consistent-cluster-management-flag' of github.com:scylladb/scylla-dev: raft: replace experimental raft option with dedicated flag main: move supervisor notification about group registry start where it actually starts	2023-01-05 15:21:35 +01:00
Kamil Braun	4268b1bbc2	Merge 'raft: raft_group0, register RPC verbs on all shards' from Gusev Petr raft_group0 used to register RPC verbs only on shard 0. This worked on clusters with the same --smp setting on all nodes, since RPCs in this case are processed on the same shard as the calling code, and raft_group0 methods only run on shard 0. A new test test_nodes_with_different_smp was added to identify the problem. Since --smp can only be specified via the command line, a corresponding parameter was added to the ManagerClient.server_add method. It allows to override the default parameters set by the SCYLLA_CMDLINE_OPTIONS variable by changing, adding or deleting individual items. Fixes: #12252 Closes #12374 * github.com:scylladb/scylladb: raft: raft_group0, register RPC verbs on all shards raft: raft_append_entries, copy entries to the target shard test.py, allow to specify the node's command line in test	2023-01-04 11:11:21 +01:00
Petr Gusev	1c23390f12	test.py, allow to specify the node's command line in test An optional parameter cmdline has been added to the ManagerClient.server_add method. It allows you to override the default parameters set by the SCYLLA_CMDLINE_OPTIONS variable by changing, adding or deleting individual items. To change or add a parameter just specify its name and value one after the other. To remove parameter use the special keyword __remove__ as a value. To set a parameter without a value (such as --overprovisioned) use the special keyword __missing__ as the value.	2023-01-03 15:24:54 +03:00
Gleb Natapov	1688163233	raft: replace experimental raft option with dedicated flag Unlike other experimental feature we want to raft to be optional even after it leaves experimental mode. For that we need to have a separate option to enable it. The patch adds the binary option "consistent-cluster-management" for that.	2023-01-03 11:15:11 +02:00
Alejo Sanchez	d408b711e3	test/python: increase CQL connection timeouts In very slow debug builds the default driver timeouts are too low and tests might fail. Bump up the values to more reasonable time. These timeout values are the same as used in topology tests. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Closes #12405	2022-12-28 10:06:33 +02:00
Alejo Sanchez	1bfe234133	test/pylib: API get/set logger level of Scylla server Provide helpers to get and set logger level for Scylla servers. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Closes #12394	2022-12-25 13:58:43 +02:00
Kamil Braun	3cd035d1b9	test/pylib: scylla_cluster: remove `ScyllaCluster.decommissioned` field The field was not used for anything. We can keep decommissioned server in `stopped` field. In fact it caused us a problem: since recently, we're using `ScyllaCluster.uninstall` to clean-up servers after test suite finishes (previously we were using `ScyllaServer.uninstall` directly). But `ScyllaCluster.uninstall` didn't look into the `decommissioned` field, so if a server got decommissioned, we wouldn't uninstall it, and it left us some unnecessary artifacts even for successful tests. This is now fixed. Closes #12163	2022-12-01 19:07:26 +02:00
Alejo Sanchez	f7aa08ef25	test.py: don't stop cluster's site if not started The site member is created in ScyllaCluster.start(), for startup failure this might not be initialized, so check it's present before stop()ing it. And delete it as it's not running and proper initialization should call ScyllaCluster.start(). Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Closes #11939	2022-11-30 13:47:18 +02:00
Kamil Braun	2f60550ff3	test/pylib: scylla_cluster: support node replace operation The `add_server` function now takes an optional `ReplaceConfig` struct (implemented using `NamedTuple`), which specifies the ID of the replaced server and whether to reuse the IP address. If we want to reuse the IP address, we don't allocate one using the host registry. Since now multiple servers can have the same IP, introduce a `leased_ips` set to `ScyllaCluster` which is used when `uninstall`ing the cluster - to make sure we don't `release_host` the same host twice.	2022-11-24 16:26:23 +01:00

1 2 3 4

169 Commits