Commit Graph

169 Commits

Author SHA1 Message Date
Botond Dénes
e55f475db1 Merge 'test/pylib: use larger timeout for decommission/removenode' from Kamil Braun
Recently we enabled RBNO by default in all topology operations. This
made the operations a bit slower (repair-based topology ops are a bit
slower than classic streaming - they do more work), and in debug mode
with large number of concurrent tests running, they might timeout.

The timeout for bootstrap was already increased before, do the same for
decommission/removenode. The previously used timeout was 300 seconds
(this is the default used by aiohttp library when it makes HTTP
requests), now use the TOPOLOGY_TIMEOUT constant from ScyllaServer which
is 1000 seconds.

Closes #12765

* github.com:scylladb/scylladb:
  test/pylib: use larger timeout for decommission/removenode
  test/pylib: scylla_cluster: rename START_TIMEOUT to TOPOLOGY_TIMEOUT
2023-02-13 16:30:24 +02:00
Nadav Har'El
2653865b34 Merge 'test.py: improve test failure handling' from Kamil Braun
Improve logging by printing the cluster at the end of each test.

Stop performing operations like attempting queries or dropping keyspaces on dirty clusters. Dirty clusters might be completely dead and these operations would only cause more "errors" to happen after a failed test, making it harder to find the real cause of failure.

Mark cluster as dirty when a test that uses it fails - after a failed test, we shouldn't assume that the cluster is in a usable state, so we shouldn't reuse it for another test.

Rely on the `is_dirty` flag in `PythonTest`s and `CQLApprovalTest`s, similarly to what `TopologyTest`s do.

Closes #12652

* github.com:scylladb/scylladb:
  test.py: rely on ScyllaCluster.is_dirty flag for recycling clusters
  test/topology: don't drop random_tables keyspace after a failed test
  test/pylib: mark cluster as dirty after a failed test
  test: pylib, topology: don't perform operations after test on a dirty cluster
  test/pylib: print cluster at the end of test
2023-02-12 12:13:25 +02:00
Kamil Braun
54f85c641d test/pylib: use larger timeout for decommission/removenode
Recently we enabled RBNO by default in all topology operations. This
made the operations a bit slower (repair-based topology ops are a bit
slower than classic streaming - they do more work), and in debug mode
with large number of concurrent tests running, they might timeout.

The timeout for bootstrap was already increased before, do the same for
decommission/removenode. The previously used timeout was 300 seconds
(this is the default used by aiohttp library when it makes HTTP
requests), now use the TOPOLOGY_TIMEOUT constant from ScyllaServer which
is 1000 seconds.
2023-02-10 15:56:31 +01:00
Kamil Braun
fde6ad5fc0 test/pylib: scylla_cluster: rename START_TIMEOUT to TOPOLOGY_TIMEOUT
Use a more generic name since the constant will also be used as timeout
for decommission and removenode.
2023-02-10 15:56:31 +01:00
Kamil Braun
ca4db9bb72 Merge 'test/raft: test snapshot threshold' from Alecco
Force snapshot with schema changes while server down. Then verify schema when bringing back up the server.

Closes #12726

* github.com:scylladb/scylladb:
  pytest/topology: check snapshot transfer
  raft conf error injection for snapshot
  test/pylib: one-shot error injection helper
2023-02-10 15:24:46 +01:00
Asias He
fc60484422 test: Increase START_TIMEOUT
It is observed that CI machine is slow to run the test. Increase the
timeout of adding servers.
2023-02-03 21:15:08 +08:00
Alejo Sanchez
9ceb6aba81 test/pylib: one-shot error injection helper
Existing helper with async context manager only worked for non one-shot
error injections. Fix it and add another helper for one-shot without a
context manager.

Fix tests using the previous helper.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2023-02-02 16:37:21 +01:00
Kamil Braun
a9dbd89478 test/pylib: mark cluster as dirty after a failed test
We don't expect the cluster to be functioning at all after a failed
test. The whole cluster might have crashed, for example. In these
situations the framework would report multiple errors (one for the
actual failure, another for a failed post-condition check because the
cluster was down) which would only obscure the report and make debugging
harder. It's also not safe in general to reuse the cluster in another
test - if the test previous failed, we should not assume that it's in a
valid state.

Therefore, mark the cluster as dirty after a failed test. This will let
us recycle the cluster based on the dirty flag and it will disable
post-condition check after a failed test (which is only done on
non-dirty clusters).

To implement this in topology tests, we use the
`pytest_runtest_makereport` hook which executes after a test finishes
but before fixtures finish. There we store a test-failed flag in a stash
provided by pytest, then access the flag in the `manager` fixture.
2023-02-02 16:35:55 +01:00
Kamil Braun
977375d13f test: pylib, topology: don't perform operations after test on a dirty cluster
`after_test` would count keyspaces and check that the number is the same
as before the test started. The `random_tables` fixture after a test
would drop the keyspace that it created before the test.

These steps are done to ensure that the cluster is ready to be reused
for the next steps. If the cluster is dirty, it cannot be reused anyway,
so the steps are unnecessary. They might also be impossible in general
- a dirty cluster might be completely dead. For example, the attempts to
drop a keyspace from `random_tables` would cause confusing errors
if a test failed when it tried to restart a node while all nodes
were down,  making it harder to find the 'real' failure.

Therefore don't perform these operations if the cluster is dirty.
2023-02-02 15:59:02 +01:00
Kamil Braun
f4b56cddde test/pylib: print cluster at the end of test
- print the cluster used by the test in `after_test`
- if cluster setup fails in `before_test`, print the cluster together
  with the exception (`after_test` is not executed if `before_test`
  fails)
2023-02-02 15:59:02 +01:00
Asias He
6d7b4a896e test: Increase max-networking-io-control-blocks
The number is too low in the test and we saw

rpc: Connection is closed error

Inrease the number to the default 1000.
2023-02-02 11:11:22 +08:00
Nadav Har'El
132af20057 Merge 'test/pylib: scylla_cluster: ensure there's space in the cluster pool when running a sequence of tests' from Kamil Braun
`ScyllaClusterManager` is used to run a sequence of test cases from
a single test file. Between two consecutive tests, if the previous test
left the cluster 'dirty', meaning the cluster cannot be reused, it would
free up space in the pool (using `steal`), stop the cluster, then get a
new cluster from the pool.

Between the `steal` and the `get`, a concurrent test run (with its own
instance of `ScyllaClusterManager` would start, because there was free
space in the pool.

This resulted in undesirable behavior when we ran tests with
`--repeat X` for a large `X`: we would start with e.g. 4 concurrent
runs of a test file, because the pool size was 4. As soon as one of the
runs freed up space in the pool, we would start another concurrent run.
Soon we'd end up with 8 concurrent runs. Then 16 concurrent runs. And so
on. We would have a large number of concurrent runs, even though the
original 4 runs didn't finish yet. All of these concurrent runs would
compete waiting on the pool, and waiting for space in the pool would
take longer and longer (the duration is linear w.r.t number of
concurrent competing runs). Tests would then time out because they would
have to wait too long.

Fix that by using the new `replace_dirty` function introduced to the
pool. This function frees up space by returning a dirty cluster and then
immediately takes it away to be used for a new cluster. Thanks to this,
we will only have at most as many concurrent runs as the pool size. For
example with --repeat 8 and pool size 4, we would run 4 concurrent runs
and start the 5th run only when one of the original 4 runs finishes,
then the 6th run when a second run finishes and so on.

The fix is preceded by a refactor that replaces `steal` with `put(is_dirty=True)`
and a `destroy` function passed to the pool (now the pool is responsible
for stopping the cluster and releasing its IPs).

Fixes #11757

Closes #12549

* github.com:scylladb/scylladb:
  test/pylib: scylla_cluster: ensure there's space in the cluster pool when running a sequence of tests
  test/pylib: pool: introduce `replace_dirty`
  test/pylib: pool: replace `steal` with `put(is_dirty=True)`
2023-02-01 12:37:39 +02:00
Nadav Har'El
681a066923 test/pylib: put UNIX-domain socket in /tmp
The "cluster manager" used by the topology test suite uses a UNIX-domain
socket to communicate between the cluster manager and the individual tests.
The socket is currently located in the test directory but there is a
problem: In Linux the length of the path used as a UNIX-domain socket
address is limited to just a little over 100 bytes. In Jenkins run, the
test directory names are very long, and we sometimes go over this length
limit and the result is that test.py fails creating this socket.

In this patch we simply put the socket in /tmp instead of the test
directory. We only need to do this change in one place - the cluster
manager, as it already passes the socket path to the individual tests
(using the "--manager-api" option).

Tested by cloning Scylla in a very long directory name.
A test like ./test.py --mode=dev test_concurrent_schema fails before
this patch, and passes with it.

Fixes #12622

Closes #12678
2023-02-01 12:37:35 +03:00
Kamil Braun
5eadea301e Merge 'pytest: start after ungraceful stop' from Alecco
If a server is stopped suddenly (i.e. not graceful), schema tables might
be in inconsistent state. Add a test case and enable Scylla
configuration option (force_schema_commit_log) to handle this.

Fixes #12218

Closes #12630

* github.com:scylladb/scylladb:
  pytest: test start after ungraceful stop
  test.py: enable force_schema_commit_log
2023-01-26 12:08:33 +01:00
Kamil Braun
3eabe04f5d test/pylib: scylla_cluster: ensure there's space in the cluster pool when running a sequence of tests
`ScyllaClusterManager` is used to run a sequence of test cases from
a single test file. Between two consecutive tests, if the previous test
left the cluster 'dirty', meaning the cluster cannot be reused, it would
put the old cluster to the pool with `is_dirty=True`, then get a new
cluster from the pool.

Between the `put` and the `get`, a concurrent test run (with its own
instance of `ScyllaClusterManager`) would start, because there was free
space in the pool.

This resulted in undesirable behavior when we ran tests with
`--repeat X` for a large `X`: we would start with e.g. 4 concurrent
runs of a test file, because the pool size was 4. As soon as one of the
runs freed up space in the pool, we would start another concurrent run.
Soon we'd end up with 8 concurrent runs. Then 16 concurrent runs. And so
on. We would have a large number of concurrent runs, even though the
original 4 runs didn't finish yet. All of these concurrent runs would
compete waiting on the pool, and waiting for space in the pool would
take longer and longer (the duration is linear w.r.t number of
concurrent competing runs). Tests would then time out because they would
have to wait too long.

Fix that by using the new `replace_dirty` function introduced to the
pool. This function frees up space by returning a dirty cluster and then
immediately takes it away to be used for a new cluster. Thanks to this,
we will only have at most as many concurrent runs as the pool size. For
example with --repeat 8 and pool size 4, we would run 4 concurrent runs
and start the 5th run only when one of the original 4 runs finishes,
then the 6th run when a second run finishes and so on.

Fixes #11757
2023-01-26 11:58:00 +01:00
Kamil Braun
b5ef57ecc2 test/pylib: pool: introduce replace_dirty
Used to atomically return a dirty object to the pool and then use the
space freed by this object to get another object. Unlike
`put(is_dirty=True)` followed by `get`, a concurrent waiter cannot take
away our space from us.

A piece of `get` was refactored to a private function `_build_and_get`,
this piece is also used in `replace_dirty`.
2023-01-26 11:58:00 +01:00
Kamil Braun
858803cc2c test/pylib: pool: replace steal with put(is_dirty=True)
The pool usage was kind of awkward previously: if the user of a pool
decided that a previously borrowed object should no longer be used,
it was their responsibility to destroy the object (releasing associated
resources and so on) and then call `steal()` on the pool to free space
for a new object.

Change the interface. Now the `Pool` constructor obtains a `destroy`
function additionally to the `build` function. The user calls the
function `put` to return both objects that are still usable and those
aren't. For the latter, they set `is_dirty=True`. The pool will
'destroy' the object with the provided function, which could mean e.g.
releasing associated resources.

For example, instead of:
```
if self.cluster.is_dirty:
    self.clusters.stop()
    self.clusters.release_ips()
    self.clusters.steal()
else:
    self.clusters.put(self.cluster)
```
we can now use:
```
self.clusters.put(self.cluster, is_dirty=self.cluster.is_dirty)
```
(assuming that `self.clusters` is a pool constructed with a `destroy`
function that stops the cluster and releases its IPs.)

Also extend the interface of the context manager obtained by
`instance()` - the user must now pass a flag `dirty_on_exception`. If
the context manager exists due to an exception and that flag was `True`,
the object will be considered dirty. The dirty flag can also be set
manually on the context manager. For example:
```
async with (cm := pool.instance(dirty_on_exception=True)) as server:
    cm.dirty = await run_test(test, server)
    # It will also be considered dirty if run_test throws an exception
```
2023-01-26 11:58:00 +01:00
Kamil Braun
a0ff33e777 test/pylib: scylla_cluster: don't leak server if stopping it fails
`ScyllaCluster.server_stop` had this piece of code:
```
        server = self.running.pop(server_id)
        if gracefully:
            await server.stop_gracefully()
        else:
            await server.stop()
        self.stopped[server_id] = server
```

We observed `stop_gracefully()` failing due to a server hanging during
shutdown. We then ended up in a state where neither `self.running` nor
`self.stopped` had this server. Later, when releasing the cluster and
its IPs, we would release that server's IP - but the server might have
still been running (all servers in `self.running` are killed before
releasing IPs, but this one wasn't in `self.running`).

Fix this by popping the server from `self.running` only after
`stop_gracefully`/`stop` finishes.

Make an analogous fix in `server_start`: put `server` into
`self.running` *before* we actually start it. If the start fails, the
server will be considered "running" even though it isn't necessarily,
but that is OK - if it isn't running, then trying to stop it later will
simply do nothing; if it is actually running, we will kill it (which we
should do) when clearing after the cluster; and we don't leak it.

Closes #12613
2023-01-25 16:58:02 +02:00
Alejo Sanchez
ccbd89f0cd test.py: enable force_schema_commit_log
To handle start after ungraceful stop, enable separate schema commit log
from server start.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2023-01-25 14:49:27 +01:00
Alejo Sanchez
f236d518c6 test.py: manual cluster handling for PythonSuite
Instead of complex async with logic, use manual cluster pool handling.

Revert the discard() logic in Pool from a recent commit.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2023-01-24 11:38:17 +01:00
Nadav Har'El
ccc2c6b5dd Merge 'test/pylib: scylla_cluster: improve server startup check' from Kamil Braun
Don't use a range scan, which is very inefficient, to perform a query for checking CQL availability.

Improve logging when waiting for server startup times out. Provide details about the failure: whether we managed to obtain the Host ID of the server and whether we managed to establish a CQL connection.

Closes #12588

* github.com:scylladb/scylladb:
  test/pylib: scylla_cluster: better logging for timeout on server startup
  test/pylib: scylla_cluster: use less expensive query to check for CQL availability
2023-01-23 17:00:52 +02:00
Kamil Braun
8a1ea6c49f test/pylib: scylla_cluster: better logging for timeout on server startup
Waiting for server startup is a multi-step procedure: after we start the
actual process, we will:
- try to obtain the Host ID (by querying a REST API endpoint)
- then try to connect a CQL session
- then try to perform a CQL query

The steps are repeated every .1 second until we reach a timeout (the
Host ID step is skipped if we previously managed to obtain it).

On timeout we'd only get a generic "failed to start server" message, it
wouldn't say what we managed to do and what not.

For example, on one of the failed jobs on Jenkins I observed this
timeout error. Looking at the logs of the server, it turned out that the
server printed the "initialization completed" message more than 2
minutes before the actual timeout happened. So for 2 minutes, the test
framework either couldn't obtain the Host ID, or couldn't establish a
CQL connection, or couldn't perform a CQL query, but I wasn't able to
determine fully which one of these was the case.

Improve the code by printing whether we managed to get the Host ID of
the server and if so - whether we managed to connect to CQL.
2023-01-23 15:59:42 +01:00
Kamil Braun
0e591606a5 test/pylib: scylla_cluster: use less expensive query to check for CQL availability
The previous CQL query used a range scan which is very inefficient, even
for local tables.

Also add a comment explaining why we need this query.
2023-01-23 15:59:05 +01:00
Nadav Har'El
54f174a1f4 Merge 'test.py: handle broken clusters for Python suite' from Alecco
If the after test check fails (is_after_test_ok is False), discard the cluster and raise exception so context manager (pool) does not recycle it.

Ignore exception re-raised by the context manager.

Fixes #12360

Closes #12569

* github.com:scylladb/scylladb:
  test.py: handle broken clusters for Python suite
  test.py: Pool discard method
2023-01-22 19:58:12 +02:00
Alejo Sanchez
c886a05b37 test.py: Pool discard method
Add a context manager discard() method to tell it to discard the object.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2023-01-19 21:43:45 +01:00
Kamil Braun
2f84e820fd test/pylib: scylla_cluster: return error details from test framework endpoints
If an endpoint handler throws an exception, the details of the exception
are not returned to the client. Normally this is desirable so that
information is not leaked, but in this test framework we do want to
return the details to the client so it can log a useful error message.

Do it by wrapping every handler into a catch clause that returns
the exception message.

Also modify a bit how HTTPErrors are rendered so it's easier to discern
the actual body of the error from other details (such as the params used
to make the request etc.)

Before:
```
E test.pylib.rest_client.HTTPError: HTTP error 500: 500 Internal Server Error
E
E Server got itself in trouble, params None, json None, uri http+unix://api/cluster/before-test/test_stuff
```

After:
```
E test.pylib.rest_client.HTTPError: HTTP error 500, uri: http+unix://api/cluster/before-test/test_stuff, params: None, json: None, body:
E Failed to start server at host 127.155.129.1.
E Check the log files:
E /home/kbraun/dev/scylladb/testlog/test.py.dev.log
E /home/kbraun/dev/scylladb/testlog/dev/scylla-1.log
```

Closes #12563
2023-01-19 17:47:13 +02:00
Kamil Braun
3ed3966f13 test/pylib: scylla_cluster: release cluster IPs when stopping ScyllaClusterManager
When we obtained a new cluster for a test case after the previous test
case left a dirty cluster, we would release the old cluster's used IP
addresses (`_before_test` function). However, we would not release the
last cluster's IP after the last test case. We would run out of IPs with
sufficiently many test files or `--repeat` runs. Fix this.

Also reorder the operations a bit: stop the cluster (and release its
IPs) before freeing up space in the cluster pool (i.e. call
`self.cluster.stop()` before `self.clusters.steal()`). This reduces
concurrency a bit - fewer Scyllas running at the same time, which is
good (the pool size gives a limit on the desired max number of
concurrently running clusters). Killing a cluster is quick so it won't
make a significant difference for the next guy waiting on the pool.

Closes #12564
2023-01-19 17:46:46 +02:00
Kamil Braun
147dd73996 test/pylib: scylla_cluster: mark cluster as dirty if it fails to boot
If a cluster fails to boot, it saves the exception in
`self.start_exception` variable; the exception will be rethrown when
a test tries to start using this cluster. As explained in `before_test`:
```
    def before_test(self, name) -> None:
        """Check that  the cluster is ready for a test. If
        there was a start error, throw it here - the server is
        running when it's added to the pool, which can't be attributed
        to any specific test, throwing it here would stop a specific
        test."""
```
It's arguable whether we should blame some random test for a failure
that it didn't cause, but nevertheless, there's a problem here: the
`start_exception` will be rethrown and the test will fail, but then the
cluster will be simply returned to the pool and the next test will
attempt to use it... and so on.

Prevent this by marking the cluster as dirty the first time we rethrow
the exception.

Closes #12560
2023-01-19 14:26:57 +02:00
Avi Kivity
9029b8dead test: disable commitlog O_DSYNC, preallocation
Commitlog O_DSYNC is intended to make Raft and schema writes durable
in the face of power loss. To make O_DSYNC performant, we preallocate
the commitlog segments, so that the commitlog writes only change file
data and not file metadata (which would require the filesystem to commit
its own log).

However, in tests, this causes each ScyllaDB instance to write 384MB
of commitlog segments. This overloads the disks and slows everything
down.

Fix this by disabling O_DSYNC (and therefore preallocation) during
the tests. They can't survive power loss, and run with
--unsafe-bypass-fsync anyway.

Closes #12542
2023-01-19 11:14:05 +01:00
Avi Kivity
aab5954cfb Merge 'reader_concurrency_semaphore: add more layers of defense against OOM' from Botond Dénes
The reader concurrency semaphore has no mechanism to limit the memory consumption of already admitted read. Once memory collective memory consumption of all the admitted reads is above the limit, all it can do is to not admit any more. Sometimes this is not enough and the memory consumption of the already admitted reads balloons to the point of OOMing the node. This pull-request offers a solution to this: it introduces two more layers of defense above this: a soft and a hard limit. Both are multipliers applied on the semaphores normal memory limit.
When the soft limit threshold is surpassed, all readers but one are blocked via a new blocking `request_memory()` call which is used by the `tracking_file_impl`. The reader to be allowed to proceed is chosen at random, it is the first reader which happens to request memory after the limit is surpassed. This is both very simple and should avoid situations where the algorithm choosing the reader to be allowed to proceed chooses a reader which will then always time out.
When the hard limit threshold is surpassed, `reader_concurrency_semaphore::consume()` starts throwing `std::bad_alloc`. This again will result in eliminating whichever reader was unlucky enough to request memory at the right moment.

With this, the semaphore is now effectively enforcing an upper bound for memory consumption, defined by the hard limit.

Refs: https://github.com/scylladb/scylladb/issues/11927

Closes #11955

* github.com:scylladb/scylladb:
  test: reader_concurrency_semaphore_test: add tests for semaphore memory limits
  reader_permit: expose operator<<(reader_permit::state)
  reader_permit: add id() accessor
  reader_concurrency_semaphore: add foreach_permit()
  reader_concurrency_semaphore: document the new memory limits
  reader_concurrency_semaphore: add OOM killer
  reader_concurrency_semaphore: make consume() and signal() private
  test: stop using reader_concurrency_semaphore::{consume,signal}() directly
  reader_concurrency_semaphore: move consume() out-of-line
  reader_permit: consume(): make it exception-safe
  reader_permit: resource_units::reset(): only call consume() if needed
  reader_concurrency_semaphore: tracked_file_impl: use request_memory()
  reader_concurrency_semaphore: add request_memory()
  reader_concurrency_semaphore: wrap wait list
  reader_concurrency_semaphore: add {serialize,kill}_limit_multiplier parameters
  test/boost/reader_concurrency_semaphore_test: dummy_file_impl: don't use hardoced buffer size
  reader_permit: add make_new_tracked_temporary_buffer()
  reader_permit: add get_state() accessor
  reader_permit: resource_units: add constructor for already consumed res
  reader_permit: resource_units: remove noexcept qualifier from constructor
  db/config: introduce reader_concurrency_semaphore_{serialize,kill}_limit_multiplier
  scylla-gdb.py: scylla-memory: extract semaphore stats formatting code
  scylla-gdb.py: fix spelling of "graphviz"
2023-01-18 17:02:55 +02:00
Tomasz Grabiec
563998b69a Merge 'raft: improve group 0 reconfiguration failure handling' from Kamil Braun
Make it so that failures in `removenode`/`decommission` don't lead to reduced availability, and any leftovers in group 0 can be removed by `removenode`:
- In `removenode`, make the node a non-voter before removing it from the token ring. This removes the possibility of having a group 0 voting member which doesn't correspond to a token ring member. We can still be left with a non-voter, but that's doesn't reduce the availability of group 0.
- As above but for `decommission`.
- Make it possible to remove group 0 members that don't correspond to token ring members from group 0 using `removenode`.
- Add an API to query the current group 0 configuration.

Fixes #11723.

Closes #12502

* github.com:scylladb/scylladb:
  test: test_topology: test for removing garbage group 0 members
  test/pylib: move some utility functions to util.py
  db: system_keyspace: add a virtual table with raft configuration
  db: system_keyspace: improve system.raft_snapshot_config schema
  service: storage_service: better error handling in `decommission`
  service: storage_service: fix indentation in removenode
  service: storage_service: make `removenode` work for group 0 members which are not token ring members
  service/raft: raft_group0: perform read_barrier in wait_for_raft
  service: storage_service: make leaving node a non-voter before removing it from group 0 in decommission/removenode
  test: test_raft_upgrade: remove test_raft_upgrade_with_node_remove
  service/raft: raft_group0: link to Raft docs where appropriate
  service/raft: raft_group0: more logging
  service/raft: raft_group0: separate function for checking and waiting for Raft
2023-01-17 21:23:15 +01:00
Kamil Braun
d134c458e5 test/pylib: increase timeout when waiting for cluster before test
Increase the timeout from default 5 minutes to 10 minutes.
Sent as a workaround for #12546 to unblock next promotions.

Closes #12547
2023-01-17 21:03:09 +02:00
Kamil Braun
c959ec455a test/pylib: move some utility functions to util.py
They were used in test_raft_upgrade, but we want to use them in other
test files too.
2023-01-17 12:28:00 +01:00
Botond Dénes
7eb093899a db/config: introduce reader_concurrency_semaphore_{serialize,kill}_limit_multiplier
Will be propagated to reader concurrency semaphores. Not wired in yet.
2023-01-16 02:05:27 -05:00
Benny Halevy
90faeedb77 test: test_topology: test replace using host_id
Add test cases exercising the --replace-node-first-boot option
by replacing nodes using their host_id rather
than ip address.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-01-13 18:36:09 +02:00
Benny Halevy
7d0d9e28f1 test: pylib: ServerInfo: add host_id
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-01-13 18:36:07 +02:00
Kamil Braun
79712185d5 test/pylib: additional logging during cluster setup
This would have saved me a lot of debugging time.
2023-01-11 10:09:42 +01:00
Kamil Braun
4f7e5ee963 test/pylib: prefix cluster/manager logs with the current test name
The log file produced by test.py combines logs coming from multiple
concurrent test runs. Each test has its own log file as well, but this
"global" log file is useful when debugging problems with topology tests,
since many events related to managing clusters are stored there.

Make the logs easier to read by including information about the test case
that's currently performing operations such as adding new servers to
clusters and so on. This includes the mode, test run name and the name
of the test case.

We do this by using custom `Logger` objects (instead of calling
`logging.info` etc. which uses the root logger) with `LoggerAdapter`s
that include the prefixes. A bit of boilerplate 'plumbing' through
function parameters is required but it's mostly straightforward.

This doesn't apply to all events, e.g. boost test cases which don't
setup a "real" Scylla cluster. These events don't have additional
prefixes.

Example:
```

17:41:43.531 INFO> [dev/topology.test_topology.1] Cluster ScyllaCluster(name: 7a414ffc-903c-11ed-bafb-f4d108a9e4a3, running: ScyllaServer(1, 127.40.246.1, 29c4ec73-8912-45ca-ae19-8bfda701a6b5), ScyllaServer(4, 127.40.246.4, 75ae2afe-ff9b-4760-9e19-cd0ed8d052e7), ScyllaServer(7, 127.40.246.7, 67a27df4-be63-4b4c-a70c-aeac0506304f), stopped: ) adding server...
17:41:43.531 INFO> [dev/topology.test_topology.1] installing Scylla server in /home/kbraun/dev/scylladb/testlog/dev/scylla-10...
17:41:43.603 INFO> [dev/topology.test_topology.1] starting server at host 127.40.246.10 in scylla-10...
17:41:43.614 INFO> [dev/topology.test_topology.2] Cluster ScyllaCluster(name: 7a497fce-903c-11ed-bafb-f4d108a9e4a3, running: ScyllaServer(2, 127.40.246.2, f59d3b1d-efbb-4657-b6d5-3fa9e9ef786e), ScyllaServer(5, 127.40.246.5, 9da16633-ce53-4d32-8687-e6b4d27e71eb), ScyllaServer(9, 127.40.246.9, e60c69cd-212d-413b-8678-dfd476d7faf5), stopped: ) adding server...
17:41:43.614 INFO> [dev/topology.test_topology.2] installing Scylla server in /home/kbraun/dev/scylladb/testlog/dev/scylla-11...
17:41:43.670 INFO> [dev/topology.test_topology.2] starting server at host 127.40.246.11 in scylla-11...
```
2023-01-11 10:09:39 +01:00
Kamil Braun
2bda0f9830 test/pylib: pool: pass *args and **kwargs to the build function from get()
This will be used to specify a custom logger when building new clusters
before starting tests, allowing to easily pinpoint which tests are
waiting for clusters to be built and what's happening to these
particular clusters.
2023-01-10 17:41:54 +01:00
Kamil Braun
822410c49b test/pylib: scylla_cluster: release IPs when cluster is no longer needed
With sufficiently many test cases we would eventually run out of IP
addresses, because IPs (which are leased from a global host registry)
would only be released at the end of an entire test suite.

In fact we already hit this during next promotions, causing much pain
indeed.

Release IPs when a cluster, after being marked dirty, is stopped and
thrown away.

Closes #12482
2023-01-10 06:59:41 +02:00
Alejo Sanchez
d632e1aa7a test/pytest: add missing import, remove unused import
Add missed import time and remove unused name import.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #12446
2023-01-08 17:38:46 +02:00
Kamil Braun
09da661eeb Merge 'raft: replace experimental raft option with dedicated flag' from Gleb Natapov
Unlike other experimental feature we want to raft to be opt in even
after it leaves experimental mode. For that we need to have a separate
option to enable it. The patch adds the binary option "consistent-cluster-management"
for that.

* 'consistent-cluster-management-flag' of github.com:scylladb/scylla-dev:
  raft: replace experimental raft option with dedicated flag
  main: move supervisor notification about group registry start where it actually starts
2023-01-05 15:21:35 +01:00
Kamil Braun
4268b1bbc2 Merge 'raft: raft_group0, register RPC verbs on all shards' from Gusev Petr
raft_group0 used to register RPC verbs only on shard 0. This worked on
clusters with the same --smp setting on all nodes, since RPCs in this
case are processed on the same shard as the calling code, and
raft_group0 methods only run on shard 0.

A new test test_nodes_with_different_smp was added to identify the
problem. Since --smp can only be specified via the command line, a
corresponding parameter was added to the ManagerClient.server_add
method.  It allows to override the default parameters set by the
SCYLLA_CMDLINE_OPTIONS variable by changing, adding or deleting
individual items.

Fixes: #12252

Closes #12374

* github.com:scylladb/scylladb:
  raft: raft_group0, register RPC verbs on all shards
  raft: raft_append_entries, copy entries to the target shard
  test.py, allow to specify the node's command line in test
2023-01-04 11:11:21 +01:00
Petr Gusev
1c23390f12 test.py, allow to specify the node's command line in test
An optional parameter cmdline has been added to
the ManagerClient.server_add method.
It allows you to override the default parameters
set by the SCYLLA_CMDLINE_OPTIONS variable
by changing, adding or deleting individual
items. To change or add a parameter just specify
its name and value one after the other.
To remove parameter use the special keyword
__remove__ as a value. To set a parameter
without a value (such as --overprovisioned)
use the special keyword __missing__ as the value.
2023-01-03 15:24:54 +03:00
Gleb Natapov
1688163233 raft: replace experimental raft option with dedicated flag
Unlike other experimental feature we want to raft to be optional even
after it leaves experimental mode. For that we need to have a separate
option to enable it. The patch adds the binary option "consistent-cluster-management"
for that.
2023-01-03 11:15:11 +02:00
Alejo Sanchez
d408b711e3 test/python: increase CQL connection timeouts
In very slow debug builds the default driver timeouts are too low and
tests might fail. Bump up the values to more reasonable time.

These timeout values are the same as used in topology tests.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #12405
2022-12-28 10:06:33 +02:00
Alejo Sanchez
1bfe234133 test/pylib: API get/set logger level of Scylla server
Provide helpers to get and set logger level for Scylla servers.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #12394
2022-12-25 13:58:43 +02:00
Kamil Braun
3cd035d1b9 test/pylib: scylla_cluster: remove ScyllaCluster.decommissioned field
The field was not used for anything. We can keep decommissioned server
in `stopped` field.

In fact it caused us a problem: since recently, we're using
`ScyllaCluster.uninstall` to clean-up servers after test suite finishes
(previously we were using `ScyllaServer.uninstall` directly). But
`ScyllaCluster.uninstall` didn't look into the `decommissioned` field,
so if a server got decommissioned, we wouldn't uninstall it, and it left
us some unnecessary artifacts even for successful tests. This is now
fixed.

Closes #12163
2022-12-01 19:07:26 +02:00
Alejo Sanchez
f7aa08ef25 test.py: don't stop cluster's site if not started
The site member is created in ScyllaCluster.start(), for startup failure
this might not be initialized, so check it's present before stop()ing
it. And delete it as it's not running and proper initialization should
call ScyllaCluster.start().

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>

Closes #11939
2022-11-30 13:47:18 +02:00
Kamil Braun
2f60550ff3 test/pylib: scylla_cluster: support node replace operation
The `add_server` function now takes an optional `ReplaceConfig` struct
(implemented using `NamedTuple`), which specifies the ID of the replaced
server and whether to reuse the IP address.

If we want to reuse the IP address, we don't allocate one using the host
registry.

Since now multiple servers can have the same IP, introduce a
`leased_ips` set to `ScyllaCluster` which is used when `uninstall`ing
the cluster - to make sure we don't `release_host` the same host twice.
2022-11-24 16:26:23 +01:00