Instead of using assigned IP addresses, use an internal server id.
Define types to distinguish local server id, host ID (UUID), and IP
address.
This is needed to test servers changing IP address and for node replace
(host UUID).
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
The code explicitly manages an IP as string, make it explicit in the
variable name.
Define its type and test for set in the instance instead of using an
empty string as placeholder.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
When initializing a ScyllaServer, try to get the host id instead of only
checking the REST API is up.
Use the existing aiohttp session from ScyllaCluster.
In case of HTTP error check the status was not an internal error (500+).
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Move the REST api client to ScyllaCluster. This will allow the cluster
to query its own servers.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
The aiohttp.web.Application only needs to be passed, so don't store a
reference in ScyllaCluster object.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Simplify REST helper by doing requests without a session.
Reusing an aiohttp.ClientSession causes knock-on effects on
`rest_api/test_task_manager` due to handling exceptions outside of an
async with block.
Requests for cluster management and Scylla REST API don't need session,
anyway.
Raise HTTPError with status code, text reason, params, and json.
In ScyllaCluster.install_and_start() instead of adding one more custom
exception, just catch all exceptions as they will be re-raised later.
While there avoid code duplication and improve sanity, type checking,
and lint score.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
The test runs remove_node command with background ddl workload.
It was written in an attempt to reproduce scylladb#11228 but seems to have
value on its own.
The if_exists parameter has been added to the add_table
and drop_table functions, since the driver could retry
the request sent to a removed node, but that request
might have already been completed.
Function wait_for_host_known waits until the information
about the node reaches the destination node. Since we add
new nodes at each iteration in main, this can take some time.
A number of abort-related options was added
SCYLLA_CMDLINE_OPTIONS as it simplifies
nailing down problems.
Closes#11734
The tests are checking the upgrade procedure and recovery from failure
in scenarios like when a node fails causing the procedure to get stuck
or when we lose a majority in a fully upgraded cluster.
Added some new functionalities to `ScyllaRESTAPIClient` like injecting
errors and obtaining gossip generation numbers.
The `removenode` operation normally requires the removing node to
contact every node in the cluster except the one that is being removed.
But if more than 1 node is down it's possible to specify a list of nodes
to ignore for the operation; the `/storage_service/remove_node` endpoint
accepts an `ignore_nodes` param which is a comma-separated list of IPs.
Extend `ScyllaRESTAPIClient`, `ScyllaClusterManager` and `ManagerClient`
so it's possible to pass the list of ignored nodes.
We also modify the `/cluster/remove-node` Manager endpoint to use
`put_json` instead of `get_text` and pass all parameters except the
initiator IP (the IP of the node who coordinates the `removenode`
operation) through JSON. This simplifies the URL greatly (it was already
messy with 3 parameters) and more closely resembles Scylla's endpoint.
Provide a helper client for Scylla REST requests. Use it on both
ScyllaClusterManager (e.g. remove node, test.py process) and
ManagerClient (e.g. get uuid, pytest process).
For now keep using IPs as key in ScyllaCluster, but this will be changed
to UUID -> IP in the future. So, for now, pass both independently. Note
the UUID must be obtained from the server before stopping it.
Refresh client driver connection when decommissioning or removing
a node.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
In ScyllaCluster currently servers are tracked by the host IP. This is
not the host id (UUID). Fix the variable name accordingly
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Split aiohttp client to a shared helper file.
While there, move aiohttp session setup back to constructors. When there
were teardown issues it looked it could be caused by aiohttp session
being created outside a coroutine. But this is proven not to be the case
after recent fixes. So move it back to the ManagerClient constructor.
On th other hand, create a close() coroutine to stop the aiohttp session.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
cql connection
When there are topology changes, the driver needs to be updated. Instead
of passing the CassandraCluster.Connection, pass the ManagerClient
instance which manages the driver connection inside of it.
Remove workaround for test_raft_upgrade.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Print the cluster name and stopped servers in addition to the running
servers.
Fix a logging call which tried to print a server in place of a cluster
and even at that it failed (the server didn't have a hostname yet so it
printed as an empty string). Add another logging call.
Use `_before_test` calls to track the current test case name.
Concatenate it with the unique test name like this:
`test_topology.1::test_add_server_add_column`, and print it
instead of the test case name.
We pass the test case name to `after_test` - so make it consistent.
Arguably, the test case name is more useful (as it's more precise) than
the test name.
The `server_remove` function did a very weird thing: it shut down a
server and made the framework 'forget' about it. From the point of view
of the Scylla cluster and the driver the server was still there.
Replace the function's body with `raise NotImplementedError`. In the
future it can be replaced with an implementation that calls
`removenode` on the Scylla cluster.
Remove `test_remove_server_add_column` from `test_topology`. It
effectively does the same thing as `test_stop_server_add_column`, except
that the framework also 'forgets' about the stopped server. This could
lead to weird situations because the forgotten server's IP could be
reused in another test that was running concurrently with this test.
Closes#11657
Fix the type of `create_server`, rename `topology_for_class` to `get_cluster_factory`, simplify the suite definitions and parameters passed to `get_cluster_factory`
Closes#11590
* github.com:scylladb/scylladb:
test.py: replace `topology` with `cluster_size` in Topology tests
test.py: rename `topology_for_class` to `get_cluster_factory`
test/pylib: ScyllaCluster: fix create_server parameter type
The test was disabled due to a bug in the Python driver which caused the
driver not to reconnect after a node was restarted (see
scylladb/python-driver#170).
Introduce a workaround for that bug: we simply create a new driver
session after restarting the nodes. Reenable the test.
Closes#11641
The only usage of `ScyllaCluster` constructor passed a `create_server`
function which expected a `List[str]` for the second parameter, while
the constructor specified that the function should expect an
`Optional[List[str]]`. There was no reason for the latter, we can easily
fix this type error.
Also give a type hint for `create_cluster` function in
`PythonTestSuite.topology_for_class`. This is actually what catched the
type error.
Check existing is_running member to avoid re-starting.
While there, set it to false after stopping.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Start ScyllaClusterManager within error handling so the ScyllaCluster
logs are available in case of error starting up.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Some tests mark clusters as 'dirty', which makes them non-reusable by
later tests; we don't want to return them to the pool of clusters.
This use-case was covered by the `add_one` function in the `Pool` class.
However, it had the unintended side effect of creating extra clusters
even if there were no more tests that were waiting for new clusters.
Rewrite the implementation of `Pool` so it provides 3 interface
functions:
- `get` borrows an object, building it first if necessary
- `put` returns a borrowed object
- `steal` is called by a borrower to free up space in the pool;
the borrower is then responsible for cleaning up the object.
Both `put` and `steal` wake up any outstanding `get` calls. Objects are
built only in `get`, so no objects are built if none are needed.
Closes#11558
teardown..."
This reverts commit df1ca57fda.
In order to prevent timeouts on teardown queries, the previous commit
added functionality to restart servers that were down. This issue is
fixed in fc0263fc9b so there's no longer need to restart stopped servers
on test teardown.
For ManagerClient request API, don't return status, raise an exception.
Server side errors are signaled by status 500, not text body.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Pool.get() might have waiting callers, so if an item is not returned
to the pool after use, tell the pool to add a new one and tell the pool
an entry was taken (used for total running entries, i.e. clusters).
Use it when a ScyllaCluster is dirty and not returned.
While there improve logging and docstrings.
Issue reported by @kbr-.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Closes#11546
`_request` performed a GET request and extracted a text body out of the
response.
Split it into `_get`, which only performs the request, and `_get_text`,
which calls `_get` and extracts the body as text.
Also extract a `_resource_uri` function which will be used for other
request types.
Previously we used a formattable string to represent the configuration;
values in the string were substituted by Python's formatting mechanism
and the resulting string was stored to obtain the config file.
This approach had some downsides, e.g. it required boilerplate work to
extend: to add a new config options, you would have to modify this
template string.
Instead we can represent the configuration as a Python dictionary. Dicts
are easy to manipulate, for example you can sum two dicts; if a key
appears in both, the second dict 'wins':
```
{1:1} | {1:2} == {1:2}
```
This makes the configuration easy to extend without having to write
boilerplate: if the user of `ScyllaServer` wants to add or override a
config option, they can simply add it to the `config_options` dict and
that's it - no need to modify any internal template strings in
`ScyllaServer` implementation like before. The `config_options` dict is
simply summed with the 'base' config dict of `ScyllaServer`
(`config_options` is the right summand so anything in there overrides
anything in the base dict).
An example of this extensibility is the `authenticator` and `authorizer`
options which no longer appear in `scylla_cluster.py` module after this
change, they only appear in the suite.yaml file.
Also, use "workdir" option instead of specifying data dir, commitlog
dir etc. separately.
Previously the code extracted `authenticator` and `authorizer` keys from
the config options and stored them.
Store the entire dict instead. The new code is easier to extend if we
want to make more options configurable.
Otherwise it crashes some python versions.
The cast was there before a2dd64f68f
explicitly dropped one while moving the code between files.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#11511
Broadcast tables are tables for which all statements are strongly
consistent (linearizable), replicated to every node in the cluster and
available as long as a majority of the cluster is available. If a user
wants to store a “small” volume of metadata that is not modified “too
often” but provides high resiliency against failures and strong
consistency of operations, they can use broadcast tables.
The main goal of the broadcast tables project is to solve problems which
need to be solved when we eventually implement general-purpose strongly
consistent tables: designing the data structure for the Raft command,
ensuring that the commands are idempotent, handling snapshots correctly,
and so on.
In this MVP (Minimum Viable Product), statements are limited to simple
SELECT and UPDATE operations on the built-in table. In the future, other
statements and data types will be available but with this PR we can
already work on features like idempotent commands or snapshotting.
Snapshotting is not handled yet which means that restarting a node or
performing too many operations (which would cause a snapshot to be
created) will give incorrect results.
In a follow-up, we plan to add end-to-end Jepsen tests
(https://jepsen.io/). With this PR we can already simulate operations on
lists and test linearizability in linear complexity. This can also test
Scylla's implementation of persistent storage, failure detector, RPC,
etc.
Design doc: https://docs.google.com/document/d/1m1IW320hXtsGulzSTSHXkfcBKaG5UlsxOpm6LN7vWOc/edit?usp=sharingCloses#11164
* github.com:scylladb/scylladb:
raft: broadcast_tables: add broadcast_kv_store test
raft: broadcast_tables: add returning query result
raft: broadcast_tables: add execution of intermediate language
raft: broadcast_tables: add compilation of cql to intermediate language
raft: broadcast_tables: add definition of intermediate language
db: system_keyspace: add broadcast_kv_store table
db: config: add BROADCAST_TABLES feature flag