Adds an util for measuring the disk usage of the given table on the given
node.
Will be used in a follow-up patch for testing that sstables are freed
properly.
(cherry picked from commit 7741491b47)
This PR resolves issue with double count of the test result for topology tests. It will not appear in the consolidated report anymore.
Another fix is to provide a better view which test failed by modifying the test case name in the report enriching it with mode and run id, so making them unique across the run.
The scope of this change is:
1. Modify the test name to have run id in name
2. Add handlers to get logs of test.py and pytest in one file that are related to test, rather than to the full suite
3. Remove topology tests from aggregating them on a suite level in Junit results
4. Add a link to the logs related to the failed tests in Junit results, so it will be easier to navigate to all logs related to test
5. Gather logs related to the failed test to one directory for better logs investigation
Ref: scylladb/scylladb#17851Closesscylladb/scylladb#18277
introduce ManagerClient.test_finished_event
to block access to REST client object from the test if
ManagerClient.after_test method was called
(test teardown started)
Restarting a node amounts to just shutting it down and then starting
again. There is no good reason to have a dedicated endpoint in the
ScyllaClusterManager for restarting when it can be implemented by
calling two endpoints in a sequence: stop and start - it's just code
duplication.
Remove the server_restart endpoint in ScyllaClusterManager and
reimplement it as two endpoint calls in the ManagerClient.
In this PR we add timeouts support to raft groups registry. We introduce
the `raft_server_with_timeouts` class, which wraps the `raft::server`
add exposes its interface with additional `raft_timeout` parameter. If
it's set, the wrapper cancels the `abort_source` after certain amount of
time. The value of the timeout can be specified either in the
`raft_timeout` parameter, or the default value can be set in `the
raft_server_with_timeouts` class constructor.
The `raft_group_registry` interface is extended with
`group0_with_timeouts()` method. It returns an instance of
`raft_server_with_timeouts` for group0 raft server. The timeout value
for it is configured in `create_server_for_group0`. It's one minute by
default and can be overridden for tests with
`group0-raft-op-timeout-in-ms` parameter.
The new api allows the client to decide whether to use timeouts or not.
In this PR we are reviewing all the group0 call sites and add
`raft_timeout` if that makes sense. The general principle is that if the
code is handling a client request and the client expects a potential
error, we use timeouts. We don't use timeouts for background fibers
(such as topology coordinator), since they wouldn't add much value. The
only thing the background fiber can do with a timeout is to retry, and
this will have the same end effect as not having a timeout at all.
Fixesscylladb/scylladb#16604Closesscylladb/scylladb#17590
* github.com:scylladb/scylladb:
migration_manager: use raft_timeout{}
storage_service::join_node_response_handler: use raft_timeout{}
storage_service::start_upgrade_to_raft_topology: use raft_timeout{}
storage_service::set_tablet_balancing_enabled: use raft_timeout{}
storage_service::move_tablet: use raft_timeout{}
raft_check_and_repair_cdc_streams: use raft_timeout{}
raft_timeout: test that node operations fail properly
raft_rebuild: use raft_timeout{}
do_cluster_cleanup: use raft_timeout{}
raft_initialize_discovery_leader: use raft_timeout{}
update_topology_with_local_metadata: use with_timeout{}
raft_decommission: use raft_timeout{}
raft_removenode: use raft_timeout{}
join_node_request_handler: add raft_timeout to make_nonvoters and add_entry
raft_group0: make_raft_config_nonvoter: add raft_timeout parameter
raft_group0: make_raft_config_nonvoter: add abort_source parameter
manager_client: server_add with start=false shouldn't call driver_connect
scylla_cluster: add seeds parameter to the add_server and servers_add
raft_server_with_timeouts: report the lost quorum
join_node_request_handler: add raft_timeout{} for start_operation
skip_mode: add platform_key
auth: use raft_timeout{}
raft_group0_client: add raft_timeout parameter
raft_group_registry: add group0_with_timeouts
utils: add composite_abort_source.hh
error_injection: move api registration to set_server_init
error_injection: add inject_parameter method
error_injection: move injection_name string into injection_shared_data
error_injection: pass injection parameters at startup
Fixes#16912
By default, ScyllaDB stores the maintenance socket in the workdir. Test.py by default uses the location for the ScyllaDB workdir as testlog/{mode}/scylla-#. The Usual location for cloning the repo is the user's home folder. In some cases, it can lead the socket path being too long and the test will start to fail. The simple way is to move the maintenance socket to /tmp folder to eliminate such a possibility.
Closesscylladb/scylladb#17941
If the server is not started there is not point
in starting the driver, it would fail because there
are no nodes to connect to. On the other hand, we
should connect the driver in server_start()
if it's not connected yet.
If this parameter is set, we use its value for
the scylla.yaml of the new node, otherwise we
use IPs of all running nodes as before.
We'll need this parameter in subsequent commits to
restrict the communication between nodes.
We remove default values for _create_server_add_data parameters
since they are redundant - in the two call sites we pass all
of them.
In the test, we use the group0-raft-op-timeout-in-ms parameter to
reduce the timeout to one second so as not to waste time.
The join_node_request_handler method contains other group0 calls
which should have timeouts (make_nonvoters and add_entry). They
will be handled in a separate commit.
The initial version used a redundant method, and it did not cover all
cases. So that leads to the flakiness of the test that used this method.
Switching to the cluster_con() method removes flakiness since it's
written more robustly.
Fixesscylladb/scylladb#17914Closesscylladb/scylladb#17932
Fix writing cassandra-rackdc.properties with correct format data instead of yaml
Add a parameter to overwrite RF for specific DC
Add the possibility to connect cql to the specific node
In this PR 4 tests were added to test multi-DC functionality. One is added from initial commit were multi-DC possibility were introduced, however, this test was not commited. Three of them are migrations from dtest, that later will be deleted. To be able to execute migrated tests additional functionality is added: the ability to connect cql to the specific node in the cluster instead of pooled connection and the possibility to overwrite the replication factor for the specific DC. To be able to use the multi DC in test.py issue with the incorrect format of the properties file fixed in this PR.
Closesscylladb/scylladb#17503
Fix aiohttp usage issue in python 3.12:
"Timeout context manager should be used inside a task"
This occurs due to UnixRESTClient created in one event loop (created
inside pytest) but used in another (created in rewriten event_loop
fixture), now it is fixed by updating UnixRESTClient object for every new
loop.
Closesscylladb/scylladb#17760
This is a speculative fix as the problem is observed only on CI.
When run_async is called right after driver_connect and get_cql
it fails with ConnectionException('Host has been marked down or
removed').
If the approach proves to be succesfull we can start to deprecate
base get_cql in favor of get_ready_cql. It's better to have robust
testing helper libraries than try to take care of it in every test
case separately.
Fixes#17713Closesscylladb/scylladb#17772
With auth-v2 we can login even if quorum is lost. So test
which checks if error occurs in such situation is deleted
and the opposite test which checks if logging in works was
added.
Add error handling to rebuild instead of retrying it until succeeds.
* 'gleb/rebuild-fail-v2' of github.com:scylladb/scylla-dev:
test: add test for rebuild failure
test: add expected_error to rebuild_node operation
topology_coordinator: Propagate rebuild failure to the initiator
Similar to the existing update_config(). Updates the command-line
arguments of the specified nodes, merging the new options into the
existing ones. Needs a restart to take effect.
when stopping the ManagerClient, it would be better to close
all connected connector, otherwise aiohttp complains like:
```
13:57:53.763 ERROR> Unclosed connector
connections: ['[(<aiohttp.client_proto.ResponseHandler object at 0x7f939d2ca5f0>, 96672.211256817)]']
connector: <aiohttp.connector.UnixConnector object at 0x7f939d2da890>
```
this warning message is printed to the console, and it is distracting
when testing manually.
so, in this change, let's close the client connecting to unix domain
socket.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#16675
The removenode operation is defined to succeed only if the node
being removed is dead. Currently, we reject this operation on the
initiator side (in `storage_service::raft_removenode`) when the
failure detector considers the node being removed alive. However,
it is possible that even if the initiator considers the node dead,
the topology coordinator will consider it alive when handling the
topology request. For example, the topology coordinator can use
a bigger failure detector timeout, or the node being removed can
suddenly resurrect.
This PR makes the topology coordinator reject removenode if the
node being removed is considered alive. It also adds
`test_remove_alive_node` that verifies this change.
Fixesscylladb/scylladb#16109Closesscylladb/scylladb#16584
* github.com:scylladb/scylladb:
test: add test_remove_alive_node
topology_coordinator: reject removenode if the removed node is alive
test: ManagerClient: remove unused wait_for_host_down
test: remove_node: wait until the node being removed is dead
The previous commit removed the only call to wait_for_host_down.
Moreover, this function is identical to server_not_sees_other_server.
We can safely remove it.
In the following commits, we make the topology coordinator reject
removenode requests if the node being removed is considered alive
by the gossiper. Before making this change, we need to adapt the
testing framework so that we don't have flaky removenode operations
that fail because the node being removed hasn't been marked as dead
yet. We achieve this by waiting until all other running nodes see
the node being removed as dead in all removenode operations.
Some tests are simplified after this change because they don't have
to call server_not_sees_other_server anymore.
After restarting each node, we should wait for other nodes to notice
the node is UP before restarting the next server. Otherwise, the next
node we restart may not send the shutdown notification to the
previously restarted node, if it still sees it as down when we
initiate its shutdown. In this case, the node will learn about the
restart from gossip later, possible when we already started CQL
requests. When a node learns that some node restarted while it
considers it as UP, it will close connections to that node. This will
fail RPC sent to that node, which will cause CQL request to time-out.
Fixes#14746Closesscylladb/scylladb#16010
Scylla can be configured to use different IPs for the internode communication
and client connections. This test allocates and configure unique IP addresses
for the client connections (`rpc_address`) for 2-nodes cluster.
Two scenarios tested:
1) Change RPC IPs sequentially
2) Change RPC IPs simultaneously
Closesscylladb/scylladb#15965
We add a new function - servers_add - that allows adding multiple
servers concurrently to a cluster. It makes use of a concurrent
bootstrap now supported in the raft-based topology.
servers_add doesn't have the replace_cfg parameter. The reason is
that we don't support concurrent replace operations, at least for
now.
There is an implementation detail in ScyllaCluster.add_servers. We
cannot simply do multiple calls to add_server concurrently. If we
did that in an empty cluster, every node would take itself as the
only seed and start a new cluster. To solve this, we introduce a
new field - initial_seed. It is used to choose one of the servers
as a seed for all servers added concurrently to an empty cluster.
Note that the add_server calls in asyncio.gather in add_servers
cannot race with each other when setting initial_seed because
there is only one thread.
In the future, we will also start all initial servers concurrently
in ScyllaCluster.install_and_start. The changes in this commit were
designed in a way that will make changing install_and_start easy.
In the following commits, we make the topology coordinator reject
join requests if the node being replaced is considered alive by the
gossiper. Before making this change, we need to adapt the testing
framework so that we don't have flaky replace operations that fail
because the node being replaced hasn't been marked as dead yet. We
achieve this by waiting until all other running nodes see the node
being replaced as dead in all replace operations.
After this change, if we try to add a server and it fails with an
expected error, the add_server function will not throw. Also, the
server will be correctly installed and stopped.
Two issues are motivating this feature.
The first one is that if we want to add a server while expecting
an error, we have to do it in two steps:
- call server_add with the start parameter set to False,
- call server_start with the expected_error parameter.
It is quite inconvenient.
The second one is that we want to be able to test the replace
operation when it is considered incorrect, for example when we try
to replace an alive node. To do this, we would have to remove
some assertions from ScyllaCluster.add_server. However, we should
not remove them because they give us clear information when we
write an incorrect test. After adding the expected_error parameter,
we can ignore these assertions only when we expect an error. In
this way, we enable testing failing replace operations without
sacrificing the testing framework's protection.
This commit adds the auth_cluster test suite to test a custom scenario
involving password authentication:
- create a cluster of 2 nodes with password authentication
- down one node
- the other node should refuse login stating that it couldn't reach
QUORUM
References ScyllaDB OSS #2339
You can now pass `expected_error` to `ManagerClient.decommission_node`
and `ManagerClient.remove_node`. Useful in combination with error
injections, for example.
Closesscylladb/scylladb#15650
In 20ff2ae5e1 mutating endpoints were
changed to use PUT. But some of them return a response, and I forgot to
provide `response_type` parameter to `put_json` (which causes
`RESTClient` to actually obtain the response). These endpoints now
return `None`.
Fix this.
Closesscylladb/scylladb#15674
Some endpoint handlers return JSON, some return text, some return empty
responses.
Reduce the number of different handler types by making the text case a
subcase of the JSON case. This also simplifies some code on the
`ManagerClient` side, which would have to deserialize data from text
(because some endpoint handlers would serialize data into text for no
particular reason). And it will allow reducing boilerplate in later
commits even further.
`ScyllaClusterManager` registers a bunch of HTTP endpoints which
`ManagerClient` uses to perform operations on a cluster during
a topology test.
The endpoints were inconsistently using verbs, like using GET for
endpoints that would have side effects. Use PUT for these.
This reverts commit 628e6ffd33, reversing
changes made to 45ec76cfbf.
The test included with this PR is flaky and often breaks CI.
Revert while a fix is found.
Fixes: #15371
Add a class that handles log file browsing with the following features:
* mark: returns "a mark" to the current position of the log.
* wait_for: asynchronously checks if the log contains the given message.
* grep: returns a list of lines matching the regular expression in the log.
Add a new endpoint in `ManagerClient` to obtain the scylla logfile path.
Fixes#14782Closes#14834
We add support for `--ignore-dead-nodes` in `raft_removenode` and
`--ignore-dead-nodes-for-replace` in `raft_replace`. For now, we allow
passing only host ids of the ignored nodes. Supporting IPs is currently
impossible because `raft_address_map` doesn't provide a mapping from IP
to a host id.
The main steps of the implementation are as follows:
- add the `ignore_nodes` column to `system.topology`,
- set the `ignore_nodes` value of the topology mutation in `raft_removenode` and `raft_replace`,
- extend `service::request_param` with alternative types that allow storing a set of ids of the ignored nodes,
- load `ignore_nodes` from `system.topology` into `request_param` in `system_keyspace::load_topology_state`,
- add `ignore_nodes` to `exclude_nodes` in `topology_coordinator::exec_global_command`,
- pass `ignore_nodes` to `replace_with_repair` and `remove_with_repair` in `storage_service::raft_topology_cmd_handler`.
Additionally, we add `test_raft_ignore_nodes.py` with two tests that verify the added changes.
Fixes#15025Closes#15113
* github.com:scylladb/scylladb:
test: add test_raft_ignore_nodes
test: ManagerClient.remove_node: allow List[HostId] for ignore_dead
raft topology: pass ignore_nodes to {replace, remove}_with_repair
raft topology: exec_global_command: add ignore_nodes to exclude_nodes
raft topology: exec_global_command: change type of exclude_nodes
topology_state_machine: extend request_param with a set of raft ids
raft topology: set ignore_nodes in raft_removenode and raft_replace
utils: introduce split_comma_separated_list
raft topology: add the ignore_nodes column to system.topology