Commit Graph

175 Commits

Author SHA1 Message Date
Nadav Har'El
3734afe193 test/cluster: document that add/start waits for all ports to be ready
Add docstrings to server_add(), server_start(), and servers_add() explaining
that they wait for ServerUpState.SERVING before returning, which means Scylla
has finished listening on all configured ports (including non-default ones).
Note that server_add() and server_start() accept expected_server_up_state to
return earlier if needed, while servers_add() always waits for SERVING.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-05-05 18:56:32 +03:00
Nadav Har'El
e014521565 test/cluster: make server_start() default to ServerUpState.SERVING
For the same reason server_add() was changed to default to SERVING
(see previous commit), server_start() had the same bug: after restarting
a node that listens on non-default ports, the polling of the hardcoded
CQL/Alternator ports could succeed before the custom ports were ready,
causing intermittent failures.

Apply the same fix to server_start() in manager_client.py,
ScyllaCluster.server_start(), and the _cluster_server_start HTTP handler.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-05-05 18:18:32 +03:00
Nadav Har'El
f91525c5df test/cluster: make server_add() default to ServerUpState.SERVING
server_add() was defaulting to ServerUpState.CQL_ALTERNATOR_QUERIED,
which polls the standard CQL and Alternator ports to determine when the
server is ready. This is wrong when a test configures Scylla to listen
on non-default ports: the polling succeeds on the default ports while the
custom ports may not yet be ready, making such tests intermittently flaky.

The correct behavior is ServerUpState.SERVING, which waits for Scylla's
sd_notify("READY=1") signal. This signal is sent only after all
configured listeners — including custom ports — are fully open, so it
is the right readiness signal regardless of the port configuration.

Up to now, the fix for each affected test was to pass
expected_server_up_state=ServerUpState.SERVING explicitly once the
flakiness was noticed (e.g. #29737). Change the default so that all
future tests get the correct behavior automatically.

Changed in manager_client.server_add(), ScyllaCluster.add_server(), and
the _cluster_server_add HTTP handler. The multi-server servers_add() path
already inherits the new default through add_server().

Fixes SCYLLADB-1822

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-05-05 18:18:32 +03:00
Piotr Szymaniak
d5efd1f676 test/cluster: wait for Alternator readiness in server startup
server_add() only waits for CQL readiness before returning. The
Alternator HTTP port may not be listening yet, causing
ConnectionRefused with Alternator tests.

Extend the ServerUpState enum and startup loop to also check Alternator
port readiness when configured. Whenever Alternator port(s) is/are
configured, each is verified if connectable and queryable,
similar to how CQL ports are probed.

Fixes SCYLLADB-1701

Closes scylladb/scylladb#29625
2026-04-25 16:35:44 +03:00
Dario Mirovic
f77ff28081 test: manager_client: use safe_driver_shutdown for exclusive_clusters
Using cluster.shutdown() is an incorrect way to shut down a Cassandra Cluster.
The correct way is using safe_driver_shutdown.

Fixes SCYLLADB-1434

Closes scylladb/scylladb#29390
2026-04-19 21:31:18 +03:00
Marcin Maliszkiewicz
e78e6cd584 test: filter_errors: support list[list[str]] error groups
Accept both list[str] (from distinct_errors=True) and
list[list[str]] (from distinct_errors=False) in filter_errors(),
matching against the first line of each error group. This allows
tests that call grep_for_errors() with default arguments to
pipe results directly through filter_errors().
2026-04-13 18:33:29 +02:00
Avi Kivity
0ae22a09d4 LICENSE: Update to version 1.1
Updated terms of non-commercial use (must be a never-customer).
2026-04-12 19:46:33 +03:00
Dario Mirovic
821f8696a7 test: pylib: shut down exclusive cql connections in ManagerClient
get_cql_exclusive() creates a Cluster object per call, but never
records it. driver_close() cannot shut it down. The cluster's
internal scheduler thread then tries to submit work to an already
shut down executor. This causes RuntimeError:

RuntimeError: cannot schedule new futures after shutdown

Fix this by tracking every exclusive Cluster in a list and shutting
them all down in driver_close().

Refs SCYLLADB-573
2026-03-19 16:12:13 +01:00
Dario Mirovic
8367509b3b test: pylib: manager_client: specify AuthProvider in get_cql_exclusive
This patch allows ManagerClient.get_cql_exclusive to accept AuthProvider
as parameter. This will be used in a follow up patch which migrates
audit test suite to test/cluster and requires this functionality for
some tests.

Refs SCYLLADB-573
2026-03-19 15:35:24 +01:00
Andrei Chekun
c36df5ecf4 test.py: eliminite drivers exception
There is a race condition in driver that raises the RuntimeException.
This pollutes the output, so this PR is just silencing this exception.

Fixes: SCYLLADB-900

Closes scylladb/scylladb#28957
2026-03-10 14:31:36 +02:00
Andrei Chekun
8acba40c84 test.py: fix unawaited ScyllaLogFile.grep() coroutines
Fixed several places where ScyllaLogFile.grep() was called without
await, resulting in checking coroutine objects for truthiness instead
of actual log matches.

Fixes: SCYLLADB-903
2026-03-09 19:41:07 +01:00
Dario Mirovic
0e5ddec2a8 test: pylib: fix connect_driver handling when adding and starting server
When connect_driver=False, the expected server up state should be
capped to HOST_ID_QUERIED. This is to avoid waiting for CQL readiness,
which requires a superuser to be present.

This logic was only in ScyllaCluster.server_start. ManagerClient.server_add
with start=True and connect_driver=False would still wait for CQL and hang
if no superuser is present. The workaround was to call
ManagerClient.server_add(start=False, connect_driver=False) followed by
ManagerClient.server_start(connect_driver=False).

This patch moves the capping from ScyllaCluster.server_start to
ManagerClient.server_add and ManagerClient.server_start, where connect_driver
is processed. ScyllaCluster only receives the already resolved
expected_server_up_state value.

Refs SCYLLADB-409
2026-03-03 23:42:25 +01:00
Patryk Jędrzejczak
67045b5f17 Merge 'raft_topology, tablets: Drain tablets in parallel with other topology operations' from Tomasz Grabiec
Allows other topology operations to execute while tablets are being
drained on decommission. In particular, bootstrap on scale-out. This
is important for elasticity.

Allows multiple decommission/removenode to happen in parallel, which
is important for efficiency.

Flow of decommission/removenode request:
  1) pending and paused, has tablet replicas on target node.
     Tablet scheduler will start draining tablets.
  2) No tablets on target node, request is pending but not paused
  3) Request is scheduled, node is in transition
  4) Request is done

Nodes are considered draining as soon as there is a leave or remove
request on them. If there are tablet replicas present on the target
node, the request is in a paused state and will not be picked by
topology coordinator. The paused state is computed from topology state
automatically on reload.

When request is not paused, its execution starts in
write_both_read_old state. The old tablet_draining state is not
entered (it's deprecated now).

Tablet load balancing will yield the state machine as soon as some
request is no longer paused and ready to be scheduled, based on
standard preemption mechanics.

Fixes #21452

Closes scylladb/scylladb#24129

* https://github.com/scylladb/scylladb:
  docs: Document parallel decommission and removenode and relevant task API
  test: Add tests for parallel decommission/removenode
  test: util: Introduce ensure_group0_leader_on()
  test: tablets: Check that there are no migrations scheduled on draining nodes
  test: lib: topology_builder: Introduce add_draining_request()
  topology_coordinator, tablets: Fail draining operations when tablet migration fails due to critical disk utilization
  tablets: topology_coordinator: Refactor to propagate reason for migration rollback
  tablet_allocator: Skip co-location on draining nodes
  node_ops: task_manager_module: Populate entity field also for active requests
  tasks: node_ops: Put node id in the entity field
  tasks, node_ops: Unify setting of task_stats in get_status() and get_stats()
  topology: Protect against empty cancelation reason
  tasks, topology: Make pending node operations abortable
  doc: topology-over-raft.md: Fix diagram for replacing, tablet_draining is not engaged
  raft_topology, tablets: Drain tablets in parallel with other topology operations
  virtual_tables: Show draining and excluded fields in system.cluster_status and system.load_by_node
  locator: topology: Add "draining" flag to a node
  topology_coordinator: Extract generate_cancel_request_update()
  storage_service: Drop dependency in topology_state_machine.hh in the header
  locator: Extract common code in assert_rf_rack_valid_keyspace()
  topology_coordinator, storage_service: Validate node removal/decommission at request submission time
2026-01-22 13:06:53 +01:00
Aleksandra Martyniuk
f0dbf6135d test: add test for enforce_rack_list option 2026-01-20 10:01:15 +01:00
Tomasz Grabiec
5c93e12373 test: util: Introduce ensure_group0_leader_on()
Many tests want to assume that group0 leader runs on a particualr
server, typically the first server in the list.

And they cannot be easily made to work with arbitrary leader, becuase
they setup a particular topology and then stop particular nodes, and
want to assume the leader is stable. They open leader's log and
expect things to appear in that log.

It's much easier to ensure the leader, than to prepare tests to
handle failovers.
2026-01-18 15:36:07 +01:00
Tomasz Grabiec
c8098e07c9 test: pylib: Introduce ManagerClient.{disable,enable}_tablet_balancing()
It's a global operation, so we can use any server.

It's not only convenient. The call via api.disable_tablet_balancing()
confuse people to think that it's a per-server operation. This leads
to proliferation of code which does it needlessly on all servers.
2026-01-13 00:38:00 +01:00
Avi Kivity
50a3460441 test: pylib/log_browsing, cluster/test_tablets: don't look for expected crashes
test_tablets.test_orphaned_sstables_on_startup verifies that an
on_internal_error("Unable to load SSTable...") is generated when
an sstable outside a tablet boundary is found on startup.

The test indeed finds the error, but then proceeds to hang in
find_backtraces(), or fail if find_backtraces() is fixed, since
it finds an unexpected (for it) crash.

Fix this by not looking for crashes if a new option expected_crash
is set. Set it for this test.
2025-12-25 20:22:17 +02:00
Cezar Moise
95d0782f89 test: add crash detection during tests
After tests end, an extra check if performed, looking into node logs.
By default, it only searches for critical errors and scans for coredumps.
If the test has the fixture `check_nodes_for_errors`, it will search for all errors.
Both checks can be ignored by setting `ignore_cores_log_patterns` and `ignore_log_patterns`.
If any of the above are found, the test will fail with an error.
2025-12-18 16:28:13 +02:00
Cezar Moise
7c8ab3d3d3 test.py: add pid to ServerInfo
Adding pid info to servers allows matching coredumps with servers

Other improvements:
- When replacing just some fields of ServerInfo, use `_replace` instead of
building a new object. This way it is agnostic to changes to the Object
- When building ServerInfo from a list, the types defined for its fields are
not enforced, so ServerInfo(*list) works fine and does not need to be changed if
fields are added or removed.
2025-12-12 15:11:03 +02:00
Artsiom Mishuta
696596a9ef test.py: shutdown ManagerClient only in current loop In python 3.14 there is stricter policy regarding asyncio loops. This leads that we can not close clients from different loops. This change ensures that we are closing only client in the current loop.
Closes scylladb/scylladb#26911
2025-11-16 19:19:46 +02:00
Petr Gusev
33e9ea4a0f test.py: add universalasync_typed_wrap
The universalasync.wrap function doesn't preserve the
type information, which confuses the VS Code Pylance
plugin and makes code navigation hard.

In this commit we fix the problem by adding a typed
wrapped around universalasync.wrap.

Fixes: scylladb/scylladb#26639
2025-10-22 11:32:37 +02:00
Nadav Har'El
aa8d6e9e74 test/pylib: add the ability to stop currently-starting servers
Some tests need the ability to abruptly stop a server in the test
cluster before it fully booted - e.g., because the test knows (and
perhaps even expects) that the boot is hung. But before this patch,
manager.server_stop() could only kill servers in "running" state.

This patch adds to pylib tracking of "starting" servers - servers which
we are starting but haven't finished booting - their list can be
returned by the manager.starting_servers(). The manage.server_stop
function can now kill a server which is just starting - not just
"running" servers.

To avoid breaking existing tests, manager.all_servers() continues to
return just running and stopped servers - not "starting" servers.
By the way, when a starting server is killed, it is not listed as stopped -
it just behaves as a normal failure to add the server, and not as a
server which successfully joined the cluster but was later stopped.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-09-25 14:00:16 +03:00
Petr Gusev
49b036cf2b pylib: extract upgrade helpers from test_sstable_compression_dictionaries_upgrade.py
We want to reuse them to test upgade for LWT fencing
2025-09-15 12:34:45 +02:00
Patryk Jędrzejczak
31372843e4 test: manager_client: allow removing a config option
Currently, there is no simple way to remove an option from the server's
config file in tests. One example when this is needed is removing the
`recovery_leader` option on all servers during the recovery procedure.

In this commit, we add a new method to `ManagerClient` that removes
an option from the given server's config file.
2025-08-07 11:20:00 +02:00
Patryk Jędrzejczak
ce26896704 test: manager_client: add docstring to server_update_config 2025-08-07 11:19:54 +02:00
Sergey Zolotukhin
4f63e1df58 test: Set request_timeout_on_shutdown_in_seconds to request_timeout_in_ms,
decrease request timeout.

In debug mode, queries may sometimes take longer than the default 30 seconds.
To address this, the timeout value `request_timeout_on_shutdown_in_seconds`
during tests is aligned with other request timeouts.
Change request timeout for tests from 180s to 90s since we must keep the request
timeout during shutdown significantly lower than the graceful shutdown timeout(2m),
or else a request timeout would cause a graceful shutdown timeout and fail a test.
2025-07-29 15:37:47 +02:00
Pavel Emelyanov
4d4406c5bc Merge 'test.py: dtest: port next_gating tests from auth_test.py' from Evgeniy Naydanov
Copy `auth_test.py` from scylla-dtest test suite, remove all not next_gating tests from it, and make it works with `test.py`

As a part of the porting process, remove unused imports and markers, remove non-next_gating tests and tests marked with `required_features("!consistent-topology-changes")` marker.

Remove `test_permissions_caching` test because it's too flaky when running using test.py

Also, make few time execution optimizations:
  - remove redundant `time.sleep(10)`
  - use smaller timeouts for CQL sessions

Enable the test in `suite.yaml` (run in dev mode only.)

Additional modifications to test.py/dtest shim code:

- Modify ManagerClient.server_update_config() method to change multiple config options in one call in addition to one `key: value` pair.
- Implement the method using slightly modified `set_configuration_options()` method of `ScyllaCluster`.
- Copy generate_cluster_topology() function from tools/cluster_topology.py module.
- Add support for `bootstrap` parameter for `new_node()` function.
- Rework `wait_for_any_log()` function.

Closes scylladb/scylladb#24648

* github.com:scylladb/scylladb:
  test.py: dtest: make auth_test.py run using test.py
  test.py: dtest: rework wait_for_any_log()
  test.py: dtest: add support for bootstrap parameter for new_node
  test.py: dtest: add generate_cluster_topology() function
  test.py: dtest: add ScyllaNode.set_configuration_options() method
  test.py: pylib/manager_client: support batch config changes
  test.py: dtest: copy unmodified auth_test.py
  test.py: dtest: add missed markers to pytest.ini
2025-07-04 10:51:52 +03:00
Patryk Jędrzejczak
8d925b5ab4 test: increase the default timeout of graceful shutdown
Multiple tests are currently flaky due to graceful shutdown
timing out when flushing tables takes more than a minute. We still
don't understand why flushing is sometimes so slow, but we suspect
it is an issue with new machines spider9 and spider11 that CI runs
on. All observed failures happened on these machines, and most of
them on spider9.

In this commit, we increase the timeout of graceful shutdown as
a temporary workaround to improve CI stability. When we get to
the bottom of the issue and fix it, we will revert this change.

Ref #12028

It's a temporary workaround to improve CI stability, we don't
have to backport it.

Closes scylladb/scylladb#24802
2025-07-04 10:43:38 +03:00
Michael Litvak
6bfb82844f test/pylib/tablets: fix test api to read tablet replicas from base table
When reading tablet replicas from system.tablets, we need to refer to
the base table partition, if any.

We fix and simplify the test api for reading tablet replicas to read
from the base table.
2025-07-01 13:20:19 +03:00
Evgeniy Naydanov
a1ce3aed44 test.py: pylib/manager_client: support batch config changes
Modify ManagerClient.server_update_config() method to change
multiple config options in one call in addition to one `key: value`
pair.  All internal machinery converted to get a values dict as a
parameter.  Type hints were adjusted too.
2025-06-30 10:16:36 +00:00
Pavel Emelyanov
4c0154f156 Merge 'test.py: enhance allure reporting' from Andrei Chekun
Add run ID for process output file to be not overwritten in the next case: first run failed, second passed. They are using the same name, so the second run will overwrite and delete the file. This will help to investigate in case of C++ test fails
Add attaching Scylla log files to allure report in case test failed. This is an alternative for link in JUnit report that exists in CI. That change will help to investigate the cluster tests fails. Example can be found in the failed [job](https://jenkins.scylladb.com/job/scylla-master/job/byo/job/byo_build_tests_dtest/2980/allure/).

Backport is not needed, this is only framework enhancements

Closes scylladb/scylladb#24677

* github.com:scylladb/scylladb:
  test.py: Attach node logs in allure report in case of fail
  test.py: Add run id to the boost output file
2025-06-27 16:22:03 +03:00
Andrei Chekun
2c726c5074 test.py: Attach node logs in allure report in case of fail
Currently, allure report have no nodes logs in case of fail, this will
allow to view the logs in one place without going anywhere else.
2025-06-26 15:37:33 +02:00
Marcin Maliszkiewicz
a3bb679f49 test: pylib: add ability to specify default authenticator during server_start
Sometimes we may not want to use default cassandra role for
control connection, especially when we test dropping default role.
2025-06-26 12:28:08 +02:00
Marcin Maliszkiewicz
d9ec746c6d test: pylib: allow rolling restart without waiting for cql
Waiting for CQL requires default superuser being present
in db. In some cases we may delete it and still want to do
rolling restart. Additionally if we need CQL we may want to
wait after restart is complete (once, and not for each node).
2025-06-26 12:28:08 +02:00
Michał Chojnowski
5da19ff6a6 pylib/manager_client: add server_switch_executable
Add an util for switching the Scylla executable during the test.
Will be used for upgrade tests.
2025-06-02 15:03:08 +02:00
Michał Chojnowski
1ff7e09edc test/pylib: in add_server, give a way to specify the executable and version-specific config
This will be used for upgrade tests.
The cluster will be started with an older executable and without configs
specific to newer versions.
2025-06-02 15:03:08 +02:00
Evgeniy Naydanov
ac1551892b test.py: initial implementation of dtest/ccm shim
Use universalasync library to make test.py async code compatible
with synchronous code of dtest/ccm

Also, copied unmodified error_example_test.py from dtest as an example.

Run the test in `dev` mode only.
2025-05-19 12:27:31 +00:00
Evgeniy Naydanov
2cb640f95c test.py: manager: add server_get_returncode() method
The method return None if Scylla process is still running or returncode.
If there is no Scylla process launched then raise NoSuchProcess exception.
2025-05-19 11:50:55 +00:00
Evgeniy Naydanov
d874beb17f test.py: manager: change CLI and env options on a node start
Add parameters to server_start() method to provide ability to
change Scylla' CLI and env options on a node start.

Also, add `expected_server_up_state` parameter as we have for
server_add() method.
2025-05-19 11:50:55 +00:00
Andrzej Jackowski
1f1e4f09cd test: add get_cql_exclusive to manager_client.py
This commit adds to ManagerClient a get_cql_exclusive function that
allows creating a cql connection with WhiteListRoundRobinPolicy for
a single server. Such connection is useful in tests that kill nodes to
make sure that the live node handles the queries. Before this commit,
some tests used cluster_con from test/cluster/conftest.py, and after
this commit test can start to use a method from MangerClient.

This change:
 - Extend ManagerClient con_gen type to allow LoadBalancingPolicy arg
 - Implement get_cql_exclusive()
2025-04-23 09:29:47 +02:00
Botond Dénes
0fdf2a2090 Merge 'test/pylib: servers_add: support list of property_files' from Benny Halevy
So that a multi-dc/multi-rack cluster can be populated
in a single call.

* Enhancement, no backport required

Closes scylladb/scylladb#23341

* github.com:scylladb/scylladb:
  test/pylib: servers_add: add auto_rack_dc parameter
  test/pylib: servers_add: support list of property_files
2025-04-01 09:14:20 +03:00
Benny Halevy
a4aa4d74c1 test/pylib: servers_add: add auto_rack_dc parameter
To quickly populate nodes in a single dc,
each node in its own rack.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-03-30 19:23:40 +03:00
Benny Halevy
c4dbb11c87 test/pylib: servers_add: support list of property_files
So that a multi-dc/multi-rack cluster can be populated
in a single call.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-03-30 19:12:39 +03:00
Evgeniy Naydanov
9cb0ec2b42 test.py: topology: run tests using bare pytest command
Run ScyllaClusterManager using pytest fixture if `--manager-api`
option is not provided.

On this stage we're trying to be as close to test.py as possible.
test.py runs tests file-by-file, so, effectively, scopes `session`,
`package`, and `module` are pretty same.  Also, test.py starts
ScyllaClusterManager for every test module and this is the reason
why fixture `manager_api_sock_path` has scope=`module`.  And, in
result, we need to change scope for fixture `manager_internal` too.
2025-03-30 03:19:29 +00:00
Artsiom Mishuta
20777d7fc6 test.py: introduce prepare_3_nodes_cluster marker
prepare_3_nodes_cluster marker will allow preparing non-dirty 3 nodes cluster
that can be reused between tests
2025-03-04 10:32:43 +01:00
Andrzej Jackowski
e70ba7e3ed test: implement connect_driver argument in ManagerClient::server_add
This commit introduces connect_driver argument in
ManagerClient::server_add. The argument allow skipping CQL driver
initialization part during server start. Starting a server without
the driver is necessary to implement some test scenarios related
to system initialization.

After stopping a server, ManagerClient::server_start can be used to
start the server again, so connect_driver argument is also added here to
allow preventing connecting the driver after a server restart.

This change:
 - Implement connect_driver argument in ManagerClient::server_add
 - Implement connect_driver argument in ManagerClient::server_start
2025-02-06 10:30:55 +01:00
Avi Kivity
f3eade2f62 treewide: relicense to ScyllaDB-Source-Available-1.0
Drop the AGPL license in favor of a source-available license.
See the blog post [1] for details.

[1] https://www.scylladb.com/2024/12/18/why-were-moving-to-a-source-available-license/
2024-12-18 17:45:13 +02:00
Emil Maskovsky
92db2eca0b test/topology_custom: fix the flaky test_raft_recovery_stuck
The test is only sending a subset of the running servers for the rolling
restart. The rolling restart is checking the visibility of the restarted
node agains the other nodes, but if that set is incomplete some of the
running servers might not have seen the restarted node yet.

Improved the manager client rolling restart method to consider all the
running nodes for checking the restarted node visibility.

Fixes: scylladb/scylladb#19959

Closes scylladb/scylladb#21477
2024-11-12 16:38:28 +01:00
Benny Halevy
0c1e85b6e3 test/pylib: ServerInfo: add datacenter and rack attributes
Set to "DEFAULT_DC" and "DEFAULT_RACK" by default.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-11-04 14:11:30 +02:00
Piotr Smaron
e0c1a51642 cql/tablets: handle MVs in ALTER tablets KEYSPACE
ALTERing tablets-enabled KEYSPACES (KS) didn't account for materialized
views (MV), and only produced tablets mutations changing tables.
With this patch we're producing tablets mutations for both tables and
MVs, hence when e.g. we change the replication factor (RF) of a KS, both the
tables' RFs and MVs' RFs are updated along with tablets replicas.
The `test_tablet_rf_change` testcase has been extended to also verify
that MVs' tablets replicas are updated when RF changes.

Fixes: #20240

Closes scylladb/scylladb#21007
2024-10-09 10:51:18 +02:00