Commit Graph

926 Commits

Author SHA1 Message Date
Avi Kivity
85374207ca Merge 'test.py: rewrite gather metrics' from Andrei Chekun
Rewrite gather metrics to be able to gather metrics for python tests correctly.
Python tests require different handling of metrics gathering from cgroup than C++ tests. pytest do not execute each python tests in a separate process, so we can't put it there and get the metrics.
The idea is to put the whole pytest process to the cgroup and get the metrics. This will work because pytest runs the threads as a completely separate processes and inside the thread it will run tests consequently.
Additionally, to simplify system resource monitor moved to pytest main thread.
Change the behavior of the gathering metrics. From this PR some data will be collected even with `--no-gather-metrics`. This data do not need any configuration and just metadata of the tests: test name, time of execution, status of the test. When `--gather-metrics` provided additionally will be written the data gathered from the cgroups about the memory for each specific test and system CPU/RAM utilization.

Backport is not needed, because it's a framework change only.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-575

~Blocked by: https://github.com/scylladb/scylladb/pull/27618~

Now python tests have metrics gathered from the cgroups as well with their own Scylla instances.
```bash
$ sqlite3 --header testlog/sqlite_af8cb.db 'select tst.path, tst.file, tst.test_name, user_sec,system_sec,usage_sec,memory_peak /1024/1024 as memory_peak_mb from test_metrics join tests as tst where tst.id = test_metrics.test_id order by memory_peak_mb desc limit 10;'
path|file|test_name|user_sec|system_sec|usage_sec|memory_peak_mb
test/cluster/dtest|limits_test.py|test_max_cells|489.468174|27.6638949999999|517.132069|4241
test/cluster/dtest|rebuild_test.py|test_rebuild_stream_abort_repro|93.6400869999998|28.9843249999999|122.624412|4241
test/cluster/dtest|schema_management_test.py|test_prepared_statements_work_after_node_restart_after_altering_schema_without_changing_columns|6.8933219999999|3.63569899999993|10.5290209999994|4241
test/cluster/dtest|schema_management_test.py|test_dropping_keyspace_with_many_columns|1.31770999999981|0.754742999999962|2.07245299999977|4241
test/cluster/dtest|schema_management_test.py|test_multiple_create_table_in_parallel|5.48435300000028|2.72915200000011|8.21350499999971|4241
test/cluster/dtest|schema_management_test.py|test_alter_table_in_parallel_to_read_and_write[write]|80.687293|18.5562|99.2434920000005|4241
test/cluster/dtest|schema_management_test.py|test_alter_table_in_parallel_to_read_and_write[read]|79.1984790000001|18.0969829999999|97.2954609999997|4241
test/cluster/dtest|schema_management_test.py|test_alter_table_in_parallel_to_read_and_write[mixed]|85.332915|18.9321070000001|104.265022|4241
test/cluster/dtest|schema_management_test.py|test_update_schema_while_node_is_killed[create_table]|10.5875369999999|5.67954400000008|16.267081|4241
test/cluster/dtest|schema_management_test.py|test_update_schema_while_node_is_killed[alter_table]|11.3801709999998|6.54689099999996|17.9270630000001|4241
```

Closes scylladb/scylladb#28206

* github.com:scylladb/scylladb:
  test.py: Add host hardware info
  test.py: rewrite resource gather
2026-05-18 20:35:14 +03:00
Nadav Har'El
5dbd0d71d5 Merge 'test/pylib: test/pylib: Cached Scylla package resolver' from Alex Dathskovsky
This series adds a shared helper for resolving, downloading, unpacking, and
installing Scylla relocatable packages for test.py.

The first patch introduces `version_fetch_utils`, which can resolve public
Scylla artifacts from the downloads bucket by version, architecture, package
variant, or direct URL. It also centralizes the local cache/install flow using
retry handling, marker files, and file locking so repeated or concurrent test
runs can safely reuse an existing installation.

The second patch wires this helper into the existing Scylla executable setup
paths. This removes the hard-coded 2025.1 package URL and replaces the local
download/unpack/install logic in `scylla_cluster.py` with the shared resolver.
It also makes `--exe-url` use the same cached installer path.
Together, these changes make upgrade-test executable selection less brittle,
avoid duplicated install logic, and provide a reusable foundation for fetching
other Scylla versions in test.py.

Closes scylladb/scylladb#29855

* github.com:scylladb/scylladb:
  test/pylib: use version fetcher for Scylla executable setup
  test/pylib: add cached Scylla package installer
2026-05-18 16:32:47 +03:00
Andrei Chekun
a03c4fd754 test.py: Add host hardware info
Gather additional information about the running host for better metrics analysis
2026-05-18 12:23:40 +02:00
Andrei Chekun
6414c48fc2 test.py: rewrite resource gather
Python tests requires different handling of metrics gathering from
cgroup than C++ tests. pytest do not execute each python tests in
a separate process, so we can't put it there and get the metrics.
The idea is to put the whole pytest process to the cgroup and get the
metrics. This will work because pytest runs the threads as as completely
separate processes and inside the thread it will run tests consequently.
Additionally, to simplify system resource monitor moved to pytest main
thread.
2026-05-18 12:23:40 +02:00
Evgeniy Naydanov
39a10d6d67 test: remove dead suite subclasses and legacy execution pipeline
After all test suites migrated to test_config.yaml with type: Python,
the specialized suite classes (Topology, CQLApproval, Run, Tool) and
the legacy execution pipeline (find_tests, run_test, TestSuite.run,
Test.run) became unreachable. Remove all this dead code.

Deleted files:
- suite/topology.py, suite/cql_approval.py, suite/run.py, suite/tool.py

Simplified:
- base.py: remove run_test(), read_log(), TestSuite.run(),
  add_test_list(), build_test_list(), all_tests(), test_count(),
  SUITE_CONFIG_FILENAME, disabled/flaky test tracking, and dead
  Test attributes (args, core_args, valid_exit_codes, allure_dir,
  is_flaky, is_cancelled, etc.)
- python.py: remove PythonTestSuite.run(), PythonTest.run(),
  _prepare_pytest_params(), pattern, test_file_ext, xmlout,
  server_log, scylla_env setup, and shlex import.
  Simplify run_ctx() to take no parameters.
- runner.py: remove --scylla-log-filename option,
  print_scylla_log_filename fixture, SUITE_CONFIG_FILENAME import,
  and suite.yaml probe in TestSuiteConfig.from_pytest_node().
- __init__.py: remove re-exports of deleted classes.
- test_config.yaml: Topology -> Python, Approval -> Python.
- conftest files: run_ctx(options=...) -> run_ctx().
- docs/dev/testing.md: update to reflect current pytest-based
  architecture, log paths, and removed features.

Co-Authored-By: Claude Opus 4.6 (200K context) <noreply@anthropic.com>

Closes scylladb/scylladb#29613
2026-05-17 22:16:31 +03:00
Alex
176dbf12c2 test/pylib: use version fetcher for Scylla executable setup
Replace the hard-coded 2025.1 archive download and local install logic with the
shared Scylla package fetch/install helper. This keeps upgrade-test executable
resolution and `--exe-url` handling on the same cached installer path.
2026-05-17 17:43:56 +03:00
Alex
1efe9a7243 test/pylib: add cached Scylla package installer
Add utilities to resolve relocatable Scylla artifacts from the public downloads
bucket by version, architecture, package variant, or direct URL. Download,
unpack, and install the selected archive into the test.py cache with retry
handling, marker files, and file locking so repeated or concurrent test runs can
reuse the same installation safely.
2026-05-17 17:43:56 +03:00
Andrei Chekun
a09fdfc46a test.py: fix issue that C++ tests' logs are deleted
Add skiping deletion of the log file in case of the fail in C++ tests.

Closes scylladb/scylladb#29859
2026-05-13 21:31:03 +03:00
Avi Kivity
f2ab911a46 Merge 'test/cluster: fix server-starting functions to wait for all ports' from Nadav Har'El
This series fixes a recurring source of flaky tests in the cluster test suite.

When a test configures Scylla to listen on non-default ports (e.g. a custom Alternator port, proxy-protocol port or shard-aware port), server_add() and server_start() would declare the server ready by polling the hardcoded standard CQL and Alternator ports. Those ports can become available slightly before the custom ports finish binding, so the test could start using the custom port before it was open — causing intermittent failures.

The fix for each affected test was to pass `expected_server_up_state=ServerUpState.SERVING` explicitly, which waits for Scylla's sd_notify("STATUS=serving") signal instead. That signal is sent only after all configured listeners are fully open, so it is always the right readiness signal regardless of the port configuration. This workaround was applied again in PR #29737 and will keep being needed for every new test that uses a non-default port.

This series makes ServerUpState.SERVING the default at every level of the server start/add call stack so no test needs to remember it:

* Make server_add(), servers_add(), server_start() et al. all  default to ServerUpState.SERVING.
* Document that server_add/server_start wait for all ports to be  ready,  so future test authors understand what the functions guarantee.
* Remove now-redundant expected_server_up_state=SERVING from exiting tests.
* A small optimization: Fix check_serving_notification() returning False on first completion. When the sd_notify future completed, the function correctly updated _received_serving but still returned False, wasting one 100ms polling interval. Return self._received_serving directly.

Closes scylladb/scylladb#29758

* github.com:scylladb/scylladb:
  test/pylib: fix missing protocol_version=4 on control_cluster
  scylla_cluster: guard poll_status() set_result() calls against cancelled future
  test/cluster: avoid repeated CQL checks and leaks while waiting for SERVING
  test/cluster: fix check_serving_notification() inefficiency
  test/cluster: remove now-redundant expected_server_up_state=SERVING
  test/cluster: document that add/start waits for all ports to be ready
  test/cluster: update remaining CQL_ALTERNATOR_QUERIED defaults to SERVING
  test/cluster: fix server_add/server_start hanging when starting in maintenance mode
  main: notify "entering maintenance mode" after the maintenance CQL server is ready
  test/cluster: make server_start() default to ServerUpState.SERVING
  test/cluster: make server_add() default to ServerUpState.SERVING
2026-05-13 21:23:18 +03:00
Piotr Smaron
0fcae72530 test: bootstrap tombstone gc repair cluster sequentially
Avoid concurrent topology changes in the tombstone GC repair setup, where debug-mode nodes running hinted handoff and materialized view startup work can time out while applying Raft entries before the test starts.

Keep the sequential path opt-in so unrelated repair tests still exercise concurrent bootstrap behavior.

Closes scylladb/scylladb#29829
2026-05-13 13:58:44 +03:00
Yaniv Michael Kaul
5d6f160129 test: update get_scylla_2025_1_executable() to use 2025.1.12
Update the hardcoded 2025.1.0 binary URL to the latest 2025.1.12
release for upgrade tests.

The 2025.1.12 binary now supports and enforces the
rf_rack_valid_keyspaces option which the test harness enables by
default. Since test_sstable_compression_dictionaries_upgrade creates
a 2-node cluster in a single rack with RF=2, it violates the
constraint. Disable the option explicitly for this test.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Closes scylladb/scylladb#29714
2026-05-12 23:20:55 +02:00
Botond Dénes
e95eb21a16 Merge 'Tablet-aware restore' from Pavel Emelyanov
The mechanics of the restore is like this

- A /storage_service/tablets/restore API is called with (keyspace, table, endpoint, bucket, manifests) parameters
  - First, it populates the system_distributed.snapshot_sstables table with the data read from the manifests
  - Then it emplaces a bunch of tablet transitions (of a new "restore" kind), one for each tablet
- The topology coordinator handles the "restore" transition by calling a new RESTORE_TABLET RPC against all the current tablet replicas
- Each replica handles the RPC verb by
  - Reading the snapshot_sstables table
  - Filtering the read sstable infos against current node and tablet being handled
  - Downloading and attaching the filtered sstables

This PR includes system_distributed.snapshot_sstables table from @robertbindar and preparation work from @kreuzerkrieg that extracts raw sstables downloading and attaching from existing generic sstables loading code.

This is first step towards SCYLLADB-197 and lacks many things. In particular
- the API only works for single-DC cluster
- the caller needs to "lock" tablet boundaries with min/max tablet count
- not abortable
- no progress tracking
- sub-optimal (re-kicking API on restore will re-download everything again)
- not re-attacheable (if API node dies, restoration proceeds, but the caller cannot "wait" for it to complete via other node)
- nodes download sstables in maintenance/streaming sched gorup (should be moved to maintenance/backup)

Other follow-up items:
- have an actual swagger object specification for `backup_location`

Closes #28436
Closes #28657
Closes #28773

Closes scylladb/scylladb#28763

* github.com:scylladb/scylladb:
  docs: Update topology_over_raft.md with `restore` transition kind
  test: Add test for backup vs migration race
  test: Restore resilience test
  sstables_loader: Fail tablet-restore task if not all sstables were downloaded
  sstables_loader: mark sstables as downloaded after attaching
  sstables_loader: return shared_sstable from attach_sstable
  db: add update_sstable_download_status method
  db: add downloaded column to snapshot_sstables
  db: extract snapshot_sstables TTL into class constant
  test: Add a test for tablet-aware restore
  tablets: Implement tablet-aware cluster-wide restore
  messaging: Add RESTORE_TABLET RPC verb
  sstables_loader: Add method to download and attach sstables for a tablet
  tablets: Add restore_config to tablet_transition_info
  sstables_loader: Add restore_tablets task skeleton
  test: Add rest_client helper to kick newly introduced API endpoint
  api: Add /storage_service/tablets/restore endpoint skeleton
  sstables_loader: Add keyspace and table arguments to manfiest loading helper
  sstables_loader_helpers: just reformat the code
  sstables_loader_helpers: generalize argument and variable names
  sstables_loader_helpers: generalize get_sstables_for_tablet
  sstables_loader_helpers: add token getters for tablet filtering
  sstables_loader_helpers: remove underscores from struct members
  sstables_loader: move download_sstable and get_sstables_for_tablet
  sstables_loader: extract single-tablet SST filtering
  sstables_loader: make download_sstable static
  sstables_loader: fix formating of the new `download_sstable` function
  sstables_loader: extract single SST download into a function
  sstables_loader: add shard_id to minimal_sst_info
  sstables_loader: add function for parsing backup manifests
  split utility functions for creating test data from database_test
  export make_storage_options_config from lib/test_services
  rjson: Add helpers for conversions to dht::token and sstable_id
  Add system_distributed_keyspace.snapshot_sstables
  add get_system_distributed_keyspace to cql_test_env
  code: Add system_distributed_keyspace dependency to sstables_loader
  storage_service: Export export handle_raft_rpc() helper
  storage_service: Export do_tablet_operation()
  storage_service: Split transit_tablet() into two
  tablets: Add braces around tablet_transition_kind::repair switch
2026-05-12 16:24:13 +03:00
Pavel Emelyanov
150345cc52 Merge 'test: per-bucket isolation for S3/GCS object storage tests' from Ernest Zaslavsky
This series adds per-test bucket isolation to all S3 and GCS object storage tests. Previously, every test shared a single pre-created bucket, which meant tests could interfere with each other through leftover objects and could not run concurrently across multiple `test.py` processes without risking collisions.

New `create_bucket`, `delete_bucket`, and `delete_bucket_with_objects` methods on `s3::client`, following the existing `make_request` pattern. `create_bucket` handles the `BUCKET_ALREADY_OWNED_BY_YOU` error gracefully.

A new `s3_test_fixture` RAII class for C++ Boost tests that creates a uniquely-named bucket on construction (derived from the Boost test name and pid) and tears down everything — objects, bucket, client — on destruction. All S3 tests in `s3_test.cc` are migrated to use it, removing manual `deferred_delete_object` and `deferred_close` boilerplate. The minio server policy is broadened to allow dynamic bucket creation/deletion.

A `client::make` overload that accepts a custom `retry_strategy`, used in tests with a fast 1ms retry delay instead of exponential backoff, significantly reducing test runtime for transient errors during bucket lifecycle operations.

Python-side (`test/cluster/object_store`): each pytest fixture (`object_storage`, `s3_storage`, `s3_server`) now creates a unique bucket per test function via `create_test_bucket()` and destroys it on teardown. Bucket names are sanitized from the pytest node name with a short UUID suffix for uniqueness.

Object storage helpers (`S3Server`, `MinioWrapper`, `GSFront`, `GSServerImpl`, factory functions, CQL helpers, `s3_server` fixture) are extracted from `test/cluster/object_store/conftest.py` into a shared `test/pylib/object_storage.py` module, eliminating duplication across test suites. The conftest becomes a thin re-export wrapper. Old class names are preserved as aliases for backward compatibility.

| Test Name                                                    | new test specific retry strategy execution time (ms) | original execution time (ms) |   Δ (ms) | Speedup |
|--------------------------------------------------------------|----------------:|-------------:|---------:|--------:|
| test_client_upload_file_multi_part_with_remainder_proxy      |          19,261 |       61,395 | −42,134  | **3.2×** |
| test_client_upload_file_multi_part_without_remainder_proxy   |          16,901 |       53,688 | −36,787  | **3.2×** |
| test_client_upload_file_single_part_proxy                    |           3,478 |        6,789 |  −3,311  | **2.0×** |
| test_client_multipart_copy_upload_proxy                      |           1,303 |        1,619 |    −316  | 1.2×    |
| test_client_put_get_object_proxy                             |             150 |          365 |    −215  | **2.4×** |
| test_client_readable_file_stream_proxy                       |             125 |          327 |    −202  | **2.6×** |
| test_small_object_copy_proxy                                 |             205 |          389 |    −184  | 1.9×    |
| test_client_put_get_tagging_proxy                            |             181 |          350 |    −169  | 1.9×    |
| test_client_multipart_upload_proxy                           |           1,252 |        1,416 |    −164  | 1.1×    |
| test_client_list_objects_proxy                               |             729 |          881 |    −152  | 1.2×    |
| test_chunked_download_data_source_with_delays_proxy          |             830 |          960 |    −130  | 1.2×    |
| test_client_readable_file_proxy                              |             148 |          279 |    −131  | 1.9×    |
| test_client_upload_file_multi_part_with_remainder_minio      |           3,358 |        3,170 |    +188  | 0.9×    |
| test_client_upload_file_multi_part_without_remainder_minio   |           3,131 |        2,929 |    +202  | 0.9×    |
| test_client_upload_file_single_part_minio                    |             519 |          421 |     +98  | 0.8×    |
| test_download_data_source_proxy                              |             180 |          237 |     −57  | 1.3×    |
| test_client_list_objects_incomplete_proxy                     |             590 |          641 |     −51  | 1.1×    |
| test_large_object_copy_proxy                                 |             952 |          991 |     −39  | 1.0×    |
| test_client_multipart_upload_fallback_proxy                  |             148 |          185 |     −37  | 1.3×    |
| test_client_multipart_copy_upload_minio                      |             641 |          674 |     −33  | 1.1×    |

No backport needed — this is a test infrastructure improvement with no production code impact beyond the new `s3::client` methods.

Closes scylladb/scylladb#29508

* github.com:scylladb/scylladb:
  test: extract object storage helpers to test/pylib/object_storage.py
  test: add per-test bucket isolation to object_store fixtures
  s3: add client::make overload with custom retry strategy
  test: add s3_test_fixture and migrate tests to per-bucket isolation
  s3: add create_bucket and delete_bucket to client
2026-05-12 12:38:24 +03:00
Pavel Emelyanov
dcd490666b test: Add rest_client helper to kick newly introduced API endpoint
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-05-12 10:40:23 +03:00
Nadav Har'El
df8c9b17b8 Merge 'alternator: Graduate Alternator Streams from experimental' from Piotr Szymaniak
As a final step for https://scylladb.atlassian.net/browse/SCYLLADB-461 we need to graduate Alternator Streams from experimental.
So let's remove `--experimental-features=alternator-streams` and map the obsolete config string to `UNUSED` for backward compatibility. Also, remove the related gating of the feature.
Finally, stop providing the config flag in test configs.

Fixes SCYLLADB-1680
Fixes #16367

To documentation tracked by https://scylladb.atlassian.net/browse/SCYLLADB-462 still remains.

This PR needs to hit 2026.2, so (only) if it branches before the PR is merged to `master`, we'd need to backport.

Closes scylladb/scylladb#29604

* github.com:scylladb/scylladb:
  test: Stop providing alternator-streams experimental flag
  alternator: Graduate Alternator Streams from experimental
2026-05-10 22:10:03 +03:00
Nadav Har'El
19555bc2cf test/pylib: fix missing protocol_version=4 on control_cluster
get_cql_up_state() creates two Cluster instances: a short-lived one
used to probe CQL readiness, and a persistent control_cluster kept
alive for the lifetime of the server.  The probe cluster was created
with protocol_version=4 (the highest version Scylla supports), but
the control_cluster was not, causing the driver to do a superfluous
version-negotiation round-trip on every server start.

Fix by extracting the shared constructor arguments into a cluster_kwargs
dict and using **cluster_kwargs for both calls, so the two Cluster
instances are created with identical parameters. This deduplication can
help avoid more instances of this bug, where someone modifies the
options in one call but forgets to change the options in the other
call.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-05-05 20:57:49 +03:00
Nadav Har'El
f977621e40 scylla_cluster: guard poll_status() set_result() calls against cancelled future
The poll_status() background thread resolves `serving_signal` by
scheduling `f.set_result(...)` on the event loop via
`call_soon_threadsafe`.  In parallel, `_cleanup_notify_socket()` can
cancel `serving_signal` at any time - for example when a server fails
to start and `stop()` -> `shutdown_control_connection()` is called while
the thread is still blocked in `recv()` (the socket close unblocks the
`recv()` with an exception, sending it down the error path).

When that race fires the scheduled `f.set_result(...)` callback runs
after `cancel()` has already put the future into the *cancelled* state,
raising `asyncio.InvalidStateError: Result is not allowed in cancelled
state`.

This bug predates the SERVING work, but the original
CQL_ALTERNATOR_QUERIED default meant the notify socket was torn down
quickly most of the time, making the window very narrow.  Now that
SERVING is the default the socket stays open throughout the full startup
wait, widening the race significantly.

Fix: replace every bare `f.set_result(v)` call with
`lambda: f.done() or f.set_result(v)`, which is a no-op when the
future is already done (cancelled, or resolved by another path).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-05-05 20:34:58 +03:00
Nadav Har'El
ff33440c6c test/cluster: avoid repeated CQL checks and leaks while waiting for SERVING
With ServerUpState.SERVING now the default, server_add() and server_start()
wait for sd_notify readiness after CQL is already up. During that window
the startup polling loop was calling get_cql_alternator_up_state() on every
iteration (every 100ms). Each successful call recreated self.control_cluster
and self.control_connection without closing the previous ones, leaking driver
connections and adding unnecessary CQL load to a node that was already known
to be queryable.

Fix in two places:

- Startup loop: skip the get_cql_alternator_up_state() call once
  server_up_state has reached CQL_ALTERNATOR_QUERIED. After that point only
  the cheap non-blocking check_serving_notification() is needed.

- get_cql_up_state(): guard control_cluster/control_connection creation with
  `if self.control_connection is None` so the persistent driver connection is
  only established once, even if the function is called multiple times.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-05-05 20:29:18 +03:00
Nadav Har'El
417b4e0765 test/cluster: fix check_serving_notification() inefficiency
When the sd_notify future completed, check_serving_notification() correctly
updated _received_serving to True but still returned False on that same call.
The SERVING state was only recognized on the next polling iteration, 100ms
later, for no reason.

Return self._received_serving instead of False after updating it.
2026-05-05 18:56:37 +03:00
Nadav Har'El
3734afe193 test/cluster: document that add/start waits for all ports to be ready
Add docstrings to server_add(), server_start(), and servers_add() explaining
that they wait for ServerUpState.SERVING before returning, which means Scylla
has finished listening on all configured ports (including non-default ones).
Note that server_add() and server_start() accept expected_server_up_state to
return earlier if needed, while servers_add() always waits for SERVING.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-05-05 18:56:32 +03:00
Nadav Har'El
90eef72794 test/cluster: update remaining CQL_ALTERNATOR_QUERIED defaults to SERVING
ScyllaServer.install_and_start() and ScyllaServer.start() still had
ServerUpState.CQL_ALTERNATOR_QUERIED as their default for
expected_server_up_state. In practice these defaults are never reached -
both call sites in ScyllaCluster always pass the value explicitly,
forwarding it from the higher-level add_server() and server_start()
whose defaults were already fixed.

Update them to SERVING anyway for consistency, so that the low-level
methods agree with the policy established at the higher layers and won't
silently revert to the wrong behavior if a new call site is added without
an explicit argument.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-05-05 18:51:19 +03:00
Nadav Har'El
af03f0e8c4 test/cluster: fix server_add/server_start hanging when starting in maintenance mode
When Scylla starts in maintenance mode it sends sd_notify("STATUS=entering
maintenance mode") instead of sd_notify("STATUS=serving"), and does not
open the standard CQL port. This caused two independent bugs after the
default was changed to ServerUpState.SERVING:

1. poll_status() resolved serving_signal to False on the maintenance
   notification, so check_serving_notification() would never return True,
   and start() would time out waiting for SERVING.

2. The readiness check in start() was guarded by
   `server_up_state >= CQL_ALTERNATOR_QUERIED`, which is never reached in
   maintenance mode (the standard CQL port is not open). Even if bug 1
   were fixed, SERVING would never be recognized.

Fix both:

- Treat STATUS=entering maintenance mode as a successful readiness signal
  in poll_status(), resolving serving_signal to True just like
  STATUS=serving. Both mean "all configured ports are now open".

- Remove the CQL_ALTERNATOR_QUERIED precondition from the
  check_serving_notification() call in start(). The sd_notify signal is
  authoritative: Scylla sends it only when fully ready, regardless of
  which ports it opened. No CQL precondition is needed.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-05-05 18:51:18 +03:00
Nadav Har'El
e014521565 test/cluster: make server_start() default to ServerUpState.SERVING
For the same reason server_add() was changed to default to SERVING
(see previous commit), server_start() had the same bug: after restarting
a node that listens on non-default ports, the polling of the hardcoded
CQL/Alternator ports could succeed before the custom ports were ready,
causing intermittent failures.

Apply the same fix to server_start() in manager_client.py,
ScyllaCluster.server_start(), and the _cluster_server_start HTTP handler.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-05-05 18:18:32 +03:00
Nadav Har'El
f91525c5df test/cluster: make server_add() default to ServerUpState.SERVING
server_add() was defaulting to ServerUpState.CQL_ALTERNATOR_QUERIED,
which polls the standard CQL and Alternator ports to determine when the
server is ready. This is wrong when a test configures Scylla to listen
on non-default ports: the polling succeeds on the default ports while the
custom ports may not yet be ready, making such tests intermittently flaky.

The correct behavior is ServerUpState.SERVING, which waits for Scylla's
sd_notify("READY=1") signal. This signal is sent only after all
configured listeners — including custom ports — are fully open, so it
is the right readiness signal regardless of the port configuration.

Up to now, the fix for each affected test was to pass
expected_server_up_state=ServerUpState.SERVING explicitly once the
flakiness was noticed (e.g. #29737). Change the default so that all
future tests get the correct behavior automatically.

Changed in manager_client.server_add(), ScyllaCluster.add_server(), and
the _cluster_server_add HTTP handler. The multi-server servers_add() path
already inherits the new default through add_server().

Fixes SCYLLADB-1822

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-05-05 18:18:32 +03:00
Evgeniy Naydanov
96d3f13245 test: add --keep-duplicates and assign RUN_ID via shared cache
Add --keep-duplicates CLI argument to bypass deduplication and forward
to pytest, allowing duplicate test file arguments to be collected
multiple times.

Move RUN_ID assignment from pytest_collect_file to modify_pytest_item.
All File collectors for the same source file share a single run_ids
dict (via RUN_ID_CACHE stash key), so items from duplicate collection
arguments (e.g. with --keep-duplicates) automatically get unique IDs.

Remove CppFile.run_id cached_property — CppTestCase now reads RUN_ID
from its own item stash, which is set during modify_pytest_item.

Fix --repeat option default from string "1" to int 1 — argparse only
applies type= to CLI-parsed values, not defaults.

Co-Authored-By: Claude Opus 4.6 (200K context) <noreply@anthropic.com>
2026-04-29 02:36:05 +00:00
Evgeniy Naydanov
497bd6b6c9 test/pylib/runner: fix disabled file collection
Return a DisabledFile collector instead of an empty list when all modes
are disabled for a file.  Returning an empty list caused subsequent
files to not get their stash items set because file_path was never
removed from REPEATING_FILES.

Co-Authored-By: Claude Opus 4.6 (200K context) <noreply@anthropic.com>
2026-04-29 02:36:05 +00:00
Evgeniy Naydanov
05f2c53931 Revert "test.py: fix test collection bug"
This reverts commit 92c09d106d.
2026-04-29 02:35:00 +00:00
Botond Dénes
a7e9c0e6d2 Merge 'test.py: fix test collection bug' from Andrei Chekun
In certain circumstances current way of collecting can be error-prone. Collection can stop when the first file is skipped in the mode leaving the rest of the files in CLI not collected.
Another issue that if the file specified twice, with directory and file explicitly, it will produce incorrect CppFile in the stash causing KeyError.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1714

No backport, test framework bug fix only.

Closes scylladb/scylladb#29634

* github.com:scylladb/scylladb:
  test.py: fix framework test
  test.py: fix test collection bug
2026-04-28 11:52:35 +03:00
Andrei Chekun
f2f4915e09 test.py: fix framework test
Framework test was not skipping unit directory where C++ tests are
located. With bug fixing this started to fail. Add ignoring this
directory as well.
2026-04-25 18:04:55 +02:00
Piotr Szymaniak
d5efd1f676 test/cluster: wait for Alternator readiness in server startup
server_add() only waits for CQL readiness before returning. The
Alternator HTTP port may not be listening yet, causing
ConnectionRefused with Alternator tests.

Extend the ServerUpState enum and startup loop to also check Alternator
port readiness when configured. Whenever Alternator port(s) is/are
configured, each is verified if connectable and queryable,
similar to how CQL ports are probed.

Fixes SCYLLADB-1701

Closes scylladb/scylladb#29625
2026-04-25 16:35:44 +03:00
Andrei Chekun
92c09d106d test.py: fix test collection bug
In certain circumstances current way of collecting can be error prone.
Collection can stop when the first file is skipped in the mode leaving
the rest of the files in CLI not collected.
Another issue that if the file specified twice, with directory and file
explicitly, it will produce incorrect CppFile in the stash causing
KeyError.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1714
2026-04-24 17:57:11 +02:00
Botond Dénes
b49cf6247f test: fix flaky test_read_repair_with_trace_logging by reading tracing with CL=ALL
Tracing events are written to system_traces.events with CL=ANY, so they
are only guaranteed to be present on the local node of the query
coordinator. Reading them back with the driver default (CL=LOCAL_ONE)
may route the query to a replica that has not yet received all events,
causing the assertion on 'digest mismatch, starting read repair' to fail
intermittently.

Fix execute_with_tracing() to read tracing via the ResponseFuture API
with query_cl=ConsistencyLevel.ALL, so events from all replicas are
merged before the caller inspects them.

Fixes: SCYLLADB-1633

Closes scylladb/scylladb#29566
2026-04-23 16:57:29 +02:00
Piotr Szymaniak
9a86044c63 test: Stop providing alternator-streams experimental flag
Now that alternator-streams is no longer an experimental feature,
stop passing it in test configurations.
2026-04-22 15:25:37 +02:00
Botond Dénes
eb3326b417 Merge 'test.py: migrate all bare skips to typed skip markers' from Artsiom Mishuta
should be merged after #29235

Complete the typed skip markers migration started in the plugin PR.
Every bare `@pytest.mark.skip` decorator and `pytest.skip()` runtime call
across the test suite is replaced with a typed equivalent, making skip
reasons machine-readable in JUnit XML and Allure reports.

**62 files changed** across 8 commits, covering ~127 skip sites in total.

Bare `pytest.skip` provides only a free-text reason string. CI dashboards
(JUnit, Allure) cannot distinguish between a test skipped due to a known
bug, a missing feature, a slow test, or an environment limitation. This
makes it hard to track skip debt, prioritize fixes, or filter dashboards
by skip category.

The typed markers (`skip_bug`, `skip_not_implemented`, `skip_slow`,
`skip_env`) introduced by the `skip_reason_plugin` solve this by embedding
a `skip_type` field into every skip report entry.

| Type | Count | Files | Description |
|------|-------|-------|-------------|
| `skip_bug` | 24 | 16 | Skip reason references a known bug/issue |
| `skip_not_implemented` | 10 | 5 | Feature not yet implemented in Scylla |
| `skip_slow` | 4 | 3 | Test too slow for regular CI runs |
| `skip_not_implemented` (bare) | 2 | 1 | Bare `@pytest.mark.skip` with no reason (COMPACT STORAGE, #3882) |

| Type | Count | Files | Description |
|------|-------|-------|-------------|
| `skip_env` | ~85 | 34 | Feature/config/topology not available at runtime |
| `skip_bug` | 2 | 2 | Known bugs: Streams on tablets (#23838), coroutine task not found (#22501) |

- **Comments**: 7 comments/docstrings across 5 files updated from `pytest.skip()` to `skip()`
- **Plugin hardened**: `warnings.warn()` → `pytest.UsageError` for bare `@pytest.mark.skip` at collection time — bare skips are now a hard error, not a warning
- **Guard tests**: New `test/pylib_test/test_no_bare_skips.py` with 3 tests that prevent regression:
  - AST scan for bare `@pytest.mark.skip` decorators
  - AST scan for bare `pytest.skip()` runtime calls
  - Real `pytest --collect-only` against all Python test directories

Runtime skip sites use the convenience wrappers from `test.pylib.skip_types`:
```python
from test.pylib.skip_types import skip_env
```

Usage:
```python
skip_env("Tablets not enabled")
```

1. **test: migrate @pytest.mark.skip to @pytest.mark.skip_bug for known bugs** — 24 decorator sites, 16 files
2. **test: migrate @pytest.mark.skip to @pytest.mark.skip_not_implemented** — 10 decorator sites, 5 files
3. **test: migrate @pytest.mark.skip to @pytest.mark.skip_slow** — 4 decorator sites, 3 files
4. **test: migrate bare @pytest.mark.skip to skip_not_implemented** — 2 bare decorators, 1 file
5. **test: migrate runtime pytest.skip() to typed skip_env()** — ~85 sites, 34 files
6. **test: migrate runtime pytest.skip() to typed skip_bug()** — 2 sites, 2 files
7. **test: update comments referencing pytest.skip() to skip()** — 7 comments, 5 files
8. **test/pylib: reject bare pytest.mark.skip and add codebase guards** — plugin hardening + 3 guard tests

- All 60 plugin + guard tests pass (`test/pylib_test/`)
- No bare `@pytest.mark.skip` or `pytest.skip()` calls remain in the codebase
- `pytest --collect-only` succeeds across all test directories with the hardened plugin

SCYLLADB-1349

Closes scylladb/scylladb#29305

* github.com:scylladb/scylladb:
  test/alternator: replace bare pytest.skip() with typed skip helpers
  test: migrate new bare skips introduced by upstream after rebase
  test/pylib: reject bare pytest.mark.skip and add codebase guards
  test: update comments referencing pytest.skip() to skip_env()
  test: migrate runtime pytest.skip() to typed skip_bug()
  test: migrate runtime pytest.skip() to typed skip_env()
  test: migrate bare @pytest.mark.skip to skip_not_implemented
  test: migrate @pytest.mark.skip to @pytest.mark.skip_slow
  test: migrate @pytest.mark.skip to @pytest.mark.skip_not_implemented
  test: migrate @pytest.mark.skip to @pytest.mark.skip_bug for known bugs
2026-04-22 15:48:27 +03:00
Artsiom Mishuta
183c6d120e test: exclude pylib_test from default test runs
Add pylib_test to norecursedirs in pytest.ini so it is not collected
during ./test.py or pytest test/ runs, but can still be run directly
via 'pytest test/pylib_test'.

Also fix pytest log cleanup: worker log files (pytest_gw*) were not
being deleted on success because cleanup was restricted to the main
process only. Now each process (main and workers) cleans up its own
log file on success.

Closes scylladb/scylladb#29551
2026-04-22 11:38:40 +02:00
Ernest Zaslavsky
9faaf1f09c test: extract object storage helpers to test/pylib/object_storage.py
Move S3/GCS server classes (S3Server, MinioWrapper, GSFront, GSServer),
factory functions (create_s3_server, create_gs_server), CQL helpers
(format_tuples, keyspace_options), bucket naming (_make_bucket_name),
and the s3_server fixture from test/cluster/object_store/conftest.py
into a shared module at test/pylib/object_storage.py.
The conftest.py is now a thin wrapper that re-exports symbols and
defines only the fixtures specific to the object_store suite
(object_storage, s3_storage).  All external importers are updated.
Old class names (S3_Server, GSServer) are kept as aliases for
backward compatibility.
2026-04-21 19:08:57 +03:00
Ernest Zaslavsky
e175088db5 test: add s3_test_fixture and migrate tests to per-bucket isolation
Add s3_test_fixture, an RAII class that creates a unique S3 bucket
on construction and tears down everything (delete all objects, delete
bucket, close client) on destruction. Bucket names are derived from
the Boost test name, pid, and a counter to guarantee uniqueness
across concurrent test processes. Names are sanitized to comply with
S3 bucket naming rules (lowercase, hyphens, 3-63 chars).
Migrate all S3 tests that create objects to use the fixture, removing
manual bucket name construction, deferred_delete_object cleanup, and
per-test deferred_close calls. The fixture owns the client lifecycle.
Tests with special semaphore requirements (broken semaphore for
fallback test, small semaphore for abort test, 1MiB for memory
test) create the fixture with a separate normal-sized semaphore and
use their own constrained client for the test operation.
The upload_file tests are converted from SEASTAR_TEST_CASE
(coroutine) to SEASTAR_THREAD_TEST_CASE since the fixture requires
thread context for .get() calls.
Broaden the minio policy to allow the test user to create and delete
arbitrary buckets (s3:CreateBucket, s3:DeleteBucket, s3:ListAllMyBuckets
on arn:aws:s3:::*), and operate on objects in any bucket.
2026-04-21 19:08:57 +03:00
Dario Mirovic
f77ff28081 test: manager_client: use safe_driver_shutdown for exclusive_clusters
Using cluster.shutdown() is an incorrect way to shut down a Cassandra Cluster.
The correct way is using safe_driver_shutdown.

Fixes SCYLLADB-1434

Closes scylladb/scylladb#29390
2026-04-19 21:31:18 +03:00
Artsiom Mishuta
9c4d3ce097 test/pylib: reject bare pytest.mark.skip and add codebase guards
Harden the skip_reason_plugin to reject bare @pytest.mark.skip at
collection time with pytest.UsageError instead of warnings.warn().

Add test/pylib_test/test_no_bare_skips.py with three guard tests:
- AST scan for bare pytest.skip() runtime calls
- Real pytest --collect-only against all Python test directories
2026-04-19 17:34:31 +02:00
Artsiom Mishuta
8a80e2c3be test: migrate runtime pytest.skip() to typed skip_env()
Migrate runtime pytest.skip() calls across 34 files to use the typed
skip_env() wrapper from test.pylib.skip_types.

These sites skip at runtime because a required feature, config option,
library version, build mode, or runtime topology is not available.

Also fixes 'raise pytest.skip(...)' in test_audit.py — skip_env()
already raises internally, so the explicit raise was incorrect.

Each file gains one new import:
  from test.pylib.skip_types import skip_env
2026-04-19 11:09:29 +02:00
Botond Dénes
fbcfe3f88f test: use uuid4 for DockerizedServer container names to avoid collisions
Container names were generated as {name}-{pid}-{counter}, where the
counter is a per-process itertools.count. This scheme breaks across CI
runs on the same host: if a prior job was killed abruptly (SIGKILL,
cancellation) its containers are left running since --rm only removes
containers on exit. A subsequent run whose worker inherits the same PID
(common in containerized CI with small PID namespaces) and reaches the
same counter value will collide with the orphaned container.

Replace pid+counter with uuid.uuid4(), which generates a random UUID,
making names unique across processes, hosts, and time without any shared
state or leaking host identifiers.

Fixes: SCYLLADB-1540

Closes scylladb/scylladb#29509
2026-04-17 11:56:51 +02:00
Avi Kivity
cad3c0de94 test: write minio log to testlog dir for Jenkins artifact collection
Write the MinIO server log directly to tempdir_base (testlog/<arch>/)
instead of the per-server temp directory that gets destroyed on
shutdown. This preserves the log for Jenkins artifact collection,
helping debug S3-related flaky test failures like the
stcs_reshape_overlapping_s3_test hang (SCYLLADB-1481).

Closes scylladb/scylladb#29458
2026-04-17 12:51:55 +03:00
Botond Dénes
facb50cbf9 Merge 'test.py: refactor test.py' from Andrei Chekun
With the latest changes, there are a lot of code that is redundant in the test.py. This PR just cleans this code.
Also, it narrows using dynamic scope for fixtures to test/alternator and test/cqlpy. All the rest by default will have module scope.
test.py will be a wrapper for pytest mostly for CI use. As for now test.py have important part of calculating the number of threads to start pytest with. This is not possible to do in pytest itself.

No backport needed, framework enhancement only.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-666

Closes scylladb/scylladb#28852

* github.com:scylladb/scylladb:
  test.py: remove testpy_test_fixture_scope
  test.py: add logger for 3rd party service
  test.py: delete dead code in test.py
2026-04-17 12:51:14 +03:00
Andrei Chekun
745debe9ec test.py: remove testpy_test_fixture_scope
With migration to pyest this fixture is useless. Removing and setting
the session to the module for the most of the tests.
Add dynamic_scope function to support running alternator fixtures in
session scope, while Test and TestSuite are not deleted. This is for
migration period, later on this function should be deleted.
2026-04-16 22:08:33 +02:00
Andrei Chekun
21addb2173 test.py: add logger for 3rd party service
With migration of preparation environment and starting 3rd party services
to the pytest, they're output the logs to the terminal. So this PR
binds them their own log file to avoid polluting the terminal.
2026-04-16 22:08:33 +02:00
Andrei Chekun
13770ab394 test.py: delete dead code in test.py
With the latest changes, there are a lot of code that is redundant in
the test.py. This PR just cleans this code.
Changes in other files are related to cleaning code from the test.py,
especially with redundant parameter --test-py-init and moving
prepare_environment to pytest itself.
2026-04-16 22:08:31 +02:00
Avi Kivity
999e108139 Merge 'test: lib: fix broken retry in start_docker_service' from Dario Mirovic
The retry loop in `start_docker_service` passes the parse callbacks via `std::move` into `create_handler` on each iteration. After the first iteration, the moved-from `std::function` objects are empty. All subsequent retries skip output parsing entirely and immediately treat the service as successfully started. This defeats the entire purpose of the retry mechanism.

Fix by passing the callbacks by copy instead of move, so the original callbacks remain valid across retries.

Fixes SCYLLADB-1542

This is a CI stability issue and should be backported.

Closes scylladb/scylladb#29504

* github.com:scylladb/scylladb:
  test/lib: fix typos in proc_utils, gcs_fixture, and dockerized_service
  test: gcs_fixture: rename container from "local-kms" to "fake-gcs-server"
  test: fix proc_utils.cc formatting from previous commit
  test: lib: use unique container name per retry attempt
  test: lib: fix broken retry in start_docker_service
2026-04-16 21:48:25 +03:00
Botond Dénes
c355df4461 Merge 'test: Lower default log level from DEBUG to INFO' from Artsiom Mishuta
1. test.py — Removed --log-level=DEBUG flag from pytest args
2. test/pytest.ini — Changed log_level to INFO (that was set DEBUG in test.py), changed log_file_level from DEBUG to INFO, added clarifying comments

+minor fix

[test/pylib: save logs on success only during teardown phase](0ede308a04)
Previously, when --save-log-on-success was enabled, logs were saved
for every test phase (setup, call, teardown)in 3 files. Restrict it to only
the teardown phase, that contains all 3 in case of test success,
to avoid redundant log entries.

Closes scylladb/scylladb#29086

* github.com:scylladb/scylladb:
  test/pylib: save logs on success only during teardown phase
  test: Lower default  log level from DEBUG to INFO
2026-04-16 12:46:11 +03:00
Dario Mirovic
50e498ac0d test/lib: fix typos in proc_utils, gcs_fixture, and dockerized_service
Fix assorted typos in comments, strings, and identifiers:
- path_preprend -> path_prepend (proc_utils.hh, proc_utils.cc)
- laúnch -> launch (proc_utils.cc)
- hand/fail -> hang/fail (dockerized_service.py)
- inconvinient -> inconvenient (dockerized_service.py)
- priviledges -> privileges (gcs_fixture.hh)
- remove double semicolon (gcs_fixture.cc)

Refs SCYLLADB-1542
2026-04-16 10:58:55 +02:00
Botond Dénes
00d8470554 Merge 'test: filter benign shutdown errors in tests that grep logs directly' from Marcin Maliszkiewicz
Tests that call grep_for_errors() directly and assert no errors
can fail spuriously due to benign RPC errors during graceful
shutdown (e.g. "connection dropped: Semaphore broken"), which
are already filtered by the after_test hook via filter_errors().

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1464
Backport: no, tests fix (we may decide to backport later if it occurs on release branches)

Closes scylladb/scylladb#29463

* github.com:scylladb/scylladb:
  test: filter benign errors in tests that grep logs during shutdown
  test: filter_errors: support list[list[str]] error groups
2026-04-15 14:40:15 +03:00