scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-22 07:42:16 +00:00

Author	SHA1	Message	Date
Avi Kivity	85374207ca	Merge 'test.py: rewrite gather metrics' from Andrei Chekun Rewrite gather metrics to be able to gather metrics for python tests correctly. Python tests require different handling of metrics gathering from cgroup than C++ tests. pytest do not execute each python tests in a separate process, so we can't put it there and get the metrics. The idea is to put the whole pytest process to the cgroup and get the metrics. This will work because pytest runs the threads as a completely separate processes and inside the thread it will run tests consequently. Additionally, to simplify system resource monitor moved to pytest main thread. Change the behavior of the gathering metrics. From this PR some data will be collected even with `--no-gather-metrics`. This data do not need any configuration and just metadata of the tests: test name, time of execution, status of the test. When `--gather-metrics` provided additionally will be written the data gathered from the cgroups about the memory for each specific test and system CPU/RAM utilization. Backport is not needed, because it's a framework change only. Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-575 ~Blocked by: https://github.com/scylladb/scylladb/pull/27618~ Now python tests have metrics gathered from the cgroups as well with their own Scylla instances. ```bash $ sqlite3 --header testlog/sqlite_af8cb.db 'select tst.path, tst.file, tst.test_name, user_sec,system_sec,usage_sec,memory_peak /1024/1024 as memory_peak_mb from test_metrics join tests as tst where tst.id = test_metrics.test_id order by memory_peak_mb desc limit 10;' path\|file\|test_name\|user_sec\|system_sec\|usage_sec\|memory_peak_mb test/cluster/dtest\|limits_test.py\|test_max_cells\|489.468174\|27.6638949999999\|517.132069\|4241 test/cluster/dtest\|rebuild_test.py\|test_rebuild_stream_abort_repro\|93.6400869999998\|28.9843249999999\|122.624412\|4241 test/cluster/dtest\|schema_management_test.py\|test_prepared_statements_work_after_node_restart_after_altering_schema_without_changing_columns\|6.8933219999999\|3.63569899999993\|10.5290209999994\|4241 test/cluster/dtest\|schema_management_test.py\|test_dropping_keyspace_with_many_columns\|1.31770999999981\|0.754742999999962\|2.07245299999977\|4241 test/cluster/dtest\|schema_management_test.py\|test_multiple_create_table_in_parallel\|5.48435300000028\|2.72915200000011\|8.21350499999971\|4241 test/cluster/dtest\|schema_management_test.py\|test_alter_table_in_parallel_to_read_and_write[write]\|80.687293\|18.5562\|99.2434920000005\|4241 test/cluster/dtest\|schema_management_test.py\|test_alter_table_in_parallel_to_read_and_write[read]\|79.1984790000001\|18.0969829999999\|97.2954609999997\|4241 test/cluster/dtest\|schema_management_test.py\|test_alter_table_in_parallel_to_read_and_write[mixed]\|85.332915\|18.9321070000001\|104.265022\|4241 test/cluster/dtest\|schema_management_test.py\|test_update_schema_while_node_is_killed[create_table]\|10.5875369999999\|5.67954400000008\|16.267081\|4241 test/cluster/dtest\|schema_management_test.py\|test_update_schema_while_node_is_killed[alter_table]\|11.3801709999998\|6.54689099999996\|17.9270630000001\|4241 ``` Closes scylladb/scylladb#28206 * github.com:scylladb/scylladb: test.py: Add host hardware info test.py: rewrite resource gather	2026-05-18 20:35:14 +03:00
Nadav Har'El	5dbd0d71d5	Merge 'test/pylib: test/pylib: Cached Scylla package resolver' from Alex Dathskovsky This series adds a shared helper for resolving, downloading, unpacking, and installing Scylla relocatable packages for test.py. The first patch introduces `version_fetch_utils`, which can resolve public Scylla artifacts from the downloads bucket by version, architecture, package variant, or direct URL. It also centralizes the local cache/install flow using retry handling, marker files, and file locking so repeated or concurrent test runs can safely reuse an existing installation. The second patch wires this helper into the existing Scylla executable setup paths. This removes the hard-coded 2025.1 package URL and replaces the local download/unpack/install logic in `scylla_cluster.py` with the shared resolver. It also makes `--exe-url` use the same cached installer path. Together, these changes make upgrade-test executable selection less brittle, avoid duplicated install logic, and provide a reusable foundation for fetching other Scylla versions in test.py. Closes scylladb/scylladb#29855 * github.com:scylladb/scylladb: test/pylib: use version fetcher for Scylla executable setup test/pylib: add cached Scylla package installer	2026-05-18 16:32:47 +03:00
Andrei Chekun	a03c4fd754	test.py: Add host hardware info Gather additional information about the running host for better metrics analysis	2026-05-18 12:23:40 +02:00
Andrei Chekun	6414c48fc2	test.py: rewrite resource gather Python tests requires different handling of metrics gathering from cgroup than C++ tests. pytest do not execute each python tests in a separate process, so we can't put it there and get the metrics. The idea is to put the whole pytest process to the cgroup and get the metrics. This will work because pytest runs the threads as as completely separate processes and inside the thread it will run tests consequently. Additionally, to simplify system resource monitor moved to pytest main thread.	2026-05-18 12:23:40 +02:00
Evgeniy Naydanov	39a10d6d67	test: remove dead suite subclasses and legacy execution pipeline After all test suites migrated to test_config.yaml with type: Python, the specialized suite classes (Topology, CQLApproval, Run, Tool) and the legacy execution pipeline (find_tests, run_test, TestSuite.run, Test.run) became unreachable. Remove all this dead code. Deleted files: - suite/topology.py, suite/cql_approval.py, suite/run.py, suite/tool.py Simplified: - base.py: remove run_test(), read_log(), TestSuite.run(), add_test_list(), build_test_list(), all_tests(), test_count(), SUITE_CONFIG_FILENAME, disabled/flaky test tracking, and dead Test attributes (args, core_args, valid_exit_codes, allure_dir, is_flaky, is_cancelled, etc.) - python.py: remove PythonTestSuite.run(), PythonTest.run(), _prepare_pytest_params(), pattern, test_file_ext, xmlout, server_log, scylla_env setup, and shlex import. Simplify run_ctx() to take no parameters. - runner.py: remove --scylla-log-filename option, print_scylla_log_filename fixture, SUITE_CONFIG_FILENAME import, and suite.yaml probe in TestSuiteConfig.from_pytest_node(). - __init__.py: remove re-exports of deleted classes. - test_config.yaml: Topology -> Python, Approval -> Python. - conftest files: run_ctx(options=...) -> run_ctx(). - docs/dev/testing.md: update to reflect current pytest-based architecture, log paths, and removed features. Co-Authored-By: Claude Opus 4.6 (200K context) <noreply@anthropic.com> Closes scylladb/scylladb#29613	2026-05-17 22:16:31 +03:00
Alex	176dbf12c2	test/pylib: use version fetcher for Scylla executable setup Replace the hard-coded 2025.1 archive download and local install logic with the shared Scylla package fetch/install helper. This keeps upgrade-test executable resolution and `--exe-url` handling on the same cached installer path.	2026-05-17 17:43:56 +03:00
Alex	1efe9a7243	test/pylib: add cached Scylla package installer Add utilities to resolve relocatable Scylla artifacts from the public downloads bucket by version, architecture, package variant, or direct URL. Download, unpack, and install the selected archive into the test.py cache with retry handling, marker files, and file locking so repeated or concurrent test runs can reuse the same installation safely.	2026-05-17 17:43:56 +03:00
Andrei Chekun	a09fdfc46a	test.py: fix issue that C++ tests' logs are deleted Add skiping deletion of the log file in case of the fail in C++ tests. Closes scylladb/scylladb#29859	2026-05-13 21:31:03 +03:00
Avi Kivity	f2ab911a46	Merge 'test/cluster: fix server-starting functions to wait for all ports' from Nadav Har'El This series fixes a recurring source of flaky tests in the cluster test suite. When a test configures Scylla to listen on non-default ports (e.g. a custom Alternator port, proxy-protocol port or shard-aware port), server_add() and server_start() would declare the server ready by polling the hardcoded standard CQL and Alternator ports. Those ports can become available slightly before the custom ports finish binding, so the test could start using the custom port before it was open — causing intermittent failures. The fix for each affected test was to pass `expected_server_up_state=ServerUpState.SERVING` explicitly, which waits for Scylla's sd_notify("STATUS=serving") signal instead. That signal is sent only after all configured listeners are fully open, so it is always the right readiness signal regardless of the port configuration. This workaround was applied again in PR #29737 and will keep being needed for every new test that uses a non-default port. This series makes ServerUpState.SERVING the default at every level of the server start/add call stack so no test needs to remember it: * Make server_add(), servers_add(), server_start() et al. all default to ServerUpState.SERVING. * Document that server_add/server_start wait for all ports to be ready, so future test authors understand what the functions guarantee. * Remove now-redundant expected_server_up_state=SERVING from exiting tests. * A small optimization: Fix check_serving_notification() returning False on first completion. When the sd_notify future completed, the function correctly updated _received_serving but still returned False, wasting one 100ms polling interval. Return self._received_serving directly. Closes scylladb/scylladb#29758 * github.com:scylladb/scylladb: test/pylib: fix missing protocol_version=4 on control_cluster scylla_cluster: guard poll_status() set_result() calls against cancelled future test/cluster: avoid repeated CQL checks and leaks while waiting for SERVING test/cluster: fix check_serving_notification() inefficiency test/cluster: remove now-redundant expected_server_up_state=SERVING test/cluster: document that add/start waits for all ports to be ready test/cluster: update remaining CQL_ALTERNATOR_QUERIED defaults to SERVING test/cluster: fix server_add/server_start hanging when starting in maintenance mode main: notify "entering maintenance mode" after the maintenance CQL server is ready test/cluster: make server_start() default to ServerUpState.SERVING test/cluster: make server_add() default to ServerUpState.SERVING	2026-05-13 21:23:18 +03:00
Piotr Smaron	0fcae72530	test: bootstrap tombstone gc repair cluster sequentially Avoid concurrent topology changes in the tombstone GC repair setup, where debug-mode nodes running hinted handoff and materialized view startup work can time out while applying Raft entries before the test starts. Keep the sequential path opt-in so unrelated repair tests still exercise concurrent bootstrap behavior. Closes scylladb/scylladb#29829	2026-05-13 13:58:44 +03:00
Yaniv Michael Kaul	5d6f160129	test: update get_scylla_2025_1_executable() to use 2025.1.12 Update the hardcoded 2025.1.0 binary URL to the latest 2025.1.12 release for upgrade tests. The 2025.1.12 binary now supports and enforces the rf_rack_valid_keyspaces option which the test harness enables by default. Since test_sstable_compression_dictionaries_upgrade creates a 2-node cluster in a single rack with RF=2, it violates the constraint. Disable the option explicitly for this test. Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com> Closes scylladb/scylladb#29714	2026-05-12 23:20:55 +02:00
Botond Dénes	e95eb21a16	Merge 'Tablet-aware restore' from Pavel Emelyanov The mechanics of the restore is like this - A /storage_service/tablets/restore API is called with (keyspace, table, endpoint, bucket, manifests) parameters - First, it populates the system_distributed.snapshot_sstables table with the data read from the manifests - Then it emplaces a bunch of tablet transitions (of a new "restore" kind), one for each tablet - The topology coordinator handles the "restore" transition by calling a new RESTORE_TABLET RPC against all the current tablet replicas - Each replica handles the RPC verb by - Reading the snapshot_sstables table - Filtering the read sstable infos against current node and tablet being handled - Downloading and attaching the filtered sstables This PR includes system_distributed.snapshot_sstables table from @robertbindar and preparation work from @kreuzerkrieg that extracts raw sstables downloading and attaching from existing generic sstables loading code. This is first step towards SCYLLADB-197 and lacks many things. In particular - the API only works for single-DC cluster - the caller needs to "lock" tablet boundaries with min/max tablet count - not abortable - no progress tracking - sub-optimal (re-kicking API on restore will re-download everything again) - not re-attacheable (if API node dies, restoration proceeds, but the caller cannot "wait" for it to complete via other node) - nodes download sstables in maintenance/streaming sched gorup (should be moved to maintenance/backup) Other follow-up items: - have an actual swagger object specification for `backup_location` Closes #28436 Closes #28657 Closes #28773 Closes scylladb/scylladb#28763 * github.com:scylladb/scylladb: docs: Update topology_over_raft.md with `restore` transition kind test: Add test for backup vs migration race test: Restore resilience test sstables_loader: Fail tablet-restore task if not all sstables were downloaded sstables_loader: mark sstables as downloaded after attaching sstables_loader: return shared_sstable from attach_sstable db: add update_sstable_download_status method db: add downloaded column to snapshot_sstables db: extract snapshot_sstables TTL into class constant test: Add a test for tablet-aware restore tablets: Implement tablet-aware cluster-wide restore messaging: Add RESTORE_TABLET RPC verb sstables_loader: Add method to download and attach sstables for a tablet tablets: Add restore_config to tablet_transition_info sstables_loader: Add restore_tablets task skeleton test: Add rest_client helper to kick newly introduced API endpoint api: Add /storage_service/tablets/restore endpoint skeleton sstables_loader: Add keyspace and table arguments to manfiest loading helper sstables_loader_helpers: just reformat the code sstables_loader_helpers: generalize argument and variable names sstables_loader_helpers: generalize get_sstables_for_tablet sstables_loader_helpers: add token getters for tablet filtering sstables_loader_helpers: remove underscores from struct members sstables_loader: move download_sstable and get_sstables_for_tablet sstables_loader: extract single-tablet SST filtering sstables_loader: make download_sstable static sstables_loader: fix formating of the new `download_sstable` function sstables_loader: extract single SST download into a function sstables_loader: add shard_id to minimal_sst_info sstables_loader: add function for parsing backup manifests split utility functions for creating test data from database_test export make_storage_options_config from lib/test_services rjson: Add helpers for conversions to dht::token and sstable_id Add system_distributed_keyspace.snapshot_sstables add get_system_distributed_keyspace to cql_test_env code: Add system_distributed_keyspace dependency to sstables_loader storage_service: Export export handle_raft_rpc() helper storage_service: Export do_tablet_operation() storage_service: Split transit_tablet() into two tablets: Add braces around tablet_transition_kind::repair switch	2026-05-12 16:24:13 +03:00
Pavel Emelyanov	150345cc52	Merge 'test: per-bucket isolation for S3/GCS object storage tests' from Ernest Zaslavsky This series adds per-test bucket isolation to all S3 and GCS object storage tests. Previously, every test shared a single pre-created bucket, which meant tests could interfere with each other through leftover objects and could not run concurrently across multiple `test.py` processes without risking collisions. New `create_bucket`, `delete_bucket`, and `delete_bucket_with_objects` methods on `s3::client`, following the existing `make_request` pattern. `create_bucket` handles the `BUCKET_ALREADY_OWNED_BY_YOU` error gracefully. A new `s3_test_fixture` RAII class for C++ Boost tests that creates a uniquely-named bucket on construction (derived from the Boost test name and pid) and tears down everything — objects, bucket, client — on destruction. All S3 tests in `s3_test.cc` are migrated to use it, removing manual `deferred_delete_object` and `deferred_close` boilerplate. The minio server policy is broadened to allow dynamic bucket creation/deletion. A `client::make` overload that accepts a custom `retry_strategy`, used in tests with a fast 1ms retry delay instead of exponential backoff, significantly reducing test runtime for transient errors during bucket lifecycle operations. Python-side (`test/cluster/object_store`): each pytest fixture (`object_storage`, `s3_storage`, `s3_server`) now creates a unique bucket per test function via `create_test_bucket()` and destroys it on teardown. Bucket names are sanitized from the pytest node name with a short UUID suffix for uniqueness. Object storage helpers (`S3Server`, `MinioWrapper`, `GSFront`, `GSServerImpl`, factory functions, CQL helpers, `s3_server` fixture) are extracted from `test/cluster/object_store/conftest.py` into a shared `test/pylib/object_storage.py` module, eliminating duplication across test suites. The conftest becomes a thin re-export wrapper. Old class names are preserved as aliases for backward compatibility. \| Test Name \| new test specific retry strategy execution time (ms) \| original execution time (ms) \| Δ (ms) \| Speedup \| \|--------------------------------------------------------------\|----------------:\|-------------:\|---------:\|--------:\| \| test_client_upload_file_multi_part_with_remainder_proxy \| 19,261 \| 61,395 \| −42,134 \| 3.2× \| \| test_client_upload_file_multi_part_without_remainder_proxy \| 16,901 \| 53,688 \| −36,787 \| 3.2× \| \| test_client_upload_file_single_part_proxy \| 3,478 \| 6,789 \| −3,311 \| 2.0× \| \| test_client_multipart_copy_upload_proxy \| 1,303 \| 1,619 \| −316 \| 1.2× \| \| test_client_put_get_object_proxy \| 150 \| 365 \| −215 \| 2.4× \| \| test_client_readable_file_stream_proxy \| 125 \| 327 \| −202 \| 2.6× \| \| test_small_object_copy_proxy \| 205 \| 389 \| −184 \| 1.9× \| \| test_client_put_get_tagging_proxy \| 181 \| 350 \| −169 \| 1.9× \| \| test_client_multipart_upload_proxy \| 1,252 \| 1,416 \| −164 \| 1.1× \| \| test_client_list_objects_proxy \| 729 \| 881 \| −152 \| 1.2× \| \| test_chunked_download_data_source_with_delays_proxy \| 830 \| 960 \| −130 \| 1.2× \| \| test_client_readable_file_proxy \| 148 \| 279 \| −131 \| 1.9× \| \| test_client_upload_file_multi_part_with_remainder_minio \| 3,358 \| 3,170 \| +188 \| 0.9× \| \| test_client_upload_file_multi_part_without_remainder_minio \| 3,131 \| 2,929 \| +202 \| 0.9× \| \| test_client_upload_file_single_part_minio \| 519 \| 421 \| +98 \| 0.8× \| \| test_download_data_source_proxy \| 180 \| 237 \| −57 \| 1.3× \| \| test_client_list_objects_incomplete_proxy \| 590 \| 641 \| −51 \| 1.1× \| \| test_large_object_copy_proxy \| 952 \| 991 \| −39 \| 1.0× \| \| test_client_multipart_upload_fallback_proxy \| 148 \| 185 \| −37 \| 1.3× \| \| test_client_multipart_copy_upload_minio \| 641 \| 674 \| −33 \| 1.1× \| No backport needed — this is a test infrastructure improvement with no production code impact beyond the new `s3::client` methods. Closes scylladb/scylladb#29508 * github.com:scylladb/scylladb: test: extract object storage helpers to test/pylib/object_storage.py test: add per-test bucket isolation to object_store fixtures s3: add client::make overload with custom retry strategy test: add s3_test_fixture and migrate tests to per-bucket isolation s3: add create_bucket and delete_bucket to client	2026-05-12 12:38:24 +03:00
Pavel Emelyanov	dcd490666b	test: Add rest_client helper to kick newly introduced API endpoint Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-05-12 10:40:23 +03:00
Nadav Har'El	df8c9b17b8	Merge 'alternator: Graduate Alternator Streams from experimental' from Piotr Szymaniak As a final step for https://scylladb.atlassian.net/browse/SCYLLADB-461 we need to graduate Alternator Streams from experimental. So let's remove `--experimental-features=alternator-streams` and map the obsolete config string to `UNUSED` for backward compatibility. Also, remove the related gating of the feature. Finally, stop providing the config flag in test configs. Fixes SCYLLADB-1680 Fixes #16367 To documentation tracked by https://scylladb.atlassian.net/browse/SCYLLADB-462 still remains. This PR needs to hit 2026.2, so (only) if it branches before the PR is merged to `master`, we'd need to backport. Closes scylladb/scylladb#29604 * github.com:scylladb/scylladb: test: Stop providing alternator-streams experimental flag alternator: Graduate Alternator Streams from experimental	2026-05-10 22:10:03 +03:00
Nadav Har'El	19555bc2cf	test/pylib: fix missing protocol_version=4 on control_cluster get_cql_up_state() creates two Cluster instances: a short-lived one used to probe CQL readiness, and a persistent control_cluster kept alive for the lifetime of the server. The probe cluster was created with protocol_version=4 (the highest version Scylla supports), but the control_cluster was not, causing the driver to do a superfluous version-negotiation round-trip on every server start. Fix by extracting the shared constructor arguments into a cluster_kwargs dict and using **cluster_kwargs for both calls, so the two Cluster instances are created with identical parameters. This deduplication can help avoid more instances of this bug, where someone modifies the options in one call but forgets to change the options in the other call. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-05-05 20:57:49 +03:00
Nadav Har'El	f977621e40	scylla_cluster: guard poll_status() set_result() calls against cancelled future The poll_status() background thread resolves `serving_signal` by scheduling `f.set_result(...)` on the event loop via `call_soon_threadsafe`. In parallel, `_cleanup_notify_socket()` can cancel `serving_signal` at any time - for example when a server fails to start and `stop()` -> `shutdown_control_connection()` is called while the thread is still blocked in `recv()` (the socket close unblocks the `recv()` with an exception, sending it down the error path). When that race fires the scheduled `f.set_result(...)` callback runs after `cancel()` has already put the future into the cancelled state, raising `asyncio.InvalidStateError: Result is not allowed in cancelled state`. This bug predates the SERVING work, but the original CQL_ALTERNATOR_QUERIED default meant the notify socket was torn down quickly most of the time, making the window very narrow. Now that SERVING is the default the socket stays open throughout the full startup wait, widening the race significantly. Fix: replace every bare `f.set_result(v)` call with `lambda: f.done() or f.set_result(v)`, which is a no-op when the future is already done (cancelled, or resolved by another path). Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-05-05 20:34:58 +03:00
Nadav Har'El	ff33440c6c	test/cluster: avoid repeated CQL checks and leaks while waiting for SERVING With ServerUpState.SERVING now the default, server_add() and server_start() wait for sd_notify readiness after CQL is already up. During that window the startup polling loop was calling get_cql_alternator_up_state() on every iteration (every 100ms). Each successful call recreated self.control_cluster and self.control_connection without closing the previous ones, leaking driver connections and adding unnecessary CQL load to a node that was already known to be queryable. Fix in two places: - Startup loop: skip the get_cql_alternator_up_state() call once server_up_state has reached CQL_ALTERNATOR_QUERIED. After that point only the cheap non-blocking check_serving_notification() is needed. - get_cql_up_state(): guard control_cluster/control_connection creation with `if self.control_connection is None` so the persistent driver connection is only established once, even if the function is called multiple times. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-05-05 20:29:18 +03:00
Nadav Har'El	417b4e0765	test/cluster: fix check_serving_notification() inefficiency When the sd_notify future completed, check_serving_notification() correctly updated _received_serving to True but still returned False on that same call. The SERVING state was only recognized on the next polling iteration, 100ms later, for no reason. Return self._received_serving instead of False after updating it.	2026-05-05 18:56:37 +03:00
Nadav Har'El	3734afe193	test/cluster: document that add/start waits for all ports to be ready Add docstrings to server_add(), server_start(), and servers_add() explaining that they wait for ServerUpState.SERVING before returning, which means Scylla has finished listening on all configured ports (including non-default ones). Note that server_add() and server_start() accept expected_server_up_state to return earlier if needed, while servers_add() always waits for SERVING. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-05-05 18:56:32 +03:00
Nadav Har'El	90eef72794	test/cluster: update remaining CQL_ALTERNATOR_QUERIED defaults to SERVING ScyllaServer.install_and_start() and ScyllaServer.start() still had ServerUpState.CQL_ALTERNATOR_QUERIED as their default for expected_server_up_state. In practice these defaults are never reached - both call sites in ScyllaCluster always pass the value explicitly, forwarding it from the higher-level add_server() and server_start() whose defaults were already fixed. Update them to SERVING anyway for consistency, so that the low-level methods agree with the policy established at the higher layers and won't silently revert to the wrong behavior if a new call site is added without an explicit argument. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-05-05 18:51:19 +03:00
Nadav Har'El	af03f0e8c4	test/cluster: fix server_add/server_start hanging when starting in maintenance mode When Scylla starts in maintenance mode it sends sd_notify("STATUS=entering maintenance mode") instead of sd_notify("STATUS=serving"), and does not open the standard CQL port. This caused two independent bugs after the default was changed to ServerUpState.SERVING: 1. poll_status() resolved serving_signal to False on the maintenance notification, so check_serving_notification() would never return True, and start() would time out waiting for SERVING. 2. The readiness check in start() was guarded by `server_up_state >= CQL_ALTERNATOR_QUERIED`, which is never reached in maintenance mode (the standard CQL port is not open). Even if bug 1 were fixed, SERVING would never be recognized. Fix both: - Treat STATUS=entering maintenance mode as a successful readiness signal in poll_status(), resolving serving_signal to True just like STATUS=serving. Both mean "all configured ports are now open". - Remove the CQL_ALTERNATOR_QUERIED precondition from the check_serving_notification() call in start(). The sd_notify signal is authoritative: Scylla sends it only when fully ready, regardless of which ports it opened. No CQL precondition is needed. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-05-05 18:51:18 +03:00
Nadav Har'El	e014521565	test/cluster: make server_start() default to ServerUpState.SERVING For the same reason server_add() was changed to default to SERVING (see previous commit), server_start() had the same bug: after restarting a node that listens on non-default ports, the polling of the hardcoded CQL/Alternator ports could succeed before the custom ports were ready, causing intermittent failures. Apply the same fix to server_start() in manager_client.py, ScyllaCluster.server_start(), and the _cluster_server_start HTTP handler. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-05-05 18:18:32 +03:00
Nadav Har'El	f91525c5df	test/cluster: make server_add() default to ServerUpState.SERVING server_add() was defaulting to ServerUpState.CQL_ALTERNATOR_QUERIED, which polls the standard CQL and Alternator ports to determine when the server is ready. This is wrong when a test configures Scylla to listen on non-default ports: the polling succeeds on the default ports while the custom ports may not yet be ready, making such tests intermittently flaky. The correct behavior is ServerUpState.SERVING, which waits for Scylla's sd_notify("READY=1") signal. This signal is sent only after all configured listeners — including custom ports — are fully open, so it is the right readiness signal regardless of the port configuration. Up to now, the fix for each affected test was to pass expected_server_up_state=ServerUpState.SERVING explicitly once the flakiness was noticed (e.g. #29737). Change the default so that all future tests get the correct behavior automatically. Changed in manager_client.server_add(), ScyllaCluster.add_server(), and the _cluster_server_add HTTP handler. The multi-server servers_add() path already inherits the new default through add_server(). Fixes SCYLLADB-1822 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-05-05 18:18:32 +03:00
Evgeniy Naydanov	96d3f13245	test: add --keep-duplicates and assign RUN_ID via shared cache Add --keep-duplicates CLI argument to bypass deduplication and forward to pytest, allowing duplicate test file arguments to be collected multiple times. Move RUN_ID assignment from pytest_collect_file to modify_pytest_item. All File collectors for the same source file share a single run_ids dict (via RUN_ID_CACHE stash key), so items from duplicate collection arguments (e.g. with --keep-duplicates) automatically get unique IDs. Remove CppFile.run_id cached_property — CppTestCase now reads RUN_ID from its own item stash, which is set during modify_pytest_item. Fix --repeat option default from string "1" to int 1 — argparse only applies type= to CLI-parsed values, not defaults. Co-Authored-By: Claude Opus 4.6 (200K context) <noreply@anthropic.com>	2026-04-29 02:36:05 +00:00
Evgeniy Naydanov	497bd6b6c9	test/pylib/runner: fix disabled file collection Return a DisabledFile collector instead of an empty list when all modes are disabled for a file. Returning an empty list caused subsequent files to not get their stash items set because file_path was never removed from REPEATING_FILES. Co-Authored-By: Claude Opus 4.6 (200K context) <noreply@anthropic.com>	2026-04-29 02:36:05 +00:00
Evgeniy Naydanov	05f2c53931	Revert "test.py: fix test collection bug" This reverts commit `92c09d106d`.	2026-04-29 02:35:00 +00:00
Botond Dénes	a7e9c0e6d2	Merge 'test.py: fix test collection bug' from Andrei Chekun In certain circumstances current way of collecting can be error-prone. Collection can stop when the first file is skipped in the mode leaving the rest of the files in CLI not collected. Another issue that if the file specified twice, with directory and file explicitly, it will produce incorrect CppFile in the stash causing KeyError. Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1714 No backport, test framework bug fix only. Closes scylladb/scylladb#29634 * github.com:scylladb/scylladb: test.py: fix framework test test.py: fix test collection bug	2026-04-28 11:52:35 +03:00
Andrei Chekun	f2f4915e09	test.py: fix framework test Framework test was not skipping unit directory where C++ tests are located. With bug fixing this started to fail. Add ignoring this directory as well.	2026-04-25 18:04:55 +02:00
Piotr Szymaniak	d5efd1f676	test/cluster: wait for Alternator readiness in server startup server_add() only waits for CQL readiness before returning. The Alternator HTTP port may not be listening yet, causing ConnectionRefused with Alternator tests. Extend the ServerUpState enum and startup loop to also check Alternator port readiness when configured. Whenever Alternator port(s) is/are configured, each is verified if connectable and queryable, similar to how CQL ports are probed. Fixes SCYLLADB-1701 Closes scylladb/scylladb#29625	2026-04-25 16:35:44 +03:00
Andrei Chekun	92c09d106d	test.py: fix test collection bug In certain circumstances current way of collecting can be error prone. Collection can stop when the first file is skipped in the mode leaving the rest of the files in CLI not collected. Another issue that if the file specified twice, with directory and file explicitly, it will produce incorrect CppFile in the stash causing KeyError. Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1714	2026-04-24 17:57:11 +02:00
Botond Dénes	b49cf6247f	test: fix flaky test_read_repair_with_trace_logging by reading tracing with CL=ALL Tracing events are written to system_traces.events with CL=ANY, so they are only guaranteed to be present on the local node of the query coordinator. Reading them back with the driver default (CL=LOCAL_ONE) may route the query to a replica that has not yet received all events, causing the assertion on 'digest mismatch, starting read repair' to fail intermittently. Fix execute_with_tracing() to read tracing via the ResponseFuture API with query_cl=ConsistencyLevel.ALL, so events from all replicas are merged before the caller inspects them. Fixes: SCYLLADB-1633 Closes scylladb/scylladb#29566	2026-04-23 16:57:29 +02:00
Piotr Szymaniak	9a86044c63	test: Stop providing alternator-streams experimental flag Now that alternator-streams is no longer an experimental feature, stop passing it in test configurations.	2026-04-22 15:25:37 +02:00
Botond Dénes	eb3326b417	Merge 'test.py: migrate all bare skips to typed skip markers' from Artsiom Mishuta should be merged after #29235 Complete the typed skip markers migration started in the plugin PR. Every bare `@pytest.mark.skip` decorator and `pytest.skip()` runtime call across the test suite is replaced with a typed equivalent, making skip reasons machine-readable in JUnit XML and Allure reports. 62 files changed across 8 commits, covering ~127 skip sites in total. Bare `pytest.skip` provides only a free-text reason string. CI dashboards (JUnit, Allure) cannot distinguish between a test skipped due to a known bug, a missing feature, a slow test, or an environment limitation. This makes it hard to track skip debt, prioritize fixes, or filter dashboards by skip category. The typed markers (`skip_bug`, `skip_not_implemented`, `skip_slow`, `skip_env`) introduced by the `skip_reason_plugin` solve this by embedding a `skip_type` field into every skip report entry. \| Type \| Count \| Files \| Description \| \|------\|-------\|-------\|-------------\| \| `skip_bug` \| 24 \| 16 \| Skip reason references a known bug/issue \| \| `skip_not_implemented` \| 10 \| 5 \| Feature not yet implemented in Scylla \| \| `skip_slow` \| 4 \| 3 \| Test too slow for regular CI runs \| \| `skip_not_implemented` (bare) \| 2 \| 1 \| Bare `@pytest.mark.skip` with no reason (COMPACT STORAGE, #3882) \| \| Type \| Count \| Files \| Description \| \|------\|-------\|-------\|-------------\| \| `skip_env` \| ~85 \| 34 \| Feature/config/topology not available at runtime \| \| `skip_bug` \| 2 \| 2 \| Known bugs: Streams on tablets (#23838), coroutine task not found (#22501) \| - Comments: 7 comments/docstrings across 5 files updated from `pytest.skip()` to `skip()` - Plugin hardened: `warnings.warn()` → `pytest.UsageError` for bare `@pytest.mark.skip` at collection time — bare skips are now a hard error, not a warning - Guard tests: New `test/pylib_test/test_no_bare_skips.py` with 3 tests that prevent regression: - AST scan for bare `@pytest.mark.skip` decorators - AST scan for bare `pytest.skip()` runtime calls - Real `pytest --collect-only` against all Python test directories Runtime skip sites use the convenience wrappers from `test.pylib.skip_types`: ```python from test.pylib.skip_types import skip_env ``` Usage: ```python skip_env("Tablets not enabled") ``` 1. test: migrate @pytest.mark.skip to @pytest.mark.skip_bug for known bugs — 24 decorator sites, 16 files 2. test: migrate @pytest.mark.skip to @pytest.mark.skip_not_implemented — 10 decorator sites, 5 files 3. test: migrate @pytest.mark.skip to @pytest.mark.skip_slow — 4 decorator sites, 3 files 4. test: migrate bare @pytest.mark.skip to skip_not_implemented — 2 bare decorators, 1 file 5. test: migrate runtime pytest.skip() to typed skip_env() — ~85 sites, 34 files 6. test: migrate runtime pytest.skip() to typed skip_bug() — 2 sites, 2 files 7. test: update comments referencing pytest.skip() to skip() — 7 comments, 5 files 8. test/pylib: reject bare pytest.mark.skip and add codebase guards — plugin hardening + 3 guard tests - All 60 plugin + guard tests pass (`test/pylib_test/`) - No bare `@pytest.mark.skip` or `pytest.skip()` calls remain in the codebase - `pytest --collect-only` succeeds across all test directories with the hardened plugin SCYLLADB-1349 Closes scylladb/scylladb#29305 * github.com:scylladb/scylladb: test/alternator: replace bare pytest.skip() with typed skip helpers test: migrate new bare skips introduced by upstream after rebase test/pylib: reject bare pytest.mark.skip and add codebase guards test: update comments referencing pytest.skip() to skip_env() test: migrate runtime pytest.skip() to typed skip_bug() test: migrate runtime pytest.skip() to typed skip_env() test: migrate bare @pytest.mark.skip to skip_not_implemented test: migrate @pytest.mark.skip to @pytest.mark.skip_slow test: migrate @pytest.mark.skip to @pytest.mark.skip_not_implemented test: migrate @pytest.mark.skip to @pytest.mark.skip_bug for known bugs	2026-04-22 15:48:27 +03:00
Artsiom Mishuta	183c6d120e	test: exclude pylib_test from default test runs Add pylib_test to norecursedirs in pytest.ini so it is not collected during ./test.py or pytest test/ runs, but can still be run directly via 'pytest test/pylib_test'. Also fix pytest log cleanup: worker log files (pytest_gw*) were not being deleted on success because cleanup was restricted to the main process only. Now each process (main and workers) cleans up its own log file on success. Closes scylladb/scylladb#29551	2026-04-22 11:38:40 +02:00
Ernest Zaslavsky	9faaf1f09c	test: extract object storage helpers to test/pylib/object_storage.py Move S3/GCS server classes (S3Server, MinioWrapper, GSFront, GSServer), factory functions (create_s3_server, create_gs_server), CQL helpers (format_tuples, keyspace_options), bucket naming (_make_bucket_name), and the s3_server fixture from test/cluster/object_store/conftest.py into a shared module at test/pylib/object_storage.py. The conftest.py is now a thin wrapper that re-exports symbols and defines only the fixtures specific to the object_store suite (object_storage, s3_storage). All external importers are updated. Old class names (S3_Server, GSServer) are kept as aliases for backward compatibility.	2026-04-21 19:08:57 +03:00
Ernest Zaslavsky	e175088db5	test: add s3_test_fixture and migrate tests to per-bucket isolation Add s3_test_fixture, an RAII class that creates a unique S3 bucket on construction and tears down everything (delete all objects, delete bucket, close client) on destruction. Bucket names are derived from the Boost test name, pid, and a counter to guarantee uniqueness across concurrent test processes. Names are sanitized to comply with S3 bucket naming rules (lowercase, hyphens, 3-63 chars). Migrate all S3 tests that create objects to use the fixture, removing manual bucket name construction, deferred_delete_object cleanup, and per-test deferred_close calls. The fixture owns the client lifecycle. Tests with special semaphore requirements (broken semaphore for fallback test, small semaphore for abort test, 1MiB for memory test) create the fixture with a separate normal-sized semaphore and use their own constrained client for the test operation. The upload_file tests are converted from SEASTAR_TEST_CASE (coroutine) to SEASTAR_THREAD_TEST_CASE since the fixture requires thread context for .get() calls. Broaden the minio policy to allow the test user to create and delete arbitrary buckets (s3:CreateBucket, s3:DeleteBucket, s3:ListAllMyBuckets on arn:aws:s3:::*), and operate on objects in any bucket.	2026-04-21 19:08:57 +03:00
Dario Mirovic	f77ff28081	test: manager_client: use safe_driver_shutdown for exclusive_clusters Using cluster.shutdown() is an incorrect way to shut down a Cassandra Cluster. The correct way is using safe_driver_shutdown. Fixes SCYLLADB-1434 Closes scylladb/scylladb#29390	2026-04-19 21:31:18 +03:00
Artsiom Mishuta	9c4d3ce097	test/pylib: reject bare pytest.mark.skip and add codebase guards Harden the skip_reason_plugin to reject bare @pytest.mark.skip at collection time with pytest.UsageError instead of warnings.warn(). Add test/pylib_test/test_no_bare_skips.py with three guard tests: - AST scan for bare pytest.skip() runtime calls - Real pytest --collect-only against all Python test directories	2026-04-19 17:34:31 +02:00
Artsiom Mishuta	8a80e2c3be	test: migrate runtime pytest.skip() to typed skip_env() Migrate runtime pytest.skip() calls across 34 files to use the typed skip_env() wrapper from test.pylib.skip_types. These sites skip at runtime because a required feature, config option, library version, build mode, or runtime topology is not available. Also fixes 'raise pytest.skip(...)' in test_audit.py — skip_env() already raises internally, so the explicit raise was incorrect. Each file gains one new import: from test.pylib.skip_types import skip_env	2026-04-19 11:09:29 +02:00
Botond Dénes	fbcfe3f88f	test: use uuid4 for DockerizedServer container names to avoid collisions Container names were generated as {name}-{pid}-{counter}, where the counter is a per-process itertools.count. This scheme breaks across CI runs on the same host: if a prior job was killed abruptly (SIGKILL, cancellation) its containers are left running since --rm only removes containers on exit. A subsequent run whose worker inherits the same PID (common in containerized CI with small PID namespaces) and reaches the same counter value will collide with the orphaned container. Replace pid+counter with uuid.uuid4(), which generates a random UUID, making names unique across processes, hosts, and time without any shared state or leaking host identifiers. Fixes: SCYLLADB-1540 Closes scylladb/scylladb#29509	2026-04-17 11:56:51 +02:00
Avi Kivity	cad3c0de94	test: write minio log to testlog dir for Jenkins artifact collection Write the MinIO server log directly to tempdir_base (testlog/<arch>/) instead of the per-server temp directory that gets destroyed on shutdown. This preserves the log for Jenkins artifact collection, helping debug S3-related flaky test failures like the stcs_reshape_overlapping_s3_test hang (SCYLLADB-1481). Closes scylladb/scylladb#29458	2026-04-17 12:51:55 +03:00
Botond Dénes	facb50cbf9	Merge 'test.py: refactor test.py' from Andrei Chekun With the latest changes, there are a lot of code that is redundant in the test.py. This PR just cleans this code. Also, it narrows using dynamic scope for fixtures to test/alternator and test/cqlpy. All the rest by default will have module scope. test.py will be a wrapper for pytest mostly for CI use. As for now test.py have important part of calculating the number of threads to start pytest with. This is not possible to do in pytest itself. No backport needed, framework enhancement only. Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-666 Closes scylladb/scylladb#28852 * github.com:scylladb/scylladb: test.py: remove testpy_test_fixture_scope test.py: add logger for 3rd party service test.py: delete dead code in test.py	2026-04-17 12:51:14 +03:00
Andrei Chekun	745debe9ec	test.py: remove testpy_test_fixture_scope With migration to pyest this fixture is useless. Removing and setting the session to the module for the most of the tests. Add dynamic_scope function to support running alternator fixtures in session scope, while Test and TestSuite are not deleted. This is for migration period, later on this function should be deleted.	2026-04-16 22:08:33 +02:00
Andrei Chekun	21addb2173	test.py: add logger for 3rd party service With migration of preparation environment and starting 3rd party services to the pytest, they're output the logs to the terminal. So this PR binds them their own log file to avoid polluting the terminal.	2026-04-16 22:08:33 +02:00
Andrei Chekun	13770ab394	test.py: delete dead code in test.py With the latest changes, there are a lot of code that is redundant in the test.py. This PR just cleans this code. Changes in other files are related to cleaning code from the test.py, especially with redundant parameter --test-py-init and moving prepare_environment to pytest itself.	2026-04-16 22:08:31 +02:00
Avi Kivity	999e108139	Merge 'test: lib: fix broken retry in start_docker_service' from Dario Mirovic The retry loop in `start_docker_service` passes the parse callbacks via `std::move` into `create_handler` on each iteration. After the first iteration, the moved-from `std::function` objects are empty. All subsequent retries skip output parsing entirely and immediately treat the service as successfully started. This defeats the entire purpose of the retry mechanism. Fix by passing the callbacks by copy instead of move, so the original callbacks remain valid across retries. Fixes SCYLLADB-1542 This is a CI stability issue and should be backported. Closes scylladb/scylladb#29504 * github.com:scylladb/scylladb: test/lib: fix typos in proc_utils, gcs_fixture, and dockerized_service test: gcs_fixture: rename container from "local-kms" to "fake-gcs-server" test: fix proc_utils.cc formatting from previous commit test: lib: use unique container name per retry attempt test: lib: fix broken retry in start_docker_service	2026-04-16 21:48:25 +03:00
Botond Dénes	c355df4461	Merge 'test: Lower default log level from DEBUG to INFO' from Artsiom Mishuta 1. test.py — Removed --log-level=DEBUG flag from pytest args 2. test/pytest.ini — Changed log_level to INFO (that was set DEBUG in test.py), changed log_file_level from DEBUG to INFO, added clarifying comments +minor fix [test/pylib: save logs on success only during teardown phase](`0ede308a04`) Previously, when --save-log-on-success was enabled, logs were saved for every test phase (setup, call, teardown)in 3 files. Restrict it to only the teardown phase, that contains all 3 in case of test success, to avoid redundant log entries. Closes scylladb/scylladb#29086 * github.com:scylladb/scylladb: test/pylib: save logs on success only during teardown phase test: Lower default log level from DEBUG to INFO	2026-04-16 12:46:11 +03:00
Dario Mirovic	50e498ac0d	test/lib: fix typos in proc_utils, gcs_fixture, and dockerized_service Fix assorted typos in comments, strings, and identifiers: - path_preprend -> path_prepend (proc_utils.hh, proc_utils.cc) - laúnch -> launch (proc_utils.cc) - hand/fail -> hang/fail (dockerized_service.py) - inconvinient -> inconvenient (dockerized_service.py) - priviledges -> privileges (gcs_fixture.hh) - remove double semicolon (gcs_fixture.cc) Refs SCYLLADB-1542	2026-04-16 10:58:55 +02:00
Botond Dénes	00d8470554	Merge 'test: filter benign shutdown errors in tests that grep logs directly' from Marcin Maliszkiewicz Tests that call grep_for_errors() directly and assert no errors can fail spuriously due to benign RPC errors during graceful shutdown (e.g. "connection dropped: Semaphore broken"), which are already filtered by the after_test hook via filter_errors(). Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1464 Backport: no, tests fix (we may decide to backport later if it occurs on release branches) Closes scylladb/scylladb#29463 * github.com:scylladb/scylladb: test: filter benign errors in tests that grep logs during shutdown test: filter_errors: support list[list[str]] error groups	2026-04-15 14:40:15 +03:00

1 2 3 4 5 ...

926 Commits