Rewrite gather metrics to be able to gather metrics for python tests correctly.
Python tests require different handling of metrics gathering from cgroup than C++ tests. pytest do not execute each python tests in a separate process, so we can't put it there and get the metrics.
The idea is to put the whole pytest process to the cgroup and get the metrics. This will work because pytest runs the threads as a completely separate processes and inside the thread it will run tests consequently.
Additionally, to simplify system resource monitor moved to pytest main thread.
Change the behavior of the gathering metrics. From this PR some data will be collected even with `--no-gather-metrics`. This data do not need any configuration and just metadata of the tests: test name, time of execution, status of the test. When `--gather-metrics` provided additionally will be written the data gathered from the cgroups about the memory for each specific test and system CPU/RAM utilization.
Backport is not needed, because it's a framework change only.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-575
~Blocked by: https://github.com/scylladb/scylladb/pull/27618~
Now python tests have metrics gathered from the cgroups as well with their own Scylla instances.
```bash
$ sqlite3 --header testlog/sqlite_af8cb.db 'select tst.path, tst.file, tst.test_name, user_sec,system_sec,usage_sec,memory_peak /1024/1024 as memory_peak_mb from test_metrics join tests as tst where tst.id = test_metrics.test_id order by memory_peak_mb desc limit 10;'
path|file|test_name|user_sec|system_sec|usage_sec|memory_peak_mb
test/cluster/dtest|limits_test.py|test_max_cells|489.468174|27.6638949999999|517.132069|4241
test/cluster/dtest|rebuild_test.py|test_rebuild_stream_abort_repro|93.6400869999998|28.9843249999999|122.624412|4241
test/cluster/dtest|schema_management_test.py|test_prepared_statements_work_after_node_restart_after_altering_schema_without_changing_columns|6.8933219999999|3.63569899999993|10.5290209999994|4241
test/cluster/dtest|schema_management_test.py|test_dropping_keyspace_with_many_columns|1.31770999999981|0.754742999999962|2.07245299999977|4241
test/cluster/dtest|schema_management_test.py|test_multiple_create_table_in_parallel|5.48435300000028|2.72915200000011|8.21350499999971|4241
test/cluster/dtest|schema_management_test.py|test_alter_table_in_parallel_to_read_and_write[write]|80.687293|18.5562|99.2434920000005|4241
test/cluster/dtest|schema_management_test.py|test_alter_table_in_parallel_to_read_and_write[read]|79.1984790000001|18.0969829999999|97.2954609999997|4241
test/cluster/dtest|schema_management_test.py|test_alter_table_in_parallel_to_read_and_write[mixed]|85.332915|18.9321070000001|104.265022|4241
test/cluster/dtest|schema_management_test.py|test_update_schema_while_node_is_killed[create_table]|10.5875369999999|5.67954400000008|16.267081|4241
test/cluster/dtest|schema_management_test.py|test_update_schema_while_node_is_killed[alter_table]|11.3801709999998|6.54689099999996|17.9270630000001|4241
```
Closesscylladb/scylladb#28206
* github.com:scylladb/scylladb:
test.py: Add host hardware info
test.py: rewrite resource gather
This series adds a shared helper for resolving, downloading, unpacking, and
installing Scylla relocatable packages for test.py.
The first patch introduces `version_fetch_utils`, which can resolve public
Scylla artifacts from the downloads bucket by version, architecture, package
variant, or direct URL. It also centralizes the local cache/install flow using
retry handling, marker files, and file locking so repeated or concurrent test
runs can safely reuse an existing installation.
The second patch wires this helper into the existing Scylla executable setup
paths. This removes the hard-coded 2025.1 package URL and replaces the local
download/unpack/install logic in `scylla_cluster.py` with the shared resolver.
It also makes `--exe-url` use the same cached installer path.
Together, these changes make upgrade-test executable selection less brittle,
avoid duplicated install logic, and provide a reusable foundation for fetching
other Scylla versions in test.py.
Closesscylladb/scylladb#29855
* github.com:scylladb/scylladb:
test/pylib: use version fetcher for Scylla executable setup
test/pylib: add cached Scylla package installer
Python tests requires different handling of metrics gathering from
cgroup than C++ tests. pytest do not execute each python tests in
a separate process, so we can't put it there and get the metrics.
The idea is to put the whole pytest process to the cgroup and get the
metrics. This will work because pytest runs the threads as as completely
separate processes and inside the thread it will run tests consequently.
Additionally, to simplify system resource monitor moved to pytest main
thread.
After all test suites migrated to test_config.yaml with type: Python,
the specialized suite classes (Topology, CQLApproval, Run, Tool) and
the legacy execution pipeline (find_tests, run_test, TestSuite.run,
Test.run) became unreachable. Remove all this dead code.
Deleted files:
- suite/topology.py, suite/cql_approval.py, suite/run.py, suite/tool.py
Simplified:
- base.py: remove run_test(), read_log(), TestSuite.run(),
add_test_list(), build_test_list(), all_tests(), test_count(),
SUITE_CONFIG_FILENAME, disabled/flaky test tracking, and dead
Test attributes (args, core_args, valid_exit_codes, allure_dir,
is_flaky, is_cancelled, etc.)
- python.py: remove PythonTestSuite.run(), PythonTest.run(),
_prepare_pytest_params(), pattern, test_file_ext, xmlout,
server_log, scylla_env setup, and shlex import.
Simplify run_ctx() to take no parameters.
- runner.py: remove --scylla-log-filename option,
print_scylla_log_filename fixture, SUITE_CONFIG_FILENAME import,
and suite.yaml probe in TestSuiteConfig.from_pytest_node().
- __init__.py: remove re-exports of deleted classes.
- test_config.yaml: Topology -> Python, Approval -> Python.
- conftest files: run_ctx(options=...) -> run_ctx().
- docs/dev/testing.md: update to reflect current pytest-based
architecture, log paths, and removed features.
Co-Authored-By: Claude Opus 4.6 (200K context) <noreply@anthropic.com>
Closesscylladb/scylladb#29613
Replace the hard-coded 2025.1 archive download and local install logic with the
shared Scylla package fetch/install helper. This keeps upgrade-test executable
resolution and `--exe-url` handling on the same cached installer path.
Add utilities to resolve relocatable Scylla artifacts from the public downloads
bucket by version, architecture, package variant, or direct URL. Download,
unpack, and install the selected archive into the test.py cache with retry
handling, marker files, and file locking so repeated or concurrent test runs can
reuse the same installation safely.
This series fixes a recurring source of flaky tests in the cluster test suite.
When a test configures Scylla to listen on non-default ports (e.g. a custom Alternator port, proxy-protocol port or shard-aware port), server_add() and server_start() would declare the server ready by polling the hardcoded standard CQL and Alternator ports. Those ports can become available slightly before the custom ports finish binding, so the test could start using the custom port before it was open — causing intermittent failures.
The fix for each affected test was to pass `expected_server_up_state=ServerUpState.SERVING` explicitly, which waits for Scylla's sd_notify("STATUS=serving") signal instead. That signal is sent only after all configured listeners are fully open, so it is always the right readiness signal regardless of the port configuration. This workaround was applied again in PR #29737 and will keep being needed for every new test that uses a non-default port.
This series makes ServerUpState.SERVING the default at every level of the server start/add call stack so no test needs to remember it:
* Make server_add(), servers_add(), server_start() et al. all default to ServerUpState.SERVING.
* Document that server_add/server_start wait for all ports to be ready, so future test authors understand what the functions guarantee.
* Remove now-redundant expected_server_up_state=SERVING from exiting tests.
* A small optimization: Fix check_serving_notification() returning False on first completion. When the sd_notify future completed, the function correctly updated _received_serving but still returned False, wasting one 100ms polling interval. Return self._received_serving directly.
Closesscylladb/scylladb#29758
* github.com:scylladb/scylladb:
test/pylib: fix missing protocol_version=4 on control_cluster
scylla_cluster: guard poll_status() set_result() calls against cancelled future
test/cluster: avoid repeated CQL checks and leaks while waiting for SERVING
test/cluster: fix check_serving_notification() inefficiency
test/cluster: remove now-redundant expected_server_up_state=SERVING
test/cluster: document that add/start waits for all ports to be ready
test/cluster: update remaining CQL_ALTERNATOR_QUERIED defaults to SERVING
test/cluster: fix server_add/server_start hanging when starting in maintenance mode
main: notify "entering maintenance mode" after the maintenance CQL server is ready
test/cluster: make server_start() default to ServerUpState.SERVING
test/cluster: make server_add() default to ServerUpState.SERVING
Avoid concurrent topology changes in the tombstone GC repair setup, where debug-mode nodes running hinted handoff and materialized view startup work can time out while applying Raft entries before the test starts.
Keep the sequential path opt-in so unrelated repair tests still exercise concurrent bootstrap behavior.
Closesscylladb/scylladb#29829
Update the hardcoded 2025.1.0 binary URL to the latest 2025.1.12
release for upgrade tests.
The 2025.1.12 binary now supports and enforces the
rf_rack_valid_keyspaces option which the test harness enables by
default. Since test_sstable_compression_dictionaries_upgrade creates
a 2-node cluster in a single rack with RF=2, it violates the
constraint. Disable the option explicitly for this test.
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Closesscylladb/scylladb#29714
The mechanics of the restore is like this
- A /storage_service/tablets/restore API is called with (keyspace, table, endpoint, bucket, manifests) parameters
- First, it populates the system_distributed.snapshot_sstables table with the data read from the manifests
- Then it emplaces a bunch of tablet transitions (of a new "restore" kind), one for each tablet
- The topology coordinator handles the "restore" transition by calling a new RESTORE_TABLET RPC against all the current tablet replicas
- Each replica handles the RPC verb by
- Reading the snapshot_sstables table
- Filtering the read sstable infos against current node and tablet being handled
- Downloading and attaching the filtered sstables
This PR includes system_distributed.snapshot_sstables table from @robertbindar and preparation work from @kreuzerkrieg that extracts raw sstables downloading and attaching from existing generic sstables loading code.
This is first step towards SCYLLADB-197 and lacks many things. In particular
- the API only works for single-DC cluster
- the caller needs to "lock" tablet boundaries with min/max tablet count
- not abortable
- no progress tracking
- sub-optimal (re-kicking API on restore will re-download everything again)
- not re-attacheable (if API node dies, restoration proceeds, but the caller cannot "wait" for it to complete via other node)
- nodes download sstables in maintenance/streaming sched gorup (should be moved to maintenance/backup)
Other follow-up items:
- have an actual swagger object specification for `backup_location`
Closes#28436Closes#28657Closes#28773Closesscylladb/scylladb#28763
* github.com:scylladb/scylladb:
docs: Update topology_over_raft.md with `restore` transition kind
test: Add test for backup vs migration race
test: Restore resilience test
sstables_loader: Fail tablet-restore task if not all sstables were downloaded
sstables_loader: mark sstables as downloaded after attaching
sstables_loader: return shared_sstable from attach_sstable
db: add update_sstable_download_status method
db: add downloaded column to snapshot_sstables
db: extract snapshot_sstables TTL into class constant
test: Add a test for tablet-aware restore
tablets: Implement tablet-aware cluster-wide restore
messaging: Add RESTORE_TABLET RPC verb
sstables_loader: Add method to download and attach sstables for a tablet
tablets: Add restore_config to tablet_transition_info
sstables_loader: Add restore_tablets task skeleton
test: Add rest_client helper to kick newly introduced API endpoint
api: Add /storage_service/tablets/restore endpoint skeleton
sstables_loader: Add keyspace and table arguments to manfiest loading helper
sstables_loader_helpers: just reformat the code
sstables_loader_helpers: generalize argument and variable names
sstables_loader_helpers: generalize get_sstables_for_tablet
sstables_loader_helpers: add token getters for tablet filtering
sstables_loader_helpers: remove underscores from struct members
sstables_loader: move download_sstable and get_sstables_for_tablet
sstables_loader: extract single-tablet SST filtering
sstables_loader: make download_sstable static
sstables_loader: fix formating of the new `download_sstable` function
sstables_loader: extract single SST download into a function
sstables_loader: add shard_id to minimal_sst_info
sstables_loader: add function for parsing backup manifests
split utility functions for creating test data from database_test
export make_storage_options_config from lib/test_services
rjson: Add helpers for conversions to dht::token and sstable_id
Add system_distributed_keyspace.snapshot_sstables
add get_system_distributed_keyspace to cql_test_env
code: Add system_distributed_keyspace dependency to sstables_loader
storage_service: Export export handle_raft_rpc() helper
storage_service: Export do_tablet_operation()
storage_service: Split transit_tablet() into two
tablets: Add braces around tablet_transition_kind::repair switch
This series adds per-test bucket isolation to all S3 and GCS object storage tests. Previously, every test shared a single pre-created bucket, which meant tests could interfere with each other through leftover objects and could not run concurrently across multiple `test.py` processes without risking collisions.
New `create_bucket`, `delete_bucket`, and `delete_bucket_with_objects` methods on `s3::client`, following the existing `make_request` pattern. `create_bucket` handles the `BUCKET_ALREADY_OWNED_BY_YOU` error gracefully.
A new `s3_test_fixture` RAII class for C++ Boost tests that creates a uniquely-named bucket on construction (derived from the Boost test name and pid) and tears down everything — objects, bucket, client — on destruction. All S3 tests in `s3_test.cc` are migrated to use it, removing manual `deferred_delete_object` and `deferred_close` boilerplate. The minio server policy is broadened to allow dynamic bucket creation/deletion.
A `client::make` overload that accepts a custom `retry_strategy`, used in tests with a fast 1ms retry delay instead of exponential backoff, significantly reducing test runtime for transient errors during bucket lifecycle operations.
Python-side (`test/cluster/object_store`): each pytest fixture (`object_storage`, `s3_storage`, `s3_server`) now creates a unique bucket per test function via `create_test_bucket()` and destroys it on teardown. Bucket names are sanitized from the pytest node name with a short UUID suffix for uniqueness.
Object storage helpers (`S3Server`, `MinioWrapper`, `GSFront`, `GSServerImpl`, factory functions, CQL helpers, `s3_server` fixture) are extracted from `test/cluster/object_store/conftest.py` into a shared `test/pylib/object_storage.py` module, eliminating duplication across test suites. The conftest becomes a thin re-export wrapper. Old class names are preserved as aliases for backward compatibility.
| Test Name | new test specific retry strategy execution time (ms) | original execution time (ms) | Δ (ms) | Speedup |
|--------------------------------------------------------------|----------------:|-------------:|---------:|--------:|
| test_client_upload_file_multi_part_with_remainder_proxy | 19,261 | 61,395 | −42,134 | **3.2×** |
| test_client_upload_file_multi_part_without_remainder_proxy | 16,901 | 53,688 | −36,787 | **3.2×** |
| test_client_upload_file_single_part_proxy | 3,478 | 6,789 | −3,311 | **2.0×** |
| test_client_multipart_copy_upload_proxy | 1,303 | 1,619 | −316 | 1.2× |
| test_client_put_get_object_proxy | 150 | 365 | −215 | **2.4×** |
| test_client_readable_file_stream_proxy | 125 | 327 | −202 | **2.6×** |
| test_small_object_copy_proxy | 205 | 389 | −184 | 1.9× |
| test_client_put_get_tagging_proxy | 181 | 350 | −169 | 1.9× |
| test_client_multipart_upload_proxy | 1,252 | 1,416 | −164 | 1.1× |
| test_client_list_objects_proxy | 729 | 881 | −152 | 1.2× |
| test_chunked_download_data_source_with_delays_proxy | 830 | 960 | −130 | 1.2× |
| test_client_readable_file_proxy | 148 | 279 | −131 | 1.9× |
| test_client_upload_file_multi_part_with_remainder_minio | 3,358 | 3,170 | +188 | 0.9× |
| test_client_upload_file_multi_part_without_remainder_minio | 3,131 | 2,929 | +202 | 0.9× |
| test_client_upload_file_single_part_minio | 519 | 421 | +98 | 0.8× |
| test_download_data_source_proxy | 180 | 237 | −57 | 1.3× |
| test_client_list_objects_incomplete_proxy | 590 | 641 | −51 | 1.1× |
| test_large_object_copy_proxy | 952 | 991 | −39 | 1.0× |
| test_client_multipart_upload_fallback_proxy | 148 | 185 | −37 | 1.3× |
| test_client_multipart_copy_upload_minio | 641 | 674 | −33 | 1.1× |
No backport needed — this is a test infrastructure improvement with no production code impact beyond the new `s3::client` methods.
Closesscylladb/scylladb#29508
* github.com:scylladb/scylladb:
test: extract object storage helpers to test/pylib/object_storage.py
test: add per-test bucket isolation to object_store fixtures
s3: add client::make overload with custom retry strategy
test: add s3_test_fixture and migrate tests to per-bucket isolation
s3: add create_bucket and delete_bucket to client
As a final step for https://scylladb.atlassian.net/browse/SCYLLADB-461 we need to graduate Alternator Streams from experimental.
So let's remove `--experimental-features=alternator-streams` and map the obsolete config string to `UNUSED` for backward compatibility. Also, remove the related gating of the feature.
Finally, stop providing the config flag in test configs.
Fixes SCYLLADB-1680
Fixes#16367
To documentation tracked by https://scylladb.atlassian.net/browse/SCYLLADB-462 still remains.
This PR needs to hit 2026.2, so (only) if it branches before the PR is merged to `master`, we'd need to backport.
Closesscylladb/scylladb#29604
* github.com:scylladb/scylladb:
test: Stop providing alternator-streams experimental flag
alternator: Graduate Alternator Streams from experimental
get_cql_up_state() creates two Cluster instances: a short-lived one
used to probe CQL readiness, and a persistent control_cluster kept
alive for the lifetime of the server. The probe cluster was created
with protocol_version=4 (the highest version Scylla supports), but
the control_cluster was not, causing the driver to do a superfluous
version-negotiation round-trip on every server start.
Fix by extracting the shared constructor arguments into a cluster_kwargs
dict and using **cluster_kwargs for both calls, so the two Cluster
instances are created with identical parameters. This deduplication can
help avoid more instances of this bug, where someone modifies the
options in one call but forgets to change the options in the other
call.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The poll_status() background thread resolves `serving_signal` by
scheduling `f.set_result(...)` on the event loop via
`call_soon_threadsafe`. In parallel, `_cleanup_notify_socket()` can
cancel `serving_signal` at any time - for example when a server fails
to start and `stop()` -> `shutdown_control_connection()` is called while
the thread is still blocked in `recv()` (the socket close unblocks the
`recv()` with an exception, sending it down the error path).
When that race fires the scheduled `f.set_result(...)` callback runs
after `cancel()` has already put the future into the *cancelled* state,
raising `asyncio.InvalidStateError: Result is not allowed in cancelled
state`.
This bug predates the SERVING work, but the original
CQL_ALTERNATOR_QUERIED default meant the notify socket was torn down
quickly most of the time, making the window very narrow. Now that
SERVING is the default the socket stays open throughout the full startup
wait, widening the race significantly.
Fix: replace every bare `f.set_result(v)` call with
`lambda: f.done() or f.set_result(v)`, which is a no-op when the
future is already done (cancelled, or resolved by another path).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
With ServerUpState.SERVING now the default, server_add() and server_start()
wait for sd_notify readiness after CQL is already up. During that window
the startup polling loop was calling get_cql_alternator_up_state() on every
iteration (every 100ms). Each successful call recreated self.control_cluster
and self.control_connection without closing the previous ones, leaking driver
connections and adding unnecessary CQL load to a node that was already known
to be queryable.
Fix in two places:
- Startup loop: skip the get_cql_alternator_up_state() call once
server_up_state has reached CQL_ALTERNATOR_QUERIED. After that point only
the cheap non-blocking check_serving_notification() is needed.
- get_cql_up_state(): guard control_cluster/control_connection creation with
`if self.control_connection is None` so the persistent driver connection is
only established once, even if the function is called multiple times.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
When the sd_notify future completed, check_serving_notification() correctly
updated _received_serving to True but still returned False on that same call.
The SERVING state was only recognized on the next polling iteration, 100ms
later, for no reason.
Return self._received_serving instead of False after updating it.
Add docstrings to server_add(), server_start(), and servers_add() explaining
that they wait for ServerUpState.SERVING before returning, which means Scylla
has finished listening on all configured ports (including non-default ones).
Note that server_add() and server_start() accept expected_server_up_state to
return earlier if needed, while servers_add() always waits for SERVING.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
ScyllaServer.install_and_start() and ScyllaServer.start() still had
ServerUpState.CQL_ALTERNATOR_QUERIED as their default for
expected_server_up_state. In practice these defaults are never reached -
both call sites in ScyllaCluster always pass the value explicitly,
forwarding it from the higher-level add_server() and server_start()
whose defaults were already fixed.
Update them to SERVING anyway for consistency, so that the low-level
methods agree with the policy established at the higher layers and won't
silently revert to the wrong behavior if a new call site is added without
an explicit argument.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
When Scylla starts in maintenance mode it sends sd_notify("STATUS=entering
maintenance mode") instead of sd_notify("STATUS=serving"), and does not
open the standard CQL port. This caused two independent bugs after the
default was changed to ServerUpState.SERVING:
1. poll_status() resolved serving_signal to False on the maintenance
notification, so check_serving_notification() would never return True,
and start() would time out waiting for SERVING.
2. The readiness check in start() was guarded by
`server_up_state >= CQL_ALTERNATOR_QUERIED`, which is never reached in
maintenance mode (the standard CQL port is not open). Even if bug 1
were fixed, SERVING would never be recognized.
Fix both:
- Treat STATUS=entering maintenance mode as a successful readiness signal
in poll_status(), resolving serving_signal to True just like
STATUS=serving. Both mean "all configured ports are now open".
- Remove the CQL_ALTERNATOR_QUERIED precondition from the
check_serving_notification() call in start(). The sd_notify signal is
authoritative: Scylla sends it only when fully ready, regardless of
which ports it opened. No CQL precondition is needed.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
For the same reason server_add() was changed to default to SERVING
(see previous commit), server_start() had the same bug: after restarting
a node that listens on non-default ports, the polling of the hardcoded
CQL/Alternator ports could succeed before the custom ports were ready,
causing intermittent failures.
Apply the same fix to server_start() in manager_client.py,
ScyllaCluster.server_start(), and the _cluster_server_start HTTP handler.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
server_add() was defaulting to ServerUpState.CQL_ALTERNATOR_QUERIED,
which polls the standard CQL and Alternator ports to determine when the
server is ready. This is wrong when a test configures Scylla to listen
on non-default ports: the polling succeeds on the default ports while the
custom ports may not yet be ready, making such tests intermittently flaky.
The correct behavior is ServerUpState.SERVING, which waits for Scylla's
sd_notify("READY=1") signal. This signal is sent only after all
configured listeners — including custom ports — are fully open, so it
is the right readiness signal regardless of the port configuration.
Up to now, the fix for each affected test was to pass
expected_server_up_state=ServerUpState.SERVING explicitly once the
flakiness was noticed (e.g. #29737). Change the default so that all
future tests get the correct behavior automatically.
Changed in manager_client.server_add(), ScyllaCluster.add_server(), and
the _cluster_server_add HTTP handler. The multi-server servers_add() path
already inherits the new default through add_server().
Fixes SCYLLADB-1822
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Add --keep-duplicates CLI argument to bypass deduplication and forward
to pytest, allowing duplicate test file arguments to be collected
multiple times.
Move RUN_ID assignment from pytest_collect_file to modify_pytest_item.
All File collectors for the same source file share a single run_ids
dict (via RUN_ID_CACHE stash key), so items from duplicate collection
arguments (e.g. with --keep-duplicates) automatically get unique IDs.
Remove CppFile.run_id cached_property — CppTestCase now reads RUN_ID
from its own item stash, which is set during modify_pytest_item.
Fix --repeat option default from string "1" to int 1 — argparse only
applies type= to CLI-parsed values, not defaults.
Co-Authored-By: Claude Opus 4.6 (200K context) <noreply@anthropic.com>
Return a DisabledFile collector instead of an empty list when all modes
are disabled for a file. Returning an empty list caused subsequent
files to not get their stash items set because file_path was never
removed from REPEATING_FILES.
Co-Authored-By: Claude Opus 4.6 (200K context) <noreply@anthropic.com>
In certain circumstances current way of collecting can be error-prone. Collection can stop when the first file is skipped in the mode leaving the rest of the files in CLI not collected.
Another issue that if the file specified twice, with directory and file explicitly, it will produce incorrect CppFile in the stash causing KeyError.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1714
No backport, test framework bug fix only.
Closesscylladb/scylladb#29634
* github.com:scylladb/scylladb:
test.py: fix framework test
test.py: fix test collection bug
server_add() only waits for CQL readiness before returning. The
Alternator HTTP port may not be listening yet, causing
ConnectionRefused with Alternator tests.
Extend the ServerUpState enum and startup loop to also check Alternator
port readiness when configured. Whenever Alternator port(s) is/are
configured, each is verified if connectable and queryable,
similar to how CQL ports are probed.
Fixes SCYLLADB-1701
Closesscylladb/scylladb#29625
In certain circumstances current way of collecting can be error prone.
Collection can stop when the first file is skipped in the mode leaving
the rest of the files in CLI not collected.
Another issue that if the file specified twice, with directory and file
explicitly, it will produce incorrect CppFile in the stash causing
KeyError.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1714
Tracing events are written to system_traces.events with CL=ANY, so they
are only guaranteed to be present on the local node of the query
coordinator. Reading them back with the driver default (CL=LOCAL_ONE)
may route the query to a replica that has not yet received all events,
causing the assertion on 'digest mismatch, starting read repair' to fail
intermittently.
Fix execute_with_tracing() to read tracing via the ResponseFuture API
with query_cl=ConsistencyLevel.ALL, so events from all replicas are
merged before the caller inspects them.
Fixes: SCYLLADB-1633
Closesscylladb/scylladb#29566
should be merged after #29235
Complete the typed skip markers migration started in the plugin PR.
Every bare `@pytest.mark.skip` decorator and `pytest.skip()` runtime call
across the test suite is replaced with a typed equivalent, making skip
reasons machine-readable in JUnit XML and Allure reports.
**62 files changed** across 8 commits, covering ~127 skip sites in total.
Bare `pytest.skip` provides only a free-text reason string. CI dashboards
(JUnit, Allure) cannot distinguish between a test skipped due to a known
bug, a missing feature, a slow test, or an environment limitation. This
makes it hard to track skip debt, prioritize fixes, or filter dashboards
by skip category.
The typed markers (`skip_bug`, `skip_not_implemented`, `skip_slow`,
`skip_env`) introduced by the `skip_reason_plugin` solve this by embedding
a `skip_type` field into every skip report entry.
| Type | Count | Files | Description |
|------|-------|-------|-------------|
| `skip_bug` | 24 | 16 | Skip reason references a known bug/issue |
| `skip_not_implemented` | 10 | 5 | Feature not yet implemented in Scylla |
| `skip_slow` | 4 | 3 | Test too slow for regular CI runs |
| `skip_not_implemented` (bare) | 2 | 1 | Bare `@pytest.mark.skip` with no reason (COMPACT STORAGE, #3882) |
| Type | Count | Files | Description |
|------|-------|-------|-------------|
| `skip_env` | ~85 | 34 | Feature/config/topology not available at runtime |
| `skip_bug` | 2 | 2 | Known bugs: Streams on tablets (#23838), coroutine task not found (#22501) |
- **Comments**: 7 comments/docstrings across 5 files updated from `pytest.skip()` to `skip()`
- **Plugin hardened**: `warnings.warn()` → `pytest.UsageError` for bare `@pytest.mark.skip` at collection time — bare skips are now a hard error, not a warning
- **Guard tests**: New `test/pylib_test/test_no_bare_skips.py` with 3 tests that prevent regression:
- AST scan for bare `@pytest.mark.skip` decorators
- AST scan for bare `pytest.skip()` runtime calls
- Real `pytest --collect-only` against all Python test directories
Runtime skip sites use the convenience wrappers from `test.pylib.skip_types`:
```python
from test.pylib.skip_types import skip_env
```
Usage:
```python
skip_env("Tablets not enabled")
```
1. **test: migrate @pytest.mark.skip to @pytest.mark.skip_bug for known bugs** — 24 decorator sites, 16 files
2. **test: migrate @pytest.mark.skip to @pytest.mark.skip_not_implemented** — 10 decorator sites, 5 files
3. **test: migrate @pytest.mark.skip to @pytest.mark.skip_slow** — 4 decorator sites, 3 files
4. **test: migrate bare @pytest.mark.skip to skip_not_implemented** — 2 bare decorators, 1 file
5. **test: migrate runtime pytest.skip() to typed skip_env()** — ~85 sites, 34 files
6. **test: migrate runtime pytest.skip() to typed skip_bug()** — 2 sites, 2 files
7. **test: update comments referencing pytest.skip() to skip()** — 7 comments, 5 files
8. **test/pylib: reject bare pytest.mark.skip and add codebase guards** — plugin hardening + 3 guard tests
- All 60 plugin + guard tests pass (`test/pylib_test/`)
- No bare `@pytest.mark.skip` or `pytest.skip()` calls remain in the codebase
- `pytest --collect-only` succeeds across all test directories with the hardened plugin
SCYLLADB-1349
Closesscylladb/scylladb#29305
* github.com:scylladb/scylladb:
test/alternator: replace bare pytest.skip() with typed skip helpers
test: migrate new bare skips introduced by upstream after rebase
test/pylib: reject bare pytest.mark.skip and add codebase guards
test: update comments referencing pytest.skip() to skip_env()
test: migrate runtime pytest.skip() to typed skip_bug()
test: migrate runtime pytest.skip() to typed skip_env()
test: migrate bare @pytest.mark.skip to skip_not_implemented
test: migrate @pytest.mark.skip to @pytest.mark.skip_slow
test: migrate @pytest.mark.skip to @pytest.mark.skip_not_implemented
test: migrate @pytest.mark.skip to @pytest.mark.skip_bug for known bugs
Add pylib_test to norecursedirs in pytest.ini so it is not collected
during ./test.py or pytest test/ runs, but can still be run directly
via 'pytest test/pylib_test'.
Also fix pytest log cleanup: worker log files (pytest_gw*) were not
being deleted on success because cleanup was restricted to the main
process only. Now each process (main and workers) cleans up its own
log file on success.
Closesscylladb/scylladb#29551
Move S3/GCS server classes (S3Server, MinioWrapper, GSFront, GSServer),
factory functions (create_s3_server, create_gs_server), CQL helpers
(format_tuples, keyspace_options), bucket naming (_make_bucket_name),
and the s3_server fixture from test/cluster/object_store/conftest.py
into a shared module at test/pylib/object_storage.py.
The conftest.py is now a thin wrapper that re-exports symbols and
defines only the fixtures specific to the object_store suite
(object_storage, s3_storage). All external importers are updated.
Old class names (S3_Server, GSServer) are kept as aliases for
backward compatibility.
Add s3_test_fixture, an RAII class that creates a unique S3 bucket
on construction and tears down everything (delete all objects, delete
bucket, close client) on destruction. Bucket names are derived from
the Boost test name, pid, and a counter to guarantee uniqueness
across concurrent test processes. Names are sanitized to comply with
S3 bucket naming rules (lowercase, hyphens, 3-63 chars).
Migrate all S3 tests that create objects to use the fixture, removing
manual bucket name construction, deferred_delete_object cleanup, and
per-test deferred_close calls. The fixture owns the client lifecycle.
Tests with special semaphore requirements (broken semaphore for
fallback test, small semaphore for abort test, 1MiB for memory
test) create the fixture with a separate normal-sized semaphore and
use their own constrained client for the test operation.
The upload_file tests are converted from SEASTAR_TEST_CASE
(coroutine) to SEASTAR_THREAD_TEST_CASE since the fixture requires
thread context for .get() calls.
Broaden the minio policy to allow the test user to create and delete
arbitrary buckets (s3:CreateBucket, s3:DeleteBucket, s3:ListAllMyBuckets
on arn:aws:s3:::*), and operate on objects in any bucket.
Using cluster.shutdown() is an incorrect way to shut down a Cassandra Cluster.
The correct way is using safe_driver_shutdown.
Fixes SCYLLADB-1434
Closesscylladb/scylladb#29390
Harden the skip_reason_plugin to reject bare @pytest.mark.skip at
collection time with pytest.UsageError instead of warnings.warn().
Add test/pylib_test/test_no_bare_skips.py with three guard tests:
- AST scan for bare pytest.skip() runtime calls
- Real pytest --collect-only against all Python test directories
Migrate runtime pytest.skip() calls across 34 files to use the typed
skip_env() wrapper from test.pylib.skip_types.
These sites skip at runtime because a required feature, config option,
library version, build mode, or runtime topology is not available.
Also fixes 'raise pytest.skip(...)' in test_audit.py — skip_env()
already raises internally, so the explicit raise was incorrect.
Each file gains one new import:
from test.pylib.skip_types import skip_env
Container names were generated as {name}-{pid}-{counter}, where the
counter is a per-process itertools.count. This scheme breaks across CI
runs on the same host: if a prior job was killed abruptly (SIGKILL,
cancellation) its containers are left running since --rm only removes
containers on exit. A subsequent run whose worker inherits the same PID
(common in containerized CI with small PID namespaces) and reaches the
same counter value will collide with the orphaned container.
Replace pid+counter with uuid.uuid4(), which generates a random UUID,
making names unique across processes, hosts, and time without any shared
state or leaking host identifiers.
Fixes: SCYLLADB-1540
Closesscylladb/scylladb#29509
Write the MinIO server log directly to tempdir_base (testlog/<arch>/)
instead of the per-server temp directory that gets destroyed on
shutdown. This preserves the log for Jenkins artifact collection,
helping debug S3-related flaky test failures like the
stcs_reshape_overlapping_s3_test hang (SCYLLADB-1481).
Closesscylladb/scylladb#29458
With the latest changes, there are a lot of code that is redundant in the test.py. This PR just cleans this code.
Also, it narrows using dynamic scope for fixtures to test/alternator and test/cqlpy. All the rest by default will have module scope.
test.py will be a wrapper for pytest mostly for CI use. As for now test.py have important part of calculating the number of threads to start pytest with. This is not possible to do in pytest itself.
No backport needed, framework enhancement only.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-666Closesscylladb/scylladb#28852
* github.com:scylladb/scylladb:
test.py: remove testpy_test_fixture_scope
test.py: add logger for 3rd party service
test.py: delete dead code in test.py
With migration to pyest this fixture is useless. Removing and setting
the session to the module for the most of the tests.
Add dynamic_scope function to support running alternator fixtures in
session scope, while Test and TestSuite are not deleted. This is for
migration period, later on this function should be deleted.
With migration of preparation environment and starting 3rd party services
to the pytest, they're output the logs to the terminal. So this PR
binds them their own log file to avoid polluting the terminal.
With the latest changes, there are a lot of code that is redundant in
the test.py. This PR just cleans this code.
Changes in other files are related to cleaning code from the test.py,
especially with redundant parameter --test-py-init and moving
prepare_environment to pytest itself.
The retry loop in `start_docker_service` passes the parse callbacks via `std::move` into `create_handler` on each iteration. After the first iteration, the moved-from `std::function` objects are empty. All subsequent retries skip output parsing entirely and immediately treat the service as successfully started. This defeats the entire purpose of the retry mechanism.
Fix by passing the callbacks by copy instead of move, so the original callbacks remain valid across retries.
Fixes SCYLLADB-1542
This is a CI stability issue and should be backported.
Closesscylladb/scylladb#29504
* github.com:scylladb/scylladb:
test/lib: fix typos in proc_utils, gcs_fixture, and dockerized_service
test: gcs_fixture: rename container from "local-kms" to "fake-gcs-server"
test: fix proc_utils.cc formatting from previous commit
test: lib: use unique container name per retry attempt
test: lib: fix broken retry in start_docker_service
1. test.py — Removed --log-level=DEBUG flag from pytest args
2. test/pytest.ini — Changed log_level to INFO (that was set DEBUG in test.py), changed log_file_level from DEBUG to INFO, added clarifying comments
+minor fix
[test/pylib: save logs on success only during teardown phase](0ede308a04)
Previously, when --save-log-on-success was enabled, logs were saved
for every test phase (setup, call, teardown)in 3 files. Restrict it to only
the teardown phase, that contains all 3 in case of test success,
to avoid redundant log entries.
Closesscylladb/scylladb#29086
* github.com:scylladb/scylladb:
test/pylib: save logs on success only during teardown phase
test: Lower default log level from DEBUG to INFO
Tests that call grep_for_errors() directly and assert no errors
can fail spuriously due to benign RPC errors during graceful
shutdown (e.g. "connection dropped: Semaphore broken"), which
are already filtered by the after_test hook via filter_errors().
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1464
Backport: no, tests fix (we may decide to backport later if it occurs on release branches)
Closesscylladb/scylladb#29463
* github.com:scylladb/scylladb:
test: filter benign errors in tests that grep logs during shutdown
test: filter_errors: support list[list[str]] error groups