Python tests requires different handling of metrics gathering from
cgroup than C++ tests. pytest do not execute each python tests in
a separate process, so we can't put it there and get the metrics.
The idea is to put the whole pytest process to the cgroup and get the
metrics. This will work because pytest runs the threads as as completely
separate processes and inside the thread it will run tests consequently.
Additionally, to simplify system resource monitor moved to pytest main
thread.
After stopping scylla server processes, the FUSE daemon
(fuse2fs) may still be processing file handle closures.
An immediate fusermount3 -u can fail with 'device busy',
causing spurious test failures on teardown.
Retry the unmount up to 10 times with 0.5s delay between
attempts, and capture stderr for diagnostics.
Fixes: SCYLLADB-2049
Closesscylladb/scylladb#29920
The stream_mutation_fragments RPC handler did not check
is_in_critical_disk_utilization_mode before accepting incoming mutation
fragments. This meant load-and-stream (nodetool refresh --load-and-stream)
could push data onto a node at critical disk utilization, potentially
filling the disk completely.
Add a critical disk utilization check in the get_next_mutation_fragment
lambda, throwing critical_disk_utilization_exception when the node is in
critical mode. This mirrors the existing protection in stream_blob.cc.
Also remove the xfail marker from the corresponding test added in the
previous commit.
Add `test_load_and_stream_rejected_on_critical_disk` which verifies
that `nodetool refresh --load-and-stream` is rejected when the target
node reaches critical disk utilization during streaming. The test is
marked xfail because the stream_mutation_fragments handler does not
yet check whether the node is in the critical disk utilization mode
(introduced in the next patch).
The test sets up a 3-node cluster, writes data and snapshots SSTables
on one node, wipes another node's data, and copies the snapshot to its
upload directory. It then starts load-and-stream and uses the
`write_components_writer_created` error injection to pause SSTable writing.
While paused, the test fills the disk past the critical threshold, then
releases the injection. The next streamed mutation fragment is rejected
with critical_disk_utilization_exception.
The test verifies that:
- The operation fails with the expected error.
- No data is persisted on the target node.
- Partial SSTable files created during streaming are deleted (via the
implicit mark-for-deletion mechanism in the SSTable lifecycle).
Python warns that the sequence "\(" is an invalid escape and
might be rejected in the future. Protect against that by using
a raw string.
Closesscylladb/scylladb#29334
Since commit 509f2af8db, gate_closed_exception can be triggered for ongoing split during shutdown. The commit is correct, but it causes split failure on shutdown to log an error, which causes CI instability. Previously, aborted_exception would be triggered instead which is logged as warning. Let's do the same.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-951.
Fixes https://github.com/scylladb/scylladb/issues/24850.
Only 2026.1 is affected.
Closesscylladb/scylladb#29032
* github.com:scylladb/scylladb:
replica: Demote log level on split failure during shutdown
service: Demote log level on split failure during shutdown
This test was observed to fail in CI recently but there is not enough information in the logs to figure out what went wrong. This PR makes a few improvements to make the next investigation easier, should it be needed:
* storage-service: add table name to mutation write failure error messages.
* database: the `database_apply` error injection used to cause trouble, catching writes to bystander tables, making tests flaky. To eliminate this, it gained a filter to apply only to non-system keyspaces. Unfortunately, this still allows it to catch writes to the trace tables. While this should not fail the test, it reduces observability, as some traces disappear. Improve this error injection to only apply to selected table. Also merge it with the `database_apply_wait` error injection, to streamline the code a bit.
* test/test_data_resurrection_in_memtable.py: dump data from the datable, before the checks for expected data, so if checks fail, the data in the table is known.
Refs: SCYLLADB-812
Refs: SCYLLADB-870
Fixes: SCYLLADB-1050 (by restricting `database_apply` error injection, so it doesn't affect writes to system traces)
Backport: test related improvement, no backport
Closesscylladb/scylladb#28899
* github.com:scylladb/scylladb:
test/cluster/test_data_resurrection_in_memtable.py: dump rows before check
replica/database: consolidate the two database_apply error injections
service/storage_proxy: add name of table to error message for write errors
Dtest failed with:
table - Failed to load SSTable .../me-3gyn_0qwi_313gw2n2y90v2j4fcv-big-Data.db
of origin memtable due to std::runtime_error (Cannot split
.../me-3gyn_0qwi_313gw2n2y90v2j4fcv-big-Data.db because manager has compaction
disabled, reason might be out of space prevention), it will be unlinked...
The reason is that the error above is being triggered when the cause is
shutdown, not out of space prevention. Let's distinguish between the two
cases and log the error with warning level on shutdown.
Fixes https://github.com/scylladb/scylladb/issues/24850.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The tests in test_out_of_space_prevention.py are flaky. Three issues contribute:
1. After creating/removing the blob file that simulates disk pressure,
the tests immediately checked derived state (e.g., "compaction_manager
- Drained") without first confirming the disk space monitor had detected
the utilization change. Fix: explicitly wait for "Reached/Dropped below
critical disk utilization level" right after creating/removing the blob
file, before checking downstream effects.
2. Several tests called `manager.driver_connect()` or omitted reconnection
entirely after `server_restart()` / `server_start()`. The pre-existing
driver session can silently reconnect multiple times, causing subsequent
CQL queries to fail. Fix: call `reconnect_driver()` after every node restart.
Additionally, call `wait_for_cql_and_get_hosts()` where CQL is used afterward,
to ensure all connection pools are established.
3. Some log assertions used marks captured before a restart, so they could
match pre-restart messages or miss messages emitted in the correct post-restart
window. Fix: refresh marks at the right points.
Apart from that, the patch fixes a typo: autotoogle -> autotoggle.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-655Closesscylladb/scylladb#28626
Into a single database_apply one. Add three parameters:
* ks_name and cf_name to filter the tables to be affected
* what - what to do: throw or wait
This leads to smaller footprint in the code and improved filtering for
table names at the cost of some extra error injection params in the
tests.
Move the storage test suite from test/storage/ to test/cluster/storage/
to consolidate related cluster-based tests.This removes the standalone
test/storage/suite.yaml as the tests will use the cluster's test configuration.
Initially these tests were in cluster, but to use unshare at first
iteration they were moved outside. Now they are using another way to
handle volumes without unshare, they should be in cluster
Closesscylladb/scylladb#28634