Commit Graph

9 Commits

Author SHA1 Message Date
Łukasz Paszkowski
e5bd2f8679 test/storage: harden out-of-space prevention tests around restart and disk-utilization transitions
The tests in test_out_of_space_prevention.py are flaky. Three issues contribute:

1. After creating/removing the blob file that simulates disk pressure,
   the tests immediately checked derived state (e.g., "compaction_manager
   - Drained") without first confirming the disk space monitor had detected
   the utilization change. Fix: explicitly wait for "Reached/Dropped below
   critical disk utilization level" right after creating/removing the blob
   file, before checking downstream effects.

2. Several tests called `manager.driver_connect()` or omitted reconnection
   entirely after `server_restart()` / `server_start()`. The pre-existing
   driver session can silently reconnect multiple times, causing subsequent
   CQL queries to fail. Fix: call `reconnect_driver()` after every node restart.
   Additionally, call `wait_for_cql_and_get_hosts()` where CQL is used afterward,
   to ensure all connection pools are established.

3. Some log assertions used marks captured before a restart, so they could
   match pre-restart messages or miss messages emitted in the correct post-restart
   window. Fix: refresh marks at the right points.

Apart from that, the patch fixes a typo: autotoogle -> autotoggle.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-655

Closes scylladb/scylladb#28626

(cherry picked from commit 826fd5d6c3)

Closes scylladb/scylladb#28967

Closes scylladb/scylladb#29197
2026-04-03 22:23:17 +03:00
Łukasz Paszkowski
960c952eb5 storage/test_out_of_space_prevention.py: Fix async/await bugs
- Add missing await keywords for async operations on s2_log.wait_for()
  and coord_log.wait_for()
- Fix incorrect regex: "compaction .* Split {cf}" → "compaction.*Split {cf}"
- The commit https://github.com/scylladb/scylladb/commit/f7324a4 demoted
  compaction start/end log messages to debug level. Hence add
  compaction=debug log messages to the following tests:
    test_split_compaction_not_triggered
    test_node_restart_while_tablet_split
    test_repair_failure_on_split_rejection

Fixes https://github.com/scylladb/scylladb/issues/27931

Closes scylladb/scylladb#27932

(cherry picked from commit 76b84b71d1)

Closes scylladb/scylladb#27949
2026-01-31 20:59:03 +02:00
Łukasz Paszkowski
c04007c755 database: Log message after critical_disk_utilization mode is set
This is a follow-up of the previous fix: https://github.com/scylladb/scylladb/pull/26030

The test test_user_writes_rejection starts a 3-node cluster and
creates a large file on one of the nodes, to trigger the out-of-space
prevention mechanism, which should reject writes on that node.

It waits for the log message 'Setting critical disk utilization mode: true'
and then executes a write expecting the node to reject it.

Currently, the message is logged before the `_critical_disk_utilization`
variable is actually updated. This causes the test to fail sporadically
if it runs quickly enough.

The fix splits the logging into two steps:
1. "Asked to set critical disk utilization mode" - logged before any action
2) "Set critical disk utilization mode" - logged after `_critical_disk_utilization` has been updated

The tests are updated to wait for the second message.

Fixes https://github.com/scylladb/scylladb/issues/26004

Closes scylladb/scylladb#26392

(cherry picked from commit 7ec369b900)

Closes scylladb/scylladb#26626
2026-01-19 06:42:01 +02:00
Łukasz Paszkowski
d70c049e07 test_user_writes_rejection: Fix test flakiness caused by typo and non-local CL=ONE reads
The current code:
```
try:
   cql.execute(f"INSERT INTO {cf} (pk, t) VALUES (-1, 'x')", host=host[0], execution_profile=cl_one_profile).result()
except Exception:
   pass
```

contains a typo: `host=host[0]` which throws an exception becase Host
object is not subscriptable. The test does not fail because the except
block is too broad and suppresses all exceptions.

Fixing the typo alone is insufficient. The write still succeeds because
the remaining nodes are UP and the query uses CL=ONE, so no failure
should be expected.

Another source of flakiness is data verification:
```
SELECT * FROM {cf} WHERE pk = 0;
```

Even when a coordinator is explicitly provided, using CL=ONE does not
guarantee a local read. The coordinator may forward the read request to
another replica, causing the verification to fail nondeterministically.

This patch rewrites the tests to address these issues:
- Fix the typo: `host[0]` to `hosts[0]`
- Verify data using `MUTATION_FRAGMENTS({cf})` which guarantees a local
  read on the coordinator node
- Reconnect the driver after node restart

Fixes https://github.com/scylladb/scylladb/issues/27933

Closes scylladb/scylladb#27934

(cherry picked from commit 7bf26ece4d)

Closes scylladb/scylladb#28094
2026-01-12 14:15:10 +01:00
Łukasz Paszkowski
566dbd0b19 test_user_writes_rejection: Disable speculative retries
This test starts a 3-node cluster and creates a large blob file so that one
node reaches critical disk utilization, triggering write rejections on that
node. The test then writes data with CL=QUORUM and validates that the data:
- did not reach the critically utilized node
- did reach the remaining two nodes

By default, tables use speculative retries to determine when coordinators may
query additional replicas.

Since the validation uses CL=ONE, it is possible that an additional request
is sent to satisfy the consistency level. As a result:
- the first check may fail if the additional request is sent to a node that
  already contains data, making it appear as if data reached the critically
  utilized node
- the second check may fail if the additional request is sent to the critically
  utilized node, making it appear as if data did not reach the healthy node

The patch fixes the flakiness by disabling the speculative retries.

Fixes https://github.com/scylladb/scylladb/issues/27212

Closes scylladb/scylladb#27488

(cherry picked from commit 2cb9bb8f3a)

Closes scylladb/scylladb#27773
2025-12-21 17:48:32 +02:00
Łukasz Paszkowski
62e27e0f77 test_out_of_space_prevention.py: Fix flaky test_node_restart_while_tablet_split test
The test starts a 3-node cluster and immediately creates a big file
on the first nodes in order to trigger the out of space prevention to
disable compaction, including the SPLIT compaction.

In order to trigger a SPLIT compaction, a keyspace with 1 initial tablet
is created followed by alter statement with `tablets = {'min_tablet_count': 2}`.
This triggers a resize decision that should not finalize due to
disabled compaction on the first node.

The test is flaky because, the keyspace is created with RF=1 and there
is no guarantee that the tablet replica will be located on the first node
with critical disk utilization. If that is not the case, the split
is finalized and the test fails, because it expect the split to be
blocked.

Change to RF=3. This ensures there is exactly one tablet replica on
each node, including the one with critical disk utilization. So SPLIT
is blocked until the disk utilization on the first node, drops below
the critical level.

Fixes: https://github.com/scylladb/scylladb/issues/25861

Closes scylladb/scylladb#26225
2025-09-25 11:54:48 +03:00
Łukasz Paszkowski
5f6df4eb97 test/storage: Properly mount/clear volumes
Due to a missing functionality in PythonTest, `unshare` is never used
to mount volumes. As a consequence:
+ volumes are created with sudo which is undesired
+ they are not cleared automatically

Even having the missing support in place, the approach with mounting
volumes with `unshare` would not work as http server, a pool of clusters,
and scylla cluster manager are started outside of the new namespace.
Thus cluster would have no access to volumes created with `unshare`.

The new approach that works with and without dbuild and does not require
sudo, uses the following three commands to mount a volume:

truncate -s 100M /tmp/mydevice.img
mkfs.ext4 /tmp/mydevice.img
fuse2fs /tmp/mydevice.img test/

Additionally, a proper cleanup is performed, i.e. servers are stopped
gracefully and and volumes are unmounted after the tests using them are
completed.

Fixes: https://github.com/scylladb/scylladb/issues/25906

Closes scylladb/scylladb#26065
2025-09-25 11:05:50 +03:00
Łukasz Paszkowski
29de947851 test_out_of_space_prevention.py: Fix flaky test_user_writes_rejection test
The test starts a 3-node cluster and immediately creates a big file
on one of the nodes, to trigger the out of space prevention to start
rejecting writes on this node. Then a write is executed and checked it
did not reach the node with critical disk utilization but reached
the remaining nodes (it should, RF=3 is set)

However, when not specified, a default LOCAL_ONE consistency level
is used. This means that only one node is required to acknowledge the
write.

After the write, the test checks if the write
+ did NOT reach the node with critical disk utilization (works)
+ did reach the remaning nodes

This can cause the test to fail sporadically as the write might not
yet be on the last node.

Use CL=QUORUM instead.

Fixes: https://github.com/scylladb/scylladb/issues/26004

Closes scylladb/scylladb#26030
2025-09-25 08:05:45 +03:00
Łukasz Paszkowski
e34deea50e tests/cluster: Add new storage tests
The storage submodule contains tests that require mounted volumes
to be executed. The volumes are created automatically with the
`volumes_factory` fixture.

The tests in this suite are executed with the custom launcher
`unshare -mr pytest`

Test scenarios (when one node reaches critical disk utilization):
1. Reject user table writes
2. Disable/Enabled compaction
3. Reject split compactions
4. New split compactions not triggered
5. Abort tablet repair
6. Disable/Enabled incoming tablet migrations
7. Restart a node while a tablet split is triggered
2025-08-29 14:56:13 +02:00