The estimate() function in the size_estimates virtual reader only
considered sstables local to the shard that happened to own the
keyspace's partition key token. Since sstables are distributed across
shards, this caused partition count estimates to be approximately
1/smp_count of the actual value.
This bug has been present since the virtual reader was introduced in
225648780d.
Use db.container().map_reduce0() to aggregate sstable estimates
across all shards. Each shard contributes its local count and
estimated_histogram, which are then merged to produce the correct
total.
Also fix the `test_partitions_estimate_full_overlap` test which becomes
flaky (xpassing ~1% of runs) because autocompaction could merge the
two overlapping sstables before the size estimate was read. Wrap the
test body in nodetool.no_autocompaction_context to prevent this race.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1179
Refs https://github.com/scylladb/scylladb/issues/9083Closesscylladb/scylladb#29286
Python warns that the sequence "\(" is an invalid escape and
might be rejected in the future. Protect against that by using
a raw string.
Closesscylladb/scylladb#29334
Convert auth_test.cc to coroutines for improved readability. Each test is converted in its own commit. Some
are trivial.
Indentation is left broken in some commits to reduce the diff, then fixed up in the last commit.
Code cleanup, so no backport.
Closesscylladb/scylladb#29336
* github.com:scylladb/scylladb:
auth_test: fix whitespace
auth_test: coroutinize test_try_describe_schema_with_internals_and_passwords_as_anonymous_user
auth_test: coroutinize test_try_login_after_creating_roles_with_hashed_password
auth_test: coroutinize test_create_roles_with_hashed_password_and_log_in
auth_test: coroutinize test_try_create_role_with_hashed_password_as_anonymous_user
auth_test: coroutinize test_try_to_create_role_with_password_and_hashed_password
auth_test: coroutinize test_try_to_create_role_with_hashed_password_and_password
auth_test: coroutinize test_alter_with_workload_type
auth_test: coroutinize test_alter_with_timeouts
auth_test: coroutinize role_permissions_table_is_protected
auth_test: coroutinize role_members_table_is_protected
auth_test: coroutinize roles_table_is_protected
auth_test: coroutinize test_password_authenticator_operations
auth_test: coroutinize test_password_authenticator_attributes
auth_test: coroutinize test_default_authenticator
When an SSTable was encrypted with a KMS host that is not present in
scylla.yaml, the error thrown was:
std::invalid_argument (No such host: <host-name>)
This message is very obscure in general, and especially confusing when
encountered while using the scylla-sstable tool: it gives no indication
that the SSTable is encrypted, that a KMS host lookup is involved, or
what the user needs to do to fix the problem.
Replace it with a message that names the missing host and points
directly to the relevant scylla.yaml section:
Encryption host "<host-name>" is not defined in scylla.yaml.
Make sure it is listed under the "kmip_hosts" section.
The wording is intentionally kept neutral (not framed as an SSTable tool
problem) because the same code path is exercised by production ScyllaDB
when a node's configuration no longer contains a host referenced by an
existing data file (e.g. after a config rollback or when restoring data
from a different cluster). The production use-case takes precedence, but
the message is equally actionable from the tool.
Closesscylladb/scylladb#29228
This series updates the storage abstraction and extends the compaction tests to support object‑storage backends (S3 and GCS), while tightening several parts of the test environment.
The changes include:
- New exists/object_exists helpers across storage backends and clock fixes in the S3 client to make signature generation stable under test conditions.
- A new get_storage_for_tests accessor and adjustments to the test environment to avoid premature teardown of the sstable registry.
- Refactoring of compaction tests to remove direct sstable access, ensure proper schema setup, and avoid use of moved‑from objects.
- Extraction of test_env‑based logic into reusable functions and addition of S3/GCS variants of the compaction tests.
Not all tests were converted to be backend‑agnostic yet, and a few require further investigation before they can run cleanly against S3/GCS backends. These will be addressed in follow‑up work.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-704 however, followup is needed
No backport needed since this change targeting future feature
Closesscylladb/scylladb#28790
* github.com:scylladb/scylladb:
compaction_test: fix formatting after previous patches
compaction_test: add S3/GCS variations to tests
compaction_test: extract test_env-based tests into functions
compaction_test: replace file_exists with storage::exists
compaction_test: initialize tables with schema via make_table_for_tests
compaction_test: use sstable APIs to manipulate component files
compaction_test: fix use-after-move issue
sstable_utils: add `get_storage` and `open_file` helpers
test_env: delay unplugging sstable registry
storage: add `exists` method to storage abstraction
s3_client: use lowres_system_clock for aws_sigv4
s3_client: add `object_exists` helper
gcs_client: add `object_exists` helper
The fix for SCYLLADB-1373 (b4f652b7c1) changed get_session() to use
the default timeout=30 for the retry loop in patient_*_cql_connection
(previously timeout=0.1). This correctly allowed retrying transient
NoHostAvailable errors during node startup, but introduced a new
flakiness in test_login and other auth tests.
The failure chain:
1. test_login connects with bad credentials (e.g. user="doesntexist")
2. get_session() calls patient_exclusive_cql_connection(), which calls
retry_till_success() with bypassed_exception=NoHostAvailable
3. The first attempt correctly fails: the server rejects the credentials
with AuthenticationFailed, wrapped in NoHostAvailable
4. retry_till_success() catches NoHostAvailable indiscriminately and
retries, not distinguishing between transient errors (node not ready)
and permanent errors (bad credentials)
5. A subsequent retry attempt times out (connect_timeout=5), producing
OperationTimedOut wrapped in NoHostAvailable
6. After 30 seconds, the last NoHostAvailable is raised -- now wrapping
OperationTimedOut instead of the original AuthenticationFailed
7. The assertion `isinstance(..., AuthenticationFailed)` fails
With the old timeout=0.1, the deadline was already exceeded after the
first attempt, so the original AuthenticationFailed propagated.
Fix: Add a `should_retry` predicate parameter to retry_till_success()
and use it in patient_cql_connection() and
patient_exclusive_cql_connection() to immediately re-raise
NoHostAvailable when it wraps AuthenticationFailed. Retrying
authentication failures is never useful since the credentials won't
change between attempts.
Fixes: SCYLLADB-1382
Closesscylladb/scylladb#29348
get_session() was passing timeout=0.1 to patient_exclusive_cql_connection
and patient_cql_connection, leaving only 0.1 seconds for the retry loop
in retry_till_success(). Since each connection attempt can take up to 5
seconds (connect_timeout=5), the retry loop effectively got only one
attempt with no chance to retry on transient NoHostAvailable errors.
Use the default timeout=30 seconds, consistent with all other callers.
Fixes: SCYLLADB-1373
Closesscylladb/scylladb#29332
Flatten continuation chains (.then()) into linear thread-style code
with .get() calls for improved readability. Remove the now-unused
require_throws helper template.
Replace direct filesystem checks (file_exists) with the
storage-agnostic exists() method in unsealed_sstable_compaction,
sstable_clone_leaving_unsealed_dest_sstable, and
failure_when_adding_new_sstable tests, making them compatible
with object-storage backends (S3, GCS).
Start using `table_for_tests::make_default_schema` so test tables are
created with a real schema. This is required for object-storage
backends, which cannot operate correctly without proper schema
initialization.
Switch tests to use sstable member functions for file manipulation
instead of opening files directly on the filesystem. This affects the
helpers that emulate sstable corruption: we now overwrite the entire
component file rather than just the first few kilobytes, which is
sufficient for producing a corrupted sstable.
Add a non-const `get_storage` accessor to expose underlying storage,
and an `open_file` helper to access sstable component files directly.
These are needed so compaction tests can read and write sstable
components.
Unplugging the mock sstable_registry happened too early in the test
environment. During sstable destruction, components may still need
access to the registry, so the unplugging is moved to a later stage.
When test_exception_safety_of_update_from_memtable was converted from
manual fail_after()/catch to with_allocation_failures() in 74db08165d,
the populate_range() call ended up inside the failure injection scope
without a scoped_critical_alloc_section guard. The other two tests
converted in the same commit (test_exception_safety_of_transitioning...
and test_exception_safety_of_partition_scan) were correctly guarded.
Without the guard, the allocation failure injector can sometimes
target an allocation point inside the cleanup path of populate_range().
In a rare corner case, this triggers a bad_alloc in a noexcept context
(reader_concurrency_semaphore::stop()), causing std::terminate.
Fixes SCYLLADB-1346
Closesscylladb/scylladb#29321
Verify that upgrading from 2025.1 to master does not silently drop DDL
auditing for table-scoped audit configurations (SCYLLADB-1155).
Test time in dev: 4s
Refs: SCYLLADB-1155
Fixes: SCYLLADB-1305
The old execute_and_validate_audit_entry required every caller to
pass audit_settings so it could decide internally whether to expect
an entry. A test added later in this series needs to simply assert
an entry was produced, without specifying audit_settings at all.
Split into two methods:
- execute_and_validate_new_audit_entry: unconditionally expects an
audit entry.
- execute_and_validate_if_category_enabled: checks audit_settings
to decide whether to expect an entry or assert absence.
Local wrapper functions and **kwargs forwarding are removed in favor
of explicit arguments at each call site, and expected-error cases are
handled inline with assert_invalid + assert_entries_were_added.
AuditTester uses self.manager throughout but never declares it.
The attribute is only assigned in the CQLAuditTester subclass
__init__, so the type checker reports 'Attribute "manager" is
unknown' on every self.manager reference in the base class.
Add an __init__ to AuditTester that accepts and stores the manager
instance, and update CQLAuditTester to forward it via super().__init__
instead of assigning self.manager directly.
Fixes: SCYLLADB-1106
* Small fix in scylla_cluster - remove debug print
* Fix GSServer::unpublish so it does not except if publish was not called beforehand
* Improve dockerized_server so mock server logs echo to the test log to help diagnose CI failures (because we don't collect log files from mocks etc, and in any case correlation will be much easier).
No backport needed.
Closesscylladb/scylladb#29112
* github.com:scylladb/scylladb:
dockerized_service: Convert log reader to pipes and push to test log
test::cluster::conftest::GSServer: Fix unpublish for when publish was not called
scylla_cluster: Use thread safe future signalling
scylla_cluster: Remove left-over debug printout
The test was failing because the call to:
await log.wait_for('Stopping.*ongoing compactions')
was missing the 'from_mark=log_mark' argument. The log mark was updated
(line: log_mark = await log.mark()) immediately after detecting
'splitting_mutation_writer_switch_wait: waiting', and just before
launching the shutdown task. However, the wait_for call on the following
line was scanning from the beginning of the log, not from that mark.
As a result, the search immediately matched old 'Stopping N tasks for N
ongoing compactions for table system.X due to table removal' messages
emitted during initial server bootstrap (for system.large_partitions,
system.large_rows, system.large_cells), rather than waiting for the
shutdown to actually stop the user-table split compaction.
This caused the test to prematurely send the message to the
'splitting_mutation_writer_switch_wait' injection. The split compaction
was unblocked before the shutdown had aborted it, so it completed
successfully. Since the split succeeded, 'Failed to complete splitting
of table' was never logged.
Meanwhile, 'storage_service_drain_wait' was blocking do_drain() waiting
for a message. With the split already done, the test was stuck waiting
for the expected failure log that would never come (600s timeout). At
the same time, after 60s the 'storage_service_drain_wait' injection
timed out internally, triggering on_internal_error() which -- with
--abort-on-internal-error=1 -- crashed the server (exit code -6).
Fix: pass from_mark=log_mark to the wait_for('Stopping.*ongoing
compactions') call so it only matches messages that appear after the
shutdown has started, ensuring the test correctly synchronizes with the
shutdown aborting the user-table split compaction before releasing the
injection.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1319.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Closesscylladb/scylladb#29311
Replace the random port selection with an OS-assigned port. We open
a temporary TCP socket, bind it to (ip, 0) with SO_REUSEADDR, read back
the port number the OS selected, then close the socket before launching
rest_api_mock.py.
Add reuse_address=True and reuse_port=True to TCPSite in rest_api_mock.py
so the server itself can also reclaim a TIME_WAIT port if needed.
Fixes: SCYLLADB-1275
Closesscylladb/scylladb#29314
On slow/overloaded CI machines the lowres_clock timer may not have
fired after the fixed 2x sleep, causing the assertion on
get_abort_exception() to fail. Replace the fixed sleep with
sleep(1x) + eventually_true() which retries with exponential backoff,
matching the pattern already used in test_time_based_cache_eviction.
Fixes: SCYLLADB-1311
Closesscylladb/scylladb#29299
make sure the driver is stopped even though cluster
teardown throws and avoid potential stale driver
connections entering infinite reconnect loops which
exhaust cpu resources.
Fixes: SCYLLADB-1189
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
Closesscylladb/scylladb#29230
Debug mode shuffles task position in the queue. So the following is possible:
1) shard 1 calls manual_clock::advance(). This expires timers on shard 1 and queues a background smp call to shard 0 which will expire timers there
2) the smp::submit_to(0, ...) from shard 1 called by the test sumbits the call
3) shard 0 creates tasks for both calls, but (2) is run first, and preempts the reactor
4) shard 1 sees the completion, completes m_svc.invoke_on(1, ..)
5) shard 0 inserts the completion from (4) before task from (1)
6) the check on shard 0: m.find(id1) fails because the timer is not expired yet
To fix that, wait for timer expiration on shard 0, so that the test
doesn't depend on task execution order.
Note: I was not able to reproduce the problem locally using test.py --mode
debug --repeat 1000.
It happens in jenkins very rarely. Which is expected as the scenario which
leads to this is quite unlikely.
Fixes SCYLLADB-1265
Closesscylladb/scylladb#29290
The test exercises all five node operations (bootstrap, replace, rebuild,
removenode, decommission) and by the end only one node out of four
remains alive. The CQL driver session, however, still holds stale
references to the dead hosts in its connection pool and load-balancing
policy state.
When the new_test_keyspace context manager exits and attempts
DROP KEYSPACE, the driver routes the query to the dead hosts first,
gets ConnectionShutdown from each, and throws NoHostAvailable before
ever trying the single live node.
Fix by calling driver_connect() after the decommission step, which
closes the old session and creates a fresh one connected only to the
servers the test manager reports as running.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1313.
Closesscylladb/scylladb#29306
Sends a search via the raw LDAP handle (bypassing _msgid_to_promise
registration), then triggers poll_results() through the public API
to exercise the unregistered-ID branch.
Refs: SCYLLADB-1344
The multishard_query_test/fuzzy_test was timing out (SIGKILL after
15 minutes) in release mode CI.
In release mode the test generates up to 64 partitions with up to
1000 clustering rows and 1000 range tombstones each. With deeply
nested randomly-generated types (e.g. frozen<map<varint,
frozen<map<frozen<tuple<...>>>>>>), this volume of data can exceed
the 15-minute CI timeout.
Reduce the release-mode clustering-row and range-tombstone
distributions from 0-1000 to 0-200. This caps the worst case at
~12,800 rows -- still 2x the devel-mode maximum (0-100) and
sufficient to exercise multi-partition paged scanning with many
pages.
Fixes: SCYLLADB-1270
No need to backport for now, only appeared on master.
Closesscylladb/scylladb#29293
* github.com:scylladb/scylladb:
test: clean up fuzzy_test_config and add comments
test: fix fuzzy_test timeout in release mode
get_audit_partitions_for_operation() returns None when no audit log
rows are found. In _test_insert_failure_doesnt_report_success_assign_nodes,
this None is passed to set(), causing TypeError: 'NoneType' object is
not iterable.
The audit log entry may not yet be visible immediately after executing
the INSERT, so use wait_for() from test.pylib.util with exponential
backoff to poll until the entry appears. Import it as wait_for_async
to avoid shadowing the existing wait_for from test.cluster.dtest.dtest_class,
which has a different signature (timeout vs deadline).
Fixes SCYLLADB-1330
Closesscylladb/scylladb#29289