In do_apply_state_locally, a race condition can occur if a task is
suspended at a preemption point while the node entry is not locked.
During this time, the host may be removed from _endpoint_state_map.
When the task resumes, this can lead to inserting an entry with an
empty host ID into the map, causing various errors, including a node
crash.
This change
1. adds a check after locking the map entry: if a gossip ACK update
does not contain a host ID, we verify that an entry with that host ID
still exists in the gossiper’s _endpoint_state_map.
2. Removes xfail from the test_gossiper_race test since the issue is now
fixed.
3. Adds exception handling in `do_shadow_round` to skip responses from
nodes that sent an empty host ID.
This re-applies the commit 13392a40d4 that
was reverted in 46aa59fe49, after fixing
the issues that caused the CI to fail.
Fixes: scylladb/scylladb#25702Fixes: scylladb/scylladb#25621
Ref: scylladb/scylla-enterprise#5613
(cherry picked from commit f08df7c9d7)
This change introduces a targeted test that simulates the gossiper race
condition observed during node decommissioning. The test delays gossip
state application and host ID lookup to reliably reproduce the scenario
where `gossiper::get_host_id()` is called on a removed endpoint,
potentially triggering an abort in `apply_new_states`.
There is a specific error injection added to widen the race window, in
order to increase the likelihood of hitting the race condition. The
error injection is designed to delay the application of gossip state
updates, for the specific node that is being decommissioned. This should
then result in the server abort in the gossiper.
This re-applies the commit 5dac4b38fb that
was reverted in dc44fca67c, but modified
to relax the check from "on_internal_error" to a just warning log. The
more strict can be re-introduced later once we are sure that all
remaining problems are resolved and it will not break the CI.
Refs: scylladb/scylladb#25621Fixes: scylladb/scylladb#25721
(cherry picked from commit 28e0f42a83)
Analysis of customer stalls revealed that the function `detail::hash_with_salt` (invoked by `passwords::check`) often blocks the reactor. Internally, this function uses the external `crypt_r` function to compute password hashes, which is CPU-intensive.
This PR addresses the issue in two ways:
1) `sha-512` is now the only password hashing scheme for new passwords (it was already the common-case).
2) `passwords::check` is moved to a dedicated alien thread.
Regarding point 1: before this change, the following hashing schemes were supported by `identify_best_supported_scheme()`: bcrypt_y, bcrypt_a, SHA-512, SHA-256, and MD5. The reason for this was that the `crypt_r` function used for password hashing comes from an external library (currently `libxcrypt`), and the supported hashing algorithms vary depending on the library in use. However:
- The bcrypt schemes never worked properly because their prefixes lack the required round count (e.g. `$2y$` instead of `$2y$05$`). Moreover, bcrypt is slower than SHA-512, so it not good idea to fix or use it.
- SHA-256 and SHA-512 both belong to the SHA-2 family. Libraries that support one almost always support the other, so it’s very unlikely to find SHA-256 without SHA-512.
- MD5 is no longer considered secure for password hashing.
Regarding point 2: the `passwords::check` call now runs on a shared alien thread created at database startup. An `std::mutex` synchronizes that thread with the shards. In theory this could introduce a frequent lock contention, but in practice each shard handles only a few hundred new connections per second—even during storms. There is already `_conns_cpu_concurrency_semaphore` in `generic_server` limits the number of concurrent connection handlers.
Fixes https://github.com/scylladb/scylladb/issues/24524
Backport not needed, as it is a new feature.
Closesscylladb/scylladb#24924
* github.com:scylladb/scylladb:
main: utils: add thread names to alien workers
auth: move passwords::check call to alien thread
test: wait for 3 clients with given username in test_service_level_api
auth: refactor password checking in password_authenticator
auth: make SHA-512 the only password hashing scheme for new passwords
auth: whitespace change in identify_best_supported_scheme()
auth: require scheme as parameter for `generate_salt`
auth: check password hashing scheme support on authenticator start
(cherry picked from commit c762425ea7)
Fixes#25683
Once a table drop is complete, there should be no reason to retain
truncation records for it, as any replay should skip mutations
anyway (no CF), and iff we somehow resurrect a dropped table,
this replay-resurrected data is the least problem anyway.
Adds a prune phase to the startup drop_truncation_rp_records run,
which ignores updating, and instead deletes records for non-existant
tables (which should patch any existing servers with lingering data
as well).
Also does an explicit delete of records on actual table DROP, to
ensure we don't grow this table more than needed even in long
uptime nodes.
Small unit test included.
Closesscylladb/scylladb#25699
(cherry picked from commit bc20861afb)
Closesscylladb/scylladb#25815
Consider the following scenario:
- A tablet is migrated away from a shard
- The tablet cleanup stage closes the storage group's async_gate
- A drop table runs truncate which attempts to disable compaction on the tablet with its gate closed. This fails, because table::parallel_foreach_compaction_group() ultimately calls storage_group_manager::parallel_foreach_storage_group() which will not disable compaction if it can't hold the storage group's gate
- Truncate calls table::discard_sstables() which checks if the compaction has been disabled, and because it hasn't, it then runs on_internal_error() with "compaction not disabled on table ks.cf during TRUNCATE" which causes a crash
Fixes: #25706
This needs to be backported to all supported versions with tablets
- (cherry picked from commit a0934cf80d)
- (cherry picked from commit 1b8a44af75)
Parent PR: #25708Closesscylladb/scylladb#25785
* github.com:scylladb/scylladb:
test: reproducer and test for drop with concurrent cleanup
truncate: check for closed storage group's gate in discard_sstables
Currently, run will execute twice pytest without modifying the path of the
JUnit XML report. This leads that the second execution of the pytest
will override the report. This PR fixing this issue so both reports will
be stored.
Closesscylladb/scylladb#25726
(cherry picked from commit e55c8a9936)
Closesscylladb/scylladb#25778
This patch fixes an error-path bug in the base-64 decoding code in
utils/base64.cc, which among other things is used in Alternator to decode
blobs in JSON requests.
The base-64 decoding code has a lookup table, which was wrongly sized 255
bytes, but needed to be 256 bytes. This meant that if the byte 255 (0xFF)
was included in an invalid base-64 string, instead of detecting that this
is an invalid byte (since the only valid bytes in a base-64 string are
A-Z,a-z,0-9,+,/ and =), the code would either think it's valid with a
nonsense 6-bit part, or even crash on an out-of-bounds read.
Besides the trivial fix, this patch also includes a reproducing test,
which tries to write a blob as a supposedly base-64 encoded string with
a 0xFF byte in it. The test fails before this patch (the write succeeds,
unexpectedly), and passes after this patch (the write fails as
expected). The test also passes on DynamoDB.
Fixes#25701
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#25705
(cherry picked from commit ff91027eac)
Closesscylladb/scylladb#25767
Move management over effective service levels from `service_level_controller`
to a new dedicated type -- `auth_integration`.
Before these changes, it was possible for the service level controller to try
to access `auth::service` after it was deinitialized. For instance, it could
happen when reloading the cache. That HAS happened as described in the following
issue: scylladb/scylladb#24792.
Although the problem might have been mitigated or even resolved in
scylladb/scylladb@10214e13bd, it's not clear
how the service will be used in the future. It's best to prevent similar bugs
than trying to fix them later on.
The logic responsible for preventing to access an uninitialized `auth::service`
was also either non-existent, complex, or non-sufficient.
To prevent accessing `auth::service` by the service level controller, we extract
the relevant portion of the code to a separate entity -- `auth_integration`.
It's an internal helper type whose sole purpose is to manage effective service
levels.
Thanks to that, we were able to nest the lifetime of `auth_integration` within
the lifetime of `auth::service`. It's now impossible to attempt to dereference
it while it's uninitialized.
If a bug related to an invalid access is spotted again, though, it might also
be easier to debug it now.
There should be no visible change to the users of the interface of the service
level controller. We strived to make the patch minimal, and the only affected
part of the logic should be related to how `auth::service` is accessed.
The relevant portion of the initialization and deinitialization flow:
(a) Before the changes:
1. Initialize `service_level_controller`. Pass a reference to an uninitialized
`auth::service` to it.
2. Initialize other services.
3. Initialize and start `auth::service`.
4. (work)
5. Stop and deinitialize `auth::service`.
6. Deinitialize other services.
7. Deinitialize `service_level_controller`.
(b) After the changes:
1. Initialize `service_level_controller`. Pass a reference to an uninitialized
`auth::service` to it. (*)
2. Initialize other services.
3. Initialize and start `auth::service`.
4. Initialize `auth_integration`. Register it in `service_level_controller`.
5. (work)
6. Unregister `auth_integration` in `service_level_controller` and deinitialize
it.
7. Stop and deinitialize `auth::service`.
8. Deinitialize other services.
9. Deinitialize `service_level_controller`.
(*):
The reference to `auth::service` in `service_level_controller` is still
necessary. We need to access the service when dropping a distributed
service level.
Although it would be best to cut that link between the service level
controller and `auth::service` too, effectively separating the entities,
it would require more work, so we leave it as-is for now.
It shouldn't prove problematic as far as accessing an uninitialized service
goes. Trying to drop a service level at the point when we're de-initializing
auth should be impossible.
For more context, see the function `drop_distributed_service_level` in
`service_level_controller`.
A trivial test has been included in the PR. Although its value is questionable
as we only try to reload the service level cache at a specific moment, it's
probably the best we can deliver to provide a reproducer of the issue this patch
is resolving.
Fixesscylladb/scylladb#24792
Backport: The impact of the bug was minimal as it only affected the shutdown.
However, since CI is failing because of it, let's backport the change to all
supported versions.
- (cherry picked from commit 7d0086b093)
- (cherry picked from commit 34afb6cdd9)
- (cherry picked from commit e929279d74)
- (cherry picked from commit dd5a35dc67)
- (cherry picked from commit fc1c41536c)
Parent PR: #25478Closesscylladb/scylladb#25753
* github.com:scylladb/scylladb:
service/qos: Move effective SL cache to auth_integration
service/qos: Add auth::service to auth_integration
service/qos: Reload effective SL cache conditionally
service/qos: Add gate to auth_integration
service/qos: Introduce auth_integration
This PR implements solution proposed in scylladb/scylladb#24481
Instead of terminating connections immediately, the shutdown now proceeds in two stages: first closing the receive (input) side to stop new requests, then waiting for all active requests to complete before fully closing the connections.
The updated shutdown process is as follows:
1. Initial Shutdown Phase
* Close the accept gate to block new incoming connections.
* Abort all accept() calls.
* For all active connections:
* Close only the input side of the connection to prevent new requests.
* Keep the output side open to allow responses to be sent.
2. Drain Phase
* Wait for all in-progress requests to either complete or fail.
3. Final Shutdown Phase
* Fully close all connections.
Fixes scylladb/scylladb#24481
- (cherry picked from commit 122e940872)
- (cherry picked from commit 3848d10a8d)
- (cherry picked from commit 3610cf0bfd)
- (cherry picked from commit 27b3d5b415)
- (cherry picked from commit 061089389c)
- (cherry picked from commit 7334bf36a4)
- (cherry picked from commit ea311be12b)
- (cherry picked from commit 4f63e1df58)
Parent PR: #24499Closesscylladb/scylladb#25519
* github.com:scylladb/scylladb:
test: Set `request_timeout_on_shutdown_in_seconds` to `request_timeout_in_ms`, decrease request timeout.
generic_server: Two-step connection shutdown.
transport: consmetic change, remove extra blanks.
transport: Handle sleep aborted exception in sleep_until_timeout_passes
generic_server: replace empty destructor with `= default`
generic_server: refactor connection::shutdown to use `shutdown_input` and `shutdown_output`
generic_server: add `shutdown_input` and `shutdown_output` functions to `connection` class.
test: Add test for query execution during CQL server shutdown
This change introduces a targeted test that simulates the gossiper race
condition observed during node decommissioning. The test delays gossip
state application and host ID lookup to reliably reproduce the scenario
where `gossiper::get_host_id()` is called on a removed endpoint,
potentially triggering an abort in `apply_new_states`.
There is a specific error injection added to widen the race window, in
order to increase the likelihood of hitting the race condition. The
error injection is designed to delay the application of gossip state
updates, for the specific node that is being decommissioned. This should
then result in the server abort in the gossiper.
Refs: scylladb/scylladb#25621Fixes: scylladb/scylladb#25721
Backport: The test is primarily for an issue found in 2025.1, so it
needs to be backported to all the 2025.x branches.
Closesscylladb/scylladb#25685
(cherry picked from commit 5dac4b38fb)
Closesscylladb/scylladb#25781
Users with single-column partition keys that contain colon characters
were unable to use certain REST APIs and 'nodetool' commands, because the
API split key by colon regardless of the partition key schema.
Affected commands:
- 'nodetool getendpoints'
- 'nodetool getsstables'
Affected endpoints:
- '/column_family/sstables/by_key'
- '/storage_service/natural_endpoints'
Refs: #16596 - This does not fully fix the issue, as users with compound
keys will face the issue if any column of the partition key contains
a colon character.
Closesscylladb/scylladb#24829Closesscylladb/scylladb#25565
Although RF-rack-valid keyspaces are not universally enforced
yet (they're governed by the configuration option
`rf_rack_valid_keyspaces`), we'd like to encourage the user to
abide by the restriction.
To that end, we're introducing a warning when creating or
altering a keyspace. If the configuration option is disabled,
but the user is trying to create an RF-rack-invalid keyspace,
they'll receive a warning.
If the option is turned off, we will also log all of the
RF-rack-invalid keyspaces at start-up.
We provide validation tests.
Fixes scylladb/scylladb#23330
Backport: we'd like to encourage the user to abide by the restriction
even when they don't enforce it to make it easier in the future to
adjust the schema when there's no way to disable it anymore. Because
of that, we'd like to backport it to all relevant versions, starting with 2025.1.
- (cherry picked from commit 60ea22d887)
- (cherry picked from commit af8a3dd17b)
- (cherry picked from commit 837d267cbf)
Parent PR: #24785Closesscylladb/scylladb#25635
* github.com:scylladb/scylladb:
main: Log RF-rack-invalid keyspaces at startup
cql3/statements: Fix indentation
cql3: Warn when creating RF-rack-invalid keyspace
To avoid situation that port is occupied on localhost, use unique
hostname for Minio
(cherry picked from commit c6c3e9f492)
Closesscylladb/scylladb#24775
Several parameters that `test.py` should pass to pytest->boost were missing. This PR adds handling these parameters: `--random-seed` and `--x-log2-compaction-groups`
Since this code affected with this issue in 2025.3 and this is only framework change, backport for that version needed.
Fixes: https://github.com/scylladb/scylladb/issues/24927
- (cherry picked from commit 71b875c932)
- (cherry picked from commit f7c7877ba6)
Parent PR: #24928Closesscylladb/scylladb#25035
* github.com:scylladb/scylladb:
test.py: add bypassing x_log2_compaction_groups to boost tests
test.py: add bypassing random seed to boost tests
Fixes#25709
If we have large allocations, spanning more than one segment, and
the internal segment references from lead to secondary are the
only thing keeping a segment alive, the implicit drop in
discard_unused_segments and orphan_all can cause a recursive call
to discard_unused_segments, which in turn can lead to vector
corruption/crash, or even double free of segment (iterator confusion).
Need to separate the modification of the vector (_segments) from
actual releasing of objects. Using temporaries is the easiest
solution.
To further reduce recursion, we can also do an early clear of
segment dependencies in callbacks from segment release (cf release).
Closesscylladb/scylladb#25719
(cherry picked from commit cc9eb321a1)
Closesscylladb/scylladb#25756
The new service, `auth_integration`, has taken over the responsibility
over managing effective service levels from `service_level_controller`.
However, before these changes, it still accessed `auth::service` via
the service level controller. Let's change that.
Note that we also remove a check that `auth::service` has been
initialized. It's not necessary anymore because the lifetime of
`auth_integration` is strictly nested within the lifetime of `auth::service`.
In actuality, `service_level_controller` should lose its reference to
`auth::service` completely. All of the management over effective service
levels has already been moved to `auth_integration`. However, the
referernce is still needed when dropping a distributed service level
because we need to update the corresponding attribute for relevant
roles.
That should not lead to invalid accesses, though. Dropping a service level
should not be possible when `auth::service` is not initialized.
(cherry picked from commit dd5a35dc67)
Since `service_level_controller` outlives `auth_integration`, it may
happen that we try to access it when it has already been deinitialized.
To prevent that, we only try to reload or clear the effective service
level cache when the object is still alive.
These changes solve an existing problem with an invalid memory access.
For more context, see issue scylladb/scylladb#24792.
We provide a reproducer test that consistently fails before these
changes but passes after them.
Fixesscylladb/scylladb#24792
(cherry picked from commit e929279d74)
We introduce a new type, `auth_integration`, that will be used internally
by `service_level_controller`. Its purpose is to take over the responsibility
over managing effective service levels.
The main problem of the current implementation of service level controller
is its dependency on `auth::service` whose lifetime is strictly nested
within the lifetime of service level controller. That may and already have
led to invalid memory accesses; for an example, see issue
scylladb/scylladb#24792.
Our strategy is to split service level controller into smaller parts and
ensure that we access `auth::service` only when it's valid to do so.
This commit is the first step towards that.
We don't change anything in the logic yet, just add the new type. Further
adjustments will be made in following commits.
(cherry picked from commit 7d0086b093)
When the configuration option `rf_rack_valid_keyspaces` is enabled and there
is an RF-rack-invalid keyspace, starting a node fails. However, when the
configuration option is disabled, but there still is a keyspace that violates
the condition, we'd like Scylla to print a warning informing the user about
the fact. That's what happens in this commit.
We provide a validation test.
(cherry picked from commit 837d267cbf)
Although RF-rack-valid keyspaces are not universally enforced
yet (they're governed by the configuration option
`rf_rack_valid_keyspaces`), we'd like to encourage the user to
abide by the restriction.
To that end, we're introducing a warning when creating or
altering a keyspace. If the configuration option is disabled,
but the user is trying to create an RF-rack-invalid keyspace,
they'll receive a warning.
We provide a validation test.
(cherry picked from commit 60ea22d887)
The test test_tombstone_gc_disabled_on_pending_replica was added when
we fixed (#20788) the potential problem with data resurrection during
file based streaming. The issue was occurring only in Enterprise, but
we added the fix in OSS to limit code divergence. This test was added
together with the fix in OSS with the idea to guard this change in OSS.
The real reproducer and test for this fix was added later, after the
fix was ported into Enterprise.
It is in: test/cluster/test_resurrection.py
Since Enterprise has been merged into OSS, there is no more need to
keep the test test_tombstone_gc_disabled_on_pending_replica. Also,
it is flaky with very low probability of failure, making it difficult
to investigate the cause of failure.
Fixes: #22182
Refs: scylladb/scylladb#25448Closesscylladb/scylladb#25134
(cherry picked from commit 7ce96345bf)
Closesscylladb/scylladb#25573
The PR fixes a test flakiness issue in test_mv_backlog related to reading metrics.
The first commit fixes a more general issue in the ScyllaMetrics helper class where it doesn't return the value of all matching lines when a specific shard is requested, but it breaks after the first match.
The second commit fixes a test issue where it expects exactly one write to be throttled, not taking into account other internal writes that may be executed during this time.
Fixes https://github.com/scylladb/scylladb/issues/23139
backport to improve CI stability - test only change
- (cherry picked from commit 5c28cffdb4)
- (cherry picked from commit 276a09ac6e)
Parent PR: #25279Closesscylladb/scylladb#25475
* github.com:scylladb/scylladb:
test: test_mv_backlog: fix to consider internal writes
test/pylib/rest_client: fix ScyllaMetrics filtering
The test creates all driver sessions by itself. As a consequence, all
sessions use the default request timeout of 10s. This can be too low for
the debug mode, as observed in scylladb/scylla-enterprise#5601.
In this commit, we change the test to use `cluster_con`, so that the
sessions have the request timeout set to 200s from now on.
Fixesscylladb/scylla-enterprise#5601
This commit changes only the test and is a CI stability improvement,
so it should be backported all the way to 2024.2. 2024.1 doesn't have
this test.
Closesscylladb/scylladb#25510
(cherry picked from commit 03cc34e3a0)
Closesscylladb/scylladb#25547
decrease request timeout.
In debug mode, queries may sometimes take longer than the default 30 seconds.
To address this, the timeout value `request_timeout_on_shutdown_in_seconds`
during tests is aligned with other request timeouts.
Change request timeout for tests from 180s to 90s since we must keep the request
timeout during shutdown significantly lower than the graceful shutdown timeout(2m),
or else a request timeout would cause a graceful shutdown timeout and fail a test.
(cherry picked from commit 4f63e1df58)
When shutting down in `generic_server`, connections are now closed in two steps.
First, only the RX (receive) side is shut down. Then, after all ongoing requests
are completed, or a timeout happened the connections are fully closed.
Fixesscylladb/scylladb#24481
(cherry picked from commit ea311be12b)
The test_base_partition_deletion_with_metrics test case (and the batch
variant) uses the metric of view updates done during its runtime to check
if we didn't perform too many of them. The test runs in the cqlpy suite,
which runs all test cases sequentially on one Scylla instance. Because
of this, if another test case starts a process which generates view
updates and doesn't wait for it to finish before it exists, we may
observe too many view updates in test_base_partition_deletion_with_metrics
and fail the test.
In all test cases we make sure that all tables that were created
during the test are dropped at the end. However, that doesn't
stop the view building process immediately, so the issue can happen
even if we drop the view. I confirmed it by adding a test just before
test_base_partition_deletion_with_metrics which builds a big
materialized view and drops it at the end - the metrics check still failed.
The issue could be caused by any of the existing test cases where we create
a view and don't wait for it to be built. Note that even if we start adding
rows after creating the view, some of them may still be included in the view
building, as the view building process is started asynchronously. In such
a scenario, the view building also doesn't cause any issues with the data in
these tests - writes performed after view creation generate view updates
synchronously when they're local (and we're running a single Scylla server),
the corresponding view udpates generated during view building are redundant.
Because we have many test cases which could be causing this issue, instead
of waiting for the view building to finish in every single one of them, we
move the susceptible test cases to be run on separate Scylla instances, in
the "cluster" suite. There, no other test cases will influence the results.
Fixes https://github.com/scylladb/scylladb/issues/20379Closesscylladb/scylladb#25209
(cherry picked from commit 2ece08ba43)
Closesscylladb/scylladb#25504
RAFT_TEST_CASE macro creates 2 test cases, one with random 20% packet
loss named name_drops. The framework makes hard coded assumptions about
leader which doesn't hold well in case of packet losses.
This short term fix disables the packet drop variant of the specified test.
It should be safe to re-enable it once the whole framework is re-worked to
remove these hard coded assumptions.
This PR fixes a bug. Hence we need to backport it.
Fixes: scylladb/scylladb#23816Closesscylladb/scylladb#25489
(cherry picked from commit a0ee5e4b85)
Closesscylladb/scylladb#25528
This test simulates a scenario where a query is being executed while
the query coordinator begins shutting down the CQL server and client
connections. The shutdown process should wait until the query execution
is either completed or timed out.
Test for scylladb/scylladb#24481
(cherry picked from commit 122e940872)
The test executes a single write, fetching metrics before and after the
write, and expects the total throttled writes count to be increased
exactly by one.
However, other internal writes (compaction for example) may be executed
during this time and be throttled, causing the metrics to be increased
by more than expected.
To address this, we filter the metrics by the scheduling group label of
the user write, to filter out the compaction writes that run in the
compaction scheduling group.
Fixesscylladb/scylladb#23139
(cherry picked from commit 276a09ac6e)
In the ScyllaMetrics `get` function, when requesting the value for a
specific shard, it is expected to return the sum of all values of
metrics for that shard that match the labels.
However, it would return the value of the first matching line it finds
instead of summing all matching lines.
For example, if we have two lines for one shard like:
some_metric{scheduling_group_name="compaction",shard="0"} 1
some_metric{scheduling_group_name="sl:default",shard="0"} 2
The result of this call would be 1 instead of 3:
get('some_metric', shard="0")
We fix this to sum all matching lines.
The filtering of lines by labels is fixed to allow specifying only some
of the labels. Previously, for the line to match the filter, either the
filter needs to be empty, or all the labels in the metric line had to be
specified in the filter parameter and match its value, which is
unexpected, and breaks when more labels are added.
We also simplify the function signature and the implementation - instead
of having the shard as a separate parameter, it can be specified as a
label, like any other label.
(cherry picked from commit 5c28cffdb4)
Previous way of execution repeat was to launch pytest for each repeat.
That was resource consuming, since each time pytest was doing discovery
of the tests. Now all repeats are done inside one pytest process.
Backport for 2025.3 is needed, since this functionality is framework only, and 2025.3 affected with this slow repeats as well.
Fixes: https://github.com/scylladb/scylladb/issues/25391
- (cherry picked from commit cc75197efd)
- (cherry picked from commit 557293995b)
- (cherry picked from commit 853bdec3ec)
- (cherry picked from commit d0e4045103)
Parent PR: #25073Closesscylladb/scylladb#25392
* github.com:scylladb/scylladb:
test.py: add repeats in pytest
test.py: add directories and filename to the log files
test.py: rename log sink file for boost tests
test.py: better error handling in boost facade
This patch sets, for alternator test suite, all 'alternator-*' loggers and 'paxos' logger to trace level. This should significantly ease debugging of failed tests, while it has no effect on test time and increases log size only by 7%.
This affects running alternator tests only with `test.py`, not with `test/alternator/run`.
Closes#24645Closesscylladb/scylladb#25327
(cherry picked from commit eb11485969)
Closesscylladb/scylladb#25383
Audit tests are vulnerable to noise from LOGIN queries (because AUTH
audit logs can appear at any time). Most tests already use the
`filter_out_noise` mechanism to remove this noise, but tests
focused on AUTH verification did not, leading to sporadic failures.
This change adds a filter to ignore AUTH logs generated by the default
"cassandra" user, so tests only verify logs from the user created
specifically for each test.
Additionally, this PR:
- Adds missing `nonlocal new_rows` statement that prevented some checks from being called
- Adds a testcase for audit logs of `cassandra` user
Fixes: https://github.com/scylladb/scylladb/issues/25069
Better backport those test changes to 2025.3. 2025.2 and earlier don't have `./cluster/dtest/audit_test.py`.
- (cherry picked from commit e634a2cb4f)
- (cherry picked from commit daf1c58e21)
- (cherry picked from commit aef6474537)
- (cherry picked from commit 21aedeeafb)
Parent PR: #25111Closesscylladb/scylladb#25140
* github.com:scylladb/scylladb:
test: audit: add cassandra user test case
test: audit: ignore cassandra user audit logs in AUTH tests
test: audit: change names of `filter_out_noise` parameters
Previous way of executin repeat was to launch pytest for each repeat.
That was resource consuming, since each time pytest was doing discovery
of the tests. Now all repeats are done inside one pytest process.
(cherry picked from commit d0e4045103)
Currently, only test function name used for output and log files. For better
clarity adding the relative path from the test directory of the file name
without extension to these files.
Before:
test_aggregate_avg.1.log
test_aggregate_avg_stdout.1.log
After:
boost.aggregate_fcts_test.test_aggregate_avg.1.log
boost.aggregate_fcts_test.test_aggregate_avg_stdout.3.log
(cherry picked from commit 853bdec3ec)
If test was not executed for some reason, for example not known parameter passed to the test, but boost framework was able to finish correctly, log file will have data but it will be parsed to an empty list. This will raise an exception in pytest execution, rather than produce test output. This change will handle this situation.
(cherry picked from commit cc75197efd)
TRUNCATE TABLE performs a memtable flush and then discards the sstables of the table being truncated. It collects the highest replay position for both of these. When the highest replay position of the discarded sstables is higher than the highest replay position of the flushed memtable, that means that we have had writes during truncate which have been flushed to disk independently of the truncate process. We check for this and trigger an on_internal_error() which throws an exception, informing the user that writing data concurrently with TRUNCATE TABLE is not advised.
The problem with this is that truncate is also called from DROP KEYSPACE and DROP TABLE. These are raft operations and exceptions thrown by them are caught by the (...) exception handler in the raft applier fiber, which then exits leaving the node without the ability to execute subsequent raft commands.
This commit changes the on_internal_error() into a warning log entry. It also outputs to keyspace/table names, and the offending replay positions which caused the check to fail.
This PR also adds a test which validates that TRUNCATE works correctly with concurrent writes. More specifically, it checks that:
- all data written before TRUNCATE starts is deleted
- none of the data after TRUNCATE completes is deleted
Fixes: #25173Fixes: #25013
Backport is needed in versions which check for truncate with concurrent writes using `on_internal_error()`: 2025.3 2025.2 2025.1
- (cherry picked from commit 268ec72dc9)
- (cherry picked from commit 33488ba943)
Parent PR: #25174Closesscylladb/scylladb#25350
* github.com:scylladb/scylladb:
truncate: add test for truncate with concurrent writes
truncate: change check for write during truncate into a log warning
Audit tests use the `filter_out_noise` function to remove noise from
audit logs generated by user authentication. As a result, none of the
existing tests covered audit logs for the default `cassandra` user.
This change adds a test case for that user.
Refs: scylladb/scylladb#25069
(cherry picked from commit 21aedeeafb)
The following steps are performed in sequence as part of the
Raft-based recovery procedure:
- set `recovery_leader` to the host ID of the recovery leader in
`scylla.yaml` on all live nodes,
- send the `SIGHUP` signal to all Scylla processes to reload the config,
- perform a rolling restart (with the recovery leader being restarted
first).
These steps are not intuitive and more complicated than they could be.
In this PR, we simplify these steps. From now on, we will be able to
simply set `recovery_leader` on each node just before restarting it.
Apart from making necessary changes in the code, we also update all
tests of the Raft-based recovery procedure and the user-facing
documentation.
Fixes scylladb/scylladb#25015
The Raft-based procedure was added in 2025.2. This PR makes the
procedure simpler and less error-prone, so it should be backported
to 2025.2 and 2025.3.
- (cherry picked from commit ec69028907)
- (cherry picked from commit 445a15ff45)
- (cherry picked from commit 23f59483b6)
- (cherry picked from commit ba5b5c7d2f)
- (cherry picked from commit 9e45e1159b)
- (cherry picked from commit f408d1fa4f)
Parent PR: #25032Closesscylladb/scylladb#25335
* https://github.com/scylladb/scylladb:
docs: document the option to set recovery_leader later
test: delay setting recovery_leader in the recovery procedure tests
gossip: add recovery_leader to gossip_digest_syn
db: system_keyspace: peers_table_read_fixup: remove rows with null host_id
db/config, gms/gossiper: change recovery_leader to UUID
db/config, utils: allow using UUID as a config option
`kmip_test_helper()` is a utility function to spawn a dedicated PyKMIP
server for a particular Boost test case. The function runs the server as
an external process and uses a thread to parse the port from the
server's logs. The thread communicates the port to the main thread via
a promise.
The current implementation has a bug where the thread may set a value
to the promise after its destruction, causing a segfault. This happens
when the server does not start within 20 seconds, in which case the port
future throws and the stack unwinding machinery destroys the port
promise before the thread that writes to it.
Fix the bug by declaring the promise before the cleanup action.
The bug has been encountered in CI runs on slow machines, where the
PyKMIP server takes too long to create its internal tables (due to slow
fdatasync calls from SQLite). This patch does not improve CI stability -
it only ensures that the error condition is properly reflected in the
test output.
This patch is not a backport. The same bug has been fixed in master as
part of a larger rewrite of the `kmip_test_helper()` (see 722e2bce96).
Refs #24747, #24842.
Fixes#24574.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Closesscylladb/scylladb#25030
This PR introduces a refinement in how credential renewal is triggered. Previously, the system attempted to renew credentials one hour before their expiration, but the credentials provider did not recognize them as expired—resulting in a no-op renewal that returned existing credentials. This led the timer fiber to immediately retry renewal, causing a renewal storm.
To resolve this, we remove expiration (or any other checks) in `reload` method, assuming that whoever calls this method knows what he does.
Fixes: https://github.com/scylladb/scylladb/issues/25044
Should be backported to 2025.3 since we need this fix for the restore
- (cherry picked from commit 68855c90ca)
- (cherry picked from commit e4ebe6a309)
- (cherry picked from commit 837475ec6f)
Parent PR: #24961Closesscylladb/scylladb#25347
* github.com:scylladb/scylladb:
s3_creds: code cleanup
s3_creds: Make `reload` unconditional
s3_creds: Add test exposing credentials renewal issue