Populate the local state during gossiper initialization in start_gossiping, preventing an empty state from being added to _endpoint_state_map and returned in get_endpoint_states responses, that was causing an 'empty host id issue' on the other nodes during nodes restart.
Check for a race condition in do_apply_state_locally In do_apply_state_locally, a race condition can occur if a task is suspended at a preemption point while the node entry is not locked.
During this time, the host may be removed from _endpoint_state_map. When the task resumes, this can lead to inserting an entry with an empty host ID into the map, causing various errors, including a node crash.
This change adds a check after locking the map entry: if a gossip ACK update does not contain a host ID, we verify that an entry with that host ID still exists in the gossiper’s _endpoint_state_map.
Fixes https://github.com/scylladb/scylladb/issues/25831
Fixes https://github.com/scylladb/scylladb/issues/25803
Fixes https://github.com/scylladb/scylladb/issues/25702
Fixes https://github.com/scylladb/scylladb/issues/25621
Ref https://github.com/scylladb/scylla-enterprise/issues/5613
Backport: The issue affects all current releases(2025.x), therefore this PR needs to be backported to all 2025.1-2025.3.
- (cherry picked from commit 28e0f42a83)
- (cherry picked from commit f08df7c9d7)
- (cherry picked from commit 775642ea23)
- (cherry picked from commit b34d543f30)
Parent PR: #25849Closesscylladb/scylladb#25897
* https://github.com/scylladb/scylladb:
gossiper: fix empty initial local node state
gossiper: add test for a race condition in start_gossiping
gossiper: check for a race condition in `do_apply_state_locally`
test/gossiper: add reproducible test for race condition during node decommission
This change removes the addition of an empty state to `_endpoint_state_map`.
Instead, a new state is created locally and then published via replicate,
avoiding the issue of an empty state existing in `_endpoint_state_map`
before the preemption point. Since this resolves the issue tested in
`test_gossiper_empty_self_id_on_shadow_round`, the `xfail` mark has been removed.
Fixes: scylladb/scylladb#25831
(cherry picked from commit b34d543f30)
This change adds a test for a race condition in `start_gossiping` that
can lead to an empty self state sent in `gossip_get_endpoint_states_response`.
Test for scylladb/scylladb#25831
(cherry picked from commit 775642ea23)
In do_apply_state_locally, a race condition can occur if a task is
suspended at a preemption point while the node entry is not locked.
During this time, the host may be removed from _endpoint_state_map.
When the task resumes, this can lead to inserting an entry with an
empty host ID into the map, causing various errors, including a node
crash.
This change
1. adds a check after locking the map entry: if a gossip ACK update
does not contain a host ID, we verify that an entry with that host ID
still exists in the gossiper’s _endpoint_state_map.
2. Removes xfail from the test_gossiper_race test since the issue is now
fixed.
3. Adds exception handling in `do_shadow_round` to skip responses from
nodes that sent an empty host ID.
This re-applies the commit 13392a40d4 that
was reverted in 46aa59fe49, after fixing
the issues that caused the CI to fail.
Fixes: scylladb/scylladb#25702Fixes: scylladb/scylladb#25621
Ref: scylladb/scylla-enterprise#5613
(cherry picked from commit f08df7c9d7)
This change introduces a targeted test that simulates the gossiper race
condition observed during node decommissioning. The test delays gossip
state application and host ID lookup to reliably reproduce the scenario
where `gossiper::get_host_id()` is called on a removed endpoint,
potentially triggering an abort in `apply_new_states`.
There is a specific error injection added to widen the race window, in
order to increase the likelihood of hitting the race condition. The
error injection is designed to delay the application of gossip state
updates, for the specific node that is being decommissioned. This should
then result in the server abort in the gossiper.
This re-applies the commit 5dac4b38fb that
was reverted in dc44fca67c, but modified
to relax the check from "on_internal_error" to a just warning log. The
more strict can be re-introduced later once we are sure that all
remaining problems are resolved and it will not break the CI.
Refs: scylladb/scylladb#25621Fixes: scylladb/scylladb#25721
(cherry picked from commit 28e0f42a83)
Fixes#25683
Once a table drop is complete, there should be no reason to retain
truncation records for it, as any replay should skip mutations
anyway (no CF), and iff we somehow resurrect a dropped table,
this replay-resurrected data is the least problem anyway.
Adds a prune phase to the startup drop_truncation_rp_records run,
which ignores updating, and instead deletes records for non-existant
tables (which should patch any existing servers with lingering data
as well).
Also does an explicit delete of records on actual table DROP, to
ensure we don't grow this table more than needed even in long
uptime nodes.
Small unit test included.
Closesscylladb/scylladb#25699
(cherry picked from commit bc20861afb)
Closesscylladb/scylladb#25813
Analysis of customer stalls revealed that the function `detail::hash_with_salt` (invoked by `passwords::check`) often blocks the reactor. Internally, this function uses the external `crypt_r` function to compute password hashes, which is CPU-intensive.
This PR addresses the issue in two ways:
1) `sha-512` is now the only password hashing scheme for new passwords (it was already the common-case).
2) `passwords::check` is moved to a dedicated alien thread.
Regarding point 1: before this change, the following hashing schemes were supported by `identify_best_supported_scheme()`: bcrypt_y, bcrypt_a, SHA-512, SHA-256, and MD5. The reason for this was that the `crypt_r` function used for password hashing comes from an external library (currently `libxcrypt`), and the supported hashing algorithms vary depending on the library in use. However:
- The bcrypt schemes never worked properly because their prefixes lack the required round count (e.g. `$2y$` instead of `$2y$05$`). Moreover, bcrypt is slower than SHA-512, so it not good idea to fix or use it.
- SHA-256 and SHA-512 both belong to the SHA-2 family. Libraries that support one almost always support the other, so it’s very unlikely to find SHA-256 without SHA-512.
- MD5 is no longer considered secure for password hashing.
Regarding point 2: the `passwords::check` call now runs on a shared alien thread created at database startup. An `std::mutex` synchronizes that thread with the shards. In theory this could introduce a frequent lock contention, but in practice each shard handles only a few hundred new connections per second—even during storms. There is already `_conns_cpu_concurrency_semaphore` in `generic_server` limits the number of concurrent connection handlers.
Fixes https://github.com/scylladb/scylladb/issues/24524
Backport not needed, as it is a new feature.
Closesscylladb/scylladb#24924
* github.com:scylladb/scylladb:
main: utils: add thread names to alien workers
auth: move passwords::check call to alien thread
test: wait for 3 clients with given username in test_service_level_api
auth: refactor password checking in password_authenticator
auth: make SHA-512 the only password hashing scheme for new passwords
auth: whitespace change in identify_best_supported_scheme()
auth: require scheme as parameter for `generate_salt`
auth: check password hashing scheme support on authenticator start
(cherry picked from commit c762425ea7)
Consider the following scenario:
- A tablet is migrated away from a shard
- The tablet cleanup stage closes the storage group's async_gate
- A drop table runs truncate which attempts to disable compaction on the tablet with its gate closed. This fails, because table::parallel_foreach_compaction_group() ultimately calls storage_group_manager::parallel_foreach_storage_group() which will not disable compaction if it can't hold the storage group's gate
- Truncate calls table::discard_sstables() which checks if the compaction has been disabled, and because it hasn't, it then runs on_internal_error() with "compaction not disabled on table ks.cf during TRUNCATE" which causes a crash
Fixes: #25706
This needs to be backported to all supported versions with tablets
- (cherry picked from commit a0934cf80d)
- (cherry picked from commit 1b8a44af75)
Parent PR: #25708Closesscylladb/scylladb#25784
* github.com:scylladb/scylladb:
test: reproducer and test for drop with concurrent cleanup
truncate: check for closed storage group's gate in discard_sstables
This change introduces a targeted test that simulates the gossiper race
condition observed during node decommissioning. The test delays gossip
state application and host ID lookup to reliably reproduce the scenario
where `gossiper::get_host_id()` is called on a removed endpoint,
potentially triggering an abort in `apply_new_states`.
There is a specific error injection added to widen the race window, in
order to increase the likelihood of hitting the race condition. The
error injection is designed to delay the application of gossip state
updates, for the specific node that is being decommissioned. This should
then result in the server abort in the gossiper.
Refs: scylladb/scylladb#25621Fixes: scylladb/scylladb#25721
Backport: The test is primarily for an issue found in 2025.1, so it
needs to be backported to all the 2025.x branches.
Closesscylladb/scylladb#25685
(cherry picked from commit 5dac4b38fb)
Closesscylladb/scylladb#25780
The gossiper can call `storage_service::on_change` frequently (see scylladb/scylla-enterprise#5613), which may cause high CPU load and even trigger OOMs or related issues.
This PR adds a temporary cache for `system.peers` to resolve host_id -> ip without hitting storage on every call. The cache is short-lived to handle the unlikely case where `system.peers` is updated directly via CQL.
This is a temporary fix; a more thorough solution is tracked in https://github.com/scylladb/scylladb/issues/25620.
Fixes scylladb/scylladb#25660
backport: this patch needs to be backported to all supported versions (2025.1/2/3).
- (cherry picked from commit 91c633371e)
- (cherry picked from commit de5dc4c362)
- (cherry picked from commit 4b907c7711)
Parent PR: #25658Closesscylladb/scylladb#25764
* github.com:scylladb/scylladb:
storage_service: move get_host_id_to_ip_map to system_keyspace
system_keyspace: use peers cache in get_ip_from_peers_table
storage_service: move get_ip_from_peers_table to system_keyspace
Consider the following scenario:
- A tablet is migrated away from a shard
- The tablet cleanup stage closes the storage group's async_gate
- A drop table runs truncate which attempts to disable compaction on the
tablet with its gate closed. This fails, because
table::parallel_foreach_compaction_group() ultimately calls
storage_group_manager::parallel_foreach_storage_group() which will not
disable compaction if it can't hold the storage group's gate
- Truncate calls table::discard_sstables() which checks if the compaction
has been disabled, and because it hasn't, it then runs
on_internal_error() with "compaction not disabled on table ks.cf during
TRUNCATE" which causes a crash
This patch makes dicard_sstables check if the storage group's gate is
closed whend checking for disabled compaction.
(cherry picked from commit a0934cf80d)
This patch fixes one cause of oversized allocations - and therefore
potentially stalls and increased tail latencies - in Alternator.
Alternator's Scan or Query operation return a page of results. When the
number of items is not limited by a "Limit" parameter, the default is
to return a 1 MB page. If items are short, a large number of them can
fit in that 1MB. The test test_query.py::test_query_large_page_small_rows
has 30,000 items returned in a single page.
In the response JSON, all these items are returned in a single array
"Items". Before this patch, we build the full response as a RapidJSON
object before sending it. The problem is that unfortunately, RapidJSON
stores arrays as contiguous allocations. This results in large
contiguous allocations in workloads that scan many small items, and
large contiguous allocations can also cause stalls and high tail
latencies. For example, before this patch, running
test/alternator/run --runveryslow \
test_query.py::test_query_large_page_small_rows
reports in the log:
oversized allocation: 573440 bytes.
After this patch, this warning no longer appears.
The patch solves the problem by collecting the scanned items not in a
RapidJSON array, but rather in a chunked_vector<rjson::value>, i.e,
a chunked (non-contiguous) array of items (each a JSON value).
After collecting this array separately from the response object, we
need to print its content without actually inserting it into the object -
we add a new function print_with_extra_array() to do that.
The new separate-chunked-vector technique is used when a large number
(currently, >256) of items were scanned. When there is a smaller number
of items in a page (this is typical when each item is longer), we just
insert those items in the object and print it as before.
Beyond the original slow test that demonstrated the oversized allocation
(which is now gone), this patch also includes a new test which
exercises the new code with a scan of 700 (>256) items in a page -
but this new test is fast enough to be permanently in our test suite
and not a manual "veryslow" test as the other test.
Fixes#23535
(cherry picked from commit 2385fba4b6)
Closesscylladb/scylladb#25654
Reimplemented the function to use the peers cache. It could be replaced
with get_ip_from_peers_table, but that would create a coroutine frame for
each call.
(cherry picked from commit 4b907c7711)
The storage_service::on_change method can be called quite often
by the gossiper, see scylladb/scylla-enterprise#5613. In this commit
we introduce a temporal cache for system.peers so that we don't have
to go to the storage each time we need to resolve host_id -> ip.
We keep the cache only for a small amount of time to handle the
(unlikely) scenario when the user wants to update system.peers table
from CQL.
Fixesscylladb/scylladb#25660
(cherry picked from commit de5dc4c362)
We plan to add a cache to get_ip_from_peers_table in upcoming commits.
It's more convenient to do this from system_keyspace, since the only two
methods that mutate system.peers (remove_endpoint and update_peers_info)
are already there.
(cherry picked from commit 91c633371e)
This patch fixes an error-path bug in the base-64 decoding code in
utils/base64.cc, which among other things is used in Alternator to decode
blobs in JSON requests.
The base-64 decoding code has a lookup table, which was wrongly sized 255
bytes, but needed to be 256 bytes. This meant that if the byte 255 (0xFF)
was included in an invalid base-64 string, instead of detecting that this
is an invalid byte (since the only valid bytes in a base-64 string are
A-Z,a-z,0-9,+,/ and =), the code would either think it's valid with a
nonsense 6-bit part, or even crash on an out-of-bounds read.
Besides the trivial fix, this patch also includes a reproducing test,
which tries to write a blob as a supposedly base-64 encoded string with
a 0xFF byte in it. The test fails before this patch (the write succeeds,
unexpectedly), and passes after this patch (the write fails as
expected). The test also passes on DynamoDB.
Fixes#25701
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#25705
(cherry picked from commit ff91027eac)
Closesscylladb/scylladb#25765
Fixes#25682
Refs scylla-enterprise#5580
If the truncation table is large in entries, we might create a
huge parallel execution, quite possibly consuming loads of resources
doing something quite trivial.
Limit concurrency to a small-ish number
Closesscylladb/scylladb#25678
(cherry picked from commit 2eccd17e70)
Closesscylladb/scylladb#25749
Previously, `maybe_reconnect_to_preferred_ip()` retrieved the host ID
using `gossiper::get_host_id()`. Since the host ID is already available
in the calling function, we now pass it directly as a parameter.
This change simplifies the code and eliminates a potential race condition
where `gossiper::get_host_id()` could fail, as described in scylladb/scylladb#25621.
Refs: scylladb/scylladb#25621Fixes: scylladb/scylladb#25715
Backport: Recommended for 2025.x release branches to avoid potential issues
from unnecessary calls to `gossiper::get_host_id()` in subscribers.
(cherry picked from commit cfc87746b6)
Closesscylladb/scylladb#25717
Fixes#25709
If we have large allocations, spanning more than one segment, and
the internal segment references from lead to secondary are the
only thing keeping a segment alive, the implicit drop in
discard_unused_segments and orphan_all can cause a recursive call
to discard_unused_segments, which in turn can lead to vector
corruption/crash, or even double free of segment (iterator confusion).
Need to separate the modification of the vector (_segments) from
actual releasing of objects. Using temporaries is the easiest
solution.
To further reduce recursion, we can also do an early clear of
segment dependencies in callbacks from segment release (cf release).
Closesscylladb/scylladb#25719
(cherry picked from commit cc9eb321a1)
Closesscylladb/scylladb#25755
Currently, in repair_service::repair_tablets a shard that initiates
the repair keeps repair_tablet_metas of all tablets that have a replica
on this node (on any shard). This may lead to oversized allocations.
Modify tablet_repair_task_impl to repair only the tablets which replicas
are kept on this shard. Modify repair_service::repair_tablets to gather
repair_tablet_metas only on local shard. repair_tablets is invoked on
all shards.
Add a new legacy_tablet_repair_task_impl that covers tablet repair started with
async_repair. A user can use sequence number of this task to manage the repair
using storage_service API.
In a test that reproduced this, we have seen 11136 tablets and 5636096 bytes
allocation failure. If we had a node with 250 shards, 100 tablets each, we could
reach 12MB kept on one shard for the whole repair time.
Fixes: https://github.com/scylladb/scylladb/issues/23632
Needs backport to all live branches as they are all vulnerable to such crashes.
Closesscylladb/scylladb#25352
* github.com:scylladb/scylladb:
repair: distribute tablet_repair_task_meta among shards
repair: do not keep erm in tablet_repair_task_meta
token_range_vector is a sequence of intervals of tokens. It is used
to describe vnodes or token ranges owned by shards.
Since tokens are bloated (16 bytes instead of 8), and intervals are bloated
(40 byte of overhead instead of 8), and since we have plenty of token ranges,
such vectors can exceed our allocation unit of 128 kB and cause allocation stalls.
This series fixes that by first generalizing some helpers and then changing
token_range_vector to use chunked_vector.
Although this touches IDL, there is no compatibility problem since the encoding
for vector and chunked_vector are identical.
There is no performance concern since token_range_vector is never used on
any hot path (hot paths always contain a partition key).
Fixes#3335.
Fixes#24115.
Fixes#24156Closesscylladb/scylladb#25659
* github.com:scylladb/scylladb:
dht: fragment token_range_vector
partition_range_compat: generalize wrap/unwrap helpers
utils: chunked_vector: add swap() method
utils: chunked_vector: add range insert() overloads
Users with single-column partition keys that contain colon characters
were unable to use certain REST APIs and 'nodetool' commands, because the
API split key by colon regardless of the partition key schema.
Affected commands:
- 'nodetool getendpoints'
- 'nodetool getsstables'
Affected endpoints:
- '/column_family/sstables/by_key'
- '/storage_service/natural_endpoints'
Refs: #16596 - This does not fully fix the issue, as users with compound
keys will face the issue if any column of the partition key contains
a colon character.
Closesscylladb/scylladb#24829Closesscylladb/scylladb#25564
token_range_vector is a linear vector containing intervals
of tokens. It can grow quite large in certain places
and so cause stalls.
Convert it to utils::chunked_vector, which prevents allocation
stalls.
It is not used in any hot path, as it usually describes
vnodes or similar things.
Fixes#3335.
(cherry picked from commit 844a49ed6e)
These helpers convert vectors of wrapped intervals to
vectors of unwrapped intervals and vice versa.
Generalize them to work on any sequence type. This is in
preparation of moving from vectors to chunked_vectors.
(cherry picked from commit 83c2a2e169)
Following std::vector(), we implement swap(). It's a simple matter
of swapping all the contents.
A unit test is added.
(cherry picked from commit 13a75ff835)
Inserts an iterator range at some position.
Again we insert the range at the end and use std::rotate() to
move the newly inserted elements into place, forgoing possible
optimizations.
Unit tests are added.
(cherry picked from commit 24e0d17def)
Currently, in repair_service::repair_tablets a shard that initiates
the repair keeps tablet_repair_task_meta of all tablets that have a replica
on this node (on any shard). This may lead to oversized allocations.
Add remote_metas class which takes care of distributing tablet_repair_task_meta
among different shards. An additional class remote_metas_builder was
added in order to ensure safety and separate writes and reads to meta
vectors.
Fixes: #23632
(cherry picked from commit 132e6495a3)
Do not keep erm in tablet_repair_task_meta to avoid non-owner shared
pointer access when metas will be distributes among shards.
Pass std::chunked_vector of erms to tablet_repair_task_impl to
preserve safety.
(cherry picked from commit 603a2dbb10)
Although RF-rack-valid keyspaces are not universally enforced
yet (they're governed by the configuration option
`rf_rack_valid_keyspaces`), we'd like to encourage the user to
abide by the restriction.
To that end, we're introducing a warning when creating or
altering a keyspace. If the configuration option is disabled,
but the user is trying to create an RF-rack-invalid keyspace,
they'll receive a warning.
If the option is turned off, we will also log all of the
RF-rack-invalid keyspaces at start-up.
We provide validation tests.
Fixes scylladb/scylladb#23330
Backport: we'd like to encourage the user to abide by the restriction
even when they don't enforce it to make it easier in the future to
adjust the schema when there's no way to disable it anymore. Because
of that, we'd like to backport it to all relevant versions, starting with 2025.1.
- (cherry picked from commit 60ea22d887)
- (cherry picked from commit af8a3dd17b)
- (cherry picked from commit 837d267cbf)
Parent PR: #24785Closesscylladb/scylladb#25634
* github.com:scylladb/scylladb:
main: Log RF-rack-invalid keyspaces at startup
cql3/statements: Fix indentation
cql3: Warn when creating RF-rack-invalid keyspace
When the configuration option `rf_rack_valid_keyspaces` is enabled and there
is an RF-rack-invalid keyspace, starting a node fails. However, when the
configuration option is disabled, but there still is a keyspace that violates
the condition, we'd like Scylla to print a warning informing the user about
the fact. That's what happens in this commit.
We provide a validation test.
(cherry picked from commit 837d267cbf)
Although RF-rack-valid keyspaces are not universally enforced
yet (they're governed by the configuration option
`rf_rack_valid_keyspaces`), we'd like to encourage the user to
abide by the restriction.
To that end, we're introducing a warning when creating or
altering a keyspace. If the configuration option is disabled,
but the user is trying to create an RF-rack-invalid keyspace,
they'll receive a warning.
We provide a validation test.
(cherry picked from commit 60ea22d887)
Before these changes, the logs in hinted handoff often didn't provide
crucial information like the identifier of the node that hints were
being sent to. Also, some of the logs were misleading and referred to
other places in the code than the one where an exception or some other
situation really occurred.
We modify those logs, extending them by more valuable information
and fixing existing issues. What's more, all of the logs in
`hint_endpoint_manager` and `hint_sender` follow a consistent format
now:
```
<class_name>[<destination host ID>]:<function_name>: <message>
```
This way, we should always have AT LEAST the basic information.
Fixes scylladb/scylladb#25466
Backport:
There is no risk in backporting these changes. They only have
impact on the logs. On the other hand, they might prove helpful
when debugging an issue in hinted handoff.
- (cherry picked from commit 2327d4dfa3)
- (cherry picked from commit d7bc9edc6c)
- (cherry picked from commit 6f1fb7cfb5)
Parent PR: #25470Closesscylladb/scylladb#25537
* github.com:scylladb/scylladb:
db/hints: Add new logs
db/hints: Adjust log levels
db/hints: Improve logs
This PR implements solution proposed in scylladb/scylladb#24481
Instead of terminating connections immediately, the shutdown now proceeds in two stages: first closing the receive (input) side to stop new requests, then waiting for all active requests to complete before fully closing the connections.
The updated shutdown process is as follows:
1. Initial Shutdown Phase
* Close the accept gate to block new incoming connections.
* Abort all accept() calls.
* For all active connections:
* Close only the input side of the connection to prevent new requests.
* Keep the output side open to allow responses to be sent.
2. Drain Phase
* Wait for all in-progress requests to either complete or fail.
3. Final Shutdown Phase
* Fully close all connections.
Fixes scylladb/scylladb#24481
- (cherry picked from commit 122e940872)
- (cherry picked from commit 3848d10a8d)
- (cherry picked from commit 3610cf0bfd)
- (cherry picked from commit 27b3d5b415)
- (cherry picked from commit 061089389c)
- (cherry picked from commit 7334bf36a4)
- (cherry picked from commit ea311be12b)
- (cherry picked from commit 4f63e1df58)
Parent PR: #24499Closesscylladb/scylladb#25518
* github.com:scylladb/scylladb:
test: Set `request_timeout_on_shutdown_in_seconds` to `request_timeout_in_ms`, decrease request timeout.
generic_server: Two-step connection shutdown.
transport: consmetic change, remove extra blanks.
generic_server: replace empty destructor with `= default`
generic_server: refactor connection::shutdown to use `shutdown_input` and `shutdown_output`
generic_server: add `shutdown_input` and `shutdown_output` functions to `connection` class.
test: Add test for query execution during CQL server shutdown
The test_base_partition_deletion_with_metrics test case (and the batch
variant) uses the metric of view updates done during its runtime to check
if we didn't perform too many of them. The test runs in the cqlpy suite,
which runs all test cases sequentially on one Scylla instance. Because
of this, if another test case starts a process which generates view
updates and doesn't wait for it to finish before it exists, we may
observe too many view updates in test_base_partition_deletion_with_metrics
and fail the test.
In all test cases we make sure that all tables that were created
during the test are dropped at the end. However, that doesn't
stop the view building process immediately, so the issue can happen
even if we drop the view. I confirmed it by adding a test just before
test_base_partition_deletion_with_metrics which builds a big
materialized view and drops it at the end - the metrics check still failed.
The issue could be caused by any of the existing test cases where we create
a view and don't wait for it to be built. Note that even if we start adding
rows after creating the view, some of them may still be included in the view
building, as the view building process is started asynchronously. In such
a scenario, the view building also doesn't cause any issues with the data in
these tests - writes performed after view creation generate view updates
synchronously when they're local (and we're running a single Scylla server),
the corresponding view udpates generated during view building are redundant.
Because we have many test cases which could be causing this issue, instead
of waiting for the view building to finish in every single one of them, we
move the susceptible test cases to be run on separate Scylla instances, in
the "cluster" suite. There, no other test cases will influence the results.
Fixes https://github.com/scylladb/scylladb/issues/20379Closesscylladb/scylladb#25209
(cherry picked from commit 2ece08ba43)
Closesscylladb/scylladb#25503
The PR fixes a test flakiness issue in test_mv_backlog related to reading metrics.
The first commit fixes a more general issue in the ScyllaMetrics helper class where it doesn't return the value of all matching lines when a specific shard is requested, but it breaks after the first match.
The second commit fixes a test issue where it expects exactly one write to be throttled, not taking into account other internal writes that may be executed during this time.
Fixes https://github.com/scylladb/scylladb/issues/23139
backport to improve CI stability - test only change
- (cherry picked from commit 5c28cffdb4)
- (cherry picked from commit 276a09ac6e)
Parent PR: #25279Closesscylladb/scylladb#25474
* github.com:scylladb/scylladb:
test: test_mv_backlog: fix to consider internal writes
test/pylib/rest_client: fix ScyllaMetrics filtering
We're providing additional information in error messages when throwing
an exception related to data corruption: when a segment is truncated
and when it's content is invalid. That might prove helpful when debugging.
Closesscylladb/scylladb#25190
(cherry picked from commit 408b45fa7e)
Closesscylladb/scylladb#25460
The test test_tombstone_gc_disabled_on_pending_replica was added when
we fixed (#20788) the potential problem with data resurrection during
file based streaming. The issue was occurring only in Enterprise, but
we added the fix in OSS to limit code divergence. This test was added
together with the fix in OSS with the idea to guard this change in OSS.
The real reproducer and test for this fix was added later, after the
fix was ported into Enterprise.
It is in: test/cluster/test_resurrection.py
Since Enterprise has been merged into OSS, there is no more need to
keep the test test_tombstone_gc_disabled_on_pending_replica. Also,
it is flaky with very low probability of failure, making it difficult
to investigate the cause of failure.
Fixes: #22182
Refs: scylladb/scylladb#25448Closesscylladb/scylladb#25134
(cherry picked from commit 7ce96345bf)
Closesscylladb/scylladb#25572
The test creates all driver sessions by itself. As a consequence, all
sessions use the default request timeout of 10s. This can be too low for
the debug mode, as observed in scylladb/scylla-enterprise#5601.
In this commit, we change the test to use `cluster_con`, so that the
sessions have the request timeout set to 200s from now on.
Fixesscylladb/scylla-enterprise#5601
This commit changes only the test and is a CI stability improvement,
so it should be backported all the way to 2024.2. 2024.1 doesn't have
this test.
Closesscylladb/scylladb#25510
(cherry picked from commit 03cc34e3a0)
Closesscylladb/scylladb#25546
We're adding new logs in just a few places that may however prove
important when debugging issues in hinted handoff in the future.
(cherry picked from commit 6f1fb7cfb5)