In the Raft-based topology, a decommissioning node is removed from group
0 after the decommission request is considered finished (and the token
ring is updated). `wait_for_token_ring_and_group0_consistency` doesn't
handle such a case; it only handles cases where the token ring is
updated later. We fix this in this commit.
We rely on the new implementation of
`wait_for_token_ring_and_group0_consistency` in the following commit to
fix flakiness of some tests.
We also update the obsolete docstring in this commit.
Add another level of verbosity: quiet.
Before this it was used as a default one, but it provides not enough information.
These changes should be coupled with pytest-sugar plugin to have an intended information for each level.
Invoke the pytest as a module, instead of a separate process, to get access to the terminal to be able to it interactively.
Framework change only, so backporting in to 2025.3
Fixes: #25403Closesscylladb/scylladb#25698
* github.com:scylladb/scylladb:
test.py: add additional level of verbosity for output
test.py: start pytest as a module instead of subprocess
This patch introduces a new `incremental_mode` parameter to the tablet
repair REST API, providing more fine-grained control over the
incremental repair process.
Previously, incremental repair was on and could not be turned off. This
change allows users to select from three distinct modes:
- `regular`: This is the default mode. It performs a standard
incremental repair, processing only unrepaired sstables and skipping
those that are already repaired. The repair state (`repaired_at`,
`sstables_repaired_at`) is updated.
- `full`: This mode forces the repair to process all sstables, including
those that have been previously repaired. This is useful when a full
data validation is needed without disabling the incremental repair
feature. The repair state is updated.
- `disabled`: This mode completely disables the incremental repair logic
for the current repair operation. It behaves like a classic
(pre-incremental) repair, and it does not update any incremental
repair state (`repaired_at` in sstables or `sstables_repaired_at` in
the system.tablets table).
The implementation includes:
- Adding the `incremental_mode` parameter to the
`/storage_service/repair/tablet` API endpoint.
- Updating the internal repair logic to handle the different modes.
- Adding a new test case to verify the behavior of each mode.
- Updating the API documentation and developer documentation.
Fixes#25605Closesscylladb/scylladb#25693
Populate the local state during gossiper initialization in start_gossiping, preventing an empty state from being added to _endpoint_state_map and returned in get_endpoint_states responses, that was causing an 'empty host id issue' on the other nodes during nodes restart.
Check for a race condition in do_apply_state_locally In do_apply_state_locally, a race condition can occur if a task is suspended at a preemption point while the node entry is not locked.
During this time, the host may be removed from _endpoint_state_map. When the task resumes, this can lead to inserting an entry with an empty host ID into the map, causing various errors, including a node crash.
This change adds a check after locking the map entry: if a gossip ACK update does not contain a host ID, we verify that an entry with that host ID still exists in the gossiper’s _endpoint_state_map.
Fixes https://github.com/scylladb/scylladb/issues/25831
Fixes https://github.com/scylladb/scylladb/issues/25803
Fixes https://github.com/scylladb/scylladb/issues/25702
Fixes https://github.com/scylladb/scylladb/issues/25621
Ref https://github.com/scylladb/scylla-enterprise/issues/5613
Backport: The issue affects all current releases(2025.x), therefore this PR needs to be backported to all 2025.1-2025.3.
Closesscylladb/scylladb#25849
* github.com:scylladb/scylladb:
gossiper: fix empty initial local node state
gossiper: add test for a race condition in start_gossiping
gossiper: check for a race condition in `do_apply_state_locally`
test/gossiper: add reproducible test for race condition during node decommission
When cpu pressured, `_locks` structure in paxos might grow and cause
oversized allocations and performance drops. We reserve memory ahead of
time.
Fixes#25559Closesscylladb/scylladb#25874
The --help text says about --large-memory-allocation-warning-threshold:
"Warn about memory allocations above this size; set to zero to disable."
That's half-true: setting the value to zero spams logs with warnings of
allocation of any size, as seastar treats zero threshold literaly.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#25850
The files object is moved before the log when stream finishes. We've
logged the files when the stream starts. Skip it in the end of
streaming.
Fixes#25830Closesscylladb/scylladb#25835
This change removes the addition of an empty state to `_endpoint_state_map`.
Instead, a new state is created locally and then published via replicate,
avoiding the issue of an empty state existing in `_endpoint_state_map`
before the preemption point. Since this resolves the issue tested in
`test_gossiper_empty_self_id_on_shadow_round`, the `xfail` mark has been removed.
Fixes: scylladb/scylladb#25831
This change adds a test for a race condition in `start_gossiping` that
can lead to an empty self state sent in `gossip_get_endpoint_states_response`.
Test for scylladb/scylladb#25831
In do_apply_state_locally, a race condition can occur if a task is
suspended at a preemption point while the node entry is not locked.
During this time, the host may be removed from _endpoint_state_map.
When the task resumes, this can lead to inserting an entry with an
empty host ID into the map, causing various errors, including a node
crash.
This change
1. adds a check after locking the map entry: if a gossip ACK update
does not contain a host ID, we verify that an entry with that host ID
still exists in the gossiper’s _endpoint_state_map.
2. Removes xfail from the test_gossiper_race test since the issue is now
fixed.
3. Adds exception handling in `do_shadow_round` to skip responses from
nodes that sent an empty host ID.
This re-applies the commit 13392a40d4 that
was reverted in 46aa59fe49, after fixing
the issues that caused the CI to fail.
Fixes: scylladb/scylladb#25702Fixes: scylladb/scylladb#25621
Ref: scylladb/scylla-enterprise#5613
This change introduces a targeted test that simulates the gossiper race
condition observed during node decommissioning. The test delays gossip
state application and host ID lookup to reliably reproduce the scenario
where `gossiper::get_host_id()` is called on a removed endpoint,
potentially triggering an abort in `apply_new_states`.
There is a specific error injection added to widen the race window, in
order to increase the likelihood of hitting the race condition. The
error injection is designed to delay the application of gossip state
updates, for the specific node that is being decommissioned. This should
then result in the server abort in the gossiper.
This re-applies the commit 5dac4b38fb that
was reverted in dc44fca67c, but modified
to relax the check from "on_internal_error" to a just warning log. The
more strict can be re-introduced later once we are sure that all
remaining problems are resolved and it will not break the CI.
Refs: scylladb/scylladb#25621Fixes: scylladb/scylladb#25721
Enabling the configuration option should have no negative impact on how the tool
behaves. There is no topology and we do not create any keyspaces (except for
trivial ones using `SimpleStrategy` and RF=1), only their metadata. Thanks to
that, we don't go through validation logic that could fail in presence of an
RF-rack-invalid keyspace.
On the other hand, enabling `rf_rack_valid_keyspaces` lets the tool access code
hidden behind that option. While that might not be of any consequence right now,
in the future it might be crucial (for instance, see: scylladb/scylladb#23030).
Note that other tools don't need an adjustment:
* scylla-types: it uses schema_builder, but it doesn't reuse any other
relevant part of Scylla.
* nodetool: it manages Scylla instances but is not an instance itself, and it
does not reuse any codepaths.
* local-file-key-generator: it has nothing to do with Scylla's logic.
Other files in the `tools` directory are auxiliary and are instructed with an
already created instance of `db::config`. Hence, no need to modify them either.
Fixesscylladb/scylladb#25792Closesscylladb/scylladb#25794
When triggering the backport process, adding a check for P0 and P1 labels, if available add them to backport PR together with force_on_cloud label
Implementing first in pkg to test the process, then will move it to scylladb
Fixes: PKG-62
Closesscylladb/scylladb#25856
Previously, the script attempted to assign GitHub Actions expressions directly within a Bash string using '${{ ... }}', which is invalid syntax in shell scripts. This caused the label JSON to be treated as a literal string instead of actual data, leading to parsing failures and incorrect backport readiness checks.
This update ensures the label data is passed correctly via the LABELS_JSON environment variable, allowing jq to properly evaluate label names and conditions.
Fixes: PKG-74
Closesscylladb/scylladb#25858
Sometimes `vector_store_client_test_ann_request` test hangs up. It is hard to
reproduce.
It seems that the problem is that tests are unreliable in case of stalled
requests. This patch attaches a timer to the abort_source to ensure that
the test will finish with a timeout at least.
Fixes: VECTOR-150
Fixes: #25234Closesscylladb/scylladb#25301
Consider the following scenario:
- Current replica set is [A, B, C]
- write succeeds on [A, B], and a hint is logged for node C
- before the hint is replayed, D bootstraps and the token migrates from C to D
- hint is replayed to node C while D is pending, but it's too late, since streaming for that token is already done
- C is cleaned up, replayed data is lost, and D has a stale copy until next repair.
In the scenario we effectively fail to send the hint. This scenario is also more likely to happen with tablets,
as it can happen for every tablet migration.
This issue is particularly detrimental to materialized views. View updates use hints by default and a specific
view update may be sent to just one view replica (when a single base replica has a different row state due to
reordering or missed writes). When we lose a hint for such a view update, we can generate a persistent inconsistency
between the base and view - ghost rows can appear due to a lost tombstone and rows may be missing in the view due
to a lost row update. Such inconsistencies can't be fixed neither by repairing the view or the base table.
To handle this, in this patch we add the pending replicas to the list of targets of each hint, even if the original
target is still alive.
This will cause some updates to be redundant. These updates are probably unavoidable for now, but they shouldn't
be too common either. The scenarios for them are:
1. managing to send the hint to the source of a migrating replica before streaming that its token - the write will
arrive on the pending replica anyway in streaming
2. the hint target not being the source of the migration - if we managed to apply the original write of the hint to
the actual source of the migration, the pending replica will get it during streaming
3. sending the same hint to many targets at a similar time - while sending to each target, we'll see the same pending
replica for the hint so we'll send it multiple times
4. possible retries where even though the hint was successfully sent to the main target, we failed to send it to the
pending replica, so we need to retry the entire write
This patch handles both tablet migrations and tablet rebuilds. In the future, for tablet migrations, we can avoid
sending the hint to pending replias if the hint target is not the source fo the migration, which would allow us to
avoid the redundant writes 2 and 3. For rack-aware RF, this will be as simple as checking whether the replicas are
in the same rack.
We also add a test case reproducing the issue.
Co-Authored-By: Raphael S. Carvalho <raphaelsc@scylladb.com>
Fixes https://github.com/scylladb/scylladb/issues/19835Closesscylladb/scylladb#25590
In S3 client both read and write metrics have three counters -- number
of requests made, number of bytes processed and request latency. In most
of the cases all three counters are updated at once -- upon response
arrival.
However, in case of chunked download source this way of accounting
metrics is misleading. In this code the request is made once, and then
the obtained bytes are consumed eventually as the data arrive.
Currently, each time a new portion of data is read from the socket the
number of read requests is incremented. That's wrong, the request is
made once, and this counter should also be incremented once, not for
every data buffer that arrived in response.
Same for read request latency -- it's "added" for every data buffer that
arrives, but it's a lenghy process, the _request_ latency should be
accounted once per responce. Maybe later we'll want to have "data
latency" metrics as well, but for what we have now it's request latency.
The number of read bytes is accounted properly, so not touched here.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#25770
In 41880bc893 ("cql3: statement_restrictions: forbid
querying a single-column inequality restriction on a
multi-column restriction"), we removed the ability to contrain
a single column on a tuple inequality, on the grounds that it
isn't used and can't be used.
Here, we extend this to remove the ability to constrain a
single column on a tuple equality, on the grounds that it isn't used
and hampers further refactoring.
CQL supports multi-column equality restrictions in the form
(ck1, ck2, ck3) = (:v1, :v2, :v3)
These restriction shape is only allowed on clustering keys, and
is translated into a partition_slice allowing the primary index
to efficiently select the part of the partition that satisfies the
restriction.
The possible_lhs_values() values function allows extracting
single-column restrictions from this and similar tuple restrictions.
For example, the multi-column restriction
(ck1, ck2, ck3) = (:v1, :v2, :v3)
implies that ck2 = :v2. If we have an index on ck2, and if we don't
further have a restriction on the partition key, then it is
advantageous to use the index to select rows, and then filter
on ck1 and ck3 to satisfy the full restriction.
However, we never actually do that. The following sequence
```cql
CREATE TABLE ks.t1 (
pk int,
ck1 int,
ck2 int,
PRIMARY KEY (pk, ck1, ck2)
);
CREATE INDEX ON ks.t1(ck1);
SELECT *
FROM ks.t1
WHERE (ck1, ck2) = (1, 2);
```
Could have been used to query a single partition via the index, but instead
is used for a full table scan, using the partition slice to skip through
unselected rows.
We can't easily start using a new query plan via the index, since
switching plans mid-query (due to paging and moving from one coordinator
to another during upgrade) would cause the sort order to change, therefore
causing some rows to be omitted and some rows to be returned twice.
Similarly, we cannot extract a token restriction from a tuple, since
the grammar doesn't allow for
```cql
WHERE (token(pk)) = (:var1)
```
Since it's not used, remove it.
This code was first introduced in d33053b841 ("cql3/restrictions: Add
free functions over new classes")
It does not directly correspond to pre-expression code.
Closesscylladb/scylladb#25757Closesscylladb/scylladb#25821
Interval's copy and move constructors are full of branches since the two payload T:s are
optional and therefore have to be optionally-constructed. This can be eliminated for
trivially copyable types (like dht::token) by eliminating interval's user-defined special member
functions (constructors etc) in that special case.
In turn, this enables optimizations in the standard library (and our own containers) that
convert moves/copies of spans of such types into memcpy().
Minor optimization, not a candidate to backport.
Closesscylladb/scylladb#25841
* github.com:scylladb/scylladb:
test: nonwrapping_interval_test: verify an interval of tokens is trivial
interval: specialize interval_data<T> for trivial types
interval: split data members into new interval_data class
C++ data movement algorithms (std::uninitialized_copy()) and friends
and the containers that use them optimize for trivially copyable
and destructible types by calling memcpy instead of using a loop
around constructors/destructors. Make intervals of trivially
copyable and destructible types also trivially copyable and
destructible by specializing interval_data<T> not to have
user-defined special member functions. This requires that T have
a default constructor since we can't skip construction when
!_start_exists or !_end_exists.
To choose whether we specialize or not, we look at default
constructiblity (see above) and trivial destructibility. This is
wider than trivial copyablity (a user-defined copy constructor
can exist) but is still beneficial, since the generated copy
constructor for interval_data<T> will be branch-free.
We don't implement the poison words in debug mode; nor are they
necessary, since we no don't manage the lifetime of _start_value
and _end_value manually any more but let the compiler do that for us.
Note [1] prevents full conversion to memcpy for now, but we still
get branch free code.
[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121789
Prepare for specialized handling of trivial types by extracting
the data members of wrapping_internal<T> and the special member
functions (constructors/destructors/assignment) into a new
interval_data<T> template.
To avoid having to refer to data member with a this-> prefix,
add using declarations in wrapping_interval<T>.
The memory usage is tracked with the help of a semaphore, so just export
its "consumed" units.
One tricky place here is the need to skip metrics registration for
scylla-sstable tool. The thing is that the tools starts the storage
manager and sstables manager on start and then some of tool's operations
may want to start both managers again (via cql environment) causing
double metrics registration exception.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#25769
The previous version had a problem: Fork PRs didn't pass the Jira credentials to the main code, which updates the Jira key status.
No need for backport. This is not the Scylla code, but a fix to GitHub Actions.
Closesscylladb/scylladb#25833
* github.com:scylladb/scylladb:
Change pull_request event to pull_request_target - ready for merge
Update workflow to use pull_request_target event - in review
Change pull_request event to pull_request_target - in progress
Add another level of verbosity: quiet.
Before this it was used as a default one, but it provides not enough
information. These changes should be coupled with pytest-sugar plugin to have
an intended information for each level.
Before the patch, user with CREATE access could create a table with CDC or alter the table enabling CDC, but could not query a SELECT on the CDC table they created.
It was due to the fact, the SELECT permission was checked on the CDC log, and later it's "parent" - the keyspace, but not the base table, on which the user had SELECT permission automatically granted on CREATE.
This patch matches the behavior of querying the CDC log to the one implemented for Materialized Views:
1. No new permissions are granted on CREATE.
2. When querying SELECT, the permissions on base table SELECT are checked.
Fixes: https://github.com/scylladb/scylladb/issues/19798
Fixes: VECTOR-151
Closes scylladb/scylladb#25797
* github.com:scylladb/scylladb:
cqlpy/test_permissions: run the reproducer tests for #19798
select_statement: check for access to CDC base table
The pre-srcub snapshot is taken in the middle of parsing options from the request. In case post-snapshot part of the parsing throws (it can do so if "quarantine_mode" value is not recognized), the snapshot remains on disk, but the API call fails.
The fix is to move snapshot taking out of the parse_scrub_options() helper. It could be moved at the end of it, but the helper name doesn't tell that it also takes a snapshot, so no. After the fix the helper in question can be simplified further.
The issue exists in older versions, but likely doesn't reveal itself for real, so it doesn't look worthwhile to backport it.
Closesscylladb/scylladb#25824
* github.com:scylladb/scylladb:
api: Simplify parse_scrub_options() helper
api: Take snapshot after parsing scrub options
Parsiong scrub options may throw after a snapshot is taken thus leaving
it on disk even though an operation reported as "failed". Not, probably,
critical, but not nice either.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It was not enabled due to some cqlsh dependency missing.
After 3 years it's hard to say if the thing is fixed or not,
but anyway we don't need another big dependecy while we already
have python driver used exstensively in tests. We use simple
wrapper file exec_cql.py, shared with auth_conns workload to
conveniently read needed preparation statements from the file.
Additionally we switch tablets off as counters don't support
it yet.
It uses some derived roles and permissions
to exercise auth code paths and also creates new
connection with each stress request to exercise
also transport/server.cc connection handling code.
Before this change, executing `DESCRIBE MATERIALIZED VIEW` on the underlying
materialized view of a secondary index would produce a `CREATE INDEX` statement.
It was not only confusing, but it also prevented from learning about
the definition of the view. The only way to do so was to query system tables.
We change that behavior and produce a `CREATE MATERIALIZED VIEW` statement
instead. The statement is printed as a comment to implicitly convey that
the user should not attempt to execute it to restore the view. A short comment
is provided to make it clearer.
Before this commit:
```
cqlsh> CREATE TABLE ks.t(p int PRIMARY KEY, v int);
cqlsh> CREATE INDEX i ON ks.t(v);
cqlsh> DESCRIBE MATERIALIZED VIEW ks.i;
CREATE INDEX i ON ks.t(v);
```
After this commit:
```
cqlsh> CREATE TABLE ks.t(p int PRIMARY KEY, v int);
cqlsh> CREATE INDEX i ON ks.t(v);
cqlsh> DESCRIBE MATERIALIZED VIEW ks.i;
/* Do NOT execute this statement! It's only for informational purposes.
This materialized view is the underlying materialized view of a secondary
index. It can be restored via restoring the index.
CREATE MATERIALIZED VIEW ks.i_index [...];
*/
```
Note that describing the base table has not been affected and still works
as follows:
```
cqlsh> CREATE TABLE ks.t(p int PRIMARY KEY, v int);
cqlsh> CREATE INDEX i ON ks.t(v);
cqlsh> DESCRIBE TABLE ks.t;
CREATE TABLE ks.t (
p int,
v int,
PRIMARY KEY (p)
) WITH bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'}
AND comment = ''
AND compaction = {'class': 'IncrementalCompactionStrategy'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND speculative_retry = '99.0PERCENTILE'
AND tombstone_gc = {'mode': 'timeout', 'propagation_delay_in_seconds': '3600'};
CREATE INDEX i ON ks.t(v);
```
We also provide two reproducers of scylladb/scylladb#24610.
Fixesscylladb/scylladb#24610Closesscylladb/scylladb#25697
An instance of `cdc::topology_description` can be quite big. The vector
it consists of stores as many `token_range_description`s as there are
vnodes, and the size of each `token_range_description` is O(#shards).
Because of that, copying an instance of the type can lead to reactor
stalls. To prevent that, we introduce an asynchronous function copying
the contents on the object.
Reactor stalls were detected in the call to `map_reduce` in
`generation_service::legacy_do_handle_cdc_generation`, so let's start
using the new function there.
A similar scenario occurs in `generation_service::handle_cdc_generation`,
so we modify it too.
Unfortunately, it doesn't seem viable to provide a reproducer of said
problem.
Fixesscylladb/scylladb#24522
Backport: none. Reactor stalls are not critical.
Closesscylladb/scylladb#25730
* github.com:scylladb/scylladb:
cdc/generation: Delete copy constructors of topology_description
cdc/generation: Clone topology_description asynchronously
Before the patch, user with CREATE access could create a table
with CDC or alter the table enabling CDC, but could not query
a SELECT on the CDC table they created.
It was due to the fact, the SELECT permission was checked on
the CDC log, and later it's "parent" - the keyspace,
but not thebase table, on which the user had SELECT permission
automatically granted on CREATE.
This patch matches the behaviour of querying the CDC log
to the one implemented for Materialized Views:
1. No new permissions are granted on CREATE.
2. When querying SELECT, the permissions on base table
SELECT are checked.
Fixes: #19798
Determine the progress of compaction tasks that have
children.
The progress of a compaction task is calculated using the default
get_progress method. If the expected_total_workload method is
implemented, the default progress is computed as:
(sum of child task progresses) / (expected total workload)
If expected_total_workload is not defined, progress is estimated based
on children progresses. However, in this case, the total progress may
increase over time as the task executes.
All compaction tasks, except for reshape tasks, implement the
expected_children_number method. To compute expected_total_workload,
iterate over all SSTables covered by the task and sum their sizes. Note
that expected_total_workload is just an approximation and the real workload
may differ if SStables set for the keyspace/table/compaction group changes.
Reshape tasks are an exception, as their scope is determined during
execution. Hence, for these tasks expected_total_workload isn't defined
and their progress (both total and completed) is determined based
on currently created children.
Fixes: https://github.com/scylladb/scylladb/issues/8392.
Fixes: https://github.com/scylladb/scylladb/issues/6406.
Fixes: https://github.com/scylladb/scylladb/issues/7845.
New feature, no backport needed
Closesscylladb/scylladb#15158
* github.com:scylladb/scylladb:
test: add compaction task progress test
compaction: set progress unit for compaction tasks
compaction: find expected workload for reshard tasks
compaction: find expected workload for global cleanup compaction tasks
compaction: find expected workload for global major compaction tasks
compaction: find expected workload for keyspace compaction tasks
compaction: find expected workload for shard compaction tasks
compaction: find expected workload for table compaction tasks
compaction: return empty progress when compaction_size isn't set
compaction: update compaction_data::compaction_size at once
tasks: do not check expected workload for done task