The cqlpy test test_materialized_view.py::test_view_in_system_tables
checks that the system table "system.built_views" can inform us that
a view has been built. This test was flaky, starting to fail quite
often recently, and this patch fixes the problem in the test.
For historic reasons this test began by calling a utility function
wait_for_view_built() - which uses a different system table,
system_distributed.view_build_status, to wait until the view was built.
The test then immediately tries to verify that also system.built_views
lists this view.
But there is no real reason why we could assume - or want to assume -
that these two tables are updated in this order, or how much time
passed between the two tables being changed. The authors of this
test already acknowledged there is a problem - they included a hack
purporting to be a "read barrier" that claimed to solve this exact
problem - but it seems it doesn't, or at least no longer does after
recent changes to the view builder's implementation.
The solution is simple - just remove the call to wait_for_view_built()
and the "hack" after it. We should just wait in a loop (until a timeout)
for the system table that we really wanted to check - system.built_views.
It's as simple as that. No need for any other assumptions or hacks.
Fixes#27296
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#27626
(cherry picked from commit ccacea621f)
Closesscylladb/scylladb#27670
When creating an alternator table with tablets, if it has an index, LSI
or GSI, require the config option rf_rack_valid_keyspaces to be enabled.
The option is required for materialized views in tablets keyspaces to
function properly and avoid consistency issues that could happen due to
cross-rack migrations and pairing switches when RF-rack validity is not
enforced.
Currently the option is validated when creating a materialized view via
the CQL interface, but it's missing from the alternator interface. Since
alternator indexes are based on materialized views, the same check
should be added there as well.
Fixesscylladb/scylladb#27612Closesscylladb/scylladb#27622
(cherry picked from commit b9ec1180f5)
Closesscylladb/scylladb#27671
Add pull_request_target event with unlabeled type to trigger-scylla-ci
workflow. This allows automatic CI triggering when the 'conflicts' label
is removed from a PR, in addition to the existing manual trigger via
comment.
The workflow now runs when:
- A user posts a comment with '@scylladbbot trigger-ci' (existing)
- The 'conflicts' label is removed from a PR (new)
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-84Closesscylladb/scylladb#27521
(cherry picked from commit f7ffa395a8)
Closesscylladb/scylladb#27602
There is no 'regular' incremental mode anymore.
The example seems have meant 'disabled'.
Fixes#27587
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 5fae4cdf80)
This commit removes the now redundant driver pages from
the Scylla DB documentation. Instead, the link to the pages
where we moved the diver information is added.
Also, the links are updated across the ScyllaDB manual.
Redirections are added for all the removed pages.
Fixes https://github.com/scylladb/scylladb/issues/26871Closesscylladb/scylladb#27277
(cherry picked from commit c5580399a8)
Closesscylladb/scylladb#27442
The rf_rack_valid_keyspaces option needs to be turned on in order to
allow creating materialized views in tablet keyspaces with numeric RF
per DC. This is also necessary for secondary indexes because they use
materialized views underneath. However, this option is _not_ necessary
for vector store indexes because those use the external vector store
service for querying the list of keys to fetch from the main table, they
do not create a materialized view. The rf_rack_valid_keyspaces was, by
accident, required for vector indexes, too.
Remove the restriction for vector store indexes as it is completely
unnecessary.
Fixes: SCYLLADB-81
Closesscylladb/scylladb#27447
(cherry picked from commit bb6e41f97a)
Closesscylladb/scylladb#27455
When a vector store node becomes unreachable, a client request sent
before the keep-alive timer fires would hang until the CQL query
timeout was reached.
This occurred because the HTTP request writes to the TCP buffer and then
waits for a response. While data is in the buffer, TCP retransmissions
prevent the keep-alive timer from detecting the dead connection.
This patch resolves the issue by setting the `TCP_USER_TIMEOUT` socket
option, which applies an effective timeout to TCP retransmissions,
allowing the connection to fail faster.
Closesscylladb/scylladb#27388
(cherry picked from commit a54bf50290)
Closesscylladb/scylladb#27423
We store the per-shard chunk count in a uint64_t vector
global_offset, and then convert the counts to offsets with
a prefix sum:
```c++
// [1, 2, 3, 0] --> [0, 1, 3, 6]
std::exclusive_scan(global_offset.begin(), global_offset.end(), global_offset.begin(), 0, std::plus());
```
However, std::exclusive_scan takes the accumulator type from the
initial value, 0, which is an int, instead of from the range being
iterated, which is of uint64_t.
As a result, the prefix sum is computed as a 32-bit integer value. If
it exceeds 0x8000'0000, it becomes negative. It is then extended to
64 bits and stored. The result is a huge 64-bit number. Later on
we try to find an sstable with this chunk and fail, crashing on
an assertion.
An example of the failure can be seen here: https://godbolt.org/z/6M8aEbo57
The fix is simple: the initial value is passed as uint64_t instead of int.
Fixes https://github.com/scylladb/scylladb/issues/27417Closesscylladb/scylladb#27418
(cherry picked from commit 9696ee64d0)
Even though that `view_building_coordinator::work_on_view_building` has
an `if` at the very beginning which checks whether the currently
processed base table is set, it only prints a message and continues
executing the rest of the function regardless of the result of the
check. However, some of the logic in the function assumes that the
currently processed base table field is set and tries to access the
value of the field. This can lead to the view building coordinator
accessing a disengaged optional, which is undefined behavior.
Fix the function by adding the clearly missing `co_await` to the check.
A regression test is added which checks that the view building state
observer - a different fiber which used to print a weird message due to
erroneus view building coordinator behavior - does not print a warning.
Fixes: scylladb/scylladb#27363Closesscylladb/scylladb#27373
(cherry picked from commit 654ac9099b)
Closesscylladb/scylladb#27406
This PR introduces two key improvements to the robustness and resource management of vector search:
Proper Abort on CQL Timeout: Previously, when a CQL query involving a vector search timed out
, the underlying ANN query to the vector store was not aborted and would continue to run. This has been fixed by ensuring the abort source is correctly signaled, terminating the ANN request when its parent CQL query expires and preventing unnecessary resource consumption.
Faster Failure Detection: The connection and keep-alive timeouts for vector store nodes were excessively long (2 and 11 minutes, respectively), causing significant delays in detecting and recovering from unreachable nodes. These timeouts are now aligned with the request_timeout_in_ms setting, allowing for much faster failure detection and improving high availability by failing over from unresponsive nodes more quickly.
Fixes: SCYLLADB-76
This issue affects the 2025.4 branch, where similar HA recovery delays have been observed.
- (cherry picked from commit b6afacfc1e)
- (cherry picked from commit 086c6992f5)
Parent PR: #27377Closesscylladb/scylladb#27391
* github.com:scylladb/scylladb:
vector_search: Fix ANN query abort on CQL timeout
vector_search: Reduce connection and keep-alive timeouts
When a CQL vector search request timed out, the underlying ANN query was
not aborted and continued to run. This happened because the abort source
was not being signaled upon request expiration.
This commit ensures the ANN query is aborted when the CQL request times out
preventing unnecessary resource consumption.
The connection timeout was 2 minutes and the keep-alive
timeout was 11 minutes. If a vector store node became unreachable, these
long timeouts caused significant delays before the system could recover,
negatively impacting high availability.
This change aligns both timeouts with the `request_timeout`
configuration, which defaults to 10 seconds. This allows for much
faster failure detection and recovery, ensuring that unresponsive nodes
are failed over from more quickly.
Consider this:
1) n1 is the topology coordinator
2) n1 schedules and executes a tablet repair with session id s1 for a
tablet on n3 an n4.
3) n3 and n4 take and store the in _rs._repair_compaction_locks[s1]
4) n1 steps down before it executes
locator::tablet_transition_stage::end_repair
5) n2 becomes the new topology coordinator
6) n2 runs locator::tablet_transition_stage::repair again
7) n3 and n4 try to take the lock again and hangs since the lock is
already taken.
To avoid the deadlock, we can throw in step 7 so that n2 will
proceed to end_repair stage and release the lock. After that, the
scheduler could schedule the tablet repair request again.
Fixes#26346Closesscylladb/scylladb#27163
(cherry picked from commit da5cc13e97)
Closesscylladb/scylladb#27337
This commit fixes the information about object storage:
- Object storage configuration is no longer marked as experimental.
- Redundant information has been removed from the description.
- Information related to object storage for SStabels has been removed
as the feature is not working.
Fixes https://github.com/scylladb/scylladb/issues/26985Closesscylladb/scylladb#26987
(cherry picked from commit 724dc1e582)
Closesscylladb/scylladb#27211
Commit 6e4803a750 broke notification about expired erms held for too long since it resets the tracker without calling its destructor (where notification is triggered). Fix the assign operator to call the destructor like it should.
Fixes https://github.com/scylladb/scylladb/issues/27141
- (cherry picked from commit 9f97c376f1)
- (cherry picked from commit 5dcdaa6f66)
Parent PR: #27140Closesscylladb/scylladb#27276
* https://github.com/scylladb/scylladb:
test: test that expired erm that held for too long triggers notification
token_metadata: fix notification about expiring erm held for to long
We currently ignore the `_excluded` field in `node::clone()` and the verbose
formatter of `locator::node`. The first one is a bug that can have
unpredictable consequences on the system. The second one can be a minor
inconvenience during debugging.
We fix both places in this PR.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-72
This PR is a bugfix that should be backported to all supported branches.
- (cherry picked from commit 4160ae94c1)
- (cherry picked from commit 287c9eea65)
Parent PR: #27265Closesscylladb/scylladb#27291
* https://github.com/scylladb/scylladb:
locator/node: include _excluded in verbose formatter
locator/node: preserve _excluded in clone()
Before this series, Alternator's CreateTable operation defaults to creating a table replicated with vnodes, not tablets. The reasons for this default included missing support for LWT, Materialized Views, Alternator TTL and Alternator Streams if tablets are used. But today, all of these (except the still-experimental Alternator Streams) are now fully available with tablets, so we are finally ready to switch Alternator to use tablets by default in new tables.
We will use the same configuration parameter that CQL uses, tablets_mode_for_new_keyspaces, to determine whether new keyspaces use tablets by default. If set to `enabled`, tablets are used by default on new tables. If set to `disabled`, tablets will not be used by default (i.e., vnodes will be used, as before). A third value, `enforced` is similar to `enabled` but forbids overriding the default to vnodes when creating a table.
As before, the user can set a tag during the CreateTable operation to override the default choice of tablets or vnodes (unless in `enforced` mode). This tag is now named `system:initial_tablets` - whereas before this patch it was called `experimental:initial_tablets`. The rules stay the same as with the earlier, experimental:initial_tablets tag: when supplied with a numeric value, the table will use tablets. When supplied with something else (like a string "none"), the table will use vnodes.
Fixes https://github.com/scylladb/scylladb/issues/22463
Backport to 2025.4, it's important not to delay phasing out vnodes.
- (cherry picked from commit 403068cb3d)
- (cherry picked from commit af00b59930)
- (cherry picked from commit 376a2f2109)
- (cherry picked from commit 35216d2f01)
- (cherry picked from commit 7466325028)
- (cherry picked from commit c7de7e76f4)
- (cherry picked from commit 63897370cb)
- (cherry picked from commit 274d0b6d62)
- (cherry picked from commit 345747775b)
- (cherry picked from commit a659698c6d)
- (cherry picked from commit eeb3a40afb)
- (cherry picked from commit b34f28dae2)
- (cherry picked from commit 25439127c8)
- (cherry picked from commit c03081eb12)
- (cherry picked from commit 65ed678109)
Parent PR: #26836Closesscylladb/scylladb#26949
* github.com:scylladb/scylladb:
test/cluster: modify test to not fail on 2025.4 branch
Fix backport conflicts
test,alternator: use 3-rack clusters in tests
alternator: improve error in tablets_mode_for_new_keyspaces=enforced
config: make tablets_mode_for_new_keyspaces live-updatable
alternator: improve comment about non-hidden system tags
alternator: Fix test_ttl_expiration_streams()
alternator: Fix test_scan_paging_missing_limit()
alternator: Don't require vnodes for TTL tests
alternator: Remove obsolete test from test_table.py
alternator: Fix tag name to request vnodes
alternator: Fix test name clash in test_tablets.py
alternator: test_tablets.py handles new policy reg. tablets
alternator: Update doc regarding tablets support
alternator: Support `tablets_mode_for_new_keyspaces` config flag
Fix incorrect hint for tablets_mode_for_new_keyspaces
Fix comment for tablets_mode_for_new_keyspaces
Previously, the view building coordinator relied on setting each task's state to STARTED and then explicitly removing these state entries once tasks finished, before scheduling new ones. This approach induced a significant number of group0 commits, particularly in large clusters with many nodes and tablets, negatively impacting performance and scalability.
With the update, the coordinator and worker logic has been restructured to operate without maintaining per-task states. Instead, tasks are simply tracked with an aborted boolean flag, which is still essential for certain tablet operations. This change removes much of the coordination complexity, simplifies the view building code, and reduces operational overhead.
In addition, the coordinator now batches reports of finished tasks before making commits. Rather than committing task completions individually, it aggregates them and reports in groups, significantly minimizing the frequency of group0 commits. This new approach is expected to improve efficiency and scalability during materialized view construction, especially in large deployments.
Fixes https://github.com/scylladb/scylladb/issues/26311
This patch needs to be backported to 2025.4.
- (cherry picked from commit 6d853c8f11)
- (cherry picked from commit 08974e1d50)
- (cherry picked from commit eb04af5020)
- (cherry picked from commit 24d69b4005)
- (cherry picked from commit fb8cbf1615)
- (cherry picked from commit fe9581f54c)
Parent PR: #26897Closesscylladb/scylladb#27266
* github.com:scylladb/scylladb:
docs/dev/view-building-coordinator: update the docs after recent changes
db/view/view_building: send coordinator's term in the RPC
db/view/view_building_state: replace task's state with `aborted` flag
db/view/view_building_coordinator: batch finished tasks reporting
db/view/view_building_worker: change internal implementation
db/view/view_building_coordinator: change `work_on_tasks` RPC return type
We currently ignore the `_excluded` field in `clone()`. Losing
information about exclusion can have unpredictable consequences. One
observed effect (that led to finding this issue) is that the
`/storage_service/nodes/excluded` API endpoint sometimes misses excluded
nodes.
(cherry picked from commit 4160ae94c1)
The purpose of the test
cluster/test_alternator::test_alternator_ttl_scheduling_group
Is to verify that during TTL expiration scans and deletions, all of the
CPU is used in the "streaming" scheduling group, not in the statement
scheduling group ("sl:default") as we had in the past due to bugs.
It appears that in branch 2025.4 we have a new bug - which doesn't exist
in master - that causes some tablets-related work which I couldn't
identify to be done in sl:default, and cause this test to fail.
The simple fix is to sleep for 5 seconds after writing the data, and
it seems that by that time, the sl:default work is done.
This change doesn't make the Alternator TTL test any weaker, so we
need to make this change to allow Alternator to go forward.
Sadly, it does mean that the only test we have for this apparent
bug (which has nothing to do with Alternator) will be gone.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
With tablets enabled, we can't create an Alternator table on a three-
node cluster with a single rack, since Scylla refuses RF=3 with just
one rack and we get the error:
An error occurred (InternalServerError) when calling the CreateTable
operation: ... Replication factor 3 exceeds the number of racks (1) in
dc datacenter1
So in test/cluster/test_alternator.py we need to use the incantation
"auto_rack_dc='dc1'" every time that we create a three-node cluster.
Before this patch, several tests in test/cluster/test_alternator.py
failed on this error, with this patch all of them pass.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 65ed678109)
When in tablets_mode_for_new_keyspaces=enforced mode, Alternator is
supposed to fail when CreateTable asks explicitly for vnodes. Before
this patch, this error was an ugly "Internal Server Error" (an
exception thrown from deep inside the implementation), this patch
checks for this case in the right place, to generate a proper
ValidationException with a proper error message.
We also enable the test test_tablets_tag_vs_config which should have
caught this error, but didn't because it was marked xfail because
tablets_mode_for_new_keyspaces had not been live-updatable. Now that
it is, we can enable the test. I also improved the test to be slightly
faster (no need to change the configuration so many times) and also
check the ordinary case - where the schema doesn't choose neither
vnodes nor tablets explicitly and we should just use the default.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit c03081eb12)
We have a configuration option "tablets_mode_for_new_keyspaces" which
determines whether new keyspaces should use tablets or vnodes.
For some reason, this configuration parameter was never marked live-
updatable, so in this patch we add flag. No other changes are needed -
the existing code that uses this flag always uses it through the
up-to-date configuration.
In the previous patches we start to honor tablets_mode_for_new_keyspaces
also in Alternator CreateTable, and we wanted to test this but couldn't
do this in test/alternator because the option was not live-updatable.
Now that it will be, we'll be able to test this feature in
test/alternator.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 25439127c8)
The previous patches added a somewhat misleading comment in front of
system:initial_tablets, which this patch improves.
That tag is NOT where Alternator "stores" table properties like the
existing comment claimed. In fact, the whole point is that it's the
opposite - Alternator never writes to this tag - it's a user-writable
tag which Alternator *reads*, to configure the new table. And this is
why it obviously can't be hidden from the user.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit b34f28dae2)
With tablets, the test begun failing. The failure was correlated with
the number of initial tablets, which when kept at default, equals
4 tablets per shard in release build and 2 tablets per shard in dev
build.
In this patch we split the test into two - one with a more data in
the table to check the original purpose of this test - that Scan
doesn't return the entire table in one page if "Limit" is missing.
The other test reproduces issue #10327 - that when the table is
small, Scan's page size isn't strictly limited to 1MB as it is in
DynamoDB.
Experimentally, 8000 KB of data (compared to 6000 KB before this patch)
is enough when we have up to 4 initial tablets per shard (so 8 initial
tablets on a two-shard node as we typically run in tests).
Original patch by Piotr Szymaniak <piotr.szymaniak@scylladb.com>
modified by Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit a659698c6d)
Since #23662 Alternator supports TTL with tablets too. Let's clear some
leftovers causing Alternator to test TTL with vnodes instead of with
what is default for Alternator (tablets or vnodes).
(cherry picked from commit 345747775b)
Since Alternator is capable of runnng with tablets according to the
flag in config, remove the obsolete test that is making sure
that Alternator runs with vnodes.
(cherry picked from commit 274d0b6d62)
The tag was lately renamed from `experimental:initial_tablets` to
`system::initial_tablets`. This commit fixes both the tests as well as
the exceptions sent to the user instructing how to create table with
vnodes.
(cherry picked from commit 63897370cb)
Adjust the tests so they are in-line with the config flag
'tablets_mode_for_new_keyspaces` that the Alternator learned to honour.
(cherry picked from commit 7466325028)
Reflect honouring by Alternator the value of the config flag
`tablets_mode_for_new_keyspaces`, as well as renaming of the tag
`experimental:initial_tablets` into `system:initial_tablets`.
(cherry picked from commit 35216d2f01)
Until now, tablets in Alternator were experimental feature enabled only
when a TAG "experimental:initial_tablets" was present when creating a
table and associated with a numeric value.
After this patch, Alternator honours the value of
`tablets_mode_for_new_keyspaces` config flag.
Each table can be overriden to use tablets or not by supplying a new TAG
"system:initial_tablets". The rules stay the same as with the earlier,
experimental tag: when supplied with a numeric value, the table will use
tablets (as long as they are supported). When supplied with something
else (like a string "none"), the table will use vnodes, provided that
tablets are not `enforced` by the config flag.
Fixes#22463
(cherry picked from commit 376a2f2109)
The comment was not listing all the 3 possible values correctly,
despite an explanation just below covers all 3 values.
(cherry picked from commit 403068cb3d)
To avoid case when an old coordinator (which hasn't been stopped yet)
dictates what should be done, add raft term to the `work_on_view_building_tasks`
RPC.
The worker needs to check if the term matches the current term from raft
server, and deny the request when the term is bad.
(cherry picked from commit fb8cbf1615)
After previous commits, we can drop entire task's state and replace it
with single boolean flag, which determines if a task was aborted.
Once a task was aborted, it cannot get resurrected to a normal state.
(cherry picked from commit 24d69b4005)
In previous implementation to execute view building tasks, the
coordinator needed to firstly set their states to `STARTED`
and then it needed to remove them before it could start the next ones.
This logic required a lot of group0 commits, especially in large
clusters with higher number of nodes and big tablet count.
After previous commit to the view building worker, the coordinator
can start view building tasks without setting the `STARTED` state
and deleting finished tasks.
This patch adjusts the coordinator to save finished tasks locally,
so it can continue to execute next ones and the finished tasks are
periodically removed from the group0 by `finished_task_gc_fiber()`.
(cherry picked from commit eb04af5020)
Commit 6e4803a750 broke notification about expired erms held for too
long since it resets the tracker without calling its destructor (where
notification is triggered). Fix assign operator to call destructor.
(cherry picked from commit 9f97c376f1)
When we delete a table in alternator, the schema change is performed on shard 0.
However, we actually use the storage_proxy from the shard that is handling the
delete_table command. This can lead to problems because some information is
stored only on shard 0 and using storage_proxy from another shard may make
us miss it.
In this patch we fix this by using the storage_proxy from shard 0 instead.
Fixes https://github.com/scylladb/scylladb/issues/27223Closesscylladb/scylladb#27224
(cherry picked from commit 3c376d1b64)
Closesscylladb/scylladb#27260
This commit doesn't change the logic behind the view building worker but
it changes how the worker is executing view building tasks.
Previously, the worker had a state only on shard0 and it was reacting to
changes in group0 state. When it noticed some tasks were moved to
`STARTED` state, the worker was creating a batch for it on the shard0
state.
The RPC call was used only to start the batch and to get its result.
Now, the main logic of batch management was moved to the RPC call
handler.
The worker has a local state on each shard and the state
contains:
- unique ptr to the batch
- set of completed tasks
- information for which views the base table was flushed
So currently, each batch lives on a shard where it has its work to do
exclusively. This eliminates a need to do a synchronization between
shard0 and work shard, which was a painful point in previous
implementation.
The worker still reacts to changes in group0 view building state, but
currently it's only used to observe whether any view building tasks was
aborted by setting `ABORTED` state.
To prepare for further changes to drop the view building task state,
the worker ignores `IDLE` and `STARTED` states completely.
(cherry picked from commit 08974e1d50)
During the initial implementation of the view builing coordinator,
we decided that if a view building task fails locally on the worker
(example reason: view update's target replica is not available),
the worker will retry this work instead of reporting a failure to the
coordinator.
However, we left return type of the RPC, which was telling if a task was
finished successfully or aborted.
But the worker doesn't need to report that a task was aborted, because
it's the coordinator, who decides to abort a task.
So, this commit changes the return type to list of UUIDs of completed
tasks.
Previously length of the returned vector needed to be the same as length
of the vector sent in the request.
No we can drop this restriction and the RPC handler return list of UUIDs
of completed tasks (subset of vector sent in the request).
This change is required to drop `STARTED` state in next commits.
Since Scylla 2025.4 wasn't released yet and we're going to merge this
patch before releasing, no RPC versioning or cluster feature is needed.
(cherry picked from commit 6d853c8f11)
This backport additionally includes commit 7e201eea from
scylladb/scylla#26543 which causes the tombstone_gc property to be added
to secondary indexes when they are created. This fix was necessary to
resolve the merge conflict. The original commit message of
scylladb/scylladb#27120 follows:
With the introduction of colocated tables, all the tablet transitions
now operate on groups of colocated tablets instead of individual
tablets. such is tablet migration, and also tablet repair.
The tablet repair currently doesn't work on individual tablets due to
the limitations in the tablet map being shared. The way it was
implemented to work on a group of colocated tablets is by repairing all
the colocated tablets together, using a dedicated rpc, and setting a
shared repair_time in the shared tablet map. It was implemented this
way because we wanted to have some way to repair the tablets of a
colocated table.
However, we want to change this in the next release so that it will be
possible to repair the tablets of a colocated table individually. In
order to simplify and prepare for the future change, we prefer until
then to not repair colocated tables at all. otherwise, we will need to
support both the shared repair and individual repair together for a long
time, and the upgrade will be more complicated.
We change the handling of the tablet 'repair' transition to repair only
the base table's tablets. It means it will not be possible to request
tablet repair for a non-base colocated table such as local MV, CDC and
paxos table. This restriction will be temporary until a later release
where we will suuport repairing colocated tablets.
This is a reasonable restriction because repair for these kind of tables
is not required or as important as for normal tables.
Fixes https://github.com/scylladb/scylladb/issues/27119
backport to 2025.4 since we must change it in the same version it's introduced before it's released
- (cherry picked from commit 273f664496)
- (cherry picked from commit 005807ebb8)
- (cherry picked from commit 7e201eea1a)
- (cherry picked from commit 868ac42a8b)
Parent PR: #27120Closesscylladb/scylladb#27256
* github.com:scylladb/scylladb:
tombstone_gc: don't use 'repair' mode for colocated tables
index: Set tombstone_gc when creating secondary index
Revert "storage service: add repair colocated tablets rpc"
topology_coordinator: don't repair colocated tablets
Vector indexes are going to be supported only for tablets (see VECTOR-322).
As a result, tests using vector indexes will be failing when run with vnodes.
This change ensures tests using vector indexes run exclusively with tablets.
Fixes: VECTOR-49
Closesscylladb/scylladb#27233
For tables of special types that can be located: MV, CDC, and paxos
table, we should not use tombstone_gc=repair mode because colocated
tablets are never repaired, hence they will not have repair_time set and
will never be GC'd using 'repair' mode.
(cherry picked from commit 868ac42a8b)
Before this commit, when the underlying materialized view was created,
it didn't have the property `tombstone_gc` set to any value. That
was a bug and we fix it now.
Two reproducer tests is added for validation. They reproduce the problem
and don't pass before this commit.
Fixesscylladb/scylladb#26542
This reverts commit 11f045bb7c.
The rpc was added together with colocated tablets in 2025.4 to support a
"shared repair" operation of a group of colocated tablets that repairs
all of them and allows also for special behavior as opposed to repairing
a single specific tablet.
It is not used anymore because we decided to not repair all colocated
tablets in a single shared operation, but to repair only the base table,
and in a later release support repairing colocated tables individually.
We can remove the rpc in 2025.4 because it is introduced in the same
version.
(cherry picked from commit 005807ebb8)
With the introduction of colocated tables, all the tablet transitions
now operate on groups of colocated tablets instead of individual
tablets. such is tablet migration, and also tablet repair.
The tablet repair currently doesn't work on individual tablets due to
the limitations in the tablet map being shared. The way it was
implemented to work on a group of colocated tablets is by repairing all
the colocated tablets together, using a dedicated rpc, and setting a
shared repair_time in the shared tablet map. It was implemented this
way because we wanted to have some way to repair the tablets of a
colocated table.
However, we want to change this in the next release so that it will be
possible to repair the tablets of a colocated table individually. In
order to simplify and prepare for the future change, we prefer until
then to not repair colocated tables at all. otherwise, we will need to
support both the shared repair and individual repair together for a long
time, and the upgrade will be more complicated.
We change the handling of the tablet 'repair' transition to repair only
the base table's tablets. It means it will not be possible to request
tablet repair for a non-base colocated table such as local MV, CDC and
paxos table. This restriction will be temporary until a later release
where we will suuport repairing colocated tablets.
This is a reasonable restriction because repair for these kind of tables
is not required or as important as for normal tables.
Fixesscylladb/scylladb#27119
(cherry picked from commit 273f664496)
When streaming SSTables by tablet range, the original implementation of tablet_sstable_streamer::stream may break out of the loop too early when encountering a non-overlaping SSTable. As a result, subsequent SSTables that should be classified as partially contained are skipped entirely.
Tablet range: [4, 5]
SSTable ranges:
[0,5]
[0, 3] <--- is considered exhausted, and causes skip to next tablet
[2, 5] <--- is missed for range [4, 5]
The loop uses if (!overlaps) break; semantics, which conflated “no overlap” with “done scanning.” This caused premature termination when an SSTable did not overlapped but the following one did.
Correct logic should be:
before(sst_last) → skip and continue.
after(sst_first) → break (no further SSTables can overlap).
Otherwise → `contains` to classify as full or partial.
Missing SSTables in streaming and potential data loss or incomplete streaming in repair/streaming operations.
1. Correct the loop termination logic that previously caused certain SSTables to be prematurely excluded, resulting in lost mutations. This change ensures all relevant SSTables are properly streamed and their mutations preserved.
2. Refactor the loop to use before() and after() checks explicitly, and only break when the SSTable is entirely after the tablet range
3. Add pytest to cover this case, full streaming flow by means of `restore`
4. Add boost tests to test the new refactored function
This data corruption fix should be ported back to 2024.2, 2025.1, 2025.2, 2025.3 and 2025.4
Fixes: https://github.com/scylladb/scylladb/issues/26979
- (cherry picked from commit 656ce27e7f)
- (cherry picked from commit dedc8bdf71)
Parent PR: #26980Closesscylladb/scylladb#27156
* github.com:scylladb/scylladb:
streaming: fix loop break condition in tablet_sstable_streamer::stream
streaming: add pytest case to reproduce mutation loss issue
Correct the loop termination logic that previously caused
certain SSTables to be prematurely excluded, resulting in
lost mutations. This change ensures all relevant SSTables
are properly streamed and their mutations preserved.
(cherry picked from commit dedc8bdf71)
Introduce a test that demonstrates mutation loss caused by premature
loop termination in tablet_sstable_streamer::stream. The code broke
out of the SSTable iteration when encountering a non-overlapping range,
which skipped subsequent SSTables that should have been partially
contained. This test showcases the problem only.
Example:
Tablet range: [4, 5]
SSTable ranges:
[0,5]
[0, 3] <--- is considered exhausted, and causes skip to next tablet
[2, 5] <--- is missed for range [4, 5]
(cherry picked from commit 656ce27e7f)
Primary issue with the old method is that each update is a separate
cross-shard call, and all later updates queue behind it. If one of the
shards has high latency for such calls, the queue may accumulate and
system will appear unresponsive for mapping changes on non-zero shards.
This happened in the field when one of the shards was overloaded with
sstables and compaction work, which caused frequent stalls which
delayed polling for ~100ms. A queue of 3k address updates
accumulated, because we update mapping on each change of gossip
states. This made bootstrap impossible because nodes couldn't
learn about the IP mapping for the bootstrapping node and streaming
failed.
To protect against that, use a more efficient method of replication
which requires a single cross-shard call to replicate all prior
updates.
It is also more reliable, if replication fails transiently for some
reason, we don't give up and fail all later updates.
Fixes#26865
- (cherry picked from commit ed8d127457)
- (cherry picked from commit 4a85ea8eb2)
- (cherry picked from commit f83c4ffc68)
Parent PR: #26941Closesscylladb/scylladb#27189
* github.com:scylladb/scylladb:
address_map: Use barrier() to wait for replication
address_map: Use more efficient and reliable replication method
utils: Introduce helper for replicated data structures
This PR extends the restore API so that it accepts primary_replica_only as parameter and it combines the concepts of primary-replica-only with scoped streaming so that with:
- `scope=all primary_replica_only=true` The restoring node will stream to the global primary replica only
- `scope=dc primary_replica_only=true` The restoring node will stream to the local primary replica only.
- `scope=rack primary_replica_only=true` The restoring node will stream only to the primary replica from within its own rack (with rf=#racks, the restoring node will stream only to itself)
- `scope=node primary_replica_only=true` is not allowed, the restoring node will always stream only to itself so the primary_replica_only parameter wouldn't make sense.
The PR also adjusts the `nodetool refresh` restriction on running restore with both primary_replica_only and scope, it adds primary_replica_only to `nodetool restore` and it adds cluster tests for primary replica within scope.
Fixes#26584
- (cherry picked from commit 965a16ce6f)
- (cherry picked from commit 136b45d657)
- (cherry picked from commit 83aee954b4)
- (cherry picked from commit c1b3fe30be)
- (cherry picked from commit d4e43bd34c)
- (cherry picked from commit 817fdadd49)
- (cherry picked from commit a04ebb829c)
Parent PR: #26609Closesscylladb/scylladb#27011
* github.com:scylladb/scylladb:
Add cluster tests for checking scoped primary_replica_only streaming
Improve choice distribution for primary replica
Refactor cluster/object_store/test_backup
nodetool restore: add primary-replica-only option
nodetool refresh: Enable scope={all,dc,rack} with primary_replica_only
Enable scoped primary replica only streaming
Support primary_replica_only for native restore API
More efficient than 100 pings.
There was one ping in test which was done "so this shard notices the
clock advance". It's not necessary, since obsering completed SMP
call implies that local shard sees the clock advancement done within in.
(cherry picked from commit f83c4ffc68)
Primary issue with the old method is that each update is a separate
cross-shard call, and all later updated queue behind it. If one of the
shards has high latency for such calls, the queue may accumulate and
system will appear unresponsive for mapping changes on non-zero shards.
This happened in the field when one of the shards was overloaded with
sstables and compaction work, which caused frequent stalls which
delayed polling for ~100ms. A queue of 3k address updates
accumulated. This made bootstrap impossible, since nodes couldn't
learn about the IP mapping for the bootstrapping node and streaming
failed.
To protect against that, use a more efficient method of replication
which requires a single cross-shard call to replicate all prior
updates.
It is also more reliable, if replication fails transiently for some
reason, we don't give up and fail all later updates.
Fixes#26865Fixes#26835
(cherry picked from commit 4a85ea8eb2)
Key goals:
- efficient (batching updates)
- reliable (no lost updates)
Will be used in data structures maintained on one designed owning
shard and replicated to other shards.
(cherry picked from commit ed8d127457)
vector_search: Add HTTPS support for vector store connections
This commit introduces TLS encryption support for vector store connections.
A new configuration option is added:
- vector_store_encryption_options.truststore: path to the trust store file
To enable secure connections, use the https:// scheme in the
vector_store_primary_uri/vector_store_secondary_uri configuration options.
Fixes: VECTOR-327
Backport to 2025.4 as this feature is expected to be available in 2025.4.
- (cherry picked from commit c40b3ba4b3)
- (cherry picked from commit 58456455e3)
Parent PR: #26935Closesscylladb/scylladb#27180
* github.com:scylladb/scylladb:
test: vector_search: Ensure all clients are stopped on shutdown
vector_search: Add HTTPS support for vector store connections
A flaky test revealed that after `clients::stop()` was called,
the `old_clients` collection was sometimes not empty,
indicating that some clients were not being stopped correctly.
This resulted in sanitizer errors when objects went out of scope at the end of the test.
This patch modifies `stop()` to ensure all clients, including those in `old_clients`,
are stopped, guaranteeing a clean shutdown.
This commit introduces TLS encryption support for vector store connections.
A new configuration option is added:
- vector_store_encryption_options.truststore: path to the trust store file
To enable secure connections, use the https:// scheme in the
vector_store_primary_uri/vector_store_secondary_uri configuration options.
Fixes: VECTOR-327
vector_search: Fix error handling and status parsing
This change addresses two issues in the vector search client that caused
validator test failures: incorrect handling of 5xx server errors and
faulty status response parsing.
1. 5xx Error Handling:
Previously, a 5xx response (e.g., 503 Service Unavailable) from the
underlying vector store for an `/ann` search request was incorrectly
interpreted as a node failure. This would cause the node to be marked
as down, even for transient issues like an index scan being in progress.
This change ensures that 5xx errors are treated as transient search
failures, not node failures, preventing nodes from being incorrectly
marked as down.
2. Status Response Parsing:
The logic for parsing status responses from the vector store was
flawed. This has been corrected to ensure proper parsing.
Fixes: SCYLLADB-50
Backport to 2025.4 as this problem is present on this branch.
- (cherry picked from commit 05b9cafb57)
- (cherry picked from commit 366ecef1b9)
- (cherry picked from commit 9563d87f74)
Parent PR: #27111Closesscylladb/scylladb#27145
* github.com:scylladb/scylladb:
vector_search: Don't mark nodes as down on 5xx server errors
test: vector_search: Move unavailable_server to dedicated file
vector_search: Fix status response parsing
For an `/ann` search request, a 5xx server response does not
indicate that the node is down. It can signify a transient state, such
as the index full scan being in progress.
Previously, treating a 503 error as a node fault would cause the node
to be incorrectly marked as down, for example, when a new index was
being created. This commit ensures that such errors are treated as
transient search failures, not node failures.
The response was incorrectly parsed as a plain string and compared
directly with C++ string. However, the body contains a JSON string,
which includes escaped quotes that caused comparison failures.
This change adds support for secondary vector store clients, typically
located in different availability zones. Secondary clients serve as
fallback targets when all primary clients are unavailable.
New configuration option allows specifying secondary client addresses
and ports.
Fixes: VECTOR-187
Closesscylladb/scylladb#27159
A Vector Store node is now considered down if it returns an HTTP 500
server error. This can happen, for example, if the node fails to
connect to the database or has not completed its initial full scan.
The logic for marking a node as 'up' is also enhanced. A node is now
only considered up when its status is explicitly 'SERVING'.
Fixes: VECTOR-187
Backport to 2025.4 as this feature is expected to be available in 2025.4.
- (cherry picked from commit ee3b83c9b0)
- (cherry picked from commit f665564537)
- (cherry picked from commit cb654d2286)
- (cherry picked from commit 4bbba099d7)
- (cherry picked from commit 5c30994bc5)
- (cherry picked from commit 7f45f15237)
Parent PR: #26413Closesscylladb/scylladb#27087
* github.com:scylladb/scylladb:
vector_search: Improve vector-store health checking
vector_search: Move response_content_to_sstring to utils.hh
vector_search: Add unit tests for client error handling
vector_search: Enable mocking of status requests
vector_search: Extract abort_source_timeout and repeat_until
vector_search: Move vs_mock_server to dedicated files
This commit updates the section for n2-highmem instances
on the Cloud Instance Recommendations page
- Added missing support for n2-highmem-96
- Update the reference to n2 instances in the Google Cloud docs.
- Added the missing information about processors for this instance
type (Ice Lake and Cascade Lake).
(cherry picked from commit dab74471cc)
Consider the following:
1) single-key read starts, blocks on replica e.g. waiting for memory.
2) the same replica is migrated away
3) single-key read expires, coordinator abandons it, releases erm.
4) migration advances to cleanup stage, barrier doesn't wait on
timed-out read
5) compaction group of the replica is deallocated on cleanup
6) that single-key resumes, but doesn't find sstable set (post cleanup)
7) with abort-on-internal-error turned on, node crashes
It's fine for abandoned (= timed out) reads to fail, since the
coordinator is gone.
For active reads (non timed out), the barrier will wait for them
since their coordinator holds erm.
This solution consists of failing reads which underlying tablet
replica has been cleaned up, by just converting internal error
to plain exception.
Fixes#26229.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#27078
(cherry picked from commit 74ecedfb5c)
Closesscylladb/scylladb#27158
Fixes#27077
Multiple points can be clarified relating to:
* Names of each sub-procedure could be clearer
* Requirements of each sub-procedure could be clearer
* Clarify which keyspaces are relevant and how to check them
* Fix typos in keyspace name
Closesscylladb/scylladb#26855
(cherry picked from commit a0734b8605)
Closesscylladb/scylladb#27157
Adds a new documentation page for the incremental repair feature.
The page covers:
- What incremental repair is and its benefits over the standard repair process.
- How it works at a high level by tracking the repair status of SSTables.
- The prerequisite of using the tablets architecture.
- The different user-configurable modes: 'regular', 'full', and 'disabled'.
Fixes#25600Closesscylladb/scylladb#26221
(cherry picked from commit 3cf1225ae6)
Closesscylladb/scylladb#27149
A Vector Store node is now considered down if it returns an HTTP 5xx status.
This can happen, for example, if the node fails to
connect to the database or has not completed its initial full scan.
The logic for marking a node as 'up' is also enhanced. A node is now
only considered up when its status is 'SERVING'.
Move the response_content_to_sstring utility function from
vector_store_client.cc to utils.hh to enable reuse across
multiple files.
This refactoring prepares for the upcoming `client.cc` implementation
that will also need this functionality.
Introduce dedicated unit tests for the client class to verify existing
functionality and serve as regression tests.
These tests ensure that invalid client requests do not cause nodes to
be marked as down.
Extend the mock server to allow inspecting incoming status requests and
configuring their responses.
This enables client unit tests to simulate various server behaviors,
such as handling node failures and backoff logic.
The `abort_source_timeout` and `repeat_until` functions are moved to
the shared utility header `test/vector_search/utils.hh`.
This allows them to be reused by upcoming `client` unit tests, avoiding
code duplication.
Increase the value of the max-networking-io-control-blocks option
for the cpp tests as it is too low and causes flakiness
as seen in vector_search.vector_store_client_test.vector_store_client_single_status_check_after_concurrent_failures:
```
seastar/src/core/reactor_backend.cc:342: void seastar::aio_general_context::queue(linux_abi::iocb *): Assertion `last < end` failed.
```
See also https://github.com/scylladb/seastar/issues/976Fixes#27056
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#27117
(cherry picked from commit fd81333181)
Closesscylladb/scylladb#27135
vector_search: Add backoff for failed nodes
Introduces logic to mark nodes that fail to answer an ANN request as
"down". Down nodes are omitted from further requests until they
successfully respond to a health check.
Health checks for down nodes are performed in the background using the
`status` endpoint, with an exponential backoff retry policy ranging
from 100ms to 20s.
Client list management is moved to separate files (clients.cc/clients.hh)
to improve code organization and modularity.
Fixes: VECTOR-187.
Backport to 2025.4 as this feature is expected to be available in 2025.4.
- (cherry picked from commit 62f8b26bd7)
- (cherry picked from commit 49a177b51e)
- (cherry picked from commit 190459aefa)
- (cherry picked from commit 009d3ea278)
- (cherry picked from commit 940ed239b2)
- (cherry picked from commit 097c0f9592)
- (cherry picked from commit 1972fb315b)
Parent PR: #26308Closesscylladb/scylladb#27054
* github.com:scylladb/scylladb:
vector_search: Set max backoff delay to 2x read request timeout
vector_search: Report status check exception via on_internal_error_noexcept
vector_search: Extract client management into dedicated class
vector_search: Add backoff for failed clients
vector_search: Make endpoint available
vector_search: Use std::expected for low-level client errors
vector_search: Extract client class
The maximum backoff delay for status checking now depends on the
`read_request_timeout_in_ms` configuration option. The delay is set
to twice the value of this parameter.
This exception should only occur due to internal errors, not client or external issues.
If triggered, it indicates an internal problem. Therefore, we notify about this exception
using on_internal_error_noexcept.
Introduces logic to mark clients that fail to answer an ANN request as
"down". Down clients are omitted from further requests until they
successfully respond to a health check.
Health checks for down clients are performed in the background using the
`status` endpoint, with an exponential backoff retry policy ranging
from 100ms to 20s.
In preparation for a new feature, the tests need the ability to make
an endpoint that was previously unavailable, available again.
This is achieved by adding an `unavailable_server::take_socket` method.
This method allows transferring the listening socket from the
`unavailable_server` to the `mock_vs_server`, ensuring they both
operate on the same endpoint.
To unify error handling, the low-level client methods now return
`std::expected` instead of throwing exceptions. This allows for
consistent and explicit error propagation from the client up to the
caller.
The relevant error types have been moved to a new `vector_search/error.hh`
header to centralize their definitions.
This refactoring extracts low-level client logic into a new, dedicated
`client` class. The new class is responsible for connecting to the
server and serializing requests.
This change prepares for extending the `vector_store_client` to check
node status via the `api/v1/status` endpoint.
`/ann` Response deserialization remains in the `vector_store_client` as it
is schema-dependent.
Somehow, the line of code responsible for freeing flushed nodes
in `trie_writer` is missing from the implementation.
This effectively means that `trie_writer` keeps the whole index in
memory until the index writer is closed, which for many dataset
is a guaranteed OOM.
Fix that, and add some test that catches this.
Fixesscylladb/scylladb#27082Closesscylladb/scylladb#27083
(cherry picked from commit d8e299dbb2)
Closesscylladb/scylladb#27122
Not waiting for nodes to see each other as alive can cause the driver to
fail the request sent in `wait_for_upgrade_state()`.
scylladb/scylladb#19771 has already replaced concurrent restarts with
`ManagerClient.rolling_restart()`, but it has missed this single place,
probably because we do concurrent starts here.
Fixes#27055Closesscylladb/scylladb#27075
(cherry picked from commit e35ba974ce)
Closesscylladb/scylladb#27110
This patch adds the missing warning about the lack of possibility
to return the similarity distance. This will be added in the next
iteration.
Fixes#27086
It has to be backported to 2025.4 as this is the limitation in 2025.4.
Closesscylladb/scylladb#27096
(cherry picked from commit f714876eaf)
Closesscylladb/scylladb#27105
This series allows an operator to reset 'cleanup needed' flag if he already cleaned up the node, so that automatic cleanup will not do it again. We also change 'nodetool cleanup' back to run cleanup on one node only (and reset 'cleanup needed' flag in the end), but the new '--global' option allows to run cleanup on all nodes that needed it simultaneously.
Fixes https://github.com/scylladb/scylladb/issues/26866
Backport to all supported version since automatic cleanup behaviour as it is now may create unexpected by the operator load during cluster resizing.
- (cherry picked from commit e872f9cb4e)
- (cherry picked from commit 0f0ab11311)
Parent PR: #26868Closesscylladb/scylladb#27095
* github.com:scylladb/scylladb:
cleanup: introduce "nodetool cluster cleanup" command to run cleanup on all dirty nodes in the cluster
cleanup: Add RESTful API to allow reset cleanup needed flag
Refs #26822Fixes#27062
AWS says to treat 503 errors, at least in the case of ec2 metadata query, as backoff-retry (generally, we do _not_ retry on provider level, but delegate this to higher levels). This patch adds special treatment for 503:s (service unavailable) for both ec2 meta and actual endpoint, doing exponential backoff.
Note: we do _not_ retry forever.
Not tested as such, since I don't get any errors when testing (doh!). Should try to set up a mock ec2 meta with injected errors maybe.
- (cherry picked from commit 190e3666cb)
- (cherry picked from commit d22e0acf0b)
Parent PR: #26934Closesscylladb/scylladb#27064
* github.com:scylladb/scylladb:
encryption::kms_host: Add exponential backoff-retry for 503 errors
encryption::kms_host: Include http error code in kms_error
We currently allow creating multiple vector indexes on one column.
This doesn't make much sense as we do not support picking one when
making ann queries.
To make this less confusing and to make our behavior similar
to Cassandra we disallow the creation of multiple vector indexes
on one column.
We also add a test that checks this behavior.
Fixes: VECTOR-254
Fixes: #26672Closesscylladb/scylladb#26508
(cherry picked from commit 46589bc64c)
Closesscylladb/scylladb#27057
Previously, only nodes in the 'normal' state and decommissioning nodes
were included in the set of nodes participating in barrier and
barrier_and_drain commands. Joining nodes are not included because they
don't coordinate requests, given their cql port is closed.
However, joining nodes may receive mutations from other nodes, for which
they may generate and coordinate materialized view updates. If their
group0 state is not synchronized it could cause lost view updates.
For example:
1. On the topology coordinator, the join completes and the joining node
becomes normal, but the joining node's state lags behind. Since it's
not synchronized by the barrier, it could be in an old state such as
`write_both_read_old`.
2. A normal node coordinates a write and sends it to the new node as the
new replica.
3. The new node applies the base mutation but doesn't generate a view
update for it, because it calculates the base-view pairing according
to its own state and replication map, and determines that it doesn't
participate in the base-view pairing.
Therefore, since the joining node participates as a coordinator for view
updates, it should be included in these barriers as well. This ensures
that before the join completes, the joining node's state is
`write_both_read_new`, where it does generate view updates.
Fixes https://github.com/scylladb/scylladb/issues/26976
backport to previous versions since it fixes a bug in MV with vnodes
- (cherry picked from commit 13d94576e5)
- (cherry picked from commit b925e047be)
Parent PR: #27008Closesscylladb/scylladb#27040
* github.com:scylladb/scylladb:
test: add mv write during node join test
topology_coordinator: include joining node in barrier
The view building coordinator manages the process by sending RPC
requests to all nodes in the cluster, instructing them what to do.
If processing that message fails, the coordinator decides if it
wants to retry it or (temporarily) abandon the work.
An example of the latter scenario could be if one of the target nodes
dies and any attempts to communicate with it would fail.
Unfortunately, the current approach to it is not perfect and may result
in a storm of warnings, effectively clogging the logs. As an example,
take a look at scylladb/scylladb#26686: the gossiper failed to mark
one of the dead nodes as DOWN fast enough, and it resulted in a warning storm.
To prevent situations like that, we implement a form of backoff.
If processing an RPC message fails, we postpone finishing the task for
a second. That should reduce the number of messages in the logs and avoid
retries that are likely to fail as well.
We provide a reproducer test.
Fixes scylladb/scylladb#26686
Backport: impact on the user. We should backport it to 2025.4.
- (cherry picked from commit 4a5b1ab40a)
- (cherry picked from commit acd9120181)
- (cherry picked from commit 393f1ca6e6)
Parent PR: #26729Closesscylladb/scylladb#27027
* github.com:scylladb/scylladb:
tet/cluster/mv: Clean up test_backoff_when_node_fails_task_rpc
db/view/view_building_coordinator: Rate limit logging failed RPC
db/view: Add backoff when RPC fails
Currently we do not support paging for vector search queries.
When we get such a query with paging enabled we ignore the paging
and return the entire result. This behavior can be confusing for users,
as there is no warning about paging not working with vector search.
This patch fixes that by adding a warning to the result of ANN queries
with paging enabled.
Closesscylladb/scylladb#26384
(cherry picked from commit 7646dde25b)
Closesscylladb/scylladb#27016
TemporaryHashes.db is a temporary sstable component used during ms
sstable writes. It's different from other sstable components in that
it's not included in the TOC. Because of this, it has a special case in
the logic that deletes unfinished sstables on boot.
(After Scylla dies in the middle of a sstable write).
But there's a bug in that special case,
which causes Scylla to forget to delete other components from the same unfinished sstable.
The code intends only to delete the TemporaryHashes.db file from the
`_state->generations_found` multimap, but it accidentally also deletes
the file's sibling components from the multimap. Fix that.
Also, extend a related test so that it would catch the problem before the fix.
Fixes scylladb/scylladb#26393
Bugfix, needs backport to 2025.4.
- (cherry picked from commit 16cb223d7f)
- (cherry picked from commit 6efb807c1a)
Parent PR: #26394Closesscylladb/scylladb#26409
* github.com:scylladb/scylladb:
sstables/sstable_directory: don't forget to delete other components when deleting TemporaryHashes.db
test/boost/database_test: fix two no-op distributed loader tests
This commits adds a tests checking various scenarios of restoring
via load and stream with primary_replica_only and a scope specified.
The tests check that in a few topologies, a mutation is replicated
a correct amount of times given primary_replica_only and that
streaming happens according to the scope rule passed.
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
(cherry picked from commit a04ebb829c)
97ab3f6622 changed "nodetool cleanup" (without arguments) to run
cleanup on all dirty nodes in the cluster. This was somewhat unexpected,
so this patch changes it back to run cleanup on the target node only (and
reset "cleanup needed" flag afterwards) and it adds "nodetool cluster
cleanup" command that runs the cleanup on all dirty nodes in the
cluster.
(cherry picked from commit 0f0ab11311)
Cleaning up a node using per keyspace/table interface does not reset cleanup
needed flag in the topology. The assumption was that running cleanup on
already clean node does nothing and completes quickly. But due to
https://github.com/scylladb/scylladb/issues/12215 (which is closed as
WONTFIX) this is not the case. This patch provides the ability to reset
the flag in the topology if operator cleaned up the node manually
already.
(cherry picked from commit e872f9cb4e)
I noticed during tests that `maybe_get_primary_replica`
would not distribute uniformly the choice of primary replica
because `info.replicas` on some shards would have an order whilst
on others it'd be ordered differently, thus making the function choose
a node as primary replica multiple times when it clearly could've
chosen a different nodes.
This patch sorts the replica set before passing it through the
scope filter.
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
(cherry picked from commit 817fdadd49)
This PR splits the suppport code from test_backup.py
into multiple functions so less duplicated code is
produced by new tests using it. It also makes it a bit
easier to understand.
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
(cherry picked from commit d4e43bd34c)
Add --primary-replica-only and update docs page for
nodetool restore.
The relationship with the scope parameter is:
- scope=all primary_replica_only=true gets the global primary replica
- scope=dc primary_replica_only=true gets the local primary replica
- scope=rack primary_replica_only=true is like a noop, it gets the only
replica in the rack (rf=#racks)
- scope=node primary_replica_only=node is not allowed
Fixes#26584
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
(cherry picked from commit c1b3fe30be)
So far it was not allowed to pass a scope when using
the primary_replica_only option, this patch enables
it because the concepts are now combined so that:
- scope=all primary_replica_only=true gets the global primary replica
- scope=dc primary_replica_only=true gets the local primary replica
- scope=rack primary_replica_only=true is like a noop, it gets the only
replica in the rack (rf=#racks)
- scope=node primary_replica_only=node is not allowed
Fixes#26584
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
(cherry picked from commit 83aee954b4)
This patch removes the restriction for streaming
to primary replica only within a scope.
Node scope streaming to primary replica is dissallowed.
Fixes#26584
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
(cherry picked from commit 136b45d657)
Current native restore does not support primary_replica_only, it is
hard-coded disabled and this may lead to data amplification issues.
This patch extends the restore REST API to accept a
primary_replica_only parameter and propagates it to
sstables_loader so it gets correctly passed to
load_and_stream.
Fixes#26584
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
(cherry picked from commit 965a16ce6f)
The service level controller relies on `auth::service` to collect
information about roles and the relation between them and the service
levels (those attached to them). Unfortunately, the service level
controller is initialized way earlier than `auth::service` and so we
had to prevent potential invalid queries of user service levels
(cf. 46193f5e79).
Unfortunately, that came at a price: it made the maintenance socket
incompatible with the current implementation of the service level
controller. The maintenance socket starts early, before the
`auth::service` is fully initialized and registered, and is exposed
almost immediately. If the user attempts to connect to Scylla within
this time window, via the maintenance socket, one of the things that
will happen is choosing the right service level for the connection.
Since the `auth::service` is not registered, Scylla with fail an
assertion and crash.
A similar scenario occurs when using maintenance mode. The maintenance
socket is how the user communicates with the database, and we're not
prepared for that either.
To avoid unnecessary crashes, we add new branches if the passed user is
absent or if it corresponds to the anonymous role. Since the role
corresponding to a connection via the maintenance socket is the anonymous
role, that solves the problem.
Some accesses to `auth::service` are not affected and we do not modify
those.
Fixes scylladb/scylladb#26816
Backport: yes. This is a fix of a regression.
- (cherry picked from commit c0f7622d12)
- (cherry picked from commit 222eab45f8)
- (cherry picked from commit 394207fd69)
- (cherry picked from commit b357c8278f)
Parent PR: #26856Closesscylladb/scylladb#27043
* github.com:scylladb/scylladb:
test/cluster/test_maintenance_mode.py: Wait for initialization
test: Disable maintenance mode correctly in test_maintenance_mode.py
test: Fix keyspace in test_maintenance_mode.py
service/qos: Do not crash Scylla if auth_integration absent
This PR fixes staging stables handling by view building coordinator in case of intra-node tablet migration or tablet merge.
To support tablet merge, the worker stores the sstables grouped only be `table_id`, instead of `(table_id, last_token)` pair.
There shouldn't be that many staging sstables, so selecting relevant for each `process_staging` task is fine.
For the intra-node migration support, the patch adds methods to load migrated sstables on the destination shard and to cleanup them on source shard.
The patch should be backported to 2025.4
Fixes https://github.com/scylladb/scylladb/issues/26244
- (cherry picked from commit 2e8c096930)
- (cherry picked from commit c99231c4c2)
- (cherry picked from commit 4bc6361766)
- (cherry picked from commit 9345c33d27)
Parent PR: #26454Closesscylladb/scylladb#27058
* github.com:scylladb/scylladb:
service/storage_service: migrate staging sstables in view building worker during intra-node migration
db/view/view_building_worker: support sstables intra-node migration
db/view_building_worker: fix indent
db/view/view_building_worker: don't organize staging sstables by last token
Currently, all apis that start a compaction have two versions:
synchronous and asynchronous. They share most of the implementation,
but some checks and params have diverged.
Unify the handlers of synchronous and asynchronous cleanup, major
compaction, and upgrade_sstables.
Fixes: https://github.com/scylladb/scylladb/issues/26715.
Requires backports to all live versions
- (cherry picked from commit 12dabdec66)
- (cherry picked from commit 044b001bb4)
- (cherry picked from commit fdd623e6bc)
Parent PR: #26746Closesscylladb/scylladb#26890
* github.com:scylladb/scylladb:
api: storage_service: tasks: unify upgrade_sstable
api: storage_service: tasks: force_keyspace_cleanup
api: storage_service: tasks: unify force_keyspace_compaction
Refs #26822
AWS says to treat 503 errors, at least in the case of ec2 metadata
query, as backoff-retry (generally, we do _not_ retry on provider
level, but delegate this to higher levels). This patch adds special
treatment for 503:s (service unavailable) for both ec2 meta and
actual endpoint, doing exponential backoff.
Note: we do _not_ retry forever.
Not tested as such, since I don't get any errors when testing
(doh!). Should try to set up a mock ec2 meta with injected errors
maybe.
v2:
* Use utils::exponential_backoff_retry
(cherry picked from commit d22e0acf0b)
Currently, tablet_sstable_streamer::get_primary_endpoints is out of
sync with tablet_map::get_primary_replica. The get_primary_replica
optimizes the choice of the replica so that the work is fairly
distributes among nodes. Meanwhile, get_primary_endpoints always
chooses the first replica.
Use get_primary_replica for get_primary_endpoints.
Fixes: https://github.com/scylladb/scylladb/issues/21883.
Closesscylladb/scylladb#26385
(cherry picked from commit 910cd0918b)
Closesscylladb/scylladb#27020
worker during intra-node migration
Use methods introduces in previous commit and:
- load staging sstables to the view building worker on the target
shard, at the end of `streaming` stage
- clear migrated staging sstables on source shard in `cleanup` stage
This patch also removes skip mark in `test_staging_sstables_with_tablet_merge`.
Fixesscylladb/scylladb#26244
(cherry picked from commit 9345c33d27)
We need to be able to load sstables on the target shard during
intra-node tablet migration and to cleanup migrated sstables on the
source shard.
(cherry picked from commit 4bc6361766)
There was a problem with staging sstables after tablet merge.
Let's say there were 2 tablets and tablet 1 (lower last token)
had an staging sstable. Then a tablet merge occured, so there is only
one tablet now (higher last token).
But entries in `_staging_sstables`, which are grouped by last token, are
never adjusted.
Since there shouldn't be thousands of sstables, we can just hold list of
sstables per table and filter necessary entries when doing
`process_staging` view building task.
(cherry picked from commit 2e8c096930)
When dropping a column from a CDC log table, set the column drop
timestamp several seconds into the future.
If a value is written to a column concurrently with dropping that
column, the value's timestamp may be after the column drop timestamp. If
this value is also flushed to an SSTable, the SSTable would be
corrupted, because it considers the column missing after the drop
timestamp and doesn't allow values for it.
While this issue affects general tables, it especially impacts CDC tables
because this scenario can occur when writing to a table with CDC preimage
enabled while dropping a column from the base table. This happens even if
the base mutation doesn't write to the dropped column, because CDC log
mutations can generate values for a column even if the base mutation doesn't.
For general tables, this issue can be avoided by simply not writing to a
column while dropping it.
We fix this for the more problematic case of CDC log tables by setting
the column drop timestamp several seconds into the future, ensuring that
writes concurrent with column drops are much less likely to have
timestamps greater than the column drop timestamp.
Fixes https://github.com/scylladb/scylladb/issues/26340
the issue affects all previous releases, backport to improve stability
- (cherry picked from commit eefae4cc4e)
- (cherry picked from commit 48298e38ab)
- (cherry picked from commit 039323d889)
- (cherry picked from commit e85051068d)
Parent PR: #26533Closesscylladb/scylladb#27041
* github.com:scylladb/scylladb:
test: test concurrent writes with column drop with cdc preimage
cdc: check if recreating a column too soon
cdc: set column drop timestamp in the future
migration_manager: pass timestamp to pre_create
This patch fixes 2 issues at one go:
First, Currently sstables::load clears the sharding metadata
(via open_data()), and so scylla-sstable always prints
an empty array for it.
Second, printing token values would generate invalid json
as they are currently printed as binary bytes, and they
should be printed simply as numbers, as we do elsewhere,
for example, for the first and last keys.
Fixes#26982
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#26991
(cherry picked from commit f9ce98384a)
Closesscylladb/scylladb#27042
add a test that writes to a table concurrently with dropping a column,
where the table has CDC enabled with preimage.
the test reproduces issue #26340 where this results in a malformed
sstable.
(cherry picked from commit e85051068d)
When we drop a column from a CDC log table, we set the column drop
timestamp a few seconds into the future. This can cause unexpected
problems if a user tries to recreate a CDC column too soon, before
the drop timestamp has passed.
To prevent this issue, when creating a CDC column we check its
creation timestamp against the existing drop timestamp, if any, and
fail with an informative error if the recreation attempt is too soon.
(cherry picked from commit 039323d889)
If we try to perform queries too early, before the call to
`storage_service::start_maintenance_mode` has finished, we will
fail with the following error:
```
ERROR 2025-11-12 20:32:27,064 [shard 0:sl:d] token_metadata - sorted_tokens is empty in first_token_index!
```
To avoid that, we should wait until initialization is complete.
(cherry picked from commit b357c8278f)
Although setting the value of `maintenance_mode` to the string `"false"`
disables maintenance mode, the testing framework misinterprets the value
and thinks that it's actually enabled. As a result, it might try to
connect to Scylla via the maintenance socket, which we don't want.
(cherry picked from commit 394207fd69)
If the user connects to Scylla via the maintenance socket, it may happen
that `auth_integration` has not been registered in the service level
controller yet. One example is maintenance mode when that will never
happen; another when the connection occurs before Scylla is fully
initialized.
To avoid unnecessary crashes, we add new branches if the passed user is
absent or if it corresponds to the anonymous role. Since the role
corresponding to a connection via the maintenance socket is the anonymous
role, that solves the problem.
In those cases, we completely circumvent any calls to `auth_integration`
and handle them separately. The modified methods are:
* `get_user_scheduling_group`,
* `with_user_service_level`,
* `describe_service_levels`.
For the first two, the new behavior is in line with the previous
implementation of those functions. The last behaves differently now,
but since it's a soft error, crashing the node is not necessary anyway.
We throw an exception instead, whose error message should give the user
a hint of what might be wrong.
The other uses of `auth_integration` within the service level controller
are not problematic:
* `find_effective_service_level`,
* `find_cached_effective_service_level`.
They take the name of a role as their argument. Since the anonymous role
doesn't have a name, it's not possible to call them with it.
Fixesscylladb/scylladb#26816
(cherry picked from commit c0f7622d12)
When dropping a column from a CDC log table, set the column drop
timestamp several seconds into the future.
If a value is written to a column concurrently with dropping that
column, the value's timestamp may be after the column drop timestamp. If
this value is also flushed to an SSTable, the SSTable would be
corrupted, because it considers the column missing after the drop
timestamp and doesn't allow values for it.
While this issue affects general tables, it especially impacts CDC tables
because this scenario can occur when writing to a table with CDC preimage
enabled while dropping a column from the base table. This happens even if
the base mutation doesn't write to the dropped column, because CDC log
mutations can generate values for a column even if the base mutation doesn't.
For general tables, this issue can be avoided by simply not writing to a
column while dropping it.
We fix this for the more problematic case of CDC log tables by setting
the column drop timestamp several seconds into the future, ensuring that
writes concurrent with column drops are much less likely to have
timestamps greater than the column drop timestamp.
Fixesscylladb/scylladb#26340
(cherry picked from commit 48298e38ab)
Add a test that reproduces the issue scylladb/scylladb#26976.
The test adds a new node with delayed group0 apply, and does writes with
MV updates right after the join completes on the coordinator and while
the joining node's state is behind.
The test fails before fixing the issue and passes after.
(cherry picked from commit b925e047be)
Previously, only nodes in the 'normal' state and decommissioning nodes
were included in the set of nodes participating in barrier and
barrier_and_drain commands. Joining nodes are not included because they
don't coordinate requests, given their cql port is closed.
However, joining nodes may receive mutations from other nodes, for which
they may generate and coordinate materialized view updates. If their
group0 state is not synchronized it could cause lost view updates.
For example:
1. On the topology coordinator, the join completes and the joining node
becomes normal, but the joining node's state lags behind. Since it's
not synchronized by the barrier, it could be in an old state such as
`write_both_read_old`.
2. A normal node coordinates a write and sends it to the new node as the
new replica.
3. The new node applies the base mutation but doesn't generate a view
update for it, because it calculates the base-view pairing according
to its own state and replication map, and determines that it doesn't
participate in the base-view pairing.
Therefore, since the joining node participates as a coordinator for view
updates, it should be included in these barriers as well. This ensures
that before the join completes, the joining node's state is
`write_both_read_new`, where it does generate view updates.
Fixesscylladb/scylladb#26976
(cherry picked from commit 13d94576e5)
The view building coordinator sends tasks in form of RPC messages
to other nodes in the cluster. If processing that RPC fails, the
coordinator logs the error.
However, since tasks are per replica (so per shard), it may happen
that we end up with a large number of similar messages, e.g. if the
target node has died, because every shard will fail to process its
RPC message. It might become even worse in the case of a network
partition.
To mitigate that, we rate limit the logging by 1 seconds.
We extend the test `test_backoff_when_node_fails_task_rpc` so that
it allows the view building coordinator to have multiple tablet
replica targets. If not for rate limiting the warning messages,
we should start getting more of them, potentially leading to
a test failure.
(cherry picked from commit acd9120181)
The view building coordinator manages the process of view building
by sending RPC requests to all nodes in the cluster, instructing them
what to do. If processing that message fails, the coordinator decides
if it wants to retry it or (temporarily) abandon the work.
An example of the latter scenario could be if one of the target nodes
dies and any attempts to communicate with it would fail.
Unfortunately, the current approach to it is not perfect and may result
in a storm of warnings, effectively clogging the logs. As an example,
take a look at scylladb/scylladb#26686: the gossiper failed to mark
one of the dead nodes as DOWN fast enough, and it resulted in a warning storm.
To prevent situations like that, we implement a form of backoff.
If processing an RPC message fails, we postpone finishing the task for
a second. That should reduce the number of messages in the logs and avoid
retries that are likely to fail as well.
We provide a reproducer test: it fails before this commit and succeeds
with it.
Fixesscylladb/scylladb#26686
(cherry picked from commit 4a5b1ab40a)
Currently, all apis that start a compaction have two versions:
synchronous and asynchronous. They share most of the implementation,
but some checks and params have diverged.
Unify the handlers of /storage_service/keyspace_upgrade_sstables/{keyspace}
and /tasks/compaction/keyspace_upgrade_sstables/{keyspace}.
(cherry picked from commit fdd623e6bc)
Currently, all apis that start a compaction have two versions:
synchronous and asynchronous. They share most of the implementation,
but some checks and params have diverged.
Unify the handlers of /storage_service/keyspace_cleanup/{keyspace}
and /tasks/compaction/keyspace_cleanup/{keyspace}.
(cherry picked from commit 044b001bb4)
Batches that fail on the initial send are retired later, until they
succeed. These retires happen with CL=ALL, regardless of what the
original CL of the batch was. This is unnecessarily strict. We tried to
follow Cassandra here, but Cassandra has a big caveat in their use of
CL=ALL for batches. They accept saving just a hint for any/all of the
endpoints, so a batch which was just logged in hints is good enough for
them.
We do not plan on replicating this usage of hints at this time, so as a
middle ground, the CL is changed to EACH_QUORUM.
Fixes: scylladb/scylladb#25432Closesscylladb/scylladb#26304
(cherry picked from commit d9c3772e20)
Closesscylladb/scylladb#26930
cql3: Fix std::bad_cast when deserializing vectors of collections
This PR fixes a bug where attempting to INSERT a vector containing collections (e.g., `vector<set<int>,1>`) would fail. On the client side, this manifested as a `ServerError: std::bad_cast`.
The cause was "type slicing" issue in the reserialize_value function. When retrieving the vector's element type, the result was being assigned by value (using auto) instead of by reference.
This "sliced" the polymorphic abstract_type object, stripping it of its actual derived type information. As a result, a subsequent dynamic_cast would fail, even if the underlying type was correct.
To prevent this entire class of bugs from happening again, I've made the polymorphic base class `abstract_type` explicitly uncopyable.
Fixes: #26704
This fix needs to be backported as these releases are affected: `2025.4` , `2025.3`.
- (cherry picked from commit 960fe3da60)
- (cherry picked from commit 77da4517d2)
Parent PR: #26740Closesscylladb/scylladb#26998
* github.com:scylladb/scylladb:
cql3: Make abstract_type explicitly noncopyable
cql3: Fix std::bad_cast when deserializing vectors of collections
Tablet merge of base tables is only safe if there is at most one replica in each rack. For more details on why it is the case please see scylladb/scylladb#17265. If the rf-rack-valid-keyspaces is turned on, this condition is satisfied, so allow it in that case.
Fixes: scylladb/scylladb#26273
Marked for backport to 2025.4 as MVs are getting un-experimentaled there.
- (cherry picked from commit 189ad96728)
- (cherry picked from commit 359ed964e3)
- (cherry picked from commit a8d92f2abd)
Parent PR: #26278Closesscylladb/scylladb#26632
* github.com:scylladb/scylladb:
test: mv: add a test for tablet merge
tablet_allocator, tests: remove allow_tablet_merge_with_views injection
tablet_allocator: allow merges in base tables if rf-rack-valid=true
This patch series re-enables support for speculative retry values `0` and `100`. These values have been supported some time ago, before [schema: fix issue 21825: add validation for PERCENTILE values in speculative_retry configuration. #21879
](https://github.com/scylladb/scylladb/pull/21879). When that PR prevented using invalid `101PERCENTILE` values, valid `100PERCENTILE` and `0PERCENTILE` value were prevented too.
Reproduction steps from [[Bug]: drop schema and all tables after apply speculative_retry = '99.99PERCENTILE' #26369](https://github.com/scylladb/scylladb/issues/26369) are unable to reproduce the issue after the fix. A test is added to make sure the inclusive border values `0` and `100` are supported.
Documentation is updated to give more information to the users. It now states that these border values are inclusive, and also that the precision, with automatic rounding, is 1 decimal digit.
Fixes#26369
This is a bug fix. If at any time a client tries to use value >= 99.5 and < 100, the raft error will happen. Backport is needed. The code which introduced inconsistency is introduced in 2025.2, so no backporting to 2025.1.
- (cherry picked from commit da2ac90bb6)
- (cherry picked from commit 5d1913a502)
- (cherry picked from commit aba4c006ba)
- (cherry picked from commit 85f059c148)
- (cherry picked from commit 7ec9e23ee3)
Parent PR: #26909Closesscylladb/scylladb#27015
* github.com:scylladb/scylladb:
test: cqlpy: add test case for non-numeric PERCENTILE value
schema: speculative_retry: update exception type for sstring ops
docs: cql: ddl.rst: update speculative-retry-options
test: cqlpy: add test for valid speculative_retry values
schema: speculative_retry: allow 0 and 100 PERCENTILE values
Add test case for non-numeric PERCENTILE value, which raises an error
different to the out-of-range invalid values. Regex in the test
test_invalid_percentile_speculative_retry_values is expanded.
Refs #26369
(cherry picked from commit 7ec9e23ee3)
Change speculative_retry::to_sstring and speculative_retry::from_sstring
to throw exceptions::configuration_exception instead of std::invalid_argument.
These errors can be triggered by CQL, so appropriate CQL exception should be
used.
Reference: https://github.com/scylladb/scylladb/issues/24748#issuecomment-3025213304
Refs #26369
(cherry picked from commit 85f059c148)
Clarify how the value of `XPERCENTILE` is handled:
- Values 0 and 100 are supported
- The percentile value is rounded to the nearest 0.1 (1 decimal place)
Refs #26369
(cherry picked from commit aba4c006ba)
test_valid_percentile_speculative_retry_values is introduced to test that
valid values for speculative_retry are properly accepted.
Some of the values are moved from the
test_invalid_percentile_speculative_retry_values test, because
the previous commit added support for them.
Refs #26369
(cherry picked from commit 5d1913a502)
This patch allows specifying 0 and 100 PERCENTILE values in speculative_retry.
It was possible to specify these values before #21825. #21825 prevented specifying
invalid values, like -1 and 101, but also prevented using 0 and 100.
On top of that, speculative_retry::to_sstring function did rounding when
formatting the string, which introduced inconsistency.
Fixes#26369
(cherry picked from commit da2ac90bb6)
The polymorphic abstract_type class serves as an interface and should not be copied.
To prevent accidental and unsafe copies, make it explicitly uncopyable.
(cherry picked from commit 77da4517d2)
When deserializing a vector whose elements are collections (e.g., set, list),
the operation raises a `std::bad_cast` exception.
This was caused by type slicing due to an incorrect assignment of a
polymorphic type by value instead of by reference. This resulted in a
failed `dynamic_cast` even when the underlying type was correct.
(cherry picked from commit 960fe3da60)
_last_key is a multi-fragment buffer.
Some prefix of _last_key (up to _last_key_mismatch) is
unneeded because it's already a part of the trie.
Some suffix of _last_key (after needed_prefix) is unneeded
because _last_key can be differentiated from its neighbors even without it.
The job of write_last_key() is to find the middle fragments,
(containing the range `[_last_key_mismatch, needed_prefix)`)
trim the first and last of the middle fragments appropriately,
and feed them to the trie writer.
But there's an error in the current logic,
in the case where `_last_key_mismatch` falls on a fragment boundary.
To describe it with an example, if the key is fragmented like
`aaa|bbb|ccc`, `_last_key_mismatch == 3`, and `needed_prefix == 7`,
then the intended output to the trie writer is `bbb|c`,
but the actual output is `|bbb|c`. (I.e. the first fragment is empty).
Technically the trie writer could handle empty fragments,
but it has an assertion against them, because they are a questionable thing.
Fix that.
We also extend bti_index_test so that it's able to hit the assert
violation (before the patch). The reason why it wasn't able to do that
before the patch is that the violation requires decorated keys to differ
on the _first_ byte of a partition key column, but the keys generated
by the test only differed on the last byte of the column.
(Because the test was using sequential integers to make the values more
human-readable during debugging). So we modify the key generation
to use random values that can differ on any position.
Fixesscylladb/scylladb#26819Closesscylladb/scylladb#26839
(cherry picked from commit b82c2aec96)
Closesscylladb/scylladb#26903
The patch c543059f86 fixed the synchronization issue between tablet
split and load-and-stream. The synchronization worked only with
raft topology, and therefore was disabled with gossip.
To do the check, storage_service::raft_topology_change_enabled()
but the topology kind is only available/set on shard 0, so it caused
the synchronization to be bypassed when load-and-stream runs on
any shard other than 0.
The reason the reproducer didn't catch it is that it was restricted
to single cpu. It will now run with multi cpu and catch the
problem observed.
Fixes#22707
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#26730
(cherry picked from commit 7f34366b9d)
Closesscylladb/scylladb#26851
Fixes: #26440
1. Added description to primary-replica-only option
2. Fixed code text to better reflect the constrained cheked in the code
itself. namely: that both primary replica only and scope must be
applied only if load and steam is applied too, and that they are mutual
exclusive to each other.
Note: when https://github.com/scylladb/scylladb/issues/26584 is
implemented (with #26609) there will be a need to align the docs as
well - namely, primary-replica-only and scope will no longer be
mutual exclusive
Signed-off-by: Ran Regev <ran.regev@scylladb.com>
Closesscylladb/scylladb#26480
(cherry picked from commit aaf53e9c42)
Closesscylladb/scylladb#26906
An Alternator user was recently "bit" when switching `alternator_enforce_authorization` from "false" to "true": ְְְAfter the configuration change, all application requests suddenly failed because unbeknownst to the user, their application used incorrect secret keys.
This series introduces a solution for users who want to **safely** switch `alternator_enforce_authorization` from "false" to "true": Before switching from "false" to "true", the user can temporarily switch a new option, `alternator_warn_authorization`, to true. In this "warn" mode, authentication and authorization errors are counted in metrics (`scylla_alternator_authentication_failures` and `scylla_alternator_authorization_failures`) and logged as WARNings, but the user's application continues to work. The user can use these metrics or log messages to learn of errors in their application's setup, fix them, and only do the switch of `alternator_enforce_authorization` when the metrics or log messages show there are no more errors.
The first patch is the implementation of the the feature - the new configuration option, the metrics and the log messages, the second patch is a test for the new feature, and the third patch is documentation recommending how to use the warn mode and the associated metrics or log messages to safely switch `alternaor_enforce_authorization` from false to true.
Fixes#25308
This is a feature that users need, so it should probably be backported to live branches.
Closesscylladb/scylladb#25457
* github.com:scylladb/scylladb:
docs/alternator: explain alternator_warn_authorization
test/alternator: tests for new auth failure metrics and log messages
alternator: add alternator_warn_authorization config
(cherry picked from commit 59019bc9a9)
Closesscylladb/scylladb#26894
In [#26408](https://github.com/scylladb/scylladb/pull/26408) a `write_handler_destroy_promise` class was introduced to wait for `abstract_write_response_handler` instances destruction. We strived to minimize the memory footprint of `abstract_write_response_handler`, with `write_handler_destroy_promise`-es we required only a single additional int. It turned our that in some cases a lot of write handlers can be scheduled for deletion at the same time, in such cases the vector can become big and cause 'oversized allocation' seastar warnings.
Another concern with `write_handler_destroy_promise`-es [was that they were more complicated than it was worth](https://github.com/scylladb/scylladb/pull/26408#pullrequestreview-3361001103).
In this commit we replace `write_handler_destroy_promise` with simple gates. One or more gates can be attached to an `abstract_write_response_handler` to wait for its destruction. We use `utils::small_vector` to store the attached gates. The limit 2 was chosen because we expect two gates at the same time in most cases. One is `storage_proxy::_write_handlers_gate`, which is used to wait for all handlers in `cancel_all_write_response_handlers`. Another one can be attached by a caller of `cancel_write_handlers`. Nothing stops several cancel_write_handlers to be called at the same time, but it should be rare.
The `sizeof(utils::small_vector) == 40`, this is `40.0 / 488 * 100 ~ 8%` increase in `sizeof(abstract_write_response_handler)`, which seems acceptable.
Fixes [scylladb/scylladb#26788](https://github.com/scylladb/scylladb/issues/26788)
backport: need to backport to 2025.4 (LWT for tablets release)
- (cherry picked from commit 4578304b76)
- (cherry picked from commit 5bda226ff6)
Parent PR: #26827Closesscylladb/scylladb#26893
* github.com:scylladb/scylladb:
storage_proxy: use coroutine::maybe_yield();
storage_proxy: use gates to track write handlers destruction
When we build a materialized view we read the entire base table from start to
end to generate all required view udpates. If a view is created while another view
is being built on the same base table, this is optimized - we start generating
view udpates for the new view from the base table rows that we're currently
reading, and we read the missed initial range again after the previous view
finishes building.
The view building progress is only updated after generating view updates for
some read partitions. However, there are scenarios where we'll generate no
view updates for the entire read range. If this was not handled we could
end up in an infinite view building loop like we did in https://github.com/scylladb/scylladb/issues/17293
To handle this, we mark the view as built if the reader generated no partitions.
However, this is not always the correct conclusion. Another scenario where
the reader won't encounter any partitions is when view building is interrupted,
and then we perform a reshard. In this scenario, we set the reader for all
shards to the last unbuilt token for an existing partition before the reshard.
However, this partition may not exist on a shard after reshard, and if there
are also no partitions with higher tokens, the reader will generate no partitions
even though it hasn't finished view building.
Additionally, we already have a check that prevents infinite view building loops
without taking the partitions generated by the reader into account. At the end
of stream, before looping back to the start, we advance current_key to the end
of the built range and check for built views in that range. This handles the case
where the entire range is empty - the conditions for a built view are:
1. the "next_token" is no greater than "first_token" (the view building process
looped back, so we've built all tokens above "first_token")
2. the "current_token" is no less than "first_token" (after looping back, we've
built all tokens below "first_token")
If the range is empty, we'll pass these conditions on an empty range after advancing
"current_key" to the end because:
1. after looping back, "next_token" will be set to `dht::minimum_token`
2. "current_key" will be set to `dht::ring_position::max()`
In this patch we remove the check for partitions generated by the reader. This fixes
the issue with resharding and it does not resurrect the issue with infinite view building
that the check was introduced for.
Fixes https://github.com/scylladb/scylladb/issues/26523Closesscylladb/scylladb#26635
(cherry picked from commit 0a22ac3c9e)
Closesscylladb/scylladb#26889
It is useful to check time spent on tablet repair. It can be used to
compare incremental repair and non-incremental repair. The time does not
include the time waiting for the tablet scheduler to schedule the tablet
repair task.
Fixes#26505Closesscylladb/scylladb#26502
(cherry picked from commit dbeca7c14d)
Closesscylladb/scylladb#26881
When a tablet migration is started, we abort the corresponding view
building tasks (i.e. we change the state of those tasks to "ABORTED").
However, we don't change the host and shard of these tasks until the
migration successfully completes. When for some reason we have to
rollback the migration, that means the migration didn't finish and
the aborted task still has the host and shard of the migration
source. So when we recreate tasks that should no longer be aborted
due to a rolled-back migration, we should look at the aborted tasks
of the source (leaving) replica. But we don't do it and we look at
the aborted tasks of the target replica.
In this patch we adjust the rollback mechanism to recreate tasks
for the migration source instead of destination. We also fix the
test that should have detected this issue - the injection that
the test was using didn't make us rollback, but we simply retried
a stage of the tablet migration. By using one_shot=False and adding
a second injection, we can now guarantee that the migration will
eventually fail and we'll continue to the 'cleanup_target' and
'revert_migration' stages.
Fixes https://github.com/scylladb/scylladb/issues/26691Closesscylladb/scylladb#26825
(cherry picked from commit 977fa91e3d)
Closesscylladb/scylladb#26879
`sstable_compression_user_table_options` allows configuring a node-global SSTable compression algorithm for user tables via scylla.yaml. The current default is LZ4Compressor (inherited from Cassandra).
Make LZ4WithDictsCompressor the new default. Metrics from real datasets in the field have shown significant improvements in compression ratios.
If the dictionary compression feature is not enabled in the cluster (e.g., during an upgrade), fall back to the `LZ4Compressor`. Once the feature is enabled, flip the default back to the dictionary compressor using with a listener callback.
Fixes#26610.
- (cherry picked from commit d95ebe7058)
- (cherry picked from commit 96e727d7b9)
- (cherry picked from commit 2fc812a1b9)
- (cherry picked from commit a0bf932caa)
Parent PR: #26697Closesscylladb/scylladb#26830
* github.com:scylladb/scylladb:
test/cluster: Add test for default SSTable compressor
db/config: Change default SSTable compressor to LZ4WithDictsCompressor
db/config: Deprecate sstable_compression_dictionaries_allow_in_ddl
boost/cql_query_test: Get expected compressor from config
This patch implements the changes required by the Vector Store authorization, as described in https://scylladb.atlassian.net/wiki/spaces/RND/pages/107085899/Vector+Store+Authentication+And+Authorization+To+ScyllaDB, that is:
- adding a new permission VECTOR_SEARCH_INDEXING, grantable only on ALL KEYSPACES
- allowing users with that permission to perform SELECT queries, but only on tables with a vector index
- increasing the number of scheduling groups by one to allow users to create a service level for a vector store user
- adjusting the tests and documentation
These changes are needed, as the vector indexes are managed by the external service, Vector Store, which needs to read the tables to create the indexes in its memory. We would like to limit the privileges of that service to a minimum to maintain the principle of least privilege, therefore a new permission, one that allows the SELECTs conditional on the existence of a vector_index on the table.
Fixes: VECTOR-201
Fixes: https://github.com/scylladb/scylladb/issues/26804
Backport reasoning:
Backport to 2025.4 required as this can make upgrading clusters more difficult if we add it in 2026.1. As for now Scylla Cloud requires version 2025.4 to enable vector search and permission is set by orchestrator so there is no chance that someone will try to add this permission during upgrade. In 2026.1 it will be more difficult.
- (cherry picked from commit ae86bfadac)
- (cherry picked from commit 3025a35aa6)
- (cherry picked from commit 6a69bd770a)
- (cherry picked from commit e8fb745965)
- (cherry picked from commit 3db2e67478)
Parent PR: #25976Closesscylladb/scylladb#26805
* github.com:scylladb/scylladb:
docs: adjust docs for VS auth changes
test: add tests for VECTOR_SEARCH_INDEXING permission
cql: allow VECTOR_SEARCH_INDEXING users to select
auth: add possibilty to check for any permission in set
auth: add a new permission VECTOR_SEARCH_INDEXING
This is backport of fix for https://github.com/scylladb/scylladb/issues/26040 and related test (https://github.com/scylladb/scylladb/pull/26589) to 2025.4.
Before this change, unauthorized connections stayed in main
scheduling group. It is not ideal, in such case, rather sl:default
should be used, to have a consistent behavior with a scenario
where users is authenticated but there is no service level assigned
to the user.
This commit adds a call to update_scheduling_group at the end of
connection creation for an unauthenticated user, to make sure the
service level is switched to sl:default.
Fixes: https://github.com/scylladb/scylladb/issues/26040
Fixes: https://github.com/scylladb/scylladb/issues/26581
(cherry picked from commit 278019c328)
(cherry picked from commit 8642629e8e)
No backport, as it's already a backport
Closesscylladb/scylladb#26815
* github.com:scylladb/scylladb:
test: add test_anonymous_user to test_raft_service_levels
transport: call update_scheduling_group for non-auth connections
In #26408 a write_handler_destroy_promise class was introduced to
wait for abstract_write_response_handler instances destruction. We
strived to minimize the memory footprint of
abstract_write_response_handler, with write_handler_destroy_promise-es
we required only a single additional int. It turned our that in some
cases a lot of write handlers can be scheduled for deletion
at the same time, in such cases the
vector<write_handler_destroy_promise> can become big and cause
'oversized allocation' seastar warnings.
Another concern with write_handler_destroy_promise-es was that they
were more complicated than it was worth.
In this commit we replace write_handler_destroy_promise with simple
gates. One or more gates can be attached to an
abstract_write_response_handler to wait for its destruction. We use
utils::small_vector<gate::holder, 2> to store the attached gates.
The limit 2 was chosen because we expect two gates at the same time
in most cases. One is storage_proxy::_write_handlers_gate,
which is used to wait for all handlers in
cancel_all_write_response_handlers. Another one can be attached by
a caller of cancel_write_handlers. Nothing stops several
cancel_write_handlers to be called at the same time, but it should be
rare.
The sizeof(utils::small_vector<gate::holder, 2>) == 40, this is
40.0 / 488 * 100 ~ 8% increase in
sizeof(abstract_write_response_handler), which seems acceptable.
Fixesscylladb/scylladb#26788
(cherry picked from commit 4578304b76)
Currently, all apis that start a compaction have two versions:
synchronous and asynchronous. They share most of the implementation,
but some checks and params have diverged.
Add consider_only_existing_data parameter to /tasks/compaction/keyspace_compaction/{keyspace},
to match the synchronous version of the api (/storage_service/keyspace_compaction/{keyspace}).
Unify the handlers of both apis.
(cherry picked from commit 12dabdec66)
The previous patch made the default compressor dependent on the
SSTABLE_COMPRESSION_DICTS feature:
* LZ4Compressor if the feature is disabled
* LZ4WithDictsCompressor if the feature is enabled
Add a test to verify that the cluster uses the right default in every
case.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
(cherry picked from commit a0bf932caa)
`sstable_compression_user_table_options` allows configuring a
node-global SSTable compression algorithm for user tables via
scylla.yaml. The current default is `LZ4Compressor` (inherited from
Cassandra).
Make `LZ4WithDictsCompressor` the new default. Metrics from real datasets
in the field have shown significant improvements in compression ratios.
If the dictionary compression feature is not enabled in the cluster
(e.g., during an upgrade), fall back to the `LZ4Compressor`. Once the
feature is enabled, flip the default back to the dictionary compressor
using with a listener callback.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
(cherry picked from commit 2fc812a1b9)
The option is a knob that allows to reject dictionary-aware compressors
in the validation stage of CREATE/ALTER statements, and in the
validation of `sstable_compression_user_table_options`. It was
introduced in 7d26d3c7cb to allow the admins of Scylla Cloud to
selectively enable it in certain clusters. For more details, check:
https://github.com/scylladb/scylla-enterprise/issues/5435
As of this series, we want to start offering dictionary compression as
the default option in all clusters, i.e., treat it as a generally
available feature. This makes the knob redundant.
Additionally, making dictionary compression the default choice in
`sstable_compression_user_table_options` creates an awkward dependency
with the knob (disabling the knob should cause
`sstable_compression_user_table_options` to fall back to a non-dict
compressor as default). That may not be very clear to the end user.
For these reasons, mark the option as "Deprecated", remove all relevant
tests, and adjust the business logic as if dictionary compression is
always available.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
(cherry picked from commit 96e727d7b9)
– Workload: N workers perform CAS updates
UPDATE … SET s{i}=new WHERE pk=? IF (∀j≠i: s{j}>=guard_j) AND s{i}=prev
at CL=LOCAL_QUORUM / SERIAL=LOCAL_SERIAL. Non-apply without timeout is treated
as contention; “uncertainty” timeouts are resolved via LOCAL_SERIAL read.
- Enable balancing and increase min_tablet_count to force split,
flush and lower min_tablet_count to merge.
- “Uncertainty” timeouts (write timeout due to uncertainty) are resolved via a
LOCAL_SERIAL read to determine whether the CAS actually applied.
- Invariants: after the run, for every pk and column s{i}, the stored value
equals the number of confirmed CAS by worker i (no lost or phantom updates)
despite ongoing tablet moves.
Closes scylladb/scylladb#26113
refs: scylladb/qa-tasks#1918
Refs #18068Fixes#24502 (to satisfy backport rules)
(cherry picked from commit 99dc31e71a)
Closesscylladb/scylladb#26790
use utils::chunked_vector instead of std::vector to store cdc stream
sets for tablets.
a cdc stream set usually represents all streams for a specific table and
timestamp, and has a stream id per each tablet of the table. each stream
id is represented by 16 bytes. thus the vector could require quite large
contiguous allocations for a table that has many tablets. change it to
chunked_vector to avoid large contiguous allocations.
Fixesscylladb/scylladb#26791Closesscylladb/scylladb#26792
(cherry picked from commit e7dbccd59e)
Closesscylladb/scylladb#26828
This check is incorrect: the current shard may be looking at
the old version of tablets map:
* an accept RPC comes to replica shard 0, which is already at write_both_read_new
* the new shard is shard 1, so paxos_state::accept is called on shard 1
* shard 1 is still at "streaming" -> shards_ready_for_reads() returns old
shard 0
Fixesscylladb/scylladb#26801Closesscylladb/scylladb#26809
(cherry picked from commit 88765f627a)
Closesscylladb/scylladb#26832
Since 5b6570be52, the default SSTable compression algorithm for user
tables is no longer hardcoded; it can be configured via the
`sstable_compression_user_table_options.sstable_compression` option in
scylla.yaml.
Modify the `test_table_compression` test to get the expected value from
the configuration.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
(cherry picked from commit d95ebe7058)
The primary goal of this test is to reproduce scylladb/scylladb#26040
so the fix (278019c328) can be backported
to older branches.
Scenario: connect via CQL as an anonymous user and verify that the
`sl:default` scheduling group is used. Before the fix for #26040
`main` scheduling group was incorrectly used instead of `sl:default`.
Control connections may legitimately use `sl:driver`, so the test
accepts those occurrences while still asserting that regular anonymous
queries use `sl:default`.
This adds explicit coverage on master. After scylladb#24411 was
implemented, some other tests started to fail when scylladb#26040
was unfixed. However, none of the tests asserted this exact behavior.
Refs: scylladb/scylladb#26040
Refs: scylladb/scylladb#26581Closesscylladb/scylladb#26589
(cherry picked from commit 8642629e8e)
Before this change, unauthorized connections stayed in `main`
scheduling group. It is not ideal, in such case, rather `sl:default`
should be used, to have a consistent behavior with a scenario
where users is authenticated but there is no service level assigned
to the user.
This commit adds a call to `update_scheduling_group` at the end of
connection creation for an unauthenticated user, to make sure the
service level is switched to `sl:default`.
Fixes: scylladb/scylladb#26040
(cherry picked from commit 278019c328)
We adjust the documentation to include the new
VECTOR_SEARCH_INDEXING permission and its usage
and also to reflect the changes in the maximal
amount of service levels.
(cherry picked from commit 3db2e67478)
This commit adds tests to verify the expected
behavior of the VECTOR_SEARCH_INDEXING permission,
that is, allowing GRANTing this permission only on
ALL KEYSPACES and allowing SELECT queries only on tables
with vector indexes when the user has this permission
(cherry picked from commit e8fb745965)
This patch allows users with the VECTOR_SEARCH_INDEXING permission
to perform SELECT queries on tables that have a vector index.
This is needed for the Vector Store service, which
reads the vector-indexed tables, but does not require
the full SELECT permission.
(cherry picked from commit 6a69bd770a)
This commit adds a new version of command_desc struct
that contains a set of permissions instead of a singular
permission. When this struct is passed to ensure/check_has_permission,
we check if the user has any of the included permission on the resource.
(cherry picked from commit 3025a35aa6)
This patch adds a new permission: VECTOR_SEARCH_INDEXING,
that is grantable only for ALL KEYSPACES. It will allow selecting
from tables with vector search indexes. It is meant to be used
by the Vector Store service to allow it to build indexes without
having full SELECT permissions on the tables.
(cherry picked from commit ae86bfadac)
when loading CDC streams metadata for tablets from the tables, read only
new entries from the history table instead of reading all entries. This
improves the CDC metadata reloading, making it more efficient and
predictable.
the CDC metadata is loaded as part of group0 reload whenever the
internal CDC tables are modified. on tablet split / merge, we create a
new CDC timestamp and streams by writing them to the cdc_streams_history
table by group0 operation, and when it's applied we reload the in-memory
CDC streams map by reading from the tables and constructing the updated map.
Previously, on every update, we would read the entire
cdc_streams_history entries for the changed table, constructing all its
streams and creating a new map from scratch.
We improve this now by reading only new entries from cdc_streams_history
and append them to the existing map. we can do this because we only
append new entries to cdc_streams_history with higher timestamp than all
previous entries.
This makes this reloading more efficient and predictable, because
previously we would read a number of entries that depends on the number
of tablets splits and merges, which increases over time and is
unbounded, whereas now we read only a single stream set on each update.
Fixes https://github.com/scylladb/scylladb/issues/26732
backport to 2025.4 where cdc with tablets is introduced
- (cherry picked from commit 8743422241)
- (cherry picked from commit 4cc0a80b79)
Parent PR: #26160Closesscylladb/scylladb#26798
* github.com:scylladb/scylladb:
test: cdc: extend cdc with tablets tests
cdc: improve cdc metadata loading
extend and improve the tests of virtual tables for cdc with tablets.
split the existing virtual tables test to one test that validates the
virtual tables against the internal cdc tables, and triggering some
tablet splits in order to create entries in the cdc_streams_history
table, and add another test with basic validation of the virtual tables
when there are multiple cdc tables.
(cherry picked from commit 4cc0a80b79)
when loading CDC streams metadata for tablets from the tables, read only
new entries from the history table instead of reading all entries. This
improves the CDC metadata reloading, making it more efficient and
predictable.
the CDC metadata is loaded as part of group0 reload whenever the
internal CDC tables are modified. on tablet split / merge, we create a
new CDC timestamp and streams by writing them to the cdc_streams_history
table by group0 operation, and when it's applied we reload the in-memory
CDC streams map by reading from the tables and constructing the updated map.
Previously, on every update, we would read the entire
cdc_streams_history entries for the changed table, constructing all its
streams and creating a new map from scratch.
We improve this now by reading only new entries from cdc_streams_history
and append them to the existing map. we can do this because we only
append new entries to cdc_streams_history with higher timestamp than all
previous entries.
This makes this reloading more efficient and predictable, because
previously we would read a number of entries that depends on the number
of tablets splits and merges, which increases over time and is
unbounded, whereas now we read only a single stream set on each update.
Fixesscylladb/scylladb#26732
(cherry picked from commit 8743422241)
The test test_mv_tablets_replace verifies that merging tablets of both a
view and its base table is allowed if rf-rack-valid-keyspaces option is
enabled (and it is enabled by default in the test suite).
(cherry picked from commit a8d92f2abd)
The `allow_tablet_merge_with_views` error injection was previously used
to allow merging tablets in a table which has materialized views
attached to it. Now, the error injection is not needed because this is
allowed under the rf-rack-valid condition, which is enabled by default
in tests.
Remove the error injection from the code and adjust the tests not to use
it.
(cherry picked from commit 359ed964e3)
Sometimes file::list_directory() returns entries without type set. In
thase case lister calls file_type() on the entry name to get it. In case
the call returns disengated type, the code assumes that some error
occurred and resolves into exception.
That's not correct. The file_type() method returns disengated type only
if the file being inspected is missing (i.e. on ENOENT errno). But this
can validly happen if a file is removed bettween readdir and stat. In
that case it's not "some error happened", but a enry should be just
skipped. In "some error happened", then file_type() would resolve into
exceptional future on its own.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#26595
(cherry picked from commit d9bfbeda9a)
Closesscylladb/scylladb#26767
It turns out that #21477 wasn't sufficient to fix the issue. The driver
may still decide to reconnect the connection after `rolling_restart`
returns. One possible explanation is that the driver sometimes handles
the DOWN notification after all nodes consider each other UP.
Reconnecting the driver after restarting nodes seems to be a reliable
workaround that many tests use. We also use it here.
Fixes#19959Closesscylladb/scylladb#26638
(cherry picked from commit 5321720853)
Closesscylladb/scylladb#26763
When a tablet is migrated between shards on the same node, during the write_both_read_new state we begin switching reads to the new shard. Until the corresponding global barrier completes, some requests may still use write_both_read_old erm, while others already use the write_both_read_new erm. To ensure mutual exclusion between these two types of requests, we must acquire locks on both the old and new shards. Once the global barrier completes, no requests remain on the old shard, so we can safely switch to acquiring locks only on the new shard.
The idea came from the similar locking problem in the [counters for tablets PR](https://github.com/scylladb/scylladb/pull/26636#discussion_r2463932395).
Fixes scylladb/scylladb#26727
backport: need to backport to 2025.4
- (cherry picked from commit 5ab2db9613)
- (cherry picked from commit 478f7f545a)
Parent PR: #26719Closesscylladb/scylladb#26748
* github.com:scylladb/scylladb:
paxos_state: use shards_ready_for_reads
paxos_state: inline shards_for_writes into get_replica_lock
introduce helper functions that can be used for garbage collecting old
cdc streams for tablets-based keyspaces.
add a background fiber to the topology coordinator that runs
periodically and checks for old CDC streams for tablets keyspaces that
can be garbage collected.
the garbage collection works by finding the newest cdc timestamp that has been
closed for more than the configured cdc TTL, and removing all information from
the cdc internal tables about cdc timestamps and streams up to this timestamp.
in general it should be safe to remove information about these streams because
they are closed for more than TTL, therefore all rows that were written to these streams
with the configured TTL should be dead.
the exception is if the TTL is altered to a smaller value, and then we may remove information
about streams that still have live rows that were written with the longer ttl.
Fixes https://github.com/scylladb/scylladb/issues/26669
- (cherry picked from commit 440caeabcb)
- (cherry picked from commit 6109cb66be)
Parent PR: #26410Closesscylladb/scylladb#26728
* github.com:scylladb/scylladb:
cdc: garbage collect CDC streams periodically
cdc: helpers for garbage collecting old streams for tablets
`CreateTable` request creates GSI/LSI together with the base table,
the base table is empty and we don't need to actually build the view.
In tablet-based keyspaces we can just don't create view building tasks
and mark the view build status as SUCCESS on all nodes. Then, the view
building worker on each node will mark the view as built in
`system.built_views` (`view_building_worker::update_built_views()`).
Vnode-based keyspaces will use the "old" logic of view builder, which
will process the view and mark it as built.
Fixes scylladb/scylladb#26615
This fix should be backported to 2025.4.
- (cherry picked from commit 8fbf122277)
- (cherry picked from commit bdab455cbb)
- (cherry picked from commit 34503f43a1)
Parent PR: #26657Closesscylladb/scylladb#26670
* github.com:scylladb/scylladb:
test/alternator/test_tablets: add test for GSI backfill with tablets
test/alternator/test_tablets: add reproducer for GSI with tablets
alternator/executor: instantly mark view as built when creating it with base table
Acquiring locks on both shards for the entire tablet migration period
is redundant. In most cases, locking only the old shard or only the new
shard is sufficient. Using shards_ready_for_reads reduces the
situations in which we need to lock both shards to:
* intra-node migrations only
* only during the write_both_read_new state
Once the global barrier completes in the write_both_read_new state, no
requests remain on the old shard, so we can safely acquire locks
only on the new shard.
Fixesscylladb/scylladb#26727
(cherry picked from commit 478f7f545a)
No need to have two functions since both callers of get_replica_lock()
use shards_for_writes() to compute the shards where the locks
must be acquired.
Also while at it, inline the acquire() lambda in get_replica_lock()
and replace it with a loop over shards. This makes the code
more strightforward.
(cherry picked from commit 5ab2db9613)
add a background fiber to the topology coordinator that runs
periodically and checks for old CDC streams for tablets keyspaces that
can be garbage collected.
(cherry picked from commit 6109cb66be)
introduce helper functions that can be used for garbage collecting old
cdc streams for tablets-based keyspaces.
- get_new_base_for_gc: finds a new base timestamp given a TTL, such that
all older timestamps and streams can be removed.
- get_cdc_stream_gc_mutations: given new base timestamp and streams,
builds mutations that update the internal cdc tables and remove the
older streams.
- garbage_collect_cdc_streams_for_table: combines the two functions
above to find a new base and build mutations to update it for a
specific table
- garbage_collect_cdc_streams: builds gc mutations for all cdc tables
(cherry picked from commit 440caeabcb)
Group0 tombstone GC considers only the current group 0 members
while computing the group 0 tombstone GC time. It's not enough
because in the Raft-based recovery procedure, there can be nodes
that haven't joined the current group 0 yet, but they have belonged
to a different group 0 and thus have a non-empty group 0 state ID.
The current code can cause a data resurrection in group 0 tables.
We fix this issue in this PR and add a regression test.
This issue was uncovered by `test_raft_recovery_entry_loss`, which
became flaky recently. We skipped this test for now. We will unskip
it in a following PR because it's skipped only on master, while we
want to backport this PR.
Fixes#26534
This PR contains an important bugfix, so we should backport it
to all branches with the Raft-based recovery procedure (2025.2
and newer).
- (cherry picked from commit 1d09b9c8d0)
- (cherry picked from commit 6b2e003994)
- (cherry picked from commit c57f097630)
Parent PR: #26612Closesscylladb/scylladb#26682
* https://github.com/scylladb/scylladb:
test: test group0 tombstone GC in the Raft-based recovery procedure
group0_state_id_handler: remove unused group0_server_accessor
group0_state_id_handler: consider state IDs of all non-ignored topology members
The guard should stop refreshing the ERM when the number of tablets changes. Tablet splits or merges invalidate the tablet_id field (_tablet), which means the guard can no longer correctly protect ongoing operations from tablet migrations.
The problem is specific to LWT, since tablet_metadata_guard is used mostly for heavy topology operations, which exclude with split and merge. The guard was used for LWT as an optimization -- we don't need to block topology operations or migrations of unrelated tablets. In the future, we could use the guard for regular reads/writes as well (via the token_metadata_guard wrapper).
Fixes https://github.com/scylladb/scylladb/issues/26437
backports: need to backport to 2025.4 since the bug is relevant to LWT over tablets.
(cherry picked from commit e1667afa50)
(cherry picked from commit 6f4558ed4b)
(cherry picked from commit 64ba427b85)
(cherry picked from commit ec6fba35aa)
(cherry picked from commit b23f2a2425)
(cherry picked from commit 33e9ea4a0f)
(cherry picked from commit 03d6829783)
Parent PR: https://github.com/scylladb/scylladb/pull/26619Closesscylladb/scylladb#26700
* github.com:scylladb/scylladb:
test_tablets_lwt: add test_tablets_merge_waits_for_lwt
test.py: add universalasync_typed_wrap
tablet_metadata_guard: fix split/merge handling
tablet_metadata_guard: add debug logs
paxos_state: shards_for_writes: improve the error message
storage_service: barrier_and_drain – change log level to info
topology_coordinator: fix log message
The universalasync.wrap function doesn't preserve the
type information, which confuses the VS Code Pylance
plugin and makes code navigation hard.
In this commit we fix the problem by adding a typed
wrapped around universalasync.wrap.
Fixes: scylladb/scylladb#26639
(cherry picked from commit 33e9ea4a0f)
The guard should stop refreshing the ERM when the number of tablets
changes. Tablet splits or merges invalidate the tablet_id field
(_tablet), which means the guard can no longer correctly protect
ongoing operations from tablet migrations.
Fixesscylladb/scylladb#26437
(cherry picked from commit b23f2a2425)
Add the current token and tablet info, remove 'this_shard_id'
since it's always written by the logging infrastructure.
(cherry picked from commit 64ba427b85)
Debugging global barrier issues is difficult without these logs.
Since barriers do not occur frequently, increasing the log level should not produce excessive output.
(cherry picked from commit 6f4558ed4b)
The test process like that:
- run long dns refresh process
- request for the resolve hostname with short abort_source timer - result
should be empty list, because of aborted request
The test sometimes finishes long dns refresh before abort_source fired and the
result list is not empty.
There are two issues. First, as.reset() changes the abort_source timeout. The
patch adds a get() method to the abort_source_timeout class, so there is no
change in the abort_source timeout. Second, a sleep could be not reliable. The
patch changes the long sleep inside a dns refresh lambda into
condition_variable handling, to properly signal the end of the dns refresh
process.
Fixes: #26561
Fixes: VECTOR-268
It needs to be backported to 2025.4
Closesscylladb/scylladb#26566
(cherry picked from commit 10208c83ca)
Closesscylladb/scylladb#26598
`shared_ptr<abstract_write_response_handler>` instances are captured in the `lmutate` and `rmutate` lambdas of `send_to_live_endpoints()`. As a result, an `abstract_write_response_handler` object may outlive its removal from the `storage_proxy::_response_handlers` map -> `cancel_all_write_response_handlers()` doesn't actually wait for requests completion -> `sp::drain_on_shutdown()` doesn't guarantee all requests are drained -> `sp::stop_remote()` completes too early and `paxos_store` is destroyed while LWT local writes might still be in progress. In this PR we introduce a `write_handler_destroy_promise` to wait for such pending instances in `cancel_write_handlers()` and `cancel_all_write_response_handlers()` to prevent the `use-after-free`.
A better long-term solution might be to replace `shared_ptr` with `unique_ptr` for `abstract_write_response_handler` and use a separate gate to track the `lmutate/rmutate` lambdas. We do not actually need to wait for these lambdas to finish before sending a timeout or error response to the client, as we currently do in `~abstract_write_response_handler`.
Fixes scylladb/scylladb#26355
backport: need to be backported to 2025.4 since #26355 is reproduced on LWT over tablets
- (cherry picked from commit bf2ac7ee8b)
- (cherry picked from commit b269f78fa6)
- (cherry picked from commit bbcf3f6eff)
- (cherry picked from commit 8925f31596)
Parent PR: #26408Closesscylladb/scylladb#26658
* github.com:scylladb/scylladb:
test_tablets_lwt: add test_lwt_shutdown
storage_proxy: wait for write handler destruction
storage_proxy: coroutinize cancel_write_handlers
storage_proxy: cancel_write_handlers: don't hold a strong pointer to handler
It's not enough to consider only the current group 0 members. In the
Raft-based recovery procedure, there can be nodes that haven't joined
the current group 0 yet, but they have belonged to a different group 0
and thus have a non-empty group 0 state ID.
We fix this issue in this commit by considering topology members
instead.
We don't consider ignored nodes as an optimization. When some nodes are
dead, the group 0 state ID handler won't have to wait until all these
nodes leave the cluster. It will only have to wait until all these nodes
are ignored, which happens at the beginning of the first
removenode/replace. As a result, tombstones of group 0 tables will be
purged much sooner.
We don't rename the `group0_members` variable to keep the change
minimal. There seems to be no precise and succinct name for the used set
of nodes anyway.
We use `std::ranges::join_view` in one place because:
- `std::ranges::concat` will become available in C++26,
- `boost::range::join` is not a good option, as there is an ongoing
effort to minimize external dependencies in Scylla.
(cherry picked from commit 1d09b9c8d0)
Rewrite wait_for first_completed to return only first completed task guarantee
of awaiting(disappearing) all cancelled and finished tasks
Use wait_for_first_completed to avoid false pass tests in the future and issues
like #26148
Use gather_safely to await tasks and removing warning that coroutine was
not awaited
Closesscylladb/scylladb#26435
(cherry picked from commit 24d17c3ce5)
Closesscylladb/scylladb#26663
When requesting repair for tablets of a colocated table, the request
fails with an error. Improve the error message to show the table names
instead of table IDs, because the table names are more useful for users.
Fixesscylladb/scylladb#26567Closesscylladb/scylladb#26568
(cherry picked from commit b808d84d63)
Closesscylladb/scylladb#26624
Load-and-stream is broken when running concurrently to the finalization step of tablet split.
Consider this:
1) split starts
2) split finalization executes barrier and succeed
3) load-and-stream runs now, starts writing sstable (pre-split)
4) split finalization publishes changes to tablet metadata
5) load-and-stream finishes writing sstable
6) sstable cannot be loaded since it spans two tablets
two possible fixes (maybe both):
1) load-and-stream awaits for topology to quiesce
2) perform split compaction on sstable that spans both sibling tablets
This patch implements # 1. By awaiting for topology to quiesce,
we guarantee that load-and-stream only starts when there's no
chance coordinator is handling some topology operation like
split finalization.
Fixes https://github.com/scylladb/scylladb/issues/26455.
- (cherry picked from commit 3abc66da5a)
- (cherry picked from commit 4654cdc6fd)
Parent PR: #26456Closesscylladb/scylladb#26651
* github.com:scylladb/scylladb:
test: Add reproducer for l-a-s and split synchronization issue
sstables_loader: Synchronize tablet split and load-and-stream
The test should pass without the fix for scylladb/scylladb#26615,
because the `executor::updata_table()` uses
`service::prepare_new_view_announcement()`, which creates view building
tasks for the view.
But it's better to add this test.
(cherry picked from commit 34503f43a1)
`CreateTable` request creates GSI/LSI together with the base table,
the base table is empty and we don't need to actually build the view.
In tablet-based keyspaces we can just don't create view building tasks
and mark the view build status as SUCCESS on all nodes. Then, the view
building worker on each node will mark the view as built in
`system.built_views` (`view_building_worker::update_built_views()`).
Vnode-based keyspaces will use the "old" logic of view builder, which
will process the view and mark it as built.
Fixesscylladb/scylladb#26615
(cherry picked from commit 8fbf122277)
Apply two main changes to the s3_client error handling
1. Add a loop to s3_client's `make_request` for the case whe the retry strategy will not help since the request itself have to be updated. For example, authentication token expiration or timestamp on the request header
2. Refine the way we handle exceptions in the `chunked_download_source` background fiber, now we carry the original `exception_ptr` and also we wrap EVERY exception in `filler_exception` to prevent retry strategy trying to retry the request altogether
Fixes: https://github.com/scylladb/scylladb/issues/26483
Should be ported back to 2025.3 and 2025.4 to prevent deadlocks and failures in these versions
- (cherry picked from commit 55fb2223b6)
- (cherry picked from commit db1ca8d011)
- (cherry picked from commit 185d5cd0c6)
- (cherry picked from commit 116823a6bc)
- (cherry picked from commit 43acc0d9b9)
- (cherry picked from commit 58a1cff3db)
- (cherry picked from commit 1d34657b14)
- (cherry picked from commit 4497325cd6)
- (cherry picked from commit fdd0d66f6e)
Parent PR: #26527Closesscylladb/scylladb#26650
* github.com:scylladb/scylladb:
s3_client: tune logging level
s3_client: add logging
s3_client: improve exception handling for chunked downloads
s3_client: fix indentation
s3_client: add max for client level retries
s3_client: remove `s3_retry_strategy`
s3_client: support high-level request retries
s3_client: just reformat `make_request`
s3_client: unify `make_request` implementation
Currently, the data returned by `database::get_tables_metadata()` and
`database::get_token_metadata()` may not be consistent. Specifically,
the tables metadata may contain some tablet-based tables before their
tablet maps appear in the token metadata. This is going to be fixed
after issue scylladb/scylladb#24414 is closed, but for the time being
work around it by accessing the token metadata via
`table`->effective_replication_map() - that token metadata is guaranteed
to have the tablet map of the `table`.
Fixes: scylladb/scylladb#26403Closesscylladb/scylladb#26588
(cherry picked from commit f76917956c)
Closesscylladb/scylladb#26631
The `compaction_strategy_state` class holds strategy specific state via
a `std::variant` containing different state types. When a compaction
strategy performs compaction, it retrieves a reference to its state from
the `compaction_strategy_state` object. If the table's compaction
strategy is ALTERed while a compaction is in progress, the
`compaction_strategy_state` object gets replaced, destroying the old
state. This leaves the ongoing compaction holding a dangling reference,
resulting in a use after free.
Fix this by using `seastar::shared_ptr` for the state variant
alternatives(`leveled_compaction_strategy_state_ptr` and
`time_window_compaction_strategy_state_ptr`). The compaction strategies
now hold a copy of the shared_ptr, ensuring the state remains valid for
the duration of the compaction even if the strategy is altered.
The `compaction_strategy_state` itself is still passed by reference and
only the variant alternatives use shared_ptrs. This allows ongoing
compactions to retain ownership of the state independently of the
wrapper's lifetime.
The method `maybe_wait_for_sstable_count_reduction()`, when retrieving
the list of sstables for a possible compaction, holds a reference to the
compaction strategy. If the strategy is updated during execution, it can
cause a use after free issue. To prevent this, hold a copy of the
compaction strategy so it isn’t yanked away during the method’s
execution.
Fixes#25913
Issue probably started after 9d3755f276, so backport to 2025.4
- (cherry picked from commit 1cd43bce0e)
- (cherry picked from commit 35159e5b02)
- (cherry picked from commit 18c071c94b)
Parent PR: #26593Closesscylladb/scylladb#26625
* github.com:scylladb/scylladb:
compaction: fix use after free when strategy is altered during compaction
compaction/twcs: pass compaction_strategy_state to internal methods
compaction_manager: hold a copy to compaction strategy in maybe_wait_for_sstable_count_reduction
shared_ptr<abstract_write_response_handler> instances are captured in
the lmutate/rmutate lambdas of send_to_live_endpoints(). As a result,
an abstract_write_response_handler object may outlive its removal from
the _response_handlers map. We use write_handler_destroy_promise to
wait for such pending instances in cancel_write_handlers() and
cancel_all_write_response_handlers() to prevent use-after-free.
A better long-term solution might be to replace shared_ptr with
unique_ptr for abstract_write_response_handler and use a separate gate
to track the lmutate/rmutate lambdas. We do not actually need to wait
for these lambdas to finish before sending a timeout or error response
to the client, as we currently do in ~abstract_write_response_handler.
Fixesscylladb/scylladb#26355
(cherry picked from commit bbcf3f6eff)
The cancel_write_handlers() method was assumed to be called in a thread
context, likely because it was first used from gossiper events, where a
thread context already existed. Later, this method was reused in
abort_view_writes() and abort_batch_writes(), where threads are created
on the fly and appear redundant.
The drain_on_shutdown() method also used a thread, justified by some
"delicate lifetime issues", but it is unclear what that actually means.
It seems that a straightforward co_await should work just fine.
(cherry picked from commit b269f78fa6)
A strong pointer was held for the duration of thread::yield(),
preventing abstract_write_response_handler destruction and possibly
delaying the sending of timeout or error responses to the client.
This commit removes the strong pointer. Instead, we compute the
next iterator before calling timeout_cb(), so if the handler is
destroyed inside timeout_cb(), we already have a valid next iterator.
(cherry picked from commit bf2ac7ee8b)
Load-and-stream is broken when running concurrently to the
finalization step of tablet split.
Consider this:
1) split starts
2) split finalization executes barrier and succeed
3) load-and-stream runs now, starts writing sstable (pre-split)
4) split finalization publishes changes to tablet metadata
5) load-and-stream finishes writing sstable
6) sstable cannot be loaded since it spans two tablets
two possible fixes (maybe both):
1) load-and-stream awaits for topology to quiesce
2) perform split compaction on sstable that spans both sibling tablets
This patch implements #1. By awaiting for topology to quiesce,
we guarantee that load-and-stream only starts when there's no
chance coordinator is handling some topology operation like
split finalization.
Fixes#26455.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 3abc66da5a)
Change all logging related to errors in `chunked_download_source` background download fiber to `info` to make it visible right away in logs.
(cherry picked from commit fdd0d66f6e)
Refactor the wrapping exception used in `chunked_download_source` to
prevent the retry strategy from reattempting failed requests. The new
implementation preserves the original `exception_ptr`, making the root
cause clearer and easier to diagnose.
(cherry picked from commit 1d34657b14)
To prevent client retrying indefinitely time skew and authentication errors add `max_attempts` to the `client::make_request`
(cherry picked from commit 43acc0d9b9)
It never worked as intended, so the credentials handling is moving to the same place where we handle time skew, since we have to reauthenticate the request
(cherry picked from commit 116823a6bc)
Add an option to retry S3 requests at the highest level, including
reinitializing headers and reauthenticating. This addresses cases
where retrying the same request fails, such as when the S3 server
rejects a timestamp older than 15 minutes.
(cherry picked from commit 185d5cd0c6)
Refactor `make_request` to use a single core implementation that
handles authentication and issues the HTTP request. All overloads now
delegate to this unified method.
(cherry picked from commit 55fb2223b6)
Tablet merge of base tables is only safe if there is at most one replica
in each rack. For more details on why it is the case please see
scylladb/scylladb#17265. If the rf-rack-valid-keyspaces is turned on,
this condition is satisfied, so allow it in that case.
Fixes: scylladb/scylladb#26273
(cherry picked from commit 189ad96728)
The `compaction_strategy_state` class holds strategy specific state via
a `std::variant` containing different state types. When a compaction
strategy performs compaction, it retrieves a reference to its state from
the `compaction_strategy_state` object. If the table's compaction
strategy is ALTERed while a compaction is in progress, the
`compaction_strategy_state` object gets replaced, destroying the old
state. This leaves the ongoing compaction holding a dangling reference,
resulting in a use after free.
Fix this by using `seastar::shared_ptr` for the state variant
alternatives(`leveled_compaction_strategy_state_ptr` and
`time_window_compaction_strategy_state_ptr`). The compaction strategies
now hold a copy of the shared_ptr, ensuring the state remains valid for
the duration of the compaction even if the strategy is altered.
The `compaction_strategy_state` itself is still passed by reference and
only the variant alternatives use shared_ptrs. This allows ongoing
compactions to retain ownership of the state independently of the
wrapper's lifetime.
Fixes#25913
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 18c071c94b)
During TWCS compaction, multiple methods independently fetch the
compaction_strategy_state using get_state(). This can lead to
inconsistencies if the compaction strategy is ALTERed while the
compaction is in progress.
This patch fixes a part of this issue by passing down the state to the
lower level methods as parameters instead of fetching it repeatedly.
Refs #25913
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 35159e5b02)
The method `maybe_wait_for_sstable_count_reduction()`, when retrieving
the list of sstables for a possible compaction, holds a reference to the
compaction strategy. If the strategy is updated during execution, it can
cause a use after free issue. To prevent this, hold a copy of the
compaction strategy so it isn’t yanked away during the method’s
execution.
Refs #26546
Refs #25913
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 1cd43bce0e)
Schema pulls should always be disabled when group 0 is used. However,
`migration_manager::disable_schema_pulls()` is never called during
a restart with `recovery_leader` set in the Raft-based recovery
procedure, which causes schema pulls to be re-enabled on all live nodes
(excluding the nodes replacing the dead nodes). Moreover, schema pulls
remain enabled on each node until the node is restarted, which could
be a very long time.
We fix this issue and add a regression test in this PR.
Fixes#26569
This is an important bug fix, so it should be backported to all branches
with the Raft-based recovery procedure (2025.2 and newer branches).
- (cherry picked from commit ec3a35303d)
- (cherry picked from commit da8748e2b1)
- (cherry picked from commit 71de01cd41)
Parent PR: #26572Closesscylladb/scylladb#26599
* github.com:scylladb/scylladb:
test: test_raft_recovery_entry_loss: fix the typo in the test case name
test: verify that schema pulls are disabled in the Raft-based recovery procedure
raft topology: disable schema pulls in the Raft-based recovery procedure
This series of patches improves test vector_store_client_test stability. The primary issue with flaky connections was discovered while working on PR #26308.
Key Changes:
- Fixes premature connection closures in the mock server:
The mock HTTP server was not consuming request payloads, causing it to close connections immediately after a response. Subsequent tests attempting to reuse these closed connections would fail intermittently, leading to flakiness. The server has been updated to handle payloads correctly.
- Removes a retry workaround:
With the underlying connection issue resolved, the retry logic in the vector_store_client_test_ann_request test is no longer needed and has been removed.
- Mocks the DNS resolver in tests:
The vector_store_client_uri_update_to_invalid test has been corrected to mock DNS lookups, preventing it from making real network requests.
- Corrects request timeout handling:
A bug has been fixed where the request timeout was not being reset between consecutive requests.
- Unifies test timeouts:
Timeouts have been standardized across the test suite for consistency.
Fixes: #26468
It is recommended to backport this series to the 2025.4 branch. Since these changes only affect test code and do not alter any production logic, the backport is safe. Addressing this test flakiness will improve the stability of the CI pipeline and prevent it from blocking unrelated patches.
- (cherry picked from commit ac5e9c34b6)
- (cherry picked from commit 2eb752e582)
- (cherry picked from commit d99a4c3bad)
- (cherry picked from commit 0de1fb8706)
- (cherry picked from commit 62deea62a4)
Parent PR: #26374Closesscylladb/scylladb#26551
* github.com:scylladb/scylladb:
vector_search: Unify test timeouts
vector_search: Fix missing timeout reset
vector_search: Refactor ANN request test
vector_search: Fix flaky connection in tests
vector_search: Fix flaky test by mocking DNS queries
The test uses CQL tracing to check which files were read by a query.
This is flaky if the coordinator and the replica are different shards,
because the Python driver only waits for the coordinator, and not
for replicas, to finish writing their traces.
(So it might happen that the Python driver returns a result
with only coordinator events and no replica events).
Let's just dodge the issue by using --smp=1.
Fixesscylladb/scylladb#26432Closesscylladb/scylladb#26434
(cherry picked from commit c35b82b860)
Closesscylladb/scylladb#26492
Some tools commands have links to online documentation in their help output. These links were left behind in the source-available change, they still point to the old opensource docs. Furthermore, the links in the scylla-sstable help output always point to the latest stable release's documentation, instead of the appropriate one for the branch the tool was built from. Fix both of these.
Fixes: scylladb/scylladb#26320
Broken documentation link fix for the tool help output, needs backport to all live source-available versions.
- (cherry picked from commit 5a69838d06)
- (cherry picked from commit 15a4a9936b)
- (cherry picked from commit fe73c90df9)
Parent PR: #26322Closesscylladb/scylladb#26390
* github.com:scylladb/scylladb:
tools/scylla-sstable: fix doc links
release: adjust doc_link() for the post source-available world
tools/scylla-nodetool: remove trailing " from doc urls
In test_two_tablets_concurrent_repair_and_migration_repair_writer_level
safe_rolling_restart returns ready cql. However, get_all_tablet_replicas
uses the cql reference from manager that isn't ready. Wait for cql.
Fixes: #26328Closesscylladb/scylladb#26349
(cherry picked from commit 0e73ce202e)
Closesscylladb/scylladb#26362
sstable::compute_shards_for_this_sstable() has a temporary of type
std::vector<dht::token_range> (aka dht::partition_range_vector), which
allocates a contiguous 300k when loading an sstable from disk. This
causes large allocation warnings (it doesn't really stress the allocator
since this typically happens during startup, but best to clear the warning
anyway).
Fix this by changing the container to by chunked_vector. It is passed
to dht::ring_position_range_vector_sharder, but since we're the only
user, we can change that class to accept the new type.
Fixes#24198.
Closesscylladb/scylladb#26353
(cherry picked from commit 7230a04799)
Closesscylladb/scylladb#26360
We do this at the end of `test_raft_recovery_entry_loss`. It's not worth
to add a separate regression test, as tests of the recovery procedure
are complicated and have a long running time. Also, we choose
`test_raft_recovery_entry_loss` out of all tests of the recovery
procedure because it does some schema changes.
(cherry picked from commit da8748e2b1)
Schema pulls should always be disabled when group 0 is used. However,
`migration_manager::disable_schema_pulls()` is never called during
a restart with `recovery_leader` set in the Raft-based recovery
procedure, which causes schema pulls to be re-enabled on all live nodes
(excluding the nodes replacing the dead nodes). Moreover, schema pulls
remain enabled on each node until the node is restarted, which could
be a very long time.
The old gossip-based recovery procedure doesn't have this problem
because we disable schema pulls after completing the upgrade-to-group0
procedure, which is a part of the old recovery procedure.
Fixes#26569
(cherry picked from commit ec3a35303d)
* seastar c8a3515f9...37983cd04 (2):
> iotune: fix very long warm up duration on systems with high cpu count
> Merge '[Backport 2025.4] iotune: Add warmup period to measurements' from Robert Bindar
iotune: Ignore measurements during warmup period
iotune: Fix warmup calculation bug and botched rebase
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Fixes#26530Closesscylladb/scylladb#26583
Introduce a counter metric to monitor instances where the background
filling fiber is blocked due to insufficient memory in the S3 client.
Closesscylladb/scylladb#26466
(cherry picked from commit 413739824f)
Closesscylladb/scylladb#26555
It turns out that Boost assertions are thread-unsafe,
(and can't be used from multiple threads concurrently).
This causes the test to fail with cryptic log corruptions sometimes.
Fix that by switching to thread-safe checks.
Fixesscylladb/scylladb#24982Closesscylladb/scylladb#26472
(cherry picked from commit 7c6e84e2ec)
Closesscylladb/scylladb#26554
The test previously used separate timeouts for requests (5s) and the
overall test case (10s).
This change unifies both timeouts to 10 seconds.
(cherry picked from commit 62deea62a4)
The `vector_store_client_test` could be flaky because the request timeout
was not consistently reset in all code paths. This could lead to a
timeout from a previous operation firing prematurely and failing the
test.
The fix ensures `abort_source_timeout` is reset before each request.
The implementation is also simplified by changing
`abort_source_timeout::reset` that combines the reset and arm
operations into a same invocation.
(cherry picked from commit 0de1fb8706)
Refactor the `vector_store_client_test_ann_request` test to use the
`vs_mock_server` class, unifying the structure of the test cases.
This change also removes retry logic that waited for the server to be ready.
This is no longer necessary because the handler now exists for all index names
and consumes the entire request payload, preventing connection closures.
Previously, the server did not handle requests for unconfigured
indexes, which caused the connection to close. This could lead to a
race condition where the client would attempt to reuse a closed
connection.
(cherry picked from commit d99a4c3bad)
The vector store mock server was not reading the ANN request body,
which could cause it to prematurely close the connection.
This could lead to a race condition where the client attempts to reuse a
closed connection from its pool, resulting in a flaky test.
The fix is to always read the request body in the mock server.
(cherry picked from commit 2eb752e582)
The `vector_store_client_uri_update_to_invalid` test was flaky because
it performed real DNS lookups, making it dependent on the network
environment.
This commit replaces the live DNS queries with a mock to make the test
hermetic and prevent intermittent failures.
`vector_search_metrics_test` test did not call configure{vs},
as a consequence the test did real DNS queries, which made the test
flaky.
The refreshes counter increment has been moved before the call to the resolver.
In tests, the resolver is mocked leading to lack of increments in production code.
Without this change, there is no way to test DNS counter increments.
The change also simplifies the test making it more readable.
(cherry picked from commit ac5e9c34b6)
The `describe_multi_item` function treated the last reference-captured
argument as the number of used RCU half units. The caller
`batch_get_item`, however, expected this parameter to hold an item size.
This RCU value was then passed to
`rcu_consumed_capacity_counter::get_half_units`, treating the
already-calculated RCU integer as if it were a size in bytes.
This caused a second conversion that undercounted the true RCU. During
conversion, the number of bytes is divided by `RCU_BLOCK_SIZE_LENGTH`
(=4KB), so the double conversion divided the number of bytes by 16 MB.
The fix removes the second conversion in `describe_multi_item` and
changes the API of `describe_multi_item`.
Fixes: https://github.com/scylladb/scylladb/pull/25847Closesscylladb/scylladb#25842
(cherry picked from commit a55c5e9ec7)
Closesscylladb/scylladb#26539
The current description is not accurate: the function doesn't throw
an exception if there's an invalid materialized view. Instead, it
simply logs the keyspaces that violate the requirement.
Furthermore, the experimental feature `views-with-tablets` is no longer
necessary for considering a materialized view as valid. It was dropped
in scylladb/scylladb@b409e85c20. The
replacement for it is the cluster feature `VIEWS_WITH_TABLETS`.
Fixesscylladb/scylladb#26420Closesscylladb/scylladb#26421
(cherry picked from commit a9577e4d52)
Closesscylladb/scylladb#26476
pass an appropriate query state for auth queries called from service
level cache reload. we use the function qos_query_state to select a
query_state based on caller context - for internal queries, we set a
very long timeout.
the service level cache reload is called from group0 reload. we want it
to have a long timeout instead of the default 5 seconds for auth
queries, because we don't have strict latency requirement on the one
hand, and on the other hand a timeout exception is undesired in the
group0 reload logic and can break group0 on the node.
Fixes https://github.com/scylladb/scylladb/issues/25290
backport possible to improve stability
- (cherry picked from commit a1161c156f)
- (cherry picked from commit 3c3dd4cf9d)
- (cherry picked from commit ad1a5b7e42)
Parent PR: #26180Closesscylladb/scylladb#26479
* github.com:scylladb/scylladb:
service/qos: set long timeout for auth queries on SL cache update
auth: add query_state parameter to query functions
auth: refactor query_all_directly_granted
We noticed during work on scylladb/seastar#2802 that on i7i family
(later proved that it's valid for i4i family as well),
the disks are reporting the physical sector sizes incorrectly
as 512bytes, whilst we proved we can render much better write IOPS with
4096bytes.
This is not the case on AWS i3en family where the reported 512bytes
physical sector size is also the size we can achieve the best write IOPS.
This patch works around this issue by changing `scylla_io_setup` to parse
the instance type out of `/sys/devices/virtual/dmi/id/product_name`
and run iotune with the correct request size based on the instance type.
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
Closesscylladb/scylladb#25315
(cherry picked from commit 2c74a6981b)
Closesscylladb/scylladb#26474
Expecting the group 0 read barrier to succeed with a timeout of 1s, just
after restarting 3 out of 5 voters, turned out to be flaky. In some
unlikely scenarios, such as multiple vote splits, the Raft leader
election could finish after the read barrier times out.
To deflake the test, we increase the timeout of Raft operations back to
300s for read barriers we expect to succeed.
Fixes#26457Closesscylladb/scylladb#26489
(cherry picked from commit 5f68b9dc6b)
Closesscylladb/scylladb#26522
Using the name regular as the incremental mode could be confusing, since
regular might be interpreted as the non-incremental repair. It is better
to use incremental directly.
Before:
- regular (standard incremental repair)
- full (full incremental repair)
- disabled (incremental repair disabled)
After:
- incremental (standard incremental repair)
- full (full incremental repair)
- disabled (incremental repair disabled)
Fixes#26503Closesscylladb/scylladb#26504
(cherry picked from commit 13dd88b010)
Closesscylladb/scylladb#26521
Materialized views are currently in the experimental phase and using them
in tablet-based keyspaces requires starting Scylla with an experimental feature,
`views-with-tablets`. Any attempts to create a materialized view or secondary
index when it's not enabled will fail with an appropriate error.
After considerable effort, we're drawing close to bringing views out of the
experimental phase, and the experimental feature will no longer be needed.
However, materialized views in tablet-based keyspaces will still be restricted,
and creating them will only be possible after enabling the configuration option
`rf_rack_valid_keyspaces`. That's what we do in this PR.
In this patch, we adjust existing tests in the tree to work with the new
restriction. That shouldn't have been necessary because we've already seemingly
adjusted all of them to work with the configuration option, but some tests hid
well. We fix that mistake now.
After that, we introduce the new restriction. What's more, when starting Scylla,
we verify that there is no materialized view that would violate the contract.
If there are some that do, we list them, notify the user, and refuse to start.
High-level implementation strategy:
1. Name the restrictions in form of a function.
2. Adjust existing tests.
3. Restrict materialized views by both the experimental feature
and the configuration option. Add validation test.
4. Drop the requirement for the experimental feature. Adjust the added test
and add a new one.
5. Update the user documentation.
Fixes scylladb/scylladb#23030
Backport: 2025.4, as we are aiming to support materialized views for tablets from that version.
- (cherry picked from commit a1254fb6f3)
- (cherry picked from commit d6fcd18540)
- (cherry picked from commit 994f09530f)
- (cherry picked from commit 6322b5996d)
- (cherry picked from commit 71606ffdda)
- (cherry picked from commit 00222070cd)
- (cherry picked from commit 288be6c82d)
- (cherry picked from commit b409e85c20)
Parent PR: #25802Closesscylladb/scylladb#26416
* github.com:scylladb/scylladb:
view: Stop requiring experimental feature
db/view: Verify valid configuration for tablet-based views
db/view: Require rf_rack_valid_keyspaces when creating view
test/cluster/random_failures: Skip creating secondary indexes
test/cluster/mv: Mark test_mv_rf_change as skipped
test/cluster: Adjust MV tests to RF-rack-validity
test/boost/schema_loader_test.cc: Explicitly enable rf_rack_valid_keyspaces
db/view: Name requirement for views with tablets
There was a race between loop in `view_building_worker::run_view_building_state_observer()`
and a moment when a batch was finishing its work (`.finally()` callback
in `view_building_worker::batch::start()`).
State observer waits on `_vb_state_machine.event` CV and when it's
awoken, it takes group0 read apply mutex and updates its state. While
updating the state, the observer looks at `batch::state` field and
reacts to it accordingly.
On the other hand, when a batch finishes its work, it sets `state` field
to `batch_state::finished` and does a broadcast on
`_vb_state_machine.event` CV.
So if the batch will execute the callback in `.finally()` while the
observer is updating its state, the observer may miss the event on the
CV and it will never notice that the batch was finished.
This patch fixes this by adding a `some_batch_finished` flag. Even if
the worker won't see an event on the CV, it will notice that the flag
was set and it will do next iteration.
Fixesscylladb/scylladb#26204Closesscylladb/scylladb#26289
(cherry picked from commit 8d0d53016c)
Closesscylladb/scylladb#26500
db/view/view_building_worker: move discover_existing_staging_sstables() to the foreground
This patch moves `discover_existing_staging_sstables()` to be executed
from main level, instead of running it on the background fiber.
This method need to be run only once during the startup to collect
existing staging sstables, so there is no need to do it in the
background. This change will increase debugability of any further issues
related to it (like https://github.com/scylladb/scylladb/issues/26403).
Fixes https://github.com/scylladb/scylladb/issues/26417
The patch should be backported to 2025.4
- (cherry picked from commit 575dce765e)
- (cherry picked from commit 84e4e34d81)
Parent PR: #26446Closesscylladb/scylladb#26501
* github.com:scylladb/scylladb:
db/view/view_building_worker: move discover_existing_staging_sstables() to the foreground
db/view/view_building_worker: futurize and rename `start_background_fibers()`
This patch moves `discover_existing_staging_sstables()` to be executed
from main level, instead of running it on the background fiber.
This method need to be run only once during the startup to collect
existing staging sstables, so there is no need to do it in the
background. This change will increase debugability of any further issues
related to it (like scylladb/scylladb#26403).
Fixesscylladb/scylladb#26417
(cherry picked from commit 84e4e34d81)
Next commit will move `discover_existing_staging_sstables()`
to the foreground, so to prepare for this we need to futurize
`start_background_fibers()` method and change its name to better reflect
its purpose.
(cherry picked from commit 575dce765e)
In the Raft-based recovery procedure, we create a new group 0 and add
live nodes to it one by one. This means that for some time there are
nodes which belong to the topology, but not to the new group 0. The
voter handler running on the recovery leader incorrectly considers these
nodes while choosing voters.
The consequences:
- misleading logs, for example, "making servers {<ID of a non-member>}
voters", where the non-member won't become a voter anyway,
- increased chance of majority loss during the recovery procedure, for
example, all 3 nodes that first joined the new group 0 are in the same
dc and rack, but only one of them becomes a voter because the voter
handler tries to make non-members in other dcs/racks voters.
Fixes#26321Closesscylladb/scylladb#26327
(cherry picked from commit 67d48a459f)
Closesscylladb/scylladb#26428
pass an appropriate query state for auth queries called from service
level cache reload. we use the function qos_query_state to select a
query_state based on caller context - for internal queries, we set a
very long timeout.
the service level cache reload is called from group0 reload. we want it
to have a long timeout instead of the default 5 seconds for auth
queries, because we don't have strict latency requirement on the one
hand, and on the other hand a timeout exception is undesired in the
group0 reload logic and can break group0 on the node.
Fixesscylladb/scylladb#25290
(cherry picked from commit ad1a5b7e42)
add a query_state parameter to several auth functions that execute
internal queries. currently the queries use the
internal_distributed_query_state() query state, and we maintain this as
default, but we want also to be able to pass a query state from the
caller.
in particular, the auth queries currently use a timeout of 5 seconds,
and we will want to set a different timeout when executed in some
different context.
(cherry picked from commit 3c3dd4cf9d)
rewrite query_all_directly_granted to use execute_internal instead of
query_internal in a style that is more consistent with the rest of the
module.
This will also be useful for a later change because execute_internal
accepts an additional parameter of query_state.
(cherry picked from commit a1161c156f)
In one of the constructors of `named_value`, the `allowed_values`
argument isn't used.
(This means that if some config entry uses this constructor,
the values aren't validated on the config layer,
and might give some lower layer a bad surprise).
Fix that.
Fixesscyllladb/scylladb#26371Closesscylladb/scylladb#26196
(cherry picked from commit 3b338e36c2)
Closesscylladb/scylladb#26425
BYPASS CACHE is implemented for `bti_index_reader` by
giving it its own private `cached_file` wrappers over
Partitions.db and Rows.db, instead of passing it
the shared `cached_file` owned by the sstable.
But due to an oversight, the private `cached_file`s aren't
constructed on top of the raw Partitions.db and Rows.db
files, but on top of `cached_file_impl` wrappers around
those files. Which means that BYPASS CACHE doesn't
actually do its job.
Tests based on `scylla_index_page_cache_*` metrics
and on CQL tracing still see the reads from the private
files as "cache misses", but those misses are served
from the shared cached files anyway, so the tests don't see
the problem. In this commit we extend `test_bti_index.py`
with a check that looks at reactor's `io_queue` metrics
instead, and catches the problem.
Fixesscylladb/scylladb#26372Closesscylladb/scylladb#26373
(cherry picked from commit dbddba0794)
Closesscylladb/scylladb#26424
We modify the requirements for using materialized views in tablet-based
keyspaces. Before, it was necessary to enable the configuration option
`rf_rack_valid_keyspaces`, having the cluster feature `VIEWS_WITH_TABLETS`
enabled, and using the experimental feature `views-with-tablets`.
We drop the last requirement.
We adjust code to that change and provide a new validation test.
We also update the user documentation to reflect the changes.
Fixesscylladb/scylladb#23030
(cherry picked from commit b409e85c20)
Creating a materialized view or a secondary index in a tablet-based
keyspace requires that the user enabled two options:
* experimental feature `views-with-tablets`,
* configuration option `rf_rack_vaid_keyspaces`.
Because the latter has only become a necessity recently (in this series),
it's possible that there are already existing materialized views that
violate it.
We add a new check at start-up that iterates over existing views and
makes sure that that is not the case. Otherwise, Scylla notifies the user
of the problem.
(cherry picked from commit 288be6c82d)
We extend the requirements for being able to create materialized views
and secondary indexes in tablet-based keyspaces. It's now necessary to
enable the configuration option `rf_rack_valid_keyspaces`. This is
a stepping stone towards bringing materialized views and secondary
indexes with tablets out of the experimental phase.
We add a validation test to verify the changes.
Refs scylladb/scylladb#23030
(cherry picked from commit 00222070cd)
Materialized views are going to require the configuration option
`rf_rack_valid_keyspaces` when being created in tablet-based keyspaces.
Since random-failure tests still haven't been adjusted to work with it,
and because it's not trivial, we skip the cases when we end up creating
or dropping an index.
(cherry picked from commit 71606ffdda)
The test will not work with `rf_rack_valid_keyspaces`. Since the option
is going to become a requirement for using views with tablets, the test
will need to be rewritten to take that into consideration. Since that
adjustment doesn't seem trivial, we mark the test as skipped for the
time being.
(cherry picked from commit 6322b5996d)
Some of the new tests covering materialized views explicitly disabled
the configuration option `rf_rack_valid_keyspaces`. It's going to become
a new requirement for views with tablets, so we adjust those tests and
enable the option. There is one exception, the test:
`cluster/mv/test_mv_topology_change.py::test_mv_rf_change`
We handle it separately in the following commit.
(cherry picked from commit 994f09530f)
The test cases in the file aren't run via an existing interface like
`do_with_cql_env`, but they rely on a more direct approach -- calling
one of the schema loader tools. Because of that, they manage the
`db::config` object on their own and don't enable the configuration
option `rf_rack_valid_keyspaces`.
That hasn't been a problem so far since the test doesn't attempt to
create RF-rack-invalid keyspaces anyway. However, in an upcoming commit,
we're going to further restrict views with tablets and require that the
option is enabled.
To prepare for that, we enable the option in all test cases. It's only
necessary in a small subset of them, but it won't hurt the enforce it
everywhere, so let's do that.
Refs scylladb/scylladb#23958
(cherry picked from commit d6fcd18540)
We add a named requirement, a function, for materialized views with tablets.
It decides whether we can create views and secondary indexes in a given
keyspace. It's a stepping stone towards modifying the requirements for it.
This way, we keep the code in one place, so it's not possible to forget
to modify it somewhere. It also makes it more organized and concise.
(cherry picked from commit a1254fb6f3)
Backport motivation:
scylla-dtest PR [limits_test.py: remove tests already ported to scylladb repo](https://github.com/scylladb/scylla-dtest/pull/6232) that removes migrated tests got merged before branch-2025.4 separation
scylladb PR [test: dtest: test_limits.py: migrate from dtest](https://github.com/scylladb/scylladb/pull/26077) got merged after branch-2025.4 separation
This caused the tests to be fully removed from branch-2025.4. This backport PR makes sure the tests are present in scylladb branch-2025.4.
This PR migrates limits tests from dtest to this repository.
One reason is that there is an ongoing effort to migrate tests from dtest to here.
Debug logs are enabled on `test_max_cells` for `lsa-timing` logger, to have more information about memory reclaim operation times and memory chunk sizes. This will allow analysis of their value distributions, which can be helpful with debugging if the issue reoccurs.
Also, scylladb keeps sql files with metrics which, with some modifications, can be used to track metrics over time for some tests. This would show if there are pauses and spikes or the test performance is more or less consistent over time.
scylla-dtest PR that removes migrated tests:
[limits_test.py: remove tests already ported to scylladb repo #6232](https://github.com/scylladb/scylla-dtest/pull/6232)
Fixes#25097
- (cherry picked from commit 82e9623911)
- (cherry picked from commit 70128fd5c7)
- (cherry picked from commit 554fd5e801)
- (cherry picked from commit b3347bcf84)
Parent PR: #26077Closesscylladb/scylladb#26359
* github.com:scylladb/scylladb:
test: dtest: limits_test.py: test_max_cells log level
test: dtest: limits_test.py: make the tests work
test: dtest: test_limits.py: remove test that are not being migrated
test: dtest: copy unmodified limits_test.py
TemporaryHashes.db is a temporary sstable component used during ms
sstable writes. It's different from other sstable components in that
it's not included in the TOC. Because of this, it has a special case in
the logic that deletes unfinished sstables on boot.
(After Scylla dies in the middle of a sstable write).
But there's a bug in that special case,
which causes Scylla to forget to delete other components from the same unfinished sstable.
The code intends only to delete the TemporaryHashes.db file from the
`_state->generations_found` multimap, but it accidentally also deletes
the file's sibling components from the multimap. Fix that.
Fixesscylladb/scylladb#26393
(cherry picked from commit 6efb807c1a)
There are two tests which effectively check nothing.
They intend to check that distributed loader removes "leftover" sstable
files. So they create some incomplete sstables, run the test env
on the directory, and the files disappeared.
But the test env completely clears the test directory before
the distributed loader looks at the files, so the tests succeed trivially.
Fix that by adding a config knob to the test env which instructs it
not to clear the directory before the test.
(cherry picked from commit 16cb223d7f)
The doc links in scylla-sstable help output are static, so they always
point to the documentation of the latest stable release, not to the
documentation of the release the tool binary is from. On top of that,
the links point to old open-source documentation, which is now EOL.
Fix both problems: point link at the new source-available documentation
pages and make them version aware.
(cherry picked from commit fe73c90df9)
Set `lsa-timing` logger log level to `debug`. This will help with
the analysis of the whole spectrum of memory reclaim operation
times and memory sizes.
Refs #25097
(cherry picked from commit b3347bcf84)
Remove unused imports and markers.
Remove Apache license header.
Enable the test in suite.yaml for `dev` and `debug` modes.
Refs #25097
(cherry picked from commit 554fd5e801)
Copy limits_test.py from scylla-dtest to test/cluster/dtest/limits_test.py.
Add license header.
Disable it for `debug`, `dev`, and `release` mode.
Refs #25097
(cherry picked from commit 82e9623911)
co_returnapi_error::validation("GlobalSecondaryIndexes and LocalSecondaryIndexes with tablets require the rf_rack_valid_keyspaces option to be enabled.");
throwapi_error::validation(format("Tag {} containing non-numerical value requests vnodes, but vnodes are forbidden by configuration option `tablets_mode_for_new_keyspaces: enforced`",INITIAL_TABLETS_TAG_KEY));
}
initial_tablets=std::nullopt;
elogger.trace("Following {} tag containing non-numerical value, Alternator will attempt to create a keyspace {} with vnodes.",INITIAL_TABLETS_TAG_KEY,keyspace_name);
}
}else{
// No per-table tag present, use the value from config
elogger.trace("Following the `tablets_mode_for_new_keyspaces` flag from the settings, Alternator will attempt to create a keyspace {} with tablets.",keyspace_name);
}else{
initial_tablets=std::nullopt;
elogger.trace("Following the `tablets_mode_for_new_keyspaces` flag from the settings, Alternator will attempt to create a keyspace {} with vnodes.",keyspace_name);
slogger.warn("alternator_warn_authorization=true: {} for user {}, client address {}",e.what(),user,client_address);
}
throwstd::move(e);
}else{
if(warn_authorization){
slogger.warn("If you set alternator_enforce_authorization=true the following will be enforced: {} for user {}, client address {}",e.what(),user,client_address);
seastar::metrics::description("Counts number of misses of cached expressions"),labels)(expression_label("ProjectionExpression")).aggregate(aggregate_labels).set_skip_when_empty()
});
// Only register the following metrics for the global metrics, not per-table
"description":"Load the sstables and stream to the primary replica node within the scope, if one is specified. If not, stream to the global primary replica.",
"required":false,
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
}
]
}
@@ -984,7 +992,7 @@
]
},
{
"path":"/storage_service/cleanup_all",
"path":"/storage_service/cleanup_all/",
"operations":[
{
"method":"POST",
@@ -994,6 +1002,30 @@
"produces":[
"application/json"
],
"parameters":[
{
"name":"global",
"description":"true if cleanup of entire cluster is requested",
"required":false,
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
}
]
}
]
},
{
"path":"/storage_service/mark_node_as_clean",
"operations":[
{
"method":"POST",
"summary":"Mark the node as clean. After that the node will not be considered as needing cleanup during automatic cleanup which is triggered by some topology operations",
"type":"void",
"nickname":"reset_cleanup_needed",
"produces":[
"application/json"
],
"parameters":[]
}
]
@@ -2924,7 +2956,7 @@
},
{
"name":"incremental_mode",
"description":"Set the incremental repair mode. Can be 'disabled', 'regular', or 'full'. 'regular': The incremental repair logic is enabled. Unrepaired sstables will be included for repair. Repaired sstables will be skipped. The incremental repair states will be updated after repair. 'full': The incremental repair logic is enabled. Both repaired and unrepaired sstables will be included for repair. The incremental repair states will be updated after repair. 'disabled': The incremental repair logic is disabled completely. The incremental repair states, e.g., repaired_at in sstables and sstables_repaired_at in the system.tablets table, will not be updated after repair. When the option is not provided, it defaults to regular.",
"description":"Set the incremental repair mode. Can be 'disabled', 'incremental', or 'full'. 'incremental': The incremental repair logic is enabled. Unrepaired sstables will be included for repair. Repaired sstables will be skipped. The incremental repair states will be updated after repair. 'full': The incremental repair logic is enabled. Both repaired and unrepaired sstables will be included for repair. The incremental repair states will be updated after repair. 'disabled': The incremental repair logic is disabled completely. The incremental repair states, e.g., repaired_at in sstables and sstables_repaired_at in the system.tablets table, will not be updated after repair. When the option is not provided, it defaults to 'disabled' mode.",
"description":"Set to \"true\" to flush all memtables and force tombstone garbage collection to check only the sstables being compacted (false by default). The memtable, commitlog and other uncompacted sstables will not be checked during tombstone garbage collection.",
throwexceptions::invalid_request_exception(format("Cannot add column {} because a column with the same name was dropped too recently. Please retry after {} seconds",
# A comma-separated list of secondary vector store node URIs. These nodes are used as a fallback when all primary nodes are unavailable, and are typically located in a different availability zone for high availability.
# Options for encrypted connections to the vector store. These options are used for HTTPS URIs in vector_store_primary_uri and vector_store_secondary_uri.
throwexceptions::configuration_exception(sstring("Missing sub-option '")+compression_parameters::SSTABLE_COMPRESSION+"' for the '"+KW_COMPRESSION+"' option.");
throwexceptions::invalid_request_exception(format("index names shouldn't be more than {:d} characters long (got \"{}\")",schema::NAME_LENGTH,_index_name.c_str()));
,enable_sstables_mc_format(this,"enable_sstables_mc_format",value_status::Unused,true,"Enable SSTables 'mc' format to be used as the default file format. Deprecated, please use \"sstable_format\" instead.")
,enable_sstables_md_format(this,"enable_sstables_md_format",value_status::Unused,true,"Enable SSTables 'md' format to be used as the default file format. Deprecated, please use \"sstable_format\" instead.")
"* chunk_length_in_kb: (Default: 4) The size of chunks to compress in kilobytes. Allowed values are powers of two between 1 and 128.\n"
"* crc_check_chance: (Default: 1.0) Not implemented (option value is ignored).\n"
"* compression_level: (Default: 3) Compression level for ZstdCompressor and ZstdWithDictsCompressor. Higher levels provide better compression ratios at the cost of speed. Allowed values are integers between 1 and 22.")
,alternator_port(this,"alternator_port",value_status::Used,0,"Alternator API port.")
,alternator_https_port(this,"alternator_https_port",value_status::Used,0,"Alternator API HTTPS port.")
,alternator_address(this,"alternator_address",value_status::Used,"0.0.0.0","Alternator API listening address.")
,alternator_enforce_authorization(this,"alternator_enforce_authorization",value_status::Used,false,"Enforce checking the authorization header for every request in Alternator.")
,alternator_enforce_authorization(this,"alternator_enforce_authorization",liveness::LiveUpdate,value_status::Used,false,"Enforce checking the authorization header for every request in Alternator.")
,alternator_warn_authorization(this,"alternator_warn_authorization",liveness::LiveUpdate,value_status::Used,false,"Count and log warnings about failed authentication or authorization")
,alternator_write_isolation(this,"alternator_write_isolation",value_status::Used,"","Default write isolation policy for Alternator.")
,alternator_streams_time_window_s(this,"alternator_streams_time_window_s",value_status::Used,10,"CDC query confidence window for alternator streams.")
"Allow writing to system tables using the .scylla.alternator.system prefix")
,alternator_max_expression_cache_entries_per_shard(this,"alternator_max_expression_cache_entries_per_shard",liveness::LiveUpdate,value_status::Used,2000,"Maximum number of cached parsed request expressions, per shard.")
,vector_store_primary_uri(this,"vector_store_primary_uri",liveness::LiveUpdate,value_status::Used,"","A comma-separated list of vector store node URIs. If not set, vector search is disabled.")
,vector_store_primary_uri(
this,"vector_store_primary_uri",liveness::LiveUpdate,value_status::Used,"","A comma-separated list of primary vector store node URIs. These nodes are preferred for vector search operations.")
"A comma-separated list of secondary vector store node URIs. These nodes are used as a fallback when all primary nodes are unavailable, and are typically located in a different availability zone for high availability.")
"Options for encrypted connections to the vector store. These options are used for HTTPS URIs in vector_store_primary_uri and vector_store_secondary_uri. The available options are:\n"
"* truststore: (Default: <not set. use system truststore>) Location of the truststore containing the trusted certificate for authenticating remote servers.")
,abort_on_ebadf(this,"abort_on_ebadf",value_status::Used,true,"Abort the server on incorrect file descriptor access. Throws exception when disabled.")
,error_injections_at_startup(this,"error_injections_at_startup",error_injection_value_status,{},"List of error injections that should be enabled on startup.")
,topology_barrier_stall_detector_threshold_seconds(this,"topology_barrier_stall_detector_threshold_seconds",value_status::Used,2,"Report sites blocking topology barrier if it takes longer than this.")
,enable_tablets(this,"enable_tablets",value_status::Used,false,"Enable tablets for newly created keyspaces. (deprecated)")
,tablets_mode_for_new_keyspaces(this,"tablets_mode_for_new_keyspaces",value_status::Used,tablets_mode_t::mode::unset,"Control tablets for new keyspaces. Can be set to the following values:\n"
,tablets_mode_for_new_keyspaces(this,"tablets_mode_for_new_keyspaces",liveness::LiveUpdate,value_status::Used,tablets_mode_t::mode::unset,"Control tablets for new keyspaces. Can be set to the following values:\n"
"\tdisabled: New keyspaces use vnodes by default, unless enabled by the tablets={'enabled':true} option\n"
"\tenabled: New keyspaces use tablets by default, unless disabled by the tablets={'disabled':true} option\n"
"\tenabled: New keyspaces use tablets by default, unless disabled by the tablets={'enabled':false} option\n"
"\tenforced: New keyspaces must use tablets. Tablets cannot be disabled using the CREATE KEYSPACE option")
staticconstsstringquery=format("SELECT table_id, timestamp, stream_state, stream_id FROM {}.{} WHERE table_id = ?",NAME,CDC_STREAMS_HISTORY);
staticconstsstringquery_all=format("SELECT table_id, timestamp, stream_state, stream_id FROM {}.{} WHERE table_id = ?",NAME,CDC_STREAMS_HISTORY);
staticconstsstringquery_from=format("SELECT table_id, timestamp, stream_state, stream_id FROM {}.{} WHERE table_id = ? AND timestamp > ?",NAME,CDC_STREAMS_HISTORY);
vbc_logger.warn("Work on tasks {} on replica {}, failed with error: {}",tasks,replica,std::current_exception());
vbc_logger.log(log_level::warn,rate_limit,"Work on tasks {} on replica {}, failed with error: {}",
tasks,replica,std::current_exception());
rpc_failed=true;
}
if(rpc_failed){
co_awaitseastar::sleep(backoff_duration);
_vb_sm.event.broadcast();
co_returnstd::nullopt;
}
if(tasks.size()!=remote_results.size()){
on_internal_error(vbc_logger,fmt::format("Number of tasks ({}) and results ({}) do not match for replica {}",tasks.size(),remote_results.size(),replica));
}
// In `view_building_coordinator::work_on_view_building()` we made sure that,
// each replica has its own entry in the `_finished_tasks`, so now we can just take a shared lock
// and insert its of finished tasks to this replica bucket as there is at most one instance of this method for each replica.
throwstd::runtime_error(fmt::format("Tried to start view building tasks ({}) on shard {} but the shard is busy",_state.tasks_map[id]->tasks,_state.tasks_map[id]->replica.shard,_state.tasks_map[id]->tasks));
throwstd::runtime_error(fmt::format("Task {} was not found for base table {} on replica {}",id,*building_state.currently_processed_base_table,my_replica));
-`process_staging` - process (generate view updates and move to base directory)
all staging sstables associated with the tablet replica of the base table
State of alive task can be either:
- IDLE
- STARTED
- ABORTED
If the task doesn't exist in the state, this means it was finished or aborted.
View building tasks are created when:
-`build_range` tasks:
- a view/index is created
@@ -38,12 +31,29 @@ View building tasks are created when:
-`process_staging` tasks:
- a staging sstable was registered to `view_building_worker`
A task might be aborted in two ways: by deleting it or by setting its state to `ABORTED`.
If a view/keyspace is dropped, then its tasks are aborted by deleting them as they are no longer needed.
On the other hand, at the beginning of a tablet operation (migration/resize/RF change), relevant view building tasks are aborted using `ABORTED` state.
This intermediate state is needed to create new tasks at the end of the operation or in case of failure and rollback (aborted tasks are also deleted then).
### Lifetime of a task
The view building coordinator starts a task only if its tablet is not in transition (`tablet_map.get_tablet_transition_info(tid) == nullptr`).
The group0 state stores only tasks that haven't been completed yet or were aborted but haven't been cleaned up yet.
When a task is created, it is stored in group0 state (`system.view_building_tasks`) to be processed in the future.
Then at some point, the view building cooridnator will decide to process it by sending a [`work_on_view_building_tasks` RPC](#rpc) to a worker.
Unless the task was aborted, the worker will eventually reply that the task was finished. After the coordinator gets the response from the worker,
it temporarily saves list of ids of finished tasks and removes those tasks from group0 state (pernamently marking them as finished) in 200ms intervals. (*)
This batching of removing finished tasks is done in order to reduce number of generated group0 operations.
On the other hand, view buildind tasks can can also be aborted due to 2 main reasons:
- a keyspace/view was dropped
- tablet operations (see [tablet operations section](#tablet-operations))
In the first case we simply delete relevant view building tasks as they are no longer needed.
But if a task needs to be aborted due to tablet operation, we're firstly setting the `aborted` flag to true. We need to do this because we need the task informations
to created a new adjusted tasks (if the operation succeeded) or rollback them (if the operation failed).
Once a task is aborted by setting the flag, this cannot be revoked, so rolling back a task means creating its duplicate and removing the original task.
(*) - Because there is a time gap between when the coordinator learns that a task is finished (from the RPC response) and when the task is marked as completed,
it is possible that the coordinator may lose this information (e.g. due to Raft leader change). But each view building worker keeps track of finished tasks locally,
so when a new coordinator will send an RPC with the same view building tasks, the worker will immediately response that those tasks were completed.
In the worst case, when both the coordinator and worker nodes are restarted, we can completely lose that information and will have to redo the work.
However, view building tasks are idempotent.
View building task struct:
```c++
@@ -53,14 +63,9 @@ struct view_building_task {
process_staging,
};
enum class task_state {
idle,
started,
aborted
};
utils::UUID id;
task_type type;
task_state state;
bool aborted;
table_id base_id;
table_id view_id; // is default value when `type == task_type::process_staging`
@@ -73,13 +78,25 @@ State machine:
```mermaid
stateDiagram-v2
[*] --> IDLE
IDLE --> [*]: aborted due to drop view
IDLE --> STARTED: vbc chooses to work on the task
STARTED --> ABORTED: aborted due to tablet operation
STARTED --> [*]: done or aborted due to drop view
IDLE --> ABORTED: aborted due to tablet operation
ABORTED --> [*]
state "the task is alive" as NORMAL
state "aborted flag is set to true" as ABORTED
[*] --> NORMAL: task is created
NORMAL --> work_on_view_building_tasks: view building coordinator sends a RPC
work_on_view_building_tasks --> [*]: the coordinator gets response from the RPC call if the task wasn't aborted in the mean time
NORMAL --> [*]: aborted due to keyspace/view drop
NORMAL --> ABORTED: aborted due to tablet operation
ABORTED --> [*]: new adjusted task is created
ABORTED --> [*]: the task is rolled back with new ID
state work_on_view_building_tasks {
state "view building worker is executing the task" as EXECUTE
state "the worker saves information that the task was finished locally" as DONE
@@ -263,3 +263,16 @@ To list all available CDC streams for a tablets-based keyspace:
...
Query all streams to read the entire CDC log.
Garbage collection of CDC streams metadata
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
For tablets-based keyspaces, Scylla periodically performs garbage collection of old CDC streams metadata in the ``system.cdc_timestamps`` and ``system.cdc_streams`` tables.
This process removes information about streams that are no longer needed, helping to prevent unbounded growth of the metadata tables.
The garbage collection process runs periodically in the background and examines streams that have been closed.
It removes information about a stream if the stream's close timestamp is older than the configured TTL of the CDC table.
Since the stream has been closed for longer than the TTL, this means that all rows in this stream have also exceeded their TTL and expired, unless the table's TTL was altered to a smaller value after some rows have been written.
..warning::
When altering the TTL of a CDC table to a smaller value, you can lose information about streams that still contain live rows. Make sure to read all the information you need from the ``system.cdc_timestamps`` and ``system.cdc_streams`` tables before performing such alterations.
ScyllaDB's standard repair process scans and processes all the data on a node, regardless of whether it has changed since the last repair. This operation can be resource-intensive and time-consuming. The Incremental Repair feature provides a much more efficient and lightweight alternative for maintaining data consistency.
The core idea of incremental repair is to repair only the data that has been written or changed since the last repair was run. It intelligently skips data that has already been verified, dramatically reducing the time, I/O, and CPU resources required for the repair operation.
How It Works
------------
ScyllaDB keeps track of the repair status of its data files (SSTables). When new data is written or existing data is modified, it is considered "unrepaired." When you run an incremental repair, the process works as follows:
1. ScyllaDB identifies and selects only the SSTables containing unrepaired data.
2. It then synchronizes this data across the replica nodes.
3. Once the data is successfully synchronized, the corresponding SSTables are marked as "repaired."
Subsequent incremental repairs will skip these marked SSTables, focusing only on new data that has arrived since. To ensure data integrity, ScyllaDB's compaction process handles repaired and unrepaired SSTables separately.
This approach is highly efficient because it allows entire SSTables to be skipped, avoiding the overhead of reading and processing unchanged data.
Prerequisites
-------------
Incremental Repair is only supported for tables that use the tablets architecture. It is not available for legacy vnode-based tables.
Incremental Repair Modes
------------------------
Incremental is currently disabled by default. You can control its behavior for a given repair operation using the ``incremental_mode`` parameter.
This is useful for enabling incremental repair, or in situations where you might need to force a full data validation.
The available modes are:
*``incremental``: Performs a standard incremental repair. It processes only unrepaired data and skips SSTables that are already marked as repaired. The repair status is updated after the operation.
*``full``: Forces the repair to process **all** SSTables, including those that have been previously repaired. This is useful when a complete data validation is required. The repair status is updated upon completion.
*``disabled``: Completely disables the incremental repair logic for the current operation. The repair behaves like a classic, non-incremental repair, and it does not read or update any incremental repair status markers.
The incremental_mode parameter can be specified using nodetool cluster repair, e.g., nodetool cluster repair --incremental-mode incremental. It can also be specified with the REST API, e.g., curl -X POST "http://127.0.0.1:10000/storage_service/tablets/repair?ks=ks1&table=tb1&tokens=all&incremental_mode=incremental"
Benefits of Incremental Repair
------------------------------
***Faster Repairs:** By targeting only new or changed data, repair operations complete in a fraction of the time.
***Reduced Resource Usage:** Consumes significantly less CPU, I/O, and network bandwidth compared to a full repair.
***More Frequent Repairs:** The efficiency of incremental repair allows you to run it more frequently, ensuring a higher level of data consistency across your cluster at all times.
Notes
-----
With the incremental repair feature, the repaired and unrepaired SSTables are compacted separately. After incremental repair, unrepaired SSTables become repaired SSTables, allowing them to be compacted together. A shorter repair interval is therefore recommended to mitigate potential space amplification resulting from these separate compactions.
@@ -208,8 +208,8 @@ Pick a zone where Haswell CPUs are found. Local SSD performance offers, accordin
Image with NVMe disk interface is recommended.
(`More info <https://cloud.google.com/compute/docs/disks/local-ssd>`_)
Recommended instances types are `z3-highmem-highlssd and z3-highmem-standardlssd <https://cloud.google.com/compute/docs/storage-optimized-machines#z3_machine_types>`_,
`n1-highmem <https://cloud.google.com/compute/docs/general-purpose-machines#n1_machines>`_, and `n2-highmem <https://cloud.google.com/compute/docs/general-purpose-machines#n2_machines>`_.
Recommended instances types are `z3-highmem-highlssd, z3-highmem-standardlssd <https://cloud.google.com/compute/docs/storage-optimized-machines#z3_machine_types>`_,
and `n2-highmem <https://cloud.google.com/compute/docs/general-purpose-machines#n2_series>`_.
..list-table::
@@ -274,38 +274,9 @@ Recommended instances types are `z3-highmem-highlssd and z3-highmem-standardlssd
- 1,406
- 36,000
..list-table::
:widths:30 20 20 30
:header-rows:1
* - Model
- vCPU
- Mem (GB)
- Storage (GB)
* - n1-highmem-2
- 2
- 13
- 375
* - n1-highmem-4
- 4
- 26
- 750
* - n1-highmem-8
- 8
- 52
- 1,500
* - n1-highmem-16
- 16
- 104
- 3,000
* - n1-highmem-32
- 32
- 208
- 6,000
* - n1-highmem-64
- 64
- 416
- 9,000
* Storage: Each Z3 instance supports up to 36,000 GiB (~36 TiB) of local
Titanium SSD (or up to 72,000 GiB on bare-metal).
* Z3 Processor: Sapphire Rapids
..list-table::
:widths:30 20 20 30
@@ -347,9 +318,15 @@ Recommended instances types are `z3-highmem-highlssd and z3-highmem-standardlssd
- 80
- 640
- 9,000
* - n2-highmem-96
- 96
- 768
- 9,000
Storage: each instance can support maximum of 24 local SSD of 375 GB partitions each for a total of `9 TB per instance <https://cloud.google.com/compute/docs/disks>`_
* Storage: Each instance can support maximum of 24 local SSD of 375 GB
partitions each for a total of `9 TB per instance <https://cloud.google.com/compute/docs/disks>`_.
* Processors: Ice Lake (the default for larger machine types) and Cascade Lake
*`Get Started Lesson on ScyllaDB University <https://university.scylladb.com/courses/scylla-essentials-overview/lessons/quick-wins-install-and-run-scylla/>`_
@@ -28,8 +28,16 @@ The command syntax is as follows:
scylla sstable <operation> <path to SStable>
You can specify more than one SSTable.
You can specify more than one SSTable. Additionally, the path to SSTable can point to an S3 fully qualified path in the form of s3://bucket-name/prefix/of/your/sstable/sstable-TOC.txt. To use this feature, you need to have AWS credentials set up in your environment. For more information, see :ref:`Configuring AWS S3 access <aws-s3-configuration>`. Additionally, you must ensure the tool is able to load the correct Scylla YAML file, which can be done using the --scylla-yaml-file parameter or by placing the YAML file in one of the default locations the tool checks.
Additionally, the path to SSTable can point to an S3 fully qualified path in
the form of s3://bucket-name/prefix/of/your/sstable/sstable-TOC.txt. To use
this feature, you need to have AWS credentials set up in your environment.
For more information, see :ref:`Configuring Object Storage <object-storage-configuration>`.
Additionally, you must ensure the tool is able to load the correct scylla.yaml
file, which can be done using the ``--scylla-yaml-file`` parameter or by
placing the YAML file in one of the default locations the tool checks.
Refer to the :doc:`Drivers Page </using-scylla/drivers/index>` for more drivers.
Refer to `ScyllaDB Drivers <https://docs.scylladb.com/stable/drivers/index.html>`_ for more drivers.
.._internode-compression:
@@ -309,49 +262,18 @@ ScyllaDB uses experimental flags to expose non-production-ready features safely.
In recent ScyllaDB versions, these features are controlled by the ``experimental_features`` list in scylla.yaml, allowing one to choose which experimental to enable.
Use ``scylla --help`` to get the list of experimental features.
.._admin-keyspace-storage-options:
Keyspace storage options
------------------------
..
This section must be moved to Data Definition> CREATE KEYSPACE
when support for object storage is GA.
By default, SStables of a keyspace are stored in a local directory.
As an alternative, you can configure your keyspace to be stored
on Amazon S3 or another S3-compatible object store.
Support for object storage is experimental and must be explicitly
enabled in the ``scylla.yaml`` configuration file by specifying
the ``keyspace-storage-options`` option:
..code-block::yaml
experimental_features:
- keyspace-storage-options
Before creating keyspaces with object storage, you also need to
:ref:`configure <object-storage-configuration>` the object storage
credentials and endpoint.
.._admin-views-with-tablets:
Views with Tablets
------------------
By default, Materialized Views (MV) and Secondary Indexes (SI)
are disabled in keyspaces that use tablets.
Support for MV and SI with tablets is experimental and must be explicitly
enabled in the ``scylla.yaml`` configuration file by specifying
the ``views-with-tablets`` option:
Materialized Views (MV) and Secondary Indexes (SI) are enabled in keyspaces that use tablets
only when :term:`RF-rack-valid keyspaces <RF-rack-valid keyspace>` are enforced. That can be
done in the ``scylla.yaml`` configuration file by specifying
@@ -20,6 +20,8 @@ To clean up the data of a specific node and specific keyspace, use this command:
nodetool -h <host name> cleanup <keyspace>
To clean up entire cluster see :doc:`nodetool cluster cleanup </operating-scylla/nodetool-commands/cluster/cleanup/>`
..warning::
Make sure there are no topology changes before running cleanup. To validate, run ``nodetool status``, all nodes should be in status Up Normal (``UN``).
**cluster cleanup** - A process that runs in the background and removes data no longer owned by nodes. Used for non tablet (vnode-based) tables only.
Running ``cluster cleanup`` on a **single node** cleans up all non tablet tables on all nodes in the cluster (tablet enabled tables are cleaned up automatically).
For example:
::
nodetool cluster cleanup
See also `ScyllaDB Manager <https://manager.docs.scylladb.com/>`_.
-``--incremental-mode`` specifies the incremental repair mode. Can be 'disabled', 'regular', or 'full'. 'regular': The incremental repair logic is enabled. Unrepaired sstables will be included for repair. Repaired sstables will be skipped. The incremental repair states will be updated after repair. 'full': The incremental repair logic is enabled. Both repaired and unrepaired sstables will be included for repair. The incremental repair states will be updated after repair. 'disabled': The incremental repair logic is disabled completely. The incremental repair states, e.g., repaired_at in sstables and sstables_repaired_at in the system.tablets table, will not be updated after repair. When the option is not provided, it defaults to regular.
-``--incremental-mode`` specifies the incremental repair mode. Can be 'disabled', 'incremental', or 'full'. 'incremental': The incremental repair logic is enabled. Unrepaired sstables will be included for repair. Repaired sstables will be skipped. The incremental repair states will be updated after repair. 'full': The incremental repair logic is enabled. Both repaired and unrepaired sstables will be included for repair. The incremental repair states will be updated after repair. 'disabled': The incremental repair logic is disabled completely. The incremental repair states, e.g., repaired_at in sstables and sstables_repaired_at in the system.tablets table, will not be updated after repair. When the option is not provided, it defaults to 'disabled'.
The Load and Stream feature extends nodetool refresh.
The ``--load-and-stream`` option loads arbitrary sstables into the cluster by reading the sstable data and streaming each partition to the replica(s) that owns it. In addition, the ``--scope`` and ``--primary-replica-only`` options are applied to filter the set of target replicas for each partition. For example, say the old cluster has 6 nodes and the new cluster has 3 nodes. One can copy the sstables from the old cluster to any of the new nodes and trigger refresh with load and stream.
The Load and Stream feature extends nodetool refresh. The new ``-las`` option loads arbitrary sstables that do not belong to a node into the cluster. It loads the sstables from the disk and calculates the data's owning nodes, and streams automatically.
For example, say the old cluster has 6 nodes and the new cluster has 3 nodes. We can copy the sstables from the old cluster to any of the new nodes and trigger the load and stream process.
Load and Stream make restores and migrations much easier:
* You can place sstable from every node to every node
* No need to run nodetool cleanup to remove unused data
With --primary-replica-only (or -pro) option, only the primary replica of each partition in an sstable will be used as the target.
--primary-replica-only must be applied together with --load-and-stream.
--primary-replica-only cannot be used with --scope, they are mutually exclusive.
--primary-replica-only requires repair to be run after the load and stream operation is completed.
Scope
-----
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.