This test reproduces scylladb/scylladb#27353 using two injection
points. First, the test triggers an intra-node tablet migration and
suspends it at the streaming stage using the
intranode_migration_streaming_wait injection. Next, it enables the
alternator_executor_batch_write_wait injection, which suspends a
batch write after its cas_shard has already been created.
The test then issues several batch writes and waits until one of them
hits this injection on the destination shard. At this point, the
cas_shard.erm for that write is still in the streaming state,
meaning the executor would need to jump back to the source shard.
The test then resumes the suspended tablet migration, allowing it to
update the ERM on the source shard to write_both_read_new. After that,
the test releases the suspended batch write and expects it to perform
two shard jumps: first from the destination to the source shard, and
then again back to the source shard.
This commit adds the alternator_executor_batch_write_wait injection to
alternator/executor.cc. Coroutines are intentionally avoided in the
parallel_for_each lambda to prevent unnecessary coroutine-frame
allocations.
(cherry picked from commit e60bcd0011)
This commit is an optimization: avoiding destruction of
foreign objects on the wrong shard. Releasing objects allocated on a
different shard causes their ::free calls to be executed remotely,
which adds unnecessary load to the SMP subsystem.
Before this patch, a std::vector could be moved
to another shard. When the vector was eventually destroyed,
its ::free had to be marshalled back to the shard where the memory had
originally been allocated. This change avoids that overhead by passing
the vector by const reference instead.
The referenced objects lifetime correctness reasoning:
* the put_or_delete_item refs usages in put_or_delete_item_cas_request
are bound to its lifetime
* cas_request lifetime is bound to storage_proxy::cas future
* we don't release put_or_delete_item-s untill all storage_proxy::cas
calls are done.
(cherry picked from commit f00f7976c1)
The rf_rack_valid_keyspaces option needs to be turned on in order to
allow creating materialized views in tablet keyspaces with numeric RF
per DC. This is also necessary for secondary indexes because they use
materialized views underneath. However, this option is _not_ necessary
for vector store indexes because those use the external vector store
service for querying the list of keys to fetch from the main table, they
do not create a materialized view. The rf_rack_valid_keyspaces was, by
accident, required for vector indexes, too.
Remove the restriction for vector store indexes as it is completely
unnecessary.
Fixes: SCYLLADB-81
Closesscylladb/scylladb#27447
(cherry picked from commit bb6e41f97a)
Closesscylladb/scylladb#27455
When a vector store node becomes unreachable, a client request sent
before the keep-alive timer fires would hang until the CQL query
timeout was reached.
This occurred because the HTTP request writes to the TCP buffer and then
waits for a response. While data is in the buffer, TCP retransmissions
prevent the keep-alive timer from detecting the dead connection.
This patch resolves the issue by setting the `TCP_USER_TIMEOUT` socket
option, which applies an effective timeout to TCP retransmissions,
allowing the connection to fail faster.
Closesscylladb/scylladb#27388
(cherry picked from commit a54bf50290)
Closesscylladb/scylladb#27423
We store the per-shard chunk count in a uint64_t vector
global_offset, and then convert the counts to offsets with
a prefix sum:
```c++
// [1, 2, 3, 0] --> [0, 1, 3, 6]
std::exclusive_scan(global_offset.begin(), global_offset.end(), global_offset.begin(), 0, std::plus());
```
However, std::exclusive_scan takes the accumulator type from the
initial value, 0, which is an int, instead of from the range being
iterated, which is of uint64_t.
As a result, the prefix sum is computed as a 32-bit integer value. If
it exceeds 0x8000'0000, it becomes negative. It is then extended to
64 bits and stored. The result is a huge 64-bit number. Later on
we try to find an sstable with this chunk and fail, crashing on
an assertion.
An example of the failure can be seen here: https://godbolt.org/z/6M8aEbo57
The fix is simple: the initial value is passed as uint64_t instead of int.
Fixes https://github.com/scylladb/scylladb/issues/27417Closesscylladb/scylladb#27418
(cherry picked from commit 9696ee64d0)
Even though that `view_building_coordinator::work_on_view_building` has
an `if` at the very beginning which checks whether the currently
processed base table is set, it only prints a message and continues
executing the rest of the function regardless of the result of the
check. However, some of the logic in the function assumes that the
currently processed base table field is set and tries to access the
value of the field. This can lead to the view building coordinator
accessing a disengaged optional, which is undefined behavior.
Fix the function by adding the clearly missing `co_await` to the check.
A regression test is added which checks that the view building state
observer - a different fiber which used to print a weird message due to
erroneus view building coordinator behavior - does not print a warning.
Fixes: scylladb/scylladb#27363Closesscylladb/scylladb#27373
(cherry picked from commit 654ac9099b)
Closesscylladb/scylladb#27406
This PR introduces two key improvements to the robustness and resource management of vector search:
Proper Abort on CQL Timeout: Previously, when a CQL query involving a vector search timed out
, the underlying ANN query to the vector store was not aborted and would continue to run. This has been fixed by ensuring the abort source is correctly signaled, terminating the ANN request when its parent CQL query expires and preventing unnecessary resource consumption.
Faster Failure Detection: The connection and keep-alive timeouts for vector store nodes were excessively long (2 and 11 minutes, respectively), causing significant delays in detecting and recovering from unreachable nodes. These timeouts are now aligned with the request_timeout_in_ms setting, allowing for much faster failure detection and improving high availability by failing over from unresponsive nodes more quickly.
Fixes: SCYLLADB-76
This issue affects the 2025.4 branch, where similar HA recovery delays have been observed.
- (cherry picked from commit b6afacfc1e)
- (cherry picked from commit 086c6992f5)
Parent PR: #27377Closesscylladb/scylladb#27391
* github.com:scylladb/scylladb:
vector_search: Fix ANN query abort on CQL timeout
vector_search: Reduce connection and keep-alive timeouts
When a CQL vector search request timed out, the underlying ANN query was
not aborted and continued to run. This happened because the abort source
was not being signaled upon request expiration.
This commit ensures the ANN query is aborted when the CQL request times out
preventing unnecessary resource consumption.
The connection timeout was 2 minutes and the keep-alive
timeout was 11 minutes. If a vector store node became unreachable, these
long timeouts caused significant delays before the system could recover,
negatively impacting high availability.
This change aligns both timeouts with the `request_timeout`
configuration, which defaults to 10 seconds. This allows for much
faster failure detection and recovery, ensuring that unresponsive nodes
are failed over from more quickly.
Consider this:
1) n1 is the topology coordinator
2) n1 schedules and executes a tablet repair with session id s1 for a
tablet on n3 an n4.
3) n3 and n4 take and store the in _rs._repair_compaction_locks[s1]
4) n1 steps down before it executes
locator::tablet_transition_stage::end_repair
5) n2 becomes the new topology coordinator
6) n2 runs locator::tablet_transition_stage::repair again
7) n3 and n4 try to take the lock again and hangs since the lock is
already taken.
To avoid the deadlock, we can throw in step 7 so that n2 will
proceed to end_repair stage and release the lock. After that, the
scheduler could schedule the tablet repair request again.
Fixes#26346Closesscylladb/scylladb#27163
(cherry picked from commit da5cc13e97)
Closesscylladb/scylladb#27337
This commit fixes the information about object storage:
- Object storage configuration is no longer marked as experimental.
- Redundant information has been removed from the description.
- Information related to object storage for SStabels has been removed
as the feature is not working.
Fixes https://github.com/scylladb/scylladb/issues/26985Closesscylladb/scylladb#26987
(cherry picked from commit 724dc1e582)
Closesscylladb/scylladb#27211
Commit 6e4803a750 broke notification about expired erms held for too long since it resets the tracker without calling its destructor (where notification is triggered). Fix the assign operator to call the destructor like it should.
Fixes https://github.com/scylladb/scylladb/issues/27141
- (cherry picked from commit 9f97c376f1)
- (cherry picked from commit 5dcdaa6f66)
Parent PR: #27140Closesscylladb/scylladb#27276
* https://github.com/scylladb/scylladb:
test: test that expired erm that held for too long triggers notification
token_metadata: fix notification about expiring erm held for to long
We currently ignore the `_excluded` field in `node::clone()` and the verbose
formatter of `locator::node`. The first one is a bug that can have
unpredictable consequences on the system. The second one can be a minor
inconvenience during debugging.
We fix both places in this PR.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-72
This PR is a bugfix that should be backported to all supported branches.
- (cherry picked from commit 4160ae94c1)
- (cherry picked from commit 287c9eea65)
Parent PR: #27265Closesscylladb/scylladb#27291
* https://github.com/scylladb/scylladb:
locator/node: include _excluded in verbose formatter
locator/node: preserve _excluded in clone()
Before this series, Alternator's CreateTable operation defaults to creating a table replicated with vnodes, not tablets. The reasons for this default included missing support for LWT, Materialized Views, Alternator TTL and Alternator Streams if tablets are used. But today, all of these (except the still-experimental Alternator Streams) are now fully available with tablets, so we are finally ready to switch Alternator to use tablets by default in new tables.
We will use the same configuration parameter that CQL uses, tablets_mode_for_new_keyspaces, to determine whether new keyspaces use tablets by default. If set to `enabled`, tablets are used by default on new tables. If set to `disabled`, tablets will not be used by default (i.e., vnodes will be used, as before). A third value, `enforced` is similar to `enabled` but forbids overriding the default to vnodes when creating a table.
As before, the user can set a tag during the CreateTable operation to override the default choice of tablets or vnodes (unless in `enforced` mode). This tag is now named `system:initial_tablets` - whereas before this patch it was called `experimental:initial_tablets`. The rules stay the same as with the earlier, experimental:initial_tablets tag: when supplied with a numeric value, the table will use tablets. When supplied with something else (like a string "none"), the table will use vnodes.
Fixes https://github.com/scylladb/scylladb/issues/22463
Backport to 2025.4, it's important not to delay phasing out vnodes.
- (cherry picked from commit 403068cb3d)
- (cherry picked from commit af00b59930)
- (cherry picked from commit 376a2f2109)
- (cherry picked from commit 35216d2f01)
- (cherry picked from commit 7466325028)
- (cherry picked from commit c7de7e76f4)
- (cherry picked from commit 63897370cb)
- (cherry picked from commit 274d0b6d62)
- (cherry picked from commit 345747775b)
- (cherry picked from commit a659698c6d)
- (cherry picked from commit eeb3a40afb)
- (cherry picked from commit b34f28dae2)
- (cherry picked from commit 25439127c8)
- (cherry picked from commit c03081eb12)
- (cherry picked from commit 65ed678109)
Parent PR: #26836Closesscylladb/scylladb#26949
* github.com:scylladb/scylladb:
test/cluster: modify test to not fail on 2025.4 branch
Fix backport conflicts
test,alternator: use 3-rack clusters in tests
alternator: improve error in tablets_mode_for_new_keyspaces=enforced
config: make tablets_mode_for_new_keyspaces live-updatable
alternator: improve comment about non-hidden system tags
alternator: Fix test_ttl_expiration_streams()
alternator: Fix test_scan_paging_missing_limit()
alternator: Don't require vnodes for TTL tests
alternator: Remove obsolete test from test_table.py
alternator: Fix tag name to request vnodes
alternator: Fix test name clash in test_tablets.py
alternator: test_tablets.py handles new policy reg. tablets
alternator: Update doc regarding tablets support
alternator: Support `tablets_mode_for_new_keyspaces` config flag
Fix incorrect hint for tablets_mode_for_new_keyspaces
Fix comment for tablets_mode_for_new_keyspaces
Previously, the view building coordinator relied on setting each task's state to STARTED and then explicitly removing these state entries once tasks finished, before scheduling new ones. This approach induced a significant number of group0 commits, particularly in large clusters with many nodes and tablets, negatively impacting performance and scalability.
With the update, the coordinator and worker logic has been restructured to operate without maintaining per-task states. Instead, tasks are simply tracked with an aborted boolean flag, which is still essential for certain tablet operations. This change removes much of the coordination complexity, simplifies the view building code, and reduces operational overhead.
In addition, the coordinator now batches reports of finished tasks before making commits. Rather than committing task completions individually, it aggregates them and reports in groups, significantly minimizing the frequency of group0 commits. This new approach is expected to improve efficiency and scalability during materialized view construction, especially in large deployments.
Fixes https://github.com/scylladb/scylladb/issues/26311
This patch needs to be backported to 2025.4.
- (cherry picked from commit 6d853c8f11)
- (cherry picked from commit 08974e1d50)
- (cherry picked from commit eb04af5020)
- (cherry picked from commit 24d69b4005)
- (cherry picked from commit fb8cbf1615)
- (cherry picked from commit fe9581f54c)
Parent PR: #26897Closesscylladb/scylladb#27266
* github.com:scylladb/scylladb:
docs/dev/view-building-coordinator: update the docs after recent changes
db/view/view_building: send coordinator's term in the RPC
db/view/view_building_state: replace task's state with `aborted` flag
db/view/view_building_coordinator: batch finished tasks reporting
db/view/view_building_worker: change internal implementation
db/view/view_building_coordinator: change `work_on_tasks` RPC return type
We currently ignore the `_excluded` field in `clone()`. Losing
information about exclusion can have unpredictable consequences. One
observed effect (that led to finding this issue) is that the
`/storage_service/nodes/excluded` API endpoint sometimes misses excluded
nodes.
(cherry picked from commit 4160ae94c1)
The purpose of the test
cluster/test_alternator::test_alternator_ttl_scheduling_group
Is to verify that during TTL expiration scans and deletions, all of the
CPU is used in the "streaming" scheduling group, not in the statement
scheduling group ("sl:default") as we had in the past due to bugs.
It appears that in branch 2025.4 we have a new bug - which doesn't exist
in master - that causes some tablets-related work which I couldn't
identify to be done in sl:default, and cause this test to fail.
The simple fix is to sleep for 5 seconds after writing the data, and
it seems that by that time, the sl:default work is done.
This change doesn't make the Alternator TTL test any weaker, so we
need to make this change to allow Alternator to go forward.
Sadly, it does mean that the only test we have for this apparent
bug (which has nothing to do with Alternator) will be gone.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
With tablets enabled, we can't create an Alternator table on a three-
node cluster with a single rack, since Scylla refuses RF=3 with just
one rack and we get the error:
An error occurred (InternalServerError) when calling the CreateTable
operation: ... Replication factor 3 exceeds the number of racks (1) in
dc datacenter1
So in test/cluster/test_alternator.py we need to use the incantation
"auto_rack_dc='dc1'" every time that we create a three-node cluster.
Before this patch, several tests in test/cluster/test_alternator.py
failed on this error, with this patch all of them pass.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 65ed678109)
When in tablets_mode_for_new_keyspaces=enforced mode, Alternator is
supposed to fail when CreateTable asks explicitly for vnodes. Before
this patch, this error was an ugly "Internal Server Error" (an
exception thrown from deep inside the implementation), this patch
checks for this case in the right place, to generate a proper
ValidationException with a proper error message.
We also enable the test test_tablets_tag_vs_config which should have
caught this error, but didn't because it was marked xfail because
tablets_mode_for_new_keyspaces had not been live-updatable. Now that
it is, we can enable the test. I also improved the test to be slightly
faster (no need to change the configuration so many times) and also
check the ordinary case - where the schema doesn't choose neither
vnodes nor tablets explicitly and we should just use the default.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit c03081eb12)
We have a configuration option "tablets_mode_for_new_keyspaces" which
determines whether new keyspaces should use tablets or vnodes.
For some reason, this configuration parameter was never marked live-
updatable, so in this patch we add flag. No other changes are needed -
the existing code that uses this flag always uses it through the
up-to-date configuration.
In the previous patches we start to honor tablets_mode_for_new_keyspaces
also in Alternator CreateTable, and we wanted to test this but couldn't
do this in test/alternator because the option was not live-updatable.
Now that it will be, we'll be able to test this feature in
test/alternator.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 25439127c8)
The previous patches added a somewhat misleading comment in front of
system:initial_tablets, which this patch improves.
That tag is NOT where Alternator "stores" table properties like the
existing comment claimed. In fact, the whole point is that it's the
opposite - Alternator never writes to this tag - it's a user-writable
tag which Alternator *reads*, to configure the new table. And this is
why it obviously can't be hidden from the user.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit b34f28dae2)
With tablets, the test begun failing. The failure was correlated with
the number of initial tablets, which when kept at default, equals
4 tablets per shard in release build and 2 tablets per shard in dev
build.
In this patch we split the test into two - one with a more data in
the table to check the original purpose of this test - that Scan
doesn't return the entire table in one page if "Limit" is missing.
The other test reproduces issue #10327 - that when the table is
small, Scan's page size isn't strictly limited to 1MB as it is in
DynamoDB.
Experimentally, 8000 KB of data (compared to 6000 KB before this patch)
is enough when we have up to 4 initial tablets per shard (so 8 initial
tablets on a two-shard node as we typically run in tests).
Original patch by Piotr Szymaniak <piotr.szymaniak@scylladb.com>
modified by Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit a659698c6d)
Since #23662 Alternator supports TTL with tablets too. Let's clear some
leftovers causing Alternator to test TTL with vnodes instead of with
what is default for Alternator (tablets or vnodes).
(cherry picked from commit 345747775b)
Since Alternator is capable of runnng with tablets according to the
flag in config, remove the obsolete test that is making sure
that Alternator runs with vnodes.
(cherry picked from commit 274d0b6d62)
The tag was lately renamed from `experimental:initial_tablets` to
`system::initial_tablets`. This commit fixes both the tests as well as
the exceptions sent to the user instructing how to create table with
vnodes.
(cherry picked from commit 63897370cb)
Adjust the tests so they are in-line with the config flag
'tablets_mode_for_new_keyspaces` that the Alternator learned to honour.
(cherry picked from commit 7466325028)
Reflect honouring by Alternator the value of the config flag
`tablets_mode_for_new_keyspaces`, as well as renaming of the tag
`experimental:initial_tablets` into `system:initial_tablets`.
(cherry picked from commit 35216d2f01)
Until now, tablets in Alternator were experimental feature enabled only
when a TAG "experimental:initial_tablets" was present when creating a
table and associated with a numeric value.
After this patch, Alternator honours the value of
`tablets_mode_for_new_keyspaces` config flag.
Each table can be overriden to use tablets or not by supplying a new TAG
"system:initial_tablets". The rules stay the same as with the earlier,
experimental tag: when supplied with a numeric value, the table will use
tablets (as long as they are supported). When supplied with something
else (like a string "none"), the table will use vnodes, provided that
tablets are not `enforced` by the config flag.
Fixes#22463
(cherry picked from commit 376a2f2109)
The comment was not listing all the 3 possible values correctly,
despite an explanation just below covers all 3 values.
(cherry picked from commit 403068cb3d)
To avoid case when an old coordinator (which hasn't been stopped yet)
dictates what should be done, add raft term to the `work_on_view_building_tasks`
RPC.
The worker needs to check if the term matches the current term from raft
server, and deny the request when the term is bad.
(cherry picked from commit fb8cbf1615)
After previous commits, we can drop entire task's state and replace it
with single boolean flag, which determines if a task was aborted.
Once a task was aborted, it cannot get resurrected to a normal state.
(cherry picked from commit 24d69b4005)
In previous implementation to execute view building tasks, the
coordinator needed to firstly set their states to `STARTED`
and then it needed to remove them before it could start the next ones.
This logic required a lot of group0 commits, especially in large
clusters with higher number of nodes and big tablet count.
After previous commit to the view building worker, the coordinator
can start view building tasks without setting the `STARTED` state
and deleting finished tasks.
This patch adjusts the coordinator to save finished tasks locally,
so it can continue to execute next ones and the finished tasks are
periodically removed from the group0 by `finished_task_gc_fiber()`.
(cherry picked from commit eb04af5020)
Commit 6e4803a750 broke notification about expired erms held for too
long since it resets the tracker without calling its destructor (where
notification is triggered). Fix assign operator to call destructor.
(cherry picked from commit 9f97c376f1)
When we delete a table in alternator, the schema change is performed on shard 0.
However, we actually use the storage_proxy from the shard that is handling the
delete_table command. This can lead to problems because some information is
stored only on shard 0 and using storage_proxy from another shard may make
us miss it.
In this patch we fix this by using the storage_proxy from shard 0 instead.
Fixes https://github.com/scylladb/scylladb/issues/27223Closesscylladb/scylladb#27224
(cherry picked from commit 3c376d1b64)
Closesscylladb/scylladb#27260
This commit doesn't change the logic behind the view building worker but
it changes how the worker is executing view building tasks.
Previously, the worker had a state only on shard0 and it was reacting to
changes in group0 state. When it noticed some tasks were moved to
`STARTED` state, the worker was creating a batch for it on the shard0
state.
The RPC call was used only to start the batch and to get its result.
Now, the main logic of batch management was moved to the RPC call
handler.
The worker has a local state on each shard and the state
contains:
- unique ptr to the batch
- set of completed tasks
- information for which views the base table was flushed
So currently, each batch lives on a shard where it has its work to do
exclusively. This eliminates a need to do a synchronization between
shard0 and work shard, which was a painful point in previous
implementation.
The worker still reacts to changes in group0 view building state, but
currently it's only used to observe whether any view building tasks was
aborted by setting `ABORTED` state.
To prepare for further changes to drop the view building task state,
the worker ignores `IDLE` and `STARTED` states completely.
(cherry picked from commit 08974e1d50)