Add a reproducer to check that the repair_time isn't updated
if the batchlog replay fails.
If repair_time was updated, tombstones could be GC'd before the
batchlog is replayed. The replay could later cause the data
resurrection.
If any batch replay failed, we cannot update repair_time as we risk the
data resurrection.
If replay of any batch needs to be retried, run the whole repair but
fail at the very end, so that the repair_time for it won't be updated.
Currently, we skip batch replay if less than batch_log_timeout passed
from the moment the batch was written. batch_log_timeout value can
be configured. If it is large, it won't be replayed for a long time.
If the tombstone will be GC'd before the batch is replayed, then we
risk the data resurrection.
To ensure safety we can skip only the batches that won't be GC'd.
In this patch we skip replay of the batches for which:
now() < written_at + min(timeout + propagation_delay)
repair_time is set as a start of batchlog replay, so at the moment
of the check we will have:
repair_time <= now()
So we know that:
repair_time < written_at + propagation_delay
With this condition we are sure that GC won't happen.
Return a flag determining whether all the batches were sent successfully in
batchlog_manager::replay_all_failed_batches (batches skipped due to being
too fresh are not counted). Throw in repair_flush_hints_batchlog_handler
if not all batches were replayed, to ensure that repair_time isn't updated.
batchlog_manager::replay_all_failed_batches skips batches that have
unknown or incorrect version. Next round will process these batches
again.
Such batches will probably be skipped everytime, so there is no point
in keeping them. Even if at some point the version becomes correct,
we should not replay the batch - it might be old and this may lead
to data resurrection.
The guard should stop refreshing the ERM when the number of tablets changes. Tablet splits or merges invalidate the `tablet_id` field (`_tablet`), which means the guard can no longer correctly protect ongoing operations from tablet migrations.
The problem is specific to LWT, since `tablet_metadata_guard` is used mostly for heavy topology operations, which exclude with split and merge. The guard was used for LWT as an optimization -- we don't need to block topology operations or migrations of unrelated tablets. In the future, we could use the guard for regular reads/writes as well (via the `token_metadata_guard` wrapper).
Fixes [scylladb/scylladb#26437](https://github.com/scylladb/scylladb/issues/26437)
backports: need to backport to 2025.4 since the bug is relevant to LWT over tablets.
Closesscylladb/scylladb#26619
* github.com:scylladb/scylladb:
test_tablets_lwt: add test_tablets_merge_waits_for_lwt
test.py: add universalasync_typed_wrap
tablet_metadata_guard: fix split/merge handling
tablet_metadata_guard: add debug logs
paxos_state: shards_for_writes: improve the error message
storage_service: barrier_and_drain – change log level to info
topology_coordinator: fix log message
The patch e34deb72f9 (repair: Rename incremental mode name)
missed one place that references the removed regular mode name.
Fixes#26503Closesscylladb/scylladb#26660
Group0 tombstone GC considers only the current group 0 members
while computing the group 0 tombstone GC time. It's not enough
because in the Raft-based recovery procedure, there can be nodes
that haven't joined the current group 0 yet, but they have belonged
to a different group 0 and thus have a non-empty group 0 state ID.
The current code can cause a data resurrection in group 0 tables.
We fix this issue in this PR and add a regression test.
This issue was uncovered by `test_raft_recovery_entry_loss`, which
became flaky recently. We skipped this test for now. We will unskip
it in a following PR because it's skipped only on master, while we
want to backport this PR.
Fixes#26534
This PR contains an important bugfix, so we should backport it
to all branches with the Raft-based recovery procedure (2025.2
and newer).
Closesscylladb/scylladb#26612
* github.com:scylladb/scylladb:
test: test group0 tombstone GC in the Raft-based recovery procedure
group0_state_id_handler: remove unused group0_server_accessor
group0_state_id_handler: consider state IDs of all non-ignored topology members
The universalasync.wrap function doesn't preserve the
type information, which confuses the VS Code Pylance
plugin and makes code navigation hard.
In this commit we fix the problem by adding a typed
wrapped around universalasync.wrap.
Fixes: scylladb/scylladb#26639
The guard should stop refreshing the ERM when the number of tablets
changes. Tablet splits or merges invalidate the tablet_id field
(_tablet), which means the guard can no longer correctly protect
ongoing operations from tablet migrations.
Fixesscylladb/scylladb#26437
Debugging global barrier issues is difficult without these logs.
Since barriers do not occur frequently, increasing the log level should not produce excessive output.
Among other things, the merge includes the patch "http: add "Connection:
close" header to final server response.". This Fixes#26298: A missing
response header meant that a test's client code sometimes didn't notice
that the server closed the connection (since the client didn't need to
use the connection again), which made one test flaky.
* seastar bd74b3fa...63900e03 (6):
> Merge 'Rework output_stream::slow_write()' from Pavel Emelyanov
output_stream: Fix indentation of the slow_write() method
output_stream: Remove pointless else
output_stream: Replace std::swap with std::exchange
output_stream: Unify some code-paths of slow_write()
> Merge 'Deprecate in/out streams move-assignment operator' from Pavel Emelyanov
iostream: Deprecate input/output stream default constructor and move-assignment operator
test: Sub-split test-cases
test: Don't reuse output_stream in file demo
test: Keep input_/output_stream as optional
util: Construct file_data_source in with_file_input_stream()
websocket: Construct in/out in initializer list
rpc: Wrap socket and buffers
> scripts/perftune.py: detect corrupted NUMA topology information
> Merge 'memory, smp: support more than 256 shards' from Avi Kivity
reactor, smp: allocate smp queues across all shards
memory: increase maximum shard count
memory: make cpu_id_shift and related mask dynamic
resource, memory: move memory limit calculation to memory.cc
resource: don't error if --overprovisioned and asking for more vcpus than available
> Merge 'Update perf_test text output, make columns selectable' from Travis Downs
perf_tests: enhance text output
perf_test_tests: add some check_output tests
`CreateTable` request creates GSI/LSI together with the base table,
the base table is empty and we don't need to actually build the view.
In tablet-based keyspaces we can just don't create view building tasks
and mark the view build status as SUCCESS on all nodes. Then, the view
building worker on each node will mark the view as built in
`system.built_views` (`view_building_worker::update_built_views()`).
Vnode-based keyspaces will use the "old" logic of view builder, which
will process the view and mark it as built.
Fixesscylladb/scylladb#26615
This fix should be backported to 2025.4.
Closesscylladb/scylladb#26657
* github.com:scylladb/scylladb:
test/alternator/test_tablets: add test for GSI backfill with tablets
test/alternator/test_tablets: add reproducer for GSI with tablets
alternator/executor: instantly mark view as built when creating it with base table
Other than patching Scylla sinks to implement new data_sink_impl::put(std::span<temporary_buffer>) overload, the PR changes transport write_response() method to stop using output_stream::write(scattered_message) because it's also gone.
Using newer seastar API, no need to backport
Closesscylladb/scylladb#26592
* github.com:scylladb/scylladb:
code: Fix indentation after previous patch
code: Switch to seastar API level 9
transport: Open-code invoke_with_counting into counting_data_sink::put
transport: Don't use scattered_message
utils: Implement memory_data_sink::put(net::packet)
The test should pass without the fix for scylladb/scylladb#26615,
because the `executor::updata_table()` uses
`service::prepare_new_view_announcement()`, which creates view building
tasks for the view.
But it's better to add this test.
Rewrite wait_for first_completed to return only first completed task guarantee
of awaiting(disappearing) all cancelled and finished tasks
Use wait_for_first_completed to avoid false pass tests in the future and issues
like #26148
Use gather_safely to await tasks and removing warning that coroutine was
not awaited
Closesscylladb/scylladb#26435
There are some environment which has corrupted NUMA topology
information, such as some instance types on AWS EC2 with specific Linux
kernel images.
On such environment, we cannot get HW information correctly from hwloc,
so we cannot proceed optimization on perftune.
To avoid causing script error, check NUMA topology information and skip
running perftune if the information corrupted.
Related scylladb/seastar#2925Closesscylladb/scylladb#26344
`CreateTable` request creates GSI/LSI together with the base table,
the base table is empty and we don't need to actually build the view.
In tablet-based keyspaces we can just don't create view building tasks
and mark the view build status as SUCCESS on all nodes. Then, the view
building worker on each node will mark the view as built in
`system.built_views` (`view_building_worker::update_built_views()`).
Vnode-based keyspaces will use the "old" logic of view builder, which
will process the view and mark it as built.
Fixesscylladb/scylladb#26615
`shared_ptr<abstract_write_response_handler>` instances are captured in the `lmutate` and `rmutate` lambdas of `send_to_live_endpoints()`. As a result, an `abstract_write_response_handler` object may outlive its removal from the `storage_proxy::_response_handlers` map -> `cancel_all_write_response_handlers()` doesn't actually wait for requests completion -> `sp::drain_on_shutdown()` doesn't guarantee all requests are drained -> `sp::stop_remote()` completes too early and `paxos_store` is destroyed while LWT local writes might still be in progress. In this PR we introduce a `write_handler_destroy_promise` to wait for such pending instances in `cancel_write_handlers()` and `cancel_all_write_response_handlers()` to prevent the `use-after-free`.
A better long-term solution might be to replace `shared_ptr` with `unique_ptr` for `abstract_write_response_handler` and use a separate gate to track the `lmutate/rmutate` lambdas. We do not actually need to wait for these lambdas to finish before sending a timeout or error response to the client, as we currently do in `~abstract_write_response_handler`.
Fixes scylladb/scylladb#26355
backport: need to be backported to 2025.4 since #26355 is reproduced on LWT over tablets
Closesscylladb/scylladb#26408
* github.com:scylladb/scylladb:
test_tablets_lwt: add test_lwt_shutdown
storage_proxy: wait for write handler destruction
storage_proxy: coroutinize cancel_write_handlers
storage_proxy: cancel_write_handlers: don't hold a strong pointer to handler
This patch removes the dependence of vector search module
on the cql3 module by moving the contents of cql3/type_json.hh
to types/json_utils.hh and removing the usage of cql3 primary_key
object in vector_store_client. We also make the needed adjustments
to files that were previously using the afformentioned type_json.hh
file.
This fixes the circular dependency cql3 <-> vector_search.
Closesscylladb/scylladb#26482
The recursive call to alter_system_schema() was missing the await
keyword, which meant the coroutine was never actually executed and
the test wasn't doing what it was supposed to do.
Not backporting: Test fix only.
Closesscylladb/scylladb#26623
In clang 21, the -fextend-variable-liveness option was made
default [1] with -Og. It helps reduce "optimized out" problems while
debugging.
However, it conflicts [2] with coroutines.
To prevent problems during the upgrade to Clang 21, disable the option.
[1] 36af7345df
[2] https://github.com/llvm/llvm-project/issues/163007Closesscylladb/scylladb#26573
Apply two main changes to the s3_client error handling
1. Add a loop to s3_client's `make_request` for the case whe the retry strategy will not help since the request itself have to be updated. For example, authentication token expiration or timestamp on the request header
2. Refine the way we handle exceptions in the `chunked_download_source` background fiber, now we carry the original `exception_ptr` and also we wrap EVERY exception in `filler_exception` to prevent retry strategy trying to retry the request altogether
Fixes: https://github.com/scylladb/scylladb/issues/26483
Should be ported back to 2025.3 and 2025.4 to prevent deadlocks and failures in these versions
Closesscylladb/scylladb#26527
* github.com:scylladb/scylladb:
s3_client: tune logging level
s3_client: add logging
s3_client: improve exception handling for chunked downloads
s3_client: fix indentation
s3_client: add max for client level retries
s3_client: remove `s3_retry_strategy`
s3_client: support high-level request retries
s3_client: just reformat `make_request`
s3_client: unify `make_request` implementation
Load-and-stream is broken when running concurrently to the finalization step of tablet split.
Consider this:
1) split starts
2) split finalization executes barrier and succeed
3) load-and-stream runs now, starts writing sstable (pre-split)
4) split finalization publishes changes to tablet metadata
5) load-and-stream finishes writing sstable
6) sstable cannot be loaded since it spans two tablets
two possible fixes (maybe both):
1) load-and-stream awaits for topology to quiesce
2) perform split compaction on sstable that spans both sibling tablets
This patch implements # 1. By awaiting for topology to quiesce,
we guarantee that load-and-stream only starts when there's no
chance coordinator is handling some topology operation like
split finalization.
Fixes https://github.com/scylladb/scylladb/issues/26455.
Closesscylladb/scylladb#26456
* github.com:scylladb/scylladb:
test: Add reproducer for l-a-s and split synchronization issue
sstables_loader: Synchronize tablet split and load-and-stream
In 380f243986 we added support for rack
lists in replication options. Drivers which are not prepared to parse
that (as of now, all of them), will not create metadata object for
that keyspace. This breaks, for example, the "copy to/from" cqlsh
command. Potentially other things too.
To fix that, keep the "replication" column in the old format, and
store numeric RF there, which corresponds to the number of
replicas. Accurate options in the new format are put in
"replication_v2".
We set replication_v2 in the schema only when it differs from the old
"replication" so that the new column is not set during upgrade,
otherwise downgrade would fail. Partition tombstone is added to ensure
that pre-alter replication_v2 value is deleted on alters which change
replication to a value which is the same as the post-alter
"replication" value.
Fixes#26415Closesscylladb/scylladb#26429
Tablet merge of base tables is only safe if there is at most one replica in each rack. For more details on why it is the case please see scylladb/scylladb#17265. If the rf-rack-valid-keyspaces is turned on, this condition is satisfied, so allow it in that case.
Fixes: scylladb/scylladb#26273
Marked for backport to 2025.4 as MVs are getting un-experimentaled there.
Closesscylladb/scylladb#26278
* github.com:scylladb/scylladb:
test: mv: add a test for tablet merge
tablet_allocator, tests: remove allow_tablet_merge_with_views injection
tablet_allocator: allow merges in base tables if rf-rack-valid=true
Load-and-stream is broken when running concurrently to the
finalization step of tablet split.
Consider this:
1) split starts
2) split finalization executes barrier and succeed
3) load-and-stream runs now, starts writing sstable (pre-split)
4) split finalization publishes changes to tablet metadata
5) load-and-stream finishes writing sstable
6) sstable cannot be loaded since it spans two tablets
two possible fixes (maybe both):
1) load-and-stream awaits for topology to quiesce
2) perform split compaction on sstable that spans both sibling tablets
This patch implements #1. By awaiting for topology to quiesce,
we guarantee that load-and-stream only starts when there's no
chance coordinator is handling some topology operation like
split finalization.
Fixes#26455.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Currently, the data returned by `database::get_tables_metadata()` and
`database::get_token_metadata()` may not be consistent. Specifically,
the tables metadata may contain some tablet-based tables before their
tablet maps appear in the token metadata. This is going to be fixed
after issue scylladb/scylladb#24414 is closed, but for the time being
work around it by accessing the token metadata via
`table`->effective_replication_map() - that token metadata is guaranteed
to have the tablet map of the `table`.
Fixes: scylladb/scylladb#26403Closesscylladb/scylladb#26588
shared_ptr<abstract_write_response_handler> instances are captured in
the lmutate/rmutate lambdas of send_to_live_endpoints(). As a result,
an abstract_write_response_handler object may outlive its removal from
the _response_handlers map. We use write_handler_destroy_promise to
wait for such pending instances in cancel_write_handlers() and
cancel_all_write_response_handlers() to prevent use-after-free.
A better long-term solution might be to replace shared_ptr with
unique_ptr for abstract_write_response_handler and use a separate gate
to track the lmutate/rmutate lambdas. We do not actually need to wait
for these lambdas to finish before sending a timeout or error response
to the client, as we currently do in ~abstract_write_response_handler.
Fixesscylladb/scylladb#26355
The cancel_write_handlers() method was assumed to be called in a thread
context, likely because it was first used from gossiper events, where a
thread context already existed. Later, this method was reused in
abort_view_writes() and abort_batch_writes(), where threads are created
on the fly and appear redundant.
The drain_on_shutdown() method also used a thread, justified by some
"delicate lifetime issues", but it is unclear what that actually means.
It seems that a straightforward co_await should work just fine.
A strong pointer was held for the duration of thread::yield(),
preventing abstract_write_response_handler destruction and possibly
delaying the sending of timeout or error responses to the client.
This commit removes the strong pointer. Instead, we compute the
next iterator before calling timeout_cb(), so if the handler is
destroyed inside timeout_cb(), we already have a valid next iterator.
UserIdentity is a map of two fields in GetRecords responses, which
always has the same value. It may be missing, or contain a constant
object with value `{"type": "Service", "principalId":
"dynamodb.amazonaws.com"}`. Currently, the latter is set only for
`REMOVE`s triggered by TTL.
This commit introduces two new CDC operation types: `service_row_delete`
and `service_partition_delete`, emitted in place of `row_delete` and
`partition_delete`. Alternator Streams treats them as regular `REMOVE`s,
but in addition adds the `userIdentity` field to the record.
This change may break existing Scylla libraries for reading raw CDC
tables, but we doubt that anybody has this use case.
Refs https://github.com/scylladb/scylladb/pull/26149
Refs https://github.com/scylladb/scylladb/pull/26121
Fixes https://github.com/scylladb/scylladb/issues/11523Closesscylladb/scylladb#26460
This patch implements the changes required by the Vector Store authorization, as described in https://scylladb.atlassian.net/wiki/spaces/RND/pages/107085899/Vector+Store+Authentication+And+Authorization+To+ScyllaDB, that is:
- adding a new permission VECTOR_SEARCH_INDEXING, grantable only on ALL KEYSPACES
- allowing users with that permission to perform SELECT queries, but only on tables with a vector index
- increasing the number of scheduling groups by one to allow users to create a service level for a vector store user
- adjusting the tests and documentation
These changes are needed, as the vector indexes are managed by the external service, Vector Store, which needs to read the tables to create the indexes in its memory. We would like to limit the privileges of that service to a minimum to maintain the principle of least privilege, therefore a new permission, one that allows the SELECTs conditional on the existence of a vector_index on the table.
Fixes: VECTOR-201
Backport reasoning:
Backport to 2025.4 required as this can make upgrading clusters more difficult if we add it in 2026.1. As for now Scylla Cloud requires version 2025.4 to enable vector search and permission is set by orchestrator so there is no chance that someone will try to add this permission during upgrade. In 2026.1 it will be more difficult.
Closesscylladb/scylladb#25976
* github.com:scylladb/scylladb:
docs: adjust docs for VS auth changes
test: add tests for VECTOR_SEARCH_INDEXING permission
cql: allow VECTOR_SEARCH_INDEXING users to select
auth: add possibilty to check for any permission in set
auth: add a new permission VECTOR_SEARCH_INDEXING
Refactor the wrapping exception used in `chunked_download_source` to
prevent the retry strategy from reattempting failed requests. The new
implementation preserves the original `exception_ptr`, making the root
cause clearer and easier to diagnose.
It never worked as intended, so the credentials handling is moving to the same place where we handle time skew, since we have to reauthenticate the request