In Alternator's HTTP API, response headers can dominate bandwidth for
small payloads. The Server, Date, and Content-Type headers were sent on
every response but many clients never use them.
This patch introduces three Alternator config options:
- alternator_http_response_server_header,
- alternator_http_response_disable_date_header,
- alternator_http_response_disable_content_type_header,
which allow customizing or suppressing the respective HTTP response
headers. All three options support live update (no restart needed).
The Server header is no longer sent by default; the Date and
Content-Type defaults preserve the existing behavior.
The Server and Date header suppression uses Seastar's
set_server_header() and set_generate_date_header() APIs added in
https://github.com/scylladb/seastar/pull/3217. This patch also
fixes deprecation warnings from older Seastar HTTP APIs.
Tests are in test/alternator/test_http_headers.py.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-70Closesscylladb/scylladb#28288
DynamoDB Streams API can only convey a single parent per stream shard.
Tablet merges produce two parents, making them incompatible with
Alternator Streams. This series blocks tablet merges when streams are
active on a tablet table.
For CreateTable, a freshly created table has no pending merges, so
streams are enabled immediately with tablet merges blocked.
For UpdateTable on an existing table, stream enablement is deferred:
the user's intent is stored via `enable_requested`, tablet merges are
blocked (new merge decisions are suppressed and any active merge
decision is revoked), and the topology coordinator finalizes enablement
once no in-flight merges remain.
The topology coordinator is woken promptly on error injection release
and tablet split completion, reducing finalization latency from ~60s
to seconds.
`test_parent_children_merge` is marked xfail (merges are now blocked),
and downward (merge) steps are removed from `test_parent_filtering` and
`test_get_records_with_alternating_tablets_count`.
Not addressed here: using a topology request to preempt long-running
operations like repair (tracked in SCYLLADB-1304).
Refs SCYLLADB-461
Closesscylladb/scylladb#29224
* github.com:scylladb/scylladb:
topology: Wake coordinator promptly for stream enablement lifecycle
test/cluster: Test deferred stream enablement on tablet tables
alternator/streams: Block tablet merges when Alternator Streams are enabled
The topology coordinator sleeps on a condition variable between
iterations. Several events relevant to Alternator stream enablement
did not wake it, causing delays of up to 60s (the periodic load
stats refresh interval) at each step:
1. Error injection release: when a test disables the
delay_cdc_stream_finalization injection, the coordinator was
not notified. Add an on_disable callback mechanism to the error
injection framework (register_on_disable / unregister_on_disable)
so subsystems can react when an injection is released. The
topology coordinator uses this to broadcast its event.
2. Tablet split completion: after all local storage groups for a
table finish splitting, split_ready_seq_number is set but the
coordinator only discovered this via the periodic stats refresh.
Add an on_tablet_split_ready callback to topology_state_machine
that the coordinator sets to trigger_load_stats_refresh(). The
split monitor in storage_service calls it when all compaction
groups are split-ready, giving the coordinator fresh stats
immediately so it can finalize the resize.
These changes reduce test_deferred_stream_enablement_on_tablets
from ~120s to ~13s and fix a production issue where Alternator
stream enablement could be delayed by up to 60s at each step of
the lifecycle (error injection release, split completion).
Async cluster test exercising the deferred enablement lifecycle:
ENABLING -> ENABLED -> disabled, verifying tablet merge blocking
and unblocking at each stage. Uses delay_cdc_stream_finalization
error injection and CQL ALTER TABLE with tablet count constraints.
Also adds tablet scheduler config to test_config.yaml (fast refresh
interval, scale factor 1) for reliable tablet count changes.
DynamoDB Streams API can only convey a single parent per stream shard.
Tablet merges produce 2 parents, which is incompatible. When streams
are requested on a tablet table, block tablet merges via
tablet_merge_blocked (the allocator suppresses new merge decisions and
revokes any active merge decision).
add_stream_options() sets tablet_merge_blocked=true alongside
enabled=true, so CreateTable needs no special handling — the flag
is inert on vnode tables and immediately effective on tablet tables.
For UpdateTable, CDC enablement is deferred: store the user's intent
via enable_requested, and let the topology coordinator finalize
enablement once no in-progress merges remain. A new helper,
defer_enabling_streams_block_tablet_merges(), amends the CDC options
to this deferred state.
Disabling streams clears all flags, immediately re-allowing merges.
The tablet allocator accesses the merge-blocked flag through a
schema::tablet_merges_forbidden() accessor rather than reaching into
CDC options directly.
Mark test_parent_children_merge as xfail and remove downward
(merge) steps from tablet_multipliers in test_parent_filtering and
test_get_records_with_alternating_tablets_count.
This PR exposes vnodes-to-tablets migrations through the task manager API via a virtual task. This allows users to list, query status, and wait on ongoing migrations through a standard interface, consistent with other global operations such as tablet operations and topology requests are already exposed.
The virtual task exposes all migrations that are currently in progress. Each migrating keyspace appears as a separate task, identified by a deterministic name-based (v3) UUID derived from the keyspace name. Progress is reported as the number of nodes that have switched to tablets vs. the total. The number increases on the forward path and decreases on rollback.
The task is not abortable - rolling back a migration requires a manual procedure.
The `wait` API blocks until the migration either completes (returning `done`) or is rolled back (returning `suspended`).
Example output:
```
$ scylla nodetool tasks list vnodes_to_tablets_migration
task_id type kind scope state sequence_number keyspace table entity shard start_time end_time
1747b573-6cd6-312d-abb1-9b66c1c2d81f vnodes_to_tablets_migration cluster keyspace running 0 ks 0
$ scylla nodetool tasks status 1747b573-6cd6-312d-abb1-9b66c1c2d81f
id: 1747b573-6cd6-312d-abb1-9b66c1c2d81f
type: vnodes_to_tablets_migration
kind: cluster
scope: keyspace
state: running
is_abortable: false
start_time:
end_time:
error:
parent_id: none
sequence_number: 0
shard: 0
keyspace: ks
table:
entity:
progress_units: nodes
progress_total: 3
progress_completed: 0
```
Fixes SCYLLADB-1150.
New feature, no backport needed.
Closesscylladb/scylladb#29256
* github.com:scylladb/scylladb:
test: cluster: Verify vnodes-to-tablets migration virtual task
distributed_loader: Link resharding tasks to migration virtual task
distributed_loader: Make table_populator aware of migration rollbacks
service: Add virtual task for vnodes-to-tablets migrations
storage_service: Guard migration status against uninitialized group0
compaction: Add parent_id to table_resharding_compaction_task_impl
storage_service: Add keyspace-level migration status function
storage_service: Replace migration status string with enum
utils: Add UUID::is_name_based()
* seastar 4d268e0e...22a5aa13 (36):
> apps/httpd: replace deprecated reply::done() with write_body()
> missing header(s)
> net: Fix missing throw for runtime_error in create_native_net_device
> tests/io_queue: account for token bucket refill granularity in bandwidth checks
> Merge 'iovec: fix iovec_trim_front infinite loop on zero-length iovecs' from Travis Downs
tests: add regression tests for zero-length iovec handling
iovec: fix iovec_trim_front infinite loop on zero-length iovecs
> util/process: graduate process management API from experimental
> cooking: don't register ready.txt as a build output
> sstring: make make_sstring not static
> Add SparkyLinux to debian list in install-dependencies.sh
> http: allow control over default response headers
> Merge 'chunked_fifo: make cached chunk retention configurable' from Brandon Allard
tests/perf: add chunked_fifo microbenchmarks
chunked_fifo: set the default free chunk retention to 0
chunked_fifo: make free chunk retention configurable
> Merge 'reactor_backend: fix pollable_fd_state_completion reuse in io_uring' from Kefu Chai
tests: add regression test for pollable_fd_state_completion reuse
reactor_backend: use reset() in AIO and epoll poll paths
reactor_backend: fix pollable_fd_state_completion reuse after co_await in io_uring
> Merge 'coroutine: Generator cleanups' from Kefu Chai
coroutine/generator: extract schedule_or_resume helper
coroutine/generator: remove unused next_awaiter classes
coroutine/generator: remove write-only _started field
coroutine/generator: assert on unreachable path in buffered await_resume
coroutine/generator: add elements_of tag and #include <ranges>
coroutine/generator: add empty() to bounded_container concept
> cmake: bump minimum Boost version to 1.79.0
> seastar_test: remove unnecessary headers
> cmake: bump minimum GnuTLS version to 3.7.4
> Merge 'reactor: add get_all_io_queues() method' from Travis Downs
tests: add unit test for reactor::get_all_io_queues()
reactor: add get_all_io_queues() method
reactor: move get_io_queue and try_get_io_queue to .cc file
> http: deprecate reply::done(), remove _response_line dead field
> core: Deprecate scattered_message
> ci: add workflow dispatch to tests workflow
> perf_tests: exit non-zero when -t pattern matches no tests
> Replace duplicate SEGV_MAPERR check in sigsegv_action() with SEGV_ACCERR.
> perf_tests: add total runtime to json output
> Merge 'Relax large allocation error originating from json_list_template' from Robert Bindar
implement move assignment operator for json_list_template
json_list_template copy assignment operator reserves capacity upfront
> perf_tests: add --no-perf-counters option
> Merge 'Fix to_human_readable_value() ability to work with large values' from Pavel Emelyanov
memory: Add compile-time test for value-to-human-readable conversion
memory: Extend list of suffixes to have peta-s
memory: Fix off-by-one in suffix calculation
memory: Mark to_human_readable_value() and others constexpr
> http: Improve writing of response_line() into the output
> Merge 'websocket: add template parameter for text/binary frame mode and implement client-side WebSocket' from wangyuwei
websocket: add template parameter for text/binary frame mode
websocket: impl client side websocket function
> file: Fix checks for file being read-only
> reactor: Make do_dump_task_queue a task_queue method
> Merge 'Implement fully mixed mode for output_stream-s' from Pavel Emelyanov
tests/output_stream: sample type patterns in sanitizer builds
tests/output_stream: extend invariant test to cover mixed write modes
iostream: allow unrestricted mixing of buffered and zero-copy writes
tests/output_stream: remove obsolete ad-hoc splitting tests
tests/output_stream: add invariant-based splitting tests
iostream: rename output_stream::_size to ::_buffer_size
> reactor_backend: replace virtual bool methods with const bool_class members
> resource: Avoid copying CPU vector to break it into groups
> perf_tests: increase overhead column precision to 3 decimal places
> Merge 'Move reactor::fdatasync() into posix_file_impl' from Pavel Emelyanov
reactor: Deprecate fdatasync() method
file: Do fdatasync() right in the posix_file_impl::flush()
file: Propagate aio_fdatasync to posix_file_impl
reactor: Move reactor::fdatasync() code to file.cc
reactor,file: Make full use of file_open_options::durable bit
file: Add file_open_options::durable boolean
file: Account io_stats::fsyncs in posix_file_impl::flush()
reactor: Move _fsyncs counter onto io_stats
> http: Remove connection::write_body()
Fix cdc writing unnecesary entries to it's log, like for example when Alternator deletes an item which in reality doesn't exist.
Originally @wps0 tackled this issue. This patch is an extension of his work. His work involved adding `should_skip` function to cdc, which would process a `mutation` object and decide, wherever changes in the object should be added to cdc log or not.
The issue with his approach is that `mutation` object might contain changes for more than one row. If - for example - the `mutation` object contains two changes, delete of non-existing row and create of non-existing row, `should_skip` function will detect changes in second item and allow whole `mutation` (BOTH items) to be added. For example (using python's boto3) running this on empty table:
```
with table.batch_writer() as batch:
batch.put_item({'p': 'p', 'c': 'c0'})
batch.delete_item(Key={'p': 'p', 'c': 'c1'})
```
will emit two events ("put" event and "delete" event), even though the item with `c` set to `c1` does not exist (thus can't be deleted). Note, that both entries in batch write must use the same partition key, otherwise upper layer with split them into separate `mutation` objects and the issue will not happen.
The solution is to do similar processing, but consider each change separated from others. This is tricky to implement due to a way cdc works. When cdc processes `mutation` object (containing X changes), it emits cdc entries in phases. Phase 1 - emit `preimage` (old state) for each change (if requested). Phase 2 - for each change emit actual "diff" (update / delete and so on). Phase 3 - emit `postimage` (new state).
We will know if change needs to be skipped during phase 2. By that time phase 1 is completed and preimage for the change is emited. At that moment we set a flag that the change (identified by clustering key value) needs to be skipped - we add a clustering key to a `ignore-rows` set (`_alternator_clustering_keys_to_ignore` variable) and continue normally. Once all phases finish we add a `postprocess` phase (`clean_up_noop_rows` function). It will go through generated cdc mutations and skip all modifications, for which clustering key is in `ignore-rows` set. After skipping we need to do a "cleanup" operation - each generated cdc mutation contain index (incremented by one), if we skipped some parts, the index is not consecutive anymore, so we reindex final changes.
There's a special case worth mentioning - Alternator tables without clustering keys. At that point `mutation` object passed to cdc can contain exactly one change (since different partition keys are splitted by upper layers and Alternator will never emit `mutation` object containing two (or more) changes with the same primary key. Here, when we decide the change is to be skipped we add empty `bytes` object to `ignore-rows` set. When checking `ignore-rows` set, we check if it's empty or not (we don't check for presence of empty `bytes` object).
Note: there might be some confusion between this patch and #28452 patch. Both started from the same error observation and use similar tests for validation, as both are easily triggered by BatchWrite commands (both needs `mutation` object passed to cdc to contain more than one single change). This issue tho is about wrong data written in cdc log and is fixed at cdc, where #28452 is about wrong way of parsing correct cdc data and is fixed at Alternator side of things. Note, that we need #28452 to truly verify (otherwise we will emit correct cdc entries, but Alternator will incorrectly parse them).
Note: to benefit / notice this patch you need `alternator_streams_increased_compatibility` flag turned on.
Note: rework is quite "broad" and covers a lot of ground - every operation, that might result in a no-change to the database state should be tested. An additional test was added - trying to remove a column from non-existing item, as well as trying to remove non-existing column from existing item.
Fixes: #28368
Fixes: SCYLLADB-1528
Fixes: SCYLLADB-538
Closesscylladb/scylladb#28544
* github.com:scylladb/scylladb:
alternator: remove unnecesary code
alternator: fix Alternator writing unnecesary cdc entries
alternator: add failing tests for Streams
Implements neccesary changes for Streams to work with tablet based tables.
- add utility functions to `system_keyspace` that helps reading cdc content from cdc log tables for tablet based base tables (similar api to ones for vnodes)
- remove antitablet `if` checks, update tests that fail / skip if tablets are selected
- add two tests to extensively test tablet based version, especially while manipulating stream count
Fixes#23838
Fixes SCYLLADB-463
Closesscylladb/scylladb#28500
* github.com:scylladb/scylladb:
alternator: add streams with tablets tests
alternator: remove antitablet guards when using Streams
alternator: implement streams for tablets
treewide: add cdc helper functions to system_keyspace
alternator: add system_keyspace reference
`stream_arn` object holds a full ARN as `std::string` and two
`std::string_view` fields (`table_name_` and `keyspace_name_`) pointing
into ARN itself. This prevents object from being safely copied
(as in that case both `table_name_` and `keyspace_name_` will point into
original object's ARN). Similar issue might happen with move, when
ARN contains string short enough for small string optimization to
kick in (although in practice this is not possible, as ARN has
requirements which make it's minimal length above 15 characteres -
current limit for small string optimizations in most popular string
libraries).
The patch drops `std::string_view` objects in favor of integer offsets
and sizes. The offset equal to 0 means beginning of ARN string. The api
is preserved - both `table_name` and `keyspace_name` function will
return `std::string_view` reconstructed on the fly.
Closesscylladb/scylladb#29507
When a table is loaded on startup during a vnodes-to-tablets migration
(forward or rollback), the `table_populator` runs a resharding
compaction.
Set the migration virtual task as parent of the resharding task. This
enables users to easily find all node-local resharding tasks related to
a particular migration.
Make `migration_virtual_task::make_task_id()` public so that the
`distributed_loader` can compute the migration's task ID.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
The `table_populator` uses a `migrate_to_tablets` flag to distinguish
normal tables from tables under vnodes-to-tablets migration (forward
path), since the two require different resharding.
The next patch will set the parent info of migration-related resharding
compaction tasks so they appear as children of the migration virtual
task. For that, the table populator needs to recognize not only
migrations in the forward path, but rollbacks as well.
Replace the flag with a tri-state `migration_direction` enum (none,
forward, rollback).
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Add a virtual task that exposes in-progress vnodes-to-tablets migrations
through the task manager API.
The task is synthesized from the current migration state, so completed
migrations are not shown. Progress is reported as the number of nodes
that currently use tablets: it increases on the forward path and
decreases on rollback. For simplicity, per-node storage modes are not
exposed in the task status; callers that need them should use the
migration status REST endpoint.
Unlike regular tasks that use time-based UUIDs, this task uses
deterministic named UUIDs derived from the keyspace names. This keeps
the implementation simple (no need to persist them) and gives each
keyspace a stable task ID. The downside is that the start time of each
task is unknown and repeated migrations of the same keyspace
(migration -> rollback -> new migration) cannot be distinguished.
Introduce a new task manager module to keep them separate from other
tasks.
Add support for `wait()`. While its practical value is debatable
(migration is a manual procedure, rolling restart will interrupt it), it
keeps the task consistent with the task manager interface.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
`storage_service::get_tablets_migration_status()` reads a group0 virtual
table, so it requires group0 to be initialized.
When invoked via the migration REST API, this condition is satisfied
since the API is only available after joining group0. However, once this
function is integrated into the task API later in this series, the
assumption will no longer hold, as the task API is exposed earlier in
the startup process.
Add a guard to detect this condition and return a clear error message.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
`storage_service::get_tablets_migration_status()` returns the
keyspace-level migration status, indicating whether migration has not
started, is in progress, or has completed, and for migrating keyspaces
also returns per-node migration statuses. Rename it to
`get_tablets_migration_status_with_node_details()` and introduce a new
`get_tablets_migration_status()` that returns only the keyspace-level
status.
This prepares the function for reuse in the next patches, which will add
a virtual task for vnodes-to-tablets migrations. Several task-manager
paths will only need the keyspace-level migration state, not per-node
information.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Using a string was sufficient while this status was only exposed through
the REST API, but the next patches will also consume it internally.
Use an enum for the internal representation and convert it back to the
existing string values in the REST API.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
The UUID class already provides `is_timestamp()` for identifying
time-based (version 1) UUIDs. Add the analogous `is_name_based()`
predicate for version 3 (name-based) UUIDs, along with a test.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Add tests for Streams, when table uses tablets underneath.
One test verifies filtering using CHILD_SHARDS feature.
Other one makes sure we get read all data while the table
undergoes tablet count change.
Add `--tablet-load-stats-refresh-interval-in-seconds=1`
to `alternator/run` script, as otherwise newly added tests will fail.
The setting changes how often scylla refreshes tablet metadata.
This can't be done using `scylla_config_temporary`, as
1) default is 60 seconds
2) scylla will wait full timeout (60s) to read configuration variable again.
Remove `if` condition, that prevented tables with tablets
working with Streams.
Remove a test, that verifies, that Alternator will reject
tables with tablets underneath working with Streams feature enabled
on them.
Update few tests, that were expected to fail on tablets to enable their
normal execution.
Add helper functions to `system_keyspace` object, that deal
with reading cdc content for tablet based table's.
`read_cdc_for_tablets_current_generation_timestamp` will read current
generation's timestamp.
`read_cdc_for_tablets_versioned_streams` will build
timestamp -> `cdc::streams_version` map similar to how
`system_distributed_keyspace::cdc_get_versioned_streams` works.
We're adding those helper functions, because their siblings in
`system_distributed_keyspace` work only, when base table is backed up
by vnodes. New additions work only, when base table is backed up
by tablets.
Add a reference to `system_keyspace` object to `executor` object in
alternator. The reference is needed, because in future commit
we will add there (and use) helper functions that read `cdc_log` tables
for tablet based tables similarly to already existing siblings
for vnodes living in `system_distributed_keyspace`.
Work in this patch is a result of two bugs - spurious MODIFY event, when
remove column is used in `update_item` on non-existing item and
spurious events, when batch write item mixed noop operations with
operations involving actual changes (the former would still emit
cdc log entries).
The latter issue required rework of Piotr Wieczorek's algorithm,
which fixed former issue as well.
Piotr Wieczorek previously wrote checks, that should
prevent unnecesary cdc events from being written. His implementation
missed the fact, that a single `mutation` object passed to cdc code
to be analysed for cdc log entries can contain modifications for
multiple rows (with the same timestamp - for example as a result
to BatchWriteItem call). His code tries to skip whole `mutation`,
which in such case is not possible, because BatchWriteItem might have
one item that does nothing and second item that does modification
(this is the reason for the second bug).
His algorithm was extended and moved. Originally it was working
as follows - user would sent a `mutation` object with some changes to
be "augmented". The cdc would process those changes and built a set of
cdc log changes based on them, that would be added to cdc log table.
Piotr added a `should_skip` function, which processes user changes and
tried to determine if they all should be dropped or not.
New version, instead of trying to skip adding rows to
cdc log `mutation` object, builds a rows-to-ignore set.
After whole cdc log `mutation` object is completed, it processes it
and go through it row by row. Any row that was previously added to
a `rows_to_ignore` set will now be removed. Remaining rows are written to
new cdc log `mutation` with new clustering key
(`cdc$batch_seq_no` index value should probably be consecutive -
we just want to be safe here) and returns new `mutation` object to
be sent to cdc log table.
The first bug is fixed as a side effect of new algorithm,
which contains more precise checks detecting, if given
mutation actually made a difference.
Fixes: #28368
Fixes: SCYLLADB-538
Fixes: SCYLLADB-1528
Refs: #28452
Garbage collected sstables created during incremental compaction are
deleted only at the end of the compaction, which increases the memory
footprint. This is inefficient, especially considering that the related
input sstables are released regularly during compaction.
This commit implements incremental release of GC sstables after each
output sstable is sealed. Unlike regular input sstables, GC sstables
use a different exhaustion predicate: a GC sstable is only released
when its token range no longer overlaps with any remaining input
sstable. This is because GC sstables hold tombstones that may shadow
data in still-alive overlapping input sstables; releasing them
prematurely would cause data resurrection.
Fixes#5563Closesscylladb/scylladb#28984
Add failing tests for Streams functionality.
Trying to remove column from non-existing item is producing
a MODIFY event (while it should none).
Doing batch write with operations working on the same partition,
where one operation is without side effects and second with
will produce events for both operations, even though first changes nothing.
First test has two versions - with and without clustering key.
Second has only with clustering key, as we can't produce
batch write with two items for the same partition -
batch write can't use primary key more than once in single call.
We also add a test for batch write, where one of three operations
has no observable side effects and should not show up in Streams
output, but in current scylla's version it does show.
ScyllaDB uses estimated_histogram in many places.
We already have a more efficient alternative: estimated_histogram_with_max. It is both CPU- and
memory-efficient, and it can be exported as Prometheus native histograms.
Its main limitation (which also has benefits) is that the bucket layout is fixed at compile time, so
histograms with different configurations cannot be mixed.
The end goal is to replace all uses of estimated_histogram in the codebase.
That migration requires a few small API adjustments, so it is done in steps.
This PR replaces estimated_histogram for CAS contention.
The PR includes a patch that adds functionality to the base approx_exponential_histogram, which will be used by the API.
The specific histograms are defined in a single place and cover the range 1-100; this makes future changes easy.
**New feature, no need to backport**
Closesscylladb/scylladb#29017
* github.com:scylladb/scylladb:
storage_proxy: migrate CAS contention histograms to estimated_histogram_with_max
estimated_histogram.hh: Add bucket offset and count to approx_exponential_histogram
After stopping the topology coordinator, a new topology coordinator
may not yet be started when get_coordinator_host() is called. Make
the function always retry via wait_for so that every caller is
protected against this race.
Fixes SCYLLADB-1553
Closesscylladb/scylladb#29489
Container names were generated as {name}-{pid}-{counter}, where the
counter is a per-process itertools.count. This scheme breaks across CI
runs on the same host: if a prior job was killed abruptly (SIGKILL,
cancellation) its containers are left running since --rm only removes
containers on exit. A subsequent run whose worker inherits the same PID
(common in containerized CI with small PID namespaces) and reaches the
same counter value will collide with the orphaned container.
Replace pid+counter with uuid.uuid4(), which generates a random UUID,
making names unique across processes, hosts, and time without any shared
state or leaking host identifiers.
Fixes: SCYLLADB-1540
Closesscylladb/scylladb#29509
The PR serves two purposes.
First, it makes the flag usage be consistent across multiple ways to load sstables components. For example, the sstable::load_metadata() doesn't set it (like .load() does) thus potentially refusing to load "corrupted" components, as the flag assumes.
Second, it removes the fanout of db.get_config().ignore_component_digest_mismatch() over the code. This thing is called pretty much everywhere to initialize the sstable_open_config, while the option in question is "scylla state" parameter, not "sstable opening" one.
Code cleanup, not backporting
Closesscylladb/scylladb#29513
* github.com:scylladb/scylladb:
sstables: Remove ignore_component_digest_mismatch from sstable_open_config
sstables: Move ignore_component_digest_mismatch initialization to constructor
sstables: Add ignore_component_digest_mismatch to sstables_manager config
Write the MinIO server log directly to tempdir_base (testlog/<arch>/)
instead of the per-server temp directory that gets destroyed on
shutdown. This preserves the log for Jenkins artifact collection,
helping debug S3-related flaky test failures like the
stcs_reshape_overlapping_s3_test hang (SCYLLADB-1481).
Closesscylladb/scylladb#29458
With the latest changes, there are a lot of code that is redundant in the test.py. This PR just cleans this code.
Also, it narrows using dynamic scope for fixtures to test/alternator and test/cqlpy. All the rest by default will have module scope.
test.py will be a wrapper for pytest mostly for CI use. As for now test.py have important part of calculating the number of threads to start pytest with. This is not possible to do in pytest itself.
No backport needed, framework enhancement only.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-666Closesscylladb/scylladb#28852
* github.com:scylladb/scylladb:
test.py: remove testpy_test_fixture_scope
test.py: add logger for 3rd party service
test.py: delete dead code in test.py
Users ought to have possibility to create the local index for Vector Search
based only on a part of the partition key. This commits provides this by
removing requirements of 'full partition key only' for custom local index.
The commit updates docs to explain that local vector index can use only a part
of the partition key.
The commit implements cqlpy test to check fixed functionality.
Fixes: SCYLLADB-953
Needs to be backported to 2026.1 as it is a fix for local vector indexes.
Closesscylladb/scylladb#28931
Decrease the default connection timeout to 3s to better align with the
default CQL query timeout of 10s.
The previous timeout allowed only one failover request in high availability
scenario before hitting the CQL query timeout.
By decreasing the timeout to 3s, we can perform up to three failover requests
within the CQL query timeout, which significantly improves the chances of
successfully completing the query in high availability scenarios.
Fixes: SCYLLADB-95
Add option `vector_store_unreachable_node_detection_time_in_ms` to
control parameters related to detecting unreachable vector store nodes.
This parameter is used to set the TCP connect timeout, keepalive
parameters, and TCP_USER_TIMEOUT. By configuring these parameters,
we can detect unreachable vector store nodes faster and trigger
failover mechanisms in a timely manner.
The audit table had caching enabled by default, which provides no
value since audit data is write-heavy and rarely read back through
the cache. This wastes cache space that could be used for more
important user data.
Disable caching by setting keys and rows_per_partition to NONE and
enabled to false, consistent with get_disabled_caching_options()
and other system tables such as system.batchlog,
system.large_partitions, and CDC log tables.
Closesscylladb/scylladb#29506
This series adds support for vector search in Alternator based on the existing implementation in CQL.
The series adds APIs for `CreateTable` and `UpdateTable` to add or remove vector indexes to Alternator tables, `DescribeTable` to list them and check the indexing status, and `Query` to perform a vector search - which contacts the vector store for the actual ANN (approximate nearest neighbor) search.
Correct functionality of these features depend on some features of the the vector store, that were already done (see https://github.com/scylladb/vector-store/pull/394).
This initial implementation is fully functional, and can already be useful, but we do not yet support all the features we hope to eventually support. Here are things that we have **not** done yet, and plan to do later in follow-up pull requests:
1. Support a new optimized vector type ("V") - in addition to the "list of numbers" type supported in this version.
2. Allow choosing a different similarity function when creating an index, by SimilarityFunction in VectorIndex definition.
3. Allow choosing quantization (f32/f16/bf16/i8/b1) to ask the vector index to compress stored vectors.
4. Support oversampling and rescoring, defined per-index and per-query.
5. Support HNSW tuning parameters — maximum_node_connections, construction_beam_width, search_beam_width.
6. Support pre-filtering over key columns, which are available at the vector store, by sending the filter to the vector store (translated from DynamoDB filter syntax to the vector's store's filter syntax). A decision still need to be made if this will use KeyConditionExpression or FilterExpression. This version supports only post-filtering (with `FilterExpression`).
7. Support projecting non-key attributes into the index (Projection=INCLUDE and Projection=ALL), and then 1. pre-filtering using these attributes, and 2. efficiently return these attributes (using Select=ALL_PROJECTED_ATTRIBUTES, which today returns just the key columns).
8. Optimize the performance of `Query`, which today is inefficient for Select=ALL_ATTRIBUTES because it serially retrieves the matching items one at a time.
9. Returning the similarity scores with the items (the design proposes ReturnVectorSearchSimilarity).
10. Add more vector-search-specific metrics, beyond the metric we already have counting Query requests. For example separate latency and request-count metrics for vector-search Queries (distinct from GSI/LSI queries), and a metric accumulating the total Limit (K) across all vector search queries.
11. Consider how (and if at all) we want to run the tests in test/alternator/test_vector.py that need the vector store in the CI. Currently they are skipped in CI and only run manually (with `test/alternator/run --vs test_vector`).
12. UpdateTable 'Update' operation to modify index parameters. Only some can be modified, e.g., Oversampling.
13. Support for "local index" (separate index for each partition).
14. Make sure that vector search and Streams can be enabled concurrently on the same table - both need CDC but we need to verify that one doesn't confuse the other or disables options that the other needs. We can only do this after we have Alternator Streams running on tablets (since vector store requires tablets).
Testing the new Alternator vector search end-to-end requires running both Scylla and the vector store together. We will have such end-to-end tests in the vector store repository (see https://github.com/scylladb/vector-store/pull/392), but we also add in this pull request many end-to-end tests written in Python, that can be run with the command "test/alternator/run --vs test_vector.py". The "--vs" option tells the run script to run both Scylla and the vector store (currently assumed to be in `.../vector-store/target/release/vector-store`). About 65% of the tests in this pull request check supported syntax and error paths so can run without the vector store, while about 35% of the tests do perform actual Query operations and require the vector store to be running. Currently, the tests that do require the vector store will not get run by CI, but can be easily re-run manually with `test/alternator/run --vs test_vector.py`.
In total, this series includes 78 functional tests in 2200 lines of Python code.
This series also includes documentation for the new Alternator feature and the new APIs introduced. You can see a more detailed design document here: https://docs.google.com/document/d/1cxLI7n-AgV5hhH1DTyU_Es8_f-t8Acql-1f58eQjZLY/edit
Two patches in this series split the huge alternator/executor.cc, after this series continued to grow it and it reached a whoppng 7,000 lines. These patches are just reorganization of code, no functional changes. But it's time that we finally do this (Refs #5783), we can't just continue to grow executor.cc with no end...
Closesscylladb/scylladb#29046
* github.com:scylladb/scylladb:
test/alternator: add option to "run" script to run with vector search
alternator: document vector search
test/alternator: fix retries in new_dynamodb_session
test/alternator: test for allowed characters in attribute names
test/alternator: tests for vector index support
alternator, vector: add validation of non-finite numbers in Query
alternator: Query: improve error message when VectorSearch is missing
alternator: add per-table metrics for vector query
alternator: clean up duplicated code
alternator: fix default Select of Query
alternator: split executor.cc even more
alternator: split alternator/executor.cc
alternator: validate vector index attribute values on write
alternator: DescribeTable for vector index: add IndexStatus and Backfilling
alternator: implement Query with a vector index
alternator: fix bug in describe_multi_item()
alternator: prevent adding GSI conflicting with a vector index
alternator: implement UpdateTable with a vector index
alternator: implement DescribeTable with a vector index
alternator: implement CreateTable with a vector index
alternator: reject empty attribute names
cdc: fix on_pre_create_column_families to create CDC log for vector search
This PR removes the power-of-two token constraint from vnodes-to-tablets migrations, allowing clusters with randomly generated tokens to migrate without manual token reassignment.
Previously, migrations required vnode tokens to be a power of two and aligned. In practice, these conditions are not met with Scylla's default random token assignment, so the constraint is a blocker for real-world use. With the introduction of arbitrary tablet boundaries in PR #28459, the tablet layer can now support arbitrary tablet boundaries. This PR builds on that capability to allow arbitrary vnode tokens during migration.
When the highest vnode token does not coincide with the end of the token ring, the vnode wraps around, but tablets do not support that. This is handled by splitting it into two tablets: one covering the tail end of the ring and one covering the beginning.
Testing has been updated accordingly: existing cluster tests now use randomly generated tokens instead of precomputed power-of-two values, and a new Boost test validates the wrap-around tablet boundary logic.
Fixes SCYLLADB-724.
New feature, no backport is needed.
Closesscylladb/scylladb#29319
* github.com:scylladb/scylladb:
test: Use arbitrary tokens in vnodes->tablets migration tests
test: boost: Add test for wrap-around vnodes
storage_service: Support vnodes->tablets migrations w/ arbitrary tokens
storage_service: Hoist migration precondition
With migration to pyest this fixture is useless. Removing and setting
the session to the module for the most of the tests.
Add dynamic_scope function to support running alternator fixtures in
session scope, while Test and TestSuite are not deleted. This is for
migration period, later on this function should be deleted.
With migration of preparation environment and starting 3rd party services
to the pytest, they're output the logs to the terminal. So this PR
binds them their own log file to avoid polluting the terminal.
With the latest changes, there are a lot of code that is redundant in
the test.py. This PR just cleans this code.
Changes in other files are related to cleaning code from the test.py,
especially with redundant parameter --test-py-init and moving
prepare_environment to pytest itself.
The retry loop in `start_docker_service` passes the parse callbacks via `std::move` into `create_handler` on each iteration. After the first iteration, the moved-from `std::function` objects are empty. All subsequent retries skip output parsing entirely and immediately treat the service as successfully started. This defeats the entire purpose of the retry mechanism.
Fix by passing the callbacks by copy instead of move, so the original callbacks remain valid across retries.
Fixes SCYLLADB-1542
This is a CI stability issue and should be backported.
Closesscylladb/scylladb#29504
* github.com:scylladb/scylladb:
test/lib: fix typos in proc_utils, gcs_fixture, and dockerized_service
test: gcs_fixture: rename container from "local-kms" to "fake-gcs-server"
test: fix proc_utils.cc formatting from previous commit
test: lib: use unique container name per retry attempt
test: lib: fix broken retry in start_docker_service
Add a `CHILD_SHARDS` filter to `DescribeStream` command.
When used, user need to pass a parent stream shard id as
json's ShardFilter.ShardId field. DescribeStream will
then return only list of stream shards, that are direct
descendants of passed parent stream shard.
Each stream shard cover a consecutive part of token space.
A stream shard Q is considered to be a child of stream shard W,
when at least one token belongs to token spaces from both streams.
The filtering algorithm itself is somewhat complicated - more details
in comments in streams.cc.
CHILD_SHARDS is a Amazon's functionality and is required by KCL.
Add unit tests.
Fixes: #25160Closesscylladb/scylladb#28189
Increase the non-AWS wait in the TTL streams test to reduce vnode CI flakes caused by delayed expiration visibility.
Fixes SCYLLADB-1556
Closesscylladb/scylladb#29516
This will make debugging of stalled tablet transitions easier. We saw
several issues when topology state machine was blocked by active
tablet migrations, which was not obvious at first glance of the
logs. Now it will be east to tell if tablet transitions are blocking
progress and which transitions are stuck.
Closesscylladb/scylladb#28616
drain() signals the postponed_reevaluation condition variable to terminate
the postponed_compactions_reevaluation() coroutine but does not await its
completion. When enable() is called afterwards, it overwrites
_waiting_reevalution with a new coroutine, orphaning the old one. During
shutdown, really_do_stop() only awaits the latest coroutine via
_waiting_reevalution, leaving the orphaned coroutine still alive. After
sharded::stop() destroys the compaction_manager, the orphaned coroutine
resumes and reads freed memory (is_disabled() accesses _state).
Fix by introducing stop_postponed_compactions(), awaiting the reevaluation coroutine in
both drain() and stop() after signaling it, if postponed_compactions_reevaluation() is running.
It uses an std::optional<future<>> for _waiting_reevalution and std::exchange to leave
_waiting_reevalution disengaged when postponed_compactions_reevaluation() is not running.
This prevents a race between drain() and stop().
While at it, fix typo in _waiting_reevalution -> _waiting_reevaluation.
Fixes: SCYLLADB-1463
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#29443
Add to test/alternator/run the option "-vs" which runs alongside with
Scylla a vector store, to allow running Alternator tests with vector
indexing.
To get the vector store, do
git clone git@github.com:scylladb/vector-store.git
cargo build --release
"run -vs" looks for an executable in ../vector-store/target/*/vector-store
but can also be overridden by the VECTOR_STORE environment variable.
test/alternator/run runs the vector store exactly like it runs Scylla -
in a temporary directory, on a temporary IP address in the localhost
subnet (127.0.0/8), killing it when the test end, and showing the output
of both programs (Scylla and vector store). These transient runs of
Scylla and vector store are configured to be able to communicate to
each other.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a new document, docs/alternator/vector-search.md, on the
new vector search feature in Alternator. It introduces this feature, and
the DynamoDB APIs that we extended to support it.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The new_dynamodb_session() function had a bug which we never noticed
because we hardly used it, but it became more noticable when the new
test/alternator/test_vector.py started to use it:
By default, boto3 retries a request up to 9 times when it encounters
a retriable error (such as an Internal Server Error). We don't want such
retries in our tests - it makes failures slower, but more importantly
it can hide "flaky" bugs by retrying 9 times until it happens to succeed.
The new_dynamodb_session() had code (copied from the dynamodb fixture)
to set boto3's "max_attempts" configuration to 0, to disable this retry.
But this code had an incorrect "if" to only be done if we're testing on
"localhost". This is wrong: We almost never use "localhost" as the
target of the test; Both test/cqlpy/run and test.py pick an IP address
in the localhost subnet (127/8) and uses that IP address - not the string
"localhost".
This bug only existed in new_dynamodb_session() - the more commonly used
"dynamodb" fixture didn't have this bug.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
One of the tests in the previous patch checked that strange characters
are allowed in attribute names used for vector indexing. It turns out
we never had a test that verifies that regardless of vector indexes -
any character whatsoever is allowed in attribute names. This is
different from table names which are much more limited.
So this patch adds the missing test.
As usual, the new test also passes on DynamoDB, showing that these
stange characters in attribute names are also allowed by DynamoDB.
In this patch we add a large collection of basic functional tests for the
vector index support, covering the CreateTable, UpdateTable, DescribeTable
and Query operations and the various ways in which those are allowed to
work - or expected to fail. These tests were written in parallel with
writing the code so they (hopefully) cover all the corner cases considered
during development, and make sure these corner cases are all handled
correctly and will not regress in the future.
Some of these tests do not involve querying of the index and focus on
the structure of requests and the kind of syntax allowed. But other tests
are end-to-end, requiring the vector store to be running and trying to
index Alternator data and query it. These tests are marked
"needs_vector_store", and are immediately skipped in Scylla is not
configured to connect to a vector store. In a later patch we'll add a
an option to test/alternator/run to be able to run these end-to-end
tests by automatically running both Scylla and the Vector Store.
We'll have additional end-to-end tests in the vector-store repository.
Note that vector search is a new API feature that doesn't exist in DynamoDB,
so we are adding new parameters and outputs to existing operations. The AWS
SDKs don't normally allow doing that, so the test added here begins by
teaching the Python SDK to use the new APIs we added. This piece of code
can also be used by end-users to use vector search (at least in Python...)
before we officially add this support to ScyllaDB's SDK wrappers.
Non-finite numbers (Inf, NaN) don't make sense in vector search, and
also not allowed in the DynamoDB API as numbers. But the parsing code
in Query's QueryVector accepted "Inf" and "NaN" and then failed to
send the request to the vector store, resulting in a strange error
message. Let's fix it in the parsing code.
We have a test (test_query_vectorsearch_queryvector_bad_number_string)
that verifies this fix.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Before this patch, if we attempt a Query with IndexName is a vector index
but forget a "VectorSearch" parameter, the error is misleading: The code
expects a GSI or LSI, and when it can't find a GSI or LSI with that name,
it reports that the index is missing. But this is not helpful. So in this
patch we produce a more helpful message: That the index does exist, and
is a vector index, so a "VectorSearch" parameter is mandatory and is
missing.
The per-table metrics for Query were not incremented for the
vector variant of the Query operations, only the global metrics were
incremented. This patch fixes this oversight, and add a test that
reproduces it (the new test fails before this patch, and passes after).
De-duplicate some code introduced in earlier patches, such a two
nearly-identical loops over the indexes (one to check if there is a
vector index, the second to get its dimensions), and two nearly-
identical chunks of code to get the item contents when there is or
there isn't a clustering key.
There should be no functional changes in this patch.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
In earlier patches, when Query'ing a vector index, we set the default
Select to ALL_ATTRIBUTES. However, according to the DynamoDB documentation
for Query,
"If neither Select nor ProjectionExpression are specified, DynamoDB
defaults to ALL_ATTRIBUTES when accessing a table, and
ALL_PROJECTED_ATTRIBUTES when accessing an index."
This default should also apply to vector index, so this patch fixes this.
The new behavior is not only more compatible with DynamoDB, it is also
much more efficient by default, as ALL_PROJECTED_ATTRIBUTES does not need
to read from the base table - it returns the results that the vector store
returned. Of course, if the user needs the more efficient ALL_ATTRIBUTES
this option is still available - it's just no longer the default.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch continues the effort to split the huge executor.cc (5000
lines before this patch) even more.
In this patch we introduce a new source file, executor_util.cc, for
various utility functions that are used for many different operations
and therefore are useful to have in a header file. These utility
functions will now be in executor_util.cc and executor_util.hh -
instead of executor.cc and executor.hh.
Various source files, including executor.cc, the executor_read.cc
introduced in the previous patch, as well as older source files like
as streams.cc, ttl.cc and serialization.cc, use the new header file.
This patch removes over 700 lines of code from executor.cc, and
also removes a large amount of utility functions declerations from
executor.hh. Originally, executor.hh was meant to be about the
interface that the Alternator server needs to *execute* the different
DynamoDB API operations - and after this patch it returns closer to
this original goal.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Already six years ago, in #5783, we noticed that alternator/executor.cc
has grown too large. The previous patches added hundreds of more lines
to it to implement vector search, and it reached a whopping 7,000 lines
of code. This is too much.
This patch splits from executor.cc two major chunks:
1. The implementation of **read** requests - GetItem, BatchGetItem,
Query (base table, GSI/LSI, and vector-search), and Scan - was
moved to a new source file alternator/executor_read.cc.
The new file has 2,000 lines.
2. Moved 250 lines of template functions dealing with attribute paths
and maps of them to a new header file, attribute_path.hh.
These utilities are used for many different operations - various
read operations use them for ProjectionExpression, and UpdateItem
uses them for modifications to nested attributes, so we need the
new header file from both executor.cc and executor_read.cc
The remaining executor.cc is still pretty big, 5,000 lines, and
contains write operations (PutItem, UpdateItem, DeleteItem,
BatchWriteItem) as well as various table and other operations, and
also many utility functions used by many types of operations, so
we can later continue this refactoring effort.
Refs #5783
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Extend system_info_encryption to encrypt system.raft SSTables.
system.raft contains the Raft log, which may hold sensitive user data
(e.g. batched mutations), so it warrants the same treatment as
system.batchlog and system.paxos.
During upgrade, existing unencrypted system.raft SSTables remain
readable. Existing data is rewritten encrypted via compaction, or
immediately via nodetool upgradesstables -a.
Update the operator-facing system_info_encryption description to
mention system.raft and add a focused test that verifies the schema
extension is present on system.raft.
Fixes: CUSTOMER-268
Backport: 2026.1 - closes an encryption-at-rest coverage gap: system.raft may persist sensitive user-originated data unencrypted; backport to the current LTS.
Closesscylladb/scylladb#29242
There's a bunch of db::config options that are used by cql3/statements/ code. For that they use data_dictionary/database as a proxy to get db::config reference. This PR moves most of these accessed options onto cql_config
Options migrated to cql_config:
1. select_internal_page_size
2. strict_allow_filtering
3. enable_parallelized_aggregation
4. batch_size_warn_threshold_in_kb
5. batch_size_fail_threshold_in_kb
6. 7 keyspace replication restriction options
7. 2 TWCS restriction options
8. restrict_future_timestamp
9. strict_is_not_null_in_views (with view_restrictions struct)
10. enable_create_table_with_compact_storage
Some options need special treatment and are still abused via database, namely:
1. enable_logstor
2. cluster_name
3. partitioner
4. endpoint_snitch
Fixing components inter-dependencies, not backporting
Closesscylladb/scylladb#29424
* github.com:scylladb/scylladb:
cql3: Move enable_create_table_with_compact_storage to cql_config
cql3: Move strict_is_not_null_in_views to cql_config
cql3: Move restrict_future_timestamp to cql_config
cql3: Move TWCS restriction options to cql_config
cql3: Move keyspace restriction options to cql_config
cql3: Move batch_size_fail_threshold_in_kb to cql_config
cql3: Move batch_size_warn_threshold_in_kb to cql_config
cql3: Move enable_parallelized_aggregation to cql_config
cql3: Move strict_allow_filtering to cql_config
cql3: Move select_internal_page_size to cql_config
test: Fix cql_test_env to use updateable cql_config from db::config
cql3: Add cql_config parameter to parsed_statement::prepare()
`system.large_partitions`, `system.large_rows`, and `system.large_cells` store records keyed by SSTable name. When SSTables are migrated between shards or nodes (resharding, streaming, decommission), the records are lost because the destination never writes entries for the migrated SSTables.
This patch series moves the source of truth for large data records into the SSTable's scylla metadata component (new `LargeDataRecords` tag 13) and reimplements the three `system.large_*` tables as virtual tables that query live SSTables on demand. A cluster feature flag (`LARGE_DATA_VIRTUAL_TABLES`) gates the transition for safe rolling upgrades.
When the cluster feature is enabled, each node drops the old system large_* tables and starts serving the corresponding tables using virtual tables that represent the large data records now stored on the sstables.
Note that the virtual tables will be empty after upgrade until the sstables that contained large data are rewritten, therefore it is recommended to run upgrade sstables compaction or major compaction to repopulate the sstables scylla-metadata with large data records.
1. **keys: move key_to_str() to keys/keys.hh** — make the helper reusable across large_data_handler, virtual tables, and scylla-sstable
2. **sstables: add LargeDataRecords metadata type (tag 13)** — new struct with binary-serialized key fields, scylla-sstable JSON support, format documentation
3. **large_data_handler: rename partition_above_threshold to above_threshold_result** — generalize the struct for reuse
4. **large_data_handler: return above_threshold_result from maybe_record_large_cells** — separate booleans for cell size vs collection elements thresholds
5. **sstables: populate LargeDataRecords from writer** — bounded min-heaps (one per large_data_type), configurable top-N via `compaction_large_data_records_per_sstable`
6. **test: add LargeDataRecords round-trip unit tests** — verify write/read, top-N bounding, below-threshold behavior
7. **db: call initialize_virtual_tables from shard 0 only** — preparatory refactoring to enable cross-shard coordination
8. **db: implement large_data virtual tables with feature flag gating** — three virtual table classes, feature flag activation, legacy SSTable fallback, dual-threshold dedup, cross-shard collection
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1276
* Although this fixes a bug where large data entries are effectively lost when sstables are renamed or migrated, the changes are intrusive and do not warrant a backport
Closesscylladb/scylladb#29257
* github.com:scylladb/scylladb:
db: implement large_data virtual tables with feature flag gating
db: call initialize_virtual_tables from shard 0 only
test: add LargeDataRecords round-trip unit tests
sstables: populate LargeDataRecords from writer
large_data_handler: return above_threshold_result from maybe_record_large_cells
large_data_handler: rename partition_above_threshold to above_threshold_result
sstables: add LargeDataRecords metadata type (tag 13)
sstables: add fmt::formatter for large_data_type
keys: move key_to_str() to keys/keys.hh
The ignore_component_digest_mismatch flag is now initialized at sstable construction
time from sstables_manager::config (which is populated from db::config at boot time).
Remove the flag from sstable_open_config struct and all call sites that were setting
it explicitly.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Initialize the ignore_component_digest_mismatch flag from sstables_manager::config
in the sstable constructor initializer list instead of in load(). This ensures the
flag value is set at construction time when the manager config is available, rather
than at load time. Mark the member const to reflect its immutability after construction.
Fixes the bootstrap path which now correctly reads the flag from manager config
initialized from db::config at boot time, instead of using the default value.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy the ignore_component_digest_mismatch flag from db::config to sstables_manager::config
during database initialization. This makes the flag available early in the boot process,
before SSTables are loaded, enabling later commits to move the flag initialization from
load-time to construction-time.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When a table has a vector index, writes to the indexed attribute
(via PutItem, UpdateItem, or BatchWriteItem) must supply a value that
is a vector of the appropriate length: It must be a list of exactly the
declared number of elements, where each element is a numeric type ("N")
representable as a 32-bit float. Before this patch, invalid values were
silently accepted and the item was simply not indexed (it was skipped
by the vector store when it read this item). Now these writes are
rejected with a ValidationException.
This is analogous to the existing validation of GSI/LSI key attribute
values - in DynamoDB after a certain attribute becomes the key of a
GSI or LSI, the user is no longer allowed to write the same type.
The implementation we add here is also analogous to the implementation
of the GSI/LSI key validation. The GSI/LSI key validation is done
by validate_value_if_index_key / si_key_attributes, and in this
patch we add the vector-index parallels: vector_index_attributes()
collects the attribute name and declared dimensions for every vector
index in the schema, and validate_value_if_vector_index_attribute()
enforces the type limitations.
For efficiency in the common case where a table has no vector indexes
and no GSIs/LSIs, both validation functions are out-of-line and each
call site guards the call with an explicit empty() check, so no
function-call overhead is incurred when there is nothing to validate.
For UpdateItem, the map of vector index attributes is cached in
update_item_operation (alongside the existing _key_attributes cache)
to avoid recomputing it on every call to update_attribute().
Add to DescribeTable's output for VectorIndexes two fields - IndexStatus
and Backfilling - which are intended to exactly mirror these two fields
that exist for GlobalSecondaryIndexes:
When a vector index is added, IndexStatus is "CREATING" before the index
is usable, and "ACTIVE" when it is finally usable for a Query. During
"CREATING" phase, "Backfilling" may be set to true when the index is
currently being backfilled (the table is scaned and an index is built).
A user is expected to call DescribeTable in a loop after creating a
vector index (via either CreateTable and UpdateTable) and only call
Query on the index after the IndexStatus is finally ACTIVE. Calling
Query earlier, while IndexStatus is still CREATING, will result in an
error.
In the current implementation, Alternator does not track the state of the
vector index, so it needs to contact the vector store to inquire about
the state of the index - using a new function introduced in this patch
that uses an existing vector-store API. This makes DescribeTable slower
on tables that have vector indexes, because the vector store is contacted
on every DescribeTable call.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
We introduce to the Query request a new "VectorSearch" parameter, which
take a mandatory "QueryVector" (a value which must be a numeric vector
of the right length) and "Limit".
The "Limit" of a vector search (Query with VectorSearch) determines the
number of nearest neighbors to return, and does not allow pagination
(ExclusiveKeyStart is not allowed). ConsistentRead=True is also not
allowed on a vector search query.
The "Select"/"ProjectionExpression"/"AttributesToGet" parameters are
also supported, requesting which attributes to fetch. Using Select=
ALL_PROJECTED_ATTRIBUTES means read only the attributes found in the
vector index - currently only the key columns - so it is significantly
faster than ALL_ATTRIBUTES because it doesn't require reading the items
from the base table.
The "FilterExpression" parameter is also supported. Like in DynamoDB's
traditional Query, this does post-filtering, i.e., removing some of the
results returned by the vector index that don't match the filter, and
as a result fewer than Limit results may be returned.
Pre-filtering (done on the vector store, and always returns Limit
results) is not yet implemented.
In commit a55c5e9ec7, the function
describe_multi_item() got a new item_callback parameter, that can
be used to calculate the size of the item. This new parameter
has a default, an empty noncopyable_function. But an empty
noncopyable_function shouldn't be called - exactly like std::function,
it throws std::bad_function_call if called when empty.
So describe_multi_item() should only call this item_callback if
it's not empty.
This became a problem in the next patch, implementing vector search
query, which called describe_multi_item with the default item_callback.
But in general, the function should be usable with the default parameter
(or we shouldn't have defined a default value for this parameter!).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
All the "indexes" we implement in Alternator - GSI, LSI and the new
vector index - share the same IndexName namespace, which we'll use in
Query to refer to the index. In the previous patch we already prevented
adding a vector index with the same name as an existing GSI or LSI.
In this patch we also prevent the reverse - adding a GSI with the name
of an existing vector index.
Additionally, one cannot add a GSI on a key that is already the key of
a vector index: The types conflict: The key of a vector index must be a
vector column, while the key of a GSI must have a standard key type
(string, binary or number).
We have tests for this later, this the big test patch.
After an earlier patch allowed CreateTable to create vector indexes
together with a table, in this patch we add to UpdateTable the ability
to add a new vector index to an existing table, as well as the ability
to delete a vector index from an existing table.
The implementation is inspired by DynamoDB's syntax for GSI - just like
GSI has GlobalSecondaryIndexUpdates with "Create" and "Delete" operations,
for vector indexes we have VectorIndexUpdates supporting Create and
Delete. "Update" is not yet supported - we didn't implement yet any
parameter that can be updated - but we can easily implement it in the
future.
ScyllaDB supports the "vector search" feature in CQL.
In this patch we start the path to adding vector search support also to
Alternator.
In this patch, we implement CreateTable support - allowing the user to enable
vector search in a new table. The following patches will enable additional
operations like UpdateTable (adding a vector index to an existing table or
deleting a vector index to an existing table) and DescribeTable.
Extensive tests for all these features will come at the end of the series.
Those tests were written in parallel with writing this implementation so cover
(hopefully) every nook and cranny of the imlementation.
Alternator has a function validate_attr_name_length() used to validate an
attribute name passed in different operations like PutItem, UpdateItem,
GetItem, etc. It fails the request if the attribute name is longer than
65535 characters.
It turns out that we forgot to check if the attribute name length isn’t 0 -
which should be forbidden as well!
This patch fixes the validation code, and also adds a test that confirms
that after this patch empty attribute names are rejected - just like DynamoDB
does - whereas before this patch they were silently accepted.
We want to fix this issue now, because in a later patch we intend to use
the same validation function also for vector indexes - and want it to be
accurate.
Fixes SCYLLADB-1069.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The vector-search feature, which is already supported in CQL, introduced
the somewhat confusing feature of enabling CDC without explicitly enabling
CDC: When a vector index is enabled on a table, CDC is "enabled" for it
even if the user didn't ask to enable CDC.
For this, some code in cdc/log.cc began to use cdc_enabled() instead of
checking schema.cdc_options.enabled() directly. This cdc_enabled()
function checks if either this enabled() is true, or has_vector_index()
is true.
But there's another twist to this story: To write with CDC, we also need
to create the CDC log table:
1. In CQL, a vector index can only be added on an existing table (with
CREATE INDEX), so the hook on_before_update_column_family() is the
one that noticed that a vector index was added, and created the CDC
log table.
2. But in Alternator, a vector index can be created up-front with a
brand-new table (in CreateTable), so the hook for a new table -
on_pre_create_column_families(), also needs to create the CDC log
table. It already did, but incorrectly checked just the explicit
CDC-enabled flag instead of the new cdc_enabled() function that
also allows vector index.
So this patch just fixes on_pre_create_column_families to use cdc_enabled().
Before this patch, when a vector index will be created in Alternator with
CreateTable, an attempt to write to the table (PutItem) will fail
because it will try to write to the CDC log, which wasn't created.
After this patch, it works. The reproducing test is
test_putitem_vectorindex_createtable (introduced in a later patch).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Pin all external GitHub Actions to full commit SHAs and upgrade to
their latest major versions to reduce supply chain attack surface:
- actions/checkout: v3/v4/v5 -> v6.0.2
- actions/github-script: v7 -> v8.0.0
- actions/setup-python: v5 -> v6.2.0
- actions/upload-artifact: v4 -> v7.0.0
- astral-sh/setup-uv: v6 -> v8.0.0
- mheap/github-action-required-labels: v5.5.2 (pinned)
- redhat-plumbers-in-action/differential-shellcheck: v5.5.6 (pinned)
- codespell-project/actions-codespell: v2.2 (pinned, was @master)
Set FORCE_JAVASCRIPT_ACTIONS_TO_NODE24=true in all 21 workflows that
use JavaScript-based actions to opt into the Node.js 24 runtime now.
This resolves the deprecation warning:
"Node.js 20 actions are deprecated. Please check if updated versions
of these actions are available that support Node.js 24. Actions will
be forced to run with Node.js 24 by default starting June 2nd,
2026. Node.js 20 will be removed from the runner on September 16th,
2026."
See: https://github.blog/changelog/2025-09-19-deprecation-of-node-20-on-github-actions-runners/
scylladb/github-automation references are intentionally left at @main
as they are org-internal reusable workflows.
Fixes: SCYLLADB-1410
Backport: Backport is required for live branches that run GH actions:
2026.1, 2025.4, 2025.1 and 2024.1
Closesscylladb/scylladb#29421
The PR removed the create_and_ks() helper from backup test and patches all callers to create keyspace, table and populate them with standard explicit facilities. While patching it turned out that one test doesn't need to populate the table, so it even becomes tiny bit shorter and faster
Enhancing test, not backporting
Closesscylladb/scylladb#29417
* github.com:scylladb/scylladb:
test_backup: Remove create_ks_and_cf helper Test
test_backup: Replace create_ks_and_cf with async patterns Test
test_backup: Add if-True blocks for indentation Test
The migration tests used to start nodes with pre-computed power-of-two
tokens. This was required because the migration itself only supported
power-of-two aligned tokens. Now that arbitrary tokens are supported,
switch the tests to use Scylla's default random token assignment.
Switching to arbitrary tokens makes the tests non-deterministic, but the
migration aspects that are affected by the token distribution
(resharding, wrap-around vnode split) are out of scope for these tests
and covered by dedicated tests.
Add a `get_all_vnode_tokens()` helper that queries system.topology at
runtime to discover the actual token layout, and derive expected tablet
counts from that.
Also account for the possible extra wrap-around tablet when the last
vnode token does not coincide with MAX_TOKEN.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Add a Boost test to verify that `prepare_for_tablets_migration()`
produces the correct tablet boundaries when a wrap-around vnode exists.
Tablets cannot wrap around the token ring as vnodes do; the last token
of the last tablet must always be MAX_TOKEN. When the last vnode token
does not coincide with MAX_TOKEN, the wrap-around vnode must be split
into two tablets.
The test is parameterized over both cases: unaligned (split expected)
and aligned (no split expected).
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
1. test.py — Removed --log-level=DEBUG flag from pytest args
2. test/pytest.ini — Changed log_level to INFO (that was set DEBUG in test.py), changed log_file_level from DEBUG to INFO, added clarifying comments
+minor fix
[test/pylib: save logs on success only during teardown phase](0ede308a04)
Previously, when --save-log-on-success was enabled, logs were saved
for every test phase (setup, call, teardown)in 3 files. Restrict it to only
the teardown phase, that contains all 3 in case of test success,
to avoid redundant log entries.
Closesscylladb/scylladb#29086
* github.com:scylladb/scylladb:
test/pylib: save logs on success only during teardown phase
test: Lower default log level from DEBUG to INFO
The vnodes-to-tablets migration creates tablet maps that mirror the
vnode layout: one tablet per vnode, preserving token boundaries and
replica placement. However, due to tablet restrictions, the migration
requires vnode tokens to be a power of two and uniformly distributed
across the token ring.
In practice, this restriction is too limiting. Real clusters use
randomly generated tokens and a node's token assignment is immutable.
To solve this problem, prior work (01fb97ee78) has been done to relax
the tablet constraints by allowing arbitrary tablet boundaries, removing
the requirement for power-of-two sizing and uniform distribution.
This patch leverages the relaxed tablet constraints to enable tablet map
creation from arbitrary vnode tokens:
* Removes all token-related constraints.
* Handles wrap-around vnodes. If a vnode wraps (i.e., the highest vnode
token is not `dht::token::last()`), it is split into two tablets:
- (last_vnode_token, dht::token::last()]
- [dht::token::first(), first_vnode_token]
The migration ops guide has been updated to remove the power-of-two
constraint.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
`prepare_for_tablets_migration()` is idempotent; it filters out tables
that already have tablet maps and returns early if no tablet maps need
to be created. However, this precondition is currently misplaced. Move
it higher to skip extra work.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
When initializing streaming sources in tablet_stream_files_handler we
use a reference to the table. We should hold the table while doing so,
because otherwise the table may be dropped and destroyed when we yield.
Use the table.stream_in_progress() phaser to hold the table while we
access it.
For sstable file streaming we can release the table after the snapshot
is initialized, and the table may be dropped safely because the files
are held by the snapshot and we don't access the table anymore. There
was a single access to the table for logging but it is replaced by a
pre-calculated variable.
For logstor segment streaming, currently it doesn't support discarding
the segments while they are streamed - when the table is dropped it
discard the segments by overwriting and freeing them, so they shouldn't
be accessed after that. Therefore, in that case continue to hold the
table until streaming is completed.
Fixes [SCYLLADB-1533](https://scylladb.atlassian.net/browse/SCYLLADB-1533)
It's a pre-existing use-after-free issue in sstable file streaming so should be backported to all releases.
It's also made worse with the recent changes of logstor, and affects also non-logstor tables, so the logstor fixes should be in the same release (2026.2).
[SCYLLADB-1533]: https://scylladb.atlassian.net/browse/SCYLLADB-1533?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQClosesscylladb/scylladb#29488
* github.com:scylladb/scylladb:
test: test drop table during streaming
streaming: stream_blob: hold table for streaming
The GCS fixture's fake-gcs-server container was named "local-kms",
copy-pasted from the AWS KMS fixture. It happened when both were
refactored to use the shared start_docker_service helper (bc544eb08e).
Rename to "fake-gcs-server" to match the Python-side naming and avoid
confusion in logs.
Refs SCYLLADB-1542
The container name is generated once before the retry loop, so
all retry attempts reuse the same name. Move the name generation
inside the loop so each attempt gets a fresh name via the
incrementing counter, consistent with the comment "publish port
ephemeral, allows parallel instances".
Formatting changes (indentation) of lines 208-225 in test/lib/proc_utils.cc
will be fixed in the next commit.
Refs SCYLLADB-1542
Previously, config_updater used a serialized_action to trigger update_config() when object_storage_endpoints changed. Because serialized_action::trigger() always schedules the action as a new reactor task (via semaphore::wait().then()), there was a window between the config value becoming visible to the REST API and update_config() actually running. This allowed a concurrent CREATE KEYSPACE to see the new endpoint via is_known_endpoint() before storage_manager had registered it in _object_storage_endpoints.
Now config observers run synchronously in a reactor turn and must not suspend. Split the previous monolithic async update_config() coroutine into two phases:
- Sync (in the observer, never suspends): storage_manager::_object_storage_endpoints is updated in place; for already-instantiated clients, update_config_sync swaps the new config atomically
- Async (per-client gate): background fibers finish the work that can't run in the observer — S3 refreshes credentials under _creds_sem; GCS drains and closes the replaced client.
Config reloads triggered by SIGHUP are applied on shard 0 and then broadcast to all other shards. An rwlock has been also introduced to make sure that the configuration has been propagated to all cores. This guarantees that a client requesting a config via the REST API will see a consistent snapshot
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-757
Fixes: [28141](https://github.com/scylladb/scylladb/issues/28141)
Closesscylladb/scylladb#28950
* github.com:scylladb/scylladb:
test/object_store: verify object storage client creation and live reconfiguration
sstables/utils/s3: split config update into sync and async parts
test_config: improve logging for wait_for_config API
db: introduce read-write lock to synchronize config updates with REST API
Split the `log_record` to `log_record_header` type that has the record
metadata fields and the mutation as a separate field which is the actual
record data:
struct log_record {
log_record_header header;
canonical_mutation mut;
};
Both the header and mutation have variable serialized size. When a
record is serialized in a write_buffer, we first put a small
`record_header` that has the header size and data size, then the
serialized header and data follow. The `log_location` of a record points
to the beginning of the `record_header`, and the size includes the
`record_header`.
This allows us to read a record header without reading the data when
it's not needed and avoid deserializing it:
* on recovery, when scanning all segments, we read only the record
headers.
* on compaction, we read the record header first to determine if the
record is alive, if yes then we read the data.
Closesscylladb/scylladb#29457
The commitlog replayer groups segments by shard using a
std::unordered_multimap, then iterates per-shard segments via
equal_range(). However, equal_range() does not guarantee iteration
order for elements with the same key, so segments could be replayed
out of order within a shard.
Correct segment ordering is required for:
- Fragmented entry reconstruction, which accumulates fragments across
segments and depends on ascending order for efficient processing.
- Commitlog-based storage used by the strongly consistent tables
feature, which relies on replayed raft items being stored in order.
Fix by changing the data structure from
std::unordered_multimap<unsigned, commitlog::descriptor>
to
std::unordered_map<unsigned, utils::chunked_vector<commitlog::descriptor>>
Since the descriptors are inserted from a std::set ordered by ID, the
vector preserves insertion (and thus ID) order. The per-shard iteration
now simply iterates the vector, guaranteeing correct replay order.
Fixes: SCYLLADB-1411
Backport: It looks like this issue doesn't cause any trouble, and is required only by the strong consistent tables, so no backporting required.
Closesscylladb/scylladb#29372
* github.com:scylladb/scylladb:
commitlog: add test to verify segment replay order
commitlog: fix replay order by using ordered map per shard
Move enable_create_table_with_compact_storage option from db::config to
cql_config. This improves separation of concerns by consolidating CQL-specific
table creation policies in the cql_config structure. Update the CREATE TABLE
statement prepare() function to use the new location for the configuration check.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Move strict_is_not_null_in_views option from db::config to cql_config via
new view_restrictions sub-struct. This improves separation of concerns by
keeping view-specific validation policies with other CQL configuration.
Update prepare_view() to take view_restrictions reference instead of reaching
into db::config, and update all callsites to pass the sub-struct.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Move restrict_future_timestamp option from db::config to cql_config. This
improves separation of concerns as timestamp validation is part of CQL query
execution behavior. Update validate_timestamp() function signature to take
cql_config reference instead of db::config, and update all callsites in
modification_statement and batch_statement to pass cql_config.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Move twcs_max_window_count and restrict_twcs_without_default_ttl options
from db::config to cql_config via new twcs_restrictions sub-struct. This
improves separation of concerns by keeping TWCS-specific validation policies
with other CQL configuration. Update check_restricted_table_properties()
to remove unused db parameter and take twcs_restrictions reference instead.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Introduce replication_restrictions, a sub-struct of cql_config, to hold
the seven keyspace-level policy options that govern how CREATE/ALTER
KEYSPACE statements are validated:
- restrict_replication_simplestrategy
- replication_strategy_warn_list / replication_strategy_fail_list
- minimum/maximum_replication_factor_warn/fail_threshold
Pass replication_restrictions into check_against_restricted_replication_strategies()
instead of having it reach into db::config directly (via both
qp.db().get_config() and qp.proxy().data_dictionary().get_config()).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace the physical system.large_partitions, system.large_rows, and
system.large_cells CQL tables with virtual tables that read from
LargeDataRecords stored in SSTable scylla metadata (tag 13).
The transition is gated by a new LARGE_DATA_VIRTUAL_TABLES cluster
feature flag:
- Before the feature is enabled: the old physical tables remain in
all_tables(), CQL writes are active, no virtual tables are registered.
This ensures safe rollback during rolling upgrades.
- After the feature is enabled: old physical tables are dropped from
disk via legacy_drop_table_on_all_shards(), virtual tables are
registered on all shards, and CQL writes are skipped via
skip_cql_writes() in cql_table_large_data_handler.
Key implementation details:
- Three virtual table classes (large_partitions_virtual_table,
large_rows_virtual_table, large_cells_virtual_table) extend
streaming_virtual_table with cross-shard record collection.
- generate_legacy_id() gains a version parameter; virtual tables
use version 1 to get different UUIDs than the old physical tables.
- compaction_time is derived from SSTable generation UUID at display
time via UUID_gen::unix_timestamp().
- Legacy SSTables without LargeDataRecords emit synthetic summary
rows based on above_threshold > 0 in LargeDataStats.
- The activation logic uses two paths: when the feature is already
enabled (test env, restart), it runs as a coroutine; when not yet
enabled, it registers a when_enabled callback that runs inside
seastar::async from feature_service::enable().
- sstable_3_x_test updated to use a simplified large_data_test_handler
and validate LargeDataRecords in SSTable metadata directly.
Move the smp::invoke_on_all dispatch from the callers into
initialize_virtual_tables() itself, so the function is called
once from shard 0 and internally distributes the per-shard
virtual table setup to all shards.
This simplifies the callers and allows a single place to add
cross-shard coordination logic (e.g. feature-gated table
registration) in future commits.
Add three new test cases to sstable_3_x_test.cc that verify the
LargeDataRecords metadata written by the SSTable writer can be read
back after open_data():
- test_large_data_records_round_trip: verifies partition_size, row_size,
and cell_size records are written with correct field semantics when
thresholds are exceeded
- test_large_data_records_top_n_bounded: verifies the bounded min-heap
keeps only the top-N largest entries per type
- test_large_data_records_none_when_below_threshold: verifies no records
are written when data is below all thresholds
Also wire large_data_records_per_sstable from db_config into the test
env's sstables_manager::config so that config changes propagate through
the updateable_value chain to configure_writer().
During compaction (SSTable writing), maintain bounded min-heaps (one per
large_data_type) that collect the top-N above-threshold records. On
stream end, drain all five heaps into a single LargeDataRecords array
and write it into the SSTable's scylla metadata component.
Five separate heaps are used:
- partition_size, row_size, cell_size: ordered by value (size bytes)
- rows_in_partition, elements_in_collection: ordered by elements_count
A new config option 'compaction_large_data_records_per_sstable' (default
10) controls the maximum number of records kept per type.
Change maybe_record_large_cells to return above_threshold_result with
separate booleans for cell size (.size) and collection elements
(.elements) thresholds. This allows the writer to track above_threshold
counts for cell_size and elements_in_collection independently.
Rename partition_above_threshold to above_threshold_result and its
'rows' field to 'elements', making it a generic struct that can be
reused for other large data types (e.g., cells with collection
elements).
Use designated initializers for clarity.
Add a new scylla metadata component LargeDataRecords (tag 13) that
stores per-SSTable top-N large data records. Each record carries:
- large_data_type (partition_size, row_size, cell_size, etc.)
- binary serialized partition key and clustering key
- column name (for cell records)
- value (size in bytes)
- element count (rows or collection elements, type-dependent)
- range tombstones and dead rows (partition records only)
The struct uses disk_string<uint32_t> for key/name fields and is
serialized via the existing describe_type framework into the SSTable
Scylla metadata component.
Add JSON support in scylla-sstable and format documentation.
Add a fmt::formatter specialization for sstables::large_data_type and
use it in scylla-sstable.cc instead of the local to_string() overload,
which is removed.
Move the key_to_str() template function from a file-local static in
db/large_data_handler.cc to keys/keys.hh so it can be reused by:
- large_data_handler.cc for log messages
- virtual tables (db/virtual_tables.cc) for converting binary keys
to human-readable CQL display
- scylla-sstable for JSON output of LargeDataRecords
No functional change.
The batch_size_fail_threshold_in_kb option controls the batch size at
which an oversized batch error is returned to the client. It belongs in
cql_config rather than db::config as it directly governs CQL batch
statement behavior.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The batch_size_warn_threshold_in_kb option controls the batch size at
which a client warning is emitted during batch execution. It belongs in
cql_config rather than db::config as it directly governs CQL batch
statement behavior.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The enable_parallelized_aggregation option controls whether aggregation
queries are fanned out across shards for parallel execution. It belongs
in cql_config rather than db::config as it directly governs CQL query
behavior at prepare time.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The strict_allow_filtering option controls whether queries that require
ALLOW FILTERING are silently accepted, warned about, or rejected. It
belongs in cql_config rather than db::config as it directly governs CQL
query behavior at prepare time.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The select_internal_page_size option controls CQL query execution
behavior (internal paging for aggregate/filtered SELECTs) and belongs
in cql_config rather than being read directly from db::config at
execution time.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The test environment was creating cql_config with hardcoded default values that
were never updated when system.config was modified via CQL. This broke tests
that dynamically change configuration values (e.g., TWCS tests).
Fix by creating cql_config from db::config using sharded_parameter, which
ensures updateable_value fields track the actual db::config sources and reflect
changes made during test execution.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Pass cql_config to prepare() so that statement preparation can use
CQL-specific configuration rather than reaching into db::config
directly.
Callers that use default_cql_config:
- db/view/view.cc: builds a SELECT statement internally to compute view
restrictions, not in response to a user query
- cql3/statements/create_view_statement.cc: same -- parses the view's
WHERE clause as a synthetic SELECT to extract restrictions
- tools/schema_loader.cc: offline schema loading tool, no runtime
config available
- tools/scylla-sstable.cc: offline sstable inspection tool, no runtime
config available
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Several tests in test_select_from_mutation_fragments.py assume that all
mutations end up in a single SSTable. This assumption can be violated
by background memtable flushes triggered by commitlog disk pressure.
Since the Scylla node is taken from a pool, it may carry unflushed data
from prior tests that prevents closed segments from being recycled,
thereby increasing the commitlog disk usage. A main source of such
pressure is keyspace-level flushes from earlier tests in this module,
which rotate commitlog segments without flushing system tables (e.g.,
`system.compaction_history`), leaving closed segments dirty.
Additionally, prior tests in the same module may have left unflushed
data on the shared test table (`test_table` fixture), keeping commitlog
segments dirty on its behalf as well. When commitlog disk usage exceeds
its threshold, the system flushes the test table to reclaim those
segments, potentially splitting a running test's mutations across
multiple SSTables.
This was observed in CI, where test_paging failed because its data was
split across two SSTables, resulting in more mutation fragments than the
hardcoded expected count.
This patch fixes the affected tests in two ways:
1. Where possible, tests are reworked to not assume a single SSTable:
- test_paging
- test_slicing_rows
- test_many_partition_scan
2. Where rework is impractical, major compaction is added after writes
and before validation to ensure that only one SSTable will exist:
- test_smoke
- test_count
- test_metadata_and_value
- test_slicing_range_tombstone_changes
Fixes SCYLLADB-1375.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Closesscylladb/scylladb#29389
Add a test that drops a table while tablet streaming is running for the
table. The table is dropped after taking the storage snapshot and
initializating streaming sources - after that streaming should be able
to complete or abort correctly if the table is dropped. We want to
verify there is no incorrect access to the destroyed table.
The test tests both types of streaming in stream_blob - sstables
and logstor segments.
When initializing streaming sources in tablet_stream_files_handler we
use a reference to the table. We should hold the table while doing so,
because otherwise the table may be dropped and destroyed when we yield.
Use the table.stream_in_progress() phaser to hold the table while we
access it.
For sstable file streaming we can release the table after the snapshot
is initialized, and the table may be dropped safely because the files
are held by the snapshot and we don't access the table anymore. There
was a single access to the table for logging but it is replaced by a
pre-calculated variable.
For logstor segment streaming, currently it doesn't support discarding
the segments while they are streamed - when the table is dropped it
discard the segments by overwriting and freeing them, so they shouldn't
be accessed after that. Therefore, in that case continue to hold the
table until streaming is completed.
Fixes SCYLLADB-1533
There are several reasons we want to do that.
One is that it will give us more flexibility in distributing the
load. We can subdivide tablets at any token, and achieve more
evenly-sized tablets. In particular, we can isolate large partitions
into separate tablets.
We can also split and merge incrementally individual tablets.
Currently, we do it for the whole table or nothing, which makes
splits and merges take longer and cause wide swings of the count.
This is not implemented in this PR yet, we still split/merge the whole table.
Another reason is vnode to tablets migration. We now could construct a
tablet map which matches exactly the vnode boundaries, so migration
can happen transparently from CQL-coordinator point of view.
Tablet count is still a power-of-two by default for newly created tables.
It may be different if tablet map is created by non-standard means,
or if per-table tablet option "pow2_count" is set to "false".
build/release/scylla perf-tablets:
Memory footprint for 131k tablets increased from 56 MiB to 58.1 MiB (+3.5%)
Before:
```
Generating tablet metadata
Total tablet count: 131072
Size of tablet_metadata in memory: 57456 KiB
Copied in 0.014346 [ms]
Cleared in 0.002698 [ms]
Saved in 1234.685303 [ms]
Read in 445.577881 [ms]
Read mutations in 299.596313 [ms] 128 mutations
Read required hosts in 247.482742 [ms]
Size of canonical mutations: 33.945053 [MiB]
Disk space used by system.tablets: 1.456761 [MiB]
Tablet metadata reload:
full 407.69ms
partial 2.65ms
```
After:
```
Generating tablet metadata
Total tablet count: 131072
Size of tablet_metadata in memory: 59504 KiB
Copied in 0.032475 [ms]
Cleared in 0.002965 [ms]
Saved in 1093.877441 [ms]
Read in 387.027100 [ms]
Read mutations in 255.752121 [ms] 128 mutations
Read required hosts in 211.202805 [ms]
Size of canonical mutations: 33.954453 [MiB]
Disk space used by system.tablets: 1.450162 [MiB]
Tablet metadata reload:
full 354.50ms
partial 2.19ms
```
Closesscylladb/scylladb#28459
* github.com:scylladb/scylladb:
test: boost: tablets: Add test for merge with arbitrary tablet count
tablets, database: Advertise 'arbitrary' layout in snapshot manifest
tablets: Introduce pow2_count per-table tablet option
tablets: Prepare for non-power-of-two tablet count
tablets: Implement merged tablet_map constructor on top of for_each_sibling_tablets()
tablets: Prepare resize_decision to hold data in decisions
tablets: table: Make storage_group handle arbitrary merge boundaries
tablets: Make stats update post-merge work with arbitrary merge boundaries
locator: tablets: Support arbitrary tablet boundaries
locator: tablets: Introduce tablet_map::get_split_token()
dht: Introduce get_uniform_tokens()
Commit c17c4806a1 removed check_for_endpoint_collision() from
the fresh bootstrap path, which was the only code path that
called do_shadow_round() for new nodes. Since the gossip shadow
round is no longer executed during bootstrap, remove the
stop_during_gossip_shadow_round error injection from the test.
The entry is marked as REMOVED_ rather than deleted to preserve
the shuffle order for seed-based test reproducibility.
The injection point in gms/gossiper.cc is also removed since it
is no longer used by any test.
Fixes: SCYLLADB-1466
Closesscylladb/scylladb#29460
The retry loop in start_docker_service passes the parse callbacks
via std::move into create_handler on each iteration. After the
first iteration, the moved-from std::function objects are empty.
All subsequent retries skip output parsing entirely and
immediately treat the service as successfully started. This
defeats the entire purpose of the retry mechanism.
Fix by passing the callbacks by copy instead of move, so the
original callbacks remain valid across retries.
Fixes SCYLLADB-1542
Use async tablet repair task flow to avoid a race where client timeout
returns while server-side repair continues after injections are
disabled.
Start repair with await_completion=false, assert it does not complete
within timeout under injection, abort/wait the task, then verify
sstables_repaired_at is unchanged.
Fixes SCYLLADB-1184
Closesscylladb/scylladb#29452
Config observers run synchronously in a reactor turn and must not
suspend. Split the previous monolithic async update_config() coroutine
into two phases:
Sync (runs in the observer, never suspends):
- S3: atomically swap _cfg (lw_shared_ptr) and set a credentials
refresh flag.
- GCS: install a freshly constructed client; stash the old one for
async cleanup.
- storage_manager: update _object_storage_endpoints and fire the
async cleanup via a gate-guarded background fiber.
Async (gate-guarded background fiber):
- S3: acquire _creds_sem, invalidate and rearm credentials only if
the refresh flag is set.
- GCS: drain and close stashed old clients.
Config is reloaded from SIGHUP on shard 0 and broadcast to all shards
under a write lock. REST API callers reading find_config_id acquire a
read lock via value_as_json_string_for_name() and are guaranteed a
consistent snapshot even when a reload is in progress.
Currently cancellation is logged in get_next_task, but the function is
called by tablets code as well where we do not act upon its result, only
yield to the topology coordinator. But the topology coordinator will not
necessary do the cancellation as well since it can be busy with tablets
migration. As a result cancellation is logged, but not done which is
confusing. Fix it by logging cancellation when it is actually happens.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1409Closesscylladb/scylladb#29471
This is an attempt (mostly suggested and implemented by AI, but with a few hours of human babysitting...), to somewhat reduce compilation time by picking one template, named_value<T>, which is used in more than a hundred source files through the config.hh header, and making it use external instantiation: The different methods of named_value<T> for various T are instantiated only once (in config.cc), and the individual translation units don't need to compile them a hundred times.
The resulting saving is a little underwhelming: The total object-file size goes down about 1% (from 346,200 before the patch to 343,488 after the patch), and previous experience shows that this object-file size is proportional to the compilation time, most of which involves code generation. But I haven't been able to measure speedup of the build itself.
1% is not nothing, but not a huge saving either. Though arguably, with 50 more of these patches, we can make the build twice faster :-)
Refs #1.
Closesscylladb/scylladb#28992
* github.com:scylladb/scylladb:
config: move named_value<T> method bodies out-of-line
config: suppress named_value<T> instantiation in every source file
Tests that call grep_for_errors() directly and assert no errors
can fail spuriously due to benign RPC errors during graceful
shutdown (e.g. "connection dropped: Semaphore broken"), which
are already filtered by the after_test hook via filter_errors().
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1464
Backport: no, tests fix (we may decide to backport later if it occurs on release branches)
Closesscylladb/scylladb#29463
* github.com:scylladb/scylladb:
test: filter benign errors in tests that grep logs during shutdown
test: filter_errors: support list[list[str]] error groups
When a replica disconnects during a digest read (e.g., during
decommission), the speculating_read_executor now immediately fires
the pending speculative retry instead of waiting for the timer.
On DISCONNECT, the digest_read_resolver invokes an _on_disconnect
callback set by the executor. The callback cancels the speculate
timer and rearms it to clock_type::now() (lowres_clock::now() =
thread-local memory read, no syscall). The existing timer callback
fires on the next reactor poll with all its logic intact — checking
is_completed(), calling add_wait_targets(1), sending the request,
and incrementing speculative_digest_reads/speculative_data_reads.
The notification is fire-and-forget: on_error() does NOT absorb the
DISCONNECT. The existing error arithmetic in digest_read_resolver
already handles this correctly because _target_count_for_cl accounts
for the speculative target.
For never_speculating_read_executor (no spare target) and
always_speculating_read_executor (all requests sent upfront),
_on_disconnect is never set — no behavior change.
Fixesscylladb/scylladb#26307Closesscylladb/scylladb#29428
all_sibling_tablet_replicas_colocated was using committed ti.replicas to
decide whether sibling tablets are co-located and merge can be finalized.
This caused a false non-co-located window when a co-located pair was moved
by the load balancer: as both tablets migrate together, their del_transition
commits may land in different Raft rounds. After the first commit, ti.replicas
diverge temporarily (one tablet shows the new position, the other the old),
causing all_sibling_tablet_replicas_colocated to return false. This clears
finalize_resize, allowing the load balancer to start new cascading migrations
that delay merge finalization by tens of seconds.
Fix this by using the optimistic replica view (trinfo->next when transitioning,
ti.replicas otherwise) — the same view the load balancer uses for load
accounting — so finalize_resize stays populated throughout an in-flight
migration and no spurious cascades are triggered.
Steps that lead to the problem:
1. Merge is triggered. The load balancer generates co-location migrations
for all sibling pairs that are not yet on the same shard. Some pairs
finish co-location before others.
2. Once all pairs are co-located in committed state,
all_sibling_tablet_replicas_colocated returns true and finalize_resize
is set. Meanwhile the load balancer may have already started a regular
LB migration on one co-located pair (both tablets are stable and the
load balancer is free to move them).
3. The LB migration moves both tablets together (colocated_tablets). Their
two del_transition commits land in separate Raft rounds. After the first
commit, ti.replicas[t1] = new position but ti.replicas[t2] = old position.
4. In this window, all_sibling_tablet_replicas_colocated sees the pair as
NOT co-located, clears finalize_resize, and the load balancer generates
new migrations for other tablets to rebalance the load that the pair
move created.
5. Those new migrations can take tens of seconds to stream, keeping the
coordinator in handle_tablet_migration mode and preventing
maybe_start_tablet_resize_finalization from being called. The merge
finalization is delayed until all those cascaded migrations complete.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-821.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1459.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Closesscylladb/scylladb#29465
Currently the statement returns cluster, partitioner and snitch names by accessing global db::config via database. As the part of an effort to detach components from global db::config, this PR tweaks the statement handler to get the cluster information from some other source. Currently the needed cluster information is stored in different components, but they are all under storage_service umbrella which seems to be a good central source of this truth. Unit test included.
Cleaning components inter-dependencies, not backporting
Closesscylladb/scylladb#29429
* github.com:scylladb/scylladb:
test: Add test_describe_cluster_sanity for DESCRIBE CLUSTER validation
describe_statement: Get cluster info from storage_service
storage_service: Add describe_cluster() method
query_processor: Expose storage_service accessor
Left over from primordial times, when reader_concurrency_semaphore was
baseclass for extensions in the separate enterprise repository.
Also remove the now unneded virtual marker from the destructor.
Closesscylladb/scylladb#29399
This series allows creating multiple vector indexes on the same column so users can rebuild an index without losing query availability.
The intended flow is:
1. Create a new vector index on a column that already has one.
2. Keep serving ANN queries from the old index while the new one is being built.
3. Verify the new index is ready.
4. Automatically switch to the remaining index.
5. Drop the old index.
To make that deterministic, `index_version` is changed from the base table schema version to a real creation timeuuid. When multiple vector indexes exist on the same column, ANN query planning now picks the index according to the routing implemented in Vector Store (newest serving index). This keeps queries on the old index until it the new one is up and ready.
This patch also removes the create-time restriction that rejected a second vector index on the same column. Name collisions are still rejected as before.
Test coverage is updated accordingly:
- Scylla now verifies that two vector indexes can coexist on the same column.
- Cassandra/SAI behavior is still covered and is still expected to reject duplicate indexes on the same column.
Fixes: VECTOR-610
Closesscylladb/scylladb#29407
* github.com:scylladb/scylladb:
docs: document vector index metadata and duplicate handling
test/cqlpy: cover vector index duplicate creation rules
vector_index: allow multiple named indexes on one column
vector_index: store `index_version` as creation timeuuid
Commit 234f905 (sstables: scylla_metadata: add schema member) added a
new Schema subcomponent (tag 11) to scylla_metadata. Document it in the
sstable Scylla format reference:
- Add schema to the subcomponent grammar enumeration
- Add a summary entry describing the subcomponent (tag 11) and its purpose
- Add a detailed ## schema subcomponent section with the binary grammar,
covering table_id, table_schema_version, keyspace_name, table_name and
the column_description array (column_kind, column_name, column_type)
Fixes https://github.com/scylladb/scylladb/issues/27960
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Closesscylladb/scylladb#28983
RF change of tablet keyspace starts tablet rebuilds. Even if any of the rebuilds is rolled back (because pending replica was excluded), rf change request finishes successfully. In this case we end up with the state of the replicas that isn't compatible with the expected keyspace replication.
Modify topology coordinator so that if it were to be idle, it starts checking if there are any missing replicas. It moves to transition_state::tablet_migration and run required rebuilds.
If a new RF change request encounters invalid state of replicas it fails. The state will be fixed later and the analogical ALTER KEYSPACE statement will be allowed.
Fixes: SCYLLADB-109.
Requires backport to all versions with tablet keyspace rf change.
Closesscylladb/scylladb#28709
* github.com:scylladb/scylladb:
test: add test_failed_tablet_rebuild_is_retried_on_alter
test: add a test to ensure that failed rebuilds are retried
service: fail ALTER KEYSPACE if replicas do not satisfy the replication
service: retry failed tablet rebuilds
service: maybe_start_tablet_migration returns std::optional<group0_guard>
Related scylladb/scylladb-docs-homepage#153.
make multiversion failed under Sphinx 8+ with:
```
sphinx-build: error: argument --tag/-t: expected one argument
subprocess.CalledProcessError: Command '(..., '-m', 'sphinx', '-t', '-D', 'smv_metadata_path=...', ..., 'manual')' returned non-zero exit status 2.
make: *** [multiversion] Error 1
```
sphinx-multiversion's arg forwarding splits `-t manual`, sending `-t` into the options slot and `manual` to the trailing FILENAMES positional.
Sphinx 7 silently tolerated the dangling `-t`; Sphinx 8+'s stricter
argparse CLI rejects it. Instead, it now reads FLAGS from an env variable.
How to test:
````
make multiversion
make FLAG=opensource multiversion
````
Both complete and switch variants correctly.
chore: rm empty lines
Closesscylladb/scylladb#29472
Three bugs fixed in segment_manager.cc:
1. write_to_separator(): captured [&index] where index was a local
coroutine-frame reference. The future is stored in
buf.pending_updates and resolved later in flush_separator_buffer(),
by which time the enclosing coroutine frame is destroyed, making
&index a dangling pointer. This is a use-after-free that manifests
as a segfault. Fix: capture index_ptr (raw pointer by value) instead.
2. add_segment_to_compaction_group(): same dangling [&index] pattern
inside the for_each_live_record lambda during recovery. Same fix
applied.
3. write(): local 'auto loc = seg->allocate(...)' shadowed the outer
'log_location loc', causing the function to always return a
zero-initialized log_location{}. Fix: remove 'auto' so the
assignment targets the outer variable.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Closesscylladb/scylladb#29451
The poll loop used condition_variable::wait(timeout) to sleep between
iterations. On every normal timeout expiry, this threw a
condition_variable_timed_out exception, which incremented the C++
exception metric and triggered false alerts for support.
Replace the timed wait with a seastar::timer that broadcasts the
condition variable on expiry, combined with an untimed wait(). The
timer is cancelled automatically on scope exit when the wait is woken
early by trigger_poll() or abort.
Fixes SCYLLADB-1477
Closesscylladb/scylladb#29438
The enable_logstor configuration option is redundant with the 'logstor'
experimental feature flag. Consolidate to a single gate: use the
experimental feature to control both whether logstor is available for
table creation and whether it is initialized at database startup.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#29427
This metric is used to catch execution of scans which go via row
cache, which can have bad effect on performance.
Since f344bd0aaa, aggregate queries go
via new statement class: parallelized_select_statement. This class
inherits from select_statement directly rather than from
primary_key_select_statement. The range scan detection logic
(_range_scan, _range_scan_no_bypass_cache) was only in
primary_key_select_statement's constructor, so parallelized queries
were not counted in select_partition_range_scan and
select_partition_range_scan_no_bypass_cache metrics.
Fix by moving the range scan detection into select_statement's
constructor, so that all subclasses get it.
No backport: enhancement
Closesscylladb/scylladb#29422
* github.com:scylladb/scylladb:
cql: Include parallelized queries in the scylla_cql_select_partition_range_scan_no_bypass_cache metric
test: cluster: dtest: Fix double-counting of metrics
`LDAPRoleManager` interpolated usernames directly into `ldap_url_template`,
allowing LDAP filter injection and URL structure manipulation via crafted
usernames.
This PR adds two layers of encoding when substituting `{USER}`:
1. **RFC 4515 filter escaping** — neutralises `*`, `(`, `)`, `\`, NUL
2. **URL percent-encoding** — prevents `%`, `?`, `#` from breaking
`ldap_url_parse`'s component splitting or undoing the filter escaping
It also adds `validate_query_template()` at startup to reject templates
that place `{USER}` outside the filter component (e.g. in the host or
base DN), where filter escaping would be the wrong defense.
Fixes: SCYLLADB-1309
Compatibility note:
Templates with `{USER}` in the host, base DN, attributes, or extensions
were previously silently accepted. They are now rejected at startup with
a descriptive error. Only templates with `{USER}` in the filter component
(after the third `?`) are valid.
Fixes: SCYLLADB-1309
Due to severeness, should be backported to all maintained versions.
Closesscylladb/scylladb#29388
* github.com:scylladb/scylladb:
auth: sanitize {USER} substitution in LDAP URL templates
test/ldap: add LDAP filter-injection reproducers
Add an explicit wait_for_schema_agreement() call after CREATE KEYSPACE
in create_new_test_keyspace to ensure all nodes have applied the schema
before proceeding.
Closesscylladb/scylladb#29371
The alter_table case has a known failure where point lookups at QUORUM
return 0 rows after node2 restarts, even though:
- the schema was correctly synced (ALTER TABLE received from cluster)
- the data commitlog was replayed (21 mutations, 0 skipped)
- all 3 nodes were alive, so QUORUM (2/3) should be satisfiable by
node1+node3 regardless of node2's state
The LIMIT 1 table scan succeeds (data is present somewhere), but
specific key lookups return empty. This points to a bug in how node2,
acting as coordinator after restart, routes single-partition reads —
most likely stale tablet routing metadata.
Add diagnostics to help distinguish data loss from a coordinator/routing
bug on the next failure:
- log which key is missing
- dump all rows visible at QUORUM
- query each node individually at ONE consistency for the missing key
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Closesscylladb/scylladb#29350
The real keyspace name of an Alternator table T is "alternator_T".
Expand the "alternator.T" format used in the audit_tables config flag
to the real keyspace name at parse time, so users don't need to spell
out the internal "alternator_T.T" form.
Add tests for the unhappy path of Alternator audit logging:
- Category filtering: operations are not logged when their category
(DML, QUERY, DDL) is excluded from audit_categories.
- Keyspace filtering: operations on a keyspace not listed in
audit_keyspaces are not logged.
- Error entries: a failed operation (thrown exception after audit_info
is set) produces an audit entry with error=true.
- Empty-keyspace bypass: global operations like ListTables and
DescribeEndpoints are logged regardless of audit_keyspaces because
should_log() short-circuits on an empty keyspace.
Prepare API in audit for auditing Alternator.
The API provides an externally-callable functions `inspect()`,
for both CQL and Alternator.
Both variants of the function would unpack parameters and merge into
calling a common `maybe_log()`, which can then call `log()` when
conditions are met.
Also, while I was at it, (const) references were favoured over raw
pointers.
The Alternator audit_info subclass (audit_info_alternator) carries an
optional consistency level — only data read/write operations have a
meaningful CL, while DDL and metadata queries store an empty string
in the audit table and syslog (matching the existing write_login
behavior). The storage helpers are updated accordingly.
Add a will_log(category, keyspace, table) method that checks whether
an operation should be audited (category check AND keyspace/table
filtering) without requiring a constructed audit_info object.
should_log() delegates to will_log().
Currently, the manifest advertises "powof2", which is wrong for
arbitrary count and boundaries.
Introduce a new kind of layout called "arbitrary", and produce it if
the tablet map doesn't conform to "powof2" layout.
We should also produce tablet boundaries in this case, but that's
worked on in a different PR: https://github.com/scylladb/scylladb/pull/28525
By default it's true, in which case tablet count of the table is
rounded up to a power of two. This option allows lifting this, in
which case the count can be arbitrary. This will allow testing the
logic of arbitrary tablet count.
This is a step towards more flexibility in managing tablets. A
prerequisite before we can split individual tablets, isolating hot
partitions, and evening-out tablet sizes by shifting boundaries.
After this patch, the system can handle tables with arbitrary tablet
count. Tablet allocator is still rounding up desired tablet count to
the nearest power of two when allocating tablets for a new table, so
unless the tablet map is allocated in some other way, the counts will
be still a power of two.
We plan to utilize arbitrary count when migrating from vnodes to
tablets, by creating a tablet map which matches vnode boundaries.
One of the reasons we don't give up on power-of-two by default yet is
that it creates an issue with merges. If tablet count is odd, one of
the tablets doesn't have a sibling and will not be merged. That can
obviously cause imbalance of token space and tablet sizes between
tablets. To limit the impact, this patch dynamically chooses which
tablet to isolate when initiating a merge. The largest tablet is
chosen, as that will minimize imbalance. Otherwise, if we always chose
the last tablet to isolate, its size would remain the same while other
tablets double in size with each odd-count merge, leading to
imbalance. The imbalance will still be there, but the difference in
tablet sizes is limited to 2x.
Example (3 tablets):
[0] owns 1/3 of tokens
[1] owns 1/3 of tokens
[2] owns 1/3 of tokens
After merge:
[0] owns 2/3 of tokens
[1] owns 1/3 of tokens
What we would like instead:
Step 1 (split [1]):
[0] owns 1/3 of tokens
[1] old 1.left, owns 1/6 of tokens
[2] old 1.right, owns 1/6 of tokens
[3] owns 1/3 of tokens
Step 2 (merge):
[0] owns 1/2 of tokens
[1] owns 1/2 of tokens
To do that, we need to be able to split individual tablets, but we're
not there yet.
We only assume that new tablets have boundaries which are equal
to some boundaries of old tablets.
In preparation for supporting arbitrary merge plan, where any replica
can be isolated (not merged with siblings) by the merge plan.
The test test/cluster/test_ttl_row.py::test_row_ttl_scheduling_group wants to
verify that the new CQL per-row TTL feature does all its work (expiration
scanning, deletion of expired items) on all nodes in the "streaming"
scheduling group, not in the statement scheduling group.
As originally written, the test couldn't require that it uses exactly zero
time in the statement scheduling group - because some things do happen
there - specifically the ALTER TABLE request we use to enable TTL.
So the test checked that the time in the "wrong" group is less than 0.2
of the total time, not zero.
But in one CI run, we got to exactly 0.2 and the test failed. Running
this test locally, I see the margin is pretty narrow: The test almost
always fails if I set the threshold ratio to 0.1.
The solution in this patch is to move the ALTER TABLE work to a different
scheduling group (by using an additional service level). After doing that
the CPU usage in sl:default goes down to exactly zero - not close to zero
but exactly zero.
However, it seems that there is always some rare background work in
sl:default and debug builds it can come out more than 0ms (e.g., in
one test we saw 1ms), so we keep checking that sl:default is much
lower than sl:stream - not exactly zero.
Incidentally, I converted the serial loop adding the 200 rows in the
test's setup to a parallel loop, to make the test setup slightly faster.
I also added to the test a sanity check that the scheduling group sl:default
that we are measuring that TTL does zero work in, is actually the scheduling
group that normal writes work in (to avoid the risk of having a test that
verifies that some irrelevant scheduling group is unsurprisingly getting
zero usage...).
Fixes SCYLLADB-1495.
Closesscylladb/scylladb#29447
We only assume that new tablets share boundaries with some old tablets.
In preparation for supporting arbitrary merge plan, where any replica
can be isolated (not merged with siblings) by the merge plan.
There are several reasons we want to do that.
One is that it will give us more flexibility in distributing the
load. We can subdivide tablets at any points, and achieve more
evenly-sized tablets. In particular, we can isolate large partitions
into separate tablets.
Another reason is vnode-to-tablet migration. We could construct a
tablet map which matches exactly the vnode boundaries, so migration
can happen transparently from the CQL-coordinator's point of view.
Implementation details:
We store a vector of tokens which represent tablet boundaries in the
tablet_id_map. tablet_id keeps its meaning, it's an index into vector
of tablets. To avoid logarithmic lookup of tablet_id from the token,
we introduce a lookup structure with power-of-two aligned buckets, and
store the tablet_id of the tablet which owns the first token in the
bucket. This way, lookup needs to consider tablet id range which
overlaps with one bucket. If boundaries are more or less aligned,
there are around 1-2 tablets overlapping with a bucket, and the lookup
is still O(1).
Amount of memory used increased, but not significantly relative to old
size (because tablet_info is currently fat):
For 131'072 tablets:
Before:
Size of tablet_metadata in memory: 57456 KiB
After:
Size of tablet_metadata in memory: 59504 KiB
And reimplement existing split-related methods around it.
This way we avoid calling dht::compaction_group_of(), and
assuming anything about tablet boundaries or tablet count
being a power of two.
This will make later refactoring easier.
Currently, hints that are sent to tablet replicas which are leaving due to RF-- can be lost, because `hint_sender` only checks if the destination host is leaving. To avoid this, we add a new method `effective_replication_map::is_leaving(host, token)` which checks if the tablet identified by the given token is leaving the host. This method is called by the `hint_sender` to check if the hint should be sent only to the destination host, or to all the replicas. This way, we increase consistency. For v-node based ERPs, `is_leaving()` calls `token_metadata::is_leaving(host)`.
Fixes: SCYLLADB-287
This is an improvement, and backport is not needed.
Closesscylladb/scylladb#28770
* github.com:scylladb/scylladb:
test: verify hints are delivered during tablet RF reduction
hint_sender: use per-tablet is_leaving() to avoid losing hints on RF reduction
erm: add is_leaving() to effective_replication_map
Fixes a race condition where tablet split can crash the server during truncation.
`truncate_table_on_all_shards()` disables compaction on all existing compaction groups, then later calls `discard_sstables()` which asserts that compaction is disabled. Between these two points, tablet split can call `set_split_mode()`, which creates new compaction groups via `make_empty_group()` — these start with `compaction_disabled_counter == 0`. When `discard_sstables()` checks its assertion, it finds these new groups and fires `on_internal_error`, aborting the server.
In `storage_group::set_split_mode()`, before creating new compaction groups, check whether the main compaction group has compaction disabled. If it does, bail out early and return `false` (not ready). This is safe because the split will be retried once truncation completes and re-enables compaction.
A new regression test `test_split_emitted_during_truncate` reproduces the
exact interleaving using two error injection points:
- **`database_truncate_wait`** — pauses truncation after compaction is disabled but before `discard_sstables()` runs.
- **`tablet_split_monitor_wait`** (new, in `service/storage_service.cc`) — pauses the split monitor at the start of `process_tablet_split_candidate()`.
The test creates a single-tablet table, triggers both operations, uses the injection points to force the problematic ordering, then verifies that truncation completes successfully and the split finishes afterward.
Fixes: SCYLLADB-1035
This needs to be backported to all currently supported version.
Closesscylladb/scylladb#29250
* github.com:scylladb/scylladb:
test: add test_split_emitted_during_truncate
table: fix race between tablet split and truncate
In partition_snapshot_row_cursor::maybe_refresh(), the !is_in_latest_version()
path calls lower_bound(_position) on the latest version's rows to find the
cursor's position in that version. When lower_bound returns null (the cursor
is positioned above all entries in the latest version in table order), the code
unconditionally sets _background_continuity = true and allows the subsequent
if(!it) block to erase the latest version's entry from the heap.
This is correct for forward traversal: null means there are no more entries
ahead, so removing the version from the heap is safe.
However, in reversed mode, null from lower_bound means the cursor is above
all entries in table order -- those entries are BELOW the cursor in query
order and will be visited LATER during reversed traversal. Erasing the heap
entry permanently loses them, causing live rows to be skipped.
The fix mirrors what prepare_heap() already does correctly: when lower_bound
returns null in reversed mode, use std::prev(rows.end()) to keep the last
entry in the heap instead of erasing it.
Add test_reversed_maybe_refresh_keeps_latest_version_entry to mvcc_test,
alongside the existing reversed cursor tests. The test creates a two-version
partition snapshot (v0 with range tombstones, v1 with a live row positioned
below all v0 entries in table order), and
traverses in reverse calling maybe_refresh() at each step -- directly
exercising the buggy code path. The test fails without the fix.
The bug was introduced by 6b7473be53 ("Handle non-evictable snapshots",
2022-11-21), which added null-iterator handling for non-evictable snapshots
(memtable snapshots lack the trailing dummy entry that evictable snapshots
have). prepare_heap() got correct reversed-mode handling at that time, but
maybe_refresh() received only forward-mode logic.
The bug is intermittent because multiple mechanisms cause iterators_valid()
to return false, forcing maybe_refresh() to take the full rebuild path via
prepare_heap() (which handles reversed mode correctly):
- Mutation cleaner merging versions in the background (changes change_mark)
- LSA segment compaction during reserve() (invalidates references)
- B-tree rebalancing on partition insertion (invalidates references)
- Debug mode's always-true need_preempt() creating many multi-version
partitions via preempted apply_monotonically()
A dtest reproducer confirmed the same root cause: with 100K overlapping range
tombstones creating a massively multi-version memtable partition (287K preemption
events), the reversed scan's latest_iterator was observed jumping discontinuously
during a version transition -- the latest version's heap entry was erased --
causing the query to walk the entire partition without finding the live row.
Fixes: SCYLLADB-1253
Closesscylladb/scylladb#29368
execute_batch_without_checking_exception_message() inserted entries
into the authorized prepared cache before verifying that
check_access() succeeded. A failed BATCH therefore left behind
cached 'authorized' entries that later let a direct EXECUTE of the
same prepared statement skip the authorization check entirely.
Move the cache insertion after the access check so that entries are
only cached on success. This matches the pattern already used by
do_execute_prepared() for individual EXECUTE requests.
Introduced in 98f5e49ea8
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1221
Backport: all supported versions
Closesscylladb/scylladb#29432
* github.com:scylladb/scylladb:
test/cqlpy: add reproducer for BATCH prepared auth cache bypass
cql3: fix authorization bypass via BATCH prepared cache poisoning
Add parametrized integration test that verifies DESCRIBE CLUSTER returns correct
values in both normal and maintenance modes:
The parametrization keeps the validation logic (CQL queries and assertions)
identical for both modes, while the setup phase is mode-specific. This ensures
the same assertions apply to both cluster states:
- partitioner is org.apache.cassandra.dht.Murmur3Partitioner
- snitch is org.apache.cassandra.locator.SimpleSnitch
- cluster name matches system.local cluster_name
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Update cluster_describe_statement::describe() to retrieve cluster metadata
from storage_service::describe_cluster() instead of directly from db::config
or gossiper.
The storage_service provides a centralized API for accessing cluster metadata
(cluster_name, partitioner, snitch_name) that works in both normal and
maintenance modes, improving separation of concerns.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add cluster_info struct containing cluster_name, partitioner, and snitch_name.
Implement describe_cluster() method to provide cluster metadata by combining
data from gossiper (cluster_name, partitioner) and snitch (snitch_name).
It will be used by next patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Previously Alternator, when emit Amazon's ARN would not stick to the
standard. After our attempt to run KCL with scylla we discovered few
issues.
Amazon's ARN looks like this:
arn:partition:service:region:account-id:resource-type/resource-id
for example:
arn:aws:dynamodb:us-west-2:111122223333:table/TestTable/stream/2015-05-11T21:21:33.291
KCL checks for:
- ARN provided from Alternator calls must fit with basic Amazon's ARN
pattern shown above,
- region constisting only of lower letter alphabets and `-`, no
underscore character
- account-id being only digits (exactly 12)
- service being `dynamodb`
- partition starting with `aws`
The patch updates our code handling ARNs to match those findings.
1. Split `stream_arn` object into `stream_arn` - ARN for streams only and
`stream_shard_id` - id value for stream shards. The latter receives original
implementation. The former emits and parses ARN in a Amazon style.
for example:
2. Update new `stream_arn` class to encode keyspace and table together
separating them by `@`. New ARN looks like this:
arn:aws:dynamodb:us-east-1:000000000000:table/TestKeyspace@TestTable/stream/2015-05-11T21:21:33.291
3. hardcode `dynamodb` as service, `aws` as partition, `us-east-1` as
region and `000000000000` as account-id (must have 12 digits)
4. Update code handling ARNs for tags manipulation to be able to parse
Amazon's style ARNs. Emiting code is left intact - the parser is now
capable of parsing both styles.
5. Added unit tests.
Fixes#28350
Fixes: SCYLLADB-539
Fixes: #28142Closesscylladb/scylladb#28187
This series makes result metadata handling for auth LIST statements consistent and adds coverage for the driver-visible behavior.
The first patch makes the result-column metadata construction shared across the affected statements, so the metadata shape used for PREPARE and EXECUTE stays uniform and easier to reason about.
The second patch adds regression coverage for both sides of the metadata-id flow:
- a Python auth-cluster test verifies that prepared LIST ROLES OF returns a non-empty result metadata id and that a later EXECUTE reuses it without METADATA_CHANGED
- a Boost transport test covers the recovery path where the client sends an empty request metadata id and the server responds with METADATA_CHANGED and the full metadata
Together these patches tighten the implementation and protect the prepared-metadata-id behavior exposed to drivers.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1218
backport: this change should be backported to all active branches to help the driver operation
Closesscylladb/scylladb#29347
Add a boost test that verifies commitlog segments are replayed in
ascending segment ID order within each shard. The test creates
multiple segments, triggers replay via commitlog_replayer, and
captures the "Replaying" debug log messages to verify the order.
Correct segment ordering is required by the strongly consistent
tables feature, particularly commitlog-based storage that relies
on replayed raft items being stored in order.
Ref SCYLLADB-1411.
The commitlog replayer groups segments by shard using a
std::unordered_multimap, then iterates per-shard segments via
equal_range(). However, equal_range() does not guarantee iteration
order for elements with the same key, so segments could be replayed
out of order within a shard. This can increase memory and disk
consumption during fragmented entry reconstruction, which accumulates
fragments across segments and benefits from ascending ID order.
This is also required by the strongly consistent tables feature,
particularly commitlog-based storage that relies on replayed raft
items being stored in order.
Fix by changing the data structure from
std::unordered_multimap<unsigned, commitlog::descriptor>
to
std::unordered_map<unsigned, utils::chunked_vector<commitlog::descriptor>>
Since the descriptors are inserted from a std::set ordered by ID, the
vector preserves insertion (and thus ID) order. The per-shard iteration
now simply iterates the vector, guaranteeing correct replay order.
Fixes SCYLLADB-1411.
Add a GitHub Actions workflow that runs scripts/compare_build_systems.py
on PRs touching build system files (configure.py, **/CMakeLists.txt,
cmake/**).
This prevents future deviations between the two build systems by
catching mismatches early in the CI pipeline.
Closesscylladb/scylladb#29426
The CMake build had -fsanitize-address-use-after-scope (enable) when
it should have been -fno-sanitize-address-use-after-scope (disable).
The comment on lines 24-25 of cql3/CMakeLists.txt explains the intent:
the use-after-scope sanitizer uses too much stack space on CqlParser
and overflows the stack. The Python-ninja path in configure.py:2801-2802
correctly had -fno-sanitize-address-use-after-scope.
Found by black-box comparison of compiler flags between the Python-ninja
and CMake build paths (ninja -nv output, debug mode, CqlParser.o):
Python-ninja: -fno-sanitize-address-use-after-scope (correct: disable)
CMake: -fsanitize-address-use-after-scope (wrong: enable)
Closesscylladb/scylladb#29439
The test waited for two "Finished tablet repair" log messages on the
coordinator, expecting one per tablet. But there are two log sources
that emit messages matching this pattern:
repair module (repair/repair.cc:2329):
"Finished tablet repair for table=..."
topology coordinator (topology_coordinator.cc:2083):
"Finished tablet repair host=..."
When the coordinator is also a repair replica (always the case with
RF=3 and 3 nodes), both messages appear in the coordinator log for the
same tablet within 1ms of each other. The test consumed both, thinking
both tablets were done, while the second tablet repair was still running.
From the CI failure logs:
04:08:09.658 Found: repair[...]: Finished tablet repair for table=...
global_tablet_id=e42fd650-3542-11f1-9756-85403784a622:0
04:08:09.660 Found: raft_topology - Finished tablet repair host=...
tablet=e42fd650-3542-11f1-9756-85403784a622:0
Both messages are for tablet :0. Tablet :1 repair had not finished yet.
The test then wrote keys 20-29 while the second tablet repair was still
in progress. That repair flushed the memtable (via
prepare_sstables_for_incremental_repair), including keys 20-29 in the
repair scan, and mark_sstable_as_repaired set repaired_at=2 on the
resulting sstable. This caused the assertion failure on servers[0]:
"should not have post-repair keys in repaired sstables, got:
{20, 21, 22, 23, 24, 25, 26, 27, 28, 29}"
Fix by matching "Finished tablet repair host=" which is unique to the
topology coordinator message and avoids the ambiguity.
Also fix an incorrect comment that said being_repaired=null when at that
point in the test being_repaired is still set to the session_id (the
delay_end_repair_update injection prevents end_repair from running).
Fixes: SCYLLADB-1478
Closesscylladb/scylladb#29444
In commit 727f68e0f5 we added the ability to SELECT:
* Individual elements of a map: `SELECT map_col[key]`.
* Individual elements of a set: `SELECT set_col[key]` returns key if the key exists in the set, or null if it doesn't, allowing to check if the element exists in the set.
* Individual pieces of a UDT: `SELECT udt_col.field`.
But at the time, we didn't provide any way to retrieve the **meta-data** for this value, namely its timestamp and TTL. We did not support `SELECT TIMESTAMP(collection[key])`, or `SELECT TIMESTAMP(udt.field)`.
Users requested to support such SELECTs in the past (see issue #15427), and Cassandra 5.0 added support for this feature - for both maps and sets and udts - so we also need this feature for compatibility. This feature was also requested recently by vector-search developers, who wanted to read Alternator columns - stored as map elements, not individual columns - with their WRITETIME information.
The first four patches in this series adds the feature (in four smaller patches instead one big one), the fifth and sixth patches add tests (cqlpy and boost tests, respectively). The seventh patch adds documentation.
All the new tests pass on Cassandra 5, failed on Scylla before the present fix, and pass with it.
The fix was surprisingly difficult. Our existing implementation (from 727f68e0f5 building on earlier machinery) doesn't just "read" `map_col[key]` and allow us to return just its timestamp. Rather, the implementation reads the entire map, serializes it in some temporary format that does **not** include the timestamps and ttls, and then takes the subscript key, at which point we no longer have the timestamp or ttl of the element. So the fix had to cross all these layers of the implementation.
While adding support for UDT fields in a pre-existing grammar nonterminal "subscriptExpr", we unintentionally added support for UDT fields also in LWT expressions (which used this nonterminal). LWT missing support for UDT fields was a long-time known compatibility issue (#13624) so we unintentionally fixed it :-) Actually, to completely fix it we needed another small change in the expression implementation, so the eighth patch in this series does this.
Fixes#15427Fixes#13624Closesscylladb/scylladb#29134
* github.com:scylladb/scylladb:
cql3: support UDT fields in LWT expressions
cql3: document WRITETIME() and TTL() for elements of map, set or UDT
test/boost: test WRITETIME() and TTL() on map collection elements
test/cqlpy: test WRITETIME() and TTL() on element of map, set or UDT
cql3: prepare and evaluate WRITETIME/TTL on collection elements and UDT fields
cql3: parse per-element timestamps/TTLs in the selection layer
cql3: add extended wire format for per-element timestamps and TTLs
cql3: extend WRITETIME/TTL grammar to accept collection and UDT elements
Document the new vector index behavior in the user-facing and developer
docs.
Describe `index_version` as a creation timeuuid stored in
`system_schema.indexes`, clarify that recreating an index changes it
while ALTER TABLE does not, and document that Scylla allows multiple
named vector indexes on the same column while still rejecting unnamed
duplicates.
Add cqlpy tests for the current CREATE INDEX behavior of vector indexes.
Cover named and unnamed duplicates, IF NOT EXISTS, coexistence of
multiple named vector indexes on the same column, interactions between
named and unnamed indexes, and the same-name-on-different-table case.
An unprivileged user could bypass authorization checks by exploiting
the BATCH prepared statement cache:
1. Prepare an INSERT on a table the user has no access to
2. Execute it inside a BATCH — gets Unauthorized
3. Execute the same prepared INSERT directly — succeeds
Apply filter_errors() to grep_for_errors() results in
test_split_stopped_on_shutdown and
test_group0_apply_while_node_is_being_shutdown. Without filtering,
benign RPC errors like 'connection dropped: Semaphore broken' that
occur during graceful shutdown cause spurious test failures.
Accept both list[str] (from distinct_errors=True) and
list[list[str]] (from distinct_errors=False) in filter_errors(),
matching against the first line of each error group. This allows
tests that call grep_for_errors() with default arguments to
pipe results directly through filter_errors().
Precompute the expected metadata-id hashes for the prepared LIST auth and
service-level statements and verify that PREPARE returns them while EXECUTE
reuses the prepared metadata without METADATA_CHANGED. Run all cases in a
single auth-cluster test after preparing the cluster, role, and service level
once through the regular manager fixture.
execute_batch_without_checking_exception_message() inserted entries
into the authorized prepared cache before verifying that
check_access() succeeded. A failed BATCH therefore left behind
cached 'authorized' entries that later let a direct EXECUTE of the
same prepared statement skip the authorization check entirely.
Move the cache insertion after the access check so that entries are
only cached on success. This matches the pattern already used by
do_execute_prepared() for individual EXECUTE requests.
Introduced in 98f5e49ea8
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1221
Prepared LIST statements were not calculating metadata in PREPARE path, and sent empty string hash to client causing problematic behaviour where metadat_id was not recalculated correctly.
This patch moves metadata construction into get_result_metadata() for the affected LIST statements and reuse that metadata when building the result set.
This gives PREPARE a stable metadata id for LIST ROLES, LIST USERS, LIST PERMISSIONS and the service-level variants.
This patch also adds a new boost test that verifies that when an EXECUTE request carries an empty result metadata id while the server has a real metadata id for the result set, the response is marked METADATA_CHANGED and includes the full result metadata plus the server metadata id.
This covers the recovery path for clients that send an empty or otherwise unusable metadata id instead of a matching cached one.
Allow creating multiple named vector indexes on the same column while
still rejecting duplicate unnamed ones.
`index_metadata::equals_noname()` now ignores `index_version`,
which is unique for every vector index creation, so duplicate detection
keeps working for unnamed vector indexes.
CREATE INDEX keeps using structural duplicate detection for regular
indexes and unnamed vector indexes, but named vector indexes are checked
by name only.
The explicit name check is also needed for IF NOT EXISTS when the same
index name already exists on a different table in the same keyspace,
because vector indexes have no backing view table to catch that case.
Add a regression test that reproduces the race between tablet split and
truncation. The test:
1. Creates a single-tablet table and inserts data.
2. Triggers truncation and pauses it (via database_truncate_wait) after
compaction is disabled but before discard_sstables() runs.
3. Triggers tablet split and pauses it (via tablet_split_monitor_wait)
at the start of process_tablet_split_candidate().
4. Releases split so set_split_mode() creates new compaction groups.
5. Waits for the set_split_mode log confirming the groups exist.
6. Releases truncation so discard_sstables() encounters the new groups.
7. Verifies truncation completes and split finishes.
Adds a tablet_split_monitor_wait error injection point in
process_tablet_split_candidate() to allow pausing the split monitor
before it enters the split loop.
Tablet split can call set_split_mode() between the point where
truncate_table_on_all_shards() disables compaction on all existing
compaction groups and the point where discard_sstables() checks that
compaction is disabled. The new split-ready compaction groups created
by set_split_mode() won't have compaction disabled, causing
discard_sstables() to fire on_internal_error.
Fix by preventing set_split_mode() from creating new compaction groups
when compaction is disabled on the main group. If truncation has
already disabled compaction, split will simply report not-ready rather
than creating groups which have compaction enabled.
This is safe because split will be retried once truncation completes
and re-enables compaction.
In an earlier patch, we used the CQL grammar's "subscriptExpr" in
the rule for WRITETIME() and TTL(). But since we also wanted these
to support UDT fields (x.a), not just collection subscripts (x[3]),
we expanded subscriptExpr to also support the field syntax.
But LWT expressions already used this subscriptExpr, which meant
that LWT expressions unintentionally gained support for UDT fields.
Missing support for UDT fields in LWT is a long-standing known
Cassandra-compatibility bug (#13624), and now our grammar finally
supports the missing syntax.
But supporting the syntax is not enough for correct implementation
of this feature - we also need to fix the expression handling:
Two bugs prevented expressions like `v.a = 0` from working in LWT IF
clauses, where `v` is a column of user-defined type.
The first bug was in get_lhs_receiver() in prepare_expr.cc: it lacked
a handler for field_selection nodes, causing an "unexpected expression"
internal error when preparing a condition like `IF v.a = 0`. The fix
adds a handler that returns a column_specification whose type is taken
from the prepared field_selection's type field.
The second bug was in search_and_replace() in expression.cc: when
recursing into a field_selection node it reconstructed it with only
`structure` and `field`, silently dropping the `field_idx` and `type`
fields that are set during preparation. As a result, any transformation
that uses search_and_replace() on a prepared expression containing a
field_selection — such as adjust_for_collection_as_maps() called from
column_condition_prepare() — would zero out those fields. At evaluation
time, type_of() on the field_selection returned a null data_type
pointer, causing a segmentation fault when the comparison operator tried
to call ->equal() through it. The fix preserves field_idx and type when
reconstructing the node.
Fixes#13624.
Add to the SELECT documentation (docs/cql/dml/select.rst) documentation
of the new ability to select WRITETIME() and TTL() of a single element
of map, set or UDT.
Also in the TTL documentation (docs/cql/time-to-live.rst), which already
had a section on "TTL for a collection", add a mention of the ability
to read a single element's TTL(), and an example.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Add tests in test/boost/expr_test.cc for the low-level implementation
of writetime() and ttl() on a map element.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds many tests verifying the behavior of WRITETIME() and
TTL() on individual elements of maps, sets and UDTs, serving as a
regression test for issue #15427. We also add tests verifying our
understanding of related issues like WRITETIME() and TTL() of entire
collections and of individual elements of *frozen* collections.
All new tests pass on Cassandra 5.0, helping to verify that our
implementation is compatible with Cassandra. They also pass on
ScyllaDB after the previous patch (most didn't before that patch).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Complete the implementation of SELECT WRITETIME(col[key])/TTL(col[key])
and WRITETIME(col.field)/TTL(col.field), building on the grammar (commit 1),
wire format (commit 2), and selection-layer (commit 3) changes in the
preceding patches.
* prepare_column_mutation_attribute() (prepare_expr.cc) now handles the
subscript and field_selection nodes that the grammar produces:
- For subscripts, it validates that the inner column is a non-frozen
map or set and checks the 'writetime_ttl_individual_element' feature
flag so the feature is rejected during rolling upgrades.
- For field selections, it validates that the inner column is a
non-frozen UDT, with the same feature-flag check.
* do_evaluate(column_mutation_attribute) (expression.cc) handles the
same two cases. For a field selection it serializes the field index as
a key and looks it up in collection_element_metadata; for a subscript
it evaluates the subscript key and looks it up in the same map.
A missing key (element not found or expired) returns NULL, matching
Cassandra behavior.
Together with the preceding three patches, this finally fixes#15427.
The next three patches will add tests and documentation for the new
feature, and the final eighth patch will fix the implementation of
UDT fields in LWT expressions - which the first patch made the grammar
allow but is still not implemented correctly.
Wire up the selection and result-set infrastructure to consume the
extended collection wire format introduced in the previous patch and
expose per-element timestamps and TTLs to the expression evaluator.
* Add collection_cell_metadata: maps from raw element-key bytes to
timestamp and remaining TTL, one entry per collection or UDT cell.
Add a corresponding collection_element_metadata span to
evaluation_inputs so that evaluators can access it.
* Add a flag _collect_collection_timestamps to selection (selection.hh/cc).
When any selected expression contains a WRITETIME(col[key])/TTL(col[key])
or WRITETIME(col.field)/TTL(col.field) attribute, the flag is set and
the send_collection_timestamps partition-slice option is enabled,
causing storage nodes to use the extended wire format from the
previous patch.
* Implement result_set_builder::add_collection() (selection.cc): when
_collect_collection_timestamps is set, parse the extended format,
decode per-element timestamps and remaining TTLs (computed from the
stored expiry time and the query time), and store them in
_collection_element_metadata indexed by column position. When the
flag is not set, the existing plain-bytes path is unchanged.
After this patch, the new selection feature is still not available to
the end-user because the prepare step still forbids it. The next patch
will finally complete the expression preparation and evaluation.
It will read the new collection_element_metadata and return the correct
timestamp or TTL value.
Introduce the infrastructure needed to transport per-element timestamps
and TTL expiry times from replicas to coordinators, required for
WRITETIME(col[key]) / TTL(col[key]) and WRITETIME(col.field) /
TTL(col.field).
* Add a 'writetime_ttl_individual_element' cluster feature flag that
guards usage of the new wire format during rolling upgrades: the
extended format is only emitted and consumed when every node in the
cluster supports it.
* Implement serialize_for_cql_with_timestamps() (types/types.cc), a
variant of serialize_for_cql() that appends a per-element section to
the regular CQL bytes, listing each live element's serialized key,
timestamp, and expiry. The format is:
[uint32 cql_len][cql bytes]
[int32 entry_count]
[per entry: (int32 key_len)(key bytes)(int64 timestamp)(int64 expiry)]
expiry is -1 when the element has no TTL.
* Add partition_slice::option::send_collection_timestamps and modify
write_cell() (mutation_partition.cc) to use the new function
serialize_for_cql_with_timestamps() when this option is available.
This commit stands alone with no user-visible effect: nothing yet sets
the new partition-slice option. The next patch adds the selection-layer
code that sets the option and parses the extended response.
Previously, WRITETIME() and TTL() only accepted a simple column name
(cident), so WRITETIME(m['key']) or WRITETIME(x.a) was a syntax error.
This patch begins to implements support for applying WRITETIME() and
TTL() to individual elements of a non-frozen map, set or UDT, as
requested in issue #15427.
On its own this commit only changes the parser (Cql.g). The prepare
step still rejects subscript and field-selection nodes with an
invalid_request_exception, so there is no user-visible behavior change
yet - just that a syntax error is replaced by a different error.
Upcoming patches add the extended wire format for per-element timestamps
(commit 2), the selection layer that consumes it (commit 3), and the
prepare/evaluate logic that ties everything together (commit 4), after
which WRITETIME() and TTL(col[key]) for collection or UDT elements
will finally be fully functional.
The parser change in this patch expands the subscriptExpr rule to
support the col.field syntax, not only col[key]. This change also
allows the UDT field syntax to be used in LWT conditions, which is
another long-standing missing feature (#13624); But to correctly
support this feature we'll need an additional patch to fix a couple
of remaining bugs - this will be the eighth commit in this series.
Remove const specifier from result_set_row._cells member to make
the class nothrow_move_constructible and nothrow_move_assignable
To be used later in query result_set and friends.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
LDAPRoleManager interpolated usernames directly into ldap_url_template.
That allowed LDAP filter metacharacters to change the query, and URL
metacharacters such as %, ?, and # to change how ldap_url_parse()
split the URL.
Apply two layers of encoding when substituting {USER}:
1. RFC 4515 filter escaping -- neutralises filter operators.
2. URL percent-encoding -- prevents ldap_url_parse from
misinterpreting %-sequences, ? delimiters, or # fragments.
Add validate_query_template() (called from start()) which uses a
sentinel round-trip through ldap_url_parse to reject templates
that place {USER} outside the filter component. Templates that
previously placed {USER} in the host or base DN were silently
accepted; they are now rejected at startup with a descriptive
error.
Change parse_url() to take const sstring& instead of string_view
to enforce the null-termination requirement of ldap_url_parse()
at the type level.
Add regression coverage for %2a, ?, #, and invalid {USER}
placement in the base DN, host, attributes, and extensions.
Update LDAP authorization docs to document the escaping behavior
and the {USER} placement restriction.
Fixes: SCYLLADB-1309
Vector indexes currently store the base table schema version in
`index_version`. That value is name-based, not time-based,
so it does not represent when the index was created.
Store a timeuuid instead and change the relevant interfaces from
`table_schema_version` to `utils::UUID`. This is a prerequisite
for supporting multiple vector indexes on the same column where
the oldest index must be selected deterministically via routing
implemented in Vector Store.
Update the cqlpy tests to check the new semantics directly:
recreating the index changes `index_version`, while ALTER TABLE does not.
This metric is used to catch execution of scans which go via row
cache, which can have bad effect on performance.
Since f344bd0aaa, aggreagte queries go
via new statement class: parallelized_select_statement. This class
inherits from select_statement directly rather than from
primary_key_select_statement. The range scan detection logic
(_range_scan, _range_scan_no_bypass_cache) was only in
primary_key_select_statement's constructor, so parallelized queries
were not counted in select_partition_range_scan and
select_partition_range_scan_no_bypass_cache metrics.
Fix by moving the range scan detection into select_statement's
constructor, so that all subclasses get it.
get_node_metrics() in test/cluster/dtest/tools/metrics.py used
re.search(metric_name, metric) to match Prometheus metric lines. The
metric name select_partition_range_scan is a substring of
select_partition_range_scan_no_bypass_cache. So when querying for
select_partition_range_scan, the regex matched both Prometheus lines:
scylla_cql_select_partition_range_scan{shard="0",...} 1
scylla_cql_select_partition_range_scan_no_bypass_cache{shard="0",...} 1
And because the code does metrics_res[metric_name] += val, it summed
both values, making it look like the counter was incremented
by 2 when it was actually incremented by 1. The fix appends r"[\s{]"
to the regex so the metric name must be followed by { (labels) or
whitespace (value), preventing substring matches.
Remove the create_ks_and_cf() helper function and its now-unused import
of format_tuples(). All callers have been converted to use the new
async patterns with new_test_keyspace() and cql.run_async().
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace all 6 calls to create_ks_and_cf() with new async patterns:
- Use new_test_keyspace() context manager for keyspace creation
- Use cql.run_async() for CREATE TABLE statement
- Use asyncio.gather() with cql.run_async() for data insertion
The test_restore_with_non_existing_sstable only needs the ks:table
structure to exist; it doesn't use the pre-populated data.
This change makes the code more explicit and maintains proper async
semantics throughout.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add if-True blocks to wrap code that uses create_ks_and_cf() in all 6
test functions. This is a mechanical change to set up the next step
where the helper will be replaced with new async patterns. All code
after the create_ks_and_cf() call until the end of each test is now
indented under the if-True block.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add tests that reproduce LDAP filter injection via unescaped {USER}
substitution (SCYLLADB-1309). A wildcard username ('*') matches
every group entry, and a parenthesis payload (")(uid=*") breaks the
search filter.
Extend the LDAP test fixture (ldap_server.py, slapd.conf) with
memberUid attributes and the NIS schema so the new tests can
exercise direct filter-value substitution.
Add test_hint_to_leaving_when_reducing_rf which verifies that mutations
stored as hints are delivered to the correct replicas when a tablet is
removed due to RF reduction. The test sets up a 3-node cluster with
RF=2, drops the hint for one replica via error injection, then reduces
RF to 1 while hints are pending. It asserts that the mutation is
readable after the topology change completes.
Also adds a "drop_hint_for_host" error injection point in
hint_endpoint_manager to selectively drop hints for a specific host.
hint_sender decides whether to send a hint directly to its destination
or to re-mutate from scratch based on token_metadata::is_leaving(),
which only checks whether the *host* is leaving the cluster. When a
tablet is dropped from a host due to RF reduction (RF--), the host
is still alive and is_leaving() returns false, so hint_sender sends
directly to a replica that will no longer own the data -- effectively
losing the hint.
Switch to the new ermp->is_leaving(host, token) which is tablet-aware.
When the destination's tablet is being migrated away *and* there are
pending endpoints, send directly (the pending endpoints will receive
the data via streaming); otherwise fall through to the re-mutate path
so all current replicas receive the mutation.
token_metadata::is_leaving() only knows whether a *host* is leaving the
cluster, which is insufficient for tablets -- a tablet can be migrated
away from a host (e.g. during RF reduction) without the host itself
leaving.
Add a pure virtual is_leaving(host, token) to effective_replication_map
so callers can ask per-token questions. The vnode implementation
delegates to token_metadata::is_leaving() (host-level, as before). The
tablet implementation checks whether the tablet owning the token has a
transition whose leaving replica matches the given host.
RF change of tablet keyspace starts tablet rebuilds. Even if any of
the rebuilds is rolled back (because pending replica was excluded),
rf change request finishes successfully. Yet, we are left with not
enough replicas. Then, a next new rf change request handler would
generate a rebuild of two replicas of the same tablet. Such a transition
would not be applied, as we don't allow many pending replicas.
An exception would be thrown and the request would be retried infinitely,
blocking the topology coordinator.
Throw and fail rf change request if there is not enough replicas.
The request should be retried later, after the issue is fixed
by the mechanism introduced in previous changes.
RF change of tablet keyspace starts tablet rebuilds. Even if any
of the rebuilds is rolled back (because pending replica was excluded),
rf change request finishes successfully. In this case we end up with
the state of the replicas that isn't compatible with the expected
keyspace replication.
After this change, if topology_coordinator has nothing to do, it
proceeds to check if the state of replicas reflects the keyspace
replication. If there are any mismatches, the tablet rebuilds are
scheduled. All required rebuilds of a single keyspace are scheduled
together without respecting the node's load (just as it happens
in case of keyspace rf change).
maybe_start_tablet_migration takes an ownership of group0_guard and
does not give it back, even if no work was done.
In the following patches, we will proceed with different operations,
if there are no migrations to be started. Thus, the guard would be needed.
Return group0_guard from maybe_start_tablet_migration is no work
was done.
Previously, when --save-log-on-success was enabled, logs were saved
for every test phase (setup, call, teardown)in 3 files. Restrict it to only
the teardown phase, that contains all 3 in case of test success,
to avoid redundant log entries.
1. test.py — Removed --log-level=DEBUG flag from pytest args
2. test/pytest.ini — Changed log_level to INFO (that was set DEBUG in test.py), changed log_file_level from DEBUG to INFO, added clarifying comments
Replace CAS contention histograms in storage proxy stats with
estimated_histogram_with_max<128> and switch metrics/API aggregation to the
new histogram path.
Introduce a dedicated cas_contention_histogram alias and use it for
cas_read_contention and cas_write_contention.
Update API histogram reduction to merge the new histogram type via
estimated_histogram_with_max_merge.
Convert API JSON serialization to explicit offsets/counts using
get_buckets_offsets() and get_buckets_counts().
Export CAS contention metrics with to_metrics_histogram(...) instead of the
legacy get_histogram(1, 8) path for consistent bucket handling.
Add utility accessors to approx_exponential_histogram to export bucket
boundaries and bucket counts in a form suitable for display/tests when
Min < Precision causes repeated integer limits.
Add MAX compile-time constant alias for the template Max parameter.
Add get_buckets_offsets() to return bucket lower limits with duplicate
adjacent limits removed.
Add get_buckets_counts() to return counts aligned with the deduplicated
limits, merging counts from buckets that share the same lower limit.
Keep existing histogram behavior unchanged.
This new functionality is intended for API use and not for
performance-critical paths.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
The previous commit added extern template declarations to suppress
named_value<T> instantiation in every translation units, but those only
suppress non-inline members. All method bodies defined inside the class
body were inline and thus exempt from extern template, so they were
still emitted as weak symbols in every TU that used them.
Fix this by moving all named_value<T> method definitions out of the class
body in config_file.hh and into config_file_impl.hh as out-of-line template
definitions. Since config_file_impl.hh is included only by db/config.cc,
utils/config_file.cc, sstables/compressor.cc, and
ent/encryption/encryption_config.cc, the method bodies are now compiled
in only those four TUs.
Also add the two missing explicit instantiation pairs that caused linker
errors:
- named_value<vector<object_storage_endpoint_param>> in db/config.cc
- named_value<encryption_config::string_string_map> in encryption_config.cc
config.hh is included by a large fraction of the codebase. It pulls in
utils/config_file.hh, whose named_value<T> template has its method
bodies defined in config_file_impl.hh. Those bodies depend on three of
the heaviest Boost headers – boost/program_options.hpp,
boost/lexical_cast.hpp, and boost/regex.hpp – as well as yaml-cpp.
Because the method bodies are reachable from config.hh, every
translation unit that includes config.hh was silently instantiating all
of named_value<T>'s methods (for each distinct T) and compiling that
Boost/yaml-cpp machinery from scratch.
Fix this by adding extern template struct declarations for all 32
distinct named_value<T> specialisations used by db::config:
- the 14 primitive/stdlib types go into utils/config_file.hh
- the 18 db-specific types (enum_option<…>, seed_provider_type, etc.)
go into db/config.hh
Matching explicit template struct instantiation definitions are added in
db/config.cc, which is already the only translation unit that includes
config_file_impl.hh. As a result the Boost/yaml-cpp template machinery
is compiled exactly once (in config.o) instead of being re-instantiated
in every including TU.
One subtlety: named_value<seed_provider_type> has an explicit member
specialisation of add_command_line_option. Per [temp.expl.spec], such
a specialisation must be declared before any extern template declaration
of the enclosing class template, so a forward declaration of the
specialisation is added to config.hh ahead of the extern template line.
Also, for some of the types we explicitly instantiated in db/config.cc,
the named_value<T> constructor calls config_type_for<T>(), which we
also need to provide explicit specializations - some of them we already
had but some were missing.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-03-11 11:30:39 +02:00
379 changed files with 19008 additions and 5297 deletions
# supposed to be processed by idl-compiler.py, so we don't check them using the cleaner
CLEANER_DIRS:test/unit exceptions alternator api auth cdc compaction db dht gms index lang message mutation mutation_writer node_ops raft redis replica service
// ARN format is `arn:<partition>:<service>:<region>:<account-id>:<resource-type>/<resource-id>/<postfix>`
// we ignore partition, service and account-id
// resource-type must be string "table"
// resource-id will be returned as table_name
// region will be returned as keyspace_name
// postfix is a string after resource-id and will be returned as is (whole), including separator.
structarn_parts{
std::string_viewkeyspace_name;
std::string_viewtable_name;
std::string_viewpostfix;
};
// arn - arn to parse
// arn_field_name - identifier of the ARN, used only when reporting an error (in error messages), for example "Incorrect resource identifier `<arn_field_name>`"
// type_name - used only when reporting an error (in error messages), for example "... is not a valid <type_name> ARN ..."
// expected_postfix - optional filter of postfix value (part of ARN after resource-id, including separator, see comments for struct arn_parts).
// If is empty - then postfix value must be empty as well
// if not empty - postfix value must start with expected_postfix, but might be longer
on_internal_error(slogger,fmt::format("parent end token not present in children tokens and no child with greater token exists, for parent shard id {}, got parent shards [{}] and children shards [{}]",
// sanity check - we should never get here as there is if above (`shard_filter && prev == e` => `continue`)
if(prev==e){
on_internal_error(slogger,fmt::format("Could not find parent generation for shard id {}, got generations [{}]",shard_filter->id,fmt::join(topologies|std::ranges::views::keys,"; ")));
on_internal_error(expr_logger,fmt::format("evaluating column_mutation_attribute field_selection: inner expression is not a column: {}",fs->structure));
throwexceptions::configuration_exception(format("The experimental feature 'logstor' must be enabled in order to use the 'logstor' storage engine."));
}
if(!db.get_config().enable_logstor()){
throwexceptions::configuration_exception(format("The configuration option 'enable_logstor' must be set to true in the configuration in order to use the 'logstor' storage engine."));
}
}else{
throwexceptions::configuration_exception(format("Illegal value for '{}'",KW_STORAGE_ENGINE));
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.