In Alternator's HTTP API, response headers can dominate bandwidth for
small payloads. The Server, Date, and Content-Type headers were sent on
every response but many clients never use them.
This patch introduces three Alternator config options:
- alternator_http_response_server_header,
- alternator_http_response_disable_date_header,
- alternator_http_response_disable_content_type_header,
which allow customizing or suppressing the respective HTTP response
headers. All three options support live update (no restart needed).
The Server header is no longer sent by default; the Date and
Content-Type defaults preserve the existing behavior.
The Server and Date header suppression uses Seastar's
set_server_header() and set_generate_date_header() APIs added in
https://github.com/scylladb/seastar/pull/3217. This patch also
fixes deprecation warnings from older Seastar HTTP APIs.
Tests are in test/alternator/test_http_headers.py.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-70Closesscylladb/scylladb#28288
DynamoDB Streams API can only convey a single parent per stream shard.
Tablet merges produce two parents, making them incompatible with
Alternator Streams. This series blocks tablet merges when streams are
active on a tablet table.
For CreateTable, a freshly created table has no pending merges, so
streams are enabled immediately with tablet merges blocked.
For UpdateTable on an existing table, stream enablement is deferred:
the user's intent is stored via `enable_requested`, tablet merges are
blocked (new merge decisions are suppressed and any active merge
decision is revoked), and the topology coordinator finalizes enablement
once no in-flight merges remain.
The topology coordinator is woken promptly on error injection release
and tablet split completion, reducing finalization latency from ~60s
to seconds.
`test_parent_children_merge` is marked xfail (merges are now blocked),
and downward (merge) steps are removed from `test_parent_filtering` and
`test_get_records_with_alternating_tablets_count`.
Not addressed here: using a topology request to preempt long-running
operations like repair (tracked in SCYLLADB-1304).
Refs SCYLLADB-461
Closesscylladb/scylladb#29224
* github.com:scylladb/scylladb:
topology: Wake coordinator promptly for stream enablement lifecycle
test/cluster: Test deferred stream enablement on tablet tables
alternator/streams: Block tablet merges when Alternator Streams are enabled
The topology coordinator sleeps on a condition variable between
iterations. Several events relevant to Alternator stream enablement
did not wake it, causing delays of up to 60s (the periodic load
stats refresh interval) at each step:
1. Error injection release: when a test disables the
delay_cdc_stream_finalization injection, the coordinator was
not notified. Add an on_disable callback mechanism to the error
injection framework (register_on_disable / unregister_on_disable)
so subsystems can react when an injection is released. The
topology coordinator uses this to broadcast its event.
2. Tablet split completion: after all local storage groups for a
table finish splitting, split_ready_seq_number is set but the
coordinator only discovered this via the periodic stats refresh.
Add an on_tablet_split_ready callback to topology_state_machine
that the coordinator sets to trigger_load_stats_refresh(). The
split monitor in storage_service calls it when all compaction
groups are split-ready, giving the coordinator fresh stats
immediately so it can finalize the resize.
These changes reduce test_deferred_stream_enablement_on_tablets
from ~120s to ~13s and fix a production issue where Alternator
stream enablement could be delayed by up to 60s at each step of
the lifecycle (error injection release, split completion).
Async cluster test exercising the deferred enablement lifecycle:
ENABLING -> ENABLED -> disabled, verifying tablet merge blocking
and unblocking at each stage. Uses delay_cdc_stream_finalization
error injection and CQL ALTER TABLE with tablet count constraints.
Also adds tablet scheduler config to test_config.yaml (fast refresh
interval, scale factor 1) for reliable tablet count changes.
DynamoDB Streams API can only convey a single parent per stream shard.
Tablet merges produce 2 parents, which is incompatible. When streams
are requested on a tablet table, block tablet merges via
tablet_merge_blocked (the allocator suppresses new merge decisions and
revokes any active merge decision).
add_stream_options() sets tablet_merge_blocked=true alongside
enabled=true, so CreateTable needs no special handling — the flag
is inert on vnode tables and immediately effective on tablet tables.
For UpdateTable, CDC enablement is deferred: store the user's intent
via enable_requested, and let the topology coordinator finalize
enablement once no in-progress merges remain. A new helper,
defer_enabling_streams_block_tablet_merges(), amends the CDC options
to this deferred state.
Disabling streams clears all flags, immediately re-allowing merges.
The tablet allocator accesses the merge-blocked flag through a
schema::tablet_merges_forbidden() accessor rather than reaching into
CDC options directly.
Mark test_parent_children_merge as xfail and remove downward
(merge) steps from tablet_multipliers in test_parent_filtering and
test_get_records_with_alternating_tablets_count.
This PR exposes vnodes-to-tablets migrations through the task manager API via a virtual task. This allows users to list, query status, and wait on ongoing migrations through a standard interface, consistent with other global operations such as tablet operations and topology requests are already exposed.
The virtual task exposes all migrations that are currently in progress. Each migrating keyspace appears as a separate task, identified by a deterministic name-based (v3) UUID derived from the keyspace name. Progress is reported as the number of nodes that have switched to tablets vs. the total. The number increases on the forward path and decreases on rollback.
The task is not abortable - rolling back a migration requires a manual procedure.
The `wait` API blocks until the migration either completes (returning `done`) or is rolled back (returning `suspended`).
Example output:
```
$ scylla nodetool tasks list vnodes_to_tablets_migration
task_id type kind scope state sequence_number keyspace table entity shard start_time end_time
1747b573-6cd6-312d-abb1-9b66c1c2d81f vnodes_to_tablets_migration cluster keyspace running 0 ks 0
$ scylla nodetool tasks status 1747b573-6cd6-312d-abb1-9b66c1c2d81f
id: 1747b573-6cd6-312d-abb1-9b66c1c2d81f
type: vnodes_to_tablets_migration
kind: cluster
scope: keyspace
state: running
is_abortable: false
start_time:
end_time:
error:
parent_id: none
sequence_number: 0
shard: 0
keyspace: ks
table:
entity:
progress_units: nodes
progress_total: 3
progress_completed: 0
```
Fixes SCYLLADB-1150.
New feature, no backport needed.
Closesscylladb/scylladb#29256
* github.com:scylladb/scylladb:
test: cluster: Verify vnodes-to-tablets migration virtual task
distributed_loader: Link resharding tasks to migration virtual task
distributed_loader: Make table_populator aware of migration rollbacks
service: Add virtual task for vnodes-to-tablets migrations
storage_service: Guard migration status against uninitialized group0
compaction: Add parent_id to table_resharding_compaction_task_impl
storage_service: Add keyspace-level migration status function
storage_service: Replace migration status string with enum
utils: Add UUID::is_name_based()
* seastar 4d268e0e...22a5aa13 (36):
> apps/httpd: replace deprecated reply::done() with write_body()
> missing header(s)
> net: Fix missing throw for runtime_error in create_native_net_device
> tests/io_queue: account for token bucket refill granularity in bandwidth checks
> Merge 'iovec: fix iovec_trim_front infinite loop on zero-length iovecs' from Travis Downs
tests: add regression tests for zero-length iovec handling
iovec: fix iovec_trim_front infinite loop on zero-length iovecs
> util/process: graduate process management API from experimental
> cooking: don't register ready.txt as a build output
> sstring: make make_sstring not static
> Add SparkyLinux to debian list in install-dependencies.sh
> http: allow control over default response headers
> Merge 'chunked_fifo: make cached chunk retention configurable' from Brandon Allard
tests/perf: add chunked_fifo microbenchmarks
chunked_fifo: set the default free chunk retention to 0
chunked_fifo: make free chunk retention configurable
> Merge 'reactor_backend: fix pollable_fd_state_completion reuse in io_uring' from Kefu Chai
tests: add regression test for pollable_fd_state_completion reuse
reactor_backend: use reset() in AIO and epoll poll paths
reactor_backend: fix pollable_fd_state_completion reuse after co_await in io_uring
> Merge 'coroutine: Generator cleanups' from Kefu Chai
coroutine/generator: extract schedule_or_resume helper
coroutine/generator: remove unused next_awaiter classes
coroutine/generator: remove write-only _started field
coroutine/generator: assert on unreachable path in buffered await_resume
coroutine/generator: add elements_of tag and #include <ranges>
coroutine/generator: add empty() to bounded_container concept
> cmake: bump minimum Boost version to 1.79.0
> seastar_test: remove unnecessary headers
> cmake: bump minimum GnuTLS version to 3.7.4
> Merge 'reactor: add get_all_io_queues() method' from Travis Downs
tests: add unit test for reactor::get_all_io_queues()
reactor: add get_all_io_queues() method
reactor: move get_io_queue and try_get_io_queue to .cc file
> http: deprecate reply::done(), remove _response_line dead field
> core: Deprecate scattered_message
> ci: add workflow dispatch to tests workflow
> perf_tests: exit non-zero when -t pattern matches no tests
> Replace duplicate SEGV_MAPERR check in sigsegv_action() with SEGV_ACCERR.
> perf_tests: add total runtime to json output
> Merge 'Relax large allocation error originating from json_list_template' from Robert Bindar
implement move assignment operator for json_list_template
json_list_template copy assignment operator reserves capacity upfront
> perf_tests: add --no-perf-counters option
> Merge 'Fix to_human_readable_value() ability to work with large values' from Pavel Emelyanov
memory: Add compile-time test for value-to-human-readable conversion
memory: Extend list of suffixes to have peta-s
memory: Fix off-by-one in suffix calculation
memory: Mark to_human_readable_value() and others constexpr
> http: Improve writing of response_line() into the output
> Merge 'websocket: add template parameter for text/binary frame mode and implement client-side WebSocket' from wangyuwei
websocket: add template parameter for text/binary frame mode
websocket: impl client side websocket function
> file: Fix checks for file being read-only
> reactor: Make do_dump_task_queue a task_queue method
> Merge 'Implement fully mixed mode for output_stream-s' from Pavel Emelyanov
tests/output_stream: sample type patterns in sanitizer builds
tests/output_stream: extend invariant test to cover mixed write modes
iostream: allow unrestricted mixing of buffered and zero-copy writes
tests/output_stream: remove obsolete ad-hoc splitting tests
tests/output_stream: add invariant-based splitting tests
iostream: rename output_stream::_size to ::_buffer_size
> reactor_backend: replace virtual bool methods with const bool_class members
> resource: Avoid copying CPU vector to break it into groups
> perf_tests: increase overhead column precision to 3 decimal places
> Merge 'Move reactor::fdatasync() into posix_file_impl' from Pavel Emelyanov
reactor: Deprecate fdatasync() method
file: Do fdatasync() right in the posix_file_impl::flush()
file: Propagate aio_fdatasync to posix_file_impl
reactor: Move reactor::fdatasync() code to file.cc
reactor,file: Make full use of file_open_options::durable bit
file: Add file_open_options::durable boolean
file: Account io_stats::fsyncs in posix_file_impl::flush()
reactor: Move _fsyncs counter onto io_stats
> http: Remove connection::write_body()
Fix cdc writing unnecesary entries to it's log, like for example when Alternator deletes an item which in reality doesn't exist.
Originally @wps0 tackled this issue. This patch is an extension of his work. His work involved adding `should_skip` function to cdc, which would process a `mutation` object and decide, wherever changes in the object should be added to cdc log or not.
The issue with his approach is that `mutation` object might contain changes for more than one row. If - for example - the `mutation` object contains two changes, delete of non-existing row and create of non-existing row, `should_skip` function will detect changes in second item and allow whole `mutation` (BOTH items) to be added. For example (using python's boto3) running this on empty table:
```
with table.batch_writer() as batch:
batch.put_item({'p': 'p', 'c': 'c0'})
batch.delete_item(Key={'p': 'p', 'c': 'c1'})
```
will emit two events ("put" event and "delete" event), even though the item with `c` set to `c1` does not exist (thus can't be deleted). Note, that both entries in batch write must use the same partition key, otherwise upper layer with split them into separate `mutation` objects and the issue will not happen.
The solution is to do similar processing, but consider each change separated from others. This is tricky to implement due to a way cdc works. When cdc processes `mutation` object (containing X changes), it emits cdc entries in phases. Phase 1 - emit `preimage` (old state) for each change (if requested). Phase 2 - for each change emit actual "diff" (update / delete and so on). Phase 3 - emit `postimage` (new state).
We will know if change needs to be skipped during phase 2. By that time phase 1 is completed and preimage for the change is emited. At that moment we set a flag that the change (identified by clustering key value) needs to be skipped - we add a clustering key to a `ignore-rows` set (`_alternator_clustering_keys_to_ignore` variable) and continue normally. Once all phases finish we add a `postprocess` phase (`clean_up_noop_rows` function). It will go through generated cdc mutations and skip all modifications, for which clustering key is in `ignore-rows` set. After skipping we need to do a "cleanup" operation - each generated cdc mutation contain index (incremented by one), if we skipped some parts, the index is not consecutive anymore, so we reindex final changes.
There's a special case worth mentioning - Alternator tables without clustering keys. At that point `mutation` object passed to cdc can contain exactly one change (since different partition keys are splitted by upper layers and Alternator will never emit `mutation` object containing two (or more) changes with the same primary key. Here, when we decide the change is to be skipped we add empty `bytes` object to `ignore-rows` set. When checking `ignore-rows` set, we check if it's empty or not (we don't check for presence of empty `bytes` object).
Note: there might be some confusion between this patch and #28452 patch. Both started from the same error observation and use similar tests for validation, as both are easily triggered by BatchWrite commands (both needs `mutation` object passed to cdc to contain more than one single change). This issue tho is about wrong data written in cdc log and is fixed at cdc, where #28452 is about wrong way of parsing correct cdc data and is fixed at Alternator side of things. Note, that we need #28452 to truly verify (otherwise we will emit correct cdc entries, but Alternator will incorrectly parse them).
Note: to benefit / notice this patch you need `alternator_streams_increased_compatibility` flag turned on.
Note: rework is quite "broad" and covers a lot of ground - every operation, that might result in a no-change to the database state should be tested. An additional test was added - trying to remove a column from non-existing item, as well as trying to remove non-existing column from existing item.
Fixes: #28368
Fixes: SCYLLADB-1528
Fixes: SCYLLADB-538
Closesscylladb/scylladb#28544
* github.com:scylladb/scylladb:
alternator: remove unnecesary code
alternator: fix Alternator writing unnecesary cdc entries
alternator: add failing tests for Streams
Implements neccesary changes for Streams to work with tablet based tables.
- add utility functions to `system_keyspace` that helps reading cdc content from cdc log tables for tablet based base tables (similar api to ones for vnodes)
- remove antitablet `if` checks, update tests that fail / skip if tablets are selected
- add two tests to extensively test tablet based version, especially while manipulating stream count
Fixes#23838
Fixes SCYLLADB-463
Closesscylladb/scylladb#28500
* github.com:scylladb/scylladb:
alternator: add streams with tablets tests
alternator: remove antitablet guards when using Streams
alternator: implement streams for tablets
treewide: add cdc helper functions to system_keyspace
alternator: add system_keyspace reference
`stream_arn` object holds a full ARN as `std::string` and two
`std::string_view` fields (`table_name_` and `keyspace_name_`) pointing
into ARN itself. This prevents object from being safely copied
(as in that case both `table_name_` and `keyspace_name_` will point into
original object's ARN). Similar issue might happen with move, when
ARN contains string short enough for small string optimization to
kick in (although in practice this is not possible, as ARN has
requirements which make it's minimal length above 15 characteres -
current limit for small string optimizations in most popular string
libraries).
The patch drops `std::string_view` objects in favor of integer offsets
and sizes. The offset equal to 0 means beginning of ARN string. The api
is preserved - both `table_name` and `keyspace_name` function will
return `std::string_view` reconstructed on the fly.
Closesscylladb/scylladb#29507
When a table is loaded on startup during a vnodes-to-tablets migration
(forward or rollback), the `table_populator` runs a resharding
compaction.
Set the migration virtual task as parent of the resharding task. This
enables users to easily find all node-local resharding tasks related to
a particular migration.
Make `migration_virtual_task::make_task_id()` public so that the
`distributed_loader` can compute the migration's task ID.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
The `table_populator` uses a `migrate_to_tablets` flag to distinguish
normal tables from tables under vnodes-to-tablets migration (forward
path), since the two require different resharding.
The next patch will set the parent info of migration-related resharding
compaction tasks so they appear as children of the migration virtual
task. For that, the table populator needs to recognize not only
migrations in the forward path, but rollbacks as well.
Replace the flag with a tri-state `migration_direction` enum (none,
forward, rollback).
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Add a virtual task that exposes in-progress vnodes-to-tablets migrations
through the task manager API.
The task is synthesized from the current migration state, so completed
migrations are not shown. Progress is reported as the number of nodes
that currently use tablets: it increases on the forward path and
decreases on rollback. For simplicity, per-node storage modes are not
exposed in the task status; callers that need them should use the
migration status REST endpoint.
Unlike regular tasks that use time-based UUIDs, this task uses
deterministic named UUIDs derived from the keyspace names. This keeps
the implementation simple (no need to persist them) and gives each
keyspace a stable task ID. The downside is that the start time of each
task is unknown and repeated migrations of the same keyspace
(migration -> rollback -> new migration) cannot be distinguished.
Introduce a new task manager module to keep them separate from other
tasks.
Add support for `wait()`. While its practical value is debatable
(migration is a manual procedure, rolling restart will interrupt it), it
keeps the task consistent with the task manager interface.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
`storage_service::get_tablets_migration_status()` reads a group0 virtual
table, so it requires group0 to be initialized.
When invoked via the migration REST API, this condition is satisfied
since the API is only available after joining group0. However, once this
function is integrated into the task API later in this series, the
assumption will no longer hold, as the task API is exposed earlier in
the startup process.
Add a guard to detect this condition and return a clear error message.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
`storage_service::get_tablets_migration_status()` returns the
keyspace-level migration status, indicating whether migration has not
started, is in progress, or has completed, and for migrating keyspaces
also returns per-node migration statuses. Rename it to
`get_tablets_migration_status_with_node_details()` and introduce a new
`get_tablets_migration_status()` that returns only the keyspace-level
status.
This prepares the function for reuse in the next patches, which will add
a virtual task for vnodes-to-tablets migrations. Several task-manager
paths will only need the keyspace-level migration state, not per-node
information.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Using a string was sufficient while this status was only exposed through
the REST API, but the next patches will also consume it internally.
Use an enum for the internal representation and convert it back to the
existing string values in the REST API.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
The UUID class already provides `is_timestamp()` for identifying
time-based (version 1) UUIDs. Add the analogous `is_name_based()`
predicate for version 3 (name-based) UUIDs, along with a test.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Add tests for Streams, when table uses tablets underneath.
One test verifies filtering using CHILD_SHARDS feature.
Other one makes sure we get read all data while the table
undergoes tablet count change.
Add `--tablet-load-stats-refresh-interval-in-seconds=1`
to `alternator/run` script, as otherwise newly added tests will fail.
The setting changes how often scylla refreshes tablet metadata.
This can't be done using `scylla_config_temporary`, as
1) default is 60 seconds
2) scylla will wait full timeout (60s) to read configuration variable again.
Remove `if` condition, that prevented tables with tablets
working with Streams.
Remove a test, that verifies, that Alternator will reject
tables with tablets underneath working with Streams feature enabled
on them.
Update few tests, that were expected to fail on tablets to enable their
normal execution.
Add helper functions to `system_keyspace` object, that deal
with reading cdc content for tablet based table's.
`read_cdc_for_tablets_current_generation_timestamp` will read current
generation's timestamp.
`read_cdc_for_tablets_versioned_streams` will build
timestamp -> `cdc::streams_version` map similar to how
`system_distributed_keyspace::cdc_get_versioned_streams` works.
We're adding those helper functions, because their siblings in
`system_distributed_keyspace` work only, when base table is backed up
by vnodes. New additions work only, when base table is backed up
by tablets.
Add a reference to `system_keyspace` object to `executor` object in
alternator. The reference is needed, because in future commit
we will add there (and use) helper functions that read `cdc_log` tables
for tablet based tables similarly to already existing siblings
for vnodes living in `system_distributed_keyspace`.
Work in this patch is a result of two bugs - spurious MODIFY event, when
remove column is used in `update_item` on non-existing item and
spurious events, when batch write item mixed noop operations with
operations involving actual changes (the former would still emit
cdc log entries).
The latter issue required rework of Piotr Wieczorek's algorithm,
which fixed former issue as well.
Piotr Wieczorek previously wrote checks, that should
prevent unnecesary cdc events from being written. His implementation
missed the fact, that a single `mutation` object passed to cdc code
to be analysed for cdc log entries can contain modifications for
multiple rows (with the same timestamp - for example as a result
to BatchWriteItem call). His code tries to skip whole `mutation`,
which in such case is not possible, because BatchWriteItem might have
one item that does nothing and second item that does modification
(this is the reason for the second bug).
His algorithm was extended and moved. Originally it was working
as follows - user would sent a `mutation` object with some changes to
be "augmented". The cdc would process those changes and built a set of
cdc log changes based on them, that would be added to cdc log table.
Piotr added a `should_skip` function, which processes user changes and
tried to determine if they all should be dropped or not.
New version, instead of trying to skip adding rows to
cdc log `mutation` object, builds a rows-to-ignore set.
After whole cdc log `mutation` object is completed, it processes it
and go through it row by row. Any row that was previously added to
a `rows_to_ignore` set will now be removed. Remaining rows are written to
new cdc log `mutation` with new clustering key
(`cdc$batch_seq_no` index value should probably be consecutive -
we just want to be safe here) and returns new `mutation` object to
be sent to cdc log table.
The first bug is fixed as a side effect of new algorithm,
which contains more precise checks detecting, if given
mutation actually made a difference.
Fixes: #28368
Fixes: SCYLLADB-538
Fixes: SCYLLADB-1528
Refs: #28452
Garbage collected sstables created during incremental compaction are
deleted only at the end of the compaction, which increases the memory
footprint. This is inefficient, especially considering that the related
input sstables are released regularly during compaction.
This commit implements incremental release of GC sstables after each
output sstable is sealed. Unlike regular input sstables, GC sstables
use a different exhaustion predicate: a GC sstable is only released
when its token range no longer overlaps with any remaining input
sstable. This is because GC sstables hold tombstones that may shadow
data in still-alive overlapping input sstables; releasing them
prematurely would cause data resurrection.
Fixes#5563Closesscylladb/scylladb#28984
Add failing tests for Streams functionality.
Trying to remove column from non-existing item is producing
a MODIFY event (while it should none).
Doing batch write with operations working on the same partition,
where one operation is without side effects and second with
will produce events for both operations, even though first changes nothing.
First test has two versions - with and without clustering key.
Second has only with clustering key, as we can't produce
batch write with two items for the same partition -
batch write can't use primary key more than once in single call.
We also add a test for batch write, where one of three operations
has no observable side effects and should not show up in Streams
output, but in current scylla's version it does show.
ScyllaDB uses estimated_histogram in many places.
We already have a more efficient alternative: estimated_histogram_with_max. It is both CPU- and
memory-efficient, and it can be exported as Prometheus native histograms.
Its main limitation (which also has benefits) is that the bucket layout is fixed at compile time, so
histograms with different configurations cannot be mixed.
The end goal is to replace all uses of estimated_histogram in the codebase.
That migration requires a few small API adjustments, so it is done in steps.
This PR replaces estimated_histogram for CAS contention.
The PR includes a patch that adds functionality to the base approx_exponential_histogram, which will be used by the API.
The specific histograms are defined in a single place and cover the range 1-100; this makes future changes easy.
**New feature, no need to backport**
Closesscylladb/scylladb#29017
* github.com:scylladb/scylladb:
storage_proxy: migrate CAS contention histograms to estimated_histogram_with_max
estimated_histogram.hh: Add bucket offset and count to approx_exponential_histogram
After stopping the topology coordinator, a new topology coordinator
may not yet be started when get_coordinator_host() is called. Make
the function always retry via wait_for so that every caller is
protected against this race.
Fixes SCYLLADB-1553
Closesscylladb/scylladb#29489
Container names were generated as {name}-{pid}-{counter}, where the
counter is a per-process itertools.count. This scheme breaks across CI
runs on the same host: if a prior job was killed abruptly (SIGKILL,
cancellation) its containers are left running since --rm only removes
containers on exit. A subsequent run whose worker inherits the same PID
(common in containerized CI with small PID namespaces) and reaches the
same counter value will collide with the orphaned container.
Replace pid+counter with uuid.uuid4(), which generates a random UUID,
making names unique across processes, hosts, and time without any shared
state or leaking host identifiers.
Fixes: SCYLLADB-1540
Closesscylladb/scylladb#29509
The PR serves two purposes.
First, it makes the flag usage be consistent across multiple ways to load sstables components. For example, the sstable::load_metadata() doesn't set it (like .load() does) thus potentially refusing to load "corrupted" components, as the flag assumes.
Second, it removes the fanout of db.get_config().ignore_component_digest_mismatch() over the code. This thing is called pretty much everywhere to initialize the sstable_open_config, while the option in question is "scylla state" parameter, not "sstable opening" one.
Code cleanup, not backporting
Closesscylladb/scylladb#29513
* github.com:scylladb/scylladb:
sstables: Remove ignore_component_digest_mismatch from sstable_open_config
sstables: Move ignore_component_digest_mismatch initialization to constructor
sstables: Add ignore_component_digest_mismatch to sstables_manager config
Write the MinIO server log directly to tempdir_base (testlog/<arch>/)
instead of the per-server temp directory that gets destroyed on
shutdown. This preserves the log for Jenkins artifact collection,
helping debug S3-related flaky test failures like the
stcs_reshape_overlapping_s3_test hang (SCYLLADB-1481).
Closesscylladb/scylladb#29458
With the latest changes, there are a lot of code that is redundant in the test.py. This PR just cleans this code.
Also, it narrows using dynamic scope for fixtures to test/alternator and test/cqlpy. All the rest by default will have module scope.
test.py will be a wrapper for pytest mostly for CI use. As for now test.py have important part of calculating the number of threads to start pytest with. This is not possible to do in pytest itself.
No backport needed, framework enhancement only.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-666Closesscylladb/scylladb#28852
* github.com:scylladb/scylladb:
test.py: remove testpy_test_fixture_scope
test.py: add logger for 3rd party service
test.py: delete dead code in test.py
Users ought to have possibility to create the local index for Vector Search
based only on a part of the partition key. This commits provides this by
removing requirements of 'full partition key only' for custom local index.
The commit updates docs to explain that local vector index can use only a part
of the partition key.
The commit implements cqlpy test to check fixed functionality.
Fixes: SCYLLADB-953
Needs to be backported to 2026.1 as it is a fix for local vector indexes.
Closesscylladb/scylladb#28931
Decrease the default connection timeout to 3s to better align with the
default CQL query timeout of 10s.
The previous timeout allowed only one failover request in high availability
scenario before hitting the CQL query timeout.
By decreasing the timeout to 3s, we can perform up to three failover requests
within the CQL query timeout, which significantly improves the chances of
successfully completing the query in high availability scenarios.
Fixes: SCYLLADB-95
Add option `vector_store_unreachable_node_detection_time_in_ms` to
control parameters related to detecting unreachable vector store nodes.
This parameter is used to set the TCP connect timeout, keepalive
parameters, and TCP_USER_TIMEOUT. By configuring these parameters,
we can detect unreachable vector store nodes faster and trigger
failover mechanisms in a timely manner.
The audit table had caching enabled by default, which provides no
value since audit data is write-heavy and rarely read back through
the cache. This wastes cache space that could be used for more
important user data.
Disable caching by setting keys and rows_per_partition to NONE and
enabled to false, consistent with get_disabled_caching_options()
and other system tables such as system.batchlog,
system.large_partitions, and CDC log tables.
Closesscylladb/scylladb#29506
This series adds support for vector search in Alternator based on the existing implementation in CQL.
The series adds APIs for `CreateTable` and `UpdateTable` to add or remove vector indexes to Alternator tables, `DescribeTable` to list them and check the indexing status, and `Query` to perform a vector search - which contacts the vector store for the actual ANN (approximate nearest neighbor) search.
Correct functionality of these features depend on some features of the the vector store, that were already done (see https://github.com/scylladb/vector-store/pull/394).
This initial implementation is fully functional, and can already be useful, but we do not yet support all the features we hope to eventually support. Here are things that we have **not** done yet, and plan to do later in follow-up pull requests:
1. Support a new optimized vector type ("V") - in addition to the "list of numbers" type supported in this version.
2. Allow choosing a different similarity function when creating an index, by SimilarityFunction in VectorIndex definition.
3. Allow choosing quantization (f32/f16/bf16/i8/b1) to ask the vector index to compress stored vectors.
4. Support oversampling and rescoring, defined per-index and per-query.
5. Support HNSW tuning parameters — maximum_node_connections, construction_beam_width, search_beam_width.
6. Support pre-filtering over key columns, which are available at the vector store, by sending the filter to the vector store (translated from DynamoDB filter syntax to the vector's store's filter syntax). A decision still need to be made if this will use KeyConditionExpression or FilterExpression. This version supports only post-filtering (with `FilterExpression`).
7. Support projecting non-key attributes into the index (Projection=INCLUDE and Projection=ALL), and then 1. pre-filtering using these attributes, and 2. efficiently return these attributes (using Select=ALL_PROJECTED_ATTRIBUTES, which today returns just the key columns).
8. Optimize the performance of `Query`, which today is inefficient for Select=ALL_ATTRIBUTES because it serially retrieves the matching items one at a time.
9. Returning the similarity scores with the items (the design proposes ReturnVectorSearchSimilarity).
10. Add more vector-search-specific metrics, beyond the metric we already have counting Query requests. For example separate latency and request-count metrics for vector-search Queries (distinct from GSI/LSI queries), and a metric accumulating the total Limit (K) across all vector search queries.
11. Consider how (and if at all) we want to run the tests in test/alternator/test_vector.py that need the vector store in the CI. Currently they are skipped in CI and only run manually (with `test/alternator/run --vs test_vector`).
12. UpdateTable 'Update' operation to modify index parameters. Only some can be modified, e.g., Oversampling.
13. Support for "local index" (separate index for each partition).
14. Make sure that vector search and Streams can be enabled concurrently on the same table - both need CDC but we need to verify that one doesn't confuse the other or disables options that the other needs. We can only do this after we have Alternator Streams running on tablets (since vector store requires tablets).
Testing the new Alternator vector search end-to-end requires running both Scylla and the vector store together. We will have such end-to-end tests in the vector store repository (see https://github.com/scylladb/vector-store/pull/392), but we also add in this pull request many end-to-end tests written in Python, that can be run with the command "test/alternator/run --vs test_vector.py". The "--vs" option tells the run script to run both Scylla and the vector store (currently assumed to be in `.../vector-store/target/release/vector-store`). About 65% of the tests in this pull request check supported syntax and error paths so can run without the vector store, while about 35% of the tests do perform actual Query operations and require the vector store to be running. Currently, the tests that do require the vector store will not get run by CI, but can be easily re-run manually with `test/alternator/run --vs test_vector.py`.
In total, this series includes 78 functional tests in 2200 lines of Python code.
This series also includes documentation for the new Alternator feature and the new APIs introduced. You can see a more detailed design document here: https://docs.google.com/document/d/1cxLI7n-AgV5hhH1DTyU_Es8_f-t8Acql-1f58eQjZLY/edit
Two patches in this series split the huge alternator/executor.cc, after this series continued to grow it and it reached a whoppng 7,000 lines. These patches are just reorganization of code, no functional changes. But it's time that we finally do this (Refs #5783), we can't just continue to grow executor.cc with no end...
Closesscylladb/scylladb#29046
* github.com:scylladb/scylladb:
test/alternator: add option to "run" script to run with vector search
alternator: document vector search
test/alternator: fix retries in new_dynamodb_session
test/alternator: test for allowed characters in attribute names
test/alternator: tests for vector index support
alternator, vector: add validation of non-finite numbers in Query
alternator: Query: improve error message when VectorSearch is missing
alternator: add per-table metrics for vector query
alternator: clean up duplicated code
alternator: fix default Select of Query
alternator: split executor.cc even more
alternator: split alternator/executor.cc
alternator: validate vector index attribute values on write
alternator: DescribeTable for vector index: add IndexStatus and Backfilling
alternator: implement Query with a vector index
alternator: fix bug in describe_multi_item()
alternator: prevent adding GSI conflicting with a vector index
alternator: implement UpdateTable with a vector index
alternator: implement DescribeTable with a vector index
alternator: implement CreateTable with a vector index
alternator: reject empty attribute names
cdc: fix on_pre_create_column_families to create CDC log for vector search
This PR removes the power-of-two token constraint from vnodes-to-tablets migrations, allowing clusters with randomly generated tokens to migrate without manual token reassignment.
Previously, migrations required vnode tokens to be a power of two and aligned. In practice, these conditions are not met with Scylla's default random token assignment, so the constraint is a blocker for real-world use. With the introduction of arbitrary tablet boundaries in PR #28459, the tablet layer can now support arbitrary tablet boundaries. This PR builds on that capability to allow arbitrary vnode tokens during migration.
When the highest vnode token does not coincide with the end of the token ring, the vnode wraps around, but tablets do not support that. This is handled by splitting it into two tablets: one covering the tail end of the ring and one covering the beginning.
Testing has been updated accordingly: existing cluster tests now use randomly generated tokens instead of precomputed power-of-two values, and a new Boost test validates the wrap-around tablet boundary logic.
Fixes SCYLLADB-724.
New feature, no backport is needed.
Closesscylladb/scylladb#29319
* github.com:scylladb/scylladb:
test: Use arbitrary tokens in vnodes->tablets migration tests
test: boost: Add test for wrap-around vnodes
storage_service: Support vnodes->tablets migrations w/ arbitrary tokens
storage_service: Hoist migration precondition
With migration to pyest this fixture is useless. Removing and setting
the session to the module for the most of the tests.
Add dynamic_scope function to support running alternator fixtures in
session scope, while Test and TestSuite are not deleted. This is for
migration period, later on this function should be deleted.
With migration of preparation environment and starting 3rd party services
to the pytest, they're output the logs to the terminal. So this PR
binds them their own log file to avoid polluting the terminal.
With the latest changes, there are a lot of code that is redundant in
the test.py. This PR just cleans this code.
Changes in other files are related to cleaning code from the test.py,
especially with redundant parameter --test-py-init and moving
prepare_environment to pytest itself.
The retry loop in `start_docker_service` passes the parse callbacks via `std::move` into `create_handler` on each iteration. After the first iteration, the moved-from `std::function` objects are empty. All subsequent retries skip output parsing entirely and immediately treat the service as successfully started. This defeats the entire purpose of the retry mechanism.
Fix by passing the callbacks by copy instead of move, so the original callbacks remain valid across retries.
Fixes SCYLLADB-1542
This is a CI stability issue and should be backported.
Closesscylladb/scylladb#29504
* github.com:scylladb/scylladb:
test/lib: fix typos in proc_utils, gcs_fixture, and dockerized_service
test: gcs_fixture: rename container from "local-kms" to "fake-gcs-server"
test: fix proc_utils.cc formatting from previous commit
test: lib: use unique container name per retry attempt
test: lib: fix broken retry in start_docker_service
Add a `CHILD_SHARDS` filter to `DescribeStream` command.
When used, user need to pass a parent stream shard id as
json's ShardFilter.ShardId field. DescribeStream will
then return only list of stream shards, that are direct
descendants of passed parent stream shard.
Each stream shard cover a consecutive part of token space.
A stream shard Q is considered to be a child of stream shard W,
when at least one token belongs to token spaces from both streams.
The filtering algorithm itself is somewhat complicated - more details
in comments in streams.cc.
CHILD_SHARDS is a Amazon's functionality and is required by KCL.
Add unit tests.
Fixes: #25160Closesscylladb/scylladb#28189
Increase the non-AWS wait in the TTL streams test to reduce vnode CI flakes caused by delayed expiration visibility.
Fixes SCYLLADB-1556
Closesscylladb/scylladb#29516
This will make debugging of stalled tablet transitions easier. We saw
several issues when topology state machine was blocked by active
tablet migrations, which was not obvious at first glance of the
logs. Now it will be east to tell if tablet transitions are blocking
progress and which transitions are stuck.
Closesscylladb/scylladb#28616
drain() signals the postponed_reevaluation condition variable to terminate
the postponed_compactions_reevaluation() coroutine but does not await its
completion. When enable() is called afterwards, it overwrites
_waiting_reevalution with a new coroutine, orphaning the old one. During
shutdown, really_do_stop() only awaits the latest coroutine via
_waiting_reevalution, leaving the orphaned coroutine still alive. After
sharded::stop() destroys the compaction_manager, the orphaned coroutine
resumes and reads freed memory (is_disabled() accesses _state).
Fix by introducing stop_postponed_compactions(), awaiting the reevaluation coroutine in
both drain() and stop() after signaling it, if postponed_compactions_reevaluation() is running.
It uses an std::optional<future<>> for _waiting_reevalution and std::exchange to leave
_waiting_reevalution disengaged when postponed_compactions_reevaluation() is not running.
This prevents a race between drain() and stop().
While at it, fix typo in _waiting_reevalution -> _waiting_reevaluation.
Fixes: SCYLLADB-1463
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#29443
Add to test/alternator/run the option "-vs" which runs alongside with
Scylla a vector store, to allow running Alternator tests with vector
indexing.
To get the vector store, do
git clone git@github.com:scylladb/vector-store.git
cargo build --release
"run -vs" looks for an executable in ../vector-store/target/*/vector-store
but can also be overridden by the VECTOR_STORE environment variable.
test/alternator/run runs the vector store exactly like it runs Scylla -
in a temporary directory, on a temporary IP address in the localhost
subnet (127.0.0/8), killing it when the test end, and showing the output
of both programs (Scylla and vector store). These transient runs of
Scylla and vector store are configured to be able to communicate to
each other.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a new document, docs/alternator/vector-search.md, on the
new vector search feature in Alternator. It introduces this feature, and
the DynamoDB APIs that we extended to support it.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The new_dynamodb_session() function had a bug which we never noticed
because we hardly used it, but it became more noticable when the new
test/alternator/test_vector.py started to use it:
By default, boto3 retries a request up to 9 times when it encounters
a retriable error (such as an Internal Server Error). We don't want such
retries in our tests - it makes failures slower, but more importantly
it can hide "flaky" bugs by retrying 9 times until it happens to succeed.
The new_dynamodb_session() had code (copied from the dynamodb fixture)
to set boto3's "max_attempts" configuration to 0, to disable this retry.
But this code had an incorrect "if" to only be done if we're testing on
"localhost". This is wrong: We almost never use "localhost" as the
target of the test; Both test/cqlpy/run and test.py pick an IP address
in the localhost subnet (127/8) and uses that IP address - not the string
"localhost".
This bug only existed in new_dynamodb_session() - the more commonly used
"dynamodb" fixture didn't have this bug.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
One of the tests in the previous patch checked that strange characters
are allowed in attribute names used for vector indexing. It turns out
we never had a test that verifies that regardless of vector indexes -
any character whatsoever is allowed in attribute names. This is
different from table names which are much more limited.
So this patch adds the missing test.
As usual, the new test also passes on DynamoDB, showing that these
stange characters in attribute names are also allowed by DynamoDB.
In this patch we add a large collection of basic functional tests for the
vector index support, covering the CreateTable, UpdateTable, DescribeTable
and Query operations and the various ways in which those are allowed to
work - or expected to fail. These tests were written in parallel with
writing the code so they (hopefully) cover all the corner cases considered
during development, and make sure these corner cases are all handled
correctly and will not regress in the future.
Some of these tests do not involve querying of the index and focus on
the structure of requests and the kind of syntax allowed. But other tests
are end-to-end, requiring the vector store to be running and trying to
index Alternator data and query it. These tests are marked
"needs_vector_store", and are immediately skipped in Scylla is not
configured to connect to a vector store. In a later patch we'll add a
an option to test/alternator/run to be able to run these end-to-end
tests by automatically running both Scylla and the Vector Store.
We'll have additional end-to-end tests in the vector-store repository.
Note that vector search is a new API feature that doesn't exist in DynamoDB,
so we are adding new parameters and outputs to existing operations. The AWS
SDKs don't normally allow doing that, so the test added here begins by
teaching the Python SDK to use the new APIs we added. This piece of code
can also be used by end-users to use vector search (at least in Python...)
before we officially add this support to ScyllaDB's SDK wrappers.
Non-finite numbers (Inf, NaN) don't make sense in vector search, and
also not allowed in the DynamoDB API as numbers. But the parsing code
in Query's QueryVector accepted "Inf" and "NaN" and then failed to
send the request to the vector store, resulting in a strange error
message. Let's fix it in the parsing code.
We have a test (test_query_vectorsearch_queryvector_bad_number_string)
that verifies this fix.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Before this patch, if we attempt a Query with IndexName is a vector index
but forget a "VectorSearch" parameter, the error is misleading: The code
expects a GSI or LSI, and when it can't find a GSI or LSI with that name,
it reports that the index is missing. But this is not helpful. So in this
patch we produce a more helpful message: That the index does exist, and
is a vector index, so a "VectorSearch" parameter is mandatory and is
missing.
The per-table metrics for Query were not incremented for the
vector variant of the Query operations, only the global metrics were
incremented. This patch fixes this oversight, and add a test that
reproduces it (the new test fails before this patch, and passes after).
De-duplicate some code introduced in earlier patches, such a two
nearly-identical loops over the indexes (one to check if there is a
vector index, the second to get its dimensions), and two nearly-
identical chunks of code to get the item contents when there is or
there isn't a clustering key.
There should be no functional changes in this patch.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
In earlier patches, when Query'ing a vector index, we set the default
Select to ALL_ATTRIBUTES. However, according to the DynamoDB documentation
for Query,
"If neither Select nor ProjectionExpression are specified, DynamoDB
defaults to ALL_ATTRIBUTES when accessing a table, and
ALL_PROJECTED_ATTRIBUTES when accessing an index."
This default should also apply to vector index, so this patch fixes this.
The new behavior is not only more compatible with DynamoDB, it is also
much more efficient by default, as ALL_PROJECTED_ATTRIBUTES does not need
to read from the base table - it returns the results that the vector store
returned. Of course, if the user needs the more efficient ALL_ATTRIBUTES
this option is still available - it's just no longer the default.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch continues the effort to split the huge executor.cc (5000
lines before this patch) even more.
In this patch we introduce a new source file, executor_util.cc, for
various utility functions that are used for many different operations
and therefore are useful to have in a header file. These utility
functions will now be in executor_util.cc and executor_util.hh -
instead of executor.cc and executor.hh.
Various source files, including executor.cc, the executor_read.cc
introduced in the previous patch, as well as older source files like
as streams.cc, ttl.cc and serialization.cc, use the new header file.
This patch removes over 700 lines of code from executor.cc, and
also removes a large amount of utility functions declerations from
executor.hh. Originally, executor.hh was meant to be about the
interface that the Alternator server needs to *execute* the different
DynamoDB API operations - and after this patch it returns closer to
this original goal.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Already six years ago, in #5783, we noticed that alternator/executor.cc
has grown too large. The previous patches added hundreds of more lines
to it to implement vector search, and it reached a whopping 7,000 lines
of code. This is too much.
This patch splits from executor.cc two major chunks:
1. The implementation of **read** requests - GetItem, BatchGetItem,
Query (base table, GSI/LSI, and vector-search), and Scan - was
moved to a new source file alternator/executor_read.cc.
The new file has 2,000 lines.
2. Moved 250 lines of template functions dealing with attribute paths
and maps of them to a new header file, attribute_path.hh.
These utilities are used for many different operations - various
read operations use them for ProjectionExpression, and UpdateItem
uses them for modifications to nested attributes, so we need the
new header file from both executor.cc and executor_read.cc
The remaining executor.cc is still pretty big, 5,000 lines, and
contains write operations (PutItem, UpdateItem, DeleteItem,
BatchWriteItem) as well as various table and other operations, and
also many utility functions used by many types of operations, so
we can later continue this refactoring effort.
Refs #5783
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Extend system_info_encryption to encrypt system.raft SSTables.
system.raft contains the Raft log, which may hold sensitive user data
(e.g. batched mutations), so it warrants the same treatment as
system.batchlog and system.paxos.
During upgrade, existing unencrypted system.raft SSTables remain
readable. Existing data is rewritten encrypted via compaction, or
immediately via nodetool upgradesstables -a.
Update the operator-facing system_info_encryption description to
mention system.raft and add a focused test that verifies the schema
extension is present on system.raft.
Fixes: CUSTOMER-268
Backport: 2026.1 - closes an encryption-at-rest coverage gap: system.raft may persist sensitive user-originated data unencrypted; backport to the current LTS.
Closesscylladb/scylladb#29242
There's a bunch of db::config options that are used by cql3/statements/ code. For that they use data_dictionary/database as a proxy to get db::config reference. This PR moves most of these accessed options onto cql_config
Options migrated to cql_config:
1. select_internal_page_size
2. strict_allow_filtering
3. enable_parallelized_aggregation
4. batch_size_warn_threshold_in_kb
5. batch_size_fail_threshold_in_kb
6. 7 keyspace replication restriction options
7. 2 TWCS restriction options
8. restrict_future_timestamp
9. strict_is_not_null_in_views (with view_restrictions struct)
10. enable_create_table_with_compact_storage
Some options need special treatment and are still abused via database, namely:
1. enable_logstor
2. cluster_name
3. partitioner
4. endpoint_snitch
Fixing components inter-dependencies, not backporting
Closesscylladb/scylladb#29424
* github.com:scylladb/scylladb:
cql3: Move enable_create_table_with_compact_storage to cql_config
cql3: Move strict_is_not_null_in_views to cql_config
cql3: Move restrict_future_timestamp to cql_config
cql3: Move TWCS restriction options to cql_config
cql3: Move keyspace restriction options to cql_config
cql3: Move batch_size_fail_threshold_in_kb to cql_config
cql3: Move batch_size_warn_threshold_in_kb to cql_config
cql3: Move enable_parallelized_aggregation to cql_config
cql3: Move strict_allow_filtering to cql_config
cql3: Move select_internal_page_size to cql_config
test: Fix cql_test_env to use updateable cql_config from db::config
cql3: Add cql_config parameter to parsed_statement::prepare()
`system.large_partitions`, `system.large_rows`, and `system.large_cells` store records keyed by SSTable name. When SSTables are migrated between shards or nodes (resharding, streaming, decommission), the records are lost because the destination never writes entries for the migrated SSTables.
This patch series moves the source of truth for large data records into the SSTable's scylla metadata component (new `LargeDataRecords` tag 13) and reimplements the three `system.large_*` tables as virtual tables that query live SSTables on demand. A cluster feature flag (`LARGE_DATA_VIRTUAL_TABLES`) gates the transition for safe rolling upgrades.
When the cluster feature is enabled, each node drops the old system large_* tables and starts serving the corresponding tables using virtual tables that represent the large data records now stored on the sstables.
Note that the virtual tables will be empty after upgrade until the sstables that contained large data are rewritten, therefore it is recommended to run upgrade sstables compaction or major compaction to repopulate the sstables scylla-metadata with large data records.
1. **keys: move key_to_str() to keys/keys.hh** — make the helper reusable across large_data_handler, virtual tables, and scylla-sstable
2. **sstables: add LargeDataRecords metadata type (tag 13)** — new struct with binary-serialized key fields, scylla-sstable JSON support, format documentation
3. **large_data_handler: rename partition_above_threshold to above_threshold_result** — generalize the struct for reuse
4. **large_data_handler: return above_threshold_result from maybe_record_large_cells** — separate booleans for cell size vs collection elements thresholds
5. **sstables: populate LargeDataRecords from writer** — bounded min-heaps (one per large_data_type), configurable top-N via `compaction_large_data_records_per_sstable`
6. **test: add LargeDataRecords round-trip unit tests** — verify write/read, top-N bounding, below-threshold behavior
7. **db: call initialize_virtual_tables from shard 0 only** — preparatory refactoring to enable cross-shard coordination
8. **db: implement large_data virtual tables with feature flag gating** — three virtual table classes, feature flag activation, legacy SSTable fallback, dual-threshold dedup, cross-shard collection
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1276
* Although this fixes a bug where large data entries are effectively lost when sstables are renamed or migrated, the changes are intrusive and do not warrant a backport
Closesscylladb/scylladb#29257
* github.com:scylladb/scylladb:
db: implement large_data virtual tables with feature flag gating
db: call initialize_virtual_tables from shard 0 only
test: add LargeDataRecords round-trip unit tests
sstables: populate LargeDataRecords from writer
large_data_handler: return above_threshold_result from maybe_record_large_cells
large_data_handler: rename partition_above_threshold to above_threshold_result
sstables: add LargeDataRecords metadata type (tag 13)
sstables: add fmt::formatter for large_data_type
keys: move key_to_str() to keys/keys.hh
The ignore_component_digest_mismatch flag is now initialized at sstable construction
time from sstables_manager::config (which is populated from db::config at boot time).
Remove the flag from sstable_open_config struct and all call sites that were setting
it explicitly.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Initialize the ignore_component_digest_mismatch flag from sstables_manager::config
in the sstable constructor initializer list instead of in load(). This ensures the
flag value is set at construction time when the manager config is available, rather
than at load time. Mark the member const to reflect its immutability after construction.
Fixes the bootstrap path which now correctly reads the flag from manager config
initialized from db::config at boot time, instead of using the default value.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy the ignore_component_digest_mismatch flag from db::config to sstables_manager::config
during database initialization. This makes the flag available early in the boot process,
before SSTables are loaded, enabling later commits to move the flag initialization from
load-time to construction-time.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When a table has a vector index, writes to the indexed attribute
(via PutItem, UpdateItem, or BatchWriteItem) must supply a value that
is a vector of the appropriate length: It must be a list of exactly the
declared number of elements, where each element is a numeric type ("N")
representable as a 32-bit float. Before this patch, invalid values were
silently accepted and the item was simply not indexed (it was skipped
by the vector store when it read this item). Now these writes are
rejected with a ValidationException.
This is analogous to the existing validation of GSI/LSI key attribute
values - in DynamoDB after a certain attribute becomes the key of a
GSI or LSI, the user is no longer allowed to write the same type.
The implementation we add here is also analogous to the implementation
of the GSI/LSI key validation. The GSI/LSI key validation is done
by validate_value_if_index_key / si_key_attributes, and in this
patch we add the vector-index parallels: vector_index_attributes()
collects the attribute name and declared dimensions for every vector
index in the schema, and validate_value_if_vector_index_attribute()
enforces the type limitations.
For efficiency in the common case where a table has no vector indexes
and no GSIs/LSIs, both validation functions are out-of-line and each
call site guards the call with an explicit empty() check, so no
function-call overhead is incurred when there is nothing to validate.
For UpdateItem, the map of vector index attributes is cached in
update_item_operation (alongside the existing _key_attributes cache)
to avoid recomputing it on every call to update_attribute().
Add to DescribeTable's output for VectorIndexes two fields - IndexStatus
and Backfilling - which are intended to exactly mirror these two fields
that exist for GlobalSecondaryIndexes:
When a vector index is added, IndexStatus is "CREATING" before the index
is usable, and "ACTIVE" when it is finally usable for a Query. During
"CREATING" phase, "Backfilling" may be set to true when the index is
currently being backfilled (the table is scaned and an index is built).
A user is expected to call DescribeTable in a loop after creating a
vector index (via either CreateTable and UpdateTable) and only call
Query on the index after the IndexStatus is finally ACTIVE. Calling
Query earlier, while IndexStatus is still CREATING, will result in an
error.
In the current implementation, Alternator does not track the state of the
vector index, so it needs to contact the vector store to inquire about
the state of the index - using a new function introduced in this patch
that uses an existing vector-store API. This makes DescribeTable slower
on tables that have vector indexes, because the vector store is contacted
on every DescribeTable call.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
We introduce to the Query request a new "VectorSearch" parameter, which
take a mandatory "QueryVector" (a value which must be a numeric vector
of the right length) and "Limit".
The "Limit" of a vector search (Query with VectorSearch) determines the
number of nearest neighbors to return, and does not allow pagination
(ExclusiveKeyStart is not allowed). ConsistentRead=True is also not
allowed on a vector search query.
The "Select"/"ProjectionExpression"/"AttributesToGet" parameters are
also supported, requesting which attributes to fetch. Using Select=
ALL_PROJECTED_ATTRIBUTES means read only the attributes found in the
vector index - currently only the key columns - so it is significantly
faster than ALL_ATTRIBUTES because it doesn't require reading the items
from the base table.
The "FilterExpression" parameter is also supported. Like in DynamoDB's
traditional Query, this does post-filtering, i.e., removing some of the
results returned by the vector index that don't match the filter, and
as a result fewer than Limit results may be returned.
Pre-filtering (done on the vector store, and always returns Limit
results) is not yet implemented.
In commit a55c5e9ec7, the function
describe_multi_item() got a new item_callback parameter, that can
be used to calculate the size of the item. This new parameter
has a default, an empty noncopyable_function. But an empty
noncopyable_function shouldn't be called - exactly like std::function,
it throws std::bad_function_call if called when empty.
So describe_multi_item() should only call this item_callback if
it's not empty.
This became a problem in the next patch, implementing vector search
query, which called describe_multi_item with the default item_callback.
But in general, the function should be usable with the default parameter
(or we shouldn't have defined a default value for this parameter!).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
All the "indexes" we implement in Alternator - GSI, LSI and the new
vector index - share the same IndexName namespace, which we'll use in
Query to refer to the index. In the previous patch we already prevented
adding a vector index with the same name as an existing GSI or LSI.
In this patch we also prevent the reverse - adding a GSI with the name
of an existing vector index.
Additionally, one cannot add a GSI on a key that is already the key of
a vector index: The types conflict: The key of a vector index must be a
vector column, while the key of a GSI must have a standard key type
(string, binary or number).
We have tests for this later, this the big test patch.
After an earlier patch allowed CreateTable to create vector indexes
together with a table, in this patch we add to UpdateTable the ability
to add a new vector index to an existing table, as well as the ability
to delete a vector index from an existing table.
The implementation is inspired by DynamoDB's syntax for GSI - just like
GSI has GlobalSecondaryIndexUpdates with "Create" and "Delete" operations,
for vector indexes we have VectorIndexUpdates supporting Create and
Delete. "Update" is not yet supported - we didn't implement yet any
parameter that can be updated - but we can easily implement it in the
future.
ScyllaDB supports the "vector search" feature in CQL.
In this patch we start the path to adding vector search support also to
Alternator.
In this patch, we implement CreateTable support - allowing the user to enable
vector search in a new table. The following patches will enable additional
operations like UpdateTable (adding a vector index to an existing table or
deleting a vector index to an existing table) and DescribeTable.
Extensive tests for all these features will come at the end of the series.
Those tests were written in parallel with writing this implementation so cover
(hopefully) every nook and cranny of the imlementation.
Alternator has a function validate_attr_name_length() used to validate an
attribute name passed in different operations like PutItem, UpdateItem,
GetItem, etc. It fails the request if the attribute name is longer than
65535 characters.
It turns out that we forgot to check if the attribute name length isn’t 0 -
which should be forbidden as well!
This patch fixes the validation code, and also adds a test that confirms
that after this patch empty attribute names are rejected - just like DynamoDB
does - whereas before this patch they were silently accepted.
We want to fix this issue now, because in a later patch we intend to use
the same validation function also for vector indexes - and want it to be
accurate.
Fixes SCYLLADB-1069.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The vector-search feature, which is already supported in CQL, introduced
the somewhat confusing feature of enabling CDC without explicitly enabling
CDC: When a vector index is enabled on a table, CDC is "enabled" for it
even if the user didn't ask to enable CDC.
For this, some code in cdc/log.cc began to use cdc_enabled() instead of
checking schema.cdc_options.enabled() directly. This cdc_enabled()
function checks if either this enabled() is true, or has_vector_index()
is true.
But there's another twist to this story: To write with CDC, we also need
to create the CDC log table:
1. In CQL, a vector index can only be added on an existing table (with
CREATE INDEX), so the hook on_before_update_column_family() is the
one that noticed that a vector index was added, and created the CDC
log table.
2. But in Alternator, a vector index can be created up-front with a
brand-new table (in CreateTable), so the hook for a new table -
on_pre_create_column_families(), also needs to create the CDC log
table. It already did, but incorrectly checked just the explicit
CDC-enabled flag instead of the new cdc_enabled() function that
also allows vector index.
So this patch just fixes on_pre_create_column_families to use cdc_enabled().
Before this patch, when a vector index will be created in Alternator with
CreateTable, an attempt to write to the table (PutItem) will fail
because it will try to write to the CDC log, which wasn't created.
After this patch, it works. The reproducing test is
test_putitem_vectorindex_createtable (introduced in a later patch).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Pin all external GitHub Actions to full commit SHAs and upgrade to
their latest major versions to reduce supply chain attack surface:
- actions/checkout: v3/v4/v5 -> v6.0.2
- actions/github-script: v7 -> v8.0.0
- actions/setup-python: v5 -> v6.2.0
- actions/upload-artifact: v4 -> v7.0.0
- astral-sh/setup-uv: v6 -> v8.0.0
- mheap/github-action-required-labels: v5.5.2 (pinned)
- redhat-plumbers-in-action/differential-shellcheck: v5.5.6 (pinned)
- codespell-project/actions-codespell: v2.2 (pinned, was @master)
Set FORCE_JAVASCRIPT_ACTIONS_TO_NODE24=true in all 21 workflows that
use JavaScript-based actions to opt into the Node.js 24 runtime now.
This resolves the deprecation warning:
"Node.js 20 actions are deprecated. Please check if updated versions
of these actions are available that support Node.js 24. Actions will
be forced to run with Node.js 24 by default starting June 2nd,
2026. Node.js 20 will be removed from the runner on September 16th,
2026."
See: https://github.blog/changelog/2025-09-19-deprecation-of-node-20-on-github-actions-runners/
scylladb/github-automation references are intentionally left at @main
as they are org-internal reusable workflows.
Fixes: SCYLLADB-1410
Backport: Backport is required for live branches that run GH actions:
2026.1, 2025.4, 2025.1 and 2024.1
Closesscylladb/scylladb#29421
The PR removed the create_and_ks() helper from backup test and patches all callers to create keyspace, table and populate them with standard explicit facilities. While patching it turned out that one test doesn't need to populate the table, so it even becomes tiny bit shorter and faster
Enhancing test, not backporting
Closesscylladb/scylladb#29417
* github.com:scylladb/scylladb:
test_backup: Remove create_ks_and_cf helper Test
test_backup: Replace create_ks_and_cf with async patterns Test
test_backup: Add if-True blocks for indentation Test
The migration tests used to start nodes with pre-computed power-of-two
tokens. This was required because the migration itself only supported
power-of-two aligned tokens. Now that arbitrary tokens are supported,
switch the tests to use Scylla's default random token assignment.
Switching to arbitrary tokens makes the tests non-deterministic, but the
migration aspects that are affected by the token distribution
(resharding, wrap-around vnode split) are out of scope for these tests
and covered by dedicated tests.
Add a `get_all_vnode_tokens()` helper that queries system.topology at
runtime to discover the actual token layout, and derive expected tablet
counts from that.
Also account for the possible extra wrap-around tablet when the last
vnode token does not coincide with MAX_TOKEN.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Add a Boost test to verify that `prepare_for_tablets_migration()`
produces the correct tablet boundaries when a wrap-around vnode exists.
Tablets cannot wrap around the token ring as vnodes do; the last token
of the last tablet must always be MAX_TOKEN. When the last vnode token
does not coincide with MAX_TOKEN, the wrap-around vnode must be split
into two tablets.
The test is parameterized over both cases: unaligned (split expected)
and aligned (no split expected).
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
1. test.py — Removed --log-level=DEBUG flag from pytest args
2. test/pytest.ini — Changed log_level to INFO (that was set DEBUG in test.py), changed log_file_level from DEBUG to INFO, added clarifying comments
+minor fix
[test/pylib: save logs on success only during teardown phase](0ede308a04)
Previously, when --save-log-on-success was enabled, logs were saved
for every test phase (setup, call, teardown)in 3 files. Restrict it to only
the teardown phase, that contains all 3 in case of test success,
to avoid redundant log entries.
Closesscylladb/scylladb#29086
* github.com:scylladb/scylladb:
test/pylib: save logs on success only during teardown phase
test: Lower default log level from DEBUG to INFO
The vnodes-to-tablets migration creates tablet maps that mirror the
vnode layout: one tablet per vnode, preserving token boundaries and
replica placement. However, due to tablet restrictions, the migration
requires vnode tokens to be a power of two and uniformly distributed
across the token ring.
In practice, this restriction is too limiting. Real clusters use
randomly generated tokens and a node's token assignment is immutable.
To solve this problem, prior work (01fb97ee78) has been done to relax
the tablet constraints by allowing arbitrary tablet boundaries, removing
the requirement for power-of-two sizing and uniform distribution.
This patch leverages the relaxed tablet constraints to enable tablet map
creation from arbitrary vnode tokens:
* Removes all token-related constraints.
* Handles wrap-around vnodes. If a vnode wraps (i.e., the highest vnode
token is not `dht::token::last()`), it is split into two tablets:
- (last_vnode_token, dht::token::last()]
- [dht::token::first(), first_vnode_token]
The migration ops guide has been updated to remove the power-of-two
constraint.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
`prepare_for_tablets_migration()` is idempotent; it filters out tables
that already have tablet maps and returns early if no tablet maps need
to be created. However, this precondition is currently misplaced. Move
it higher to skip extra work.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
When initializing streaming sources in tablet_stream_files_handler we
use a reference to the table. We should hold the table while doing so,
because otherwise the table may be dropped and destroyed when we yield.
Use the table.stream_in_progress() phaser to hold the table while we
access it.
For sstable file streaming we can release the table after the snapshot
is initialized, and the table may be dropped safely because the files
are held by the snapshot and we don't access the table anymore. There
was a single access to the table for logging but it is replaced by a
pre-calculated variable.
For logstor segment streaming, currently it doesn't support discarding
the segments while they are streamed - when the table is dropped it
discard the segments by overwriting and freeing them, so they shouldn't
be accessed after that. Therefore, in that case continue to hold the
table until streaming is completed.
Fixes [SCYLLADB-1533](https://scylladb.atlassian.net/browse/SCYLLADB-1533)
It's a pre-existing use-after-free issue in sstable file streaming so should be backported to all releases.
It's also made worse with the recent changes of logstor, and affects also non-logstor tables, so the logstor fixes should be in the same release (2026.2).
[SCYLLADB-1533]: https://scylladb.atlassian.net/browse/SCYLLADB-1533?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQClosesscylladb/scylladb#29488
* github.com:scylladb/scylladb:
test: test drop table during streaming
streaming: stream_blob: hold table for streaming
The GCS fixture's fake-gcs-server container was named "local-kms",
copy-pasted from the AWS KMS fixture. It happened when both were
refactored to use the shared start_docker_service helper (bc544eb08e).
Rename to "fake-gcs-server" to match the Python-side naming and avoid
confusion in logs.
Refs SCYLLADB-1542
The container name is generated once before the retry loop, so
all retry attempts reuse the same name. Move the name generation
inside the loop so each attempt gets a fresh name via the
incrementing counter, consistent with the comment "publish port
ephemeral, allows parallel instances".
Formatting changes (indentation) of lines 208-225 in test/lib/proc_utils.cc
will be fixed in the next commit.
Refs SCYLLADB-1542
Previously, config_updater used a serialized_action to trigger update_config() when object_storage_endpoints changed. Because serialized_action::trigger() always schedules the action as a new reactor task (via semaphore::wait().then()), there was a window between the config value becoming visible to the REST API and update_config() actually running. This allowed a concurrent CREATE KEYSPACE to see the new endpoint via is_known_endpoint() before storage_manager had registered it in _object_storage_endpoints.
Now config observers run synchronously in a reactor turn and must not suspend. Split the previous monolithic async update_config() coroutine into two phases:
- Sync (in the observer, never suspends): storage_manager::_object_storage_endpoints is updated in place; for already-instantiated clients, update_config_sync swaps the new config atomically
- Async (per-client gate): background fibers finish the work that can't run in the observer — S3 refreshes credentials under _creds_sem; GCS drains and closes the replaced client.
Config reloads triggered by SIGHUP are applied on shard 0 and then broadcast to all other shards. An rwlock has been also introduced to make sure that the configuration has been propagated to all cores. This guarantees that a client requesting a config via the REST API will see a consistent snapshot
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-757
Fixes: [28141](https://github.com/scylladb/scylladb/issues/28141)
Closesscylladb/scylladb#28950
* github.com:scylladb/scylladb:
test/object_store: verify object storage client creation and live reconfiguration
sstables/utils/s3: split config update into sync and async parts
test_config: improve logging for wait_for_config API
db: introduce read-write lock to synchronize config updates with REST API
Split the `log_record` to `log_record_header` type that has the record
metadata fields and the mutation as a separate field which is the actual
record data:
struct log_record {
log_record_header header;
canonical_mutation mut;
};
Both the header and mutation have variable serialized size. When a
record is serialized in a write_buffer, we first put a small
`record_header` that has the header size and data size, then the
serialized header and data follow. The `log_location` of a record points
to the beginning of the `record_header`, and the size includes the
`record_header`.
This allows us to read a record header without reading the data when
it's not needed and avoid deserializing it:
* on recovery, when scanning all segments, we read only the record
headers.
* on compaction, we read the record header first to determine if the
record is alive, if yes then we read the data.
Closesscylladb/scylladb#29457
The commitlog replayer groups segments by shard using a
std::unordered_multimap, then iterates per-shard segments via
equal_range(). However, equal_range() does not guarantee iteration
order for elements with the same key, so segments could be replayed
out of order within a shard.
Correct segment ordering is required for:
- Fragmented entry reconstruction, which accumulates fragments across
segments and depends on ascending order for efficient processing.
- Commitlog-based storage used by the strongly consistent tables
feature, which relies on replayed raft items being stored in order.
Fix by changing the data structure from
std::unordered_multimap<unsigned, commitlog::descriptor>
to
std::unordered_map<unsigned, utils::chunked_vector<commitlog::descriptor>>
Since the descriptors are inserted from a std::set ordered by ID, the
vector preserves insertion (and thus ID) order. The per-shard iteration
now simply iterates the vector, guaranteeing correct replay order.
Fixes: SCYLLADB-1411
Backport: It looks like this issue doesn't cause any trouble, and is required only by the strong consistent tables, so no backporting required.
Closesscylladb/scylladb#29372
* github.com:scylladb/scylladb:
commitlog: add test to verify segment replay order
commitlog: fix replay order by using ordered map per shard
Move enable_create_table_with_compact_storage option from db::config to
cql_config. This improves separation of concerns by consolidating CQL-specific
table creation policies in the cql_config structure. Update the CREATE TABLE
statement prepare() function to use the new location for the configuration check.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Move strict_is_not_null_in_views option from db::config to cql_config via
new view_restrictions sub-struct. This improves separation of concerns by
keeping view-specific validation policies with other CQL configuration.
Update prepare_view() to take view_restrictions reference instead of reaching
into db::config, and update all callsites to pass the sub-struct.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Move restrict_future_timestamp option from db::config to cql_config. This
improves separation of concerns as timestamp validation is part of CQL query
execution behavior. Update validate_timestamp() function signature to take
cql_config reference instead of db::config, and update all callsites in
modification_statement and batch_statement to pass cql_config.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Move twcs_max_window_count and restrict_twcs_without_default_ttl options
from db::config to cql_config via new twcs_restrictions sub-struct. This
improves separation of concerns by keeping TWCS-specific validation policies
with other CQL configuration. Update check_restricted_table_properties()
to remove unused db parameter and take twcs_restrictions reference instead.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Introduce replication_restrictions, a sub-struct of cql_config, to hold
the seven keyspace-level policy options that govern how CREATE/ALTER
KEYSPACE statements are validated:
- restrict_replication_simplestrategy
- replication_strategy_warn_list / replication_strategy_fail_list
- minimum/maximum_replication_factor_warn/fail_threshold
Pass replication_restrictions into check_against_restricted_replication_strategies()
instead of having it reach into db::config directly (via both
qp.db().get_config() and qp.proxy().data_dictionary().get_config()).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace the physical system.large_partitions, system.large_rows, and
system.large_cells CQL tables with virtual tables that read from
LargeDataRecords stored in SSTable scylla metadata (tag 13).
The transition is gated by a new LARGE_DATA_VIRTUAL_TABLES cluster
feature flag:
- Before the feature is enabled: the old physical tables remain in
all_tables(), CQL writes are active, no virtual tables are registered.
This ensures safe rollback during rolling upgrades.
- After the feature is enabled: old physical tables are dropped from
disk via legacy_drop_table_on_all_shards(), virtual tables are
registered on all shards, and CQL writes are skipped via
skip_cql_writes() in cql_table_large_data_handler.
Key implementation details:
- Three virtual table classes (large_partitions_virtual_table,
large_rows_virtual_table, large_cells_virtual_table) extend
streaming_virtual_table with cross-shard record collection.
- generate_legacy_id() gains a version parameter; virtual tables
use version 1 to get different UUIDs than the old physical tables.
- compaction_time is derived from SSTable generation UUID at display
time via UUID_gen::unix_timestamp().
- Legacy SSTables without LargeDataRecords emit synthetic summary
rows based on above_threshold > 0 in LargeDataStats.
- The activation logic uses two paths: when the feature is already
enabled (test env, restart), it runs as a coroutine; when not yet
enabled, it registers a when_enabled callback that runs inside
seastar::async from feature_service::enable().
- sstable_3_x_test updated to use a simplified large_data_test_handler
and validate LargeDataRecords in SSTable metadata directly.
Move the smp::invoke_on_all dispatch from the callers into
initialize_virtual_tables() itself, so the function is called
once from shard 0 and internally distributes the per-shard
virtual table setup to all shards.
This simplifies the callers and allows a single place to add
cross-shard coordination logic (e.g. feature-gated table
registration) in future commits.
Add three new test cases to sstable_3_x_test.cc that verify the
LargeDataRecords metadata written by the SSTable writer can be read
back after open_data():
- test_large_data_records_round_trip: verifies partition_size, row_size,
and cell_size records are written with correct field semantics when
thresholds are exceeded
- test_large_data_records_top_n_bounded: verifies the bounded min-heap
keeps only the top-N largest entries per type
- test_large_data_records_none_when_below_threshold: verifies no records
are written when data is below all thresholds
Also wire large_data_records_per_sstable from db_config into the test
env's sstables_manager::config so that config changes propagate through
the updateable_value chain to configure_writer().
During compaction (SSTable writing), maintain bounded min-heaps (one per
large_data_type) that collect the top-N above-threshold records. On
stream end, drain all five heaps into a single LargeDataRecords array
and write it into the SSTable's scylla metadata component.
Five separate heaps are used:
- partition_size, row_size, cell_size: ordered by value (size bytes)
- rows_in_partition, elements_in_collection: ordered by elements_count
A new config option 'compaction_large_data_records_per_sstable' (default
10) controls the maximum number of records kept per type.
Change maybe_record_large_cells to return above_threshold_result with
separate booleans for cell size (.size) and collection elements
(.elements) thresholds. This allows the writer to track above_threshold
counts for cell_size and elements_in_collection independently.
Rename partition_above_threshold to above_threshold_result and its
'rows' field to 'elements', making it a generic struct that can be
reused for other large data types (e.g., cells with collection
elements).
Use designated initializers for clarity.
Add a new scylla metadata component LargeDataRecords (tag 13) that
stores per-SSTable top-N large data records. Each record carries:
- large_data_type (partition_size, row_size, cell_size, etc.)
- binary serialized partition key and clustering key
- column name (for cell records)
- value (size in bytes)
- element count (rows or collection elements, type-dependent)
- range tombstones and dead rows (partition records only)
The struct uses disk_string<uint32_t> for key/name fields and is
serialized via the existing describe_type framework into the SSTable
Scylla metadata component.
Add JSON support in scylla-sstable and format documentation.
Add a fmt::formatter specialization for sstables::large_data_type and
use it in scylla-sstable.cc instead of the local to_string() overload,
which is removed.
Move the key_to_str() template function from a file-local static in
db/large_data_handler.cc to keys/keys.hh so it can be reused by:
- large_data_handler.cc for log messages
- virtual tables (db/virtual_tables.cc) for converting binary keys
to human-readable CQL display
- scylla-sstable for JSON output of LargeDataRecords
No functional change.
The batch_size_fail_threshold_in_kb option controls the batch size at
which an oversized batch error is returned to the client. It belongs in
cql_config rather than db::config as it directly governs CQL batch
statement behavior.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The batch_size_warn_threshold_in_kb option controls the batch size at
which a client warning is emitted during batch execution. It belongs in
cql_config rather than db::config as it directly governs CQL batch
statement behavior.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The enable_parallelized_aggregation option controls whether aggregation
queries are fanned out across shards for parallel execution. It belongs
in cql_config rather than db::config as it directly governs CQL query
behavior at prepare time.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The strict_allow_filtering option controls whether queries that require
ALLOW FILTERING are silently accepted, warned about, or rejected. It
belongs in cql_config rather than db::config as it directly governs CQL
query behavior at prepare time.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The select_internal_page_size option controls CQL query execution
behavior (internal paging for aggregate/filtered SELECTs) and belongs
in cql_config rather than being read directly from db::config at
execution time.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The test environment was creating cql_config with hardcoded default values that
were never updated when system.config was modified via CQL. This broke tests
that dynamically change configuration values (e.g., TWCS tests).
Fix by creating cql_config from db::config using sharded_parameter, which
ensures updateable_value fields track the actual db::config sources and reflect
changes made during test execution.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Pass cql_config to prepare() so that statement preparation can use
CQL-specific configuration rather than reaching into db::config
directly.
Callers that use default_cql_config:
- db/view/view.cc: builds a SELECT statement internally to compute view
restrictions, not in response to a user query
- cql3/statements/create_view_statement.cc: same -- parses the view's
WHERE clause as a synthetic SELECT to extract restrictions
- tools/schema_loader.cc: offline schema loading tool, no runtime
config available
- tools/scylla-sstable.cc: offline sstable inspection tool, no runtime
config available
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Several tests in test_select_from_mutation_fragments.py assume that all
mutations end up in a single SSTable. This assumption can be violated
by background memtable flushes triggered by commitlog disk pressure.
Since the Scylla node is taken from a pool, it may carry unflushed data
from prior tests that prevents closed segments from being recycled,
thereby increasing the commitlog disk usage. A main source of such
pressure is keyspace-level flushes from earlier tests in this module,
which rotate commitlog segments without flushing system tables (e.g.,
`system.compaction_history`), leaving closed segments dirty.
Additionally, prior tests in the same module may have left unflushed
data on the shared test table (`test_table` fixture), keeping commitlog
segments dirty on its behalf as well. When commitlog disk usage exceeds
its threshold, the system flushes the test table to reclaim those
segments, potentially splitting a running test's mutations across
multiple SSTables.
This was observed in CI, where test_paging failed because its data was
split across two SSTables, resulting in more mutation fragments than the
hardcoded expected count.
This patch fixes the affected tests in two ways:
1. Where possible, tests are reworked to not assume a single SSTable:
- test_paging
- test_slicing_rows
- test_many_partition_scan
2. Where rework is impractical, major compaction is added after writes
and before validation to ensure that only one SSTable will exist:
- test_smoke
- test_count
- test_metadata_and_value
- test_slicing_range_tombstone_changes
Fixes SCYLLADB-1375.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Closesscylladb/scylladb#29389
Add a test that drops a table while tablet streaming is running for the
table. The table is dropped after taking the storage snapshot and
initializating streaming sources - after that streaming should be able
to complete or abort correctly if the table is dropped. We want to
verify there is no incorrect access to the destroyed table.
The test tests both types of streaming in stream_blob - sstables
and logstor segments.
When initializing streaming sources in tablet_stream_files_handler we
use a reference to the table. We should hold the table while doing so,
because otherwise the table may be dropped and destroyed when we yield.
Use the table.stream_in_progress() phaser to hold the table while we
access it.
For sstable file streaming we can release the table after the snapshot
is initialized, and the table may be dropped safely because the files
are held by the snapshot and we don't access the table anymore. There
was a single access to the table for logging but it is replaced by a
pre-calculated variable.
For logstor segment streaming, currently it doesn't support discarding
the segments while they are streamed - when the table is dropped it
discard the segments by overwriting and freeing them, so they shouldn't
be accessed after that. Therefore, in that case continue to hold the
table until streaming is completed.
Fixes SCYLLADB-1533
There are several reasons we want to do that.
One is that it will give us more flexibility in distributing the
load. We can subdivide tablets at any token, and achieve more
evenly-sized tablets. In particular, we can isolate large partitions
into separate tablets.
We can also split and merge incrementally individual tablets.
Currently, we do it for the whole table or nothing, which makes
splits and merges take longer and cause wide swings of the count.
This is not implemented in this PR yet, we still split/merge the whole table.
Another reason is vnode to tablets migration. We now could construct a
tablet map which matches exactly the vnode boundaries, so migration
can happen transparently from CQL-coordinator point of view.
Tablet count is still a power-of-two by default for newly created tables.
It may be different if tablet map is created by non-standard means,
or if per-table tablet option "pow2_count" is set to "false".
build/release/scylla perf-tablets:
Memory footprint for 131k tablets increased from 56 MiB to 58.1 MiB (+3.5%)
Before:
```
Generating tablet metadata
Total tablet count: 131072
Size of tablet_metadata in memory: 57456 KiB
Copied in 0.014346 [ms]
Cleared in 0.002698 [ms]
Saved in 1234.685303 [ms]
Read in 445.577881 [ms]
Read mutations in 299.596313 [ms] 128 mutations
Read required hosts in 247.482742 [ms]
Size of canonical mutations: 33.945053 [MiB]
Disk space used by system.tablets: 1.456761 [MiB]
Tablet metadata reload:
full 407.69ms
partial 2.65ms
```
After:
```
Generating tablet metadata
Total tablet count: 131072
Size of tablet_metadata in memory: 59504 KiB
Copied in 0.032475 [ms]
Cleared in 0.002965 [ms]
Saved in 1093.877441 [ms]
Read in 387.027100 [ms]
Read mutations in 255.752121 [ms] 128 mutations
Read required hosts in 211.202805 [ms]
Size of canonical mutations: 33.954453 [MiB]
Disk space used by system.tablets: 1.450162 [MiB]
Tablet metadata reload:
full 354.50ms
partial 2.19ms
```
Closesscylladb/scylladb#28459
* github.com:scylladb/scylladb:
test: boost: tablets: Add test for merge with arbitrary tablet count
tablets, database: Advertise 'arbitrary' layout in snapshot manifest
tablets: Introduce pow2_count per-table tablet option
tablets: Prepare for non-power-of-two tablet count
tablets: Implement merged tablet_map constructor on top of for_each_sibling_tablets()
tablets: Prepare resize_decision to hold data in decisions
tablets: table: Make storage_group handle arbitrary merge boundaries
tablets: Make stats update post-merge work with arbitrary merge boundaries
locator: tablets: Support arbitrary tablet boundaries
locator: tablets: Introduce tablet_map::get_split_token()
dht: Introduce get_uniform_tokens()
Commit c17c4806a1 removed check_for_endpoint_collision() from
the fresh bootstrap path, which was the only code path that
called do_shadow_round() for new nodes. Since the gossip shadow
round is no longer executed during bootstrap, remove the
stop_during_gossip_shadow_round error injection from the test.
The entry is marked as REMOVED_ rather than deleted to preserve
the shuffle order for seed-based test reproducibility.
The injection point in gms/gossiper.cc is also removed since it
is no longer used by any test.
Fixes: SCYLLADB-1466
Closesscylladb/scylladb#29460
The retry loop in start_docker_service passes the parse callbacks
via std::move into create_handler on each iteration. After the
first iteration, the moved-from std::function objects are empty.
All subsequent retries skip output parsing entirely and
immediately treat the service as successfully started. This
defeats the entire purpose of the retry mechanism.
Fix by passing the callbacks by copy instead of move, so the
original callbacks remain valid across retries.
Fixes SCYLLADB-1542
Use async tablet repair task flow to avoid a race where client timeout
returns while server-side repair continues after injections are
disabled.
Start repair with await_completion=false, assert it does not complete
within timeout under injection, abort/wait the task, then verify
sstables_repaired_at is unchanged.
Fixes SCYLLADB-1184
Closesscylladb/scylladb#29452
Config observers run synchronously in a reactor turn and must not
suspend. Split the previous monolithic async update_config() coroutine
into two phases:
Sync (runs in the observer, never suspends):
- S3: atomically swap _cfg (lw_shared_ptr) and set a credentials
refresh flag.
- GCS: install a freshly constructed client; stash the old one for
async cleanup.
- storage_manager: update _object_storage_endpoints and fire the
async cleanup via a gate-guarded background fiber.
Async (gate-guarded background fiber):
- S3: acquire _creds_sem, invalidate and rearm credentials only if
the refresh flag is set.
- GCS: drain and close stashed old clients.
Config is reloaded from SIGHUP on shard 0 and broadcast to all shards
under a write lock. REST API callers reading find_config_id acquire a
read lock via value_as_json_string_for_name() and are guaranteed a
consistent snapshot even when a reload is in progress.
Currently cancellation is logged in get_next_task, but the function is
called by tablets code as well where we do not act upon its result, only
yield to the topology coordinator. But the topology coordinator will not
necessary do the cancellation as well since it can be busy with tablets
migration. As a result cancellation is logged, but not done which is
confusing. Fix it by logging cancellation when it is actually happens.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1409Closesscylladb/scylladb#29471
This is an attempt (mostly suggested and implemented by AI, but with a few hours of human babysitting...), to somewhat reduce compilation time by picking one template, named_value<T>, which is used in more than a hundred source files through the config.hh header, and making it use external instantiation: The different methods of named_value<T> for various T are instantiated only once (in config.cc), and the individual translation units don't need to compile them a hundred times.
The resulting saving is a little underwhelming: The total object-file size goes down about 1% (from 346,200 before the patch to 343,488 after the patch), and previous experience shows that this object-file size is proportional to the compilation time, most of which involves code generation. But I haven't been able to measure speedup of the build itself.
1% is not nothing, but not a huge saving either. Though arguably, with 50 more of these patches, we can make the build twice faster :-)
Refs #1.
Closesscylladb/scylladb#28992
* github.com:scylladb/scylladb:
config: move named_value<T> method bodies out-of-line
config: suppress named_value<T> instantiation in every source file
Tests that call grep_for_errors() directly and assert no errors
can fail spuriously due to benign RPC errors during graceful
shutdown (e.g. "connection dropped: Semaphore broken"), which
are already filtered by the after_test hook via filter_errors().
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1464
Backport: no, tests fix (we may decide to backport later if it occurs on release branches)
Closesscylladb/scylladb#29463
* github.com:scylladb/scylladb:
test: filter benign errors in tests that grep logs during shutdown
test: filter_errors: support list[list[str]] error groups
When a replica disconnects during a digest read (e.g., during
decommission), the speculating_read_executor now immediately fires
the pending speculative retry instead of waiting for the timer.
On DISCONNECT, the digest_read_resolver invokes an _on_disconnect
callback set by the executor. The callback cancels the speculate
timer and rearms it to clock_type::now() (lowres_clock::now() =
thread-local memory read, no syscall). The existing timer callback
fires on the next reactor poll with all its logic intact — checking
is_completed(), calling add_wait_targets(1), sending the request,
and incrementing speculative_digest_reads/speculative_data_reads.
The notification is fire-and-forget: on_error() does NOT absorb the
DISCONNECT. The existing error arithmetic in digest_read_resolver
already handles this correctly because _target_count_for_cl accounts
for the speculative target.
For never_speculating_read_executor (no spare target) and
always_speculating_read_executor (all requests sent upfront),
_on_disconnect is never set — no behavior change.
Fixesscylladb/scylladb#26307Closesscylladb/scylladb#29428
all_sibling_tablet_replicas_colocated was using committed ti.replicas to
decide whether sibling tablets are co-located and merge can be finalized.
This caused a false non-co-located window when a co-located pair was moved
by the load balancer: as both tablets migrate together, their del_transition
commits may land in different Raft rounds. After the first commit, ti.replicas
diverge temporarily (one tablet shows the new position, the other the old),
causing all_sibling_tablet_replicas_colocated to return false. This clears
finalize_resize, allowing the load balancer to start new cascading migrations
that delay merge finalization by tens of seconds.
Fix this by using the optimistic replica view (trinfo->next when transitioning,
ti.replicas otherwise) — the same view the load balancer uses for load
accounting — so finalize_resize stays populated throughout an in-flight
migration and no spurious cascades are triggered.
Steps that lead to the problem:
1. Merge is triggered. The load balancer generates co-location migrations
for all sibling pairs that are not yet on the same shard. Some pairs
finish co-location before others.
2. Once all pairs are co-located in committed state,
all_sibling_tablet_replicas_colocated returns true and finalize_resize
is set. Meanwhile the load balancer may have already started a regular
LB migration on one co-located pair (both tablets are stable and the
load balancer is free to move them).
3. The LB migration moves both tablets together (colocated_tablets). Their
two del_transition commits land in separate Raft rounds. After the first
commit, ti.replicas[t1] = new position but ti.replicas[t2] = old position.
4. In this window, all_sibling_tablet_replicas_colocated sees the pair as
NOT co-located, clears finalize_resize, and the load balancer generates
new migrations for other tablets to rebalance the load that the pair
move created.
5. Those new migrations can take tens of seconds to stream, keeping the
coordinator in handle_tablet_migration mode and preventing
maybe_start_tablet_resize_finalization from being called. The merge
finalization is delayed until all those cascaded migrations complete.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-821.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1459.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Closesscylladb/scylladb#29465
Currently the statement returns cluster, partitioner and snitch names by accessing global db::config via database. As the part of an effort to detach components from global db::config, this PR tweaks the statement handler to get the cluster information from some other source. Currently the needed cluster information is stored in different components, but they are all under storage_service umbrella which seems to be a good central source of this truth. Unit test included.
Cleaning components inter-dependencies, not backporting
Closesscylladb/scylladb#29429
* github.com:scylladb/scylladb:
test: Add test_describe_cluster_sanity for DESCRIBE CLUSTER validation
describe_statement: Get cluster info from storage_service
storage_service: Add describe_cluster() method
query_processor: Expose storage_service accessor
Left over from primordial times, when reader_concurrency_semaphore was
baseclass for extensions in the separate enterprise repository.
Also remove the now unneded virtual marker from the destructor.
Closesscylladb/scylladb#29399
This series allows creating multiple vector indexes on the same column so users can rebuild an index without losing query availability.
The intended flow is:
1. Create a new vector index on a column that already has one.
2. Keep serving ANN queries from the old index while the new one is being built.
3. Verify the new index is ready.
4. Automatically switch to the remaining index.
5. Drop the old index.
To make that deterministic, `index_version` is changed from the base table schema version to a real creation timeuuid. When multiple vector indexes exist on the same column, ANN query planning now picks the index according to the routing implemented in Vector Store (newest serving index). This keeps queries on the old index until it the new one is up and ready.
This patch also removes the create-time restriction that rejected a second vector index on the same column. Name collisions are still rejected as before.
Test coverage is updated accordingly:
- Scylla now verifies that two vector indexes can coexist on the same column.
- Cassandra/SAI behavior is still covered and is still expected to reject duplicate indexes on the same column.
Fixes: VECTOR-610
Closesscylladb/scylladb#29407
* github.com:scylladb/scylladb:
docs: document vector index metadata and duplicate handling
test/cqlpy: cover vector index duplicate creation rules
vector_index: allow multiple named indexes on one column
vector_index: store `index_version` as creation timeuuid
Commit 234f905 (sstables: scylla_metadata: add schema member) added a
new Schema subcomponent (tag 11) to scylla_metadata. Document it in the
sstable Scylla format reference:
- Add schema to the subcomponent grammar enumeration
- Add a summary entry describing the subcomponent (tag 11) and its purpose
- Add a detailed ## schema subcomponent section with the binary grammar,
covering table_id, table_schema_version, keyspace_name, table_name and
the column_description array (column_kind, column_name, column_type)
Fixes https://github.com/scylladb/scylladb/issues/27960
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Closesscylladb/scylladb#28983
RF change of tablet keyspace starts tablet rebuilds. Even if any of the rebuilds is rolled back (because pending replica was excluded), rf change request finishes successfully. In this case we end up with the state of the replicas that isn't compatible with the expected keyspace replication.
Modify topology coordinator so that if it were to be idle, it starts checking if there are any missing replicas. It moves to transition_state::tablet_migration and run required rebuilds.
If a new RF change request encounters invalid state of replicas it fails. The state will be fixed later and the analogical ALTER KEYSPACE statement will be allowed.
Fixes: SCYLLADB-109.
Requires backport to all versions with tablet keyspace rf change.
Closesscylladb/scylladb#28709
* github.com:scylladb/scylladb:
test: add test_failed_tablet_rebuild_is_retried_on_alter
test: add a test to ensure that failed rebuilds are retried
service: fail ALTER KEYSPACE if replicas do not satisfy the replication
service: retry failed tablet rebuilds
service: maybe_start_tablet_migration returns std::optional<group0_guard>
Related scylladb/scylladb-docs-homepage#153.
make multiversion failed under Sphinx 8+ with:
```
sphinx-build: error: argument --tag/-t: expected one argument
subprocess.CalledProcessError: Command '(..., '-m', 'sphinx', '-t', '-D', 'smv_metadata_path=...', ..., 'manual')' returned non-zero exit status 2.
make: *** [multiversion] Error 1
```
sphinx-multiversion's arg forwarding splits `-t manual`, sending `-t` into the options slot and `manual` to the trailing FILENAMES positional.
Sphinx 7 silently tolerated the dangling `-t`; Sphinx 8+'s stricter
argparse CLI rejects it. Instead, it now reads FLAGS from an env variable.
How to test:
````
make multiversion
make FLAG=opensource multiversion
````
Both complete and switch variants correctly.
chore: rm empty lines
Closesscylladb/scylladb#29472
Three bugs fixed in segment_manager.cc:
1. write_to_separator(): captured [&index] where index was a local
coroutine-frame reference. The future is stored in
buf.pending_updates and resolved later in flush_separator_buffer(),
by which time the enclosing coroutine frame is destroyed, making
&index a dangling pointer. This is a use-after-free that manifests
as a segfault. Fix: capture index_ptr (raw pointer by value) instead.
2. add_segment_to_compaction_group(): same dangling [&index] pattern
inside the for_each_live_record lambda during recovery. Same fix
applied.
3. write(): local 'auto loc = seg->allocate(...)' shadowed the outer
'log_location loc', causing the function to always return a
zero-initialized log_location{}. Fix: remove 'auto' so the
assignment targets the outer variable.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Closesscylladb/scylladb#29451
The poll loop used condition_variable::wait(timeout) to sleep between
iterations. On every normal timeout expiry, this threw a
condition_variable_timed_out exception, which incremented the C++
exception metric and triggered false alerts for support.
Replace the timed wait with a seastar::timer that broadcasts the
condition variable on expiry, combined with an untimed wait(). The
timer is cancelled automatically on scope exit when the wait is woken
early by trigger_poll() or abort.
Fixes SCYLLADB-1477
Closesscylladb/scylladb#29438
The enable_logstor configuration option is redundant with the 'logstor'
experimental feature flag. Consolidate to a single gate: use the
experimental feature to control both whether logstor is available for
table creation and whether it is initialized at database startup.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#29427
This metric is used to catch execution of scans which go via row
cache, which can have bad effect on performance.
Since f344bd0aaa, aggregate queries go
via new statement class: parallelized_select_statement. This class
inherits from select_statement directly rather than from
primary_key_select_statement. The range scan detection logic
(_range_scan, _range_scan_no_bypass_cache) was only in
primary_key_select_statement's constructor, so parallelized queries
were not counted in select_partition_range_scan and
select_partition_range_scan_no_bypass_cache metrics.
Fix by moving the range scan detection into select_statement's
constructor, so that all subclasses get it.
No backport: enhancement
Closesscylladb/scylladb#29422
* github.com:scylladb/scylladb:
cql: Include parallelized queries in the scylla_cql_select_partition_range_scan_no_bypass_cache metric
test: cluster: dtest: Fix double-counting of metrics
`LDAPRoleManager` interpolated usernames directly into `ldap_url_template`,
allowing LDAP filter injection and URL structure manipulation via crafted
usernames.
This PR adds two layers of encoding when substituting `{USER}`:
1. **RFC 4515 filter escaping** — neutralises `*`, `(`, `)`, `\`, NUL
2. **URL percent-encoding** — prevents `%`, `?`, `#` from breaking
`ldap_url_parse`'s component splitting or undoing the filter escaping
It also adds `validate_query_template()` at startup to reject templates
that place `{USER}` outside the filter component (e.g. in the host or
base DN), where filter escaping would be the wrong defense.
Fixes: SCYLLADB-1309
Compatibility note:
Templates with `{USER}` in the host, base DN, attributes, or extensions
were previously silently accepted. They are now rejected at startup with
a descriptive error. Only templates with `{USER}` in the filter component
(after the third `?`) are valid.
Fixes: SCYLLADB-1309
Due to severeness, should be backported to all maintained versions.
Closesscylladb/scylladb#29388
* github.com:scylladb/scylladb:
auth: sanitize {USER} substitution in LDAP URL templates
test/ldap: add LDAP filter-injection reproducers
Add an explicit wait_for_schema_agreement() call after CREATE KEYSPACE
in create_new_test_keyspace to ensure all nodes have applied the schema
before proceeding.
Closesscylladb/scylladb#29371
The alter_table case has a known failure where point lookups at QUORUM
return 0 rows after node2 restarts, even though:
- the schema was correctly synced (ALTER TABLE received from cluster)
- the data commitlog was replayed (21 mutations, 0 skipped)
- all 3 nodes were alive, so QUORUM (2/3) should be satisfiable by
node1+node3 regardless of node2's state
The LIMIT 1 table scan succeeds (data is present somewhere), but
specific key lookups return empty. This points to a bug in how node2,
acting as coordinator after restart, routes single-partition reads —
most likely stale tablet routing metadata.
Add diagnostics to help distinguish data loss from a coordinator/routing
bug on the next failure:
- log which key is missing
- dump all rows visible at QUORUM
- query each node individually at ONE consistency for the missing key
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Closesscylladb/scylladb#29350
The real keyspace name of an Alternator table T is "alternator_T".
Expand the "alternator.T" format used in the audit_tables config flag
to the real keyspace name at parse time, so users don't need to spell
out the internal "alternator_T.T" form.
Add tests for the unhappy path of Alternator audit logging:
- Category filtering: operations are not logged when their category
(DML, QUERY, DDL) is excluded from audit_categories.
- Keyspace filtering: operations on a keyspace not listed in
audit_keyspaces are not logged.
- Error entries: a failed operation (thrown exception after audit_info
is set) produces an audit entry with error=true.
- Empty-keyspace bypass: global operations like ListTables and
DescribeEndpoints are logged regardless of audit_keyspaces because
should_log() short-circuits on an empty keyspace.
Prepare API in audit for auditing Alternator.
The API provides an externally-callable functions `inspect()`,
for both CQL and Alternator.
Both variants of the function would unpack parameters and merge into
calling a common `maybe_log()`, which can then call `log()` when
conditions are met.
Also, while I was at it, (const) references were favoured over raw
pointers.
The Alternator audit_info subclass (audit_info_alternator) carries an
optional consistency level — only data read/write operations have a
meaningful CL, while DDL and metadata queries store an empty string
in the audit table and syslog (matching the existing write_login
behavior). The storage helpers are updated accordingly.
Add a will_log(category, keyspace, table) method that checks whether
an operation should be audited (category check AND keyspace/table
filtering) without requiring a constructed audit_info object.
should_log() delegates to will_log().
Currently, the manifest advertises "powof2", which is wrong for
arbitrary count and boundaries.
Introduce a new kind of layout called "arbitrary", and produce it if
the tablet map doesn't conform to "powof2" layout.
We should also produce tablet boundaries in this case, but that's
worked on in a different PR: https://github.com/scylladb/scylladb/pull/28525
By default it's true, in which case tablet count of the table is
rounded up to a power of two. This option allows lifting this, in
which case the count can be arbitrary. This will allow testing the
logic of arbitrary tablet count.
This is a step towards more flexibility in managing tablets. A
prerequisite before we can split individual tablets, isolating hot
partitions, and evening-out tablet sizes by shifting boundaries.
After this patch, the system can handle tables with arbitrary tablet
count. Tablet allocator is still rounding up desired tablet count to
the nearest power of two when allocating tablets for a new table, so
unless the tablet map is allocated in some other way, the counts will
be still a power of two.
We plan to utilize arbitrary count when migrating from vnodes to
tablets, by creating a tablet map which matches vnode boundaries.
One of the reasons we don't give up on power-of-two by default yet is
that it creates an issue with merges. If tablet count is odd, one of
the tablets doesn't have a sibling and will not be merged. That can
obviously cause imbalance of token space and tablet sizes between
tablets. To limit the impact, this patch dynamically chooses which
tablet to isolate when initiating a merge. The largest tablet is
chosen, as that will minimize imbalance. Otherwise, if we always chose
the last tablet to isolate, its size would remain the same while other
tablets double in size with each odd-count merge, leading to
imbalance. The imbalance will still be there, but the difference in
tablet sizes is limited to 2x.
Example (3 tablets):
[0] owns 1/3 of tokens
[1] owns 1/3 of tokens
[2] owns 1/3 of tokens
After merge:
[0] owns 2/3 of tokens
[1] owns 1/3 of tokens
What we would like instead:
Step 1 (split [1]):
[0] owns 1/3 of tokens
[1] old 1.left, owns 1/6 of tokens
[2] old 1.right, owns 1/6 of tokens
[3] owns 1/3 of tokens
Step 2 (merge):
[0] owns 1/2 of tokens
[1] owns 1/2 of tokens
To do that, we need to be able to split individual tablets, but we're
not there yet.
We only assume that new tablets have boundaries which are equal
to some boundaries of old tablets.
In preparation for supporting arbitrary merge plan, where any replica
can be isolated (not merged with siblings) by the merge plan.
The test test/cluster/test_ttl_row.py::test_row_ttl_scheduling_group wants to
verify that the new CQL per-row TTL feature does all its work (expiration
scanning, deletion of expired items) on all nodes in the "streaming"
scheduling group, not in the statement scheduling group.
As originally written, the test couldn't require that it uses exactly zero
time in the statement scheduling group - because some things do happen
there - specifically the ALTER TABLE request we use to enable TTL.
So the test checked that the time in the "wrong" group is less than 0.2
of the total time, not zero.
But in one CI run, we got to exactly 0.2 and the test failed. Running
this test locally, I see the margin is pretty narrow: The test almost
always fails if I set the threshold ratio to 0.1.
The solution in this patch is to move the ALTER TABLE work to a different
scheduling group (by using an additional service level). After doing that
the CPU usage in sl:default goes down to exactly zero - not close to zero
but exactly zero.
However, it seems that there is always some rare background work in
sl:default and debug builds it can come out more than 0ms (e.g., in
one test we saw 1ms), so we keep checking that sl:default is much
lower than sl:stream - not exactly zero.
Incidentally, I converted the serial loop adding the 200 rows in the
test's setup to a parallel loop, to make the test setup slightly faster.
I also added to the test a sanity check that the scheduling group sl:default
that we are measuring that TTL does zero work in, is actually the scheduling
group that normal writes work in (to avoid the risk of having a test that
verifies that some irrelevant scheduling group is unsurprisingly getting
zero usage...).
Fixes SCYLLADB-1495.
Closesscylladb/scylladb#29447
We only assume that new tablets share boundaries with some old tablets.
In preparation for supporting arbitrary merge plan, where any replica
can be isolated (not merged with siblings) by the merge plan.
There are several reasons we want to do that.
One is that it will give us more flexibility in distributing the
load. We can subdivide tablets at any points, and achieve more
evenly-sized tablets. In particular, we can isolate large partitions
into separate tablets.
Another reason is vnode-to-tablet migration. We could construct a
tablet map which matches exactly the vnode boundaries, so migration
can happen transparently from the CQL-coordinator's point of view.
Implementation details:
We store a vector of tokens which represent tablet boundaries in the
tablet_id_map. tablet_id keeps its meaning, it's an index into vector
of tablets. To avoid logarithmic lookup of tablet_id from the token,
we introduce a lookup structure with power-of-two aligned buckets, and
store the tablet_id of the tablet which owns the first token in the
bucket. This way, lookup needs to consider tablet id range which
overlaps with one bucket. If boundaries are more or less aligned,
there are around 1-2 tablets overlapping with a bucket, and the lookup
is still O(1).
Amount of memory used increased, but not significantly relative to old
size (because tablet_info is currently fat):
For 131'072 tablets:
Before:
Size of tablet_metadata in memory: 57456 KiB
After:
Size of tablet_metadata in memory: 59504 KiB
And reimplement existing split-related methods around it.
This way we avoid calling dht::compaction_group_of(), and
assuming anything about tablet boundaries or tablet count
being a power of two.
This will make later refactoring easier.
Currently, hints that are sent to tablet replicas which are leaving due to RF-- can be lost, because `hint_sender` only checks if the destination host is leaving. To avoid this, we add a new method `effective_replication_map::is_leaving(host, token)` which checks if the tablet identified by the given token is leaving the host. This method is called by the `hint_sender` to check if the hint should be sent only to the destination host, or to all the replicas. This way, we increase consistency. For v-node based ERPs, `is_leaving()` calls `token_metadata::is_leaving(host)`.
Fixes: SCYLLADB-287
This is an improvement, and backport is not needed.
Closesscylladb/scylladb#28770
* github.com:scylladb/scylladb:
test: verify hints are delivered during tablet RF reduction
hint_sender: use per-tablet is_leaving() to avoid losing hints on RF reduction
erm: add is_leaving() to effective_replication_map
Fixes a race condition where tablet split can crash the server during truncation.
`truncate_table_on_all_shards()` disables compaction on all existing compaction groups, then later calls `discard_sstables()` which asserts that compaction is disabled. Between these two points, tablet split can call `set_split_mode()`, which creates new compaction groups via `make_empty_group()` — these start with `compaction_disabled_counter == 0`. When `discard_sstables()` checks its assertion, it finds these new groups and fires `on_internal_error`, aborting the server.
In `storage_group::set_split_mode()`, before creating new compaction groups, check whether the main compaction group has compaction disabled. If it does, bail out early and return `false` (not ready). This is safe because the split will be retried once truncation completes and re-enables compaction.
A new regression test `test_split_emitted_during_truncate` reproduces the
exact interleaving using two error injection points:
- **`database_truncate_wait`** — pauses truncation after compaction is disabled but before `discard_sstables()` runs.
- **`tablet_split_monitor_wait`** (new, in `service/storage_service.cc`) — pauses the split monitor at the start of `process_tablet_split_candidate()`.
The test creates a single-tablet table, triggers both operations, uses the injection points to force the problematic ordering, then verifies that truncation completes successfully and the split finishes afterward.
Fixes: SCYLLADB-1035
This needs to be backported to all currently supported version.
Closesscylladb/scylladb#29250
* github.com:scylladb/scylladb:
test: add test_split_emitted_during_truncate
table: fix race between tablet split and truncate
In partition_snapshot_row_cursor::maybe_refresh(), the !is_in_latest_version()
path calls lower_bound(_position) on the latest version's rows to find the
cursor's position in that version. When lower_bound returns null (the cursor
is positioned above all entries in the latest version in table order), the code
unconditionally sets _background_continuity = true and allows the subsequent
if(!it) block to erase the latest version's entry from the heap.
This is correct for forward traversal: null means there are no more entries
ahead, so removing the version from the heap is safe.
However, in reversed mode, null from lower_bound means the cursor is above
all entries in table order -- those entries are BELOW the cursor in query
order and will be visited LATER during reversed traversal. Erasing the heap
entry permanently loses them, causing live rows to be skipped.
The fix mirrors what prepare_heap() already does correctly: when lower_bound
returns null in reversed mode, use std::prev(rows.end()) to keep the last
entry in the heap instead of erasing it.
Add test_reversed_maybe_refresh_keeps_latest_version_entry to mvcc_test,
alongside the existing reversed cursor tests. The test creates a two-version
partition snapshot (v0 with range tombstones, v1 with a live row positioned
below all v0 entries in table order), and
traverses in reverse calling maybe_refresh() at each step -- directly
exercising the buggy code path. The test fails without the fix.
The bug was introduced by 6b7473be53 ("Handle non-evictable snapshots",
2022-11-21), which added null-iterator handling for non-evictable snapshots
(memtable snapshots lack the trailing dummy entry that evictable snapshots
have). prepare_heap() got correct reversed-mode handling at that time, but
maybe_refresh() received only forward-mode logic.
The bug is intermittent because multiple mechanisms cause iterators_valid()
to return false, forcing maybe_refresh() to take the full rebuild path via
prepare_heap() (which handles reversed mode correctly):
- Mutation cleaner merging versions in the background (changes change_mark)
- LSA segment compaction during reserve() (invalidates references)
- B-tree rebalancing on partition insertion (invalidates references)
- Debug mode's always-true need_preempt() creating many multi-version
partitions via preempted apply_monotonically()
A dtest reproducer confirmed the same root cause: with 100K overlapping range
tombstones creating a massively multi-version memtable partition (287K preemption
events), the reversed scan's latest_iterator was observed jumping discontinuously
during a version transition -- the latest version's heap entry was erased --
causing the query to walk the entire partition without finding the live row.
Fixes: SCYLLADB-1253
Closesscylladb/scylladb#29368
execute_batch_without_checking_exception_message() inserted entries
into the authorized prepared cache before verifying that
check_access() succeeded. A failed BATCH therefore left behind
cached 'authorized' entries that later let a direct EXECUTE of the
same prepared statement skip the authorization check entirely.
Move the cache insertion after the access check so that entries are
only cached on success. This matches the pattern already used by
do_execute_prepared() for individual EXECUTE requests.
Introduced in 98f5e49ea8
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1221
Backport: all supported versions
Closesscylladb/scylladb#29432
* github.com:scylladb/scylladb:
test/cqlpy: add reproducer for BATCH prepared auth cache bypass
cql3: fix authorization bypass via BATCH prepared cache poisoning
Add parametrized integration test that verifies DESCRIBE CLUSTER returns correct
values in both normal and maintenance modes:
The parametrization keeps the validation logic (CQL queries and assertions)
identical for both modes, while the setup phase is mode-specific. This ensures
the same assertions apply to both cluster states:
- partitioner is org.apache.cassandra.dht.Murmur3Partitioner
- snitch is org.apache.cassandra.locator.SimpleSnitch
- cluster name matches system.local cluster_name
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Update cluster_describe_statement::describe() to retrieve cluster metadata
from storage_service::describe_cluster() instead of directly from db::config
or gossiper.
The storage_service provides a centralized API for accessing cluster metadata
(cluster_name, partitioner, snitch_name) that works in both normal and
maintenance modes, improving separation of concerns.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add cluster_info struct containing cluster_name, partitioner, and snitch_name.
Implement describe_cluster() method to provide cluster metadata by combining
data from gossiper (cluster_name, partitioner) and snitch (snitch_name).
It will be used by next patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Previously Alternator, when emit Amazon's ARN would not stick to the
standard. After our attempt to run KCL with scylla we discovered few
issues.
Amazon's ARN looks like this:
arn:partition:service:region:account-id:resource-type/resource-id
for example:
arn:aws:dynamodb:us-west-2:111122223333:table/TestTable/stream/2015-05-11T21:21:33.291
KCL checks for:
- ARN provided from Alternator calls must fit with basic Amazon's ARN
pattern shown above,
- region constisting only of lower letter alphabets and `-`, no
underscore character
- account-id being only digits (exactly 12)
- service being `dynamodb`
- partition starting with `aws`
The patch updates our code handling ARNs to match those findings.
1. Split `stream_arn` object into `stream_arn` - ARN for streams only and
`stream_shard_id` - id value for stream shards. The latter receives original
implementation. The former emits and parses ARN in a Amazon style.
for example:
2. Update new `stream_arn` class to encode keyspace and table together
separating them by `@`. New ARN looks like this:
arn:aws:dynamodb:us-east-1:000000000000:table/TestKeyspace@TestTable/stream/2015-05-11T21:21:33.291
3. hardcode `dynamodb` as service, `aws` as partition, `us-east-1` as
region and `000000000000` as account-id (must have 12 digits)
4. Update code handling ARNs for tags manipulation to be able to parse
Amazon's style ARNs. Emiting code is left intact - the parser is now
capable of parsing both styles.
5. Added unit tests.
Fixes#28350
Fixes: SCYLLADB-539
Fixes: #28142Closesscylladb/scylladb#28187
This series makes result metadata handling for auth LIST statements consistent and adds coverage for the driver-visible behavior.
The first patch makes the result-column metadata construction shared across the affected statements, so the metadata shape used for PREPARE and EXECUTE stays uniform and easier to reason about.
The second patch adds regression coverage for both sides of the metadata-id flow:
- a Python auth-cluster test verifies that prepared LIST ROLES OF returns a non-empty result metadata id and that a later EXECUTE reuses it without METADATA_CHANGED
- a Boost transport test covers the recovery path where the client sends an empty request metadata id and the server responds with METADATA_CHANGED and the full metadata
Together these patches tighten the implementation and protect the prepared-metadata-id behavior exposed to drivers.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1218
backport: this change should be backported to all active branches to help the driver operation
Closesscylladb/scylladb#29347
Add a boost test that verifies commitlog segments are replayed in
ascending segment ID order within each shard. The test creates
multiple segments, triggers replay via commitlog_replayer, and
captures the "Replaying" debug log messages to verify the order.
Correct segment ordering is required by the strongly consistent
tables feature, particularly commitlog-based storage that relies
on replayed raft items being stored in order.
Ref SCYLLADB-1411.
The commitlog replayer groups segments by shard using a
std::unordered_multimap, then iterates per-shard segments via
equal_range(). However, equal_range() does not guarantee iteration
order for elements with the same key, so segments could be replayed
out of order within a shard. This can increase memory and disk
consumption during fragmented entry reconstruction, which accumulates
fragments across segments and benefits from ascending ID order.
This is also required by the strongly consistent tables feature,
particularly commitlog-based storage that relies on replayed raft
items being stored in order.
Fix by changing the data structure from
std::unordered_multimap<unsigned, commitlog::descriptor>
to
std::unordered_map<unsigned, utils::chunked_vector<commitlog::descriptor>>
Since the descriptors are inserted from a std::set ordered by ID, the
vector preserves insertion (and thus ID) order. The per-shard iteration
now simply iterates the vector, guaranteeing correct replay order.
Fixes SCYLLADB-1411.
Add a GitHub Actions workflow that runs scripts/compare_build_systems.py
on PRs touching build system files (configure.py, **/CMakeLists.txt,
cmake/**).
This prevents future deviations between the two build systems by
catching mismatches early in the CI pipeline.
Closesscylladb/scylladb#29426
The CMake build had -fsanitize-address-use-after-scope (enable) when
it should have been -fno-sanitize-address-use-after-scope (disable).
The comment on lines 24-25 of cql3/CMakeLists.txt explains the intent:
the use-after-scope sanitizer uses too much stack space on CqlParser
and overflows the stack. The Python-ninja path in configure.py:2801-2802
correctly had -fno-sanitize-address-use-after-scope.
Found by black-box comparison of compiler flags between the Python-ninja
and CMake build paths (ninja -nv output, debug mode, CqlParser.o):
Python-ninja: -fno-sanitize-address-use-after-scope (correct: disable)
CMake: -fsanitize-address-use-after-scope (wrong: enable)
Closesscylladb/scylladb#29439
The test waited for two "Finished tablet repair" log messages on the
coordinator, expecting one per tablet. But there are two log sources
that emit messages matching this pattern:
repair module (repair/repair.cc:2329):
"Finished tablet repair for table=..."
topology coordinator (topology_coordinator.cc:2083):
"Finished tablet repair host=..."
When the coordinator is also a repair replica (always the case with
RF=3 and 3 nodes), both messages appear in the coordinator log for the
same tablet within 1ms of each other. The test consumed both, thinking
both tablets were done, while the second tablet repair was still running.
From the CI failure logs:
04:08:09.658 Found: repair[...]: Finished tablet repair for table=...
global_tablet_id=e42fd650-3542-11f1-9756-85403784a622:0
04:08:09.660 Found: raft_topology - Finished tablet repair host=...
tablet=e42fd650-3542-11f1-9756-85403784a622:0
Both messages are for tablet :0. Tablet :1 repair had not finished yet.
The test then wrote keys 20-29 while the second tablet repair was still
in progress. That repair flushed the memtable (via
prepare_sstables_for_incremental_repair), including keys 20-29 in the
repair scan, and mark_sstable_as_repaired set repaired_at=2 on the
resulting sstable. This caused the assertion failure on servers[0]:
"should not have post-repair keys in repaired sstables, got:
{20, 21, 22, 23, 24, 25, 26, 27, 28, 29}"
Fix by matching "Finished tablet repair host=" which is unique to the
topology coordinator message and avoids the ambiguity.
Also fix an incorrect comment that said being_repaired=null when at that
point in the test being_repaired is still set to the session_id (the
delay_end_repair_update injection prevents end_repair from running).
Fixes: SCYLLADB-1478
Closesscylladb/scylladb#29444
In commit 727f68e0f5 we added the ability to SELECT:
* Individual elements of a map: `SELECT map_col[key]`.
* Individual elements of a set: `SELECT set_col[key]` returns key if the key exists in the set, or null if it doesn't, allowing to check if the element exists in the set.
* Individual pieces of a UDT: `SELECT udt_col.field`.
But at the time, we didn't provide any way to retrieve the **meta-data** for this value, namely its timestamp and TTL. We did not support `SELECT TIMESTAMP(collection[key])`, or `SELECT TIMESTAMP(udt.field)`.
Users requested to support such SELECTs in the past (see issue #15427), and Cassandra 5.0 added support for this feature - for both maps and sets and udts - so we also need this feature for compatibility. This feature was also requested recently by vector-search developers, who wanted to read Alternator columns - stored as map elements, not individual columns - with their WRITETIME information.
The first four patches in this series adds the feature (in four smaller patches instead one big one), the fifth and sixth patches add tests (cqlpy and boost tests, respectively). The seventh patch adds documentation.
All the new tests pass on Cassandra 5, failed on Scylla before the present fix, and pass with it.
The fix was surprisingly difficult. Our existing implementation (from 727f68e0f5 building on earlier machinery) doesn't just "read" `map_col[key]` and allow us to return just its timestamp. Rather, the implementation reads the entire map, serializes it in some temporary format that does **not** include the timestamps and ttls, and then takes the subscript key, at which point we no longer have the timestamp or ttl of the element. So the fix had to cross all these layers of the implementation.
While adding support for UDT fields in a pre-existing grammar nonterminal "subscriptExpr", we unintentionally added support for UDT fields also in LWT expressions (which used this nonterminal). LWT missing support for UDT fields was a long-time known compatibility issue (#13624) so we unintentionally fixed it :-) Actually, to completely fix it we needed another small change in the expression implementation, so the eighth patch in this series does this.
Fixes#15427Fixes#13624Closesscylladb/scylladb#29134
* github.com:scylladb/scylladb:
cql3: support UDT fields in LWT expressions
cql3: document WRITETIME() and TTL() for elements of map, set or UDT
test/boost: test WRITETIME() and TTL() on map collection elements
test/cqlpy: test WRITETIME() and TTL() on element of map, set or UDT
cql3: prepare and evaluate WRITETIME/TTL on collection elements and UDT fields
cql3: parse per-element timestamps/TTLs in the selection layer
cql3: add extended wire format for per-element timestamps and TTLs
cql3: extend WRITETIME/TTL grammar to accept collection and UDT elements
Document the new vector index behavior in the user-facing and developer
docs.
Describe `index_version` as a creation timeuuid stored in
`system_schema.indexes`, clarify that recreating an index changes it
while ALTER TABLE does not, and document that Scylla allows multiple
named vector indexes on the same column while still rejecting unnamed
duplicates.
Add cqlpy tests for the current CREATE INDEX behavior of vector indexes.
Cover named and unnamed duplicates, IF NOT EXISTS, coexistence of
multiple named vector indexes on the same column, interactions between
named and unnamed indexes, and the same-name-on-different-table case.
An unprivileged user could bypass authorization checks by exploiting
the BATCH prepared statement cache:
1. Prepare an INSERT on a table the user has no access to
2. Execute it inside a BATCH — gets Unauthorized
3. Execute the same prepared INSERT directly — succeeds
Apply filter_errors() to grep_for_errors() results in
test_split_stopped_on_shutdown and
test_group0_apply_while_node_is_being_shutdown. Without filtering,
benign RPC errors like 'connection dropped: Semaphore broken' that
occur during graceful shutdown cause spurious test failures.
Accept both list[str] (from distinct_errors=True) and
list[list[str]] (from distinct_errors=False) in filter_errors(),
matching against the first line of each error group. This allows
tests that call grep_for_errors() with default arguments to
pipe results directly through filter_errors().
Precompute the expected metadata-id hashes for the prepared LIST auth and
service-level statements and verify that PREPARE returns them while EXECUTE
reuses the prepared metadata without METADATA_CHANGED. Run all cases in a
single auth-cluster test after preparing the cluster, role, and service level
once through the regular manager fixture.
execute_batch_without_checking_exception_message() inserted entries
into the authorized prepared cache before verifying that
check_access() succeeded. A failed BATCH therefore left behind
cached 'authorized' entries that later let a direct EXECUTE of the
same prepared statement skip the authorization check entirely.
Move the cache insertion after the access check so that entries are
only cached on success. This matches the pattern already used by
do_execute_prepared() for individual EXECUTE requests.
Introduced in 98f5e49ea8
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1221
Prepared LIST statements were not calculating metadata in PREPARE path, and sent empty string hash to client causing problematic behaviour where metadat_id was not recalculated correctly.
This patch moves metadata construction into get_result_metadata() for the affected LIST statements and reuse that metadata when building the result set.
This gives PREPARE a stable metadata id for LIST ROLES, LIST USERS, LIST PERMISSIONS and the service-level variants.
This patch also adds a new boost test that verifies that when an EXECUTE request carries an empty result metadata id while the server has a real metadata id for the result set, the response is marked METADATA_CHANGED and includes the full result metadata plus the server metadata id.
This covers the recovery path for clients that send an empty or otherwise unusable metadata id instead of a matching cached one.
Allow creating multiple named vector indexes on the same column while
still rejecting duplicate unnamed ones.
`index_metadata::equals_noname()` now ignores `index_version`,
which is unique for every vector index creation, so duplicate detection
keeps working for unnamed vector indexes.
CREATE INDEX keeps using structural duplicate detection for regular
indexes and unnamed vector indexes, but named vector indexes are checked
by name only.
The explicit name check is also needed for IF NOT EXISTS when the same
index name already exists on a different table in the same keyspace,
because vector indexes have no backing view table to catch that case.
Add a regression test that reproduces the race between tablet split and
truncation. The test:
1. Creates a single-tablet table and inserts data.
2. Triggers truncation and pauses it (via database_truncate_wait) after
compaction is disabled but before discard_sstables() runs.
3. Triggers tablet split and pauses it (via tablet_split_monitor_wait)
at the start of process_tablet_split_candidate().
4. Releases split so set_split_mode() creates new compaction groups.
5. Waits for the set_split_mode log confirming the groups exist.
6. Releases truncation so discard_sstables() encounters the new groups.
7. Verifies truncation completes and split finishes.
Adds a tablet_split_monitor_wait error injection point in
process_tablet_split_candidate() to allow pausing the split monitor
before it enters the split loop.
Tablet split can call set_split_mode() between the point where
truncate_table_on_all_shards() disables compaction on all existing
compaction groups and the point where discard_sstables() checks that
compaction is disabled. The new split-ready compaction groups created
by set_split_mode() won't have compaction disabled, causing
discard_sstables() to fire on_internal_error.
Fix by preventing set_split_mode() from creating new compaction groups
when compaction is disabled on the main group. If truncation has
already disabled compaction, split will simply report not-ready rather
than creating groups which have compaction enabled.
This is safe because split will be retried once truncation completes
and re-enables compaction.
implement tablet split, tablet merge and tablet migration for tables that use the experimental logstor storage engine.
* tablet merge simply merges the histograms of segments of one compaction group with another.
* for tablet split we take the segments from the source compaction group, read them and write all live records to separate segments according to the split classifier, and move separated segments to the target compaction groups.
* for tablet migration we use stream_blob, similarly to file streaming of sstables. we add a new op type for streaming a logstor segment. on the source we take a snapshot of the segments with an input stream that reads the segment, and on the target we create a sink that allocates a new segment on the target shard and writes to it.
* we also do some improvements for recovery and loading of segments. we add a segment header that contains useful information for non-mixed segments, such as the table and token range.
Refs SCYLLADB-770
no backport - still a new and experimental feature
Closesscylladb/scylladb#29207
* github.com:scylladb/scylladb:
test: logstor: additional logstor tests
docs/dev: add logstor on-disk format section
logstor: add version and crc to buffer header
test: logstor: tablet split/merge and migration
logstor: enable tablet balancing
logstor: streaming of logstor segments using stream_blob
logstor: add take_logstor_snapshot
logstor: segment input/output stream
logstor: implement compaction_group::cleanup
logstor: tablet split
logstor: tablet merge
logstor: add compaction reenabler
logstor: add segment header
logstor: serialize writes to active segment
replica: extend compaction_group functions for logstor
replica: add compaction_group_for_logstor_segment
logstor: code cleanup
In an earlier patch, we used the CQL grammar's "subscriptExpr" in
the rule for WRITETIME() and TTL(). But since we also wanted these
to support UDT fields (x.a), not just collection subscripts (x[3]),
we expanded subscriptExpr to also support the field syntax.
But LWT expressions already used this subscriptExpr, which meant
that LWT expressions unintentionally gained support for UDT fields.
Missing support for UDT fields in LWT is a long-standing known
Cassandra-compatibility bug (#13624), and now our grammar finally
supports the missing syntax.
But supporting the syntax is not enough for correct implementation
of this feature - we also need to fix the expression handling:
Two bugs prevented expressions like `v.a = 0` from working in LWT IF
clauses, where `v` is a column of user-defined type.
The first bug was in get_lhs_receiver() in prepare_expr.cc: it lacked
a handler for field_selection nodes, causing an "unexpected expression"
internal error when preparing a condition like `IF v.a = 0`. The fix
adds a handler that returns a column_specification whose type is taken
from the prepared field_selection's type field.
The second bug was in search_and_replace() in expression.cc: when
recursing into a field_selection node it reconstructed it with only
`structure` and `field`, silently dropping the `field_idx` and `type`
fields that are set during preparation. As a result, any transformation
that uses search_and_replace() on a prepared expression containing a
field_selection — such as adjust_for_collection_as_maps() called from
column_condition_prepare() — would zero out those fields. At evaluation
time, type_of() on the field_selection returned a null data_type
pointer, causing a segmentation fault when the comparison operator tried
to call ->equal() through it. The fix preserves field_idx and type when
reconstructing the node.
Fixes#13624.
Add to the SELECT documentation (docs/cql/dml/select.rst) documentation
of the new ability to select WRITETIME() and TTL() of a single element
of map, set or UDT.
Also in the TTL documentation (docs/cql/time-to-live.rst), which already
had a section on "TTL for a collection", add a mention of the ability
to read a single element's TTL(), and an example.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Add tests in test/boost/expr_test.cc for the low-level implementation
of writetime() and ttl() on a map element.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds many tests verifying the behavior of WRITETIME() and
TTL() on individual elements of maps, sets and UDTs, serving as a
regression test for issue #15427. We also add tests verifying our
understanding of related issues like WRITETIME() and TTL() of entire
collections and of individual elements of *frozen* collections.
All new tests pass on Cassandra 5.0, helping to verify that our
implementation is compatible with Cassandra. They also pass on
ScyllaDB after the previous patch (most didn't before that patch).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Complete the implementation of SELECT WRITETIME(col[key])/TTL(col[key])
and WRITETIME(col.field)/TTL(col.field), building on the grammar (commit 1),
wire format (commit 2), and selection-layer (commit 3) changes in the
preceding patches.
* prepare_column_mutation_attribute() (prepare_expr.cc) now handles the
subscript and field_selection nodes that the grammar produces:
- For subscripts, it validates that the inner column is a non-frozen
map or set and checks the 'writetime_ttl_individual_element' feature
flag so the feature is rejected during rolling upgrades.
- For field selections, it validates that the inner column is a
non-frozen UDT, with the same feature-flag check.
* do_evaluate(column_mutation_attribute) (expression.cc) handles the
same two cases. For a field selection it serializes the field index as
a key and looks it up in collection_element_metadata; for a subscript
it evaluates the subscript key and looks it up in the same map.
A missing key (element not found or expired) returns NULL, matching
Cassandra behavior.
Together with the preceding three patches, this finally fixes#15427.
The next three patches will add tests and documentation for the new
feature, and the final eighth patch will fix the implementation of
UDT fields in LWT expressions - which the first patch made the grammar
allow but is still not implemented correctly.
Wire up the selection and result-set infrastructure to consume the
extended collection wire format introduced in the previous patch and
expose per-element timestamps and TTLs to the expression evaluator.
* Add collection_cell_metadata: maps from raw element-key bytes to
timestamp and remaining TTL, one entry per collection or UDT cell.
Add a corresponding collection_element_metadata span to
evaluation_inputs so that evaluators can access it.
* Add a flag _collect_collection_timestamps to selection (selection.hh/cc).
When any selected expression contains a WRITETIME(col[key])/TTL(col[key])
or WRITETIME(col.field)/TTL(col.field) attribute, the flag is set and
the send_collection_timestamps partition-slice option is enabled,
causing storage nodes to use the extended wire format from the
previous patch.
* Implement result_set_builder::add_collection() (selection.cc): when
_collect_collection_timestamps is set, parse the extended format,
decode per-element timestamps and remaining TTLs (computed from the
stored expiry time and the query time), and store them in
_collection_element_metadata indexed by column position. When the
flag is not set, the existing plain-bytes path is unchanged.
After this patch, the new selection feature is still not available to
the end-user because the prepare step still forbids it. The next patch
will finally complete the expression preparation and evaluation.
It will read the new collection_element_metadata and return the correct
timestamp or TTL value.
Introduce the infrastructure needed to transport per-element timestamps
and TTL expiry times from replicas to coordinators, required for
WRITETIME(col[key]) / TTL(col[key]) and WRITETIME(col.field) /
TTL(col.field).
* Add a 'writetime_ttl_individual_element' cluster feature flag that
guards usage of the new wire format during rolling upgrades: the
extended format is only emitted and consumed when every node in the
cluster supports it.
* Implement serialize_for_cql_with_timestamps() (types/types.cc), a
variant of serialize_for_cql() that appends a per-element section to
the regular CQL bytes, listing each live element's serialized key,
timestamp, and expiry. The format is:
[uint32 cql_len][cql bytes]
[int32 entry_count]
[per entry: (int32 key_len)(key bytes)(int64 timestamp)(int64 expiry)]
expiry is -1 when the element has no TTL.
* Add partition_slice::option::send_collection_timestamps and modify
write_cell() (mutation_partition.cc) to use the new function
serialize_for_cql_with_timestamps() when this option is available.
This commit stands alone with no user-visible effect: nothing yet sets
the new partition-slice option. The next patch adds the selection-layer
code that sets the option and parses the extended response.
Previously, WRITETIME() and TTL() only accepted a simple column name
(cident), so WRITETIME(m['key']) or WRITETIME(x.a) was a syntax error.
This patch begins to implements support for applying WRITETIME() and
TTL() to individual elements of a non-frozen map, set or UDT, as
requested in issue #15427.
On its own this commit only changes the parser (Cql.g). The prepare
step still rejects subscript and field-selection nodes with an
invalid_request_exception, so there is no user-visible behavior change
yet - just that a syntax error is replaced by a different error.
Upcoming patches add the extended wire format for per-element timestamps
(commit 2), the selection layer that consumes it (commit 3), and the
prepare/evaluate logic that ties everything together (commit 4), after
which WRITETIME() and TTL(col[key]) for collection or UDT elements
will finally be fully functional.
The parser change in this patch expands the subscriptExpr rule to
support the col.field syntax, not only col[key]. This change also
allows the UDT field syntax to be used in LWT conditions, which is
another long-standing missing feature (#13624); But to correctly
support this feature we'll need an additional patch to fix a couple
of remaining bugs - this will be the eighth commit in this series.
Remove const specifier from result_set_row._cells member to make
the class nothrow_move_constructible and nothrow_move_assignable
To be used later in query result_set and friends.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Since we do no longer support upgrade from versions that do not support
v2 of "view building status" code (building status is managed by raft) we can remove v1 code and upgrade code and make sure we do not boot with old "builder status" version.
v2 version was introduced by 8d25a4d678 which is included in scylla-2025.1.0.
No backport needed since this is code removal.
Closesscylladb/scylladb#29105
* github.com:scylladb/scylladb:
view: drop unused v1 builder code
view: remove upgrade to raft code
Replace the range scan in read_verify_workload() with individual
single-partition queries, using the keys returned by
prepare_write_workload() instead of hard-coding them.
The range scan was previously observed to time out in debug mode after
a hard cluster restart. Single-partition reads are lighter on the
cluster and less likely to time out under load.
The new verification is also stricter: instead of merely checking that
the expected number of rows is returned, it verifies that each written
key is individually readable, catching any data-loss or key-identity
mismatch that the old count-only check would have missed.
This is the second attemp at stabilizing this test, after the recent
854c374ebf. That fix made sure that the
cluster has converged on topology and nodes see each other before running
the verify workload.
Fixes: SCYLLADB-1331
Closesscylladb/scylladb#29313
The supergroup replaces streaming (a.k.a. maintenance as well) group, inherits 200 shares from it and consists of four sub-groups (all have equal shares of 200 withing the new supergroup)
* maintenance_compaction. This group configures `compaction_manager::maintenance_sg()` group. User-triggered compaction runs in it
* backup. This group configures `snapshot_ctl::config::backup_sched_group`. Native backup activity runs there
* maintenance. It's a new "visible" name, everything that was called "maintenance" in the code ran in "streaming" group. Now it will run in "maintenance". The activities include those that don't communicate over RPC (see below why)
* `tablet_allocator::balance_tablets()`
* `sstables_manager::components_reclaim_reload_fiber()`
* `tablet_storage_group_manager::merge_completion_fiber()`
* metrics exporting http server altogether
* streaming. This is purely existing streaming group that just moves under the new supergroup. Everything else that was run there, continues doing so, including
* hints sender
* all view building related components (update generator, builder, workers)
* repair
* stream_manager
* messaging service (except for verb handlers that switch groups)
* join_cluster() activity
* REST API
* ... something else I forgot
The `--maintenance_io_throughput_mb_per_sec` option is introduced. It controls the IO throughput limit applied to the maintenance supergroup. If not set, the `--stream_io_throughput_mb_per_sec` option is used to preserve backward compatibility.
All new sched groups inherit `request_class::maintenance` (however, "backup" seem not to make any requests yet).
Moving more activities from "streaming" into "maintenance" (or its own group) is possible, but one will need to take care of RPC group switching. The thing is that when a client makes an RPC call, the server may switch to one of pre-negotiated scheduling groups. Verbs for existing activities that run in "streaming" group are routed through RPC index that negotiates "streaming" group on the server side. If any of that client code moves to some other group, server will still run the handlers in "streaming" which is not quite expected. That's one of the main reasons why only the selected fibers were moved to their own "maintenance" group. Similar for backup -- this code doesn't use RPC, so it can be moved. Restoring code uses load-and-stream and corresponding RPCs, so it cannot be just moved into its own new group.
Fixes SCYLLADB-351
New feature, not backporting
Closesscylladb/scylladb#28542
* github.com:scylladb/scylladb:
code: Add maintenance/maintenance group
backup: Add maintenance/backup group
compaction: Add maintenance/maintenance_compaction group
main: Introduce maintenance supergroup
main: Move all maintenance sched group into streaming one
database: Use local variable for current_scheduling_group
code: Live-update IO throughputs from main
shared_tombstone_gc_state::update_repair_time() uses copy-on-write
semantics: each call copies the entire per_table_history_maps and the
per-table repair_history_map. repair_service::load_history() called
this once per history entry, making the load O(N²) in both time and
memory.
Introduce batch_update_repair_time() which performs a single
copy-on-write for any number of entries belonging to the same table.
Restructure load_history() to collect entries into batches of up to
1000 and flush each batch in one call, keeping peak memory bounded.
The batch size limit is intentional: the repair history table currently
has no bound on the number of entries and can grow large. Note that
this does not cause a problem in the in-memory history map itself:
entries are coalesced internally and only the latest repair time is
kept per range. The unbounded entry count only makes the batched
update during load expensive.
Fixes: SCYLLADB-104
Closesscylladb/scylladb#29326
Include non-primary key restrictions (e.g. regular column filters) in
the filter JSON sent to the Vector Store service. Previously only
partition key and clustering column restrictions were forwarded, so
filtering on regular columns was silently ignored.
Add get_nonprimary_key_restrictions() getter to statement_restrictions.
Add unit tests for non-primary key equality, range, and bind marker
restrictions in filter_test.
Fixes: SCYLLADB-970
Closesscylladb/scylladb#29019
LDAPRoleManager interpolated usernames directly into ldap_url_template.
That allowed LDAP filter metacharacters to change the query, and URL
metacharacters such as %, ?, and # to change how ldap_url_parse()
split the URL.
Apply two layers of encoding when substituting {USER}:
1. RFC 4515 filter escaping -- neutralises filter operators.
2. URL percent-encoding -- prevents ldap_url_parse from
misinterpreting %-sequences, ? delimiters, or # fragments.
Add validate_query_template() (called from start()) which uses a
sentinel round-trip through ldap_url_parse to reject templates
that place {USER} outside the filter component. Templates that
previously placed {USER} in the host or base DN were silently
accepted; they are now rejected at startup with a descriptive
error.
Change parse_url() to take const sstring& instead of string_view
to enforce the null-termination requirement of ldap_url_parse()
at the type level.
Add regression coverage for %2a, ?, #, and invalid {USER}
placement in the base DN, host, attributes, and extensions.
Update LDAP authorization docs to document the escaping behavior
and the {USER} placement restriction.
Fixes: SCYLLADB-1309
Vector indexes currently store the base table schema version in
`index_version`. That value is name-based, not time-based,
so it does not represent when the index was created.
Store a timeuuid instead and change the relevant interfaces from
`table_schema_version` to `utils::UUID`. This is a prerequisite
for supporting multiple vector indexes on the same column where
the oldest index must be selected deterministically via routing
implemented in Vector Store.
Update the cqlpy tests to check the new semantics directly:
recreating the index changes `index_version`, while ALTER TABLE does not.
For counter updates, use a counter ID that is constructed from the
node's rack instead of the node's host ID.
A rack can have at most two active tablet replicas at a time: a single
normal tablet replica, and during tablet migration there are two active
replicas, the normal and pending replica. Therefore we can have two
unique counter IDs per rack that are reused by all replicas in the rack.
We construct the counter ID from the rack UUID, which is constructed
from the name "dc:rack". The pending replica uses a deterministic
variation of the rack's counter ID by negating it.
This improves the performance and size of counter cells by having less
unique counter IDs and less counter shards in a counter cell.
Previously the number of counter shards was the number of different
host_id's that updated the counter, which can be typically the number of
nodes in the cluster and continue growing indefinitely when nodes are
replaced. with the rack-based counter id the number of counter shards
will be at most twice the number of different racks (including removed
racks, which should not be significant).
Fixes SCYLLADB-356
backport not needed - an enhancement
Closesscylladb/scylladb#28901
* github.com:scylladb/scylladb:
docs/dev: add counters doc
counters: reuse counter IDs by rack
Replace move_to_shard()/move_to_host() with as_bounce()/target_shard()/
target_host() to clarify the interface after bounce was extended to
support cross-node bouncing.
- Add virtual as_bounce() returning const bounce* to the base class
(nullptr by default, overridden in bounce to return this), replacing
the virtual move_to_shard() which conflated bounce detection with
shard access
- Rename move_to_shard() -> target_shard() (now non-virtual, returns
unsigned directly) and move_to_host() -> target_host() on bounce
- Replace dynamic_pointer_cast with static_pointer_cast at call sites
that already checked as_bounce()
- Move forward declarations of message types before the virtual
methods so as_bounce() can reference bounce
Fixes: SCYLLADB-1066
Closesscylladb/scylladb#29367
Motivation
----------
Since strongly consistent tables are based on the concept of Raft
groups, operations on them can get stuck for indefinite amounts of
time. That may be problematic, and so we'd like to implement a way
to cancel those operations at suitable times.
Description of solution
-----------------------
The situations we focus on are the following:
* Timed-out queries
* Leader changes
* Tablet migrations
* Table drops
* Node shutdowns
We handle each of them and provide validation tests.
Implementation strategy
-----------------------
1. Auxiliary commits.
2. Abort operations on timeout.
3. Abort operations on tablet removal.
4. Extend `client_state`.
5. Abort operation on shutdown.
6. Help `state_machine` be aborted as soon as possible.
Tests
-----
We provide tests that validate the correctness of the solution.
The total time spent on `test_strong_consistency.py`
(measured on my local machine, dev mode):
Before:
```
real 0m31.809s
user 1m3.048s
sys 0m21.812s
```
After:
```
real 0m34.523s
user 1m10.307s
sys 0m27.223s
```
The incremental differences in time can be found in the commit messages.
Fixes SCYLLADB-429
Backport: not needed. This is an enhancement to an experimental feature.
Closesscylladb/scylladb#28526
* github.com:scylladb/scylladb:
service: strong_consistency: Abort state_machine::apply when aborting server
service: strong_consistency: Abort ongoing operations when shutting down
service: client_state: Extend with abort_source
service: strong_consistency: Handle abort when removing Raft group
service: strong_consistency: Abort Raft operations on timeout
service: strong_consistency: Use timeout when mutating
service: strong_consistency: Fix indentation
service: strong_consistency: Enclose coordinator methods with try-catch
service: strong_consistency: Crash at unexpected exception
test: cluster: Extract default config & cmdline in test_strong_consistency.py
This reverts commit 8b4a91982b.
Two commits independently added rolling_max_tracker_test to test/boost/CMakeLists.txt:
8b4a919 cmake: add missing rolling_max_tracker_test and symmetric_key_test
f3a91df test/cmake: add missing tests to boost test suite
The second was merged two days after the first. They didn't conflict on
code-level and applied cleanly resulting in a duplicate add_scylla_test()
entries that breaks the CMake build:
CMake Error: add_executable cannot create target
"test_boost_rolling_max_tracker_test" because another target
with the same name already exists.
Remove the duplicate.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Reported-by: Łukasz Paszkowski <lukasz.paszkowski@scylladb.com>
The PR contains more code cleanups, mostly in gossiper. Dropping more gossiper state leaving only NORMAL and SHUTDOWN. All other states are checked against topology state. Those two are left because SHUTDOWN state is propagated through gossiper only and when the node is not in SHUTDOWN it should be in some other state.
No need to backport. Cleanups.
Closesscylladb/scylladb#29129
* https://github.com/scylladb/scylladb:
storage_service: cleanup unused code
storage_service: simplify get_peer_info_for_update
gossiper: send shutdown notifications in parallel
gms: remove unused code
virtual_tables: no need to call gossiper if we already know that the node is in shutdown
gossiper: print node state from raft topology in the logs
gossiper: use is_shutdown instead of code it manually
gossiper: mark endpoint_state(inet_address ip) constructor as explicit
gossiper: remove unused code
gossiper: drop last use of LEFT state and drop the state
gossiper: drop unused STATUS_BOOTSTRAPPING state
gossiper: rename is_dead_state to is_left since this is all that the function checks now.
gossiper: use raft topology state instead of gossiper one when checking node's state
storage_service: drop check_for_endpoint_collision function
storage_service: drop is_first_node function
gossiper: remove unused REMOVED_TOKEN state
gossiper: remove unused advertise_token_removed function
Cassandra's native vector index type is StorageAttachedIndex (SAI). Libraries such as CassIO, LangChain, and LlamaIndex generate `CREATE CUSTOM INDEX` statements using the SAI class name. Previously, ScyllaDB rejected these with "Non-supported custom class".
This PR adds compatibility so that SAI-style CQL statements work on ScyllaDB without modification.
1. **test: enable SAI_VECTOR_ALLOW_CUSTOM_PARAMETERS for Cassandra tests**
Enables the `SAI_VECTOR_ALLOW_CUSTOM_PARAMETERS` Cassandra system property so that `search_beam_width` tests pass against Cassandra 5.0.7.
2. **test: modernize vector index test comments and fix xfail**
Updates test comments from "Reproduces" to "Validates fix for" for clarity, and converts the `test_ann_query_with_pk_restriction` xfail into a stripped-down CREATE INDEX syntax test (removing unused INSERT/SELECT lines). Removes the redundant `test_ann_query_with_non_pk_restriction` test.
3. **cql: add Cassandra SAI (StorageAttachedIndex) compatibility**
Core implementation: the SAI class name is detected and translated to ScyllaDB's native `vector_index`. The fully-qualified class name (`org.apache.cassandra.index.sai.StorageAttachedIndex`) requires exact case; short names (`StorageAttachedIndex`, `sai`) are matched case-insensitively — matching Cassandra's behavior. Non-vector and multi-column SAI targets are rejected with clear errors. Adds `skip_on_scylla_vnodes` fixture, SAI compatibility docs, and the Cassandra compatibility table entry (split into "SAI general" vs "SAI for vector search").
4. **cql: accept source_model option for Cassandra SAI compatibility**
The `source_model` option is a Cassandra SAI property used by Cassandra libraries (e.g., CassIO) to tag vector indexes with the name of the embedding model. ScyllaDB accepts it for compatibility but does not use it — the validator is a no-op lambda. The option is preserved in index metadata and returned in DESCRIBE INDEX output.
- `cql3/statements/create_index_statement.cc`: SAI class detection and rewriting logic
- `index/secondary_index_manager.cc`: case-insensitive class name lookup (lowercasing restored before `classes.find()`)
- `index/vector_index.cc`: `source_model` accepted as a valid option with no-op validator
- `docs/cql/secondary-indexes.rst`: SAI compatibility documentation with `source_model` table row
- `docs/using-scylla/cassandra-compatibility.rst`: SAI entry split into general (not supported) and vector search (supported)
- `test/cqlpy/conftest.py`: `scylla_with_tablets` renamed to `skip_on_scylla_vnodes`
- `test/cqlpy/test_vector_index.py`: SAI tests inlined (no constants), `check_bad_option()` helper for numeric validation, uppercase class name test, merged `source_model` tests with DESCRIBE check
| Backend | Passed | Skipped | Failed |
|--------------------|--------|---------|--------|
| ScyllaDB (dev) | 42 | 0 | 0 |
| Cassandra 5.0.7 | 16 | 26 | 0 |
None: new feature.
Fixes: SCYLLADB-239
Closesscylladb/scylladb#28645
* github.com:scylladb/scylladb:
cql: accept source_model option and show options in DESCRIBE
cql: add Cassandra SAI (StorageAttachedIndex) compatibility
test: modernize vector index test comments and fix xfail
test: enable SAI_VECTOR_ALLOW_CUSTOM_PARAMETERS for Cassandra tests
This metric is used to catch execution of scans which go via row
cache, which can have bad effect on performance.
Since f344bd0aaa, aggreagte queries go
via new statement class: parallelized_select_statement. This class
inherits from select_statement directly rather than from
primary_key_select_statement. The range scan detection logic
(_range_scan, _range_scan_no_bypass_cache) was only in
primary_key_select_statement's constructor, so parallelized queries
were not counted in select_partition_range_scan and
select_partition_range_scan_no_bypass_cache metrics.
Fix by moving the range scan detection into select_statement's
constructor, so that all subclasses get it.
get_node_metrics() in test/cluster/dtest/tools/metrics.py used
re.search(metric_name, metric) to match Prometheus metric lines. The
metric name select_partition_range_scan is a substring of
select_partition_range_scan_no_bypass_cache. So when querying for
select_partition_range_scan, the regex matched both Prometheus lines:
scylla_cql_select_partition_range_scan{shard="0",...} 1
scylla_cql_select_partition_range_scan_no_bypass_cache{shard="0",...} 1
And because the code does metrics_res[metric_name] += val, it summed
both values, making it look like the counter was incremented
by 2 when it was actually incremented by 1. The fix appends r"[\s{]"
to the regex so the metric name must be followed by { (labels) or
whitespace (value), preventing substring matches.
Add system.tablets to the set of system resources that can be
accessed with the VECTOR_SEARCH_INDEXING permission.
Fixes: VECTOR-605
Closesscylladb/scylladb#29397
Remove the create_ks_and_cf() helper function and its now-unused import
of format_tuples(). All callers have been converted to use the new
async patterns with new_test_keyspace() and cql.run_async().
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace all 6 calls to create_ks_and_cf() with new async patterns:
- Use new_test_keyspace() context manager for keyspace creation
- Use cql.run_async() for CREATE TABLE statement
- Use asyncio.gather() with cql.run_async() for data insertion
The test_restore_with_non_existing_sstable only needs the ks:table
structure to exist; it doesn't use the pre-populated data.
This change makes the code more explicit and maintains proper async
semantics throughout.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add if-True blocks to wrap code that uses create_ks_and_cf() in all 6
test functions. This is a mechanical change to set up the next step
where the helper will be replaced with new async patterns. All code
after the create_ks_and_cf() call until the end of each test is now
indented under the if-True block.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Accept the Cassandra SAI 'source_model' option for vector indexes.
This option is used by Cassandra libraries (e.g., CassIO, LangChain)
to tag vector indexes with the name of the embedding model that
produced the vectors.
ScyllaDB does not use the source_model value but stores it and
includes it in the DESCRIBE INDEX output for Cassandra compatibility.
Additionally, extend vector_index::describe() to emit a
WITH OPTIONS = {...} clause containing all user-provided index options
(filtering out system keys: target, class_name, index_version).
This makes options like similarity_function, source_model, etc.
visible in DESCRIBE output.
Libraries such as CassIO, LangChain, and LlamaIndex create vector
indexes using Cassandra's StorageAttachedIndex (SAI) class name.
This commit lets ScyllaDB accept these statements without modification.
When a CREATE CUSTOM INDEX statement specifies an SAI class name on a
vector column, ScyllaDB automatically rewrites it to the native
vector_index implementation. Accepted class names (case-insensitive):
- org.apache.cassandra.index.sai.StorageAttachedIndex
- StorageAttachedIndex
- sai
SAI on non-vector columns is rejected with a clear error directing
users to a secondary index instead.
The SAI detection and rewriting logic is extracted into a dedicated
static function (maybe_rewrite_sai_to_vector_index) to keep the
already-long validate_while_executing method manageable.
Multi-column (local index) targets and nonexistent columns are
skipped with continue — the former are treated as filtering columns
by vector_index::check_target(), and the latter are caught later by
vector_index::validate().
Tests that exercise features common to both backends (basic creation,
similarity_function, IF NOT EXISTS, bad options, etc.) now use the
SAI class name with the skip_on_scylla_vnodes fixture so they run
against both ScyllaDB and Cassandra. ScyllaDB-specific tests continue
to use USING 'vector_index' with scylla_only.
- Change 'Reproduces' to 'Validates fix for' in test comments to
reflect that the referenced issues are already fixed.
- Condense the VECTOR-179 comment to two lines.
- Replace the xfailed test_ann_query_with_restriction_works_only_on_pk
with a focused test (test_ann_query_with_pk_restriction) that creates
a vector index on a table with a PK column restriction, validating
the VECTOR-374 fix.
Every time someone modifies the build system — adding a source file, changing a compilation flag, or wiring a new test — the change tends to land in only one of our two build systems (configure.py or CMake). Over time this causes three classes of problems:
1. **CMake stops compiling entirely.** Missing defines, wrong sanitizer flags, or misplaced subdirectory ordering cause hard build failures that are only discovered when someone tries to use CMake (e.g. for IDE integration).
2. **Missing build targets.** Tests or binaries present in configure.py are never added to CMake, so `cmake --build` silently skips them. This PR fixes several such cases (e.g. `symmetric_key_test`, `auth_cache_test`, `sstable_tablet_streaming`).
3. **Missing compilation units in targets.** A `.cc` file is added to a test binary in one system but not the other, causing link errors or silently omitted test coverage.
To fix the existing drift and prevent future divergence, this series:
**Adds a build-system comparison script**
(`scripts/compare_build_systems.py`) that configures both systems into a temporary directory, parses their generated `build.ninja` files, and compares per-file compilation flags, link target sets, and per-target libraries. configure.py is treated as the baseline; CMake must match it. The script supports a `--ci` mode suitable for gating PRs that touch
build files.
**Fixes all current mismatches** found by the script:
- Mode flag alignment in `mode.common.cmake` and `mode.Coverage.cmake`
(sanitizer flags, `-fno-lto`, stack-usage warnings, coverage defines).
- Global define alignment (`SEASTAR_NO_EXCEPTION_HACK`, `XXH_PRIVATE_API`,
`BOOST_ALL_DYN_LINK`, `SEASTAR_TESTING_MAIN` placement).
- Seastar build configuration (shared vs static per mode, coverage
sanitizer link options).
- Abseil sanitizer flags (`-fno-sanitize=vptr`).
- Missing test targets in `test/boost/CMakeLists.txt`.
- Redundant per-test flags now covered by global settings.
- Lua library resolution via a custom `cmake/FindLua.cmake` using
pkg-config, matching configure.py's approach.
**Adds documentation** (`docs/dev/compare-build-systems.md`) describing how to run the script and interpret its output.
No backport needed — this is build infrastructure improvement only.
Closesscylladb/scylladb#29273
* github.com:scylladb/scylladb:
scripts: remove lua library rename workaround from comparison script
cmake: add custom FindLua using pkg-config to match configure.py
test/cmake: add missing tests to boost test suite
test/cmake: remove per-test LTO disable
cmake: add BOOST_ALL_DYN_LINK and strip per-component defines
cmake: move SEASTAR_TESTING_MAIN after seastar and abseil subdirs
cmake: add -fno-sanitize=vptr for abseil sanitizer flags
cmake: align Seastar build configuration with configure.py
cmake: align global compile defines and options with configure.py
cmake: fix Coverage mode in mode.Coverage.cmake
cmake: align mode.common.cmake flags with configure.py
configure.py: add sstable_tablet_streaming to combined_tests
docs: add compare-build-systems.md
scripts: add compare_build_systems.py to compare ninja build files
Add .set_skip_when_empty() to all error-path metrics in the tracing
module. Tracing itself is not a commonly used feature, making all of
these metrics almost always zero:
Tier 1 (very rare - corruption/schema issues):
- tracing_keyspace_helper::bad_column_family_errors: tracing schema
missing or incompatible, should never happen post-bootstrap
- tracing::trace_errors: internal error building trace parameters
Tier 2 (overload - tracing backend saturated):
- tracing::dropped_sessions: too many pending sessions
- tracing::dropped_records: too many pending records
Tier 3 (general tracing write errors):
- tracing_keyspace_helper::tracing_errors: errors during writes to
system_traces keyspace
Since tracing is an opt-in feature that most deployments rarely use,
all five metrics are almost always zero and create unnecessary
reporting overhead.
AI-Assisted: yes
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Closesscylladb/scylladb#29346
Add a documentation of the counters feature implementation in
docs/dev/counters.md.
The documentation is taken from the wiki and updated according to the
current state of the code - legacy details are removed, and a section
about the counter id is added.
For counter updates, use a counter ID that is constructed from the
node's rack instead of the node's host ID.
A rack can have at most two active tablet replicas at a time: a single
normal tablet replica, and during tablet migration there are two active
replicas, the normal and pending replica. Therefore we can have two
unique counter IDs per rack that are reused by all replicas in the rack.
We construct the counter ID from the rack UUID, which is constructed
from the name "dc:rack". The pending replica uses a deterministic
variation of the rack's counter ID by negating it.
This improves the performance and size of counter cells by having less
unique counter IDs and less counter shards in a counter cell.
Previously the number of counter shards was the number of different
host_id's that updated the counter, which can be typically the number of
nodes in the cluster and continue growing indefinitely when nodes are
replaced. with the rack-based counter id the number of counter shards
will be at most twice the number of different racks (including removed
racks, which should not be significant).
Fixes SCYLLADB-356
Add .set_skip_when_empty() to four metrics in replica/database.cc that
are only incremented on very rare error paths and are almost always zero:
- database::dropped_view_updates: view updates dropped due to overload.
NOTE: this metric appears to never be incremented in the current
codebase and may be a candidate for removal.
- database::multishard_query_failed_reader_stops: documented as a 'hard
badness counter' that should always be zero. NOTE: no increment site
was found in the current codebase; may be a candidate for removal.
- database::multishard_query_failed_reader_saves: documented as a 'hard
badness counter' that should always be zero.
- database::total_writes_rejected_due_to_out_of_space_prevention: only
fires when disk utilization is critical and user table writes are
disabled, a very rare operational state.
These metrics create unnecessary reporting overhead when they are
perpetually zero. set_skip_when_empty() suppresses them from metrics
output until they become non-zero.
AI-Assisted: yes
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Closesscylladb/scylladb#29345
After obtaining the CQL response, check if its actual size exceeds the initially acquired memory permit. If so, acquire additional semaphore units and adopt them into the permit, ensuring accurate memory accounting for large responses.
Additionally, move the permit into a .then() continuation so that the semaphore units are kept alive until write_message finishes, preventing premature release of memory permit. This is especially important with slow networks and big responses when buffers can accumulate and deplete a node's memory.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1306
Related https://scylladb.atlassian.net/browse/SCYLLADB-740
Backport: all supported versions
Closesscylladb/scylladb#29288
* github.com:scylladb/scylladb:
transport: add per-service-level pending response memory metric
transport: hold memory permit until response write completes
transport: account for response size exceeding initial memory estimate
Add .set_skip_when_empty() to four metrics in the db module that are
only incremented on very rare error paths and are almost always zero:
- cache::pinned_dirty_memory_overload: described as 'should sit
constantly at 0, nonzero is indicative of a bug'
- corrupt_data::entries_reported: only fires on actual data corruption
- hints::corrupted_files: only fires on on-disk hint file corruption
- rate_limiter::failed_allocations: only fires when the rate limiter
hash table is completely full and gives up allocating, requiring
extreme cardinality pressure
These metrics create unnecessary reporting overhead when they are
perpetually zero. set_skip_when_empty() suppresses them from metrics
output until they become non-zero.
AI-Assisted: yes
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Closesscylladb/scylladb#29344
get_live_members function called is_shutdown which inet_address
argument, which caused temporary endpoint_state to be created. Fix
it by prohibiting implicit conversion and calling the correct
is_shutdown function instead.
The decommission sets left gossiper state only to prevent shutdown
notification be issued by the node during shutdown. Since the
notification code now checks the state in raft topology this is no
longer needed.
The state machine used by strongly consistent tablets may block on a
read barrier if the local schema is insufficient to resolve pending
mutations [1]. To deal with that, we perform a read barrier that may
block for a long time.
When a strongly consistent tablet is being removed, we'd like to cancel
all ongoing executions of `state_machine::apply`: the shard is no
longer responsible for the tablet, so it doesn't matter what the outcome
is.
---
In the implementation, we abort the operations by simply throwing
an exception from `state_machine::apply` and not doing anything.
That's a red flag considering that it may lead to the instance
being killed on the spot [2].
Fortunately for us, strongly consistent tables use the default Raft
server implementation, i.e. `raft::server_impl`, which actually
handles one type of an exception thrown by the method: namely,
`abort_requested_exception`, which is the default exception thrown
by `seastar::abort_source` [3]. We leverage this property.
---
Unfortunately, `raft::server_impl::abort` isn't perfectly suited for
us. If we look into its code, we'll see that the relevant portion of
the procedure boils down to three steps:
1. Prevent scheduling adding new entries.
2. Wait for the applier fiber.
3. Abort the state machine.
Since aborting the state machine happens only after the applier fiber
has already finished, there will no longer be anything to abort. Either
all executions of `state_machine::apply` have already finished, or they
are hanging and we cannot do anything.
That's a pre-existing problem that we won't be solving here (even
though it's possible). We hope the problem will be solved, and it seems
likely: the code suggests that the behavior is not intended. For more
details, see e.g. [4].
---
We provide two validation tests. They simulate the abortion of
`state_machine::apply` in two different scenarios:
* when the table is dropped (which should also cover the case of tablet
migration),
* when the node is shutting down.
The value of the tests isn't high since they don't ensure that the
state of the group is still valid (though it should be), nor do they
perform any other check. Instead, we rely on the testing framework to
spot any anomalies or errors. That's probably the best we can do at
the moment.
Unfortunately, both tests are marked as skipped becuause of the current
limitations of `raft::server_impl::abort` described above and in [4].
References:
[1] 4c8dba1
[2] See the description of `raft::state_machine` in `raft/raft.hh`.
[3] See `server_impl::applier_fiber` in `raft/server.cc`.
[4] SCYLLADB-1056
These changes are complementary to those from a recent commit where we
handled aborting ongoing operations during tablet events, such as
tablet migration. In this commit, we consider the case of shutting down
a node.
When a node is shutting down, we eventually close the connections. When
the client can no longer get a response from the server, it makes no
sense to continue with the queries. We'd like to cancel them at that
point.
We leverage the abort source passed down via `client_state` down to
the strongly consistent coordinator. This way, the transport layer can
communicate with it and signal that the queries should be canceled.
The abort source is triggered by the CQL server (cf.
`generic_server::server::{stop,shutdown}`).
---
Note that this is not an optional change. In fact, if we don't abort
those requests, we might hang for an indefinite amount of time when
executing the following code in `main.cc`:
```
// Register at_exit last, so that storage_service::drain_on_shutdown will be called first
auto do_drain = defer_verbose_shutdown("local storage", [&ss] {
ss.local().drain_on_shutdown().get();
});
```
The problem boils down to the fact that `generic_server::server::stop`
will wait for all connections to be closed, but that won't happen until
all ongoing operations (at least those to strongly consistent tables)
are finished.
It's important to highlight that even though we hang on this, the
client can no longer get any response. Thus, it's crucial that at that
point we simply abort ongoing operations to proceed with the rest of
shutdown.
---
Two tests are added to verify that the implementation is correct:
one focusing on local operations, the other -- on a forwarded write.
Difference in time spent on the whole test file
`test_strong_consistency.py` on my local machine, in dev mode:
Before:
```
real 0m31.775s
user 1m4.475s
sys 0m22.615s
```
After:
```
real 0m32.024s
user 1m10.751s
sys 0m23.871s
```
Individual runs of the added tests:
test_queries_when_shutting_down:
```
real 0m12.818s
user 0m36.726s
sys 0m4.577s
```
test_abort_forwarded_write_upon_shutdown:
```
real 0m12.930s
user 0m36.622s
sys 0m4.752s
```
We make `client_state` store a pointer to an `abort_source`. This will
be useful in the following commit that will implement aborting ongoing
requests to strongly consistent tables upon connection shutdowns.
It might also be useful in some other places in the code in the future.
We set the abort source for client states in relevant places.
When a strongly consistent Raft group is being removed, it means one of
the following cases:
(A) The node is shutting down and it's simply part of the the shutdown
procedure.
(B) The tablet is somehow leaving the replica. For example, due to:
- Tablet migration
- Tablet split/merge
- Tablet removal (e.g. because the table is dropped)
In this commit, we focus on case (A). Case (B) will be handled in the
following one.
---
The changes in the code are literally none, and there's a reason to it.
First, let's note that we've already implemented abortion of timed-out
requests. There is a limit to how long a query can run and sooner or
later it will finish, regardless of what we do.
Second, we need to ask ourselves if the cases we're considering in this
commit (i.e. case (B)) is a situation where we'd like to speed up the
process. The answer is no.
Tablet migrations are effectively internal operations that are invisible
to the users. User requests are, quite obviously, the opposite of that.
Because of that, we want to patiently wait for the queries to finish or
time out, even though it's technically possible to lead to an abort
earlier.
Lastly, the changes in the code that actually appear in this commit are
not completely irrelevant either. We consider the important case of
the `leader_info_updater` fiber and argue that it's safe to not pass
any abort source to the Raft methods used by it.
---
Unfortunately, we don't have tablet migrations implemented yet [1],
so our testing capabilities are limited. Still, we provide a new test
that corresponds to case (B) described above. We simulate a tablet
migration by dropping a table and observe how reads and writes behave
in such a situation. There's no extremely careful validation involved
there, but that's what we can have for the time being.
Difference in time spent on the whole test file
`test_strong_consistency.py` on my local machine, in dev mode:
Before:
```
real 0m30.841s
user 1m3.294s
sys 0m21.091s
```
After:
```
real 0m31.775s
user 1m4.475s
sys 0m22.615s
```
The time spent on the new test only:
```
real 0m5.264s
user 0m34.646s
sys 0m3.374s
```
References:
[1] SCYLLADB-868
If a query, either a write, or a read to a strongly consistent table,
times out, we immediately abort the operation and throw an exception.
Unfortunately, due to the inconsistency in exception types thrown
on timeout by the many methods we use in the code, it results in
pretty messy `try-catch` clauses. Perhaps there's a better alternative
to this, but it's beyond the scope of this work, so we leave it as-is.
We provide a validation test that consists of three cases corresponding
to reads, writes, and waiting for the leader. They verify that the code
works as expected in all affected places.
A comparison of time spent on the whole `test_strong_consistency.py` on
my local machine, in dev mode:
Before:
```
real 0m32.185s
user 0m55.391s
sys 0m15.745s
```
After:
```
real 0m30.841s
user 1m3.294s
sys 0m21.091s
```
The time spent on the new test only:
```
real 0m7.077s
user 0m35.359s
sys 0m3.717s
```
- Document Alternator (DynamoDB-compatible API) auditing support in
the operator-facing auditing guide (docs/operating-scylla/security/auditing.rst)
- Cover operation-to-category mapping, operation field format,
keyspace/table filtering, and audit log examples
- Document the audit_tables=alternator.<table> shorthand format
- Minor wording improvements throughout (Scylla -> ScyllaDB,
clarify default audit backend)
Closesscylladb/scylladb#29231
We remove the inconsistency between reads and writes to strongly
consistent tables. Before the commit, only reads used a timeout.
Now, writes do as well.
Although the parameter isn't used yet, that will change in the following
commit. This is a prerequisite for it.
We enclose `coordinator::{mutate,query}` with `try-catch` clauses. They
do nothing at the moment, but we'll use them later. We do this now to
avoid noise in the upcoming commits.
We'll fix the indentation in the following commit.
The loop shouldn't throw any other exception than the ones already
covered by the `catch` claues. Crash, at least when
`abort_on_internal_error` is set, if we catch any other type since
that may be a sign of a bug.
All used configs and cmdlines share the same values. Let's extract them
to avoid repeating them every time a new test is written. Those options
should be enabled for all tests in the file anyway.
Fixes#29043 with the following docs changes:
- docs/dev/system-keyspaces.md: Added a new file that documents all keyspaces created internally
Closesscylladb/scylladb#29044
Spreading db::config around and making all services depend on it is not nice. Most other service that need configuration provide their own config that's populated from db::config in main.cc/cql_test_env.cc and use it, not the global config.
This PR does the same for repair_service.
Enhancing components dependencies, not backporting
Closesscylladb/scylladb#29153
* github.com:scylladb/scylladb:
repair: Remove db/config.hh from repair/*.cc files
repair: Move repair_multishard_reader options onto repair_service::config
repair: Move critical_disk_utilization_level onto repair_service::config
repair: Move repair_partition_count_estimation_ratio onto repair_service::config
repair: Move repair_hints_batchlog_flush_cache_time_in_ms onto repair_service::config
repair: Move enable_small_table_optimization_for_rbno onto repair_service::config
repair: Introduce repair_service::config
While working on benchmarks for strong consistency we noticed that the raft logic attempted to take snapshots during the benchmark. Snapshot transfer is not implemented for strong consistency yet and the methods that take or transfer snapshots throw exceptions. This causes the raft groups to stop working completely.
While implementing snapshot transfers is out of scope, we can implement some mitigations now to stop the tests from breaking:
- The first commit adjusts the configuration options. First, it disables periodic snapshotting (i.e. creating a snapshot every X log entries). Second, it increases the memory threshold for the raft log before which a snapshot is created from 2MB to 10MB.
- The second commit relaxes the take snapshot / drop snapshot methods and makes it possible to actually use them - they are no-ops. It is still forbidden to transfer snapshots.
I am including both commits because applying only the first one didn't completely prevent the issue from occurring when testing locally.
Refs: SCYLLADB-1115
Strong consistency is experimental, no need for backport.
Closesscylladb/scylladb#29189
* github.com:scylladb/scylladb:
strong_consistency: fake taking and dropping snapshots
strong_consistency: adjust limits for snapshots
This commit removes references ScyllaDB versions ("Since x.y")
from the ScyllaDB documentation on Docker Hub, as they are
redundant and confusing (some versions are super ancient).
Fixes SCYLLADB-1212
Closesscylladb/scylladb#29204
The code responds ealry with READY message, but lack some necessary set up, namely:
* update_scheduling_group(): without it, the connection runs under the default scheduling group instead of the one mapped to the user's service level.
* on_connection_ready(): without it, the connection never releases its slot in the uninitialized-connections concurrency semaphore (acquired at connection creation), leaking one unit per cert-authenticated connection for the lifetime of the connection.
* _authenticating = false / _ready = true: without them, system.clients reports connection_stage = AUTHENTICATING forever instead of READY (not critical, but not nice either)
The PR fixes it and adds a regression test, that (for sanity) also covers AllowAll and Password authrticators
Fixes SCYLLADB-1226
Present since 2025.1, probably worth backporting
Closesscylladb/scylladb#29220
* github.com:scylladb/scylladb:
transport: fix process_startup cert-auth path missing connection-ready setup
transport: test that connection_stage is READY after auth via all process_startup paths
During incremental repair, each tablet replica holds three SSTable views:
UNREPAIRED, REPAIRING, and REPAIRED. The repair lifecycle is:
1. Replicas snapshot unrepaired SSTables and mark them REPAIRING.
2. Row-level repair streams missing rows between replicas.
3. mark_sstable_as_repaired() runs on all replicas, rewriting the
SSTables with repaired_at = sstables_repaired_at + 1 (e.g. N+1).
4. The coordinator atomically commits sstables_repaired_at=N+1 and
the end_repair stage to Raft, then broadcasts
repair_update_compaction_ctrl which calls clear_being_repaired().
The bug lives in the window between steps 3 and 4. After step 3, each
replica has on-disk SSTables with repaired_at=N+1, but sstables_repaired_at
in Raft is still N. The classifier therefore sees:
is_repaired(N, sst{repaired_at=N+1}) == false
sst->being_repaired == null (lost on restart, or not yet set)
and puts them in the UNREPAIRED view. If a new write arrives and is
flushed (repaired_at=0), STCS minor compaction can fire immediately and
merge the two SSTables. The output gets repaired_at = max(N+1, 0) = N+1
because compaction preserves the maximum repaired_at of its inputs.
Once step 4 commits sstables_repaired_at=N+1, the compacted output is
classified REPAIRED on the affected replica even though it contains data
that was never part of the repair scan. Other replicas, which did not
experience this compaction, classify the same rows as UNREPAIRED. This
divergence is never healed by future repairs because the repaired set is
considered authoritative. The result is data resurrection: deleted rows
can reappear after the next compaction that merges unrepaired data with the
wrongly-promoted repaired SSTable.
The fix has two layers:
Layer 1 (in-memory, fast path): mark_sstable_as_repaired() now also calls
mark_as_being_repaired(session) on the new SSTables it writes. This keeps
them in the REPAIRING view from the moment they are created until
repair_update_compaction_ctrl clears the flag after step 4, covering the
race window in the normal (no-restart) case.
Layer 2 (durable, restart-safe): a new is_being_repaired() helper on
tablet_storage_group_manager detects the race window even after a node
restart, when being_repaired has been lost from memory. It checks:
sst.repaired_at == sstables_repaired_at + 1
AND tablet transition kind == tablet_transition_kind::repair
Both conditions survive restarts: repaired_at is on-disk in SSTable
metadata, and the tablet transition is persisted in Raft. Once the
coordinator commits sstables_repaired_at=N+1 (step 4), is_repaired()
returns true and the SSTable naturally moves to the REPAIRED view.
The classifier in make_repair_sstable_classifier_func() is updated to call
is_being_repaired(sst, sstables_repaired_at) in place of the previous
sst->being_repaired.uuid().is_null() check.
A new test, test_incremental_repair_race_window_promotes_unrepaired_data,
reproduces the bug by:
- Running repair round 1 to establish sstables_repaired_at=1.
- Injecting delay_end_repair_update to hold the race window open.
- Running repair round 2 so all replicas complete mark_sstable_as_repaired
(repaired_at=2) but the coordinator has not yet committed step 4.
- Writing post-repair keys to all replicas and flushing servers[1] to
create an SSTable with repaired_at=0 on disk.
- Restarting servers[1] so being_repaired is lost from memory.
- Waiting for autocompaction to merge the two SSTables on servers[1].
- Asserting that the merged SSTable contains post-repair keys (the bug)
and that servers[0] and servers[2] do not see those keys as repaired.
NOTE FOR MAINTAINER: Copilot initially only implemented Layer 1 (the
in-memory being_repaired guard), missing the restart scenario entirely.
I pointed out that being_repaired is lost on restart and guided Copilot
to add the durable Layer 2 check. I also polished the implementation:
moving is_being_repaired into tablet_storage_group_manager so it can
reuse the already-held _tablet_map (avoiding an ERM lookup and try/catch),
passing sstables_repaired_at in from the classifier to avoid re-reading it,
and using compaction_group_for_sstable inside the function rather than
threading a tablet_id parameter through the classifier.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1239.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Closesscylladb/scylladb#29244
This series fixes two related inconsistencies around secondary-index
names.
1. `DESCRIBE INDEX ... WITH INTERNALS` returned the backing
materialized-view name in the `name` column instead of the logical
index name.
2. The snapshot REST API accepted backing table names for MV-backed
secondary indexes, but not the logical index names exposed to users.
The snapshot side now resolves logical secondary-index names to backing
table names where applicable, reports logical index names in snapshot
details, rejects vector index names with HTTP 400, and keeps multi-keyspace
DELETE atomic by resolving all keyspaces before deleting anything.
The tests were also extended accordingly, and the snapshot test helper
was fixed to clean up multi-table snapshots using one DELETE per table.
Fixes: SCYLLADB-1122
Minor bugfix, no need to backport.
Closesscylladb/scylladb#29083
* github.com:scylladb/scylladb:
cql3: fix DESCRIBE INDEX WITH INTERNALS name
test: add snapshot REST API tests for logical index names
test: fix snapshot cleanup helper
api: clarify snapshot REST parameter descriptions
api: surface no_such_column_family as HTTP 400
db: fix clear_snapshot() atomicity and use C++23 lambda form
db: normalize index names in get_snapshot_details()
db: add resolve_table_name() to snapshot_ctl
To create `process_staging` view building tasks, we firstly need to collect informations about them on shard0, create necessary mutations, commit them to group0 and move staging sstables objects to their original shards.
But there is a possible race after committing the group0 command and before moving the staging sstables to their shards. Between those two events, the coordinator may schedule freshly created tasks and dispatch them to the worker but the worker won't have the sstables objects because they weren't moved yet.
This patch fixes the race by holding `_staging_sstables_mutex` locks from all necessary shards when executing `create_staging_sstable_tasks()`. With this, even if the task will be scheduled and dispatched quickly, the worker will wait with executing it until the sstables objects are moved and the locks are released.
Fixes SCYLLADB-816
This PR should be backported to all versions containing view building coordinator (2025.4 and newer).
Closesscylladb/scylladb#29174
* github.com:scylladb/scylladb:
db/view/view_building_worker: fix indentation
db/view/view_building_worker: lock staging sstables mutex for necessary shards when creating tasks
Add tests that reproduce LDAP filter injection via unescaped {USER}
substitution (SCYLLADB-1309). A wildcard username ('*') matches
every group entry, and a parenthesis payload (")(uid=*") breaks the
search filter.
Extend the LDAP test fixture (ldap_server.py, slapd.conf) with
memberUid attributes and the NIS schema so the new tests can
exercise direct filter-value substitution.
DESCRIBE INDEX ... WITH INTERNALS returned the name of
the backing materialized view in the name column instead
of the logical index name.
Return the logical index name from schema::describe()
for index schemas so all callers observe the
user-facing name consistently.
Fixes: SCYLLADB-1122
Add focused REST coverage for logical secondary-index
names in snapshot creation, deletion, and details
output.
Also cover vector-index rejection and verify
multi-keyspace delete resolves all keyspaces before
deleting anything so mixed index kinds cannot cause
partial removal.
The snapshot REST helper cleaned up multi-table
snapshots with a single DELETE request that passed a
comma-separated cf filter, but the API accepts only one
table name there.
Delete each table snapshot separately so existing tests
that snapshot multiple tables use the API as
documented.
Document the current /storage_service/snapshots behavior
more accurately.
For DELETE, cf is a table filter applied independently
in each keyspace listed in kn. If cf is omitted or
empty, snapshots for all tables are eligible, and
secondary indexes can be addressed by their logical
index name.
Snapshot requests that name a non-existent table or a
non-snapshotable logical index currently surface an
internal server error.
Translate no_such_column_family into a bad request so
callers get a client-facing error that matches the
invalid input.
clear_snapshot() applies a table filter independently in
each keyspace, so logical index names must be resolved
per keyspace on the delete path as well.
Resolve all keyspaces before deleting anything so a later
failure cannot partially remove a snapshot, and use the
explicit-object-parameter coroutine lambda form for the
asynchronous implementation.
Snapshot details exposed backing secondary-index view
names instead of logical index names.
Normalize index entries in get_snapshot_details() so the
REST API reports the user-facing name, and update the
existing REST test to assert that behavior directly.
The snapshot REST API accepted backing secondary-index
table names, but not logical index names.
Introduce resolve_table_name() so snapshot creation can
translate a logical index name to the backing table when
the index is materialized as a view.
Currently we don't support 'local' consistency, which would
imply maintaining separate raft group for each dc. What we
support is actually 'global' consistency -- one raft group
per tablet replica set. We don't plan to support local
consistency for the first GA.
Closesscylladb/scylladb#29221
Hi, thanks for Scylla!
We found a small issue in tracker::set_configuration() during joint consensus and put together a fix.
When a server is demoted from voter to non-voter, set_configuration processes the current config first (can_vote=false), then the previous config. But when it finds the server already in the progress map (tracker.cc:118), it hits `continue` without updating can_vote. So the server's follower_progress::can_vote stays false even though it's still a voter in the previous config.
This causes broadcast_read_quorum (fsm.cc:1055) to skip the demoted server, reducing the pool of responders. Since committed() correctly includes the server in _previous_voters for quorum calculation, read barriers can stall if other servers are slow.
The fix is to use configuration::can_vote() in tracker::set_configuration.
We included a reproduction unit test (test_tracker_voter_demotion_joint_config) that extracts the set_configuration algorithm and demonstrates the mismatch. We weren't able to build the full Scylla test suite to add an in-tree test, so we kept it as a standalone file for reference.
No backport: the bug is non-critical and the change needs some soak time in master.
Closesscylladb/scylladb#29226
* https://github.com/scylladb/scylladb:
fix: use is_voter::yes instead of true in test assertions
test: add tracker voter demotion test to fsm_test.cc
fix: use configuration::can_vote() in tracker::set_configuration
The test_create_index_synchronous_updates test in test_secondary_index_properties.py
was intermittently failing with 'assert found_wanted_trace' because the expected
trace event 'Forcing ... view update to be synchronous' was missing from the
trace events returned by get_query_trace().
Root cause: trace events are written asynchronously to system_traces.events.
The Python driver's populate() method considers a trace complete once the
session row in system_traces.sessions has duration IS NOT NULL, then reads
events exactly once. Since the session row and event rows are written as
separate mutations with no transactional guarantee, the driver can read an
incomplete set of events.
Evidence from the failed CI run logs:
- The entire test (CREATE TABLE through DROP TABLE) completed in ~300ms
(01:38:54,859 - 01:38:55,157)
- The INSERT with tracing happened in a ~50ms window between the second
CREATE INDEX completing (01:38:55,108) and DROP TABLE starting
(01:38:55,157)
- The 'Forcing ... synchronous' trace message is generated during the
INSERT write path (db/view/view.cc:2061), so it was produced, but
not yet flushed to system_traces.events when the driver read them
- This matches the known limitation documented in test/alternator/
test_tracing.py: 'we have no way to know whether the tracing events
returned is the entire trace'
Fix: replace the single-shot trace.events read with a retry loop that
directly queries system_traces.events until the expected event appears
(with a 30s timeout). Use ConsistencyLevel.ONE since system_traces has
RF=2 and cqlpy tests run on a single-node cluster.
The same race condition pattern exists in test_mv_synchronous_updates in
test_materialized_view.py (which this test was modeled after), so the
same fix is proactively applied there as well.
Fixes SCYLLADB-1314
Closesscylladb/scylladb#29374
Previously, the timestamp decoded from a timeuuid was printed using the
local timezone via datetime.fromtimestamp(), which produces different
output depending on the machine's locale settings.
ScyllaDB logs are emitted in UTC by default. Printing the decoded date
in UTC makes it straightforward to correlate SSTable identifiers with
log entries without having to mentally convert timezones.
Also fix the embedded pytest assertion, which was accidentally correct
only on machines in UTC+8 — it now uses an explicit UTC-aware datetime.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Closesscylladb/scylladb#29253
Add explicit permissions block (contents: read, pull-requests: write,
statuses: write) matching the requirements of the called reusable
workflow which checks out code, posts PR comments, and sets commit
statuses. Fixes code scanning alert #172.
Closesscylladb/scylladb#29183
Replace strict case-sensitive '== "True"' check with strcasecmp(..., "true")
so that Python's str(True) -> "True" is properly recognized. Accepts any
case variation of "true" ("True", "TRUE", etc.), with empty string
defaulting to false.
Maintains backward compatibility with out-of-tree tests that rely on
Python's bool stringification.
The goal is to reduce the number of distinct ways API handlers use to
convert string http query parameters into bool variables. This place is the
only one that simply compares param to "True".
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Closesscylladb/scylladb#29236
Switch to branch lts_2026_01_07, which is exactly equal to
upstream now.
There were no notable changes in the release notes, but the
new versions are more friendly to newer compilers (specifically,
in include hygiene).
configure.py needs a few library updates; cmake works without
change.
scylla-gdb.py updated for new hash table layout (by Claude Opus 4.6).
* abseil d7aaad83...255c84da (1179):
> Abseil LTS branch, Jan 2026, Patch 1 (#2007)
> Cherry-picks for LTS 20260107 (#1990)
> Apply LTS transformations for 20260107 LTS branch (#1989)
> Mark legacy Mutex methods and MutexLock pointer constructors as deprecated
> `cleanup`: specify that it's safe to use the class in a signal handler.
> Suppress bugprone-use-after-move in benign cases
> StrFormat: format scientific notation without heap allocation
> Introduce a legacy copy of GetDebugStackTraceHook API.
> Report 1ns instead of 0ns for probe_benchmarks. Some tools incorrectly assume that benchmark was not run if 0ns reported.
> Add absl::chunked_queue
> `CRC32` version of `CombineContiguous` for length <= 32.
> Add `absl::down_cast`
> Fix FixedArray iterator constructor, which should require input_iterator, not forward_iterator
> Add a latency benchmark for hashing a pair of integers.
> Delete absl::strings_internal::STLStringReserveAmortized()
> As IsAtLeastInputIterator helper
> Use StringAppendAndOverwrite() in CEscapeAndAppendInternal()
> Add support for absl::(u)int128 in FastIntToBuffer()
> absl/strings: Prepare helper for printing objects to string representations.
> Use SimpleAtob() for parsing bool flags
> No-op changes to relative timeout support code.
> Adjust visibility of heterogeneous_lookup_testing.h
> Remove -DUNORDERED_SET_CXX17 since the macro no longer exists
> [log] Prepare helper for streaming container contents to strings.
> Restrict the visibility of some internal testing utilities
> Add absl::linked_hash_set and absl::linked_hash_map
> [meta] Add constexpr testing helper.
> BUILD file reformatting.
> `absl/meta`: Add C++17 port of C++20 `requires` expression for internal use
> Remove the implementation of `absl::string_view`, which was only needed prior to C++17. `absl::string_view` is now an alias for `std::string_view`. It is recommended that clients simply use `std::string_view`.
> No public description
> absl:🎏 Stop echoing file content in flagfile parsing errors Modified ArgsList::ReadFromFlagfile to redact the content of unexpected lines from error messages. \
> Refactor the declaration of `raw_hash_set`/`btree` to omit default template parameters from the subclasses.
> Import of CCTZ from GitHub.
> Add ABSL_ATTRIBUTE_LIFETIME_BOUND to Flag help generator
> Correct `Mix4x16Vectors` comment.
> Special implementation for string hash with sizes greater than 64.
> Reorder function parameters so that hash state is the first argument.
> Search more aggressively for open slots in absl::internal_stacktrace::BorrowedFixupBuffer
> Implement SpinLockHolder in terms of std::lock_guard.
> No public description
> Avoid discarding test matchers.
> Import of CCTZ from GitHub.
> Automated rollback of commit 9f40d6d6f3cfc1fb0325dd8637eb65f8299a4b00.
> Enable clang-specific warnings on the clang-cl build instead of just trying to be MSVC
> Enable clang-specific warnings on the clang-cl build instead of just trying to be MSVC
> Make AnyInvocable remember more information
> Add further diagnostics under clang for string_view(nullptr)
> Import of CCTZ from GitHub.
> Document the differing trimming behavior of absl::Span::subspan() and std::span::subspan()
> Special implementation for string hash with sizes in range [33, 64].
> Add the deleted string_view(std::nullptr_t) constructor from C++23
> CI: Use a cached copy of GoogleTest in CMake builds if possible to minimize the possibility of errors downloading from GitHub
> CI: Enable libc++ hardening in the ASAN build for even more checks https://libcxx.llvm.org/Hardening.html
> Call the common case of AllocateBackingArray directly instead of through the function pointer.
> Change AlignedType to have a void* array member so that swisstable backing arrays end up in the pointer-containing partition for heap partitioning.
> base: Discourage use of ABSL_ATTRIBUTE_PACKED
> Revert: Add an attribute to HashtablezInfo which performs a bitwise XOR on all hashes. The purposes of this attribute is to identify if identical hash tables are being created. If we see a large number of identical tables, it's likely the code can be improved by using a common table as opposed to keep rebuilding the same one.
> Import of CCTZ from GitHub.
> Record insert misses in hashtable profiling.
> Add absl::StatusCodeToStringView.
> Add a missing dependency on str_format that was being pulled in transitively
> Pico-optimize `SkipWhitespace` to use `StripLeadingAsciiWhitespace`.
> absl::string_view: Upgrade the debug assert on the single argument char* constructor to ABSL_HARDENING_ASSERT
> Use non-stack storage for stack trace buffers
> Fixed incorrect include for ABSL_NAMESPACE_BEGIN
> Add ABSL_REFACTOR_INLINE to separate the inliner directive from the deprecated directive so that we can give users a custom deprecation message.
> Reduce stack usage when unwinding without fixups
> Reduce stack usage when unwinding from 170 to 128 on x64
> Rename RecordInsert -> RecordInsertMiss.
> PR #1968: Use std::move_backward within InlinedVector's Storage::Insert
> Use the new absl::StringResizeAndOverwrite() in CUnescape()
> Explicitly instantiate common `raw_hash_set` backing array functions.
> Rollback reduction of maximum load factor. Now it is back to 28/32.
> Export Mutex::Dtor from shared libraries in NDEBUG mode
> Allow `IsOkAndHolds` to rely on duck typing for matching `StatusOr` like types instead of uniquely `absl::StatusOr`, e.g. `google::cloud::StatusOr`.
> Fix typo in macro and add missing static_cast for WASM builds.
> windows(cmake): add abseil_test_dll to target link libraries when required
> Handle empty strings in `SimpleAtof` after stripping whitespace
> Avoid using a thread_local in an inline function since this causes issues on some platforms.
> (Roll forward) Change Abseil's SpinLock adaptive_spin_count to a class static variable that can be set by tcmalloc friend classes.
> Change Abseil's SpinLock adaptive_spin_count to a class static variable that can be set by tcmalloc friend classes.
> Change Abseil's SpinLock adaptive_spin_count to a class static variable that can be set by tcmalloc friend classes.
> Fixes for String{Resize|Append}AndOverwrite - StringAppendAndOverwrite() should always call StringResizeAndOverwrite() with at least capacity() in case the standard library decides to shrink the buffer (Fixes#1965) - Small refactor to make the minimum growth an addition for clarity and to make it easier to test 1.5x growth in the future - Turn an ABSL_HARDENING_ASSERT into a ThrowStdLengthError - Add a missing std::move
> Correct the supported features of Status Matchers
> absl/time: Use "memory order acquire" for loads, which would allow for the safe removal of the data memory barrier.
> Use the new absl::StringResizeAndOverwrite() in string escaping utilities
> Add an internal-only helper StringAppendAndOverwrite() similar to StringResizeAndOverwrite() but optimized for repeated appends, using exponential growth to ensure amortized complexity of increasing a string size by a small amount is O(1).
> Release `ABSL_EXPECT_OK` and `ABSL_ASSERT_OK`.
> Fix the CHECK_XX family of macros to not print `char*` arguments as C-strings if the comparison happened as pointers. Printing as pointers is more relevant to the result of the comparison.
> Rollback StringAppendAndOverwrite() - the problem is that StringResizeAndOverwrite has MSAN testing of the entire string. This causes quadratic MSAN verification on small appends.
> Add an internal-only helper StringAppendAndOverwrite() similar to StringResizeAndOverwrite() but optimized for repeated appends, using exponential growth to ensure amortized complexity of increasing a string size by a small amount is O(1).
> PR #1961: Fix Clang warnings on powerpc
> Use the new absl::StringResizeAndOverwrite() in string escaping utilities
> Use the new absl::StringResizeAndOverwrite() in string escaping utilities
> macOS CI: Move the Bazel vendor_dir to ${HOME} to workaround a Bazel issue where it does not work when it is in ${TMP} and also fix the quoting which was causing it to incorrectly receive the argument
> Use __msan_check_mem_is_initialized for detailed MSan report
> Optimize stack unwinding by reducing `AddressIsReadable` calls.
> Add internal API to allow bypassing stack trace fixups when needed
> absl::StrFormat: improve test coverage with scientific exponent test cases
> Add throughput and latency benchmarks for `absl::ToDoubleXYZ` functions.
> CordzInfo: Use absl::NoDestructor to remove a global destructor. Chromium requires no global destructors.
> string_view: Enable std::view and std::borrowed_range
> cleanup: s/logging_internal/log_internal/ig for consistency
> Use the new absl::StringResizeAndOverwrite() in string escaping utilities
> Use the new absl::StringResizeAndOverwrite() in string escaping utilities
> Use the new absl::StringResizeAndOverwrite() in absl::AsciiStrTo{Lower|Upper}
> Use the new absl::StringResizeAndOverwrite() in absl::StrJoin()
> Use the new absl::StringResizeAndOverwrite() in absl::StrCat()
> string_view: Fix include order
> Don't pass nullptr as the 1st arg of `from_chars`
> absl/types: format code with clang-format.
> Validate absl::StringResizeAndOverwrite op has written bytes as expected.
> Skip the ShortStringCollision test on WASM.
> Rollback `absl/types`: format code with clang-format.
> Remove usage of the WasmOffsetConverter for Wasm / Emscripten stack-traces.
> Use the new absl::StringResizeAndOverwrite() in absl::CordCopyToString()
> Remove an undocumented behavior of --vmodule and absl::SetVLogLevel that could set a module_pattern to defer to the global vlog threshold.
> Update to rules_cc 0.2.9
> Avoid redefine warnings with ntstatus constants
> PR #1944: Use same element-width for non-temporal loads and stores on Arm
> absl::StringResizeAndOverwrite(): Add the requirement that the only value that can be written to buf[size] is the terminator character.
> absl/types: format code with clang-format.
> Minor formatting changes.
> Remove `IntIdentity` and `PtrIdentity` from `raw_hash_set_probe_benchmark`.
> Automated rollback of commit cad60580dba861d36ed813564026d9774d9e4e2b.
> FlagStateInterface implementors need only support being restored once.
> Clarify the post-condition of `reserve()` in Abseil hash containers.
> Clarify the post-condition of `reserve()` in Abseil hash containers.
> Represent dropped samples in hashtable profile.
> Add lifetimebound to absl::implicit_cast and make it work for rvalue references as it already does with lvalue references
> Clean up a doc example where we had `absl_nonnull` and `= nullptr;`
> Change Cordz to synchronize tracked cords with Snapshots / DeleteQueue
> Minor refactor to `num_threads` in deadlock test
> Rename VLOG macro parameter to match other uses of this pseudo type.
> `time`: Fix indentation
> Automated Code Change
> Adds `absl::StringResizeAndOverwrite` as a polyfill for C++23's `std::basic_string<CharT,Traits,Allocator>::resize_and_overwrite`
> Internal-only change
> absl/time: format code with clang-format.
> No public description
> Expose typed releasers of externally appended memory.
> Fix __declspec support for ABSL_DECLARE_FLAG()
> Annotate absl::AnyInvocable as an owner type via [[gsl::Owner]] and absl_internal_is_view = std::false_type
> Annotate absl::FunctionRef as a view type via [[gsl::Pointer]] and absl_internal_is_view
> Remove unnecessary dep on `core_headers` from the `nullability` cc_library
> type_traits: Add type_identity and type_traits_t backfills
> Refactor raw_hash_set range insertion to call private insert_range function.
> Fix bug in absl::FunctionRef conversions from non-const to const
> PR #1937: Simplify ConvertSpecialToEmptyAndFullToDeleted
> Improve absl::FunctionRef compatibility with C++26
> Add a workaround for unused variable warnings inside of not-taken if-constexpr codepaths in older versions of GCC
> Annotate ABSL_DIE_IF_NULL's return type with `absl_nonnull`
> Move insert index computation into `PrepareInsertLarge` in order to reduce inlined part of insert/emplace operations.
> Automated Code Change
> PR #1939: Add missing rules_cc loads
> Expose (internally) a LogMessage constructor taking file as a string_view for (internal, upcoming) FFI integration.
> Fixed up some #includes in mutex.h
> Make absl::FunctionRef support non-const callables, aligning it with std::function_ref from C++26
> Move capacity update in `Grow1To3AndPrepareInsert` after accessing `common.infoz()` to prevent assertion failure in `control()`.
> Fix check_op(s) compilation failures on gcc 8 which eagerly tries to instantiate std::underlying_type for non-num types.
> Use `ABSL_ATTRIBUTE_ALWAYS_INLINE`for lambda in `find_or_prepare_insert_large`.
> Mark the implicit floating operators as constexpr for `absl::int128` and `absl::uint128`
> PR #1931: raw_hash_set: fix instantiation for recursive types on MSVC with /Zc:__cplusplus
> Add std::pair specializations for IsOwner and IsView
> Cast ABSL_MIN_LOG_LEVEL to absl::LogSeverityAtLeast instead of absl::LogSeverity.
> Fix a corner case in the aarch64 unwinder
> Fix inconsistent nullability annotation in ReleasableMutexLock
> Remove support for Native Client
> Rollback f040e96b93dba46e8ed3ca59c0444cbd6c0a0955
> When printing CHECK_XX failures and both types are unprintable, don't bother printing " (UNPRINTABLE vs. UNPRINTABLE)".
> PR #1929: Fix shorten-64-to-32 warning in stacktrace_riscv-inl.inc
> Refactor `find_or_prepare_insert_large` to use a single return statement using a lambda.
> Use possible CPUs to identify NumCPUs() on Linux.
> Fix incorrect nullability annotation of `absl::Cord::InlineRep::set_data()`.
> Move SetCtrl* family of functions to cc file.
> Change absl::InlinedVector::clear() so that it does not deallocate any allocated space. This allows allocations to be reused and matches the behavior specification of std::vector::clear().
> Mark Abseil container algorithms as `constexpr` for C++20.
> Fix `CHECK_<OP>` ambiguous overload for `operator<<` in older versions of GCC when C-style strings are compared
> stacktrace_test: avoid spoiling errno in the test signal handler.
> Optimize `CRC32AcceleratedX86ARMCombinedMultipleStreams::Extend` by interleaving the `CRC32_u64` calls at a lower level.
> stacktrace_test: avoid spoiling errno in the test signal handler.
> stacktrace_test: avoid spoiling errno in the test signal handler.
> std::multimap::find() is not guaranteed to return the first entry with the requested key. Any may be returned if many exist.
> Mark `/`, `%`, and `*` operators as constexpr when intrinsics are available.
> Add the C++20 string_view contructor that uses iterators
> Implement absl::erase_if for absl::InlinedVector
> Adjust software prefetch to fetch 5 cachelines ahead, as benchmarking suggests this should perform better.
> Reduce maximum load factor to 27/32 (from 28/32).
> Remove unused include
> Remove unused include statement
> PR #1921: Fix ABSL_BUILD_DLL mode (absl_make_dll) with mingw
> PR #1922: Enable mmap for WASI if it supports the mman header
> Rollback C++20 string_view constructor that uses iterators due to broken builds
> Add the C++20 string_view contructor that uses iterators
> Bump versions of dependencies in MODULE.bazel
> Automated Code Change
> PR #1918: base: add musl + ppc64le fallback for UnscaledCycleClock::Frequency
> Optimize crc32 Extend by removing obsolete length alignment.
> Fix typo in comment of `ABSL_ATTRIBUTE_UNUSED`.
> Mark AnyInvocable as being nullability compatible.
> Ensure stack usage remains low when unwinding the stack, to prevent stack overflows
> Shrink #if ABSL_HAVE_ATTRIBUTE_WEAK region sizes in stacktrace_test.cc
> <filesystem> is not supported for XTENSA. Disable it in //absl/hash/internal/hash.h.
> Use signal-safe dynamic memory allocation for stack traces when necessary
> PR #1915: Fix SYCL Build Compatibility with Intel LLVM Compiler on Windows for abseil
> Import of CCTZ from GitHub.
> Tag tests that currently fail on ios_sim_arm64 with "no_test_ios_sim_arm64"
> Automated Code Change
> Automated Code Change
> Import of CCTZ from GitHub.
> Move comment specific to pointer-taking MutexLock variant to its definition.
> Add lifetime annotations to MutexLock, SpinLockHolder, etc.
> Add lifetimebound annotations to absl::MakeSpan and absl::MakeConstSpan to detect dangling references
> Remove comment mentioning deferenceability.
> Add referenceful MutexLock with Condition overload.
> Mark SpinLock camel-cased methods as ready for inlining.
> Whitespace change
> In logging tests that write expectations against `ScopedMockLog::Send`, suppress the default behavior that forwards to `ScopedMockLog::Log` so that unexpected logs are printed with full metadata. Many of these tests are poking at those metadata, and a failure message that doesn't include them is unhelpful.
> Add ABSL_ATTRIBUTE_LIFETIME_BOUND to absl::ClippedSubstr
> Inline internal usages of Mutex::Lock, etc. in favor of lock.
> Inline internal usages of pointerful SpinLockHolder/MutexLock.
> Remove wrong comment in Cord::Unref
> Update the crc32 dynamic dispatch table with newer platforms.
> PR #1914: absl/base/internal/poison.cc: Minor build fix
> Accept references on SpinLockHolder/MutexLock
> Import of CCTZ from GitHub.
> Fix typos in comments.
> Inline SpinLock Lock->lock, Unlock->unlock internal to Abseil.
> Rename Mutex methods to use the typical C++ lower case names.
> Rename SpinLock methods to use the typical C++ lower case names.
> Add an assert that absl::StrSplit is not called with a null char* argument.
> Fix sign conversion warning
> PR #1911: Fix absl_demangle_test on ppc64
> Disallow using a hash function whose return type is smaller than size_t.
> Optimize CRC-32C extension by zeroes
> Deduplicate stack trace implementations in stacktrace.cc
> Align types of location_table_ and mapping_table_ keys (-Wshorten-64-to-32).
> Move SigSafeArena() out to absl/base/internal/low_level_alloc.h
> Allow CHECK_<OP> variants to be used with unprintable types.
> Import of CCTZ from GitHub.
> Adds required load statements for C++ rules to BUILD and bzl files.
> Disable sanitizer bounds checking in ComputeZeroConstant.
> Roll back NDK weak symbol mode for backtrace() due to internal test breakage
> Add converter for extracting SwissMap profile information into a https://github.com/google/pprof suitable format for inspection.
> Allocate memory for frames and sizes during stack trace fix-up when no memory is provided
> Support NDK weak symbol mode for backtrace() on Android.
> Change skip_empty_or_deleted to not use groups.
> Fix bug of dereferencing invalidated iterator in test case.
> Refactor: split erase_meta_only into large and small versions.
> Fix a TODO to use std::is_nothrow_swappable when it became available.
> Clean up the testing of alternate options that were removed in previous changes
> Only use generic stacktrace when ABSL_HAVE_THREAD_LOCAL.
> Automated Code Change
> Add triviality tests for absl::Span
> Loosen the PointerAlignment test to allow up to 5 stuck bits to avoid flakiness.
> Prevent conversion constructions from absl::Span to itself
> Skip flaky expectations in waiter_test for MSVC.
> Refactor: call AssertIsFull from iterator::assert_is_full to avoid passing the same arguments repeatedly.
> In AssertSameContainer, remove the logic checking for whether the iterators are from SOO tables or not since we don't use it to generate a more informative debug message.
> Remove unused NonIterableBitMask::HighestBitSet function.
> Refactor: move iterator unchecked_* members before data members to comply with Google C++ style guide.
> Mix pointers once instead of twice now that we've improved mixing on 32-bit platforms and improved the kMul constant.
> Remove unused utility functions/constants.
> Revert a change for breaking downstream third party libs
> Remove unneeded include from cord_rep_btree_navigator.h
> Refactor: move find_first_non_full into raw_hash_set.cc.
> Perform stronger mixing on 32-bit platforms and enable the LowEntropyStrings test.
> Include deallocated caller-provided size in delete hooks.
> Roll back one more time: In debug mode, assert that the probe sequence isn't excessively long.
> Allow a `std::move` of `delimiter_` to happen in `ByString::ByString(ByString&&)`. Right now the move ctor is making a copy because the source object is `const`.
> Assume that control bytes don't alias CommonFields.
> Consistently use [[maybe_unused]] in raw_hash_set.h for better compiler warning compatibility.
> Roll forward: In debug mode, assert that the probe sequence isn't excessively long.
> Add a new test for hash collisions for short strings when PrecombineLengthMix has low quality.
> Refactor: define CombineRawImpl for repeated `Mix(state ^ value, kMul)` operations.
> Automated Code Change
> Mark hash_test as large so that the timeout is increased.
> Change the value of kMul to have higher entropy and prevent collisions when keys are aligned integers or pointers.
> Fix LIFETIME annotations for op*/op->/value operators for reference types.
> Update StatusOr to support lvalue reference value types.
> Rollback debug assertion that the probe sequence isn't excessively long.
> AnyInvocable: Fix operator==/!= comments
> In debug mode, assert that the probe sequence isn't excessively long.
> Improve NaN handling in absl::Duration arithmetic.
> Change PrecombineLengthMix to sample data from kStaticRandomData.
> Fix includes and fuse constructors of SpinLock.
> Enable `operator==` for `StatusOr` only if the contained type is equality-comparable
> Enable SIMD memcpy-crc on ARM cores.
> Improve mixing on 32-bit platforms.
> Change DurationFromDouble to return -InfiniteDuration() for all NaNs.
> Change return type of hash internal `Seed` to `size_t` from `uint64_t`
> CMake: Add a fatal error when the compiler defaults to or is set to a C++ language standard prior to C++17.
> Make bool true hash be ~size_t{} instead of 1 so that all bits are different between true/false instead of only one.
> Automated Code Change
> Pass swisstable seed as seed to absl::Hash so we can save an XOR in H1.
> Add support for scoped enumerations in CHECK_XX().
> Revert no-inline on Voidify::operator&&() -- caused unexpected binary size growth
> Mark Voidify::operator&&() as no-inline. This improves stack trace for `LOG(FATAL)` with optimization on.
> Refactor long strings hash computations and move `len <= PiecewiseChunkSize()` out of the line to keep only one function call in the inlined hash code.
> rotr/rotl: Fix undefined behavior when passing INT_MIN as the number of positions to rotate by
> Reorder members of MixingHashState to comply with Google C++ style guide ordering of type declarations, static constants, ctors, non-ctor functions.
> Delete unused function ShouldSampleHashtablezInfoOnResize.
> Remove redundant comments that just name the following symbol without providing additional information.
> Remove unnecessary modification of growth info in small table case.
> Suppress CFI violation on VDSO call.
> Replace WeakMix usage with Mix and change H2 to use the most significant 7 bits - saving 1 cycle in H1.
> Fix -Wundef warning
> Fix conditional constexpr in ToInt64{Nano|Micro|Milli}seconds under GCC7 and GCC8 using an else clause as a workaround
> Enable CompressedTupleTest.NestedEbo test case.
> Lift restriction on using EBCO[1] for nested CompressedTuples. The current implementation of CompressedTuple explicitly disallows EBCO for cases where CompressedTuples are nested. This is because the implentation for a tuple with EBCO-compatible element T inherits from Storage<T, I>, where I is the index of T in the tuple, and
> absl::string_view: assert against (data() == nullptr && size() != 0)
> Fix a false nullability warning in [Q]CHECK_OK by replacing nullptr with an empty char*
> Make `combine_contiguous` to mix length in a weak way by adding `size << 24`, so that we can avoid a separate mixing of size later. The empty range is mixing 0x57 byte.
> Add a test case that -1.0 and 1.0 have different hashes.
> Update CI to a more recent Clang on Linux x86-64
> `absl::string_view`: Add a debug assert to the single-argument constructor that the argument is not `nullptr`.
> Fix CI on macOS Sequoia
> Use Xcode 16.3 for testing
> Use a proper fix instead of a workaround for a parameter annotated absl_nonnull since the latest Clang can see through the workaround
> Assert that SetCtrl isn't called on small tables - there are no control bytes in such cases.
> Use `MaskFullOrSentinel` in `skip_empty_or_deleted`.
> Reduce flakiness in MockDistributions.Examples test case.
> Rename PrepareInsertNonSoo to PrepareInsertLarge now that it's no longer used in all non-SOO cases.
> PR #1895: use c++17 in podspec
> Avoid hashing the key in prefetch() for small tables.
> Remove template alias nullability annotations.
> Add `Group::MaskFullOrSentinel` implementation without usage.
> Move `hashtable_control_bytes` tests into their own file.
> Simplify calls to `EqualElement` by introducing `equal_to` helper function.
> Do `common.increment_size()` directly in SmallNonSooPrepareInsert if inserting to reserved 1 element table.
> Import of CCTZ from GitHub.
> Small cleanup of `infoz` processing to get the logic out of the line or removed.
> Extract the entire PrepareInsert to Small non SOO table out of the line.
> Take `get_hash` implementation out of the SwissTable class to minimize number of instantiations.
> Change kEmptyGroup to kDefaultIterControl now that it's only used for default-constructed iterators.
> [bits] Add tests for return types
> Avoid allocating control bytes in capacity==1 swisstables.
> PR #1888: Adjust Table.GrowExtremelyLargeTable to avoid OOM on i386
> Avoid mixing after `Hash64` calls for long strings by passing `state` instead of `Seed` to low level hash.
> Indent absl container examples consistently
> Revert- Doesn't actually work because SWIG doesn't use the full preprocessor
> Add tags to skip some tests under UBSAN.
> Avoid subtracting `it.control()` and `table.control()` in single element table during erase.
> Remove the `salt` parameter from low level hash and use a global constant. That may potentially remove some loads.
> In SwissTable, don't hash the key when capacity<=1 on insertions.
> Remove the "small" size designation for thread_identity_test, which causes the test to timeout after 60s.
> Add comment explaining math behind expressions.
> Exclude SWIG from ABSL_DEPRECATED and ABSL_DEPRECATE_AND_INLINE
> stacktrace_x86: Handle nested signals on altstack
> Import of CCTZ from GitHub.
> Simplify MixingHashState::Read9To16 to not depend on endianness.
> Delete deprecated `absl::Cord::Get` and its remaining call sites.
> PR #1884: Remove duplicate dependency
> Remove relocatability test that is no longer useful
> Import of CCTZ from GitHub.
> Fix a bug of casting sizeof(slot_type) to uint16_t instead of uint32_t.
> Rewrite `WideToUtf8` for improved readability.
> Avoid requiring default-constructability of iterator type in algorithms that use ContainerIterPairType
> Added test cases for invalid surrogates sequences.
> Use __builtin_is_cpp_trivially_relocatable to implement absl::is_trivially_relocatable in a way that is compatible with PR2786 in the upcoming C++26.
> Remove dependency on `wcsnlen` for string length calculation.
> Stop being strict about validating the "clone" part of mangled names
> Add support for logging wide strings in `absl::log`.
> Deprecate `ABSL_HAVE_STD_STRING_VIEW`.
> Change some nullability annotations in absl::Span to absl_nullability_unknown to workaround a bug that makes nullability checks trigger in foreach loops, while still fixing the -Wnullability-completeness warnings.
> Linux CI update
> Fix new -Wnullability-completeness warnings found after upgrading the Clang version used in the Linux ARM CI to Clang 19.
> Add __restrict for uses of PolicyFunctions.
> Use Bazel vendor mode to cache external dependencies on Windows and macOS
> Move PrepareInsertCommon from header file to cc file.
> Remove the explicit from the constructor to a test allocator in hash_policy_testing.h. This is rejected by Clang when using the libstdc++ that ships with GCC15
> Extract `WideToUtf8` helper to `utf8.h`.
> Updates the documentation for `CHECK` to make it more explicit that it is used to require that a condition is true.
> Add PolicyFunctions::soo_capacity() so that the compiler knows that soo_capacity() is always 0 or 1.
> Expect different representations of pointers from the Windows toolchain.
> Add set_no_seed_for_testing for use in GrowExtremelyLargeTable test.
> Update GoogleTest dependency to 1.17.0 to support GCC15
> Assume that frame pointers inside known stack bounds are readable.
> Remove fallback code in absl/algorithm/container.h
> Fix GCC15 warning that <ciso646> is deprecated in C++17
> Fix misplaced closing brace
> Remove unused include.
> Automated Code Change
> Type erase copy constructor.
> Refactor to use hash_of(key) instead of hash_ref()(key).
> Create Table.Prefetch test to make sure that it works.
> Remove NOINLINE on the constructor with buckets.
> In SwissTable, don't hash the key in find when capacity<=1.
> Use 0x57 instead of Seed() for weakly mixing of size.
> Use absl::InsecureBitGen in place of std::random_device in Abseil tests.
> Remove unused include.
> Use large 64 bits kMul for 32 bits platforms as well.
> Import of CCTZ from GitHub.
> Define `combine_weakly_mixed_integer` in HashSelect::State in order to allow `friend auto AbslHashValue` instead of `friend H AbslHashValue`.
> PR #1878: Fix typos in comments
> Update Abseil dependencies in preparation for release
> Use weaker mixing for absl::Hash for types that mix their sizes.
> Update comments on UnscaledCycleClock::Now.
> Use alignas instead of the manual alignment for the Randen entropy pool.
> Document nullability annotation syntax for array declarations (not many people may know the syntax).
> Import of CCTZ from GitHub.
> Release tests for ABSL_RAW_DCHECK and ABSL_RAW_DLOG.
> Adjust threshold for stuck bits to avoid flaky failures.
> Deprecate template type alias nullability annotations.
> Add more probe benchmarks
> PR #1874: Simplify detection of the powerpc64 ELFv1 ABI
> Make `absl::FunctionRef` copy-assignable. This brings it more in line with `std::function_ref`.
> Remove unused #includes from absl/base/internal/nullability_impl.h
> PR #1870: Retry SymInitialize on STATUS_INFO_LENGTH_MISMATCH
> Prefetch from slots in parallel with reading from control.
> Migrate template alias nullability annotations to macros.
> Improve dependency graph in `TryFindNewIndexWithoutProbing` hot path evaluation.
> Add latency benchmarks for Hash for strings with size 3, 5 and 17.
> Exclude UnwindImpl etc. from thread sanitizer due to false positives.
> Use `GroupFullEmptyOrDeleted` inside of `transfer_unprobed_elements_to_next_capacity_fn`.
> PR #1863: [minor] Avoid variable shadowing for absl btree
> Extend stack-frame walking functionality to allow dynamic fixup
> Fix "unsafe narrowing" in absl for Emscripten
> Roll back change to address breakage
> Extend stack-frame walking functionality to allow dynamic fixup
> Introduce `absl::Cord::Distance()`
> Avoid aliasing issues in growth information initialization.
> Make `GrowSooTableToNextCapacityAndPrepareInsert` in order to initialize control bytes all at once and avoid two function calls on growth right after SOO.
> Simplify `SingleGroupTableH1` since we do not need to mix all bits anymore. Per table seed has a good last bit distribution.
> Use `NextSeed` instead of `NextSeedBaseNumber` and make the result type to be `uint16_t`. That avoids unnecessary bit twiddling and simplify the code.
> Optimize `GrowthToLowerBoundCapacity` in order to avoid division.
> [base] Make :endian internal to absl
> Fully qualify absl names in check macros to avoid invalid name resolution when the user scope has those names defined.
> Fix memory sanitization in `GrowToNextCapacityAndPrepareInsert`.
> Define and use `ABSL_SWISSTABLE_ASSERT` in cc file since a lot of logic moved there.
> Remove `ShouldInsertBackwards` functionality. It was used for additional order randomness in debug mode. It is not necessary anymore with introduction of separate per table `seed`.
> Fast growing to the next capacity based on carbon hash table ideas.
> Automated Code Change
> Refactor CombinePiecewiseBuffer test case to (a) call PiecewiseChunkSize() to get the chunk size and (b) use ASSERT for expectation in a loop.
> PR #1867: Remove global static in stacktrace_win32-inl.inc
> Mark Abseil hardening assert in AssertIsValidForComparison as slow.
> Roll back a problematic change.
> Add absl::FastTypeId<T>()
> Automated Code Change
> Update TestIntrinsicInt128 test to print the indices with the conflicting hashes.
> Code simplification: we don't need XOR and kMul when mixing large string hashes into hash state.
> Refactor absl::CUnescape() to use direct string output instead of pointer/size.
> Rename `policy.transfer` to `policy.transfer_n`.
> Optimize `ResetCtrl` for small tables with `capacity < Group::KWidth * 2` (<32 if SSE enabled and <16 if not).
> Use 16 bits of per-table-seed so that we can save an `and` instruction in H1.
> Fully annotate nullability in headers where it is partially annotated.
> Add note about sparse containers to (flat|node)_hash_(set|map).
> Make low_level_alloc compatible with -Wthread-safety-pointer
> Add missing direct includes to enable the removal of unused includes from absl/base/internal/nullability_impl.h.
> Add tests for macro nullability annotations analogous to existing tests for type alias annotations.
> Adds functionality to return stack frame pointers during stack walking, in addition to code addresses
> Use even faster reduction algorithm in FinalizePclmulStream()
> Add nullability annotations to some very-commonly-used APIs.
> PR #1860: Add `unsigned` to character buffers to ensure they can provide storage (https://eel.is/c++draft/intro.object#3)
> Release benchmarks for absl::Status and absl::StatusOr
> Use more efficient reduction algorithm in FinalizePclmulStream()
> Add a test case to make it clear that `--vmodule=foo/*=1` does match any children and grandchildren and so on under `foo/`.
> Gate use of clang nullability qualifiers through absl nullability macros on `nullability_on_classes`.
> Mark `absl::StatusOr::status()` as ABSL_MUST_USE_RESULT
> Cleanups related to benchmarks * Fix many benchmarks to be cc_binary instead of cc_test * Add a few benchmarks for StrFormat * Add benchmarks for Substitute * Add benchmarks for Damerau-Levenshtein distance used in flags
> Add a log severity alias `DO_NOT_$UBMIT` intended for logging during development
> Avoid relying on true and false tokens in the preprocessor macros used in any_invocable.h
> Avoid relying on true and false tokens in the preprocessor macros used in absl/container
> Refactor to make it clear that H2 computation is not repeated in each iteration of the probe loop.
> Turn on C++23 testing for GCC and Clang on Linux
> Fix overflow of kSeedMask on 32 bits platform in `generate_new_seed`.
> Add a workaround for std::pair not being trivially copyable in C++23 in some standard library versions
> Refactor WeakMix to include the XOR of the state with the input value.
> Migrate ClearPacBits() to a more generic implementation and location
> Annotate more Abseil container methods with [[clang::lifetime_capture_by(...)]] and make them all forward to the non-captured overload
> Make PolicyFunctions always be the second argument (after CommonFields) for type-erased functions.
> Move GrowFullSooTableToNextCapacity implementation with some dependencies to cc file.
> Optimize btree_iterator increment/decrement to avoid aliasing issues by using local variables instead of repeatedly writing to `this`.
> Add constexpr conversions from absl::Duration to int64_t
> PR #1853: Add support for QCC compiler
> Fix documentation for key requirements of flat_hash_set
> Use `extern template` for `GrowFullSooTableToNextCapacity` since we know the most common set of paramenters.
> C++23: Fix log_format_test to match the stream format for volatile pointers
> C++23: Fix compressed_tuple_test.
> Implement `btree::iterator::+=` and `-=`.
> Stop calling `ABSL_ANNOTATE_MEMORY_IS_INITIALIZED` for threadlocal counter.
> Automated Code Change
> Introduce seed stored in the hash table inside of the size.
> Replace ABSL_ATTRIBUTE_UNUSED with [[maybe_unused]]
> Minor consistency cleanups to absl::BitGen mocking.
> Restore the empty CMake targets for bad_any_cast, bad_optional_access, and bad_variant_access to allow clients to migrate.
> bits.h: Add absl::endian and absl::byteswap polyfills
> Use absl::NoDestructor an absl::Mutex instance in the flags library to prevent some exit-time destructor warnings
> Add thread GetEntropyFromRandenPool test
> Update nullability annotation documentation to focus on macro annotations.
> Simplify some random/internal types; expose one function to acquire entropy.
> Remove pre-C++17 workarounds for lack of std::launder
> UBSAN: Use -fno-sanitize-recover
> int128_test: Avoid testing signed integer overflow
> Remove leading commas in `Describe*` methods of `StatusIs` matcher.
> absl::StrFormat: Avoid passing null to memcpy
> str_cat_test: Avoid using invalid enum values
> hash_generator_testing: Avoid using invalid enum values
> absl::Cord: Avoid passing null to memcpy and memset
> graphcycles_test: Avoid applying a non-zero offset to a null pointer
> Make warning about wrapping empty std::function in AnyInvocable stronger.
> absl/random: Convert absl::BitGen / absl::InsecureBitGen to classes from aliases.
> Fix buffer overflow the internal demangling function
> Avoid calling `ShouldRehashForBugDetection` on the first two inserts to the table.
> Remove the polyfill implementations for many type traits and alias them to their std equivalents. It is recommended that clients now simple use the std equivalents.
> ROLLBACK: Limit slot_size to 2^16-1 and maximum table size to 2^43-1.
> Limit `slot_size` to `2^16-1` and maximum table size to `2^43-1`.
> Use C++17 [[nodiscard]] instead of the deprecated ABSL_MUST_USE_RESULT
> Remove the polyfills for absl::apply and absl::make_from_tuple, which were only needed prior to C++17. It is recommended that clients simply use std::apply and std::make_from_tuple.
> PR #1846: Fix build on big endian
> Bazel: Move environment variables to --action_env
> Remove the implementation of `absl::variant`, which was only needed prior to C++17. `absl::variant` is now an alias for `std::variant`. It is recommended that clients simply use `std::variant`.
> MSVC: Fix warnings c4244 and c4267 in the main library code
> Update LowLevelHashLenGt16 to be LowLevelHashLenGt32 now that the input is guaranteed to be >32 in length.
> Xtensa does not support thread_local. Disable it in absl/base/config.h.
> Add support for 8-bit and 16-bit integers to absl::SimpleAtoi
> CI: Update Linux ARM latest container
> Add time hash tests
> `any_invocable`: Update comment that refer to C++17 and C++11
> `check_test_impl.inc`: Use C++17 features unconditionally
> Remove the implementation of `absl::optional`, which was only needed prior to C++17. `absl::optional` is now an alias for `std::optional`. It is recommended that clients simply use `std::optional`.
> Move hashtable control bytes manipulation to a separate file.
> Fix a use-after-free bug in which the string passed to `AtLocation` may be referenced after it is destroyed. While the string does live until the end of the full statement, logging (previously occurred) in the destructor of the `LogMessage` which may be constructed before the temporary string (and thus destroyed after the temporary string's destructor).
> `internal/layout`: Delete pre-C++17 out of line definition of constexpr class member
> Extract slow path for PrepareInsertNonSoo to a separate function `PrepareInsertNonSooSlow`.
> Minor code cleanups
> `internal/log_message`: Use `if constexpr` instead of SFINAE for `operator<<`
> [absl] Use `std::min` in `constexpr` contexts in `absl::string_view`
> Remove the implementation of `absl::any`, which was only needed prior to C++17. `absl::any` is now an alias for `std::any`. It is recommended that clients simply use `std::any`.
> Remove ABSL_INTERNAL_NEED_REDUNDANT_CONSTEXPR_DECL which is longer needed with the C++17 floor
> Make `OptimalMemcpySizeForSooSlotTransfer` ready to work with MaxSooSlotSize upto `3*sizeof(size_t)`.
> `internal/layout`: Replace SFINAE with `if constexpr`
> PR #1830: C++17 improvement: use if constexpr in internal/hash.h
> `absl`: Deprecate `ABSL_HAVE_CLASS_TEMPLATE_ARGUMENT_DEDUCTION`
> Add a verification for access of being destroyed table. Also enabled access after destroy check in ASAN optimized mode.
> Store `CharAlloc` in SwissTable in order to simplify type erasure of functions accepting allocator as `void*`.
> Introduce and use `SetCtrlInLargeTable`, when we know that table is at least one group. Similarly to `SetCtrlInSingleGroupTable`, we can save some operations.
> Make raw_hash_set::slot_type private.
> Delete absl/utility/internal/if_constexpr.h
> `internal/any_invocable`: Use `if constexpr` instead of SFINAE when initializing storage accessor
> Depend on string_view directly
> Optimize and slightly simplify `PrepareInsertNonSoo`.
> PR #1833: Make ABSL_INTERNAL_STEP_n macros consistent in crc code
> `internal/any_invocable`: Use alias `RawT` consistently in `InitializeStorage`
> Move the implementation of absl::ComputeCrc32c to the header file, to facilitate inlining.
> Delete absl/base/internal/inline_variable.h
> Add lifetimebound to absl::StripAsciiWhitespace
> Revert: Random: Use target attribute instead of -march
> Add return for opt mode in AssertNotDebugCapacity to make sure that code is not evaluated in opt mode.
> `internal/any_invocable`: Delete TODO, improve comment and simplify pragma in constructor
> Split resizing routines and type erase similar instructions.
> Random: Use target attribute instead of -march
> `internal/any_invocable`: Use `std::launder` unconditionally
> `internal/any_invocable`: Remove suppresion of false positive -Wmaybe-uninitialized on GCC 12
> Fix feature test for ABSL_HAVE_STD_OPTIONAL
> Support C++20 iterators in raw_hash_map's random-access iterator detection
> Fix mis-located test dependency
> Disable the DestroyedCallsFail test on GCC due to flakiness.
> `internal/any_invocable`: Implement invocation using `if constexpr` instead of SFINAE
> PR #1835: Bump deployment_target version and add visionos to podspec
> PR #1828: Fix spelling of pseudorandom in README.md
> Make raw_hash_map::key_arg private.
> `overload`: Delete obsolete macros for undefining `absl::Overload` when C++ < 17
> `absl/base`: Delete `internal/invoke.h` and `invoke_test.cc`
> Remove `WORKSPACE.bazel`
> `absl`: Replace `base_internal::{invoke,invoke_result_t,is_invocable_r}` with `std` equivalents
> Allow C++20 forward iterators to use fast paths
> Factor out some iterator traits detection code
> Type erase IterateOverFullSlots to decrease code size.
> `any_invocable`: Delete pre-C++17 workarounds for `noexcept` and guaranteed copy elision
> Make raw_hash_set::key_arg private.
> Rename nullability macros to use new lowercase spelling.
> Fix bug where ABSL_REQUIRE_EXPLICIT_INIT did not actually result in a linker error
> Make Randen benchmark program use runtime CPU detection.
> Add CI for the C++20/Clang/libstdc++ combination
> Move Abseil to GoogleTest 1.16.0
> `internal/any_invocable`: Use `if constexpr` instead of SFINAE in `InitializeStorage`
> More type-erasing of InitializeSlots by removing the Alloc and AlignOfSlot template parameters.
> Actually use the hint space instruction to strip PAC bits for return addresses in stack traces as the comment says
> `log/internal`: Replace `..._ATTRIBUTE_UNUSED_IF_STRIP_LOG` with C++17 `[[maybe_unused]]`
> `attributes`: Document `ABSL_ATTRIBUTE_UNUSED` as deprecated
> `internal/any_invocable`: Initialize using `if constexpr` instead of ternary operator, enum, and templates
> Fix flaky tests due to sampling by introducing utility to refresh sampling counters for the current thread.
> Minor reformatting in raw_hash_set: - Add a clear_backing_array member to declutter calls to ClearBackingArray. - Remove some unnecessary `inline` keywords on functions. - Make PoisonSingleGroupEmptySlots static.
> Update CI for linux_gcc-floor to use GCC9, Bazel 7.5, and CMake 3.31.5.
> `internal/any_invocable`: Rewrite `IsStoredLocally` type trait into a simpler constexpr function
> Add ABSL_REQUIRE_EXPLICIT_INIT to Abseil to enable enforcing explicit field initializations
> Require C++17
> Minimize number of `InitializeSlots` with respect to SizeOfSlot.
> Leave the call to `SampleSlow` only in type erased InitializeSlots.
> Update comments for Read4To8 and Read1To3.
> PR #1819: fix compilation with AppleClang
> Move SOO processing inside of InitializeSlots and move it once.
> PR #1816: Random: use getauxval() via <sys/auxv.h>
> Optimize `InitControlBytesAfterSoo` to have less writes and make them with compile time known size.
> Remove stray plus operator in cleanup_internal::Storage
> Include <cerrno> to fix compilation error in chromium build.
> Adjust internal logging namespacing for consistency s/ABSL_LOGGING_INTERNAL_/ABSL_LOG_INTERNAL_/
> Rewrite LOG_EVERY_N (et al) docs to clarify that the first instance is logged. Also, deliberately avoid giving exact numbers or examples since IRL behavior is not so exact.
> ABSL_ASSUME: Use a ternary operator instead of do-while in the implementations that use a branch marked unreachable so that it is usable in more contexts.
> Simplify the comment for raw_hash_set::erase.
> Remove preprocessors for now unsupported compilers.
> `absl::ScopedMockLog`: Explicitly document that it captures logs emitted by all threads
> Fix potential integer overflow in hash container create/resize
> Add lifetimebound to StripPrefix/StripSuffix.
> Random: Rollforward support runtime dispatch on AArch64 macOS
> Crc: Only test non_temporal_store_memcpy_avx on AVX targets
> Provide information about types of all flags.
> Deprecate the precomputed hash find() API in swisstable.
> Import of CCTZ from GitHub.
> Adjust whitespace
> Expand documentation for absl::raw_hash_set::erase to include idiom example of iterator post-increment.
> Performance improvement for absl::AsciiStrToUpper() and absl::AsciiStrToLower()
> Crc: Remove the __builtin_cpu_supports path for SupportsArmCRC32PMULL
> Use absl::NoDestructor for some absl::Mutex instances in the flags library to prevent some exit-time destructor warnings
> Update the WORKSPACE dependency of rules_cc to 0.1.0
> Rollback support runtime dispatch on AArch64 macOS for breaking some builds
> Downgrade to rules_cc 0.0.17 because 0.1.0 was yanked
> Use unused set in testing.
> Random: Support runtime dispatch on AArch64 macOS
> crc: Use absl::nullopt when returning absl::optional
> Annotate absl::FixedArray to warn when unused.
> PR #1806: Fix undefined symbol: __android_log_write
> Move ABSL_HAVE_PTHREAD_CPU_NUMBER_NP to the file where it is needed
> Use rbit instruction on ARM rather than rev.
> Debugging: Report the CPU we are running on under Darwin
> Add a microbenchmark for very long int/string tuples.
> Crc: Detect support for pmull and crc instructions on Apple AArch64 With a newer clang, we can use __builtin_cpu_supports which caches all the feature bits.
> Add special handling for hashing integral types so that we can optimize Read1To3 and Read4To8 for the strings case.
> Use unused FixedArray instances.
> Minor reformatting
> Avoid flaky expectation in WaitDurationWoken test case in MSVC.
> Use Bazel rules_cc for many compiler-specific rules instead of our custom ones from before the Bazel rules existed.
> Mix pointers twice in absl::Hash.
> New internal-use-only classes `AsStructuredLiteralImpl` and `AsStructuredValueImpl`
> Annotate some Abseil container methods with [[clang::lifetime_capture_by(...)]]
> Faster copy from inline Cords to inline Strings
> Add new benchmark cases for hashing string lengths 1,2,4,8.
> Move the Arm implementation of UnscaledCycleClock::Now() into the header file, like the x86 implementation, so it can be more easily inlined.
> Minor include cleanup in absl/random/internal
> Import of CCTZ from GitHub.
> Use Bazel Platforms to support AES-NI compile options for Randen
> In HashState::Create, require that T is a subclass of HashStateBase in order to discourage users from defining their own HashState types.
> PR #1801: Remove unncessary <iostream> includes
> New class StructuredProtoField
> Mix pointers twice in TSan and MSVC to avoid flakes in the PointerAlignment test.
> Add a test case that type-erased absl::HashState is consistent with absl::HashOf.
> Mix pointers twice in build modes in which the PointerAlignment test is flaky if we mix once.
> Increase threshold for stuck bits in PointerAlignment test on android.
> Use hashing ideas from Carbon's hashtable in absl hashing: - Use byte swap instead of mixing pointers twice. - Change order of branches to check for len<=8 first. - In len<=16 case, do one multiply to mix the data instead of using the logic from go/absl-hash-rl (reinforcement learning was used to optimize the instruction sequence). - Add special handling for len<=32 cases in 64-bit architectures.
> Test that using a table that was moved-to from a moved-from table fails in sanitizer mode.
> Remove a trailing comma causing an issue for an OSS user
> Add missing includes in hash.h.
> Use the public implementation rule for "@bazel_tools//tools/cpp:clang-cl"
> Import of CCTZ from GitHub.
> Change the definition of is_trivially_relocatable to be a bit less conservative.
> Updates to CI to support newer versions of tools
> Check if ABSL_HAVE_INTRINSIC_INT128 is defined
> Print hash expansions in the hash_testing error messages.
> Avoid flakiness in notification_test on MSVC.
> Roll back: Add more debug capacity validation checks on moves.
> Add more debug capacity validation checks on moves.
> Add macro versions of nullability annotations.
> Improve fork-safety by opening files with `O_CLOEXEC`.
> Move ABSL_HARDENING_ASSERTs in constexpr methods to their own lines.
> Add test cases for absl::Hash: - That hashes are consistent for the same int value across different int types. - That hashes of vectors of strings are unequal even when their concatenations are equal. - That FragmentedCord hashes works as intended for small Cords.
> Skip the IterationOrderChangesOnRehash test case in ASan mode because it's flaky.
> Add missing includes in absl hash.
> Try to use file descriptors in the 2000+ range to avoid mis-behaving client interference.
> Add weak implementation of the __lsan_is_turned_off in Leak Checker
> Fix a bug where EOF resulted in infinite loop.
> static_assert that absl::Time and absl::Duration are trivially destructible.
> Move Duration ToInt64<unit> functions to be inline.
> string_view: Add defaulted copy constructor and assignment
> Use `#ifdef` to avoid errors when `-Wundef` is used.
> Strip PAC bits for return addresses in stack traces
> PR #1794: Update cpu_detect.cc fix hw crc32 and AES capability check, fix undefined
> PR #1790: Respect the allocator's .destroy method in ~InlinedVector
> Cast away nullability in the guts of CHECK_EQ (et al) where Clang doesn't see that the nullable string returned by Check_EQImpl is statically nonnull inside the loop.
> string_view: Correct string_view(const char*, size_type) docs
> Add support for std::string_view in StrCat even when absl::string_view != std::string_view.
> Misc. adjustments to unit tests for logging.
> Use local_config_cc from rules_cc and make it a dev dependency
> Add additional iteration order tests with reservation. Reserved tables have a different way of iteration randomization compared to gradually resized tables (at least for small tables).
> Use all the bits (`popcount`) in `FindFirstNonFullAfterResize` and `PrepareInsertAfterSoo`.
> Mark ConsumePrefix, ConsumeSuffix, StripPrefix, and StripSuffix as constexpr since they are all pure functions.
> PR #1789: Add missing #ifdef pp directive to the TypeName() function in the layout.h
> PR #1788: Fix warning for sign-conversion on riscv
> Make StartsWith and EndsWith constexpr.
> Simplify logic for growing single group table.
> Document that absl::Time and absl::Duration are trivially destructible.
> Change some C-arrays to std::array as this enables bounds checking in some hardened standard library builds
> Replace outdated select() on --cpu with platform API equivalent.
> Take failure_message as const char* instead of string_view in LogMessageFatal and friends.
> Mention `c_any_of` in the function comment of `absl::c_linear_search`.
> Import of CCTZ from GitHub.
> Rewrite some string_view methods to avoid a -Wunreachable-code warning
> IWYU: Update includes and fix minor spelling mistakes.
> Add comment on how to get next element after using erase.
> Add ABSL_ATTRIBUTE_LIFETIME_BOUND and a doc note about absl::LogAsLiteral to clarify its intended use.
> Import of CCTZ from GitHub.
> Reduce memory consumption of structured logging proto encoding by passing tag value
> Remove usage of _LIBCPP_HAS_NO_FILESYSTEM_LIBRARY.
> Make Span's relational operators constexpr since C++20.
> distributions: support a zero max value in Zipf.
> PR #1786: Fix typo in test case.
> absl/random: run clang-format.
> Add some nullability annotations in logging and tidy up some NOLINTs and comments.
> CMake: Change the default for ABSL_PROPAGATE_CXX_STD to ON
> Delete UnvalidatedMockingBitGen
> PR #1783: [riscv][debugging] Fix a few warnings in RISC-V inlines
> Add conversion operator to std::array for StrSplit.
> Add a comment explaining the extra comparison in raw_hash_set::operator==. Also add a small optimization to avoid the extra comparison in sets that use hash_default_eq as the key_equal functor.
> Add benchmark for absl::HexStringToBytes
> Avoid installing options.h with the other headers
> Add ABSL_ATTRIBUTE_LIFETIME_BOUND to absl::Span constructors.
> Annotate absl::InlinedVector to warn when unused.
> Make `c_find_first_of`'s `options` parameter a const reference to allow temporaries.
> Disable Elf symbols for Xtensa
> PR #1775: Support symbolize only on WINAPI_PARTITION_DESKTOP
> Require through an internal presubmit that .h|.cc|.inc files contain either the string ABSL_NAMESPACE_BEGIN or SKIP_ABSL_INLINE_NAMESPACE_CHECK
> Xtensa supports mmap, enable it in absl/base/config.h
> PR #1777: Avoid std::ldexp in `operator double(int128)`.
> Marks absl::Span as view and borrowed_range, like std::span.
> Mark inline functions with only a simple comparison in strings/ascii.h as constexpr.
> Add missing Abseil inline namespace and fix includes
> Fix bug where the high bits of `__int128_t`/`__uint128_t` might go unused in the hash function. This fix increases the hash quality of these types.
> Add a test to verify bit casting between signed and unsigned int128 works as expected
> Add suggestions to enable sanitizers for asserts when doing so may be helpful.
> Add nullability attributes to nullability type aliases.
> Refactor swisstable moves.
> Improve ABSL_ASSERT performance by guaranteeing it is optimized away under NDEBUG in C++20
> Mark Abseil hardening assert in AssertSameContainer as slow.
> Add workaround for q++ 8.3.0 (QNX 7.1) compiler by making sure MaskedPointer is trivially copyable and copy constructible.
> Small Mutex::Unlock optimization
> Optimize `CEscape` and `CEscapeAndAppend` by up to 40%.
> Fix the conditional compilation of non_temporal_store_memcpy_avx to verify that AVX can be forced via `gnu::target`.
> Delete TODOs to move functors when moving hashtables and add a test that fails when we do so.
> Fix benchmarks in `escaping_benchmark.cc` by properly calling `benchmark::DoNotOptimize` on both inputs and outputs and by removing the unnecessary and wrong `ABSL_RAW_CHECK` condition (`check != 0`) of `BM_ByteStringFromAscii_Fail` benchmark.
> It seems like commit abc9b916a94ebbf251f0934048295a07ecdbf32a did not work as intended.
> Fix a bug in `absl::SetVLogLevel` where a less generic pattern incorrectly removed a more generic one.
> Remove the side effects between tests in vlog_is_on_test.cc
> Attempt to fix flaky Abseil waiter/sleep tests
> Add an explicit tag for non-SOO CommonFields (removing default ctor) and add a small optimization for early return in AssertNotDebugCapacity.
> Make moved-from swisstables behave the same as empty tables. Note that we may change this in the future.
> Tag tests that currently fail on darwin_arm64 with "no_test_darwin_arm64"
> add gmock to cmake defs for no_destructor_test
> Optimize raw_hash_set moves by allowing some members of CommonFields to be uninitialized when moved-from.
> Add more debug capacity validation checks on iteration/size.
> Add more debug capacity validation checks on copies.
> constinit -> constexpr for DisplayUnits
> LSC: Fix null safety issues diagnosed by Clang’s `-Wnonnull` and `-Wnullability`.
> Remove the extraneous variable creation in Match().
> Import of CCTZ from GitHub.
> Add more debug capacity validation checks on merge/swap.
> Add `absl::` namespace to c_linear_search implementation in order to avoid ADL
> Distinguish the debug message for the case of self-move-assigned swiss tables.
> Update LowLevelHash comment regarding number of hash state variables.
> Add an example for the `--vmodule` flag.
> Remove first prefetch.
> Add moved-from validation for the case of self-move-assignment.
> Allow slow and fast abseil hardening checks to be enabled independently.
> Update `ABSL_RETIRED_FLAG` comment to reflect `default_value` is no longer used.
> Add validation against use of moved-from hash tables.
> Provide file-scoped pragma behind macro ABSL_POINTERS_DEFAULT_NONNULL to indicate the default nullability. This is a no-op for now (not understood by checkers), but does communicate intention to human readers.
> Add stacktrace config for android using the generic implementation
> Fix nullability annotations in ABSL code.
> Replace CHECKs with ASSERTs and EXPECTs -- no reason to crash on failure.
> Remove ABSL_INTERNAL_ATTRIBUTE_OWNER and ABSL_INTERNAL_ATTRIBUTE_VIEW
> Migrate ABSL_INTERNAL_ATTRIBUTE_OWNER and ABSL_INTERNAL_ATTRIBUTE_VIEW to ABSL_ATTRIBUTE_OWNER and ABSL_ATTRIBUTE_VIEW
> Disable ABSL_ATTRIBUTE_OWNER and ABSL_ATTRIBUTE_VIEW prior to Clang-13 due to false positives.
> Make ABSL_ATTRIBUTE_VIEW and ABSL_ATTRIBUTE_OWNER public
> Optimize raw_hash_set::AssertHashEqConsistent a bit to avoid having as much runtime overhead.
> PR #1728: Workaround broken compilation against NDK r25
> Add validation against use of destroyed hash tables.
> Do not truncate `ABSL_RAW_LOG` output at null bytes
> Use several unused cord instances in tests and benchmarks.
> Add comments about ThreadIdentity struct allocation behavior.
> Refactoring followup for reentrancy validation in swisstable.
> Add debug mode checks that element constructors/destructors don't make reentrant calls to raw_hash_set member functions.
> Add tagging for cc_tests that are incompatible with Fuchsia
> Add GetTID() implementation for Fuchsia
> PR #1738: Fix shell option group handling in pkgconfig files
> Disable weak attribute when absl compiled as windows DLL
> Remove `CharIterator::operator->`.
> Mark non-modifying container algorithms as constexpr for C++20.
> PR #1739: container/internal: Explicitly include <cstdint>
> Don't match -Wnon-virtual-dtor in the "flags are needed to suppress warnings in headers". It should fall through to the "don't impose our warnings on others" case. Do this by matching on "-Wno-*" instead of "-Wno*".
> PR #1732: Fix build on NVIDIA Jetson board. Fix#1665
> Update GoogleTest dependency to 1.15.2
> Enable AsciiStrToLower and AsciiStrToUpper overloads for rvalue references.
> PR #1735: Avoid `int` to `bool` conversion warning
> Add `absl::swap` functions for `*_hash_*` to avoid calling `std::swap`
> Change internal visibility
> Remove resolved issue.
> Increase test timeouts to support running on Fuchsia emulators
> Add tracing annotations to absl::Notification
> Suppress compiler optimizations which may break container poisoning.
> Disable ABSL_INTERNAL_HAVE_DEBUGGING_STACK_CONSUMPTION for Fuchsia
> Add tracing annotations to absl::BlockingCounter
> Add absl_vlog_is_on and vlog_is_on to ABSL_INTERNAL_DLL_TARGETS
> Update swisstable swap API comments to no longer guarantee that we don't move/swap individual elements.
> PR #1726: cmake: Fix RUNPATH when using BUILD_WITH_INSTALL_RPATH=True
> Avoid unnecessary copying when upper-casing or lower-casing ASCII string_view
> Add weak internal tracing API
> Fix LINT.IfChange syntax
> PR #1720: Fix spelling mistake: occurrance -> occurrence
> Add missing include for Windows ASAN configuration in poison.cc
> Delete absl/strings/internal/has_absl_stringify.h now that the GoogleTest version we depend on uses the public file
> Update versions of dependencies in preparation for release
> PR #1699: Add option to build with MSVC static runtime
> Remove unneeded 'be' from comment.
> PR #1715: Generate options.h using CMake only once
> Small type fix in absl/log/internal/log_impl.h
> PR #1709: Handle RPATH CMake configuration
> PR #1710: fixup! PR #1707: Fixup absl_random compile breakage in Apple ARM64 targets
> PR #1695: Fix time library build for Apple platforms
> Remove cyclic cmake dependency that breaks in cmake 3.30.0
> Roll forward poisoned pointer API and fix portability issues.
> Use GetStatus in IsOkAndHoldsMatcher
> PR #1707: Fixup absl_random compile breakage in Apple ARM64 targets
> PR #1706: Require CMake version 3.16
> Add an MSVC implementation of ABSL_ATTRIBUTE_LIFETIME_BOUND
> Mark c_min_element, c_max_element, and c_minmax_element as constexpr in C++17.
> Optimize the absl::GetFlag cost for most non built-in flag types (including string).
> Encode some additional metadata when writing protobuf-encoded logs.
> Replace signed integer overflow, since that's undefined behavior, with unsigned integer overflow.
> Make mutable CompressedTuple::get() constexpr.
> vdso_support: support DT_GNU_HASH
> Make c_begin, c_end, and c_distance conditionally constexpr.
> Add operator<=> comparison to absl::Time and absl::Duration.
> Deprecate `ABSL_ATTRIBUTE_NORETURN` in favor of the `[[noreturn]]` standardized in C++11
> Rollback new poisoned pointer API
> Static cast instead of reinterpret cast raw hash set slots as casting from void* to T* is well defined
> Fix absl::NoDestructor documentation about its use as a global
> Declare Rust demangling feature-complete.
> Split demangle_internal into a tree of smaller libraries.
> Decode Rust Punycode when it's not too long.
> Add assertions to detect reentrance in `IterateOverFullSlots` and `absl::erase_if`.
> Decoder for Rust-style Punycode encodings of bounded length.
> Add `c_contains()` and `c_contains_subrange()` to `absl/algorithm/container.h`.
> Three-way comparison spaceship <=> operators for Cord.
> internal-only change
> Remove erroneous preprocessor branch on SGX_SIM.
> Add an internal API to get a poisoned pointer.
> optimization.h: Add missing <utility> header for C++
> Add a compile test for headers that require C compatibility
> Fix comment typo
> Expand documentation for SetGlobalVLogLevel and SetVLogLevel.
> Roll back 6f972e239f668fa29cab43d7968692cd285997a9
> PR #1692: Add missing `<utility>` include
> Remove NOLINT for `#include <new>` for __cpp_lib_launder
> Remove not used after all kAllowRemoveReentrance parameter from IterateOverFullSlots.
> Create `absl::container_internal::c_for_each_fast` for SwissTable.
> Disable flaky test cases in kernel_timeout_internal_test.
> Document that swisstable and b-tree containers are not exception-safe.
> Add `ABSL_NULLABILITY_COMPATIBLE` attribute.
> LSC: Move expensive variables on their last use to avoid copies.
> Add ABSL_INTERNAL_ATTRIBUTE_VIEW and ABSL_INTERNAL_ATTRIBUTE_OWNER attributes to more types in Abseil
> Drop std:: qualification from integer types like uint64_t.
> Increase slop time on MSVC in PerThreadSemTest.Timeouts again due to continued flakiness.
> Turn on validation for out of bounds MockUniform in MockingBitGen
> Use ABSL_UNREACHABLE() instead of equivalent
> If so configured, report which part of a C++ mangled name didn't parse.
> Sequence of 1-to-4 values with prefix sum to support Punycode decoding.
> Add the missing inline namespace to the nullability files
> Add ABSL_INTERNAL_ATTRIBUTE_VIEW and ABSL_INTERNAL_ATTRIBUTE_OWNER attributes to types in Abseil
> Disallow reentrance removal in `absl::erase_if`.
> Fix implicit conversion of temporary bitgen to BitGenRef
> Use `IterateOverFullSlots` in `absl::erase_if` for hash table.
> UTF-8 encoding library to support Rust Punycode decoding.
> Disable negative NaN float ostream format checking on RISC-V
> PR #1689: Minor: Add missing quotes in CMake string view library definition
> Demangle template parameter object names, TA <template-arg>.
> Demangle sr St <simple-id> <simple-id>, a dubious encoding found in the wild.
> Try not to lose easy type combinators in S::operator const int*() and the like.
> Demangle fixed-width floating-point types, DF....
> Demangle _BitInt types DB..., DU....
> Demangle complex floating-point literals.
> Demangle <extended-qualifier> in types, e.g., U5AS128 for address_space(128).
> Demangle operator co_await (aw).
> Demangle fully general vendor extended types (any <template-args>).
> Demangle transaction-safety notations GTt and Dx.
> Demangle C++11 user-defined literal operator functions.
> Demangle C++20 constrained friend names, F (<source-name> | <operator-name>).
> Demangle dependent GNU vector extension types, Dv <expression> _ <type>.
> Demangle elaborated type names, (Ts | Tu | Te) <name>.
> Add validation that hash/eq functors are consistent, meaning that `eq(k1, k2) -> hash(k1) == hash(k2)`.
> Demangle delete-expressions with the global-scope operator, gs (dl | da) ....
> Demangle new-expressions with braced-init-lists.
> Demangle array new-expressions, [gs] na ....
> Demangle object new-expressions, [gs] nw ....
> Demangle preincrement and predecrement, pp_... and mm_....
> Demangle throw and rethrow (tw... and tr).
> Remove redundant check of is_soo() while prefetching heap blocks.
> Demangle ti... and te... expressions (typeid).
> Demangle nx... syntax for noexcept(e) as an expression in a dependent signature.
> Demangle alignof expressions, at... and az....
> Demangle C++17 structured bindings, DC...E.
> Demangle modern _ZGR..._ symbols.
> Remove redundant check of is_soo() while prefetching heap blocks.
> Demangle sizeof...(pack captured from an alias template), sP ... E.
> Demangle types nested under vendor extended types.
> Demangle il ... E syntax (braced list other than direct-list-initialization).
> Avoid signed overflow for Ed <number> _ manglings with large <number>s.
> Remove redundant check of is_soo() while prefetching heap blocks.
> Remove obsolete TODO
> Clarify function comment for `erase` by stating that this idiom only works for "some" standard containers.
> Move SOVERSION to global CMakeLists, apply SOVERSION to DLL
> Set ABSL_HAVE_THREAD_LOCAL to 1 on all platforms
> Demangle constrained auto types (Dk <type-constraint>).
> Parse <discriminator> more accurately.
> Demangle lambdas in class member functions' default arguments.
> Demangle unofficial <unresolved-qualifier-level> encodings like S0_IT_E.
> Do not make std::filesystem::path hash available for macOS <10.15
> Include flags in DLL build (non-Windows only)
> Enable building monolithic shared library on macOS and Linux.
> Demangle Clang's last-resort notation _SUBSTPACK_.
> Demangle C++ requires-expressions with parameters (rQ ... E).
> Demangle Clang's encoding of __attribute__((enable_if(condition, "message"))).
> Demangle static_cast and friends.
> Demangle decltype(expr)::nested_type (NDT...E).
> Optimize GrowIntoSingleGroupShuffleControlBytes.
> Demangle C++17 fold-expressions.
> Demangle thread_local helper functions.
> Demangle lambdas with explicit template arguments (UlTy and similar forms).
> Demangle &-qualified function types.
> Demangle valueless literals LDnE (nullptr) and LA<number>_<type>E ("foo").
> Correctly demangle the <unresolved-name> at the end of dt and pt (x.y, x->y).
> Add missing targets to ABSL_INTERNAL_DLL_TARGETS
> Build abseil_test_dll with ABSL_BUILD_TESTING
> Demangle C++ requires-expressions without parameters (rq ... E).
> overload: make the constructor constexpr
> Update Abseil CI Docker image to use Clang 19, GCC 14, and CMake 3.29.3
> Workaround symbol resolution bug in Clang 19
> Workaround bogus GCC14 -Wmaybe-uninitialized warning
> Silence a bogus GCC14 -Warray-bounds warning
> Forbid absl::Uniform<absl::int128>(gen)
> Use IN_LIST to replace list(FIND) + > -1
> Recognize C++ vendor extended expressions (e.g., u9__is_same...E).
> `overload_test`: Remove a few unnecessary trailing return types
> Demangle the C++ this pointer (fpT).
> Stop eating an extra E in ParseTemplateArg for some L<type><value>E literals.
> Add ABSL_INTERNAL_ATTRIBUTE_VIEW and ABSL_INTERNAL_ATTRIBUTE_OWNER attributes to Abseil.
> Demangle C++ direct-list-initialization (T{1, 2, 3}, tl ... E).
> Demangle the C++ spaceship operator (ss, operator<=>).
> Demangle C++ sZ encodings (sizeof...(pack)).
> Demangle C++ so ... E encodings (typically array-to-pointer decay).
> Recognize dyn-trait-type in Rust demangling.
> Rework casting in raw_hash_set's IsFull().
> Remove test references to absl::SharedBitGen, which was never part of the open source release. This was only used in tests that never ran as part in the open source release.
> Recognize fn-type and lifetimes in Rust demangling.
> Support int128/uint128 in validated MockingBitGen
> Recognize inherent-impl and trait-impl in Rust demangling.
> Recognize const and array-type in Rust mangled names.
> Remove Asylo from absl.
> Recognize generic arguments containing only types in Rust mangled names.
> Fix missing #include <random> for std::uniform_int_distribution
> Move `prepare_insert` out of the line as type erased `PrepareInsertNonSoo`.
> Revert: Add -Wdead-code-aggressive to ABSL_LLVM_FLAGS
> Add (unused) validation to absl::MockingBitGen
> Support `AbslStringify` with `DCHECK_EQ`.
> PR #1672: Optimize StrJoin with tuple without user defined formatter
> Give ReturnAddresses and N<uppercase> namespaces separate stacks for clarity.
> Demangle Rust backrefs.
> Use Nt for struct and trait names in Rust demangler test inputs.
> Allow __cxa_demangle on MIPS
> Add a `string_view` overload to `absl::StrJoin`
> Demangle Rust's Y<type><path> production for passably simple <type>s.
> `convert_test`: Delete obsolete condition around ASSERT_EQ in TestWithMultipleFormatsHelper
> `any_invocable`: Clean up #includes
> Resynchronize absl/functional/CMakeLists.txt with BUILD.bazel
> `any_invocable`: Add public documentation for undefined behavior when invoking an empty AnyInvocable
> `any_invocable`: Delete obsolete reference to proposed standard type
> PR #1662: Replace shift with addition in crc multiply
> Doc fix.
> `convert_test`: Extract loop over tested floats from helper function
> Recognize some simple Rust mangled names in Demangle.
> Use __builtin_ctzg and __builtin_clzg in the implementations of CountTrailingZeroesNonzero16 and CountLeadingZeroes16 when they are available.
> Remove the forked absl::Status matchers implementation in statusor_test
> Add comment hack to fix copybara reversibility
> Add GoogleTest matchers for absl::Status
> [random] LogUniform: Document as a discrete distribution
> Enable Cord tests with Crc.
> Fix order of qualifiers in `absl::AnyInvocable` documentation.
> Guard against null pointer dereference in DumpNode.
> Apply ABSL_MUST_USE_RESULT to try lock functions.
> Add public aliases for default hash/eq types in hash-based containers
> Import of CCTZ from GitHub.
> Remove the hand-rolled CordLeaker and replace with absl::NoDestructor to test the after-exit behavior
> `convert_test`: Delete obsolete `skip_verify` parameter in test helper
> overload: allow using the underlying type with CTAD directly.
> PR #1653: Remove unnecessary casts when calling CRC32_u64
> PR #1652: Avoid C++23 deprecation warnings from float_denorm_style
> Minor cleanup for `absl::Cord`
> PR #1651: Implement ABSL_INTERNAL_DISABLE_DEPRECATED_DECLARATION_WARNING for MSVC compiler
> Add `operator<=>` support to `absl::int128` and `absl::uint128`
> [absl] Re-use the existing `std::type_identity` backfill instead of redefining it again
> Add `absl::AppendCordToString`
> `str_format/convert_test`: Delete workaround for [glibc bug](https://sourceware.org/bugzilla/show_bug.cgi?id=22142)
> `absl/log/internal`: Document conditional ABSL_ATTRIBUTE_UNUSED, add C++17 TODO
> `log/internal/check_op`: Add ABSL_ATTRIBUTE_UNUSED to CHECK macros when STRIP_LOG is enabled
> log_benchmark: Add VLOG_IS_ON benchmark
> Restore string_view detection check
> Remove an unnecessary ABSL_ATTRIBUTE_UNUSED from a logging macro
< Abseil LTS Branch, Jan 2024, Patch 2 (#1650)
> In example code, add missing template parameter.
> Optimize crc32 V128_From2x64 on Arm
> Annotate that Mutex should warn when unused.
> Add ABSL_ATTRIBUTE_LIFETIME_BOUND to Cord::Flatten/TryFlat
> Deprecate `absl::exchange`, `absl::forward` and `absl::move`, which were only useful before C++14.
> Temporarily revert dangling std::string_view detection until dependent is fixed
> Use _decimal_ literals for the CivilDay example.
> Fix bug in BM_EraseIf.
> Add internal traits to absl::string_view for lifetimebound detection
> Add internal traits to absl::StatusOr for lifetimebound detection
> Add internal traits to absl::Span for lifetimebound detection
> Add missing dependency for log test build target
> Add internal traits for lifetimebound detection
> Use local decoding buffer in HexStringToBytes
> Only check if the frame pointer is inside a signal stack with known bounds
> Roll forward: enable small object optimization in swisstable.
> Optimize LowLevelHash by breaking dependency between final loads and previous len/ptr updates.
> Fix the wrong link.
> Optimize InsertMiss for tables without kDeleted slots.
> Use GrowthInfo without applying any optimizations based on it.
> Disable small object optimization while debugging some failing tests.
> Adjust conditonal compilation in non_temporal_memcpy.h
> Reformat log/internal/BUILD
> Remove deprecated errno constants from the absl::Status mapping
> Introduce GrowthInfo with tests, but without usage.
> Enable small object optimization in swisstable.
> Refactor the GCC unintialized memory warning suppression in raw_hash_set.h.
> Respect `NDEBUG_SANITIZER`
> Revert integer-to-string conversion optimizations pending more thorough analysis
> Fix a bug in `Cord::{Append,Prepend}(CordBuffer)`: call `MaybeRemoveEmptyCrcNode()`. Otherwise appending a `CordBuffer` an empty Cord with a CRC node crashes (`RemoveCrcNode()` which increases the refcount of a nullptr child).
> Add `BM_EraseIf` benchmark.
> Record sizeof(key_type), sizeof(value_type) in hashtable profiles.
> Fix ClangTidy warnings in btree.h.
> LSC: Move expensive variables on their last use to avoid copies.
> PR #1644: unscaledcycleclock: remove RISC-V support
> Reland: Make DLOG(FATAL) not understood as [[noreturn]]
> Separate out absl::StatusOr constraints into statusor_internal.h
> Use Layout::WithStaticSizes in btree.
> `layout`: Delete outdated comments about ElementType alias not being used because of MSVC
> Performance improvement for absl::AsciiStrToUpper() and absl::AsciiStrToLower()
> `layout_benchmark`: Replace leftover comment with intended call to MyAlign
> Remove absl::aligned_storage_t
> Delete ABSL_ANNOTATE_MEMORY_IS_INITIALIZED under Thread Sanitizer
> Remove vestigial variables in the DumpNode() helper in absl::Cord
> Do hashtablez sampling on the first insertion into an empty SOO hashtable.
> Add explicit #include directives for <tuple>, "absl/base/config.h", and "absl/strings/string_view.h".
> Add a note about the cost of `VLOG` in non-debug builds.
> Fix flaky test failures on MSVC.
> Add template keyword to example comment for Layout::WithStaticSizes.
> PR #1643: add xcprivacy to all subspecs
> Record sampling stride in cord profiling to facilitate unsampling.
> Fix a typo in a comment.
> [log] Correct SetVLOGLevel to SetVLogLevel in comments
> Add a feature to container_internal::Layout that lets you specify some array sizes at compile-time as template parameters. This can make offset and size calculations faster.
> `layout`: Mark parameter of Slices with ABSL_ATTRIBUTE_UNUSED, remove old workaround
> `layout`: Use auto return type for functions that explicitly instantiate std::tuple in return statements
> Remove redundant semicolons introduced by macros
> [log] Make :vlog_is_on/:absl_vlog_is_on public in BUILD.bazel
> Add additional checks for size_t overflows
> Replace //visibility:private with :__pkg__ for certain targets
> PR #1603: Disable -Wnon-virtual-dtor warning for CommandLineFlag implementations
> Add several missing includes in crc/internal
> Roll back extern template instatiations in swisstable due to binary size increases in shared libraries.
> Add nodiscard to SpinLockHolder.
> Test that rehash(0) reduces capacity to minimum.
> Add extern templates for common swisstable types.
> Disable ubsan for benign unaligned access in crc_memcpy
> Make swisstable SOO support GDB pretty printing and still compile in OSS.
> Fix OSX support with CocoaPods and Xcode 15
> Fix GCC7 C++17 build
> Use UnixEpoch and ZeroDuration
> Make flaky failures much less likely in BasicMocking.MocksNotTriggeredForIncorrectTypes test.
> Delete a stray comment
> Move GCC uninitialized memory warning suppression into MaybeInitializedPtr.
> Replace usages of absl::move, absl::forward, and absl::exchange with their std:: equivalents
> Fix the move to itself
> Work around an implicit conversion signedness compiler warning
> Avoid MSan: use-of-uninitialized-value error in find_non_soo.
> Fix flaky MSVC test failures by using longer slop time.
> Add ABSL_ATTRIBUTE_UNUSED to variables used in an ABSL_ASSUME.
> Implement small object optimization in swisstable - disabled for now.
> Document and test ability to use absl::Overload with generic lambdas.
> Extract `InsertPosition` function to be able to reuse it.
> Increase GraphCycles::PointerMap size
> PR #1632: inlined_vector: Use trivial relocation for `erase`
> Create `BM_GroupPortable_Match`.
> [absl] Mark `absl::NoDestructor` methods with `absl::Nonnull` as appropriate
> Automated Code Change
> Rework casting in raw_hash_set's `IsFull()`.
> Adds ABSL_ATTRIBUTE_LIFETIME_BOUND to absl::BitGenRef
> Workaround for NVIDIA C++ compiler being unable to parse variadic expansions in range of range-based for loop
> Rollback: Make DLOG(FATAL) not understood as [[noreturn]]
> Make DLOG(FATAL) not understood as [[noreturn]]
> Optimize `absl::Duration` division and modulo: Avoid repeated redundant comparisons in `IDivFastPath`.
> Optimize `absl::Duration` division and modulo: Allow the compiler to inline `time_internal::IDivDuration`, by splitting the slow path to a separate function.
> Fix typo in example code snippet.
> Automated Code Change
> Add braces for conditional statements in raw_hash_map functions.
> Optimize `prepare_insert`, when resize happens. It removes single unnecessary probing before resize that is beneficial for small tables the most.
> Add noexcept to move assignment operator and swap function
> Import of CCTZ from GitHub.
> Minor documentation updates.
> Change find_or_prepare_insert to return std::pair<iterator, bool> to match return type of insert.
> PR #1618: inlined_vector: Use trivial relocation for `SwapInlinedElements`
> Improve raw_hash_set tests.
> Performance improvement for absl::AsciiStrToUpper() and absl::AsciiStrToLower()
> Use const_cast to avoid duplicating the implementation of raw_hash_set::find(key).
> Import of CCTZ from GitHub.
> Performance improvement for absl::AsciiStrToUpper() and absl::AsciiStrToLower()
> Annotate that SpinLock should warn when unused.
> PR #1625: absl::is_trivially_relocatable now respects assignment operators
> Introduce `Group::MaskNonFull` without usage.
> `demangle`: Parse template template and C++20 lambda template param substitutions
> PR #1617: fix MSVC 32-bit build with -arch:AVX
> Minor documentation fix for `absl::StrSplit()`
> Prevent overflow in `absl::CEscape()`
> `demangle`: Parse optional single template argument for built-in types
> PR #1412: Filter out `-Xarch_` flags from pkg-config files
> `demangle`: Add complexity guard to `ParseQRequiresExpr`
< Prepare 20240116.1 patch for Apple Privacy Manifest (#1623)
> Remove deprecated symbol absl::kuint128max
> Add ABSL_ATTRIBUTE_WARN_UNUSED.
> `demangle`: Parse `requires` clauses on template params, before function return type
> On Apple, implement absl::is_trivially_relocatable with the fallback.
> `demangle`: Parse `requires` clauses on functions
> Make `begin()` to return `end()` on empty tables.
> `demangle`: Parse C++20-compatible template param declarations, except those with `requires` expressions
> Add the ABSL_DEPRECATE_AND_INLINE() macro
> Span: Fixed comment referencing std::span as_writable_bytes() as as_mutable_bytes().
> Switch rank structs to be consistent with written guidance in go/ranked-overloads
> Avoid hash computation and `Group::Match` in small tables copy and use `IterateOverFullSlots` for iterating for all tables.
> Optimize `absl::Hash` by making `LowLevelHash` faster.
> Add -Wdead-code-aggressive to ABSL_LLVM_FLAGS
< Backport Apple Privacy Manifest (#1613)
> Stop using `std::basic_string<uint8_t>` which relies on a non-standard generic `char_traits<>` implementation, recently removed from `libc++`.
> Add absl_container_hash-based HashEq specialization
> `demangle`: Implement parsing for simplest constrained template arguments
> Roll forward 9d8588bfc4566531c4053b5001e2952308255f44 (which was rolled back in 146169f9ad357635b9cd988f976b38bcf83476e3) with fix.
> Add a version of absl::HexStringToBytes() that returns a bool to validate that the input was actually valid hexadecimal data.
> Enable StringLikeTest in hash_function_defaults_test
> Fix a typo.
> Minor changes to the BUILD file for absl/synchronization
> Avoid static initializers in case of ABSL_FLAGS_STRIP_NAMES=1
> Rollback 9d8588bfc4566531c4053b5001e2952308255f44 for breaking the build
> No public description
> Decrease the precision of absl::Now in x86-64 debug builds
> Optimize raw_hash_set destructor.
> Add ABSL_ATTRIBUTE_UNINITIALIZED macros for use with clang and GCC's `uninitialized`
> Optimize `Cord::Swap()` for missed compiler optimization in clang.
> Type erased hash_slot_fn that depends only on key types (and hash function).
> Replace `testonly = 1` with `testonly = True` in abseil BUILD files.
> Avoid extra `& msbs` on every iteration over the mask for GroupPortableImpl.
> Missing parenthesis.
> Early return from destroy_slots for trivially destructible types in flat_hash_{*}.
> Avoid export of testonly target absl::test_allocator in CMake builds
> Use absl::NoDestructor for cordz global queue.
> Add empty WORKSPACE.bzlmod
> Introduce `RawHashSetLayout` helper class.
> Fix a corner case in SpyHashState for exact boundaries.
> Add nullability annotations
> Use absl::NoDestructor for global HashtablezSampler.
> Always check if the new frame pointer is readable.
> PR #1604: Add privacy manifest
< Disable ABSL_ATTRIBUTE_TRIVIAL_ABI in open-source builds (#1606)
> Remove code pieces for no longer supported GCC versions.
> Disable ABSL_ATTRIBUTE_TRIVIAL_ABI in open-source builds
> Prevent brace initialization of AlphaNum
> Remove code pieces for no longer supported MSVC versions.
> Added benchmarks for smaller size copy constructors.
> Migrate empty CrcCordState to absl::NoDestructor.
> Add protected copy ctor+assign to absl::LogSink, and clarify thread-safety requirements to apply to the interface methods.
< Apply LTS transformations for 20240116 LTS branch (#1599)
Closesscylladb/scylladb#28756
The rack option was fully implemented in the code but omitted from
both docs/operating-scylla/admin.rst and conf/scylla.yaml comments.
Closesscylladb/scylladb#29239
In a multi-declarator declaration, the & ref-qualifier is part of each
individual declarator, not the shared type specifier. So:
const auto& a = x(), b = y();
declares 'a' as a reference but 'b' as a value, silently copying y().
The same applies to:
const T& a = v[i], b = v[j];
Both operator== lines had this pattern, causing an unnecessary copy of
the column vector and an unnecessary copy of each entry on every call.
Fix by repeating & on the second declarator in both lines.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Closesscylladb/scylladb#29213
clear_gently() (introduced in 322aa2f8b5) clears all token_metadata_impl
members using co_await to avoid reactor stalls on large data structures.
_topology_change_info (introduced in 10bf8c7901) was added later and not
included in clear_gently().
update_topology_change_info() already uses utils::clear_gently() when
replacing the value, so it looks reasonable to apply the same pattern
in clear_gently().
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Closesscylladb/scylladb#29210
Some on_internal_error() calls have the selector argument to a format
string with no placeholder for it in the format string.
"While at it", disambiguate selector type in the message text.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#29208
There's only one test left that uses it, and it can be patched to use standard ks/cf creation helpers from pylib. This patch does so and drops the lengthy create_dataset() helper
Tests improvements, no need to backport
Closesscylladb/scylladb#29176
* github.com:scylladb/scylladb:
test/backup: drop create_dataset helper
test/backup: use new_test_keyspace in test_restore_primary_replica
The enable_tablets(false) was added when LWT wasn't supported for tablets, now it's, so no need in this attribute are more.
The test covers behavior which should work in similar way for both vnodes and tablets -> it doesn't seem it would benefit much from running it in both enable_tablets(true) and enable_tablets(false) modes.
Closesscylladb/scylladb#29167
In order to apply fsult-injected delay, there's the inject(duration)
overload. Results in shorter code
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#29168
The endpoint in question has some places worth fixing, in particular
- the keyspace parameter is not validated
- the validated table name is resolved into table_id, but the id is unused
- two ugly static helpers to stream obtained token ranges into json
Improving the API code flow, not backporting
Closesscylladb/scylladb#29154
* github.com:scylladb/scylladb:
api: Inline describe_ring JSON handling
storage_service: Make describe_ring_for_table() take table_id
Add skip_reason_plugin.py — a framework-agnostic pytest plugin that
provides typed skip markers (skip_bug, skip_not_implemented, skip_slow,
skip_env) so that the reason a test is skipped is machine-readable in
JUnit XML and Allure reports. Bare untyped pytest.mark.skip now
triggers a warning (to become an error after full migration). Runtime
skips via skip() are also enriched by parsing the [type] prefix from
the skip message.
The plugin is a class (SkipReasonPlugin) that receives the concrete
SkipType enum and an optional report_callback from conftest.py, keeping
it decoupled from allure and project-specific types.
Extract SkipType enum and convenience runtime skip wrappers (skip_bug,
skip_env, etc.) into test/pylib/skip_types.py so callers only need a
single import instead of importing both SkipType and skip() separately.
conftest.py imports SkipType from the new module and registers the
plugin instance unconditionally (for all test runners).
New files:
- test/pylib/skip_reason_plugin.py: core plugin — typed marker
processing, bare-skip warnings, JUnit/Allure report enrichment
(including runtime skip() parsing via _parse_skip_type helper)
- test/pylib/skip_types.py: SkipType enum and convenience wrappers
(skip_bug, skip_not_implemented, skip_slow, skip_env)
- test/pylib_test/test_skip_reason_plugin.py: 17 pytester-based
test functions (51 cases across 3 build modes) covering markers,
warnings, reports, callbacks, and skip_mode interaction
Infrastructure changes:
- test/conftest.py: import SkipType from skip_types, register
SkipReasonPlugin with allure report callback
- test/pylib/runner.py: set SKIP_TYPE_KEY/SKIP_REASON_KEY stash keys
for skip_mode so the report hook can enrich JUnit/Allure with
skip_type=mode without longrepr parsing
- test/pytest.ini: register typed marker definitions (required for
--strict-markers even when plugin is not loaded)
Migrated test files (representative samples):
- test/cluster/test_tablet_repair_scheduler.py:
skip -> skip_bug (#26844), skip -> skip_not_implemented
- test/cqlpy/.../timestamp_test.py: skip -> skip_slow
- test/cluster/dtest/schema_management_test.py: skip -> skip_not_implemented
- test/cluster/test_change_replication_factor_1_to_0.py: skip -> skip_bug (#20282)
- test/alternator/conftest.py: skip -> skip_env
- test/alternator/test_https.py: use skip_env() wrapper
Fixes SCYLLADB-79
Closesscylladb/scylladb#29235
This patch series implements `object_storage_base::clone`, which was previously a stub that aborted at runtime. Clone creates a copy of an sstable under a new generation and is used during compaction.
The implementation uses server-side object copies (S3 CopyObject / GCS Objects: rewrite) and mirrors the filesystem clone semantics: TemporaryTOC is written first to mark the operation as in-progress, component objects are copied, and TemporaryTOC is removed to commit (unless the caller requested the destination be left unsealed).
The first two patches fix pre-existing bugs in the underlying storage clients that were exposed by the new clone code path:
- GCS `copy_object` used the wrong HTTP method (PUT instead of POST) and sent an invalid empty request body.
- S3 `copy_object` silently ignored the abort_source parameter.
1. **gcp_client: fix copy_object request method and body** — Fix two bugs in the GCS rewrite API call.
2. **s3_client: pass through abort_source in copy_object** — Stop ignoring the abort_source parameter.
3. **object_storage: add copy_object to object_storage_client** — New interface method with S3 and GCS implementations.
4. **storage: add make_object_name overload with generation** — Helper for building destination object names with a different generation.
5. **storage: make delete_object const** — Needed by the const clone method.
6. **storage: implement object_storage_base::clone** — The actual clone implementation plus a copy_object wrapper.
7. **test/boost: enable sstable clone tests for S3 and GCS** — Re-enable the previously skipped tests.
A test similar to `sstable_clone_leaving_unsealed_dest_sstable` was added to properly test the sealed/unsealed states for object storage. Works for both S3 and GCS.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1045
Prerequisite: https://github.com/scylladb/scylladb/pull/28790
No need to backport since this code targets future feature
Closesscylladb/scylladb#29166
* github.com:scylladb/scylladb:
compaction_test: enable sstable clone tests for S3 and GCS
storage: implement object_storage_base::clone
storage: make delete_object const in object_storage_base
storage: add make_object_name overload with generation
sstables: add get_format() accessor to sstable
object_storage: add copy_object to object_storage_client
s3_client: pass through abort_source in copy_object
gcp_client: fix copy_object request method and body
The vector-search feature introduced the somewhat confusing feature of
enabling CDC without explicitly enabling CDC: When a vector index is
enabled on a table, CDC is "enabled" for it even if the user didn't
ask to enable CDC.
For this, write-path code began to use a new cdc_enabled() function
instead of checking schema.cdc_options.enabled() directly. This
cdc_enabled() function checks if either this enabled() is true, or
has_vector_index() is true.
Unfortunately, LWT writes continued to use cdc_options.enabled() instead
of the new cdc_enabled(). This means that if a vector index is used and
a vector is written using an LWT write, the new value is not indexed.
This patch fixes this bug. It also adds a regression test that fails
before this patch and passes afterwards - the new test verifies that
when a table has a vector index (but no explicit CDC enabled), the CDC
log is updated both after regular writes and after successful LWT writes.
This patch was also tested in the context of the upcoming vector-search-
for-Alternator pull request, which has a test reproducing this bug
(Alternator uses LWT frequently, so this is very important there).
It will also be tested by the vector-store test suite ("validator").
Fixes SCYLLADB-1342
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#29300
In the unregistered-ID branch, ldap_msgfree() was called on a result
already owned by an RAII ldap_msg_ptr, causing a double-free on scope
exit. Remove the redundant manual free.
Fixes: SCYLLADB-1344
Backport: 2026.1, 2025.4, 2025.1 - it's a memory corruption, with a one-line fix, so better backport it everywhere.
Closesscylladb/scylladb#29302
* github.com:scylladb/scylladb:
test: ldap: add regression test for double-free on unregistered message ID
ldap: fix double-free of LDAPMessage in poll_results()
Now that object_storage_base::clone is implemented,
remove the early-return skips and re-enable the
sstable_clone_leaving_unsealed_dest_sstable tests for
both S3 and GCS storage backends.
Implement the clone method for object_storage_base, which creates
a copy of an sstable with a new generation using server-side object
copies. Also add a const copy_object convenience wrapper, similar
to the existing put_object and delete_object wrappers.
A dedicated test for the new object storage clone path will be
added in the following commit. The preexisting local-filesystem
clone is already covered by the sstable_clone_leaving_unsealed_dest_sstable
test.
Add a make_object_name overload that accepts a target
generation parameter for constructing object names with
a generation different from the source sstable's own.
Refactor the original make_object_name to delegate to
the new overload, eliminating code duplication.
This is needed by clone to build destination object
names for the new generation.
Add a public get_format() accessor for the _format member, following
the same pattern as the existing get_version(). This allows storage
implementations to access the sstable format without reaching into
private members, and is needed by the upcoming object_storage_base::clone
to construct entry_descriptor for the sstables registry.
Add a copy_object method to the object_storage_client
interface for server-side object copies, with
implementations for both S3 and GCS wrappers.
The S3 wrapper delegates to s3::client::copy_object.
The GCS wrapper delegates to gcp::storage::client's
cross-bucket copy_object overload.
This is a prerequisite for implementing sstable clone
on object storage.
The abort_source parameter in s3::client::copy_object
was ignored — the function accepted it but always passed
nullptr to the underlying copy_s3_object. Forward it
properly so callers can cancel in-progress copies.
The GCP copy_object (rewrite API) had two bugs:
1. The request body was an empty string, but the GCP
rewrite endpoint always parses it as JSON metadata.
An empty string is not valid JSON, resulting in
400 "Metadata in the request couldn't decode".
Fix: send "{}" (empty JSON object) as the body.
2. The HTTP method was PUT, but the GCP Objects: rewrite
API requires POST per the documentation.
Fix: use POST.
Test coverage in a follow-up patch
When `BatchWriteItem` operates on multiple items sharing the same partition key in `always_use_lwt` write isolation mode, all CDC log entries are emitted under a single timestamp. The previous `get_records` parsing algorithm in `alternator/streams.cc` assumed that all CDC log entries sharing the same timestamp correspond to a single DynamoDB item change. As a result, it would incorrectly squash multiple distinct item changes into a single Streams record — producing wrong event data (e.g., one INSERT instead of four, with mismatched key/attribute values).
Note: the bug is specific to `always_use_lwt` mode because only in LWT mode does the entire batch share a single timestamp. In non-LWT modes, each item in the batch receives a separate timestamp, so the entries naturally stay separate.
**Commit 1: alternator: add BatchWriteItem Streams test**
- Adds new tests `test_streams_batchwrite_no_clustering_deletes_non_existing_items` and `test_streams_batchwrite_no_clustering_deletes_existing_items` that cover the corner cases of batch-deleting a existing and non-existing item in a table without a clustering key. CDC tables without clustering keys are handled differently, and this path was previously untested for delete operations.
- Adds a new test `test_streams_batchwrite_into_the_same_partition_will_report_wrong_stream_data`, that is a simple way to trigger a bug.
- Adds a new test `test_streams_batchwrite_into_the_same_partition_deletes_existing_items`, that validates various combinations of puts and deletes in a single BatchWrite against the same partition.
- Adds a new `test_table_ss_new_and_old_images_write_isolation_always` fixture and extends `create_table_ss` to accept `additional_tags`, enabling tests with a specific write isolation mode.
**Commit 2: alternator: fix BatchWriteItem squashed Streams entries**
The core fix rewrites the CDC log entry parsing in `get_records` to distinguish items by their clustering key:
- Introduces `managed_bytes_ptr_hash` and `managed_bytes_ptr_equal` helper structs for pointer-based hash map lookups on `managed_bytes`.
- Replaces the single `record`/`dynamodb` pair with a `std::unordered_map<const managed_bytes*, Record, ...>` (`records_map`) keyed by the base table's clustering key value from each CDC log row. For tables without a clustering key, all entries map to a single sentinel key.
- Adds a validation that Alternator tables have at most one clustering key column (as required by the DynamoDB data model).
- On end-of-record (`eor`), flushes all accumulated per-clustering-key records into the output, each with a unique `eventID` (the `event_id` format now includes an index suffix).
- Adjusts the limit check: since a single CDC timestamp bucket can now produce multiple output records, the limit may be slightly exceeded to avoid breaking mid-batch.
Fixes#28439
Fixes: SCYLLADB-540
Closesscylladb/scylladb#28452
* github.com:scylladb/scylladb:
alternator/test: explain why 'always' write isolation mode is used in tests
alternator/test: add scylla_only to always write isolation fixture
alternator: fix BatchWriteItem squashed Streams entries
alternator: add BatchWriteItem test (failing)
Queries against local vector indexes were failing with the error:
```ANN ordering by vector requires the column to be indexed using 'vector_index'```
This was a regression introduced by 15788c3734, which incorrectly
assumed the first column in the targets list is always the vector column.
For local vector indexes, the first column is the partition key, causing
the failure.
Previously, serialization logic for the target index option was shared
between vector and secondary indexes. This is no longer viable due to
the introduction of local vector indexes and vector indexes with filtering
columns, which have different target format.
This commit introduces a dedicated JSON-based serialization format for
vector index targets, identifying the target column (tc), filtering
columns (fc), and partition key columns (pk). This ensures unambiguous
serialization and deserialization for all vector index types.
This change is backward compatible for regular vector indexes. However,
it breaks compatibility for local vector indexes and vector indexes with
filtering columns created in version 2026.1.0. To mitigate this, usage
of these specific index types will be blocked in the 2026.1.0 release
by failing ANN queries against them in vector-store service.
Fixes: SCYLLADB-895
Backport to 2026.1 is required as this issue occurs also on this branch.
Closesscylladb/scylladb#28862
* github.com:scylladb/scylladb:
index: fix DESC INDEX for vector index
vector_search: test: refactor boilerplate setup
vector_search: fix SELECT on local vector index
index: test: vector index target option serialization test
index: test: secondary index target option serialization test
To create `process_staging` view building tasks, we firstly need to
collect informations about them on shard0, create necessary mutations,
commit them to group0 and move staging sstables objects to their
original shards.
But there is a possible race after committing the group0 command
and before moving the staging sstables to their shards.
Between those two events, the coordinator may schedule freshly created
tasks and dispatch them to the worker but the worker won't have the
sstables objects because they weren't moved yet.
This patch fixes the race by holding `_staging_sstables_mutex` locks
from necessary shards when executing `create_staging_sstable_tasks()`.
With this, even if the task will be scheduled and dispatched quickly,
the worker will wait with executing it until the sstables objects are
moved and the locks are released.
Fixes SCYLLADB-816
Add a test that verifies filesystem_storage::clone preserves the sstable
state: an sstable in staging is cloned to a new generation, the clone is
re-loaded from the staging directory, and its state is asserted to still
be staging.
The change proves that https://scylladb.atlassian.net/browse/SCYLLADB-1205
is invalid, and can be closed.
* No functional change and no backport needed
Closesscylladb/scylladb#29209
* github.com:scylladb/scylladb:
test: add test_sstable_clone_preserves_staging_state
test: derive sstable state from directory in test_env::make_sstable
sstables: log debug message in filesystem_storage::clone
`data_value::to_parsable_string()` crashes with a null pointer dereference when called on a `null` data_value. Return `"null"` instead.
Added tests after the fix. Manually checked that tests fail without the fix.
Fixes SCYLLADB-1350
This is a fix that prevents format crash. No known occurrence in production, but backport is desirable.
Closesscylladb/scylladb#29262
* github.com:scylladb/scylladb:
test: boost: test null data value to_parsable_string
cql3: fix null handling in data_value formatting
Extend test_basic to run with both RF=1 and RF=3 to verify that
object storage works correctly with multiple replicas. The test now
starts one server per replica (each on its own rack), flushes all
nodes, validates tablet replica counts for RF>1, and restarts all
servers before verifying data is still readable.
Fixes: SCYLLADB-546
Closesscylladb/scylladb#28583
As noted in issue #5027 and issue #29138, Alternator's support for
ReturnConsumedCapacity is lacking in a two areas:
1. While ReturnConsumedCapacity is supported for most relevant
operations, it's not supported in two operations: Query and Scan.
2. While ReturnConsumedCapacity=TOTAL is supported, INDEXES is not
supported at all.
This patch adds extensive tests for all these cases. All these tests
pass on DynamoDB but fail on Alternator, so are marked with "xfail".
The tests for ReturnConsumedCapacity=INDEXES are deliberately split
into two: First, we test the case where the table has no indexes, so
INDEXES is almost the same as TOTAL and should be very easy to
implement. A second test checks the cases where there are indexes,
and different operations increment the capacity of the base table
and/or indexes differently - it will require significantly more work
to make the second test pass.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#29188
When a table has a vector index, cdc::cdc_enabled() returns true because
vector index writes are implemented via the CDC augmentation path. However,
register_cdc_operation_result_tracker() was checking only
cdc_options().enabled(), which is false for tables that have a vector index
but not traditional CDC.
As a result, the operation_result_tracker was never attached to write
response handlers for vector-indexed tables. This tracker was added in
commit 1b92cbe, and its job is to update metrics of CDC operations,
and since vector search really does use CDC under the hood, these
metrics could be useful when diagnosing problems.
Fix by using cdc::cdc_enabled() instead of cdc_options().enabled(), which
covers both traditional CDC and vector-indexed tables.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#29343
Clang 22 verifies [[nodiscard]] for co_await,
causing compilation failures where return values of expected<> were
silently discarded.
These call sites were discarding the return value of client::request()
and vector_store_client::ann(), both of which return expected<> types
marked [[nodiscard]]. Rather than suppressing the warning with (void)
casts, properly check the return values using the established test
patterns: BOOST_CHECK(result) where the call is expected to succeed,
and BOOST_CHECK(!result) where the call is expected to fail.
Closesscylladb/scylladb#29297
This issue adds the upgrade guide for all patch releases within 2026.x major release.
In addition, it fixes the link to Upgrade Policy in the 2025.x-to-2026.1 upgrade guide.
Fixes SCYLLADB-1247
Closesscylladb/scylladb#29307
This PR introduces the vnodes-to-tablets migration procedure, which enables converting an existing vnode-based keyspace to tablets.
The migration is implemented as a manual, operator-driven process executed in several stages. The core idea is to first create tablet maps with the same token boundaries and replica hosts as the vnodes, and then incrementally convert the storage of each node to the tablets layout. At a high level, the procedure is the following:
1. Create tablet maps for all tables in the keyspace.
2. Sequentially upgrade all nodes from vnodes to tablets:
1. Mark a node for upgrade in the topology state.
2. Restart the node. During startup, while the node is offline, it reshards the SSTables on vnode boundaries and switches to a tablet ERM.
3. Wait for the node to return online before proceeding to the next node.
4. Finalize the migration:
1. Update the keyspace schema to mark it as tablet-based.
2. Clear the group0 state related to the migration.
From the client's perspective, the migration is online; the cluster can still serve requests on that keyspace, although performance may be temporarily degraded.
During the migration, some nodes use vnode ERMs while others use tablet ERMs. Cluster-level algorithms such as load balancing will treat the keyspace's tables as vnode-based. Once migration is finalized, the keyspace is permanently switched to tablets and cannot be reverted back to vnodes. However, a rollback procedure is available before finalization.
The patch series consists of:
* Load balancer adjustments to ignore tablets belonging to a migrating keyspace.
* A new vnode-based resharding mode, where SSTables are segregated on vnode boundaries rather than with the static sharder.
* A new per-node `intended_storage_mode` column in `system.topology`. Represents migration intent (whether migration should occur on restart) and direction.
* Four new REST endpoints for driving the migration (start, node upgrade/downgrade, finalize, status), along with `nodetool` wrappers. The finalization is implemented as a global topology request.
* Wiring of the migration process into the startup logic: the `distributed_loader` determines a migrating table's ERM flavor from the `intended_storage_mode` and the ERM flavor determines the `table_populator`'s resharding mode. Token metadata changes have been adjusted to preserve the ERM flavor.
* Cluster tests for the migration process.
Fixes SCYLLADB-722.
Fixes SCYLLADB-723.
Fixes SCYLLADB-725.
Fixes SCYLLADB-779.
Fixes SCYLLADB-948.
New feature, no backport is needed.
Closesscylladb/scylladb#29065
* github.com:scylladb/scylladb:
docs: Add ops guide for vnodes-to-tablets migration
test: cluster: Add test for migration of multiple keyspaces
test: cluster: Add test for error conditions
test: cluster: Add vnodes->tablets migration test (rollback)
test: cluster: Add vnodes->tablets migration test (1 table, 3 nodes)
test: cluster: Add vnodes->tablets migration test (1 table, 1 node)
scylla-nodetool: Add migrate-to-tablets subcommand
api: Add REST endpoint for vnode-to-tablet migration status
api: Add REST endpoint for migration finalization
topology_coordinator: Add `finalize_migration` request
database: Construct migrating tables with tablet ERMs
api: Add REST endpoint for upgrading nodes to tablets
api: Add REST endpoint for starting vnodes-to-tablets migration
topology_state_machine: Add intended_storage_mode to system.topology
distributed_loader: Wire vnode-based resharding into table populator
replica: Pick any compaction group for resharding
compaction: resharding_compaction: add vnodes_resharding option
storage_service: Preserve ERM flavor of migrating tables
tablet_allocator: Exclude migrating tables from load balancing
feature_service: Add vnodes_to_tablets_migrations feature
The estimate() function in the size_estimates virtual reader only
considered sstables local to the shard that happened to own the
keyspace's partition key token. Since sstables are distributed across
shards, this caused partition count estimates to be approximately
1/smp_count of the actual value.
This bug has been present since the virtual reader was introduced in
225648780d.
Use db.container().map_reduce0() to aggregate sstable estimates
across all shards. Each shard contributes its local count and
estimated_histogram, which are then merged to produce the correct
total.
Also fix the `test_partitions_estimate_full_overlap` test which becomes
flaky (xpassing ~1% of runs) because autocompaction could merge the
two overlapping sstables before the size estimate was read. Wrap the
test body in nodetool.no_autocompaction_context to prevent this race.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1179
Refs https://github.com/scylladb/scylladb/issues/9083Closesscylladb/scylladb#29286
As discussed with @ScyllaPiotr in
https://github.com/scylladb/scylladb/pull/29232, the doc about to be
removed is just:
> Looking at history, I think this audit.md is a design doc: scylladb/scylla-enterprise@87a5c19, for which the feature has been implemented differently, eventually, and was created around the time when design docs, apparently, where stored within the repository itself. So for me it's some trash (sorry for strong language) that can be safely removed.
Closesscylladb/scylladb#29316
Python warns that the sequence "\(" is an invalid escape and
might be rejected in the future. Protect against that by using
a raw string.
Closesscylladb/scylladb#29334
Convert auth_test.cc to coroutines for improved readability. Each test is converted in its own commit. Some
are trivial.
Indentation is left broken in some commits to reduce the diff, then fixed up in the last commit.
Code cleanup, so no backport.
Closesscylladb/scylladb#29336
* github.com:scylladb/scylladb:
auth_test: fix whitespace
auth_test: coroutinize test_try_describe_schema_with_internals_and_passwords_as_anonymous_user
auth_test: coroutinize test_try_login_after_creating_roles_with_hashed_password
auth_test: coroutinize test_create_roles_with_hashed_password_and_log_in
auth_test: coroutinize test_try_create_role_with_hashed_password_as_anonymous_user
auth_test: coroutinize test_try_to_create_role_with_password_and_hashed_password
auth_test: coroutinize test_try_to_create_role_with_hashed_password_and_password
auth_test: coroutinize test_alter_with_workload_type
auth_test: coroutinize test_alter_with_timeouts
auth_test: coroutinize role_permissions_table_is_protected
auth_test: coroutinize role_members_table_is_protected
auth_test: coroutinize roles_table_is_protected
auth_test: coroutinize test_password_authenticator_operations
auth_test: coroutinize test_password_authenticator_attributes
auth_test: coroutinize test_default_authenticator
When an SSTable was encrypted with a KMS host that is not present in
scylla.yaml, the error thrown was:
std::invalid_argument (No such host: <host-name>)
This message is very obscure in general, and especially confusing when
encountered while using the scylla-sstable tool: it gives no indication
that the SSTable is encrypted, that a KMS host lookup is involved, or
what the user needs to do to fix the problem.
Replace it with a message that names the missing host and points
directly to the relevant scylla.yaml section:
Encryption host "<host-name>" is not defined in scylla.yaml.
Make sure it is listed under the "kmip_hosts" section.
The wording is intentionally kept neutral (not framed as an SSTable tool
problem) because the same code path is exercised by production ScyllaDB
when a node's configuration no longer contains a host referenced by an
existing data file (e.g. after a config rollback or when restoring data
from a different cluster). The production use-case takes precedence, but
the message is equally actionable from the tool.
Closesscylladb/scylladb#29228
start - end will result in negative length, rejected by the python
runtime. Use the correct end - start to calculate length.
Closesscylladb/scylladb#29249
data_dictionary::database was converted to replica::database in two
places, just to call find_keyspace(), then call
get_replication_strategy() on the returned keyspace. This is not
necessary, data_dictionary::database already has find_keyspace() and the
returned data_dictionary::keyspace also has get_replication_strategy().
This patch removes a small layering violation but more importantly, it
is necessary for the sstable tool to be able to load schemas from disk,
when said schema has tombstone_gc props.
Closesscylladb/scylladb#29279
This series updates the storage abstraction and extends the compaction tests to support object‑storage backends (S3 and GCS), while tightening several parts of the test environment.
The changes include:
- New exists/object_exists helpers across storage backends and clock fixes in the S3 client to make signature generation stable under test conditions.
- A new get_storage_for_tests accessor and adjustments to the test environment to avoid premature teardown of the sstable registry.
- Refactoring of compaction tests to remove direct sstable access, ensure proper schema setup, and avoid use of moved‑from objects.
- Extraction of test_env‑based logic into reusable functions and addition of S3/GCS variants of the compaction tests.
Not all tests were converted to be backend‑agnostic yet, and a few require further investigation before they can run cleanly against S3/GCS backends. These will be addressed in follow‑up work.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-704 however, followup is needed
No backport needed since this change targeting future feature
Closesscylladb/scylladb#28790
* github.com:scylladb/scylladb:
compaction_test: fix formatting after previous patches
compaction_test: add S3/GCS variations to tests
compaction_test: extract test_env-based tests into functions
compaction_test: replace file_exists with storage::exists
compaction_test: initialize tables with schema via make_table_for_tests
compaction_test: use sstable APIs to manipulate component files
compaction_test: fix use-after-move issue
sstable_utils: add `get_storage` and `open_file` helpers
test_env: delay unplugging sstable registry
storage: add `exists` method to storage abstraction
s3_client: use lowres_system_clock for aws_sigv4
s3_client: add `object_exists` helper
gcs_client: add `object_exists` helper
View building worker was breaking semaphores without holding their locks.
This lead to races like SCYLLADB-844 and SCYLLADB-543,
where a new batch was started after `view_building_worker::state` was cleared in the `drain()` process.
This patch fix the race by:
- taking a lock of the mutex before breaking it
- distinguishing between `state::clear()`(can happen multiple times) and `state::drain()`(can be called only once during shutdown)
- asserting that the state is not doing any new work after it was drained
Fixes SCYLLADB-844
Fixes SCYLLADB-543
This PR should be backported to all versions containing view building coordinator (2025.4 and newer).
Closesscylladb/scylladb#29303
* github.com:scylladb/scylladb:
view_building_worker: extract starting a new batch to state's method
view_building_worker: distinguish between state's `clear()` and `drain()`
view_building_worker: lock mutexes before breaking them in `drain()`
view_building_worker: execute drain() once
The fix for SCYLLADB-1373 (b4f652b7c1) changed get_session() to use
the default timeout=30 for the retry loop in patient_*_cql_connection
(previously timeout=0.1). This correctly allowed retrying transient
NoHostAvailable errors during node startup, but introduced a new
flakiness in test_login and other auth tests.
The failure chain:
1. test_login connects with bad credentials (e.g. user="doesntexist")
2. get_session() calls patient_exclusive_cql_connection(), which calls
retry_till_success() with bypassed_exception=NoHostAvailable
3. The first attempt correctly fails: the server rejects the credentials
with AuthenticationFailed, wrapped in NoHostAvailable
4. retry_till_success() catches NoHostAvailable indiscriminately and
retries, not distinguishing between transient errors (node not ready)
and permanent errors (bad credentials)
5. A subsequent retry attempt times out (connect_timeout=5), producing
OperationTimedOut wrapped in NoHostAvailable
6. After 30 seconds, the last NoHostAvailable is raised -- now wrapping
OperationTimedOut instead of the original AuthenticationFailed
7. The assertion `isinstance(..., AuthenticationFailed)` fails
With the old timeout=0.1, the deadline was already exceeded after the
first attempt, so the original AuthenticationFailed propagated.
Fix: Add a `should_retry` predicate parameter to retry_till_success()
and use it in patient_cql_connection() and
patient_exclusive_cql_connection() to immediately re-raise
NoHostAvailable when it wraps AuthenticationFailed. Retrying
authentication failures is never useful since the credentials won't
change between attempts.
Fixes: SCYLLADB-1382
Closesscylladb/scylladb#29348
Following the previous commit, a new batch cannot be started if the
state was already drained.
This commit also adds a check that only one batch is running at a time.
While both of this methods do the same (abort current batch, clear
data), we can clear the state multiple times during view_building_worker
lifetime (for instance when processing base table is changed) but
`view_building_worker::state::drain()` should be called only once and
after this no other work on the state should be done.
Not doing this may lead to races like SCYLLADB-844.
If some consumer is holding a lock of a mutex and `drain()`
is just braking the mutex without locking it beforehand,
then the consumer may process its code which should be aborted.
An example of the race is SCYLLADB-844, where `work_on_tasks()` is
holding `_state._mutex` while it is broken by `drain()`.
This causes a new batch is started after the `_state` is cleared.
get_session() was passing timeout=0.1 to patient_exclusive_cql_connection
and patient_cql_connection, leaving only 0.1 seconds for the retry loop
in retry_till_success(). Since each connection attempt can take up to 5
seconds (connect_timeout=5), the retry loop effectively got only one
attempt with no chance to retry on transient NoHostAvailable errors.
Use the default timeout=30 seconds, consistent with all other callers.
Fixes: SCYLLADB-1373
Closesscylladb/scylladb#29332
Flatten continuation chains (.then()) into linear thread-style code
with .get() calls for improved readability. Remove the now-unused
require_throws helper template.
When a Scylla node starts, the scylla-image-setup.service invokes the
`scylla_swap_setup` script to provision swap. This script allocates a
swap file and creates a swap systemd unit to delegate control to
systemd. By default, systemd injects a Before=swap.target dependency
into every swap unit, allowing other services to use swap.target to wait
for swap to be enabled.
On Azure, this doesn't work so well because we store the swap file on
the ephemeral disk [1] which has network dependencies (`_netdev` mount
option, configured by cloud-init [2]). This makes the swap.target
indirectly depend on the network, leading to dependency cycles such as:
swap.target -> mnt-swapfile.swap -> mnt.mount -> network-online.target
-> network.target -> systemd-resolved.service -> tmp.mount -> swap.target
This patch breaks the cycle by removing the swap unit from swap.target
using DefaultDependencies=no. The swap unit will still be activated via
WantedBy=multi-user.target, just not during early boot.
Although this problem is specific to Azure, this patch applies the fix
to all clouds to keep the code simple.
Fixes#26519.
Fixes SCYLLADB-1257
[1] https://github.com/scylladb/scylla-machine-image/pull/426
[2] https://github.com/canonical/cloud-init/pull/1213#issuecomment-1026065501
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Closesscylladb/scylladb#28504
Replace direct filesystem checks (file_exists) with the
storage-agnostic exists() method in unsealed_sstable_compaction,
sstable_clone_leaving_unsealed_dest_sstable, and
failure_when_adding_new_sstable tests, making them compatible
with object-storage backends (S3, GCS).
Start using `table_for_tests::make_default_schema` so test tables are
created with a real schema. This is required for object-storage
backends, which cannot operate correctly without proper schema
initialization.
Switch tests to use sstable member functions for file manipulation
instead of opening files directly on the filesystem. This affects the
helpers that emulate sstable corruption: we now overwrite the entire
component file rather than just the first few kilobytes, which is
sufficient for producing a corrupted sstable.
Add a non-const `get_storage` accessor to expose underlying storage,
and an `open_file` helper to access sstable component files directly.
These are needed so compaction tests can read and write sstable
components.
Unplugging the mock sstable_registry happened too early in the test
environment. During sstable destruction, components may still need
access to the registry, so the unplugging is moved to a later stage.
Add an `exists` method to the storage abstraction to allow S3, GCS,
and local storage implementations to check whether an sstable
component is present.
Switch aws_sigv4 to lowres_system_clock since it is not affected by
time offsets often introduced in tests, which can skew db_clock. S3
requests cannot represent time shifts greater than 15 minutes from
server time, so a stable clock is required.
When test_exception_safety_of_update_from_memtable was converted from
manual fail_after()/catch to with_allocation_failures() in 74db08165d,
the populate_range() call ended up inside the failure injection scope
without a scoped_critical_alloc_section guard. The other two tests
converted in the same commit (test_exception_safety_of_transitioning...
and test_exception_safety_of_partition_scan) were correctly guarded.
Without the guard, the allocation failure injector can sometimes
target an allocation point inside the cleanup path of populate_range().
In a rare corner case, this triggers a bad_alloc in a noexcept context
(reader_concurrency_semaphore::stop()), causing std::terminate.
Fixes SCYLLADB-1346
Closesscylladb/scylladb#29321
Verify that upgrading from 2025.1 to master does not silently drop DDL
auditing for table-scoped audit configurations (SCYLLADB-1155).
Test time in dev: 4s
Refs: SCYLLADB-1155
Fixes: SCYLLADB-1305
The old execute_and_validate_audit_entry required every caller to
pass audit_settings so it could decide internally whether to expect
an entry. A test added later in this series needs to simply assert
an entry was produced, without specifying audit_settings at all.
Split into two methods:
- execute_and_validate_new_audit_entry: unconditionally expects an
audit entry.
- execute_and_validate_if_category_enabled: checks audit_settings
to decide whether to expect an entry or assert absence.
Local wrapper functions and **kwargs forwarding are removed in favor
of explicit arguments at each call site, and expected-error cases are
handled inline with assert_invalid + assert_entries_were_added.
AuditTester uses self.manager throughout but never declares it.
The attribute is only assigned in the CQLAuditTester subclass
__init__, so the type checker reports 'Attribute "manager" is
unknown' on every self.manager reference in the base class.
Add an __init__ to AuditTester that accepts and stores the manager
instance, and update CQLAuditTester to forward it via super().__init__
instead of assigning self.manager directly.
Fixes: SCYLLADB-1106
* Small fix in scylla_cluster - remove debug print
* Fix GSServer::unpublish so it does not except if publish was not called beforehand
* Improve dockerized_server so mock server logs echo to the test log to help diagnose CI failures (because we don't collect log files from mocks etc, and in any case correlation will be much easier).
No backport needed.
Closesscylladb/scylladb#29112
* github.com:scylladb/scylladb:
dockerized_service: Convert log reader to pipes and push to test log
test::cluster::conftest::GSServer: Fix unpublish for when publish was not called
scylla_cluster: Use thread safe future signalling
scylla_cluster: Remove left-over debug printout
The test was failing because the call to:
await log.wait_for('Stopping.*ongoing compactions')
was missing the 'from_mark=log_mark' argument. The log mark was updated
(line: log_mark = await log.mark()) immediately after detecting
'splitting_mutation_writer_switch_wait: waiting', and just before
launching the shutdown task. However, the wait_for call on the following
line was scanning from the beginning of the log, not from that mark.
As a result, the search immediately matched old 'Stopping N tasks for N
ongoing compactions for table system.X due to table removal' messages
emitted during initial server bootstrap (for system.large_partitions,
system.large_rows, system.large_cells), rather than waiting for the
shutdown to actually stop the user-table split compaction.
This caused the test to prematurely send the message to the
'splitting_mutation_writer_switch_wait' injection. The split compaction
was unblocked before the shutdown had aborted it, so it completed
successfully. Since the split succeeded, 'Failed to complete splitting
of table' was never logged.
Meanwhile, 'storage_service_drain_wait' was blocking do_drain() waiting
for a message. With the split already done, the test was stuck waiting
for the expected failure log that would never come (600s timeout). At
the same time, after 60s the 'storage_service_drain_wait' injection
timed out internally, triggering on_internal_error() which -- with
--abort-on-internal-error=1 -- crashed the server (exit code -6).
Fix: pass from_mark=log_mark to the wait_for('Stopping.*ongoing
compactions') call so it only matches messages that appear after the
shutdown has started, ensuring the test correctly synchronizes with the
shutdown aborting the user-table split compaction before releasing the
injection.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1319.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Closesscylladb/scylladb#29311
Replace the random port selection with an OS-assigned port. We open
a temporary TCP socket, bind it to (ip, 0) with SO_REUSEADDR, read back
the port number the OS selected, then close the socket before launching
rest_api_mock.py.
Add reuse_address=True and reuse_port=True to TCPSite in rest_api_mock.py
so the server itself can also reclaim a TIME_WAIT port if needed.
Fixes: SCYLLADB-1275
Closesscylladb/scylladb#29314
On slow/overloaded CI machines the lowres_clock timer may not have
fired after the fixed 2x sleep, causing the assertion on
get_abort_exception() to fail. Replace the fixed sleep with
sleep(1x) + eventually_true() which retries with exponential backoff,
matching the pattern already used in test_time_based_cache_eviction.
Fixes: SCYLLADB-1311
Closesscylladb/scylladb#29299
Track the total memory consumed by responses waiting to be
written to the socket, exposed as a per-scheduling-group gauge
(cql_pending_response_memory). This complements the response
memory accounting added in the previous commits by giving
visibility into how much memory each service level is holding
in unsent response buffers.
make sure the driver is stopped even though cluster
teardown throws and avoid potential stale driver
connections entering infinite reconnect loops which
exhaust cpu resources.
Fixes: SCYLLADB-1189
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
Closesscylladb/scylladb#29230
Debug mode shuffles task position in the queue. So the following is possible:
1) shard 1 calls manual_clock::advance(). This expires timers on shard 1 and queues a background smp call to shard 0 which will expire timers there
2) the smp::submit_to(0, ...) from shard 1 called by the test sumbits the call
3) shard 0 creates tasks for both calls, but (2) is run first, and preempts the reactor
4) shard 1 sees the completion, completes m_svc.invoke_on(1, ..)
5) shard 0 inserts the completion from (4) before task from (1)
6) the check on shard 0: m.find(id1) fails because the timer is not expired yet
To fix that, wait for timer expiration on shard 0, so that the test
doesn't depend on task execution order.
Note: I was not able to reproduce the problem locally using test.py --mode
debug --repeat 1000.
It happens in jenkins very rarely. Which is expected as the scenario which
leads to this is quite unlikely.
Fixes SCYLLADB-1265
Closesscylladb/scylladb#29290
The test exercises all five node operations (bootstrap, replace, rebuild,
removenode, decommission) and by the end only one node out of four
remains alive. The CQL driver session, however, still holds stale
references to the dead hosts in its connection pool and load-balancing
policy state.
When the new_test_keyspace context manager exits and attempts
DROP KEYSPACE, the driver routes the query to the dead hosts first,
gets ConnectionShutdown from each, and throws NoHostAvailable before
ever trying the single live node.
Fix by calling driver_connect() after the decommission step, which
closes the old session and creates a fresh one connected only to the
servers the test manager reports as running.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1313.
Closesscylladb/scylladb#29306
Sends a search via the raw LDAP handle (bypassing _msgid_to_promise
registration), then triggers poll_results() through the public API
to exercise the unregistered-ID branch.
Refs: SCYLLADB-1344
The multishard_query_test/fuzzy_test was timing out (SIGKILL after
15 minutes) in release mode CI.
In release mode the test generates up to 64 partitions with up to
1000 clustering rows and 1000 range tombstones each. With deeply
nested randomly-generated types (e.g. frozen<map<varint,
frozen<map<frozen<tuple<...>>>>>>), this volume of data can exceed
the 15-minute CI timeout.
Reduce the release-mode clustering-row and range-tombstone
distributions from 0-1000 to 0-200. This caps the worst case at
~12,800 rows -- still 2x the devel-mode maximum (0-100) and
sufficient to exercise multi-partition paged scanning with many
pages.
Fixes: SCYLLADB-1270
No need to backport for now, only appeared on master.
Closesscylladb/scylladb#29293
* github.com:scylladb/scylladb:
test: clean up fuzzy_test_config and add comments
test: fix fuzzy_test timeout in release mode
In the unregistered-ID branch, ldap_msgfree() was called on a result
already owned by an RAII ldap_msg_ptr, causing a double-free on scope
exit. Remove the redundant manual free.
Fixes: SCYLLADB-1344
This commit improves how test.py chohoses the default number of
parallele jobs.
This update keeps logic of selecting number of jobs from memory and cpu limits
but simplifies the heuristic so it is smoother, easier to reason about.
This avoids discontinuities such as neighboring machine sizes producing
unexpectedly different job counts, and behaves more predictably on asymmetric
machines where CPU and RAM do not scale together.
Compared to the current threshold-based version, this approach:
- avoids hard jumps around memory cutoffs
- avoids bucketed debug scaling based on CPU count
- keeps CPU and memory as separate constraints and combines them in one place
- avoids double-penalizing debug mode
- is easier to tune later by adjusting a few constants instead of rewriting branching logic
Closesscylladb/scylladb#28904
get_audit_partitions_for_operation() returns None when no audit log
rows are found. In _test_insert_failure_doesnt_report_success_assign_nodes,
this None is passed to set(), causing TypeError: 'NoneType' object is
not iterable.
The audit log entry may not yet be visible immediately after executing
the INSERT, so use wait_for() from test.pylib.util with exponential
backoff to poll until the entry appears. Import it as wait_for_async
to avoid shadowing the existing wait_for from test.cluster.dtest.dtest_class,
which has a different signature (timeout vs deadline).
Fixes SCYLLADB-1330
Closesscylladb/scylladb#29289
implement tablet migration for logstor tables by streaming segments
using stream_blob, similar to file streaming of sstables.
take a snapshot of the logstor segments and create a stream_blob_info
vector with entry for each segment with the input stream that reads the
segment and an op of type file_ops::stream_logstor_segments.
the stream_blob_handler creates a logstor sink that allocates a segment
on the target shard and creates an output stream that writes to it. when
the sink is closed it loads the segment.
add the function table::take_logstor_snapshot that is similar to
take_storage_snapshot for sstables.
given a token range, for each storage group in the range, it flushes the
separator buffers and then makes a snapshot of all segments in the sg's
compaction groups while disabling compaction.
the segment snapshot holds a reference to the segment so that it won't
be freed by compaction, and it provides an input stream for reading the
segment.
this will be used for tablet migration to stream the segments.
add functions for creating segment input and output streams, that will
be used for segment streaming.
the segment input stream creates a file input stream that reads a given
segment.
the segment output stream allocates a new local segment and creates an
output stream that writes to the segment, and when closed it loads the
segment and adds it to the compaction group.
implement compaction group cleanup by clearing the range in the index
and discarding the segments of the compaction group.
segments are discarded by overwriting the segment header to indicate the
segment is empty while preserving the segment generation number in order
to not resurrect old data in the segment.
implement tablet split for logstor.
flush the separator and then perform split as a new type of compaction:
take a batch of segments from the source compaction group, read them and
write all live records into left/right write buffers according to the
split classifier, flush them to the compaction group, and free the old
segments. segments that fit in a single target compaction group are
removed from the source and added to the correct target group.
implement tablet merge with logstor.
disable compaction for the new compaction group, then merge the merging
compaction groups by merging their logstor segments set into the new cg
- simply merging the segment histogram.
add a function that stops and disabled compaction for a compaction group
and returns a compaction reenabler object, similarly to the normal
compaction manager.
this will be useful for disabling compaction while doing operations on
the compaction group's logstor segment set.
we have two types of segments. the active segment is "mixed" because we
can write to it multiple write_buffers, each write buffer having records
from different tables and tablets. in constrast, the separator and
compaction write "full" segments - they write a single write_buffer that
has records from a single tablet and storage group.
for "full" segments, we add a segment header the contains additional
useful metadata such as the table and token range in the segment.
the write buffer header contains the type of the buffer, mixed or full.
if it's full then it has a segment header placed after the write buffer
header.
previously when writing to the active segment, the allocation was
serialized but multiple writes could proceed concurrently to different
offsets. change it instead to serialize the entire write.
we prefer to write larger buffers sequentially instead of multiple
buffers concurrently. it is also better that we don't have "holes" in
the segment.
we also change the buffered_writer to send a single flushing buffer at a
time. it has a ring of buffers, new writes are written to the head
buffer, and a single consumer flushes the tail buffer.
extend compaction_group functions such as disk size calculation and
empty() to account also for the logstor segments that the compaction
group owns.
reuse the sstable_add_gate when there is a write in process to a
compaction group, in order for the compaction group to be considered not
empty.
add the function table::compaction_group_for_logstor_segment that we use
when recovering a segment to find the compaction group for a segment
based on its token range, similarly to compaction_group_for_sstable for
sstables.
extract the common logic from compaction_group_for_sstable to a common
function compaction_group_for_token_range that finds a compaction group
for a token range.
Switch _promoted_indexes storage in partition_index_page from
managed_vector to chunked_managed_vector to avoid large contiguous
allocations.
Avoid allocation failure (or crashes with --abort-on-internal-error)
when large partitions have enough promoted index entries to trigger a
large allocation with managed_vector.
Fixes: SCYLLADB-1315
Closesscylladb/scylladb#29283
Remove the unused timeout field from fuzzy_test_config. It was
declared, initialized per build mode, and logged, but never actually
enforced anywhere.
Document the intentionally small max_size (1024 bytes) passed to
read_partitions_with_paged_scan in run_fuzzy_test_scan: it forces
many pages per scan to stress the paging and result-merging logic.
The multishard_query_test/fuzzy_test was timing out (SIGKILL after
15 minutes) in release mode CI.
In release mode the test generates up to 64 partitions with up to
1000 clustering rows and 1000 range tombstones each. With deeply
nested randomly-generated types (e.g. frozen<map<varint,
frozen<map<frozen<tuple<...>>>>>>), this volume of data can exceed
the 15-minute CI timeout.
Reduce the release-mode clustering-row and range-tombstone
distributions from 0-1000 to 0-200. This caps the worst case at
~12,800 rows -- still 2x the devel-mode maximum (0-100) and
sufficient to exercise multi-partition paged scanning with many
pages.
Fixes: SCYLLADB-1270
The test was flaky because it stopped dc2_node immediately after an
LWT write, before cross-DC replication could complete. The LWT commit
uses LOCAL_QUORUM, which only guarantees persistence in the
coordinator's DC. Replication to the remote DC is async background
work, and CAS mutations don't store hints. Stopping dc2_node could
drop in-flight RPCs, leaving DC1 without the mutation.
Fix by polling both live DC1 nodes after the write to confirm
cross-DC replication completed before stopping dc2_node. Both nodes
must have the data so that the later ConsistentRead=True
(LOCAL_QUORUM) read on restarted node1 is guaranteed to succeed.
Fixes SCYLLADB-1267
Closesscylladb/scylladb#29287
The approach taken in 1ae2ae50a6 turned
out to be incorrect. The Raft member requesting a read barrier could
incorrectly advance its commit_idx and break linearizability. We revert that
commit in this PR.
We also remake the read barrier optimization with a completely new approach.
We make the leader replicate to the non-voting requester of a read barrier if
its `commit_idx` is behind.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-998
No backport: the issue is present only in master.
Closesscylladb/scylladb#29216
* github.com:scylladb/scylladb:
raft: speed up read barrier requested by non-voters
Revert "raft: read_barrier: update local commit_idx to read_idx when it's safe"
Capture the memory permit in the leave lambda's .finally()
continuation so that the semaphore units are kept alive until
write_response finishes, preventing premature release of
memory accounting.
This is especially important with slow network and big responses
when buffers can accumulate and deplete node's memory.
Fix two independent race conditions in the syslog audit test that cause intermittent `assert 2 <= 1` failures in `assert_entries_were_added`.
**Datagram ordering race:**
`UnixSockerListener` used `ThreadingUnixDatagramServer`, where each datagram spawns a new thread. The notification barrier in `get_lines()` assumes FIFO handling, but the notification thread can win the lock before an audit entry thread, so `clear_audit_logs()` misses entries that arrive moments later. Fix: switch to sequential `UnixDatagramServer`.
**Config reload race:**
The live-update path used `wait_for_config` (REST API poll on shard 0) which can return before `broadcast_to_all_shards()` completes. Fix: wait for `"completed re-reading configuration file"` in the server log after each SIGHUP, which guarantees all shards have the new config.
Fixes SCYLLADB-1277
This is CI improvement for the latest code. No need for backport.
Closesscylladb/scylladb#29282
* github.com:scylladb/scylladb:
test: cluster: wait for full config reload in audit live-update path
test: cluster: fix syslog listener datagram ordering race
`test_limit_concurrent_requests` could create far more tables than intended
because worker threads looped indefinitely and only the probe path terminated
the test. In practice, workers often hit `RequestLimitExceeded` first, but the
test kept running and creating tables, increasing memory pressure and causing
flakiness due to bad_alloc errors in logs.
Fix by replacing the old probe-driven termination with worker-driven
termination. Workers now run until any worker sees
`RequestLimitExceeded`.
Fixes SCYLLADB-1181
Closesscylladb/scylladb#29270
A joining node hung forever if the topology coordinator added it to the
group 0 configuration before the node reached `post_server_start`. In
that case, `server->get_configuration().contains(my_id)` returned true
and the node broke out of the join loop early, skipping
`post_server_start`. `_join_node_group0_started` was therefore never set,
so the node's `join_node_response` RPC handler blocked indefinitely.
Meanwhile the topology coordinator's `respond_to_joining_node` call
(which has no timeout) hung forever waiting for the reply that never came.
Fix by only taking the early-break path when not starting as a follower
(i.e. when the node is the discovery leader or is restarting). A joining
node must always reach `post_server_start`.
We also provide a regression test. It takes 6s in dev mode.
Fixes SCYLLADB-959
Closesscylladb/scylladb#29266
After obtaining the CQL response, check if its actual size exceeds
the initially acquired memory permit. If so, take semaphore units
and adopt them into the permit (non blocking).
This doesn't fully prevent from allocating too much memory as
size is known when buffer is already allocated but improves
memory accounting for big responses.
Add test_hint_to_leaving_when_reducing_rf which verifies that mutations
stored as hints are delivered to the correct replicas when a tablet is
removed due to RF reduction. The test sets up a 3-node cluster with
RF=2, drops the hint for one replica via error injection, then reduces
RF to 1 while hints are pending. It asserts that the mutation is
readable after the topology change completes.
Also adds a "drop_hint_for_host" error injection point in
hint_endpoint_manager to selectively drop hints for a specific host.
_apply_config_to_running_servers used wait_for_config (REST API poll)
to confirm live config updates. The REST API reads from shard 0 only,
so it can return before broadcast_to_all_shards() completes — other
shards may still have stale audit config, generating unexpected entries.
Additionally, server_remove_config_option for absent keys sent separate
SIGHUPs before server_update_config, and the single wait_for_config at
the end could match a completion from an earlier SIGHUP.
Wait for "completed re-reading configuration file" in the server log
after each SIGHUP-producing operation. This message is logged only
after both read_config() and broadcast_to_all_shards() finish,
guaranteeing all shards have the new config. Each operation gets its
own mark+wait so no stale completion is matched.
Fixes SCYLLADB-1277
UnixSockerListener used ThreadingUnixDatagramServer, which spawns a
new thread per datagram. The notification barrier in get_lines() relies
on all prior datagrams being handled before the notification. With
threading, the notification handler can win the lock before an audit
entry handler, so get_lines() returns before the entry is appended.
clear_audit_logs() then clears an incomplete buffer, and the late
entry leaks into the next test's before/after diff.
Switch to sequential UnixDatagramServer. The server thread now handles
datagrams in kernel FIFO order, so the notification is always processed
after all preceding audit entries.
Refs SCYLLADB-1277
The `DESC INDEX` command returned incorrect results for local vector
indexes and for vector indexes that included filtering columns.
This patch corrects the implementation to ensure `DESCRIBE INDEX`
accurately reflects the index configuration.
This was a pre-existing issue, not a regression from recent
serialization schema changes for vector index target options.
Queries against local vector indexes were failing with the error:
"ANN ordering by vector requires the column to be indexed using 'vector_index'"
This was a regression introduced by 15788c3734, which incorrectly
assumed the first column in the targets list is always the vector column.
For local vector indexes, the first column is the partition key, causing
the failure.
Previously, serialization logic for the target index option was shared
between vector and secondary indexes. This is no longer viable due to
the introduction of local vector indexes and vector indexes with filtering
columns, which have different target format.
This commit introduces a dedicated JSON-based serialization format for
vector index targets, identifying the target column (tc), filtering
columns (fc), and partition key columns (pk). This ensures unambiguous
serialization and deserialization for all vector index types.
This change is backward compatible for regular vector indexes. However,
it breaks compatibility for local vector indexes and vector indexes with
filtering columns created in version 2026.1.0. To mitigate this, usage
of these specific index types will be blocked in the 2026.1.0 release
by failing ANN queries against them in vector-store service.
Fixes: SCYLLADB-895
This test ensures that the serialization format for vector index target
options remains stable. Maintaining backward compatibility is critical
because the index is restored from this property on startup.
Any unintended changes to the serialization schema could break existing
indexes after an upgrade.
This option is also an interface for the vector-store service,
which uses it to identify the indexed column.
Target option serialization must remain stable for backward compatibility.
The index is restored from this property on startup, so unintentional
changes to the serialization schema can break indexes after upgrade.
We achieve this by making the leader replicate to the non-voting requester
of a read barrier if its commit_idx is behind.
There are some corner cases where the new `replicate_to(*opt_progress, true);`
call will be a no-op, while the corresponding call in `tick_leader()` would
result in sending the AppendEntries RPC to the follower. These cases are:
- `progress.state == follower_progress::state::PROBE && progress.probe_sent`,
- `progress.state == follower_progress::state::PIPELINE
&& progress.in_flight == follower_progress::max_in_flight`.
We could try to improve the optimization by including some of the cases above,
but it would only complicate the code without noticeable benefits (at least
for group0).
Note: this is the second attempt for this optimization. The first approach
turned out to be incorrect and was reverted in the previous commit. The
performance improvement is the same as in the previous case.
hint_sender decides whether to send a hint directly to its destination
or to re-mutate from scratch based on token_metadata::is_leaving(),
which only checks whether the *host* is leaving the cluster. When a
tablet is dropped from a host due to RF reduction (RF--), the host
is still alive and is_leaving() returns false, so hint_sender sends
directly to a replica that will no longer own the data -- effectively
losing the hint.
Switch to the new ermp->is_leaving(host, token) which is tablet-aware.
When the destination's tablet is being migrated away *and* there are
pending endpoints, send directly (the pending endpoints will receive
the data via streaming); otherwise fall through to the re-mutate path
so all current replicas receive the mutation.
token_metadata::is_leaving() only knows whether a *host* is leaving the
cluster, which is insufficient for tablets -- a tablet can be migrated
away from a host (e.g. during RF reduction) without the host itself
leaving.
Add a pure virtual is_leaving(host, token) to effective_replication_map
so callers can ask per-token questions. The vnode implementation
delegates to token_metadata::is_leaving() (host-level, as before). The
tablet implementation checks whether the tablet owning the token has a
transition whose leaving replica matches the given host.
Use get_cql_exclusive(node1) so the driver only connects to node1 and
never attempts to contact the stopped node2. The test was flaky because
the driver received `Host has been marked down or removed` from node2.
Fixes: SCYLLADB-1227
Closesscylladb/scylladb#29268
The test was using time.sleep(1) (a blocking call) to wait after
scheduling the stop_compaction task, intending to let it register on
the server before releasing the sstable_cleanup_wait injection point.
However, time.sleep() blocks the asyncio event loop entirely, so the
asyncio.create_task(stop_compaction) task never gets to run during the
sleep. After the sleep, the directly-awaited message_injection() runs
first, releasing the injection point before stop_compaction is even
sent. By the time stop_compaction reaches Scylla, the cleanup has
already completed successfully -- no exception is raised and the test
fails.
Fix by replacing time.sleep(1) with await asyncio.sleep(1), which
yields control to the event loop and allows the stop_compaction task
to actually send its HTTP request before message_injection is called.
Fixes: SCYLLADB-834
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Closesscylladb/scylladb#29202
The vnodes-to-tablets migration is a manual procedure, so instructions
need to be provided to the users.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Now that cmake/FindLua.cmake uses pkg-config (matching configure.py),
both build systems resolve to the same 'lua' library name. Remove the
lua/lua-5.4 entries from _KNOWN_LIB_ASYMMETRIES and add 'm' (math
library) as a known transitive dependency that configure.py gets via
pkg-config for lua.
CMake's built-in FindLua resolves to the versioned library file
(e.g. liblua-5.4.so) instead of the unversioned symlink (liblua.so),
causing a library name mismatch between the two build systems.
Add a custom cmake/FindLua.cmake that uses pkg-config — matching
configure.py's approach — and find_library(NAMES lua) to find the
unversioned symlink. This also mirrors the pattern used by other
Find modules in cmake/ (FindxxHash, Findlz4, etc.).
Add symmetric_key_test (standalone, links encryption library) and
auth_cache_test to the combined_tests binary. These tests already
exist in configure.py; this aligns the CMake build.
The per-test -fno-lto link option is now redundant since -fno-lto
was added globally in mode.common.cmake. LTO-enabled targets
(the scylla binary in RelWithDebInfo) override it via enable_lto().
Match configure.py's Boost handling:
- Add BOOST_ALL_DYN_LINK when using shared Boost libraries.
- Strip per-component defines (BOOST_UNIT_TEST_FRAMEWORK_DYN_LINK,
BOOST_REGEX_DYN_LINK, etc.) that CMake's Boost package config
adds on imported targets. configure.py only uses the umbrella
BOOST_ALL_DYN_LINK define.
Place add_compile_definitions(SEASTAR_TESTING_MAIN) after both
add_subdirectory(seastar) and add_subdirectory(abseil) are processed.
This matches configure.py's global define without leaking into
seastar's subdirectory build (which would cause a duplicate main
symbol in seastar_testing).
Remove the now-redundant per-test SEASTAR_TESTING_MAIN compile
definition from test/CMakeLists.txt.
Match configure.py line 2192: abseil gets sanitizer flags with
-fno-sanitize=vptr to exclude vptr checks which are incompatible
with abseil's usage of type-punning patterns.
- Set BUILD_SHARED_LIBS based on build type to match configure.py's
build_seastar_shared_libs: Debug and Dev build Seastar as a shared
library, all other modes build it static.
- Add sanitizer link options on the seastar target for Coverage
mode. Seastar's CMake only activates sanitizer targets for
Debug/Sanitize configs, but Coverage mode needs them too since
configure.py's seastar_libs_coverage carries -fsanitize flags.
- Disable CMake's automatic -fcolor-diagnostics injection for
Clang+Ninja (CMake 3.24+), matching configure.py which does not
add any color diagnostics flags.
- Add SEASTAR_NO_EXCEPTION_HACK and XXH_PRIVATE_API as global
defines (previously SEASTAR_NO_EXCEPTION_HACK was only on the
seastar target as PRIVATE; it needs to be project-wide).
- Add -fpch-validate-input-files-content to check precompiled
header content when timestamps don't match.
Fix multiple deviations from configure.py's coverage mode:
- Remove -fprofile-list from CMAKE_CXX_FLAGS_COVERAGE. That flag
belongs in COVERAGE_INST_FLAGS applied to other modes, not to
coverage mode itself.
- Replace incorrect defines (DEBUG, SANITIZE, DEBUG_LSA_SANITIZER,
SCYLLA_ENABLE_ERROR_INJECTION) with the correct Seastar debug
defines (SEASTAR_DEBUG, SEASTAR_DEFAULT_ALLOCATOR, etc.) that
configure.py's pkg-config query produces for coverage mode.
- Add sanitizer and stack-clash-protection compile flags for
Coverage config, matching the flags that Seastar's pkg-config
--cflags output includes for debug builds.
- Change CMAKE_STATIC_LINKER_FLAGS_COVERAGE to
CMAKE_EXE_LINKER_FLAGS_COVERAGE. Coverage flags need to reach
the executable linker, not the static archiver.
Add three flag-alignment changes:
- -Wno-error=stack-usage= alongside the stack-usage threshold flag,
preventing hard errors from stack-usage warnings (matching
configure.py behavior).
- -fno-lto global link option. configure.py adds -fno-lto to all
binaries; LTO-enabled targets override it via enable_lto().
- Sanitizer link flags (-fsanitize=address, -fsanitize=undefined) for
Debug/Sanitize configs, matching configure.py's cxx_ld_flags.
Document the purpose, usage, and examples for
scripts/compare_build_systems.py which compares the configure.py
and CMake build systems by parsing their ninja build files.
Add a script that compares configure.py and CMake build systems by
parsing their generated build.ninja files. The script checks:
- Per-file compilation flags (defines, warnings, optimization)
- Link target sets (detect missing/extra targets)
- Per-target linker flags and libraries
configure.py is treated as the baseline. CMake should match it.
Both systems are always configured into a temporary directory so the
user's build tree is never touched.
Usage:
scripts/compare_build_systems.py -m dev # single mode
scripts/compare_build_systems.py # all modes
scripts/compare_build_systems.py --ci # CI mode (strict)
Fix the ordering of the concurrency limit check in the Alternator HTTP server so it happens before memory acquisition, and reduce test pressure to avoid LSA exhaustion on the memory-constrained test node.
The patch moves the concurrency check to right after the content-length early-out, before any memory acquisition or I/O. The check was originally placed before memory acquisition but was inadvertently moved after it during a refactoring. This allowed unlimited requests to pile up consuming memory, reading bodies, verifying signatures, and decompressing — all before being rejected. Restores the original ordering and mirrors the CQL transport (`transport/server.cc`).
Lowers `concurrent_requests_limit` from 5 to 3 and the thread multiplier from 5 to 2 (6 threads instead of 25). This is still sufficient to reliably trigger RequestLimitExceeded, while keeping flush pressure within what 512MB per shard can sustain.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1248
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1181
The test started to fail quite recently. It affects master only. No backport is needed. We might want to consider backporting a commit moving the concurrency check earlier.
Closesscylladb/scylladb#29272
* github.com:scylladb/scylladb:
test: reduce concurrent-request-limit test pressure to avoid LSA exhaustion
alternator: check concurrency limit before memory acquisition
The test_limit_concurrent_requests dtest uses concurrent CreateTable
requests to verify Alternator's concurrency limiting. Each admitted
CreateTable triggers Raft consensus, schema mutations, and memtable
flushes—all of which consume LSA memory. On the 1 GB test node
(2 SMP × 512 MB), the original settings (limit=5, 25 threads) created
enough flush pressure to exhaust the LSA emergency reserve, producing
logalloc::bad_alloc errors in the node log. The test was always
marginal under these settings and became flaky as new system tables
increased baseline LSA usage over time.
Lower concurrent_requests_limit from 5 to 3 and the thread multiplier
from 5 to 2 (6 threads total). This is still well above the limit and
sufficient to reliably trigger RequestLimitExceeded, while keeping flush
pressure within what 512 MB per shard can sustain.
The concurrency limit check in the Alternator server was positioned after
memory acquisition (get_units), request body reading (read_entire_stream),
signature verification, and decompression. This allowed unlimited requests
to pile up consuming memory before being rejected, exhausting LSA memory
and causing logalloc::bad_alloc errors that cascade into Raft applier
and topology coordinator failures, breaking subsequent operations.
Without this fix, test_limit_concurrent_requests on a 1GB node produces
50 logalloc::bad_alloc errors and cascading failures: reads from
system.scylla_local fail, the Raft applier fiber stops, the topology
coordinator stops, and all subsequent CreateTable operations fail with
InternalServerError (500). With this fix, the cascade is eliminated --
admitted requests may still cause LSA pressure on a memory-constrained
node, but the server remains functional.
Move the concurrency check to right after the content-length early-out,
before any memory acquisition or I/O. This mirrors the CQL transport
which correctly checks concurrency before memory acquisition
(transport/server.cc).
The concurrency check was originally added in 1b8c946ad7 (Sep 2020)
*before* memory acquisition, which at the time lived inside with_gate
(after the concurrency gate). The ordering was inverted by f41dac2a3a
(Mar 2021, "avoid large contiguous allocation for request body"), which
moved get_units() earlier in the function to reserve memory before
reading the newly-introduced content stream -- but inadvertently also
moved it before the concurrency check. c3593462a4 (Mar 2025) further
worsened the situation by adding a 16MB fallback reservation for
requests without Content-Length and ungzip/deflate decompression steps
-- all before the concurrency check -- greatly increasing the memory
consumed by requests that would ultimately be rejected.
RF change of tablet keyspace starts tablet rebuilds. Even if any of
the rebuilds is rolled back (because pending replica was excluded),
rf change request finishes successfully. Yet, we are left with not
enough replicas. Then, a next new rf change request handler would
generate a rebuild of two replicas of the same tablet. Such a transition
would not be applied, as we don't allow many pending replicas.
An exception would be thrown and the request would be retried infinitely,
blocking the topology coordinator.
Throw and fail rf change request if there is not enough replicas.
The request should be retried later, after the issue is fixed
by the mechanism introduced in previous changes.
RF change of tablet keyspace starts tablet rebuilds. Even if any
of the rebuilds is rolled back (because pending replica was excluded),
rf change request finishes successfully. In this case we end up with
the state of the replicas that isn't compatible with the expected
keyspace replication.
After this change, if topology_coordinator has nothing to do, it
proceeds to check if the state of replicas reflects the keyspace
replication. If there are any mismatches, the tablet rebuilds are
scheduled. All required rebuilds of a single keyspace are scheduled
together without respecting the node's load (just as it happens
in case of keyspace rf change).
maybe_start_tablet_migration takes an ownership of group0_guard and
does not give it back, even if no work was done.
In the following patches, we will proceed with different operations,
if there are no migrations to be started. Thus, the guard would be needed.
Return group0_guard from maybe_start_tablet_migration is no work
was done.
**The Bug**
Assertion failure: `SCYLLA_ASSERT(res.second)` in `raft/server.cc`
when creating a snapshot transfer for a destination that already had a
stale in-flight transfer.
**Root Cause**
If a node loses leadership and later becomes leader again before the next
`io_fiber` iteration, the old transfer from the previous term can remain
in `_snapshot_transfers` while `become_leader()` resets progress state.
When the new term emits `install_snapshot(dst)`, `send_snapshot(dst)`
tries to create a new entry for the same destination and can hit the
assertion.
**The Fix**
Abort all in-flight snapshot transfers in `process_fsm_output()` when
`term_and_vote` is persisted. A term/vote change marks existing transfers
as stale, so we clean them up before dispatching messages from that batch
and before any new snapshot transfer is started.
With cross-term cleanup moved to the term-change path, `send_snapshot()`
now asserts the within-term invariant that there is at most one in-flight
transfer per destination.
Fixes: SCYLLADB-862
Backport: The issue is reproducible in master, but is present in all
active branches.
Closesscylladb/scylladb#29092
This reverts commit c30607d80b.
With the default configuration, enabling DDL has no effect because
no `audit_keyspaces` or `audit_tables` are specified. Including DDL
in the default categories can be misleading for some customers, and
ideally we would like to avoid it.
However, DDL has been one of the default audit categories for years,
and removing it risks silently breaking existing deployments that
depend on it. Therefore, the recent change to disable DDL by default
is reverted.
Fixes: SCYLLADB-1155
Closesscylladb/scylladb#29169
test_reboot uses a custom restart function that SIGKILLs and restarts
nodes sequentially. After all nodes are back up, the test proceeded
directly to reads after wait_for_cql_and_get_hosts(), which only
confirms CQL reachability.
While a node is restarted, other nodes might execute global token
metadata barriers, which advance the topology fence version. The
restarted node has to learn about the new version before it can send
reads/writes to the other nodes. The test issues reads as soon as the
CQL port is opened, which might happen before the last restarted node
learns of the latest topology version. If this node acts as a
coordinator for reads/write before this happens, these will fail as the
other nodes will reject the ops with the outdated topology fence
version.
Fix this by replacing wait_for_cql_and_get_hosts() on the abrupt-restart
path with the more robus get_ready_cql(), which makes sure servers see
each other before refreshing the cql connection. This should ensure that
nodes have exchanged gossip and converged on topology state before any
reads are executed. The rolling_restart() path is unaffected as it
handles this internally.
Fixes: SCYLLADB-557
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Closesscylladb/scylladb#29211
`test_crashed_node_substitution` intermittently failed:
```python
assert len(gossiper_eps) == (len(server_eps) + 1)
```
The test crashed the node right after a single ACK2 handshake (`finished do_send_ack2_msg`), assuming the node state was visible to all peers. However, since gossip is eventually consistent, the update may not have propagated yet, so some nodes did not see the failed node.
This change: Wait until the gossiper state is visible on peers before continuing the test and asserting.
Fixes: [SCYLLADB-1256](https://scylladb.atlassian.net/browse/SCYLLADB-1256).
backport: this issue may affect CI for all branches, so should be backported to all versions.
[SCYLLADB-1256]: https://scylladb.atlassian.net/browse/SCYLLADB-1256?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQClosesscylladb/scylladb#29254
* github.com:scylladb/scylladb:
test: test_crashed_node_substitution: add docstring and fix whitespace
test: fix race condition in test_crashed_node_substitution
If lwt_workload() sends an update immediately after a
rolling restart, the coordinator might still see a replica as
down due to gossip lagging behind. Concurrently restarting another
node leaves only one available replica, failing the
LOCAL_QUORUM requirement for learn or eventually consistent
sp::query() in sp::cas() and resulting in
a mutation_write_failure_exception.
We fix this problem by waiting for the restarted server
to see 2 other peers. The server_change_version
doesn't do that by default -- it passes
wait_others=0 to server_start().
Fixes SCYLLADB-1136
Closesscylladb/scylladb#29234
`test_crashed_node_substitution` intermittently failed:
```
assert len(gossiper_eps) == (len(server_eps) + 1)
```
The test crashed the node right after a single ACK2 handshake
("finished do_send_ack2_msg"), assuming the node state was
visible to all peers. However, since gossip is eventually
consistent, the update may not have propagated yet, so some
nodes did not see the failed node.
This change: Wait until the gossiper state is visible on
peers before continuing the test and asserting.
Fixes: SCYLLADB-1256.
SSTable unlinking is async, so in some cases it may happen that
the upload dir is not empty immediately after refresh is done.
This patch adjusts test_refresh_deletes_uploaded_sstables so
it waits with a timeout till the upload dir becomes empty
instead of just assuming the API will sync on sstables being
gone.
Fixes SCYLLADB-1190
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
Closesscylladb/scylladb#29215
This test runs the vnodes-to-tablets migration for a single table on a
single-node cluster. The node has multiple shards and multiple
power-of-two aligned vnodes, so resharding is triggered.
More details in the docstring.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
The vnodes-to-tablets migration is a manual procedure, so orchestration
must be done via nodetool.
This patch adds the following new commands:
* nodetool migrate-to-tablets start {ks}
* nodetool migrate-to-tablets upgrade
* nodetool migrate-to-tablets downgrade
* nodetool migrate-to-tablets status {ks}
* nodetool migrate-to-tablets finalize {ks}
The commands are just wrappers over the REST API.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
If the keyspace is migrating, it reports the intended and actual storage
mode for each node.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Remove unused `pytest.mark.single_node` marker from `TestCQLAudit`.
Rename `TestCQLAudit` to `CQLAuditTester` to reflect that it is a test helper, not a test class. This avoids accidental pytest collection and subsequent warning about `__init__`.
Logs before the fixes:
```
test/cluster/test_audit.py:514: 14 warnings
/home/dario/dev/scylladb/test/cluster/test_audit.py:514: PytestCollectionWarning: cannot collect test class 'TestCQLAudit' because it has a __init__ constructor (from: cluster/test_audit.py)
@pytest.mark.single_node
```
Fixes SCYLLADB-1237
This is an addition to the latest master code. No backport needed.
Closesscylladb/scylladb#29237
* github.com:scylladb/scylladb:
test: audit: rename TestCQLAudit to CQLAuditTester
test: audit: remove unused pytest.mark.single_node
Improve test comments for test_streams_batchwrite_into_the_same_partition_deletes_existing_items
and test_streams_batchwrite_into_the_same_partition_will_report_wrong_stream_data to explain why
'always' write isolation mode is required: in always_use_lwt mode all items in a batch get the
same CDC timestamp, which triggers the squashing bug. In other modes each item gets a separate
timestamp so the bug doesn't manifest.
Also fix the example in the second test comment to use cleaner key values and correct event type
(INSERT, not MODIFY, since items are inserted into an empty table), and fix the issue reference
from #28452 (the PR) to #28439 (the issue).
pytest tries to collect tests for execution in several ways.
One is to pick all classes that start with 'Test'. Those classes
must not have custom '__init__' constructor. TestCQLAudit does.
TestCQLAudit after migration from test/cluster/dtest is not a test
class anymore, but rather a helper class. There are two ways to fix
this:
1. Add __init__ = False to the TestCQLAudit class
2. Rename it to not start with 'Test'
Option 2 feels better because the new name itself does not convey
the wrong message about its role.
Fixes SCYLLADB-1237
Remove unused pytest.mark.single_node in TestCQLAudit class.
This is a leftover from audit tests migration from
test/cluster/dtest to test/cluster.
Refs SCYLLADB-1237
Add scylla_only fixture dependency to the
test_table_ss_new_and_old_images_write_isolation_always fixture.
This ensures all tests using the 'always' write isolation mode
are skipped when running against DynamoDB (--aws), since the
system:write_isolation tag is a Scylla-only feature.
BatchWriteItem with items for the same partition (and write isolation
set to always) will trigger LWT and run different cdc code path, which
will result in wrong Streams data being returned to the user -
changes will be randomly squashed together.
For example batch write:
batch.put_item(Item={'p': 'p', 'c': 'c0'})
batch.put_item(Item={'p': 'p', 'c': 'c1'})
batch.put_item(Item={'p': 'p', 'c': 'c2'})
instead of producing 3 modify / insert events will produce one:
type=INSERT, key={'c': {'S': 'c0'}, 'p': {'S': 'p'}},
old_image=None, new_image={'c': {'S': 'c2'}, 'p': {'S': 'p'}}
with `new_image` having different `c` key from `key` field.
This happens because BatchWriteItem (when using LWT) emits it's changes
to cdc under the same timestamp. This results in in all log entries
being put in single cdc "bucket" (under the same cdc$timestamp key).
Previous parsing algorithm would interpret those changes as a change
to a single item and squash them together.
The patch rewrites algorithm to use `std::unordered_map` for records
based on value of clustering key, that is added to every cdc log entry.
This allows rebuilding all item modifications.
Fixes#28439
Fixes: SCYLLADB-540
Add additional BatchWriteItem tests (some failing):
- `test_streams_batchwrite_no_clustering_deletes_non_existing_items`
`test_streams_batchwrite_no_clustering_deletes_existing_items` -
those tests pass, we add it here for completness, as non clustering
tables trigger different paths.
- `test_streams_batchwrite_into_the_same_partition_deletes_existing_items` -
failing test, that checks combinations of puts and deletes in a single
batch write (so for example 3 items, 2 puts and 1 delete).
- `test_streams_batchwrite_into_the_same_partition_will_report_wrong_stream_data` -
failing simple test.
Tests fail, because current implementation, when writing cdc log
entries will squash all changes done to the same partition together.
The data is still there, but when GetRecords is called and we parse
cdc log entries, we don't correctly recover it (see issue #28439 for
more details).
Previously, the result of when_all was discarded. when_all stores
exceptions in the returned futures rather than throwing, so the outer
catch(in_use&) could never trigger. Now we capture the when_all result
and inspect each future individually to properly detect in_use from
either stream.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1216Closesscylladb/scylladb#29219
Change wait_for() defaults from period=1s/no backoff to period=0.1s
with 1.5x backoff capped at 1.0s. This catches fast conditions in
100ms instead of 1000ms, benefiting ~100 call sites automatically.
Add completion logging with elapsed time and iteration count.
Tested local with test/cluster/test_fencing.py::test_fence_hints (dev mode),
log output:
wait_for(at_least_one_hint_failed) completed in 0.83s (4 iterations)
wait_for(exactly_one_hint_sent) completed in 1.34s (5 iterations)
Fixes SCYLLADB-738
Closesscylladb/scylladb#29173
The test was flaky. The scenario looked like this:
1. Stop server 1.
2. Set its rf_rack_valid_keyspaces configuration option to true.
3. Create an RF-rack-invalid keyspace.
4. Start server 1 and expect a failure during start-up.
It was wrong. We cannot predict when the Raft mutation corresponding to
the newly created keyspace will arrive at the node or when it will be
processed. If the check of the RF-rack-valid keyspaces we perform at
start-up was done before that, it won't include the keyspace. This will
lead to a test failure.
Unfortunately, it's not feasible to perform a read barrier during
start-up. What's more, although it would help the test, it wouldn't be
useful otherwise. Because of that, we simply fix the test, at least for
now.
The new scenario looks like this:
1. Disable the rf_rack_valid_keyspaces configuration option on server 1.
2. Start the server.
3. Create an RF-rack-invalid keyspace.
4. Perform a read barrier on server 1. This will ensure that it has
observed all Raft mutations, and we won't run into the same problem.
5. Stop the node.
6. Set its rf_rack_valid_keyspaces configuration option to true.
7. Try to start the node and observe a failure.
This will make the test perform consistently.
---
I ran the test (in dev mode, on my local machine) three times before
these changes, and three times with them. I include the time results
below.
Before:
```
real 0m47.570s
user 0m41.631s
sys 0m8.634s
real 0m50.495s
user 0m42.499s
sys 0m8.607s
real 0m50.375s
user 0m41.832s
sys 0m8.789s
```
After:
```
real 0m50.509s
user 0m43.535s
sys 0m9.715s
real 0m50.857s
user 0m44.185s
sys 0m9.811s
real 0m50.873s
user 0m44.289s
sys 0m9.737s
```
Fixes SCYLLADB-1137
Backport: The test is present on all supported branches, and so we
should backport these changes to them.
Closesscylladb/scylladb#29218
* github.com:scylladb/scylladb:
test: cluster: Deflake test_startup_with_keyspaces_violating_rf_rack_valid_keyspaces
test: cluster: Mark test with @pytest.mark.asyncio in test_multidc.py
This PR contains two small improvements to `test_incremental_repair.py`
motivated by the sporadic failure of
`test_tablet_incremental_repair_and_scrubsstables_abort`.
The test fails with `assert 3 == 2` on `len(sst_add)` in the second
repair round. The extra SSTable has `repaired_at=0`, meaning scrub
unexpectedly produced more unrepaired SSTables than anticipated. Since
scrub (and compaction in general) logs at DEBUG level and the test did
not enable debug logging, the existing logs do not contain enough
information to determine the root cause.
**Commit 1** fixes a long-standing typo in the helper function name
(`preapre` -> `prepare`).
**Commit 2** enables `compaction=debug` for the Scylla nodes started by
`do_tablet_incremental_repair_and_ops`, which covers all
`test_tablet_incremental_repair_and_*` variants. This will capture full
compaction/scrub activity on the next reproduction, making the failure
diagnosable.
Refs: SCYLLADB-1086
Backport: test improvement, no backport
Closesscylladb/scylladb#29175
* https://github.com/scylladb/scylladb:
test/cluster/test_incremental_repair.py: enable compaction DEBUG logs in do_tablet_incremental_repair_and_ops
test/cluster/test_incremental_repair.py: fix typo preapre -> prepare
When authenticate() returns a user directly (certificate-based auth,
introduced in 20e9619bb1), process_startup was missing the same
post-authentication bookkeeping that the no-auth and SASL paths perform:
- update_scheduling_group(): without it, the connection runs under the
default scheduling group instead of the one mapped to the user's
service level.
- _authenticating = false / _ready = true: without them,
system.clients reports connection_stage = AUTHENTICATING forever
instead of READY.
- on_connection_ready(): without it, the connection never releases its
slot in the uninitialized-connections concurrency semaphore (acquired
at connection creation), leaking one unit per cert-authenticated
connection for the lifetime of the connection.
The omission was introduced when on_connection_ready() was added to the
else and SASL branches in 474e84199c but the cert-auth branch was missed.
Fixes: 20e9619bb1 ("auth: support certificate-based authentication")
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The cert-auth path in process_startup (introduced in 20e9619bb1) was
missing _ready = true, _authenticating = false, update_scheduling_group()
and on_connection_ready(). The result is that connections authenticated
via certificate show connection_stage = AUTHENTICATING in system.clients
forever, run under the wrong service-level scheduling group, and hold
the uninitialized-connections semaphore slot for the lifetime of the
connection.
Add a parametrized cluster test that verifies all three process_startup
branches result in connection_stage = READY:
- allow_all: AllowAllAuthenticator (no-auth path)
- password: PasswordAuthenticator (SASL/process_auth_response path)
- cert_bypass: CertificateAuthenticator with transport_early_auth_bypass
error injection (cert-auth path -- the buggy one)
The injection is added to certificate_authenticator::authenticate() so
tests can bypass actual TLS certificate parsing while still exercising
the cert-auth code path in process_startup.
The cert_bypass case is marked xfail until the bug is fixed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Instead of always passing sstable_state::normal, infer the state from
the last component of the directory path by comparing against the known
state subdirectory constants (staging_dir, upload_dir, quarantine_dir).
Any unrecognized path component (the common case for normal-state
sstables) maps to sstable_state::normal.
When a non-normal state is detected, strip the state subdirectory from
dir so that the base table directory is passed to storage.
Two issues prevented the precompiled header from compiling
successfully when using CMake directly (rather than the
configure.py + ninja build system):
a) Propagate build flags to Rust binding targets reusing the
PCH. The wasmtime_bindings and inc targets reuse the PCH
from scylla-precompiled-header, which is compiled with
Seastar's flags (including sanitizer flags in
Debug/Sanitize modes). Without matching compile options,
the compiler rejects the PCH due to flag mismatch (e.g.,
-fsanitize=address). Link these targets against
Seastar::seastar to inherit the required compile options.
Closesscylladb/scylladb#28941
The test was flaky. The scenario looked like this:
1. Stop server 1.
2. Set its rf_rack_valid_keyspaces configuration option to true.
3. Create an RF-rack-invalid keyspace.
4. Start server 1 and expect a failure during start-up.
It was wrong. We cannot predict when the Raft mutation corresponding to
the newly created keyspace will arrive at the node or when it will be
processed. If the check of the RF-rack-valid keyspaces we perform at
start-up was done before that, it won't include the keyspace. This will
lead to a test failure.
Unfortunately, it's not feasible to perform a read barrier during
start-up. What's more, although it would help the test, it wouldn't be
useful otherwise. Because of that, we simply fix the test, at least for
now.
The new scenario looks like this:
1. Disable the rf_rack_valid_keyspaces configuration option on server 1.
2. Start the server.
3. Create an RF-rack-invalid keyspace.
4. Perform a read barrier on server 1. This will ensure that it has
observed all Raft mutations, and we won't run into the same problem.
5. Stop the node.
6. Set its rf_rack_valid_keyspaces configuration option to true.
7. Try to start the node and observe a failure.
This will make the test perform consistently.
---
I ran the test (in dev mode, on my local machine) three times before
these changes, and three times with them. I include the time results
below.
Before:
```
real 0m47.570s
user 0m41.631s
sys 0m8.634s
real 0m50.495s
user 0m42.499s
sys 0m8.607s
real 0m50.375s
user 0m41.832s
sys 0m8.789s
```
After:
```
real 0m50.509s
user 0m43.535s
sys 0m9.715s
real 0m50.857s
user 0m44.185s
sys 0m9.811s
real 0m50.873s
user 0m44.289s
sys 0m9.737s
```
Fixes SCYLLADB-1137
ERMs created in `calculate_vnode_effective_replication_map` have RF computed based
on the old token metadata during a topology change. The reading replicas, however,
are computed based on the new token metadata (`target_token_metadata`) when
`read_new` is true. That can create a mismatch for EverywhereStrategy during some
topology changes - RF can be equal to the number of reading replicas +-1. During
bootstrap, this can cause the
`everywhere_replication_strategy::sanity_check_read_replicas` check to fail in
debug mode.
We fix the check in this commit by allowing one more reading replica when
`read_new` is true.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1147Closesscylladb/scylladb#29150
Before these changes, we would send mutations to the node and
immediately query the metrics to see how many hints had been written.
However, that could lead to random failures of the test: even if the
mutations have finished executing, hints are stored asynchronously, so
we don't have a guarantee they have already been processed.
To prevent such failures, we rewrite the check: we will perform multiple
checks against the metrics until we have confirmed that the hints have
indeed been written or we hit the timeout.
We're generous with the timeout: we give the test 60 seconds. That
should be enough time to avoid flakiness even on super slow machines,
and if the test does fail, we will know something is really wrong.
As a bonus, we improve the test in general too. We explicitly express
the preconditions we rely on, as well as bump the log level. If the
test fails in the future, it might be very difficult do debug it
without this additional information.
Fixes SCYLLADB-1133
Backport: The test is present on all supported branches. To avoid
running into more failures, we should backport these changes
to them.
Closesscylladb/scylladb#29191
* github.com:scylladb/scylladb:
test: cluster: Increase log level in test_write_cl_any_to_dead_node_generates_hints
test: cluster: Await all mutations concurrently in test_write_cl_any_to_dead_node_generates_hints
test: cluster: Specify min_tablet_count in test_write_cl_any_to_dead_node_generates_hints
test: cluster: Use new_test_table in test_write_cl_any_to_dead_node_generates_hints
test: cluster: Introduce auxiliary function keyspace_has_tablets
test: cluster: Deflake test_write_cl_any_to_dead_node_generates_hints
The endpoint is the following:
POST /storage_service/vnode_tablet_migrations/keyspaces/{keyspace}/finalization
When called, it issues a `finalize_migration` topology request and waits
for its completion.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Vnodes-to-tablets migration needs a finalization step to finish or
rollback the migration. Finishing the migration involves switching the
keyspace schema to tablets and clearing the `intended_storage_mode` from
system.topology. Rolling back the migration involves deleting the tablet
maps and clearing the `intended_storage_mode`.
The finalization needs to be done as a topology request to exclude with
other operations such as repair and TRUNCATE.
This patch introduces the `finalize_migration` global topology request
for this purpose. The request takes a keyspace name as an argument.
The direction of the finalization (i.e., forward path vs rollback) is
inferred from the `intended_storage_mode` of all nodes (not ideal,
should be made explicit).
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Extend `database::add_column_family()` with a `storage_mode` argument.
If the table is under vnodes-to-tablets migration and the storage mode
is "tablets", create a tablet ERM.
Make the distributed loader determine the storage mode from topology
(`intended_storage_mode` column in system.topology).
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
The endpoint is the following:
POST /storage_service/vnode_tablet_migrations/node/storage_mode?intended_mode={tablets,vnodes}
This endpoint is part of the vnodes-to-tablets migration process and
controls a node's intended_storage_mode in system.topology. The storage
mode represents the node-local data distribution model, i.e., how data
are organized across shards. The node will apply the intended storage
mode to migrating tables upon next restart by resharding their SSTables
(either on vnode boundaries if intended_mode=tablets, or with the static
sharder if intended_mode=vnodes).
Note that this endpoint controls the intended_storage_mode of the local
node only. This has the nice benefit that once the API call returns, the
change has not only been committed to group0 but also applied to the
local node's state machine. This guarantees that the change is part of
the node's local copy upon next restart; no additional read barrier is
needed.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
The endpoint is the following:
POST /storage_service/vnode_tablet_migrations/keyspaces/{keyspace}
Its purpose is to start the migration of a whole keyspace from vnodes to
tablets.
When called, Scylla will synchronously create a tablet map for each
table in the specified keyspace. The tablet maps of all tables are
identical and they mirror the vnode layout; they contain one tablet per
vnode and each tablet uses the same replica hosts and token boundaries
as the corresponding vnode.
The only difference from vnodes lies in the sharding approach. Tablets
are assigned to a single shard - using a round-robin strategy in this
patch - whereas vnodes are distributed evenly across all shards. If the
tablet count per shard is low and tablet sizes are uneven, or some
shards have more tablets than others, performance may degrade during the
migration process. For example, a cluster with i8g.48xlarge (192 vCPUs),
256 vnodes per node and RF=3 will have 256 * 3 / 192 vCPUs = 4 tablet
replicas per shard during the migration. One additional tablet or a
double-sized tablet would cause 25% overcommit.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Reduce the timeout for one test to 60 minutes. The longest test we had
so far was ~10-15 minutes. So reducing this timeout is pretty safe and
should help with hanging tests.
Closesscylladb/scylladb#29212
Part of the vnodes-to-tablets migration is to reshard the SSTables of
each node on vnode boundaries. Resharding is a heavy operation that
runs on startup while the node is offline. Since nodes can restart
for unexpected reasons, we need a flag to do it in a controllable way.
We also need the ability to roll back the migration, which requires
resharding in the opposite direction. This means a node must be aware of
the intended migration direction.
To address both requirements, this patch introduces a new column,
intended_storage_mode, in system.topology. A non-null value indicates
that a node should perform a migration and specifies the migration
direction.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Make the table populator migration-aware. If a table is migrating to
tablets, switch from normal resharding to vnode-based resharding.
Vnode-based resharding requires passing a vector of "owned ranges" upon
which resharding will segregate the SSTables. Compute it from the tablet
map. We could also compute them from the vnodes, since tablets are
identical to vnodes during the migration, but in the future we may
switch to a different model (multiple tablets per vnode).
Let the distributed loader decide if a table is migrating or not and
communicate that to the table populator. A table is migrating if the
keyspace replication strategy uses vnodes but the table replication
strategy uses tablets.
Currently, tables cannot enter this "migrating" state; support for this
will be introduced in the next patches.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
In the previous patch, reshard compaction was extended with a special
operation mode where SSTables from vnode-based tables are segregated on
vnode boundaries and not with the static sharder. This will later be
wired into vnodes-to-tablets migration.
The problem is that resharding requires a compaction group. With a
vnode-based table, there is only one compaction group per shard, and
this is what the current code utilizes
(`try_get_compaction_group_view_with_static_sharding()`). But the new
operation mode will apply to migrating tables, which use a
`tablet_storage_group_manager`, which creates one compaction group for
each tablet. Some compaction group needs to be selected.
Pick any compaction group that is available on the current shard.
Reshard compaction is an operation that happens early in the startup
process; compaction groups do not own any SSTables yet, so all
compaction groups are equivalent.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
In this mode, the output sstables generated by resharding
compaction are segregated by token range, based on the keyspace
vnode-based owned token ranges vector.
A basic unit test was also added to sstable_directory_test.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When a table is migrating from vnodes to tablets, the cluster is in a
mixed state where some nodes use vnode ERMs and others use tablet ERMs.
The ERM flavor is a node-local property that expresses the node's
storage organization.
Preserve the flavor across token metadata changes. The flavor needs to
be on par with storage, but the storage can change only on startup, as
it requires resharding all SSTables to conform with the flavor.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
The tablet load balancer operates on all tablet-based tables that appear
in the tablet metadata.
With the introduction of the vnodes-to-tablets migration procedure later
in this series, migrating tables will also appear in the tablet
metadata, but they need to be treated as vnode tables until migration is
finished. This patch excludes such tables from load balancing.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Vnodes-to-tablets migrations require cluster-level support: the REST API
and the group0 state need to be supported by all nodes.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Migrate audit tests from test/cluster/dtest to test/cluster. Optimize their execution time through cluster reuse.
The audit test suite is heavy. There are more than 70 test executions. Environment preparation is a significant part of each test case execution time.
This PR:
1. Copies audit tests from test/cluster/dtest to test/cluster, refactoring and enabling them
2. Groups tests functions by non-live cluster configuration variations to enable cluster reuse between them
- Execution time reduced from 4m 29s to 2m 47s, which is ~38% execution time decrease
3. Removes the old audit tests from test/cluster/dtest
Includes two supporting changes:
- Allow specifying `AuthProvider` in `ManagerClient.get_cql_exclusive`
- Fix server log file handling for clean clusters
Refs [SCYLLADB-573](https://scylladb.atlassian.net/browse/SCYLLADB-573)
This PR is an improvement and does not require a backport.
[SCYLLADB-573]: https://scylladb.atlassian.net/browse/SCYLLADB-573?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQClosesscylladb/scylladb#28650
* github.com:scylladb/scylladb:
test: cluster: fix log clear race condition in test_audit.py
test: pylib: shut down exclusive cql connections in ManagerClient
test: cluster: fix multinode audit entry comparison in test_audit.py
test: cluster: dtest: remove old audit tests
test: cluster: group migrated audit tests for cluster reuse
test: cluster: enable migrated audit tests and make them work
test: pylib: manager_client: specify AuthProvider in get_cql_exclusive
test: pylib: scylla cluster after_test log fix
test: audit: copy audit test from dtest
Maintenance socket path used for PGO is in the node workdir.
When the node workdir path is too long, the maintenance socket path
(workdir/cql.m) can exceed the Unix domain socket sun_path limit
and failing the PGO training pipeline.
To prevent this:
- pass an explicit --maintenance-socket override
pointing to a short determinitic path in /tmp derived from the MD5
hash of the workdir maintenance socket path
- update maintenance_socket_path to return the matching short path
so that exec_cql.py connects to the right socket
The short path socket files are cleaned up after the cluster stops.
The path is using MD5 hash of the workdir path, so it is deterministic.
Fixes SCYLLADB-1070
Closesscylladb/scylladb#29149
Commit faa0ee9844 accidentally broke the way split snapshot mutation was
frozen -- instead of appending the sub-mutation `m` the commit kept the
old variable name of `mut` which in the new code corresponds to "old"
non-split mutation
Fixes#29051
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#29052
There's a flaw in table::query() -- calling querier_opt->close() can dereferences a disengaged std::optional. The fix pretty simple. Once fixed, there are two if-s checking for querier_opt being engaged or not that are worth being merged.
The problem doesn't really shows itself becase table::query() is not called with null saved_querier, so the de-facto if is always correct. However, better to be on safe-side.
The problem doesn't show itself for real, not worth backporting
Closesscylladb/scylladb#29142
* github.com:scylladb/scylladb:
table: merge adjacent querier_opt checks in query()
table: don't close a disengaged querier in query()
Potential fix for https://github.com/scylladb/scylladb/security/code-scanning/147.
To fix the problem, add an explicit `permissions:` block to the workflow
(either at the top level or inside the `trigger-jenkins` job) that
constrains the `GITHUB_TOKEN` to the minimal necessary privileges. This
codifies least-privilege in the workflow itself instead of relying on
repository or organization defaults.
The best minimal, non‑breaking change is to define a root‑level
`permissions:` block with read‑only contents access because the job does
not perform any write operations to the repository, nor does it interact
with issues, pull requests, or other GitHub resources. A conservative,
widely accepted baseline is `contents: read`. If later steps require more
permissions, they can be added explicitly, but for this snippet, no such
need is visible.
Concretely, in `.github/workflows/trigger_jenkins.yaml`, insert:
```yaml
permissions:
contents: read
```
between the `name:` block and the `on:` block (e.g., after line 2).
No additional methods, imports, or definitions are needed since this is
a pure YAML configuration change and does not alter runtime behavior of
the existing shell steps.
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Closesscylladb/scylladb#27815
Potential fix for https://github.com/scylladb/scylladb/security/code-scanning/169.
In general, the fix is to add an explicit `permissions:` block to the
workflow (at the root level or per job) so that the `GITHUB_TOKEN` has
only the minimal scopes needed. Since this job only reads event data and
uses secrets to talk to Jenkins, we can restrict `GITHUB_TOKEN` to
read‑only repository contents.
The single best fix here is to add a top‑level `permissions:` block
right under the `name:` (and before `on:`) in
`.github/workflows/trigger-scylla-ci.yaml`, setting `contents: read`.
This applies to all jobs in the workflow, including `trigger-jenkins`,
and does not alter any existing steps or logic. No additional imports or
methods are needed, as this is purely a YAML configuration change for
GitHub Actions.
Concretely, edit `.github/workflows/trigger-scylla-ci.yaml` to insert:
```yaml
permissions:
contents: read
```
after line 1. No other lines in the file need to change.
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Closesscylladb/scylladb#27812
We increase the log level of `hints_manager` to TRACE in the test.
If it fails, it may be incredibly difficult to debug it without any
additional information.
The test relies on the assumption that mutations will be distributed
more or less uniformly over the nodes. Although in practice this should
not be possible, theoretically it's possible that there's only one
tablet allocated for the table.
To clearly indicate this precondition, we explicitly set the property
`min_tablet_count` when creating the table. This way, we have a gurantee
that the table has multiple tablets. The load balancer should now take
care of distributing them over the nodes equally. Thanks to that,
`servers[1]` will have some tablets, and so it'll be the target for some
of the mutations we perform.
The context manager is the de-facto standard in the test suite. It will
also allow us for a prettier way to conditionally enable per-table
tablet options in the following commit.
The function is adapted from its counterpart in the cqlpy test suite:
cqlpy/util.py::keyspace_has_tablets. We will use it in a commit in this
series to conditionally set tablet properties when creating a table.
It might also be useful in general.
Before these changes, we would send mutations to the node and
immediately query the metrics to see how many hints had been written.
However, that could lead to random failures of the test: even if the
mutations have finished executing, hints are stored asynchronously, so
we don't have a guarantee they have already been processed.
To prevent such failures, we rewrite the check: we will perform multiple
checks against the metrics until we have confirmed that the hints have
indeed been written or we hit the timeout.
We're generous with the timeout: we give the test 60 seconds. That
should be enough time to avoid flakiness even on super slow machines,
and if the test does fail, we will know something is really wrong.
Fixes SCYLLADB-1133
Snapshots are not implemented yet for strong consistency - attempting to
take, transfer or drop a snapshot results in an exception. However, the
logic of our state machine forces snapshot transfer even if there are no
lagging replicas - every raft::server::configuration::snapshot_threshold
log entries. We have actually encountered an issue in our benchmarks
where snapshots were being taken even though the cluster was not under
any disruption, and this is one of the possible causes.
It turns out that we can safely allow for taking snapshots right now -
we can just implement it as a no-op and return a random UUID.
Conversely, dropping a snapshot can also be a no-op. This is safe
because snapshot transfer still throws an exception - as long as the
taken/recovered snapshots are never attempted to be transferred.
Raft snapshots are not implemented yet for strong consistency. Adjust
the current raft group config to make them much less likely to occur:
- snapshot_threshold config option decides how many log entries need to
be applied after the last snapshot. Set it to the maximum value for
size_t in order to effectively disable it.
- snapshot_threshold_log_size defines a threshold for the log memory
usage over which a snapshot is created. Increase it from the default
2MB to 10MB.
- max_log_size defines the threshold for the log memory usage over which
requests are stopped to be admitted until the log is shrunk back by a
snapshot. Set it to 20MB, as this option is recommended to be at least
twice as much as snapshot_threshold_log_size.
Refs: SCYLLADB-1115
Fixtures previously ran GDB once (module scope) to find live objects
(sstables, tasks, schemas) and stored their addresses. Tests then
reused those addresses in separate GDB invocations. Sometimes these
addresses would become stale and the test would step on use-after-free
(e.g. sstables compacted away between invocations).
Fix by dropping the fixtures. The helper functions used by the fixtures
to obtain the required objects are converted to gdb convenience
functions, which can be used in the same expression as the test command
invocation. Thus, the object is aquired on-demand at the moment it is
used, so it is guaranteed to be fresh and relevant.
Fixes: SCYLLADB-1020
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Closesscylladb/scylladb#28999
The version of fmt installed on my machine refuses to work with
`std::filesystem::path` directly. Add `.string()` calls in places that
attempt to print paths directly in order to make them work.
Closesscylladb/scylladb#29148
When encrypted_data_source::get() caches a trailing block in _next, the next call takes it directly — bypassing input_stream::read(), which checks _eof. It then calls input_stream::read_exactly() on the already-drained stream. Unlike read(), read_up_to(), and consume(), read_exactly() does not check _eof when the buffer is empty, so it calls _fd.get() on a source that already returned EOS.
In production this manifested as stuck encrypted SSTable component downloads during tablet restore: the underlying chunked_download_source hung forever on the post-EOS get(), causing 4 tablets to never complete. The stuck files were always block-aligned sizes (8k, 12k) where _next gets populated and the source is fully consumed in the same call.
Fix by checking _input.eof() before calling read_exactly(). When the stream already reached EOF, buf2 is known to be empty, so the call is skipped entirely.
A comprehensive test is added that uses a strict_memory_source which fails on post-EOS get(), reproducing the exact code path that caused the production deadlock.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1128
Backport to 2025.3/4 and 2026.1 is needed since it fixes a bug that may bite us in production, to be on the safe side
Closesscylladb/scylladb#29110
* github.com:scylladb/scylladb:
encryption: fix deadlock in encrypted_data_source::get()
test_lib: mark `limiting_data_source_impl` as not `final`
Fix formatting after previous patch
Fix indentation after previous patch
test_lib: make limiting_data_source_impl available to tests
Replace create_dataset + manual DROP/CREATE KEYSPACE with two sequential
new_test_keyspace context manager blocks, matching the pattern used by
do_test_streaming_scopes. The first block covers backup, the second
covers restore. Keyspace lifecycle is now automatic.
The streaming directions validation loop is moved outside of the second
context block, since it only parses logs and has no dependency on the
keyspace being alive.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The test sporadically fails because scrub produces an unexpected number
of SSTables. Compaction logs are needed to diagnose why, but were not
captured since scrub runs at DEBUG level. Enable compaction=debug for
the servers started by do_tablet_incremental_repair_and_ops so the next
reproduction provides enough information to root-cause the issue.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
And move some activities from streaming group into it, namely
- tablet_allocator background group
- sstables_manager-s components reclaimer
- tablet storage group manager merge completion fiber
- prometheus
All other activity that was in streaming group remains there, but can be
moved to this group (or to new maintenance subgroup) later.
All but prometheus are patched here, prometheus still uses the
maintenance_sched_group variable in main.cc, so it transparently
moves into new group
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The snapshot_ctl::backup_task_impl runs in configured scheduling group.
Now it's streaming one. This patch introduces the maintenance/backup
group and re-configures backup task with it.
The group gets its --backup_io_throughput_mb_per_sec option that
controls bandwidth limit for this sub-group only.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Compaction manager tells compaction_sched_group from
maintenance_compaction_sched_group. The latter, however, is set to be
"streaming" group. This patch adds real maintenance_compaction group
under the maintenance supergroup and makes compaction manager use it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
And just move streaming group inside it. Next patches will populate this
supergroup further.
The new supergroup gets its --maintenance-io-throughput-mb-per-sec
option that controls supergroup-wide IO bandwidth applied to it. If not
configured, the supergroup gets the throughput from streaming to be
backward compatible.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The main.cc code uses two variables to reference streaming scheduling.
This patch stops using the maintenance_sched_group one, because it's in
fact streaming group, and real "maintenance" will appear later in this
set.
One place is deliberately not patched -- prometheus code starts before
dbcfg.streaming_scheduling_group appears, so it still sits uses the
maintenance_sched_group variable. This fact will be used in one of the
next patches.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The classify_request() helper captures current scheduling group into
local variable and compares it with groups from db_config to decide
which "class" it belongs to.
One if uses current_scheduling_group(), while it could use the local
variable.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently we have two live-updateable IO-throughput options -- one for
streaming and one for compaction. Both are observed and the changed
value is applied to the corresponding scheduling_group by the relevant
serice -- respectively, stream_manager and compaction_manager.
Both observe/react/apply places use pretty heavy boilerplate code for
such simple task. Next patches will make things worse by adding two more
options to control IO throughput of some other groups.
Said that, the proposal is to hold the updating code in main.cc with the
help of a wrapper class. In there all the needed bits are at hand, and
classes can get their IO updates applied easily.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Without this include the file would not compile on its own. The issue
was most likely masked by the use of precompiled headers in our CI.
Closesscylladb/scylladb#29170
The 'scylla sstable-summary' GDB command crashes with
'ValueError: Argument "count" should be greater than zero' when
inspecting ms-format (trie-based) sstables. This happens because
ms-format sstables don't populate the traditional summary structure,
leaving all fields zeroed out, which causes gdb.read_memory() to be
called with a zero count.
Fix by:
- Adding zero-length guards to sstring.to_hex() and sstring.as_bytes()
to return early when the data length is zero, consistent with the
existing guard in managed_bytes.get().
- Adding the same guard to scylla_sstable_summary.to_hex().
- Detecting ms-format sstables (version == 5) early in
scylla_sstable_summary.invoke() and printing an informative message
instead of attempting to read the unpopulated summary.
Fixes: SCYLLADB-1180
Closesscylladb/scylladb#29162
In test_index_requires_rf_rack_valid_keyspace, the create_table call
for a plain tablet-based table can fail with 'Unable to reach schema
agreement' after the server's 10s timeout is exceeded. This happens
when schema gossip propagation across the 4-node cluster takes longer
than expected after a sequence of rapid schema changes earlier in the
test.
Add a retry (up to 2 attempts) on schema agreement errors for this
specific create_table call rather than increasing the server-side
timeout.
Fixes: SCYLLADB-1135
Closesscylladb/scylladb#29132
Potential fix for https://github.com/scylladb/scylladb/security/code-scanning/139,
To fix the problem, explicitly restrict the `GITHUB_TOKEN` permissions
for this workflow/job so it has only what is needed. The script reads PR
data and repository info (which is covered by `contents: read`/default
read scopes) and posts a comment via `github.rest.issues.createComment`,
which requires `issues: write`. No other write scopes (e.g., `contents:
write`, `pull-requests: write`) are necessary.
The best fix without changing functionality is to add a `permissions`
block scoped to this job (or at the workflow root). Since we only see a
single job here, we’ll add it under `check-fixes-prefix`. Concretely, in
`.github/workflows/backport-pr-fixes-validation.yaml`, between the
`runs-on: ubuntu-latest` line (line 10) and `steps:` (line 11), add:
```yaml
permissions:
contents: read
issues: write
```
This keeps the token minimally privileged while still allowing the script
to create issue/PR comments.
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Closesscylladb/scylladb#27810
`read_barrier(session2)` was supposed to ensure `node2` has caught up on schema
before a CL=ALL write. But `patient_cql_connection(node2)` creates a
cluster-aware driver session `(TokenAwarePolicy(DCAwareRoundRobinPolicy()))`
that can route the barrier CQL statement to any node — not necessarily `node2`.
If the barrier runs on `node1` or `node3` (which already have the new schema),
it's a no-op, and `node2` remains stale, thus the observed `WriteFailure`.
The fix is to switch to `patient_exclusive_cql_connection(node2)`,
which uses `WhiteListRoundRobinPolicy([node2_ip])` to pin all CQL to `node2`.
This is already the established pattern used by other tests in the same file.
Fixes: SCYLLADB-1139
No need to backport yet, appeared only on master.
Closesscylladb/scylladb#29151
The test intentionally creates huge index pages.
But since 5e7fb08bf3,
the index reader allocates a block of memory for a whole index page,
instead of incrementally allocating small pieces during index parsing.
This giant allocation causes the test to fail spuriously in CI sometimes.
Fix this by disabling sstable compression on the test table,
which puts a hard cap of 2000 keys per index page.
Fixes: SCYLLADB-1152
Closesscylladb/scylladb#29152
Use add_new_sstable_and_update_cache() when attaching SSTables
downloaded by the node-scoped local loader.
This is the correct variant for new SSTables: it can unlink the
SSTable on failure to add it, and it can split the SSTable if a
tablet split is in progress. The older
add_sstable_and_update_cache() helper is intended for preexisting
SSTables that are already stable on disk.
Additionally, downloaded SSTables are now left unsealed (TemporaryTOC)
until they are successfully added to the table's SSTable set. The
download path (download_fully_contained_sstables) passes
leave_unsealed=true to create_stream_sink, and attach_sstable opens
the SSTable with unsealed_sstable=true and seals it only inside the
on_add callback — matching the pattern used by stream_blob.cc and
storage_service.cc for tablet streaming.
This prevents a data-resurrection hazard: previously, if the process
crashed between download and attach_sstable, or if attach_sstable
failed mid-loop, sealed (TOC) SSTables would remain in the table
directory and be reloaded by distributed_loader on restart. With
TemporaryTOC, sstable_directory automatically cleans them up on
restart instead.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1085.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#29072
The native Scylla nodetool reports ECONNREFUSED as 'Connection refused',
not as 'ConnectException' (which is the Java nodetool format). Add
'Connection refused' to the valid_errors list so that transient
connection failures during concurrent decommission/bootstrap topology
changes are properly tolerated.
Fixes SCYLLADB-1167
Closesscylladb/scylladb#29156
There are two helpers for describe_ring endpoint. Both can be squashed
together for code brevity.
Also, while at it, the "keyspace" parameter is not properly validated by
the endpoint.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
All callers already have it. It makes no difference for the method
itself with which table identifier to work, but will help to simplify
the flow in API handler (next patch)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This actually uses two interconnected options:
repair_multishard_reader_buffer_hint_size and
repair_multishard_reader_enable_read_ahead.
Both are propagated through repair_service::config and pass their
values to repair_reader/make_reader at construction time.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Most other services have their configs, rpair still uses global
db::config.
Add an empty config struct to repair_service to carry db::config options
the repair service needs.
Subsequent patches will populate the struct with options.
The config is created in main.cc as sharded_parameter because all future
options are live-updateable and should capture theirs source from
db::config on correct shard.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
After the previous fix both guarding if-s start with 'if (querier_opt &&'.
Merge them into a single outer 'if (querier_opt)' block to avoid the
redundant check and make the structure easier to follow.
No functional change.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When a `with_connect` operation timed out, the underlying connection
attempt continued to run in the reactor. This could lead to a crash
if the connection was established/rejected after the client object had
already been destroyed. This issue was observed during the teardown
phase of a upcoming high-availability test case.
This commit fixes the race condition by ensuring the connection attempt
is properly canceled on timeout.
Additionally, the explicit TLS handshake previously forced during the
connection is now deferred to the first I/O operation, which is the
default and preferred behavior.
Fixes: SCYLLADB-832
Backports to 2026.1 and 2025.4 are required, as this issue also exists on those branches and is causing CI flakiness.
Closesscylladb/scylladb#29031
* github.com:scylladb/scylladb:
vector_search: test: fix flaky test
vector_search: fix race condition on connection timeout
The condition guarding querier_opt->close() was:
When saved_querier is null the short-circuit makes the whole condition true
regardless of whether querier_opt is engaged. If partition_ranges is empty,
query_state::done() is true before the while-loop body ever runs, so querier_opt
is never created. Calling querier_opt->close() then dereferences a disengaged
std::optional — undefined behaviour.
Fix by checking querier_opt first:
This preserves all existing semantics (close when not saving, or when saving
wouldn't be useful) while making the no-querier path safe.
Why this doesn't surface today: the sole production call site, database::query(),
in practice. The API header documents nullptr as valid ("Pass nullptr when
queriers are not saved"), so the bug is real but latent.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The code in upload_file std::move()-s vector of names into
merge_objects() method, then iterates over this vector to delete
objects. The iteration is apparently a no-op on moved-from vector.
The fix is to make merge_objects() helper get vector of names by const
reference -- the method doesn't modify the names collection, the caller
keeps one in stable storage.
Fixes#29060
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#29061
Switch directories::do_verify_owner_and_mode() from lister::scan_dir() to
utils::directory_lister while preserving the previous hidden-entry
behavior.
Make do_verify_subpath use lister::filter_type directly so the
verification helper can pass it straight into directory_lister, and keep
a single yielding iteration loop for directory traversal.
Minus one scan_dir user twards scan_dir removal from code.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#29064
The update_credentials_and_rearm() may get "empty" credentials from
_creds_provider_chain.get_aws_credentials() -- it doesn't throw, but
returns default-initialized value. In that case the expires_at will be
set to time_point::min, and it's probably not a good idea to arm the
refresh timer and, even worse idea, to subtract 1h from it.
Fixes#29056
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#29057
The format string had two {} placeholders but three arguments, the
_upload_id one is skipped from formatting
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#29053
The test_restore_primary_replica_same_domain and test_restore_primary_replica_different_domain tests have very much in common. Previously both tests were also split each into two, so we have four tests, and now we have two that can also be squashed, the lines-of-code savings still worth it.
This is the continuation of #28569
Tests improvement, not backporting
Closesscylladb/scylladb#28994
* github.com:scylladb/scylladb:
test: Replace a bunch of ternary operators with an if-else block
test: Squash test_restore_primary_replica_same|different_domain tests
test: Use the same regexp in test_restore_primary_replica_different|same_domain-s
The do_test_backup_abort() fetched the node's workdir and resolved cf_dir
solely to construct a unique-ish backup prefix:
prefix = f'{cf_dir}/backup'
The comment already acknowledged this was only "unique(ish)" — relying
on the UUID-derived cf_dir name as a uniqueness source is roundabout.
unique_name() is already imported and used for exactly this purpose
elsewhere in the file.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#29030
The endpoint URL remains intact. Having it next to another toppartitions
endpoint (the /column_family/toppartitions one) is natural.
This endpoint only needs sharded<replica::database>&, grabs it from
http_context and doesn't use any other service. In column_family.cc the
database reference is already available as a parameter. Once more user
of http_context.db is gone.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Closesscylladb/scylladb#28996
The test in question uses several helpers from the backup sute, but it doesn't really need them -- the operations it want to perform can be performed with standard pylib methods. "While at it" also collect some dangling effectively unused local variables from this test (these were apparently left from backup tests this one was copied-and-reworked from)
Enhancing tests, not backporting
Closesscylladb/scylladb#29130
* github.com:scylladb/scylladb:
test/refresh: Simplify refresh invocation
test/refresh: Remove r_servers alias for servers
test/refresh: Replace check_mutation_replicas with a plain CQL SELECT
test/refresh: Inline keyspace/table/data setup in test_refresh_deletes_uploaded_sstables
test/refresh: Prepare indentation for new_test_keyspace in test_refresh_deletes_uploaded_sstables
test/refresh: Decouple test_refresh_deletes_uploaded_sstables from backup tests
test/refresh: Remove unused wait_for_cql_and_get_hosts import
Two issues found in the lister returned by gs_client_wrapper::make_object_lister()
Lister can report EOF too early in case filter is active, another one is potential vector out-of-bounds access
Fixes#29058
The code appeared in 2026.1, worth fixing it there as well
Closesscylladb/scylladb#29059
* github.com:scylladb/scylladb:
sstables: Fix object storage lister not resetting position in batch vector
sstables: Fix object storage lister skipping entries when filter is active
When it deadlocks, groups stop merging and compaction group merge
backlog will run-away.
Also, graceful shutdown will be blocked on it.
Found by flaky unit test
test_merge_chooses_best_replica_with_odd_count, which timed-out in 1
in 100 runs.
Reason for deadlock:
When storage groups are merged, the main compaction group of the new
storage group takes a compaction lock, which is appended to
_compaction_reenablers_for_merging, and released when the merge
completion fiber is done with the whole batch.
If we accumulate more than 1 merge cycle for the fiber, deadlock
occurs. Lock order will be this
Initial state:
cg0: main
cg1: main
cg2: main
cg3: main
After 1st merge:
cg0': main [locked], merging_groups=[cg0.main, cg1.main]
cg1': main [locked], merging_groups=[cg2.main, cg3.main]
After 2nd merge:
cg0'': main [locked], merging_groups=[cg0'.main [locked], cg0.main, cg1.main, cg1'.main [locked], cg2.main, cg3.main]
merge completion fiber will try to stop cg0'.main, which will be
blocked on compaction lock. which is held by the reenabler in
_compaction_reenablers_for_merging, hence deadlock.
The fix is to wait for background merge to finish before we start the
next merge. It's achieved by holding old erm in the background merge,
and doing a topology barrier from the merge finalizing transition.
Background merge is supposed to be a relatively quick operation, it's
stopping compaction groups. So may wait for active requests. It
shouldn't prolong the barrier indefinitely.
Tablet tests which trigger merge need to be adjusted to call the
barrier, otherwise they will be vulnerable to the deadlock.
Fixes SCYLLADB-928
Backport to >= 2025.4 because it's the earliest vulnerable due to f9021777d8.
Closesscylladb/scylladb#29007
* github.com:scylladb/scylladb:
tablets: Fix deadlock in background storage group merge fiber
replica: table: Propagate old erm to storage group merge
test: boost: tablets_test: Save tablet metadata when ACKing split resize decision
storage_service: Extract local_topology_barrier()
This patch series introduces a new documentation for exiting guardrails.
Moreover:
- Warning / failure messages of recently added write CL guardrails (SCYLLADB-259) are rephrased, so all guardrails have similar messages.
- Some new tests are added, to help verify the correctness of the documentation and avoid situations where the documentation and implementation diverge.
Fixes: [SCYLLADB-257](https://scylladb.atlassian.net/browse/SCYLLADB-257)
No backport, just new docs and tests.
[SCYLLADB-257]: https://scylladb.atlassian.net/browse/SCYLLADB-257?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQClosesscylladb/scylladb#29011
* github.com:scylladb/scylladb:
test: add new guardrail tests matching documentation scenarios
test: add metric assertions to guardrail replication strategy tests
test: use regex matching in guardrail replication strategy tests
test: extract ks_opts helper in test_guardrail_replication_strategy
docs: document CQL guardrails
cql: improve write consistency level guardrail messages
This PR enables the node_exporter systemd collector and configures the unit whitelist to include scylla-server.service and systemd-coredump services.
**Motivation**: We currently lack visibility into system-level service states, which is critical for diagnosing stability issues.
This configuration enables two specific use cases:
- Detecting Coredump Loops: We encounter scenarios where ScyllaDB enters a restart loop. To pinpoint SIGSEGV (coredumps) as the root cause, we need to track when the systemd-coredump service becomes active, indicating a dump is being processed.
- Identifying Startup Failures: We need to detect when the scylla-server unit enters a failed state. This is essential for catching unrecoverable errors (e.g., corrupted commitlogs or configuration bugs) that prevent the server from starting.
example of promql queries:
- `node_systemd_unit_state{name=~"systemd-coredump@.*", state="active"} == 1`
- `node_systemd_unit_state{name="scylla-server.service", state="failed"} == 1`
Closes#28402
reader_concurrency_semaphore::signal() guards against available
resources exceeding the initial limit after a signal, which would
indicate a bug such as double-returning resources. It reports the
issue via on_internal_error_noexcept and clamps resources back to
the initial values. However, before this commit there were no tests
that verified this behavior, so bugs like SCYLLADB-1014 went
undetected.
Add a test that artificially signals resources that were never
consumed and verifies that signal() detects the negative leak and
clamps available resources back to the initial limit.
Refs: SCYLLADB-1014
Fixes: SCYLLADB-1031
Closesscylladb/scylladb#28993
Since 7564a56dc8, all tables default to
repair-mode tombstone-gc, which is identical to immediate-mode for RF=1
tables. Consequently the tombstones written by the tests in this test
file are immediately collectible and with some unlucky timing, some of
them can be collected before the end of the test, failing the empty-page
prefix check because the empty pages prefix will be smaller than
expected based on the number of tombstones written.
Disable tombstone-gc to remove this source of flakyness.
Fixes: SCYLLADB-1062
Closesscylladb/scylladb#29077
The test has expectation w.r.t which write makes it to which nodes:
* inserts make it to all nodes
* delete makes it to all-1 (QUORUM) node
However, this was not expressed with CL, and the default CL=ONE allowed
for some nodes missing the writes and this violating the tests
expectations on what data is persent on which nodes. This resulted on
the test being flaky and failing on the data checks.
Use explicit CL for the ingestion to prevent this.
The improvements to the test introduced in
a8dd13731f was of great help in
investigating this: traces are now available and the check happens after
the data was dumped to logs.
Fixes: SCYLLADB-870
Fixes: SCYLLADB-812
Fixes: SCYLLADB-1102
Closesscylladb/scylladb#29128
Introduce an initial and experimental implementation of an alternative log-structured storage engine for key-value tables.
Main flows and components:
* The storage is composed of 32MB files, each file divided to segments of size 128k. We write to them sequentially records that contain a mutation and additional metadata. Records are written to a buffer first and then written to the active segment sequentially in 4k sized blocks.
* The primary index in memory maps keys to their location on disk. It is a B-tree per-table that is ordered by tokens, similar to a memtable.
* On reads we calculate the key and look it up in the primary index, then read the mutation from disk with a single disk IO.
* On writes we write the record to a buffer, wait for it to be written to disk, then update the index with the new location, and free the previous record.
* We track the used space in each segment. When overwriting a record, we increase the free space counter for the segment of the previous record that becomes dead. We store the segments in a histogram by usage.
* The compaction process takes segments with low utilization, reads them and writes the live records to new segments, and frees the old segments.
* Segments are initially "mixed" - we write to the active segment records from all tables and all tablets. The "separator" process rewrites records from mixed segments into new segments that are organized by compaction groups (tablets), and frees the mixed segments. Each write is written to the active segment and to a separator buffer of the compaction group, which is eventually flushed to a new segment in the compaction group.
Currently this mode is experimental and requires an experimental flag to be enabled.
Some things that are not supported yet are strong consistency, tablet migration, tablet split/merge, big mutations, tombstone gc, ttl.
to use, add to config:
```
enable_logstor: true
experimental_features:
- logstor
```
create a table:
```
CREATE TABLE ks.t(pk int PRIMARY KEY, a int, v text) WITH storage_engine = 'logstor';
```
INSERT, SELECT, DELETE work as expected
UPDATE not supported yet
no backport - new feature
Closesscylladb/scylladb#28706
* github.com:scylladb/scylladb:
logstor: trigger separator flush for buffers that hold old segments
docs/dev: add logstor documentation
logstor: recover segments into compaction groups
logstor: range read
logstor: change index to btree by token per table
logstor: move segments to replica::compaction_group
db: update dirty mem limits dynamically
logstor: track memory usage
logstor: logstor stats api
logstor: compaction buffer pool
logstor: separator: flush buffer when full
logstor: hold segment until index updates
logstor: truncate table
logstor: enable/disable compaction per table
logstor: separator buffer pool
test: logstor: add separator and compaction tests
logstor: segment and separator barrier
logstor: separator debt controller
logstor: compaction controller
logstor: recovery: recover mixed segments using separator
logstor: wait for pending reads in compaction
logstor: separator
logstor: compaction groups
logstor: cache files for read
logstor: recovery: initial
logstor: add segment generation
logstor: reserve segments for compaction
logstor: index: buckets
logstor: add buffer header
logstor: add group_id
logstor: record generation
logstor: generation utility
logstor: use RIPEMD-160 for index key
test: add test_logstor.py
api: add logstor compaction trigger endpoint
replica: add logstor to db
schema: add logstor cf property
logstor: initial commit
db: disable tablet balancing with logstor
db: add logstor experimental feature flag
Trie-based sstable indexes are supposed to be (hopefully) a better default than the old BIG indexes.
Make the new format a new default for new clusters by naming ms in the default scylla.yaml.
New functionality. No backport needed.
This PR is basically Michał's one https://github.com/scylladb/scylladb/pull/26377, Jakub's https://github.com/scylladb/scylladb/pull/27332 fixing `sstables_manager::get_highest_supported_format()` and one test fix.
Closesscylladb/scylladb#28960
* github.com:scylladb/scylladb:
db/config: announce ms format as highest supported
db/config: enable `ms` sstable format by default
cluster/dtest/bypass_cache_test: switch from highest_supported_sstable_format to chosen_sstable_format
api/system: add /system/chosen_sstable_version
test/cluster/dtest: reduce num_tokens to 16
take_snapshot return values were unused so drop them. do_refresh was a
thin wrapper around load_new_sstables that added no logic; inline it
directly into the gather expression.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The goal of test_refresh_deletes_uploaded_sstables is to verify that
sstables are removed from the upload directory after refresh. The replica
check was just a sanity guard; a simple SELECT of all keys is sufficient
and much lighter.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Replace create_dataset() with explicit keyspace creation via new_test_keyspace,
inline CREATE TABLE, and direct cql.run_async inserts — matching the pattern
used in do_test_streaming_scopes. This removes the last dependency on backup
helpers for dataset setup and makes the test self-contained.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Wrap the test body under if True: to pre-indent it, making the subsequent
patch that introduces new_test_keyspace a pure content change with no
whitespace noise.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace create_cluster() from object_store/test_backup.py with a plain
manager.servers_add(2) call. The test does not use object storage, so
there is no need to pull in the backup helper along with its config and
logging knobs.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Previously, when --save-log-on-success was enabled, logs were saved
for every test phase (setup, call, teardown)in 3 files. Restrict it to only
the teardown phase, that contains all 3 in case of test success,
to avoid redundant log entries.
1. test.py — Removed --log-level=DEBUG flag from pytest args
2. test/pytest.ini — Changed log_level to INFO (that was set DEBUG in test.py), changed log_file_level from DEBUG to INFO, added clarifying comments
This PR fixes the Installation page:
- Replaces `http `with `https `in the download command.
- Replaces the Open Source example from the Installation section for CentOS (we overlooked this example before).
Fixes https://github.com/scylladb/scylladb/issues/29087
Fixes https://github.com/scylladb/scylladb/issues/29087
This update affects all supported versions and should be backported as a bug fix.
Closesscylladb/scylladb#29088
* github.com:scylladb/scylladb:
doc: remove the Open Source Example from Installation
doc: replace http with https in the installation instructions
assert_entries_were_added:
- takes a "before" snapshot of the audit log
- yields to execute a statement
- takes an "after" snapshot of the audit log
- computes new rows by diffing "after" minus "before"
If an audit entry generated by prepare() arrives between the snapshot
and the diff, it inflates the new row count and the test fails with
assert 2 <= 1.
Fix by:
- Adding clear_audit_logs() at the end of prepare(), after all setup
- Waiting for the "completed re-reading configuration file" log message
after server_update_config
- Draining pending syslog lines before clearing the buffer
Refs SCYLLADB-573
get_cql_exclusive() creates a Cluster object per call, but never
records it. driver_close() cannot shut it down. The cluster's
internal scheduler thread then tries to submit work to an already
shut down executor. This causes RuntimeError:
RuntimeError: cannot schedule new futures after shutdown
Fix this by tracking every exclusive Cluster in a list and shutting
them all down in driver_close().
Refs SCYLLADB-573
assert_entries_were_added computes new audit rows by slicing the "after"
list at the length of the "before" list: rows_after[len(rows_before):].
This assumes new rows always appear at the tail of the combined sorted
list. In a multinode setup, each node generates its own event_time
timestamps. A new row from node A can sort before an old row from node
B, breaking the tail assumption. The assertion "new rows are not the
last rows in the audit table" then fires.
Fix this by splitting the before/after lists per node and computing the
new rows tail independently for each node. This guarantees that per node
ordering, which is monotonic, is respected, and the combined new rows
are sorted afterwards.
Refs SCYLLADB-573
Since audit tests have been migrated to test/cluster/test_audit.py,
old tests in test/cluster/dtest/audit_test.py have to be removed.
Refs SCYLLADB-573
This patch reorganizes the execution flow of the test functions.
They are grouped to enable cluster reuse between specific test
functions. One of the main contributors to the test execution time
is the cluster preparation. This patch significantly reduces the
total test execution time by having way less new cluster preparation
calls and more cluster reuse.
Performance increase on the developer machine is around 38%:
- before: 4m 29s
- after: 2m 47s
Fixes SCYLLADB-573
Make audit tests from test/cluster/dtest to test/cluster.
test/cluster environment has less overhead, and audit tests
are heavy, their execution taking lots of time. This patch
is part of an effort to improve audit test suite performance.
This patch refactors the tests so that they execute correctly,
as well as enables them. A follow up patch will remove the
audit tests in test/cluster/dtest.
All the tests are confirmed to be running after the change.
No dead code present.
Test test_audit_categories_invalid is not parametrized anymore.
It never used the parametrized helper class, so it just ran
the same logic three times. This is why there are now 74,
and not 76, test executions.
Refs SCYLLADB-573
The default 'cassandra' superuser was removed from ScyllaDB, which
broke PGO training. exec_cql.py relied on username/password auth
('cassandra'/'cassandra') to execute setup CQL scripts like auth.cql
and counters.cql.
Switch exec_cql.py to connect via the Unix domain maintenance socket
instead. The maintenance socket bypasses authentication, no credentials
are needed. Additionally, create the 'cassandra' superuser via the
maintenance socket during the populate phase, so that cassandra-stress
keeps working. cassandra-stress hardcodes user=cassandra password=cassandra.
Changes:
- exec_cql.py: replace host/port/username/password arguments with a
single --socket argument; add connect_maintenance_socket() with
wait ready logic
- pgo.py: add maintenance_socket_path() helper; update
populate_auth_conns() and populate_counters() to pass the socket
path to exec_cql.py
Fixes SCYLLADB-1070
Closesscylladb/scylladb#29081
This patch allows ManagerClient.get_cql_exclusive to accept AuthProvider
as parameter. This will be used in a follow up patch which migrates
audit test suite to test/cluster and requires this functionality for
some tests.
Refs SCYLLADB-573
Before any test, a pool of ScyllaCluster objects is created.
At the beginning of a test suite, a ScyllaClusterManager is created,
and given a reference to the pool.
At the end of a test suite, the ScyllaClusterManager is destroyed.
Before each test case:
- ManagerClient is constructed and connected to the ScyllaClusterManager
of that test suite
- A ScyllaCluster object is fetched from the pool
- If the pool is empty, a new ScyllaCluster object is created
- If the pool is not empty, a cached ScyllaCluster object is returned
After each test case:
- Return ScyllaCluster object from ManagerClient to the pool
- If the cluster is dirty, the pool destroys it
- If the cluster is clean, the pool caches it
- ManagerClient is destroyed
Many actions mark a cluster as dirty. Normal test execution will always
make the cluster be destroyed upon returning to the pool.
ManagerClient.mark_clean is not used in the tests. When it is used,
the flow with cluster reuse happens.
The bug is that the log file is closed even if cluster is not dirty.
This causes an error when trying to log to a reused cluster server.
The solution in this patch is to not close the log file if the cluster
is not dirty. Upon cluster reuse the log file will be open and functional.
Another approach would be to reopen the log file if closed, but this
approach seems more clean.
Refs SCYLLADB-573
This patch just copies the audit test suite from dtest and
disables it in the test config file. Later patches will
update the code and enable the test suite.
Refs SCYLLADB-573
Add tests for RF guardrails (min/max warn/fail, RF=0 bypass,
threshold=-1 disable, ALTER KEYSPACE) and write consistency level
guardrails to cover all scenarios described in guardrails.rst.
Test runtime (dev):
test_guardrail_replication_strategy - 6s
test_guardrail_write_consistency_level - 5s
Refs: SCYLLADB-257
Replace loose substring assertions with regex-based matching against
the exact server message formats. Add regex constants for all
guardrail messages and rewrite create_ks_and_assert_warnings_and_errors()
to verify count and content of warnings and failures.
Refs: SCYLLADB-257
The Web Installer page includes instructions to install the old pre-2025.1 Enterprise versions,
which are no longer supported (since we released 2026.1).
This commit removes those redundant and misleading instructions.
Fixes https://github.com/scylladb/scylladb/issues/29099Closesscylladb/scylladb#29103
This patchset migrates: query_all_directly_granted, query_all,
get_attribute, query_attribute_for_all functions to use cache
instead of doing CQL queries. It also includes some preparatory
work which fixes cache update order and triggering.
Main motivation behind this is to make sure that all calls
from service_level_controller::auth_integration are cached,
which we achieve here.
Alternative implementation could move the whole auth_integration
data into auth cache but since auth_integration manages also lifetime
and contains service levels specific logic such solution would be
too complex for little (if any) gain.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-159
Backport: no, not a bug
Closesscylladb/scylladb#28791
* github.com:scylladb/scylladb:
auth: switch query_attribute_for_all to use cache
auth: switch get_attribute to use cache
auth: cache: add heterogeneous map lookups
auth: switch query_all to use cache
auth: switch query_all_directly_granted to use cache
auth: cache: add ability to go over all roles
raft: service: reload auth cache before service levels
service: raft: move update_service_levels_effective_cache check
`storage_group_of()` sits on the replica-side token lookup hot path, yet it called `tablet_map::get_tablet_id_and_range_side()`, which always computes both the tablet id and the post-split range side — even though most callers only need the storage group id.
The range-side computation is only relevant when a storage group is in tablet splitting mode, but we were paying for it unconditionally on every lookup.
This series fixes that by:
1. Adding `tablet_map::get_tablet_range_side()` so the range side can be computed independently when needed.
2. Adding lazy `select_compaction_group()` overloads that defer the range-side computation until splitting mode is actually active.
3. Switching `storage_group_of()` to use the cheaper `get_tablet_id()` path, only computing the range side on demand.
Improvements. No backport is required.
Closesscylladb/scylladb#28963
* github.com:scylladb/scylladb:
replica/table: avoid computing token range side in storage_group_of() on hot path
replica/compaction_group: add lazy select_compaction_group() overloads
locator/tablets: add tablet_map::get_tablet_range_side()
When encrypted_data_source::get() caches a trailing block in
_next, the next call takes it directly — bypassing
input_stream::read(), which checks _eof. It then calls
input_stream::read_exactly() on the already-drained stream.
Unlike read(), read_up_to(), and consume(), read_exactly()
does not check _eof when the buffer is empty, so it calls
_fd.get() on a source that already returned EOS.
In production this manifested as stuck encrypted SSTable
component downloads during tablet restore: the underlying
chunked_download_source hung forever on the post-EOS get(),
causing 4 tablets to never complete. The stuck files were
always block-aligned sizes (8k, 12k) where _next gets
populated and the source is fully consumed in the same call.
Fix by checking _input.eof() before calling read_exactly().
When the stream already reached EOF, buf2 is known to be
empty, so the call is skipped entirely.
A comprehensive test is added that uses a strict_memory_source
which fails on post-EOS get(), reproducing the exact code
path that caused the production deadlock.
Factor out ks_opts() to build keyspace options with tablets handling
and use it across all existing replication strategy guardrail tests.
No behavioral changes.
This facilitates further modification of the tests later in this
patch series.
Refs: SCYLLADB-257
After repair, the test does a major to compact all sstables into a
single one, so the results can be simply checked by a select from
mutation_fragments() query. Sometimes off-strategy happens parallel to
this major, so after the major there are still 2 sstables, resulting in
the test failing when checking that the query returns just a single row.
To fix, just use tablets for the test table, tablets don't use
off-strategy anymore.
Fixes: SCYLLADB-940
Closesscylladb/scylladb#29071
Previously the test test_interrupt_view_build_shard_registration stopped
the node ungracefully and used commitlog periodic mode to persist the
view build progress in a not very reliable way.
It can happen that due to timing issues, the view build progress is not
persisted, or some of it is persisted in a different ordering than
expected.
To make the test more reliable we change it to stop the node gracefully,
so the commitlog is persisted in a graceful and consistent way, without
using the periodic mode delay. We need to also change the injection for
the shutdown to not get stuck.
Fixes SCYLLADB-1005
Closesscylladb/scylladb#29008
As reported in SCYLLADB-1013, the directory lister must be closed also when an exception is thrown.
For example, see backtrace below:
```
seastar::on_internal_error(seastar::logger&, std::basic_string_view<char, std::char_traits<char>>) at ./build/release/seastar/./seastar/src/core/on_internal_error.cc:57
directory_lister::~directory_lister() at ./utils/lister.cc:77
replica::table::get_snapshot_details(std::filesystem::__cxx11::path, std::filesystem::__cxx11::path) (.resume) at ./replica/table.cc:4081
std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<db::snapshot_ctl::table_snapshot_details>::promise_type>::resume() const at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/coroutine:247
(inlined by) seastar::internal::coroutine_traits_base<db::snapshot_ctl::table_snapshot_details>::promise_type::run_and_dispose() at ././seastar/include/seastar/core/coroutine.hh:129
seastar::reactor::task_queue::run_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:2695
(inlined by) seastar::reactor::task_queue_group::run_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:3201
seastar::reactor::task_queue_group::run_some_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:3185
(inlined by) seastar::reactor::do_run() at ./build/release/seastar/./seastar/src/core/reactor.cc:3353
seastar::reactor::run() at ./build/release/seastar/./seastar/src/core/reactor.cc:3245
seastar::app_template::run_deprecated(int, char**, std::function<void ()>&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:266
seastar::app_template::run(int, char**, std::function<seastar::future<int> ()>&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:160
scylla_main(int, char**) at ./main.cc:756
```
Fixes: [SCYLLADB-1013](https://scylladb.atlassian.net/browse/SCYLLADB-1013)
* Requires backport to 2026.1 since the leak exists since 004c08f525
[SCYLLADB-1013]: https://scylladb.atlassian.net/browse/SCYLLADB-1013?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQClosesscylladb/scylladb#29084
* github.com:scylladb/scylladb:
test/boost/database_test: add test_snapshot_ctl_details_exception_handling
table: get_snapshot_details: fix indentation inside try block
table: per-snapshot get_snapshot_details: fix typo in comment
table: per-snapshot get_snapshot_details: always close lister using try/catch
table: get_snapshot_details: always close lister using deferred_close
Run keyspace compaction asynchronously in
`test_tombstone_gc_correctness_during_tablet_split` and only await it
after `split_sstable_rewrite` is disabled.
The problem is that `keyspace_compaction()` starts with a flush, and that
flush can take around five seconds. During that window the split
compaction is stopped before major compaction is retried. The stop aborts
the in-flight major compaction attempt, then the split proceeds far enough
to enter the `split_sstable_rewrite` injection point.
At that point the test used to wait synchronously for major compaction to
finish, but major compaction cannot finish yet: when it retries, it needs
the same semaphore that is still effectively tied up behind the blocked
split rewrite. So the test waits for major compaction, while the split
waits for the injection to be released, and the code that would release
that injection never runs.
Starting major compaction as a task breaks that cycle. The test can first
disable `split_sstable_rewrite`, let the split get out of the way, and
only then wait for major compaction to complete.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-827.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#29066
During decommission, we first mark a topology request as done, then shut
down a node and in the following steps we remove node from the topology.
Thus, finished request does not imply that a node is removed from
the topology.
Due to that, in node_ops_virtual_task::wait, while gathering children
from the whole cluster, we may hit the connection exception - because
a node is still in topology, even though it is down.
Modify the get_children method to ignore the exception and warn
about the failure instead.
Keep token_metadata_ptr in get_children to prevent topology from changing.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-867
Needs backports to all versions
Closesscylladb/scylladb#29035
* github.com:scylladb/scylladb:
tasks: fix indentation
tasks: do not fail the wait request if rpc fails
tasks: pass token_metadata_ptr to task_manager::virtual_task::impl::get_children
Document that `SELECT ... WHERE` clause currently accepts only conjunctions
of relations joined by `AND` (`OR` is not supported), and that
parentheses cannot be used to group boolean subexpressions.
Add an unsupported query example and point readers to equivalent `IN`
rewrites when applicable.
This problem has been raised by one of our users in
https://forum.scylladb.com/t/error-parsing-query-or-unsupported-statement/5299,
and while one could infer answer to user's question by looking at the
syntax of the `SELECT ... WHERE`, it's not immediately obvious to
non-advanced users, so clarifying these concepts is justified.
Fixes: SCYLLADB-1116
Closesscylladb/scylladb#29100
A compaction group has a separator buffer that holds the mixed segments
alive until the separator buffer is flushed. A mixed segment can be
freed only after all separator buffers that hold writes from the segment
are flushed.
Typically a separator buffer is flushed when it becomes full. However
it's possible for example that one compaction groups is filled slower
than others and holds many segments.
To fix this we trigger a separator flush periodically for separator
buffers that hold old segments. We track the active segment sequence
number and for each separator buffer the oldest sequence number it
holds.
Fix the logstor recovery to work with compaction groups. When recovering
a segment find its token range and add it to the appropriate compaction
groups. if it doesn't fit in a single compaction group then write each
record to its compaction group's separator buffer.
Change the primary index to be a btree that is ordered by token,
similarly to a memtable, and create a index per-table instead of a
single global index.
Add a segment_set member to replica::compaction_group that manages the
logstor segments that belong to the compaction group, similarly to how
it manages sstables. Add also a separator buffer in each compaction
group.
When writing a mutation to a compaction group, the mutation is written
to the active segment and to the separator buffer of the compaction
group, and when the separator buffer is flushed the segment is added to
the compaction_group's segment set.
when logstor is enabled, update the db dirty memory limits dynamically.
previously the threshold is set to 0.5 of the available memory, so 0.5
goes to memtables and 0.5 to others (cache).
when logstor is enabled, we calculate the available memory excluding
logstor, and divide it evenly between memtables and cache.
add a write gate to write_buffer. when writing a record to the write
buffer, the gate is held and passed back to the caller, and the caller
holds the gate until the write operation is complete, including
follow-up operations such as updating the index after the write.
in particular, when writing a mutation in logstor::write, the write
buffer is held open until the write is completed and updated in the
index.
when writing the write buffer to the active segment, we write the buffer
and then wait for the write buffer gate to close, i.e. we wait for all
index updates to complete before proceeding. the segment is held open
until all the write operations and index updates are complete.
this property is useful for correctness: when a segment is closed we
know that all the writes to it are updated in the index. this is needed
in compaction for example, where we take closed segments and check
which records in them are alive by looking them up in the index. if the
index is not updated yet then it will be wrong.
implement freeing all segments of a table for table truncate.
first do barrier to flush all active and mixed segments and put all the
table's data in compaction groups, then stop compaction for the table,
then free the table's segments and remove the live entries from the
index.
add barrier operation that forces switch of the active segment and
separator, and waits for all existing segments to close and all
separators to flush.
add tracking of the total separator debt - writes that were written to a
separator and waiting to be flushed, and add flow control to keep the
debt in control by delaying normal writes.
on recovery we may find mixed segments. recover them by adding them to a
separator, reading all their records and writing them to the separator,
and flush the separator.
we free a segment from compaction after updating all live records in the
segment to point to new locations in the index. we need to ensure they
are no running operations that use the old locations before we free the
segment.
initial implementation of the separator. it replaces "mixed" segments -
segments that have records from different groups, to segments by group.
every write is written to the active segment and to a buffer in the
active separator. the active separator has in-memory buffers by group.
at some threshold number of segments we switch the active segment and
separator atomically, and start flushing the separator.
the separator is flushed by writing the buffers into new non-mixed
segments, adding them to a compaction group, and frees the mixed
segments.
initial and basic recovery implementation.
* find all files, read their segments and populate the index with the
newest record for each key.
* find which segments are used and build the usage histogram
add segment generation number that is incremented when the segment is
reused, and it's written to every buffer that is written to the segment.
this is useful for recovery.
reserve segments for compaction so it always has enough segments to run
and doesn't get stuck.
do the compaction writes into full new segments instead of the active
segment.
add group_id value to each log record that is passed with the mutation
when writing it.
the group_id will be used to group log records in segments, such that a
segment will contain records only from a single group.
this will be useful for tablet migration. we want for each tablet to
have their own segments with all their records, so we can migrate them
efficiently by copying these segments.
the group_id value is set to a value equivalent to the tablet id.
basic utility for generation numbers that will be useful next. a
generation number is an unsigned integer that can be incremented and
compared even if it wraparounds, assuming the values we compare were
written around the same time.
initial implementation of the logstor storage engine for key-value
tables that supports writes, reads and basic compaction.
main components:
* logstor: this is the main interface to users that supports writing and
reading back mutations, and manages the internal components.
* index: the primary index in-memory that maps a key to a location on
disk.
* write buffer: writes go initially to a write buffer. it accumulates
multiple records in a buffer and writes them to the segment manager in
4k sized blocks.
* segment manager: manages the storage - files, segments, compaction. it
manages file and segment allocation, and writes 4k aligned buffers to
the active segment sequentially. it tracks the used space in each
segment. the compaction finds segment with low space usage and writes
them to new segments, and frees the old segments.
GRANT/REVOKE fails on the maintenance socket connections, because maintenance_auth_service uses allow_all_authorizer. allow_all_authorizer allows all operations, but not GRANT/REVOKE, because they make no sense in its context.
This has been observed during PGO run failure in operations from ./pgo/conf/auth.cql file.
This patch introduces maintenance_socket_authorizer that supports the capabilities of default_authorizer ('CassandraAuthorizer') without needing authorization.
Refs SCYLLADB-1070
This is an improvement, no need for backport.
Closesscylladb/scylladb#29080
* github.com:scylladb/scylladb:
test: use NetworkTopologyStrategy in maintenance socket tests
test: use cleanup fixture in maintenance socket auth tests
auth: add maintenance_socket_authorizer
Make the pattern static const so it is compiled once at first call rather
than on every Content-Range header parse.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#29054
Since we do no longer support upgrade from versions that do not support
v2 of view building code we can remove upgrade code and make sure we do
not boot with old builder version.
It handles 0, and could generate better code for that. On Broadwell
architecture, it translates to a single instruction (LZCNT). We're
still on Westmere, so it translates to BSR with a conditional move.
Also, drop unnecessary casts and bit arithmetic, which saves a few
instructions.
Move to header so that it's inlined in parsers.
This change reduces the cost of partition index page construction and
LSA migration. This is achieved by several things working together:
- index entries don't store keys as separate small objects (managed_bytes)
They are written into one managed_bytes fragmented storage, entries
hold offset into it.
Before, we paid 16 bytes for managed_bytes plus LSA descriptor for
the storage (1 byte) plus back-reference in the storage (8 bytes),
so 25 bytes. Now we only pay 4 bytes for the size offset. If keys are 16
bytes, that's a reduction from 31 bytes to 20 bytes per key.
- index entries and key storage are now trivially moveable, so LSA
migration can use memcpy() which amortizes the cost per key.
memcpy().
LSA eviction is now trivial and constant time for the whole page
regardless of the number of entries. Page eviction dropped from
14 us to 1 us.
This improves throughput in a CPU-bound miss-heavy read workload where
the partition index doesn't fit in memory.
scylla perf-simple-query -c1 -m200M --partitions=1000000
Before:
15328.25 tps (150.0 allocs/op, 14.1 logallocs/op, 45.4 tasks/op, 286769 insns/op, 218134 cycles/op, 0 errors)
15279.01 tps (149.9 allocs/op, 14.1 logallocs/op, 45.3 tasks/op, 287696 insns/op, 218637 cycles/op, 0 errors)
15347.78 tps (149.7 allocs/op, 14.1 logallocs/op, 45.3 tasks/op, 285851 insns/op, 217795 cycles/op, 0 errors)
15403.68 tps (149.6 allocs/op, 14.1 logallocs/op, 45.2 tasks/op, 285111 insns/op, 216984 cycles/op, 0 errors)
15189.47 tps (150.0 allocs/op, 14.1 logallocs/op, 45.5 tasks/op, 289509 insns/op, 219602 cycles/op, 0 errors)
15295.04 tps (149.8 allocs/op, 14.1 logallocs/op, 45.3 tasks/op, 288021 insns/op, 218545 cycles/op, 0 errors)
15162.01 tps (149.8 allocs/op, 14.1 logallocs/op, 45.4 tasks/op, 291265 insns/op, 220451 cycles/op, 0 errors)
After:
21620.18 tps (148.4 allocs/op, 13.4 logallocs/op, 43.7 tasks/op, 176817 insns/op, 153183 cycles/op, 0 errors)
20644.03 tps (149.8 allocs/op, 13.5 logallocs/op, 44.3 tasks/op, 187941 insns/op, 160409 cycles/op, 0 errors)
20588.06 tps (150.1 allocs/op, 13.5 logallocs/op, 44.5 tasks/op, 188090 insns/op, 160818 cycles/op, 0 errors)
20789.29 tps (149.5 allocs/op, 13.5 logallocs/op, 44.2 tasks/op, 186495 insns/op, 159382 cycles/op, 0 errors)
20977.89 tps (149.5 allocs/op, 13.4 logallocs/op, 44.2 tasks/op, 183969 insns/op, 158140 cycles/op, 0 errors)
21125.34 tps (149.1 allocs/op, 13.4 logallocs/op, 44.1 tasks/op, 183204 insns/op, 156925 cycles/op, 0 errors)
21244.42 tps (148.6 allocs/op, 13.4 logallocs/op, 43.8 tasks/op, 181276 insns/op, 155973 cycles/op, 0 errors)
Mostly because the index now fits in memory.
When it doesn't, the benefits are still visible due to lower LSA overhead.
It's shorter, and is supposed to be optimized for trivially-moveable
types.
Important for managed_vector<index_entry>, which can have lots of
elements.
Densely populated pages have no promoted index (small partitions), so
we can save space in such workloads by keeping promoted index in a
separate vector.
For workloads which do have a promoted index, pages have only one
partition. There aren't many such pages and they are long-lived, so
the extra allocation of the vector is amortized.
promoted_index class is removed, and replaced with equivalent
parsed_promoted_index_entry for simplicity. Because it's removed,
make_cursor() is moved into the index_reader class.
Reducing the size of index_entry is important for performence if pages
are densly populated. It helps to reduce LSA allocator pressure and
compaction/eviction speed.
This change, combined with the earlier change "Shave-off 16 bytes from
index_entry by using raw_token", gives significant improvement in
throughput in perf_simple_query run where the index doesn't fit in
memory:
scylla perf-simple-query -c1 -m200M --partitions=1000000
Before:
9714.78 tps (170.9 allocs/op, 16.9 logallocs/op, 55.3 tasks/op, 494788 insns/op, 343920 cycles/op, 0 errors)
9603.13 tps (171.6 allocs/op, 17.0 logallocs/op, 55.6 tasks/op, 502358 insns/op, 348344 cycles/op, 0 errors)
9621.43 tps (171.9 allocs/op, 17.0 logallocs/op, 55.8 tasks/op, 500612 insns/op, 347508 cycles/op, 0 errors)
9597.75 tps (171.6 allocs/op, 17.0 logallocs/op, 55.6 tasks/op, 501428 insns/op, 348604 cycles/op, 0 errors)
9615.54 tps (171.6 allocs/op, 16.9 logallocs/op, 55.6 tasks/op, 501313 insns/op, 347935 cycles/op, 0 errors)
9577.03 tps (171.8 allocs/op, 17.0 logallocs/op, 55.7 tasks/op, 503283 insns/op, 349251 cycles/op, 0 errors)
After:
15328.25 tps (150.0 allocs/op, 14.1 logallocs/op, 45.4 tasks/op, 286769 insns/op, 218134 cycles/op, 0 errors)
15279.01 tps (149.9 allocs/op, 14.1 logallocs/op, 45.3 tasks/op, 287696 insns/op, 218637 cycles/op, 0 errors)
15347.78 tps (149.7 allocs/op, 14.1 logallocs/op, 45.3 tasks/op, 285851 insns/op, 217795 cycles/op, 0 errors)
15403.68 tps (149.6 allocs/op, 14.1 logallocs/op, 45.2 tasks/op, 285111 insns/op, 216984 cycles/op, 0 errors)
15189.47 tps (150.0 allocs/op, 14.1 logallocs/op, 45.5 tasks/op, 289509 insns/op, 219602 cycles/op, 0 errors)
15295.04 tps (149.8 allocs/op, 14.1 logallocs/op, 45.3 tasks/op, 288021 insns/op, 218545 cycles/op, 0 errors)
15162.01 tps (149.8 allocs/op, 14.1 logallocs/op, 45.4 tasks/op, 291265 insns/op, 220451 cycles/op, 0 errors)
The std::optional<> adds 8 bytes.
And dht::token adds 8 bytes due to _kind, which in this case is always
kind::key.
The size changd from 56 to 48 bytes.
During decommission, we first mark a topology request as done, then shut
down a node and in the following steps we remove node from the topology.
Thus, finished request does not imply that a node is removed from
the topology.
Due to that, in node_ops_virtual_task::wait, while gathering children
from the whole cluster, we may hit the connection exception - because
a node is still in topology, even though it is down.
Modify the get_children method to ignore the exception and warn
about the failure instead.
In get_children we get the vector of alive nodes with get_nodes.
Yet, between this and sending rpc to those nodes there might be
a preemption. Currently, the liveness of a node is checked once
again before the rpcs (only with gossiper not in topology - unlike
get_nodes).
Modify get_children, so that it keeps a token_metadata_ptr,
preventing topology from changing between get_nodes and rpcs.
Remove test_get_children as it checked if the get_children method
won't fail if a node is down after get_nodes - which cannot happen
currently.
Fixes: SCYLLADB-942
Adds an injection signal _from_ table::seal_active_memtable to allow us to
reliably wait for flushing. And does so.
Closesscylladb/scylladb#29070
Since commit 509f2af8db, gate_closed_exception can be triggered for ongoing split during shutdown. The commit is correct, but it causes split failure on shutdown to log an error, which causes CI instability. Previously, aborted_exception would be triggered instead which is logged as warning. Let's do the same.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-951.
Fixes https://github.com/scylladb/scylladb/issues/24850.
Only 2026.1 is affected.
Closesscylladb/scylladb#29032
* github.com:scylladb/scylladb:
replica: Demote log level on split failure during shutdown
service: Demote log level on split failure during shutdown
The limiter scans ranges to decide whether or not to rate-limit the
query. However, when considering each range only the front one's token
is accounted. This looks like a misprint.
The limiter was introduced in cc9a2ad41f
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#29050
A followup of the merge of two test cases that happened in the previous
patch. Both used `foo = N if domain == bar else M` to evaluate the
parameters for topology. Using if-else block makes it immediately obvious
which topology and scope apply for each domain value without having to
evaluate multiple inline conditionals.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The two tests differ only in the way they set up the topology for the
cluster and the post-restore checks against the resulting streams.
The merge happens with the help of a "scope_is_same" boolean parameter
and corresponding updates in the topology setup and post-checks.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The one in "different domain" test is simpler because the test performs
less checks. Next patch will merge both tests and making regexp-s look
identical makes the merge even smother.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Update test running instructions to reflect unified pytest-based runner.
The test.py now requires full test paths with file extensions for both
C++ and Python tests.
No backport: The change is only relevant for recent test.py changes in
master.
Closesscylladb/scylladb#29062
The auth::cache::includes_table function also covers role_members and
role_attributes. The existing check was removed because it blocked these
tables from triggering necessary cache updates.
While previously non-critical (due to unused attributes and table coupling),
maintaining a correct cache is essential for upcoming changes.
Verify that the directory listers opened by get_snapshot_details
are properly closed when handling an (injected) exception.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The comment says the snapshot directory may contain a `schema.sql` file,
but the code treats `schema.cql` as the special-case schema file.
Reported-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Since this is a coroutine, we cannot just use deferred_close,
but rather we need to catch an error, close the lister, and then
return the error, is applicable.
Fixes: SCYLLADB-1013
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
NetworkTopologyStrategy is the preferred choice. We should not
use SimpleStrategy anymore. This patch changes the topology strategy
for all the maintenance socket tests.
Refs SCYLLADB-1070
Add a cql_clusters pytest fixture that tracks CQL driver Cluster
objects and shuts them down automatically after test completion.
This replaces manual shutdown() calls at the end of each test.
Also consolidate shutdown() calls in retry helpers into finally
blocks for consistent cleanup.
Refs SCYLLADB-1070
GRANT/REVOKE fails on the maintenance socket connections,
because maintenance_auth_service uses allow_all_authorizer.
allow_all_authorizer allows all operations, but not GRANT/REVOKE,
because they make no sense in its context.
This has been observed during PGO run failure in operations from
./pgo/conf/auth.cql file.
This patch introduces maintenance_socket_authorizer that supports
the capabilities of default_authorizer ('CassandraAuthorizer')
without needing authorization.
Refs SCYLLADB-1070
This patch is mostly for the purpose of running pgo CI job.
We may receive connection error if asyncio.sleep(5) in
pgo.py is not sufficient waiting time.
In pgo.py we do wait for port but only for cql,
anyway it's better to have high level check than
trying to wait for alternator port there.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1071
Backport: 2026.1 - it failed on CI for that build
Closesscylladb/scylladb#29063
* github.com:scylladb/scylladb:
perf: add abort_source support to wait-for-port loops
perf-alternator: wait for alternator port before running workload
- fix s3::range max value for object size which is 50TiB and not 5.
- refactor constants to make it accessible for all interested parties, also reuse these constants in tests
No need to backport, doubt we will encounter an object larger than 5TiB
Closesscylladb/scylladb#28601
* github.com:scylladb/scylladb:
s3_client: reorganize tests in part_size_calculation_test
s3_client: switch using s3 limits constants in tests
s3_client: fix the s3::range max object size
s3_client: remove "aws" prefix from object limits constants
s3_client: make s3 object limits accessible
This commit replaces the Open Soruce example from the Installation section for CentOS.
We updated the example for Ubuntu, but not for CentOS.
We don't want to have any Open Source information in the docs.
Fixes https://github.com/scylladb/scylladb/issues/29087
This test was observed to fail in CI recently but there is not enough information in the logs to figure out what went wrong. This PR makes a few improvements to make the next investigation easier, should it be needed:
* storage-service: add table name to mutation write failure error messages.
* database: the `database_apply` error injection used to cause trouble, catching writes to bystander tables, making tests flaky. To eliminate this, it gained a filter to apply only to non-system keyspaces. Unfortunately, this still allows it to catch writes to the trace tables. While this should not fail the test, it reduces observability, as some traces disappear. Improve this error injection to only apply to selected table. Also merge it with the `database_apply_wait` error injection, to streamline the code a bit.
* test/test_data_resurrection_in_memtable.py: dump data from the datable, before the checks for expected data, so if checks fail, the data in the table is known.
Refs: SCYLLADB-812
Refs: SCYLLADB-870
Fixes: SCYLLADB-1050 (by restricting `database_apply` error injection, so it doesn't affect writes to system traces)
Backport: test related improvement, no backport
Closesscylladb/scylladb#28899
* github.com:scylladb/scylladb:
test/cluster/test_data_resurrection_in_memtable.py: dump rows before check
replica/database: consolidate the two database_apply error injections
service/storage_proxy: add name of table to error message for write errors
Previously, all stream-table fixtures in test_streams.py used scope="function",
forcing a fresh table to be created for every test, slowing down the test a bit
(though not much), and discouraging writing small new tests.
This was a workaround for a DynamoDB quirk (that Alternator doesn't have):
LATEST shard iterators have a time slack and may point slightly before the true
stream head, causing leftover events from a previous test to appear in the next
test's reads.
The first two tests in this series fix small problems that turn up once we start
sharing test tables in test_streams.py. The final patch fixes the "LATEST" problem
and enables sharing the test table by using "module" scope fixtures instead of
"function".
After this series, test_streams.py run time went down a bit, from 20.2 seconds to 17.7 seconds.
Closesscylladb/scylladb#28972
* github.com:scylladb/scylladb:
test/alternator: speed up test_streams.py by using module-scope fixtures
test/alternator: test_streams.py don't use fixtures in 4 tests
test/alternator: fix do_test() in test_streams.py
This test relies on the cache entry being evicted after 200ms past the
TTL. This may not happen on a busy CI machine. Make the test less
reliant on timing by using eventually_true().
Simplify the test by dropping the second entry, it doesn't add anything
to the test.
Fixes: SCYLLADB-811
Closesscylladb/scylladb#28958
This fixtures starts the mock server and immediately connects to it to
setup the expected requests. The connection attempt might be too early,
so there is a retry loop with a timeout. The loop currently checks for
requests.exception.ConnectionError. We've seen a case where the
connection is successful but the request fails with 404. The mock
started the server but didn't setup the routes yet. Add a retry for http
404 to handle this.
Fixes: SCYLLADB-966
Closesscylladb/scylladb#29003
When handling `repair_stream_cmd::end_of_current_rows`, passing the
foreign list directly to `put_row_diff_handler` triggered a massive
synchronous deep copy on the destination shard. Additionally, destroying
the list triggered a synchronous deallocation on the source shard. This
blocked the reactor and triggered the CPU stall detector.
This commit fixes the issue by introducing `clone_gently()` to copy the
list elements one by one, and leveraging the existing
`utils::clear_gently()` to destroy them. Both utilize
`seastar::coroutine::maybe_yield()` to allow the reactor to breathe
during large cross-shard transfers and cleanups.
Fixes SCYLLADB-403
Closesscylladb/scylladb#28979
vector
The lister loop in get() pre-fetches records in batches and keeps them
in a _info vector, iterating over it with the help of _pos cursor. When
the vector is re-read, the cursor must be reset too.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The lister loop in get() method looks weird. It uses do-while(false)
loop and calls continue; inside when filter asks to skip a entry.
Skipping, thus, aborts the whole thing and EOF-s, which is not what's
supposed to happen.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This series fixes a metrics visibility gap in Alternator and adds regression coverage.
Until now, BatchGetItem and BatchWriteItem updated global latency histograms but did not consistently update per-table latency histograms. As a result, table-level latency dashboards could miss batch traffic.
It updates the batch read/write paths to compute request duration once and record it in both global and per-table latency metrics.
Add the missing tests, including a metric-agnostic helper and a dedicated per-table latency test that verifies latency counters increase for item and batch operations.
This change is metrics-only (no API/behavior change for requests) and improves observability consistency between global and per-table views.
Fixes#28721
**We assume the alternator per-table metrics exist, but the batch ones are not updated**
Closesscylladb/scylladb#28732
* github.com:scylladb/scylladb:
test(alternator): add per-table latency coverage for item and batch ops
alternator: track per-table latency for batch get/write operations
Remove outdated references to filtering on columns provided in the
index definition, and remove the note about equal relations (= and IN)
being the only supported operations. Vector search filtering currently
supports WHERE clauses on primary key columns only.
Closesscylladb/scylladb#28949
Permits in the `waiting_for_memory` state represent already-executing reads that are blocked on memory allocation. Preemptively aborting them is wasteful -- these reads have already consumed resources and made progress, so they should be allowed to complete.
Restrict the preemptive abort check in maybe_admit_waiters() to only apply to permits in the `waiting_for_admission` state, and tighten the state validation in `on_preemptive_aborted()` accordingly.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1016
Backport not needed. The commit introducing replica load shedding is not part of 2026.1
Closesscylladb/scylladb#29025
* github.com:scylladb/scylladb:
reader_concurrency_semaphore: skip preemptive abort for permits waiting for memory
reader_concurrency_semaphore_test: detect memory leak on preemptive abort of waiting_for_memory permit
Check abort_source on each retry iteration in
wait_for_alternator and wait_for_cql so the
wait can be interrupted on shutdown.
Didn't use sleep_abortable as the sleep is very short
anyway.
Fixes: SCYLLADB-244
Disables snapshot control such that any active ops finish/fail
before proceeding with decommission.
Note: snapshot control provided as argument, not member ref
due to storage_service being used from both main and cql_test_env.
(The latter has no snapshot_ctl to provide).
Could do the snapshot lockout on API level, but want to do
pre-checks before this.
Note: this just disables backup/snapshot fully. Could re-enable
after decommission, but this seems somewhat pointless.
v2:
* Add log message to snapshot shutdown
* Make test use log waiting instead of timeouts
Closesscylladb/scylladb#28980
This patch is mostly for the purpose of running pgo CI job.
We may receive connection error if asyncio.sleep(5) in
pgo.py is not sufficient waiting time.
In pgo.py we do wait for port but only for cql,
anyway it's better to have high level check than
trying to wait for alternator port there.
Some tests, when create a cluster, configure nodes with the rf-rack-valid option, because sometimes they want to have it OFF. For that the option is explicitly carried around, but the cluster creating helper can guess this option itself -- out of the provided topology and replication factor.
Removing this option simplifies the code and (which a nicer outcome) the test "signature" that's used e.g. in command-line to run a specific test.
Improving tests, not backporting
Closesscylladb/scylladb#28860
* github.com:scylladb/scylladb:
test: Relax topology_rf_validity parameter for some tests
test: Auto detect rf-rack-valid option in create_cluster()
Dtest failed with:
table - Failed to load SSTable .../me-3gyn_0qwi_313gw2n2y90v2j4fcv-big-Data.db
of origin memtable due to std::runtime_error (Cannot split
.../me-3gyn_0qwi_313gw2n2y90v2j4fcv-big-Data.db because manager has compaction
disabled, reason might be out of space prevention), it will be unlinked...
The reason is that the error above is being triggered when the cause is
shutdown, not out of space prevention. Let's distinguish between the two
cases and log the error with warning level on shutdown.
Fixes https://github.com/scylladb/scylladb/issues/24850.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
`test_raft_no_quorum.py::test_cannot_add_new_node` is currently flaky in dev
mode. The bootstrap of the first node can fail due to `add_entry()` timing
out (with the 1s timeout set by the test case).
Other test cases in this test file could fail in the same way as well, so we
need a general fix. We don't want to increase the timeout in dev mode, as it
would slow down the test. The solution is to keep the timeout unchanged, but
set it only after quorum is lost. This prevents unexpected timeouts of group0
operations with almost no impact on the test running time.
A note about the new `update_group0_raft_op_timeout` function: waiting for
the log seems to be necessary only for
`test_quorum_lost_during_node_join_response_handler`, but let's do it
for all test cases just in case (including `test_can_restart` that shouldn't
be flaky currently).
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-913Closesscylladb/scylladb#28998
Since commit 509f2af8db, gate_closed_exception can be triggered
for ongoing split during shutdown. The commit is correct, but it
causes split failure on shutdown to log an error, which causes
CI instability. Previously, aborted_exception would be triggered
instead which is logged as warning. Let's do the same.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-951.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Replaced multiple per-action workflow jobs with a single consolidated
call to main_pr_events_jira_sync.yml. Added 'edited' event trigger.
This makes CI actions in PRs more readable and workflow execution faster.
Fixes:PM-253
Closesscylladb/scylladb#29042
changes in this commit:
1)rename class from 'TestContext' to 'Context' so pytest will not consider this class as a test
2)extend pytest filterwarnings list to ignore warnings from external libs
3) use datetime.datetime.now(datetime.UTC) unstead datetime.datetime.utcnow()
4) use ResultSet.one() instead ResultSet[0]
Fixes SCYLLADB-904
Fixes SCYLLADB-908
Related SCYLLADB-902
Closesscylladb/scylladb#28956
The test assumes that the sleep duration will be at least the value of
the sleep parameter. However, the actual sleep time can be slightly less
than requested (e.g., a 100ms sleep request might result in a 99ms
sleep).
This commit adjusts the test's time comparison to be more lenient,
preventing test flakiness.
When a `with_connect` operation timed out, the underlying connection
attempt continued to run in the reactor. This could lead to a crash
if the connection was established/rejected after the client object had
already been destroyed. This issue was observed during the teardown
phase of a upcoming high-availability test case.
This commit fixes the race condition by ensuring the connection attempt
is properly canceled on timeout.
Additionally, the explicit TLS handshake previously forced during the
connection is now deferred to the first I/O operation, which is the
default and preferred behavior.
Fixes: SCYLLADB-832
In this series we add support for forwarding strongly consistent CQL requests to suitable replicas, so that clients can issue reads/writes to any node and have the request executed on an appropriate tablet replica (and, for writes, on the Raft leader). We return the same CQL response as what the user would get while sending the request to the correct replica and we perform the same logging/stats updates on the request coordinator as if the coordinator was the appropriate replica.
The core mechanism of forwarding a strongly consistent request is sending an RPC containing the user's cql request frame to the appropriate replica and returning back a ready, serialized `cql_transport::response`. We do this in the CQL server - it is most prepared for handling these types and forwarding a request containing a CQL frame allows us to reuse near-top-level methods for CQL request handling in the new RPC handler (such as the general `process`)
For sending the RPC, the CQL server needs to obtain the information about who should it forward the request to. This requires knowledge about the tablet raft group members and leader. We obtain this information during the execution of a `cql3/strong_consistency` statement, and we return this information back to the CQL server using the generalized `bounce_to_shard` `response_message`, where we now store the information about either a shard, or a specific replica to which we should forward to. Similarly to `bounce_to_shard`, we need to handle this `result_message` in a loop - a replica may move during statement execution, or the Raft leader can change. We also use it for forwarding strongly consistent writes when we're not a member of the affected tablet raft group - in that case we need to forward the statement twice - once to any replica of the affected tablet, then that replica can find the leader and return this information to the coordinator, which allows the second request to be directed to the leader.
This feature also allows passing through exception messages which happened on the target replica while executing the statement. For that, many methods of the `cql_transport::cql_server::connection` for creating error responses needed to be moved to `cql_transport::cql_server`. And for final exception handling on the coordinator, we added additional error info to the RPC response, so that the handling can be performed without having the `result_message::exception` or `exception_ptr` itself.
Fixes [SCYLLADB-71](https://scylladb.atlassian.net/browse/SCYLLADB-71)
[SCYLLADB-71]: https://scylladb.atlassian.net/browse/SCYLLADB-71?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQClosesscylladb/scylladb#27517
* github.com:scylladb/scylladb:
test: add tests for CQL forwarding
transport: enable CQL forwarding for strong consistency statements
transport: add remote statement preparation for CQL forwarding
transport: handle redirect responses in CQL forwarding
transport: add exception handling for forwarded CQL requests
transport: add basic CQL request forwarding
idl: add a representation of client_state for forwarding
cql_server: handle query, execute, batch in one case
transport: inline process_on_shard in cql_server::process
transport: extract process() to cql_server
transport: add messaging_service to cql_server
transport: add response reconstruction helpers for forwarding
transport: generalize the bounce result message for bouncing to other nodes
strong consistency: redirect requests to live replicas from the same rack
transport: pass foreign_ptr into sleep_until_timeout_passes and move it to cql_server
transport: extract the error handling from process_request_one
transport: move error response helpers from connection to cql_server
Update warn and fail messages for the write_consistency_levels_warned
and write_consistency_levels_disallowed guardrails to include the
configuration option name and actionable guidance. The main motivation
is to make the messages follow the conventions of other guardrails.
Refs: SCYLLADB-257
When computing table sizes via load_stats to determine if a split/merge is needed, we are filtering tablets which are being migrated, in order to avoid counting them twice (both on leaving and pending replica) in the total table size. The tablets are filtered so that they are counted on the leaving replica until the streaming stage, and on the pending replica after the streaming stage.
Currently, the procedure for collecting tablet sizes for load balancing also uses this same filter. This should be changed, because the load balancer needs to have as much information about tablet sizes as possible, and could ignore a node due to missing tablet sizes for tablets in the `write_both_read_new` and `use_new` stages.
For tablet size collection, we should include all the tablets which are currently taking up disk space. This means:
- on leaving replica, include all tablets until the `cleanup` stage
- on pending replica, include all tablets starting with the `write_both_read_new` and later stages
While this is an improvement, it causes problems with some of the tests, and therefore needs to be backported to 2026.1
Fixes: SCYLLADB-829
Closesscylladb/scylladb#28587
* github.com:scylladb/scylladb:
load_stats: add filtering for tablet sizes
load_stats: move tablet filtering for table size computation
load_stats: bring the comment and code in sync
Tests that call create_cluster() helper no longer need to carry the
rf-validity parameter. This simplifies the code and test signature.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The helper accepts its as boolean argument, but it can easily estimate
one from the provided topology.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Before b59b3d4 the migration code checked that service level controller
is on v2 version before migration and the check also implicitly checked
that _sl_data_accessor field is already initialized, but now that the
check is gone the migration can start before service level controller is
fully initialized. Re add the check, but to a different place.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1049Closesscylladb/scylladb#29021
Permits in the `waiting_for_memory` state represent already-executing
reads that are blocked on memory allocation. Preemptively aborting
them is wasteful -- these reads have already consumed resources and
made progress, so they should be allowed to complete.
Restrict the preemptive abort check in maybe_admit_waiters() to only
apply to permits in the `waiting_for_admission` state, and tighten
the state validation in `on_preemptive_aborted()` accordingly.
Adjust the following tests:
+ test_reader_concurrency_semaphore_abort_preemptively_aborted_permit
no longer relies on requesting memory
+ test_reader_concurrency_semaphore_preemptive_abort_requested_memory_leak
adjusted to the fix
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1016
- Added VECTOR to the comma-separated list of Jira project keys in `call_sync_milestone_to_jira.yml`.
- The `jira_project_keys` value changed from `SCYLLADB,CUSTOMER,SMI,RELENG` to `SCYLLADB,CUSTOMER,SMI,RELENG,VECTOR`.
- The VECTOR project needs to sync with scylladb.git milestones, so that when a GitHub milestone is created or closed in scylladb/scylladb, the corresponding Jira release is also created or released in the VECTOR project.
- Previously only SCYLLADB, CUSTOMER, SMI, and RELENG projects were synced.
Fixes:PM-220
Closesscylladb/scylladb#29014
This PR adds integrity verification for SSTable component files during loading. When component digests are present in Scylla metadata, the loader now validates each component's CRC32 digest against the stored expected value, catching silent corruption of component files. Index, Rows and Partitions components digests are also validated duriung scrub in validate mode
Added corruption tests that write an SSTable, flip a bit in a specific component file, then verify that reloading the SSTable detects the corruption and throws the expected exception.
Depends on https://github.com/scylladb/scylladb/pull/28338
Backport is not required, this is new feature
Fixes https://github.com/scylladb/scylladb/issues/20103Closesscylladb/scylladb#28761
* github.com:scylladb/scylladb:
test/cqlpy: test --ignore-component-digest-mismatch flag in scylla sstable upgrade
docs: document --ignore-component-digest-mismatch flag for scylla sstable upgrade
sstables: propagate ignore_component_digest_mismatch config to all load sites
sstables: add option to ignore component digest mismatches
sstable_compaction_test: Add scrub validate test for corrupted index
sstables: add tests for component digest validation on corrupted SSTables
sstables: validate index components digests during SSTable scrub in validate mode
sstables: verify component digests on SSTable load
sstables: add digest_file_random_access_reader for CRC32 digest computation
Fix several test cases that did not await async tasks:
- test_restart_leaving_replica_during_cleanup
- test_restart_in_cleanup_stage_after_cleanup
- test_tablet_back_and_forth_migration
- test_staging_backlog_is_preserved_with_file_based_streaming
Fixes SCYLLADB-910
* Minor fixes, no backport needed
Closesscylladb/scylladb#28908
* github.com:scylladb/scylladb:
test_tablets_migration: test_staging_backlog_is_preserved_with_file_based_streaming: convert for loop to asyncio.gather
test_tablets_migration: test_tablet_back_and_forth_migration: await move_tablet
test_tablets_migration: test_restart_in_cleanup_stage_after_cleanup: await move_task
test_tablets_migration: test_restart_leaving_replica_during_cleanup: await move_task
test_tablets_migration: drop unused imports from cassandra.query
Fixes#25084
Add slirp4netns and use for nested containers. This will allow nested container port aliasing, helping CI stability.
Note: this contains and updated Dockerfile for dbuild image, but since chicken and eggs, right now will force install slirp4netns before anything in dbuild script.
Updates the mock server handling to use ephemeral ports and query from container, ensuring we don't get port collisions. (boost as well as pytest).
Includes a timeout up, and a tweak to our scylla_cluster handling, ensuring we don't deadlock when pipe size is less than requires for our sys notify messages.
Closesscylladb/scylladb#28727
* github.com:scylladb/scylladb:
gcs_fixture: Change to use docker helper
aws_kms_fixture: Modify to use docker helper
test/lib/proc_util: Add docker helper
pytest: use ephemeral port publish for docker mock servers
dbuild: Use container network in dbuild nested containers
scylla_cluster: Read notify sock in background to prevent deadlock
A few days ago, in commit 7b30a39 we added to pytest.ini the option
xfail_strict. This option causes every time a test XPASSes, i.e., an xfail
test actually passes - to be considered an error and fail the test.
But some tests demonstrate a timing-related bug and do not reproduce the
bug every single time. An example we noticed in one CI run is:
test/cluster/test_alternator.py::test_alternator_concurrent_rmw_same_partition_different_server
This test reproduces a timing-related bug (if you do an LWT write to
one partition on to two different coordinators "at the same time", you
can get a failure), but only most of the time, not 100% of the time.
The solution is to add "strict=False" for the xfail marker on this specific
test. This undoes the xfail_strict for this specific test, accepting that
this specific test can either pass or fail. Note that this does NOT make
this test worthless - we still see this test failing most of the time, and
when a developer finally fixes this issue, the test will begin to pass all
the time.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-941
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#29016
When it deadlocks, groups stop merging and compaction group merge
backlog will run-away.
Also, graceful shutdown will be blocked on it.
Found by flaky unit test
test_merge_chooses_best_replica_with_odd_count, which timed-out in 1
in 100 runs.
Reason for deadlock:
When storage groups are merged, the main compaction group of the new
storage group takes a compaction lock, which is appended to
_compaction_reenablers_for_merging, and released when the merge
completion fiber is done with the whole batch.
If we accumulate more than 1 merge cycle for the fiber, deadlock
occurs. Lock order will be this
Initial state:
cg0: main
cg1: main
cg2: main
cg3: main
After 1st merge:
cg0': main [locked], merging_groups=[cg0.main, cg1.main]
cg1': main [locked], merging_groups=[cg2.main, cg3.main]
After 2nd merge:
cg0'': main [locked], merging_groups=[cg0'.main [locked], cg0.main, cg1.main, cg1'.main [locked], cg2.main, cg3.main]
merge completion fiber will try to stop cg0'.main, which will be
blocked on compaction lock. which is held by the reenabler in
_compaction_reenablers_for_merging, hence deadlock.
The fix is to wait for background merge to finish before we start the
next merge. It's achieved by holding old erm in the background merge,
and doing a topology barrier from the merge finalizing transition.
Background merge is supposed to be a relatively quick operation, it's
stopping compaction groups. So may wait for active requests. It
shouldn't prolong the barrier indefinitely.
Tablet boost unit tests which trigger merge need to be adjusted to
call the barrier, otherwise they will be vulnerable to the deadlock.
Two cluster tests were removed because they assumed that merge happens
in the backgournd. Now that it happens as part of merge finalization,
and blocks topology state machine, those tests deadlock because they
are unable to make topology changes (node bootstrap) while background
merge is blocked.
The test "test_tablets_merge_waits_for_lwt" needed to be adjusted. It
assumed that merge finalization doesn't wait for the erm held by the
LWT operation, and triggered tablet movement afterwards, and assumed
that this migration will issue a barrier which will block on the LWT
operation. After this commit, it's the barrier in merge finalization
which is blocked. The test was adjusted to use an earlier log mark
when waiting for "Got raft_topology_cmd::barrier_and_drain", which
will catch the barrier in merge finalization.
Fixes SCYLLADB-928
Needs to be ordered before split finalization, because storage_group
must be in split mode already at finalization time. There must be
split-ready compaction groups, otherwise finalization fails with this
error:
Found 0 split ready compaction groups, but expected 2 instead.
Exposed by increased split activity in tests.
Will be called in tests. It does the local part of the global topology
barrier.
The comment:
// We capture the topology version right after the checks
// above, before any yields. This is crucial since _topology_state_machine._topology
// might be altered concurrently while this method is running,
// which can cause the fence command to apply an invalid fence version.
was dropped, because it's no longer true after
fad6c41cee, and it doesn't make sense in
the context of local_topology_barrier(). We'd have to propagate the
version to local_topology_barrier(), but it's pointless. The fence
version is decided before calling the local barrier, and it will be
valid even if local version moves ahead.
This is short cleanup after recent removal of creating default cassandra superuser and auth-v1 code removal.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1036
Backport: no, just code cleanup
Closesscylladb/scylladb#29004
* github.com:scylladb/scylladb:
auth: remove DEFAULT_SUPERUSER_NAME constant and dead DEFAULT_USER_PASSWORD
auth: use configurable default_superuser in describe_roles
auth: move default_superuser to common, remove _superuser member
auth: use LOCAL_ONE for all auth queries
auth: remove get_auth_ks_name indirection
can_use_effective_service_level_cache() always returns true now, so the function can be dropped entirely and all the code that assumes it may return false can be dropped as well. Also drop async versions of find_effective_service_level and get_user_scheduling_group since they are unused.
No need to backport, code removal,
Closesscylladb/scylladb#29002
* github.com:scylladb/scylladb:
service level: make maybe_update_per_service_level_params synchronous
service level: remove unused get_user_scheduling_group function
service level: drop async find_effective_service_level
service level: remove remnants of version 1 service level
Introduced by 54bddeb3b5, the yield was
added to write_cell(), to also help the general case where there is no
collection. Arguably this was unnecessary and this patch moves the yield
to write_collection(), to the cell write loop instead, so regular cells
don't have to poll the preempt flag.
Closesscylladb/scylladb#29013
This PR shortens two sleeps from 1s to 100ms to speed up bootstrap in tests.
The changed sleeps are:
- the pause duration in group0 discovery,
- the retry period in `wait_for_cql`.
Refs: https://scylladb.atlassian.net/browse/SCYLLADB-918
No backport: performance improvements mostly relevant to tests.
Closesscylladb/scylladb#29020
* github.com:scylladb/scylladb:
test: pylib: util: wait for CQL being ready with a shorter period
group0: discovery: shorten the pause duration
Add basic cluster tests for CQL forwarding.
The test cases include:
- basic reads and writes
- prepared statements with binds
- forwarding from a non-replica
- exception passthrough during forwarding (using an injection)
- re-preparing a statement on the target node, even if the user
query is also an EXECUTE request on a prepared statement
- verification metric updates
The existing test_basic_write_read was modified so that a few extra
cases could be validated on the same cluster.
We enable CQL forwarding by starting to return the bounce_to_node
result message in redirect_statement() instead of throwing. The
forwarding code introduced in the preceding patches reacts to these
messages, allowing the requests to be forwarded.
With the update, some tests assuming that requests can't be forwarded
need to be adjusted, so we do that as well.
During forwarding of CQL EXECUTE requests, the target node may
not have the prepared statement in its cache. If we do have this
statement as a coordinator, instead of returning PREPARED NOT FOUND
to the client, we want to prepare the statement ourselves on target
node.
For that, we add a new FORWARD_CQL_PREPARE RPC. We use the new RPC
after gettting the prepared_not_found status during forwarding. When
we try to forward a request, we always have the query string (we
decide whether to forward based on this query), so we can always use
the new RPC when getting the prepared_not_found status.
After receiving the response, we try forwarding the EXECUTE request
again.
During CQL forwarding, when the target node can't handle the request,
it will find another node which can execute the request or which knows
where the request can be executed. We return this information in
responses to CQL forwarding, and in this patch, we add handling of
this kind of a response.
After getting a redirect response, we retry forwarding to the returned
host/shard until success or timeout. This can happen many times during
a single request, when we first forward to a replica and later to the
coordinator, or when a replica/coordinator migrated while we were
performing the forwarding
When a forwarded request fails on the remote node, we can't use the
exception handling that happens in process_request_one because we
don't go through this code path. Instead, we use the previously
extracted cql_server::handle_exception handler, which performs
all accounting on the forwarded-to node, and which prepares the
response. For the read_failure_exception_with_timeout exception,
we need to perform the sleep on the source node, so we return the
timeout in the forwarding response and use it on the source node
to know how long to sleep without any extra calculations.
The handle_forward_execute() method is extracted from the inline handler
lambda to make the error catching wrapper cleaner.
Add the infrastructure for forwarding CQL requests to other nodes.
When a process() call results in a node bounce (as opposed to a shard
bounce), the coordinator serializes the request and sends it via the
FORWARD_CQL_EXECUTE RPC verb to the target node.
In this patch we omit several features that allow handling more
scenarios that can happen when trying to forward a CQL request,
but the RPC request and response are already prepared for them.
They will be handled in the following commits.
Use rolling_max_tracker to record gross bytes allocated during each
CQL parse. The rolling maximum is then added to the memory estimate
for incoming QUERY and PREPARE requests so that the admission control
in the CQL transport layer accounts for parsing overhead.
The measured memory footprint serves as upper bound rather than
exact number but it's purpose is to prevent OOMs under unprepared
statements heavy load.
In benchmark 1G memory node shows decrease of non-LSA memory usage
from peak 320MB (our coordinator budget is 10% of 1G) to 96MB. While
tps drops from 1.2 kops to 0.8 kops. Drop in tps is expected as
memory admission kicks in trying to prevent OOM.
This is phase 1 of OOM prevention, potential next steps:
- add second admission in query_processor::get_statement trying to prevent potential thundering herd problem
- decrease cql_server memory pool size
- count reads in the memory pool
- add per service level memory pool and a shared one
Related https://scylladb.atlassian.net/browse/SCYLLADB-740
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-938
Backport: no, new feature, but we may reconsider if some customer needs it
Closesscylladb/scylladb#28919
* github.com:scylladb/scylladb:
cql3: track CQL parsing memory cost and use it for admission control
utils: add rolling max tracker
In the following patches, when we start allowing to forward CQL
requests to other nodes, we'll need to use the same client state
for executing the request on the destination node as we had on the
source. client_state contains many fields and we need to create
a new instance of it when we start handling the forwarded request,
so to prepare for the forwarding RPC, we add a serializable format
of the client_state as an IDL struct. The new class is missing some
fields that are not used while executing requests, and some whose
value is determined by the fact that the client state is used for
a forwarded request.
These include:
- driver name, driver version, client options - not used for executing
requests. Instead, we use these as data sources for the virtual
"clients" system table.
- auth_state - must be READY - we reached a bounce message, so we were
able to try executing the request locally
- _control_connection - used for altering a cql_server::connection, which
we don't have on the target node
- _default_timeout_config - used when updating service levels, also only
per-connection
- workload_type - used for deciding whether to allow shedding at the
start of processing the request, and for getting per-connection service
level params (for an API)
Currently we perform the same steps when handling query, execute
and batch CQL requests. So instead of creating multiple functions
performing these steps, we can handle them all in one fallthrough
case in cql_server::connection::process_request_one.
The process_on_shard method is relatively short, it's only used
in the process() method and the Process concept that is uses
is as long as the function itself. This area will be made more
complex by the following patches for cql forwarding, so we simplify
it by inlining process_on_shard in cql_server::process.
Move process() and process_on_shard() from cql_server::connection to
cql_server. The process() method is no longer a template - instead, it
takes an opcode parameter and uses get_process_fn_for_opcode() to select
the appropriate internal processing function.
The process_query, process_execute, and process_batch wrappers on
connection now delegate to _server.process() with the appropriate opcode.
This refactoring is preparation for CQL request forwarding, where
process() will need to be called from a context other than connection
- the forwarding RPC handler).
The messaging service will be used by cql_server to register RPC
handlers for forwarding CQL requests between nodes.
We pass it through the controller to cql_server.
Expose response::flags() and response::extract_body(), and a new constructor.
It will be needed for creating a cql_transport::response from the response body returned
during CQL forwarding.
In the following patches, we'll start allowing forwarding requests to strongly
consistent tables so that they'll get executed on the suitable tablet Raft group
members. For that we'll reuse the approach that we already have for bouncing
requests to other shards - we'll try to execute a request locally, and the
result of that will be a bounce message with another replica as the target.
In this patch we generalize the former bounce_to_shard result message so that
it will be able to specify the target of the bounce as another shard or specific
replica.
We also rename it to result_message::bounce so that it stops implying that only
another shard may be its target.
Aside from the host_id and the shard, the new message also includes the timeout,
because in the service handling the forwarding we won't have the access to it,
and it's needed for specifying how long we should wait for the forwarded
requests. It also includes an information whether this is a write request
to return correct timeout response in case the deadline is exceeded.
We will return other hosts in the new bounce message when executing requests to
strongly consistent tables when we can't handle the request because we aren't
a suitable replica. We can't handle this message yet, so we don't return it
anywhere and we still assume that every bounce message is a bounce to the same
host.
Forwarding CQL requests is not implemented yet, but we're already
prepared to return the target to forward to when trying to execute
strongly consistent requests. Currently, if we're not a replica
of the affected tablet, we redirect the request to the first replica
in the list.
This is not optimal, because this replica may be down or it may be
in another rack, making us perform cross-rack requests during forwarding.
Instead, we should forward the request to the replica from the same
rack and handle the case where the replica is down.
In this patch we change the replica selection for forwarding strongly
consistent requests, so that when the coordinator isn't a replica, it
redirects the request to the replica from the same rack.
If the replica from the same rack is down, or there is no replica in
our rack, we choose the next closest replica (preferring same-DC replicas
over other DCs). If no replica is alive, the query fails - the driver
should retry when some replica comes back up.
A permit in `waiting_for_memory` state can be preemptively aborted by
maybe_admit_waiters(). This is wrong: such permits have already been
admitted and are actively processing a read — they are merely blocked
waiting for memory under serialize-limit pressure.
When `on_preemptive_aborted()` fires on a `waiting_for_memory` permit,
it does not clear `_requested_memory`. A subsequent `request_memory()`
call accumulatesa on top of the stale value, causing `on_granted_memory()`
to consume more than resource_units tracks.
This commit adds a test that confirms that scenario by counting
internal_errors.
The test_raft_voters_multidc_kill_dc scenario had become weaker after group0 voter count was made always odd.
In particular, the old num_nodes == 1 case (dc1=2, dc2=1, dc3=1) could pass even without the intended balancing logic, because with 3 voters total we naturally get one voter per DC.
This change restores coverage of the original intent:
- Replace num_nodes parametrization with explicit DC triples.
- Use (3, 1, 1) to force a meaningful asymmetric topology where voter placement logic is required.
- Keep a larger topology case (6, 3, 3) for broader coverage.
- Mark (6, 3, 3) as skip_mode(debug) with reason:
larger topology case is too slow in debug on minipcs.
Also updated comments/docstring to match the new setup.
Fixes: SCYLLADB-794
backport: None, it is done to deflake minipcs that will start working only on master
Closesscylladb/scylladb#29000
Change sleep_until_timeout_passes() to accept a foreign_ptr<std::unique_ptr<response>>.
We can easily create the foreign_ptr for the responses created in the CQL server,
but we'll need this when we get responses when forwarding CQL statements - the responses
may come from other shards.
We also move it from cql_server::connection to cql_server, because for forwarded CQL
requests, we'll need to handle it at the cql_server level.
The method also loses its const qualifier - the abort_source that we pass into
sleep_abortable needs to be non-const. Apparently, we could still use it in a const
method of cql_server::connection because we passed it as _server._abort_source which
caused the const qualifier to be lost.
After stop() moved _reaper, in-flight with_connection() callbacks could
still call reap(), which accessed the moved-from future causing a
SIGSEGV in future_base::detach_promise(). Add a seastar::gate so
stop() waits for all in-flight operations before moving _reaper.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1043Closesscylladb/scylladb#29015
`wait_for_cql` is used in hundreds, if not thousands, of places in tests.
We shouldn't waste up to 1s for every call.
Also, the 1s period is clearly too long compared to the bootstrap time,
which is usually 0-3s in dev mode.
The following test speeds up from 50s to 42s with the change:
```
for _ in range(10):
servers = await manager.servers_add(3)
await manager.get_ready_cql(servers)
```
Nodes currently pause group0 discovery for 1s. This case is always hit while
adding multiple nodes in parallel to an empty cluster by all nodes except the
one that becomes the group0 leader.
This is fine in production, but in tests, the slowdown is quite significant.
Every `manager.servers_add(n)` call for n > 1 becomes 1s slower when the
cluster is empty. Many cluster tests are affected.
In this commit, we decrease the sleep duration from 1s to 100ms to speed up
tests. The consequence of this change is that nodes might perform more steps
in group0 discovery, but the increase in CPU usage and network traffic should
be negligible.
Currently the test iterates on all servers and calls manager.api.disable_injection
but it doesn't await those calls.
Use asyncio.gather to await all calls in parallel.
Co-authored-by: Copilot CLI
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Replace CAS contention histograms in storage proxy stats with
estimated_histogram_with_max<128> and switch metrics/API aggregation to the
new histogram path.
Introduce a dedicated cas_contention_histogram alias and use it for
cas_read_contention and cas_write_contention.
Update API histogram reduction to merge the new histogram type via
estimated_histogram_with_max_merge.
Convert API JSON serialization to explicit offsets/counts using
get_buckets_offsets() and get_buckets_counts().
Export CAS contention metrics with to_metrics_histogram(...) instead of the
legacy get_histogram(1, 8) path for consistent bucket handling.
Add utility accessors to approx_exponential_histogram to export bucket
boundaries and bucket counts in a form suitable for display/tests when
Min < Precision causes repeated integer limits.
Add MAX compile-time constant alias for the template Max parameter.
Add get_buckets_offsets() to return bucket lower limits with duplicate
adjacent limits removed.
Add get_buckets_counts() to return counts aligned with the deduplicated
limits, merging counts from buckets that share the same lower limit.
Keep existing histogram behavior unchanged.
This new functionality is intended for API use and not for
performance-critical paths.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
serialize_collection_mutation() copies the serialized collection into
the returned collection_mutation object. Change to move to avoid the
copy.
Fixes: SCYLLADB-1041
Closesscylladb/scylladb#29010
can_use_effective_service_level_cache() always returns true now, so the
function can be dropped entirely and all the code that assumes it may
return false can be dropped as well.
The Alternator test test_compressed_request.py::test_gzip_request_oversized
checks that a very large request that compresses to a small size is still
rejected. This test passed on Alternator, but used to fail on DynamoDB
because DynamoDB didn't reject this case. This was a bug in DynamoDB
(a "decompression bomb" vulnerability), and after I reported it, it
was fixed.
So now this test does pass on DynamoDB (after a small modification to
allow for different error codes). So remove its scylla_only marker,
and make the comment true to the current state.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#28820
Use rolling_max_tracker to record gross bytes allocated during each
CQL parse. The rolling maximum is then added to the memory estimate
for incoming QUERY and PREPARE requests so that the admission control
in the CQL transport layer accounts for parsing overhead.
The measured memory footprint serves as upper bound rather than
exact number but it's purpose is to prevent OOMs under unprepared
statements heavy load.
In benchmark 1G memory node shows decrease of non-LSA memory usage
from peak 320MB (our coordinator budget is 10% of 1G) to 96MB. While
tps drops from 1.2 kops to 0.8 kops. Drop in tps is expected as
memory admission kicks in trying to prevent OOM.
In this patch we replace every single use of SCYLLA_ASSERT(), abort() and assert() in the cql3/ directory by throwing_assert().
The problem with SCYLLA_ASSERT()/abort()/assert() is that when it fails, it crashes Scylla. This is almost always a bad idea (see #7871 discussing why), but it's even riskier in front-end code like cql3/: In front-end code, there is a risk that due to a bug in our code, a specific user request can cause Scylla to crash. A malicious user can send this query to all nodes and crash the entire cluster. When the user is not malicious, it causes a small problem (a failing request) to become a much worse crash - and worse, the user has no idea which request is causing this crash and the crash will repeat if the same request is tried again.
All of this is solved by using the new throwing_assert(), which is the same as SCYLLA_ASSERT() but throws an exception (using on_internal_error()) instead of crashing. The exception will prevent the code path with the invalid assumption from continuing, but will result in only the current user request being aborted, with a clear error message reporting the internal server error due to an assertion failure.
I reviewed all the changes that I did in these patches to check that (to the best of my understanding) none of the assertions in cql3/ involve the sort of serious corruption that might require crashing the Scylla node entirely.
throwing_assert() also improves logging of assertion failures compared to the original SCYLLA_ASSERT()/abort() - SCYLLA_ASSERT() printed a message to stderr which in many installations is lost, and abort() often prints no message at all. But throwing_assert() uses Scylla's standard logger, and also includes a backtrace in the log message.
Fixes#13970 (Exorcise assertions from CQL code paths)
Refs #7871 (Exorcise assertions from Scylla)
Closesscylladb/scylladb#28847
* github.com:scylladb/scylladb:
cql3: remove unnecessary assert()
cql3: replace abort() by throwing_assert()
cql3: Replace SCYLLA_ASSERT by throwing_assert
The `test/cqlpy/cassandra_tests/validation/entities/json_test.py::testJsonOrdering` was failing because of differences between Cassandra and Scylla in printing
JSON floating point values - e.g. Cassandra prints 30.0, where Scylla prints 30.
Both are valid, so in this patch, instead of comparing strings, we compare parsed JSON using `EquivalentJson`.
Fixes#28467Closesscylladb/scylladb#28924
Replace the hardcoded meta::DEFAULT_SUPERUSER_NAME comparison with
default_superuser(_qp) which reads from the auth_superuser_name
config option. This makes the IF NOT EXISTS clause in DESCRIBE
output correct for clusters with a non-default superuser name.
_"A journey of a thousand miles begins with a single step" Lao Tzu_
ScyllaDB uses estimated_histogram in many places.
We already have a more efficient alternative: approx_exponential_histogram. It is both CPU and
memory-efficient and can be exported as Prometheus native histograms.
Its main limitation (which has its benefits) is that the bucket layout is fixed at compile time, so
histograms with different configurations cannot be mixed.
The end goal is to replace all uses of estimated_histogram in the codebase.
That migration needs a few small API adjustments, so I am splitting the work
into steps for easier review.
This series is the first step. It introduces a base template for fixed-size
estimated histograms, and switches the Alternator's estimated_histogram with the template.
This change is self-contained and valuable on its own, while keeping the scope limited.
Minor adjustments were made to the code and tests so that the tests would pass.
Follow-up PRs will apply the same pattern to the rest of the code.
**New feature no need to backport**
Closesscylladb/scylladb#28987
* github.com:scylladb/scylladb:
alternator: migrate to operation_size_kb histograms
test/alternator/test_metrics.py: Update the bucket in the histogram search
alternator: Use batch_histogram for batch size histograms
estimated_histogram.hh: adds estimated_histogram_with_max
When we forward CQL statements, we'll need to handle the errors
on the destination node. Only for read_failure_exception_with_timeout
exception, we'll still need to wait until timeout passes on the
source node.
For that we extract the exception handling to a separate method.
Additionally, we separate the waiting and all other handling,
so that all handling aside from waiting will be reusable after
forwarding, and we'll also be able to sleep on the source node
if necessary.
These methods are used only in the error handler in the cql server,
and outside of 3 cases, they don't need any information from the
cql_server::connection. We move them from cql_server::connection
to cql_server, so that they can be used in the following patches
for methods for CQL request forwarding where we'll have no instance
of cql_server::connection on the node forwarded to.
After the change the methods require no access to the server's
or connection's fields, so we also make them static methods.
Switch Alternator operation-size metrics from the legacy estimated
histogram implementation to estimated_histogram_with_max<512> and export
them through the native approx-exponential histogram path.
Add a dedicated operation-size histogram type alias based on
estimated_histogram_with_max<512>.
Replace all per-operation size histograms (GetItem/PutItem/DeleteItem/
UpdateItem/BatchGetItem/BatchWriteItem) with the new type.
Remove the custom legacy histogram-to-metrics adapter and use
to_metrics_histogram() for operation size metrics, aligning export
behavior with other approx-exponential histograms.
Update Alternator metrics tests to compute expected le bucket boundaries using
approx-exponential bucket math (including deduplication of equal
bounds), so assertions match the new exported histogram schema.
Update bucket helper signatures to use (max, precision) parameters and keep
+Inf handling unchanged.
Replace byte-to-KB ceiling conversion with plain integer division (bytes
/ 1024): histogram export already reports each bucket by its upper bound
(le), so rounding input values up before bucketing is unnecessary and
would over-shift borderline samples into higher buckets.
Move default_superuser() to auth::meta in common.{hh,cc} and remove the
cached _superuser member from both standard_role_manager and
password_authenticator. The superuser name comes from config which is
immutable at runtime, so caching it is unnecessary.
Removes auth-v1 hack for cassandra superuser as auth-v1
code no longer exists.
Also CL is not really used when quering raft replicated
tables (like auth ones), but LOCAL_ONE is the least confusing
one.
Replace get_auth_ks_name(qp) with db::system_keyspace::NAME directly.
The function always returned the constant "system" and its qp
parameter was unused.
nodetool cluster repair without additional params repairs all tablet
keyspaces in a cluster. Currently, if a table is dropped while
the command is running, all tables are repaired but the command finishes
with a failure.
Modify nodetool cluster repair. If a table wasn't specified
(i.e. all tables are repaired), the command finishes successfully
even if a table was dropped.
If a table was specified and it does not exist (e.g. because it was
dropped before the repair was requested), then the behavior remains
unchanged.
Fixes: SCYLLADB-568.
Closesscylladb/scylladb#28739
What changed
* Added closed to milestone event types in
call_sync_milestone_to_jira.yml (types: [created] -> types: [created, closed])
* Added VECTOR to the list of Jira project keys being synced
(jira_project_keys: SCYLLADB,CUSTOMER,SMI,RELENG -> jira_project_keys: SCYLLADB,CUSTOMER,SMI,RELENG,VECTOR)
Why (Requirements Summary)
* The call_sync_milestone_to_jira.yml workflow only triggered on
milestone creation. When a GitHub milestone is closed, the
corresponding Jira versions (in SCYLLADB, CUSTOMER, SMI, RELENG
projects) should be marked as released. Adding the closed trigger
enables the called workflow (main_sync_milestone_to_jira_release.yml
in github-automation) to handle both creating and releasing Jira
versions from GitHub milestone events.
* Added the VECTOR project so its Jira versions are also created/released
when milestones are created or closed in scylladb.git.
* This is consistent with the same change already applied to the staging
and scylla-machine-image repos.
Fixes:PM-216
Update call_sync_milestone_to_jira.yml in scylladb.git - add close trigger and VECTOR project sync
Closesscylladb/scylladb#28981
This patch adds estimated_histogram_with_max template that will be a
based for specific estimated_histograms, eventually replacing the current
struct implementation.
Introduce estimated_histogram_with_max<Max> as a reusable wrapper around
approx_exponential_histogram<1, Max, 4>, providing merge support and the
same add helpers used by existing estimated_histogra type.
Add estimated_histogram_with_max_merge()
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Fix an invalid condition, when searching for a parent shard, when table
is based on vnodes. Shards have associated with them `last token` -
token, than marks the end of the range of tokens they consume (inclusive).
An additional assumptions are whole token space is used and
(for vnodes) token space wraps around.
Previously code looked like this:
auto pid = std::upper_bound(..., [](const dht::token& t, const cdc::stream_id& id) {
return t < id.token();
});
if (pid != pids.begin()) {
pid = std::prev(pid);
}
An `upper_bound` call with `t < id.token()` means it is looking for
an iterator, for which value `t < id.token()` changed to true,
which effectively means a position, where iterator is bigger
then searched value. Then we move iterator backward once if possible.
Assuming token space <-2, 2> and parents [0, 2], when we search for:
- -1 -> we will get 0, it's first, so we can't move backward, so 0 (ok)
- 0 -> we will get 2, it's not first, so we go back and we return 0 (ok)
- 1 -> we will get 2, it's not first, so we go back and we return 0
(not ok - should be 2)
The fix is to replace it with `std::lower_bound` and remove conditional
backward motion. Since we've a guarantees that whole token space is used
if `std::lower_bound` ends with `end()` value, then we have a wrap
around case and we need to pick `begin()` as result.
Fixes#28354
Fixes: SCYLLADB-537
Closesscylladb/scylladb#28382
Changes dockerized_service to use ephermal port publish, and
query the published port from podman/docker.
Modifies client code to use slightly changed usage syntax.
The previous commit added extern template declarations to suppress
named_value<T> instantiation in every translation units, but those only
suppress non-inline members. All method bodies defined inside the class
body were inline and thus exempt from extern template, so they were
still emitted as weak symbols in every TU that used them.
Fix this by moving all named_value<T> method definitions out of the class
body in config_file.hh and into config_file_impl.hh as out-of-line template
definitions. Since config_file_impl.hh is included only by db/config.cc,
utils/config_file.cc, sstables/compressor.cc, and
ent/encryption/encryption_config.cc, the method bodies are now compiled
in only those four TUs.
Also add the two missing explicit instantiation pairs that caused linker
errors:
- named_value<vector<object_storage_endpoint_param>> in db/config.cc
- named_value<encryption_config::string_string_map> in encryption_config.cc
query_processor::prepare() could race with prepared statement invalidation: after loading from the prepared cache, we converted the cached object to a checked weak pointer and then continued asynchronous work (including error-injection waitpoints). If invalidation happened in that window, the weak handle could no longer be promoted and the prepare path could fail nondeterministically.
This change keeps a strong cache entry reference alive across the whole critical section in prepare() by using a pinned cache accessor (get_pinned()), and only deriving the weak handle while the entry is pinned. This removes the lifetime gap without adding retry loops.
Test coverage was extended in test/cluster/test_prepare_race.py:
- reproduces the invalidation-during-prepare window with injection,
- verifies prepare completes successfully,
- then invalidates again and executes the same stale client prepared object,
- confirms the driver transparently re-requests/re-prepares and execution succeeds.
This change introduces:
- no behavior change for normal prepare flow besides stronger lifetime guarantees,
- no new protocol semantics,
- preserves existing cache invalidation logic,
- adds explicit cluster-level regression coverage for both the race and driver reprepare path.
- pushes the re prepare operation twards the driver, the server will return unprepared error for the first time and the driver will have to re prepare during execution stage
Fixes: https://github.com/scylladb/scylladb/issues/27657
Backport to active branches recommended: No node crash, but user-visible PREPARE failures under rare schema-invalidation race; low-risk timeout-bounded retry improves robustness.
Closesscylladb/scylladb#28952
* github.com:scylladb/scylladb:
transport/messages: hold pinned prepared entry in PREPARE result
cql3: pin prepared cache entry in prepare() to avoid invalid weak handle race
Remove the host network setting, ensuring we use private networks
(slirp4netns). This will allow nested container port aliasing,
helping CI stability (can use ephemeral ports and container
introspection).
This also makes the nested podman setup non-conditional,
since we only run podman containers inside dbuild, and need
the setup regardless if host container is docker or not.
Starts a thread to process scylla notify messages (NOTIFY_SOCKET)
instead of just processing inline, non-blocking. This because it
is possible for the pipe created to be to small to hold enough
messages for us to reach the point where we otherwise even read
from said pipe, allowing other end (scylla) to proceed.
This series adds a global read barrier to raft_group0_client, ensuring that Raft group0 mutations are applied on all live nodes before returning to the caller.
Currently, after a group0_batch::commit, the mutations are only guaranteed to be applied on the leader. Other nodes may still be catching up, leading to stale reads. This patch introduces a broadcast read barrier mechanism. Calling send_group0_read_barrier_to_live_members after committing will cause the coordinator to send a read barrier RPC to all live nodes (discovered via gossiper) and waits for them to complete. This is best effort attempt to get cluster-wide visibility of the committed state before the response is returned to the user.
Auth and service levels write paths are switched to use this new mechanism.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-650
Backport: no, new feature
Closesscylladb/scylladb#28731
* https://github.com/scylladb/scylladb:
test: add tests for global group0_batch barrier feature
qos: switch service levels write paths to use global group0_batch barrier
auth: switch write paths to use global group0_batch barrier
raft: add function to broadcast read barrier request
raft: add gossiper dependency to raft_group0_client
raft: add read barrier RPC
config.hh is included by a large fraction of the codebase. It pulls in
utils/config_file.hh, whose named_value<T> template has its method
bodies defined in config_file_impl.hh. Those bodies depend on three of
the heaviest Boost headers – boost/program_options.hpp,
boost/lexical_cast.hpp, and boost/regex.hpp – as well as yaml-cpp.
Because the method bodies are reachable from config.hh, every
translation unit that includes config.hh was silently instantiating all
of named_value<T>'s methods (for each distinct T) and compiling that
Boost/yaml-cpp machinery from scratch.
Fix this by adding extern template struct declarations for all 32
distinct named_value<T> specialisations used by db::config:
- the 14 primitive/stdlib types go into utils/config_file.hh
- the 18 db-specific types (enum_option<…>, seed_provider_type, etc.)
go into db/config.hh
Matching explicit template struct instantiation definitions are added in
db/config.cc, which is already the only translation unit that includes
config_file_impl.hh. As a result the Boost/yaml-cpp template machinery
is compiled exactly once (in config.o) instead of being re-instantiated
in every including TU.
One subtlety: named_value<seed_provider_type> has an explicit member
specialisation of add_command_line_option. Per [temp.expl.spec], such
a specialisation must be declared before any extern template declaration
of the enclosing class template, so a forward declaration of the
specialisation is added to config.hh ahead of the extern template line.
Also, for some of the types we explicitly instantiated in db/config.cc,
the named_value<T> constructor calls config_type_for<T>(), which we
also need to provide explicit specializations - some of them we already
had but some were missing.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Remove the rest of the code that assumes that either group0 does not exist yet or a cluster is till not upgraded to raft topology. Both of those are not supported any more.
No need to backport since we remove functionality here.
Closesscylladb/scylladb#28841
* github.com:scylladb/scylladb:
service level: remove version 1 service level code
features: move GROUP0_SCHEMA_VERSIONING to deprecated features list
migration_manager: remove unused forward definitions
test: remove unused code
auth: drop auth_migration_listener since it does nothing now
schema: drop schema_registry_entry::maybe_sync() function
schema: drop make_table_deleting_mutations since it should not be needed with raft
schema: remove calculate_schema_digest function
schema: drop recalculate_schema_version function and its uses
migration_manager: drop check for group0_schema_versioning feature
cdc: drop usage of cdc_local table and v1 generation definition
storage_service: no need to add yourself to the topology during reboot since raft state loading already did it
storage_service: remove unused functions
group0: drop with_raft() function from group0_guard since it always returns true now
gossiper: do not gossip TOKENS and CDC_GENERATION_ID any more
gossiper: drop tokens from loaded_endpoint_state
gossiper: remove unused functions
storage_service: do not pass loaded_peer_features to join_topology()
storage_service: remove unused fields from replacement_info
gossiper: drop is_safe_for_restart() function and its use
storage_service: remove unused variables from join_topology
gossiper: remove the code that was only used in gossiper topology
storage_service: drop the check for raft mode from recovery code
cdc: remove legacy code
test: remove unused injection points
auth: remove legacy auth mode and upgrade code
treewide: remove schema pull code since we never pull schema any more
raft topology: drop upgrade_state and its type from the topology state machine since it is not used any longer
group0: hoist the checks for an illegal upgrade into main.cc
api: drop get_topology_upgrade_state and always report upgrade status as done
service_level_controller: drop service level upgrade code
test: drop run_with_raft_recovery parameter to cql_test_env
group0: get rid of group0_upgrade_state
storage_service: drop topology_change_kind as it is no longer needed
storage_service: drop check_ability_to_perform_topology_operation since no upgrades can happen any more
service_storage: remove unused functions
storage_service: remove non raft rebuild code
storage_service: set topology change kind only once
group0: drop in_recovery function and its uses
group0: rename use_raft to maintenance_mode and make it sync
The `service_error` struct: 6dc2c42f8b/service/vector_store_client.hh (L64)
currently stores just the error status code. For this reason whenever the HTTP error occurs, only the error code can be forwarded to the client. For example see here: 6dc2c42f8b/service/vector_store_client.cc (L580)
For this reason in the output of the drivers full description of the error is missing which forces user to take a look into Scylla server logs.
The objective of this PR is to extend the support for HTTP errors in Vector Store client to handle messages as well.
Moreover, it removes the quadratic reallocation in response_content_to_sstring() helper function that is used for getting the response in case of error.
Fixes: VECTOR-189
Closesscylladb/scylladb#26139
* github.com:scylladb/scylladb:
vector_search: Avoid quadratic reallocation in response_content_to_sstring
vector_store_client: Return HTTP error description, not just code
Refs: SCYLLADB-557
We should use full replication in KS/CF creation and population,
for at least two reasons:
1.) Ensure we wait fully for and write to all nodes
2.) Make test more "real", behaving like a proper cluster
Closesscylladb/scylladb#28959
In cql3/, there was one call to assert() (not SCYLLA_ASSERT or
throwing_assert), and it was:
const auto shard_num = smp::count;
assert(shard_num > 0)
Rather than converting this assert() to throwing_assert() as I did in
previous patches, I decided to outright remove it: Seastar guarantees
that smp::count is not zero. Many other places in the code use
smp::count assuming that it is correct, no other place bothers to assert
it isn't zero.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
After the previous patch replaced all SCYLLA_ASSERT() calls by
throwing_assert(), this patch also replaces all calls to abort().
All these abort() calls are supposedly cases that can never happen,
but if they ever do happen because of a bug, in none of these places
we absolutely need to crash - and exception that aborts the current
operation should be enough.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
In this patch we replace every single use of SCYLLA_ASSERT() in the cql3/
directory by throwing_assert().
The problem with SCYLLA_ASSERT() is that when it fails, it crashes Scylla.
This is almost always a bad idea (see #7871 discussing why), but it's even
riskier in front-end code like cql3/: In front-end code, there is a risk
that due to a bug in our code, a specific user request can cause Scylla
to crash. A malicious user can send this query to all nodes and crash
the entire cluster. When the user is not malicious, it causes a small
problem (a failing request) to become a much worse crash - and worse,
the user has no idea which request is causing this crash and the crash
will repeat if the same request is tried again.
All of this is solved by using the new throwing_assert(), which is the
same as SCYLLA_ASSERT() but throws an exception (using on_internal_error())
instead of crashing. The exception will prevent the code path with the
invalid assumption from continuing, but will result in only the current
user request being aborted, with a clear error message reporting the
internal server error due to an assertion failure.
I reviewed all the changes that I did in this patch to check that (to the
best of my understanding) none of the assertions in cql3/ involve the
sort of serious corruption that might require crashing the Scylla node
entirely.
throwing_assert() also improves logging of assertion failures compared
to the original SCYLLA_ASSERT() - SCYLLA_ASSERT() printed a message to
stderr which in many installations is lost, whereas throwing_assert()
uses Scylla's standard logger, and also includes a backtrace in the
log message.
Fixes#13970 (Exorcise assertions from CQL code paths)
Refs #7871 (Exorcise assertions from Scylla)
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
To restore how streaming scopes work there are two tests that greatly duplicate each other -- test_restore_with_streaming_scopes from cluster/object_store suite and test_refresh_with_streaming_scopes from cluster suite.
This patch generalizes both into a do_test_streaming_scopes() non-test function
Closesscylladb/scylladb#28874
* github.com:scylladb/scylladb:
test: Re-sort comments around do_test_streaming_scopes()
test: Split do_load_sstables()
test: Drop load_fn argument from do_load_sstables()
test: Re-use do_test_streaming_scopes() in refresh test
test: Introduce SSTablesOnLocalStorage
test: Introduce SSTablesOnObjectStorage
test: Move test_restore_with_streaming_scopes() into do_test_streaming_scopes()
Document how to migrate a ScyllaDB cluster to different instance
types using the add-and-replace node cycling approach.
Closes: QAINFRA-42
Closesscylladb/scylladb#28458
Recently, in commit 7b30a39, we added to pytest.ini the option xfail_strict.
This option causes every time a test XPASSes, i.e., an XFAIL test actually
passes, to be considered an error and fail the test.
While this has some benefits, it's a big problem when running tests
against a reference implementation like DynamoDB or Cassandra: We
typically mark a test "xfail" if the test shows a known bug - i.e., if
the test fails on Scylla but passes on the reference system (DynamoDB
or Cassandra). This means that when running "test/cqlpy/run-cassandra"
or "test/alternator/run --aws", we expect to see many tests XPASS,
and now this will cause these runs to "fail".
So in this patch we add the xfail_strict=false to cqlpy/run-cassandra
and alternator/run --aws. This option is not added to cqlpy/run or
to alternator/run without --aws, and also doesn't affect test.py or
Jenkins.
P.S. This is another nail in the coffin of doing "cd test/alternator;
pytest --aws". You should get used to running Alternator tests through
test/alternator/run, even if you don't need to run Scylla (the "--aws"
option doesn't run Scylla).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#28973
During tests I noticed that if the number of tablets is very small,
say 2, and the number of nodes is 3 (2 shards per node), using the
number of storage groups on each shard, a shard may end up holding 0 groups,
whilst the other holds 1 group. And in some nodes even both shards have
0 groups.
Taking the minimum among shards here was showing in manifests a tablet
count of 0 for all 3 nodes, which is incorrect.
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
Closesscylladb/scylladb#28978
Currently, for repair tasks tablet_virtual_task::wait gathers the
ids of tablets that are to be repaired. The gathered set is later
used to check if the repair is still ongoing.
However, if the tablets are resized (split or merged), the gathered
set becomes irrelevant. Those, we may end up with invalid tablet id
error being thrown.
Wait until repair is done for all tablets in the table.
Fixes: https://github.com/scylladb/scylladb/issues/28202
Backport to 2026.1 needed as it contains the change introducing the issue d51b1fea94Closesscylladb/scylladb#28323
* github.com:scylladb/scylladb:
service: fix indentation
test: add test_tablet_repair_wait
service: remove status_helper::tablets
service: tasks: scan all tablets in tablet_virtual_task::wait
A bug or some bad operator intervention can lead to a sstable existing in a
node after the tablet replica was moved to a different node.
This will result sstable loading during boot failing, requiring operator
intervention.
The log today just dumps the name of the "orphaned" sstable, but one
investigating it might want to know which process (repair, memtable, whatever)
generated that sstable, if the sstable was created locally or remotely,
and the current replica set of the underlying tablet.
From the original identifier, we can know the exact time the sstable was
created on its original node. From the current id, we know the time it
was created on the current node.
All this info can help the investigator to correlate with events in other nodes
(includes actions from the coordinator) to get closer to the root cause.
The new log will look like this:
"Unable to load SSTable .../me-3gyg_1fsw_2u0u826b00b71vc46o-big-Data.db
(originated from compaction with id 913f41c0-18c2-11f1-8f08-cb8521b3f330
on host e483238c-2287-4022-8bc4-b4f1c4cb2b0d)
of tablet 6 (replica set: [e483238c-2287-4022-8bc4-b4f1c4cb2b0d:0])"
Refs https://scylladb.atlassian.net/browse/SCYLLADB-788.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#28921
Vector deserialization is an operation which performance is critical for
vector similarity search feature because it is frequently executed during
rescoring operation. Some of the identified performance bottlenecks
for it include:
1. Per-element virtual dispatch in deserialize(): each of the N elements
went through visit() which switches on ~28 type variants. For a
1024-dimension float vector, that's 1024 redundant type switches when
the element type is the same for all of them.
2. Redundant work in split_fragmented(): value_length_if_fixed() was
called inside the loop (N virtual calls), and no reserve() was done
on the output vector causing repeated reallocations.
This series fixes both:
- Introduce deserialize_vector_visitor that dispatches on the element
type once for the entire vector, then loops inside the resolved
handler. Simple numeric types (float, int, etc.) call
deserialize_value() directly with no virtual dispatch per element.
String types (ascii, utf8) get a dedicated handler that skips
make_empty() (sstring has no empty_t constructor). Complex types
(list, map, tuple, etc.) fall back to per-element dispatch.
- In split_fragmented(), reserve the output vector to _dimension and
cache value_length_if_fixed() before the loop.
Benchmark results (1024-dim float vector, release build, -O3 -flto):
deserialize: 15.73 us -> 11.70 us (1.34x, 26% faster)
split_fragmented: 10.34 us -> 7.45 us (1.39x, 28% faster)
References: SCYLLADB-471
Backport: none, unless we observe some critical performance improvement for quantization.
Closesscylladb/scylladb#28618
* github.com:scylladb/scylladb:
types: optimize reading vector fragments
types: optimize vector deserialization for high-dimensional vectors
Similarly to LWTs, we reject queries with user-provided timestamps
when they target strongly consistent tables.
Such statements could force us to rewrite history, and that contradicts
the philosophy of linearizability we aim for.
Fixes SCYLLADB-879
Closesscylladb/scylladb#28867
* seastar d2953d2a...4d268e0e (32):
> Merge 'prometheus: support multiple __name__ filters and prefixed names' from Travis Downs
doc: update prometheus.md with __name__ filter enhancements
prometheus: support prefixed names in __name__ filter
prometheus: add benchmarks for name filter performance
prometheus: support multiple __name__ query parameters
prometheus: move write_body_args to header
> fair_queue: Subtract from _queued_capacity on pop_front()
> memory: expose cumulative allocated bytes statistic
> Merge 'Add ability to configure IO bandwidth limit for supergroup' from Pavel Emelyanov
test: IO bandiwdth throttler unit tests
code: Add ability to configure IO bandwidth limit for supergroup
io_queue: Have more than one throttler par class
io_queue: Introduce bandwidth_throttler helper class
io_queue: Nest io_group::priotiy_class_data-s
io_queue: Update class bandwidth on group's class data
io_queue: Make io_group::priority_class_data::tokens() static
fair_queue: Introduce group (un)plugging
> Fix _shard_to_numa_node_mapping double population
> Use exception parameter in log_timer_callback_exception()
> Fix wakeup_granularity() fallback debug-fs reading
> test_fixture: Fix SEASTAR_FIXTURE_THREAD_TEST_CASE thread not propagated
> build: support tuning -ffile-prefix-map
> test: Remove unused C::dup() method of testing class
> src/core/reactor: introduce reactor::get_backend_name()
> util/process: add pid() accessor
> Merge 'Add source location to task and tasktrace object' from Radosław Cybulski
coroutine.hh: disable source_location for GCC to avoid ICE
reactor: improve do_dump_task_queue reporting
Use source_location in `do_dump_task_queue`
Update backtrace with source locations of resume points
Add calls to update resume_point
Add a std::source_location (resume_point) to task object.
> Merge 'Refine posix file .dup() implementation' from Pavel Emelyanov
file: Templatize posix_file_handle_impl
file: Don't dup() non-read-only files
file: Split ..._impl::dup() implementations
test: Add a simple test for dup()
> Merge 'Deprecate reactor::make_pollable_fd(socket_address, int)' from Pavel Emelyanov
reactor: Deprecate make_pollable_fd()
net/posix: Create file_desc for sockets in-place
reactor,net: Keep sock_need_nonblock boolean on posix_network_stack
net/posix: Re-format constructor initializer lists
> Merge 'test: add fuzz testing infrastructure and sstring fuzzer' from Travis Downs
test: add fuzz tests to CI workflow
test: add sstring differential fuzzer
test: add fuzz testing infrastructure
> Introduce "integrated queue length" metrics and use it for IO classes (#3210)
> reactor: Remove get_sg_data(unsigned) overload
> memcached: Stop using scattered_message
> reactor: Mark uptime() method const
> alien: Remove deprecated run_on and submit_to calls
> file: make open_flags and access_flags constexpr
> scheduling: Unfriend some methods from scheduling_group
> reactor: Move _dying bit to epoll backend
> file: coroutinize the with_file templates
> configure: validate --cook ingredient names
> fix trailing whitespace
> Merge 'Estimate timing overhead, allow failing if it is too high' from Travis Downs
perf_tests: document overhead column and threshold options
perf_tests: add measurement overhead tracking and warnings
perf_tests: remove inline/hot attributes from time_measurement methods
perf_tests: move time_measurement class to implementation file
perf_tests: move perf counters into time_measurement singleton
> rpc: log handler type
> Merge 'Add pre-commit with trailing whitespace hook' from Travis Downs
Add GitHub Actions workflow for pre-commit enforcement
Add pre-commit setup documentation to HACKING.md
Add pre-commit configuration with trailing-whitespace hook
Remove trailing whitespace from source files
> posix-stack: Make internal::posix_connect() resolve exceptions into futures
> sstring: fix npos to be size_t for consistency with std::string
Closesscylladb/scylladb#28954
There was a redundant work in split_fragmented(): value_length_if_fixed() was
called inside the loop (N virtual calls), and no reserve() was done
on the output vector causing repeated reallocations.
This patch reserves the output vector to _dimension and
caches value_length_if_fixed() before the loop.
Additionally, split read_vector_element() into two specialized functions:
read_vector_element_fixed() and read_vector_element_variable(), and hoist
the branch on fixed_len outside the loop in split_fragmented() and
deserialize_loop(). This avoids a conditional branch per element in the
hot path.
Benchmark results (1024-dim float vector, release build, -O3 -flto):
10.34 us -> 7.45 us (1.39x, 28% faster)
Verify that scylla sstable upgrade fails when an sstable has a
corrupted Statistics component digest, and succeeds when the
--ignore-component-digest-mismatch flag is provided.
Add ignore_component_digest_mismatch option to db::config (default false).
When set, sstable loading logs a warning instead of throwing on component
digest mismatches, allowing a node to start up despite corrupted non-vital
components or bugs in digest calculation.
Propagate the config to all production sstable load paths:
- distributed_loader (node startup, upload dir processing)
- storage_service (tablet storage cloning)
- sstables_loader (load-and-stream, download tasks, attach)
- stream_blob (tablet streaming)
Add `ignore_component_digest_mismatch` option to `sstable_open_config`
that logs a warning instead of throwing `malformed_sstable_exception`
on component digest mismatch. This is useful for recovering sstables
with corrupted non-vital components or working around bugs in digest
calculation.
Expose the option in scylla-sstable via the
`--ignore-component-digest-mismatch` flag for the upgrade operation.
Generalize corrupt_sstable() and scrub_validate_corrupted_file() to
accept a component_type parameter, defaulting to Data, so they can be
reused for corrupting other components.
Add tests that verify SSTable component digest validation detects
corruption on load. Each test writes an SSTable, corrupts a specific
component file by flipping a bit, then asserts that reloading the
SSTable throws malformed_sstable_exception with the expected digest
mismatch message.
Add integrity verification for SSTable component files by validating
their CRC32 digests against the expected values stored in Scylla
metadata during SSTable loading.
The following components are validated on load: TOC, Scylla metadata,
CompressionInfo, Statistics, Summary, and Filter.
Add a new reader that wraps file_random_access_reader and computes a
running CRC32 digest over the data as it is read. The digest accumulates
across sequential read_exactly() calls and is reset on seek(), since a
non-sequential read invalidates the running checksum.
One of the performance bottlenecks while deserializing vectors was
per-element virtual dispatch in deserialize(): each of the N elements
went through visit() which switches on ~28 type variants. For a
1024-dimension float vector, that's 1024 redundant type switches when
the element type is the same for all of them.
This patch introduces deserialize_vector_visitor that dispatches on the element
type once for the entire vector, then loops inside the resolved
handler. Simple numeric types (float, int, etc.) call
deserialize_value() directly with no virtual dispatch per element.
String types (ascii, utf8) get a dedicated handler that skips
make_empty() (sstring has no empty_t constructor). Complex types
(list, map, tuple, etc.) fall back to per-element dispatch.
Benchmark results (1024-dim float vector, release build, -O3 -flto):
15.73 us -> 11.70 us (1.34x, 26% faster)
`e4da0afb8d5491bf995cbd1d7a7efb966c79ac34` introduces a protection
against resources that are "made up" of thin air to
`reader_concurrency_semaphore`. If there are more `_resources` than
the `_initial_resources`, it means there is a negative leak, and
`on_internal_error_noexcept` is called. In addition to it,
`_resources` is set to `std::max(_resources, _initial_resources)`.
However, the commit message of `e4da0afb8d5491bf995cbd1d7a7efb966c79ac34`
states the opposite: "The detection also clamps the
_resources to _initial_resources, to prevent any damage".
Before this commit, the protection mechanism doesn't clamp
`_resources` to `_initial_resources` but instead keeps `_resources` high,
possibly even indefinitely growing. This commit changes `std::max` to
`std::min` to make the code behave as intended.
Refs: SCYLLADB-163
Closesscylladb/scylladb#28982
Pre-compute the total size and allocate a single uninitialized sstring
before copying the buffers, following the pattern from Seastar's
read_entire_stream_contiguous(). This avoids iterative reallocation
which is O(n^2) for large responses.
This simple patch adds support for storing the HTTP error description
that Vector Store client receives from vector store. Until now it was
just printed to the log but it was not returned. For this reason it
was not forwarded to the drivers which forced users to access ScyllaDB
server logs to understand what is wrong with Vector Store.
This patch also updates formatter to print the message next to the
error code.
Fixes: VECTOR-189
Previously, all stream-table fixtures in this test file used
scope="function", forcing a fresh table to be created for every test,
slowing down the test a bit (though not much), and discouraging writing
small new tests.
This was a workaround for a DynamoDB quirk (that Alternator doesn't have):
LATEST shard iterators have a time slack and may point slightly before
the true stream head, causing leftover events from a previous test to
appear in the next test's reads.
We fix this by draining the stream inside latest_iterators() and
shards_and_latest_iterators() after obtaining the LATEST iterators:
fetch records in a loop until two consecutive polling rounds both return
empty, guaranteeing the iterators are positioned past all pre-existing
events before the caller writes anything. With this guarantee in place,
all stream-table fixtures can safely use scope="module".
After this patch, test_streams.py continues to pass on DynamoDB.
On Alternator, the test file's run time went down a bit, from
20.2 seconds to 17.7 seconds.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
In the next patch, we plan to make the fixtures in test_streams.py
shared between tests. Most tests work well with shared tables, but two
(test_streams_trim_horizon and test_streams_starting_sequence_number)
were written to expect a new table with an empty history, and two
other (test_streams_closed_read and test_streams_disabled_stream) want
to disable streaming and would break a shared table.
So this patch we modify these four tests to create their own new table
instead of using a fixture.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This pull request adds support for calculation and storing CRC32 digests for all SSTable components.
This change replaces plain file_writer with crc32_digest_file_writer for all SSTable components that should be checksummed. The resulting component digests are stored in the sstable structure
and later persisted to disk as part of the Scylla metadata component during writer::consume_end_of_stream.
Several test cases where introduced to verify expected behaviour.
Additionally, this PR adds new rewrite component mechanism for safe sstable component rewriting.
Previously, rewriting an sstable component (e.g., via rewrite_statistics) created a temporary file that was renamed to the final name after sealing. This allowed crash recovery by simply removing the temporary file on startup.
However, with component digests stored in scylla_metadata (#20100),
replacing a component like Statistics requires atomically updating both the component
and scylla_metadata with the new digest - impossible with POSIX rename.
The new mechanism creates a clone sstable with a fresh generation:
- Hard-links all components from the source except the component being rewritten and scylla_metadata
- Copies original sstable components pointer and recognized components from the source
- Invokes a modifier callback to adjust the new sstable before rewriting
- Writes the modified component along with updated scylla_metadata containing the new digest
- Seals the new sstable with a temporary TOC
- Replaces the old sstable atomically, the same way as it is done in compaction
This is built on the rewrite_sstables compaction framework to support batch operations (e.g., following incremental repair).
In case of any failure durning the whole process, sstable will be automatically deleted on the node startup due to
temporary toc persistence.
Backport is not required, it is a new feature
Fixes https://github.com/scylladb/scylladb/issues/20100, https://github.com/scylladb/scylladb/issues/27453Closesscylladb/scylladb#28338
* github.com:scylladb/scylladb:
docs: document components_digests subcomponent and trailing digest in Scylla.db
sstable_compaction_test: Add tests for perform_component_rewrite
sstable_test: add verification testcases of SSTable components digests persistance
sstables: store digest of all sstable components in scylla metadata
sstables: replace rewrite_statistics with new rewrite component mechanism
sstables: add new rewrite component mechanism for safe sstable component rewriting
compaction: add compaction_group_view method to specify sstable version
sstables: add null_data_sink and serialized_checksum for checksum-only calculation
sstables: extract default write open flags into a constant
sstables: Add write_simple_with_digest for component checksumming
sstables: Extract file writer closing logic into separate methods
sstables: Implement CRC32 digest-only writer
Currently, status_helper::tablets, which keeps a vector of processed
tablet ids, is used only in tablet_virtual_task::get_status_helper,
so there is no point in returning it. Also, in get_status_helper,
it is used only to determine if any tablets are processed.
Remove status_helper::tablets. Use a flag instead of the vector
in get_status_helper.
Currently, for repair tasks tablet_virtual_task::wait gathers the
ids of tablets that are to be repaired. The gathered set is later
used to check if the repair is still ongoing.
However, if the tablets are resized (split or merged), the gathered
set becomes irrelevant. Those, we may end up with invalid tablet id
error being thrown.
Wait until repair is done for all tablets in the table.
There is a race condition in driver that raises the RuntimeException.
This pollutes the output, so this PR is just silencing this exception.
Fixes: SCYLLADB-900
Closesscylladb/scylladb#28957
result_message::prepared now owns a strong pinned prepared-cache entry instead of relying only on a weak pointer view. This closes the remaining lifetime gap after query_processor::prepare() returns, so users of the returned PREPARE message cannot observe an invalidated weak handle during subsequent
processing.
- update result_message::prepared::cql constructor to accept pinned entry
- construct weak view from owned pinned entry inside the message
- pass pinned cache entry from query_processor::prepare() into the message constructor
query_processor::prepare() could race with prepared statement invalidation: after loading from the prepared cache, we converted the cached object to a checked weak pointer and then continued asynchronous work (including error-injection waitpoints). If invalidation happened in that window, the weak
handle could no longer be promoted and the prepare path could fail nondeterministically.
This change keeps a strong cache entry reference alive across the whole critical section in prepare() by using a pinned cache accessor (get_pinned()), and only deriving the weak handle while the entry is pinned. This removes the lifetime gap without adding retry loops.
Test coverage was extended in test/cluster/test_prepare_race.py:
- reproduces the invalidation-during-prepare window with injection,
- verifies prepare completes successfully,
- then invalidates again and executes the same stale client prepared object,
- confirms the driver transparently re-requests/re-prepares and execution succeeds.
This change introduces:
- no behavior change for normal prepare flow besides stronger lifetime guarantees,
- no new protocol semantics,
- preserves existing cache invalidation logic,
- adds explicit cluster-level regression coverage for both the race and driver reprepare path.
- pushes the re prepare operation twards the driver, the server will return unprepared error for the first time and the driver will have to re prepare during execution stage
Fixed several places where ScyllaLogFile.grep() was called without await, resulting in checking coroutine objects for truthiness instead of actual log matches.
Fixes: SCYLLADB-903
No backport, framework fix and one test fix.
Closesscylladb/scylladb#28909
* github.com:scylladb/scylladb:
test.py: fix unawaited ScyllaLogFile.grep() coroutines
tests: fix test_group0_recovers_after_partial_command_application
When dropping a table, make_drop_table_or_view_mutations() creates
a point tombstone in system_schema.columns for every column in the table.
The clustering key of system_schema.columns is (table_name, column_name).
A clustering key with only the table_name component acts as a prefix
tombstone. That tombstone covers all columns belonging to that table.
This approach is already used by make_table_deleting_mutations() during
CREATE TABLE.
Apply the same prefix tombstone approach to DROP TABLE for the columns,
view_virtual_columns, computed_columns, and dropped_columns schema tables.
This reduces tombstone accumulation in schema table sstables.
In test_max_cells test case, which repeatedly creates and drops a table
with 32768 columns, overall test time improved from ~180s to ~157s, which
is ~12.7% improvement.
Refs SCYLLADB-815
Closesscylladb/scylladb#28976
The function checks that the node's state is not left or removed in
gossiper during restart, but with raft topology a removed node will
not be able to contact the cluster to get this information since it will
be banned.
Also remove test_auth_raft_command_split test which is irrelevant since 5ba7d1b116
because it does not use the function that injects max sized command after the
commit.
Schema pull was used by legacy schema code which is not supported for a
long time now and during legacy recovery which is no longer supported as
well. It can be dropped now.
Simplify code by getting rid of group0_upgrade_state since upgrade is no
longer supported, so no need to track its state. The none upgraded node
will simply not boot and to detect that the patch checks the state
directly from the system table.
The only support mode is topology_change_kind::raft, so always set it in
storage_service::join_cluster during join or regular boot. Drop the check
for legacy mode from raft_group0::setup_group0_if_exist since the mode
will not be set at this point any longer. The wrong upgrade will still
be detected in storage_service::join_cluster where topology.upgrade_state
is checked directly.
group0_upgrade_state::recovery is now used only in maintenance mode
so rename the function to indicate it. Also there is no preemption point
in the function any more and it can be a regular function, not a
co-routine.
The test description of refreshing test is very elaborated and it's
worth having it as the description of the streaming scopes test itself.
Callers of the helper can go with smaller descriptions.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This helper does two things -- sorts sstables per server according to
scope in use and calls sstables_storage.restore(). The code looks better
if the sorting of sstables stays in a helper and the call for .restore()
is moved to the caller.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now it's possible to replace the whole body of the
test_refresh_with_streaming_scopes() test by calling the corresponding
helper function from backup/restore test module. This helper does
exactly the same, and the SSTablesOnLocalStorage class provides the
necessary save/restore implementations.
One more thing to mention -- the refreshing test for some reason only
wants to run with restored min-tablet-count equal to the original one.
The do_test_streaming_scopes() needs to account for that, as it runs the
tests for more options.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This class implements some of the sstables manipulations performed by
test_refresh_with_streaming_scopes(). It's here to facilitate next patch
that will use it to call do_test_streaming_scopes() helper.
This patch moves two blocks of code out of the test into this new class.
The shutil.rmtree(tmpbackup) is seemingly lost, but it really isn't --
the tmpbackup variable holds a name of a _subdir_ inside servers'
workdirs. This path doesn't really exist on disk on its own, so removing
it is a no-op.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The class in question performs two operations for
do_test_streaming_scopes(): saves sstables and restores them. Current
caller of the helper is the test_restore_with_streaming_scopes() test
that need to backup sstables on object storage and restore them from
there with the restoration API. The SSTablesOnObjectStorage class does
exactly that.
The change in do_load_sstables() that checks for sstables_storage to be
non None is needed to keep test_refresh_with_streaming_scopes() work --
that test doesn't provide sstables_storage (yet) and the function in
question will call the load_fn callback. Next patch will eliminate it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The body of this test is duplicated by
test_refresh_with_streaming_scopes() test from other module. Keeping it
in a non-test top-level function will help generalizing these two tests.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
A few days ago, in commit 7b30a3981b we added to pytest.ini the option
xfail_strict. This option causes every time a test XPASSes, i.e., an
xfail test actually passes - to be considered an error and fail the test.
But some tests demonstrate a timing-related bug and do not reproduce the
bug every single time. An example we noticed in one CI run is:
test/cqlpy/test_secondary_index.py::test_unbuilt_index_not_used
This test reproduces a timing-related bug (if you read from a secondary
index "too quickly" you can get wrong results), but only about 90% of the
time, not 100% of the time.
The solution is to add "strict=False" for the xfail marker on this
specific test. This undoes the xfail_strict for this specific test,
accepting that this specific test can either pass or fail. Note that
this does NOT make this test worthless - we still see this test failing
most of the time, and when a developer finally fixes this issue, the
test will begin to pass all the time.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-956
(we'll probably need to follow up this fix with the same fix for other
xfail tests that can sometime pass).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#28942
When we generate view updates, we check whether we can skip the
entire view update if all columns selected by the view are unmodified.
However, for collection columns, we only check if they were unset
before and after the update.
In this patch we add a check for the actual collection contents.
We perform this check for both virtual and non-virtual selections.
When the column is only a virtual column in the view, it would be
enough to check the liveness of each collection cell, however for
that we'd need to deserialize the entire collection anyway, which
should be effectively as expensive as comparing all of its bytes.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-808Closesscylladb/scylladb#28839
* github.com:scylladb/scylladb:
mv: allow skipping view updates when a collection is unmodified
mv: allow skipping view updates if an empty collection remains unset
Instead of dht::partition_ranges_vector, which is an std::vector<> and
have been seen to cause large allocations when calculating ranges to be
invalidated after compaction:
seastar_memory - oversized allocation: 147456 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at
[Backtrace #0]
void seastar::backtrace<seastar::current_backtrace_tasklocal()::$_0>(seastar::current_backtrace_tasklocal()::$_0&&, bool) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:89
(inlined by) seastar::current_backtrace_tasklocal() at ./build/release/seastar/./seastar/src/util/backtrace.cc:99
seastar::current_tasktrace() at ./build/release/seastar/./seastar/src/util/backtrace.cc:136
seastar::current_backtrace() at ./build/release/seastar/./seastar/src/util/backtrace.cc:169
seastar::memory::cpu_pages::warn_large_allocation(unsigned long) at ./build/release/seastar/./seastar/src/core/memory.cc:840
seastar::memory::cpu_pages::check_large_allocation(unsigned long) at ./build/release/seastar/./seastar/src/core/memory.cc:903
(inlined by) seastar::memory::cpu_pages::allocate_large(unsigned int, bool) at ./build/release/seastar/./seastar/src/core/memory.cc:910
(inlined by) seastar::memory::allocate_large(unsigned long, bool) at ./build/release/seastar/./seastar/src/core/memory.cc:1533
(inlined by) seastar::memory::allocate_slowpath(unsigned long) at ./build/release/seastar/./seastar/src/core/memory.cc:1679
seastar::memory::allocate(unsigned long) at ././seastar/src/core/memory.cc:1698
(inlined by) operator new(unsigned long) at ././seastar/src/core/memory.cc:2440
(inlined by) std::__new_allocator<interval<dht::ring_position>>::allocate(unsigned long, void const*) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/new_allocator.h:151
(inlined by) std::allocator<interval<dht::ring_position>>::allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/allocator.h:203
(inlined by) std::allocator_traits<std::allocator<interval<dht::ring_position>>>::allocate(std::allocator<interval<dht::ring_position>>&, unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/alloc_traits.h:614
(inlined by) std::_Vector_base<interval<dht::ring_position>, std::allocator<interval<dht::ring_position>>>::_M_allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/stl_vector.h:387
(inlined by) std::vector<interval<dht::ring_position>, std::allocator<interval<dht::ring_position>>>::reserve(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/vector.tcc:79
dht::to_partition_ranges(utils::chunked_vector<interval<dht::token>, 131072ul> const&, seastar::bool_class<utils::can_yield_tag>) at ./dht/i_partitioner.cc:347
compaction::compaction::get_ranges_for_invalidation(std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable>>> const&) at ./compaction/compaction.cc:619
(inlined by) compaction::compaction::get_compaction_completion_desc(std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable>>>, std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable>>>) at ./compaction/compaction.cc:719
(inlined by) compaction::regular_compaction::replace_remaining_exhausted_sstables() at ./compaction/compaction.cc:1362
compaction::compaction::finish(std::chrono::time_point<db_clock, std::chrono::duration<long, std::ratio<1l, 1000l>>>, std::chrono::time_point<db_clock, std::chrono::duration<long, std::ratio<1l, 1000l>>>) at ./compaction/compaction.cc:1021
compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0::operator()() at ./compaction/compaction.cc:1960
(inlined by) compaction::compaction_result std::__invoke_impl<compaction::compaction_result, compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(std::__invoke_other, compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/invoke.h:63
(inlined by) std::__invoke_result<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>::type std::__invoke<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/invoke.h:98
(inlined by) decltype(auto) std::__apply_impl<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0, std::tuple<>>(compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&, std::tuple<>&&, std::integer_sequence<unsigned long, ...>) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/tuple:2920
(inlined by) decltype(auto) std::apply<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0, std::tuple<>>(compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&, std::tuple<>&&) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/tuple:2935
(inlined by) seastar::future<compaction::compaction_result> seastar::futurize<compaction::compaction_result>::apply<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&, std::tuple<>&&) at ././seastar/include/seastar/core/future.hh:1930
(inlined by) seastar::futurize<std::invoke_result<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>::type>::type seastar::async<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(seastar::thread_attributes, compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&)::'lambda'()::operator()() const at ././seastar/include/seastar/core/thread.hh:267
(inlined by) seastar::noncopyable_function<void ()>::direct_vtable_for<seastar::futurize<std::invoke_result<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>::type>::type seastar::async<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(seastar::thread_attributes, compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&)::'lambda'()>::call(seastar::noncopyable_function<void ()> const*) at ././seastar/include/seastar/util/noncopyable_function.hh:138
seastar::noncopyable_function<void ()>::operator()() const at ./build/release/seastar/./seastar/include/seastar/util/noncopyable_function.hh:224
(inlined by) seastar::thread_context::main() at ./build/release/seastar/./seastar/src/core/thread.cc:318
dht::partition_ranges_vector is used on the hot path, so just convert
the problematic user -- cache invalidation -- to use
utils::chunked_vector<dht::partition_range> instead.
Fixes: SCYLLADB-121
Closesscylladb/scylladb#28855
Fixed several places where ScyllaLogFile.grep() was called without
await, resulting in checking coroutine objects for truthiness instead
of actual log matches.
Fixes: SCYLLADB-903
Many tests in test/alternator/test_streams.py use a do_test() function
which performs a user-defined function that runs some write requests,
and then verifies that the expected output appears on the stream.
Because DynamoDB drops do-nothing changes from the stream - such as
writing to an item a value that it already has - these tests need to
write to a different item each time, so do_test() invents a random key
and passes it to the user-defined function to use. But... we had a bug,
the random number generation was done only once, instead of every time.
The fix is to do the random number generation on every call.
We never noticed this bug when each test used a brand new table. But the
next patch will make the tests share the test table, and tests start
to fail. It's especially visible if you run the same test twice against
DynamoDB, e.g.,
test/alternator/run --count 2 --aws \
test_streams.py::test_streams_putitem_keys_only
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
`storage_group_of()` is on the replica-side token lookup hot path but
used `tablet_map::get_tablet_id_and_range_side()`, which computes both
tablet id and post-split range side.
Most callers only need the storage group id. Switch `storage_group_of()`
to use `get_tablet_id()` via `tablet_id_for_token()`, and select the
compaction group via new overloads that compute the range side only
when splitting mode is active.
Change `storage_group::select_compaction_group()` to accept a token
(and tablet_map) and compute the tablet range side only when
splitting_mode() is active.
Add an overload for selecting the compaction group for an sstable
spanning a token range.
Add `tablet_map::get_tablet_range_side(token)` to compute the
post-split range side without computing the tablet id.
Pure addition, no behavior change.
Trie-based sstable indexes are supposed to be (hopefully)
a better default than the old BIG indexes.
Make them the new default.
If we change our mind, this change can be reverted later.
Trie-based indexes and older indexes have a difference in metrics,
and the test uses the metrics to check for bypass cache.
To choose the right metrics, it uses highest_supported_sstable_format,
which is inappropriate, because the sstable format chosen for writes
by Scylla might be different than highest_supported_sstable_format.
Use chosen_sstable_format instead.
Returns the sstable version currently chosen for use in for new sstables.
We are adding it because some tests want to know what format they are
writing (tests using upgradesstable, tests which check stats that only
apply to one of the index types, etc).
(Currently they are using `highest_supported_sstable_format` for this
purpose, which is inappropriate, and will become invalid if a non-latest
format is the default).
cluster.dtest_alternator_tests.test_slow_query_logging performs
a bootstrap with 768 token ranges.
It works with `me` sstables, which have 2 open file descriptors
per open sstable, but with `ms` sstables, which have 3 open
file descriptors per open sstable, it fails with EMFILE.
To avoid this problem, let's just decrease the number of vnodes
for in the test suite. It's appropriate anyway, because it avoids some
unneeded work without weakening the tests.
(Note: pylib-based have been setting `num_tokens` to 16 for a long time too).
This breaks `bypass_cache_test`, which is written in a way that expects
a certain number of token ranges. We adjust the relevant parameter
accordingly.
This function ensures that all alive nodes executed
read barrier.
It will be usefull for the following commits which would
eventually delay returning response to the user until
mutations are applied on other nodes so that the user
may perceive better data consistency accross nodes.
When a user tries to use ALTER TABLE on a materialized view,
the resulting error message is `Cannot use ALTER TABLE on Materialized View`.
The intention behind this error is that ALTER MATERIALIZED VIEW should
be used instead.
But we observed that some users interpret this error message as a general
"You cannot do any ALTER on this thing".
This patch enhances the error message (and others similar to it)
to prevent the confusion.
Closesscylladb/scylladb#28831
tablet_virtual_task::wait throws if a table on which a tablet operation
was working is dropped.
Treat the tablet operation as successful if a table is dropped.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-494
Needs backport to all live releases
Closesscylladb/scylladb#28933
* github.com:scylladb/scylladb:
test: add test_tablet_repair_wait_with_table_drop
service: tasks: return successful status if a table was dropped
SNI works only with DNS hostnames. Adding an IP address causes warnings
on the server side.
This change adds SNI only if it is not an IP address.
This change has no unit tests, as this behavior is not critical,
since it causes a warning on the server side.
The critical part, that the server name is verified, is already covered.
This PR also adds warning logs to improve future troubleshooting of connections to the vector-store nodes.
Fixes: VECTOR-528
Backports to 2025.04 and 2026.01 are required, as these branches are also affected.
Closesscylladb/scylladb#28637
* github.com:scylladb/scylladb:
vector_search: fix TLS server name with IP
vector_search: add warn log for failed ann requests
The first node in the cluster is not guaranteed to be the coordinator
node. Hardcoding node 0 as the coordinator causes test flakiness. This
patch dynamically finds the actual coordinator node and targets it for
error injection, log checking, and restarts.
Additionally, inject `tablet_force_tablet_count_decrease_once` across
all servers to force the tablet merge process to trigger once.
Fixes SCYLLADB-865
Closesscylladb/scylladb#28945
Mention that role and permission changes are durable but may
not be immediately visible on other nodes due to asynchronous
replication.
Fixes: SCYLLADB-651
Closesscylladb/scylladb#28900
Currently, the view_update_generator::mutate_MV function acquires a
reference to the keyspace relevant to the operation, then it calls
max_concurrent_for_each and uses that reference inside the lambda passed
to that function. max_concurrent_for_each can preempt and there is no
mechanism that makes sure that the keyspace is alive until the view
updates are generated, so it is possible that the keyspace is freed by
the time the reference is used.
Fix the issue by precomputing the necessary information based on the
keyspace reference right away, and then passing that information by
value to the other parts of the code. It turns out that we only need to
know whether the keyspace uses tablets and whether it uses a network
topology strategy.
Fixes: scylladb/scylladb#28925Closesscylladb/scylladb#28928
The tests in test_out_of_space_prevention.py are flaky. Three issues contribute:
1. After creating/removing the blob file that simulates disk pressure,
the tests immediately checked derived state (e.g., "compaction_manager
- Drained") without first confirming the disk space monitor had detected
the utilization change. Fix: explicitly wait for "Reached/Dropped below
critical disk utilization level" right after creating/removing the blob
file, before checking downstream effects.
2. Several tests called `manager.driver_connect()` or omitted reconnection
entirely after `server_restart()` / `server_start()`. The pre-existing
driver session can silently reconnect multiple times, causing subsequent
CQL queries to fail. Fix: call `reconnect_driver()` after every node restart.
Additionally, call `wait_for_cql_and_get_hosts()` where CQL is used afterward,
to ensure all connection pools are established.
3. Some log assertions used marks captured before a restart, so they could
match pre-restart messages or miss messages emitted in the correct post-restart
window. Fix: refresh marks at the right points.
Apart from that, the patch fixes a typo: autotoogle -> autotoggle.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-655Closesscylladb/scylladb#28626
Fixes: SCYLLADB-915
Test was quite broken; Not waiting for coro:s, as well as a bunch
of checks no longer even close to valid (this is a ported dtest, and
not a very good one).
Closesscylladb/scylladb#28887
vector_search: small improvements
This PR addresses several minor code quality issues and style inconsistencies within the vector_search module.
No backport is needed as these improvements are not visible to the end user.
Closesscylladb/scylladb#28718
* github.com:scylladb/scylladb:
vector_search: fix names of private members
vector_search: remove unused global variable
Tests use `start_writes` as a simple write workload to test that writes
succeed when they should (e.g., there is no availability loss), but not to
test performance. There is no reason to overload the CPU, which can lead to
test failures.
I suspect this function to be the cause of SCYLLADB-929, where the failures
of `test_raft_recovery_user_data` (that creates multiple write workloads
with `start_writes`) indicated that the machine was overloaded.
The relevant observations:
- two runs failed at the same time in debug mode,
- there were many reactor stalls and RPC timeouts in the logs (leading to
unexpected events like servers marking each other down and group0
leader changes).
I didn't prove that `start_writes` really caused this, but adding this sleep
should be a good change, even if I'm wrong.
The number of writes performed by the test decreases 30-50 times with the
sleep.
Note that some other util functions like `start_writes_to_cdc_table` have
such a sleep.
This PR also contains some minor updates to `test_raft_recovery_user_data`.
Fixes SCYLLADB-929
No backport:
- the failures were observed only in master CI,
- no proof that the change fixes the issue, so backports could be a waste
of time.
Closesscylladb/scylladb#28917
* github.com:scylladb/scylladb:
test: test_raft_recovery_user_data: replace asyncio.gather with gather_safely
test: test_raft_recovery_user_data: use the exclude_node API
test: test_raft_recovery_user_data: drop tablet_load_stats_cfg
test: cluster: util: sleep for 0.01s between writes in do_writes
This patch adds a test file, test/alternator/test_encoding.py, testing
how Alternator stores its data in the underlying CQL database. We test
how tables are named, and how attributes of different types are encoded
into CQL.
The test, which begins with a long comment, also doubles as developer-
oriented *documention* on how Alternator encodes its data in the CQL
database. This documentation is not intended for end-users - we do
not want to officially support reading or writing Alternator tables
through CQL. But once in a while, this information can come in handy
for developers.
More importantly, this test will also serve as a regression test,
verifying that Alternator's encoding doesn't change unintentionally.
If we make an unintentional change to the way that Alternator stores
its data, this can break upgrades: The new code might not be able to
read or write the old table with its old encoding. So it's important
to make sure we never make such unintentional changes to the encoding
of Alternator's data. If we ever do make *intentional* changes to
Alternator's data encoding, we will need to fix the relevant test;
But also not forget to make sure that the new code is able to read
the old encoding as well.
The new tests use both "dynamodb" (Alternator) and "cql" fixtures,
to test how CQL sees the Alternator tables. So naturally are these
tests are marked "scylla_only" and skipped on DynamoDB.
Fixes#19770.
Closesscylladb/scylladb#28866
The motivations for this patch are as follows:
- Guardrails should follow similar conventions, e.g. for config names,
metrics names, testing. Keeping guardrails together makes it easier
to find and compare existing guardrails when new guardrails are
implemented.
- The configuration is used to auto-generate the documentation
(particularly, the `configuration-parameters` page). Currently,
the order of parameters in the documentation is inconsistent (e.g.
`minimum_replication_factor_fail_threshold` before
`minimum_replication_factor_warn_threshold` but
`maximum_replication_factor_fail_threshold` after
`maximum_replication_factor_warn_threshold`), which can be confusing
to customers.
Fixes: SCYLLADB-256
Closesscylladb/scylladb#28932
This series improves timeout handling consistency across the test framework and makes build-mode effects explicit in LWT tests. (starting with LWT test that got flaky)
1. Centralize timeout scaling
Introduce scale_timeout(timeout) fixture in runner.py to provide a single, consistent mechanism for scaling test timeouts based on build mode.
Previously, timeout adjustments were done in an ad-hoc manner across different helpers and tests. Centralizing the logic:
Ensures consistent behavior across the test suite
Simplifies maintenance and reasoning about timeout behavior
Reduces duplication and per-test scaling logic
This becomes increasingly important as tests run on heterogeneous hardware configurations, where different build modes (especially debug) can significantly impact execution time.
2. Make scale_timeout explicit in LWT helpers
Propagate scale_timeout explicitly through BaseLWTTester and Worker, validating it at construction time instead of relying on implicit pytest fixture injection inside helper classes.
Additionally:
Update wait_for_phase_ops() and wait_for_tablet_count() to use scale_timeout_by_mode() for consistent polling behavior across modes
Update all LWT test call sites to pass build_mode explicitly
Increase default timeout values, as the previous defaults were too short and prone to flakiness, particularly under slower configurations such as debug builds
Overall, this series improves determinism, reduces flakiness, and makes the interaction between build mode and test timing explicit and maintainable.
backport: not required just an enhansment for test.py infra
Closesscylladb/scylladb#28840
* https://github.com/scylladb/scylladb:
test/auth_cluster: align service-level timeout expectations with scaled config
test/lwt: propagate scale_timeout through LWT helpers; scale resize waits Pass scale_timeout explicitly through BaseLWTTester and Worker, validating it at construction time instead of relying on implicit pytest fixture injection inside helper classes. Update wait_for_phase_ops() and wait_for_tablet_count() to use scale_timeout_by_mode() so polling behavior remains consistent across build modes. Adjust LWT test call sites to pass scale_timeout explicitly. Increase default timeout values, as the previous defaults were too short and prone to flakiness under slower configurations (notably debug/dev builds).
test/pylib: introduce scale_timeout fixture helper
This patch fixes 2 issues within strong consistency state machine:
- it might happen that apply is called before the schema is delivered to the node
- on the other hand, the apply may be called after the schema was changed and purged from the schema registry
The first problem is fixed by doing `group0.read_barrier()` before applying the mutations.
The second one is solved by upgrading the mutations using column mappings in case the version of the mutations' schema is older.
Fixes SCYLLADB-428
Strong consistency is in experimental phase, no need to backport.
Closesscylladb/scylladb#28546
* https://github.com/scylladb/scylladb:
test/cluster/test_strong_consistency: add reproducer for old schema during apply
test/cluster/test_strong_consistency: add reproducer for missing schema during apply
test/cluster/test_strong_consistency: extract common function
raft_group_registry: allow to drop append entries requests for specific raft group
strong_consistency/state_machine: find and hold schemas of applying mutations
strong_consistency/state_machine: pull necessary dependencies
db/schema_tables: add `get_column_mapping_if_exists()`
`test_proxy_protocol_port_preserved_in_system_clients` failed because it
didn't see the just created connection in system.clients immediately. The
last lines of the stacktrace are:
```
# Complete CQL handshake
await do_cql_handshake(reader, writer)
# Now query system.clients using the driver to see our connection
cql = manager.get_cql()
rows = list(cql.execute(
f"SELECT address, port FROM system.clients WHERE address = '{fake_src_addr}' ALLOW FILTERING"
))
# We should find our connection with the fake source address and port
> assert len(rows) > 0, f"Expected to find connection from {fake_src_addr} in system.clients"
E AssertionError: Expected to find connection from 203.0.113.200 in system.clients
E assert 0 > 0
E + where 0 = len([])
```
Explanation: we first await for the hand-made connection to be completed,
then, via another connection, we're querying system.clients, and we don't
get this hand-made connection in the resultset.
The solution is to replace the bare cql.execute() calls with await wait_for_results(), a helper
that polls via cql.run_async() until the expected row count is reached
(30 s timeout, 100 ms period).
Fixes: SCYLLADB-819
The flaky test is present on master and in previous release, so backporting only there.
Closesscylladb/scylladb#28849
* github.com:scylladb/scylladb:
test_proxy_protocol: introduce extra logging to aid debugging
test_proxy_protocol: fix flaky system.clients visibility checks
Move all ${{ }} expression interpolations into env: blocks so they are
passed as environment variables instead of being expanded directly into
shell scripts. This prevents an attacker from escaping the heredoc in
the Validate Comment Trigger step and executing arbitrary commands on
the runner.
The Verify Org Membership step is hardened in the same way for
defense-in-depth.
Refs: GHSA-9pmq-v59g-8fxp
Fixes: SCYLLADB-954
Closesscylladb/scylladb#28935
When we generate view updates, we check whether we can skip the
entire view update if all columns selected by the view are unmodified.
However, for collection columns, we only check if they were unset
before and after the update.
In this patch we add a check for the actual collection contents.
We perform this check for both virtual and non-virtual selections.
When the column is only a virtual column in the view, it would be
enough to check the liveness of each collection cell, however for
that we'd need to deserialize the entire collection anyway, which
should be effectively as expensive as comparing all of its bytes.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-808
Currently, when we generate view updates, we skip the view update if
all columns selected by the view are unchanged in the base table update.
However, this does not apply for collection columns - if the base table
has a collection regular column, we never allow skipping generating
view updates and the reason for that is missing implementation.
We can easily relax this for the case where the collection was missing
before and after the update - in this commit we move the check for
collections after the check for missing cells.
The ini-level strict_config was removed/never existed as a config key in pytest 8 — it's only a command-line flag(and back in pytest 9)
In pytest 8.3.5, the equivalent is the --strict-config CLI flag, not an ini option
Fixes SCYLLADB-955
Closesscylladb/scylladb#28939
Document the new `components_digests` subcomponent (tag 12) added to the
Scylla.db metadata component, which stores CRC32 digests of all checksummed
SSTable component files. Also document the trailing CRC32 digest that
stores digest of the scylla metadata itself.
Add two test cases to verify the correctness of the perform_component_rewrite
functionality:
- test_perform_component_rewrite_single_sstable: Tests rewriting the Statistics
component of a single sstable
- test_perform_component_rewrite_multiple_sstables: Tests rewriting 5 out of 10
sstables
Adds a generic test helper that writes a random SSTable, reloads it, and
verifies that the persisted CRC32 digest for each component matches the
digest computed from disk. Those covers all checksummed components test cases.
This change replaces plain file_writer with crc32_digest_file_writer
for all SSTable components that should be checksummed. The resulting component
digests are stored in scylla metadata component.
This also extends new rewrite component mechanism,
to rewrite metadata with updated digest together with the component.
When the local entry with `read_idx` belongs to the current term, it's
safe to update the local `commit_idx` to `read_idx`.
The motivation for this change is to speed up read barriers. `wait_for_apply`
executed at the end of `read_barrier` is delayed until the follower learns
that the entry with `read_idx` is committed. It usually happens quickly in
the `read_quorum` message. However, non-voters don't receive this message,
so they have to wait for `append_entries`. If no new entries are being
added, `append_entries` can come only from `fsm::tick_leader()`. For group0,
this happens once every 100ms.
The issue above significantly slows down cluster setups in tests. Nodes
join group0 as non-voters, and then they are met with several read barriers
just after a write to group0. One example is `global_token_metadata_barrier`
in `write_both_read_new` performed just after `update_topology_state` in
`write_both_read_old`.
I tested the performance impact of this change with the following test:
```python
for _ in range(10):
await manager.servers_add(3)
```
It consistently takes 44-45s with the change and 50-51s without the change
in dev mode.
No backport:
- non-critical performance improvement mostly relevant in tests,
- the change requires some soak time in master.
Closesscylladb/scylladb#28891
* github.com:scylladb/scylladb:
raft: server: fix the repeating typo
raft: clarify the comment about read_barrier_reply
raft: read_barrier: update local commit_idx to read_idx when it's safe
raft: log: clarify the specification of term_for
In case of an error, we want to see the contents of the system.clients
table to have a better understanding of what happened - whether the
row(s) are really missing or maybe they are there, but 1 digit doesn't
match or the row is half-written.
We'll therefore query for the whole table on the CQL side, and then
filter out the rows we want to later proceed with on the python side.
This way we can dump the contents of the whole system.clients table if
something goes south.
`test_proxy_protocol_port_preserved_in_system_clients` failed because it
didn't see the just created connection in system.clients immediately. The
last lines of the stacktrace are:
```
# Complete CQL handshake
await do_cql_handshake(reader, writer)
# Now query system.clients using the driver to see our connection
cql = manager.get_cql()
rows = list(cql.execute(
f"SELECT address, port FROM system.clients WHERE address = '{fake_src_addr}' ALLOW FILTERING"
))
# We should find our connection with the fake source address and port
> assert len(rows) > 0, f"Expected to find connection from {fake_src_addr} in system.clients"
E AssertionError: Expected to find connection from 203.0.113.200 in system.clients
E assert 0 > 0
E + where 0 = len([])
```
Explanation: we first await for the hand-made connection to be completed,
then, via another connection, we're querying system.clients, and we don't
get this hand-made connection in the resultset.
The solution is to replace the bare cql.execute() calls with await wait_for_results(), a helper
that polls via cql.run_async() until the expected row count is reached
(30 s timeout, 100 ms period).
Fixes: SCYLLADB-819
tablet_virtual_task::wait throws if a table on which a tablet operation
was working is dropped.
Treat the tablet operation as successful if a table is dropped.
The one accepts long list of arguments, some of those is not really needed. Also some callers can be relaxed not to provide default values for arguments with such.
Improving tests, not backporting
Closesscylladb/scylladb#28861
* github.com:scylladb/scylladb:
test: Remove passing default "expected_replicas" to check_mutation_replicas()
test: Remove scope and primary-replica-only arguments from check_mutation_replicas() helper
Previous code performed endian conversion by bulk-copying raw bytes
into a std::vector<float> and then iterating over it via a
reinterpret_cast<uint32_t*> pointer. Accessing float storage through a
uint32_t* violates C++ strict aliasing rules, giving the compiler
freedom to reorder or elide the stores, causing undefined behavior.
Replace the two-pass approach with a single-pass loop using
seastar::consume_be<uint32_t>() and std::bit_cast<float>(), which is
both well-defined and auto-vectorizable.
Follow-up #28754Closesscylladb/scylladb#28912
Add support for non_gating, the opposite of gating in dtest terminology, tests in test.py
codebase
This test will/should not be run by any current gating job (ci/next/nightly)
Closesscylladb/scylladb#28902
HostRegistry initialized in several places in the framework, this can
lead to the overlapping IP, even though the possibility is low it's not
zero. This PR makes host registry initialized once for the master
thread and pytest. To avoid communication between with workers, each
worker will get its own subnet that it can use solely for its own goals.
This simplifies the solution while providing the way to avoid overlapping IP's.
Closesscylladb/scylladb#28520
This PR disables running FXAIL tests on ci run to speed it up.
tests will continue run on "nightly" job and FAIL on unexpected pass
and will continue run on "NEXT" job and NOT FAIL on unexpected pass
Closesscylladb/scylladb#28886
The purpose of `add_column_for_post_processing` is to add columns that are required for processing of a query,
but are not part of SELECT clause and shouldn't be returned. They are added to the final result set, but later are not serialized.
Mainly it is used for filtering and grouping columns, with a special case of `WHERE primary_key IN ... ORDER BY ...` when the whole result set needs additional final sorting,
and ordering columns must be added as well.
There was a bug that manifested in #9435, #8100 and was actually identified in #22061.
In case of selection with processing (e.g functions involved), result set row is formed in two stages.
Initially it is a list of columns fetched from replicas - on which filtering and grouping is performed.
After that the actual selection is resolved and the final number of columns can change.
Ordering is performed on this final shape, but the ordering column index returned by `add_column_for_post_processing` refereed to initial shape.
If selection refereed to the same column twice (e.g. `v, TTL(v)` as in #9435) final row was longer than initial and ordering refereed to incorrect column.
If a function in selection refereed to multiple columns (e.g. as_json(.., ..) which #8100 effectively uses) the final row was shorter
and ordering tried to use a non-existing column.
This patch fixes the problem by making sure that column index of the final result set is used for ordering.
The previously crashing test `cassandra_tests/validation/entities/json_test.py::testJsonOrdering` doesn't have to be skipped, but now it is failing on issue #28467.
Fixes#9435Fixes#8100Fixes#22061Closesscylladb/scylladb#28472
Tests use `start_writes` as a simple write workload to test that writes
succeed when they should (e.g., there is no availability loss), but not to
test performance. There is no reason to overload the CPU, which can lead to
test failures.
I suspect this function to be the cause of SCYLLADB-929, where the failures
of `test_raft_recovery_user_data` (that creates multiple write workloads
with `start_writes`) indicated that the machine was overloaded.
The relevant observations:
- two runs failed at the same time in debug mode,
- there were many reactor stalls and RPC timeouts in the logs (leading to
unexpected events like servers marking each other down and group0
leader changes).
I didn't prove that `start_writes` really caused this, but adding this sleep
should be a good change, even if I'm wrong.
The number of writes performed by the test decreases 30-50 times with the
sleep.
Note that some other util functions like `start_writes_to_cdc_table` have
such a sleep.
Fixes SCYLLADB-929
Similar to `raft_drop_incoming_append_entries`, the new error injection
`raft_drop_incoming_append_entries_for_specified_group` skips handler
for `raft_append_entries` RPC but it allows to specify id of raft group
for which the requests should be dropped.
The id of a raft group should be passed in error injection parameters
under `value` key.
It might happen that a strong consistency command will arrive to a node:
- before it knows about the schema
- after the schema was changes and the old version was removed from the
memory
To fix the first case, it's enough to perform a read barrier on group0.
In case of the second one, we can use column mapping the upgrade the
mutation to newer schema.
Also, we should hold pointers to schemas until we finish `_db.apply()`,
so the schema is valid for the whole time.
And potentially we should hold multiple pointers because commands passed
to `state_machine::apply()` may contain mutations to different schema
versions.
This commit relies on a fact that the tablet raft group and its state
machine is created only after the table is created locally on the node.
Fixes SCYLLADB-428
Set enable_schema_commitlog for each group0 tables.
Assert that group0 tables use schema commitlog in ensure_group0_schema
(per each command).
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-914.
Needs backport to all live releases as all are vulnerable
Closesscylladb/scylladb#28876
* github.com:scylladb/scylladb:
test: add test_group0_tables_use_schema_commitlog
db: service: remove group0 tables from schema commitlog schema initializer
service: ensure that tables updated via group0 use schema commitlog
db: schema: remove set_is_group0_table param
Consider this:
- repair takes the lock holder
- tablet merge filber destories the compaction group and the compaction state
- repair fails
- repair destroy the lock holder
This is observed in the test:
```
repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036] Repair 1 out of 1 tablets: table=sec_index.users range=(432345564227567615,504403158265495551] replicas=[0e9d51a5-9c99-4d6e-b9db-ad36a148b0ea:15, 498e354c-1254-4d8d-a565-2f5c6523845a:9, 5208598c-84f0-4526-bb7f-573728592172:28]
...
repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036]: Started to repair 1 out of 1 tables in keyspace=sec_index, table=users, table_id=ea2072d0-ccd9-11f0-8dba-c5ab01bffb77, repair_reason=repair
repair - Enable incremental repair for table=sec_index.users range=(432345564227567615,504403158265495551]
table - Disabled compaction for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
table - Got unrepaired compaction and repair lock for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
table - Disabled compaction for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
table - Got unrepaired compaction and repair lock for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036]: get_sync_boundary: got error from node=0e9d51a5-9c99-4d6e-b9db-ad36a148b0ea, keyspace=sec_index, table=users, range=(432345564227567615,504403158265495551], error=seastar::rpc::remote_verb_error (Compaction state for table [0x60f008fa34c0] not found)
compaction_manager - Stopping 1 tasks for 1 ongoing compactions for table sec_index.users compaction_group=238 due to tablet merge
compaction_manager - Stopping 1 tasks for 1 ongoing compactions for table sec_index.users compaction_group=238 due to tablet merge
....
scylla[10793] Segmentation fault on shard 28, in scheduling group streaming
```
The rwlock in compaction_state could be destroyed before the lock holder
of the rwlock is destroyed. This causes user after free when the lock
the holder is destroyed.
To fix it, users of repair lock will now be waited when a compaction
group is being stopped.
That way, compaction group - which controls the lifetime of rwlock -
cannot be destroyed while the lock is held.
Additionally, the merge completion fiber - that might remove groups -
is properly serialized with incremental repair.
The issue can be reproduced using sanitize build consistently and can not
be reproduced after the fix.
Fixes#27365Closesscylladb/scylladb#28823
* github.com:scylladb/scylladb:
repair: Fix rwlock in compaction_state and lock holder lifecycle
repair: Prevent repair lock holder leakage after table drop
The comment could be misleading. It could suggest that the returned index is
already safe to read. That's not necessarily true. The entry with the
returned index could, for example, be dropped by the leader if the leader's
entry with this index had a different term.
When the local entry with `read_idx` belongs to the current term, it's
safe to update the local `commit_idx` to `read_idx`. The argument for
safety is in the new comment above `maybe_update_commit_idx_for_read`.
The motivation for this change is to speed up read barriers. `wait_for_apply`
executed at the end of `read_barrier` is delayed until the follower learns
that the entry with `read_idx` is committed. It usually happens quickly in
the `read_quorum` message. However, non-voters don't receive this message,
so they have to wait for `append_entries`. If no new entries are being
added, `append_entries` can come only from `fsm::tick_leader()`. For group0,
this happens once every 100ms.
The issue above significantly slows down cluster setups in tests. Nodes
join group0 as non-voters, and then they are met with several read barriers
just after a write to group0. One example is `global_token_metadata_barrier`
in `write_both_read_new` performed just after `update_topology_state` in
`write_both_read_old`.
Writing a test for this change would be difficult, so we trust the nemesis
tests to do the job. They have already found consistency issues in read
barriers. See #10578.
Both migration manager and system keyspace will be used in next commit.
The first one is needed to execute group0 read barrier and we need
system keyspace to get column mappings.
Use scale_timeout_by_mode() in make_scylla_conf() to derive
request_timeout_in_ms in test/pylib/scylla_cluster.py.
Update test_connections_parameters_auto_update in
test/cluster/auth_cluster/test_raft_service_levels.py to expect the
mode-specific timeout string returned by the REST endpoint after this
scaling change.
Pass scale_timeout explicitly through BaseLWTTester and Worker, validating it at construction time instead of relying on implicit pytest fixture injection inside helper classes.
Update wait_for_phase_ops() and wait_for_tablet_count() to use scale_timeout_by_mode() so polling behavior remains consistent across build modes.
Adjust LWT test call sites to pass scale_timeout explicitly.
Increase default timeout values, as the previous defaults were too short and prone to flakiness under slower configurations (notably debug/dev builds).
Introduce scale_timeout(mode) to centralize test timeout scaling logic based on build mode, the function will return a callable that will handle the timeout by mode.
This ensures consistent timeout behavior across test helpers and eliminates ad-hoc per-test scaling adjustments.
Centralizing the logic improves maintainability and makes timeout behavior easier to reason about.
This becomes increasingly important as we run tests on heterogeneous hardware configurations.
Different build modes (especially debug) can significantly affect execution time, and having a single scaling mechanism helps keep test stability predictable across environments.
No functional change beyond unifying existing timeout scaling behavior.
This commit updates the documentation for the unified installer.
- The Open Source example is replaced with version 2025.1 (Source Available, currently supported, LTS).
- The info about CentOS 7 is removed (no longer supported).
- Java 8 is removed.
- The example for cassandra-stress is removed (as it was already removed on other installation pages).
Fixes https://github.com/scylladb/scylladb/issues/28150Closesscylladb/scylladb#28152
In scenarios where we want to firsty check if a column mapping exists
and if we don't want do flow control with exception, it is very wasteful
to do
```
if (column_mapping_exists()) {
get_column_mapping();
}
```
especially in a hot path like `state_machine::apply()` becase this will
execute 2 internal queries.
This commit introduces `get_column_mapping_if_exists()` function,
which simply wrapps result of `get_column_mapping()` in optional and
doesn't throw an exception if the mapping doesn't exist.
this commit enables 3 strict pytest options:
strict_config - if any warnings encountered while parsing the pytest section of the configuration file will raise errors.
xfail_strict - if markers not registered in the markers section of the configuration file will raise errors.
strict-markers - if tests marked with @pytest.mark.xfail that actually succeed will by default fail the test suite
and fix errors that occur after enabling these options
Closesscylladb/scylladb#28859
Currently, repair-mode tombstone-gc cannot be used on tables with RF=1. We want to make repair-mode the default for all tablet tables (and more, see https://github.com/scylladb/scylladb/issues/22814), but currently a keyspace created with RF=1 and later altered to RF>1 will end up using timeout-mode tombstone gc. This is because the repair-mode tombstone-gc code relies on repair history to determine the gc-before time for keys/ranges. RF=1 tables cannot run repairs so they will have empty repair history and consequently won't be able to purge tombstones.
This PR solves this by keeping a registry of RF=1 tables and consulting this registry when creating `tombstone_gc_state` objects. If the table is RF=1, tombstone-gc will work as if the table used immediate-mode tombstone-gc. The registry is updated on each replication update. As soon as the table is not RF=1 anymore, the tombstone-gc reverts to the natural repair-mode behaviour.
After this PR, tombstone-gc defaults to repair-mode for all tables, regardless of RF and tablets/vnodes.
Fixes: SCYLLADB-106.
New feature, no backport required.
Closesscylladb/scylladb#22945
* github.com:scylladb/scylladb:
test/{boost,cluster}: add test for tombstone gc mode=repair with RF=1
tombstone_gc: allow use of repair-mode for RF=1 tables
replica/table: update rf=1 table registry in shared tombstone-gc state
tombstone_gc: tombstone_gc_before_getter: consider RF when getting gc before time
tombstone_gc: unpack per_table_history_maps
tombstone_gc: extract _group0_gc_time from per_table_history_map
tombstone_gc: drop tombstone_gc_state(nullptr) ctor and operator bool()
test/lib/random_schema: use timeout-mode tombstone_gc
tombstone_gc_options: add C++ friendly constructor
test: move away from tombstone_gc_state(nullptr) ctor
treewide: move away from tombstone_gc_state(nullptr) ctor
sstable: move away from tombstone_gc_mode::operator bool()
replica/table: add get_tombstone_gc_state()
compaction: use tombstone_gc_state with value semantics
db/row_cache: use tombstone_gc_state with value semantics
tombstone_gc: introduce tombstone_gc_state::for_tests()
The test test_mv_merge_allowed asserts in two places that the tablet
count is 2. It does so by calling an async function but, mistakenly, the
returned coroutine was not awaited. The coroutine is, apparently, truthy
so the assertions always passed.
Fix the test to properly await the coroutines in the assertions.
Fixes: SCYLLADB-905
Closesscylladb/scylladb#28875
This PR solves a series of similar problems related to executing
methods on an already aborted `raft::server`. They materialize
in various ways:
* For `add_entry` and `modify_config`, a `raft::not_a_leader` with
a null ID will be returned IF forwarding is disabled. This wasn't
a big problem because forwarding has always been enabled for group0,
but it's something that's nice to fix. It's might be relevant for
strong consistency that will heavily rely on this code.
* For `wait_for_leader` and `wait_for_state_change`, the calls may
hang and never resolve. A more detailed scenario is provided in a
commit message.
For the last two methods, we also extend their descriptions to indicate
the new possible exception type, `raft::stopped_error`. This change is
correct since either we enter the functions and throw the exception
immediately (if the server has already been aborted), or it will be
thrown upon the call to `raft::server::abort`.
We fix both issues. A few reproducer tests have been included to verify
that the calls finish and throw the appropriate errors.
Fixes SCYLLADB-841
Backport: Although the hanging problems haven't been spotted so far
(at least to the best of my knowledge), it's best to avoid
running into a problem like that, so let's backport the
changes to all supported versions. They're small enough.
Closesscylladb/scylladb#28822
* https://github.com/scylladb/scylladb:
raft: Make methods throw stopped_error if server aborted
raft: Throw stopped_error if server aborted
test: raft: Introduce get_default_cluster
Into a single database_apply one. Add three parameters:
* ks_name and cf_name to filter the tables to be affected
* what - what to do: throw or wait
This leads to smaller footprint in the code and improved filtering for
table names at the cost of some extra error injection params in the
tests.
This patch series implements `write_consistency_levels_warned` and `write_consistency_levels_disallowed` guardrails, allowing the configuration of which consistency levels are unwanted for writes. The motivation for these guardrails is to forbid writing with consistency levels that don't provide high durability guarantees (like CL=ANY, ONE, or LOCAL_ONE).
Neither guardrail is enabled by default, so as not to disrupt clusters that are currently using any of the CLs for writes. The warning guardrail may seem harmless, as it only adds a warning to the CQL response; however, enabling it can significantly increase network traffic (as a warning message is added to each response) and also decrease throughput due to additional allocations required to prepare the warning. Therefore, both guardrails should be enabled with care. The newly added `writes_per_consistency_level` metric, which is incremented unconditionally, can help decide whether a guardrail can be safely enabled in an existing cluster.
This commit adds additional `if` instructions on the critical path. However, based on the `perf_simple_query` benchmark for writes, the difference is marginal (~40 additional instructions, which is a relative difference smaller than 0.001).
BEFORE:
```
291443.35 tps ( 53.3 allocs/op, 16.0 logallocs/op, 14.2 tasks/op, 48067 insns/op, 18885 cycles/op, 0 errors)
throughput:
mean= 289743.07 standard-deviation=6075.60
median= 291424.69 median-absolute-deviation=1702.56
maximum=292498.27 minimum=261920.06
instructions_per_op:
mean= 48072.30 standard-deviation=21.15
median= 48074.49 median-absolute-deviation=12.07
maximum=48119.87 minimum=48019.89
cpu_cycles_per_op:
mean= 18884.09 standard-deviation=56.43
median= 18877.33 median-absolute-deviation=14.71
maximum=19155.48 minimum=18821.57
```
AFTER:
```
290108.83 tps ( 53.3 allocs/op, 16.0 logallocs/op, 14.2 tasks/op, 48121 insns/op, 18988 cycles/op, 0 errors)
throughput:
mean= 289105.08 standard-deviation=3626.58
median= 290018.90 median-absolute-deviation=1072.25
maximum=291110.44 minimum=274669.98
instructions_per_op:
mean= 48117.57 standard-deviation=18.58
median= 48114.51 median-absolute-deviation=12.08
maximum=48162.18 minimum=48087.18
cpu_cycles_per_op:
mean= 18953.43 standard-deviation=28.76
median= 18945.82 median-absolute-deviation=20.84
maximum=19023.93 minimum=18916.46
```
Fixes: SCYLLADB-259
Refs: SCYLLADB-739
No backport, it's a new feature
Closesscylladb/scylladb#28570
* github.com:scylladb/scylladb:
scylla.yaml: add write CL guardrails to scylla.yaml
scylla.yaml: reorganize guardrails config to be in one place
test: add cluster tests for write CL guardrails
test: implement test_guardrail_write_consistency_level
cql3: start using write CL guardrails
cql3/query_processor: implement metrics to track CL of writes
db: cql3/query_processor: add write_consistency_levels enum_sets
config: add write_consistency_levels_* guardrails configuration
The test/alternator/run script currently fails, Scylla fails to boot
complaining that "--alternator-ttl-period-in-seconds" is specified
twice (which is, unfortunately, not allowed). The problem is that
recently we started to set this option in test/cqlpy/run.py, for
CQL's new row-level TTL, so now it is no longer needed in
test/alternator/run - and in fact not allowed and we must remove it.
This patch only affects the script test/alternator/run, and has no
affect on running tests through test.py or Jenkins.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#28868
This test forgot to await its check() calls, which is the pass-condition
of the test. Once the await was added, the test started failing. Turns
out, the test was broken, but this was never discovered, because due to
the missing await, the errors were not propagated.
This patch adds the missing await and fixes the discovered problems:
* Use cql.run_async() instead of cql.execute()
* Fix json path for timestamp
* Missing flush/compact
Fixes: SCYLLADB-911
Closesscylladb/scylladb#28883
Recently we started to rely on the options "--auth-superuser-name"
and "--auth-superuser-salted-password" to ensure that a
cassandra/cassandra user exists for tests - without those options
a default superuser no longer exists.
This broke "test/cqlpy/run --release" for old releases, earlier
than 5.4 (in the enterprise stream, 2024.1 or earlier), because
those old release didn't have this option.
So in this patch we fix the "--release" logic that removes these
options from the command line when running these old versions.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#28894
The helper in question duplicates the functionality of `take_snapshot()` one from the same file. The only difference is that it additionally creates keyspace:table with yet another helper, but that helper is also going to be removed (as continuation of #28600 and #28608)
Enhancing tests, not backporting
Closesscylladb/scylladb#28834
* github.com:scylladb/scylladb:
test_backup: Remove prepare_snapshot_for_backup()
test_backup: Patch test_simple_backup_and_restore to use take_snapshot()
test_backup: Patch backup tests to use take_snapshot()
test_backup: Add helper to take snapshot on a single server
8e9c7397c5 made `rs` a reference, which can
lead to use-after-free. The `normal_nodes` map containing the referenced
value can be destroyed before the last use of `rs` when the topology state
is reloaded after a context switch on some `co_await`. The following move
assignment in `storage_service::topology_state_load` causes this:
```
_topology_state_machine._topology = co_await _sys_ks.local().load_topology_state(tablet_hosts);
```
This issue has been discovered in next-2026.1 CI after queueing the
backport of #28558. `test_truncate_during_topology_change` failed after
ASan reported a heap-use-after-free in
```
co_await _repair.local().bootstrap_with_repair(get_token_metadata_ptr(), rs.ring.value().tokens, session);
```
This test enables `delay_bootstrap_120s`, which makes the bug much more
likely to reproduce, but it could happen elsewhere.
No backport needed, as the only backport of #28558 hasn't been merged yet.
The backport PR will cherry-pick this commit.
Closesscylladb/scylladb#28772
Set enable_schema_commitlog for each group0 tables.
Assert that group0 tables use schema commitlog in ensure_group0_schema
(per each command).
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-914.
set_is_group0_table takes an enabled flag, based on which it decides
whether it's a group0 table. The method is called only with enabled = true.
Drop the param. For not group0 tables nothing should be set.
After the previous changes in `raft::server::{add_entry, modify_config}`
(cf. SCYLLADB-841), we also go through other methods of `raft::server`
and verify that they handle the aborted state properly.
I found two methods that do not:
(A) `wait_for_leader`
(B) `wait_for_state_change`
What happened before these changes?
In case (A), the dangerous scenario occurred when `_leader_promise` was
empty on entering the function. In that case, we would construct the
promise and wait on the corresponding future. However, if the server
had been already aborted before the call, the future would never
resolve and we'd be effectively stuck.
Case (B) is fully analogous: instead of `_leader_promise`, we'd work
with `_stte_change_promise`.
There's probably a more proper solution to this problem, but since I'm
not familiar with the internal code of Raft, I fix it this way. We can
improve it further in the future.
We provide two simple validation tests. They verify that after aborting
a `raft::server`, the calls:
* do not hang (the tests would time out otherwise),
* throw raft::stopped_error.
Fixes SCYLLADB-841
Before the change, calling `add_entry` or `modify_config` on an already
aborted Raft server could result in an error `not_a_leader` containing
a null server ID. It was possible precisely when forwarding was
disabled in the server configuration.
`not_a_leader` is supposed to return the ID of the current leader,
so that was wrong. Furthermore, the description of the function
specified that if a server is aborted, then it should throw
`stopped_error`.
We fix that issue. A few small reproducer tests were provided to verify
that the functions behave correctly with and without forwarding enabled.
Refs SCYLLADB-841
Two calls in test_client_routes_upgrade were missing `await`,
so they were never actually executed. This caused Python
to emit RuntimeWarning about unawaited coroutines, and more
importantly, the test skipped important verification steps, which
could mask real bugs or cause flakiness.
Additionally, increase 10s timeouts to 60s to avoid flakiness in slow
environments. Although these tests haven't failed so far, similar
issues have already been observed in other tests with too-short
timeouts.
Fixes: [SCYLLADB-909](https://scylladb.atlassian.net/browse/SCYLLADB-909)
Backport to 2026.1, as the test is also there.
[SCYLLADB-909]: https://scylladb.atlassian.net/browse/SCYLLADB-909?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQClosesscylladb/scylladb#28877
* github.com:scylladb/scylladb:
test: increase timeouts in test_client_routes.py
test: add missing awaits in test_client_routes_upgrade
The main loops iterating over vector components were not vectorized due to:
- "cannot prove it is safe to reorder floating-point operations"
- "Cannot vectorize early exit loop with more than one early exit"
The first issue is fixed with adding `#pragma clang fp contract(fast) reassociate(on)`, which allows compiler to optimize floating point operations.
The second issue is solved by refactoring the operations in the affected loop.
Additionally using float operations instead of double increases throughput and numerical accuracy is not the main consideration in vector search scenarios.
Performance measured:
- scylla built using dbuild
- using https://github.com/zilliztech/VectorDBBench (modified to call `SELECT id, similarity_cosine({vector<float, 1536>}, {vector<float, 1536>}) ...` without ANN search):
- client concurrency 20
before: ~2250 QPS
`float` operations: ~2350 QPS
`compute_cosine_similarity` vectorization: ~2500QPS
`extract_float_vector` vectorization: ~3000QPS
Follow-up https://github.com/scylladb/scylladb/pull/28615
Ref https://scylladb.atlassian.net/browse/SCYLLADB-764Closesscylladb/scylladb#28754
Increase 10s timeouts to 60s to avoid flakiness in slow
environments. Although these tests haven't failed so far, similar
issues have already been observed in other tests with too-short
timeouts.
Test execution time is unaffected; the entire suite in `dev` takes ~30s
before and after this change.
Two calls in test_client_routes_upgrade were missing `await`,
so they were never actually executed. This caused Python
to emit RuntimeWarning about unawaited coroutines, and more
importantly, the test skipped important verification steps, which
could mask real bugs or cause flakiness.
Fixes: SCYLLADB-909
The comparator used to sort per-IP client rows was not a strict-weak-ordering (it could return true in both directions for some pairs), which makes `std::ranges::sort` behavior undefined. A concrete pair that breaks it (and is realistic in system.clients):
a = (port=9042, client_type="cql")
b = (port=10000, client_type="alternator")
With the current comparator:
cmp(a,b) = (9042 < 10000) || ("cql" < "alternator") = true || false = true
cmp(b,a) = (10000 < 9042) || ("alternator" < "cql") = false || true = true
So both directions are true, meaning there is no valid ordering that sort can achieve.
The fix is to sort lexicographically by (port, client_type) to match the table's clustering key and ensure deterministic ordering.
Closesscylladb/scylladb#28844
This series closes a gap in the approx_exponential_histogram implementation to
cover integer values starting from small Min values.
While the original implementation was focused on durations, where this limitation
was not an issue, over time, there has been a growing need for histograms that
cover smaller values, such as the number of SSTables or the number of items in a
batch.
The reason for the original limitation is inherent to the exponential histogram
math. The previous code required Min to be at least Precision to avoid negative
bit shifts in the exponential calculations.
After this series, approx_exponential_histogram allows Min to be smaller than
Precision by scaling values during indexing. The value is shifted left by
log2 Precision minus log2 Min or zero whichever is larger, and the existing
exponential math is applied. Bucket limits are then scaled back to the original
units. This keeps insertion and retrieval O(1) without runtime branching, at the
cost of repeated bucket limits for some values in the Min to Precision range.
Additional tests cover the new behavior.
Relates to #2785
** New feature, no need to backport. **
Closesscylladb/scylladb#28371
* github.com:scylladb/scylladb:
estimated_histogram_test.cc: add to_metrics_histogram test
histogram_metrics_helper.hh: Support Min < Precision
estimated_histogram_test.cc: Add tests for approx_exponential_histogram with Min<Precision
estimated_histogram.hh: support Min less than Precision histograms
`isclose` function checks if returned similarity floats are close enough to expected value, but it doesn't `assert` by itself.
Several tests missed that `assert`, effectively always passing.
With this patch similarity values checks are wrapped in helper function `assert_similarity` with predefined tolerance.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-877Closesscylladb/scylladb#28748
This patch series removes creation of default 'cassandra:cassandra' superuser on system start.
Disable creation of a superuser with default 'cassandra:cassandra' credentials to improve security. The current flow requires clients to create another superuser and then drop the default `cassandra:cassandra' role. For those who do, there is a time window where the default credentials exist. For those who do not, that role stays. We want to improve security by forcing the client to either use config to specify default values for default superuser name and password or use cqlsh over maintenance socket connection to explicitly create/alter a superuser role.
The patch series:
- Enable role modification over the maintenance socket
- Stop using default 'cassandra' value for default superuser, skipping creation instead
Design document: https://scylladb.atlassian.net/wiki/spaces/RND/pages/165773327/Drop+default+cassandra+superuserFixesscylladb/scylla-enterprise#5657
This is an improvement. It does not need a backport.
Closesscylladb/scylladb#27215
* github.com:scylladb/scylladb:
config: enable maintenance socket in workdir by default
docs: auth: do not specify password with -p option
docs: update documentation related to default superuser
test: maintenance socket role management
test: cluster: add logs to test_maintenance_socket.py
test: pylib: fix connect_driver handling when adding and starting server
auth: do not create default 'cassandra:cassandra' superuser
auth: remove redundant DEFAULT_USER_NAME from password authenticator
auth: enable role management operations via maintenance socket
client_state: add has_superuser method
client_state: add _bypass_auth_checks flag
auth: let maintenance_socket_role_manager know if node is in maintenance mode
auth: remove class registrator usage
auth: instantiate auth service with factory functors
auth: add service constructor with factory functors
auth: add transitional.hh file
service: qos: handle special scheduling group case for maintenance socket
service: qos: use _auth_integration as condition for using _auth_integration
The method is called from storage_proxy::mutate_hint() which is in turn called from hint_mutation::apply_locally(). The latter is either called from directly by hint sender, which already runs in streaming group, or via RPC HINT_MUTATION handler which uses index 1 that negotiates streaming group as well.
To be sure, add a debugging check for current group being the expected one.
Code cleanup, not backporting
Closesscylladb/scylladb#28545
* github.com:scylladb/scylladb:
hint: Don't switch group in database::apply_hint()
hint_sender: Switch to sender group on stop either
This change is a bit more careful, as the test collects files from
snapshot directory several times. Before patching it to use the helper,
it collected _all_ the files. Now the helper only provides TOC-s, but
that's fine -- the only check that relies on that may also re-collect
TOC-s and compare new set with old set.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Some of those tests need to update the hard-coded 'backup' snapshot name
to use the one provided by take_snapshot() helper. Other than that, the
patching is pretty straightforward.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The take_snapshot() helper returns a dict(server: list[string]). When
there's only one server to work with, it's more handy to just get a
single list of sstables.
Next patches will make use of that helper.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Modify the methods which calculate the default gc mode as well as that
which validates whether repair-mode can be used at all, so both accepts
use of repair-mode on RF=1 tables.
This de-facto changes the default tombstone-gc to repair-mode for all
tables. Documentation is updated accordingly.
Some tests need adjusting:
* cqlpy/test_select_from_mutation_fragments.py: disable GC for some test
cases because this patch makes tombstones they write subject to GC
when using defaults.
* test/cluster/test_mv.py::test_mv_tombstone_gc_not_inherited used
repair-mode as a non-default for the base table and expected the MV to
revert to default. Another mode has to be used as the non-default
(immediate).
* test/cqlpy/test_tools.py::test_scylla_sstable_dump_schema: don't
compare tombstone_gc schema extension when comparing dumped schema vs.
original. The tool's schema loader doesn't have access to the keyspace
definition so it will come up with different defaults for
tombstone-gc.
* test/boost/row_cache_test.cc::test_populating_cache_with_expired_and_nonexpired_tombstones
sets tombstone expiry assuming the tombstone-gc timeout-mode default.
Change the CREATE TABLE statement to set the expected mode.
Most of the functionality is tested in cqlpy tests located in
`test_guardrail_write_consistency_level.py`. Add two tests
that require the cluster framework:
- `test_invalid_write_cl_guardrail_config` checks the node startup
path when incorrect `write_consistency_levels_warned` and
`write_consistency_levels_disallowed` values are used.
- `test_write_cl_default` checks the behavior of the default
configuration using a multi-node cluster.
Tests execution time:
- Dev: 10s
- Debug: 18s
Refs: SCYLLADB-259
Implement basic tests for write consistency level guardrails,
verifying that they work for each type of write request (inserts,
updates, deletes, logged batches, unlogged batches, conditional batches,
and counter operations).
All tests are marked as Scylla-only because they currently don't
pass with Cassandra due to differences in handling superusers (see:
SCYLLADB-882).
Tests execution time:
- Dev: 3s
- Debug: 14s
Refs: SCYLLADB-259
Refs: SCYLLADB-882
In this series we introduce new system tables and use them for storing the raft metadata
for strongly consistent tables. In contrast to the previously used raft group0 tables, the
new tables can store data on any shard. The tables also allow specifying the shard where
each partition should reside, which enables the tablets of strongly consistent tables to have
their raft group metadata co-located on the same shard as the tablet replica.
The new tables have almost the same schemas as the raft group0 tables. However, they
have an additional column in their partition keys. The additional column is the shard
that specifies where the data should be located. While a tablet and its corresponding
raft group server resides on some shard, it now writes and reads all requests to the
metadata tables using its shard in addition to the group_id.
The extra partition key column is used by the new partitioner and sharder which allow
this special shard routing. The partitioner encodes the shard in the token and the
sharder decodes the shard from the token. This approach for routing avoids any
additional lookups (for the tablet mapping) during operations on the new tables
and it also doesn't require keeping any state. It also doesn't interact negatively
with resharding - as long as tablets (and their corresponding raft metadata) occupy
some shard, we do not allow starting the node with a shard count lower than the
id of this shard. When increasing the shard count, the routing does not change,
similarly to how tablet allocation doesn't change.
To use the new tables, a new implementation of `raft::persistence` is added. Currently,
it's almost an exact copy of the `raft_sys_table_storage` which just uses the new tables,
but in the future we can modify it with changes specific to metadata (or mutation)
storage for strongly consistent tables. The new storage is used in the `groups_manager`,
which combined with the removal of some `this_shard_id() == 0` checks, allows strongly
consistent tables to be used on all shards.
This approach for making sure that the reads/writes to the new tables end up on the correct shards
won in the balance of complexity/usability/performance against a few other approaches we've considered.
They include:
1. Making the Raft server read/write directly to the database, skipping the sharder, on its shard, while using
the default partitioner/sharder. This approach could let us avoid changing the schema and there should be
no problems for reads and writes performed by the Raft server. However, in this approach we would input
data in tables conflicting with the placement determined by the sharder. As a result, any read going through
the sharder could miss the rows it was supposed to read. Even when reading all shards to find a specific value,
there is a risk of polluting the cache - the rows loaded on incorrect shards may persist in the cache for an unknown
amount of time. The cache may also mistakenly remember that a row is missing, even though it's actually present,
just on an incorrect shard.
Some of the issues with this approach could be worked around using another sharder which always returns
this_shard_id() when asked about a shard. It's not clear how such a sharder would implement a method like
`token_for_next_shard`, and how much simpler it would be compared to the current "identity" sharder.
2. Using a sharder depending on the current allocation of tablets on the node. This approach relies on the
knowledge of group_id -> shard mapping at any point in time in the cluster. For this approach we'd also
need to either add a custom partitioner which encodes the group_id in the token, or we'd need to track the
token(group_id) -> shard mapping. This approach has the benefit over the one used in the series of keeping
the partition key as just group_id. However, it requires more logic, and the access to the live state of the node
in the sharder, and it's not static - the same token may be sharded differently depending on the state of the
node - it shouldn't occur in practice, but if we changed the state of the node before adjusting the table data,
we would be unable to access/fix the stale data without artificially also changing the state of the node.
3. Using metadata tables co-located to the strongly consistent tables. This approach could simplify the
metadata migrations in the future, however it would require additional schema management of all co-located
metadata tables, and it's not even obvious what could be used as the partition key in these tables - some
metadata is per-raft-group, so we couldn't reuse the partition key of the strongly consistent table for it. And
finding and remembering a partition key that is routed to a specific shard is not a simple task. Finally, splits
and merges will most likely need special handling for metadata anyway, so we wouldn't even make use of
co-located table's splits and merges.
Fixes [SCYLLADB-361](https://scylladb.atlassian.net/browse/SCYLLADB-361)
[SCYLLADB-361]: https://scylladb.atlassian.net/browse/SCYLLADB-361?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQClosesscylladb/scylladb#28509
* github.com:scylladb/scylladb:
docs: add strong consistency doc
test/cluster: add tests for strongly-consistent tables' metadata persistence
raft: enable multi-shard raft groups for strongly consistent tablets
test/raft: add unit tests for raft_groups_storage
raft: add raft_groups_storage persistence class
db: add system tables for strongly consistent tables' raft groups
dht: add fixed_shard_partitioner and fixed_shard_sharder
raft: add group_id -> shard mapping to raft_group_registry
schema: add with_sharder overload accepting static_sharder reference
The default 100ms timeout for client readiness in tests is too
aggressive. In some test environments, this is not enough time for
client creation, which involves address resolution and TLS certificate
reading, leading to flaky tests.
This commit increases the default client creation timeout to 10 seconds.
This makes the tests more robust, especially in slower execution
environments, and prevents similar flakiness in other test cases.
Fixes: VECTOR-547, SCYLLADB-802, SCYLLADB-825, SCYLLADB-826
Backport to 2025.4 and 2026.1, as the same problem occurs on these branches and can potentially make the CI flaky there as well.
Closesscylladb/scylladb#28846
* github.com:scylladb/scylladb:
vector_search: test: include ANN error in assertion
vector_search: test: fix HTTPS client test flakiness
When we create a materialized view, we consider 2 cases:
1. the view's primary key contains a column that is not
in the primary key of the base table
2. the view's primary key doesn't contain such a column
In the 2nd case, we add all columns from the base table
to the schema of the view (as virtual columns). As a result,
all of these columns are effectively "selected" in
view_updates::can_skip_view_updates. Same thing happens when
we add new columns to the base table using ALTER.
Because of this, we can never have !column_is_selected and
!has_base_non_pk_columns_in_view_pk at the same time. And
thus, the check (!column_is_selected
&& _base_info.has_base_non_pk_columns_in_view_pk) is always
the same as (!column_is_selected).
Because we immediately return after this check, the tail of
this function is also never reached - all checks after the
(column_is_selected) are affected by this. Also, the condition
(!column_is_selected && base_has_nonexpiring_marker) is always
false at the point it is called. And this in turn makes the
`base_has_nonexpiring_marker` unused, so we delete it as well.
It's worth considering, why did we even have
`base_has_nonexpiring_marker` if it's effectively unused. We
initially introduced it in bd52e05ae2 and we (incorrectly)
used it to allow skipping view updates even if the liveness of
virtual columns changed. Soon after, in 5f85a7a821, we
started categorizing virtual columns as column_is_selected == true
and we moved the liveness checks for virtual columns to the
`if (column_is_selected)` clause, before the `base_has_nonexpiring_marker`
check. We changed this because even if we have a nonexpiring marker
right now, it may be changed in the future, in which case the liveness
of the view row will depend on liveness of the virtual columns and
we'll need to have the view updates from the time the row marker was
nonexpiring.
Closesscylladb/scylladb#28838
Fixes issue #12818 with the following docs changes:
docs/dev/system_keyspace.md: Added missing system tables, added table of contents (TOC), added categories
Closesscylladb/scylladb#27789
For 2025.3 and 2025.4 this test runs order of magnitude
slower in debug mode. Potentially due to passwords::check
running in alien thread and overwhelming the CPU (this is
fixed in newer versions).
Decreasing the number of connections in test makes it fast
again, without breaking reproducibility.
As additional measure we double the timeout.
The fix is now cherry-picked to master as sometimes
test fails there too.
(cherry picked from commit 1f1fc2c2ac)
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-795
backport: 2026.1, already on other stable branches
Closesscylladb/scylladb#28848
* github.com:scylladb/scylladb:
test: add more logs to test_startup_no_auth_response
test: decrease strain in test_startup_response
Enable verification of write consistency level guardrails in
`modification_statement` and `batch_statement`.
Neither guardrail is enabled by default, so as not to disrupt clusters
that are currently using any of the CLs for writes. The warning
guardrail may seem harmless, as it only adds a warning to the CQL
response; however, enabling it can significantly increase network
traffic (as a warning message is added to each response) and also
decrease throughput due to additional allocations required to prepare
the warning. Therefore, both guardrails should be enabled with care.
The newly added `writes_per_consistency_level` metric, which is
incremented unconditionally, can help decide whether a guardrail can
be safely enabled in an existing cluster.
This commit adds additional `if` instructions on the critical path.
However, based on the `perf_simple_query` benchmark for writes,
the difference is marginal (~40 additional instructions, which is
a relative difference smaller than 0.001).
BEFORE:
```
291443.35 tps ( 53.3 allocs/op, 16.0 logallocs/op, 14.2 tasks/op, 48067 insns/op, 18885 cycles/op, 0 errors)
throughput:
mean= 289743.07 standard-deviation=6075.60
median= 291424.69 median-absolute-deviation=1702.56
maximum=292498.27 minimum=261920.06
instructions_per_op:
mean= 48072.30 standard-deviation=21.15
median= 48074.49 median-absolute-deviation=12.07
maximum=48119.87 minimum=48019.89
cpu_cycles_per_op:
mean= 18884.09 standard-deviation=56.43
median= 18877.33 median-absolute-deviation=14.71
maximum=19155.48 minimum=18821.57
```
AFTER:
```
290108.83 tps ( 53.3 allocs/op, 16.0 logallocs/op, 14.2 tasks/op, 48121 insns/op, 18988 cycles/op, 0 errors)
throughput:
mean= 289105.08 standard-deviation=3626.58
median= 290018.90 median-absolute-deviation=1072.25
maximum=291110.44 minimum=274669.98
instructions_per_op:
mean= 48117.57 standard-deviation=18.58
median= 48114.51 median-absolute-deviation=12.08
maximum=48162.18 minimum=48087.18
cpu_cycles_per_op:
mean= 18953.43 standard-deviation=28.76
median= 18945.82 median-absolute-deviation=20.84
maximum=19023.93 minimum=18916.46
```
Fixes: SCYLLADB-259
Consider this:
- repair takes the lock holder
- tablet merge filber destories the compaction group and the compaction state
- repair fails
- repair destroy the lock holder
This is observed in the test:
```
repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036] Repair 1 out of 1 tablets: table=sec_index.users range=(432345564227567615,504403158265495551] replicas=[0e9d51a5-9c99-4d6e-b9db-ad36a148b0ea:15, 498e354c-1254-4d8d-a565-2f5c6523845a:9, 5208598c-84f0-4526-bb7f-573728592172:28]
...
repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036]: Started to repair 1 out of 1 tables in keyspace=sec_index, table=users, table_id=ea2072d0-ccd9-11f0-8dba-c5ab01bffb77, repair_reason=repair
repair - Enable incremental repair for table=sec_index.users range=(432345564227567615,504403158265495551]
table - Disabled compaction for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
table - Got unrepaired compaction and repair lock for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
table - Disabled compaction for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
table - Got unrepaired compaction and repair lock for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036]: get_sync_boundary: got error from node=0e9d51a5-9c99-4d6e-b9db-ad36a148b0ea, keyspace=sec_index, table=users, range=(432345564227567615,504403158265495551], error=seastar::rpc::remote_verb_error (Compaction state for table [0x60f008fa34c0] not found)
compaction_manager - Stopping 1 tasks for 1 ongoing compactions for table sec_index.users compaction_group=238 due to tablet merge
compaction_manager - Stopping 1 tasks for 1 ongoing compactions for table sec_index.users compaction_group=238 due to tablet merge
....
scylla[10793] Segmentation fault on shard 28, in scheduling group streaming
```
The rwlock in compaction_state could be destroyed before the lock holder
of the rwlock is destroyed. This causes user after free when the lock
the holder is destroyed.
To fix it, users of repair lock will now be waited when a compaction
group is being stopped.
That way, compaction group - which controls the lifetime of rwlock -
cannot be destroyed while the lock is held.
Additionally, the merge completion fiber - that might remove groups -
is properly serialized with incremental repair.
The issue can be reproduced using sanitize build consistently and can not
be reproduced after the fix.
Fixes#27365
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Prevent repair lock holder from being leaked in repair_service when table
is dropped midway.
The leakage might result in use-after-free later, since the repair lock
itself will be gone after table drop.
The RPC verb that removes the lock on success path will not be called
by coordinator after table was dropped.
Refs #27365.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-896.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
We want to enable maintenance socket by default.
This will prevent users from having to reboot a server to enable it.
Also, there is little point in having maintenance socket that is turned off,
and we want users to use it. After this patch series, they will have
to use it. Note that while config seeding exists, we do not encourage it
for production deployments.
This patch changes default maintenance_socket value from ignore to workdir.
This enables maintenance socket without specifying an explicit path.
Refs SCYLLADB-409
Specifying password with -p option is considered unsafe.
The password will be saved in bash history.
The preferred approach is to enter the password when prompted.
Any approach that passes the password via command line arguments
makes that password visible in process options (ps command), no matter
if the password is passed directly or as an environment variable.
Refs SCYLLADB-409
Update create superuser procedure:
- Remove notes about default `cassandra` superuser
- Add create superuser using existing superuser section
- Update create superuser by using `scylla.yaml` config
- Add create superuser using maintenance socket
Update password reset procedure:
- Add maintenance socket approach
- Remove the old approach with deleting all the roles
Update enabling authentication with downtime and during runtime:
- Mention creating new superuser over the maintenance socket
- Remove default superuser usage
Update enable authorization:
- Mention creating new superuser over the maintenance socket
- Remove mention of default superuser
Reasoning for deletion of the old approach:
- [old] Needs cluster downtime, removes all roles, needs recreation of roles,
needs maintenance socket anyways, if config values are not used for superuser
- [new] No cluster downtime, possibly one node restart to enable maintenance
socket, faster
Refs SCYLLADB-409
Introduce a test that cover:
- Server startup without credentials config seeding with no roles created
- Await maintenance socket role management to be enabled
- `CREATE ROLE`, `ALTER ROLE`, and `DROP ROLE` statement execution success
All the tests in the test_maintenance_socket.py module take 2-3 seconds
to execute.
Explicitly shut down Cluster objects to prevent 'RuntimeError: cannot
schedule new futures after shutdown'.
Refs SCYLLADB-409
Add logs to test_maintenance_socket.py test test_maintenance_socket.
This approach offers additional visibility in case of test failure.
Such logs will be added to new tests in a follow up patch in this
patch series.
Refs SCYLLADB-409
When connect_driver=False, the expected server up state should be
capped to HOST_ID_QUERIED. This is to avoid waiting for CQL readiness,
which requires a superuser to be present.
This logic was only in ScyllaCluster.server_start. ManagerClient.server_add
with start=True and connect_driver=False would still wait for CQL and hang
if no superuser is present. The workaround was to call
ManagerClient.server_add(start=False, connect_driver=False) followed by
ManagerClient.server_start(connect_driver=False).
This patch moves the capping from ScyllaCluster.server_start to
ManagerClient.server_add and ManagerClient.server_start, where connect_driver
is processed. ScyllaCluster only receives the already resolved
expected_server_up_state value.
Refs SCYLLADB-409
Changes the behavior of default superuser creation.
Previously, without configuration 'cassandra:cassandra' credentials
were used. Now default superuser creation is skipped if not configured.
The two ways to create default superuser are:
- Config file - auth_superuser_name and auth_superuser_salted_password fields
- Maintenance socket - connect over maintenance socket and CREATE/ALTER ROLE ...
Behavior changes:
Old behavior:
- No config - 'cassandra:cassandra' created
- auth_superuser_name only - <name>:cassandra created
- auth_superuser_salted_password only - 'cassandra:<password>' created
- Both specified - '<name>:<password>' created
New behavior:
- No config - no default superuser
- Requires maintenance socket setup
- auth_superuser_name only - '<name>:' created WITHOUT password
- Requires maintenance socket setup
- auth_superuser_salted_password only - no default superuser
- Both specified - '<name>:<password>' created
Fixes SCYLLADB-409
Introduce maintenance_socket_authenticator and rework
maintenance_socket_role_manager to support role management operations.
Maintenance auth service uses allow_all_authenticator. To allow
role modification statements over the maintenance socket connections,
we need to treat the maintenance socket connections as superusers and
give them proper access rights.
Possible approaches are:
1. Modify allow_all_authenticator with conditional logic that
password_authenticator already does
2. Modify password_authenticator with conditional logic specific
for the maintenance socket connections
3. Extend password_authenticator, overriding the methods that differ
Option 3 is chosen: maintenance_socket_authenticator extends
password_authenticator with authentication disabled.
The maintenance_socket_role_manager is reworked to lazily create a
standard_role_manager once the node joins the cluster, delegating role
operations to it. In maintenance mode role operations remain disabled.
Refs SCYLLADB-409
Encapsulate the superuser check in client_state so that it
respects _bypass_auth_checks. Connections that bypass auth
(internal callers and the maintenance socket) are always
considered superusers.
Migrate existing call sites from auth::has_superuser(service, user)
to client_state.has_superuser(). Also add _bypass_auth_checks
handling to ensure_not_anonymous().
Refs SCYLLADB-409
Authorization checks were previously skipped based on the
_is_internal flag. This couples two concerns: marking client
state as internal and bypassing authorization.
Introduce _bypass_auth_checks to handle only the authorization
bypass. Internal client state sets it to true, preserving current
behavior. External client state accepts it as a constructor
parameter, defaulting to false.
This will allow maintenance socket connections to skip
authorization without being marked as internal.
Refs SCYLLADB-409
This patch is part of preparations for dropping 'cassandra::cassandra'
default superuser. When that is implemented, maintenance_socket_role_manager
will have two modes of work:
1. in maintenance mode, where role operations are forbidden
2. in normal mode, where role operations are allowed
To execute the role operations, the node has to join a cluster.
In maintenance mode the node does not join a cluster.
This patch lets maintenance_socket_role_manager know if it works under
maintenance mode and returns appropriate error message when role
operations execution is requested.
Refs SCYLLADB-409
This patch removes class registrator usage in auth module.
It is not used after switching to factory functor initialization
of auth service.
Several role manager, authenticator, and authorizer name variables
are returned as well, and hardcoded inside qualified_java_name method,
since that is the only place they are ever used.
Refs SCYLLADB-409
Auth service is instantiated with the constructor that accepts
service_config, which then uses class registrator to instantiate
authorizer, authenticator, and role manager.
This patch switches to instantiating auth service via the constructor
that accepts factory functors. This is a step towards removing
usage of class registrator.
Refs SCYLLADB-409
Auth service can be initialized:
- [current] by passing instantiated authorizer, authenticator, role manager
- [current] by passing service_config, which then uses class registrator to instantiate authorizer, authenticator, role manager
- This approach is easy to use with sharded services
- [new] by passing factory functors which instantiate authorizer, authenticator, role manager
- This approach is also easy to use with sharded services
Refs SCYLLADB-409
In a follow-up patch in this patch series class registrator will be removed.
Adding transitional.hh file will be necessary to expose the authenticator and authorizer.
Refs SCYLLADB-409
service_level_controller has special handling for maintenance socket connections.
If the current user is not a named user, it should use the default scheduling group.
The reason is that the maintenance socket can communicate with Scylla before
auth_integration is registered.
The guard is already present, but it was omitted in get_cached_user_scheduling_group.
This also fixes flakiness in test_maintenance_socket.py tests.
Refs SCYLLADB-409
Maintenance socket connections can be established before _auth_integration is
initialized. The fix introduced with scylladb/scylladb#26856 PR check for
the value of user variable. For maintenance socket connections it will be an
anonymous user, and will fall back to using default scheduling group.
This patch changes the criteria for using default scheduling group from
the user variable to checking the _auth_integration variable itself:
- If _auth_integration is not initialized, use default scheduling group
- If _auth_integration is initialized, let it choose the scheduling group
Refs SCYLLADB-409
Add `write_consistency_levels_disallowed_violations` and
`write_consistency_levels_warned_violations` metrics to track
violations of write_consistency_levels guardrails.
Add `writes_per_consistency_level` to track what CL is used by
writes, regardless of the guardrails configuration.
Data gathered by this metric can be used to decide whether enabling
a particular write consistency level guardrail in a particular
existing cluster is safe.
Refs: SCYLLADB-259
Add enum_sets to query_processor that track the configuration
values of `write_consistency_levels_warned` and
`write_consistency_levels_disallowed`.
Refs: SCYLLADB-259
We introduce a function creating a Raft cluster with parameters usually
used by the tests. This will avoid code duplication, especially after
introducing new tests in the following commits.
Note that the test `test_aborting_wait_for_state_change` has changed:
the previous complex configuration was unnecessary for it (I wrote it).
These two are only used to print into logs on error. However, their
values can be found from previous logs and test execution context.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When the test fails, the assertion message does not include
the error from the ANN request.
This change enhances the assertion to include the specific ANN error,
making it easier to diagnose test failures.
The default 100ms timeout for client readiness in tests is too
aggressive. In some test environments, this is not enough time for
client creation, which involves address resolution and TLS certificate
reading, leading to flaky tests.
This commit increases the default client creation timeout to 10 seconds.
This makes the tests more robust, especially in slower execution
environments, and prevents similar flakiness in other test cases.
Fixes: VECTOR-547, SCYLLADB-802
On every update of the ERM, update the state of the current table in the
registry of RF=1 tables in shared tombstone gc state.
Ensures that tombstone gc stops collection of tombstones in immediate
mode as soon as the table starts transitioning away from RF=1.
Currently we cannot use repair-mode tombstone gc on RF=1 tables, because
such tables don't need repair and so there won't be repair history to
use to produce gc_before times.
Introduce shared_tombstone_gc_state::_rf_one_tables which will keep a
registry of RF=1 tables. Keeping this up to date is left to outside code
(table.cc). Consult the registry to determine whether a table is RF=1 or
not, so the repair history check can be ellided for rf=1 tables.
Not wired in yet into the table code.
Doesn't belong there. Also, having it as a separate member of
shared_tombstone_gc_state makes updating _group0_gc_time cheaper, as the
update doesn't have to do a copy-mutate-swap of the history maps.
This is the current de-facto default for all tests using random schema
and some are apparently relying on this. Make this explicit to avoid
upsetting tests, by the impending change of this default to repair.
It is ambigous, use the appropriate no-gc or gc-all factories instead,
as appropriate.
A special note for mutation::compacted(): according to the comment above
it, it doesn't drop expired tombstones but as it is currently, it
actually does. Change the tombstone gc param for the underlying call to
compact_for_compaction() to uphold the comment. This is used in tests
mostly, so no fallout expected.
Tests are handled in the next commit, to reduce noise.
Two tests in mutation_test.cc have to be updated:
* test_compactor_range_tombstone_spanning_many_pages
has to be updated in this commit, as it uses
mutation_partition::compact_for_query() as well as
compact_for_query(). The test passes default constructed
tombstone_gc() to the latter while the former now uses no-gc
creating a mismatch in tombstone gc behaviour, resulting in test
failure. Update the test to also pass no-gc to compact_for_query().
* test_query_digest similarly uses mutation_partition::query_mutation()
and another compaction method, having to match the no-gc now used in
query_mutation().
It is ambiguous, use tombstone_gc_mode::is_gc_enabled() instead.
Note that the two has slightly different meanings, operator bool()
returned true when repair-history related functionality was enabled.
This is fine, because the only two users are logs, where the two
meanings are close enough. All other users were eliminated or migrated
already, taking the change in meaning into account.
Instead of keeping a pointer to it. Replace nullptr with
tombstone_gc_state::no_gc().
This object is now designed to be used as a value-type, after recent
refactoring.
To replace the usage of tombstone_gc_state(nullptr) usage in tests
specifically. This more verbose factory method will hopefully convey
that this is not to be used in production code. The nullptr constructor
doesn't convey this and in fact it was used in production code
here-and-there.
This commit introduces pure pytest logging into a file
Previously, test.py used pytest as a script(not a framework) and just captured pytest stdout and logged this data by itself
This commit sets up the log files format that additionaly display Python processName, threadName adn taskName because test.py test cases use them, and now it is so hard to investigate issues that are connected with parallelism inside test case themselve
In addition, commit splits the logging of different pytest workers(xdist) into different files. If pytest workers have ho failed test - log file for these workers will be deleted
There is also additional logging for failures that will contain a separate file per test failure and contain the error itself (stacktrace) and all capture logs from stdout, stderr during the test run. With --save-log-on-success it will be a separate file per test on pass as well
All this new functionality works with the new xdit scheduler (--test-py-init=True)
Fixes SCYLLADB-713
Closesscylladb/scylladb#28705
Introduced a new max_tablet_count tablet option that caps the maximum number of tablets a table can have. This feature is designed primarily for backup and restore workflows.
During backup, when load balancing is disabled for snapshot consistency, the current tablet count is recorded in the backup manifest.
During restore, max_tablet_count is set to this recorded value, ensuring the restored table's tablet count never exceeds the original snapshot's tablet distribution.
This guarantee enables efficient file-based SSTable streaming during restore, as each SSTable remains fully contained within a single tablet boundary.
Closesscylladb/scylladb#28450
The secondary index mechanism is currently used to determine the target column.
This mechanism works incorrectly for vector indexes with filtering because
it returns the last specified column as the target (vectors) column.
However, the syntax for a vector index requires the first column to be the target:
```
CREATE CUSTOM INDEX ON t(vectors, users) USING 'vector_index';
```
This discrepancy eventually leads to the following exception when performing an
ANN search on a vector index with filtering columns:
````
ANN ordering by vector requires the column to be indexed using 'vector_index'
````
This commit fixes the issue by introducing dedicated logic for vector indexes
to correctly identify the target(vectors) column.
Fixes: SCYLLADB-635
Closesscylladb/scylladb#28740
Split input sstable(s) into multiple output sstables based on the provided
token boundaries. The input sstable(s) are divided according to the specified
split tokens, creating one output sstable per token range.
Fixes: SCYLLADB-10
Closesscylladb/scylladb#28741
For 2025.3 and 2025.4 this test runs order of magnitude
slower in debug mode. Potentially due to passwords::check
running in alien thread and overwhelming the CPU (this is
fixed in newer versions).
Decreasing the number of connections in test makes it fast
again, without breaking reproducibility.
As additional measure we double the timeout.
The fix is now cherry-picked to master as sometimes
test fails there too.
(cherry picked from commit 1f1fc2c2ac)
The query (and in certain modes the write) operations uses virtual table facility inside `cql_test_env`. The schema of the sstable is created as a table in `cql_test_env`. This involves registering all UDTs with the keyspace, so they are available for lookups.
This was done with a flat loop over all column types, but this is not enough. UDTs might be nested in other types, like collections. One has to do a traversal of the type tree and register every UDT on the way.
This PR changes the flat loop to a recursive traversal of the type tree. The query operation now works with UDTs, no matter how deeply nested they are.
Backport: Implements missing functionality of a tool, no backport.
Closesscylladb/scylladb#28798
* github.com:scylladb/scylladb:
tools/scylla-sstable: create_table_in_cql_env(): register UDTs recursively
tools/scylla-sstable: generalize dump_if_user_type
tools/scylla-sstable: move dump_if_user_type() definition
3f7ee3ce5d introduced system.batchlog_v2, with a schema designed to speed up batchlog replays and make post-replay cleanups much more effective.
It did not introduce a cluster feature for the new table, because it is node local table, so the cluster can switch to the new table gradually, one node at a time.
However, https://github.com/scylladb/scylladb/issues/27886 showed that the switching causes timeouts during upgrades, in mixed clusters. Furthermore, switching to the new table unconditionally on upgrades nodes, means that on rollback, the batches saved into the v2 table are lost.
This PR introduces re-introduces v1 (`system.batchlog`) support and guards the use of the v2 table with a cluster feature, so mixed clusters keep using v1 and thus be rollback-compatible.
The re-introduced v1 support doesn't support post-replay cleanups for simplicity. The cleanup in v1 was never particularly effective anyway and we ended up disabling it for heavy batchlog users, so I don't think the lack of support for cleanup is a problem.
Fixes: https://github.com/scylladb/scylladb/issues/27886
Needs backport to 2026.1, to fix upgrades for clusters using batches
Closesscylladb/scylladb#28736
* github.com:scylladb/scylladb:
test/boost/batchlog_manager_test: add tests for v1 batchlog
test/boost/batchlog_manager_test: make prepare_batches() work with both v1 and v2
test/boost/batchlog_manager_test: fix indentation
test/boost/batchlog_manager_test: extract prepare_batches() method
test/lib/cql_assertions: is_rows(): add dump parameter
tools/scylla-sstable: extract query result printers
tools/scylla-sstable: add std::ostream& arg to query result printers
repair/row_level: repair_flush_hints_batchlog_handler(): add all_replayed to finish log
db/batchlog_manager: re-add v1 support
db/batchlog_manager: return all_replayed from process_batch()
db/batchlog_manager: process_bath() fix indentation
db/batchlog_manager: make batch() a standalone function
db/batchlog_manager: make structs stats public
db/batchlog_manager: allocate limiter on the stack
db/batchlog_manager: add feature_service dependency
gms/feature_service: add batchlog_v2 feature
The test is currently flaky with `reuse_ip = True`. The issue is that the
test retries replace before the first replace is rolled back and the
first replacing node is removed from gossip. The second replacing node
can see the entry of the first replacing node in gossip. This entry has
a newer generation than the entry of the node being replaced, and both
replacing nodes have the same IP as the node being replaced. Therefore,
the second replacing node incorrectly considers this entry as the entry
of the node being replaced. This entry is missing rack and DC, so the
second replace fails with
```
ERROR 2026-02-24 21:19:03,420 [shard 0:main] init - Startup failed:
std::runtime_error (Cannot replace node
8762a9d2-3b30-4e66-83a1-98d16c5dd007/127.61.127.1 with a node on
a different data center or rack.
Current location=UNKNOWN_DC/UNKNOWN_RACK, new location=dc1/rack2)
```
Fixes SCYLLADB-805
Closesscylladb/scylladb#28829
Following becb48b586 it seems we have a regression with trigger CI logic
The Verify Org Membership step used gh api /orgs/scylladb/members/$AUTHOR
with GITHUB_TOKEN to check if the user is an org member. However,
GITHUB_TOKEN does not have read:org scope, so the API call fails for all
users — even actual scylladb org members — causing CI triggers to be
silently skipped.
Replace the API call with the author_association field from the GitHub
event payload, which is set by GitHub itself and requires no special
token permissions. This allows any scylladb org member (MEMBER or OWNER)
to trigger CI via comment, regardless of whether they authored the PR.
Closesscylladb/scylladb#28837
remove the test since it's not relevant anymore, it's not testing what
it's supposed to test and it's unstable.
the purpose of the test was to reproduce an issue in the legacy view
builder where a view starts to build at token T2 and then all tokens
[T1, end) with T1<T2 migrate to another node while it's still building,
exposing an issue when the view builder wraparounds the token ring.
this is not relevant anymore because now view building with tablets is
done via the view building coordinator for tablets, and all views start
to build from the first token with no wraparound.
besides, the test is unstable due to relying too much on specific
timing, which was useful for investigating and fixing the original issue
but not anymore.
Fixes SCYLLADB-842
Closesscylladb/scylladb#28842
The issue was fixed by commit cc03f5c89d
("cql3: support literals and bind variables in selectors"), so the
xfail marker is no longer needed.
Closesscylladb/scylladb#28776
The test uses create_dataset helper duplicating the existing code that does the same. This PR patches basic tests to use standard facilities.
Also the PR simplifies the 3-level nested loops used to combine several sets of restoration parameters by using itertools.product facility.
Continuation of #28600.
Cleaning tests, not backporting
Closesscylladb/scylladb#28608
* github.com:scylladb/scylladb:
test/object_store: Use itertools.product() for deeply nested loops
test/object_store: Replace dataset creation usage with standard methods
test/object_store: Shift indentation right for test_restore_with_streaming_scopes
The perf-simple-query tests were not restricted on CPU count,
so on a 96-CPU machine, they would run on 96 CPUs, and time out
in debug mode.
All restrict memory usage and add --overprovisioned so that
pinning is disabled. Apply that to all tests.
Closesscylladb/scylladb#28821
The helper is very simple yet generic -- it takes a snapshot of a keyspace on all servers and collects the resulting sstables from workdirs. Re-using it in all test cases saves some lines of code. Also, the method is "sequential", making it "parallel" reduces the waiting time a bit.
Will help generalizing existing backup/restore tests to support clustered snapshot/backup/restore API (see #28525) later.
Cleaning up tests, not backporting.
Closesscylladb/scylladb#28660
* github.com:scylladb/scylladb:
test/backup: Run keyspace flush and snapshot taking API in parallel
test/backup: Re-use take_snapshot() helper in do_abort_restore()
test/backup: Move take_snapshot() helper up
Split the chained inject_parameter().value_or() expression into two
separate named variables for clarity.
Use condition_variable::when() instead of wait(). when() is the
coroutine-native API, designed specifically for co_await contexts —
it avoids the heap allocation of a promise_waiter that wait() uses,
and instead uses a stack-based awaiter.
Closesscylladb/scylladb#28824
The PR removes most of the code that assumes that group0 and raft topology is not enabled. It also makes sure that joining a cluster in no raft mode or upgrading a node in a cluster that not yet uses raft topology to this version will fail.
Refs #15422
No backport needed since this removes functionality.
Closesscylladb/scylladb#28514
* https://github.com/scylladb/scylladb:
group0: fix indentation after previous patch
raft_group0: simplify get_group0_upgrade_state function since no upgrade can happen any more
raft_group0: move service::group0_upgrade_state to use fmt::formatter instead of iostream
raft_group0: remove unused code from raft_group0
node_ops: remove topology over node ops code
topology: fix indentation after the previous patch
topology: drop topology_change_enabled parameter from raft_group0 code
storage_service: remove unused handle_state_* functions
gossiper: drop wait_for_gossip_to_settle and deprecate correspondent option
storage_service: fix indentation after the last patch
storage_service: remove gossiper bootstrapping code
storage_service: drop get_group_server_if_raft_topolgy_enabled
storage_service: drop is_topology_coordinator_enabled and its uses
storage_service: drop run_with_api_lock_in_gossiper_mode_only
topology: remove code that assumes raft_topology_change_enabled() may return false
test: schema_change_test: make test_schema_digest_does_not_change_with_disabled_features tests run in raft mode
test: schema_change_test: drop schema tests relevant for no raft mode only
topology: remove upgrade to raft topology code
group0: remove upgrade to group0 code
group0: refuse to boot if a cluster is still is not in a raft topology mode
storage_service: refuse to join a cluster in legacy mode
This commit adds the upgrade guide for version 2026.1.
According to the new upgrade policy, the user can now upgrade to the major version (2026.1)
from any previous minor version.
So instead of adding a separate guide form 2025.4 to 2026.1, we need a guide from 2025.x to 2026.1.
In addition, this commit:
- Updates the upgrade policy for reflect the above change.
- Removes the upgrade guides for the previous version.
Fixes https://github.com/scylladb/scylladb/issues/28533
Fixes https://github.com/scylladb/scylladb/issues/28532Closesscylladb/scylladb#28789
assert(), and SCYLLA_ASSERT() are evil (Refs #7871) because they can cause the entire Scylla cluster to crash mysteriously instead of cleanly failing the specific request that encountered a serious problem of failed pre-requisite.
In this two-patch series, in the first patch we introduce a new macro throwing_assert(), a convenient drop-in replacement for SCYLLA_ASSERT() but which has all the benefits of on_internal_error() instead of the dangers of SCYLLA_ASSERT().
In the second patch we use the new function to replace every call to SCYLLA_ASSERT() in Alternator by the new throwing_assert().
Here is an example from the second patch to demonstrate the power of this approach: The Alternator code uses the attrs_column() function to retrieve the ":attrs" column of a schema. Since every Alternator table always has an ":attrs" column in its schema, we felt safe to SCYLLA_ASSERT() that this column exists. However, imagine that one day because of a bug, one Alternator table is missing this column. Or maybe not a bug - maybe a malicious user on a shared cluster found a way to deliberately delete this column (e.g, with a CQL command!) and this check fails. Before this patch, the entire Scylla node will crash. If the same request is sent to all nodes - the entire cluster will crash. The user might not even know which request caused this crash. In contrast, after this patch, the specific operation - e.g., PutItem - will get an exception. Only this operation, and nothing else, will be aborted, and the user who sent this request will even get an "Internal Server Error" with the assertion-failure message, alerting them that this specific query is causing problems, while other queries might work normally.
There's no need to backport this patch - unless it becomes annoying that other branches don't have the throwing_assert() function and we want it to ease other backports.
Fixes#28308.
Closesscylladb/scylladb#28445
* github.com:scylladb/scylladb:
alternator: replace SCYLLA_ASSERT with throwing_assert
utils: introduce throwing_assert(), a safe replacement for assert
In nonroot installations, the install.sh script was hardcoding the
api_ui_dir and api_doc_dir paths to /opt/scylladb/ in scylla.yaml,
even though the actual files were installed to a different location
(typically ~/scylladb). This caused REST API endpoints like
/api-doc/failure_detector/ to fail with "transfer closed with
outstanding read data remaining" error because Scylla couldn't find
the API documentation files at the configured paths.
Fix this by using the $prefix variable instead of hardcoded
/opt/scylladb/ paths. This ensures that:
- In regular installations: $prefix = /opt/scylladb (no change)
- In nonroot installations: $prefix = ~/scylladb (paths now correct)
Fixes: SCYLLADB-721
Backport: The hardcoded paths in install.sh have been present since
the nonroot installation feature was introduced, making REST API
endpoints non-functional in all nonroot installations across all
live versions of Scylla.
Closesscylladb/scylladb#28805
The concurrency semaphore gates uninitialized connections across all
do_accepts loops, but was initialized to a fixed value regardless of
how many listeners exist. With multiple listeners competing for the
same units, each effectively gets less than the configured concurrency.
Initialize the semaphore to concurrency - 1 and signal 1 per listen()
call, so total capacity is concurrency - 1 + nr_listeners. This
guarantees each listener's accept loop can have at least one unit
available.
It mainly fixes problem when setting uninitialized_connections_semaphore_cpu_concurrency
config value to 1 would result in not being able to process connections, as only 1 out of 2
listeners got the semaphore.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-762
Backport: no, it's a minor problem
Closesscylladb/scylladb#28747
* github.com:scylladb/scylladb:
test: add test_uninitialized_conns_semaphore
generic_server: fix waiters count in shed log
generic_server: scale connection concurrency semaphore by listener count
Previously, rewriting an sstable component (e.g., via rewrite_statistics) created a temporary file that was renamed
to the final name after sealing. This allows crash recovery by simply removing the temporary file on startup.
However, this approach won't work once component digests are stored in scylla_metadata,
as replacing a component like Statistics will require atomically updating both the component
and scylla_metadata with the new digest—impossible with POSIX rename.
The new mechanism creates a clone sstable with a fresh generation:
- Hard-links all components from the source except the component being rewritten and scylla metadata if update_sstable_id is true
- Copies original sstable components pointer and recognized components from the source
- Invokes a modifier callback to adjust the new sstable before rewriting
- Writes the modified component. If update_sstable_id is true, reads scylla metadata, generates new sstable_id and rewrites it.
- Seals the new sstable with a temporary TOC
- Replaces the old sstable atomically, the same way as it is done in compaction
This is built on the rewrite_sstables compaction framework to support batch operations (e.g., following incremental repair).
In case of any failure during the whole process, sstable will be automatically deleted on the node startup due to
temporary toc persistence.
This prepares the infrastructure for component digests. Once digests are introduced in scylla_metadata
this mechanism will be extended to also rewrite scylla metadata with the updated digest alongside the modified component, ensuring atomic updates of both.
Add make_sstable() overload that accepts sstable_version_types parameter
to compaction_group_view interface and all implementations.
This will be useful in rewrite component mechanism, as we
need to preserve sstable version when creating the new one for the replacement.
Introduce a null_data_sink and make_digest_calculator implementation that discards
all writes, enabling checksum calculation without file I/O. This allows the
existing checksummed_file_writer to be used for digest computation
without writing data to disk.
This will be used in a future commit to calculate the scylla metadata
component checksum before writing it to disk, allowing the component
to store its own checksum.
The futurization refactoring in 9d3755f276 ("replica: Futurize
retrieval of sstable sets in compaction_group_view") changed
maybe_wait_for_sstable_count_reduction() from a single predicated
wait:
```
co_await cstate.compaction_done.wait([..] {
return num_runs_for_compaction() <= threshold
|| !can_perform_regular_compaction(t);
});
```
to a while loop with a predicated wait:
```
while (can_perform_regular_compaction(t)
&& co_await num_runs_for_compaction() > threshold) {
co_await cstate.compaction_done.wait([this, &t] {
return !can_perform_regular_compaction(t);
});
}
```
This was necessary because num_runs_for_compaction() became a
coroutine (returns future<size_t>) and can no longer be called
inside a condition_variable predicate (which must be synchronous).
However, the inner wait's predicate — !can_perform_regular_compaction(t)
— only returns true when compaction is disabled or the table is being
removed. During normal operation, every signal() from compaction_done
wakes the waiter, the predicate returns false, and the waiter
immediately goes back to sleep without ever re-checking the outer
while loop's num_runs_for_compaction() condition.
This causes memtable flushes to hang forever in
maybe_wait_for_sstable_count_reduction() whenever the sstable run
count exceeds the threshold, because completed compactions signal
compaction_done but the signal is swallowed by the predicate.
Fix by replacing the predicated wait with a bare wait(), so that
any signal (including from completed compactions) causes the outer
while loop to re-evaluate num_runs_for_compaction().
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-610Closesscylladb/scylladb#28801
This series implements a new per-row TTL feature for CQL. The per-row TTL feature was requested in issue #13000. It is a feature that does not exist in Cassandra, and was inspired by DynamoDB's TTL feature - and under the hood uses the same implementation that we used in Alternator to implement this DynamoDB feature.
The new per-row TTL feature is completely separate from CQL's existing per-write (and per-cell) TTL, and both will be available to users.
In the per-row TTL feature, one column in the table is designated as the "TTL" column, and its value for a row is the expiration time for that row. The TTL column can be designated at table creation time, e.g.:
```cql
CREATE TABLE tab (
id int PRIMARY KEY,
t text,
expiration timestamp TTL
);
```
Or after the table already exists with:
```cql
ALTER TABLE tab TTL expiration
```
Expiration can also be disabled, with:
```cql
ALTER TABLE tab TTL NULL
```
The new per-row TTL feature has two features that users have been asking for:
1. A user can change the value of just the TTL column - without rewriting the entire row - to change the expiration time of the entire row.
2. When an expired row is finally deleted, a CDC event about this deletion appears in the CDC log (if CDC is enabled), including - if a preimage is enabled - the content of the deleted row.
To achieve the second goal (CDC events), a row is not guaranteed to disappear at exactly its expiration time (as CQL's original TTL feature guarantees). Rather, the row is deleted some time later, depending on `alternator_ttl_period_in_seconds`; Until the actual deletion, the row is still readable (and even writable). But we are guaranteed that when the row is finally deleted, the CDC event will come too.
The implementation uses the same background thread used by Alternator to periodically scan for expired items and delete them.
The expiration thread keeps the same metrics as it did for Alternator:
* `scylla_expiration_scan_passes`
* `scylla_expiration_scan_table`
* `scylla_expiration_items_deleted`
* `scylla_expiration_secondary_ranges_scanned`
The series begins with a few small preparation patches, followed by the main part of the feature (which isn't big, since we are just enabling the pre-existing Alternator expiration machinary for CQL) and finally 30 tests (single-node and multi-node tests) and documentation.
This series is a new feature, so traditionally would not be backported. However, I wouldn't be surprised if we will be requested to backport it so that customers will not need to wait for a new major release.
Fixes#13000Closesscylladb/scylladb#28320
* github.com:scylladb/scylladb:
test/cqlpy: verify that a column can't be both STATIC and PRIMARY KEY
docs/cql: document the new CQL per-row TTL feature
test/cluster: tests for the new CQL per-row TTL feature
test/cqlpy: tests for the new CQL per-row TTL feature
test: set low alternator_ttl_period_in_seconds in CQL tests
cql ttl: fix ALTER TABLE to disable TTL if column is dropped
cql ttl: add setting/unsetting of TTL column to ALTER TABLE
cql ttl: add TTL column support to CREATE TABLE and DESC TABLE
ttl: add CQL support to Alternator's TTL expiration service
alternator ttl: move TTL_TAG_KEY to a header file
alternator ttl: remove unnecessary check of feature flag
cql: add "cql_row_ttl" cluster feature
alternator: fix error message if UpdateTimeToLive is not supported
This patch adds filtering for tablet sizes collected in load_stats.
This is needed to improve the chances that the balancer will have all
the tablet sizes for the node, and that way avoid having the node
ignored during balancing.
Switch vector dimension handling to fixed-width `uint32_t` type,
update parsing/validation, and add boundary tests.
The dimension is parsed as `unsigned long` at first which is guaranteed
to be **at least** 32-bit long, which is safe to downcast to `uint32_t`.
Move `MAX_VECTOR_DIMENSION` from `cql3_type::raw_vector` to `cql3_type`
to ensure public visibility for checks outside the class.
Add tests to verify the type boundaries.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-223
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Co-authored-by: Dawid Pawlik <dawid.pawlik@scylladb.com>
Closesscylladb/scylladb#28762
This patch moves the table size tablet filtering code from a lambda in
storage_service::load_stats_for_tablet_based_tables() to the code
section where it will be used:
tablet_storage_group_manager::table_load_stats()
This is needed to better accomodate the next commit which will add
code for filtering tablets for tablet sizes.
Table sizes are collected in load stats, and are filtered according to
the migration stage, so as to avoid double accounting of tablet sizes.
The comment which explains the cut-off migration stage (cleanup) and
the cut-off stage in the code (streaming) are not the same.
This patch fixes that.
The test test_build_view_with_large_row creates a materialized view and
expects the view to be built with a timeout of 5 seconds. It was
observed to fail because the timeout is too short on slow machines.
Increase the timeout to 60 seconds to make the test less flaky on slow
machines. Similarly for the other tests in the file that have a timeout
for view build, increase the timeout to 60 seconds to be consistent and
safer.
Fixes SCYLLADB-769
Closesscylladb/scylladb#28817
We have observed a bug that caused Scylla to crash due to metrics double
registration. This bug is really difficult to reproduce and was seen
only once in the wild. We think that it may be caused by a request
in-flight keeping a reference to the stats object, making it not
deregister when the index is dropped, which casues a double registration
when we recreate the index, however we are not 100% sure.
This patch makes it so the metrics always get deregistered when we drop the index, which should fix the double registration bug.
Fixes: #27252Closesscylladb/scylladb#28655
Audit tests have been slow. They rely on wait_for function. This function first sleeps for the duration of the time step specified, and then calls the given function. The audit tests need 0.02-0.03 seconds for the given function, but the operation lasts around 1.02-1.03 seconds, since step is 1 second.
This patch modifies wait_for dtest function so it first executes the given function, and afterwards calls time.sleep(step). This reduces time needed for the given function from 1.03 to 0.03 seconds.
Total audit tests suite speedup is 3x. On the developer machine the time is reduced from 13+ minutes to 4 minutes.
This patch also improves performance of some alternator tests that use the same wait_for dtest function.
`wait_for` in dtest framework has default time step reduced to make the environment more responsive and test execution faster.
Refs SCYLLADB-573
This is a performance improvement of testing framework. No need to backport.
Closesscylladb/scylladb#28590
* github.com:scylladb/scylladb:
dtest: shorten default sleep step in wait_for
dtest: wait_for speedup
Add a test that exercises to_metrics_histogram when Min is smaller than
Precision.
The test verifies duplicate integer bounds are collapsed,
counts remain cumulative, and native histogram metadata is still present
with the expected schema and min id.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
to_metrics_histogram now collapses duplicate integer bucket bounds
caused by Min less than Precision scaling while always keeping native
histogram metadata.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Add three test cases to verify the hybrid linear/exponential bucketing:
- test_histogram_min_1_bucket_limits: Validates bucket lower limits
- test_histogram_min_1_basic: Tests value insertion and bucket distribution
- test_histogram_min_1_statistics: Tests min(), max(), quantile(), and mean()
- test_histogram_min_2_precision_4: Test min == 2 and precision 4.
These tests cover the new Min<Precision mode with Precision=4, verifying both the
linear range and exponential range.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
approx_exponential_histogram is a pseudo exponential histogram implementation
that can insert and retrieve values into and from buckets in O 1 time.
The implementation uses power of two ranges and splits them linearly into
buckets. The number of buckets per power of two range is called Precision.
The original implementation aimed at covering large value ranges had a
limitation. The histogram Min value had to be greater than or equal to
Precision. As a result code that needs histograms for small integer values
could not use this implementation efficiently.
This change addresses that gap by handling the case where Min is less than
Precision. For Min smaller than Precision the value is scaled by a power of
two factor during indexing so the existing exponential math can be reused
without runtime branching. Bucket limits are scaled back to the original
units which can lead to repeated bucket limits in the Min to Precision
range for integer values.
Example with Min 2 and Precision 4
Buckets 2 2 3 3 4 5 6 7 8 10 12 14 and so on
Implementation details
Introduce SHIFT based on log2 Precision minus log2 Min when positive
Scale Min and Max by SHIFT for all exponential calculations
Compute NUM_BUCKETS using the standard log2 Max over Min formula
Use scaled value in find_bucket_index to avoid fractional bucket steps
Return bucket limits by scaling back to original units
Constraint relaxed from Min greater or equal to Precision to allow any Min
less than Max still power of two
This change maintains backward compatibility with existing histograms
while enabling efficient tracking of small integer values.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Add missing tests for per-table Alternator latency metrics to ensure recent
per-table latency accounting is actually validated.
Changes in this patch:
Refactor latency assertion helper into check_sets_latency_by_metric(),
parameterized by metric name.
Keep existing behavior by implementing check_sets_latency() as a wrapper
over scylla_alternator_op_latency.
Add test_item_latency_per_table() to verify
scylla_alternator_table_op_latency_count increases for:
PutItem, GetItem, DeleteItem, UpdateItem, BatchWriteItem,
and BatchGetItem.
This closes a test gap where only global latency metrics were checked, while
per-table latency metrics were not covered.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Batch operations were updating only global latency histograms, which left
table-level latency metrics incomplete.
This change computes request duration once at the end of each operation and
reuses it to update both global and per-table latency stats:
Latencies are stored per table used,
This aligns batch read/write metric behavior with other operations and improves
per-table observability.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This workflow calls the reusable backport-with-jira workflow from
scylladb/github-automation to enable automatic backport PR creation with
Jira sub-issue integration.
The workflow triggers on:
- Push to master/next-*/branch-* branches (for promotion events)
- PR labeled with backport/X.X pattern (for manual backport requests)
- PR closed/merged on version branches (for chain backport processing)
Features enabled by calling the shared workflow:
- Creates Jira sub-issues under the main issue for each backport version
- Sorts versions descending (highest first: 2025.4 -> 2025.3 -> 2025.2)
- Cherry-picks from previous version branch to avoid repeated conflicts
- On Jira API failure: adds comment to main issue, applies 'jira-sub-issue-creation-failed' label, continues with PR
Closesscylladb/scylladb#28804
While adding the new syntax "TTL" to CREATE TABLE, I noticed that the
parser actually allows a column to be defined as "STATIC PRIMARY KEY".
So I add here a small test to verify that this is not really allowed:
The syntax "c int STATIC PRIMARY KEY" is accepted, but then rejected
by a later check. The syntax "c int PRIMARY KEY STATIC" is rejected
as a syntax error.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Add user-facing documentation for the new CQL per-row TTL feature,
in docs/cql/cql-extensions.md.
Also mention (and link) the new alternative TTL feature in a few
relevant documents about the old (per-write) TTL, about CDC,
and about the CREATE TABLE and ALTER TABLE commands.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The previous patch added single-node functional tests (in test/cqlpy)
for everything which was possible to test on a single node. In this
patch we add four tests that we couldn't test on a single node, using
the test/cluster test framework:
1. Test that the TTL expiration work - both the scanning threads and
the actual deletion work on all nodes - happens on the "streaming"
scheduling group.
2. Test that even if one of the cluster's nodes is down, still all
the items get expired - another node "takes over" the dead node's
work.
3. Test that rolling upgrade works as designed for the CQL per-row TTL
feature: Before every single node in the cluster is upgraded to
support this feature, a TTL column cannot be enabled on a table.
And as soon as the last node of the cluster is upgraded, the TTL
feature begins to work completely (you don't need to reboot all
the nodes again).
4. Test that expiration works correctly on a multi-DC setup. The test
doesn't check the efficiency of this process - i.e., that today each
DC scans part of the data, reading with LOCAL_QUORUM, and writing
the deletions across the entire cluster. Rather, the test only
verifies the correctness - that expired rows do get deleted -
for the usual case the data across the DCs is consistent.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch contains 27 functional tests (in the test/cqlpy framework)
for the new CQL per-row TTL feature. The tests cover the TTL column
configuration statements (CREATE TABLE, ALTER TABLE) as well as the
actual item expiration or non-expiration depending on the value of
the expiration-time column - and also CDC events generated on expiration
and the metrics generated by the expiration process.
These tests were written together with the code, as in "test-driven
development", so they aim to cover every corner case considered during
the development, and they reproduce every bug and misstep seen during
the development process. As a result, they hopefully achieve very high
code coverage - but since we don't have a working code-coverage tool,
I can't report any specific code coverage numbers.
These tests check everything which we can check on single-node cluster.
The next patch will add additional multi-node tests for things we can't
check here with a single node - such as the scheduling group used by the
distributed work, the effect of dead nodes on the TTL functionality, and
the process of rolling upgrade.
The tests in this patch do NOT try to stress the background expiration
scanning threads, or to check how they handle topology changes, large
amounts of data or clusters spanning multiple DCs. These tests also don't
test the performance impact of these scanning threads. Because the
expiration scanning thread is identical to the one already used by
Alternator TTL, we assume that many of these aspects were already tested
for Alternator TTL and did not change when the same implementation is
used for the new CQL feature.
All new tests pass on ScyllaDB. Because the per-row TTL feature is
a new ScyllaDB feature that does not exist on Cassandra, all these
tests are skipped on Cassandra.
Because some of these tests involve waiting for expiration, they can't
be very quick. Still, because we set alternator_ttl_period_in_seconds
to 0.5 seconds in the test framework, all 27 tests running sequentially
finish in roughly 6 seconds total, which we consider acceptable.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
In test/alternator/run we set alternator_ttl_period_in_seconds to a very
low number (0.5 seconds) to allow TTL tests to expire items very quickly
and finish quickly.
Until now, we didn't need to do this for CQL tests, because they weren't
using this Alternator-only feature. Now that CQL uses the same expiration
feature with its original configuration parameter, we need to set it in
CQL tests too.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
If "ALTER TABLE tab DROP x" is done to delete column x, and column x was
the designated TTL column, then the per-row TTL feature should be disabled
on this table.
If we don't do this, the expiration scanner will continue to scan the
table trying to read the dropped column - which will be wasteful or
worse.
A test for this case is also included in test/cqlpy/test_ttl_row.py
in a later patch.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The previous patch added the ability in CREATE TABLE to designate one
of the regular columns as a "TTL column", to be used by the per-row TTL
feature (Refs #13000). In this patch we add to ALTER TABLE the ability
to enable per-row TTL on an existing table with a given column as the
TTL column:
ALTER TABLE tab TTL colname
and also the ability to disable per-row TTL with
ALTER TABLE tab TTL NULL
as in CREATE TABLE, the designated TTL column must be a regular column
(it can't be a primary key column or a static column), and must have
the types timestamp, bigint or int.
You can't enable per-row TTL if already enabled, or disable it if
already disabled. To change the TTL column on an existing table,
you must first disable TTL, and then re-enable it with the new column.
A large collection of functional tests (in test/cqlpy), for every detail
of this patch, will come in a later patch in this series.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch enables the per-row TTL feature in CQL (Refs #13000).
This patch allows the user to create a new table with one of its columns
designated as the TTL column with a syntax like:
CREATE TABLE tab (
id int PRIMARY KEY,
t text,
expiration timestamp TTL
);
The column marked "TTL" must have the "timestamp", "bigint" or "int"
types (the choice of these types was explained in the previous patch),
and there can only be one such column. We decided not to allow a column
to be both a primary key column and a TTL column - although it would
have worked (it's supported in Alternator), I considered this non-useful
and confusing, and decided not to allow it in CQL. A TTL column also
can't be a static column.
We save the information of which column is the TTL column in a tag which
is read by the "expiration service" - originally a part of Alternator's
TTL implementation. After the previous patch, the expiration service is
running and knows how to understand CQL tables, so the CQL per-row TTL
feature will start to work.
This patch also implements DESC TABLE, printing the word "TTL" in the
right place of the output.
This patch doesn't yet implement ALTER TABLE that should allow enabling
or disabling the TTL column setting on an existing table - we'll do that
in the next patch.
A large collection of functional tests (in test/cqlpy), for every detail
of this feature will be added in a later patch.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The Alternator TTL feature uses an "expiration service", a background
thread on each shard which periodically scans for expired items and
deletes them. When writing the expiration service, we already
anticipated that the day will come that we'll want to use it for CQL
too. Well, now that we want to use it for CQL, we only need to make
two changes:
1. Before this patch, the expiration service was only started if
Alternator was enabled. Now we need to start it unconditionally,
as both Alternator and CQL will need to use it.
The performance impact of the new background threads, when not
needed, should be minimal: These threads will wake up every
alternator_ttl_period_in_seconds (by default - once a day) and
just check if any table has per-row TTL enabled, and if not, do
nothing.
2. Before this patch, the expiration-time column had to be of type
"decimal" - a variable-precision floating-point type. This made
sense in Alternator - where all numbers are of this type, but CQL
offers better and more efficient types for this purpose. In this
patch we add support for two additional types for the expiration
time column: The "timestamp" type (which uses millisecond precision,
which our implementation truncates to whole seconds) and for the
"bigint" type storing a number of seconds since the UNIX epoch.
We also support the smaller "int" type for compatibility with
existing data, but it is not recommended because a signed
32-bit integer counting time from 1970 will break in 2038.
After this patch, the expiration service supports CQL tables, but there
is nothing yet that can enable it on CQL tables - i.e., nothing that
sets the appropriate tag on the table to tell the expiration service
which column is the expiration-time column. We'll add new syntax to
do this in the next patch.
At the moment, we leave the expiration service implementation in
its existing location - alternator/ttl.cc. This is despite the fact
that we now start it and use it also for CQL. For better modularity,
we should probably later move the expiration service implementation
to a separate module (directory).
Similarly, the expiration service's period is still configured via
alternator_ttl_period_in_seconds, which is now a misnomer because it
also affects CQL. Later we can rename this configuration parameter,
or alternatively, consider different scan periods for different tables
and table types, and have separate configuration for Alternator TTL
and CQL per-row TTL.
The metrics kept by the expiration service are the same metrics existing
for Alternator TTL, and fortunately do not have the name "alternator" in
their name:
* scylla_expiration_scan_passes
* scylla_expiration_scan_table
* scylla_expiration_items_deleted
* scylla_expiration_secondary_ranges_scanned
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
TTL_TAG_KEY stores the name of the tag in which we store the name of the
table's expiration-time column, for Alternator's TTL feature.
We already need this name in two source files, and soon we'll need it
in more files - as we want to use the same implementation also for for
a new per-row TTL feature in CQL. So it's time to move the declaration
of this variable to a new header file - alternator/ttl_tag.hh.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Every node that supports the Alternator TTL feature should start its
background expiration-checking thread, *without* checking if other
nodes support this feature. This patch removes the unnecessary check.
Indeed, until all other nodes enable this feature, the background thread
will have nothing to do. but when finally all nodes have this feature -
we need this thread to already be on - without requiring another reboot
of all nodes to start this thread.
In practice, this change won't change anything on modern installations
because this feature is already three years old and always enabled on
modern clusters. But I don't want to repeat the same mistake for the new
CQL per-row TTL feature, so better fix it in Alternator too.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a new cluster feature "CQL_ROW_TTL", for the new CQL
per-row TTL feature.
With this patch, this node reports supporting this feature, but the CQL
per-row TTL feature can only be used once all the nodes in the cluster
supports the feature. In other words, user requests to enable per-row TTL
on a table should check this feature flag (on the whole cluster) before
proceeding.
This is needed because the implementation of the per-row-TTL expiration
requires the cooperation of all nodes to participate in scanning for
expired items, so the feature can't be trusted until all nodes participate
in it.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Since commit 2dedb5ea75, the Alternator
TTL feature is no longer experimental. It is still a "cluster feature"
meaning it cannot be used on a partially-upgraded cluster until the entire
cluster supports this feature.
The error message we printed when the cluster doesn't support this
feature was outdated, referring to the no-longer-existing experimental
feature. So this patch fixes the error message.
Since this feature is already three years old, nobody is likely to ever
see this error message (it can be seen only by someone upgrading an
even older cluster, during the rolling upgrade), but better not have
wrong error messages in the code, even if it's not seen by users.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Replace all calls to SCYLLA_ASSSERT() in Alternator by the better and
safer throwing_assert() introduced in the previous patch.
As a result of this patch, if one of the call sites for these asserts
is buggy and ever fails, only the involved operation will be killed
by an exception, instead of crashing the whole server - and often the
entire cluster (as the same buggy request reaches all nodes and
crashes them all).
Additionally, this patch replaces a few existing uses in Alternator
of on_internal_error() with a non-interesting message with a
more-or-less equivalent, but shorter, throwing_assert(). The idea is
to convert the verbose idiom:
if (!condition) {
on_internal_error(logger, "some error message")
}
With the shorter
throwing_assert(condition)
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch introduces throwing_assert(cond), a better and safer
replacement for assert(cond) or SCYLLA_ASSERT(cond). It aims to
eventually replace all assertions in Scylla and provide a real solution
to issue #7871 ("exorcise assertions from Scylla").
throwing_assert() is based on the existing on_internal_error() and
inherits all its benefits, but brings with it the *convenience* of
assert() and SCYLLA_ASSERT(): No need for a separate if(), new strings,
etc. For example, you can do write just one line of throwing_assert():
throwing_assert(p != nullptr);
Instead of much more verbose on_internal_error:
if (p == nullptr) {
utils::on_internal_error("assertion failed: p != nullptr")
}
Like assert() and SCYLLA_ASSERT(), in our tests throwing_assert() dumps
core on failure. But its advantage over the other assertion functions
like becomes clear in production:
* assert() is compiled-out in release builds. This means that the
condition is not checked, and the code after the failed condition
continues to run normally, potentially to disasterous consequences.
In contrast, throwing_assert() continues to check the condition even in
release builds, and if the condition is false it throws an exception.
This ensures that the code following the condition doesn't run.
* SCYLLA_ASSERT() in release builds checks the condition and *crashes*
Scylla if the condition is not met.
In contrast, throwing_assert() doesn't crash, but throws an exception.
This means that the specific operation that encountered the error
is aborted, instead of the entire server. It often also means that
the user of this operation will see this error somehow and know
which operation failed - instead of encountering a mysterious
server (or even whole-cluster crash) without any indication which
operation caused it.
Another benefit of throwing_assert() is that it logs the error message
(and also a backtrace!) to Scylla's usual logging mechanisms - not to
stderr like assert and SCYLLA_ASSERT write, where users sometimes can't
see what is written.
Fixes#28308.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Add a new docs/dev document for the strongly consistent tables feature.
For now, it only contains information about the Raft metadata persistence,
but it should be updated as more of the strong-consistency components are
added.
In this patch we add various tests for checking how strongly consistent
tables work while allowing their tablets to reside on non-0 shards and
while using the new persistent storage for their raft metadata.
The tests verify that:
- strongly consistent tables' tablets can be allocated on different shards
and we can write/read from them
- the raft metadata is persistent across restarts even with disruptions
- the sharder correctly routes metadata queries to specified shards
- we can correctly perform multi-shard reads from the metadata tables
- we can read using just the group_id (without shard) using ALLOW FILTERING
For the tests we add logging to the sharder and partitioner and we add
some extra logs for observability.
In this patch we allow strongly consistent tables to have tablets on
shards different than 0.
For that, we remove the checks for shard 0 for the non-group0 raft
groups, and we allow the tablet allocator to place tablets of
strongly consistent tables on shards different than 0.
We also start using the new storage (raft::persistence) for strongly
consistent tables, added in the preceding commits.
Most functions of the new storage for raft groups for strongly
consistent tables are the same as for the system raft table
storage, so we reuse the tests for them to test the new storage.
We add additional tests for checking the new raft groups partitioner
and sharder, and for verifying that writes using storages for different
shards do not affect the data read on different shards.
We also add a test for checking the snapshot_descriptor present after
the storage bootstrap - for both system and strongly consistent storages
we check that the storage contains the initial descriptor.
Add raft_groups_storage, a raft::persistence implementation for
strongly consistent tablet groups.
Currently, it's almost an exact copy of the raft_sys_table_storage that
uses the new raft tables for strongly consistent tables (raft_groups,
raft_groups_snapshots, raft_groups_snapshot_config) which have
a (shard, group_id) partition key.
In the future, the mutation, term and commit_idx data will be stored
differently for for strongly consistent tables than for group0, which
will differentiate this class from the original raft_sys_table_storage.
The storage is created for each raft group server and it takes a shard
parameter at construction time to ensure all queries target the correct
partition (and thus shard).
Add three new system tables for storing raft state for strongly
consistent tablets, corresponding to the tables for group0:
- system.raft_groups: Stores the raft log, term/vote, snapshot_id,
and commit_idx for each tablet's raft group.
- system.raft_groups_snapshots: Stores snapshot descriptors
(index, term) for each group.
- system.raft_groups_snapshot_config: Stores the raft configuration
(current and previous voters) for each snapshot.
These tables use a (shard, group_id) composite partition key with
the newly added raft_groups_partitioner and raft_groups_sharder, ensuring
data is co-located with the tablet replica that owns the raft group.
The tables are only created when the STRONGLY_CONSISTENT_TABLES experimental
feature is enabled.
Add a custom partitioner and sharder that will be used for Raft tables
for strongly consistent tables. These tables will have partition keys
of the form (shard, group_id) and the partitioner creates tokens that
encode the target shard in the high 16 bits.
Token layout:
[shard: 16 bits][partition key hash: 48 bits]
This encoding guarantees that raft group data will be located on the
same shard as the tablet replica corresponding to that raft group as long
we use the tablet replica's shard as the value in the partition key.
Storing the shard directly in the partition key avoids additional lookups
for request routing to the incoming new raft tables.
For even more simplicity, we avoid biasing between uint64_t and int64_t
by limiting the acceptable shard ids up to 32767 (leaving the top bit 0),
which results in the same value of the token when interpreting either as
uint64_t or int64_t.
The sharder decodes the shard by extracting the high bits, which is
shard-count independent. This allows the partition key:shard mapping
to remain the same even during smp changes (only increases are allowed,
the same limitation as for tablets).
This series closes a gap in how CQL request and response sizes are reported.
Previously, request_size and response_size were tracked as simple counters,
providing only cumulative totals per shard. This made it difficult to understand
the distribution of message sizes and identify potential issues with very large
or very small requests.
After this series, the CQL transport reports detailed histogram metrics showing
the distribution of request and response sizes. These histograms are tracked
per-instance, per-type (per ops), and per-scheduling-group, providing
much better visibility into CQL traffic patterns.
The histograms are collected for QUERY, EXECUTE, and BATCH operations, which are
the primary data path operations where message size distribution is most relevant.
This data can help identify:
- Clients sending unexpectedly large requests
- Operations with oversized result sets
- Scheduling group differences in traffic patterns
To support this, the series extends the approx_exponential_histogram template to
handle accurate sum, adds a bytes_histogram type alias optimized for byte-range measurements (1KB to 1GB).
The existing per-shard counter metrics are maintained for backward compatibility.
Metrics example:
```
scylla_transport_cql_request_bytes{kind="BATCH",scheduling_group_name="sl:default",shard="0"} 129808
scylla_transport_cql_request_bytes{kind="EXECUTE",scheduling_group_name="sl:default",shard="0"} 227409
scylla_transport_cql_request_bytes{kind="PREPARE",scheduling_group_name="sl:default",shard="0"} 631
scylla_transport_cql_request_bytes{kind="QUERY",scheduling_group_name="sl:default",shard="0"} 2809
scylla_transport_cql_request_bytes{kind="QUERY",scheduling_group_name="sl:driver",shard="0"} 4079
scylla_transport_cql_request_bytes{kind="REGISTER",scheduling_group_name="sl:default",shard="0"} 98
scylla_transport_cql_request_bytes{kind="STARTUP",scheduling_group_name="sl:driver",shard="0"} 432
scylla_transport_cql_request_histogram_bytes_sum{kind="QUERY",scheduling_group_name="sl:driver"} 4079
scylla_transport_cql_request_histogram_bytes_count{kind="QUERY",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="1024.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="2048.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="4096.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="8192.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="16384.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="32768.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="65536.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="131072.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="262144.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="524288.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="1048576.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="2097152.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="4194304.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="8388608.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="16777216.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="33554432.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="67108864.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="134217728.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="268435456.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="536870912.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="1073741824.000000",scheduling_group_name="sl:driver"} 57
```
**The field sees it as an important issue**
Fixes#14850Closesscylladb/scylladb#28419
* github.com:scylladb/scylladb:
test/boost/estimated_histogram_test.cc: Switch to real Sum
transport/server: to bytes_histogram
approx_exponential_histogram: Add sum() method for accurate value tracking
utils/estimated_histogram.hh: Add bytes_histogram
When running unit tests, there's a visible ~1-second sleep
when gossip exits the failure detector loop.
Improve this by adding a condition variable for exiting the loop
and signaling it when any of the exit conditions are satisfied:
the abort_source is pulled, the gossiper is shut down, or the sleep
is complete. We can't just use the abort_source because gossip can be
shut down independently of the rest of the system.
To see the improvement, I ran cql_query_test in dev mode:
Before:
$ time ./build/dev/test/boost/combined_tests -t cql_query_test -- --smp 2 > /dev/null 2>&1
real 2m26.904s
user 0m24.307s
sys 0m13.402s
After:
$ time ./build/dev/test/boost/combined_tests -t cql_query_test -- --smp 2 > /dev/null 2>&1
real 0m26.579s
user 0m24.671s
sys 0m13.636s
Two minutes of real-time saved.
Real-life improvement in test.py will be lower, because of the overhead
of launching pytest for each test case.
Closesscylladb/scylladb#28649
When a permit is preemptively aborted, store the corresponding
exception in permit's member: `reader_permit::impl::_ex`.
This makes preemptively-aborted permits consistently report aborted()
and prevents them from being treated as eligible for inactive
registration in `register_inactive_read()`, avoiding assertion
failures on unexpected permit state.
Closesscylladb/scylladb#28591
scylladb/scylla container image doesn't include systemctl binary, while it
is used by perftune.py script shipped within the same image.
Scylla Operator runs this script to tune Scylla nodes/containers,
expecting its all dependencies to be available in the container's PATH.
Without systemctl, the script fails on systems that run irqbalance
(e.g., on EKS nodes) as the script tries to reconfigure irqbalance and
restart it via systemctl afterwards.
Fixes: scylladb/scylla-operator#3080Closesscylladb/scylladb#28567
Refs: SCYLLADB-193
Adds a "snapshot_table" topology operation and associated data structure/table columns to support dispatching a snapshot operation as a topo coordinator op.
Logic is similar, and thus broken out and semi-shared with, truncation.
Also adds optional tablet metadata to manifest, listing all tablets present in a given snapshot, as well as
tablet sstable ownership, repair status, and token ranges.
As per description in SCYLLADB-193, the alternative snapshot mechanism is in
a separate namespace under 'tablets', which while dubious is the desired destination.
The API is accessed via `nodetool cluster snapshot`, which more or less mirrors `nodetool snapshot`, but using topo op.
TTL is added to message propagation as a separate patch here, since it is not (yet) used from API (or nodetool).
Requires a syntax for both API and command line.
Closesscylladb/scylladb#28525
* github.com:scylladb/scylladb:
topology::snapshot: Add expiry (ttl) to RPC/topo op
test_snapshot_with_tablets: Extend test to check manifest content
table::manifest: Add tablet info to manifest.json
test::test_snapshot_with_tablets: Add small test for topo coordinated snapshot
scylla-nodetool: Add "cluster snapshot" command
api::storage_service: Add tablets/snapshots command for cluster level snapshot
db::snapshot-ctl: Add method to do snapshot using topo coordinator
storage_proxy: Add snapshot_keyspace method
topology_coordinator: Add handler for snapshot_tables
storage_proxy: Add handler for SNAPSHOT_WITH_TABLETS
messaging_service: Add SNAPSHOT_WITH_TABLETS verb
feature_service: Add SNAPSHOT_AS_TOPOLOGY_OPERATION feature
topology_mutation: Add setter for snapshot part of row
system_keyspace::topology_requests_entry: Add snapshot info to table
topology_state_machine: Add snapshot_tables operation
topology_coordinator: Break out logic from handle_truncate_table
storage_proxy: Break out logic from request_truncate_with_tablets
test/object_store: Remove create_ks_and_cf() helper
test/object_store: Replace create_ks_and_cf() usage with standard methods
test/object_store: Shift indentation right for test cases
Migrate cluster tests directory to be handled by pytest. This is the next step in process of unification of the tests and migration to the pytest.
With this PR cluster test will be executed with the full path to the file instead of `suite/test` paradigm.
Backport is not needed because it framework enhancement.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-46Closesscylladb/scylladb#27618
* github.com:scylladb/scylladb:
test.py: remove setsid from the framework
test.py: rename suite.yaml to test_config.yaml
test.py: add cluster tests to be executed by pytest
test.py: add random seed for topology tests reproducibility
test.py: add explicit default values to pytest options
test.py: replace SCYLLA env var with build_mode fixture
Recently we suffered a regression on how Alternator TTL behaves when a node goes down when tablets are used.
Usually, expiration of data in a particular tablet are handled by this tablet's "primary replica". However, if that node is down, we want another node to perform these expiration until the primary replica goes back online. We created a function `tablet_map::get_secondary_replica()` to select that "other node". We don't care too much what the "secondary replica" means, but we do care that it's different from the primary replica - if it's the same the expiration of that tablet will never be done.
It turns out that recently, in commits 817fdad and d88036d, the implementation of get_primary_replica() changed without a corresponding change to get_secondary_replica(). After those changes, the two functions are mismatched, and sometimes return the same node for both primary and secondary replica.
Unfortunately, although we had a dtest for the handling of a dead node in Alternator TTL, it failed to reproduce this bug, so this regression was missed - nothing else besides Alternator TTL ever used the get_secondary_replica() function.
So this series, in addition to fixing the bug, we add two tests that reproduce this bug (fail before the fix, pass with the fix):
1. A unit test that checks that get_secondary_replica() always returns a different node from get_primary_replica()
2. A cluster test based on the original dtest, which does reproduce this bug in Alternator TTL where some of the data was never expired (but only failed in release build, for an unknown reason).
Fixes SCYLLADB-777.
Closesscylladb/scylladb#28771
* github.com:scylladb/scylladb:
test: add unit test for tablet_map::get_secondary_replica()
test, alternator: add test for TTL expiration with a node down
locator: fix get_secondary_replica() to match get_primary_replica()
The path removes the code protected by !raft_topology_change_enabled()
since it is no longer reachable. Drop test_lwt_for_tablets_is_not_supported_without_raft
since not raft mode is no longer supported.
It is not enough to go over all column types and register the UDTs. UDTs
might be nested in other types, like collections. One has to do a
traversal of the type tree and register every UDT on the way. That is
what this patch does.
This function is used by the query and write operations, which should
now both work with nested UDTs.
Add a test which fails before and passes after this patch.
Audit tests have been slow. They rely on wait_for function.
This function first sleeps for the duration of the time step
specified, and then calls the given function. The audit tests
need 0.02-0.03 seconds for the given function, but the operation
lasts around 1.02-1.03 seconds, since step is 1 second.
This patch modifies wait_for dtest function so it first executes
the given function, and afterwards calls time.sleep(step). This
reduces time needed for the given function from 1.03 to 0.03 seconds.
Total audit tests suite speedup is 3x. On the developer machine
the time is reduced from 13+ minutes to 4 minutes.
This patch also improves performance of some alternator tests that
use the same wait_for dtest function.
Refs SCYLLADB-573
The current way of checking the boost's stdout can have a race
condition when pytest will try to read the file before it was really
flushed. So this PR should eliminate this possibility.
Closesscylladb/scylladb#28783
PR #28703 was merged into master but not with the latest version of the
changes. This patch is an incremental fix for this.
Currently, the elements of the tablet_sizes_per_shard vector are
incremented in separate shards. This is prone to false sharing of cache
lines, and ping-pong of memory, which leads to reduced performance.
In this patch, in order to avoid cache line collisions while updating
the sum of tablet sizes per shard, we align the counter to 64 bytes.
Fixes: SCYLLADB-678
Closesscylladb/scylladb#28757
This pr adds await for each one of the tasks to wait for the MV schema to be added successfully
and then to start the server shutdown
With this change we dont need will not get the shutdown races.
Closesscylladb/scylladb#28774
This commit removes the information that Alternator doesn't support tablets.
The limitation is no longer valid.
Fixes SCYLLADB-778
Closesscylladb/scylladb#28781
`test_autoretrain_dict` sporadically fails because the default
compression algorithm was changed after the test was written.
`9ffa62a986815709d0a09c705d2d0caf64776249` was an attempt to fix it by
changing the compression configuration during node startup. However,
the configuration change had an incorrect YAML format and was
ignored by ScyllaDB. This commit fixes it.
Fixes: scylladb/scylladb#28204Closesscylladb/scylladb#28746
For a while, we have seen coroutine related tests (those that use the
coroutine_task fixture) fail occasionally, because no coroutine frame is
found. Multiple attempts were made to make this problem self-diagnosing
and dump enough information to be able to debug this post-mortem. To no
avail so far. A lot of time was invested into this this benign issue:
See the long discussion at https://github.com/scylladb/scylladb/issues/22501.
It is not known if the bug is in gdb, or the gdb script trying to find
the coroutine frame. In any case, both are only used for debugging, so
we can tolerate occasional failures -- we are forced to do so when
working with gdb anyway.
Instead of piling on more effor there, just skip these tests when the
problem occurs. This solves the CI flakyness.
Fixes: #22501Closesscylladb/scylladb#28745
Add --continue-after-error true to perf-cql-raw and perf-alternator
tests, and --stop-on-error false to perf-simple-query test, so that
tests don't abort on the first error.
Reason for this is that tests are flaky with example failure:
Perf test failed: std::runtime_error (server returned ERROR to EXECUTE)
When CPU is starved on CI we can return timeouts and/or other errors.
The change should make tests more robust on the expense of smaller test
scope. But those tests were written mostly to test startup sequence
as it differs from Scylla's starup.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-759Closesscylladb/scylladb#28767
With previous architecture, scylla servers were handled by test.py and
if pytest fails, test.py was responsible for stopping scylla processes.
Now with only pytest handling, there is no such mechanism, that's why
I'm removing the setsid, so when the parent pytest process closes it
will automatically close all child including any started process during
testing. This will allow to not leave any scylla process in case pytest
was killed.
cluster tests are now executed by pytest also
Run pytest in an executor to avoid blocking the event loop, allowing
resource monitoring to run concurrently
Logic for passing the arguments to pytest changed due to the fact that
almost all directories now executed by pytest and directories that are
not handled excluded in pytest.ini
Modify the threads count for debug mode, because with the default
logic CI agents die
Set TOPOLOGY_RANDOM_FAILURES_TEST_SHUFFLE_SEED environment variable
during pytest configuration to enable to ensure that all xdist workers will
discover the same scope of the tests. This is a known limitation of the
xdist plugi where the discovered tests should be consistenta across
master and workers.
Add explicit default values to pytest command line options to prevent
issues when running tests with pytest's parallel execution where
options are not present on upper conftest, so they're just not set at all.
Replace direct usage of SCYLLA environment variable with the build_mode
pytest fixture and path_to helper function. This makes tests more
flexible and consistent with the test framework. Also this allows to use
tests with xdist, where environment variable can be left in the master
process and will not be set in the workers
Add using the fixture to get the scylla binary from the suite, this will
align with getting relocatable Scylla exe.
Lua doesn't have separate integer and floating point numbers,
so we check if a number can fit in an integer and if so convert
it to an integer.
The conversion routine invokes undefined behavior (and even
acknowledges it!). More recent compilers changed their behavior
when casting infinities, breaking test_user_function_double_return
which tests this conversion.
Fix by tightening the conversion to not invoke undefined behavior.
Closesscylladb/scylladb#28503
This patchset:
- ensures the loading semaphore is acquired in cross-shard callbacks
- fixes iterator invalidation problem when reloading all cached permissions
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-780
Backport: no, affected code not released yet
Closesscylladb/scylladb#28766
* github.com:scylladb/scylladb:
auth: cache: fix permissions iterator invalidation in reload_all_permissions
auth/cache: acquire _loading_sem in cross-shard callbacks
The hostent::addr_list is deprecated in favor of address_entry::addr
field that contains the very same addresses.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#28565
- add an overload to the rest http client to accept retry strategy instance as an argument
- remove hand rolled error handling from object storage client and replace with common machinery that supports handling and retrying when appropriate
No backport neede since it is only refactoring
Closesscylladb/scylladb#28161
* github.com:scylladb/scylladb:
object_storage: add retryable machinery to object storage
rest_client: add `simple_send` overload
To handle RPC from other nodes, we need to be able to redirect the
requests for each raft group to the shard that owns it. We need to
be able to do the redirection on all shards, so to achieve that, on
all shards we need to store the information about which shard is
occupied by each Raft group server.
For that we add a group_id -> shard mapping to the raft_group_registry.
The mapping is filled out when starting raft servers, it's emptied
when we abort raft servers. We use it when registering RPC verb handlers,
so that regardless of the shard handling the RPC, the work on the raft
group can be performed on the corresponding shard.
This patch adds a unit test for tablet_map::get_secondary_replica().
It was never officially defined how the "primary" and "secondary"
replicas were chosen, and their implementation changed over time,
but the one invariant that this test verifies is that the secondary
replica and the primary replica must be a different node.
This test reproduces issue SCYLLADB-777, where we discovered that
the get_primary_replica() changed without a corresponding change to
get_primary_replica(). So before the previous patch, this test failed,
and after the previous patch - it passes.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
We have many single-node functional tests for Alternator TTL in
test/alternator/test_ttl.py. This patch adds a multi-node test in
test/cluster/test_alternator.py. The new test verifies that:
1. Even though Alternator TTL splits the work of scanning and expiring
items between nodes, all the items get correctly expired.
2. When one node is down, all the items still expire because the
"secondary" owner of each token range takes over expiring the
items in this range while the "primary" owner is down.
This new test is actually a port of a test we already had in dtest
(alternator_ttl_tests.py::test_multinode_expiration). This port is
faster and smaller then the original (fewer nodes, fewer rows), but it
still found a regression (SCYLLADB-777) that dtest missed - the new test
failed when running with tablets and in release build mode.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The function tablet_map::get_secondary_replica() is used by Alternator
TTL to choose a node different from get_primary_replica(). Unfortunately,
recently (commits 817fdad and d88037d) the implementation of the latter
function changed, without changing the former. So this patch changes
the former to match.
The next two patches will have two tests that fail before this patch,
and pass with it:
1. A unit test that checks that get_secondary_replica() returns a
different node than get_primary_replica().
2. An Alternator TTL test that checks that when a node is down,
expirations still happen because the secondary replica takes over
the primary replica's work.
Fixes SCYLLADB-777
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Move the storage test suite from test/storage/ to test/cluster/storage/
to consolidate related cluster-based tests.This removes the standalone
test/storage/suite.yaml as the tests will use the cluster's test configuration.
Initially these tests were in cluster, but to use unshare at first
iteration they were moved outside. Now they are using another way to
handle volumes without unshare, they should be in cluster
Closesscylladb/scylladb#28634
They were running in recovery to reuse existing system tables without
group0 id, but since we want to remove recovery mode we need to
re-generate the tables.
This patch removes ability of a cluster to upgrade from not having
group0 to having one. This ability is used in gossiper based recovery
procedure that is deprecated and removed in this version. Also remove
tests that uses the procedure.
We are going to drop legacy topology mode (gossiper mode) and no longer
allow ScyllaDB to start in this mode. This patch refuses to boot if a
cluster is not in raft topology mode yet. It may happen if a node of a
cluster that is not yet in a raft topology is upgraded to a newer
version. If this happens the node has to be downgraded. Raft topology
has to be enabled on a cluster and then the node can be upgraded again.
We are going to drop legacy topology mode (gossiper mode) and no longer
allow ScyllaDB to start in this mode. This patch disallows a node to
join a cluster that is still in legacy mode. A cluster needs to enable
raft mode first.
The ANN vector queries with all-zero vectors are allowed even on vector indexes with similarity function set to cosine.
When enabling the rescoring option, those queries would fail as the rescoring calls `similarity_cosine` function underneath, causing an `InvalidRequest` exception as all-zero vectors were not allowed matching Cassandra's behaviour.
To eliminate the discrepancy we want the all-zero vector `similarity_cosine` calls to pass, but return the NaN as the cosine similarity for zero vectors is mathematically incorrect. We decided not to use arbitrary values contrary to USearch, for which the distance (not to be confused with similarity) is defined as cos(0, 0) = 0, cos(0, x) = 1 while supporting the range of values [0, 2].
If we wanted to convert that to similarity, that would mean sim_cos(0, x) = 0.5, which does not support mathematical reasoning why that would be more similar than for example vectors marking obtuse angles.
It's safe to assume that all-zero vectors for cosine similarity shouldn't make any impact, therefore we return NaN and eliminate them from best results.
Adjusted the tests accordingly to check both proper Cassandra and Scylla's behaviour.
Fixes: SCYLLADB-456
Backport to 2026.1 needed, as it fixes the bug for ANN vector queries using rescoring introduced there.
Closesscylladb/scylladb#28609
* github.com:scylladb/scylladb:
test/vector_search: add reproducer for rescoring with zero vectors
vector_search: return NaN for similarity_cosine with all-zero vectors
This patch series moves `test/cluster/dtest/guardrails_test.py`
to `test/cluster/test_guardrails.py`, and migrates it from `cluster/dtest/`
to `cluster/` framework.
There are two motivations for moving the test:
- Execution time reduction (from 12s to 9s in 'dev' in my env)
- Facilitate adding new tests to the `guardrails_test.py` file
No backport, `dtest/guardrails_test.py` is only on master
Closesscylladb/scylladb#28737
* github.com:scylladb/scylladb:
test: move dtest/guardrails_test.py to test_guardrails.py
test: prepare guardrails_test.py to be moved to test/cluster/
The inner loops in reload_all_permissions iterate role's permissions
and _anonymous_permissions maps across yield points. Concurrent
load_permissions calls (which don't hold _loading_sem) can emplace
into those same maps during a yield, potentially triggering a rehash
that invalidates the active iterator.
We want to avoid adding semaphore acquire in load_permissions
because it's on a common path (get_permissions).
Fixing by snapshotting the keys into a vector before iterating with
yields, so no long-lived map iterator is held across suspension
points.
This series hardens MV shutdown behavior by fixing lifecycle tracking for detached view-builder callbacks and aligning update handling with the same async dispatch style used by create/drop.
Patch 1 refactors on_update_view to use a dedicated coroutine dispatcher (dispatch_update_view), keeping update logic serialized under the existing view-builder lock and consistent with the callback architecture already used for create/drop paths.
Patch 2 adds explicit callback lifetime coordination in view_builder:
- introduce a seastar::gate member
- acquire _ops_gate.hold() when launching detached create/update/drop dispatch futures
- keep the hold alive until each detached future resolves
- close the gate during view_builder::drain() so shutdown waits for in-flight callback work before final teardown
Together, these changes reduce shutdown race exposure in MV event handling while preserving existing behavior for normal operation.
Testing:
- pytest --test-py-init test/cluster/mv (47 passed, 7 skipped)
backport: not required started happening in master
fixes: SCYLLADB-687
Closesscylladb/scylladb#28648
* github.com:scylladb/scylladb:
db/view: gate detached view-builder callbacks during shutdown
db:view: refactor on_update_view to use coroutine dispatcher
Takes set of ks->tables tuples and issues snapshot for each.
If feature is enabled, keyspace is non-local, and uses tablets,
will issue topo coordinator call across cluster.
Keyspaces not fitting the above will just go to "normal" (node
local) snapshot.
Makes request_truncate_with_tablets use a parameterized helper,
because eventually we will want to use almost identical logic
for other ops, like snapshot.
To create a keyspace theres new_test_keyspace helper
Table is created with a single cql.run_async with explicit schema
Dataset is populated with a single parallel INSERT as well
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is preparational patch. Next will need to replace
foo()
bar()
with
with something() as s:
foo()
bar()
Effectively -- only add the `with something()` line. Not to shift the
whole file right together with that future change, do it here.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
distribute_role() modifies _roles on non-zero shards via
invoke_on_others() without holding _loading_sem. Similarly, load_all()'s
invoke_on_others() callback calls prune_all() without the semaphore.
When these run concurrently with reload_all_permissions(), which
iterates _roles across yield points, an insertion can trigger
absl::flat_hash_map::resize(), freeing the backing storage while
an iterator still references it.
Fix by acquiring _loading_sem on the target shard in both
distribute_role()'s and load_all()'s invoke_on_others callbacks,
serializing all _roles mutations with coroutines that iterate
the map.
The test_restore_with_streaming_scopes want to run some loop body for
all (almost) combinations of scope, primary-replica-only and min tablet
count. For that three nested loops are used. Using itertools.product()
makes the code shorter, less indented and more explicit.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Two places are fixed
1. The call to create_dataset() is replaced with three "library"
methods. This makes it explicit which options and schema are used
for that. Eventually, the large and bulky create_dataset will be
removed
2. The part that restores data into a fresh new table calls some CQLs by
hand, and partially re-uses variables obtained from previous call to
create_dataset(). Using the same "library" methods to re-create an
empty table makes this part much simpler
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is preparational patch. Next will need to replace
foo()
bar()
with
with something() as s:
foo()
bar()
Effectively -- only add the `with something()` line. Not to shift the
whole file right together with that future change, do it here.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The take_snapshot() helper runs these API sequentially for every server.
Running them with asyncio.gather() slightly reduces the wait-time thus
improving the total runtime.
Before:
CPU utilization: 2.1%
real 0m33,871s
user 0m22,500s
sys 0m13,207s
After:
CPU utilization: 2.4%
real 0m29,532s
user 0m22,351s
sys 0m12,890s
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The test in question does _exactly_ what this helper does, but in a
longer way. The only difference is that it uses server_id as key to dict
with sstable components, but it's easy to tune.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
So that it's not in the middle of tests themselves, but near other
"helper" functions in the .py file
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
remove hand rolled error handling from object storage client
and replace with common machinery that supports exception
handling and retrying when appropriate
The concurrency semaphore gates uninitialized connections across all
do_accepts loops, but was initialized to a fixed value regardless of
how many listeners exist. With multiple listeners competing for the
same units, each effectively gets less than the configured concurrency.
Initialize the semaphore to concurrency - 1 and signal 1 per listen()
call, so total capacity is concurrency - 1 + nr_listeners. This
guarantees each listener's accept loop can have at least one unit
available.
This commit moves `guardrails_test.py`, prepared in the previous
commit of this patch series, to `test/cluster/test_guardrails.py`.
It also cleans up `suite.yaml`.
Disable `test/cluster/dtest/guardrails_test.py` in `suite.yaml` and
make it compatible with the `test/cluster/` framework. This will
allow moving this file from `test/cluster/dtest/` to `test/cluster/`
in the next commit of this patch series.
There are two motivations for moving the test:
- Execution time reduction (from 12s to 9s in 'dev' in my env)
- Facilitate adding new tests to the `guardrails_test.py` file
Make the actual table name a parameter and add logic to adapt to the
variant used.
Also add dump_to_log::yes to is_rows() invokation to help debuging
tests.
When set to true, the query results will be logged by the testlog logger
with debug level. A huge help when debugging failures around cql
assertions: seeing the actual query result is often enough to
immediately understand why the test failed.
system.batchlog will still have to be used while the cluster is
upgrading from an older version, which doesn't know v2 yet.
Re-add support for replaying v1 batchlogs. The switch to v2 will happen
after the BATCHLOG_V2 cluster feature is enabled.
The only external user -- storage_proxy -- only needs a minor
adjustment: switch between the table names. The rest is handled
transparently by the db/batchlog.hh interface and the batchlog_manager.
process_batch() currently returns stop_iteration::no from all control
paths. This is not useful. Return the all_replayed output param instead.
This requires making the batch() lambda a coroutine, but considering the
amount of work process_batch() does (send multiple writes), this should
be inconsequential.
SNI works only with DNS hostnames. Adding an IP address causes warnings
on the server side.
This change adds SNI only if it is not an IP address.
This change has no unit tests, as this behavior is not critical,
since it causes a warning on the server side.
The critical part, that the server name is verified, is already covered.
Fixes: VECTOR-528
In order to simplify troubleshooting connection problems, this patch
adds an extra warn log that prints the error for the vector search
request whenever it fails.
in s3::Range class start using s3 global constant for two reasons:
1) uniformity, no need to introduce semantically same constant in each class
2) the value was wrong
Detached migration callbacks (on_create_view, on_update_view, on_drop_view)
can race with view_builder::drain() teardown.
Add a lifetime gate to view_builder and wire callback launches through
_ops_gate.hold() so each detached dispatch future is tracked until it
completes (finally keeps the hold alive). During shutdown, drain()
now waits for all tracked callback work with _ops_gate.close().
This ensures drain does not proceed past callback lifetime while shutdown is in
progress, and ignores only gate_closed_exception at callback entry as the
expected shutdown path.
on_update_view() currently runs its serialized logic inline via with_semaphore()
from a detached callback path, while create/drop already use dedicated async
dispatchers.
Refactor update handling to follow the same pattern:
- add dispatch_update_view(sstring ks_name, sstring view_name)
- move update logic into that coroutine
- acquire the existing view-builder lock via get_or_adopt_view_builder_lock()
- keep existing behavior for missing base/view state
- keep background invocation semantics from on_update_view()
This aligns update/create/drop flow and keeps async lifecycle handling and a first step to fix shutdown issue.
Extract the commonly used `open_flags::wo | open_flags::create |
open_flags::exclusive` into a reusable constant
`sstable_write_open_flags` to reduce duplication.
Introduce new methods to write SSTable components while calculating
and returning their CRC32 checksums. This adds:
- make_digests_component_file_writer(): creates a crc32_digest_file_writer
for component writing with checksum tracking
- write_simple_with_digest() and do_write_simple_with_digest(): write
components and return the full checksum value
Introduce template parameter to checksummed file writer to support
digest-only calculation without storing chunk checksums.
This will be needed for future to calculate digest of other components.
The ANN vector queries with all-zero vectors are allowed even on vector
indexes with similarity function set to cosine.
When enabling the rescoring option, those queries would fail as the rescoring
calls `similarity_cosine` function underneath, causing an `InvalidRequest` exception
as all-zero vectors were not allowed matching Cassandra's behaviour.
To eliminate the discrepancy we want the all-zero vector `similarity_cosine` calls to pass,
but return the NaN as the cosine similarity for zero vectors is mathematically incorrect.
We decided not to use arbitrary values contrary to USearch, for which the distance
(not to be confused with similarity) is defined as cos(0, 0) = 0, cos(0, x) = 1 while
supporting the range of values [0, 2].
If we wanted to convert that to similarity, that would mean sim_cos(0, x) = 0.5,
which does not support mathematical reasoning why that would be more similar than
for example vectors marking obtuse angles.
It's safe to assume that all-zero vectors for cosine similarity shouldn't make any impact,
therefore we return NaN and eliminate them from best results.
Adjusted the tests accordingly to check both proper Cassandra and Scylla's behaviour.
Fixes: SCYLLADB-456
Add a schema_builder::with_sharder() overload that accepts a const
reference to dht::static_sharder. This allows schemas to use custom
sharder instances instead of only static sharder configurations.
This is needed to support tables that use custom partitioning and
sharding strategies, such as the incoming raft metadata tables for
strongly consistent tables.
The method is called from storage_proxy::mutate_hint() which is in turn
called from hint_mutation::apply_locally(). The latter is either called
from directly by hint sender, which already runs in streaming group, or
via RPC HINT_MUTATION handler which uses index 1 that negotiates streaming
group as well.
To be sure, add a debugging check for current group being the expected one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently sender only switches group for hints sending on start. It's
worth doing the same on stop too for consistency. There's nothing to
compete with at this point.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now that the sum function in the histogram uses true values instead of
an estimate, the test should reflect that.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch replaces simple counters with bytes_histogram for tracking
CQL request and response sizes, enabling better visibility into message
size distribution.
Changes:
- Replace request_size and response_size metrics with bytes_histogram in
cql_sg_stats::request_kind_stats
- Per-shard metrics continue to be reported as before
- QUERY, EXECUTE, and BATCH operations now report per-node, per-scheduling-group
histograms of bytes sent and received, providing detailed insight into these
operations
Other CQL operations (e.g., PREPARE, OPTIONS) are not included in per-node
histogram reporting as they are less performance-critical, but can be added
in the future if proven useful.
Metrics example:
```
# HELP scylla_transport_cql_request_bytes Counts the total number of received bytes in CQL messages of a specific kind.
# TYPE scylla_transport_cql_request_bytes counter
scylla_transport_cql_request_bytes{kind="BATCH",scheduling_group_name="sl:default",shard="0"} 129808
scylla_transport_cql_request_bytes{kind="EXECUTE",scheduling_group_name="sl:default",shard="0"} 227409
scylla_transport_cql_request_bytes{kind="PREPARE",scheduling_group_name="sl:default",shard="0"} 631
scylla_transport_cql_request_bytes{kind="QUERY",scheduling_group_name="sl:default",shard="0"} 2809
scylla_transport_cql_request_bytes{kind="QUERY",scheduling_group_name="sl:driver",shard="0"} 4079
scylla_transport_cql_request_bytes{kind="REGISTER",scheduling_group_name="sl:default",shard="0"} 98
scylla_transport_cql_request_bytes{kind="STARTUP",scheduling_group_name="sl:driver",shard="0"} 432
# HELP scylla_transport_cql_request_histogram_bytes A histogram of received bytes in CQL messages of a specific kind and specific scheduling group.
# TYPE scylla_transport_cql_request_histogram_bytes histogram
scylla_transport_cql_request_histogram_bytes_sum{kind="QUERY",scheduling_group_name="sl:driver"} 4079
scylla_transport_cql_request_histogram_bytes_count{kind="QUERY",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="1024.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="2048.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="4096.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="8192.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="16384.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="32768.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="65536.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="131072.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="262144.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="524288.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="1048576.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="2097152.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="4194304.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="8388608.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="16777216.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="33554432.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="67108864.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="134217728.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="268435456.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="536870912.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="1073741824.000000",scheduling_group_name="sl:driver"} 57
```
Previously, histogram sums were estimated by multiplying bucket offsets
by their counts, which produces inaccurate results - typically too high
when using upper limits or too low when using lower limits.
This patch adds accurate sum tracking to approx_exponential_histogram:
- Adds a _sum member variable to track the actual sum of all values
- Implements sum() method to return the accumulated total
- Updates add() to increment _sum for each value
- Modifies to_metrics_histogram() helper to use the new sum() method
This change is important as histograms will be used instead of counters for
byte statistics, where accurate totals are essential for metrics reporting.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
For various use cases, we need to report byte histograms, such as for
request and reply message sizes.
This patch introduce bytes_histogram as a type alias for
approx_exponential_histogram configured to track byte values from 1KB to
1GB with power-of-2 buckets (Precision=1).
This provides a convenient, performance-efficient histogram for
measuring message sizes, payload sizes, and other byte-based metrics.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2026-01-28 13:31:39 +02:00
4321 changed files with 65202 additions and 33927 deletions
# supposed to be processed by idl-compiler.py, so we don't check them using the cleaner
CLEANER_DIRS:test/unit exceptions alternator api auth cdc compaction db dht gms index lang message mutation mutation_writer node_ops raft redis replica service
@@ -12,20 +12,48 @@ The terms "**You**" or "**Licensee**" refer to any individual accessing or using
**Grant of License**
* **Software Definitions:** Software means the ScyllaDB software provided by Licensor, including the source code, object code, and any accompanying documentation or tools, or any part thereof, as made available under this Agreement.
* **Grant of License:** Subject to the terms and conditions of this Agreement, Licensor grants You a limited, non-exclusive, revocable, non-sublicensable, non-transferable, royalty free license to Use the Software, in each case solely for the purposes of:
* **Definitions:**
1.**Software:** Software means the ScyllaDB software provided by Licensor, including the source code, object code, and any accompanying documentation or tools, or any part thereof, as made available under this Agreement.
2.**Commercial Customer**: means any legal entity (including its Affiliates) that has entered into a transaction with Licensor, or an authorized reseller/distributor, for the provision of any ScyllaDB products or services. This includes, without limitation: (a) Scope of Service: Any paid subscription, enterprise license, "BYOA" or Database-as-a-Service (DBaaS) offering, technical support, professional services, consulting, or training. (b) Scale and Volume: Any deployment regardless of size, capacity, or performance metrics (c) Payment Method: Any compensation model, including but not limited to, fixed-fee, consumption-based (On-Demand), committed spend, third-party marketplace credits (e.g., AWS, GCP, Azure), or promotional credits and discounts.
* **Grant of License:** Subject to the terms and conditions of this Agreement, including the Eligibility and Exclusive Use Restrictions clause, Licensor grants You a limited, non-exclusive, revocable, non-sublicensable, non-transferable, royalty free license to Use the Software, in each case solely for the purposes of:
1) Copying, distributing, evaluating (including performing benchmarking or comparative tests or evaluations , subject to the limitations below) and improving the Software and ScyllaDB; and
2) create a modified version of the Software (each, a "**Licensed Work**"); provided however, that each such Licensed Work keeps all or substantially all of the functions and features of the Software, and/or using all or substantially all of the source code of the Software. You hereby agree that all the Licensed Work are, upon creation, considered Licensed Work of the Licensor, shall be the sole property of the Licensor and its assignees, and the Licensor and its assignees shall be the sole owner of all rights of any kind or nature, in connection with such Licensed Work. You hereby irrevocably and unconditionally assign to the Licensor all the Licensed Work and any part thereof. This License applies separately for each version of the Licensed Work, which shall be considered "Software" for the purpose of this Agreement.
* **Eligibility and Exclusive Use Restrictions**
**License Limitations, Restrictions and Obligations:** The license grant above is subject to the following limitations, restrictions, and obligations. If Licensee’s Use of the Software does not comply with the above license grant or the terms of this section (including exceeding the Usage Limit set forth below), Licensee must: (i) refrain from any Use of the Software; and (ii) purchase a [commercial paid license](https://www.scylladb.com/scylladb-proprietary-software-license-agreement/) from the Licensor.
i. Restricted to "Never Customers" Only. The license granted under this Agreement is strictly limited to Never Customers. For purposes of this Agreement, a "Never Customer" is an entity (including its Affiliates) that does not have, and has not had within the previous twelve (12) months, a paid commercial subscription, professional services agreement, or any other commercial relationship with Licensor. Satisfaction of the Never Customer criteria is a strict condition precedent to the effectiveness of this License.
ii. Total Prohibition for Existing Commercial Customers. If You (or any of Your Affiliates) are an existing Commercial Customer of Licensor within the last twelve (12) months, no license is deemed to have been offered or extended to You, and any download or installation of the Software by You is unauthorized. This prohibition applies to all deployments, including but not limited to:
(a) existing commercial workloads;
(b) any new use cases, new applications, or new workloads
iii. **No "Hybrid" Usage**. Licensee is expressly prohibited from combining free tier usage under this Agreement with paid commercial units.
If You are a Commercial Customer, all use of the Software across Your entire organization (and any of your Affiliates) must be governed by a valid, paid commercial agreement. Use of the Software under this license by a Commercial Customer (which is not a "Never Customer") shall:
(a) Void this license *ab initio*;
(b) Be deemed a material breach of both this Agreement and any existing commercial terms; and
(c) Entitle Licensor to invoice Licensee for such unauthorized usage at Licensor's standard list prices, retroactive to the date of first use.
Notwithstanding anything to the contrary in the Eligibility or License Limitations sections above a Commercial Customer may use the Software exclusively for non-production purposes, including Continuous Integration (CI), automated testing, and quality assurance environments, provided that such use at all times remains compliant with the Usage Limit.
iv. **Verification**. Licensor reserves the right to audit Licensee's environment and corporate identity to ensure compliance with these eligibility criteria.
For the purposes of this Agreement an "**Affiliate**" means any entity that directly or indirectly controls, is controlled by, or is under common control with a party, where "control" means ownership of more than 50% of the voting stock or decision-making authority
**License Limitations, Restrictions and Obligations:** The license grant above is subject to the following limitations, restrictions, and obligations. If Licensee’s Use of the Software does not comply with the above license grant or the terms of this section (including exceeding the Usage Limit set forth below), Licensee must: (i) refrain from any Use of the Software; and (ii) unless Licensee is a Never Customer, purchase a [commercial paid license](https://www.scylladb.com/scylladb-proprietary-software-license-agreement/) from the Licensor.
* **Updates:** You shall be solely responsible for providing all equipment, systems, assets, access, and ancillary goods and services needed to access and Use the Software. Licensor may modify or update the Software at any time, without notification, in its sole and absolute discretion. After the effective date of each such update, Licensor shall bear no obligation to run, provide or support legacy versions of the Software.
* **"Usage Limit":** Licensee's total overall available storage across all deployments and clusters of the Software and the Licensed Work under this License shall not exceed 10TB and/or an upper limit of 50 VCPUs (hyper threads).
* **IP Markings:** Licensee must retain all copyright, trademark, and other proprietary notices contained in the Software. You will not modify, delete, alter, remove, or obscure any intellectual property, including without limitations licensing, copyright, trademark, or any other notices of Licensor in the Software.
* **License Reproduction:** You must conspicuously display this Agreement on each copy of the Software. If You receive the Software from a third party, this Agreement still applies to Your Use of the Software. You will be responsible for any breach of this Agreement by any such third-party.
* Distribution of any Licensed Works is permitted, provided that: (i) You must include in any Licensed Work prominent notices stating that You have modified the Software, (ii) You include a copy of this Agreement with the Licensed Work, and (iii) You clearly identify all modifications made in the Licensed Work and provides attribution to the Licensor as the original author(s) of the Software.
* **Commercial Use Restrictions:** Licensee may not offer the Software as a software-as-a-service (SaaS) or commercial database-as-as-service (dBaaS) offering. Licensee may not use the Software to compete with Licensor's existing or future products or services. If your Use of the Software does not comply with the requirements currently in effect as described in this License, you must purchase a commercial license from the Licensor, its affiliated entities, or you must refrain from using the Software and all Licensed Work. Furthermore, if You make any written claim of patent infringement relating to the Software, Your patent license for the Software granted under this Agreement terminates immediately.
* **Commercial Use Restrictions:** Licensee may not offer the Software as a software-as-a-service (SaaS) or commercial database-as-as-service (dBaaS) offering. Licensee may not use the Software to compete with Licensor's existing or future products or services. If your Use of the Software does not comply with the requirements currently in effect as described in this License, you must purchase a commercial license from the Licensor, its Affiliated entities, or you must refrain from using the Software and all Licensed Work. Furthermore, if You make any written claim of patent infringement relating to the Software, Your patent license for the Software granted under this Agreement terminates immediately.
* Notwithstanding anything to the contrary, under the License granted hereunder, You shall not and shall not permit others to: (i) transfer the Software or any portions thereof to any other party except as expressly permitted herein; (ii) attempt to circumvent or overcome any technological protection measures incorporated into the Software; (iii) incorporate the Software into the structure, machinery or controls of any aircraft, other aerial device, military vehicle, hovercraft, waterborne craft or any medical equipment of any kind; or (iv) use the Software or any part thereof in any unlawful, harmful or illegal manner, or in a manner which infringes third parties’ rights in any way, including intellectual property rights.
**Monitoring; Audit**
@@ -41,14 +69,14 @@ The terms "**You**" or "**Licensee**" refer to any individual accessing or using
**Indemnity; Disclaimer; Limitation of Liability**
* **Indemnity:** Licensee hereby agrees to indemnify, defend and hold harmless Licensor and its affiliates from any losses or damages incurred due to a third party claim arising out of: (i) Licensee’s breach of this Agreement; (ii) Licensee’s negligence, willful misconduct or violation of law, or (iii) Licensee’s products or services.
* **Indemnity:** Licensee hereby agrees to indemnify, defend and hold harmless Licensor and its Affiliates from any losses or damages incurred due to a third party claim arising out of: (i) Licensee’s breach of this Agreement; (ii) Licensee’s negligence, willful misconduct or violation of law, or (iii) Licensee’s products or services.
* DISCLAIMER OF WARRANTIES: LICENSEE AGREES THAT LICENSOR HAS MADE NO EXPRESS WARRANTIES REGARDING THE SOFTWARE AND THAT THE SOFTWARE IS BEING PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND. LICENSOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THE SOFTWARE, EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION, ANY IMPLIED WARRANTIES OF FITNESS FOR A PARTICULAR PURPOSE; TITLE; MERCHANTABILITY; OR NON-INFRINGEMENT OF THIRD PARTY RIGHTS. LICENSOR DOES NOT WARRANT THAT THE SOFTWARE WILL OPERATE UNINTERRUPTED OR ERROR FREE, OR THAT ALL ERRORS WILL BE CORRECTED. LICENSOR DOES NOT GUARANTEE ANY PARTICULAR RESULTS FROM THE USE OF THE SOFTWARE, AND DOES NOT WARRANT THAT THE SOFTWARE IS FIT FOR ANY PARTICULAR PURPOSE.
* LIMITATION OF LIABILITY: TO THE FULLEST EXTENT PERMISSIBLE UNDER APPLICABLE LAW, IN NO EVENT WILL LICENSOR AND/OR ITS AFFILIATES, EMPLOYEES, OFFICERS AND DIRECTORS BE LIABLE TO LICENSEE FOR (I) ANY LOSS OF USE OR DATA; INTERRUPTION OF BUSINESS; OR ANY INDIRECT; SPECIAL; INCIDENTAL; OR CONSEQUENTIAL DAMAGES OF ANY KIND (INCLUDING LOST PROFITS); AND (II) ANY DIRECT DAMAGES EXCEEDING THE TOTAL AMOUNT OF ONE THOUSAND US DOLLARS ($1,000). THE FOREGOING PROVISIONS LIMITING THE LIABILITY OF LICENSOR SHALL APPLY REGARDLESS OF THE FORM OR CAUSE OF ACTION, WHETHER IN STRICT LIABILITY, CONTRACT OR TORT.
**Proprietary Rights; No Other Rights**
* **Ownership:** Licensor retains sole and exclusive ownership of all rights, interests and title in the Software and any scripts, processes, techniques, methodologies, inventions, know-how, concepts, formatting, arrangements, visual attributes, ideas, database rights, copyrights, patents, trade secrets, and other intellectual property related thereto, and all derivatives, enhancements, modifications and improvements thereof. Except for the limited license rights granted herein, Licensee has no rights in or to the Software and/ or Licensor’s trademarks, logo, or branding and You acknowledge that such Software, trademarks, logo, or branding is the sole property of Licensor.
* **Feedback:** Licensee is not required to provide any suggestions, enhancement requests, recommendations or other feedback regarding the Software ("Feedback"). If, notwithstanding this policy, Licensee submits Feedback, Licensee understands and acknowledges that such Feedback is not submitted in confidence and Licensor assumes no obligation, expressed or implied, by considering it. All right in any trademark or logo of Licensor or its affiliates and You shall make no claim of right to the Software or any part thereof to be supplied by Licensor hereunder and acknowledges that as between Licensor and You, such Software is the sole proprietary, title and interest in and to Licensor.such Feedback shall be assigned to, and shall become the sole and exclusive property of, Licensor upon its creation.
* **Feedback:** Licensee is not required to provide any suggestions, enhancement requests, recommendations or other feedback regarding the Software ("Feedback"). If, notwithstanding this policy, Licensee submits Feedback, Licensee understands and acknowledges that such Feedback is not submitted in confidence and Licensor assumes no obligation, expressed or implied, by considering it. All right in any trademark or logo of Licensor or its Affiliates and You shall make no claim of right to the Software or any part thereof to be supplied by Licensor hereunder and acknowledges that as between Licensor and You, such Software is the sole proprietary, title and interest in and to Licensor.such Feedback shall be assigned to, and shall become the sole and exclusive property of, Licensor upon its creation.
* Except for the rights expressly granted to You under this Agreement, You are not granted any other licenses or rights in the Software or otherwise. This Agreement constitutes the entire agreement between You and the Licensor with respect to the subject matter hereof and supersedes all prior or contemporaneous communications, representations, or agreements, whether oral or written.
* **Third-Party Software:** Customer acknowledges that the Software may contain open and closed source components (“OSS Components”) that are governed separately by certain licenses, in each case as further provided by Company upon request. Any applicable OSS Component license is solely between Licensee and the applicable licensor of the OSS Component and Licensee shall comply with the applicable OSS Component license.
* If any provision of this Agreement is held to be invalid or unenforceable, such provision shall be struck and the remaining provisions shall remain in full force and effect.
@@ -56,7 +84,7 @@ The terms "**You**" or "**Licensee**" refer to any individual accessing or using
**Miscellaneous**
* **Miscellaneous:** This Agreement may be modified at any time by Licensor, and constitutes the entire agreement between the parties with respect to the subject matter hereof. Licensee may not assign or subcontract its rights or obligations under this Agreement. This Agreement does not, and shall not be construed to create any relationship, partnership, joint venture, employer-employee, agency, or franchisor-franchisee relationship between the parties.
* **Modifications**: Licensor reserves the right to modify this Agreement at any time. Changes will be effective upon posting to the Website or within the Software repository. Continued use of the Software after such changes constitutes acceptance.
* **Governing Law & Jurisdiction:** This Agreement shall be governed and construed in accordance with the laws of Israel, without giving effect to their respective conflicts of laws provisions, and the competent courts situated in Tel Aviv, Israel, shall have sole and exclusive jurisdiction over the parties and any conflict and/or dispute arising out of, or in connection to, this Agreement
// ARN format is `arn:<partition>:<service>:<region>:<account-id>:<resource-type>/<resource-id>/<postfix>`
// we ignore partition, service and account-id
// resource-type must be string "table"
// resource-id will be returned as table_name
// region will be returned as keyspace_name
// postfix is a string after resource-id and will be returned as is (whole), including separator.
structarn_parts{
std::string_viewkeyspace_name;
std::string_viewtable_name;
std::string_viewpostfix;
};
// arn - arn to parse
// arn_field_name - identifier of the ARN, used only when reporting an error (in error messages), for example "Incorrect resource identifier `<arn_field_name>`"
// type_name - used only when reporting an error (in error messages), for example "... is not a valid <type_name> ARN ..."
// expected_postfix - optional filter of postfix value (part of ARN after resource-id, including separator, see comments for struct arn_parts).
// If is empty - then postfix value must be empty as well
// if not empty - postfix value must start with expected_postfix, but might be longer
co_returnapi_error::request_limit_exceeded(format("too many in-flight requests (configured via max_concurrent_requests_per_shard): {}",_pending_requests.get_count()));
co_returnapi_error::request_limit_exceeded(format("too many in-flight requests (configured via max_concurrent_requests_per_shard): {}",_pending_requests.get_count()));
seastar::metrics::make_histogram("batch_item_count_histogram",seastar::metrics::description("Histogram of the number of items in a batch request"),labels,
seastar::metrics::make_histogram("batch_item_count_histogram",seastar::metrics::description("Histogram of the number of items in a batch request"),labels,
co_returnapi_error::unknown_operation("UpdateTimeToLive not yet supported. Experimental support is available if the 'alternator-ttl' experimental feature is enabled on all nodes.");
co_returnapi_error::unknown_operation("UpdateTimeToLive not yet supported. Upgrade all nodes to a version that supports it.");
// Should never happen - we verified the column's type
// before starting the scan.
[[unlikely]]
on_internal_error(tlogger,format("expiration scanner value of unsupported type {} in column {}",meta[*expiration_column]->type->cql3_type_name(),scan_ctx.column_name));
"description":"The snapshot tag to delete. If omitted, all snapshots are removed.",
"required":false,
"allowMultiple":false,
"type":"string",
@@ -751,7 +751,7 @@
},
{
"name":"kn",
"description":"Comma-separated keyspaces name that their snapshot will be deleted",
"description":"Comma-separated list of keyspace names to delete snapshots from. If omitted, snapshots are deleted from all keyspaces.",
"required":false,
"allowMultiple":false,
"type":"string",
@@ -759,7 +759,7 @@
},
{
"name":"cf",
"description":"an optional table name that its snapshot will be deleted",
"description":"A table name used to filter which table's snapshots are deleted. If omitted or empty, snapshots for all tables are eligible. When provided together with 'kn', the table is looked up in each listed keyspace independently. For secondary indexes, the logical index name (e.g. 'myindex') can be used and is resolved automatically.",
"required":false,
"allowMultiple":false,
"type":"string",
@@ -1295,6 +1295,45 @@
}
]
},
{
"path":"/storage_service/logstor_compaction",
"operations":[
{
"method":"POST",
"summary":"Trigger compaction of the key-value storage",
"type":"void",
"nickname":"logstor_compaction",
"produces":[
"application/json"
],
"parameters":[
{
"name":"major",
"description":"When true, perform a major compaction",
"required":false,
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
}
]
}
]
},
{
"path":"/storage_service/logstor_flush",
"operations":[
{
"method":"POST",
"summary":"Trigger flush of logstor storage",
"type":"void",
"nickname":"logstor_flush",
"produces":[
"application/json"
],
"parameters":[]
}
]
},
{
"path":"/storage_service/active_repair/",
"operations":[
@@ -3085,6 +3124,125 @@
}
]
},
{
"path":"/storage_service/tablets/snapshots",
"operations":[
{
"method":"POST",
"summary":"Takes the snapshot for the given keyspaces/tables. A snapshot name must be specified.",
"type":"void",
"nickname":"take_cluster_snapshot",
"produces":[
"application/json"
],
"parameters":[
{
"name":"tag",
"description":"the tag given to the snapshot",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"keyspace",
"description":"Keyspace(s) to snapshot. Multiple keyspaces can be provided using a comma-separated list. If omitted, snapshot all keyspaces.",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"table",
"description":"Table(s) to snapshot. Multiple tables (in a single keyspace) can be provided using a comma-separated list. If omitted, snapshot all tables in the given keyspace(s).",
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.