For indexed queries, statement_restrictions calculates _view_schema,
which is passed via get_view_schema() to indexed_select_statement(),
which passes it right back to statement_restrictions via one of three
functions to calculate clustering ranges.
Avoid the back-and-forth and use the stored value. Using a different
value would be broken.
This change allows unifying the signatures of the four functions that
get clustering ranges.
Prevent copying/moving, that can change the address, and instead enforce
using shared_ptr. Most of the code is already using shared_ptr, so the
changes aren't very large.
To forbid non-shared_ptr construction, the constructors are annotated
with a private_tag tag class.
In preparation for refactoring statement_restrictions, add a simple
and an exhaustive regression test, encoding the index selection
algorithm into the test. We cannot change the index selection algorithm
because then mixed-node clusters will alter the sorting key mid-query
(if paging takes place).
Because the exhaustive space has such a large stack frame, and
because Address Santizer bloats the stack frame, increase it
for debug builds.
Garbage collected sstables created during incremental compaction are
deleted only at the end of the compaction, which increases the memory
footprint. This is inefficient, especially considering that the related
input sstables are released regularly during compaction.
This commit implements incremental release of GC sstables after each
output sstable is sealed. Unlike regular input sstables, GC sstables
use a different exhaustion predicate: a GC sstable is only released
when its token range no longer overlaps with any remaining input
sstable. This is because GC sstables hold tombstones that may shadow
data in still-alive overlapping input sstables; releasing them
prematurely would cause data resurrection.
Fixes#5563Closesscylladb/scylladb#28984
After stopping the topology coordinator, a new topology coordinator
may not yet be started when get_coordinator_host() is called. Make
the function always retry via wait_for so that every caller is
protected against this race.
Fixes SCYLLADB-1553
Closesscylladb/scylladb#29489
Container names were generated as {name}-{pid}-{counter}, where the
counter is a per-process itertools.count. This scheme breaks across CI
runs on the same host: if a prior job was killed abruptly (SIGKILL,
cancellation) its containers are left running since --rm only removes
containers on exit. A subsequent run whose worker inherits the same PID
(common in containerized CI with small PID namespaces) and reaches the
same counter value will collide with the orphaned container.
Replace pid+counter with uuid.uuid4(), which generates a random UUID,
making names unique across processes, hosts, and time without any shared
state or leaking host identifiers.
Fixes: SCYLLADB-1540
Closesscylladb/scylladb#29509
Write the MinIO server log directly to tempdir_base (testlog/<arch>/)
instead of the per-server temp directory that gets destroyed on
shutdown. This preserves the log for Jenkins artifact collection,
helping debug S3-related flaky test failures like the
stcs_reshape_overlapping_s3_test hang (SCYLLADB-1481).
Closesscylladb/scylladb#29458
With the latest changes, there are a lot of code that is redundant in the test.py. This PR just cleans this code.
Also, it narrows using dynamic scope for fixtures to test/alternator and test/cqlpy. All the rest by default will have module scope.
test.py will be a wrapper for pytest mostly for CI use. As for now test.py have important part of calculating the number of threads to start pytest with. This is not possible to do in pytest itself.
No backport needed, framework enhancement only.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-666Closesscylladb/scylladb#28852
* github.com:scylladb/scylladb:
test.py: remove testpy_test_fixture_scope
test.py: add logger for 3rd party service
test.py: delete dead code in test.py
Users ought to have possibility to create the local index for Vector Search
based only on a part of the partition key. This commits provides this by
removing requirements of 'full partition key only' for custom local index.
The commit updates docs to explain that local vector index can use only a part
of the partition key.
The commit implements cqlpy test to check fixed functionality.
Fixes: SCYLLADB-953
Needs to be backported to 2026.1 as it is a fix for local vector indexes.
Closesscylladb/scylladb#28931
Decrease the default connection timeout to 3s to better align with the
default CQL query timeout of 10s.
The previous timeout allowed only one failover request in high availability
scenario before hitting the CQL query timeout.
By decreasing the timeout to 3s, we can perform up to three failover requests
within the CQL query timeout, which significantly improves the chances of
successfully completing the query in high availability scenarios.
Fixes: SCYLLADB-95
Add option `vector_store_unreachable_node_detection_time_in_ms` to
control parameters related to detecting unreachable vector store nodes.
This parameter is used to set the TCP connect timeout, keepalive
parameters, and TCP_USER_TIMEOUT. By configuring these parameters,
we can detect unreachable vector store nodes faster and trigger
failover mechanisms in a timely manner.
This series adds support for vector search in Alternator based on the existing implementation in CQL.
The series adds APIs for `CreateTable` and `UpdateTable` to add or remove vector indexes to Alternator tables, `DescribeTable` to list them and check the indexing status, and `Query` to perform a vector search - which contacts the vector store for the actual ANN (approximate nearest neighbor) search.
Correct functionality of these features depend on some features of the the vector store, that were already done (see https://github.com/scylladb/vector-store/pull/394).
This initial implementation is fully functional, and can already be useful, but we do not yet support all the features we hope to eventually support. Here are things that we have **not** done yet, and plan to do later in follow-up pull requests:
1. Support a new optimized vector type ("V") - in addition to the "list of numbers" type supported in this version.
2. Allow choosing a different similarity function when creating an index, by SimilarityFunction in VectorIndex definition.
3. Allow choosing quantization (f32/f16/bf16/i8/b1) to ask the vector index to compress stored vectors.
4. Support oversampling and rescoring, defined per-index and per-query.
5. Support HNSW tuning parameters — maximum_node_connections, construction_beam_width, search_beam_width.
6. Support pre-filtering over key columns, which are available at the vector store, by sending the filter to the vector store (translated from DynamoDB filter syntax to the vector's store's filter syntax). A decision still need to be made if this will use KeyConditionExpression or FilterExpression. This version supports only post-filtering (with `FilterExpression`).
7. Support projecting non-key attributes into the index (Projection=INCLUDE and Projection=ALL), and then 1. pre-filtering using these attributes, and 2. efficiently return these attributes (using Select=ALL_PROJECTED_ATTRIBUTES, which today returns just the key columns).
8. Optimize the performance of `Query`, which today is inefficient for Select=ALL_ATTRIBUTES because it serially retrieves the matching items one at a time.
9. Returning the similarity scores with the items (the design proposes ReturnVectorSearchSimilarity).
10. Add more vector-search-specific metrics, beyond the metric we already have counting Query requests. For example separate latency and request-count metrics for vector-search Queries (distinct from GSI/LSI queries), and a metric accumulating the total Limit (K) across all vector search queries.
11. Consider how (and if at all) we want to run the tests in test/alternator/test_vector.py that need the vector store in the CI. Currently they are skipped in CI and only run manually (with `test/alternator/run --vs test_vector`).
12. UpdateTable 'Update' operation to modify index parameters. Only some can be modified, e.g., Oversampling.
13. Support for "local index" (separate index for each partition).
14. Make sure that vector search and Streams can be enabled concurrently on the same table - both need CDC but we need to verify that one doesn't confuse the other or disables options that the other needs. We can only do this after we have Alternator Streams running on tablets (since vector store requires tablets).
Testing the new Alternator vector search end-to-end requires running both Scylla and the vector store together. We will have such end-to-end tests in the vector store repository (see https://github.com/scylladb/vector-store/pull/392), but we also add in this pull request many end-to-end tests written in Python, that can be run with the command "test/alternator/run --vs test_vector.py". The "--vs" option tells the run script to run both Scylla and the vector store (currently assumed to be in `.../vector-store/target/release/vector-store`). About 65% of the tests in this pull request check supported syntax and error paths so can run without the vector store, while about 35% of the tests do perform actual Query operations and require the vector store to be running. Currently, the tests that do require the vector store will not get run by CI, but can be easily re-run manually with `test/alternator/run --vs test_vector.py`.
In total, this series includes 78 functional tests in 2200 lines of Python code.
This series also includes documentation for the new Alternator feature and the new APIs introduced. You can see a more detailed design document here: https://docs.google.com/document/d/1cxLI7n-AgV5hhH1DTyU_Es8_f-t8Acql-1f58eQjZLY/edit
Two patches in this series split the huge alternator/executor.cc, after this series continued to grow it and it reached a whoppng 7,000 lines. These patches are just reorganization of code, no functional changes. But it's time that we finally do this (Refs #5783), we can't just continue to grow executor.cc with no end...
Closesscylladb/scylladb#29046
* github.com:scylladb/scylladb:
test/alternator: add option to "run" script to run with vector search
alternator: document vector search
test/alternator: fix retries in new_dynamodb_session
test/alternator: test for allowed characters in attribute names
test/alternator: tests for vector index support
alternator, vector: add validation of non-finite numbers in Query
alternator: Query: improve error message when VectorSearch is missing
alternator: add per-table metrics for vector query
alternator: clean up duplicated code
alternator: fix default Select of Query
alternator: split executor.cc even more
alternator: split alternator/executor.cc
alternator: validate vector index attribute values on write
alternator: DescribeTable for vector index: add IndexStatus and Backfilling
alternator: implement Query with a vector index
alternator: fix bug in describe_multi_item()
alternator: prevent adding GSI conflicting with a vector index
alternator: implement UpdateTable with a vector index
alternator: implement DescribeTable with a vector index
alternator: implement CreateTable with a vector index
alternator: reject empty attribute names
cdc: fix on_pre_create_column_families to create CDC log for vector search
This PR removes the power-of-two token constraint from vnodes-to-tablets migrations, allowing clusters with randomly generated tokens to migrate without manual token reassignment.
Previously, migrations required vnode tokens to be a power of two and aligned. In practice, these conditions are not met with Scylla's default random token assignment, so the constraint is a blocker for real-world use. With the introduction of arbitrary tablet boundaries in PR #28459, the tablet layer can now support arbitrary tablet boundaries. This PR builds on that capability to allow arbitrary vnode tokens during migration.
When the highest vnode token does not coincide with the end of the token ring, the vnode wraps around, but tablets do not support that. This is handled by splitting it into two tablets: one covering the tail end of the ring and one covering the beginning.
Testing has been updated accordingly: existing cluster tests now use randomly generated tokens instead of precomputed power-of-two values, and a new Boost test validates the wrap-around tablet boundary logic.
Fixes SCYLLADB-724.
New feature, no backport is needed.
Closesscylladb/scylladb#29319
* github.com:scylladb/scylladb:
test: Use arbitrary tokens in vnodes->tablets migration tests
test: boost: Add test for wrap-around vnodes
storage_service: Support vnodes->tablets migrations w/ arbitrary tokens
storage_service: Hoist migration precondition
With migration to pyest this fixture is useless. Removing and setting
the session to the module for the most of the tests.
Add dynamic_scope function to support running alternator fixtures in
session scope, while Test and TestSuite are not deleted. This is for
migration period, later on this function should be deleted.
With migration of preparation environment and starting 3rd party services
to the pytest, they're output the logs to the terminal. So this PR
binds them their own log file to avoid polluting the terminal.
With the latest changes, there are a lot of code that is redundant in
the test.py. This PR just cleans this code.
Changes in other files are related to cleaning code from the test.py,
especially with redundant parameter --test-py-init and moving
prepare_environment to pytest itself.
The retry loop in `start_docker_service` passes the parse callbacks via `std::move` into `create_handler` on each iteration. After the first iteration, the moved-from `std::function` objects are empty. All subsequent retries skip output parsing entirely and immediately treat the service as successfully started. This defeats the entire purpose of the retry mechanism.
Fix by passing the callbacks by copy instead of move, so the original callbacks remain valid across retries.
Fixes SCYLLADB-1542
This is a CI stability issue and should be backported.
Closesscylladb/scylladb#29504
* github.com:scylladb/scylladb:
test/lib: fix typos in proc_utils, gcs_fixture, and dockerized_service
test: gcs_fixture: rename container from "local-kms" to "fake-gcs-server"
test: fix proc_utils.cc formatting from previous commit
test: lib: use unique container name per retry attempt
test: lib: fix broken retry in start_docker_service
Add a `CHILD_SHARDS` filter to `DescribeStream` command.
When used, user need to pass a parent stream shard id as
json's ShardFilter.ShardId field. DescribeStream will
then return only list of stream shards, that are direct
descendants of passed parent stream shard.
Each stream shard cover a consecutive part of token space.
A stream shard Q is considered to be a child of stream shard W,
when at least one token belongs to token spaces from both streams.
The filtering algorithm itself is somewhat complicated - more details
in comments in streams.cc.
CHILD_SHARDS is a Amazon's functionality and is required by KCL.
Add unit tests.
Fixes: #25160Closesscylladb/scylladb#28189
Increase the non-AWS wait in the TTL streams test to reduce vnode CI flakes caused by delayed expiration visibility.
Fixes SCYLLADB-1556
Closesscylladb/scylladb#29516
Add to test/alternator/run the option "-vs" which runs alongside with
Scylla a vector store, to allow running Alternator tests with vector
indexing.
To get the vector store, do
git clone git@github.com:scylladb/vector-store.git
cargo build --release
"run -vs" looks for an executable in ../vector-store/target/*/vector-store
but can also be overridden by the VECTOR_STORE environment variable.
test/alternator/run runs the vector store exactly like it runs Scylla -
in a temporary directory, on a temporary IP address in the localhost
subnet (127.0.0/8), killing it when the test end, and showing the output
of both programs (Scylla and vector store). These transient runs of
Scylla and vector store are configured to be able to communicate to
each other.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The new_dynamodb_session() function had a bug which we never noticed
because we hardly used it, but it became more noticable when the new
test/alternator/test_vector.py started to use it:
By default, boto3 retries a request up to 9 times when it encounters
a retriable error (such as an Internal Server Error). We don't want such
retries in our tests - it makes failures slower, but more importantly
it can hide "flaky" bugs by retrying 9 times until it happens to succeed.
The new_dynamodb_session() had code (copied from the dynamodb fixture)
to set boto3's "max_attempts" configuration to 0, to disable this retry.
But this code had an incorrect "if" to only be done if we're testing on
"localhost". This is wrong: We almost never use "localhost" as the
target of the test; Both test/cqlpy/run and test.py pick an IP address
in the localhost subnet (127/8) and uses that IP address - not the string
"localhost".
This bug only existed in new_dynamodb_session() - the more commonly used
"dynamodb" fixture didn't have this bug.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
One of the tests in the previous patch checked that strange characters
are allowed in attribute names used for vector indexing. It turns out
we never had a test that verifies that regardless of vector indexes -
any character whatsoever is allowed in attribute names. This is
different from table names which are much more limited.
So this patch adds the missing test.
As usual, the new test also passes on DynamoDB, showing that these
stange characters in attribute names are also allowed by DynamoDB.
In this patch we add a large collection of basic functional tests for the
vector index support, covering the CreateTable, UpdateTable, DescribeTable
and Query operations and the various ways in which those are allowed to
work - or expected to fail. These tests were written in parallel with
writing the code so they (hopefully) cover all the corner cases considered
during development, and make sure these corner cases are all handled
correctly and will not regress in the future.
Some of these tests do not involve querying of the index and focus on
the structure of requests and the kind of syntax allowed. But other tests
are end-to-end, requiring the vector store to be running and trying to
index Alternator data and query it. These tests are marked
"needs_vector_store", and are immediately skipped in Scylla is not
configured to connect to a vector store. In a later patch we'll add a
an option to test/alternator/run to be able to run these end-to-end
tests by automatically running both Scylla and the Vector Store.
We'll have additional end-to-end tests in the vector-store repository.
Note that vector search is a new API feature that doesn't exist in DynamoDB,
so we are adding new parameters and outputs to existing operations. The AWS
SDKs don't normally allow doing that, so the test added here begins by
teaching the Python SDK to use the new APIs we added. This piece of code
can also be used by end-users to use vector search (at least in Python...)
before we officially add this support to ScyllaDB's SDK wrappers.
The per-table metrics for Query were not incremented for the
vector variant of the Query operations, only the global metrics were
incremented. This patch fixes this oversight, and add a test that
reproduces it (the new test fails before this patch, and passes after).
In earlier patches, when Query'ing a vector index, we set the default
Select to ALL_ATTRIBUTES. However, according to the DynamoDB documentation
for Query,
"If neither Select nor ProjectionExpression are specified, DynamoDB
defaults to ALL_ATTRIBUTES when accessing a table, and
ALL_PROJECTED_ATTRIBUTES when accessing an index."
This default should also apply to vector index, so this patch fixes this.
The new behavior is not only more compatible with DynamoDB, it is also
much more efficient by default, as ALL_PROJECTED_ATTRIBUTES does not need
to read from the base table - it returns the results that the vector store
returned. Of course, if the user needs the more efficient ALL_ATTRIBUTES
this option is still available - it's just no longer the default.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Extend system_info_encryption to encrypt system.raft SSTables.
system.raft contains the Raft log, which may hold sensitive user data
(e.g. batched mutations), so it warrants the same treatment as
system.batchlog and system.paxos.
During upgrade, existing unencrypted system.raft SSTables remain
readable. Existing data is rewritten encrypted via compaction, or
immediately via nodetool upgradesstables -a.
Update the operator-facing system_info_encryption description to
mention system.raft and add a focused test that verifies the schema
extension is present on system.raft.
Fixes: CUSTOMER-268
Backport: 2026.1 - closes an encryption-at-rest coverage gap: system.raft may persist sensitive user-originated data unencrypted; backport to the current LTS.
Closesscylladb/scylladb#29242
There's a bunch of db::config options that are used by cql3/statements/ code. For that they use data_dictionary/database as a proxy to get db::config reference. This PR moves most of these accessed options onto cql_config
Options migrated to cql_config:
1. select_internal_page_size
2. strict_allow_filtering
3. enable_parallelized_aggregation
4. batch_size_warn_threshold_in_kb
5. batch_size_fail_threshold_in_kb
6. 7 keyspace replication restriction options
7. 2 TWCS restriction options
8. restrict_future_timestamp
9. strict_is_not_null_in_views (with view_restrictions struct)
10. enable_create_table_with_compact_storage
Some options need special treatment and are still abused via database, namely:
1. enable_logstor
2. cluster_name
3. partitioner
4. endpoint_snitch
Fixing components inter-dependencies, not backporting
Closesscylladb/scylladb#29424
* github.com:scylladb/scylladb:
cql3: Move enable_create_table_with_compact_storage to cql_config
cql3: Move strict_is_not_null_in_views to cql_config
cql3: Move restrict_future_timestamp to cql_config
cql3: Move TWCS restriction options to cql_config
cql3: Move keyspace restriction options to cql_config
cql3: Move batch_size_fail_threshold_in_kb to cql_config
cql3: Move batch_size_warn_threshold_in_kb to cql_config
cql3: Move enable_parallelized_aggregation to cql_config
cql3: Move strict_allow_filtering to cql_config
cql3: Move select_internal_page_size to cql_config
test: Fix cql_test_env to use updateable cql_config from db::config
cql3: Add cql_config parameter to parsed_statement::prepare()
`system.large_partitions`, `system.large_rows`, and `system.large_cells` store records keyed by SSTable name. When SSTables are migrated between shards or nodes (resharding, streaming, decommission), the records are lost because the destination never writes entries for the migrated SSTables.
This patch series moves the source of truth for large data records into the SSTable's scylla metadata component (new `LargeDataRecords` tag 13) and reimplements the three `system.large_*` tables as virtual tables that query live SSTables on demand. A cluster feature flag (`LARGE_DATA_VIRTUAL_TABLES`) gates the transition for safe rolling upgrades.
When the cluster feature is enabled, each node drops the old system large_* tables and starts serving the corresponding tables using virtual tables that represent the large data records now stored on the sstables.
Note that the virtual tables will be empty after upgrade until the sstables that contained large data are rewritten, therefore it is recommended to run upgrade sstables compaction or major compaction to repopulate the sstables scylla-metadata with large data records.
1. **keys: move key_to_str() to keys/keys.hh** — make the helper reusable across large_data_handler, virtual tables, and scylla-sstable
2. **sstables: add LargeDataRecords metadata type (tag 13)** — new struct with binary-serialized key fields, scylla-sstable JSON support, format documentation
3. **large_data_handler: rename partition_above_threshold to above_threshold_result** — generalize the struct for reuse
4. **large_data_handler: return above_threshold_result from maybe_record_large_cells** — separate booleans for cell size vs collection elements thresholds
5. **sstables: populate LargeDataRecords from writer** — bounded min-heaps (one per large_data_type), configurable top-N via `compaction_large_data_records_per_sstable`
6. **test: add LargeDataRecords round-trip unit tests** — verify write/read, top-N bounding, below-threshold behavior
7. **db: call initialize_virtual_tables from shard 0 only** — preparatory refactoring to enable cross-shard coordination
8. **db: implement large_data virtual tables with feature flag gating** — three virtual table classes, feature flag activation, legacy SSTable fallback, dual-threshold dedup, cross-shard collection
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1276
* Although this fixes a bug where large data entries are effectively lost when sstables are renamed or migrated, the changes are intrusive and do not warrant a backport
Closesscylladb/scylladb#29257
* github.com:scylladb/scylladb:
db: implement large_data virtual tables with feature flag gating
db: call initialize_virtual_tables from shard 0 only
test: add LargeDataRecords round-trip unit tests
sstables: populate LargeDataRecords from writer
large_data_handler: return above_threshold_result from maybe_record_large_cells
large_data_handler: rename partition_above_threshold to above_threshold_result
sstables: add LargeDataRecords metadata type (tag 13)
sstables: add fmt::formatter for large_data_type
keys: move key_to_str() to keys/keys.hh
Alternator has a function validate_attr_name_length() used to validate an
attribute name passed in different operations like PutItem, UpdateItem,
GetItem, etc. It fails the request if the attribute name is longer than
65535 characters.
It turns out that we forgot to check if the attribute name length isn’t 0 -
which should be forbidden as well!
This patch fixes the validation code, and also adds a test that confirms
that after this patch empty attribute names are rejected - just like DynamoDB
does - whereas before this patch they were silently accepted.
We want to fix this issue now, because in a later patch we intend to use
the same validation function also for vector indexes - and want it to be
accurate.
Fixes SCYLLADB-1069.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The PR removed the create_and_ks() helper from backup test and patches all callers to create keyspace, table and populate them with standard explicit facilities. While patching it turned out that one test doesn't need to populate the table, so it even becomes tiny bit shorter and faster
Enhancing test, not backporting
Closesscylladb/scylladb#29417
* github.com:scylladb/scylladb:
test_backup: Remove create_ks_and_cf helper Test
test_backup: Replace create_ks_and_cf with async patterns Test
test_backup: Add if-True blocks for indentation Test
The migration tests used to start nodes with pre-computed power-of-two
tokens. This was required because the migration itself only supported
power-of-two aligned tokens. Now that arbitrary tokens are supported,
switch the tests to use Scylla's default random token assignment.
Switching to arbitrary tokens makes the tests non-deterministic, but the
migration aspects that are affected by the token distribution
(resharding, wrap-around vnode split) are out of scope for these tests
and covered by dedicated tests.
Add a `get_all_vnode_tokens()` helper that queries system.topology at
runtime to discover the actual token layout, and derive expected tablet
counts from that.
Also account for the possible extra wrap-around tablet when the last
vnode token does not coincide with MAX_TOKEN.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Add a Boost test to verify that `prepare_for_tablets_migration()`
produces the correct tablet boundaries when a wrap-around vnode exists.
Tablets cannot wrap around the token ring as vnodes do; the last token
of the last tablet must always be MAX_TOKEN. When the last vnode token
does not coincide with MAX_TOKEN, the wrap-around vnode must be split
into two tablets.
The test is parameterized over both cases: unaligned (split expected)
and aligned (no split expected).
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
1. test.py — Removed --log-level=DEBUG flag from pytest args
2. test/pytest.ini — Changed log_level to INFO (that was set DEBUG in test.py), changed log_file_level from DEBUG to INFO, added clarifying comments
+minor fix
[test/pylib: save logs on success only during teardown phase](0ede308a04)
Previously, when --save-log-on-success was enabled, logs were saved
for every test phase (setup, call, teardown)in 3 files. Restrict it to only
the teardown phase, that contains all 3 in case of test success,
to avoid redundant log entries.
Closesscylladb/scylladb#29086
* github.com:scylladb/scylladb:
test/pylib: save logs on success only during teardown phase
test: Lower default log level from DEBUG to INFO
When initializing streaming sources in tablet_stream_files_handler we
use a reference to the table. We should hold the table while doing so,
because otherwise the table may be dropped and destroyed when we yield.
Use the table.stream_in_progress() phaser to hold the table while we
access it.
For sstable file streaming we can release the table after the snapshot
is initialized, and the table may be dropped safely because the files
are held by the snapshot and we don't access the table anymore. There
was a single access to the table for logging but it is replaced by a
pre-calculated variable.
For logstor segment streaming, currently it doesn't support discarding
the segments while they are streamed - when the table is dropped it
discard the segments by overwriting and freeing them, so they shouldn't
be accessed after that. Therefore, in that case continue to hold the
table until streaming is completed.
Fixes [SCYLLADB-1533](https://scylladb.atlassian.net/browse/SCYLLADB-1533)
It's a pre-existing use-after-free issue in sstable file streaming so should be backported to all releases.
It's also made worse with the recent changes of logstor, and affects also non-logstor tables, so the logstor fixes should be in the same release (2026.2).
[SCYLLADB-1533]: https://scylladb.atlassian.net/browse/SCYLLADB-1533?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQClosesscylladb/scylladb#29488
* github.com:scylladb/scylladb:
test: test drop table during streaming
streaming: stream_blob: hold table for streaming
The GCS fixture's fake-gcs-server container was named "local-kms",
copy-pasted from the AWS KMS fixture. It happened when both were
refactored to use the shared start_docker_service helper (bc544eb08e).
Rename to "fake-gcs-server" to match the Python-side naming and avoid
confusion in logs.
Refs SCYLLADB-1542
The container name is generated once before the retry loop, so
all retry attempts reuse the same name. Move the name generation
inside the loop so each attempt gets a fresh name via the
incrementing counter, consistent with the comment "publish port
ephemeral, allows parallel instances".
Formatting changes (indentation) of lines 208-225 in test/lib/proc_utils.cc
will be fixed in the next commit.
Refs SCYLLADB-1542
Previously, config_updater used a serialized_action to trigger update_config() when object_storage_endpoints changed. Because serialized_action::trigger() always schedules the action as a new reactor task (via semaphore::wait().then()), there was a window between the config value becoming visible to the REST API and update_config() actually running. This allowed a concurrent CREATE KEYSPACE to see the new endpoint via is_known_endpoint() before storage_manager had registered it in _object_storage_endpoints.
Now config observers run synchronously in a reactor turn and must not suspend. Split the previous monolithic async update_config() coroutine into two phases:
- Sync (in the observer, never suspends): storage_manager::_object_storage_endpoints is updated in place; for already-instantiated clients, update_config_sync swaps the new config atomically
- Async (per-client gate): background fibers finish the work that can't run in the observer — S3 refreshes credentials under _creds_sem; GCS drains and closes the replaced client.
Config reloads triggered by SIGHUP are applied on shard 0 and then broadcast to all other shards. An rwlock has been also introduced to make sure that the configuration has been propagated to all cores. This guarantees that a client requesting a config via the REST API will see a consistent snapshot
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-757
Fixes: [28141](https://github.com/scylladb/scylladb/issues/28141)
Closesscylladb/scylladb#28950
* github.com:scylladb/scylladb:
test/object_store: verify object storage client creation and live reconfiguration
sstables/utils/s3: split config update into sync and async parts
test_config: improve logging for wait_for_config API
db: introduce read-write lock to synchronize config updates with REST API
The commitlog replayer groups segments by shard using a
std::unordered_multimap, then iterates per-shard segments via
equal_range(). However, equal_range() does not guarantee iteration
order for elements with the same key, so segments could be replayed
out of order within a shard.
Correct segment ordering is required for:
- Fragmented entry reconstruction, which accumulates fragments across
segments and depends on ascending order for efficient processing.
- Commitlog-based storage used by the strongly consistent tables
feature, which relies on replayed raft items being stored in order.
Fix by changing the data structure from
std::unordered_multimap<unsigned, commitlog::descriptor>
to
std::unordered_map<unsigned, utils::chunked_vector<commitlog::descriptor>>
Since the descriptors are inserted from a std::set ordered by ID, the
vector preserves insertion (and thus ID) order. The per-shard iteration
now simply iterates the vector, guaranteeing correct replay order.
Fixes: SCYLLADB-1411
Backport: It looks like this issue doesn't cause any trouble, and is required only by the strong consistent tables, so no backporting required.
Closesscylladb/scylladb#29372
* github.com:scylladb/scylladb:
commitlog: add test to verify segment replay order
commitlog: fix replay order by using ordered map per shard
Replace the physical system.large_partitions, system.large_rows, and
system.large_cells CQL tables with virtual tables that read from
LargeDataRecords stored in SSTable scylla metadata (tag 13).
The transition is gated by a new LARGE_DATA_VIRTUAL_TABLES cluster
feature flag:
- Before the feature is enabled: the old physical tables remain in
all_tables(), CQL writes are active, no virtual tables are registered.
This ensures safe rollback during rolling upgrades.
- After the feature is enabled: old physical tables are dropped from
disk via legacy_drop_table_on_all_shards(), virtual tables are
registered on all shards, and CQL writes are skipped via
skip_cql_writes() in cql_table_large_data_handler.
Key implementation details:
- Three virtual table classes (large_partitions_virtual_table,
large_rows_virtual_table, large_cells_virtual_table) extend
streaming_virtual_table with cross-shard record collection.
- generate_legacy_id() gains a version parameter; virtual tables
use version 1 to get different UUIDs than the old physical tables.
- compaction_time is derived from SSTable generation UUID at display
time via UUID_gen::unix_timestamp().
- Legacy SSTables without LargeDataRecords emit synthetic summary
rows based on above_threshold > 0 in LargeDataStats.
- The activation logic uses two paths: when the feature is already
enabled (test env, restart), it runs as a coroutine; when not yet
enabled, it registers a when_enabled callback that runs inside
seastar::async from feature_service::enable().
- sstable_3_x_test updated to use a simplified large_data_test_handler
and validate LargeDataRecords in SSTable metadata directly.
Move the smp::invoke_on_all dispatch from the callers into
initialize_virtual_tables() itself, so the function is called
once from shard 0 and internally distributes the per-shard
virtual table setup to all shards.
This simplifies the callers and allows a single place to add
cross-shard coordination logic (e.g. feature-gated table
registration) in future commits.
Add three new test cases to sstable_3_x_test.cc that verify the
LargeDataRecords metadata written by the SSTable writer can be read
back after open_data():
- test_large_data_records_round_trip: verifies partition_size, row_size,
and cell_size records are written with correct field semantics when
thresholds are exceeded
- test_large_data_records_top_n_bounded: verifies the bounded min-heap
keeps only the top-N largest entries per type
- test_large_data_records_none_when_below_threshold: verifies no records
are written when data is below all thresholds
Also wire large_data_records_per_sstable from db_config into the test
env's sstables_manager::config so that config changes propagate through
the updateable_value chain to configure_writer().
The test environment was creating cql_config with hardcoded default values that
were never updated when system.config was modified via CQL. This broke tests
that dynamically change configuration values (e.g., TWCS tests).
Fix by creating cql_config from db::config using sharded_parameter, which
ensures updateable_value fields track the actual db::config sources and reflect
changes made during test execution.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Several tests in test_select_from_mutation_fragments.py assume that all
mutations end up in a single SSTable. This assumption can be violated
by background memtable flushes triggered by commitlog disk pressure.
Since the Scylla node is taken from a pool, it may carry unflushed data
from prior tests that prevents closed segments from being recycled,
thereby increasing the commitlog disk usage. A main source of such
pressure is keyspace-level flushes from earlier tests in this module,
which rotate commitlog segments without flushing system tables (e.g.,
`system.compaction_history`), leaving closed segments dirty.
Additionally, prior tests in the same module may have left unflushed
data on the shared test table (`test_table` fixture), keeping commitlog
segments dirty on its behalf as well. When commitlog disk usage exceeds
its threshold, the system flushes the test table to reclaim those
segments, potentially splitting a running test's mutations across
multiple SSTables.
This was observed in CI, where test_paging failed because its data was
split across two SSTables, resulting in more mutation fragments than the
hardcoded expected count.
This patch fixes the affected tests in two ways:
1. Where possible, tests are reworked to not assume a single SSTable:
- test_paging
- test_slicing_rows
- test_many_partition_scan
2. Where rework is impractical, major compaction is added after writes
and before validation to ensure that only one SSTable will exist:
- test_smoke
- test_count
- test_metadata_and_value
- test_slicing_range_tombstone_changes
Fixes SCYLLADB-1375.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Closesscylladb/scylladb#29389
Add a test that drops a table while tablet streaming is running for the
table. The table is dropped after taking the storage snapshot and
initializating streaming sources - after that streaming should be able
to complete or abort correctly if the table is dropped. We want to
verify there is no incorrect access to the destroyed table.
The test tests both types of streaming in stream_blob - sstables
and logstor segments.
There are several reasons we want to do that.
One is that it will give us more flexibility in distributing the
load. We can subdivide tablets at any token, and achieve more
evenly-sized tablets. In particular, we can isolate large partitions
into separate tablets.
We can also split and merge incrementally individual tablets.
Currently, we do it for the whole table or nothing, which makes
splits and merges take longer and cause wide swings of the count.
This is not implemented in this PR yet, we still split/merge the whole table.
Another reason is vnode to tablets migration. We now could construct a
tablet map which matches exactly the vnode boundaries, so migration
can happen transparently from CQL-coordinator point of view.
Tablet count is still a power-of-two by default for newly created tables.
It may be different if tablet map is created by non-standard means,
or if per-table tablet option "pow2_count" is set to "false".
build/release/scylla perf-tablets:
Memory footprint for 131k tablets increased from 56 MiB to 58.1 MiB (+3.5%)
Before:
```
Generating tablet metadata
Total tablet count: 131072
Size of tablet_metadata in memory: 57456 KiB
Copied in 0.014346 [ms]
Cleared in 0.002698 [ms]
Saved in 1234.685303 [ms]
Read in 445.577881 [ms]
Read mutations in 299.596313 [ms] 128 mutations
Read required hosts in 247.482742 [ms]
Size of canonical mutations: 33.945053 [MiB]
Disk space used by system.tablets: 1.456761 [MiB]
Tablet metadata reload:
full 407.69ms
partial 2.65ms
```
After:
```
Generating tablet metadata
Total tablet count: 131072
Size of tablet_metadata in memory: 59504 KiB
Copied in 0.032475 [ms]
Cleared in 0.002965 [ms]
Saved in 1093.877441 [ms]
Read in 387.027100 [ms]
Read mutations in 255.752121 [ms] 128 mutations
Read required hosts in 211.202805 [ms]
Size of canonical mutations: 33.954453 [MiB]
Disk space used by system.tablets: 1.450162 [MiB]
Tablet metadata reload:
full 354.50ms
partial 2.19ms
```
Closesscylladb/scylladb#28459
* github.com:scylladb/scylladb:
test: boost: tablets: Add test for merge with arbitrary tablet count
tablets, database: Advertise 'arbitrary' layout in snapshot manifest
tablets: Introduce pow2_count per-table tablet option
tablets: Prepare for non-power-of-two tablet count
tablets: Implement merged tablet_map constructor on top of for_each_sibling_tablets()
tablets: Prepare resize_decision to hold data in decisions
tablets: table: Make storage_group handle arbitrary merge boundaries
tablets: Make stats update post-merge work with arbitrary merge boundaries
locator: tablets: Support arbitrary tablet boundaries
locator: tablets: Introduce tablet_map::get_split_token()
dht: Introduce get_uniform_tokens()
Commit c17c4806a1 removed check_for_endpoint_collision() from
the fresh bootstrap path, which was the only code path that
called do_shadow_round() for new nodes. Since the gossip shadow
round is no longer executed during bootstrap, remove the
stop_during_gossip_shadow_round error injection from the test.
The entry is marked as REMOVED_ rather than deleted to preserve
the shuffle order for seed-based test reproducibility.
The injection point in gms/gossiper.cc is also removed since it
is no longer used by any test.
Fixes: SCYLLADB-1466
Closesscylladb/scylladb#29460