In a lambda returned from make_streaming_consumer() there's a check for
current scheudling group being streaming one. It came from #17090 where
streaming code was launched in wrong sched group thus affecting user
groups in a bad way.
The check is nice and useful, but it abuses replica::database by getting
unrelated information from it.
To preserve the check and to stop using database as provider of configs,
keep the streaming scheduling group handle in the debug namespace. This
emphasises that this global variable is purely for debugging purposes.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#28410
One of the best features of the pytest framework is "assertion
rewriting": If your test does for example "assert a + 1 == b", the
assertion is "rewritten" so that if it fails it tells you not only
that "a+1" and "b" are not equal, what the non-equal values are,
how they are not equal (e.g., find different elements of arrays) and
how each side of the equality was calculated.
But pytest can only "rewrite" assertion that it sees. If you call a
utility function checksomething() from another module and that utility
function calls assert, it will not be able to rewrite it, and you'll
get ugly, hard-to-debug, assertion failures.
This problem is especially noticable in tests we translated from
Cassandra, in test/cqlpy/cassandra_tests. Those tests use a bunch of
assertion-performing utility functions like assertRows() et al.
Those utility functions are defined in a separate source file,
porting.py, so by default do not get their assertions rewritten.
We had a solution for this: test/cqlpy/cassandra_test/__init__.py had:
pytest.register_assert_rewrite("cassandra_tests.porting")
This tells pytest to rewrite assertions in porting.py the first time
that it is imported.
It used to work well, but recently it stopped working. This is because
we change the module paths recently, and it should be written as
test.cqlpy.cassandra_tests.porting.
I verified by editing one of the cassandra_tests to make a bad check
that indeed this statement stopped working, and fixing the module
path in this way solves it, and makes assertion rewriting work
again.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#28411
During test.py run, noticed this warning:
```
10:38:22 test/cqlpy/cassandra_tests/validation/operations/insert_update_if_condition_test.py:14: 32 warnings
10:38:22 /jenkins/workspace/releng-testing/scylla-ci/scylla/test/cqlpy/cassandra_tests/validation/operations/insert_update_if_condition_test.py:14: PytestAssertRewriteWarning: Module already imported so cannot be rewritten: test.cqlpy.cassandra_tests.porting
10:38:22 pytest.register_assert_rewrite('test.cqlpy.cassandra_tests.porting')
```
The insert_update_if_condition_test.py was calling
pytest.register_assert_rewrite() for the porting module, but this
registration is already handled by cassandra_tests/__init__.py which
is automatically loaded before any test runs.
Closesscylladb/scylladb#28409
Enhance the skip_mode marker to accept either a single mode string
or a list of modes, allowing tests to be skipped across multiple
build configurations with a single marker.
Before:
@pytest.mark.skip_mode("dev", reason="...")
@pytest.mark.skip_mode("debug", reason="...")
After:
@pytest.mark.skip_mode(["dev", "debug"], reason="...")
This reduces duplication when the same skip condition applies
to multiple build modes.
Closesscylladb/scylladb#28406
The test is currently flaky. It tries to get the host ID of the bootstrapping
node via the REST API after the node crashes. This can obviously fail. The
test usually doesn't fail, though, as it relies on the host ID being saved
in `ScyllaServer._host_id` at this point by `ScyllaServer.try_get_host_id()`
repeatedly called in `ScyllaServer.start()`. However, with a very fast crash
and unlucky timings, no such call may succeed.
We deflake the test by getting the host ID before the crash. Note that at this
point, the bootstrapping node must be serving the REST API requests because
`await log.wait_for("finished do_send_ack2_msg")` above guarantees that the
node has started the gossip shadow round, which happens after starting the REST
API.
Fixes#28385Closesscylladb/scylladb#28388
This PR reduces the runtime of `test_out_of_space_prevention.py` by addressing two main sources of overhead: slow “critical utilization” setup and delayed tablet load stats propagation. Combined, these changes cut the module’s total execution time from 324s to 185s.
Improvements. No backup is required.
Closesscylladb/scylladb#28396
* github.com:scylladb/scylladb:
test/storage: speed up out-of-space prevention tests by using smaller volumes
test/storage: reduce tablet load stats refresh interval to speed up OOS prevention tests
As a next step of migration to the pytest runner, this PR moves
responsibility for nodetool tests execution solely to the pytest.
Closesscylladb/scylladb#28348
In production environments, we observed cases where the S3 client would repeatedly fail to connect due to DNS entries becoming stale. Because the existing logic only attempted the first resolved address and lacked a way to refresh DNS state, the client could get stuck in a failure loop.
Introduce RR TTL and connection failure retry to
- re-resolve the RR in a timely manner
- forcefully reset and re-resolve addresses
- add a special case when the TTL is 0 and the record must be resolved for every request
Fixes: CUSTOMER-96
Fixes: CUSTOMER-139
Should be backported to 2025.3/4 and 2026.1 since we already encountered it in the production clusters for 2025.3
Closesscylladb/scylladb#27891
* github.com:scylladb/scylladb:
connection_factory: includes cleanup
dns_connection_factory: refine the move constructor
connection_factory: retry on failure
connection_factory: introduce TTL timer
connection_factory: get rid of shared_future in dns_connection_factory
connection_factory: extract connection logic into a member
connection_factory: remove unnecessary `else`
connection_factory: use all resolved DNS addresses
s3_test: remove client double-close
Tests in test_out_of_space_prevention.py spend a large fraction of
time creating a random “blob” file to cross the 0.8 critical disk
utilization threshold. With 100MB volumes this requires writing
~70–80MB of data, which is slow inside Docker/Podman-backed volumes.
Most tests only use ~11MB of data, so large volumes are unnecessary.
Reduce the test volume size to 20MB so the critical threshold is
reached at ~16MB and the blob file is much smaller.
This cuts ~5–6s per test.
Set `--tablet-load-stats-refresh-interval-in-seconds=1` for this module’s
clusters applicable to all tests. This significantly reduces runtime
for the slowest cases:
- test_reject_split_compaction: 75.62s -> 23.04s
- test_split_compaction_not_triggered: 69.36s -> 22.98s
consistent_cluster_management is deprecated since scylla-5.2 and no
longer used by Scylladb, so it should not be used by test either.
Closesscylladb/scylladb#28340
These two streams mostly play together. The former provides an input_stream from read from in-memory temporary buffers, the latter wraps it to limit the size of provided temporary buffers. Both are used to test contiguous data consumer, also the buffer_input_stream has a caller in sstables reversing reader.
This PR removes the buffer_input_stream in favor of seastar memory_data_source, and moves the limiting_input_stream into test/lib.
Enanching testing code, not backporting
Closesscylladb/scylladb#28352
* github.com:scylladb/scylladb:
code: Move limiting data source to test/lib
util: Simplify limiting_data_source API
util: Remove buffer_input_stream
test: Use seastar::util::temporary_buffer_data_source in data consumer test
sstables: Use seastar::util::as_input_stream() in mx reader
This compaction group testing is useless because the machinery for it
to work was removed. This was useful in the early tablet days, where
we wanted to test compaction groups directly. Today groups are stressed
and tested on every tablet test.
I see a ~40% reduction time after this patch, since database_test is
one of the most (if not the most) time consuming in boost suite.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#28324
Filter the content of sstable(s), including or excluding the specified partitions. Partitions can be provided on the command line via `--partition`, or in a file via `--partitions-file`. Produces one output sstable per input sstable -- if the filter selects at least one partition in the respective input sstable. Output sstables are placed in the path provided via `--oputput-dir`. Use `--merge` to filter all input sstables combined, producing one output sstable.
Fixes: #13076
New functionality, no backport.
Closesscylladb/scylladb#27836
* github.com:scylladb/scylladb:
tools/scylla-sstable: introduce filter command
tools/scylla-sstable: remove --unsafe-accept-nonempty-output-dir
tools/scylla-sstable: make partition_set ordered
tools/scylla-stable: remove unused boost/algorithm/string.hpp include
Generally square brackets are non allowed in URI, while pytest uses it
the test name to show that there were additional parameters for the same
test. When such a test fail it shows the directory correctly in Jenkins,
however attempt to download only this will fail, because of the square
brackets in URI. This change substitute the square brackets with round
brackets.
Closesscylladb/scylladb#28226
Commit 0156e97560 ("storage_proxy: cas: reject for
tablets-enabled tables") marked a bunch of LWT tests as
XFAIL with tablets enabled, pending resolution of #18066.
But since that event is now in the past, we undo the XFAIL
markings (or in some cases, use an any-keyspace fixture
instead of a vnodes-only fixture).
Ref #18066.
Closesscylladb/scylladb#28336
Only two tests use it now -- the limit-data-source-test iself and a test
that validates continuous_data_consumer template.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The source maintains "limit generator" -- a function that returns the
maximum size of bytes to return from the next buffer.
Currently all callers just return constant numbers from it. Passing a
function that returns non-constant one can, probably, be used for a
fuzzy test, but even the limiting-data-source-test itself doesn't do it,
so what's the point...
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The test creates buffer_data_source_impl and wraps it with limiting data
source. The former data_source duplicates the functionality of the
existing seastar temporary_buffer_data_source.
This patch makes the test code use seastar facility. The
buffer_data_source_impl will be removed soon.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This PR refactors the streaming subsystem to support direct download of fully contained sstables. Instead of streaming these files, they are downloaded and attached directly to their corresponding tables. This approach reduces overhead, simplifies logic, and improves efficiency. Expected node scope restore performance improvement: ~4 times faster in best case scenario when all sstables are fully contained.
1. Add storage options field to sstable Introduce a data member to store storage options, enabling distinction between local and object storage types.
2. Add method to create component source Extend the storage interface with a public method to create a data_source for any sstable component.
3. Inline streamer instance creation Remove make_sstable_streamer and inline its usage to allow different sets of arguments at call sites.
4. Skip streaming empty sstable sets Avoid unnecessary streaming calls when the sstable set is empty.
5. Enable direct download of contained sstables Replace streaming of fully contained sstables with direct download, attaching them to their corresponding table.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-200
Refs: https://github.com/scylladb/scylladb/issues/23908
No need to backport as this code targets 2026.2 release (for tablet-aware restore)
Closesscylladb/scylladb#26834
* github.com:scylladb/scylladb:
tests: reuse test_backup_broken_streaming
streaming: enable direct download of contained sstables
storage: add method to create component source
streaming: keep sharded database reference on tablet_sstable_streamer
streaming: skip streaming empty sstable sets
streaming: inline streamer instance creation
tests: fix incorrect backup/restore test flow
Previously we only inspected std::system_error inside
std::nested_exception to support a specific TLS-related failure
mode. However, nested exceptions may contain any type, including
other restartable (retryable) errors. This change unwraps one
nested exception per iteration and re-applies all known handlers
until a match is found or the chain is exhausted.
Closesscylladb/scylladb#28240
In this PR, we fix two bugs present in `boost_test_tree_lister` that
affected the output of `--list_json_content` added in
scylladb/scylladb@afde5f668a:
* The labels test units use were duplicated in the output.
* If a test suite or a test file didn't contain any tests, it wasn't
listed in the output.
Refs scylladb/scylladb#25415
Backport: not needed. The code hasn't been used anywhere yet.
Closesscylladb/scylladb#28255
* github.com:scylladb/scylladb:
test/lib/boost_test_tree_lister.cc: Record empty test suites
test/lib/boost_test_tree_lister.cc: Deduplicate labels
Finishing the deprecation of the skip_mode function in favor of
pytest.mark.skip_mode. This PR is only cleaning and migrating leftover tests
that are still used and old way of skip_mode.
Closesscylladb/scylladb#28299
Instead of streaming fully contained sstables, download them directly
and attach them to their corresponding table. This simplifies the
process and avoids unnecessary streaming overhead.
There are two checks for live endpoints performed in test_gossiper.py,
but one of those sits in test_gossiper_unreachable_endpoints somehow.
This patch moves live endpoints check into live endpoints test.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#28224
This series implements rescoring algorithm.
Index options allowing to enable this functionality were introduced in earlier PR https://github.com/scylladb/scylladb/pull/28165.
When Vector Index has enabled quantization, Vector Store uses reduced vector representation to save memory, but it may degrade correctness of ANN queries. For quantized index we can enable rescoring algorithm, which recalculates similarity score from full vector representation stored in Scylla and reorder returned result set.
It works also with oversampling - we fetch more candidates from Vector Store, rescore them at Scylla and return only requested number of results.
Example:
Creating a Vector Index with Rescoring
```sql
-- Create a table with a vector column
CREATE TABLE ks.products (
id int PRIMARY KEY,
embedding vector<float, 128>
);
-- Create a vector index with rescoring enabled
CREATE INDEX products_embedding_idx ON ks.products (embedding)
USING 'vector_index'
WITH OPTIONS = {
'similarity_function': 'cosine',
'quantization': 'i8',
'oversampling': '2.0',
'rescoring': 'true'
};
```
1. **Quantization** (`i8`) compresses vectors in the index, reducing memory usage but introducing precision loss in distance calculations
2. **Oversampling** (`2.0`) retrieves 2× more candidates than requested from the vector store (e.g., `LIMIT 10` fetches 20 candidates)
3. **Rescoring** (`true`) recalculates similarity scores using full-precision (`f32`) vectors from the base table and re-ranks results
Query example:
```sql
-- Find 10 most similar products
SELECT id, similarity_cosine(embedding, [0.1, 0.2, ...]) AS score
FROM ks.products
ORDER BY embedding ANN OF [0.1, 0.2, ...]
LIMIT 10;
```
With rescoring enabled, the query:
1. Fetches 20 candidates from the quantized index (due to oversampling=2.0)
2. Reads full-precision embeddings from the base table
3. Recalculates similarity scores with full precision
4. Re-ranks and returns the top 10 results
In this implementation we use CQL similarity function implementation to calculate new score values and use them in post query ordering. We add that column manually to selection, but it has to be removed from the final response.
Follow-up https://github.com/scylladb/scylladb/pull/28165
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-83
New feature - doesn't need backport.
Closesscylladb/scylladb#27769
* github.com:scylladb/scylladb:
vector_index: rescoring: Fetch oversampled rows
vector_index: rescoring: Sort by similarity column
select_statement: Modify `needs_post_query_ordering` condition
vector_index: rescoring: Add hidden similarity score column
vector_index: Refactor extracting ANN query information
The storage_proxy::stop() is not called by main (it is commented out due to #293), so the corresponding message injection is never hit. When the test releases paxos_state_learn_after_mutate, shutdown may already be in progress or even completed by the time we try to trigger the storage_proxy::stop injection, which makes the test flaky.
Fix this by completely removing the storage_proxy::stop injection. The injection is not required for test correctness. Shutdown must wait for the background LWT learn to finish, which is released via the paxos_state_learn_after_mutate injection. The shutdown process blocks on in-flight HTTP requests through seastar::httpd::http_server::stop and its _task_gate, so the HTTP request that releases paxos_state_learn_after_mutate is guaranteed to complete before the node is shut down.
Fixesscylladb/scylladb#28260
backport: 2025.4, the `test_lwt_shutdown` test was introduced in this version
Closesscylladb/scylladb#28315
* https://github.com/scylladb/scylladb:
storage_proxy: drop stop() method
test_lwt_shutdown: fix flakiness by removing storage_proxy::stop injection
It was observed twice that the test times out in debug mode.
Fix by increasing the timeout.
The test never expects a timeout, so increasing it won't increase
the test duration.
Fixes#28028Closesscylladb/scylladb#28272
- Pass pytest request fixture into coro_task (used for scylla_tmp_dir
and core dump path)
- Rename duplicate `test_sstable_summary` that runs sstable-index-cache
to `test_sstable_index_cache` so both tests are collected
Refs https://github.com/scylladb/scylladb/issues/22501Closesscylladb/scylladb#28286
In 8df61f6d99 we changed the requirements for creating materialized
views and MV-based indexes - instead of requiring the
rf_rack_valid_keyspaces flag to be set, we now require the keyspace to
be RF-rack-valid at the time of creation, and it is enforced to remain
RF-rack-valid while the MV exists. This validation is done in the cql
create view/index statements.
The same should be done also for alternator - when creating a table with
GSI or LSI, or when adding a GSI to an existing table, previously we
required the flag rf_rack_valid_keyspaces to be set. Now we change it to
instead check if the keyspace is RF-rack-valid, and if not the operation
fails with an appropriate error.
Fixes https://github.com/scylladb/scylladb/issues/28214
backport to 2025.4 to add RF-rack-valid enforcements in alternator
Closesscylladb/scylladb#28154
* github.com:scylladb/scylladb:
locator: document the exception type of assert_rf_rack_valid_keyspace
alternator: don't require rf_rack flag for indexes, validate instead
When read or write operations are performed on a DC with RF=0 with LOCAL_QUORUM
or LOCAL_ONE consistency level, Cassandra throws `Unavailable` exception.
Scylla allowed such read operations and failed write operations with a cryptic:
"broken promise" error. This occured because the initial availability
check passed (quorum of 0 requires 0 replicas), but execution failed
later when no replicas existed to process the mutation.
This patch adds an explicit RF=0 validation for LOCAL_ONE and LOCAL_QUORUM that
throws before attempting operation execution.
The change also requires `test_query_dc_with_rf_0_does_not_crash_db` to be
upgraded. This testcase was asserting somewhat similar scenario, but wasn't
taking into account the whole matrix of combinations:
- scenarios: successful vs unsuccesful operation outcome
- local consistency levels: LOCAL_QUORUM & LOCAL_ONE
- operations: SELECT (read) & INSERT (write)
and so it's been extended to cover both the pre-existing and the current issues
and the whole matrix of combinations.
Fixes: scylladb/scylladb#27893
A minor change, no need to backport.
Closesscylladb/scylladb#27894
* github.com:scylladb/scylladb:
db: fail reads and writes with local consistencty level to a DC with RF=0
db: consistency_level: split `local_quorum_for()`
db: consistency_level: fix nrs -> nts abbreviation
storage_proxy::stop() is not called by main (it is commented out due to #293),
so the corresponding message injection is never hit. When the test releases
paxos_state_learn_after_mutate, shutdown may already be in progress or even
completed by the time we try to trigger the storage_proxy::stop injection,
which makes the test flaky.
Fix this by completely removing the storage_proxy::stop injection.
The injection is not required for test correctness. Shutdown must wait for the
background LWT learn to finish, which is released via the
paxos_state_learn_after_mutate injection.
The shutdown process blocks on in-flight api HTTP requests through
seastar::httpd::http_server::stop and its _task_gate, so the
shutdown will not prevent the HTTP request that released the
paxos_state_learn_after_mutate from completing successfully.
Fixesscylladb/scylladb#28260
During rewrite --extra-scylla-cmdline-options was missed and it was not passed to the tests that are using pytest. The issue that there were no possibility to pass these parameters via cmd to the Scylla, while tests were not affected because they were using the parameters from the yaml file.
This PR fixes this issue so it will be easier to modify the Scylla start parameters without modifying code.
No backport needed, only framework enhancement.
Closesscylladb/scylladb#28156
* github.com:scylladb/scylladb:
test.py: do not crash when there is no boost log
test.py: pass correctly extra cmd line arguments
Before this change, the test function `_verify_tasks_processed_metrics`
verified that after service level reconfiguration, a given number of
`scylla_scheduler_tasks_processed` were processed by a given scheduling
group. Moreover, the check verified that another scheduling group
didn't process a high number of requests. The second check was vulnerable
to flakiness, because sometimes additional load caused extensive work
in the second scheduling group (e.g. password hashing in `sl:driver`
due to new connections being created).
To avoid test failures, this commit changes which metric is verified:
instead of `scylla_scheduler_tasks_processed`, the metric
`scylla_transport_cql_requests_count` is checked. This prevents similar
problems, because there is no reason for a high number of
requests to be processed by the second scheduling group. Moreover,
it allows decreasing the number of requests that are sent for
verification, and thus speeds up the test.
Fixes: scylladb/scylladb#27715Closesscylladb/scylladb#28318
In this PR we add a basic implementation of the strongly-consistent tables:
* generate raft group id when a strongly-consistent table is created
* persist it into system.tables table
* start raft groups on replicas when a strongly-consistent tablet_map reaches them
* add strongly-consistent version of the storage_proxy, with the `query` and `mutate` methods
* the `mutate` method submits a command to the tablets raft group, the query method reads the data with `raft.read_barrier()`
* strongly-consistent versions of the `select_statement` and `modification_statement` are added
* a basic `test_strong_consistency.py/test_basic_write_read` is added which to check that we can write and read data in a strongly consistent fashion.
Limitations:
* for now the strongly consistent tables can have tablets only on shard zero. This is because we (ab/re) use the existing raft system tables which live only on shard0. In the next PRs we'll create separate tables for the new tablets raft groups.
* No Scylla-side proxying - the test has to figure out who is the leader and submit the command to the right node. This will be fixed separately.
* No tablet balancing -- migration/split/merges require separate complicated code.
The new behavior is hidden behind `STRONGLY_CONSISTENT_TABLES` feature, which is enabled when the `STRONGLY_CONSISTENT_TABLES` experimental feature flag is set.
Requirements, specs and general overview of the feature can be found [here](https://scylladb.atlassian.net/wiki/spaces/RND/pages/91422722/Strong+Consistency). Short term implementation plan is [here](https://docs.google.com/document/d/1afKeeHaCkKxER7IThHkaAQlh2JWpbqhFLIQ3CzmiXhI/edit?tab=t.0#heading=h.thkorgfek290)
One can check the strongly consistent writes and reads locally via cqlsh:
scylla.yaml:
```
experimental_features:
- strongly-consistent-tables
```
cqlsh:
```
CREATE KEYSPACE IF NOT EXISTS my_ks WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 1} AND tablets = {'initial': 1} AND consistency = 'local';
CREATE TABLE my_ks.test (pk int PRIMARY KEY, c int);
INSERT INTO my_ks.test (pk, c) VALUES (10, 20);
SELECT * FROM my_ks.test WHERE pk = 10;
```
Fixes SCYLLADB-34
Fixes SCYLLADB-32
Fixes SCYLLADB-31
Fixes SCYLLADB-33
Fixes SCYLLADB-56
backport: no need
Closesscylladb/scylladb#27614
* https://github.com/scylladb/scylladb:
test_encryption: capture stderr
test/cluster: add test_strong_consistency.py
raft_group_registry: disable metrics for non-0 groups
strong consistency: implement select_statement::do_execute()
cql: add select_statement.cc
strong consistency: implement coordinator::query()
cql: add modification_statement
cql: add statement_helpers
strong consistency: implement coordinator::mutate()
raft.hh: make server::wait_for_leader() public
strong_consistency: add coordinator
modification_statement: make get_timeout public
strong_consistency: add groups_manager
strong_consistency: add state_machine and raft_command
table: add get_max_timestamp_for_tablet
tablets: generate raft group_id-s for new table
tablet_replication_strategy: add consistency field
tablets: add raft_group_id
modification_statement: remove virtual where it's not needed
modification_statement: inline prepare_statement()
system_keyspace: disable tablet_balancing for strongly_consistent_tables
cql: rename strongly_consistent statements to broadcast statements
Filter the content of sstable(s), including or excluding the specified
partitions. Partitions can be provided on the command line via
`--partition`, or in a file via `--partitions-file`.
Produces one output sstable per input sstable -- if the filter selects
at least one partition in the respective input sstable.
Output sstables are placed in the path provided via `--oputput-dir`.
Use `--merge` to filter all input sstables combined, producing one
output sstable.
In 8df61f6d99 we changed the requirements for creating materialized
views and MV-based indexes - instead of requiring the
rf_rack_valid_keyspaces flag to be set, we now require the keyspace to
be RF-rack-valid at the time of creation, and it is enforced to remain
RF-rack-valid while the MV exists. This validation is done in the cql
create view/index statements.
The same should be done also for alternator - when creating a table with
GSI or LSI, or when adding a GSI to an existing table, previously we
required the flag rf_rack_valid_keyspaces to be set. Now we change it to
instead check if the keyspace is RF-rack-valid, and if not the operation
fails with an appropriate error.
So far with oversampling the extended set of keys was returned from VS,
but query to the base table was still limited by the query `limit`.
Now for rescoring we want to fetch rows for all the keys returned from VS.
However later we need to restore the command limit, to trim result_set accordingly.
For non-rescoring scenarios we trim directly keys set returned from VS if it happens to exceed query limit.
With this change rescoring validation tests (except `no_nulls_in_rescored_results`) pass fully.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-83
Our plan for rescoring is to use the existing post-query ordering mechanism to sort (and trim) result_set by similarity column.
For general SELECT case this ordering is permitted only for queries with IN on the partition key and an ORDER BY, which is checked in `needs_post_query_ordering`.
Recently this check was overriden for ANN queries in https://github.com/scylladb/scylladb/pull/28109 to enable IN queries handled by VS without excessive post-processing.
In this patch we revert that change - ANN case will be handled by general check.
However we change the condition - we will enable post processing anytime `_ordering_comparator` is set.
In current implementation `_ordering_comparator` is created only in `select_statement::prepare` with `get_ordering_comparator`,
only for the same conditions as were checked in `needs_post_query_ordering`, so this change should be transparent for general SELECT.
For ANN query it is also not set (yet), so it will not influence ANN filtering, but we confirm that this functionality still works
by adding filtering test: `test/vector_search/filter_test.cc::vector_store_client_test_filtering_ann_cql`.
Rescoring ordering for ANN queries will be enabled when we add `_ordering_comparator` in following patch.
Allows other topology operations to execute while tablets are being
drained on decommission. In particular, bootstrap on scale-out. This
is important for elasticity.
Allows multiple decommission/removenode to happen in parallel, which
is important for efficiency.
Flow of decommission/removenode request:
1) pending and paused, has tablet replicas on target node.
Tablet scheduler will start draining tablets.
2) No tablets on target node, request is pending but not paused
3) Request is scheduled, node is in transition
4) Request is done
Nodes are considered draining as soon as there is a leave or remove
request on them. If there are tablet replicas present on the target
node, the request is in a paused state and will not be picked by
topology coordinator. The paused state is computed from topology state
automatically on reload.
When request is not paused, its execution starts in
write_both_read_old state. The old tablet_draining state is not
entered (it's deprecated now).
Tablet load balancing will yield the state machine as soon as some
request is no longer paused and ready to be scheduled, based on
standard preemption mechanics.
Fixes#21452Closesscylladb/scylladb#24129
* https://github.com/scylladb/scylladb:
docs: Document parallel decommission and removenode and relevant task API
test: Add tests for parallel decommission/removenode
test: util: Introduce ensure_group0_leader_on()
test: tablets: Check that there are no migrations scheduled on draining nodes
test: lib: topology_builder: Introduce add_draining_request()
topology_coordinator, tablets: Fail draining operations when tablet migration fails due to critical disk utilization
tablets: topology_coordinator: Refactor to propagate reason for migration rollback
tablet_allocator: Skip co-location on draining nodes
node_ops: task_manager_module: Populate entity field also for active requests
tasks: node_ops: Put node id in the entity field
tasks, node_ops: Unify setting of task_stats in get_status() and get_stats()
topology: Protect against empty cancelation reason
tasks, topology: Make pending node operations abortable
doc: topology-over-raft.md: Fix diagram for replacing, tablet_draining is not engaged
raft_topology, tablets: Drain tablets in parallel with other topology operations
virtual_tables: Show draining and excluded fields in system.cluster_status and system.load_by_node
locator: topology: Add "draining" flag to a node
topology_coordinator: Extract generate_cancel_request_update()
storage_service: Drop dependency in topology_state_machine.hh in the header
locator: Extract common code in assert_rf_rack_valid_keyspace()
topology_coordinator, storage_service: Validate node removal/decommission at request submission time
This flag was added to operations which have an --output-dir
command-line arguments. These operations write sstables and need a
directory where to write them. Back in the numeric-generation world this
posed a problem: if the directory contained any sstable, generation
clash was almost guaranteed, because each scylla-sstable command
invokation would start output generations from 1. To avoid this, empty
output directory was a requirement, with the
--unsafe-accept-nonempty-output-dir allowing for a force-override.
Now in the timeuuid generation days, all this is not necessary anymore:
generations are unique, so it is not a problem if the output directory
already contains sstables: the probability of generation clash is almost
0. Even if it happens, the tool will just simply fail to write the new
sstable with the clashing generation.
Remove this historic relic of a flag and the related logic, it is just a
pointless nuissance nowadays.
When read or write operations are performed on a DC with RF=0 with LOCAL_QUORUM
or LOCAL_ONE consistency level, Cassandra throws `Unavailable` exception.
Scylla allowed such read operations and failed write operations with a cryptic:
"broken promise" error. This occured because the initial availability
check passed (quorum of 0 requires 0 replicas), but execution failed
later when no replicas existed to process the mutation.
This patch adds an explicit RF=0 validation for LOCAL_ONE and LOCAL_QUORUM that
throws before attempting operation execution.
The change also requires `test_query_dc_with_rf_0_does_not_crash_db` to be
upgraded. This testcase was asserting somewhat similar scenario, but wasn't
taking into account the whole matrix of combinations:
- scenarios: successful vs unsuccesful operation outcome
- local consistency levels: LOCAL_QUORUM & LOCAL_ONE
- operations: SELECT (read) & INSERT (write)
and so it's been extended to cover both the pre-existing and the current issues
and the whole matrix of combinations.
Fixes: scylladb/scylladb#27893
When one request is super slow and req/s high
in theory we have a collision on id, this patch
avoids that by reusing id and aborting when there
is no free one (unlikely).
This commit avoids leaking seastar::async future from two benchmark
tools: perf-alternator and perf-cql-raw. Additionally it adds
abort_source for fast and clean shutdown.