Compare commits

...

196 Commits

Author SHA1 Message Date
Botond Dénes
edda66886e Merge 'Replace sstable_stream_source_impl's internal sink and source with seastar utilities' from Pavel Emelyanov
The class in question has internal implementations of in-memory data_sink_impl and data_source_impl. In seastar there's generic implementation of the same facilities. From the "code re-use" perspective it makes sense to use both. TODO-s in Scylla code supports that.

Using newer seastar facilities, not backporting.

Closes scylladb/scylladb#28321

* github.com:scylladb/scylladb:
  sstable: Replace buffer_data_sink_impl with seastar::util::basic_memory_data_sink
  sstables: Use seastar::util::as_input_stream() and remove buffer_data_source_impl
2026-01-23 16:55:18 +02:00
Pavel Emelyanov
3e09d3cc97 test: Keep test_gossiper_live_endpoints checks togethger
There are two checks for live endpoints performed in test_gossiper.py,
but one of those sits in test_gossiper_unreachable_endpoints somehow.
This patch moves live endpoints check into live endpoints test.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28224
2026-01-23 16:53:48 +02:00
Piotr Dulikowski
3ec4f67407 Merge 'vector_index: Implement rescoring' from Szymon Malewski
This series implements rescoring algorithm.

Index options allowing to enable this functionality were introduced in earlier PR https://github.com/scylladb/scylladb/pull/28165.

When Vector Index has enabled quantization, Vector Store uses reduced vector representation to save memory, but it may degrade correctness of ANN queries. For quantized index we can enable rescoring algorithm, which recalculates similarity score from full vector representation stored in Scylla and reorder returned result set.
It works also with oversampling - we fetch more candidates from Vector Store, rescore them at Scylla and return only requested number of results.

Example:

Creating a Vector Index with Rescoring

```sql
-- Create a table with a vector column
CREATE TABLE ks.products (
    id int PRIMARY KEY,
    embedding vector<float, 128>
);

-- Create a vector index with rescoring enabled
CREATE INDEX products_embedding_idx ON ks.products (embedding)
    USING 'vector_index'
    WITH OPTIONS = {
        'similarity_function': 'cosine',
        'quantization': 'i8',
        'oversampling': '2.0',
        'rescoring': 'true'
    };
```

1. **Quantization** (`i8`) compresses vectors in the index, reducing memory usage but introducing precision loss in distance calculations
2. **Oversampling** (`2.0`) retrieves 2× more candidates than requested from the vector store (e.g., `LIMIT 10` fetches 20 candidates)
3. **Rescoring** (`true`) recalculates similarity scores using full-precision (`f32`) vectors from the base table and re-ranks results

Query example:

```sql
-- Find 10 most similar products
SELECT id, similarity_cosine(embedding, [0.1, 0.2, ...]) AS score
FROM ks.products
ORDER BY embedding ANN OF [0.1, 0.2, ...]
LIMIT 10;
```

With rescoring enabled, the query:
1. Fetches 20 candidates from the quantized index (due to oversampling=2.0)
2. Reads full-precision embeddings from the base table
3. Recalculates similarity scores with full precision
4. Re-ranks and returns the top 10 results

In this implementation we use CQL similarity function implementation to calculate new score values and use them in post query ordering. We add that column manually to selection, but it has to be removed from the final response.

Follow-up https://github.com/scylladb/scylladb/pull/28165
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-83

New feature - doesn't need backport.

Closes scylladb/scylladb#27769

* github.com:scylladb/scylladb:
  vector_index: rescoring: Fetch oversampled rows
  vector_index: rescoring: Sort by similarity column
  select_statement: Modify `needs_post_query_ordering` condition
  vector_index: rescoring: Add hidden similarity score column
  vector_index: Refactor extracting ANN query information
2026-01-23 15:20:10 +01:00
Patryk Jędrzejczak
a41d7a9240 Merge 'test_lwt_shutdown: fix flakiness by removing storage_proxy::stop injection' from Petr Gusev
The storage_proxy::stop() is not called by main (it is commented out due to #293), so the corresponding message injection is never hit. When the test releases paxos_state_learn_after_mutate, shutdown may already be in progress or even completed by the time we try to trigger the storage_proxy::stop injection, which makes the test flaky.

Fix this by completely removing the storage_proxy::stop injection. The injection is not required for test correctness. Shutdown must wait for the background LWT learn to finish, which is released via the paxos_state_learn_after_mutate injection. The shutdown process blocks on in-flight HTTP requests through seastar::httpd::http_server::stop and its _task_gate, so the HTTP request that releases paxos_state_learn_after_mutate is guaranteed to complete before the node is shut down.

Fixes scylladb/scylladb#28260

backport: 2025.4, the `test_lwt_shutdown` test was introduced in this version

Closes scylladb/scylladb#28315

* https://github.com/scylladb/scylladb:
  storage_proxy: drop stop() method
  test_lwt_shutdown: fix flakiness by removing storage_proxy::stop injection
2026-01-23 15:18:17 +01:00
Avi Kivity
30d6f3b8e0 test: test_proxy_protocol: bump timeout
It was observed twice that the test times out in debug mode.
Fix by increasing the timeout.

The test never expects a timeout, so increasing it won't increase
the test duration.

Fixes #28028

Closes scylladb/scylladb#28272
2026-01-23 15:37:00 +02:00
Łukasz Paszkowski
09fde82a33 test/scylla_gdb: fix coro_task request usage, rename duplicate test
- Pass pytest request fixture into coro_task (used for scylla_tmp_dir
  and core dump path)
- Rename duplicate `test_sstable_summary` that runs sstable-index-cache
  to `test_sstable_index_cache` so both tests are collected

Refs https://github.com/scylladb/scylladb/issues/22501

Closes scylladb/scylladb#28286
2026-01-23 15:25:58 +02:00
Pavel Emelyanov
4a307d931a sstable: Replace buffer_data_sink_impl with seastar::util::basic_memory_data_sink
The former accumulates sstable writer writes into a vector of temporary
buffers. In seastar there's a generic memory data sink that provides a
sink to accumulate stream of bytes into any container.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-23 14:22:22 +03:00
Pavel Emelyanov
97b1340a68 sstables: Use seastar::util::as_input_stream() and remove buffer_data_source_impl
The latter is used to wrap vector of buffers into an input_stream.
Seastar already provides the very same functionality with the
convenience as_input_stream() helper.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-23 14:21:14 +03:00
Piotr Dulikowski
fe9237fdc9 Merge 'alternator: don't require rf_rack flag for indexes, validate instead' from Michael Litvak
In 8df61f6d99 we changed the requirements for creating materialized
views and MV-based indexes - instead of requiring the
rf_rack_valid_keyspaces flag to be set, we now require the keyspace to
be RF-rack-valid at the time of creation, and it is enforced to remain
RF-rack-valid while the MV exists. This validation is done in the cql
create view/index statements.

The same should be done also for alternator - when creating a table with
GSI or LSI, or when adding a GSI to an existing table, previously we
required the flag rf_rack_valid_keyspaces to be set. Now we change it to
instead check if the keyspace is RF-rack-valid, and if not the operation
fails with an appropriate error.

Fixes https://github.com/scylladb/scylladb/issues/28214

backport to 2025.4 to add RF-rack-valid enforcements in alternator

Closes scylladb/scylladb#28154

* github.com:scylladb/scylladb:
  locator: document the exception type of assert_rf_rack_valid_keyspace
  alternator: don't require rf_rack flag for indexes, validate instead
2026-01-23 11:49:02 +01:00
Botond Dénes
c4c2f87be7 Merge 'db: fail reads and writes with local consistency level to a DC with RF=0' from null
When read or write operations are performed on a DC with RF=0 with LOCAL_QUORUM
or LOCAL_ONE consistency level, Cassandra throws `Unavailable` exception.
Scylla allowed such read operations and failed write operations with a cryptic:
"broken promise" error. This occured because the initial availability
check passed (quorum of 0 requires 0 replicas), but execution failed
later when no replicas existed to process the mutation.

This patch adds an explicit RF=0 validation for LOCAL_ONE and LOCAL_QUORUM that
throws before attempting operation execution.

The change also requires `test_query_dc_with_rf_0_does_not_crash_db` to be
upgraded. This testcase was asserting somewhat similar scenario, but wasn't
taking into account the whole matrix of combinations:
- scenarios: successful vs unsuccesful operation outcome
- local consistency levels: LOCAL_QUORUM & LOCAL_ONE
- operations: SELECT (read) & INSERT (write)

and so it's been extended to cover both the pre-existing and the current issues
and the whole matrix of combinations.

Fixes: scylladb/scylladb#27893

A minor change, no need to backport.

Closes scylladb/scylladb#27894

* github.com:scylladb/scylladb:
  db: fail reads and writes with local consistencty level to a DC with RF=0
  db: consistency_level: split `local_quorum_for()`
  db: consistency_level: fix nrs -> nts abbreviation
2026-01-23 12:36:20 +02:00
Petr Gusev
c45244b235 storage_proxy: drop stop() method
It's not called by main.cc and can be confusing.
2026-01-23 11:22:03 +01:00
Petr Gusev
f5ed3e9fea test_lwt_shutdown: fix flakiness by removing storage_proxy::stop injection
storage_proxy::stop() is not called by main (it is commented out due to #293),
so the corresponding message injection is never hit. When the test releases
paxos_state_learn_after_mutate, shutdown may already be in progress or even
completed by the time we try to trigger the storage_proxy::stop injection,
which makes the test flaky.

Fix this by completely removing the storage_proxy::stop injection.
The injection is not required for test correctness. Shutdown must wait for the
background LWT learn to finish, which is released via the
paxos_state_learn_after_mutate injection.

The shutdown process blocks on in-flight api HTTP requests through
seastar::httpd::http_server::stop and its _task_gate, so the
shutdown will not prevent the HTTP request that released the
paxos_state_learn_after_mutate from completing successfully.

Fixes scylladb/scylladb#28260
2026-01-23 11:20:36 +01:00
Botond Dénes
35c9a00275 Merge 'test.py: pass correctly extra cmd line arguments' from Andrei Chekun
During rewrite --extra-scylla-cmdline-options was missed and it was not passed to the tests that are using pytest. The issue that there were no possibility to pass these parameters via cmd to the Scylla, while tests were not affected because they were using the parameters from the yaml file.
This PR fixes this issue so it will be easier to modify the Scylla start parameters without modifying code.

No backport needed, only framework enhancement.

Closes scylladb/scylladb#28156

* github.com:scylladb/scylladb:
  test.py: do not crash when there is no boost log
  test.py: pass correctly extra cmd line arguments
2026-01-23 11:26:01 +02:00
Andrzej Jackowski
c493a66668 test: check cql_requests_count instead of tasks_processed in SL
Before this change, the test function `_verify_tasks_processed_metrics`
verified that after service level reconfiguration, a given number of
`scylla_scheduler_tasks_processed` were processed by a given scheduling
group. Moreover, the check verified that another scheduling group
didn't process a high number of requests. The second check was vulnerable
to flakiness, because sometimes additional load caused extensive work
in the second scheduling group (e.g. password hashing in `sl:driver`
due to new connections being created).

To avoid test failures, this commit changes which metric is verified:
instead of `scylla_scheduler_tasks_processed`, the metric
`scylla_transport_cql_requests_count` is checked. This prevents similar
problems, because there is no reason for a high number of
requests to be processed by the second scheduling group. Moreover,
it allows decreasing the number of requests that are sent for
verification, and thus speeds up the test.

Fixes: scylladb/scylladb#27715

Closes scylladb/scylladb#28318
2026-01-23 10:19:29 +01:00
Patryk Jędrzejczak
4e984139b2 Merge 'strongly consistent tables: basic implementation' from Petr Gusev
In this PR we add a basic implementation of the strongly-consistent tables:
* generate raft group id when a strongly-consistent table is created
* persist it into system.tables table
* start raft groups on replicas when a strongly-consistent tablet_map reaches them
* add strongly-consistent version of the storage_proxy, with the `query` and `mutate` methods
* the `mutate` method submits a command to the tablets raft group, the query method reads the data with `raft.read_barrier()`
* strongly-consistent versions of the `select_statement` and `modification_statement` are added
* a basic `test_strong_consistency.py/test_basic_write_read` is added which to check that we can write and read data in a strongly consistent fashion.

Limitations:
* for now the strongly consistent tables can have tablets only on shard zero. This is because we (ab/re) use the existing raft system tables which live only on shard0. In the next PRs we'll create separate tables for the new tablets raft groups.
* No Scylla-side proxying - the test has to figure out who is the leader and submit the command to the right node. This will be fixed separately.
* No tablet balancing -- migration/split/merges require separate complicated code.

The new behavior is hidden behind `STRONGLY_CONSISTENT_TABLES` feature, which is enabled when the `STRONGLY_CONSISTENT_TABLES` experimental feature flag is set.

Requirements, specs and general overview of the feature can be found [here](https://scylladb.atlassian.net/wiki/spaces/RND/pages/91422722/Strong+Consistency). Short term implementation plan is [here](https://docs.google.com/document/d/1afKeeHaCkKxER7IThHkaAQlh2JWpbqhFLIQ3CzmiXhI/edit?tab=t.0#heading=h.thkorgfek290)

One can check the strongly consistent writes and reads locally via cqlsh:
scylla.yaml:
```
experimental_features:
  - strongly-consistent-tables
```

cqlsh:
```
CREATE KEYSPACE IF NOT EXISTS my_ks WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 1} AND tablets = {'initial': 1} AND consistency = 'local';
CREATE TABLE my_ks.test (pk int PRIMARY KEY, c int);
INSERT INTO my_ks.test (pk, c) VALUES (10, 20);
SELECT * FROM my_ks.test WHERE pk = 10;
```

Fixes SCYLLADB-34
Fixes SCYLLADB-32
Fixes SCYLLADB-31
Fixes SCYLLADB-33
Fixes SCYLLADB-56

backport: no need

Closes scylladb/scylladb#27614

* https://github.com/scylladb/scylladb:
  test_encryption: capture stderr
  test/cluster: add test_strong_consistency.py
  raft_group_registry: disable metrics for non-0 groups
  strong consistency: implement select_statement::do_execute()
  cql: add select_statement.cc
  strong consistency: implement coordinator::query()
  cql: add modification_statement
  cql: add statement_helpers
  strong consistency: implement coordinator::mutate()
  raft.hh: make server::wait_for_leader() public
  strong_consistency: add coordinator
  modification_statement: make get_timeout public
  strong_consistency: add groups_manager
  strong_consistency: add state_machine and raft_command
  table: add get_max_timestamp_for_tablet
  tablets: generate raft group_id-s for new table
  tablet_replication_strategy: add consistency field
  tablets: add raft_group_id
  modification_statement: remove virtual where it's not needed
  modification_statement: inline prepare_statement()
  system_keyspace: disable tablet_balancing for strongly_consistent_tables
  cql: rename strongly_consistent statements to broadcast statements
2026-01-23 09:52:33 +01:00
Anna Stuchlik
c681b3363d doc: remove an outdated KB
This page was created for very outdated versions of ScyllaDB (5.1 (ScyllaDB Open Source) and 2022.2 (ScyllaDB Enterprise) to ensure smooth upgrade to that versions.

We no longer need that document.

Fixes https://github.com/scylladb/scylladb/issues/28264

Closes scylladb/scylladb#28266
2026-01-22 18:27:14 +01:00
Szymon Wasik
927aebef37 Add vector search documentation links to CQL docs
This patch adds links to the Vector Search documentation that is hosted
together with Scylla Cloud docs to the CQL documentation.
It also make the note about supported capabilities consistent and
removes the experimental label as the feature is GAed.

Fixes: SCYLLADB-371

Closes scylladb/scylladb#28312
2026-01-22 16:46:38 +01:00
Michael Litvak
d5009882c6 locator: document the exception type of assert_rf_rack_valid_keyspace
The function assert_rf_rack_valid_keyspace uses the exception type
std::invalid_argument when the RF-rack validation fails. Document it and
change all callers to catch this specific exception type when checking
for RF-rack validation failures, so that other exception types can be
propagated properly.
2026-01-22 16:11:35 +01:00
Michael Litvak
1f7a65904e alternator: don't require rf_rack flag for indexes, validate instead
In 8df61f6d99 we changed the requirements for creating materialized
views and MV-based indexes - instead of requiring the
rf_rack_valid_keyspaces flag to be set, we now require the keyspace to
be RF-rack-valid at the time of creation, and it is enforced to remain
RF-rack-valid while the MV exists. This validation is done in the cql
create view/index statements.

The same should be done also for alternator - when creating a table with
GSI or LSI, or when adding a GSI to an existing table, previously we
required the flag rf_rack_valid_keyspaces to be set. Now we change it to
instead check if the keyspace is RF-rack-valid, and if not the operation
fails with an appropriate error.
2026-01-22 16:11:35 +01:00
Szymon Malewski
29d090845a vector_index: rescoring: Fetch oversampled rows
So far with oversampling the extended set of keys was returned from VS,
but query to the base table was still limited by the query `limit`.
Now for rescoring we want to fetch rows for all the keys returned from VS.
However later we need to restore the command limit, to trim result_set accordingly.
For non-rescoring scenarios we trim directly keys set returned from VS if it happens to exceed query limit.

With this change rescoring validation tests (except `no_nulls_in_rescored_results`) pass fully.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-83
2026-01-22 15:38:44 +01:00
Szymon Malewski
0bc95bcf87 vector_index: rescoring: Sort by similarity column
This patch implements second part of rescoring - ordering results by similarity column added in earlier patch.
For this purpose in this patch we define `_ordering_comparator`, which enables pre-existing post-query ordering functionality.

However, none additional test passes yet, as they include ovesampling, which will be the subject of following patches.
2026-01-22 15:38:44 +01:00
Szymon Malewski
57e7a4fa4f select_statement: Modify needs_post_query_ordering condition
Our plan for rescoring is to use the existing post-query ordering mechanism to sort (and trim) result_set by similarity column.
For general SELECT case this ordering is permitted only for queries with IN on the partition key and an ORDER BY, which is checked in `needs_post_query_ordering`.
Recently this check was overriden for ANN queries in https://github.com/scylladb/scylladb/pull/28109 to enable IN queries handled by VS without excessive post-processing.

In this patch we revert that change - ANN case will be handled by general check.
However we change the condition - we will enable post processing anytime `_ordering_comparator` is set.

In current implementation `_ordering_comparator` is created only in `select_statement::prepare` with `get_ordering_comparator`,
only for the same conditions as were checked in `needs_post_query_ordering`, so this change should be transparent for general SELECT.
For ANN query it is also not set (yet), so it will not influence ANN filtering, but we confirm that this functionality still works
by adding filtering test: `test/vector_search/filter_test.cc::vector_store_client_test_filtering_ann_cql`.

Rescoring ordering for ANN queries will be enabled when we add `_ordering_comparator` in following patch.
2026-01-22 15:38:44 +01:00
Szymon Malewski
c89957b725 vector_index: rescoring: Add hidden similarity score column
Rescoring consist of recalculating similarity score and reordering results based on it.
In this patch we add calculation of similarity score as a hidden (non-serialized) column and following patch will add reordering.
Normal ordering uses `add_column_for_post_processing`, however this works only for regular columns, not function.
So we create it together with user requested columns (this also forces the use of `selection_with_processing`) and hide the column later.
This also requires special handling for 'SELECT *' case - we need to manually add all columns before adding similarity column.

In case user already asks for similarity score in the SELECT clause, this value will be calculated twice - is should be optimized in future patches.
2026-01-22 15:38:40 +01:00
Anna Stuchlik
0aa881f190 doc: add the info about Alternator ports to the Admin Guide
Fixes https://github.com/scylladb/scylladb/issues/23706

Closes scylladb/scylladb#27724
2026-01-22 16:10:58 +03:00
Israel Fruchter
6b3ce5fdcc dist: scylla_coredump_setup: force unmount /var/lib/systemd/coredump before setup
When setting up coredump handling, if there are old mounts in a deleted state (e.g. from an older installation),
systemd might fail to activate the new `.mount` unit properly because it assumes the path is already mounted.

Explicitly unmount `/var/lib/systemd/coredump` before proceeding with the setup to ensure a clean state.

Fix: scylladb/scylla-enterprise#5692

Closes scylladb/scylladb#28300
2026-01-22 14:35:26 +02:00
Pavel Emelyanov
cb6ee05391 Merge 'Extend snapshot manifest.json with tablet-aware metadata' from Benny Halevy
This series extends the json manifest file we create when taking snapshots.
It adds the following metadata:
- manifesr version and scope
- snapshot name
- created_at and expires_at timestamps (#24061)
- node metadata (host_id, dc, rack)
- keyspace and table metadat
- tablet_count (#26352)
- per-sstable metadata (#26352)

Fixes [SCYLLADB-189](https://scylladb.atlassian.net/browse/SCYLLADB-189)
Fixes [SCYLLADB-195](https://scylladb.atlassian.net/browse/SCYLLADB-195)
Fixes [SCYLLADB-196](https://scylladb.atlassian.net/browse/SCYLLADB-196)

* Enhancement, no backport needed

[SCYLLADB-189]: https://scylladb.atlassian.net/browse/SCYLLADB-189?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
[SCYLLADB-195]: https://scylladb.atlassian.net/browse/SCYLLADB-195?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
[SCYLLADB-196]: https://scylladb.atlassian.net/browse/SCYLLADB-196?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Closes scylladb/scylladb#27945

* github.com:scylladb/scylladb:
  snapshot: keep per-sstable metadata in manifest.json
  snapshot: add table info and tablet_count to manifest.json
  snapshot: add basic support for snapshot ttl in manifest.json
  table: snapshot_on_all_shards: take snapshot_options
  db: snapshot_ctl: move skip_flush to struct snapshot_options
  snapshot: add snapshot name in manifest.json
  test: lib: cql_test_env: apply db::config::tablets_mode_for_new_keyspaces
  snapshot: add node info to manifest.json
  snapshot: add manifest info to manifest.json
  test: database_test: snapshot_works: add validate_manifest
2026-01-22 15:19:11 +03:00
Patryk Jędrzejczak
67045b5f17 Merge 'raft_topology, tablets: Drain tablets in parallel with other topology operations' from Tomasz Grabiec
Allows other topology operations to execute while tablets are being
drained on decommission. In particular, bootstrap on scale-out. This
is important for elasticity.

Allows multiple decommission/removenode to happen in parallel, which
is important for efficiency.

Flow of decommission/removenode request:
  1) pending and paused, has tablet replicas on target node.
     Tablet scheduler will start draining tablets.
  2) No tablets on target node, request is pending but not paused
  3) Request is scheduled, node is in transition
  4) Request is done

Nodes are considered draining as soon as there is a leave or remove
request on them. If there are tablet replicas present on the target
node, the request is in a paused state and will not be picked by
topology coordinator. The paused state is computed from topology state
automatically on reload.

When request is not paused, its execution starts in
write_both_read_old state. The old tablet_draining state is not
entered (it's deprecated now).

Tablet load balancing will yield the state machine as soon as some
request is no longer paused and ready to be scheduled, based on
standard preemption mechanics.

Fixes #21452

Closes scylladb/scylladb#24129

* https://github.com/scylladb/scylladb:
  docs: Document parallel decommission and removenode and relevant task API
  test: Add tests for parallel decommission/removenode
  test: util: Introduce ensure_group0_leader_on()
  test: tablets: Check that there are no migrations scheduled on draining nodes
  test: lib: topology_builder: Introduce add_draining_request()
  topology_coordinator, tablets: Fail draining operations when tablet migration fails due to critical disk utilization
  tablets: topology_coordinator: Refactor to propagate reason for migration rollback
  tablet_allocator: Skip co-location on draining nodes
  node_ops: task_manager_module: Populate entity field also for active requests
  tasks: node_ops: Put node id in the entity field
  tasks, node_ops: Unify setting of task_stats in get_status() and get_stats()
  topology: Protect against empty cancelation reason
  tasks, topology: Make pending node operations abortable
  doc: topology-over-raft.md: Fix diagram for replacing, tablet_draining is not engaged
  raft_topology, tablets: Drain tablets in parallel with other topology operations
  virtual_tables: Show draining and excluded fields in system.cluster_status and system.load_by_node
  locator: topology: Add "draining" flag to a node
  topology_coordinator: Extract generate_cancel_request_update()
  storage_service: Drop dependency in topology_state_machine.hh in the header
  locator: Extract common code in assert_rf_rack_valid_keyspace()
  topology_coordinator, storage_service: Validate node removal/decommission at request submission time
2026-01-22 13:06:53 +01:00
Patryk Jędrzejczak
e11450ccca test: test_raft_recovery_user_data: prevent repeated ALTER KEYSPACE request
The test is currently flaky. With `remove_dead_nodes_with == "remove"`,
it sends several ALTER KEYSPACE requests. The request performed just
after adding 3 new nodes can unexpectedly be sent twice to two
different nodes by the driver. The second receiver rejects the request
through the new guardrail added in 2e7ba1f8ce,
and the test fails.

This has been acknowledged as a bug in the Python driver. It shouldn't
retry non-idempotent requests with the default retry policy. There could
be one more bug in the driver, as it looks like the driver decides to
resend the request after it disconnects from the first receiver. The
first receiver has just bootstrapped, so the driver shouldn't disconnect.

We deflake the test by reconnecting the driver before performing the
problematic ALTER KEYSPACE request.

The change has been tested in byo, as the failure reproduces only in CI.
Without the change, the test fails once in ~250 runs in dev mode. With
the change, more than 1000 runs passed.

Fixes #27862

No backport needed as 2e7ba1f8ce is only
in master.

Closes scylladb/scylladb#28290
2026-01-22 14:13:42 +03:00
Szymon Malewski
e0cc6ca7e6 vector_index: Refactor extracting ANN query information
For the purpose of rescoring we will need information if the query is an ANN query
and the access to index option earlier in the `select_statement::prepare` than it happened before.
This patch refactors extracting this information to new helper structure `ann_ordering_info`
and is consistently using it.
2026-01-22 10:00:47 +01:00
Botond Dénes
7d2e6c0170 Merge 'config: add enforce_rack_list option' from Aleksandra Martyniuk
Add enforce_rack_list option. When the option is set to true,
all tablet keyspaces have rack list replication factor.

When the option is on:
- CREATE STATEMENT always auto-extends rf to rack lists;
- ALTER STATEMENT fails when there is numeric rf in any DC.

The flag is set to false by default and a node needs to be restarted
in order to change its value. Starting a node with enforce_rack_list
option will fail, if there are any tablet keyspaces with numeric rf
in any DC.

enforce_rack_list is a per-node option and a user needs to ensure
that no tablet keyspace is altered or created while nodes in
the cluster don't have the consistent value.

Mark rf_rack_valid_keyspaces as deprecated.

Fixes: https://github.com/scylladb/scylladb/issues/26399.

New feature; no backport needed

Closes scylladb/scylladb#28084

* github.com:scylladb/scylladb:
  test: add test for enforce_rack_list option
  db: mark rf_rack_valid_keyspaces as deprecated
  config: add enforce_rack_list option
  Revert "alternator: require rf_rack_valid_keyspaces when creating index"
2026-01-22 10:27:35 +02:00
Benny Halevy
d6557764b9 snapshot: keep per-sstable metadata in manifest.json
Adds a "sstables" array member to manifest.json.
For each sstables, keep the following metadata:
id - a uuid for the sstable (the sstable identifier
if the use-sstable-identifier option was used, otherwise
the sstable uuid generation)
toc_name - the name of the TOC.txt file
data_size and index_size - in bytes
first_token and last_token - of the sstable first and last keys.

Fixes: SCYLLADB-196

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-22 09:42:52 +02:00
Benny Halevy
dc9093303d snapshot: add table info and tablet_count to manifest.json
Add a table member to manifest.json with the keyspace_name,
table_name, table_id, tablets_type, and, for tablets-enabled tables, get
tablet_count on each shard and write the minimum to manifest.json.
For vnodes-based tables, tablet_count=0.

For now, `tablets_type` may be either `none` for vnodes tables, or
`powof2` for tablets tables.  In the future, when we support arbitrary
tablt boundaries, this will be reflected here, and it is likely we
would backup the whole tablets map sperately to get all tablet boundaries.

Fixes SCYLLADB-195

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-22 09:36:52 +02:00
Benny Halevy
91df129e21 snapshot: add basic support for snapshot ttl in manifest.json
Store the snapshot `created_at` time and an optional
`expires_at` time.

Fixes SCYLLADB-189

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-22 09:12:56 +02:00
Benny Halevy
5e90fbb9d2 table: snapshot_on_all_shards: take snapshot_options
And keep the options for now in the local_snapshot_writer.
The options will be used by following patches to pass
extra metadata like the snapshot creation time, expiration time, etc.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-22 09:12:56 +02:00
Benny Halevy
49a3e0914d db: snapshot_ctl: move skip_flush to struct snapshot_options
So we can easily extend it and add more options.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-22 09:12:56 +02:00
Benny Halevy
d9fc3b1c11 snapshot: add snapshot name in manifest.json
Store the snapshot tag in the manifest file.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-22 09:12:56 +02:00
Benny Halevy
0f014e2916 test: lib: cql_test_env: apply db::config::tablets_mode_for_new_keyspaces
If tablets are enabled via db::config add the `tablet = {'enabled': true}'
option when creating a keyspace, even if `cql_test_config.initial_tablets`
is disengaged.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-22 09:12:56 +02:00
Benny Halevy
0d82e56078 snapshot: add node info to manifest.json
Add metadata about the node: host_id, datacenter, and rack.
This enables dc- or rack- aware restore.
Today this information is "encoded" into the snapshot hierarchy
prefixes, but if all manifest files would be stored in a flat
directory, we'd need to encode that metadata in the object name,
but it'd be better for the manifest contents to be self descriptive.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-22 09:12:56 +02:00
Benny Halevy
24040efc54 snapshot: add manifest info to manifest.json
Add metadata about the manifest itself:
A version and the manifest scope (currently "node",
but in the future, may also be "shard", or "tablet")

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-22 09:12:56 +02:00
Benny Halevy
9e0f5410ae test: database_test: snapshot_works: add validate_manifest
Validate the manifest.json format by loading it using rjson::parse
and then validate its contents to ensure it lists exactly the
SSTables present in the snapshot directory.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-22 09:12:56 +02:00
Patryk Jędrzejczak
e503340efc test: test_raft_recovery_during_join: get host ID of the bootstrapping node before it crashes
The test is currently flaky. It tries to get the host ID of the bootstrapping
node via the REST API after the node crashes. This can obviously fail. The
test usually doesn't fail, though, as it relies on the host ID being saved
in `ScyllaServer._host_id` at this point by `ScyllaServer.try_get_host_id()`
repeatedly called in `ScyllaServer.start()`. However, with a very fast crash
and unlucky timings, no such call may succeed.

We deflake the test by getting the host ID before the crash. Note that at this
point, the bootstrapping node must be serving the REST API requests because
`await coordinator_log.wait_for("delay_node_bootstrap: waiting for message")`
above guarantees that the node has submitted the join topology request, which
happens after starting the REST API.

Fixes #28227

Closes scylladb/scylladb#28233
2026-01-22 06:55:16 +02:00
Botond Dénes
4281d18c2e Merge 'schema: Apply sstable_compression_user_table_options to CQL aux and Alternator tables' from Nikos Dragazis
In PR 5b6570be52 we introduced the config option `sstable_compression_user_table_options` to allow adjusting the default compression settings for user tables. However, the new option was hooked into the CQL layer and applied only to CQL base tables, not to the whole spectrum of user tables: CQL auxiliary tables (materialized views, secondary indexes, CDC log tables), Alternator base tables, Alternator auxiliary tables (GSIs, LSIs, Streams).

This gap also led to inconsistent default compression algorithms after we changed the option’s default algorithm from LZ4 to LZ4WithDicts (adf9c426c2).

This series introduces a general “schema initializer” mechanism in `schema_builder` and uses it to apply the default compression settings uniformly across all user tables. This ensures that all base and aux tables take their default compression settings from config.

Fixes #26914.

Backport justification: LZ4WithDicts is the new default since 2025.4, but the config option exists since 2025.2. Based on severity, I suggest we backport only to 2025.4 to maintain consistency of the defaults.

Closes scylladb/scylladb#27204

* github.com:scylladb/scylladb:
  db/config: Update sstable_compression_user_table_options description
  schema: Add initializer for compression defaults
  schema: Generalize static configurators into schema initializers
  schema: Initialize static properties eagerly
  db: config: Add accessor for sstable_compression_user_table_options
  test: Check that CQL and Alternator tables respect compression config
2026-01-22 06:50:48 +02:00
Nadav Har'El
5c1e525618 Merge 'vector_search: cache restrictions JSON at prepare time ' from Dawid Pawlik
Add `prepared_filter` class which handles the preparation, construction, and caching of Vector Search filtering compatible JSON object.

If no bind markers found in primary key restrictions, the JSON object will be built once at prepare time and cached for use during execution calls.

Additionally, this patch moves the filter functions from `cql3::restrictions` to `vector_search` namespace and does some renaming to make the purpose of those functions clear.

Follow-up: https://github.com/scylladb/scylladb/pull/28109
Fixes: [SCYLLADB-299](https://scylladb.atlassian.net/browse/SCYLLADB-299)

[SCYLLADB-299]: https://scylladb.atlassian.net/browse/SCYLLADB-299?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Closes scylladb/scylladb#28276

* github.com:scylladb/scylladb:
  test/vector_search: add filter tests with bind variables
  vector_search: cache restrictions JSON at prepare time
  refactor: vector_search: move filter logic to vector_search namespace
2026-01-21 19:03:35 +02:00
Patryk Jędrzejczak
1f0f694c9e test: test_zero_token_nodes_multidc: properly handle reads with CL=LOCAL_ONE
The test is currently flaky. It incorrectly assumes that a read with
CL=LOCAL_ONE will see the data inserted by a preceding write with
CL=LOCAL_ONE in the same datacenter with RF=2.

The same issue has already been fixed for CL=ONE in
21edec1ace. The difference is that
for CL=LOCAL_ONE, only dc1 is problematic, as dc2 has RF=1.

We fix the issue for CL=LOCAL_ONE by skipping the check for dc1.

Fixes #28253

The fix addresses CI flakiness and only changes the test, so it
should be backported.

Closes scylladb/scylladb#28274
2026-01-21 15:17:42 +01:00
Petr Gusev
843da2bbb8 test_encryption: capture stderr
The check_output function doesn't capture stderr by default, making
it impossible to debug if something goes wrong.
2026-01-21 14:56:01 +01:00
Petr Gusev
925d86fefc test/cluster: add test_strong_consistency.py
Add a basic test that creates a strongly consistent keyspace and table,
writes some data, and verifies that the same data can be read back.

Since Scylla-side request proxying is not yet implemented, writes are
handled only on the leader node. The test uses the existing
`/raft/leader_host` REST endpoint to determine the leader of the tablets
Raft group.
2026-01-21 14:56:01 +01:00
Petr Gusev
59a876cebb raft_group_registry: disable metrics for non-0 groups
The `raft::server` registers metrics using the `server_id` label. When
both a group0 Raft server and the tablets Raft server are created on
the same node/shard, duplicate metrics cause conflicts.

This commit temporarily disables metrics for non-0 groups. A proper fix
will likely require adding a `group_id` label in the future.
2026-01-21 14:56:01 +01:00
Petr Gusev
1f170d2566 strong consistency: implement select_statement::do_execute() 2026-01-21 14:56:01 +01:00
Petr Gusev
a5d611866e cql: add select_statement.cc 2026-01-21 14:56:01 +01:00
Petr Gusev
493ebe2add strong consistency: implement coordinator::query() 2026-01-21 14:56:01 +01:00
Petr Gusev
ccf90cfde8 cql: add modification_statement
We use decoration instead of inheritance, since inheritance already
serves to differentiate statement types (modification_statement has
update_statement and delete_statement as descendants). A better
solution would likely involve refactoring modification_statement and
extracting the mutation-generation logic into a reusable component
shared by both eventual and strongly consistent statements.
2026-01-21 14:56:01 +01:00
Petr Gusev
989566e8a3 cql: add statement_helpers
Introduce two helper methods that will be used for strongly consistent
select_statement and modification_statement.

redirect_statement() forwards the request to another shard or node.
Currently, only shard forwarding is implemented; node-level proxying
will be added in follow-up PRs.

is_strongly_consistent() will be used in the prepare() method of raw
statements to determine whether a strongly consistent statement should
be created for the given CQL statement.
2026-01-21 14:56:01 +01:00
Petr Gusev
b94fd11939 strong consistency: implement coordinator::mutate()
To guarantee monotonic mutation timestamps, we compute the maximum
timestamp used so far for the current tablet. This is done by calling
read_barrier() on the tablet’s Raft group server and extracting the
maximum timestamp from the local database via
table::get_max_timestamp_for_tablet().

Because read_barrier() may take a while, we perform it proactively in a
dedicated fiber, leader_info_updater, rather than during the mutation
request. This fiber is started when the Raft group server starts for a
tablet. It reacts to wait_for_state_change(), computes the maximum
timestamp, and stores it per term.

The new groups_manager::begin_mutate() function checks whether the
maximum timestamp has already been computed for the current term. If
not, it asks the client to wait. This two-step interface (synchronous
begin_mutate() + asynchronous wait on the need_wait_for_leader future)
is needed because the term can change at any asynchronous point.
If begin_mutate() were asynchronous, the client would need to recheck
the term after `co_await begin_mutate()`.

We currently do not handle raft::commit_status_unknown. We rethrow it to
the CQL client, which must check whether the command was applied and
retry if necessary. Handling this inside Scylla would require persisting
a deduplication key after applying the mutation, which introduces write
amplification. Additionally, connection breaks between Scylla and the
driver can always occur, so the client must be prepared to verify the
command status regardless.
2026-01-21 14:56:01 +01:00
Petr Gusev
801bf82a34 raft.hh: make server::wait_for_leader() public
When a strongly consistent request arrives at a node, we
need to know which replica is the leader, since such requests
are generally executed only on the leader. If a leader has
not yet been elected, we must wait. This commit exposes
wait_for_leader() so it can be used for that purpose.

We cannot rely solely on wait_for_state_change(), because it does not
trigger when some other node becomes a leader.
2026-01-21 14:56:01 +01:00
Petr Gusev
7d111f2396 strong_consistency: add coordinator
Add the `coordinator` class, which will be responsible for coordinating
reads and writes to strongly consistent tables. This commit includes
only the boilerplate; the methods will be implemented in separate
commits.
2026-01-21 14:56:01 +01:00
Petr Gusev
4413142f25 modification_statement: make get_timeout public
We'll need to access this method in a new
strong_consistency/modification_statement class.
2026-01-21 14:56:00 +01:00
Petr Gusev
4902186ede strong_consistency: add groups_manager
This class is reponsible for managing raft groups for
strongly-consistent tablets.
2026-01-21 14:56:00 +01:00
Petr Gusev
4eee5bc273 strong_consistency: add state_machine and raft_command
These commands will be used by strongly consistent tablets to submit
mutations to Raft. A simple state_machine implementation is introduced
to apply these commands.

We apply commands in batches to reduce commitlog I/O overhead. The
batched variant of database::apply has known atomicity issues. For
example, it does not guarantee atomicity under memory pressure: some
mutations may be published to the memtable while others are blocked in
run_when_memory_available. We will address these issues later.
2026-01-21 14:56:00 +01:00
Petr Gusev
a8350b274e table: add get_max_timestamp_for_tablet
Strongly consistent writes require knowing the maximum timestamp of
locally applied mutations to guarantee monotonically increasing
timestamps for subsequent writes.

This commit adds a function that returns the maximum timestamp for a
given tablet.

Why it is safe to use this function with deleted cells:
* Tombstones are included in memtable.get_max_timestamp() calculations.
* The maximum timestamp of a memtable is used to initialize the maximum
  timestamp of the resulting sstable.
* During compaction, a new sstable’s maximum timestamp is initialized as
  the maximum of the contributing sstables.
2026-01-21 14:56:00 +01:00
Petr Gusev
dc4896b69b tablets: generate raft group_id-s for new table
We assign shard to zero because raft is currently only supported on the
zero shard.
2026-01-21 14:56:00 +01:00
Petr Gusev
6dd74be684 tablet_replication_strategy: add consistency field
This commit adds a `consistency` field to `tablet_replication_strategy`.
In upcoming commits we'll use this field to determine if a
`raft_group_id` should be generated for a new table.
2026-01-21 14:56:00 +01:00
Petr Gusev
53f93eb830 tablets: add raft_group_id
Add a `raft_group_id` column to `system.tablets` and to the `tablet_map`
class. The column is populated only when the
`strongly_consistent_tables` feature is enabled.

This feature is currently disabled by default and is enabled only when
the user sets the `STRONGLY_CONSISTENT_TABLES` experimental flag.

The `raft_group_id` column is added to `system.tablets` only when this
flag is set. This allows the schema to evolve freely while the feature
is experimental, without requiring complex migrations.
2026-01-21 14:56:00 +01:00
Petr Gusev
cab3e1eea5 modification_statement: remove virtual where it's not needed
This is a refactoring/simplification commit.
2026-01-21 14:56:00 +01:00
Petr Gusev
9015bed794 modification_statement: inline prepare_statement()
This is a refactoring/simplification commit.

There are many 'prepare' functions in this class that don't
meaningfully differ from each other. The prepare_statement() adds
accidental complexity by adding a level of indirection -- the reader
has to jump between the call site and the function body to reconstruct
the full picture.
2026-01-21 14:56:00 +01:00
Petr Gusev
998ee5b7fb system_keyspace: disable tablet_balancing for strongly_consistent_tables
We don't yet support tablet balancing for strongly consistent tables.
2026-01-21 14:56:00 +01:00
Petr Gusev
6b0d757f28 cql: rename strongly_consistent statements to broadcast statements
In preparation for upcoming work on strongly consistent queries in
Scylla, this commit renames the existing `strongly_consistent`
statements to `broadcast_statements` to avoid confusion.

The old code paths are kept temporarily, as they may be useful for
reference or for copying parts during the implementation of the new
strongly consistent statements.
2026-01-21 14:56:00 +01:00
Pavel Emelyanov
18b5a49b0c Populate all sl:* groups into dedicated top-level supergroup
This patch changes the layout of user-facing scheduling groups from

/
`- statement
`- sl:default
`- sl:*
`- other groups (compaction, streaming, etc.)

into

/
`- user (supergroup)
   `- statement
   `- sl:default
   `- sl:*
`- other groups (compaction, streaming, etc.)

The new supergroup has 1000 static shares and is name-less, in a sense
that it only have a variable in the code to refer to and is not exported
via metrics (should be fixed in seastar if we want to).

The moved groups don't change their names or shares, only move inside
the scheduling hierarchy.

The goal of the change is to improve resource consumption of sl:*
groups. Right now activities in low-shares service levels are scheduled
on-par with e.g. streaming activity, which is considered to be low-prio
one. By moving all sl:* groups into their own supergroup with 1000
shares changes the meaning of sl:* shares. From now on these shares
values describe preirities of service level between each-other, and the
user activities compete with the rest of the system with 1000 shares,
regardless of how many service levels are there.

Unit tests keep their user groups under root supergroup (for simplicity)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28235
2026-01-21 14:14:48 +02:00
Botond Dénes
cf4133c62d Merge 'service: pass topology guard to RBNO' from Aleksandra Martyniuk
Currently, raft-based node operations with streaming use topology guards, but repair-based don't.

Topology guards ensure that if a respective session is closed (the operation has finished), each leftover operation being a part of this session fails. Thanks to that we won't incorrectly assume that e.g. the old rpc received late belongs to the newly started operation. This is especially important if the operation involves writes.

Pass a topology_guard down from raft_topology_cmd_handler to repair tasks. Repair tasks already support topology guards.

Fixes: https://github.com/scylladb/scylladb/issues/27759

No topology_guard in any version; needs backport to all versions

Closes scylladb/scylladb#27839

* github.com:scylladb/scylladb:
  service: use session variable for streaming
  service: pass topology guard to RBNO
2026-01-21 12:47:28 +02:00
Robert Bindar
ea8a661119 reduce test_backup.py and test_refresh.py datasets
backup and restore tests. This made the testing times explode
with both cluster/object_store/test_backup.py and
cluster/test_refresh.py taking more than an hour each to complete
under test.py and around 14min under pytest directly.
This was painful especially in CI because it runs tests under test.py which
suffers from the issue of not being able to run test cases from within
the same file in parallel (a fix is attempted in 27618).

This patch reduces the dataset of these tests to the minimum and
gets rid of one of the tested topology as it was redundant.
The test times are reduced to 2min under pytest and 14 mins under
test.py.

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>

Closes scylladb/scylladb#28280
2026-01-21 10:47:36 +02:00
Nadav Har'El
8962093d90 Merge 'vector_index: Introduce rescoring index option' from Szymon Malewski
This series introduces `rescoring` index option.
There is no rescoring algorithm implementation yet.
This series prepares it by:
- adding new index option
- adding documentation
- adding tests for option handling
- adding tests for rescoring implementation - at this point they report errors and are marked that this is expected, because rescoring is not implemented.

Follow-up https://github.com/scylladb/scylladb/pull/27677
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-293
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-294

No backporting - it is a new feature.

Closes scylladb/scylladb#28165

* github.com:scylladb/scylladb:
  vector_search: Add more rescoring validation tests
  vector_search: Add rescoring validation test
  vector_search: doc: Document new index option
  vector_search: test: Add `rescoring` index option test
  vector_index: introduce rescoring option
  vector_index: improve options validation
2026-01-21 10:46:22 +02:00
Avi Kivity
ecb6fb00f0 streamed_mutation_freezer: use chunked_vector instead of std::deque for clustering rows
The streamed_mutation_freezer class uses a deque to avoid large
allocations, but fails as seen in the referenced issue when the
vector backing the deque grows too large. This may be a problem
in itself, but the issue doesn't provide enough information to tell.

Fix the immediate problem by switching to chunked_vector, which
is better in avoiding large allocations. We do lose some early-free
in serialize_mutation_fragments(), but since most of the memory should
be in the clustering row itself, not in the deque/chunked_vector holding
it, it should not be a problem.

Fixes #28275

Closes scylladb/scylladb#28281
2026-01-21 10:13:44 +02:00
Botond Dénes
09d3b6c98b Merge 'auth: use paged internal queries during migration' from Piotr Smaron
Auth v2 migration uses non-paged queries via `execute_internal` API.
This commit changes it to use `query_internal` instead, which uses
paging under the hood.

Fixes: https://github.com/scylladb/scylladb/issues/27577

A minor enhancement, no need to backport.

Closes scylladb/scylladb#25395

* github.com:scylladb/scylladb:
  auth: use paged internal queries during migration
  auth: move some code in migrate_to_auth_v2 up
  auth: re-align pieces of migrate_to_auth_v2
  cql: extend `query_internal` with `query_state` param
2026-01-21 09:58:06 +02:00
Raphael S. Carvalho
d16f9c821d Revert "api: storage_service/tablets/repair: disable incremental repair by default"
This reverts commit c8cff94a5a.

Re-enabling incremental repair on master with "Aborting on shard 0 during
scaleout + repair #26041" and "Failure to attach sstables in streaming consumer
leaves sealed sstables on disk #27414" fixed.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#28120
2026-01-21 08:50:13 +02:00
Dawid Pawlik
4a7e20953a test/cqlpy: remove the xfail mark from already passing tests
Since #28109 was merged, those tests started to pass as we allow
the filtering on primary key columns within ANN vector queries.

Closes scylladb/scylladb#28231
2026-01-21 08:47:20 +02:00
Botond Dénes
7a3f51b304 Update seastar submodule
* seastar e00f1513...f55dc7eb (8):
  > build: detect io_uring correctly
  > ioinfo: report actual I/O latency goal instead of configured value
  > core/loop: repeater,repeat_until_value: don't rethrow exceptions
  > seastar::test: Add exception interception to extract nested info
  > bool_class: add operator<=>
  > Merge 'Perftune.py: generic virtual device with lower-level slave network devices support' from Vladislav Zolotarov
    scripts/perftune.py: NetPerfTuner: rename "hardware interface" methods to "tunable interface" methods
    scripts/perftune.py: NetPerfTuner: refactor NIC handling to support any virtual interface with slaves
    scripts/perftune.py: NetPerfTuner: rename methods to private for better encapsulation
  > scripts/perftune.py: add fast path queues IRQ filtering and indexing for Microsoft Azure Network Adapter (MANA) NICs
  > file: Finally override dma_read_bulk() in posix_file_impl

Closes scylladb/scylladb#28262
2026-01-21 08:44:20 +02:00
Szymon Malewski
d6226500f6 vector_search: Add more rescoring validation tests
Adding tests for specific cases of rescoring processing:
- wildcard selection - "SELECT * ..." is a case with slightly different path of rescoring processing. We want to confirm that it is handled correctly.
- calculating similarity with other vectors in SELECT clause should not influence ANN ordering.
- NULL handling - results that for any reason have NULL in a score should be filtered out.

As rescoring is not implemented yet, the tests use boost::unit_test::expected_failures
to indicate that the test reports errors.
2026-01-20 21:01:45 +01:00
Karol Nowacki
376c70be75 vector_search: Add rescoring validation test
Verify that vector store results will be correctly rescored and reordered
according to the rescoring algorithm.
As rescoring is not implemented yet, the tests use `boost::unit_test::expected_failures`
to indicate that they report errors.

First test checks rescoring with a simple selection list.
Second makes sure that rescoring is not triggered for quantization=f32 - full representation of vectors.
Third repeats the first one, but adds to it returning of similarity score value.
2026-01-20 21:01:45 +01:00
Karol Nowacki
41dcf12463 vector_search: doc: Document new index option
Adds documentation for the rescoring option for vector search indexes.
2026-01-20 21:01:45 +01:00
Karol Nowacki
b268eda67e vector_search: test: Add rescoring index option test
Add tests to validate `rescoring` index options.
It also improves tests for related `oversampling` option validation.
2026-01-20 21:01:45 +01:00
Szymon Malewski
c5945b1ef4 vector_index: introduce rescoring option
This patch adds vector index option allowing to enable rescoring - recalculation of similarity metric and re-ranking of quantized VS candidates.
Quantization is a necessary condition to run rescoring - checked in convenience function `is_rescoring_enabled`.
Rescoring itself is not implemented - it will come in following patches.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-294
2026-01-20 21:01:45 +01:00
Szymon Malewski
262a8cef0b vector_index: improve options validation
In this patch we enhance validation of option by:
- giving context (option name) in error messages
- listing supported values in error messages of enumerated options
- avoiding using templates

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-293
Follow-up: https://github.com/scylladb/scylladb/pull/27677
2026-01-20 21:01:41 +01:00
Dawid Pawlik
f27ef79d0d test/vector_search: add filter tests with bind variables
Add tests that check if preparation of the filter  does work
and not produce cache when the restrictions consist of bind variables.
2026-01-20 18:17:46 +01:00
Dawid Pawlik
e62cb29b7d vector_search: cache restrictions JSON at prepare time
Add `prepared_filter` class which handles the preparation, construction
and caching of Vector Search filtering compatible JSON object.

If no bind markers found in SELECT statement, the JSON object will be built
once at prepare time and cached for use during execution calls.

Adjust tests accordingly to use prepared filters.

Follow-up: #28109
Fixes: SCYLLADB-299
2026-01-20 17:15:52 +01:00
Andrei Chekun
91e2b027ce test.py: do not crash when there is no boost log
Small fix to not crash the whole process when boost tests failed to
start and do not produced the log file that can be parsed.
2026-01-20 15:52:40 +01:00
Andrei Chekun
58d3052ad4 test.py: pass correctly extra cmd line arguments
During rewrite --extra-scylla-cmdline-options was missed and it was not
passed to the tests that are using pytest. The issue that there were no
possibility to pass these parameters via cmd to the Scylla, while tests
were not affected because they were using the parameters from the yaml
file. This PR fixes this issue so it will be easier to modify the Scylla
start parameters without modifying code.
2026-01-20 15:52:40 +01:00
Anna Stuchlik
84e9b94503 doc: fix the default compaction strategy for Materialized Views
Fixes https://github.com/scylladb/scylladb/issues/24483

Closes scylladb/scylladb#27725
2026-01-20 15:41:03 +01:00
Dawid Pawlik
f54a4010c0 refactor: vector_search: move filter logic to vector_search namespace
Move Vector Search filter functions from `cql3::restrictions` to
`vector_search` namespace as it's a better place according to
it's purpose.
The effective name has now changed from `cql3::restrictions::to_json`
to `vector_search::to_json` which clearly mentions that the JSON
object will be used for Vector Search.

Rename the auxilary functions to use `to_json` suffix instead of
variety of verbs as those functions logic focuses on building JSON
object from different structures. The late naming emphasized too
much distinction between those functions, while they do pretty much
the same thing.

Follow-up: #28109
2026-01-20 13:13:43 +01:00
Botond Dénes
a53f989d2f db/row_cache: make_nonpopulating_reader(): pass cache tracker to snapshot
The API contract in partition_version.hh states that when dealing with
evictable entries, a real cache tracker pointer has to be passed to all
methods that ask for it. The nonpopulating reader violates this, passing
a nullptr to the snapshot. This was observed to cause a crash when a
concurrent cache read accessed the snapshot with the null tracker.

A reproducer is included which fails before and passes after the fix.

Fixes: #26847

Closes scylladb/scylladb#28163
2026-01-20 12:34:37 +01:00
Avi Kivity
c7dda5500c database: simplify apply_counter_update exception handling
Use coroutine::try_future to exit the coroutine immediately on
error instead of explict checks.

Closes scylladb/scylladb#28257
2026-01-20 11:13:49 +02:00
Wojciech Mitros
fc2aecea69 idl: don't redefine bound_weight and partition_region in paging_state.idl.hh
Bound_weight and partition_region are defined in both paging_state.idl.hh and
position_in_partition.idl.hh. This isn't currently causing any issues, but if
a future RPC uses both the paging_state and position_in_partition, after
including both files we'll get a duplicate error.
In this patch we prevent this by removing the definitions from paging_state.idl.hh
and including position_in_partition.idl.hh in their place.

Closes scylladb/scylladb#28228
2026-01-20 11:12:47 +02:00
Aleksandra Martyniuk
2be5ee9f9d service: use session variable for streaming
Use session that was retrieved at the beginning of the handler for
node operations with streaming to ensure that the session id won't
change in between.
2026-01-20 10:06:34 +01:00
Aleksandra Martyniuk
3fe596d556 service: pass topology guard to RBNO
Currently, raft-based node operations with streaming use topology
guards, but repair-based don't.

Topology guards ensure that if a respective session is closed
(the operation has finished), each leftover operation being a part
of this session fails. Thanks to that we won't incorrectly assume
that e.g. the old rpc received late belongs to the newly started
operation. This is especially important if the operation involves
writes.

Pass a topology_guard down from raft_topology_cmd_handler to repair
tasks. Repair tasks already support topology guards.

Fixes: https://github.com/scylladb/scylladb/issues/27759
2026-01-20 10:06:34 +01:00
Aleksandra Martyniuk
f0dbf6135d test: add test for enforce_rack_list option 2026-01-20 10:01:15 +01:00
Aleksandra Martyniuk
7dc371f312 db: mark rf_rack_valid_keyspaces as deprecated
Mark rf_rack_valid_keyspaces option as deprecated. User should
use enforce_rack_list option instead.

The option can still be used and it does not change it's behavior.

Docs is updated accordingly.
2026-01-20 09:58:57 +01:00
Aleksandra Martyniuk
761ace4f05 config: add enforce_rack_list option
Add enforce_rack_list option. When the option is set to true,
all tablet keyspaces have rack list replication factor.

When the option is on:
- CREATE STATEMENT always auto-extends rf to rack lists;
- ALTER STATEMENT fails when there is numeric rf in any DC.

The flag is set to false by default and a node needs to be restarted
in order to change its value. Starting a node with enforce_rack_list
option will fail, if there are any tablet keyspaces with numeric rf
in any DC.

enforce_rack_list is a per-node option and a user needs to ensure
that no tablet keyspace is altered or created while nodes in
the cluster don't have the consistent value.
2026-01-20 09:58:51 +01:00
Michael Litvak
e7ec87382e Revert "alternator: require rf_rack_valid_keyspaces when creating index"
This reverts commit 4b26a86cb0.

The rf_rack_valid_keyspaces option is now not required for creating MVs.
2026-01-20 09:56:48 +01:00
Pavel Emelyanov
8ecd4d73ac test: Update cluster/object_store/ tests to use new S3 config format
Currently the suite generates config in old format, and only a single
test validates that using new format "works".

This change updates the suite (mainly the MinioServer::create_conf()
method) to generate endpoint confit in new format.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28113
2026-01-20 10:53:34 +02:00
Pavel Emelyanov
5e369c0439 audit: Stop using deprecated seastar UDP sending API
The datagram_channel::send() method that sends net::packet-s is
deprecated in favor of using span<temporary_buffer> one. Auditing code
still uses the former one -- it constructs a packet by using formatted
string by copying the string into the packet's fragment, then sends it.
This patch releases string into temporary_buffer and then passes
one-element span to send().

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28198
2026-01-20 10:51:23 +02:00
Avi Kivity
874322f95e multishard_query: simplify do_query() coroutine/continuation complexity
do_query() is a coroutine but uses some continuations to take
advantage of exceptions being propagated via future::then() without
being thrown. We can accomplish the same thing with a nested coroutine
and coroutine::try_future(), simplifying the code.

While this area isn't performance intensive, we're not adding allocations.
The coroutine frame may add an allocation, but since read_page()
certainly does not return immediately, the following then() will allocate
as well. Since we eliminated that then(), the change is at least neutral
allocation-wise.

Closes scylladb/scylladb#28258
2026-01-20 10:45:10 +02:00
Łukasz Paszkowski
e07fe2536e test/pylib/util.py: Add retries and additional logging to start_writes()
Consider the following scenario:
1. Let nodes A,B,C form a cluster with RF=3
2. Write query with CL=QUORUM is submitted and is acknowledged by
   nodes B,C
3. Follow-up read query with CL=QUORUM is sent to verify the write
   from the previous step
4. Coordinator sends data/digest requests to the nodes A,B. Since the
   node A is missing data, digest mismatches and data reconciliation
   is triggered
5. The node A or B fails, becomes unavailable, etc
6. During reconciliation, data requests are sent to node A,B and fail
   failing the entire read query

When the above scenario happens, the tests using `start_writes()` fail
with the following stacktrace:
```
...

>           await finish_writes()

test/cluster/test_tablets_migration.py:259:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
test/pylib/util.py:241: in finish
    await asyncio.gather(*tasks)
test/pylib/util.py:227: in do_writes
    raise e
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

worker_id = 1

...

>                   rows = await cql.run_async(rd_stmt, [pk])
E                   cassandra.ReadFailure: Error from server: code=1300 [Replica(s) failed to execute read] message="Operation failed for test_1767777001181_bmsvk.test - received 1 responses and 1 failures from 2 CL=QUORUM." info={'consistency': 'QUORUM', 'required_responses': 2, 'received_responses': 1, 'failures': 1}
```

Note that when a node failure happens before/during a read query,
there is no test failure as the speculative retries are enabled
by default. Hence an additional data/digest read is sent to the third
remaining node.

However, the same speculative read is cancelled the moment, the read
query reaches CL which may trigger a read-repair.

This change:
- Retries the verification read in start_writes() on failure to mitigate
  races between reads and node failures
- Adds additional logging to correlate Python exceptions with Scylla logs

Fixes https://github.com/scylladb/scylladb/issues/27478
Fixes https://github.com/scylladb/scylladb/issues/27974
Fixes https://github.com/scylladb/scylladb/issues/27494
Fixes https://github.com/scylladb/scylladb/issues/23529

Note that this change test flakiness observed during tablet transitions.
However, it serves as a workaround for a higher-level issue
https://github.com/scylladb/scylladb/issues/28125

Closes scylladb/scylladb#28140
2026-01-20 10:38:20 +02:00
Piotr Smaron
6084f250ae auth: use paged internal queries during migration
Auth v2 migration uses non-paged queries via `execute_internal` API.
This commit changes it to use `query_internal` instead, which uses
paging under the hood.

Fixes: scylladb/scylladb#27577
2026-01-20 09:32:21 +01:00
Piotr Smaron
345458c2d8 auth: move some code in migrate_to_auth_v2 up
Just move the touched code above so the next commit is more readable.
But this has a drawback: previously, if the returned rows were empty,
this code was not executed, but now is independently of the query
results. This shouldn't be a big deal, though, as auth shouldn't be
empty.
2026-01-20 09:15:53 +01:00
Piotr Smaron
97fe3f2a2c auth: re-align pieces of migrate_to_auth_v2
This is needed to make next commits more readable.
Just moved the touched code 4 spaces to the right.
2026-01-20 09:15:53 +01:00
Piotr Smaron
3ca6b59f80 cql: extend query_internal with query_state param
This later is going to be used to pass a query timeout via `qs` to
`query_internal`.
2026-01-20 09:15:48 +01:00
Nadav Har'El
70b3cd0540 Merge 'vector_index: introduce quantization and oversampling options' from Szymon Malewski
This patch adds vector index options allowing to enable quantization and oversampling.
Specific quantization value will be used internally by vector store.

In the current implementation, get_oversampling allows us to decide how many times more candidates
to retrieve from vector store - final response is still trimmed to the given limit.
It is a first step to allow rescoring - recalculation of similarity metric and re-ranking.
Without rescoring oversampling will be also further optimized to happen internally in vector store.

`test/vector_search/rescoring_test.cc` implements basic tests of added functionality.
New options are documented in `docs/cql/secondary-indexes.rst`.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-82
Ref https://scylladb.atlassian.net/browse/SCYLLADB-83

New feature - no backporting

Closes scylladb/scylladb#27677

* github.com:scylladb/scylladb:
  vector_search: doc: Document new index options
  vector_search: test: Test oversampling
  vector_search: test: Add rescoring index options test
  vector_search: test: Extract Configure utility to shared header
  vector_index: introduce `quantization` and `oversampling` options
2026-01-20 08:50:46 +02:00
Avi Kivity
36347c3ce9 Merge 'db/system_keyspace: remove namespace v3' from Botond Dénes
Cassandra changed their system tables in 3.0. We migrated to the new system table layout in 2017, in ScyllaDB 2.0.
System tables introduced in Cassandra 3.0, as well as the 3.0 variant of pre-existing system tables were added to the db::system_table::v3 namespace.
We ended up adding some new ScyllaDB-only system tables to this namespace as well.

As the dust settled, most of the v3 system tables ended up being either simple aliases to non-v3 tables, or new tables.
Either way, the codebase uses just one variant of each table for a long time now the v3:: distinction is pointless.

Remove the v3 namespace and unify the table listing under the top-level db::system_keyspace scope.

Code cleanup, no backport

Closes scylladb/scylladb#28146

* github.com:scylladb/scylladb:
  db/system_keyspace: move remining tables out of v3 keyspace
  db/system_keyspace: relocate truncated() and commitlog_cleanups()
  db/system_keyspace: drop v3::local()
  db/system_keyspace: remove duplicate table names from v3
2026-01-19 20:54:38 +02:00
Nikos Dragazis
4cde34f6f2 storage_service: Remove redundant yields
The loops in `ongoing_rf_change()` perform explicit yields, but they
also perform coroutine operations which can yield implicitly. The
explicit yields are redundant.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-01-19 16:18:49 +02:00
Tomasz Grabiec
7977c97694 Merge 'db: add effective_capacity to load_per_node virtual table' from Ferenc Szili
`effective_capacity` is a value used in size based load balancing. It contains the sum of available disk space of a node and all the tablet sizes.
This change adds this value to the virtual table `system.load_per_node`. This can be useful for debugging size based load balancing.

Size based load balancing is currently only on master, so no backport is needed.

Closes scylladb/scylladb#28220

* github.com:scylladb/scylladb:
  docs: add effective_capacity to system keyspace docs
  virtual_table: add effective_capacity to load_per_node
2026-01-19 13:17:28 +01:00
Marcin Maliszkiewicz
be8a30230b Merge 'test/cluster/dtest: import scrub_test.py' from Botond Dénes
This test has to be adjusted in lock-step with scylladb.git, due to changes in https://github.com/scylladb/scylladb/pull/27836. It is simpler to just take the time and import it, so https://github.com/scylladb/scylladb/pull/27836 can patch all the affected tests, including this one.
All code is imported verbatim, then patched later, such that the series remains bisectable.

dtest import, no backport needed

Closes scylladb/scylladb#28085

* github.com:scylladb/scylladb:
  test/cluster/dtest: remove is_win() and users
  test/cluster/dtest/scrub_test.py: add license blurb
  test/cluster/dtest: import scrub_test.py
  test/cluster/dtest/ccmlib: scylla_node.py: adapt run_scylla_sstable() at al
  test/cluster/dtest/ccmlib: scylla_node.py: import run_scylla_sstable()
2026-01-19 12:14:08 +01:00
Botond Dénes
2e4d0e42f0 test/cluster/dtest: remove is_win() and users
ScyllaDB and its tests never run on windows, this function is not
needed, patch it out.
2026-01-19 12:56:57 +02:00
Botond Dénes
8953a143e5 test/cluster/dtest/scrub_test.py: add license blurb
The original scrub test was done by the Cassandra project, hence there
is two Licenses notices: one for the original work by Cassandra
(2015) and one for our modifications on top (2021).
2026-01-19 12:55:59 +02:00
Botond Dénes
d2c266eb47 test/cluster/dtest: import scrub_test.py
Import the test verbatim. Requires adding is_win() to ccmlib/common.py,
with a dummy implementation.
2026-01-19 12:52:44 +02:00
Botond Dénes
99e8a92aef test/cluster/dtest/ccmlib: scylla_node.py: adapt run_scylla_sstable() at al
To work in the local test.py context.
2026-01-19 12:52:44 +02:00
Botond Dénes
807da53583 test/cluster/dtest/ccmlib: scylla_node.py: import run_scylla_sstable()
And dependencies: get_sstables() and __gather_sstables().
Code is importend verbatim, but doesn't work yet (no users yet either).
Will be patched to work in the next commit.
2026-01-19 12:52:44 +02:00
Botond Dénes
e01041d3ee db/system_keyspace: move remining tables out of v3 keyspace
The last remining tables in the v3 keyspace are those that are genuinely
distinct -- added by Cassandra 3.0 or >= ScyllaDB 2.0.
Move these out of the v3 keyspace too, with this the v3 keyspace is
defunct and removed.
2026-01-19 12:32:21 +02:00
Botond Dénes
ce57ef94bd db/system_keyspace: relocate truncated() and commitlog_cleanups()
The name variables of these tables is outside the v3 namespace but the
method defining their schema is in the v3 namespace. Relocate the
methods out from the v3 namespace, to the scope where the name variables
live.
The methods are moved to the private: part of system_keyspace, as they
don't have external users currently.
2026-01-19 12:32:21 +02:00
Botond Dénes
2ccb8ff666 db/system_keyspace: drop v3::local()
It is unused, the non-v3 variant is used instead.
2026-01-19 12:32:21 +02:00
Botond Dénes
b52a3f3a43 db/system_keyspace: remove duplicate table names from v3
Those table names that are effectively just an alias of the their
counterpart outside of the v3 namespace (struct).

scylla_local() is made public. Currently it is private, but it has
external users, working around the private designation by using the
public v3::scylla_local() alias. This change just makes the existing
status clear.
2026-01-19 12:32:21 +02:00
Karol Nowacki
324b829263 vector_search: doc: Document new index options
Adds documentation for the `quantization` and `oversampling`
options for vector search indexes.
2026-01-19 10:28:46 +01:00
Karol Nowacki
bca17290f4 vector_search: test: Test oversampling
Add test to verify that Scylla correctly oversamples the limit
according to the oversampling option.
2026-01-19 10:28:46 +01:00
Karol Nowacki
e347f6d0d4 vector_search: test: Add rescoring index options test
Add tests to validate quantization and oversampling index options.
2026-01-19 10:28:44 +01:00
Karol Nowacki
24b037e8e3 vector_search: test: Extract Configure utility to shared header
Move Configure test utility to dedicated file for reuse across test suites.
2026-01-19 10:21:44 +01:00
Szymon Malewski
b8e91ee6ae vector_index: introduce quantization and oversampling options
This patch adds vector index options allowing to enable quantization and oversampling.
Specific quantization value will be used internally by vector store.

In the current implementation, `get_oversampling` allows us to decide how many times more candidates
to retrieve from vector store - final response is still trimmed to the given limit.
It is a first step to allow rescoring - recalculation of similarity metric and re-ranking.
Without rescoring oversampling will be also further optimized to happen internally in vector store.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-82
Ref https://scylladb.atlassian.net/browse/SCYLLADB-83
2026-01-19 10:21:43 +01:00
Tomasz Grabiec
dd0fc35c63 lsa: Export metrics for reclaim/evict/compact time
Currently, we only know about long reclaims from lsa-timing stall
reports. Shorter reclaims can go under the radar.

Those metrics will help to asses increase in LSA activity, which
translates to higher CPU cost of a workload.

reclaim tracks memory which goes to the standard allocator, e.g. when
entering and allocating_section or in the background reclaimer.

evict/compact count activity towrads building LSA reserve, in
allocating_section entry, or naked LSA allocation.

Closes scylladb/scylladb#27774
2026-01-19 12:08:16 +03:00
Nadav Har'El
3e270a49f7 test/cqlpy: remove test_describe.py from cluster reuse blacklist
The way that test.py runs test/cqlpy tests requires that tests end their
session with all keyspaces deleted. If we forget to delete a keyspace,
test.py suspects some test fails and reports a failure. As reported in
issue #26291, the test file test/cqlpy/test_describe.py caused this check
to trigger, so this file was added to the blacklist "dirties_cluster"
in suite.yaml to force test.py to ignore this problem.

I believe the cause of the problem was as follows: test_describe.py
didn't really leave any undeleted keyspace. Rather, test_describe.py had
one test which used "USE" and this broke DESC KEYSPACES (Refs #26334) -
which test.py used to see which keyspaces remained.

We solved this problem not just once, but twice:
1. In pull request #26345, I fixed the test not to use "USE" on the main
   CQL session.
2. In pull request #27971, I fixed DESC KEYSPACES implementation so even
   if "USE" was in effect, it will return the correct results.

I checked manually, and after removing test_describe.py from the
dirties_cluster blacklist, all cqlpy tests now pass, without
spurious failures in the test following test_describe.py. So it's time
to remove it from the blacklist.

Fixes #26291

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#27973
2026-01-19 12:02:00 +03:00
Dario Mirovic
823d1b9c03 audit: fix start_audit init sequence placement
Commit d54d409 (audit: write out to both table and syslog) unified
create_audit and start_audit, which moved the audit service creation later
in the startup sequence. This broke startup when audit is enabled because
view_builder prepares CQL queries before start_audit runs, and
query preparation calls audit_instance().local_is_initialized()
which crashes on the non-existent sharded service.

Move start_audit to run before view_builder::start() and other components
that may prepare CQL queries during their initialization.

Fixes SCYLLADB-252

Closes scylladb/scylladb#28139
2026-01-19 11:57:39 +03:00
Botond Dénes
6f5f42305a docs: make the glossary more tablet inclusive
Our glossary is stuck in the past, still discussing token ownership in
terms of vnodes and cluster synchronization in terms of gossip.
This patch tries to improve this a bit, although much more work needs to
be done.
The term `Tablet` is added and the definition of `Token` and `Token
Range` is rephrased to be tablet inclusive.
The term `Cluster` is changed to mention raft as the synchronization
mechanism instead of gossip.

One oustanding problem is that our general architecture page describing
the ring acrhitecture is still Vnode only. We have a seprate Tablets
page, but the two don't link to each other and most documentation refer
only to the former. A casual reader might be able to spend a a lot of
time on our documentation page, without even seeing the word: tablets.

Closes scylladb/scylladb#28170
2026-01-19 11:50:13 +03:00
Ernest Zaslavsky
829bd9b598 aws_error: fix nested exception handling
The loop that unwraps nested exception, rethrows nested exception and saves pointer to the temporary std::exception& inner on stack, then continues. This pointer is, thus, pointing to a released temporary

Closes scylladb/scylladb#28143
2026-01-19 11:41:47 +03:00
Botond Dénes
b7bc48e7b7 reader_concurrency_semaphore: improve handling of base resources
reader_permit::release_base_resources() is a soft evict for the permit:
it releases the resources aquired during admission. This is used in
cases where a single process owns multiple permits, creating a risk for
deadlock, like it is the case for repair. In this case,
release_base_resources() acts as a manual eviction mechanism to prevent
permits blockings each other from admission.

Recently we found a bad interaction between release_base_resources() and
permit eviction. Repair uses both mechanism: it marks its permits as
inactive and later it also uses release_base_resources(). This partice
might be worth reconsidering, but the fact remains that there is a bug
in the reader permit which causes the base resources to be released
twice when release_base_resources() is called on an already evicted
permit. This is incorrect and is fixed in this patch.

Improve release_base_resources():
* make _base_resources const
* move signal call into the if (_base_resources_consumed()) { }
* use reader_permit::impl::signal() instead of
  reader_concurrency_semaphore::signal()
* all places where base resources are released now call
  release_base_resources()

A reproducer unit test is added, which fails before and passes after the
fix.

Fixes: #28083

Closes scylladb/scylladb#28155
2026-01-19 11:37:51 +03:00
Nadav Har'El
d86d5b33aa test/cqlpy: translate Cassandra's unit tests for LWT
This is a translation of Cassandra's CQL unit test source file
validation/operations/InsertUpdateIfConditionTest.java into our cqlpy
framework.

This test file checks various LWT conditional updates. After that
file became too big, the Cassandra developers split parts from it -
moving tests for LWT with collections, UDTs, and static columns to
separate test files - which I already translated (pull request #13663).
This patch translates the remaining, main, LWT tests.

Strangely, this test file also has, in the middle of the file, several
tests for conditional schema changes, like CREATE KEYSPACE IF NOT EXISTS,
a feature which has *nothing* to do with LWT so really didn't belong in
this file. But I translated those as well.

These new tests all pass on both ScyllaDB and Cassandra, and have not
uncovered any new bug.

However these tests do demonstrate yet again something that users and
developers of ScyllaDB's LWT must be aware of: Whereas usually
ScyllaDB's goal has been compatiblity with Cassandra's CQL, in LWT
this has *not* been the case: ScyllaDB deviated from Cassandra's
behavior in its LWT implementation in several places. These intentional
deviations were documented in docs/kb/lwt-differences.rst.

Accordingly, the tests here include almost a hundred (!) modificatons
(search for "if is_scylla") to allow the same test to pass on both
ScyllaDB and Cassandra, as well as many comments explaining the types
of differences we're seeing.

Although these deviations from Cassandra compatibility are known and
intentional, it's worth listing here the ones re-discovered by these
new tests:

1. On a successful conditional write, Cassandra returns just true, Scylla
   also returns the old contents of the row.

2. Similarly, in an IF EXISTS write that failed (the row did not exist),
   Cassandra returns just false, Scylla also returns extra null values for
   each and every column of the row.

3. Cassandra allows in "IF v IN (?, ?)" to bind individual values to
   UNSET_VALUE and skips them, Scylla treats this as an error. Refs #13659.

4. When there are static columns, Scylla's LWT response returns the static
   column first, Cassandra returns the modified column first. Since both
   also say which columns they return, neither is more correct than the other,
   a normally users will address specific columns by name, not by position.

5. docs/kb/lwt-differences.rst explains that "the returned result set
   contains an old row for every conditional statement in the batch".
   Beyond this different, actually non-conditional updates in the batch will
   also get a row in Scylla's result. Refs #27955.

6. For batch statement, ScyllaDB allows mixing `IF EXISTS`, `IF NOT EXISTS`,
   and other conditions for the same row. Cassandra doesn't, so checks that
   these combinations are not allowed were commented out.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#27961
2026-01-19 09:46:04 +02:00
Botond Dénes
19efd7f6f9 Merge 'The system_replicated_keys should be mark as a system keyspace' from Amnon Heiman
This PR marks system_replicated_keys as a system keyspace.
It was missing when the keyspace was added.

A side effect of that is that metrics that are not supposed to be reported are.
Fixes #27903

Closes scylladb/scylladb#27954

* github.com:scylladb/scylladb:
  distributed_loader: system_replicated_keys as system keyspace
  replicated_key_provider: make KSNAME public
2026-01-19 09:37:41 +02:00
Aleksandra Martyniuk
65cba0c3e7 service: node_ops: remove coroutine::lambda wrappers
In storage_service::raft_topology_cmd_handler we pass a lambda
wrapped in coroutine::lambda to a function that creates streaming_task_impl.
The lambda is kept in streaming_task_impl that invokes it in its run
method.

The lambda captures may be destroyed before the lambda is called, leading
to use after free.

Do not wrap a lambda passed to streaming_task_impl into coroutine::lambda.
Use this auto dissociate the lambda lifetime from the calling statement.

Fixes: https://github.com/scylladb/scylladb/issues/28200.

Closes scylladb/scylladb#28201
2026-01-19 09:19:53 +02:00
Botond Dénes
c8811387e1 Merge 'service: do not change the schema while pausing the rf change ' from Aleksandra Martyniuk
Currently, if a rf change request is paused, it immediately changes
the system_schema.keyspaces to use rack list for this keyspace.
If the request is aborted, the co-location might not be finished.
Hence, we can end up with inconsistent schema and tablet replica state.

Update the system_schema.keyspaces only after the co-location is done (and
not when it's started).

Fixes: https://github.com/scylladb/scylladb/issues/28167

No backport needed; changes that introduced a bug are only on master

Closes scylladb/scylladb#28168

* github.com:scylladb/scylladb:
  service: fin indentation
  test: add test_numeric_rf_to_rack_list_conversion_abort
  service: tasks: fix type of global_topology_request_virtual_task
  service: do not change the schema while pausing the rf change
2026-01-19 09:15:20 +02:00
Botond Dénes
7d637b14e8 erge 'test/cluster/test_internode_compression: Transpose test from dtest' from Calle Wilund
Refs #27429

Re-implement the dtest with same name as a scylla pytest, using a python level network proxy instead of tcpdump etc.
Both to avoid sudo and also to ensure we don't race.

Juggles different listen_address and broadcast_address values to insert a proxy measuring RPC traffic.

Note: the measuring relies on python network IO not splitting data chunks, since we don't really have packet-level view of the connections.

Note that a scylla change is required to make the ip address magic work, otherwise topology mechanism gets
confused. This should maybe at some point be looked into more, since we should be more resilient against various services in scylla binding to different addresses.

When this test is merged, we can drop the flaky test from dtest. And hope no new flakiness comes from this one...

Closes scylladb/scylladb#28133

* github.com:scylladb/scylladb:
  test/cluster/test_internode_compression: Transpose test from dtest
  gossiper/main: Extend special treatment of node ID resolve for rpc_address
2026-01-19 08:34:31 +02:00
Ferenc Szili
1136a3f398 docs: add effective_capacity to system keyspace docs
This adds the description of effective_capacity to the documentation
of the system keyspace.
2026-01-18 16:57:08 +01:00
Ferenc Szili
3e0362ec67 virtual_table: add effective_capacity to load_per_node
This change adds effective_capacity to the virtual table load_per_node.
This value can be useful for debugging size based load balancing.
2026-01-18 16:52:13 +01:00
Nadav Har'El
3e138a2685 test/cqlpy: Add our copyright/license to translated Cassandra tests
All the tests under test/cqlpy/cassandra_tests/ were translated from
Cassandra's unit tests originally written in Java into our own test
framework, and accordinly carry a clear mention of their origin and
original license.

However, we did modify these original tests - even if the modification
was slight and mostly straightforward. Therefore I was asked to also
mention our own copyright (and license) for these modifications.

So this patch adds to every file in test/cqlpy/cassandra_tests/ text like:

   # Modifications: Copyright 2026-present ScyllaDB
   # SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0

with the appropriate year instead of 2026.

Fixes #28215

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#28216
2026-01-18 16:25:28 +01:00
Tomasz Grabiec
3b0df29ceb docs: Document parallel decommission and removenode and relevant task API 2026-01-18 15:36:08 +01:00
Tomasz Grabiec
85140cdf7e test: Add tests for parallel decommission/removenode 2026-01-18 15:36:08 +01:00
Tomasz Grabiec
5c93e12373 test: util: Introduce ensure_group0_leader_on()
Many tests want to assume that group0 leader runs on a particualr
server, typically the first server in the list.

And they cannot be easily made to work with arbitrary leader, becuase
they setup a particular topology and then stop particular nodes, and
want to assume the leader is stable. They open leader's log and
expect things to appear in that log.

It's much easier to ensure the leader, than to prepare tests to
handle failovers.
2026-01-18 15:36:07 +01:00
Tomasz Grabiec
478b8f09df test: tablets: Check that there are no migrations scheduled on draining nodes
In case of decommission, it's not desirable because it's less urgent.

In case of removenode, it leads to failure of removenode operation
because scheduled co-locating migration will fail if the destination
is on the excluded node, and this failure will be interpreted as drain
failure and coordinator will cancel the request.

Not a problem before "parallel decommission" because this failure is
only a streaming failure, not a barrier failure, so exception doesn't
escape into the catch clause in transition stage handler, and the
migration is simply rolled back. Once draining happens in the tablet
migration track, streaming failure will be interpreted as drain
failure and cancel the request.
2026-01-18 15:36:07 +01:00
Tomasz Grabiec
e082e32cc7 test: lib: topology_builder: Introduce add_draining_request() 2026-01-18 15:36:07 +01:00
Tomasz Grabiec
baea12c9cb topology_coordinator, tablets: Fail draining operations when tablet migration fails due to critical disk utilization
Reaching critical disk utilization on destination means the draining
either caused it, or at least works against reliveing it. So it's
better to cancel those requests. In case of decommission, if critical
disk utilization was caused by it due to not enough capacity, aborting
decomission will bring capacity back to the system and rebalancing
will relieve critical disk utlization.
2026-01-18 15:36:07 +01:00
Tomasz Grabiec
1b784e98f3 tablets: topology_coordinator: Refactor to propagate reason for migration rollback
Will be easier to implement later change to cancel topology request,
where we need to give a reason for doing so.
2026-01-18 15:36:07 +01:00
Tomasz Grabiec
2d954f4b19 tablet_allocator: Skip co-location on draining nodes
In case of decommission, it's not desirable because it's less
urgent.

In case of removenode, it leads to failure of removenode operation
because scheduled co-locating migration will fail if the destination
is on the excluded node, and this failure will be interpreted as drain
failure and coordinator will cancel the request.

Not a problem before "parallel decommission" because this failure is
only a streaming failure, not a barrier failure, so exception doesn't
escape into the catch clause in transition stage handler, and the
migration is simply rolled back. Once draining happens in the tablet
migration track, streaming failure will be interpreted as drain
failure and cancel the request.
2026-01-18 15:36:06 +01:00
Tomasz Grabiec
d9e1a6006f node_ops: task_manager_module: Populate entity field also for active requests 2026-01-18 15:36:06 +01:00
Tomasz Grabiec
bbd293d440 tasks: node_ops: Put node id in the entity field
If we have many node requests active at a time, it's useful to know
which requets works on which node.

Fixes #27208
2026-01-18 15:36:06 +01:00
Tomasz Grabiec
576ebcdd30 tasks, node_ops: Unify setting of task_stats in get_status() and get_stats()
They should return the same, so extract the common logic.
2026-01-18 15:36:05 +01:00
Tomasz Grabiec
629d6d98fa topology: Protect against empty cancelation reason
Request would be deemed successful, which is counter to the intention
of cancelation and effect on the system.
2026-01-18 15:36:05 +01:00
Tomasz Grabiec
7446eb7e8d tasks, topology: Make pending node operations abortable
We want to be able to cancel decommission when it's still in the
tablet draining phase. Such a request is in a pending and paused
state, and can be safely canceled. We set the node's "draining" flag
back to false.
2026-01-18 15:36:05 +01:00
Tomasz Grabiec
091ed4d54b doc: topology-over-raft.md: Fix diagram for replacing, tablet_draining is not engaged
Since 288e75fe22
2026-01-18 15:36:05 +01:00
Tomasz Grabiec
a009644c7d raft_topology, tablets: Drain tablets in parallel with other topology operations
Allows other topology operations to execute while tablets are being
drained on decommission. In particular, bootstrap on scale-out. This
is important for elasticity.

Allows multiple decommission/removenode to happen in parallel, which
is important for efficiency.

Flow of decommission/removenode request:
  1) pending and paused, has tablet replicas on target node.
     Tablet scheduler will start draining tablets.
  2) No tablets on target node, request is pending but not paused
  3) Request is scheduled, node is in transition
  4) Request is done

Nodes are considered draining as soon as there is a leave or remove
request on them. If there are tablet replicas present on the target
node, the request is in a paused state and will not be picked by
topology coordinator. The paused state is computed from topology state
automatically on reload.

When request is not paused, its execution starts in
write_both_read_old state. The old tablet_draining state is not
entered (it's deprecated now).

Tablet load balancing will yield the state machine as soon as some
request is no longer paused and ready to be scheduled, based on
standard preemption mechanics.

The test case test_explicit_tablet_movement_during_decommission is
removed. It verifies that tablet move API works during tablet draining
transition. After this PR, we no longer enter this transition, so the
test doesn't work. It loses its purpose, because movement during
normal tablet balancing is not special and tested elsewhere.
2026-01-18 15:36:05 +01:00
Tomasz Grabiec
e38ee160fc virtual_tables: Show draining and excluded fields in system.cluster_status and system.load_by_node
It gives a more accurate picture of what happens in the cluster.
2026-01-18 15:36:04 +01:00
Tomasz Grabiec
1c2e47e059 locator: topology: Add "draining" flag to a node
They are being drained of tablet replicas, tablet scheduler works to
move replicas away from such nodes. This state is set at the
beginning of decommission and removenode operations.
2026-01-18 15:36:04 +01:00
Tomasz Grabiec
a37b1ce832 topology_coordinator: Extract generate_cancel_request_update() 2026-01-18 15:36:04 +01:00
Tomasz Grabiec
77bd00bf9f storage_service: Drop dependency in topology_state_machine.hh in the header
To reduce compilation time.
2026-01-18 15:36:04 +01:00
Tomasz Grabiec
a24c3fc229 locator: Extract common code in assert_rf_rack_valid_keyspace() 2026-01-18 15:36:04 +01:00
Tomasz Grabiec
d3ee82ea51 topology_coordinator, storage_service: Validate node removal/decommission at request submission time
After parallel tablet draining, the validation at the time request
starts executing is too late, tablets will be already drained.

This trips tests which expect validation failure, but get tablet
draining failure instead.

Also, in case of decommission, it's a waste to go through draining
only to discover that the operation has to be rolled back due to
validation.

So avoid submitting a request altogether if it's invalid.
The validation at request execution start remains, for extra sefety.

validate_removing_node() was extracted out of topology_coordinator,
so that it can be called by storage_service on non-coordinator.

Some tests need adjusting for the fact that after failed removenode
the node may still not be marked as excluded, so we need to explicitly
exclude it or add to the list of ignored nodes in the next removenode
operation.
2026-01-18 15:36:04 +01:00
Nadav Har'El
34d28475d9 Merge 'Implement Vector Search filtering API' from Dawid Pawlik
Since Vector Store service filtering API has been implemented (scylladb/vector-store#334), there is a need for the implementation of Scylla side part.
This patch should implement a `statement_restrictions` parsing into Vector Store filtering API compatible JSON objects.
Those objects should be added to ANN query vector POST requests as `filter` object.

After this patch, the subset of all operations ([Vector Search Filtering Milestone 1](https://scylladb.atlassian.net/wiki/spaces/RND/pages/156729450/Vector+Search+Filtering+Design+Document#Milestone-1)) happy path should be completed, allowing users to filter on primary key columns with single column `=` and `IN` or multiple column `()=()` and `() IN ()`.
The restrictions for other operations should be implemented in a PR on Vector Store service side.

---

This PR implements parsing the `statement_restrictions` into Vector Store filtering API compatible JSON objects.
The JSON objects are created and used in ANN vector queries with filtering.
It closes the Scylla side implementation of Vector Search filtering milestone 1.

Unit tests for `statement_restrictions` parsing are added. Integration tests will be added on Vector Store service side PR.

---

Fixes: SCYLLADB-249

New feature, should land into 2026.1

Closes scylladb/scylladb#28109

* github.com:scylladb/scylladb:
  docs: update documentation on filtering with vector queries
  test/vector_search: add test for filtered ANN with VS mock
  test/vector_search: add restriction to JSON conversion unit tests
  vector_search: cql: construct and use filter in ANN vector queries
  select_statement: do not require post query ordering for vector queries
  vector_search: add `statement_restrictions` to JSON parsing
2026-01-18 16:11:29 +02:00
Ernest Zaslavsky
eb76858369 Update seastar submodule
seastar dd46b6f..e00f1513
```
e00f1513 Merge 'net: Add DNS TTL to the net::hostent' from Ernest Zaslavsky
8a69e1f4 net: extract common implementation of inet_address::find_all
cb469fd1 net: deprecate the addr_list in hostent
1d59c0ca net: expose DNS TTL via net::hostent
3c6d919f http: add virtual close() to connection_factory
bbd0001a Revert "net: expose DNS TTL via net::hostent"
```

Closes scylladb/scylladb#28147
2026-01-18 15:00:48 +02:00
Andrzej Jackowski
6eca7e4ff6 transport: unify lambda capture lifetime for control connections
Workload prioritization was added in scylladb/scylladb#22031.
The functionality of updating service levels was implemented as
a lambda coroutine, leaving room for the lambda coroutine fiasco.

The problem was noticed and addressed in scylladb/scylladb#26404.
There are currently three functions that call switch_tenant:
 - update_user_scheduling_group_v1 and update_user_scheduling_group_v2
   use the deducing this (this auto self) to ensure the proper
   lifecycle of the lambda capture.
 - update_control_connection_scheduling_group doesn’t use the deducing
   this, but the lambda captures only `this`, which is used before
   the first possible coroutine preemption. Therefore, it doesn’t seem
   that any memory corruption or undefined behavior is possible here.

Nevertheless, it seems better to start using the deducing this in
update_control_connection_scheduling_group as well, to avoid problems
in the future if someone modifies the code and forgets to add it.

Fixes: SCYLLADB-284

Closes scylladb/scylladb#28158
2026-01-17 20:36:31 +02:00
Nikos Dragazis
8aca7b0eb9 test: database_test: Fix serialization of partition key
The `make_key` lambda erroneously allocates a fixed 8-byte buffer
(`sizeof(s.size())`) for variable-length strings, potentially causing
uninitialized bytes to be included. If such bytes exist and they are
not valid UTF-8 characters, deserialization fails:

```
ERROR 2026-01-16 08:18:26,062 [shard 0:main] testlog - snapshot_list_contains_dropped_tables: cql env callback failed, error: exceptions::invalid_request_exception (Exception while binding column p1: marshaling error: Validation failed - non-UTF8 character in a UTF8 string, at byte offset 7)
```

Fixes #28195.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>

Closes scylladb/scylladb#28197
2026-01-17 20:32:06 +02:00
Botond Dénes
1e09a34686 replica: add abort polling to memtable and cache readers
Continuing the read once it is aborted (e.g. due to timeout) is a waste
of resources, as the produced results will be discarded.
Poll the permit's abort exception in the memtable and cache reader's
fill_buffer(). This results in one poll per buffer filled (8KB of data).

We already have similar poll for sstable readers, as disk reads are
usually much heavier and therefore it is more important to stop them
ASAP after abort. Cache and memtable reads are usually quick but not
always, hence it is important to also have polling in the cache and
memtable readers.

Refs: #11469
Fixes: #28148

Closes scylladb/scylladb#28149
2026-01-16 18:03:04 +01:00
Ferenc Szili
0aebc17c4c docs: correct spelling errors in size based balancing docs
0ede8d154b introduced the dev doc for size
based load balancing, but also added spelling errors.
This PR fixed these errors.

Closes scylladb/scylladb#28196
2026-01-16 17:41:57 +02:00
Aleksandra Martyniuk
ad2381923f service: fin indentation 2026-01-16 11:38:10 +01:00
Aleksandra Martyniuk
504290902c test: add test_numeric_rf_to_rack_list_conversion_abort
Add regression test that checks whether aborted rf change leaves
the system_schema.keyspaces unchanged.
2026-01-16 11:36:21 +01:00
Aleksandra Martyniuk
3ed8701301 service: tasks: fix type of global_topology_request_virtual_task
Currently, the type of global_topology_request_virtual_task isn't
taken out of std::variant before printing, which results with
a task of type variant(actual_type).

Retrieve the type from the variant before passing it to task type.
2026-01-16 11:36:21 +01:00
Aleksandra Martyniuk
580dfd63e5 service: do not change the schema while pausing the rf change
Currently, if a rf change request is paused, it immediately changes
the system_schema.keyspaces to use rack list for this keyspace.
If the request is aborted, the co-location might not be finished.
Hence, we can end up with inconsistent schema and tablet replica state.

Update the system_schema.keyspaces only after the co-location is done (and
not when it's started).
2026-01-16 11:36:15 +01:00
Patryk Jędrzejczak
eb7be9010d Merge 'topology_coordinator: Refresh load stats after table is created or altered' from Tomasz Grabiec
We switched to the size-based load balancing, which now has more
strict requirements for load stats. We no longer need only per-node
stats, but also per-tablet stats.

Bootstrapping a node triggers stats refresh, but allocating tablets on
table creation didn't. So after creating a table, load balancer
couldn't make progress for up to 60s (stats refresh period).

This makes tests take longer, and can even cause failures if tests are
using a low-enough timeout.

Fixes https://github.com/scylladb/scylladb/issues/27921

No backport becuse only master is vulnerable (size-based load balancing).

Closes scylladb/scylladb#27926

* https://github.com/scylladb/scylladb:
  test: cluster: Add reproducer for missed notification in topology coordinator
  topology_coordinator: Wake up the state machine after stats refresh
  topology_coordinator: Move tablet_load_stats_refresh_before_rebalancing injection earlier
  topology_coordinator: Fix potential missed notification
  topology_coordinator: Refresh load stats after table is created or altered
  tablets: Do a group0 read barrier on tablet load stats refresh
  topology_coordinator: Ensure stats are refreshed in the gossip scheduling group
  test: Use ManagerClient.{disable,enable}_tablet_balancing()
  test: Add missing calls to disable_tablet_balancing() in tests which use move_tablet() API
  test: pylib: Introduce ManagerClient.{disable,enable}_tablet_balancing()
2026-01-16 11:34:57 +01:00
Dawid Pawlik
383f9e6e56 docs: update documentation on filtering with vector queries
Add a description of available filtering options with ANN vector queries.
Provide an example of such query and a reference to `WHERE` clause restrictions.
2026-01-16 11:18:23 +01:00
Dawid Pawlik
67d3454d2b test/vector_search: add test for filtered ANN with VS mock
Implement a test using Vector Store mock to check if end-to-end
integration works with filtered ANN query.
2026-01-16 11:18:23 +01:00
Dawid Pawlik
a54be82536 test/vector_search: add restriction to JSON conversion unit tests
Add unit tests for conversion of CQL restrictions to Vector Store filtering API
compatible JSON objects. The tests include:
- empty restriction
- `ALLOW FILTERING` in restriction
- single column restrictions
    - `=`, `<`, `>`, `<=`, `>=`, `IN`
- multiple column restrictions
    - `()=()`, `()<()`, `()>()`, `()<=()`, `()>=()`, `() IN ()`
- multiple restrictions conjunction
- `TEXT` and `BOOLEAN` column restrictions
2026-01-16 11:18:23 +01:00
Dawid Pawlik
2a38794b8e vector_search: cql: construct and use filter in ANN vector queries
Add `filter` option in `ann()` function to write the filter JSON
object as the POST request in ANN vector queries.

Adjust existing `vector_store_client_test` tests accordingly.
2026-01-16 11:18:23 +01:00
Dawid Pawlik
304e908e3b select_statement: do not require post query ordering for vector queries
As there is only one `ORDER BY` clause with `ANN OF` ordering supported
in ANN vector queries, there is no need to require post query ordering
for the ANN vector queries. The standard ordering is not allowed here.
In fact the ordering is done on the Vector Store service side within
the ANN search, so that the returned primary keys are already sorted
accordingly.

If left unchanged, the filtering with `IN` clauses would cause
a `bad_function_call` server error as the filtering with `IN` clauses
require the post query ordering in a standard case.
2026-01-16 11:18:23 +01:00
Dawid Pawlik
a84d1361db vector_search: add statement_restrictions to JSON parsing
Add a module parsing the statement restrictions into Vector Store
filtering API compatible JSON objects.

The API was defined in: scylladb/vector-store#334

Examplary JSON object compatible with the API:
```
{
 "restrictions": [
     { "type": "==", "lhs": "pk", "rhs": 1 },
     { "type": "IN", "lhs": "pk", "rhs": [2, 3] },
     { "type": "<", "lhs": "ck", "rhs": 4 },
     { "type": "<=", "lhs": "ck", "rhs": 5 },
     { "type": ">", "lhs": "pk", "rhs": 6 },
     { "type": ">=", "lhs": "pk", "rhs": 7 },
     { "type": "()==()", "lhs": ["pk", "ck"], "rhs": [10, 20] },
     { "type": "()IN()", "lhs": ["pk", "ck"], "rhs": [[100, 200], [300, 400]] },
     { "type": "()<()", "lhs": ["pk", "ck"], "rhs": [30, 40] },
     { "type": "()<=()", "lhs": ["pk", "ck"], "rhs": [50, 60] },
     { "type": "()>()", "lhs": ["pk", "ck"], "rhs": [70, 80] },
     { "type": "()>=()", "lhs": ["pk", "ck"], "rhs": [90, 0] }
 ],
 "allow_filtering": true
}
```
2026-01-16 11:18:23 +01:00
Tomasz Grabiec
3fb7719277 topology_coordinator: Update load stats in case rebuilding with no live replica
Such rebuild has no read_from replica, but we know the tablet size will be 0.
If we don't, stats will be incomplete until the next refresh.

This is important for test cases which do removenode or replace while
all replicas are down. So for example test_replace from
test_tablets_removenode.py, which uses RF=1 and replaces a node.

Without this, the test waits for 60s needlessly after the first round
of rebuilding migrations before scheduling more migrations. This can
cause the test to time out.

Fixes #28115

Closes scylladb/scylladb#28121
2026-01-16 11:19:01 +02:00
Calle Wilund
2fd6ca4c46 test/cluster/test_internode_compression: Transpose test from dtest
Refs #27429

re-implement the dtest with same name as a scylla pytest, using
a python level network proxy instead of tcpdump etc. Both to avoid
sudo and also to ensure we don't race.

v2:
* Included positive test (mode=all)
2026-01-14 10:53:34 +01:00
Nikos Dragazis
7fa1f87355 db/config: Update sstable_compression_user_table_options description
Clarify what "user table" means.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-01-13 20:45:59 +02:00
Nikos Dragazis
1e37781d86 schema: Add initializer for compression defaults
In PR 5b6570be52 we introduced the config option
`sstable_compression_user_table_options` to allow adjusting the default
compression settings for user tables. However, the new option was hooked
into the CQL layer and applied only to CQL base tables, not to the whole
spectrum of user tables: CQL auxiliary tables (materialized views,
secondary indexes, CDC log tables), Alternator base tables, Alternator
auxiliary tables (GSIs, LSIs, Streams).

Fix this by moving the logic into the `schema_builder` via a schema
initializer. This ensures that the default compression settings are
applied uniformly regardless of how the table is created, while also
keeping the logic in a central place.

Register the initializer at startup in all executables where schemas are
being used (`scylla_main()`, `scylla_sstable_main()`, `cql_test_env`).

Finally, remove the ad-hoc logic from `create_table_statement`
(redundant as of this patch), remove the xfail markers from the relevant
tests and adjust `test_describe_cdc_log_table_create_statement` to
expect LZ4WithDicts as the default compressor.

Fixes #26914.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-01-13 20:45:59 +02:00
Nikos Dragazis
d5ec66bc0c schema: Generalize static configurators into schema initializers
Extend the `static_configurator` mechanism to support initialization of
arbitrary schema properties, not only static ones, by passing a
`schema_builder` reference to the configurator interface.

As part of this change, rename `static_configurator` to
`schema_initializer` to better reflect its broader responsibility.

Add a checkpoint/restore mechanism to allow de-registering an
initializer (useful for testing; will be used in the next patch).

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-01-13 20:45:59 +02:00
Nikos Dragazis
5b4aa4b6a6 schema: Initialize static properties eagerly
Schemas maintain a set of so-called "static properties". These are not
user-visible schema properties; they are internal values carried by
in-memory `schema` objects for convenience (349bc1a9b6,
https://github.com/scylladb/scylladb/pull/13170#issuecomment-1469848086).

Currently, the initialization of these properties happens when a
`schema_builder` builds a schema (`schema_builder::build()`), by
invoking all registered "static configurators".

This patch moves the initialization of static properties into the
`schema_builder` constructor. With this change, the builder initializes
the properties once, stores them in a data member, and reuses them for
all schema objects that it builds. This doesn't affect correctness as
the values produced by static configurators are "static" by
nature; they do not depend on runtime state.

In the next patch, we will replace the "static configurator" pattern
with a more general pattern that also supports initialization of regular
schema properties, not just static ones. Regular properties cannot be
initialized in `build()` because users may have already explicitly set
values via setters, and there is no way to distinguish between default
values and explicitly assigned ones.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-01-13 20:45:55 +02:00
Nikos Dragazis
76b2d0f961 db: config: Add accessor for sstable_compression_user_table_options
The `sstable_compression_user_table_options` config option determines
the default compression settings for user tables.

In patch 2fc812a1b9, the default value of this option was changed from
LZ4 to LZ4WithDicts and a fallback logic was introduced during startup
to temporarily revert the option to LZ4 until the dictionary compression
feature is enabled.

Replace this fallback logic with an accessor that returns the correct
settings depending on the feature flag. This is cleaner and more
consistent with the way we handle the `sstable_format` option, where the
same problem appears (see `get_preferred_sstable_version()`).

As a consequence, the configuration option must always be accessed
through this accessor. Add a comment to point this out.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-01-13 18:30:38 +02:00
Nikos Dragazis
4ec7a064a9 test: Check that CQL and Alternator tables respect compression config
In patches 11f6a25d44 and 7b9428d8d7 we added tests to verify that
auxiliary tables for both CQL and Alternator have the same default
compression settings as their base tables. These tests do not check
where these defaults originate from; they just verify that they are
consistent.

Add some more tests to verify the actual source of the defaults, which
is expected to be the `sstable_compression_user_table_options`
from the configuration. Unlike the previous tests, these tests require
dedicated Scylla instances with custom configuration, so they must be
placed under `test/cluster/`.

Mark them as xfail-ing. The marker will be removed later in this series.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-01-13 18:16:08 +02:00
Calle Wilund
da17e8b18b gossiper/main: Extend special treatment of node ID resolve for rpc_address
Refs #27429

If running with broadcast_address != listen/cql/rpc address, topology
gets confused about the varying addresses. Need to special case
resolve both addresses as "self". I.e. extend broadcast_address
treatment to cql_address as well.

Added export of this via gossiper for symmetry.
2026-01-13 14:12:19 +01:00
Tomasz Grabiec
8dc20e6aaf test: cluster: Add reproducer for missed notification in topology coordinator 2026-01-13 00:40:23 +01:00
Tomasz Grabiec
7a04dd2d22 topology_coordinator: Wake up the state machine after stats refresh
Otherwise, coordinator may not react to changing stats after explicit
calls to trigger_load_stats_refresh() done on node replace or table
creation, if stats take longer to refresh than it takes the
coordinator to go idle.

The periodic refresh does wake up the topology coordinator, so the
issue is not dramatic in production, but it's annoying in tests, which
take longer because of that.

Fixes #25163
2026-01-13 00:40:23 +01:00
Tomasz Grabiec
d910e6ea63 topology_coordinator: Move tablet_load_stats_refresh_before_rebalancing injection earlier
Refreshing stats will signal _topo_sm.event, so do it before waiting
for the event, to avoiding busy looping in the coordinator.

This will produce lots of logs in test cases which enable debug-level
logging in the raft logger.

Refs #28086
2026-01-13 00:40:16 +01:00
Tomasz Grabiec
e5dee2aab8 topology_coordinator: Fix potential missed notification
Checking for work is not atomic, so there is room for missed
notification. Especially that notifications are not always triggered
from fibers which take the group0 guard.

Fix by subscribing for the event before checking for work.

Fixes #27958
2026-01-13 00:39:01 +01:00
Tomasz Grabiec
2b7aa3211d topology_coordinator: Refresh load stats after table is created or altered
We switched to the size-based load balancing, which now has more
strict requirements for load stats. We no longer need only per-node
stats (capacity), but also per-tablet stats.

Bootstrapping a node triggers stats refresh, but allocating tablets on
table creation didn't. So after creating a table, load balancer
couldn't make progress for up to 60s (stats refresh period).

This makes tests take longer, and can even cause failures if tests are
using a low-enough timeout.

Fixes #27921
2026-01-13 00:38:59 +01:00
Tomasz Grabiec
663831ebd7 tablets: Do a group0 read barrier on tablet load stats refresh
Stats refresh will be triggered on topology coordinator by events like
allocating new tablets on table creation. For refresh to be effective,
all replicas must see the new tablets, otherwise stats will be
incomplete.
2026-01-13 00:38:00 +01:00
Tomasz Grabiec
c4c5ed5aba topology_coordinator: Ensure stats are refreshed in the gossip scheduling group
Refresh can be triggered from different places, but it should run in
the gossip scheduling group, like group0 operations.
2026-01-13 00:38:00 +01:00
Tomasz Grabiec
5e6935f276 test: Use ManagerClient.{disable,enable}_tablet_balancing() 2026-01-13 00:38:00 +01:00
Tomasz Grabiec
6936704677 test: Add missing calls to disable_tablet_balancing() in tests which use move_tablet() API
If a test tries to move a tablet, it assumes the tablets are stable.
This fixes flakiness exposed by size-based load-balancing and a later
change to refresh stats sooner.
2026-01-13 00:38:00 +01:00
Tomasz Grabiec
c8098e07c9 test: pylib: Introduce ManagerClient.{disable,enable}_tablet_balancing()
It's a global operation, so we can use any server.

It's not only convenient. The call via api.disable_tablet_balancing()
confuse people to think that it's a per-server operation. This leads
to proliferation of code which does it needlessly on all servers.
2026-01-13 00:38:00 +01:00
Amnon Heiman
c6d1c63ddb distributed_loader: system_replicated_keys as system keyspace
This patch adds system_replicated_keys to the list of known system
keyspaces.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2026-01-02 16:41:47 +02:00
Amnon Heiman
83c1103917 replicated_key_provider: make KSNAME public
Move KSNAME constant from internal static to public member of
replicated_key_provider_factory class.

It will be used to identify it as a system keyspace.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2026-01-02 16:39:51 +02:00
282 changed files with 8894 additions and 1629 deletions

View File

@@ -17,6 +17,7 @@
#include "auth/service.hh"
#include "db/config.hh"
#include "db/view/view_build_status.hh"
#include "locator/tablets.hh"
#include "mutation/tombstone.hh"
#include "locator/abstract_replication_strategy.hh"
#include "utils/log.hh"
@@ -1875,23 +1876,34 @@ future<executor::request_return_type> executor::create_table_on_shard0(service::
auto ts = group0_guard.write_timestamp();
utils::chunked_vector<mutation> schema_mutations;
auto ksm = create_keyspace_metadata(keyspace_name, _proxy, _gossiper, ts, tags_map, _proxy.features(), tablets_mode);
locator::replication_strategy_params params(ksm->strategy_options(), ksm->initial_tablets(), ksm->consistency_option());
const auto& topo = _proxy.local_db().get_token_metadata().get_topology();
auto rs = locator::abstract_replication_strategy::create_replication_strategy(ksm->strategy_name(), params, topo);
// Alternator Streams doesn't yet work when the table uses tablets (#23838)
if (stream_specification && stream_specification->IsObject()) {
auto stream_enabled = rjson::find(*stream_specification, "StreamEnabled");
if (stream_enabled && stream_enabled->IsBool() && stream_enabled->GetBool()) {
locator::replication_strategy_params params(ksm->strategy_options(), ksm->initial_tablets(), ksm->consistency_option());
const auto& topo = _proxy.local_db().get_token_metadata().get_topology();
auto rs = locator::abstract_replication_strategy::create_replication_strategy(ksm->strategy_name(), params, topo);
if (rs->uses_tablets()) {
co_return api_error::validation("Streams not yet supported on a table using tablets (issue #23838). "
"If you want to use streams, create a table with vnodes by setting the tag 'system:initial_tablets' set to 'none'.");
}
}
}
// Creating an index in tablets mode requires the rf_rack_valid_keyspaces option to be enabled.
// GSI and LSI indexes are based on materialized views which require this option to avoid consistency issues.
if (!view_builders.empty() && ksm->uses_tablets() && !_proxy.data_dictionary().get_config().rf_rack_valid_keyspaces()) {
co_return api_error::validation("GlobalSecondaryIndexes and LocalSecondaryIndexes with tablets require the rf_rack_valid_keyspaces option to be enabled.");
// Creating an index in tablets mode requires the keyspace to be RF-rack-valid.
// GSI and LSI indexes are based on materialized views which require RF-rack-validity to avoid consistency issues.
if (!view_builders.empty() || _proxy.data_dictionary().get_config().rf_rack_valid_keyspaces()) {
try {
locator::assert_rf_rack_valid_keyspace(keyspace_name, _proxy.local_db().get_token_metadata_ptr(), *rs);
} catch (const std::invalid_argument& ex) {
if (!view_builders.empty()) {
co_return api_error::validation(fmt::format("GlobalSecondaryIndexes and LocalSecondaryIndexes on a table "
"using tablets require the number of racks in the cluster to be either 1 or 3"));
} else {
co_return api_error::validation(fmt::format("Cannot create table '{}' with tablets: the configuration "
"option 'rf_rack_valid_keyspaces' is enabled, which enforces that tables using tablets can only be created in clusters "
"that have either 1 or 3 racks", table_name));
}
}
}
try {
schema_mutations = service::prepare_new_keyspace_announcement(_proxy.local_db(), ksm, ts);
@@ -2114,9 +2126,12 @@ future<executor::request_return_type> executor::update_table(client_state& clien
co_return api_error::validation(fmt::format(
"LSI {} already exists in table {}, can't use same name for GSI", index_name, table_name));
}
if (p.local().local_db().find_keyspace(keyspace_name).get_replication_strategy().uses_tablets() &&
!p.local().data_dictionary().get_config().rf_rack_valid_keyspaces()) {
co_return api_error::validation("GlobalSecondaryIndexes with tablets require the rf_rack_valid_keyspaces option to be enabled.");
try {
locator::assert_rf_rack_valid_keyspace(keyspace_name, p.local().local_db().get_token_metadata_ptr(),
p.local().local_db().find_keyspace(keyspace_name).get_replication_strategy());
} catch (const std::invalid_argument& ex) {
co_return api_error::validation(fmt::format("GlobalSecondaryIndexes on a table "
"using tablets require the number of racks in the cluster to be either 1 or 3"));
}
elogger.trace("Adding GSI {}", index_name);

View File

@@ -3051,7 +3051,7 @@
},
{
"name":"incremental_mode",
"description":"Set the incremental repair mode. Can be 'disabled', 'incremental', or 'full'. 'incremental': The incremental repair logic is enabled. Unrepaired sstables will be included for repair. Repaired sstables will be skipped. The incremental repair states will be updated after repair. 'full': The incremental repair logic is enabled. Both repaired and unrepaired sstables will be included for repair. The incremental repair states will be updated after repair. 'disabled': The incremental repair logic is disabled completely. The incremental repair states, e.g., repaired_at in sstables and sstables_repaired_at in the system.tablets table, will not be updated after repair. When the option is not provided, it defaults to 'disabled' mode.",
"description":"Set the incremental repair mode. Can be 'disabled', 'incremental', or 'full'. 'incremental': The incremental repair logic is enabled. Unrepaired sstables will be included for repair. Repaired sstables will be skipped. The incremental repair states will be updated after repair. 'full': The incremental repair logic is enabled. Both repaired and unrepaired sstables will be included for repair. The incremental repair states will be updated after repair. 'disabled': The incremental repair logic is disabled completely. The incremental repair states, e.g., repaired_at in sstables and sstables_repaired_at in the system.tablets table, will not be updated after repair. When the option is not provided, it defaults to incremental mode.",
"required":false,
"allowMultiple":false,
"type":"string",

View File

@@ -2016,12 +2016,14 @@ void set_snapshot(http_context& ctx, routes& r, sharded<db::snapshot_ctl>& snap_
auto tag = req->get_query_param("tag");
auto column_families = split(req->get_query_param("cf"), ",");
auto sfopt = req->get_query_param("sf");
auto sf = db::snapshot_ctl::skip_flush(strcasecmp(sfopt.c_str(), "true") == 0);
db::snapshot_options opts = {
.skip_flush = strcasecmp(sfopt.c_str(), "true") == 0,
};
std::vector<sstring> keynames = split(req->get_query_param("kn"), ",");
try {
if (column_families.empty()) {
co_await snap_ctl.local().take_snapshot(tag, keynames, sf);
co_await snap_ctl.local().take_snapshot(tag, keynames, opts);
} else {
if (keynames.empty()) {
throw httpd::bad_param_exception("The keyspace of column families must be specified");
@@ -2029,7 +2031,7 @@ void set_snapshot(http_context& ctx, routes& r, sharded<db::snapshot_ctl>& snap_
if (keynames.size() > 1) {
throw httpd::bad_param_exception("Only one keyspace allowed when specifying a column family");
}
co_await snap_ctl.local().take_column_family_snapshot(keynames[0], column_families, tag, sf);
co_await snap_ctl.local().take_column_family_snapshot(keynames[0], column_families, tag, opts);
}
co_return json_void();
} catch (...) {
@@ -2064,7 +2066,8 @@ void set_snapshot(http_context& ctx, routes& r, sharded<db::snapshot_ctl>& snap_
auto info = parse_scrub_options(ctx, std::move(req));
if (!info.snapshot_tag.empty()) {
co_await snap_ctl.local().take_column_family_snapshot(info.keyspace, info.column_families, info.snapshot_tag, db::snapshot_ctl::skip_flush::no);
db::snapshot_options opts = {.skip_flush = false};
co_await snap_ctl.local().take_column_family_snapshot(info.keyspace, info.column_families, info.snapshot_tag, opts);
}
compaction::compaction_stats stats;

View File

@@ -146,7 +146,8 @@ void set_tasks_compaction_module(http_context& ctx, routes& r, sharded<service::
auto info = parse_scrub_options(ctx, std::move(req));
if (!info.snapshot_tag.empty()) {
co_await snap_ctl.local().take_column_family_snapshot(info.keyspace, info.column_families, info.snapshot_tag, db::snapshot_ctl::skip_flush::no);
db::snapshot_options opts = {.skip_flush = false};
co_await snap_ctl.local().take_column_family_snapshot(info.keyspace, info.column_families, info.snapshot_tag, opts);
}
auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();

View File

@@ -53,10 +53,10 @@ static std::string json_escape(std::string_view str) {
}
future<> audit_syslog_storage_helper::syslog_send_helper(const sstring& msg) {
future<> audit_syslog_storage_helper::syslog_send_helper(temporary_buffer<char> msg) {
try {
auto lock = co_await get_units(_semaphore, 1, std::chrono::hours(1));
co_await _sender.send(_syslog_address, net::packet{msg.data(), msg.size()});
co_await _sender.send(_syslog_address, std::span(&msg, 1));
}
catch (const std::exception& e) {
auto error_msg = seastar::format(
@@ -90,7 +90,7 @@ future<> audit_syslog_storage_helper::start(const db::config& cfg) {
co_return;
}
co_await syslog_send_helper("Initializing syslog audit backend.");
co_await syslog_send_helper(temporary_buffer<char>::copy_of("Initializing syslog audit backend."));
}
future<> audit_syslog_storage_helper::stop() {
@@ -120,7 +120,7 @@ future<> audit_syslog_storage_helper::write(const audit_info* audit_info,
audit_info->table(),
username);
co_await syslog_send_helper(msg);
co_await syslog_send_helper(std::move(msg).release());
}
future<> audit_syslog_storage_helper::write_login(const sstring& username,
@@ -139,7 +139,7 @@ future<> audit_syslog_storage_helper::write_login(const sstring& username,
client_ip,
username);
co_await syslog_send_helper(msg.c_str());
co_await syslog_send_helper(std::move(msg).release());
}
}

View File

@@ -26,7 +26,7 @@ class audit_syslog_storage_helper : public storage_helper {
net::datagram_channel _sender;
seastar::semaphore _semaphore;
future<> syslog_send_helper(const sstring& msg);
future<> syslog_send_helper(seastar::temporary_buffer<char> msg);
public:
explicit audit_syslog_storage_helper(cql3::query_processor&, service::migration_manager&);
virtual ~audit_syslog_storage_helper();

View File

@@ -876,22 +876,6 @@ future<> migrate_to_auth_v2(db::system_keyspace& sys_ks, ::service::raft_group0_
continue; // some tables might not have been created if they were not used
}
// use longer than usual timeout as we scan the whole table
// but not infinite or very long as we want to fail reasonably fast
const auto t = 5min;
const timeout_config tc{t, t, t, t, t, t, t};
::service::client_state cs(::service::client_state::internal_tag{}, tc);
::service::query_state qs(cs, empty_service_permit());
auto rows = co_await qp.execute_internal(
seastar::format("SELECT * FROM {}.{}", meta::legacy::AUTH_KS, cf_name),
db::consistency_level::ALL,
qs,
{},
cql3::query_processor::cache_internal::no);
if (rows->empty()) {
continue;
}
std::vector<sstring> col_names;
for (const auto& col : schema->all_columns()) {
col_names.push_back(col.name_as_cql_string());
@@ -900,30 +884,51 @@ future<> migrate_to_auth_v2(db::system_keyspace& sys_ks, ::service::raft_group0_
for (size_t i = 1; i < col_names.size(); ++i) {
val_binders_str += ", ?";
}
for (const auto& row : *rows) {
std::vector<data_value_or_unset> values;
for (const auto& col : schema->all_columns()) {
if (row.has(col.name_as_text())) {
values.push_back(
col.type->deserialize(row.get_blob_unfragmented(col.name_as_text())));
} else {
values.push_back(unset_value{});
std::vector<mutation> collected;
// use longer than usual timeout as we scan the whole table
// but not infinite or very long as we want to fail reasonably fast
const auto t = 5min;
const timeout_config tc{t, t, t, t, t, t, t};
::service::client_state cs(::service::client_state::internal_tag{}, tc);
::service::query_state qs(cs, empty_service_permit());
co_await qp.query_internal(
seastar::format("SELECT * FROM {}.{}", meta::legacy::AUTH_KS, cf_name),
db::consistency_level::ALL,
{},
1000,
[&qp, &cf_name, &col_names, &val_binders_str, &schema, ts, &collected] (const cql3::untyped_result_set::row& row) -> future<stop_iteration> {
std::vector<data_value_or_unset> values;
for (const auto& col : schema->all_columns()) {
if (row.has(col.name_as_text())) {
values.push_back(
col.type->deserialize(row.get_blob_unfragmented(col.name_as_text())));
} else {
values.push_back(unset_value{});
}
}
}
auto muts = co_await qp.get_mutations_internal(
seastar::format("INSERT INTO {}.{} ({}) VALUES ({})",
db::system_keyspace::NAME,
cf_name,
fmt::join(col_names, ", "),
val_binders_str),
internal_distributed_query_state(),
ts,
std::move(values));
if (muts.size() != 1) {
on_internal_error(log,
format("expecting single insert mutation, got {}", muts.size()));
}
co_yield std::move(muts[0]);
auto muts = co_await qp.get_mutations_internal(
seastar::format("INSERT INTO {}.{} ({}) VALUES ({})",
db::system_keyspace::NAME,
cf_name,
fmt::join(col_names, ", "),
val_binders_str),
internal_distributed_query_state(),
ts,
std::move(values));
if (muts.size() != 1) {
on_internal_error(log,
format("expecting single insert mutation, got {}", muts.size()));
}
collected.push_back(std::move(muts[0]));
co_return stop_iteration::no;
},
std::move(qs));
for (auto& m : collected) {
co_yield std::move(m);
}
}
co_yield co_await sys_ks.make_auth_version_mutation(ts,

View File

@@ -725,7 +725,9 @@ raft_tests = set([
vector_search_tests = set([
'test/vector_search/vector_store_client_test',
'test/vector_search/load_balancer_test',
'test/vector_search/client_test'
'test/vector_search/client_test',
'test/vector_search/filter_test',
'test/vector_search/rescoring_test'
])
vector_search_validator_bin = 'vector-search-validator/bin/vector-search-validator'
@@ -1034,6 +1036,9 @@ scylla_core = (['message/messaging_service.cc',
'cql3/functions/aggregate_fcts.cc',
'cql3/functions/castas_fcts.cc',
'cql3/functions/error_injection_fcts.cc',
'cql3/statements/strong_consistency/modification_statement.cc',
'cql3/statements/strong_consistency/select_statement.cc',
'cql3/statements/strong_consistency/statement_helpers.cc',
'cql3/functions/vector_similarity_fcts.cc',
'cql3/statements/cf_prop_defs.cc',
'cql3/statements/cf_statement.cc',
@@ -1059,8 +1064,8 @@ scylla_core = (['message/messaging_service.cc',
'cql3/statements/raw/parsed_statement.cc',
'cql3/statements/property_definitions.cc',
'cql3/statements/update_statement.cc',
'cql3/statements/strongly_consistent_modification_statement.cc',
'cql3/statements/strongly_consistent_select_statement.cc',
'cql3/statements/broadcast_modification_statement.cc',
'cql3/statements/broadcast_select_statement.cc',
'cql3/statements/delete_statement.cc',
'cql3/statements/prune_materialized_view_statement.cc',
'cql3/statements/batch_statement.cc',
@@ -1351,6 +1356,9 @@ scylla_core = (['message/messaging_service.cc',
'lang/wasm.cc',
'lang/wasm_alien_thread_runner.cc',
'lang/wasm_instance_cache.cc',
'service/strong_consistency/groups_manager.cc',
'service/strong_consistency/coordinator.cc',
'service/strong_consistency/state_machine.cc',
'service/raft/group0_state_id_handler.cc',
'service/raft/group0_state_machine.cc',
'service/raft/group0_state_machine_merger.cc',
@@ -1380,6 +1388,7 @@ scylla_core = (['message/messaging_service.cc',
'vector_search/dns.cc',
'vector_search/client.cc',
'vector_search/clients.cc',
'vector_search/filter.cc',
'vector_search/truststore.cc'
] + [Antlr3Grammar('cql3/Cql.g')] \
+ scylla_raft_core
@@ -1489,6 +1498,7 @@ idls = ['idl/gossip_digest.idl.hh',
'idl/hinted_handoff.idl.hh',
'idl/storage_proxy.idl.hh',
'idl/sstables.idl.hh',
'idl/strong_consistency/state_machine.idl.hh',
'idl/group0_state_machine.idl.hh',
'idl/mapreduce_request.idl.hh',
'idl/replica_exception.idl.hh',
@@ -1784,6 +1794,8 @@ deps['test/raft/discovery_test'] = ['test/raft/discovery_test.cc',
deps['test/vector_search/vector_store_client_test'] = ['test/vector_search/vector_store_client_test.cc'] + scylla_tests_dependencies
deps['test/vector_search/load_balancer_test'] = ['test/vector_search/load_balancer_test.cc'] + scylla_tests_dependencies
deps['test/vector_search/client_test'] = ['test/vector_search/client_test.cc'] + scylla_tests_dependencies
deps['test/vector_search/filter_test'] = ['test/vector_search/filter_test.cc'] + scylla_tests_dependencies
deps['test/vector_search/rescoring_test'] = ['test/vector_search/rescoring_test.cc'] + scylla_tests_dependencies
boost_tests_prefixes = ["test/boost/", "test/vector_search/", "test/raft/", "test/manual/", "test/ldap/"]

View File

@@ -47,6 +47,9 @@ target_sources(cql3
functions/aggregate_fcts.cc
functions/castas_fcts.cc
functions/error_injection_fcts.cc
statements/strong_consistency/select_statement.cc
statements/strong_consistency/modification_statement.cc
statements/strong_consistency/statement_helpers.cc
functions/vector_similarity_fcts.cc
statements/cf_prop_defs.cc
statements/cf_statement.cc
@@ -72,8 +75,8 @@ target_sources(cql3
statements/raw/parsed_statement.cc
statements/property_definitions.cc
statements/update_statement.cc
statements/strongly_consistent_modification_statement.cc
statements/strongly_consistent_select_statement.cc
statements/broadcast_modification_statement.cc
statements/broadcast_select_statement.cc
statements/delete_statement.cc
statements/prune_materialized_view_statement.cc
statements/batch_statement.cc

View File

@@ -48,8 +48,10 @@ const std::chrono::minutes prepared_statements_cache::entry_expiry = std::chrono
struct query_processor::remote {
remote(service::migration_manager& mm, service::mapreduce_service& fwd,
service::storage_service& ss, service::raft_group0_client& group0_client)
service::storage_service& ss, service::raft_group0_client& group0_client,
service::strong_consistency::coordinator& _sc_coordinator)
: mm(mm), mapreducer(fwd), ss(ss), group0_client(group0_client)
, sc_coordinator(_sc_coordinator)
, gate("query_processor::remote")
{}
@@ -57,6 +59,7 @@ struct query_processor::remote {
service::mapreduce_service& mapreducer;
service::storage_service& ss;
service::raft_group0_client& group0_client;
service::strong_consistency::coordinator& sc_coordinator;
seastar::named_gate gate;
};
@@ -514,9 +517,16 @@ query_processor::~query_processor() {
}
}
std::pair<std::reference_wrapper<service::strong_consistency::coordinator>, gate::holder>
query_processor::acquire_strongly_consistent_coordinator() {
auto [remote_, holder] = remote();
return {remote_.get().sc_coordinator, std::move(holder)};
}
void query_processor::start_remote(service::migration_manager& mm, service::mapreduce_service& mapreducer,
service::storage_service& ss, service::raft_group0_client& group0_client) {
_remote = std::make_unique<struct remote>(mm, mapreducer, ss, group0_client);
service::storage_service& ss, service::raft_group0_client& group0_client,
service::strong_consistency::coordinator& sc_coordinator) {
_remote = std::make_unique<struct remote>(mm, mapreducer, ss, group0_client, sc_coordinator);
}
future<> query_processor::stop_remote() {
@@ -860,6 +870,7 @@ struct internal_query_state {
sstring query_string;
std::unique_ptr<query_options> opts;
statements::prepared_statement::checked_weak_ptr p;
std::optional<service::query_state> qs;
bool more_results = true;
};
@@ -867,10 +878,14 @@ internal_query_state query_processor::create_paged_state(
const sstring& query_string,
db::consistency_level cl,
const data_value_list& values,
int32_t page_size) {
int32_t page_size,
std::optional<service::query_state> qs) {
auto p = prepare_internal(query_string);
auto opts = make_internal_options(p, values, cl, page_size);
return internal_query_state{query_string, std::make_unique<cql3::query_options>(std::move(opts)), std::move(p), true};
if (!qs) {
qs.emplace(query_state_for_internal_call());
}
return internal_query_state{query_string, std::make_unique<cql3::query_options>(std::move(opts)), std::move(p), std::move(qs), true};
}
bool query_processor::has_more_results(cql3::internal_query_state& state) const {
@@ -893,9 +908,8 @@ future<> query_processor::for_each_cql_result(
future<::shared_ptr<untyped_result_set>>
query_processor::execute_paged_internal(internal_query_state& state) {
state.p->statement->validate(*this, service::client_state::for_internal_calls());
auto qs = query_state_for_internal_call();
::shared_ptr<cql_transport::messages::result_message> msg =
co_await state.p->statement->execute(*this, qs, *state.opts, std::nullopt);
co_await state.p->statement->execute(*this, *state.qs, *state.opts, std::nullopt);
class visitor : public result_message::visitor_base {
internal_query_state& _state;
@@ -1202,8 +1216,9 @@ future<> query_processor::query_internal(
db::consistency_level cl,
const data_value_list& values,
int32_t page_size,
noncopyable_function<future<stop_iteration>(const cql3::untyped_result_set_row&)> f) {
auto query_state = create_paged_state(query_string, cl, values, page_size);
noncopyable_function<future<stop_iteration>(const cql3::untyped_result_set_row&)> f,
std::optional<service::query_state> qs) {
auto query_state = create_paged_state(query_string, cl, values, page_size, std::move(qs));
co_return co_await for_each_cql_result(query_state, std::move(f));
}

View File

@@ -44,6 +44,10 @@ class query_state;
class mapreduce_service;
class raft_group0_client;
namespace strong_consistency {
class coordinator;
}
namespace broadcast_tables {
struct query;
}
@@ -155,7 +159,8 @@ public:
~query_processor();
void start_remote(service::migration_manager&, service::mapreduce_service&,
service::storage_service& ss, service::raft_group0_client&);
service::storage_service& ss, service::raft_group0_client&,
service::strong_consistency::coordinator&);
future<> stop_remote();
data_dictionary::database db() {
@@ -174,6 +179,9 @@ public:
return _proxy;
}
std::pair<std::reference_wrapper<service::strong_consistency::coordinator>, gate::holder>
acquire_strongly_consistent_coordinator();
cql_stats& get_cql_stats() {
return _cql_stats;
}
@@ -322,6 +330,7 @@ public:
* page_size - maximum page size
* f - a function to be run on each row of the query result,
* if the function returns stop_iteration::yes the iteration will stop
* qs - optional query state (default: std::nullopt)
*
* \note This function is optimized for convenience, not performance.
*/
@@ -330,7 +339,8 @@ public:
db::consistency_level cl,
const data_value_list& values,
int32_t page_size,
noncopyable_function<future<stop_iteration>(const cql3::untyped_result_set_row&)> f);
noncopyable_function<future<stop_iteration>(const cql3::untyped_result_set_row&)> f,
std::optional<service::query_state> qs = std::nullopt);
/*
* \brief iterate over all cql results using paging
@@ -499,7 +509,8 @@ private:
const sstring& query_string,
db::consistency_level,
const data_value_list& values,
int32_t page_size);
int32_t page_size,
std::optional<service::query_state> qs = std::nullopt);
/*!
* \brief run a query using paging

View File

@@ -46,6 +46,13 @@ void metadata::add_non_serialized_column(lw_shared_ptr<column_specification> nam
_column_info->_names.emplace_back(std::move(name));
}
void metadata::hide_last_column() {
if (_column_info->_column_count == 0) {
utils::on_internal_error("Trying to hide a column when there are no columns visible.");
}
_column_info->_column_count--;
}
void metadata::set_paging_state(lw_shared_ptr<const service::pager::paging_state> paging_state) {
_flags.set<flag::HAS_MORE_PAGES>();
_paging_state = std::move(paging_state);

View File

@@ -73,6 +73,7 @@ public:
uint32_t value_count() const;
void add_non_serialized_column(lw_shared_ptr<column_specification> name);
void hide_last_column();
public:
void set_paging_state(lw_shared_ptr<const service::pager::paging_state> paging_state);

View File

@@ -225,10 +225,9 @@ cql3::statements::alter_keyspace_statement::prepare_schema_mutations(query_proce
// The second hyphen is not really true because currently topological changes can
// disturb it (see scylladb/scylladb#23345), but we ignore that.
locator::assert_rf_rack_valid_keyspace(_name, tmptr, *rs);
} catch (const std::exception& e) {
} catch (const std::invalid_argument& e) {
if (replica::database::enforce_rf_rack_validity_for_keyspace(qp.db().get_config(), *ks_md)) {
// There's no guarantee what the type of the exception will be, so we need to
// wrap it manually here in a type that can be passed to the user.
// wrap the exception manually here in a type that can be passed to the user.
throw exceptions::invalid_request_exception(e.what());
} else {
// Even when RF-rack-validity is not enforced for the keyspace, we'd

View File

@@ -9,7 +9,7 @@
*/
#include "cql3/statements/strongly_consistent_modification_statement.hh"
#include "cql3/statements/broadcast_modification_statement.hh"
#include <optional>
@@ -28,11 +28,11 @@
namespace cql3 {
static logging::logger logger("strongly_consistent_modification_statement");
static logging::logger logger("broadcast_modification_statement");
namespace statements {
strongly_consistent_modification_statement::strongly_consistent_modification_statement(
broadcast_modification_statement::broadcast_modification_statement(
uint32_t bound_terms,
schema_ptr schema,
broadcast_tables::prepared_update query)
@@ -43,7 +43,7 @@ strongly_consistent_modification_statement::strongly_consistent_modification_sta
{ }
future<::shared_ptr<cql_transport::messages::result_message>>
strongly_consistent_modification_statement::execute(query_processor& qp, service::query_state& qs, const query_options& options, std::optional<service::group0_guard> guard) const {
broadcast_modification_statement::execute(query_processor& qp, service::query_state& qs, const query_options& options, std::optional<service::group0_guard> guard) const {
return execute_without_checking_exception_message(qp, qs, options, std::move(guard))
.then(cql_transport::messages::propagate_exception_as_future<shared_ptr<cql_transport::messages::result_message>>);
}
@@ -63,7 +63,7 @@ evaluate_prepared(
}
future<::shared_ptr<cql_transport::messages::result_message>>
strongly_consistent_modification_statement::execute_without_checking_exception_message(query_processor& qp, service::query_state& qs, const query_options& options, std::optional<service::group0_guard> guard) const {
broadcast_modification_statement::execute_without_checking_exception_message(query_processor& qp, service::query_state& qs, const query_options& options, std::optional<service::group0_guard> guard) const {
if (this_shard_id() != 0) {
co_return ::make_shared<cql_transport::messages::result_message::bounce_to_shard>(0, cql3::computed_function_values{});
}
@@ -103,11 +103,11 @@ strongly_consistent_modification_statement::execute_without_checking_exception_m
), result);
}
uint32_t strongly_consistent_modification_statement::get_bound_terms() const {
uint32_t broadcast_modification_statement::get_bound_terms() const {
return _bound_terms;
}
future<> strongly_consistent_modification_statement::check_access(query_processor& qp, const service::client_state& state) const {
future<> broadcast_modification_statement::check_access(query_processor& qp, const service::client_state& state) const {
auto f = state.has_column_family_access(_schema->ks_name(), _schema->cf_name(), auth::permission::MODIFY);
if (_query.value_condition.has_value()) {
f = f.then([this, &state] {
@@ -117,7 +117,7 @@ future<> strongly_consistent_modification_statement::check_access(query_processo
return f;
}
bool strongly_consistent_modification_statement::depends_on(std::string_view ks_name, std::optional<std::string_view> cf_name) const {
bool broadcast_modification_statement::depends_on(std::string_view ks_name, std::optional<std::string_view> cf_name) const {
return _schema->ks_name() == ks_name && (!cf_name || _schema->cf_name() == *cf_name);
}

View File

@@ -27,13 +27,13 @@ struct prepared_update {
}
class strongly_consistent_modification_statement : public cql_statement_opt_metadata {
class broadcast_modification_statement : public cql_statement_opt_metadata {
const uint32_t _bound_terms;
const schema_ptr _schema;
const broadcast_tables::prepared_update _query;
public:
strongly_consistent_modification_statement(uint32_t bound_terms, schema_ptr schema, broadcast_tables::prepared_update query);
broadcast_modification_statement(uint32_t bound_terms, schema_ptr schema, broadcast_tables::prepared_update query);
virtual future<::shared_ptr<cql_transport::messages::result_message>>
execute(query_processor& qp, service::query_state& qs, const query_options& options, std::optional<service::group0_guard> guard) const override;

View File

@@ -9,7 +9,7 @@
*/
#include "cql3/statements/strongly_consistent_select_statement.hh"
#include "cql3/statements/broadcast_select_statement.hh"
#include <seastar/core/future.hh>
#include <seastar/core/on_internal_error.hh>
@@ -24,7 +24,7 @@ namespace cql3 {
namespace statements {
static logging::logger logger("strongly_consistent_select_statement");
static logging::logger logger("broadcast_select_statement");
static
expr::expression get_key(const cql3::expr::expression& partition_key_restrictions) {
@@ -58,7 +58,7 @@ bool is_selecting_only_value(const cql3::selection::selection& selection) {
selection.get_columns()[0]->name() == "value";
}
strongly_consistent_select_statement::strongly_consistent_select_statement(schema_ptr schema, uint32_t bound_terms,
broadcast_select_statement::broadcast_select_statement(schema_ptr schema, uint32_t bound_terms,
lw_shared_ptr<const parameters> parameters,
::shared_ptr<selection::selection> selection,
::shared_ptr<const restrictions::statement_restrictions> restrictions,
@@ -73,7 +73,7 @@ strongly_consistent_select_statement::strongly_consistent_select_statement(schem
_query{prepare_query()}
{ }
broadcast_tables::prepared_select strongly_consistent_select_statement::prepare_query() const {
broadcast_tables::prepared_select broadcast_select_statement::prepare_query() const {
if (!is_selecting_only_value(*_selection)) {
throw service::broadcast_tables::unsupported_operation_error("only 'value' selector is allowed");
}
@@ -94,7 +94,7 @@ evaluate_prepared(
}
future<::shared_ptr<cql_transport::messages::result_message>>
strongly_consistent_select_statement::execute_without_checking_exception_message(query_processor& qp, service::query_state& qs, const query_options& options, std::optional<service::group0_guard> guard) const {
broadcast_select_statement::execute_without_checking_exception_message(query_processor& qp, service::query_state& qs, const query_options& options, std::optional<service::group0_guard> guard) const {
if (this_shard_id() != 0) {
co_return ::make_shared<cql_transport::messages::result_message::bounce_to_shard>(0, cql3::computed_function_values{});
}

View File

@@ -25,12 +25,12 @@ struct prepared_select {
}
class strongly_consistent_select_statement : public select_statement {
class broadcast_select_statement : public select_statement {
const broadcast_tables::prepared_select _query;
broadcast_tables::prepared_select prepare_query() const;
public:
strongly_consistent_select_statement(schema_ptr schema,
broadcast_select_statement(schema_ptr schema,
uint32_t bound_terms,
lw_shared_ptr<const parameters> parameters,
::shared_ptr<selection::selection> selection,

View File

@@ -123,10 +123,9 @@ future<std::tuple<::shared_ptr<cql_transport::event::schema_change>, utils::chun
// We hold a group0_guard, so it's correct to check this here.
// The topology or schema cannot change while we're performing this query.
locator::assert_rf_rack_valid_keyspace(_name, tmptr, *rs);
} catch (const std::exception& e) {
} catch (const std::invalid_argument& e) {
if (replica::database::enforce_rf_rack_validity_for_keyspace(cfg, *ksm)) {
// There's no guarantee what the type of the exception will be, so we need to
// wrap it manually here in a type that can be passed to the user.
// wrap the exception in a type that can be passed to the user.
throw exceptions::invalid_request_exception(e.what());
} else {
// Even when RF-rack-validity is not enforced for the keyspace, we'd

View File

@@ -31,8 +31,6 @@
#include "db/config.hh"
#include "compaction/time_window_compaction_strategy.hh"
bool is_internal_keyspace(std::string_view name);
namespace cql3 {
namespace statements {
@@ -124,10 +122,6 @@ void create_table_statement::apply_properties_to(schema_builder& builder, const
addColumnMetadataFromAliases(cfmd, Collections.singletonList(valueAlias), defaultValidator, ColumnDefinition.Kind.COMPACT_VALUE);
#endif
if (!_properties->get_compression_options() && !is_internal_keyspace(keyspace())) {
builder.set_compressor_params(db.get_config().sstable_compression_user_table_options());
}
_properties->apply_to_builder(builder, _properties->make_schema_extensions(db.extensions()), db, keyspace(), true);
}

View File

@@ -98,6 +98,7 @@ static locator::replication_strategy_config_options prepare_options(
const sstring& strategy_class,
const locator::token_metadata& tm,
bool rf_rack_valid_keyspaces,
bool enforce_rack_list,
locator::replication_strategy_config_options options,
const locator::replication_strategy_config_options& old_options,
bool rack_list_enabled,
@@ -107,7 +108,7 @@ static locator::replication_strategy_config_options prepare_options(
auto is_nts = locator::abstract_replication_strategy::to_qualified_class_name(strategy_class) == "org.apache.cassandra.locator.NetworkTopologyStrategy";
auto is_alter = !old_options.empty();
const auto& all_dcs = tm.get_datacenter_racks_token_owners();
auto auto_expand_racks = uses_tablets && rf_rack_valid_keyspaces && rack_list_enabled;
auto auto_expand_racks = uses_tablets && rack_list_enabled && (rf_rack_valid_keyspaces || enforce_rack_list);
logger.debug("prepare_options: {}: is_nts={} auto_expand_racks={} rack_list_enabled={} old_options={} new_options={} all_dcs={}",
strategy_class, is_nts, auto_expand_racks, rack_list_enabled, old_options, options, all_dcs);
@@ -417,7 +418,7 @@ lw_shared_ptr<data_dictionary::keyspace_metadata> ks_prop_defs::as_ks_metadata(s
auto initial_tablets = get_initial_tablets(default_initial_tablets, cfg.enforce_tablets());
bool uses_tablets = initial_tablets.has_value();
bool rack_list_enabled = utils::get_local_injector().enter("create_with_numeric") ? false : feat.rack_list_rf;
auto options = prepare_options(sc, tm, cfg.rf_rack_valid_keyspaces(), get_replication_options(), {}, rack_list_enabled, uses_tablets);
auto options = prepare_options(sc, tm, cfg.rf_rack_valid_keyspaces(), cfg.enforce_rack_list(), get_replication_options(), {}, rack_list_enabled, uses_tablets);
return data_dictionary::keyspace_metadata::new_keyspace(ks_name, sc,
std::move(options), initial_tablets, get_consistency_option(), get_boolean(KW_DURABLE_WRITES, true), get_storage_options());
}
@@ -434,7 +435,7 @@ lw_shared_ptr<data_dictionary::keyspace_metadata> ks_prop_defs::as_ks_metadata_u
auto sc = get_replication_strategy_class();
bool rack_list_enabled = utils::get_local_injector().enter("create_with_numeric") ? false : feat.rack_list_rf;
if (sc) {
options = prepare_options(*sc, tm, cfg.rf_rack_valid_keyspaces(), get_replication_options(), old_options, rack_list_enabled, uses_tablets);
options = prepare_options(*sc, tm, cfg.rf_rack_valid_keyspaces(), cfg.enforce_rack_list(), get_replication_options(), old_options, rack_list_enabled, uses_tablets);
} else {
sc = old->strategy_name();
options = old_options;

View File

@@ -11,7 +11,7 @@
#include "utils/assert.hh"
#include "cql3/cql_statement.hh"
#include "cql3/statements/modification_statement.hh"
#include "cql3/statements/strongly_consistent_modification_statement.hh"
#include "cql3/statements/broadcast_modification_statement.hh"
#include "cql3/statements/raw/modification_statement.hh"
#include "cql3/statements/prepared_statement.hh"
#include "cql3/expr/expr-utils.hh"
@@ -29,6 +29,8 @@
#include "cql3/query_processor.hh"
#include "service/storage_proxy.hh"
#include "service/broadcast_tables/experimental/lang.hh"
#include "cql3/statements/strong_consistency/modification_statement.hh"
#include "cql3/statements/strong_consistency/statement_helpers.hh"
#include <boost/lexical_cast.hpp>
@@ -546,7 +548,7 @@ modification_statement::process_where_clause(data_dictionary::database db, expr:
}
}
::shared_ptr<strongly_consistent_modification_statement>
::shared_ptr<broadcast_modification_statement>
modification_statement::prepare_for_broadcast_tables() const {
// FIXME: implement for every type of `modification_statement`.
throw service::broadcast_tables::unsupported_operation_error{};
@@ -554,24 +556,27 @@ modification_statement::prepare_for_broadcast_tables() const {
namespace raw {
::shared_ptr<cql_statement_opt_metadata>
modification_statement::prepare_statement(data_dictionary::database db, prepare_context& ctx, cql_stats& stats) {
::shared_ptr<cql3::statements::modification_statement> statement = prepare(db, ctx, stats);
if (service::broadcast_tables::is_broadcast_table_statement(keyspace(), column_family())) {
return statement->prepare_for_broadcast_tables();
} else {
return statement;
}
}
std::unique_ptr<prepared_statement>
modification_statement::prepare(data_dictionary::database db, cql_stats& stats) {
schema_ptr schema = validation::validate_column_family(db, keyspace(), column_family());
auto meta = get_prepare_context();
auto statement = prepare_statement(db, meta, stats);
auto statement = std::invoke([&] -> shared_ptr<cql_statement> {
auto result = prepare(db, meta, stats);
if (strong_consistency::is_strongly_consistent(db, schema->ks_name())) {
return ::make_shared<strong_consistency::modification_statement>(std::move(result));
}
if (service::broadcast_tables::is_broadcast_table_statement(keyspace(), column_family())) {
return result->prepare_for_broadcast_tables();
}
return result;
});
auto partition_key_bind_indices = meta.get_partition_key_bind_indexes(*schema);
return std::make_unique<prepared_statement>(audit_info(), std::move(statement), meta, std::move(partition_key_bind_indices));
return std::make_unique<prepared_statement>(audit_info(), std::move(statement), meta,
std::move(partition_key_bind_indices));
}
::shared_ptr<cql3::statements::modification_statement>

View File

@@ -30,7 +30,7 @@ class operation;
namespace statements {
class strongly_consistent_modification_statement;
class broadcast_modification_statement;
namespace raw { class modification_statement; }
@@ -113,15 +113,15 @@ public:
virtual void add_update_for_key(mutation& m, const query::clustering_range& range, const update_parameters& params, const json_cache_opt& json_cache) const = 0;
virtual uint32_t get_bound_terms() const override;
uint32_t get_bound_terms() const override;
virtual const sstring& keyspace() const;
const sstring& keyspace() const;
virtual const sstring& column_family() const;
const sstring& column_family() const;
virtual bool is_counter() const;
bool is_counter() const;
virtual bool is_view() const;
bool is_view() const;
int64_t get_timestamp(int64_t now, const query_options& options) const;
@@ -129,12 +129,12 @@ public:
std::optional<gc_clock::duration> get_time_to_live(const query_options& options) const;
virtual future<> check_access(query_processor& qp, const service::client_state& state) const override;
future<> check_access(query_processor& qp, const service::client_state& state) const override;
// Validate before execute, using client state and current schema
void validate(query_processor&, const service::client_state& state) const override;
virtual bool depends_on(std::string_view ks_name, std::optional<std::string_view> cf_name) const override;
bool depends_on(std::string_view ks_name, std::optional<std::string_view> cf_name) const override;
void add_operation(::shared_ptr<operation> op);
@@ -256,7 +256,9 @@ public:
virtual json_cache_opt maybe_prepare_json_cache(const query_options& options) const;
virtual ::shared_ptr<strongly_consistent_modification_statement> prepare_for_broadcast_tables() const;
virtual ::shared_ptr<broadcast_modification_statement> prepare_for_broadcast_tables() const;
db::timeout_clock::duration get_timeout(const service::client_state& state, const query_options& options) const;
protected:
/**
@@ -264,9 +266,7 @@ protected:
* processed to check that they are compatible.
* @throws InvalidRequestException
*/
virtual void validate_where_clause_for_conditions() const;
db::timeout_clock::duration get_timeout(const service::client_state& state, const query_options& options) const;
void validate_where_clause_for_conditions() const;
friend class raw::modification_statement;
};

View File

@@ -40,7 +40,6 @@ protected:
public:
virtual std::unique_ptr<prepared_statement> prepare(data_dictionary::database db, cql_stats& stats) override;
::shared_ptr<cql_statement_opt_metadata> prepare_statement(data_dictionary::database db, prepare_context& ctx, cql_stats& stats);
::shared_ptr<cql3::statements::modification_statement> prepare(data_dictionary::database db, prepare_context& ctx, cql_stats& stats) const;
void add_raw(sstring&& raw) { _raw_cql = std::move(raw); }
const sstring& get_raw_cql() const { return _raw_cql; }

View File

@@ -131,8 +131,6 @@ private:
void verify_ordering_is_valid(const prepared_orderings_type&, const schema&, const restrictions::statement_restrictions& restrictions) const;
prepared_ann_ordering_type prepare_ann_ordering(const schema& schema, prepare_context& ctx, data_dictionary::database db) const;
// Checks whether this ordering reverses all results.
// We only allow leaving select results unchanged or reversing them.
bool is_ordering_reversed(const prepared_orderings_type&) const;

View File

@@ -8,6 +8,8 @@
* SPDX-License-Identifier: (LicenseRef-ScyllaDB-Source-Available-1.0 and Apache-2.0)
*/
#include "cql3/statements/strong_consistency/select_statement.hh"
#include "cql3/statements/strong_consistency/statement_helpers.hh"
#include "cql3/statements/select_statement.hh"
#include "cql3/expr/expression.hh"
#include "cql3/expr/evaluate.hh"
@@ -16,7 +18,7 @@
#include "cql3/statements/raw/select_statement.hh"
#include "cql3/query_processor.hh"
#include "cql3/statements/prune_materialized_view_statement.hh"
#include "cql3/statements/strongly_consistent_select_statement.hh"
#include "cql3/statements/broadcast_select_statement.hh"
#include "exceptions/exceptions.hh"
#include <seastar/core/future.hh>
@@ -25,12 +27,14 @@
#include "service/broadcast_tables/experimental/lang.hh"
#include "service/qos/qos_common.hh"
#include "transport/messages/result_message.hh"
#include "cql3/functions/functions.hh"
#include "cql3/functions/as_json_function.hh"
#include "cql3/selection/selection.hh"
#include "cql3/util.hh"
#include "cql3/restrictions/statement_restrictions.hh"
#include "index/secondary_index.hh"
#include "types/vector.hh"
#include "vector_search/filter.hh"
#include "validation.hh"
#include "exceptions/unrecognized_entity_exception.hh"
#include <optional>
@@ -368,8 +372,9 @@ uint64_t select_statement::get_inner_loop_limit(uint64_t limit, bool is_aggregat
}
bool select_statement::needs_post_query_ordering() const {
// We need post-query ordering only for queries with IN on the partition key and an ORDER BY.
return _restrictions->key_is_in_relation() && !_parameters->orderings().empty();
// We need post-query ordering for queries with IN on the partition key and an ORDER BY
// and ANN index queries with rescoring.
return static_cast<bool>(_ordering_comparator);
}
struct select_statement_executor {
@@ -1958,14 +1963,46 @@ mutation_fragments_select_statement::do_execute(query_processor& qp, service::qu
}));
}
::shared_ptr<cql3::statements::select_statement> vector_indexed_table_select_statement::prepare(data_dictionary::database db, schema_ptr schema,
uint32_t bound_terms, lw_shared_ptr<const parameters> parameters, ::shared_ptr<selection::selection> selection,
::shared_ptr<restrictions::statement_restrictions> restrictions, ::shared_ptr<std::vector<size_t>> group_by_cell_indices, bool is_reversed,
ordering_comparator_type ordering_comparator, prepared_ann_ordering_type prepared_ann_ordering, std::optional<expr::expression> limit,
std::optional<expr::expression> per_partition_limit, cql_stats& stats, std::unique_ptr<attributes> attrs) {
struct ann_ordering_info {
secondary_index::index _index;
raw::select_statement::prepared_ann_ordering_type _prepared_ann_ordering;
bool is_rescoring_enabled;
};
static std::optional<ann_ordering_info> get_ann_ordering_info(
data_dictionary::database db,
schema_ptr schema,
lw_shared_ptr<const raw::select_statement::parameters> parameters,
prepare_context& ctx) {
if (parameters->orderings().empty()) {
return std::nullopt;
}
auto [column_id, ordering] = parameters->orderings().front();
const auto& ann_vector = std::get_if<raw::select_statement::ann_vector>(&ordering);
if (!ann_vector) {
return std::nullopt;
}
::shared_ptr<column_identifier> column = column_id->prepare_column_identifier(*schema);
const column_definition* def = schema->get_column_definition(column->name());
if (!def) {
throw exceptions::invalid_request_exception(
fmt::format("Undefined column name {}", column->text()));
}
if (!def->type->is_vector() || static_cast<const vector_type_impl*>(def->type.get())->get_elements_type()->get_kind() != abstract_type::kind::float_kind) {
throw exceptions::invalid_request_exception("ANN ordering is only supported on float vector indexes");
}
auto e = expr::prepare_expression(*ann_vector, db, schema->ks_name(), nullptr, def->column_specification);
expr::fill_prepare_context(e, ctx);
raw::select_statement::prepared_ann_ordering_type prepared_ann_ordering = std::make_pair(std::move(def), std::move(e));
auto cf = db.find_column_family(schema);
auto& sim = cf.get_index_manager();
auto [index_opt, _] = restrictions->find_idx(sim);
auto indexes = sim.list_indexes();
auto it = std::find_if(indexes.begin(), indexes.end(), [&prepared_ann_ordering](const auto& ind) {
@@ -1977,27 +2014,90 @@ mutation_fragments_select_statement::do_execute(query_processor& qp, service::qu
if (it == indexes.end()) {
throw exceptions::invalid_request_exception("ANN ordering by vector requires the column to be indexed using 'vector_index'");
}
index_opt = *it;
if (!index_opt) {
throw std::runtime_error("No index found.");
return ann_ordering_info{
*it,
std::move(prepared_ann_ordering),
secondary_index::vector_index::is_rescoring_enabled(it->metadata().options())
};
}
static uint32_t add_similarity_function_to_selectors(
std::vector<selection::prepared_selector>& prepared_selectors,
const ann_ordering_info& ann_ordering_info,
data_dictionary::database db,
schema_ptr schema) {
auto similarity_function_name = secondary_index::vector_index::get_cql_similarity_function_name(ann_ordering_info._index.metadata().options());
// Create the function name
auto func_name = functions::function_name::native_function(sstring(similarity_function_name));
// Create the function arguments
std::vector<expr::expression> args;
args.push_back(expr::column_value(ann_ordering_info._prepared_ann_ordering.first));
args.push_back(ann_ordering_info._prepared_ann_ordering.second);
// Get the function object
std::vector<shared_ptr<assignment_testable>> provided_args;
provided_args.push_back(expr::as_assignment_testable(args[0], expr::type_of(args[0])));
provided_args.push_back(expr::as_assignment_testable(args[1], expr::type_of(args[1])));
auto func = cql3::functions::instance().get(db, schema->ks_name(), func_name, provided_args, schema->ks_name(), schema->cf_name(), nullptr);
// Create the function call expression
expr::function_call similarity_func_call{
.func = func,
.args = std::move(args),
};
// Add the similarity function as a prepared selector (last)
prepared_selectors.push_back(selection::prepared_selector{
.expr = std::move(similarity_func_call),
.alias = nullptr,
});
return prepared_selectors.size() - 1;
}
static select_statement::ordering_comparator_type get_similarity_ordering_comparator(std::vector<selection::prepared_selector>& prepared_selectors, uint32_t similarity_column_index) {
auto type = expr::type_of(prepared_selectors[similarity_column_index].expr);
if (type->get_kind() != abstract_type::kind::float_kind) {
seastar::on_internal_error(logger, "Similarity function must return float type.");
}
return [similarity_column_index, type] (const raw::select_statement::result_row_type& r1, const raw::select_statement::result_row_type& r2) {
auto& c1 = r1[similarity_column_index];
auto& c2 = r2[similarity_column_index];
auto f1 = c1 ? value_cast<float>(type->deserialize(*c1)) : std::numeric_limits<float>::quiet_NaN();
auto f2 = c2 ? value_cast<float>(type->deserialize(*c2)) : std::numeric_limits<float>::quiet_NaN();
if (std::isfinite(f1) && std::isfinite(f2)) {
return f1 > f2;
}
return std::isfinite(f1);
};
}
::shared_ptr<cql3::statements::select_statement> vector_indexed_table_select_statement::prepare(data_dictionary::database db, schema_ptr schema,
uint32_t bound_terms, lw_shared_ptr<const parameters> parameters, ::shared_ptr<selection::selection> selection,
::shared_ptr<restrictions::statement_restrictions> restrictions, ::shared_ptr<std::vector<size_t>> group_by_cell_indices, bool is_reversed,
ordering_comparator_type ordering_comparator, prepared_ann_ordering_type prepared_ann_ordering, std::optional<expr::expression> limit,
std::optional<expr::expression> per_partition_limit, cql_stats& stats, const secondary_index::index& index, std::unique_ptr<attributes> attrs) {
auto prepared_filter = vector_search::prepare_filter(*restrictions, parameters->allow_filtering());
return ::make_shared<cql3::statements::vector_indexed_table_select_statement>(schema, bound_terms, parameters, std::move(selection), std::move(restrictions),
std::move(group_by_cell_indices), is_reversed, std::move(ordering_comparator), std::move(prepared_ann_ordering), std::move(limit),
std::move(per_partition_limit), stats, *index_opt, std::move(attrs));
std::move(per_partition_limit), stats, index, std::move(prepared_filter), std::move(attrs));
}
vector_indexed_table_select_statement::vector_indexed_table_select_statement(schema_ptr schema, uint32_t bound_terms, lw_shared_ptr<const parameters> parameters,
::shared_ptr<selection::selection> selection, ::shared_ptr<const restrictions::statement_restrictions> restrictions,
::shared_ptr<std::vector<size_t>> group_by_cell_indices, bool is_reversed, ordering_comparator_type ordering_comparator,
prepared_ann_ordering_type prepared_ann_ordering, std::optional<expr::expression> limit,
std::optional<expr::expression> per_partition_limit, cql_stats& stats, const secondary_index::index& index, std::unique_ptr<attributes> attrs)
std::optional<expr::expression> per_partition_limit, cql_stats& stats, const secondary_index::index& index,
vector_search::prepared_filter prepared_filter, std::unique_ptr<attributes> attrs)
: select_statement{schema, bound_terms, parameters, selection, restrictions, group_by_cell_indices, is_reversed, ordering_comparator, limit,
per_partition_limit, stats, std::move(attrs)}
, _index{index}
, _prepared_ann_ordering(std::move(prepared_ann_ordering)) {
, _prepared_ann_ordering(std::move(prepared_ann_ordering))
, _prepared_filter(std::move(prepared_filter)) {
if (!limit.has_value()) {
throw exceptions::invalid_request_exception("Vector ANN queries must have a limit specified");
@@ -2032,13 +2132,19 @@ future<shared_ptr<cql_transport::messages::result_message>> vector_indexed_table
auto timeout = db::timeout_clock::now() + get_timeout(state.get_client_state(), options);
auto aoe = abort_on_expiry(timeout);
auto filter_json = _prepared_filter.to_json(options);
uint64_t fetch = static_cast<uint64_t>(std::ceil(limit * secondary_index::vector_index::get_oversampling(_index.metadata().options())));
auto pkeys = co_await qp.vector_store_client().ann(
_schema->ks_name(), _index.metadata().name(), _schema, get_ann_ordering_vector(options), limit, aoe.abort_source());
_schema->ks_name(), _index.metadata().name(), _schema, get_ann_ordering_vector(options), fetch, filter_json, aoe.abort_source());
if (!pkeys.has_value()) {
co_await coroutine::return_exception(
exceptions::invalid_request_exception(std::visit(vector_search::vector_store_client::ann_error_visitor{}, pkeys.error())));
}
if (pkeys->size() > limit && !secondary_index::vector_index::is_rescoring_enabled(_index.metadata().options())) {
pkeys->erase(pkeys->begin() + limit, pkeys->end());
}
co_return co_await query_base_table(qp, state, options, pkeys.value(), timeout);
});
@@ -2055,11 +2161,11 @@ void vector_indexed_table_select_statement::update_stats() const {
}
lw_shared_ptr<query::read_command> vector_indexed_table_select_statement::prepare_command_for_base_query(
query_processor& qp, service::query_state& state, const query_options& options) const {
query_processor& qp, service::query_state& state, const query_options& options, uint64_t fetch_limit) const {
auto slice = make_partition_slice(options);
return ::make_lw_shared<query::read_command>(_schema->id(), _schema->version(), std::move(slice), qp.proxy().get_max_result_size(slice),
query::tombstone_limit(qp.proxy().get_tombstone_limit()),
query::row_limit(get_inner_loop_limit(get_limit(options, _limit), _selection->is_aggregate())), query::partition_limit(query::max_partitions),
query::row_limit(get_inner_loop_limit(fetch_limit, _selection->is_aggregate())), query::partition_limit(query::max_partitions),
_query_start_time_point, tracing::make_trace_info(state.get_trace_state()), query_id::create_null_id(), query::is_first_page::no,
options.get_timestamp(state));
}
@@ -2077,7 +2183,7 @@ std::vector<float> vector_indexed_table_select_statement::get_ann_ordering_vecto
future<::shared_ptr<cql_transport::messages::result_message>> vector_indexed_table_select_statement::query_base_table(query_processor& qp,
service::query_state& state, const query_options& options, const std::vector<vector_search::primary_key>& pkeys,
lowres_clock::time_point timeout) const {
auto command = prepare_command_for_base_query(qp, state, options);
auto command = prepare_command_for_base_query(qp, state, options, pkeys.size());
// For tables without clustering columns, we can optimize by querying
// partition ranges instead of individual primary keys, since the
@@ -2116,6 +2222,7 @@ future<::shared_ptr<cql_transport::messages::result_message>> vector_indexed_tab
query::result_merger{command->get_row_limit(), query::max_partitions});
co_return co_await wrap_result_to_error_message([this, &command, &options](auto result) {
command->set_row_limit(get_limit(options, _limit));
return process_results(std::move(result), command, options, _query_start_time_point);
})(std::move(result));
}
@@ -2129,6 +2236,7 @@ future<::shared_ptr<cql_transport::messages::result_message>> vector_indexed_tab
{timeout, state.get_permit(), state.get_client_state(), state.get_trace_state(), {}, {}, options.get_specific_options().node_local_only},
std::nullopt)
.then(wrap_result_to_error_message([this, &options, command](service::storage_proxy::coordinator_query_result qr) {
command->set_row_limit(get_limit(options, _limit));
return this->process_results(std::move(qr.query_result), command, options, _query_start_time_point);
}));
}
@@ -2223,32 +2331,41 @@ std::unique_ptr<prepared_statement> select_statement::prepare(data_dictionary::d
prepared_selectors = maybe_jsonize_select_clause(std::move(prepared_selectors), db, schema);
auto aggregation_depth = 0u;
std::optional<ann_ordering_info> ann_ordering_info_opt = get_ann_ordering_info(db, schema, _parameters, ctx);
bool is_ann_query = ann_ordering_info_opt.has_value();
// Force aggregation if GROUP BY is used. This will wrap every column x as first(x).
if (!_group_by_columns.empty()) {
aggregation_depth = std::max(aggregation_depth, 1u);
if (prepared_selectors.empty()) {
// We have a "SELECT * GROUP BY". If we leave prepared_selectors
// empty, below we choose selection::wildcard() for SELECT *, and
// forget to do the "levellize" trick needed for the GROUP BY.
// So we need to set prepared_selectors. See #16531.
auto all_columns = selection::selection::wildcard_columns(schema);
std::vector<::shared_ptr<selection::raw_selector>> select_all;
select_all.reserve(all_columns.size());
for (const column_definition *cdef : all_columns) {
auto name = ::make_shared<cql3::column_identifier::raw>(cdef->name_as_text(), true);
select_all.push_back(::make_shared<selection::raw_selector>(
expr::unresolved_identifier(std::move(name)), nullptr));
}
prepared_selectors = selection::raw_selector::to_prepared_selectors(select_all, *schema, db, keyspace());
if (prepared_selectors.empty() && (!_group_by_columns.empty() || (is_ann_query && ann_ordering_info_opt->is_rescoring_enabled))) {
// We have a "SELECT * GROUP BY" or "SELECT * ORDER BY ANN" with rescoring enabled. If we leave prepared_selectors
// empty, below we choose selection::wildcard() for SELECT *, and either:
// - forget to do the "levellize" trick needed for the GROUP BY. See #16531.
// - forget to add the similarity function needed for ORDER BY ANN with rescoring. See below.
// So we need to set prepared_selectors.
auto all_columns = selection::selection::wildcard_columns(schema);
std::vector<::shared_ptr<selection::raw_selector>> select_all;
select_all.reserve(all_columns.size());
for (const column_definition *cdef : all_columns) {
auto name = ::make_shared<cql3::column_identifier::raw>(cdef->name_as_text(), true);
select_all.push_back(::make_shared<selection::raw_selector>(
expr::unresolved_identifier(std::move(name)), nullptr));
}
prepared_selectors = selection::raw_selector::to_prepared_selectors(select_all, *schema, db, keyspace());
}
for (auto& ps : prepared_selectors) {
expr::fill_prepare_context(ps.expr, ctx);
}
// Force aggregation if GROUP BY is used. This will wrap every column x as first(x).
auto aggregation_depth = _group_by_columns.empty() ? 0u : 1u;
select_statement::ordering_comparator_type ordering_comparator;
bool hide_last_column = false;
if (is_ann_query && ann_ordering_info_opt->is_rescoring_enabled) {
uint32_t similarity_column_index = add_similarity_function_to_selectors(prepared_selectors, *ann_ordering_info_opt, db, schema);
hide_last_column = true;
ordering_comparator = get_similarity_ordering_comparator(prepared_selectors, similarity_column_index);
}
for (auto& ps : prepared_selectors) {
aggregation_depth = std::max(aggregation_depth, expr::aggregation_depth(ps.expr));
}
@@ -2266,6 +2383,11 @@ std::unique_ptr<prepared_statement> select_statement::prepare(data_dictionary::d
? selection::selection::wildcard(schema)
: selection::selection::from_selectors(db, schema, keyspace(), levellized_prepared_selectors);
if (is_ann_query && hide_last_column) {
// Hide the similarity selector from the client by reducing column_count
selection->get_result_metadata()->hide_last_column();
}
// Cassandra 5.0.2 disallows PER PARTITION LIMIT with aggregate queries
// but only if GROUP BY is not used.
// See #9879 for more details.
@@ -2273,8 +2395,6 @@ std::unique_ptr<prepared_statement> select_statement::prepare(data_dictionary::d
throw exceptions::invalid_request_exception("PER PARTITION LIMIT is not allowed with aggregate queries.");
}
bool is_ann_query = !_parameters->orderings().empty() && std::holds_alternative<select_statement::ann_vector>(_parameters->orderings().front().second);
auto restrictions = prepare_restrictions(db, schema, ctx, selection, for_view, _parameters->allow_filtering() || is_ann_query,
restrictions::check_indexes(!_parameters->is_mutation_fragments()));
@@ -2282,19 +2402,14 @@ std::unique_ptr<prepared_statement> select_statement::prepare(data_dictionary::d
validate_distinct_selection(*schema, *selection, *restrictions);
}
select_statement::ordering_comparator_type ordering_comparator;
bool is_reversed_ = false;
std::optional<prepared_ann_ordering_type> prepared_ann_ordering;
auto orderings = _parameters->orderings();
if (!orderings.empty()) {
if (!orderings.empty() && !is_ann_query) {
std::visit([&](auto&& ordering) {
using T = std::decay_t<decltype(ordering)>;
if constexpr (std::is_same_v<T, select_statement::ann_vector>) {
prepared_ann_ordering = prepare_ann_ordering(*schema, ctx, db);
} else {
if constexpr (!std::is_same_v<T, select_statement::ann_vector>) {
SCYLLA_ASSERT(!for_view);
verify_ordering_is_allowed(*_parameters, *restrictions);
prepared_orderings_type prepared_orderings = prepare_orderings(*schema);
@@ -2307,7 +2422,7 @@ std::unique_ptr<prepared_statement> select_statement::prepare(data_dictionary::d
}
std::vector<sstring> warnings;
if (!prepared_ann_ordering.has_value()) {
if (!is_ann_query) {
check_needs_filtering(*restrictions, db.get_config().strict_allow_filtering(), warnings);
ensure_filtering_columns_retrieval(db, *selection, *restrictions);
}
@@ -2361,7 +2476,21 @@ std::unique_ptr<prepared_statement> select_statement::prepare(data_dictionary::d
&& restrictions->partition_key_restrictions_size() == schema->partition_key_size());
};
if (_parameters->is_prune_materialized_view()) {
if (strong_consistency::is_strongly_consistent(db, schema->ks_name())) {
stmt = ::make_shared<strong_consistency::select_statement>(
schema,
ctx.bound_variables_size(),
_parameters,
std::move(selection),
std::move(restrictions),
std::move(group_by_cell_indices),
is_reversed_,
std::move(ordering_comparator),
prepare_limit(db, ctx, _limit),
prepare_limit(db, ctx, _per_partition_limit),
stats,
std::move(prepared_attrs));
} else if (_parameters->is_prune_materialized_view()) {
stmt = ::make_shared<cql3::statements::prune_materialized_view_statement>(
schema,
ctx.bound_variables_size(),
@@ -2390,10 +2519,10 @@ std::unique_ptr<prepared_statement> select_statement::prepare(data_dictionary::d
prepare_limit(db, ctx, _per_partition_limit),
stats,
std::move(prepared_attrs));
} else if (prepared_ann_ordering) {
} else if (is_ann_query) {
stmt = vector_indexed_table_select_statement::prepare(db, schema, ctx.bound_variables_size(), _parameters, std::move(selection), std::move(restrictions),
std::move(group_by_cell_indices), is_reversed_, std::move(ordering_comparator), std::move(*prepared_ann_ordering),
prepare_limit(db, ctx, _limit), prepare_limit(db, ctx, _per_partition_limit), stats, std::move(prepared_attrs));
std::move(group_by_cell_indices), is_reversed_, std::move(ordering_comparator), std::move(ann_ordering_info_opt->_prepared_ann_ordering),
prepare_limit(db, ctx, _limit), prepare_limit(db, ctx, _per_partition_limit), stats, ann_ordering_info_opt->_index, std::move(prepared_attrs));
} else if (restrictions->uses_secondary_indexing()) {
stmt = view_indexed_table_select_statement::prepare(
db,
@@ -2425,7 +2554,7 @@ std::unique_ptr<prepared_statement> select_statement::prepare(data_dictionary::d
std::move(prepared_attrs)
);
} else if (service::broadcast_tables::is_broadcast_table_statement(keyspace(), column_family())) {
stmt = ::make_shared<cql3::statements::strongly_consistent_select_statement>(
stmt = ::make_shared<cql3::statements::broadcast_select_statement>(
schema,
ctx.bound_variables_size(),
_parameters,
@@ -2615,28 +2744,6 @@ void select_statement::verify_ordering_is_valid(const prepared_orderings_type& o
}
}
select_statement::prepared_ann_ordering_type select_statement::prepare_ann_ordering(const schema& schema, prepare_context& ctx, data_dictionary::database db) const {
auto [column_id, ordering] = _parameters->orderings().front();
const auto& ann_vector = std::get_if<select_statement::ann_vector>(&ordering);
SCYLLA_ASSERT(ann_vector);
::shared_ptr<column_identifier> column = column_id->prepare_column_identifier(schema);
const column_definition* def = schema.get_column_definition(column->name());
if (!def) {
throw exceptions::invalid_request_exception(
fmt::format("Undefined column name {}", column->text()));
}
if (!def->type->is_vector() || static_cast<const vector_type_impl*>(def->type.get())->get_elements_type()->get_kind() != abstract_type::kind::float_kind) {
throw exceptions::invalid_request_exception("ANN ordering is only supported on float vector indexes");
}
auto e = expr::prepare_expression(*ann_vector, db, keyspace(), nullptr, def->column_specification);
expr::fill_prepare_context(e, ctx);
return std::make_pair(std::move(def), std::move(e));
}
select_statement::ordering_comparator_type select_statement::get_ordering_comparator(const prepared_orderings_type& orderings,
selection::selection& selection,
const restrictions::statement_restrictions& restrictions) {

View File

@@ -22,6 +22,7 @@
#include "locator/host_id.hh"
#include "service/cas_shard.hh"
#include "vector_search/vector_store_client.hh"
#include "vector_search/filter.hh"
namespace service {
class client_state;
@@ -362,6 +363,7 @@ private:
class vector_indexed_table_select_statement : public select_statement {
secondary_index::index _index;
prepared_ann_ordering_type _prepared_ann_ordering;
vector_search::prepared_filter _prepared_filter;
mutable gc_clock::time_point _query_start_time_point;
public:
@@ -371,13 +373,13 @@ public:
lw_shared_ptr<const parameters> parameters, ::shared_ptr<selection::selection> selection,
::shared_ptr<restrictions::statement_restrictions> restrictions, ::shared_ptr<std::vector<size_t>> group_by_cell_indices, bool is_reversed,
ordering_comparator_type ordering_comparator, prepared_ann_ordering_type prepared_ann_ordering, std::optional<expr::expression> limit,
std::optional<expr::expression> per_partition_limit, cql_stats& stats, std::unique_ptr<cql3::attributes> attrs);
std::optional<expr::expression> per_partition_limit, cql_stats& stats, const secondary_index::index& index, std::unique_ptr<cql3::attributes> attrs);
vector_indexed_table_select_statement(schema_ptr schema, uint32_t bound_terms, lw_shared_ptr<const parameters> parameters,
::shared_ptr<selection::selection> selection, ::shared_ptr<const restrictions::statement_restrictions> restrictions,
::shared_ptr<std::vector<size_t>> group_by_cell_indices, bool is_reversed, ordering_comparator_type ordering_comparator,
prepared_ann_ordering_type prepared_ann_ordering, std::optional<expr::expression> limit, std::optional<expr::expression> per_partition_limit,
cql_stats& stats, const secondary_index::index& index, std::unique_ptr<cql3::attributes> attrs);
cql_stats& stats, const secondary_index::index& index, vector_search::prepared_filter prepared_filter, std::unique_ptr<cql3::attributes> attrs);
private:
future<::shared_ptr<cql_transport::messages::result_message>> do_execute(
@@ -385,7 +387,7 @@ private:
void update_stats() const;
lw_shared_ptr<query::read_command> prepare_command_for_base_query(query_processor& qp, service::query_state& state, const query_options& options) const;
lw_shared_ptr<query::read_command> prepare_command_for_base_query(query_processor& qp, service::query_state& state, const query_options& options, uint64_t fetch_limit) const;
std::vector<float> get_ann_ordering_vector(const query_options& options) const;

View File

@@ -0,0 +1,82 @@
/*
* Copyright (C) 2025-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#include "modification_statement.hh"
#include "transport/messages/result_message.hh"
#include "cql3/query_processor.hh"
#include "service/strong_consistency/coordinator.hh"
#include "cql3/statements/strong_consistency/statement_helpers.hh"
namespace cql3::statements::strong_consistency {
static logging::logger logger("sc_modification_statement");
modification_statement::modification_statement(shared_ptr<base_statement> statement)
: cql_statement_opt_metadata(&timeout_config::write_timeout)
, _statement(std::move(statement))
{
}
using result_message = cql_transport::messages::result_message;
future<shared_ptr<result_message>> modification_statement::execute(query_processor& qp, service::query_state& qs,
const query_options& options, std::optional<service::group0_guard> guard) const
{
return execute_without_checking_exception_message(qp, qs, options, std::move(guard))
.then(cql_transport::messages::propagate_exception_as_future<shared_ptr<result_message>>);
}
future<shared_ptr<result_message>> modification_statement::execute_without_checking_exception_message(
query_processor& qp, service::query_state& qs, const query_options& options,
std::optional<service::group0_guard> guard) const
{
auto json_cache = base_statement::json_cache_opt{};
const auto keys = _statement->build_partition_keys(options, json_cache);
if (keys.size() != 1 || !query::is_single_partition(keys[0])) {
throw exceptions::invalid_request_exception("Strongly consistent queries can only target a single partition");
}
if (_statement->requires_read()) {
throw exceptions::invalid_request_exception("Strongly consistent updates don't support data prefetch");
}
auto [coordinator, holder] = qp.acquire_strongly_consistent_coordinator();
const auto mutate_result = co_await coordinator.get().mutate(_statement->s,
keys[0].start()->value().token(),
[&](api::timestamp_type ts) {
const auto prefetch_data = update_parameters::prefetch_data(_statement->s);
const auto ttl = _statement->get_time_to_live(options);
const auto params = update_parameters(_statement->s, options, ts, ttl, prefetch_data);
const auto ranges = _statement->create_clustering_ranges(options, json_cache);
auto muts = _statement->apply_updates(keys, ranges, params, json_cache);
if (muts.size() != 1) {
on_internal_error(logger, ::format("statement '{}' has unexpected number of mutations {}",
raw_cql_statement, muts.size()));
}
return std::move(*muts.begin());
});
using namespace service::strong_consistency;
if (const auto* redirect = get_if<need_redirect>(&mutate_result)) {
co_return co_await redirect_statement(qp, options, redirect->target);
}
co_return seastar::make_shared<result_message::void_message>();
}
future<> modification_statement::check_access(query_processor& qp, const service::client_state& state) const {
return _statement->check_access(qp, state);
}
uint32_t modification_statement::get_bound_terms() const {
return _statement->get_bound_terms();
}
bool modification_statement::depends_on(std::string_view ks_name, std::optional<std::string_view> cf_name) const {
return _statement->depends_on(ks_name, cf_name);
}
}

View File

@@ -0,0 +1,39 @@
/*
* Copyright (C) 2025-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#pragma once
#include "cql3/cql_statement.hh"
#include "cql3/expr/expression.hh"
#include "cql3/statements/modification_statement.hh"
namespace cql3::statements::strong_consistency {
class modification_statement : public cql_statement_opt_metadata {
using result_message = cql_transport::messages::result_message;
using base_statement = cql3::statements::modification_statement;
shared_ptr<base_statement> _statement;
public:
modification_statement(shared_ptr<base_statement> statement);
future<shared_ptr<result_message>> execute(query_processor& qp, service::query_state& state,
const query_options& options, std::optional<service::group0_guard> guard) const override;
future<shared_ptr<result_message>> execute_without_checking_exception_message(query_processor& qp,
service::query_state& qs, const query_options& options,
std::optional<service::group0_guard> guard) const override;
future<> check_access(query_processor& qp, const service::client_state& state) const override;
uint32_t get_bound_terms() const override;
bool depends_on(std::string_view ks_name, std::optional<std::string_view> cf_name) const override;
};
}

View File

@@ -0,0 +1,56 @@
/*
* Copyright (C) 2025-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#include "select_statement.hh"
#include "query/query-request.hh"
#include "cql3/query_processor.hh"
#include "service/strong_consistency/coordinator.hh"
#include "cql3/statements/strong_consistency/statement_helpers.hh"
namespace cql3::statements::strong_consistency {
using result_message = cql_transport::messages::result_message;
future<::shared_ptr<result_message>> select_statement::do_execute(query_processor& qp,
service::query_state& state,
const query_options& options) const
{
const auto key_ranges = _restrictions->get_partition_key_ranges(options);
if (key_ranges.size() != 1 || !query::is_single_partition(key_ranges[0])) {
throw exceptions::invalid_request_exception("Strongly consistent queries can only target a single partition");
}
const auto now = gc_clock::now();
auto read_command = make_lw_shared<query::read_command>(
_query_schema->id(),
_query_schema->version(),
make_partition_slice(options),
query::max_result_size(query::result_memory_limiter::maximum_result_size),
query::tombstone_limit(query::tombstone_limit::max),
query::row_limit(get_inner_loop_limit(get_limit(options, _limit), _selection->is_aggregate())),
query::partition_limit(query::max_partitions),
now,
tracing::make_trace_info(state.get_trace_state()),
query_id::create_null_id(),
query::is_first_page::no,
options.get_timestamp(state));
const auto timeout = db::timeout_clock::now() + get_timeout(state.get_client_state(), options);
auto [coordinator, holder] = qp.acquire_strongly_consistent_coordinator();
auto query_result = co_await coordinator.get().query(_query_schema, *read_command,
key_ranges, state.get_trace_state(), timeout);
using namespace service::strong_consistency;
if (const auto* redirect = get_if<need_redirect>(&query_result)) {
co_return co_await redirect_statement(qp, options, redirect->target);
}
co_return co_await process_results(get<lw_shared_ptr<query::result>>(std::move(query_result)),
read_command, options, now);
}
}

View File

@@ -0,0 +1,26 @@
/*
* Copyright (C) 2025-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#pragma once
#include "cql3/cql_statement.hh"
#include "cql3/statements/select_statement.hh"
namespace cql3::statements::strong_consistency {
class select_statement : public cql3::statements::select_statement {
using result_message = cql_transport::messages::result_message;
public:
using cql3::statements::select_statement::select_statement;
future<::shared_ptr<cql_transport::messages::result_message>> do_execute(query_processor& qp,
service::query_state& state, const query_options& options) const override;
};
}

View File

@@ -0,0 +1,37 @@
/*
* Copyright (C) 2025-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#include "statement_helpers.hh"
#include "transport/messages/result_message_base.hh"
#include "cql3/query_processor.hh"
#include "replica/database.hh"
#include "locator/tablet_replication_strategy.hh"
namespace cql3::statements::strong_consistency {
future<::shared_ptr<cql_transport::messages::result_message>> redirect_statement(query_processor& qp,
const query_options& options,
const locator::tablet_replica& target)
{
const auto my_host_id = qp.db().real_database().get_token_metadata().get_topology().my_host_id();
if (target.host != my_host_id) {
throw exceptions::invalid_request_exception(format(
"Strongly consistent writes can be executed only on the leader node, "
"leader id {}, current host id {}",
target.host, my_host_id));
}
auto&& func_values_cache = const_cast<cql3::query_options&>(options).take_cached_pk_function_calls();
co_return qp.bounce_to_shard(target.shard, std::move(func_values_cache));
}
bool is_strongly_consistent(data_dictionary::database db, std::string_view ks_name) {
const auto* tablet_aware_rs = db.find_keyspace(ks_name).get_replication_strategy().maybe_as_tablet_aware();
return tablet_aware_rs && tablet_aware_rs->get_consistency() != data_dictionary::consistency_config_option::eventual;
}
}

View File

@@ -0,0 +1,23 @@
/*
* Copyright (C) 2025-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#pragma once
#include "cql3/cql_statement.hh"
#include "locator/tablets.hh"
namespace cql3::statements::strong_consistency {
future<::shared_ptr<cql_transport::messages::result_message>> redirect_statement(
query_processor& qp,
const query_options& options,
const locator::tablet_replica& target);
bool is_strongly_consistent(data_dictionary::database db, std::string_view ks_name);
}

View File

@@ -13,7 +13,7 @@
#include "cql3/expr/expression.hh"
#include "cql3/expr/evaluate.hh"
#include "cql3/expr/expr-utils.hh"
#include "cql3/statements/strongly_consistent_modification_statement.hh"
#include "cql3/statements/broadcast_modification_statement.hh"
#include "service/broadcast_tables/experimental/lang.hh"
#include "raw/update_statement.hh"
@@ -333,7 +333,7 @@ std::optional<expr::expression> get_value_condition(const expr::expression& the_
return binop->rhs;
}
::shared_ptr<strongly_consistent_modification_statement>
::shared_ptr<broadcast_modification_statement>
update_statement::prepare_for_broadcast_tables() const {
if (attrs) {
if (attrs->is_time_to_live_set()) {
@@ -359,7 +359,7 @@ update_statement::prepare_for_broadcast_tables() const {
.value_condition = get_value_condition(_condition),
};
return ::make_shared<strongly_consistent_modification_statement>(
return ::make_shared<broadcast_modification_statement>(
get_bound_terms(),
s,
query

View File

@@ -45,7 +45,7 @@ private:
virtual void execute_operations_for_key(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params, const json_cache_opt& json_cache) const;
public:
virtual ::shared_ptr<strongly_consistent_modification_statement> prepare_for_broadcast_tables() const override;
virtual ::shared_ptr<broadcast_modification_statement> prepare_for_broadcast_tables() const override;
};
/*

View File

@@ -323,6 +323,9 @@ void cache_mutation_reader::touch_partition() {
inline
future<> cache_mutation_reader::fill_buffer() {
if (const auto& ex = get_abort_exception(); ex) {
return make_exception_future<>(ex);
}
if (_state == state::before_static_row) {
touch_partition();
auto after_static_row = [this] {

View File

@@ -1341,7 +1341,7 @@ db::config::config(std::shared_ptr<db::extensions> exts)
, sstable_compression_user_table_options(this, "sstable_compression_user_table_options", value_status::Used, compression_parameters{compression_parameters::algorithm::lz4_with_dicts},
"Server-global user table compression options. If enabled, all user tables"
"will be compressed using the provided options, unless overridden"
"by compression options in the table schema. The available options are:\n"
"by compression options in the table schema. User tables are all tables in non-system keyspaces. The available options are:\n"
"* sstable_compression: The compression algorithm to use. Supported values: LZ4Compressor, LZ4WithDictsCompressor (default), SnappyCompressor, DeflateCompressor, ZstdCompressor, ZstdWithDictsCompressor, '' (empty string; disables compression).\n"
"* chunk_length_in_kb: (Default: 4) The size of chunks to compress in kilobytes. Allowed values are powers of two between 1 and 128.\n"
"* crc_check_chance: (Default: 1.0) Not implemented (option value is ignored).\n"
@@ -1584,7 +1584,14 @@ db::config::config(std::shared_ptr<db::extensions> exts)
, enable_create_table_with_compact_storage(this, "enable_create_table_with_compact_storage", liveness::LiveUpdate, value_status::Used, false, "Enable the deprecated feature of CREATE TABLE WITH COMPACT STORAGE. This feature will eventually be removed in a future version.")
, rf_rack_valid_keyspaces(this, "rf_rack_valid_keyspaces", liveness::MustRestart, value_status::Used, false,
"Enforce RF-rack-valid keyspaces. Additionally, if there are existing RF-rack-invalid "
"keyspaces, attempting to start a node with this option ON will fail.")
"keyspaces, attempting to start a node with this option ON will fail. "
"DEPRECATED. Use enforce_rack_list instead.")
, enforce_rack_list(this, "enforce_rack_list", liveness::MustRestart, value_status::Used, false,
"Enforce rack list for tablet keyspaces. "
"When the option is on, CREATE STATEMENT expands numeric rfs to rack lists "
"and ALTER STATEMENT is allowed only when rack lists are used in all DCs."
"Additionally, if there are existing tablet keyspaces with numeric rf in any DC "
"attempting to start a node with this option ON will fail.")
// FIXME: make frequency per table in order to reduce work in each iteration.
// Bigger tables will take longer to be resized. similar-sized tables can be batched into same iteration.
, tablet_load_stats_refresh_interval_in_seconds(this, "tablet_load_stats_refresh_interval_in_seconds", liveness::LiveUpdate, value_status::Used, 60,
@@ -1785,6 +1792,21 @@ const db::extensions& db::config::extensions() const {
return *_extensions;
}
compression_parameters db::config::get_sstable_compression_user_table_options(bool dicts_feature_enabled) const {
if (sstable_compression_user_table_options.is_set()
|| dicts_feature_enabled
|| !sstable_compression_user_table_options().uses_dictionary_compressor()) {
return sstable_compression_user_table_options();
} else {
// Fall back to non-dict if dictionary compression is not enabled cluster-wide.
auto options = sstable_compression_user_table_options();
auto params = options.get_options();
auto algo = compression_parameters::non_dict_equivalent(options.get_algorithm());
params[compression_parameters::SSTABLE_COMPRESSION] = sstring(compression_parameters::algorithm_to_name(algo));
return compression_parameters{params};
}
}
std::map<sstring, db::experimental_features_t::feature> db::experimental_features_t::map() {
// We decided against using the construct-on-first-use idiom here:
// https://github.com/scylladb/scylla/pull/5369#discussion_r353614807

View File

@@ -419,7 +419,13 @@ public:
named_value<bool> enable_sstables_mc_format;
named_value<bool> enable_sstables_md_format;
named_value<sstring> sstable_format;
// NOTE: Do not use this option directly.
// Use get_sstable_compression_user_table_options() instead.
named_value<compression_parameters> sstable_compression_user_table_options;
compression_parameters get_sstable_compression_user_table_options(bool dicts_feature_enabled) const;
named_value<bool> sstable_compression_dictionaries_allow_in_ddl;
named_value<bool> sstable_compression_dictionaries_enable_writing;
named_value<float> sstable_compression_dictionaries_memory_budget_fraction;
@@ -599,6 +605,7 @@ public:
named_value<bool> enable_create_table_with_compact_storage;
named_value<bool> rf_rack_valid_keyspaces;
named_value<bool> enforce_rack_list;
named_value<uint32_t> tablet_load_stats_refresh_interval_in_seconds;
named_value<bool> force_capacity_based_balancing;

View File

@@ -850,7 +850,7 @@ mutation_reader row_cache::make_nonpopulating_reader(schema_ptr schema, reader_p
std::move(permit),
e.key(),
query::clustering_key_filter_ranges(slice.row_ranges(*schema, e.key().key())),
e.partition().read(_tracker.region(), _tracker.memtable_cleaner(), nullptr, phase_of(pos)),
e.partition().read(_tracker.region(), _tracker.memtable_cleaner(), &_tracker, phase_of(pos)),
false,
_tracker.region(),
_read_section,

View File

@@ -96,16 +96,16 @@ static logging::logger diff_logger("schema_diff");
/** system.schema_* tables used to store keyspace/table/type attributes prior to C* 3.0 */
namespace db {
namespace {
const auto set_use_schema_commitlog = schema_builder::register_static_configurator([](const sstring& ks_name, const sstring& cf_name, schema_static_props& props) {
if (ks_name == schema_tables::NAME) {
props.enable_schema_commitlog();
const auto set_use_schema_commitlog = schema_builder::register_schema_initializer([](schema_builder& builder) {
if (builder.ks_name() == schema_tables::NAME) {
builder.enable_schema_commitlog();
}
});
const auto set_group0_table_options =
schema_builder::register_static_configurator([](const sstring& ks_name, const sstring& cf_name, schema_static_props& props) {
if (ks_name == schema_tables::NAME) {
schema_builder::register_schema_initializer([](schema_builder& builder) {
if (builder.ks_name() == schema_tables::NAME) {
// all schema tables are group0 tables
props.is_group0_table = true;
builder.set_is_group0_table(true);
}
});
}

View File

@@ -65,7 +65,7 @@ future<> snapshot_ctl::run_snapshot_modify_operation(noncopyable_function<future
});
}
future<> snapshot_ctl::take_snapshot(sstring tag, std::vector<sstring> keyspace_names, skip_flush sf) {
future<> snapshot_ctl::take_snapshot(sstring tag, std::vector<sstring> keyspace_names, snapshot_options opts) {
if (tag.empty()) {
throw std::runtime_error("You must supply a snapshot name.");
}
@@ -74,21 +74,21 @@ future<> snapshot_ctl::take_snapshot(sstring tag, std::vector<sstring> keyspace_
std::ranges::copy(_db.local().get_keyspaces() | std::views::keys, std::back_inserter(keyspace_names));
};
return run_snapshot_modify_operation([tag = std::move(tag), keyspace_names = std::move(keyspace_names), sf, this] () mutable {
return do_take_snapshot(std::move(tag), std::move(keyspace_names), sf);
return run_snapshot_modify_operation([tag = std::move(tag), keyspace_names = std::move(keyspace_names), opts, this] () mutable {
return do_take_snapshot(std::move(tag), std::move(keyspace_names), opts);
});
}
future<> snapshot_ctl::do_take_snapshot(sstring tag, std::vector<sstring> keyspace_names, skip_flush sf) {
future<> snapshot_ctl::do_take_snapshot(sstring tag, std::vector<sstring> keyspace_names, snapshot_options opts) {
co_await coroutine::parallel_for_each(keyspace_names, [tag, this] (const auto& ks_name) {
return check_snapshot_not_exist(ks_name, tag);
});
co_await coroutine::parallel_for_each(keyspace_names, [this, tag = std::move(tag), sf] (const auto& ks_name) {
return replica::database::snapshot_keyspace_on_all_shards(_db, ks_name, tag, bool(sf));
co_await coroutine::parallel_for_each(keyspace_names, [this, tag = std::move(tag), opts] (const auto& ks_name) {
return replica::database::snapshot_keyspace_on_all_shards(_db, ks_name, tag, opts);
});
}
future<> snapshot_ctl::take_column_family_snapshot(sstring ks_name, std::vector<sstring> tables, sstring tag, skip_flush sf) {
future<> snapshot_ctl::take_column_family_snapshot(sstring ks_name, std::vector<sstring> tables, sstring tag, snapshot_options opts) {
if (ks_name.empty()) {
throw std::runtime_error("You must supply a keyspace name");
}
@@ -99,14 +99,14 @@ future<> snapshot_ctl::take_column_family_snapshot(sstring ks_name, std::vector<
throw std::runtime_error("You must supply a snapshot name.");
}
return run_snapshot_modify_operation([this, ks_name = std::move(ks_name), tables = std::move(tables), tag = std::move(tag), sf] () mutable {
return do_take_column_family_snapshot(std::move(ks_name), std::move(tables), std::move(tag), sf);
return run_snapshot_modify_operation([this, ks_name = std::move(ks_name), tables = std::move(tables), tag = std::move(tag), opts] () mutable {
return do_take_column_family_snapshot(std::move(ks_name), std::move(tables), std::move(tag), opts);
});
}
future<> snapshot_ctl::do_take_column_family_snapshot(sstring ks_name, std::vector<sstring> tables, sstring tag, skip_flush sf) {
future<> snapshot_ctl::do_take_column_family_snapshot(sstring ks_name, std::vector<sstring> tables, sstring tag, snapshot_options opts) {
co_await check_snapshot_not_exist(ks_name, tag, tables);
co_await replica::database::snapshot_tables_on_all_shards(_db, ks_name, std::move(tables), std::move(tag), bool(sf));
co_await replica::database::snapshot_tables_on_all_shards(_db, ks_name, std::move(tables), std::move(tag), opts);
}
future<> snapshot_ctl::clear_snapshot(sstring tag, std::vector<sstring> keyspace_names, sstring cf_name) {

View File

@@ -38,10 +38,14 @@ class backup_task_impl;
} // snapshot namespace
struct snapshot_options {
bool skip_flush = false;
gc_clock::time_point created_at = gc_clock::now();
std::optional<gc_clock::time_point> expires_at;
};
class snapshot_ctl : public peering_sharded_service<snapshot_ctl> {
public:
using skip_flush = bool_class<class skip_flush_tag>;
struct table_snapshot_details {
int64_t total;
int64_t live;
@@ -70,8 +74,8 @@ public:
*
* @param tag the tag given to the snapshot; may not be null or empty
*/
future<> take_snapshot(sstring tag, skip_flush sf = skip_flush::no) {
return take_snapshot(tag, {}, sf);
future<> take_snapshot(sstring tag, snapshot_options opts = {}) {
return take_snapshot(tag, {}, opts);
}
/**
@@ -80,7 +84,7 @@ public:
* @param tag the tag given to the snapshot; may not be null or empty
* @param keyspace_names the names of the keyspaces to snapshot; empty means "all"
*/
future<> take_snapshot(sstring tag, std::vector<sstring> keyspace_names, skip_flush sf = skip_flush::no);
future<> take_snapshot(sstring tag, std::vector<sstring> keyspace_names, snapshot_options opts = {});
/**
* Takes the snapshot of multiple tables. A snapshot name must be specified.
@@ -89,7 +93,7 @@ public:
* @param tables a vector of tables names to snapshot
* @param tag the tag given to the snapshot; may not be null or empty
*/
future<> take_column_family_snapshot(sstring ks_name, std::vector<sstring> tables, sstring tag, skip_flush sf = skip_flush::no);
future<> take_column_family_snapshot(sstring ks_name, std::vector<sstring> tables, sstring tag, snapshot_options opts = {});
/**
* Remove the snapshot with the given name from the given keyspaces.
@@ -127,8 +131,8 @@ private:
friend class snapshot::backup_task_impl;
future<> do_take_snapshot(sstring tag, std::vector<sstring> keyspace_names, skip_flush sf = skip_flush::no);
future<> do_take_column_family_snapshot(sstring ks_name, std::vector<sstring> tables, sstring tag, skip_flush sf = skip_flush::no);
future<> do_take_snapshot(sstring tag, std::vector<sstring> keyspace_names, snapshot_options opts = {} );
future<> do_take_column_family_snapshot(sstring ks_name, std::vector<sstring> tables, sstring tag, snapshot_options opts = {});
};
}

View File

@@ -42,11 +42,11 @@ extern logging::logger cdc_log;
namespace db {
namespace {
const auto set_wait_for_sync_to_commitlog = schema_builder::register_static_configurator([](const sstring& ks_name, const sstring& cf_name, schema_static_props& props) {
if ((ks_name == system_distributed_keyspace::NAME_EVERYWHERE && cf_name == system_distributed_keyspace::CDC_GENERATIONS_V2) ||
(ks_name == system_distributed_keyspace::NAME && cf_name == system_distributed_keyspace::CDC_TOPOLOGY_DESCRIPTION))
const auto set_wait_for_sync_to_commitlog = schema_builder::register_schema_initializer([](schema_builder& builder) {
if ((builder.ks_name() == system_distributed_keyspace::NAME_EVERYWHERE && builder.cf_name() == system_distributed_keyspace::CDC_GENERATIONS_V2) ||
(builder.ks_name() == system_distributed_keyspace::NAME && builder.cf_name() == system_distributed_keyspace::CDC_TOPOLOGY_DESCRIPTION))
{
props.wait_for_sync_to_commitlog = true;
builder.set_wait_for_sync_to_commitlog(true);
}
});
}

View File

@@ -66,24 +66,24 @@ static thread_local auto sstableinfo_type = user_type_impl::get_instance(
namespace db {
namespace {
const auto set_null_sharder = schema_builder::register_static_configurator([](const sstring& ks_name, const sstring& cf_name, schema_static_props& props) {
const auto set_null_sharder = schema_builder::register_schema_initializer([](schema_builder& builder) {
// tables in the "system" keyspace which need to use null sharder
static const std::unordered_set<sstring> tables = {
// empty
};
if (ks_name == system_keyspace::NAME && tables.contains(cf_name)) {
props.use_null_sharder = true;
if (builder.ks_name() == system_keyspace::NAME && tables.contains(builder.cf_name())) {
builder.set_use_null_sharder(true);
}
});
const auto set_wait_for_sync_to_commitlog = schema_builder::register_static_configurator([](const sstring& ks_name, const sstring& cf_name, schema_static_props& props) {
const auto set_wait_for_sync_to_commitlog = schema_builder::register_schema_initializer([](schema_builder& builder) {
static const std::unordered_set<sstring> tables = {
system_keyspace::PAXOS,
};
if (ks_name == system_keyspace::NAME && tables.contains(cf_name)) {
props.wait_for_sync_to_commitlog = true;
if (builder.ks_name() == system_keyspace::NAME && tables.contains(builder.cf_name())) {
builder.set_wait_for_sync_to_commitlog(true);
}
});
const auto set_use_schema_commitlog = schema_builder::register_static_configurator([](const sstring& ks_name, const sstring& cf_name, schema_static_props& props) {
const auto set_use_schema_commitlog = schema_builder::register_schema_initializer([](schema_builder& builder) {
static const std::unordered_set<sstring> tables = {
schema_tables::SCYLLA_TABLE_SCHEMA_HISTORY,
system_keyspace::BROADCAST_KV_STORE,
@@ -108,18 +108,18 @@ namespace {
system_keyspace::ROLE_MEMBERS,
system_keyspace::ROLE_ATTRIBUTES,
system_keyspace::ROLE_PERMISSIONS,
system_keyspace::v3::CDC_LOCAL,
system_keyspace::CDC_LOCAL,
system_keyspace::DICTS,
system_keyspace::VIEW_BUILDING_TASKS,
system_keyspace::CLIENT_ROUTES,
};
if (ks_name == system_keyspace::NAME && tables.contains(cf_name)) {
props.enable_schema_commitlog();
if (builder.ks_name() == system_keyspace::NAME && tables.contains(builder.cf_name())) {
builder.enable_schema_commitlog();
}
});
const auto set_group0_table_options =
schema_builder::register_static_configurator([](const sstring& ks_name, const sstring& cf_name, schema_static_props& props) {
schema_builder::register_schema_initializer([](schema_builder& builder) {
static const std::unordered_set<sstring> tables = {
// scylla_local may store a replicated tombstone related to schema
// (see `make_group0_schema_version_mutation`), so we include it in the group0 tables list.
@@ -142,8 +142,8 @@ namespace {
system_keyspace::CLIENT_ROUTES,
system_keyspace::REPAIR_TASKS,
};
if (ks_name == system_keyspace::NAME && tables.contains(cf_name)) {
props.is_group0_table = true;
if (builder.ks_name() == system_keyspace::NAME && tables.contains(builder.cf_name())) {
builder.set_is_group0_table(true);
}
});
}
@@ -918,7 +918,7 @@ schema_ptr system_keyspace::corrupt_data() {
return scylla_local;
}
schema_ptr system_keyspace::v3::batches() {
schema_ptr system_keyspace::batches() {
static thread_local auto schema = [] {
schema_builder builder(generate_legacy_id(NAME, BATCHES), NAME, BATCHES,
// partition key
@@ -946,53 +946,7 @@ schema_ptr system_keyspace::v3::batches() {
return schema;
}
schema_ptr system_keyspace::v3::built_indexes() {
// identical to ours, but ours otoh is a mix-in of the 3.x series cassandra one
return db::system_keyspace::built_indexes();
}
schema_ptr system_keyspace::v3::local() {
static thread_local auto schema = [] {
schema_builder builder(generate_legacy_id(NAME, LOCAL), NAME, LOCAL,
// partition key
{{"key", utf8_type}},
// clustering key
{},
// regular columns
{
{"bootstrapped", utf8_type},
{"broadcast_address", inet_addr_type},
{"cluster_name", utf8_type},
{"cql_version", utf8_type},
{"data_center", utf8_type},
{"gossip_generation", int32_type},
{"host_id", uuid_type},
{"listen_address", inet_addr_type},
{"native_protocol_version", utf8_type},
{"partitioner", utf8_type},
{"rack", utf8_type},
{"release_version", utf8_type},
{"rpc_address", inet_addr_type},
{"schema_version", uuid_type},
{"thrift_version", utf8_type},
{"tokens", set_type_impl::get_instance(utf8_type, true)},
{"truncated_at", map_type_impl::get_instance(uuid_type, bytes_type, true)},
},
// static columns
{},
// regular column name type
utf8_type,
// comment
"information about the local node"
);
builder.set_gc_grace_seconds(0);
builder.with_hash_version();
return builder.build(schema_builder::compact_storage::no);
}();
return schema;
}
schema_ptr system_keyspace::v3::truncated() {
schema_ptr system_keyspace::truncated() {
static thread_local auto local = [] {
schema_builder builder(generate_legacy_id(NAME, TRUNCATED), NAME, TRUNCATED,
// partition key
@@ -1022,7 +976,7 @@ schema_ptr system_keyspace::v3::truncated() {
thread_local data_type replay_position_type = tuple_type_impl::get_instance({long_type, int32_type});
schema_ptr system_keyspace::v3::commitlog_cleanups() {
schema_ptr system_keyspace::commitlog_cleanups() {
static thread_local auto local = [] {
schema_builder builder(generate_legacy_id(NAME, COMMITLOG_CLEANUPS), NAME, COMMITLOG_CLEANUPS,
// partition key
@@ -1049,47 +1003,7 @@ schema_ptr system_keyspace::v3::commitlog_cleanups() {
return local;
}
schema_ptr system_keyspace::v3::peers() {
// identical
return db::system_keyspace::peers();
}
schema_ptr system_keyspace::v3::peer_events() {
// identical
return db::system_keyspace::peer_events();
}
schema_ptr system_keyspace::v3::range_xfers() {
// identical
return db::system_keyspace::range_xfers();
}
schema_ptr system_keyspace::v3::compaction_history() {
// identical
return db::system_keyspace::compaction_history();
}
schema_ptr system_keyspace::v3::sstable_activity() {
// identical
return db::system_keyspace::sstable_activity();
}
schema_ptr system_keyspace::v3::size_estimates() {
// identical
return db::system_keyspace::size_estimates();
}
schema_ptr system_keyspace::v3::large_partitions() {
// identical
return db::system_keyspace::large_partitions();
}
schema_ptr system_keyspace::v3::scylla_local() {
// identical
return db::system_keyspace::scylla_local();
}
schema_ptr system_keyspace::v3::available_ranges() {
schema_ptr system_keyspace::available_ranges() {
static thread_local auto schema = [] {
schema_builder builder(generate_legacy_id(NAME, AVAILABLE_RANGES), NAME, AVAILABLE_RANGES,
// partition key
@@ -1112,7 +1026,7 @@ schema_ptr system_keyspace::v3::available_ranges() {
return schema;
}
schema_ptr system_keyspace::v3::views_builds_in_progress() {
schema_ptr system_keyspace::views_builds_in_progress() {
static thread_local auto schema = [] {
schema_builder builder(generate_legacy_id(NAME, VIEWS_BUILDS_IN_PROGRESS), NAME, VIEWS_BUILDS_IN_PROGRESS,
// partition key
@@ -1135,7 +1049,7 @@ schema_ptr system_keyspace::v3::views_builds_in_progress() {
return schema;
}
schema_ptr system_keyspace::v3::built_views() {
schema_ptr system_keyspace::built_views() {
static thread_local auto schema = [] {
schema_builder builder(generate_legacy_id(NAME, BUILT_VIEWS), NAME, BUILT_VIEWS,
// partition key
@@ -1158,7 +1072,7 @@ schema_ptr system_keyspace::v3::built_views() {
return schema;
}
schema_ptr system_keyspace::v3::scylla_views_builds_in_progress() {
schema_ptr system_keyspace::scylla_views_builds_in_progress() {
static thread_local auto schema = [] {
auto id = generate_legacy_id(NAME, SCYLLA_VIEWS_BUILDS_IN_PROGRESS);
return schema_builder(NAME, SCYLLA_VIEWS_BUILDS_IN_PROGRESS, std::make_optional(id))
@@ -1174,7 +1088,7 @@ schema_ptr system_keyspace::v3::scylla_views_builds_in_progress() {
return schema;
}
/*static*/ schema_ptr system_keyspace::v3::cdc_local() {
/*static*/ schema_ptr system_keyspace::cdc_local() {
static thread_local auto cdc_local = [] {
schema_builder builder(generate_legacy_id(NAME, CDC_LOCAL), NAME, CDC_LOCAL,
// partition key
@@ -2180,21 +2094,21 @@ future<> system_keyspace::update_cdc_generation_id(cdc::generation_id gen_id) {
co_await std::visit(make_visitor(
[this] (cdc::generation_id_v1 id) -> future<> {
co_await execute_cql(
format("INSERT INTO system.{} (key, streams_timestamp) VALUES (?, ?)", v3::CDC_LOCAL),
sstring(v3::CDC_LOCAL), id.ts);
format("INSERT INTO system.{} (key, streams_timestamp) VALUES (?, ?)", CDC_LOCAL),
sstring(CDC_LOCAL), id.ts);
},
[this] (cdc::generation_id_v2 id) -> future<> {
co_await execute_cql(
format("INSERT INTO system.{} (key, streams_timestamp, uuid) VALUES (?, ?, ?)", v3::CDC_LOCAL),
sstring(v3::CDC_LOCAL), id.ts, id.id);
format("INSERT INTO system.{} (key, streams_timestamp, uuid) VALUES (?, ?, ?)", CDC_LOCAL),
sstring(CDC_LOCAL), id.ts, id.id);
}
), gen_id);
}
future<std::optional<cdc::generation_id>> system_keyspace::get_cdc_generation_id() {
auto msg = co_await execute_cql(
format("SELECT streams_timestamp, uuid FROM system.{} WHERE key = ?", v3::CDC_LOCAL),
sstring(v3::CDC_LOCAL));
format("SELECT streams_timestamp, uuid FROM system.{} WHERE key = ?", CDC_LOCAL),
sstring(CDC_LOCAL));
if (msg->empty()) {
co_return std::nullopt;
@@ -2220,19 +2134,19 @@ static const sstring CDC_REWRITTEN_KEY = "rewritten";
future<> system_keyspace::cdc_set_rewritten(std::optional<cdc::generation_id_v1> gen_id) {
if (gen_id) {
return execute_cql(
format("INSERT INTO system.{} (key, streams_timestamp) VALUES (?, ?)", v3::CDC_LOCAL),
format("INSERT INTO system.{} (key, streams_timestamp) VALUES (?, ?)", CDC_LOCAL),
CDC_REWRITTEN_KEY, gen_id->ts).discard_result();
} else {
// Insert just the row marker.
return execute_cql(
format("INSERT INTO system.{} (key) VALUES (?)", v3::CDC_LOCAL),
format("INSERT INTO system.{} (key) VALUES (?)", CDC_LOCAL),
CDC_REWRITTEN_KEY).discard_result();
}
}
future<bool> system_keyspace::cdc_is_rewritten() {
// We don't care about the actual timestamp; it's additional information for debugging purposes.
return execute_cql(format("SELECT key FROM system.{} WHERE key = ?", v3::CDC_LOCAL), CDC_REWRITTEN_KEY)
return execute_cql(format("SELECT key FROM system.{} WHERE key = ?", CDC_LOCAL), CDC_REWRITTEN_KEY)
.then([] (::shared_ptr<cql3::untyped_result_set> msg) {
return !msg->empty();
});
@@ -2376,11 +2290,11 @@ std::vector<schema_ptr> system_keyspace::all_tables(const db::config& cfg) {
scylla_local(), db::schema_tables::scylla_table_schema_history(),
repair_history(),
repair_tasks(),
v3::views_builds_in_progress(), v3::built_views(),
v3::scylla_views_builds_in_progress(),
v3::truncated(),
v3::commitlog_cleanups(),
v3::cdc_local(),
views_builds_in_progress(), built_views(),
scylla_views_builds_in_progress(),
truncated(),
commitlog_cleanups(),
cdc_local(),
raft(), raft_snapshots(), raft_snapshot_config(), group0_history(), discovery(),
topology(), cdc_generations_v3(), topology_requests(), service_levels_v2(), view_build_status_v2(),
dicts(), view_building_tasks(), client_routes(), cdc_streams_state(), cdc_streams_history()
@@ -2403,7 +2317,7 @@ static bool maybe_write_in_user_memory(schema_ptr s) {
return (s.get() == system_keyspace::batchlog().get())
|| (s.get() == system_keyspace::batchlog_v2().get())
|| (s.get() == system_keyspace::paxos().get())
|| s == system_keyspace::v3::scylla_views_builds_in_progress();
|| s == system_keyspace::scylla_views_builds_in_progress();
}
future<> system_keyspace::make(
@@ -2689,7 +2603,7 @@ mutation system_keyspace::make_size_estimates_mutation(const sstring& ks, std::v
future<> system_keyspace::register_view_for_building(sstring ks_name, sstring view_name, const dht::token& token) {
sstring req = format("INSERT INTO system.{} (keyspace_name, view_name, generation_number, cpu_id, first_token) VALUES (?, ?, ?, ?, ?)",
v3::SCYLLA_VIEWS_BUILDS_IN_PROGRESS);
SCYLLA_VIEWS_BUILDS_IN_PROGRESS);
return execute_cql(
std::move(req),
std::move(ks_name),
@@ -2705,7 +2619,7 @@ future<> system_keyspace::register_view_for_building_for_all_shards(sstring ks_n
// before all shards are registered.
// if another shard has already registered, this won't overwrite its status. if it hasn't registered, we insert
// a status with first_token=null and next_token=null, indicating it hasn't made progress.
auto&& schema = db::system_keyspace::v3::scylla_views_builds_in_progress();
auto&& schema = db::system_keyspace::scylla_views_builds_in_progress();
auto timestamp = api::new_timestamp();
mutation m{schema, partition_key::from_single_value(*schema, utf8_type->decompose(ks_name))};
@@ -2723,7 +2637,7 @@ future<> system_keyspace::register_view_for_building_for_all_shards(sstring ks_n
future<> system_keyspace::update_view_build_progress(sstring ks_name, sstring view_name, const dht::token& token) {
sstring req = format("INSERT INTO system.{} (keyspace_name, view_name, next_token, cpu_id) VALUES (?, ?, ?, ?)",
v3::SCYLLA_VIEWS_BUILDS_IN_PROGRESS);
SCYLLA_VIEWS_BUILDS_IN_PROGRESS);
return execute_cql(
std::move(req),
std::move(ks_name),
@@ -2734,14 +2648,14 @@ future<> system_keyspace::update_view_build_progress(sstring ks_name, sstring vi
future<> system_keyspace::remove_view_build_progress_across_all_shards(sstring ks_name, sstring view_name) {
return execute_cql(
format("DELETE FROM system.{} WHERE keyspace_name = ? AND view_name = ?", v3::SCYLLA_VIEWS_BUILDS_IN_PROGRESS),
format("DELETE FROM system.{} WHERE keyspace_name = ? AND view_name = ?", SCYLLA_VIEWS_BUILDS_IN_PROGRESS),
std::move(ks_name),
std::move(view_name)).discard_result();
}
future<> system_keyspace::remove_view_build_progress(sstring ks_name, sstring view_name) {
return execute_cql(
format("DELETE FROM system.{} WHERE keyspace_name = ? AND view_name = ? AND cpu_id = ?", v3::SCYLLA_VIEWS_BUILDS_IN_PROGRESS),
format("DELETE FROM system.{} WHERE keyspace_name = ? AND view_name = ? AND cpu_id = ?", SCYLLA_VIEWS_BUILDS_IN_PROGRESS),
std::move(ks_name),
std::move(view_name),
int32_t(this_shard_id())).discard_result();
@@ -2749,20 +2663,20 @@ future<> system_keyspace::remove_view_build_progress(sstring ks_name, sstring vi
future<> system_keyspace::mark_view_as_built(sstring ks_name, sstring view_name) {
return execute_cql(
format("INSERT INTO system.{} (keyspace_name, view_name) VALUES (?, ?)", v3::BUILT_VIEWS),
format("INSERT INTO system.{} (keyspace_name, view_name) VALUES (?, ?)", BUILT_VIEWS),
std::move(ks_name),
std::move(view_name)).discard_result();
}
future<> system_keyspace::remove_built_view(sstring ks_name, sstring view_name) {
return execute_cql(
format("DELETE FROM system.{} WHERE keyspace_name = ? AND view_name = ?", v3::BUILT_VIEWS),
format("DELETE FROM system.{} WHERE keyspace_name = ? AND view_name = ?", BUILT_VIEWS),
std::move(ks_name),
std::move(view_name)).discard_result();
}
future<std::vector<system_keyspace::view_name>> system_keyspace::load_built_views() {
return execute_cql(format("SELECT * FROM system.{}", v3::BUILT_VIEWS)).then([] (::shared_ptr<cql3::untyped_result_set> cql_result) {
return execute_cql(format("SELECT * FROM system.{}", BUILT_VIEWS)).then([] (::shared_ptr<cql3::untyped_result_set> cql_result) {
return *cql_result
| std::views::transform([] (const cql3::untyped_result_set::row& row) {
auto ks_name = row.get_as<sstring>("keyspace_name");
@@ -2774,7 +2688,7 @@ future<std::vector<system_keyspace::view_name>> system_keyspace::load_built_view
future<std::vector<system_keyspace::view_build_progress>> system_keyspace::load_view_build_progress() {
return execute_cql(format("SELECT keyspace_name, view_name, first_token, next_token, cpu_id FROM system.{}",
v3::SCYLLA_VIEWS_BUILDS_IN_PROGRESS)).then([] (::shared_ptr<cql3::untyped_result_set> cql_result) {
SCYLLA_VIEWS_BUILDS_IN_PROGRESS)).then([] (::shared_ptr<cql3::untyped_result_set> cql_result) {
std::vector<view_build_progress> progress;
for (auto& row : *cql_result) {
auto ks_name = row.get_as<sstring>("keyspace_name");
@@ -3227,6 +3141,8 @@ future<service::topology> system_keyspace::load_topology_state(const std::unorde
co_return ret;
}
const bool strongly_consistent_tables = _db.features().strongly_consistent_tables;
for (auto& row : *rs) {
if (!row.has("host_id")) {
// There are no clustering rows, only the static row.
@@ -3463,7 +3379,9 @@ future<service::topology> system_keyspace::load_topology_state(const std::unorde
ret.session = service::session_id(some_row.get_as<utils::UUID>("session"));
}
if (some_row.has("tablet_balancing_enabled")) {
if (strongly_consistent_tables) {
ret.tablet_balancing_enabled = false;
} else if (some_row.has("tablet_balancing_enabled")) {
ret.tablet_balancing_enabled = some_row.get_as<bool>("tablet_balancing_enabled");
} else {
ret.tablet_balancing_enabled = true;

View File

@@ -127,6 +127,8 @@ class system_keyspace : public seastar::peering_sharded_service<system_keyspace>
static schema_ptr raft_snapshot_config();
static schema_ptr local();
static schema_ptr truncated();
static schema_ptr commitlog_cleanups();
static schema_ptr peers();
static schema_ptr peer_events();
static schema_ptr range_xfers();
@@ -137,7 +139,10 @@ class system_keyspace : public seastar::peering_sharded_service<system_keyspace>
static schema_ptr large_rows();
static schema_ptr large_cells();
static schema_ptr corrupt_data();
static schema_ptr scylla_local();
static schema_ptr batches();
static schema_ptr available_ranges();
static schema_ptr built_views();
static schema_ptr cdc_local();
future<> force_blocking_flush(sstring cfname);
// This function is called when the system.peers table is read,
// and it fixes some types of inconsistencies that can occur
@@ -204,6 +209,12 @@ public:
static constexpr auto VIEW_BUILDING_TASKS = "view_building_tasks";
static constexpr auto CLIENT_ROUTES = "client_routes";
static constexpr auto VERSIONS = "versions";
static constexpr auto BATCHES = "batches";
static constexpr auto AVAILABLE_RANGES = "available_ranges";
static constexpr auto VIEWS_BUILDS_IN_PROGRESS = "views_builds_in_progress";
static constexpr auto BUILT_VIEWS = "built_views";
static constexpr auto SCYLLA_VIEWS_BUILDS_IN_PROGRESS = "scylla_views_builds_in_progress";
static constexpr auto CDC_LOCAL = "cdc_local";
// auth
static constexpr auto ROLES = "roles";
@@ -211,42 +222,6 @@ public:
static constexpr auto ROLE_ATTRIBUTES = "role_attributes";
static constexpr auto ROLE_PERMISSIONS = "role_permissions";
struct v3 {
static constexpr auto BATCHES = "batches";
static constexpr auto PAXOS = "paxos";
static constexpr auto BUILT_INDEXES = "IndexInfo";
static constexpr auto LOCAL = "local";
static constexpr auto PEERS = "peers";
static constexpr auto PEER_EVENTS = "peer_events";
static constexpr auto RANGE_XFERS = "range_xfers";
static constexpr auto COMPACTION_HISTORY = "compaction_history";
static constexpr auto SSTABLE_ACTIVITY = "sstable_activity";
static constexpr auto SIZE_ESTIMATES = "size_estimates";
static constexpr auto AVAILABLE_RANGES = "available_ranges";
static constexpr auto VIEWS_BUILDS_IN_PROGRESS = "views_builds_in_progress";
static constexpr auto BUILT_VIEWS = "built_views";
static constexpr auto SCYLLA_VIEWS_BUILDS_IN_PROGRESS = "scylla_views_builds_in_progress";
static constexpr auto CDC_LOCAL = "cdc_local";
static schema_ptr batches();
static schema_ptr built_indexes();
static schema_ptr local();
static schema_ptr truncated();
static schema_ptr commitlog_cleanups();
static schema_ptr peers();
static schema_ptr peer_events();
static schema_ptr range_xfers();
static schema_ptr compaction_history();
static schema_ptr sstable_activity();
static schema_ptr size_estimates();
static schema_ptr large_partitions();
static schema_ptr scylla_local();
static schema_ptr available_ranges();
static schema_ptr views_builds_in_progress();
static schema_ptr built_views();
static schema_ptr scylla_views_builds_in_progress();
static schema_ptr cdc_local();
};
// Partition estimates for a given range of tokens.
struct range_estimates {
schema_ptr schema;
@@ -264,6 +239,7 @@ public:
static schema_ptr batchlog_v2();
static schema_ptr paxos();
static schema_ptr built_indexes(); // TODO (from Cassandra): make private
static schema_ptr scylla_local();
static schema_ptr raft();
static schema_ptr raft_snapshots();
static schema_ptr repair_history();
@@ -283,6 +259,8 @@ public:
static schema_ptr dicts();
static schema_ptr view_building_tasks();
static schema_ptr client_routes();
static schema_ptr views_builds_in_progress();
static schema_ptr scylla_views_builds_in_progress();
// auth
static schema_ptr roles();

View File

@@ -195,7 +195,7 @@ public:
return mutation_reader(std::make_unique<build_progress_reader>(
s,
std::move(permit),
_db.find_column_family(s->ks_name(), system_keyspace::v3::SCYLLA_VIEWS_BUILDS_IN_PROGRESS),
_db.find_column_family(s->ks_name(), system_keyspace::SCYLLA_VIEWS_BUILDS_IN_PROGRESS),
range,
slice,
std::move(trace_state),

View File

@@ -3707,7 +3707,7 @@ void validate_view_keyspace(const data_dictionary::database& db, std::string_vie
try {
locator::assert_rf_rack_valid_keyspace(keyspace_name, tmptr, rs);
} catch (const std::exception& e) {
} catch (const std::invalid_argument& e) {
throw std::logic_error(fmt::format(
"Materialized views and secondary indexes are not supported on the keyspace '{}': {}",
keyspace_name, e.what()));

View File

@@ -68,6 +68,8 @@ public:
.with_column("peer", inet_addr_type, column_kind::partition_key)
.with_column("dc", utf8_type)
.with_column("up", boolean_type)
.with_column("draining", boolean_type)
.with_column("excluded", boolean_type)
.with_column("status", utf8_type)
.with_column("load", utf8_type)
.with_column("tokens", int32_type)
@@ -107,8 +109,11 @@ public:
if (tm.get_topology().has_node(hostid)) {
// Not all entries in gossiper are present in the topology
sstring dc = tm.get_topology().get_location(hostid).dc;
auto& node = tm.get_topology().get_node(hostid);
sstring dc = node.dc_rack().dc;
set_cell(cr, "dc", dc);
set_cell(cr, "draining", node.is_draining());
set_cell(cr, "excluded", node.is_excluded());
}
if (ownership.contains(eps.get_ip())) {
@@ -1134,6 +1139,8 @@ public:
set_cell(r.cells(), "dc", node.dc());
set_cell(r.cells(), "rack", node.rack());
set_cell(r.cells(), "up", _gossiper.local().is_alive(host));
set_cell(r.cells(), "draining", node.is_draining());
set_cell(r.cells(), "excluded", node.is_excluded());
if (auto ip = _gossiper.local().get_address_map().find(host)) {
set_cell(r.cells(), "ip", data_value(inet_address(*ip)));
}
@@ -1144,6 +1151,9 @@ public:
if (stats && stats->capacity.contains(host)) {
auto capacity = stats->capacity.at(host);
set_cell(r.cells(), "storage_capacity", data_value(int64_t(capacity)));
if (auto ts_iter = stats->tablet_stats.find(host); ts_iter != stats->tablet_stats.end()) {
set_cell(r.cells(), "effective_capacity", data_value(int64_t(ts_iter->second.effective_capacity)));
}
if (auto utilization = load.get_allocated_utilization(host)) {
set_cell(r.cells(), "storage_allocated_utilization", data_value(double(*utilization)));
@@ -1168,9 +1178,12 @@ private:
.with_column("rack", utf8_type)
.with_column("ip", inet_addr_type)
.with_column("up", boolean_type)
.with_column("draining", boolean_type)
.with_column("excluded", boolean_type)
.with_column("tablets_allocated", long_type)
.with_column("tablets_allocated_per_shard", double_type)
.with_column("storage_capacity", long_type)
.with_column("effective_capacity", long_type)
.with_column("storage_allocated_load", long_type)
.with_column("storage_allocated_utilization", double_type)
.with_column("storage_load", long_type)
@@ -1484,7 +1497,7 @@ future<> initialize_virtual_tables(
co_await add_table(std::make_unique<cdc_streams_table>(db, ss));
db.find_column_family(system_keyspace::size_estimates()).set_virtual_reader(mutation_source(db::size_estimates::virtual_reader(db, sys_ks.local())));
db.find_column_family(system_keyspace::v3::views_builds_in_progress()).set_virtual_reader(mutation_source(db::view::build_progress_virtual_reader(db)));
db.find_column_family(system_keyspace::views_builds_in_progress()).set_virtual_reader(mutation_source(db::view::build_progress_virtual_reader(db)));
db.find_column_family(system_keyspace::built_indexes()).set_virtual_reader(mutation_source(db::index::built_indexes_virtual_reader(db)));
}

View File

@@ -114,6 +114,10 @@ WantedBy=local-fs.target scylla-server.service
'''[1:-1]
with open('/etc/systemd/system/var-lib-systemd-coredump.mount', 'w') as f:
f.write(dot_mount)
# in case we have old mounts in deleted state hanging around from older installation
# systemd doesn't seem to be able to deal with those properly, and assume they are still active
# and doesn't do anything about them
run('umount /var/lib/systemd/coredump', shell=True, check=False)
os.makedirs('/var/lib/scylla/coredump', exist_ok=True)
systemd_unit.reload()
systemd_unit('var-lib-systemd-coredump.mount').enable()

View File

@@ -1,6 +1,10 @@
### a dictionary of redirections
#old path: new path
# Remove an outdated KB
/stable/kb/perftune-modes-sync.html: /stable/kb/index.html
# Move the diver information to another project
/stable/using-scylla/drivers/index.html: https://docs.scylladb.com/stable/drivers/index.html

View File

@@ -190,7 +190,7 @@ then every rack in every datacenter receives a replica, except for racks compris
of only :doc:`zero-token nodes </architecture/zero-token-nodes>`. Racks added after
the keyspace creation do not receive replicas.
When ``rf_rack_valid_keyspaces``` is enabled in the config and the keyspace is tablet-based,
When ``enforce_rack_list`` (or (deprecated) ``rf_rack_valid_keyspaces``) is enabled in the config and the keyspace is tablet-based,
the numeric replication factor is automatically expanded into a rack list when the statement is
executed, which can be observed in the DESCRIBE output afterwards. If the numeric RF is smaller than
the number of racks in a DC, a subset of racks is chosen arbitrarily.

View File

@@ -241,8 +241,8 @@ Currently, the possible orderings are limited by the :ref:`clustering order <clu
.. _vector-queries:
Vector queries
~~~~~~~~~~~~~~
Vector queries :label-note:`ScyllaDB Cloud`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The ``ORDER BY`` clause can also be used with vector columns to perform the approximate nearest neighbor (ANN) search.
When using vector columns, the syntax is as follows:
@@ -280,11 +280,25 @@ For example::
FROM ImageEmbeddings
ORDER BY embedding ANN OF [0.1, 0.2, 0.3, 0.4] LIMIT 5;
.. warning::
Currently, vector queries do not support filtering with ``WHERE`` clause,
grouping with ``GROUP BY`` and paging. This will be added in the future releases.
Vector queries also support filtering with ``WHERE`` clauses on columns that are part of the primary key.
For example::
SELECT image_id FROM ImageEmbeddings
WHERE user_id = 'user123'
ORDER BY embedding ANN OF [0.1, 0.2, 0.3, 0.4] LIMIT 5;
The supported operations are equal relations (``=`` and ``IN``) with restrictions as in regular ``WHERE`` clauses. See :ref:`WHERE <where-clause>`.
Other filtering scenarios are currently not supported.
.. note::
Vector indexes are supported in ScyllaDB Cloud only in clusters that have the Vector Search feature enabled.
Vector indexes do not support all ScyllaDB features (e.g., tracing, TTL, paging, and grouping). More information
about Vector Search is available in the
`ScyllaDB Cloud documentation <https://cloud.docs.scylladb.com/stable/vector-search/>`_.
.. _limit-clause:

View File

@@ -129,17 +129,15 @@ More on :doc:`Local Secondary Indexes </features/local-secondary-indexes>`
.. _create-vector-index-statement:
Vector Index :label-caution:`Experimental` :label-note:`ScyllaDB Cloud`
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Vector Index :label-note:`ScyllaDB Cloud`
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. note::
Vector indexes are supported in ScyllaDB Cloud only in the clusters that have the vector search feature enabled.
Moreover, vector indexes are an experimental feature that:
* is not suitable for production use,
* does not guarantee backward compatibility between ScyllaDB versions,
* does not support all the features of ScyllaDB (e.g., tracing, filtering, TTL).
Vector indexes are supported in ScyllaDB Cloud only in clusters that have the Vector Search feature enabled.
Vector indexes do not support all ScyllaDB features (e.g., tracing, TTL, paging, and grouping). More information
about Vector Search is available in the
`ScyllaDB Cloud documentation <https://cloud.docs.scylladb.com/stable/vector-search/>`_.
ScyllaDB supports creating vector indexes on tables, allowing queries on the table to use those indexes for efficient
similarity search on vector data.
@@ -177,6 +175,26 @@ The following options are supported for vector indexes. All of them are optional
| | as ``efSearch``. Higher values lead to better recall (i.e., more relevant results are found) | |
| | but increase query latency. Supported values are integers between 1 and 4096. | |
+------------------------------+----------------------------------------------------------------------------------------------------------+---------------+
| ``quantization`` | The quantization method to use for compressing vectors in Vector Index. Vectors in base table | ``f32`` |
| | are never compressed. Supported values (case-insensitive) are: | |
| | | |
| | * ``f32``: 32-bit single-precision IEEE 754 floating-point. | |
| | * ``f16``: 16-bit standard half-precision floating-point (IEEE 754). | |
| | * ``bf16``: 16-bit "Brain" floating-point (optimized for ML workloads). | |
| | * ``i8``: 8-bit signed integer. | |
| | * ``b1``: 1-bit binary value (packed 8 per byte). | |
+------------------------------+----------------------------------------------------------------------------------------------------------+---------------+
| ``oversampling`` | A multiplier for the candidate set size during the search phase. For example, if a query asks for 10 | ``1.0`` |
| | similar vectors (``LIMIT 10``) and ``oversampling`` is 2.0, the search will initially retrieve 20 | |
| | candidates. This can improve accuracy at the cost of latency. Supported values are | |
| | floating-point numbers between 1.0 (no oversampling) and 100.0. | |
+------------------------------+----------------------------------------------------------------------------------------------------------+---------------+
| ``rescoring`` | Flag enabling recalculation of similarity scores with full precision and re-ranking of the candidate set.| ``false`` |
| | Valid only for quantization below ``f32``. Supported values are: | |
| | | |
| | * ``true``: Enable rescoring. | |
| | * ``false``: Disable rescoring. | |
+------------------------------+----------------------------------------------------------------------------------------------------------+---------------+
.. _drop-index-statement:

View File

@@ -665,6 +665,14 @@ it is not possible to update only some elements of a vector (without updating th
Types stored in a vector are not implicitly frozen, so if you want to store a frozen collection or
frozen UDT in a vector, you need to explicitly wrap them using `frozen` keyword.
.. note::
The main application of vectors is to support vector search capabilities, which
are supported in ScyllaDB Cloud only in clusters that have the Vector Search feature enabled.
Note that Vector Search clusters do not support all ScyllaDB features (e.g., tracing, TTL, paging, and grouping). More information
about Vector Search is available in the
`ScyllaDB Cloud documentation <https://cloud.docs.scylladb.com/stable/vector-search/>`_.
.. .. _custom-types:
.. Custom Types

View File

@@ -221,6 +221,87 @@ scylla-bucket/prefix/
```
See the API [documentation](#copying-sstables-on-s3-backup) for more details about the actual backup request.
### The snapshot manifest
Each table snapshot directory contains a manifest.json file that lists the contents of the snapshot and some metadata.
The json structure is as follows:
```
{
"manifest": {
"version": "1.0",
"scope": "node"
},
"node": {
"host_id": "<UUID>",
"datacenter": "mydc",
"rack": "myrack"
},
"snapshot": {
"name": "snapshot name",
"created_at": seconds_since_epoch,
"expires_at": seconds_since_epoch | null,
},
"table": {
"keyspace_name": "my_keyspace",
"table_name": "my_table",
"table_id": "<UUID>",
"tablets_type": "none|powof2",
"tablet_count": N
},
"sstables": [
{
"id": "67e35000-d8c6-11f0-9599-060de9f3bd1b",
"toc_name": "me-3gw7_0ndy_3wlq829wcsddgwha1n-big-TOC.txt",
"data_size": 75,
"index_size": 8,
"first_token": -8629266958227979430,
"last_token": 9168982884335614769,
},
{
"id": "67e35000-d8c6-11f0-85dc-0625e9f3bd1b",
"toc_name": "me-3gw7_0ndy_3wlq821a6cqlbmxrtn-big-TOC.txt",
"data_size": 73,
"index_size": 8,
"first_token": 221146791717891383,
"last_token": 7354559975791427036,
},
...
],
"files": [ ... ]
}
The `manifest` member contains the following attributes:
- `version` - respresenting the version of the manifest itself. It is incremented when members are added or removed from the manifest.
- `scope` - the scope of metadata stored in this manifest file. The following scopes are supported:
- `node` - the manifest describes all SSTables owned by this node in this snapshot.
The `node` member contains metadata about this node that enables datacenter- or rack-aware restore.
- `host_id` - is the node's unique host_id (a UUID).
- `datacenter` - is the node's datacenter.
- `rack` - is the node's rack.
The `snapshot` member contains metadata about the snapshot.
- `name` - is the snapshot name (a.k.a. `tag`)
- `created_at` - is the time when the snapshot was created.
- `expires_at` - is an optional time when the snapshot expires and can be dropped, if a TTL was set for the snapshot. If there is no TTL, `expires_at` may be omitted, set to null, or set to 0.
The `table` member contains metadata about the table being snapshot.
- `keyspace_name` and `table_name` - are self-explanatory.
- `table_id` - a universally unique identifier (UUID) of the table set when the table is created.
- `tablets_type`:
- `none` - if the keyspace uses vnodes replication
- `powof2` - if the keyspace uses tables replication, and the tablet token ranges are based on powers of 2.
- `tablet_count` - Optional. If `tablets_type` is not `none`, contains the number of tablets allcated in the table. If `tablets_type` is `powof2`, tablet_count would be a power of 2.
The `sstables` member is a list containing metadata about the SSTables in the snapshot.
- `id` - is the STable's unique id (a UUID). It is carried over with the SSTable when it's streamed as part of tablet migration, even if it gets a new generation.
- `toc_name` - is the name of the SSTable Table Of Contents (TOC) component.
- `data_size` and `index_size` - are the sizes of the SSTable's data and index components, respectively. They can be used to estimate how much disk space is needed for restore.
- `first_token` and `last_token` - are the first and last tokens in the SSTable, respectively. They can be used to determine if a SSTable is fully contained in a (tablet) token range to enable efficient file-based streaming of the SSTable.
The optional `files` member may contain a list of non-SSTable files included in the snapshot directory, not including the manifest.json file and schema.cql.
```
3. `CREATE KEYSPACE` with S3/GS storage
When creating a keyspace with S3/GS storage, the data is stored under the bucket passed as argument to the `CREATE KEYSPACE` statement.

View File

@@ -6,10 +6,10 @@ same amount of disk space. This means that the number of tablets located on a no
proportional to the gross disk capacity of that node. Because the used disk space of
different tablets can vary greatly, this could create imbalance in disk utilization.
Size based load balancing aims to achieve better disk utilization accross nodes in a
Size based load balancing aims to achieve better disk utilization across nodes in a
cluster. The load balancer will continuously gather information about available disk
space and tablet sizes from all the nodes. It then incrementally computes tablet
migration plans which equalize disk utilization accross the cluster.
migration plans which equalize disk utilization across the cluster.
# Basic operation
@@ -75,7 +75,7 @@ migrations), and will wait for correct tablet sizes to arrive after the next ``l
refresh by the topology coordinator.
One exception to this are nodes which have been excluded from the cluster. These nodes
are down and therefor are not able to send fresh ``load_stats``. But they have to be drained
are down and therefore are not able to send fresh ``load_stats``. But they have to be drained
of their tablets (via tablet rebuild), and the balancer must do this even with incomplete
tablet data. So, only excluded nodes are allowed to have missing tablet sizes.

View File

@@ -357,6 +357,7 @@ Schema:
CREATE TABLE system.load_per_node (
node uuid PRIMARY KEY,
dc text,
effective_capacity bigint,
rack text,
storage_allocated_load bigint,
storage_allocated_utilization double,
@@ -372,6 +373,7 @@ Columns:
* `storage_allocated_load` - Disk space allocated for tablets, assuming each tablet has a fixed size (target_tablet_size).
* `storage_allocated_utilization` - Fraction of node's disk capacity taken for `storage_allocated_load`, where 1.0 means full utilization.
* `storage_capacity` - Total disk capacity in bytes. Used to compute `storage_allocated_utilization`. By default equal to file system's capacity.
* `effective_capacity` - Sum of available disk space and tablet sizes on a node. Used to compute load on a node for size based balancing.
* `storage_load` - Disk space allocated for tablets, computed with actual tablet sizes. Can be null if some of the tablet sizes are not known.
* `storage_utilization` - Fraction of node's disk capacity taken for `storage_load` (with actual tablet sizes), where 1.0 means full utilization. Can be null if some of the tablet sizes are not known.
* `tablets_allocated` - Number of tablet replicas on the node. Migrating tablets are accounted as if migration already finished.

View File

@@ -46,14 +46,11 @@ stateDiagram-v2
state replacing {
rp_join_group0 : join_group0
rp_left_token_ring : left_token_ring
rp_tablet_draining : tablet_draining
rp_write_both_read_old : write_both_read_old
rp_write_both_read_new : write_both_read_new
[*] --> rp_join_group0
rp_join_group0 --> rp_left_token_ring: rollback
rp_join_group0 --> rp_tablet_draining
rp_tablet_draining --> rp_write_both_read_old
rp_tablet_draining --> rp_left_token_ring: rollback
rp_join_group0 --> rp_write_both_read_old
rp_join_group0 --> [*]: rejected
rp_write_both_read_old --> rp_write_both_read_new: streaming completed
rp_write_both_read_old --> rp_left_token_ring: rollback
@@ -69,34 +66,34 @@ stateDiagram-v2
normal --> decommissioning: leave
normal --> removing: remove
state decommissioning {
de_tablet_migration0 : tablet_migration (draining)
de_tablet_migration1 : tablet_migration
[*] --> de_tablet_migration0
de_tablet_migration0 --> [*]: aborted
de_left_token_ring : left_token_ring
de_tablet_draining : tablet_draining
de_tablet_migration : tablet_migration
de_write_both_read_old: write_both_read_old
de_write_both_read_new : write_both_read_new
de_rollback_to_normal : rollback_to_normal
[*] --> de_tablet_draining
de_tablet_draining --> de_rollback_to_normal: rollback
de_rollback_to_normal --> de_tablet_migration
de_tablet_draining --> de_write_both_read_old
de_tablet_migration --> [*]
de_rollback_to_normal --> de_tablet_migration1
de_tablet_migration0 --> de_write_both_read_old
de_tablet_migration1 --> [*]
de_write_both_read_old --> de_write_both_read_new: streaming completed
de_write_both_read_old --> de_rollback_to_normal: rollback
de_write_both_read_new --> de_left_token_ring
de_left_token_ring --> [*]
}
state removing {
re_tablet_migration0 : tablet_migration (draining)
re_tablet_migration1 : tablet_migration
[*] --> re_tablet_migration0
re_tablet_migration0 --> [*]: aborted
re_left_token_ring : left_token_ring
re_tablet_draining : tablet_draining
re_tablet_migration : tablet_migration
re_write_both_read_old : write_both_read_old
re_write_both_read_new : write_both_read_new
re_rollback_to_normal : rollback_to_normal
[*] --> re_tablet_draining
re_tablet_draining --> re_rollback_to_normal: rollback
re_rollback_to_normal --> re_tablet_migration
re_tablet_migration --> [*]
re_tablet_draining --> re_write_both_read_old
re_rollback_to_normal --> re_tablet_migration1
re_tablet_migration1 --> [*]
re_tablet_migration0 --> re_write_both_read_old
re_write_both_read_old --> re_write_both_read_new: streaming completed
re_write_both_read_old --> re_rollback_to_normal: rollback
re_write_both_read_new --> re_left_token_ring
@@ -181,6 +178,27 @@ are the currently supported global topology operations:
contain replicas of the table being truncated. It uses [sessions](#Topology guards)
to make sure that no stale RPCs are executed outside of the scope of the request.
## Tablet draining
Presence of node requests of type `leave` (decommission) and `remove` will cause the load
balancer to start migrating tablets away from those nodes. Until they are drained
of tablets, the requests are in paused state and are not picked by topology
coordinator for execution.
Paused requests are also pending, the "topology_request" field is engaged.
The node's state is still `normal` when request is paused.
Canceling the requests will stop the draining process, because absence of a request
lifts the draining state on the node.
When tablet scheduler is done with migrating all tablets away from a draining node,
the associated request will be unpaused, and can be picked by topology coordinator
for execution. This time it will enter the `write_both_read_old` transition and proceed
with the vnode part.
This allows multiple requests to drain tablets in parallel, and other requests to not be blocked
by long `leave` and `remove` requests.
## Zero-token nodes
Zero-token nodes (the nodes started with `join_ring=false`) never own tokens or become
@@ -221,12 +239,6 @@ that there are no tablet transitions in the system.
Tablets are migrated in parallel and independently.
There is a variant of tablet migration track called tablet draining track, which is invoked
as a step of certain topology operations (e.g. decommission, removenode). Its goal is to readjust tablet replicas
so that a given topology change can proceed. For example, when decommissioning a node, we
need to migrate tablet replicas away from the node being decommissioned.
Tablet draining happens before making changes to vnode-based replication.
## Node replace with tablets
Tablet replicas on the replaced node are rebuilt after the replacing node is already in the normal state and

View File

@@ -28,8 +28,7 @@ Incremental Repair is only supported for tables that use the tablets architectur
Incremental Repair Modes
------------------------
Incremental is currently disabled by default. You can control its behavior for a given repair operation using the ``incremental_mode`` parameter.
This is useful for enabling incremental repair, or in situations where you might need to force a full data validation.
While incremental repair is the default and recommended mode, you can control its behavior for a given repair operation using the ``incremental_mode`` parameter. This is useful for situations where you might need to force a full data validation.
The available modes are:

View File

@@ -86,7 +86,7 @@ Compaction Strategies with Materialized Views
Materialized views, just like regular tables, use one of the available :doc:`compaction strategies </architecture/compaction/compaction-strategies>`.
When a materialized view is created, it does not inherit its base table compaction strategy settings, because the data model
of a view does not necessarily have the same characteristics as the one from its base table.
Instead, the default compaction strategy (SizeTieredCompactionStrategy) is used.
Instead, the default compaction strategy (IncrementalCompactionStrategy) is used.
A compaction strategy for a new materialized view can be explicitly set during its creation, using the following command:

View File

@@ -57,7 +57,6 @@ Knowledge Base
* :doc:`Customizing CPUSET </kb/customizing-cpuset>`
* :doc:`Recreate RAID devices </kb/raid-device>` - How to recreate your RAID devices without running scylla-setup
* :doc:`Configure ScyllaDB Networking with Multiple NIC/IP Combinations </kb/yaml-address>` - examples for setting the different IP addresses in scylla.yaml
* :doc:`Updating the Mode in perftune.yaml After a ScyllaDB Upgrade </kb/perftune-modes-sync>`
* :doc:`Kafka Sink Connector Quickstart </using-scylla/integrations/kafka-connector>`
* :doc:`Kafka Sink Connector Configuration </using-scylla/integrations/sink-config>`

View File

@@ -1,48 +0,0 @@
==============================================================
Updating the Mode in perftune.yaml After a ScyllaDB Upgrade
==============================================================
We improved ScyllaDB's performance by `removing the rx_queues_count from the mode
condition <https://github.com/scylladb/seastar/pull/949>`_. As a result, ScyllaDB operates in
the ``sq_split`` mode instead of the ``mq`` mode (see :doc:`Seastar Perftune </operating-scylla/admin-tools/perftune>` for information about the modes).
If you upgrade from an earlier version of ScyllaDB, your cluster's existing nodes may use the ``mq`` mode,
while new nodes will use the ``sq_split`` mode. As using different modes across one cluster is not recommended,
you should change the configuration to ensure that the ``sq_split`` mode is used on all nodes.
This section describes how to update the `perftune.yaml` file to configure the ``sq_split`` mode on all nodes.
Procedure
------------
The examples below assume that you are using the default locations for storing data and the `scylla.yaml` file,
and that your NIC is ``eth5``.
#. Backup your old configuration.
.. code-block:: console
sudo mv /etc/scylla.d/cpuset.conf /etc/scylla.d/cpuset.conf.old
sudo mv /etc/scylla.d/perftune.yaml /etc/scylla.d/perftune.yaml.old
#. Create a new configuration.
.. code-block:: console
sudo scylla_sysconfig_setup --nic eth5 --homedir /var/lib/scylla --confdir /etc/scylla
A new ``/etc/scylla.d/cpuset.conf`` will be generated on the output.
#. Compare the contents of the newly generated ``/etc/scylla.d/cpuset.conf`` with ``/etc/scylla.d/cpuset.conf.old`` you created in step 1.
- If they are exactly the same, rename ``/etc/scylla.d/perftune.yaml.old`` you created in step 1 back to ``/etc/scylla.d/perftune.yaml`` and continue to the next node.
- If they are different, move on to the next steps.
#. Restart the ``scylla-server`` service.
.. code-block:: console
nodetool drain
sudo systemctl restart scylla-server
#. Wait for the service to become up and running (similarly to how it is done during a :doc:`rolling restart </operating-scylla/procedures/config-change/rolling-restart>`). It may take a considerable amount of time before the node is in the UN state due to resharding.
#. Continue to the next node.

View File

@@ -25,4 +25,8 @@ Port Description Protocol
19142 Native shard-aware transport port (ssl) TCP
====== ============================================ ========
If you're using ScyllaDB Alternator, ensure that the ports configured
for Alternator with the ``alternator_port`` or ``alternator_https_port`` parameter
are open. See :doc:`ScyllaDB Alternator </alternator/alternator>` for details.
.. note:: For ScyllaDB Manager ports, see the `ScyllaDB Manager <https://manager.docs.scylladb.com/>`_ documentation.

View File

@@ -93,6 +93,25 @@ API calls
Cluster tasks are not unregistered from task manager with API calls.
Node operations module
----------------------
There is a module named ``node_ops``, which allows tracking node operations: decommission, removenode, bootstrap, replace, rebuild.
The ``type`` field designates the operation, and is one of:
- ``decommission``
- ``remove node``
- ``bootstrap``
- ``replace``
- ``rebuild``
The ``scope`` and ``kind`` fields are set to ``cluster``.
The ``entity`` field holds the host id of the node which is being operated on, as long as the request
is not finished. In case of the ``replace`` operation, it will hold the host id of the replacing node.
``decommission`` and ``remove node`` tasks are abortable, but only before they finish tablet migration.
Tasks API
---------

View File

@@ -55,7 +55,7 @@ ScyllaDB nodetool cluster repair command supports the following options:
nodetool cluster repair --tablet-tokens 1,10474535988
- ``--incremental-mode`` specifies the incremental repair mode. Can be 'disabled', 'incremental', or 'full'. 'incremental': The incremental repair logic is enabled. Unrepaired sstables will be included for repair. Repaired sstables will be skipped. The incremental repair states will be updated after repair. 'full': The incremental repair logic is enabled. Both repaired and unrepaired sstables will be included for repair. The incremental repair states will be updated after repair. 'disabled': The incremental repair logic is disabled completely. The incremental repair states, e.g., repaired_at in sstables and sstables_repaired_at in the system.tablets table, will not be updated after repair. When the option is not provided, it defaults to 'disabled'.
- ``--incremental-mode`` specifies the incremental repair mode. Can be 'disabled', 'incremental', or 'full'. 'incremental': The incremental repair logic is enabled. Unrepaired sstables will be included for repair. Repaired sstables will be skipped. The incremental repair states will be updated after repair. 'full': The incremental repair logic is enabled. Both repaired and unrepaired sstables will be included for repair. The incremental repair states will be updated after repair. 'disabled': The incremental repair logic is disabled completely. The incremental repair states, e.g., repaired_at in sstables and sstables_repaired_at in the system.tablets table, will not be updated after repair. When the option is not provided, it defaults to incremental.
For example:

View File

@@ -29,5 +29,13 @@ Before you run ``nodetool decommission``:
request may fail.
In such a case, ALTER the keyspace to reduce the RF before running ``nodetool decommission``.
It's allowed to invoke ``nodetool decommission`` on multiple nodes in parallel. This will be faster than doing
it sequentially if there is significant amount of data in tablet-based keyspaces, because
tablets are migrated from nodes in parallel. Decommission process first migrates tablets away, and this
part is done in parallel for all nodes being decommissioned. Then it does the vnode-based decommission, and
this part is serialized with other vnode-based operations, including those from other decommission operations.
Decommission which is still in the tablet draining phase can be canceled using Task Manager API.
See :doc:`Task manager </operating-scylla/admin-tools/task-manager>`.
.. include:: nodetool-index.rst

View File

@@ -56,6 +56,17 @@ To only mark the node as permanently down without doing actual removal, use :doc
.. _removenode-ignore-dead-nodes:
It's allowed to invoke ``nodetool removenode`` on multiple nodes in parallel. This will be faster than doing
it sequentially if there is significant amount of data in tablet-based keyspaces, because
tablets which belong to removed nodes will be rebuilt in parallel. Node removal first migrates tablets to new
replicas, in parallel for all nodes being removed. Then does the part which executes
removal for the vnode-based keyspaces, and this is serialized with other vnode-based operations, including
those from other removenode operations.
Removenode which is still in the tablet rebuild phase can be canceled using Task Manager API.
Tablets which are already rebuilt will remain on their new replicas.
See :doc:`Task manager </operating-scylla/admin-tools/task-manager>`.
Ignoring Dead Nodes
---------------------

View File

@@ -177,8 +177,8 @@ To display the log classes (output changes with each version so your display may
storage_proxy
storage_service
stream_session
strongly_consistent_modification_statement
strongly_consistent_select_statement
broadcast_modification_statement
broadcast_select_statement
system_distributed_keyspace
system_keyspace
table

View File

@@ -17,7 +17,7 @@ Glossary
The CAP Theorem is the notion that **C** (Consistency), **A** (Availability) and **P** (Partition Tolerance) of data are mutually dependent in a distributed system. Increasing any 2 of these factors will reduce the third. ScyllaDB chooses availability and partition tolerance over consistency. See :doc:`Fault Tolerance </architecture/architecture-fault-tolerance>`.
Cluster
One or multiple ScyllaDB nodes, acting in concert, which own a single contiguous token range. State is communicated between nodes in the cluster via the Gossip protocol. See :doc:`Ring Architecture </architecture/ringarchitecture/index>`.
One or multiple ScyllaDB nodes, acting in concert, which own a single contiguous token range. State is communicated between nodes in the cluster via :doc:`Raft </architecture/raft>`.
Clustering Key
A single or multi-column clustering key determines a rows uniqueness and sort order on disk within a partition. See :doc:`Ring Architecture </architecture/ringarchitecture/index>`.
@@ -157,14 +157,18 @@ Glossary
Table
A collection of columns fetched by row. Columns are ordered by Clustering Key. See :doc:`Ring Architecture </architecture/ringarchitecture/index>`.
Tablet
The name of a :term:`Token Range` in the `Tablet Model </architecture/tablets>` of assigning ownership of data in a cluster.
Time-window compaction strategy
TWCS is designed for time series data. See :doc:`Compaction Strategies</architecture/compaction/compaction-strategies/>`.
Token
A value in a range, used to identify both nodes and partitions. Each node in a ScyllaDB cluster is given an (initial) token, which defines the end of the range a node handles. See :doc:`Ring Architecture </architecture/ringarchitecture/index>`.
A hash value calculated over the partition key of a table. All tokens calculated over any partition key of any table are part the same token domain. This allows unified assignment of partitions to nodes in a ScyllaDB cluster via using :term:`Token Range` as the basis of ownership assignment.
Token Range
The total range of potential unique identifiers supported by the partitioner. By default, each ScyllaDB node in the cluster handles 256 token ranges. Each token range corresponds to a Vnode. Each range of hashes in turn is a segment of the total range of a given hash function. See :doc:`Ring Architecture </architecture/ringarchitecture/index>`.
A contiguous range of :term:`Token` values, the basis of ownership assignment in a ScyllaDB cluster. A token range is defined by a start and end token. Each node in a ScyllaDB cluster owns a number of token ranges.
There are two models of distributing token ranges in the cluster, the :doc:`VNode Model </architecture/ringarchitecture/index>` and the `Tablet Model </architecture/tablets>`.
Tombstone
A marker that indicates that data has been deleted. A large number of tombstones may impact read performance and disk usage, so an efficient tombstone garbage collection strategy should be employed. See :ref:`Tombstones GC options <ddl-tombstones-gc>`.

View File

@@ -47,7 +47,6 @@
namespace encryption {
static auto constexpr KSNAME = "system_replicated_keys";
static auto constexpr TABLENAME = "encrypted_keys";
static logger log("replicated_key_provider");
@@ -182,7 +181,7 @@ future<::shared_ptr<cql3::untyped_result_set>> replicated_key_provider::query(ss
future<> replicated_key_provider::force_blocking_flush() {
return _ctxt.get_database().invoke_on_all([](replica::database& db) {
// if (!Boolean.getBoolean("cassandra.unsafesystem"))
replica::column_family& cf = db.find_column_family(KSNAME, TABLENAME);
replica::column_family& cf = db.find_column_family(replicated_key_provider_factory::KSNAME, TABLENAME);
return cf.flush();
});
}
@@ -306,7 +305,7 @@ future<std::tuple<UUID, key_ptr>> replicated_key_provider::get_key(const key_inf
if (id.id) {
uuid = utils::UUID_gen::get_UUID(*id.id);
log.debug("Finding key {} ({})", uuid, info);
auto s = fmt::format("SELECT * FROM {}.{} WHERE key_file=? AND cipher=? AND strength=? AND key_id=?;", KSNAME, TABLENAME);
auto s = fmt::format("SELECT * FROM {}.{} WHERE key_file=? AND cipher=? AND strength=? AND key_id=?;", replicated_key_provider_factory::KSNAME, TABLENAME);
res = co_await query(std::move(s), _system_key->name(), cipher, int32_t(id.info.len), uuid);
// if we find nothing, and we actually queried a specific key (by uuid), we've failed.
@@ -316,7 +315,7 @@ future<std::tuple<UUID, key_ptr>> replicated_key_provider::get_key(const key_inf
}
} else {
log.debug("Finding key ({})", info);
auto s = fmt::format("SELECT * FROM {}.{} WHERE key_file=? AND cipher=? AND strength=? LIMIT 1;", KSNAME, TABLENAME);
auto s = fmt::format("SELECT * FROM {}.{} WHERE key_file=? AND cipher=? AND strength=? LIMIT 1;", replicated_key_provider_factory::KSNAME, TABLENAME);
res = co_await query(std::move(s), _system_key->name(), cipher, int32_t(id.info.len));
}
@@ -333,7 +332,7 @@ future<std::tuple<UUID, key_ptr>> replicated_key_provider::get_key(const key_inf
auto ks = base64_encode(b);
log.trace("Inserting generated key {}", uuid);
co_await query(fmt::format("INSERT INTO {}.{} (key_file, cipher, strength, key_id, key) VALUES (?, ?, ?, ?, ?)",
KSNAME, TABLENAME), _system_key->name(), cipher, int32_t(id.info.len), uuid, ks
replicated_key_provider_factory::KSNAME, TABLENAME), _system_key->name(), cipher, int32_t(id.info.len), uuid, ks
);
log.trace("Flushing key table");
co_await force_blocking_flush();
@@ -366,8 +365,8 @@ future<> replicated_key_provider::validate() const {
schema_ptr encrypted_keys_table() {
static thread_local auto schema = [] {
auto id = generate_legacy_id(KSNAME, TABLENAME);
return schema_builder(KSNAME, TABLENAME, std::make_optional(id))
auto id = generate_legacy_id(replicated_key_provider_factory::KSNAME, TABLENAME);
return schema_builder(replicated_key_provider_factory::KSNAME, TABLENAME, std::make_optional(id))
.with_column("key_file", utf8_type, column_kind::partition_key)
.with_column("cipher", utf8_type, column_kind::partition_key)
.with_column("strength", int32_type, column_kind::clustering_key)
@@ -387,23 +386,23 @@ future<> replicated_key_provider::maybe_initialize_tables() {
}
future<> replicated_key_provider::do_initialize_tables(::replica::database& db, service::migration_manager& mm) {
if (db.has_schema(KSNAME, TABLENAME)) {
if (db.has_schema(replicated_key_provider_factory::KSNAME, TABLENAME)) {
co_return;
}
log.debug("Creating keyspace and table");
if (!db.has_keyspace(KSNAME)) {
if (!db.has_keyspace(replicated_key_provider_factory::KSNAME)) {
auto group0_guard = co_await mm.start_group0_operation();
auto ts = group0_guard.write_timestamp();
try {
auto ksm = keyspace_metadata::new_keyspace(
KSNAME,
replicated_key_provider_factory::KSNAME,
"org.apache.cassandra.locator.EverywhereStrategy",
{},
std::nullopt,
std::nullopt,
true);
co_await mm.announce(service::prepare_new_keyspace_announcement(db, ksm, ts), std::move(group0_guard), fmt::format("encryption at rest: create keyspace {}", KSNAME));
co_await mm.announce(service::prepare_new_keyspace_announcement(db, ksm, ts), std::move(group0_guard), fmt::format("encryption at rest: create keyspace {}", replicated_key_provider_factory::KSNAME));
} catch (exceptions::already_exists_exception&) {
}
}
@@ -411,10 +410,10 @@ future<> replicated_key_provider::do_initialize_tables(::replica::database& db,
auto ts = group0_guard.write_timestamp();
try {
co_await mm.announce(co_await service::prepare_new_column_family_announcement(mm.get_storage_proxy(), encrypted_keys_table(), ts), std::move(group0_guard),
fmt::format("encryption at rest: create table {}.{}", KSNAME, TABLENAME));
fmt::format("encryption at rest: create table {}.{}", replicated_key_provider_factory::KSNAME, TABLENAME));
} catch (exceptions::already_exists_exception&) {
}
auto& ks = db.find_keyspace(KSNAME);
auto& ks = db.find_keyspace(replicated_key_provider_factory::KSNAME);
auto& rs = ks.get_replication_strategy();
// should perhaps check name also..
if (rs.get_type() != locator::replication_strategy_type::everywhere_topology) {

View File

@@ -27,6 +27,8 @@ namespace encryption {
class replicated_key_provider_factory : public key_provider_factory {
public:
static constexpr const char* KSNAME = "system_replicated_keys";
replicated_key_provider_factory();
~replicated_key_provider_factory();

View File

@@ -172,6 +172,7 @@ public:
gms::feature topology_global_request_queue { *this, "TOPOLOGY_GLOBAL_REQUEST_QUEUE"sv };
gms::feature lwt_with_tablets { *this, "LWT_WITH_TABLETS"sv };
gms::feature repair_msg_split { *this, "REPAIR_MSG_SPLIT"sv };
gms::feature parallel_tablet_draining { *this, "PARALLEL_TABLET_DRAINING"sv };
gms::feature view_building_coordinator { *this, "VIEW_BUILDING_COORDINATOR"sv };
gms::feature ms_sstable { *this, "MS_SSTABLE_FORMAT"sv };
gms::feature rack_list_rf { *this, "RACK_LIST_RF"sv };

View File

@@ -148,6 +148,9 @@ public:
inet_address get_broadcast_address() const noexcept {
return get_token_metadata_ptr()->get_topology().my_address();
}
inet_address get_cql_address() const noexcept {
return get_token_metadata_ptr()->get_topology().my_cql_address();
}
const std::set<inet_address>& get_seeds() const noexcept;
public:

View File

@@ -54,6 +54,7 @@ set(idl_headers
sstables.idl.hh
storage_proxy.idl.hh
storage_service.idl.hh
strong_consistency/state_machine.idl.hh
group0_state_machine.idl.hh
mapreduce_request.idl.hh
replica_exception.idl.hh

View File

@@ -13,6 +13,7 @@
#include "idl/token.idl.hh"
#include "idl/keys.idl.hh"
#include "idl/uuid.idl.hh"
#include "idl/position_in_partition.idl.hh"
namespace db {
enum class read_repair_decision : uint8_t {
@@ -22,19 +23,6 @@ enum class read_repair_decision : uint8_t {
};
}
enum class bound_weight : int8_t {
before_all_prefixed = -1,
equal = 0,
after_all_prefixed = 1,
}
enum class partition_region : uint8_t {
partition_start,
static_row,
clustered,
partition_end,
};
namespace service {
namespace pager {
class paging_state {

View File

@@ -0,0 +1,20 @@
/*
* Copyright 2025-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#include "idl/frozen_mutation.idl.hh"
#include "idl/uuid.idl.hh"
namespace service {
namespace strong_consistency {
struct raft_command {
frozen_mutation mutation;
};
} // namespace strong_consistency
} // namespace service

View File

@@ -227,7 +227,7 @@ public:
_db,
s,
std::move(permit),
_db.find_column_family(s->ks_name(), system_keyspace::v3::BUILT_VIEWS),
_db.find_column_family(s->ks_name(), system_keyspace::BUILT_VIEWS),
range,
slice,
std::move(trace_state),

View File

@@ -19,43 +19,114 @@
#include "types/concrete_types.hh"
#include "utils/managed_string.hh"
#include <seastar/core/sstring.hh>
#include <boost/algorithm/string.hpp>
namespace secondary_index {
template <int MAX>
static void validate_positive_option(const sstring& value) {
static void validate_positive_option(int max, const sstring& value_name, const sstring& value) {
int num_value;
size_t len;
try {
num_value = std::stoi(value, &len);
} catch (...) {
throw exceptions::invalid_request_exception(format("Numeric option {} is not a valid number", value));
throw exceptions::invalid_request_exception(format("Invalid value in option '{}' for vector index: '{}' is not an integer", value_name, value));
}
if (len != value.size()) {
throw exceptions::invalid_request_exception(format("Numeric option {} is not a valid number", value));
throw exceptions::invalid_request_exception(format("Invalid value in option '{}' for vector index: '{}' is not an integer", value_name, value));
}
if (num_value <= 0 || num_value > MAX) {
throw exceptions::invalid_request_exception(format("Numeric option {} out of valid range [1 - {}]", value, MAX));
if (num_value <= 0 || num_value > max) {
throw exceptions::invalid_request_exception(format("Invalid value in option '{}' for vector index: '{}' is out of valid range [1 - {}]", value_name, value, max));
}
}
static void validate_similarity_function(const sstring& value) {
sstring similarity_function = value;
std::transform(similarity_function.begin(), similarity_function.end(), similarity_function.begin(), ::tolower);
if (similarity_function != "cosine" && similarity_function != "euclidean" && similarity_function != "dot_product") {
throw exceptions::invalid_request_exception(format("Unsupported similarity function: {}", value));
static void validate_factor_option(float min, float max, const sstring& value_name, const sstring& value) {
float num_value;
size_t len;
try {
num_value = std::stof(value, &len);
} catch (...) {
throw exceptions::invalid_request_exception(format("Invalid value in option '{}' for vector index: '{}' is not a float", value_name, value));
}
if (len != value.size()) {
throw exceptions::invalid_request_exception(format("Invalid value in option '{}' for vector index: '{}' is not a float", value_name, value));
}
if (!(num_value >= min && num_value <= max)) {
throw exceptions::invalid_request_exception(format("Invalid value in option '{}' for vector index: '{}' is out of valid range [{} - {}]", value_name, value, min, max));
}
}
const static std::unordered_map<sstring, std::function<void(const sstring&)>> vector_index_options = {
{"similarity_function", validate_similarity_function},
{"maximum_node_connections", validate_positive_option<512>},
{"construction_beam_width", validate_positive_option<4096>},
{"search_beam_width", validate_positive_option<4096>},
static void validate_enumerated_option(const std::vector<sstring>& supported_values, const sstring& value_name, const sstring& value) {
bool is_valid = std::any_of(supported_values.begin(), supported_values.end(),
[&](const std::string& func) { return boost::iequals(value, func); });
if (!is_valid) {
throw exceptions::invalid_request_exception(
seastar::format("Invalid value in option '{}' for vector index: '{}'. Supported are case-insensitive: {}",
value_name,
value,
fmt::join(supported_values, ", ")));
}
}
static const std::vector<sstring> similarity_function_values = {
"cosine", "euclidean", "dot_product"
};
static const std::vector<sstring> quantization_values = {
"f32", "f16", "bf16", "i8", "b1"
};
static const std::vector<sstring> boolean_values = {
"false", "true"
};
const static std::unordered_map<sstring, std::function<void(const sstring&, const sstring&)>> vector_index_options = {
// `similarity_function` defines method of calculating similarity between vectors
// Used internally by vector store during both indexing and querying
// CQL implements corresponding functions in cql3/functions/similarity_functions.hh
{"similarity_function", std::bind_front(validate_enumerated_option, similarity_function_values)},
// 'maximum_node_connections', 'construction_beam_width', 'search_beam_width' define HNSW index parameters
// Used internally by vector store.
{"maximum_node_connections", std::bind_front(validate_positive_option, 512)},
{"construction_beam_width", std::bind_front(validate_positive_option, 4096)},
{"search_beam_width", std::bind_front(validate_positive_option, 4096)},
// 'quantization' enables compression of vectors in vector store (not in base table!)
// Used internally by vector store. Scylla only checks it to enable rescoring.
{"quantization", std::bind_front(validate_enumerated_option, quantization_values)},
// 'oversampling' defines factor by which number of candidates retrieved from vector store is multiplied.
// It can improve accuracy of ANN queries, especially for quantized vectors when combined with rescoring.
// Used by Scylla during query processing to increase query limit sent to vector store.
{"oversampling", std::bind_front(validate_factor_option, 1.0f, 100.0f)},
// 'rescoring' enables recalculating of similarity scores of candidates retrieved from vector store when quantization is used.
{"rescoring", std::bind_front(validate_enumerated_option, boolean_values)},
};
bool vector_index::is_rescoring_enabled(const index_options_map& properties) {
auto q = properties.find("quantization");
auto r = properties.find("rescoring");
return q != properties.end() && !boost::iequals(q->second, "f32")
&& r != properties.end() && boost::iequals(r->second, "true");
}
float vector_index::get_oversampling(const index_options_map& properties) {
auto it = properties.find("oversampling");
if (it != properties.end()) {
return std::stof(it->second);
}
return 1.0f;
}
sstring vector_index::get_cql_similarity_function_name(const index_options_map& properties) {
auto it = properties.find("similarity_function");
if (it != properties.end()) {
return "similarity_" + boost::to_lower_copy(it->second);
}
return "similarity_cosine";
}
bool vector_index::view_should_exist() const {
return false;
}
@@ -139,7 +210,7 @@ void vector_index::check_index_options(const cql3::statements::index_specific_pr
if (it == vector_index_options.end()) {
throw exceptions::invalid_request_exception(format("Unsupported option {} for vector index", option.first));
}
it->second(option.second);
it->second(option.first, option.second);
}
}

View File

@@ -35,6 +35,10 @@ public:
static bool has_vector_index(const schema& s);
static bool has_vector_index_on_column(const schema& s, const sstring& target_name);
static void check_cdc_options(const schema& schema);
static bool is_rescoring_enabled(const index_options_map& properties);
static float get_oversampling(const index_options_map& properties);
static sstring get_cql_similarity_function_name(const index_options_map& properties);
private:
void check_uses_tablets(const schema& schema, const data_dictionary::database& db) const;
void check_cdc_not_explicitly_disabled(const schema& schema) const;

View File

@@ -20,6 +20,7 @@
#include <boost/icl/interval.hpp>
#include <boost/icl/interval_map.hpp>
#include <variant>
namespace locator {
@@ -178,6 +179,15 @@ size_t get_replication_factor(const replication_strategy_config_option& opt) {
return replication_factor_data(opt).count();
}
bool uses_rack_list_exclusively(const replication_strategy_config_options& opts) {
for (auto& [_, val] : opts) {
if (!std::holds_alternative<rack_list>(val)) {
return false;
}
}
return true;
}
replication_factor_data::replication_factor_data(const replication_strategy_config_option& rf) {
std::visit(overloaded_functor {
[&] (const sstring& rf) {

View File

@@ -52,6 +52,8 @@ using replication_strategy_config_options = std::map<sstring, replication_strate
// Throws configuration_exception when option is not a valid replication factor specifier.
size_t get_replication_factor(const replication_strategy_config_option&);
bool uses_rack_list_exclusively(const replication_strategy_config_options&);
struct replication_strategy_params {
const replication_strategy_config_options options;
std::optional<unsigned> initial_tablets;

View File

@@ -311,7 +311,8 @@ future<tablet_map> network_topology_strategy::allocate_tablets_for_new_table(sch
rslogger.info("Rounding up tablet count from {} to {} for table {}.{}", tablet_count, aligned_tablet_count, s->ks_name(), s->cf_name());
tablet_count = aligned_tablet_count;
}
co_return co_await reallocate_tablets(std::move(s), std::move(tm), tablet_map(tablet_count));
co_return co_await reallocate_tablets(std::move(s), std::move(tm),
tablet_map(tablet_count, get_consistency() != data_dictionary::consistency_config_option::eventual));
}
future<tablet_map> network_topology_strategy::reallocate_tablets(schema_ptr s, token_metadata_ptr tm, tablet_map tablets) const {
@@ -325,6 +326,16 @@ future<tablet_map> network_topology_strategy::reallocate_tablets(schema_ptr s, t
for (tablet_id tb : tablets.tablet_ids()) {
auto tinfo = tablets.get_tablet_info(tb);
tinfo.replicas = co_await reallocate_tablets(s, tm, load, tablets, tb);
if (tablets.has_raft_info()) {
for (auto& r: tinfo.replicas) {
r.shard = 0;
}
if (!tablets.get_tablet_raft_info(tb).group_id) {
tablets.set_tablet_raft_info(tb, tablet_raft_info {
.group_id = raft::group_id{utils::UUID_gen::get_time_UUID()}
});
}
}
tablets.set_tablet(tb, std::move(tinfo));
}
@@ -497,7 +508,7 @@ future<tablet_replica_set> network_topology_strategy::add_tablets_in_dc(schema_p
candidates_list& rack_list = existing.empty() ? new_racks : existing_racks;
auto& candidate = rack_list.emplace_back(rack);
for (const auto& node : nodes) {
if (!node.get().is_normal()) {
if (!node.get().is_normal() || node.get().is_draining()) {
continue;
}
const auto& host_id = node.get().host_id();

View File

@@ -26,6 +26,7 @@ class tablet_aware_replication_strategy : public per_table_replication_strategy
private:
size_t _initial_tablets = 0;
db::tablet_options _tablet_options;
data_dictionary::consistency_config_option _consistency;
protected:
void validate_tablet_options(const abstract_replication_strategy&, const gms::feature_service&, const replication_strategy_config_options&) const;
void process_tablet_options(abstract_replication_strategy&, replication_strategy_config_options&, replication_strategy_params);
@@ -37,6 +38,8 @@ protected:
public:
size_t get_initial_tablets() const { return _initial_tablets; }
data_dictionary::consistency_config_option get_consistency() const { return _consistency; }
/// Generates tablet_map for a new table.
/// Runs under group0 guard.
virtual future<tablet_map> allocate_tablets_for_new_table(schema_ptr, token_metadata_ptr, size_t tablet_count) const = 0;

View File

@@ -470,16 +470,20 @@ bool tablet_metadata::operator==(const tablet_metadata& o) const {
return true;
}
tablet_map::tablet_map(size_t tablet_count)
tablet_map::tablet_map(size_t tablet_count, bool with_raft_info)
: _log2_tablets(log2ceil(tablet_count)) {
if (tablet_count != 1ul << _log2_tablets) {
on_internal_error(tablet_logger, format("Tablet count not a power of 2: {}", tablet_count));
}
_tablets.resize(tablet_count);
if (with_raft_info) {
_raft_info.resize(tablet_count);
}
}
tablet_map tablet_map::clone() const {
return tablet_map(_tablets, _log2_tablets, _transitions, _resize_decision, _resize_task_info, _repair_scheduler_config);
return tablet_map(_tablets, _log2_tablets, _transitions, _resize_decision, _resize_task_info,
_repair_scheduler_config, _raft_info);
}
future<tablet_map> tablet_map::clone_gently() const {
@@ -497,7 +501,15 @@ future<tablet_map> tablet_map::clone_gently() const {
co_await coroutine::maybe_yield();
}
co_return tablet_map(std::move(tablets), _log2_tablets, std::move(transitions), _resize_decision, _resize_task_info, _repair_scheduler_config);
raft_info_container raft_info;
raft_info.reserve(_raft_info.size());
for (const auto& i: _raft_info) {
raft_info.emplace_back(i);
co_await coroutine::maybe_yield();
}
co_return tablet_map(std::move(tablets), _log2_tablets, std::move(transitions), _resize_decision,
_resize_task_info, _repair_scheduler_config, std::move(raft_info));
}
void tablet_map::check_tablet_id(tablet_id id) const {
@@ -707,6 +719,24 @@ const tablet_transition_info* tablet_map::get_tablet_transition_info(tablet_id i
return &i->second;
}
const tablet_raft_info& tablet_map::get_tablet_raft_info(tablet_id id) const {
check_tablet_id(id);
if (_raft_info.empty()) {
on_internal_error(tablet_logger, "Tablet map doesn't have raft info");
}
return _raft_info[size_t(id)];
}
void tablet_map::set_tablet_raft_info(tablet_id id, tablet_raft_info raft_info) {
check_tablet_id(id);
if (_raft_info.empty()) {
on_internal_error(tablet_logger,
format("Tablet map has no raft info, tablet_id {}, group_id {}",
id, raft_info.group_id));
}
_raft_info[size_t(id)] = std::move(raft_info);
}
// The names are persisted in system tables so should not be changed.
static const std::unordered_map<tablet_transition_stage, sstring> tablet_transition_stage_to_name = {
{tablet_transition_stage::allow_write_both_read_old, "allow_write_both_read_old"},
@@ -1416,6 +1446,7 @@ void tablet_aware_replication_strategy::process_tablet_options(abstract_replicat
replication_strategy_params params) {
if (ars._uses_tablets) {
_initial_tablets = params.initial_tablets.value_or(0);
_consistency = params.consistency.value_or(data_dictionary::consistency_config_option::eventual);
mark_as_per_table(ars);
}
}
@@ -1580,15 +1611,25 @@ static void assert_rf_rack_valid_keyspace(std::string_view ks, const token_metad
tablet_logger.debug("[assert_rf_rack_valid_keyspace]: Keyspace '{}' has been verified to be RF-rack-valid", ks);
}
static bool is_normal_token_owner(const token_metadata& tm, locator::host_id host) {
auto& node = tm.get_topology().get_node(host);
return tm.is_normal_token_owner(host) && !tm.is_leaving(host) && !node.is_draining();
}
static bool is_transitioning_token_owner(const token_metadata& tm, locator::host_id host) {
auto& node = tm.get_topology().get_node(host);
return tm.get_topology().get_node(host).is_bootstrapping() ||
(tm.is_normal_token_owner(host) && (tm.is_leaving(host) || node.is_draining()));
}
void assert_rf_rack_valid_keyspace(std::string_view ks, const token_metadata_ptr tmptr, const abstract_replication_strategy& ars) {
assert_rf_rack_valid_keyspace(ks, tmptr, ars,
tmptr->get_topology().get_datacenter_racks(),
[&tmptr] (host_id host) {
return tmptr->is_normal_token_owner(host) && !tmptr->is_leaving(host);
return is_normal_token_owner(*tmptr, host);
},
[&tmptr] (host_id host) {
return tmptr->get_topology().get_node(host).is_bootstrapping() ||
(tmptr->is_normal_token_owner(host) && tmptr->is_leaving(host));
return is_transitioning_token_owner(*tmptr, host);
}
);
}
@@ -1625,14 +1666,13 @@ void assert_rf_rack_valid_keyspace(std::string_view ks, const token_metadata_ptr
if (op.tag == rf_rack_topology_operation::type::add && host == op.node_id) {
return true;
}
return tmptr->is_normal_token_owner(host) && !tmptr->is_leaving(host);
return is_normal_token_owner(*tmptr, host);
},
[&tmptr, &op] (host_id host) {
if (op.tag == rf_rack_topology_operation::type::add && host == op.node_id) {
return false;
}
return tmptr->get_topology().get_node(host).is_bootstrapping() ||
(tmptr->is_normal_token_owner(host) && tmptr->is_leaving(host));
return is_transitioning_token_owner(*tmptr, host);
}
);
}
@@ -1642,7 +1682,7 @@ rack_list get_allowed_racks(const locator::token_metadata& tm, const sstring& dc
auto normal_nodes = [&] (const sstring& rack) {
int count = 0;
for (auto n : topo.get_datacenter_rack_nodes().at(dc).at(rack)) {
count += int(n.get().is_normal());
count += int(n.get().is_normal() && !n.get().is_draining());
}
return count;
};
@@ -1715,6 +1755,12 @@ auto fmt::formatter<locator::tablet_map>::format(const locator::tablet_map& r, f
out = fmt::format_to(out, ", session={}", tr->session_id);
}
}
if (r.has_raft_info()) {
const auto& raft_info = r.get_tablet_raft_info(tid);
if (raft_info.group_id) {
out = fmt::format_to(out, ", group_id={}", raft_info.group_id);
}
}
first = false;
tid = *r.next_tablet(tid);
}

View File

@@ -21,6 +21,7 @@
#include "utils/chunked_vector.hh"
#include "utils/hash.hh"
#include "utils/UUID.hh"
#include "raft/raft.hh"
#include <ranges>
#include <seastar/core/reactor.hh>
@@ -200,10 +201,7 @@ enum class tablet_repair_incremental_mode : uint8_t {
disabled,
};
// FIXME: Incremental repair is disabled by default due to
// https://github.com/scylladb/scylladb/issues/26041 and
// https://github.com/scylladb/scylladb/issues/27414
constexpr tablet_repair_incremental_mode default_tablet_repair_incremental_mode{tablet_repair_incremental_mode::disabled};
constexpr tablet_repair_incremental_mode default_tablet_repair_incremental_mode{tablet_repair_incremental_mode::incremental};
sstring tablet_repair_incremental_mode_to_string(tablet_repair_incremental_mode);
tablet_repair_incremental_mode tablet_repair_incremental_mode_from_string(const sstring&);
@@ -250,6 +248,13 @@ struct tablet_info {
bool operator==(const tablet_info&) const = default;
};
/// Reft-related information for strongly-consistent tablets.
struct tablet_raft_info {
raft::group_id group_id;
bool operator==(const tablet_raft_info&) const = default;
};
// Merges tablet_info b into a, but with following constraints:
// - they cannot have active repair task, since each task has a different id
// - their replicas must be all co-located.
@@ -552,6 +557,7 @@ public:
class tablet_map {
public:
using tablet_container = utils::chunked_vector<tablet_info>;
using raft_info_container = utils::chunked_vector<tablet_raft_info>;
private:
using transitions_map = std::unordered_map<tablet_id, tablet_transition_info>;
// The implementation assumes that _tablets.size() is a power of 2:
@@ -564,16 +570,20 @@ private:
resize_decision _resize_decision;
tablet_task_info _resize_task_info;
std::optional<repair_scheduler_config> _repair_scheduler_config;
raft_info_container _raft_info;
// Internal constructor, used by clone() and clone_gently().
tablet_map(tablet_container tablets, size_t log2_tablets, transitions_map transitions,
resize_decision resize_decision, tablet_task_info resize_task_info, std::optional<repair_scheduler_config> repair_scheduler_config)
resize_decision resize_decision, tablet_task_info resize_task_info,
std::optional<repair_scheduler_config> repair_scheduler_config,
raft_info_container raft_info)
: _tablets(std::move(tablets))
, _log2_tablets(log2_tablets)
, _transitions(std::move(transitions))
, _resize_decision(resize_decision)
, _resize_task_info(std::move(resize_task_info))
, _repair_scheduler_config(std::move(repair_scheduler_config))
, _raft_info(std::move(raft_info))
{}
/// Returns the largest token owned by tablet_id when the tablet_count is `1 << log2_tablets`.
@@ -586,7 +596,7 @@ public:
/// Constructs a tablet map.
///
/// \param tablet_count The desired tablets to allocate. Must be a power of two.
explicit tablet_map(size_t tablet_count);
explicit tablet_map(size_t tablet_count, bool with_raft_info = false);
tablet_map(tablet_map&&) = default;
tablet_map(const tablet_map&) = delete;
@@ -612,6 +622,16 @@ public:
/// \throws std::logic_error If the given id does not belong to this instance.
const tablet_transition_info* get_tablet_transition_info(tablet_id) const;
/// Returns true for strongly-consistent tablets.
/// Use get_tablet_raft_info() to retrieve Raft info for a specific tablet_id.
bool has_raft_info() const {
return !_raft_info.empty();
}
/// Returns Raft information for the given tablet_id.
/// It is an internal error to call this method if has_raft_info() returns false.
const tablet_raft_info& get_tablet_raft_info(tablet_id) const;
/// Returns the largest token owned by a given tablet.
/// \throws std::logic_error If the given id does not belong to this instance.
dht::token get_last_token(tablet_id id) const;
@@ -721,6 +741,7 @@ public:
void set_repair_scheduler_config(std::optional<locator::repair_scheduler_config> config);
void clear_tablet_transition_info(tablet_id);
void clear_transitions();
void set_tablet_raft_info(tablet_id, tablet_raft_info);
// Destroys gently.
// The tablet map is not usable after this call and should be destroyed.
@@ -876,7 +897,7 @@ class abstract_replication_strategy;
/// * if the keyspace is RF-rack-valid, no side effect,
/// * if the keyspace is RF-rack-invalid, an exception will be thrown. It will contain information about the reason
/// why the keyspace is RF-rack-invalid and will be worded in a way that can be returned directly to the user.
/// There are NO guarantees about the type of the exception.
/// The exception type is std::invalid_argument.
///
/// Preconditions:
/// * Every DC that takes part in replication according to the passed replication strategy MUST be known

View File

@@ -55,23 +55,24 @@ thread_local const endpoint_dc_rack endpoint_dc_rack::default_location = {
.rack = locator::production_snitch_base::default_rack,
};
node::node(const locator::topology* topology, locator::host_id id, endpoint_dc_rack dc_rack, state state, shard_id shard_count, bool excluded, this_node is_this_node, node::idx_type idx)
node::node(const locator::topology* topology, locator::host_id id, endpoint_dc_rack dc_rack, state state, shard_id shard_count, bool excluded, this_node is_this_node, node::idx_type idx, bool draining)
: _topology(topology)
, _host_id(id)
, _dc_rack(std::move(dc_rack))
, _state(state)
, _shard_count(std::move(shard_count))
, _excluded(excluded)
, _draining(draining)
, _is_this_node(is_this_node)
, _idx(idx)
{}
node_holder node::make(const locator::topology* topology, locator::host_id id, endpoint_dc_rack dc_rack, state state, shard_id shard_count, bool excluded, node::this_node is_this_node, node::idx_type idx) {
return std::make_unique<node>(topology, std::move(id), std::move(dc_rack), std::move(state), shard_count, excluded, is_this_node, idx);
node_holder node::make(const locator::topology* topology, locator::host_id id, endpoint_dc_rack dc_rack, state state, shard_id shard_count, bool excluded, node::this_node is_this_node, node::idx_type idx, bool draining) {
return std::make_unique<node>(topology, std::move(id), std::move(dc_rack), std::move(state), shard_count, excluded, is_this_node, idx, draining);
}
node_holder node::clone() const {
return make(nullptr, host_id(), dc_rack(), get_state(), get_shard_count(), is_excluded(), is_this_node());
return make(nullptr, host_id(), dc_rack(), get_state(), get_shard_count(), is_excluded(), is_this_node(), -1, is_draining());
}
std::string node::to_string(node::state s) {

View File

@@ -65,6 +65,7 @@ private:
state _state;
shard_id _shard_count = 0;
bool _excluded = false;
bool _draining = false;
// Is this node the `localhost` instance
this_node _is_this_node;
@@ -78,7 +79,8 @@ public:
shard_id shard_count = 0,
bool excluded = false,
this_node is_this_node = this_node::no,
idx_type idx = -1);
idx_type idx = -1,
bool draining = false);
node(const node&) = delete;
node(node&&) = delete;
@@ -143,6 +145,16 @@ public:
_excluded = excluded;
}
// Indicates that the tablet scheduler should move tablet replicas away
// from this node because it's undergoing decommission or removenode operation.
bool is_draining() const {
return _draining;
}
void set_draining(bool draining) {
_draining = draining;
}
bool is_leaving() const noexcept {
switch (_state) {
case state::being_decommissioned:
@@ -178,7 +190,8 @@ private:
shard_id shard_count = 0,
bool excluded = false,
node::this_node is_this_node = this_node::no,
idx_type idx = -1);
idx_type idx = -1,
bool draining = false);
node_holder clone() const;
void set_topology(const locator::topology* topology) noexcept { _topology = topology; }
@@ -492,7 +505,7 @@ struct fmt::formatter<locator::node> : fmt::formatter<string_view> {
if (!verbose) {
return fmt::format_to(ctx.out(), "{}", node.host_id());
} else {
return fmt::format_to(ctx.out(), " idx={} host_id={} dc={} rack={} state={} shards={} excluded={} this_node={}",
return fmt::format_to(ctx.out(), " idx={} host_id={} dc={} rack={} state={} shards={} excluded={} draining={} this_node={}",
node.idx(),
node.host_id(),
node.dc_rack().dc,
@@ -500,6 +513,7 @@ struct fmt::formatter<locator::node> : fmt::formatter<string_view> {
locator::node::to_string(node.get_state()),
node.get_shard_count(),
node.is_excluded(),
node.is_draining(),
bool(node.is_this_node()));
}
}

123
main.cc
View File

@@ -109,6 +109,8 @@
#include "sstables/sstables_manager.hh"
#include "db/virtual_tables.hh"
#include "service/strong_consistency/groups_manager.hh"
#include "service/strong_consistency/coordinator.hh"
#include "service/raft/raft_group_registry.hh"
#include "service/raft/raft_group0_client.hh"
#include "service/raft/raft_group0.hh"
@@ -123,6 +125,7 @@
#include "auth/cache.hh"
#include "utils/labels.hh"
#include "tools/utils.hh"
#include "schema/compression_initializer.hh"
namespace fs = std::filesystem;
@@ -812,6 +815,17 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
notify_set.notify_all(configurable::system_state::stopped).get();
});
// Register schema initializer for default compression settings.
// This statement is deliberately positioned here:
// * Needs to be done before starting services that use `schema_builder`s
// because registration is not thread-safe (initializer vector is
// not thread-local) and every `schema_builder` must use it.
// * Needs to be done after the initialization of the configurables
// because this particular initializer depends on `db::extensions::is_extension_internal_keyspace()`.
register_compression_initializer(*cfg, [&feature_service] {
return bool(feature_service.local().sstable_compression_dicts);
});
cfg->setup_directories();
// We're writing to a non-atomic variable here. But bool writes are atomic
@@ -869,6 +883,11 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
startlog.warn("Ignoring unused features found in config: {}", unused_features);
}
// Enable Raft-related columns in system.tablets if the experimental feature is enabled.
replica::set_strongly_consistent_tables_enabled(cfg->check_experimental(
db::experimental_features_t::feature::STRONGLY_CONSISTENT_TABLES
));
gms::feature_config fcfg;
fcfg.disabled_features = get_disabled_features_from_db_config(*cfg);
@@ -1119,12 +1138,14 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
});
#endif
seastar::scheduling_supergroup user_ssg = create_scheduling_supergroup(1000).get();
// Note: changed from using a move here, because we want the config object intact.
replica::database_config dbcfg;
dbcfg.compaction_scheduling_group = create_scheduling_group("compaction", "comp", 1000).get();
dbcfg.memory_compaction_scheduling_group = create_scheduling_group("mem_compaction", "mcmp", 1000).get();
dbcfg.streaming_scheduling_group = maintenance_scheduling_group;
dbcfg.statement_scheduling_group = create_scheduling_group("statement", "stmt", 1000).get();
dbcfg.statement_scheduling_group = create_scheduling_group("statement", "stmt", 1000, user_ssg).get();
dbcfg.memtable_scheduling_group = create_scheduling_group("memtable", "mt", 1000).get();
dbcfg.memtable_to_cache_scheduling_group = create_scheduling_group("memtable_to_cache", "mt2c", 200).get();
dbcfg.gossip_scheduling_group = create_scheduling_group("gossip", "gms", 1000).get();
@@ -1212,7 +1233,7 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
checkpoint(stop_signal, "starting service level controller");
qos::service_level_options default_service_level_configuration;
default_service_level_configuration.shares = 1000;
sl_controller.start(std::ref(auth_service), std::ref(token_metadata), std::ref(stop_signal.as_sharded_abort_source()), default_service_level_configuration, dbcfg.statement_scheduling_group).get();
sl_controller.start(std::ref(auth_service), std::ref(token_metadata), std::ref(stop_signal.as_sharded_abort_source()), default_service_level_configuration, user_ssg, dbcfg.statement_scheduling_group).get();
sl_controller.invoke_on_all(&qos::service_level_controller::start).get();
auto stop_sl_controller = defer_verbose_shutdown("service level controller", [] {
sl_controller.stop().get();
@@ -1802,6 +1823,21 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
client_routes.stop().get();
});
checkpoint(stop_signal, "initializing strongly consistent groups manager");
sharded<service::strong_consistency::groups_manager> groups_manager;
groups_manager.start(std::ref(messaging), std::ref(raft_gr), std::ref(qp),
std::ref(db), std::ref(feature_service)).get();
auto stop_groups_manager = defer_verbose_shutdown("strongly consistent groups manager", [&] {
groups_manager.stop().get();
});
checkpoint(stop_signal, "initializing strongly consistent coordinator");
sharded<service::strong_consistency::coordinator> sc_coordinator;
sc_coordinator.start(std::ref(groups_manager), std::ref(db)).get();
auto stop_sc_coordinator = defer_verbose_shutdown("strongly consistent coordinator", [&] {
sc_coordinator.stop().get();
});
checkpoint(stop_signal, "initializing storage service");
debug::the_storage_service = &ss;
ss.start(std::ref(stop_signal.as_sharded_abort_source()),
@@ -1813,7 +1849,8 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
std::ref(auth_cache), std::ref(client_routes),
std::ref(tsm), std::ref(vbsm), std::ref(task_manager), std::ref(gossip_address_map),
compression_dict_updated_callback,
only_on_shard0(&*disk_space_monitor_shard0)
only_on_shard0(&*disk_space_monitor_shard0),
std::ref(groups_manager)
).get();
ss.local().set_train_dict_callback([&rpc_dict_training_worker] (std::vector<std::vector<std::byte>> sample) {
@@ -1829,7 +1866,7 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
checkpoint(stop_signal, "initializing query processor remote part");
// TODO: do this together with proxy.start_remote(...)
qp.invoke_on_all(&cql3::query_processor::start_remote, std::ref(mm), std::ref(mapreduce_service),
std::ref(ss), std::ref(group0_client)).get();
std::ref(ss), std::ref(group0_client), std::ref(sc_coordinator)).get();
auto stop_qp_remote = defer_verbose_shutdown("query processor remote part", [&qp] {
qp.invoke_on_all(&cql3::query_processor::stop_remote).get();
});
@@ -2178,6 +2215,17 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
// This will also disable migration manager schema pulls if needed.
group0_service.setup_group0_if_exist(sys_ks.local(), ss.local(), qp.local(), mm.local()).get();
// The call to setup_group0_if_exists() above guarantees that, if group0 is
// created and started, the locally persisted group0 state has been applied
// before it returns. As a result, tablet Raft groups are started using
// tablet metadata that is at least as recent as the locally persisted version.
// groups_manager::start() waits for all these Raft groups to start, so when
// we begin RPC messaging below, the system is ready to accept proxied requests
// from other replicas.
groups_manager.invoke_on_all([](service::strong_consistency::groups_manager& m) {
return m.start();
}).get();
api::set_server_storage_service(ctx, ss, group0_client).get();
auto stop_ss_api = defer_verbose_shutdown("storage service API", [&ctx] {
api::unset_server_storage_service(ctx).get();
@@ -2186,12 +2234,15 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
with_scheduling_group(maintenance_scheduling_group, [&] {
return messaging.invoke_on_all([&] (auto& ms) {
return ms.start_listen(token_metadata.local(), [&gossiper] (gms::inet_address ip) {
if (ip == gossiper.local().get_broadcast_address()) {
// #27429. When running with broadcast_address != rpc_address, topology gets
// confused if we can't resolve rpc/cql address as self.
if (ip == gossiper.local().get_broadcast_address() || ip == gossiper.local().get_cql_address()) {
return gossiper.local().my_host_id();
}
try {
return gossiper.local().get_host_id(ip);
} catch (...) {
startlog.debug("Could not resolve host id: {}, {}", ip, std::current_exception());
return locator::host_id{};
}
});
@@ -2217,50 +2268,27 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
startlog.info("Verifying that all of the keyspaces are RF-rack-valid");
db.local().check_rf_rack_validity(token_metadata.local().get());
startlog.info("Verifying that all of the tablet keyspaces use rack list replication factors");
db.local().check_rack_list_everywhere(cfg->enforce_rack_list());
// Start audit service after join_cluster so that the table-based audit backend
// can properly create its keyspace and table.
checkpoint(stop_signal, "starting audit service");
audit::audit::start_audit(*cfg, token_metadata, qp, mm).handle_exception([&] (auto&& e) {
startlog.error("audit start failed: {}", e);
}).get();
auto audit_stop = defer([] {
audit::audit::stop_audit().get();
});
// Semantic validation of sstable compression parameters from config.
// Adding here (i.e., after `join_cluster`) to ensure that the
// required SSTABLE_COMPRESSION_DICTS cluster feature has been negotiated.
//
// Also, if the dictionary compression feature is not enabled, use
// LZ4Compressor as the default algorithm instead of LZ4WithDictsCompressor.
const auto& dicts_feature_enabled = feature_service.local().sstable_compression_dicts;
auto& sstable_compression_options = cfg->sstable_compression_user_table_options;
gms::feature::listener_registration reg_listener;
if (!sstable_compression_options.is_set() && !dicts_feature_enabled) {
if (sstable_compression_options().get_algorithm() != compression_parameters::algorithm::lz4_with_dicts) {
on_internal_error(startlog, "expected LZ4WithDictsCompressor as default algorithm for sstable_compression_user_table_options.");
}
startlog.info("SSTABLE_COMPRESSION_DICTS feature is disabled. Overriding default SSTable compression to use LZ4Compressor instead of LZ4WithDictsCompressor.");
compression_parameters original_params{sstable_compression_options().get_options()};
auto params = sstable_compression_options().get_options();
params[compression_parameters::SSTABLE_COMPRESSION] = sstring(compression_parameters::algorithm_to_name(compression_parameters::algorithm::lz4));
smp::invoke_on_all([&sstable_compression_options, params = std::move(params)] {
if (!sstable_compression_options.is_set()) { // guard check; in case we ever make the option live updateable
sstable_compression_options(compression_parameters{params}, utils::config_file::config_source::None);
}
}).get();
// Register a callback to update the default compression algorithm when the feature is enabled.
// Precondition:
// The callback must run inside seastar::async context:
// - If the listener fires immediately, we are running inside seastar::async already.
// - If the listener is deferred, `feature_service::enable()` runs it inside seastar::async.
reg_listener = feature_service.local().sstable_compression_dicts.when_enabled([&sstable_compression_options, params = std::move(original_params)] {
startlog.info("SSTABLE_COMPRESSION_DICTS feature is now enabled. Overriding default SSTable compression to use LZ4WithDictsCompressor.");
smp::invoke_on_all([&sstable_compression_options, params = std::move(params)] {
if (!sstable_compression_options.is_set()) { // guard check; in case we ever make the option live updateable
sstable_compression_options(params, utils::config_file::config_source::None);
}
}).get();
});
}
const auto dicts_feature_enabled = bool(feature_service.local().sstable_compression_dicts);
try {
cfg->sstable_compression_user_table_options().validate(
compression_parameters::dicts_feature_enabled(bool(dicts_feature_enabled)));
cfg->get_sstable_compression_user_table_options(dicts_feature_enabled).validate(
compression_parameters::dicts_feature_enabled(dicts_feature_enabled));
} catch (const std::exception& e) {
startlog.error("Invalid sstable_compression_user_table_options: {}", e.what());
throw bad_configuration_error();
@@ -2506,13 +2534,6 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
seastar::set_abort_on_ebadf(cfg->abort_on_ebadf());
api::set_server_done(ctx).get();
audit::audit::start_audit(*cfg, token_metadata, qp, mm).handle_exception([&] (auto&& e) {
startlog.error("audit start failed: {}", e);
}).get();
auto audit_stop = defer([] {
audit::audit::stop_audit().get();
});
// Create controllers before drain_on_shutdown() below, so that it destructs
// after drain stops them in stop_transport()
// Register controllers after drain_on_shutdown() below, so that even on start

View File

@@ -178,7 +178,7 @@ class fragmenting_mutation_freezer {
tombstone _partition_tombstone;
std::optional<static_row> _sr;
std::deque<clustering_row> _crs;
utils::chunked_vector<clustering_row> _crs;
range_tombstone_list _rts;
frozen_mutation_consumer_fn _consumer;

View File

@@ -242,7 +242,7 @@ class streamed_mutation_freezer {
tombstone _partition_tombstone;
std::optional<static_row> _sr;
std::deque<clustering_row> _crs;
utils::chunked_vector<clustering_row> _crs;
range_tombstone_list _rts;
public:
streamed_mutation_freezer(const schema& s, const partition_key& key)

View File

@@ -226,7 +226,7 @@ future<> mutation_partition_serializer::write_gently(ser::writer_of_mutation_par
void serialize_mutation_fragments(const schema& s, tombstone partition_tombstone,
std::optional<static_row> sr, range_tombstone_list rts,
std::deque<clustering_row> crs, ser::writer_of_mutation_partition<bytes_ostream>&& wr)
utils::chunked_vector<clustering_row> crs, ser::writer_of_mutation_partition<bytes_ostream>&& wr)
{
auto srow_writer = std::move(wr).write_tomb(partition_tombstone).start_static_row();
auto row_tombstones = [&] {
@@ -242,10 +242,9 @@ void serialize_mutation_fragments(const schema& s, tombstone partition_tombstone
rts.clear();
auto clustering_rows = std::move(row_tombstones).end_range_tombstones().start_rows();
while (!crs.empty()) {
auto& cr = crs.front();
for (auto& cr : crs) {
write_row(clustering_rows.add(), s, cr.key(), cr.cells(), cr.marker(), cr.tomb()).end_deletable_row();
crs.pop_front();
cr = clustering_row(clustering_key_prefix{});
}
std::move(clustering_rows).end_rows().end_mutation_partition();
}

View File

@@ -41,4 +41,4 @@ public:
void serialize_mutation_fragments(const schema& s, tombstone partition_tombstone,
std::optional<static_row> sr, range_tombstone_list range_tombstones,
std::deque<clustering_row> clustering_rows, ser::writer_of_mutation_partition<bytes_ostream>&&);
utils::chunked_vector<clustering_row> clustering_rows, ser::writer_of_mutation_partition<bytes_ostream>&&);

View File

@@ -60,28 +60,47 @@ static future<db::system_keyspace::topology_requests_entries> get_entries(db::sy
return sys_ks.get_node_ops_request_entries(db_clock::now() - ttl);
}
static tasks::task_stats get_task_stats(const db::system_keyspace::topology_requests_entry& entry,
const tasks::virtual_task_hint& hint) {
return tasks::task_stats{
.task_id = tasks::task_id{entry.id},
.type = request_type_to_task_type(entry.request_type),
.kind = tasks::task_kind::cluster,
.scope = "cluster",
.state = get_state(entry),
.sequence_number = 0,
.keyspace = "",
.table = "",
.entity = hint.node_id ? fmt::to_string(*hint.node_id) : "",
.shard = 0,
.start_time = entry.start_time,
.end_time = entry.end_time,
};
}
future<std::optional<tasks::task_status>> node_ops_virtual_task::get_status(tasks::task_id id, tasks::virtual_task_hint hint) {
auto entry_opt = co_await _ss._sys_ks.local().get_topology_request_entry_opt(id.uuid());
if (!entry_opt) {
co_return std::nullopt;
}
auto& entry = *entry_opt;
auto stats = get_task_stats(entry, hint);
co_return tasks::task_status{
.task_id = id,
.type = request_type_to_task_type(entry.request_type),
.kind = tasks::task_kind::cluster,
.scope = "cluster",
.state = get_state(entry),
.task_id = stats.task_id,
.type = stats.type,
.kind = stats.kind,
.scope = stats.scope,
.state = stats.state,
.is_abortable = co_await is_abortable(std::move(hint)),
.start_time = entry.start_time,
.end_time = entry.end_time,
.start_time = stats.start_time,
.end_time = stats.end_time,
.error = entry.error,
.parent_id = tasks::task_id::create_null_id(),
.sequence_number = 0,
.shard = 0,
.keyspace = "",
.table = "",
.entity = "",
.sequence_number = stats.sequence_number,
.shard = stats.shard,
.keyspace = stats.keyspace,
.table = stats.table,
.entity = stats.entity,
.progress_units = "",
.progress = tasks::task_manager::task::progress{},
.children = co_await get_children(get_module(), id, std::bind_front(&gms::gossiper::is_alive, &_ss.gossiper()))
@@ -92,26 +111,41 @@ tasks::task_manager::task_group node_ops_virtual_task::get_group() const noexcep
return tasks::task_manager::task_group::topology_change_group;
}
static std::map<tasks::task_id, locator::host_id> get_requests(const service::topology& topology) {
std::map<tasks::task_id, locator::host_id> result;
for (auto& request : topology.requests) {
auto* rs = topology.find(request.first);
if (rs) {
result.emplace(tasks::task_id(rs->second.request_id), locator::host_id(request.first.uuid()));
}
}
for (auto& [node, rs] : topology.transition_nodes) {
result.emplace(tasks::task_id(rs.request_id), locator::host_id(node.uuid()));
}
return result;
}
future<std::optional<tasks::virtual_task_hint>> node_ops_virtual_task::contains(tasks::task_id task_id) const {
if (!task_id.uuid().is_timestamp()) {
// Task id of node ops operation is always a timestamp.
co_return std::nullopt;
}
auto empty_hint = std::make_optional<tasks::virtual_task_hint>({});
auto hint = std::make_optional<tasks::virtual_task_hint>({});
service::topology& topology = _ss._topology_state_machine._topology;
for (auto& request : topology.requests) {
if (topology.find(request.first)->second.request_id == task_id.uuid()) {
co_return empty_hint;
}
auto reqs = get_requests(topology);
if (reqs.contains(task_id)) {
hint->node_id = reqs.at(task_id);
co_return hint;
}
auto entry = co_await _ss._sys_ks.local().get_topology_request_entry_opt(task_id.uuid());
co_return entry && std::holds_alternative<service::topology_request>(entry->request_type) ? empty_hint : std::nullopt;
co_return entry && std::holds_alternative<service::topology_request>(entry->request_type) ? hint : std::nullopt;
}
future<tasks::is_abortable> node_ops_virtual_task::is_abortable(tasks::virtual_task_hint) const {
return make_ready_future<tasks::is_abortable>(tasks::is_abortable::no);
future<tasks::is_abortable> node_ops_virtual_task::is_abortable(tasks::virtual_task_hint hint) const {
// Currently, only node operations are supported by abort_topology_request().
return make_ready_future<tasks::is_abortable>(tasks::is_abortable(hint.node_id.has_value()));
}
future<std::optional<tasks::task_status>> node_ops_virtual_task::wait(tasks::task_id id, tasks::virtual_task_hint hint) {
@@ -124,30 +158,21 @@ future<std::optional<tasks::task_status>> node_ops_virtual_task::wait(tasks::tas
co_return co_await get_status(id, std::move(hint));
}
future<> node_ops_virtual_task::abort(tasks::task_id id, tasks::virtual_task_hint) noexcept {
return make_ready_future();
future<> node_ops_virtual_task::abort(tasks::task_id id, tasks::virtual_task_hint hint) noexcept {
co_await _ss.abort_topology_request(id.uuid());
}
future<std::vector<tasks::task_stats>> node_ops_virtual_task::get_stats() {
db::system_keyspace& sys_ks = _ss._sys_ks.local();
co_return std::ranges::to<std::vector<tasks::task_stats>>(co_await get_entries(sys_ks, get_task_manager().get_user_task_ttl())
| std::views::transform([] (const auto& e) {
auto id = e.first;
| std::views::transform([reqs = get_requests(_ss._topology_state_machine._topology)] (const auto& e) {
auto id = tasks::task_id{e.first};
auto& entry = e.second;
return tasks::task_stats {
.task_id = tasks::task_id{id},
.type = node_ops::request_type_to_task_type(entry.request_type),
.kind = tasks::task_kind::cluster,
.scope = "cluster",
.state = get_state(entry),
.sequence_number = 0,
.keyspace = "",
.table = "",
.entity = "",
.shard = 0,
.start_time = entry.start_time,
.end_time = entry.end_time
};
tasks::virtual_task_hint hint;
if (reqs.contains(id)) {
hint.node_id = reqs.at(id);
}
return get_task_stats(entry, hint);
}));
}

View File

@@ -250,6 +250,9 @@ public:
}
virtual future<> fill_buffer() override {
if (const auto& ex = get_abort_exception(); ex) {
return make_exception_future<>(ex);
}
return do_until([this] { return is_end_of_stream() || is_buffer_full(); }, [this] {
_reader.with_reserve([&] {
if (!_static_row_done) {

View File

@@ -300,8 +300,7 @@ private:
void add_to_rpc_config(server_address srv);
void remove_from_rpc_config(const server_address& srv);
// A helper to wait for a leader to get elected
future<> wait_for_leader(seastar::abort_source* as);
future<> wait_for_leader(seastar::abort_source* as) override;
future<> wait_for_state_change(seastar::abort_source* as) override;

View File

@@ -254,6 +254,16 @@ public:
// It it passes nullptr, the function is unabortable.
virtual future<> wait_for_state_change(seastar::abort_source* as) = 0;
// The returned future is resolved when a leader is elected for the current term.
// Note that it is not guaranteed that the leader will remain the same by the time
// the future is resolved, so the caller must check the synchronous
// `current_leader()` function and retry `wait_for_leader()` if it returns an empty
// `raft::server_id`.
//
// The caller may pass a pointer to an abort_source to make the function abortable.
// It it passes nullptr, the function is unabortable.
virtual future<> wait_for_leader(seastar::abort_source* as) = 0;
// Manually trigger snapshot creation and log truncation.
//
// Does nothing if the current apply index is less or equal to the last persisted snapshot descriptor index

View File

@@ -153,7 +153,7 @@ private:
sstring _op_name;
std::string_view _op_name_view;
reader_resources _base_resources;
const reader_resources _base_resources;
bool _base_resources_consumed = false;
reader_resources _resources;
reader_permit::state _state = reader_permit::state::active;
@@ -270,9 +270,7 @@ public:
_semaphore.on_permit_created(*this);
}
~impl() {
if (_base_resources_consumed) {
signal(_base_resources);
}
release_base_resources();
if (_resources.non_zero()) {
on_internal_error_noexcept(rcslog, format("reader_permit::impl::~impl(): permit {} detected a leak of {{count={}, memory={}}} resources",
@@ -342,6 +340,11 @@ public:
void on_admission() {
SCYLLA_ASSERT(_state != reader_permit::state::active_await);
on_permit_active();
if (_base_resources_consumed) {
on_internal_error(rcslog, fmt::format("on_admission(): permit {} already has its base resources consumed", description()));
}
consume(_base_resources);
_base_resources_consumed = true;
}
@@ -370,10 +373,7 @@ public:
void on_evicted() {
SCYLLA_ASSERT(_state == reader_permit::state::inactive);
_state = reader_permit::state::evicted;
if (_base_resources_consumed) {
signal(_base_resources);
_base_resources_consumed = false;
}
release_base_resources();
}
void consume(reader_resources res) {
@@ -403,10 +403,9 @@ public:
void release_base_resources() noexcept {
if (_base_resources_consumed) {
_resources -= _base_resources;
_base_resources_consumed = false;
signal(_base_resources);
}
_semaphore.signal(std::exchange(_base_resources, {}));
}
sstring description() const {

View File

@@ -1476,13 +1476,14 @@ future<> repair_service::sync_data_using_repair(
dht::token_range_vector ranges,
std::unordered_map<dht::token_range, repair_neighbors> neighbors,
streaming::stream_reason reason,
shared_ptr<node_ops_info> ops_info) {
shared_ptr<node_ops_info> ops_info,
service::frozen_topology_guard frozen_topology_guard) {
if (ranges.empty()) {
co_return;
}
SCYLLA_ASSERT(this_shard_id() == 0);
auto task = co_await _repair_module->make_and_start_task<repair::data_sync_repair_task_impl>({}, _repair_module->new_repair_uniq_id(), std::move(keyspace), "", std::move(ranges), std::move(neighbors), reason, ops_info);
auto task = co_await _repair_module->make_and_start_task<repair::data_sync_repair_task_impl>({}, _repair_module->new_repair_uniq_id(), std::move(keyspace), "", std::move(ranges), std::move(neighbors), reason, ops_info, std::move(frozen_topology_guard));
co_await task->done();
}
@@ -1523,7 +1524,7 @@ future<> repair::data_sync_repair_task_impl::run() {
auto start_time = std::chrono::steady_clock::now();
rlogger.info("repair[{}]: sync data for keyspace={}, status=started, reason={}, small_table_optimization={}", id.uuid(), keyspace, _reason, small_table_optimization);
co_await module->run(id, [this, &rs, id, &db, keyspace, ranges_reduced_factor, small_table_optimization, germs = std::move(germs), &ranges = _ranges, &neighbors = _neighbors, reason = _reason, &task_as = _as] () mutable {
co_await module->run(id, [this, &rs, id, &db, keyspace, ranges_reduced_factor, small_table_optimization, germs = std::move(germs), &ranges = _ranges, &neighbors = _neighbors, reason = _reason, &task_as = _as, frozen_topology_guard = _frozen_topology_guard] () mutable {
auto cfs = list_column_families(db, keyspace);
_cfs_size = cfs.size();
if (cfs.empty()) {
@@ -1535,7 +1536,7 @@ future<> repair::data_sync_repair_task_impl::run() {
repair_results.reserve(smp::count);
task_as.check();
for (auto shard : std::views::iota(0u, smp::count)) {
auto f = rs.container().invoke_on(shard, [keyspace, table_ids, id, ranges_reduced_factor, ranges, neighbors, reason, germs, small_table_optimization, parent_data = get_repair_uniq_id().task_info] (repair_service& local_repair) mutable -> future<> {
auto f = rs.container().invoke_on(shard, [keyspace, table_ids, id, ranges_reduced_factor, ranges, neighbors, reason, germs, small_table_optimization, parent_data = get_repair_uniq_id().task_info, frozen_topology_guard] (repair_service& local_repair) mutable -> future<> {
auto data_centers = std::vector<sstring>();
auto hosts = std::vector<sstring>();
auto ignore_nodes = std::unordered_set<locator::host_id>();
@@ -1545,7 +1546,7 @@ future<> repair::data_sync_repair_task_impl::run() {
auto task = co_await local_repair._repair_module->make_and_start_task<repair::shard_repair_task_impl>(parent_data, tasks::task_id::create_random_id(), std::move(keyspace),
local_repair, germs->get().shared_from_this(), std::move(ranges), std::move(table_ids),
id, std::move(data_centers), std::move(hosts), std::move(ignore_nodes), std::move(neighbors), reason, hints_batchlog_flushed, small_table_optimization, ranges_parallelism, flush_time,
service::default_session_id, tablet_repair_sched_info{}, ranges_reduced_factor);
frozen_topology_guard, tablet_repair_sched_info{}, ranges_reduced_factor);
co_await task->done();
});
repair_results.push_back(std::move(f));
@@ -1584,10 +1585,10 @@ future<std::optional<double>> repair::data_sync_repair_task_impl::expected_total
co_return _cfs_size ? std::make_optional<double>(_ranges.size() * _cfs_size * smp::count) : std::nullopt;
}
future<> repair_service::bootstrap_with_repair(locator::token_metadata_ptr tmptr, std::unordered_set<dht::token> bootstrap_tokens) {
future<> repair_service::bootstrap_with_repair(locator::token_metadata_ptr tmptr, std::unordered_set<dht::token> bootstrap_tokens, service::frozen_topology_guard frozen_topology_guard) {
SCYLLA_ASSERT(this_shard_id() == 0);
auto reason = streaming::stream_reason::bootstrap;
return seastar::async([this, tmptr = std::move(tmptr), tokens = std::move(bootstrap_tokens), reason] () mutable {
return seastar::async([this, tmptr = std::move(tmptr), tokens = std::move(bootstrap_tokens), reason, frozen_topology_guard = std::move(frozen_topology_guard)] () mutable {
auto& db = get_db().local();
auto ks_erms = db.get_non_local_strategy_keyspaces_erms();
auto& topology = tmptr->get_topology();
@@ -1776,7 +1777,7 @@ future<> repair_service::bootstrap_with_repair(locator::token_metadata_ptr tmptr
}
}
auto nr_ranges = desired_ranges.size();
sync_data_using_repair(keyspace_name, erm, std::move(desired_ranges), std::move(range_sources), reason, nullptr).get();
sync_data_using_repair(keyspace_name, erm, std::move(desired_ranges), std::move(range_sources), reason, nullptr, frozen_topology_guard).get();
rlogger.info("bootstrap_with_repair: finished with keyspace={}, nr_ranges={}, nr_tables={}", keyspace_name, nr_ranges * nr_tables, nr_tables);
}
rlogger.info("bootstrap_with_repair: finished with keyspaces={}", ks_erms | std::views::keys);
@@ -1804,9 +1805,9 @@ future<> repair_service::reset_node_ops_progress(streaming::stream_reason reason
});
}
future<> repair_service::do_decommission_removenode_with_repair(locator::token_metadata_ptr tmptr, locator::host_id leaving_node_id, shared_ptr<node_ops_info> ops, streaming::stream_reason reason) {
future<> repair_service::do_decommission_removenode_with_repair(locator::token_metadata_ptr tmptr, locator::host_id leaving_node_id, shared_ptr<node_ops_info> ops, streaming::stream_reason reason, service::frozen_topology_guard frozen_topology_guard) {
SCYLLA_ASSERT(this_shard_id() == 0);
return seastar::async([this, tmptr = std::move(tmptr), leaving_node_id = std::move(leaving_node_id), ops] () mutable {
return seastar::async([this, tmptr = std::move(tmptr), leaving_node_id = std::move(leaving_node_id), ops, frozen_topology_guard = std::move(frozen_topology_guard)] () mutable {
auto& db = get_db().local();
auto& topology = tmptr->get_topology();
auto myhostid = topology.my_host_id();
@@ -1990,7 +1991,7 @@ future<> repair_service::do_decommission_removenode_with_repair(locator::token_m
ranges.swap(ranges_for_removenode);
}
auto nr_ranges_synced = ranges.size();
sync_data_using_repair(keyspace_name, erm, std::move(ranges), std::move(range_sources), reason, ops).get();
sync_data_using_repair(keyspace_name, erm, std::move(ranges), std::move(range_sources), reason, ops, frozen_topology_guard).get();
rlogger.info("{}: finished with keyspace={}, leaving_node={}, nr_ranges={}, nr_ranges_synced={}, nr_ranges_skipped={}",
op, keyspace_name, leaving_node_id, nr_ranges_total, nr_ranges_synced * nr_tables, nr_ranges_skipped * nr_tables);
}
@@ -1998,15 +1999,15 @@ future<> repair_service::do_decommission_removenode_with_repair(locator::token_m
}).finally([this, reason] { return reset_node_ops_progress(reason); });
}
future<> repair_service::decommission_with_repair(locator::token_metadata_ptr tmptr) {
future<> repair_service::decommission_with_repair(locator::token_metadata_ptr tmptr, service::frozen_topology_guard frozen_topology_guard) {
SCYLLA_ASSERT(this_shard_id() == 0);
auto my_address = tmptr->get_topology().my_host_id();
return do_decommission_removenode_with_repair(std::move(tmptr), my_address, {}, streaming::stream_reason::decommission);
return do_decommission_removenode_with_repair(std::move(tmptr), my_address, {}, streaming::stream_reason::decommission, std::move(frozen_topology_guard));
}
future<> repair_service::removenode_with_repair(locator::token_metadata_ptr tmptr, locator::host_id leaving_node, shared_ptr<node_ops_info> ops) {
future<> repair_service::removenode_with_repair(locator::token_metadata_ptr tmptr, locator::host_id leaving_node, shared_ptr<node_ops_info> ops, service::frozen_topology_guard frozen_topology_guard) {
SCYLLA_ASSERT(this_shard_id() == 0);
return do_decommission_removenode_with_repair(std::move(tmptr), std::move(leaving_node), std::move(ops), streaming::stream_reason::removenode).then([this] {
return do_decommission_removenode_with_repair(std::move(tmptr), std::move(leaving_node), std::move(ops), streaming::stream_reason::removenode, std::move(frozen_topology_guard)).then([this] {
rlogger.debug("Triggering off-strategy compaction for all non-system tables on removenode completion");
seastar::sharded<replica::database>& db = get_db();
return db.invoke_on_all([](replica::database &db) {
@@ -2017,9 +2018,9 @@ future<> repair_service::removenode_with_repair(locator::token_metadata_ptr tmpt
});
}
future<> repair_service::do_rebuild_replace_with_repair(std::unordered_map<sstring, locator::static_effective_replication_map_ptr> ks_erms, locator::token_metadata_ptr tmptr, sstring op, utils::optional_param source_dc, streaming::stream_reason reason, std::unordered_set<locator::host_id> ignore_nodes, locator::host_id replaced_node) {
future<> repair_service::do_rebuild_replace_with_repair(std::unordered_map<sstring, locator::static_effective_replication_map_ptr> ks_erms, locator::token_metadata_ptr tmptr, sstring op, utils::optional_param source_dc, streaming::stream_reason reason, service::frozen_topology_guard frozen_topology_guard, std::unordered_set<locator::host_id> ignore_nodes, locator::host_id replaced_node) {
SCYLLA_ASSERT(this_shard_id() == 0);
return seastar::async([this, ks_erms = std::move(ks_erms), tmptr = std::move(tmptr), source_dc = std::move(source_dc), op = std::move(op), reason, ignore_nodes = std::move(ignore_nodes), replaced_node] () mutable {
return seastar::async([this, ks_erms = std::move(ks_erms), tmptr = std::move(tmptr), source_dc = std::move(source_dc), op = std::move(op), reason, ignore_nodes = std::move(ignore_nodes), replaced_node, frozen_topology_guard = std::move(frozen_topology_guard)] () mutable {
auto& db = get_db().local();
const auto& topology = tmptr->get_topology();
auto myid = tmptr->get_my_id();
@@ -2210,14 +2211,14 @@ future<> repair_service::do_rebuild_replace_with_repair(std::unordered_map<sstri
}).get();
}
auto nr_ranges = ranges.size();
sync_data_using_repair(keyspace_name, erm, std::move(ranges), std::move(range_sources), reason, nullptr).get();
sync_data_using_repair(keyspace_name, erm, std::move(ranges), std::move(range_sources), reason, nullptr, frozen_topology_guard).get();
rlogger.info("{}: finished with keyspace={}, source_dc={}, nr_ranges={}", op, keyspace_name, source_dc_for_keyspace, nr_ranges * nr_tables);
}
rlogger.info("{}: finished with keyspaces={}, source_dc={}", op, ks_erms | std::views::keys, source_dc);
});
}
future<> repair_service::rebuild_with_repair(std::unordered_map<sstring, locator::static_effective_replication_map_ptr> ks_erms, locator::token_metadata_ptr tmptr, utils::optional_param source_dc) {
future<> repair_service::rebuild_with_repair(std::unordered_map<sstring, locator::static_effective_replication_map_ptr> ks_erms, locator::token_metadata_ptr tmptr, utils::optional_param source_dc, service::frozen_topology_guard frozen_topology_guard) {
SCYLLA_ASSERT(this_shard_id() == 0);
auto op = sstring("rebuild_with_repair");
const auto& topology = tmptr->get_topology();
@@ -2226,7 +2227,7 @@ future<> repair_service::rebuild_with_repair(std::unordered_map<sstring, locator
}
auto reason = streaming::stream_reason::rebuild;
rlogger.info("{}: this-node={} source_dc={}", op, *topology.this_node(), source_dc);
co_await do_rebuild_replace_with_repair(std::move(ks_erms), std::move(tmptr), std::move(op), std::move(source_dc), reason).finally([this, reason] { return reset_node_ops_progress(reason);});
co_await do_rebuild_replace_with_repair(std::move(ks_erms), std::move(tmptr), std::move(op), std::move(source_dc), reason, frozen_topology_guard).finally([this, reason] { return reset_node_ops_progress(reason);});
co_await get_db().invoke_on_all([](replica::database& db) {
for (auto& t : db.get_non_system_column_families()) {
t->trigger_offstrategy_compaction();
@@ -2234,7 +2235,7 @@ future<> repair_service::rebuild_with_repair(std::unordered_map<sstring, locator
});
}
future<> repair_service::replace_with_repair(std::unordered_map<sstring, locator::static_effective_replication_map_ptr> ks_erms, locator::token_metadata_ptr tmptr, std::unordered_set<dht::token> replacing_tokens, std::unordered_set<locator::host_id> ignore_nodes, locator::host_id replaced_node) {
future<> repair_service::replace_with_repair(std::unordered_map<sstring, locator::static_effective_replication_map_ptr> ks_erms, locator::token_metadata_ptr tmptr, std::unordered_set<dht::token> replacing_tokens, std::unordered_set<locator::host_id> ignore_nodes, locator::host_id replaced_node, service::frozen_topology_guard frozen_topology_guard) {
SCYLLA_ASSERT(this_shard_id() == 0);
auto cloned_tm = co_await tmptr->clone_async();
auto op = sstring("replace_with_repair");
@@ -2248,7 +2249,7 @@ future<> repair_service::replace_with_repair(std::unordered_map<sstring, locator
co_await cloned_tmptr->update_normal_tokens(replacing_tokens, tmptr->get_my_id());
auto source_dc = utils::optional_param(myloc.dc);
rlogger.info("{}: this-node={} ignore_nodes={} source_dc={}", op, *topology.this_node(), ignore_nodes, source_dc);
co_await do_rebuild_replace_with_repair(std::move(ks_erms), std::move(cloned_tmptr), std::move(op), std::move(source_dc), reason, std::move(ignore_nodes), replaced_node).finally([this, reason] { return reset_node_ops_progress(reason); });
co_await do_rebuild_replace_with_repair(std::move(ks_erms), std::move(cloned_tmptr), std::move(op), std::move(source_dc), reason, frozen_topology_guard, std::move(ignore_nodes), replaced_node).finally([this, reason] { return reset_node_ops_progress(reason); });
}
// It is called by the repair_tablet rpc verb to repair the given tablet

View File

@@ -208,15 +208,15 @@ public:
// The tokens are the tokens assigned to the bootstrap node.
// all repair-based node operation entry points must be called on shard 0
future<> bootstrap_with_repair(locator::token_metadata_ptr tmptr, std::unordered_set<dht::token> bootstrap_tokens);
future<> decommission_with_repair(locator::token_metadata_ptr tmptr);
future<> removenode_with_repair(locator::token_metadata_ptr tmptr, locator::host_id leaving_node, shared_ptr<node_ops_info> ops);
future<> rebuild_with_repair(std::unordered_map<sstring, locator::static_effective_replication_map_ptr> ks_erms, locator::token_metadata_ptr tmptr, utils::optional_param source_dc);
future<> replace_with_repair(std::unordered_map<sstring, locator::static_effective_replication_map_ptr> ks_erms, locator::token_metadata_ptr tmptr, std::unordered_set<dht::token> replacing_tokens, std::unordered_set<locator::host_id> ignore_nodes, locator::host_id replaced_node);
future<> bootstrap_with_repair(locator::token_metadata_ptr tmptr, std::unordered_set<dht::token> bootstrap_tokens, service::frozen_topology_guard frozen_topology_guard);
future<> decommission_with_repair(locator::token_metadata_ptr tmptr, service::frozen_topology_guard frozen_topology_guard);
future<> removenode_with_repair(locator::token_metadata_ptr tmptr, locator::host_id leaving_node, shared_ptr<node_ops_info> ops, service::frozen_topology_guard frozen_topology_guard);
future<> rebuild_with_repair(std::unordered_map<sstring, locator::static_effective_replication_map_ptr> ks_erms, locator::token_metadata_ptr tmptr, utils::optional_param source_dc, service::frozen_topology_guard frozen_topology_guard);
future<> replace_with_repair(std::unordered_map<sstring, locator::static_effective_replication_map_ptr> ks_erms, locator::token_metadata_ptr tmptr, std::unordered_set<dht::token> replacing_tokens, std::unordered_set<locator::host_id> ignore_nodes, locator::host_id replaced_node, service::frozen_topology_guard frozen_topology_guard);
private:
future<> do_decommission_removenode_with_repair(locator::token_metadata_ptr tmptr, locator::host_id leaving_node, shared_ptr<node_ops_info> ops, streaming::stream_reason reason);
future<> do_decommission_removenode_with_repair(locator::token_metadata_ptr tmptr, locator::host_id leaving_node, shared_ptr<node_ops_info> ops, streaming::stream_reason reason, service::frozen_topology_guard frozen_topology_guard);
future<> do_rebuild_replace_with_repair(std::unordered_map<sstring, locator::static_effective_replication_map_ptr> ks_erms, locator::token_metadata_ptr tmptr, sstring op, utils::optional_param source_dc, streaming::stream_reason reason, std::unordered_set<locator::host_id> ignore_nodes = {}, locator::host_id replaced_node = {});
future<> do_rebuild_replace_with_repair(std::unordered_map<sstring, locator::static_effective_replication_map_ptr> ks_erms, locator::token_metadata_ptr tmptr, sstring op, utils::optional_param source_dc, streaming::stream_reason reason, service::frozen_topology_guard frozen_topology_guard, std::unordered_set<locator::host_id> ignore_nodes = {}, locator::host_id replaced_node = {});
// Must be called on shard 0
future<> sync_data_using_repair(sstring keyspace,
@@ -224,7 +224,8 @@ private:
dht::token_range_vector ranges,
std::unordered_map<dht::token_range, repair_neighbors> neighbors,
streaming::stream_reason reason,
shared_ptr<node_ops_info> ops_info);
shared_ptr<node_ops_info> ops_info,
service::frozen_topology_guard frozen_topology_guard);
future<> reset_node_ops_progress(streaming::stream_reason reason);

View File

@@ -82,11 +82,13 @@ private:
std::unordered_map<dht::token_range, repair_neighbors> _neighbors;
optimized_optional<abort_source::subscription> _abort_subscription;
size_t _cfs_size = 0;
service::frozen_topology_guard _frozen_topology_guard;
public:
data_sync_repair_task_impl(tasks::task_manager::module_ptr module, repair_uniq_id id, std::string keyspace, std::string entity, dht::token_range_vector ranges, std::unordered_map<dht::token_range, repair_neighbors> neighbors, streaming::stream_reason reason, shared_ptr<node_ops_info> ops_info)
data_sync_repair_task_impl(tasks::task_manager::module_ptr module, repair_uniq_id id, std::string keyspace, std::string entity, dht::token_range_vector ranges, std::unordered_map<dht::token_range, repair_neighbors> neighbors, streaming::stream_reason reason, shared_ptr<node_ops_info> ops_info, service::frozen_topology_guard frozen_topology_guard)
: repair_task_impl(module, id.uuid(), id.id, "keyspace", std::move(keyspace), "", std::move(entity), tasks::task_id::create_null_id(), reason)
, _ranges(std::move(ranges))
, _neighbors(std::move(neighbors))
, _frozen_topology_guard(std::move(frozen_topology_guard))
{
if (ops_info && ops_info->as) {
_abort_subscription = ops_info->as->subscribe([this] () noexcept {

Some files were not shown because too many files have changed in this diff Show More