scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-22 07:42:16 +00:00

Author	SHA1	Message	Date
Marcin Maliszkiewicz	628e1ef2de	Merge 'Introduce auth::config to decouple auth modules from db::config' from Pavel Emelyanov Auth modules (authenticators, role managers, and auth::service) access their configuration options by reaching into db::config through the query processor. This abuses database as proxy object to get configuration. This series introduces a dedicated auth::config struct that carries the configuration options used by auth modules.The config is populated in main.cc and delivered to each shard via sharded_parameter. This makes auth service conform to the overall design, where db::config is split into smaller per-service configs on start, thus decoupling individual components/services from global configuration. Cleaning components dependencies, not backporting. Closes scylladb/scylladb#29870 * github.com:scylladb/scylladb: auth: Remove unused default_superuser() function auth: Switch role managers to use auth::config auth: Switch authenticators to use auth::config auth: Introduce auth::config and wire it through service	2026-05-18 11:32:11 +02:00
Pavel Emelyanov	07ed557a2f	auth: Introduce auth::config and wire it through service Add a dedicated auth::config struct that carries all configuration options needed by auth modules. The config is created per-shard using sharded_parameter to ensure updateable_value fields are shard-local. The config is stored as a member in auth::service and passed by const reference to factories so that each auth module can receive its configuration when constructed. The modules themselves are not yet converted — they still read from db::config via the query processor. The stored config is also used in describe_roles() to read the superuser name, eliminating the default_superuser() call that reached into db::config via the query processor. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-05-15 18:44:37 +03:00
Petr Gusev	9e3209e4a3	cql: refactor add_tablet_info to take tablet_routing_info directly Change add_tablet_info() to accept locator::tablet_routing_info instead of destructured (tablet_replica_set, token_range) pair. This simplifies all three call sites. Remove the empty-replicas guard inside add_tablet_info(): the only producer of tablet_routing_info is tablet ERM's check_locality(), which returns either nullopt (correctly routed) or info with replicas copied from tablet_info — a tablet always has replicas. All callers already check for nullopt before calling add_tablet_info(), so by the time we enter the function replicas are guaranteed non-empty.	2026-05-15 12:28:33 +02:00
Piotr Dulikowski	0c016cecc3	Merge 'QOS: self-heal stale V1-to-V2 migration state on upgrade' from Alex Dathskovsky service_levels: self-heal stale v1 marker after raft topology upgrade This PR handles an upgrade corner case where a node may already be using raft topology, while `system.scylla_local` still marks service levels as v1. The problem was introduced by commit `2917ec5d51` ("service:qos: service levels migration"), which added the service-levels migration from `system_distributed.service_levels` to `system.service_levels_v2` as part of the raft topology upgrade. However, if the cluster had no service levels configured, there was no data to migrate. In that case, the migration path could leave the local version marker unchanged, so the node would later observe an inconsistent state: * raft topology is already enabled; * service levels are still marked as v1 in `system.scylla_local`. Such clusters can be left in a stale state and fail startup during upgrade to 2026.2 This PR makes the upgrade path self-healing. The first commit restores `service_level_controller::migrate_to_v2()`, giving us a group0-based path for writing the service-levels v2 state even after raft topology is already in use. The second commit wires this path into startup. When the node detects the stale raft-topology + service-levels-v1 state, it retries the migration a bounded number of times and updates the version marker to v2 instead of failing startup. With this change, clusters that were left in this stale state can recover automatically during upgrade to 2026. Fixes: SCYLLADB-1807 backport: 2026.2 2026.1 we need this functionality when we are upgrading older servers Closes scylladb/scylladb#29749 * github.com:scylladb/scylladb: test/auth_cluster: simulate v1 state in self-heal test When skip_service_levels_v2_initialization is used, write an explicit v1 service level version marker while skipping v2 initialization. This lets the restart test exercise self-healing from v1 to v2. qos: self-heal stale service levels version on startup qos: reintroduce service levels v2 migration self-heal	2026-05-14 10:32:43 +02:00
Avi Kivity	f2ab911a46	Merge 'test/cluster: fix server-starting functions to wait for all ports' from Nadav Har'El This series fixes a recurring source of flaky tests in the cluster test suite. When a test configures Scylla to listen on non-default ports (e.g. a custom Alternator port, proxy-protocol port or shard-aware port), server_add() and server_start() would declare the server ready by polling the hardcoded standard CQL and Alternator ports. Those ports can become available slightly before the custom ports finish binding, so the test could start using the custom port before it was open — causing intermittent failures. The fix for each affected test was to pass `expected_server_up_state=ServerUpState.SERVING` explicitly, which waits for Scylla's sd_notify("STATUS=serving") signal instead. That signal is sent only after all configured listeners are fully open, so it is always the right readiness signal regardless of the port configuration. This workaround was applied again in PR #29737 and will keep being needed for every new test that uses a non-default port. This series makes ServerUpState.SERVING the default at every level of the server start/add call stack so no test needs to remember it: * Make server_add(), servers_add(), server_start() et al. all default to ServerUpState.SERVING. * Document that server_add/server_start wait for all ports to be ready, so future test authors understand what the functions guarantee. * Remove now-redundant expected_server_up_state=SERVING from exiting tests. * A small optimization: Fix check_serving_notification() returning False on first completion. When the sd_notify future completed, the function correctly updated _received_serving but still returned False, wasting one 100ms polling interval. Return self._received_serving directly. Closes scylladb/scylladb#29758 * github.com:scylladb/scylladb: test/pylib: fix missing protocol_version=4 on control_cluster scylla_cluster: guard poll_status() set_result() calls against cancelled future test/cluster: avoid repeated CQL checks and leaks while waiting for SERVING test/cluster: fix check_serving_notification() inefficiency test/cluster: remove now-redundant expected_server_up_state=SERVING test/cluster: document that add/start waits for all ports to be ready test/cluster: update remaining CQL_ALTERNATOR_QUERIED defaults to SERVING test/cluster: fix server_add/server_start hanging when starting in maintenance mode main: notify "entering maintenance mode" after the maintenance CQL server is ready test/cluster: make server_start() default to ServerUpState.SERVING test/cluster: make server_add() default to ServerUpState.SERVING	2026-05-13 21:23:18 +03:00
Alex	6188bf3e01	test/auth_cluster: simulate v1 state in self-heal test When skip_service_levels_v2_initialization is used, write an explicit v1 service level version marker while skipping v2 initialization. This lets the restart test exercise self-healing from v1 to v2.	2026-05-13 17:55:20 +03:00
Alex	c2014f7e50	qos: self-heal stale service levels version on startup Add self_heal_service_levels_version() and use it during startup when the node is already on raft topology but service levels are still marked as v1. In that stale state, migrate service levels to v2 through group0 instead of failing startup.	2026-05-13 17:55:20 +03:00
Michael Litvak	5f8322a820	strong_consistency: change wait for groups to start on startup on startup, previously groups_manager::start() was called and waited for the groups to start. we change it instead to just start the raft servers in the background without waiting for them to be fully started. we wait for the servers to start explicitly at a later stage of startup, after starting the messaging service. the reason is that for the servers to be fully started they may require communication that requires the messaging service. currently it is not required, but it will be changed in the next commit.	2026-05-13 08:43:26 +02:00
Michael Litvak	5a5c7c6241	strong_consistency: wait for raft servers to start in create table When creating a strongly consistent table, wait for the table's raft servers to start and be ready to serve queries before completing the operation. We want the create table operation to absorb the delay of starting the raft groups instead of the first queries. The create table coordinator commits and applies the schema statement, then it waits for all hosts that have a tablet replica to create and start the raft groups for the table's tablets. It does this by sending an RPC to all the relevant hosts that executes a group0 barrier, in order to ensure the table and raft groups are created, then waits for all raft groups on the host to finish starting and be ready. Fixes SCYLLADB-807	2026-05-13 08:43:24 +02:00
Botond Dénes	e95eb21a16	Merge 'Tablet-aware restore' from Pavel Emelyanov The mechanics of the restore is like this - A /storage_service/tablets/restore API is called with (keyspace, table, endpoint, bucket, manifests) parameters - First, it populates the system_distributed.snapshot_sstables table with the data read from the manifests - Then it emplaces a bunch of tablet transitions (of a new "restore" kind), one for each tablet - The topology coordinator handles the "restore" transition by calling a new RESTORE_TABLET RPC against all the current tablet replicas - Each replica handles the RPC verb by - Reading the snapshot_sstables table - Filtering the read sstable infos against current node and tablet being handled - Downloading and attaching the filtered sstables This PR includes system_distributed.snapshot_sstables table from @robertbindar and preparation work from @kreuzerkrieg that extracts raw sstables downloading and attaching from existing generic sstables loading code. This is first step towards SCYLLADB-197 and lacks many things. In particular - the API only works for single-DC cluster - the caller needs to "lock" tablet boundaries with min/max tablet count - not abortable - no progress tracking - sub-optimal (re-kicking API on restore will re-download everything again) - not re-attacheable (if API node dies, restoration proceeds, but the caller cannot "wait" for it to complete via other node) - nodes download sstables in maintenance/streaming sched gorup (should be moved to maintenance/backup) Other follow-up items: - have an actual swagger object specification for `backup_location` Closes #28436 Closes #28657 Closes #28773 Closes scylladb/scylladb#28763 * github.com:scylladb/scylladb: docs: Update topology_over_raft.md with `restore` transition kind test: Add test for backup vs migration race test: Restore resilience test sstables_loader: Fail tablet-restore task if not all sstables were downloaded sstables_loader: mark sstables as downloaded after attaching sstables_loader: return shared_sstable from attach_sstable db: add update_sstable_download_status method db: add downloaded column to snapshot_sstables db: extract snapshot_sstables TTL into class constant test: Add a test for tablet-aware restore tablets: Implement tablet-aware cluster-wide restore messaging: Add RESTORE_TABLET RPC verb sstables_loader: Add method to download and attach sstables for a tablet tablets: Add restore_config to tablet_transition_info sstables_loader: Add restore_tablets task skeleton test: Add rest_client helper to kick newly introduced API endpoint api: Add /storage_service/tablets/restore endpoint skeleton sstables_loader: Add keyspace and table arguments to manfiest loading helper sstables_loader_helpers: just reformat the code sstables_loader_helpers: generalize argument and variable names sstables_loader_helpers: generalize get_sstables_for_tablet sstables_loader_helpers: add token getters for tablet filtering sstables_loader_helpers: remove underscores from struct members sstables_loader: move download_sstable and get_sstables_for_tablet sstables_loader: extract single-tablet SST filtering sstables_loader: make download_sstable static sstables_loader: fix formating of the new `download_sstable` function sstables_loader: extract single SST download into a function sstables_loader: add shard_id to minimal_sst_info sstables_loader: add function for parsing backup manifests split utility functions for creating test data from database_test export make_storage_options_config from lib/test_services rjson: Add helpers for conversions to dht::token and sstable_id Add system_distributed_keyspace.snapshot_sstables add get_system_distributed_keyspace to cql_test_env code: Add system_distributed_keyspace dependency to sstables_loader storage_service: Export export handle_raft_rpc() helper storage_service: Export do_tablet_operation() storage_service: Split transit_tablet() into two tablets: Add braces around tablet_transition_kind::repair switch	2026-05-12 16:24:13 +03:00
Pavel Emelyanov	1c0f8ab66e	Merge 'sstables: introduce --abort-on-malformed-sstable-error' from Botond Dénes When a malformed sstable error occurs, it is usually caused by actual sstable corruption — a cosmic ray, a bad disk write, etc. However, it can also be caused by memory corruption, where a data structure in memory happens to be read as sstable data. In the latter case, having a coredump of the process at the moment of the error is invaluable for post-mortem debugging, since the exception throwing/catching machinery destroys the stack frames that would point to the corruption site. This patch series introduces `--abort-on-malformed-sstable-error`, a new command-line option (with `LiveUpdate` support) that, when set, causes the server to call `std::abort()` instead of throwing an exception whenever any sstable parse error is detected. This covers all code paths: - Direct `throw malformed_sstable_exception(...)` sites (migrated to `throw_malformed_sstable_exception()`) - Direct `throw bufsize_mismatch_exception(...)` sites (migrated to `throw_bufsize_mismatch_exception()`) - `parse_assert()` failures (via `on_parse_error()`) - BTI parse errors (via `on_bti_parse_error()`) The implementation places the flag and helper functions in `sstables/sstables.cc`, next to the existing `on_parse_error()` / `on_bti_parse_error()` infrastructure. The flag defaults to `false`, preserving current behaviour. It is intended to be enabled temporarily when investigating suspected memory corruption. Commit breakdown: 1. Infrastructure: flag, getter/setter, and throw helpers in `sstables/sstables.cc`; config option wired up in `main.cc` 2. `on_parse_error()` and `on_bti_parse_error()` check the new flag 3. All ~50 `throw malformed_sstable_exception(...)` sites migrated 4. Both `throw bufsize_mismatch_exception(...)` sites migrated Refs: SCYLLADB-1087 Backport: new feature, no backport Closes scylladb/scylladb#29324 * github.com:scylladb/scylladb: sstables: migrate all bufsize_mismatch_exception throw sites to throw_bufsize_mismatch_exception() sstables: migrate all malformed_sstable_exception throw sites to throw_malformed_sstable_exception() sstables: make on_parse_error() and on_bti_parse_error() respect --abort-on-malformed-sstable-error sstables: disable abort-on-malformed-sstable-error in tests that corrupt sstables on purpose sstables: introduce --abort-on-malformed-sstable-error infrastructure sstables: refactor parse_path() to return std::expected<> instead of throwing	2026-05-12 12:38:25 +03:00
Pavel Emelyanov	90ff7c5de3	code: Add system_distributed_keyspace dependency to sstables_loader The loader will need to populate and read data from system_distributed.snapshot_sstables table added recently, so this dependency is truly needed. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-05-12 10:17:40 +03:00
Botond Dénes	f6dc2cb5f8	sstables: introduce --abort-on-malformed-sstable-error infrastructure Add the --abort-on-malformed-sstable-error command-line option and the supporting infrastructure. When set, any malformed sstable error will abort the process and generate a coredump instead of throwing an exception. This is useful for debugging memory corruption that may manifest as apparent sstable corruption. The implementation introduces: - throw_malformed_sstable_exception() and throw_bufsize_mismatch_exception() helper functions in sstables/sstables.cc, which check the new flag and either abort (with logging) or throw the appropriate exception. - set_abort_on_malformed_sstable_error() / abort_on_malformed_sstable_error() to control the per-process atomic flag. - abort_on_malformed_sstable_error config option (LiveUpdate, default false) wired up in main.cc alongside abort_on_internal_error. Call-site migration will follow in subsequent commits.	2026-05-11 11:58:14 +03:00
Botond Dénes	eae15f4fdd	Merge 'Share timeout_config between services' from Pavel Emelyanov The timeout_config (more exactly -- updatable_timeout_config) is used by alternator/controller and transport/controller. Both create a local copy of that opbject by constructing one out of db::config. Also some options from this config are needed by storage_proxy, but since it doesn't have access to any timeout_config-s, it just uses db::config by getting it from the database. This PR introduces top-level sharded<updateable_timeout_config>, initializes it from db::config values and makes existing users plus storage_proxy us it where required. Motivation -- remove more replica::database::get_config() users. A side effect -- timeout_config is not duplicated by transport and alternator controllers. Components' dependencies cleanup, not backporting. Closes scylladb/scylladb#29636 * github.com:scylladb/scylladb: storage_proxy: Use shared updateable_timeout_config for CAS contention timeout alternator: Use shared updateable_timeout_config by reference cql_transport: Use shared updateable_timeout_config by reference storage_proxy: Use shared updateable_timeout_config by reference main: Introduce sharded<updateable_timeout_config> storage_proxy: Keep own updateable_timeout_config	2026-05-11 11:12:01 +03:00
Botond Dénes	9b2dfab2e5	Merge 'Don't use database.get_config() to fetch calculate_view_update_throttling_delay option' from Pavel Emelyanov This option is used in two places -- proxy and view-update-generator both need it to calculate the calculate_view_update_throttling_delay() value. This PR moves the option onto view_update_backlog top-level service, makes the calculating helper be method of that class and patches the callers to use it. This eliminates more places that abuse database as db::config accessor. Code dependencies refactoring, not backporting Closes scylladb/scylladb#29635 * github.com:scylladb/scylladb: view: Turn calculate_view_update_throttling_delay into node_update_backlog member view: Place view_flow_control_delay_limit_in_ms on node_update_backlog view: Add node_update_backlog reference to view_update_generator	2026-05-11 10:30:24 +03:00
Pavel Emelyanov	f39cbb1ec6	storage_proxy: Move maintenance_mode onto storage_proxy::config Stop reading maintenance_mode through replica::database's db::config. Add a properly typed maintenance_mode_enabled field to storage_proxy::config, populate it in main.cc from cfg->maintenance_mode() (same as messaging_service::config), and use a cached member in storage_proxy instead of db.local().get_config().maintenance_mode(). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Closes scylladb/scylladb#29637	2026-05-11 10:11:20 +03:00
Avi Kivity	5a887362e3	Merge 'Remove legacy tables creation code' from Gleb Natapov Drop creation of `service_levels` and `cdc_generation_descriptions_v2` table creation code since they are no longer needed. Old clusters will still have it because they were created earlier. Also the series contains a small improvement around group0 creation. No backport needed since this removes functionality. Closes scylladb/scylladb#29482 * github.com:scylladb/scylladb: db/system_distributed_keyspace: remove system_distributed_everywhere since it is unused db/system_distributed_keyspace: drop CDC_TOPOLOGY_DESCRIPTION and CDC_GENERATIONS_V2 db/system_distributed_keyspace: remove unused code db/system_distributed_keyspace: drop old cdc_generation_descriptions_v2 table db/system_distributed_keyspace: drop old service_levels table fix indent after the previous patch group0: call setup_group0 only when needed	2026-05-10 14:46:21 +03:00
Nadav Har'El	597838c501	main: notify "entering maintenance mode" after the maintenance CQL server is ready The sd_notify "entering maintenance mode" status was emitted before start_cql() was called, so clients that waited for this notification could attempt to connect to the maintenance socket before it was actually accepting connections. Move the checkpoint() call to after start_cql(), matching how the normal startup path emits "serving" only after all configured listeners are open. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-05-05 18:51:17 +03:00
Andrzej Jackowski	3755c370ac	audit: assert storage ordering invariants at runtime Abort if audit storage fails to start rather than silently running with an unaudited maintenance socket. Also assert that storage is already stopped when the audit service is destroyed, documenting the defer-stack ordering requirement. Refs SCYLLADB-1615 Refs SCYLLADB-1695	2026-04-28 18:58:49 +02:00
Andrzej Jackowski	543fb6a2db	audit: start maintenance socket after audit storage Without this, there is a window after startup where queries on the maintenance socket bypass auditing because audit storage is not yet initialized. Fixes SCYLLADB-1615	2026-04-28 18:58:49 +02:00
Andrzej Jackowski	b7bc2d89e6	audit: move audit construction before maintenance socket During graceful shutdown, deferred stops run in reverse order of construction. When the audit service was constructed after the maintenance socket, audit was destroyed first. A DML query still in-flight on the maintenance socket could then bypass auditing entirely. Move construction as early as possible so the audit service outlives the maintenance socket on the defer stack, and to maximise the window in which attempts to use audit before storage is ready are caught with on_internal_error_noexcept. Refs SCYLLADB-1615	2026-04-28 18:58:49 +02:00
Andrzej Jackowski	bc67dd0b82	audit: split startup into construction and storage phases The table-based audit backend needs Raft to create its keyspace, but the audit service must exist earlier so that CQL paths don't silently skip auditing. Split startup into two phases: construction and storage initialization. Queries arriving between the two phases are logged as errors. This is a refactoring commit and the split sections will be moved later in this patch series. Refs SCYLLADB-1615	2026-04-28 18:58:42 +02:00
Pavel Emelyanov	33cd3b5d68	alternator: Use shared updateable_timeout_config by reference Pass sharded<updateable_timeout_config>& into alternator::controller and through to alternator::server, which now stores a reference instead of constructing its own updateable_timeout_config from proxy.data_dictionary().get_config(). This removes the last creator of a per-owner updateable_timeout_config copy and completes the consolidation onto the single sharded<updateable_timeout_config> instance built in main. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-24 15:29:39 +03:00
Pavel Emelyanov	1a045d0cdd	cql_transport: Use shared updateable_timeout_config by reference Pass sharded<updateable_timeout_config>& into cql_transport::controller, which feeds the shard-local instance as a reference into cql_server_config::timeout_config. This drops the per-shard local updateable_timeout_config constructed from db::config inside the controller's sharded_parameter lambda, replacing it with a reference into the shared sharded instance. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-24 15:21:31 +03:00
Pavel Emelyanov	aa99c1fd6e	storage_proxy: Use shared updateable_timeout_config by reference Drop storage_proxy's own updateable_timeout_config member built from db::config and take a reference to the shared sharded instance introduced by the previous patch. Both main and cql_test_env pass std::ref(timeout_cfg) into storage_proxy::start so each shard's storage_proxy references its shard-local updateable_timeout_config. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-24 15:07:21 +03:00
Pavel Emelyanov	7b7295fde0	main: Introduce sharded<updateable_timeout_config> Build a single sharded updateable_timeout_config from db::config in both main and cql_test_env, sitting next to sharded<cql_config>. Subsequent patches migrate storage_proxy, the CQL transport controller and alternator server from their per-owner updateable_timeout_config copies to references into this shared instance. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-24 15:03:35 +03:00
Pavel Emelyanov	855372db3c	view: Place view_flow_control_delay_limit_in_ms on node_update_backlog Store the view_flow_control_delay_limit_in_ms config option as an updateable_value on node_update_backlog. The value is threaded from main.cc into the backlog object at construction time. Existing call sites (tests) that construct node_update_backlog without the option continue to work via a default argument. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-24 13:47:54 +03:00
Pavel Emelyanov	ec2339e635	view: Add node_update_backlog reference to view_update_generator Pass node_update_backlog explicitly to view_update_generator via its constructor and start() call. This is plumbing only; no behavior change. A subsequent patch will use this reference to compute view update throttling delays without going through database::get_config(). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-24 13:45:46 +03:00
Radosław Cybulski	74b523ea20	treewide: fix spelling errors. Fix various spelling errors. Closes scylladb/scylladb#29574	2026-04-21 18:20:26 +03:00
Radosław Cybulski	d93299b605	alternator: add system_keyspace reference Add a reference to `system_keyspace` object to `executor` object in alternator. The reference is needed, because in future commit we will add there (and use) helper functions that read `cdc_log` tables for tablet based tables similarly to already existing siblings for vnodes living in `system_distributed_keyspace`.	2026-04-17 18:57:43 +02:00
Piotr Dulikowski	37fc1507f0	Merge 'Alternator: Add vector search support' from Nadav Har'El This series adds support for vector search in Alternator based on the existing implementation in CQL. The series adds APIs for `CreateTable` and `UpdateTable` to add or remove vector indexes to Alternator tables, `DescribeTable` to list them and check the indexing status, and `Query` to perform a vector search - which contacts the vector store for the actual ANN (approximate nearest neighbor) search. Correct functionality of these features depend on some features of the the vector store, that were already done (see https://github.com/scylladb/vector-store/pull/394). This initial implementation is fully functional, and can already be useful, but we do not yet support all the features we hope to eventually support. Here are things that we have not done yet, and plan to do later in follow-up pull requests: 1. Support a new optimized vector type ("V") - in addition to the "list of numbers" type supported in this version. 2. Allow choosing a different similarity function when creating an index, by SimilarityFunction in VectorIndex definition. 3. Allow choosing quantization (f32/f16/bf16/i8/b1) to ask the vector index to compress stored vectors. 4. Support oversampling and rescoring, defined per-index and per-query. 5. Support HNSW tuning parameters — maximum_node_connections, construction_beam_width, search_beam_width. 6. Support pre-filtering over key columns, which are available at the vector store, by sending the filter to the vector store (translated from DynamoDB filter syntax to the vector's store's filter syntax). A decision still need to be made if this will use KeyConditionExpression or FilterExpression. This version supports only post-filtering (with `FilterExpression`). 7. Support projecting non-key attributes into the index (Projection=INCLUDE and Projection=ALL), and then 1. pre-filtering using these attributes, and 2. efficiently return these attributes (using Select=ALL_PROJECTED_ATTRIBUTES, which today returns just the key columns). 8. Optimize the performance of `Query`, which today is inefficient for Select=ALL_ATTRIBUTES because it serially retrieves the matching items one at a time. 9. Returning the similarity scores with the items (the design proposes ReturnVectorSearchSimilarity). 10. Add more vector-search-specific metrics, beyond the metric we already have counting Query requests. For example separate latency and request-count metrics for vector-search Queries (distinct from GSI/LSI queries), and a metric accumulating the total Limit (K) across all vector search queries. 11. Consider how (and if at all) we want to run the tests in test/alternator/test_vector.py that need the vector store in the CI. Currently they are skipped in CI and only run manually (with `test/alternator/run --vs test_vector`). 12. UpdateTable 'Update' operation to modify index parameters. Only some can be modified, e.g., Oversampling. 13. Support for "local index" (separate index for each partition). 14. Make sure that vector search and Streams can be enabled concurrently on the same table - both need CDC but we need to verify that one doesn't confuse the other or disables options that the other needs. We can only do this after we have Alternator Streams running on tablets (since vector store requires tablets). Testing the new Alternator vector search end-to-end requires running both Scylla and the vector store together. We will have such end-to-end tests in the vector store repository (see https://github.com/scylladb/vector-store/pull/392), but we also add in this pull request many end-to-end tests written in Python, that can be run with the command "test/alternator/run --vs test_vector.py". The "--vs" option tells the run script to run both Scylla and the vector store (currently assumed to be in `.../vector-store/target/release/vector-store`). About 65% of the tests in this pull request check supported syntax and error paths so can run without the vector store, while about 35% of the tests do perform actual Query operations and require the vector store to be running. Currently, the tests that do require the vector store will not get run by CI, but can be easily re-run manually with `test/alternator/run --vs test_vector.py`. In total, this series includes 78 functional tests in 2200 lines of Python code. This series also includes documentation for the new Alternator feature and the new APIs introduced. You can see a more detailed design document here: https://docs.google.com/document/d/1cxLI7n-AgV5hhH1DTyU_Es8_f-t8Acql-1f58eQjZLY/edit Two patches in this series split the huge alternator/executor.cc, after this series continued to grow it and it reached a whoppng 7,000 lines. These patches are just reorganization of code, no functional changes. But it's time that we finally do this (Refs #5783), we can't just continue to grow executor.cc with no end... Closes scylladb/scylladb#29046 * github.com:scylladb/scylladb: test/alternator: add option to "run" script to run with vector search alternator: document vector search test/alternator: fix retries in new_dynamodb_session test/alternator: test for allowed characters in attribute names test/alternator: tests for vector index support alternator, vector: add validation of non-finite numbers in Query alternator: Query: improve error message when VectorSearch is missing alternator: add per-table metrics for vector query alternator: clean up duplicated code alternator: fix default Select of Query alternator: split executor.cc even more alternator: split alternator/executor.cc alternator: validate vector index attribute values on write alternator: DescribeTable for vector index: add IndexStatus and Backfilling alternator: implement Query with a vector index alternator: fix bug in describe_multi_item() alternator: prevent adding GSI conflicting with a vector index alternator: implement UpdateTable with a vector index alternator: implement DescribeTable with a vector index alternator: implement CreateTable with a vector index alternator: reject empty attribute names cdc: fix on_pre_create_column_families to create CDC log for vector search	2026-04-17 10:25:45 +02:00
Botond Dénes	88a8324e68	erge 'db: store large data records in SSTable metadata and serve via virtual tables' from Benny Halevy `system.large_partitions`, `system.large_rows`, and `system.large_cells` store records keyed by SSTable name. When SSTables are migrated between shards or nodes (resharding, streaming, decommission), the records are lost because the destination never writes entries for the migrated SSTables. This patch series moves the source of truth for large data records into the SSTable's scylla metadata component (new `LargeDataRecords` tag 13) and reimplements the three `system.large_` tables as virtual tables that query live SSTables on demand. A cluster feature flag (`LARGE_DATA_VIRTUAL_TABLES`) gates the transition for safe rolling upgrades. When the cluster feature is enabled, each node drops the old system large_ tables and starts serving the corresponding tables using virtual tables that represent the large data records now stored on the sstables. Note that the virtual tables will be empty after upgrade until the sstables that contained large data are rewritten, therefore it is recommended to run upgrade sstables compaction or major compaction to repopulate the sstables scylla-metadata with large data records. 1. keys: move key_to_str() to keys/keys.hh — make the helper reusable across large_data_handler, virtual tables, and scylla-sstable 2. sstables: add LargeDataRecords metadata type (tag 13) — new struct with binary-serialized key fields, scylla-sstable JSON support, format documentation 3. large_data_handler: rename partition_above_threshold to above_threshold_result — generalize the struct for reuse 4. large_data_handler: return above_threshold_result from maybe_record_large_cells — separate booleans for cell size vs collection elements thresholds 5. sstables: populate LargeDataRecords from writer — bounded min-heaps (one per large_data_type), configurable top-N via `compaction_large_data_records_per_sstable` 6. test: add LargeDataRecords round-trip unit tests — verify write/read, top-N bounding, below-threshold behavior 7. db: call initialize_virtual_tables from shard 0 only — preparatory refactoring to enable cross-shard coordination 8. db: implement large_data virtual tables with feature flag gating — three virtual table classes, feature flag activation, legacy SSTable fallback, dual-threshold dedup, cross-shard collection Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1276 * Although this fixes a bug where large data entries are effectively lost when sstables are renamed or migrated, the changes are intrusive and do not warrant a backport Closes scylladb/scylladb#29257 * github.com:scylladb/scylladb: db: implement large_data virtual tables with feature flag gating db: call initialize_virtual_tables from shard 0 only test: add LargeDataRecords round-trip unit tests sstables: populate LargeDataRecords from writer large_data_handler: return above_threshold_result from maybe_record_large_cells large_data_handler: rename partition_above_threshold to above_threshold_result sstables: add LargeDataRecords metadata type (tag 13) sstables: add fmt::formatter for large_data_type keys: move key_to_str() to keys/keys.hh	2026-04-16 14:03:31 +03:00
Nadav Har'El	e43a2e5086	alternator: implement Query with a vector index We introduce to the Query request a new "VectorSearch" parameter, which take a mandatory "QueryVector" (a value which must be a numeric vector of the right length) and "Limit". The "Limit" of a vector search (Query with VectorSearch) determines the number of nearest neighbors to return, and does not allow pagination (ExclusiveKeyStart is not allowed). ConsistentRead=True is also not allowed on a vector search query. The "Select"/"ProjectionExpression"/"AttributesToGet" parameters are also supported, requesting which attributes to fetch. Using Select= ALL_PROJECTED_ATTRIBUTES means read only the attributes found in the vector index - currently only the key columns - so it is significantly faster than ALL_ATTRIBUTES because it doesn't require reading the items from the base table. The "FilterExpression" parameter is also supported. Like in DynamoDB's traditional Query, this does post-filtering, i.e., removing some of the results returned by the vector index that don't match the filter, and as a result fewer than Limit results may be returned. Pre-filtering (done on the vector store, and always returns Limit results) is not yet implemented.	2026-04-16 13:31:47 +03:00
Benny Halevy	ce00d61917	db: implement large_data virtual tables with feature flag gating Replace the physical system.large_partitions, system.large_rows, and system.large_cells CQL tables with virtual tables that read from LargeDataRecords stored in SSTable scylla metadata (tag 13). The transition is gated by a new LARGE_DATA_VIRTUAL_TABLES cluster feature flag: - Before the feature is enabled: the old physical tables remain in all_tables(), CQL writes are active, no virtual tables are registered. This ensures safe rollback during rolling upgrades. - After the feature is enabled: old physical tables are dropped from disk via legacy_drop_table_on_all_shards(), virtual tables are registered on all shards, and CQL writes are skipped via skip_cql_writes() in cql_table_large_data_handler. Key implementation details: - Three virtual table classes (large_partitions_virtual_table, large_rows_virtual_table, large_cells_virtual_table) extend streaming_virtual_table with cross-shard record collection. - generate_legacy_id() gains a version parameter; virtual tables use version 1 to get different UUIDs than the old physical tables. - compaction_time is derived from SSTable generation UUID at display time via UUID_gen::unix_timestamp(). - Legacy SSTables without LargeDataRecords emit synthetic summary rows based on above_threshold > 0 in LargeDataStats. - The activation logic uses two paths: when the feature is already enabled (test env, restart), it runs as a coroutine; when not yet enabled, it registers a when_enabled callback that runs inside seastar::async from feature_service::enable(). - sstable_3_x_test updated to use a simplified large_data_test_handler and validate LargeDataRecords in SSTable metadata directly.	2026-04-16 08:49:02 +03:00
Benny Halevy	cb6004b625	db: call initialize_virtual_tables from shard 0 only Move the smp::invoke_on_all dispatch from the callers into initialize_virtual_tables() itself, so the function is called once from shard 0 and internally distributes the per-shard virtual table setup to all shards. This simplifies the callers and allows a single place to add cross-shard coordination logic (e.g. feature-gated table registration) in future commits.	2026-04-16 08:49:02 +03:00
Gleb Natapov	0ef06a34ed	group0: call setup_group0 only when needed setup_group0 and setup_group0_if_exist have hidden condition inside that make them no-op. It is not clear at the call site that functions may do nothing. Change the code to check the conditions at the call site instead.	2026-04-15 15:48:48 +03:00
Dimitrios Symonidis	71714fdc0e	db: introduce read-write lock to synchronize config updates with REST API Config is reloaded from SIGHUP on shard 0 and broadcast to all shards under a write lock. REST API callers reading find_config_id acquire a read lock via value_as_json_string_for_name() and are guaranteed a consistent snapshot even when a reload is in progress.	2026-04-15 14:28:31 +02:00
Avi Kivity	0ae22a09d4	LICENSE: Update to version 1.1 Updated terms of non-commercial use (must be a never-customer).	2026-04-12 19:46:33 +03:00
Avi Kivity	8ccee6803e	Merge 'Remove upgrade view builder' from Gleb Natapov Since we do no longer support upgrade from versions that do not support v2 of "view building status" code (building status is managed by raft) we can remove v1 code and upgrade code and make sure we do not boot with old "builder status" version. v2 version was introduced by `8d25a4d678` which is included in scylla-2025.1.0. No backport needed since this is code removal. Closes scylladb/scylladb#29105 * github.com:scylladb/scylladb: view: drop unused v1 builder code view: remove upgrade to raft code	2026-04-12 00:39:26 +03:00
Avi Kivity	ca80ee8586	Merge 'Introduce maintenance scheduling supergroup and do initial population' from Pavel Emelyanov The supergroup replaces streaming (a.k.a. maintenance as well) group, inherits 200 shares from it and consists of four sub-groups (all have equal shares of 200 withing the new supergroup) * maintenance_compaction. This group configures `compaction_manager::maintenance_sg()` group. User-triggered compaction runs in it * backup. This group configures `snapshot_ctl::config::backup_sched_group`. Native backup activity runs there * maintenance. It's a new "visible" name, everything that was called "maintenance" in the code ran in "streaming" group. Now it will run in "maintenance". The activities include those that don't communicate over RPC (see below why) * `tablet_allocator::balance_tablets()` * `sstables_manager::components_reclaim_reload_fiber()` * `tablet_storage_group_manager::merge_completion_fiber()` * metrics exporting http server altogether * streaming. This is purely existing streaming group that just moves under the new supergroup. Everything else that was run there, continues doing so, including * hints sender * all view building related components (update generator, builder, workers) * repair * stream_manager * messaging service (except for verb handlers that switch groups) * join_cluster() activity * REST API * ... something else I forgot The `--maintenance_io_throughput_mb_per_sec` option is introduced. It controls the IO throughput limit applied to the maintenance supergroup. If not set, the `--stream_io_throughput_mb_per_sec` option is used to preserve backward compatibility. All new sched groups inherit `request_class::maintenance` (however, "backup" seem not to make any requests yet). Moving more activities from "streaming" into "maintenance" (or its own group) is possible, but one will need to take care of RPC group switching. The thing is that when a client makes an RPC call, the server may switch to one of pre-negotiated scheduling groups. Verbs for existing activities that run in "streaming" group are routed through RPC index that negotiates "streaming" group on the server side. If any of that client code moves to some other group, server will still run the handlers in "streaming" which is not quite expected. That's one of the main reasons why only the selected fibers were moved to their own "maintenance" group. Similar for backup -- this code doesn't use RPC, so it can be moved. Restoring code uses load-and-stream and corresponding RPCs, so it cannot be just moved into its own new group. Fixes SCYLLADB-351 New feature, not backporting Closes scylladb/scylladb#28542 * github.com:scylladb/scylladb: code: Add maintenance/maintenance group backup: Add maintenance/backup group compaction: Add maintenance/maintenance_compaction group main: Introduce maintenance supergroup main: Move all maintenance sched group into streaming one database: Use local variable for current_scheduling_group code: Live-update IO throughputs from main	2026-04-12 00:34:48 +03:00
Pavel Emelyanov	cb329b10bf	code: Add maintenance/maintenance group And move some activities from streaming group into it, namely - tablet_allocator background group - sstables_manager-s components reclaimer - tablet storage group manager merge completion fiber - prometheus All other activity that was in streaming group remains there, but can be moved to this group (or to new maintenance subgroup) later. All but prometheus are patched here, prometheus still uses the maintenance_sched_group variable in main.cc, so it transparently moves into new group Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-03-23 16:00:03 +03:00
Pavel Emelyanov	de9bfe0f1d	backup: Add maintenance/backup group The snapshot_ctl::backup_task_impl runs in configured scheduling group. Now it's streaming one. This patch introduces the maintenance/backup group and re-configures backup task with it. The group gets its --backup_io_throughput_mb_per_sec option that controls bandwidth limit for this sub-group only. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-03-23 16:00:02 +03:00
Pavel Emelyanov	6f43e8562e	compaction: Add maintenance/maintenance_compaction group Compaction manager tells compaction_sched_group from maintenance_compaction_sched_group. The latter, however, is set to be "streaming" group. This patch adds real maintenance_compaction group under the maintenance supergroup and makes compaction manager use it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-03-23 16:00:02 +03:00
Pavel Emelyanov	13355d1845	main: Introduce maintenance supergroup And just move streaming group inside it. Next patches will populate this supergroup further. The new supergroup gets its --maintenance-io-throughput-mb-per-sec option that controls supergroup-wide IO bandwidth applied to it. If not configured, the supergroup gets the throughput from streaming to be backward compatible. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-03-23 16:00:02 +03:00
Pavel Emelyanov	7cb9fa0778	main: Move all maintenance sched group into streaming one The main.cc code uses two variables to reference streaming scheduling. This patch stops using the maintenance_sched_group one, because it's in fact streaming group, and real "maintenance" will appear later in this set. One place is deliberately not patched -- prometheus code starts before dbcfg.streaming_scheduling_group appears, so it still sits uses the maintenance_sched_group variable. This fact will be used in one of the next patches. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-03-23 16:00:02 +03:00
Pavel Emelyanov	15c41bfb6c	code: Live-update IO throughputs from main Currently we have two live-updateable IO-throughput options -- one for streaming and one for compaction. Both are observed and the changed value is applied to the corresponding scheduling_group by the relevant serice -- respectively, stream_manager and compaction_manager. Both observe/react/apply places use pretty heavy boilerplate code for such simple task. Next patches will make things worse by adding two more options to control IO throughput of some other groups. Said that, the proposal is to hold the updating code in main.cc with the help of a wrapper class. In there all the needed bits are at hand, and classes can get their IO updates applied easily. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-03-23 16:00:02 +03:00
Pavel Emelyanov	35f625e5c7	repair: Move repair_multishard_reader options onto repair_service::config This actually uses two interconnected options: repair_multishard_reader_buffer_hint_size and repair_multishard_reader_enable_read_ahead. Both are propagated through repair_service::config and pass their values to repair_reader/make_reader at construction time. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-20 19:36:50 +03:00
Pavel Emelyanov	9bc0d27aae	repair: Move critical_disk_utilization_level onto repair_service::config Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-20 19:23:47 +03:00
Pavel Emelyanov	80aa0fcdc2	repair: Move repair_partition_count_estimation_ratio onto repair_service::config Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-20 19:23:47 +03:00
Pavel Emelyanov	585cb0c718	repair: Move repair_hints_batchlog_flush_cache_time_in_ms onto repair_service::config Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-20 19:23:47 +03:00

1 2 3 4 5 ...

1673 Commits