scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-26 03:20:37 +00:00

Author	SHA1	Message	Date
Nikos Dragazis	914d3f845a	schema: Add initializer for compression defaults In PR `5b6570be52` we introduced the config option `sstable_compression_user_table_options` to allow adjusting the default compression settings for user tables. However, the new option was hooked into the CQL layer and applied only to CQL base tables, not to the whole spectrum of user tables: CQL auxiliary tables (materialized views, secondary indexes, CDC log tables), Alternator base tables, Alternator auxiliary tables (GSIs, LSIs, Streams). Fix this by moving the logic into the `schema_builder` via a schema initializer. This ensures that the default compression settings are applied uniformly regardless of how the table is created, while also keeping the logic in a central place. Register the initializer at startup in all executables where schemas are being used (`scylla_main()`, `scylla_sstable_main()`, `cql_test_env`). Finally, remove the ad-hoc logic from `create_table_statement` (redundant as of this patch), remove the xfail markers from the relevant tests and adjust `test_describe_cdc_log_table_create_statement` to expect LZ4WithDicts as the default compressor. Fixes #26914. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com> (cherry picked from commit `1e37781d86`)	2026-01-28 12:42:10 +02:00
Nikos Dragazis	001581c69c	db: config: Add accessor for sstable_compression_user_table_options The `sstable_compression_user_table_options` config option determines the default compression settings for user tables. In patch `2fc812a1b9`, the default value of this option was changed from LZ4 to LZ4WithDicts and a fallback logic was introduced during startup to temporarily revert the option to LZ4 until the dictionary compression feature is enabled. Replace this fallback logic with an accessor that returns the correct settings depending on the feature flag. This is cleaner and more consistent with the way we handle the `sstable_format` option, where the same problem appears (see `get_preferred_sstable_version()`). As a consequence, the configuration option must always be accessed through this accessor. Add a comment to point this out. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com> (cherry picked from commit `76b2d0f961`)	2026-01-28 12:42:07 +02:00
Gleb Natapov	726c1f5734	direct_failure_detector: run direct failure detector in the gossiper scheduling group When direct failure detector was introduces the idea was that it will run on the same connection raft group0 verbs are running, but in `60f1053087` raft verbs were moved to run on the gossiper connection while DIRECT_FD_PING was left where it was. This patch move it to gossiper connection as well and fix the pinger code to run in gossiper scheduling group. (cherry picked from commit `86dde50c0d`)	2025-12-09 17:19:31 +02:00
Michał Jadwiszczak	64e0405ba2	db/view/view_building: send coordinator's term in the RPC To avoid case when an old coordinator (which hasn't been stopped yet) dictates what should be done, add raft term to the `work_on_view_building_tasks` RPC. The worker needs to check if the term matches the current term from raft server, and deny the request when the term is bad. (cherry picked from commit `fb8cbf1615`)	2025-11-26 17:47:16 +01:00
Nikos Dragazis	3b801f3d80	db/config: Change default SSTable compressor to LZ4WithDictsCompressor `sstable_compression_user_table_options` allows configuring a node-global SSTable compression algorithm for user tables via scylla.yaml. The current default is `LZ4Compressor` (inherited from Cassandra). Make `LZ4WithDictsCompressor` the new default. Metrics from real datasets in the field have shown significant improvements in compression ratios. If the dictionary compression feature is not enabled in the cluster (e.g., during an upgrade), fall back to the `LZ4Compressor`. Once the feature is enabled, flip the default back to the dictionary compressor using with a listener callback. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com> (cherry picked from commit `2fc812a1b9`)	2025-11-04 15:41:40 +02:00
Nikos Dragazis	bafe2bbbbc	db/config: Deprecate sstable_compression_dictionaries_allow_in_ddl The option is a knob that allows to reject dictionary-aware compressors in the validation stage of CREATE/ALTER statements, and in the validation of `sstable_compression_user_table_options`. It was introduced in `7d26d3c7cb` to allow the admins of Scylla Cloud to selectively enable it in certain clusters. For more details, check: https://github.com/scylladb/scylla-enterprise/issues/5435 As of this series, we want to start offering dictionary compression as the default option in all clusters, i.e., treat it as a generally available feature. This makes the knob redundant. Additionally, making dictionary compression the default choice in `sstable_compression_user_table_options` creates an awkward dependency with the knob (disabling the knob should cause `sstable_compression_user_table_options` to fall back to a non-dict compressor as default). That may not be very clear to the end user. For these reasons, mark the option as "Deprecated", remove all relevant tests, and adjust the business logic as if dictionary compression is always available. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com> (cherry picked from commit `96e727d7b9`)	2025-11-04 15:40:46 +02:00
Raphael S. Carvalho	d998d9d418	sstables_loader: Synchronize tablet split and load-and-stream Load-and-stream is broken when running concurrently to the finalization step of tablet split. Consider this: 1) split starts 2) split finalization executes barrier and succeed 3) load-and-stream runs now, starts writing sstable (pre-split) 4) split finalization publishes changes to tablet metadata 5) load-and-stream finishes writing sstable 6) sstable cannot be loaded since it spans two tablets two possible fixes (maybe both): 1) load-and-stream awaits for topology to quiesce 2) perform split compaction on sstable that spans both sibling tablets This patch implements #1. By awaiting for topology to quiesce, we guarantee that load-and-stream only starts when there's no chance coordinator is handling some topology operation like split finalization. Fixes #26455. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> (cherry picked from commit `3abc66da5a`)	2025-10-21 12:26:54 +00:00
Piotr Dulikowski	1f73e18eaf	Merge '[Backport 2025.4] db/view: Require rf_rack_valid_keyspaces when creating materialized view' from Scylladb[bot] Materialized views are currently in the experimental phase and using them in tablet-based keyspaces requires starting Scylla with an experimental feature, `views-with-tablets`. Any attempts to create a materialized view or secondary index when it's not enabled will fail with an appropriate error. After considerable effort, we're drawing close to bringing views out of the experimental phase, and the experimental feature will no longer be needed. However, materialized views in tablet-based keyspaces will still be restricted, and creating them will only be possible after enabling the configuration option `rf_rack_valid_keyspaces`. That's what we do in this PR. In this patch, we adjust existing tests in the tree to work with the new restriction. That shouldn't have been necessary because we've already seemingly adjusted all of them to work with the configuration option, but some tests hid well. We fix that mistake now. After that, we introduce the new restriction. What's more, when starting Scylla, we verify that there is no materialized view that would violate the contract. If there are some that do, we list them, notify the user, and refuse to start. High-level implementation strategy: 1. Name the restrictions in form of a function. 2. Adjust existing tests. 3. Restrict materialized views by both the experimental feature and the configuration option. Add validation test. 4. Drop the requirement for the experimental feature. Adjust the added test and add a new one. 5. Update the user documentation. Fixes scylladb/scylladb#23030 Backport: 2025.4, as we are aiming to support materialized views for tablets from that version. - (cherry picked from commit `a1254fb6f3`) - (cherry picked from commit `d6fcd18540`) - (cherry picked from commit `994f09530f`) - (cherry picked from commit `6322b5996d`) - (cherry picked from commit `71606ffdda`) - (cherry picked from commit `00222070cd`) - (cherry picked from commit `288be6c82d`) - (cherry picked from commit `b409e85c20`) Parent PR: #25802 Closes scylladb/scylladb#26416 * github.com:scylladb/scylladb: view: Stop requiring experimental feature db/view: Verify valid configuration for tablet-based views db/view: Require rf_rack_valid_keyspaces when creating view test/cluster/random_failures: Skip creating secondary indexes test/cluster/mv: Mark test_mv_rf_change as skipped test/cluster: Adjust MV tests to RF-rack-validity test/boost/schema_loader_test.cc: Explicitly enable rf_rack_valid_keyspaces db/view: Name requirement for views with tablets	2025-10-12 08:20:20 +02:00
Michał Jadwiszczak	5eeb1e3e76	db/view/view_building_worker: futurize and rename `start_background_fibers()` Next commit will move `discover_existing_staging_sstables()` to the foreground, so to prepare for this we need to futurize `start_background_fibers()` method and change its name to better reflect its purpose. (cherry picked from commit `575dce765e`)	2025-10-09 22:39:32 +00:00
Dawid Mędrek	2e2d1f17bb	db/view: Verify valid configuration for tablet-based views Creating a materialized view or a secondary index in a tablet-based keyspace requires that the user enabled two options: * experimental feature `views-with-tablets`, * configuration option `rf_rack_vaid_keyspaces`. Because the latter has only become a necessity recently (in this series), it's possible that there are already existing materialized views that violate it. We add a new check at start-up that iterates over existing views and makes sure that that is not the case. Otherwise, Scylla notifies the user of the problem. (cherry picked from commit `288be6c82d`)	2025-10-06 13:19:54 +00:00
Avi Kivity	5b6570be52	Merge 'db/config: Add SSTable compression options for user tables' from Nikos Dragazis ScyllaDB offers the `compression` DDL property for configuring compression per user table (compression algorithm and chunk size). If not specified, the default compression algorithm is the LZ4Compressor with a 4KiB chunk size. The same default applies to system tables as well. This series introduces a new configuration option to allow customizing the default for user tables. It also adds some tests for the new functionality. Fixes #25195. Closes scylladb/scylladb#26003 * github.com:scylladb/scylladb: test/cluster: Add tests for invalid SSTable compression options test/boost: Add tests for SSTable compression config options main: Validate SSTable compression options from config db/config: Add SSTable compression options for user tables db/config: Prepare compression_parameters for config system compressor: Validate presence of sstable_compression in parameters compressor: Add missing space in exception message	2025-09-28 20:23:23 +03:00
Nikos Dragazis	8d5bd212ca	main: Validate SSTable compression options from config `compression_parameters` provides two levels of validation: * syntactic checks - implemented in the constructor * semantic checks - implemented by `compression_parameters::validate()` The former are applied implicitly when parsing the options from the command line or from scylla.yaml. The latter are currently not applied, but they should. In lack of a better place, apply them in main, right after joining the cluster, to make sure that the cluster features have been negotiated. The feature needed here is the `SSTABLE_COMPRESSION_DICTS`. Validation will fail if the feature is disabled and a dictionary compression algorithm has been selected. Also, mark `validate()` as const so that it can be called from a config object. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-09-26 12:02:42 +03:00
Botond Dénes	86ed627fc4	compaction: move code to namespace compaction The namespace usage in this directory is very inconsistent, with files and classes scattered in: * global namespace * namespace compaction * namespace sstables With cases, where all three used in the same file. This code used to live in sstables/ and some of it still retains namespace sstables as a heritage of that time. The mismatch between the dir (future module) and the namespace used is confusing, so finish the migration and move all code in compaction/ to namespace compaction too. This patch, although large, is mechanic and only the following kind of changes are made: * replace namespace sstable {} with namespace compaction {} * add namespace compaction {} * drop/add sstables:: * drop/add compaction:: * move around forward-declarations so they are in the correct namespace context This refactoring revealed some awkward leftover coupling between sstables and compaction, in sstables/sstable_set.cc, where the make_sstable_set() methods of compaction strategies are implemented.	2025-09-25 15:03:56 +03:00
Pavel Emelyanov	f6860d1de0	Merge 'mv: run view building worker fibers in streaming group' from Piotr Dulikowski The background fibers of the view building worker are indirectly spawned by the main function, thus the fibers inherit the "main" scheduling group. The main scheduling group is not supposed to be used for regular work, only for initialization and deinitialization, so this is wrong. Wrap the call to `start_backgroud_fibers()` with `with_scheduling_group` and use the streaming scheduling group. The view building worker already handles RPCs in the streaming scheduling group (which do most of the work; background fibers only do some maintenance), so this seems like a good fit. No need to backport, view build coordinator is not a part of any release yet. Closes scylladb/scylladb#26122 * github.com:scylladb/scylladb: mv: fix typo in start_backgroud_fibers mv: run view building worker fibers in streaming group	2025-09-22 15:28:38 +03:00
Karol Nowacki	eae71d3e91	vector_store_client: Move to vector_search module Vector search related implementation moved to a new module vector_search. As the vector search functionality is going to be extended, it is better to keep it in a separate module.	2025-09-22 08:01:47 +02:00
Michał Chojnowski	9e70df83ab	db: get rid of sstables-format-selector Our sstable format selection logic is weird, and hard to follow. If I'm not misunderstanding, the pieces are: 1. There's the `sstable_format` config entry, which currently doesn't do anything, but in the past it used to disable cluster features for versions newer than the specified one. 2. There are deprecated and unused config entries for individual versions (`enable_sstables_mc_format`, `enable_sstables_md_format`, etc). 3. There is a cluster feature for each version: ME_SSTABLE_FORMAT, MD_SSTABLE_FORMAT, etc. (Currently all sstable version features have been grandfathered, and aren't checked by the code anymore). 4. There's an entry in `system.scylla_local` which contains the latest enabled sstable version. (Why? Isn't this directly derived from cluster features anyway)? 5. There's `sstable_manager::_format` which contains the sstable version to be used for new writes. This field is updated by `sstables_format_selector` based on cluster features and the `system.scylla_local` entry. I don't see why those pieces are needed. Version selection has the following constraints: 1. New sstables must be written with a format that supports existing data. For example, range tombstones with an infinite bound are only supported by sstables since version "mc". So if a range tombstone with an infinite bound exists somewhere in the dataset, the format chosen for new sstables has to be at least as new as "mc". 2. A new format might only be used after a corresponding cluster feature is enabled. (Otherwise new sstables might become unreadable if they are sent to another node, or if a node is downgraded). 3. The user should have a way to inhibit format ugprades if he wishes. So far, constraint (1) has been fulfilled by never using formats older than the newest format ever enabled on the node. (With an exception for resharding and reshaping system tables). Constraint (2) has been fulfilled by calling `sstable_manager::set_format` only after the corresponsing cluster feature is enabled. Constraint (3) has been fulfilled by the ability to inhibit cluster features by setting `sstable_format` by some fixed value. The main thing I don't like about this whole setup is that it doesn't let me downgrade the preferred sstable format. After a format is enabled, there is no way to go back to writing the old format again. That is no good -- after I make some performance-sensitive changes in a new format, it might turn out to be a pessimization for the particular workload, and I want to be able to go back. This patch aims to give a way to downgrade formats without violating the constraints. What it does is: 1. The entry in `system.scylla_local` becomes obsolete. After the patch we no longer update or read it. As far as I understand, the purpose of this entry is to prevent unwanted format downgrades (which is something cluster features are designed for) and it's updated if and only if relevant cluster features are updated. So there's no reason to have it, we can just directly use cluster features. 2. `sstable_format_selector` gets deleted. Without the `system.scylla_local` around, it's just a glorified feature listener. 3. The format selection logic is moved into `sstable_manager`. It already sees the `db::config` and the `gms::feature_service`. For the foreseeable future, the knowledge of enabled cluster features and current config should be enough information to pick the right formats. 4. The `sstable_format` entry in `db::config` is no longer intended to inhibit cluster features. Instead, it is intended to select the format for new sstables, and it becomes live-updatable. 5. Instead of writing new sstables with "highest supported" format, (which used to be set by `sstables_format_selector`) we write them with the "preferred" format, which is determined by `sstable_manager` based on the combination of enabled features and the current value of `sstable_format`. Closes scylladb/scylladb#26092 [avi: Pavel found the reason for the scylla_local entry - it predates stable storage for cluster features]	2025-09-19 16:17:56 +03:00
Pavel Emelyanov	a1ea553fe1	code: Replace distributed<> with sharded<> The latter is recommended in seastar, and the former was left as compatibility alias. Latest seastar explicitly marks it as deprecated so once the submodule is updated, compilation logs will explode. Most of the patch is generated with for f in $(git grep -l '\<distributed<[A-Za-z0-9:_]>') ; do sed -e 's/\<distributed<$[A-Za-z0-9:_]$>/sharded<\1>/g' -i $f; done for f in $(git grep -l distributed.hh); do sed -e 's/distributed.hh/sharded.hh/' -i $f ; done and a small manual change in test/perf/perf.hh Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#26136	2025-09-19 12:22:51 +02:00
Piotr Dulikowski	fb0e5784e4	mv: fix typo in start_backgroud_fibers Letter "n" was missing in this name.	2025-09-18 15:50:16 +02:00
Piotr Dulikowski	261f61d303	mv: run view building worker fibers in streaming group The background fibers of the view building worker are indirectly spawned by the main function, thus the fibers inherit the "main" scheduling group. The main scheduling group is not supposed to be used for regular work, only for initialization and deinitialization, so this is wrong. Wrap the call to `start_backgroud_fibers()` with `with_scheduling_group` and use the streaming scheduling group. The view building worker already handles RPCs in the streaming scheduling group (which do most of the work; background fibers only do some maintenance), so this seems like a good fit.	2025-09-18 15:42:36 +02:00
Michał Jadwiszczak	dc1ffd2c10	service/storage_service: drain `view_building_worker` earlier Similarly to view builder, view building worker needs to be drained in `storage_service::do_drain()`. Storage service drain is happening at the same beginning of shutdown procedure. Before this patch, the worker was still building views after the storage service was drained and this caused errors like: `Error applying view update to (named_gate_closed_exception)` and `locator::no_such_tablet_map`. Fixes scylladb/scylladb#25908 Closes scylladb/scylladb#25984	2025-09-15 11:29:19 +03:00
Pavel Emelyanov	34d1648d21	main: Properly handle zero allocation warning threshold The --help text says about --large-memory-allocation-warning-threshold: "Warn about memory allocations above this size; set to zero to disable." That's half-true: setting the value to zero spams logs with warnings of allocation of any size, as seastar treats zero threshold literaly. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#25850	2025-09-08 12:41:19 +02:00
Piotr Dulikowski	78ef334333	Merge 'Move "cache" API endpoints registration closer to column_family ones ' from Pavel Emelyanov These two "blocks" of endpoints have different URL prefixes, but work with the same "service", which is sharded<replica::database>. The latter block had already been fixed to carry the sharded<database>& around (#25467), now it's the "cache" turn. However, since these endpoints also work with the database, there's no need in dedicated top-level set/unset machinery (similarly, gossiper has two API set/unset blocks that come together, see #19425), it's enough to just set/unset them next to each other. Ongoing http_context dependency cleanup, no need to backport Closes scylladb/scylladb#25674 * github.com:scylladb/scylladb: api: Capture and use db in cache_service handlers api: Add sharded<database>& arg to set_cache_service() api: Squash (un)set_cache_service into ..._column_family api: Coroutinize set_server_column_family()	2025-09-02 13:59:02 +02:00
Avi Kivity	bc5773f777	Merge 'Add out of space prevention mechanisms' from Łukasz Paszkowski When a scaling out is delayed or fails, it is crucial to ensure that clusters remain operational and recoverable even under extreme conditions. To achieve this, the following proactive measures are implemented: - reject writes - includes: inserts, updates, deletes, counter updates, hints, read+repair and lwt writes - applicable to: user tables, views, CDC log, audit, cql tracing - stop running compactions/repairs and prevent from starting new ones - reject incoming tablet migrations The aforementioned mechanisms are automatically enabled when node's disk utilization reaches the critical level (default: 98%) and disabled when the utilization drop below the threshold. Apart from that, the series add tests that require mounted volumes to simulate out of space. The paths to the volumes can be provided using the a pytest argument, i.e. `--space-limited-dirs`. When not provided, tests are skipped. Test scenarios: 1. Start a cluster and write data until one of the nodes reaches 90% of the disk utilization 2. Perform an operation that would take the nodes over 100% 3. The nodes should not exceed the critical disk utilization (98% by default) 4. Scale out the cluster by adding one node per rack 5. Retry or wait for the operation from step 2 The operation is: writing data, running compactions, building materialized views, running repair, migrating tablets (caused by RF change, decommission). The test is successful, if no nodes run out of space, the operation from step 2 is aborted/paused/timed out and the operation from step 5 is successful. `perf-simple-query --smp 1 -m 1G` results obtained for fixed 400MHz frequency: Read path (before) ``` instructions_per_op: mean= 39661.51 standard-deviation=34.53 median= 39655.39 median-absolute-deviation=23.33 maximum=39708.71 minimum=39622.61 ``` Read path (after) ``` instructions_per_op: mean= 39691.68 standard-deviation=34.54 median= 39683.14 median-absolute-deviation=11.94 maximum=39749.32 minimum=39656.63 ``` Write path (before): ``` instructions_per_op: mean= 50942.86 standard-deviation=97.69 median= 50974.11 median-absolute-deviation=34.25 maximum=51019.23 minimum=50771.60 ``` Write path (after): ``` instructions_per_op: mean= 51000.15 standard-deviation=115.04 median= 51043.93 median-absolute-deviation=52.19 maximum=51065.81 minimum=50795.00 ``` Fixes: https://github.com/scylladb/scylladb/issues/14067 Refs: https://github.com/scylladb/scylladb/issues/2871 No backport, as it is a new feature. Closes scylladb/scylladb#23917 * github.com:scylladb/scylladb: tests/cluster: Add new storage tests test/scylla_cluster: Override workdir when passed via cmdline streaming: Reject incoming migrations storage_service: extend locator::load_stats to collect per-node critical disk utilization flag repair_service: Add a facility to disable the service compaction_manager: Subscribe to out of space controller compaction_manager: Replace enabled/disabled states with running state database: Add critical_disk_utilization mode database can be moved to disk_space_monitor: add subscription API for threshold-based disk space monitoring docs: Add feature documentation config: Add critical_disk_utilization_level option replica/exceptions: Add a new custom replica exception	2025-08-30 18:47:57 +03:00
Piotr Dulikowski	7ccb50514d	Merge 'Introduce view building coordinator' from Michał Jadwiszczak This patch introduces `view_building_coordinator`, a single entity within whole cluster responsible for building tablet-based views. The view building coordinator takes slightly different approach than the existing node-local view builder. The whole process is split into smaller view building tasks, one per each tablet replica of the base table. The coordinator builds one base table at a time and it can choose another when all views of currently processing base table are built. The tasks are started by setting `STARTED` state and they are executed by node-local view building worker. The tasks are scheduled in a way, that each shard processes only one tablet at a time (multiple tasks can be started for a shard on a node because a table can have multiple views but then all tasks have the same base table and tablet (last_token)). Once the coordinator starts the tasks, it sends `work_on_view_building_tasks` RPC to start the tasks and receive their results. This RPC is resilient to RPC failure or raft leader change, meaning if one RPC call started a batch of tasks but then failed (for instance the raft leader was changed and caller aborted waiting for the response), next RPC call will attach itself to the already started batch. The coordinator plugs into handling tablet operations (migration/resize/RF change) and adjusts its tasks accordingly. At the start of each tablet operation, the coordinator aborts necessary view building tasks to prevent https://github.com/scylladb/scylladb/issues/21564. Then, new adjusted tasks are created at the end of the operation. If the operation fails at any moment, aborted tasks are rollback. The view building coordinator can also handle staging sstables using process_staging view building tasks. We do this because we don't want to start generating view updates from a staging sstable prematurely, before the writes are directed to the new replica (https://github.com/scylladb/scylladb/issues/19149). For detailed description check: `docs/dev/view-building-coordinator.md` Fixes https://github.com/scylladb/scylladb/issues/22288 Fixes https://github.com/scylladb/scylladb/issues/19149 Fixes https://github.com/scylladb/scylladb/issues/21564 Fixes https://github.com/scylladb/scylladb/issues/17603 Fixes https://github.com/scylladb/scylladb/issues/22586 Fixes https://github.com/scylladb/scylladb/issues/18826 Fixes https://github.com/scylladb/scylladb/issues/23930 --- This PR is reimplementation of https://github.com/scylladb/scylladb/pull/21942 Closes scylladb/scylladb#23760 * github.com:scylladb/scylladb: test/cluster: add view build status tests test/cluster: add view building coordinator tests utils/error_injection: allow to abort `injection_handler::wait_for_message()` test: adjust existing tests utils/error_injection: add injection with `sleep_abortable()` db/view/view_builder: ignore `no_such_keyspace` exception docs/dev: add view building coordinator documentation db/view/view_building_worker: work on `process_staging` tasks db/view/view_building_worker: register staging sstable to view building coordinator when needed db/view/view_building_worker: discover staging sstables db/view/view_building_worker: add method to register staging sstable db/view/view_update_generator: add method to process staging sstables instantly db/view/view_update_generator: extract generating updates from staging sstables to a method db/view/view_update_generator: ignore tablet-based sstables db/view/view_building_coordinator: update view build status on node join/left db/view/view_building_coordinator: handle tablet operations db/view: add view building task mutation builder service/topology_coordinator: run view building coordinator db/view: introduce `view_building_coordinator` db/view/view_building_worker: update built views locally db/view: introduce `view_building_worker` db/view: extract common view building functionalities db/view: prepare to create abstract `view_consumer` message/messaging_service: add `work_on_view_building_tasks` RPC service/topology_coordinator: make `term_changed_error` public db/schema_tables: create/cleanup tasks when an index is created/dropped service/migration_manager: cleanup view building state on drop keyspace service/migration_manager: cleanup view building state on drop view service/migration_manager: create view building tasks on create view test/boost: enable proxy remote in some tests service/migration_manager: pass `storage_proxy` to `prepare_keyspace_drop_announcement()` service/migration_manager: coroutinize `prepare_new_view_announcement()` service/storage_proxy: expose references to `system_keyspace` and `view_building_state_machine` service: reload `view_building_state_machine` on group0 apply() service/vb_coordinator: add currently processing base db/system_keyspace: move `get_scylla_local_mutation()` up db/system_keyspace: add `view_building_tasks` table db/view: add view_building_state and views_state db/system_keyspace: add method to get view build status map db/view: extract `system.view_build_status_v2` cql statements to system_keyspace db/system_keyspace: move `internal_system_query_state()` function earlier db/view: ignore tablet-based views in `view_builder` gms/feature_service: add VIEW_BUILDING_COORDINATOR feature	2025-08-29 17:28:44 +02:00
Łukasz Paszkowski	9809800aa8	repair_service: Add a facility to disable the service Repair service currently have two functions: stop() and shutdown() that stop all ongoing repairs and prevent any further repairs from being started. It is possible to stop the repair_service once. Once stopped, it cannot be restarted. We would like, however, to enable / disable the repair service many times. Similarly to compaction_manager, the repair service provides two new functions: - drain() - abort all ongoing local repair task and disable the service, i.e. no new local task will be scheduled and data received from the repair master is rejected. It's, though, still possible to schedule a global repair request - enable() - enable the service By default, the repair service is enabled immediately once started. For tablet-based keyspaces, the new facility prevents tablets from being repaired. Whenever the repair_service is disabled and the request to repair a tablet arrives, an exception is returned. Once the exception is thrown, the tablet is moved into the end_repair state and the operation will be retried later. Hence, disabling the service does not fail the global tablet repair request.	2025-08-29 14:56:13 +02:00
Łukasz Paszkowski	9539e80e54	compaction_manager: Subscribe to out of space controller	2025-08-29 14:56:07 +02:00
Łukasz Paszkowski	3d03b88719	database: Add critical_disk_utilization mode database can be moved to When database operates in the critical disk utilization mode, all mutation writes including inserts, updates, deletes, counter updates, hints, read+repair, lwt writes) to user tables and other associated with them tables like views, CDC log, audit are rejected, with a clear error exception returned. The mode is meant to be used with the disk space monitor in order to prevent any user writes when node's disk utilization is too high.	2025-08-29 13:46:45 +02:00
Łukasz Paszkowski	3e740d25b5	disk_space_monitor: add subscription API for threshold-based disk space monitoring Introduce the `subscribe` method to disk_space_monitor, allowing clients to register callbacks triggered when disk utilization crosses a configurable threshold. The API supports flexible trigger options, including notifications on threshold crossing and direction (above/below). This enables more granular and efficient disk space monitoring for consumers.	2025-08-28 18:06:37 +02:00
Michał Jadwiszczak	233f4dcee3	db/view/view_building_worker: register staging sstable to view building coordinator when needed Change return type of `check_needs_view_update_path()`. Instead of retrning bool which tells whether to use staging directory (and register to `view_update_generator`) or use normal directory. Now the function returns enum with possible values: - `normal_directory` - use normal directory for the sstable - `staging_directly_to_generator` - use staging directory and register to `view_update_generator` - `staging_managed_by_vbc` - use staging directory but don't register it to `view_update_generator` but create view building tasks for later The third option is new, it's used when the table has any view which is in building process currrently. In this case, registering it to `view_update_generator` prematurely may lead to base-view inconsistency (for example when a replica is in a pending state).	2025-08-27 10:23:03 +02:00
Michał Jadwiszczak	651827cdab	db/view/view_building_worker: add method to register staging sstable The method will be used when a new staging sstable needs to go through the view building coordinator (the coordinator will decide when to process this staging sstable). Callers push new staging sstables to a queue and notifiy the async fiber to create `view_building_task`s from the sstables and commit them to group0.	2025-08-27 10:23:03 +02:00
Michał Jadwiszczak	c9e710dca3	db/view: introduce `view_building_worker` The worker is responsible for building tablet-based views by executing tasks scheduled by the view building coordinator. It observes view building state machine and wait on the machine's conditional variable (so the worker is woken up when group0 state is applied). The tasks are executed in batches, all tasks in one batch need to have the same: type, base_id, table_id. One shard can only execute one batch at a time (at least for now, in the future we might want to change that). That worker keeps track of finished and failed tasks in its local state. The state is cleared when `view_building_state::currently_processed_base_table` is changed.	2025-08-27 10:22:59 +02:00
Michał Jadwiszczak	d2e1b6d44a	service/storage_proxy: expose references to `system_keyspace` and `view_building_state_machine` Those references are needed to manage view building tasks while a view is created/dropped.	2025-08-27 08:55:47 +02:00
Michał Jadwiszczak	f2e7051a84	service: reload `view_building_state_machine` on group0 apply() The state may be also reloaded on `topology_change` or `mixed_change` because topology coordinator may change view building tasks during tablet operations.	2025-08-27 08:55:47 +02:00
Dawid Mędrek	dd5a35dc67	service/qos: Add auth::service to auth_integration The new service, `auth_integration`, has taken over the responsibility over managing effective service levels from `service_level_controller`. However, before these changes, it still accessed `auth::service` via the service level controller. Let's change that. Note that we also remove a check that `auth::service` has been initialized. It's not necessary anymore because the lifetime of `auth_integration` is strictly nested within the lifetime of `auth::service`. In actuality, `service_level_controller` should lose its reference to `auth::service` completely. All of the management over effective service levels has already been moved to `auth_integration`. However, the referernce is still needed when dropping a distributed service level because we need to update the corresponding attribute for relevant roles. That should not lead to invalid accesses, though. Dropping a service level should not be possible when `auth::service` is not initialized.	2025-08-26 18:41:43 +02:00
Dawid Mędrek	e929279d74	service/qos: Reload effective SL cache conditionally Since `service_level_controller` outlives `auth_integration`, it may happen that we try to access it when it has already been deinitialized. To prevent that, we only try to reload or clear the effective service level cache when the object is still alive. These changes solve an existing problem with an invalid memory access. For more context, see issue scylladb/scylladb#24792. We provide a reproducer test that consistently fails before these changes but passes after them. Fixes scylladb/scylladb#24792	2025-08-26 18:41:40 +02:00
Dawid Mędrek	7d0086b093	service/qos: Introduce auth_integration We introduce a new type, `auth_integration`, that will be used internally by `service_level_controller`. Its purpose is to take over the responsibility over managing effective service levels. The main problem of the current implementation of service level controller is its dependency on `auth::service` whose lifetime is strictly nested within the lifetime of service level controller. That may and already have led to invalid memory accesses; for an example, see issue scylladb/scylladb#24792. Our strategy is to split service level controller into smaller parts and ensure that we access `auth::service` only when it's valid to do so. This commit is the first step towards that. We don't change anything in the logic yet, just add the new type. Further adjustments will be made in following commits.	2025-08-26 18:41:34 +02:00
Pavel Emelyanov	4e556214ba	api: Squash (un)set_cache_service into ..._column_family The set_server_column_family() registers API handlers that work with replica::database. The set_server_cache() does the very same thing, but registers handlers with some other prefix. Squash the latter into former, later "cache" handlers will also make use of the database reference argument that's already available in ..._column_family() setter. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-08-26 11:46:48 +03:00
Dawid Mędrek	837d267cbf	main: Log RF-rack-invalid keyspaces at startup When the configuration option `rf_rack_valid_keyspaces` is enabled and there is an RF-rack-invalid keyspace, starting a node fails. However, when the configuration option is disabled, but there still is a keyspace that violates the condition, we'd like Scylla to print a warning informing the user about the fact. That's what happens in this commit. We provide a validation test.	2025-08-21 19:35:33 +02:00
Botond Dénes	09dc285b4a	Merge 'Remove redis from scylla source tree' from Ran Regev - remove redis documentation First, remove the redis documentation. - remove ./redis and dependencies Second, remove the redis directory and its dependencies from the project. Fixes: #25144 This is a cleanup, no need to backport. Closes scylladb/scylladb#25148 * github.com:scylladb/scylladb: remove ./redis and dependencies remove redis documentation	2025-08-21 14:26:11 +03:00
Ran Regev	ebf1db5c5e	remove ./redis and dependencies Remove ./redis and all its usages. This is the second commit that removes ./redis from Scylla Signed-off-by: Ran Regev <ran.regev@scylladb.com>	2025-08-20 17:53:23 +03:00
Pavel Emelyanov	818a41ccdb	api: Capture sharded<database> for set_server_column_family() Similarly to other API handlers, instead of using a database from http context, patch the setting methods to capture the database from main code and pass it around to handlers. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-08-18 11:00:54 +03:00
Botond Dénes	614d17347a	tombstone_gc: extract shared state into shared_tombstone_gc_state Instead of storing it partially in tombstone_gc and partially in an external map. Move all external parts into the new shared_tombstone_gc_state. This new class is responsible for keeping and updating the repair history. tombstone_gc_state just keeps const pointers to the shared state as before and is only responsible for querying the tombstone gc before times. This separation makes the code easier to follow and also enables further patching of tombstone_gc_state.	2025-08-11 07:09:14 +03:00
Patryk Jędrzejczak	3299ffba51	Merge 'raft_group0: split shutdown into abort-and-drain and destroy' from Petr Gusev Previously, `raft_group0::abort()` was called in `storage_service::do_drain` (introduced in #24418) to stop the group0 Raft server before destroying local storage. This was necessary because `raft::server` depends on storage (via `raft_sys_table_storage` and `group0_state_machine`). However, this caused issues: services like `sstable_dict_autotrainer` and `auth::service`, which use `group0_client` but are not stopped by `storage_service`, could trigger use-after-free if `raft_group0` was destroyed too early. This can happen both during normal shutdown and when 'nodetool drain' is used. This PR reworks the shutdown logic: * Introduces `abort_and_drain()`, which aborts the server and waits for background tasks to finish, but keeps the server object alive. Clients will see `raft::stopped_error` if they try to access group0 after this method is called. * Final destruction now happens in `abort_and_destroy()`, called later from `main.cc`, ensuring safe cleanup. The `raft_server_for_group::aborted` is changed to a `shared_future`, as it is now awaited in both abort methods. Node startup can fail before reaching `storage_service`, in which case `drain_on_shutdown()` and `abort_and_drain()` are never called. To ensure proper cleanup, `raft_group0` deinitialization logic must be included in both `abort_and_drain()` and `abort_and_destroy()`. Refs #25115 Fixes #24625 Backport: the changes are complicated and not safe to backport, we'll backport a revert of the original patch (#24418) in a separate PR. Closes scylladb/scylladb#25151 * https://github.com/scylladb/scylladb: raft_group0: split shutdown into abort_and_drain and destroy Revert "main.cc: fix group0 shutdown order"	2025-07-29 10:39:00 +02:00
Botond Dénes	f3ed27bd9e	Merge 'Move feature-service config creation code out of feature-service itself' from Pavel Emelyanov Nowadays the way to configure an internal service is 1. service declares its config struct 2. caller (main/test/tool) fills the respective config with values it wants 3. the service is started with the config passed by value The feature service code behaves likewise, but provides a helper method to create its config out of db::config. This PR moves this helper out of gms code, so that it doesn't mess with system-wide db::config and only needs its own small struct feature_config. For the reference: similar changes with other services: #23705 , #20174 , #19166 Closes scylladb/scylladb#25118 * github.com:scylladb/scylladb: gms,init: Move get_disabled_features_from_db_config() from gms code: Update callers generating feature service config gms: Make feature_config a simple struct gms: Split feature_config_from_db_config() into two	2025-07-29 08:17:49 +03:00
Petr Gusev	8b8b7adbe5	raft_group0: split shutdown into abort_and_drain and destroy Previously, raft_group0::abort() was called in storage_service::do_drain (introduced in #24418) to stop the group0 Raft server before destroying local storage. This was necessary because raft::server depends on storage (via raft_sys_table_storage and group0_state_machine). However, this caused issues: services like sstable_dict_autotrainer and auth::service, which use group0_client but are not stopped by storage_service, could trigger use-after-free if raft_group0 was destroyed too early. This can happen both during normal shutdown and when 'nodetool drain' is used. This commit reworks the shutdown logic: * Introduces abort_and_drain(), which aborts the server and waits for background tasks to finish, but keeps the server object alive. Clients will see raft::stopped_error if they try to access group0 after abort_and_drain(). * Final destruction happens in a separate method destroy(), called later from main.cc. The raft_server_for_group::aborted is changed to a shared_future -- abort_server now returns a future so that we can wait for it in abort_and_drain(), it should return the future from the previous abort_server call, which can happen in the on_background_error callback. Node startup can fail before reaching storage_service, in which case ss.drain_on_shutdown() and abort_and_drain() are never called. To ensure proper cleanup, abort_and_drain() is called from main.cc before destroy(). Clients of raft_group_registry are expected to call destroy_server() for the servers they own. Currently, the only such client is raft_group0, which satisfies this requirement. As a result, raft_group_registry::stop_servers() is no longer needed. Instead, raft_group_registry::stop() now verifies that all servers have been properly destroyed. If any remain, it calls on_internal_error(). The call to drain_on_shutdown() in cql_test_env.cc appears redundant. The only source of raft::server instances in raft_group_registry is group0_service, and if group0_service.start() succeeds, both abort_and_drain() and destroy() are guaranteed to be called during shutdown.	2025-07-25 17:16:14 +02:00
Petr Gusev	ac4bc3f816	paxos_state: lazily create paxos state table We call paxos_store::ensure_initialized in the beginning of storage_proxy::cas to create a paxos state table for a user table if it doesn't exist. When the LWT coordinator sends RPCs to replicas, some of them may not yet have the paxos schema. In paxos_store::get_paxos_state_schema we just wait for them to appear, or throw 'no_such_column_family' if the base table was dropped.	2025-07-24 19:48:08 +02:00
Petr Gusev	6e87a6cdb0	paxos_state: extract state access functions into paxos_store Introduce paxos_store abstraction to isolate Paxos state access. Prepares for supporting either system.paxos or a co-located table as the storage backend.	2025-07-24 16:39:50 +02:00
Avi Kivity	e89f6c5586	config, main: make cpu scheduling mandatory CPU scheduling has been with us since `641aaba12c` (2017), and no one ever disables it. Likely nothing really works without it. Make it mandatory and mark the option unused. Closes scylladb/scylladb#24894	2025-07-22 12:39:01 +02:00
Pavel Emelyanov	8220974e76	code: Update callers generating feature service config Instead of requesting it from gms code, create it "by hand" with the help of get_disabled_features_from_db_config() method. This is how other services are configured by main/tools/testing code. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-07-21 19:19:09 +03:00
Avi Kivity	c762425ea7	Merge 'auth: move passwords::check call to alien thread' from Andrzej Jackowski Analysis of customer stalls revealed that the function `detail::hash_with_salt` (invoked by `passwords::check`) often blocks the reactor. Internally, this function uses the external `crypt_r` function to compute password hashes, which is CPU-intensive. This PR addresses the issue in two ways: 1) `sha-512` is now the only password hashing scheme for new passwords (it was already the common-case). 2) `passwords::check` is moved to a dedicated alien thread. Regarding point 1: before this change, the following hashing schemes were supported by `identify_best_supported_scheme()`: bcrypt_y, bcrypt_a, SHA-512, SHA-256, and MD5. The reason for this was that the `crypt_r` function used for password hashing comes from an external library (currently `libxcrypt`), and the supported hashing algorithms vary depending on the library in use. However: - The bcrypt schemes never worked properly because their prefixes lack the required round count (e.g. `$2y$` instead of `$2y$05$`). Moreover, bcrypt is slower than SHA-512, so it not good idea to fix or use it. - SHA-256 and SHA-512 both belong to the SHA-2 family. Libraries that support one almost always support the other, so it’s very unlikely to find SHA-256 without SHA-512. - MD5 is no longer considered secure for password hashing. Regarding point 2: the `passwords::check` call now runs on a shared alien thread created at database startup. An `std::mutex` synchronizes that thread with the shards. In theory this could introduce a frequent lock contention, but in practice each shard handles only a few hundred new connections per second—even during storms. There is already `_conns_cpu_concurrency_semaphore` in `generic_server` limits the number of concurrent connection handlers. Fixes https://github.com/scylladb/scylladb/issues/24524 Backport not needed, as it is a new feature. Closes scylladb/scylladb#24924 * github.com:scylladb/scylladb: main: utils: add thread names to alien workers auth: move passwords::check call to alien thread test: wait for 3 clients with given username in test_service_level_api auth: refactor password checking in password_authenticator auth: make SHA-512 the only password hashing scheme for new passwords auth: whitespace change in identify_best_supported_scheme() auth: require scheme as parameter for `generate_salt` auth: check password hashing scheme support on authenticator start	2025-07-16 13:15:54 +03:00

1 2 3 4 5 ...

1524 Commits