scylladb

Author	SHA1	Message	Date
Pavel Emelyanov	818a41ccdb	api: Capture sharded<database> for set_server_column_family() Similarly to other API handlers, instead of using a database from http context, patch the setting methods to capture the database from main code and pass it around to handlers. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-08-18 11:00:54 +03:00
Botond Dénes	614d17347a	tombstone_gc: extract shared state into shared_tombstone_gc_state Instead of storing it partially in tombstone_gc and partially in an external map. Move all external parts into the new shared_tombstone_gc_state. This new class is responsible for keeping and updating the repair history. tombstone_gc_state just keeps const pointers to the shared state as before and is only responsible for querying the tombstone gc before times. This separation makes the code easier to follow and also enables further patching of tombstone_gc_state.	2025-08-11 07:09:14 +03:00
Patryk Jędrzejczak	3299ffba51	Merge 'raft_group0: split shutdown into abort-and-drain and destroy' from Petr Gusev Previously, `raft_group0::abort()` was called in `storage_service::do_drain` (introduced in #24418) to stop the group0 Raft server before destroying local storage. This was necessary because `raft::server` depends on storage (via `raft_sys_table_storage` and `group0_state_machine`). However, this caused issues: services like `sstable_dict_autotrainer` and `auth::service`, which use `group0_client` but are not stopped by `storage_service`, could trigger use-after-free if `raft_group0` was destroyed too early. This can happen both during normal shutdown and when 'nodetool drain' is used. This PR reworks the shutdown logic: * Introduces `abort_and_drain()`, which aborts the server and waits for background tasks to finish, but keeps the server object alive. Clients will see `raft::stopped_error` if they try to access group0 after this method is called. * Final destruction now happens in `abort_and_destroy()`, called later from `main.cc`, ensuring safe cleanup. The `raft_server_for_group::aborted` is changed to a `shared_future`, as it is now awaited in both abort methods. Node startup can fail before reaching `storage_service`, in which case `drain_on_shutdown()` and `abort_and_drain()` are never called. To ensure proper cleanup, `raft_group0` deinitialization logic must be included in both `abort_and_drain()` and `abort_and_destroy()`. Refs #25115 Fixes #24625 Backport: the changes are complicated and not safe to backport, we'll backport a revert of the original patch (#24418) in a separate PR. Closes scylladb/scylladb#25151 * https://github.com/scylladb/scylladb: raft_group0: split shutdown into abort_and_drain and destroy Revert "main.cc: fix group0 shutdown order"	2025-07-29 10:39:00 +02:00
Botond Dénes	f3ed27bd9e	Merge 'Move feature-service config creation code out of feature-service itself' from Pavel Emelyanov Nowadays the way to configure an internal service is 1. service declares its config struct 2. caller (main/test/tool) fills the respective config with values it wants 3. the service is started with the config passed by value The feature service code behaves likewise, but provides a helper method to create its config out of db::config. This PR moves this helper out of gms code, so that it doesn't mess with system-wide db::config and only needs its own small struct feature_config. For the reference: similar changes with other services: #23705 , #20174 , #19166 Closes scylladb/scylladb#25118 * github.com:scylladb/scylladb: gms,init: Move get_disabled_features_from_db_config() from gms code: Update callers generating feature service config gms: Make feature_config a simple struct gms: Split feature_config_from_db_config() into two	2025-07-29 08:17:49 +03:00
Petr Gusev	8b8b7adbe5	raft_group0: split shutdown into abort_and_drain and destroy Previously, raft_group0::abort() was called in storage_service::do_drain (introduced in #24418) to stop the group0 Raft server before destroying local storage. This was necessary because raft::server depends on storage (via raft_sys_table_storage and group0_state_machine). However, this caused issues: services like sstable_dict_autotrainer and auth::service, which use group0_client but are not stopped by storage_service, could trigger use-after-free if raft_group0 was destroyed too early. This can happen both during normal shutdown and when 'nodetool drain' is used. This commit reworks the shutdown logic: * Introduces abort_and_drain(), which aborts the server and waits for background tasks to finish, but keeps the server object alive. Clients will see raft::stopped_error if they try to access group0 after abort_and_drain(). * Final destruction happens in a separate method destroy(), called later from main.cc. The raft_server_for_group::aborted is changed to a shared_future -- abort_server now returns a future so that we can wait for it in abort_and_drain(), it should return the future from the previous abort_server call, which can happen in the on_background_error callback. Node startup can fail before reaching storage_service, in which case ss.drain_on_shutdown() and abort_and_drain() are never called. To ensure proper cleanup, abort_and_drain() is called from main.cc before destroy(). Clients of raft_group_registry are expected to call destroy_server() for the servers they own. Currently, the only such client is raft_group0, which satisfies this requirement. As a result, raft_group_registry::stop_servers() is no longer needed. Instead, raft_group_registry::stop() now verifies that all servers have been properly destroyed. If any remain, it calls on_internal_error(). The call to drain_on_shutdown() in cql_test_env.cc appears redundant. The only source of raft::server instances in raft_group_registry is group0_service, and if group0_service.start() succeeds, both abort_and_drain() and destroy() are guaranteed to be called during shutdown.	2025-07-25 17:16:14 +02:00
Petr Gusev	ac4bc3f816	paxos_state: lazily create paxos state table We call paxos_store::ensure_initialized in the beginning of storage_proxy::cas to create a paxos state table for a user table if it doesn't exist. When the LWT coordinator sends RPCs to replicas, some of them may not yet have the paxos schema. In paxos_store::get_paxos_state_schema we just wait for them to appear, or throw 'no_such_column_family' if the base table was dropped.	2025-07-24 19:48:08 +02:00
Petr Gusev	6e87a6cdb0	paxos_state: extract state access functions into paxos_store Introduce paxos_store abstraction to isolate Paxos state access. Prepares for supporting either system.paxos or a co-located table as the storage backend.	2025-07-24 16:39:50 +02:00
Avi Kivity	e89f6c5586	config, main: make cpu scheduling mandatory CPU scheduling has been with us since `641aaba12c` (2017), and no one ever disables it. Likely nothing really works without it. Make it mandatory and mark the option unused. Closes scylladb/scylladb#24894	2025-07-22 12:39:01 +02:00
Pavel Emelyanov	8220974e76	code: Update callers generating feature service config Instead of requesting it from gms code, create it "by hand" with the help of get_disabled_features_from_db_config() method. This is how other services are configured by main/tools/testing code. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-07-21 19:19:09 +03:00
Avi Kivity	c762425ea7	Merge 'auth: move passwords::check call to alien thread' from Andrzej Jackowski Analysis of customer stalls revealed that the function `detail::hash_with_salt` (invoked by `passwords::check`) often blocks the reactor. Internally, this function uses the external `crypt_r` function to compute password hashes, which is CPU-intensive. This PR addresses the issue in two ways: 1) `sha-512` is now the only password hashing scheme for new passwords (it was already the common-case). 2) `passwords::check` is moved to a dedicated alien thread. Regarding point 1: before this change, the following hashing schemes were supported by `identify_best_supported_scheme()`: bcrypt_y, bcrypt_a, SHA-512, SHA-256, and MD5. The reason for this was that the `crypt_r` function used for password hashing comes from an external library (currently `libxcrypt`), and the supported hashing algorithms vary depending on the library in use. However: - The bcrypt schemes never worked properly because their prefixes lack the required round count (e.g. `$2y$` instead of `$2y$05$`). Moreover, bcrypt is slower than SHA-512, so it not good idea to fix or use it. - SHA-256 and SHA-512 both belong to the SHA-2 family. Libraries that support one almost always support the other, so it’s very unlikely to find SHA-256 without SHA-512. - MD5 is no longer considered secure for password hashing. Regarding point 2: the `passwords::check` call now runs on a shared alien thread created at database startup. An `std::mutex` synchronizes that thread with the shards. In theory this could introduce a frequent lock contention, but in practice each shard handles only a few hundred new connections per second—even during storms. There is already `_conns_cpu_concurrency_semaphore` in `generic_server` limits the number of concurrent connection handlers. Fixes https://github.com/scylladb/scylladb/issues/24524 Backport not needed, as it is a new feature. Closes scylladb/scylladb#24924 * github.com:scylladb/scylladb: main: utils: add thread names to alien workers auth: move passwords::check call to alien thread test: wait for 3 clients with given username in test_service_level_api auth: refactor password checking in password_authenticator auth: make SHA-512 the only password hashing scheme for new passwords auth: whitespace change in identify_best_supported_scheme() auth: require scheme as parameter for `generate_salt` auth: check password hashing scheme support on authenticator start	2025-07-16 13:15:54 +03:00
Andrzej Jackowski	77a9b5919b	main: utils: add thread names to alien workers This commit adds a call to `pthread_setname_np` in `alien_worker::spawn`, so each alien worker thread receives a descriptive name. This makes debugging, monitoring, and performance analysis easier by allowing alien workers to be clearly identified in tools such as `perf`.	2025-07-15 23:29:21 +02:00
Andrzej Jackowski	9574513ec1	auth: move passwords::check call to alien thread Analysis of customer stalls showed that the `detail::hash_with_salt` function, called from `passwords::check`, often blocks the reactor. This function internally uses the `crypt_r` function from an external library to compute password hashes, which is a CPU-intensive operation. To prevent such reactor stalls, this commit moves the `passwords::check` call to a dedicated alien thread. This thread is created at system startup and is shared by all shards. Within the alien thread, an `std::mutex` synchronizes access between the thread and the shards. While this could theoretically cause frequent lock contentions, in practice, even during connection storms, the number of new connections per second per shard is limited (typically hundreds per second). Additionally, the `_conns_cpu_concurrency_semaphore` in `generic_server` ensures that not too many connections are processed at once. Fixes scylladb/scylladb#24524	2025-07-15 23:29:13 +02:00
Avi Kivity	6fce817aa8	Merge 'Atomic in-memory schema changes application' from Marcin Maliszkiewicz This change is preparing ground for state update unification for raft bound subsystems. It introduces schema_applier which in the future will become generic interface for applying mutations in raft. Pulling database::apply() out of schema merging code will allow to batch changes to subsystems. Future generic code will first call prepare() on all implementations, then single database::apply() and then update() on all implementations, then on each shard it will call commit() for all implementations, without preemption so that the change is observed as atomic across all subsystems, and then post_commit(). Backport: no, it's a new feature Fixes: https://github.com/scylladb/scylladb/issues/19649 Fixes https://github.com/scylladb/scylladb/issues/24531 Closes scylladb/scylladb#24886 [avi: adjust for std::vector<mutations> -> utils::chunked_vector<mutations>] * github.com:scylladb/scylladb: test: add type creation to test_snapshot storage_service: always wake up load balancer on update tablet metadata db: schema_applier: call destroy also when exception occurs db: replica: simplify seeding ERM during shema change db: remove cleanup from add_column_family db: abort on exception during schema commit phase db: make user defined types changes atomic replica: db: make keyspace schema changes atomic db: atomically apply changes to tables and views replica: make truncate_table_on_all_shards get whole schema from table_shards service: split update_tablet_metadata into two phases service: pull out update_tablet_metadata from migration_listener db: service: add store_service dependency to schema_applier service: simplify load_tablet_metadata and update_tablet_metadata db: don't perform move on tablet_hint reference replica: split add_column_family_and_make_directory into steps replica: db: split drop_table into steps db: don't move map references in merge_tables_and_views() db: introduce commit_on_shard function db: access types during schema merge via special storage replica: make non-preemptive keyspace create/update/delete functions public replica: split update keyspace into two phases replica: split creating keyspace into two functions db: rename create_keyspace_from_schema_partition db: decouple functions and aggregates schema change notification from merging code db: store functions and aggregates change batch in schema_applier db: decouple tables and views schema change notifications from merging code db: store tables and views schema diff in schema_applier db: decouple user type schema change notifications from types merging code service: unify keyspace notification functions arguments db: replica: decouple keyspace schema change notifications to a separate function db: add class encapsulating schema merging	2025-07-13 20:47:55 +03:00
Benny Halevy	3feb759943	everywhere: use utils::chunked_vector for list of mutations Currently, we use std::vector<*mutation> to keep a list of mutations for processing. This can lead to large allocation, e.g. when the vector size is a function of the number of tables. Use a chunked vector instead to prevent oversized allocations. `perf-simple-query --smp 1` results obtained for fixed 400MHz frequency and PGO disabled: Before (read path): ``` enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=read, query_single_key=no, counters=no} Disabling auto compaction Creating 10000 partitions... 89055.97 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39417 insns/op, 18003 cycles/op, 0 errors) 103372.72 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39380 insns/op, 17300 cycles/op, 0 errors) 98942.27 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39413 insns/op, 17336 cycles/op, 0 errors) 103752.93 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39407 insns/op, 17252 cycles/op, 0 errors) 102516.77 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39403 insns/op, 17288 cycles/op, 0 errors) throughput: mean= 99528.13 standard-deviation=6155.71 median= 102516.77 median-absolute-deviation=3844.59 maximum=103752.93 minimum=89055.97 instructions_per_op: mean= 39403.99 standard-deviation=14.25 median= 39406.75 median-absolute-deviation=9.30 maximum=39416.63 minimum=39380.39 cpu_cycles_per_op: mean= 17435.81 standard-deviation=318.24 median= 17300.40 median-absolute-deviation=147.59 maximum=18002.53 minimum=17251.75 ``` After (read path) ``` enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=read, query_single_key=no, counters=no} Disabling auto compaction Creating 10000 partitions... 59755.04 tps ( 66.2 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39466 insns/op, 22834 cycles/op, 0 errors) 71854.16 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39417 insns/op, 17883 cycles/op, 0 errors) 82149.45 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39411 insns/op, 17409 cycles/op, 0 errors) 49640.04 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.3 tasks/op, 39474 insns/op, 19975 cycles/op, 0 errors) 54963.22 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.3 tasks/op, 39474 insns/op, 18235 cycles/op, 0 errors) throughput: mean= 63672.38 standard-deviation=13195.12 median= 59755.04 median-absolute-deviation=8709.16 maximum=82149.45 minimum=49640.04 instructions_per_op: mean= 39448.38 standard-deviation=31.60 median= 39466.17 median-absolute-deviation=25.75 maximum=39474.12 minimum=39411.42 cpu_cycles_per_op: mean= 19267.01 standard-deviation=2217.03 median= 18234.80 median-absolute-deviation=1384.25 maximum=22834.26 minimum=17408.67 ``` `perf-simple-query --smp 1 --write` results obtained for fixed 400MHz frequency and PGO disabled: Before (write path): ``` enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=write, query_single_key=no, counters=no} Disabling auto compaction 63736.96 tps ( 59.4 allocs/op, 16.4 logallocs/op, 14.3 tasks/op, 49667 insns/op, 19924 cycles/op, 0 errors) 64109.41 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 49992 insns/op, 20084 cycles/op, 0 errors) 56950.47 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50005 insns/op, 20501 cycles/op, 0 errors) 44858.42 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50014 insns/op, 21947 cycles/op, 0 errors) 28592.87 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50027 insns/op, 27659 cycles/op, 0 errors) throughput: mean= 51649.63 standard-deviation=15059.74 median= 56950.47 median-absolute-deviation=12087.33 maximum=64109.41 minimum=28592.87 instructions_per_op: mean= 49941.18 standard-deviation=153.76 median= 50005.24 median-absolute-deviation=73.01 maximum=50027.07 minimum=49667.05 cpu_cycles_per_op: mean= 22023.01 standard-deviation=3249.92 median= 20500.74 median-absolute-deviation=1938.76 maximum=27658.75 minimum=19924.32 ``` After (write path) ``` enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=write, query_single_key=no, counters=no} Disabling auto compaction 53395.93 tps ( 59.4 allocs/op, 16.5 logallocs/op, 14.3 tasks/op, 50326 insns/op, 21252 cycles/op, 0 errors) 46527.83 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50704 insns/op, 21555 cycles/op, 0 errors) 55846.30 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50731 insns/op, 21060 cycles/op, 0 errors) 55669.30 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50735 insns/op, 21521 cycles/op, 0 errors) 52130.17 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50757 insns/op, 21334 cycles/op, 0 errors) throughput: mean= 52713.91 standard-deviation=3795.38 median= 53395.93 median-absolute-deviation=2955.40 maximum=55846.30 minimum=46527.83 instructions_per_op: mean= 50650.57 standard-deviation=182.46 median= 50731.38 median-absolute-deviation=84.09 maximum=50756.62 minimum=50325.87 cpu_cycles_per_op: mean= 21344.42 standard-deviation=202.86 median= 21334.00 median-absolute-deviation=176.37 maximum=21554.61 minimum=21060.24 ``` Fixes #24815 Improvement for rare corner cases. No backport required Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#24919	2025-07-13 19:13:11 +03:00
Marcin Maliszkiewicz	15b4db47c7	storage_service: always wake up load balancer on update tablet metadata Lack of wakeup is error-prone, as it relies on a wakeup occurring elsewhere.	2025-07-10 10:46:55 +02:00
Marcin Maliszkiewicz	fa157e7e46	db: service: add store_service dependency to schema_applier There is already implicit logical dependency via migration_notifier but in the next commits we'll be moving store_service out from it as we need better control (i.e. return a value from the call).	2025-07-10 10:40:43 +02:00
Marcin Maliszkiewicz	847d7f4a3a	service: simplify load_tablet_metadata and update_tablet_metadata - remove load_tablet_metadata(), instead we add wake_up_load_balancer flag to update_tablet_metadata(), it reduces number of public functions and also serves as a comment (removed comment with very similar meaning) - reimplement the code to not use mutate_token_metadata(), this way it's more readable and it's also needed as we'll split update_tablet_metadata() in following commits so that we can have subroutine which doesn't yield (for ensuring atomicity)	2025-07-10 10:40:43 +02:00
Pawel Pery	7bf53fc908	vector_store_client: implement initial vector_store_client service This patch is a part of vector_store_client sharded service implementation for a communication with vector-store service. It adds a `services/vector_store_client.{cc\|hh}` sharded service and a configuration parameter `vector_store_uri` with a `http://vector-store.dns.name:port` format. If there will be an error during parsing that parameter there will be an exception during construction. For the future unit testing purposes the patch adds `vector_store_client_tester` as a way to inject mockup functionality. This service will be used by the select statements for the Vector search indexes (see VS-46). For this reason I've added vector_store_client service in the query processor. Reference: VS-47 VS-45	2025-07-08 16:29:55 +02:00
Piotr Dulikowski	62efe6616a	Merge 'mapreduce: add tablet-aware dispatching algorithm' from Andrzej Jackowski The primary motivation for this change is to reduce the time during which the Effective Replication Map (ERM) is retained by the mapreduce service. This ensures that long aggregate queries do not block topology operations. As ScyllaDB is generally transitioning towards tablets, and using tablets simplifies work dispatching, the decision was made to design the new algorithm specifically for tablets. The goal of the algorithm is to divide the work in such a way that each `tablet_replica` (that is <host, shard> pair) processes two tablets at a time. The new algorithm can be summarized as follows: 1. Prepare a tablet_replica -> partition_range mapping where the values cover the entire space. 2. For each tablet_replica, in parallel, take two partition ranges and dispatch them to the node hosting the replica. The ERM is released and re-acquired in each iteration, allowing the destination (i.e., tablet_replica) to change for each artition range (in such cases, the partition range is assigned to the appropriate tablet_replica). In step 1, the main difference compared to the old algorithm (dispatch_to_vnodes) is that partition ranges are assigned to a tablet_replica rather than just the host. In step 2, the main difference is that the work is divided into smaller batches, and the ERM is released and re-acquired for each batch. In the current implementation, each node can correctly handle every partition range, even if the mapreduce supercoordinator does not retain the ERM and the range is absent locally. This is because mapreduce_service::execute_on_this_shard creates a new pager that coordinates the partition range read, including obtaining its own ERM. However, every partition range that is absent locally is handled by shard 0. Therefore, proper routing of partition ranges is necessary to avoid shard 0 overload. This is why, in step 2, the ERM is retained during each batch processing, and the tablet_replica is refreshed for each processed range. Additionally, shard_id is added to mapreduce request. When shard_id is set, the entire partition range is handled by the specified shard. As the new tablet-aware mapreduce algorithm balances the workload across shards, shard_id ensure that the balance is preserved, even during events such as tablet splits. This patch series: - Refactors a bit mapreduce service, to facilitate having two algorithm versions (one for vnodes and one for tablets). - Implements tablet-aware dispatching algorithm. - Adds shard_id to mapreduce request and uses the information to handle requests entirely by selected shard. - Adds test_long_query_timeout_erm to verify the new functionality. Fixes: scylladb#21831 No backport, as it is rather new feature than a bugfix. Closes scylladb/scylladb#24383 * github.com:scylladb/scylladb: mapreduce: add missing comma and space in mapreduce_request operator<< mapreduce: add shard_id_hint to mapreduce request test: add test_long_query_timeout_erm mapreduce: add tablet-aware dispatching algorithm storage_proxy: make storage_proxy::is_alive public mapreduce: remove _shared_token_metadata from mapreduce_service mapreduce: move dispatching logic to dispatch_to_vnodes mapreduce: remove underscores from variable names mapreduce: move req_with_modified_pr handling to a new function mapreduce: change next_vnode lambda to get_next_partition_range function	2025-06-26 12:25:39 +02:00
Andrzej Jackowski	9dbb1468b4	mapreduce: remove _shared_token_metadata from mapreduce_service Before this change, `mapreduce_service` used `_shared_token_metadata` to get the topology. However, the token was used in a part of the code that already had its own ERM with its own metadata token. Moreover, as mapreduce_service's token and ERM's token are not guaranteed to be the same, inconsistencies could occur. Therefore, this commit removes `_shared_token_metadata` and its usage.	2025-06-25 08:42:16 +02:00
Marcin Maliszkiewicz	97c60b8153	main: don't start maintenance auth service if not enabled In `f96d30c2b5` we introduced the maintenance service, which is an additional instance of auth::service. But this service has a somewhat confusing 2-level startup mechanism: it's initialized with sharded<Service>::start and then auth::service::start (different method with the same name to confuse even more). When maintenance_socket was disabled (default setting), the code did only the first part of the startup. This registered a config observer but didn't create a permission_cache instance. As a result, a crash on SIGHUP when config is reloaded can occur.	2025-06-18 11:27:08 +02:00
Avi Kivity	cd79a8fc25	Revert "Merge 'Atomic in-memory schema changes application' from Marcin Maliszkiewicz" This reverts commit `0b516da95b`, reversing changes made to `30199552ac`. It breaks cluster.random_failures.test_random_failures.test_random_failures in debug mode (at least). Fixes #24513	2025-06-16 22:38:12 +03:00
Marcin Maliszkiewicz	2090e44283	storage_service: always wake up load balancer on update tablet metadata Lack of wakeup is error-prone, as it relies on a wakeup occurring elsewhere.	2025-06-06 08:50:34 +02:00
Marcin Maliszkiewicz	92e3d69f79	db: service: add store_service dependency to schema_applier There is already implicit logical dependency via migration_notifier but in the next commits we'll be moving store_service out from it as we need better control (i.e. return a value from the call).	2025-06-06 08:50:33 +02:00
Marcin Maliszkiewicz	1c8fd3a65d	service: simplify load_tablet_metadata and update_tablet_metadata - remove load_tablet_metadata(), instead we add wake_up_load_balancer flag to update_tablet_metadata(), it reduces number of public functions and also serves as a comment (removed comment with very similar meaning) - reimplement the code to not use mutate_token_metadata(), this way it's more readable and it's also needed as we'll split update_tablet_metadata() in following commits so that we can have subroutine which doesn't yield (for ensuring atomicity)	2025-06-06 08:50:33 +02:00
Aleksandra Martyniuk	9c03255fd2	cql_test_env: main: move stream_manager initialization Currently, stream_manager is initialized after storage_service and so it is stopped before the storage_service is. In its stop method storage_service accesses stream_manager which is uninitialized at a time. Move stream_manager initialization over the storage_service initialization. Fixes: #23207. Closes scylladb/scylladb#24008	2025-05-15 17:17:35 +03:00
Botond Dénes	4a802baccb	Merge 'compress: make sstable compression dictionaries NUMA-aware ' from Michał Chojnowski compress: distribute compression dictionaries over shards We don't want each shard to have its own copy of each dictionary. It would unnecessary pressure on cache and memory. Instead, we want to share dictionaries between shards. Before this commit, all dictionaries live on shard 0. All other shards borrow foreign shared pointers from shard 0. There's a problem with this setup: dictionary blobs receive many random accesses. If shard 0 is on a remote NUMA node, this could pose a performance problem. Therefore, for each dictionary, we would like to have one copy per NUMA node, not one copy per the entire machine. And each shard should use the copy belonging to its own NUMA node. This is the main goal of this patch. There is another issue with putting all dicts on shard 0: it eats an assymetric amount of memory from shard 0. This commit spreads the ownership of dicts over all shards within the NUMA group, to make the situation more symmetric. (Dict owner is decided based on the hash of dict contents). It should be noted that the last part isn't necessarily a good thing, though. While it makes the situation more symmetric within each node, it makes it less symmetric across the cluster, if different node sizes are present. If dicts occupy 1% of memory on each shard of a 100-shard node, then the same dicts would occupy 100% of memory on a 1-shard node. So for the sake of cluster-wide symmetry, we might later want to consider e.g. making the memory limit for dictionaries inversely proportional to the number of shards. New functionality, added to a feature which isn't in any stable branch yet. No backporting. Closes scylladb/scylladb#23590 * github.com:scylladb/scylladb: test: add test/boost/sstable_compressor_factory_test compress: add some test-only APIs compress: rename sstable_compressor_factory_impl to dictionary_holder compress: fix indentation compress: remove sstable_compressor_factory_impl::_owner_shard compress: distribute compression dictionaries over shards test: switch uses of make_sstable_compressor_factory() to a seastar::thread-dependent version test: remove sstables::test_env::do_with()	2025-05-08 09:52:46 +03:00
Petr Gusev	e6c3f954f6	main: check if current process group controls stdin tty test.py doesn't override stdin when starting Scylla, so when tests are run from a terminal, isatty() returns true and parsed command line output is not printed, which is inconvenient. In this commit we add a check if the current process group controls the stdin terminal. This serves two purposes: * improves the "interactive mode" check from #scylladb/scylladb#18309, as only the controlling process group can interact with the terminal. * solves the test.py problem above, because test.py runs scylla in a new session/process group (it calls setsid after fork), and is now correctly not considered interactive. Closes scylladb/scylladb#24047	2025-05-08 06:52:48 +03:00
Michał Chojnowski	1bcf77951c	compress: distribute compression dictionaries over shards We don't want each shard to have its own copy of each dictionary. It would unnecessary pressure on cache and memory. Instead, we want to share dictionaries between shards. Before this commit, all dictionaries live on shard 0. All other shards borrow foreign shared pointers from shard 0. There's a problem with this setup: dictionary blobs receive many random accesses. If shard 0 is on a remote NUMA node, this could pose a performance problem. Therefore, for each dictionary, we would like to have one copy per NUMA node, not one copy per the entire machine. And each shard should use the copy belonging to its own NUMA node. This is the main goal of this patch. There is another issue with putting all dicts on shard 0: it eats an assymetric amount of memory from shard 0. This commit spreads the ownership of dicts over all shards within the NUMA group, to make the situation more symmetric. (Dict owner is decided based on the hash of dict contents). It should be noted that the last part isn't necessarily a good thing, though. While it makes the situation more symmetric within each node, it makes it less symmetric across the cluster, if different node sizes are present. If dicts occupy 1% of memory on each shard of a 100-shard node, then the same dicts would occupy 100% of memory on a 1-shard node. So for the sake of cluster-wide symmetry, we might later want to consider e.g. making the memory limit for dictionaries inversely proportional to the number of shards.	2025-05-07 14:43:18 +02:00
Pavel Emelyanov	eb5b52f598	Merge 'main: make DC and rack immutable after bootstrap' from Piotr Dulikowski Changing DC or rack on a node which was already bootstrapped is, in case of vnodes, very unsafe (almost guaranteed to cause data loss or unavailability), and is outright not supported if the cluster has a tablet-backed keyspaces. Moreover, the possibility of doing that makes it impossible to uphold some of the invariants promised by the RF-rack-valid flag, which is eventually going to become unconditionally enabled. Get rid of the above problems by removing the possibility of changing the DC / rack of a node. A node will now fail to start if its snitch reports a different DC or rack than the one that was reported during the first boot. Fixes: scylladb/scylladb#23278 Fixes: scylladb/scylladb#22869 Marking for backport to 2025.1, as this is a necessary part of the RF-rack-valid saga Closes scylladb/scylladb#23800 * github.com:scylladb/scylladb: doc: changing topology when changing snitches is no longer supported test: cluster: introduce test_no_dc_rack_change storage_service: don't update DC/rack in update_topology_with_local_metadata main: make dc and rack immutable after bootstrap test: cluster: remove test_snitch_change	2025-04-21 15:52:55 +03:00
Piotr Dulikowski	ce2fab7cce	main: make dc and rack immutable after bootstrap Changing DC or rack on a node which was already bootstrapped is, in case of vnodes, very unsafe (almost guaranteed to cause data loss or unavailability), and is outright not supported if the cluster has a tablet-backed keyspaces. Moreover, the possibility of doing that makes it impossible to uphold some of the invariants promised by the RF-rack-valid flag, which is eventually going to become unconditionally enabled. Get rid of the above problems by removing the possibility of changing the DC / rack of a node. A node will now fail to start if its snitch reports a different DC or rack than the one that was reported during the first boot. Fixes: scylladb/scylladb#23278	2025-04-17 16:22:26 +02:00
Tomasz Grabiec	0b9a75d7b6	virtual-tables: Introduce system.load_per_node Can be used to query per-node stats about load as seen by the load balancer. In particular, node's capacity will be used by tablet-mon.py to scale tablet columns so that equal height is equal node utilization.	2025-04-09 20:21:51 +02:00
Benny Halevy	f702adf6a5	main: fix typo in tablet allocator checkpoint message Inroduced in `b6705ad48b` Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#23211	2025-04-08 17:19:41 +03:00
Kefu Chai	0cd6cf1dc5	main: Remove unused member variable `_sys_ks` Fixes a Clang error by removing the unused private field `sstable_dict_deleter::_sys_ks` that was flagged with: [-Werror,-Wunused-private-field] ``` /home/kefu/.local/bin/clang++ -DBOOST_PROGRAM_OPTIONS_DYN_LINK -DBOOST_PROGRAM_OPTIONS_NO_LIB -DSCYLLA_BUILD_MODE=release -DXXH_PRIVATE_API -DCMAKE_INTDIR=\"RelWithDebInfo\" -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/build/gen -I/home/kefu/dev/scylladb/build -isystem /home/kefu/dev/scylladb/seastar/include -isystem /home/kefu/dev/scylladb/build/RelWithDebInfo/seastar/gen/include -isystem /home/kefu/dev/scylladb/abseil -isystem /home/kefu/dev/scylladb/build/rust -I/usr/include/p11-kit-1 -ffunction-sections -fdata-sections -O3 -g -gz -std=gnu++23 -flto=thin -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-unused-parameter -ffile-prefix-map=/home/kefu/dev/scylladb/= -ffile-prefix-map=/home/kefu/dev/scylladb/build=. -ffile-prefix-map=/home/kefu/dev/scylladb/build/=build -march=westmere -Xclang -fexperimental-assignment-tracking=disabled -mllvm -inline-threshold=2500 -fno-slp-vectorize -ffat-lto-objects -std=gnu++23 -Werror=unused-result -DSEASTAR_API_LEVEL=7 -DSEASTAR_SSTRING -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_SCHEDULING_GROUPS_COUNT=19 -DSEASTAR_LOGGER_TYPE_STDOUT -DBOOST_PROGRAM_OPTIONS_NO_LIB -DBOOST_PROGRAM_OPTIONS_DYN_LINK -DBOOST_THREAD_NO_LIB -DBOOST_THREAD_DYN_LINK -DFMT_SHARED -MD -MT CMakeFiles/scylla.dir/RelWithDebInfo/main.cc.o -MF CMakeFiles/scylla.dir/RelWithDebInfo/main.cc.o.d -o CMakeFiles/scylla.dir/RelWithDebInfo/main.cc.o -c /home/kefu/dev/scylladb/main.cc /home/kefu/dev/scylladb/main.cc:1660:38: error: private field '_sys_ks' is not used [-Werror,-Wunused-private-field] 1660 \| db::system_keyspace& _sys_ks; \| ^ ``` The member variable is not referenced anywhere in the code, so removing it improves maintainability without affecting functionality. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23545	2025-04-02 20:07:39 +03:00
Pavel Emelyanov	2ee9cec1d3	Merge 'Remove object_storage.yaml and move the endpoints to scylla.yaml' from Robert Bindar Move `object_storage.yaml` endpoints to `scylla.yaml` This change also removes the `object_storage.yaml` file altogether and adds tests for fetching the endpoints via the `v2/config/object_storage_endpoints` REST api. Also, `object_storage_config_file` options is moved to a deprecated state as it's no longer needed. This PR depends on #22951, the reviewers should review patch 393e1ac0ec066475ca94094265a5f88dbbdb1a1f Refs https://github.com/scylladb/scylladb/issues/22428 Closes scylladb/scylladb#22952 * github.com:scylladb/scylladb: Remove db::config::object_storage_config Move `object_storage.yaml` endpoints to `scylla.yaml`	2025-04-01 16:01:44 +03:00
Michał Chojnowski	3f7969313f	main: run a sstable_dict_autotrainer Create an instance of `sstable_dict_autotrainer` in `scylla_main` and run it.	2025-04-01 00:07:30 +02:00
Michał Chojnowski	bea866a46f	main: clean up sstable compression dicts after table drops When a table is dropped, its corresponding dictionary in `system.dicts` -- if any -- should be deleted, otherwise it will remain forever as garbage. This commit implements such cleanup.	2025-04-01 00:07:30 +02:00
Michał Chojnowski	58ae278d10	api: add the retrain_dict API call Add an API call which will retrain the SSTable compression dictionary for a given table. Currently, it needs all nodes to be alive to succeed. We can relax this later.	2025-04-01 00:07:29 +02:00
Michał Chojnowski	94d244ab49	main: in compression_dict_updated_callback, recognize and use SSTable compression dicts Currently, there is at most one dictionary in `system.dicts`: named "general", used by RPC compression. So the callback called on `system.dicts` just always refreshes the RPC compression dict. In a follow-up commit, we will publish SSTable compression dicts to `system.dicts` rows with a name in the "sstables/{table_uuid}" format. We want modification to such rows to be passed as new dictionary recommendations to the SSTable compressor factory. This commit teaches the `system.dicts` modification callback to recognize such modifications and forward them to the compressor factory.	2025-04-01 00:07:29 +02:00
Michał Chojnowski	4856f4acca	db/system_keyspace: let `system.dicts` helpers be used for dicts other than the RPC compression dict Extend the `system.dicts` helper for querying and modifying `system.dicts` with an ability to use names other than "general". We will use that in later commits to publish dictionaries for SSTable compression.	2025-04-01 00:07:29 +02:00
Michał Chojnowski	b77c611c00	raft/group0_state_machine: on `system.dicts` mutations, pass the affected partitition keys to the callback Before this patch, `system.dicts` contains only one dictionary, for RPC compression, with the fixed name "general". In later parts of this series, we will add more dictionaries to system.dicts, one per table, for SSTable compression. To enable that, this patch adjusts the callback mechanism for group0's `write_mutations` command, so that the mutation callbacks for group0-managed tables can see which partition keys were affected. This way, the callbacks can query only the modified partitions instead of doing a full scan. (This is necessary to prevent quadratic behaviours.) For now, only the `system.dicts` callback uses the partition keys.	2025-04-01 00:07:29 +02:00
Michał Chojnowski	30a9d471fa	sstables: plug an `sstable_compressor_factory` into `sstables_manager` Create a `sstable_compressor_factory_impl` in `scylla_main`, and pipe it through constructors into `sstables_manager`. In next commits, the factory available through the `sstables_manager` will be used to create compressors for SSTable readers and writers.	2025-04-01 00:07:28 +02:00
Robert Bindar	b647196121	Remove db::config::object_storage_config That map became redundant once we added object_storage_endpoints in the config, this patch removes it and switches all the user code to use the new option. Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>	2025-03-31 17:15:12 +03:00
Robert Bindar	e3a3508960	Move `object_storage.yaml` endpoints to `scylla.yaml` This change also removes the `object_storage.yaml` file altogether and adds tests for fetching the endpoints via the `v2/config/object_storage_endpoints` REST api. Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>	2025-03-31 13:39:39 +03:00
Pavel Emelyanov	1da889f239	Merge 'Allow abort during join_cluster' from Benny Halevy Bootstrap or replace can take a long time, but since `feef7d3fa1`, the stop_signal is checked only in checkpoints, and in particular, abort isn't requested during join_cluster. Fixes #23222 * requires backport on top of https://github.com/scylladb/scylladb/pull/23184 Closes scylladb/scylladb#23306 * github.com:scylladb/scylladb: main: allow abort during join_cluster main: add checkpoint before joining cluster storage_service: add start_sys_dist_ks	2025-03-26 15:48:58 +03:00
Avi Kivity	7646e1448a	Merge 'cql3: Introduce RF-rack-valid keyspaces' from Dawid Mędrek This PR is an introductory step towards enforcing RF-rack-valid keyspaces in Scylla. The scope of changes: * defining RF-rack-valid keyspaces, * introducing a configuration option enforcing RF-rack-valid keyspaces, * restricting the CREATE and ALTER KEYSPACE statements so that they never lead to RF-rack invalid keyspaces, * during the initialization of a node, it verifies that all existing keyspaces are RF-rack-valid. If not, the initialization fails. We provide tests verifying that the changes behave as intended. --- Note that there are a number of things that still need to be implemented. That includes, for instance, restricting topology operations too. --- Implementation strategy (going beyond the scope of this PR): 1. Introduce the new configuration option `rf_rack_valid_keyspaces`. 2. Start enforcing RF-rack-validity in keyspaces if the option is enabled. 3. Adjust the tests: in the tree and out of it. Explicitly enable the option in all tests. 4. Once the tests have been adjusted, change the default value of the option to enabled. 5. Stop explicitly enabling the option in tests. 6. Get rid of the option. --- Fixes scylladb/scylladb#20356 Fixes scylladb/scylladb#23276 Fixes scylladb/scylladb#23300 --- Backport: this is part of the requirements for releasing 2025.1. Closes scylladb/scylladb#23138 * github.com:scylladb/scylladb: main: Refuse to start node when RF-rack-invalid keyspace exists cql3: Ensure that CREATE and ALTER never lead to RF-rack-invalid keyspaces db/config: Introduce RF-rack-valid keyspaces	2025-03-20 19:10:36 +02:00
Dawid Mędrek	0e04a6f3eb	main: Refuse to start node when RF-rack-invalid keyspace exists When a node is started with the option `rf_rack_valid_keyspaces` enabled, the initialization will fail if there is an RF-rack-invalid keyspace. We want to force the user to adjust their existing keyspaces when upgrading to 2025.* so that the invariant that every keyspace is RF-rack-valid is always satisfied. Fixes scylladb/scylladb#23300	2025-03-19 15:13:44 +01:00
Piotr Dulikowski	2ca1c0b6f9	Merge 'introduce the new Raft-based recovery procedure for group 0 majority loss' from Patryk Jędrzejczak This PR introduces the new Raft-based recovery procedure for group 0 majority loss. The Raft-based recovery procedure works with tablets. The old gossip-based recovery procedure does not because we have no code for tablet migrations after the gossip-based topology changes. The Raft-based procedure requires the Raft-based topology to be enabled in the cluster. If the Raft-based topology is not enabled, the gossip-based procedure must be used. We will be able to get rid of the gossip-based procedure when we make the Raft-based topology mandatory (we can do both in the same version, 2025.2 is the plan). Before we do it, we will have to keep both procedures and explain when each of them should be used. The idea behind the new procedure is to recreate group 0 without touching the topology structures. Once we create a new group 0, we can remove all dead nodes using the standard `removenode` and `replace` operations. For the procedure to be safe, we must ensure that each member of the new group 0 moves to the same initial group 0 state. Also, the only safe choice for the state is the latest persistent state available among the live nodes. The solution to the problem above is to ensure that the leader of the new group 0 (called the recovery leader) is one of the nodes with the latest state available. Other members will receive the snapshot from the recovery leader when they join the new group 0 and move to its state. Below is the shortened description of the new recovery procedure from the perspective of the administrator. For the full description, refer to the design document. 1. Find the set of live nodes. 2. Kill any live node that shouldn't be a member of the new group 0. 3. Ensure the full network connectivity between live nodes. 4. Rolling restart live nodes to ensure they are healthy and ready for recovery. 5. Check if some data could have been lost. If yes, restore it from backup after the recovery procedure. 6. Find the recovery leader (the node with the largest `group0_state_id`). 7. Remove `raft_group_id` from `system.scylla_local` and truncate `system.discovery` on each live node. 8. Set the new scylla.yaml parameter, `recovery_leader`, to Host ID of the recovery leader on each live node. 9. Rolling restart all live nodes, but the recovery leader must be restarted first. 10. Remove all dead nodes using `removenode` or `replace`. 11. Unset `recovery_leader` on all nodes. 12. Delete data of the old group 0 from `system.raft`, `system.raft_snaphots`, and `system.raft_snapshot_config`. In the future, we could automate some of these steps or even introduce a tool that will do all (or most) of them by itself. For now, we are fine with a procedure that is reliable and simple enough. This PR makes using 2025.1 with tablets much safer. We want to backport it to 2025.1. We will also want to backport a few follow-ups. Fixes scylladb/scylladb#20657 Closes scylladb/scylladb#22286 * github.com:scylladb/scylladb: test: mark tests with the gossip-based recovery procedure test: add tests for the Raft-based recovery procedure test: topology: util: fix the tokens consistency check for left nodes test: topology: util: extend start_writes gossip: allow group 0 ID mismatch in the Raft-based recovery procedure raft_group0: modify_raft_voter_status: do not add new members treewide: allow recreating group 0 in the Raft-based recovery procedure	2025-03-18 19:10:56 +01:00
Calle Wilund	1525cb2dba	main/commitlog: wait for file deletion and distribute recycled segments to shards Refs #23017 When replaying a large commitlog from an unclean node, we can cause shard 0 db commitlog to reach footprint limit, and then remain there (because we never release segments lower than limit). This is wasteful with diskspace. But deleting segments early here is also wasteful; A better solution is to simply give the segments to all CL shards, thus distributing the available space. v2: * Do segement distribution using ranges. go c++23	2025-03-17 12:09:00 +00:00
Benny Halevy	41f02c521d	main: allow abort during join_cluster Bootstrap or replace can take a long time, but since `feef7d3fa1`, the stop_signal is checked only in checkpoints, and in particular, abort isn't requested during join_cluster. Fixes #23222 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-03-16 12:21:15 +02:00

1 2 3 4 5 ...

1484 Commits