scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-22 17:40:34 +00:00

Author	SHA1	Message	Date
Avi Kivity	66173c06a3	Merge 'Eradicate the ability to create new sstables with numerical sstable generation' from Benny Halevy Remove support for generating numerical sstable generation for new sstables. Loading such sstables is still supported but new sstables are always created with a uuid generation. This is possible since: * All live versions (since 5.4 / `f014ccf369`) now support uuid sstable generations. * The `uuid_sstable_identifiers_enabled` config option (that is unused from version 2025.2 / `6da758d74c`) controls only the use of uuid generations when creating new sstables. SSTables with uuid generations should still be properly loaded by older versions, even if `uuid_sstable_identifiers_enabled` is set to `false`. Fixes #24248 * Enhancement, no backport needed Closes scylladb/scylladb#24512 * github.com:scylladb/scylladb: streaming: stream_blob: use the table sstable_generation_generator replica: distributed_loader: process_upload_dir: use the table sstable_generation_generator sstables: sstable_generation_generator: stop tracking highest generation replica: table: get rid of update_sstables_known_generation sstables: sstable_directory: stop tracking highest_generation replica: distributed_loader: stop tracking highest_generation sstables: sstable_generation: get rid of uuid_identifiers bool class sstables_manager: drop uuid_sstable_identifiers feature_service: move UUID_SSTABLE_IDENTIFIERS to supported_feature_set test: cql_query_test: add test_sstable_load_mixed_generation_type test: sstable_datafile_test: move copy_directory helper to test/lib/test_utils test: database_test: move table_dir helper to test/lib/test_utils	2025-08-14 11:54:33 +03:00
Botond Dénes	614d17347a	tombstone_gc: extract shared state into shared_tombstone_gc_state Instead of storing it partially in tombstone_gc and partially in an external map. Move all external parts into the new shared_tombstone_gc_state. This new class is responsible for keeping and updating the repair history. tombstone_gc_state just keeps const pointers to the shared state as before and is only responsible for querying the tombstone gc before times. This separation makes the code easier to follow and also enables further patching of tombstone_gc_state.	2025-08-11 07:09:14 +03:00
Benny Halevy	7c9ce235d7	test: database_test: move table_dir helper to test/lib/test_utils It's a generic helper that can be used by all tests. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-07 12:04:23 +03:00
Patryk Jędrzejczak	3299ffba51	Merge 'raft_group0: split shutdown into abort-and-drain and destroy' from Petr Gusev Previously, `raft_group0::abort()` was called in `storage_service::do_drain` (introduced in #24418) to stop the group0 Raft server before destroying local storage. This was necessary because `raft::server` depends on storage (via `raft_sys_table_storage` and `group0_state_machine`). However, this caused issues: services like `sstable_dict_autotrainer` and `auth::service`, which use `group0_client` but are not stopped by `storage_service`, could trigger use-after-free if `raft_group0` was destroyed too early. This can happen both during normal shutdown and when 'nodetool drain' is used. This PR reworks the shutdown logic: * Introduces `abort_and_drain()`, which aborts the server and waits for background tasks to finish, but keeps the server object alive. Clients will see `raft::stopped_error` if they try to access group0 after this method is called. * Final destruction now happens in `abort_and_destroy()`, called later from `main.cc`, ensuring safe cleanup. The `raft_server_for_group::aborted` is changed to a `shared_future`, as it is now awaited in both abort methods. Node startup can fail before reaching `storage_service`, in which case `drain_on_shutdown()` and `abort_and_drain()` are never called. To ensure proper cleanup, `raft_group0` deinitialization logic must be included in both `abort_and_drain()` and `abort_and_destroy()`. Refs #25115 Fixes #24625 Backport: the changes are complicated and not safe to backport, we'll backport a revert of the original patch (#24418) in a separate PR. Closes scylladb/scylladb#25151 * https://github.com/scylladb/scylladb: raft_group0: split shutdown into abort_and_drain and destroy Revert "main.cc: fix group0 shutdown order"	2025-07-29 10:39:00 +02:00
Botond Dénes	f3ed27bd9e	Merge 'Move feature-service config creation code out of feature-service itself' from Pavel Emelyanov Nowadays the way to configure an internal service is 1. service declares its config struct 2. caller (main/test/tool) fills the respective config with values it wants 3. the service is started with the config passed by value The feature service code behaves likewise, but provides a helper method to create its config out of db::config. This PR moves this helper out of gms code, so that it doesn't mess with system-wide db::config and only needs its own small struct feature_config. For the reference: similar changes with other services: #23705 , #20174 , #19166 Closes scylladb/scylladb#25118 * github.com:scylladb/scylladb: gms,init: Move get_disabled_features_from_db_config() from gms code: Update callers generating feature service config gms: Make feature_config a simple struct gms: Split feature_config_from_db_config() into two	2025-07-29 08:17:49 +03:00
Petr Gusev	8b8b7adbe5	raft_group0: split shutdown into abort_and_drain and destroy Previously, raft_group0::abort() was called in storage_service::do_drain (introduced in #24418) to stop the group0 Raft server before destroying local storage. This was necessary because raft::server depends on storage (via raft_sys_table_storage and group0_state_machine). However, this caused issues: services like sstable_dict_autotrainer and auth::service, which use group0_client but are not stopped by storage_service, could trigger use-after-free if raft_group0 was destroyed too early. This can happen both during normal shutdown and when 'nodetool drain' is used. This commit reworks the shutdown logic: * Introduces abort_and_drain(), which aborts the server and waits for background tasks to finish, but keeps the server object alive. Clients will see raft::stopped_error if they try to access group0 after abort_and_drain(). * Final destruction happens in a separate method destroy(), called later from main.cc. The raft_server_for_group::aborted is changed to a shared_future -- abort_server now returns a future so that we can wait for it in abort_and_drain(), it should return the future from the previous abort_server call, which can happen in the on_background_error callback. Node startup can fail before reaching storage_service, in which case ss.drain_on_shutdown() and abort_and_drain() are never called. To ensure proper cleanup, abort_and_drain() is called from main.cc before destroy(). Clients of raft_group_registry are expected to call destroy_server() for the servers they own. Currently, the only such client is raft_group0, which satisfies this requirement. As a result, raft_group_registry::stop_servers() is no longer needed. Instead, raft_group_registry::stop() now verifies that all servers have been properly destroyed. If any remain, it calls on_internal_error(). The call to drain_on_shutdown() in cql_test_env.cc appears redundant. The only source of raft::server instances in raft_group_registry is group0_service, and if group0_service.start() succeeds, both abort_and_drain() and destroy() are guaranteed to be called during shutdown.	2025-07-25 17:16:14 +02:00
Petr Gusev	ac4bc3f816	paxos_state: lazily create paxos state table We call paxos_store::ensure_initialized in the beginning of storage_proxy::cas to create a paxos state table for a user table if it doesn't exist. When the LWT coordinator sends RPCs to replicas, some of them may not yet have the paxos schema. In paxos_store::get_paxos_state_schema we just wait for them to appear, or throw 'no_such_column_family' if the base table was dropped.	2025-07-24 19:48:08 +02:00
Petr Gusev	6e87a6cdb0	paxos_state: extract state access functions into paxos_store Introduce paxos_store abstraction to isolate Paxos state access. Prepares for supporting either system.paxos or a co-located table as the storage backend.	2025-07-24 16:39:50 +02:00
Pavel Emelyanov	8220974e76	code: Update callers generating feature service config Instead of requesting it from gms code, create it "by hand" with the help of get_disabled_features_from_db_config() method. This is how other services are configured by main/tools/testing code. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-07-21 19:19:09 +03:00
Avi Kivity	c762425ea7	Merge 'auth: move passwords::check call to alien thread' from Andrzej Jackowski Analysis of customer stalls revealed that the function `detail::hash_with_salt` (invoked by `passwords::check`) often blocks the reactor. Internally, this function uses the external `crypt_r` function to compute password hashes, which is CPU-intensive. This PR addresses the issue in two ways: 1) `sha-512` is now the only password hashing scheme for new passwords (it was already the common-case). 2) `passwords::check` is moved to a dedicated alien thread. Regarding point 1: before this change, the following hashing schemes were supported by `identify_best_supported_scheme()`: bcrypt_y, bcrypt_a, SHA-512, SHA-256, and MD5. The reason for this was that the `crypt_r` function used for password hashing comes from an external library (currently `libxcrypt`), and the supported hashing algorithms vary depending on the library in use. However: - The bcrypt schemes never worked properly because their prefixes lack the required round count (e.g. `$2y$` instead of `$2y$05$`). Moreover, bcrypt is slower than SHA-512, so it not good idea to fix or use it. - SHA-256 and SHA-512 both belong to the SHA-2 family. Libraries that support one almost always support the other, so it’s very unlikely to find SHA-256 without SHA-512. - MD5 is no longer considered secure for password hashing. Regarding point 2: the `passwords::check` call now runs on a shared alien thread created at database startup. An `std::mutex` synchronizes that thread with the shards. In theory this could introduce a frequent lock contention, but in practice each shard handles only a few hundred new connections per second—even during storms. There is already `_conns_cpu_concurrency_semaphore` in `generic_server` limits the number of concurrent connection handlers. Fixes https://github.com/scylladb/scylladb/issues/24524 Backport not needed, as it is a new feature. Closes scylladb/scylladb#24924 * github.com:scylladb/scylladb: main: utils: add thread names to alien workers auth: move passwords::check call to alien thread test: wait for 3 clients with given username in test_service_level_api auth: refactor password checking in password_authenticator auth: make SHA-512 the only password hashing scheme for new passwords auth: whitespace change in identify_best_supported_scheme() auth: require scheme as parameter for `generate_salt` auth: check password hashing scheme support on authenticator start	2025-07-16 13:15:54 +03:00
Andrzej Jackowski	77a9b5919b	main: utils: add thread names to alien workers This commit adds a call to `pthread_setname_np` in `alien_worker::spawn`, so each alien worker thread receives a descriptive name. This makes debugging, monitoring, and performance analysis easier by allowing alien workers to be clearly identified in tools such as `perf`.	2025-07-15 23:29:21 +02:00
Andrzej Jackowski	9574513ec1	auth: move passwords::check call to alien thread Analysis of customer stalls showed that the `detail::hash_with_salt` function, called from `passwords::check`, often blocks the reactor. This function internally uses the `crypt_r` function from an external library to compute password hashes, which is a CPU-intensive operation. To prevent such reactor stalls, this commit moves the `passwords::check` call to a dedicated alien thread. This thread is created at system startup and is shared by all shards. Within the alien thread, an `std::mutex` synchronizes access between the thread and the shards. While this could theoretically cause frequent lock contentions, in practice, even during connection storms, the number of new connections per second per shard is limited (typically hundreds per second). Additionally, the `_conns_cpu_concurrency_semaphore` in `generic_server` ensures that not too many connections are processed at once. Fixes scylladb/scylladb#24524	2025-07-15 23:29:13 +02:00
Avi Kivity	6fce817aa8	Merge 'Atomic in-memory schema changes application' from Marcin Maliszkiewicz This change is preparing ground for state update unification for raft bound subsystems. It introduces schema_applier which in the future will become generic interface for applying mutations in raft. Pulling database::apply() out of schema merging code will allow to batch changes to subsystems. Future generic code will first call prepare() on all implementations, then single database::apply() and then update() on all implementations, then on each shard it will call commit() for all implementations, without preemption so that the change is observed as atomic across all subsystems, and then post_commit(). Backport: no, it's a new feature Fixes: https://github.com/scylladb/scylladb/issues/19649 Fixes https://github.com/scylladb/scylladb/issues/24531 Closes scylladb/scylladb#24886 [avi: adjust for std::vector<mutations> -> utils::chunked_vector<mutations>] * github.com:scylladb/scylladb: test: add type creation to test_snapshot storage_service: always wake up load balancer on update tablet metadata db: schema_applier: call destroy also when exception occurs db: replica: simplify seeding ERM during shema change db: remove cleanup from add_column_family db: abort on exception during schema commit phase db: make user defined types changes atomic replica: db: make keyspace schema changes atomic db: atomically apply changes to tables and views replica: make truncate_table_on_all_shards get whole schema from table_shards service: split update_tablet_metadata into two phases service: pull out update_tablet_metadata from migration_listener db: service: add store_service dependency to schema_applier service: simplify load_tablet_metadata and update_tablet_metadata db: don't perform move on tablet_hint reference replica: split add_column_family_and_make_directory into steps replica: db: split drop_table into steps db: don't move map references in merge_tables_and_views() db: introduce commit_on_shard function db: access types during schema merge via special storage replica: make non-preemptive keyspace create/update/delete functions public replica: split update keyspace into two phases replica: split creating keyspace into two functions db: rename create_keyspace_from_schema_partition db: decouple functions and aggregates schema change notification from merging code db: store functions and aggregates change batch in schema_applier db: decouple tables and views schema change notifications from merging code db: store tables and views schema diff in schema_applier db: decouple user type schema change notifications from types merging code service: unify keyspace notification functions arguments db: replica: decouple keyspace schema change notifications to a separate function db: add class encapsulating schema merging	2025-07-13 20:47:55 +03:00
Benny Halevy	3feb759943	everywhere: use utils::chunked_vector for list of mutations Currently, we use std::vector<*mutation> to keep a list of mutations for processing. This can lead to large allocation, e.g. when the vector size is a function of the number of tables. Use a chunked vector instead to prevent oversized allocations. `perf-simple-query --smp 1` results obtained for fixed 400MHz frequency and PGO disabled: Before (read path): ``` enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=read, query_single_key=no, counters=no} Disabling auto compaction Creating 10000 partitions... 89055.97 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39417 insns/op, 18003 cycles/op, 0 errors) 103372.72 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39380 insns/op, 17300 cycles/op, 0 errors) 98942.27 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39413 insns/op, 17336 cycles/op, 0 errors) 103752.93 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39407 insns/op, 17252 cycles/op, 0 errors) 102516.77 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39403 insns/op, 17288 cycles/op, 0 errors) throughput: mean= 99528.13 standard-deviation=6155.71 median= 102516.77 median-absolute-deviation=3844.59 maximum=103752.93 minimum=89055.97 instructions_per_op: mean= 39403.99 standard-deviation=14.25 median= 39406.75 median-absolute-deviation=9.30 maximum=39416.63 minimum=39380.39 cpu_cycles_per_op: mean= 17435.81 standard-deviation=318.24 median= 17300.40 median-absolute-deviation=147.59 maximum=18002.53 minimum=17251.75 ``` After (read path) ``` enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=read, query_single_key=no, counters=no} Disabling auto compaction Creating 10000 partitions... 59755.04 tps ( 66.2 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39466 insns/op, 22834 cycles/op, 0 errors) 71854.16 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39417 insns/op, 17883 cycles/op, 0 errors) 82149.45 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39411 insns/op, 17409 cycles/op, 0 errors) 49640.04 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.3 tasks/op, 39474 insns/op, 19975 cycles/op, 0 errors) 54963.22 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.3 tasks/op, 39474 insns/op, 18235 cycles/op, 0 errors) throughput: mean= 63672.38 standard-deviation=13195.12 median= 59755.04 median-absolute-deviation=8709.16 maximum=82149.45 minimum=49640.04 instructions_per_op: mean= 39448.38 standard-deviation=31.60 median= 39466.17 median-absolute-deviation=25.75 maximum=39474.12 minimum=39411.42 cpu_cycles_per_op: mean= 19267.01 standard-deviation=2217.03 median= 18234.80 median-absolute-deviation=1384.25 maximum=22834.26 minimum=17408.67 ``` `perf-simple-query --smp 1 --write` results obtained for fixed 400MHz frequency and PGO disabled: Before (write path): ``` enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=write, query_single_key=no, counters=no} Disabling auto compaction 63736.96 tps ( 59.4 allocs/op, 16.4 logallocs/op, 14.3 tasks/op, 49667 insns/op, 19924 cycles/op, 0 errors) 64109.41 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 49992 insns/op, 20084 cycles/op, 0 errors) 56950.47 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50005 insns/op, 20501 cycles/op, 0 errors) 44858.42 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50014 insns/op, 21947 cycles/op, 0 errors) 28592.87 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50027 insns/op, 27659 cycles/op, 0 errors) throughput: mean= 51649.63 standard-deviation=15059.74 median= 56950.47 median-absolute-deviation=12087.33 maximum=64109.41 minimum=28592.87 instructions_per_op: mean= 49941.18 standard-deviation=153.76 median= 50005.24 median-absolute-deviation=73.01 maximum=50027.07 minimum=49667.05 cpu_cycles_per_op: mean= 22023.01 standard-deviation=3249.92 median= 20500.74 median-absolute-deviation=1938.76 maximum=27658.75 minimum=19924.32 ``` After (write path) ``` enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=write, query_single_key=no, counters=no} Disabling auto compaction 53395.93 tps ( 59.4 allocs/op, 16.5 logallocs/op, 14.3 tasks/op, 50326 insns/op, 21252 cycles/op, 0 errors) 46527.83 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50704 insns/op, 21555 cycles/op, 0 errors) 55846.30 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50731 insns/op, 21060 cycles/op, 0 errors) 55669.30 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50735 insns/op, 21521 cycles/op, 0 errors) 52130.17 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50757 insns/op, 21334 cycles/op, 0 errors) throughput: mean= 52713.91 standard-deviation=3795.38 median= 53395.93 median-absolute-deviation=2955.40 maximum=55846.30 minimum=46527.83 instructions_per_op: mean= 50650.57 standard-deviation=182.46 median= 50731.38 median-absolute-deviation=84.09 maximum=50756.62 minimum=50325.87 cpu_cycles_per_op: mean= 21344.42 standard-deviation=202.86 median= 21334.00 median-absolute-deviation=176.37 maximum=21554.61 minimum=21060.24 ``` Fixes #24815 Improvement for rare corner cases. No backport required Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#24919	2025-07-13 19:13:11 +03:00
Marcin Maliszkiewicz	fa157e7e46	db: service: add store_service dependency to schema_applier There is already implicit logical dependency via migration_notifier but in the next commits we'll be moving store_service out from it as we need better control (i.e. return a value from the call).	2025-07-10 10:40:43 +02:00
Pawel Pery	7bf53fc908	vector_store_client: implement initial vector_store_client service This patch is a part of vector_store_client sharded service implementation for a communication with vector-store service. It adds a `services/vector_store_client.{cc\|hh}` sharded service and a configuration parameter `vector_store_uri` with a `http://vector-store.dns.name:port` format. If there will be an error during parsing that parameter there will be an exception during construction. For the future unit testing purposes the patch adds `vector_store_client_tester` as a way to inject mockup functionality. This service will be used by the select statements for the Vector search indexes (see VS-46). For this reason I've added vector_store_client service in the query processor. Reference: VS-47 VS-45	2025-07-08 16:29:55 +02:00
Andrzej Jackowski	9dbb1468b4	mapreduce: remove _shared_token_metadata from mapreduce_service Before this change, `mapreduce_service` used `_shared_token_metadata` to get the topology. However, the token was used in a part of the code that already had its own ERM with its own metadata token. Moreover, as mapreduce_service's token and ERM's token are not guaranteed to be the same, inconsistencies could occur. Therefore, this commit removes `_shared_token_metadata` and its usage.	2025-06-25 08:42:16 +02:00
Dawid Mędrek	c60035cbf6	test/lib/cql_test_env.cc: Enable rf_rack_valid_keyspaces by default We've adjusted all of the Boost tests so they respect the invariant enforced by the `rf_rack_valid_keyspaces` configuration option, or explicitly disabled the option in those that turned out to be more problematic and will require more attention. Thanks to that, we can now enable it by default in the test suite.	2025-05-27 18:53:39 +02:00
Aleksandra Martyniuk	9c03255fd2	cql_test_env: main: move stream_manager initialization Currently, stream_manager is initialized after storage_service and so it is stopped before the storage_service is. In its stop method storage_service accesses stream_manager which is uninitialized at a time. Move stream_manager initialization over the storage_service initialization. Fixes: #23207. Closes scylladb/scylladb#24008	2025-05-15 17:17:35 +03:00
Michał Chojnowski	1bcf77951c	compress: distribute compression dictionaries over shards We don't want each shard to have its own copy of each dictionary. It would unnecessary pressure on cache and memory. Instead, we want to share dictionaries between shards. Before this commit, all dictionaries live on shard 0. All other shards borrow foreign shared pointers from shard 0. There's a problem with this setup: dictionary blobs receive many random accesses. If shard 0 is on a remote NUMA node, this could pose a performance problem. Therefore, for each dictionary, we would like to have one copy per NUMA node, not one copy per the entire machine. And each shard should use the copy belonging to its own NUMA node. This is the main goal of this patch. There is another issue with putting all dicts on shard 0: it eats an assymetric amount of memory from shard 0. This commit spreads the ownership of dicts over all shards within the NUMA group, to make the situation more symmetric. (Dict owner is decided based on the hash of dict contents). It should be noted that the last part isn't necessarily a good thing, though. While it makes the situation more symmetric within each node, it makes it less symmetric across the cluster, if different node sizes are present. If dicts occupy 1% of memory on each shard of a 100-shard node, then the same dicts would occupy 100% of memory on a 1-shard node. So for the sake of cluster-wide symmetry, we might later want to consider e.g. making the memory limit for dictionaries inversely proportional to the number of shards.	2025-05-07 14:43:18 +02:00
Pavel Emelyanov	eb5b52f598	Merge 'main: make DC and rack immutable after bootstrap' from Piotr Dulikowski Changing DC or rack on a node which was already bootstrapped is, in case of vnodes, very unsafe (almost guaranteed to cause data loss or unavailability), and is outright not supported if the cluster has a tablet-backed keyspaces. Moreover, the possibility of doing that makes it impossible to uphold some of the invariants promised by the RF-rack-valid flag, which is eventually going to become unconditionally enabled. Get rid of the above problems by removing the possibility of changing the DC / rack of a node. A node will now fail to start if its snitch reports a different DC or rack than the one that was reported during the first boot. Fixes: scylladb/scylladb#23278 Fixes: scylladb/scylladb#22869 Marking for backport to 2025.1, as this is a necessary part of the RF-rack-valid saga Closes scylladb/scylladb#23800 * github.com:scylladb/scylladb: doc: changing topology when changing snitches is no longer supported test: cluster: introduce test_no_dc_rack_change storage_service: don't update DC/rack in update_topology_with_local_metadata main: make dc and rack immutable after bootstrap test: cluster: remove test_snitch_change	2025-04-21 15:52:55 +03:00
Piotr Dulikowski	ce2fab7cce	main: make dc and rack immutable after bootstrap Changing DC or rack on a node which was already bootstrapped is, in case of vnodes, very unsafe (almost guaranteed to cause data loss or unavailability), and is outright not supported if the cluster has a tablet-backed keyspaces. Moreover, the possibility of doing that makes it impossible to uphold some of the invariants promised by the RF-rack-valid flag, which is eventually going to become unconditionally enabled. Get rid of the above problems by removing the possibility of changing the DC / rack of a node. A node will now fail to start if its snitch reports a different DC or rack than the one that was reported during the first boot. Fixes: scylladb/scylladb#23278	2025-04-17 16:22:26 +02:00
Tomasz Grabiec	0b9a75d7b6	virtual-tables: Introduce system.load_per_node Can be used to query per-node stats about load as seen by the load balancer. In particular, node's capacity will be used by tablet-mon.py to scale tablet columns so that equal height is equal node utilization.	2025-04-09 20:21:51 +02:00
Michał Chojnowski	b77c611c00	raft/group0_state_machine: on `system.dicts` mutations, pass the affected partitition keys to the callback Before this patch, `system.dicts` contains only one dictionary, for RPC compression, with the fixed name "general". In later parts of this series, we will add more dictionaries to system.dicts, one per table, for SSTable compression. To enable that, this patch adjusts the callback mechanism for group0's `write_mutations` command, so that the mutation callbacks for group0-managed tables can see which partition keys were affected. This way, the callbacks can query only the modified partitions instead of doing a full scan. (This is necessary to prevent quadratic behaviours.) For now, only the `system.dicts` callback uses the partition keys.	2025-04-01 00:07:29 +02:00
Michał Chojnowski	30a9d471fa	sstables: plug an `sstable_compressor_factory` into `sstables_manager` Create a `sstable_compressor_factory_impl` in `scylla_main`, and pipe it through constructors into `sstables_manager`. In next commits, the factory available through the `sstables_manager` will be used to create compressors for SSTable readers and writers.	2025-04-01 00:07:28 +02:00
Dawid Mędrek	0e04a6f3eb	main: Refuse to start node when RF-rack-invalid keyspace exists When a node is started with the option `rf_rack_valid_keyspaces` enabled, the initialization will fail if there is an RF-rack-invalid keyspace. We want to force the user to adjust their existing keyspaces when upgrading to 2025.* so that the invariant that every keyspace is RF-rack-valid is always satisfied. Fixes scylladb/scylladb#23300	2025-03-19 15:13:44 +01:00
Pavel Emelyanov	2bb455ec75	Merge 'Main: stop system_keyspace' from Benny Halevy This series adds an async guard to system_keyspace operations and adds a deferred action to stop the system_keyspace in main() before destroying the service. This helps to make sure that sys_ks is unplugged from its users and that all async operations using it are drained once it's stopped. * Enhancement, no backport needed Closes scylladb/scylladb#23113 * github.com:scylladb/scylladb: main: stop system keyspace system_keyspace: call shutdown from stop system_keyspace: shutdown: allow calling more than once database, compaction_manager, large_data_handler: use pluggable<system_keysapce> utils: add class pluggable	2025-03-14 13:23:28 +03:00
Avi Kivity	696ce4c982	Merge "convert some parts of the gossiper to host ids" from Gleb " This is series starts conversion of the gossiper to use host ids to index nodes. It does not touch the main map yet, but converts a lot of internal code to host id. There are also some unrelated cleanups that were done while working on the series. On of which is dropping code related to old shadow round. We replaced shadow round with explicit GOSSIP_GET_ENDPOINT_STATES verb in `cd7d64f588` which is in scylla-4.3.0, so there should be no compatibility problem. We already dropped a lot of old shadow round code in previous patches anyway. I tested manually that old and new node can co-exist in the same cluster, " * 'gleb/gossiper-host-id-v2' of github.com:scylladb/scylla-dev: (33 commits) gossiper: drop unneeded code gossiper: move _expire_time_endpoint_map to host_id gossiper: move _just_removed_endpoints to host id gossiper: drop unused get_msg_addr function messaging_service: change connection dropping notification to pass host id only messaging_service: pass host id to remove_rpc_client in down notification treewide: pass host id to endpoint_lifecycle_subscriber treewide: drop endpoint life cycle subscribers that do nothing load_meter: move to host id treewide: use host id directly in endpoint state change subscribers treewide: pass host id to endpoint state change subscribers gossiper: drop deprecated unsafe_assassinate_endpoint operation storage_service: drop unused code in handle_state_removed treewide: drop endpoint state change subscribers that do nothing gossiper: drop ip address from handle_echo_msg and simplify code since host_id is now mandatory gossiper: start using host ids to send messages earlier messaging_service: add temporary address map entry on incoming connection topology_coordinator: notify about IP change from sync_raft_topology_nodes as well treewide: move everyone to use host id based gossiper::is_alive and drop ip based one storage_proxy: drop unused template ...	2025-03-13 13:36:31 +02:00
Avi Kivity	b1d9f80d85	Merge 'tablets: Make load balancing capacity-aware' from Tomasz Grabiec Before this patch, the load balancer was equalizing tablet count per shard, so it achieved balance assuming that: 1) tablets have the same size 2) shards have the same capacity That can cause imbalance of utilization if shards have different capacity, which can happen in heterogeneous clusters with different instance types. One of the causes for capacity difference is that larger instances run with fewer shards due to vCPUs being dedicated to IRQ handling. This makes those shards have more disk capacity, and more CPU power. After this patch, the load balancer equalizes shard's storage utilization, so it no longer assumes that shards have the same capacity. It still assumes that each tablet has equal size. So it's a middle step towards full size-aware balancing. One consequence is that to be able to balance, the load balancer need to know about every node's capacity, which is collected with the same RPC which collects load_stats for average tablet size. This is not a significant set back because migrations cannot proceed anyway if nodes are down due to barriers. We could make intra-node migration scheduling work without capacity information, but it's pointless due to above, so not implemented. Also, per-shard goal for tablet count is still the same for all nodes in the cluster, so nodes with less capacity will be below limit and nodes with more capacity will be slightly above limit. This shouldn't be a significant problem in practice, we could compensate for this by increasing the limit. Refs #23042 Closes scylladb/scylladb#23079 * github.com:scylladb/scylladb: tablets: Make load balancing capacity-aware topology_coordinator: Fix confusing log message topology_coordinator: Refresh load stats after adding a new node topology_coordinator: Allow capacity stats to be refreshed with some nodes down topology_coordinator: Refactor load status refreshing so that it can be triggered from multiple places test: boost: tablets_test: Always provide capacity in load_stats test: perf_load_balancing: Set node capacity test: perf_load_balancing: Convert to topology_builder config, disk_space_monitor: Allow overriding capacity via config storage_service, tablets: Collect per-node capacity in load_stats	2025-03-11 14:34:27 +02:00
Gleb Natapov	f0af3f261e	messaging_service: add temporary address map entry on incoming connection We want to move to use host ids as soon as possible. Currently it is possible only after the full gossiper exchange (because only at this point gossiper state is added and with it address map entry). To make it possible to move to host ids earlier this patch adds address map entries on incoming communication during CLIENT_ID verb processing. The patch also adds generation to CLIENT_ID to use it when address map is updated. It is done so that older gossiper entries can be overwritten with newer mapping in case of IP change.	2025-03-11 12:09:21 +02:00
Tomasz Grabiec	d01cc16d1e	config, disk_space_monitor: Allow overriding capacity via config Intended for testing, or hot-fixing out-of-space issues in production. Tablet load balancer uses this information for determining per-shard load so reducing capacity will cause tablets to be migrated away from the node.	2025-03-06 13:35:37 +01:00
Avi Kivity	28906c9261	Merge 'scylla-sstable: introduce the query command' from Botond Dénes The scylla-sstable dump-* command suite has proven invaluable in many investigations. In certain cases however, I found that `dump-data` is quite cumbersome. An example would be trying to find certain values in an sstable, or trying to read the content of system tables when a node is down. For these cases, `dump-data` is very cumbersome: one has to trudge through tons of uninteresting metadata and do compaction in their heads. This PR introduces the new scylla-sstable query command, specifically targeted at situations like this: it allows executing queries on sstables, exposing to the user all the power of CQL, to tailor the output as they see fit. Select everything from a table: $ scylla sstable query --system-schema /path/to/data/system_schema/keyspaces-/-big-Data.db keyspace_name \| durable_writes \| replication -------------------------------+----------------+------------------------------------------------------------------------------------- system_replicated_keys \| true \| ({class : org.apache.cassandra.locator.EverywhereStrategy}) system_auth \| true \| ({class : org.apache.cassandra.locator.SimpleStrategy}, {replication_factor : 1}) system_schema \| true \| ({class : org.apache.cassandra.locator.LocalStrategy}) system_distributed \| true \| ({class : org.apache.cassandra.locator.SimpleStrategy}, {replication_factor : 3}) system \| true \| ({class : org.apache.cassandra.locator.LocalStrategy}) ks \| true \| ({class : org.apache.cassandra.locator.NetworkTopologyStrategy}, {datacenter1 : 1}) system_traces \| true \| ({class : org.apache.cassandra.locator.SimpleStrategy}, {replication_factor : 2}) system_distributed_everywhere \| true \| ({class : org.apache.cassandra.locator.EverywhereStrategy}) Select everything from a single SSTable, use the JSON output (filtered through [jq](https://jqlang.github.io/jq/) for better readability): $ scylla sstable query --system-schema --output-format=json /path/to/data/system_schema/keyspaces-/me-3gm7_127s_3ndxs28xt4llzxwqz6-big-Data.db \| jq [ { "keyspace_name": "system_schema", "durable_writes": true, "replication": { "class": "org.apache.cassandra.locator.LocalStrategy" } }, { "keyspace_name": "system", "durable_writes": true, "replication": { "class": "org.apache.cassandra.locator.LocalStrategy" } } ] Select a specific field in a specific partition using the command-line: $ scylla sstable query --system-schema --query "select replication from scylla_sstable.keyspaces where keyspace_name='ks'" ./scylla-workdir/data/system_schema/keyspaces-/-Data.db replication ------------------------------------------------------------------------------------- ({class : org.apache.cassandra.locator.NetworkTopologyStrategy}, {datacenter1 : 1}) Select a specific field in a specific partition using ``--query-file``: $ echo "SELECT replication FROM scylla_sstable.keyspaces WHERE keyspace_name='ks';" > query.cql $ scylla sstable query --system-schema --query-file=./query.cql ./scylla-workdir/data/system_schema/keyspaces-/-Data.db replication ------------------------------------------------------------------------------------- ({class : org.apache.cassandra.locator.NetworkTopologyStrategy}, {datacenter1 : 1}) New functionality: no backport needed. Closes scylladb/scylladb#22007 github.com:scylladb/scylladb: docs/operating-scylla: document scylla-sstable query test/cqlpy/test_tools.py: add tests for scylla-sstable query test/cqlpy/test_tools.py: make scylla_sstable() return table name also scylla-sstable: introduce the query command tools/utils: get_selected_operation(): use std::string for operation_options utils/rjson: streaming_writer: add RawValue() cql3/type_json: add to_json_type() test/lib/cql_test_env: introduce do_with_cql_env_noreentrant_in_thread()	2025-03-06 13:42:45 +02:00
Tomasz Grabiec	7e7f1e6f91	storage_service, tablets: Collect per-node capacity in load_stats New RPC is introduced becuase load_stats was marked "final" in the IDL. Will be needed by capacity-aware load balancing.	2025-03-06 12:17:32 +01:00
Benny Halevy	7a624e3df8	system_keyspace: call shutdown from stop and use that to replace the explicit shutdown when stopped in cql_test_env. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-03-05 08:30:23 +02:00
Botond Dénes	5d63ef4d15	Merge 'scylla sstable: Add standard extensions and propagate to schema load ' from Calle Wilund Fixes #22314 Adds expected schema extensions to the tools extension set (if used). Also uses the source config extensions in schema loader instead of temp one, to ensure we can, for example, load a schema.cql with things like `tombstone_gc` or encryption attributes in them. Bundles together the setup of "always on" schema extensions into a single call, and uses this from the three (3) init points. Could have opted for static reg via `configurables`, but since we are moving to a single code base, the need for this is going away, hence explicit init seems more in line. Closes scylladb/scylladb#22327 * github.com:scylladb/scylladb: tools: Add standard extensions and propagate to schema load cql_test_env: Use add all extensions instead of inidividually main: Move extensions adding to function tomstone_gc: Make validate work for tools	2025-02-26 13:52:47 +02:00
Tomasz Grabiec	f3b63bfeff	test: cql_test_env: Expose db config	2025-02-19 16:29:08 +01:00
Botond Dénes	01a4d30d88	test/lib/cql_test_env: introduce do_with_cql_env_noreentrant_in_thread() This variant of do_with_cql_env(), forgoes the reentrancy support in the regular do_with_cql_env() variants, and re-uses the caller's exsting seastar thread. This is an optimized version for callers which don't need reentrancy and already have a thread.	2025-02-17 08:01:38 -05:00
Pavel Emelyanov	5d1f74b86a	main: Start sharded<view_builder> earlier The view_builder service is needed by repair service, but is started after it. It's OK in a sense that repair service holds a sharded reference on it and checks whether local_is_initialized() before using it, which is not nice. Fortunately, starting sharded view buidler can be done early enough, because most of its dependencies would be already started by that time. Two exceptions are -- view_update_generator and system_distributed_keyspace. Both can be moved up too with the same justification. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-02-14 20:26:55 +03:00
Pavel Emelyanov	f650e75137	test/cql_env: Move stream manager start lower This is to keep it in-sync with main code, where stream manager is started after storage_proxy's and query_processor's remotes. This doesn't change nothing for now, but next patches will move other services around main/cql_test_env and early start of stream manager in cql_test_env will be problematic. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-02-14 20:25:20 +03:00
Botond Dénes	4a7a75dfcb	Merge 'tasks: use host_id in task manager' from Aleksandra Martyniuk Use host_id in a children list of a task in task manager to indicate a node on which the child was created. Move TASKS_CHILDREN_REQUEST to IDL. Send it by host_id. Fixes: https://github.com/scylladb/scylladb/issues/22284. Ip to host_id transition; backport isn't needed. Closes scylladb/scylladb#22487 * github.com:scylladb/scylladb: tasks: drop task_manager::config::broadcast_address as it's unused tasks: replace ip with host_id in task_identity api: task_manager: pass gossiper to api::set_task_manager tasks: keep host_id in task_manager tasks: move tasks_get_children to IDL	2025-02-11 11:32:27 +02:00
Tomasz Grabiec	3bb9d2fbdb	test: cql_test_env: Expose topology_state_machine	2025-02-07 16:09:21 +01:00
Aleksandra Martyniuk	4470c2f6d3	tasks: keep host_id in task_manager Keep host_id of a node in task manager. If host_id wasn't resolved yet, task manager will keep an empty id. It's a preparation for the following changes.	2025-02-05 10:10:29 +01:00
Kamil Braun	febd45861e	test/lib: cql_test_env: make service shutdown more verbose Introduce `defer_verbose_shutdown` in `cql_test_env` which logs a message before and after shutting down a service, distinguishing between success and failure. The function is similar to the one in `main` but skips special error handling logic applicable only to the main Scylla binary. The purpose of the `cql_test_env` version of this function is only more verbose logging. If necessary it can be extended in the future with additional logic. I estimated the impact on the size of produced log files using `cdc_test` as an example: ``` $ build/dev/test/boost/combined_tests --run_test=cdc_test -- --smp=2 \ >logfile 2>&1 $ du -b logfile ``` the result before this commit: 1964064 bytes, after: 2196432 bytes, so estimated ~12% increase of log file size for boost tests that use `cql_test_env`, assuming that the number of logs printed by each test is similar to the logs printed by `cdc_test` (but I believe `cdc_test` is one of the less verbose tests so this is an overestimate). The motivation for this change is easier debugging of shutdown issues. When investigating scylladb/scylladb#21983, where an exception is thrown somewhere during the shutdown procedure, I found it hard to pinpoint the service from which the exception originates. This change will make it easier to debug issues like that by wrapping shutdown of each service in a pair of messages logged when shutdown starts and when it finishes (including when it fails). We should get more details on this issue when it reproduces again in CI after this commit is merged into `master`. (I failed to reproduce it locally with 1000 runs.) Ref scylladb/scylladb#21983 Closes scylladb/scylladb#22566	2025-01-30 10:27:45 +03:00
Pavel Emelyanov	ca9b59f3b2	storage_service: Drop sys_dist_ks argument from join_cluster() Storage service has _sys_dist_ks onboard and can just use it Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-01-24 12:26:32 +03:00
Piotr Dulikowski	6aa962f5f4	Merge 'Add audit subsystem for database operations' from Paweł Zakrzewski Introduces a comprehensive audit system to track database operations for security and compliance purposes. This change includes: Core Components: - New audit subsystem for logging database operations - Service level integration for proper resource management - CQL statement tracking with operation categories - Login process integration for tenant management Key Features: - Configurable audit logging (syslog/table) - Operation categorization (QUERY/DML/DDL/DCL/AUTH/ADMIN) - Selective auditing by keyspace/table - Password sanitization in audit logs - Service level shares support (1-1000) for workload prioritization - Proper lifecycle management and cleanup I ran the dtests for audit (manually enabled) and they pass. The in-repo tests pass. Notably, there should be no non-whitespace changes between this and scylla-enterprise Fixes scylladb/scylla-enterprise#4999 Closes scylladb/scylladb#22147 * github.com:scylladb/scylladb: audit: Add shares support to service level management audit: Add service level support to CQL login process audit: Add support to CQL statements audit: Integrate audit subsystem into Scylla main process audit: Add documentation for the audit subsystem audit: Add the audit subsystem	2025-01-17 13:14:55 +01:00
Kamil Braun	89ee2a6834	Merge 'drop ip addresses from token metadata' from Gleb Now that all topology related code uses host ids there is not point to maintain ip to id (and back) mappings in the token metadata. After the patch the mapping will be maintained in the gossiper only. The rest of the system will use host ids and in rare cases where translation is needed (mostly for UX compatibility reasons) the translation will be done using gossiper. Fixes: scylladb/scylla#21777 * 'gleb/drop-ip-from-tm-v3' of github.com:scylladb/scylla-dev: (57 commits) hint manager: do not translate ip to id in case hint manager is stopped already locator: token_metadata: drop update_host_id() function that does nothing now locator: topology: drop indexing by ips repair: drop unneeded code storage_service: use host_id to look for a node in on_alive handler storage_proxy: translate ips to ids in forward array using gossiper locator: topology: remove unused functions storage_service: check for outdated ip in on_change notification in the peers table storage_proxy: translate id to ip using address map in tablets's describe_ring code instead of taking one from the topology topology coordinator: change connection dropping code to work on host ids cql3: report host id instead of ip in error during SELECT FROM MUTATION_FRAGMENTS query locator: drop unused function from tablet_effective_replication_map api: view_build_statuses: do not use IP from the topology, but translate id to ip using address map instead locator: token_metadata: remove unused ip based functions locator: network_topology_strategy: use host_id based function to check number of endpoints in dcs gossiper: drop get_unreachable_token_owners functions storage_service: use gossiper to map ip to id in node_ops operations storage_service: fix indentation after the last patch storage_service: drop loops from node ops replace_prepare handling since there can be only one replacing node token_metadata: drop no longer used functions ...	2025-01-17 11:00:52 +01:00
Gleb Natapov	50fb22c8f9	locator: topology: drop indexing by ips Do not track id to ip mapping in the topology class any longer. There are no remaining users.	2025-01-16 16:37:08 +02:00
Gleb Natapov	1b6e1456e5	messaging_service: drop the usage of ip based token_metadata APIs We want to drop ips from token_metadata so move to use host id based counterparts. Messaging service gets a function that maps from ips to id when is starts listening.	2025-01-16 16:37:06 +02:00
Calle Wilund	00b40eada3	cql_test_env: Use add all extensions instead of inidividually	2025-01-15 12:08:09 +00:00
Paweł Zakrzewski	28bd699c51	audit: Add service level support to CQL login process This change integrates service level functionality into the CQL authentication and connection handling: - Add scheduling_group_name to client_data to track service level assignments - Extend SASL challenge interface to expose authenticated username - Modify connection processing to support tenant switching: - Add switch_tenant() method to handle scheduling group changes - Add process_until_tenant_switch() to handle request processing boundaries - Implement no_tenant() default executor - Add execute_under_tenant_type for scheduling group management - Update connection lifecycle to properly handle service level changes: - Initialize connections with default scheduling group - Support dynamic scheduling group updates when service levels change - Ensure proper cleanup of scheduling group assignments The changes enable proper scheduling group assignment and management based on authenticated users' service levels, while maintaining backward compatibility for connections without service level assignments.	2025-01-15 11:10:36 +01:00

1 2 3 4 5 ...

599 Commits