scylladb

Author	SHA1	Message	Date
Botond Dénes	4a802baccb	Merge 'compress: make sstable compression dictionaries NUMA-aware ' from Michał Chojnowski compress: distribute compression dictionaries over shards We don't want each shard to have its own copy of each dictionary. It would unnecessary pressure on cache and memory. Instead, we want to share dictionaries between shards. Before this commit, all dictionaries live on shard 0. All other shards borrow foreign shared pointers from shard 0. There's a problem with this setup: dictionary blobs receive many random accesses. If shard 0 is on a remote NUMA node, this could pose a performance problem. Therefore, for each dictionary, we would like to have one copy per NUMA node, not one copy per the entire machine. And each shard should use the copy belonging to its own NUMA node. This is the main goal of this patch. There is another issue with putting all dicts on shard 0: it eats an assymetric amount of memory from shard 0. This commit spreads the ownership of dicts over all shards within the NUMA group, to make the situation more symmetric. (Dict owner is decided based on the hash of dict contents). It should be noted that the last part isn't necessarily a good thing, though. While it makes the situation more symmetric within each node, it makes it less symmetric across the cluster, if different node sizes are present. If dicts occupy 1% of memory on each shard of a 100-shard node, then the same dicts would occupy 100% of memory on a 1-shard node. So for the sake of cluster-wide symmetry, we might later want to consider e.g. making the memory limit for dictionaries inversely proportional to the number of shards. New functionality, added to a feature which isn't in any stable branch yet. No backporting. Closes scylladb/scylladb#23590 * github.com:scylladb/scylladb: test: add test/boost/sstable_compressor_factory_test compress: add some test-only APIs compress: rename sstable_compressor_factory_impl to dictionary_holder compress: fix indentation compress: remove sstable_compressor_factory_impl::_owner_shard compress: distribute compression dictionaries over shards test: switch uses of make_sstable_compressor_factory() to a seastar::thread-dependent version test: remove sstables::test_env::do_with()	2025-05-08 09:52:46 +03:00
Petr Gusev	e6c3f954f6	main: check if current process group controls stdin tty test.py doesn't override stdin when starting Scylla, so when tests are run from a terminal, isatty() returns true and parsed command line output is not printed, which is inconvenient. In this commit we add a check if the current process group controls the stdin terminal. This serves two purposes: * improves the "interactive mode" check from #scylladb/scylladb#18309, as only the controlling process group can interact with the terminal. * solves the test.py problem above, because test.py runs scylla in a new session/process group (it calls setsid after fork), and is now correctly not considered interactive. Closes scylladb/scylladb#24047	2025-05-08 06:52:48 +03:00
Michał Chojnowski	1bcf77951c	compress: distribute compression dictionaries over shards We don't want each shard to have its own copy of each dictionary. It would unnecessary pressure on cache and memory. Instead, we want to share dictionaries between shards. Before this commit, all dictionaries live on shard 0. All other shards borrow foreign shared pointers from shard 0. There's a problem with this setup: dictionary blobs receive many random accesses. If shard 0 is on a remote NUMA node, this could pose a performance problem. Therefore, for each dictionary, we would like to have one copy per NUMA node, not one copy per the entire machine. And each shard should use the copy belonging to its own NUMA node. This is the main goal of this patch. There is another issue with putting all dicts on shard 0: it eats an assymetric amount of memory from shard 0. This commit spreads the ownership of dicts over all shards within the NUMA group, to make the situation more symmetric. (Dict owner is decided based on the hash of dict contents). It should be noted that the last part isn't necessarily a good thing, though. While it makes the situation more symmetric within each node, it makes it less symmetric across the cluster, if different node sizes are present. If dicts occupy 1% of memory on each shard of a 100-shard node, then the same dicts would occupy 100% of memory on a 1-shard node. So for the sake of cluster-wide symmetry, we might later want to consider e.g. making the memory limit for dictionaries inversely proportional to the number of shards.	2025-05-07 14:43:18 +02:00
Pavel Emelyanov	eb5b52f598	Merge 'main: make DC and rack immutable after bootstrap' from Piotr Dulikowski Changing DC or rack on a node which was already bootstrapped is, in case of vnodes, very unsafe (almost guaranteed to cause data loss or unavailability), and is outright not supported if the cluster has a tablet-backed keyspaces. Moreover, the possibility of doing that makes it impossible to uphold some of the invariants promised by the RF-rack-valid flag, which is eventually going to become unconditionally enabled. Get rid of the above problems by removing the possibility of changing the DC / rack of a node. A node will now fail to start if its snitch reports a different DC or rack than the one that was reported during the first boot. Fixes: scylladb/scylladb#23278 Fixes: scylladb/scylladb#22869 Marking for backport to 2025.1, as this is a necessary part of the RF-rack-valid saga Closes scylladb/scylladb#23800 * github.com:scylladb/scylladb: doc: changing topology when changing snitches is no longer supported test: cluster: introduce test_no_dc_rack_change storage_service: don't update DC/rack in update_topology_with_local_metadata main: make dc and rack immutable after bootstrap test: cluster: remove test_snitch_change	2025-04-21 15:52:55 +03:00
Piotr Dulikowski	ce2fab7cce	main: make dc and rack immutable after bootstrap Changing DC or rack on a node which was already bootstrapped is, in case of vnodes, very unsafe (almost guaranteed to cause data loss or unavailability), and is outright not supported if the cluster has a tablet-backed keyspaces. Moreover, the possibility of doing that makes it impossible to uphold some of the invariants promised by the RF-rack-valid flag, which is eventually going to become unconditionally enabled. Get rid of the above problems by removing the possibility of changing the DC / rack of a node. A node will now fail to start if its snitch reports a different DC or rack than the one that was reported during the first boot. Fixes: scylladb/scylladb#23278	2025-04-17 16:22:26 +02:00
Tomasz Grabiec	0b9a75d7b6	virtual-tables: Introduce system.load_per_node Can be used to query per-node stats about load as seen by the load balancer. In particular, node's capacity will be used by tablet-mon.py to scale tablet columns so that equal height is equal node utilization.	2025-04-09 20:21:51 +02:00
Benny Halevy	f702adf6a5	main: fix typo in tablet allocator checkpoint message Inroduced in `b6705ad48b` Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#23211	2025-04-08 17:19:41 +03:00
Kefu Chai	0cd6cf1dc5	main: Remove unused member variable `_sys_ks` Fixes a Clang error by removing the unused private field `sstable_dict_deleter::_sys_ks` that was flagged with: [-Werror,-Wunused-private-field] ``` /home/kefu/.local/bin/clang++ -DBOOST_PROGRAM_OPTIONS_DYN_LINK -DBOOST_PROGRAM_OPTIONS_NO_LIB -DSCYLLA_BUILD_MODE=release -DXXH_PRIVATE_API -DCMAKE_INTDIR=\"RelWithDebInfo\" -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/build/gen -I/home/kefu/dev/scylladb/build -isystem /home/kefu/dev/scylladb/seastar/include -isystem /home/kefu/dev/scylladb/build/RelWithDebInfo/seastar/gen/include -isystem /home/kefu/dev/scylladb/abseil -isystem /home/kefu/dev/scylladb/build/rust -I/usr/include/p11-kit-1 -ffunction-sections -fdata-sections -O3 -g -gz -std=gnu++23 -flto=thin -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-unused-parameter -ffile-prefix-map=/home/kefu/dev/scylladb/= -ffile-prefix-map=/home/kefu/dev/scylladb/build=. -ffile-prefix-map=/home/kefu/dev/scylladb/build/=build -march=westmere -Xclang -fexperimental-assignment-tracking=disabled -mllvm -inline-threshold=2500 -fno-slp-vectorize -ffat-lto-objects -std=gnu++23 -Werror=unused-result -DSEASTAR_API_LEVEL=7 -DSEASTAR_SSTRING -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_SCHEDULING_GROUPS_COUNT=19 -DSEASTAR_LOGGER_TYPE_STDOUT -DBOOST_PROGRAM_OPTIONS_NO_LIB -DBOOST_PROGRAM_OPTIONS_DYN_LINK -DBOOST_THREAD_NO_LIB -DBOOST_THREAD_DYN_LINK -DFMT_SHARED -MD -MT CMakeFiles/scylla.dir/RelWithDebInfo/main.cc.o -MF CMakeFiles/scylla.dir/RelWithDebInfo/main.cc.o.d -o CMakeFiles/scylla.dir/RelWithDebInfo/main.cc.o -c /home/kefu/dev/scylladb/main.cc /home/kefu/dev/scylladb/main.cc:1660:38: error: private field '_sys_ks' is not used [-Werror,-Wunused-private-field] 1660 \| db::system_keyspace& _sys_ks; \| ^ ``` The member variable is not referenced anywhere in the code, so removing it improves maintainability without affecting functionality. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23545	2025-04-02 20:07:39 +03:00
Pavel Emelyanov	2ee9cec1d3	Merge 'Remove object_storage.yaml and move the endpoints to scylla.yaml' from Robert Bindar Move `object_storage.yaml` endpoints to `scylla.yaml` This change also removes the `object_storage.yaml` file altogether and adds tests for fetching the endpoints via the `v2/config/object_storage_endpoints` REST api. Also, `object_storage_config_file` options is moved to a deprecated state as it's no longer needed. This PR depends on #22951, the reviewers should review patch 393e1ac0ec066475ca94094265a5f88dbbdb1a1f Refs https://github.com/scylladb/scylladb/issues/22428 Closes scylladb/scylladb#22952 * github.com:scylladb/scylladb: Remove db::config::object_storage_config Move `object_storage.yaml` endpoints to `scylla.yaml`	2025-04-01 16:01:44 +03:00
Michał Chojnowski	3f7969313f	main: run a sstable_dict_autotrainer Create an instance of `sstable_dict_autotrainer` in `scylla_main` and run it.	2025-04-01 00:07:30 +02:00
Michał Chojnowski	bea866a46f	main: clean up sstable compression dicts after table drops When a table is dropped, its corresponding dictionary in `system.dicts` -- if any -- should be deleted, otherwise it will remain forever as garbage. This commit implements such cleanup.	2025-04-01 00:07:30 +02:00
Michał Chojnowski	58ae278d10	api: add the retrain_dict API call Add an API call which will retrain the SSTable compression dictionary for a given table. Currently, it needs all nodes to be alive to succeed. We can relax this later.	2025-04-01 00:07:29 +02:00
Michał Chojnowski	94d244ab49	main: in compression_dict_updated_callback, recognize and use SSTable compression dicts Currently, there is at most one dictionary in `system.dicts`: named "general", used by RPC compression. So the callback called on `system.dicts` just always refreshes the RPC compression dict. In a follow-up commit, we will publish SSTable compression dicts to `system.dicts` rows with a name in the "sstables/{table_uuid}" format. We want modification to such rows to be passed as new dictionary recommendations to the SSTable compressor factory. This commit teaches the `system.dicts` modification callback to recognize such modifications and forward them to the compressor factory.	2025-04-01 00:07:29 +02:00
Michał Chojnowski	4856f4acca	db/system_keyspace: let `system.dicts` helpers be used for dicts other than the RPC compression dict Extend the `system.dicts` helper for querying and modifying `system.dicts` with an ability to use names other than "general". We will use that in later commits to publish dictionaries for SSTable compression.	2025-04-01 00:07:29 +02:00
Michał Chojnowski	b77c611c00	raft/group0_state_machine: on `system.dicts` mutations, pass the affected partitition keys to the callback Before this patch, `system.dicts` contains only one dictionary, for RPC compression, with the fixed name "general". In later parts of this series, we will add more dictionaries to system.dicts, one per table, for SSTable compression. To enable that, this patch adjusts the callback mechanism for group0's `write_mutations` command, so that the mutation callbacks for group0-managed tables can see which partition keys were affected. This way, the callbacks can query only the modified partitions instead of doing a full scan. (This is necessary to prevent quadratic behaviours.) For now, only the `system.dicts` callback uses the partition keys.	2025-04-01 00:07:29 +02:00
Michał Chojnowski	30a9d471fa	sstables: plug an `sstable_compressor_factory` into `sstables_manager` Create a `sstable_compressor_factory_impl` in `scylla_main`, and pipe it through constructors into `sstables_manager`. In next commits, the factory available through the `sstables_manager` will be used to create compressors for SSTable readers and writers.	2025-04-01 00:07:28 +02:00
Robert Bindar	b647196121	Remove db::config::object_storage_config That map became redundant once we added object_storage_endpoints in the config, this patch removes it and switches all the user code to use the new option. Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>	2025-03-31 17:15:12 +03:00
Robert Bindar	e3a3508960	Move `object_storage.yaml` endpoints to `scylla.yaml` This change also removes the `object_storage.yaml` file altogether and adds tests for fetching the endpoints via the `v2/config/object_storage_endpoints` REST api. Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>	2025-03-31 13:39:39 +03:00
Pavel Emelyanov	1da889f239	Merge 'Allow abort during join_cluster' from Benny Halevy Bootstrap or replace can take a long time, but since `feef7d3fa1`, the stop_signal is checked only in checkpoints, and in particular, abort isn't requested during join_cluster. Fixes #23222 * requires backport on top of https://github.com/scylladb/scylladb/pull/23184 Closes scylladb/scylladb#23306 * github.com:scylladb/scylladb: main: allow abort during join_cluster main: add checkpoint before joining cluster storage_service: add start_sys_dist_ks	2025-03-26 15:48:58 +03:00
Avi Kivity	7646e1448a	Merge 'cql3: Introduce RF-rack-valid keyspaces' from Dawid Mędrek This PR is an introductory step towards enforcing RF-rack-valid keyspaces in Scylla. The scope of changes: * defining RF-rack-valid keyspaces, * introducing a configuration option enforcing RF-rack-valid keyspaces, * restricting the CREATE and ALTER KEYSPACE statements so that they never lead to RF-rack invalid keyspaces, * during the initialization of a node, it verifies that all existing keyspaces are RF-rack-valid. If not, the initialization fails. We provide tests verifying that the changes behave as intended. --- Note that there are a number of things that still need to be implemented. That includes, for instance, restricting topology operations too. --- Implementation strategy (going beyond the scope of this PR): 1. Introduce the new configuration option `rf_rack_valid_keyspaces`. 2. Start enforcing RF-rack-validity in keyspaces if the option is enabled. 3. Adjust the tests: in the tree and out of it. Explicitly enable the option in all tests. 4. Once the tests have been adjusted, change the default value of the option to enabled. 5. Stop explicitly enabling the option in tests. 6. Get rid of the option. --- Fixes scylladb/scylladb#20356 Fixes scylladb/scylladb#23276 Fixes scylladb/scylladb#23300 --- Backport: this is part of the requirements for releasing 2025.1. Closes scylladb/scylladb#23138 * github.com:scylladb/scylladb: main: Refuse to start node when RF-rack-invalid keyspace exists cql3: Ensure that CREATE and ALTER never lead to RF-rack-invalid keyspaces db/config: Introduce RF-rack-valid keyspaces	2025-03-20 19:10:36 +02:00
Dawid Mędrek	0e04a6f3eb	main: Refuse to start node when RF-rack-invalid keyspace exists When a node is started with the option `rf_rack_valid_keyspaces` enabled, the initialization will fail if there is an RF-rack-invalid keyspace. We want to force the user to adjust their existing keyspaces when upgrading to 2025.* so that the invariant that every keyspace is RF-rack-valid is always satisfied. Fixes scylladb/scylladb#23300	2025-03-19 15:13:44 +01:00
Piotr Dulikowski	2ca1c0b6f9	Merge 'introduce the new Raft-based recovery procedure for group 0 majority loss' from Patryk Jędrzejczak This PR introduces the new Raft-based recovery procedure for group 0 majority loss. The Raft-based recovery procedure works with tablets. The old gossip-based recovery procedure does not because we have no code for tablet migrations after the gossip-based topology changes. The Raft-based procedure requires the Raft-based topology to be enabled in the cluster. If the Raft-based topology is not enabled, the gossip-based procedure must be used. We will be able to get rid of the gossip-based procedure when we make the Raft-based topology mandatory (we can do both in the same version, 2025.2 is the plan). Before we do it, we will have to keep both procedures and explain when each of them should be used. The idea behind the new procedure is to recreate group 0 without touching the topology structures. Once we create a new group 0, we can remove all dead nodes using the standard `removenode` and `replace` operations. For the procedure to be safe, we must ensure that each member of the new group 0 moves to the same initial group 0 state. Also, the only safe choice for the state is the latest persistent state available among the live nodes. The solution to the problem above is to ensure that the leader of the new group 0 (called the recovery leader) is one of the nodes with the latest state available. Other members will receive the snapshot from the recovery leader when they join the new group 0 and move to its state. Below is the shortened description of the new recovery procedure from the perspective of the administrator. For the full description, refer to the design document. 1. Find the set of live nodes. 2. Kill any live node that shouldn't be a member of the new group 0. 3. Ensure the full network connectivity between live nodes. 4. Rolling restart live nodes to ensure they are healthy and ready for recovery. 5. Check if some data could have been lost. If yes, restore it from backup after the recovery procedure. 6. Find the recovery leader (the node with the largest `group0_state_id`). 7. Remove `raft_group_id` from `system.scylla_local` and truncate `system.discovery` on each live node. 8. Set the new scylla.yaml parameter, `recovery_leader`, to Host ID of the recovery leader on each live node. 9. Rolling restart all live nodes, but the recovery leader must be restarted first. 10. Remove all dead nodes using `removenode` or `replace`. 11. Unset `recovery_leader` on all nodes. 12. Delete data of the old group 0 from `system.raft`, `system.raft_snaphots`, and `system.raft_snapshot_config`. In the future, we could automate some of these steps or even introduce a tool that will do all (or most) of them by itself. For now, we are fine with a procedure that is reliable and simple enough. This PR makes using 2025.1 with tablets much safer. We want to backport it to 2025.1. We will also want to backport a few follow-ups. Fixes scylladb/scylladb#20657 Closes scylladb/scylladb#22286 * github.com:scylladb/scylladb: test: mark tests with the gossip-based recovery procedure test: add tests for the Raft-based recovery procedure test: topology: util: fix the tokens consistency check for left nodes test: topology: util: extend start_writes gossip: allow group 0 ID mismatch in the Raft-based recovery procedure raft_group0: modify_raft_voter_status: do not add new members treewide: allow recreating group 0 in the Raft-based recovery procedure	2025-03-18 19:10:56 +01:00
Calle Wilund	1525cb2dba	main/commitlog: wait for file deletion and distribute recycled segments to shards Refs #23017 When replaying a large commitlog from an unclean node, we can cause shard 0 db commitlog to reach footprint limit, and then remain there (because we never release segments lower than limit). This is wasteful with diskspace. But deleting segments early here is also wasteful; A better solution is to simply give the segments to all CL shards, thus distributing the available space. v2: * Do segement distribution using ranges. go c++23	2025-03-17 12:09:00 +00:00
Benny Halevy	41f02c521d	main: allow abort during join_cluster Bootstrap or replace can take a long time, but since `feef7d3fa1`, the stop_signal is checked only in checkpoints, and in particular, abort isn't requested during join_cluster. Fixes #23222 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-03-16 12:21:15 +02:00
Benny Halevy	f269480f53	main: add checkpoint before joining cluster Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-03-16 12:08:04 +02:00
Botond Dénes	83ea1877ab	Merge 'scylla-sstable: add native S3 support' from Ernest Zaslavsky scylla-sstable: Enable support for S3-stored sstables Minimal implementation of what was mentioned in this [issue](https://github.com/scylladb/scylladb/issues/20532) This update allows Scylla to work with sstables stored on AWS S3. Users can specify the fully qualified location of the sstable using the format: `s3://bucket/prefix/sstable_name`. One should have `object_storage_config_file` referenced in the `scylla.yaml` as described in docs/operating-scylla/admin.rst ref: https://github.com/scylladb/scylladb/issues/20532 fixes: https://github.com/scylladb/scylladb/issues/20535 No backport needed since the S3 functionality was never released Closes scylladb/scylladb#22321 * github.com:scylladb/scylladb: tests: Add Tests for Scylla-SSTable S3 Functionality docs: Update Scylla Tools Documentation for S3 SSTable Support scylla-sstable: Enable Support for S3 SSTables s3: Implement S3 Fully Qualified Name Manipulation Functions object_storage: Refactor `object_storage.yaml` parsing logic	2025-03-14 15:05:52 +02:00
Patryk Jędrzejczak	9970c1fcc3	gossip: allow group 0 ID mismatch in the Raft-based recovery procedure This patch ensures that members of the new group 0 can gossip with members of the old group 0 during rolling restart in the Raft-based recovery procedure. Without this change, restarted nodes (members of the new group 0) wouldn't be marked as UP by other nodes (members of the old group 0), which would decrease availability.	2025-03-14 13:53:05 +01:00
Pavel Emelyanov	2bb455ec75	Merge 'Main: stop system_keyspace' from Benny Halevy This series adds an async guard to system_keyspace operations and adds a deferred action to stop the system_keyspace in main() before destroying the service. This helps to make sure that sys_ks is unplugged from its users and that all async operations using it are drained once it's stopped. * Enhancement, no backport needed Closes scylladb/scylladb#23113 * github.com:scylladb/scylladb: main: stop system keyspace system_keyspace: call shutdown from stop system_keyspace: shutdown: allow calling more than once database, compaction_manager, large_data_handler: use pluggable<system_keysapce> utils: add class pluggable	2025-03-14 13:23:28 +03:00
Avi Kivity	696ce4c982	Merge "convert some parts of the gossiper to host ids" from Gleb " This is series starts conversion of the gossiper to use host ids to index nodes. It does not touch the main map yet, but converts a lot of internal code to host id. There are also some unrelated cleanups that were done while working on the series. On of which is dropping code related to old shadow round. We replaced shadow round with explicit GOSSIP_GET_ENDPOINT_STATES verb in `cd7d64f588` which is in scylla-4.3.0, so there should be no compatibility problem. We already dropped a lot of old shadow round code in previous patches anyway. I tested manually that old and new node can co-exist in the same cluster, " * 'gleb/gossiper-host-id-v2' of github.com:scylladb/scylla-dev: (33 commits) gossiper: drop unneeded code gossiper: move _expire_time_endpoint_map to host_id gossiper: move _just_removed_endpoints to host id gossiper: drop unused get_msg_addr function messaging_service: change connection dropping notification to pass host id only messaging_service: pass host id to remove_rpc_client in down notification treewide: pass host id to endpoint_lifecycle_subscriber treewide: drop endpoint life cycle subscribers that do nothing load_meter: move to host id treewide: use host id directly in endpoint state change subscribers treewide: pass host id to endpoint state change subscribers gossiper: drop deprecated unsafe_assassinate_endpoint operation storage_service: drop unused code in handle_state_removed treewide: drop endpoint state change subscribers that do nothing gossiper: drop ip address from handle_echo_msg and simplify code since host_id is now mandatory gossiper: start using host ids to send messages earlier messaging_service: add temporary address map entry on incoming connection topology_coordinator: notify about IP change from sync_raft_topology_nodes as well treewide: move everyone to use host id based gossiper::is_alive and drop ip based one storage_proxy: drop unused template ...	2025-03-13 13:36:31 +02:00
Dawid Mędrek	0a6137218a	db/hints: Cancel draining when stopping node Draining hints may occur in one of the two scenarios: * a node leaves the cluster and the local node drains all of the hints saved for that node, * the local node is being decommissioned. Draining may take some time and the hint manager won't stop until it finishes. It's not a problem when decommissioning a node, especially because we want the cluster to retain the data stored in the hints. However, it may become a problem when the local node started draining hints saved for another node and now it's being shut down. There are two reasons for that: * Generally, in situations like that, we'd like to be able to shut down nodes as fast as possible. The data stored in the hints won't disappear from the cluster yet since we can restart the local node. * Draining hints may introduce flakiness in tests. Replaying hints doesn't have the highest priority and it's reflected in the scheduling groups we use as well as the explicitly enforced throughput. If there are a large number of hints to be replayed, it might affect our tests. It's already happened, see: scylladb/scylladb#21949. To solve those problems, we change the semantics of draining. It will behave as before when the local node is being decommissioned. However, when the local node is only being stopped, we will immediately cancel all ongoing draining processes and stop the hint manager. To amend for that, when we start a node and it initializes a hint endpoint manager corresponding to a node that's already left the cluster, we will begin the draining process of that endpoint manager right away. That should ensure all data is retained, while possibly speeding up the shutdown process. There's a small trade-off to it, though. If we stop a node, we can then remove it. It won't have a chance to replay hints it might've before these changes, but that's an edge case. We expect this commit to bring more benefit than harm. We also provide tests verifying that the implementation works as intended. Fixes scylladb/scylladb#21949 Closes scylladb/scylladb#22811	2025-03-13 11:55:15 +02:00
Avi Kivity	b1d9f80d85	Merge 'tablets: Make load balancing capacity-aware' from Tomasz Grabiec Before this patch, the load balancer was equalizing tablet count per shard, so it achieved balance assuming that: 1) tablets have the same size 2) shards have the same capacity That can cause imbalance of utilization if shards have different capacity, which can happen in heterogeneous clusters with different instance types. One of the causes for capacity difference is that larger instances run with fewer shards due to vCPUs being dedicated to IRQ handling. This makes those shards have more disk capacity, and more CPU power. After this patch, the load balancer equalizes shard's storage utilization, so it no longer assumes that shards have the same capacity. It still assumes that each tablet has equal size. So it's a middle step towards full size-aware balancing. One consequence is that to be able to balance, the load balancer need to know about every node's capacity, which is collected with the same RPC which collects load_stats for average tablet size. This is not a significant set back because migrations cannot proceed anyway if nodes are down due to barriers. We could make intra-node migration scheduling work without capacity information, but it's pointless due to above, so not implemented. Also, per-shard goal for tablet count is still the same for all nodes in the cluster, so nodes with less capacity will be below limit and nodes with more capacity will be slightly above limit. This shouldn't be a significant problem in practice, we could compensate for this by increasing the limit. Refs #23042 Closes scylladb/scylladb#23079 * github.com:scylladb/scylladb: tablets: Make load balancing capacity-aware topology_coordinator: Fix confusing log message topology_coordinator: Refresh load stats after adding a new node topology_coordinator: Allow capacity stats to be refreshed with some nodes down topology_coordinator: Refactor load status refreshing so that it can be triggered from multiple places test: boost: tablets_test: Always provide capacity in load_stats test: perf_load_balancing: Set node capacity test: perf_load_balancing: Convert to topology_builder config, disk_space_monitor: Allow overriding capacity via config storage_service, tablets: Collect per-node capacity in load_stats	2025-03-11 14:34:27 +02:00
Gleb Natapov	f0af3f261e	messaging_service: add temporary address map entry on incoming connection We want to move to use host ids as soon as possible. Currently it is possible only after the full gossiper exchange (because only at this point gossiper state is added and with it address map entry). To make it possible to move to host ids earlier this patch adds address map entries on incoming communication during CLIENT_ID verb processing. The patch also adds generation to CLIENT_ID to use it when address map is updated. It is done so that older gossiper entries can be overwritten with newer mapping in case of IP change.	2025-03-11 12:09:21 +02:00
Ernest Zaslavsky	38165fd285	object_storage: Refactor `object_storage.yaml` parsing logic Refactored the parsing of `object_storage.yaml` out of Scylla's `main` function. This change is made to facilitate reusability of the parsing logic in other parts of the codebase.	2025-03-09 09:50:36 +02:00
Tomasz Grabiec	d01cc16d1e	config, disk_space_monitor: Allow overriding capacity via config Intended for testing, or hot-fixing out-of-space issues in production. Tablet load balancer uses this information for determining per-shard load so reducing capacity will cause tablets to be migrated away from the node.	2025-03-06 13:35:37 +01:00
Tomasz Grabiec	7e7f1e6f91	storage_service, tablets: Collect per-node capacity in load_stats New RPC is introduced becuase load_stats was marked "final" in the IDL. Will be needed by capacity-aware load balancing.	2025-03-06 12:17:32 +01:00
Botond Dénes	49d6bf8947	Merge 'main: safely check stop_signal in-between starting services' from Benny Halevy To simplify aborting scylla while starting the services, add a _ready state to stop_signal, so that until main is ready to be stopped by the abort_source, just register that the signal is caught, and let a check() method poll that and request abort and throw respective exception only then, in controlled points that are in-between starting of services after the service started successfully and a deferred stop action was installed. This patch prevents gate_closed_exception to escape handling when start-up is aborted early with the stop signal, causing https://github.com/scylladb/scylladb/issues/23153 The regression is apparently due to `a25c3eaa1c` Fixes https://github.com/scylladb/scylladb/issues/23153 * Requires backport to 2025.1 due to `a25c3eaa1c` Closes scylladb/scylladb#23103 * github.com:scylladb/scylladb: main: add checkpoints main: safely check stop_signal in-between starting services main: move prometheus start message main: move per-shard database start message	2025-03-06 08:28:29 +02:00
Benny Halevy	8ae8275f17	main: stop system keyspace To prevent internal queries coming from system_keyspace (like updating compaction history, for example) Refs scylladb/scylla-dtest#5581 Refs #22886 Refs #8995 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-03-05 08:30:23 +02:00
Benny Halevy	b6705ad48b	main: add checkpoints Before starting significant services that didn't have a corresponding call to supervisor::notify before them. Fixes #23153 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-03-05 07:29:34 +02:00
Benny Halevy	feef7d3fa1	main: safely check stop_signal in-between starting services To simplify aborting scylla while starting the services, Add a _ready state to stop_signal, so that until main is ready to be stopped by the abort_source, just register that the signal is caught, and let a check() method poll that and request abort and throw respective exception only then, in controlled points that are in-between starting of services after the service started successfully and a deferred stop action was installed. Refs #23153 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-03-05 07:15:17 +02:00
Benny Halevy	282ff344db	main: move prometheus start message The `prometheus_server` is started only conditionally but the notification message is sent and logged unconditionally. Move it inside the condtional code block. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-03-05 07:09:09 +02:00
Benny Halevy	23433f593c	main: move per-shard database start message It is now logged out of place, so move it to right before calling `start` on every database shard. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-03-05 07:09:09 +02:00
Amnon Heiman	fd5d1f1f6a	main.cc: label metrics with basic_level The following metrics will be marked with basic_level label: scylla_scylladb_current_version scylla_reactor_utilization Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2025-03-03 16:58:39 +02:00
Botond Dénes	5d63ef4d15	Merge 'scylla sstable: Add standard extensions and propagate to schema load ' from Calle Wilund Fixes #22314 Adds expected schema extensions to the tools extension set (if used). Also uses the source config extensions in schema loader instead of temp one, to ensure we can, for example, load a schema.cql with things like `tombstone_gc` or encryption attributes in them. Bundles together the setup of "always on" schema extensions into a single call, and uses this from the three (3) init points. Could have opted for static reg via `configurables`, but since we are moving to a single code base, the need for this is going away, hence explicit init seems more in line. Closes scylladb/scylladb#22327 * github.com:scylladb/scylladb: tools: Add standard extensions and propagate to schema load cql_test_env: Use add all extensions instead of inidividually main: Move extensions adding to function tomstone_gc: Make validate work for tools	2025-02-26 13:52:47 +02:00
Tomasz Grabiec	3d01ce3707	config: Make tablets_initial_scale_factor live-updateable	2025-02-19 16:29:08 +01:00
Tomasz Grabiec	7e4a61953d	tablets: load_balancer: Pick initial_scale_factor from config So that it can be live-updated.	2025-02-19 16:29:08 +01:00
Tomasz Grabiec	f1bda8d4c1	tablets: load_balancer: Scale down tablet count to respect per-shard tablet count goal The limit is enforced by controlling average per-shard tablet replica count in a given DC, which is controlled by per-table tablet count. This is effective in respecting the limit on individual shards as long as tablet replicas are distributed evenly between shards. There is no attempt to move tablets around in order to enforce limits on individual shards in case of imbalance between shards. If the average per-shard tablet count exceeds the limit, all tables which contribute to it (have replicas in the DC) are scaled down by the same factor. Due to rounding up to the nearest power of 2, we may overshoot the per-shard goal by at most a factor of 2. If different DCs want different scale factors of a given table, the lowest scale factor is chosen for a given table. The limit is configurable. It's a global per-cluster config which controls how many tablet replicas per shard in total we consider to be still ok. It controls tablet allocator behavior, when choosing initial tablet count. Even though it's a per-node config, we don't support different limits per node. All nodes must have the same value of that config. It's similar in that regard to other scheduler config items like tablets_initial_scale_factor and target_tablet_size_in_bytes.	2025-02-19 16:29:07 +01:00
Pavel Emelyanov	5d1f74b86a	main: Start sharded<view_builder> earlier The view_builder service is needed by repair service, but is started after it. It's OK in a sense that repair service holds a sharded reference on it and checks whether local_is_initialized() before using it, which is not nice. Fortunately, starting sharded view buidler can be done early enough, because most of its dependencies would be already started by that time. Two exceptions are -- view_update_generator and system_distributed_keyspace. Both can be moved up too with the same justification. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-02-14 20:26:55 +03:00
Gleb Natapov	d288d79d78	api: initialize token metadata API after starting the gossiper Token metadata API now depend on gossiper to do ip to host id mappings, so initialized it after the gossiper is initialized and de-initialized it before gossiper is stopped. Fixes: scylladb/scylladb#22743 Closes scylladb/scylladb#22760	2025-02-13 14:39:05 +01:00
Botond Dénes	4a7a75dfcb	Merge 'tasks: use host_id in task manager' from Aleksandra Martyniuk Use host_id in a children list of a task in task manager to indicate a node on which the child was created. Move TASKS_CHILDREN_REQUEST to IDL. Send it by host_id. Fixes: https://github.com/scylladb/scylladb/issues/22284. Ip to host_id transition; backport isn't needed. Closes scylladb/scylladb#22487 * github.com:scylladb/scylladb: tasks: drop task_manager::config::broadcast_address as it's unused tasks: replace ip with host_id in task_identity api: task_manager: pass gossiper to api::set_task_manager tasks: keep host_id in task_manager tasks: move tasks_get_children to IDL	2025-02-11 11:32:27 +02:00
Ernest Zaslavsky	dee4fc7150	aws creds: add STS and Instance Metadata service credentials providers This commit introduces two new credentials providers: STS and Instance Metadata Service. The S3 client's provider chain has been updated to incorporate these new providers. Additionally, unit tests have been added to ensure coverage of the new functionality.	2025-02-05 14:57:19 +02:00

1 2 3 4 5 ...

1458 Commits