scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-01 12:36:56 +00:00

Author	SHA1	Message	Date
Radosław Cybulski	c36614e16d	alternator: add size check to BatchItemWrite Add a size check for BatchItemWrite command - if the item count is bigger than configuration value `alternator_maximum_batch_write_size`, an error will be raised and no modification will happen. This is done to synchronize with DynamoDB, where maximum size of BatchItemWrite is 25. To avoid complaints from clients, who use our feature of BatchWriteItem being limitless we set default value to 100. Fixes #5057 Closes scylladb/scylladb#23232	2025-04-02 14:48:00 +03:00
Avi Kivity	882f405eed	Merge "Convert gossiper's endpoint state map to be host id based" from Gleb " The series makes endpoint state map in the gossiper addressable by host id instead of ips. The transition has implication outside of the gossiper as well. Gossiper based topology operations are affected by this change since they assume that the mapping is ip based. On wire protocol is not affected by the change as maps that are sent by the gossiper protocol remain ip based. If old node sends two different entries for the same host id the one with newer generation is applied. If new node has two ids that are mapped to the same ip the newer one is added to the outgoing map. Interoperability was verified manually by running mixed cluster. The series concludes the conversion of the system to be host id based. " * 'gleb/gossipper-endpoint-map-to-host-id-v2' of github.com:scylladb/scylla-dev: gossiper: make examine_gossiper private gossiper: rename get_nodes_with_host_id to get_node_ip treewide: drop id parameter from gossiper::for_each_endpoint_state treewide: move gossiper to index nodes by host id gossiper: drop ip from replicate function parameters gossiper: drop ip from apply_new_states parameters gossiper: drop address from handle_major_state_change parameter list gossiper: pass rpc::client_info to gossiper_shutdown verb handler gossiper: add try_get_host_id function gossiper: add ip to endpoint_state serialization: fix std::map de-serializer to not invoke value's default constructor gossiper: drop template from wait_alive_helper function gossiper: move get_supported_features and its users to host id storage_service: make candidates_for_removal host id based gossiper: use peers table to detect address change storage_service: use std::views::keys instead of std::views::transform that returns a key gossiper: move _pending_mark_alive_endpoints to host id gossiper: do not allow to assassinate endpoint in raft topology mode gossiper: fix indentation after previous patch gossiper: do not allow to assassinate non existing endpoint	2025-04-02 12:30:00 +03:00
Kefu Chai	6da758d74c	config: mark uuid_sstable_identifiers_enabled unused the option of `uuid_sstable_identifier_enabled` was introduced in `f014ccf3` . the first version which has this change was 5.4, and 6.1 has been branched. during the discussion of backup and restore, we realized that we've been taking efforts to address problems which could have been addressed with the sstable with UUID-based identifier. see also #10459 which is the issue which proposed to implement UUID-v1 based sstable identifier. now that two major releases passed, we should have the luxury to mark this option "unused". this option which was previously introduced to keep the backward compatibility, and to allow user to opt-out of the feature for some reasons. so in this change, mark the option unused, so that if any user still sets this option with command line, they will get a clear error. but we still parse and handle this setting in `scylla.yaml`, so that this option is still respected for existing settings, and for existing tests, which are not yet prepared for the uuid-based sstable identifiers. Refs #10459 Fixes #20337 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#20341	2025-04-01 20:21:47 +03:00
Pavel Emelyanov	2ee9cec1d3	Merge 'Remove object_storage.yaml and move the endpoints to scylla.yaml' from Robert Bindar Move `object_storage.yaml` endpoints to `scylla.yaml` This change also removes the `object_storage.yaml` file altogether and adds tests for fetching the endpoints via the `v2/config/object_storage_endpoints` REST api. Also, `object_storage_config_file` options is moved to a deprecated state as it's no longer needed. This PR depends on #22951, the reviewers should review patch 393e1ac0ec066475ca94094265a5f88dbbdb1a1f Refs https://github.com/scylladb/scylladb/issues/22428 Closes scylladb/scylladb#22952 * github.com:scylladb/scylladb: Remove db::config::object_storage_config Move `object_storage.yaml` endpoints to `scylla.yaml`	2025-04-01 16:01:44 +03:00
Michał Chojnowski	4f0d453acf	dict_autotrainer: introduce sstable_dict_autotrainer Add a fiber responsible for periodic re-training of compression dictionaries (for tables which opted into dict-aware compression). As of this patch, it works like this: every `$tick_period` (15 minutes), if we are the current Raft leader, we check for dict-aware tables which have no dict, or a dict older than `$retrain_period`. For those tables, if they have enough data (>1GiB) for a training, we train a new dict and check if it's significantly better than the current one (provides ratio smaller than 95% of current ratio), and if so, we update the dict.	2025-04-01 00:07:30 +02:00
Michał Chojnowski	9d02e2c005	db/system_keyspace: add query_dict_timestamp Adds a helper method which queries the creation timestamp of a given dict in `system.dicts`. We will later use the age of the current SSTable compression dict to decide if another training should be done already.	2025-04-01 00:07:30 +02:00
Michał Chojnowski	bea866a46f	main: clean up sstable compression dicts after table drops When a table is dropped, its corresponding dictionary in `system.dicts` -- if any -- should be deleted, otherwise it will remain forever as garbage. This commit implements such cleanup.	2025-04-01 00:07:30 +02:00
Michał Chojnowski	4856f4acca	db/system_keyspace: let `system.dicts` helpers be used for dicts other than the RPC compression dict Extend the `system.dicts` helper for querying and modifying `system.dicts` with an ability to use names other than "general". We will use that in later commits to publish dictionaries for SSTable compression.	2025-04-01 00:07:29 +02:00
Michał Chojnowski	b77c611c00	raft/group0_state_machine: on `system.dicts` mutations, pass the affected partitition keys to the callback Before this patch, `system.dicts` contains only one dictionary, for RPC compression, with the fixed name "general". In later parts of this series, we will add more dictionaries to system.dicts, one per table, for SSTable compression. To enable that, this patch adjusts the callback mechanism for group0's `write_mutations` command, so that the mutation callbacks for group0-managed tables can see which partition keys were affected. This way, the callbacks can query only the modified partitions instead of doing a full scan. (This is necessary to prevent quadratic behaviours.) For now, only the `system.dicts` callback uses the partition keys.	2025-04-01 00:07:29 +02:00
Michał Chojnowski	30a9d471fa	sstables: plug an `sstable_compressor_factory` into `sstables_manager` Create a `sstable_compressor_factory_impl` in `scylla_main`, and pipe it through constructors into `sstables_manager`. In next commits, the factory available through the `sstables_manager` will be used to create compressors for SSTable readers and writers.	2025-04-01 00:07:28 +02:00
Robert Bindar	b647196121	Remove db::config::object_storage_config That map became redundant once we added object_storage_endpoints in the config, this patch removes it and switches all the user code to use the new option. Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>	2025-03-31 17:15:12 +03:00
Gleb Natapov	28fb84117d	treewide: drop id parameter from gossiper::for_each_endpoint_state We have it in endpoint_state anyway, so no need to pass both.	2025-03-31 16:50:50 +03:00
Gleb Natapov	4609bbbbb2	treewide: move gossiper to index nodes by host id This patch changes gossiper to index nodes by host ids instead of ips. The main data structure that changes is _endpoint_state_map, but this results in a lot of changes since everything that uses the map directly or indirectly has to be changed. The big victim of this outside of the gossiper itself is topology over gossiper code. It works on IPs and assumes the gossiper does the same and both need to be changed together. Changes to other subsystems are much smaller since they already mostly work on host ids anyway.	2025-03-31 16:50:50 +03:00
Gleb Natapov	0dd86b4f1d	gossiper: move get_supported_features and its users to host id	2025-03-31 15:42:07 +03:00
Robert Bindar	e3a3508960	Move `object_storage.yaml` endpoints to `scylla.yaml` This change also removes the `object_storage.yaml` file altogether and adds tests for fetching the endpoints via the `v2/config/object_storage_endpoints` REST api. Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>	2025-03-31 13:39:39 +03:00
Pavel Emelyanov	9aa986a49a	snapshot-ctl: Remove unused snapshot-single-table method Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-28 10:45:31 +03:00
Benny Halevy	62aeba759b	tablets: enforce tablets using tablets_mode_for_new_keyspaces=enforced config option `tablets_mode_for_new_keyspaces=enforced` enables tablets by default for new keyspaces, like `tablets_mode_for_new_keyspaces=enabled`. However, it does not allow to opt-out when creating new keyspaces by setting `tablets = {'enabled': false}`. Refs scylladb/scylla-enterprise#4355 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-03-24 15:32:16 +02:00
Benny Halevy	c62865df90	db/config: add tablets_mode_for_new_keyspaces option The new option deprecates the existing `enable_tablets` option. It will be extended in the next patch with a 3rd value: "enforced" while will enable tablets by default for new keyspace but without the posibility to opt out using the `tablets = {'enabled': false}` keyspace schema option. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-03-24 14:54:45 +02:00
Avi Kivity	7646e1448a	Merge 'cql3: Introduce RF-rack-valid keyspaces' from Dawid Mędrek This PR is an introductory step towards enforcing RF-rack-valid keyspaces in Scylla. The scope of changes: * defining RF-rack-valid keyspaces, * introducing a configuration option enforcing RF-rack-valid keyspaces, * restricting the CREATE and ALTER KEYSPACE statements so that they never lead to RF-rack invalid keyspaces, * during the initialization of a node, it verifies that all existing keyspaces are RF-rack-valid. If not, the initialization fails. We provide tests verifying that the changes behave as intended. --- Note that there are a number of things that still need to be implemented. That includes, for instance, restricting topology operations too. --- Implementation strategy (going beyond the scope of this PR): 1. Introduce the new configuration option `rf_rack_valid_keyspaces`. 2. Start enforcing RF-rack-validity in keyspaces if the option is enabled. 3. Adjust the tests: in the tree and out of it. Explicitly enable the option in all tests. 4. Once the tests have been adjusted, change the default value of the option to enabled. 5. Stop explicitly enabling the option in tests. 6. Get rid of the option. --- Fixes scylladb/scylladb#20356 Fixes scylladb/scylladb#23276 Fixes scylladb/scylladb#23300 --- Backport: this is part of the requirements for releasing 2025.1. Closes scylladb/scylladb#23138 * github.com:scylladb/scylladb: main: Refuse to start node when RF-rack-invalid keyspace exists cql3: Ensure that CREATE and ALTER never lead to RF-rack-invalid keyspaces db/config: Introduce RF-rack-valid keyspaces	2025-03-20 19:10:36 +02:00
Avi Kivity	a62ab824e6	schema: deprecate schema_extension schema_extension allows making invisible changes to system_schema that evade upgrade rollback tests. They appear in system_schema as an encoded blob which reduces serviceability, as they cannot be read. Deprecate it and point users to adding explicit columns in scylla_tables. We could probably make use of the data structure, after we teach it to encode its payload into proper named and typed columns instead of using IDL. Closes scylladb/scylladb#23151	2025-03-19 20:36:16 +02:00
Dawid Mędrek	32879ec0d5	db/config: Introduce RF-rack-valid keyspaces We introduce a new term in the glossary: RF-rack-valid keyspace. We also highlight in our user documentation that all keyspaces must remain RF-rack-valid throughout their lifetime, and failing to guarantee that may result in data inconsistencies or other issues. We base that information on our experience with materialized views in keyspaces using tablets, even though they remain an experimental feature. Along with the new term, we introduce a new configuration option called `rf_rack_valid_keyspaces`, which, when enabled, will enforce preserving all keyspaces RF-rack-valid. That functionality will be implemented in upcoming commits. For now, we materialize the restriction in form of a named requirement: a function verifying that the passed keyspace is RF-rack-valid. The option is disabled by default. That will change once we adjust the existing tests to the new semantics. Once that is done, the option will first be enabled by default, and then it will be removed. Fixes scylladb/scylladb#20356	2025-03-19 14:46:35 +01:00
Piotr Dulikowski	2ca1c0b6f9	Merge 'introduce the new Raft-based recovery procedure for group 0 majority loss' from Patryk Jędrzejczak This PR introduces the new Raft-based recovery procedure for group 0 majority loss. The Raft-based recovery procedure works with tablets. The old gossip-based recovery procedure does not because we have no code for tablet migrations after the gossip-based topology changes. The Raft-based procedure requires the Raft-based topology to be enabled in the cluster. If the Raft-based topology is not enabled, the gossip-based procedure must be used. We will be able to get rid of the gossip-based procedure when we make the Raft-based topology mandatory (we can do both in the same version, 2025.2 is the plan). Before we do it, we will have to keep both procedures and explain when each of them should be used. The idea behind the new procedure is to recreate group 0 without touching the topology structures. Once we create a new group 0, we can remove all dead nodes using the standard `removenode` and `replace` operations. For the procedure to be safe, we must ensure that each member of the new group 0 moves to the same initial group 0 state. Also, the only safe choice for the state is the latest persistent state available among the live nodes. The solution to the problem above is to ensure that the leader of the new group 0 (called the recovery leader) is one of the nodes with the latest state available. Other members will receive the snapshot from the recovery leader when they join the new group 0 and move to its state. Below is the shortened description of the new recovery procedure from the perspective of the administrator. For the full description, refer to the design document. 1. Find the set of live nodes. 2. Kill any live node that shouldn't be a member of the new group 0. 3. Ensure the full network connectivity between live nodes. 4. Rolling restart live nodes to ensure they are healthy and ready for recovery. 5. Check if some data could have been lost. If yes, restore it from backup after the recovery procedure. 6. Find the recovery leader (the node with the largest `group0_state_id`). 7. Remove `raft_group_id` from `system.scylla_local` and truncate `system.discovery` on each live node. 8. Set the new scylla.yaml parameter, `recovery_leader`, to Host ID of the recovery leader on each live node. 9. Rolling restart all live nodes, but the recovery leader must be restarted first. 10. Remove all dead nodes using `removenode` or `replace`. 11. Unset `recovery_leader` on all nodes. 12. Delete data of the old group 0 from `system.raft`, `system.raft_snaphots`, and `system.raft_snapshot_config`. In the future, we could automate some of these steps or even introduce a tool that will do all (or most) of them by itself. For now, we are fine with a procedure that is reliable and simple enough. This PR makes using 2025.1 with tablets much safer. We want to backport it to 2025.1. We will also want to backport a few follow-ups. Fixes scylladb/scylladb#20657 Closes scylladb/scylladb#22286 * github.com:scylladb/scylladb: test: mark tests with the gossip-based recovery procedure test: add tests for the Raft-based recovery procedure test: topology: util: fix the tokens consistency check for left nodes test: topology: util: extend start_writes gossip: allow group 0 ID mismatch in the Raft-based recovery procedure raft_group0: modify_raft_voter_status: do not add new members treewide: allow recreating group 0 in the Raft-based recovery procedure	2025-03-18 19:10:56 +01:00
Botond Dénes	2795d83b32	Merge 'commitlog: Serialize file deletion and distribute replayed segments' from Calle Wilund Fixes #23017 When deleting segments while our footprint is over the limit, mainly when recycling/deleting segments after replay (recover boot) we can cause two deletion passes to be running at the same time. This is because delete is triggered by either a.) replay release b.) timer check (explicit) c.) timer initiated flush callback where the last one is in fact not even waited for. If we are considering many files for delete/recycle, we can, due to task switch, end up considering segments ok to keep, in parallel, even though one of them should be deleted. The end result will be us keeping one more segment than should be allowed. Now, eventually, this should be released, once we do deletion again, but this can take a while. Solution is to simply ensure we serialize deletion. This might cause some delay in processing cycles for recycle, but in practice, this should never happen when we are in fact under pressure. As noted in the issue above, when replaying a large commitlog from an unclean node, we can cause shard 0 db commitlog to reach footprint limit, and then remain there (because we never release segments lower than limit). This is wasteful with diskspace. But deleting segments early here is also wasteful; A better solution is to simply give the segments to all CL shards, thus distributing the available space. Closes scylladb/scylladb#23150 * github.com:scylladb/scylladb: main/commitlog: wait for file deletion and distribute recycled segments to shards commitlog: Serialize file deletion	2025-03-18 11:47:17 +02:00
Botond Dénes	fda3486770	Merge 'Remove some excessive ks:cf -> table_id conversions in API and schema_tables' from Pavel Emelyanov Actually, the main goal of this PR was to remove parse_tables() helpers from api/ in favor of more flexible (yet same complex) parse_table_infos(), but it turned out that it also saves some lookups in database maps. There are several places in API and schema_tables that have table_id at hand, but at some point drop it and carry keyspace and table names over to a place that maps ks:cf back to table_id and then uses it to find the table object. This PR keeps the table_id with the help of table_info struct in those places. This change allows removing the aforementioned parse_table() helpers from api/ and also saves few lookups in database maps. Removing the parse_tables() from api/ is the continuation of previous effort that reduces the set of helpers in api/ code that help handlers "parse" keyspaces and tables names see #22742 #21533 Closes scylladb/scylladb#23216 * github.com:scylladb/scylladb: api: Remove the remaining parse_tables() overload database: Sanitize flush_tables_on_all_shards() schema_tables: Remove all_table_names() database: Make tables flushing helper use table_info-s, not names api: Make keyspace flush endpoint use parse_table_infos() (and a bit more) schema_tables,client_state: Switch to using all_table_infos() schema_tables: Tune up some methods to benefit from table_infos schema_tables: Introduce all_table_infos()	2025-03-17 15:40:41 +02:00
Calle Wilund	4ed81e05bf	commitlog: Serialize file deletion Fixes #23017 When deleting segments while our footprint is over the limit, mainly when recycling/deleting segments after replay (recover boot) we can cause two deletion passes to be running at the same time. This is because delete is triggered by either a.) replay release b.) timer check (explicit) c.) timer initiated flush callback where the last one is in fact not even waited for. If we are considering many files for delete/recycle, we can, due to task switch, end up considering segments ok to keep, in parallel, even though one of them should be deleted. The end result will be us keeping one more segment than should be allowed. Now, eventually, this should be released, once we do deletion again, but this can take a while. Solution is to simply ensure we serialize deletion. This might cause some delay in processing cycles for recycle, but in practice, this should never happen when we are in fact under pressure. Small unit test included.	2025-03-17 12:09:00 +00:00
Patryk Jędrzejczak	9970c1fcc3	gossip: allow group 0 ID mismatch in the Raft-based recovery procedure This patch ensures that members of the new group 0 can gossip with members of the old group 0 during rolling restart in the Raft-based recovery procedure. Without this change, restarted nodes (members of the new group 0) wouldn't be marked as UP by other nodes (members of the old group 0), which would decrease availability.	2025-03-14 13:53:05 +01:00
Patryk Jędrzejczak	fd51d7e448	treewide: allow recreating group 0 in the Raft-based recovery procedure This patch adds support for recreating group 0 after losing majority. This is the only part of the new Raft-based recovery procedure that touches Scylla core. The following steps are necessary to recreate group 0: 1. Determine the new group 0 members. These are alive nodes that are normal or rebuilding. 2. Choose the recovery leader - the node which will become the new group 0 leader. This must be one of the nodes with the latest persistent group 0 state. 3. Remove `raft_group_id` from `system.scylla_local` and truncate `system.discovery` on each live node. 4. Set the new scylla.yaml parameter - `recovery_leader` - to Host ID of the recovery leader on each live node. 5. Rolling restart all live nodes, but the recovery leader must be restarted first. In the implementation, restarts in step 5 are very similar to normal restarts with the Raft-based topology enabled. The only differences are: 1. Steps 3-4 make the restarting node discover the new group 0 in `join_cluster`. 2. The group 0 server is started in `join_group0`, not `setup_group0_if_exists`. 3. The restarting node joins the new group 0 in `join_topology` using `legacy_handshaker`. There is no reason to contact the topology coordinator since the node has already joined the topology. Unfortunately, this patch creates another execution path for the starting logic. `join_cluster` becomes even messier. However, there is nothing we can do about it. Joining group 0 without joining topology is something completely new. Having a few small changes without touching other execution paths is the best we can do. We will start removing the old stuff soon, after making the Raft-based topology mandatory, and the situation will improve.	2025-03-14 13:52:57 +01:00
Pavel Emelyanov	2bb455ec75	Merge 'Main: stop system_keyspace' from Benny Halevy This series adds an async guard to system_keyspace operations and adds a deferred action to stop the system_keyspace in main() before destroying the service. This helps to make sure that sys_ks is unplugged from its users and that all async operations using it are drained once it's stopped. * Enhancement, no backport needed Closes scylladb/scylladb#23113 * github.com:scylladb/scylladb: main: stop system keyspace system_keyspace: call shutdown from stop system_keyspace: shutdown: allow calling more than once database, compaction_manager, large_data_handler: use pluggable<system_keysapce> utils: add class pluggable	2025-03-14 13:23:28 +03:00
Avi Kivity	696ce4c982	Merge "convert some parts of the gossiper to host ids" from Gleb " This is series starts conversion of the gossiper to use host ids to index nodes. It does not touch the main map yet, but converts a lot of internal code to host id. There are also some unrelated cleanups that were done while working on the series. On of which is dropping code related to old shadow round. We replaced shadow round with explicit GOSSIP_GET_ENDPOINT_STATES verb in `cd7d64f588` which is in scylla-4.3.0, so there should be no compatibility problem. We already dropped a lot of old shadow round code in previous patches anyway. I tested manually that old and new node can co-exist in the same cluster, " * 'gleb/gossiper-host-id-v2' of github.com:scylladb/scylla-dev: (33 commits) gossiper: drop unneeded code gossiper: move _expire_time_endpoint_map to host_id gossiper: move _just_removed_endpoints to host id gossiper: drop unused get_msg_addr function messaging_service: change connection dropping notification to pass host id only messaging_service: pass host id to remove_rpc_client in down notification treewide: pass host id to endpoint_lifecycle_subscriber treewide: drop endpoint life cycle subscribers that do nothing load_meter: move to host id treewide: use host id directly in endpoint state change subscribers treewide: pass host id to endpoint state change subscribers gossiper: drop deprecated unsafe_assassinate_endpoint operation storage_service: drop unused code in handle_state_removed treewide: drop endpoint state change subscribers that do nothing gossiper: drop ip address from handle_echo_msg and simplify code since host_id is now mandatory gossiper: start using host ids to send messages earlier messaging_service: add temporary address map entry on incoming connection topology_coordinator: notify about IP change from sync_raft_topology_nodes as well treewide: move everyone to use host id based gossiper::is_alive and drop ip based one storage_proxy: drop unused template ...	2025-03-13 13:36:31 +02:00
Dawid Mędrek	0a6137218a	db/hints: Cancel draining when stopping node Draining hints may occur in one of the two scenarios: * a node leaves the cluster and the local node drains all of the hints saved for that node, * the local node is being decommissioned. Draining may take some time and the hint manager won't stop until it finishes. It's not a problem when decommissioning a node, especially because we want the cluster to retain the data stored in the hints. However, it may become a problem when the local node started draining hints saved for another node and now it's being shut down. There are two reasons for that: * Generally, in situations like that, we'd like to be able to shut down nodes as fast as possible. The data stored in the hints won't disappear from the cluster yet since we can restart the local node. * Draining hints may introduce flakiness in tests. Replaying hints doesn't have the highest priority and it's reflected in the scheduling groups we use as well as the explicitly enforced throughput. If there are a large number of hints to be replayed, it might affect our tests. It's already happened, see: scylladb/scylladb#21949. To solve those problems, we change the semantics of draining. It will behave as before when the local node is being decommissioned. However, when the local node is only being stopped, we will immediately cancel all ongoing draining processes and stop the hint manager. To amend for that, when we start a node and it initializes a hint endpoint manager corresponding to a node that's already left the cluster, we will begin the draining process of that endpoint manager right away. That should ensure all data is retained, while possibly speeding up the shutdown process. There's a small trade-off to it, though. If we stop a node, we can then remove it. It won't have a chance to replay hints it might've before these changes, but that's an edge case. We expect this commit to bring more benefit than harm. We also provide tests verifying that the implementation works as intended. Fixes scylladb/scylladb#21949 Closes scylladb/scylladb#22811	2025-03-13 11:55:15 +02:00
Avi Kivity	b1d9f80d85	Merge 'tablets: Make load balancing capacity-aware' from Tomasz Grabiec Before this patch, the load balancer was equalizing tablet count per shard, so it achieved balance assuming that: 1) tablets have the same size 2) shards have the same capacity That can cause imbalance of utilization if shards have different capacity, which can happen in heterogeneous clusters with different instance types. One of the causes for capacity difference is that larger instances run with fewer shards due to vCPUs being dedicated to IRQ handling. This makes those shards have more disk capacity, and more CPU power. After this patch, the load balancer equalizes shard's storage utilization, so it no longer assumes that shards have the same capacity. It still assumes that each tablet has equal size. So it's a middle step towards full size-aware balancing. One consequence is that to be able to balance, the load balancer need to know about every node's capacity, which is collected with the same RPC which collects load_stats for average tablet size. This is not a significant set back because migrations cannot proceed anyway if nodes are down due to barriers. We could make intra-node migration scheduling work without capacity information, but it's pointless due to above, so not implemented. Also, per-shard goal for tablet count is still the same for all nodes in the cluster, so nodes with less capacity will be below limit and nodes with more capacity will be slightly above limit. This shouldn't be a significant problem in practice, we could compensate for this by increasing the limit. Refs #23042 Closes scylladb/scylladb#23079 * github.com:scylladb/scylladb: tablets: Make load balancing capacity-aware topology_coordinator: Fix confusing log message topology_coordinator: Refresh load stats after adding a new node topology_coordinator: Allow capacity stats to be refreshed with some nodes down topology_coordinator: Refactor load status refreshing so that it can be triggered from multiple places test: boost: tablets_test: Always provide capacity in load_stats test: perf_load_balancing: Set node capacity test: perf_load_balancing: Convert to topology_builder config, disk_space_monitor: Allow overriding capacity via config storage_service, tablets: Collect per-node capacity in load_stats	2025-03-11 14:34:27 +02:00
Gleb Natapov	0e3dcb7954	treewide: move everyone to use host id based gossiper::is_alive and drop ip based one	2025-03-11 12:09:21 +02:00
Gleb Natapov	e47f251178	gossiper: move _live_endpoints and _unreachable_endpoints endpoint to host_id Index live and dead endpoints by host id. It also allows to simplify some code that does a translation.	2025-03-11 12:09:21 +02:00
Pavel Emelyanov	89f3c1a91e	database: Sanitize flush_tables_on_all_shards() Previous patch left this method with few uglinesses - the vector<table_id> argument is named table_names - the sstring keyspace argument is unused - the keyspace argument is captured for no use Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-10 13:13:10 +03:00
Pavel Emelyanov	0f9cc956f4	schema_tables: Remove all_table_names() Now it's unused. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-10 13:12:56 +03:00
Pavel Emelyanov	c2d23d7948	database: Make tables flushing helper use table_info-s, not names The database::flush_tables_on_all_shards() method accepts a keyspace name and a vector of table names. Then it converts ks:cf pair for each of the table name into a table-id and flushes the table with the ID. All the callers of that method already have or can easily get the vector of table_id-s, not just names, so make use of this. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-10 13:11:32 +03:00
Pavel Emelyanov	5a897d7368	schema_tables,client_state: Switch to using all_table_infos() There are few more places left that can use all_table_infos() as a replacement for all_table_names(), patch them. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-10 13:05:59 +03:00
Pavel Emelyanov	da05765746	schema_tables: Tune up some methods to benefit from table_infos There are convert_schema_to_mutations() and calculate_schema_digest() that collect table names and then use them to find schema and query mutations from the table. Both can use the newly introduced all_table_infos() and use the returned table_id-s to do the same, thus avoiding re-lookups (which are fast anyway, but still). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-10 13:01:50 +03:00
Pavel Emelyanov	d7bfa5a545	schema_tables: Introduce all_table_infos() This method is like all_table_names(), but returns a vector of table_info-s which is effectively a pair of string name and uuid id. To be used later, and the string-returning all_table_name() will be removed very soon too. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-10 12:59:03 +03:00
Tomasz Grabiec	d01cc16d1e	config, disk_space_monitor: Allow overriding capacity via config Intended for testing, or hot-fixing out-of-space issues in production. Tablet load balancer uses this information for determining per-shard load so reducing capacity will cause tablets to be migrated away from the node.	2025-03-06 13:35:37 +01:00
Pavel Emelyanov	86b3e9b50b	code: Move checked-file-impl.hh to util/ fixes: #22100 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23123	2025-03-06 10:22:05 +02:00
Pavel Emelyanov	e7d1ea3ab6	commitlog: Use shorter input stream creation overload There's one that doesn't need the offset argument when it's 0 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23140	2025-03-06 08:06:42 +01:00
Benny Halevy	7a624e3df8	system_keyspace: call shutdown from stop and use that to replace the explicit shutdown when stopped in cql_test_env. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-03-05 08:30:23 +02:00
Benny Halevy	102aec64d5	system_keyspace: shutdown: allow calling more than once Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-03-05 08:30:22 +02:00
Benny Halevy	fba88bdd62	database, compaction_manager, large_data_handler: use pluggable<system_keysapce> To allow safe plug and unplug of the system_keyspace. This patch follows-up on `917fdb9e53` (more specifically - `f9b57df471`) Since just keeping a shared_ptr<system_keyspace> doesn't prevent stopping the system_keyspace shards, while using the `pluggable` interface allows safe draining of outstanding async calls on shutdown, before stopping the system_keyspace. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-03-05 08:27:23 +02:00
Botond Dénes	71d8b7aa9f	querier: demote tombstone warning for range-scans to debug level Range scans are expected to go though lots of tombstones, no need to spam the logs about this. The tombstone warning log is demoted to debug level, if somebody wants to see it they can bump the logger to debug level. Fixes: https://github.com/scylladb/scylladb/issues/23093 Closes scylladb/scylladb#23094	2025-03-04 10:38:06 +03:00
Botond Dénes	6f7a069bce	Merge 'Label basic metrics' from Amnon Heiman This series is part of the effort to reduce the overall overhead originating from metrics reporting, both on the Scylla side and the metrics collecting server (Prometheus or similar) The idea in this series is to create an equivalent of levels with a label. First, label a subset of the metrics used by the dashboards. Second, the per-table metrics that are now off by default will be marked with a different label. The following specific optional features: CDC, CAS, and Alternator have a dedicated label now. This will allow users to disable all metrics of features that are not in use. All the rest of the metrics are left unlabeled. Without any changes, users would get the same metrics they are getting today. But you could pass the `__level=1` and get only those metrics the dashboard needs. That reduces between 50% and 70% (many metrics are hidden if not used, so the overall number of metrics varies). The labels are not reported based on the seastar feature of hiding labels that start with an underscore. Closes scylladb/scylladb#12246 * github.com:scylladb/scylladb: db/view/view.cc: label metrics with basic_level transport/server.cc: label metrics with basic_level service/storage_proxy.cc: label metrics with basic_level and cas main.cc: label metrics with basic_level streaming/stream_manager.cc: label metrics with basic_level repair/repair.cc: label metrics with basic_level service/storage_service.cc: label metrics with basic_level gms/gossiper.cc: label metrics with basic_level replica/database.cc: label metrics with basic_level cdc/log.cc: label metrics with basic_level and cdc alternator: label metrics with basic_level and alternator row_cache.cc: label metrics with basic_level query_processor.cc: label metrics with basic_level sstables.cc: label metrics with basic_level utils/logalloc.cc label metrics with basic_level commitlog.cc: label metrics with basic_level compaction_manager.cc: label metrics with basic_level Adding the __level and features labels	2025-03-04 09:32:11 +02:00
Calle Wilund	2f10205714	config: Enable optional TLS1.3 session ticket usage in cert setup Refs #22916 Adds an "enable_session_tickets" option to TLS setup for our server endpoints (not documented for internode RPC, as we don't handle it on the client side there), allowing enabling of TLS3 client session ticket, i.e. quicker reconnect. Session tickets are valid within a time frame or until a node restarts, whichever comes first. v2: Use "TLS1.3" in help message Closes scylladb/scylladb#22928	2025-03-04 09:30:53 +02:00
Amnon Heiman	19a414598b	db/view/view.cc: label metrics with basic_level The following metrics will be marked with basic_level label: scylla_view_builder_builds_in_progress Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2025-03-03 16:58:39 +02:00
Amnon Heiman	f40dc4e5c4	row_cache.cc: label metrics with basic_level The following metrics will be marked with basic_level label: scylla_cache_bytes_total scylla_cache_bytes_used scylla_cache_partition_evictions scylla_cache_partition_hits scylla_cache_partition_insertions scylla_cache_partition_merges scylla_cache_partition_misses scylla_cache_partition_removals scylla_cache_range_tombstone_reads scylla_cache_reads scylla_cache_reads_with_misses scylla_cache_row_evictions scylla_cache_row_hits scylla_cache_row_insertions scylla_cache_row_misses scylla_cache_row_removals scylla_cache_rows scylla_cache_rows_merged_from_memtable scylla_cache_row_tombstone_reads Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2025-03-03 16:58:38 +02:00

... 14 15 16 17 18 ...

4972 Commits