scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-26 19:35:12 +00:00

Author	SHA1	Message	Date
Petr Gusev	fa25e6d63e	token_metadata: add debug logs We log the new version when the new token metadata is set. Also, the log for fence_version is moved in shared_token_metadata from storage_service for uniformity.	2023-08-22 14:31:04 +04:00
Benny Halevy	949ea43034	topology: unindex_node: erase dc from datacenters when empty In branch 5.2 we erase `dc` from `_datacenters` if there are no more endpoints listed in `_dc_endpoints[dc]`. This was lost unintentionally in `f3d5df5448` and this commit restores that behavior, and fixes test_remove_endpoint. Fixes scylladb/scylladb#14896 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #14897	2023-08-02 09:08:24 +03:00
Avi Kivity	dac93b2096	Merge 'Concurrent tablet migration and balancing' from Tomasz Grabiec This change makes tablet load balancing more efficient by performing migrations independently for different tablets, and making new load balancing plans concurrently with active migrations. The migration track is interrupted by pending topology change operations. The coordinator executes the load balancer on edges of tablet state machine transitions. This allows new migrations to be started as soon as tablets finish streaming. The load balancer is also continuously invoked as long as it produces a non-empty plan. This is in order to saturate the cluster with streaming. A single make_plan() call is still not saturating, due to the way algorithm is implemented. Overload of shards is limited by the fact that load balancer algorithm tracks streaming concurrency on both source and target shards of active migrations and takes concurrency limit into account when producing new migrations. Closes #14851 * github.com:scylladb/scylladb: tablets: load_balancer: Remove double logging tests: tablets: Check that load balancing is interrupted by topology change tests: tablets: Add test for load balancing with active migrations tablets: Balance tablets concurrently with active migrations storage_service, tablets: Extract generate_migration_updates() storage_service, tablets: Move get_leaving_replica() to tablets.cc locator: tablets: Move std::hash definition earlier storage_service: Advance tablets independently topology_coordinator: Fix missed notification on abort tablets: Add formatter for tablet_migration_info	2023-07-31 16:44:33 +03:00
Benny Halevy	d903d03bf8	locator: topology: node::state: make fine grained Currently the node::state is coarse grained so one cannot distinguish between e.g. a leaving node due to decommission (where the node is used for reading) vs. due to remove node (where the node is not used for reading). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-07-31 10:33:48 +03:00
Tomasz Grabiec	8fdbc42e71	tests: tablets: Add test for load balancing with active migrations	2023-07-31 01:45:23 +02:00
Tomasz Grabiec	fe181b3bac	tablets: Balance tablets concurrently with active migrations After this change, the load balancer can make progress with active migrations. If the algorithm is called with active tablet migrations in tablet metadata, those are treated by load balancer as if they were already completed. This allows the algorithm to incrementally make decision which when executed with active migrations will produce the desired result. Overload of shards is limited by the fact that the algorithm tracks streaming concurrency on both source and target shards of active migrations and takes concurrency limit into account when producing new migrations. The coordinator executes the load balancer on edges of tablet state machine stransitions. This allows new migrations to be started as soon as tablets finish streaming. The load balancer is also continuously invoked as long as it produces a non-empty plan. This is in order to saturate the cluster with streaming. A single make_plan() call is still not saturating, due to the way algorithm is implemented.	2023-07-31 01:45:23 +02:00
Tomasz Grabiec	fbc6076e6a	storage_service, tablets: Move get_leaving_replica() to tablets.cc For better encapsulation of tablet-specific code.	2023-07-31 01:45:23 +02:00
Tomasz Grabiec	18a59ab5ff	locator: tablets: Move std::hash definition earlier Will be needed in order to define a struct which has unordered_set<tablet_replica> as a field.	2023-07-31 01:45:23 +02:00
Tomasz Grabiec	6f4a35f9ae	service: tablet_allocator: Introduce tablet load balancer Will be invoked by the topology coordinator later to decide which tablets to migrate.	2023-07-25 21:08:51 +02:00
Tomasz Grabiec	d59b8d316c	tablets: Introduce tablet_map::for_each_tablet()	2023-07-25 21:08:51 +02:00
Tomasz Grabiec	0e3eac29d0	topology: Introduce get_node()	2023-07-25 21:08:51 +02:00
Tomasz Grabiec	f2fdf37415	token_metadata: Add non-const getter of tablet_metadata Needed for tests.	2023-07-25 21:08:51 +02:00
Tomasz Grabiec	e3a8bb7ec9	tablets: Introduce global_tablet_id Identifies tablet in the scope of the whole cluster. Not to be confused with tablet replicas, which all share global_tablet_id. Will be needed by load balancer and tablet migration algorithm to identify tablets globally.	2023-07-25 21:08:51 +02:00
Tomasz Grabiec	8cf92d4c86	tablets: Turn tablet_id into a struct The IDL compiler cannot deal with enum classes like this.	2023-07-25 21:08:51 +02:00
Tomasz Grabiec	91dee5c872	tablets: effective_replication_map: Take transition stage into account when computing replicas	2023-07-25 21:08:51 +02:00
Tomasz Grabiec	dc2ec3f81c	tablets: Store "stage" in transition info It's needed to implement tablet migration. It stores the current step of tablet migration state machine. The state machine will be advanced by the topology change coordinator. See the "Tablet migration" section of topology-over-raft.md	2023-07-25 21:08:02 +02:00
Tomasz Grabiec	7851694eaa	locator: erm: Make get_endpoints_for_reading() always return read replicas Just a simplification. Drop the test case from token_metadata which creates pending endpoints without normal tokens. It fails after this change with exception: "sorted_tokens is empty in first_token_index!" thrown from token_metadata::first_token_index(), which is used when calculating normal endpoints. This test case is not valid, first node inserts its tokens as normal without going through bootstrap procedure.	2023-07-25 21:08:01 +02:00
Kefu Chai	3129ae3c8c	treewide: compare signed and unsigned using std::cmp_() when comparing signed and unsigned numbers, the compiler promotes the signed number to coomon type -- in this case, the unsigned type, so they can be compared. but sometimes, it matters. and after the promotion, the comparison yields the wrong result. this can be manifested using a short sample like: ``` int main(int argc, char argv) { int x = -1; unsigned y = 2; fmt::print("{}\n", x < y); return 0; } ``` this error can be identified by `-Werror=sign-compare`, but before enabling this compiling option. let's use `std::cmp_()` to compare them. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-07-18 10:27:18 +08:00
Patryk Jędrzejczak	7ae7be0911	locator: remove this_host_id from topology::config The `locator::topology::config::this_host_id` field is redundant in all places that use `locator::topology::config`, so we can safely remove it. Closes #14638 Closes #14723	2023-07-17 14:57:36 +02:00
Petr Gusev	3737bf8fa2	topology.cc: unindex_node: _dc_racks removal fix The eps reference was reused to manipulate the racks dictionary. This resulted in assigning a set of nodes from the racks dictionary to an element of the _dc_endpoints dictionary. The problem was demonstrated by the dtest test_decommission_last_node_in_rack (scylladb/scylla-dtest#3299). The test set up four nodes, three on one rack and one on another, all within a single data center (dc). It then switched to a 'network_topology_strategy' for one keyspace and tried to decommission the single node on the second rack. This decomission command with error message 'zero replica after the removal.' This happened because unindex_node assigned the empty list from the second rack as a value for the single dc in _dc_endpoints dictionary. As a result, we got empty nodes list for single dc in natural_endpoints_tracker::_all_endpoints, node_count == 0 in data_center_endpoints, _rf_left == 0, so network_topology_strategy::calculate_natural_endpoints rejected all the endpoints and returned an empty endpoint_set. In repair_service::do_decommission_removenode_with_repair this caused the 'zero replica after the removal' error. With this fix the test passes both with --consistent-cluster-management option and without it. The specific unit test for this problem was added. Fixes: #14184 Closes #14673	2023-07-13 11:16:01 +03:00
Avi Kivity	d645e7a515	Update seastar submodule locator/_snitch.cc updated for http::reply losing the _status_code member without a deprecation notice. seastar 99d28ff057...2b7a341210 (23): > Merge 'Prefault memory when --lock-memory 1 is specified' from Avi Kivity Fixes #8828. > reactor: use structured binding when appropriate > Simplify payload length and mask parsing. > memcached: do not used deprecated API > build: serialize calls to openssl certificate generation > reactor: epoll backend: initialize _highres_timer_pending > shared_ptr: deprecate lw_shared_ptr operator=(T&&) > tests: fail spawn_test if output is empty > Support specifying the "build root" in configure > Merge 'Cleanup RPC request/response frames maintenance' from Pavel Emelyanov > build: correct the syntax error in comment > util: print_safe: fix hex print functions > Add code examples for handling exceptions > smp: warn if --memory parameter is not supported > Merge 'gate: track holders' from Benny Halevy > file: call lambda with std::invoke() > deleter: Delete move and copy constructors > file: fix the indent > file: call close() without the syscall thread > reactor: use s/::free()/::io_uring_free_probe()/ > Merge 'seastar-json2code: generate better-formatted code' from Kefu Chai > reactor: Don't re-evaliate local reactor for thread_pool > Merge 'Improve http::reply re-allocations and copying in client' from Pavel Emelyanov Closes #14602	2023-07-10 16:07:12 +03:00
Marcin Maliszkiewicz	c5de25be4c	locator: use deferred_close in azure and gcp snitches Close needs to be called even if function throws in the middle. Closes #14458	2023-07-07 11:08:10 +02:00
Tomasz Grabiec	ebdebb982b	locator: network_topology_startegy: Allocate shards to tablets Uses a simple algorihtm for allocating shards which chooses least-loaded shard on a given node, encapsulated in load_sketch. Takes load due to current tablet allocation into account. Each tablet, new or allocated for other tables, is assumed to have an equal load weight.	2023-06-21 00:58:25 +02:00
Tomasz Grabiec	e110167a2a	locator: Store node shard count in topology Will be needed by tablet allocator.	2023-06-21 00:58:25 +02:00
Tomasz Grabiec	353ce1a6d1	locator: Make sharder accessible through effective_replication_map For tablets, sharding depends on replication map, so the scope of the sharder should be effective_replicaion_map rather than the schema object. Existing users will be transitioned incrementally in later patches.	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	22ab100b41	tablets: Implement tablet sharder	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	e44e6033d8	tablets: Include pending replica in get_shard() We need to move get_shard() from tablet_info to tablet_map in order to have access to transition_info.	2023-06-21 00:58:24 +02:00
Petr Gusev	246eaec14e	shared_token_metadata: update_fence_version: on_internal_error -> throw on_internal_error is wrong for fence_version condition violation, since in case of topology change coordinator migrating to another node we can have raft_topology_cmd::command::fence command from the old coordinator running in parallel with the fence command (or topology version upgrading raft command) from the new one. The comment near the raft_topology_cmd::command::fence handling describes this situation, assuming an exception is thrown in this case.	2023-06-20 13:39:17 +04:00
Petr Gusev	f6b019c229	raft topology: add fence_version It's stored outside of topology table, since it's updated not through RAFT, but with a new 'fence' raft command. The current value is cached in shared_token_metadata. An initial fence version is loaded in main during storage_service initialisation.	2023-06-15 15:48:00 +04:00
Petr Gusev	4f99302c2b	raft_topology: add barrier_and_drain cmd We use utils::phased_barrier. The new phase is started each time the version is updated. We track all instances of token_metadata, when an instance is destroyed the corresponding phased_barrier::operation is released.	2023-06-15 15:48:00 +04:00
Petr Gusev	253d8a8c65	token_metadata: add topology version It's stored in as a static column in topology table, will be updated at various steps of the topology change state machine. The initial value is 1, zero means that topology versions are not yet supported, will be used in RPC handling.	2023-06-15 15:48:00 +04:00
Avi Kivity	26c8470f65	treewide: use #include <seastar/...> for seastar headers We treat Seastar as an external library, so fix the few places that didn't do so to use angle brackets. Closes #14037	2023-06-06 08:36:09 +03:00
Petr Gusev	819d710753	vnode_erm: optimize get_range_addresses In get_range_addresses we are iterating over vnode tokens, don't need to do binary search for them in tmptr->first_token, they can be directly used as keys for _replication_map.	2023-05-24 12:16:37 +04:00
Petr Gusev	5976277c2c	token_metadata: drop has_pending_ranges and migration_info Use the new erm::has_pending_ranges function, drop the old implementation from token_metadata.	2023-05-21 13:17:42 +04:00
Petr Gusev	5495065242	effective_replication_map: add has_pending_ranges We add the has_pending_ranges function to erm. The implementation for vnode is similar to that of token_metadata. For tablets, we add new code that checks if the given endpoint is contained in tablet_map::_transitions.	2023-05-21 13:17:42 +04:00
Petr Gusev	8cb709d3d6	token_metadata: drop update_pending_ranges The function storage_service::update_pending_ranges is turned to update_topology_changes_info. The pending_endpoints and read_endpoints will be computed later, when the erms are rebuilt.	2023-05-21 13:17:42 +04:00
Petr Gusev	87307781c4	effective_replication_map: use new get_pending_endpoints and get_endpoints_for_reading We already use the new pending_endpoints from erm though the get_pending_ranges virtual function, in this commit we update all the remaining places to use the new implementation in erm, as well as remove the old implementation in token_metadata.	2023-05-21 13:17:42 +04:00
Petr Gusev	e22a5c42c8	vnode_effective_replication_map: get_pending_endpoints and get_endpoints_for_reading In this commit we introduce functions to erm for accessing pending_endpoints and read_endpoints similar to the corresponding functions in token_metadata. The only difference - we no longer need the keyspace_name map. The functions get_pending_endpoints and get_endpoints_for_reading are virtual, since they have different implementations for vnode and for tablets. The get_pending_endpoints already existed. For tablets it remained unchanged, while for vnode we just changed it from calling on token_metadata to using a local field. We have also removed ks_name from the signature as it's no longer needed. For vnodes, the get_endpoints_for_reading also just employs the local field. In the case of tablets, we currently return nullptr as the appropriate implementation remains unclear.	2023-05-21 13:17:42 +04:00
Petr Gusev	fbe3254a9e	calculate_effective_replication_map: compute pending_endpoints and read_endpoints In this commit we add logic to calculate pending_endpoints and read_endpoints, similar to how it was done in update_pending_ranges. For situations where 'natural_endpoints_depend_on_token' is false we short-circuit the calculations, breaking out of the loop after the first iteration. In this case we add a single item with key=default_replication_map_key to the replication_map and set pending_endpoints/read_endpoints key range to the entire set of possible values. In the loop we iterate over all_tokens, which contains the union of all boundary tokens, from the old and from the new topology. In addition to updating pending_endpoints and read_endpoints in the loop, we remember the new natural endpoints in the replication_map if the current token is contained in the current set of boundary tokens.	2023-05-21 13:17:42 +04:00
Petr Gusev	a8c36aad0b	vnode_erm: optimize replication_map We optimise memory usage of replication_map by storing endpoints list only once in case of natural_endpoints_depend_on_token() == false. For simplicity, this list is stored in the same unordered_map with special key default_replication_map_key. We inline both get_natural_endpoints and for_each_natural_endpoint_until from abstract_replication_strategy into vnode_erm since now the overrides in local and everywhere strategies are redundant. The default implementation works for them as empty sorted_tokens() is not a problem, we store endpoints with a special key. Function do_get_natural_endpoints was extracted, since get_natural_endpoints returns by val, but for_each_natural_endpoint_until reference in sufficient.	2023-05-21 13:17:42 +04:00
Petr Gusev	b9812023c6	vnode_erm::get_range_addresses: use sorted_tokens We want to refactor replication_map so that it doesn't store multiple copies of the same endpoints vector in case of natural_endpoints_depend_on_token == false. To preserve get_range_addresses behaviour we iterate over tm.sorted_tokens() instead of _replication_map. It's possible that the callers of this function are ok with single range in case of natural_endpoints_depend_on_token == false, but to restrict the scope of the refactoring we refrain from going to that direction.	2023-05-21 11:33:38 +04:00
Petr Gusev	99ff1fefe5	abstract_replication_strategy.hh: de-virtualize natural_endpoints_depend_on_token We are going to use this function in vnode_erm::get_natural_endpoints, so for efficiency it's better to have fewer virtual calls.	2023-05-21 11:33:38 +04:00
Petr Gusev	6f12c72c3f	effective_replication_map: clone_endpoints_gently -> clone_data_gently We need to account for the new fields in the clone implementation. The signature future<erm> erm::clone() const; doesn't work because the call will be made via foreign_ptr on an instance from another shard, so we need to use local values for replication_strategy and token_metadata.	2023-05-21 11:33:38 +04:00
Petr Gusev	959f9757d3	vnode_erm: gentle destruction of _pending_endpoints and _read_endpoints Refactor ~vnode_effective_replication_map, use our new clear_gently overload for rvalue references. Add new fields _pending_endpoints and _read_endpoints to the call. vnode_efficient_replication_map::clear_gently is removed as it was not used.	2023-05-21 11:33:38 +04:00
Petr Gusev	084abc0e44	token_metadata: add pending_endpoints and read_endpoints to vnode_effective_replication_map In this commit, we just add fields and pass them through the constructor. Calculation and usage logic will be added later.	2023-05-19 19:04:43 +04:00
Petr Gusev	10bf8c7901	token_metadata: introduce topology_change_info We plan to move pending_endpoints and read_endpoints, along with their computation logic, from token_metadata to vnode_effective_replication_map. The vnode_effective_replication_map seems more appropriate for them since it contains functionally similar _replication_map and we will be able to reuse pending_endpoints/read_endpoints across keyspaces sharing the same factory_key. At present, pending_endpoints and read_endpoints are updated in the update_pending_ranges function. The update logic comprises two parts - preparing data common to all keyspaces/replication_strategies, and calculating the migration_info for specific keyspaces. In this commit, we introduce a new topology_change_info structure to hold the first part's data add create an update_topology_change_info function to update it. This structure will later be used in vnode_effective_replication_map to compute pending_endpoints and read_endpoints. This enables the reuse of topology_change_info across all keyspaces, unlike the current update_pending_ranges implementation, which is another benefit of this refactoring. The update_topology_change_info implementation is mostly derived from update_pending_ranges, there are a few differences though: * replacing async and thread with plain co_awaits; * adding a utils::clear_gently call for the previous value to mitigate reactor stalls if target_token_metadata grows large; * substituting immediately invoked lambdas with simple variables and blocks to reduce noise, as lambdas would need to be converted into coroutines. The original update_pending_ranges remains unchanged, and will be removed entirely upon transitioning to the new implementation. Meanwhile, we add an update_topology_change_info call to storage_service::update_pending_ranges so that we can iteratively switch the system to the new implementation.	2023-05-19 19:04:43 +04:00
Petr Gusev	51e80691ef	token_metadata: replace set_topology_transition_state with set_read_new This helps isolate topology::transition_state dependencies, token_metadata doesn't need the entire enum, just this boolean flag.	2023-05-19 19:04:43 +04:00
Petr Gusev	0e4e2df657	token_metadata: add endpoints for reading In this patch we add token_metadata::set_topology_transition_state method. If the current state is write_both_read_new update_pending_ranges will compute new ranges for read requests. The default value of topology_transition_state is null, meaning no read ranges are computed. We will add the appropriate set_topology_transition_state calls later. Also, we add endpoints_for_reading method to get read endpoints based on the computed ranges.	2023-05-09 18:41:59 +04:00
Petr Gusev	0567ab82ac	token_metadata_impl: extract maybe_migration_endpoints helper function We are going to add a function in token_metadata to get read endpoints, similar to pending_endpoints_for. So in this commit we extract the maybe_migration_endpoints helper function, which will be used in both cases.	2023-05-09 13:56:38 +04:00
Petr Gusev	030f0f73aa	token_metadata_impl: introduce migration_info We are going to store read_endpoints in a way similar to pending ranges, so in this commit we add migration_info - a container for two boost::icl::interval_map. Also, _pending_ranges_interval_map is renamed to _keyspace_to_migration_info, since it captures the meaning better.	2023-05-09 13:56:38 +04:00

1 2 3 4 5 ...

635 Commits