scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-29 12:47:02 +00:00

Author	SHA1	Message	Date
Benny Halevy	a1acf6854b	everywhere: reduce dependencies on i_partitioner.hh Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-11-05 20:47:44 +02:00
Benny Halevy	6de1cc2993	locator: resolve the dependency of token_metadata.hh on token_range_splitter.hh define token_metadata_ptr in token_metadata_fwd.hh So that the declaration of `make_splitter` can be moved to token_range_splitter.hh, where it belongs, and so token_metadata.hh won't have to include it. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-11-05 20:01:29 +02:00
Tomasz Grabiec	d5539e080d	tablets: Implement cleanup step This change adds a stub for tablet cleanup on the replica side and wires it into the tablet migration process. The handling on replica side is incomplete because it doesn't remove the actual data yet. It only flushes the memtables, so that all data is in sstables and none requires a memtable flush. This patch is necessary to make decommission work. Otherwise, a memtable flush would happen when the decommissioned node is put in the drained state (as in nodetool drain) and it would fail on missing host id mapping (node is no longer in topology), which is examined by the tablet sharder when producing sstable sharding metadata. Leading to abort due to failed memtable flush.	2023-09-14 12:45:10 +02:00
Tomasz Grabiec	6a62aca3a9	locator: Introduce tablet_metadata_guard Will be used to synchronize long-running tablet operations with topology coordinator. It blocks barriers like erm_ptr, but refreshes if change is irrelevant, so behaves as if the erm_ptr's scope was narrowed down to a single tablet.	2023-09-14 12:45:10 +02:00
Tomasz Grabiec	532ec84210	locator, replica: Add a way to wait for table's effective_replication_map change	2023-09-14 12:08:54 +02:00
Benny Halevy	7119c1d8cc	token_metadata: update_topology: make endpoint_dc_rack arg optional It's better to pass a disengaged optional when the caller doesn't have the information rather than passing the default dc_rack location so the latter will never implicitly override a known endpoint dc/rack location. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #15300	2023-09-11 16:16:19 +02:00
Botond Dénes	b062b245ad	Merge 'Don't cache dc:rack on system keyspace local cache' from Pavel Emelyanov The local node's dc:rack pair is cached on system keyspace on start. However, most of other code don't need it as they get dc:rack from topology or directly from snitch. There are few places left that still mess with sysks cache, but they are easy to patch. So after this patch all the core code uses two sources of dc:rack -- topology / snitch -- instead of three. Closes #15280 * github.com:scylladb/scylladb: system_keyspace: Don't require snitch argument on start system_keyspace: Don't cache local dc:rack pair system_keyspace: Save local info with explicit location storage_service: Get endpoint location from snitch, not system keyspace snitch: Introduce and use get_location() method repair: Local location variables instead of system keyspace's one repair: Use full endpoint location instead of datacenter part	2023-09-11 10:26:26 +03:00
Benny Halevy	574c7e349a	locator: topology: is_configured_this_node: delete spurious semicolumn Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-09-06 12:24:09 +03:00
Benny Halevy	115462be17	locator: topology: is_configured_this_node: compare host_id first Since `5d1f60439a` we have this node's host_id in topology config, so it can be used to determine this node when adding it. Prepare for extending the token_metadata interface to provide host_id in update_topology. We would like to compare the host_id first to be able to distinguish this node from a node we're replacing that may have the same ip address (but different host_id). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-09-06 12:24:09 +03:00
Pavel Emelyanov	d2bd203cba	snitch: Introduce and use get_location() method There are some places out there that generate locator::endpoint_dc_rack pair out of snitch's get_datacenter() and get_rack() calls. Generalize those with snitch's new method. It will also be used by next patch. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-05 12:52:30 +03:00
Benny Halevy	5afc242814	token_metadata: get_endpoint_to_host_id_map_for_reading: just inform that normal node has null host_id It is too early to require that all nodes in normal state have a non-null host_id. The assertion was added in `44c14f3e2b` but unfortunately there are several call sites where we add the node as normal, but without a host_id and we patch it in later on. In the future we should be able to require that once we identify nodes by host_id over gossiper and in token_metadata. Fixes scylladb/scylladb#15181 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #15184	2023-08-28 21:40:55 +03:00
Botond Dénes	e7af2a7de8	Merge 'token_metadata::get_endpoint_to_host_id_map_for_reading: restrict to token owners' from Benny Halevy And verify the they returned host_id isn't null. Call on_internal_error_noexcept in that case since all token owners are expected to have their host_id set. Aborting in testing would help fix issues in this area. Fixes scylladb/scylladb#14843 Refs scylladb/scylladb#14793 Closes #14844 * github.com:scylladb/scylladb: api: storage_service: improve description of /storage_service/host_id token_metadata: get_endpoint_to_host_id_map_for_reading: restrict to token owners	2023-08-23 13:55:14 +03:00
Kamil Braun	cdc3cd2b79	Merge 'raft: add fencing tests' from Petr Gusev In this PR a simple test for fencing is added. It exercises the data plane, meaning if it somehow happens that the node has a stale topology version, then requests from this node will get an error 'stale topology'. The test just decrements the node version manually through CQL, so it's quite artificial. To test a more real-world scenario we need to allow the topology change fiber to sometimes skip unavailable nodes. Now the algorithm fails and retries indefinitely in this case. The PR also adds some logs, and removes one seemingly redundant topology version increment, see the commit messages for details. Closes #14901 * github.com:scylladb/scylladb: test_fencing: add test_fence_hints test.py: output the skipped tests test.py: add skip_mode decorator and fixture test.py: add mode fixture hints: add debug log for dropped hints hints: send_one_hint: extend the scope of file_send_gate holder pylib: add ScyllaMetrics hints manager: add send_errors counter token_metadata: add debug logs fencing: add simple data plane test random_tables.py: add counter column type raft topology: don't increment version when transitioning to node_state::normal	2023-08-22 16:28:21 +02:00
Petr Gusev	fa25e6d63e	token_metadata: add debug logs We log the new version when the new token metadata is set. Also, the log for fence_version is moved in shared_token_metadata from storage_service for uniformity.	2023-08-22 14:31:04 +04:00
Benny Halevy	44c14f3e2b	token_metadata: get_endpoint_to_host_id_map_for_reading: restrict to token owners And verify the they returned host_id isn't null. Call on_internal_error_noexcept in that case since all token owners are expected to have their host_id set. Aborting in testing would help fix issues in this area. Fixes scylladb/scylladb#14843 Refs scylladb/scylladb#14793 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-08-21 09:16:42 +03:00
Tomasz Grabiec	bd8bb5d4b1	Merge 'Wire tablet into compaction group' from Raphael "Raph" Carvalho Compaction group is the data plane for tablets, so this integration allows each tablet to have its own storage (memtable + sstables). A crucial step for dynamic tablets, where each tablet can be worked on independently. There are still some inefficiencies to be worked on, but as it is, it already unlocks further development. ``` INFO 2023-07-27 22:43:38,331 [shard 0] init - loading tablet metadata INFO 2023-07-27 22:43:38,333 [shard 0] init - loading non-system sstables INFO 2023-07-27 22:43:38,354 [shard 0] table - Tablet with id 0 present for ks.cf INFO 2023-07-27 22:43:38,354 [shard 0] table - Tablet with id 2 present for ks.cf INFO 2023-07-27 22:43:38,354 [shard 0] table - Tablet with id 4 present for ks.cf INFO 2023-07-27 22:43:38,354 [shard 0] table - Tablet with id 6 present for ks.cf INFO 2023-07-27 22:43:38,428 [shard 1] table - Tablet with id 1 present for ks.cf INFO 2023-07-27 22:43:38,428 [shard 1] table - Tablet with id 3 present for ks.cf INFO 2023-07-27 22:43:38,428 [shard 1] table - Tablet with id 5 present for ks.cf INFO 2023-07-27 22:43:38,428 [shard 1] table - Tablet with id 7 present for ks.cf ``` Closes #14863 * github.com:scylladb/scylladb: Kill scylla option to configure number of compaction groups replica: Wire tablet into compaction group token_metadata: Add this_host_id to topology config replica: Switch to chunked_vector for storing compaction groups replica: Generate group_id for compaction_group on demand	2023-08-18 15:17:17 +02:00
Raphael S. Carvalho	5d1f60439a	token_metadata: Add this_host_id to topology config The motivation is that token_metadata::get_my_id() is not available early in the bootstrap process, as raft topology is pulled later than new tables are registered and created, and this node is added to topology even later. To allow creation of compaction groups to retrieve "my id" from token metadata early, initialization will now feed local id into topology config which is immutable for each node anyway. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-08-16 18:23:44 -03:00
Raphael S. Carvalho	9400b79658	gce_snitch: Fix use-after-move in load_config() The use-after-move is not very harmful as it's only used when handling exception. So user would be left with a bogus message. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #15054	2023-08-15 10:23:57 +03:00
Benny Halevy	949ea43034	topology: unindex_node: erase dc from datacenters when empty In branch 5.2 we erase `dc` from `_datacenters` if there are no more endpoints listed in `_dc_endpoints[dc]`. This was lost unintentionally in `f3d5df5448` and this commit restores that behavior, and fixes test_remove_endpoint. Fixes scylladb/scylladb#14896 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #14897	2023-08-02 09:08:24 +03:00
Avi Kivity	dac93b2096	Merge 'Concurrent tablet migration and balancing' from Tomasz Grabiec This change makes tablet load balancing more efficient by performing migrations independently for different tablets, and making new load balancing plans concurrently with active migrations. The migration track is interrupted by pending topology change operations. The coordinator executes the load balancer on edges of tablet state machine transitions. This allows new migrations to be started as soon as tablets finish streaming. The load balancer is also continuously invoked as long as it produces a non-empty plan. This is in order to saturate the cluster with streaming. A single make_plan() call is still not saturating, due to the way algorithm is implemented. Overload of shards is limited by the fact that load balancer algorithm tracks streaming concurrency on both source and target shards of active migrations and takes concurrency limit into account when producing new migrations. Closes #14851 * github.com:scylladb/scylladb: tablets: load_balancer: Remove double logging tests: tablets: Check that load balancing is interrupted by topology change tests: tablets: Add test for load balancing with active migrations tablets: Balance tablets concurrently with active migrations storage_service, tablets: Extract generate_migration_updates() storage_service, tablets: Move get_leaving_replica() to tablets.cc locator: tablets: Move std::hash definition earlier storage_service: Advance tablets independently topology_coordinator: Fix missed notification on abort tablets: Add formatter for tablet_migration_info	2023-07-31 16:44:33 +03:00
Benny Halevy	d903d03bf8	locator: topology: node::state: make fine grained Currently the node::state is coarse grained so one cannot distinguish between e.g. a leaving node due to decommission (where the node is used for reading) vs. due to remove node (where the node is not used for reading). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-07-31 10:33:48 +03:00
Tomasz Grabiec	8fdbc42e71	tests: tablets: Add test for load balancing with active migrations	2023-07-31 01:45:23 +02:00
Tomasz Grabiec	fe181b3bac	tablets: Balance tablets concurrently with active migrations After this change, the load balancer can make progress with active migrations. If the algorithm is called with active tablet migrations in tablet metadata, those are treated by load balancer as if they were already completed. This allows the algorithm to incrementally make decision which when executed with active migrations will produce the desired result. Overload of shards is limited by the fact that the algorithm tracks streaming concurrency on both source and target shards of active migrations and takes concurrency limit into account when producing new migrations. The coordinator executes the load balancer on edges of tablet state machine stransitions. This allows new migrations to be started as soon as tablets finish streaming. The load balancer is also continuously invoked as long as it produces a non-empty plan. This is in order to saturate the cluster with streaming. A single make_plan() call is still not saturating, due to the way algorithm is implemented.	2023-07-31 01:45:23 +02:00
Tomasz Grabiec	fbc6076e6a	storage_service, tablets: Move get_leaving_replica() to tablets.cc For better encapsulation of tablet-specific code.	2023-07-31 01:45:23 +02:00
Tomasz Grabiec	18a59ab5ff	locator: tablets: Move std::hash definition earlier Will be needed in order to define a struct which has unordered_set<tablet_replica> as a field.	2023-07-31 01:45:23 +02:00
Tomasz Grabiec	6f4a35f9ae	service: tablet_allocator: Introduce tablet load balancer Will be invoked by the topology coordinator later to decide which tablets to migrate.	2023-07-25 21:08:51 +02:00
Tomasz Grabiec	d59b8d316c	tablets: Introduce tablet_map::for_each_tablet()	2023-07-25 21:08:51 +02:00
Tomasz Grabiec	0e3eac29d0	topology: Introduce get_node()	2023-07-25 21:08:51 +02:00
Tomasz Grabiec	f2fdf37415	token_metadata: Add non-const getter of tablet_metadata Needed for tests.	2023-07-25 21:08:51 +02:00
Tomasz Grabiec	e3a8bb7ec9	tablets: Introduce global_tablet_id Identifies tablet in the scope of the whole cluster. Not to be confused with tablet replicas, which all share global_tablet_id. Will be needed by load balancer and tablet migration algorithm to identify tablets globally.	2023-07-25 21:08:51 +02:00
Tomasz Grabiec	8cf92d4c86	tablets: Turn tablet_id into a struct The IDL compiler cannot deal with enum classes like this.	2023-07-25 21:08:51 +02:00
Tomasz Grabiec	91dee5c872	tablets: effective_replication_map: Take transition stage into account when computing replicas	2023-07-25 21:08:51 +02:00
Tomasz Grabiec	dc2ec3f81c	tablets: Store "stage" in transition info It's needed to implement tablet migration. It stores the current step of tablet migration state machine. The state machine will be advanced by the topology change coordinator. See the "Tablet migration" section of topology-over-raft.md	2023-07-25 21:08:02 +02:00
Tomasz Grabiec	7851694eaa	locator: erm: Make get_endpoints_for_reading() always return read replicas Just a simplification. Drop the test case from token_metadata which creates pending endpoints without normal tokens. It fails after this change with exception: "sorted_tokens is empty in first_token_index!" thrown from token_metadata::first_token_index(), which is used when calculating normal endpoints. This test case is not valid, first node inserts its tokens as normal without going through bootstrap procedure.	2023-07-25 21:08:01 +02:00
Kefu Chai	3129ae3c8c	treewide: compare signed and unsigned using std::cmp_() when comparing signed and unsigned numbers, the compiler promotes the signed number to coomon type -- in this case, the unsigned type, so they can be compared. but sometimes, it matters. and after the promotion, the comparison yields the wrong result. this can be manifested using a short sample like: ``` int main(int argc, char argv) { int x = -1; unsigned y = 2; fmt::print("{}\n", x < y); return 0; } ``` this error can be identified by `-Werror=sign-compare`, but before enabling this compiling option. let's use `std::cmp_()` to compare them. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-07-18 10:27:18 +08:00
Patryk Jędrzejczak	7ae7be0911	locator: remove this_host_id from topology::config The `locator::topology::config::this_host_id` field is redundant in all places that use `locator::topology::config`, so we can safely remove it. Closes #14638 Closes #14723	2023-07-17 14:57:36 +02:00
Petr Gusev	3737bf8fa2	topology.cc: unindex_node: _dc_racks removal fix The eps reference was reused to manipulate the racks dictionary. This resulted in assigning a set of nodes from the racks dictionary to an element of the _dc_endpoints dictionary. The problem was demonstrated by the dtest test_decommission_last_node_in_rack (scylladb/scylla-dtest#3299). The test set up four nodes, three on one rack and one on another, all within a single data center (dc). It then switched to a 'network_topology_strategy' for one keyspace and tried to decommission the single node on the second rack. This decomission command with error message 'zero replica after the removal.' This happened because unindex_node assigned the empty list from the second rack as a value for the single dc in _dc_endpoints dictionary. As a result, we got empty nodes list for single dc in natural_endpoints_tracker::_all_endpoints, node_count == 0 in data_center_endpoints, _rf_left == 0, so network_topology_strategy::calculate_natural_endpoints rejected all the endpoints and returned an empty endpoint_set. In repair_service::do_decommission_removenode_with_repair this caused the 'zero replica after the removal' error. With this fix the test passes both with --consistent-cluster-management option and without it. The specific unit test for this problem was added. Fixes: #14184 Closes #14673	2023-07-13 11:16:01 +03:00
Avi Kivity	d645e7a515	Update seastar submodule locator/_snitch.cc updated for http::reply losing the _status_code member without a deprecation notice. seastar 99d28ff057...2b7a341210 (23): > Merge 'Prefault memory when --lock-memory 1 is specified' from Avi Kivity Fixes #8828. > reactor: use structured binding when appropriate > Simplify payload length and mask parsing. > memcached: do not used deprecated API > build: serialize calls to openssl certificate generation > reactor: epoll backend: initialize _highres_timer_pending > shared_ptr: deprecate lw_shared_ptr operator=(T&&) > tests: fail spawn_test if output is empty > Support specifying the "build root" in configure > Merge 'Cleanup RPC request/response frames maintenance' from Pavel Emelyanov > build: correct the syntax error in comment > util: print_safe: fix hex print functions > Add code examples for handling exceptions > smp: warn if --memory parameter is not supported > Merge 'gate: track holders' from Benny Halevy > file: call lambda with std::invoke() > deleter: Delete move and copy constructors > file: fix the indent > file: call close() without the syscall thread > reactor: use s/::free()/::io_uring_free_probe()/ > Merge 'seastar-json2code: generate better-formatted code' from Kefu Chai > reactor: Don't re-evaliate local reactor for thread_pool > Merge 'Improve http::reply re-allocations and copying in client' from Pavel Emelyanov Closes #14602	2023-07-10 16:07:12 +03:00
Marcin Maliszkiewicz	c5de25be4c	locator: use deferred_close in azure and gcp snitches Close needs to be called even if function throws in the middle. Closes #14458	2023-07-07 11:08:10 +02:00
Tomasz Grabiec	ebdebb982b	locator: network_topology_startegy: Allocate shards to tablets Uses a simple algorihtm for allocating shards which chooses least-loaded shard on a given node, encapsulated in load_sketch. Takes load due to current tablet allocation into account. Each tablet, new or allocated for other tables, is assumed to have an equal load weight.	2023-06-21 00:58:25 +02:00
Tomasz Grabiec	e110167a2a	locator: Store node shard count in topology Will be needed by tablet allocator.	2023-06-21 00:58:25 +02:00
Tomasz Grabiec	353ce1a6d1	locator: Make sharder accessible through effective_replication_map For tablets, sharding depends on replication map, so the scope of the sharder should be effective_replicaion_map rather than the schema object. Existing users will be transitioned incrementally in later patches.	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	22ab100b41	tablets: Implement tablet sharder	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	e44e6033d8	tablets: Include pending replica in get_shard() We need to move get_shard() from tablet_info to tablet_map in order to have access to transition_info.	2023-06-21 00:58:24 +02:00
Petr Gusev	246eaec14e	shared_token_metadata: update_fence_version: on_internal_error -> throw on_internal_error is wrong for fence_version condition violation, since in case of topology change coordinator migrating to another node we can have raft_topology_cmd::command::fence command from the old coordinator running in parallel with the fence command (or topology version upgrading raft command) from the new one. The comment near the raft_topology_cmd::command::fence handling describes this situation, assuming an exception is thrown in this case.	2023-06-20 13:39:17 +04:00
Petr Gusev	f6b019c229	raft topology: add fence_version It's stored outside of topology table, since it's updated not through RAFT, but with a new 'fence' raft command. The current value is cached in shared_token_metadata. An initial fence version is loaded in main during storage_service initialisation.	2023-06-15 15:48:00 +04:00
Petr Gusev	4f99302c2b	raft_topology: add barrier_and_drain cmd We use utils::phased_barrier. The new phase is started each time the version is updated. We track all instances of token_metadata, when an instance is destroyed the corresponding phased_barrier::operation is released.	2023-06-15 15:48:00 +04:00
Petr Gusev	253d8a8c65	token_metadata: add topology version It's stored in as a static column in topology table, will be updated at various steps of the topology change state machine. The initial value is 1, zero means that topology versions are not yet supported, will be used in RPC handling.	2023-06-15 15:48:00 +04:00
Avi Kivity	26c8470f65	treewide: use #include <seastar/...> for seastar headers We treat Seastar as an external library, so fix the few places that didn't do so to use angle brackets. Closes #14037	2023-06-06 08:36:09 +03:00
Petr Gusev	819d710753	vnode_erm: optimize get_range_addresses In get_range_addresses we are iterating over vnode tokens, don't need to do binary search for them in tmptr->first_token, they can be directly used as keys for _replication_map.	2023-05-24 12:16:37 +04:00

1 2 3 4 5 ...

652 Commits