scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-21 17:10:35 +00:00

Author	SHA1	Message	Date
Kamil Braun	a56f7ce21a	storage_service: raft topology: warn when `raft_topology_cmd_handler` fails due to abort Currently we print an ERROR on all exceptions in `raft_topology_cmd_handler`. This log level is too high, in some cases exceptions are expected -- like during shutdown. And it causes dtest failures. Turn exceptions from aborts into WARN level. Also improve logging by printing the command that failed. Fixes scylladb/scylladb#19754 (cherry picked from commit `7506709573`) Closes scylladb/scylladb#20072	2024-08-08 18:14:24 +02:00
Kamil Braun	4948029666	raft topology: improve logging Add more logging for raft-based topology operations in INFO and DEBUG levels. Improve the existing logging, adding more details. Fix a FIXME in test_coordinator_queue_management (by readding a log message that was removed in the past -- probably by accident -- and properly awaiting for it to appear in test). Enable group0_state_machine logging at TRACE level in tests. These logs are relatively rare (group 0 commands are used for metadata operations) and relatively small, mostly consist of printing `system.group0_history` mutation in the applied command, for example: ``` TRACE 2024-08-02 18:47:12,238 [shard 0: gms] group0_raft_sm - apply() is called with 1 commands TRACE 2024-08-02 18:47:12,238 [shard 0: gms] group0_raft_sm - cmd: prev_state_id: optional(dd9d47c6-50ee-11ef-d77f-500b8e1edde3), new_state_id: dd9ea5c6-50ee-11ef-ae64-dfbcd08d72c3, creator_addr: 127.219.233.1, creator_id: 02679305-b9d1-41ef-866d-d69be156c981 TRACE 2024-08-02 18:47:12,238 [shard 0: gms] group0_raft_sm - cmd.history_append: {canonical_mutation: table_id 027e42f5-683a-3ed7-b404-a0100762063c schema_version c9c345e1-428f-36e0-b7d5-9af5f985021e partition_key pk{0007686973746f7279} partition_tombstone {tombstone: none}, row tombstone {range_tombstone: start={position: clustered, ckp{0010b4ba65c64b6e11ef8080808080808080}, 1}, end={position: clustered, ckp{}, 1}, {tombstone: timestamp=1722617232237511, deletion_time=1722617232}}{row {position: clustered, ckp{0010dd9ea5c650ee11efae64dfbcd08d72c3}, 0} tombstone {row_tombstone: none} marker {row_marker: 1722617232237511 0 0}, column description atomic_cell{ create system_distributed keyspace; create system_distributed_everywhere keyspace; create and update system_distributed(_everywhere) tables,ts=1722617232237511,expiry=-1,ttl=0}}} ``` note that the mutation contains a human-readable description of the command -- like "create system_distributed keyspace" above. These logs might help debugging various issues (e.g. when `apply` hangs waiting for read_apply mutex, or takes too long to apply a command). Ref: scylladb/scylladb#19105 Ref: scylladb/scylladb#19945 (cherry picked from commit `e8d5974961`) Closes scylladb/scylladb#20049	2024-08-08 11:59:34 +03:00
Emil Maskovsky	0770069dda	raft: use the abort source reference in raft group0 client interface Most callers of the raft group0 client interface are passing a real source instance, so we can use the abort source reference in the client interface. This change makes the code simpler and more consistent. (cherry picked from commit `2dbe9ef2f2`)	2024-08-01 19:36:00 +02:00
Michael Litvak	ad6eb1cadf	view: drain view builder before database The view builder is doing write operations to the database. In order for the view builder to shutdown gracefully without errors, we need to ensure the database can handle writes while it is drained. The commit changes the drain order, so that view builder is drained before the database shuts down. Fixes scylladb/scylladb#18929 (cherry picked from commit `9d9318c564`) Closes scylladb/scylladb#19636	2024-07-08 19:16:26 +02:00
Gleb Natapov	724ec62e22	test: add test that checks that local address cannot expire between join request placemen and its processing (cherry picked from commit `3f136cf2eb`)	2024-07-01 10:44:31 +00:00
Gleb Natapov	a6c5f8192d	storage_service: make node's entry non expiring in raft address map Local address map entry should never expire in the address map. (cherry picked from commit `5d8f08c0d7`)	2024-07-01 10:44:31 +00:00
Patryk Jędrzejczak	e129c4ad43	join_token_ring, gossip topology: update obsolete comment The code mentioned in the comment has already been added. We change the comment to prevent confusion. (cherry picked from commit `bcc0a352b7`)	2024-06-21 12:05:43 +00:00
Patryk Jędrzejczak	37bb6e0a43	join_token_ring, gossip topology: fix indendation after previous patch (cherry picked from commit `7735bd539b`)	2024-06-21 12:05:43 +00:00
Patryk Jędrzejczak	e5e8b970ed	join_token_ring, gossip topology: recalculate sync nodes in wait_alive Before this patch, if we booted a node just after removing a different node, the booting node may still see the removed node as NORMAL and wait for it to be UP, which would time out and fail the bootstrap. This issue caused scylladb/scylladb#17526. Fix it by recalculating the nodes to wait for in every step of the of the `wait_alive` loop. (cherry picked from commit `017134fd38`)	2024-06-21 12:05:42 +00:00
Gleb Natapov	0e49180cef	topology coordinator: add more trace level logging for debugging Add more logging that provide more visibility into what happens during topology loading. Message-ID: <ZnE5OAmUbExVZMWA@scylladb.com> (cherry picked from commit `fb764720d3`)	2024-06-18 16:38:51 +02:00
Benny Halevy	6122f9454d	storage_service: join_token_ring: reject replace on different dc or rack Do not allow replacing a node on one dc/rack with a node on a different dc/rack as this violates the assumption of replace node operation that all token ranges previously owned by the dead node would be rebuilt on the new node. Fixes #16858 Refs scylladb/scylla-enterprise#3518 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `34dfa4d3a3`) Closes scylladb/scylladb#19281	2024-06-14 07:43:58 +03:00
Tomasz Grabiec	f8243cbf19	Merge '[Backport 6.0] Serialize repair with tablet migration' from ScyllaDB We want to exclude repair with tablet migrations to avoid races between repair reads and writes with replica movement. Repair is not prepared to handle topology transitions in the middle. One reason why it's not safe is that repair may successfully write to a leaving replica post streaming phase and consider all replicas to be repaired, but in fact they are not, the new replica would not be repaired. Other kinds of races could result in repair failures. If repair writes to a leaving replica which was already cleaned up, such writes will fail, causing repair to fail. Excluding works by keeping effective_replication_map_ptr in a version which doesn't have table's tablets in transitions. That prevents later transitions from starting because topology coordinator's barrier will wait for that erm before moving to a stage later than allow_write_both_read_old, so before any requests start using the new topology. Also, if transitions are already running, repair waits for them to finish. A blocked tablet migration (e.g. due to down node) will block repair, whereas before it would fail. Once admin resolves the cause of blocked migration, repair will continue. Fixes #17658. Fixes #18561. (cherry picked from commit `6c64cf33df`) (cherry picked from commit `1513d6f0b0`) (cherry picked from commit `476c076a21`) (cherry picked from commit `c45ce41330`) (cherry picked from commit `e97acf4e30`) (cherry picked from commit `98323be296`) (cherry picked from commit `5ca54a6e88`) Refs #18641 Closes scylladb/scylladb#19144 * github.com:scylladb/scylladb: test: pylib: Do not block async reactor while removing directories repair: Exclude tablet migrations with tablet repair repair_service: Propagate topology_state_machine to repair_service main, storage_service: Move topology_state_machine outside storage_service storage_srvice, toplogy: Extract topology_state_machine::await_quiesced() tablet_scheduler: Make disabling of balancing interrupt shuffle mode tablet_scheduler: Log whether balancing is considered as enabled	2024-06-09 00:20:44 +02:00
Tomasz Grabiec	e518bb68b2	main, storage_service: Move topology_state_machine outside storage_service It will be propagated to repair_service to avoid cyclic dependency: storage_service <-> repair_service (cherry picked from commit `c45ce41330`)	2024-06-06 13:01:19 +00:00
Tomasz Grabiec	af2caeb2de	storage_srvice, toplogy: Extract topology_state_machine::await_quiesced() Will be used later in a place which doesn't have access to storage_service but has to toplogy_state_machine. It's not necessary to start group0 operation around polling because the busy() state can be checked atomically and if it's false it means the topology is no longer busy. (cherry picked from commit `476c076a21`)	2024-06-06 13:01:19 +00:00
Tomasz Grabiec	d5ebfea1ff	tablet_scheduler: Make disabling of balancing interrupt shuffle mode Tests will rely on that, they will run in shuffle mode, and disable balancing around section which otherwise would be infinitely blocked by ongoing shuffling (like repair). (cherry picked from commit `1513d6f0b0`)	2024-06-06 13:01:18 +00:00
Kamil Braun	5d3dde50f4	Merge '[Backport 6.0] Fail bootstrap if ip mapping is missing during double write stage' from ScyllaDB If a node restart just before it stores bootstrapping node's IP it will not have ID to IP mapping for bootstrapping node which may cause failure on a write path. Detect this and fail bootstrapping if it happens. (cherry picked from commit `1faef47952`) (cherry picked from commit `27445f5291`) (cherry picked from commit `6853b02c00`) (cherry picked from commit `f91db0c1e4`) Refs #18927 Closes scylladb/scylladb#19118 * github.com:scylladb/scylladb: raft topology: fix indentation after previous commit raft topology: do not add bootstrapping node without IP as pending test: add test of bootstrap where the coordinator crashes just before storing IP mapping schema_tables: remove unused code	2024-06-06 11:35:13 +02:00
Gleb Natapov	e11827f37e	raft topology: fix indentation after previous commit (cherry picked from commit `f91db0c1e4`)	2024-06-05 13:55:29 +00:00
Gleb Natapov	0acfc223ab	raft topology: do not add bootstrapping node without IP as pending If there is no mapping from host id to ip while a node is in bootstrap state there is no point adding it to pending endpoint since write handler will not be able to map it back to host id anyway. If the transition sate requires double writes though we still want to fail. In case the state is write_both_read_old we fail the barrier that will cause topology operation to rollback and in case of write_both_read_new we assert but this should not happen since the mapping is persisted by this point (or we failed in write_both_read_old state). Fixes: scylladb/scylladb#18676 (cherry picked from commit `6853b02c00`)	2024-06-05 13:55:28 +00:00
Gleb Natapov	c53cd98a41	test: add test of bootstrap where the coordinator crashes just before storing IP mapping On the next boot there is no host ID to IP mapping which causes node to crash again with "No mapping for :: in the passed effective replication map" assertion. (cherry picked from commit `27445f5291`)	2024-06-05 13:55:28 +00:00
Botond Dénes	341c29bd74	Merge '[Backport 6.0] storage_service: Fix race between tablet split and stats retrieval' from Raphael "Raph" Carvalho Retrieval of tablet stats must be serialized with mutation to token metadata, as the former requires tablet id stability. If tablet split is finalized while retrieving stats, the saved erm, used by all shards, can have a lower tablet count than the one in a particular shard, causing an abort as tablet map requires that any id feeded into it is lower than its current tablet count. Fixes https://github.com/scylladb/scylladb/issues/18085. (cherry picked from commit `abcc68dbe7`) (cherry picked from commit `551bf9dd58`) (cherry picked from commit `e7246751b6`) Refs https://github.com/scylladb/scylladb/pull/18287 Closes scylladb/scylladb#19095 * github.com:scylladb/scylladb: topology_experimental_raft/test_tablets: restore usage of check_with_down test: Fix flakiness in topology_experimental_raft/test_tablets service: Use tablet read selector to determine which replica to account table stats storage_service: Fix race between tablet split and stats retrieval	2024-06-05 13:06:32 +03:00
Benny Halevy	21f87c9cfa	locator: tablet_map: get_primary_replica: return tablet_replica This is required by repair when it will start using get_primary_replica in a following patch. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `c52f70f92c`)	2024-06-03 19:50:39 +00:00
Pavel Emelyanov	62a23fd86a	config: Remove experimental TABLETS feature ... and replace it with boolean enable_tablets option. All the places in the code are patched to check the latter option instead of the former feature. The option is OFF by default, but the default scylla.yaml file sets this to true, so that newly installed clusters turn tablets ON. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> (cherry picked from commit `83d491af02`) Closes scylladb/scylladb#19012	2024-06-03 12:16:41 +03:00
Marcin Maliszkiewicz	cbf47319c1	db: auth: move auth tables to system keyspace Separate keyspace which also behaves as system brings little benefit while creating some compatibility problems like schema digest mismatch during rollback. So we decided to move auth tables into system keyspace. Fixes https://github.com/scylladb/scylladb/issues/18098 Closes scylladb/scylladb#18769 (cherry picked from commit `2ab143fb40`) [avi: adjust test/alternator/suite.yaml to reflect new keyspace]	2024-06-02 21:41:14 +03:00
Wojciech Mitros	3c47ab9851	mv: handle different ERMs for base and view table When calculating the base-view mapping while the topology is changing, we may encounter a situation where the base table noticed the change in its effective replication map while the view table hasn't, or vice-versa. This can happen because the ERM update may be performed during the preemption between taking the base ERM and view ERM, or, due to `f2ff701`, the update may have just been performed partially when we are taking the ERMs. Until now, we assumed that the ERMs are synchronized while calling finding the base-view endpoint mapping, so in particular, we were using the topology from the base's ERM to check the datacenters of all endpoints. Now that the ERMs are more likely to not be the same, we may try to get the datacenter of a view endpoint that doesn't exist in the base's topology, causing us to crash. This is fixed in this patch by using the view table's topology for endpoints coming from the view ERM. The mapping resulting from the call might now be a temporary mapping between endpoints in different topologies, but it still maps base and view replicas 1-to-1. Fixes: #17786 Fixes: #18709 (cherry-picked from `519317dc58`) This commit also includes the follow-up patch that removes the flakiness from the test that is introduced by the commit above. The flakiness was caused by enabling the delay_before_get_view_natural_endpoint injection on a node and not disabling it before the node is shut down. The patch removes the enabling of the injection on the node in the first place. By squashing the commits, we won't introduce a place in the commit history where a potential bisect could mistakenly fail. Fixes: https://github.com/scylladb/scylladb/issues/18941 (cherry-picked from `0de3a5f3ff`) Closes scylladb/scylladb#18974	2024-05-30 09:13:31 +02:00
Piotr Smaron	e04964ba17	Return response only when tablets are reallocated Up until now we waited until mutations are in place and then returned directly to the caller of the ALTER statement, but that doesn't imply that tablets were deleted/created, so we must wait until the whole processing is done and return only then.	2024-05-30 08:33:26 +03:00
Patryk Jędrzejczak	ed3ac1eea4	storage_service: update system.cdc_local in topology_state_load When the node with CDC enabled and with the topology on raft disabled bootstraps, it reads system.cdc_local for the last generation. Nodes with both enabled use group0 to get the last generation. In the following scenario with a cluster of one node: 1. the node is created with CDC and the topology on raft enabled 2. the user creates table T 3. the node is restarted in the recovery mode 4. the CDC log of T is extended with new entries 5. the node restarts in normal mode The generation created in the step 3 is seen in system_distributed.cdc_generation_timestamps but not in system.cdc_generations_v3, thus there are used streams that the CDC based on raft doesn't know about. Instead of creating a new generation, the node should use the generation already committed to group0. Save the last CDC generation in the system.cdc_local during loading the topology state so that it is visible for CDC not based on raft. Fixes scylladb/scylladb#17819 (cherry picked from commit `4351eee1f6`)	2024-05-28 14:01:59 +00:00
Raphael S. Carvalho	46220bd839	service: Use tablet read selector to determine which replica to account table stats Since we introduced the ability to revert migrations, we can no longer rely on ordering of transition stages to determine whether to account pending or leaving replica. Let's use read selector instead, which correctly has info which replica type has correct stats info. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> (cherry picked from commit `551bf9dd58`)	2024-05-27 18:21:21 +00:00
Raphael S. Carvalho	55a45e3486	storage_service: Fix race between tablet split and stats retrieval If tablet split is finalized while retrieving stats, the saved erm, used by all shards, will be invalidated. It can either cause incorrect behavior or crash if id is not available. It's worked by feeding local tablet map into the "coordinator" collecting stats from all shards. We will also no longer have a snapshot of erm shared between shards to help intra-node migration. This is simplified by serializing token metadata changes and the retrieval of the stats (latter should complete pretty fast, so it shouldn't block the former for any significant time). Fixes #18085. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> (cherry picked from commit `abcc68dbe7`)	2024-05-27 18:21:21 +00:00
Pavel Emelyanov	b24fb8dc87	inet_address: Remove to_sstring() in favor of fmt::to_string The existing inet_address::to_string() calls fmt::format("{}", *this) anyway. However, the to_string() method is declared in .cc file, while form formatter is in the header and is equipeed with constexprs so that converting an address to string is done as much as possible compile-time. Also, though minor, fmt::to_string(foo) is believed to be even faster than fmt::format("{}", foo). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#18712	2024-05-21 09:43:08 +03:00
Avi Kivity	52fe351c31	Merge 'Balance tablets within nodes (intra-node migration)' from Tomasz Grabiec This is needed to avoid severe imbalance between shards which can happen when some table grows and is split. The inter-node balance can be equal, so inter-node migration cannot fix the imbalance. Also, if RF=N then there is not even a possibility of moving tablets around to fix the imbalance. The only way to bring the system to balance is to move tablets within the nodes. The system is not prepared for intra-node migration currently. Request coordination is host-based, while for intra-node migration it should be (also) shard-based. The solution employed here is to keep the coordination between nodes as-is, and for intra-node migration storage_proxy-level coordinator is not aware of the migration (no pending host). The replica-side request handler will be a second-level coordinator which routes requests to shards, similar to how the first-level coordinator routes them to hosts. Tablet sharder is adjusted to handle intra-migration where a tablet can have two replicas on the same host. For reads, sharder uses the read selector to resolve the conflict. For writes, the write selector is used. The old shard_of() API is kept to represent shard for reads, and new method is introduced to query the shards for writing: shard_for_writes(). All writers should be switched to that API, which is not done in this patch yet. The request handler on replica side acts as a second-level coordinator, using sharder to determine routing to shards. A given sharder has a scope of a single topology version, a single effective_replication_map_ptr, which should be kept alive during writes. perf-simple-query test results show no signs of regression: Command: perf-simple-query -c1 -m1G --write --tablets --duration=10 Before: > 83294.81 tps ( 59.5 allocs/op, 14.3 tasks/op, 53725 insns/op, 0 errors) > 87756.72 tps ( 59.5 allocs/op, 14.3 tasks/op, 54049 insns/op, 0 errors) > 86428.47 tps ( 59.6 allocs/op, 14.3 tasks/op, 54208 insns/op, 0 errors) > 86211.38 tps ( 59.7 allocs/op, 14.3 tasks/op, 54219 insns/op, 0 errors) > 86559.89 tps ( 59.6 allocs/op, 14.3 tasks/op, 54188 insns/op, 0 errors) > 86609.39 tps ( 59.6 allocs/op, 14.3 tasks/op, 54117 insns/op, 0 errors) > 87464.06 tps ( 59.5 allocs/op, 14.3 tasks/op, 54039 insns/op, 0 errors) > 86185.43 tps ( 59.6 allocs/op, 14.3 tasks/op, 54169 insns/op, 0 errors) > 86254.71 tps ( 59.6 allocs/op, 14.3 tasks/op, 54139 insns/op, 0 errors) > 83395.35 tps ( 60.2 allocs/op, 14.4 tasks/op, 54693 insns/op, 0 errors) > > median 86428.47 tps ( 59.6 allocs/op, 14.3 tasks/op, 54208 insns/op, 0 errors) > median absolute deviation: 243.04 > maximum: 87756.72 > minimum: 83294.81 > After: > 85523.06 tps ( 59.5 allocs/op, 14.3 tasks/op, 53872 insns/op, 0 errors) > 89362.47 tps ( 59.6 allocs/op, 14.3 tasks/op, 54226 insns/op, 0 errors) > 88167.55 tps ( 59.7 allocs/op, 14.3 tasks/op, 54400 insns/op, 0 errors) > 87044.40 tps ( 59.7 allocs/op, 14.3 tasks/op, 54310 insns/op, 0 errors) > 88344.50 tps ( 59.6 allocs/op, 14.3 tasks/op, 54289 insns/op, 0 errors) > 88355.06 tps ( 59.6 allocs/op, 14.3 tasks/op, 54242 insns/op, 0 errors) > 88725.46 tps ( 59.6 allocs/op, 14.3 tasks/op, 54230 insns/op, 0 errors) > 88640.08 tps ( 59.6 allocs/op, 14.3 tasks/op, 54210 insns/op, 0 errors) > 90306.31 tps ( 59.4 allocs/op, 14.3 tasks/op, 54043 insns/op, 0 errors) > 87343.62 tps ( 59.8 allocs/op, 14.3 tasks/op, 54496 insns/op, 0 errors) > > median 88355.06 tps ( 59.6 allocs/op, 14.3 tasks/op, 54242 insns/op, 0 errors) > median absolute deviation: 1007.41 > maximum: 90306.31 > minimum: 85523.06 Command (reads): perf-simple-query -c1 -m1G --tablets --duration=10 Before: > 95860.18 tps ( 63.1 allocs/op, 14.1 tasks/op, 42476 insns/op, 0 errors) > 97537.69 tps ( 63.1 allocs/op, 14.1 tasks/op, 42454 insns/op, 0 errors) > 97549.23 tps ( 63.1 allocs/op, 14.1 tasks/op, 42470 insns/op, 0 errors) > 97511.29 tps ( 63.1 allocs/op, 14.1 tasks/op, 42470 insns/op, 0 errors) > 97227.32 tps ( 63.1 allocs/op, 14.1 tasks/op, 42471 insns/op, 0 errors) > 94031.94 tps ( 63.1 allocs/op, 14.1 tasks/op, 42441 insns/op, 0 errors) > 96978.04 tps ( 63.1 allocs/op, 14.1 tasks/op, 42462 insns/op, 0 errors) > 96401.70 tps ( 63.1 allocs/op, 14.1 tasks/op, 42473 insns/op, 0 errors) > 96573.77 tps ( 63.1 allocs/op, 14.1 tasks/op, 42440 insns/op, 0 errors) > 96340.54 tps ( 63.1 allocs/op, 14.1 tasks/op, 42468 insns/op, 0 errors) > > median 96978.04 tps ( 63.1 allocs/op, 14.1 tasks/op, 42462 insns/op, 0 errors) > median absolute deviation: 571.20 > maximum: 97549.23 > minimum: 94031.94 > After: > 99794.67 tps ( 63.1 allocs/op, 14.1 tasks/op, 42471 insns/op, 0 errors) > 101244.99 tps ( 63.1 allocs/op, 14.1 tasks/op, 42472 insns/op, 0 errors) > 101128.37 tps ( 63.1 allocs/op, 14.1 tasks/op, 42485 insns/op, 0 errors) > 101065.27 tps ( 63.1 allocs/op, 14.1 tasks/op, 42465 insns/op, 0 errors) > 101212.98 tps ( 63.1 allocs/op, 14.1 tasks/op, 42456 insns/op, 0 errors) > 101413.31 tps ( 63.1 allocs/op, 14.1 tasks/op, 42463 insns/op, 0 errors) > 101464.92 tps ( 63.1 allocs/op, 14.1 tasks/op, 42466 insns/op, 0 errors) > 101086.74 tps ( 63.1 allocs/op, 14.1 tasks/op, 42488 insns/op, 0 errors) > 101559.09 tps ( 63.1 allocs/op, 14.1 tasks/op, 42468 insns/op, 0 errors) > 100742.58 tps ( 63.1 allocs/op, 14.1 tasks/op, 42491 insns/op, 0 errors) > > median 101212.98 tps ( 63.1 allocs/op, 14.1 tasks/op, 42456 insns/op, 0 errors) > median absolute deviation: 200.33 > maximum: 101559.09 > minimum: 99794.67 > Fixes #16594 Closes scylladb/scylladb#18026 * github.com:scylladb/scylladb: Implement fast streaming for intra-node migration test: tablets_test: Test sharding during intra-node migration test: tablets_test: Check sharding also on the pending host test: py: tablets: Test writes concurrent with migration test: py: tablets: Test crash during intra-node migration api, storage_service: Introduce API to wait for topology to quiesce dht, replica: Remove deprecated sharder APIs test: Avoid using deprecated sharded API db: do_apply_many() avoid deprecated sharded API replica: mutation_dump: Avoid deprecated sharder API repair: Avoid deprecated sharder API table: Remove optimization which returns empty reader when key is not owned by the shard dht: is_single_shard: Avoid deprecated sharder API dht: split_range_to_single_shard: Work with static_sharder only dht: ring_position_range_sharder: Avoid deprecated sharder APIs dht: token: Avoid use of deprecated sharder API by switching to static_sharder selective_token_sharder: Avoid use of deprecated sharder API docs: Document tablet sharding vs tablet replica placement readers/multishard.cc: use shard_for_reads() instead of shard_of() multishard_mutation_query.cc: use shard_for_reads() instead of shard_of() storage_proxy: Extract common code to apply mutations on many shards according to sharder storage_proxy: Prepare per-partition rate-limiting for intra-node migration storage_proxy: Avoid shard_of() use in mutate_counter_on_leader_and_replicate() storage_proxy: Prepare mutate_hint() for intra-node tablet migration commitlog_replayer: Avoid deprecated sharder::shard_of() lwt: Avoid deprecated sharder::shard_of() compaction: Avoid deprecated sharder::shard_of() dht: Extract dht::static_sharder replica: Deprecate table::shard_of() locator: Deprecate effective_replication_map::shard_of() dht: Deprecate old sharder API: shard_of/next_shard/token_for_next_shard tests: tablets: py: Add intra-node migration test tests: tablets: Test that drained nodes are not balanced internally tests: tablets: Add checks of replica set validity to test_load_balancing_with_random_load tests: tablets: Verify that disabling balancing results in no intra-node migrations tests: tablets: Check that nodes are internally balanced tests: tablets: Improve debuggability by showing which rows are missing tablets, storage_service: Support intra-node migration in move_tablet() API tablet_allocator: Generate intra-node migration plan tablet_allocator: Extract make_internode_plan() tablet_allocator: Maintain candidate list and shard tablet count for target nodes tablet_allocator: Lift apply_load/can_accept_load lambdas to member functions tablets, streaming: Implement tablet streaming for intra-node migration dht, auto_refreshing_sharder: Allow overriding write selector multishard_writer: Handle intra-node migration storage_proxy: Handle intra-node tablet migration for writes tablets: Get rid of tablet_map::get_shard() tablets: Avoid tablet_map::get_shard in cleanup tablets: test: Use sharder instead of tablet_map::get_shard() tablets: tablet_sharder: Allow working with non-local host sharding: Prepare for intra-node-migration docs: Document sharder use for tablets tablets: Introduce tablet transition kind for intra-node migration tests: tablets: Fix use-after-move of skiplist in rebalance_tablets() sstables, gdb: Track readers in a linked list raft topology: Fix global token metadata barrier to not fence ahead of what is drained	2024-05-20 16:13:01 +03:00
Raphael S. Carvalho	715ae689c0	Implement fast streaming for intra-node migration With intra-node migration, all the movement is local, so we can make streaming faster by just cloning the sstable set of leaving replica and loading it into the pending one. This cloning is underlying storage specific, but s3 doesn't support snapshot() yet (th sstables::storage procedure which clone is built upon). It's only supported by file system, with help of hard links. A new generation is picked for new cloned sstable, and it will live in the same directory as the original. A challenge I bumped into was to understand why table refused to load the sstable at pending replica, as it considered them foreign. Later I realized that sharder (for reads) at this stage of migration will point only to leaving replica. It didn't fail with mutation based streaming, because the sstable writer considers the shard -- that the sstable was written into -- as its owner, regardless of what sharder says. That was fixed by mimicking this behavior during loading at pending. test: ./test.py --mode=dev intranode --repeat=100 passes. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-05-16 00:28:47 +02:00
Tomasz Grabiec	7956a2991e	api, storage_service: Introduce API to wait for topology to quiesce	2024-05-16 00:28:47 +02:00
Tomasz Grabiec	329342bfb2	tablets, storage_service: Support intra-node migration in move_tablet() API	2024-05-16 00:28:47 +02:00
Tomasz Grabiec	fdcaaea91a	tablets, streaming: Implement tablet streaming for intra-node migration	2024-05-16 00:28:46 +02:00
Tomasz Grabiec	d000ad0325	tablets: Avoid tablet_map::get_shard in cleanup In preparation for intra-node migration for which get_shard() is not prepared.	2024-05-16 00:28:46 +02:00
Tomasz Grabiec	82b34d34d8	tablets: Introduce tablet transition kind for intra-node migration We need a separate transition kind for intra node migration so that we don't have to recover this information from replica set in an expensive way. This information is needed in the hot path - in effective_replicaiton_map, to not return the pending tablet replica to the coordinator. From its perspective, replica set is not transitional. The transition will also be used to alter the behavior of the sharder. When not in intra-node migration, the sharder should advertise the shard which is either in the previous or next replica set. During intra-node migration, that's not possible as there may be two such shards. So it will return the shard according to the current read selector.	2024-05-16 00:28:46 +02:00
Botond Dénes	8690dbf8ad	service/storage_service: introduce get_tablet_to_endpoint_map() The tablet variant of the existing get_token_to_endpoint_map(), which returns a list of tablet tokens and the primary replica for each.	2024-05-13 06:57:13 -04:00
Avi Kivity	b7055b5f2f	storage_service: don't rely on optional<> formatting for removed node error std::optional formatting changed while moving from the home-grown formatter to the fmt provided formatter; don't rely on it for user visible messages. Here, the optional formatted is known to be engaged, so just print it. Closes scylladb/scylladb#18534	2024-05-09 10:03:23 +03:00
Botond Dénes	155332ebf8	Merge 'Drain view_builder in generic drain (again)' from Pavel Emelyanov Some time ago #16558 was merged that moved view builder drain into generic drain. After this merge dtests started to fail from time to time, so the PR was reverted (see #18278). In #18295 the hang was found. View builder drain was moved from "before stopping messaging service to "after" it, and view update write handlers in proxy hanged for hard-coded timeout of 5 minutes without being aborted. Tests don't wait for 5 minutes and kill scylla, then complain about it and fail. This PR brings back the original PR as well as the necessary fix that cancels view update write handlers on stop. Closes scylladb/scylladb#18408 * github.com:scylladb/scylladb: Reapply "Merge 'Drain view_builder in generic drain' from ScyllaDB" view: Abort pending view updates when draining	2024-05-09 08:26:44 +03:00
Patryk Jędrzejczak	053a2893cf	raft topology: join_token_ring: prevent shutdown hangs Shutdown of a bootstrapping node could hang on `_topology_state_machine.event.when()` in `wait_for_topology_request_completion`. It caused scylladb/scylladb#17246 and scylladb/scylladb#17608. On a normal node, `wait_for_group0_stop` would prevent it, but this function won't be called before we join group 0. Solve it by adding a new subscriber to `_abort_source`. Additionally, trigger `_group0_as` to prevent other hang scenarios. Note that if both the new subscriber and `wait_for_group0_stop` are called, nothing will break. `abort_source::request_abort` and `conditional_variable::broken` can be called multiple times. The raft-based topology is moved out of experimental in 6.0, no need to backport the patch. Fixes scylladb/scylladb#17246 Fixes scylladb/scylladb#17608 Closes scylladb/scylladb#18549	2024-05-09 08:26:43 +03:00
Piotr Dulikowski	180cb7a2b9	storage_service: notify lifecycle subs only after token metadata update Currently, in raft mode, when raft topology is reloaded from disk or a notification is received from gossip about an endpoint change, token metadata is updated accordingly. While updating token metadata we detect whether some nodes are joining or are leaving and we notify endpoint lifecycle subscribers if such an event occurs. These notifications are fired _before_ we finish updating token metadata and before the updated version is globally available. This behavior, for "node leaving" notifications specifically, was not present in legacy topology mode. Hinted handoff depends on token metadata being updated before it is notified about a leaving node (we had a similar issue before: scylladb/scylladb#5087, and we fixed it by enforcing this property). Because this is not true right now for raft mode, this causes the hint draining logic not to work properly - when a node leaves the cluster, there should be an attempt to send out hints for that node, but instead hints are not sent out and are kept on disk. In order to fix the issue with hints, postpone notifying endpoint lifecycle subscribers about joined and left nodes only after the final token metadata is computed and replicated to all shards. Fixes: scylladb/scylladb#17023 Closes scylladb/scylladb#18377	2024-05-08 09:40:44 +02:00
Piotr Dulikowski	64ba620dc2	Merge 'hinted handoff: Use host IDs instead of IPs in the module' from Dawid Mędrek This pull request introduces host ID in the Hinted Handoff module. Nodes are now identified by their host IDs instead of their IPs. The conversion occurs on the boundary between the module and `storage_proxy.hh`, but aside from that, IPs have been erased. The changes take into considerations that there might still be old hints, still identified by IPs, on disk – at start-up, we map them to host IDs if it's possible so that they're not lost. Refs scylladb/scylladb#6403 Fixes scylladb/scylladb#12278 Closes scylladb/scylladb#15567 * github.com:scylladb/scylladb: docs: Update Hinted Handoff documentation db/hints: Add endpoint_downtime_not_bigger_than() db/hints: Migrate hinted handoff when cluster feature is enabled db/hints: Handle arbitrary directories in resource manager db/hints: Start using hint_directory_manager db/hints: Enforce providing IP in get_ep_manager() db/hints: Introduce hint_directory_manager db/hints/resource_manager: Update function description db/hints: Coroutinize space_watchdog::scan_one_ep_dir() db/hints: Expose update lock of space watchdog db/hints: Add function for migrating hint directories to host ID db/hints: Take both IP and host ID when storing hints db/hints: Prepare initializing endpoint managers for migrating from IP to host ID db/hints: Migrate to locator::host_id db/hints: Remove noexcept in do_send_one_mutation() service: Add locator::host_id to on_leave_cluster service: Fix indentation db/hints: Fix indentation	2024-05-06 09:58:18 +02:00
Kamil Braun	16846bf5ce	Merge 'Do not serialize removenode operation with api lock if topology over raft is enabled' from Gleb With topology over raft all operation are already serialized by the coordinator anyway, so no need to synchronize removenode using api lock. All others are still synchronized since there cannot be executed in parallel for the same node anyway. * 'gleb/17681-fix' of github.com:scylladb/scylla-dev: storage_service: do not take API lock for removenode operation if topology coordinator is enabled test: return file mark from wait_for that points after the found string	2024-05-06 09:03:03 +02:00
Benny Halevy	ebff5f5d70	everywhere: include seastar headers using angle brackets seastar is an external library therefore it should use the system-include syntax. Closes scylladb/scylladb#18513	2024-05-06 10:00:31 +03:00
Benny Halevy	7dd6a81026	storage_service: get_system_mutations: make_canonical_mutation_gently and also unfreeze_gently the result frozen_mutation:s to prevent the following stalls that were seen with the test_add_many_nodes_under_load dtest: ``` ++[1#1/58 5%] addr=0x16330e9 total=321 count=4 avg=80: \| utils::uleb64_express_encode_impl at ././utils/vle.hh:73 \| (inlined by) utils::uleb64_express_encode<void (&)(char const, unsigned long), void (&)(char const, unsigned long)> at ././utils/vle.hh:82 \| (inlined by) logalloc::region_impl::object_descriptor::encode at ./utils/logalloc.cc:1658 \| (inlined by) logalloc::region_impl::alloc_small at ./utils/logalloc.cc:1743 ++ - addr=0x1634cff: \| logalloc::region_impl::alloc at ./utils/logalloc.cc:2104 \| ++[2#1/2 83%] addr=0x116e22c total=321 count=4 avg=80: \| \| managed_bytes::managed_bytes at ././utils/managed_bytes.hh:552 \| \| ++[3#1/3 51%] addr=0x1551288 total=198 count=3 avg=66: \| \| \| compound_wrapper<clustering_key_prefix, clustering_key_prefix_view>::compound_wrapper at ././keys.hh:149 \| \| \| (inlined by) prefix_compound_wrapper<clustering_key_prefix, clustering_key_prefix_view, clustering_key_prefix>::prefix_compound_wrapper at ././keys.hh:574 \| \| \| (inlined by) clustering_key_prefix::clustering_key_prefix at ././keys.hh:865 \| \| \| (inlined by) rows_entry::rows_entry at ./mutation/mutation_partition.hh:957 \| \| ++ - addr=0x153f09f: \| \| \| allocation_strategy::construct<rows_entry, schema const&, position_in_partition_view&, seastar::bool_class<dummy_tag>&, seastar::bool_class<continuous_tag>&> at ././utils/allocation_strategy.hh:160 \| \| ++ - addr=0x151409a: \| \| \| mutation_partition::append_clustered_row at ./mutation/mutation_partition.cc:719 \| \| ++ - addr=0x14ab38f: \| \| \| partition_builder::accept_row at ././partition_builder.hh:57 \| \| \| ++[4#1/1 100%] addr=0x1579766 total=577 count=7 avg=82: \| \| \| \| mutation_partition_view::do_accept<partition_builder> at ./mutation/mutation_partition_view.cc:212 \| \| \| ++[5#1/2 56%] addr=0x14e737c total=321 count=4 avg=80: \| \| \| \| frozen_mutation::unfreeze at ./mutation/frozen_mutation.cc:116 \| \| \| \| ++[6#1/1 100%] addr=0x24fb47e total=1476 count=18 avg=82: \| \| \| \| \| service::storage_service::get_system_mutations at ./service/storage_service.cc:6401 ``` Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-05-02 19:37:06 +03:00
Benny Halevy	bc1985b8ce	storage_service: merge_topology_snapshot: freeze_gently Freezing large mutations synchronously may cause reactor stalls, as seen in the test_add_many_nodes_under_load dtest: ``` ++[1#1/37 5%] addr=0x15b0bf total=99 count=2 avg=50: ?? ??:0 \| ++[2#1/2 67%] addr=0x15a331f total=66 count=1 avg=66: \| \| bytes_ostream::write at ././bytes_ostream.hh:248 \| \| (inlined by) bytes_ostream::write at ././bytes_ostream.hh:263 \| \| (inlined by) ser::serialize_integral<unsigned int, bytes_ostream> at ././serializer.hh:203 \| \| (inlined by) ser::integral_serializer<unsigned int>::write<bytes_ostream> at ././serializer.hh:217 \| \| (inlined by) ser::serialize<unsigned int, bytes_ostream> at ././serializer.hh:254 \| \| (inlined by) ser::writer_of_column<bytes_ostream>::write_id at ./build/dev/gen/idl/mutation.dist.impl.hh:4680 \| \| ++[3#1/1 100%] addr=0x159df71 total=132 count=2 avg=66: \| \| \| (anonymous namespace)::write_row_cells<ser::deletable_row__cells<bytes_ostream> >(ser::deletable_row__cells<bytes_ostream>&&, row const&, schema const&, column_kind)::{lambda(unsigned int, atomic_cell_or_collection const&)#1}::operator() at ./mutation/mutation_partition_serializer.cc:99 \| \| \| (inlined by) row::maybe_invoke_with_hash<(anonymous namespace)::write_row_cells<ser::deletable_row__cells<bytes_ostream> >(ser::deletable_row__cells<bytes_ostream>&&, row const&, schema const&, column_kind)::{lambda(unsigned int, atomic_cell_or_collection const&)#1} const, cell_and_hash const> at ./mutation/mutation_partition.hh:133 \| \| \| (inlined by) row::for_each_cell<(anonymous namespace)::write_row_cells<ser::deletable_row__cells<bytes_ostream> >(ser::deletable_row__cells<bytes_ostream>&&, row const&, schema const&, column_kind)::{lambda(unsigned int, atomic_cell_or_collection const&)#1}>(ser::deletable_row__cells<bytes_ostream>&&) const::{lambda(unsigned int, cell_and_hash const&)#1}::operator() at ./mutation/mutation_partition.hh:152 \| \| \| (inlined by) compact_radix_tree::tree<cell_and_hash, unsigned int>::walking_visitor<row::for_each_cell<(anonymous namespace)::write_row_cells<ser::deletable_row__cells<bytes_ostream> >(ser::deletable_row__cells<bytes_ostream>&&, row const&, schema const&, column_kind)::{lambda(unsigned int, atomic_cell_or_collection const&)#1}>(ser::deletable_row__cells<bytes_ostream>&&) const::{lambda(unsigned int, cell_and_hash const&)#1}, true>::operator() at ././utils/compact-radix-tree.hh:1888 \| \| \| (inlined by) compact_radix_tree::tree<cell_and_hash, unsigned int>::visit_slot<compact_radix_tree::tree<cell_and_hash, unsigned int>::walking_visitor<row::for_each_cell<(anonymous namespace)::write_row_cells<ser::deletable_row__cells<bytes_ostream> >(ser::deletable_row__cells<bytes_ostream>&&, row const&, schema const&, column_kind)::{lambda(unsigned int, atomic_cell_or_collection const&)#1}>(ser::deletable_row__cells<bytes_ostream>&&) const::{lambda(unsigned int, cell_and_hash const&)#1}, true>&> at ././utils/compact-radix-tree.hh:1560 \| \| ++ - addr=0x159d84d: \| \| \| compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)1, 4u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)0, 0u>::visit<compact_radix_tree::tree<cell_and_hash, unsigned int>::walking_visitor<row::for_each_cell<(anonymous namespace)::write_row_cells<ser::deletable_row__cells<bytes_ostream> >(ser::deletable_row__cells<bytes_ostream>&&, row const&, schema const&, column_kind)::{lambda(unsigned int, atomic_cell_or_collection const&)#1}>(ser::deletable_row__cells<bytes_ostream>&&) const::{lambda(unsigned int, cell_and_hash const&)#1}, true>&> at ././utils/compact-radix-tree.hh:1364 \| \| \| (inlined by) compact_radix_tree::tree<cell_and_hash, unsigned int>::node_base<cell_and_hash, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)1, 4u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)0, 0u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 8u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)1, 4u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 16u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 8u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 32u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)6, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 16u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::direct_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)6, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)0, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 32u> >::visit<compact_radix_tree::tree<cell_and_hash, unsigned int>::walking_visitor<row::for_each_cell<(anonymous namespace)::write_row_cells<ser::deletable_row__cells<bytes_ostream> >(ser::deletable_row__cells<bytes_ostream>&&, row const&, schema const&, column_kind)::{lambda(unsigned int, atomic_cell_or_collection const&)#1}>(ser::deletable_row__cells<bytes_ostream>&&) const::{lambda(unsigned int, cell_and_hash const&)#1}, true>&, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)1, 4u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)0, 0u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 8u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)1, 4u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 16u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 8u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 32u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)6, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 16u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::direct_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)6, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)0, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 32u> > at ././utils/compact-radix-tree.hh:799 \| \| \| (inlined by) compact_radix_tree::tree<cell_and_hash, unsigned int>::node_base<cell_and_hash, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)1, 4u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)0, 0u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 8u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)1, 4u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 16u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)2, 8u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::indirect_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 32u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)6, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)3, 16u>, compact_radix_tree::tree<cell_and_hash, unsigned int>::direct_layout<cell_and_hash, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)6, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)0, 0u, (compact_radix_tree::tree<cell_and_hash, unsigned int>::layout)4, 32u> >::visit<compact_radix_tree::tree<cell_and_hash, unsigned int>::walking_visitor<row::for_each_cell<(anonymous namespace)::write_row_cells<ser::deletable_row__cells<bytes_ostream> >(ser::deletable_row__cells<bytes_ostream>&&, row const&, schema const&, column_kind)::{lambda(unsigned int, atomic_cell_or_collection const&)#1}>(ser::deletable_row__cells<bytes_ostream>&&) const::{lambda(unsigned int, cell_and_hash const&)#1}, true>&> at ././utils/compact-radix-tree.hh:807 \| \| ++[4#1/1 100%] addr=0x1596f4a total=329 count=5 avg=66: \| \| \| compact_radix_tree::tree<cell_and_hash, unsigned int>::node_head::visit<compact_radix_tree::tree<cell_and_hash, unsigned int>::walking_visitor<row::for_each_cell<(anonymous namespace)::write_row_cells<ser::deletable_row__cells<bytes_ostream> >(ser::deletable_row__cells<bytes_ostream>&&, row const&, schema const&, column_kind)::{lambda(unsigned int, atomic_cell_or_collection const&)#1}>(ser::deletable_row__cells<bytes_ostream>&&) const::{lambda(unsigned int, cell_and_hash const&)#1}, true> > at ././utils/compact-radix-tree.hh:473 \| \| \| (inlined by) compact_radix_tree::tree<cell_and_hash, unsigned int>::visit<compact_radix_tree::tree<cell_and_hash, unsigned int>::walking_visitor<row::for_each_cell<(anonymous namespace)::write_row_cells<ser::deletable_row__cells<bytes_ostream> >(ser::deletable_row__cells<bytes_ostream>&&, row const&, schema const&, column_kind)::{lambda(unsigned int, atomic_cell_or_collection const&)#1}>(ser::deletable_row__cells<bytes_ostream>&&) const::{lambda(unsigned int, cell_and_hash const&)#1}, true> > at ././utils/compact-radix-tree.hh:1626 \| \| \| (inlined by) compact_radix_tree::tree<cell_and_hash, unsigned int>::walk<row::for_each_cell<(anonymous namespace)::write_row_cells<ser::deletable_row__cells<bytes_ostream> >(ser::deletable_row__cells<bytes_ostream>&&, row const&, schema const&, column_kind)::{lambda(unsigned int, atomic_cell_or_collection const&)#1}>(ser::deletable_row__cells<bytes_ostream>&&) const::{lambda(unsigned int, cell_and_hash const&)#1}> at ././utils/compact-radix-tree.hh:1909 \| \| \| (inlined by) row::for_each_cell<(anonymous namespace)::write_row_cells<ser::deletable_row__cells<bytes_ostream> >(ser::deletable_row__cells<bytes_ostream>&&, row const&, schema const&, column_kind)::{lambda(unsigned int, atomic_cell_or_collection const&)#1}> at ./mutation/mutation_partition.hh:151 \| \| \| (inlined by) (anonymous namespace)::write_row_cells<ser::deletable_row__cells<bytes_ostream> > at ./mutation/mutation_partition_serializer.cc:97 \| \| \| (inlined by) write_row<ser::writer_of_deletable_row<bytes_ostream> > at ./mutation/mutation_partition_serializer.cc:168 \| \| ++[5#1/2 80%] addr=0x15a310c total=263 count=4 avg=66: \| \| \| mutation_partition_serializer::write_serialized<ser::writer_of_mutation_partition<bytes_ostream> > at ./mutation/mutation_partition_serializer.cc:180 \| \| \| ++[6#1/2 62%] addr=0x14eb60a total=428 count=7 avg=61: \| \| \| \| frozen_mutation::frozen_mutation(mutation const&)::$_0::operator()<ser::writer_of_mutation_partition<bytes_ostream> > at ./mutation/frozen_mutation.cc:85 \| \| \| \| (inlined by) ser::after_mutation__key<bytes_ostream>::partition<frozen_mutation::frozen_mutation(mutation const&)::$_0> at ./build/dev/gen/idl/mutation.dist.impl.hh:7058 \| \| \| \| (inlined by) frozen_mutation::frozen_mutation at ./mutation/frozen_mutation.cc:84 \| \| \| \| ++[7#1/1 100%] addr=0x14ed388 total=532 count=9 avg=59: \| \| \| \| \| freeze at ./mutation/frozen_mutation.cc:143 \| \| \| \| ++[8#1/2 74%] addr=0x252cf55 total=394 count=6 avg=66: \| \| \| \| \| service::storage_service::merge_topology_snapshot at ./service/storage_service.cc:763 ``` This change uses freeze_gently to freeze the cdc_generations_v2 mutations one at a time to prevent the stalls reported above. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-05-02 19:37:05 +03:00
Benny Halevy	574cb7d977	storage_service: merge_topology_snapshot: co_await to_mutation_gently Perevent stalls from "unpacking" of large canonical mutations seen with test_add_many_nodes_under_load when called from `group0_state_machine::transfer_snapshot`: ``` ++[1#1/44 14%] addr=0x395b2f total=569 count=6 avg=95: ?? ??:0 \| ++[2#1/2 56%] addr=0x3991e3 total=321 count=4 avg=80: ?? ??:0 \| ++ - addr=0x1587159: \| \| std::__new_allocator<seastar::basic_sstring<signed char, unsigned int, 31u, false> >::allocate at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/new_allocator.h:147 \| \| (inlined by) std::allocator<seastar::basic_sstring<signed char, unsigned int, 31u, false> >::allocate at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/allocator.h:198 \| \| (inlined by) std::allocator_traits<std::allocator<seastar::basic_sstring<signed char, unsigned int, 31u, false> > >::allocate at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/alloc_traits.h:482 \| \| (inlined by) std::_Vector_base<seastar::basic_sstring<signed char, unsigned int, 31u, false>, std::allocator<seastar::basic_sstring<signed char, unsigned int, 31u, false> > >::_M_allocate at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/stl_vector.h:378 \| \| (inlined by) std::vector<seastar::basic_sstring<signed char, unsigned int, 31u, false>, std::allocator<seastar::basic_sstring<signed char, unsigned int, 31u, false> > >::reserve at /usr/bin/../lib/gcc/x86_64-redhat-linux/13/../../../../include/c++/13/bits/vector.tcc:79 \| \| (inlined by) ser::idl::serializers::internal::vector_serializer<std::vector<seastar::basic_sstring<signed char, unsigned int, 31u, false>, std::allocator<seastar::basic_sstring<signed char, unsigned int, 31u, false> > > >::read<seastar::fragmented_memory_input_stream<bytes_ostream::fragment_iterator> > at ././serializer_impl.hh:226 \| \| (inlined by) ser::deserialize<std::vector<seastar::basic_sstring<signed char, unsigned int, 31u, false>, std::allocator<seastar::basic_sstring<signed char, unsigned int, 31u, false> > >, seastar::fragmented_memory_input_stream<bytes_ostream::fragment_iterator> > at ././serializer.hh:264 \| \| (inlined by) ser::serializer<clustering_key_prefix>::read<seastar::fragmented_memory_input_stream<bytes_ostream::fragment_iterator> >(seastar::fragmented_memory_input_stream<bytes_ostream::fragment_iterator>&)::{lambda(auto:1&)#1}::operator()<seastar::fragmented_memory_input_stream<bytes_ostream::fragment_iterator> > at ./build/dev/gen/idl/keys.dist.impl.hh:31 \| ++ - addr=0x1587085: \| \| seastar::with_serialized_stream<seastar::fragmented_memory_input_stream<bytes_ostream::fragment_iterator>, ser::serializer<clustering_key_prefix>::read<seastar::fragmented_memory_input_stream<bytes_ostream::fragment_iterator> >(seastar::fragmented_memory_input_stream<bytes_ostream::fragment_iterator>&)::{lambda(auto:1&)#1}, void, void> at ././seastar/include/seastar/core/simple-stream.hh:646 \| \| (inlined by) ser::serializer<clustering_key_prefix>::read<seastar::fragmented_memory_input_stream<bytes_ostream::fragment_iterator> > at ./build/dev/gen/idl/keys.dist.impl.hh:28 \| \| (inlined by) ser::deserialize<clustering_key_prefix, seastar::fragmented_memory_input_stream<bytes_ostream::fragment_iterator> > at ././serializer.hh:264 \| \| (inlined by) ser::deletable_row_view::key() const::{lambda(auto:1&)#1}::operator()<seastar::fragmented_memory_input_stream<bytes_ostream::fragment_iterator> const> at ./build/dev/gen/idl/mutation.dist.impl.hh:1268 \| \| ++[3#1/1 100%] addr=0x15865a3 total=577 count=7 avg=82: \| \| \| seastar::memory_input_stream<bytes_ostream::fragment_iterator>::with_stream<ser::deletable_row_view::key() const::{lambda(auto:1&)#1}> at ././seastar/include/seastar/core/simple-stream.hh:491 \| \| \| (inlined by) seastar::with_serialized_stream<seastar::memory_input_stream<bytes_ostream::fragment_iterator> const, ser::deletable_row_view::key() const::{lambda(auto:1&)#1}, void> at ././seastar/include/seastar/core/simple-stream.hh:639 \| \| \| (inlined by) ser::deletable_row_view::key at ./build/dev/gen/idl/mutation.dist.impl.hh:1264 \| \| ++[4#1/1 100%] addr=0x157cf27 total=643 count=8 avg=80: \| \| \| mutation_partition_view::do_accept<partition_builder> at ./mutation/mutation_partition_view.cc:212 \| \| ++ - addr=0x1516cac: \| \| \| mutation_partition::apply at ./mutation/mutation_partition.cc:497 \| \| ++[5#1/1 100%] addr=0x14e4433 total=1765 count=22 avg=80: \| \| \| canonical_mutation::to_mutation at ./mutation/canonical_mutation.cc:60 \| \| ++[6#1/2 98%] addr=0x2452a60 total=1732 count=21 avg=82: \| \| \| service::storage_service::merge_topology_snapshot at ./service/storage_service.cc:761 \| \| ++ - addr=0x2858782: \| \| \| service::group0_state_machine::transfer_snapshot at ./service/raft/group0_state_machine.cc:303 ``` Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-05-02 19:27:56 +03:00
Pavel Emelyanov	67736b5cd3	Reapply "Merge 'Drain view_builder in generic drain' from ScyllaDB" This reverts commit `9c2a836607`.	2024-05-02 08:16:14 +03:00
Raphael S. Carvalho	62b1cfa89c	topology_coordinator: Fix synchronization of tablet split with other concurrent ops Finalization of tablet split was only synchronizing with migrations, but that's not enough as we want to make sure that all processes like repair completes first as they might hold erm and therefore will be working with a "stale" version of token metadata. For synchronization to work properly, handling of tablet split finalize will now take over the state machine, when possible, and execute a global token metadata barrier to guarantee that update in topology by split won't cause problems. Repair for example could be writing a sstable with stale metadata, and therefore, could generate a sstable that spans multiple tablets. We don't want that to happen, therefore we need the barrier. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#18380	2024-04-30 19:23:28 +02:00
Gleb Natapov	f2b0a5e9e1	storage_service: do not take API lock for removenode operation if topology coordinator is enabled Topology coordinator serialize operations internally, so there is no need to have an external lock. Fixes: scylladb/scylladb#17681	2024-04-30 15:13:50 +03:00

1 2 3 4 5 ...

2035 Commits