scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-09 00:13:31 +00:00

Author	SHA1	Message	Date
Kamil Braun	5d3dde50f4	Merge '[Backport 6.0] Fail bootstrap if ip mapping is missing during double write stage' from ScyllaDB If a node restart just before it stores bootstrapping node's IP it will not have ID to IP mapping for bootstrapping node which may cause failure on a write path. Detect this and fail bootstrapping if it happens. (cherry picked from commit `1faef47952`) (cherry picked from commit `27445f5291`) (cherry picked from commit `6853b02c00`) (cherry picked from commit `f91db0c1e4`) Refs #18927 Closes scylladb/scylladb#19118 * github.com:scylladb/scylladb: raft topology: fix indentation after previous commit raft topology: do not add bootstrapping node without IP as pending test: add test of bootstrap where the coordinator crashes just before storing IP mapping schema_tables: remove unused code	2024-06-06 11:35:13 +02:00
Gleb Natapov	e11827f37e	raft topology: fix indentation after previous commit (cherry picked from commit `f91db0c1e4`)	2024-06-05 13:55:29 +00:00
Gleb Natapov	0acfc223ab	raft topology: do not add bootstrapping node without IP as pending If there is no mapping from host id to ip while a node is in bootstrap state there is no point adding it to pending endpoint since write handler will not be able to map it back to host id anyway. If the transition sate requires double writes though we still want to fail. In case the state is write_both_read_old we fail the barrier that will cause topology operation to rollback and in case of write_both_read_new we assert but this should not happen since the mapping is persisted by this point (or we failed in write_both_read_old state). Fixes: scylladb/scylladb#18676 (cherry picked from commit `6853b02c00`)	2024-06-05 13:55:28 +00:00
Gleb Natapov	c53cd98a41	test: add test of bootstrap where the coordinator crashes just before storing IP mapping On the next boot there is no host ID to IP mapping which causes node to crash again with "No mapping for :: in the passed effective replication map" assertion. (cherry picked from commit `27445f5291`)	2024-06-05 13:55:28 +00:00
Botond Dénes	341c29bd74	Merge '[Backport 6.0] storage_service: Fix race between tablet split and stats retrieval' from Raphael "Raph" Carvalho Retrieval of tablet stats must be serialized with mutation to token metadata, as the former requires tablet id stability. If tablet split is finalized while retrieving stats, the saved erm, used by all shards, can have a lower tablet count than the one in a particular shard, causing an abort as tablet map requires that any id feeded into it is lower than its current tablet count. Fixes https://github.com/scylladb/scylladb/issues/18085. (cherry picked from commit `abcc68dbe7`) (cherry picked from commit `551bf9dd58`) (cherry picked from commit `e7246751b6`) Refs https://github.com/scylladb/scylladb/pull/18287 Closes scylladb/scylladb#19095 * github.com:scylladb/scylladb: topology_experimental_raft/test_tablets: restore usage of check_with_down test: Fix flakiness in topology_experimental_raft/test_tablets service: Use tablet read selector to determine which replica to account table stats storage_service: Fix race between tablet split and stats retrieval	2024-06-05 13:06:32 +03:00
Benny Halevy	21f87c9cfa	locator: tablet_map: get_primary_replica: return tablet_replica This is required by repair when it will start using get_primary_replica in a following patch. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `c52f70f92c`)	2024-06-03 19:50:39 +00:00
Botond Dénes	a38d5463ef	Merge '[Backport 6.0] tablets: load balancer: Use random selection of candidates when moving tablets' from ScyllaDB In order to avoid per-table tablet load imbalance balance from forming in the cluster after adding nodes, the load balancer now picks the candidate tablet at random. This should keep the per-table distribution on the target node similar to the distribution on the source nodes. Currently, candidate selection picks the first tablet in the unordered_set, so the distribution depends on hashing in the unordered set. Due to the way hash is calculated, table id dominates the hash and a single table can be chosen more often for migration away. This can result in imbalance of tablets for any given table after bootstrapping a new node. For example, consider the following results of a simulation which starts with a 6-node cluster and does a sequence of node bootstraps and decommissions. One table has 4096 tablets and RF=1, and the other has 256 tablets and RF=2. Before the patch, the smaller table has node overcommit of 2.34 in the worst topology state, while after the patch it has overcommit of 1.65. overcommit is calculated as max load (tablet count per node) dividied by perfect average load (all tablets / nodes): Run #861, params: {iterations=6, nodes=6, tablets1=4096 (10.7/sh), tablets2=256 (1.3/sh), rf1=1, rf2=2, shards=64} Overcommit : init : {table1={shard=1.03, node=1.00}, table2={shard=1.51, node=1.01}} Overcommit : worst: {table1={shard=1.23, node=1.10}, table2={shard=9.85, node=1.65}} Overcommit (old) : init : {table1={shard=1.03, node=1.00}, table2={shard=1.51, node=1.01}} Overcommit (old) : worst: {table1={shard=1.31, node=1.12}, table2={shard=64.00, node=2.34}} The worst state before the patch had the following distribution of tablets for the smaller table: Load on host ba7f866d...: total=171, min=1, max=7, spread=6, avg=2.67, overcommit=2.62 Load on host 4049ae8d...: total=102, min=0, max=6, spread=6, avg=1.59, overcommit=3.76 Load on host 3b499995...: total=89, min=0, max=4, spread=4, avg=1.39, overcommit=2.88 Load on host ad33bede...: total=63, min=0, max=3, spread=3, avg=0.98, overcommit=3.05 Load on host 0c2e65dc...: total=57, min=0, max=3, spread=3, avg=0.89, overcommit=3.37 Load on host 3f2d32d4...: total=27, min=0, max=2, spread=2, avg=0.42, overcommit=4.74 Load on host 9de9f71b...: total=3, min=0, max=1, spread=1, avg=0.05, overcommit=21.33 One node has as many as 171 tablets of that table and another one has as few as 3. After the patch, the worst distribution looks like this: Load on host 94a02049...: total=121, min=1, max=6, spread=5, avg=1.89, overcommit=3.17 Load on host 65ac6145...: total=87, min=0, max=5, spread=5, avg=1.36, overcommit=3.68 Load on host 856a66d1...: total=80, min=0, max=5, spread=5, avg=1.25, overcommit=4.00 Load on host e3ac4a41...: total=77, min=0, max=4, spread=4, avg=1.20, overcommit=3.32 Load on host 81af623f...: total=66, min=0, max=4, spread=4, avg=1.03, overcommit=3.88 Load on host 4a038569...: total=47, min=0, max=2, spread=2, avg=0.73, overcommit=2.72 Load on host c6ab3fe9...: total=34, min=0, max=3, spread=3, avg=0.53, overcommit=5.65 Most-loaded node has 121 tablets and least loaded node has 34 tablets. It's still not good, a better distribution is possible, but it's an improvement. Refs #16824 (cherry picked from commit `3be6120e3b`) (cherry picked from commit `c9bcb5e400`) (cherry picked from commit `7b1eea794b`) (cherry picked from commit `603abddca9`) Refs #18885 Closes scylladb/scylladb#19036 * github.com:scylladb/scylladb: tablets: load balancer: Use random selection of candidates when moving tablets test: perf: Add test for tablet load balancer effectiveness load_sketch: Extract get_shard_minmax() load_sketch: Allow populating only for a given table	2024-06-03 12:25:05 +03:00
Pavel Emelyanov	62a23fd86a	config: Remove experimental TABLETS feature ... and replace it with boolean enable_tablets option. All the places in the code are patched to check the latter option instead of the former feature. The option is OFF by default, but the default scylla.yaml file sets this to true, so that newly installed clusters turn tablets ON. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> (cherry picked from commit `83d491af02`) Closes scylladb/scylladb#19012	2024-06-03 12:16:41 +03:00
Tomasz Grabiec	b9c88fdf4b	tablets: load balancer: Use random selection of candidates when moving tablets In order to avoid per-table tablet load imbalance balance from forming in the cluster after adding nodes, the load balancer now picks the candidate tablet at random. This should keep the per-table distribution on the target node similar to the distribution on the source nodes. Currently, candidate selection picks the first tablet in the unordered_set, so the distribution depends on hashing in the unordered set. Due to the way hash is calculated, table id dominates the hash and a single table can be chosen more often for migration away. This can result in imbalance of tablets for any given table after bootstrapping a new node. For example, consider the following results of a simulation which starts with a 6-node cluster and does a sequence of node bootstraps and decommissions. One table has 4096 tablets and RF=1, and the other has 256 tablets and RF=2. Before the patch, the smaller table has node overcommit of 2.34 in the worst topology state, while after the patch it has overcommit of 1.65. overcommit is calculated as max load (tablet count per node) dividied by perfect average load (all tablets / nodes): Run #861, params: {iterations=6, nodes=6, tablets1=4096 (10.7/sh), tablets2=256 (1.3/sh), rf1=1, rf2=2, shards=64} Overcommit : init : {table1={shard=1.03, node=1.00}, table2={shard=1.51, node=1.01}} Overcommit : worst: {table1={shard=1.23, node=1.10}, table2={shard=9.85, node=1.65}} Overcommit (old) : init : {table1={shard=1.03, node=1.00}, table2={shard=1.51, node=1.01}} Overcommit (old) : worst: {table1={shard=1.31, node=1.12}, table2={shard=64.00, node=2.34}} The worst state before the patch had the following distribution of tablets for the smaller table: Load on host ba7f866d...: total=171, min=1, max=7, spread=6, avg=2.67, overcommit=2.62 Load on host 4049ae8d...: total=102, min=0, max=6, spread=6, avg=1.59, overcommit=3.76 Load on host 3b499995...: total=89, min=0, max=4, spread=4, avg=1.39, overcommit=2.88 Load on host ad33bede...: total=63, min=0, max=3, spread=3, avg=0.98, overcommit=3.05 Load on host 0c2e65dc...: total=57, min=0, max=3, spread=3, avg=0.89, overcommit=3.37 Load on host 3f2d32d4...: total=27, min=0, max=2, spread=2, avg=0.42, overcommit=4.74 Load on host 9de9f71b...: total=3, min=0, max=1, spread=1, avg=0.05, overcommit=21.33 One node has as many as 171 tablets of that table and the one has as few as 3. After the patch, the worst distribution looks like this: Load on host 94a02049...: total=121, min=1, max=6, spread=5, avg=1.89, overcommit=3.17 Load on host 65ac6145...: total=87, min=0, max=5, spread=5, avg=1.36, overcommit=3.68 Load on host 856a66d1...: total=80, min=0, max=5, spread=5, avg=1.25, overcommit=4.00 Load on host e3ac4a41...: total=77, min=0, max=4, spread=4, avg=1.20, overcommit=3.32 Load on host 81af623f...: total=66, min=0, max=4, spread=4, avg=1.03, overcommit=3.88 Load on host 4a038569...: total=47, min=0, max=2, spread=2, avg=0.73, overcommit=2.72 Load on host c6ab3fe9...: total=34, min=0, max=3, spread=3, avg=0.53, overcommit=5.65 Most-loaded node has 121 tablets and least loaded node has 34 tablets. It's still not good, a better distribution is possible, but it's an improvement. Refs #16824 (cherry picked from commit `603abddca9`)	2024-06-02 22:40:46 +00:00
Marcin Maliszkiewicz	cbf47319c1	db: auth: move auth tables to system keyspace Separate keyspace which also behaves as system brings little benefit while creating some compatibility problems like schema digest mismatch during rollback. So we decided to move auth tables into system keyspace. Fixes https://github.com/scylladb/scylladb/issues/18098 Closes scylladb/scylladb#18769 (cherry picked from commit `2ab143fb40`) [avi: adjust test/alternator/suite.yaml to reflect new keyspace]	2024-06-02 21:41:14 +03:00
Wojciech Mitros	3c47ab9851	mv: handle different ERMs for base and view table When calculating the base-view mapping while the topology is changing, we may encounter a situation where the base table noticed the change in its effective replication map while the view table hasn't, or vice-versa. This can happen because the ERM update may be performed during the preemption between taking the base ERM and view ERM, or, due to `f2ff701`, the update may have just been performed partially when we are taking the ERMs. Until now, we assumed that the ERMs are synchronized while calling finding the base-view endpoint mapping, so in particular, we were using the topology from the base's ERM to check the datacenters of all endpoints. Now that the ERMs are more likely to not be the same, we may try to get the datacenter of a view endpoint that doesn't exist in the base's topology, causing us to crash. This is fixed in this patch by using the view table's topology for endpoints coming from the view ERM. The mapping resulting from the call might now be a temporary mapping between endpoints in different topologies, but it still maps base and view replicas 1-to-1. Fixes: #17786 Fixes: #18709 (cherry-picked from `519317dc58`) This commit also includes the follow-up patch that removes the flakiness from the test that is introduced by the commit above. The flakiness was caused by enabling the delay_before_get_view_natural_endpoint injection on a node and not disabling it before the node is shut down. The patch removes the enabling of the injection on the node in the first place. By squashing the commits, we won't introduce a place in the commit history where a potential bisect could mistakenly fail. Fixes: https://github.com/scylladb/scylladb/issues/18941 (cherry-picked from `0de3a5f3ff`) Closes scylladb/scylladb#18974	2024-05-30 09:13:31 +02:00
Piotr Smaron	e04964ba17	Return response only when tablets are reallocated Up until now we waited until mutations are in place and then returned directly to the caller of the ALTER statement, but that doesn't imply that tablets were deleted/created, so we must wait until the whole processing is done and return only then.	2024-05-30 08:33:26 +03:00
Piotr Smaron	544c424e89	Implement ALTER tablets KEYSPACE statement support This commit adds support for executing ALTER KS for keyspaces with tablets and utilizes all the previous commits. The ALTER KS is handled in alter_keyspace_statement, where a global topology request in generated with data attached to system.topology table. Then, once topology state machine is ready, it starts to handle this global topology event, which results in producing mutations required to change the schema of the keyspace, delete the system.topology's global req, produce tablets mutations and additional mutations for a table tracking the lifetime of the whole req. Tracking the lifetime is necessary to not return the control to the user too early, so the query processor only returns the response while the mutations are sent.	2024-05-30 08:33:25 +03:00
Piotr Smaron	73b59b244d	Parameterize migration_manager::announce by type to allow executing different raft commands Since ALTER KS requires creating topology_change raft command, some functions need to be extended to handle it. RAFT commands are recognized by types, so some functions are just going to be parameterized by type, i.e. made into templates. These templates are instantiated already, so that only 1 instances of each template exists across the whole code base, to avoid compiling it in each translation unit.	2024-05-30 08:33:15 +03:00
Piotr Smaron	885c7309ee	Extend system.topology with 3 new columns to store data required to process alter ks global topo req Because ALTER KS will result in creating a global topo req, we'll have to pass the req data to topology coordinator's state machine, and the easiest way to do it is through sytem.topology table, which is going to be extended with 3 extra columns carrying all the data required to execute ALTER KS from within topology coordinator.	2024-05-30 08:33:15 +03:00
Piotr Smaron	adfad686b3	Allow query_processor to check if global topo queue is empty With current implementation only 1 global topo req can be executed at a time, so when ALTER KS is executed, we'll have to check if any other global topo req is ongoing and fail the req if that's the case.	2024-05-30 08:33:15 +03:00
Piotr Smaron	1a70db17a6	Introduce new global topo `keyspace_rf_change` req It will be used when processing ALTER KS statement, but also to create a separate processing path for a KS with tablets (as opposed to a vnode KS).	2024-05-30 08:33:15 +03:00
Piotr Smaron	bd4b781dc8	New raft cmd for both schema & topo changes Allows executing combined topology & schema mutations under a single RAFT command	2024-05-30 08:33:15 +03:00
Paweł Zakrzewski	cedb47d843	tablet_allocator: make load_balancer_stats_manager configurable by name This is needed, because the same name cannot be used for 2 separate entities, because we're getting double-metrics-registration error, thus the names have to be configurable, not hardcoded.	2024-05-30 08:33:15 +03:00
Botond Dénes	8bff078a89	Merge '[Backport 6.0] cdc, raft topology: fix and test cdc in the recovery mode' from ScyllaDB This PR ensures that CDC keeps working correctly in the recovery mode after leaving the raft-based topology. We update `system.cdc_local` in `topology_state_load` to ensure a node restarting in the recovery mode sees the last CDC generation created by the topology coordinator. Additionally, we extend the topology recovery test to verify that the CDC keeps working correctly during the whole recovery process. In particular, we test that after restarting nodes in the recovery mode, they correctly use the active CDC generation created by the topology coordinator. Fixes scylladb/scylladb#17409 Fixes scylladb/scylladb#17819 (cherry picked from commit `4351eee1f6`) (cherry picked from commit `68b6e8e13e`) (cherry picked from commit `388db33dec`) (cherry picked from commit `2111cb01df`) Refs #18820 Closes scylladb/scylladb#18938 * github.com:scylladb/scylladb: test: test_topology_recovery_basic: test CDC during recovery test: util: start_writes_to_cdc_table: add FIXME to increase CL test: util: start_writes_to_cdc_table: allow restarting with new cql storage_service: update system.cdc_local in topology_state_load	2024-05-29 16:14:02 +03:00
Piotr Dulikowski	bc711a169d	Merge '[Backport 6.0] qos/raft_service_level_distributed_data_accessor: print correct error message when trying to modify a service level in recovery mode' from ScyllaDB Raft service levels are read-only in recovery mode. This patch adds check and proper error message when a user tries to modify service levels in recovery mode. Fixes https://github.com/scylladb/scylladb/issues/18827 (cherry picked from commit `2b56158d13`) (cherry picked from commit `ee08d7fdad`) (cherry picked from commit `af0b6bcc56`) Refs #18841 Closes scylladb/scylladb#18913 * github.com:scylladb/scylladb: test/auth_cluster/test_raft_service_levels: try to create sl in recovery service/qos/raft_sl_dda: reject changes to service levels in recovery mode service/qos/raft_sl_dda: extract raft_sl_dda steps to common function	2024-05-28 16:45:52 +02:00
Patryk Jędrzejczak	ed3ac1eea4	storage_service: update system.cdc_local in topology_state_load When the node with CDC enabled and with the topology on raft disabled bootstraps, it reads system.cdc_local for the last generation. Nodes with both enabled use group0 to get the last generation. In the following scenario with a cluster of one node: 1. the node is created with CDC and the topology on raft enabled 2. the user creates table T 3. the node is restarted in the recovery mode 4. the CDC log of T is extended with new entries 5. the node restarts in normal mode The generation created in the step 3 is seen in system_distributed.cdc_generation_timestamps but not in system.cdc_generations_v3, thus there are used streams that the CDC based on raft doesn't know about. Instead of creating a new generation, the node should use the generation already committed to group0. Save the last CDC generation in the system.cdc_local during loading the topology state so that it is visible for CDC not based on raft. Fixes scylladb/scylladb#17819 (cherry picked from commit `4351eee1f6`)	2024-05-28 14:01:59 +00:00
Raphael S. Carvalho	46220bd839	service: Use tablet read selector to determine which replica to account table stats Since we introduced the ability to revert migrations, we can no longer rely on ordering of transition stages to determine whether to account pending or leaving replica. Let's use read selector instead, which correctly has info which replica type has correct stats info. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> (cherry picked from commit `551bf9dd58`)	2024-05-27 18:21:21 +00:00
Raphael S. Carvalho	55a45e3486	storage_service: Fix race between tablet split and stats retrieval If tablet split is finalized while retrieving stats, the saved erm, used by all shards, will be invalidated. It can either cause incorrect behavior or crash if id is not available. It's worked by feeding local tablet map into the "coordinator" collecting stats from all shards. We will also no longer have a snapshot of erm shared between shards to help intra-node migration. This is simplified by serializing token metadata changes and the retrieval of the stats (latter should complete pretty fast, so it shouldn't block the former for any significant time). Fixes #18085. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> (cherry picked from commit `abcc68dbe7`)	2024-05-27 18:21:21 +00:00
Michał Jadwiszczak	6d655e6766	service/qos/raft_sl_dda: reject changes to service levels in recovery mode When a cluster goes into recovery mode and service levels were migrated to raft, service levels become temporarily read-only. This commit adds a proper error message in case a user tries to do any changes. (cherry picked from commit `ee08d7fdad`)	2024-05-27 18:20:36 +00:00
Michał Jadwiszczak	54b9fdab03	service/qos/raft_sl_dda: extract raft_sl_dda steps to common function When setting/dropping a service level using raft data accessor, the same validation steps are executed (this_shard_id = 0 and guard is present). To not duplicate the calls in both functions, they can be extracted to a helper function. (cherry picked from commit `2b56158d13`)	2024-05-27 18:20:36 +00:00
Kefu Chai	747ffd8776	migration_manager: do not reference moved-away smart pointer this change is inspired by clang-tidy. it warns like: ``` [752/852] Building CXX object service/CMakeFiles/service.dir/migration_manager.cc.o Warning: /home/runner/work/scylladb/scylladb/service/migration_manager.cc:891:71: warning: 'view' used after it was moved [bugprone-use-after-move] 891 \| db.get_notifier().before_create_column_family(keyspace, view, mutations, ts); \| ^ /home/runner/work/scylladb/scylladb/service/migration_manager.cc:886:86: note: move occurred here 886 \| auto mutations = db::schema_tables::make_create_view_mutations(keyspace, std::move(view), ts); \| ^ ``` in which, `view` is an instance of view_ptr which is a type with the semantics of shared pointer, it's backed by a member variable of `seastar::lw_shared_ptr<const schema>`, whose move-ctor actually resets the original instance. so we are actually accessing the moved-away pointer in ```c++ db.get_notifier().before_create_column_family(keyspace, view, mutations, ts) ``` so, in this change, instead of moving away from `view`, we create a copy, and pass the copy to `db::schema_tables::make_create_view_mutations()`. this should be fine, as the behavior of `db::schema_tables::make_create_view_mutations()` does not rely on if the `view` passed to it is a moved away from it or not. the change which introduced this use-after-move was `88a5ddabce` Refs `88a5ddabce` Fixes #18837 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> (cherry picked from commit `125464f2d9`) Closes scylladb/scylladb#18873	2024-05-27 15:18:29 +03:00
Pavel Emelyanov	b24fb8dc87	inet_address: Remove to_sstring() in favor of fmt::to_string The existing inet_address::to_string() calls fmt::format("{}", *this) anyway. However, the to_string() method is declared in .cc file, while form formatter is in the header and is equipeed with constexprs so that converting an address to string is done as much as possible compile-time. Also, though minor, fmt::to_string(foo) is believed to be even faster than fmt::format("{}", foo). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#18712	2024-05-21 09:43:08 +03:00
Avi Kivity	52fe351c31	Merge 'Balance tablets within nodes (intra-node migration)' from Tomasz Grabiec This is needed to avoid severe imbalance between shards which can happen when some table grows and is split. The inter-node balance can be equal, so inter-node migration cannot fix the imbalance. Also, if RF=N then there is not even a possibility of moving tablets around to fix the imbalance. The only way to bring the system to balance is to move tablets within the nodes. The system is not prepared for intra-node migration currently. Request coordination is host-based, while for intra-node migration it should be (also) shard-based. The solution employed here is to keep the coordination between nodes as-is, and for intra-node migration storage_proxy-level coordinator is not aware of the migration (no pending host). The replica-side request handler will be a second-level coordinator which routes requests to shards, similar to how the first-level coordinator routes them to hosts. Tablet sharder is adjusted to handle intra-migration where a tablet can have two replicas on the same host. For reads, sharder uses the read selector to resolve the conflict. For writes, the write selector is used. The old shard_of() API is kept to represent shard for reads, and new method is introduced to query the shards for writing: shard_for_writes(). All writers should be switched to that API, which is not done in this patch yet. The request handler on replica side acts as a second-level coordinator, using sharder to determine routing to shards. A given sharder has a scope of a single topology version, a single effective_replication_map_ptr, which should be kept alive during writes. perf-simple-query test results show no signs of regression: Command: perf-simple-query -c1 -m1G --write --tablets --duration=10 Before: > 83294.81 tps ( 59.5 allocs/op, 14.3 tasks/op, 53725 insns/op, 0 errors) > 87756.72 tps ( 59.5 allocs/op, 14.3 tasks/op, 54049 insns/op, 0 errors) > 86428.47 tps ( 59.6 allocs/op, 14.3 tasks/op, 54208 insns/op, 0 errors) > 86211.38 tps ( 59.7 allocs/op, 14.3 tasks/op, 54219 insns/op, 0 errors) > 86559.89 tps ( 59.6 allocs/op, 14.3 tasks/op, 54188 insns/op, 0 errors) > 86609.39 tps ( 59.6 allocs/op, 14.3 tasks/op, 54117 insns/op, 0 errors) > 87464.06 tps ( 59.5 allocs/op, 14.3 tasks/op, 54039 insns/op, 0 errors) > 86185.43 tps ( 59.6 allocs/op, 14.3 tasks/op, 54169 insns/op, 0 errors) > 86254.71 tps ( 59.6 allocs/op, 14.3 tasks/op, 54139 insns/op, 0 errors) > 83395.35 tps ( 60.2 allocs/op, 14.4 tasks/op, 54693 insns/op, 0 errors) > > median 86428.47 tps ( 59.6 allocs/op, 14.3 tasks/op, 54208 insns/op, 0 errors) > median absolute deviation: 243.04 > maximum: 87756.72 > minimum: 83294.81 > After: > 85523.06 tps ( 59.5 allocs/op, 14.3 tasks/op, 53872 insns/op, 0 errors) > 89362.47 tps ( 59.6 allocs/op, 14.3 tasks/op, 54226 insns/op, 0 errors) > 88167.55 tps ( 59.7 allocs/op, 14.3 tasks/op, 54400 insns/op, 0 errors) > 87044.40 tps ( 59.7 allocs/op, 14.3 tasks/op, 54310 insns/op, 0 errors) > 88344.50 tps ( 59.6 allocs/op, 14.3 tasks/op, 54289 insns/op, 0 errors) > 88355.06 tps ( 59.6 allocs/op, 14.3 tasks/op, 54242 insns/op, 0 errors) > 88725.46 tps ( 59.6 allocs/op, 14.3 tasks/op, 54230 insns/op, 0 errors) > 88640.08 tps ( 59.6 allocs/op, 14.3 tasks/op, 54210 insns/op, 0 errors) > 90306.31 tps ( 59.4 allocs/op, 14.3 tasks/op, 54043 insns/op, 0 errors) > 87343.62 tps ( 59.8 allocs/op, 14.3 tasks/op, 54496 insns/op, 0 errors) > > median 88355.06 tps ( 59.6 allocs/op, 14.3 tasks/op, 54242 insns/op, 0 errors) > median absolute deviation: 1007.41 > maximum: 90306.31 > minimum: 85523.06 Command (reads): perf-simple-query -c1 -m1G --tablets --duration=10 Before: > 95860.18 tps ( 63.1 allocs/op, 14.1 tasks/op, 42476 insns/op, 0 errors) > 97537.69 tps ( 63.1 allocs/op, 14.1 tasks/op, 42454 insns/op, 0 errors) > 97549.23 tps ( 63.1 allocs/op, 14.1 tasks/op, 42470 insns/op, 0 errors) > 97511.29 tps ( 63.1 allocs/op, 14.1 tasks/op, 42470 insns/op, 0 errors) > 97227.32 tps ( 63.1 allocs/op, 14.1 tasks/op, 42471 insns/op, 0 errors) > 94031.94 tps ( 63.1 allocs/op, 14.1 tasks/op, 42441 insns/op, 0 errors) > 96978.04 tps ( 63.1 allocs/op, 14.1 tasks/op, 42462 insns/op, 0 errors) > 96401.70 tps ( 63.1 allocs/op, 14.1 tasks/op, 42473 insns/op, 0 errors) > 96573.77 tps ( 63.1 allocs/op, 14.1 tasks/op, 42440 insns/op, 0 errors) > 96340.54 tps ( 63.1 allocs/op, 14.1 tasks/op, 42468 insns/op, 0 errors) > > median 96978.04 tps ( 63.1 allocs/op, 14.1 tasks/op, 42462 insns/op, 0 errors) > median absolute deviation: 571.20 > maximum: 97549.23 > minimum: 94031.94 > After: > 99794.67 tps ( 63.1 allocs/op, 14.1 tasks/op, 42471 insns/op, 0 errors) > 101244.99 tps ( 63.1 allocs/op, 14.1 tasks/op, 42472 insns/op, 0 errors) > 101128.37 tps ( 63.1 allocs/op, 14.1 tasks/op, 42485 insns/op, 0 errors) > 101065.27 tps ( 63.1 allocs/op, 14.1 tasks/op, 42465 insns/op, 0 errors) > 101212.98 tps ( 63.1 allocs/op, 14.1 tasks/op, 42456 insns/op, 0 errors) > 101413.31 tps ( 63.1 allocs/op, 14.1 tasks/op, 42463 insns/op, 0 errors) > 101464.92 tps ( 63.1 allocs/op, 14.1 tasks/op, 42466 insns/op, 0 errors) > 101086.74 tps ( 63.1 allocs/op, 14.1 tasks/op, 42488 insns/op, 0 errors) > 101559.09 tps ( 63.1 allocs/op, 14.1 tasks/op, 42468 insns/op, 0 errors) > 100742.58 tps ( 63.1 allocs/op, 14.1 tasks/op, 42491 insns/op, 0 errors) > > median 101212.98 tps ( 63.1 allocs/op, 14.1 tasks/op, 42456 insns/op, 0 errors) > median absolute deviation: 200.33 > maximum: 101559.09 > minimum: 99794.67 > Fixes #16594 Closes scylladb/scylladb#18026 * github.com:scylladb/scylladb: Implement fast streaming for intra-node migration test: tablets_test: Test sharding during intra-node migration test: tablets_test: Check sharding also on the pending host test: py: tablets: Test writes concurrent with migration test: py: tablets: Test crash during intra-node migration api, storage_service: Introduce API to wait for topology to quiesce dht, replica: Remove deprecated sharder APIs test: Avoid using deprecated sharded API db: do_apply_many() avoid deprecated sharded API replica: mutation_dump: Avoid deprecated sharder API repair: Avoid deprecated sharder API table: Remove optimization which returns empty reader when key is not owned by the shard dht: is_single_shard: Avoid deprecated sharder API dht: split_range_to_single_shard: Work with static_sharder only dht: ring_position_range_sharder: Avoid deprecated sharder APIs dht: token: Avoid use of deprecated sharder API by switching to static_sharder selective_token_sharder: Avoid use of deprecated sharder API docs: Document tablet sharding vs tablet replica placement readers/multishard.cc: use shard_for_reads() instead of shard_of() multishard_mutation_query.cc: use shard_for_reads() instead of shard_of() storage_proxy: Extract common code to apply mutations on many shards according to sharder storage_proxy: Prepare per-partition rate-limiting for intra-node migration storage_proxy: Avoid shard_of() use in mutate_counter_on_leader_and_replicate() storage_proxy: Prepare mutate_hint() for intra-node tablet migration commitlog_replayer: Avoid deprecated sharder::shard_of() lwt: Avoid deprecated sharder::shard_of() compaction: Avoid deprecated sharder::shard_of() dht: Extract dht::static_sharder replica: Deprecate table::shard_of() locator: Deprecate effective_replication_map::shard_of() dht: Deprecate old sharder API: shard_of/next_shard/token_for_next_shard tests: tablets: py: Add intra-node migration test tests: tablets: Test that drained nodes are not balanced internally tests: tablets: Add checks of replica set validity to test_load_balancing_with_random_load tests: tablets: Verify that disabling balancing results in no intra-node migrations tests: tablets: Check that nodes are internally balanced tests: tablets: Improve debuggability by showing which rows are missing tablets, storage_service: Support intra-node migration in move_tablet() API tablet_allocator: Generate intra-node migration plan tablet_allocator: Extract make_internode_plan() tablet_allocator: Maintain candidate list and shard tablet count for target nodes tablet_allocator: Lift apply_load/can_accept_load lambdas to member functions tablets, streaming: Implement tablet streaming for intra-node migration dht, auto_refreshing_sharder: Allow overriding write selector multishard_writer: Handle intra-node migration storage_proxy: Handle intra-node tablet migration for writes tablets: Get rid of tablet_map::get_shard() tablets: Avoid tablet_map::get_shard in cleanup tablets: test: Use sharder instead of tablet_map::get_shard() tablets: tablet_sharder: Allow working with non-local host sharding: Prepare for intra-node-migration docs: Document sharder use for tablets tablets: Introduce tablet transition kind for intra-node migration tests: tablets: Fix use-after-move of skiplist in rebalance_tablets() sstables, gdb: Track readers in a linked list raft topology: Fix global token metadata barrier to not fence ahead of what is drained	2024-05-20 16:13:01 +03:00
Kefu Chai	a517fcf970	service/storage_proxy: capture `tr_state` by copy in handle_paxos_accept() this change is inspired by following warning from clang-tidy ``` Warning: /home/runner/work/scylladb/scylladb/service/storage_proxy.cc:884:13: warning: 'tr_state' used after it was moved [bugprone-use-after-move] 884 \| if (tr_state) { \| ^ /home/runner/work/scylladb/scylladb/service/storage_proxy.cc:872:139: note: move occurred here 872 \| auto f = get_schema_for_read(proposal.update.schema_version(), src_addr, *timeout).then([&sp = _sp, &sys_ks = _sys_ks, tr_state = std::move(tr_state), \| ^ ``` this is not a false positive. as `tr_state` is a captured by move for constructing a variable in the captured list of a lambda which is in turn passed to the expression evaluated to `f`. even the expression itself is not evaluated yet when we reference `tr_state` to check if it is empty after preparing the expression, `tr_state` is already moved away into the captured variable. so at that moment, the statement of `f = f.finally(...)` is never evaluated, because `tr_state` is always empty by then. so before this change, the trace message is never recorded. in this change, we address this issue by capturing `tr_state` by copying it. as `tr_state` is backed by a `lw_shared_ptr`, the overhead is neglectable. after this change, the tracing message is recorded. the change introduced this issue was `548767f91e`. please note, we could coroutinize this function to improve its readability, but since this is a fix and should be backported, let's start with a minimal fix, and worry about the readability in a follow-up change. Refs `548767f91e` Fixes #18725 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#18702	2024-05-20 12:58:49 +03:00
Avi Kivity	2fbd78c769	feature: grandfather DIGEST_FOR_NULL_VALUES The DIGEST_FOR_NULL_VALUES feature was added in `21a77612b3` (2020; 4.4) and can now be assumed to be always present. The hasher which it invoked is removed.	2024-05-18 00:24:00 +03:00
Avi Kivity	879583c489	storage_proxy: drop use of MD5 as a digest algorithm The XXHASH feature was introduced in `0bab3e59c2` (2017; 2.2) and made mandatory in `defe6f49df` (2020; 4.4), but some vestiges remain. Remove them now. Note that md5_hasher itself is still in use by other components, so it cannot be removed.	2024-05-18 00:23:47 +03:00
Avi Kivity	d52c424a5f	feature: grandfather LWT LWT was make non-experimental in `9948f548a5` (2020; 4.1) and can now be assumed to be always present.	2024-05-18 00:20:53 +03:00
Avi Kivity	93088d0921	feature: grandfather HINTED_HANDOFF_SEPARATE_CONNECTION The HINTED_HANDOFF_SEPARATE_CONNECTION feature was introduced in `3a46b1bb2b` (2019; 3.3) and can be assumed always present.	2024-05-18 00:18:27 +03:00
Avi Kivity	3bead8cea0	feature: grandfather PER_TABLE_PARTITIONERS The PER_TABLE_PARTITIONERS feature was added in `90df9a44ce` (2020; 4.0) and can now be assumed to be always present. We also remove the associated schema_feature.	2024-05-18 00:15:07 +03:00
Avi Kivity	c7d7ca2c23	feature: grandfather CDC The CDC feature was made non-experimental in `e9072542c1` (2020; 4.4) and can now be assumed to be always present. We also remove the corresponding schema_feature.	2024-05-17 20:41:20 +03:00
Avi Kivity	82ad2913ca	feature: grandfather DIGEST_INSENSITIVE_TO_EXPIRY The DIGEST_INSENSITIVE_TO_EXPIRY feature was added in `9de071d214` (2019; 3.2) and can now be assumed to be always present. We enable the corresponding schema_feature unconditionally. We do not remove the corresponding schema feature, because it can be disabled when the related TABLE_DIGEST_INSENSITIVE_TO_EXPIRY is present.	2024-05-17 20:41:19 +03:00
Avi Kivity	b5f6021a6b	feature: grandfather VIEW_VIRTUAL_COLUMNS The VIEW_VIRTUAL_COLUMNS feature was added in `a108df09f9` (2019; 3.1) and can now be assumed to be always present. The corresponding schema_feature is removed. Note schema_features are not sent over the wire. A digest calculation without VIEW_VIRTUAL_COLUMNS is no longer tested.	2024-05-17 20:41:19 +03:00
Avi Kivity	6982de6dde	Merge 'Fix stalls in forward_service::dispatch() with large tablet count' from Raphael "Raph" Carvalho With a large tablet count, e.g. 128k, forward_service::dispatch() can potentially stall when grouping ranges per endpoint. ` Reactor stalled for 4 ms on shard 1. Backtrace: 0x5eb15ea 0x5eb09f5 0x5eb1daf 0x3dbaf 0x2d01e57 0x33f7d1e 0x348255f 0x2d005d4 0x2d3d017 0x2d3d58c 0x2d3d225 0x5e59622 0x5ec328f 0x5ec4577 0x5ee84e0 0x5e8394a 0x8c946 0x11296f ` Also there are inefficient copies that are being removed. partition_range_vector for a single endpoint can grow beyond 1M. Closes scylladb/scylladb#18695 * github.com:scylladb/scylladb: service: fix indentation in dispatch() service: fix reactor stall with large tablet count service: avoid potential expensive copies in forward_service::dispatch() service: coroutinize forward_service::dispatch()	2024-05-16 15:17:43 +03:00
Piotr Dulikowski	68eca3778c	Merge 'mv: throttle view update generation for large queries' from Wojciech Mitros This series is a reupload of #13792 with a few modifications, namely a test is added and the conflicts with recent tablet related changes are fixed. See https://github.com/scylladb/scylladb/issues/12379 and https://github.com/scylladb/scylladb/pull/13583 for a detailed description of the problem and discussions. This PR aims to extend the existing throttling mechanism to work with requests that internally generate a large amount of view updates, as suggested by @nyh. The existing mechanism works in the following way: * Client sends a request, we generate the view updates corresponding to the request and spawn background tasks which will send these updates to remote nodes * Each background task consumes some units from the `view_update_concurrency_semaphore`, but doesn't wait for these units, it's just for tracking * We keep track of the percent of consumed units on each node, this is called `view update backlog`. * Before sending a response to the client we sleep for a short amount of time. The amount of time to sleep for is based on the fullness of this `view update backlog`. For a well behaved client with limited concurrency this will limit the amount of incoming requests to a manageable level. This mechanism doesn't handle large DELETE queries. Deleting a partition is fast for the base table, but it requires us to generate a view update for every single deleted row. The number of deleted rows per single client request can be in the millions. Delaying response to the request doesn't help when a single request can generate millions of updates. To deal with this we could treat the view update generator just like any other client and force it to wait a bit of time before sending the next batch of updates. The amount of time to wait for is calculated just like in the existing throttling code, it's based on the fullness of `view update backlogs`. The new algorithm of view update generation looks something like this: ```c++ for(;;) { auto updates = generate_updates_batch_with_max_100_rows(); co_await seastar::sleep(calculate_sleep_time_from_backlogs()); spawn_background_tasks_for_updates(updates); } ``` Fixes: https://github.com/scylladb/scylladb/issues/12379 Closes scylladb/scylladb#16819 * github.com:scylladb/scylladb: test: add test for bad_allocs during large mv queries mv: throttle view update generation for large queries exceptions: add read_write_timeout_exception, a subclass of request_timeout_exception db/view: extract view throttling delay calculation to a global function view_update_generator: add get_storage_proxy() storage_proxy: make view backlog getters public	2024-05-16 08:22:54 +02:00
Raphael S. Carvalho	715ae689c0	Implement fast streaming for intra-node migration With intra-node migration, all the movement is local, so we can make streaming faster by just cloning the sstable set of leaving replica and loading it into the pending one. This cloning is underlying storage specific, but s3 doesn't support snapshot() yet (th sstables::storage procedure which clone is built upon). It's only supported by file system, with help of hard links. A new generation is picked for new cloned sstable, and it will live in the same directory as the original. A challenge I bumped into was to understand why table refused to load the sstable at pending replica, as it considered them foreign. Later I realized that sharder (for reads) at this stage of migration will point only to leaving replica. It didn't fail with mutation based streaming, because the sstable writer considers the shard -- that the sstable was written into -- as its owner, regardless of what sharder says. That was fixed by mimicking this behavior during loading at pending. test: ./test.py --mode=dev intranode --repeat=100 passes. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-05-16 00:28:47 +02:00
Tomasz Grabiec	ad02d85c16	test: py: tablets: Test crash during intra-node migration	2024-05-16 00:28:47 +02:00
Tomasz Grabiec	7956a2991e	api, storage_service: Introduce API to wait for topology to quiesce	2024-05-16 00:28:47 +02:00
Tomasz Grabiec	3b7d7088d1	storage_proxy: Extract common code to apply mutations on many shards according to sharder	2024-05-16 00:28:47 +02:00
Tomasz Grabiec	660b3d1765	storage_proxy: Prepare per-partition rate-limiting for intra-node migration Note: there is a potential problem with rate-limit count going out of sync during intra-node migration between old and the new shard. Before this patch, when coordinator accounted and admitted the request, so the rate_limit_info passed to apply_locally() is account_only, it was converted to std::monostate for requests to the local replia. This makes sense because the request was already accounted by the coordinator. However, during intra-node migration when we do double writes to two shards locally, that means that the new shard will not account the write, it will have lower count than the limiter on the old shard. This means that the new shard may accept writes which will end up being rejected. This is not desirable, but not the end of the world since it's temporary, and the new shard will still protect itself from overload based on its own rate limiter.	2024-05-16 00:28:47 +02:00
Tomasz Grabiec	7c3291b5ea	storage_proxy: Avoid shard_of() use in mutate_counter_on_leader_and_replicate() Cunters are not supported with tablets, so we should not reach this path.	2024-05-16 00:28:47 +02:00
Tomasz Grabiec	db2809317d	storage_proxy: Prepare mutate_hint() for intra-node tablet migration	2024-05-16 00:28:47 +02:00
Tomasz Grabiec	c9294b1642	lwt: Avoid deprecated sharder::shard_of() Instead, use shard_for_reads(). The justification is that: 1) In cas_shard(), we need to pick a single request coordinator. shard_for_reads() gives that, which is equivalent to shard_of() if there is no intra-node migration. 2) In paxos handler for prepare(), the shard we execute it on is the shard from which we read, so shard_for_reads() is the one. 3) Updates of paxos state are separate CQL requests, and use their own sharding. 4) Handler for learn is executing updates using calls to storage_proxy::mutate_locally() which will use the right sharder for writes However, the code is still not prepared for intra-node migration, and possibly regular migration too in case of abandoned requests, because the locking of paxos state assumes that the shard is static. That would have to be fixed separately, e.g. by locking both shards (shard_for_writes()) during migration, so that the set of locked shards always intersects during migration and local serialization of paxos state updates is achieved. I left FIXMEs for that.	2024-05-16 00:28:47 +02:00
Tomasz Grabiec	329342bfb2	tablets, storage_service: Support intra-node migration in move_tablet() API	2024-05-16 00:28:47 +02:00
Tomasz Grabiec	db9d3f0128	tablet_allocator: Generate intra-node migration plan Intra-node migrations are scheduled for each node independently with the aim to equalize per-shard tablet count on each node. This is needed to avoid severe imbalance between shards which can happen when some table grows and is split. The inter-node balance can be equal, so inter-node migration cannot fix the imbalance. Also, if RF=N then there is not even a possibility of moving tablets around to fix the imbalance. The only way to bring the system to balance is to move tablets within the nodes. After scheduling inter-node migrations, the algorithm schedules intra-node migrations. This means that across-node migrations can proceed in parallel with intra-node migrations if there is free capacity to carry them out, but across-node migrations have higher priority. Fixes #16594	2024-05-16 00:28:47 +02:00

1 2 3 4 5 ...

4564 Commits