scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-26 19:35:12 +00:00

Author	SHA1	Message	Date
Wojciech Mitros	d31437b589	mv: replicate the gossiped backlog to all shards On each shard of each node we store the view update backlogs of other nodes to, depending on their size, delay responses to incoming writes, lowering the load on these nodes and helping them get their backlog to normal if it were too high. These backlogs are propagated between nodes in two ways: the first one is adding them to replica write responses. The seconds one is gossiping any changes to the node's backlog every 1s. The gossip becomes useful when writes stop to some node for some time and we stop getting the backlog using the first method, but we still want to be able to select a proper delay for new writes coming to this node. It will also be needed for the mv admission control. Currently, the backlog is gossiped from shard 0, as expected. However, we also receive the backlog only on shard 0 and only update this shard's backlogs for the other node. Instead, we'd want to have the backlogs updated on all shards, allowing us to use proper delays also when requests are received on shards different than 0. This patch changes the backlog update code, so that the backlogs on all shards are updated instead. This will only be performed up to once per second for each other node, and is done with a lower priority, so it won't severly impact other work. Fixes: scylladb/scylladb#19232 Closes scylladb/scylladb#19268	2024-06-14 11:24:20 +02:00
Benny Halevy	34dfa4d3a3	storage_service: join_token_ring: reject replace on different dc or rack Do not allow replacing a node on one dc/rack with a node on a different dc/rack as this violates the assumption of replace node operation that all token ranges previously owned by the dead node would be rebuilt on the new node. Fixes scylladb/scylladb#16858 Refs scylladb/scylla-enterprise#3518 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#16862	2024-06-13 11:19:47 +02:00
Kefu Chai	0c9ea654f5	service/paxos: drop operator<< for proposal since we stopped using the generic container formatters which in turn use operator<< for formatting the elemements. we can drop more operator<< operators. so, in this change, we drop operator<< for proposal. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#19156	2024-06-12 10:14:47 +03:00
Dawid Medrek	431ec55f6c	service/storage_proxy: Move a comment to its relevant place In `b92fb35`, we put a comment in the wrong place. These changes move it to the right one. Closes scylladb/scylladb#19215	2024-06-12 10:10:02 +03:00
Michał Chojnowski	fee48f67ef	storage_proxy: avoid infinite growth of _throttled_writes storage_proxy has a throttling mechanism which attempts to limit the number of background writes by forcefully raising CL to ALL (it's not implemented exactly like that, but that's the effect) when the amount of background and queued writes is above some fixed threshold. If this is applied to a write, it becomes "throttled", and its ID is appended to into _throttled_writes. Whenever the amount of background and queued writes falls below the threshold, writes are "unthrottled" — some IDs are popped from _throttled_writes and the writes represented by these IDs — if their handlers still exist — have their CL lowered back. The problem here is that IDs are only ever removed from _throttled_writes if the number of queued and background writes falls below the threshold. But this doesn't have to happen in any finite time, if there's constant write pressure. And in fact, in one load test, it hasn't happened in 3 hours, eventually causing the buffer to grow into gigabytes and trigger OOM. This patch is intended to be a good-enough-in-practice fix for the problem. Fixes scylladb/scylladb#17476 Fixes scylladb/scylladb#1834 Closes scylladb/scylladb#19136	2024-06-07 15:56:23 +02:00
Gleb Natapov	34cf5c81f6	group0, topology coordinator: run group0 and the topology coordinator in gossiper scheduling group Currently they both run in streaming group and it may become busy during repair/mv building and affect group0 functionality. Move it to the gossiper group where it should have more time to run. Fixes scylladb/scylladb#18863 Closes scylladb/scylladb#19138	2024-06-07 15:31:44 +02:00
Piotr Dulikowski	e18aeb2486	Merge 'mv: gossip the same backlog if a different backlog was sent in a response' from Wojciech Mitros Currently, there are 2 ways of sharing a backlog with other nodes: through a gossip mechanism, and with responses to replica writes. In gossip, we check each second if the backlog changed, and if it did we update other nodes with it. However if the backlog for this node changed on another node with a write response, the gossiped backlog is currently not updated, so if after the response the backlog goes back to the value from the previous gossip round, it will not get sent and the other node will stay with an outdated backlog - this can be observed in the following scenario: 1. Cluster starts, all nodes gossip their empty view update backlog to one another 2. On node N, `view_update_backlog_broker` (the backlog gossiper) performs an iteration of its backlog update loop, sees no change (backlog has been empty since the start), schedules the next iteration after 1s 3. Within the next 1s, coordinator (different than N) sends a write to N causing a remote view update (which we do not wait for). As a result, node N replies immediately with an increased view update backlog, which is then noted by the coordinator. 4. Still within the 1s, node N finishes the view update in the background, dropping its view update backlog to 0. 5. In the next and following iterations of `view_update_backlog_broker` on N, backlog is empty, as it was in step 2, so no change is seen and no update is sent due to the check ``` auto backlog = _sp.local().get_view_update_backlog(); if (backlog_published && backlog_published == backlog) { sleep_abortable(gms::gossiper::INTERVAL, _as).get(); continue; } ``` After this scenario happens, the coordinator keeps an information about an increased view update backlog on N even though it's actually already empty This patch fixes the issue this by notifying the gossip that a different backlog was sent in a response, causing it to send an unchanged backlog to other nodes in the following gossip round. Fixes: https://github.com/scylladb/scylladb/issues/18461 Similarly to https://github.com/scylladb/scylladb/pull/18646, without admission control (https://github.com/scylladb/scylladb/pull/18334), this patch doesn't affect much, so I'm marking it as backport/none Tests: manual. Currently this patch only affects the length of MV flow control delay, which is not reliable to base a test on. A proper test will be added when MV admission control is added, so we'll be able to base the test on rejected requests Closes scylladb/scylladb#18663 github.com:scylladb/scylladb: mv: gossip the same backlog if a different backlog was sent in a response node_update_backlog: divide adding and fetching backlogs	2024-06-07 10:20:21 +02:00
Avi Kivity	cd553848c1	Merge 'auth-v2: use a single transaction in auth related statements ' from Marcin Maliszkiewicz Due to gradual raft introduction into statements code in cases when single statement modified more than one table or mutation producing function was composed out of simpler ones we violated transactional logic and statement execution was not atomic as whole. This patch changes that, so now either all changes resulting from statement execution are applied or none. Affected statements types are: - schema modification - auth modifications - service levels modifications Fixes https://github.com/scylladb/scylladb/issues/17738 Closes scylladb/scylladb#17910 * github.com:scylladb/scylladb: raft: rename mutations_collector to group0_batch raft: rename announce to commit cql3: raft: attach description to each mutations collector group auth: unify mutations_generator type auth: drop redundant 'this' keyword auth: remove no longer used code from standard_role_manager::legacy_modify_membership cql3: auth: use mutation collector for service levels statements cql3: auth: use mutation collector for alter role cql3: auth: use mutation collector for grant role and revoke role cql3: auth: use mutation collector for drop role and auto-revoke auth: add refactored modify_membership func in standard_role_manager auth: implement empty revoke_all in allow_all_authorizer auth: drop request_execution_exception handling from default_authorizer::revoke_all Revert "Introduce TABLET_KEYSPACE event to differentiate processing path of a vnode vs tablets ks" cql3: auth: use mutation collector for grant and revoke permissions cql3: extract changes_tablets function in alter_keyspace_statement cql3: auth: use mutation collector for create role statement auth: move create_role code into service auth: add a way to announce mutations having only client_state ref auth: add collect_mutations common helper auth: remove unused header in common.hh auth: add class for gathering mutations without immediate announce auth: cql3: use auth facade functions consistently on write path auth: remove unused is_enforcing function	2024-06-06 17:31:26 +03:00
Marcin Maliszkiewicz	63e6334a64	raft: rename mutations_collector to group0_batch	2024-06-06 13:26:34 +02:00
Kamil Braun	57e810c852	Merge 'Serialize repair with tablet migration' from Tomasz Grabiec We want to exclude repair with tablet migrations to avoid races between repair reads and writes with replica movement. Repair is not prepared to handle topology transitions in the middle. One reason why it's not safe is that repair may successfully write to a leaving replica post streaming phase and consider all replicas to be repaired, but in fact they are not, the new replica would not be repaired. Other kinds of races could result in repair failures. If repair writes to a leaving replica which was already cleaned up, such writes will fail, causing repair to fail. Excluding works by keeping effective_replication_map_ptr in a version which doesn't have table's tablets in transitions. That prevents later transitions from starting because topology coordinator's barrier will wait for that erm before moving to a stage later than allow_write_both_read_old, so before any requests start using the new topology. Also, if transitions are already running, repair waits for them to finish. A blocked tablet migration (e.g. due to down node) will block repair, whereas before it would fail. Once admin resolves the cause of blocked migration, repair will continue. Fixes #17658. Fixes #18561. Closes scylladb/scylladb#18641 * github.com:scylladb/scylladb: test: pylib: Do not block async reactor while removing directories repair: Exclude tablet migrations with tablet repair repair_service: Propagate topology_state_machine to repair_service main, storage_service: Move topology_state_machine outside storage_service storage_srvice, toplogy: Extract topology_state_machine::await_quiesced() tablet_scheduler: Make disabling of balancing interrupt shuffle mode tablet_scheduler: Log whether balancing is considered as enabled	2024-06-06 11:27:03 +02:00
Wojciech Mitros	f70f774e40	mv: gossip the same backlog if a different backlog was sent in a response Currently, there are 2 ways of sharing a backlog with other nodes: through a gossip mechanism, and with responses to replica writes. In gossip, we check each second if the backlog changed, and if it did we update other nodes with it. However if the backlog for this node changed on another node with a write response, the gossiped backlog is currently not updated, so if after the response the backlog goes back to the value from the previous gossip round, it will not get sent and the other node will stay with an outdated backlog. This patch changes this by notifying the gossip that a the backlog changed since the last gossip round so a different backlog could have been send through the response piggyback mechanism. With that information, gossip will send an unchanged backlog to other nodes in the following gossip round. Fixes: https://github.com/scylladb/scylladb/issues/18461	2024-06-06 10:45:15 +02:00
Wojciech Mitros	272e80fe0a	node_update_backlog: divide adding and fetching backlogs Currently, we only update the backlogs in node_update_backlog at the same time when we're fetching them. This is done using storage_proxy's method get_view_update_backlog, which is confusing because it's a getter with side-effects. Additionally, we don't always want to update the backlog when we're reading it (as in gossip which is only on shard 0) and we don't always want to read it when we're updating it (when we're not handling any writes but the backlog drops due to background work finish). This patch divides the node_view_backlog::add_fetch as well the storage_proxy::get_view_update_backlog both into two methods; one for updating and one for reading the backlog. This patch only replaces the places where we're currently using the view backlog getter, more situations where we should get/update the backlog should be considered in a following patch.	2024-06-06 10:45:13 +02:00
Botond Dénes	44975abe18	Merge 'Sanitize start-stop of protocol servers' from Pavel Emelyanov Protocol servers are started last, and are registered in storage_service, which stops them. Also there are deferred actions scheduled to stop protocol servers on aborted start and a FIXME asking to make even this case rely on storage_service. Also, there's a (rather rare) aborted-start bug in alternator and redis. Yet, thrift can be left started in some weird circumstances. This patch fixes it all. As a side effect, the start-stop code becomes shorter and a bit better structured. refs: #2737 Closes scylladb/scylladb#19042 * github.com:scylladb/scylladb: main: Start alternator expiration service earlier main: Start redis transparently main: Start alternator transparently main: Start thrift transparently main: Start native transport transparently storage_service: Make register_protocol_server() start the server storage_service: Turn register_protocol_server() async method storage_service: Outline register_protocol_server() main: Schedule deferred drain_on_shutdown() prior to protocol servers main: Move some trailing startup earlier	2024-06-06 09:08:05 +03:00
Kefu Chai	4a36918989	topology_coordinator: handle/wait futures when stopping topology_coordinator before this change, unlike other services in scylla, topology_coordinator is not properly stopped when it is aborted, because the scylla instance is no longer a leader or is being shut down. its `run()` method just stops the grand loop and bails out before topology_coordinator is destroyed. but we are tracking the migration state of tablets using a bunch of futures, which might not be handled yet, and some of them could carry failures. in that case, when the `future` instances with failure state get destroyed, seastar calls `report_failed_future`. and seastar considers this practice a source a bug -- as one just fails to handle an error. that's why we have following error: ``` WARN 2024-05-19 23:00:42,895 [shard 0:strm] seastar - Exceptional future ignored: seastar::rpc::unknown_verb_error (unknown verb), backtrace: /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libseastar.so+0x56c14e /home/bhalevy/.ccm/scylla-repository/local_tarball/libre loc/libseastar.so+0x56c770 /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libseastar.so+0x56ca58 /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libseastar.so+0x38c6ad 0x29cdd07 0x29b376b 0x29a5b65 0x108105a /home/bhalevy/.ccm/scylla-repository/local_tarbal l/libreloc/libseastar.so+0x3ff1df /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libseastar.so+0x400367 /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libseastar.so+0x3ff838 /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libseastar.so+0x36de58 /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libseastar.so+0x36d092 0x1017cba 0x1055080 0x1016ba7 /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libc.so.6+0x27b89 /home/bhalevy/.ccm/scylla-repository/local_tarball/libreloc/libc.so.6+0x27c4a 0x1015524 ``` and the backtrace looks like: ``` seastar::current_backtrace_tasklocal() at ??:? seastar::current_tasktrace() at ??:? seastar::current_backtrace() at ??:? seastar::report_failed_future(seastar::future_state_base::any&&) at ??:? service::topology_coordinator::tablet_migration_state::~tablet_migration_state() at topology_coordinator.cc:? service::topology_coordinator::~topology_coordinator() at topology_coordinator.cc:? service::run_topology_coordinator(seastar::sharded<db::system_distributed_keyspace>&, gms::gossiper&, netw::messaging_service&, locator::shared_token_metadata&, db::system_keyspace&, replica::database&, service::raft_group0&, service::topology_state_machine&, seastar::abort_source&, raft::server&, seastar::noncopyable_function<seastar::future<service::raft_topology_cmd_result> (utils::tagged_tagged_integer<raft::internal::non_final, raft::term_tag, unsigned long>, unsigned long, service::raft_topology_cmd const&)>, service::tablet_allocator&, std::chrono::duration<long, std::ratio<1l, 1000l> >, service::endpoint_lifecycle_notifier&) [clone .resume] at topology_coordinator.cc:? seastar::internal::coroutine_traits_base<void>::promise_type::run_and_dispose() at main.cc:? seastar::reactor::run_some_tasks() at ??:? seastar::reactor::do_run() at ??:? seastar::reactor::run() at ??:? seastar::app_template::run_deprecated(int, char**, std::function<void ()>&&) at ??:? ``` and even worse, these futures are indirectly owned by `topology_coordinator`. so there are chances that they could be used even after `topology_coordinator` is destroyed. this is a use-after-free issue. because the `run_topology_coordinator` fiber exits when the scylla instance retires from the leader's role, this use-after-free could be fatal to a running instance due to undefined behavior of use after free. so, in this change, we handle the futures in `_tablets`, and note down the failures carried by them if any. Fixes #18745 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#18991	2024-06-06 07:55:03 +03:00
Tomasz Grabiec	c45ce41330	main, storage_service: Move topology_state_machine outside storage_service It will be propagated to repair_service to avoid cyclic dependency: storage_service <-> repair_service	2024-06-05 16:11:22 +02:00
Tomasz Grabiec	476c076a21	storage_srvice, toplogy: Extract topology_state_machine::await_quiesced() Will be used later in a place which doesn't have access to storage_service but has to toplogy_state_machine. It's not necessary to start group0 operation around polling because the busy() state can be checked atomically and if it's false it means the topology is no longer busy.	2024-06-05 16:11:22 +02:00
Tomasz Grabiec	1513d6f0b0	tablet_scheduler: Make disabling of balancing interrupt shuffle mode Tests will rely on that, they will run in shuffle mode, and disable balancing around section which otherwise would be infinitely blocked by ongoing shuffling (like repair).	2024-06-05 16:11:22 +02:00
Tomasz Grabiec	6c64cf33df	tablet_scheduler: Log whether balancing is considered as enabled	2024-06-05 16:11:22 +02:00
Kamil Braun	18f5d6fd89	Merge 'Fail bootstrap if ip mapping is missing during double write stage' from Gleb Natapov If a node restart just before it stores bootstrapping node's IP it will not have ID to IP mapping for bootstrapping node which may cause failure on a write path. Detect this and fail bootstrapping if it happens. Closes scylladb/scylladb#18927 * github.com:scylladb/scylladb: raft topology: fix indentation after previous commit raft topology: do not add bootstrapping node without IP as pending test: add test of bootstrap where the coordinator crashes just before storing IP mapping schema_tables: remove unused code	2024-06-05 11:15:15 +02:00
Marcin Maliszkiewicz	ac0e164a6b	raft: rename announce to commit Old wording was derived from existing code which originated from schema code. Name commit better describes what we do here.	2024-06-04 15:43:04 +02:00
Marcin Maliszkiewicz	370a5b547e	cql3: raft: attach description to each mutations collector group This description is readable from raft log table. Previously single description was provided for the whole announce call but since it can contain mutations from various subsystems now description was moved to add_mutation(s)/add_generator function calls.	2024-06-04 15:43:04 +02:00
Marcin Maliszkiewicz	a88b7fc281	cql3: auth: use mutation collector for service levels statements This is done to achieve single transaction semantics.	2024-06-04 15:43:04 +02:00
Marcin Maliszkiewicz	0573fee2a9	cql3: auth: use mutation collector for grant and revoke permissions This is done to achieve single transaction semantics. The change includes auto-grant feature. In particular for schema related auto-grant we don't use normal mutation collector announce path but follow migration manager, this may be unified in the future.	2024-06-04 15:43:04 +02:00
Marcin Maliszkiewicz	6f654675c6	auth: add a way to announce mutations having only client_state ref Statements code have only access to client_state from which it takes auth::service. It doesn't have abort_source nor group0_client so we need to add them to auth::service. Additionally since abort_source can't be const the whole announce_mutations method needs non const auth::service so we need to remove const from the getter function.	2024-06-04 15:43:04 +02:00
Marcin Maliszkiewicz	7e0a801f53	auth: add class for gathering mutations without immediate announce To achieve write atomicity across different tables we need to announce mutations in a single transaction. So instead of each function doing a separate announce we need to collect mutations and announce them once at the end.	2024-06-04 15:43:04 +02:00
Piotr Dulikowski	01ff8108c1	Merge 'db/hints: Use host ID to IP mappings to choose the ep manager to drain when node is leaving' from Dawid Mędrek In `d0f5873`, we introduced mappings IP–host ID between hint directories and the hint endpoint managers managing them. As a consequence, it may happen that one hint directory stores hints towards multiple nodes at the same time. If any of those nodes leaves the cluster, we should drain the hint directory. However, before these changes that doesn't happen – we only drain it when the node of the same host ID as the hint endpoint manager leaves the cluster. This PR fixes that draining issue in the pre-host-ID-based hinted handoff. Now no matter which of the nodes corresponding to a hint directory leaves the cluster, the directory will be drained. We also introduce error injections to be able to test that it indeed happens. Fixes scylladb/scylladb#18761 Closes scylladb/scylladb#18764 * github.com:scylladb/scylladb: db/hints: Introduce an error injection to test draining db/hints: Ensure that draining happens	2024-06-04 10:17:14 +02:00
Avi Kivity	f133ae945a	Merge 'repair: Introduce new primary replica selection algorithm for tablets' from Benny Halevy Tablet allocation does not guarantee fairness of the first replica in the replicas set across dcs. The lack of this fix cause the following dtest to fail: repair_additional_test.py::TestRepairAdditional::test_repair_option_pr_multi_dc Use the tablet_map get_primary_replica or get_primary_replica_within_dc, respectively to see if this node is the primary replica for each tablet or not. Fixes https://github.com/scylladb/scylladb/issues/17752 No backport is required before 6.0 as tablets (and tablet repair) are introduced in 6.0 Closes scylladb/scylladb#18784 * github.com:scylladb/scylladb: repair: repair_tablets: use get_primary_replica repair: repair_tablets: no need to check ranges_specified per tablet locator: tablet_map: add get_primary_replica_within_dc locator: tablet_map: get_primary_replica: do not copy tablet info locator: tablet_map: get_primary_replica: return tablet_replica	2024-06-03 13:16:49 +03:00
Pavel Emelyanov	9292d326b7	storage_service: Make register_protocol_server() start the server After a protocol server is registered, it can be instantly started by the main code. It makes sense to generalize this sequence by teaching register_protocol_server() start it. For now it's a no-op change, as "start_instantly" is false by default, but next patches will make use of it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-06-03 12:12:03 +03:00
Pavel Emelyanov	2aab9f6340	storage_service: Turn register_protocol_server() async method To make the next patch shorter Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-06-03 12:12:03 +03:00
Pavel Emelyanov	eb033e3c5f	storage_service: Outline register_protocol_server() To make next patch shorter Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-06-03 12:12:03 +03:00
Benny Halevy	c52f70f92c	locator: tablet_map: get_primary_replica: return tablet_replica This is required by repair when it will start using get_primary_replica in a following patch. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-06-02 20:26:09 +03:00
Tomasz Grabiec	603abddca9	tablets: load balancer: Use random selection of candidates when moving tablets In order to avoid per-table tablet load imbalance balance from forming in the cluster after adding nodes, the load balancer now picks the candidate tablet at random. This should keep the per-table distribution on the target node similar to the distribution on the source nodes. Currently, candidate selection picks the first tablet in the unordered_set, so the distribution depends on hashing in the unordered set. Due to the way hash is calculated, table id dominates the hash and a single table can be chosen more often for migration away. This can result in imbalance of tablets for any given table after bootstrapping a new node. For example, consider the following results of a simulation which starts with a 6-node cluster and does a sequence of node bootstraps and decommissions. One table has 4096 tablets and RF=1, and the other has 256 tablets and RF=2. Before the patch, the smaller table has node overcommit of 2.34 in the worst topology state, while after the patch it has overcommit of 1.65. overcommit is calculated as max load (tablet count per node) dividied by perfect average load (all tablets / nodes): Run #861, params: {iterations=6, nodes=6, tablets1=4096 (10.7/sh), tablets2=256 (1.3/sh), rf1=1, rf2=2, shards=64} Overcommit : init : {table1={shard=1.03, node=1.00}, table2={shard=1.51, node=1.01}} Overcommit : worst: {table1={shard=1.23, node=1.10}, table2={shard=9.85, node=1.65}} Overcommit (old) : init : {table1={shard=1.03, node=1.00}, table2={shard=1.51, node=1.01}} Overcommit (old) : worst: {table1={shard=1.31, node=1.12}, table2={shard=64.00, node=2.34}} The worst state before the patch had the following distribution of tablets for the smaller table: Load on host ba7f866d...: total=171, min=1, max=7, spread=6, avg=2.67, overcommit=2.62 Load on host 4049ae8d...: total=102, min=0, max=6, spread=6, avg=1.59, overcommit=3.76 Load on host 3b499995...: total=89, min=0, max=4, spread=4, avg=1.39, overcommit=2.88 Load on host ad33bede...: total=63, min=0, max=3, spread=3, avg=0.98, overcommit=3.05 Load on host 0c2e65dc...: total=57, min=0, max=3, spread=3, avg=0.89, overcommit=3.37 Load on host 3f2d32d4...: total=27, min=0, max=2, spread=2, avg=0.42, overcommit=4.74 Load on host 9de9f71b...: total=3, min=0, max=1, spread=1, avg=0.05, overcommit=21.33 One node has as many as 171 tablets of that table and the one has as few as 3. After the patch, the worst distribution looks like this: Load on host 94a02049...: total=121, min=1, max=6, spread=5, avg=1.89, overcommit=3.17 Load on host 65ac6145...: total=87, min=0, max=5, spread=5, avg=1.36, overcommit=3.68 Load on host 856a66d1...: total=80, min=0, max=5, spread=5, avg=1.25, overcommit=4.00 Load on host e3ac4a41...: total=77, min=0, max=4, spread=4, avg=1.20, overcommit=3.32 Load on host 81af623f...: total=66, min=0, max=4, spread=4, avg=1.03, overcommit=3.88 Load on host 4a038569...: total=47, min=0, max=2, spread=2, avg=0.73, overcommit=2.72 Load on host c6ab3fe9...: total=34, min=0, max=3, spread=3, avg=0.53, overcommit=5.65 Most-loaded node has 121 tablets and least loaded node has 34 tablets. It's still not good, a better distribution is possible, but it's an improvement. Refs #16824	2024-06-02 14:23:00 +02:00
Pavel Emelyanov	83d491af02	config: Remove experimental TABLETS feature ... and replace it with boolean enable_tablets option. All the places in the code are patched to check the latter option instead of the former feature. The option is OFF by default, but the default scylla.yaml file sets this to true, so that newly installed clusters turn tablets ON. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#18898	2024-05-30 18:03:51 +03:00
Dawid Medrek	745a9c6ab8	db/hints: Ensure that draining happens Before hinted handoff is migrated to using host IDs to identify nodes in the cluster, we keep track of mappings between hint endpoint managers identified by host IDs and the hint directories managed by them and represented by IP addresses. As a consequence, it may happen that one hint directory corresponds to multiple nodes -- it's intended. See `64ba620` for more details. Before these changes, we only started the draining process of a hint directory if the node leaving the cluster corresponded to that hint directory AND was identified by the same host ID as the hint endpoint manager managing that directory. As a result, the draining did not always happen when it was supposed to. Draining should start no matter which of the nodes corresponding to a hint directory is leaving the cluster. This commit ensures that it happens.	2024-05-29 19:32:38 +02:00
Kefu Chai	a415bb07ab	sl_controller: fix a typo in comment s/necessairy/necessary/ Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#18950	2024-05-29 16:23:31 +03:00
Pavel Emelyanov	e74a4b038f	Merge 'tablets: alter keyspace' from Piotr Smaron This change supports changing replication factor in tablets-enabled keyspaces. This covers both increasing and decreasing the number of tablets replicas through first building topology mutations (`alter_keyspace_statement.cc`) and then tablets/topology/schema mutations (`topology_coordinator.cc`). For the limitations of the current solution, please see the docs changes attached to this PR. Fixes: #16129 Closes scylladb/scylladb#16723 * github.com:scylladb/scylladb: test: Do not check tablets mutations on nodes that don't have them test: Fix the way tablets RF-change test parses mutation_fragments test/tablets: Unmark RF-changing test with xfail docs: document ALTER KEYSPACE with tablets Return response only when tablets are reallocated cql-pytest: Verify RF is changes by at most 1 when tablets on cql3/alter_keyspace_statement: Do not allow for change of RF by more than 1 Reject ALTER with 'replication_factor' tag Implement ALTER tablets KEYSPACE statement support Parameterize migration_manager::announce by type to allow executing different raft commands Introduce TABLET_KEYSPACE event to differentiate processing path of a vnode vs tablets ks Extend system.topology with 3 new columns to store data required to process alter ks global topo req Allow query_processor to check if global topo queue is empty Introduce new global topo `keyspace_rf_change` req New raft cmd for both schema & topo changes Add storage service to query processor tablets: tests for adding/removing replicas tablet_allocator: make load_balancer_stats_manager configurable by name	2024-05-29 14:17:51 +03:00
Gleb Natapov	f91db0c1e4	raft topology: fix indentation after previous commit	2024-05-29 12:11:28 +03:00
Gleb Natapov	6853b02c00	raft topology: do not add bootstrapping node without IP as pending If there is no mapping from host id to ip while a node is in bootstrap state there is no point adding it to pending endpoint since write handler will not be able to map it back to host id anyway. If the transition sate requires double writes though we still want to fail. In case the state is write_both_read_old we fail the barrier that will cause topology operation to rollback and in case of write_both_read_new we assert but this should not happen since the mapping is persisted by this point (or we failed in write_both_read_old state). Fixes: scylladb/scylladb#18676	2024-05-29 12:11:18 +03:00
Gleb Natapov	27445f5291	test: add test of bootstrap where the coordinator crashes just before storing IP mapping On the next boot there is no host ID to IP mapping which causes node to crash again with "No mapping for :: in the passed effective replication map" assertion.	2024-05-29 11:46:23 +03:00
Kefu Chai	719d53a565	service/storage_proxy: coroutinize handle_paxos_accept() for better readability. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#18765	2024-05-28 20:51:10 +03:00
Wojciech Mitros	519317dc58	mv: handle different ERMs for base and view table When calculating the base-view mapping while the topology is changing, we may encounter a situation where the base table noticed the change in its effective replication map while the view table hasn't, or vice-versa. This can happen because the ERM update may be performed during the preemption between taking the base ERM and view ERM, or, due to `f2ff701`, the update may have just been performed partially when we are taking the ERMs. Until now, we assumed that the ERMs are synchronized while calling finding the base-view endpoint mapping, so in particular, we were using the topology from the base's ERM to check the datacenters of all endpoints. Now that the ERMs are more likely to not be the same, we may try to get the datacenter of a view endpoint that doesn't exist in the base's topology, causing us to crash. This is fixed in this patch by using the view table's topology for endpoints coming from the view ERM. The mapping resulting from the call might now be a temporary mapping between endpoints in different topologies, but it still maps base and view replicas 1-to-1. Fixes: #17786 Fixes: #18709 Closes scylladb/scylladb#18816	2024-05-28 16:01:39 +02:00
Piotr Smaron	39181c4bf2	Return response only when tablets are reallocated Up until now we waited until mutations are in place and then returned directly to the caller of the ALTER statement, but that doesn't imply that tablets were deleted/created, so we must wait until the whole processing is done and return only then.	2024-05-28 13:56:46 +02:00
Piotr Smaron	fbd75c5c06	Implement ALTER tablets KEYSPACE statement support This commit adds support for executing ALTER KS for keyspaces with tablets and utilizes all the previous commits. The ALTER KS is handled in alter_keyspace_statement, where a global topology request in generated with data attached to system.topology table. Then, once topology state machine is ready, it starts to handle this global topology event, which results in producing mutations required to change the schema of the keyspace, delete the system.topology's global req, produce tablets mutations and additional mutations for a table tracking the lifetime of the whole req. Tracking the lifetime is necessary to not return the control to the user too early, so the query processor only returns the response while the mutations are sent.	2024-05-28 13:56:42 +02:00
Piotr Smaron	7081215552	Parameterize migration_manager::announce by type to allow executing different raft commands Since ALTER KS requires creating topology_change raft command, some functions need to be extended to handle it. RAFT commands are recognized by types, so some functions are just going to be parameterized by type, i.e. made into templates. These templates are instantiated already, so that only 1 instances of each template exists across the whole code base, to avoid compiling it in each translation unit.	2024-05-28 13:55:11 +02:00
Piotr Smaron	59d3fd615f	Extend system.topology with 3 new columns to store data required to process alter ks global topo req Because ALTER KS will result in creating a global topo req, we'll have to pass the req data to topology coordinator's state machine, and the easiest way to do it is through sytem.topology table, which is going to be extended with 3 extra columns carrying all the data required to execute ALTER KS from within topology coordinator.	2024-05-28 13:55:11 +02:00
Piotr Smaron	6fd0a49b63	Allow query_processor to check if global topo queue is empty With current implementation only 1 global topo req can be executed at a time, so when ALTER KS is executed, we'll have to check if any other global topo req is ongoing and fail the req if that's the case.	2024-05-28 13:55:11 +02:00
Piotr Smaron	c174eee386	Introduce new global topo `keyspace_rf_change` req It will be used when processing ALTER KS statement, but also to create a separate processing path for a KS with tablets (as opposed to a vnode KS).	2024-05-28 13:54:48 +02:00
Kamil Braun	247eb9020b	Merge 'cdc, raft topology: fix and test cdc in the recovery mode' from Patryk Jędrzejczak This PR ensures that CDC keeps working correctly in the recovery mode after leaving the raft-based topology. We update `system.cdc_local` in `topology_state_load` to ensure a node restarting in the recovery mode sees the last CDC generation created by the topology coordinator. Additionally, we extend the topology recovery test to verify that the CDC keeps working correctly during the whole recovery process. In particular, we test that after restarting nodes in the recovery mode, they correctly use the active CDC generation created by the topology coordinator. Fixes scylladb/scylladb#17409 Fixes scylladb/scylladb#17819 Closes scylladb/scylladb#18820 * github.com:scylladb/scylladb: test: test_topology_recovery_basic: test CDC during recovery test: util: start_writes_to_cdc_table: add FIXME to increase CL test: util: start_writes_to_cdc_table: allow restarting with new cql storage_service: update system.cdc_local in topology_state_load	2024-05-28 11:53:28 +02:00
Botond Dénes	2d79b0106c	Merge 'storage_service: Fix race between tablet split and stats retrieval' from Raphael "Raph" Carvalho Retrieval of tablet stats must be serialized with mutation to token metadata, as the former requires tablet id stability. If tablet split is finalized while retrieving stats, the saved erm, used by all shards, can have a lower tablet count than the one in a particular shard, causing an abort as tablet map requires that any id feeded into it is lower than its current tablet count. Fixes #18085. Closes scylladb/scylladb#18287 * github.com:scylladb/scylladb: test: Fix flakiness in topology_experimental_raft/test_tablets service: Use tablet read selector to determine which replica to account table stats storage_service: Fix race between tablet split and stats retrieval	2024-05-27 16:32:54 +03:00
Piotr Dulikowski	fa142a9ce7	Merge 'qos/raft_service_level_distributed_data_accessor: print correct error message when trying to modify a service level in recovery mode' from Michał Jadwiszczak Raft service levels are read-only in recovery mode. This patch adds check and proper error message when a user tries to modify service levels in recovery mode. Fixes https://github.com/scylladb/scylladb/issues/18827 Closes scylladb/scylladb#18841 * github.com:scylladb/scylladb: test/auth_cluster/test_raft_service_levels: try to create sl in recovery service/qos/raft_sl_dda: reject changes to service levels in recovery mode service/qos/raft_sl_dda: extract raft_sl_dda steps to common function	2024-05-27 13:26:06 +02:00

1 2 3 4 5 ...

4602 Commits