scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-24 00:32:15 +00:00

Author	SHA1	Message	Date
Aleksandra Martyniuk	e963631859	docs: describe task folding (cherry picked from commit `beef77a778`)	2024-06-05 10:09:13 +02:00
Marcin Maliszkiewicz	cbf47319c1	db: auth: move auth tables to system keyspace Separate keyspace which also behaves as system brings little benefit while creating some compatibility problems like schema digest mismatch during rollback. So we decided to move auth tables into system keyspace. Fixes https://github.com/scylladb/scylladb/issues/18098 Closes scylladb/scylladb#18769 (cherry picked from commit `2ab143fb40`) [avi: adjust test/alternator/suite.yaml to reflect new keyspace]	2024-06-02 21:41:14 +03:00
Piotr Smaron	885c7309ee	Extend system.topology with 3 new columns to store data required to process alter ks global topo req Because ALTER KS will result in creating a global topo req, we'll have to pass the req data to topology coordinator's state machine, and the easiest way to do it is through sytem.topology table, which is going to be extended with 3 extra columns carrying all the data required to execute ALTER KS from within topology coordinator.	2024-05-30 08:33:15 +03:00
Nadav Har'El	dcd26d8a16	Merge 'docs: update isolation.md' from Botond Dénes Update `docs/dev/isolation.d`: * Update the list of scheduling groups * Remove IO priority groups (they were folded into scheduling groups) * Add section on RPC isolation Closes scylladb/scylladb#18749 * github.com:scylladb/scylladb: docs: isolation.md: add section on RPC call isolation docs: isolation.md: remove mention of IO priority groups docs: isolation.md: update scheduling group list, add aliases	2024-05-21 11:46:57 +03:00
Botond Dénes	11fa79a537	docs: isolation.md: add section on RPC call isolation	2024-05-21 03:12:22 -04:00
Pavel Emelyanov	159e44d08a	test.py: Make it possible to avoid wildcard test names matching There's a nasty scenario when this searching plays bad joke. When CI picks up a new branch and notices, that a test had changed, it spawns a custom job with test.py --repeat 100 $changed_test_name in it. Next, when the test.py tries opt-in test name matching, it uses the wildcard search and can pick up extra unwanted tests into the run. To solve this, the case-selection syntax is extended. Now if the caller specifies `suite/test::` as test, the test file is selected by exact name match, but the specific test-case is not selected, the `` makes it run all cases. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#18704	2024-05-20 15:50:47 +02:00
Botond Dénes	936a7e282b	docs: isolation.md: remove mention of IO priority groups They were folded into CPU scheduling groups, which now apply to both CPU and IO.	2024-05-20 03:33:24 -04:00
Botond Dénes	8f61468322	docs: isolation.md: update scheduling group list, add aliases	2024-05-20 03:30:04 -04:00
Tomasz Grabiec	eb3a22d5a8	docs: Document tablet sharding vs tablet replica placement	2024-05-16 00:28:47 +02:00
Tomasz Grabiec	10a4903d0c	dht: Deprecate old sharder API: shard_of/next_shard/token_for_next_shard Require users to specify whether we want shard for reads or for writes by switching to appropriate non-deprecated variant. For example, shard_of() can be replaced with shard_for_reads() or shard_for_writes(). The next_shard/token_for_next_shard APIs have only for-reads variant, and the act of switching will be a testimony to the fact that the code is valid for intra-node migration.	2024-05-16 00:28:47 +02:00
Tomasz Grabiec	6946ad2a45	sharding: Prepare for intra-node-migration Tablet sharder is adjusted to handle intra-migration where a tablet can have two replicas on the same host. For reads, sharder uses the read selector to resolve the conflict. For writes, the write selector is used. The old shard_of() API is kept to represent shard for reads, and new method is introduced to query the shards for writing: shard_for_writes(). All writers should be switched to that API, which is not done in this patch yet. The request handler on replica side acts as a second-level coordinator, using sharder to determine routing to shards. A given sharder has a scope of a single topology version, a single effective_replication_map_ptr, which should be kept alive during writes.	2024-05-16 00:28:46 +02:00
Tomasz Grabiec	b5bb46357b	docs: Document sharder use for tablets	2024-05-16 00:28:46 +02:00
Tomasz Grabiec	82b34d34d8	tablets: Introduce tablet transition kind for intra-node migration We need a separate transition kind for intra node migration so that we don't have to recover this information from replica set in an expensive way. This information is needed in the hot path - in effective_replicaiton_map, to not return the pending tablet replica to the coordinator. From its perspective, replica set is not transitional. The transition will also be used to alter the behavior of the sharder. When not in intra-node migration, the sharder should advertise the shard which is either in the previous or next replica set. During intra-node migration, that's not possible as there may be two such shards. So it will return the shard according to the current read selector.	2024-05-16 00:28:46 +02:00
Aleksandra Martyniuk	561fb1dd09	service: move to cleanup stage if allow_write_both_read_old fails If allow_write_both_read_old tablet transition stage fails, move to cleanup_target stage before reverting migration. It's a preparation for further patches which deallocate storage group of a tablet during cleanup.	2024-05-10 14:56:38 +02:00
Yaniv Michael Kaul	124064844f	docs/dev/object_stroage.md: convert example AWS keys to be more innocent Someone thought that they actually represent real keys (the 'EXAMPLE' in their name was not enough). Converted them to be as clear as can be, example data. Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com> Closes scylladb/scylladb#18565	2024-05-09 08:26:43 +03:00
Botond Dénes	96a7ed7efb	Merge 'sstables: add dead row count when issuing warning to system.large_partitions' from Ferenc Szili This is the second half of the fix for issue #13968. The first half is already merged with PR #18346 Scylla issues warnings for partitions containing more rows than a configured threshold. The warning is issued by inserting a row into the `system.large_partitions` table. This row contains the information about the partition for which the warning is issued: keyspace, table, sstable, partition key and size, compaction time and the number of rows in the partition. A previous PR #18346 also added range tombstone count to this row. This change adds a new counter for dead rows to the large_partitions table. This change also adds cluster feature protection for writing into these new counters. This is needed in case a cluster is in the process of being upgraded to this new version, after which an upgraded node writes data with the new schema into `system.large_partitions`, and finally a node is then rolled back to an old version. This node will then revert the schema to the old version, but the written sstables will still contain data with the new counters, causing any readers of this table to throw errors when they encounter these cells. This is an enhancement, and backporting is not needed. Fixes #13968 Closes scylladb/scylladb#18458 * github.com:scylladb/scylladb: sstable: added test for counting dead rows sstable: added docs for system.large_partitions.dead_rows sstable: added cluster feature for dead rows and range tombstones sstable: write dead_rows count to system.large_partitions sstable: added counter for dead rows	2024-05-09 08:26:43 +03:00
Ferenc Szili	8e9771d010	sstable: added docs for system.large_partitions.dead_rows	2024-05-07 15:44:33 +02:00
Piotr Dulikowski	64ba620dc2	Merge 'hinted handoff: Use host IDs instead of IPs in the module' from Dawid Mędrek This pull request introduces host ID in the Hinted Handoff module. Nodes are now identified by their host IDs instead of their IPs. The conversion occurs on the boundary between the module and `storage_proxy.hh`, but aside from that, IPs have been erased. The changes take into considerations that there might still be old hints, still identified by IPs, on disk – at start-up, we map them to host IDs if it's possible so that they're not lost. Refs scylladb/scylladb#6403 Fixes scylladb/scylladb#12278 Closes scylladb/scylladb#15567 * github.com:scylladb/scylladb: docs: Update Hinted Handoff documentation db/hints: Add endpoint_downtime_not_bigger_than() db/hints: Migrate hinted handoff when cluster feature is enabled db/hints: Handle arbitrary directories in resource manager db/hints: Start using hint_directory_manager db/hints: Enforce providing IP in get_ep_manager() db/hints: Introduce hint_directory_manager db/hints/resource_manager: Update function description db/hints: Coroutinize space_watchdog::scan_one_ep_dir() db/hints: Expose update lock of space watchdog db/hints: Add function for migrating hint directories to host ID db/hints: Take both IP and host ID when storing hints db/hints: Prepare initializing endpoint managers for migrating from IP to host ID db/hints: Migrate to locator::host_id db/hints: Remove noexcept in do_send_one_mutation() service: Add locator::host_id to on_leave_cluster service: Fix indentation db/hints: Fix indentation	2024-05-06 09:58:18 +02:00
Raphael S. Carvalho	62b1cfa89c	topology_coordinator: Fix synchronization of tablet split with other concurrent ops Finalization of tablet split was only synchronizing with migrations, but that's not enough as we want to make sure that all processes like repair completes first as they might hold erm and therefore will be working with a "stale" version of token metadata. For synchronization to work properly, handling of tablet split finalize will now take over the state machine, when possible, and execute a global token metadata barrier to guarantee that update in topology by split won't cause problems. Repair for example could be writing a sstable with stale metadata, and therefore, could generate a sstable that spans multiple tablets. We don't want that to happen, therefore we need the barrier. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#18380	2024-04-30 19:23:28 +02:00
Dawid Medrek	bf802e99eb	docs: Update Hinted Handoff documentation We briefly explain the process of migration of Hinted Handoff to host IDs, the rationale for it, consequences, and possible side effects.	2024-04-28 01:22:59 +02:00
Avi Kivity	c2b8ca7d71	Merge 'cql3: statements: change default tombstone_gc mode for tablets' from Aleksandra Martyniuk Repair may miss some tablets that migrated across nodes. So if tombstones expire after some timeout, then we can have data resurrection. Set default tombstone_gc mode to "repair" for tables which use tablets (if repair is required). Fixes: #16627. Closes scylladb/scylladb#18013 * github.com:scylladb/scylladb: test: check default value of tombstone_gc test: topology: move some functions to util.py cql3: statements: change default tombstone_gc mode for tablets	2024-04-25 19:18:37 +03:00
Aleksandra Martyniuk	58f72f9019	cql3: statements: change default tombstone_gc mode for tablets Currently, if tombstone_gc mode isn't specified for a table, then "timeout" is used by default. With tablets, running "nodetool repair -pr" may miss a tablet if it migrated across the nodes. Then, if we expire tombstones for ranges that weren't repaired, we may get data resurrection. Set default tombstone_gc mode value for DDLs that don't specify it. It's set to "repair" for tables which use tablets unless they use local replication strategy or rf = 1. Otherwise it's set to "timeout".	2024-04-24 10:42:10 +02:00
Ferenc Szili	c528597a84	sstables: add docs changes for system.large_partitions This commit updates the documentation changes for the new column range_tombstones in system.large_partitions	2024-04-22 15:25:41 +02:00
Anna Stuchlik	a3481a4566	doc: document the system_auth_v2 feature This commit includes updates related to replacing system_auth with system_auth_v2. - The keyspace name system_auth is renamed to system_auth_v2. - The procedures are updated to account for system_auth_v2. - No longer required system_auth RF changes are removed from procedures. - The information is added that if the consistent topology updates feature was not enabled upon upgrade from 5.4, there are limitations or additional steps to do (depending on the procedure). The files with that kind of information are to be found in _common folders and included as needed. - The upgrade guide has been updated to reflect system_auth_v2 and related impacts. Closes scylladb/scylladb#18077	2024-04-18 18:33:49 +02:00
Botond Dénes	2cb5dcabf7	docs/dev/maintainer.md: document another exceptions to rule no.0 Maintainers are also allowed to commit their own backport PR. They are allowed to backport their own code, opening a PR to get a CI run for a backport doesn't change this. Closes scylladb/scylladb#17727	2024-04-03 09:51:19 +03:00
Kefu Chai	d1e8d89ae2	doc: topology-over-raft: add transition_state to node state diagram in order to help the developers to understand the transitions of `node_state` and the `transition_state` on each of the `node_state`, in this change, the nested state machine diagram is added to the node state diagram. please note, instead of trying to merge similar states like bootstrapping and replacing into a single state, we keep them as separate ones, and replicate the nested state machine diagram in them as well, to be more clear. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#18025	2024-03-27 12:16:35 +01:00
Avi Kivity	72bbe75d5b	Merge 'Fix node replace with tablets for RF=N' from Tomasz Grabiec This PR fixes a problem with replacing a node with tablets when RF=N. Currently, this will fail because tablet replica allocation for rebuild will not be able to find a viable destination, as the replacing node is not considered to be a candidate. It cannot be a candidate because replace rolls back on failure and we cannot roll back after tablets were migrated. The solution taken here is to not drain tablet replicas from replaced node during topology request but leave it to happen later after the replaced node is in left state and replacing node is in normal state. The replacing node waits for this draining to be complete on boot before the node is considered booted. Fixes https://github.com/scylladb/scylladb/issues/17025 Nodes in the left state will be kept in tablet replica sets for a while after node replace is done, until the new replica is rebuilt. So we need to know about those node's location (dc, rack) for two reasons: 1) algorithms which work with replica sets filter nodes based on their location. For example materialized views code which pairs base replicas with view replicas filters by datacenter first. 2) tablet scheduler needs to identify each node's location in order to make decisions about new replica placement. It's ok to not know the IP, and we don't keep it. Those nodes will not be present in the IP-based replica sets, e.g. those returned by get_natural_endpoints(), only in host_id-based replica sets. storage_proxy request coordination is not affected. Nodes in the left state are still not present in token ring, and not considered to be members of the ring (datacanter endpoints excludes them). In the future we could make the change even more transparent by only loading locator::node* for those nodes and keeping node* in tablet replica sets. Currently left nodes are never removed from topology, so will accumulate in memory. We could garbage-collect them from topology coordinator if a left node is absent in any replica set. That means we need a new state - left_for_real. Closes scylladb/scylladb#17388 * github.com:scylladb/scylladb: test: py: Add test for view replica pairing after replace raft, api: Add RESTful API to query current leader of a raft group test: test_tablets_removenode: Verify replacing when there is no spare node doc: topology-on-raft: Document replace behavior with tablets tablets, raft topology: Rebuild tablets after replacing node is normal tablets: load_balancer: Access node attributes via node struct tablets: load_balancer: Extract ensure_node() mv: Switch to using host_id-based replica set effective_replication_map: Introduce host_id-based get_replicas() raft topology: Keep nodes in the left state to topology tablets: Introduce read_required_hosts()	2024-03-18 16:16:08 +02:00
Tomasz Grabiec	1d01b4ca20	doc: topology-on-raft: Document replace behavior with tablets	2024-03-15 13:20:08 +01:00
Pavel Emelyanov	6a77f36519	doc: Add tablets migration state diagram Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#17790	2024-03-14 20:29:21 +01:00
Kefu Chai	15bea069a9	docs: use less slangy language this is a follow-up change of `1519904fb9`, to incorporate the comment from Anna Stuchlik. Signed-off-by: Anna Stuchlik <anna.stuchlik@scylladb.com> Closes scylladb/scylladb#17671	2024-03-13 13:33:37 +02:00
Pavel Emelyanov	3453a934ba	docs: Add info about the ability to run specific test case The test.py usage is documented, the ability to run a specific test by its name is described in doc. Extend it with the new ability to run specific test case as well. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-03-11 09:10:20 +03:00
Avi Kivity	6383aa1e3c	docs: maintainer.md: add exceptions to the don't-commit-your-own-code rules Submodule and toolchain updates aren't original code and so are exempt from the don't-commit-own-code rule. Closes scylladb/scylladb#17534	2024-03-06 10:19:46 +02:00
Kefu Chai	1519904fb9	docs: quote CQL keywords this "misspelling" was identified by codespell. actually, it's not quite a misspelling, as "UPDATE" and "INSERT" are keywords in CQL. so we intended to emaphasis them, so to make codespell more useful, and to preserve the intention, let's quote the keywords with backticks. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#17391	2024-03-06 09:57:07 +02:00
Patryk Jędrzejczak	e145e758eb	raft topology: store committed CDC generations' IDs in the topology When we create a CDC generation and ring-delay is non-zero, the timestamp of the new generation is in the future. Hence, we can have multiple generations that can be written to. However, if we add a new node to the cluster with the Raft-based topology, it receives only the last committed generation. So, this node will be rejecting writes considered correct by the other nodes until the last committed generation starts operating. In scylladb/scylladb#17134, we have allowed sending writes to the previous CDC generations. So, the situation became even more complicated. We need to adjust the Raft-based topology to ensure all required generations are loaded into memory and their data isn't cleared too early. This patch is the first step of the adjustment. We replace `current_cdc_generation_{uuid, timestamp}` with the set containing IDs of all committed generations - `committed_cdc_generations`. This set is sorted by timestamps, just like `unpublished_cdc_generations`. This patch is mostly refactoring. The last generation in `committed_cdc_generations` is the equivalent of the previous `current_cdc_generation_{uuid, timestamp}`. The other generations are irrelevant for now. They will be used in the following patches. After introducing `committed_cdc_generations`, a newly committed generation is also unpublished (it was current and unpublished before the patch). We introduce `add_new_committed_cdc_generation`, which updates both sets of generations so that we don't have to call `add_committed_cdc_generation` and `add_unpublished_cdc_generation` together. It's easy to forget that both of them are necessary. Before this patch, there was no call to `add_unpublished_cdc_generation` in `topology_coordinator::build_coordinator_state`. It was a bug reported in scylladb/scylladb#17288. This patch fixes it. This patch also removes "the current generation" notion from the Raft-based topology. For the Raft-based topology, the current generation was the last committed generation. However, for the `cdc::metadata`, it was the generation operating now. These two generations could be different, which was confusing. For the `cdc::metadata`, the current generation is relevant as it is handled differently, but for the Raft-based topology, it isn't. Therefore, we change only the Raft-based topology. The generation called "current" is called "the last committed" from now.	2024-02-20 12:35:16 +01:00
Avi Kivity	9bb4482ad0	Merge 'cdc: metadata: allow sending writes to the previous generations' from Patryk Jędrzejczak Before this PR, writes to the previous CDC generations would always be rejected. After this PR, they will be accepted if the write's timestamp is greater than `now - generation_leeway`. This change was proposed around 3 years ago. The motivation was to improve user experience. If a client generates timestamps by itself and its clock is desynchronized with the clock of the node the client is connected to, there could be a period during generation switching when writes fail. We didn't consider this problem critical because the client could simply retry a failed write with a higher timestamp. Eventually, it would succeed. This approach is safe because these failed writes cannot have any side effects. However, it can be inconvenient. Writing to previous generations was proposed to improve it. The idea was rejected 3 years ago. Recently, it turned out that there is a case when the client cannot retry a write with the increased timestamp. It happens when a table uses CDC and LWT, which makes timestamps permanent. Once Paxos commits an entry with a given timestamp, Scylla will keep trying to apply that entry until it succeeds, with the same timestamp. Applying the entry involves writing to the CDC log table. If it fails, we get stuck. It's a major bug with an unknown perfect solution. Allowing writes to previous generations for `generation_leeway` is a probabilistic fix that should solve the problem in practice. Apart from this change, this PR adds tests for it and updates the documentation. This PR is sufficient to enable writes to the previous generations only in the gossiper-based topology. The Raft-based topology needs some adjustments in loading and cleaning CDC generations. These changes won't interfere with the changes introduced in this PR, so they are left for a follow-up. Fixes scylladb/scylladb#7251 Fixes scylladb/scylladb#15260 Closes scylladb/scylladb#17134 * github.com:scylladb/scylladb: docs: using-scylla: cdc: remove info about failing writes to old generations docs: dev: cdc: document writing to previous CDC generations test: add test_writes_to_previous_cdc_generations cdc: generation: allow increasing generation_leeway through error injection cdc: metadata: allow sending writes to the previous generations	2024-02-18 19:21:53 +02:00
Yaniv Kaul	d2ef100b60	Typos: more/less then -> more/less than Fix repated typos in comments: more then -> more than, less then -> less than Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com> Closes scylladb/scylladb#17303	2024-02-13 17:16:15 +02:00
Patryk Jędrzejczak	9b923f8b81	docs: dev: cdc: document writing to previous CDC generations We update the dev documentation after allowing writes to the previous CDC generations in one of the previous patches.	2024-02-12 10:14:00 +01:00
Kamil Braun	e9e24f47ec	Merge 'raft topology: implement upgrade and recovery procedure' from Piotr Dulikowski This PR implements a procedure that upgrades existing clusters to use raft-based topology operations. The procedure does not start automatically, it must be triggered manually by the administrator after making sure that no topology operations are currently running. Upgrade is triggered by sending `POST /storage_service/raft_topology/upgrade` request. This causes the topology coordinator to start who drives the rest of the process: it builds the `system.topology` state based on information observed in gossip and tells all nodes to switch to raft mode. Then, topology coordinator runs normally. Upgrade progress is tracked in a new static column `upgrade_state` in `system.topology`. The procedure also serves as an extension to the current recovery procedure on raft. The current recovery procedure requires restarting nodes in a special mode which disables raft, perform `nodetool removenode` on the dead nodes, clean up some state on the nodes and restart them so that they automatically rebuild the group 0. Raft topology fits into existing procedure by falling back to legacy topology operations after disabling raft. After rebuilding the group 0, upgrade needs to be triggered again. Because upgrade is manual and it might not be convenient for administrators to run it right after upgrading the cluster, we allow the cluster to operate in legacy topology operations mode until upgrade, which includes allowing new nodes to join. In order to allow it, nodes now ask the cluster about the mode they should use to join before proceeding by using a new `JOIN_NODE_QUERY` RPC. The procedure is explained in more detail in `topology-over-raft.md`. Fixes: https://github.com/scylladb/scylladb/issues/15008 Closes scylladb/scylladb#17077 * github.com:scylladb/scylladb: test/topology_custom: upgrade/recovery tests for topology on raft cdc/generation_service: in legacy mode, fall back to raft tables system_keyspace: add read_cdc_generation_opt cdc/generation_service: turn off gossip notifications in raft topo mode cql_test_env: move raft_topology_change_enabled var earlier group0_state_machine: pull snapshot after raft topology feature enabled storage_service: disable persistent feature enabler on upgrade storage_service: replicate raft features to system.peers storage_service: gossip tokens and cdc generation in raft topology mode API: add api for triggering and monitoring topology-on-raft upgrade storage_service: infer which topology operations to use on startup storage_service: set the topology kind value based on group 0 state raft_group0: expose link to the upgrade doc in the header feature_service: fall back to checking legacy features on startup storage_service: add fiber for tracking the topology upgrade progress gms: feature_service: add SUPPORTS_CONSISTENT_TOPOLOGY_CHANGES topology_coordinator: implement core upgrade logic topology_coordinator: extract top-level error handling logic storage_service: initialize discovery leader's state earlier topology_coordinator: allow for custom sharding info in prepare_and_broadcast_cdc_generation_data topology_coordinator: allow for custom sharding info in prepare_new_cdc_generation_data topology_coordinator: remove outdated fixme in prepare_new_cdc_generation_data topology_state_machine: introduce upgrade_state storage_service: disallow topology ops when upgrade is in progress raft_group0_client: add in_recovery method storage_service: introduce join_node_query verb raft_group0: make discover_group0 public raft_group0: filter current node's IP in discover_group0 raft_group0: remove my_id arg from discover_group0 storage_service: make _raft_topology_change_enabled more advanced docs: document raft topology upgrade and recovery	2024-02-09 11:54:53 +01:00
Kefu Chai	64c829da70	docs: reformat the state machine diagram using mermaid for better readability Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#16620	2024-02-08 19:43:53 +02:00
Piotr Dulikowski	1104f8b00f	docs: document raft topology upgrade and recovery	2024-02-07 09:54:54 +01:00
Patryk Jędrzejczak	2687204c7f	docs: dev: topology-over-raft: align indentation	2024-02-02 16:55:28 +01:00
Patryk Jędrzejczak	fdd3c3a280	docs: dev: topology-over-raft: document the rollback_to_normal state In one of the previous patches, we changed the `rollback_to_normal` state from a node state to a transition state. We document it in this patch. The node state wasn't documented, so there is nothing to remove.	2024-02-02 16:55:28 +01:00
Avi Kivity	c8397f0287	Merge 'Implement tablet splitting' from Raphael "Raph" Carvalho The motivation for tablet resizing is that we want to keep the average tablet size reasonable, such that load rebalancing can remain efficient. Too large tablet makes migration inefficient, therefore slowing down the balancer. If the avg size grows beyond the upper bound (split threshold), then balancer decides to split. Split spans all tablets of a table, due to power-of-two constraint. Likewise, if the avg size decreases below the lower bound (merge threshold), then merge takes place in order to grow the avg size. Merge is not implemented yet, although this series lays foundation for it to be impĺemented later on. A resize decision can be revoked if the avg size changes and the decision is no longer needed. For example, let's say table is being split and avg size drops below the target size (which is 50% of split threshold and 100% of merge one). That means after split, the avg size would drop below the merge threshold, causing a merge after split, which is wasteful, so it's better to just cancel the split. Tablet metadata gains 2 new fields for managing this: resize_type: resize decision type, can be either of "merge", "split", or "none". resize_seq_number: a sequence number that works as the global identifier of the decision (monotonically increasing, increased by 1 on every new decision emitted by the coordinator). A new RPC was implemented to pull stats from each table replica, such that load balancer can calculate the avg tablet size and know the "split status", for a given table. Avg size is aggregated carefully while taking RF of each DC into account (which might differ). When a table is done splitting its storage, it loads (mirror) the resize_seq_number from tablet metadata into its local state (in another words, my split status is ready). If a table is split ready, coordinator will see that table's seq number is the same as the one in tablet metadata. Helps to distinguish stale decisions from the latest one (in case decisions are revoked and re-emited later on). Also, it's aggregated carefully, by taking the minimum among all replicas, so coordinator will only update topology when all replicas are ready. When load balancer emits split decision, replicas will listen to need to split with a "split monitor" that is awakened once a table has replication metadata updated and detects the need for split (i.e. resize_type field is "split"). The split monitor will start splitting of compaction groups (using mechanism introduced here: `081f30d149`) for the table. And once splitting work is completed, the table updates its local state as having completed split. When coordinator pulls the split status of all replicas for a table via RPC, the balancer can see whether that table is ready for "finalizing" the decision, which is about updating tablet metadata to split each tablet into two. Once table replicas have their replication metadata updated with the new tablet count, they can update appropriately their set of compaction groups (that were previously split in the preparation step). Fixes #16536. Closes scylladb/scylladb#16580 * github.com:scylladb/scylladb: test/topology_experimental_raft: Add tablet split test replica: Bypass reshape on boot with tablets temporarily replica: Fix table::compaction_group_for_sstable() for tablet streaming test/topology_experimental_raft: Disable load balancer in test fencing replica: Remap compaction groups when tablet split is finalized service: Split tablet map when split request is finalized replica: Update table split status if completed split compaction work storage_service: Implement split monitor topology_cordinator: Generate updates for resize decisions made by balancer load_balancer: Introduce metrics for resize decisions db: Make target tablet size a live-updateable config option load_balancer: Implement resize decisions service: Wire table_resize_plan into migration_plan service: Introduce table_resize_plan tablet_mutation_builder: Add set_resize_decision() topology_coordinator: Wire load stats into load balancer storage_service: Allow tablet split and migration to happen concurrently topology_coordinator: Periodically retrieve table_load_stats locator: Introduce topology::get_datacenter_nodes() storage_service: Implement table_load_stats RPC replica: Expose table_load_stats in table replica: Introduce storage_group::live_disk_space_used() locator: Introduce table_load_stats tablets: Add resize decision metadata to tablet metadata locator: Introduce resize_decision	2024-01-31 13:59:56 +02:00
Patryk Jędrzejczak	7c10cae6c4	docs: dev: topology-over-raft: document the left_token_ring state In one of the previous patches, we changed the `left_token_ring` state from a node state to a transition state. We document it in this patch. The node state wasn't documented, so there is nothing to remove.	2024-01-29 10:39:07 +01:00
Kefu Chai	9ee6c00c84	docs: fix misspellings these misspellings are identified by codespell. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#17005	2024-01-26 13:14:21 +02:00
Raphael S. Carvalho	7ed5b44d52	load_balancer: Implement resize decisions This implements the ability in load balancer to emit split or merge requests, cancel ongoing ones if they're no longer needed, and also finalize those that are ready for the topology changes. That's all based on average tablet size, collected by coordinator from all nodes, and split and merge thresholds. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-01-25 18:36:08 -03:00
Raphael S. Carvalho	0d5ba1ee4b	tablets: Add resize decision metadata to tablet metadata The new metadata describes the ongoing resize operation (can be either of merge, split or none) that spans tablets of a given table. That's managed by group0, so down nodes will be able to see the decision when they come back up and see the changes to the metadata. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-01-25 18:36:06 -03:00
Avi Kivity	69d597075a	Merge 'tablets: Add support for removenode and replace handling' from Tomasz Grabiec New tablet replicas are allocated and rebuilt synchronously with node operations. They are safely rebuilt from all existing replicas. The list of ignored nodes passed to node operations is respected. Tablet scheduler is responsible for scheduling tablet rebuilding transition which changes the replicas set. The infrastructure for handling decommission in tablet scheduler is reused for this. Scheduling is done incrementally, respecting per-shard load limits. Rebuilding transitions are recognized by load calculation to affect all tablet replicas. New kind of tablet transition is introduced called "rebuild" which adds new tablet replica and rebuilds it from existing replicas. Other than that, the transition goes through the same stages as regular migration to ensure safe synchronization with request coordinators. In this PR we simply stream from all tablet replicas. Later we should switch to calling repair to avoid sending excessive amounts of data. Fixes https://github.com/scylladb/scylladb/issues/16690. Closes scylladb/scylladb#16894 * github.com:scylladb/scylladb: tests: tablets: Add tests for removenode and replace tablets: Add support for removenode and replace handling topology_coordinator: tablets: Do not fail in a tight loop topology_coordinator: tablets: Avoid warnings about ignored failured future storage_service, topology: Track excluded state in locator::topology raft topology: Introduce param-less topology::get_excluded_nodes() raft topology: Move get_excluded_nodes() to topology tablets: load_balancer: Generalize load tracking tablets: Introduce get_migration_streaming_info() which works on migration request tablets: Move migration_to_transition_info() to tablets.hh tablets: Extract get_new_replicas() which works on migraiton request tablets: Move tablet_migration_info to tablets.hh tablets: Store transition kind per tablet	2024-01-25 14:49:43 +02:00
Avi Kivity	4a57b67634	docs: add a rough diagram of module interaction It is incomplete and maybe inaccurate, but it is a start. Closes scylladb/scylladb#16903	2024-01-23 18:08:48 +02:00
Tomasz Grabiec	4a06ffb43c	tablets: Store transition kind per tablet Will be used to distinguish regular migration from rebuild, repair and RF change.	2024-01-23 01:12:57 +01:00

1 2 3 4

158 Commits