scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-02 14:15:46 +00:00

Author	SHA1	Message	Date
Tomasz Grabiec	ef9e5e64a3	locator: token_metadata: Introduce topology barrier stall detector When topology barrier is blocked for longer than configured threshold (2s), stale versions are marked as stalled and when they get released they report backtrace to the logs. This should help to identify what was holding for token metadata pointer for too long. Example log: token_metadata - topology version 30 held for 299.159 [s] past expiry, released at: 0x2397ae1 0x23a36b6 ... Closes scylladb/scylladb#17427	2024-02-21 15:05:34 +02:00
Avi Kivity	605bf6e221	range.hh: retire range.hh was deprecated in `bd794629f9` (2020) since its names conflict with the C++ library concept of an iterator range. The name ::range also mapped to the dangerous wrapping_interval rather than nonwrapping_interval. Complete the deprecation by removing range.hh and replacing all the aliases by the names they point to from the interval library. Note this now exposes uses of wrapping intervals as they are now explicit. The unit tests are renamed and range.hh is deleted. Closes scylladb/scylladb#17428	2024-02-21 00:24:25 +02:00
Avi Kivity	93af3dd69b	Merge 'Maintenance socket: set filesystem permissions to 660' from Mikołaj Grzebieluch Set filesystem permissions for the maintenance socket to 660 (previously it was 755) to allow a scyllaadm's group to connect. Split the logic of creating sockets into two separate functions, one for each case: when it is a regular cql controller or used by maintenance_socket. Fixes https://github.com/scylladb/scylladb/issues/16487. Closes scylladb/scylladb#17113 * github.com:scylladb/scylladb: maintenance_socket: add option to set owning group transport/controller: get rid of magic number for socket path's maximal length transport/controller: set unix_domain_socket_permissions for maintenance_socket transport/controller: pass unix_domain_socket_permissions to generic_server::listen transport/controller: split configuring sockets into separate functions	2024-02-20 15:09:54 +02:00
Mikołaj Grzebieluch	182cfebe40	maintenance_socket: add option to set owning group Option `maintenance-socket-group` sets the owning group of the maintenance socket. If not set, the group will be the same as the user running the scylla node.	2024-02-19 10:21:00 +01:00
Kefu Chai	50964c423e	hints: host_filter: add formatter for hints::host_filter before this change, we rely on the default-generated fmt::formatter created from operator<<, but fmt v10 dropped the default-generated formatter. in this change, we define formatters for `hints::host_filter`. its operator<< is preserved as it's still used by the homebrew generic formatter for vector<>, which is in turn used by db/config.cc. Refs #13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#17347	2024-02-16 19:03:11 +03:00
Kamil Braun	50ebce8acc	Merge 'Purge old ip on change' from Petr Gusev When a node changes IP address we need to remove its old IP from `system.peers` and gossiper. We do this in `sync_raft_topology_nodes` when the new IP is saved into `system.peers` to avoid losing the mapping if the node crashes between deleting and saving the new IP. We also handle the possible duplicates in this case by dropping them on the read path when the node is restarted. The PR also fixes the problem with old IPs getting resurrected when a node changes its IP address. The following scenario is possible: a node `A` changes its IP from `ip1` to `ip2` with restart, other nodes are not yet aware of `ip2` so they keep gossiping `ip1`. After restart `A` receives `ip1` in a gossip message and calls `handle_major_state_change` since it considers it as a new node. Then `on_join` event is called on the gossiper notification handlers, we receive such event in `raft_ip_address_updater` and reverts the IP of the node A back to ip1. To fix this we ensure that the new gossiper generation number is used when a node registers its IP address in `raft_address_map` at startup. The `test_change_ip` is adjusted to ensure that the old IPs are properly removed in all cases, even if the node crashes. Fixes #16886 Fixes #16691 Fixes #17199 Closes scylladb/scylladb#17162 * github.com:scylladb/scylladb: test_change_ip: improve the test raft_ip_address_updater: remove stale IPs from gossiper raft_address_map: add my ip with the new generation system_keyspace::update_peer_info: check ep and host_id are not empty system_keyspace::update_peer_info: make host_id an explicit parameter system_keyspace::update_peer_info: remove any_set flag optimisation system_keyspace: remove duplicate ips for host_id system_keyspace: peers table: use coroutines storage_service::raft_ip_address_updater: log gossiper event name raft topology: ip change: purge old IP on_endpoint_change: coroutinize the lambda around sync_raft_topology_nodes	2024-02-15 17:40:29 +01:00
Avi Kivity	5df5714331	Merge 'api: storage_service/natural_endpoints: add tablets support' from Botond Dénes This API endpoint currently returns with status 500 if attempted to be called for a table which uses tablets. This series adds tablet support. No change in usage semantics is required, the endpoint already has a table parameter. This endpoint is the backend of `nodetool getendpoints` which should now work, after this PR. Fixes: #17313 Closes scylladb/scylladb#17316 * github.com:scylladb/scylladb: service/storage_service: get_natural_endpoints(): add tablets support replica/database: keyspace: add uses_tablets() service/storage_service: remove token overload of get_natural_endpoints()	2024-02-15 13:36:56 +02:00
Petr Gusev	2bf75c1a4e	system_keyspace::update_peer_info: check ep and host_id are not empty	2024-02-15 13:21:04 +04:00
Petr Gusev	86410d71d1	system_keyspace::update_peer_info: make host_id an explicit parameter The host_id field should always be set, so it's more appropriate to pass it as a separate parameter. The function storage_service::get_peer_info_for_update is updated. It shouldn't look for host_id app state is the passed map, instead the callers should get the host_id on their own.	2024-02-15 13:21:04 +04:00
Petr Gusev	e0072f7cb3	system_keyspace::update_peer_info: remove any_set flag optimisation This optimization never worked -- there were four usages of the update_peer_info function and in all of them some of the peer_info fields were set or should be set: * sync_raft_topology_nodes/process_normal_node: e.g. tokens is set * sync_raft_topology_nodes/process_transition_node: host_id is set * handle_state_normal: tokens is set * storage_service::on_change: get_peer_info_for_update could potentially return a peer_info with all fields set to empty, but this shouldn't be possible, host_id should always be set. Moreover, there is a bug here: we extract host_id from the states_ parameter, which represent the gossiper application states that have been changed. This parameter contains host_id only if a node changes its IP address, in all other cases host_id is unset. This means we could end up with a record with empty host_id, if it wasn't previously set by some other means. We are going to fix this bug in the next commit.	2024-02-15 13:21:04 +04:00
Petr Gusev	4a14988735	system_keyspace: remove duplicate ips for host_id When a node changes IP we call sync_raft_topology_nodes from raft_ip_address_updater::on_endpoint_change with the old IP value in prev_ip parameter. It's possible that the nodes crashes right after we insert a new IP for the host_id, but before we remove the old IP. In this commit we fix the possible inconsistency by removing the system.peers record with old timestamp. This is what the new peers_table_read_fixup function is responsible for. We call this function in all system_keyspace methods that read the system.peers table. The function loads the table in memory, decides if some rows are stale by comparing their timestamps and removes them. The new function also removes the records with no host_id, so we no longer need the get_host_id function. We'll add a test for the problem this commit fixes in the next commit.	2024-02-15 13:21:04 +04:00
Petr Gusev	fa8718085a	system_keyspace: peers table: use coroutines This is a refactoring commit with no observable changes in behaviour. We switch the functions to coroutines, it'll be easy to work with them in this way in the next commit. Also, we add more const-s along the way.	2024-02-15 13:21:04 +04:00
Botond Dénes	7f17d3bb0e	replica/database: keyspace: add uses_tablets() Mirroring table::uses_tablets(), provides a convenient and -- more importabtly -- easily discoverable way to determine whether the keyspace uses tablets or not. This information is of course already available via the abstract replication strategy, but as seen in a few examples, this is not easily discoverable and sometimes people resorted to enumerating the keyspace's tables to be able to invoke table::uses_tablets().	2024-02-15 01:51:26 -05:00
Kamil Braun	7e9e10186f	Merge 'change the way ignored nodes are handled by the topology coordinator' from Gleb This series makes several changes to how ignored nodes list is treated by the topology coordinator. First the series makes it global and not part of a single topology operation, second it extends the list at the time of removenode/replace invocation and third it bans all nodes in the list from contacting the cluster ever again. The main motivation is to have a way to unblock tablet migration in case of a node failure. Tablet migration knows how to avoid nodes in ignored nodes list and this patch series provides a way to extend it without performing any topology operation (which is not possible while tables migration runs). Fixes scylladb/scylladb#16108 * 'gleb/ignore-nodes-handling-v2' of github.com:scylladb/scylla-dev: test: add test for the new ignore nodes behaviour topology coordinator: cleanup node_state::decommissioning state handling code topology coordinator: ban ignored nodes just like we ban nodes that are left storage_service: topology coordinator: validate ignore dead nodes parameters in removenode/replace topology coordinator: add removed/replaced nodes to ignored_nodes list at the request invocation time topology coordinator: make ignored_nodes list global and permanent topology_coordinator: do not cancel rebuild just because some other nodes are dead topology coordinator: throw more specific error from wait_for_ip() function in case of a timeout raft_group0: add make_nonvoters function that can make multiple node non voters simultaneously	2024-02-14 16:36:01 +01:00
Gleb Natapov	9b52dc4560	topology coordinator: make ignored_nodes list global and permanent Currently ignored_nodes list is part of a request (removenode or replace) and exists only while a request is handled. This patch changes it to be global and exist outside of any request. Node stays in the list until they eventually removed and moved to the "left" state. If a node is specified in the ignore-dead-nodes option for any command it will be ignored for all other operations that support ignored_nodes (like tablet migration).	2024-02-13 16:15:35 +02:00
Nadav Har'El	21e7deafeb	alternator, mv: fix case of two new key columns in GSI A materialized view in CQL allows AT MOST ONE view key column that wasn't a key column in the base table. This is because if there were two or more of those, the "liveness" (timestamp, ttl) of these different columns can change at every update, and it's not possible to pick what liveness to use for the view row we create. We made an exception for this rule for Alternator: DynamoDB's API allows creating a GSI whose partition key and range key are both regular columns in the base table, and we must support this. We claim that the fact that Alternator allows neither TTL (Alternator's "TTL" is a different feature) nor user-defined timestamps, does allow picking the liveness for the view row we create. But we did it wrong! We claimed in a comment - and implemented in the code before this patch - that in Alternator we can assume that both GSI key columns will have the same liveness, and in particular timestamp. But this is only true if one modifies both columns together! In fact, in general it is not true: We can have two non-key attributes 'a' and 'b' which are the GSI's key columns, and we can modify only b, without modifying a, in which case the timestamp of the view modification should be b's newer timestamp, not a's older one. The existing code took a's timestamp, assuming it will be the same as b's, which is incorrect. The result was that if we repeatedly modify only b, all view updates will receive the same timestamp (a's old timestamp), and a deletion will always win over all the modifications. This patch includes a reproducing test written by a user (@Zak-Kent) that demonstrates how after a view row is deleted it doesn't get recreated - because all the modifications use the same timestamp. The fix is, as suggested above, to use the higher of the two timestamps of both base-regular-column GSI key columns as the timestamp for the new view rows or view row deletions. The reproducer that failed before this patch passes with it. As usual, the reproducer passes on AWS DynamoDB as well, proving that the test is correct and should really work. Fixes #17119 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#17172	2024-02-12 13:17:29 +02:00
Kamil Braun	e9e24f47ec	Merge 'raft topology: implement upgrade and recovery procedure' from Piotr Dulikowski This PR implements a procedure that upgrades existing clusters to use raft-based topology operations. The procedure does not start automatically, it must be triggered manually by the administrator after making sure that no topology operations are currently running. Upgrade is triggered by sending `POST /storage_service/raft_topology/upgrade` request. This causes the topology coordinator to start who drives the rest of the process: it builds the `system.topology` state based on information observed in gossip and tells all nodes to switch to raft mode. Then, topology coordinator runs normally. Upgrade progress is tracked in a new static column `upgrade_state` in `system.topology`. The procedure also serves as an extension to the current recovery procedure on raft. The current recovery procedure requires restarting nodes in a special mode which disables raft, perform `nodetool removenode` on the dead nodes, clean up some state on the nodes and restart them so that they automatically rebuild the group 0. Raft topology fits into existing procedure by falling back to legacy topology operations after disabling raft. After rebuilding the group 0, upgrade needs to be triggered again. Because upgrade is manual and it might not be convenient for administrators to run it right after upgrading the cluster, we allow the cluster to operate in legacy topology operations mode until upgrade, which includes allowing new nodes to join. In order to allow it, nodes now ask the cluster about the mode they should use to join before proceeding by using a new `JOIN_NODE_QUERY` RPC. The procedure is explained in more detail in `topology-over-raft.md`. Fixes: https://github.com/scylladb/scylladb/issues/15008 Closes scylladb/scylladb#17077 * github.com:scylladb/scylladb: test/topology_custom: upgrade/recovery tests for topology on raft cdc/generation_service: in legacy mode, fall back to raft tables system_keyspace: add read_cdc_generation_opt cdc/generation_service: turn off gossip notifications in raft topo mode cql_test_env: move raft_topology_change_enabled var earlier group0_state_machine: pull snapshot after raft topology feature enabled storage_service: disable persistent feature enabler on upgrade storage_service: replicate raft features to system.peers storage_service: gossip tokens and cdc generation in raft topology mode API: add api for triggering and monitoring topology-on-raft upgrade storage_service: infer which topology operations to use on startup storage_service: set the topology kind value based on group 0 state raft_group0: expose link to the upgrade doc in the header feature_service: fall back to checking legacy features on startup storage_service: add fiber for tracking the topology upgrade progress gms: feature_service: add SUPPORTS_CONSISTENT_TOPOLOGY_CHANGES topology_coordinator: implement core upgrade logic topology_coordinator: extract top-level error handling logic storage_service: initialize discovery leader's state earlier topology_coordinator: allow for custom sharding info in prepare_and_broadcast_cdc_generation_data topology_coordinator: allow for custom sharding info in prepare_new_cdc_generation_data topology_coordinator: remove outdated fixme in prepare_new_cdc_generation_data topology_state_machine: introduce upgrade_state storage_service: disallow topology ops when upgrade is in progress raft_group0_client: add in_recovery method storage_service: introduce join_node_query verb raft_group0: make discover_group0 public raft_group0: filter current node's IP in discover_group0 raft_group0: remove my_id arg from discover_group0 storage_service: make _raft_topology_change_enabled more advanced docs: document raft topology upgrade and recovery	2024-02-09 11:54:53 +01:00
Piotr Dulikowski	fb02453686	system_keyspace: add read_cdc_generation_opt The `system_keyspace::read_cdc_generation` loads a cdc generation from the system tables. One of its preconditions is that the generation exists - this precondition is quite easy to satisfy in raft mode, and the function was designed to be used solely in that mode. In legacy mode however, in case when we revert from raft mode through recovery, it might be necessary to use generations created in raft mode for some time. In order to make the function useful as a fallback in case lookup of a generation in legacy mode fails, introduce a relaxed variant of `read_cdc_generation` which returns std::nullopt if the generation does not exist.	2024-02-08 19:12:28 +01:00
Piotr Dulikowski	3513a07d8a	feature_service: fall back to checking legacy features on startup When checking features on startup (i.e. whether support for any feature was revoked in an unsafe way), it might happen that upgrade to raft topology didn't finish yet. In that case, instead of loading an empty set of features - which supposedly represents the set of features that were enabled until last boot - we should fall back to loading the set from the legacy `enabled_features` key in `system.scylla_local`.	2024-02-08 19:12:28 +01:00
Kefu Chai	3dfb0f86f1	db: add formatter for error_injection_at_startup before this change, we rely on the default-generated fmt::formatter created from operator<<, but fmt v10 dropped the default-generated formatter. in this change, we define formatters for `error_injection_at_startup`, and drop its operator<<. Refs #13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#17211	2024-02-08 19:40:48 +02:00
Piotr Dulikowski	32a2e24a0f	topology_state_machine: introduce upgrade_state `upgrade_state` is a static column which will be used to track the progress of building the topology state machine.	2024-02-08 18:05:02 +01:00
Nadav Har'El	14315fcbc3	mv: fix missing view deletions in some cases of range tombstones For efficiency, if a base-table update generates many view updates that go the same partition, they are collected as one mutation. If this mutation grows too big it can lead to memory exhaustion, so since commit `7d214800d0` we split the output mutation to mutations no longer than 100 rows (max_rows_for_view_updates) each. This patch fixes a bug where this split was done incorrectly when the update involved range tombstones, a bug which was discovered by a user in a real use case (#17117). Range tombstones are read in two parts, a beginning and an end, and the code could split the processing between these two parts and the result that some of the range tombstones in update could be missed - and the view could miss some deletions that happened in the base table. This patch fixes the code in two places to avoid breaking up the processing between range tombstones: 1. The counter "_op_count" that decides where to break the output mutation should only be incremented when adding rows to this output mutation. The existing code strangely incrmented it on every read (!?) which resulted in the counter being incremented on every input fragment, and in particular could reach the limit 100 between two range tombstone pieces. 2. Moreover, the length of output was checked in the wrong place... The existing code could get to 100 rows, not check at that point, read the next input - half a range tombstone - and only then check that we reached 100 rows and stop. The fix is to calculate the number of rows in the right place - exactly when it's needed, not before the step. The first change needs more justification: The old code, that incremented _op_count on every input fragment and not just output fragments did not fit the stated goal of its introduction - to avoid large allocations. In one test it resulted in breaking up the output mutation to chunks of 25 rows instead of the intended 100 rows. But, maybe there was another goal, to stop the iteration after 100 input rows and avoid the possibility of stalls if there are no output rows? It turns out the answer is no - we don't need this _op_count increment to avoid stalls: The function build_some() uses `co_await on_results()` to run one step of processing one input fragment - and `co_await` always checks for preemption. I verfied that indeed no stalls happen by using the existing test test_long_skipped_view_update_delete_with_timestamp. It generates a very long base update where all the view updates go to the same partition, but all but the last few updates don't generate any view updates. I confirmed that the fixed code loops over all these input rows without increasing _op_count and without generating any view update yet, but it does NOT stall. This patch also includes two tests reproducing this bug and confirming its fixed, and also two additional tests for breaking up long deletions that I wanted to make sure doesn't fail after this patch (it doesn't). By the way, this fix would have also fixed issue #12297 - which we fixed a year ago in a different way. That issue happend when the code went through 100 input rows without generating any output rows, and incorrectly concluding that there's no view update to send. With this fix, the code no longer stops generating the view update just because it saw 100 input rows - it would have waited until it generated 100 output rows in the view update (or the input is really done). Fixes #17117 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#17164	2024-02-06 14:57:33 +02:00
Kamil Braun	968d1e3e78	Merge 'raft topology: make rollback_to_normal a transition state' from Patryk Jędrzejczak After changing `left_token_ring` from a node state to a transition state in scylladb/scylladb#17009, we do the same for `rollback_to_normal`. `rollback_to_normal` was created as a node state because `left_token_ring` was a node state. This change will allow us to distinguish a failed removenode from a failed decommission in the `rollback_to_normal` handler. Currently, we use the same logic for both of them, so it's not required. However, this might change, as it has happened with the decommission and the failed bootstrap/replace in the `left_token_ring` state (scylladb/scylladb#16797). We are making this change now because it would be much harder after branching. Fixes scylladb/scylladb#17032 Closes scylladb/scylladb#17136 * github.com:scylladb/scylladb: docs: dev: topology-over-raft: align indentation docs: dev: topology-over-raft: document the rollback_to_normal state topology_coordinator: improve logs in rollback_to_normal handler raft topology: make rollback_to_normal a transition state	2024-02-05 16:30:20 +01:00
Patryk Jędrzejczak	25b90f5554	raft topology: make rollback_to_normal a transition state After changing `left_token_ring` from a node state to a transition state in scylladb/scylladb#17009, we do the same for `rollback_to_normal`. `rollback_to_normal` was created as a node state because `left_token_ring` was a node state. This change will allow us to distinguish a failed removenode from a failed decommission in the `rollback_to_normal` handler. Currently, we use the same logic for both of them, so it's not required. However, this might change, as it has happened with the decommission and the failed bootstrap/replace in the `left_token_ring` state (scylladb/scylladb#16797). We are making this change now because it would be much harder after branching. The change also simplifies the code in `topology_coordinator:rollback_current_topology_op`. Moving the `rollback_to_normal` handler from `handle_node_transition` to `handle_topology_transition` created a large diff. There is only one change - adding `auto node = get_node_to_work_on(std::move(guard));`.	2024-02-02 16:55:20 +01:00
Avi Kivity	7cb1c10fed	treewide: replace seastar::future::get0() with seastar::future::get() get0() dates back from the days where Seastar futures carried tuples, and get0() was a way to get the first (and usually only) element. Now it's a distraction, and Seastar is likely to deprecate and remove it. Replace with seastar::future::get(), which does the same thing.	2024-02-02 22:12:57 +08:00
Kefu Chai	7a8e8c2ced	db: add formatter for db::write_type before this change, we rely on the default-generated fmt::formatter created from operator<<, but fmt v10 dropped the default-generated formatter. in this change, we define formatters for `db::write_type`, and drop its operator<<. Refs #13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#17093	2024-02-01 10:22:45 +02:00
Pavel Emelyanov	7c5c89ba8d	Revert "Merge 'Use utils::directories instead of db::config to get dirs' from Patryk Wróbel" This reverts commit `370fbd346c`, reversing changes made to `0912d2a2c6`. This makes scylla-manager mis-interpret the data_file_directories somehow, issue #17078	2024-01-31 15:08:14 +03:00
Avi Kivity	c8397f0287	Merge 'Implement tablet splitting' from Raphael "Raph" Carvalho The motivation for tablet resizing is that we want to keep the average tablet size reasonable, such that load rebalancing can remain efficient. Too large tablet makes migration inefficient, therefore slowing down the balancer. If the avg size grows beyond the upper bound (split threshold), then balancer decides to split. Split spans all tablets of a table, due to power-of-two constraint. Likewise, if the avg size decreases below the lower bound (merge threshold), then merge takes place in order to grow the avg size. Merge is not implemented yet, although this series lays foundation for it to be impĺemented later on. A resize decision can be revoked if the avg size changes and the decision is no longer needed. For example, let's say table is being split and avg size drops below the target size (which is 50% of split threshold and 100% of merge one). That means after split, the avg size would drop below the merge threshold, causing a merge after split, which is wasteful, so it's better to just cancel the split. Tablet metadata gains 2 new fields for managing this: resize_type: resize decision type, can be either of "merge", "split", or "none". resize_seq_number: a sequence number that works as the global identifier of the decision (monotonically increasing, increased by 1 on every new decision emitted by the coordinator). A new RPC was implemented to pull stats from each table replica, such that load balancer can calculate the avg tablet size and know the "split status", for a given table. Avg size is aggregated carefully while taking RF of each DC into account (which might differ). When a table is done splitting its storage, it loads (mirror) the resize_seq_number from tablet metadata into its local state (in another words, my split status is ready). If a table is split ready, coordinator will see that table's seq number is the same as the one in tablet metadata. Helps to distinguish stale decisions from the latest one (in case decisions are revoked and re-emited later on). Also, it's aggregated carefully, by taking the minimum among all replicas, so coordinator will only update topology when all replicas are ready. When load balancer emits split decision, replicas will listen to need to split with a "split monitor" that is awakened once a table has replication metadata updated and detects the need for split (i.e. resize_type field is "split"). The split monitor will start splitting of compaction groups (using mechanism introduced here: `081f30d149`) for the table. And once splitting work is completed, the table updates its local state as having completed split. When coordinator pulls the split status of all replicas for a table via RPC, the balancer can see whether that table is ready for "finalizing" the decision, which is about updating tablet metadata to split each tablet into two. Once table replicas have their replication metadata updated with the new tablet count, they can update appropriately their set of compaction groups (that were previously split in the preparation step). Fixes #16536. Closes scylladb/scylladb#16580 * github.com:scylladb/scylladb: test/topology_experimental_raft: Add tablet split test replica: Bypass reshape on boot with tablets temporarily replica: Fix table::compaction_group_for_sstable() for tablet streaming test/topology_experimental_raft: Disable load balancer in test fencing replica: Remap compaction groups when tablet split is finalized service: Split tablet map when split request is finalized replica: Update table split status if completed split compaction work storage_service: Implement split monitor topology_cordinator: Generate updates for resize decisions made by balancer load_balancer: Introduce metrics for resize decisions db: Make target tablet size a live-updateable config option load_balancer: Implement resize decisions service: Wire table_resize_plan into migration_plan service: Introduce table_resize_plan tablet_mutation_builder: Add set_resize_decision() topology_coordinator: Wire load stats into load balancer storage_service: Allow tablet split and migration to happen concurrently topology_coordinator: Periodically retrieve table_load_stats locator: Introduce topology::get_datacenter_nodes() storage_service: Implement table_load_stats RPC replica: Expose table_load_stats in table replica: Introduce storage_group::live_disk_space_used() locator: Introduce table_load_stats tablets: Add resize decision metadata to tablet metadata locator: Introduce resize_decision	2024-01-31 13:59:56 +02:00
Kefu Chai	b931d93668	treewide: fix misspellings in code comments these misspellings are identified by codespell. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#17004	2024-01-31 09:16:10 +02:00
Piotr Smaroń	35ba037724	config: fix a typo in --role-manager's description Closes scylladb/scylladb#17063	2024-01-30 16:13:33 +02:00
Pavel Emelyanov	370fbd346c	Merge 'Use utils::directories instead of db::config to get dirs' from Patryk Wróbel `db::config` is a class, that is used in many places across the code base. When it is changed, its clients' code need to be recompiled. It represents the configuration of the database. Some fields of the configuration that describe the location of directories may be empty. In such cases `db::config::setup_directories()` function is called - it modifies the provided configuration. Such modification is not good - it is better to keep `db::config` intact. This PR: - extends the public interface of utils::directories class to provide required directory paths to the users - removes 'db::config::setup_directories()' to avoid altering the fields of configuration object - replaces usages of db::config object with utils::directories object in places that require obtaining paths to dirs Fixes: scylladb#5626 Closes scylladb/scylladb#16787 * github.com:scylladb/scylladb: utils/directories: make utils::directories::set an internal type db::config: keep dir paths unchanged cql_transport/controler: use utils::directories to get paths of dirs service/storage_proxy: use utils::directories to get paths of dirs api/storage_service.cc: use utils::directories to get paths of dirs tools/scylla-sstable.cc: use utils::directories to get paths db/commitlog: do not use db::config to get dirs Use utils::directories to get dirs paths in replica::database Allow utils::directories to provide paths to dirs Clean-up of utils::directories	2024-01-29 18:01:15 +03:00
Kamil Braun	0912d2a2c6	Merge 'raft topology: make left_token_ring a transition state' from Patryk Jędrzejczak When a node is in the `left_token_ring` state, we don't know how it has ended up in this state. We cannot distinguish a node that has finished decommissioning from a node that has failed bootstrap. The main problem it causes is that we incorrectly send the `barrier_and_drain` command to a node that has failed bootstrapping or replacing. We must do it for a node that has finished decommissioning because it could still coordinate requests. However, since we cannot distinguish nodes in the `left_token_ring` state, we must send the command to all of them. This issue appeared in scylladb/scylladb#16797 and this PR is a follow-up that fixes it. The solution is changing `left_token_ring` from a node state to a transition state. Fixes scylladb/scylladb#16944 Closes scylladb/scylladb#17009 * github.com:scylladb/scylladb: docs: dev: topology-over-raft: document the left_token_ring state topology_coordinator: adjust reason string in left_token_ring handler raft topology: make left_token_ring a transition state topology_coordinator: rollback_current_topology_op: remove unused exclude_nodes	2024-01-29 15:29:01 +01:00
Kefu Chai	43094d2023	db: add formatter for db::read_repair_decision before this change, we rely on the default-generated fmt::formatter created from operator<<, but fmt v10 dropped the default-generated formatter. in this change, we define formatters for `db::read_repair_decision`, and drop its operator<<. Refs #13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#17033	2024-01-29 15:43:51 +02:00
Patryk Wrobel	781a6a5071	utils/directories: make utils::directories::set an internal type Previously, utils::directories::set could have been used by clients of utils::directories class to provide dirs for creation. Due to moving the responsibility for providing paths of dirs from db::config to utils::directories, such usage is no longer the case. This change: - defines utils::directories::set in utils/directories.cc to disallow its usage by the clients of utils::directories - makes utils::directories::create_and_verify() member function private; now it is used only by the internals of the class - introduces a new member function to utils::directories called create_and_verify_sharded_directory() to limit the functionality provided to clients Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>	2024-01-29 13:20:41 +01:00
Patryk Wrobel	dc8d5ffaf6	db::config: keep dir paths unchanged This change is intended to ensure, that db::config fields related to directories are not changed. To achieve that a member function called setup_directories() is removed. The responsibility for directories paths has been moved to utils::directories, which may generate default paths if the configuration does not provide a specific value. Fixes: scylladb#5626 Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>	2024-01-29 13:20:41 +01:00
Patryk Wrobel	804afffb11	db/commitlog: do not use db::config to get dirs This change removes usage of db::config to get path of commitlog_directory. Instead, it introduces a new parameter to directly pass the path to db::commitlog::config::from_db_config(). Refs: scylladb#5626 Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>	2024-01-29 13:11:33 +01:00
Patryk Jędrzejczak	b0eef50b2e	raft topology: make left_token_ring a transition state A node can be in the `left_token_ring` state after: - a finished decommission, - a failed bootstrap, - a failed replace. When a node is in the `left_token_ring` state, we don't know how it has ended up in this state. We cannot distinguish a node that has finished decommissioning from a node that has failed bootstrap. The main problem it causes is that we incorrectly send the `barrier_and_drain` command to a node that has failed bootstrapping or replacing. We must do it for a node that has finished decommissioning because it could still coordinate requests. However, since we cannot distinguish nodes in the `left_token_ring` state, we must send the command to all of them. This issue appeared in scylladb/scylladb#16797 and this patch is a follow-up that fixes it. The solution is changing `left_token_ring` from a node state to a transition state. Regarding implementation, most of the changes are simple refactoring. The less obvious are: - Before this patch, in `system_keyspace::left_topology_state`, we had to keep the ignored nodes' IDs for replace to ensure that the replacing node will have access to it after moving to the `left_token_ring` state, which happens when replace fails. We don't need this workaround anymore. When we enter the new `left_token_ring` transition state, the new node will still be in the `decommissioning` state, so it won't lose its request param. - Before this patch, a decommissioning node lost its tokens while moving to the `left_token_ring` state. After the patch, it loses tokens while still being in the `decommissioning` state. We ensure that all `decommissioning` handlers correctly handle a node that lost its tokens. Moving the `left_token_ring` handler from `handle_node_transition` to `handle_topology_transition` created a large diff. There are only three changes: - adding `auto node = get_node_to_work_on(std::move(guard));`, - adding `builder.del_transition_state()`, - changing error logged when `global_token_metadata_barrier` fails.	2024-01-29 10:39:07 +01:00
Kefu Chai	8f38bd5376	commitlog: add formatter for db::replay_position before this change, we rely on the default-generated fmt::formatter created from operator<<, but fmt v10 dropped the default-generated formatter. in this change, we define formatters for `db::replay_position`, and drop its operator<<. Refs #13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#17014	2024-01-29 09:59:30 +02:00
Dawid Medrek	b92fb3537a	main: Postpone start-up of hint manager In this commit, we postpone the start-up of the hint manager until we obtain information about other nodes in the cluster. When we start the hint managers, one of the things that happen is creating endpoint managers -- structures managed by db::hints::manager. Whether we create an instance of endpoint manager depends on the value returned by host_filter::can_hint_for, which, in turn, may depend on the current state of locator::topology. If locator::topology is incomplete, some endpoint managers may not be started even though they should (because the target node IS part of the cluster and we SHOULD send hints to it if there are some). The situation like that can happen because we start the hint managers too early. This commit aims to solve that problem. We only start the hint managers when we've gathered information about the other nodes in the cluster and created the locator::topology using it. Hinted Handoff is not negatively affected by these changes since in between the previous point of starting the hint managers and the current one, all of the mutations performed by service::storage_proxy target the local node, so no hints would need to be generated anyway. Fixes scylladb/scylladb#11870 Closes scylladb/scylladb#16511	2024-01-26 12:49:40 +01:00
Kamil Braun	4f736894e1	Merge 'Add maintenance mode' from Mikołaj Grzebieluch In this mode, the node is not reachable from the outside, i.e. * it refuses all incoming RPC connections, * it does not join the cluster, thus * all group0 operations are disabled (e.g. schema changes), * all cluster-wide operations are disabled for this node (e.g. repair), * other nodes see this node as dead, * cannot read or write data from/to other nodes, * it does not open Alternator and Redis transport ports and the TCP CQL port. The only way to make CQL queries is to use the maintenance socket. The node serves only local data. To start the node in maintenance mode, use the `--maintenance-mode true` flag or set `maintenance_mode: true` in the configuration file. REST API works as usual, but some routes are disabled: * authorization_cache * failure_detector * hinted_hand_off_manager This PR also updates the maintenance socket documentation: * add cqlsh usage to the documentation * update the documentation to use `WhiteListRoundRobinPolicy` Fixes #5489. Closes scylladb/scylladb#15346 * github.com:scylladb/scylladb: test.py: add test for maintenance mode test.py: generalize usage of cluster_con test.py: when connecting to node in maintenance mode use maintenance socket docs: add maintenance mode documentation main: add maintenance mode main: move some REST routes initialization before joining group0 message_service: add sanity check that rpc connections are not created in the maintenance mode raft_group0_client: disable group0 operations in the maintenance mode service/storage_service: add start_maintenance_mode() method storage_service: add MAINTENANCE option to mode enum service/maintenance_mode: add maintenance_mode_enabled bool class service/maintenance_mode: move maintenance_socket_enabled definition to seperate file db/config: add maintenance mode flag docs: add cqlsh usage to maintenance socket documentation docs: update maintenance socket documentation to use WhiteListRoundRobinPolicy	2024-01-26 11:02:34 +01:00
Raphael S. Carvalho	638e6e30cb	db: Make target tablet size a live-updateable config option Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-01-25 18:36:08 -03:00
Avi Kivity	03313d359e	Merge ' db: commitlog_replayer: ignore mutations affected by (tablet) cleanups ' from Michał Chojnowski To avoid data resurrection, mutations deleted by cleanup operations should be skipped during commitlog replay. This series implements the above for tablet cleanups, by using a new system table which holds records of cleanup operations. Fixes #16752 Closes scylladb/scylladb#16888 * github.com:scylladb/scylladb: test: test_tablets: add a test for cleanup after migration test: pylib: add ScyllaCluster.wipe_sstables test: boost: add commitlog_cleanup_test db: commitlog_replayer: ignore mutations affected by (tablet) cleanups replica: table: garbage-collect irrelevant system.commitlog_cleanups records db: commitlog: add min_position() replica: table: populate system.commitlog_cleanups on tablet cleanup db: system_keyspace: add system.commitlog_cleanups replica: table: refresh compound sstable set after tablet cleanup	2024-01-25 20:51:03 +02:00
Mikołaj Grzebieluch	8b2f0e38d9	service/maintenance_mode: move maintenance_socket_enabled definition to seperate file	2024-01-25 15:27:53 +01:00
Mikołaj Grzebieluch	e6a83b9819	db/config: add maintenance mode flag	2024-01-25 15:27:53 +01:00
Patryk Jędrzejczak	378cbd0b70	raft topology: ensure at most one transitioning node We add a sanity check to ensure at most one transitioning node at a time. If there is more, something must have gone wrong. In the future, we might implement concurrent topology operations. Then, we will remove this sanity check. We also extend the comment describing `transition_nodes` so that it better explains why we use a map and how it should be handled.	2024-01-25 13:42:46 +01:00
Kefu Chai	0fbfc96619	db: add formatter for schema_tables::table_kind before this change, we rely on the default-generated fmt::formatter created from operator<<, but fmt v10 dropped the default-generated formatter. in this change, we define formatters for db::schema_tables::table_kind, and its operator<<() is still used by the homebrew generic formatter for std::map<>, so it is preserved. Refs #13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#16972	2024-01-25 11:33:13 +03:00
Nadav Har'El	df6c9828ef	Merge 'Add protobuf and Native histogram support' from Amnon Heiman Native histograms (also known as sparse histograms) are an experimental Prometheus feature. They use protobuf as the reporting layer. Native histograms hold the benefits of high resolution at a lower resource cost. This series allows sending histograms in a native histogram format over protobuf. By default, protobuf support is disabled. To use protobuf with native histograms, the command line flag prometheus_allow_protobuf should be set to true, and the Prometheus server should send the accept header with protobuf. Fixes #12931 Closes scylladb/scylladb#16737 * github.com:scylladb/scylladb: main.cc: Add prometheus_allow_protobuf command line histogram_metrics_helper: support native histogram config: Add prometheus_allow_protobuf flag	2024-01-24 21:24:50 +02:00
Kefu Chai	c978d1b3f8	config: s/re-use/reuse/ this misspelling is identified by codespell. per m-w, reuse is a word per-se, and we don't need the hyphen for addressing the ambiguity in the use cases, like, recover and re-cover. see also https://www.merriam-webster.com/dictionary/reuse Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#16962	2024-01-24 15:19:03 +02:00
Michał Chojnowski	a246bb39ef	db: commitlog_replayer: ignore mutations affected by (tablet) cleanups To avoid data resurrection, mutations deleted by cleanup operations have to be skipped during commitlog replay. This patch implements this, based on the metadata recorded on cleanup operations into system.commitlog_cleanups.	2024-01-24 10:37:39 +01:00
Michał Chojnowski	05ff32ebf9	db: commitlog: add min_position() Add a helper function which returns the minimum replay position across all existing or future commitlog segments. Only positions greater or equal to it can be replayed on the next reboot. We will use this helper in a future patch to garbage collect some cleanup metadata which refers to replay positions.	2024-01-24 10:37:38 +01:00

1 2 3 4 5 ...

3610 Commits