scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-28 04:06:59 +00:00

Author	SHA1	Message	Date
Gleb Natapov	f9215b4d7e	test: extend existing test to check that a joining node can map addresses of all pre-existing nodes during join (cherry picked from commit `9e4cd32096`)	2024-09-26 21:13:34 +00:00
Gleb Natapov	ac24ab5141	topology coordinator:: mark node as being replaced earlier Before `17f4a151ce` the node was marked as been replaced in join_group0 state, before it actually joins the group0, so by the time it actually joins and starts transferring snapshot/log no traffic is sent to it. The commit changed this to mark the node as being replaced after the snapshot/log is already transferred so we can get the traffic to the node while it sill did not caught up with a leader and this may causes problems since the state is not complete. Mark the node as being replaced earlier, but still add the new node to the topology later as the commit above intended. (cherry picked from commit `c0939d86f9`)	2024-09-26 03:45:50 +00:00
Abhinav	ea6349a6f5	raft topology: add error for removal of non-normal nodes In the current scenario, We check if a node being removed is normal on the node initiating the removenode request. However, we don't have a similar check on the topology coordinator. The node being removed could be normal when we initiate the request, but it doesn't have to be normal when the topology coordinator starts handling the request. For example, the topology coordinator could have removed this node while handling another removenode request that was added to the request queue earlier. This commit intends to fix this issue by adding more checks in the enqueuing phase and return errors for duplicate requests for node removal. This PR fixes a bug. Hence we need to backport it. Fixes: scylladb/scylladb#20271 (cherry picked from commit `b25b8dccbd`) Closes scylladb/scylladb#20799	2024-09-25 11:34:20 +02:00
Kefu Chai	3e84d43f93	treewide: use seastar::format() or fmt::format() explicitly before this change, we rely on `using namespace seastar` to use `seastar::format()` without qualifying the `format()` with its namespace. this works fine until we changed the parameter type of format string `seastar::format()` from `const char*` to `fmt::format_string<...>`. this change practically invited `seastar::format()` to the club of `std::format()` and `fmt::format()`, where all members accept a templated parameter as its `fmt` parameter. and `seastar::format()` is not the best candidate anymore. despite that argument-dependent lookup (ADT for short) favors the function which is in the same namespace as its parameter, but `using namespace` makes `seastar::format()` more competitive, so both `std::format()` and `seastar::format()` are considered as the condidates. that is what is happening scylladb in quite a few caller sites of `format()`, hence ADT is not able to tell which function the winner in the name lookup: ``` /__w/scylladb/scylladb/mutation/mutation_fragment_stream_validator.cc:265:12: error: call to 'format' is ambiguous 265 \| return format("{} ({}.{} {})", _name_view, s.ks_name(), s.cf_name(), s.id()); \| ^~~~~~ /usr/bin/../lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/format:4290:5: note: candidate function [with _Args = <const std::basic_string_view<char> &, const seastar::basic_sstring<char, unsigned int, 15> &, const seastar::basic_sstring<char, unsigned int, 15> &, const utils::tagged_uuid<table_id_tag> &>] 4290 \| format(format_string<_Args...> __fmt, _Args&&... __args) \| ^ /__w/scylladb/scylladb/seastar/include/seastar/core/print.hh:143:1: note: candidate function [with A = <const std::basic_string_view<char> &, const seastar::basic_sstring<char, unsigned int, 15> &, const seastar::basic_sstring<char, unsigned int, 15> &, const utils::tagged_uuid<table_id_tag> &>] 143 \| format(fmt::format_string<A...> fmt, A&&... a) { \| ^ ``` in this change, we change all `format()` to either `fmt::format()` or `seastar::format()` with following rules: - if the caller expects an `sstring` or `std::string_view`, change to `seastar::format()` - if the caller expects an `std::string`, change to `fmt::format()`. because, `sstring::operator std::basic_string` would incur a deep copy. we will need another change to enable scylladb to compile with the latest seastar. namely, to pass the format string as a templated parameter down to helper functions which format their parameters. to miminize the scope of this change, let's include that change when bumping up the seastar submodule. as that change will depend on the seastar change. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2024-09-11 23:21:40 +03:00
Piotr Dulikowski	d98708013c	Merge 'view: move view_build_status to group0' from Michael Litvak Migrate the `system_distributed.view_build_status` table to `system.view_build_status_v2`. The writes to the v2 table are done via raft group0 operations. The new parameter `view_builder_version` stored in `scylla_local` indicates whether nodes should use the old or the new table. New clusters use v2. Otherwise, the migration to v2 is initiated by the topology coordinator when the feature is enabled. It reads all the rows from the old table and writes them to the new table, and sets `view_builder_version` to v2. When the change is applied, all view_builder services are updated to write and read from the v2 table. The old table `system_distributed.view_build_status` is set to read virtually from the new table in order to maintain compatibility. When removing a node from the cluster, we remove its rows from the table atomically (fixes https://github.com/scylladb/scylladb/issues/11836). Also, during the migration, we remove all invalid rows. Fixes scylladb/scylladb#15329 dtest https://github.com/scylladb/scylla-dtest/pull/4827 Closes scylladb/scylladb#19745 * github.com:scylladb/scylladb: view: test view_build_status table with node replace test/pylib: use view_build_status_v2 table in wait_for_view view_builder: common write view_build_status function view_builder: improve migration to v2 with intermediate phase view: delete node rows from view_build_status on node removal view: sanitize view_build_status during migration view: make old view_build_status table a virtual table replica: move streaming_reader_lifecycle_policy to header file view_builder: test view_build_status_v2 storage_service: add view_build_status to raft snapshot view_builder: migration to v2 db:system_keyspace: add view_builder_version to scylla_local view_builder: read view status from v2 table view_builder: introduce writing status mutations via raft view_builder: pass group0_client and qp to view_builder view_builder: extract sys_dist status operations to functions db:system_keyspace: add view_build_status_v2 table	2024-09-11 13:02:58 +02:00
Gleb Natapov	af83c5e53e	group0: stop group0 before draining storage service during shutdown Currently storage service is drained while group0 is still active. The draining stops commitlogs, so after this point no more writes are possible, but if group0 is still active it may try to apply commands which will try to do writes and they will fail causing group0 state machine errors. This is benign since we are shutting down anyway, but better to fix shutdown order to keep logs clean. Fixes scylladb/scylladb#19665	2024-09-10 13:15:56 +02:00
Evgeniy Naydanov	769424723b	test: error injections for Raft-based topology Add following error injections: - stop_after_init_of_system_ks - stop_after_init_of_schema_commitlog - stop_after_starting_gossiper - stop_after_starting_raft_address_map - stop_after_starting_migration_manager - stop_after_starting_commitlog - stop_after_starting_repair - stop_after_starting_cdc_generation_service - stop_after_starting_group0_service - stop_after_starting_auth_service - stop_during_gossip_shadow_round - stop_after_saving_tokens - stop_after_starting_gossiping - stop_after_sending_join_node_request - stop_after_setting_mode_to_normal_raft_topology - stop_before_becoming_raft_voter - topology_coordinator_pause_after_updating_cdc_generation - stop_before_streaming - stop_after_streaming - stop_after_bootstrapping_initial_raft_configuration	2024-09-05 22:11:31 +00:00
Michael Litvak	c1f3517a75	view_builder: improve migration to v2 with intermediate phase Add an intermediate phase to the view builder migration to v2 where we write to both the old and new table in order to not lose writes during the migration. We add an additional view builder version v1_5 between v1 and v2 where we write to both tables. We perform a barrier before moving to v2 to ensure all the operations to the old table are completed.	2024-09-05 15:42:35 +03:00
Michael Litvak	fcf66ad541	storage_service: add view_build_status to raft snapshot Include the table system.view_build_status_v2 in the raft snapshot, and also the view_builder version parameter.	2024-09-05 15:42:30 +03:00
Michael Litvak	8d25a4d678	view_builder: migration to v2 Migrate view_builder to v2, to store the view build status of all nodes in the group0 based table view_build_status_v2. Introduce a feature view_build_status_on_group0 so we know when all nodes are ready to migrate and use the new table. A new cluster is initialized to use v2. Otherwise, The topology coordinator initiates the migration when the feature is enabled, if it was not done already. The migration reads all the rows in the v1 table and writes it via group0 to the v2 table, together with a mutation that updates the view_builder parameter in scylla_local to v2. When this mutation is applied, it updates the view_builder service to start using the v2 table.	2024-09-05 15:41:04 +03:00
Kamil Braun	79983723c8	storage_service: pass `_abort_source` to `hold_read_apply_mutex` There's no point waiting for this lock if `storage_service` is being aborted. In theory the lock, if held, should be eventually released by whatever is holding it during shutdown -- but if there is some cyclic reference between the services, and e.g. whatever holds the lock is stuck because of ongoing shutdown and would only be unstuck by `storage_service` getting stopped (which it can't because it's waiting on the lock), that would cause a shutdown deadlock. Better to be safe than sorry.	2024-09-03 15:52:05 +02:00
Kamil Braun	a4d1065628	api: move `reload_raft_topology_state` implementation inside `storage_service` In later commit we'll want to access more `storage_service` internals in the API's implementation (namely, `_abort_source`) Also moving the implementation there allows making `service::topology_transition()` private again (it was made public in `992f1327d3` only for this API implementation)	2024-09-03 15:52:03 +02:00
Kamil Braun	292ef0d1f9	Merge 'Fix node replace with inter-dc encryption enabled.' from Gleb Natapov Currently if a coordinator and a node being replaced are in the same DC while inter-dc encryption is enabled (connections between nodes in the same DC should not be encrypted) the replace operation will fail. It fails because a coordinator uses non encrypted connection to push raft data to the new node, but the new node will not accept such connection until it knows which DC the coordinator belongs to and for that the raft data needs to be transferred. The series adds the test for this scenario and the fix for the chicken&egg problem above. The series (or at least the fix itself) needs to be backported because this is a serious regression. Fixes: scylladb/scylladb#19025 Closes scylladb/scylladb#20290 * github.com:scylladb/scylladb: topology coordinator: fix indentation after the last patch topology coordinator: do not add replacing node without a ring to topology test: add test for replace in clusters with encryption enabled test.py: add server encryption support to cluster manager .gitignore: fix pattern for resources to match only one specific directory	2024-08-30 11:29:05 +02:00
Raphael S. Carvalho	26facd807e	storage_service: avoid processing same table unnecessarily in split monitor If there's a token metadata for a given table, and it is in split mode, it will be registered such that split monitor can look at it, for example, to start split work, or do nothing if table completed it. during topology change, e.g. drain, split is stalled since it cannot take over the state machine. It was noticed that the log is being spammed with a message saying the table completed split work, since every tablet metadata update, means waking up the monitor on behalf of a table. So it makes sense to demote the logging level to debug. That persists until drain completes and split can finally complete. Another thing that was noticed is that during drain, a table can be submitted for processing faster than the monitor can handle, so the candidate queue may end up with multiple duplicated entries for same table, which means unnecessary work. That is fixed by using a sequenced set, which keeps the current FIFO behavior. Fixes #20339. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#20029	2024-08-29 19:38:43 +03:00
Gleb Natapov	32a59ba98f	topology coordinator: fix indentation after the last patch	2024-08-29 17:14:09 +03:00
Gleb Natapov	17f4a151ce	topology coordinator: do not add replacing node without a ring to topology When only inter dc encryption is enabled a non encrypted connection between two nodes is allowed only if both nodes are in the same dc. If a nodes that initiates the connection knows that dst is in the same dc and hence use non encrypted connection, but the dst not yet knows the topology of the src such connection will not be allowed since dst cannot guaranty that dst is in the same dc. Currently, when topology coordinator is used, a replacing node will appear in the coordinator's topology immediately after it is added to the group0. The coordinator will try to send raft message to the new node and (assuming only inter dc encryption is enabled and replacing node and the coordinator are in the same dc) it will try to open regular, non encrypted, connection to it. But the replacing node will not have the coordinator in it's topology yet (it needs to sync the raft state for that). so it will reject such connection. To solve the problem the patch does not add a replacing node that was just added to group0 to the topology. It will be added later, when tokens will be assigned to it. At this point a replacing node will already make sure that its topology state is up-to-date (since it will execute a raft barrier in join_node_response_params handler) and it knows coordinator's topology. This aligns replace behaviour with bootstrap since bootstrap also does not add a node without a ring to the topology. The patch effectively reverts `b8ee8911ca` Fixes: scylladb/scylladb#19025	2024-08-29 17:14:09 +03:00
Patryk Jędrzejczak	02bb70da19	treewide: support zero-token nodes in the recovery mode Before we implement the manual recovery tool, we must support zero-token nodes in the recovery mode. This means that two topology operations involving zero-token nodes must work in the gossip-based topology: - removing a dead zero-token node, - restarting a live zero-token node. We make changes necessary to make them work in this patch.	2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak	574c252391	feature_service: introduce the ZERO_TOKEN_NODES feature Zero-token nodes must be supported by all nodes in the cluster. Otherwise, the non-supporting nodes would crash on some assertion that assumes only token-owing normal nodes make sense. Hence, we introduce the ZERO_TOKEN_NODES cluster feature. Zero-token nodes refuse to boot if it is not supported. I tested this patch manually. First, I booted a node built in the previous patch. Then, I tried to add a zero-token node built in this patch. It refused to boot as expected.	2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak	c25eefe217	storage_service: rename join_token_ring to join_topology After introducing zero-token nodes that call join_token_ring but do not join the ring, the join_token_ring name does not make much sense.	2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak	9937cf3a24	storage_service: raft_topology_cmd_handler: improve warnings	2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak	22d907e721	treewide: introduce support for zero-token nodes in Raft topology We revive the `join_ring` option. We support it only in the Raft-based topology, as we plan to remove the gossip-based topology when we fix the last blocker - the implementation of the manual recovery tool. In the Raft-based topology, a node can be assigned tokens only once when it joins the cluster. Hence, we disallow joining the ring later, which is possible in Cassandra. The main idea behind the solution is simple. We make the unsupported special case of zero tokens a supported normal case. Nodes with zero tokens assigned are called "zero-token nodes" from now on. From the topology point of view, zero-token nodes are the same as token-owning nodes. They can be in the same states, etc. From the data point of view, they are different. They are not members of the token ring, so they are not present in `token_metadata::_normal_token_owners`. Hence, they are ignored in all non-local replication strategies. The tablet load balancer also ignores them. Topology operations involving zero-token nodes are simplified: - `add` and `replace` finish in the `join_group0` state, so creating a new CDC generation and streaming are skipped, - `removenode` and `decommission` skip streaming, - `rebuild` does not even contact the topology coordinator as there is nothing to rebuild, Also, if the topology operation involves a token-owning node, zero-token nodes are ignored in streaming. Zero-token nodes can be used as coordinator-only nodes, just like in Cassandra. They can handle requests just like token-owning nodes. The main motivation behind zero-token nodes is that they can prevent the Raft majority loss efficiently. Zero-token nodes are group 0 voters, but they can run on much weaker and cheaper machines because they do not replicate data and handle client requests by default (drivers ignore them). For example, if there are two DCs, one with 4 nodes and one with 5 nodes, if we add a DC with 2 zero-token nodes, every DC will contain less than half of the nodes, so we won't lose the majority when any DC dies. Another way of preventing the Raft majority loss is changing the voter set, which is tracked by scylladb/scylladb#18793. That approach can be used together with zero-token nodes. In the example above, if we choose equal numbers of voters in both DCs, then a DC with one zero-token node will be sufficient. However, in the typical setup of 2 DCs with the same number of nodes it is enough to add a DC with only one zero-token node without changing the voter set. Zero-token nodes could also be used as load balancers in the Alternator.	2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak	ed55261650	treewide: distinguish all nodes from all token owners In one of the following patches, we introduce support for zero-token nodes. From that point, getting all nodes and getting all token owners isn't equivalent. In this patch, we ensure that we consider only token owners when we want to consider only token owners (for example, in the replication logic), and we consider all nodes when we want to consider all nodes (for example, in the topology logic). The main purpose of this patch is to make the PR introducing zero-token nodes easier to review. The patch that introduces zero-token nodes is already complicated. We don't want trivial changes from this patch to make noise there. This patch introduces changes needed for zero-token nodes only in the Raft-based topology and in the recovery mode. Zero-token nodes are unsupported in the gossip-based topology outside recovery. Some functions added to `token_metadata` and `topology` are inefficient because they compute a new data structure in every call. They are never called in the hot path, so it's not a serious problem. Nevertheless, we should improve it somehow. Note that it's not obvious how to do it because we don't want to make `token_metadata` store topology-related data. Similarly, we don't want to make `topology` store token-related data. We can think of an improvement in a follow-up. We don't remove unused `topology::get_datacenter_rack_nodes` and `topology::get_datacenter_nodes`. These function can be useful in the future. Also, `topology::_dc_nodes` is used internally in `topology`.	2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak	2d9575d6a9	gossip topology: make a replacing node remove the replaced node from topology In the following patch, we change the gossiper to work the same for zero-token nodes and token-owning nodes. We replace occurrences of `is_normal_token_owner` with topology-based conditions. We want to rely on the invariant that token-owning nodes own tokens if and only if they are in the normal or leaving state. However, this invariant is broken by a replacing node because it does not remove the replaced node from topology. Hence, after joining, the replacing node has topology with a node that is not a token owner anymore but is in a leaving state (`being_replaced`). We fix it to prevent the following patch from introducing a regression.	2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak	c7016dedb3	locator: topology: add_or_update_endpoint: use none as the default node state In one of the following patches, we change the gossiper to work the same for zero-token nodes and token-owning nodes. We replace occurrences of `is_normal_token_owner` with topology-based conditions. We want to rely on the invariant that token-owning nodes own tokens if and only if they are in the normal or leaving state. However, this invariant can be broken in the gossip-based topology when a new node joins the cluster. When a boostrapping node starts gossiping, other nodes add it to their topology in `storage_service::on_alive`. Surprisingly, the state of the new node is set to `normal`, as it's the default value used by `add_or_update_endpoint`. Later, the state will be set to `bootstrapping` or `replacing`, and finally it will be set again to `normal` when the join operation finishes. We fix this strange behavior by setting the node state to `none` in `storage_service::on_alive` for nodes not present in the topology. Note that we must add such nodes to the topology. Other code needs their Host ID, IP, and location. We change the default node state from `normal` to `none` in `add_or_update_endpoint` to prevent bugs like the one in `storage_service::on_alive`. Also, we ensure that nodes in the `none` state are ignored in the getters of `locator::topology`.	2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak	366605224c	token_metadata: rename get_all_endpoints and get_all_ips In one of the following patches, we introduce support for zero-token nodes. A zero-token node that has successfully joined the cluster is in the normal state but is not a normal token owner. Hence, the names of `get_all_endpoints` and `get_all_ips` become misleading. They should specify that the functions return only IDs/IPs of token owners.	2024-08-29 10:37:07 +02:00
Benny Halevy	18c45f7502	raft_rebuild: propagate source_dc force option to rebuild_option Currently, the `force` property of the `source_dc` rebuild option is lost and `raft_topology_cmd_handler` has no way to know if it was given or not. This in turn can cause rebuild to fail, even when `--force` is set by the user, where it would succeed with gossip topology changes, based on the source_dc --force semantics. Fixes scylladb/scylladb#20242 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#20249	2024-08-27 17:05:48 +02:00
Benny Halevy	686a8f2939	abstract_replication_strategy: make get_ranges async To prevent stalls due to large number of tokens. For example, large cluster with say 70 nodes can have more than 16K tokens. Fixes #19757 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-08-25 10:57:34 +03:00
Tomasz Grabiec	ff52527c54	Merge 'repair: do_rebuild_replace_with_repair: use source_dc only when safe' from Benny Halevy It is unsafe to restrict the sync nodes for repair to the source data center if it has too low replication factor in network_topology_replication_strategy, or if other nodes in that DC are ignored. Also, this change restricts the usage of source_dc to `network_topology` and `everywhere_topology` strategies, as with simple replication strategy there is no guarantee that there would be any more replicas in that data center. Fixes #16826 Reproducer submitted as https://github.com/scylladb/scylla-dtest/pull/3865 It fails without this fix and passes with it. * Requires backport to live versions. Issue hit in the filed with 2022.2.14 Closes scylladb/scylladb#16827 * github.com:scylladb/scylladb: repair: do_rebuild_replace_with_repair: use source_dc only when safe repair: replace_with_repair: pass the replace_node downstream repair: replace_with_repair: pass ignore_nodes as a set of host_id:s repair: replace_rebuild_with_repair: pass ks_erms from caller nodetool: rebuild: add force option Add and use utils::optional_param to pass source_dc	2024-08-20 16:13:23 +02:00
Benny Halevy	8665eef98c	repair: replace_with_repair: pass the replace_node downstream To be used by the next path to count how many nodes are lost in each datacenter. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-08-19 17:23:33 +03:00
Benny Halevy	9729dd21c3	repair: replace_with_repair: pass ignore_nodes as a set of host_id:s The callers already pass ignore_nodes as host_id:s and we translate them into inet_address only for repair so delay the translation as much as posible, Refs scylladb/scylladb#6403 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-08-19 17:22:01 +03:00
Benny Halevy	b5d0ab092c	repair: replace_rebuild_with_repair: pass ks_erms from caller The keyspaces replication maps must be in sync with the token_metadata_ptr passed already to the functions, so instead of getting it in the callee, let the caller get the ks_erms along with retrieving the tmptr. Note that it's already done on the rebuild path for streaming based rebuild. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-08-19 17:20:27 +03:00
Benny Halevy	8b1877f3ca	Add and use utils::optional_param to pass source_dc Clearly indicate if a source_dc is provided, and if so, was it explicitly given by the user, or was implicitly selected by scylla. This will become useful in the next patches that will use that to either reject the operation if it's unsafe to use the source_dc and the dc was explicitly given by the user, or whether to fallback to using all nodes otherwise. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-08-19 17:13:54 +03:00
Laszlo Ersek	baccbc09c5	gms/gossiper: return "strong_ordering" from compare_endpoint_startup() The callers of gossiper::compare_endpoint_startup() need not (should not) learn of any particular (tagged or untagged) difference of generations; they only care about the ordering of generations. Change the return type of compare_endpoint_startup() to "std::strong_ordering", and delegate the comparison to tagged_tagged_integer::operator<=>. Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>	2024-08-14 13:35:08 +02:00
Botond Dénes	5bff422b54	service/storage_service: load_tablet_metadata(): add hint parameter Allowing for reloading only those parts of the tablet metadata that were actually changed.	2024-08-11 09:53:19 -04:00
Botond Dénes	2cec0d8dd1	service/migration_listener: update_tablet_metadata(): add hint parameter The hint contains information related to what exactly changed, allowing listeners to do partial updates, instead of reloading all metadata on each notification.	2024-08-11 09:53:19 -04:00
Botond Dénes	806ec3244a	service/storage_service: topology_state_load(): allow providing change hint So that when reloading state from disk, only changed parts are reloaded instead of all. For now, only tablets have hints implemented.	2024-08-11 09:53:18 -04:00
Botond Dénes	0254cfc7d3	locator/tablets: make tablet_metadata cheap to copy Keep lw_shared_ptr<tablet_map> in the tablet map and use COW semantics. To prevent accidental changes to shared tablet_map instances, all modifications to a tablet_map have to go through a new `mutate_tablet_map()` method, which implements the copy-modify-swap idiom.	2024-08-11 09:52:37 -04:00
Michał Jadwiszczak	5f8132c13c	service/raft/group0_state_machine: update effective service levels cache Updates to `system.role_members` and `system.role_attributes` affect effective service levels cache, so applying mutations to those tables should reload the effective SL cache.	2024-08-08 10:42:09 +02:00
Michał Jadwiszczak	be4c83ad3c	service/qos: define effective service level Write down definitions of `service level` and `effective service level` in service/qos/service_level_controller.hh. Until now, effective service level was only used as result of `LIST EFFECTIVE SERVICE LEVEL OF <role>`. Now we want to have quick access to effective service level of each role and introduce cache of effective sl to do it. New definitions clarify things. The commit also renames: - `update_service_levels_from_distributed_data` -> `update_service_levels_cache` Later we will introduce effective_service_level_cache, so this change standarizes the names. - `find_service_level` -> `find_effective_service_level` The function actualy returns effective service level.	2024-08-08 10:42:09 +02:00
Kamil Braun	4181a1c53e	storage_service: raft topology: warn when `raft_topology_cmd_handler` fails due to abort Currently we print an ERROR on all exceptions in `raft_topology_cmd_handler`. This log level is too high, in some cases exceptions are expected -- like during shutdown. And it causes dtest failures. Turn exceptions from aborts into WARN level. Also improve logging by printing the command that failed. Fixes scylladb/scylladb#19754 Closes scylladb/scylladb#19935	2024-08-07 17:57:23 +02:00
Kamil Braun	f348f33667	raft topology: improve logging Add more logging for raft-based topology operations in INFO and DEBUG levels. Improve the existing logging, adding more details. Fix a FIXME in test_coordinator_queue_management (by readding a log message that was removed in the past -- probably by accident -- and properly awaiting for it to appear in test). Enable group0_state_machine logging at TRACE level in tests. These logs are relatively rare (group 0 commands are used for metadata operations) and relatively small, mostly consist of printing `system.group0_history` mutation in the applied command, for example: ``` TRACE 2024-08-02 18:47:12,238 [shard 0: gms] group0_raft_sm - apply() is called with 1 commands TRACE 2024-08-02 18:47:12,238 [shard 0: gms] group0_raft_sm - cmd: prev_state_id: optional(dd9d47c6-50ee-11ef-d77f-500b8e1edde3), new_state_id: dd9ea5c6-50ee-11ef-ae64-dfbcd08d72c3, creator_addr: 127.219.233.1, creator_id: 02679305-b9d1-41ef-866d-d69be156c981 TRACE 2024-08-02 18:47:12,238 [shard 0: gms] group0_raft_sm - cmd.history_append: {canonical_mutation: table_id 027e42f5-683a-3ed7-b404-a0100762063c schema_version c9c345e1-428f-36e0-b7d5-9af5f985021e partition_key pk{0007686973746f7279} partition_tombstone {tombstone: none}, row tombstone {range_tombstone: start={position: clustered, ckp{0010b4ba65c64b6e11ef8080808080808080}, 1}, end={position: clustered, ckp{}, 1}, {tombstone: timestamp=1722617232237511, deletion_time=1722617232}}{row {position: clustered, ckp{0010dd9ea5c650ee11efae64dfbcd08d72c3}, 0} tombstone {row_tombstone: none} marker {row_marker: 1722617232237511 0 0}, column description atomic_cell{ create system_distributed keyspace; create system_distributed_everywhere keyspace; create and update system_distributed(_everywhere) tables,ts=1722617232237511,expiry=-1,ttl=0}}} ``` note that the mutation contains a human-readable description of the command -- like "create system_distributed keyspace" above. These logs might help debugging various issues (e.g. when `apply` hangs waiting for read_apply mutex, or takes too long to apply a command). Ref: scylladb/scylladb#19105 Ref: scylladb/scylladb#19945 Closes scylladb/scylladb#19998	2024-08-06 11:50:16 +03:00
Avi Kivity	aa1270a00c	treewide: change assert() to SCYLLA_ASSERT() assert() is traditionally disabled in release builds, but not in scylladb. This hasn't caused problems so far, but the latest abseil release includes a commit [1] that causes a 1000 insn/op regression when NDEBUG is not defined. Clearly, we must move towards a build system where NDEBUG is defined in release builds. But we can't just define it blindly without vetting all the assert() calls, as some were written with the expectation that they are enabled in release mode. To solve the conundrum, change all assert() calls to a new SCYLLA_ASSERT() macro in utils/assert.hh. This macro is always defined and is not conditional on NDEBUG, so we can later (after vetting Seastar) enable NDEBUG in release mode. [1] `66ef711d68` Closes scylladb/scylladb#20006	2024-08-05 08:23:35 +03:00
Piotr Dulikowski	44f327675d	Merge 'Remove gossiper argument from storage_service::join_cluster()' from Pavel Emelyanov It's only needed to start hints via proxy, but proxy can do it without gossiper argument Closes scylladb/scylladb#19894 * github.com:scylladb/scylladb: storage_service: Remote gossiper argument from join_cluster() proxy: Use remote gossiper to start hints resource manager hints: Const-ify gossiper references and anchor pointers	2024-08-01 10:18:14 +02:00
Emil Maskovsky	2dbe9ef2f2	raft: use the abort source reference in raft group0 client interface Most callers of the raft group0 client interface are passing a real source instance, so we can use the abort source reference in the client interface. This change makes the code simpler and more consistent.	2024-07-31 09:18:54 +02:00
Nadav Har'El	ca8b91f641	test: increase timeouts for /localnodes test In commit `bac7c33313` we introduced a new test for the Alternator "/localnodes" request, checking that a node that is still joining does not get returned. The tests used what I thought were "very high" timeouts - we had a timeout of 10 seconds for starting a single node, and injected a 20 second sleep to leave us 10 seconds after the first sleep. But the test failed in one extremely slow run (a debug build on aarch64), where starting just a single node took more than 15 seconds! So in this patch I increase the timeouts significantly: We increase the wait for the node to 60 seconds, and the sleeping injection to 120 seconds. These should definitely be enough for anyone (famous last words...). The test doesn't actually wait for these timeouts, so the ridiculously high timeouts shouldn't affect the normal runtime of this test. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#19916	2024-07-30 10:41:48 +03:00
Pavel Emelyanov	aaad2bbeaf	storage_service: Remote gossiper argument from join_cluster() This pointer was only needed to pull all the way down the hints resource manager start() method. It's no longer needed for that. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-07-26 16:29:58 +03:00
Pavel Emelyanov	a1dbaba9e1	proxy: Use remote gossiper to start hints resource manager By the time hinst resource manager is started, proxy already has its remote part initialized. Remote returns const gossiper pointer, but after previous change hints code can live with it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-07-26 16:29:03 +03:00
Botond Dénes	84db147c58	Merge 'tasks: introduce virtual tasks' from Aleksandra Martyniuk Introduce virtual tasks - task manager tasks which cover cluster-wide operations. Virtual tasks aren't kept in memory, instead their statuses are retrieved from associated service when user requests them with task manager API. From API users' perspective, virtual tasks behave similarly to regular tasks, but they can be queried from any node in a cluster. Virtual tasks cannot have a parent task. They can have children on each node in a cluster, but do not keep references to them. So, if a direct child of a virtual task is unregistered from task manager, it will no longer be shown in parent's children vector. virtual_task class corresponds to all virtual tasks in one group. If users want to list all tasks in a module, a virtual_task returns all recent supported operations; if they request virtual task's status - info about the one specified operation is presented. Time to live, number of tracked operations etc. depend on the implementation of individual virtual_task. All virtual_tasks are kept only on shard 0. Refs: https://github.com/scylladb/scylladb/issues/15852 New feature, no backport needed. Closes scylladb/scylladb#16374 * github.com:scylladb/scylladb: docs: describe virtual tasks db: node_ops: filter topology request entries test: add a topology suite for testing tasks node_ops: service: create streaming tasks node_ops: register node_ops_virtual_task in task manager service: node_ops: keep node ops module in storage service node_ops: implement node_ops_virtual_task methods db: service: modify methods to get topology_requests data db: service: add request type column to topology_requests node_ops: add task manager module and node_ops_virtual_task tasks: api: add virtual task support to get_task_status_recursively tasks: api: add virtual task support tasks: api: add virtual tasks support to get_tasks tasks: add task_handler to hide task and virtual_task differences from user tasks: modify invoke_on_task tasks: implement task_manager::virtual_task::impl::get_children tasks: keep virtual tasks in task manager tasks: introduce task_manager::virtual_task	2024-07-24 08:34:28 +03:00
Aleksandra Martyniuk	36b77c0592	test: add a topology suite for testing tasks Add topology_tasks test suite for testing task manager's node ops tasks. Add TaskManagerClient to topology_tasks for an easy usage of task manager rest api. Write a test for bootstrap, replace, rebuild, decommission and remove top level tasks using the above.	2024-07-23 13:35:01 +02:00
Aleksandra Martyniuk	a903971a74	node_ops: service: create streaming tasks Create tasks which cover streaming part of topology changes. These tasks are children of respective node_ops_virtual_task.	2024-07-23 13:35:01 +02:00

1 2 3 4 5 ...

2104 Commits