scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-04 14:03:06 +00:00

Author	SHA1	Message	Date
Kefu Chai	24d14b601b	treewide: s/boost::adaptors::map_values/std::views::values/ now that we are allowed to use C++23. we now have the luxury of using `std::views::values`. in this change, we: - replace `boost::adaptors::map_values` with `std::views::values` - update affected code to work with `std::views::values` - the places where we use `boost::join()` are not changed, because we cannot use `std::views::concat` yet. this helper is only available in C++26. to reduce the dependency to boost for better maintainability, and leverage standard library features for better long-term support. this change is part of our ongoing effort to modernize our codebase and reduce external dependencies where possible. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#21265	2024-10-27 21:32:45 +02:00
Benny Halevy	04d741bcbb	storage_service: on_change: update_peer_info only if peer info changed Return an optional peer_info from get_peer_info_for_update when the `app_state_map` arg does not change peer_info, so that we can skip calling update_peer_info, if it didn't change. Fixes scylladb/scylladb#20991 Refs scylladb/scylladb#16376 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#21152	2024-10-22 10:26:08 +02:00
Kefu Chai	6ead5a4696	treewide: move log.hh into utils/log.hh the log.hh under the root of the tree was created keep the backward compatibility when seastar was extracted into a separate library. so log.hh should belong to `utils` directory, as it is based solely on seastar, and can be used all subsystems. in this change, we move log.hh into utils/log.hh to that it is more modularized. and this also improves the readability, when one see `#include "utils/log.hh"`, it is obvious that this source file needs the logging system, instead of its own log facility -- please note, we do have two other `log.hh` in the tree. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2024-10-22 06:54:46 +03:00
Kefu Chai	5cd619a60c	treewide: s/boost::adaptors::map_keys/std::views::keys/ now that we are allowed to use C++23. we now have the luxury of using `std::views::keys`. in this change, we: - replace `boost::adaptors::map_keys` with `std::views::keys` - update affected code to work with `std::views::keys` to reduce the dependency to boost for better maintainability, and leverage standard library features for better long-term support. this change is part of our ongoing effort to modernize our codebase and reduce external dependencies where possible. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#21198	2024-10-21 12:47:52 +03:00
Avi Kivity	c3be2489ce	treewide: drop includes of <boost/range/adaptors.hpp> This includes way too much, including <boost/regex.hpp>, which is huge. Drop includes of adaptors.hpp and replace by what is needed. Closes scylladb/scylladb#21187	2024-10-20 17:17:11 +03:00
Emil Maskovsky	a840949ea0	treewide: code cleanup and refactoring Fix the clang-tidy warnings, code cleanup and improvements. Applied the clang format to the updated places.	2024-10-08 20:53:54 +02:00
Kamil Braun	1b9337bf99	Merge 'Wait for all users of group0 server to complete before destroying it' from Gleb Natapov Group0 server is often used in asynchronous context, but we do not wait for them to complete before destroying the server. We already have shutdown gate for it, so lets use it in those asynch functions. Also make sure to signal group0 abort source if initialization fails. Fixes scylladb/scylladb#20701 Backport to 6.2 since it contains `af83c5e53e` and it made the race easier to hit, so tests became flaky. Closes scylladb/scylladb#20891 * github.com:scylladb/scylladb: group: hold group0 shutdown gate during async operations group0: Stop group0 if node initialization fails	2024-10-08 13:46:54 +02:00
Gleb Natapov	e642f0a86d	group: hold group0 shutdown gate during async operations Wait for all outstanding async work that uses group0 to complete before destroying group0 server. Fixes scylladb/scylladb#20701	2024-10-06 17:20:52 +03:00
Gleb Natapov	ba22493a69	group0: Stop group0 if node initialization fails Commit `af83c5e53e` moved aborting of group0 into the storage service drain function. But it is not called if node fails during initialization (if it failed to join cluster for instance). So lets abort on both paths (but only once).	2024-10-06 17:20:52 +03:00
Botond Dénes	07094c3e44	Merge 'replica: Fix tombstone GC during tablet split preparation' from Raphael "Raph" Carvalho During split prepare phase, there will be more than 1 compaction group with overlapping token range for a given replica. Assume tablet 1 has sstable A containing deleted data, and sstable B containing a tombstone that shadows data in A. Then split starts: 1) sstable B is split first, and moved from main (unsplit) group to a split-ready group 2) now compaction runs in split-ready group before sstable A is split tombstone GC logic today only looks at underlying group, so compaction is step 2 will discard the deleted data in A, since it belongs to another group (the unsplit one), and so the tombstone can be purged incorrectly. To fix it, compaction will now work with all uncompacting sstables that belong to the same replica, since tombstone GC requires all sstables that possibly contain shadowed data to be available for correct decision to be made. Fixes https://github.com/scylladb/scylladb/issues/20044. Branches 6.0, 6.1 and 6.2 are vulnerable, so backport is needed. Closes scylladb/scylladb#20939 * github.com:scylladb/scylladb: replica: Fix tombstone GC during tablet split preparation service: Improve error handling for split	2024-10-04 10:29:42 +03:00
Raphael S. Carvalho	bcd358595f	service: Improve error handling for split Retry wasn't really happening since the loop was broken and sleep part was skipped on error. Also, we were treating abort of split during shutdown as if it were an actual error and that confused longevity tests that parse for logs with error level. The fix is about demoting the level of logs when we know the exception comes from shutdown. Fixes #20890.	2024-10-02 11:23:44 -03:00
Sergey Zolotukhin	6398b7548c	config: Add a warning about use of IP address for join topology and replace operations. When the '--ignore-dead-nodes-for-replace' config option contains IP addresses, a warning will be logged, notifying the user that using IP addresses with this option is deprecated and will no longer be supported in the next release. Fixes scylladb/scylladb#19218	2024-10-02 11:56:59 +02:00
Sergey Zolotukhin	3b9033423d	utils: Optimizations for utils::split_comma_separated_list and usage of host_id_or_endpoint lists - utils::split_comma_separated_list now accepts a reference to sstring instead of a copy to avoid extra memory allocations. Additionally, the results of trimming are moved to the resulting vector instead of being copied. - service/storage_service removenode, raft_removenode, find_raft_nodes_from_hoeps, parse_node_list and api/storage_service::set_storage_service were changed to use std::vector<host_id_or_endpoint> instead of std::list<host_id_or_endpoint> as std::vector is a more cache-friendly structure, resulting in better performance.	2024-10-02 11:56:59 +02:00
Kamil Braun	9224e48d6b	Merge 'Populate raft address map from gossiper on raft configuration change' from Gleb Natapov For each new node added to the raft config populate its ID to IP mapping in raft address map from the gossiper. The mapping may have expired if a node is added to the raft configuration long after it first appears in the gossiper. Fixes scylladb/scylladb#20600 Backport to all supported versions since the bug may cause bootstrapping failure. Closes scylladb/scylladb#20601 * github.com:scylladb/scylladb: test: extend existing test to check that a joining node can map addresses of all pre-existing nodes during join group0: make sure that address map has an entry for each new node in the raft configuration	2024-09-26 12:41:25 +02:00
Gleb Natapov	9e4cd32096	test: extend existing test to check that a joining node can map addresses of all pre-existing nodes during join	2024-09-25 17:10:09 +03:00
Kamil Braun	7d8f1d251a	Merge 'Mark node as being replaced earlier' from Gleb Natapov Before `17f4a151ce` the node was marked as been replaced in join_group0 state, before it actually joins the group0, so by the time it actually joins and starts transferring snapshot/log no traffic is sent to it. The commit changed this to mark the node as being replaced after the snapshot/log is already transferred so we can get the traffic to the node while it sill did not caught up with a leader and this may causes problems since the state is not complete. Mark the node as being replaced earlier, but still add the new node to the topology later as the commit above intended. Fixes: scylladb/scylladb#20629 Need to be backported since this is a regression Closes scylladb/scylladb#20743 * github.com:scylladb/scylladb: test: amend test_replace_reuse_ip test to check that there is no stale writes after snapshot transfer starts topology coordinator:: mark node as being replaced earlier topology coordinator: do metadata barrier before calling finish_accepting_node() during replace	2024-09-25 15:46:12 +02:00
Avi Kivity	d16ea0afd6	Merge 'cql3: Extend DESC SCHEMA by auth and service levels' from Dawid Mędrek Auth has been managed via Raft since Scylla 6.0. Restoring data following the usual procedure (1) is error-prone and so a safer method must have been designed and implemented. That's what happens in this PR. We want to extend `DESC SCHEMA` by auth and service levels to provide a safe way to backup and restore those two components. To realize that, we change the meaning of `DESC SCHEMA WITH INTERNALS` and add a new "tier": `DESC SCHEMA WITH INTERNALS AND PASSWORDS`. * `DESC SCHEMA` -- no change, i.e. the statement describes the current schema items such as keyspaces, tables, views, UDTs, etc. * `DESC SCHEMA WITH INTERNALS` -- does the same as the previous tier and also describes auth and service levels. No information about passwords is returned. * `DESC SCHEMA WITH INTERNALS AND PASSWORDS` -- does the same as the previous tier and also includes information about the salted hashes corresponding to the passwords of roles. To restore existing roles, we extend the `CREATE ROLE` statement by allowing to use the option `WITH SALTED HASH = '[...]'`. --- Implementation strategy: * Add missing things/adjust existing ones that will be used later. * Implement creating a role with salted hash. * Add tests for creating a role with salted hash. * Prepare for implementing describe functionality of auth and service levels. * Implement describe functionality for elements of auth and service levels. * Extend the grammar. * Add tests for describe auth and service levels. * Add/update documentation. --- (1): https://opensource.docs.scylladb.com/stable/operating-scylla/procedures/backup-restore/restore.html In case the link stops working, restoring a schema was realised by managing raw files on disk. Fixes scylladb/scylladb#18750 Fixes scylladb/scylladb#18751 Fixes scylladb/scylladb#20711 Closes scylladb/scylladb#20168 * github.com:scylladb/scylladb: docs: Update user documentation for backup and restore docs/dev: Add documentation for DESC SCHEMA test: Add tests for describing auth and service levels cql3/functions/user_function: Remove newline character before and after UDF body cql3: Implement DESCRIBE SCHEMA WITH INTERNALS AND PASSWORDS auth: Implement describing auth auth/authenticator: Add member functions for querying password hash service/qos/service_level_controller: Describe service levels data_dictionary: Remove keyspace_element.hh treewide: Start using new overloads of describe treewide: Fix indentation in describe functions treewide: Return create statement optionally in describe functions treewide: Add new describe overloads to implementations of data_dictionary::keyspace_element treewide: Start using schema::ks_name() instead of schema::keyspace_name() cql3: Refactor `description` cql3: Move description to dedicated files test: Add tests for `CREATE ROLE WITH SALTED HASH` cql3/statements: Restrict CREATE ROLE WITH SALTED HASH auth: Allow for creating roles with SALTED HASH types: Introduce a function `cql3_type_name_without_frozen()` cql3/util: Accept std::string_view rather than const sstring&	2024-09-24 21:44:32 +03:00
Abhinav	36d68ec955	raft topology: add error for removal of non-normal nodes In the current scenario, We check if a node being removed is normal on the node initiating the removenode request. However, we don't have a similar check on the topology coordinator. The node being removed could be normal when we initiate the request, but it doesn't have to be normal when the topology coordinator starts handling the request. For example, the topology coordinator could have removed this node while handling another removenode request that was added to the request queue earlier. This commit intends to fix this issue by adding more checks in the enqueuing phase and return errors for duplicate requests for node removal. This PR fixes a bug. Hence we need to backport it. Fixes: scylladb/scylladb#20271 Closes scylladb/scylladb#20500	2024-09-24 16:11:19 +02:00
Dawid Mędrek	39cf106151	treewide: Start using schema::ks_name() instead of schema::keyspace_name() We're going to remove the interface `data_dictionary::keyspace_element`. As `schema::keyspace_name()` is an implementation of one of the methods specified by that interface, we replace its uses by `schema::ks_name()`. `schema::keyspace_name()` was an alias for it, so no semantic change has occured.	2024-09-20 14:24:53 +02:00
Gleb Natapov	c0939d86f9	topology coordinator:: mark node as being replaced earlier Before `17f4a151ce` the node was marked as been replaced in join_group0 state, before it actually joins the group0, so by the time it actually joins and starts transferring snapshot/log no traffic is sent to it. The commit changed this to mark the node as being replaced after the snapshot/log is already transferred so we can get the traffic to the node while it sill did not caught up with a leader and this may causes problems since the state is not complete. Mark the node as being replaced earlier, but still add the new node to the topology later as the commit above intended.	2024-09-19 15:23:48 +03:00
Benny Halevy	574a08ed96	storage_service: rebuild: warn about tablets-enabled keyspaces Until we automatically support rebuild for tablets-enabled keyspaces, warn the user about them. The reason this is not an error, is that after increasing RF in a new datacenter, the current procedure is to run `nodetool rebuild` on all nodes in that dc to rebuild the new vnode replicas. This is not required for tablets, since the additional replicas are rebuilt automatically as part of ALTER KS. However, `nodetool rebuild` is also run after local data loss (e.g. due to corruption and removal of sstables). In this case, rebuild is not supported for tablets-enabled keyspaces, as tablet replicas that had lost data may have already been migrated to other nodes, and rebuilding the requested node will not know about it. It is advised to repair all nodes in the datacenter instead. Refs scylladb/scylladb#17575 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#20375	2024-09-19 14:25:46 +03:00
Pavel Emelyanov	36863d4ad0	sstables_manager: Remove table_dir from make_sstable() It used to be passed to sstable constructor, but now it doesn't need this argument. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-09-13 16:49:50 +03:00
Kefu Chai	3e84d43f93	treewide: use seastar::format() or fmt::format() explicitly before this change, we rely on `using namespace seastar` to use `seastar::format()` without qualifying the `format()` with its namespace. this works fine until we changed the parameter type of format string `seastar::format()` from `const char*` to `fmt::format_string<...>`. this change practically invited `seastar::format()` to the club of `std::format()` and `fmt::format()`, where all members accept a templated parameter as its `fmt` parameter. and `seastar::format()` is not the best candidate anymore. despite that argument-dependent lookup (ADT for short) favors the function which is in the same namespace as its parameter, but `using namespace` makes `seastar::format()` more competitive, so both `std::format()` and `seastar::format()` are considered as the condidates. that is what is happening scylladb in quite a few caller sites of `format()`, hence ADT is not able to tell which function the winner in the name lookup: ``` /__w/scylladb/scylladb/mutation/mutation_fragment_stream_validator.cc:265:12: error: call to 'format' is ambiguous 265 \| return format("{} ({}.{} {})", _name_view, s.ks_name(), s.cf_name(), s.id()); \| ^~~~~~ /usr/bin/../lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/format:4290:5: note: candidate function [with _Args = <const std::basic_string_view<char> &, const seastar::basic_sstring<char, unsigned int, 15> &, const seastar::basic_sstring<char, unsigned int, 15> &, const utils::tagged_uuid<table_id_tag> &>] 4290 \| format(format_string<_Args...> __fmt, _Args&&... __args) \| ^ /__w/scylladb/scylladb/seastar/include/seastar/core/print.hh:143:1: note: candidate function [with A = <const std::basic_string_view<char> &, const seastar::basic_sstring<char, unsigned int, 15> &, const seastar::basic_sstring<char, unsigned int, 15> &, const utils::tagged_uuid<table_id_tag> &>] 143 \| format(fmt::format_string<A...> fmt, A&&... a) { \| ^ ``` in this change, we change all `format()` to either `fmt::format()` or `seastar::format()` with following rules: - if the caller expects an `sstring` or `std::string_view`, change to `seastar::format()` - if the caller expects an `std::string`, change to `fmt::format()`. because, `sstring::operator std::basic_string` would incur a deep copy. we will need another change to enable scylladb to compile with the latest seastar. namely, to pass the format string as a templated parameter down to helper functions which format their parameters. to miminize the scope of this change, let's include that change when bumping up the seastar submodule. as that change will depend on the seastar change. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2024-09-11 23:21:40 +03:00
Piotr Dulikowski	d98708013c	Merge 'view: move view_build_status to group0' from Michael Litvak Migrate the `system_distributed.view_build_status` table to `system.view_build_status_v2`. The writes to the v2 table are done via raft group0 operations. The new parameter `view_builder_version` stored in `scylla_local` indicates whether nodes should use the old or the new table. New clusters use v2. Otherwise, the migration to v2 is initiated by the topology coordinator when the feature is enabled. It reads all the rows from the old table and writes them to the new table, and sets `view_builder_version` to v2. When the change is applied, all view_builder services are updated to write and read from the v2 table. The old table `system_distributed.view_build_status` is set to read virtually from the new table in order to maintain compatibility. When removing a node from the cluster, we remove its rows from the table atomically (fixes https://github.com/scylladb/scylladb/issues/11836). Also, during the migration, we remove all invalid rows. Fixes scylladb/scylladb#15329 dtest https://github.com/scylladb/scylla-dtest/pull/4827 Closes scylladb/scylladb#19745 * github.com:scylladb/scylladb: view: test view_build_status table with node replace test/pylib: use view_build_status_v2 table in wait_for_view view_builder: common write view_build_status function view_builder: improve migration to v2 with intermediate phase view: delete node rows from view_build_status on node removal view: sanitize view_build_status during migration view: make old view_build_status table a virtual table replica: move streaming_reader_lifecycle_policy to header file view_builder: test view_build_status_v2 storage_service: add view_build_status to raft snapshot view_builder: migration to v2 db:system_keyspace: add view_builder_version to scylla_local view_builder: read view status from v2 table view_builder: introduce writing status mutations via raft view_builder: pass group0_client and qp to view_builder view_builder: extract sys_dist status operations to functions db:system_keyspace: add view_build_status_v2 table	2024-09-11 13:02:58 +02:00
Gleb Natapov	af83c5e53e	group0: stop group0 before draining storage service during shutdown Currently storage service is drained while group0 is still active. The draining stops commitlogs, so after this point no more writes are possible, but if group0 is still active it may try to apply commands which will try to do writes and they will fail causing group0 state machine errors. This is benign since we are shutting down anyway, but better to fix shutdown order to keep logs clean. Fixes scylladb/scylladb#19665	2024-09-10 13:15:56 +02:00
Evgeniy Naydanov	769424723b	test: error injections for Raft-based topology Add following error injections: - stop_after_init_of_system_ks - stop_after_init_of_schema_commitlog - stop_after_starting_gossiper - stop_after_starting_raft_address_map - stop_after_starting_migration_manager - stop_after_starting_commitlog - stop_after_starting_repair - stop_after_starting_cdc_generation_service - stop_after_starting_group0_service - stop_after_starting_auth_service - stop_during_gossip_shadow_round - stop_after_saving_tokens - stop_after_starting_gossiping - stop_after_sending_join_node_request - stop_after_setting_mode_to_normal_raft_topology - stop_before_becoming_raft_voter - topology_coordinator_pause_after_updating_cdc_generation - stop_before_streaming - stop_after_streaming - stop_after_bootstrapping_initial_raft_configuration	2024-09-05 22:11:31 +00:00
Michael Litvak	c1f3517a75	view_builder: improve migration to v2 with intermediate phase Add an intermediate phase to the view builder migration to v2 where we write to both the old and new table in order to not lose writes during the migration. We add an additional view builder version v1_5 between v1 and v2 where we write to both tables. We perform a barrier before moving to v2 to ensure all the operations to the old table are completed.	2024-09-05 15:42:35 +03:00
Michael Litvak	fcf66ad541	storage_service: add view_build_status to raft snapshot Include the table system.view_build_status_v2 in the raft snapshot, and also the view_builder version parameter.	2024-09-05 15:42:30 +03:00
Michael Litvak	8d25a4d678	view_builder: migration to v2 Migrate view_builder to v2, to store the view build status of all nodes in the group0 based table view_build_status_v2. Introduce a feature view_build_status_on_group0 so we know when all nodes are ready to migrate and use the new table. A new cluster is initialized to use v2. Otherwise, The topology coordinator initiates the migration when the feature is enabled, if it was not done already. The migration reads all the rows in the v1 table and writes it via group0 to the v2 table, together with a mutation that updates the view_builder parameter in scylla_local to v2. When this mutation is applied, it updates the view_builder service to start using the v2 table.	2024-09-05 15:41:04 +03:00
Kamil Braun	79983723c8	storage_service: pass `_abort_source` to `hold_read_apply_mutex` There's no point waiting for this lock if `storage_service` is being aborted. In theory the lock, if held, should be eventually released by whatever is holding it during shutdown -- but if there is some cyclic reference between the services, and e.g. whatever holds the lock is stuck because of ongoing shutdown and would only be unstuck by `storage_service` getting stopped (which it can't because it's waiting on the lock), that would cause a shutdown deadlock. Better to be safe than sorry.	2024-09-03 15:52:05 +02:00
Kamil Braun	a4d1065628	api: move `reload_raft_topology_state` implementation inside `storage_service` In later commit we'll want to access more `storage_service` internals in the API's implementation (namely, `_abort_source`) Also moving the implementation there allows making `service::topology_transition()` private again (it was made public in `992f1327d3` only for this API implementation)	2024-09-03 15:52:03 +02:00
Kamil Braun	292ef0d1f9	Merge 'Fix node replace with inter-dc encryption enabled.' from Gleb Natapov Currently if a coordinator and a node being replaced are in the same DC while inter-dc encryption is enabled (connections between nodes in the same DC should not be encrypted) the replace operation will fail. It fails because a coordinator uses non encrypted connection to push raft data to the new node, but the new node will not accept such connection until it knows which DC the coordinator belongs to and for that the raft data needs to be transferred. The series adds the test for this scenario and the fix for the chicken&egg problem above. The series (or at least the fix itself) needs to be backported because this is a serious regression. Fixes: scylladb/scylladb#19025 Closes scylladb/scylladb#20290 * github.com:scylladb/scylladb: topology coordinator: fix indentation after the last patch topology coordinator: do not add replacing node without a ring to topology test: add test for replace in clusters with encryption enabled test.py: add server encryption support to cluster manager .gitignore: fix pattern for resources to match only one specific directory	2024-08-30 11:29:05 +02:00
Raphael S. Carvalho	26facd807e	storage_service: avoid processing same table unnecessarily in split monitor If there's a token metadata for a given table, and it is in split mode, it will be registered such that split monitor can look at it, for example, to start split work, or do nothing if table completed it. during topology change, e.g. drain, split is stalled since it cannot take over the state machine. It was noticed that the log is being spammed with a message saying the table completed split work, since every tablet metadata update, means waking up the monitor on behalf of a table. So it makes sense to demote the logging level to debug. That persists until drain completes and split can finally complete. Another thing that was noticed is that during drain, a table can be submitted for processing faster than the monitor can handle, so the candidate queue may end up with multiple duplicated entries for same table, which means unnecessary work. That is fixed by using a sequenced set, which keeps the current FIFO behavior. Fixes #20339. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#20029	2024-08-29 19:38:43 +03:00
Gleb Natapov	32a59ba98f	topology coordinator: fix indentation after the last patch	2024-08-29 17:14:09 +03:00
Gleb Natapov	17f4a151ce	topology coordinator: do not add replacing node without a ring to topology When only inter dc encryption is enabled a non encrypted connection between two nodes is allowed only if both nodes are in the same dc. If a nodes that initiates the connection knows that dst is in the same dc and hence use non encrypted connection, but the dst not yet knows the topology of the src such connection will not be allowed since dst cannot guaranty that dst is in the same dc. Currently, when topology coordinator is used, a replacing node will appear in the coordinator's topology immediately after it is added to the group0. The coordinator will try to send raft message to the new node and (assuming only inter dc encryption is enabled and replacing node and the coordinator are in the same dc) it will try to open regular, non encrypted, connection to it. But the replacing node will not have the coordinator in it's topology yet (it needs to sync the raft state for that). so it will reject such connection. To solve the problem the patch does not add a replacing node that was just added to group0 to the topology. It will be added later, when tokens will be assigned to it. At this point a replacing node will already make sure that its topology state is up-to-date (since it will execute a raft barrier in join_node_response_params handler) and it knows coordinator's topology. This aligns replace behaviour with bootstrap since bootstrap also does not add a node without a ring to the topology. The patch effectively reverts `b8ee8911ca` Fixes: scylladb/scylladb#19025	2024-08-29 17:14:09 +03:00
Patryk Jędrzejczak	02bb70da19	treewide: support zero-token nodes in the recovery mode Before we implement the manual recovery tool, we must support zero-token nodes in the recovery mode. This means that two topology operations involving zero-token nodes must work in the gossip-based topology: - removing a dead zero-token node, - restarting a live zero-token node. We make changes necessary to make them work in this patch.	2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak	574c252391	feature_service: introduce the ZERO_TOKEN_NODES feature Zero-token nodes must be supported by all nodes in the cluster. Otherwise, the non-supporting nodes would crash on some assertion that assumes only token-owing normal nodes make sense. Hence, we introduce the ZERO_TOKEN_NODES cluster feature. Zero-token nodes refuse to boot if it is not supported. I tested this patch manually. First, I booted a node built in the previous patch. Then, I tried to add a zero-token node built in this patch. It refused to boot as expected.	2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak	c25eefe217	storage_service: rename join_token_ring to join_topology After introducing zero-token nodes that call join_token_ring but do not join the ring, the join_token_ring name does not make much sense.	2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak	9937cf3a24	storage_service: raft_topology_cmd_handler: improve warnings	2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak	22d907e721	treewide: introduce support for zero-token nodes in Raft topology We revive the `join_ring` option. We support it only in the Raft-based topology, as we plan to remove the gossip-based topology when we fix the last blocker - the implementation of the manual recovery tool. In the Raft-based topology, a node can be assigned tokens only once when it joins the cluster. Hence, we disallow joining the ring later, which is possible in Cassandra. The main idea behind the solution is simple. We make the unsupported special case of zero tokens a supported normal case. Nodes with zero tokens assigned are called "zero-token nodes" from now on. From the topology point of view, zero-token nodes are the same as token-owning nodes. They can be in the same states, etc. From the data point of view, they are different. They are not members of the token ring, so they are not present in `token_metadata::_normal_token_owners`. Hence, they are ignored in all non-local replication strategies. The tablet load balancer also ignores them. Topology operations involving zero-token nodes are simplified: - `add` and `replace` finish in the `join_group0` state, so creating a new CDC generation and streaming are skipped, - `removenode` and `decommission` skip streaming, - `rebuild` does not even contact the topology coordinator as there is nothing to rebuild, Also, if the topology operation involves a token-owning node, zero-token nodes are ignored in streaming. Zero-token nodes can be used as coordinator-only nodes, just like in Cassandra. They can handle requests just like token-owning nodes. The main motivation behind zero-token nodes is that they can prevent the Raft majority loss efficiently. Zero-token nodes are group 0 voters, but they can run on much weaker and cheaper machines because they do not replicate data and handle client requests by default (drivers ignore them). For example, if there are two DCs, one with 4 nodes and one with 5 nodes, if we add a DC with 2 zero-token nodes, every DC will contain less than half of the nodes, so we won't lose the majority when any DC dies. Another way of preventing the Raft majority loss is changing the voter set, which is tracked by scylladb/scylladb#18793. That approach can be used together with zero-token nodes. In the example above, if we choose equal numbers of voters in both DCs, then a DC with one zero-token node will be sufficient. However, in the typical setup of 2 DCs with the same number of nodes it is enough to add a DC with only one zero-token node without changing the voter set. Zero-token nodes could also be used as load balancers in the Alternator.	2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak	ed55261650	treewide: distinguish all nodes from all token owners In one of the following patches, we introduce support for zero-token nodes. From that point, getting all nodes and getting all token owners isn't equivalent. In this patch, we ensure that we consider only token owners when we want to consider only token owners (for example, in the replication logic), and we consider all nodes when we want to consider all nodes (for example, in the topology logic). The main purpose of this patch is to make the PR introducing zero-token nodes easier to review. The patch that introduces zero-token nodes is already complicated. We don't want trivial changes from this patch to make noise there. This patch introduces changes needed for zero-token nodes only in the Raft-based topology and in the recovery mode. Zero-token nodes are unsupported in the gossip-based topology outside recovery. Some functions added to `token_metadata` and `topology` are inefficient because they compute a new data structure in every call. They are never called in the hot path, so it's not a serious problem. Nevertheless, we should improve it somehow. Note that it's not obvious how to do it because we don't want to make `token_metadata` store topology-related data. Similarly, we don't want to make `topology` store token-related data. We can think of an improvement in a follow-up. We don't remove unused `topology::get_datacenter_rack_nodes` and `topology::get_datacenter_nodes`. These function can be useful in the future. Also, `topology::_dc_nodes` is used internally in `topology`.	2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak	2d9575d6a9	gossip topology: make a replacing node remove the replaced node from topology In the following patch, we change the gossiper to work the same for zero-token nodes and token-owning nodes. We replace occurrences of `is_normal_token_owner` with topology-based conditions. We want to rely on the invariant that token-owning nodes own tokens if and only if they are in the normal or leaving state. However, this invariant is broken by a replacing node because it does not remove the replaced node from topology. Hence, after joining, the replacing node has topology with a node that is not a token owner anymore but is in a leaving state (`being_replaced`). We fix it to prevent the following patch from introducing a regression.	2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak	c7016dedb3	locator: topology: add_or_update_endpoint: use none as the default node state In one of the following patches, we change the gossiper to work the same for zero-token nodes and token-owning nodes. We replace occurrences of `is_normal_token_owner` with topology-based conditions. We want to rely on the invariant that token-owning nodes own tokens if and only if they are in the normal or leaving state. However, this invariant can be broken in the gossip-based topology when a new node joins the cluster. When a boostrapping node starts gossiping, other nodes add it to their topology in `storage_service::on_alive`. Surprisingly, the state of the new node is set to `normal`, as it's the default value used by `add_or_update_endpoint`. Later, the state will be set to `bootstrapping` or `replacing`, and finally it will be set again to `normal` when the join operation finishes. We fix this strange behavior by setting the node state to `none` in `storage_service::on_alive` for nodes not present in the topology. Note that we must add such nodes to the topology. Other code needs their Host ID, IP, and location. We change the default node state from `normal` to `none` in `add_or_update_endpoint` to prevent bugs like the one in `storage_service::on_alive`. Also, we ensure that nodes in the `none` state are ignored in the getters of `locator::topology`.	2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak	366605224c	token_metadata: rename get_all_endpoints and get_all_ips In one of the following patches, we introduce support for zero-token nodes. A zero-token node that has successfully joined the cluster is in the normal state but is not a normal token owner. Hence, the names of `get_all_endpoints` and `get_all_ips` become misleading. They should specify that the functions return only IDs/IPs of token owners.	2024-08-29 10:37:07 +02:00
Benny Halevy	18c45f7502	raft_rebuild: propagate source_dc force option to rebuild_option Currently, the `force` property of the `source_dc` rebuild option is lost and `raft_topology_cmd_handler` has no way to know if it was given or not. This in turn can cause rebuild to fail, even when `--force` is set by the user, where it would succeed with gossip topology changes, based on the source_dc --force semantics. Fixes scylladb/scylladb#20242 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#20249	2024-08-27 17:05:48 +02:00
Benny Halevy	686a8f2939	abstract_replication_strategy: make get_ranges async To prevent stalls due to large number of tokens. For example, large cluster with say 70 nodes can have more than 16K tokens. Fixes #19757 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-08-25 10:57:34 +03:00
Tomasz Grabiec	ff52527c54	Merge 'repair: do_rebuild_replace_with_repair: use source_dc only when safe' from Benny Halevy It is unsafe to restrict the sync nodes for repair to the source data center if it has too low replication factor in network_topology_replication_strategy, or if other nodes in that DC are ignored. Also, this change restricts the usage of source_dc to `network_topology` and `everywhere_topology` strategies, as with simple replication strategy there is no guarantee that there would be any more replicas in that data center. Fixes #16826 Reproducer submitted as https://github.com/scylladb/scylla-dtest/pull/3865 It fails without this fix and passes with it. * Requires backport to live versions. Issue hit in the filed with 2022.2.14 Closes scylladb/scylladb#16827 * github.com:scylladb/scylladb: repair: do_rebuild_replace_with_repair: use source_dc only when safe repair: replace_with_repair: pass the replace_node downstream repair: replace_with_repair: pass ignore_nodes as a set of host_id:s repair: replace_rebuild_with_repair: pass ks_erms from caller nodetool: rebuild: add force option Add and use utils::optional_param to pass source_dc	2024-08-20 16:13:23 +02:00
Benny Halevy	8665eef98c	repair: replace_with_repair: pass the replace_node downstream To be used by the next path to count how many nodes are lost in each datacenter. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-08-19 17:23:33 +03:00
Benny Halevy	9729dd21c3	repair: replace_with_repair: pass ignore_nodes as a set of host_id:s The callers already pass ignore_nodes as host_id:s and we translate them into inet_address only for repair so delay the translation as much as posible, Refs scylladb/scylladb#6403 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-08-19 17:22:01 +03:00
Benny Halevy	b5d0ab092c	repair: replace_rebuild_with_repair: pass ks_erms from caller The keyspaces replication maps must be in sync with the token_metadata_ptr passed already to the functions, so instead of getting it in the callee, let the caller get the ks_erms along with retrieving the tmptr. Note that it's already done on the rebuild path for streaming based rebuild. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-08-19 17:20:27 +03:00

1 2 3 4 5 ...

2123 Commits