scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-29 20:57:00 +00:00

Author	SHA1	Message	Date
Kamil Braun	292ef0d1f9	Merge 'Fix node replace with inter-dc encryption enabled.' from Gleb Natapov Currently if a coordinator and a node being replaced are in the same DC while inter-dc encryption is enabled (connections between nodes in the same DC should not be encrypted) the replace operation will fail. It fails because a coordinator uses non encrypted connection to push raft data to the new node, but the new node will not accept such connection until it knows which DC the coordinator belongs to and for that the raft data needs to be transferred. The series adds the test for this scenario and the fix for the chicken&egg problem above. The series (or at least the fix itself) needs to be backported because this is a serious regression. Fixes: scylladb/scylladb#19025 Closes scylladb/scylladb#20290 * github.com:scylladb/scylladb: topology coordinator: fix indentation after the last patch topology coordinator: do not add replacing node without a ring to topology test: add test for replace in clusters with encryption enabled test.py: add server encryption support to cluster manager .gitignore: fix pattern for resources to match only one specific directory	2024-08-30 11:29:05 +02:00
Pavel Emelyanov	cec4d207f6	Merge 'repair: throw if batchlog manager isn't initialized' from Aleksandra Martyniuk repair_service::repair_flush_hints_batchlog_handler may access batchlog manager while it is uninitialized. Throw if batchlog manager isn't initialized. Fixes: #20236. Needs backport to 6.0 and 6.1 as they suffer from the uninitialized bm access. Closes scylladb/scylladb#20251 * github.com:scylladb/scylladb: test: add test to ensure repair won't fail with uninitialized bm repair: throw if batchlog manager isn't initialized	2024-08-30 11:37:24 +03:00
Anna Stuchlik	4471c80bdc	doc: add the 6.1-to-6.2 upgrade guide This commit replaces the 6.0-to-6.1 upgrade guide with the 6.1-to-6.2 upgrade guide. The new guide is a template that covers the basic procedure. If any 6.2-specific updates are required, they will have to be added along with development. Closes scylladb/scylladb#20178	2024-08-30 10:10:45 +03:00
Piotr Dulikowski	c05be27e4a	Merge 'db/hints: Move the code for writing hints to a separate function' from Dawid Mędrek In scylladb/scylladb@7301a96, in the function `hint_endpoint_manager::store_hint()`, we transformed the lambda passed to `seastar::with_gate()` to a coroutine lambda to improve the readability. However, there was a subtle problem related to lifetimes of the captures that needed to be addressed: * Since we started `co_await`ing in the lambda, the captures were at risk of being destructed too soon. The usual solution is to wrap a coroutine lambda within a `seastar::coroutine::lambda` object and rely on the extended lifetime enforced by the semantics of the language. See `docs/dev/lambda-coroutine-fiasco.md` for more context. * However, since we don't immediately `co_await` the future returned by `with_gate()`, we cannot rely on the extended lifetime provided by the wrapper. The document linked in the previous bullet point suggests keeping the passed coroutine lambda as a variable and pass it as a reference to `with_gate()`. However, that's not feasible either because we discard the returned future and the function returns almost instantly -- destructing every local object, which would encompass the lambda too. The solution used in the commit was to move captures of the lambda into the lambda's body. That helped because Seastar's backend is responsible for keeping all of the local variables alive until the lambda finishes its execution. However, we didn't move all of the captures into the lambda -- the missing one was the `this` pointer that was implicitly used in the lambda. Address sanitiser hasn't reported any bugs related to the pointer yet, but the bug is most likely there. In this commit, we transform the lambda's body into a new member function and only call it from the lambda. This way, we don't need to care about the lifetimes of the captures because Seastar ensures that the function's arguments stay alive until the coroutine finishes. Choosing this solution instead of assigning `this` to a pointer variable inside the lambda's body and using it to refer to the object's members has actual benefit: it's not possible to accidentally forget to refer to a member of the object via the pointer; it also makes the code less awkward. Fixes scylladb/scylladb#20306 Closes scylladb/scylladb#20258 * github.com:scylladb/scylladb: db/hints: Fix indentation in `do_store_hint()` db/hints: Move code for writing hints to separate function	2024-08-30 09:09:02 +02:00
Avi Kivity	bbcfd47bf5	doc: nodetool: toppartitions: document --samplers and --capacity In particular --capacity is critical for obtaining accurate measurements. Closes scylladb/scylladb#20192	2024-08-30 10:07:54 +03:00
Botond Dénes	9f9346fc59	Merge 'nodetool: tasks: add nodetool commands to track task manager tasks' from Aleksandra Martyniuk Add nodetool commands to manage task manager tasks: - tasks abort - aborts the task - tasks list - lists all tasks in the module - tasks modules - lists all modules - tasks set-ttl - sets task ttl - tasks status - gets status of the task - tasks tree - gets statuses of the task and all its desendent's - tasks ttl - gets task ttl - tasks wait - waits for the task and gets its status Fixes: https://github.com/scylladb/scylladb/issues/19201. Closes scylladb/scylladb#19614 * github.com:scylladb/scylladb: test: nodetool: add tests for tasks commands nodetool: tasks: add nodetool commands to track task manager tasks api: task_manager: return status 403 if a task is not abortable api: task_manager: return none instead of empty task id api: task_manager: add timeout to wait_task api: task_manager: add operation to get ttl nodetool: add suboperations support nodetool: change operations_with_func type nodetool: prepare operation related classes for suboperations	2024-08-30 07:37:37 +03:00
Avi Kivity	67b24859bc	Merge 'generic_server: convert connection tracking to seastar::gate' from Laszlo Ersek ~~~ generic_server: convert connection tracking to seastar::gate If we call server::stop() right after "server" construction, it hangs: With the server never listening (never accepting connections and never serving connections), nothing ever calls server::maybe_stop(). Consequently, co_await _all_connections_stopped.get_future(); at the end of server::stop() deadlocks. Such a server::stop() call does occur in controller::do_start_server() [transport/controller.cc], when - cserver->start() (sharded<cql_server>::start()) constructs a "server"-derived object, - start_listening_on_tcp_sockets() throws an exception before reaching listen_on_all_shards() (for example because it fails to set up client encryption -- certificate file is inaccessible etc.), - the "deferred_action" cserver->stop().get(); is invoked during cleanup. (The cserver->stop() call exposing the connection tracking problem dates back to commit `ae4d5a60ca` ("transport::controller: Shut down distributed object on startup exception", 2020-11-25), and it's been triggerable through the above code path since commit `6b178f9a4a` ("transport/controller: split configuring sockets into separate functions", 2024-02-05).) Tracking live connections and connection acceptances seems like a good fit for "seastar::gate", so rewrite the tracking with that. "seastar::gate" can be closed (and the returned future can be waited for) without anyone ever having entered the gate. NOTE: this change makes it quite clear that neither server::stop() nor server::shutdown() must be called multiple times. The permitted sequences are: - server::shutdown() + server::stop() - or just server::stop(). Fixes #10305 Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com> ~~~ Fixes #10305. I think we might want to backport this -- it fixes a hang-on-misconfiguration which affects `scylla-6.1.0-0.20240804.abbf0b24a60c.x86_64` minimally. Basically every release that contains commit `ae4d5a60ca` has a theoretical chance for the hang, and every release that contains commit `6b178f9a4a` has a practical chance for the hang. Focusing on the more practical symptom (i.e., releases containing commit `6b178f9a4a`), `git tag --contains 6b178f9a4a90` gives us (ignoring candidates and release candidates): - scylla-6.0.0 - scylla-6.0.1 - scylla-6.0.2 - scylla-6.1.0 Closes scylladb/scylladb#20212 * github.com:scylladb/scylladb: generic_server: make server::stop() idempotent generic_server: coroutinize server::shutdown() generic_server: make server::shutdown() idempotent test/generic_server: add test case configure, cmake: sort the lists of boost unit tests generic_server: convert connection tracking to seastar::gate	2024-08-29 19:45:48 +03:00
Laszlo Ersek	db44000f8d	Update seastar submodule * seastar 83e6cdfd...ec5da7a6 (1): > reactor, linux-aio: advise users in more detail on setting aio-max-nr Fixes #5981 Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com> Closes scylladb/scylladb#20307	2024-08-29 19:42:02 +03:00
Raphael S. Carvalho	26facd807e	storage_service: avoid processing same table unnecessarily in split monitor If there's a token metadata for a given table, and it is in split mode, it will be registered such that split monitor can look at it, for example, to start split work, or do nothing if table completed it. during topology change, e.g. drain, split is stalled since it cannot take over the state machine. It was noticed that the log is being spammed with a message saying the table completed split work, since every tablet metadata update, means waking up the monitor on behalf of a table. So it makes sense to demote the logging level to debug. That persists until drain completes and split can finally complete. Another thing that was noticed is that during drain, a table can be submitted for processing faster than the monitor can handle, so the candidate queue may end up with multiple duplicated entries for same table, which means unnecessary work. That is fixed by using a sequenced set, which keeps the current FIFO behavior. Fixes #20339. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#20029	2024-08-29 19:38:43 +03:00
Aleksandra Martyniuk	1f46cad5de	test: nodetool: add tests for tasks commands	2024-08-29 17:37:13 +02:00
Aleksandra Martyniuk	20fffcdcf5	nodetool: tasks: add nodetool commands to track task manager tasks	2024-08-29 17:37:12 +02:00
Avi Kivity	7da3314deb	Merge 'Integrated restore' from Ernest Zaslavsky Handed over from https://github.com/scylladb/scylladb/pull/20149 This adds minimal implementation of the start-restore API call. The method starts a task that runs load-and-stream functionality against sstables from S3 bucket. Arguments are: ``` endpoint -- the ID in object_store.yaml config file bucket -- the target bucket to get objects from keyspace -- the keyspace to work on table -- the table to work on snapshot -- the name of the snapshot from which the backup was taken ``` The task runs in the background, its task_id is returned from the method once it's spawned and it should be used via /task_manager API to track the task execution and completion. Remote sstables components are scanned as if they were placed in local upload/ directory. Then colelcted sstables are fed into load-and-stream. This branch has https://github.com/scylladb/scylladb/pull/19890 (Integrated backup), https://github.com/scylladb/scylladb/pull/20120 (S3 lister) and few more minor PRs merged in. The restore branch itself starts with [utils: Introduce abstract (directory) lister](`29c867b54d`) commit. refs: https://github.com/scylladb/scylladb/issues/18392 Closes scylladb/scylladb#20305 * github.com:scylladb/scylladb: tools/scylla-nodetool: add restore integration test/object_store: Add simple restore test test/object_store: Generalize prepare_snapshot_for_backup() code: Introduce restore API method sstable_loader: Add sstables::storage_manager dependency sstable_loader: Maintain task manager module sstable_loader: Out-line constructor distributed_loader: Split get_sstables_from_upload_dir() sstables/storage: Compose uploaded sstable path simpler sstable_directory: Prepare FS lister to scan files on S3 sstable_directory: Parse sstable component without full path s3-client: Add support for lister::filter utils: Introduce abstract (directory) lister	2024-08-29 18:25:30 +03:00
Kamil Braun	9574c399ce	Merge 'add support for zero-token nodes' from Patryk Jędrzejczak We revive the `join_ring` option. We support it only in the Raft-based topology, as we plan to remove the gossip-based topology when we fix the last blocker - the implementation of the manual recovery tool. In the Raft-based topology, a node can be assigned tokens only once when it joins the cluster. Hence, we disallow joining the ring later, which is possible in Cassandra. The main idea behind the solution is simple. We make the unsupported special case of zero tokens a supported normal case. Nodes with zero tokens assigned are called "zero-token nodes" from now on. From the topology point of view, zero-token nodes are the same as token-owning nodes. They can be in the same states, etc. From the data point of view, they are different. They are not members of the token ring, so they are not present in `token_metadata::_normal_token_owners`. Hence, they are ignored in all non-local replication strategies. The tablet load balancer also ignores them. Zero-token nodes can be used as coordinator-only nodes, just like in Cassandra. They can handle requests just like token-owning nodes. The main motivation behind zero-token nodes is that they can prevent the Raft majority loss efficiently. Zero-token nodes are group 0 voters, but they can run on much weaker and cheaper machines because they do not replicate data and handle client requests by default (drivers ignore them). For example, if there are two DCs, one with 4 nodes and one with 5 nodes, if we add a DC with 2 zero-token nodes, every DC will contain less than half of the nodes, so we won't lose the majority when any DC dies. Another way of preventing the Raft majority loss is changing the voter set, which is tracked by scylladb/scylladb#18793. That approach can be used together with zero-token nodes. In the example above, if we choose equal numbers of voters in both DCs, then a DC with one zero-token node will be sufficient. However, in the typical setup of 2 DCs with the same number of nodes it is enough to add a DC with only one zero-token node without changing the voter set. Zero-token nodes could also be used as load balancers in the Alternator. Additionally, this PR fixes scylladb/scylladb#11087, which turned out to be a blocker. This PR introduced a new feature. There is no need to backport it. Fixes scylladb/scylladb#6527 Fixes scylladb/scylladb#11087 Fixes scylladb/scylladb#15360 Closes scylladb/scylladb#19684 * github.com:scylladb/scylladb: docs: raft: document using zero-token nodes to prevent majority loss test: test recovery mode in the presence of zero-token nodes test: topology: util.py: add cqls parameter to check_system_topology_and_cdc_generations_v3_consistency test: topology: util.py: accept zero tokens in check_system_topology_and_cdc_generations_v3_consistency treewide: support zero-token nodes in the recovery mode storage_proxy: make TRUNCATE work locally for local tables test: topology: util.py: document that check_token_ring_and_group0_consistency fails with zero-token nodes test: test zero-token nodes test: test_topology_ops: move helpers to topology/util.py feature_service: introduce the ZERO_TOKEN_NODES feature storage_service: rename join_token_ring to join_topology storage_service: raft_topology_cmd_handler: improve warnings topology_coordinator: fix indentation after the previous patch treewide: introduce support for zero-token nodes in Raft topology system_keyspace: load_topology_state: remove assertion impossible to hit treewide: distinguish all nodes from all token owners gossip topology: make a replacing node remove the replaced node from topology locator: topology: add_or_update_endpoint: use none as the default node state test: boost: tablets tests: ensure all nodes are normal token owners token_metadata: rename get_all_endpoints and get_all_ips network_topology_strategy: reallocate_tablets: remove unused dc_rack_nodes virtual_tables: cluster_status_table: execute: set dc regardless of the token ownership	2024-08-29 16:26:21 +02:00
Gleb Natapov	32a59ba98f	topology coordinator: fix indentation after the last patch	2024-08-29 17:14:09 +03:00
Gleb Natapov	17f4a151ce	topology coordinator: do not add replacing node without a ring to topology When only inter dc encryption is enabled a non encrypted connection between two nodes is allowed only if both nodes are in the same dc. If a nodes that initiates the connection knows that dst is in the same dc and hence use non encrypted connection, but the dst not yet knows the topology of the src such connection will not be allowed since dst cannot guaranty that dst is in the same dc. Currently, when topology coordinator is used, a replacing node will appear in the coordinator's topology immediately after it is added to the group0. The coordinator will try to send raft message to the new node and (assuming only inter dc encryption is enabled and replacing node and the coordinator are in the same dc) it will try to open regular, non encrypted, connection to it. But the replacing node will not have the coordinator in it's topology yet (it needs to sync the raft state for that). so it will reject such connection. To solve the problem the patch does not add a replacing node that was just added to group0 to the topology. It will be added later, when tokens will be assigned to it. At this point a replacing node will already make sure that its topology state is up-to-date (since it will execute a raft barrier in join_node_response_params handler) and it knows coordinator's topology. This aligns replace behaviour with bootstrap since bootstrap also does not add a node without a ring to the topology. The patch effectively reverts `b8ee8911ca` Fixes: scylladb/scylladb#19025	2024-08-29 17:14:09 +03:00
Gleb Natapov	2f1b1fd45e	test: add test for replace in clusters with encryption enabled	2024-08-29 17:14:09 +03:00
Gleb Natapov	b98282a976	test.py: add server encryption support to cluster manager	2024-08-29 17:14:09 +03:00
Gleb Natapov	84757a4ed3	.gitignore: fix pattern for resources to match only one specific directory	2024-08-29 17:13:58 +03:00
Dawid Medrek	d459cf91eb	db/hints: Fix indentation in `do_store_hint()`	2024-08-29 14:47:08 +02:00
Dawid Medrek	75ce6943d0	db/hints: Move code for writing hints to separate function In scylladb/scylladb@7301a96, in the function `hint_endpoint_manager::store_hint()`, we transformed the lambda passed to `seastar::with_gate()` to a coroutine lambda to improve the readability. However, there was a subtle problem related to lifetimes of the captures that needed to be addressed: * Since we started `co_await`ing in the lambda, the captures were at risk of being destructed too soon. The usual solution is to wrap a coroutine lambda within a `seastar::coroutine::lambda` object and rely on the extended lifetime enforced by the semantics of the language. See `docs/dev/lambda-coroutine-fiasco.md` for more context. * However, since we don't immediately `co_await` the future returned by `with_gate()`, we cannot rely on the extended lifetime provided by the wrapper. The document linked in the previous bullet point suggests keeping the passed coroutine lambda as a variable and pass it as a reference to `with_gate()`. However, that's not feasible either because we discard the returned future and the function returns almost instantly -- destructing every local object, which would encompass the lambda too. The solution used in the commit was to move captures of the lambda into the lambda's body. That helped because Seastar's backend is responsible for keeping all of the local variables alive until the lambda finishes its execution. However, we didn't move all of the captures into the lambda -- the missing one was the `this` pointer that was implicitly used in the lambda. Address sanitiser hasn't reported any bugs related to the pointer yet, but the bug is most likely there. In this commit, we transform the lambda's body into a new member function and only call it from the lambda. This way, we don't need to care about the lifetimes of the captures because Seastar ensures that the function's arguments stay alive until the coroutine finishes. Choosing this solution instead of assigning `this` to a pointer variable inside the lambda's body and using it to refer to the object's members has actual benefit: it's not possible to accidentally forget to refer to a member of the object via the pointer; it also makes the code less awkward.	2024-08-29 14:47:02 +02:00
Aleksandra Martyniuk	627fc46ca7	api: task_manager: return status 403 if a task is not abortable	2024-08-29 13:53:40 +02:00
Aleksandra Martyniuk	10ab60f32b	api: task_manager: return none instead of empty task id If a user requests a status of a task that does not have a parent, show "none" instead of an empty parent_id.	2024-08-29 13:53:40 +02:00
Aleksandra Martyniuk	5bcff4d544	api: task_manager: add timeout to wait_task	2024-08-29 13:53:40 +02:00
Aleksandra Martyniuk	3d78172328	api: task_manager: add operation to get ttl	2024-08-29 13:53:39 +02:00
Aleksandra Martyniuk	fb160afaf6	nodetool: add suboperations support Modify nodetool methods so that it support suboperations.	2024-08-29 13:53:39 +02:00
Aleksandra Martyniuk	4b96f9abb9	nodetool: change operations_with_func type Change the type of operations_with_func so that they can contain suboperations.	2024-08-29 13:53:39 +02:00
Aleksandra Martyniuk	c6f8a0116a	nodetool: prepare operation related classes for suboperations Modify operation and add operation_action class so that information about suboperations is stored. It's a preparation for adding suboperations support to nodetool.	2024-08-29 13:53:39 +02:00
Kefu Chai	dbb056f4f7	build: cmake: point -ffile-prefix-map to build directory before this change, we included `-ffile-prefix-map=${CMAKE_SOURCE_DIR}=.` in cflags when building the tree with CMake, but this was wrong. as the "." directory is the build directory used by CMake. and this directory is specified by the `-B` option when generating the building system. if `configure.py --use-cmake` is used to build the tree, the build directory would be "build". so this option instructs the compiler to replace the directory of source file in the debug symbols and in `__FILE__` at compile time. but, in a typical workspace, for instance, `build/main.cc` does not exist. the reason why this does not apply to CMake but applies to the rules generated by `configure.py` is that, `configure.py` puts the generated `build.ninja` right under the top source directory, so `.` is correct and it helps to create reproducible builds. because this practically erases the path prefixes in the build output. while CMake puts it under the specified build directory, replacing the source directory with the build directory with the file prefix map is just wrong. there are two options to address this problem: * stop passing this option. but this would lead to non-reproducible builds. as we would encode the build directory in the "scylla" executable. if a developer needs to rebuild an executable for debugging a coredump generated in production, he/she would have to either build the tree in the same directory as our CI does. or, he/she has to pass `-ffile-prefix-map=...` to map the local build directory to the one used by CI. this is not convenient. * instead of using `${CMAKE_SOURCE_DIR}=.`, add `${CMAKE_BINARY_DIR}=.`. this erases the build directory in the outputs, but preserves the debuggability. so we pick the second solution. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#20329	2024-08-29 12:28:11 +03:00
Patryk Jędrzejczak	c192a9ee3b	docs: raft: document using zero-token nodes to prevent majority loss	2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak	e027ffdffc	test: test recovery mode in the presence of zero-token nodes We modify existing tests to verify that the recovery mode works correctly in the presence of zero-token nodes. In `test_topology_recovery_basic`, we test the case when a zero-token node is live. In particular, we test that the gossip-based restart of such a node works. In `test_topology_recovery_after_majority_loss`, we test the case when zero-token nodes are unrecoverable. In particular, we test that the gossip-based removenode of such nodes works. Since zero-token nodes are ignored by the Python driver if it also connects to other nodes, we use different CQL sessions for a zero-token node in `test_topology_recovery_basic`.	2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak	fb1e060c4c	test: topology: util.py: add cqls parameter to check_system_topology_and_cdc_generations_v3_consistency In the following commit, we modify `test_topology_recovery_basic` to test the recovery mode in the presence of live zero-token nodes. Unfortunately, it requires a bit ugly workaround. Zero-token nodes are ignored by the Python driver if it also connects to other nodes because of empty tokens in the `system.peers` table. In that test, we must connect to a zero-token node to enter the recovery mode and purge the Raft data. Hence, we use different CQL sessions for different nodes. In the future, we may change the Python driver behavior and revert this workaround. Moreover, the recovery tests will be removed or significantly changed when we implement the manual recovery tool. Therefore, we shouldn't worry about this workaround too much.	2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak	54905fc179	test: topology: util.py: accept zero tokens in check_system_topology_and_cdc_generations_v3_consistency Before we use `check_system_topology_and_cdc_generations_v3_consistency` in a test with a zero-token node, we must ensure it doesn't fail because of zero tokens in a row of the `system.topology` table.	2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak	02bb70da19	treewide: support zero-token nodes in the recovery mode Before we implement the manual recovery tool, we must support zero-token nodes in the recovery mode. This means that two topology operations involving zero-token nodes must work in the gossip-based topology: - removing a dead zero-token node, - restarting a live zero-token node. We make changes necessary to make them work in this patch.	2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak	87b415efdc	storage_proxy: make TRUNCATE work locally for local tables In on of the following patches, we implement support for zero-token nodes in the recovery mode. To achieve this, we need to be able to purge all Raft data on live zero-token nodes by using TRUNCATE. Currently, TRUNCATE works the same for all replication strategies - it is performed on all token owners. However, zero-token nodes are not token owners, so TRUNCATE would ignore them. Since zero-token nodes store only local tables, fixing scylladb/scylladb#11087 is the perfect solution for the issue with zero-token nodes. We do it in this patch. Fixes scylladb/scylladb#11087	2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak	21c8409fa4	test: topology: util.py: document that check_token_ring_and_group0_consistency fails with zero-token nodes	2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak	95e14ae44b	test: test zero-token nodes We add tests to verify the basic properties of zero-token nodes. `test_zero_token_nodes_no_replication` and `test_not_enough_token_owners` are more or less deterministic tests. Running them only in the dev mode is sufficient. `test_zero_token_nodes_topology_ops` is quite slow, as expected, considering parameterization and the number of topology operations. In the future we can think of making it faster or skipping in the debug mode. For now, our priority is to test zero-token nodes thoroughly.	2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak	d43d67c525	test: test_topology_ops: move helpers to topology/util.py In one of the following patches, we reuse the helper functions from `test_topology_ops` in a new test, so we move them to `util.py`. Also, we add the `cl` parameter to `start_writes`, as the new test will use `cl=2`.	2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak	574c252391	feature_service: introduce the ZERO_TOKEN_NODES feature Zero-token nodes must be supported by all nodes in the cluster. Otherwise, the non-supporting nodes would crash on some assertion that assumes only token-owing normal nodes make sense. Hence, we introduce the ZERO_TOKEN_NODES cluster feature. Zero-token nodes refuse to boot if it is not supported. I tested this patch manually. First, I booted a node built in the previous patch. Then, I tried to add a zero-token node built in this patch. It refused to boot as expected.	2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak	c25eefe217	storage_service: rename join_token_ring to join_topology After introducing zero-token nodes that call join_token_ring but do not join the ring, the join_token_ring name does not make much sense.	2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak	9937cf3a24	storage_service: raft_topology_cmd_handler: improve warnings	2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak	3ce936da7b	topology_coordinator: fix indentation after the previous patch	2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak	22d907e721	treewide: introduce support for zero-token nodes in Raft topology We revive the `join_ring` option. We support it only in the Raft-based topology, as we plan to remove the gossip-based topology when we fix the last blocker - the implementation of the manual recovery tool. In the Raft-based topology, a node can be assigned tokens only once when it joins the cluster. Hence, we disallow joining the ring later, which is possible in Cassandra. The main idea behind the solution is simple. We make the unsupported special case of zero tokens a supported normal case. Nodes with zero tokens assigned are called "zero-token nodes" from now on. From the topology point of view, zero-token nodes are the same as token-owning nodes. They can be in the same states, etc. From the data point of view, they are different. They are not members of the token ring, so they are not present in `token_metadata::_normal_token_owners`. Hence, they are ignored in all non-local replication strategies. The tablet load balancer also ignores them. Topology operations involving zero-token nodes are simplified: - `add` and `replace` finish in the `join_group0` state, so creating a new CDC generation and streaming are skipped, - `removenode` and `decommission` skip streaming, - `rebuild` does not even contact the topology coordinator as there is nothing to rebuild, Also, if the topology operation involves a token-owning node, zero-token nodes are ignored in streaming. Zero-token nodes can be used as coordinator-only nodes, just like in Cassandra. They can handle requests just like token-owning nodes. The main motivation behind zero-token nodes is that they can prevent the Raft majority loss efficiently. Zero-token nodes are group 0 voters, but they can run on much weaker and cheaper machines because they do not replicate data and handle client requests by default (drivers ignore them). For example, if there are two DCs, one with 4 nodes and one with 5 nodes, if we add a DC with 2 zero-token nodes, every DC will contain less than half of the nodes, so we won't lose the majority when any DC dies. Another way of preventing the Raft majority loss is changing the voter set, which is tracked by scylladb/scylladb#18793. That approach can be used together with zero-token nodes. In the example above, if we choose equal numbers of voters in both DCs, then a DC with one zero-token node will be sufficient. However, in the typical setup of 2 DCs with the same number of nodes it is enough to add a DC with only one zero-token node without changing the voter set. Zero-token nodes could also be used as load balancers in the Alternator.	2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak	ba016c9af7	system_keyspace: load_topology_state: remove assertion impossible to hit We store tokens in a non-frozen set, which doesn't distinguish an empty set from no value. Hence, hitting this assertion is impossible.	2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak	ed55261650	treewide: distinguish all nodes from all token owners In one of the following patches, we introduce support for zero-token nodes. From that point, getting all nodes and getting all token owners isn't equivalent. In this patch, we ensure that we consider only token owners when we want to consider only token owners (for example, in the replication logic), and we consider all nodes when we want to consider all nodes (for example, in the topology logic). The main purpose of this patch is to make the PR introducing zero-token nodes easier to review. The patch that introduces zero-token nodes is already complicated. We don't want trivial changes from this patch to make noise there. This patch introduces changes needed for zero-token nodes only in the Raft-based topology and in the recovery mode. Zero-token nodes are unsupported in the gossip-based topology outside recovery. Some functions added to `token_metadata` and `topology` are inefficient because they compute a new data structure in every call. They are never called in the hot path, so it's not a serious problem. Nevertheless, we should improve it somehow. Note that it's not obvious how to do it because we don't want to make `token_metadata` store topology-related data. Similarly, we don't want to make `topology` store token-related data. We can think of an improvement in a follow-up. We don't remove unused `topology::get_datacenter_rack_nodes` and `topology::get_datacenter_nodes`. These function can be useful in the future. Also, `topology::_dc_nodes` is used internally in `topology`.	2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak	2d9575d6a9	gossip topology: make a replacing node remove the replaced node from topology In the following patch, we change the gossiper to work the same for zero-token nodes and token-owning nodes. We replace occurrences of `is_normal_token_owner` with topology-based conditions. We want to rely on the invariant that token-owning nodes own tokens if and only if they are in the normal or leaving state. However, this invariant is broken by a replacing node because it does not remove the replaced node from topology. Hence, after joining, the replacing node has topology with a node that is not a token owner anymore but is in a leaving state (`being_replaced`). We fix it to prevent the following patch from introducing a regression.	2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak	c7016dedb3	locator: topology: add_or_update_endpoint: use none as the default node state In one of the following patches, we change the gossiper to work the same for zero-token nodes and token-owning nodes. We replace occurrences of `is_normal_token_owner` with topology-based conditions. We want to rely on the invariant that token-owning nodes own tokens if and only if they are in the normal or leaving state. However, this invariant can be broken in the gossip-based topology when a new node joins the cluster. When a boostrapping node starts gossiping, other nodes add it to their topology in `storage_service::on_alive`. Surprisingly, the state of the new node is set to `normal`, as it's the default value used by `add_or_update_endpoint`. Later, the state will be set to `bootstrapping` or `replacing`, and finally it will be set again to `normal` when the join operation finishes. We fix this strange behavior by setting the node state to `none` in `storage_service::on_alive` for nodes not present in the topology. Note that we must add such nodes to the topology. Other code needs their Host ID, IP, and location. We change the default node state from `normal` to `none` in `add_or_update_endpoint` to prevent bugs like the one in `storage_service::on_alive`. Also, we ensure that nodes in the `none` state are ignored in the getters of `locator::topology`.	2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak	6adaf85634	test: boost: tablets tests: ensure all nodes are normal token owners In one of the following patches, we make NetworkTopologyStrategy and the tablet load balancer consider only normal token owners to ensure they ignore zero-token nodes. Some unit tests would start failing after this change because they do not ensure that all nodes are normal token owners. This patch prevents it. Judging by the logic in the test cases in `network_topology_strategy_test`, `point++` was probably intended anyway.	2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak	366605224c	token_metadata: rename get_all_endpoints and get_all_ips In one of the following patches, we introduce support for zero-token nodes. A zero-token node that has successfully joined the cluster is in the normal state but is not a normal token owner. Hence, the names of `get_all_endpoints` and `get_all_ips` become misleading. They should specify that the functions return only IDs/IPs of token owners.	2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak	293a66fe41	network_topology_strategy: reallocate_tablets: remove unused dc_rack_nodes	2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak	4ff08decb8	virtual_tables: cluster_status_table: execute: set dc regardless of the token ownership If a node is in `locator::topology`, then it has a location. We remove the token ownership condition to make the table more descriptive.	2024-08-29 10:37:06 +02:00

1 2 3 4 5 ...

44170 Commits