scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-24 00:32:15 +00:00

Author	SHA1	Message	Date
Kamil Braun	b6b35ce061	service: storage_proxy: sequence CDC preimage select with Paxos learn `paxos_response_handler::learn_decision` was calling `cdc_service::augment_mutation_call` concurrently with `storage_proxy::mutate_internal`. `augment_mutation_call` was selecting rows from the base table in order to create the preimage, while `mutate_internal` was writing rows to the table. It was therefore possible for the preimage to observe the update that it accompanied, which doesn't make any sense, because the preimage is supposed to show the state before the update. Fix this by performing the operations sequentially. We can still perform the CDC mutation write concurrently with the base mutation write. `cdc_with_lwt_test` was sometimes failing in debug mode due to this bug and was marked flaky. Unmark it. Fixes #12098 (cherry picked from commit `1ef113691a`)	2023-03-21 20:23:19 +02:00
Petr Gusev	069e38f02d	transport server: fix unexpected server errors handling If request processing ended with an error, it is worth sending the error to the client through make_error/write_response. Previously in this case we just wrote a message to the log and didn't handle the client connection in any way. As a result, the only thing the client got in this case was timeout error. A new test_batch_with_error is added. It is quite difficult to reproduce error condition in a test, so we use error injection instead. Passing injection_key in the body of the request ensures that the exception will be thrown only for this test request and will not affect other requests that the driver may send in the background. Closes: scylladb#12104 (cherry picked from commit `a4cf509c3d`)	2023-03-21 20:23:09 +02:00
Gleb Natapov	39158f55d0	lwt: do not destroy capture in upgrade_if_needed lambda since the lambda is used more then once If on the first call the capture is destroyed the second call may crash. Fixes: #12958 Message-Id: <Y/sks73Sb35F+PsC@scylladb.com> (cherry picked from commit `1ce7ad1ee6`)	2023-02-27 14:19:37 +02:00
Kamil Braun	291b1f6e7f	service/raft: raft_group0: prevent double abort There was a small chance that we called `timeout_src.request_abort()` twice in the `with_timeout` function, first by timeout and then by shutdown. `abort_source` fails on an assertion in this case. Fix this. Fixes: #12512 Closes #12514 (cherry picked from commit `54170749b8`)	2023-02-05 18:31:50 +02:00
Tomasz Grabiec	563998b69a	Merge 'raft: improve group 0 reconfiguration failure handling' from Kamil Braun Make it so that failures in `removenode`/`decommission` don't lead to reduced availability, and any leftovers in group 0 can be removed by `removenode`: - In `removenode`, make the node a non-voter before removing it from the token ring. This removes the possibility of having a group 0 voting member which doesn't correspond to a token ring member. We can still be left with a non-voter, but that's doesn't reduce the availability of group 0. - As above but for `decommission`. - Make it possible to remove group 0 members that don't correspond to token ring members from group 0 using `removenode`. - Add an API to query the current group 0 configuration. Fixes #11723. Closes #12502 * github.com:scylladb/scylladb: test: test_topology: test for removing garbage group 0 members test/pylib: move some utility functions to util.py db: system_keyspace: add a virtual table with raft configuration db: system_keyspace: improve system.raft_snapshot_config schema service: storage_service: better error handling in `decommission` service: storage_service: fix indentation in removenode service: storage_service: make `removenode` work for group 0 members which are not token ring members service/raft: raft_group0: perform read_barrier in wait_for_raft service: storage_service: make leaving node a non-voter before removing it from group 0 in decommission/removenode test: test_raft_upgrade: remove test_raft_upgrade_with_node_remove service/raft: raft_group0: link to Raft docs where appropriate service/raft: raft_group0: more logging service/raft: raft_group0: separate function for checking and waiting for Raft	2023-01-17 21:23:15 +01:00
Kamil Braun	5545547d07	test: test_topology: test for removing garbage group 0 members Verify that `removenode` can remove group 0 members which are not token ring members.	2023-01-17 12:28:00 +01:00
Kamil Braun	a483915c62	db: system_keyspace: add a virtual table with raft configuration Add a new virtual table `system.raft_state` that shows the currently operating Raft configuration for each present group. The schema is the same as `system.raft_snapshot_config` (the latter shows the config from the last snapshot). In the future we plan to add more columns to this table, showing more information (like the current leader and term), hence the generic name. Adding the table requires some plumbing of `sharded<raft_group_registry>&` through function parameters to make it accessible from `register_virtual_tables`, but it's mostly straightforward. Also added some APIs to `raft_group_registry` to list all groups and find a given group (returning `nullptr` if one isn't found, not throwing an exception).	2023-01-17 12:28:00 +01:00
Kamil Braun	2bfe85ce9b	db: system_keyspace: improve system.raft_snapshot_config schema Remove the `ip_addr` column which was not used. IP addresses are not part of Raft configuration now and they can change dynamically. Swap the `server_id` and `disposition` columns in the clustering key, so when querying the configuration, we first obtain all servers with the current disposition and then all servers with the previous disposition (note that a server may appear both in current and previous).	2023-01-17 12:28:00 +01:00
Kamil Braun	c3ed82e5fb	service: storage_service: better error handling in `decommission` Improve the error handling in `decommission` in case `leave_group0` fails, informing the user what they should do (i.e. call `removenode` to get rid of the group 0 member), and allowing decommission to finish; it does not make sense to let the node continue to run after it leaves the token ring. (And I'm guessing it's also not safe. Or maybe impossible.)	2023-01-17 12:28:00 +01:00
Kamil Braun	beb0eee007	service: storage_service: fix indentation in removenode	2023-01-17 12:28:00 +01:00
Kamil Braun	aba33dd352	service: storage_service: make `removenode` work for group 0 members which are not token ring members Due to failures we might end up in a situation where we have a group 0 member which is not a token ring member: a decommission/removenode which failed after leaving/removing a node from the token ring but before leaving / removing a node from group 0. There was no way to get rid of such a group 0 member. A node that left the token ring must not be allowed to run further (or it can cause data loss, data resurrection and maybe other fun stuff), so we can't run decommission a second time (even if we tried, it would just say that "we're not a member of the token ring" and abort). And `removenode` would also not work, because it proceeds only if the node requested to be removed is a member of the token ring. We modify `removenode` so it can run in this situation and remove the group 0 member. The parts of `removenode` related to token ring modification are now conditioned on whether the node was a member of the token ring. The final `remove_from_group0` step is in its own branch. Some minor refactors were necessary. Some log messages were also modified so it's easier to understand which messages correspond the "token movement" part of the procedure. The `make_nonvoter` step happens only if token ring removal happens, otherwise we can skip directly to `remove_from_group0`. We also move `remove_from_group0` outside the "try...catch", fixing #11723. The "node ops" part of the procedure is related strictly to token ring movement, so it makes sense for `remove_from_group0` to happen outside. Indentation is broken in this commit for easier reviewability, fixed in the following commit. Fixes: #11723	2023-01-17 12:28:00 +01:00
Kamil Braun	ec2cd29e42	service/raft: raft_group0: perform read_barrier in wait_for_raft Right now wait_for_raft is called before performing group 0 configuration changes. We want to also call it before checking for membership, for that it's desirable to have the most recent information, hence call read_barrier. In the existing use cases it's not strictly necessary, but it doesn't hurt.	2023-01-17 12:28:00 +01:00
Kamil Braun	db734cd74f	service: storage_service: make leaving node a non-voter before removing it from group 0 in decommission/removenode removenode currently works roughly like this: 1. stream/repair data so it ends up on new replica sets (calculated without the node we want to remove) 2. remove the node from the token ring 3. remove the node from group 0 configuration. If the procedure fails before after step 2 but before step 3 finishes, we're in trouble: the cluster is left with an additional voting group 0 member, which reduces group 0's availability, and there is no way to remove this member because `removenode` no longer considers it to be part of the cluster (it consults the token ring to decide). Improve this failure scenario by including a new step at the beginning: make the node a non-voter in group 0 configuration. Then, even if we fail after removing the node from the token ring but before removing it from group 0, we'll only be left with a non-voter which doesn't reduce availability. We make a similar change for `decommission`: between `unbootstrap()` (which streams data) and `leave_ring()` (which removes our tokens from the ring), become a non-voter. The difference here is that we don't become a non-voter at the beginning, but only after streaming/repair. In `removenode` it's desirable to make the node a non-voter as soon as possible because it's already dead. In decommission it may be desirable for us to remain a voter if we fail during streaming because we're still alive and functional in that case. In a later commit we'll also make it possible to retry `removenode` to remove a node that is only a group 0 member and not a token ring member.	2023-01-17 12:28:00 +01:00
Kamil Braun	4f0801406e	service/raft: raft_group0: link to Raft docs where appropriate Resolve some TODOs.	2023-01-17 12:28:00 +01:00
Kamil Braun	2befbaa341	service/raft: raft_group0: more logging Make the logs in leave_group0 consistent with logs in remove_from_group0.	2023-01-17 12:28:00 +01:00
Kamil Braun	77dc1c4c70	service/raft: raft_group0: separate function for checking and waiting for Raft leave_group0 and remove_from_group0 functions both start with the following steps: - if Raft is disabled or in RECOVERY mode, print a simple log message and abort - if Raft cluster feature flag is not yet enabled, print a complex log message and abort - wait for Raft upgrade procedure to finish - then perform the actual group 0 reconfiguration. Refactor these preparation steps to a separate function, `wait_for_raft`. This reduces code duplication; the function will also be used in more operations later (becoming a nonvoter or turning another server into a nonvoter). We also change the API so that the preparation function is called from outside by the caller before they call the reconfiguration function. This is because in later commits, some of the call sites (mainly `removenode`) will want to check explicitly whether Raft is enabled and wait for Raft's availabilty, then perform a sequence of steps related to group 0 configuration depending on the result. Also add a private function `raft_upgrade_complete()` which we use to assert that Raft is ready to be used.	2023-01-17 12:27:58 +01:00
Wojciech Mitros	5f45b32bfa	forward_service: prevent heap use-after-free of forward_aggregates Currently, we create `forward_aggregates` inside a function that returns the result of a future lambda that captures these aggregates by reference. As a result, the aggregates may be destructed before the lambda finishes, resulting in a heap use-after-free. To prolong the lifetime of these aggregates, we cannot use a move capture, because the lambda is wrapped in a with_thread_if_needed() call on these aggregates. Instead, we fix this by wrapping the entire return statement in a do_with(). Fixes #12528 Closes #12533	2023-01-17 13:25:57 +02:00
Gleb Natapov' via ScyllaDB development	15ebd59071	lwt: upgrade stored mutations to the latest schema during prepare Currently they are upgraded during learn on a replica. The are two problems with this. First the column mapping may not exist on a replica if it missed this particular schema (because it was down for instance) and the mapping history is not part of the schema. In this case "Failed to look up column mapping for schema version" will be thrown. Second lwt request coordinator may not have the schema for the mutation as well (because it was freed from the registry already) and when a replica tries to retrieve the schema from the coordinator the retrieval will fail causing the whole request to fail with "Schema version XXXX not found" Both of those problems can be fixed by upgrading stored mutations during prepare on a node it is stored at. To upgrade the mutation its column mapping is needed and it is guarantied that it will be present at the node the mutation is stored at since it is pre-request to store it that the corresponded schema is available. After that the mutation is processed using latest schema that will be available on all nodes. Fixes #10770 Message-Id: <Y7/ifraPJghCWTsq@scylladb.com>	2023-01-17 11:14:46 +01:00
Avi Kivity	0b418fa7cf	cql3, transport, tests: remove "unset" from value type system The CQL binary protocol introduced "unset" values in version 4 of the protocol. Unset values can be bound to variables, which cause certain CQL fragments to be skipped. For example, the fragment `SET a = :var` will not change the value of `a` if `:var` is bound to an unset value. Unsets, however, are very limited in where they can appear. They can only appear at the top-level of an expression, and any computation done with them is invalid. For example, `SET list_column = [3, :var]` is invalid if `:var` is bound to unset. This causes the code to be littered with checks for unset, and there are plenty of tests dedicated to catching unsets. However, a simpler way is possible - prevent the infiltration of unsets at the point of entry (when evaluating a bind variable expression), and introduce guards to check for the few cases where unsets are allowed. This is what this long patch does. It performs the following: (general) 1. unset is removed from the possible values of cql3::raw_value and cql3::raw_value_view. (external->cql3) 2. query_options is fortified with a vector of booleans, unset_bind_variable_vector, where each boolean corresponds to a bind variable index and is true when it is unset. 3. To avoid churn, two compatiblity structs are introduced: cql3::raw_value{,_view}_vector_with_unset, which can be constructed from a std::vector<raw_value{,_view/}>, which is what most callers have. They can also be constructed with explicit unset vectors, for the few cases they are needed. (cql3->variables) 4. query_options::get_value_at() now throws if the requested bind variable is unset. This replaces all the throwing checks in expression evaluation and statement execution, which are removed. 5. A new query_options::is_unset() is added for the users that can tolerate unset; though it is not used directly. 6. A new cql3::unset_operation_guard class guards against unsets. It accepts an expression, and can be queried whether an unset is present. Two conditions are checked: the expression must be a singleton bind variable, and at runtime it must be bound to an unset value. 7. The modification_statement operations are split into two, via two new subclasses of cql3::operation. cql3::operation_no_unset_support ignores unsets completely. cql3::operation_skip_if_unset checks if an operand is unset (luckily all operations have at most one operand that tolerates unset) and applies unset_operation_guard to it. 8. The various sites that accept expressions or operations are modified to check for should_skip_operation(). This are the loops around operations in update_statement and delete_statement, and the checks for unset in attributes (LIMIT and PER PARTITION LIMIT) (tests) 9. Many unset tests are removed. It's now impossible to enter an unset value into the expression evaluation machinery (there's just no unset value), so it's impossible to test for it. 10. Other unset tests now have to be invoked via bind variables, since there's no way to create an unset cql3::expr::constant. 11. Many tests have their exception message match strings relaxed. Since unsets are now checked very early, we don't know the context where they happen. It would be possible to reintroduce it (by adding a format string parameter to cql3::unset_operation_guard), but it seems not to be worth the effort. Usage of unsets is rare, and it is explicit (at least with the Python driver, an unset cannot be introduced by ommission). I tried as an alternative to wrap cql3::raw_value{,_view} (that doesn't recognize unsets) with cql3::maybe_unset_value (that does), but that caused huge amounts of churn, so I abandoned that in favor of the current approach. Closes #12517	2023-01-16 21:10:56 +02:00
Kamil Braun	7510144fba	Merge 'Add replace-node-first-boot option' from Benny Halevy Allow replacing a node given its Host ID rather than its ip address. This series adds a replace_node_first_boot option to db/config and makes use of it in storage_service. The new option takes priority over the legacy replace_address* options. When the latter are used, a deprecation warning is printed. Documentation updated respectively. And a cql unit_test is added. Ref #12277 Closes #12316 * github.com:scylladb/scylladb: docs: document the new replace_node_first_boot option dist/docker: support --replace-node-first-boot db: config: describe replace_address* options as deprecated test: test_topology: test replace using host_id test: pylib: ServerInfo: add host_id storage_service: get rid of get_replace_address storage_service: is_replacing: rely directly on config options storage_service: pass replacement_info to run_replace_ops storage_service: pass replacement_info to booststrap storage_service: join_token_ring: reuse replacement_info.address storage_service: replacement_info: add replace address init: do not allow cfg.replace_node_first_boot of seed node db: config: add replace_node_first_boot option	2023-01-16 15:08:31 +01:00
Michał Sala	bbbe12af43	forward_service: fix timeout support in parallel aggregates `forward_request` verb carried information about timeouts using `lowres_clock::time_point` (that came from local steady clock `seastar::lowres_clock`). The time point was produced on one node and later compared against other node `lowres_clock`. That behavior was wrong (`lowres_clock::time_point`s produced with different `lowres_clock`s cannot be compared) and could lead to delayed or premature timeout. To fix this issue, `lowres_clock::time_point` was replaced with `lowres_system_clock::time_point` in `forward_request` verb. Representation to which both time point types serialize is the same (64-bit integer denoting the count of elapsed nanoseconds), so it was possible to do an in-place switch of those types using logic suggested by @avikivity: - using steady_clock is just broken, so we aren't taking anything from users by breaking it further - once all nodes are upgraded, it magically starts to work Closes #12529	2023-01-16 12:08:13 +02:00
Benny Halevy	db2b76beb5	storage_service: get rid of get_replace_address It is unused now. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-01-13 18:34:29 +02:00
Benny Halevy	17f70e4619	storage_service: is_replacing: rely directly on config options Rather than on get_replace_address, before we remove the latter. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-01-13 18:34:29 +02:00
Benny Halevy	7282d58d11	storage_service: pass replacement_info to run_replace_ops So it won't need to call get_replace_address. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-01-13 18:34:09 +02:00
Benny Halevy	08598e4f64	storage_service: pass replacement_info to booststrap So it won't need to call get_replace_address. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-01-13 18:30:48 +02:00
Benny Halevy	b863f7a75f	storage_service: join_token_ring: reuse replacement_info.address Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-01-13 18:30:48 +02:00
Benny Halevy	add2f209b8	storage_service: replacement_info: add replace address Populate replacement_info.address in prepare_replacement_info as a first step towards getting rid of get_replace_address(). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-01-13 18:30:48 +02:00
Kamil Braun	be390285b6	db: system_keyspace: remove (my_)server_id column from RAFT_SNAPSHOTS and RAFT_SNAPSHOT_CONFIG A single node will run a single Raft server in any given Raft group, so this column is not necessary.	2023-01-12 16:48:50 +01:00
Kamil Braun	bed555d1e5	db: system_keyspace: rename 'raft_config' to 'raft_snapshot_config' Make it clear that the table stores the snapshot configuration, which is not necessarily the currently operating configuration (the last one appended to the log). In the future we plan to have a separate virtual table for showing the currently operating configuration, perhaps we will call it `system.raft_config`.	2023-01-12 16:21:26 +01:00
Nadav Har'El	d6e6820f33	Merge 'Drop support for cql binary protocols versions 1 and 2' from Avi Kivity The CQL binary protocol version 3 was introduced in 2014. All Scylla version support it, and Cassandra versions 2.1 and newer. Versions 1 and 2 have 16-bit collection sizes, while protocol 3 and newer use 32-bit collection sizes. Unfortunately, we implemented support for multiple serialization formats very intrusively, by pushing the format everywhere. This avoids the need to re-serialize (sometimes) but is quite obnoxious. It's also likely to be broken, since it's almost untested and it's too easy to write cql_serialization_format::internal() instead of propagating the client specified value. Since protocols 1 and 2 are obsolete for 9 years, just drop them. It's easy to verify that they are no longer in use on a running system by examining the `system.clients` table before upgrade. Fixes #10607 Closes #12432 * github.com:scylladb/scylladb: treewide: drop cql_serialization_format cql: modification_statement: drop protocol check for LWT transport: drop cql protocol versions 1 and 2	2023-01-09 18:52:41 +02:00
Botond Dénes	2612f98a6c	Merge 'Abort repair tasks' from Aleksandra Martyniuk Aborting of repair operation is fully managed by task manager. Repair tasks are aborted: - on shutdown; top level repair tasks subscribe to global abort source. On shutdown all tasks are aborted recursively - through node operations (applies to data_sync_repair_task_impls and their descendants only); data_sync_repair_task_impl subscribes to node_ops_info abort source - with task manager api (top level tasks are abortable) - with storage_service api and on failure; these cases were modified to be aborted the same way as the ones from above are. Closes #12085 * github.com:scylladb/scylladb: repair: make top level repair tasks abortable repair: unify a way of aborting repair operations repair: delete sharded abort source from node_ops_info repair: delete unused node_ops_info from data_sync_repair_task_impl repair: delete redundant abort subscription from shard_repair_task_impl repair: add abort subscription to data sync task tasks: abort tasks on system shutdown	2023-01-05 15:21:35 +01:00
Avi Kivity	cc6010b512	Merge 'Make restore_replica_count abortable' from Benny Halevy Similar to the way we allow aborting streaming-based removenode, subscribe to storage_service::_abort_source to request abort locally and pass a shared_ptr<abort_source> to `node_ops_info`, used to abort removenode_with_repair on shutdown. Fixes #12429 Closes #12430 * github.com:scylladb/scylladb: storage_service: restore_replica_count: demote status_checker related logging to debug level storage_service: restore_replica_count: allow aborting removenode_with_repair storage_service: coroutinize restore_replica_count storage_service: restore_replica_count: undefer stop_status_checker storage_service: restore_replica_count: handle exceptions from stream_async and send_replication_notification storage_service: restore_replica_count: coroutinize status_checker	2023-01-05 15:21:35 +01:00
Benny Halevy	086546f575	storage_service: restore_replica_count: demote status_checker related logging to debug level the status_checker is not the main line of business of restore_replica_count, starting and stopping it do nt seem to deserve info level logging, which might have been useful in the past to debug issues surrounding that. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-01-04 19:05:04 +02:00
Benny Halevy	3879ee1db8	storage_service: restore_replica_count: allow aborting removenode_with_repair Similar to the way we allow aborting streaming-based removenode, subscribe to storage_service::_abort_source to request abort locally and pass a shared_ptr<abort_source> to `node_ops_info`, used to abort removenode_with_repair on shutdown. Fixes #12429 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-01-04 19:05:04 +02:00
Benny Halevy	afece5bdc4	storage_service: coroutinize restore_replica_count and unwrap the async thread started for streaming. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-01-04 19:05:04 +02:00
Benny Halevy	d1eadc39c1	storage_service: restore_replica_count: undefer stop_status_checker Now that all exceptions in the rest of the function are swallowed, just execute the stop_status_checker deferred action serially before returning, on the wau to coroutinizing restore_replica_count (since we can't co_await status_checker inside the deferred action). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-01-04 19:05:04 +02:00
Benny Halevy	788ecb738d	storage_service: restore_replica_count: handle exceptions from stream_async and send_replication_notification On the way to coroutinizing restore_replica_count, extract awaiting stream_async and send_replication_notification into a try/catch blocks so we can later undefer stop_status_checker. The exception is still returned as an exceptional future which is logged by the caller as warning. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-01-04 19:02:42 +02:00
Benny Halevy	b54d121dfd	storage_service: restore_replica_count: coroutinize status_checker There is no need to start a thread for the status_checker and can be implemented using a background coroutine. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-01-04 19:02:20 +02:00
Kamil Braun	4268b1bbc2	Merge 'raft: raft_group0, register RPC verbs on all shards' from Gusev Petr raft_group0 used to register RPC verbs only on shard 0. This worked on clusters with the same --smp setting on all nodes, since RPCs in this case are processed on the same shard as the calling code, and raft_group0 methods only run on shard 0. A new test test_nodes_with_different_smp was added to identify the problem. Since --smp can only be specified via the command line, a corresponding parameter was added to the ManagerClient.server_add method. It allows to override the default parameters set by the SCYLLA_CMDLINE_OPTIONS variable by changing, adding or deleting individual items. Fixes: #12252 Closes #12374 * github.com:scylladb/scylladb: raft: raft_group0, register RPC verbs on all shards raft: raft_append_entries, copy entries to the target shard test.py, allow to specify the node's command line in test	2023-01-04 11:11:21 +01:00
Avi Kivity	2739ac66ed	treewide: drop cql_serialization_format Now that we don't accept cql protocol version 1 or 2, we can drop cql_serialization format everywhere, except when in the IDL (since it's part of the inter-node protocol). A few functions had duplicate versions, one with and one without a cql_serialization_format parameter. They are deduplicated. Care is taken that `partition_slice`, which communicates the cql_serialization_format across nodes, still presents a valid cql_serialization_format to other nodes when transmitting itself and rejects protocol 1 and 2 serialization\ format when receiving. The IDL is unchanged. One test checking the 16-bit serialization format is removed.	2023-01-03 19:54:13 +02:00
Petr Gusev	8417840647	raft: raft_group0, register RPC verbs on all shards raft_group0 used to register RPC verbs only on shard 0. This worked on clusters with the same --smp setting on all nodes, since RPCs in this case are (usually) processed on the same shard as the calling code, and raft_group0 methods only run on shard 0. A new test test_nodes_with_different_smp was added to identify the problem. Fixes: #12252	2023-01-03 17:04:07 +03:00
Petr Gusev	7725e03a09	raft: raft_append_entries, copy entries to the target shard If append_entries RPC was received on a non-zero shard, we may need to pass it to a zero (or, potentially, some other) shard. The problem is that raft::append_request contains entries in the form of raft::log_entry_ptr == lw_shared_ptr<log_entry>, which doesn't support cross-shard reference counting. In debug mode it contains a special ref-counting facility debug_shared_ptr_counter_type, which resorts to on_internal_error if it detects such a case. To solve this, we just copy log entries to the target shard if it isn't equal to the current one. In most cases, if --smp setting is the same on all nodes, RPC will be handled on zero shard, so there will be no overhead.	2023-01-03 15:25:00 +03:00
Avi Kivity	767b7be8be	Merge 'Get rid of handle_state_replacing' from Benny Halevy Since [repair: Always use run_replace_ops](`2ec1f719de`), nodes no longer publish HIBERNATE state so we don't need to support handling it. Replace is now always done using node operations (using repair or streaming). so nodes are never expected to change status to HIBERNATE. Therefore storage_service:handle_state_replacing is not needed anymore. This series gets rid of it and updates documentation related to STATUS:HIBERNATE respectively. Fixes #12330 Closes #12349 * github.com:scylladb/scylladb: docs: replace-dead-node: get rid of hibernate status storage_service: get rid of handle_state_replacing	2023-01-02 13:35:29 +02:00
Gleb Natapov	28952d32ff	storage_service: move leave_ring outside of unbootstrap() We want to reuse the later without the call. Message-Id: <20221228144944.3299711-17-gleb@scylladb.com>	2023-01-02 12:03:29 +02:00
Gleb Natapov	96453ff75f	service: raft: improve group0_state_machine::apply logging Trace how many entries are applied as well. Message-Id: <20221228144944.3299711-14-gleb@scylladb.com>	2023-01-02 11:57:16 +02:00
Gleb Natapov	dbd5b97201	storage_service: improve logging in update_pending_ranges() function We pass the reason for the change. Log it as well. Message-Id: <20221228144944.3299711-11-gleb@scylladb.com>	2023-01-02 11:54:03 +02:00
Gleb Natapov	5a96751534	storage_service: remove start_leaving since it is no longer used Message-Id: <20221228144944.3299711-2-gleb@scylladb.com>	2023-01-02 11:37:48 +02:00
Asias He	d819d98e78	storage_service: Ignore dropped table for repair_updater In case a table is dropped, we should ignore it in the repair_updater, since we can not update off strategy trigger for a dropped table. Refs #12373 Closes #12388	2022-12-24 13:48:25 +02:00
Aleksandra Martyniuk	f56e886127	repair: delete sharded abort source from node_ops_info Sharded abort source in node_ops_info is no longer needed since its functionality is provided by task manager's tasks structure.	2022-12-21 11:37:03 +01:00
Aleksandra Martyniuk	60e298fda1	repair: change utils::UUID to node_ops_id Type of the id of node operations is changed from utils::UUID to node_ops_id. This way the id of node operations would be easily distinguished from the ids of other entities. Closes #11673	2022-12-20 17:04:47 +02:00

1 2 3 4 5 ...

3211 Commits