Commit Graph

3211 Commits

Author SHA1 Message Date
Kamil Braun
b6b35ce061 service: storage_proxy: sequence CDC preimage select with Paxos learn
`paxos_response_handler::learn_decision` was calling
`cdc_service::augment_mutation_call` concurrently with
`storage_proxy::mutate_internal`. `augment_mutation_call` was selecting
rows from the base table in order to create the preimage, while
`mutate_internal` was writing rows to the table. It was therefore
possible for the preimage to observe the update that it accompanied,
which doesn't make any sense, because the preimage is supposed to show
the state before the update.

Fix this by performing the operations sequentially. We can still perform
the CDC mutation write concurrently with the base mutation write.

`cdc_with_lwt_test` was sometimes failing in debug mode due to this bug
and was marked flaky. Unmark it.

Fixes #12098

(cherry picked from commit 1ef113691a)
2023-03-21 20:23:19 +02:00
Petr Gusev
069e38f02d transport server: fix unexpected server errors handling
If request processing ended with an error, it is worth
sending the error to the client through
make_error/write_response. Previously in this case we
just wrote a message to the log and didn't handle the
client connection in any way. As a result, the only
thing the client got in this case was timeout error.

A new test_batch_with_error is added. It is quite
difficult to reproduce error condition in a test,
so we use error injection instead. Passing injection_key
in the body of the request ensures that the exception
will be thrown only for this test request and
will not affect other requests that
the driver may send in the background.

Closes: scylladb#12104
(cherry picked from commit a4cf509c3d)
2023-03-21 20:23:09 +02:00
Gleb Natapov
39158f55d0 lwt: do not destroy capture in upgrade_if_needed lambda since the lambda is used more then once
If on the first call the capture is destroyed the second call may crash.

Fixes: #12958

Message-Id: <Y/sks73Sb35F+PsC@scylladb.com>
(cherry picked from commit 1ce7ad1ee6)
2023-02-27 14:19:37 +02:00
Kamil Braun
291b1f6e7f service/raft: raft_group0: prevent double abort
There was a small chance that we called `timeout_src.request_abort()`
twice in the `with_timeout` function, first by timeout and then by
shutdown. `abort_source` fails on an assertion in this case. Fix this.

Fixes: #12512

Closes #12514

(cherry picked from commit 54170749b8)
2023-02-05 18:31:50 +02:00
Tomasz Grabiec
563998b69a Merge 'raft: improve group 0 reconfiguration failure handling' from Kamil Braun
Make it so that failures in `removenode`/`decommission` don't lead to reduced availability, and any leftovers in group 0 can be removed by `removenode`:
- In `removenode`, make the node a non-voter before removing it from the token ring. This removes the possibility of having a group 0 voting member which doesn't correspond to a token ring member. We can still be left with a non-voter, but that's doesn't reduce the availability of group 0.
- As above but for `decommission`.
- Make it possible to remove group 0 members that don't correspond to token ring members from group 0 using `removenode`.
- Add an API to query the current group 0 configuration.

Fixes #11723.

Closes #12502

* github.com:scylladb/scylladb:
  test: test_topology: test for removing garbage group 0 members
  test/pylib: move some utility functions to util.py
  db: system_keyspace: add a virtual table with raft configuration
  db: system_keyspace: improve system.raft_snapshot_config schema
  service: storage_service: better error handling in `decommission`
  service: storage_service: fix indentation in removenode
  service: storage_service: make `removenode` work for group 0 members which are not token ring members
  service/raft: raft_group0: perform read_barrier in wait_for_raft
  service: storage_service: make leaving node a non-voter before removing it from group 0 in decommission/removenode
  test: test_raft_upgrade: remove test_raft_upgrade_with_node_remove
  service/raft: raft_group0: link to Raft docs where appropriate
  service/raft: raft_group0: more logging
  service/raft: raft_group0: separate function for checking and waiting for Raft
2023-01-17 21:23:15 +01:00
Kamil Braun
5545547d07 test: test_topology: test for removing garbage group 0 members
Verify that `removenode` can remove group 0 members which are not token
ring members.
2023-01-17 12:28:00 +01:00
Kamil Braun
a483915c62 db: system_keyspace: add a virtual table with raft configuration
Add a new virtual table `system.raft_state` that shows the currently
operating Raft configuration for each present group. The schema is the
same as `system.raft_snapshot_config` (the latter shows the config from
the last snapshot). In the future we plan to add more columns to this
table, showing more information (like the current leader and term),
hence the generic name.

Adding the table requires some plumbing of
`sharded<raft_group_registry>&` through function parameters to make it
accessible from `register_virtual_tables`, but it's mostly
straightforward.

Also added some APIs to `raft_group_registry` to list all groups and
find a given group (returning `nullptr` if one isn't found, not throwing
an exception).
2023-01-17 12:28:00 +01:00
Kamil Braun
2bfe85ce9b db: system_keyspace: improve system.raft_snapshot_config schema
Remove the `ip_addr` column which was not used. IP addresses are not
part of Raft configuration now and they can change dynamically.

Swap the `server_id` and `disposition` columns in the clustering key, so
when querying the configuration, we first obtain all servers with the
current disposition and then all servers with the previous disposition
(note that a server may appear both in current and previous).
2023-01-17 12:28:00 +01:00
Kamil Braun
c3ed82e5fb service: storage_service: better error handling in decommission
Improve the error handling in `decommission` in case `leave_group0`
fails, informing the user what they should do (i.e. call `removenode` to
get rid of the group 0 member), and allowing decommission to finish; it
does not make sense to let the node continue to run after it leaves the
token ring. (And I'm guessing it's also not safe. Or maybe impossible.)
2023-01-17 12:28:00 +01:00
Kamil Braun
beb0eee007 service: storage_service: fix indentation in removenode 2023-01-17 12:28:00 +01:00
Kamil Braun
aba33dd352 service: storage_service: make removenode work for group 0 members which are not token ring members
Due to failures we might end up in a situation where we have a group 0
member which is not a token ring member: a decommission/removenode
which failed after leaving/removing a node from the token ring but
before leaving / removing a node from group 0.

There was no way to get rid of such a group 0 member. A node that left
the token ring must not be allowed to run further (or it can cause data
loss, data resurrection and maybe other fun stuff), so we can't run
decommission a second time (even if we tried, it would just say that
"we're not a member of the token ring" and abort). And `removenode`
would also not work, because it proceeds only if the node requested to
be removed is a member of the token ring.

We modify `removenode` so it can run in this situation and remove the
group 0 member. The parts of `removenode` related to token ring
modification are now conditioned on whether the node was a member of the
token ring. The final `remove_from_group0` step is in its own branch. Some
minor refactors were necessary. Some log messages were also modified so
it's easier to understand which messages correspond the "token movement"
part of the procedure.

The `make_nonvoter` step happens only if token ring removal happens,
otherwise we can skip directly to `remove_from_group0`.

We also move `remove_from_group0` outside the "try...catch",
fixing #11723. The "node ops" part of the procedure is related strictly
to token ring movement, so it makes sense for `remove_from_group0` to
happen outside.

Indentation is broken in this commit for easier reviewability, fixed in
the following commit.

Fixes: #11723
2023-01-17 12:28:00 +01:00
Kamil Braun
ec2cd29e42 service/raft: raft_group0: perform read_barrier in wait_for_raft
Right now wait_for_raft is called before performing group 0
configuration changes. We want to also call it before checking for
membership, for that it's desirable to have the most recent information,
hence call read_barrier. In the existing use cases it's not strictly
necessary, but it doesn't hurt.
2023-01-17 12:28:00 +01:00
Kamil Braun
db734cd74f service: storage_service: make leaving node a non-voter before removing it from group 0 in decommission/removenode
removenode currently works roughly like this:
1. stream/repair data so it ends up on new replica sets (calculated
   without the node we want to remove)
2. remove the node from the token ring
3. remove the node from group 0 configuration.

If the procedure fails before after step 2 but before step 3 finishes,
we're in trouble: the cluster is left with an additional voting group 0
member, which reduces group 0's availability, and there is no way to
remove this member because `removenode` no longer considers it to be
part of the cluster (it consults the token ring to decide).

Improve this failure scenario by including a new step at the beginning:
make the node a non-voter in group 0 configuration. Then, even if we
fail after removing the node from the token ring but before removing it
from group 0, we'll only be left with a non-voter which doesn't reduce
availability.

We make a similar change for `decommission`: between `unbootstrap()` (which
streams data) and `leave_ring()` (which removes our tokens from the
ring), become a non-voter. The difference here is that we don't become a
non-voter at the beginning, but only after streaming/repair. In
`removenode` it's desirable to make the node a non-voter as soon as
possible because it's already dead. In decommission it may be desirable
for us to remain a voter if we fail during streaming because we're still
alive and functional in that case.

In a later commit we'll also make it possible to retry `removenode` to
remove a node that is only a group 0 member and not a token ring member.
2023-01-17 12:28:00 +01:00
Kamil Braun
4f0801406e service/raft: raft_group0: link to Raft docs where appropriate
Resolve some TODOs.
2023-01-17 12:28:00 +01:00
Kamil Braun
2befbaa341 service/raft: raft_group0: more logging
Make the logs in leave_group0 consistent with logs in
remove_from_group0.
2023-01-17 12:28:00 +01:00
Kamil Braun
77dc1c4c70 service/raft: raft_group0: separate function for checking and waiting for Raft
leave_group0 and remove_from_group0 functions both start with the
following steps:
- if Raft is disabled or in RECOVERY mode, print a simple log message
  and abort
- if Raft cluster feature flag is not yet enabled, print a complex log
  message and abort
- wait for Raft upgrade procedure to finish
- then perform the actual group 0 reconfiguration.

Refactor these preparation steps to a separate function,
`wait_for_raft`. This reduces code duplication; the function will also
be used in more operations later (becoming a nonvoter or turning another
server into a nonvoter).

We also change the API so that the preparation function is called from
outside by the caller before they call the reconfiguration function.
This is because in later commits, some of the call sites (mainly
`removenode`) will want to check explicitly whether Raft is enabled and
wait for Raft's availabilty, then perform a sequence of steps related
to group 0 configuration depending on the result.

Also add a private function `raft_upgrade_complete()` which we use to
assert that Raft is ready to be used.
2023-01-17 12:27:58 +01:00
Wojciech Mitros
5f45b32bfa forward_service: prevent heap use-after-free of forward_aggregates
Currently, we create `forward_aggregates` inside a function that
returns the result of a future lambda that captures these aggregates
by reference. As a result, the aggregates may be destructed before
the lambda finishes, resulting in a heap use-after-free.

To prolong the lifetime of these aggregates, we cannot use a move
capture, because the lambda is wrapped in a with_thread_if_needed()
call on these aggregates. Instead, we fix this by wrapping the
entire return statement in a do_with().

Fixes #12528

Closes #12533
2023-01-17 13:25:57 +02:00
Gleb Natapov' via ScyllaDB development
15ebd59071 lwt: upgrade stored mutations to the latest schema during prepare
Currently they are upgraded during learn on a replica. The are two
problems with this.  First the column mapping may not exist on a replica
if it missed this particular schema (because it was down for instance)
and the mapping history is not part of the schema. In this case "Failed
to look up column mapping for schema version" will be thrown. Second lwt
request coordinator may not have the schema for the mutation as well
(because it was freed from the registry already) and when a replica
tries to retrieve the schema from the coordinator the retrieval will fail
causing the whole request to fail with "Schema version XXXX not found"

Both of those problems can be fixed by upgrading stored mutations
during prepare on a node it is stored at. To upgrade the mutation its
column mapping is needed and it is guarantied that it will be present
at the node the mutation is stored at since it is pre-request to store
it that the corresponded schema is available. After that the mutation
is processed using latest schema that will be available on all nodes.

Fixes #10770

Message-Id: <Y7/ifraPJghCWTsq@scylladb.com>
2023-01-17 11:14:46 +01:00
Avi Kivity
0b418fa7cf cql3, transport, tests: remove "unset" from value type system
The CQL binary protocol introduced "unset" values in version 4
of the protocol. Unset values can be bound to variables, which
cause certain CQL fragments to be skipped. For example, the
fragment `SET a = :var` will not change the value of `a` if `:var`
is bound to an unset value.

Unsets, however, are very limited in where they can appear. They
can only appear at the top-level of an expression, and any computation
done with them is invalid. For example, `SET list_column = [3, :var]`
is invalid if `:var` is bound to unset.

This causes the code to be littered with checks for unset, and there
are plenty of tests dedicated to catching unsets. However, a simpler
way is possible - prevent the infiltration of unsets at the point of
entry (when evaluating a bind variable expression), and introduce
guards to check for the few cases where unsets are allowed.

This is what this long patch does. It performs the following:

(general)

1. unset is removed from the possible values of cql3::raw_value and
   cql3::raw_value_view.

(external->cql3)

2. query_options is fortified with a vector of booleans,
   unset_bind_variable_vector, where each boolean corresponds to a bind
   variable index and is true when it is unset.
3. To avoid churn, two compatiblity structs are introduced:
   cql3::raw_value{,_view}_vector_with_unset, which can be constructed
   from a std::vector<raw_value{,_view/}>, which is what most callers
   have. They can also be constructed with explicit unset vectors, for
   the few cases they are needed.

(cql3->variables)

4. query_options::get_value_at() now throws if the requested bind variable
   is unset. This replaces all the throwing checks in expression evaluation
   and statement execution, which are removed.
5. A new query_options::is_unset() is added for the users that can tolerate
   unset; though it is not used directly.
6. A new cql3::unset_operation_guard class guards against unsets. It accepts
   an expression, and can be queried whether an unset is present. Two
   conditions are checked: the expression must be a singleton bind
   variable, and at runtime it must be bound to an unset value.
7. The modification_statement operations are split into two, via two
   new subclasses of cql3::operation. cql3::operation_no_unset_support
   ignores unsets completely. cql3::operation_skip_if_unset checks if
   an operand is unset (luckily all operations have at most one operand that
   tolerates unset) and applies unset_operation_guard to it.
8. The various sites that accept expressions or operations are modified
   to check for should_skip_operation(). This are the loops around
   operations in update_statement and delete_statement, and the checks
   for unset in attributes (LIMIT and PER PARTITION LIMIT)

(tests)

9. Many unset tests are removed. It's now impossible to enter an
   unset value into the expression evaluation machinery (there's
   just no unset value), so it's impossible to test for it.
10. Other unset tests now have to be invoked via bind variables,
   since there's no way to create an unset cql3::expr::constant.
11. Many tests have their exception message match strings relaxed.
   Since unsets are now checked very early, we don't know the context
   where they happen. It would be possible to reintroduce it (by adding
   a format string parameter to cql3::unset_operation_guard), but it
   seems not to be worth the effort. Usage of unsets is rare, and it is
   explicit (at least with the Python driver, an unset cannot be
   introduced by ommission).

I tried as an alternative to wrap cql3::raw_value{,_view} (that doesn't
recognize unsets) with cql3::maybe_unset_value (that does), but that
caused huge amounts of churn, so I abandoned that in favor of the
current approach.

Closes #12517
2023-01-16 21:10:56 +02:00
Kamil Braun
7510144fba Merge 'Add replace-node-first-boot option' from Benny Halevy
Allow replacing a node given its Host ID rather than its ip address.

This series adds a replace_node_first_boot option to db/config
and makes use of it in storage_service.

The new option takes priority over the legacy replace_address* options.
When the latter are used, a deprecation warning is printed.

Documentation updated respectively.

And a cql unit_test is added.

Ref #12277

Closes #12316

* github.com:scylladb/scylladb:
  docs: document the new replace_node_first_boot option
  dist/docker: support --replace-node-first-boot
  db: config: describe replace_address* options as deprecated
  test: test_topology: test replace using host_id
  test: pylib: ServerInfo: add host_id
  storage_service: get rid of get_replace_address
  storage_service: is_replacing: rely directly on config options
  storage_service: pass replacement_info to run_replace_ops
  storage_service: pass replacement_info to booststrap
  storage_service: join_token_ring: reuse replacement_info.address
  storage_service: replacement_info: add replace address
  init: do not allow cfg.replace_node_first_boot of seed node
  db: config: add replace_node_first_boot option
2023-01-16 15:08:31 +01:00
Michał Sala
bbbe12af43 forward_service: fix timeout support in parallel aggregates
`forward_request` verb carried information about timeouts using
`lowres_clock::time_point` (that came from local steady clock
`seastar::lowres_clock`). The time point was produced on one node and
later compared against other node `lowres_clock`. That behavior
was wrong (`lowres_clock::time_point`s produced with different
`lowres_clock`s cannot be compared) and could lead to delayed or
premature timeout.

To fix this issue, `lowres_clock::time_point` was replaced with
`lowres_system_clock::time_point` in `forward_request` verb.
Representation to which both time point types serialize is the same
(64-bit integer denoting the count of elapsed nanoseconds), so it was
possible to do an in-place switch of those types using logic suggested
by @avikivity:
    - using steady_clock is just broken, so we aren't taking anything
        from users by breaking it further
    - once all nodes are upgraded, it magically starts to work

Closes #12529
2023-01-16 12:08:13 +02:00
Benny Halevy
db2b76beb5 storage_service: get rid of get_replace_address
It is unused now.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-01-13 18:34:29 +02:00
Benny Halevy
17f70e4619 storage_service: is_replacing: rely directly on config options
Rather than on get_replace_address, before we remove the latter.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-01-13 18:34:29 +02:00
Benny Halevy
7282d58d11 storage_service: pass replacement_info to run_replace_ops
So it won't need to call get_replace_address.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-01-13 18:34:09 +02:00
Benny Halevy
08598e4f64 storage_service: pass replacement_info to booststrap
So it won't need to call get_replace_address.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-01-13 18:30:48 +02:00
Benny Halevy
b863f7a75f storage_service: join_token_ring: reuse replacement_info.address
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-01-13 18:30:48 +02:00
Benny Halevy
add2f209b8 storage_service: replacement_info: add replace address
Populate replacement_info.address in prepare_replacement_info
as a first step towards getting rid of get_replace_address().

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-01-13 18:30:48 +02:00
Kamil Braun
be390285b6 db: system_keyspace: remove (my_)server_id column from RAFT_SNAPSHOTS and RAFT_SNAPSHOT_CONFIG
A single node will run a single Raft server in any given Raft group,
so this column is not necessary.
2023-01-12 16:48:50 +01:00
Kamil Braun
bed555d1e5 db: system_keyspace: rename 'raft_config' to 'raft_snapshot_config'
Make it clear that the table stores the snapshot configuration, which is
not necessarily the currently operating configuration (the last one
appended to the log).

In the future we plan to have a separate virtual table for showing the
currently operating configuration, perhaps we will call it
`system.raft_config`.
2023-01-12 16:21:26 +01:00
Nadav Har'El
d6e6820f33 Merge 'Drop support for cql binary protocols versions 1 and 2' from Avi Kivity
The CQL binary protocol version 3 was introduced in 2014. All Scylla
version support it, and Cassandra versions 2.1 and newer.

Versions 1 and 2 have 16-bit collection sizes, while protocol 3 and newer
use 32-bit collection sizes.

Unfortunately, we implemented support for multiple serialization formats
very intrusively, by pushing the format everywhere. This avoids the need
to re-serialize (sometimes) but is quite obnoxious. It's also likely to be
broken, since it's almost untested and it's too easy to write
cql_serialization_format::internal() instead of propagating the client
specified value.

Since protocols 1 and 2 are obsolete for 9 years, just drop them. It's
easy to verify that they are no longer in use on a running system by
examining the `system.clients` table before upgrade.

Fixes #10607

Closes #12432

* github.com:scylladb/scylladb:
  treewide: drop cql_serialization_format
  cql: modification_statement: drop protocol check for LWT
  transport: drop cql protocol versions 1 and 2
2023-01-09 18:52:41 +02:00
Botond Dénes
2612f98a6c Merge 'Abort repair tasks' from Aleksandra Martyniuk
Aborting of repair operation is fully managed by task manager.
Repair tasks are aborted:
- on shutdown; top level repair tasks subscribe to global abort source. On shutdown all tasks are aborted recursively
- through node operations (applies to data_sync_repair_task_impls and their descendants only); data_sync_repair_task_impl subscribes to node_ops_info abort source
- with task manager api (top level tasks are abortable)
- with storage_service api and on failure; these cases were modified to be aborted the same way as the ones from above are.

Closes #12085

* github.com:scylladb/scylladb:
  repair: make top level repair tasks abortable
  repair: unify a way of aborting repair operations
  repair: delete sharded abort source from node_ops_info
  repair: delete unused node_ops_info from data_sync_repair_task_impl
  repair: delete redundant abort subscription from shard_repair_task_impl
  repair: add abort subscription to data sync task
  tasks: abort tasks on system shutdown
2023-01-05 15:21:35 +01:00
Avi Kivity
cc6010b512 Merge 'Make restore_replica_count abortable' from Benny Halevy
Similar to the way we allow aborting streaming-based
removenode, subscribe to storage_service::_abort_source
to request abort locally and pass a shared_ptr<abort_source>
to `node_ops_info`, used to abort removenode_with_repair
on shutdown.

Fixes #12429

Closes #12430

* github.com:scylladb/scylladb:
  storage_service: restore_replica_count: demote status_checker related logging to debug level
  storage_service: restore_replica_count: allow aborting removenode_with_repair
  storage_service: coroutinize restore_replica_count
  storage_service: restore_replica_count: undefer stop_status_checker
  storage_service: restore_replica_count: handle exceptions from stream_async and send_replication_notification
  storage_service: restore_replica_count: coroutinize status_checker
2023-01-05 15:21:35 +01:00
Benny Halevy
086546f575 storage_service: restore_replica_count: demote status_checker related logging to debug level
the status_checker is not the main line of business
of restore_replica_count, starting and stopping it
do nt seem to deserve info level logging, which
might have been useful in the past to debug issues
surrounding that.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-01-04 19:05:04 +02:00
Benny Halevy
3879ee1db8 storage_service: restore_replica_count: allow aborting removenode_with_repair
Similar to the way we allow aborting streaming-based
removenode, subscribe to storage_service::_abort_source
to request abort locally and pass a shared_ptr<abort_source>
to `node_ops_info`, used to abort removenode_with_repair
on shutdown.

Fixes #12429

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-01-04 19:05:04 +02:00
Benny Halevy
afece5bdc4 storage_service: coroutinize restore_replica_count
and unwrap the async thread started for streaming.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-01-04 19:05:04 +02:00
Benny Halevy
d1eadc39c1 storage_service: restore_replica_count: undefer stop_status_checker
Now that all exceptions in the rest of the function
are swallowed, just execute the stop_status_checker
deferred action serially before returning, on the
wau to coroutinizing restore_replica_count (since
we can't co_await status_checker inside the deferred
action).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-01-04 19:05:04 +02:00
Benny Halevy
788ecb738d storage_service: restore_replica_count: handle exceptions from stream_async and send_replication_notification
On the way to coroutinizing restore_replica_count,
extract awaiting stream_async and send_replication_notification
into a try/catch blocks so we can later undefer stop_status_checker.

The exception is still returned as an exceptional future
which is logged by the caller as warning.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-01-04 19:02:42 +02:00
Benny Halevy
b54d121dfd storage_service: restore_replica_count: coroutinize status_checker
There is no need to start a thread for the status_checker
and can be implemented using a background coroutine.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-01-04 19:02:20 +02:00
Kamil Braun
4268b1bbc2 Merge 'raft: raft_group0, register RPC verbs on all shards' from Gusev Petr
raft_group0 used to register RPC verbs only on shard 0. This worked on
clusters with the same --smp setting on all nodes, since RPCs in this
case are processed on the same shard as the calling code, and
raft_group0 methods only run on shard 0.

A new test test_nodes_with_different_smp was added to identify the
problem. Since --smp can only be specified via the command line, a
corresponding parameter was added to the ManagerClient.server_add
method.  It allows to override the default parameters set by the
SCYLLA_CMDLINE_OPTIONS variable by changing, adding or deleting
individual items.

Fixes: #12252

Closes #12374

* github.com:scylladb/scylladb:
  raft: raft_group0, register RPC verbs on all shards
  raft: raft_append_entries, copy entries to the target shard
  test.py, allow to specify the node's command line in test
2023-01-04 11:11:21 +01:00
Avi Kivity
2739ac66ed treewide: drop cql_serialization_format
Now that we don't accept cql protocol version 1 or 2, we can
drop cql_serialization format everywhere, except when in the IDL
(since it's part of the inter-node protocol).

A few functions had duplicate versions, one with and one without
a cql_serialization_format parameter. They are deduplicated.

Care is taken that `partition_slice`, which communicates
the cql_serialization_format across nodes, still presents
a valid cql_serialization_format to other nodes when
transmitting itself and rejects protocol 1 and 2 serialization\
format when receiving. The IDL is unchanged.

One test checking the 16-bit serialization format is removed.
2023-01-03 19:54:13 +02:00
Petr Gusev
8417840647 raft: raft_group0, register RPC verbs on all shards
raft_group0 used to register RPC verbs only on shard 0.
This worked on clusters with the same --smp setting on
all nodes, since RPCs in this case are (usually)
processed on the same shard as the calling code,
and raft_group0 methods only run on shard 0.

A new test test_nodes_with_different_smp was added
to identify the problem.

Fixes: #12252
2023-01-03 17:04:07 +03:00
Petr Gusev
7725e03a09 raft: raft_append_entries, copy entries to the target shard
If append_entries RPC was received on a non-zero shard, we may
need to pass it to a zero (or, potentially, some other) shard.
The problem is that raft::append_request contains entries in the form
of raft::log_entry_ptr == lw_shared_ptr<log_entry>, which doesn't
support cross-shard reference counting. In debug mode it contains
a special ref-counting facility debug_shared_ptr_counter_type,
which resorts to on_internal_error if it detects such a case.

To solve this, we just copy log entries to the target shard if it
isn't equal to the current one. In most cases, if --smp setting
is the same on all nodes, RPC will be handled on zero shard,
so there will be no overhead.
2023-01-03 15:25:00 +03:00
Avi Kivity
767b7be8be Merge 'Get rid of handle_state_replacing' from Benny Halevy
Since [repair: Always use run_replace_ops](2ec1f719de), nodes no longer publish HIBERNATE state so we don't need to support handling it.

Replace is now always done using node operations (using repair or streaming).
so nodes are never expected to change status to HIBERNATE.

Therefore storage_service:handle_state_replacing is not needed anymore.

This series gets rid of it and updates documentation related to STATUS:HIBERNATE respectively.

Fixes #12330

Closes #12349

* github.com:scylladb/scylladb:
  docs: replace-dead-node: get rid of hibernate status
  storage_service: get rid of handle_state_replacing
2023-01-02 13:35:29 +02:00
Gleb Natapov
28952d32ff storage_service: move leave_ring outside of unbootstrap()
We want to reuse the later without the call.

Message-Id: <20221228144944.3299711-17-gleb@scylladb.com>
2023-01-02 12:03:29 +02:00
Gleb Natapov
96453ff75f service: raft: improve group0_state_machine::apply logging
Trace how many entries are applied as well.

Message-Id: <20221228144944.3299711-14-gleb@scylladb.com>
2023-01-02 11:57:16 +02:00
Gleb Natapov
dbd5b97201 storage_service: improve logging in update_pending_ranges() function
We pass the reason for the change. Log it as well.

Message-Id: <20221228144944.3299711-11-gleb@scylladb.com>
2023-01-02 11:54:03 +02:00
Gleb Natapov
5a96751534 storage_service: remove start_leaving since it is no longer used
Message-Id: <20221228144944.3299711-2-gleb@scylladb.com>
2023-01-02 11:37:48 +02:00
Asias He
d819d98e78 storage_service: Ignore dropped table for repair_updater
In case a table is dropped, we should ignore it in the repair_updater,
since we can not update off strategy trigger for a dropped table.

Refs #12373

Closes #12388
2022-12-24 13:48:25 +02:00
Aleksandra Martyniuk
f56e886127 repair: delete sharded abort source from node_ops_info
Sharded abort source in node_ops_info is no longer needed since
its functionality is provided by task manager's tasks structure.
2022-12-21 11:37:03 +01:00
Aleksandra Martyniuk
60e298fda1 repair: change utils::UUID to node_ops_id
Type of the id of node operations is changed from utils::UUID
to node_ops_id. This way the id of node operations would be easily
distinguished from the ids of other entities.

Closes #11673
2022-12-20 17:04:47 +02:00