Raft rebuild is broken because the session id is not set.
The following was seen when run rebuild
stream_session - [Stream #8cfca940-afc9-11ee-b6f1-30b8f78c1451]
stream_transfer_task: Fail to send to 127.0.70.1:0:
seastar::rpc::remote_verb_error (Session not found:
00000000-0000-0000-0000-000000000000)
with raft topology, e.g.,
scylla --enable-repair-based-node-ops 0 --consistent-cluster-management true --experimental-features consistent-topology-changes
Fix by setting the session id.
Fixes#16741Closesscylladb/scylladb#16814
If the joining node fails while handling the response from the
topology coordinator, it hangs even though it knows the join
operation has failed. Therefore, we ensure it shuts down in
this patch.
Additionally, we ensure that if the first join request response
was a rejection or the node failed while handling it, the
following acceptances by the (possibly different) coordinator
don't succeed. The node considers the join operation as failed.
We shouldn't add it to the cluster.
Fixesscylladb/scylladb#16333Closesscylladb/scylladb#16650
* github.com:scylladb/scylladb:
topology_coordinator: clarify warnings
raft topology: join: allow only the first response to be a succesful acceptance
storage_service: join_node_response_handler: fix indentation
raft topology: join: shut down a node on error in response handler
Service level controller updates itself in interval. However the interval time is hardcoded in main to 10 seconds and it leads to long sleeps in some of the tests.
This patch moves this value to `service_levels_interval_ms` command line option and sets this value to 0.5s in cql-pytest.
Closesscylladb/scylladb#16394
* github.com:scylladb/scylladb:
test:cql-pytest: change service levels intervals in tests
configure service levels interval
Currently to figure out if a topology request is complete a submitter
checks the topology state and tries to figure out from that the status
of the request. This is not exact. Lets look at rebuild handling for
instance. To figure out if request is completed the code waits for
request object to disappear from the topology, but if another rebuild
starts between the end of the previous one and the code noticing that
it completed the code will continue waiting for the next rebuild.
Another problem is that in case of operation failure there is no way to
pass an error back to the initiator.
This series solves those problems by assigning an id for each request and
tracking the status of each request in a separate table. The initiator
can query the request status from the table and see if the request was
completed successfully or if it failed with an error, which is also
evadable from the table.
The schema for the table is:
CREATE TABLE system.topology_requests (
id timeuuid PRIMARY KEY,
initiating_host uuid,
start_time timestamp,
done boolean,
error text,
end_time timestamp,
);
and all entries have TTL of one month.
Now that we have explicit status for each request we may use it to
replace shutdown notification rpc. During a decommission, in
left_token_ring state, we set done to true after metadata barrier
that waits for all request to the decommissioning node to complete
and notify the decommissioning node with a regular barrier. At this
point the node will see that the request is complete and exit.
Instead of trying to guess if a request completed by looking into the
topology state (which is sometimes can be error prone) look at the
request status in the new topology_requests. If request failed report
a reason for the failure from the table.
Provide a unique ID for each topology request and store it the topology
state machine. It will be used to index new topology requests table in
order to retrieve request status.
keyspace objects are heavyweight and copies are immediately our-of-date,
so copying them is bad.
Fix by deleting the copy constructor and copy assignment operator. One
call site is fixed. This call site is safe since the it's only used
for accessing a few attributes (introduced in f70c4127c6).
Closesscylladb/scylladb#16782
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, we define a formatter for service::cleanup_status,
and remove its operator<<().
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#16778
Introduce new REST API "/storage_service/cleanup_all"
that, when triggered, instructs the topology coordinator to initiate
cluster wide cleanup on all dirty nodes. It is done by introducing new
global command "global_topology_request::cleanup".
Sometimes it is unsafe to start a new topology operation before cleanup
runs on dirty nodes. This patch detects the situation when the topology
operation to be executed cannot be run safely until all dirty nodes do
cleanup and initiates the cleanup automatically. It also waits for
cleanup to complete before proceeding with the topology operation.
There can be a situation that nodes that needs cleanup dies and will
never clear the flag. In this case if a topology operation that wants to
run next does not have this node in its ignore node list it may stuck
forever. To fix this the patch also introduces the "liveness aware"
request queue management: we do not simple choose _a_ request to run next,
but go over the queue and find requests that can proceed considering
the nodes liveness situation. If there are multiple requests eligible to
run the patch introduces the order based on the operation type: replace,
join, remove, leave, rebuild. The order is such so to not trigger cleanup
needlessly.
We want to change the coordinator to consider nodes liveness when
processing the topology operation queue. If there is no enough live
nodes to process any of the ops we want to cancel them. For that to work
we need to be able to kick the coordinator if liveness situation
changes.
Introduce a fiber that waits on a topology event and when it sees that
the node it runs on needs to perform sstable cleanup it initiates one
for each non tablet, non local table and resets "cleanup" flag back to
"clean" in the topology.
We want to be able to wait for all writes started through the storage
proxy before a fence is advanced. Add phased_barrier that is entered
on each local write operation before checking the fence to do so. A
write will be either tracked by the phased_barrier or fenced. This will
be needed to wait for all non fenced local writes to complete before
starting a cleanup.
A cleanup needs to run when a node loses an ownership of a range (during
bootstrap) or if a range movement to an normal node failed (removenode,
decommission failure). Mark all dirty node as "cleanup needed" in those cases.
The function creates a mutation that sets cleanup to "needed" for each
normal node that, according to the erm, has data it does not own after
successful or unsuccessful topology operation.
The patch adds cleanup state to the persistent and in memory state and
handles the loading. The state can be "clean" which means no cleanup
needed, "needed" which means the node is dirty and needs to run cleanup
at some point, "running" which means that cleanup is running by the node
right now and when it will be completed the state will be reset to "clean".
The loop in `id2ip` lambda makes problems if we are applying an old raft
log that contains long-gone nodes. In this case, we may never receive
the `IP` for a node and stuck in the loop forever. In this series we
replace the loop with an if - we just don't update the `host_id <-> ip`
mapping in the `token_metadata.topology` if we don't have an `IP` yet.
The PR moves `host_id -> IP` resolution to the data plane, now it
happens each time the IP-based methods of `erm` are called. We need this
because IPs may not be known at the time the erm is built. The overhead
of `raft_address_map` lookup is added to each data plane request, but it
should be negligible. In this PR `erm/resolve_endpoints` continues to
treat missing IP for `host_id` as `internal_error`, but we plan to relax
this in the follow-up (see this PR first comment).
Closesscylladb/scylladb#16639
* github.com:scylladb/scylladb:
raft ips: rename gossiper_state_change_subscriber_proxy -> raft_ip_address_updater
gossiper_state_change_subscriber_proxy: call sync_raft_topology_nodes
storage_service: topology_state_load: remove IP waiting loop
storage_service: sync_raft_topology_nodes: add target_node parameter
storage_service: sync_raft_topology_nodes: move loops to the end
storage_service: sync_raft_topology_nodes: rename extract process_left_node and process_transition_node
storage_service: sync_raft_topology_nodes: rename add_normal_node -> process_normal_node
storage_service: sync_raft_topology_nodes: move update_topology up
storage_service: topology_state_load: remove clone_async/clear_gently overhead
storage_service: fix indentation
storage_service: extract sync_raft_topology_nodes
storage_service: topology_state_load: move remove_endpoint into mutate_token_metadata
address_map: move gossiper subscription logic into storage_service
topology_coordinator: exec_global_command: small refactor, use contains + reformat
storage_service: wait_for_ip for new nodes
storage_service.idl.hh: fix raft_topology_cmd.command declaration
erm: for_each_natural_endpoint_until: use is_vnode == true
erm: switch the internal data structures to host_id-s
erm: has_pending_ranges: switch to host_id
When a node changes its IP we need to store the mapping in
system.peers and update token_metadata.topology and erm
in-memory data structures.
The test_change_ip was improved to verify this new
behaviour. Before this patch the test didn't check
that IPs used for data requests are updated on
IP change. In this commit we add the read/write check.
It fails on insert with 'node unavailable'
error without the fix.
The loop makes problems if we are applying an old
raft log that contains long-gone nodes. In this case, we may
never receive the IP for a node and stuck in the loop forever.
The idea of the patch is to replace the loop with an
if - we just don't update the host_id <-> ip mapping
in the token_metadata.topology if we don't have an IP yet.
When we get the mapping later, we'll call
sync_raft_topology_nodes again from
gossiper_state_change_subscriber_proxy.
If it's set, instead of going over all the nodes in raft topology,
the function will update only the specified node. This parameter
will be used in the next commit, in the call to sync_raft_topology_nodes
from gossiper_state_change_subscriber_proxy.
In the following commits we need part of the
topology_state_load logic to be applied
from gossiper_state_change_subscriber_proxy.
In this commit we extract this logic into a
new function sync_raft_topology_nodes.
In the next commit we extract the loops by nodes into
a new function, in this commit we just move them
closer to each other.
Now the remove_endpoint function might be called under
token_metadata_lock (mutate_token_metadata takes it).
It's not a problem since gossiper event handlers in
raft_topology mode doesn't modify token_metadata so
we won't get a deadlock.
We are going to remove the IP waiting loop from topology_state_load
in subsequent commits. An IP for a given host_id may change
after this function has been called by raft. This means we need
to subscribe to the gossiper notifications and call it later
with a new id<->ip mapping.
In this preparatory commit we move the existing address_map
update logic into storage_service so that in later commits
we can enhance it with topology_state_load call.
When a new node joins the cluster we need to be sure that it's IP
is known to all other nodes. In this patch we do this by waiting
for the IP to appear in raft_address_map.
A new raft_topology_cmd::command::wait_for_ip command is added.
It's run on all nodes of the cluster before we put the topology
into transition state. This applies both to new and replacing nodes.
It's important to run wait_for_ip before moving to
topology::transition_state::join_group0 since in this state
node IPs are already used to populate pending nodes in erm.
So far the service levels interval, responsible for updating SL configuration,
was hardcoded in main.
Now it's extracted to `service_levels_interval_ms` option.
In a longevity test reported in scylladb/scylladb#16668 we observed that
NORMAL state is not being properly handled for a node that replaced
another node. Either handle_state_normal is not being called, or it is
but getting stuck in the middle. Which is the case couldn't be
determined from the logs, and attempts at creating a local reproducer
failed.
Thus the plan is to continue debugging using the longevity test, but we need
more logs. To check whether `handle_state_normal` was called and which branches
were taken, include some INFO level logs there. Also, detect deadlocks inside
`gossiper::lock_endpoint` by reporting an error message if `lock_endpoint`
waits for the lock for too long.
Ref: scylladb/scylladb#16668Closesscylladb/scylladb#16733
* github.com:scylladb/scylladb:
gossiper: report error when waiting too long for endpoint lock
gossiper: store source_location instead of string in endpoint_permit
storage_service: more verbose logging in handle_state_normal
In the next patches we are going to change erm data structures
(replication_map and ring_mapping) from IP to host_id. Having
locator::host_id instead of IP in has_pending_ranges arguments
makes this transition easier.