Before this patch, if we booted a node just after removing
a different node, the booting node may still see the removed node
as NORMAL and wait for it to be UP, which would time out and fail
the bootstrap.
This issue caused scylladb/scylladb#17526.
Fix it by recalculating the nodes to wait for in every step of the
of the `wait_alive` loop.
(cherry picked from commit 017134fd38)
Currently a new node is marked as alive too late, after it is already
reported as a pending node. The patch series changes replace procedure
to be the same as what node_ops do: first stop reporting the IP of the
node that is being replaced as a natural replica for writes, then mark
the IP is alive, and only after that report the IP as a pending endpoint.
Fixes: scylladb/scylladb#17421
* 'gleb/17421-fix-v2' of github.com:scylladb/scylla-dev:
test_replace_reuse_ip: add data plane load
sync_raft_topology_nodes: make replace procedure similar to nodeops one
storage_service: topology_coordinator: fix indentation after previous patch
storage_service: topology coordinator: drop ring check in node_state::replacing state
In replace-with-same-ip a new node calls gossiper.start_gossiping
from join_token_ring with the 'advertise' parameter set to false.
This means that this node will fail echo RPC-s from other nodes,
making it appear as not alive to them. The node changes this only
in storage_service::join_node_response_handler, when the topology
coordinator notifies it that it's actually allowed to join the
cluster. The node calls _gossiper.advertise_to_nodes({}), and
only from this moment other nodes can see it as alive.
The problem is that topology coordinator sends this notification
in topology::transition_state::join_group0 state. In this state
nodes of the cluster already see the new node as pending,
they react with calling tmpr->add_replacing_endpoint and
update_topology_change_info when they process the corresponding
raft notification in sync_raft_topology_nodes. When the new
token_metadata is published, assure_sufficient_live_nodes
sees the new node in pending_endpoints. All of this happen
before the new node handled successful join notification,
so it's not alive yet. Suppose we had a cluster with three
nodes and we're replacing on them with a fourth node.
For cl=qurum assure_sufficient_live_nodes throws if
live < need + pending, which in our case becomes 2 < 2 + 1.
The end effect is that during replace-with-same-ip
data plane requests can fail with unavailable_exception,
breaking availability.
The patch makes boot procedure more similar to node ops one.
It splits the marking of a node as "being replaced" and adding it to
pending set in to different steps and marks it as alive in the middle.
So when the node is in topology::transition_state::join_group0 state
it marked as "being replaced" which means it will no longer be used for
reads and writes. Then, in the next state, new node is marked as alive and
is added to pending list.
fixesscylladb/scylladb#17421
Gossiper automatically removes endpoints that do not have tokens in
normal state and either do not send gossiper updates or are dead for a
long time. We do not need this with topology coordinator mode since in
this mode the coordinator is responsible to manage the set of nodes in
the cluster. In addition the patch disables quarantined endpoint
maintenance in gossiper in raft mode and uses left node list from the
topology coordinator to ignore updates for nodes that are no longer part
of the topology.
When loading this node endpoint state and it has
tokens in token_metadata, its status can already be set
to normal.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When loading endpoint_state from system.peers,
pass the loaded nodes dc/rack info from
storage_service::join_token_ring to gossiper::add_saved_endpoint.
Load the endpoint DC/RACK information to the endpoint_state,
if available so they can propagate to bootstrapping nodes
via gossip, even if those nodes are DOWN after a full cluster-restart.
Note that this change makes the host_id presence
mandatory following https://github.com/scylladb/scylladb/pull/16376.
The reason to do so is that the other states: tokens, dc, and rack
are useless with the host_id.
This change is backward compatible since the HOST_ID application state
was written to system.peers since inception in scylla
and it would be missing only due to potential exception
in older versions that failed to write it.
In this case, manual intervention is needed and
the correct HOST_ID needs to be manually updated in system.peers.
Refs #15787
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Pack the topology-related data loaded from system.peers
in `gms::load_endpoint_state`, to be used in a following
patch for `add_saved_endpoint`.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
In a longevity test reported in scylladb/scylladb#16668 we observed that
NORMAL state is not being properly handled for a node that replaced
another node. Either handle_state_normal is not being called, or it is
but getting stuck in the middle. Which is the case couldn't be
determined from the logs, and attempts at creating a local reproducer
failed.
One hypothesis is that `gossiper` is stuck on `lock_endpoint`. We dealt
with gossiper deadlocks in the past (e.g. scylladb/scylladb#7127).
Modify the code so it reports an error if `lock_endpoint` waits for the
lock for more than a minute. When the issue reproduces again in
longevity, we will see if `lock_endpoint` got stuck.
The original code extracted only the function_name from the
source_location for logging. We'll use more information from the
source_location in later commits.
Currently, `add_saved_endpoint` is called from two paths: One, is when
loading states from system.peers in the join path (join_cluster,
join_token_ring), when `_raft_topology_change_enabled` is false, and the
other is from `storage_service::topology_state_load` when raft topology
changes are enabled.
In the later path, from `topology_state_load`, `add_saved_endpoint` is
called only if the endpoint_state does not exist yet. However, this is
checked without acquiring the endpoint_lock and so it races with the
gossiper, and once `add_saved_endpoint` acquires the lock, the endpoint
state may already be populated.
Since `add_saved_endpoint` applies local information about the endpoint
state (e.g. tokens, dc, rack), it uses the local heart_beat_version,
with generation=0 to update the endpoint states, and that is
incompatible with changes applies via gossip that will carry the
endpoint's generation and version, determining the state's update order.
This change makes sure that the endpoint state is never update in
`add_saved_endpoint` if it has non-zero generation. An internal error
exception is thrown if non-zero generation is found, and in the only
call site that might reach that state, in
`storage_service::topology_state_load`, the caller acquires the
endpoint_lock for checking for the existence of the endpoint_state,
calling `add_saved_endpoint` under the lock only if the endpoint_state
does not exist.
Fixes#16429Closesscylladb/scylladb#16432
* github.com:scylladb/scylladb:
gossiper: add_saved_endpoint: keep heart_beat_state if ep_state is found
storage_service: topology_state_load: lock endpoint for add_saved_endpoint
raft_group_registry: move on_alive error injection to gossiper
Change the mutate_live_and_unreachable_endpoints procedure
so that the called `func` would mutate a cloned
`live_and_unreachable_endpoints` object in place.
Those are replicated to temporary copies on all shards
using `foreign<unique_ptr<>>` so that the would be
automatically freed on exception.
Only after all copies are made, they are applied
on all gossiper shards in a noexcept loop
and finally, a `on_success` function is called
to apply further side effects if everything else
was replicated successfully.
The latter is still susceptible to exceptions,
but we can live with those as long as `_live_endpoints`
and `_unreachable_endpoints` are synchronized on all shards.
With that, the read-only methods:
`get_live_members_synchronized` and
`get_unreachable_members_synchronized`
become trivial and they just return the required data
from shard 0.
Fixes#15089
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#16597
Rather than calling on_change for each particular
application_state, pass an endpoint_state::map_type
with all changed states, to be processed as a batch.
In particular, thise allows storage_service::on_change
to update_peer_info once for all changed states.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
None of the subscribers is doing anything before_change.
This is done before changing `on_change` in the following patch.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Have a central definition for the map held
in the endpoint_state (before changing it to
std::unordered_map).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
`topology_state_load` currently calls `add_saved_endpoint`
only if it finds no endpoint_state_ptr for the endpoint.
However, this is done before locking the endpoint
and the endpoint state could be inserted concurrently.
To prevent that, a permit_id parameter was added to
`add_saved_endpoint` allowing the caller to call it
while the endpoint is locked. With that, `topology_state_load`
locks the endpoint and checks the existence of the endpoint state
under the lock, before calling `add_saved_endpoint`.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Returns this node's endpoint_state_ptr.
With this entry point, the caller doesn't need to
get_broadcast_address.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Fixes some typos as found by codespell run on the code.
In this commit, I was hoping to fix only comments, not user-visible alerts, output, etc.
Follow-up commits will take care of them.
Refs: https://github.com/scylladb/scylladb/issues/16255
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Remove `fall_back_to_syn_msg` which is not necessary in newer Scylla versions.
Fix the calculation of `nodes_down` which could count a single node multiple times.
Make shadow round mandatory during bootstrap and replace -- these operations are unsafe to do without checking features first, which are obtained during the shadow round (outside raft-topology mode).
Finally, during node restart, allow the shadow round to be skipped when getting `timeout_error`s from contact points, not only when getting `closed_error`s (during restart it's best-effort anyway, and in general it's impossible to distinguish between a dead node and a partitioned node).
More details in commit messages.
Ref: https://github.com/scylladb/scylladb/issues/15675Closesscylladb/scylladb#15941
* github.com:scylladb/scylladb:
gossiper: do_shadow_round: increment `nodes_down` in case of timeout
gossiper: do_shadow_round: fix `nodes_down` calculation
storage_service: make shadow round mandatory during bootstrap/replace
gossiper: do_shadow_round: remove default value for nodes param
gossiper: do_shadow_round: remove `fall_back_to_syn_msg`
It is unsafe to bootstrap or perform replace without performing the
shadow round, which is used to obtain features from the existing cluster
and verify that we support all enabled features.
Before this patch, I could easily produce the following scenario:
1. bootstrap first node in the cluster
2. shut it down
3. start bootstrapping second node, pointing to the first as seed
4. the second node skips shadow round because it gets
`rpc::closed_error` when trying to connect to first node.
5. the node then passes the feature check (!) and proceeds to the next
step, where it waits for nodes to show up in gossiper
6. we now restart the first node, and the second node finishes bootstrap
The shadow round must be mandatory during bootstrap/replace, which is
what this patch does.
On restart it can remain optional as it was until now. In fact it should
be completely unnecessary during restart, but since we did it until now
(as best-effort), we can keep doing it.
The status is not used since 2ec1f719de
which is included in scylla-4.6.0. We cannot have mixed cluster with the
version so old, so the new version should not carry the compatibility
burden.
Many of the gossiper internal functions currently use seastar threads for historical reasons,
but since they are short living, the cost of spawning a seastar thread for them is excessive
and they can be simplified and made more efficient using coroutines.
Closes#15364
* github.com:scylladb/scylladb:
gossiper: reindent do_stop_gossiping
gossiper: coroutinize do_stop_gossiping
gossiper: reindent assassinate_endpoint
gossiper: coroutinize assassinate_endpoint
gossiper: coroutinize handle_ack2_msg
gossiper: handle_ack_msg: always log warning on exception
gossiper: reindent handle_ack_msg
gossiper: coroutinize handle_ack_msg
gossiper: reindent handle_syn_msg
gossiper: coroutinize handle_syn_msg
gossiper: message handlers: no need to capture shared_from_this
gossiper: add_local_application_state: throw internal error if endpoint state is not found
gossiper: coroutinize add_local_application_state
Simplify the function. It does not need to spawn
a seastar thread.
While at it, declare it as private since it's called
only internally by the gossiper (and on shard 0).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This function practically returned true from inception.
In d38deef499
it started using messaging_service().knows_version(endpoint)
that also returns `true` unconditionally, to this day
So there's no point calling it since we can assume
that `uses_host_id` is true for all versions.
Closes#15343
* github.com:scylladb/scylladb:
storage_service: fixup indentation after last patch
gossiper: get rid of uses_host_id
This function practically returned true from inception.
In d38deef499
It started using messaging_service().knows_version(endpoint)
that also returns `true` unconditionally, to this day
So there's no point calling it since we can assume
that `uses_host_id` is true for all versions.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We need to const_cast `this` since the const
container() has no const invoke_on override.
Trying to fix this in seastar sharded.hh breaks
many other call sites in scylla.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now that the endpoint_state isn't change in place
we do not need to copy it to each subscriber.
We can rather just pass the lw_shared_ptr holding
a snapshot of it.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
And restrict the accessor methods to return const pointers
or refrences.
With that, the endpoint_state_ptr:s held in the _endpoint_state_map
point to immutable endpoint_state objects - with one exception:
the endpoint_state update_timestamp may be updated in place,
but the endpoint_state_map is immutable.
replicate() replaces the endpoint_state_ptr in the map
with a new one to maintain immutability.
A later change will also make this exception safe so
replicate will guarantee strong exception safety so that all shards
are updated or none of them.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This commit changes the interface to
using endpoint_state_ptr = lw_shared_ptr<const endpoint_state>
so that users can get a snapshot of the endpoint_state
that they must not modify in-place anyhow.
While internally, gossiper still has the legacy helpers
to manage the endpoint_state.
Fixesscylladb/scylladb#14799
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Before changing _endpoint_state_map to hold a
lw_shared_ptr<endpoint_state>, provide synchronous helpers
for users to traverse all endpoint_states with no need
to copy them (as long as the called func does not yield).
With that, gossiper::get_endpoint_states() can be made private.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>