Broken in f570e41d18.
Not replicating this may cause coordinator to treat a node which is
down as alive, or vice verse.
Fixes regression in dtest:
consistency_test.py:TestAvailability.test_simple_strategy
which was expected to get "unavailable" exception but it was getting a
timeout.
Message-Id: <1510666967-1288-1-git-send-email-tgrabiec@scylladb.com>
storage_service depends on endpoint states to be replicated to all
shards before token metadata is replicated. Currently this is taken
care of by storage_service::replicate_to_all_cores(), invoked from
storage_service's change listener. It copies whole endpoint state map,
which is expensive in large clusters. It's more efficient to replicate
only incremental changes, and only once, rather than for each
application state.
Makes state application faster due to increased parallelism.
Refs #2855.
Bootrap of 11th node, ignoring apply_state_locally() which complete instantly:
Before:
DEBUG 2017-10-06 15:24:04,213 [shard 0] gossip - apply_state_locally() took 1230 ms
DEBUG 2017-10-06 15:24:04,223 [shard 0] gossip - apply_state_locally() took 1421 ms
DEBUG 2017-10-06 15:24:04,225 [shard 0] gossip - apply_state_locally() took 607 ms
DEBUG 2017-10-06 15:24:04,288 [shard 0] gossip - apply_state_locally() took 488 ms
DEBUG 2017-10-06 15:24:04,408 [shard 0] gossip - apply_state_locally() took 1425 ms
After:
DEBUG 2017-10-06 16:24:13,130 [shard 0] gossip - apply_state_locally() took 814 ms
It's possible that a change listener for a later state will run before
change listener for the previous state completes, in which case
node's state can be corruped. For example, the previous change listener
may override sysytem.peers with an old value.
This patch fixes the problem by serializing state changes and
listeners for each node.
The implementation uses loading_shared_values so that the lock remains
alive as long as there is anyone holding it. Using endpoint_state_map
for that doesn't seem appropraite, because entries can be removed from
it while listeners are still running. There is code in the gossiper
which anticipates that entry may be gone across deferring points in
some places.
apply_new_states() always fires change listeners for received values,
even if we already processed the state earlier. Some change listeners
are heavy-weight, e.g. storage_service::handle_state_normal(). We
should avoid calling them more than necessary.
Make sure that we always run the change listeners by putting them in a
defer() block. Otherwise, if exception is thrown in the middle of state
application, change listeners would not be run. Later we would not
detect the change for states which were already applied, and not run
the change listers.
Fixes#2867
It is serialized since e428d06f40. This causes regression in
performance of application state propagation due to reduced
parallelism.
Processing states for each node has high latency due to memtable
flushes triggered by update_tokens() and commitlog syncs done by
system.peers updates, if commitlog sync mode is set to "batch". We
have high internal concurrency for these, so increasing parallelism
significantly reduces time to process all states.
Fixes#2855.
Failure detector decides that a node is down if it hasn't received a change of
its heartbeat for longer than ~11 times the average of past intervals between
updates.
If there are multiple incoming ACKs containing information about the
same node, we may detect and report a change for each of them. This
will cause failure_detector to establish that the average report
period is in milliseconds. After the update storm is over, it will
claim the node failure very soon, because report period will now be a
large multiple of the average.
Fix by not counting short updates into the calculation of average
arrival time.
Fixes#2861.
This patch introduces the get_application_state_ptr() function, which
allows access to a versioned_value of a particular endpoint.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Now that we have get_endpoint_state_for_endpoint_ptr(), which does not
return a copy and allows mutating the actual state, we can use it
instead of repeating the lookup code.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Make it use get_endpoint_state_for_endpoint_ptr(), check if gossiper is
enabled, mark it as const, and have some callers use it instead of open
coding the logic.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Have convict() use get_endpoint_state_for_endpoint_ptr(), simplify
logging, and also protect expensive operations by checking the log
level.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
gossiper::get_endpoint_state_for_endpoint() returns a copy of
endpoint_state, which we've seen can be very expensive.
This patch adds a similar function which returns a pointer instead,
and changes the call sites where using the pointer-returning variant
is deemed safe (the pointer neither escapes the function, nor crosses
any defer point).
Fixes#764
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
gossiper::apply_state_locally() calls handle_major_state_change() for
each endpoint, in a seastar thread, which calls mark_alive() for new
nodes, which calls ms().send_gossip_echo(id).get(). So it synchronously
waits for each node to respond before it moves on to the next entry. As
a result it may take a while before whole state is processed.
Apache (tm) Cassandra (tm) sends echos in the background.
In a large cluster, we see at the time the joining node starts
streaming, it hasn't managed to apply all the endpoint_state for peer
nodes, so the joining node does not know some of the nodes yet, which
results in the joining node ingores to stream from some of the existing
nodes.
Fixes#2787Fixes#2797
Message-Id: <3760da2bef1a83f1b6a27702a67ca4170e74b92c.1505719669.git.asias@scylladb.com>
"This series tries to improve the bootstrap of a node in a large cluster by
improving how gossip applies the gossip node state. In #2404, the joining node
failed to bootstrap, because it did not see the seed node when
storage_service::bootstrap ran. After this series, we apply the whole gossip
state contained in the gossip ack/ack2 message before applying the next one,
and we apply the state of the seed node earlier than non-seed node so we can
have the seed node's state faster. We also add some randomness to the order of
applying gossip node state to prevent some of the nodes' state are always
applied earlier than the others.
This series improves apply_state_locally for large cluster:
- Tune the order of applying endpoint_state
- Serialize apply_state_locally
- Avoid copying of the gossip state map
Fixes#2404"
* tag 'asias/gossip_issue_2404_v2' of github.com:scylladb/seastar-dev:
gossip: Avoid copying with apply_state_locally
gossip: Serialize apply_state_locally
gossip: Tune the order of applying endpoint_state in apply_state_locally
gossip: Introduce is_seed helper
gossip: Pass const endpoint_state& in notify_failure_detector
gossip: Pass reference in notify_failure_detector
Move the std::map<inet_address, endpoint_state> map from the gossip
ack/ack2 message directly and move it around in apply_state_locally to
avoid copying the map.
apply_state_locally will be called when gossip ack/ack2
message is received. It will use the std::map<inet_address,
endpoint_state>& map to update the endpoint state.
However, we can receive multiple such gossip ack/ack2 messages from
multiple peer nodes in parallel. Currently, we process them in parallel.
It is better to apply all the states from one node then move to apply
all the states from another node than interleaving. Because it is more
important to have the state of the whole cluster than to have a bit
newer state from another peer (if it is newer), especially when the node
boots up and runs its first round of gossip exchange.
After this patch, we apply the whole gossip state contained in the
gossip ack/ack2 message before applying the next one.
We currently always apply the endpoint_state in the order of the
endpoint ip address. This is not good because some of the endpoint's
state is always applied earlier than the others.
In large cluster, the number of endpoints can be large, it takes time to
apply all of them. To make it more fair, we apply the endpoint_state
randomly.
Apply the seed node's state earlier because in bootstrap, we will check
if we have seen the seed node in storage_service::bootstrap. In #2404,
the bootstrap failed because, the joining node hasn't apply the seed
node's state when storage_service::bootstrap runs.
This reverts commit b56ba02335.
After commit 8fa35d6ddf (messaging_service: Get rid of timeout and retry
logic for streaming verb), streaming verb in rpc does not check if a
node is in gossip memebership since all the retry logic is removed.
Remove the extra wait before removing the joining node from gossip
membership.
Message-Id: <a416a735bb8aad533bbee190e3324e6b16799415.1504063598.git.asias@scylladb.com>