In 3a36ec33db (gossip: Wait longer for seed node during boot up), we
increased the timeout by the factor of 60, i.e., ring_dealy * 60 = 5
seconds * 60 = 5 minutes.
In 57ee9676c2 (storage_service: Fix default ring_delay time), we fixed
the default ring_dealy to 30 seconds. Now the timeout is 30 * 60 seconds
= 30 minutes, which is too long.
Make it 5 minues.
We want to prevent older version of scylla which has fewer features to
join a cluster with newer version of scylla which has more features,
because when scylla sees a feature is enabled on all other nodes, it
will start to use the feature and assume existing nodes and future nodes
will always have this feature.
In order to support downgrade during rolling upgrade, we need to support
mixed old and new nodes case.
1) All old nodes
O O O O O <- N OK
O O O O O <- O OK
2) All new nodes
N N N N N <- N OK
N N N N N <- O FAIL
3) Mixed old and new nodes
O N O N O <- N OK
O N O N O <- O OK
(O == old node, N == new node, <- == joining the cluster)
With this patch, I tested:
1.1) Add new node to new node cluster
gossip - Feature check passed. Local node 127.0.0.4 features =
{RANGE_TOMBSTONES}, Remote common_features = {RANGE_TOMBSTONES}
1.2) Add old node to old node cluster
gossip - Feature check passed. Local node 127.0.0.4 features = {},
Remote common_features = {}
2.1) Add new node to new node cluster
gossip - Feature check passed. Local node 127.0.0.4 features =
{RANGE_TOMBSTONES}, Remote common_features = {RANGE_TOMBSTONES}
2.2) Add old node to new node cluster
seastar - Exiting on unhandled exception: std::runtime_error (Feature
check failed. This node can not join the cluster because it does not
understand the feature. Local node 127.0.0.4 features = {}, Remote
common_features = {RANGE_TOMBSTONES})
3.1) Add new node to mixed cluster
gossip - Feature check passed. Local node 127.0.0.4 features =
{RANGE_TOMBSTONES}, Remote common_features = {}
3.2) Add old node to mixed cluster
gossip - Feature check passed. Local node 127.0.0.4 features = {},
Remote common_features = {}
Fixes#1253
If the feature string is empty, boost::split will return
std::set<sstring> = {""} instead of std::set<sstring> = {}
which will make a node with a feaure, e.g. std::set<sstring> =
{"RANGE_TOMBSTONES"}, think it does not understand the feature of
a node with no features at all.
This patch changes the gms::feature destructor so it
checks whether the gossiper has been stopped before trying
to unregister the feature.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
When a node starts up, peer node can send gossip syn message to it
before the gossip message handlers are registered in messaging_service.
We can see:
scylla[123]: [shard 0] rpc - client a.b.c.d: unknown verb exception 6 ignored
To fix, we delay the listening of messaging_service to the point when
gossip message handlers are registered.
Message-Id: <9b20d85e199ef0e44cdcde2920123a301a88f3d7.1464254400.git.asias@scylladb.com>
This class encapsulates the waiting for a cluster feature. A feature
object is registered with the gossiper, which is responsible for later
marking it as enabled.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch changes the sleep-based mechanism of detecting new features
by instead registering waiters with a condition variable that is
signaled whenever a new endpoint information is received.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch removes the timeout when waiting for features,
since future patches will make this argument unnecessary.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch fixes an inadvertent change to the shadow endpoint state
map in gossiper::run, done by calling get_heart_beat_state() which
also updates the endpoint state's timestamp. This did not happen for
the normal map, but did happen for the shadow map. As a result, every
time gossiper::run() was scheduled, endpoint_map_changed would always
be true and all the shards would make superfluous copies of the
endpoint state maps.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1464309023-3254-2-git-send-email-duarte@scylladb.com>
_live_endpoints_just_added tracks the peer node which just becomes live.
When a down node gets back, the peer nodes can receive multiple messages
which would mark the node up, e.g., the message piled up in the sender's
tcp stack, after a node was blocked with gdb and released. Each such
message will trigger a echo message and when the reply of the echo
message is received (real_mark_alive), the same node will be added to
_live_endpoints_just_added.push_back more than once. Thus, we see the
same node be favored more than once:
INFO 2016-04-12 12:09:57,399 [shard 0] gossip -
do_gossip_to_live_member: Favor newly added node 127.0.0.2
INFO 2016-04-12 12:09:58,412 [shard 0] gossip -
do_gossip_to_live_member: Favor newly added node 127.0.0.2
INFO 2016-04-12 12:09:59,429 [shard 0] gossip -
do_gossip_to_live_member: Favor newly added node 127.0.0.2
INFO 2016-04-12 12:10:00,429 [shard 0] gossip -
do_gossip_to_live_member: Favor newly added node 127.0.0.2
INFO 2016-04-12 12:10:01,430 [shard 0] gossip -
do_gossip_to_live_member: Favor newly added node 127.0.0.2
INFO 2016-04-12 12:10:02,442 [shard 0] gossip -
do_gossip_to_live_member: Favor newly added node 127.0.0.2
INFO 2016-04-12 12:10:03,454 [shard 0] gossip -
do_gossip_to_live_member: Favor newly added node 127.0.0.2
To fix, do not insert the node if it is already in
_live_endpoints_just_added.
Fixes#1178
Message-Id: <6bcfad4430fbc63b4a8c40ec86a2744bdfafb40f.1464161975.git.asias@scylladb.com>
In perf-flame, I saw in
service::storage_proxy::create_write_response_handler (2.66% cpu)
gossiper::is_alive takes 0.72% cpu
locator::token_metadata::pending_endpoints_for takes 1.2% cpu
After this patch:
service::storage_proxy::create_write_response_handler (2.17% cpu)
gossiper::is_alive does not show up at all
locator::token_metadata::pending_endpoints_for takes 1.3% cpu
There is no need to copy the endpoint_state from the endpoint_state_map
to check if a node is alive. Optimize it since gossiper::is_alive is
called in the fast path.
Message-Id: <2144310aef8d170cab34a2c96cb67cabca761ca8.1463540290.git.asias@scylladb.com>
seastar::async() creates a seastar thread and to do that allocates
memory. That allocation, obviously, may fail so the error handling code
needs to be moved so that it also catches errors from thread creation.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
"There is a need to have an ability to detect whether a feature is
supported by entire cluster. The way to do it is to advertise feature
availability over gossip and then each node will be able to check if all
other nodes have a feature in question.
The idea is to have new application state SUPPORTED_FEATURES that will contain
set of strings, each string holding feature name.
This series adds API to do so.
The following patch on top of this series demostreates how to wait for features
during boot up. FEATURE1 and FEATURE2 are introduced. We use
wait_for_feature_on_all_node to wait for FEATURE1 and FEATURE2 successfully.
Since FEATURE3 is not supported, the wait will not succeed, the wait will timeout.
--- a/service/storage_service.cc
+++ b/service/storage_service.cc
@@ -95,7 +95,7 @@ sstring storage_service::get_config_supported_features() {
// Add features supported by this local node. When a new feature is
// introduced in scylla, update it here, e.g.,
// return sstring("FEATURE1,FEATURE2")
- return sstring("");
+ return sstring("FEATURE1,FEATURE2");
}
std::set<inet_address> get_seeds() {
@@ -212,6 +212,11 @@ void storage_service::prepare_to_join() {
// gossip snitch infos (local DC and rack)
gossip_snitch_info().get();
+ gossiper.wait_for_feature_on_all_node(std::set<sstring>{sstring("FEATURE1"), sstring("FEATURE2")}, std::chrono::seconds(30)).get();
+ logger.info("Wait for FEATURE1 and FEATURE2 done");
+ gossiper.wait_for_feature_on_all_node(std::set<sstring>{sstring("FEATURE3")}).get();
+ logger.info("Wait for FEATURE3 done");
+
We can query the supported_features:
cqlsh> SELECT supported_features from system.peers;
supported_features
--------------------
FEATURE1,FEATURE2
FEATURE1,FEATURE2
(2 rows)
cqlsh> SELECT supported_features from system.local;
supported_features
--------------------
FEATURE1,FEATURE2
(1 rows)"
API to wait for features are available on a node or all the nodes in the
cluster.
$timeout specifies how long we want to wait. If the features are not
availabe yet, sleep 2 seconds and retry.
- Get features supported by this particular node
std::set<sstring> get_supported_features(inet_address endpoint) const;
- Get features supported by all the nodes this node knows about
std::set<sstring> get_supported_features() const;
Since calculate_pending_ranges will modify token_metadata, we need to
replicate to other shards. With this patch, when we call
calculate_pending_ranges, token_metadata will be replciated to other
non-zero shards.
In addition, it is not useful as a standalone class. We can merge it
into the storage_service. Kill one singleton class.
Fixes#1033
Refs #962
Message-Id: <fb5b26311cafa4d315eb9e72d823c5ade2ab4bda.1457943074.git.asias@scylladb.com>
If we do
- Decommission a node
- Stop a node
we will shutdown gossip more than once in:
- storage_service::decommission
- storage_service::drain_on_shutdown
Fix by checking if it is already stopped and back off if so.
If storage_service::token_metadata is not distributed together with
gossiper::endpoint_state_map there may be a situation when a non-zero
shard sees a new value in token_metadata (e.g. newly added node's
token ranges) while still seeing an old gossiper::endpoint_state_map
contents (e.g. a mentioned above newly added node may not be present,
thus causing gossiper::is_alive() to return FALSE for that node, while
the node is actually alive and kicking).
To avoid this discrepancy we will always update a token_metadata together
with an endpoint_state_map when we distribute new token_metadata data
among shards.
Fixes#909
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
We will need to access it from a storage_service class when replicate
token_metadata.
Rename _shadow_endpoint_state_map -> shadow_endpoint_state_map
according to our coding convention.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
1) As explained in commit 697b16414a (gossip: Make gossip message
handling async), in each gossip round we can make talking to the 1-3
peer nodes in parallel to reduce latency of gossip round.
2) Gossip syn message uses one way rpc message, but now the returned
future of the one way message is ready only when message is dequeued for
some reason (sent or dropped). If we wait for the one way syn messge to
return it might block the gossip round for a unbounded time. To fix, do
not wait for it in the gossip round. The downside is there will be no
back pressure to bound the syn messages, however since the messages are
once per second, I think it is fine.
Message-Id: <ea4655f121213702b3f58185378bb8899e422dd1.1456991561.git.asias@scylladb.com>
add_local_application_state is used in various places. Before this
patch, it can only be called on cpu zero. To make it safer to use, use
invoke_on() to foward the code to run on cpu zero, so that caller can
call it on any cpu.
Refs: #795
Message-Id: <d69b81c5561622078dbe887d87209c4ea2e3bf46.1456315043.git.asias@scylladb.com>
Gleb saw once:
scylla: gms/gossiper.cc:1393:
gms::gossiper::add_local_application_state(gms::application_state,
gms::versioned_value):: mutable: Assertion
`endpoint_state_map.count(ep_addr)' failed.
The assert is about we can not find the entry in endpoint_state_map of
the node itself. I can not really find any place we could call
add_local_application_state before we call gossiper::start_gossiping()
where it inserts broadcast address into endpoint_state_map.
I can not reproduce issue, let's log the error so we can narrow down
which application state triggered the assert.
Refs: #795
Message-Id: <f4433be0a0d4f23470a5e24e528afdb67b74c7ef.1456315043.git.asias@scylladb.com>
In each gossip round, i.e., gossiper::run(), we do:
1) send syn message
2) peer node: receive syn message, send back ack message
3) process ack message in handle_ack_msg
apply_state_locally
mark_alive
send_gossip_echo
handle_major_state_change
on_restart
mark_alive
send_gossip_echo
mark_dead
on_dead
on_join
apply_new_states
do_on_change_notifications
on_change
4) send back ack2 message
5) peer node: process ack2 message
apply_state_locally
At the moment, syn is "wait" message, it times out in 3 seconds. In step
3, all the registered gossip callbacks are called which might take
significant amount of time to complete.
In order to reduce the gossip round latency, we make syn "no-wait" and
do not run the handle_ack_msg insdie the gossip::run(). As a result, we
will not get a ack message as the return value of a syn message any
more, so a GOSSIP_DIGEST_ACK message verb is introduced.
With this patch, the gossip message exchange is now async. It is useful
when some nodes are down in the cluster. We will not delay the gossip
round, which is supposed to run every second, 3*n seconds (n = 1-3,
since it talks to 1-3 peer nodes in each gossip round) or even
longer (considering the time to run gossip callbacks).
Later, we can make talking to the 1-3 peer nodes in parallel to reduce
latency even more.
Refs: #900
The problem is we initialize _last_interpret when failure_detector
object is constructed. When interpret() runs for the first time, the
_last_interpret value is not the last time we run interpret() but the
time we initialize failure_detector object.
Fix by initializing _last_interpret inside interpret().
[Thu Feb 18 02:40:04 2016] INFO [shard 0] storage_service - Node 127.0.0.1 state jump to normal
[Thu Feb 18 02:40:04 2016] INFO [shard 0] storage_service - NORMAL: node is now in normal status
[Thu Feb 18 02:40:04 2016] INFO [shard 0] gossip - Waiting for gossip to settle before accepting client requests...
[Thu Feb 18 02:40:12 2016] INFO [shard 0] gossip - No gossip backlog; proceeding
Starting listening for CQL clients on 127.0.0.1:9042...
[Thu Feb 18 02:40:12 2016] INFO [shard 0] gossip - Node 127.0.0.2 is now part of the cluster
[Thu Feb 18 02:40:12 2016] INFO [shard 0] gossip - InetAddress 127.0.0.2 is now UP
[Thu Feb 18 02:40:13 2016] INFO [shard 0] gossip - do_gossip_to_live_member: Favor newly added node 127.0.0.2
[Thu Feb 18 02:40:13 2016] WARN [shard 0] failure_detector - Not marking nodes down due to local pause of 9091 > 5000 (milliseconds)
PHI_FACTOR is a constexpr variable that is defined using std::log.
Though G++ has a constexpr version of std::log, this itself is not spec
complaint (in fact, Clang enforces this). See C++ Spec 26.8 for the
definition of std::log and 17.6.5.6 for the rule regarding adding
constexpr where it isn't specified.
This patch replaces the std::log statement with a version from math.h
that contains the exact value (M_LOG10El).
Signed-off-by: Erich Keane <erich.keane@verizon.net>
Message-Id: <1454603285-32677-1-git-send-email-erich.keane@verizon.net>