Commit Graph

107 Commits

Author SHA1 Message Date
Vlad Zolotarov
eb4fbb3949 gms::gossiper: move collectd counters registration to the metrics registration layer
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-01-10 16:24:55 -05:00
Asias He
86c2620b7a gossip: Skip stopping if it is not started
If exception is triggered early in boot when doing an I/O operation,
scylla will fail because io checker calls storage service to stop
transport services, and not all of them were initialized yet.

Scylla was failing as follow:
scylla: ./seastar/core/sharded.hh:439: Service& seastar::sharded<Service>::local()
[with Service = gms::gossiper]: Assertion `local_is_initialized()' failed.
Aborting on shard 0.
Backtrace:
  0x000000000048a2ca
  0x000000000048a3d3
  0x00007fc279e739ff
  0x00007fc279ad6a27
  0x00007fc279ad8629
  0x00007fc279acf226
  0x00007fc279acf2d1
  0x0000000000c145f8
  0x000000000110d1bc
  0x000000000041bacd
  0x00000000005520f1
  0x00007fc279aeaf1f
Aborted (core dumped)

Refs #883.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Signed-off-by: Asias He <asias@scylladb.com>
Message-Id: <963f7b0f5a7a8a1405728b414a7d7a6dccd70581.1479172124.git.asias@scylladb.com>
2016-12-05 09:42:37 +02:00
Avi Kivity
c94fb1bf12 build: reduce inclusions of messaging_service.hh
Remove inclusions from header files (primary offender is fb_utilities.hh)
and introduce new messaging_service_fwd.hh to reduce rebuilds when the
messaging service changes.

Message-Id: <1475584615-22836-1-git-send-email-avi@scylladb.com>
2016-10-05 11:46:49 +03:00
Asias He
774d16306f gossip: Use lowres_clock for scheduled_gossip_task
The timer is fired once per second. Using low resolution clock is enough.
Message-Id: <1f21514e975afea6ac5c9dde18a881a41561da70.1475130948.git.asias@scylladb.com>
2016-09-29 10:03:14 +03:00
Asias He
f0d3084c8b gossip: Switch to use system_clock
The expire time which is used to decide when to remove a node from
gossip membership is gossiped around the cluster. We switched to steady
clock in the past. In order to have a consistent time_point in all the
nodes in the cluster, we have to use wall clock. Switch to use
system_clock for gossip.

Fixes #1704
2016-09-27 16:42:13 +08:00
Asias He
ef782f0335 gossip: Add heart_beat_version to collectd
$ tools/scyllatop/scyllatop.py '*gossip*'

node-1/gossip-0/gauge-heart_beat_version 1.0
node-2/gossip-0/gauge-heart_beat_version 1.0
node-3/gossip-0/gauge-heart_beat_version 1.0

Gossip heart beat version changes every second. If everyting is working
correctly, the gauge-heart_beat_version output should be 1.0. If not,
the gauge-heart_beat_version output should be less than 1.0.

Message-Id: <cbdaa1397cdbcd0dc6a67987f8af8038fd9b2d08.1470712861.git.asias@scylladb.com>
2016-08-15 12:32:00 +03:00
Asias He
0c56bbe793 gossip: Make get_supported_features and wait_for_feature_on{_all}_node private
They are used only inside gossiper itself. Also make the helper
get_supported_features(std::unordered_map<gms::inet_address, sstring>) static.

Message-Id: <f434c145ad9138084708b60c1d959b84360e47b2.1467775291.git.asias@scylladb.com>
2016-07-06 09:54:56 +03:00
Asias He
88f0bb3a7b gossip: Add check_knows_remote_features
To check if this node knows features in
std::unordered_map<inet_address, sstring> peer_features_string
2016-07-05 10:09:54 +08:00
Asias He
2b53c50c15 gossip: Add get_supported_features
To get features supported by all the nodes listed in the
address/feature map.
2016-07-05 10:09:53 +08:00
Asias He
4f3ce42163 storage_service: Prevent old version node to join a new version cluster
We want to prevent older version of scylla which has fewer features to
join a cluster with newer version of scylla which has more features,
because when scylla sees a feature is enabled on all other nodes, it
will start to use the feature and assume existing nodes and future nodes
will always have this feature.

In order to support downgrade during rolling upgrade, we need to support
mixed old and new nodes case.

1) All old nodes
O O O O O <- N   OK
O O O O O <- O   OK

2) All new nodes
N N N N N <- N   OK
N N N N N <- O   FAIL

3) Mixed old and new nodes
O N O N O <- N   OK
O N O N O <- O   OK

(O == old node, N == new node, <- == joining the cluster)

With this patch, I tested:

1.1) Add new node to new node cluster
gossip - Feature check passed. Local node 127.0.0.4 features =
{RANGE_TOMBSTONES}, Remote common_features = {RANGE_TOMBSTONES}

1.2) Add old node to old node cluster
gossip - Feature check passed. Local node 127.0.0.4 features = {},
Remote common_features = {}

2.1) Add new node to new node cluster
gossip - Feature check passed. Local node 127.0.0.4 features =
{RANGE_TOMBSTONES}, Remote common_features = {RANGE_TOMBSTONES}

2.2) Add old node to new node cluster
seastar - Exiting on unhandled exception: std::runtime_error (Feature
check failed. This node can not join the cluster because it does not
understand the feature. Local node 127.0.0.4 features = {}, Remote
common_features = {RANGE_TOMBSTONES})

3.1) Add new node to mixed cluster
gossip - Feature check passed. Local node 127.0.0.4 features =
{RANGE_TOMBSTONES}, Remote common_features = {}

3.2) Add old node to mixed cluster
gossip - Feature check passed. Local node 127.0.0.4 features = {},
Remote common_features = {}

Fixes #1253
2016-06-17 10:49:45 +08:00
Duarte Nunes
f613dabf53 gossip: Introduce the gms::feature class
This class encapsulates the waiting for a cluster feature. A feature
object is registered with the gossiper, which is responsible for later
marking it as enabled.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-05-27 17:20:51 +00:00
Duarte Nunes
4684b8ecbb gossip: Refactor waiting for features
This patch changes the sleep-based mechanism of detecting new features
by instead registering waiters with a condition variable that is
signaled whenever a new endpoint information is received.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-05-27 17:20:51 +00:00
Duarte Nunes
422f244172 gossip: Don't timeout when waiting for features
This patch removes the timeout when waiting for features,
since future patches will make this argument unnecessary.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-05-27 17:20:51 +00:00
Pekka Enberg
47a904c0f6 Merge "gossip: Introduce SUPPORTED_FEATURES" from Asias
"There is a need to have an ability to detect whether a feature is
supported by entire cluster. The way to do it is to advertise feature
availability over gossip and then each node will be able to check if all
other nodes have a feature in question.

The idea is to have new application state SUPPORTED_FEATURES that will contain
set of strings, each string holding feature name.

This series adds API to do so.

The following patch on top of this series demostreates how to wait for features
during boot up. FEATURE1 and FEATURE2 are introduced. We use
wait_for_feature_on_all_node to wait for FEATURE1 and FEATURE2 successfully.
Since FEATURE3 is not supported, the wait will not succeed, the wait will timeout.

   --- a/service/storage_service.cc
   +++ b/service/storage_service.cc
   @@ -95,7 +95,7 @@ sstring storage_service::get_config_supported_features() {
        // Add features supported by this local node. When a new feature is
        // introduced in scylla, update it here, e.g.,
        // return sstring("FEATURE1,FEATURE2")
   -    return sstring("");
   +    return sstring("FEATURE1,FEATURE2");
    }

    std::set<inet_address> get_seeds() {
   @@ -212,6 +212,11 @@ void storage_service::prepare_to_join() {
        // gossip snitch infos (local DC and rack)
        gossip_snitch_info().get();

   +    gossiper.wait_for_feature_on_all_node(std::set<sstring>{sstring("FEATURE1"), sstring("FEATURE2")}, std::chrono::seconds(30)).get();
   +    logger.info("Wait for FEATURE1 and FEATURE2 done");
   +    gossiper.wait_for_feature_on_all_node(std::set<sstring>{sstring("FEATURE3")}).get();
   +    logger.info("Wait for FEATURE3 done");
   +

We can query the supported_features:

    cqlsh> SELECT supported_features from system.peers;

     supported_features
    --------------------
      FEATURE1,FEATURE2
      FEATURE1,FEATURE2

    (2 rows)
    cqlsh> SELECT supported_features from system.local;

     supported_features
    --------------------
      FEATURE1,FEATURE2

    (1 rows)"
2016-04-08 09:22:50 +03:00
Pekka Enberg
38a54df863 Fix pre-ScyllaDB copyright statements
People keep tripping over the old copyrights and copy-pasting them to
new files. Search and replace "Cloudius Systems" with "ScyllaDB".

Message-Id: <1460013664-25966-1-git-send-email-penberg@scylladb.com>
2016-04-08 08:12:47 +03:00
Asias He
04e8727793 gossip: Introduce wait_for_feature_on_{all}_node
API to wait for features are available on a node or all the nodes in the
cluster.

$timeout specifies how long we want to wait. If the features are not
availabe yet, sleep 2 seconds and retry.
2016-04-06 07:12:34 +08:00
Asias He
1e437e925c gossip: Introduce get_supported_features
- Get features supported by this particular node

  std::set<sstring> get_supported_features(inet_address endpoint) const;

- Get features supported by all the nodes this node knows about

  std::set<sstring> get_supported_features() const;
2016-04-06 07:12:34 +08:00
Asias He
1bf0412e7a gossip: Introduce handle_shutdown_msg helper 2016-03-15 16:09:43 +08:00
Asias He
54d8ac16b5 gossip: Introduce handle_echo_msg helper 2016-03-15 16:09:42 +08:00
Asias He
1f64f4bfcb gossip: Introdcue handle_ack2_msg helper 2016-03-15 16:09:42 +08:00
Vlad Zolotarov
3a72ef87f2 gossiper: make _shadow_endpoint_state_map public and rename
We will need to access it from a storage_service class when replicate
token_metadata.

Rename _shadow_endpoint_state_map -> shadow_endpoint_state_map
according to our coding convention.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-03-06 11:16:44 +02:00
Vlad Zolotarov
4a21d48cc5 gossiper: use a semaphore instead of a future<> for serializing a timer callback
Use a semaphore to allow serializing with a gossiper's timer callback.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2016-03-06 11:16:44 +02:00
Asias He
01cb6b0d42 gossip: Send syn message in parallel and do not wait for it
1) As explained in commit 697b16414a (gossip: Make gossip message
handling async), in each gossip round we can make talking to the 1-3
peer nodes in parallel to reduce latency of gossip round.

2) Gossip syn message uses one way rpc message, but now the returned
future of the one way message is ready only when message is dequeued for
some reason (sent or dropped). If we wait for the one way syn messge to
return it might block the gossip round for a unbounded time. To fix, do
not wait for it in the gossip round. The downside is there will be no
back pressure to bound the syn messages, however since the messages are
once per second, I think it is fine.
Message-Id: <ea4655f121213702b3f58185378bb8899e422dd1.1456991561.git.asias@scylladb.com>
2016-03-03 11:17:50 +02:00
Asias He
59564591d5 storage_service: Use get_gossip_status to get status
The help is introduced recently, use it. Avoid to open code it.
2016-02-25 21:19:52 +08:00
Asias He
697b16414a gossip: Make gossip message handling async
In each gossip round, i.e., gossiper::run(), we do:

1) send syn message
2)                           peer node: receive syn message, send back ack message
3) process ack message in handle_ack_msg
   apply_state_locally
     mark_alive
       send_gossip_echo
     handle_major_state_change
       on_restart
       mark_alive
         send_gossip_echo
       mark_dead
         on_dead
       on_join
     apply_new_states
       do_on_change_notifications
          on_change
4) send back ack2 message
5)                            peer node: process ack2 message
   			      apply_state_locally

At the moment, syn is "wait" message, it times out in 3 seconds. In step
3, all the registered gossip callbacks are called which might take
significant amount of time to complete.

In order to reduce the gossip round latency, we make syn "no-wait" and
do not run the handle_ack_msg insdie the gossip::run(). As a result, we
will not get a ack message as the return value of a syn message any
more, so a GOSSIP_DIGEST_ACK message verb is introduced.

With this patch, the gossip message exchange is now async. It is useful
when some nodes are down in the cluster. We will not delay the gossip
round, which is supposed to run every second, 3*n seconds (n = 1-3,
since it talks to 1-3 peer nodes in each gossip round) or even
longer (considering the time to run gossip callbacks).

Later, we can make talking to the 1-3 peer nodes in parallel to reduce
latency even more.

Refs: #900
2016-02-24 19:33:39 +08:00
Asias He
755d792c78 gossip: Wait for gossip timer callback to finish in do_stop_gossiping
Also do not rearm the timer if we stopped the gossip.

Message-Id: <73765857b554d9914e87b24d287ff35ab0af6fce.1453378191.git.asias@scylladb.com>
2016-01-21 14:15:57 +02:00
Asias He
02b04e5907 gossip: Add is_safe_for_bootstrap
Make the following tests pass:

bootstrap_test.py:TestBootstrap.shutdown_wiped_node_cannot_join_test
bootstrap_test.py:TestBootstrap.killed_wiped_node_cannot_join_test

    1) start node2
    2) wait for cql connection with node2 is ready
    3) stop node2
    4) delete data and commitlog directory for node2
    5) start node2

In step 5), node2 will do the bootstrap process since its data,
including the system table is wiped. It will think itself is a completly
new node and can possiblly stream from wrong node and violate
consistency.

To fix, we reject the boot if we found the node was in SHUTDOWN or
STATUS_NORMAL.

CASSANDRA-9765
Message-Id: <47bc23f4ce1487a60c5b4fbe5bfe9514337480a8.1452158975.git.asias@scylladb.com>
2016-01-07 15:55:01 +02:00
Asias He
2345cda42f messaging_service: Rename shard_id to msg_addr
Use shard_id as the destination of the messaging_service is confusing,
since shard_id is used in the context of cpu id.
Message-Id: <8c9ef193dc000ef06f8879e6a01df65cf24635d8.1452155241.git.asias@scylladb.com>
2016-01-07 10:36:35 +02:00
Asias He
8c909122a6 gossip: Add wait_for_gossip_to_settle
Implement the wait for gossip to settle logic in the bootup process.

CASSANDRA-4288

Fixes:
bootstrap_test.py:TestBootstrap.shutdown_wiped_node_cannot_join_test

1) start node2
2) wait for cql connection with node2 is ready
3) stop node2
4) delete data and commitlog directory for node2
5) start node2

In step 5, sometimes I saw in shadow round of node2, it gets node2's
status as BOOT from other nodes in the cluster instead of NORMAL. The
problem is we do not wait for gossip to settle before we start cql server,
as a result, when we stop node2 in step 3), other nodes in the cluster
have not got node2's status update to NORMAL.
2016-01-07 10:09:25 +02:00
Avi Kivity
f3980f1fad Merge seastar upstream
* seastar 51154f7...8b2171e (9):
  > memcached: avoid a collision of an expiration with time_point(-1).
  > tutorial: minor spelling corrections etc.
  > tutorial: expand semaphores section
  > Merge "Use steady_clock where monotonic clock is required" from Vlad
  > Merge "TLS fixes + RPC adaption" from Calle
  > do_with() optimization
  > tutorial: explain limiting parallelism using semaphores
  > submit_io: change pending flushes criteria
  > apps: remove defunct apps/seastar

Adjust code to use steady_clock instead of high_resolution_clock.
2015-12-27 14:40:20 +02:00
Asias He
9d4382c626 gossip: Introduce get_gossip_status
Get value of application_state::STATUS.
2015-12-09 12:29:15 +08:00
Asias He
3004866f59 gossip: Rename start to start_gossiping
So that we have a more consistent name start_gossiping() and
stop_gossiping() and it will not confuse with get_gossiper.start().
2015-12-02 16:50:34 +08:00
Asias He
5c3951b28a gossip: Get rid of the handler helper 2015-12-02 16:50:34 +08:00
Asias He
7a6ad7aec2 gossip: Fix Assertion `local_is_initialized()' failed
This patch fixes the following cql_query_test failure.

   cql_query_test: scylla/seastar/core/sharded.hh:439:
   Service& seastar::sharded<Service>::local() [with Service =
   gms::gossiper]: Assertion `local_is_initialized()' failed.

The problem is in gossiper::stop() we call gossip::add_local_application_state()
which will in turn call gms::get_local_gossiper(). In seastar::sharded::stop

 _instances[engine().cpu_id()].service = nullptr;
 return inst->stop().then([this, inst] {
     return _instances[engine().cpu_id()].freed.get_future();
 });

We set the _instances to nullptr before we call the stop method, so
local_is_initialized asserts when we try to access get_local_gossiper
again.

To fix, we make the stopping of gossiper explicit. In the shutdown
procedure, we call stop_gossiping() explicitly.

This has two more advantages:

1) The api to stop gossip is now calling the stop_gossiping() instead of
sharing the seastar::sharded's stop method.

2) We can now get rid of the _handler seastar::sharded helper.
2015-12-02 16:50:34 +08:00
Asias He
f62a6f234b gossip: Add shutdown gossip state
Backported: CASSANDRA-8336 and CASSANDRA-9871

84b2846 remove redundant state
b2c62bb Add shutdown gossip state to prevent timeouts during rolling restarts
8f9ca07 Cannot replace token does not exist - DN node removed as Fat Client

Fixes:

When X is shutdown, X sends SHUTDOWN message to both Y and Z, but for
some reason, only Y receives the message and Z does not receive the
message. If Z has a higher gossip version for X than Y has for
X, Z will initiate a gossip with Y and Y will mark X alive again.

X ------> Y
 \      /
  \    /
    Z
2015-12-01 17:29:25 +08:00
Asias He
80d1d4d161 storage_service: Relax bootstrapping/leaving/moving nodes check in check_for_endpoint_collision
When other bootstrapping/leaving/moving nodes are found during
bootstrap, instead of throwing immediately, sleep and try again for one
minute, hoping other nodes will finish the operation soon.

Since we are retrying using shadow gossip round more than once, we need
to put the gossip state back to shadow round after each shadow round, to
make shadow round works correctly.

This is useful when starting an empty cluster for testing. E.g,

   $ scylla --listen-address 127.0.0.1
   $ sleep 3
   $ scylla --listen-address 127.0.0.2
   $ sleep 3
   $ scylla --listen-address 127.0.0.3

Without this patch, node 3 will hit the check.

   TIME  STATUS
   -----------------------
   Node  1:
   32:00 Starts
   32:00 In NORMAL status

   Node  2:
   32:03 Starts
   32:04 In BOOT status
   32:10 In NORMAL status

   Node  3:
   32:06 Starts
   32:06 Found node 2 in BOOT status, hit the check, sleep and try again
   32:11 Found node 2 in NORMAL status, can keep going now
   32:12 In BOOT status
   32:18 In NORMAL status
2015-11-30 09:07:57 +08:00
Asias He
3b52033371 gossip: Favor newly added node in do_gossip_to_live_member
When a new node joins a cluster, it will starts a gossip round with seed
node. However, within this round, the seed node will not tell the new
node anything it knows about other nodes in the cluster, because the
digest in the gossip SYN message contains only the new node itself and
no other nodes. The seed node will pick randomly from the live nodes,
including the newly added node in do_gossip_to_live_member to start a
gossip round. If the new node is "lucky", seed node will talk to it very
soon and tells all the information it knows about the cluster, thus the
new node will mark the seed node alive and think it has seen the seed
node. If there considerably large number of live nodes, it might take a
long time before the seed node pick the new node and talk to it.

In bootstrap code, storage_service::bootstrap checks if we see any nodes
after sleep of RING_DELAY milliseconds and throw "Unable to contact any
seeds!" if not, thus the node will fail to bootstrap.

To help the seed node talk to new node faster, we favor new node in
do_gossip_to_live_member.
2015-11-18 15:00:37 +02:00
Asias He
d622fe867e gossip: Pass const ref if possible
It is clear that we will not change the parameter.
2015-11-09 13:01:37 +02:00
Asias He
5e8037b50a gossip: Futurize add_local_application_state()
We are ignoring the future returned by seastar::async. Futurize it so
caller can wait for the application state to be actually applied.

In addition, dropping the unused add_local_application_states function.
2015-11-01 11:20:52 +02:00
Vlad Zolotarov
33b195760b gms::gossiper: allow the modification of _subscribers while it's being iterated
Introduce a subscribers_list class that exposes 3 methods:
  - push_back(s) - adds a new element s to the back of the list
  - remove(s) - removes an element s from the list
  - for_each(f) - invoke f on each element of the list
  - make a subscriber_list store shared_ptr to a subscriber
    to allow removing (currently it stores a naked pointer to the object).

subscribers_list allows push_back() and remove() to be called while
another thread (e.g. seastar::async()) is in the middle of for_each().

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>

New in v2:
   - Simplify subscribers_list::remove() method.
   - load_broadcaster: inherit from enable_shared_from_this instead
     of async_sharded_service.
2015-10-30 00:16:16 +02:00
Asias He
1469cec5bf gossiper: Kill free function helper to get heart version and generation number
They can only be executed on cpu 0. Make the gossiper member
functions for them to do so.
2015-10-27 21:48:37 +08:00
Asias He
f573059698 gossiper: Kill free function helper for {unsafe_,}assassinate_endpoint
They can only be executed on cpu 0. Make the gossiper member functions
for them to do so.
2015-10-27 21:48:37 +08:00
Asias He
c5f377eb8b gossip: Simplify get_endpoint_downtime
_unreachable_endpoints is replicated to call cores. No need to query
on core 0.
2015-10-27 21:48:37 +08:00
Asias He
6f1db4fb72 gossip: Simplify get_unreachable_members
_unreachable_endpoints is replicated to call cores. No need to query on
core 0.

This also fixes a bug in storage_proxy::truncate_blocking
which might access _unreachable_endpoints on non-zero cores.
2015-10-27 21:48:37 +08:00
Asias He
a9f96d1f5a gossip: Replicate _unreachable_endpoints to all cores 2015-10-27 21:48:37 +08:00
Asias He
2439a2a982 gossip: Simplify get_live_members
_live_endpoints is replicated to call cores. No need to query on core 0.
2015-10-27 21:48:37 +08:00
Amnon Heiman
ff67285091 gossiper: make the get cluster name and partitioner public
The API needs the cluster and the partitioner names, so the methods are
now public.

Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
2015-10-20 10:21:19 +03:00
Asias He
817c138034 gossip: Add get_current_heart_beat_version interface
HTTP API will use it.
2015-09-28 09:38:22 +08:00
Avi Kivity
d5cf0fb2b1 Add license notices 2015-09-20 10:43:39 +03:00
Asias He
c44afca3d8 gossip: Make is_dead_state take const reference 2015-09-11 15:43:27 +08:00