Commit Graph

453 Commits

Author SHA1 Message Date
Asias He
02befb6474 gossip: Log seeds seen
It is useful for debugging bootstap issue, especially for large
clusters.

Also do not use the `_seeds` as the set_seeds function parameter since
there is a class member called _seeds.

Refs #3417
Message-Id: <15e6bdf06376949ced1bdb845f810da09266783d.1532474820.git.asias@scylladb.com>
2018-08-01 10:57:56 +03:00
Nadav Har'El
25bd139508 cross-tree: clean up use of std::random_device()
std::random_device() uses the relatively slow /dev/urandom, and we rarely if
ever intend to use it directly - we normally want to use it to seed a faster
random_engine (a pseudo-random number generator).

In many places in the code, we first created a random_device variable, and then
using it created a random_engine variable. However, this practice created the
risk of a programmer accidentally using the random_device object, instead of the
random_engine object, because both have the same API; This hurts performance.

This risk materialized in just two places in the code, utils/uuid.cc and
gms/gossiper.cc. A patch for to uuid.cc was sent previously by Pawel and is
not included in this patch, and the fix for gossiper.{cc,hh} is included here.

To avoid risking the same mistake in the future, this patch switches across the
code to an idiom where the random_device object is not *named*, so cannot be
accidentally used. We use the following idiom:

   std::default_random_engine _engine{std::random_device{}()};

Here std::random_device{}() creates the random device (/dev/urandom) and pulls
a random integer from it. It then uses this seed to create the random_engine
(the pseudo-random number generator). The std::random_device{} object is
temporary and unnamed, and cannot be unintentionally used directly.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180726154958.4405-1-nyh@scylladb.com>
2018-07-26 16:54:58 +01:00
Asias He
fd71c5718f gossip: Reduce continuous memory usage
Gossip SYN and ACK uses std::vector to store a list of gossip_digest,
the larger the cluster, the more continuous memory is needed. To reduce
the memory pressure which might cause std::bad_alloc, switch the std::vector
to chunked_vector.

In addition, change add_local_application_state to use std::list instead
of std::vector.

Refs #2782
2018-07-17 20:15:32 +08:00
Asias He
c3b5a2ecd5 gossip: Fix tokens assignment in assassinate_endpoint
The tokens vector is defined a few lines above and is needed outsie the
if block.

Do not redefine it again in the if block, otherwise the tokens will be empty.

Found by code inspection.

Fixes #3551.

Message-Id: <c7a06375c65c950e94236571127f533e5a60cbfd.1530002177.git.asias@scylladb.com>
2018-06-26 16:38:12 +01:00
Asias He
059ec89ad1 gms: Add is_normal helper to endpoint_state
It is faster than gossiper::is_normal because it avoids to do search in
the std::map<application_state, versioned_value>. It is useful for the
code in the fast path which needs to query if a node is in NORMAL
status.

Fixes #3500

Message-Id: <42db91fa4108f9f4fcf94fed3ec403ccf35d15e9.1528354644.git.asias@scylladb.com>
2018-06-10 19:21:03 +03:00
Duarte Nunes
b1dd1876e5 gms/gossiper: Prevent duplicate processing of EchoMessage reply
We make multiple attempts to mark a node as alive. We do that be
sending an EchoMessage, and marking the node as alive upon receiving a
successful answer. In case there's a network partition and the nodes
can't reach each other, multiple messages may be delivered and
processed.

We can avoid processing duplicate EchoMessage replies by checking
whether we had already marked the node as alive.

Fixes #1184

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180428191942.31990-1-duarte@scylladb.com>
2018-04-29 14:20:01 +03:00
Calle Wilund
b1edf75c8b types: Make seastar::inet_address the "native" type for CQL inet.
Fixes #3187

Requires seastar "inet_address: Add constructor and conversion function
from/to IPv4"

Implements support IPv6 for CQL inet data. The actual data stored will
now vary between 4 and 16 bytes. gms::inet_address has been augumented
to interop with seastar::inet_address, though of course actually trying
to use an Ipv6 address there or in any of its tables with throw badly.

Tests assuming ipv4 changed. Storing a ipv4_address should be
transparent, as it now "widens". However, since all ipv4 is
inet_address, but not vice versa, there is no implicit overloading on
the read paths. I.e. tests and system_keyspace (where we read ip
addresses from tables explicitly) are modified to use the proper type.
Message-Id: <20180424161817.26316-1-calle@scylladb.com>
2018-04-24 23:12:07 +01:00
Asias He
d71a94a08b gossip: Add tokens and host_id in add_saved_endpoint
Problem:

   Start node 1 2 3
   Shutdown node2
   Shutdown node1 node3
   Start node1 node3
   Try to repalce_address for node 2
   The replace operation fails with the error:
     seastar - Exiting on unhandled exception: std::runtime_error
     (Cannot replace_address node2 because it doesn't exist in gossip)

This is because after all nodes shutdown, the other nodes do not have the
tokens and host_id info of node2 until node2 boots up and talks to the cluster.

If node2 can not boots up for whatever reason, currently the only way to
recover node2 is to `nodetool removenode` and bootstrap node2 again. This will
change tokens in the cluster and cause more data movement than just replacing
node2.

To fix, we add the tokens and host_id gossip application state in add_saved_endpoint
during boot up.

This is pretty safe because the generation for application state added by
add_saved_endpoint is zero, if node2 actually boots, other nodes will update
with node2's version.

Before:
$ curl -X GET --header "Accept: application/json" "http://127.0.0.1:10000/failure_detector/endpoints/" | python -mjson.tool

    {
        "addrs": "127.0.0.2",
        "generation": 0,
        "is_alive": false,
        "update_time": 1523344828953,
        "version": 0
    }

Node 2 can not be replaced.

After:
$ curl -X GET --header "Accept: application/json" "http://127.0.0.1:10000/failure_detector/endpoints/" | python -mjson.tool

    {
        "addrs": "127.0.0.2",
        "application_state": [
            {
                "application_state": 12,
                "value": "31284090-2557-4036-9367-7bb4ef49c35a",
                "version": 2
            },
            {
                "application_state": 13,
                "value": "... a lot of tokens ...",
                "version": 1
            }
        ],
        "generation": 0,
        "is_alive": false,
        "update_time": 1523344828953,
        "version": 0
    }

Node 2 can be replaced.

Tests: dtest/replace_address_test.py
Fixes: #3347
Message-Id: <117fd6649939e0505847335791be8d7a96e7d273.1523346805.git.asias@scylladb.com>
2018-04-10 13:14:31 +02:00
Asias He
f539e993d3 gossip: Relax generation max difference check
start node 1 2 3
shutdown node2
shutdown node1 and node3
start node1 and node3
nodetool removenode node2
clean up all scylla data on node2
bootstrap node2 as a new node

I saw node2 could not bootstrap stuck at waiting for schema information to compelte for ever:

On node1, node3

    [shard 0] gossip - received an invalid gossip generation for peer 127.0.0.2; local generation = 2, received generation = 1521779704

On node2

    [shard 0] storage_service - JOINING: waiting for schema information to complete

This is becasue in nodetool removenode operation, the generation of node1 was increased from 0 to 2.

   gossiper::advertise_removing () calls eps.get_heart_beat_state().force_newer_generation_unsafe();
   gossiper::advertise_token_removed() calls eps.get_heart_beat_state().force_newer_generation_unsafe();

Each force_newer_generation_unsafe increases the generation by 1.

Here is an example,

Before nodetool removenode:
```
curl -X GET --header "Accept: application/json" "http://127.0.0.1:10000/failure_detector/endpoints/" | python -mjson.tool
   {
   "addrs": "127.0.0.2",
   "generation": 0,
   "is_alive": false,
   "update_time": 1521778757334,
   "version": 0
   },
```

After nodetool revmoenode:
```
curl -X GET --header "Accept: application/json" "http://127.0.0.1:10000/failure_detector/endpoints/" | python -mjson.tool
 {
     "addrs": "127.0.0.2",
     "application_state": [
         {
             "application_state": 0,
             "value": "removed,146b52d5-dc94-4e35-b7d4-4f64be0d2672,1522038476246",
             "version": 214
         },
         {
             "application_state": 6,
             "value": "REMOVER,14ecc9b0-4b88-4ff3-9c96-38505fb4968a",
             "version": 153
            }
     ],
     "generation": 2,
     "is_alive": false,
     "update_time": 1521779276246,
     "version": 0
 },
```

In gossiper::apply_state_locally, we have this check:

```
if (local_generation != 0 && remote_generation > local_generation + MAX_GENERATION_DIFFERENCE) {
    // assume some peer has corrupted memory and is broadcasting an unbelievable generation about another peer (or itself)
  logger.warn("received an invalid gossip generation for peer {}; local generation = {}, received generation = {}",ep, local_generation, remote_generation);

}
```
to skip the gossip update.

To fix, we relax generation max difference check to allow the generation
of a removed node.

After this patch, the removed node bootstraps successfully.

Tests: dtest:update_cluster_layout_tests.py
Fixes #3331

Message-Id: <678fb60f6b370d3ca050c768f705a8f2fd4b1287.1522289822.git.asias@scylladb.com>
2018-03-29 12:09:49 +03:00
Duarte Nunes
9cadfb27f1 gms/gossiper: Remove superfluous check
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-19 13:08:53 +00:00
Duarte Nunes
69b28a4f2b gms/gossiper: Check for shadow round completion before throwing
For values of `shadow_round_ms` lower than 1 second, this was assuming
failure without checking.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-03-19 13:08:53 +00:00
Avi Kivity
02ce0c4cde gms: allow gossiper to start_gossiping() without binding to the port
This is useful in tests, which don't communicate. Binding to a port can
fail if the system is running something else.

It would be better to prevent even more of the gossiper from starting up,
but that is more difficult.
2018-03-19 12:16:11 +02:00
Duarte Nunes
810db425a5 gms/gossiper: Synchronize endpoint state destruction
In gossiper::handle_major_state_change() we set the endpoint_state for
a particular endpoint and replicate the changes to other cores.

This is totally unsynchronized with the execution of
gossiper::evict_from_membership(), which can happen concurrently, and
can remove the very same endpoint from the map  (in all cores).

Replicating the changes to other cores in handle_major_state_change()
can interleave with replicating the changes to other cores in
evict_from_membership(), and result in an undefined final state.

Another issue happened in debug mode dtests, where a fiber executes
handle_major_state_change(), calls into the subscribers, of which
storage_service is one, and ultimately lands on
storage_service::update_peer_info(), which iterates over the
endpoint's application state with deferring points in between (to
update a system table). gossiper::evict_from_membership() was executed
concurrently by another fiber, which freed the state the first one is
iterating over.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180318123211.3366-1-duarte@scylladb.com>
2018-03-18 14:38:04 +02:00
Asias He
25aa59f2f1 gossip: Fix force_after in wait_for_gossip
In commit 8af0b501a2 (gossip: wait for stabilized gossip on bootstrap)

The force_after variable was changed from int32_t to stdx::optional<int32_t>

-            if (force_after > 0 && total_polls > force_after) {
+            if (force_after && total_polls > *force_after) {

Checking force_after > 0 was dropped which is wrong because force_after
is set to -1 by default. So the if branch will always be executed after
1 poll.

We always see:

   [shard 0] gossip - Gossip not settled but startup forced by
   skip_wait_for_gossip_to_settle. Gossp total polls: 1

even if skip_wait_for_gossip_to_settle is not set at all.

Fixes #3257
Message-Id: <845d219cea6101a7c507c13879c850a5c882e510.1520297548.git.asias@scylladb.com>
2018-03-06 10:11:02 +02:00
Asias He
5bae9b4e22 gossip: Check get_application_state_ptr in get_host_id
Check the pointer returned from get_application_state_ptr before use it.

Refs #2136

Message-Id: <e2ea32993754a79837dd97a7c5c601461dc5e1d1.1516581663.git.asias@scylladb.com>
2018-01-22 12:56:20 +02:00
Asias He
34f6218dc5 gossip: Show correct nodetool status against the shutdown node itself
If a node shuts itself down due to I/O error (such as ENOSPC), then
nodetool status will show the cluster status at the time the shutdown
occured.

In fact the node will be in shutdown status (nodetool gossipinfo shows
the correct status), however, `nodetool status` does not interpret the
shutdown status, instead it use the output of:

curl -X GET --header "Accept: application/json"
"http://127.0.0.1:10000/gossiper/endpoint/live"

to decide if a node is in UN status.

To fix, do not include the node itself in the output of get_live_members

Without this patch, when a node is shutdown due to I/O error:
UN  127.0.0.1  296.2 MB   256          ?  056ff68e-615c-4412-8d35-a4626569b9fd  rack1

With this patch, when a node is shutdown due to I/O error:
?N  127.0.0.1  296.2 MB   256          ?  056ff68e-615c-4412-8d35-a4626569b9fd  rack1

Fixes #1629
Message-Id: <039196a478b5b1a8749b3fdaf7e16cfe2eb73a2f.1498528642.git.asias@scylladb.com>
2018-01-04 08:31:01 +02:00
Calle Wilund
8af0b501a2 gossip: wait for stabilized gossip on bootstrap
Fixes #2866

Instead of a raw 30s sleep waiting for gossip to stabilize/set up 
ranges on bootstrap, use similar logic as 'wait_for_gossip_to_settle'
and loop said 30s or more until we neither grow/shrink ep set, or
are processing ACK:s.
2017-12-05 14:28:34 +00:00
Calle Wilund
1c8302e692 gossiper: Prevent race condition in propagation
Fixes #2894

Allow applying certain application states as monotonic sets,
i.e. allow set of states as input, and ensure the values are 
re-versioned and all applied together.
Then do so for certain states that are  by design coupled
(status/tokens). 

Similar solution as origins, as issue is copy of the same.
2017-12-05 14:28:34 +00:00
Tomasz Grabiec
7323fe76db gossiper: Replicate endpoint_state::is_alive()
Broken in f570e41d18.

Not replicating this may cause coordinator to treat a node which is
down as alive, or vice verse.

Fixes regression in dtest:

  consistency_test.py:TestAvailability.test_simple_strategy

which was expected to get "unavailable" exception but it was getting a
timeout.

Message-Id: <1510666967-1288-1-git-send-email-tgrabiec@scylladb.com>
2017-11-14 15:58:00 +02:00
Tomasz Grabiec
f570e41d18 gms/gossiper: Remove periodic replication of endpoint state map
For large clusters the map can be big and cause latency problems.
Since we now actively replicate changes, this is no longer needed.
2017-10-18 08:49:53 +02:00
Tomasz Grabiec
84c7b63c51 gossiper: Check for features in the change listener
In preparation for removal of periodic replication
2017-10-18 08:49:53 +02:00
Tomasz Grabiec
2d5fb9d109 gms/gossiper: Replicate changes incrementally to other shards
storage_service depends on endpoint states to be replicated to all
shards before token metadata is replicated. Currently this is taken
care of by storage_service::replicate_to_all_cores(), invoked from
storage_service's change listener. It copies whole endpoint state map,
which is expensive in large clusters. It's more efficient to replicate
only incremental changes, and only once, rather than for each
application state.
2017-10-18 08:49:53 +02:00
Tomasz Grabiec
28c9609370 gms/gossiper: Document validity of endpoint_state properties 2017-10-18 08:49:53 +02:00
Tomasz Grabiec
5cc83b9b3c gms/gossiper: Process endpoints in parallel
Makes state application faster due to increased parallelism.

Refs #2855.

Bootrap of 11th node, ignoring apply_state_locally() which complete instantly:

Before:

DEBUG 2017-10-06 15:24:04,213 [shard 0] gossip - apply_state_locally() took 1230 ms
DEBUG 2017-10-06 15:24:04,223 [shard 0] gossip - apply_state_locally() took 1421 ms
DEBUG 2017-10-06 15:24:04,225 [shard 0] gossip - apply_state_locally() took 607 ms
DEBUG 2017-10-06 15:24:04,288 [shard 0] gossip - apply_state_locally() took 488 ms
DEBUG 2017-10-06 15:24:04,408 [shard 0] gossip - apply_state_locally() took 1425 ms

After:

DEBUG 2017-10-06 16:24:13,130 [shard 0] gossip - apply_state_locally() took 814 ms
2017-10-18 08:49:53 +02:00
Tomasz Grabiec
8f01e08690 gms/gossiper: Serialize state changes and notifications for given node
It's possible that a change listener for a later state will run before
change listener for the previous state completes, in which case
node's state can be corruped. For example, the previous change listener
may override sysytem.peers with an old value.

This patch fixes the problem by serializing state changes and
listeners for each node.

The implementation uses loading_shared_values so that the lock remains
alive as long as there is anyone holding it. Using endpoint_state_map
for that doesn't seem appropraite, because entries can be removed from
it while listeners are still running.  There is code in the gossiper
which anticipates that entry may be gone across deferring points in
some places.
2017-10-18 08:49:53 +02:00
Tomasz Grabiec
6fccf7f4d0 gms/gossiper: Encapsulate lookup of endpoint_state 2017-10-18 08:49:52 +02:00
Tomasz Grabiec
41ffefd194 gossiper: Add and improve logging 2017-10-18 08:49:52 +02:00
Tomasz Grabiec
0ed84710d9 gms/gossiper: Don't fire change listeners when there is no change
apply_new_states() always fires change listeners for received values,
even if we already processed the state earlier. Some change listeners
are heavy-weight, e.g. storage_service::handle_state_normal().  We
should avoid calling them more than necessary.

Make sure that we always run the change listeners by putting them in a
defer() block. Otherwise, if exception is thrown in the middle of state
application, change listeners would not be run. Later we would not
detect the change for states which were already applied, and not run
the change listers.

Fixes #2867
2017-10-18 08:49:52 +02:00
Tomasz Grabiec
c780a74b58 gms/gossiper: Allow parallel apply_state_locally()
It is serialized since e428d06f40. This causes regression in
performance of application state propagation due to reduced
parallelism.

Processing states for each node has high latency due to memtable
flushes triggered by update_tokens() and commitlog syncs done by
system.peers updates, if commitlog sync mode is set to "batch".  We
have high internal concurrency for these, so increasing parallelism
significantly reduces time to process all states.

Fixes #2855.
2017-10-18 08:49:52 +02:00
Tomasz Grabiec
f20a805eca gms/gossiper: Avoid copies in endpoint_state::add_application_state() 2017-10-18 08:49:52 +02:00
Tomasz Grabiec
a71624d58d gms/failure_detector: Ignore short update intervals
Failure detector decides that a node is down if it hasn't received a change of
its heartbeat for longer than ~11 times the average of past intervals between
updates.

If there are multiple incoming ACKs containing information about the
same node, we may detect and report a change for each of them. This
will cause failure_detector to establish that the average report
period is in milliseconds. After the update storm is over, it will
claim the node failure very soon, because report period will now be a
large multiple of the average.

Fix by not counting short updates into the calculation of average
arrival time.

Fixes #2861.
2017-10-18 08:49:52 +02:00
Duarte Nunes
f67a553b96 gms/endpoint_state: Remove get_application_state()
It is no longer used, as all callsites have moved to
get_application_state_ptr().

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
9d5c6e0c72 gms/endpoint_state: Avoid copies in is_shutdown()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
92df519b91 gms/gossiper: Cleanup get_supported_features()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
39f71f7d12 gms/gossiper: Cleanup get_gossip_status()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
db660f1e08 gms/gossiper: Cleanup seen_any_seed()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
88dd97fe8e gms/gossiper: Cleanup get_host_id()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
95079795ce gms/gossiper: Removed dead uses_vnodes() function
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
7db7704edc gms/gossiper: Cleanup uses_host_id()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
2984bdab29 gms/gossiper: Add get_application_state_ptr()
This patch introduces the get_application_state_ptr() function, which
allows access to a versioned_value of a particular endpoint.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
f41748af81 gms/gossiper: Cleanup notify_failure_detector()
Now that we have get_endpoint_state_for_endpoint_ptr(), which does not
return a copy and allows mutating the actual state, we can use it
instead of repeating the lookup code.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
2210d10552 gms/gossiper: Cleanup is_alive()
Make it use get_endpoint_state_for_endpoint_ptr(), check if gossiper is
enabled, mark it as const, and have some callers use it instead of open
coding the logic.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Duarte Nunes
ceef45a6fe gms/gossiper: Const-qualify functions
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:31 +01:00
Duarte Nunes
955aee1588 gms/gossiper: Cleanup convict()
Have convict() use get_endpoint_state_for_endpoint_ptr(), simplify
logging, and also protect expensive operations by checking the log
level.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:31 +01:00
Duarte Nunes
cf99a41226 gms/gossiper: Add non-const get_endpoint_state_for_endpoint_ptr()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:31 +01:00
Duarte Nunes
d0fba1a113 gms/failure_detector: Simplify alive/dead endpoint count
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:31 +01:00
Duarte Nunes
dc65cda1a3 gms/failure_detector: Fix if/else style to include braces
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:31 +01:00
Tomasz Grabiec
66a15ccd18 gms/gossiper: Introduce copy-less endpoint_state::get_application_state_ptr()
Message-Id: <1507642411-28680-3-git-send-email-tgrabiec@scylladb.com>
2017-10-10 18:27:43 +01:00
Duarte Nunes
ceebbe14cc gossiper: Avoid endpoint_state copies
gossiper::get_endpoint_state_for_endpoint() returns a copy of
endpoint_state, which we've seen can be very expensive.

This patch adds a similar function which returns a pointer instead,
and changes the call sites where using the pointer-returning variant
is deemed safe (the pointer neither escapes the function, nor crosses
any defer point).

Fixes #764

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-10 13:48:02 +01:00
Duarte Nunes
bc976b4773 endpoint_state: const-qualify functions
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-10 13:30:28 +01:00