This lays the groundwork for brokering a node's view update
backlog across the whole cluster. This is needed for when a
coordinator does not contact a given replica for a long time, and uses
a backlog view that is outdated and causes requests to be
unnecessarily delayed.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
db::config is a global class; changes in any module can cause changes
in db::config. Therefore, it is a cause of needless recompilation.
Remove some of these dependencies by having consumers of db::config
declare an intermediate config struct that is contains only
configuration of interest to them, and have their caller fill it out
(in the case of auth, it already followed this scheme and the patchset
only moves the translation function).
In addition, some outright pointless inclusions of db/config.hh are
removed.
The result is somewhat shorter compile times, and fewer needless
recompiles.
* https://github.com/avikivity/scylla unconfig-1/v1:
config: remove inclusions of db/config.hh from header files
repair: remove unneeded config.hh inclusion
batchlog_manager: remove dependency on db::config
auth: remove permissions_cache dependency on db::config
auth: remove auth::service dependency on db::config
auth: remove unneeded db/config.hh includes
- New scylla node always send application_state::RPC_READY = false when
the node boots and send application_state::RPC_READY = true when cql
server is up
- Old scylla node that does not support the application_state::RPC_READY
never has application_state::RPC_READY in the endpoint_state, we can
only think their cql server is up, so we return true here if
application_state::RPC_READY is not present
Instead, distribute those inclusions to .cc files that require them. This
reduces rebuilds when config.hh changes, and makes it easier to locate files
that need config disaggregation.
storage_service keeps a bunch of "feature" variables, indicating cluster-wide
supported features, and has the ability to wait until the entire cluster supports
a given feature.
The propagation of features depends on gossip, but gossip is initialized after
storage_service, so the current code late-initializes the features. However, that
means that whoever waits on a feature between storage_service initialization and
gossip initialization loses their wait entry. In #3952, we have proof that this
in fact happens.
Fix this by removing the circular dependency. We now store features in a new
service, feature_service, that is started before both gossip and storage_service.
Gossip updates feature_service while storage_service reads for it.
Fixes#3953.
* https://github.com/avikivity/3953/v4.1:
storage_service: deinline enable_all_features()
gossiper: keep features registered
tests/gossip: switch to seastar::thread
storage_service: deinline init/deinit functions
gossiper: split feature storage into a new feature_service
gossiper: maybe enable features after start_gossiping()
storage_service: fix gap when feature::when_enabled() doesn't work
Since we may now start with features already registered, we need to enable
features immediately after gossip is started. This case happens in a cluster
that already is fully upgraded on startup. Before this series, features were
only added after this point.
Feature lifetime is tied to storage_service lifetime, but features are now managed
by gossip. To avoid circular dependency, add a new feature_service service to manage
feature lifetime.
To work around the problem, the current code re-initializes features after
gossip is initialized. This patch does not fix this problem; it only makes it
possible to solve it by untyping features from gossip.
Gossiper unregisters enabled features as an optimization. However that makes
decoupling features from gossiper harder. Disable this optimization; since the
number of features is small and normal access is to a single feature at a time,
there is no significant performance or memory loss.
In dtest, we have
self.check_rows_on_node(node1, 2000)
self.check_rows_on_node(node2, 2000)
which introduce the following cluster operations:
1) Initially:
- node1 up
- node2 up
2) self.check_rows_on_node(node1, 2000)
- node2 down
- node2 up (A: node2 will call gossiper::real_mark_alive when node2 boots
up to mark node1 up)
3) self.check_rows_on_node(node2, 2000)
- node1 down (B: node1 will send shutdown gossip message to node2, node2
will mark node1 down)
- node1 up (C: when node1 is up, node2 will call
gossiper::real_mark_alive)
Since there is no guarantee the order of Operation A and Operation B, it
is possible node2 will mark node1 as status=shutdown and mark node1 is
UP.
In Operation C, node2 will call gossiper::real_mark_alive to mark node1
up, but since node2 might think node1 is already up, node2 will exit
early in gossiper::real_mark_alive and not log "InetAddress 127.0.0.1 is
now UP, status={}"
As a result, dtest fails to see node2 reports node1 is up when it boots
node1 and fail the test.
TimeoutError: 23 Nov 2018 10:44:19 [node2] Missing: ['127.0.0.1.* now UP']
In the log we can see node1 marked as DOWN and UP almost at the same time on node2:
INFO 2018-11-23 22:31:29,999 [shard 0] gossip - InetAddress 127.0.0.1 is now DOWN, status = shutdown
INFO 2018-11-23 22:31:30,006 [shard 0] gossip - InetAddress 127.0.0.1 is now UP, status = shutdown
Fixes#3940
Tests: dtest with 20 consecutive succesful runs
Message-Id: <996dc325cbcc3f94fc0b7569217aa65464eaaa1c.1543213511.git.asias@scylladb.com>
* seastar d59fcef...b924495 (2):
> build: Fix protobuf generation rules
> Merge "Restructure files" from Jesse
Includes fixup patch from Jesse:
"
Update Seastar `#include`s to reflect restructure
All Seastar header files are now prefixed with "seastar" and the
configure script reflects the new locations of files.
Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <5d22d964a7735696fb6bb7606ed88f35dde31413.1542731639.git.jhaberku@scylladb.com>
"
sprint() recently became more strict, throwing on sprint("%s", 5). Replace
with the more modern format().
Mechanically converted with https://github.com/avikivity/unsprint.
* seastar d152f2d...c1e0e5d (6):
> scripts: perftune.py: properly merge parameters from the command line and the configuration file
> fmt: update to 5.2.1
> io_queue: only increment statistics when request is admitted
> Adds `read_first_line.cc` and `read_first_line.hh` to CMake.
> fstream: remove default extent allocation hint
> core/semaphore: Change the access of semaphore_units main ctor
Due to a compile-time fight between fmt and boost::multiprecision, a
lexical_cast was added to mediate.
sprint("%s", var) no longer accepts numeric values, so some sprint()s were
converted to format() calls. Since more may be lurking we'll need to remove
all sprint() calls.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Fixes#3798Fixes#3694
Tests:
unit(release), dtest([new] cql_tests.py:TruncateTester.truncate_after_restart_test)
* tag 'fix-gossip-shard-replication-v1' of github.com:tgrabiec/scylla:
gms/gossiper: Replicate enpoint states in add_saved_endpoint()
gms/gossiper: Make reset_endpoint_state_map() have effect on all shards
gms/gossiper: Replicate STATUS change from mark_as_shutdown() to other shards
gms/gossiper: Always override states from older generations
Lack of this may result in non-zero shards on some nodes still seeing
STATUS as NORMAL for a node which shut down, in some cases.
mark_as_shutdown() is invoked in reaction to an RPC call initiated by
the node which is shutting down. Another way a node can learn about
other node shutting down is via gossiping with a node which knows
this. In that case, the states will be replicated to non-zero
shards. The node which learnt via mark_as_shutdown() may also
eventually propagate this to non-zero shards, e.g. when it gossips
about it with other nodes, and its local version number at the time of
mark_as_shudown() was smaller than the one used to set the STATE by
the shutting down node.
Application states of each node are versioned per-node with a pair of
generation number (more significant) and value version. Generation
number uniquely identifies the life time of a scylla
process. Generation number changes after restart. Value versions start
from 0 on each restart. When a node gets updates for application
states, it merges them with its view on given node. Value updates with
older versions are ignored.
Gossiper processes updates only on shard 0, and replicates value
updates to other shards. When it sees a value with a new generation,
it correclty forgets all previous values. However, non-zero shards
don't forget values from previous generations. As a result,
replication will fail to override the values on non-zero shards when
generation number changes until their value version exceeds the
version prior to the restart.
This will result in incorrect STATUS for non-seed nodes on non-zero
shards. When restarting a non-seed node, it will do a shadow gossip
round before setting its STATUS to NORMAL. In the shadow round it will
learn from other nodes about itself, and set its STATUS to shutdown on
all shards with a high value version. Later, when it sets its status
to NORMAL, it will override it only on shard 0, because on other
shards the version of STATUS is higher.
This will cause CQL truncate to skip current node if the coordinator
runs on non-zero shards.
The fix is to override the entries on remote shards in the same way we
do on shard 0. All updates to endpoint states should be already
serialized on shard 0, and remote shards should see them in the same
order.
Introduced in 2d5fb9dFixes#3798Fixes#3694
shutdown_announce_in_ms specifies a period of time that a node which
is shutting down waits to allow its state to propagate to other nodes.
However, we were setting _enabled to false before waiting, which
will make the current node ignore gossip messages.
Message-Id: <1538576996-26283-1-git-send-email-tgrabiec@scylladb.com>
std::random_device() uses the relatively slow /dev/urandom, and we rarely if
ever intend to use it directly - we normally want to use it to seed a faster
random_engine (a pseudo-random number generator).
In many places in the code, we first created a random_device variable, and then
using it created a random_engine variable. However, this practice created the
risk of a programmer accidentally using the random_device object, instead of the
random_engine object, because both have the same API; This hurts performance.
This risk materialized in just two places in the code, utils/uuid.cc and
gms/gossiper.cc. A patch for to uuid.cc was sent previously by Pawel and is
not included in this patch, and the fix for gossiper.{cc,hh} is included here.
To avoid risking the same mistake in the future, this patch switches across the
code to an idiom where the random_device object is not *named*, so cannot be
accidentally used. We use the following idiom:
std::default_random_engine _engine{std::random_device{}()};
Here std::random_device{}() creates the random device (/dev/urandom) and pulls
a random integer from it. It then uses this seed to create the random_engine
(the pseudo-random number generator). The std::random_device{} object is
temporary and unnamed, and cannot be unintentionally used directly.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180726154958.4405-1-nyh@scylladb.com>
Gossip SYN and ACK uses std::vector to store a list of gossip_digest,
the larger the cluster, the more continuous memory is needed. To reduce
the memory pressure which might cause std::bad_alloc, switch the std::vector
to chunked_vector.
In addition, change add_local_application_state to use std::list instead
of std::vector.
Refs #2782
It is faster than gossiper::is_normal because it avoids to do search in
the std::map<application_state, versioned_value>. It is useful for the
code in the fast path which needs to query if a node is in NORMAL
status.
Fixes#3500
Message-Id: <42db91fa4108f9f4fcf94fed3ec403ccf35d15e9.1528354644.git.asias@scylladb.com>
We make multiple attempts to mark a node as alive. We do that be
sending an EchoMessage, and marking the node as alive upon receiving a
successful answer. In case there's a network partition and the nodes
can't reach each other, multiple messages may be delivered and
processed.
We can avoid processing duplicate EchoMessage replies by checking
whether we had already marked the node as alive.
Fixes#1184
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180428191942.31990-1-duarte@scylladb.com>
Fixes#3187
Requires seastar "inet_address: Add constructor and conversion function
from/to IPv4"
Implements support IPv6 for CQL inet data. The actual data stored will
now vary between 4 and 16 bytes. gms::inet_address has been augumented
to interop with seastar::inet_address, though of course actually trying
to use an Ipv6 address there or in any of its tables with throw badly.
Tests assuming ipv4 changed. Storing a ipv4_address should be
transparent, as it now "widens". However, since all ipv4 is
inet_address, but not vice versa, there is no implicit overloading on
the read paths. I.e. tests and system_keyspace (where we read ip
addresses from tables explicitly) are modified to use the proper type.
Message-Id: <20180424161817.26316-1-calle@scylladb.com>
Problem:
Start node 1 2 3
Shutdown node2
Shutdown node1 node3
Start node1 node3
Try to repalce_address for node 2
The replace operation fails with the error:
seastar - Exiting on unhandled exception: std::runtime_error
(Cannot replace_address node2 because it doesn't exist in gossip)
This is because after all nodes shutdown, the other nodes do not have the
tokens and host_id info of node2 until node2 boots up and talks to the cluster.
If node2 can not boots up for whatever reason, currently the only way to
recover node2 is to `nodetool removenode` and bootstrap node2 again. This will
change tokens in the cluster and cause more data movement than just replacing
node2.
To fix, we add the tokens and host_id gossip application state in add_saved_endpoint
during boot up.
This is pretty safe because the generation for application state added by
add_saved_endpoint is zero, if node2 actually boots, other nodes will update
with node2's version.
Before:
$ curl -X GET --header "Accept: application/json" "http://127.0.0.1:10000/failure_detector/endpoints/" | python -mjson.tool
{
"addrs": "127.0.0.2",
"generation": 0,
"is_alive": false,
"update_time": 1523344828953,
"version": 0
}
Node 2 can not be replaced.
After:
$ curl -X GET --header "Accept: application/json" "http://127.0.0.1:10000/failure_detector/endpoints/" | python -mjson.tool
{
"addrs": "127.0.0.2",
"application_state": [
{
"application_state": 12,
"value": "31284090-2557-4036-9367-7bb4ef49c35a",
"version": 2
},
{
"application_state": 13,
"value": "... a lot of tokens ...",
"version": 1
}
],
"generation": 0,
"is_alive": false,
"update_time": 1523344828953,
"version": 0
}
Node 2 can be replaced.
Tests: dtest/replace_address_test.py
Fixes: #3347
Message-Id: <117fd6649939e0505847335791be8d7a96e7d273.1523346805.git.asias@scylladb.com>
start node 1 2 3
shutdown node2
shutdown node1 and node3
start node1 and node3
nodetool removenode node2
clean up all scylla data on node2
bootstrap node2 as a new node
I saw node2 could not bootstrap stuck at waiting for schema information to compelte for ever:
On node1, node3
[shard 0] gossip - received an invalid gossip generation for peer 127.0.0.2; local generation = 2, received generation = 1521779704
On node2
[shard 0] storage_service - JOINING: waiting for schema information to complete
This is becasue in nodetool removenode operation, the generation of node1 was increased from 0 to 2.
gossiper::advertise_removing () calls eps.get_heart_beat_state().force_newer_generation_unsafe();
gossiper::advertise_token_removed() calls eps.get_heart_beat_state().force_newer_generation_unsafe();
Each force_newer_generation_unsafe increases the generation by 1.
Here is an example,
Before nodetool removenode:
```
curl -X GET --header "Accept: application/json" "http://127.0.0.1:10000/failure_detector/endpoints/" | python -mjson.tool
{
"addrs": "127.0.0.2",
"generation": 0,
"is_alive": false,
"update_time": 1521778757334,
"version": 0
},
```
After nodetool revmoenode:
```
curl -X GET --header "Accept: application/json" "http://127.0.0.1:10000/failure_detector/endpoints/" | python -mjson.tool
{
"addrs": "127.0.0.2",
"application_state": [
{
"application_state": 0,
"value": "removed,146b52d5-dc94-4e35-b7d4-4f64be0d2672,1522038476246",
"version": 214
},
{
"application_state": 6,
"value": "REMOVER,14ecc9b0-4b88-4ff3-9c96-38505fb4968a",
"version": 153
}
],
"generation": 2,
"is_alive": false,
"update_time": 1521779276246,
"version": 0
},
```
In gossiper::apply_state_locally, we have this check:
```
if (local_generation != 0 && remote_generation > local_generation + MAX_GENERATION_DIFFERENCE) {
// assume some peer has corrupted memory and is broadcasting an unbelievable generation about another peer (or itself)
logger.warn("received an invalid gossip generation for peer {}; local generation = {}, received generation = {}",ep, local_generation, remote_generation);
}
```
to skip the gossip update.
To fix, we relax generation max difference check to allow the generation
of a removed node.
After this patch, the removed node bootstraps successfully.
Tests: dtest:update_cluster_layout_tests.py
Fixes#3331
Message-Id: <678fb60f6b370d3ca050c768f705a8f2fd4b1287.1522289822.git.asias@scylladb.com>
This is useful in tests, which don't communicate. Binding to a port can
fail if the system is running something else.
It would be better to prevent even more of the gossiper from starting up,
but that is more difficult.
In gossiper::handle_major_state_change() we set the endpoint_state for
a particular endpoint and replicate the changes to other cores.
This is totally unsynchronized with the execution of
gossiper::evict_from_membership(), which can happen concurrently, and
can remove the very same endpoint from the map (in all cores).
Replicating the changes to other cores in handle_major_state_change()
can interleave with replicating the changes to other cores in
evict_from_membership(), and result in an undefined final state.
Another issue happened in debug mode dtests, where a fiber executes
handle_major_state_change(), calls into the subscribers, of which
storage_service is one, and ultimately lands on
storage_service::update_peer_info(), which iterates over the
endpoint's application state with deferring points in between (to
update a system table). gossiper::evict_from_membership() was executed
concurrently by another fiber, which freed the state the first one is
iterating over.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180318123211.3366-1-duarte@scylladb.com>
In commit 8af0b501a2 (gossip: wait for stabilized gossip on bootstrap)
The force_after variable was changed from int32_t to stdx::optional<int32_t>
- if (force_after > 0 && total_polls > force_after) {
+ if (force_after && total_polls > *force_after) {
Checking force_after > 0 was dropped which is wrong because force_after
is set to -1 by default. So the if branch will always be executed after
1 poll.
We always see:
[shard 0] gossip - Gossip not settled but startup forced by
skip_wait_for_gossip_to_settle. Gossp total polls: 1
even if skip_wait_for_gossip_to_settle is not set at all.
Fixes#3257
Message-Id: <845d219cea6101a7c507c13879c850a5c882e510.1520297548.git.asias@scylladb.com>
If a node shuts itself down due to I/O error (such as ENOSPC), then
nodetool status will show the cluster status at the time the shutdown
occured.
In fact the node will be in shutdown status (nodetool gossipinfo shows
the correct status), however, `nodetool status` does not interpret the
shutdown status, instead it use the output of:
curl -X GET --header "Accept: application/json"
"http://127.0.0.1:10000/gossiper/endpoint/live"
to decide if a node is in UN status.
To fix, do not include the node itself in the output of get_live_members
Without this patch, when a node is shutdown due to I/O error:
UN 127.0.0.1 296.2 MB 256 ? 056ff68e-615c-4412-8d35-a4626569b9fd rack1
With this patch, when a node is shutdown due to I/O error:
?N 127.0.0.1 296.2 MB 256 ? 056ff68e-615c-4412-8d35-a4626569b9fd rack1
Fixes#1629
Message-Id: <039196a478b5b1a8749b3fdaf7e16cfe2eb73a2f.1498528642.git.asias@scylladb.com>
Fixes#2866
Instead of a raw 30s sleep waiting for gossip to stabilize/set up
ranges on bootstrap, use similar logic as 'wait_for_gossip_to_settle'
and loop said 30s or more until we neither grow/shrink ep set, or
are processing ACK:s.
Fixes#2894
Allow applying certain application states as monotonic sets,
i.e. allow set of states as input, and ensure the values are
re-versioned and all applied together.
Then do so for certain states that are by design coupled
(status/tokens).
Similar solution as origins, as issue is copy of the same.
Broken in f570e41d18.
Not replicating this may cause coordinator to treat a node which is
down as alive, or vice verse.
Fixes regression in dtest:
consistency_test.py:TestAvailability.test_simple_strategy
which was expected to get "unavailable" exception but it was getting a
timeout.
Message-Id: <1510666967-1288-1-git-send-email-tgrabiec@scylladb.com>
storage_service depends on endpoint states to be replicated to all
shards before token metadata is replicated. Currently this is taken
care of by storage_service::replicate_to_all_cores(), invoked from
storage_service's change listener. It copies whole endpoint state map,
which is expensive in large clusters. It's more efficient to replicate
only incremental changes, and only once, rather than for each
application state.
Makes state application faster due to increased parallelism.
Refs #2855.
Bootrap of 11th node, ignoring apply_state_locally() which complete instantly:
Before:
DEBUG 2017-10-06 15:24:04,213 [shard 0] gossip - apply_state_locally() took 1230 ms
DEBUG 2017-10-06 15:24:04,223 [shard 0] gossip - apply_state_locally() took 1421 ms
DEBUG 2017-10-06 15:24:04,225 [shard 0] gossip - apply_state_locally() took 607 ms
DEBUG 2017-10-06 15:24:04,288 [shard 0] gossip - apply_state_locally() took 488 ms
DEBUG 2017-10-06 15:24:04,408 [shard 0] gossip - apply_state_locally() took 1425 ms
After:
DEBUG 2017-10-06 16:24:13,130 [shard 0] gossip - apply_state_locally() took 814 ms
It's possible that a change listener for a later state will run before
change listener for the previous state completes, in which case
node's state can be corruped. For example, the previous change listener
may override sysytem.peers with an old value.
This patch fixes the problem by serializing state changes and
listeners for each node.
The implementation uses loading_shared_values so that the lock remains
alive as long as there is anyone holding it. Using endpoint_state_map
for that doesn't seem appropraite, because entries can be removed from
it while listeners are still running. There is code in the gossiper
which anticipates that entry may be gone across deferring points in
some places.
apply_new_states() always fires change listeners for received values,
even if we already processed the state earlier. Some change listeners
are heavy-weight, e.g. storage_service::handle_state_normal(). We
should avoid calling them more than necessary.
Make sure that we always run the change listeners by putting them in a
defer() block. Otherwise, if exception is thrown in the middle of state
application, change listeners would not be run. Later we would not
detect the change for states which were already applied, and not run
the change listers.
Fixes#2867