Echo message does not need to access gossip internal states, we can run
it on all shards and avoid forwarding to shard zero.
This makes gossip marking node up more robust when shard zero is loaded.
There is an argument that we should make echo message return only when
all shards have responded so that all shards are live and responding.
However, in a heavily loaded cluster, one shard might be overloaded on
multiple nodes in the cluster at the same time. If we require echo
response on all shards, we have a chance local node will mark all peer
nodes as down. As a result, the whole cluster is down. This is much
worse than not excluding a node with a slow shard from a cluster.
Refs: #7197
The gossiper verbs are registered in two places -- start_gossiping
and do_shadow_round(). And unregistered in one -- stop_gossiping
iff the start took place. Respectively, there's a chance that after
a shadow round scylla exits without starting gossiping thus leaving
verbs armed.
Fix by unregistering verbs on stop if they are still registered.
fixes: #7262
tests: manual(start, abort start after shadow round), unit(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200921140357.24495-1-xemul@scylladb.com>
Checks for features introduced over 2 years ago were removed
in previous commits, so all that is left is removing the feature
bits itself. Note that the feature strings are still sent
to other nodes just to be double sure, but the new code assumes
that all these features are implicitly enabled.
... and tests. Printin a pointer in logs is considered to be a bad practice,
so the proposal is to keep this explicit (with fmt::ptr) and allow it for
.debug and .trace cases.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
We saw errors in killed_wiped_node_cannot_join_test dtest:
Aug 2020 10:30:43 [node4] Missing: ['A node with address 127.0.76.4 already exists, cancelling join']:
The test does:
n1, n2, n3, n4
wipe data on n4
start n4 again with the same ip address
Without this patch, n4 will bootstrap into the cluster new tokens. We
should prevent n4 to bootstrap because there is an existing
node in the cluster.
In shadow round, the local node should apply the application state of
the node with the same ip address. This is useful to detect a node
trying to bootstrap with the same IP address of an existing node.
Tests: bootstrap_test.py
Fixes: #7073
"
We keep refrences to locator::token_metadata in many places.
Most of them are for read-only access and only a few want
to modify the token_metadata.
Recently, in 94995acedb,
we added yielding loops that access token_metadata in order
to avoid cpu stalls. To make that possible we need to make
sure they token_metadata object they are traversing won't change
mid-loop.
This series is a first step in ensuring the serialization of
updates to shared token metadata to reading it.
Test: unit(dev)
Dtest: bootstrap_test:TestBootstrap.start_stop_test{,_node}, update_cluster_layout_tests.py -a next-gating(dev)
"
* tag 'constify-token-metadata-access-v2' of github.com:bhalevy/scylla:
api/http_context: keep a const sharded<locator::token_metadata>&
gossiper: keep a const token_metadata&
storage_service: separate get_mutable_token_metadata
range_streamer: keep a const token_metadata&
storage_proxy: delete unused get_restricted_ranges declaration
storage_proxy: keep a const token_metadata&
storage_proxy: get rid of mutable get_token_metadata getter
database: keep const token_metadata&
database: keyspace_metadata: pass const locator::token_metadata& around
everywhere_replication_strategy: move methods out of line
replication_strategy: keep a const token_metadata&
abstract_replication_strategy: get_ranges: accept const token_metadata&
token_metadata: rename calculate_pending_ranges to update_pending_ranges
token_metadata: mark const methods
token_ranges: pending_endpoints_for: return empty vector if keyspace not found
token_ranges: get_pending_ranges: return empty vector if keyspace not found
token_ranges: get rid of unused get_pending_ranges variant
replication_strategy: calculate_natural_endpoints: make token_metadata& param const
token_metadata: add get_datacenter_racks() const variant
Gossiper needs messaging service, the messaging is started before the
gossiper, so we can push the former reference into it.
Gossiper is not stopped for real, neither the messaging service is, so
the memory usage is still safe.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
"
gossip: Get rid of seed concept
The concept of seed and the different behaviour between seed nodes and
non seed nodes generate a lot of confusion, complication and error for
users. For example, how to add a seed node into into a cluster, how to
promote a non seed node to a seed node, how to choose seeds node in
multiple DC setup, edit config files for seeds, why seed node does not
bootstrap.
If we remove the concept of seed, it will get much easier for users.
After this series, seed config option is only used once when a new node
joins a cluster.
Major changes:
Seed nodes are only used as the initial contact point nodes.
Seed nodes now perform bootstrap. The only exception is the first node
in the cluster.
The unsafe auto_bootstrap option is now ignored.
Gossip shadow round now talks to all nodes instead of just seed nodes.
Refs: #6845
Tests: update_cluster_layout_tests.py + manual test
"
* 'gossip_no_seed_v2' of github.com:asias/scylla:
gossip: Get rid of seed concept
gossip: Introduce GOSSIP_GET_ENDPOINT_STATES verb
gossip: Add do_apply_state_locally helper
gossip: Do not talk to seed node explicitly
gossip: Talk to live endpoints in a shuffled fashion
The concept of seed and the different behaviour between seed nodes and
non seed nodes generate a lot of confusion, complication and error for
users. For example, how to add a seed node into into a cluster, how to
promote a non seed node to a seed node, how to choose seeds node in
multiple DC setup, edit config files for seeds, why seed node does not
bootstrap.
If we remove the concept of seed, it will get much easier for users.
After this series, seed config option is only used once when a new node
joins a cluster.
Major changes:
- Seed nodes are only used as the initial contact point nodes.
- Seed nodes now perform bootstrap. The only exception is the first node
in the cluster.
- The unsafe auto_bootstrap option is now ignored.
- Gossip shadow round now attempts to talk to all nodes instead of just seed nodes.
Manual test:
- bootstrap n1, n2, n3 (n1 and n2 are listed as seed, check only n1
will skip bootstrap, n2 and n3 will bootstrap)
- shtudown n1, n2, n3
- start n2 (check non seed node can boot)
- start n1 (check n1 talks to both n2 and n3)
- start n3 (check n3 talks to both n1 and n3)
Upgrade/Downgrade test:
- Initialize cluster
Start 3 node with n1, n2, n3 using old version
n1 and n2 are listed as seed
- Test upgrade starting from seed nodes
Rolling restart n1 using new version
Rolling restart n2 using new version
Rolling restart n3 using new version
- Test downgrade to old version
Rolling restart n1 using old version
Rolling restart n2 using old version
Rolling restart n3 using old version
- Test upgrade starting from non seed nodes
Rolling restart n3 using new version
Rolling restart n2 using new version
Rolling restart n1 using new version
Notes on upgrade procedure:
There is no special procedure needed to upgrade to Scylla without seed
concept. Rolling upgrade node one by one is good enough.
Fixes: #6845
Tests: ./test.py + update_cluster_layout_tests.py + manual test
C++20 introduced `contains` member functions for maps and sets for
checking whether an element is present in the collection. Previously
`count` function was often used in various ways.
`contains` does not only express the intend of the code better but also
does it in more unified way.
This commit replaces all the occurences of the `count` with the
`contains`.
Tests: unit(dev)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <b4ef3b4bc24f49abe04a2aba0ddd946009c9fcb2.1597314640.git.piotr@scylladb.com>
1. The node1 is shutdown
2. The node1 sends shutdown message to node2
3. The node2 receives gossip shutdown message but the handler yields
4. The node1 is restarted
5. The node1 sends new gossip endpoint_state to node2, node2 applies the state
in apply_state_locally and calls gossiper::handle_major_state_change
and then calls gossiper::mark_alive
6. The shutdown message handler in step 3 resumes and sets status of node1 to SHUTDOWN
7. The gossiper::mark_alive fiber in step 5 resumes and calls gossiper::real_mark_alive,
node2 will skip to mark node1 as alive because the status of node1 is
SHUTDOWN. As a result, node1 is alive but it is not marked as UP by node2.
To fix, we serialize the two operations.
Fixes#7032
"
This series adds support for the "md" sstable format.
Support is based on the following:
* do not use clustering based filtering in the presence
of static row, tombstones.
* Disabling min/max column names in the metadata for
formats older than "md".
* When updating the metadata, reset and disable min/max
in the presence of range tombstones (like Cassandra does
and until we process them accurately).
* Fix the way we maintain min/max column names by:
keeping whole clustering key prefixes as min/max
rather than calculating min/max independently for
each component, like Cassandra does in the "md" format.
Fixes#4442
Tests: unit(dev), cql_query_test -t test_clustering_filtering* (debug)
md migration_test dtest from git@github.com:bhalevy/scylla-dtest.git migration_test-md-v1
"
* tag 'md-format-v4' of github.com:bhalevy/scylla: (27 commits)
config: enable_sstables_md_format by default
test: cql_query_test: add test_clustering_filtering unit tests
table: filter_sstable_for_reader: allow clustering filtering md-format sstables
table: create_single_key_sstable_reader: emit partition_start/end for empty filtered results
table: filter_sstable_for_reader: adjust to md-format
table: filter_sstable_for_reader: include non-scylla sstables with tombstones
table: filter_sstable_for_reader: do not filter if static column is requested
table: filter_sstable_for_reader: refactor clustering filtering conditional expression
features: add MD_SSTABLE_FORMAT cluster feature
config: add enable_sstables_md_format
database: add set_format_by_config
test: sstable_3_x_test: test both mc and md versions
test: Add support for the "md" format
sstables: mx/writer: use version from sstable for write calls
sstables: mx/writer: update_min_max_components for partition tombstone
sstables: metadata_collector: support min_max_components for range tombstones
sstable: validate_min_max_metadata: drop outdated logic
sstables: rename mc folder to mx
sstables: may_contain_rows: always true for old formats
sstables: add may_contain_rows
...
C++20 introduced `contains` member functions for maps and sets for
checking whether an element is present in the collection. Previously
the code pattern looked like:
<collection>.find(<element>) != <collection>.end()
In C++20 the same can be expressed with:
<collection>.contains(<element>)
This is not only more concise but also expresses the intend of the code
more clearly.
This commit replaces all the occurences of the old pattern with the new
approach.
Tests: unit(dev)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <f001bbc356224f0c38f06ee2a90fb60a6e8e1980.1597132302.git.piotr@scylladb.com>
MD format is disabled by default at this point.
The option extends enable_sstables_mc_format
so that both are needed to be set for supporting
the md format.
The MD_FORMAT cluster feature will be added in
a following patch.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The new verb is used to replace the current gossip shadow round
implementation. Current shadow round implementation reuses the gossip
syn and ack async message, which has plenty of drawbacks. It is hard to
tell if the syn messages to a specific peer node has responded. The
delayed responses from shadow round can apply to the normal gossip
states even if the shadow round is done. The syn and ack message
handler are full special cases due to the shadow round. All gossip
application states including the one that are not relevant are sent
back. The gossip application states are applied and the gossip
listeners are called as if is in the normal gossip operation. It is
completely unnecessary to call the gossip listeners in the shadow round.
This patch introduces a new verb to request the exact gossip application
states the shadow round needed with a synchronous verb and applies the
application states without calling the gossip listeners. This patch
makes the shadow round easier to reason about, more robust and
efficient.
Refs: #6845
Tests: update_cluster_layout_tests.py
Currently, we talk to a seed node in each gossip round with some
probability, i.e., nr_of_seeds / (nr_of_live_nodes + nr_of_unreachable_nodes)
For example, with 5 seeds in a 50 nodes cluster, the probability is 0.1.
Now that we talk to all live nodes, including the seed nodes, in a
bounded time period. It is not a must to talk to seed node in each
gossip round.
In order to get rid of the seed concept, do not talk to seed node
explicitly in each gossip round.
This patch is a preparatory patch to remove the seed concept in gossip.
Refs: #6845
Tests: update_cluster_layout_tests.py
Currently, we select 10 percent of random live nodes to talk with in
each gossip round. There is no upper bound how long it will take to talk
to all live nodes.
This patch changes the way we select live nodes to talk with as below:
1) Shuffle all the live endpoints randomly
2) Split the live endpoints into 10 groups
3) Talk to one of the groups in each gossip round
4) Go to step 1 to shuffle again after we groups are talked with
We keep both randomness of selecting nodes as before and determinacy to complete
talking to all live nodes.
In addition, the way to favor newly added node is simplified. When a
new live node is added, it is always added to the front of the group, so
it will be talked with in the next gossip round.
This patch is a preparatory patch to remove the seed concept in gossip.
Refs: #6845
Tests: update_cluster_layout_tests.py
It is not used any more after 75cf1d18b5
(storage_service: Unify handling of replaced node removal from gossip)
in the "Make replacing node take writes" series.
Refs #5482
When a replacing node is in early boot up and is not in HIBERNATE sate
yet, if the node is killed by a user, the node will wrongly send a
shutdown message to other nodes. This is because UNKNOWN is not in
SILENT_SHUTDOWN_STATES, so in gossiper::do_stop_gossiping, the node will
send shutdown message. Other nodes in the cluster will call
storage_service::handle_state_normal for this node, since NORMAL and
SHUTDOWN status share the same status handler. As a result, other nodes
will incorrectly think the node is part of the cluster and the replace
operation is finished.
Such problem was seen in replace_node_no_hibernate_state_test dtest:
n1, n2 are in the cluster
n2 is dead
n3 is started to replace n2, but n3 is killed in the middle
n3 announces SHUTDOWN status wrongly
n1 runs storage_service::handle_state_normal for n3
n1 get tokens for n3 which is empty, because n3 hasn't gossip tokens yet
n1 skips update normal tokens for n3, but think n3 has replaced n2
n4 starts to replace n2
n4 checks the tokens for n2 in storage_service::join_token_ring (Cannot
replace token {} which does not exist!) or
storage_service::prepare_replacement_info (Cannot replace_address {}
because it doesn't exist in gossip)
To fix, we add UNKNOWN into SILENT_SHUTDOWN_STATES and avoid sending
shutdown message.
Tests: replace_address_test.py:TestReplaceAddress.replace_node_no_hibernate_state_test
Fixes: #6436
This is purely utility helper routine. As a nice side effect the
inclusion of storage_service.hh is removed from several unrelated
places.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Nowadays the knowledge about known/supported features is
scattered between frature_service and storage_service. The
latter uses knowledge about the selected _sstables_format
to alter the "supported" set.
Encapsulate this knowledge inside the feature_service with
the help of "masked_features" -- those, that shouldn't be
advertized to other nodes. When only maskable feature for
today is the UNBOUNDED_RANGE_TOMBSTONES one. Nowadays it's
reported as supported only if the sstables format is MC.
With this patch it starts as masked and gets unmasked when
the sstables format is selected to be MC, so the change is
correct.
This will make it possible to move sstables_format from
storage service to anywhere else.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The set of bool enable_something-s on feature_fonfig duplicates
the disabled_features set on it, so remove the former and make
full use of the latter.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This feature will ensure that caching can be switched
off per table only after the whole cluster supports it.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
"
Fixes#6067
Makes the scylla endpoint initializations that support TLS use reloadable certificate stores, watching used cert + key files for changes, and reload iff modified.
Tests in separate dtest set.
"
* elcallio-calle/reloadable-tls:
transport: Use reloadable tls certificates
redis: Use reloadable tls certificates
alternator: Use reloadable tls certificates
messaging_service: Use reloadable TLS certificates
Changes messaging service rpc to use reloadable tls
certificates iff tls is enabled-
Note that this means that the service cannot start
listening at construction time if TLS is active,
and user need to call start_listen_ex to initialize
and actually start the service.
Since "normal" messaging service is actually started
from gms, this route too is made a continuation.
1. Remove the `versioned_value::factory` class, it didn't add any value. It just
forced us to create an object for making `versioned_value`s, for no sensible
reason.
2. Move some `versioned_value` deserialization code (string -> internal data
structures) into the versioned_value module. Previously, it was scattered all
around the place.
3. Make `gossiper::get_seeds` const and return a const reference.
I needed these refactors for a PR I was preparing to fix an issue with CDC. The
attempt of fixing the issue failed (I'm trying something different now), but the
refactors might be useful anyway.
* kbr--vv-refactor:
gossiper: make `get_seeds` method const and return a const ref
versioned_value: remove versioned_value::factory class
gms: move TOKENS string deserialization code into versioned_value
These callbacks can block a seastar thread and the underlying vector
can be reallocated concurrently.
This is no different than if it was a plain std::vector and the
solution is similar: use values instead of references.
Fixes#6230
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200422182304.120906-1-espindola@scylladb.com>
The gossiper has a convenience functions get_up_endpoint_count() and
get_down_endpoint_count(), but strangely no function to get the total
number. Even though it's easy to calculate the total by summing up their
result it is inefficient and also incovenient because of of these
functions returns a future.
So let's add another function, get_all_endpoint_count(), to get the
total number of nodes. We will use this function in the next patch.
Signed-off-by: Nadav Har'El <n...@scylladb.com>
Message-Id: <20200422182035.15106-1-nyh@scylladb.com>
And do the same with CDC_STREAMS_TIMESTAMP.
The code that took a list of tokens represented as a string inside
versioned_value (for gossiping) and deserialized it into
an `unordered_set<dht::token>` lived in the storage_service module,
while the code that did the serializing (set -> string) lived in
versioned_value. There was a similar situation with the CDC generation
timestamp.
To increase maintanability and reusability, the deserialization code is
now placed next to the serialization code in versioned_value.
Furthermore, the `make_full_token_string`, `make_token_string`, and
`make_cdc_streams_timestamp_string` (serialization functions) are moved
out of versioned_value::factory and made static methods of
versioned_value instead.
Fixes#5808
Seems some gcc:s will generate the code as sign extending. Mine does not,
but this should be more correct anyhow.
Added small stringify test to serialization_test for inet_address
This removes the need to include reactor.hh, a source of compile
time bloat.
In some places, the call is qualified with seastar:: in order
to resolve ambiguities with a local name.
Includes are adjusted to make everything compile. We end up
having 14 translation units including reactor.hh, primarily for
deprecated things like reactor::at_exit().
Ref #1
Always enable lightweight transactions. Remove the check for the command
line switch from the feature service, assuming LWT is always enabled.
Remove the check for LWT from Alternator.
Note that in order for the cluster to work with LWT, all nodes need
to support it.
Rename LWT to UNUSED in db/config.hh, to keep accepting lwt keyword in
--experimental-features command line option, but do nothing with it.
Changes in v2:
* remove enable_lwt feature flag, it's always there
Closes#6102
test: unit (dev, debug)
Message-Id: <20200401071149.41921-1-kostja@scylladb.com>
Consider 3 nodes in the cluster, n1, n2, n3 with gossip generation
number g1, g2, g3.
n1, n2, n3 running scylla version with commit
0a52ecb6df (gossip: Fix max generation
drift measure)
One year later, user wants the upgrade n1,n2,n3 to a new version
when n3 does a rolling restart with a new version, n3 will use a
generation number g3'. Because g3' - g2 > MAX_GENERATION_DIFFERENCE and
g3' - g1 > MAX_GENERATION_DIFFERENCE, so g1 and g2 will reject n3's
gossip update and mark g3 as down.
Such unnecessary marking of node down can cause availability issues.
For example:
DC1: n1, n2
DC2: n3, n4
When n3 and n4 restart, n1 and n2 will mark n3 and n4 as down, which
causes the whole DC2 to be unavailable.
To fix, we can start the node with a gossip generation within
MAX_GENERATION_DIFFERENCE difference for the new node.
Once all the nodes run the version with commit
0a52ecb6df, the option is no logger
needed.
Fixes#5164