Commit Graph

318 Commits

Author SHA1 Message Date
Kamil Braun
d71513d814 abstract_replication_strategy: avoid reactor stalls in get_address_ranges and friends
The algorithm used in `get_address_ranges` and `get_range_addresses`
calls `calculate_natural_endpoints` in a loop; the loop iterates over
all tokens in the token ring. If the complexity of a particular
implementation of `calculate_natural_endpoints` is large - say `θ(n)`,
where `n` is the number of tokens - this results in an `θ(n^2)`
algorithm (or worse). This case happens for `Everywhere` replication strategy.

For small clusters this doesn't matter that much, but if `n` is, say, `20*255`,
this may result in huge reactor stalls, as observed in practice.

We avoid these stalls by inserting tactical yields. We hope that
some day someone actually implements a subquadratic algortihm here.

The commit also adds a comment on
`abstract_replication_strategy::calculate_natural_endpoints` explaining
that the interface does not give a complexity guarantee (at this point);
the different implementations have different complexities.

For example, `Everywhere` implementation always iterates over all tokens
in the token ring, so it has `θ(n)` worst and best case complexity.
On the other hand, `NetworkTopologyStrategy` implementation usually
finishes after visiting a small part of the token ring (specifically,
as soon as it finds a token for each node in the ring) and performs
a constant number of operations for each visited token on average,
but theoretically its worst case complexity is actually `O(n + k^2)`,
where `n` is the number of all tokens and `k` is the number of endpoints
(the `k^2` appears since for each endpoint we must perform finds and
inserts on `unordered_set` of size `O(k)`; `unordered_set` operations
have `O(1)` average complexity but `O(size of the set)` worst case
complexity).

Therefore it's not easy to put any complexity guarantee in the interface
at this point. Instead, we say that:
- some implementations may yield - if their complexities force us to do so
- but in general, there is no guarantee that the implementation may
  yield - e.g. the `Everywhere` implementation does not yield.

Fixes #8555.

Closes #8647
2021-05-25 11:53:28 +03:00
Pavel Solodovnikov
fff7ef1fc2 treewide: reduce boost headers usage in scylla header files
`dev-headers` target is also ensured to build successfully.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-05-20 01:33:18 +03:00
Kamil Braun
03ad111beb tree-wide: comments on deprecated functions to access global variables
Closes #8665
2021-05-18 11:31:10 +03:00
Nadav Har'El
fb0c4e469a Merge 'token_metadata: Fix get_all_endpoints to return nodes in the ring' from Asias He
The get_all_endpoints() should return the nodes that are part of the ring.

A node inside _endpoint_to_host_id_map does not guarantee that the node
is part of the ring.

To fix, return from _token_to_endpoint_map.

Fixes #8534

Closes #8536

* github.com:scylladb/scylla:
  token_metadata: Get rid of get_all_endpoints_count
  range_streamer: Handle everywhere_topology
  range_streamer: Adjust use_strict_sources_for_ranges
  token_metadata: Fix get_all_endpoints to return nodes in the ring
2021-05-11 18:39:10 +03:00
Asias He
5a410cb6e3 token_metadata: Get rid of get_all_endpoints_count
It is now only a wrapper for count_normal_token_owners.

Refs #8534
2021-05-06 15:36:20 +08:00
Asias He
ddeabba6aa token_metadata: Fix get_all_endpoints to return nodes in the ring
The get_all_endpoints() should return the nodes that are part of the ring.

A node inside _endpoint_to_host_id_map does not guarantee that the node
is part of the ring.

To fix, return from _token_to_endpoint_map.

Fixes #8534
2021-05-06 10:02:11 +08:00
Avi Kivity
e9802348b5 storage_proxy, treewide: use utils::small_vector inet_address_vector:s
Replace std::vector<inet_address> with a small_vector of size 3 for
replica sets (reflecting the common case of local reads, and the somewhat
less common case of single-datacenter writes). Vectors used to
describe topology changes are of size 1, reflecting that up to one
node is usually involved with topology changes. At those counts and
below we save an allocation; above those counts everything still works,
but small_vector allocates like std::vector.

In a few places we need to convert between std::vector and the new types,
but these are all out of the hot paths (or are in a hot path, but behind a
cache).
2021-05-05 18:36:54 +03:00
Avi Kivity
cea5493cb7 storage_proxy, treewide: introduce names for vectors of inet_address
storage_proxy works with vectors of inet_addresses for replica sets
and for topology changes (pending endpoints, dead nodes). This patch
introduces new names for these (without changing the underlying
type - it's still std::vector<gms::inet_address>). This is so that
the following patch, that changes those types to utils::small_vector,
will be less noisy and highlight the real changes that take place.
2021-05-05 18:36:48 +03:00
Juliusz Stasiewicz
b6fb5ee912 locator: Check DC names in NTS
The same trick is used as in C*:
79e693e16e/src/java/org/apache/cassandra/locator/NetworkTopologyStrategy.java (L241)

Fixes #7595
2021-02-09 07:04:17 +01:00
Pavel Emelyanov
d3ee8774ad storage-service: Subscribe to snitch to update topology
Currently snitch explicitly calls storage service (if
it's initialized) to update topology on snitch data
change.

Instead of it -- make storage service subscribe on the
snitch reconfigure signal upon creation.

This finally makes snitch fully independent from storage
service.

In tests the snitch instance is not created, so check
for it before subscribing.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-01-13 16:41:34 +03:00
Pavel Emelyanov
d1a2d0f894 snitch: Introduce reconfiguration signal
Add a notifier to snitch_base that gets triggered when the
snitch configuration changes to which others may subscribe.

For now only the gossiping-file-snitch triggers it when it
re-reads its config file. Other existing snitches are kinda
static in this sense.

The subscribe-trigger engine is based on scoped connection
from boost::signals2.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-01-13 16:41:34 +03:00
Pavel Emelyanov
ca336409d7 snitch: Always gossip snitch info itself
The gossiping_property_file_snitch updates the gossip RACK and DC
values upon config change. Right now this is done with the help
of storage service, but the needed code to gossip rack and dc is
already available in the snitch itself.

Said that -- gossip snitch info by snitch helper and remove the
storage_service's one. This makes the 2nd step decoupling snitch
and storage service.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-01-13 16:41:34 +03:00
Pavel Emelyanov
99e71bd1f6 snitch: Do gossip DC and RACK itself
This is the 2nd step in generalizing the snitch data gossiping
and at the same the 1st step in decoupling storage service and
snitch.

During start storage service starts gossiper, which notifies the
snicth with .gossiper_starting() call, then the storage service
calls gossip_snitch_info.

This patch makes snitch itself do the last step.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-01-13 16:41:34 +03:00
Pavel Emelyanov
bc1a3a358d snitch: Add generic gossiping helper
Nowadays some snitch implementations gossip the INTERNAL_IP
value and storage_service gossip RACK and DC for all of them.

This functionality is going to be generalized and the first
step is in making a common method for a snitch to gossip its
data.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-01-13 16:41:34 +03:00
Benny Halevy
322aa2f8b5 token_metadata: add clear_gently
clear_gently gently clears the token_metadata members.
It uses continuations to allow yielding if needed
to prevent reactor stalls.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-12-22 11:22:21 +02:00
Benny Halevy
56aa49ca81 token_metadata: shared_token_metadata: add mutate_token_metadata
mutate_token_metadata acquires the shared_token_metadata lock,
clones the token_metadata (using clone_async)
and calls an asynchronous functor on
the cloned copy of the token_metadata to mutate it.

If the functor is successful, the mutated clone
is set back to to the shared_token_metadata,
otherwise, the clone is destroyed.

With that, get rid of shared_token_metadata::clone

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-12-22 11:22:19 +02:00
Benny Halevy
e089c22ec1 token_metdata: futurize update_normal_tokens
The function complexity if O(#tokens) in the worst case
as for each endpoint token to traverses _token_to_endpoint_map
lineraly to erase the endpoint mapping if it exists.

This change renames the current implementation of
update_normal_tokens to update_normal_tokens_sync
and clones the code as a coroutine that returns a future
and may yield if needed.

Eventually we should futurize the whole token_metadata
and abstract_replication_strategy interface and get rid
of the synchronous functions.  Until then the sync
version is still required from call sites that
are neither returning a future nor run in a seastar thread.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-12-22 10:35:15 +02:00
Benny Halevy
e7f4cd89a9 abstract_replication_strategy: get_pending_address_ranges: invoke clone_only_token_map if can_yield
Optimize the can_yield case by invoking the futurized version
of clone_only_token_map.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-12-22 09:49:08 +02:00
Asias He
829b4c1438 repair: Make removenode safe by default
Currently removenode works like below:

- The coordinator node advertises the node to be removed in
  REMOVING_TOKEN status in gossip

- Existing nodes learn the node in REMOVING_TOKEN status

- Existing nodes sync data for the range it owns

- Existing nodes send notification to the coordinator

- The coordinator node waits for notification and announce the node in
  REMOVED_TOKEN

Current problems:

- Existing nodes do not tell the coordinator if the data sync is ok or failed.

- The coordinator can not abort the removenode operation in case of error

- Failed removenode operation will make the node to be removed in
  REMOVING_TOKEN forever.

- The removenode runs in best effort mode which may cause data
  consistency issues.

  It means if a node that owns the range after the removenode
  operation is down during the operation, the removenode node operation
  will continue to succeed without requiring that node to perform data
  syncing. This can cause data consistency issues.

  For example, Five nodes in the cluster, RF = 3, for a range, n1, n2,
  n3 is the old replicas, n2 is being removed, after the removenode
  operation, the new replicas are n1, n5, n3. If n3 is down during the
  removenode operation, only n1 will be used to sync data with the new
  owner n5. This will break QUORUM read consistency if n1 happens to
  miss some writes.

Improvements in this patch:

- This patch makes the removenode safe by default.

We require all nodes in the cluster to participate in the removenode operation and
sync data if needed. We fail the removenode operation if any of them is down or
fails.

If the user want the removenode operation to succeed even if some of the nodes
are not available, the user has to explicitly pass a list of nodes that can be
skipped for the operation.

$ nodetool removenode --ignore-dead-nodes <list_of_dead_nodes_to_ignore> <host_id>

Example restful api:

$ curl -X POST "http://127.0.0.1:10000/storage_service/remove_node/?host_id=7bd303e9-4c7b-4915-84f6-343d0dbd9a49&ignore_nodes=127.0.0.3,127.0.0.5"

- The coordinator can abort data sync on existing nodes

For example, if one of the nodes fails to sync data. It makes no sense for
other nodes to continue to sync data because the whole operation will
fail anyway.

- The coordinator can decide which nodes to ignore and pass the decision
  to other nodes

Previously, there is no way for the coordinator to tell existing nodes
to run in strict mode or best effort mode. Users will have to modify
config file or run a restful api cmd on all the nodes to select strict
or best effort mode. With this patch, the cluster wide configuration is
eliminated.

Fixes #7359

Closes #7626
2020-12-10 10:14:39 +02:00
Benny Halevy
157a964a63 locator: extract can_yield to utils/maybe_yield.hh
Move the definition of bool_class can_yield to a standalone
header file and define there a maybe_yield(can_yield) helper.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-24 12:23:56 +02:00
Calle Wilund
9f48dc7dac locator::ec2_multi_region_snitch: Handle ipv6 broadcast/public ip
Fixes #7064

Iff broadcast address is set to ipv6 from main (meaning prefer
ipv6), determine the "public" ipv6 address (which should be
the same, but might not be), via aws metadata query.

Closes #7633
2020-11-18 12:48:25 +02:00
Piotr Jastrzebski
661b52c7df token_metadata: Remove std::iterator from tokens_iterator
std::iterator is deprecated since C++17 so define all the required
iterator_traits directly.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-11-17 16:53:20 +01:00
Piotr Jastrzebski
87bf577450 token_metadata: Remove std::iterator from tokens_iterator_impl
std::iterator is deprecated since C++17 so define all the required
iterator_traits directly.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-11-17 16:53:20 +01:00
Benny Halevy
275fe30628 token_metadata_impl: set_pending_ranges: add can_yield_param
To prevent a > 10 ms stall when inserting to boost::icl::interval_map.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:24 +02:00
Benny Halevy
1e2138e8ef abstract_replication_strategy: get rid of get_ranges_in_thread
Use the can_yield param to get_ranges instead.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:24 +02:00
Benny Halevy
ba31350239 abstract_replication_strategy: add can_yield param to get_pending_ranges and friends
To prevent reactor stalls as seen in #7313.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:24 +02:00
Benny Halevy
6c2a089a6f abstract_replication_strategy: define can_yield bool_class
To be used by convention by several other methods.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:24 +02:00
Benny Halevy
7fb489d338 token_metadata_impl: calculate_pending_ranges_for_* reindent
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:24 +02:00
Benny Halevy
6ce2436a4c token_metadata_impl: calculate_pending_ranges_for_* pass new_pending_ranges by ref
We can use the seastar thread to keep the vector rather thna creating
a lw_shared_ptr for it.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:24 +02:00
Benny Halevy
0ca423dcfc token_metadata_impl: calculate_pending_ranges_for_* call in thread
The functions can be simplified as they are all now being called
from a seastar thread.

Make them sequential, returning void, and yielding if necessary.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:24 +02:00
Benny Halevy
84d086dc77 token_metadata: update_pending_ranges: create seastar thread
So we can yield in this path to prevent reactor stalls.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:24 +02:00
Benny Halevy
1e6c181678 abstract_replication_strategy: add get_address_ranges method for specific endpoint
Some of the callers of get_address_ranges are interested in the ranges
of a specific endpoint.

Rather than building a map for all endpoints and then traversing
it looking for this specific endpoint, build a multimap of token ranges
relating only to the specified endpoint.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:24 +02:00
Benny Halevy
2ce6773dae token_metadata_impl: clone_after_all_left: sort tokens only once
Currently the sorted tokens are copied needlessly by on this path
by `clone_only_token_map` and then recalculated after calling
remove_endpoint for each leaving endpoint.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:24 +02:00
Benny Halevy
0abd8e62cd token_metadata: futurize clone_after_all_left
Call the futurized clone_only_token_map and
remove the _leaving_endpoints from the cloned token_metadata_impl.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:24 +02:00
Benny Halevy
4a622c14e1 token_metadata: futurize clone_only_token_map
Does part of clone_async() using continuations to prevent stalls.

Rename synchronous variant to clone_only_token_map_sync
that is going to be deprecated once all its users will be futurized.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:24 +02:00
Benny Halevy
d1a73ec7b3 token_metadata: use mutable_token_metadata_ptr in calculate_pending_ranges_for_*
Replacing old code using lw_shared_ptr<token_metadata> with the "modern"
mutable_token_metadata_ptr alias.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:23 +02:00
Benny Halevy
4fc5997949 token_metadata: add clone_async
Clone token_metadata object using async continuation to
prevent reactor stalls.

Refs https://github.com/scylladb/scylla/issues/7220

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:23 +02:00
Benny Halevy
5ab7b0b2ea abstract_replication_strategy: accept a token_metadata_ptr in get_pending_address_ranges methods
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:23 +02:00
Benny Halevy
349aa966ba abstract_replication_strategy: accept a token_metadata_ptr in get_ranges methods
In preparation to returning future<dht::token_range_vector>
from async variants.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:23 +02:00
Benny Halevy
41c7efd0c0 storage_service: convert to token_metadata_ptr
clone _token_metadata for updating into _updated_token_metadata
and use it to update the local token_metadata on all shard via
do_update_pending_ranges().

Adjust get_token_metadata to get either the update the updated_token_metadata,
if available, or the base token_metadata.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:23 +02:00
Benny Halevy
fa880439c9 storage_service: use token_metadata_lock to serialize updates to token_metadata
Rather than using `serialized_action`, grab a lock before mutating
_token_metadata and hold it until its replicated to all shards.

A following patch will use a mutable token_metadata_ptr
that is updated out of line under the lock.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:23 +02:00
Benny Halevy
6d06853e6c abstract_replication_strategy: convert to shared_token_metadata
To facilitate that, keep a const shared_token_metadata& in class database
rather than a const token_metadata&

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:23 +02:00
Benny Halevy
45fb57a2ec abstract_replication_strategy: pass token_metadata& to get_cached_endpoints
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:23 +02:00
Benny Halevy
ade8c77a7c abstract_replication_strategy: pass token_metadata& to do_get_natural_endpoints
Rather than accessing abstract_replication_strategy::_token_metedata directly.
In preparation to changing it to a shared_token_metadata.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:23 +02:00
Benny Halevy
29ed59f8c4 main: start a shared_token_metadata
And use it to get a token_metadata& compatible
with current usage, until the services are converted to
use token_metadata_ptr.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:23 +02:00
Benny Halevy
0e739aa801 storage_service: add update_topology method
Move the functionality from gossiping_property_file_snitch::reload_configuration
to the storage_service class.

With that we can make get_mutable_token_metadata private.

TODO: update token_metadata on shard 0 and then
replicate_to_all_cores rather than updating on all shards
in parallel.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:23 +02:00
Benny Halevy
5a250f529f token_metadata: get rid of unused calculate_pending_ranges_for_* methods
They are only called inernally by token_metadata_impl.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-09-30 23:16:23 +03:00
Benny Halevy
41e5a3a245 token_metadata: get rid of clone_after_all_settled
It's unused.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-09-30 23:15:11 +03:00
Benny Halevy
105a2f5244 token_metadata_impl: remove_endpoint: do not sort tokens
Call sort_tokens at the caller as all call sites from
within token_metadata_impl call remove_endpoint for multiple
endpoints so the tokens can be re-sorted only once, when done
removing all tokens.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-09-30 23:12:32 +03:00
Benny Halevy
86303f4fdd token_metadata_impl: always sort_tokens in place
No need to return the sorted tokens vector as it's
always assigned to _sorted_tokens.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-09-30 23:08:56 +03:00