Commit Graph

88 Commits

Author SHA1 Message Date
Kamil Braun
30cc07b40d Merge 'Introduce tablets' from Tomasz Grabiec
This PR introduces an experimental feature called "tablets". Tablets are
a way to distribute data in the cluster, which is an alternative to the
current vnode-based replication. Vnode-based replication strategy tries
to evenly distribute the global token space shared by all tables among
nodes and shards. With tablets, the aim is to start from a different
side. Divide resources of replica-shard into tablets, with a goal of
having a fixed target tablet size, and then assign those tablets to
serve fragments of tables (also called tablets). This will allow us to
balance the load in a more flexible manner, by moving individual tablets
around. Also, unlike with vnode ranges, tablet replicas live on a
particular shard on a given node, which will allow us to bind raft
groups to tablets. Those goals are not yet achieved with this PR, but it
lays the ground for this.

Things achieved in this PR:

  - You can start a cluster and create a keyspace whose tables will use
    tablet-based replication. This is done by setting `initial_tablets`
    option:

    ```
        CREATE KEYSPACE test WITH replication = {'class': 'NetworkTopologyStrategy',
                        'replication_factor': 3,
                        'initial_tablets': 8};
    ```

    All tables created in such a keyspace will be tablet-based.

    Tablet-based replication is a trait, not a separate replication
    strategy. Tablets don't change the spirit of replication strategy, it
    just alters the way in which data ownership is managed. In theory, we
    could use it for other strategies as well like
    EverywhereReplicationStrategy. Currently, only NetworkTopologyStrategy
    is augmented to support tablets.

  - You can create and drop tablet-based tables (no DDL language changes)

  - DML / DQL work with tablet-based tables

    Replicas for tablet-based tables are chosen from tablet metadata
    instead of token metadata

Things which are not yet implemented:

  - handling of views, indexes, CDC created on tablet-based tables
  - sharding is done using the old method, it ignores the shard allocated in tablet metadata
  - node operations (topology changes, repair, rebuild) are not handling tablet-based tables
  - not integrated with compaction groups
  - tablet allocator piggy-backs on tokens to choose replicas.
    Eventually we want to allocate based on current load, not statically

Closes #13387

* github.com:scylladb/scylladb:
  test: topology: Introduce test_tablets.py
  raft: Introduce 'raft_server_force_snapshot' error injection
  locator: network_topology_strategy: Support tablet replication
  service: Introduce tablet_allocator
  locator: Introduce tablet_aware_replication_strategy
  locator: Extract maybe_remove_node_being_replaced()
  dht: token_metadata: Introduce get_my_id()
  migration_manager: Send tablet metadata as part of schema pull
  storage_service: Load tablet metadata when reloading topology state
  storage_service: Load tablet metadata on boot and from group0 changes
  db, migration_manager: Notify about tablet metadata changes via migration_listener::on_update_tablet_metadata()
  migration_notifier: Introduce before_drop_keyspace()
  migration_manager: Make prepare_keyspace_drop_announcement() return a future<>
  test: perf: Introduce perf-tablets
  test: Introduce tablets_test
  test: lib: Do not override table id in create_table()
  utils, tablets: Introduce external_memory_usage()
  db: tablets: Add printers
  db: tablets: Add persistence layer
  dht: Use last_token_of_compaction_group() in split_token_range_msb()
  locator: Introduce tablet_metadata
  dht: Introduce first_token()
  dht: Introduce next_token()
  storage_proxy: Improve trace-level logging
  locator: token_metadata: Fix confusing comment on ring_range()
  dht, storage_proxy: Abstract token space splitting
  Revert "query_ranges_to_vnodes_generator: fix for exclusive boundaries"
  db: Exclude keyspace with per-table replication in get_non_local_strategy_keyspaces_erms()
  db: Introduce get_non_local_vnode_based_strategy_keyspaces()
  service: storage_proxy: Avoid copying keyspace name in write handler
  locator: Introduce per-table replication strategy
  treewide: Use replication_strategy_ptr as a shorter name for abstract_replication_strategy::ptr_type
  locator: Introduce effective_replication_map
  locator: Rename effective_replication_map to vnode_effective_replication_map
  locator: effective_replication_map: Abstract get_pending_endpoints()
  db: Propagate feature_service to abstract_replication_strategy::validate_options()
  db: config: Introduce experimental "TABLETS" feature
  db: Log replication strategy for debugging purposes
  db: Log full exception on error in do_parse_schema_tables()
  db: keyspace: Remove non-const replication strategy getter
  config: Reformat
2023-04-27 09:40:18 +02:00
Kefu Chai
5a11d67709 dht: token: s/tri_compare/operator<=>/
now that C++20 is able to generate the default-generated comparing
operators for us. there is no need to define them manually. and,
`std::rel_ops::*` are deprecated in C++20.

also, use `foo <=> bar` instead of `tri_compare(foo, bar)` for better
readability.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-04-26 14:09:57 +08:00
Tomasz Grabiec
d3c9ad4ed6 locator: Rename effective_replication_map to vnode_effective_replication_map
In preparation for introducing a more abstract
effective_replication_map which can describe replication maps which
are not based on vnodes.
2023-04-24 10:49:36 +02:00
Benny Halevy
3f1ac846d8 gms: get rid of unused failure_detector
The legacy failure_detector is now unused and can be removed.

TODO: integare direct_failure_detector with failure_detector api.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-04-21 09:08:27 +03:00
Benny Halevy
06a0902708 dht/range_streamer: stream_async: move ranges_to_stream to do_streaming
Currently the ranges_to_stream variable lives
on the caller state, and do_streaming() moves its
contents down to request_ranges/transfer_ranges
and then calls clear() to make it ready for reuse.

This works in principle but it makes it harder
for an occasional reader of this code to figure out
what going on.

This change transfers control of the ranges_to_stream vector
to do_streaming, by calling it with (std::exchange(do_streaming, {}))
and with that that moved vector doesn't need to be cleared by
do_streaming, and the caller is reponsible for readying
the variable for reuse in its for loop.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-02-28 17:38:34 +02:00
Benny Halevy
775c6b9697 dht/range_streamer: stream_async: do_streaming: move ranges downstream
The ranges can be moved rather than copied to both
`request_ranges` and `transfer_ranges` as they are only cleared
after this point.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-02-28 16:56:55 +02:00
Benny Halevy
3cd8838a09 dht/range_streamer: add_ranges: clear_gently ranges_for_keyspace
After calling get_range_fetch_map, ranges_for_keyspace
is not used anymore.
Synchronously destroying it may potentially stall in large clusters
so use utils::clear_gently to gently clear the map.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-02-28 16:52:30 +02:00
Benny Halevy
a80c2d16dd dht/range_streamer: get_range_fetch_map: reduce copies
Use const& to refer to the input ranges and endpoints
rather than copying them individually along the way
more than needed to.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-02-28 16:52:30 +02:00
Benny Halevy
9d6e5d50d1 dht/range_streamer: add_ranges: move ranges down-stream
Eliminate extraneous copy.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-02-28 16:52:27 +02:00
Benny Halevy
27b382dcce dht/range_streamer: stream_async: incrementally update _nr_ranges_remaining
Rather than calling nr_ranges_to_stream() inside `do_streaming`.
As nr_ranges_to_stream depends on the `_to_stream` that will be updated
only later on after the next patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-02-28 16:50:40 +02:00
Benny Halevy
c3c7efffb1 dht/range_streamer: stream_async: erase from range_vec only after do_streaming success
range_vec is used for calculating nr_ranges_to_stream.
Currently, the ranges_to_stream that were
moved out of range_vec are push back on exception,
but this isn't safe, since they may have moved already
to request_ranges or transfer_ranges.

Instead, erase the ranges we pass to do_streaming
only after it succeeds so on exception, range_vec
will not need adjusting.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-02-28 16:50:40 +02:00
Kefu Chai
0cb842797a treewide: do not define/capture unused variables
these warnings are found by Clang-17 after removing
`-Wno-unused-lambda-capture` and '-Wno-unused-variable' from
the list of disabled warnings in `configure.py`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-02-15 22:57:18 +02:00
Benny Halevy
912b56ebcf dht: range_streamer: define logger as static
dht::logger can't be global in this case,
as it's too generic, but should be static
to range_streamer.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-01-22 22:54:26 +02:00
Asias He
9ed401c4b2 streaming: Add finished percentage metrics for node ops using streaming
We have added the finished percentage for repair based node operations.

This patch adds the finished percentage for node ops using the old
streaming.

Example output:

scylla_streaming_finished_percentage{ops="bootstrap",shard="0"} 1.000000
scylla_streaming_finished_percentage{ops="decommission",shard="0"} 1.000000
scylla_streaming_finished_percentage{ops="rebuild",shard="0"} 0.561945
scylla_streaming_finished_percentage{ops="removenode",shard="0"} 1.000000
scylla_streaming_finished_percentage{ops="repair",shard="0"} 1.000000
scylla_streaming_finished_percentage{ops="replace",shard="0"} 1.000000

In addition to the metrics, log shows the percentage is added.

[shard 0] range_streamer - Finished 2698 out of 2817 ranges for rebuild, finished percentage=0.95775646

Fixes #11600

Closes #11601
2022-09-22 14:19:34 +03:00
Pavel Emelyanov
b6fdea9a79 code: Call sort_endpoints_by_proximity() via topology
The method is about to be moved from snitch to topology, this patch
prepares the rest of the code to use the latter to call it. The
topology's method just calls snitch, but it's going to change in the
next patch.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-05 15:14:01 +03:00
Pavel Emelyanov
4184091f1c snitch, code: Remove get_sorted_list_by_proximity()
There are two sorting methods in snitch -- one sorts the list of
addresses in place, the other one creates a sorted copy of the passed
const list (in fact -- the passed reference is not const, but it's not
modified by the method). However, both callers of the latter anyway
create their own temporary list of address, so they don't really benefit
from snitch generating another copy.

So this patch leaves just one sorting method -- the in-place one.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-05 15:11:37 +03:00
Pavel Emelyanov
6dedc69608 topology: Do not add bootstrapping nodes to topology
Recent change in topology (commit 4cbe6ee9 titled
"topology: Require entry in the map for update_normal_tokens()")
made token_metadata::update_normal_tokens() require the entry presense
in the embedded topology object. Respectively, the commit in question
equipped most callers of update_normal_tokens() with preceeding
topology update call to satisfy the requirement.

However, tokens are put into token_metadata not only for normal state,
but also for bootstrapping, and one place that added bootstrapping
tokens errorneously got topology update. This is wrong -- node must
not be present in the topology until switching into normal state. As
the result several tests with bootstrapping nodes started to fail.

The fix removes topology update for bootstrapping nodes, but this
change reveals few other places that piggy-backed this mistaken
update, so noy _they_ need to update topology themselves.

tests: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/2040/
       update_cluster_layout_tests.py::test_simple_add_new_node_while_schema_changes_with_repair
       update_cluster_layout_tests.py::test_simple_kill_new_node_while_bootstrapping_with_parallel_writes_in_multidc
       repair_based_node_operations_test.py::test_lcs_reshape_efficiency

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220902082753.17827-1-xemul@scylladb.com>
2022-09-04 13:53:38 +03:00
Benny Halevy
91ab8ee1c3 effective_replication_map: make get_range_addresses asynchronous
So it may yield, preenting reactor stalls as seen in #11005.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 17:31:01 +03:00
Benny Halevy
9b2af3f542 range_streamer: add_ranges and friends: get erm as param
Rather than getting it in the callee, let the caller
(e.g.  storage_service)
hold the erm and pass it down to potentially multiple
async functions.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 17:31:01 +03:00
Pavel Emelyanov
5e2fa32c8c range_streamer: Get rack/datacenter from topology
It's needed in source filter classes so range-streamer passes the
topology reference into its methods.

Nice side effect -- snitch header goes away from range-streamer one.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-06-22 11:47:26 +03:00
Asias He
1f8b529e08 range_streamer: Disable restream logic
Consider:
- n1 and n2 in the cluster
- n3 bootstraps to join
- n1 does not hear gossip update from n3 due to network issue
- n1 removes n3 from gossip and pending node list
- stream between n1 and n3 fails
- n1 and n3 network issue is fixed
- n3 retry the stream with n1
- n3 finishes the stream with n1
- n3 advertises normal to join the cluster

The problem is that n1 will not treat n3 as the pending node so writes
will not route to n3 once n1 removes n3.

Another problem is that when n1 gets normal gossip status update from
n3. The gossip listener will fail because n1 has removed n3 so n1 could
not find the host id for n3. This will cause n1 to abort.

To fix, disable the retry logic in range_streamer so that once a stream
with existing fails the bootstrap fails.

The downside is that we lose the ability to restream caused by temporary
network issue but since we have repair based node operation. We can use
it to resume the previous failed node operations.

Fixes: #9805

Closes #9806
2022-05-24 11:24:25 +03:00
Pavel Emelyanov
469ded71a9 bootstrapper: Get 'is-replacing' via argument too
This also removes the only usage of this helper outside of the storage
service. The place that needs it is the use_strict_sources_for_ranges()
checker and all the callers of it are aware of whether it's replacing
happenning or not.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-02-07 12:41:02 +03:00
Avi Kivity
fcb8d040e8 treewide: use Software Package Data Exchange (SPDX) license identifiers
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.

Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.

The changes we applied mechanically with a script, except to
licenses/README.md.

Closes #9937
2022-01-18 12:15:18 +01:00
Avi Kivity
ae3a360725 database: Move database, keyspace, table classes to replica/ directory
The database, keyspace, and table classes represent the replica-only
part of the objects after which they are named. Reading from a table
doesn't give you the full data, just the replica's view, and it is not
consistent since reconciliation is applied on the coordinator.

As a first step in acknowledging this, move the related files to
a replica/ subdirectory.
2022-01-06 17:07:30 +02:00
Pavel Emelyanov
831f18e392 dht: Pass gossiper to range_streamer::add_ranges
A continuation of the previous patch. The range_streamer needs
gossiper too, and is called from boot_strapper and storage_service.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-25 10:54:16 +03:00
Pavel Emelyanov
3087422d4d stream_plan: Keep stream_manager onboard
The plan itself doesn't need it, but it creates some lower level
objects that do. Next patches will use this reference.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-24 12:17:37 +03:00
Pavel Emelyanov
c593f8624d dht: Keep stream_manager on board
This is the preparation for the future patching. The stream_plan
creation will need the manager reference, so keep one on dht
object in advance. These are only created from the storage service
bootstrap code.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-11-24 12:17:37 +03:00
Benny Halevy
17296cba4b effective_replication_map: add get_range_addresses
Equivalent to abstract_replication_strategy get_range_addresses,
yet synchronous, as it uses the precalculated map.

Call it from storage_service::get_new_source_ranges
and range_streamer::get_all_ranges_with_sources_for.

Consequently, get_new_source_ranges and removenode_add_ranges
can become synchronous too.

Unfortunately we can't entirely get rid of
abstract_replication_strategy::get_range_addresses
as it's still needed by
range_streamer::get_all_ranges_with_strict_sources_for.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-10-13 16:10:06 +03:00
Benny Halevy
4d2561ff75 abstract_replication_strategy: precacluate get_replication_factor for effective_replication_map
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-10-13 16:10:06 +03:00
Benny Halevy
91581ba23a abstract_replication_strategy: futurize get_range_addresses
All remaining use sites are called in a seastar thread
so we drop the can_yield param and make get_range_addresses
always async.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-10-13 16:10:06 +03:00
Avi Kivity
a55b434a2b treewide: extent copyright statements to present day 2021-06-06 19:18:49 +03:00
Pavel Solodovnikov
e0749d6264 treewide: some random header cleanups
Eliminate not used includes and replace some more includes
with forward declarations where appropriate.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-06-06 19:18:49 +03:00
Nadav Har'El
fb0c4e469a Merge 'token_metadata: Fix get_all_endpoints to return nodes in the ring' from Asias He
The get_all_endpoints() should return the nodes that are part of the ring.

A node inside _endpoint_to_host_id_map does not guarantee that the node
is part of the ring.

To fix, return from _token_to_endpoint_map.

Fixes #8534

Closes #8536

* github.com:scylladb/scylla:
  token_metadata: Get rid of get_all_endpoints_count
  range_streamer: Handle everywhere_topology
  range_streamer: Adjust use_strict_sources_for_ranges
  token_metadata: Fix get_all_endpoints to return nodes in the ring
2021-05-11 18:39:10 +03:00
Asias He
4793894fac range_streamer: Handle everywhere_topology
The everywhere_topology returns the number of nodes in the cluster as
RF. This makes only streaming from the node losing the range impossible
since no node is losing the range after bootstrap.

Shortcut it not to use strict source in case the keyspace is
everywhere_topology.

Refs #8503
2021-05-06 10:02:11 +08:00
Asias He
1b7414860b range_streamer: Adjust use_strict_sources_for_ranges
Now the get_all_endpoints() returns the number of nodes in the ring. We
need to adjust the checking for using strict source or not.

Use strict when number of nodes in the ring is equal or more than RF

Refs #8534
2021-05-06 10:02:11 +08:00
Avi Kivity
cea5493cb7 storage_proxy, treewide: introduce names for vectors of inet_address
storage_proxy works with vectors of inet_addresses for replica sets
and for topology changes (pending endpoints, dead nodes). This patch
introduces new names for these (without changing the underlying
type - it's still std::vector<gms::inet_address>). This is so that
the following patch, that changes those types to utils::small_vector,
will be less noisy and highlight the real changes that take place.
2021-05-05 18:36:48 +03:00
Pavel Emelyanov
5ecbc33be5 database.*: Remove unused headers
The database.hh is the central recursive-headers knot -- it has ~50
includes. This patch leaves only 34 (it remains the champion though).
Similar thing for database.cc.
Both changes help the latter compile ~4% faster :)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210414183107.30374-1-xemul@scylladb.com>
2021-04-18 14:03:17 +03:00
Pavel Emelyanov
ffc9cc9aec range-streamer: Remove global storage service reference
The reference is used by range streamer and (!) storage
service itself to find out if the consistent_rangemovement
option is ON/OFF.

Both places already have the database with config at hands
and can be simplified.

v2: spellchecking

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210212095403.22662-1-xemul@scylladb.com>
2021-02-12 15:50:30 +01:00
Benny Halevy
322aa2f8b5 token_metadata: add clear_gently
clear_gently gently clears the token_metadata members.
It uses continuations to allow yielding if needed
to prevent reactor stalls.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-12-22 11:22:21 +02:00
Benny Halevy
e089c22ec1 token_metdata: futurize update_normal_tokens
The function complexity if O(#tokens) in the worst case
as for each endpoint token to traverses _token_to_endpoint_map
lineraly to erase the endpoint mapping if it exists.

This change renames the current implementation of
update_normal_tokens to update_normal_tokens_sync
and clones the code as a coroutine that returns a future
and may yield if needed.

Eventually we should futurize the whole token_metadata
and abstract_replication_strategy interface and get rid
of the synchronous functions.  Until then the sync
version is still required from call sites that
are neither returning a future nor run in a seastar thread.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-12-22 10:35:15 +02:00
Benny Halevy
157a964a63 locator: extract can_yield to utils/maybe_yield.hh
Move the definition of bool_class can_yield to a standalone
header file and define there a maybe_yield(can_yield) helper.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-24 12:23:56 +02:00
Benny Halevy
ba31350239 abstract_replication_strategy: add can_yield param to get_pending_ranges and friends
To prevent reactor stalls as seen in #7313.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:24 +02:00
Benny Halevy
4a622c14e1 token_metadata: futurize clone_only_token_map
Does part of clone_async() using continuations to prevent stalls.

Rename synchronous variant to clone_only_token_map_sync
that is going to be deprecated once all its users will be futurized.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:24 +02:00
Benny Halevy
63137b35ea range_streamer: convert to token_metadata_ptr
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2020-11-11 14:20:23 +02:00
Piotr Jastrzebski
c001374636 codebase wide: replace count with contains
C++20 introduced `contains` member functions for maps and sets for
checking whether an element is present in the collection. Previously
`count` function was often used in various ways.

`contains` does not only express the intend of the code better but also
does it in more unified way.

This commit replaces all the occurences of the `count` with the
`contains`.

Tests: unit(dev)

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <b4ef3b4bc24f49abe04a2aba0ddd946009c9fcb2.1597314640.git.piotr@scylladb.com>
2020-08-15 20:26:02 +03:00
Piotr Jastrzebski
52ec0c683e codebase wide: replace erase + remove_if with erase_if
C++20 introduced std::erase_if which simplifies removal of elements
from the collection. Previously the code pattern looked like:

<collection>.erase(
        std::remove_if(<collection>.begin(), <collection>.end(), <predicate>),
        <collection>.end());

In C++20 the same can be expressed with:

std::erase_if(<collection>, <predicate>);

This commit replaces all the occurences of the old pattern with the new
approach.

Tests: unit(dev)

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <6ffcace5cce79793ca6bd65c61dc86e6297233fd.1597064990.git.piotr@scylladb.com>
2020-08-10 18:17:38 +03:00
Piotr Sarna
bd2d48e99c streaming: make stream_plan::abort noexcept
Aborting a stream plan is used in deinitialization code
ran in noexcept environment, so it should be noexcept itself.
Tested on a not-merged-yet Seastar patch with hardened noexcept
checks for abort_source.

Message-Id: <6eada033bb394d725b83a7e0f92381cb792ef6a1.1596446857.git.sarna@scylladb.com>
2020-08-03 14:00:19 +03:00
Asias He
81f0260816 range_streamer: Handle table of RF 1 in get_range_fetch_map
After "Make replacing node take writes" series, with repair based node
operations disabled, we saw the replace operation fail like:

```
[shard 0] init - Startup failed: std::runtime_error (unable to find
sufficient sources for streaming range (9203926935651910749, +inf) in
keyspace system_auth)
```
The reason is the system_auth keyspace has default RF of 1. It is
impossible to find a source node to stream from for the ranges owned by
the replaced node.

In the past, the replace operation with keyspace of RF 1 passes, because
the replacing node calls token_metadata.update_normal_tokens(tokens,
ip_of_replacing_node) before streaming. We saw:

```
[shard 0] range_streamer - Bootstrap : keyspace system_auth range
(-9021954492552185543, -9016289150131785593] exists on {127.0.0.6}
```

Node 127.0.0.6 is the replacing node 127.0.0.5. The source node check in
range_streamer::get_range_fetch_map will pass if the source is the node
itself. However, it will not stream from the node itself. As a result,
the system_auth keyspace will not get any data.

After the "Make replacing node take writes" series, the replacing node
calls token_metadata.update_normal_tokens(tokens, ip_of_replacing_node)
after the streaming finishes. We saw:

```
[shard 0] range_streamer - Bootstrap : keyspace system_auth range
(-9049647518073030406, -9048297455405660225] exists on {127.0.0.5}
```

Since 127.0.0.5 was dead, the source node check failed, so the bootstrap
operation.

Ta fix, we ignore the keyspace of RF 1 when it is unable to find a source
node to stream.

Fixes #6351
2020-05-22 09:30:52 +08:00
Pavel Emelyanov
7bc34c17eb range-streamer: Tune the progress message
Now it will show the full info about range being streamed, like

range_streamer - Rebuild with 127.0.0.2 for keyspace=ks2, streaming [72, 96) out of 248 ranges

The [x, y) range is semi-open one, the full streaming progress
then can be logged like

... streaming [0, 16) out of 36 ranges   <- first send
... streaming [16, 24) out of 36 ranges
... streaming [24, 36) out of 36 ranges  <- last send

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200304101505.5506-1-xemul@scylladb.com>
2020-03-05 12:56:29 +01:00
Pavel Emelyanov
f4e789a9c2 range_streamer: Fix off-by-size in stream progress log
The nr_ranges_streamed denotes the number of ranges streamed
so far, but by the time the sending lambda is called this
counter is already incremented by the number of ranges to be
streamed in this call. And the variable is not used for
anything else but logging.

Fix this by swapping logging with incrementing.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200221101601.18779-1-xemul@scylladb.com>
2020-02-23 11:20:17 +02:00