This PR introduces an experimental feature called "tablets". Tablets are
a way to distribute data in the cluster, which is an alternative to the
current vnode-based replication. Vnode-based replication strategy tries
to evenly distribute the global token space shared by all tables among
nodes and shards. With tablets, the aim is to start from a different
side. Divide resources of replica-shard into tablets, with a goal of
having a fixed target tablet size, and then assign those tablets to
serve fragments of tables (also called tablets). This will allow us to
balance the load in a more flexible manner, by moving individual tablets
around. Also, unlike with vnode ranges, tablet replicas live on a
particular shard on a given node, which will allow us to bind raft
groups to tablets. Those goals are not yet achieved with this PR, but it
lays the ground for this.
Things achieved in this PR:
- You can start a cluster and create a keyspace whose tables will use
tablet-based replication. This is done by setting `initial_tablets`
option:
```
CREATE KEYSPACE test WITH replication = {'class': 'NetworkTopologyStrategy',
'replication_factor': 3,
'initial_tablets': 8};
```
All tables created in such a keyspace will be tablet-based.
Tablet-based replication is a trait, not a separate replication
strategy. Tablets don't change the spirit of replication strategy, it
just alters the way in which data ownership is managed. In theory, we
could use it for other strategies as well like
EverywhereReplicationStrategy. Currently, only NetworkTopologyStrategy
is augmented to support tablets.
- You can create and drop tablet-based tables (no DDL language changes)
- DML / DQL work with tablet-based tables
Replicas for tablet-based tables are chosen from tablet metadata
instead of token metadata
Things which are not yet implemented:
- handling of views, indexes, CDC created on tablet-based tables
- sharding is done using the old method, it ignores the shard allocated in tablet metadata
- node operations (topology changes, repair, rebuild) are not handling tablet-based tables
- not integrated with compaction groups
- tablet allocator piggy-backs on tokens to choose replicas.
Eventually we want to allocate based on current load, not statically
Closes#13387
* github.com:scylladb/scylladb:
test: topology: Introduce test_tablets.py
raft: Introduce 'raft_server_force_snapshot' error injection
locator: network_topology_strategy: Support tablet replication
service: Introduce tablet_allocator
locator: Introduce tablet_aware_replication_strategy
locator: Extract maybe_remove_node_being_replaced()
dht: token_metadata: Introduce get_my_id()
migration_manager: Send tablet metadata as part of schema pull
storage_service: Load tablet metadata when reloading topology state
storage_service: Load tablet metadata on boot and from group0 changes
db, migration_manager: Notify about tablet metadata changes via migration_listener::on_update_tablet_metadata()
migration_notifier: Introduce before_drop_keyspace()
migration_manager: Make prepare_keyspace_drop_announcement() return a future<>
test: perf: Introduce perf-tablets
test: Introduce tablets_test
test: lib: Do not override table id in create_table()
utils, tablets: Introduce external_memory_usage()
db: tablets: Add printers
db: tablets: Add persistence layer
dht: Use last_token_of_compaction_group() in split_token_range_msb()
locator: Introduce tablet_metadata
dht: Introduce first_token()
dht: Introduce next_token()
storage_proxy: Improve trace-level logging
locator: token_metadata: Fix confusing comment on ring_range()
dht, storage_proxy: Abstract token space splitting
Revert "query_ranges_to_vnodes_generator: fix for exclusive boundaries"
db: Exclude keyspace with per-table replication in get_non_local_strategy_keyspaces_erms()
db: Introduce get_non_local_vnode_based_strategy_keyspaces()
service: storage_proxy: Avoid copying keyspace name in write handler
locator: Introduce per-table replication strategy
treewide: Use replication_strategy_ptr as a shorter name for abstract_replication_strategy::ptr_type
locator: Introduce effective_replication_map
locator: Rename effective_replication_map to vnode_effective_replication_map
locator: effective_replication_map: Abstract get_pending_endpoints()
db: Propagate feature_service to abstract_replication_strategy::validate_options()
db: config: Introduce experimental "TABLETS" feature
db: Log replication strategy for debugging purposes
db: Log full exception on error in do_parse_schema_tables()
db: keyspace: Remove non-const replication strategy getter
config: Reformat
now that C++20 is able to generate the default-generated comparing
operators for us. there is no need to define them manually. and,
`std::rel_ops::*` are deprecated in C++20.
also, use `foo <=> bar` instead of `tri_compare(foo, bar)` for better
readability.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
The legacy failure_detector is now unused and can be removed.
TODO: integare direct_failure_detector with failure_detector api.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently the ranges_to_stream variable lives
on the caller state, and do_streaming() moves its
contents down to request_ranges/transfer_ranges
and then calls clear() to make it ready for reuse.
This works in principle but it makes it harder
for an occasional reader of this code to figure out
what going on.
This change transfers control of the ranges_to_stream vector
to do_streaming, by calling it with (std::exchange(do_streaming, {}))
and with that that moved vector doesn't need to be cleared by
do_streaming, and the caller is reponsible for readying
the variable for reuse in its for loop.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The ranges can be moved rather than copied to both
`request_ranges` and `transfer_ranges` as they are only cleared
after this point.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
After calling get_range_fetch_map, ranges_for_keyspace
is not used anymore.
Synchronously destroying it may potentially stall in large clusters
so use utils::clear_gently to gently clear the map.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Use const& to refer to the input ranges and endpoints
rather than copying them individually along the way
more than needed to.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Rather than calling nr_ranges_to_stream() inside `do_streaming`.
As nr_ranges_to_stream depends on the `_to_stream` that will be updated
only later on after the next patch.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
range_vec is used for calculating nr_ranges_to_stream.
Currently, the ranges_to_stream that were
moved out of range_vec are push back on exception,
but this isn't safe, since they may have moved already
to request_ranges or transfer_ranges.
Instead, erase the ranges we pass to do_streaming
only after it succeeds so on exception, range_vec
will not need adjusting.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
these warnings are found by Clang-17 after removing
`-Wno-unused-lambda-capture` and '-Wno-unused-variable' from
the list of disabled warnings in `configure.py`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
dht::logger can't be global in this case,
as it's too generic, but should be static
to range_streamer.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We have added the finished percentage for repair based node operations.
This patch adds the finished percentage for node ops using the old
streaming.
Example output:
scylla_streaming_finished_percentage{ops="bootstrap",shard="0"} 1.000000
scylla_streaming_finished_percentage{ops="decommission",shard="0"} 1.000000
scylla_streaming_finished_percentage{ops="rebuild",shard="0"} 0.561945
scylla_streaming_finished_percentage{ops="removenode",shard="0"} 1.000000
scylla_streaming_finished_percentage{ops="repair",shard="0"} 1.000000
scylla_streaming_finished_percentage{ops="replace",shard="0"} 1.000000
In addition to the metrics, log shows the percentage is added.
[shard 0] range_streamer - Finished 2698 out of 2817 ranges for rebuild, finished percentage=0.95775646
Fixes#11600Closes#11601
The method is about to be moved from snitch to topology, this patch
prepares the rest of the code to use the latter to call it. The
topology's method just calls snitch, but it's going to change in the
next patch.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are two sorting methods in snitch -- one sorts the list of
addresses in place, the other one creates a sorted copy of the passed
const list (in fact -- the passed reference is not const, but it's not
modified by the method). However, both callers of the latter anyway
create their own temporary list of address, so they don't really benefit
from snitch generating another copy.
So this patch leaves just one sorting method -- the in-place one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Recent change in topology (commit 4cbe6ee9 titled
"topology: Require entry in the map for update_normal_tokens()")
made token_metadata::update_normal_tokens() require the entry presense
in the embedded topology object. Respectively, the commit in question
equipped most callers of update_normal_tokens() with preceeding
topology update call to satisfy the requirement.
However, tokens are put into token_metadata not only for normal state,
but also for bootstrapping, and one place that added bootstrapping
tokens errorneously got topology update. This is wrong -- node must
not be present in the topology until switching into normal state. As
the result several tests with bootstrapping nodes started to fail.
The fix removes topology update for bootstrapping nodes, but this
change reveals few other places that piggy-backed this mistaken
update, so noy _they_ need to update topology themselves.
tests: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/2040/
update_cluster_layout_tests.py::test_simple_add_new_node_while_schema_changes_with_repair
update_cluster_layout_tests.py::test_simple_kill_new_node_while_bootstrapping_with_parallel_writes_in_multidc
repair_based_node_operations_test.py::test_lcs_reshape_efficiency
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220902082753.17827-1-xemul@scylladb.com>
Rather than getting it in the callee, let the caller
(e.g. storage_service)
hold the erm and pass it down to potentially multiple
async functions.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
It's needed in source filter classes so range-streamer passes the
topology reference into its methods.
Nice side effect -- snitch header goes away from range-streamer one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Consider:
- n1 and n2 in the cluster
- n3 bootstraps to join
- n1 does not hear gossip update from n3 due to network issue
- n1 removes n3 from gossip and pending node list
- stream between n1 and n3 fails
- n1 and n3 network issue is fixed
- n3 retry the stream with n1
- n3 finishes the stream with n1
- n3 advertises normal to join the cluster
The problem is that n1 will not treat n3 as the pending node so writes
will not route to n3 once n1 removes n3.
Another problem is that when n1 gets normal gossip status update from
n3. The gossip listener will fail because n1 has removed n3 so n1 could
not find the host id for n3. This will cause n1 to abort.
To fix, disable the retry logic in range_streamer so that once a stream
with existing fails the bootstrap fails.
The downside is that we lose the ability to restream caused by temporary
network issue but since we have repair based node operation. We can use
it to resume the previous failed node operations.
Fixes: #9805Closes#9806
This also removes the only usage of this helper outside of the storage
service. The place that needs it is the use_strict_sources_for_ranges()
checker and all the callers of it are aware of whether it's replacing
happenning or not.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.
Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.
The changes we applied mechanically with a script, except to
licenses/README.md.
Closes#9937
The database, keyspace, and table classes represent the replica-only
part of the objects after which they are named. Reading from a table
doesn't give you the full data, just the replica's view, and it is not
consistent since reconciliation is applied on the coordinator.
As a first step in acknowledging this, move the related files to
a replica/ subdirectory.
A continuation of the previous patch. The range_streamer needs
gossiper too, and is called from boot_strapper and storage_service.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The plan itself doesn't need it, but it creates some lower level
objects that do. Next patches will use this reference.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is the preparation for the future patching. The stream_plan
creation will need the manager reference, so keep one on dht
object in advance. These are only created from the storage service
bootstrap code.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Equivalent to abstract_replication_strategy get_range_addresses,
yet synchronous, as it uses the precalculated map.
Call it from storage_service::get_new_source_ranges
and range_streamer::get_all_ranges_with_sources_for.
Consequently, get_new_source_ranges and removenode_add_ranges
can become synchronous too.
Unfortunately we can't entirely get rid of
abstract_replication_strategy::get_range_addresses
as it's still needed by
range_streamer::get_all_ranges_with_strict_sources_for.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
All remaining use sites are called in a seastar thread
so we drop the can_yield param and make get_range_addresses
always async.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Eliminate not used includes and replace some more includes
with forward declarations where appropriate.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
The get_all_endpoints() should return the nodes that are part of the ring.
A node inside _endpoint_to_host_id_map does not guarantee that the node
is part of the ring.
To fix, return from _token_to_endpoint_map.
Fixes#8534Closes#8536
* github.com:scylladb/scylla:
token_metadata: Get rid of get_all_endpoints_count
range_streamer: Handle everywhere_topology
range_streamer: Adjust use_strict_sources_for_ranges
token_metadata: Fix get_all_endpoints to return nodes in the ring
The everywhere_topology returns the number of nodes in the cluster as
RF. This makes only streaming from the node losing the range impossible
since no node is losing the range after bootstrap.
Shortcut it not to use strict source in case the keyspace is
everywhere_topology.
Refs #8503
Now the get_all_endpoints() returns the number of nodes in the ring. We
need to adjust the checking for using strict source or not.
Use strict when number of nodes in the ring is equal or more than RF
Refs #8534
storage_proxy works with vectors of inet_addresses for replica sets
and for topology changes (pending endpoints, dead nodes). This patch
introduces new names for these (without changing the underlying
type - it's still std::vector<gms::inet_address>). This is so that
the following patch, that changes those types to utils::small_vector,
will be less noisy and highlight the real changes that take place.
The database.hh is the central recursive-headers knot -- it has ~50
includes. This patch leaves only 34 (it remains the champion though).
Similar thing for database.cc.
Both changes help the latter compile ~4% faster :)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210414183107.30374-1-xemul@scylladb.com>
The reference is used by range streamer and (!) storage
service itself to find out if the consistent_rangemovement
option is ON/OFF.
Both places already have the database with config at hands
and can be simplified.
v2: spellchecking
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210212095403.22662-1-xemul@scylladb.com>
clear_gently gently clears the token_metadata members.
It uses continuations to allow yielding if needed
to prevent reactor stalls.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The function complexity if O(#tokens) in the worst case
as for each endpoint token to traverses _token_to_endpoint_map
lineraly to erase the endpoint mapping if it exists.
This change renames the current implementation of
update_normal_tokens to update_normal_tokens_sync
and clones the code as a coroutine that returns a future
and may yield if needed.
Eventually we should futurize the whole token_metadata
and abstract_replication_strategy interface and get rid
of the synchronous functions. Until then the sync
version is still required from call sites that
are neither returning a future nor run in a seastar thread.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Move the definition of bool_class can_yield to a standalone
header file and define there a maybe_yield(can_yield) helper.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Does part of clone_async() using continuations to prevent stalls.
Rename synchronous variant to clone_only_token_map_sync
that is going to be deprecated once all its users will be futurized.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
C++20 introduced `contains` member functions for maps and sets for
checking whether an element is present in the collection. Previously
`count` function was often used in various ways.
`contains` does not only express the intend of the code better but also
does it in more unified way.
This commit replaces all the occurences of the `count` with the
`contains`.
Tests: unit(dev)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <b4ef3b4bc24f49abe04a2aba0ddd946009c9fcb2.1597314640.git.piotr@scylladb.com>
C++20 introduced std::erase_if which simplifies removal of elements
from the collection. Previously the code pattern looked like:
<collection>.erase(
std::remove_if(<collection>.begin(), <collection>.end(), <predicate>),
<collection>.end());
In C++20 the same can be expressed with:
std::erase_if(<collection>, <predicate>);
This commit replaces all the occurences of the old pattern with the new
approach.
Tests: unit(dev)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <6ffcace5cce79793ca6bd65c61dc86e6297233fd.1597064990.git.piotr@scylladb.com>
After "Make replacing node take writes" series, with repair based node
operations disabled, we saw the replace operation fail like:
```
[shard 0] init - Startup failed: std::runtime_error (unable to find
sufficient sources for streaming range (9203926935651910749, +inf) in
keyspace system_auth)
```
The reason is the system_auth keyspace has default RF of 1. It is
impossible to find a source node to stream from for the ranges owned by
the replaced node.
In the past, the replace operation with keyspace of RF 1 passes, because
the replacing node calls token_metadata.update_normal_tokens(tokens,
ip_of_replacing_node) before streaming. We saw:
```
[shard 0] range_streamer - Bootstrap : keyspace system_auth range
(-9021954492552185543, -9016289150131785593] exists on {127.0.0.6}
```
Node 127.0.0.6 is the replacing node 127.0.0.5. The source node check in
range_streamer::get_range_fetch_map will pass if the source is the node
itself. However, it will not stream from the node itself. As a result,
the system_auth keyspace will not get any data.
After the "Make replacing node take writes" series, the replacing node
calls token_metadata.update_normal_tokens(tokens, ip_of_replacing_node)
after the streaming finishes. We saw:
```
[shard 0] range_streamer - Bootstrap : keyspace system_auth range
(-9049647518073030406, -9048297455405660225] exists on {127.0.0.5}
```
Since 127.0.0.5 was dead, the source node check failed, so the bootstrap
operation.
Ta fix, we ignore the keyspace of RF 1 when it is unable to find a source
node to stream.
Fixes#6351
Now it will show the full info about range being streamed, like
range_streamer - Rebuild with 127.0.0.2 for keyspace=ks2, streaming [72, 96) out of 248 ranges
The [x, y) range is semi-open one, the full streaming progress
then can be logged like
... streaming [0, 16) out of 36 ranges <- first send
... streaming [16, 24) out of 36 ranges
... streaming [24, 36) out of 36 ranges <- last send
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200304101505.5506-1-xemul@scylladb.com>
The nr_ranges_streamed denotes the number of ranges streamed
so far, but by the time the sending lambda is called this
counter is already incremented by the number of ranges to be
streamed in this call. And the variable is not used for
anything else but logging.
Fix this by swapping logging with incrementing.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200221101601.18779-1-xemul@scylladb.com>