Our interval template started life as `range`, and was supported
wrapping to follow Cassandra's convention of wrapping around the
maximum token.
We later recognized that an interval type should usually be non-wrapping
and split it into wrapping_range and nonwrapping_range, with `range`
aliasing wrapping_range to preserve compatibility.
Even later, we realized the name was already taken by C++ ranges and
so renamed it to `interval`. Given that intervals are usually non-wrapping,
the default `interval` type is non-wrapping.
We can now simplify it further, recognizing that everyone assumes
that an interval is non-wrapping and so doesn't need the
nonwrapping_interval_designation. We just rename nonwrapping_interval
to `interval` and remove the type alias.
Consider the inclusiveness of the token-range's start and end bounds and
copy the flag to the output bounds, instead of assuming they are always
inclusive.
range.hh was deprecated in bd794629f9 (2020) since its names
conflict with the C++ library concept of an iterator range. The name
::range also mapped to the dangerous wrapping_interval rather than
nonwrapping_interval.
Complete the deprecation by removing range.hh and replacing all the
aliases by the names they point to from the interval library. Note
this now exposes uses of wrapping intervals as they are now explicit.
The unit tests are renamed and range.hh is deleted.
Closesscylladb/scylladb#17428
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, we define formatters for
`partition_range_view` and `i_partition`, and drop their operator<<:s.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#17331
This commit renames keyspace::get_effective_replication_map()
to keyspace::get_vnode_effective_replication_map(). This change
is required to ease the analysis of the usage of this function.
When tablets are enabled, then this function shall not be used.
Instead of per-keyspace, per-table replication map should be used.
The rename was performed to distinguish between those two calls.
The next step will be an audit of usages of
keyspace::get_vnode_effective_replication_map().
Refs: scylladb#16626
Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>
Closesscylladb/scylladb#17314
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, we define formatters for `dht::ring_posittion`,
and drop its operator<<.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#17194
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, we define formatters for `dht::sharder`, and drop
its operator<<.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#17178
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, we define formatters for `dht::ring_position_ext` and
`dht::ring_position_view`, and drop their operator<<.
Refs #13245Closesscylladb/scylladb#17128
* github.com:scylladb/scylladb:
db: add formatter for dht::ring_position_ext
db: add formatter for dht::ring_position_view
get0() dates back from the days where Seastar futures carried tuples, and
get0() was a way to get the first (and usually only) element. Now
it's a distraction, and Seastar is likely to deprecate and remove it.
Replace with seastar::future::get(), which does the same thing.
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, we define formatters for `dht::ring_position_ext`,
and drop its operator<<.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, we define formatters for `dht::ring_position_view`,
and drop its operator<<.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, we define formatters for dht::decorated_key and
repair_sync_boundary.
please note, before this change, repair_sync_boundary was using
the operator<< based formatter of `dht::decorated_key`, so we are
updating both of them in a single commit.
because we still use the homebrew generic formatter of vector<>
in to format vector<repair_sync_boundary> and vector<dht::decorated_key>,
so their operator<< are preserved.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#16994
This patch reverts commit 10f8f13b90 from
November 2022. That commit added to the "view update generator", the code
which builds view updates for staging sstables, a filter that ignores
ranges that do not belong to this node. However,
1. I believe this filter was never necessary, because the view update
code already silently ignores base updates which do not belong to
this replica (see get_view_natural_endpoint()). After all, the view
update needs to know that this replica is the Nth owner of the base
update to send its update to the Nth view replica, but if no such
N exists, no view update is sent.
2. The code introduced for that filter used a per-keyspace replication
map, which was ok for vnodes but no longer works for tablets, and
causes the operation using it to fail.
3. The filter was used every time the "view update generator" was used,
regardless of whether any cleanup is necessary or not, so every
such operation would fail with tablets. So for example the dtest
test_mvs_populating_from_existing_data fails with tablets:
* This test has view building in parallel with automatic tablet
movement.
* Tablet movement is streaming.
* When streaming happens before view building has finished, the
streamed sstables get "view update generator" run on them.
This causes the problematic code to be called.
Before this patch, the dtest test_mvs_populating_from_existing_data
fails when tablets are enabled. After this patch, it passes.
Fixes#16598
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Change the token_metadata type to token_metadata2 in
the signatures of CDC-related methods in
storage_service and cdc/generation. Use
get_new_strong to get a pointer to the new host_id-based
token_metadata from the inet_address-based one,
living in the shared_token_metadata.
The starting point of the patch is in
storage_service::handle_global_request. We change the
tmptr type to token_metadata2 and propagate the change
down the call chains. This includes token-related methods
of the boot_strapper class.
Tablet streaming involves asynchronous RPCs to other replicas which transfer writes. We want side-effects from streaming only within the migration stage in which the streaming was started. This is currently not guaranteed on failure. When streaming master fails (e.g. due to RPC failing), it can be that some streaming work is still alive somewhere (e.g. RPC on wire) and will have side-effects at some point later.
This PR implements tracking of all operations involved in streaming which may have side-effects, which allows the topology change coordinator to fence them and wait for them to complete if they were already admitted.
The tracking and fencing is implemented by using global "sessions", created for streaming of a single tablet. Session is globally identified by UUID. The identifier is assigned by the topology change coordinator, and stored in system.tablets. Sessions are created and closed based on group0 state (tablet metadata) by the barrier command sent to each replica, which we already do on transitions between stages. Also, each barrier waits for sessions which have been closed to be drained.
The barrier is blocked only if there is some session with work which was left behind by unsuccessful streaming. In which case it should not be blocked for long, because streaming process checks often if the guard was left behind and stops if it was.
This mechanism of tracking is fault-tolerant: session id is stored in group0, so coordinator can make progress on failover. The barriers guarantee that session exists on all replicas, and that it will be closed on all replicas.
Closesscylladb/scylladb#15847
* github.com:scylladb/scylladb:
test: tablets: Add test for failed streaming being fenced away
error_injection: Introduce poll_for_message()
error_injection: Make is_enabled() public
api: Add API to kill connection to a particular host
range_streamer: Do not block topology change barriers around streaming
range_streamer, tablets: Do not keep token metadata around streaming
tablets: Fail gracefully when migrating tablet has no pending replica
storage_service, api: Add API to disable tablet balancing
storage_service, api: Add API to migrate a tablet
storage_service, raft topology: Run streaming under session topology guard
storage_service, tablets: Use session to guard tablet streaming
tablets: Add per-tablet session id field to tablet metadata
service: range_streamer: Propagate topology_guard to receivers
streaming: Always close the rpc::sink
storage_service: Introduce concept of a topology_guard
storage_service: Introduce session concept
tablets: Fix topology_metadata_guard holding on to the old erm
docs: Document the topology_guard mechanism
Streaming was keeping effective_replication_map_ptr around the whole
process, which blocks topology change barriers.
This will inhibit progress of tablet load balancer or concurrent
migrations, resulting in worse performance.
Fix by switching to the most recent erm on sharder
calls. multishard_writer calls shard_of() for each new partition.
A better way would be to switch immediately when topology version
changes, but this is left for later.
Fixes some typos as found by codespell run on the code.
In this commit, I was hoping to fix only comments, not user-visible alerts, output, etc.
Follow-up commits will take care of them.
Refs: https://github.com/scylladb/scylladb/issues/16255
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
to have feature parity with `configure.py`. we won't need this
once we migrate to C++20 modules. but before that day comes, we
need to stick with C++ headers.
we generate a rule for each .hh files to create a corresponding
.cc and then compile it, in order to verify the self-containness of
that header. so the number of rule is quite large, to avoid the
unnecessary overhead. the check-header target is enabled only if
`Scylla_CHECK_HEADERS` option is enabled.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#15913
Extract decorated_key.hh and ring_position.hh
out of i_partitioner.hh so they can be included
selectively, since i_partitioner.hh contains too much
bagage that is not always needed in full.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Move the `token_comparator` definition and
implementation to token.{hh,cc}, respectively
since they are independent of i_partitioner.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This option allows user to change the number of ranges to stream in
batch per stream plan.
Currently, each stream plan streams 10% of the total ranges.
With more ranges per stream plan, it reduces the waiting time between
two stream plans. For example,
stream_plan1: shard0 (t0), shard1 (t1)
stream_plan2: shard0 (t2), shard1 (t3)
We start stream_plan2 after all shards finish streaming in stream_plan1.
If shard0 and shard1 in stream_plan1 finishes at different time. One of
the shards will be idle.
If we stream more ranges in a single stream plan, the waiting time will
be reduced.
Previously, we retry the stream plan if one of the stream plans is
failed. That's one of the reasons we want more stream plans. With RBNO
and 1f8b529e08 (range_streamer: Disable restream logic), the
restream factor is not important anymore.
Also, more ranges in a single stream plan will create bigger but fewer
sstables on the receiver side.
The default value is the same as before: 10% percentage of total ranges.
Fixes#14191Closes#14402
Today, SSTable cleanup skips to the next partition, one at a time, when it finds
that the current partition is no longer owned by this node.
That's very inefficient because when a cluster is growing in size, existing
nodes lose multiple sequential tokens in its owned ranges. Another inefficiency
comes from fetching index pages spanning all unowned tokens, which was described
in #14317.
To solve both problems, cleanup will now use multi range reader, to guarantee
that it will only process the owned data and as a result skip unowned data.
This results in cleanup scanning an owned range and then fast forwarding to the
next one, until it's done with them all. This reduces significantly the amount
of data in the index caching, as index will only be invoked at each range
boundary instead.
Without further ado,
before:
... 2GB to 1GB (~50% of original) in 26248ms = 81MB/s. ~9443072 total partitions merged to 4750028.
after:
... 2GB to 1GB (~50% of original) in 17424ms = 123MB/s. ~9443072 total partitions merged to 4750028.
Fixes#12998.
Fixes#14317.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This is in order to prevent new incorrect uses of dht::shard_of() to
be accidentally added. Also, makes sure that all current uses are
caught by the compiler and require an explicit rename.
Currently, the coordinator splits the partition range at vnode (or
tablet) boundaries and then tries to merge adjacent ranges which
target the same replica. This is an optimization which makes less
sense with tablets, which are supposed to be of substantial size. If
we don't merge the ranges, then with tablets we can avoid using the
multishard reader on the replica side, since each tablet lives on a
single shard.
The main reason to avoid a multishard reader is avoiding its
complexity, and avoiding adapting it to work with tablet
sharding. Currently, the multishard reader implementation makes
several assumptions about shard assignment which do not hold with
tablets. It assumes that shards are assigned in a round-robin fashion.
The function currently assumes that shard assignment for subsequent
tokens is round robin, which will not be the case for tablets. This
can lead to incorrect split calculation or infinite loop.
Another assumption was that subsequent splits returned by the sharder
have distinct shards. This also doesn't hold for tablets, which may
return the same shard for subsequent tokens. This assumption was
embedded in the following line:
start_token = sharder.token_for_next_shard(end_token, shard);
If the range which starts with end_token is also owned by "shard",
token_for_next_shard() would skip over it.
We need those functions to work with tablet sharder, which is not
accessible through schema::get_sharder(). In order to propagate the
right sharder, those functions need to take it externally rather from
the schema object. The sharder will come from the
effective_replication_map attached to the table object.
Those splitting functions are used when generating sharding metadata
of an sstable. We need to keep this sharding metadata consistent with
tablet mapping to shards in order for node restart to detect that
those sstables belong to a single shard and that resharding is not
necessary. Resharding of sstables based on tablet metadata is not
implemented yet and will abort after this series.
Keeping sharding metadata accurate for tablets is only necessary until
compaction group integration is finished. After that, we can use the
sstable token range to determine the owning tablet and thus the owning
shard. Before that, we can't, because a single sstable may contain
keys from different tablets, and the whole key range may overlap with
keys which belong to other shards.
The logic was extracted from ring_position_range_sharder::next(), and
the latter was changed to rely on sharder::next_shard().
The tablet sharder will have a different implementation for
next_shard(). This way, ring_position_range_sharder can work with both
current sharder and the tablet sharder.
This reverts commit d85af3dca4. It
restores the linear search algorithm, as we expect the search to
terminate near the origin. In this case linear search is O(1)
while binary search is O(log n).
A comment is added so we don't repeat the mistake.
Closes#13704
This PR introduces an experimental feature called "tablets". Tablets are
a way to distribute data in the cluster, which is an alternative to the
current vnode-based replication. Vnode-based replication strategy tries
to evenly distribute the global token space shared by all tables among
nodes and shards. With tablets, the aim is to start from a different
side. Divide resources of replica-shard into tablets, with a goal of
having a fixed target tablet size, and then assign those tablets to
serve fragments of tables (also called tablets). This will allow us to
balance the load in a more flexible manner, by moving individual tablets
around. Also, unlike with vnode ranges, tablet replicas live on a
particular shard on a given node, which will allow us to bind raft
groups to tablets. Those goals are not yet achieved with this PR, but it
lays the ground for this.
Things achieved in this PR:
- You can start a cluster and create a keyspace whose tables will use
tablet-based replication. This is done by setting `initial_tablets`
option:
```
CREATE KEYSPACE test WITH replication = {'class': 'NetworkTopologyStrategy',
'replication_factor': 3,
'initial_tablets': 8};
```
All tables created in such a keyspace will be tablet-based.
Tablet-based replication is a trait, not a separate replication
strategy. Tablets don't change the spirit of replication strategy, it
just alters the way in which data ownership is managed. In theory, we
could use it for other strategies as well like
EverywhereReplicationStrategy. Currently, only NetworkTopologyStrategy
is augmented to support tablets.
- You can create and drop tablet-based tables (no DDL language changes)
- DML / DQL work with tablet-based tables
Replicas for tablet-based tables are chosen from tablet metadata
instead of token metadata
Things which are not yet implemented:
- handling of views, indexes, CDC created on tablet-based tables
- sharding is done using the old method, it ignores the shard allocated in tablet metadata
- node operations (topology changes, repair, rebuild) are not handling tablet-based tables
- not integrated with compaction groups
- tablet allocator piggy-backs on tokens to choose replicas.
Eventually we want to allocate based on current load, not statically
Closes#13387
* github.com:scylladb/scylladb:
test: topology: Introduce test_tablets.py
raft: Introduce 'raft_server_force_snapshot' error injection
locator: network_topology_strategy: Support tablet replication
service: Introduce tablet_allocator
locator: Introduce tablet_aware_replication_strategy
locator: Extract maybe_remove_node_being_replaced()
dht: token_metadata: Introduce get_my_id()
migration_manager: Send tablet metadata as part of schema pull
storage_service: Load tablet metadata when reloading topology state
storage_service: Load tablet metadata on boot and from group0 changes
db, migration_manager: Notify about tablet metadata changes via migration_listener::on_update_tablet_metadata()
migration_notifier: Introduce before_drop_keyspace()
migration_manager: Make prepare_keyspace_drop_announcement() return a future<>
test: perf: Introduce perf-tablets
test: Introduce tablets_test
test: lib: Do not override table id in create_table()
utils, tablets: Introduce external_memory_usage()
db: tablets: Add printers
db: tablets: Add persistence layer
dht: Use last_token_of_compaction_group() in split_token_range_msb()
locator: Introduce tablet_metadata
dht: Introduce first_token()
dht: Introduce next_token()
storage_proxy: Improve trace-level logging
locator: token_metadata: Fix confusing comment on ring_range()
dht, storage_proxy: Abstract token space splitting
Revert "query_ranges_to_vnodes_generator: fix for exclusive boundaries"
db: Exclude keyspace with per-table replication in get_non_local_strategy_keyspaces_erms()
db: Introduce get_non_local_vnode_based_strategy_keyspaces()
service: storage_proxy: Avoid copying keyspace name in write handler
locator: Introduce per-table replication strategy
treewide: Use replication_strategy_ptr as a shorter name for abstract_replication_strategy::ptr_type
locator: Introduce effective_replication_map
locator: Rename effective_replication_map to vnode_effective_replication_map
locator: effective_replication_map: Abstract get_pending_endpoints()
db: Propagate feature_service to abstract_replication_strategy::validate_options()
db: config: Introduce experimental "TABLETS" feature
db: Log replication strategy for debugging purposes
db: Log full exception on error in do_parse_schema_tables()
db: keyspace: Remove non-const replication strategy getter
config: Reformat
in C++20, compiler generate operator!=() if the corresponding
operator==() is already defined, the language now understands
that the comparison is symmetric in the new standard.
fortunately, our operator!=() is always equivalent to
`! operator==()`, this matches the behavior of the default
generated operator!=(). so, in this change, all `operator!=`
are removed.
in addition to the defaulted operator!=, C++20 also brings to us
the defaulted operator==() -- it is able to generated the
operator==() if the member-wise lexicographical comparison.
under some circumstances, this is exactly what we need. so,
in this change, if the operator==() is also implemented as
a lexicographical comparison of all memeber variables of the
class/struct in question, it is implemented using the default
generated one by removing its body and mark the function as
`default`. moreover, if the class happen to have other comparison
operators which are implemented using lexicographical comparison,
the default generated `operator<=>` is used in place of
the defaulted `operator==`.
sometimes, we fail to mark the operator== with the `const`
specifier, in this change, to fulfil the need of C++ standard,
and to be more correct, the `const` specifier is added.
also, to generate the defaulted operator==, the operand should
be `const class_name&`, but it is not always the case, in the
class of `version`, we use `version` as the parameter type, to
fulfill the need of the C++ standard, the parameter type is
changed to `const version&` instead. this does not change
the semantic of the comparison operator. and is a more idiomatic
way to pass non-trivial struct as function parameters.
please note, because in C++20, both operator= and operator<=> are
symmetric, some of the operators in `multiprecision` are removed.
they are the symmetric form of the another variant. if they were
not removed, compiler would, for instance, find ambiguous
overloaded operator '=='.
this change is a cleanup to modernize the code base with C++20
features.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13687
now that C++20 is able to generate the default-generated comparing
operators for us. there is no need to define them manually. and,
`std::rel_ops::*` are deprecated in C++20.
also, use `foo <=> bar` instead of `tri_compare(foo, bar)` for better
readability.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
this change ensures that `dk._key` is formatted with the "pk" prefix.
as in 3738fcb, the `operator<<` for partition_key was removed. so the
compiler has to find an alternative when trying to fulfill the needs
when this operator<< is called. fortunately, from the compiler's
perspective, `partition_key` has an `operator managed_bytes_view`, and
this operator does not have the explicit specifier, and,
`managed_bytes_view` does support `operator<<`. so this ends up with a
change in the format of `decorated_key` when it is printed using
`operator<<`. the code compiles. but unfortunately, the behavior is
changed, and it breaks scylla-dtest/cdc_tracing_info_test.py where the
partition_key is supposed to be printed like "pk{010203}" instead of
"010203". the latter is how `managed_bytes_view` is formatted.
a test is added accordingly to avoid future changes which break the
dtest.
Fixes scylladb#13628
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closes#13653