Take advantage of the fact that both ranges and
ranges_to_subtract are deoverlapped and sorted by
to reduce the calculation complexity from
quadratic to linear.
Fixes#11922
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The algorithm is generic and can be used elsewhere.
Add a unit test for the function before it gets
optimized in the following patch.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Since they are currently not cleaned up by cleanup compaction
filter their tokens, processing only tokens owned by the
current node (based on the keyspace replication strategy).
Refs #9559
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
It is currently used by cleanup_compaction partition filter.
Factor it out so it can be used to filter staging sstables in
the next patch.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This class exists for one purpose only: to serve as glue code between
dht::ring_position and boost::icl::interval_map. The latter requires
that keys in its intervals are:
* default constructible
* copyable
* have standalone compare operations
For this reason we have to wrap `dht::ring_position` in a class,
together with a schema to provide all this. This is
`compatible_ring_position`. There is one further requirement by code
using the interval map: it wants to do lookups without copying the
lookup key(s). To solve this, we came up with
`compatible_ring_position_or_view` which is a union of a key or a key
view + schema. As we recently found out, boost::icl copies its keys **a
lot**. It seems to assume these keys are cheap to copy and carelessly
copies them around even when iterating over the map. But
`compatible_ring_position_or_view` is not cheap to copy as it copies a
`dht::ring_position` which allocates, and it does that via an
`std::optional` and `std::variant` to add insult to injury.
This patch make said class cheap to copy, by getting rid of the variant
and storing the `dht::ring_position` via a shared pointer. The view is
stored separately and either points to the ring position stored in the
shared pointer or to an outside ring position (for lookups).
Fixes: #11669Closes#11670
We have added the finished percentage for repair based node operations.
This patch adds the finished percentage for node ops using the old
streaming.
Example output:
scylla_streaming_finished_percentage{ops="bootstrap",shard="0"} 1.000000
scylla_streaming_finished_percentage{ops="decommission",shard="0"} 1.000000
scylla_streaming_finished_percentage{ops="rebuild",shard="0"} 0.561945
scylla_streaming_finished_percentage{ops="removenode",shard="0"} 1.000000
scylla_streaming_finished_percentage{ops="repair",shard="0"} 1.000000
scylla_streaming_finished_percentage{ops="replace",shard="0"} 1.000000
In addition to the metrics, log shows the percentage is added.
[shard 0] range_streamer - Finished 2698 out of 2817 ranges for rebuild, finished percentage=0.95775646
Fixes#11600Closes#11601
The method is about to be moved from snitch to topology, this patch
prepares the rest of the code to use the latter to call it. The
topology's method just calls snitch, but it's going to change in the
next patch.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are two sorting methods in snitch -- one sorts the list of
addresses in place, the other one creates a sorted copy of the passed
const list (in fact -- the passed reference is not const, but it's not
modified by the method). However, both callers of the latter anyway
create their own temporary list of address, so they don't really benefit
from snitch generating another copy.
So this patch leaves just one sorting method -- the in-place one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Recent change in topology (commit 4cbe6ee9 titled
"topology: Require entry in the map for update_normal_tokens()")
made token_metadata::update_normal_tokens() require the entry presense
in the embedded topology object. Respectively, the commit in question
equipped most callers of update_normal_tokens() with preceeding
topology update call to satisfy the requirement.
However, tokens are put into token_metadata not only for normal state,
but also for bootstrapping, and one place that added bootstrapping
tokens errorneously got topology update. This is wrong -- node must
not be present in the topology until switching into normal state. As
the result several tests with bootstrapping nodes started to fail.
The fix removes topology update for bootstrapping nodes, but this
change reveals few other places that piggy-backed this mistaken
update, so noy _they_ need to update topology themselves.
tests: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/2040/
update_cluster_layout_tests.py::test_simple_add_new_node_while_schema_changes_with_repair
update_cluster_layout_tests.py::test_simple_kill_new_node_while_bootstrapping_with_parallel_writes_in_multidc
repair_based_node_operations_test.py::test_lcs_reshape_efficiency
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220902082753.17827-1-xemul@scylladb.com>
The method creates a copy of token metadata and pushes an endpoint (with
some tokens) into it. Next patches will require providing dc/rack info
together with the endpoint, this patch prepares for that.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Both classes may populate (temporarly clones of) token metadata object
with endpoint:tokens pairs for the endpoint they work with. Next patches
will require that endpoint comes with the dc/rack info. This patch makes
sure dht classes have the necessary information at hand (for now it's
just empty pair of strings).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Rather than getting it in the callee, let the caller
(e.g. storage_service)
hold the erm and pass it down to potentially multiple
async functions.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
For node operations, we currently call get_non_system_keyspaces
but really want to work on all keyspace that have non-local
replication strategy as they are replicated on other nodes.
Reflect that in the replica::database function name.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
While we're iterating over the fetched keyspace names, some of these
keyspaces may get dropped. Handle that by checking if the keyspace still
exists.
Also, when retrieving the replication strategy from the keyspace, store
the pointer (which is an `lw_shared_ptr`) to the strategy to keep it
alive, in case the keyspace that was holding it gets dropped.
Closes#10861
It's needed in source filter classes so range-streamer passes the
topology reference into its methods.
Nice side effect -- snitch header goes away from range-streamer one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Consider:
- n1 and n2 in the cluster
- n3 bootstraps to join
- n1 does not hear gossip update from n3 due to network issue
- n1 removes n3 from gossip and pending node list
- stream between n1 and n3 fails
- n1 and n3 network issue is fixed
- n3 retry the stream with n1
- n3 finishes the stream with n1
- n3 advertises normal to join the cluster
The problem is that n1 will not treat n3 as the pending node so writes
will not route to n3 once n1 removes n3.
Another problem is that when n1 gets normal gossip status update from
n3. The gossip listener will fail because n1 has removed n3 so n1 could
not find the host id for n3. This will cause n1 to abort.
To fix, disable the retry logic in range_streamer so that once a stream
with existing fails the bootstrap fails.
The downside is that we lose the ability to restream caused by temporary
network issue but since we have repair based node operation. We can use
it to resume the previous failed node operations.
Fixes: #9805Closes#9806
After fcb8d040 ("treewide: use Software Package Data Exchange
(SPDX) license identifiers"), many dual-licensed files were
left with empty comments on top. Remove them to avoid visual
noise.
Closes#10562
The flat_mutation_reader files were conflated and contained multiple
readers, which were not strictly necessary. Splitting optimizes both
iterative compilation times, as touching rarely used readers doesn't
recompile large chunks of codebase. Total compilation times are also
improved, as the size of flat_mutation_reader.hh and
flat_mutation_reader_v2.hh have been reduced and those files are
included by many file in the codebase.
With changes
real 29m14.051s
user 168m39.071s
sys 5m13.443s
Without changes
real 30m36.203s
user 175m43.354s
sys 5m26.376s
Closes#10194
Traditionally in Scylla and in Cassandra, an empty partition key is mapped
to minimum_token() instead of the empty key's usual hash function (0).
The reasons for this are unknown (to me), but one possibility is that
having one known key that maps to the minimal token is useful for
various iterations.
In murmur3_partitioner.cc we have two variants of the token calculation
function - the first is get_token(bytes_view) and the second is
get_token(schema, partition_key_view). The first includes that empty-
key special case, but the second was missing this special case!
As Kamil first noted in #9352, the second variant is used when looking
up partitions in the index file - so if a partition with an empty-string
key is saved under one token, it will be looked up under a different
token and not found. I reproduced exactly this problem when fixing
issues #9364 and #9375 (empty-string keys in materialized views and
indexes) - where a partition with an empty key was visible in a
full-table scan but couldn't be found by looking up its key because of
the wrong index lookup.
I also tried an alternative fix - changing both implementations to return
minimum_token (and not 0) for the empty key. But this is undesirable -
minimum_token is not supposed to be a valid token, so the tokenizer and
sharder may not return a valid replica or shard for it, so we shouldn't
store data under such token. We also have have code (such as an increasing-
key sanity check in the flat mutation reader) which assumes that
no real key in the data can be minimum_token, and our plan is to start
allowing data with an empty key (at least for materialized views).
This patch does not risk a backward-incompatible disk format changes
for two reasons:
1. In the current Scylla, there was no valid case where an empty partition
key may appear. CQL and Thrift forbid such keys, and materialized-views
and indexes also (incorrectly - see #9364, #9375) drop such rows.
2. Although Cassandra *does* allow empty partition keys, they is only
allowed in materialized views and indexes - and we don't support reading
materialized views generated by Cassandra (the user must re-generate
them in Scylla).
When #9364 and #9375 will be fixed by the next patch, empty partition keys
will start appearing in Scylla (in materialized views and in the
materialized view backing a secondary index), and this fix will become
important.
Fixes#9352
Refs #9364
Refs #9375
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
After the mechanical change in fcb8d040e8
("treewide: use Software Package Data Exchange (SPDX) license identifiers"),
a few stray license blurbs or fragments thereof remain. In two cases
these were extra blurbs in code generators intended for the generated code,
in others they were just missed by the script.
Clean them up, adding an SPDX license identifier where needed.
Closes#10072
This also removes the only usage of this helper outside of the storage
service. The place that needs it is the use_strict_sources_for_ranges()
checker and all the callers of it are aware of whether it's replacing
happenning or not.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The replica::database is passed into the helper just to get the
config from. Better to use config directly without messing with
the database.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The helper in question has nothing to do with replica/database and
is only used by dht to convert config option to a set of tokens.
It sounds like the helper deserves living where it's needed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's a place in normal node start that parses the initial_token
option or generates num_tokens random tokens. This code is used almost
unchanged since being ported from its java version. Later there appeared
the dht::get_bootstrap_token() with the same internal logic.
This patch generalizes these two places. Logging messages are unified
too (dtest seem not to check those).
The change improves a corner case. The normal node startup code doesn't
check if the initial_token is empty and num_tokens is 0 generating empty
bootstrap_tokens set. It fails later with an obscure 'remove_endpoint
should be used instead' message.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.
Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.
The changes we applied mechanically with a script, except to
licenses/README.md.
Closes#9937
Move replica-oriented classes to the replica namespace. The main
classes moved are ::database, ::keyspace, and ::table, but a few
ancillary classes are also moved. There are certainly classes that
should be moved but aren't (like distributed_loader) but we have
to start somewhere.
References are adjusted treewide. In many cases, it is obvious that
a call site should not access the replica (but the data_dictionary
instead), but that is left for separate work.
scylla-gdb.py is adjusted to look for both the new and old names.
The database, keyspace, and table classes represent the replica-only
part of the objects after which they are named. Reading from a table
doesn't give you the full data, just the replica's view, and it is not
consistent since reconciliation is applied on the coordinator.
As a first step in acknowledging this, move the related files to
a replica/ subdirectory.
In a previous patch, we noticed that the header file <gm/inet_address.hh>,
which is included, directly or indirectly, by most source files,
includes <seastar/net/ip.hh> which is very slow to compile, and
replaced it by the much faster-to-include <seastar/net/ipv[46]_address.hh>.
However, we also included <seastar/net/ip.hh> in types.hh - and that
too is included by almost every file, so the actual saving from the
above patch was minimal. So in this patch we replace this include too.
After this patch Scylla does not include <seastar/net/ip.hh> at all.
According to ClangBuildAnalyzer, this reduces the average time to include
types.hh (multiply this by 312 times!) from 4 seconds to 1.8 seconds,
and reduces total build time (dev mode) by about 3%.
Some of the source files were now missing some include directives, that
were previously included in ip.hh - so we need to add those explicitly.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
A continuation of the previous patch. The range_streamer needs
gossiper too, and is called from boot_strapper and storage_service.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The boot_strapper::bootstrap needs gossiper and is called only from
the storage_service code that has it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The plan itself doesn't need it, but it creates some lower level
objects that do. Next patches will use this reference.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is the preparation for the future patching. The stream_plan
creation will need the manager reference, so keep one on dht
object in advance. These are only created from the storage service
bootstrap code.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Equivalent to abstract_replication_strategy get_range_addresses,
yet synchronous, as it uses the precalculated map.
Call it from storage_service::get_new_source_ranges
and range_streamer::get_all_ranges_with_sources_for.
Consequently, get_new_source_ranges and removenode_add_ranges
can become synchronous too.
Unfortunately we can't entirely get rid of
abstract_replication_strategy::get_range_addresses
as it's still needed by
range_streamer::get_all_ranges_with_strict_sources_for.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Remaining callers of get_address_ranges and get_pending_address_ranges
are all either from a seastar thread or from a coroutine
so we can make the methods always async and drop the
can_yield param.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
All remaining use sites are called in a seastar thread
so we drop the can_yield param and make get_range_addresses
always async.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Enable creating shared_ptr<BaseClass> in nonstatic_class_registry
using BaseClass::ptr_type and use that for
abstract_replication_strategy.
While at it, also clean up compressor with that respect
to define compressor::ptr_type as shared_ptr<compressor>
thus simplifying compressor_registry.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This warning can catch a virtual function that thinks it
overrides another, but doesn't, because the two functions
have different signatures. This isn't very likely since most
of our virtual functions override pure virtuals, but it's
still worth having.
Enable the warning and fix numerous violations.
Closes#9347
If x is of type std::strong_ordering, then "x <=> 0" is equivalent to
x. These no-ops were inserted during #1449 fixes, but are now unnecessary.
They have potential for harm, since they can hide an accidental of the
type of x to an arithmetic type, so remove them.
Ref #1449.
Prevent accidental conversions to bool from yielding the wrong results.
Unprepared users (that converted to bool, or assigned to int) are adjusted.
Ref #1449
Test: unit (dev)
Closes#9088
In an upcoming commit I will add "system.describe_ring" table which uses
endpoint's inet address as a part of CK and, therefore, needs to keep them
sorted with `inet_addr_type::less`.
In preparation for caching index objects, manage them under LSA.
Implementation notes:
key_view was changed to be a view on managed_bytes_view instead of
bytes, so it now can be fragmented. Old users of key_view now have to
linearize it. Actual linearization should be rare since partition
keys are typically small.
Index parser is now not constructing the index_entry directly, but
produces value objects which live in the standard allocator space:
class parsed_promoted_index_entry;
calss parsed_partition_index_entry;
This change was needed to support consumers which don't populate the
partition index cache and don't use LSA,
e.g. sstable::generate_summary(). It's now consumer's responsibility
to allocate index_entry out of parsed_partition_index_entry.
This is a set of a few cosmetic changes in dht/token. Mostly some comments and a simplification of `midpoint()`.
Closes#8803
* github.com:scylladb/scylla:
dht: token: add a comment excusing the `const bytes&` constructor
dht: token: simplify midpoint()
dht: token: add a comment to normalize()
dht: token: use {read,write}_unaligned instead of std::copy_n
dht: token-sharding: fix a typo in a comment
Eliminate not used includes and replace some more includes
with forward declarations where appropriate.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>