If tablet-based table is created concurrently with node being
decommissioned after tablets are already drained, the new table may be
permanently left with replicas on the node which is no longer in the
topology. That creates an immidiate availability risk because we are
running with one replica down.
This also violates invariants about replica placement and this state
cannot be fixed by topology operations.
One effect is that this will lead to load balancer failure which will
inhibit progress of any topology operations:
load_balancer - Replica 154b0380-1dd2-11b2-9fdd-7156aa720e1a:0 of tablet 7e03dd40-537b-11ef-9fdd-7156aa720e1a:1 not found in topology, at: ...
Fixes#20032
(cherry picked from commit f5c74a5df2)
Closesscylladb/scylladb#20067
The API already promises this, the comment on effective_replication_map says:
"Excludes replicas which are in the left state".
Tablet replicas on the replaced node are rebuilt after the node
already left. We may no longer have the IP mapping for the left node
so we should not include that node in the replica set. Otherwise,
storage_proxy may try to use the empty IP and fail:
storage_proxy - No mapping for :: in the passed effective replication map
It's fine to not include it, because storage proxy uses keyspace RF
and not replica list size to determine quorum. The node is not coming
up, so noone should need to contact it.
Users which need replica list stability should use the host_id-based API.
Fixes#18843
(cherry picked from commit 0d596a425c)
We want to exclude repair with tablet migrations to avoid races
between repair reads and writes with replica movement. Repair is not
prepared to handle topology transitions in the middle.
One reason why it's not safe is that repair may successfully write to
a leaving replica post streaming phase and consider all replicas to be
repaired, but in fact they are not, the new replica would not be
repaired.
Other kinds of races could result in repair failures. If repair writes
to a leaving replica which was already cleaned up, such writes will
fail, causing repair to fail.
Excluding works by keeping effective_replication_map_ptr in a version
which doesn't have table's tablets in transitions. That prevents later
transitions from starting because topology coordinator's barrier will
wait for that erm before moving to a stage later than
allow_write_both_read_old, so before any requets start using the new
topology. Also, if transitions are already running, repair waits for
them to finish.
Fixes#17658.
Fixes#18561.
(cherry picked from commit 98323be296)
Currently, the function needlessly copies the tablet_info
(all tablet replicas in particular) to a local variable.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 2de79c39dc)
This is required by repair when it will start using get_primary_replica
in a following patch.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit c52f70f92c)
This commit adds support for executing ALTER KS for keyspaces with
tablets and utilizes all the previous commits.
The ALTER KS is handled in alter_keyspace_statement, where a global
topology request in generated with data attached to system.topology
table. Then, once topology state machine is ready, it starts to handle
this global topology event, which results in producing mutations
required to change the schema of the keyspace, delete the
system.topology's global req, produce tablets mutations and additional
mutations for a table tracking the lifetime of the whole req. Tracking
the lifetime is necessary to not return the control to the user too
early, so the query processor only returns the response while the
mutations are sent.
Require users to specify whether we want shard for reads or for writes
by switching to appropriate non-deprecated variant.
For example, shard_of() can be replaced with shard_for_reads() or
shard_for_writes().
The next_shard/token_for_next_shard APIs have only for-reads variant,
and the act of switching will be a testimony to the fact that the code
is valid for intra-node migration.
During streaming for intra-node migration we want to write only to the
new shard. To achieve that, allow altering write selector in
sharder::shard_for_writes() and per-instance of
auto_refreshing_sharder.
Tablet sharder is adjusted to handle intra-migration where a tablet
can have two replicas on the same host. For reads, sharder uses the
read selector to resolve the conflict. For writes, the write selector
is used.
The old shard_of() API is kept to represent shard for reads, and new
method is introduced to query the shards for writing:
shard_for_writes(). All writers should be switched to that API, which
is not done in this patch yet.
The request handler on replica side acts as a second-level
coordinator, using sharder to determine routing to shards. A given
sharder has a scope of a single topology version, a single
effective_replication_map_ptr, which should be kept alive during
writes.
We need a separate transition kind for intra node migration so that we
don't have to recover this information from replica set in an
expensive way. This information is needed in the hot path - in
effective_replicaiton_map, to not return the pending tablet replica to
the coordinator. From its perspective, replica set is not
transitional.
The transition will also be used to alter the behavior of the
sharder. When not in intra-node migration, the sharder should
advertise the shard which is either in the previous or next replica
set. During intra-node migration, that's not possible as there may be
two such shards. So it will return the shard according to the current
read selector.
The primary replica is an arbitrary replica of the tablet's, which is
considered to tbe the "main" owner of the tablet, similar to how
replicas own tokens in the vnode world.
To avoid aliasing the primary replicas with a certain DC or rack,
primary replicas are rotated among the tablet's replicas, selecting
tablet_id % replica_count as the primary replica.
since we do not rely on FMT_DEPRECATED_OSTREAM to define the
fmt::formatter for us anymore, let's stop defining `FMT_DEPRECATED_OSTREAM`.
in this change,
* utils: drop the range formatters in to_string.hh and to_string.c, as
we don't use them anymore. and the tests for them in
test/boost/string_format_test.cc are removed accordingly.
* utils: use fmt to print chunk_vector and small_vector. as
we are not able to print the elements using operator<< anymore
after switching to {fmt} formatters.
* test/boost: specialize fmt::details::is_std_string_like<bytes>
due to a bug in {fmt} v9, {fmt} fails to format a range whose
element type is `basic_sstring<uint8_t>`, as it considers it
as a string-like type, but `basic_sstring<uint8_t>`'s char type
is signed char, not char. this issue does not exist in {fmt} v10,
so, in this change, we add a workaround to explicitly specialize
the type trait to assure that {fmt} format this type using its
`fmt::formatter` specialization instead of trying to format it
as a string. also, {fmt}'s generic ranges formatter calls the
pair formatter's `set_brackets()` and `set_separator()` methods
when printing the range, but operator<< based formatter does not
provide these method, we have to include this change in the change
switching to {fmt}, otherwise the change specializing
`fmt::details::is_std_string_like<bytes>` won't compile.
* test/boost: in tests, we use `BOOST_REQUIRE_EQUAL()` and its friends
for comparing values. but without the operator<< based formatters,
Boost.Test would not be able to print them. after removing
the homebrew formatters, we need to use the generic
`boost_test_print_type()` helper to do this job. so we are
including `test_utils.hh` in tests so that we can print
the formattable types.
* treewide: add "#include "utils/to_string.hh" where
`fmt::formatter<optional<>>` is used.
* configure.py: do not define FMT_DEPRECATED_OSTREAM
* cmake: do not define FMT_DEPRECATED_OSTREAM
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, we include `fmt/ranges.h` and/or `fmt/std.h`
for formatting the container types, like vector, map
optional and variant using {fmt} instead of the homebrew
formatter based on operator<<.
with this change, the changes adding fmt::formatter and
the changes using ostream formatter explicitly, we are
allowed to drop `FMT_DEPRECATED_OSTREAM` macro.
Refs scylladb#13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
in in {fmt} before v10, it provides the specialization of `fmt::formatter<..>`
for `std::string_view` as well as the specialization of `fmt::formatter<..>`
for `fmt::string_view` which is an implementation builtin in {fmt} for
compatibility of pre-C++17. and this type is used even if the code is
compiled with C++ stadandard greater or equal to C++17. also, before v10,
the `fmt::formatter<std::string_view>::format()` is defined so it accepts
`std::string_view`. after v10, `fmt::formatter<std::string_view>` still
exists, but it is now defined using `format_as()` machinery, so it's
`format()` method does not actually accept `std::string_view`, it
accepts `fmt::string_view`, as the former can be converted to
`fmt::string_view`.
this is why we can inherit from `fmt::formatter<std::string_view>` and
use `formatter<std::string_view>::format(foo, ctx);` to implement the
`format()` method with {fmt} v9, but we cannot do this with {fmt} v10,
and we would have following compilation failure:
```
FAILED: service/CMakeFiles/service.dir/RelWithDebInfo/topology_state_machine.cc.o
/home/kefu/.local/bin/clang++ -DFMT_DEPRECATED_OSTREAM -DFMT_SHARED -DSCYLLA_BUILD_MODE=release -DSEASTAR_API_LEVEL=7 -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SSTRING -DXXH_PRIVATE_API -DCMAKE_INTDIR=\"RelWithDebInfo\" -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/build/gen -I/home/kefu/dev/scylladb/seastar/include -I/home/kefu/dev/scylladb/build/seastar/gen/include -I/home/kefu/dev/scylladb/build/seastar/gen/src -ffunction-sections -fdata-sections -O3 -g -gz -std=gnu++20 -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-enum-constexpr-conversion -Wno-unused-parameter -ffile-prefix-map=/home/kefu/dev/scylladb=. -march=westmere -mllvm -inline-threshold=2500 -fno-slp-vectorize -U_FORTIFY_SOURCE -Werror=unused-result -MD -MT service/CMakeFiles/service.dir/RelWithDebInfo/topology_state_machine.cc.o -MF service/CMakeFiles/service.dir/RelWithDebInfo/topology_state_machine.cc.o.d -o service/CMakeFiles/service.dir/RelWithDebInfo/topology_state_machine.cc.o -c /home/kefu/dev/scylladb/service/topology_state_machine.cc
/home/kefu/dev/scylladb/service/topology_state_machine.cc:254:41: error: no matching member function for call to 'format'
254 | return formatter<std::string_view>::format(it->second, ctx);
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~
/usr/include/fmt/core.h:2759:22: note: candidate function template not viable: no known conversion from 'seastar::basic_sstring<char, unsigned int, 15>' to 'const fmt::basic_string_view<char>' for 1st argument
2759 | FMT_CONSTEXPR auto format(const T& val, FormatContext& ctx) const
| ^ ~~~~~~~~~~~~
```
because the inherited `format()` method actually comes from
`fmt::formatter<fmt::string_view>`. to reduce the confusion, in this
change, we just inherit from `fmt::format<string_view>`, where
`string_view` is actually `fmt::string_view`. this follows
the document at
https://fmt.dev/latest/api.html#formatting-user-defined-types,
and since there is less indirection under the hood -- we do not
use the specialization created by `FMT_FORMAT_AS` which inherit
from `formatter<fmt::string_view>`, hopefully this can improve
the compilation speed a little bit. also, this change addresses
the build failure with {fmt} v10.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#18299
The problem this series solves is correctly ignoring DOWN nodes state
when replacing a node.
When a node is replaced and there are other nodes that are down, the
replacing node is told to ignore those DOWN nodes using the
`ignore_dead_nodes_for_replace` option.
Since the replacing node is bootstrapping it starts with an empty
system.peers table so it has no notion about any node state and it
learns about all other nodes via gossip shadow round done in
`storage_service::prepare_replacement_info`.
Normally, since the DOWN nodes to ignore already joined the ring, the
remaining node will have their endpoint state already in gossip, but if
the whole cluster was restarted while those DOWN nodes did not start,
the remaining nodes will only have a partial endpoint state from them,
which is loaded from system.peers.
Currently, the partial endpoint state contains only `HOST_ID` and
`TOKENS`, and in particular it lacks `STATUS`, `DC`, and `RACK`.
The first part of this series loads also `DC` and `RACK` from
system.peers to make them available to the replacing node as they are
crucial for building a correct replication map with network topology
replication strategy.
But still, without a `STATUS` those nodes are not considered as normal
token owners yet, and they do not go through handle_state_normal which
adds them to the topology and token_metadata.
The second part of this series uses the endpoint state retrieved in the
gossip shadow round to explicitly add the ignored nodes' state to
topology (including dc and rack) and token_metadata (tokens) in
`prepare_replacement_info`. If there are more DOWN nodes that are not
explicitly ignored replace will fail (as it should).
Fixesscylladb/scylladb#15787Closesscylladb/scylladb#15788
* github.com:scylladb/scylladb:
storage_service: join_token_ring: load ignored nodes state if replacing
storage_service: replacement_info: return ignore_nodes state
locator: host_id_or_endpoint: keep value as variant
gms: endpoint_state: add getters for host_id, dc_rack, and tokens
storage_service: topology_state_load: set local STATUS state using add_saved_endpoint
gossiper: add_saved_endpoint: set dc and rack
gossiper: add_saved_endpoint: fixup indentation
gossiper: add_saved_endpoint: make host_id mandatory
gossiper: add load_endpoint_state
gossiper: start_gossiping: log local state
Just like leaving replica could be optional when adding replica to
tablet, the pending replica can be optional too if we're removing a
replica from tablet
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Rather than allowing to keep both
host_id and endpoint, keep only one of them
and provide resolve functions that use the
token_metadata to resolve the host_id into
an inet_address or vice verse.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When getting leaving replica from from tablet info and transition info,
the getter code assumes that this replica always exists. It's not going
to be the case soon, so make the return value be optional.
There are four places that mess with leaving replica:
- stream tablet handler: this place checks that the leaving replica is
_not_ current host. If leaving replica is missing, the check should
pass
- cleanup tablet handler: this place checks that the leaving replica
_is_ current host. If leaving replica is missing, the check should
fail as well
- topology coordinator: it gets leaving replica to call cleanup on. If
leaving replica is missing, the cleanup call is short-circuited to
succeed immediately
- load-stats calculator: it checks if the leaving replica is self. This
check is not patched as it's automatically satisfied by std::optional
comparison operator overload for wrapped type
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
We can cache tablet map in erm, to avoid looking it up on every write for
getting write replicas. We do that in tablet_sharder, but not in tablet
erm. Tablet map is immutable in the context of a given erm, so the
address of the map is stable during erm lifetime.
This caught my attention when looking at perf diff output
(comparing tablet and vnode modes).
It also helps when erm is called again on write completion for
checking locality, used for forwarding info to the driver if needed.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#18158
This series provides a reallocate_tablets function, that's initially called by allocate_tablets_for_new_table.
The new allocation implementation is independent of vnodes/token ownership.
Rather than using the natural_endpoints_tracker, it implements its own tracking
based on dc/rack load (== number of replicas in rack), with the additional benefit
that tablet allocation will balance the allocation across racks, using a heap structure,
similar to the one we use to balance tablet allocation across shards in each node.
reallocate_tablets may also be called with an optional parameter pointing the the current tablet_map.
In this case the function either allocates more tablet replicas in datacenters for which the replication factor was increased,
or it will deallocate tablet replicas from datacenters for which replication factor was decreased.
The NetworkTopologyStrategy_tablets_test unit test was extended to cover replication factor changes.
Closesscylladb/scylladb#17846
* github.com:scylladb/scylladb:
network_topology_strategy: reallocate_tablets: consider new_racks before existing racks
network_topology_startegy_test: add NetworkTopologyStrategy_tablet_allocation_balancing_test
network_topology_strategy: reallocate_tablets: support deallocation via rf change
network_topology_startegy_test: tablets_test: randomize cases
network_topology_strategy: allocate_tablets_for_new_table: do not rely on token ownership
network_topology_startegy_test: add NetworkTopologyStrategy_tablets_negative_test
network_topology_strategy_test: endpoints_check: use particular BOOST_CHECK_* functions
network_topology_strategy_test: endpoints_check: verify that replicas are placed on unique nodes
network_topology_strategy_test: endpoints_check: strictly check rf for tablets
network_topology_strategy_test: full_ring_check for tablets: drop unused options param
Allocate first from new (unpopulated) racks before
allocating from racks that are already populated
with replicas.
Still, rotate both new and existing racks by tablet id
to ensure fairness.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add support for deallocating tablet replicas when the
datacenter replication factor is decreased.
We deallocate replicas back-to-front order to maintain
replica pairing between the base table and
its materialized views.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Base initial tablets allocation for new table
on the dc/rack topology, rather then on the token ring,
to remove the dependency on token ownership.
We keep the rack ordinal order in each dc
to facilitate in-rack pairing of base/view
replica pairing, and we apply load-balancing
principles by sorting the nodes in each rack
by their load (number of tablets allocated to
the node), and attempting to fill lease-loaded
nodes first.
This method is more efficient than circling
the token ring and attemting to insert the endpoints
to the natural_endpoint_tracker until the replication
factor per dc is fulfilled, and it allows an easier
way to incrementally allocate more replicas after
rf is increased.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
There are several places in code that calculate replica sets associated
with specific tablet transision. Having a helper to substract two sets
improves code readability.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#18033
feature_service.hh is a high-level header that integrates much
of the system functionality, so including it in lower-level headers
causes unnecessary rebuilds. Specifically, when retiring features.
Fix by removing feature_service.hh from headers, and supply forward
declarations and includes in .cc where needed.
Closesscylladb/scylladb#18005
The loader is writing to pending replica even when write selector is set
to previous. If migration is reverted, then the writes won't be rolled
back as it assumes pending replicas weren't written to yet. That can
cause data resurrection if tablet is later migrated back into the same
replica.
NOTE: write selector is handled correctly when set to next, because
get_natural_endpoints() will return the next replica set, and none
of the replicas will be considered leaving. And of course, selector
set to both is also handled correctly.
Fixes#17892.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#17902
This fixes a problem with replacing a node with tablets when
RF=N. Currently, this will fail because new tablet replica allocation
will not be able to find a viable destination, as the replacing node
is not considered a candidate. It cannot be a candidate because
replace rolls back on failure and we cannot roll back after tablets
were migrated.
The solution taken here is to not drain tablet replicas from replaced
node during topology request but leave it to happen later after the
replaced node is left and replacing node is normal.
The replacing node waits for this draining to be complete on boot
before the node is considered booted.
Fixes#17025
Those nodes will be kept in tablet replica sets for a while after node
replace is done, until the new replica is rebuilt. So we need to know
about those node's location (dc, rack) for two reasons:
1) algorithms which work with replica sets filter nodes based on
their location. For example materialized views code which pairs base
replicas with view replicas filters by datacenter first.
2) tablet scheduler needs to identify each node's location in order
to make decisions about new replica placement.
It's ok to not know the IP, and we don't keep it. Those nodes will not
be present in the IP-based replica sets, e.g. those returned by
get_natural_endpoints(), only in host_id-based replica
sets. storage_proxy request coordination is not affected.
Nodes in the left state are still not present in token ring, and not
considered to be members of the ring (datacanter endpoints excludes them).
In the future we could make the change even more transparent by only
loading locator::node* for those nodes and keeping node* in tablet
replica sets.
We load topology infromation only for left nodes which are actually
referenced by any tablet. To achieve that, topology loading code
queries system.tablet for the set of hosts. This set is then passed to
system.topology loading method which decides whether to load
replica_state for a left node or not.
Before this patch, the mentioned function was a specific
member of vnode_effective_replication_strategy class.
To allow its usage also when tablets are enabled it was
shifted to the base class - effective_replication_strategy
and made pure virtual to force the derived classes to
implement it.
It is used by 'storage_service::get_ranges_for_endpoint()'
that is used in calculation of effective ownership. Such
calculation needs to be performed also when tablets are
enabled.
Refs: scylladb#17342
Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>
This change introudces a new member function that
returns a vector of sorted tokens where each pair of adjacent
elements depicts a range of tokens that belong to tablet.
It will be used to produce the equivalent of sorted_tokens() of
vnodes when trying to use dht::describe_ownership() for tablets.
Refs: scylladb#17342
Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, we define formatters for
* tablet_id
* tablet_replica
* tablet_metadata
* tablet_map
their operator<<:s are dropped
Refs scylladb/scylladb#13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#17504
When topology barrier is blocked for longer than configured threshold
(2s), stale versions are marked as stalled and when they get released
they report backtrace to the logs. This should help to identify what
was holding for token metadata pointer for too long.
Example log:
token_metadata - topology version 30 held for 299.159 [s] past expiry, released at: 0x2397ae1 0x23a36b6 ...
Closesscylladb/scylladb#17427
Given a list of partition-ranges, yields the intersection of this
range-list, with that of that tablet-ranges, for tablets located on the
given host.
This will be used in multishard_mutation_query.cc, to obtain the ranges
to read from the local node: given the read ranges, obtain the ranges
belonging to tablets who have replicas on the local node.
range.hh was deprecated in bd794629f9 (2020) since its names
conflict with the C++ library concept of an iterator range. The name
::range also mapped to the dangerous wrapping_interval rather than
nonwrapping_interval.
Complete the deprecation by removing range.hh and replacing all the
aliases by the names they point to from the interval library. Note
this now exposes uses of wrapping intervals as they are now explicit.
The unit tests are renamed and range.hh is deleted.
Closesscylladb/scylladb#17428
It can happen that a node is lost during tablet migration involving that node. Migration will be stuck, blocking topology state machine. To recover from this, the current procedure is for the admin to execute nodetool removenode or replacing the node. This marks the node as "ignored" and tablet state machine can pick this up and abort the migration.
This PR implements the handling for streaming stage only and adds a test for it. Checking other stages needs more work with failure injection to inject failures into specific barrier.
To handle streaming failure two new stages are introduced -- cleanup_target and revert_migration. The former is to clean the pending replica that could receive some data by the time streaming stopped working, the latter is like end_migration, but doesn't commit the new_replicas into replicas field.
refs: #16527Closesscylladb/scylladb#17360
* github.com:scylladb/scylladb:
test/topology: Add checking error paths for failed migration
topology.tablets_migration: Handle failed streaming
topology.tablets_migration: Add cleanup_target transition stage
topology.tablets_migration: Add revert_migration transition stage
storage_service: Rewrap cleanup stage checking in cleanup_tablet()
test/topology: Move helpers to get tablet replicas to pylib