We make `consistent_cluster_management` mandatory in 5.5. This
option will be always unused and assumed to be true.
Additionally, we make `override_decommission` deprecated, as this option
has been supported only with `consistent_cluster_management=false`.
Making `consistent_cluster_management` mandatory also simplifies
the code. Branches that execute only with
`consistent_cluster_management` disabled are removed.
We also update documentation by removing information irrelevant in 5.5.
Fixesscylladb/scylladb#15854
Note about upgrades: this PR does not introduce any more limitations
to the upgrade procedure than there are already. As in
scylladb/scylladb#16254, we can upgrade from the first version of Scylla
that supports the schema commitlog feature, i.e. from 5.1 (or
corresponding Enterprise release) or later. Assuming this PR ends up in
5.5, the documented upgrade support is from 5.4. For corresponding
Enterprise release, it's from 2023.x (based on 5.2), so all requirements
are met.
Closesscylladb/scylladb#16334
* github.com:scylladb/scylladb:
docs: update after making consistent_cluster_management mandatory
system_keyspace, main, cql_test_env: fix indendations
db: config: make consistent_cluster_management mandatory
test: boost: schema_change_test: replace disable_raft_schema_config
db: config: make override_decommission deprecated
db: config: make force_schema_commit_log deprecated
This introduces the ability to split a storage group.
The main compaction group is split into left and right groups.
set_split() is used to set the storage group to splitting mode, which
will create left and right compaction groups. Incoming writes will
now be placed into memtable of either left or right groups.
split() is used to complete the splitting of a group. It only
returns when all preexisting data is split. That means main
compaction group will be empty and all the data will be stored
in either left or right group.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Code that executed only when consistent_cluster_management=false is
removed. In particular, after this patch:
- raft_group0 and raft_group_registry are always enabled,
- raft_group0::status_for_monitoring::disabled becomes unused,
- topology tests can only run with consistent_cluster_management.
Make host_id parameter non-optional and
move it to the beginning of the arguments list.
Delete unused overloads of add_or_update_endpoint.
Delete unused overload of token_metadata::update_topology
with inet_address argument.
locator_topology_test, network_topology_strategy_test and
tablets_test are fully switched to the host_id-based token_metadata,
meaning they no longer populate the old token_metadata.
All the boost and topology tests pass with this change.
Tablet streaming involves asynchronous RPCs to other replicas which transfer writes. We want side-effects from streaming only within the migration stage in which the streaming was started. This is currently not guaranteed on failure. When streaming master fails (e.g. due to RPC failing), it can be that some streaming work is still alive somewhere (e.g. RPC on wire) and will have side-effects at some point later.
This PR implements tracking of all operations involved in streaming which may have side-effects, which allows the topology change coordinator to fence them and wait for them to complete if they were already admitted.
The tracking and fencing is implemented by using global "sessions", created for streaming of a single tablet. Session is globally identified by UUID. The identifier is assigned by the topology change coordinator, and stored in system.tablets. Sessions are created and closed based on group0 state (tablet metadata) by the barrier command sent to each replica, which we already do on transitions between stages. Also, each barrier waits for sessions which have been closed to be drained.
The barrier is blocked only if there is some session with work which was left behind by unsuccessful streaming. In which case it should not be blocked for long, because streaming process checks often if the guard was left behind and stops if it was.
This mechanism of tracking is fault-tolerant: session id is stored in group0, so coordinator can make progress on failover. The barriers guarantee that session exists on all replicas, and that it will be closed on all replicas.
Closesscylladb/scylladb#15847
* github.com:scylladb/scylladb:
test: tablets: Add test for failed streaming being fenced away
error_injection: Introduce poll_for_message()
error_injection: Make is_enabled() public
api: Add API to kill connection to a particular host
range_streamer: Do not block topology change barriers around streaming
range_streamer, tablets: Do not keep token metadata around streaming
tablets: Fail gracefully when migrating tablet has no pending replica
storage_service, api: Add API to disable tablet balancing
storage_service, api: Add API to migrate a tablet
storage_service, raft topology: Run streaming under session topology guard
storage_service, tablets: Use session to guard tablet streaming
tablets: Add per-tablet session id field to tablet metadata
service: range_streamer: Propagate topology_guard to receivers
streaming: Always close the rpc::sink
storage_service: Introduce concept of a topology_guard
storage_service: Introduce session concept
tablets: Fix topology_metadata_guard holding on to the old erm
docs: Document the topology_guard mechanism
Load balancing needs to be disabled before making a series of manual
migrations so that we don't fight with the load balancer.
Also will be used in tests to ensure tablets stick to expected locations.
when compiling the tests with -Wsign-compare, the compiler complains like:
```
/home/kefu/.local/bin/clang++ -DBOOST_ALL_DYN_LINK -DBOOST_NO_CXX98_FUNCTION_BASE -DDEBUG -DDEBUG_LSA_SANITIZER -DFMT_DEPRECATED_OSTREAM -DFMT_SHARED -DSANITIZE -DSCYLLA_BUILD_MODE=debug -DSCYLLA_ENABLE_ERROR_INJECTION -DSEASTAR_API_LEVEL=7 -DSEASTAR_BROKEN_SOURCE_LOCATION -DSEASTAR_DEBUG -DSEASTAR_DEBUG_SHARED_PTR -DSEASTAR_DEFAULT_ALLOCATOR -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SHUFFLE_TASK_QUEUE -DSEASTAR_TESTING_MAIN -DSEASTAR_TYPE_ERASE_MORE -DXXH_PRIVATE_API -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/build/cmake/gen -I/home/kefu/dev/scylladb/seastar/include -I/home/kefu/dev/scylladb/build/cmake/seastar/gen/include -isystem /home/kefu/dev/scylladb/build/cmake/rust -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-mismatched-tags -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-unused-parameter -Wno-missing-field-initializers -Wno-deprecated-copy -Wno-ignored-qualifiers -march=westmere -Og -g -gz -std=gnu++20 -fvisibility=hidden -U_FORTIFY_SOURCE -DSEASTAR_SSTRING -Wno-error=unused-result "-Wno-error=#warnings" -fstack-clash-protection -fsanitize=address -fsanitize=undefined -fno-sanitize=vptr -MD -MT test/boost/CMakeFiles/tablets_test.dir/tablets_test.cc.o -MF test/boost/CMakeFiles/tablets_test.dir/tablets_test.cc.o.d -o test/boost/CMakeFiles/tablets_test.dir/tablets_test.cc.o -c /home/kefu/dev/scylladb/test/boost/tablets_test.cc
/home/kefu/dev/scylladb/test/boost/tablets_test.cc:1335:53: error: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Werror,-Wsign-compare]
for (int log2_tablets = 0; log2_tablets < tablet_count_bits; ++log2_tablets) {
~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~~~~
```
in this case, it should be safe to use an signed int as the loop
variable to be compared with `tablet_count_bits`, but let's just
appease the compiler so we can enable the warning option project-wide
to prevent any potential issues caused by signed-unsigned comparision.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#15449
Load balancer will recognize decommissioning nodes and will
move tablet replicas away from such nodes with highest priority.
Topology changes have now an extra step called "tablet draining" which
calls the load balancer. The step will execute tablet migration track
as long as there are nodes which require draining. It will not do regular
load balancing.
If load balancer is unable to find new tablet replicas, because RF
cannot be met or availability is at risk due to insufficient node
distribution in racks, it will throw an exception. Currently, topology
change will retry in a loop. We should make this error cause topology
change to be paused so that admin becomes aware of the problem and
issues an abort on the topology change. There is no infrastructure for
aborts yet, so this is not implemented.
The motivation is that token_metadata::get_my_id() is not available
early in the bootstrap process, as raft topology is pulled later
than new tables are registered and created, and this node is added
to topology even later.
To allow creation of compaction groups to retrieve "my id" from
token metadata early, initialization will now feed local id
into topology config which is immutable for each node anyway.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
We add a special mode of load balancing, enabled through error
injection, which causes it to continuously generate plans. This
should keep the topology coordinator continuously in the tablet
migration track.
We enable this mode in test_tablets.py:test_bootstrap before
bootstrapping nodes to see that bootstrap request interrupts
tablet migration track. If this would not be the case, the
test will hang.
It's needed to implement tablet migration. It stores the current step
of tablet migration state machine. The state machine will be advanced
by the topology change coordinator.
See the "Tablet migration" section of topology-over-raft.md