This PR fixes the bug of certain calls to the `mintimeuuid()` CQL function which large negative timestamps could crash Scylla. It turns out we already had protections in place against very positive timestamps, but very negative timestamps could still cause bugs.
The actual fix in this series is just a few lines, but the bigger effort was improving the test coverage in this area. I added tests for the "date" type (the original reproducer for this bug used totimestamp() which takes a date parameter), and also reproducers for this bug directly, without totimestamp() function, and one with that function.
Finally this PR also replaces the assert() which made this molehill-of-a-bug into a mountain, by a throw.
Fixes#17035Closesscylladb/scylladb#17073
* github.com:scylladb/scylladb:
utils: replace assert() by on_internal_error()
utils: add on_internal_error with common logger
utils: add a timeuuid minimum, like we had maximum
test/cql-pytest: tests for "date" type
This reverts commit 370fbd346c, reversing
changes made to 0912d2a2c6.
This makes scylla-manager mis-interpret the data_file_directories
somehow, issue #17078
The motivation for tablet resizing is that we want to keep the average tablet size reasonable, such that load rebalancing can remain efficient. Too large tablet makes migration inefficient, therefore slowing down the balancer.
If the avg size grows beyond the upper bound (split threshold), then balancer decides to split. Split spans all tablets of a table, due to power-of-two constraint.
Likewise, if the avg size decreases below the lower bound (merge threshold), then merge takes place in order to grow the avg size. Merge is not implemented yet, although this series lays foundation for it to be impĺemented later on.
A resize decision can be revoked if the avg size changes and the decision is no longer needed. For example, let's say table is being split and avg size drops below the target size (which is 50% of split threshold and 100% of merge one). That means after split, the avg size would drop below the merge threshold, causing a merge after split, which is wasteful, so it's better to just cancel the split.
Tablet metadata gains 2 new fields for managing this:
resize_type: resize decision type, can be either of "merge", "split", or "none".
resize_seq_number: a sequence number that works as the global identifier of the decision (monotonically increasing, increased by 1 on every new decision emitted by the coordinator).
A new RPC was implemented to pull stats from each table replica, such that load balancer can calculate the avg tablet size and know the "split status", for a given table. Avg size is aggregated carefully while taking RF of each DC into account (which might differ).
When a table is done splitting its storage, it loads (mirror) the resize_seq_number from tablet metadata into its local state (in another words, my split status is ready). If a table is split ready, coordinator will see that table's seq number is the same as the one in tablet metadata. Helps to distinguish stale decisions from the latest one (in case decisions are revoked and re-emited later on). Also, it's aggregated carefully, by taking the minimum among all replicas, so coordinator will only update topology when all replicas are ready.
When load balancer emits split decision, replicas will listen to need to split with a "split monitor" that is awakened once a table has replication metadata updated and detects the need for split (i.e. resize_type field is "split").
The split monitor will start splitting of compaction groups (using mechanism introduced here: 081f30d149) for the table. And once splitting work is completed, the table updates its local state as having completed split.
When coordinator pulls the split status of all replicas for a table via RPC, the balancer can see whether that table is ready for "finalizing" the decision, which is about updating tablet metadata to split each tablet into two. Once table replicas have their replication metadata updated with the new tablet count, they can update appropriately their set of compaction groups (that were previously split in the preparation step).
Fixes#16536.
Closesscylladb/scylladb#16580
* github.com:scylladb/scylladb:
test/topology_experimental_raft: Add tablet split test
replica: Bypass reshape on boot with tablets temporarily
replica: Fix table::compaction_group_for_sstable() for tablet streaming
test/topology_experimental_raft: Disable load balancer in test fencing
replica: Remap compaction groups when tablet split is finalized
service: Split tablet map when split request is finalized
replica: Update table split status if completed split compaction work
storage_service: Implement split monitor
topology_cordinator: Generate updates for resize decisions made by balancer
load_balancer: Introduce metrics for resize decisions
db: Make target tablet size a live-updateable config option
load_balancer: Implement resize decisions
service: Wire table_resize_plan into migration_plan
service: Introduce table_resize_plan
tablet_mutation_builder: Add set_resize_decision()
topology_coordinator: Wire load stats into load balancer
storage_service: Allow tablet split and migration to happen concurrently
topology_coordinator: Periodically retrieve table_load_stats
locator: Introduce topology::get_datacenter_nodes()
storage_service: Implement table_load_stats RPC
replica: Expose table_load_stats in table
replica: Introduce storage_group::live_disk_space_used()
locator: Introduce table_load_stats
tablets: Add resize decision metadata to tablet metadata
locator: Introduce resize_decision
The persisted snapshot index may be 0 if the snapshot was created in
older version of Scylla, which means snapshot transfer won't be
triggered to a bootstrapping node. Commands present in the log may not
cover all schema changes --- group 0 might have been created through the
upgrade upgrade procedure, on a cluster with existing schema. So a
deployment with index=0 snapshot is broken and we need to fix it. We can
use the new `raft::server::trigger_snapshot` API for that.
Also add a test.
Fixesscylladb/scylladb#16683Closesscylladb/scylladb#17072
* github.com:scylladb/scylladb:
test: add test for fixing a broken group 0 snapshot
raft_group0: trigger snapshot if existing snapshot index is 0
Our time-handling code in UUID_gen.hh is very fragile for very large
timestamps, because the different types - such as Cassandra "timestamp"
and Timeuuid use very different resolution and ranges.
In issue #17035 we discovered a situation where a certain CQL
"timestamp"-type value could cause an assertion-failure and a crash
in the create_time() function that creates a timeuuid - because that
timestamp didn't fit the place we have in timeuuid.
We already added in the past a limit, UUID_UNIXTIME_MAX, beyond which
we refuse timestamps, to avoid these assertions failure. However, we
missed the possibility of *negative* timestamps (which are allowed in
CQL), and indeed a negative timestamp (or a timestamp which was "wrapped"
to a negative value) is what caused issue #17035.
So this patch adds a second limit, UUID_UNIXTIME_MIN - limiting the
most negative timestamp that we support to well below the area which
causes problems, and adds tests that reproduce #17035 and that we
didn't break anything else (e.g., negative timestamps are still
allowed - just not extremely negative timestamps).
Fixes#17035.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Add error handling to rebuild instead of retrying it until succeeds.
* 'gleb/rebuild-fail-v2' of github.com:scylladb/scylla-dev:
test: add test for rebuild failure
test: add expected_error to rebuild_node operation
topology_coordinator: Propagate rebuild failure to the initiator
This patch adds a few simple tests for the values of the "date" column
type, and how it can be initialized from string or integers, and what do
those values mean.
Two of the tests reproduce issue #17066, where validation is missing
for values that don't fit in a 32-bit unsigned integer.
Refs #17066
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
In a cluster with group 0 with snapshot at index 0 (such group 0 might
be established in a 5.2 cluster, then preserved once it upgrades to 5.4
or later), no snapshot transfer will be triggered when a node is
bootstrapped. This way to new node might not obtain full schema, or
obtain incorrect schema, like in scylladb/scylladb#16683.
Simulate this scenario in a test case using the RECOVERY mode and error
injections. Check that the newly added logic for creating a new snapshot
if such situation is detected helps in this case.
The issues mentioned in the comment before are already fixed.
Unfortunately, there is another, opposite issue which this function can
be used for. The previous issue was about the existing driver session
not reconnecting. The current issue is about the existing driver session
reconnecting too much... (and in the middle of queries.)
Waiting for CQL connections is not enough. For the queries to succeed,
nodes must see each other. We have to wait for this, otherwise the test
will be flaky.
Fixes#17029Closesscylladb/scylladb#17040
We do not support tablet resharding yet. All tablet-related code assumes that the (host_id, shard) tablet replica is always valid. Violating this leads to undefined behaviour: errors in the tablet load balancer and potential crashes.
Avoid this by refusing to start if the need to resharding is detected. Be as lenient as possible: check all tablets with a replica on this node, and only refuse startup if at least one tablet has an invalid replica shard.
Startup will fail as:
ERROR 2024-01-26 07:03:06,931 [shard 0:main] init - Startup failed: std::runtime_error (Detected a tablet with invalid replica shard, reducing shard count with tablet-enabled tables is not yet supported. Replace the node instead.)
Refs: #16739Fixes: #16843Closesscylladb/scylladb#17008
* github.com:scylladb/scylladb:
test/topolgy_experimental_raft: test_tablets.py: add test for resharding
test/pylib: manager[_client]: add update_cmdline()
main: refuse startup when tablet resharding is required
locator: tablets: add check_tablet_replica_shards()
`db::config` is a class, that is used in many places across the code base. When it is changed, its clients' code need to be recompiled. It represents the configuration of the database. Some fields of the configuration that describe the location of directories may be empty. In such cases `db::config::setup_directories()` function is called - it modifies the provided configuration. Such modification is not good - it is better to keep `db::config` intact.
This PR:
- extends the public interface of utils::directories class to provide required directory paths to the users
- removes 'db::config::setup_directories()' to avoid altering the fields of configuration object
- replaces usages of db::config object with utils::directories object in places that require obtaining paths to dirs
Fixes: scylladb#5626
Closesscylladb/scylladb#16787
* github.com:scylladb/scylladb:
utils/directories: make utils::directories::set an internal type
db::config: keep dir paths unchanged
cql_transport/controler: use utils::directories to get paths of dirs
service/storage_proxy: use utils::directories to get paths of dirs
api/storage_service.cc: use utils::directories to get paths of dirs
tools/scylla-sstable.cc: use utils::directories to get paths
db/commitlog: do not use db::config to get dirs
Use utils::directories to get dirs paths in replica::database
Allow utils::directories to provide paths to dirs
Clean-up of utils::directories
When a node is in the `left_token_ring` state, we don't know how
it has ended up in this state. We cannot distinguish a node that
has finished decommissioning from a node that has failed bootstrap.
The main problem it causes is that we incorrectly send the
`barrier_and_drain` command to a node that has failed
bootstrapping or replacing. We must do it for a node that has
finished decommissioning because it could still coordinate
requests. However, since we cannot distinguish nodes in the
`left_token_ring` state, we must send the command to all of them.
This issue appeared in scylladb/scylladb#16797 and this PR is
a follow-up that fixes it.
The solution is changing `left_token_ring` from a node state
to a transition state.
Fixesscylladb/scylladb#16944Closesscylladb/scylladb#17009
* github.com:scylladb/scylladb:
docs: dev: topology-over-raft: document the left_token_ring state
topology_coordinator: adjust reason string in left_token_ring handler
raft topology: make left_token_ring a transition state
topology_coordinator: rollback_current_topology_op: remove unused exclude_nodes
This allows the user of `raft::server` to cause it to create a snapshot
and truncate the Raft log (leaving no trailing entries; in the future we
may extend the API to specify number of trailing entries left if
needed). In a later commit we'll add a REST endpoint to Scylla to
trigger group 0 snapshots.
One use case for this API is to create group 0 snapshots in Scylla
deployments which upgraded to Raft in version 5.2 and started with an
empty Raft log with no snapshot at the beginning. This causes problems,
e.g. when a new node bootstraps to the cluster, it will not receive a
snapshot that would contain both schema and group 0 history, which would
then lead to inconsistent schema state and trigger assertion failures as
observed in scylladb/scylladb#16683.
In 5.4 the logic of initial group 0 setup was changed to start the Raft
log with a snapshot at index 1 (ff386e7a44)
but a problem remains with these existing deployments coming from 5.2,
we need a way to trigger a snapshot in them (other than performing 1000
arbitrary schema changes).
Another potential use case in the future would be to trigger snapshots
based on external memory pressure in tablet Raft groups (for strongly
consistent tables).
The PR adds the API to `raft::server` and a HTTP endpoint that uses it.
In a follow-up PR, we plan to modify group 0 server startup logic to automatically
call this API if it sees that no snapshot is present yet (to automatically
fix the aforementioned 5.2 deployments once they upgrade.)
Closesscylladb/scylladb#16816
* github.com:scylladb/scylladb:
raft: remove `empty()` from `fsm_output`
test: add test for manual triggering of Raft snapshots
api: add HTTP endpoint to trigger Raft snapshots
raft: server: add `trigger_snapshot` API
raft: server: track last persisted snapshot descriptor index
raft: server: framework for handling server requests
raft: server: inline `poll_fsm_output`
raft: server: fix indentation
raft: server: move `io_fiber`'s processing of `batch` to a separate function
raft: move `poll_output()` from `fsm` to `server`
raft: move `_sm_events` from `fsm` to `server`
raft: fsm: remove constructor used only in tests
raft: fsm: move trace message from `poll_output` to `has_output`
raft: fsm: extract `has_output()`
raft: pass `max_trailing_entries` through `fsm_output` to `store_snapshot_descriptor`
raft: server: pass `*_aborted` to `set_exception` call
This change replaces usage of db::config with
usage of utils::directories to get paths of
directories in service/storage_proxy.
Refs: scylladb#5626
Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>
This change removes usage of db::config to
get path of commitlog_directory. Instead, it
introduces a new parameter to directly pass
the path to db::commitlog::config::from_db_config().
Refs: scylladb#5626
Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>
This change replaces the usage of db::config with
usage of utils::directories to get dirs paths in
replica::database class.
Moreover, it adjusts tests that require construction
of replica::database - its constructor has been
changed to accept utils::directories object.
Refs: scylladb#5626
Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>
This change extends utils::directories class in
the following way:
- adds new member variables that correspond to
fields from db::config that describe paths
of directories
- introduces a public interface to retrieve the
values of the new members
- allows construction of utils::directories
object based on db::config to setup internal
member variables related to paths to dirs
The new members of utils::directories are overriden
when the provided values are empty. The way of setting
paths is taken from db::config.
To ensure that the new logic works correctly
`utils_directories_test` has been created.
Refs: scylladb#5626
Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>
Similar to the existing update_config(). Updates the command-line
arguments of the specified nodes, merging the new options into the
existing ones. Needs a restart to take effect.
A node can be in the `left_token_ring` state after:
- a finished decommission,
- a failed bootstrap,
- a failed replace.
When a node is in the `left_token_ring` state, we don't know how
it has ended up in this state. We cannot distinguish a node that
has finished decommissioning from a node that has failed bootstrap.
The main problem it causes is that we incorrectly send the
`barrier_and_drain` command to a node that has failed
bootstrapping or replacing. We must do it for a node that has
finished decommissioning because it could still coordinate
requests. However, since we cannot distinguish nodes in the
`left_token_ring` state, we must send the command to all of them.
This issue appeared in scylladb/scylladb#16797 and this patch is
a follow-up that fixes it.
The solution is changing `left_token_ring` from a node state
to a transition state.
Regarding implementation, most of the changes are simple
refactoring. The less obvious are:
- Before this patch, in `system_keyspace::left_topology_state`, we
had to keep the ignored nodes' IDs for replace to ensure that the
replacing node will have access to it after moving to the
`left_token_ring` state, which happens when replace fails. We
don't need this workaround anymore. When we enter the new
`left_token_ring` transition state, the new node will still be in
the `decommissioning` state, so it won't lose its request param.
- Before this patch, a decommissioning node lost its tokens
while moving to the `left_token_ring` state. After the patch, it
loses tokens while still being in the `decommissioning` state. We
ensure that all `decommissioning` handlers correctly handle a node
that lost its tokens.
Moving the `left_token_ring` handler from `handle_node_transition`
to `handle_topology_transition` created a large diff. There are
only three changes:
- adding `auto node = get_node_to_work_on(std::move(guard));`,
- adding `builder.del_transition_state()`,
- changing error logged when `global_token_metadata_barrier` fails.
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, we define formatters for `db::replay_position`,
and drop its operator<<.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#17014
This series does a similar change to Alternator as was done recently to CQL:
1. If the "tablets" experimental feature in enabled, new Alternator tables will use tablets automatically, without requiring an option on each new table. A default choice of initial_tablets is used. These choices can still be overridden per-table if the user wants to.
3. In particular, all test/alternator tests will also automatically run with tablets enabled
4. However, some tests will fail on tablets because they use features that haven't yet been implemented with tablets - namely Alternator Streams (Refs #16317) and Alternator TTL (Refs #16567). These tests will - until those features are implemented with tablets - continue to be run without tablets.
5. An option is added to the test/alternator/run to allow developers to manually run tests without tablets enabled, if they wish to (this option will be useful in the short term, and can be removed later).
Fixes#16355Closesscylladb/scylladb#16900
* github.com:scylladb/scylladb:
test/alternator: add "--vnodes" option to run script
alternator: use tablets by default, if available
test/alternator: run some tests without tablets
In this commit, we postpone the start-up
of the hint manager until we obtain information
about other nodes in the cluster.
When we start the hint managers, one of the
things that happen is creating endpoint
managers -- structures managed by
db::hints::manager. Whether we create
an instance of endpoint manager depends on
the value returned by host_filter::can_hint_for,
which, in turn, may depend on the current state
of locator::topology.
If locator::topology is incomplete, some endpoint
managers may not be started even though they
should (because the target node IS part of the
cluster and we SHOULD send hints to it if there
are some).
The situation like that can happen because we
start the hint managers too early. This commit
aims to solve that problem. We only start
the hint managers when we've gathered information
about the other nodes in the cluster and created
the locator::topology using it.
Hinted Handoff is not negatively affected by these
changes since in between the previous point of
starting the hint managers and the current one,
all of the mutations performed by
service::storage_proxy target the local node, so
no hints would need to be generated anyway.
Fixesscylladb/scylladb#11870Closesscylladb/scylladb#16511
this RESTful API is a scylla specific extension and is only used
by scylla-nodetool. currently, the java-based nodetool does not use
it at all, so mark it with "scylla_only".
one can verify this change with:
```
pytest --mode=debug --nodetool=cassandra test_cleanup.py::test_cleanup
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#17001
we should allow user to run nodetool tests without `test.py`. but there
are good chance that the host could be reused by multiple tests or
multiple users who could be using port 12345. by randomizing the IP and
port, they would have better chance to complete the test without running
into used port problem.
Closesscylladb/scylladb#16996
* github.com:scylladb/scylladb:
test/nodetool: return a randomized address if not running with unshare
test/nodetool: return an address from loopback_network fixture
In this mode, the node is not reachable from the outside, i.e.
* it refuses all incoming RPC connections,
* it does not join the cluster, thus
* all group0 operations are disabled (e.g. schema changes),
* all cluster-wide operations are disabled for this node (e.g. repair),
* other nodes see this node as dead,
* cannot read or write data from/to other nodes,
* it does not open Alternator and Redis transport ports and the TCP CQL port.
The only way to make CQL queries is to use the maintenance socket. The node serves only local data.
To start the node in maintenance mode, use the `--maintenance-mode true` flag or set `maintenance_mode: true` in the configuration file.
REST API works as usual, but some routes are disabled:
* authorization_cache
* failure_detector
* hinted_hand_off_manager
This PR also updates the maintenance socket documentation:
* add cqlsh usage to the documentation
* update the documentation to use `WhiteListRoundRobinPolicy`
Fixes#5489.
Closesscylladb/scylladb#15346
* github.com:scylladb/scylladb:
test.py: add test for maintenance mode
test.py: generalize usage of cluster_con
test.py: when connecting to node in maintenance mode use maintenance socket
docs: add maintenance mode documentation
main: add maintenance mode
main: move some REST routes initialization before joining group0
message_service: add sanity check that rpc connections are not created in the maintenance mode
raft_group0_client: disable group0 operations in the maintenance mode
service/storage_service: add start_maintenance_mode() method
storage_service: add MAINTENANCE option to mode enum
service/maintenance_mode: add maintenance_mode_enabled bool class
service/maintenance_mode: move maintenance_socket_enabled definition to seperate file
db/config: add maintenance mode flag
docs: add cqlsh usage to maintenance socket documentation
docs: update maintenance socket documentation to use WhiteListRoundRobinPolicy
we should allow user to run nodetool tests without `test.py`. but there
are good chance that the host could be reused by multiple tests or
multiple users who could be using port 12345. by randomizing the IP and
port, they would have better chance to complete the test without running
into used port problem.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
* rename "maybe_setup_loopback_network" to "server_address"
* return an address from the fixture
this change prepares for bringing back the randomized IP and port,
in case users run this test without test.py, by randomizing the
IP and port, they would have better chance to complete the test
without running into used port problem.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
This is easier to reproducer after changes in load balancer, to
emit resize decisions, which in turn results in topology version
being incremented, and that might race with fencing tests that
manipulate the topology version manually.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This implements the ability in load balancer to emit split or merge
requests, cancel ongoing ones if they're no longer needed, and
also finalize those that are ready for the topology changes.
That's all based on average tablet size, collected by coordinator
from all nodes, and split and merge thresholds.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The new metadata describes the ongoing resize operation (can be either
of merge, split or none) that spans tablets of a given table.
That's managed by group0, so down nodes will be able to see the
decision when they come back up and see the changes to the
metadata.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
To avoid data resurrection, mutations deleted by cleanup operations should be skipped during commitlog replay.
This series implements the above for tablet cleanups, by using a new system table which holds records of cleanup operations.
Fixes#16752Closesscylladb/scylladb#16888
* github.com:scylladb/scylladb:
test: test_tablets: add a test for cleanup after migration
test: pylib: add ScyllaCluster.wipe_sstables
test: boost: add commitlog_cleanup_test
db: commitlog_replayer: ignore mutations affected by (tablet) cleanups
replica: table: garbage-collect irrelevant system.commitlog_cleanups records
db: commitlog: add min_position()
replica: table: populate system.commitlog_cleanups on tablet cleanup
db: system_keyspace: add system.commitlog_cleanups
replica: table: refresh compound sstable set after tablet cleanup
We didn't send the `barrier_and_drain` command to a
decommissioning node that could still be coordinating requests. It
could happen that a decommissioning node sent a request with an
old topology version after normal nodes received the new fence
version. Then, the request would fail on replicas with the stale
topology exception.
This PR fixes this problem by modifying `exec_global_command`.
From now on, it sends `barrier_and_drain` to a decommissioning
node.
We also stop filtering stale topology exceptions in
`test_topology_ops`. We added this filter after detecting the bug
fixed by this PR.
Fixesscylladb/scylladb#15804Fixesscylladb/scylladb#16579Fixesscylladb/scylladb#16642Closesscylladb/scylladb#16797
* github.com:scylladb/scylladb:
test: test_topology_ops: remove failed mutations filter
raft topology: send barrier_and_drain to a decommissioning node
raft topology: ensure at most one transitioning node
before this change, we use a random address when launching
rest_api_mock server, but there are chances that the randomly
picked address conflicts with an already-used address on the
host. and the subprocess fails right away with the returncode of
1 upon this failure, but we just continue on and check the readiness
of the already-dead server. actually, we've seen test failures
caused by the EADDRINUSE failure, and when we checked the readiness
of the rest_api_mock by sending HTTP request and reading the
response, what we had is not a JSON encoded response but a webpage,
which was likely the one returned by a minio server.
in this change, we
* specify the "launcher" option of nodetool
test suite to "unshare", so that all its tests are launched
in separated namespaces.
* do not use a random address for the mock server, as the network
namespaces are separated.
Fixes#16542Closesscylladb/scylladb#16773
* github.com:scylladb/scylladb:
test/nodetool: run nodetool tests using "unshare"
test.py: add "launcher" option support
The test checks that in maintenance mode server A is not available for other
nodes and for clients. It is possible to connect by the maintenance socket
to server A and perform local CQL operations.
A node in the maintenance socket hasn't an opened regular CQL port.
To connect to the node, the scylla cluster needs to use the node's maintenance socket.
In maintenance mode, the node doesn't communicate with other nodes, so it doesn't
start or apply group0 operations. Users can still try to start it, e.g. change
the schema, and the node can't allow it.
Init _upgrade_state with recovery in the maintenance mode.
Throw an error if the group0 operation is started in maintenance mode.
before this change, we always cast the wait duration to millisecond,
even if it could be using a higher resolution. actually
`std::chrono::steady_clock` is using `nanosecond` for its duration,
so if we inject a deadline using `steady_clock`, we could be awaken
earlier due to the narrowing of the duration type caused by the
duration_cast.
in this change, we just use the duration as it is. this should allow
the caller to use the resolution provided by Seastar without losing
the precision. the tests are updated to print the time duration
instead of count to provide information with a higher resolution.
Fixes#15902Closesscylladb/scylladb#16264
* github.com:scylladb/scylladb:
tests: utils: error injection: print time duration instead of count
error_injection: do not cast to milliseconds when injecting timeout
New tablet replicas are allocated and rebuilt synchronously with node
operations. They are safely rebuilt from all existing replicas.
The list of ignored nodes passed to node operations is respected.
Tablet scheduler is responsible for scheduling tablet rebuilding transition which
changes the replicas set. The infrastructure for handling decommission
in tablet scheduler is reused for this.
Scheduling is done incrementally, respecting per-shard load
limits. Rebuilding transitions are recognized by load calculation to
affect all tablet replicas.
New kind of tablet transition is introduced called "rebuild" which
adds new tablet replica and rebuilds it from existing replicas. Other
than that, the transition goes through the same stages as regular
migration to ensure safe synchronization with request coordinators.
In this PR we simply stream from all tablet replicas. Later we should
switch to calling repair to avoid sending excessive amounts of data.
Fixes https://github.com/scylladb/scylladb/issues/16690.
Closesscylladb/scylladb#16894
* github.com:scylladb/scylladb:
tests: tablets: Add tests for removenode and replace
tablets: Add support for removenode and replace handling
topology_coordinator: tablets: Do not fail in a tight loop
topology_coordinator: tablets: Avoid warnings about ignored failured future
storage_service, topology: Track excluded state in locator::topology
raft topology: Introduce param-less topology::get_excluded_nodes()
raft topology: Move get_excluded_nodes() to topology
tablets: load_balancer: Generalize load tracking
tablets: Introduce get_migration_streaming_info() which works on migration request
tablets: Move migration_to_transition_info() to tablets.hh
tablets: Extract get_new_replicas() which works on migraiton request
tablets: Move tablet_migration_info to tablets.hh
tablets: Store transition kind per tablet
We added this filter after detecting a bug in the Raft-based
topology. We weren't sending `barrier_and_drain` commands to a
decommissioning node that could still be coordinating requests.
It could cause stale topology exceptions on replicas if the
decommissioning node sent a request with an old topology version
after normal nodes received the new fence version.
This bug has been fixed in the previous commit, so we remove the
filter.