Commit Graph

6240 Commits

Author SHA1 Message Date
Botond Dénes
2a4b991772 Merge 'Fix mintimeuuid() call that could crash Scylla' from Nadav Har'El
This PR fixes the bug of certain calls to the `mintimeuuid()` CQL function which large negative timestamps could crash Scylla. It turns out we already had protections in place against very positive timestamps, but very negative timestamps could still cause bugs.

The actual fix in this series is just a few lines, but the bigger effort was improving the test coverage in this area. I added tests for the "date" type (the original reproducer for this bug used totimestamp() which takes a date parameter), and also reproducers for this bug directly, without totimestamp() function, and one with that function.

Finally this PR also replaces the assert() which made this molehill-of-a-bug into a mountain, by a throw.

Fixes #17035

Closes scylladb/scylladb#17073

* github.com:scylladb/scylladb:
  utils: replace assert() by on_internal_error()
  utils: add on_internal_error with common logger
  utils: add a timeuuid minimum, like we had maximum
  test/cql-pytest: tests for "date" type
2024-02-01 10:48:48 +02:00
Asias He
2888c3086c utils: Add uuid_xor_to_uint32 helper
Convert the uuid to a uint32_t using xor.
It is useful to get a uint32_t number from the uuid.

Refs: #16927

Closes scylladb/scylladb#17049
2024-02-01 10:27:55 +02:00
Pavel Emelyanov
7c5c89ba8d Revert "Merge 'Use utils::directories instead of db::config to get dirs' from Patryk Wróbel"
This reverts commit 370fbd346c, reversing
changes made to 0912d2a2c6.

This makes scylla-manager mis-interpret the data_file_directories
somehow, issue #17078
2024-01-31 15:08:14 +03:00
Avi Kivity
c8397f0287 Merge 'Implement tablet splitting' from Raphael "Raph" Carvalho
The motivation for tablet resizing is that we want to keep the average tablet size reasonable, such that load rebalancing can remain efficient. Too large tablet makes migration inefficient, therefore slowing down the balancer.

If the avg size grows beyond the upper bound (split threshold), then balancer decides to split. Split spans all tablets of a table, due to power-of-two constraint.

Likewise, if the avg size decreases below the lower bound (merge threshold), then merge takes place in order to grow the avg size. Merge is not implemented yet, although this series lays foundation for it to be impĺemented later on.

A resize decision can be revoked if the avg size changes and the decision is no longer needed. For example, let's say table is being split and avg size drops below the target size (which is 50% of split threshold and 100% of merge one). That means after split, the avg size would drop below the merge threshold, causing a merge after split, which is wasteful, so it's better to just cancel the split.

Tablet metadata gains 2 new fields for managing this:
resize_type: resize decision type, can be either of "merge", "split", or "none".
resize_seq_number: a sequence number that works as the global identifier of the decision (monotonically increasing, increased by 1 on every new decision emitted by the coordinator).

A new RPC was implemented to pull stats from each table replica, such that load balancer can calculate the avg tablet size and know the "split status", for a given table. Avg size is aggregated carefully while taking RF of each DC into account (which might differ).
When a table is done splitting its storage, it loads (mirror) the resize_seq_number from tablet metadata into its local state (in another words, my split status is ready). If a table is split ready, coordinator will see that table's seq number is the same as the one in tablet metadata. Helps to distinguish stale decisions from the latest one (in case decisions are revoked and re-emited later on). Also, it's aggregated carefully, by taking the minimum among all replicas, so coordinator will only update topology when all replicas are ready.

When load balancer emits split decision, replicas will listen to need to split with a "split monitor" that is awakened once a table has replication metadata updated and detects the need for split (i.e. resize_type field is "split").
The split monitor will start splitting of compaction groups (using mechanism introduced here: 081f30d149) for the table. And once splitting work is completed, the table updates its local state as having completed split.

When coordinator pulls the split status of all replicas for a table via RPC, the balancer can see whether that table is ready for "finalizing" the decision, which is about updating tablet metadata to split each tablet into two. Once table replicas have their replication metadata updated with the new tablet count, they can update appropriately their set of compaction groups (that were previously split in the preparation step).

Fixes #16536.

Closes scylladb/scylladb#16580

* github.com:scylladb/scylladb:
  test/topology_experimental_raft: Add tablet split test
  replica: Bypass reshape on boot with tablets temporarily
  replica: Fix table::compaction_group_for_sstable() for tablet streaming
  test/topology_experimental_raft: Disable load balancer in test fencing
  replica: Remap compaction groups when tablet split is finalized
  service: Split tablet map when split request is finalized
  replica: Update table split status if completed split compaction work
  storage_service: Implement split monitor
  topology_cordinator: Generate updates for resize decisions made by balancer
  load_balancer: Introduce metrics for resize decisions
  db: Make target tablet size a live-updateable config option
  load_balancer: Implement resize decisions
  service: Wire table_resize_plan into migration_plan
  service: Introduce table_resize_plan
  tablet_mutation_builder: Add set_resize_decision()
  topology_coordinator: Wire load stats into load balancer
  storage_service: Allow tablet split and migration to happen concurrently
  topology_coordinator: Periodically retrieve table_load_stats
  locator: Introduce topology::get_datacenter_nodes()
  storage_service: Implement table_load_stats RPC
  replica: Expose table_load_stats in table
  replica: Introduce storage_group::live_disk_space_used()
  locator: Introduce table_load_stats
  tablets: Add resize decision metadata to tablet metadata
  locator: Introduce resize_decision
2024-01-31 13:59:56 +02:00
Botond Dénes
181f68f248 Merge 'raft_group0: trigger snapshot if existing snapshot index is 0' from Kamil Braun
The persisted snapshot index may be 0 if the snapshot was created in
older version of Scylla, which means snapshot transfer won't be
triggered to a bootstrapping node. Commands present in the log may not
cover all schema changes --- group 0 might have been created through the
upgrade upgrade procedure, on a cluster with existing schema. So a
deployment with index=0 snapshot is broken and we need to fix it. We can
use the new `raft::server::trigger_snapshot` API for that.

Also add a test.

Fixes scylladb/scylladb#16683

Closes scylladb/scylladb#17072

* github.com:scylladb/scylladb:
  test: add test for fixing a broken group 0 snapshot
  raft_group0: trigger snapshot if existing snapshot index is 0
2024-01-31 13:04:59 +02:00
Nadav Har'El
827c20467c utils: add a timeuuid minimum, like we had maximum
Our time-handling code in UUID_gen.hh is very fragile for very large
timestamps, because the different types - such as Cassandra "timestamp"
and Timeuuid use very different resolution and ranges.

In issue #17035 we discovered a situation where a certain CQL
"timestamp"-type value could cause an assertion-failure and a crash
in the create_time() function that creates a timeuuid - because that
timestamp didn't fit the place we have in timeuuid.

We already added in the past a limit, UUID_UNIXTIME_MAX, beyond which
we refuse timestamps, to avoid these assertions failure. However, we
missed the possibility of *negative* timestamps (which are allowed in
CQL), and indeed a negative timestamp (or a timestamp which was "wrapped"
to a negative value) is what caused issue #17035.

So this patch adds a second limit, UUID_UNIXTIME_MIN - limiting the
most negative timestamp that we support to well below the area which
causes problems, and adds tests that reproduce #17035 and that we
didn't break anything else (e.g., negative timestamps are still
allowed - just not extremely negative timestamps).

Fixes #17035.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-01-31 11:32:26 +02:00
Kamil Braun
bb22e06a9e Merge 'abort failed rebuild instead of retrying it forever' from Gleb
Add error handling to rebuild instead of retrying it until succeeds.

* 'gleb/rebuild-fail-v2' of github.com:scylladb/scylla-dev:
  test: add test for rebuild failure
  test: add expected_error to rebuild_node operation
  topology_coordinator: Propagate rebuild failure to the initiator
2024-01-31 10:07:28 +01:00
Nadav Har'El
47955642d9 test/cql-pytest: tests for "date" type
This patch adds a few simple tests for the values of the "date" column
type, and how it can be initialized from string or integers, and what do
those values mean.

Two of the tests reproduce issue #17066, where validation is missing
for values that don't fit in a 32-bit unsigned integer.

Refs #17066

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-01-31 10:58:02 +02:00
Kamil Braun
57d5aa5a68 test: add test for fixing a broken group 0 snapshot
In a cluster with group 0 with snapshot at index 0 (such group 0 might
be established in a 5.2 cluster, then preserved once it upgrades to 5.4
or later), no snapshot transfer will be triggered when a node is
bootstrapped. This way to new node might not obtain full schema, or
obtain incorrect schema, like in scylladb/scylladb#16683.

Simulate this scenario in a test case using the RECOVERY mode and error
injections. Check that the newly added logic for creating a new snapshot
if such situation is detected helps in this case.
2024-01-30 16:44:01 +01:00
Kamil Braun
74bf60a8ca test_raft_snapshot_request: fix flakiness
Add workaround for scylladb/python-driver#295.

Also an assert made at the end of the test was false, it is fixed with
appropriate comment added.
2024-01-30 16:21:24 +01:00
Kamil Braun
39339b9f70 test: topology/util: update comment for reconnect_driver
The issues mentioned in the comment before are already fixed.
Unfortunately, there is another, opposite issue which this function can
be used for. The previous issue was about the existing driver session
not reconnecting. The current issue is about the existing driver session
reconnecting too much... (and in the middle of queries.)
2024-01-30 15:36:48 +01:00
Kamil Braun
cf3f26dc94 test_maintenance_mode: fix flakiness
Wait until CQL is available and nodes see each other before trying to
perform a query.

Closes scylladb/scylladb#17059
2024-01-30 14:11:14 +02:00
Gleb Natapov
8b50613465 test: add test for rebuild failure 2024-01-30 11:04:19 +02:00
Gleb Natapov
d62204e758 test: add expected_error to rebuild_node operation 2024-01-30 11:04:19 +02:00
Michał Chojnowski
904bb25987 test: test_tablet_cleanup: wait for servers to see each other before multi-node queries
Waiting for CQL connections is not enough. For the queries to succeed,
nodes must see each other. We have to wait for this, otherwise the test
will be flaky.

Fixes #17029

Closes scylladb/scylladb#17040
2024-01-30 08:56:01 +02:00
Tomasz Grabiec
36f218c83b Merge 'main: refuse startup when tablet resharding is required' from Botond Dénes
We do not support tablet resharding yet. All tablet-related code assumes that the (host_id, shard) tablet replica is always valid. Violating this leads to undefined behaviour: errors in the tablet load balancer and potential crashes.
Avoid this by refusing to start if the need to resharding is detected. Be as lenient as possible: check all tablets with a replica on this node, and only refuse startup if at least one tablet has an invalid replica shard.

Startup will fail as:

    ERROR 2024-01-26 07:03:06,931 [shard 0:main] init - Startup failed: std::runtime_error (Detected a tablet with invalid replica shard, reducing shard count with tablet-enabled tables is not yet supported. Replace the node instead.)

Refs: #16739
Fixes: #16843

Closes scylladb/scylladb#17008

* github.com:scylladb/scylladb:
  test/topolgy_experimental_raft: test_tablets.py: add test for resharding
  test/pylib: manager[_client]: add update_cmdline()
  main: refuse startup when tablet resharding is required
  locator: tablets: add check_tablet_replica_shards()
2024-01-29 23:39:41 +01:00
Pavel Emelyanov
370fbd346c Merge 'Use utils::directories instead of db::config to get dirs' from Patryk Wróbel
`db::config` is a class, that is used in many places across the code base. When it is changed, its clients' code need to be recompiled. It represents the configuration of the database. Some fields of the configuration that describe the location of directories may be empty. In such cases `db::config::setup_directories()` function is called - it modifies the provided configuration. Such modification is not good - it is better to keep `db::config` intact.

This PR:
 - extends the public interface of utils::directories class to provide required directory paths to the users
 - removes 'db::config::setup_directories()' to avoid altering the fields of configuration object
 - replaces usages of db::config object with utils::directories object in places that require obtaining paths to dirs

Fixes: scylladb#5626

Closes scylladb/scylladb#16787

* github.com:scylladb/scylladb:
  utils/directories: make utils::directories::set an internal type
  db::config: keep dir paths unchanged
  cql_transport/controler: use utils::directories to get paths of dirs
  service/storage_proxy: use utils::directories to get paths of dirs
  api/storage_service.cc: use utils::directories to get paths of dirs
  tools/scylla-sstable.cc: use utils::directories to get paths
  db/commitlog: do not use db::config to get dirs
  Use utils::directories to get dirs paths in replica::database
  Allow utils::directories to provide paths to dirs
  Clean-up of utils::directories
2024-01-29 18:01:15 +03:00
Kamil Braun
0912d2a2c6 Merge 'raft topology: make left_token_ring a transition state' from Patryk Jędrzejczak
When a node is in the `left_token_ring` state, we don't know how
it has ended up in this state. We cannot distinguish a node that
has finished decommissioning from a node that has failed bootstrap.

The main problem it causes is that we incorrectly send the
`barrier_and_drain` command to a node that has failed
bootstrapping or replacing. We must do it for a node that has
finished decommissioning because it could still coordinate
requests. However, since we cannot distinguish nodes in the
`left_token_ring` state, we must send the command to all of them.
This issue appeared in scylladb/scylladb#16797 and this PR is
a follow-up that fixes it.

The solution is changing `left_token_ring` from a node state
to a transition state.

Fixes scylladb/scylladb#16944

Closes scylladb/scylladb#17009

* github.com:scylladb/scylladb:
  docs: dev: topology-over-raft: document the left_token_ring state
  topology_coordinator: adjust reason string in left_token_ring handler
  raft topology: make left_token_ring a transition state
  topology_coordinator: rollback_current_topology_op: remove unused exclude_nodes
2024-01-29 15:29:01 +01:00
Botond Dénes
d202d32f81 Merge 'Add an API to trigger snapshot in Raft servers' from Kamil Braun
This allows the user of `raft::server` to cause it to create a snapshot
and truncate the Raft log (leaving no trailing entries; in the future we
may extend the API to specify number of trailing entries left if
needed). In a later commit we'll add a REST endpoint to Scylla to
trigger group 0 snapshots.

One use case for this API is to create group 0 snapshots in Scylla
deployments which upgraded to Raft in version 5.2 and started with an
empty Raft log with no snapshot at the beginning. This causes problems,
e.g. when a new node bootstraps to the cluster, it will not receive a
snapshot that would contain both schema and group 0 history, which would
then lead to inconsistent schema state and trigger assertion failures as
observed in scylladb/scylladb#16683.

In 5.4 the logic of initial group 0 setup was changed to start the Raft
log with a snapshot at index 1 (ff386e7a44)
but a problem remains with these existing deployments coming from 5.2,
we need a way to trigger a snapshot in them (other than performing 1000
arbitrary schema changes).

Another potential use case in the future would be to trigger snapshots
based on external memory pressure in tablet Raft groups (for strongly
consistent tables).

The PR adds the API to `raft::server` and a HTTP endpoint that uses it.

In a follow-up PR, we plan to modify group 0 server startup logic to automatically
call this API if it sees that no snapshot is present yet (to automatically
fix the aforementioned 5.2 deployments once they upgrade.)

Closes scylladb/scylladb#16816

* github.com:scylladb/scylladb:
  raft: remove `empty()` from `fsm_output`
  test: add test for manual triggering of Raft snapshots
  api: add HTTP endpoint to trigger Raft snapshots
  raft: server: add `trigger_snapshot` API
  raft: server: track last persisted snapshot descriptor index
  raft: server: framework for handling server requests
  raft: server: inline `poll_fsm_output`
  raft: server: fix indentation
  raft: server: move `io_fiber`'s processing of `batch` to a separate function
  raft: move `poll_output()` from `fsm` to `server`
  raft: move `_sm_events` from `fsm` to `server`
  raft: fsm: remove constructor used only in tests
  raft: fsm: move trace message from `poll_output` to `has_output`
  raft: fsm: extract `has_output()`
  raft: pass `max_trailing_entries` through `fsm_output` to `store_snapshot_descriptor`
  raft: server: pass `*_aborted` to `set_exception` call
2024-01-29 15:06:04 +02:00
Patryk Wrobel
f08768e767 service/storage_proxy: use utils::directories to get paths of dirs
This change replaces usage of db::config with
usage of utils::directories to get paths of
directories in service/storage_proxy.

Refs: scylladb#5626
Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>
2024-01-29 13:11:33 +01:00
Patryk Wrobel
804afffb11 db/commitlog: do not use db::config to get dirs
This change removes usage of db::config to
get path of commitlog_directory. Instead, it
introduces a new parameter to directly pass
the path to db::commitlog::config::from_db_config().

Refs: scylladb#5626

Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>
2024-01-29 13:11:33 +01:00
Patryk Wrobel
9483d149af Use utils::directories to get dirs paths in replica::database
This change replaces the usage of db::config with
usage of utils::directories to get dirs paths in
replica::database class.

Moreover, it adjusts tests that require construction
of replica::database - its constructor has been
changed to accept utils::directories object.

Refs: scylladb#5626
Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>
2024-01-29 13:11:33 +01:00
Patryk Wrobel
1cd676e438 Allow utils::directories to provide paths to dirs
This change extends utils::directories class in
the following way:
 - adds new member variables that correspond to
   fields from db::config that describe paths
   of directories
 - introduces a public interface to retrieve the
   values of the new members
 - allows construction of utils::directories
   object based on db::config to setup internal
   member variables related to paths to dirs

The new members of utils::directories are overriden
when the provided values are empty. The way of setting
paths is taken from db::config.

To ensure that the new logic works correctly
`utils_directories_test` has been created.

Refs: scylladb#5626

Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>
2024-01-29 13:11:33 +01:00
Botond Dénes
fd66ce1591 test/topolgy_experimental_raft: test_tablets.py: add test for resharding
Check that scylla refuses to start when the shard count is reduced.
2024-01-29 07:04:33 -05:00
Botond Dénes
a7a5aada2a test/pylib: manager[_client]: add update_cmdline()
Similar to the existing update_config(). Updates the command-line
arguments of the specified nodes, merging the new options into the
existing ones. Needs a restart to take effect.
2024-01-29 07:04:33 -05:00
Patryk Jędrzejczak
b0eef50b2e raft topology: make left_token_ring a transition state
A node can be in the `left_token_ring` state after:
- a finished decommission,
- a failed bootstrap,
- a failed replace.

When a node is in the `left_token_ring` state, we don't know how
it has ended up in this state. We cannot distinguish a node that
has finished decommissioning from a node that has failed bootstrap.

The main problem it causes is that we incorrectly send the
`barrier_and_drain` command to a node that has failed
bootstrapping or replacing. We must do it for a node that has
finished decommissioning because it could still coordinate
requests. However, since we cannot distinguish nodes in the
`left_token_ring` state, we must send the command to all of them.
This issue appeared in scylladb/scylladb#16797 and this patch is
a follow-up that fixes it.

The solution is changing `left_token_ring` from a node state
to a transition state.

Regarding implementation, most of the changes are simple
refactoring. The less obvious are:
- Before this patch, in `system_keyspace::left_topology_state`, we
had to keep the ignored nodes' IDs for replace to ensure that the
replacing node will have access to it after moving to the
`left_token_ring` state, which happens when replace fails. We
don't need this workaround anymore. When we enter the new
`left_token_ring` transition state, the new node will still be in
the `decommissioning` state, so it won't lose its request param.
- Before this patch, a decommissioning node lost its tokens
while moving to the `left_token_ring` state. After the patch, it
loses tokens while still being in the `decommissioning` state. We
ensure that all `decommissioning` handlers correctly handle a node
that lost its tokens.

Moving the `left_token_ring` handler from `handle_node_transition`
to `handle_topology_transition` created a large diff. There are
only three changes:
- adding `auto node = get_node_to_work_on(std::move(guard));`,
- adding `builder.del_transition_state()`,
- changing error logged when `global_token_metadata_barrier` fails.
2024-01-29 10:39:07 +01:00
Kefu Chai
8f38bd5376 commitlog: add formatter for db::replay_position
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.

in this change, we define formatters for `db::replay_position`,
and drop its operator<<.

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#17014
2024-01-29 09:59:30 +02:00
Botond Dénes
d3c1be9107 Merge 'alternator: enable tablets by default if experimental feature is enabled' from Nadav Har'El
This series does a similar change to Alternator as was done recently to CQL:

1. If the "tablets" experimental feature in enabled, new Alternator tables will use tablets automatically, without requiring an option on each new table. A default choice of initial_tablets is used. These choices can still be overridden per-table if the user wants to.
3. In particular, all test/alternator tests will also automatically run with tablets enabled
4. However, some tests will fail on tablets because they use features that haven't yet been implemented with tablets - namely Alternator Streams (Refs #16317) and Alternator TTL (Refs #16567). These tests will - until those features are implemented with tablets - continue to be run without tablets.
5. An option is added to the test/alternator/run to allow developers to manually run tests without tablets enabled, if they wish to (this option will be useful in the short term, and can be removed later).

Fixes #16355

Closes scylladb/scylladb#16900

* github.com:scylladb/scylladb:
  test/alternator: add "--vnodes" option to run script
  alternator: use tablets by default, if available
  test/alternator: run some tests without tablets
2024-01-29 09:22:13 +02:00
Dawid Medrek
b92fb3537a main: Postpone start-up of hint manager
In this commit, we postpone the start-up
of the hint manager until we obtain information
about other nodes in the cluster.

When we start the hint managers, one of the
things that happen is creating endpoint
managers -- structures managed by
db::hints::manager. Whether we create
an instance of endpoint manager depends on
the value returned by host_filter::can_hint_for,
which, in turn, may depend on the current state
of locator::topology.

If locator::topology is incomplete, some endpoint
managers may not be started even though they
should (because the target node IS part of the
cluster and we SHOULD send hints to it if there
are some).

The situation like that can happen because we
start the hint managers too early. This commit
aims to solve that problem. We only start
the hint managers when we've gathered information
about the other nodes in the cluster and created
the locator::topology using it.

Hinted Handoff is not negatively affected by these
changes since in between the previous point of
starting the hint managers and the current one,
all of the mutations performed by
service::storage_proxy target the local node, so
no hints would need to be generated anyway.

Fixes scylladb/scylladb#11870
Closes scylladb/scylladb#16511
2024-01-26 12:49:40 +01:00
Kefu Chai
a9d781d70f test/nodetool: only test "storage_service/cleanup_all" with scylla
this RESTful API is a scylla specific extension and is only used
by scylla-nodetool. currently, the java-based nodetool does not use
it at all, so mark it with "scylla_only".

one can verify this change with:
```
pytest --mode=debug --nodetool=cassandra test_cleanup.py::test_cleanup
```

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#17001
2024-01-26 13:19:15 +02:00
Botond Dénes
582ddc70ec Merge 'test/nodetool: return a randomized address if not running with unshare' from Kefu Chai
we should allow user to run nodetool tests without `test.py`. but there
are good chance that the host could be reused by multiple tests or
multiple users who could be using port 12345. by randomizing the IP and
port, they would have better chance to complete the test without running
into used port problem.

Closes scylladb/scylladb#16996

* github.com:scylladb/scylladb:
  test/nodetool: return a randomized address if not running with unshare
  test/nodetool: return an address from loopback_network fixture
2024-01-26 13:15:58 +02:00
Kamil Braun
4f736894e1 Merge 'Add maintenance mode' from Mikołaj Grzebieluch
In this mode, the node is not reachable from the outside, i.e.
* it refuses all incoming RPC connections,
* it does not join the cluster, thus
  * all group0 operations are disabled (e.g. schema changes),
  * all cluster-wide operations are disabled for this node (e.g. repair),
  * other nodes see this node as dead,
  * cannot read or write data from/to other nodes,
* it does not open Alternator and Redis transport ports and the TCP CQL port.

The only way to make CQL queries is to use the maintenance socket. The node serves only local data.

To start the node in maintenance mode, use the `--maintenance-mode true` flag or set `maintenance_mode: true` in the configuration file.

REST API works as usual, but some routes are disabled:
* authorization_cache
* failure_detector
* hinted_hand_off_manager

This PR also updates the maintenance socket documentation:
* add cqlsh usage to the documentation
* update the documentation to use `WhiteListRoundRobinPolicy`

Fixes #5489.

Closes scylladb/scylladb#15346

* github.com:scylladb/scylladb:
  test.py: add test for maintenance mode
  test.py: generalize usage of cluster_con
  test.py: when connecting to node in maintenance mode use maintenance socket
  docs: add maintenance mode documentation
  main: add maintenance mode
  main: move some REST routes initialization before joining group0
  message_service: add sanity check that rpc connections are not created in the maintenance mode
  raft_group0_client: disable group0 operations in the maintenance mode
  service/storage_service: add start_maintenance_mode() method
  storage_service: add MAINTENANCE option to mode enum
  service/maintenance_mode: add maintenance_mode_enabled bool class
  service/maintenance_mode: move maintenance_socket_enabled definition to seperate file
  db/config: add maintenance mode flag
  docs: add cqlsh usage to maintenance socket documentation
  docs: update maintenance socket documentation to use WhiteListRoundRobinPolicy
2024-01-26 11:02:34 +01:00
Kefu Chai
01727a5399 test/nodetool: return a randomized address if not running with unshare
we should allow user to run nodetool tests without `test.py`. but there
are good chance that the host could be reused by multiple tests or
multiple users who could be using port 12345. by randomizing the IP and
port, they would have better chance to complete the test without running
into used port problem.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-01-26 13:32:47 +08:00
Kefu Chai
358d30fd29 test/nodetool: return an address from loopback_network fixture
* rename "maybe_setup_loopback_network" to "server_address"
* return an address from the fixture

this change prepares for bringing back the randomized IP and port,
in case users run this test without test.py, by randomizing the
IP and port, they would have better chance to complete the test
without running into used port problem.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-01-26 13:20:37 +08:00
Raphael S. Carvalho
3b14c5b84a test/topology_experimental_raft: Add tablet split test
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-01-25 18:58:43 -03:00
Raphael S. Carvalho
4245ad333a test/topology_experimental_raft: Disable load balancer in test fencing
This is easier to reproducer after changes in load balancer, to
emit resize decisions, which in turn results in topology version
being incremented, and that might race with fencing tests that
manipulate the topology version manually.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-01-25 18:58:43 -03:00
Raphael S. Carvalho
7ed5b44d52 load_balancer: Implement resize decisions
This implements the ability in load balancer to emit split or merge
requests, cancel ongoing ones if they're no longer needed, and
also finalize those that are ready for the topology changes.

That's all based on average tablet size, collected by coordinator
from all nodes, and split and merge thresholds.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-01-25 18:36:08 -03:00
Raphael S. Carvalho
ed2138a35a tablet_mutation_builder: Add set_resize_decision()
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-01-25 18:36:08 -03:00
Raphael S. Carvalho
0d5ba1ee4b tablets: Add resize decision metadata to tablet metadata
The new metadata describes the ongoing resize operation (can be either
of merge, split or none) that spans tablets of a given table.
That's managed by group0, so down nodes will be able to see the
decision when they come back up and see the changes to the
metadata.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-01-25 18:36:06 -03:00
Avi Kivity
03313d359e Merge ' db: commitlog_replayer: ignore mutations affected by (tablet) cleanups ' from Michał Chojnowski
To avoid data resurrection, mutations deleted by cleanup operations should be skipped during commitlog replay.

This series implements the above for tablet cleanups, by using a new system table which holds records of cleanup operations.

Fixes #16752

Closes scylladb/scylladb#16888

* github.com:scylladb/scylladb:
  test: test_tablets: add a test for cleanup after migration
  test: pylib: add ScyllaCluster.wipe_sstables
  test: boost: add commitlog_cleanup_test
  db: commitlog_replayer: ignore mutations affected by (tablet) cleanups
  replica: table: garbage-collect irrelevant system.commitlog_cleanups records
  db: commitlog: add min_position()
  replica: table: populate system.commitlog_cleanups on tablet cleanup
  db: system_keyspace: add system.commitlog_cleanups
  replica: table: refresh compound sstable set after tablet cleanup
2024-01-25 20:51:03 +02:00
Kamil Braun
543ad0987a Merge 'raft topology: send barrier_and_drain to a decommissioning node' from Patryk Jędrzejczak
We didn't send the `barrier_and_drain` command to a
decommissioning node that could still be coordinating requests. It
could happen that a decommissioning node sent a request with an
old topology version after normal nodes received the new fence
version. Then, the request would fail on replicas with the stale
topology exception.

This PR fixes this problem by modifying `exec_global_command`.
From now on, it sends `barrier_and_drain` to a decommissioning
node.

We also stop filtering stale topology exceptions in
`test_topology_ops`. We added this filter after detecting the bug
fixed by this PR.

Fixes scylladb/scylladb#15804
Fixes scylladb/scylladb#16579
Fixes scylladb/scylladb#16642

Closes scylladb/scylladb#16797

* github.com:scylladb/scylladb:
  test: test_topology_ops: remove failed mutations filter
  raft topology: send barrier_and_drain to a decommissioning node
  raft topology: ensure at most one transitioning node
2024-01-25 16:09:02 +01:00
Botond Dénes
6d5ee6d48a Merge 'test/nodetool: run nodetool tests using "unshare"' from Kefu Chai
before this change, we use a random address when launching
rest_api_mock server, but there are chances that the randomly
picked address conflicts with an already-used address on the
host. and the subprocess fails right away with the returncode of
1 upon this failure, but we just continue on and check the readiness
of the already-dead server. actually, we've seen test failures
caused by the EADDRINUSE failure, and when we checked the readiness
of the rest_api_mock by sending HTTP request and reading the
response, what we had is not a JSON encoded response but a webpage,
which was likely the one returned by a minio server.

in this change, we

* specify the "launcher" option of nodetool
  test suite to "unshare", so that all its tests are launched
  in separated namespaces.
* do not use a random address for the mock server, as the network
  namespaces are separated.

Fixes #16542

Closes scylladb/scylladb#16773

* github.com:scylladb/scylladb:
  test/nodetool: run nodetool tests using "unshare"
  test.py: add "launcher" option support
2024-01-25 16:53:49 +02:00
Mikołaj Grzebieluch
763911af5b test.py: add test for maintenance mode
The test checks that in maintenance mode server A is not available for other
nodes and for clients. It is possible to connect by the maintenance socket
to server A and perform local CQL operations.
2024-01-25 15:27:53 +01:00
Mikołaj Grzebieluch
ca35e352f5 test.py: generalize usage of cluster_con
Add option to pass load_balancing policy.
Change hosts type to list of IPs or cassandra.Endpoint.
2024-01-25 15:27:53 +01:00
Mikołaj Grzebieluch
77a656bfd6 test.py: when connecting to node in maintenance mode use maintenance socket
A node in the maintenance socket hasn't an opened regular CQL port.
To connect to the node, the scylla cluster needs to use the node's maintenance socket.
2024-01-25 15:27:53 +01:00
Mikołaj Grzebieluch
c08266cfe5 raft_group0_client: disable group0 operations in the maintenance mode
In maintenance mode, the node doesn't communicate with other nodes, so it doesn't
start or apply group0 operations. Users can still try to start it, e.g. change
the schema, and the node can't allow it.

Init _upgrade_state with recovery in the maintenance mode.
Throw an error if the group0 operation is started in maintenance mode.
2024-01-25 15:27:53 +01:00
Mikołaj Grzebieluch
8b2f0e38d9 service/maintenance_mode: move maintenance_socket_enabled definition to seperate file 2024-01-25 15:27:53 +01:00
Pavel Emelyanov
bf3cae4992 Merge 'tests: utils: error injection: print time duration instead of count' from Kefu Chai
before this change, we always cast the wait duration to millisecond,
even if it could be using a higher resolution. actually
`std::chrono::steady_clock` is using `nanosecond` for its duration,
so if we inject a deadline using `steady_clock`, we could be awaken
earlier due to the narrowing of the duration type caused by the
duration_cast.

in this change, we just use the duration as it is. this should allow
the caller to use the resolution provided by Seastar without losing
the precision. the tests are updated to print the time duration
instead of count to provide information with a higher resolution.

Fixes #15902

Closes scylladb/scylladb#16264

* github.com:scylladb/scylladb:
  tests: utils: error injection: print time duration instead of count
  error_injection: do not cast to milliseconds when injecting timeout
2024-01-25 16:13:27 +03:00
Avi Kivity
69d597075a Merge 'tablets: Add support for removenode and replace handling' from Tomasz Grabiec
New tablet replicas are allocated and rebuilt synchronously with node
operations. They are safely rebuilt from all existing replicas.
The list of ignored nodes passed to node operations is respected.

Tablet scheduler is responsible for scheduling tablet rebuilding transition which
changes the replicas set. The infrastructure for handling decommission
in tablet scheduler is reused for this.

Scheduling is done incrementally, respecting per-shard load
limits. Rebuilding transitions are recognized by load calculation to
affect all tablet replicas.

New kind of tablet transition is introduced called "rebuild" which
adds new tablet replica and rebuilds it from existing replicas. Other
than that, the transition goes through the same stages as regular
migration to ensure safe synchronization with request coordinators.

In this PR we simply stream from all tablet replicas. Later we should
switch to calling repair to avoid sending excessive amounts of data.

Fixes https://github.com/scylladb/scylladb/issues/16690.

Closes scylladb/scylladb#16894

* github.com:scylladb/scylladb:
  tests: tablets: Add tests for removenode and replace
  tablets: Add support for removenode and replace handling
  topology_coordinator: tablets: Do not fail in a tight loop
  topology_coordinator: tablets: Avoid warnings about ignored failured future
  storage_service, topology: Track excluded state in locator::topology
  raft topology: Introduce param-less topology::get_excluded_nodes()
  raft topology: Move get_excluded_nodes() to topology
  tablets: load_balancer: Generalize load tracking
  tablets: Introduce get_migration_streaming_info() which works on migration request
  tablets: Move migration_to_transition_info() to tablets.hh
  tablets: Extract get_new_replicas() which works on migraiton request
  tablets: Move tablet_migration_info to tablets.hh
  tablets: Store transition kind per tablet
2024-01-25 14:49:43 +02:00
Patryk Jędrzejczak
b348014745 test: test_topology_ops: remove failed mutations filter
We added this filter after detecting a bug in the Raft-based
topology. We weren't sending `barrier_and_drain` commands to a
decommissioning node that could still be coordinating requests.
It could cause stale topology exceptions on replicas if the
decommissioning node sent a request with an old topology version
after normal nodes received the new fence version.

This bug has been fixed in the previous commit, so we remove the
filter.
2024-01-25 13:42:48 +01:00