Commit Graph

358 Commits

Author SHA1 Message Date
Gleb Natapov
97ab3f6622 storage_service: topology_coordinator: introduce cleanup REST API integrated with the topology coordinator
Introduce new REST API "/storage_service/cleanup_all"
that, when triggered, instructs the topology coordinator to initiate
cluster wide cleanup on all dirty nodes. It is done by introducing new
global command "global_topology_request::cleanup".
2024-01-14 15:45:53 +02:00
Avi Kivity
22b77edef3 Merge 'scylla-nodetool: implement the scrub command' from Botond Dénes
On top of the capabilities of the java-nodetool command, the following additional functionalit is implemented:
* Expose quarantine-mode option of the scrub_keyspace REST API
* Exit with error and print a message, when scrub finishes with abort or validation_errors return code

The command comes with tests and all tests pass with both the new and the current nodetool implementations.

Refs: #15588
Refs: #16208

Closes scylladb/scylladb#16391

* github.com:scylladb/scylladb:
  tools/scylla-nodetool: implement the scrub command
  test/nodetool: rest_api_mock.py: add missing "f" to error message f string
  api: extract scrub_status into its own header
2023-12-12 22:22:35 +02:00
Botond Dénes
8064d17f78 api: extract scrub_status into its own header
So it can be shared with scylla-nodetool code.
2023-12-12 09:33:39 -05:00
Botond Dénes
885a807c71 Merge 'api: storage_service: api for starting async compaction' from Aleksandra Martyniuk
For all compaction types which can be started with api, add an asynchronous version of api, which returns task_id of the corresponding task manager task. With the task_id a user can check task status, abort, or wait for it, using task manager api.

Closes scylladb/scylladb#15092

* github.com:scylladb/scylladb:
  test: use async api in test_not_created_compaction_task_abort
  test: test compaction task started asynchronously
  api: tasks: api for starting async compaction
  api: compaction: pass pointer to top level compaction tasks
2023-12-12 12:06:52 +02:00
Asias He
5f20e33e15 api: Reject unsupported http api options for repair
If an option is not supported, reject the request instead of silently
ignoring the unsupported options.

It prevents the user thinks the option is supported but it is ignored by
scylla core.

Fixes #16299

Closes scylladb/scylladb#16300
2023-12-12 09:18:00 +02:00
Aleksandra Martyniuk
b485897704 api: tasks: api for starting async compaction
For all compaction types which can be started with api, add an asynchronous
version of api, which returns task_id of the corresponding task manager
task. With the task_id a user can check task status, abort, or wait for it,
using task manager api.
2023-12-11 11:39:33 +01:00
Aleksandra Martyniuk
ceec5577d8 api: compaction: pass pointer to top level compaction tasks
As a preparation for asynchronous compaction api, from which we
cannot take values by reference, top level compaction tasks get
pointers which need to be set to nullptr when they are not needed
(like in async api).
2023-12-11 11:36:10 +01:00
Avi Kivity
9c0f05efa1 Merge 'Track tablet streaming under global sessions to prevent side-effects of failed streaming' from Tomasz Grabiec
Tablet streaming involves asynchronous RPCs to other replicas which transfer writes. We want side-effects from streaming only within the migration stage in which the streaming was started. This is currently not guaranteed on failure. When streaming master fails (e.g. due to RPC failing), it can be that some streaming work is still alive somewhere (e.g. RPC on wire) and will have side-effects at some point later.

This PR implements tracking of all operations involved in streaming which may have side-effects, which allows the topology change coordinator to fence them and wait for them to complete if they were already admitted.

The tracking and fencing is implemented by using global "sessions", created for streaming of a single tablet. Session is globally identified by UUID. The identifier is assigned by the topology change coordinator, and stored in system.tablets. Sessions are created and closed based on group0 state (tablet metadata) by the barrier command sent to each replica, which we already do on transitions between stages. Also, each barrier waits for sessions which have been closed to be drained.

The barrier is blocked only if there is some session with work which was left behind by unsuccessful streaming. In which case it should not be blocked for long, because streaming process checks often if the guard was left behind and stops if it was.

This mechanism of tracking is fault-tolerant: session id is stored in group0, so coordinator can make progress on failover. The barriers guarantee that session exists on all replicas, and that it will be closed on all replicas.

Closes scylladb/scylladb#15847

* github.com:scylladb/scylladb:
  test: tablets: Add test for failed streaming being fenced away
  error_injection: Introduce poll_for_message()
  error_injection: Make is_enabled() public
  api: Add API to kill connection to a particular host
  range_streamer: Do not block topology change barriers around streaming
  range_streamer, tablets: Do not keep token metadata around streaming
  tablets: Fail gracefully when migrating tablet has no pending replica
  storage_service, api: Add API to disable tablet balancing
  storage_service, api: Add API to migrate a tablet
  storage_service, raft topology: Run streaming under session topology guard
  storage_service, tablets: Use session to guard tablet streaming
  tablets: Add per-tablet session id field to tablet metadata
  service: range_streamer: Propagate topology_guard to receivers
  streaming: Always close the rpc::sink
  storage_service: Introduce concept of a topology_guard
  storage_service: Introduce session concept
  tablets: Fix topology_metadata_guard holding on to the old erm
  docs: Document the topology_guard mechanism
2023-12-07 16:29:02 +02:00
Tomasz Grabiec
d1c1b59236 storage_service, api: Add API to disable tablet balancing
Load balancing needs to be disabled before making a series of manual
migrations so that we don't fight with the load balancer.

Also will be used in tests to ensure tablets stick to expected locations.
2023-12-06 18:36:17 +01:00
Tomasz Grabiec
1f57d1ea28 storage_service, api: Add API to migrate a tablet
Will be used in tests, or for hot fixes in production.
2023-12-06 18:36:17 +01:00
Kefu Chai
f483309165 compaction, api: drop unused functions
run_on_existing_tables() is not used at all. and we have two of them.
in this change, let's drop them.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#16304
2023-12-06 14:31:08 +02:00
Benny Halevy
e5d3c6741f api: use locator::topology rather than fb_utilities
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-12-05 08:42:49 +02:00
Yaniv Kaul
c658bdb150 Typos: fix typos in comments
Fixes some typos as found by codespell run on the code.
In this commit, I was hoping to fix only comments, not user-visible alerts, output, etc.
Follow-up commits will take care of them.

Refs: https://github.com/scylladb/scylladb/issues/16255
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2023-12-02 22:37:22 +02:00
Benny Halevy
b12b142232 api: add /storage_service/compact
For major compacting all tables in the database.
The advantage of this api is that `commitlog->force_new_active_segment`
happens only once in `database::flush_all_tables` rather than
once per keyspace (when `nodetool compact` translates to
a sequence of `/storage_service/keyspace_compaction` calls).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-11-28 16:37:42 +02:00
Benny Halevy
1b576f358b api: add /storage_service/flush
For flushing all tables in the database.
The advantage of this api is that `commitlog->force_new_active_segment`
happens only once in `database::flush_all_tables` rather than
once per keyspace (when `nodetool flush` translates to
a sequence of `/storage_service/keyspace_flush` calls).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-11-28 16:37:42 +02:00
Benny Halevy
1fd85bd37b api: compaction: add flush_memtables option
When flushing is done externally, e.g. by running
`nodetool flush` prior to `nodetool compact`,
flush_memtables=false can be passed to skip flushing
of tables right before they are major-compacted.

This is useful to prevent creation of small sstables
due to excessive memtable flushing.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-11-28 16:37:42 +02:00
Botond Dénes
697cf41b9b Merge 'repair: Introduce small table optimization' from Asias He
repair: Introduce small table optimization

*) Problem:

We have seen in the field it takes longer than expected to repair system tables
like system_auth which has a tiny amount of data but is replicated to all nodes
in the cluster. The cluster has multiple DCs. Each DC has multiple nodes. The
main reason for the slowness is that even if the amount of data is small,
repair has to walk though all the token ranges, that is num_tokens *
number_of_nodes_in_the_cluster. The overhead of the repair protocol for each
token range dominates due to the small amount of data per token range. Another
reason is the high network latency between DCs makes the RPC calls used to
repair consume more time.

*) Solution:

To solve this problem, a small table optimization for repair is introduced in
this patch. A new repair option is added to turn on this optimization.

- No token range to repair is needed by the user. It  will repair all token
ranges automatically.

- Users only need to send the repair rest api to one of the nodes in the
cluster. It can be any of the nodes in the cluster.

- It does not require the RF to be configured to replicate to all nodes in the
cluster. This means it can work with any tables as long as the amount of data
is low, e.g., less than 100MiB per node.

*) Performance:

1)
3 DCs, each DC has 2 nodes, 6 nodes in the cluster. RF = {dc1: 2, dc2: 2, dc3: 2}

Before:
```
repair - repair[744cd573-2621-45e4-9b27-00634963d0bd]: stats:
repair_reason=repair, keyspace=system_auth, tables={roles, role_attributes,
role_members}, ranges_nr=1537, round_nr=4612,
round_nr_fast_path_already_synced=4611,
round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=1,
rpc_call_nr=115289, tx_hashes_nr=0, rx_hashes_nr=5, duration=1.5648403 seconds,
tx_row_nr=2, rx_row_nr=0, tx_row_bytes=356, rx_row_bytes=0,
row_from_disk_bytes={{127.0.14.1, 178}, {127.0.14.2, 178}, {127.0.14.3, 0},
{127.0.14.4, 0}, {127.0.14.5, 178}, {127.0.14.6, 178}},
row_from_disk_nr={{127.0.14.1, 1}, {127.0.14.2, 1}, {127.0.14.3, 0},
{127.0.14.4, 0}, {127.0.14.5, 1}, {127.0.14.6, 1}},
row_from_disk_bytes_per_sec={{127.0.14.1, 0.00010848}, {127.0.14.2,
0.00010848}, {127.0.14.3, 0}, {127.0.14.4, 0}, {127.0.14.5, 0.00010848},
{127.0.14.6, 0.00010848}} MiB/s, row_from_disk_rows_per_sec={{127.0.14.1,
0.639043}, {127.0.14.2, 0.639043}, {127.0.14.3, 0}, {127.0.14.4, 0},
{127.0.14.5, 0.639043}, {127.0.14.6, 0.639043}} Rows/s,
tx_row_nr_peer={{127.0.14.3, 1}, {127.0.14.4, 1}}, rx_row_nr_peer={}
```

After:
```
repair - repair[d6e544ba-cb68-4465-ab91-6980bcbb46a9]: stats:
repair_reason=repair, keyspace=system_auth, tables={roles, role_attributes,
role_members}, ranges_nr=1, round_nr=4, round_nr_fast_path_already_synced=4,
round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=0,
rpc_call_nr=80, tx_hashes_nr=0, rx_hashes_nr=0, duration=0.001459798 seconds,
tx_row_nr=0, rx_row_nr=0, tx_row_bytes=0, rx_row_bytes=0,
row_from_disk_bytes={{127.0.14.1, 178}, {127.0.14.2, 178}, {127.0.14.3, 178},
{127.0.14.4, 178}, {127.0.14.5, 178}, {127.0.14.6, 178}},
row_from_disk_nr={{127.0.14.1, 1}, {127.0.14.2, 1}, {127.0.14.3, 1},
{127.0.14.4, 1}, {127.0.14.5, 1}, {127.0.14.6, 1}},
row_from_disk_bytes_per_sec={{127.0.14.1, 0.116286}, {127.0.14.2, 0.116286},
{127.0.14.3, 0.116286}, {127.0.14.4, 0.116286}, {127.0.14.5, 0.116286},
{127.0.14.6, 0.116286}} MiB/s, row_from_disk_rows_per_sec={{127.0.14.1,
685.026}, {127.0.14.2, 685.026}, {127.0.14.3, 685.026}, {127.0.14.4, 685.026},
{127.0.14.5, 685.026}, {127.0.14.6, 685.026}} Rows/s, tx_row_nr_peer={},
rx_row_nr_peer={}
```

The time to finish repair difference = 1.5648403 seconds / 0.001459798 seconds = 1072X

2)
3 DCs, each DC has 2 nodes, 6 nodes in the cluster. RF = {dc1: 2, dc2: 2, dc3: 2}

Same test as above except 5ms delay is added to simulate multiple dc
network latency:

The time to repair is reduced from 333s to 0.2s.

333.26758 s / 0.22625381s = 1472.98

3)

3 DCs, each DC has 3 nodes, 9 nodes in the cluster. RF = {dc1: 3, dc2: 3, dc3: 3}
, 10 ms network latency

Before:

```
repair - repair[86124a4a-fd26-42ea-a078-437ca9e372df]: stats:
repair_reason=repair, keyspace=system_auth, tables={role_attributes,
role_members, roles}, ranges_nr=2305, round_nr=6916,
round_nr_fast_path_already_synced=6915,
round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=1,
rpc_call_nr=276630, tx_hashes_nr=0, rx_hashes_nr=8, duration=986.34015
seconds, tx_row_nr=7, rx_row_nr=0, tx_row_bytes=1246, rx_row_bytes=0,
row_from_disk_bytes={{127.0.57.1, 178}, {127.0.57.2, 178}, {127.0.57.3,
0}, {127.0.57.4, 0}, {127.0.57.5, 0}, {127.0.57.6, 0}, {127.0.57.7, 0},
{127.0.57.8, 0}, {127.0.57.9, 0}}, row_from_disk_nr={{127.0.57.1, 1},
{127.0.57.2, 1}, {127.0.57.3, 0}, {127.0.57.4, 0}, {127.0.57.5, 0},
{127.0.57.6, 0}, {127.0.57.7, 0}, {127.0.57.8, 0}, {127.0.57.9, 0}},
row_from_disk_bytes_per_sec={{127.0.57.1, 1.72105e-07}, {127.0.57.2,
1.72105e-07}, {127.0.57.3, 0}, {127.0.57.4, 0}, {127.0.57.5, 0},
{127.0.57.6, 0}, {127.0.57.7, 0}, {127.0.57.8, 0}, {127.0.57.9, 0}}
MiB/s, row_from_disk_rows_per_sec={{127.0.57.1, 0.00101385},
{127.0.57.2, 0.00101385}, {127.0.57.3, 0}, {127.0.57.4, 0},
{127.0.57.5, 0}, {127.0.57.6, 0}, {127.0.57.7, 0}, {127.0.57.8, 0},
{127.0.57.9, 0}} Rows/s, tx_row_nr_peer={{127.0.57.3, 1},
{127.0.57.4, 1}, {127.0.57.5, 1}, {127.0.57.6, 1}, {127.0.57.7, 1},
{127.0.57.8, 1}, {127.0.57.9, 1}}, rx_row_nr_peer={}
```

After:

```
repair - repair[07ebd571-63cb-4ef6-9465-6e5f1e98f04f]: stats:
repair_reason=repair, keyspace=system_auth, tables={role_attributes,
role_members, roles}, ranges_nr=1, round_nr=4,
round_nr_fast_path_already_synced=4,
round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=0,
rpc_call_nr=128, tx_hashes_nr=0, rx_hashes_nr=0, duration=1.6052915
seconds, tx_row_nr=0, rx_row_nr=0, tx_row_bytes=0, rx_row_bytes=0,
row_from_disk_bytes={{127.0.57.1, 178}, {127.0.57.2, 178}, {127.0.57.3,
178}, {127.0.57.4, 178}, {127.0.57.5, 178}, {127.0.57.6, 178},
{127.0.57.7, 178}, {127.0.57.8, 178}, {127.0.57.9, 178}},
row_from_disk_nr={{127.0.57.1, 1}, {127.0.57.2, 1}, {127.0.57.3, 1},
{127.0.57.4, 1}, {127.0.57.5, 1}, {127.0.57.6, 1}, {127.0.57.7, 1},
{127.0.57.8, 1}, {127.0.57.9, 1}},
row_from_disk_bytes_per_sec={{127.0.57.1, 0.00037793}, {127.0.57.2,
0.00037793}, {127.0.57.3, 0.00037793}, {127.0.57.4, 0.00037793},
{127.0.57.5, 0.00037793}, {127.0.57.6, 0.00037793}, {127.0.57.7,
0.00037793}, {127.0.57.8, 0.00037793}, {127.0.57.9, 0.00037793}}
MiB/s, row_from_disk_rows_per_sec={{127.0.57.1, 2.22634},
{127.0.57.2, 2.22634}, {127.0.57.3, 2.22634}, {127.0.57.4,
2.22634}, {127.0.57.5, 2.22634}, {127.0.57.6, 2.22634},
{127.0.57.7, 2.22634}, {127.0.57.8, 2.22634}, {127.0.57.9,
2.22634}} Rows/s, tx_row_nr_peer={}, rx_row_nr_peer={}
```

The time to repair is reduced from 986s (16 minutes) to 1.6s

*) Summary

So, a more than 1000X difference is observed for this common usage of
system table repair procedure.

Fixes #16011
Refs  #15159

Closes scylladb/scylladb#15974

* github.com:scylladb/scylladb:
  repair: Introduce small table optimization
  repair: Convert put_row_diff_with_rpc_stream to use coroutine
2023-11-24 15:11:42 +02:00
Asias He
c605220bb3 repair: Introduce small table optimization
*) Problem:

We have seen in the field it takes longer than expected to repair system tables
like system_auth which has a tiny amount of data but is replicated to all nodes
in the cluster. The cluster has multiple DCs. Each DC has multiple nodes. The
main reason for the slowness is that even if the amount of data is small,
repair has to walk though all the token ranges, that is num_tokens *
number_of_nodes_in_the_cluster. The overhead of the repair protocol for each
token range dominates due to the small amount of data per token range. Another
reason is the high network latency between DCs makes the RPC calls used to
repair consume more time.

*) Solution:

To solve this problem, a small table optimization for repair is introduced in
this patch. A new repair option is added to turn on this optimization.

- No token range to repair is needed by the user. It  will repair all token
ranges automatically.

- Users only need to send the repair rest api to one of the nodes in the
cluster. It can be any of the nodes in the cluster.

- It does not require the RF to be configured to replicate to all nodes in the
cluster. This means it can work with any tables as long as the amount of data
is low, e.g., less than 100MiB per node.

*) Performance:

1)
3 DCs, each DC has 2 nodes, 6 nodes in the cluster. RF = {dc1: 2, dc2: 2, dc3: 2}

Before:
```
repair - repair[744cd573-2621-45e4-9b27-00634963d0bd]: stats:
repair_reason=repair, keyspace=system_auth, tables={roles, role_attributes,
role_members}, ranges_nr=1537, round_nr=4612,
round_nr_fast_path_already_synced=4611,
round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=1,
rpc_call_nr=115289, tx_hashes_nr=0, rx_hashes_nr=5, duration=1.5648403 seconds,
tx_row_nr=2, rx_row_nr=0, tx_row_bytes=356, rx_row_bytes=0,
row_from_disk_bytes={{127.0.14.1, 178}, {127.0.14.2, 178}, {127.0.14.3, 0},
{127.0.14.4, 0}, {127.0.14.5, 178}, {127.0.14.6, 178}},
row_from_disk_nr={{127.0.14.1, 1}, {127.0.14.2, 1}, {127.0.14.3, 0},
{127.0.14.4, 0}, {127.0.14.5, 1}, {127.0.14.6, 1}},
row_from_disk_bytes_per_sec={{127.0.14.1, 0.00010848}, {127.0.14.2,
0.00010848}, {127.0.14.3, 0}, {127.0.14.4, 0}, {127.0.14.5, 0.00010848},
{127.0.14.6, 0.00010848}} MiB/s, row_from_disk_rows_per_sec={{127.0.14.1,
0.639043}, {127.0.14.2, 0.639043}, {127.0.14.3, 0}, {127.0.14.4, 0},
{127.0.14.5, 0.639043}, {127.0.14.6, 0.639043}} Rows/s,
tx_row_nr_peer={{127.0.14.3, 1}, {127.0.14.4, 1}}, rx_row_nr_peer={}
```

After:
```
repair - repair[d6e544ba-cb68-4465-ab91-6980bcbb46a9]: stats:
repair_reason=repair, keyspace=system_auth, tables={roles, role_attributes,
role_members}, ranges_nr=1, round_nr=4, round_nr_fast_path_already_synced=4,
round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=0,
rpc_call_nr=80, tx_hashes_nr=0, rx_hashes_nr=0, duration=0.001459798 seconds,
tx_row_nr=0, rx_row_nr=0, tx_row_bytes=0, rx_row_bytes=0,
row_from_disk_bytes={{127.0.14.1, 178}, {127.0.14.2, 178}, {127.0.14.3, 178},
{127.0.14.4, 178}, {127.0.14.5, 178}, {127.0.14.6, 178}},
row_from_disk_nr={{127.0.14.1, 1}, {127.0.14.2, 1}, {127.0.14.3, 1},
{127.0.14.4, 1}, {127.0.14.5, 1}, {127.0.14.6, 1}},
row_from_disk_bytes_per_sec={{127.0.14.1, 0.116286}, {127.0.14.2, 0.116286},
{127.0.14.3, 0.116286}, {127.0.14.4, 0.116286}, {127.0.14.5, 0.116286},
{127.0.14.6, 0.116286}} MiB/s, row_from_disk_rows_per_sec={{127.0.14.1,
685.026}, {127.0.14.2, 685.026}, {127.0.14.3, 685.026}, {127.0.14.4, 685.026},
{127.0.14.5, 685.026}, {127.0.14.6, 685.026}} Rows/s, tx_row_nr_peer={},
rx_row_nr_peer={}
```

The time to finish repair difference = 1.5648403 seconds / 0.001459798 seconds = 1072X

2)
3 DCs, each DC has 2 nodes, 6 nodes in the cluster. RF = {dc1: 2, dc2: 2, dc3: 2}

Same test as above except 5ms delay is added to simulate multiple dc
network latency:

The time to repair is reduced from 333s to 0.2s.

333.26758 s / 0.22625381s = 1472.98

3)

3 DCs, each DC has 3 nodes, 9 nodes in the cluster. RF = {dc1: 3, dc2: 3, dc3: 3}
, 10 ms network latency

Before:

```
repair - repair[86124a4a-fd26-42ea-a078-437ca9e372df]: stats:
repair_reason=repair, keyspace=system_auth, tables={role_attributes,
role_members, roles}, ranges_nr=2305, round_nr=6916,
round_nr_fast_path_already_synced=6915,
round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=1,
rpc_call_nr=276630, tx_hashes_nr=0, rx_hashes_nr=8, duration=986.34015
seconds, tx_row_nr=7, rx_row_nr=0, tx_row_bytes=1246, rx_row_bytes=0,
row_from_disk_bytes={{127.0.57.1, 178}, {127.0.57.2, 178}, {127.0.57.3,
0}, {127.0.57.4, 0}, {127.0.57.5, 0}, {127.0.57.6, 0}, {127.0.57.7, 0},
{127.0.57.8, 0}, {127.0.57.9, 0}}, row_from_disk_nr={{127.0.57.1, 1},
{127.0.57.2, 1}, {127.0.57.3, 0}, {127.0.57.4, 0}, {127.0.57.5, 0},
{127.0.57.6, 0}, {127.0.57.7, 0}, {127.0.57.8, 0}, {127.0.57.9, 0}},
row_from_disk_bytes_per_sec={{127.0.57.1, 1.72105e-07}, {127.0.57.2,
1.72105e-07}, {127.0.57.3, 0}, {127.0.57.4, 0}, {127.0.57.5, 0},
{127.0.57.6, 0}, {127.0.57.7, 0}, {127.0.57.8, 0}, {127.0.57.9, 0}}
MiB/s, row_from_disk_rows_per_sec={{127.0.57.1, 0.00101385},
{127.0.57.2, 0.00101385}, {127.0.57.3, 0}, {127.0.57.4, 0},
{127.0.57.5, 0}, {127.0.57.6, 0}, {127.0.57.7, 0}, {127.0.57.8, 0},
{127.0.57.9, 0}} Rows/s, tx_row_nr_peer={{127.0.57.3, 1},
{127.0.57.4, 1}, {127.0.57.5, 1}, {127.0.57.6, 1}, {127.0.57.7, 1},
{127.0.57.8, 1}, {127.0.57.9, 1}}, rx_row_nr_peer={}
```

After:

```
repair - repair[07ebd571-63cb-4ef6-9465-6e5f1e98f04f]: stats:
repair_reason=repair, keyspace=system_auth, tables={role_attributes,
role_members, roles}, ranges_nr=1, round_nr=4,
round_nr_fast_path_already_synced=4,
round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=0,
rpc_call_nr=128, tx_hashes_nr=0, rx_hashes_nr=0, duration=1.6052915
seconds, tx_row_nr=0, rx_row_nr=0, tx_row_bytes=0, rx_row_bytes=0,
row_from_disk_bytes={{127.0.57.1, 178}, {127.0.57.2, 178}, {127.0.57.3,
178}, {127.0.57.4, 178}, {127.0.57.5, 178}, {127.0.57.6, 178},
{127.0.57.7, 178}, {127.0.57.8, 178}, {127.0.57.9, 178}},
row_from_disk_nr={{127.0.57.1, 1}, {127.0.57.2, 1}, {127.0.57.3, 1},
{127.0.57.4, 1}, {127.0.57.5, 1}, {127.0.57.6, 1}, {127.0.57.7, 1},
{127.0.57.8, 1}, {127.0.57.9, 1}},
row_from_disk_bytes_per_sec={{127.0.57.1, 0.00037793}, {127.0.57.2,
0.00037793}, {127.0.57.3, 0.00037793}, {127.0.57.4, 0.00037793},
{127.0.57.5, 0.00037793}, {127.0.57.6, 0.00037793}, {127.0.57.7,
0.00037793}, {127.0.57.8, 0.00037793}, {127.0.57.9, 0.00037793}}
MiB/s, row_from_disk_rows_per_sec={{127.0.57.1, 2.22634},
{127.0.57.2, 2.22634}, {127.0.57.3, 2.22634}, {127.0.57.4,
2.22634}, {127.0.57.5, 2.22634}, {127.0.57.6, 2.22634},
{127.0.57.7, 2.22634}, {127.0.57.8, 2.22634}, {127.0.57.9,
2.22634}} Rows/s, tx_row_nr_peer={}, rx_row_nr_peer={}
```

The time to repair is reduced from 986s (16 minutes) to 1.6s

*) Summary

So, a more than 1000X difference is observed for this common usage of
system table repair procedure.

Fixes #16011
Refs  #15159
2023-11-20 15:11:16 +08:00
Botond Dénes
dfd7981fa7 api/storage_service: start/stop native transport in the statement sg
Currently, it is started/stopped in the streaming/maintenance sg, which
is what the API itself runs in.
Starting the native transport in the streaming sg, will lead to severely
degraded performance, as the streaming sg has significantly less
CPU/disk shares and reader concurrency semaphore resources.
Furthermore, it will lead to multi-paged reads possibly switching
between scheduling groups mid-way, triggering an internal error.

To fix, use `with_scheduling_group()` for both starting and stopping
native transport. Technically, it is only strictly necessary for
starting, but I added it for stop as well for consistency.

Also apply the same treatment to RPC (Thrift). Although no one uses it,
best to fix it, just to be on the safe side.

I think we need a more systematic approach for solving this once and for
all, like passing the scheduling group to the protocol server and have
it switch to it internally. This allows the server to always run on the
correct scheduling group, not depending on the caller to remember using
it. However, I think this is best done in a follow-up, to keep this
critical patch small and easily backportable.

Fixes: #15485

Closes scylladb/scylladb#16019
2023-11-13 14:08:01 +03:00
Pavel Emelyanov
e4c0a4d34d api: Move storage_service endpoints that use token metadata only
There are few of them that don't need the storage service for anything
but get token metadata from. Move them to own .cc/.hh units.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-10-24 17:44:53 +03:00
Kamil Braun
b68d6ad5e9 api: storage_service: unset reload_raft_topology_state
Every endpoint needs to be unset. Oversight in
992f1327d3.

Closes scylladb/scylladb#15591
2023-10-03 09:12:12 +03:00
Pavel Emelyanov
0eea513663 api: Move storage_proxy/ endpoint using storage_service
The storage_proxy/get_schema_version is served by storage_service, so it
should be in storage_service.cc instead

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-29 14:10:08 +03:00
Pavel Emelyanov
b5eb474d95 api: Remove storage_proxy.hh from storage_service.cc
Proxy is not used in storage service handlers

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-29 14:09:01 +03:00
Kamil Braun
992f1327d3 api: storage_service: add REST API to reload topology state
Some tests may want to modify system.topology table directly. Add a REST
API to reload the state into memory. An alternative would be restarting
the server, but that's slower and may have other side effects undesired
in the test.

The API can also be called outside tests, it should not have any
observable effects unless the user modifies `system.topology` table
directly (which they should never do, outside perhaps some disaster
recovery scenarios).
2023-09-28 11:59:16 +02:00
Pavel Emelyanov
e022a76350 main, api: Set/Unset storage_service API in proper place
Currently the storage-service API handlers are set up in "random" place.
It can happen earlier -- as soon as the storage service itself is ready.

Also, despite storage service is stopped on shutdown, API handlers
continue reference it leading to potential use-after-frees or "local is
not initialized" assertions.

Fix both. Unsetting is pretty bulky, scylladb/seastar#1620 is to help.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-26 12:22:13 +03:00
Pavel Emelyanov
78a22c5ae3 api/storage_service: Remove gossiper arg from API
Now all handlers work purely on storage_service and gossiper argument is
no longer needed

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-26 12:21:46 +03:00
Pavel Emelyanov
8dc6e74138 api/storage_service: Remove system keyspace arg from API
It's not used nowadays

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-26 12:21:25 +03:00
Pavel Emelyanov
27eaff9d44 api/storage_service: Get gossiper from storage service
Some handlers in set_storage_service() have implicit dependency on
gossiper. It's not API that should track it, but storage service itself,
so get the gossiper from service, not from the external argument (it
will be removed soon)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-26 12:20:27 +03:00
Pavel Emelyanov
4008ebb1b0 api/storage_service: Get token_metadata from storage service
The API handlers that live in set_storage_service() should be
self-contained and operate on storage-service only. Said that, they
should get the token metadata, when needed, from storage service, not
from somewhere else.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-26 12:19:24 +03:00
Petr Gusev
b90011294d config.cc: drop db::config::host_id
In this refactoring commit we remove the db::config::host_id
field, as it's hacky and duplicates token_metadata::get_my_id.

Some tests want specific host_id, we add it to cql_test_config
and use in cql_test_env.

We can't pass host_id to sstables_manager by value since it's
initialized in database constructor and host_id is not loaded yet.
We also prefer not to make a dependency on shared_token_metadata
since in this case we would have to create artificial
shared_token_metadata in many tools and tests where sstables_manager
is used. So we pass a function that returns host_id to
sstables_manager constructor.
2023-09-13 23:00:15 +04:00
Tomasz Grabiec
c27d212f4b api, storage_service: Recalculate table digests on relocal_schema api call
Currently, the API call recalculates only per-node schema version. To
workaround issues like #4485 we want to recalculate per-table
digests. One way to do that is to restart the node, but that's slow
and has impact on availability.

Use like this:

  curl -X POST http://127.0.0.1:10000/storage_service/relocal_schema

Fixes #15380

Closes #15381
2023-09-13 18:27:57 +03:00
Pavel Emelyanov
c3a6e31368 api: Don't carry cdc gen. service over
There's a storage_service/cdc_streams_check_and_repair endpoint that
needs to provide cdc gen. service to call storage_service method on. Now
the latter has its own reference to the former and API can stop taking
care of that

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-08-29 09:36:58 +03:00
Patryk Jędrzejczak
0beabdc6ba utils: introduce split_comma_separated_list
Three places handle comma-separated lists similarly:
- ss::remove_node.set(...) in api::set_storage_service,
- storage_service::parse_node_list,
- storage_service::is_repair_based_node_ops_enabled.
In the next commit, the fourth place that needs the same logic
appears -- storage_service::raft_replace. It needs to load
and parse the --ignore-dead-nodes-for-replace param from config.

Moreover, the code in is_repair_based_node_ops_enabled is
different and doesn't seem right. We swap '\"' and '\'' with ' '
but don't do anything with it afterward.

To avoid code duplication and fix is_repair_based_node_ops_enabled,
we introduce the new function utils::split_comma_separated_list.

This change has a small side effect on logging. For example,
ignore_nodes_strs in storage_service::parse_node_list might be
printed in a slightly different form.
2023-08-22 10:30:36 +02:00
Botond Dénes
946c6487ee Merge 'repair: Add ranges_parallelism option' from Asias He
This patch adds the ranges_parallelism option to repair restful API.

Users can use this option to optionally specify the number of ranges to repair in parallel per repair job to a smaller number than the Scylla core calculated default max_repair_ranges_in_parallel.

Scylla manager can also use this option to provide more ranges (>N) in a single repair job but only repairing N ranges_parallelism in parallel, instead of providing N ranges in a repair job.

To make it safer, unlike the PR #4848, this patch does not allow user to exceed the max_repair_ranges_in_parallel.

Fixes #4847

Closes #14886

* github.com:scylladb/scylladb:
  repair: Add ranges_parallelism option
  repair: Change to use coroutine in do_repair_ranges
2023-08-03 11:34:05 +03:00
Asias He
9b3fd9407b repair: Add ranges_parallelism option
This patch adds the ranges_parallelism option to repair restful API.

Users can use this option to optionally specify the number of ranges
to repair in parallel per repair job to a smaller number than the Scylla
core calculated default max_repair_ranges_in_parallel.

Scylla manager can also use this option to provide more ranges (>N) in
a single repair job but only repairing N ranges_parallelism in parallel,
instead of providing N ranges in a repair job.

To make it safer, unlike the PR #4848, this patch does not allow user to
exceed the max_repair_ranges_in_parallel.

Fixes #4847
2023-08-01 10:58:14 +08:00
Aleksandra Martyniuk
cdbfa0b2f5 replica: iterate safely over tables related maps
Loops over _column_families and _ks_cf_to_uuid which may preempt
are protected by reader mode of rwlock so that iterators won't
get invalid.
2023-07-25 17:13:04 +02:00
Aleksandra Martyniuk
52afd9d42d replica: wrap column families related maps into tables_metadata
As a preparation for ensuring access safety for column families
related maps, add tables_metadata, access to members of which
would be protected by rwlock.
2023-07-25 16:13:00 +02:00
Aleksandra Martyniuk
f48b57e7b9 compaction: use table_info in compaction tasks
Task manager compaction tasks need table names for logs.
Thus, compaction tasks store table infos instead of table ids.

get_table_ids function is deleted as it isn't used anywhere.
2023-05-30 09:58:55 +02:00
Aleksandra Martyniuk
4206139e5a api: move table_info to schema/schema_fwd.hh
table_info is moved from api/storage_service.hh to schema/schema_fwd.hh
so that it could be used in task manager's tasks.
2023-05-30 09:57:21 +02:00
Kefu Chai
2fbcbc09b0 api: specialize fmt::formatter<api::table_info>
this is a part of a series to migrating from `operator<<(ostream&, ..)`
based formatting to fmtlib based formatting. the goal here is to enable
fmtlib to print `api::table_info` without the help of `operator<<`.

but the corresponding `operator<<()` is preserved in this change, as we
still have lots of callers relying on this << operator instorage_service.cc
where std::vector<table_info> is formatted using operator<<(ostream&, const Range&)
defined in to_string.hh. we could have used fmt/ranges.h to print the
std::vector<table_info>. but the combination of operator<<(ostream&, const Range&)
and FMT_DEPRECATED_OSTREAM renders this impossible. because
unlike the builtin range formatter specializations, the fallback formatter
synthesized from the operator<< does not have brackets defined for
the range printer. the brackets are used as the left and right marks
of the range, for instance, the array-alike containers are printed
like [1,2,3], while the tuple-alike containers are printed like
(1,2,3). once we are allowed to remove FMT_DEPRECATED_OSTREAM, we
should be able to use the builtin range formatter, and remove the
operator<< for api::table_info by then.

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13975
2023-05-24 09:49:44 +03:00
Tomasz Grabiec
9d4bca26cc Merge 'raft topology: implement check_and_repair_cdc_streams API' from Kamil Braun
`check_and_repair_cdc_streams` is an existing API which you can use when the
current CDC generation is suboptimal, e.g. after you decommissioned a node the
current generation has more stream IDs than you need. In that case you can do
`nodetool checkAndRepairCdcStreams` to create a new generation with fewer
streams.

It also works when you change number of shards on some node. We don't
automatically introduce a new generation in that case but you can use
`checkAndRepairCdcStreams` to create a new generation with restored
shard-colocation.

This PR implements the API on top of raft topology, it was originally
implemented using gossiper.  It uses the `commit_cdc_generation` topology
transition state and a new `publish_cdc_generation` state to create new CDC
generations in a cluster without any nodes changing their `node_state`s in the
process.

Closes #13683

* github.com:scylladb/scylladb:
  docs: update topology-over-raft.md
  test: topology_experimental_raft: test `check_and_repair_cdc` API
  raft topology: implement `check_and_repair_cdc_streams` API
  raft topology: implement global request handling
  raft topology: introduce `prepare_new_cdc_generation_data`
  raft_topology: `get_node_to_work_on_opt`: return guard if no node found
  raft topology: remove `node_to_work_on` from `commit_cdc_generation` transition
  raft topology: separate `publish_cdc_generation` state
  raft topology: non-node-specific `exec_global_command`
  raft topology: introduce `start_operation()`
  raft topology: non-node-specific `topology_mutation_builder`
  topology_state_machine: introduce `global_topology_request`
  topology_state_machine: use `uint16_t` for `enum_class`es
  raft topology: make `new_cdc_generation_data_uuid` topology-global
2023-05-22 11:33:58 +02:00
Kefu Chai
b112a3b78a api: storage_service: use string for generation
in this change, the type of the "generation" field of "sstable" in the
return value of RESTful API entry point at
"/storage_service/sstable_info" is changed from "long" to "string".

this change depends on the corresponding change on tools/jmx submodule,
so we have to include the submodule change in this very commit.

this API is used by our JMX exporter, which in turn exposes the
SSTable information via the "StorageService.getSSTableInfo" mBean
operation, which returns the retrieved SSTable info as a list of
CompositeData. and "generation" is a field of an element in the
CompositeData. in general, the scylla JMX exporter is consumed
by the nodetool, which prints out returned SSTable info list with
a pretty formatted table, see
tools/java/src/java/org/apache/cassandra/tools/nodetool/SSTableInfo.java.
the nodetool's formatter is not aware of the schema or type of the
SSTables to be printed, neither does it enforce the type -- it just
tries it best to pretty print them as a tabular.

But the fields in CompositeData is typed, when the scylla JMX exporter
translates the returned SSTables from the RESTful API, it sets the
typed fields of every `SSTableInfo` when constructing `PerTableSSTableInfo`.
So, we should be consistent on the type of "generation" field on both
the JMX and the RESTful API sides. because we package the same version
of scylla-jmx and nodetool in the same precompiled tarball, and enforce
the dependencies on exactly same version when shipping deb and rpm
packages, we should be safe when it comes to interoperability of
scylla-jmx and scylla. also, as explained above, nodetool does not care
about the typing, so it is not a problem on nodetool's front.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13834
2023-05-15 20:33:48 +03:00
Raphael S. Carvalho
abc1eae1c2 Add API to disable tombstone GC in compaction
Adding new APIs /column_family/tombstone_gc and
/storage_service/tombstone_gc.

Mimicks existing APIs /column_family/autocompaction and
/storage_service/autocompaction.

column_family variant must specify a single table only,
following existing convention.

whereas the storage_service one can specify an entire
keyspace, or a subset of a tables in a keyspace.

column_family API usage
-----

The table name must be in keyspace:name format

Get status:
curl -s -X GET "http://127.0.0.1:10000/column_family/tombstone_gc/ks:cf"

Enable GC
curl -s -X POST "http://127.0.0.1:10000/column_family/tombstone_gc/ks:cf"

Disable GC
curl -s -X DELETE "http://127.0.0.1:10000/column_family/tombstone_gc/ks:cf"

storage_service API usage
-----

Tables can be specified using a comma-separated list.

Enable GC on keyspace
curl -s -X POST "http://127.0.0.1:10000/storage_service/tombstone_gc/ks"

Disable GC on keyspace
curl -s -X DELETE "http://127.0.0.1:10000/storage_service/tombstone_gc/ks"

Enable GC on a subset of tables
curl -s -X POST
"http://127.0.0.1:10000/storage_service/tombstone_gc/ks?cf=table1,table2"

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-05-12 10:34:38 -03:00
Raphael S. Carvalho
07104393af api: storage_service: restore indentation
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-05-12 10:34:36 -03:00
Raphael S. Carvalho
501b5a9408 api: storage_service: extract code to set attribute for a set of tables
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-05-12 10:33:50 -03:00
Botond Dénes
bb62038119 Merge 'Scrub compaction task' from Aleksandra Martyniuk
Task manager's tasks covering scrub compaction on top,
shard and table level.

For this levels we have common scrub tasks for each scrub
mode since they share code. Scrub modes will be differentiated
on compaction group level.

Closes #13694

* github.com:scylladb/scylladb:
  test: extend test_compaction_task.py to test scrub compaction
  compaction: add table_scrub_sstables_compaction_task_impl
  compaction: add shard_scrub_sstables_compaction_task_impl
  compaction: add scrub_sstables_compaction_task_impl
  api: get rid of unnecessary std::optional in scrub
  compaction: rename rewrite_sstables_compaction_task_impl
2023-05-10 14:18:20 +03:00
Aleksandra Martyniuk
8d32579fe6 compaction: add scrub_sstables_compaction_task_impl
Implementation of task_manager's task covering scrub sstables
compaction.
2023-05-09 11:13:57 +02:00
Aleksandra Martyniuk
79c39e4ea7 api: get rid of unnecessary std::optional in scrub
In scrub lambdas returning std::optional<compaction_stats> cannot
return empty value. Hence, std::optional wrapper isn't needed.
2023-05-09 10:31:44 +02:00
Kamil Braun
372a06f735 raft topology: implement check_and_repair_cdc_streams API
The original API is gossiper-based. Since we're moving CDC generations
handling to Raft-based topology, we need to implement this API as well.

For now the API creates a new generation unconditionally, in a follow-up
I'll introduce a check to skip the creation if the current generation is
optimal.
2023-05-08 16:49:01 +02:00
Kefu Chai
9b35faf485 treewide: replace generation_type::value() with generation_type::as_int()
* replace generation_type::value() with generation_type::as_int()
* drop generation_value()

because we will switch over to UUID based generation identifier, the member
function or the free function generation_value() cannot fulfill the needs
anymore. so, in this change, they are consolidated and are replaced by
"as_int()", whose name is more specific, and will also work and won't be
misleading even after switching to UUID based generation identifier. as
`value()` would be confusing by then: it could be an integer or a UUID.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-05-06 18:24:45 +08:00