Commit Graph

844 Commits

Author SHA1 Message Date
Petr Gusev
63f64f3303 token_metadata: make it a template with NodeId=inet_address/host_id
NodeId is used in all internal token_metadata data structures, that
previously used inet_address. We choose topology::key_kind based
on the value of the template parameter.

generic_token_metadata::update_topology overload with host_id
parameter is added to make update_topology_change_info work,
it now uses NodeId as a parameter type.

topology::remove_endpoint(host_id) is added to make
generic_token_metadata::remove_endpoint(NodeId) work.

pending_endpoints_for and endpoints_for_reading are just removed - they
are not used and not implemented. The declarations were left by mistake
from a refactoring in which these methods were moved to erm.

generic_token_metadata_base is extracted to contain declarations, common
to both token_metadata versions.

Templates are explicitly instantiated inside token_metadata.cc, since
implementation part is also a template and it's not exposed to the header.

There are no other behavioral changes in this commit, just syntax
fixes to make token_metadata a template.
2023-12-11 12:51:34 +04:00
Avi Kivity
9c0f05efa1 Merge 'Track tablet streaming under global sessions to prevent side-effects of failed streaming' from Tomasz Grabiec
Tablet streaming involves asynchronous RPCs to other replicas which transfer writes. We want side-effects from streaming only within the migration stage in which the streaming was started. This is currently not guaranteed on failure. When streaming master fails (e.g. due to RPC failing), it can be that some streaming work is still alive somewhere (e.g. RPC on wire) and will have side-effects at some point later.

This PR implements tracking of all operations involved in streaming which may have side-effects, which allows the topology change coordinator to fence them and wait for them to complete if they were already admitted.

The tracking and fencing is implemented by using global "sessions", created for streaming of a single tablet. Session is globally identified by UUID. The identifier is assigned by the topology change coordinator, and stored in system.tablets. Sessions are created and closed based on group0 state (tablet metadata) by the barrier command sent to each replica, which we already do on transitions between stages. Also, each barrier waits for sessions which have been closed to be drained.

The barrier is blocked only if there is some session with work which was left behind by unsuccessful streaming. In which case it should not be blocked for long, because streaming process checks often if the guard was left behind and stops if it was.

This mechanism of tracking is fault-tolerant: session id is stored in group0, so coordinator can make progress on failover. The barriers guarantee that session exists on all replicas, and that it will be closed on all replicas.

Closes scylladb/scylladb#15847

* github.com:scylladb/scylladb:
  test: tablets: Add test for failed streaming being fenced away
  error_injection: Introduce poll_for_message()
  error_injection: Make is_enabled() public
  api: Add API to kill connection to a particular host
  range_streamer: Do not block topology change barriers around streaming
  range_streamer, tablets: Do not keep token metadata around streaming
  tablets: Fail gracefully when migrating tablet has no pending replica
  storage_service, api: Add API to disable tablet balancing
  storage_service, api: Add API to migrate a tablet
  storage_service, raft topology: Run streaming under session topology guard
  storage_service, tablets: Use session to guard tablet streaming
  tablets: Add per-tablet session id field to tablet metadata
  service: range_streamer: Propagate topology_guard to receivers
  streaming: Always close the rpc::sink
  storage_service: Introduce concept of a topology_guard
  storage_service: Introduce session concept
  tablets: Fix topology_metadata_guard holding on to the old erm
  docs: Document the topology_guard mechanism
2023-12-07 16:29:02 +02:00
Tomasz Grabiec
733eb21601 api: Add API to kill connection to a particular host
For testing failure scenarios.
2023-12-06 18:36:17 +01:00
Tomasz Grabiec
d1c1b59236 storage_service, api: Add API to disable tablet balancing
Load balancing needs to be disabled before making a series of manual
migrations so that we don't fight with the load balancer.

Also will be used in tests to ensure tablets stick to expected locations.
2023-12-06 18:36:17 +01:00
Tomasz Grabiec
1f57d1ea28 storage_service, api: Add API to migrate a tablet
Will be used in tests, or for hot fixes in production.
2023-12-06 18:36:17 +01:00
Kefu Chai
f483309165 compaction, api: drop unused functions
run_on_existing_tables() is not used at all. and we have two of them.
in this change, let's drop them.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#16304
2023-12-06 14:31:08 +02:00
Botond Dénes
d2a88cd8de Merge 'Typos: fix typos in code' from Yaniv Kaul
Fixes some more typos as found by codespell run on the code. In this commit, there are more user-visible errors.

Refs: https://github.com/scylladb/scylladb/issues/16255

Closes scylladb/scylladb#16289

* github.com:scylladb/scylladb:
  Update unified/build_unified.sh
  Update main.cc
  Update dist/common/scripts/scylla-housekeeping
  Typos: fix typos in code
2023-12-06 07:36:41 +02:00
Yaniv Kaul
ae2ab6000a Typos: fix typos in code
Fixes some more typos as found by codespell run on the code.
In this commit, there are more user-visible errors.

Refs: https://github.com/scylladb/scylladb/issues/16255
2023-12-05 15:18:11 +02:00
Benny Halevy
e5d3c6741f api: use locator::topology rather than fb_utilities
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-12-05 08:42:49 +02:00
Wojciech Mitros
a8c9451fb2 commitlog: add max disk size api
Currently, the max size of commitlog is obtained either from the
config parameter commitlog_total_space_in_mb or, when the parameter
is -1, from the total memory allocated for Scylla.
To facilitate testing of the behavior of commitlog hard limit,
expose the value of commitlog max_disk_size in a dedicated API.

Closes scylladb/scylladb#16020
2023-12-03 17:16:58 +02:00
Yaniv Kaul
c658bdb150 Typos: fix typos in comments
Fixes some typos as found by codespell run on the code.
In this commit, I was hoping to fix only comments, not user-visible alerts, output, etc.
Follow-up commits will take care of them.

Refs: https://github.com/scylladb/scylladb/issues/16255
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2023-12-02 22:37:22 +02:00
Benny Halevy
b12b142232 api: add /storage_service/compact
For major compacting all tables in the database.
The advantage of this api is that `commitlog->force_new_active_segment`
happens only once in `database::flush_all_tables` rather than
once per keyspace (when `nodetool compact` translates to
a sequence of `/storage_service/keyspace_compaction` calls).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-11-28 16:37:42 +02:00
Benny Halevy
1b576f358b api: add /storage_service/flush
For flushing all tables in the database.
The advantage of this api is that `commitlog->force_new_active_segment`
happens only once in `database::flush_all_tables` rather than
once per keyspace (when `nodetool flush` translates to
a sequence of `/storage_service/keyspace_flush` calls).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-11-28 16:37:42 +02:00
Benny Halevy
1fd85bd37b api: compaction: add flush_memtables option
When flushing is done externally, e.g. by running
`nodetool flush` prior to `nodetool compact`,
flush_memtables=false can be passed to skip flushing
of tables right before they are major-compacted.

This is useful to prevent creation of small sstables
due to excessive memtable flushing.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-11-28 16:37:42 +02:00
Botond Dénes
697cf41b9b Merge 'repair: Introduce small table optimization' from Asias He
repair: Introduce small table optimization

*) Problem:

We have seen in the field it takes longer than expected to repair system tables
like system_auth which has a tiny amount of data but is replicated to all nodes
in the cluster. The cluster has multiple DCs. Each DC has multiple nodes. The
main reason for the slowness is that even if the amount of data is small,
repair has to walk though all the token ranges, that is num_tokens *
number_of_nodes_in_the_cluster. The overhead of the repair protocol for each
token range dominates due to the small amount of data per token range. Another
reason is the high network latency between DCs makes the RPC calls used to
repair consume more time.

*) Solution:

To solve this problem, a small table optimization for repair is introduced in
this patch. A new repair option is added to turn on this optimization.

- No token range to repair is needed by the user. It  will repair all token
ranges automatically.

- Users only need to send the repair rest api to one of the nodes in the
cluster. It can be any of the nodes in the cluster.

- It does not require the RF to be configured to replicate to all nodes in the
cluster. This means it can work with any tables as long as the amount of data
is low, e.g., less than 100MiB per node.

*) Performance:

1)
3 DCs, each DC has 2 nodes, 6 nodes in the cluster. RF = {dc1: 2, dc2: 2, dc3: 2}

Before:
```
repair - repair[744cd573-2621-45e4-9b27-00634963d0bd]: stats:
repair_reason=repair, keyspace=system_auth, tables={roles, role_attributes,
role_members}, ranges_nr=1537, round_nr=4612,
round_nr_fast_path_already_synced=4611,
round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=1,
rpc_call_nr=115289, tx_hashes_nr=0, rx_hashes_nr=5, duration=1.5648403 seconds,
tx_row_nr=2, rx_row_nr=0, tx_row_bytes=356, rx_row_bytes=0,
row_from_disk_bytes={{127.0.14.1, 178}, {127.0.14.2, 178}, {127.0.14.3, 0},
{127.0.14.4, 0}, {127.0.14.5, 178}, {127.0.14.6, 178}},
row_from_disk_nr={{127.0.14.1, 1}, {127.0.14.2, 1}, {127.0.14.3, 0},
{127.0.14.4, 0}, {127.0.14.5, 1}, {127.0.14.6, 1}},
row_from_disk_bytes_per_sec={{127.0.14.1, 0.00010848}, {127.0.14.2,
0.00010848}, {127.0.14.3, 0}, {127.0.14.4, 0}, {127.0.14.5, 0.00010848},
{127.0.14.6, 0.00010848}} MiB/s, row_from_disk_rows_per_sec={{127.0.14.1,
0.639043}, {127.0.14.2, 0.639043}, {127.0.14.3, 0}, {127.0.14.4, 0},
{127.0.14.5, 0.639043}, {127.0.14.6, 0.639043}} Rows/s,
tx_row_nr_peer={{127.0.14.3, 1}, {127.0.14.4, 1}}, rx_row_nr_peer={}
```

After:
```
repair - repair[d6e544ba-cb68-4465-ab91-6980bcbb46a9]: stats:
repair_reason=repair, keyspace=system_auth, tables={roles, role_attributes,
role_members}, ranges_nr=1, round_nr=4, round_nr_fast_path_already_synced=4,
round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=0,
rpc_call_nr=80, tx_hashes_nr=0, rx_hashes_nr=0, duration=0.001459798 seconds,
tx_row_nr=0, rx_row_nr=0, tx_row_bytes=0, rx_row_bytes=0,
row_from_disk_bytes={{127.0.14.1, 178}, {127.0.14.2, 178}, {127.0.14.3, 178},
{127.0.14.4, 178}, {127.0.14.5, 178}, {127.0.14.6, 178}},
row_from_disk_nr={{127.0.14.1, 1}, {127.0.14.2, 1}, {127.0.14.3, 1},
{127.0.14.4, 1}, {127.0.14.5, 1}, {127.0.14.6, 1}},
row_from_disk_bytes_per_sec={{127.0.14.1, 0.116286}, {127.0.14.2, 0.116286},
{127.0.14.3, 0.116286}, {127.0.14.4, 0.116286}, {127.0.14.5, 0.116286},
{127.0.14.6, 0.116286}} MiB/s, row_from_disk_rows_per_sec={{127.0.14.1,
685.026}, {127.0.14.2, 685.026}, {127.0.14.3, 685.026}, {127.0.14.4, 685.026},
{127.0.14.5, 685.026}, {127.0.14.6, 685.026}} Rows/s, tx_row_nr_peer={},
rx_row_nr_peer={}
```

The time to finish repair difference = 1.5648403 seconds / 0.001459798 seconds = 1072X

2)
3 DCs, each DC has 2 nodes, 6 nodes in the cluster. RF = {dc1: 2, dc2: 2, dc3: 2}

Same test as above except 5ms delay is added to simulate multiple dc
network latency:

The time to repair is reduced from 333s to 0.2s.

333.26758 s / 0.22625381s = 1472.98

3)

3 DCs, each DC has 3 nodes, 9 nodes in the cluster. RF = {dc1: 3, dc2: 3, dc3: 3}
, 10 ms network latency

Before:

```
repair - repair[86124a4a-fd26-42ea-a078-437ca9e372df]: stats:
repair_reason=repair, keyspace=system_auth, tables={role_attributes,
role_members, roles}, ranges_nr=2305, round_nr=6916,
round_nr_fast_path_already_synced=6915,
round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=1,
rpc_call_nr=276630, tx_hashes_nr=0, rx_hashes_nr=8, duration=986.34015
seconds, tx_row_nr=7, rx_row_nr=0, tx_row_bytes=1246, rx_row_bytes=0,
row_from_disk_bytes={{127.0.57.1, 178}, {127.0.57.2, 178}, {127.0.57.3,
0}, {127.0.57.4, 0}, {127.0.57.5, 0}, {127.0.57.6, 0}, {127.0.57.7, 0},
{127.0.57.8, 0}, {127.0.57.9, 0}}, row_from_disk_nr={{127.0.57.1, 1},
{127.0.57.2, 1}, {127.0.57.3, 0}, {127.0.57.4, 0}, {127.0.57.5, 0},
{127.0.57.6, 0}, {127.0.57.7, 0}, {127.0.57.8, 0}, {127.0.57.9, 0}},
row_from_disk_bytes_per_sec={{127.0.57.1, 1.72105e-07}, {127.0.57.2,
1.72105e-07}, {127.0.57.3, 0}, {127.0.57.4, 0}, {127.0.57.5, 0},
{127.0.57.6, 0}, {127.0.57.7, 0}, {127.0.57.8, 0}, {127.0.57.9, 0}}
MiB/s, row_from_disk_rows_per_sec={{127.0.57.1, 0.00101385},
{127.0.57.2, 0.00101385}, {127.0.57.3, 0}, {127.0.57.4, 0},
{127.0.57.5, 0}, {127.0.57.6, 0}, {127.0.57.7, 0}, {127.0.57.8, 0},
{127.0.57.9, 0}} Rows/s, tx_row_nr_peer={{127.0.57.3, 1},
{127.0.57.4, 1}, {127.0.57.5, 1}, {127.0.57.6, 1}, {127.0.57.7, 1},
{127.0.57.8, 1}, {127.0.57.9, 1}}, rx_row_nr_peer={}
```

After:

```
repair - repair[07ebd571-63cb-4ef6-9465-6e5f1e98f04f]: stats:
repair_reason=repair, keyspace=system_auth, tables={role_attributes,
role_members, roles}, ranges_nr=1, round_nr=4,
round_nr_fast_path_already_synced=4,
round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=0,
rpc_call_nr=128, tx_hashes_nr=0, rx_hashes_nr=0, duration=1.6052915
seconds, tx_row_nr=0, rx_row_nr=0, tx_row_bytes=0, rx_row_bytes=0,
row_from_disk_bytes={{127.0.57.1, 178}, {127.0.57.2, 178}, {127.0.57.3,
178}, {127.0.57.4, 178}, {127.0.57.5, 178}, {127.0.57.6, 178},
{127.0.57.7, 178}, {127.0.57.8, 178}, {127.0.57.9, 178}},
row_from_disk_nr={{127.0.57.1, 1}, {127.0.57.2, 1}, {127.0.57.3, 1},
{127.0.57.4, 1}, {127.0.57.5, 1}, {127.0.57.6, 1}, {127.0.57.7, 1},
{127.0.57.8, 1}, {127.0.57.9, 1}},
row_from_disk_bytes_per_sec={{127.0.57.1, 0.00037793}, {127.0.57.2,
0.00037793}, {127.0.57.3, 0.00037793}, {127.0.57.4, 0.00037793},
{127.0.57.5, 0.00037793}, {127.0.57.6, 0.00037793}, {127.0.57.7,
0.00037793}, {127.0.57.8, 0.00037793}, {127.0.57.9, 0.00037793}}
MiB/s, row_from_disk_rows_per_sec={{127.0.57.1, 2.22634},
{127.0.57.2, 2.22634}, {127.0.57.3, 2.22634}, {127.0.57.4,
2.22634}, {127.0.57.5, 2.22634}, {127.0.57.6, 2.22634},
{127.0.57.7, 2.22634}, {127.0.57.8, 2.22634}, {127.0.57.9,
2.22634}} Rows/s, tx_row_nr_peer={}, rx_row_nr_peer={}
```

The time to repair is reduced from 986s (16 minutes) to 1.6s

*) Summary

So, a more than 1000X difference is observed for this common usage of
system table repair procedure.

Fixes #16011
Refs  #15159

Closes scylladb/scylladb#15974

* github.com:scylladb/scylladb:
  repair: Introduce small table optimization
  repair: Convert put_row_diff_with_rpc_stream to use coroutine
2023-11-24 15:11:42 +02:00
Asias He
c605220bb3 repair: Introduce small table optimization
*) Problem:

We have seen in the field it takes longer than expected to repair system tables
like system_auth which has a tiny amount of data but is replicated to all nodes
in the cluster. The cluster has multiple DCs. Each DC has multiple nodes. The
main reason for the slowness is that even if the amount of data is small,
repair has to walk though all the token ranges, that is num_tokens *
number_of_nodes_in_the_cluster. The overhead of the repair protocol for each
token range dominates due to the small amount of data per token range. Another
reason is the high network latency between DCs makes the RPC calls used to
repair consume more time.

*) Solution:

To solve this problem, a small table optimization for repair is introduced in
this patch. A new repair option is added to turn on this optimization.

- No token range to repair is needed by the user. It  will repair all token
ranges automatically.

- Users only need to send the repair rest api to one of the nodes in the
cluster. It can be any of the nodes in the cluster.

- It does not require the RF to be configured to replicate to all nodes in the
cluster. This means it can work with any tables as long as the amount of data
is low, e.g., less than 100MiB per node.

*) Performance:

1)
3 DCs, each DC has 2 nodes, 6 nodes in the cluster. RF = {dc1: 2, dc2: 2, dc3: 2}

Before:
```
repair - repair[744cd573-2621-45e4-9b27-00634963d0bd]: stats:
repair_reason=repair, keyspace=system_auth, tables={roles, role_attributes,
role_members}, ranges_nr=1537, round_nr=4612,
round_nr_fast_path_already_synced=4611,
round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=1,
rpc_call_nr=115289, tx_hashes_nr=0, rx_hashes_nr=5, duration=1.5648403 seconds,
tx_row_nr=2, rx_row_nr=0, tx_row_bytes=356, rx_row_bytes=0,
row_from_disk_bytes={{127.0.14.1, 178}, {127.0.14.2, 178}, {127.0.14.3, 0},
{127.0.14.4, 0}, {127.0.14.5, 178}, {127.0.14.6, 178}},
row_from_disk_nr={{127.0.14.1, 1}, {127.0.14.2, 1}, {127.0.14.3, 0},
{127.0.14.4, 0}, {127.0.14.5, 1}, {127.0.14.6, 1}},
row_from_disk_bytes_per_sec={{127.0.14.1, 0.00010848}, {127.0.14.2,
0.00010848}, {127.0.14.3, 0}, {127.0.14.4, 0}, {127.0.14.5, 0.00010848},
{127.0.14.6, 0.00010848}} MiB/s, row_from_disk_rows_per_sec={{127.0.14.1,
0.639043}, {127.0.14.2, 0.639043}, {127.0.14.3, 0}, {127.0.14.4, 0},
{127.0.14.5, 0.639043}, {127.0.14.6, 0.639043}} Rows/s,
tx_row_nr_peer={{127.0.14.3, 1}, {127.0.14.4, 1}}, rx_row_nr_peer={}
```

After:
```
repair - repair[d6e544ba-cb68-4465-ab91-6980bcbb46a9]: stats:
repair_reason=repair, keyspace=system_auth, tables={roles, role_attributes,
role_members}, ranges_nr=1, round_nr=4, round_nr_fast_path_already_synced=4,
round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=0,
rpc_call_nr=80, tx_hashes_nr=0, rx_hashes_nr=0, duration=0.001459798 seconds,
tx_row_nr=0, rx_row_nr=0, tx_row_bytes=0, rx_row_bytes=0,
row_from_disk_bytes={{127.0.14.1, 178}, {127.0.14.2, 178}, {127.0.14.3, 178},
{127.0.14.4, 178}, {127.0.14.5, 178}, {127.0.14.6, 178}},
row_from_disk_nr={{127.0.14.1, 1}, {127.0.14.2, 1}, {127.0.14.3, 1},
{127.0.14.4, 1}, {127.0.14.5, 1}, {127.0.14.6, 1}},
row_from_disk_bytes_per_sec={{127.0.14.1, 0.116286}, {127.0.14.2, 0.116286},
{127.0.14.3, 0.116286}, {127.0.14.4, 0.116286}, {127.0.14.5, 0.116286},
{127.0.14.6, 0.116286}} MiB/s, row_from_disk_rows_per_sec={{127.0.14.1,
685.026}, {127.0.14.2, 685.026}, {127.0.14.3, 685.026}, {127.0.14.4, 685.026},
{127.0.14.5, 685.026}, {127.0.14.6, 685.026}} Rows/s, tx_row_nr_peer={},
rx_row_nr_peer={}
```

The time to finish repair difference = 1.5648403 seconds / 0.001459798 seconds = 1072X

2)
3 DCs, each DC has 2 nodes, 6 nodes in the cluster. RF = {dc1: 2, dc2: 2, dc3: 2}

Same test as above except 5ms delay is added to simulate multiple dc
network latency:

The time to repair is reduced from 333s to 0.2s.

333.26758 s / 0.22625381s = 1472.98

3)

3 DCs, each DC has 3 nodes, 9 nodes in the cluster. RF = {dc1: 3, dc2: 3, dc3: 3}
, 10 ms network latency

Before:

```
repair - repair[86124a4a-fd26-42ea-a078-437ca9e372df]: stats:
repair_reason=repair, keyspace=system_auth, tables={role_attributes,
role_members, roles}, ranges_nr=2305, round_nr=6916,
round_nr_fast_path_already_synced=6915,
round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=1,
rpc_call_nr=276630, tx_hashes_nr=0, rx_hashes_nr=8, duration=986.34015
seconds, tx_row_nr=7, rx_row_nr=0, tx_row_bytes=1246, rx_row_bytes=0,
row_from_disk_bytes={{127.0.57.1, 178}, {127.0.57.2, 178}, {127.0.57.3,
0}, {127.0.57.4, 0}, {127.0.57.5, 0}, {127.0.57.6, 0}, {127.0.57.7, 0},
{127.0.57.8, 0}, {127.0.57.9, 0}}, row_from_disk_nr={{127.0.57.1, 1},
{127.0.57.2, 1}, {127.0.57.3, 0}, {127.0.57.4, 0}, {127.0.57.5, 0},
{127.0.57.6, 0}, {127.0.57.7, 0}, {127.0.57.8, 0}, {127.0.57.9, 0}},
row_from_disk_bytes_per_sec={{127.0.57.1, 1.72105e-07}, {127.0.57.2,
1.72105e-07}, {127.0.57.3, 0}, {127.0.57.4, 0}, {127.0.57.5, 0},
{127.0.57.6, 0}, {127.0.57.7, 0}, {127.0.57.8, 0}, {127.0.57.9, 0}}
MiB/s, row_from_disk_rows_per_sec={{127.0.57.1, 0.00101385},
{127.0.57.2, 0.00101385}, {127.0.57.3, 0}, {127.0.57.4, 0},
{127.0.57.5, 0}, {127.0.57.6, 0}, {127.0.57.7, 0}, {127.0.57.8, 0},
{127.0.57.9, 0}} Rows/s, tx_row_nr_peer={{127.0.57.3, 1},
{127.0.57.4, 1}, {127.0.57.5, 1}, {127.0.57.6, 1}, {127.0.57.7, 1},
{127.0.57.8, 1}, {127.0.57.9, 1}}, rx_row_nr_peer={}
```

After:

```
repair - repair[07ebd571-63cb-4ef6-9465-6e5f1e98f04f]: stats:
repair_reason=repair, keyspace=system_auth, tables={role_attributes,
role_members, roles}, ranges_nr=1, round_nr=4,
round_nr_fast_path_already_synced=4,
round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=0,
rpc_call_nr=128, tx_hashes_nr=0, rx_hashes_nr=0, duration=1.6052915
seconds, tx_row_nr=0, rx_row_nr=0, tx_row_bytes=0, rx_row_bytes=0,
row_from_disk_bytes={{127.0.57.1, 178}, {127.0.57.2, 178}, {127.0.57.3,
178}, {127.0.57.4, 178}, {127.0.57.5, 178}, {127.0.57.6, 178},
{127.0.57.7, 178}, {127.0.57.8, 178}, {127.0.57.9, 178}},
row_from_disk_nr={{127.0.57.1, 1}, {127.0.57.2, 1}, {127.0.57.3, 1},
{127.0.57.4, 1}, {127.0.57.5, 1}, {127.0.57.6, 1}, {127.0.57.7, 1},
{127.0.57.8, 1}, {127.0.57.9, 1}},
row_from_disk_bytes_per_sec={{127.0.57.1, 0.00037793}, {127.0.57.2,
0.00037793}, {127.0.57.3, 0.00037793}, {127.0.57.4, 0.00037793},
{127.0.57.5, 0.00037793}, {127.0.57.6, 0.00037793}, {127.0.57.7,
0.00037793}, {127.0.57.8, 0.00037793}, {127.0.57.9, 0.00037793}}
MiB/s, row_from_disk_rows_per_sec={{127.0.57.1, 2.22634},
{127.0.57.2, 2.22634}, {127.0.57.3, 2.22634}, {127.0.57.4,
2.22634}, {127.0.57.5, 2.22634}, {127.0.57.6, 2.22634},
{127.0.57.7, 2.22634}, {127.0.57.8, 2.22634}, {127.0.57.9,
2.22634}} Rows/s, tx_row_nr_peer={}, rx_row_nr_peer={}
```

The time to repair is reduced from 986s (16 minutes) to 1.6s

*) Summary

So, a more than 1000X difference is observed for this common usage of
system table repair procedure.

Fixes #16011
Refs  #15159
2023-11-20 15:11:16 +08:00
Botond Dénes
dfd7981fa7 api/storage_service: start/stop native transport in the statement sg
Currently, it is started/stopped in the streaming/maintenance sg, which
is what the API itself runs in.
Starting the native transport in the streaming sg, will lead to severely
degraded performance, as the streaming sg has significantly less
CPU/disk shares and reader concurrency semaphore resources.
Furthermore, it will lead to multi-paged reads possibly switching
between scheduling groups mid-way, triggering an internal error.

To fix, use `with_scheduling_group()` for both starting and stopping
native transport. Technically, it is only strictly necessary for
starting, but I added it for stop as well for consistency.

Also apply the same treatment to RPC (Thrift). Although no one uses it,
best to fix it, just to be on the safe side.

I think we need a more systematic approach for solving this once and for
all, like passing the scheduling group to the protocol server and have
it switch to it internally. This allows the server to always run on the
correct scheduling group, not depending on the caller to remember using
it. However, I think this is best done in a follow-up, to keep this
critical patch small and easily backportable.

Fixes: #15485

Closes scylladb/scylladb#16019
2023-11-13 14:08:01 +03:00
Kefu Chai
efd65aebb2 build: cmake: add check-header target
to have feature parity with `configure.py`. we won't need this
once we migrate to C++20 modules. but before that day comes, we
need to stick with C++ headers.

we generate a rule for each .hh files to create a corresponding
.cc and then compile it, in order to verify the self-containness of
that header. so the number of rule is quite large, to avoid the
unnecessary overhead. the check-header target is enabled only if
`Scylla_CHECK_HEADERS` option is enabled.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#15913
2023-11-13 10:27:06 +02:00
Kamil Braun
eb6943b852 api: failure_detector: fix indentation 2023-11-06 17:12:17 +01:00
Kamil Braun
a89c69007e api: failure_detector: invoke on shard 0
These APIs may return stale or simply incorrect data on shards
other than 0. Newer versions of Scylla are better at maintaining
cross-shard consistency, but we need a simple fix that can be easily and
without risk be backported to older versions; this is the fix.

Fixes: scylladb/scylladb#15816
2023-11-06 17:03:38 +01:00
Kefu Chai
3a6e359328 build: cmake: add token_metadata.cc to api
`token_metadata.cc` moved into api in e4c0a4d34d, let's update CMake
accordingly.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#15857
2023-10-29 18:30:32 +02:00
Pavel Emelyanov
8e1ff745fa api: Remove http::context -> token_metadata dependency
Now the token metadata usage is fine grained by the relevant endpoint
handlers only.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-10-24 17:49:05 +03:00
Pavel Emelyanov
be9ea0c647 api: Pass shared_token_metadata instead of storage_service
The token metadata endpoints need token metadata, not storage service

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-10-24 17:48:27 +03:00
Pavel Emelyanov
c23193bed0 api: Move snitch endpoints that use token metadata only
Snitch is now a service can speaks for the local node only. In order to
get dc/rack for peers in the cluster one need to use topology which, in
turn, lives on token metadata. This patch moves the dc/rack getters to
api/token_metadata.cc next to other t.m. related endpoints.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-10-24 17:47:18 +03:00
Pavel Emelyanov
e4c0a4d34d api: Move storage_service endpoints that use token metadata only
There are few of them that don't need the storage service for anything
but get token metadata from. Move them to own .cc/.hh units.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-10-24 17:44:53 +03:00
Pavel Emelyanov
0981661f8b api: Unset task_manager test API handlers
So that the task_manager reference is not used when it shouldn't on stop

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-10-18 18:56:24 +03:00
Pavel Emelyanov
2d543af78e api: Unset task_manager API handlers
So that the task_manager reference is not used when it shouldn't on stop

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-10-18 18:56:01 +03:00
Pavel Emelyanov
0632ad50f3 api: Remove ctx->task_manager dependency
Now the task manager's API (and test API) use the argument and this
explicit dependency is no longer required

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-10-18 18:55:27 +03:00
Pavel Emelyanov
572c880d97 api: Use task_manager& argument in test API handlers
Now it's there and can be used. This will allow removing the
ctx->task_manager dependency soon

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-10-18 18:55:13 +03:00
Pavel Emelyanov
0396ce7977 api: Push sharded<task_manager>& down the test API set calls
This is to make it possible to use this reference instead of the ctx.tm
one by the next patch

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-10-18 18:54:53 +03:00
Pavel Emelyanov
ef1d2b2c86 api: Use task_manager& argument in API handlers
Now it's there and can be used. This will allow removing the
ctx->task_manager dependency soon

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-10-18 18:54:24 +03:00
Pavel Emelyanov
14e10e7db4 api: Push sharded<task_manager>& down the API set calls
This is to make it possible to use this reference instead of the ctx.tm
one by the next patch

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-10-18 18:52:46 +03:00
Pavel Emelyanov
162642ac18 api: Remove proxy reference from http context
Now it's unused

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-10-05 16:16:52 +03:00
Pavel Emelyanov
e76f23994c api,hints: Use proxy instead of ctx
Now hints endpoints use ctx.sp reference, but it has the direct proxy
reference at hand and should prefer it

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-10-05 16:16:01 +03:00
Pavel Emelyanov
6ce7ec4a5e api,hints: Pass sharded<proxy>& instead of gossiper&
Proxy is the target service to handle hints API endpoints. Need to pass
it as argument to handlers

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-10-05 16:15:28 +03:00
Pavel Emelyanov
5f521116a2 api,hints: Fix indentation after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-10-05 16:15:20 +03:00
Pavel Emelyanov
53891dd9cc api,hints: Move gossiper access to proxy
API handlers should try to avoid using any service other than the "main"
one. For hints API this service is going to be proxy, so no gossiper
access in the handler itself.

(indentation is left broken)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-10-05 16:14:26 +03:00
Kamil Braun
b68d6ad5e9 api: storage_service: unset reload_raft_topology_state
Every endpoint needs to be unset. Oversight in
992f1327d3.

Closes scylladb/scylladb#15591
2023-10-03 09:12:12 +03:00
Pavel Emelyanov
2603605cd5 api: Make storage_proxy handlers use proxy argument
And stop using proxy reference from http context. After a while the
proxy dependency will be removed from http context

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-29 14:10:09 +03:00
Pavel Emelyanov
fc4335387a api: Change some static helpers to use proxy instead of ctx
There are some helpers in storage_proxy.cc that get proxy reference from
passed http context argument. Next patch will stop using ctx for that
purpose, so prepare in advance by making the helpers use proxy reference
argument directly

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-29 14:10:09 +03:00
Pavel Emelyanov
4910b4d5b7 api: Pass sharded<storage_proxy> reference to storage_proxy handlers
The goals is to make handlers use proxy argument instead of keeping
proxt as dependency on http context (other handlers are mostly such
already)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-29 14:10:09 +03:00
Pavel Emelyanov
7ef7b05397 api: Remove storage_service argument from storage_proxy setup
The code setting up storage_proxy/ endpoints no longer needs
storage_service and related decoration

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-29 14:10:09 +03:00
Pavel Emelyanov
0eea513663 api: Move storage_proxy/ endpoint using storage_service
The storage_proxy/get_schema_version is served by storage_service, so it
should be in storage_service.cc instead

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-29 14:10:08 +03:00
Pavel Emelyanov
b5eb474d95 api: Remove storage_proxy.hh from storage_service.cc
Proxy is not used in storage service handlers

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-29 14:09:01 +03:00
Kamil Braun
992f1327d3 api: storage_service: add REST API to reload topology state
Some tests may want to modify system.topology table directly. Add a REST
API to reload the state into memory. An alternative would be restarting
the server, but that's slower and may have other side effects undesired
in the test.

The API can also be called outside tests, it should not have any
observable effects unless the user modifies `system.topology` table
directly (which they should never do, outside perhaps some disaster
recovery scenarios).
2023-09-28 11:59:16 +02:00
Pavel Emelyanov
e022a76350 main, api: Set/Unset storage_service API in proper place
Currently the storage-service API handlers are set up in "random" place.
It can happen earlier -- as soon as the storage service itself is ready.

Also, despite storage service is stopped on shutdown, API handlers
continue reference it leading to potential use-after-frees or "local is
not initialized" assertions.

Fix both. Unsetting is pretty bulky, scylladb/seastar#1620 is to help.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-26 12:22:13 +03:00
Pavel Emelyanov
78a22c5ae3 api/storage_service: Remove gossiper arg from API
Now all handlers work purely on storage_service and gossiper argument is
no longer needed

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-26 12:21:46 +03:00
Pavel Emelyanov
8dc6e74138 api/storage_service: Remove system keyspace arg from API
It's not used nowadays

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-26 12:21:25 +03:00
Pavel Emelyanov
27eaff9d44 api/storage_service: Get gossiper from storage service
Some handlers in set_storage_service() have implicit dependency on
gossiper. It's not API that should track it, but storage service itself,
so get the gossiper from service, not from the external argument (it
will be removed soon)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-26 12:20:27 +03:00
Pavel Emelyanov
4008ebb1b0 api/storage_service: Get token_metadata from storage service
The API handlers that live in set_storage_service() should be
self-contained and operate on storage-service only. Said that, they
should get the token metadata, when needed, from storage service, not
from somewhere else.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-26 12:19:24 +03:00