Move mutation-related files to a new mutation/ directory. The names
are kept in the global namespace to reduce churn; the names are
unambiguous in any case.
mutation_reader remains in the readers/ module.
mutation_partition_v2.cc was missing from CMakeLists.txt; it's added in this
patch.
This is a step forward towards librarization or modularization of the
source base.
Closes#12788
System keyspace is a keyspace with local replication strategy and thus
it does not need to be repaired. It is possible to invoke repair
of this keyspace through the api, which leads to runtime error since
peer_events and scylla_table_schema_history have different sharding logic.
For keyspaces with local replication strategy repair_service::do_repair_start
returns immediately.
Closes#12459
* github.com:scylladb/scylladb:
test: rest_api: check if repair of system keyspace returns before corresponding task is created
repair: finish repair immediately on local keyspaces
In most cases, tasks manager's tasks are started just after they are
created. Thus, to reduce boilerplate required for creating and starting
tasks, tasks::task_manager::module::make_and_start_task method is added.
Repair tasks are modified to use the method where possible.
Closes#12729
* github.com:scylladb/scylladb:
repair: use tasks::task_manager::module::make_and_start_task for repair tasks
tasks: add task_manager::module::make_and_start_task method
System keyspace is a keyspace with local replication strategy and thus
it does not need to be repaired. It is possible to invoke repair
of this keyspace through the api, which leads to runtime error since
peer_events and scylla_table_schema_history have different sharding logic.
For keyspaces with local replication strategy repair_service::do_repair_start
returns immediately.
Consider:
- Bootstrap n1 in dc 1
- Create ks with EverywhereStrategy
- Bootstrap n2 in dc 2
Since n2 is the first node in dc2, there will be no local dc nodes to
sync data from. In this case, n2 should sync data with node in dc 1 even
if it is in the remote dc.
Aborting of repair operation is fully managed by task manager.
Repair tasks are aborted:
- on shutdown; top level repair tasks subscribe to global abort source. On shutdown all tasks are aborted recursively
- through node operations (applies to data_sync_repair_task_impls and their descendants only); data_sync_repair_task_impl subscribes to node_ops_info abort source
- with task manager api (top level tasks are abortable)
- with storage_service api and on failure; these cases were modified to be aborted the same way as the ones from above are.
Closes#12085
* github.com:scylladb/scylladb:
repair: make top level repair tasks abortable
repair: unify a way of aborting repair operations
repair: delete sharded abort source from node_ops_info
repair: delete unused node_ops_info from data_sync_repair_task_impl
repair: delete redundant abort subscription from shard_repair_task_impl
repair: add abort subscription to data sync task
tasks: abort tasks on system shutdown
Type of the id of node operations is changed from utils::UUID
to node_ops_id. This way the id of node operations would be easily
distinguished from the ids of other entities.
Closes#11673
data_sync_repair_task_impl subscribes to corresponding node_ops_info
abort source and then, when requested, all its descedants are
aborted recursively. Thus, shard_repair_task_impl does not need
to subscribe to the node_ops_info abort source, since the parent
task will take care of aborting once it is requested.
abort_subscription and connected attributes are deleted from
the shard_repair_task_impl.
When node operation is aborted, same should happen with
the corresponding task manager's repair task.
Subscribe data_sync_repair_task_impl abort() to node_ops_info
abort_source.
Type of operation is related to a specific implementation
of a task. Then, it should rather be access with a virtual
method in tasks::task_manager::task::impl than be
its attribute.
Closes#12326
* github.com:scylladb/scylladb:
api: delete unused type parameter from task_manager_test api
tasks: repair: api: remove type attribute from task_manager::task::status
tasks: add type() method to task_manager::task::impl
repair: add reason attribute to repair_task
It is moved into the async thread so the encapsulating
function should be defined mutable to move the func
rather thna copying it.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#12267
When repair master and followers have different shard count, the repair
followers need to create multi-shard readers. Each multi-shard reader
will create one local reader on each shard, N (smp::count) local readers
in total.
There is a hard limit on the number of readers who can work in parallel.
When there are more readers than this limit. The readers will start to
evict each other, causing buffers already read from disk to be dropped
and recreating of readers, which is not very efficient.
To optimize and reduce reader eviction overhead, a global reader permit
is introduced which considers the multi-shard reader bloats.
With this patch, at any point in time, the number of readers created by
repair will not exceed the reader limit.
Test Results:
1) with stream sem 10, repair global sem 10, 5 ranges in parallel, n1=2
shards, n2=8 shards, memory wanted =1
1.1)
[asias@hjpc2 mycluster]$ time nodetool -p 7200 repair ks2 (repair on n2)
[2022-11-23 17:45:24,770] Starting repair command #1, repairing 1
ranges for keyspace ks2 (parallelism=SEQUENTIAL, full=true)
[2022-11-23 17:45:53,869] Repair session 1
[2022-11-23 17:45:53,869] Repair session 1 finished
real 0m30.212s
user 0m1.680s
sys 0m0.222s
1.2)
[asias@hjpc2 mycluster]$ time nodetool repair ks2 (repair on n1)
[2022-11-23 17:46:07,507] Starting repair command #1, repairing 1
ranges for keyspace ks2 (parallelism=SEQUENTIAL, full=true)
[2022-11-23 17:46:30,608] Repair session 1
[2022-11-23 17:46:30,608] Repair session 1 finished
real 0m24.241s
user 0m1.731s
sys 0m0.213s
2) with stream sem 10, repair global sem no_limit, 5 ranges in
parallel, n1=2 shards, n2=8 shards, memory wanted =1
2.1)
[asias@hjpc2 mycluster]$ time nodetool -p 7200 repair ks2 (repair on n2)
[2022-11-23 17:49:49,301] Starting repair command #1, repairing 1
ranges for keyspace ks2 (parallelism=SEQUENTIAL, full=true)
[2022-11-23 17:52:01,414] Repair session 1
[2022-11-23 17:52:01,415] Repair session 1 finished
real 2m13.227s
user 0m1.752s
sys 0m0.218s
2.2)
[asias@hjpc2 mycluster]$ time nodetool repair ks2 (repair on n1)
[2022-11-23 17:52:19,280] Starting repair command #1, repairing 1
ranges for keyspace ks2 (parallelism=SEQUENTIAL, full=true)
[2022-11-23 17:52:42,387] Repair session 1
[2022-11-23 17:52:42,387] Repair session 1 finished
real 0m24.196s
user 0m1.689s
sys 0m0.184s
Comparing 1.1) and 2.1), it shows the eviction played a major role here.
The patch gives 73s / 30s = 2.5X speed up in this setup.
Comparing 1.1 and 1.2, it shows even if we limit the readers, starting
on the lower shard is faster 30s / 24s = 1.25X (the total number of
multishard readers is lower)
Fixes#12157Closes#12158
The PR introduces shard_repair_task_impl which represents a repair task
that spans over a single shard repair.
repair_info is replaced with shard_repair_task_impl, since both serve
similar purpose.
Closes#12066
* github.com:scylladb/scylladb:
repair: reindent
repair: replace repair_info with shard_repair_task_impl
repair: move repair_info methods to shard_repair_task_impl
repair: rename methods of repair_module
repair: change type of repair_module::_repairs
repair: keep a reference to shard_repair_task_impl in row_level_repair
repair: move repair_range method to shard_repair_task_impl
repair: make do_repair_ranges a method of shard_repair_task_impl
repair: copy repair_info methods to shard_repair_task_impl
repair: corutinize shard task creation
repair: define run for shard_repair_task_impl
repair: add shard_repair_task_impl
Currently, each data sync repair task is started (and hence run) twice.
Thus, when two running operations happen within a time frame long
enough, the following situation may occur:
- the first run finishes
- after some time (ttl) the task is unregistered from the task manager
- the second run finishes and attempts to finish the task which does
not exist anymore
- memory access causes a segfault.
The second call to start is deleted. A check is added
to the start method to ensure that each task is started at most once.
Fixes: #12089Closes#12090
As a preparation to replacing repair_info with shard_repair_task_impl,
type of _repairs in repair module is changed from
std::unordered_map<int, lw_shared_ptr<repair_info>> to
std::unordered_map<int, tasks::task_id>.
As a part of replacing repair_info with shard_repair_task_impl,
instead of a reference to repair_info, row_level_repair keeps
a reference to shard_repair_task_impl.
Function do_repair_ranges is directly connected to shard repair tasks.
Turning it into shard_repair_task_impl method enables an access to tasks'
members with no additional intermediate layers.
Methods of repair_info are copied to shard_repair_task_impl. They are
not used yet, it's a preparation for replacing repair_info with
shard_repair_task_impl.
And make sure the token_metadata ring version is same as the
reference one (from the erm on shard 0), when starting the
repair on each shard.
Refs #11993
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>