Before integration with task manager the state of one shard repair
was kept in repair_info. repair_info object was destroyed immediately
after shard repair was finished.
In an integration process repair_info's fields were moved to
shard_repair_task_impl as the two served the similar purposes.
Though, shard_repair_task_impl isn't immediately destoyed, but is
kept in task manager for task_ttl seconds after it's complete.
Thus, some of repair_info's fields have their lifetime prolonged,
which makes the repair state change delayed.
Release shard_repair_task_impl resources immediately after shard
repair is finished.
Fixes: #15505.
Closesscylladb/scylladb#15506
Most of the time only the roots of tasks tree should be non internal.
Change default implementation of is_internal and delete overrides
consistent with it.
Closesscylladb/scylladb#15353
Node operations will be integrated with task manager and so node_ops
directory needs to be created. To have an access to node ops related
classes from task manager and preserve consistent naming, move
the classes to node_ops/node_ops_data.cc.
Override methods returning expected children number and job size
in repair tasks. With them get_progress method would be able to
return more precise progress value.
Getting reason argument in task_manager_module::get_progress is deceiving
as the method works properly only for streaming::stream_reason::repair
(repair::shard_repair_task_impl::nr_ranges_finished isn't updated for
any other reason).
This patch adds the ranges_parallelism option to repair restful API.
Users can use this option to optionally specify the number of ranges
to repair in parallel per repair job to a smaller number than the Scylla
core calculated default max_repair_ranges_in_parallel.
Scylla manager can also use this option to provide more ranges (>N) in
a single repair job but only repairing N ranges_parallelism in parallel,
instead of providing N ranges in a repair job.
To make it safer, unlike the PR #4848, this patch does not allow user to
exceed the max_repair_ranges_in_parallel.
Fixes#4847
Instead of just a boolean _failed flag, persist the error message of the
exception which caused the repair to fail, and include it in the log
message announcing the failure.
This helps users to figure if the repair has failed due to a peer node
was down during repair.
For example:
```
WARN [shard 0] repair - repair[ec2e9646-918e-4345-99ab-fa07aa1f17de]: Repair
1026 out of 1026 ranges, keyspace=ks2a, table={test_table, tb},
range=(9203128250168517738,+inf), peers={127.0.0.2}, live_peers={},
status=skipped_no_live_peers
INFO [shard 0] repair - repair[ec2e9646-918e-4345-99ab-fa07aa1f17de]: stats:
repair_reason=repair, keyspace=ks2a, tables={test_table, tb}, ranges_nr=513,
round_nr=0, round_nr_fast_path_already_synced=0,
round_nr_fast_path_same_combined_hashes=0, round_nr_slow_path=0, rpc_call_nr=0,
tx_hashes_nr=0, rx_hashes_nr=0, duration=0 seconds, tx_row_nr=0, rx_row_nr=0,
tx_row_bytes=0, rx_row_bytes=0, row_from_disk_bytes={}, row_from_disk_nr={},
row_from_disk_bytes_per_sec={} MiB/s, row_from_disk_rows_per_sec={} Rows/s,
tx_row_nr_peer={}, rx_row_nr_peer={}
WARN [shard 0] repair - repair[ec2e9646-918e-4345-99ab-fa07aa1f17de]: 1026 out
of 1026 ranges failed, keyspace=ks2a, tables={test_table, tb},
repair_reason=repair, nodes_down_during_repair={127.0.0.2}
WARN [shard 0] repair - repair[ec2e9646-918e-4345-99ab-fa07aa1f17de]:
repair_tracker run failed: std::runtime_error ({shard 0: std::runtime_error
(repair[ec2e9646-918e-4345-99ab-fa07aa1f17de]: 1026 out of 1026 ranges failed,
keyspace=ks2a, tables={test_table, tb}, repair_reason=repair,
nodes_down_during_repair={127.0.0.2})})
```
In addition, change the `status=skipped` to `status=skipped_no_live_peers`
to make it more clear.
Closes#13928