The flush of hints and batchlog are needed only for the table with
tombstone_gc_mode set to repair mode. We should skip the flush if the
tombstone_gc_mode is not repair mode.
Fixes#10004Closes#10124
More and more places are using the repair[uuid]: format for logging
repair jobs with the uuid. Convert more places to use the new format to
unify the log format.
This makes it easier to grep a specific repair job in the log.
Closes#10125
Rather than using a static unit32_t next_id,
move the next_id variable into repair_service shard 0
and manage it there.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Note that we can't pass the repair_service container()
from its ctor since it's not populated until all shards start.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Keep repair_meta in repair_meta_map as shared_ptr<repair_meta>
rather than lw_shared_ptr<repair_meta> so it can be defined
in the header file and use only forward-declared
class repair_meta.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Define the static {get,insert,remove}_repair_meta functions out
of the repair_meta class definition, on the way of moving them,
along with the repair_meta_map itself, to repair_service.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
All repair_meta needs is the local instance.
Need be, it's a peering service so the container()
can be used if needed.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Use repair_service as the authoritative source for
the database, messaging_service, system_distributed_keyspace,
and view_update_generator, similar to repair_info.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.
Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.
The changes we applied mechanically with a script, except to
licenses/README.md.
Closes#9937
"
Repair obtains a permit for each repair-meta instance it creates. This
permit is supposed to track all resources consumed by that repair as
well as ensure concurrency limit is respected. However when the
non-local reader path is used (shard config of master != shard config of
follower), a second permit will be obtained -- for the shard reader of
the multishard reader. This creates a situation where the repair-meta's
permit can block the shard permit, creating a deadlock situation.
This patch solves this by dropping the count resource on the
repair-meta's permit when a non-local reader path is executed -- that is
a multishard reader is created.
Fixes: #9751
"
* 'repair-double-permit-block/v4' of https://github.com/denesb/scylla:
repair: make sure there is one permit per repair with count res
reader_permit: add release_base_resource()
"
SSTables created by repair will potentially not conform to the
compaction strategy
layout goal. If node shuts down before off-strategy has a chance to
reshape those files, node will be forced to reshape them on restart.
That
causes unexpected downtime. Turns out we can skip reshape of those files
on boot, and allow them to be reshaped after node becomes online, as if
the node never went down. Those files will go through same procedure as
files created by repair-based ops. They will be placed in maintenance
set,
and be reshaped iteratively until ready for integration into the main
set.
"
Fixes#9895.
tests: UNIT(dev).
* 'postpone_reshape_on_repair_originated_files' of https://github.com/raphaelsc/scylla:
distributed_loader: postpone reshape of repair-originated sstables
sstables: Introduce filter for sstable_directory::reshape
table: add fast path when offstrategy is not needed
sstables: add constant for repair origin
This series greatly reduces gossipers' dependence on `seastar::async` (yet, not completely).
`i_endpoint_state_change_subscriber` callbacks are converted to return futures (again, to get rid of `seastar::async` dependency), all users are adjusted appropriately (e.g. `storage_service`, `cdc::generation_service`, `streaming::stream_manager`, `view_update_backlog_broker` and `migration_manager`).
This includes futurizing and coroutinizing the whole function call chain up to the `i_endpoint_state_change_subscriber` callback functions.
To aid the conversion process, a non-`seastar::async` dependent variant of `utils::atomic_vector::for_each` is introduced (`for_each_futurized`). A different name is used to clearly distinguish converted and non-converted code, so that the last step (remove `seastar::async()` wrappers around callback-calling code in gossiper) is easier. This is left for a follow-up series, though.
Tests: unit(dev)
Closes#9844
* github.com:scylladb/scylla:
service: storage_service: coroutinize `set_gossip_tokens`
service: storage_service: coroutinize `leave_ring`
service: storage_service: coroutinize `handle_state_left`
service: storage_service: coroutinize `handle_state_leaving`
service: storage_service: coroutinize `handle_state_removing`
service: storage_service: coroutinize `do_drain`
service: storage_service: coroutinize `shutdown_protocol_servers`
service: storage_service: coroutinize `excise`
service: storage_service: coroutinize `remove_endpoint`
service: storage_service: coroutinize `handle_state_replacing`
service: storage_service: coroutinize `handle_state_normal`
service: storage_service: coroutinize `update_peer_info`
service: storage_service: coroutinize `do_update_system_peers_table`
service: storage_service: coroutinize `update_table`
service: storage_service: coroutinize `handle_state_bootstrap`
service: storage_service: futurize `notify_*` functions
service: storage_service: coroutinize `handle_state_replacing_update_pending_ranges`
repair: row_level_repair_gossip_helper: coroutinize `remove_row_level_repair`
locator: reconnectable_snitch_helper: coroutinize `reconnect`
gms: i_endpoint_state_change_subscriber: make callbacks to return futures
utils: atomic_vector: introduce future-returning `for_each` function
utils: atomic_vector: rename `for_each` to `thread_for_each`
gms: gossiper: coroutinize `start_gossiping`
gms: gossiper: coroutinize `force_remove_endpoint`
gms: gossiper: coroutinize `do_status_check`
gms: gossiper: coroutinize `remove_endpoint`
"
The first patch introduces evictable_reader_v2, and the second one
further simplifies it. We clone instead of converting because there
is at least one downstream (by way of multishard_combining_reader) use
that is not itself straightforward to convert at the moment
(multishard_mutation_query), and because evictable_reader instances
cannot be {up,down}graded (since users also access the undelying
buffers). This also means that shard_reader, reader_lifecycle_policy
and multishard_combining_reader have to be cloned.
"
* tag 'clone-evictable-reader-to-v2/v3' of https://github.com/cmm/scylla:
convert make_multishard_streaming_reader() to flat_mutation_reader_v2
convert table::make_streaming_reader() to flat_mutation_reader_v2
convert make_flat_multi_range_reader() to flat_mutation_reader_v2
view_update_generator: remove unneeded call to downgrade_to_v1()
introduce multishard_combining_reader_v2
introduce shard_reader_v2
introduce the reader_lifecycle_policy_v2 abstract base
evictable_reader_v2: further code simplifications
introduce evictable_reader_v2 & friends
Do not depend on the_repair_tracker().
With that, the_repair_tracker() is no longer used
and should be deleted.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Embed the node_ops_metrics instance in
a sharded repair_service member.
Test: curl -silent http://127.0.0.1:9180/metrics | grep node_ops | grep -v "^#"
on a freshly started scylla instance.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
For de-globalizing its thread-local instance
by placing a node_ops_metrics member in repair_service.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
These functions are called from the api layer.
Continue to hide the repair tracker from the caller
but use the repair_service already available
at the api layer to invoke the respective high-level
methods without requiring `the_repair_tracker()`.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
And rename the global repair_tracker getter to
`the_repair_tracker` as the first step to get rid of it.
repair_service methods now use the repair_service::repair_tracker
method.
The global getter was renamed to `the_repair_tracker()`
temporarily while eliminating it in this series
to help distinguish it from repair_service::repair_tracker().
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Rather than keeping all shards' semaphore
and repair_info:s on the tracker's single-shard instance,
instantiate it on all shards, tracking the local
repair jobs on its local shard.
For now, until it's deglobalized, turn
_the_tracker into static thread_local pointer.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Repair obtains a permit for each repair-meta instance it creates. This
permit is supposed to track all resources consumed by that repair as
well as ensure concurrency limit is respected. However when the
non-local reader path is used (shard config of master != shard config of
follower), a second permit will be obtained -- for the shard reader of
the multishard reader. This creates a situation where the repair-meta's
permit can block the shard permit, creating a deadlock situation.
This patch solves this by dropping the count resource on the
repair-meta's permit when a non-local reader path is executed -- that is
a multishard reader is created.
Move replica-oriented classes to the replica namespace. The main
classes moved are ::database, ::keyspace, and ::table, but a few
ancillary classes are also moved. There are certainly classes that
should be moved but aren't (like distributed_loader) but we have
to start somewhere.
References are adjusted treewide. In many cases, it is obvious that
a call site should not access the replica (but the data_dictionary
instead), but that is left for separate work.
scylla-gdb.py is adjusted to look for both the new and old names.
The database, keyspace, and table classes represent the replica-only
part of the objects after which they are named. Reading from a table
doesn't give you the full data, just the replica's view, and it is not
consistent since reconciliation is applied on the coordinator.
As a first step in acknowledging this, move the related files to
a replica/ subdirectory.
Sort follower nodes by the proximity so that in the step where the
master node gets missing rows from repair follower nodes,the master
node has a chance to get the missing rows from a near node first (e.g.,
local dc node), avoding getting rows from a far node.
For example:
dc1: n1, n2
dc2: n3, n4
dc3: n5, n6
Run repair on n1, with this patch, n1 will get data from n2 which is in the same dc first.
[shard 0] repair - Repair 1 out of 1 ranges, id=[id=1, uuid=8b0040bd-5aa5-42e1-bb9f-58c5e7052aec],
shard=0, keyspace=ks, table={cf}, range=(-6734413101754081925, -6539883972247625343],
peers={127.0.39.5, 127.0.39.6, 127.0.39.2, 127.0.39.4, 127.0.39.3},
live_peers={127.0.39.5, 127.0.39.6, 127.0.39.2, 127.0.39.4, 127.0.39.3}
[shard 0] repair - Before sort = {127.0.39.5, 127.0.39.6, 127.0.39.2, 127.0.39.4, 127.0.39.3}
[shard 0] repair - After sort = {127.0.39.2, 127.0.39.5, 127.0.39.6, 127.0.39.4, 127.0.39.3}
[shard 0] repair - Started Row Level Repair (Master): local=127.0.39.1,
peers={127.0.39.2, 127.0.39.5, 127.0.39.6, 127.0.39.4, 127.0.39.3}
Closes#9769
Currently, the repair parallelism is calculated by the number of memory
allocated to repair and memory usage per repair instance. However, the
memory usage per repair instance does not take the max possible memory
usage caused by repair followers.
As a result, when repairing a table with more replication factors, e.g.,
3 DCs, each has 3 replicas, the repair master node would use 9X repair
buffer size in worse cases. This would cause OOM when the system is
under pressure.
This patch introduces a semaphore to cap the max memory usage.
Each repair instance takes the max possible memory usage budget before
it starts. This ensures repair would never use more than the memory
allocated to repair.
Fixes#9817Closes#9818.
The gc_grace_seconds is a very fragile and broken design inherited from
Cassandra. Deleted data can be resurrected if cluster wide repair is not
performed within gc_grace_seconds. This design pushes the job of making
the database consistency to the user. In practice, it is very hard to
guarantee repair is performed within gc_grace_seconds all the time. For
example, repair workload has the lowest priority in the system which can
be slowed down by the higher priority workload, so that there is no
guarantee when a repair can finish. A gc_grace_seconds value that is
used to work might not work after data volume grows in a cluster. Users
might want to avoid running repair during a specific period where
latency is the top priority for their business.
To solve this problem, an automatic mechanism to protect data
resurrection is proposed and implemented. The main idea is to remove the
tombstone only after the range that covers the tombstone is repaired.
In this patch, a new table option tombstone_gc is added. The option is
used to configure tombstone gc mode. For example:
1) GC a tombstone after gc_grace_seconds
cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'timeout'} ;
This is the default mode. If no tombstone_gc option is specified by the
user. The old gc_grace_seconds based gc will be used.
2) Never GC a tombstone
cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'disabled'};
3) GC a tombstone immediately
cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'immediate'};
4) GC a tombstone after repair
cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'repair'};
In addition to the 'mode' option, another option 'propagation_delay_in_seconds'
is added. It defines the max time a write could possibly delay before it
eventually arrives at a node.
A new gossip feature TOMBSTONE_GC_OPTIONS is added. The new tombstone_gc
option can only be used after the whole cluster supports the new
feature. A mixed cluster works with no problem.
Tests: compaction_test.py, ninja test
Fixes#3560
[avi: resolve conflicts vs data_dictionary]
Consider
1) n1, n2, n3, n4, n5
2) n2 and n3 are both down
3) start n6 to replace n2
4) start n7 to replace n3
We want to replace the dead nodes n2 and n3 to fix the cluster to have 5
running nodes.
Replace operation in step 3 will fail because n3 is down.
We would see errors like below:
replace[25edeec0-57d4-11ec-be6b-7085c2409b2d]: Nodes={127.0.0.3} needed
for replace operation are down. It is highly recommended to fix the down
nodes and try again.
In the above example, currently, there is no way to replace any of the
dead nodes.
Users can either fix one of the dead nodes and run replace or run
removenode operation to remove one of the dead nodes then run replace
and run bootstrap to add another node.
Fixing dead nodes is always the best solution but it might not be
possible. Running removenode operation is not better than running
replace operation (with best effort by ignoring the other dead node) in
terms of data consistency. In addition, users have to run bootstrap
operation to add back the removed node. So, allowing replacing in such
case is a clear win.
This patch adds the --ignore-dead-nodes-for-replace option to allow run
replace operation with best effort mode. Please note, use this option
only if the dead nodes are completely broken and down, and there is no
way to fix the node and bring it back. This also means the user has to
make sure the ignored dead nodes specified are really down to avoid any
data consistency issue.
Fixes#9757Closes#9758
Adding a strigify function for the node_ops_cmd enum,
will make the log output more readable and will make it
possible (hopefully) to do initial analysis without consulting
the source code.
Refs #9629
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>
Closes#9745
Let's enable offstrategy for repair based rebuild, for it to take
advantage of offstrategy benefits, one of the most important
being compaction not acting aggressively, which is important
for both reducing operation time and delivering good latency
while the operation is running.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211130115957.13779-1-raphaelsc@scylladb.com>
There are two places in repair that call for global gossiper instance.
However, the repair_service already has sharded gossiper on board, and
it can use it directly in the first place.
The second place is called from inside repair_info method. This place
is fixed by keeping the gossiper reference on the info, just like it's
done for other services that info needs.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The range::subtract() was used as a trick to implement
range::intersection() when intersection was not available at that time.
The following code is problematic:
dht::token_range given_range_complement(tok_end, tok_start);
because dht::token_range should be non-wrapping range.
To fix, use compat::unwrap_into() to unwrap the range and use
the range::intersection() to calculate the intersection.
Note even if the given_range_complement is problematic, current code
generates correct intersections.
Example 1:
$ curl -X POST 'http://127.0.0.1:10000/storage_service/repair_async/keyspace1?startToken=5&endToken=100'
[shard 0] repair - starting user-requested
repair for keyspace keyspace1, repair id [id=1,
uuid=aa71a192-5967-4f05-99b8-5febd9d81d50], options {{ endToken -> 100}, { startToken -> 5}}
[shard 0] repair - start=5, end=100, given_range_complement=(100, 5], wrange=(100, 5], is_wrap_around=true,
ranges={(-inf, -7612759882658906007], (-7612759882658906007,
-6766703710995023384], (-6766703710995023384, 2918449800065200439],
(2918449800065200439, 8039072586540417979], (8039072586540417979,
+inf)}, intersections={(5, 100]}
[shard 0] repair - New method intersections={(5, 100]}
Example 2:
$ curl -X POST
'http://127.0.0.1:10000/storage_service/repair_async/keyspace1?startToken=100&endToken=5'
[shard 0] repair - starting user-requested
repair for keyspace keyspace1, repair id [id=1,
uuid=f6076438-015c-4bdc-8ebd-0a55664365fa], options {{ endToken -> 5}, { startToken -> 100}}
[shard 0] repair - start=100, end=5, given_range_complement=(5, 100], wrange=(5, 100], is_wrap_around=false,
ranges={(-inf, -7612759882658906007], (-7612759882658906007,
-6766703710995023384], (-6766703710995023384, 2918449800065200439],
(2918449800065200439, 8039072586540417979], (8039072586540417979,
+inf)}, intersections={
(-inf, -7612759882658906007],
(-7612759882658906007, -6766703710995023384],
(-6766703710995023384, 5],
(100, 2918449800065200439],
(2918449800065200439, 8039072586540417979],
(8039072586540417979, +inf)}
[shard 0] repair - New method intersections={
(-inf, -7612759882658906007],
(-7612759882658906007, -6766703710995023384],
(-6766703710995023384, 5],
(100, 2918449800065200439],
(2918449800065200439, 8039072586540417979],
(8039072586540417979, +inf)}
Fixes#9560Closes#9561