The repair reader depends on the table object being alive, while it is
reading. However, for local reads, there was no synchronization between
the lifecycle of the repair reader and that of the table. In some cases
this can result in use-after-free. Solve by using the table's existing
mechanism for lifecycle extension: `read_in_progress()`.
For the non-local reader, when the local node's shard configuration is
different from the remote one's, this problem is already solved, as the
multishard streaming reader already pins table objects on the used
shards. This creates an inconsistency that might be suprising (in a bad
way). One reader takes care of pinning needed resources while the other
one doesn't. I was thorn on how to reconcile this, and decided to go
with the simplest solution, explicitely pinning the table for local
reads, that is conserve the inconsistency. It was suggested that this
inconsitency is remedied by building resource pinning into the local
reader as well [1] but there is opposition to this [2]. Adding a wrapper
reader which does just the resource pinning seems excessive, both in
code and runtime overhead.
Spotted while investigating repair-related crashes which occured during
interrupted repairs.
Fixes: #4342
[1] https://github.com/scylladb/scylla/issues/4342#issuecomment-474271050
[2] https://github.com/scylladb/scylla/issues/4342#issuecomment-474331657
Tests: none, this is a trivial fix for a not-yet-seen-in-the-wild bug.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <8e84ece8343468960d4e161467ecd9bb10870c27.1553072505.git.bdenes@scylladb.com>
4 nodes in the cluster
n1, n2 in dc1
n3, n4 in dc2
dc1 RF=2, dc2 RF=2.
If we run
nodetool repair -hosts 127.0.0.1,127.0.03 -dc "dc1,dc2" multi
on n1.
The -hosts option will be ignored and only the -dc option
will be used to choose which hosts to repair. In this case, n1 to n4
will be repaired.
If user wants to select specific hosts to repair with, there is no need
to specify the -dc option. Use the -hosts option is enough.
Reject the combination and not to surprise the user.
In https://issues.apache.org/jira/browse/CASSANDRA-9876, the same logic
is introduced as well.
Refs #3836
Message-Id: <e95ac1099f98dd53bb9d6534316005ea3577e639.1551406529.git.asias@scylladb.com>
" from Asias
Some of the function names are not updated after we change the rpc verb
names. Rename them to make them consistent with the rpc verb names.
* seastar-dev.git asias/row_level_repair_rename_consistent_with_rpc_verb/v1:
repair: Rename request_sync_boundary to get_sync_boundary
repair: Rename request_full_row_hashes to get_full_row_hashes
repair: Rename request_combined_row_hash to get_combined_row_hash
repair: Rename request_row_diff to get_row_diff
repair: Rename send_row_diff to put_row_diff
repair: Update function name in docs/row_level_repair.md
cryptopp's config.h has the following pragma:
#pragma GCC diagnostic ignored "-Wunused-function"
It is not wrapped in a push/pop. Because of that, including cryptopp
headers disables that warning on scylla code too.
The issue has been reported as
https://github.com/weidai11/cryptopp/issues/793
To work around it, this patch uses a pimpl to have a single .cc file
that has to include cryptopp headers.
While at it, it also reduces the differences and code duplication
between the md5 and sha1 hashers.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
"
This is a first step in fixing #3988.
"
* 'espindola/large-row-warn-only-v4' of https://github.com/espindola/scylla:
Rename large_partition_handler
Print a warning if a row is too large
Remove defaut parameter value
Rename _threshold_bytes to _partition_threshold_bytes
keys: add schema-aware printing for clustering_key_prefix
When a repair failed, we saw logs like:
repair - Checksum of range (8235770168569320790, 8235957818553794560] on
127.0.0.1 failed: std::bad_alloc (std::bad_alloc)
It is hard to tell which keyspace and table has failed.
To fix, log the keyspace and table name. It is useful to know when debugging.
Fixes#4166
Message-Id: <8424d314125b88bf5378ea02a703b0f82c2daeda.1548818669.git.asias@scylladb.com>
Number of rows sent and received
- tx_row_nr
- rx_row_nr
Bytes of rows sent and received
- tx_row_bytes
- rx_row_bytes
Number of row hashes sent and received
- tx_hashes_nr
- rx_hashes_nr
Number of rows read from disk
- row_from_disk_nr
Bytes of rows read from disk
- row_from_disk_bytes
Message-Id: <d1ee6b8ae8370857fe45f88b6c13087ea217d381.1547603905.git.asias@scylladb.com>
"
This series adds generating view updates from sstables added through
/upload directory if their tables have accompanying materialized views.
Said sstables are left in /upload directory until updates are generated
from them and are treated just like staging sstables from /staging dir.
If there are no views for a given tables, sstables are simply moved
from /upload dir to datadir without any changes.
Tests: unit (release)
"
* 'add_handling_staging_sstables_to_upload_dir_5' of https://github.com/psarna/scylla:
all: rename view_update_from_staging_generator
distributed_loader: fix indentation
service: add generating view updates from uploaded sstables
init: pass view update generator to storage service
sstables: treat sstables in upload dir as needing view build
sstables,table: rename is_staging to requires_view_building
distributed_loader: use proper directory for opening SSTable
db,view: make throttling optional for view_update_generator
In order to allow other services to wait for incoming streams
to finish, row level repair uses stream phasing when creating
new sstables from incoming data.
Fixes scylladb#4032
Replace stdx::optional and stdx::string_view with the C++ std
counterparts.
Some instances of boost::variant were also replaced with std::variant,
namely those that called seastar::visit.
Scylla now requires GCC 8 to compile.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20190108111141.5369-1-duarte@scylladb.com>
This header, which is easily replaced with a forward declaration,
introduces a dependency on database.hh everywhere. Remove it and scatter
includes of database.hh in source files that really need it.
"
=== How the the partition level repair works
- The repair master decides which ranges to work on.
- The repair master splits the ranges to sub ranges which contains around 100
partitions.
- The repair master computes the checksum of the 100 partitions and asks the
related peers to compute the checksum of the 100 partitions.
- If the checksum matches, the data in this sub range is synced.
- If the checksum mismatches, repair master fetches the data from all the peers
and sends back the merged data to peers.
=== Major problems with partition level repair
- A mismatch of a single row in any of the 100 partitions causes 100
partitions to be transferred. A single partition can be very large. Not to
mention the size of 100 partitions.
- Checksum (find the mismatch) and streaming (fix the mismatch) will read the
same data twice
=== Row level repair
Row level checksum and synchronization: detect row level mismatch and transfer
only the mismatch
=== How the row level repair works
- To solve the problem of reading data twice
Read the data only once for both checksum and synchronization between nodes.
We work on a small range which contains only a few mega bytes of rows,
We read all the rows within the small range into memory. Find the
mismatch and send the mismatch rows between peers.
We need to find a sync boundary among the nodes which contains only N bytes of
rows.
- To solve the problem of sending unnecessary data.
We need to find the mismatched rows between nodes and only send the delta.
The problem is called set reconciliation problem which is a common problem in
distributed systems.
For example:
Node1 has set1 = {row1, row2, row3}
Node2 has set2 = { row2, row3}
Node3 has set3 = {row1, row2, row4}
To repair:
Node1 fetches nothing from Node2 (set2 - set1), fetches row4 (set3 - set1) from Node3.
Node1 sends row1 and row4 (set1 + set2 + set3 - set2) to Node2
Node1 sends row3 (set1 + set2 + set3 - set3) to Node3.
=== How to implement repair with set reconciliation
- Step A: Negotiate sync boundary
class repair_sync_boundary {
dht::decorated_key pk;
position_in_partition position
}
Reads rows from disk into row buffers until the size is larger than N
bytes. Return the repair_sync_boundary of the last mutation_fragment we
read from disk. The smallest repair_sync_boundary of all nodes is
set as the current_sync_boundary.
- Step B: Get missing rows from peer nodes so that repair master contains all the rows
Request combined hashes from all nodes between last_sync_boundary and
current_sync_boundary. If the combined hashes from all nodes are identical,
data is synced, goto Step A. If not, request the full hashes from peers.
At this point, the repair master knows exactly what rows are missing. Request the
missing rows from peer nodes.
Now, local node contains all the rows.
- Step C: Send missing rows to the peer nodes
Since local node also knows what peer nodes own, it sends the missing rows to
the peer nodes.
=== How the RPC API looks like
- repair_range_start()
Step A:
- request_sync_boundary()
Step B:
- request_combined_row_hashes()
- reqeust_full_row_hashes()
- request_row_diff()
Step C:
- send_row_diff()
- repair_range_stop()
=== Performance evaluation
We created a cluster of 3 Scylla nodes on AWS using i3.xlarge instance. We
created a keyspace with a replication factor of 3 and inserted 1 billion
rows to each of the 3 nodes. Each node has 241 GiB of data.
We tested 3 cases below.
1) 0% synced: one of the node has zero data. The other two nodes have 1 billion identical rows.
Time to repair:
old = 87 min
new = 70 min (rebuild took 50 minutes)
improvement = 19.54%
2) 100% synced: all of the 3 nodes have 1 billion identical rows.
Time to repair:
old = 43 min
new = 24 min
improvement = 44.18%
3) 99.9% synced: each node has 1 billion identical rows and 1 billion * 0.1% distinct rows.
Time to repair:
old: 211 min
new: 44 min
improvement: 79.15%
Bytes sent on wire for repair:
old: tx= 162 GiB, rx = 90 GiB
new: tx= 1.15 GiB, tx = 0.57 GiB
improvement: tx = 99.29%, rx = 99.36%
It is worth noting that row level repair sends and receives exactly the
number of rows needed in theory.
In this test case, repair master needs to receives 2 million rows and
sends 4 million rows. Here are the details: Each node has 1 billion *
0.1% distinct rows, that is 1 million rows. So repair master receives 1
million rows from repair slave 1 and 1 million rows from repair slave 2.
Repair master sends 1 million rows from repair master and 1 million rows
received from repair slave 1 to repair slave 2. Repair master sends
sends 1 million rows from repair master and 1 million rows received from
repair slave 2 to repair slave 1.
In the result, we saw the rows on wire were as expected.
tx_row_nr = 1000505 + 999619 + 1001257 + 998619 (4 shards, the numbers are for each shard) = 4'000'000
rx_row_nr = 500233 + 500235 + 499559 + 499973 (4 shards, the numbers are for each shard) = 2'000'000
Fixes: #3033
Tests: dtests/repair_additional_test.py
"
* 'asias/row_level_repair_v7' of github.com:cloudius-systems/seastar-dev: (51 commits)
repair: Enable row level repair
repair: Add row_level_repair
repair: Add docs for row level repair
repair: Add repair_init_messaging_service_handler
repair: Add repair_meta
repair: Add repair_writer
repair: Add repair_reader
repair: Add repair_row
repair: Add fragment_hasher
repair: Add decorated_key_with_hash
repair: Add get_random_seed
repair: Add get_common_diff_detect_algorithm
repair: Add shard_config
repair: Add suportted_diff_detect_algorithms
repair: Add repair_stats to repair_info
repair: Introduce repair_stats
flat_mutation_reader: Add make_generating_reader
storage_service: Introduce ROW_LEVEL_REPAIR feature
messaging_service: Add RPC verbs for row level repair
repair: Export the repair logger
...
Many headers don't really need to include database.hh, the include can
be replaced by forward declarations and/or including the actually needed
headers directly. Some headers don't need this include at all.
Each header was verified to be compilable on its own after the change,
by including it into an empty `.cc` file and compiling it. `.cc` files
that used to get `database.hh` through headers that no longer include it
were changed to include it themselves.
=== How the the partition level repair works
- The repair master decides which ranges to work on.
- The repair master splits the ranges to sub ranges which contains around 100
partitions.
- The repair master computes the checksum of the 100 partitions and asks the
related peers to compute the checksum of the 100 partitions.
- If the checksum matches, the data in this sub range is synced.
- If the checksum mismatches, repair master fetches the data from all the peers
and sends back the merged data to peers.
=== Major problems with partition level repair
- A mismatch of a single row in any of the 100 partitions causes 100
partitions to be transferred. A single partition can be very large. Not to
mention the size of 100 partitions.
- Checksum (find the mismatch) and streaming (fix the mismatch) will read the
same data twice
=== Row level repair
Row level checksum and synchronization: detect row level mismatch and transfer
only the mismatch
=== How the row level repair works
- To solve the problem of reading data twice
Read the data only once for both checksum and synchronization between nodes.
We work on a small range which contains only a few mega bytes of rows,
We read all the rows within the small range into memory. Find the
mismatch and send the mismatch rows between peers.
We need to find a sync boundary among the nodes which contains only N bytes of
rows.
- To solve the problem of sending unnecessary data.
We need to find the mismatched rows between nodes and only send the delta.
The problem is called set reconciliation problem which is a common problem in
distributed systems.
For example:
Node1 has set1 = {row1, row2, row3}
Node2 has set2 = { row2, row3}
Node3 has set3 = {row1, row2, row4}
To repair:
Node1 fetches nothing from Node2 (set2 - set1), fetches row4 (set3 - set1) from Node3.
Node1 sends row1 and row4 (set1 + set2 + set3 - set2) to Node2
Node1 sends row3 (set1 + set2 + set3 - set3) to Node3.
=== How to implement repair with set reconciliation
- Step A: Negotiate sync boundary
class repair_sync_boundary {
dht::decorated_key pk;
position_in_partition position
}
Reads rows from disk into row buffers until the size is larger than N
bytes. Return the repair_sync_boundary of the last mutation_fragment we
read from disk. The smallest repair_sync_boundary of all nodes is
set as the current_sync_boundary.
- Step B: Get missing rows from peer nodes so that repair master contains all the rows
Request combined hashes from all nodes between last_sync_boundary and
current_sync_boundary. If the combined hashes from all nodes are identical,
data is synced, goto Step A. If not, request the full hashes from peers.
At this point, the repair master knows exactly what rows are missing. Request the
missing rows from peer nodes.
Now, local node contains all the rows.
- Step C: Send missing rows to the peer nodes
Since local node also knows what peer nodes own, it sends the missing rows to
the peer nodes.
=== How the RPC API looks like
- repair_range_start()
Step A:
- request_sync_boundary()
Step B:
- request_combined_row_hashes()
- reqeust_full_row_hashes()
- request_row_diff()
Step C:
- send_row_diff()
- repair_range_stop()
=== Performance evaluation
We created a cluster of 3 Scylla nodes on AWS using i3.xlarge instance. We
created a keyspace with a replication factor of 3 and inserted 1 billion
rows to each of the 3 nodes. Each node has 241 GiB of data.
We tested 3 cases below.
1) 0% synced: one of the node has zero data. The other two nodes have 1 billion identical rows.
Time to repair:
old = 87 min
new = 70 min (rebuild took 50 minutes)
improvement = 19.54%
2) 100% synced: all of the 3 nodes have 1 billion identical rows.
Time to repair:
old = 43 min
new = 24 min
improvement = 44.18%
3) 99.9% synced: each node has 1 billion identical rows and 1 billion * 0.1% distinct rows.
Time to repair:
old: 211 min
new: 44 min
improvement: 79.15%
Bytes sent on wire for repair:
old: tx= 162 GiB, rx = 90 GiB
new: tx= 1.15 GiB, tx = 0.57 GiB
improvement: tx = 99.29%, rx = 99.36%
It is worth noting that row level repair sends and receives exactly the
number of rows needed in theory.
In this test case, repair master needs to receives 2 million rows and
sends 4 million rows. Here are the details: Each node has 1 billion *
0.1% distinct rows, that is 1 million rows. So repair master receives 1
million rows from repair slave 1 and 1 million rows from repair slave 2.
Repair master sends 1 million rows from repair master and 1 million rows
received from repair slave 1 to repair slave 2. Repair master sends
sends 1 million rows from repair master and 1 million rows received from
repair slave 2 to repair slave 1.
In the result, we saw the rows on wire were as expected.
tx_row_nr = 1000505 + 999619 + 1001257 + 998619 (4 shards, the numbers are for each shard) = 4'000'000
rx_row_nr = 500233 + 500235 + 499559 + 499973 (4 shards, the numbers are for each shard) = 2'000'000
Fixes#3033
This patch introduces repair_meta class that is the core class for the
row level repair.
For each range to repair, repair_meta objects are created on both repair
master and repair slaves. It stores the meta data for the row level
repair algorithms, e.g, the current sync boundary, the buffer used to
hold the rows the peers are working on, the reader to read data from
sstable and the writer to write data to sstable.
This patch also implements the RPC verbs for row level repair, for
example, REPAIR_ROW_LEVEL_START/REPAIR_ROW_LEVEL_STOP to starts/stops
row level repair for a range, REPAIR_GET_SYNC_BOUNDARY to get sync
boundary peers want to work on, REPAIR_GET_ROW_DIFF to get missing rows
from repair slaves and REPAIR_PUT_ROW_DIFF to pus missing rows to repair
slaves.
repair_writer uses multishard_writer to apply the mutation_fragments to
sstable. The repair master needs one such writer for each of the repair
slave. The repair slave needs one writer for the repair master.
repair_reader is used to read data from disk. It is simply a local
flat_mutation_reader reader for the repair master. It is more
complicated for the repair slave.
The repair slaves have to follow what repair master read from disk.
For example,
Assume repair master has 2 shards and repair slave has 3 shards
Repair master on shard 0 asks repair slave on shard 0 to read range [0,100).
Repair master on shard 1 asks repair slave on shard 1 to read range [0,100).
Repair master on shard 0 will only read the data that belongs to shard 0
within range [0,100). Since master and slave have different shard count,
repair slave on shard 0 has to use the multi shard reader to collect
data on all the shards. It can not pass range [0, 100) to the multi
shard reader, otherwise it will read more data than the repair master.
Instead, repair slave uses a sharder using sharding configuration of the
repair master, to generate the sub ranges belong to shard 0 of repair
master.
If repair master and slave has the same sharding configuration, a simple
local reader is enough for repair slave.
repair_row is the in-memory representation of "row" that the row level
repair works on. It represents a mutation_fragment that is read from the
flat_mutation reader. The hash of a repair_row is the combination of the
mutation_fragment hash and partition_key hash.
Get a random uint64_t number as the seed for the repair row hashing.
The seed is passed to xx_hasher.
We add the randomization when hashing rows so that when we run repair
for the next time the same row produces different hashing number.
It is used to find the common difference detection algorithms supported
by repair master and repair slaves.
It is up to repair master to choose what algorithm to use.
It returns a vector of row level repair difference detection algorithms
supported by this node.
We are going to implement the "send_full_set" in the following patches.
Represent a position of a mutation_fragment read from a flat mutation
reader. Repair nodes negotiate a small sub range identified by two
repair_sync_boundary to work on in each round.