284 lines
8.4 KiB
Markdown
284 lines
8.4 KiB
Markdown
# Row level repair
|
|
|
|
## How the the partition level repair works
|
|
|
|
- The repair master decides which ranges to work on.
|
|
- The repair master splits the ranges to sub ranges which contains around 100
|
|
partitions.
|
|
- The repair master computes the checksum of the 100 partitions and asks the
|
|
related peers to compute the checksum of the 100 partitions.
|
|
- If the checksum matches, the data in this sub range is synced.
|
|
- If the checksum mismatches, repair master fetches the data from all the peers
|
|
and sends back the merged data to peers.
|
|
|
|
## Major problems with partition level repair
|
|
|
|
- A mismatch of a single row in any of the 100 partitions causes 100
|
|
partitions to be transferred. A single partition can be very large. Not to
|
|
mention the size of 100 partitions.
|
|
|
|
- Checksum (find the mismatch) and streaming (fix the mismatch) will read the
|
|
same data twice
|
|
|
|
## Row level repair
|
|
|
|
Row level checksum and synchronization: detect row level mismatch and transfer
|
|
only the mismatch
|
|
|
|
## How the row level repair works
|
|
|
|
- To solve the problem of reading data twice
|
|
|
|
Read the data only once for both checksum and synchronization between nodes.
|
|
|
|
We work on a small range which contains only a few mega bytes of rows,
|
|
We read all the rows within the small range into memory. Find the
|
|
mismatch and send the mismatch rows between peers.
|
|
|
|
We need to find a sync boundary among the nodes which contains only N bytes of
|
|
rows.
|
|
|
|
- To solve the problem of sending unnecessary data.
|
|
|
|
We need to find the mismatched rows between nodes and only send the delta.
|
|
The problem is called set reconciliation problem which is a common problem in
|
|
distributed systems.
|
|
|
|
For example:
|
|
|
|
- Node1 has set1 = {row1, row2, row3}
|
|
|
|
- Node2 has set2 = { row2, row3}
|
|
|
|
- Node3 has set3 = {row1, row2, row4}
|
|
|
|
To repair:
|
|
|
|
- Node1 fetches nothing from Node2 (set2 - set1), fetches row4 (set3 - set1) from Node3.
|
|
|
|
- Node1 sends row1 and row4 (set1 + set2 + set3 - set2) to Node2.
|
|
|
|
- Node1 sends row3 (set1 + set2 + set3 - set3) to Node3.
|
|
|
|
## How to implement repair with set reconciliation
|
|
|
|
- Step A: Negotiate sync boundary
|
|
|
|
class repair_sync_boundary {
|
|
dht::decorated_key pk;
|
|
position_in_partition position
|
|
}
|
|
|
|
Reads rows from disk into row buffers until the size is larger than N
|
|
bytes. Return the repair_sync_boundary of the last mutation_fragment we
|
|
read from disk. The smallest repair_sync_boundary of all nodes is
|
|
set as the current_sync_boundary.
|
|
|
|
- Step B: Get missing rows from peer nodes so that repair master contains all the rows
|
|
|
|
Request combined hashes from all nodes between last_sync_boundary and
|
|
current_sync_boundary. If the combined hashes from all nodes are identical,
|
|
data is synced, goto Step A. If not, request the full hashes from peers.
|
|
|
|
At this point, the repair master knows exactly what rows are missing. Request the
|
|
missing rows from peer nodes.
|
|
|
|
Now, local node contains all the rows.
|
|
|
|
- Step C: Send missing rows to the peer nodes
|
|
|
|
Since local node also knows what peer nodes own, it sends the missing rows to
|
|
the peer nodes.
|
|
|
|
## How the RPC API looks like
|
|
|
|
Start:
|
|
- repair_range_start()
|
|
|
|
Step A:
|
|
- get_sync_boundary()
|
|
|
|
Step B:
|
|
- get_combined_row_hashes()
|
|
- get_full_row_hashes()
|
|
- get_row_diff()
|
|
|
|
Step C:
|
|
- put_row_diff()
|
|
|
|
Finish:
|
|
- repair_range_stop()
|
|
|
|
## Performance evaluation
|
|
|
|
We created a cluster of 3 Scylla nodes on AWS using i3.xlarge instance. We
|
|
created a keyspace with a replication factor of 3 and inserted 1 billion
|
|
rows to each of the 3 nodes. Each node has 241 GiB of data.
|
|
We tested 3 cases below.
|
|
|
|
### 1) 0% synced: one of the node has zero data. The other two nodes have 1 billion identical rows.
|
|
|
|
Time to repair:
|
|
|
|
* old = 87 min
|
|
* new = 70 min (rebuild took 50 minutes)
|
|
* improvement = 19.54%
|
|
|
|
### 2) 100% synced: all of the 3 nodes have 1 billion identical rows.
|
|
|
|
Time to repair:
|
|
|
|
* old = 43 min
|
|
* new = 24 min
|
|
* improvement = 44.18%
|
|
|
|
### 3) 99.9% synced: each node has 1 billion identical rows and 1 billion * 0.1% distinct rows.
|
|
|
|
Time to repair:
|
|
|
|
* old: 211 min
|
|
* new: 44 min
|
|
* improvement: 79.15%
|
|
|
|
Bytes sent on wire for repair:
|
|
|
|
* old: tx= 162 GiB, rx = 90 GiB
|
|
* new: tx= 1.15 GiB, tx = 0.57 GiB
|
|
* improvement: tx = 99.29%, rx = 99.36%
|
|
|
|
It is worth noting that row level repair sends and receives exactly the
|
|
number of rows needed in theory.
|
|
|
|
In this test case, repair master needs to receives 2 million rows and
|
|
sends 4 million rows. Here are the details: Each node has 1 billion *
|
|
0.1% distinct rows, that is 1 million rows. So repair master receives 1
|
|
million rows from repair follower 1 and 1 million rows from repair follower 2.
|
|
Repair master sends 1 million rows from repair master and 1 million rows
|
|
received from repair follower 1 to repair follower 2. Repair master sends
|
|
sends 1 million rows from repair master and 1 million rows received from
|
|
repair follower 2 to repair follower 1.
|
|
|
|
In the result, we saw the rows on wire were as expected.
|
|
|
|
tx_row_nr = 1000505 + 999619 + 1001257 + 998619 (4 shards, the numbers are for each shard) = 4'000'000
|
|
rx_row_nr = 500233 + 500235 + 499559 + 499973 (4 shards, the numbers are for each shard) = 2'000'000
|
|
|
|
## RPC stream usage in repair
|
|
|
|
Some of the RPC verbs used by row level repair can transmit bulk of data,
|
|
namely REPAIR_GET_FULL_ROW_HASHES, REPAIR_GET_ROW_DIFF and REPAIR_PUT_ROW_DIFF.
|
|
To have better repair bandwidth with high latency link, we want to increase the
|
|
row buff size. It is more efficient to send large amount of data with RPC
|
|
stream interface instead of the RPC verb interface. We want to switch to RPC
|
|
stream for such RPC verbs. Three functions, get_full_row_hashes(),
|
|
get_row_diff() and put_row_diff(), are converted to use the new RPC stream
|
|
interface.
|
|
|
|
|
|
### 1) get_full_row_hashes()
|
|
A new REPAIR_GET_FULL_ROW_HASHES_WITH_RPC_STREAM rpc veb is introduced.
|
|
|
|
#### The repair master sends: repair_stream_cmd.
|
|
|
|
The repair_stream_cmd can be:
|
|
- repair_stream_cmd::get_full_row_hashes
|
|
Asks the repair follower to send all the hashes in working row buffer.
|
|
|
|
#### The repair follower replies: repair_hash_with_cmd.
|
|
|
|
```
|
|
struct repair_hash_with_cmd {
|
|
repair_stream_cmd cmd;
|
|
repair_hash hash;
|
|
};
|
|
```
|
|
|
|
The repair_stream_cmd in repair_hash_with_cmd can be:
|
|
|
|
- repair_stream_cmd::hash_data
|
|
One of the hashes in the working row buffer. The hash is stored in repair_hash_with_cmd.hash.
|
|
|
|
- repair_stream_cmd::end_of_current_hash_set
|
|
Notifies repair master that repair follower has send all the hashes in the working row buffer
|
|
|
|
- repair_stream_cmd::error
|
|
Notifies an error has happened on the follower
|
|
|
|
|
|
### 2) get_row_diff()
|
|
|
|
A new REPAIR_GET_ROW_DIFF_WITH_RPC_STREAM rpc veb is introduced.
|
|
|
|
```
|
|
struct repair_hash_with_cmd {
|
|
repair_stream_cmd cmd;
|
|
repair_hash hash;
|
|
};
|
|
|
|
struct repair_row_on_wire_with_cmd {
|
|
repair_stream_cmd cmd;
|
|
repair_row_on_wire row;
|
|
};
|
|
```
|
|
|
|
#### The repair master sends: repair_hash_with_cmd
|
|
|
|
The repair_stream_cmd in repair_hash_with_cmd can be:
|
|
|
|
- repair_stream_cmd::needs_all_rows
|
|
Asks the repair follower to send all the rows in the working row buffer.
|
|
|
|
- repair_stream_cmd::hash_data
|
|
Contains the hash for the row in the working row buffer that repair master
|
|
needs. The hash is stored in repair_hash_with_cmd.hash.
|
|
|
|
- repair_stream_cmd::end_of_current_hash_set
|
|
|
|
Notifies repair follower that repair master has sent all the rows it needs.
|
|
|
|
#### The repair follower replies: repair_row_on_wire_with_cmd
|
|
|
|
The repair_stream_cmd in repair_row_on_wire_with_cmd can be:
|
|
|
|
- repair_stream_cmd::row_data
|
|
This is one of the row repair master requested. The row data is stored in
|
|
repair_row_on_wire_with_cmd.row.
|
|
|
|
- repair_stream_cmd::end_of_current_rows
|
|
Notifies repair follower has sent all the rows repair master requested.
|
|
|
|
- repair_stream_cmd::error
|
|
Notifies an error has happened on the follower.
|
|
|
|
|
|
### 3) put_row_diff()
|
|
|
|
A new REPAIR_PUT_ROW_DIFF_WITH_RPC_STREAM rpc veb is introduced.
|
|
|
|
```
|
|
struct repair_row_on_wire_with_cmd {
|
|
repair_stream_cmd cmd;
|
|
repair_row_on_wire row;
|
|
};
|
|
```
|
|
|
|
#### The repair master sends: repair_row_on_wire_with_cmd
|
|
|
|
The repair_stream_cmd in repair_row_on_wire_with_cmd can be:
|
|
|
|
- repair_stream_cmd::row_data
|
|
Contains one of the rows that repair master wants to send to repair follower.
|
|
The row data is stored in repair_row_on_wire_with_cmd.row.
|
|
|
|
- repair_stream_cmd::end_of_current_rows
|
|
Notifies repair follower that repair master has sent all the rows it wants to send.
|
|
|
|
#### The repair follower replies: repair_stream_cmd
|
|
|
|
- repair_stream_cmd::put_rows_done
|
|
Notifies repair master that repair follower has received and applied all the
|
|
rows repair master sent.
|
|
|
|
- repair_stream_cmd::error
|
|
Notifies an error has happened on the follower.
|