5.0 KiB
Row level repair
How the the partition level repair works
- The repair master decides which ranges to work on.
- The repair master splits the ranges to sub ranges which contains around 100 partitions.
- The repair master computes the checksum of the 100 partitions and asks the related peers to compute the checksum of the 100 partitions.
- If the checksum matches, the data in this sub range is synced.
- If the checksum mismatches, repair master fetches the data from all the peers and sends back the merged data to peers.
Major problems with partition level repair
-
A mismatch of a single row in any of the 100 partitions causes 100 partitions to be transferred. A single partition can be very large. Not to mention the size of 100 partitions.
-
Checksum (find the mismatch) and streaming (fix the mismatch) will read the same data twice
Row level repair
Row level checksum and synchronization: detect row level mismatch and transfer only the mismatch
How the row level repair works
- To solve the problem of reading data twice
Read the data only once for both checksum and synchronization between nodes.
We work on a small range which contains only a few mega bytes of rows, We read all the rows within the small range into memory. Find the mismatch and send the mismatch rows between peers.
We need to find a sync boundary among the nodes which contains only N bytes of rows.
- To solve the problem of sending unnecessary data.
We need to find the mismatched rows between nodes and only send the delta. The problem is called set reconciliation problem which is a common problem in distributed systems.
For example:
-
Node1 has set1 = {row1, row2, row3}
-
Node2 has set2 = { row2, row3}
-
Node3 has set3 = {row1, row2, row4}
To repair:
-
Node1 fetches nothing from Node2 (set2 - set1), fetches row4 (set3 - set1) from Node3.
-
Node1 sends row1 and row4 (set1 + set2 + set3 - set2) to Node2.
-
Node1 sends row3 (set1 + set2 + set3 - set3) to Node3.
How to implement repair with set reconciliation
- Step A: Negotiate sync boundary
class repair_sync_boundary { dht::decorated_key pk; position_in_partition position }
Reads rows from disk into row buffers until the size is larger than N bytes. Return the repair_sync_boundary of the last mutation_fragment we read from disk. The smallest repair_sync_boundary of all nodes is set as the current_sync_boundary.
- Step B: Get missing rows from peer nodes so that repair master contains all the rows
Request combined hashes from all nodes between last_sync_boundary and current_sync_boundary. If the combined hashes from all nodes are identical, data is synced, goto Step A. If not, request the full hashes from peers.
At this point, the repair master knows exactly what rows are missing. Request the missing rows from peer nodes.
Now, local node contains all the rows.
- Step C: Send missing rows to the peer nodes
Since local node also knows what peer nodes own, it sends the missing rows to the peer nodes.
How the RPC API looks like
Start:
- repair_range_start()
Step A:
- request_sync_boundary()
Step B:
- request_combined_row_hashes()
- request_full_row_hashes()
- request_row_diff()
Step C:
- send_row_diff()
Finish:
- repair_range_stop()
Performance evaluation
We created a cluster of 3 Scylla nodes on AWS using i3.xlarge instance. We created a keyspace with a replication factor of 3 and inserted 1 billion rows to each of the 3 nodes. Each node has 241 GiB of data. We tested 3 cases below.
1) 0% synced: one of the node has zero data. The other two nodes have 1 billion identical rows.
Time to repair:
- old = 87 min
- new = 70 min (rebuild took 50 minutes)
- improvement = 19.54%
2) 100% synced: all of the 3 nodes have 1 billion identical rows.
Time to repair:
- old = 43 min
- new = 24 min
- improvement = 44.18%
3) 99.9% synced: each node has 1 billion identical rows and 1 billion * 0.1% distinct rows.
Time to repair:
- old: 211 min
- new: 44 min
- improvement: 79.15%
Bytes sent on wire for repair:
- old: tx= 162 GiB, rx = 90 GiB
- new: tx= 1.15 GiB, tx = 0.57 GiB
- improvement: tx = 99.29%, rx = 99.36%
It is worth noting that row level repair sends and receives exactly the number of rows needed in theory.
In this test case, repair master needs to receives 2 million rows and sends 4 million rows. Here are the details: Each node has 1 billion * 0.1% distinct rows, that is 1 million rows. So repair master receives 1 million rows from repair follower 1 and 1 million rows from repair follower 2. Repair master sends 1 million rows from repair master and 1 million rows received from repair follower 1 to repair follower 2. Repair master sends sends 1 million rows from repair master and 1 million rows received from repair follower 2 to repair follower 1.
In the result, we saw the rows on wire were as expected.
tx_row_nr = 1000505 + 999619 + 1001257 + 998619 (4 shards, the numbers are for each shard) = 4'000'000 rx_row_nr = 500233 + 500235 + 499559 + 499973 (4 shards, the numbers are for each shard) = 2'000'000