mirrors/scylladb

Fork 0

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-09 08:23:29 +00:00

Files

Asias He b2b20cd5c0 repair: Add docs for row level repair

2018-12-12 16:49:01 +08:00

5.0 KiB

Raw Blame History

Row level repair

How the the partition level repair works

The repair master decides which ranges to work on.
The repair master splits the ranges to sub ranges which contains around 100 partitions.
The repair master computes the checksum of the 100 partitions and asks the related peers to compute the checksum of the 100 partitions.
If the checksum matches, the data in this sub range is synced.
If the checksum mismatches, repair master fetches the data from all the peers and sends back the merged data to peers.

Major problems with partition level repair

A mismatch of a single row in any of the 100 partitions causes 100 partitions to be transferred. A single partition can be very large. Not to mention the size of 100 partitions.
Checksum (find the mismatch) and streaming (fix the mismatch) will read the same data twice

Row level repair

Row level checksum and synchronization: detect row level mismatch and transfer only the mismatch

How the row level repair works

To solve the problem of reading data twice

Read the data only once for both checksum and synchronization between nodes.

We work on a small range which contains only a few mega bytes of rows, We read all the rows within the small range into memory. Find the mismatch and send the mismatch rows between peers.

We need to find a sync boundary among the nodes which contains only N bytes of rows.

To solve the problem of sending unnecessary data.

We need to find the mismatched rows between nodes and only send the delta. The problem is called set reconciliation problem which is a common problem in distributed systems.

For example:

Node1 has set1 = {row1, row2, row3}
Node2 has set2 = { row2, row3}
Node3 has set3 = {row1, row2, row4}

To repair:

Node1 fetches nothing from Node2 (set2 - set1), fetches row4 (set3 - set1) from Node3.
Node1 sends row1 and row4 (set1 + set2 + set3 - set2) to Node2.
Node1 sends row3 (set1 + set2 + set3 - set3) to Node3.

How to implement repair with set reconciliation

Step A: Negotiate sync boundary

class repair_sync_boundary { dht::decorated_key pk; position_in_partition position }

Reads rows from disk into row buffers until the size is larger than N bytes. Return the repair_sync_boundary of the last mutation_fragment we read from disk. The smallest repair_sync_boundary of all nodes is set as the current_sync_boundary.

Step B: Get missing rows from peer nodes so that repair master contains all the rows

Request combined hashes from all nodes between last_sync_boundary and current_sync_boundary. If the combined hashes from all nodes are identical, data is synced, goto Step A. If not, request the full hashes from peers.

At this point, the repair master knows exactly what rows are missing. Request the missing rows from peer nodes.

Now, local node contains all the rows.

Step C: Send missing rows to the peer nodes

Since local node also knows what peer nodes own, it sends the missing rows to the peer nodes.

How the RPC API looks like

Start:

repair_range_start()

Step A:

request_sync_boundary()

Step B:

request_combined_row_hashes()
request_full_row_hashes()
request_row_diff()

Step C:

send_row_diff()

Finish:

repair_range_stop()

Performance evaluation

We created a cluster of 3 Scylla nodes on AWS using i3.xlarge instance. We created a keyspace with a replication factor of 3 and inserted 1 billion rows to each of the 3 nodes. Each node has 241 GiB of data. We tested 3 cases below.

1) 0% synced: one of the node has zero data. The other two nodes have 1 billion identical rows.

Time to repair:

old = 87 min
new = 70 min (rebuild took 50 minutes)
improvement = 19.54%

2) 100% synced: all of the 3 nodes have 1 billion identical rows.

Time to repair:

old = 43 min
new = 24 min
improvement = 44.18%

3) 99.9% synced: each node has 1 billion identical rows and 1 billion * 0.1% distinct rows.

Time to repair:

old: 211 min
new: 44 min
improvement: 79.15%

Bytes sent on wire for repair:

old: tx= 162 GiB, rx = 90 GiB
new: tx= 1.15 GiB, tx = 0.57 GiB
improvement: tx = 99.29%, rx = 99.36%

It is worth noting that row level repair sends and receives exactly the number of rows needed in theory.

In this test case, repair master needs to receives 2 million rows and sends 4 million rows. Here are the details: Each node has 1 billion * 0.1% distinct rows, that is 1 million rows. So repair master receives 1 million rows from repair follower 1 and 1 million rows from repair follower 2. Repair master sends 1 million rows from repair master and 1 million rows received from repair follower 1 to repair follower 2. Repair master sends sends 1 million rows from repair master and 1 million rows received from repair follower 2 to repair follower 1.

In the result, we saw the rows on wire were as expected.

tx_row_nr = 1000505 + 999619 + 1001257 + 998619 (4 shards, the numbers are for each shard) = 4'000'000 rx_row_nr = 500233 + 500235 + 499559 + 499973 (4 shards, the numbers are for each shard) = 2'000'000

5.0 KiB Raw Blame History

Row level repair

How the the partition level repair works

Major problems with partition level repair

Row level repair

How the row level repair works

How to implement repair with set reconciliation

How the RPC API looks like

Performance evaluation

1) 0% synced: one of the node has zero data. The other two nodes have 1 billion identical rows.

2) 100% synced: all of the 3 nodes have 1 billion identical rows.

3) 99.9% synced: each node has 1 billion identical rows and 1 billion * 0.1% distinct rows.

5.0 KiB

Raw Blame History