Avoid including the lengthy stream_session.hh in messaging_service.
More importantly, fix the build because currently messaging_service.cc
and messaging_service.hh does not include stream_mutation_fragments_cmd.
I am not sure why it builds on my machine. Spotted this when backporting
the "streaming: Send error code from the sender to receiver" to 3.0
branch.
Refs: #4789
In case of error on the sender side, the sender does not propagate the
error to the receiver. The sender will close the stream. As a result,
the receiver will get nullopt from the source in
get_next_mutation_fragment and pass mutation_fragment_opt with no value
to the generating_reader. In turn, the generating_reader generates end
of stream. However, the last element that the generating_reader has
generated can be any type of mutation_fragment. This makes the sstable
that consumes the generating_reader violates the mutation_fragment
stream rule.
To fix, we need to propagate the error. However RPC streaming does not
support propagate the error in the framework. User has to send an error
code explicitly.
Fixes: #4789
get_rpc_client assumes the messaging_service is not stopped. We should check
is_stopping() before we call get_rpc_client.
We do such check in existing code, e.g., send_message and friends. Do
the same check in the newly introduced
make_sink_and_source_for_stream_mutation_fragments() and friends for row
level repair.
Fixes: #4767
Because inet_address was initially hardcoded to
ipv4, its wire format is not very forward compatible.
Since we potentially need to communicate with older version nodes, we
manually define the new serial format for inet_address to be:
ipv4: 4 bytes address
ipv6: 4 bytes marker 0xffffffff (invalid address)
16 bytes data -> address
- REPAIR_GET_ROW_DIFF_WITH_RPC_STREAM
Get repair rows from follower nodes
- REPAIR_PUT_ROW_DIFF_WITH_RPC_STREAM
Put repair rows to follower nodes
- REPAIR_GET_FULL_ROW_HASHES_WITH_RPC_STREAM:
Get full hashes from follower nodes
Before this patch, repair master and followers use their own schema
version at the point repair starts independently. The schemas can be
different due to schema change. Repair uses the schema to serialize
mutation_fragment and deserialize the mutation_fragment received from
peer nodes. Using different schema version to serialize and deserialize
cause undefined behaviour.
To fix, we use the schema the repair master decides for all the repair
nodes involved.
On top of this patch, we could do another step to make sure all nodes
has the latest schema. But let's do it in a separate patch.
Fixes: #4549
Backports: 3.1
Currently, REPAIR_GET_COMBINED_ROW_HASH RPC verb returns only the
repair_hash object. In the future, we will use set reconciliation
algorithm to decode the full row hashes in working row buf. It is useful
to return the number of rows inside working row buf in addition to the
combined row hashes to make sure the decode is successful.
It is also better to use a wrapper class for the verb response so we can
extend the return values later more easily with IDL.
Fixes#4526
Message-Id: <93be47920b523f07179ee17e418760015a142990.1559771344.git.asias@scylladb.com>
Seastar now supports two RPC compression algorithm: the original LZ4 one
and LZ4_FRAGMENTED. The latter uses lz4 stream interface which allows it
to process large messages without fully linearising them. Since, RPC
requests used by Scylla often contain user-provided data that
potentially could be very large, LZ4_FRAGMENTED is a better choice for
the default compression algorithm.
Message-Id: <20190417144318.27701-1-pdziepak@scylladb.com>
Function messaging_service::get_rpc_client() suppose to either return
existing client or create one and return it. The function is suppose to
be atomic, so after checking that requested client does not exist it is
safe to assume emplace() will succeed. But we saw bugs that made the
function to not be atomic. Lets add an assert that will help to catch
such bugs easier if they will happen in the future.
Message-Id: <20190326115741.GX26144@scylladb.com>
Current code captures a reference to rpc::client in a continuation, but
there is no guaranty that the reference will be valid when continuation runs.
Capture shared pointer to rpc::client instead.
Fixes#4350.
Message-Id: <20190314135538.GC21521@scylladb.com>
Replace stdx::optional and stdx::string_view with the C++ std
counterparts.
Some instances of boost::variant were also replaced with std::variant,
namely those that called seastar::visit.
Scylla now requires GCC 8 to compile.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20190108111141.5369-1-duarte@scylladb.com>
"
=== How the the partition level repair works
- The repair master decides which ranges to work on.
- The repair master splits the ranges to sub ranges which contains around 100
partitions.
- The repair master computes the checksum of the 100 partitions and asks the
related peers to compute the checksum of the 100 partitions.
- If the checksum matches, the data in this sub range is synced.
- If the checksum mismatches, repair master fetches the data from all the peers
and sends back the merged data to peers.
=== Major problems with partition level repair
- A mismatch of a single row in any of the 100 partitions causes 100
partitions to be transferred. A single partition can be very large. Not to
mention the size of 100 partitions.
- Checksum (find the mismatch) and streaming (fix the mismatch) will read the
same data twice
=== Row level repair
Row level checksum and synchronization: detect row level mismatch and transfer
only the mismatch
=== How the row level repair works
- To solve the problem of reading data twice
Read the data only once for both checksum and synchronization between nodes.
We work on a small range which contains only a few mega bytes of rows,
We read all the rows within the small range into memory. Find the
mismatch and send the mismatch rows between peers.
We need to find a sync boundary among the nodes which contains only N bytes of
rows.
- To solve the problem of sending unnecessary data.
We need to find the mismatched rows between nodes and only send the delta.
The problem is called set reconciliation problem which is a common problem in
distributed systems.
For example:
Node1 has set1 = {row1, row2, row3}
Node2 has set2 = { row2, row3}
Node3 has set3 = {row1, row2, row4}
To repair:
Node1 fetches nothing from Node2 (set2 - set1), fetches row4 (set3 - set1) from Node3.
Node1 sends row1 and row4 (set1 + set2 + set3 - set2) to Node2
Node1 sends row3 (set1 + set2 + set3 - set3) to Node3.
=== How to implement repair with set reconciliation
- Step A: Negotiate sync boundary
class repair_sync_boundary {
dht::decorated_key pk;
position_in_partition position
}
Reads rows from disk into row buffers until the size is larger than N
bytes. Return the repair_sync_boundary of the last mutation_fragment we
read from disk. The smallest repair_sync_boundary of all nodes is
set as the current_sync_boundary.
- Step B: Get missing rows from peer nodes so that repair master contains all the rows
Request combined hashes from all nodes between last_sync_boundary and
current_sync_boundary. If the combined hashes from all nodes are identical,
data is synced, goto Step A. If not, request the full hashes from peers.
At this point, the repair master knows exactly what rows are missing. Request the
missing rows from peer nodes.
Now, local node contains all the rows.
- Step C: Send missing rows to the peer nodes
Since local node also knows what peer nodes own, it sends the missing rows to
the peer nodes.
=== How the RPC API looks like
- repair_range_start()
Step A:
- request_sync_boundary()
Step B:
- request_combined_row_hashes()
- reqeust_full_row_hashes()
- request_row_diff()
Step C:
- send_row_diff()
- repair_range_stop()
=== Performance evaluation
We created a cluster of 3 Scylla nodes on AWS using i3.xlarge instance. We
created a keyspace with a replication factor of 3 and inserted 1 billion
rows to each of the 3 nodes. Each node has 241 GiB of data.
We tested 3 cases below.
1) 0% synced: one of the node has zero data. The other two nodes have 1 billion identical rows.
Time to repair:
old = 87 min
new = 70 min (rebuild took 50 minutes)
improvement = 19.54%
2) 100% synced: all of the 3 nodes have 1 billion identical rows.
Time to repair:
old = 43 min
new = 24 min
improvement = 44.18%
3) 99.9% synced: each node has 1 billion identical rows and 1 billion * 0.1% distinct rows.
Time to repair:
old: 211 min
new: 44 min
improvement: 79.15%
Bytes sent on wire for repair:
old: tx= 162 GiB, rx = 90 GiB
new: tx= 1.15 GiB, tx = 0.57 GiB
improvement: tx = 99.29%, rx = 99.36%
It is worth noting that row level repair sends and receives exactly the
number of rows needed in theory.
In this test case, repair master needs to receives 2 million rows and
sends 4 million rows. Here are the details: Each node has 1 billion *
0.1% distinct rows, that is 1 million rows. So repair master receives 1
million rows from repair slave 1 and 1 million rows from repair slave 2.
Repair master sends 1 million rows from repair master and 1 million rows
received from repair slave 1 to repair slave 2. Repair master sends
sends 1 million rows from repair master and 1 million rows received from
repair slave 2 to repair slave 1.
In the result, we saw the rows on wire were as expected.
tx_row_nr = 1000505 + 999619 + 1001257 + 998619 (4 shards, the numbers are for each shard) = 4'000'000
rx_row_nr = 500233 + 500235 + 499559 + 499973 (4 shards, the numbers are for each shard) = 2'000'000
Fixes: #3033
Tests: dtests/repair_additional_test.py
"
* 'asias/row_level_repair_v7' of github.com:cloudius-systems/seastar-dev: (51 commits)
repair: Enable row level repair
repair: Add row_level_repair
repair: Add docs for row level repair
repair: Add repair_init_messaging_service_handler
repair: Add repair_meta
repair: Add repair_writer
repair: Add repair_reader
repair: Add repair_row
repair: Add fragment_hasher
repair: Add decorated_key_with_hash
repair: Add get_random_seed
repair: Add get_common_diff_detect_algorithm
repair: Add shard_config
repair: Add suportted_diff_detect_algorithms
repair: Add repair_stats to repair_info
repair: Introduce repair_stats
flat_mutation_reader: Add make_generating_reader
storage_service: Introduce ROW_LEVEL_REPAIR feature
messaging_service: Add RPC verbs for row level repair
repair: Export the repair logger
...
Change the inter-node protocol so we can propagate the view update
backlog from a base replica to the coordinator through the
mutation_done and mutation_failed verbs.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Many headers don't really need to include database.hh, the include can
be replaced by forward declarations and/or including the actually needed
headers directly. Some headers don't need this include at all.
Each header was verified to be compilable on its own after the change,
by including it into an empty `.cc` file and compiling it. `.cc` files
that used to get `database.hh` through headers that no longer include it
were changed to include it themselves.
This patch adds the RPC verbs that are needed by the row level repair.
The usage of those verbs are in the following patches.
All the verbs for row level repair are sent by the repair master.
Repair master asks repair slaves to create repair meta objects, a.k.a,
repair_meta object, to store the repair meta data needed by row level
repair algorithm. The repair meta object is identified by the IP address
of the repair master and a uint32 number repair_meta_id chosen by repair
master. When repair master restarts or is out of the cluster, repair
slaves will detect it and remove all existing repair_meta for the repair
master. When repair slave restarts, the existing repair_meta on the
slave will be gone.
The sync boundary used in the verbs is the position_in_partition of the
last mutation_fragment. In each repair round, peers work on
(last_sync_boundary, current_sync_boundary]
* seastar d59fcef...b924495 (2):
> build: Fix protobuf generation rules
> Merge "Restructure files" from Jesse
Includes fixup patch from Jesse:
"
Update Seastar `#include`s to reflect restructure
All Seastar header files are now prefixed with "seastar" and the
configure script reflects the new locations of files.
Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <5d22d964a7735696fb6bb7606ed88f35dde31413.1542731639.git.jhaberku@scylladb.com>
"
In a homogeneous cluster this will reduce number of internal cross-shard hops
per request since RPC calls will arrive to correct shard.
Message-Id: <20181118150817.GF2062@scylladb.com>
On receiving a mutation_fragment or a mutation triggered by a streaming
operation, we pass an enum stream_reason to notify the receiver what
the streaming is used for. So the receiver can decide further operation,
e.g., send view updates, beyond applying the streaming data on disk.
Fixes#3276
Message-Id: <f15ebcdee25e87a033dcdd066770114a499881c0.1539498866.git.asias@scylladb.com>
Fixes#3787
Message service streaming sink was created using direct call to
rpc::client::make_sink. This in turn needs a new socker, which it
creates completely ignoring what underlying transport is active for the
client in question.
Fix by retaining the tls credential pointer in the client wrapper, and
using this in a sink method to determine whether to create a new tls
socker, or just go ahead with a plain one.
Message-Id: <20181010003249.30526-1-calle@scylladb.com>
The non-TLS RPC server has an rpc::resource_limits configuration that limits
its memory consumption, but the TLS server does not. That means a many-node
TLS configuration can OOM if all nodes gang up on a single replica.
Fix by passing the limits to the TLS server too.
Fixes#3757.
Message-Id: <20180907192607.19802-1-avi@scylladb.com>
Since the messaging service will assign a scheduling group based
on the client index, it's more important now to get the verbs categorized
correctly.
Re-categorize REPLICATION_FINISHED, REPAIR_CHECKSUM_RANGE, and most
importantly STREAM_MUTATION_FRAGMENTS to the repair/streaming oriented
connections so we get the correct scheduling.
A default means that when adding new verbs, we may forget to
categorize a verb correctly. Without the default, the compiler
will complain due to -Wswitch.
"
This work is on top of Gleb's rpc streaming which is merged recently.
What this series does is to replace scylla streaming service's data plane to
use the new rpc streaming instead of the old rpc verb to send the mutations for
scylla streaming. Other parts of scylla streaming, the control plane, are not
changed.
In my test, to bootstrap a new node to the existing one node cluster, smp 2,
scylla stores data on ramdisk to minimize disk io impact.
I saw x2 improvment in streaming bandwidth.
Before:
[shard 0] stream_session - [Stream #2ae92320-5fc8-11e8-911a-000000000000]
Streaming plan for Bootstrap-ks3-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=1570312 KiB, 109521.02 KiB/s
[shard 0] range_streamer - Bootstrap with 127.0.0.1 for keyspace=ks3 succeeded, took 14.338 seconds
After:
[shard 0] stream_session - [Stream #e5589ac0-5fc7-11e8-b463-000000000000]
Streaming plan for Bootstrap-ks3-index-0 succeeded, peers={127.0.0.1}, tx=0 KiB, 0.00 KiB/s, rx=1546875 KiB, 220415.36 KiB/s
[shard 0] range_streamer - Bootstrap with 127.0.0.1 for keyspace=ks3 succeeded, took 7.018 seconds
Tests: dtest update_cluster_layout_tests.py
Fixes: #3591
"
* tag 'asias/scylla_streaming_with_rpc_streaming_v8' of github.com:scylladb/seastar-dev:
streaming: Add rpc streaming support
storage_service: Introduce STREAM_WITH_RPC_STREAM feature
streaming: Add estimate_partitions to send_info
messaging_service: Add streaming with rpc streaming support
messaging_service: Add streaming_domain
database: Add add_sstable_and_update_cache
database: Add make_streaming_sstable_for_write
Assign a scheduling_group for each RPC service. Assignement is
done by connection (get_rpc_client_idx()) - all verbs on the
same connection are assigned the same group. While this may seem
arbitrary, it avoids priority inversion; if two verbs on the same
connection have different scheduling groups, the verb with the low
shares may cause a backlog and stall the connection, including
following requests from verbs that ought to have higher shares.
The scheduling_group parameters are encapsulated in different
classes as they are passed around to avoid adding dependencies.
Message-Id: <20180708140433.6426-1-avi@scylladb.com>
Preparation for adding rpc streaming in scylla streaming.
- register_stream_mutation_fragments is used to register the rpc
streaming verb
- make_sink_and_source_for_stream_mutation_fragments is used to get the
sink and source object for the sender
- make_sink_for_stream_mutation_fragments is used to get a sink object
for the receiver
While not strictly needed, specify which algorithm to use when request
a digest from a remote node. This is more flexible than relying on a
cluster wide feature, although that's what we'll do in subsequent
patches. It also makes the verb more consistent with the data request.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Timeouts are a global property. However, for tables in keyspaces like
the system keyspace, we don't want to uphold that timeout--in fact, we
wan't no timeout there at all.
We already apply such configuration for requests waiting in the queued
sstable queue: system keyspace requests won't be removed. However, the
storage proxy will insert its own timeouts in those requests, causing
them to fail.
This patch changes the storage proxy read layer so that the timeout is
applied based on the column family configuration, which is in turn
inherited from the keyspace configuration. This matches our usual
way of passing db parameters down.
In terms of implementation, we can either move the timeout inside the
abstract read executor or keep it external. The former is a bit cleaner,
the the latter has the nice property that all executors generated will
share the exact same timeout point. In this patch, we chose the latter.
We are also careful to propagate the timeout information to the replica.
So even if we are talking about the local replica, when we add the
request to the concurrency queue, we will do it in accordance with the
timeout specified by the storage proxy layer.
After this patch, Scylla is able to start just fine with very low
timeouts--since read timeouts in the system keyspace are now ignored.
Fixes#2462
Implementation notes, and general comments about open discussion in 2462:
* Because we are not bypassing the timeout, just setting it high enough,
I consider the concerns about the batchlog moot: if we fail for any
other reason that will be propagated. Last case, because the timeout
is per-CF, we could do what we do for the dirty memory manager and
move the batchlog alone to use a different timeout setting.
* Storage proxy likes specifying its timeouts as a time_point, whereas
when we get low enough as to deal with the read_concurrency_config,
we are talking about deltas. So at some point we need to convert time_points
to durations. We do that in the database query functions.
v2:
- use per-request instead of per-table timeouts.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Don't enforce the outgoing connections from the 'listen_address'
interface only.
If 'local_address' is given to connect() it will enforce it to use a
particular interface to connect from, even if the destination address
should be accessed from a different interface. If we don't specify the
'local_address' the source interface will be chosen according to the
routing configuration.
Fixes#3066
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1513372688-21595-1-git-send-email-vladz@scylladb.com>
With the "Use range_streamer everywhere" (7217b7ab36) seires, all
the user of streaming now do streaming with relative small ranges and
can retry streaming at higher level.
There are problems with timeout and retry at RPC verb level in streaming:
1) Timeout can be false negative.
2) We can not cancel the send operations which are already called. When
user aborts the streaming, the retry logic keeps running for a long
time.
This patch removes all the timeout and retry logic for streaming verbs.
After this, the timeout is the job of TCP, the retry is the job of the
upper layer.
Message-Id: <df20303c1fa728dcfdf06430417cf2bd7a843b00.1503994267.git.asias@scylladb.com>
This reverts commit 98757069a5. We have the
failure detector which will detect an unresponsive node and fail the RPC.
Adding a timeout can just introduce false positives.