There is no need to call dht::split_ranges_to_shards to split the token
range into <shard> : <a lot of small ranges> mapping and create a flat
mutation reader with a lot of small ranges.
Because:
1) The flat mutation reader on each shard only returns data belongs to
this local shard, there is no correctness issue if we do not split and
feed the sub ranges only belongs to this local shard.
2) With murmur3_partitioner_ignore_msb_bits = 12, it is almost certain
that given a token range, all the shards will have data for the range
anyway. Even if we ask all the shards to work on the token range and
some of the shards have no data for it, it is fine. We simply send no
data from this shard.
Tests: update_cluster_layout_tests.py
Message-Id: <ac00cd21d6156c47b74451dd415d627481e48212.1526864222.git.asias@scylladb.com>
In streaming, the sender sends the mutations on all the local shards in
parallel, it is possible that the receiver handle more than one such
connection on the same shard. It is determined by where the tcp
connection goes. Current rpc ignores the dest shard id when sending the
rpc message.
For instance, say node1 has 2 shards, node2 has 2 shards. Currently, we
can end up with like this:
Node 1 shard 0 -> Node 2 shard 1
Node 1 shard 1 -> Node 2 shard 1
It is better if we do:
Node 1 shard 0 -> Node 2 shard 0
Node 1 shard 1 -> Node 2 shard 1
This patch solves this problem by let the handler always handle on
shard = src_cpu_id % smp::count.
If sender and receiver have the same shard config, it is completely
distributed the work evenly.
If sender and receiver do not have the same shard config, it is
unavoidable some of the shard will do more work than the others.
Tests: dtest update_cluster_layout_tests.py
Message-Id: <911827bcf67459a07ec92623a9ed4c4fbba195ca.1524622375.git.asias@scylladb.com>
This reverts commit f792c78c96.
With the "Use range_streamer everywhere" (7217b7ab36) series,
all the user of streaming now do streaming with relative small ranges
and can retry streaming at higher level.
Reduce the time-to-recover from 5 hours to 10 minutes per stream session.
Even if the 10 minutes idle detection might cause higher false positive,
it is fine, since we can retry the "small" stream session anyway. In the
long term, we should replace the whole idle detection logic with
whenever the stream initiator goes away, the stream slave goes away.
Message-Id: <75f308baf25a520d42d884c7ef36f1aecb8a64b0.1520992219.git.asias@scylladb.com>
The uninitialized session has no peer associated with it yet. There is
no point sending the failed message when abort the session. Sending the
failed message in this case will send to a peer with uninitialized
dst_cpu_id which will casue the receiver to pass a bogus shard id to
smp::submit_to which cases segfault.
In addition, to be safe, initialize the dst_cpu_id to zero. So that
uninitialized session will send message to shard zero instead of random
bogus shard id.
Fixes the segfault issue found by
repair_additional_test.py:RepairAdditionalTest.repair_abort_test
Fixes#3115
Message-Id: <9f0f7b44c7d6d8f5c60d6293ab2435dadc3496a9.1515380325.git.asias@scylladb.com>
stream_session.cc:417:62: error: cannot call member function ‘utils::UUID streaming::stream_session::plan_id()’ without object
sslog.warn("[Stream #{}] Failed to send: {}", plan_id(), ep);
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20171214022621.19442-1-raphaelsc@scylladb.com>
In the case there are large number of column families, the sender will
send all the column families in parallel. We allow 20% of shard memory
for streaming on the receiver, so each column family will have 1/N, N is
the number of in-flight column families, memory for memtable. Large N
causes a lot of small sstables to be generated.
It is possible there are multiple senders to a single receiver, e.g.,
when a new node joins the cluster, the maximum in-flight column families
is number of peer node. The column families are sent in the order of
cf_id. It is not guaranteed that all peers has the same speed so they
are sending the same cf_id at the same time, though. We still have
chance some of the peers are sending the same cf_id.
Fixes#3065
Message-Id: <46961463c2a5e4f1faff232294dc485ac4f1a04e.1513159678.git.asias@scylladb.com>
There is a user of fragment_and_freeze() (streaming) that will need
to be able to break the loop Right now, it does that between
streamed_mutation, but that won't be possible after we switch to flat
readers.
When we abort a session, it is possible that:
node 1 abort the session by user request
node 1 send the complete_message to node 2
node 2 abort the session upon receive of the complete_message
node 1 sends one more stream message to node 2 and the stream_manager
for the session can not be found.
It is fine for node 2 to not able to find the stream_manager, make the
log on node 2 less verbose to confuse user less.
It is the handler for the failed complete message. Add a flag to
remember if we received a such message from peer, if so, do not send
back the failed complete message back to the peer when running
close_session with failed status.
When a node shutdown itself, it will send a shutdown status to peer
nodes. When peer nodes receives the shtudown status update, they are
supposed to close all the sessions with that node becasue the node is
shutdown, no need to wait and timeout, then fail the session.
This change can speed up the closing of sessions.
Currently, send_complete_message is not used. We will use it shortly in
case the local session is failed. Send a complete message with failed
flag to notify peer node that the session is failed so that peer can
close the session. This can speed up the closing of failed session.
Also rename it to send_failed_complete_message.
The complete_message is not needed and the handler of this rpc message
does nothing but returns a ready future. The patch to remove it did not
make into the Scylla 1.0 release so it was left there.
Use this flag to notify the peer that the session is failed so that the
peer can close the failed session more quickly.
The flag is used as a rpc::optional so it is compatible use old
version of the verb.
Streaming ususally takes long time to complete. Abort it on false
positive idle detection can be very wasteful.
Increase the abort timeout from 10 minutes to a very large timeout, 300
minutes. The real idle session will be aborted eventually if other
mechanisms, e.g., streaming manager has gossip callback for on_remove
and on_restart event to abort, do not abort the session.
Fixes#2197
Message-Id: <57f81bfebfdc6f42164de5a84733097c001b394e.1494552921.git.asias@scylladb.com>
- introcduced "seastarx.hh" header, which does a "using namespace seastar";
- 'net' namespace conflicts with seastar::net, renamed to 'netw'.
- 'transport' namespace conflicts with seastar::transport, renamed to
cql_transport.
- "logger" global variables now conflict with logger global type, renamed
to xlogger.
- other minor changes
Now that we have the new interface to make readers with ranges, we can
simplify the code a lot.
1) Less readers are needed
before: number of ranges of readers
after: smp::count readers at most
2) No foreign_ptr is needed
There is no need to forward to a shard to make the foreign_ptr for
send_info in the first phase and forward to that shard to execute the
send_info in the second phase.
3) No do_with is needed in send_mutations since si now is a
lw_shared_ptr
4) Fix possible user after free of 'si' in do_send_mutations
We need to take a reference of 'si' when sending the mutation with
send_stream_mutation rpc call, otherwise:
msg1 got exception
si->mutations_done.broken()
si is freed
msg2 got exception
si is used again
The issue is introduced in dc50ce0ce5 (streaming: Make the mutation
readers when streaming starts) which is master only, branch 1.5 is not
affected.
Currenlty we make the mutation readers for streaming at different
time point, i.e.,
do_for_each(_ranges.begin(), _ranges.end(), [] (auto range) {
make a mutation reader for this range
read mutations from the reader and send
})
If there are write workload in the background, we will stream extra
data, since the later the reader is made the more data we need to send.
Fix it by making all the readers before starting to stream.
Fixes#1815
Message-Id: <1479341474-1364-2-git-send-email-asias@scylladb.com>
There are places in which we need to use the column family object many
times, with deferring points in between. Because the column family may
have been destroyed in the deferring point, we need to go and find it
again.
If we use lw_shared_ptr, however, we'll be able to at least guarantee
that the object will be alive. Some users will still need to check, if
they want to guarantee that the column family wasn't removed. But others
that only need to make sure we don't access an invalid object will be
able to avoid the cost of re-finding it just fine.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <722bf49e158da77ff509372c2034e5707706e5bf.1478111467.git.glauber@scylladb.com>
Wrapping ranges are a pain, so we are moving wrap handling to the edges.
Since cql can't generate wrapping ranges, this means thrift and the ring
maintenance code; also range->ring transformations need to merge the first
and last ranges.
Message-Id: <1478105905-31613-1-git-send-email-avi@scylladb.com>
Remove inclusions from header files (primary offender is fb_utilities.hh)
and introduce new messaging_service_fwd.hh to reduce rebuilds when the
messaging service changes.
Message-Id: <1475584615-22836-1-git-send-email-avi@scylladb.com>
"This series improves repair by
1) using less streaming sessions
2) reducing unnecessary streaming traffic
3) fixing a hang during shutdown
See commit log for "repair: Reduce stream_plan usage", "repair: Reduce
unnecessary streaming traffic" and "streaming: Fail streaming sessions
during shutdown" for details.
Tested with repair_additional_test.py."
Using make_streaming_reader for streaming on the sender side, it has
the following advantages:
- streaming, repair will not pollute the row cache on the sender side
any more. Currently, we are risking evicting all the frequently-queried
partitions from the cache when an operation like repair reads entire
sstables and floods the row cache with swathes of cold data from they
read from disk.
- less data will be sent becasue the reader will only return existing
data before the point of the reader is created, plus bounded amount
of writes which arrive later. This helps reducing the streaming time
in the case new data is being inserted all the time while streaming is
in progress. E.g., adding a new node while there is a lot of cql write
workload.
Fixes#382Fixes#1682