Commit Graph

444 Commits

Author SHA1 Message Date
Piotr Sarna
32c0fe8df2 streaming: stream tables with views through staging sstables
While streaming to a table with paired views, staging sstables
are used. After the table is written to disk, it's used to generate
all required view updates. It's also resistant to restarts as it's
stored on a hard drive in staging/ directory.

Refs #3275
2018-11-13 15:04:42 +01:00
Piotr Sarna
dc74887ff3 streaming: add system distributed keyspace ref to streaming
Streaming code needs system distributed keyspace to check if streamed
sstables should be staging, so a proper reference is added.
2018-11-13 15:01:53 +01:00
Piotr Sarna
7ef5e1b685 streaming: add view update generator reference to streaming
Streaming code may need view update generator service to generate
and send view updates, so a proper reference is added.
2018-11-13 15:01:53 +01:00
Avi Kivity
fd513c42ad streaming: convert sprint() to format()
sprint() recently became more strict, throwing on sprint("%s", 5). Replace
with the more modern format().

Mechanically converted with https://github.com/avikivity/unsprint.
2018-11-01 13:16:17 +00:00
Avi Kivity
8501e2a45d streaming: progress_info: fix format string
We try to escape % as \%, but the correct escape is %%.
2018-11-01 13:16:17 +00:00
Asias He
7f826d3343 streaming: Expose reason for streaming
On receiving a mutation_fragment or a mutation triggered by a streaming
operation, we pass an enum stream_reason to notify the receiver what
the streaming is used for. So the receiver can decide further operation,
e.g., send view updates, beyond applying the streaming data on disk.

Fixes #3276
Message-Id: <f15ebcdee25e87a033dcdd066770114a499881c0.1539498866.git.asias@scylladb.com>
2018-10-15 22:03:28 +01:00
Gleb Natapov
ceb361544a stream_session: remove unused capture
'Consumer function' parameter for distribute_reader_and_consume_on_shards()
captures schema_ptr (which is a seastar::shared_ptr), but the function
is later copied on another shard at which point schema_ptr is also copied
and its counter is incremented by the wrong shard. The capture is not
even used, so lets just drop it.

Fixes #3838

Message-Id: <20181011075500.GN14449@scylladb.com>
2018-10-11 11:10:58 +03:00
Botond Dénes
eb357a385d flat_mutation_reader: make timeout opt-out rather than opt-in
Currently timeout is opt-in, that is, all methods that even have it
default it to `db::no_timeout`. This means that ensuring timeout is used
where it should be is completely up to the author and the reviewrs of
the code. As humans are notoriously prone to mistakes this has resulted
in a very inconsistent usage of timeout, many clients of
`flat_mutation_reader` passing the timeout only to some members and only
on certain call sites. This is small wonder considering that some core
operations like `operator()()` only recently received a timeout
parameter and others like `peek()` didn't even have one until this
patch. Both of these methods call `fill_buffer()` which potentially
talks to the lower layers and is supposed to propagate the timeout.
All this makes the `flat_mutation_reader`'s timeout effectively useless.

To make order in this chaos make the timeout parameter a mandatory one
on all `flat_mutation_reader` methods that need it. This ensures that
humans now get a reminder from the compiler when they forget to pass the
timeout. Clients can still opt-out from passing a timeout by passing
`db::no_timeout` (the previous default value) but this will be now
explicit and developers should think before typing it.

There were suprisingly few core call sites to fix up. Where a timeout
was available nearby I propagated it to be able to pass it to the
reader, where I couldn't I passed `db::no_timeout`. Authors of the
latter kind of code (view, streaming and repair are some of the notable
examples) should maybe consider propagating down a timeout if needed.
In the test code (the wast majority of the changes) I just used
`db::no_timeout` everywhere.

Tests: unit(release, debug)

Signed-off-by: Botond Dénes <bdenes@scylladb.com>

Message-Id: <1edc10802d5eb23de8af28c9f48b8d3be0f1a468.1536744563.git.bdenes@scylladb.com>
2018-09-20 11:31:24 +02:00
Asias He
de05df216f streaming: Use rpc::source on the shard where it is created
rpc::source can only work on the shard where it is created, thus we can
not apply the load distribution optimization. Disable it and let the
multishard_writer to forward the data to the correct shard.

Fixes #3731.

Message-Id: <0d1b4d3e7adcfdc4e392b83aeb2544b95f3f46dd.1537430162.git.asias@scylladb.com>
2018-09-20 12:29:24 +03:00
Duarte Nunes
a025bf6a7d Merge seastar upstream
Seastar introduced a "compat" namespace, which conflicts with Scylla's
own "compat" namespaces. The merge thus includes changes to scope
uses of Scylla's "compat" namespaces.

* seastar 8ad870f...9bb1611  (5):
  > util/variant_utils: Ensure variant_cast behaves well with rvalues
  > util/std-compat: Fix infinite recursion
  > doc/tutorial: Undo namespace changes
  > util/variant_utils: Add cast_variant()
  > Add compatbility with C++17's library types

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-08-14 13:07:09 +01:00
Asias He
d47d46e1a8 streaming: Use streaming_write_priority for the sstable writer
Use the streaming io priority otherwise it uses the default io priority.

Message-Id: <e1836a9a93e7204d4bc9bba9c841d57c8b24aff8.1533715438.git.asias@scylladb.com>
2018-08-08 11:08:06 +03:00
Asias He
deff5e7d60 streaming: Add rpc streaming support
This patch changes scylla streaming to use the recently added rpc
streaming feature provided by seastar to send mutation fragments for
scylla streaming instead of the rpc verbs.

It also changes the receiver to write to the sstable file directly,
skipping writing to memtable.
2018-07-13 08:36:47 +08:00
Asias He
faa6769cdb streaming: Add estimate_partitions to send_info
The sender needs to estimate the number of partitions to send, because
the receiver needs this to prepare the sstables.
2018-07-13 08:36:46 +08:00
Asias He
db8c3a7059 streaming: Do not use dht::split_ranges_to_shards
There is no need to call dht::split_ranges_to_shards to split the token
range into <shard> : <a lot of small ranges> mapping and create a flat
mutation reader with a lot of small ranges.

Because:

1) The flat mutation reader on each shard only returns data belongs to
this local shard, there is no correctness issue if we do not split and
feed the sub ranges only belongs to this local shard.

2) With murmur3_partitioner_ignore_msb_bits = 12, it is almost certain
that given a token range, all the shards will have data for the range
anyway. Even if we ask all the shards to work on the token range and
some of the shards have no data for it, it is fine. We simply send no
data from this shard.

Tests: update_cluster_layout_tests.py

Message-Id: <ac00cd21d6156c47b74451dd415d627481e48212.1526864222.git.asias@scylladb.com>
2018-05-21 10:42:45 +03:00
Asias He
e20038eb84 streaming: Handle stream_mutation rpc handler on all shards
In streaming, the sender sends the mutations on all the local shards in
parallel, it is possible that the receiver handle more than one such
connection on the same shard. It is determined by where the tcp
connection goes. Current rpc ignores the dest shard id when sending the
rpc message.

For instance, say node1 has 2 shards, node2 has 2 shards. Currently, we
can end up with like this:

   Node 1 shard 0 -> Node 2 shard 1
   Node 1 shard 1 -> Node 2 shard 1

It is better if we do:

   Node 1 shard 0 -> Node 2 shard 0
   Node 1 shard 1 -> Node 2 shard 1

This patch solves this problem by let the handler always handle on
shard = src_cpu_id % smp::count.

If sender and receiver have the same shard config, it is completely
distributed the work evenly.

If sender and receiver do not have the same shard config, it is
unavoidable some of the shard will do more work than the others.

Tests: dtest update_cluster_layout_tests.py

Message-Id: <911827bcf67459a07ec92623a9ed4c4fbba195ca.1524622375.git.asias@scylladb.com>
2018-05-19 21:08:25 +03:00
Asias He
ad7b132188 Revert "streaming: Do not abort session too early in idle detection"
This reverts commit f792c78c96.

With the "Use range_streamer everywhere" (7217b7ab36) series,
all the user of streaming now do streaming with relative small ranges
and can retry streaming at higher level.

Reduce the time-to-recover from 5 hours to 10 minutes per stream session.

Even if the 10 minutes idle detection might cause higher false positive,
it is fine, since we can retry the "small" stream session anyway. In the
long term, we should replace the whole idle detection logic with
whenever the stream initiator goes away, the stream slave goes away.

Message-Id: <75f308baf25a520d42d884c7ef36f1aecb8a64b0.1520992219.git.asias@scylladb.com>
2018-03-14 10:11:00 +02:00
Asias He
774307b3a7 streaming: Do send failed message for uninitialized session
The uninitialized session has no peer associated with it yet. There is
no point sending the failed message when abort the session. Sending the
failed message in this case will send to a peer with uninitialized
dst_cpu_id which will casue the receiver to pass a bogus shard id to
smp::submit_to which cases segfault.

In addition, to be safe, initialize the dst_cpu_id to zero. So that
uninitialized session will send message to shard zero instead of random
bogus shard id.

Fixes the segfault issue found by
repair_additional_test.py:RepairAdditionalTest.repair_abort_test

Fixes #3115
Message-Id: <9f0f7b44c7d6d8f5c60d6293ab2435dadc3496a9.1515380325.git.asias@scylladb.com>
2018-01-08 15:04:06 +02:00
Raphael S. Carvalho
95d1995876 fix compilation of stream_session.cc
stream_session.cc:417:62: error: cannot call member function ‘utils::UUID streaming::stream_session::plan_id()’ without object
         sslog.warn("[Stream #{}] Failed to send: {}", plan_id(), ep);

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20171214022621.19442-1-raphaelsc@scylladb.com>
2017-12-14 10:57:33 +01:00
Asias He
a9dab60b6c streaming: One cf per time on sender
In the case there are large number of column families, the sender will
send all the column families in parallel. We allow 20% of shard memory
for streaming on the receiver, so each column family will have 1/N, N is
the number of in-flight column families, memory for memtable. Large N
causes a lot of small sstables to be generated.

It is possible there are multiple senders to a single receiver, e.g.,
when a new node joins the cluster, the maximum in-flight column families
is number of peer node. The column families are sent in the order of
cf_id. It is not guaranteed that all peers has the same speed so they
are sending the same cf_id at the same time, though. We still have
chance some of the peers are sending the same cf_id.

Fixes #3065

Message-Id: <46961463c2a5e4f1faff232294dc485ac4f1a04e.1513159678.git.asias@scylladb.com>
2017-12-13 12:32:41 +02:00
Paweł Dziepak
dca93bea23 db: convert make_streaming_reader() to flat_mutation_reader 2017-11-13 16:49:52 +00:00
Paweł Dziepak
6f1e0d3ed8 stream_transfer_task: switch to flat_mutation_reader 2017-11-13 16:49:52 +00:00
Paweł Dziepak
8bb672502d fragment_and_freeze: allow callback to stop iteration
There is a user of fragment_and_freeze() (streaming) that will need
to be able to break the loop Right now, it does that between
streamed_mutation, but that won't be possible after we switch to flat
readers.
2017-11-13 16:44:33 +00:00
Avi Kivity
85a6a2b3cb streaming: remove unneeded includes 2017-09-12 10:43:39 +03:00
Asias He
9c8da2cc56 streaming: Add abort to stream_plan
It can be used by the user of stream_plan to abort the stream sessions.
Repair will be the first user when aborting the repair.
2017-08-30 15:19:51 +08:00
Asias He
475b7a7f1c streaming: Add abort_all_stream_sessions for stream_coordinator
It will abort all the sessions within the stream_coordinator.
It will be used by stream_plan soon.
2017-08-30 15:19:51 +08:00
Asias He
fad34801bf streaming: Introduce streaming::abort()
It will be used soon by stream_plan::abort() to abort a stream session.
2017-08-30 15:19:50 +08:00
Asias He
7fba7cca01 streaming: Make stream_manager and coordinator message debug level
When we abort a session, it is possible that:

node 1 abort the session by user request
node 1 send the complete_message to node 2
node 2 abort the session upon receive of the complete_message
node 1 sends one more stream message to node 2 and the stream_manager
for the session can not be found.

It is fine for node 2 to not able to find the stream_manager, make the
log on node 2 less verbose to confuse user less.
2017-08-30 15:19:50 +08:00
Asias He
be573bcafb streaming: Check if _stream_result is valid
If on_error() was called before init() was executed, the
_stream_result can be invalid.
2017-08-30 15:19:44 +08:00
Asias He
8a3f6acdd2 streaming: Log peer address in on_error 2017-08-30 15:18:27 +08:00
Asias He
eace5fc6e8 streaming: Introduce received_failed_complete_message
It is the handler for the failed complete message. Add a flag to
remember if we received a such message from peer, if so, do not send
back the failed complete message back to the peer when running
close_session with failed status.
2017-08-30 15:18:27 +08:00
Duarte Nunes
85e85ec72e Don't catch polymorphic exceptions by value
It makes gcc a very sad compiler.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170726172053.5639-2-duarte@scylladb.com>
2017-07-27 09:39:58 +03:00
Asias He
d6cebd1341 streaming: Listen on shutdown gossip callback
When a node shutdown itself, it will send a shutdown status to peer
nodes. When peer nodes receives the shtudown status update, they are
supposed to close all the sessions with that node becasue the node is
shutdown, no need to wait and timeout, then fail the session.

This change can speed up the closing of sessions.
2017-07-19 10:11:06 +08:00
Asias He
aa87429e67 streaming: Send complete message with failed flag when session is failed
To notify peer node the session is failed.
2017-07-19 10:11:05 +08:00
Asias He
03b838705c streaming: Handle failed flag in complete message
Fail the current session if the failed flag is on in the complete
message handler.
2017-07-19 10:11:05 +08:00
Asias He
12d18cfab4 streaming: Do not fail the session when failed to send complete message
Since the complete message is not mandatary, no point to fail the session
in case failed to send the complete message.
2017-07-19 10:11:04 +08:00
Asias He
ca5248cd58 streaming: Introduce send_failed_complete_message
Currently, send_complete_message is not used. We will use it shortly in
case the local session is failed. Send a complete message with failed
flag to notify peer node that the session is failed so that peer can
close the session. This can speed up the closing of failed session.

Also rename it to send_failed_complete_message.
2017-07-19 10:11:04 +08:00
Asias He
f21cb75cdb streaming: Do not send complete message when session is successful
The complete_message is not needed and the handler of this rpc message
does nothing but returns a ready future. The patch to remove it did not
make into the Scylla 1.0 release so it was left there.
2017-07-18 15:29:42 +08:00
Asias He
0ba4e73068 streaming: Introduce the failed parameter for complete message
Use this flag to notify the peer that the session is failed so that the
peer can close the failed session more quickly.

The flag is used as a rpc::optional so it is compatible use old
version of the verb.
2017-07-18 11:24:31 +08:00
Asias He
7599c1524d streaming: Remove unused session_failed function
It is never used. Get rid of it.
2017-07-18 11:22:09 +08:00
Asias He
caad7ced23 streaming: Less verbose in logging
Now, we will have large number of small streaming. Make the
not very important logging message debug level.
2017-07-18 11:17:09 +08:00
Asias He
d0dffd7346 streaming: Better stats
Log the number of bytes streamed and streaming bandwidth summary in the same line with session
complete message.
2017-07-18 11:17:09 +08:00
Asias He
f792c78c96 streaming: Do not abort session too early in idle detection
Streaming ususally takes long time to complete. Abort it on false
positive idle detection can be very wasteful.

Increase the abort timeout from 10 minutes to a very large timeout, 300
minutes. The real idle session will be aborted eventually if other
mechanisms, e.g., streaming manager has gossip callback for on_remove
and on_restart event to abort, do not abort the session.

Fixes #2197

Message-Id: <57f81bfebfdc6f42164de5a84733097c001b394e.1494552921.git.asias@scylladb.com>
2017-05-24 12:29:50 +03:00
Avi Kivity
ebaeefa02b Merge seatar upstream (seastar namespace)
- introcduced "seastarx.hh" header, which does a "using namespace seastar";
 - 'net' namespace conflicts with seastar::net, renamed to 'netw'.
 - 'transport' namespace conflicts with seastar::transport, renamed to
   cql_transport.
 - "logger" global variables now conflict with logger global type, renamed
   to xlogger.
 - other minor changes
2017-05-21 12:26:15 +03:00
Avi Kivity
ca69a04969 streaming: avoid auto in function argument declaration
'auto' in a non-lambda function argument is not legal C++, and is hard
to read besides.  Replace with the right type.
2017-04-17 23:03:15 +03:00
Vlad Zolotarov
a850bea820 streaming::stream_manager: move a collectd counters registration to the metrics registration layer
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-01-10 16:24:54 -05:00
Asias He
937f28d2f1 Convert to use dht::partition_range_vector and dht::token_range_vector 2016-12-19 14:08:50 +08:00
Asias He
e5485f3ea6 Get rid of query::partition_range
Use dht::partition_range instead
2016-12-19 08:09:25 +08:00
Asias He
d1178fa299 Convert to use dht::token_range 2016-12-19 08:04:29 +08:00
Asias He
ba54654af3 streaming: Use interval_set to sort and merge ranges
So that the ranges are sorted and have no overlaps. We can have less
ranges to deal with and it can help the mutation readers to optimize.

Here is an exmaple of ranges generated by repair:

Before:

    INFO  2016-12-07 17:44:21,185 [shard 0] stream_session - cf_id =
    dec9fa90-bc3b-11e6-af78-000000000001,
    before ranges = {(-3383928698815274642, -3376937163195039606],
    (-7260764223708720005, -7251657821052234309], (-4767213984179237293,
    -4747032371925842389], (-7645879646119667643, -7589962743703481776],
    (-2340199306656526861, -2320523117224780931], (-576028861239229331,
    -560973674020019962], (-4070378863644120252, -3987599893827407860],
    (-2551584407739673151, -2498779102482524711], (-5416061903556353312,
    -5354212455975869358], (37594980457713898, 67885601051654285],
    (3083778975065200884, 3091232478835418439], (3131345970514528877,
    3187922544267434961], (5765437476661317163, 5778671293583720541],
    (5960610072466058818, 5972289771228014343], (7749618183851698485,
    7758080813117351135], (-3987599893827407860, -3899198931034439776],
    (-7251657821052234309, -7131649010279865221], (-3576581915808403133,
    -3383928698815274642], (-417850207760366422, -327959672080599465],
    (-2671876682129336880, -2551584407739673151], (-1305178847032904465,
    -1137497074548854552], (8540448858050275827, 8610171849752115483],
    (-560973674020019962, -417850207760366422], (-2498779102482524711,
    -2340199306656526861], (2394447940525988167, 2523396860109747637],
    (-6703329224557608009, -6517757811218772762], (-3675103288021821677,
    -3576581915808403133], (-5622185785296846551, -5416061903556353312],
    (8610171849752115483, 8742605005068551458], (8068079250973315241,
    8185655671734937642], (560264964510741191, 790641981923757238],
    (5581202487214475094, 5765437476661317163], (8742605005068551458,
    8923908282731801645], (-6038176423022601107, -5622185785296846551],
    (5778671293583720541, 5960610072466058818], (-3899198931034439776,
    -3675103288021821677], (8356739976149429222, 8540448858050275827],
    (-6517757811218772762, -6038176423022601107], (-8052600134279395253,
    -7645879646119667643], (-327959672080599465, 37594980457713898],
    (7758080813117351135, 8019254284118543066], (4781565016737645510,
    5067070718000527886], (2523396860109747637, 3083778975065200884],
    (-5354212455975869358, -4767213984179237293], (6784138025918878582,
    7190719703944308372], (67885601051654285, 447405341661896387],
    (-2190610927722759275, -1305178847032904465], (-4747032371925842389,
    -4070378863644120252]}, size=48

After:

    INFO  2016-12-07 17:44:21,185 [shard 0] stream_session - cf_id =
    dec9fa90-bc3b-11e6-af78-000000000001,
    after  ranges = {(-8052600134279395253, -7589962743703481776],
    (-7260764223708720005, -7131649010279865221], (-6703329224557608009,
    -3376937163195039606], (-2671876682129336880, -2320523117224780931],
    (-2190610927722759275, -1137497074548854552], (-576028861239229331,
    447405341661896387], (560264964510741191, 790641981923757238],
    (2394447940525988167, 3091232478835418439], (3131345970514528877,
    3187922544267434961], (4781565016737645510, 5067070718000527886],
    (5581202487214475094, 5972289771228014343], (6784138025918878582,
    7190719703944308372], (7749618183851698485, 8019254284118543066],
    (8068079250973315241, 8185655671734937642], (8356739976149429222,
    8923908282731801645]}, size=15
2016-12-12 11:09:26 +08:00
Asias He
1987264beb streaming: Make streaming reader with ranges
Now that we have the new interface to make readers with ranges, we can
simplify the code a lot.

1) Less readers are needed
before: number of ranges of readers
after: smp::count readers at most

2) No foreign_ptr is needed
There is no need to forward to a shard to make the foreign_ptr for
send_info in the first phase and forward to that shard to execute the
send_info in the second phase.

3) No do_with is needed in send_mutations since si now is a
lw_shared_ptr

4) Fix possible user after free of 'si' in do_send_mutations
We need to take a reference of 'si' when sending the mutation with
send_stream_mutation rpc call, otherwise:
   msg1 got exception
   si->mutations_done.broken()
   si is freed
   msg2 got exception
   si is used again
The issue is introduced in dc50ce0ce5 (streaming: Make the mutation
readers when streaming starts) which is master only, branch 1.5 is not
affected.
2016-12-12 09:04:21 +08:00