Commit Graph

473 Commits

Author SHA1 Message Date
Asias He
49a73aa2fc streaming: Move stream_mutation_fragments_cmd to a new file (#4812)
Avoid including the lengthy stream_session.hh in messaging_service.

More importantly, fix the build because currently messaging_service.cc
and messaging_service.hh does not include stream_mutation_fragments_cmd.
I am not sure why it builds on my machine. Spotted this when backporting
the "streaming: Send error code from the sender to receiver" to 3.0
branch.

Refs: #4789
2019-08-07 14:59:46 +02:00
Asias He
288371ce75 streaming: Do not call rpc stream flush in send_mutation_fragments
The stream close() guarantees the data sent will be flushed. No need to
call the stream flush() since the stream is not reused.

Follow up fix for commit bac987e32a (streaming: Send error code from
the sender to receiver).

Refs #4789
2019-08-07 14:31:17 +02:00
Asias He
bac987e32a streaming: Send error code from the sender to receiver
In case of error on the sender side, the sender does not propagate the
error to the receiver. The sender will close the stream. As a result,
the receiver will get nullopt from the source in
get_next_mutation_fragment and pass mutation_fragment_opt with no value
to the generating_reader. In turn, the generating_reader generates end
of stream. However, the last element that the generating_reader has
generated can be any type of mutation_fragment. This makes the sstable
that consumes the generating_reader violates the mutation_fragment
stream rule.

To fix, we need to propagate the error. However RPC streaming does not
support propagate the error in the framework. User has to send an error
code explicitly.

Fixes: #4789
2019-08-06 16:54:56 +02:00
Asias He
64a4c0ede2 streaming: Do not open rpc stream connection if ranges are not relevant to a shard
Given a list of ranges to stream, stream_transfer_task will create an
reader with the ranges and create a rpc stream connection on all the shards.

When user provides ranges to repair with -st -et options, e.g.,
using scylla-manger, such ranges can belong to only one shard, repair
will pass such ranges to streaming.

As a result, only one shard will have data to send while the rpc stream
connections are created on all the shards, which can cause the kernel
run out of ports in some systems.

To mitigate the problem, do not open the connection if the ranges do not
belong to the shard at all.

Refs: #4708
2019-07-18 18:31:21 +03:00
Botond Dénes
12b8405720 streaming,repair: restore indentation
Deferred from the previous two patches.
2019-06-26 18:45:36 +03:00
Botond Dénes
9c2407573c streaming: pass the data stream through the compaction strategy's interposer consumer 2019-06-26 18:45:36 +03:00
Botond Dénes
2693f1838a Introduce mutation_writer namespace
Currently there is a single mutation_writer: `multishard_writer`,
however in the next path we are going to add another one. This is the
right moment to move these into a common namespace (and folder), we
have way too much stuff scattered already in the top-level namespace
(and folder).
Also rename `tests/multishard_writer_test.cc` to
`tests/mutation_writer_test.cc`, this test-suite will be the home of all
the different mutation writer's unit test cases.
2019-06-26 15:45:59 +03:00
Asias He
f212dfb887 streaming: Reject stream if the _sys_dist_ks or _view_update_generator are not ready
They are of type db::system_distributed_keyspace and
db::view::view_update_generator.

n1 is in normal status
n2 boots up and _sys_dist_ks or _view_update_generator are not
initialized
n1 runs stream, n2 is the follower.
n2 uses the _sys_dist_ks or _view_update_generator
"Assertion `local_is_initialized()' failed" is observed

Fixes #4360

Message-Id: <4ae13e1640ac8707a9ba0503a2744f6faf89ecf4.1554330030.git.asias@scylladb.com>
2019-04-04 10:48:00 +03:00
Benny Halevy
223e1af521 sstables: provide large_data_handler to constructor
And use it for writing the sstable and/or when deleting it.

Refs #4198

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2019-03-26 16:24:19 +02:00
Asias He
b8158dd65d streaming: Get rid of the keep alive timer in streaming
There is no guarantee that rpc streaming makes progress in some time
period. Remove the keep alive timer in streaming to avoid killing the
session when the rpc streaming is just slow.

The keep alive timer is used to close the session in the following case:

n2 (the rpc streaming sender) streams to n1 (the rpc streaming receiver)
kill -9 n2

We need this because we do not kill the session when gossip think a node
is down, because we think the node down might only be temporary
and it is a waste to drop the previous work that has done especially
when the stream session takes long time.

Since in range_streamer, we do not stream all data in a single stream
session, we stream 10% of the data per time, and we have retry logic.
I think it is fine to kill a stream session when gossip thinks a node is
down. This patch changes to close all stream session with the node that
gossip think it is down.
Message-Id: <bdbb9486a533eee25fcaf4a23a946629ba946537.1551773823.git.asias@scylladb.com>
2019-03-12 12:20:28 +01:00
Rafael Ávila de Espíndola
625080b414 Rename large_partition_handler
Now that it also handles large rows, rename it to large_data_handler.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-01-28 15:03:14 -08:00
Piotr Jastrzebski
1ac7283550 Fix cross shard cf usage in streaming
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2019-01-24 18:13:30 +01:00
Duarte Nunes
04a14b27e4 Merge 'Add handling staging sstables to /upload dir' from Piotr
"
This series adds generating view updates from sstables added through
/upload directory if their tables have accompanying materialized views.
Said sstables are left in /upload directory until updates are generated
from them and are treated just like staging sstables from /staging dir.
If there are no views for a given tables, sstables are simply moved
from /upload dir to datadir without any changes.

Tests: unit (release)
"

* 'add_handling_staging_sstables_to_upload_dir_5' of https://github.com/psarna/scylla:
  all: rename view_update_from_staging_generator
  distributed_loader: fix indentation
  service: add generating view updates from uploaded sstables
  init: pass view update generator to storage service
  sstables: treat sstables in upload dir as needing view build
  sstables,table: rename is_staging to requires_view_building
  distributed_loader: use proper directory for opening SSTable
  db,view: make throttling optional for view_update_generator
2019-01-15 18:19:27 +00:00
Piotr Sarna
0eb703dc80 all: rename view_update_from_staging_generator
The new name, view_update_generator, is both more concise
and correct, since we now generate from directories
other than "/staging".
2019-01-15 17:31:47 +01:00
Piotr Sarna
7e61f02365 streaming: add phasing incoming streams
Incoming streams are now phased, which can be leveraged later
to wait for all ongoing streams to finish.

Refs #4032
2019-01-15 10:28:15 +01:00
Duarte Nunes
fa2b0384d2 Replace std::experimental types with C++17 std version.
Replace stdx::optional and stdx::string_view with the C++ std
counterparts.

Some instances of boost::variant were also replaced with std::variant,
namely those that called seastar::visit.

Scylla now requires GCC 8 to compile.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20190108111141.5369-1-duarte@scylladb.com>
2019-01-08 13:16:36 +02:00
Avi Kivity
f02c64cadf streaming: stream_session: remove include of db/view/view_update_from_staging_generator.hh
This header, which is easily replaced with a forward declaration,
introduces a dependency on database.hh everywhere. Remove it and scatter
includes of database.hh in source files that really need it.
2019-01-05 17:33:25 +02:00
Piotr Sarna
9d46715613 streaming,view: move view update checks to separate file
Checking if view update path should be used for sstables
is going to be reused in row level repair code,
so relevant functions are moved to a separate header.
2019-01-03 08:31:40 +01:00
Asias He
d90836a2d3 streaming: Make total_incoming_bytes and total_outgoing_bytes metrics monotonic
Currently, they increases and decreases as the stream sessions are
created and destroyed. Make them prometheus monotonically increasing
counter for easier monitoring.

Message-Id: <7c07cea25a59a09377292dc8f64ed33ff12eda87.1545959905.git.asias@scylladb.com>
2018-12-30 16:52:17 +02:00
Duarte Nunes
bab7e6877b streaming/stream_session: Only stage sstables for tables with views
When streaming, sstables for which we need to generate view updates
are placed in a special staging directory. However, we only need to do
this for tables that actually have views.

Refs #4021
Message-Id: <20181227215412.5632-1-duarte@scylladb.com>
2018-12-28 18:32:24 +02:00
Gleb Natapov
37b4043677 streaming: always read from rpc::source until end-of-stream during mutation sending
rpc::source cannot be abandoned until EOS is reached, but current code
does not obey it if error code is received, it throws exception instead that
aborts the reading loop. Fix it by moving exception throwing out of the
loop.

Fixes: #4025

Message-Id: <20181227135051.GC29458@scylladb.com>
2018-12-27 16:50:53 +02:00
Duarte Nunes
66e45469b2 streaming/stream_session: Don't use table reference across defer points
When creating a sstable from which to generate view updates, we held
on to a table reference across defer points. In case there's a
concurrent schema drop, the table object might be destroyed and we
will incur in a use-after-free. Solve this by holding on to a shared
pointer and pinning the table object.

Refs #4021

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20181227105921.3601-1-duarte@scylladb.com>
2018-12-27 13:05:46 +02:00
Avi Kivity
c96fc1d585 Merge "Introduce row level repair" from Asias
"
=== How the the partition level repair works

- The repair master decides which ranges to work on.
- The repair master splits the ranges to sub ranges which contains around 100
partitions.
- The repair master computes the checksum of the 100 partitions and asks the
related peers to compute the checksum of the 100 partitions.
- If the checksum matches, the data in this sub range is synced.
- If the checksum mismatches, repair master fetches the data from all the peers
and sends back the merged data to peers.

=== Major problems with partition level repair

- A mismatch of a single row in any of the 100 partitions causes 100
partitions to be transferred. A single partition can be very large. Not to
mention the size of 100 partitions.

- Checksum (find the mismatch) and streaming (fix the mismatch) will read the
same data twice

=== Row level repair

Row level checksum and synchronization: detect row level mismatch and transfer
only the mismatch

=== How the row level repair works

- To solve the problem of reading data twice

Read the data only once for both checksum and synchronization between nodes.

We work on a small range which contains only a few mega bytes of rows,
We read all the rows within the small range into memory. Find the
mismatch and send the mismatch rows between peers.

We need to find a sync boundary among the nodes which contains only N bytes of
rows.

- To solve the problem of sending unnecessary data.

We need to find the mismatched rows between nodes and only send the delta.
The problem is called set reconciliation problem which is a common problem in
distributed systems.

For example:
Node1 has set1 = {row1, row2, row3}
Node2 has set2 = {      row2, row3}
Node3 has set3 = {row1, row2, row4}

To repair:
Node1 fetches nothing from Node2 (set2 - set1), fetches row4 (set3 - set1) from Node3.
Node1 sends row1 and row4 (set1 + set2 + set3 - set2) to Node2
Node1 sends row3 (set1 + set2 + set3 - set3) to Node3.

=== How to implement repair with set reconciliation

- Step A: Negotiate sync boundary

class repair_sync_boundary {
    dht::decorated_key pk;
    position_in_partition position
}

Reads rows from disk into row buffers until the size is larger than N
bytes. Return the repair_sync_boundary of the last mutation_fragment we
read from disk. The smallest repair_sync_boundary of all nodes is
set as the current_sync_boundary.

- Step B: Get missing rows from peer nodes so that repair master contains all the rows

Request combined hashes from all nodes between last_sync_boundary and
current_sync_boundary. If the combined hashes from all nodes are identical,
data is synced, goto Step A. If not, request the full hashes from peers.

At this point, the repair master knows exactly what rows are missing. Request the
missing rows from peer nodes.

Now, local node contains all the rows.

- Step C: Send missing rows to the peer nodes

Since local node also knows what peer nodes own, it sends the missing rows to
the peer nodes.

=== How the RPC API looks like

- repair_range_start()

Step A:
- request_sync_boundary()

Step B:
- request_combined_row_hashes()
- reqeust_full_row_hashes()
- request_row_diff()

Step C:
- send_row_diff()

- repair_range_stop()

=== Performance evaluation

We created a cluster of 3 Scylla nodes on AWS using i3.xlarge instance. We
created a keyspace with a replication factor of 3 and inserted 1 billion
rows to each of the 3 nodes. Each node has 241 GiB of data.
We tested 3 cases below.

1) 0% synced: one of the node has zero data. The other two nodes have 1 billion identical rows.

Time to repair:
   old = 87 min
   new = 70 min (rebuild took 50 minutes)
   improvement = 19.54%

2) 100% synced: all of the 3 nodes have 1 billion identical rows.
Time to repair:
   old = 43 min
   new = 24 min
   improvement = 44.18%

3) 99.9% synced: each node has 1 billion identical rows and 1 billion * 0.1% distinct rows.

Time to repair:
   old: 211 min
   new: 44 min
   improvement: 79.15%

Bytes sent on wire for repair:
   old: tx= 162 GiB,  rx = 90 GiB
   new: tx= 1.15 GiB, tx = 0.57 GiB
   improvement: tx = 99.29%, rx = 99.36%

It is worth noting that row level repair sends and receives exactly the
number of rows needed in theory.

In this test case, repair master needs to receives 2 million rows and
sends 4 million rows. Here are the details: Each node has 1 billion *
0.1% distinct rows, that is 1 million rows. So repair master receives 1
million rows from repair slave 1 and 1 million rows from repair slave 2.
Repair master sends 1 million rows from repair master and 1 million rows
received from repair slave 1 to repair slave 2. Repair master sends
sends 1 million rows from repair master and 1 million rows received from
repair slave 2 to repair slave 1.

In the result, we saw the rows on wire were as expected.

tx_row_nr  = 1000505 + 999619 + 1001257 + 998619 (4 shards, the numbers are for each shard) = 4'000'000
rx_row_nr  =  500233 + 500235 +  499559 + 499973 (4 shards, the numbers are for each shard) = 2'000'000

Fixes: #3033

Tests: dtests/repair_additional_test.py
"

* 'asias/row_level_repair_v7' of github.com:cloudius-systems/seastar-dev: (51 commits)
  repair: Enable row level repair
  repair: Add row_level_repair
  repair: Add docs for row level repair
  repair: Add repair_init_messaging_service_handler
  repair: Add repair_meta
  repair: Add repair_writer
  repair: Add repair_reader
  repair: Add repair_row
  repair: Add fragment_hasher
  repair: Add decorated_key_with_hash
  repair: Add get_random_seed
  repair: Add get_common_diff_detect_algorithm
  repair: Add shard_config
  repair: Add suportted_diff_detect_algorithms
  repair: Add repair_stats to repair_info
  repair: Introduce repair_stats
  flat_mutation_reader:  Add make_generating_reader
  storage_service: Introduce ROW_LEVEL_REPAIR feature
  messaging_service: Add RPC verbs for row level repair
  repair: Export the repair logger
  ...
2018-12-25 13:13:00 +02:00
Gleb Natapov
393269d34b streaming: hold to sink while close() is running and call close on error as well
Currently if something throws while streaming in mutation sending loop
sink is not closed. Also when close() is running the code does not hold
onto sink object. close() is async, so sink should be kept alive until
it completes. The patch uses do_with() to hold onto sink while close is
running and run close() on error path too.

Fixes #4004.

Message-Id: <20181220155931.GL3075@scylladb.com>
2018-12-20 18:03:37 +02:00
Asias He
bcba6b4f4d streaming: Futurize estimate_partitions
The loop can take a long time if the number of sstables and/or ranges
are large. To fix, futurize the loop.

Fixes: #4005

Message-Id: <3b05cb84f3f57cc566702142c6365a04b075018e.1545290730.git.asias@scylladb.com>
2018-12-20 12:08:03 +02:00
Asias He
0067d32b47 flat_mutation_reader: Add make_generating_reader
Move generating_reader from stream_session.cc to flat_mutation_reader.cc.
It will be used by repair code soon.

Also introduce a helper make_generating_reader to hide the
implementation of generating_reader.
2018-12-12 16:49:01 +08:00
Piotr Sarna
8e6021dfa1 streaming: don't check view building of system tables
System tables will never need view building, and, what's more,
are actually used in the process of view build checking.
So, checking whether system tables need a view update path
is simplified to returning 'false'.
2018-11-28 09:21:56 +01:00
Piotr Sarna
6ad2c39f88 streaming: remove unused sstable_is_staging bool class
sstable_is_staging bool class is not used anywhere in the code anymore,
so it's removed.
2018-11-28 09:21:56 +01:00
Avi Kivity
775b7e41f4 Update seastar submodule
* seastar d59fcef...b924495 (2):
  > build: Fix protobuf generation rules
  > Merge "Restructure files" from Jesse

Includes fixup patch from Jesse:

"
Update Seastar `#include`s to reflect restructure

All Seastar header files are now prefixed with "seastar" and the
configure script reflects the new locations of files.

Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <5d22d964a7735696fb6bb7606ed88f35dde31413.1542731639.git.jhaberku@scylladb.com>
"
2018-11-21 00:01:44 +02:00
Piotr Sarna
32c0fe8df2 streaming: stream tables with views through staging sstables
While streaming to a table with paired views, staging sstables
are used. After the table is written to disk, it's used to generate
all required view updates. It's also resistant to restarts as it's
stored on a hard drive in staging/ directory.

Refs #3275
2018-11-13 15:04:42 +01:00
Piotr Sarna
dc74887ff3 streaming: add system distributed keyspace ref to streaming
Streaming code needs system distributed keyspace to check if streamed
sstables should be staging, so a proper reference is added.
2018-11-13 15:01:53 +01:00
Piotr Sarna
7ef5e1b685 streaming: add view update generator reference to streaming
Streaming code may need view update generator service to generate
and send view updates, so a proper reference is added.
2018-11-13 15:01:53 +01:00
Avi Kivity
fd513c42ad streaming: convert sprint() to format()
sprint() recently became more strict, throwing on sprint("%s", 5). Replace
with the more modern format().

Mechanically converted with https://github.com/avikivity/unsprint.
2018-11-01 13:16:17 +00:00
Avi Kivity
8501e2a45d streaming: progress_info: fix format string
We try to escape % as \%, but the correct escape is %%.
2018-11-01 13:16:17 +00:00
Asias He
7f826d3343 streaming: Expose reason for streaming
On receiving a mutation_fragment or a mutation triggered by a streaming
operation, we pass an enum stream_reason to notify the receiver what
the streaming is used for. So the receiver can decide further operation,
e.g., send view updates, beyond applying the streaming data on disk.

Fixes #3276
Message-Id: <f15ebcdee25e87a033dcdd066770114a499881c0.1539498866.git.asias@scylladb.com>
2018-10-15 22:03:28 +01:00
Gleb Natapov
ceb361544a stream_session: remove unused capture
'Consumer function' parameter for distribute_reader_and_consume_on_shards()
captures schema_ptr (which is a seastar::shared_ptr), but the function
is later copied on another shard at which point schema_ptr is also copied
and its counter is incremented by the wrong shard. The capture is not
even used, so lets just drop it.

Fixes #3838

Message-Id: <20181011075500.GN14449@scylladb.com>
2018-10-11 11:10:58 +03:00
Botond Dénes
eb357a385d flat_mutation_reader: make timeout opt-out rather than opt-in
Currently timeout is opt-in, that is, all methods that even have it
default it to `db::no_timeout`. This means that ensuring timeout is used
where it should be is completely up to the author and the reviewrs of
the code. As humans are notoriously prone to mistakes this has resulted
in a very inconsistent usage of timeout, many clients of
`flat_mutation_reader` passing the timeout only to some members and only
on certain call sites. This is small wonder considering that some core
operations like `operator()()` only recently received a timeout
parameter and others like `peek()` didn't even have one until this
patch. Both of these methods call `fill_buffer()` which potentially
talks to the lower layers and is supposed to propagate the timeout.
All this makes the `flat_mutation_reader`'s timeout effectively useless.

To make order in this chaos make the timeout parameter a mandatory one
on all `flat_mutation_reader` methods that need it. This ensures that
humans now get a reminder from the compiler when they forget to pass the
timeout. Clients can still opt-out from passing a timeout by passing
`db::no_timeout` (the previous default value) but this will be now
explicit and developers should think before typing it.

There were suprisingly few core call sites to fix up. Where a timeout
was available nearby I propagated it to be able to pass it to the
reader, where I couldn't I passed `db::no_timeout`. Authors of the
latter kind of code (view, streaming and repair are some of the notable
examples) should maybe consider propagating down a timeout if needed.
In the test code (the wast majority of the changes) I just used
`db::no_timeout` everywhere.

Tests: unit(release, debug)

Signed-off-by: Botond Dénes <bdenes@scylladb.com>

Message-Id: <1edc10802d5eb23de8af28c9f48b8d3be0f1a468.1536744563.git.bdenes@scylladb.com>
2018-09-20 11:31:24 +02:00
Asias He
de05df216f streaming: Use rpc::source on the shard where it is created
rpc::source can only work on the shard where it is created, thus we can
not apply the load distribution optimization. Disable it and let the
multishard_writer to forward the data to the correct shard.

Fixes #3731.

Message-Id: <0d1b4d3e7adcfdc4e392b83aeb2544b95f3f46dd.1537430162.git.asias@scylladb.com>
2018-09-20 12:29:24 +03:00
Duarte Nunes
a025bf6a7d Merge seastar upstream
Seastar introduced a "compat" namespace, which conflicts with Scylla's
own "compat" namespaces. The merge thus includes changes to scope
uses of Scylla's "compat" namespaces.

* seastar 8ad870f...9bb1611  (5):
  > util/variant_utils: Ensure variant_cast behaves well with rvalues
  > util/std-compat: Fix infinite recursion
  > doc/tutorial: Undo namespace changes
  > util/variant_utils: Add cast_variant()
  > Add compatbility with C++17's library types

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-08-14 13:07:09 +01:00
Asias He
d47d46e1a8 streaming: Use streaming_write_priority for the sstable writer
Use the streaming io priority otherwise it uses the default io priority.

Message-Id: <e1836a9a93e7204d4bc9bba9c841d57c8b24aff8.1533715438.git.asias@scylladb.com>
2018-08-08 11:08:06 +03:00
Asias He
deff5e7d60 streaming: Add rpc streaming support
This patch changes scylla streaming to use the recently added rpc
streaming feature provided by seastar to send mutation fragments for
scylla streaming instead of the rpc verbs.

It also changes the receiver to write to the sstable file directly,
skipping writing to memtable.
2018-07-13 08:36:47 +08:00
Asias He
faa6769cdb streaming: Add estimate_partitions to send_info
The sender needs to estimate the number of partitions to send, because
the receiver needs this to prepare the sstables.
2018-07-13 08:36:46 +08:00
Asias He
db8c3a7059 streaming: Do not use dht::split_ranges_to_shards
There is no need to call dht::split_ranges_to_shards to split the token
range into <shard> : <a lot of small ranges> mapping and create a flat
mutation reader with a lot of small ranges.

Because:

1) The flat mutation reader on each shard only returns data belongs to
this local shard, there is no correctness issue if we do not split and
feed the sub ranges only belongs to this local shard.

2) With murmur3_partitioner_ignore_msb_bits = 12, it is almost certain
that given a token range, all the shards will have data for the range
anyway. Even if we ask all the shards to work on the token range and
some of the shards have no data for it, it is fine. We simply send no
data from this shard.

Tests: update_cluster_layout_tests.py

Message-Id: <ac00cd21d6156c47b74451dd415d627481e48212.1526864222.git.asias@scylladb.com>
2018-05-21 10:42:45 +03:00
Asias He
e20038eb84 streaming: Handle stream_mutation rpc handler on all shards
In streaming, the sender sends the mutations on all the local shards in
parallel, it is possible that the receiver handle more than one such
connection on the same shard. It is determined by where the tcp
connection goes. Current rpc ignores the dest shard id when sending the
rpc message.

For instance, say node1 has 2 shards, node2 has 2 shards. Currently, we
can end up with like this:

   Node 1 shard 0 -> Node 2 shard 1
   Node 1 shard 1 -> Node 2 shard 1

It is better if we do:

   Node 1 shard 0 -> Node 2 shard 0
   Node 1 shard 1 -> Node 2 shard 1

This patch solves this problem by let the handler always handle on
shard = src_cpu_id % smp::count.

If sender and receiver have the same shard config, it is completely
distributed the work evenly.

If sender and receiver do not have the same shard config, it is
unavoidable some of the shard will do more work than the others.

Tests: dtest update_cluster_layout_tests.py

Message-Id: <911827bcf67459a07ec92623a9ed4c4fbba195ca.1524622375.git.asias@scylladb.com>
2018-05-19 21:08:25 +03:00
Asias He
ad7b132188 Revert "streaming: Do not abort session too early in idle detection"
This reverts commit f792c78c96.

With the "Use range_streamer everywhere" (7217b7ab36) series,
all the user of streaming now do streaming with relative small ranges
and can retry streaming at higher level.

Reduce the time-to-recover from 5 hours to 10 minutes per stream session.

Even if the 10 minutes idle detection might cause higher false positive,
it is fine, since we can retry the "small" stream session anyway. In the
long term, we should replace the whole idle detection logic with
whenever the stream initiator goes away, the stream slave goes away.

Message-Id: <75f308baf25a520d42d884c7ef36f1aecb8a64b0.1520992219.git.asias@scylladb.com>
2018-03-14 10:11:00 +02:00
Asias He
774307b3a7 streaming: Do send failed message for uninitialized session
The uninitialized session has no peer associated with it yet. There is
no point sending the failed message when abort the session. Sending the
failed message in this case will send to a peer with uninitialized
dst_cpu_id which will casue the receiver to pass a bogus shard id to
smp::submit_to which cases segfault.

In addition, to be safe, initialize the dst_cpu_id to zero. So that
uninitialized session will send message to shard zero instead of random
bogus shard id.

Fixes the segfault issue found by
repair_additional_test.py:RepairAdditionalTest.repair_abort_test

Fixes #3115
Message-Id: <9f0f7b44c7d6d8f5c60d6293ab2435dadc3496a9.1515380325.git.asias@scylladb.com>
2018-01-08 15:04:06 +02:00
Raphael S. Carvalho
95d1995876 fix compilation of stream_session.cc
stream_session.cc:417:62: error: cannot call member function ‘utils::UUID streaming::stream_session::plan_id()’ without object
         sslog.warn("[Stream #{}] Failed to send: {}", plan_id(), ep);

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20171214022621.19442-1-raphaelsc@scylladb.com>
2017-12-14 10:57:33 +01:00
Asias He
a9dab60b6c streaming: One cf per time on sender
In the case there are large number of column families, the sender will
send all the column families in parallel. We allow 20% of shard memory
for streaming on the receiver, so each column family will have 1/N, N is
the number of in-flight column families, memory for memtable. Large N
causes a lot of small sstables to be generated.

It is possible there are multiple senders to a single receiver, e.g.,
when a new node joins the cluster, the maximum in-flight column families
is number of peer node. The column families are sent in the order of
cf_id. It is not guaranteed that all peers has the same speed so they
are sending the same cf_id at the same time, though. We still have
chance some of the peers are sending the same cf_id.

Fixes #3065

Message-Id: <46961463c2a5e4f1faff232294dc485ac4f1a04e.1513159678.git.asias@scylladb.com>
2017-12-13 12:32:41 +02:00
Paweł Dziepak
dca93bea23 db: convert make_streaming_reader() to flat_mutation_reader 2017-11-13 16:49:52 +00:00
Paweł Dziepak
6f1e0d3ed8 stream_transfer_task: switch to flat_mutation_reader 2017-11-13 16:49:52 +00:00