Files
scylladb/flat_mutation_reader.cc
Avi Kivity c96fc1d585 Merge "Introduce row level repair" from Asias
"
=== How the the partition level repair works

- The repair master decides which ranges to work on.
- The repair master splits the ranges to sub ranges which contains around 100
partitions.
- The repair master computes the checksum of the 100 partitions and asks the
related peers to compute the checksum of the 100 partitions.
- If the checksum matches, the data in this sub range is synced.
- If the checksum mismatches, repair master fetches the data from all the peers
and sends back the merged data to peers.

=== Major problems with partition level repair

- A mismatch of a single row in any of the 100 partitions causes 100
partitions to be transferred. A single partition can be very large. Not to
mention the size of 100 partitions.

- Checksum (find the mismatch) and streaming (fix the mismatch) will read the
same data twice

=== Row level repair

Row level checksum and synchronization: detect row level mismatch and transfer
only the mismatch

=== How the row level repair works

- To solve the problem of reading data twice

Read the data only once for both checksum and synchronization between nodes.

We work on a small range which contains only a few mega bytes of rows,
We read all the rows within the small range into memory. Find the
mismatch and send the mismatch rows between peers.

We need to find a sync boundary among the nodes which contains only N bytes of
rows.

- To solve the problem of sending unnecessary data.

We need to find the mismatched rows between nodes and only send the delta.
The problem is called set reconciliation problem which is a common problem in
distributed systems.

For example:
Node1 has set1 = {row1, row2, row3}
Node2 has set2 = {      row2, row3}
Node3 has set3 = {row1, row2, row4}

To repair:
Node1 fetches nothing from Node2 (set2 - set1), fetches row4 (set3 - set1) from Node3.
Node1 sends row1 and row4 (set1 + set2 + set3 - set2) to Node2
Node1 sends row3 (set1 + set2 + set3 - set3) to Node3.

=== How to implement repair with set reconciliation

- Step A: Negotiate sync boundary

class repair_sync_boundary {
    dht::decorated_key pk;
    position_in_partition position
}

Reads rows from disk into row buffers until the size is larger than N
bytes. Return the repair_sync_boundary of the last mutation_fragment we
read from disk. The smallest repair_sync_boundary of all nodes is
set as the current_sync_boundary.

- Step B: Get missing rows from peer nodes so that repair master contains all the rows

Request combined hashes from all nodes between last_sync_boundary and
current_sync_boundary. If the combined hashes from all nodes are identical,
data is synced, goto Step A. If not, request the full hashes from peers.

At this point, the repair master knows exactly what rows are missing. Request the
missing rows from peer nodes.

Now, local node contains all the rows.

- Step C: Send missing rows to the peer nodes

Since local node also knows what peer nodes own, it sends the missing rows to
the peer nodes.

=== How the RPC API looks like

- repair_range_start()

Step A:
- request_sync_boundary()

Step B:
- request_combined_row_hashes()
- reqeust_full_row_hashes()
- request_row_diff()

Step C:
- send_row_diff()

- repair_range_stop()

=== Performance evaluation

We created a cluster of 3 Scylla nodes on AWS using i3.xlarge instance. We
created a keyspace with a replication factor of 3 and inserted 1 billion
rows to each of the 3 nodes. Each node has 241 GiB of data.
We tested 3 cases below.

1) 0% synced: one of the node has zero data. The other two nodes have 1 billion identical rows.

Time to repair:
   old = 87 min
   new = 70 min (rebuild took 50 minutes)
   improvement = 19.54%

2) 100% synced: all of the 3 nodes have 1 billion identical rows.
Time to repair:
   old = 43 min
   new = 24 min
   improvement = 44.18%

3) 99.9% synced: each node has 1 billion identical rows and 1 billion * 0.1% distinct rows.

Time to repair:
   old: 211 min
   new: 44 min
   improvement: 79.15%

Bytes sent on wire for repair:
   old: tx= 162 GiB,  rx = 90 GiB
   new: tx= 1.15 GiB, tx = 0.57 GiB
   improvement: tx = 99.29%, rx = 99.36%

It is worth noting that row level repair sends and receives exactly the
number of rows needed in theory.

In this test case, repair master needs to receives 2 million rows and
sends 4 million rows. Here are the details: Each node has 1 billion *
0.1% distinct rows, that is 1 million rows. So repair master receives 1
million rows from repair slave 1 and 1 million rows from repair slave 2.
Repair master sends 1 million rows from repair master and 1 million rows
received from repair slave 1 to repair slave 2. Repair master sends
sends 1 million rows from repair master and 1 million rows received from
repair slave 2 to repair slave 1.

In the result, we saw the rows on wire were as expected.

tx_row_nr  = 1000505 + 999619 + 1001257 + 998619 (4 shards, the numbers are for each shard) = 4'000'000
rx_row_nr  =  500233 + 500235 +  499559 + 499973 (4 shards, the numbers are for each shard) = 2'000'000

Fixes: #3033

Tests: dtests/repair_additional_test.py
"

* 'asias/row_level_repair_v7' of github.com:cloudius-systems/seastar-dev: (51 commits)
  repair: Enable row level repair
  repair: Add row_level_repair
  repair: Add docs for row level repair
  repair: Add repair_init_messaging_service_handler
  repair: Add repair_meta
  repair: Add repair_writer
  repair: Add repair_reader
  repair: Add repair_row
  repair: Add fragment_hasher
  repair: Add decorated_key_with_hash
  repair: Add get_random_seed
  repair: Add get_common_diff_detect_algorithm
  repair: Add shard_config
  repair: Add suportted_diff_detect_algorithms
  repair: Add repair_stats to repair_info
  repair: Introduce repair_stats
  flat_mutation_reader:  Add make_generating_reader
  storage_service: Introduce ROW_LEVEL_REPAIR feature
  messaging_service: Add RPC verbs for row level repair
  repair: Export the repair logger
  ...
2018-12-25 13:13:00 +02:00

852 lines
35 KiB
C++

/*
* Copyright (C) 2017 ScyllaDB
*/
/*
* This file is part of Scylla.
*
* Scylla is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Scylla is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
*/
#include "flat_mutation_reader.hh"
#include "mutation_reader.hh"
#include "seastar/util/reference_wrapper.hh"
#include <algorithm>
#include <boost/range/adaptor/transformed.hpp>
#include <seastar/util/defer.hh>
static size_t compute_buffer_size(const schema& s, circular_buffer<mutation_fragment>& buffer)
{
return boost::accumulate(
buffer
| boost::adaptors::transformed([&s] (const mutation_fragment& mf) {
return mf.memory_usage(s);
}), size_t(0)
);
}
void flat_mutation_reader::impl::forward_buffer_to(const position_in_partition& pos) {
_buffer.erase(std::remove_if(_buffer.begin(), _buffer.end(), [this, &pos] (mutation_fragment& f) {
return !f.relevant_for_range_assuming_after(*_schema, pos);
}), _buffer.end());
_buffer_size = compute_buffer_size(*_schema, _buffer);
}
void flat_mutation_reader::impl::clear_buffer_to_next_partition() {
auto next_partition_start = std::find_if(_buffer.begin(), _buffer.end(), [] (const mutation_fragment& mf) {
return mf.is_partition_start();
});
_buffer.erase(_buffer.begin(), next_partition_start);
_buffer_size = compute_buffer_size(*_schema, _buffer);
}
flat_mutation_reader flat_mutation_reader::impl::reverse_partitions(flat_mutation_reader::impl& original) {
// FIXME: #1413 Full partitions get accumulated in memory.
class partition_reversing_mutation_reader final : public flat_mutation_reader::impl {
flat_mutation_reader::impl* _source;
range_tombstone_list _range_tombstones;
std::stack<mutation_fragment> _mutation_fragments;
mutation_fragment_opt _partition_end;
private:
stop_iteration emit_partition() {
auto emit_range_tombstone = [&] {
auto it = std::prev(_range_tombstones.tombstones().end());
auto& rt = *it;
_range_tombstones.tombstones().erase(it);
auto rt_owner = alloc_strategy_unique_ptr<range_tombstone>(&rt);
push_mutation_fragment(mutation_fragment(std::move(rt)));
};
position_in_partition::less_compare cmp(*_source->_schema);
while (!_mutation_fragments.empty() && !is_buffer_full()) {
auto& mf = _mutation_fragments.top();
if (!_range_tombstones.empty() && !cmp(_range_tombstones.tombstones().rbegin()->end_position(), mf.position())) {
emit_range_tombstone();
} else {
push_mutation_fragment(std::move(mf));
_mutation_fragments.pop();
}
}
while (!_range_tombstones.empty() && !is_buffer_full()) {
emit_range_tombstone();
}
if (is_buffer_full()) {
return stop_iteration::yes;
}
push_mutation_fragment(std::move(*std::exchange(_partition_end, stdx::nullopt)));
return stop_iteration::no;
}
future<stop_iteration> consume_partition_from_source(db::timeout_clock::time_point timeout) {
if (_source->is_buffer_empty()) {
if (_source->is_end_of_stream()) {
_end_of_stream = true;
return make_ready_future<stop_iteration>(stop_iteration::yes);
}
return _source->fill_buffer(timeout).then([] { return stop_iteration::no; });
}
while (!_source->is_buffer_empty() && !is_buffer_full()) {
auto mf = _source->pop_mutation_fragment();
if (mf.is_partition_start() || mf.is_static_row()) {
push_mutation_fragment(std::move(mf));
} else if (mf.is_end_of_partition()) {
_partition_end = std::move(mf);
if (emit_partition()) {
return make_ready_future<stop_iteration>(stop_iteration::yes);
}
} else if (mf.is_range_tombstone()) {
_range_tombstones.apply(*_source->_schema, std::move(mf.as_range_tombstone()));
} else {
_mutation_fragments.emplace(std::move(mf));
}
}
return make_ready_future<stop_iteration>(is_buffer_full());
}
public:
explicit partition_reversing_mutation_reader(flat_mutation_reader::impl& mr)
: flat_mutation_reader::impl(mr._schema)
, _source(&mr)
, _range_tombstones(*mr._schema)
{ }
virtual future<> fill_buffer(db::timeout_clock::time_point timeout) override {
return repeat([&, timeout] {
if (_partition_end) {
// We have consumed full partition from source, now it is
// time to emit it.
auto stop = emit_partition();
if (stop) {
return make_ready_future<stop_iteration>(stop_iteration::yes);
}
}
return consume_partition_from_source(timeout);
});
}
virtual void next_partition() override {
clear_buffer_to_next_partition();
if (is_buffer_empty() && !is_end_of_stream()) {
while (!_mutation_fragments.empty()) {
_mutation_fragments.pop();
}
_range_tombstones.clear();
_partition_end = stdx::nullopt;
_source->next_partition();
}
}
virtual future<> fast_forward_to(const dht::partition_range&, db::timeout_clock::time_point) override {
throw std::bad_function_call();
}
virtual future<> fast_forward_to(position_range, db::timeout_clock::time_point) override {
throw std::bad_function_call();
}
virtual size_t buffer_size() const override {
return flat_mutation_reader::impl::buffer_size() + _source->buffer_size();
}
};
return make_flat_mutation_reader<partition_reversing_mutation_reader>(original);
}
template<typename Source>
future<bool> flat_mutation_reader::impl::fill_buffer_from(Source& source, db::timeout_clock::time_point timeout) {
if (source.is_buffer_empty()) {
if (source.is_end_of_stream()) {
return make_ready_future<bool>(true);
}
return source.fill_buffer(timeout).then([this, &source, timeout] {
return fill_buffer_from(source, timeout);
});
} else {
while (!source.is_buffer_empty() && !is_buffer_full()) {
push_mutation_fragment(source.pop_mutation_fragment());
}
return make_ready_future<bool>(source.is_end_of_stream() && source.is_buffer_empty());
}
}
template future<bool> flat_mutation_reader::impl::fill_buffer_from<flat_mutation_reader>(flat_mutation_reader&, db::timeout_clock::time_point);
flat_mutation_reader& to_reference(reference_wrapper<flat_mutation_reader>& wrapper) {
return wrapper.get();
}
flat_mutation_reader make_delegating_reader(flat_mutation_reader& r) {
return make_flat_mutation_reader<delegating_reader<reference_wrapper<flat_mutation_reader>>>(ref(r));
}
flat_mutation_reader make_forwardable(flat_mutation_reader m) {
class reader : public flat_mutation_reader::impl {
flat_mutation_reader _underlying;
position_range _current;
mutation_fragment_opt _next;
// When resolves, _next is engaged or _end_of_stream is set.
future<> ensure_next(db::timeout_clock::time_point timeout) {
if (_next) {
return make_ready_future<>();
}
return _underlying(timeout).then([this] (auto&& mfo) {
_next = std::move(mfo);
if (!_next) {
_end_of_stream = true;
}
});
}
public:
reader(flat_mutation_reader r) : impl(r.schema()), _underlying(std::move(r)), _current({
position_in_partition(position_in_partition::partition_start_tag_t()),
position_in_partition(position_in_partition::after_static_row_tag_t())
}) { }
virtual future<> fill_buffer(db::timeout_clock::time_point timeout) override {
return repeat([this, timeout] {
if (is_buffer_full()) {
return make_ready_future<stop_iteration>(stop_iteration::yes);
}
return ensure_next(timeout).then([this] {
if (is_end_of_stream()) {
return stop_iteration::yes;
}
position_in_partition::less_compare cmp(*_schema);
if (!cmp(_next->position(), _current.end())) {
_end_of_stream = true;
// keep _next, it may be relevant for next range
return stop_iteration::yes;
}
if (_next->relevant_for_range(*_schema, _current.start())) {
push_mutation_fragment(std::move(*_next));
}
_next = {};
return stop_iteration::no;
});
});
}
virtual future<> fast_forward_to(position_range pr, db::timeout_clock::time_point timeout) override {
_current = std::move(pr);
_end_of_stream = false;
forward_buffer_to(_current.start());
return make_ready_future<>();
}
virtual void next_partition() override {
_end_of_stream = false;
if (!_next || !_next->is_partition_start()) {
_underlying.next_partition();
_next = {};
}
clear_buffer_to_next_partition();
_current = {
position_in_partition(position_in_partition::partition_start_tag_t()),
position_in_partition(position_in_partition::after_static_row_tag_t())
};
}
virtual future<> fast_forward_to(const dht::partition_range& pr, db::timeout_clock::time_point timeout) override {
_end_of_stream = false;
clear_buffer();
_next = {};
_current = {
position_in_partition(position_in_partition::partition_start_tag_t()),
position_in_partition(position_in_partition::after_static_row_tag_t())
};
return _underlying.fast_forward_to(pr, timeout);
}
virtual size_t buffer_size() const override {
return flat_mutation_reader::impl::buffer_size() + _underlying.buffer_size();
}
};
return make_flat_mutation_reader<reader>(std::move(m));
}
flat_mutation_reader make_nonforwardable(flat_mutation_reader r, bool single_partition) {
class reader : public flat_mutation_reader::impl {
flat_mutation_reader _underlying;
bool _single_partition;
bool _static_row_done = false;
bool is_end_end_of_underlying_stream() const {
return _underlying.is_buffer_empty() && _underlying.is_end_of_stream();
}
future<> on_end_of_underlying_stream(db::timeout_clock::time_point timeout) {
if (!_static_row_done) {
_static_row_done = true;
return _underlying.fast_forward_to(position_range::all_clustered_rows(), timeout);
}
push_mutation_fragment(partition_end());
if (_single_partition) {
_end_of_stream = true;
return make_ready_future<>();
}
_underlying.next_partition();
_static_row_done = false;
return _underlying.fill_buffer(timeout).then([this] {
_end_of_stream = is_end_end_of_underlying_stream();
});
}
public:
reader(flat_mutation_reader r, bool single_partition)
: impl(r.schema())
, _underlying(std::move(r))
, _single_partition(single_partition)
{ }
virtual future<> fill_buffer(db::timeout_clock::time_point timeout) override {
return do_until([this] { return is_end_of_stream() || is_buffer_full(); }, [this, timeout] {
return fill_buffer_from(_underlying, timeout).then([this, timeout] (bool underlying_finished) {
if (underlying_finished) {
return on_end_of_underlying_stream(timeout);
}
return make_ready_future<>();
});
});
}
virtual future<> fast_forward_to(position_range pr, db::timeout_clock::time_point timeout) override {
throw std::bad_function_call();
}
virtual void next_partition() override {
clear_buffer_to_next_partition();
if (is_buffer_empty()) {
_underlying.next_partition();
}
_end_of_stream = is_end_end_of_underlying_stream();
}
virtual future<> fast_forward_to(const dht::partition_range& pr, db::timeout_clock::time_point timeout) override {
_end_of_stream = false;
clear_buffer();
return _underlying.fast_forward_to(pr, timeout);
}
virtual size_t buffer_size() const override {
return flat_mutation_reader::impl::buffer_size() + _underlying.buffer_size();
}
};
return make_flat_mutation_reader<reader>(std::move(r), single_partition);
}
class empty_flat_reader final : public flat_mutation_reader::impl {
public:
empty_flat_reader(schema_ptr s) : impl(std::move(s)) { _end_of_stream = true; }
virtual future<> fill_buffer(db::timeout_clock::time_point timeout) override { return make_ready_future<>(); }
virtual void next_partition() override {}
virtual future<> fast_forward_to(const dht::partition_range& pr, db::timeout_clock::time_point timeout) override { return make_ready_future<>(); };
virtual future<> fast_forward_to(position_range cr, db::timeout_clock::time_point timeout) override { return make_ready_future<>(); };
};
flat_mutation_reader make_empty_flat_reader(schema_ptr s) {
return make_flat_mutation_reader<empty_flat_reader>(std::move(s));
}
flat_mutation_reader
flat_mutation_reader_from_mutations(std::vector<mutation> ms,
const query::partition_slice& slice,
streamed_mutation::forwarding fwd) {
std::vector<mutation> sliced_ms;
for (auto& m : ms) {
auto ck_ranges = query::clustering_key_filter_ranges::get_ranges(*m.schema(), slice, m.key());
auto mp = mutation_partition(std::move(m.partition()), *m.schema(), std::move(ck_ranges));
sliced_ms.emplace_back(m.schema(), m.decorated_key(), std::move(mp));
}
return flat_mutation_reader_from_mutations(sliced_ms, query::full_partition_range, fwd);
}
flat_mutation_reader
flat_mutation_reader_from_mutations(std::vector<mutation> mutations, const dht::partition_range& pr, streamed_mutation::forwarding fwd) {
class reader final : public flat_mutation_reader::impl {
std::vector<mutation> _mutations;
std::vector<mutation>::iterator _cur;
std::vector<mutation>::iterator _end;
position_in_partition::less_compare _cmp;
bool _static_row_done = false;
mutation_fragment_opt _rt;
mutation_fragment_opt _cr;
private:
void prepare_next_clustering_row() {
auto& crs = _cur->partition().clustered_rows();
while (true) {
auto re = crs.unlink_leftmost_without_rebalance();
if (!re) {
break;
}
auto re_deleter = defer([re] { current_deleter<rows_entry>()(re); });
if (!re->dummy()) {
_cr = mutation_fragment(std::move(*re));
break;
}
}
}
void prepare_next_range_tombstone() {
auto& rts = _cur->partition().row_tombstones().tombstones();
auto rt = rts.unlink_leftmost_without_rebalance();
if (rt) {
auto rt_deleter = defer([rt] { current_deleter<range_tombstone>()(rt); });
_rt = mutation_fragment(std::move(*rt));
}
}
mutation_fragment_opt read_next() {
if (_cr && (!_rt || _cmp(_cr->position(), _rt->position()))) {
auto cr = std::exchange(_cr, { });
prepare_next_clustering_row();
return cr;
} else if (_rt) {
auto rt = std::exchange(_rt, { });
prepare_next_range_tombstone();
return rt;
}
return { };
}
private:
void do_fill_buffer(db::timeout_clock::time_point timeout) {
while (!is_end_of_stream() && !is_buffer_full()) {
if (!_static_row_done) {
_static_row_done = true;
if (!_cur->partition().static_row().empty()) {
push_mutation_fragment(static_row(std::move(_cur->partition().static_row())));
}
}
auto mfopt = read_next();
if (mfopt) {
push_mutation_fragment(std::move(*mfopt));
} else {
push_mutation_fragment(partition_end());
++_cur;
if (_cur == _end) {
_end_of_stream = true;
} else {
start_new_partition();
}
}
}
}
void start_new_partition() {
_static_row_done = false;
push_mutation_fragment(partition_start(_cur->decorated_key(),
_cur->partition().partition_tombstone()));
prepare_next_clustering_row();
prepare_next_range_tombstone();
}
void destroy_current_mutation() {
auto &crs = _cur->partition().clustered_rows();
auto re = crs.unlink_leftmost_without_rebalance();
while (re) {
current_deleter<rows_entry>()(re);
re = crs.unlink_leftmost_without_rebalance();
}
auto &rts = _cur->partition().row_tombstones().tombstones();
auto rt = rts.unlink_leftmost_without_rebalance();
while (rt) {
current_deleter<range_tombstone>()(rt);
rt = rts.unlink_leftmost_without_rebalance();
}
}
struct cmp {
bool operator()(const mutation& m, const dht::ring_position& p) const {
return m.decorated_key().tri_compare(*m.schema(), p) < 0;
}
bool operator()(const dht::ring_position& p, const mutation& m) const {
return m.decorated_key().tri_compare(*m.schema(), p) > 0;
}
};
static std::vector<mutation>::iterator find_first_partition(std::vector<mutation>& ms, const dht::partition_range& pr) {
if (!pr.start()) {
return std::begin(ms);
}
if (pr.is_singular()) {
return std::lower_bound(std::begin(ms), std::end(ms), pr.start()->value(), cmp{});
} else {
if (pr.start()->is_inclusive()) {
return std::lower_bound(std::begin(ms), std::end(ms), pr.start()->value(), cmp{});
} else {
return std::upper_bound(std::begin(ms), std::end(ms), pr.start()->value(), cmp{});
}
}
}
static std::vector<mutation>::iterator find_last_partition(std::vector<mutation>& ms, const dht::partition_range& pr) {
if (!pr.end()) {
return std::end(ms);
}
if (pr.is_singular()) {
return std::upper_bound(std::begin(ms), std::end(ms), pr.start()->value(), cmp{});
} else {
if (pr.end()->is_inclusive()) {
return std::upper_bound(std::begin(ms), std::end(ms), pr.end()->value(), cmp{});
} else {
return std::lower_bound(std::begin(ms), std::end(ms), pr.end()->value(), cmp{});
}
}
}
public:
reader(schema_ptr s, std::vector<mutation>&& mutations, const dht::partition_range& pr)
: impl(s)
, _mutations(std::move(mutations))
, _cur(find_first_partition(_mutations, pr))
, _end(find_last_partition(_mutations, pr))
, _cmp(*s)
{
_end_of_stream = _cur == _end;
if (!_end_of_stream) {
auto mutation_destroyer = defer([this] { destroy_mutations(); });
start_new_partition();
do_fill_buffer(db::no_timeout);
mutation_destroyer.cancel();
}
}
void destroy_mutations() noexcept {
// After unlink_leftmost_without_rebalance() was called on a bi::set
// we need to complete destroying the tree using that function.
// clear_and_dispose() used by mutation_partition destructor won't
// work properly.
_cur = _mutations.begin();
while (_cur != _end) {
destroy_current_mutation();
++_cur;
}
}
~reader() {
destroy_mutations();
}
virtual future<> fill_buffer(db::timeout_clock::time_point timeout) override {
do_fill_buffer(timeout);
return make_ready_future<>();
}
virtual void next_partition() override {
clear_buffer_to_next_partition();
if (is_buffer_empty() && !is_end_of_stream()) {
destroy_current_mutation();
++_cur;
if (_cur == _end) {
_end_of_stream = true;
} else {
start_new_partition();
}
}
}
virtual future<> fast_forward_to(const dht::partition_range& pr, db::timeout_clock::time_point timeout) override {
clear_buffer();
_cur = find_first_partition(_mutations, pr);
_end = find_last_partition(_mutations, pr);
_static_row_done = false;
_cr = {};
_rt = {};
_end_of_stream = _cur == _end;
if (!_end_of_stream) {
start_new_partition();
}
return make_ready_future<>();
};
virtual future<> fast_forward_to(position_range cr, db::timeout_clock::time_point timeout) override {
throw std::runtime_error("This reader can't be fast forwarded to another position.");
};
};
assert(!mutations.empty());
schema_ptr s = mutations[0].schema();
auto res = make_flat_mutation_reader<reader>(std::move(s), std::move(mutations), pr);
if (fwd) {
return make_forwardable(std::move(res));
}
return res;
}
/// A reader that is empty when created but can be fast-forwarded.
///
/// Useful when a reader has to be created without an initial read-range and it
/// has to be fast-forwardable.
/// Delays the creation of the underlying reader until it is first
/// fast-forwarded and thus a range is available.
class forwardable_empty_mutation_reader : public flat_mutation_reader::impl {
mutation_source _source;
const query::partition_slice& _slice;
const io_priority_class& _pc;
tracing::trace_state_ptr _trace_state;
flat_mutation_reader_opt _reader;
public:
forwardable_empty_mutation_reader(schema_ptr s,
mutation_source source,
const query::partition_slice& slice,
const io_priority_class& pc,
tracing::trace_state_ptr trace_state)
: impl(s)
, _source(std::move(source))
, _slice(slice)
, _pc(pc)
, _trace_state(std::move(trace_state)) {
_end_of_stream = true;
}
virtual future<> fill_buffer(db::timeout_clock::time_point timeout) override {
if (!_reader) {
return make_ready_future<>();
}
if (_reader->is_buffer_empty()) {
if (_reader->is_end_of_stream()) {
_end_of_stream = true;
return make_ready_future<>();
} else {
return _reader->fill_buffer(timeout).then([this, timeout] { return fill_buffer(timeout); });
}
}
_reader->move_buffer_content_to(*this);
return make_ready_future<>();
}
virtual future<> fast_forward_to(const dht::partition_range& pr, db::timeout_clock::time_point timeout) override {
if (!_reader) {
_reader = _source.make_reader(_schema, pr, _slice, _pc, std::move(_trace_state), streamed_mutation::forwarding::no,
mutation_reader::forwarding::yes);
_end_of_stream = false;
return make_ready_future<>();
}
clear_buffer();
_end_of_stream = false;
return _reader->fast_forward_to(pr, timeout);
}
virtual future<> fast_forward_to(position_range pr, db::timeout_clock::time_point timeout) override {
throw std::bad_function_call();
}
virtual void next_partition() override {
if (!_reader) {
return;
}
clear_buffer_to_next_partition();
if (is_buffer_empty() && !is_end_of_stream()) {
_reader->next_partition();
}
}
virtual size_t buffer_size() const override {
return impl::buffer_size() + (_reader ? _reader->buffer_size() : 0);
}
};
template<typename Generator>
class flat_multi_range_mutation_reader : public flat_mutation_reader::impl {
std::optional<Generator> _generator;
flat_mutation_reader _reader;
const dht::partition_range* next() {
if (!_generator) {
return nullptr;
}
return (*_generator)();
}
public:
flat_multi_range_mutation_reader(
schema_ptr s,
mutation_source source,
const dht::partition_range& first_range,
Generator generator,
const query::partition_slice& slice,
const io_priority_class& pc,
tracing::trace_state_ptr trace_state)
: impl(s)
, _generator(std::move(generator))
, _reader(source.make_reader(s, first_range, slice, pc, trace_state, streamed_mutation::forwarding::no, mutation_reader::forwarding::yes))
{
}
virtual future<> fill_buffer(db::timeout_clock::time_point timeout) override {
return do_until([this] { return is_end_of_stream() || !is_buffer_empty(); }, [this, timeout] {
return _reader.fill_buffer(timeout).then([this, timeout] () {
while (!_reader.is_buffer_empty()) {
push_mutation_fragment(_reader.pop_mutation_fragment());
}
if (!_reader.is_end_of_stream()) {
return make_ready_future<>();
}
if (auto r = next()) {
return _reader.fast_forward_to(*r, timeout);
} else {
_end_of_stream = true;
return make_ready_future<>();
}
});
});
}
virtual future<> fast_forward_to(const dht::partition_range& pr, db::timeout_clock::time_point timeout) override {
clear_buffer();
_end_of_stream = false;
return _reader.fast_forward_to(pr, timeout).then([this] {
_generator.reset();
});
}
virtual future<> fast_forward_to(position_range pr, db::timeout_clock::time_point timeout) override {
throw std::bad_function_call();
}
virtual void next_partition() override {
clear_buffer_to_next_partition();
if (is_buffer_empty() && !is_end_of_stream()) {
_reader.next_partition();
}
}
virtual size_t buffer_size() const override {
return flat_mutation_reader::impl::buffer_size() + _reader.buffer_size();
}
};
flat_mutation_reader
make_flat_multi_range_reader(schema_ptr s, mutation_source source, const dht::partition_range_vector& ranges,
const query::partition_slice& slice, const io_priority_class& pc,
tracing::trace_state_ptr trace_state,
mutation_reader::forwarding fwd_mr)
{
class adapter {
dht::partition_range_vector::const_iterator _it;
dht::partition_range_vector::const_iterator _end;
public:
adapter(dht::partition_range_vector::const_iterator begin, dht::partition_range_vector::const_iterator end) : _it(begin), _end(end) {
}
const dht::partition_range* operator()() {
if (_it == _end) {
return nullptr;
}
return &*_it++;
}
};
if (ranges.empty()) {
if (fwd_mr) {
return make_flat_mutation_reader<forwardable_empty_mutation_reader>(std::move(s), std::move(source), slice, pc, std::move(trace_state));
} else {
return make_empty_flat_reader(std::move(s));
}
} else if (ranges.size() == 1) {
return source.make_reader(std::move(s), ranges.front(), slice, pc, std::move(trace_state), streamed_mutation::forwarding::no, fwd_mr);
} else {
return make_flat_mutation_reader<flat_multi_range_mutation_reader<adapter>>(std::move(s), std::move(source),
ranges.front(), adapter(std::next(ranges.cbegin()), ranges.cend()), slice, pc, std::move(trace_state));
}
}
flat_mutation_reader
make_flat_multi_range_reader(
schema_ptr s,
mutation_source source,
std::function<std::optional<dht::partition_range>()> generator,
const query::partition_slice& slice,
const io_priority_class& pc,
tracing::trace_state_ptr trace_state,
mutation_reader::forwarding fwd_mr) {
class adapter {
std::function<std::optional<dht::partition_range>()> _generator;
std::unique_ptr<dht::partition_range> _previous;
std::unique_ptr<dht::partition_range> _current;
public:
explicit adapter(std::function<std::optional<dht::partition_range>()> generator)
: _generator(std::move(generator))
, _previous(std::make_unique<dht::partition_range>(dht::partition_range::make_singular({dht::token{}, partition_key::make_empty()})))
, _current(std::make_unique<dht::partition_range>(dht::partition_range::make_singular({dht::token{}, partition_key::make_empty()}))) {
}
const dht::partition_range* operator()() {
std::swap(_current, _previous);
if (auto next = _generator()) {
*_current = std::move(*next);
return _current.get();
} else {
return nullptr;
}
}
};
auto adapted_generator = adapter(std::move(generator));
auto* first_range = adapted_generator();
if (!first_range) {
if (fwd_mr) {
return make_flat_mutation_reader<forwardable_empty_mutation_reader>(std::move(s), std::move(source), slice, pc, std::move(trace_state));
} else {
return make_empty_flat_reader(std::move(s));
}
} else {
return make_flat_mutation_reader<flat_multi_range_mutation_reader<adapter>>(std::move(s), std::move(source),
*first_range, std::move(adapted_generator), slice, pc, std::move(trace_state));
}
}
flat_mutation_reader
make_flat_mutation_reader_from_fragments(schema_ptr schema, std::deque<mutation_fragment> fragments) {
class reader : public flat_mutation_reader::impl {
std::deque<mutation_fragment> _fragments;
public:
reader(schema_ptr schema, std::deque<mutation_fragment> fragments)
: flat_mutation_reader::impl(std::move(schema))
, _fragments(std::move(fragments)) {
}
virtual future<> fill_buffer(db::timeout_clock::time_point) override {
while (!(_end_of_stream = _fragments.empty()) && !is_buffer_full()) {
push_mutation_fragment(std::move(_fragments.front()));
_fragments.pop_front();
}
return make_ready_future<>();
}
virtual void next_partition() override {
clear_buffer_to_next_partition();
if (is_buffer_empty()) {
while (!(_end_of_stream = _fragments.empty()) && !_fragments.front().is_partition_start()) {
_fragments.pop_front();
}
}
}
virtual future<> fast_forward_to(position_range pr, db::timeout_clock::time_point timeout) override {
throw std::runtime_error("This reader can't be fast forwarded to another range.");
}
virtual future<> fast_forward_to(const dht::partition_range& pr, db::timeout_clock::time_point timeout) override {
throw std::runtime_error("This reader can't be fast forwarded to another position.");
}
};
return make_flat_mutation_reader<reader>(std::move(schema), std::move(fragments));
}
/*
* This reader takes a get_next_fragment generator that produces mutation_fragment_opt which is returned by
* generating_reader.
*
*/
class generating_reader final : public flat_mutation_reader::impl {
std::function<future<mutation_fragment_opt> ()> _get_next_fragment;
public:
generating_reader(schema_ptr s, std::function<future<mutation_fragment_opt> ()> get_next_fragment)
: impl(std::move(s)), _get_next_fragment(std::move(get_next_fragment))
{ }
virtual future<> fill_buffer(db::timeout_clock::time_point) override {
return do_until([this] { return is_end_of_stream() || is_buffer_full(); }, [this] {
return _get_next_fragment().then([this] (mutation_fragment_opt mopt) {
if (!mopt) {
_end_of_stream = true;
} else {
push_mutation_fragment(std::move(*mopt));
}
});
});
}
virtual void next_partition() override {
throw std::bad_function_call();
}
virtual future<> fast_forward_to(const dht::partition_range&, db::timeout_clock::time_point) override {
throw std::bad_function_call();
}
virtual future<> fast_forward_to(position_range, db::timeout_clock::time_point) override {
throw std::bad_function_call();
}
};
flat_mutation_reader make_generating_reader(schema_ptr s, std::function<future<mutation_fragment_opt> ()> get_next_fragment) {
return make_flat_mutation_reader<generating_reader>(std::move(s), std::move(get_next_fragment));
}