In some cases the full-blown partition key validation and especially the
associated key copy per partition might be deemed too costly. As a next
best thing this patch adds a token only validation, which should cover
99% (number pulled out of my sleeve) of the cases. Let's hope no one
gets unlucky.
Currently key order validation for the mutation fragment stream
validating filter is all or nothing. Either no keys (partition or
clustering) are validated or all of them. As we suspect that clustering
key order validation would add a significant overhead, this discourages
turning key validation on, which means we miss out on partition key
monotonicity validation which has a much more moderate cost.
This patch makes this configurable in a more fine-grained fashion,
providing separate levels for partition and clustering key monotonicity
validation.
As the choice for the default validation level is not as clear-cut as
before, the default value for the validation level is removed in the
validating filter's constructor.
Currently the code uses a look of unlink_leftmost_without_rebalance
calls. B-tree does have it, but plain clearing of the tree is a bit
faster with clear().
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The current API is tailored to the `mutation_fragment` type. In
the next patch we will want to use the validator from a context where
the mutation fragments are already decomposed into their respective
concrete types, e.g. static_row, clustering_row, etc. To avoid having to
reconstruct a mutation fragment type just to use the validator, add an
API which allows validating these concrete types conveniently too.
The main user of this method, the one which required this method to
return the collective buffer size of the entire reader tree, is now
gone. The remaining two users just use it to check the size of the
reader instance they are working with.
So de-virtualize this method and reduce its responsibility to just
returning the buffer size of the current reader instance.
The memory usage is now maintained and updated on each change to the
mutation fragment, so it needs not be recalculated on a call to
`memory_usage()`, hence the schema parameter is unused and can be
removed.
We want to start tracking the memory consumption of mutation fragments.
For this we need schema and permit during construction, and on each
modification, so the memory consumption can be recalculated and pass to
the permit.
In this patch we just add the new parameters and go through the insane
churn of updating all call sites. They will be used in the next patch.
Via a tracked_allocator. Although the memory allocations made by the
_buffer shouldn't dominate the memory consumption of the read itself,
they can still be a significant portion that scales with the number of
readers in the read.
Not used yet, this patch does all the churn of propagating a permit
to each impl.
In the next patch we will use it to track to track the memory
consumption of `_buffer`.
There's an optimization in flat_mutation_reader_from_mutations
that folds the list from left-to-right in linear time. In case
of currently used boost::set the .unlink_leftmost_without_rebalance
helper is used, so wrap this exception with a method of the
range_tombstone_list. This is the last place where caller need
to mess with the exact internal collection.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method extracts an element from the list, constructs
a desired object from it and frees. This is common usage
of range_tombstone_list. Having a helper helps encapsulating
the exact collection inside the class.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The goal is to stop revealing the exact collection from
the range_tombstone_list, so make use of existing begin/end
methods and extend with rbegin() where needed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
We want to switch from using a single limit to a dual soft/hard limit.
As a first step we switch the limit field of `query_class_config` to use
the recently introduced type for this. As this field has a single user
at the moment -- reverse queries (and not a lot of propagation) -- we
update it in this same patch to use the soft/hard limit: warn on
reaching the soft limit and abort on the hard limit (the previous
behaviour).
In order to eventually switch to a single JSON library,
most of the libjsoncpp usage is dropped in favor of rjson.
Unfortunately, one usage still remains:
test/utils/test_repl utility heavily depends on the *exact textual*
format of its output JSON files, so replacing a library results
in all tests failing because of differences in formatting.
It is possible to force rjson to print its documents in the exact
matching format, but that's left for later, since the issue is not
critical. It would be nice though if our test suite compared
JSON documents with a real JSON parser, since there are more
differences - e.g. libjsoncpp keeps children of the object
sorted, while rapidjson uses an unordered data structure.
This change should cause no change in semantics, it strives
just to replace all usage of libjsoncpp with rjson.
Mutation sources will soon require a valid permit so make sure we have
one and pass it to the mutation sources when creating the underlying
readers.
For now, pass no_reader_permit() on call sites, deferring the obtaining
of a valid permit to later patches.
We typically use `std::bad_function_call` to throw from
mandatory-to-implement virtual functions, that cannot have a meaningful
implementation in the derived class. The problem with
`std::bad_function_call` is that it carries absolutely no information
w.r.t. where was it thrown from.
I originally wanted to replace `std::bad_function_call` in our codebase
with a custom exception type that would allow passing in the name of the
function it is thrown from to be included in the exception message.
However after I ended up also including a backtrace, Benny Halevy
pointed out that I might as well just throw `std:bad_function_call` with
a backtrace instead. So this is what this patch does.
All users are various unimplemented methods of the
`flat_mutation_reader::impl` interface.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200408075801.701416-1-bdenes@scylladb.com>
If the reversing requires more memory than the limit, the read is
aborted. All users are updated to get a meaningful limit, from the
respective table object, with the exception of tests of course.
Currently reverse reads just pass a flag to
`flat_mutation_reader::consume()` to make the read happen in reverse.
This is deceptively simple and streamlined -- while in fact behind the
scenes a reversing reader is created to wrap the reader in question to
reverse partitions, one-by-one.
This patch makes this apparent by exposing the reversing reader via
`make_reversing_reader()`. This now makes how reversing works more
apparent. It also allows for more configuration to be passed to the
reversing reader (in the next patches).
This change is forward compatible, as in time we plan to add reversing
support to the sstable layer, in which case the reversing reader will
go.
The low-level validator allows fine-grained validation of different
aspects of monotonicity of a fragment stream. It doesn't do any error
handling. Since different aspects can be validated with different
functions, this allows callers to understand what exactly is invalid.
The high-level API is the previous fragment filter one. This is now
built on the low-level API.
This division allows for advanced use cases where the user of the
validator wants to do all error handling and wants to decide exactly
what monotonicity to validate. The motivating use-case is scrubbing
compaction, added in the next patches.
The former was never really more than a reader_permit with one
additional method. Currently using it doesn't even save one from any
includes. Now that readers will be using reader_permit we would have to
pass down both to mutation_source. Instead get rid of
reader_resource_tracker and just use reader_permit. Instead of making it
a last and optional parameter that is easy to ignore, make it a
first class parameter, right after schema, to signify that permits are
now a prominent part of the reader API.
This -- mostly mechanical -- patch essentially refactors mutation_source
to ask for the reader_permit instead of reader_resource_tracking and
updates all usage sites.
Add a name parameter to the validator, so that the validator can be
identified in log messages. Schema identity information is added to the
name automatically. This should help pinpoint the problematic place
where validation failed.
Although at the moment we have a single validator, it still benefits
from having a name, as we can now include in it the name of the sstable
being written and hence trace the source of the bad data.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200117150616.895878-1-bdenes@scylladb.com>
So a higher level component using the validator to validate a stream can
catch only validation errors, and let any other incidental exception
through.
This allows building data correctors on top of the
`mutation_fragment_stream_validator`, by filtering a fragment stream
through a validator, catching invalid fragment stream exceptions and
dropping the respective fragments from the stream.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20191220073443.530750-1-bdenes@scylladb.com>
Currently end of stream validation is done in the destructor,
but the validator may be destructed prematurely, e.g. on
exception, as seen in https://github.com/scylladb/scylla/issues/5215
This patch adds a on_end_of_stream() method explicitly called by
consume_pausable_in_thread. Also, the respective concepts for
ParitionFilter, MutationFragmentFilter and a new on for the
on_end_of_stream method were unified as FlattenedConsumerFilter.
Refs #5215
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 506ff40bd447f00158c24859819d4bb06436c996)
Merged patch series from Avi Kivity:
The static row can be rare: many tables don't have them, and tables
that do will often have mutations without them (if the static row
is rarely updated, it may be present in the cache and in readers,
but absent in memtable mutations). However, it always consumes ~100
bytes of memory, even if it not present, due to row's overhead.
Change it to be optional by allocating it as an external object rather
than inlined into mutation_partition. This adds overhead when the
static row is present (17 bytes for the reference, back reference,
and lsa allocator overhead).
perf_simple_query appears to marginally (2%) faster. Footprint is
reduced by ~9% for a cache entry, 12% in memtables. More details are
provided in the patch commitlog.
Tests: unit (debug)
Avi Kivity (4):
managed_ref: add get() accessor
managed_ref: add external_memory_usage()
mutation_partition: introduce lazy_row
mutation_partition: make static_row optional to reduce memory
footprint
cell_locking.hh | 2 +-
converting_mutation_partition_applier.hh | 4 +-
mutation_partition.hh | 284 ++++++++++++++++++++++-
partition_builder.hh | 4 +-
utils/managed_ref.hh | 12 +
flat_mutation_reader.cc | 2 +-
memtable.cc | 2 +-
mutation_partition.cc | 45 +++-
mutation_partition_serializer.cc | 2 +-
partition_version.cc | 4 +-
tests/multishard_mutation_query_test.cc | 2 +-
tests/mutation_source_test.cc | 2 +-
tests/mutation_test.cc | 12 +-
tests/sstable_mutation_test.cc | 10 +-
14 files changed, 355 insertions(+), 32 deletions(-)
The static row can be rare: many tables don't have them, and tables
that do will often have mutations without them (if the static row
is rarely updated, it may be present in the cache and in readers,
but absent in memtable mutations). However, it always consumes ~100
bytes of memory, even if it not present, due to row's overhead.
Change it to be optional by using lazy_row instead of row. Some call
sites treewide were adjusted to deal with the extra indirection.
perf_simple_query appears to improve by 2%, from 163krps to 165 krps,
though it's hard to be sure due to noisy measurements.
memory_footprint comparisons (before/after):
mutation footprint: mutation footprint:
- in cache: 1096 - in cache: 992
- in memtable: 854 - in memtable: 750
- in sstable: 351 - in sstable: 351
- frozen: 540 - frozen: 540
- canonical: 827 - canonical: 827
- query result: 342 - query result: 342
sizeof(cache_entry) = 112 sizeof(cache_entry) = 112
-- sizeof(decorated_key) = 36 -- sizeof(decorated_key) = 36
-- sizeof(cache_link_type) = 32 -- sizeof(cache_link_type) = 32
-- sizeof(mutation_partition) = 200 -- sizeof(mutation_partition) = 96
-- -- sizeof(_static_row) = 112 -- -- sizeof(_static_row) = 8
-- -- sizeof(_rows) = 24 -- -- sizeof(_rows) = 24
-- -- sizeof(_row_tombstones) = 40 -- -- sizeof(_row_tombstones) = 40
sizeof(rows_entry) = 232 sizeof(rows_entry) = 232
sizeof(lru_link_type) = 16 sizeof(lru_link_type) = 16
sizeof(deletable_row) = 168 sizeof(deletable_row) = 168
sizeof(row) = 112 sizeof(row) = 112
sizeof(atomic_cell_or_collection) = 8 sizeof(atomic_cell_or_collection) = 8
Tests: unit (dev)
Storing and comparing keys is expensive.
Add a flag to enable/disable this feature (disabled by default).
Without the flag, only the partition region monotonicity is
validated, allowing repeated clustering rows, regardless of
clustering keys.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The fragment reader calls `fast_forward_to()` from its constructor to
discard fragments that fall outside the query range. Mmove the
the fast-forward code in to an internal void returning method, and call
that from both the constructor and `fast_forward_to()`, to avoid a
warning on a discarded future<>.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20190801133942.10744-1-bdenes@scylladb.com>
To be able to support this new overload, the reader is made
partition-range aware. It will now correctly only return fragments that
fall into the partition-range it was created with. For completeness'
sake and to be able to test it, also implement
`fast_forward_to(const dht::partition_range)`. Slicing is done by
filtering out non-overlapping fragments from the initial list of
fragments. Also add a unit test that runs it through the mutation_source
test suite.
To be able to run the mutation-source test suite with this reader. In
the next patch, this reader will be used in testing another reader, so
it is important to make sure it works correctly first.
Replace stdx::optional and stdx::string_view with the C++ std
counterparts.
Some instances of boost::variant were also replaced with std::variant,
namely those that called seastar::visit.
Scylla now requires GCC 8 to compile.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20190108111141.5369-1-duarte@scylladb.com>
"
=== How the the partition level repair works
- The repair master decides which ranges to work on.
- The repair master splits the ranges to sub ranges which contains around 100
partitions.
- The repair master computes the checksum of the 100 partitions and asks the
related peers to compute the checksum of the 100 partitions.
- If the checksum matches, the data in this sub range is synced.
- If the checksum mismatches, repair master fetches the data from all the peers
and sends back the merged data to peers.
=== Major problems with partition level repair
- A mismatch of a single row in any of the 100 partitions causes 100
partitions to be transferred. A single partition can be very large. Not to
mention the size of 100 partitions.
- Checksum (find the mismatch) and streaming (fix the mismatch) will read the
same data twice
=== Row level repair
Row level checksum and synchronization: detect row level mismatch and transfer
only the mismatch
=== How the row level repair works
- To solve the problem of reading data twice
Read the data only once for both checksum and synchronization between nodes.
We work on a small range which contains only a few mega bytes of rows,
We read all the rows within the small range into memory. Find the
mismatch and send the mismatch rows between peers.
We need to find a sync boundary among the nodes which contains only N bytes of
rows.
- To solve the problem of sending unnecessary data.
We need to find the mismatched rows between nodes and only send the delta.
The problem is called set reconciliation problem which is a common problem in
distributed systems.
For example:
Node1 has set1 = {row1, row2, row3}
Node2 has set2 = { row2, row3}
Node3 has set3 = {row1, row2, row4}
To repair:
Node1 fetches nothing from Node2 (set2 - set1), fetches row4 (set3 - set1) from Node3.
Node1 sends row1 and row4 (set1 + set2 + set3 - set2) to Node2
Node1 sends row3 (set1 + set2 + set3 - set3) to Node3.
=== How to implement repair with set reconciliation
- Step A: Negotiate sync boundary
class repair_sync_boundary {
dht::decorated_key pk;
position_in_partition position
}
Reads rows from disk into row buffers until the size is larger than N
bytes. Return the repair_sync_boundary of the last mutation_fragment we
read from disk. The smallest repair_sync_boundary of all nodes is
set as the current_sync_boundary.
- Step B: Get missing rows from peer nodes so that repair master contains all the rows
Request combined hashes from all nodes between last_sync_boundary and
current_sync_boundary. If the combined hashes from all nodes are identical,
data is synced, goto Step A. If not, request the full hashes from peers.
At this point, the repair master knows exactly what rows are missing. Request the
missing rows from peer nodes.
Now, local node contains all the rows.
- Step C: Send missing rows to the peer nodes
Since local node also knows what peer nodes own, it sends the missing rows to
the peer nodes.
=== How the RPC API looks like
- repair_range_start()
Step A:
- request_sync_boundary()
Step B:
- request_combined_row_hashes()
- reqeust_full_row_hashes()
- request_row_diff()
Step C:
- send_row_diff()
- repair_range_stop()
=== Performance evaluation
We created a cluster of 3 Scylla nodes on AWS using i3.xlarge instance. We
created a keyspace with a replication factor of 3 and inserted 1 billion
rows to each of the 3 nodes. Each node has 241 GiB of data.
We tested 3 cases below.
1) 0% synced: one of the node has zero data. The other two nodes have 1 billion identical rows.
Time to repair:
old = 87 min
new = 70 min (rebuild took 50 minutes)
improvement = 19.54%
2) 100% synced: all of the 3 nodes have 1 billion identical rows.
Time to repair:
old = 43 min
new = 24 min
improvement = 44.18%
3) 99.9% synced: each node has 1 billion identical rows and 1 billion * 0.1% distinct rows.
Time to repair:
old: 211 min
new: 44 min
improvement: 79.15%
Bytes sent on wire for repair:
old: tx= 162 GiB, rx = 90 GiB
new: tx= 1.15 GiB, tx = 0.57 GiB
improvement: tx = 99.29%, rx = 99.36%
It is worth noting that row level repair sends and receives exactly the
number of rows needed in theory.
In this test case, repair master needs to receives 2 million rows and
sends 4 million rows. Here are the details: Each node has 1 billion *
0.1% distinct rows, that is 1 million rows. So repair master receives 1
million rows from repair slave 1 and 1 million rows from repair slave 2.
Repair master sends 1 million rows from repair master and 1 million rows
received from repair slave 1 to repair slave 2. Repair master sends
sends 1 million rows from repair master and 1 million rows received from
repair slave 2 to repair slave 1.
In the result, we saw the rows on wire were as expected.
tx_row_nr = 1000505 + 999619 + 1001257 + 998619 (4 shards, the numbers are for each shard) = 4'000'000
rx_row_nr = 500233 + 500235 + 499559 + 499973 (4 shards, the numbers are for each shard) = 2'000'000
Fixes: #3033
Tests: dtests/repair_additional_test.py
"
* 'asias/row_level_repair_v7' of github.com:cloudius-systems/seastar-dev: (51 commits)
repair: Enable row level repair
repair: Add row_level_repair
repair: Add docs for row level repair
repair: Add repair_init_messaging_service_handler
repair: Add repair_meta
repair: Add repair_writer
repair: Add repair_reader
repair: Add repair_row
repair: Add fragment_hasher
repair: Add decorated_key_with_hash
repair: Add get_random_seed
repair: Add get_common_diff_detect_algorithm
repair: Add shard_config
repair: Add suportted_diff_detect_algorithms
repair: Add repair_stats to repair_info
repair: Introduce repair_stats
flat_mutation_reader: Add make_generating_reader
storage_service: Introduce ROW_LEVEL_REPAIR feature
messaging_service: Add RPC verbs for row level repair
repair: Export the repair logger
...
If the reader is fast-forwarded to another partition range mutation_ may
be left with some partial mutations. Make sure that those are properly
destroyed.
Move generating_reader from stream_session.cc to flat_mutation_reader.cc.
It will be used by repair code soon.
Also introduce a helper make_generating_reader to hide the
implementation of generating_reader.
Allows creating a multi range reader from an arbitrary callable that
return std::optional<dht::partition_range>. The callable is expected to
return a new range on each call, such that passing each successive range
to `flat_mutation_reader::fast_forward_to` is valid. When exhausted the
callable is expected to return std::nullopt.
Instead of working with a dht::partition_range_vector directly, work
with an abstract generator that returns a pointer to the next range on
each invocation. When exhausted it returns nullptr. This opens up the
possibility to create multi range readers from a generator functor that
creates ranges lazily. This is indeed what the next path does.
Previously, when the passed in range of partition ranges contained 0
ranges, an empty reader was returned. This means that the returned
reader was forwardable or not depending on the number of passed in
ranges. This is inconsistent and can lead to nasty surprises.
To solve this problem add `forwardable_empty_mutation_reader`, a
specialized reader that delays creating the underlying reader until
fast_forward_to() is called on it, and thus a range is available.
When `make_flat_multi_range_mutation_reader()` is called with
`mutation_reader::forwarding::no` a simple empty reader is created, like
before.
The factory function creating this reader ensures that the passed-in
ranges vector has more then one range, which effectively makes the
`fwd_mr` constructor parameter have no effect. The underlying reader
will always be created with `mutation_reader::forwarding::yes` as it has
to be able to fast-forward between the ranges.
Currently timeout is opt-in, that is, all methods that even have it
default it to `db::no_timeout`. This means that ensuring timeout is used
where it should be is completely up to the author and the reviewrs of
the code. As humans are notoriously prone to mistakes this has resulted
in a very inconsistent usage of timeout, many clients of
`flat_mutation_reader` passing the timeout only to some members and only
on certain call sites. This is small wonder considering that some core
operations like `operator()()` only recently received a timeout
parameter and others like `peek()` didn't even have one until this
patch. Both of these methods call `fill_buffer()` which potentially
talks to the lower layers and is supposed to propagate the timeout.
All this makes the `flat_mutation_reader`'s timeout effectively useless.
To make order in this chaos make the timeout parameter a mandatory one
on all `flat_mutation_reader` methods that need it. This ensures that
humans now get a reminder from the compiler when they forget to pass the
timeout. Clients can still opt-out from passing a timeout by passing
`db::no_timeout` (the previous default value) but this will be now
explicit and developers should think before typing it.
There were suprisingly few core call sites to fix up. Where a timeout
was available nearby I propagated it to be able to pass it to the
reader, where I couldn't I passed `db::no_timeout`. Authors of the
latter kind of code (view, streaming and repair are some of the notable
examples) should maybe consider propagating down a timeout if needed.
In the test code (the wast majority of the changes) I just used
`db::no_timeout` everywhere.
Tests: unit(release, debug)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <1edc10802d5eb23de8af28c9f48b8d3be0f1a468.1536744563.git.bdenes@scylladb.com>
This works around a problem of std::terminate() being called in debug
mode build if initialization of _current throws.
Backtrace:
Thread 2 "row_cache_test_" received signal SIGABRT, Aborted.
0x00007ffff17ce9fb in raise () from /lib64/libc.so.6
(gdb) bt
#0 0x00007ffff17ce9fb in raise () from /lib64/libc.so.6
#1 0x00007ffff17d077d in abort () from /lib64/libc.so.6
#2 0x00007ffff5773025 in __gnu_cxx::__verbose_terminate_handler() () from /lib64/libstdc++.so.6
#3 0x00007ffff5770c16 in ?? () from /lib64/libstdc++.so.6
#4 0x00007ffff576fb19 in ?? () from /lib64/libstdc++.so.6
#5 0x00007ffff5770508 in __gxx_personality_v0 () from /lib64/libstdc++.so.6
#6 0x00007ffff3ce4ee3 in ?? () from /lib64/libgcc_s.so.1
#7 0x00007ffff3ce570e in _Unwind_Resume () from /lib64/libgcc_s.so.1
#8 0x0000000003633602 in reader::reader (this=0x60e0001160c0, r=...) at flat_mutation_reader.cc:214
#9 0x0000000003655864 in std::make_unique<make_forwardable(flat_mutation_reader)::reader, flat_mutation_reader>(flat_mutation_reader &&) (__args#0=...)
at /usr/include/c++/7/bits/unique_ptr.h:825
#10 0x0000000003649a63 in make_flat_mutation_reader<make_forwardable(flat_mutation_reader)::reader, flat_mutation_reader>(flat_mutation_reader &&) (args#0=...)
at flat_mutation_reader.hh:440
#11 0x000000000363565d in make_forwardable (m=...) at flat_mutation_reader.cc:270
#12 0x000000000303f962 in memtable::make_flat_reader (this=0x61300001d540, s=..., range=..., slice=..., pc=..., trace_state_ptr=..., fwd=..., fwd_mr=...)
at memtable.cc:592
Message-Id: <1528792447-13336-1-git-send-email-tgrabiec@scylladb.com>
Don't create a flat_multi_range_mutation_reader when the range vector
has 0 or 1 element. In the former case create an empty reader and in the
latter just create a reader with the mutation-source with the only range
in the vector.
Builds a reader from a set of ordered mutations fragments. This is
useful for building a reader out of a subset of segments returned by a
different reader. It is equivalent to building a mutation out of the
set of mutation fragments, and calling
make_flat_mutation_reader_from_mutations, except that it doest not yet
support fast-forwarding.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
buffer_size() exposes the collective size of the external memory
consumed by the mutattion-fragments in the flat reader's buffer. This
provides a basis to build basic memory accounting on. Altought this is
not the entire memory consumption of any given reader it is the most
volatile component and usually by far the largest one too.