Commit Graph

120 Commits

Author SHA1 Message Date
George Kollias
c2343dc841 Make restricting reader fill_buffer more efficient
Currently, restricting_mutation_reader::fill_buffer justs reads
lower-layer reader's fragments one by one without doing any further
transformations. This change just swaps the parent-child buffers in a
single step, as suggested in #3604, and, hence, removing any possible
per-fragment overhead.

I couldn't find any test that exercises restricting_mutation_reader as
a mutation source, so I added test_restricted_reader_as_mutation_source
in mutation_reader_test.

Tests: unit (release), though these 4 tests are failing regardless of
my changes (they fail on master for me as well): snitch_reset_test,
sstable_mutation_test, sstable_test, sstable_3_x_test.

Fixes: #3604

Signed-off-by: George Kollias <georgioskollias@gmail.com>
Message-Id: <1540052861-621-1-git-send-email-georgioskollias@gmail.com>
2018-10-22 11:36:54 +03:00
Botond Dénes
dfad223ea2 multishard_mutation_reader: shard_reader: don't do concurrent read-aheads
multishard_mutation_reader starts read-aheads on the
shards-to-be-read-soon. When doing this it didn't check whether the
respective shards had an ongoing read-ahead already. This lead to a
single shard executing multiple concurrent read-aheads. This is damaging
for multiple reasons:
    * Can lead to concurrent access of the remote reader's data members.
    * The `shard_reader` was designed around a single read-ahead and
    thus will synchronise foreground reads with only the last one.

The practical implications of this seen so far was that queries reading
a large number of rows (large enough to reliably trigger the
bug) would stop the read early, due the `combined_mutation_reader`'s
internal accounting being messed up by concurrent access.

Also add a unit test. Instead of coming up with a very specific, and
very contrived unit test, use the test-case that detected this bug in
the first place: count(*) on a table with lots of rows (>1000). This
unit-test should serve well for detecting any similar bugs in the
future.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <ff1c49be64e2fb443f9aa8c5c8d235e682442248.1536746388.git.bdenes@scylladb.com>
2018-09-12 11:43:18 +01:00
Botond Dénes
6a07b8ae83 multishard_mutation_reader: update shard_reader's comment
The `adandoned` member was renamed to `stopped`. Update the comment
accordingly.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <1d655785f28fe1e5fa041f2f49852f0ad88be53e.1536743950.git.bdenes@scylladb.com>
2018-09-12 11:32:08 +02:00
Botond Dénes
49704755b0 combined_mutation_reader: propagate timeout in fill_buffer()
All user reads go through the combined reader. Not propagating the
timeout down from there means that the storage layer's timeout
functionality is effectively disabled. Spotted while reading the code.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <7fc10eca1c231dd04ac433913d9e6a51b6b17139.1536657041.git.bdenes@scylladb.com>
2018-09-11 15:44:28 +02:00
Botond Dénes
a011a9ebf2 mutation_reader: multishard_combining_reader: support custom dismantler
Add a dismantler functor parameter. When the multishard reader is
destroyed this functor will be called for each shard reader, passing a
future to a `stopped_foreign_reader`. This future becomes available when
the shard reader is stopped, that is, when it finished all in-progress
read-aheads and/or pending next partition calls.

The intended use case for the dismantler functor is a client that needs
to be notified when readers are destroyed and/or has to have access to
any unconsumed fragments from the foreign readers wrapping the shard
readers.
2018-09-03 10:31:44 +03:00
Botond Dénes
f13b878a94 mutation_reader: pass all standard reader params to remote_reader_factory
Extend `remote_reader_factory` interface so that it accepts all standard
mutation reader creation parameters. This allows factory lambdas to be
truly stateless, not having to capture any standard parameters that is
needed for creating the reader.
Standard parameters are those accepted by
`mutation_source::make_reader()`.
2018-09-03 10:31:44 +03:00
Botond Dénes
8915293257 multishard_combining_reader: fix incorrect comment 2018-09-03 10:31:44 +03:00
Botond Dénes
81a03db955 mutation_reader: reader_selector: use ring_position instead of token
sstable_set::incremental selector was migrated to ring position, follow
suit and migrate the reader_selector to use ring_position as well. Above
correctness this also improves efficiency in case of dense tables,
avoiding prematurely selecting sstables that share the token but start
at different keys, altough one could argue that this is a niche case.
2018-07-04 17:42:37 +03:00
Botond Dénes
a8e795a16e sstables_set::incremental_selector: use ring_position instead of token
Currently `sstable_set::incremental_selector` works in terms of tokens.
Sstables can be selected with tokens and internally the token-space is
partitioned (in `partitioned_sstable_set`, used for LCS) with tokens as
well. This is problematic for severeal reasons.
The sub-range sstables cover from the token-space is defined in terms of
decorated keys. It is even possible that multiple sstables cover
multiple non-overlapping sub-ranges of a single token. The current
system is unable to model this and will at best result in selecting
unnecessary sstables.
The usage of token for providing the next position where the
intersecting sstables change [1] causes further problems. Attempting to
walk over the token-space by repeatedly calling `select()` with the
`next_position` returned from the previous call will quite possibly lead
to an infinite loop as a token cannot express inclusiveness/exclusiveness
and thus the incremental selector will not be able to make progress when
the upper and lower bounds of two neighbouring intervals share the same
token with different inclusiveness e.g. [t1, t2](t2, t3].

To solve these problems update incremental_selector to work in terms of
ring position. This makes it possible to partition the token-space
amoing sstables at decorated key granularity. It also makes it possible
for select() to return a next_position that is guaranteed to make
progress.

partitioned_sstable_set now builds the internal interval map using the
decorated key of the sstables, not just the tokens.
incremental_selector::select() now uses `dht::ring_position_view` as
both the selector and the next_position. ring_position_view can express
positions between keys so it can also include information about
inclusiveness/exclusiveness of the next interval guaranteeing forward
progress.

[1] `sstable_set::incremental_selector::selection::next_position`
2018-07-04 17:42:33 +03:00
Botond Dénes
78ecf2740a mutation_reader_merger::maybe_add_readers(): remove early return
It's unnecessary (doesn't prevent anything). The code without it
expresses intent better (and is shorter by two lines).
2018-07-02 11:41:09 +03:00
Botond Dénes
d26b35b058 mutation_reader_merger: get rid of _key
`_key` is only used in a single place and this does not warrant storing
it in a member. Also get rid of current_position() which was used to
query `_key`.
2018-07-02 11:40:43 +03:00
Paweł Dziepak
27014a23d7 treewide: require type info for copying atomic_cell_or_collection 2018-05-31 15:51:11 +01:00
Tomasz Grabiec
bb96518cc5 mutation_reader: Make empty mutation source advertize no partitions
So that perf_row_cache_update will always populate cache.
2018-05-30 12:18:56 +02:00
Avi Kivity
7d29addb1f mutation_reader: optimize make_combined_reader for the single-reader case
If we're given a single reader (can be common in a low-write-rate table,
where most of the data will be in a single large sstable, or in leveled
tables) then we can avoid the overhead of the combining reader by returning
the single input.

Tests: unit (release)
Message-Id: <20180513130333.15424-1-avi@scylladb.com>
2018-05-13 20:07:10 +02:00
Botond Dénes
7a3eab90c8 multishard_combining_reader: use optimized optional for the shard reader
Use flat_mutation_reader_opt instead of
std::optional<flat_mutation_reader>.
2018-05-10 13:06:47 +03:00
Botond Dénes
04643fb223 multishard_combining_reader: prepare for read-ahead otliving the reader
When the multishard reader is destroyed there might be severeal pending
read-aheads running in the background. These read-aheads need their
associated reader to stay alive until after the read-ahead completes.
To solve this move the flat_mutation_reader into a struct and manage
this struct's lifetime through a shared pointer. Fibers associated with
read-aheads that might outlive the multishard reader will hold on to a
copy of the shard pointer keeping the underlying reader alive until they
complete. To avoid doing any extra work a flag is added to this state
which is set when the multishard reader is destroyed. When this flag is
set, pending continuations will return early.  All this is encapsulated
in multishard_combining_reader::shard_reader the multishard reader code
itself need not be changed.
2018-04-30 17:16:21 +03:00
Botond Dénes
a05d398be7 foreign_reader: prepare for read-ahead outliving the reader
The foreign reader keeps track of ongoing read-aheads via a
foreign_ptr to the read-ahead's future on the remote shard. This pointer
is overwritten after each "remote call" to the remote reader with a
pointer to the future of the new read-ahead's future.
There are severeal problems with the current implementation:
1) There is a new read-ahead launched after each "remote call"
  unconditionally, even if the remote reader is at EOS. This will start
  unecessary read-ahead when the reader is already finished and may be
  soon destroyed (legally) by the client.
2) The pointer to the remote read-ahead future is not set to nullptr
  when a remote call is issued. Thus in the destructor, where we
  attach a continuation to the read-ahead's future to extend the
  reader's lifetime until after the read-ahead finishes, we migh attach
  a continuation to a future that already has one and run into a failed
  assert().

To fix this issues reset the read-ahead pointer to nullptr each time a
remote call is issued and don't start a new read-ahead if the remote
reader is at EOS. This way we can ensure that when the reader is
destroyed we either have a valid and non-stale read-aead future or none
at all and can reliably make a decision about whether we need to extend
the lifetime of the remote reader or not.
2018-04-30 14:34:43 +03:00
Botond Dénes
704d3d8421 multishard_combining_reader: avoid creating the shard reader twice
The multishard reader creates its shard readers on demand when they are
first attempted to be used. However at this time the reader migh already
be in the progress of being created, initiated by a previous read-ahead.
To avoid creating the shard reader twice, before creating the reader
check whether there are any read-aheads in progress. If there is, it
already created (is creating or will create) the reader and hence
synchronise with the read ahead. Synchronisation happens via a promise,
the read ahead creates a promise which will be fulfilled when the reader
is created. A concurrent create_reader() call will wait on this promise
instead of attempting to create a new reader.
2018-04-30 14:34:43 +03:00
Botond Dénes
f9464cfcd7 multishard_combining_reader: read_ahead: don't assume reader is created
Currently it is assumed that when read_ahead is called the reader is
already created. Under most circumstances this will not be true. It was
blind (bad) luck that we didn't hit this before (during testing).
2018-04-30 14:34:43 +03:00
Botond Dénes
d9fceb398a multishard_combining_reader: move read-ahead related methods
To the group of methods that do not assume the reader is already
created. A patch will follow that will update read_ahead() to not assume
that the reader is created.
2018-04-30 14:34:43 +03:00
Botond Dénes
5dcfaa68f6 multishard_combining_reader: avoid looking up the shard reader twice 2018-04-30 14:34:43 +03:00
Botond Dénes
79504a7d28 multishard_combining_reader: use optional for maybe created reader
After a little "research" [1] it turns out my initial fears were
completely without ground, std::optional::operator->() and
std::optional::opterator*() doesn't involve an unnecessary branch and
thus there is no need to hand-roll an optional with a separate bool.

[1] http://en.cppreference.com/w/cpp/utility/optional/operator*
2018-04-30 14:34:37 +03:00
Botond Dénes
07fb2e9c4d make_foreign_reader: don't wrap local readers
If the to-be-wrapped reader is local (lives on the same shard where
make_foreign_reader() is called) there is no need to wrap it with
foreign_reader. Just return it as is.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <886ed883b707f163603a40b56b8823f2bb6c47c6.1523873224.git.bdenes@scylladb.com>
2018-04-16 15:11:20 +03:00
Botond Dénes
3a6f397fd0 Add multishard_combined_reader
Takes care of reading a range from all shards that own a subrange in the
range. The read happens sequentially, reading from one shard at a time.
Under the scenes it uses combined_mutation_reader and foreign_reader,
the former providing the merging logic and the latter taking care of
transferring the output of the remote readers to the local shard.
Readers are created on-demand by a reader-selector implementation that
creates readers for yet unvisited shards as the read progresses.
The read starts with a concurrency of one, that is the reader reads from
a single shard at a time. The concurrency is exponentially increased (to
a maximum of the number of shards) when a reader's buffer is empty after
moving the next shard. This condition is important as we only wan't to
increase concurrency for sparse tables that have little data and the
reader has to move between shards often. When concurrency is > 1, the
reader issues background read-aheads to the next shards so that by the
time it needs to move to them they have the data ready.
For dense tables (where we rarely cross shards) we rely on the
foreign_reader to issue sufficient read-aheads on its own to avoid
blocking.
2018-04-11 10:03:47 +03:00
Botond Dénes
2c0f8d0586 Add foreign_reader
Local representant of a reader located on a remote shard. Manages the
lifecycle and takes care of seamlessly transferring fragments produced
by the remote reader. Fragments are *copied* between the shards in
batches, a bufferful at a time.
To maximize throughput read-ahead is used. After each fill_buffer() or
fast_forward_to() a read-ahead (a fill_buffer() on the remote reader) is
issued. This read-ahead runs in the background and is brough back to
foreground on the next fill_buffer() or fast_forward_to() call.
2018-04-11 09:22:45 +03:00
Botond Dénes
f488ae3917 Add buffer_size() to flat_mutation_reader
buffer_size() exposes the collective size of the external memory
consumed by the mutattion-fragments in the flat reader's buffer. This
provides a basis to build basic memory accounting on. Altought this is
not the entire memory consumption of any given reader it is the most
volatile component and usually by far the largest one too.
2018-03-13 10:34:34 +02:00
Botond Dénes
212b2dabc4 Resource-based cache eviction
Readers serving user-reads need to obtain a permit to start reading.
There exists a restriction on how much active readers can be admitted
based on their count and their memory onsumption.
Since the saved readers of cached queriers are techically active (they
hold a permit) they can block new readers from obtaining a permit.
New readers have a higher priority because a cached reader might be
abandoned or used later at best so in the face of memory pressure we
evict cached readers to free up permits for new readers.
Cached queriers are evicted in LRU order as the oldest queriers are the
most likely to be evicted based on their TTL anyway.
2018-03-13 10:34:34 +02:00
Botond Dénes
1259031af3 Use the reader_concurrency_semaphore to limit reader concurrency 2018-03-08 14:12:12 +02:00
Botond Dénes
dfa04c3fea Add reader_concurrency_semaphore
This semaphore implements the new dual, count and memory based active
reader limiting. As purely memory-based limiting proved to cause
problems on big boxes admitting a large number of readers (more than any
disk could handle) the previous count-based limit is reintroduced in
addition to the existing memory-based limit.
When creating new readers first the count-based limit is checked. If
that clears the memory limit is checked before admitting the reader.
reader_conccurency_semaphore wraps the two semaphores that implement
these limits and enforces the correct order of limit checking.
This class also completely replaces the restricted_reader_config struct,
it encapsulates all data and related functinality of the latter, making
client code simpler.
2018-03-08 14:12:12 +02:00
Botond Dénes
d5bb8a47fc mv reader_resource_tracker.hh -> reader_concurrency_semaphore.hh
In preparation to reader_concurrency_semaphore being added to the file.
The reader_resource_tracker is really only a helper class for
reader_concurrency_semaphore so the latter is better suited to provide
the name of the file.
2018-03-08 10:29:16 +02:00
Botond Dénes
206e7d40d4 restricted_mutation_reader: switch to std::variant
Tests: unit-tests(release)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <a8930b764171db131d9d8d5fe4035014ecb452f4.1519391304.git.bdenes@scylladb.com>
2018-02-25 14:35:57 +02:00
Piotr Jastrzebski
37285ad7fa Delete unused make_reader_returning
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
864db78fcf Delete unused make_reader_returning_many
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
ff4ffc1c64 Delete unused make_empty_reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
0b8aedcc59 Delete unused mutation_reader_from_flat_mutation_reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
8aaf5dc900 Delete unused streamed_mutation_from_flat_mutation_reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
d266eaa01e mutation_source: rename make_flat_mutation_reader to make_reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-19 09:30:12 +01:00
Avi Kivity
93076d25b6 Merge "mutation_source: remove support for creation with mutation_reader" from Piotr
"After this patchset it's only possible to create a mutation_source with a function that produces flat_mutation_reader."

* 'haaawk/mutation_source_v1' of ssh://github.com/scylladb/seastar-dev:
  Merge flat_mutation_reader_mutation_source into mutation_source
  Remove unused mutation_reader_mutation_source
  Remove unused mutation_source constructor.
  Migrate make_source to flat reader
  Migrate run_conversion_to_mutation_reader_tests to flat reader
  flat_mutation_reader_from_mutations: add support for slicing
  Remove unused mutation_source constructor.
  Migrate partition_counting_reader to flat reader
  Migrate throttled_mutation_source to flat reader
  Extract delegating_reader from make_delegating_reader
  row_cache_test: call row_cache::make_flat_reader in mutation_sources
  Remove unused friend declaration in flat_mutation_reader::impl
  Migrate make_source_with to flat reader
  Migrate make_empty_mutation_source to flat reader
  Remove unused mutation_source constructor
  Migrate test_multi_range_reader to flat reader
  Remove unused mutation_source constructors
2018-01-15 18:15:53 +02:00
Avi Kivity
fe788e0a5d mutation_reader: adjust FragmentProducer concept for timeout
forward_to() no accepts a timeout parameter, and the concept should
reflect it, or it breaks the build when concepts are enabled.
2018-01-14 18:09:37 +02:00
Glauber Costa
3c9eeea4cf restricted_mutation_reader: don't pass timeouts through the config structure
This patch enables passing a timeout to the restricted_mutation_reader
through the read path interface -- using fill_buffer and friends. This
will serve as a basis for having per-timeout requests.

The config structure still has a timeout, but that is so far only used
to actually pass the value to the query interface. Once that starts
coming from the storage proxy layer (next patch) we will remove.

The query callers are patched so that we pass the timeout down. We patch
the callers in database.cc, but leave the streaming ones alone. That can
be safely done because the default for the query path is now no_timeout,
and that is what the streaming code wants. So there is no need to
complicate the interface to allow for passing a timeout that we intend
to disable.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-12 07:43:21 -05:00
Glauber Costa
5140aaea00 add a timeout to fast forward to
In the last patch, we enabled per-request timeouts, we enable timeouts
in fill_buffer. There are many places, though, in which we
fast_forward_to before we fill_buffer, so in order to make that
effective we need to propagate the timeouts to fast_forward_to as well.

In the same way as fill_buffer, we make the argument optional wherever
possible in the high level callers, making them mandatory in the
implementations.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-12 07:43:19 -05:00
Glauber Costa
d965af42b0 add a timeout to fill_buffer
As part of the work to enable per-request timeouts, we enable timeouts
in fill_buffer.

The argument is made optional at the main classes, but mandatory in all
the ::impl versions. This way we'll make sure we didn't forget anything.

At this point we're still mostly passing that information around and
don't have any entity that will act on those timeouts. In the next patch
we will wire that up.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-11 12:07:41 -05:00
Paweł Dziepak
b4a4c04bab combined_reader: optimise for disjoint partition streams
The legacy mutation_reader/streamed_mutation design allowed very easily
to skip the partition merging logic if there was only one underlying
reader that has emitted it.

That optimisation was lost after conversion to flat mutation readers
which has impacted the performance. This patch mostly recovers it by
bypassing most of mutation_reader_merger logic if there is only a single
active reader for a given partition.

The performance regression was introduced in
8731c1bc66 "Flatten the implementation of
combined_mutation_reader".

perf_simple_query -c4 read results (medians of 60):

original regression
             before 8731c1     after 8731c1   diff
 read            326241.02        300244.09  -8.0%

this patch
                    before            after  diff
 read            313882.59        325148.05  3.6%
Message-Id: <20180103121019.764-1-pdziepak@scylladb.com>
2018-01-11 10:21:17 +01:00
Raphael S. Carvalho
818830715f Fix potential infinite recursion when combining mutations for leveled compaction
The issue is triggered by compaction of sstables of level higher than 0.

The problem happens when interval map of partitioned sstable set stores
intervals such as follow:
[-9223362900961284625 : -3695961740249769322 ]
(-3695961740249769322 : -3695961103022958562 ]

When selector is called for first interval above, the exclusive lower
bound of the second interval is returned as next token, but the
inclusivess info is not returned.
So reader_selector was returning that there *were* new readers when
the current token was -3695961740249769322 because it was stored in
selector position field as inclusive, but it's actually exclusive.

This false positive was leading to infinite recursion in combined
reader because sstable set's incremental selector itself knew that
there were actually *no* new readers, and therefore *no* progress
could be made.

Fix is to use ring_position in reader_selector, such that
inclusiveness would be respected.
So reader_selector::has_new_readers() won't return false positive
under the conditions described above.

Fixes #2908.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2018-01-03 16:23:01 -02:00
Avi Kivity
8795238869 Merge "Fix handling of range tombstones starting at same position" from Tomasz
"When we get two range tombstones with the same lower bound from
different data sources (e.g. two sstable), which need to be combined
into a single stream, they need to be de-overlapped, because each
mutation fragment in the stream must have a different position. If we
have range tombstones [1, 10) and [1, 20), the result of that
de-overlapping will be [1, 10) and [10, 20]. The problem is that if
the stream corresponds to a clustering slice with upper bound greater
than 1, but lower than 10, the second range tombstone would appear as
being out of the query range. This is currently violating assumptions
made by some consumers, like cache populator.

One effect of this may be that a reader will miss rows which are in
the range (1, 10) (after the start of the first range tombstone, and
before the start of the second range tombstone), if the second range
tombstone happens to be the last fragment which was read for a
discontinuous range in cache and we stopped reading at that point
because of a full buffer and cache was evicted before we resumed
reading, so we went to reading from the sstable reader again. There
could be more cases in which this violation may resurface.

There is also a related bug in mutation_fragment_merger. If the reader
is in forwarding mode, and the current range is [1, 5], the reader
would still emit range_tombstone([10, 20]). If that reader is later
fast forwarded to another range, say [6, 8], it may produce fragments
with smaller positions which were emitted before, violating
monotonicity of fragment positions in the stream.

A similar bug was also present in partition_snapshot_flat_reader.

Possible solutions:

 1) relax the assumption (in cache) that streams contain only relevant
 range tombstones, and only require that they contain at least all
 relevant tombstones

 2) allow subsequent range tombstones in a stream to share the same
 starting position (position is weakly monotonic), then we don't need
 to de-overlap the tombstones in readers.

 3) teach combining readers about query restrictions so that they can drop
fragments which fall outside the range

 4) force leaf readers to trim all range tombstones to query restrictions

This patch implements solution no 2. It simplifies combining readers,
which don't need to accumulate and trim range tombstones.

I don't like solution 3, because it makes combining readers more
complicated, slower, and harder to properly construct (currently
combining readers don't need to know restrictions of the leaf
streams).

Solution 4 is confined to implementations of leaf readers, but also
has disadvantage of making those more complicated and slower.

There is only one consumer which needs the tombstones with monotonic positions, and
that is the sstable writer.

Fixes #3093."

* tag 'tgrabiec/fix-out-of-range-tombstones-v1' of github.com:scylladb/seastar-dev:
  tests: row_cache: Introduce test for concurrent read, population and eviction
  tests: sstables: Add test for writing combined stream with range tombstones at same position
  tests: memtable: Test that combined mutation source is a mutation source
  tests: memtable: Test that memtable with many versions is a mutation source
  tests: mutation_source: Add test for stream invariants with overlapping tombstones
  tests: mutation_reader: Test fast forwarding of combined reader with overlapping range tombstones
  tests: mutation_reader: Test combined reader slicing on random mutations
  tests: mutation_source_test: Extract random_mutation_generator::make_partition_keys()
  mutation_fragment: Introduce range()
  clustering_interval_set: Introduce overlaps()
  clustering_interval_set: Extract private make_interval()
  mutation_reader: Allow range tombstones with same position in the fragment stream
  sstables: Handle consecutive range_tombstone fragments with same position
  tests: streamed_mutation_assertions: Merge range_tombstones with the same position in produces_range_tombstone()
  streamed_mutation: Introduce peek()
  mutation_fragment: Extract mergeable_with()
  mutation_reader: Move definition of combining mutation reader to source file
  mutation_reader: Use make_combined_reader() to create combined reader
2018-01-02 18:32:09 +02:00
Duarte Nunes
2618209c2d Remove obsolete includes and fix build
move.hh was deleted, but files weren't updated to reflect that.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-12-28 12:03:44 +00:00
Tomasz Grabiec
41ede08a1d mutation_reader: Allow range tombstones with same position in the fragment stream
When we get two range tombstones with the same lower bound from
different data sources (e.g. two sstable), which need to be combined
into a single stream, they need to be de-overlapped, because each
mutation fragment in the stream must have a different position. If we
have range tombstones [1, 10) and [1, 20), the result of that
de-overlapping will be [1, 10) and [10, 20]. The problem is that if
the stream corresponds to a clustering slice with upper bound greater
than 1, but lower than 10, the second range tombstone would appear as
being out of the query range. This is currently violating assumptions
made by some consumers, like cache populator.

One effect of this may be that a reader will miss rows which are in
the range (1, 10) (after the start of the first range tombstone, and
before the start of the second range tombstone), if the second range
tombstone happens to be the last fragment which was read for a
discontinuous range in cache and we stopped reading at that point
because of a full buffer and cache was evicted before we resumed
reading, so we went to reading from the sstable reader again. There
could be more cases in which this violation may resurface.

There is also a related bug in mutation_fragment_merger. If the reader
is in forwarding mode, and the current range is [1, 5], the reader
would still emit range_tombstone([10, 20]). If that reader is later
fast forwarded to another range, say [6, 8], it may produce fragments
with smaller positions which were emitted before, violating
monotonicity of fragment positions in the stream.

A similar bug was also present in partition_snapshot_flat_reader.

Possible solutions:

 1) relax the assumption (in cache) that streams contain only relevant
 range tombstones, and only require that they contain at least all
 relevant tombstones

 2) allow subsequent range tombstones in a stream to share the same
 starting position (position is weakly monotonic), then we don't need
 to de-overlap the tombstones in readers.

 3) teach combining readers about query restrictions so that they can drop
fragments which fall outside the range

 4) force leaf readers to trim all range tombstones to query restrictions

This patch implements solution no 2. It simplifies combining readers,
which don't need to accumulate and trim range tombstones.

I don't like solution 3, because it makes combining readers more
complicated, slower, and harder to properly construct (currently
combining readers don't need to know restrictions of the leaf
streams).

Solution 4 is confined to implementations of leaf readers, but also
has disadvantage of making those more complicated and slower.

Fixes #3093.
2017-12-22 11:06:20 +01:00
Tomasz Grabiec
60ed5d29c0 mutation_reader: Move definition of combining mutation reader to source file
So that the whole world doesn't recompile when it changes.
2017-12-21 21:24:11 +01:00
Tomasz Grabiec
52285a9e73 mutation_reader: Use make_combined_reader() to create combined reader
So that we can hide the definition of combined_mutation_reader. It's
also less verbose.
2017-12-21 21:24:11 +01:00
Piotr Jastrzebski
2c1f0250c2 Migrate make_empty_mutation_source to flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 21:17:46 +01:00