streaming generates lots of small sstables with large token range,
which triggers O(N^2) in space in interval map.
level 0 sstables will now be stored in a structure that has O(N)
in space complexity and which will be included for every read.
Fixes#2287.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170417185509.6633-1-raphaelsc@scylladb.com>
Currently we return proceed::no after every mutation_fragment which is
to be consumed. This froces parser to save and reload its state
often. This can be avoided if we pushed the fragments directly from
mp_row_consumer, then we would return proceed::no only when the buffer
fills up.
tests/perf/perf_fast_forward shows 15% increase in throughput of a large partition scan,
from 1.34M frag/s to 1.55M frag/s.
Message-Id: <1490882700-22684-1-git-send-email-tgrabiec@scylladb.com>
"sstable_streamed_mutation::fast_forward_to() is changed to use promoted index
(via index_reader) to optimize skipping in large partitions.
In addition to that, sstable mutation_reader is changed to use the index
to skip to the next partition.
Performance impact was evaluated using newly added tests/perf/perf_fast_forward
What's beyond this series:
- Using index_reader for single-partition reads as well
- Using index_reader for skipping across ranges in clustering restrictions"
* tag 'tgrabiec/skip-within-partition-using-index-v2' of github.com:cloudius-systems/seastar-dev: (47 commits)
tests: Add performance test for fast forwarding of sstable readers
tests: Allow starting cql_test_env on pre-existing data
config: Allow specifying source when setting value
tests: sstable: Add test for fast forwarding within partition using index
sstables: sstable_streamed_mutation: use index in fast_forward_to()
sstables: Store parsed promoted index in index_entry
sstables: Add trace-level logging for sstable consumption
sstables: Define deletion_time earlier
sstables: Make parsing throw exception on malformed promoted index
tests: Add tests for ordering of position_in_partition relative to composites
position_range: Introduce all_clustered_rows() factory method
position_in_partition: Introduce for_key()/after_key() factory methods
position_in_partition: Add factory methods for positions around all rows
position_in_partition: Introduce for_range_start()/for_range_end()
position_in_partition: Fix friendship declaration
keys: Introduce is_empty() for prefixes
position_in_partition: Make comparable with composites
types: Enhance lexicographical comparators
compound_compat: Accept marker value in serialize_value()
compound_compat: Add trichotomic comparator
...
quick introduction to level starvation:
high levels may be left uncompacted (thus starved) for a long time if user
makes something that make they contain little data, such as cleanup or change
of max sstable size (default 160M). Leveled strategy handles this problem as
follow: consider we're compacting L1 to L2. If L3 is starved, we look for one
of its sstable that is fully contained in token range of candidates L1->L2,
so that we won't end up with an overlapping in L2.
now the problem:
the functionality isn't working properly now because range of candidates is
being incorrectly calculated due to an accident when converting the code to
C++. It won't cause an overlap because it's actually being more restrictive
about which sstable from starved level can be used.
A test case was added to confirm the problem.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170328223753.15398-1-raphaelsc@scylladb.com>
Will be easier to propagate failure to upper layers once parsing is
reused in the index_reader.
The old behavior of ignoring parsing failures is preserved, but the
error is logged now.
Direct motivation for this is to be able to use two index readers from
a single mutation reader, one for lower bound of the range and one for
the upper bound of the range, without sacrificing optimization of
avoiding index reads when forwarding to partition ranges which are
close by. After the change, all index readers of given sstable will
share index buffers, so lower bound reader can reuse the page read by
the upper bound reader.
The reason for using two readers will be so that we are able to skip
inside the partition range, not only outside of it. This is not
possible if we use the same index reader to locate the upper bound of
the range, because we may only advance the cursor.
Needed to make the index_list copyable, which is going to be needed to
implement legacy get_index_entries() which returns by value, after
index sharing is implemented.
Index reader already can be queried only with monotonic positions, so
the concept of a cursor is ingrained. Making it explicit will make it easier
to define behavior for forwarding withing the partition.
After the change:
- lower_bound() is renamed to advance_to() and doesn't return
the position, only advances the cursor
- data file position for partition under cursor can be obtained
at any time with data_file_position()
Handling of forwarding is done inside mp_row_consumer, because it
allows us to filter out irrelevant data sooner and thus more
efficiently.
Becuase static row can be now skipped as well, _skip_clustering_row
was renamed to more generic _skip_in_progress.
This is a preliminary step before adding support for fast-forwarding
to mp_row_consumer, so that range handling can be solely in
mp_row_consumer rather than split between it and
sstable_streamed_mutation.
This also alleviates #2080 by reading all tombstones only up to the
first row, after that range tombstones are treated like other
fragments.
As part of this change, skip detection detection is refactored. This
simplifies reasoning about mp_row_consumer's state a bit because now
is_mutation() is not reset externally and only depends on current
position of the reader.
It will prove useful when we extend mutation reader to decide if it
should skip to the next partition up front before calling
_context.read(), so that we can for instance skip using index instead.
Fixes#2088.
The skip() implementation for the compressed file input stream incorrectly
handled the case of skipping to the end of file: In that case we just need
to update the file pointer, but not skip anywhere in the compressed disk
file; In particular, we must NOT call locate() to find the relevant on-disk
compressed chunk, because there is none - locate() can only be called on
actual positions of bytes, not on the one-past-end-of-file position.
Fixes#2143
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170308100057.23316-1-nyh@scylladb.com>
The current code is assymetric: the first N-1 shards to delete a set receive
a synthetic future to wait on, while the last deletion receives the result
of the delete operation (which also broadcasts completion to the first N-1
operations. This results, in case of an error, with the Nth future being
reported as an unhandled error.
Fix by making everything symmetric: all N callers receive a synthetic
future. Nobody waits for the deletion operation (which still broadcasts its
completion to all waiters, so errors are not lost).
Message-Id: <20170305151607.14264-1-avi@scylladb.com>
"This introduces an API which allows forward navigation in a stream of mutation
fragments. It allows one to consume only a subset of the stream by iteratively
specifying sub-ranges from which fragments should be returned.
API outline:
When in forwarding mode, the stream does not return all fragments right away,
but only those belonging to the current range. Initially current range only
covers the static row. The stream can be forwarded, even before reaching end-
of-stream for current range, to a later range with fast_forward_to().
Forwarding doesn't change initial restrictions of the stream, it can only be
used to skip over data.
Monotonicity of positions is preserved by forwarding. That is fragments
emitted after forwarding will have greater positions than any fragments
emitted before forwarding.
For any range, all range tombstones relevant for that range which are present
in the original stream will be emitted. Range tombstones emitted before
forwarding which overlap with the new range are not necessarily re-emitted.
When not in forwarding mode, the stream acts as if the current range was equal
to the full range. This implies that fast_forward_to() cannot be
used.
Whether stream is in forwarding mode or not is specified when the stream
is created, typically via mutation_source interface.
What's left for later series:
Optimization by providing specialized implementations. This series implements
forwarding support in all mutation sources via generic wrapper which simply
drops fragments."
* tag 'tgrabiec/clustering-fast-forward-to-v2' of github.com:scylladb/seastar-dev:
tests: mutation_source_tests: Verify monotonicty of positions
tests: random_mutation_generator: Spread the keys more
tests: mutation_source_test: Make blobs more easily distinguishable
tests: streamed_mutation: Test that merged stream passes mutation source tests
tests: mutation_source_test: Add tests for forwarding of streamed_mutation
tests: streamed_mutation_assertions: Add methods for navigating the stream
tests: Add range generators to random_mutation_generator
partition_slice_builder: Add with_ranges()
query: Introduce full_clustering_range
streamed_mutation: Add non-owning variant of mutation_from_streamed_mutation()
db: Enable creating forwardable readers via mutation_source
mutation_source: Document liveness requirements
mutation_source: Cleanup
db: Replace virtual_reader_type with mutation_source_opt
partition_version: Refactor make_partition_snapshot_reader() overloads
database: Fix mutation_source created by as_mutation_source() to not ignore trace_state_ptr
memtable: Accept all mutation_source parameters
streamed_mutation: Implement fast_forward_to() in stream merger
streamed_mutation: Add generic implementation of forwardable streamed_mutation
streamed_mutation: Add fast_forward_to() API
position_in_partition: Introduce position_range
position_in_partition: Introduce position constructor for right after the static row
streamed_mutation: Make cast to view non-explicit
streamed_mutation: Make schema() getter non-copying
Failing to close a file properly before destroying file's object causes
crashes.
[tgrabiec: fixed typo]
Message-Id: <20170221144858.GG11471@scylladb.com>
This reverts commit 1e2c01ff49.
We do not detect repeated tombstone if it follows an in-range
tombstone following a skipped clustering row, because _in_progress
will be disengaged after such tombstone is emitted.
Message-Id: <1487596080-21480-1-git-send-email-tgrabiec@scylladb.com>
mutation_fragments are going to be caching their size in memory. In
order to be able to invalidate that correctly, they need to know when
that size may change (but avoid invalidation when it is not necessary).
continuous_data_consumer::fast_forward_to() returns a future which was
later ignored by data_consume_context::fast_forward_to().
With the current implementation, the future in question is always ready
and that's why the problem didn't manifest itself in the form of crashes
or invalid results.
Message-Id: <20170120105746.7300-1-pdziepak@scylladb.com>