Commit Graph

850 Commits

Author SHA1 Message Date
Raphael S. Carvalho
11b74050a1 partitioned_sstable_set: fix quadratic space complexity
streaming generates lots of small sstables with large token range,
which triggers O(N^2) in space in interval map.
level 0 sstables will now be stored in a structure that has O(N)
in space complexity and which will be included for every read.

Fixes #2287.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170417185509.6633-1-raphaelsc@scylladb.com>
2017-04-18 13:04:38 +03:00
Tomasz Grabiec
d523c60629 sstables: Push fragments from mp_row_consumer so that parser is interrupted less
Currently we return proceed::no after every mutation_fragment which is
to be consumed. This froces parser to save and reload its state
often. This can be avoided if we pushed the fragments directly from
mp_row_consumer, then we would return proceed::no only when the buffer
fills up.

tests/perf/perf_fast_forward shows 15% increase in throughput of a large partition scan,
from 1.34M frag/s to 1.55M frag/s.

Message-Id: <1490882700-22684-1-git-send-email-tgrabiec@scylladb.com>
2017-04-05 18:10:54 +03:00
Avi Kivity
5b530aa464 Merge "Use promoted index for skipping in sstable mutation readers" from Tomasz
"sstable_streamed_mutation::fast_forward_to() is changed to use promoted index
(via index_reader) to optimize skipping in large partitions.

In addition to that, sstable mutation_reader is changed to use the index
to skip to the next partition.

Performance impact was evaluated using newly added tests/perf/perf_fast_forward

What's beyond this series:

  - Using index_reader for single-partition reads as well

  - Using index_reader for skipping across ranges in clustering restrictions"

* tag 'tgrabiec/skip-within-partition-using-index-v2' of github.com:cloudius-systems/seastar-dev: (47 commits)
  tests: Add performance test for fast forwarding of sstable readers
  tests: Allow starting cql_test_env on pre-existing data
  config: Allow specifying source when setting value
  tests: sstable: Add test for fast forwarding within partition using index
  sstables: sstable_streamed_mutation: use index in fast_forward_to()
  sstables: Store parsed promoted index in index_entry
  sstables: Add trace-level logging for sstable consumption
  sstables: Define deletion_time earlier
  sstables: Make parsing throw exception on malformed promoted index
  tests: Add tests for ordering of position_in_partition relative to composites
  position_range: Introduce all_clustered_rows() factory method
  position_in_partition: Introduce for_key()/after_key() factory methods
  position_in_partition: Add factory methods for positions around all rows
  position_in_partition: Introduce for_range_start()/for_range_end()
  position_in_partition: Fix friendship declaration
  keys: Introduce is_empty() for prefixes
  position_in_partition: Make comparable with composites
  types: Enhance lexicographical comparators
  compound_compat: Accept marker value in serialize_value()
  compound_compat: Add trichotomic comparator
  ...
2017-03-29 19:01:12 +03:00
Raphael S. Carvalho
023031b0c8 compaction: lcs: fix functionality to feed starved levels
quick introduction to level starvation:
high levels may be left uncompacted (thus starved) for a long time if user
makes something that make they contain little data, such as cleanup or change
of max sstable size (default 160M). Leveled strategy handles this problem as
follow: consider we're compacting L1 to L2. If L3 is starved, we look for one
of its sstable that is fully contained in token range of candidates L1->L2,
so that we won't end up with an overlapping in L2.

now the problem:
the functionality isn't working properly now because range of candidates is
being incorrectly calculated due to an accident when converting the code to
C++. It won't cause an overlap because it's actually being more restrictive
about which sstable from starved level can be used.

A test case was added to confirm the problem.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170328223753.15398-1-raphaelsc@scylladb.com>
2017-03-29 18:59:46 +03:00
Tomasz Grabiec
3fbc0bed6e sstables: sstable_streamed_mutation: use index in fast_forward_to() 2017-03-28 18:34:55 +02:00
Tomasz Grabiec
5b36976bf0 sstables: Store parsed promoted index in index_entry 2017-03-28 18:34:55 +02:00
Tomasz Grabiec
a2a8312c78 sstables: Add trace-level logging for sstable consumption 2017-03-28 18:34:55 +02:00
Tomasz Grabiec
5af815bf20 sstables: Define deletion_time earlier 2017-03-28 18:34:55 +02:00
Tomasz Grabiec
5e34743882 sstables: Make parsing throw exception on malformed promoted index
Will be easier to propagate failure to upper layers once parsing is
reused in the index_reader.

The old behavior of ignoring parsing failures is preserved, but the
error is logged now.
2017-03-28 18:34:55 +02:00
Tomasz Grabiec
18a057aa81 compound_compat: Return composite from serialize_value()
To make the code more type-safe. Also, mark constructor from bytes
explicit.
2017-03-28 18:10:39 +02:00
Tomasz Grabiec
123b102dd6 sstables: Skip to next partition using index
Slicing front of a very large partition:

Before:

 offset  read      time [s]     frags     frag/s    aio      [KiB] blocked dropped    cpu
 0       1         0.110960         1          9    992     126956     924       0  92.4%

After:

 offset  read      time [s]     frags     frag/s    aio      [KiB] blocked dropped    cpu
 0       1         0.000784         1       1276      3        344       2       1  37.3%
2017-03-28 18:10:39 +02:00
Tomasz Grabiec
a9252dfc58 sstables: Use separate index readers for lower and upper bounds
So that lower bound can be advanced within the range.
2017-03-28 18:10:39 +02:00
Tomasz Grabiec
27d86dfe18 sstables: Enable skipping to cells at data_consume_context level 2017-03-28 18:10:39 +02:00
Tomasz Grabiec
aad943523a sstables: index_reader: Add trace-level logging 2017-03-28 18:10:39 +02:00
Tomasz Grabiec
388315c1ff sstables: Expose index metrics 2017-03-28 18:10:39 +02:00
Tomasz Grabiec
1dbd2e239e sstables: index_reader: Share index lists among other index readers
Direct motivation for this is to be able to use two index readers from
a single mutation reader, one for lower bound of the range and one for
the upper bound of the range, without sacrificing optimization of
avoiding index reads when forwarding to partition ranges which are
close by. After the change, all index readers of given sstable will
share index buffers, so lower bound reader can reuse the page read by
the upper bound reader.

The reason for using two readers will be so that we are able to skip
inside the partition range, not only outside of it. This is not
possible if we use the same index reader to locate the upper bound of
the range, because we may only advance the cursor.
2017-03-28 18:10:39 +02:00
Tomasz Grabiec
0635d74e17 sstables: Make index_entry copyable
Needed to make the index_list copyable, which is going to be needed to
implement legacy get_index_entries() which returns by value, after
index sharing is implemented.
2017-03-28 18:10:39 +02:00
Tomasz Grabiec
e36979da47 sstables: index_reader: Use sstable's schema
Makes for a simpler interface.
2017-03-28 18:10:39 +02:00
Tomasz Grabiec
e3e2f037bb sstables: index_reader: Refactor around the concept of a cursor
Index reader already can be queried only with monotonic positions, so
the concept of a cursor is ingrained. Making it explicit will make it easier
to define behavior for forwarding withing the partition.

After the change:

 - lower_bound() is renamed to advance_to() and doesn't return
   the position, only advances the cursor

 - data file position for partition under cursor can be obtained
   at any time with data_file_position()
2017-03-28 18:10:39 +02:00
Tomasz Grabiec
27862fa8f6 sstables: index_reader: Narrow down summary range during lookup
Positions passed to lower_bound() must be non-decreasing, so summary
indexes as well.
2017-03-28 18:10:39 +02:00
Tomasz Grabiec
02ace99798 sstables: index_reader: Change lookup to work on ring_position_view
In preparation for changing the interface to work not only with ranges.
2017-03-28 18:10:39 +02:00
Tomasz Grabiec
5edb427873 sstables: Remove private constructor
To reduce duplication.
2017-03-28 18:10:39 +02:00
Tomasz Grabiec
705bd6da1a sstables: Remove unused method 2017-03-28 18:10:39 +02:00
Tomasz Grabiec
d5e704ca1e sstables: Make key_view constructor from bytes_view explicit 2017-03-28 18:10:39 +02:00
Raphael S. Carvalho
6b6bb38f38 compaction_manager: stop manager after storage io error
Manager will stop itself if a compaction fails due to storage io
error, which unconditionally results in stop of transportation
services.

Fixes #2147.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170316054538.23423-1-raphaelsc@scylladb.com>
2017-03-16 10:37:47 +02:00
Tomasz Grabiec
1f1b516b31 sstables: Remove use of forwarding wrapper 2017-03-10 14:42:22 +01:00
Tomasz Grabiec
d7afab21e7 sstables: Implement sstable_streamed_mutation::fast_forward_to()
Handling of forwarding is done inside mp_row_consumer, because it
allows us to filter out irrelevant data sooner and thus more
efficiently.

Becuase static row can be now skipped as well, _skip_clustering_row
was renamed to more generic _skip_in_progress.
2017-03-10 14:42:22 +01:00
Tomasz Grabiec
4750216387 sstables: Extract and use clustering_ranges_walker
Extracted from mp_row_consumer.
2017-03-10 14:42:22 +01:00
Tomasz Grabiec
124dde30db sstables: Extract writer parameters into config objects
Also enables users to change the default promoted index block size.
2017-03-10 14:42:22 +01:00
Tomasz Grabiec
01374c41f2 sstables: Move workaround for out-of-order range tombstones to mp_row_consumer
This is a preliminary step before adding support for fast-forwarding
to mp_row_consumer, so that range handling can be solely in
mp_row_consumer rather than split between it and
sstable_streamed_mutation.

This also alleviates #2080 by reading all tombstones only up to the
first row, after that range tombstones are treated like other
fragments.
2017-03-10 14:42:22 +01:00
Tomasz Grabiec
d41a7c5eb4 sstables: Drop default mp_row_consumer constructor 2017-03-10 14:42:22 +01:00
Tomasz Grabiec
56f1ad7841 sstables: Swap order of values in "proceed" so that "no" is assigned 0 2017-03-10 14:42:22 +01:00
Tomasz Grabiec
084747b1ee sstables: streamed_mutation: Stop reading when end of slice reached
As part of this change, skip detection detection is refactored. This
simplifies reasoning about mp_row_consumer's state a bit because now
is_mutation() is not reset externally and only depends on current
position of the reader.

It will prove useful when we extend mutation reader to decide if it
should skip to the next partition up front before calling
_context.read(), so that we can for instance skip using index instead.

Fixes #2088.
2017-03-10 14:42:19 +01:00
Tomasz Grabiec
55358cacc5 sstables: Switch is_in_range() to position_in_partition
Makes it immune to #1446 and is a prerequisite for implementing
forwarding in mp_row_consumer.
2017-03-09 21:15:11 +01:00
Nadav Har'El
506e074ba4 sstable decompression: fix skip() to end of file
The skip() implementation for the compressed file input stream incorrectly
handled the case of skipping to the end of file: In that case we just need
to update the file pointer, but not skip anywhere in the compressed disk
file; In particular, we must NOT call locate() to find the relevant on-disk
compressed chunk, because there is none - locate() can only be called on
actual positions of bytes, not on the one-past-end-of-file position.

Fixes #2143

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170308100057.23316-1-nyh@scylladb.com>
2017-03-08 12:35:05 +02:00
Avi Kivity
1b5ba63676 sstable: fix unhandled exception in atomic_deletion_manager::delete_atomically()
The current code is assymetric: the first N-1 shards to delete a set receive
a synthetic future to wait on, while the last deletion receives the result
of the delete operation (which also broadcasts completion to the first N-1
operations.  This results, in case of an error, with the Nth future being
reported as an unhandled error.

Fix by making everything symmetric: all N callers receive a synthetic
future.  Nobody waits for the deletion operation (which still broadcasts its
completion to all waiters, so errors are not lost).

Message-Id: <20170305151607.14264-1-avi@scylladb.com>
2017-03-07 12:41:12 +02:00
Paweł Dziepak
5d66031b7a sstable: make input_stream_history initializers in-class
sstable has two constructors but only one of them was creating input
stream history objects.
Message-Id: <20170227151734.16928-1-pdziepak@scylladb.com>
2017-02-28 09:22:11 +01:00
Paweł Dziepak
0198d8e470 Merge "Introduce streamed_mutation::fast_forward_to()" from Tomasz
"This introduces an API which allows forward navigation in a stream of mutation
fragments. It allows one to consume only a subset of the stream by iteratively
specifying sub-ranges from which fragments should be returned.

API outline:

  When in forwarding mode, the stream does not return all fragments right away,
  but only those belonging to the current range. Initially current range only
  covers the static row. The stream can be forwarded, even before reaching end-
  of-stream for current range, to a later range with fast_forward_to().
  Forwarding doesn't change initial restrictions of the stream, it can only be
  used to skip over data.

  Monotonicity of positions is preserved by forwarding. That is fragments
  emitted after forwarding will have greater positions than any fragments
  emitted before forwarding.

  For any range, all range tombstones relevant for that range which are present
  in the original stream will be emitted. Range tombstones emitted before
  forwarding which overlap with the new range are not necessarily re-emitted.

  When not in forwarding mode, the stream acts as if the current range was equal
  to the full range. This implies that fast_forward_to() cannot be
  used.

  Whether stream is in forwarding mode or not is specified when the stream
  is created, typically via mutation_source interface.

What's left for later series:

  Optimization by providing specialized implementations. This series implements
  forwarding support in all mutation sources via generic wrapper which simply
  drops fragments."

* tag 'tgrabiec/clustering-fast-forward-to-v2' of github.com:scylladb/seastar-dev:
  tests: mutation_source_tests: Verify monotonicty of positions
  tests: random_mutation_generator: Spread the keys more
  tests: mutation_source_test: Make blobs more easily distinguishable
  tests: streamed_mutation: Test that merged stream passes mutation source tests
  tests: mutation_source_test: Add tests for forwarding of streamed_mutation
  tests: streamed_mutation_assertions: Add methods for navigating the stream
  tests: Add range generators to random_mutation_generator
  partition_slice_builder: Add with_ranges()
  query: Introduce full_clustering_range
  streamed_mutation: Add non-owning variant of mutation_from_streamed_mutation()
  db: Enable creating forwardable readers via mutation_source
  mutation_source: Document liveness requirements
  mutation_source: Cleanup
  db: Replace virtual_reader_type with mutation_source_opt
  partition_version: Refactor make_partition_snapshot_reader() overloads
  database: Fix mutation_source created by as_mutation_source() to not ignore trace_state_ptr
  memtable: Accept all mutation_source parameters
  streamed_mutation: Implement fast_forward_to() in stream merger
  streamed_mutation: Add generic implementation of forwardable streamed_mutation
  streamed_mutation: Add fast_forward_to() API
  position_in_partition: Introduce position_range
  position_in_partition: Introduce position constructor for right after the static row
  streamed_mutation: Make cast to view non-explicit
  streamed_mutation: Make schema() getter non-copying
2017-02-24 10:37:51 +00:00
Tomasz Grabiec
892d4a2165 db: Enable creating forwardable readers via mutation_source
Right now all mutation source implementations will use
make_forwardable() wrapper.
2017-02-23 18:50:44 +01:00
Gleb Natapov
0977f4fdf8 sstable: close sstable_writer's file if writing of sstable fails.
Failing to close a file properly before destroying file's object causes
crashes.

[tgrabiec: fixed typo]

Message-Id: <20170221144858.GG11471@scylladb.com>
2017-02-21 18:17:47 +01:00
Tomasz Grabiec
33457cc9a9 sstables: Fix detection of repeated tombstones
The check was not catching range tombstone repeated immediately after
itself.
Message-Id: <1487596098-17409-1-git-send-email-tgrabiec@scylladb.com>
2017-02-20 15:35:15 +00:00
Tomasz Grabiec
cc439df542 Revert "sstables: Simplify sstable_streamed_mutation::read_next()"
This reverts commit 1e2c01ff49.

We do not detect repeated tombstone if it follows an in-range
tombstone following a skipped clustering row, because _in_progress
will be disengaged after such tombstone is emitted.
Message-Id: <1487596080-21480-1-git-send-email-tgrabiec@scylladb.com>
2017-02-20 15:34:58 +00:00
Raphael S. Carvalho
53d9008052 sstables/deletion_manager: kill dead code
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <2b24d9e622238030a737fbbe12b8439853d5d075.1487095059.git.raphaelsc@scylladb.com>
2017-02-16 18:38:54 +02:00
Tomasz Grabiec
1e2c01ff49 sstables: Simplify sstable_streamed_mutation::read_next()
mp_row_consumer doesn't split row fragments on repeated range
tombstones any more.
2017-02-13 16:12:16 +01:00
Tomasz Grabiec
6324876f24 sstables: Emit only relevant range tombstones 2017-02-13 16:12:16 +01:00
Paweł Dziepak
354ce0b2c7 mutation_fragment: make write access more explicit
mutation_fragments are going to be caching their size in memory. In
order to be able to invalidate that correctly, they need to know when
that size may change (but avoid invalidation when it is not necessary).
2017-02-09 10:49:46 +00:00
Paweł Dziepak
83c6fc1114 sstables: write counter cells 2017-02-02 10:35:14 +00:00
Paweł Dziepak
5905729c4a sstables: read counter cells 2017-02-02 10:35:14 +00:00
Tomasz Grabiec
6c75614d19 sstables: Fix input_stream not being closed by index_reader
Fixes #2022
Message-Id: <1484912679-5729-1-git-send-email-tgrabiec@scylladb.com>
2017-01-20 11:58:33 +00:00
Paweł Dziepak
19ad35610b sstables: do not discard future returned by fast_forward_to()
continuous_data_consumer::fast_forward_to() returns a future which was
later ignored by data_consume_context::fast_forward_to().

With the current implementation, the future in question is always ready
and that's why the problem didn't manifest itself in the form of crashes
or invalid results.
Message-Id: <20170120105746.7300-1-pdziepak@scylladb.com>
2017-01-20 12:22:17 +01:00