Commit Graph

201 Commits

Author SHA1 Message Date
Piotr Jastrzebski
11a354b144 Introduce sstable::read_row_flat
This will be used together with sstables::read_range_rows
to migrate sstables::as_mutation_source().

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-20 16:26:54 +01:00
Piotr Jastrzebski
65c6f339d6 Delete sstable_streamed_mutation
It's no longer used so can be safely removed.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-20 16:26:54 +01:00
Piotr Jastrzebski
e241b0c2de Stop using streamed_mutation in sstable_data_source
Use a partition_header instead.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-20 16:26:54 +01:00
Piotr Jastrzebski
375c321e9d Stop using streamed_mutation in consumer and reader
Don't use streamed_mutation in mp_row_consumer
and sstable_mutation_reader.

Also use sstable_mutation_reader in sstable::read_row.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-20 16:22:57 +01:00
Piotr Jastrzebski
f7bf782a41 Store sstable_mutation_reader pointer in mp_row_consumer
The reader will be used by mp_row_consumer instead of streamed_mutation
in next patches.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-16 22:06:28 +01:00
Piotr Jastrzebski
145fcf846e Move advance_to_upper_bound above sstable_mutation_reader
It will be used in sstable_mutation_reader when the reader
will be used to implement sstable::read_row.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-16 22:06:28 +01:00
Piotr Jastrzebski
1c7938c44d Replace "sm" with "partition" in get_next_sm and on_sm_finished
Streamed mutation won't be used any more so get_next_partition
and on_partition_finished are more suitable names.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-16 22:06:28 +01:00
Piotr Jastrzebski
4943f52ad7 Remove unused sstable_mutation_reader constructor
The constructor is never used so it can be safely removed.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-16 22:06:28 +01:00
Piotr Jastrzebski
c7971eb8e3 Move mp_row_consumer methods implementations to the bottom
Those methods have to be below sstable_mutation_reader because
they will be using the reader instead of streamed_mutation.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-16 22:06:28 +01:00
Piotr Jastrzebski
537b42e153 Turn sstable_mutation_reader into a flat_mutation_reader
This is the first step which still uses streamed_mutation.
Next step will be to get rid of streamed_mutation.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-16 22:06:00 +01:00
Piotr Jastrzebski
74f0c01865 Add sstables::read_rows_flat and sstables::read_range_rows_flat
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-15 15:33:23 +01:00
Piotr Jastrzebski
ea449c9cce Replace sstables::mutation_reader with ::mutation_reader
This will make migration to flat_mutation_reader much
easier and sstables::mutation_reader is going away with
this migration anyway.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-15 10:40:01 +01:00
Piotr Jastrzebski
228f0737f4 Reduce dependencies from mp_row_consumer to sstable_streamed_mutation
Before this patch mp_row_consumer was using sstable_streamed_mutation
in two ways:

1. Populate sstable_streamed_mutation's buffer with mutation_fragments
2. Advance sstable_streamed_mutation's sstable_data_source to new position.

We can easily reduce those dependencies only to the first one.
This will reduce the coupling between those classes and simplify
the flow of execution.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-11-15 10:40:01 +01:00
Duarte Nunes
baeec0935f Replace query::full_slice with schema::full_slice()
query::full_slice doesn't select any regular or static columns, which
is at odds with the expectations of its users. This patch replaces it
with the schema::full_slice() version.

Refs #2885

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1507732800-9448-2-git-send-email-duarte@scylladb.com>
2017-10-17 11:25:53 +02:00
Botond Dénes
dead2617ce mp_row_consumer: remove unnecessary _reasource_tracker member
Leftovers from a43901f84.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <88237d9cd97feeca47e12ec4af89c90f1a3a6bb5.1507535176.git.bdenes@scylladb.com>
2017-10-09 10:59:40 +03:00
Botond Dénes
a43901f842 row_consumer: de-virtualize io_priority() and resource_tracker()
Fixes #2830

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <448a1f739ab8c88a7a5562bce8dce5ae6efdf934.1507302530.git.bdenes@scylladb.com>
2017-10-06 18:50:12 +01:00
Botond Dénes
47e07b787e restricted_mutation_reader: restrict based-on memory consumption
Restrict readers based on their memory consumption, instead of the count
of the top-level readers. To do this an interposer is installed at the
input_stream level which tracks buffers emmited by the stream. This way
we can have an accurate picture of the readers' actual memory
consumption.
New readers will consume 16k units from the semaphore up-front. This is
to account their own memory-consumption, apart from the buffers they
will allocate. Creating the reader will be deferred to when there are
enough resources to create it. As before only new readers will be
blocked on an exhausted semaphore, existing readers can continue to
work.
2017-10-03 12:44:12 +03:00
Avi Kivity
78eae8bf48 Revert "Merge "Make restricting_mutation_reader more accurate" from Botond"
This reverts commit c6e5dcc556, reversing
changes made to 19b21a0ab2. Failes to build,
plus author has more changes.
2017-10-03 11:58:59 +03:00
Botond Dénes
33e97e7457 restricted_mutation_reader: restrict based-on memory consumption
Restrict readers based on their memory consumption, instead of the count
of the top-level readers. To do this an interposer is installed at the
input_stream level which tracks buffers emmited by the stream. This way
we can have an accurate picture of the readers' actual memory
consumption.
New readers will consume 16k units from the semaphore up-front. This is
to account their own memory-consumption, apart from the buffers they
will allocate. Creating the reader will be deferred to when there are
enough resources to create it. As before only new readers will be
blocked on an exhausted semaphore, existing readers can continue to
work.
2017-09-20 11:14:35 +03:00
Paweł Dziepak
3e1d09e71d sstables: do not expect counter shards to be sorted 2017-09-05 10:32:48 +01:00
Tomasz Grabiec
65e488c150 sstables: Fix abort in mutation reader for certain skip pattern
The problem happens for the following sequence of events:

 1) reader stops in the middle of some partition before it
    skips to another partition range

 2) reader is fast forwarded to a partition range which has no data in
    the sstable. There are some partitions between the previous
    partition range and the one we skip to

 3) the reader is asked for next partition

The problem was that mutation_reader::fast_forward_to() was putting
the reader in _read_enabled == false state in step 2, but
data_consume_context was not fast forwarded to the range. When in step
3 we were asked for the next partition, we attempted to skip using
index (because of 1). The result of the skip was some position which
is outside of the current range of data_consume_context, which causes
it to abort. To fix, add a check for _read_enabled before we try to
skip.
2017-08-28 10:28:15 +02:00
Tomasz Grabiec
dc3c8863f3 sstables: Fix reader returning partition past the query range in some cases
If index was used to skip to the next partition (because the current
partition wasn't consumed in full) and reader's partition range ends
before the data file ends, we did not detect that we're out of range
before returning a streamed_mutation. Fix by checking _context.eof()
before doing that.

Refs #2733.
2017-08-28 10:16:27 +02:00
Paweł Dziepak
7b0f75c0d1 sstables: avoid indirect calls to abstract_type::is_multi_cell() 2017-07-26 14:38:27 +01:00
Paweł Dziepak
28c105e4a7 sstables: avoid copying key components 2017-07-26 14:38:27 +01:00
Paweł Dziepak
e0a04cb7fe sstables: make sure that fill_buffer() actually fills buffer
streamed_mutation::impl::fill_buffer() is supposed to either push
mutation fragments to the buffer or set EOS flag. However, it was
possible that mp_row_consumer would return proceed::no if a skip was
needed without satisfying any of these conditions.
2017-07-26 14:36:36 +01:00
Tomasz Grabiec
a9237c1666 schema: Revert back to the 1.7 layout of static compact tables in memory
We are using C* 3.x compatible layout in schema tables but want to
keep using the 1.7 layout in memory for compatibility during rolling
upgrade. This patch switches the schema and schema_builder classes
back to the old layout. Translation of layout happens when converting
to/from schema mutations.

Notable changes:

 1) Includes a revert of commit 6260f31e08
    "thrift: Update CQL mapping of static CFs".

 2) Brings back the "default_validation_class" schema attribute. In v3
    it can be dervied from column definitions, but in v2 it can't, so
    we have to store it.

 3) legacy_schema_migrator and schema_builder don't have to do
    conversions to v3, this is now handled by the v3_columns
    class. schema_builder works with the same layout as schema, that
    is v2.

 4) Includes a revert of commit 66991a7ccb
    "v3 schema test fixes"

Fixes #2555.
2017-07-19 09:52:15 +02:00
Nadav Har'El
3018df11b5 Allow reading exactly desired byte ranges and fast_forward_to
In commit c63e88d556, support was added for
fast_forward_to() in data_consume_rows(). Because an input stream's end
cannot be changed after creation, that patch ignores the specified end
byte, and uses the end of file as the end position of the stream.

As result of this, even when we want to read a specific byte range (e.g.,
in the repair code to checksum the partitions in a given range), the code
reads an entire 128K buffer around the end byte, or significantly more, with
read-ahead enabled. This causes repair to do more than 10 times the amount
of I/O it really has to do in the checksumming phase (which in the current
implementation, reads small ranges of partitions at a time).

This patch has two levels:

1. In the lower level, sstable::data_consume_rows(), which reads all
   partitions in a given disk byte range, now gets another byte position,
   "last_end". That can be the range's end, the end of the file, or anything
   in between the two. It opens the disk stream until last_end, which means
   1. we will never read-ahead beyond last_end, and 2. fast_fordward_to() is
   not allowed beyond last_end.

2. In the upper level, we add to the various layers of sstable readers,
   mutation readers, etc., a boolean flag mutation_reader::forwarding, which
   says whether fast_forward_to() is allowed on the stream of mutations to
   move the stream to a different partition range.

   Note that this flag is separate from the existing boolean flag
   streamed_mutation::fowarding - that one talks about skipping inside a
   single partition, while the flag we are adding is about switching the
   partition range being read. Most of the functions that previously
   accepted streamed_mutation::forwarding now accept *also* the option
   mutation_reader::forwarding. The exception are functions which are known
   to read only a single partition, and not support fast_forward_to() a
   different partition range.

   We note that if mutation_reader::forwarding::no is requested, and
   fast_forward_to() is forbidden, there is no point in reading anything
   beyond the range's end, so data_consume_rows() is called with last_end as
   the range's end. But if forwarding::yes is requested, we use the end of the
   file as last_end, exactly like the code before this patch did.

Importantly, we note that the repair's partition reading code,
column_family::make_streaming_reader, uses mutation_reader::forwarding::no,
while the other existing reading code will use the default forwarding::yes.

In the future, we can further optimize the amount of bytes read from disk
by replacing forwarding::yes by an actual last partition that may ever be
read, and use its byte position as the last_end passed to data_consume_rows.
But we don't do this yet, and it's not a regression from the existing code,
which also opened the file input stream until the end of the file, and not
until the end of the range query. Moreover, such an improvement will not
improve of anything if the overall range is always very large, in which
case not over-reading at its end will not improve performance.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170619152629.11703-1-nyh@scylladb.com>
2017-06-19 18:31:32 +03:00
Avi Kivity
6e2c9ef9fb Revert "Allow reading exactly desired byte ranges and fast_forward_to"
This reverts commit 317d7fc253 (and also the
related 2c57ab84b2).  It causes crashes
during range scans, reported by Gleb:

"To reproduce I run SELECT * FROM keyspace1.standard1; on typical c-s
dataset and 3 node cluster.

Backtrace:
    at /home/gleb/work/seastar/seastar/core/apply.hh:36
    rvalue=<unknown type in /home/gleb/work/seastar/build/release/scylla, CU 0x54cf307, DIE 0x55ebf2a>) at /home/gleb/work/seastar/seastar/core/do_with.hh:57
    range=std::vector of length 6, capacity 8 = {...}) at /home/gleb/work/seastar/seastar/core/future-util.hh:142
    at ./seastar/core/future.hh:890
    at /home/gleb/work/seastar/seastar/core/future-util.hh:119
    at /home/gleb/work/seastar/seastar/core/future-util.hh:142
2017-06-18 16:10:21 +03:00
Nadav Har'El
317d7fc253 Allow reading exactly desired byte ranges and fast_forward_to
In commit c63e88d556, support was added for
fast_forward_to() in data_consume_rows(). Because an input stream's end
cannot be changed after creation, that patch ignores the specified end
byte, and uses the end of file as the end position of the stream.

As result of this, even when we want to read a specific byte range (e.g.,
in the repair code to checksum the partitions in a given range), the code
reads an entire 128K buffer around the end byte, or significantly more, with
read-ahead enabled. This causes repair to do more than 10 times the amount
of I/O it really has to do in the checksumming phase (which in the current
implementation, reads small ranges of partitions at a time).

This patch has two levels:

1. In the lower level, sstable::data_consume_rows(), which reads all
   partitions in a given disk byte range, now gets another byte position,
   "last_end". That can be the range's end, the end of the file, or anything
   in between the two. It opens the disk stream until last_end, which means
   1. we will never read-ahead beyond last_end, and 2. fast_fordward_to() is
   not allowed beyond last_end.

2. In the upper level, we add to the various layers of sstable readers,
   mutation readers, etc., a boolean flag mutation_reader::forwarding, which
   says whether fast_forward_to() is allowed on the stream of mutations to
   move the stream to a different partition range.

   Note that this flag is separate from the existing boolean flag
   streamed_mutation::fowarding - that one talks about skipping inside a
   single partition, while the flag we are adding is about switching the
   partition range being read. Most of the functions that previously
   accepted streamed_mutation::forwarding now accept *also* the option
   mutation_reader::forwarding. The exception are functions which are known
   to read only a single partition, and not support fast_forward_to() a
   different partition range.

   We note that if mutation_reader::forwarding::no is requested, and
   fast_forward_to() is forbidden, there is no point in reading anything
   beyond the range's end, so data_consume_rows() is called with last_end as
   the range's end. But if forwarding::yes is requested, we use the end of the
   file as last_end, exactly like the code before this patch did.

Importantly, we note that the repair's partition reading code,
column_family::make_streaming_reader, uses mutation_reader::forwarding::no,
while the other existing reading code will use the default forwarding::yes.

In the future, we can further optimize the amount of bytes read from disk
by replacing forwarding::yes by an actual last partition that may ever be
read, and use its byte position as the last_end passed to data_consume_rows.
But we don't do this yet, and it's not a regression from the existing code,
which also opened the file input stream until the end of the file, and not
until the end of the range query. Moreover, such an improvement will not
improve of anything if the overall range is always very large, in which
case not over-reading at its end will not improve performance.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170614072122.13473-1-nyh@scylladb.com>
2017-06-15 13:22:46 +01:00
Piotr Jastrzebski
6528f3a963 Make sure mutation_reader for sstables can be fast-forwarded
Fixes #2145.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
[tgrabiec: Extracted from a series, fixed title]
Message-Id: <1495639745-19387-1-git-send-email-tgrabiec@scylladb.com>
2017-05-24 16:36:24 +01:00
Paweł Dziepak
3ecceaee48 Merge "Fix fast_forward_to() on sstable reader being ignored in some cases" from Tomasz
"When mutation reader enters the partition using index,
streamed_mutation object is returned to the user before the row start
fragment is processed. In that case, when we process the row start, we
should ignore it and not call setup_for_partition() again. That may
override user's fast_forward_to() request."

* 'tgrabiec/fix-initial-fast-forward-to-for-single-key-sstable-readers' of github.com:scylladb/seastar-dev:
  tests: mutation_source_test: Test forwarding in single-key readers
  sstables: Remove unused code
  sstables: mutation_reader: Fix setup_for_partition() being called twice in some cases
  sstables: Fix verify_end_state() to tolerate ATOM_START_2 state
2017-05-17 15:35:30 +01:00
Tomasz Grabiec
e07cc44af2 sstables: Remove unused code 2017-05-16 13:31:01 +02:00
Tomasz Grabiec
0e23f8aa9b sstables: mutation_reader: Fix setup_for_partition() being called twice in some cases
When mutation reads enters the partition using index,
streamed_mutation object is returned to the user before the row start
fragment is processed. In that case, when we process the row start, we
should ignore it and not call setup_for_partition() again. That may
override user's fast_forward_to() request.
2017-05-16 13:31:01 +02:00
Calle Wilund
6c8b5fc09d schema_tables: Use v3 schema tables and formats
Switches system/schema_* for system_schema/*, updates schema/schema
builder and uses to hold/expect v3 style info (i.e. types & dropped).
2017-05-10 16:44:48 +00:00
Tomasz Grabiec
e56711a54d sstables: mutation_reader: Avoid reading index when restrictions cover whole partition
The check for is_static_row() used to be enough, but it no longer is
after optimization made in commit 3e06065, which avoids reading the
static row.
Message-Id: <1494241164-25810-1-git-send-email-tgrabiec@scylladb.com>
2017-05-09 11:03:18 +01:00
Duarte Nunes
d45596ae8e sstables: Read and write shadowable tombstones
This patch serializes shadowable tombstones to sstables by adding a
new, incompatible atom's mask.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-04-25 11:46:33 +02:00
Tomasz Grabiec
3472a74de4 sstables: Remove unused code 2017-04-20 11:23:05 +02:00
Tomasz Grabiec
c1059ca8e4 sstables: mutation_reader: Use index_reader::advance_to_next_partition() to skip to next partition
It's cheaper than a key-based lookup, so use it when we can.
2017-04-20 11:23:05 +02:00
Tomasz Grabiec
4742008b70 sstables: mutation_reader: Use index_reader for single-partition reads
This switches single-partition query to use the index_reader
infrastructure. Index lookups via index_reader are faster than
find_disk_ranges().

perf_fast_forward, rows: 1000000, value size: 100

Before:

  Testing forwarding with clustering restriction in a large partition:
  pk-scan   time [s]     frags     frag/s    aio      [KiB] blocked dropped  idx hit idx miss  idx blk    cpu
  no        0.002182         2        916      3        152       2       0        0        1        1  88.1%

After:

  Testing forwarding with clustering restriction in a large partition:
  pk-scan   time [s]     frags     frag/s    aio      [KiB] blocked dropped  idx hit idx miss  idx blk    cpu
  no        0.000758         2       2639      3        152       2       0        0        1        1  48.6%

This is also a cleanup, a step towards converting all code to use the
index_reader.
2017-04-20 11:23:05 +02:00
Tomasz Grabiec
9d8795089d sstables: mutation_reader: Add trace-level logging 2017-04-20 11:18:55 +02:00
Tomasz Grabiec
b198c31c46 sstables: mutation_reader: Move partition reading code to sstable_data_source
It will be reused for read_row(), which does't create mutation_reader
instance, only sstable_data_source.
2017-04-20 11:18:26 +02:00
Tomasz Grabiec
6e4bca0be6 sstables: mutation_reader: Move definitions out of the class body
To make further refactoring easier to review. No functional changes here.
2017-04-20 10:54:38 +02:00
Tomasz Grabiec
4ed7e529db sstables: Move binary_search() to a header
There are instantiations of binary_search() used in sstables.cc, but
defined in partition.cc. The instantiations are explicitly declared in
partition.cc, but the types changed and they became obsolete. The
thing worked because partition.cc also instantiated it with the right
type. But after that code will be removed, it no longer would, and we
would get a linker error. To avoid such problems, define
binary_search() in a header.
2017-04-20 10:54:38 +02:00
Tomasz Grabiec
3e8795494e sstables: mutation_reader: Advance to next partition using index in some cases
To produce a streamed_mutation for the next partition, we need to read
its key and the tombstone. Currently we always do that by consuming
the partition header from the data file. In some cases that may cause
unnecessary IO.

It's better to obtain partition information from the index if we
already have it. We can save on IO if the user will skip past the
front of partition immediately after.

It is also better to pay the cost of reading the index if we know that
we will need to use the index anyway soon. This patch predicts that by
checking if there are any clustering restrictions. If there are any,
we will almost surely need_skip() and use the index anyway.

This change also lays the ground for unification of multi and single
partiton queries without loss of performance.
2017-04-20 10:54:37 +02:00
Tomasz Grabiec
ae72c159b1 sstables: index_reader: Introduce promoted_index_view
So that we have a nice way of extracting tombstone out of it. We not
always need fully parsed index.
2017-04-20 10:54:37 +02:00
Tomasz Grabiec
0ef33b7f29 sstables: mutation_reader: Move _index_in_current to sstable_data_source
sstable_data_source holds a shared state between mutation_reader and
streamed_mutation for sstables. The information whether index is in
current partition will have to be accessed by both in the following
patches.
2017-04-20 10:54:37 +02:00
Tomasz Grabiec
885f53d905 sstables: mutation_reader: Avoid resetting the walker
Before the change, the following scenario was happening:

  1) we try to skip based on clustering restrictions
  2) we find the page and fast forward to it, recording walker's
     lower bound counter
  3) we read the first fragment, it's not a tombstone, so we reset the walker,
     and its lower bound counter too
  4) the fragment is not in range (the range starts in the middle of the page)
  5) needs_skip() is true, we redo the index lookup, which wastes some CPU

This change fixes the problem by avoiding resetting the walker. We can
do that because leading tombstones are checked with a non-mutable
contains_tombstone()
2017-04-20 10:54:37 +02:00
Tomasz Grabiec
b030ce693d sstables: mutation_reader: Don't try to read index to skip to static row
Static row is always at the beginning, there's no point in doing
index lookups.
2017-04-20 10:54:37 +02:00
Tomasz Grabiec
3e060659f1 sstables: mutation_reader: Don't try to read static row if table doesn't have any 2017-04-20 10:54:37 +02:00
Tomasz Grabiec
77d3e30239 sstables: mutation_reader: Use index to skip across clustering restrictions
Improves scans with clustering restrictions. Before the change such
scans would scan whole partition.

Below are results of a test case from perf_fast_forward which selects
few rows from a large partition using query restrictions (not fast forwarding).

Before:

  stride  rows      time [s]     frags     frag/s    aio      [KiB] blocked dropped  idx hit idx miss  idx blk    cpu
  1000000 1         0.000609         1       1642      3        152       2       1        0        1        1  38.0%
  500000  2         0.242255         2          8    511      64152     398       4        0        1        1  98.6%
  250000  4         0.281592         4         14    749      95832     564       4        0        1        1  98.4%
  125000  8         0.328056         8         24    873     111704     657       4        0        1        1  98.4%
  62500   16        0.306700        16         52    935     119640     751       4        0        1        1  99.4%

After:

  stride  rows      time [s]     frags     frag/s    aio      [KiB] blocked dropped  idx hit idx miss  idx blk    cpu
  1000000 1         0.000711         1       1406      3        152       2       1        0        1        1  42.1%
  500000  2         0.000910         2       2197      5        216       3       2        0        1        1  39.2%
  250000  4         0.001384         4       2891      9        344       5       4        0        1        1  35.3%
  125000  8         0.003197         8       2502     21        728      13       8        0        1        1  53.1%
  62500   16        0.006664        16       2401     41       1368      25      16        0        1        1  58.2%
2017-04-20 10:54:37 +02:00