Commit Graph

166 Commits

Author SHA1 Message Date
Avi Kivity
48b9e47f7d Revert "row_cache: Add missing handling for failures happening outside the updating thread"
This reverts commit f9feb310ab (requested by author).
2017-08-29 19:26:02 +03:00
Tomasz Grabiec
f9feb310ab row_cache: Add missing handling for failures happening outside the updating thread
Thread stack allocation may fail, in which case we did not do the
necessary invalidation. Fix by hoisting the scope of the cleanup function.

Also fixes the following test failure:

  tests/row_cache_test.cc(949): fatal error: in "test_update_failure": critical check it->second.equal(*s, mopt->partition()) has failed

which started to trigger after commit 318423d50b.

Message-Id: <1504023113-30374-2-git-send-email-tgrabiec@scylladb.com>
2017-08-29 19:17:22 +03:00
Raphael S. Carvalho
637f3bfa50 db: refresh row cache's underlying data source after compaction
Underlying data source in row cache holds a reference to sstable set
prior to compaction which isn't released until a memtable flush, which
means file descriptors of deleted sstables remains opened, wasting
disk space.
The fix is to refresh underlying data source in row cache.

Fixes #2570.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2017-07-24 15:49:11 -03:00
Piotr Jastrzebski
a4b6cfe8f0 row_cache: use continuity info in single partition queries
If a query requests for a single partition that is inside
a range that has already been queried, use the continuity info
and don't go to disk when it's not needed.

Fixes #2244.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <15bb3b5b03225e7402e3862da53b5e06d3f4fa74.1499345295.git.piotr@scylladb.com>
2017-07-07 10:29:19 +02:00
Tomasz Grabiec
37d2b6b3c6 row_cache: Switch _stats.hits/misses to row granularity
Those are exported by the RESTful APIs called
"get_row_hits/get_row_misses" and reported by nodetool.
2017-07-04 13:55:06 +02:00
Tomasz Grabiec
60c2a86192 row_cache: Track mispopulations also at row level 2017-07-04 13:55:06 +02:00
Tomasz Grabiec
94547db620 row_cache: Track row insertions 2017-07-04 13:55:06 +02:00
Tomasz Grabiec
a58f2c8640 row_cache: Track row hits and misses 2017-07-04 13:55:06 +02:00
Tomasz Grabiec
77b2a92ece row_cache: Make mispopulation counter also apply for continuity information 2017-07-04 13:55:06 +02:00
Tomasz Grabiec
a5fdff2ac2 row_cache: Add partition_ prefix to current counters
In preparation for adding per-row counters.
2017-07-04 13:55:06 +02:00
Tomasz Grabiec
6a22cbceaf row_cache: Add metrics for operations on underlying reader 2017-07-04 13:55:06 +02:00
Tomasz Grabiec
5c7b6fc164 row_cache: Add reader-related metrics 2017-07-04 13:55:06 +02:00
Tomasz Grabiec
e720b317c9 row_cache: Restore update of concurrent_misses_same_key
It was lost in action in 6f6575f456.

Message-Id: <1499168837-5072-1-git-send-email-tgrabiec@scylladb.com>
2017-07-04 14:51:05 +03:00
Tomasz Grabiec
1d6fec0755 row_cache: Drop not very useful prefixes from metric names
This drops "total_opertaions_" and "objects_" prefixes. There is no
convention of adding them in other parts of the system, and they don't
add much value.

Fixes scylladb/scylla-grafana-monitoring#169.

Message-Id: <1499160342-25865-1-git-send-email-tgrabiec@scylladb.com>
2017-07-04 13:37:12 +03:00
Tomasz Grabiec
97005825bf row_cache: Fix compilation errors with gcc 5
Message-Id: <1498741526-27055-1-git-send-email-tgrabiec@scylladb.com>
2017-06-29 16:34:46 +03:00
Tomasz Grabiec
786e75dbf7 row_cache: Use continuity information to decide whether to populate
If cache is missing given key, but the range is marked as continuous,
it means sstables don't have that entry and we can insert it without
asking the presence checker (bloom filter based). The latter is more
expensive and gives false positives. So this improves update
performance and hit ratio.

Another positive effect is that we don't have to clear continuity now.

Fixes #1999.

Message-Id: <1498643043-21117-1-git-send-email-tgrabiec@scylladb.com>
2017-06-28 13:32:48 +03:00
Tomasz Grabiec
b56232b216 row_cache: Introduce evict() 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
6f6575f456 row_cache: Enable partial partition population 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
e792220c3a row_cache: Introduce update_invalidating() 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
c29878f49f row_cache: Extract memtable walking logic from update() into do_update()
So that it can be reused in update_invalidating().
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
509a0d8a83 row_cache: Allow reading from underlying through read_context
The interaction will be as follows:

  - Before creating cache_streamed_mutation for given partition, cache
    mutation reader sets up read_context for current partition (in one
    of two ways) so that the matching underlying streamed_mutation can
    be accessed at any time by cached_stream_mutation.

  - cache_streamed_mutation assumes that read_context is set up for
    current partition and invokes fast_forward_to() and
    get_next_fragment() to access the underlying
    streamed_mutation.
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
a1d3e0318c row_cache: Store autoupdating_underlying_reader in read_context
Will be reused for reading of incomplete partition entries.
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
3f2320c377 row_cache: Store information whether query is a range query in read_context
We will need to use this information later in yet another place, when
creating a reader for incomplete cache entry. This refactors the code
so that there is a single place which determines this fact.
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
a2207ee9a6 row_cache: Move autoupdating_underlying_reader to read_context.hh 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
ca920bd0ef row_cache: Keep only one streamed_mutation in scanning_and_populating_reader
Currently scanning_and_populating_reader asks
just_cache_scanning_reader for the next partition from cache, together
with information if the range is continuous. If it's not, it saves the
partition it got from it and moves on to reading from the underlying
reader up to that partition. When that's done, it emits the stored
partition.

This approach won't work well with upcoming changes for storing
partial partitions. We won't have whole partitions any more, so
streamed_mutation returned for the entry needs to be prepared for
reading from the underlying mutation source. We want to reuse the same
underlying reader as much as possible, so all streamed_mutations for
given read (read_context) will share the state of the underlying
reader. Construction of a streamed_mutation will depend on the fact
that the shared state is set up for it, so we cannot have two
streamed_mutations prepared at the same time (one for entry from
primary, and one for the earlier entry being populated). This change
defers the creation of a streamed_mutation for the entry present in
cache until the whole reader reaches it to avoid this problem.

This will also have antoher potentially beneficial effect. Since we
defer the decision about which snapshot to use until we reach the
entry, there is a higher chance that the current snapshot of the entry
will match the one used last by the populating read, and that we will
be able to reuse the reader.

It's implemented by utilizing a stable partition cursor which tracks
its current position so that it's possible to revisit the cache entry
(if it's still there) after population ends. The functionality of
just_cache_scanning_reader was inlined into
scanning_and_populating_reader.
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
045888d5f3 row_cache: Introduce partition_range_cursor 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
c3905bf235 row_cache: Print position instead of key of cache_entry 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
5bfecaad99 row_cache: Switch invalidate_unwrapped() to use ring_position_view ranges
It's needed before switching cache_entry ordering to rely solely on
cache_entry::position() so that invalidate_unwrapped() never removes
the dummy entry at the end. Currently if the range has upper bound
like this:

  { ring_position::max(), inclusive=true }

The code which selects entries for removal would include the dummy row
at the end. It uses upper_bound() to get the end iterator, and the
dummy entry has a position which is equal to the position in the
bound.

ring_position_view ranges are end-exclusive, so it's impossible to
create a partition range which would include a dummy entry.

The code is also simpler.
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
64626b32b0 row_cache: Make printable 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
54b3da1910 row_cache: Introduce find_or_create() helper 2017-06-24 18:06:11 +02:00
Tomasz Grabiec
f2d2c221d4 row_cache: Return cache_entry reference from do_find_or_create_entry
Will be useful when additional action needs to be done on the entry
after it was created or constructed.
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
bbfa52822e row_cache: Switch readers to use per-entry snapshots
Currently readers are always using the latest snapshot. This is fine
for respecting write atomicity if partitions are fully continuous in
cache (now), but will break write atomicity once partial population is
allowed.

Consider the following case:

  flush write(ck=1), write(ck=2) -> snapshot_1
  cache reader 1 reads and inserts ck=1 @snapshot_1
  flush write(ck=1), write(ck=2) -> snapshot_2
  cache reader 2 reads and inserts ck=2 @snapshot_2

Because cache update is not atomic, it can happen that reader 2 will
complete while the partition hasn't been updated yet for snapshot_2.
In such case, after read 2 the partition would contain ck=1 from
snapshot_1 and ck=2 from snapshot_2. It will match neither of the
snapshots, and this could violate write atomicity.

To solve this problem we conceptually assign each partition key in the
ring to its current snapshot which it reflects. The update process
gradually converts entries in ring order to the new snapshot. Reads
will not be using the latest snapshot, but rather the current snapshot
for the position in the ring they are at.

There is a race between the update process and populating reads. Since
after the update all entries must reflect the new snapshot, reads
using the old snapshot cannot be allowed to insert data which can no
longer be reached by the update process. Before this patch this race
was prevented by the use of a phased_barrier, where readers would keep
phased_barrier::operation alive between starting a read of a partition
and inserting it into cache. Cache update was waiting for all prior
operations before starting the update. Any later read which was not
waited for would use the latest snapshot for reads, so the update
process didn't have to fix anything up for such reads.

After this change, later reads cannot always use the latest snapshot,
they have to use the snapshot corresponding to given entry. So it's
not enough for update() to wait for prior reads in order to prevent
stale populations. The (simple) solution implemented in this patch is
to detect the conflict and abandon population of given sub-range. In
general, reads are allowed to populate given range only if it belongs
to a single snapshot.

Note that the range here is not the whole query range. For population
of continuity, it is the range starting after the previous key and
ending after the key being inserted. When populating a partition
entry, the range is a singular range containing only the partition
key. Readers switch to new snapshots automatically as they move across
the ring. It's possible that the insertion of the partition doesn't
conflict, but continuity does. In such case the entry will be inserted
but continuity will not be set.
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
8ba6366610 row_cache: Switch to using snapshot_source
Currently every time cache needs to create reader for missing data it
obtains a reader which is most up to date. That reader includes writes
from later populate phases, for which update() was not yet
called. This will be problematic once we allow partitions to be
partially populated, because different parts of the partition could be
partially populated using readers using different sets of writes, and break
write atomicity.

The solution will be to always populate given partition using the same
set of writes, using reader created from the current snapshot. The
snapshot changes only on update(), with update() gradually converting
each partition to the new snapshot.
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
e23c7e2f34 row_cache: Rework invalidate() implementation
1) Reduce duplication by delegating to more general overloads

 2) Improve documentation to not mention effects in terms of
    population (detail) but rather write visibiliy

 3) Rename clear() to invalidate() and merge with the range variant,
    it has the same semantics
2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
2d73c193e7 row_cache: Introduce read_context
This object stores all read relevant context required all
over the place. This leads to a cleaner code.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>

[tgrabiec:
  - made read_context shareable to allow storing shared
    mutable state later
  - added range and cache getters
]
2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
a3ff8db323 row_cache: Introduce autoupdating_underlying_reader
This is an abstraction that represents a reader
to the underlying source and auto updates itself
to make sure the reader reflects the latest state
of the underlying source.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>

[tgrabiec: Add range getter to avoid friendships]
2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
60346a2819 row_cache: remove unused read overload
This will simplify the following patches and unused
code should be removed anyway.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
77f944880c cache: Remove support for wide partitions
This will be handled by row cache now.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-06-24 18:06:11 +02:00
Nadav Har'El
3018df11b5 Allow reading exactly desired byte ranges and fast_forward_to
In commit c63e88d556, support was added for
fast_forward_to() in data_consume_rows(). Because an input stream's end
cannot be changed after creation, that patch ignores the specified end
byte, and uses the end of file as the end position of the stream.

As result of this, even when we want to read a specific byte range (e.g.,
in the repair code to checksum the partitions in a given range), the code
reads an entire 128K buffer around the end byte, or significantly more, with
read-ahead enabled. This causes repair to do more than 10 times the amount
of I/O it really has to do in the checksumming phase (which in the current
implementation, reads small ranges of partitions at a time).

This patch has two levels:

1. In the lower level, sstable::data_consume_rows(), which reads all
   partitions in a given disk byte range, now gets another byte position,
   "last_end". That can be the range's end, the end of the file, or anything
   in between the two. It opens the disk stream until last_end, which means
   1. we will never read-ahead beyond last_end, and 2. fast_fordward_to() is
   not allowed beyond last_end.

2. In the upper level, we add to the various layers of sstable readers,
   mutation readers, etc., a boolean flag mutation_reader::forwarding, which
   says whether fast_forward_to() is allowed on the stream of mutations to
   move the stream to a different partition range.

   Note that this flag is separate from the existing boolean flag
   streamed_mutation::fowarding - that one talks about skipping inside a
   single partition, while the flag we are adding is about switching the
   partition range being read. Most of the functions that previously
   accepted streamed_mutation::forwarding now accept *also* the option
   mutation_reader::forwarding. The exception are functions which are known
   to read only a single partition, and not support fast_forward_to() a
   different partition range.

   We note that if mutation_reader::forwarding::no is requested, and
   fast_forward_to() is forbidden, there is no point in reading anything
   beyond the range's end, so data_consume_rows() is called with last_end as
   the range's end. But if forwarding::yes is requested, we use the end of the
   file as last_end, exactly like the code before this patch did.

Importantly, we note that the repair's partition reading code,
column_family::make_streaming_reader, uses mutation_reader::forwarding::no,
while the other existing reading code will use the default forwarding::yes.

In the future, we can further optimize the amount of bytes read from disk
by replacing forwarding::yes by an actual last partition that may ever be
read, and use its byte position as the last_end passed to data_consume_rows.
But we don't do this yet, and it's not a regression from the existing code,
which also opened the file input stream until the end of the file, and not
until the end of the range query. Moreover, such an improvement will not
improve of anything if the overall range is always very large, in which
case not over-reading at its end will not improve performance.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170619152629.11703-1-nyh@scylladb.com>
2017-06-19 18:31:32 +03:00
Avi Kivity
6e2c9ef9fb Revert "Allow reading exactly desired byte ranges and fast_forward_to"
This reverts commit 317d7fc253 (and also the
related 2c57ab84b2).  It causes crashes
during range scans, reported by Gleb:

"To reproduce I run SELECT * FROM keyspace1.standard1; on typical c-s
dataset and 3 node cluster.

Backtrace:
    at /home/gleb/work/seastar/seastar/core/apply.hh:36
    rvalue=<unknown type in /home/gleb/work/seastar/build/release/scylla, CU 0x54cf307, DIE 0x55ebf2a>) at /home/gleb/work/seastar/seastar/core/do_with.hh:57
    range=std::vector of length 6, capacity 8 = {...}) at /home/gleb/work/seastar/seastar/core/future-util.hh:142
    at ./seastar/core/future.hh:890
    at /home/gleb/work/seastar/seastar/core/future-util.hh:119
    at /home/gleb/work/seastar/seastar/core/future-util.hh:142
2017-06-18 16:10:21 +03:00
Nadav Har'El
317d7fc253 Allow reading exactly desired byte ranges and fast_forward_to
In commit c63e88d556, support was added for
fast_forward_to() in data_consume_rows(). Because an input stream's end
cannot be changed after creation, that patch ignores the specified end
byte, and uses the end of file as the end position of the stream.

As result of this, even when we want to read a specific byte range (e.g.,
in the repair code to checksum the partitions in a given range), the code
reads an entire 128K buffer around the end byte, or significantly more, with
read-ahead enabled. This causes repair to do more than 10 times the amount
of I/O it really has to do in the checksumming phase (which in the current
implementation, reads small ranges of partitions at a time).

This patch has two levels:

1. In the lower level, sstable::data_consume_rows(), which reads all
   partitions in a given disk byte range, now gets another byte position,
   "last_end". That can be the range's end, the end of the file, or anything
   in between the two. It opens the disk stream until last_end, which means
   1. we will never read-ahead beyond last_end, and 2. fast_fordward_to() is
   not allowed beyond last_end.

2. In the upper level, we add to the various layers of sstable readers,
   mutation readers, etc., a boolean flag mutation_reader::forwarding, which
   says whether fast_forward_to() is allowed on the stream of mutations to
   move the stream to a different partition range.

   Note that this flag is separate from the existing boolean flag
   streamed_mutation::fowarding - that one talks about skipping inside a
   single partition, while the flag we are adding is about switching the
   partition range being read. Most of the functions that previously
   accepted streamed_mutation::forwarding now accept *also* the option
   mutation_reader::forwarding. The exception are functions which are known
   to read only a single partition, and not support fast_forward_to() a
   different partition range.

   We note that if mutation_reader::forwarding::no is requested, and
   fast_forward_to() is forbidden, there is no point in reading anything
   beyond the range's end, so data_consume_rows() is called with last_end as
   the range's end. But if forwarding::yes is requested, we use the end of the
   file as last_end, exactly like the code before this patch did.

Importantly, we note that the repair's partition reading code,
column_family::make_streaming_reader, uses mutation_reader::forwarding::no,
while the other existing reading code will use the default forwarding::yes.

In the future, we can further optimize the amount of bytes read from disk
by replacing forwarding::yes by an actual last partition that may ever be
read, and use its byte position as the last_end passed to data_consume_rows.
But we don't do this yet, and it's not a regression from the existing code,
which also opened the file input stream until the end of the file, and not
until the end of the range query. Moreover, such an improvement will not
improve of anything if the overall range is always very large, in which
case not over-reading at its end will not improve performance.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170614072122.13473-1-nyh@scylladb.com>
2017-06-15 13:22:46 +01:00
Tomasz Grabiec
6cf2841654 mvcc: Extract partition_snapshot_reader to separate header
Right know whole world includes it transitively, which results in
painful recompiles when the code changes.

Relax dependencies.
Message-Id: <1495620201-8046-1-git-send-email-tgrabiec@scylladb.com>
2017-05-24 12:13:15 +01:00
Avi Kivity
ebaeefa02b Merge seatar upstream (seastar namespace)
- introcduced "seastarx.hh" header, which does a "using namespace seastar";
 - 'net' namespace conflicts with seastar::net, renamed to 'netw'.
 - 'transport' namespace conflicts with seastar::transport, renamed to
   cql_transport.
 - "logger" global variables now conflict with logger global type, renamed
   to xlogger.
 - other minor changes
2017-05-21 12:26:15 +03:00
Tomasz Grabiec
d1bde3036e row_cache: Keep counters in a struct
So that taking a snapshot of all stats is easy.
2017-05-17 14:15:14 +02:00
Tomasz Grabiec
35c9dfecc2 row_cache: Implement mutation_reader::fast_forward_to() for cache scanner
Needed to make perf_fast_forward work with cache enabled.
2017-05-17 14:15:14 +02:00
Tomasz Grabiec
7b6be7e188 row_cache: Add missing propagation of the forwarding flag in handle_large_partition()
Message-Id: <1494503145-25622-1-git-send-email-tgrabiec@scylladb.com>
2017-05-11 15:47:19 +01:00
Tomasz Grabiec
0351ab8bc6 row_cache: Fix undefined behavior in read_wide()
_underlying is created with _range, which is captured by
reference. But range_and_underlyig_reader is moved after being
constructed by do_with(), so _range reference is invalidated.

Fixes #2377.
Message-Id: <1494492025-18091-1-git-send-email-tgrabiec@scylladb.com>
2017-05-11 09:43:43 +01:00
Amnon Heiman
064f5e1b63 row_cache: switch to the metrics layer registration
This patch moves the row_cache metrics registration from collectd to the
metric layer.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20170321143812.785-3-amnon@scylladb.com>
2017-03-21 16:42:58 +02:00
Tomasz Grabiec
892d4a2165 db: Enable creating forwardable readers via mutation_source
Right now all mutation source implementations will use
make_forwardable() wrapper.
2017-02-23 18:50:44 +01:00
Glauber Costa
facb0aa6d9 row_cache: rewrite loop so that debug mode doesn't become a noop
need_preempt() is always true in debug mode. Because of that, this loop
will never be executed. Rewrite it as a do-while loop so we are sure
that it is executed at least once - or exactly once in debug mode.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <1485913079-1283-1-git-send-email-glauber@scylladb.com>
2017-02-01 10:02:13 +02:00