Commit Graph

130 Commits

Author SHA1 Message Date
Piotr Jastrzebski
60346a2819 row_cache: remove unused read overload
This will simplify the following patches and unused
code should be removed anyway.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
77f944880c cache: Remove support for wide partitions
This will be handled by row cache now.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-06-24 18:06:11 +02:00
Nadav Har'El
3018df11b5 Allow reading exactly desired byte ranges and fast_forward_to
In commit c63e88d556, support was added for
fast_forward_to() in data_consume_rows(). Because an input stream's end
cannot be changed after creation, that patch ignores the specified end
byte, and uses the end of file as the end position of the stream.

As result of this, even when we want to read a specific byte range (e.g.,
in the repair code to checksum the partitions in a given range), the code
reads an entire 128K buffer around the end byte, or significantly more, with
read-ahead enabled. This causes repair to do more than 10 times the amount
of I/O it really has to do in the checksumming phase (which in the current
implementation, reads small ranges of partitions at a time).

This patch has two levels:

1. In the lower level, sstable::data_consume_rows(), which reads all
   partitions in a given disk byte range, now gets another byte position,
   "last_end". That can be the range's end, the end of the file, or anything
   in between the two. It opens the disk stream until last_end, which means
   1. we will never read-ahead beyond last_end, and 2. fast_fordward_to() is
   not allowed beyond last_end.

2. In the upper level, we add to the various layers of sstable readers,
   mutation readers, etc., a boolean flag mutation_reader::forwarding, which
   says whether fast_forward_to() is allowed on the stream of mutations to
   move the stream to a different partition range.

   Note that this flag is separate from the existing boolean flag
   streamed_mutation::fowarding - that one talks about skipping inside a
   single partition, while the flag we are adding is about switching the
   partition range being read. Most of the functions that previously
   accepted streamed_mutation::forwarding now accept *also* the option
   mutation_reader::forwarding. The exception are functions which are known
   to read only a single partition, and not support fast_forward_to() a
   different partition range.

   We note that if mutation_reader::forwarding::no is requested, and
   fast_forward_to() is forbidden, there is no point in reading anything
   beyond the range's end, so data_consume_rows() is called with last_end as
   the range's end. But if forwarding::yes is requested, we use the end of the
   file as last_end, exactly like the code before this patch did.

Importantly, we note that the repair's partition reading code,
column_family::make_streaming_reader, uses mutation_reader::forwarding::no,
while the other existing reading code will use the default forwarding::yes.

In the future, we can further optimize the amount of bytes read from disk
by replacing forwarding::yes by an actual last partition that may ever be
read, and use its byte position as the last_end passed to data_consume_rows.
But we don't do this yet, and it's not a regression from the existing code,
which also opened the file input stream until the end of the file, and not
until the end of the range query. Moreover, such an improvement will not
improve of anything if the overall range is always very large, in which
case not over-reading at its end will not improve performance.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170619152629.11703-1-nyh@scylladb.com>
2017-06-19 18:31:32 +03:00
Avi Kivity
6e2c9ef9fb Revert "Allow reading exactly desired byte ranges and fast_forward_to"
This reverts commit 317d7fc253 (and also the
related 2c57ab84b2).  It causes crashes
during range scans, reported by Gleb:

"To reproduce I run SELECT * FROM keyspace1.standard1; on typical c-s
dataset and 3 node cluster.

Backtrace:
    at /home/gleb/work/seastar/seastar/core/apply.hh:36
    rvalue=<unknown type in /home/gleb/work/seastar/build/release/scylla, CU 0x54cf307, DIE 0x55ebf2a>) at /home/gleb/work/seastar/seastar/core/do_with.hh:57
    range=std::vector of length 6, capacity 8 = {...}) at /home/gleb/work/seastar/seastar/core/future-util.hh:142
    at ./seastar/core/future.hh:890
    at /home/gleb/work/seastar/seastar/core/future-util.hh:119
    at /home/gleb/work/seastar/seastar/core/future-util.hh:142
2017-06-18 16:10:21 +03:00
Nadav Har'El
317d7fc253 Allow reading exactly desired byte ranges and fast_forward_to
In commit c63e88d556, support was added for
fast_forward_to() in data_consume_rows(). Because an input stream's end
cannot be changed after creation, that patch ignores the specified end
byte, and uses the end of file as the end position of the stream.

As result of this, even when we want to read a specific byte range (e.g.,
in the repair code to checksum the partitions in a given range), the code
reads an entire 128K buffer around the end byte, or significantly more, with
read-ahead enabled. This causes repair to do more than 10 times the amount
of I/O it really has to do in the checksumming phase (which in the current
implementation, reads small ranges of partitions at a time).

This patch has two levels:

1. In the lower level, sstable::data_consume_rows(), which reads all
   partitions in a given disk byte range, now gets another byte position,
   "last_end". That can be the range's end, the end of the file, or anything
   in between the two. It opens the disk stream until last_end, which means
   1. we will never read-ahead beyond last_end, and 2. fast_fordward_to() is
   not allowed beyond last_end.

2. In the upper level, we add to the various layers of sstable readers,
   mutation readers, etc., a boolean flag mutation_reader::forwarding, which
   says whether fast_forward_to() is allowed on the stream of mutations to
   move the stream to a different partition range.

   Note that this flag is separate from the existing boolean flag
   streamed_mutation::fowarding - that one talks about skipping inside a
   single partition, while the flag we are adding is about switching the
   partition range being read. Most of the functions that previously
   accepted streamed_mutation::forwarding now accept *also* the option
   mutation_reader::forwarding. The exception are functions which are known
   to read only a single partition, and not support fast_forward_to() a
   different partition range.

   We note that if mutation_reader::forwarding::no is requested, and
   fast_forward_to() is forbidden, there is no point in reading anything
   beyond the range's end, so data_consume_rows() is called with last_end as
   the range's end. But if forwarding::yes is requested, we use the end of the
   file as last_end, exactly like the code before this patch did.

Importantly, we note that the repair's partition reading code,
column_family::make_streaming_reader, uses mutation_reader::forwarding::no,
while the other existing reading code will use the default forwarding::yes.

In the future, we can further optimize the amount of bytes read from disk
by replacing forwarding::yes by an actual last partition that may ever be
read, and use its byte position as the last_end passed to data_consume_rows.
But we don't do this yet, and it's not a regression from the existing code,
which also opened the file input stream until the end of the file, and not
until the end of the range query. Moreover, such an improvement will not
improve of anything if the overall range is always very large, in which
case not over-reading at its end will not improve performance.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170614072122.13473-1-nyh@scylladb.com>
2017-06-15 13:22:46 +01:00
Tomasz Grabiec
6cf2841654 mvcc: Extract partition_snapshot_reader to separate header
Right know whole world includes it transitively, which results in
painful recompiles when the code changes.

Relax dependencies.
Message-Id: <1495620201-8046-1-git-send-email-tgrabiec@scylladb.com>
2017-05-24 12:13:15 +01:00
Avi Kivity
ebaeefa02b Merge seatar upstream (seastar namespace)
- introcduced "seastarx.hh" header, which does a "using namespace seastar";
 - 'net' namespace conflicts with seastar::net, renamed to 'netw'.
 - 'transport' namespace conflicts with seastar::transport, renamed to
   cql_transport.
 - "logger" global variables now conflict with logger global type, renamed
   to xlogger.
 - other minor changes
2017-05-21 12:26:15 +03:00
Tomasz Grabiec
d1bde3036e row_cache: Keep counters in a struct
So that taking a snapshot of all stats is easy.
2017-05-17 14:15:14 +02:00
Tomasz Grabiec
35c9dfecc2 row_cache: Implement mutation_reader::fast_forward_to() for cache scanner
Needed to make perf_fast_forward work with cache enabled.
2017-05-17 14:15:14 +02:00
Tomasz Grabiec
7b6be7e188 row_cache: Add missing propagation of the forwarding flag in handle_large_partition()
Message-Id: <1494503145-25622-1-git-send-email-tgrabiec@scylladb.com>
2017-05-11 15:47:19 +01:00
Tomasz Grabiec
0351ab8bc6 row_cache: Fix undefined behavior in read_wide()
_underlying is created with _range, which is captured by
reference. But range_and_underlyig_reader is moved after being
constructed by do_with(), so _range reference is invalidated.

Fixes #2377.
Message-Id: <1494492025-18091-1-git-send-email-tgrabiec@scylladb.com>
2017-05-11 09:43:43 +01:00
Amnon Heiman
064f5e1b63 row_cache: switch to the metrics layer registration
This patch moves the row_cache metrics registration from collectd to the
metric layer.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20170321143812.785-3-amnon@scylladb.com>
2017-03-21 16:42:58 +02:00
Tomasz Grabiec
892d4a2165 db: Enable creating forwardable readers via mutation_source
Right now all mutation source implementations will use
make_forwardable() wrapper.
2017-02-23 18:50:44 +01:00
Glauber Costa
facb0aa6d9 row_cache: rewrite loop so that debug mode doesn't become a noop
need_preempt() is always true in debug mode. Because of that, this loop
will never be executed. Rewrite it as a do-while loop so we are sure
that it is executed at least once - or exactly once in debug mode.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <1485913079-1283-1-git-send-email-glauber@scylladb.com>
2017-02-01 10:02:13 +02:00
Glauber Costa
69dbb3e108 row_cache: yield if need_preempt(), even if there is quota left.
The quota check is quite old at the moment, and dates back to a time in
which the infrastructure in seastar threads was lacking a lot. It is a
bad check since it will not take into consideration the size of the
partition or the time it takes to merge them.

A better check would at least take need_preempt() into account, so that
we would respect the task quota. That check is now embedded into
should_yield(), so there would no need to check anything else.

Although should_yield() does the job, it is still currently quite
expensive. And because we are in a seastar thread with a computationally
intensive loop, it can hurt latency a lot.

So as a temporary measure, let's at least check for need_preempt() - as
it is hurting real users at the moment - and soon work on making
should_yield() cheaper.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2017-01-26 22:10:54 -05:00
Glauber Costa
0e1f64b163 row_cache: add systemtap markers for the update process
update is one of our biggest sources of performance issues as far as the
cache is concerned. systemtap can be useful in helping tracking some of
them down.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2017-01-26 21:56:32 -05:00
Tomasz Grabiec
d048eec254 row_cache: Fix stats handling for uncached wide partitions
Report hitting wide partition dummy as a cache miss instead of a hit.

Refs #2011
Message-Id: <1484302266-3828-1-git-send-email-tgrabiec@scylladb.com>
2017-01-18 09:58:04 +00:00
Tomasz Grabiec
87f15624f4 row_cache: Add counter for wide partition mispopulations
Message-Id: <1484733250-14470-1-git-send-email-tgrabiec@scylladb.com>
2017-01-18 09:57:51 +00:00
Tomasz Grabiec
78844fa2e5 db: Use incremental selector in partition_presence_checker
This reduces the number of sstables we need to check to only those
whose token range overlaps with the key. Reduces cache update
time. Especially effective with leveled compaction strategy.

Refs #1943.

Incremental selector works with an immutable sstable set, so cache
updates need to be serialized. Otherwise we could mispopulate due to
stale presence information.

Presence checker interface was changed to accept decorated key in
order to gain easy access to the token, which is required by
the incremental selector.
2016-12-19 14:20:58 +01:00
Asias He
e5485f3ea6 Get rid of query::partition_range
Use dht::partition_range instead
2016-12-19 08:09:25 +08:00
Tomasz Grabiec
1b5f338c17 memtable: Track flushed memory in memtable object 2016-12-05 12:59:09 +01:00
Paweł Dziepak
f877be50b0 Merge "Keep wide partition cache entry longer than others" from Piotr
"Cache entries for wide partitions are usually smaller than other
entries and the cost of recreating them is higher so it makes sense
to keep them longer than ordinary entries."
2016-11-15 20:44:52 +00:00
Paweł Dziepak
999dafbe57 row_cache: touch entries read during range queries
Fixes #1847.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1479230809-27547-1-git-send-email-pdziepak@scylladb.com>
2016-11-15 18:54:11 +01:00
Piotr Jastrzebski
5ec668c9c6 Add separate LRU for wide partitions.
Evict wide partitions only every 1000 normal partition
evictions.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2016-11-15 16:19:13 +01:00
Piotr Jastrzebski
9a41bfbf69 Add collectd metric for wide partition evictions.
This will allow us to see how big is an amount
of evictions of cached info about wide partitions.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2016-11-15 15:53:14 +01:00
Paweł Dziepak
a8308e2a8d row_cache: dummy entry does not count as partition
Since continuity flag introduction row cache contains a single dummy
entry. cache_tracker knows nothing about it so that it doesn't appear in
any of the metrics. However, cache destructor calls
cache_tracker::on_erase() for every entry in the cache including the
dummy one. This is incorrect since the tracker wasn't informed when the
dummy entry was created.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1478608776-10363-1-git-send-email-pdziepak@scylladb.com>
2016-11-08 13:54:44 +01:00
Avi Kivity
a35136533d Convert ring_position and token ranges to be nonwrapping
Wrapping ranges are a pain, so we are moving wrap handling to the edges.

Since cql can't generate wrapping ranges, this means thrift and the ring
maintenance code; also range->ring transformations need to merge the first
and last ranges.

Message-Id: <1478105905-31613-1-git-send-email-avi@scylladb.com>
2016-11-02 21:04:11 +02:00
Paweł Dziepak
a7224ae46e row_cache: avoid dereferencing invalid iterator
Conditions in row_cache::do_find_or_create_entry() make it possible that
std::prev(it) is going to be dereferenced even if it is a begin
iterator.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-26 15:24:23 +01:00
Paweł Dziepak
654f651e0c row_cache: set _first_element flag correctly
If the continuity flag was set for the first element _first_element flag
would not be cleared. This shouldn't cause any correctness problems but
properly setting the flag allows to avoid some unnecessary key
comparisons.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-26 15:07:24 +01:00
Paweł Dziepak
567ff96f2a row_cache: fix clearing continuity flag at eviction
In original implementation the continuity flag indicated that cache has
full information about the range the between current partition and the
one following it, hence when evicting an entry the one preceeding it
had to have its continuity flag cleared.

This was changed, however, and now the continuiy flag tells whether the
cache is continuous between the current element and the one before it.
This means that eviction code needs to clear the flag for the entry
directly following the evicted one.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-26 14:58:20 +01:00
Paweł Dziepak
5ff699e09f row_cache: rework cache to use fast forwarding reader
This uncomfortably large patch overhauls cache range reader so that it
can take advantage of fast forwarding mutation readers.

A significant change in the cache itself is that the continuity flag now
is used to determine whether cache is contiguous between the previous
entry and the current one. This allows for a significant simplification
of the cache code and easier integration with reader fast forwarding.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
18acb0c0e6 row_cache: put cache entry flags in a struct
Flags are easier to manage if they are in a single structure.
Especially, default initialization and move contstructors are simpler
and less error prone.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
f248e23db5 row_cache: add do_find_or_create_entry() to reduce code duplication
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
6755a679f6 drop key readers
key_readers weren't used since introduction of continuity flag to cache
entries.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Avi Kivity
9ac441d3b5 range: adjust split_after to allow split_point outside input range
Make split_after() more generic by allowing split_point to be anywhere,
not just within the input range.  If the split_point is before, the entire
range is returned; and if it is after, stdx::nullopt is returned.

"before" and "after" are not well defined for wrap-around ranges, so
but we are phasing them out and soon there will not be
wrapping_range::split_after() users.

This is a prerequisite for converting partition_range and friends to
nonwrapping_range.
Message-Id: <1475765099-10657-1-git-send-email-avi@scylladb.com>
2016-10-06 17:54:44 +02:00
Duarte Nunes
f864bca773 row_cache: Deal with side-effects in allocating_section
In row_cache::make_reader, we update statistics inside an
allocating_section, which retries the supplied function until it can
satisfy all allocations by way of reserving LSA memory up front. Since
those updates are interleave with allocations, retries can lead to
miscounts.

This patch fixes this by updating statistics after all allocations.

Fixes #1659

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1473845977-20205-1-git-send-email-duarte@scylladb.com>
2016-09-14 10:46:25 +01:00
Glauber Costa
dc5d8e33af Revert "row_cache: update sstable histograms on cache hits"
This reverts commit 1726b1d0cc.

Reverting this patch turns our SSTable access counter into a miss counter only.
The estimated histogram always starts its first bucket at 1, so by marking cache
accesses we will be wrongly feeding "1" into the buckets.

Notice that this is not yet ideal: nodetool is supposed to show a histogram of
all reads, and by doing this we are changing its meaning slightly. Workloads
that serve mostly from cache will be distorted towards their misses.

The real solution is to use a different histogram, but we will need to enforce
a newer version of nodetool for that: the current issue is that nodetool expects
an EstimatedHistogram in a specific format in the other side.

Conflicts:
	row_cache.hh

Message-Id: <a599fa9e949766e7c9697450ae34fc28e881e90a.1472742276.git.glauber@scy
lladb.com>
Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-09-01 18:07:31 +03:00
Duarte Nunes
9269256246 row_cache: Accept a trace_state_ptr
This patch changes the row_cache so it accepts a trace_state_ptr,
which it is responsible of flowing to the underlying mutation_reader
if needed.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-09-01 12:00:55 +02:00
Glauber Costa
1726b1d0cc row_cache: update sstable histograms on cache hits
If we have a cache hit, we still need to update our sstable histogram - notting
that we have touched 0 SSTables.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-08-31 15:14:22 -04:00
Piotr Jastrzebski
3607d99269 Remove clustering_key_filtering_context.
Remove clustering_key_filter_factory and clustering_key_filtering_context.
Use partition_slice directly with a static get_ranges method.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2016-08-30 20:31:55 +02:00
Piotr Jastrzebski
b05b90b3a5 Introduce clustering_key_filter_ranges.
This fixes the problem of multiple concurrent get_ranges calls.
Previously each call was invalidating the result of the previous
call. Now they don't step on each other foot.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2016-08-30 19:46:38 +02:00
Avi Kivity
fbc3377ad4 row_cache: add a counter for a miss that did not result in an insertion
Such misses are due to concurrent access to the same key.  Add a counter
to track this as it results in unnecessary I/O being performed.

See #1534.
Message-Id: <1470139871-14693-1-git-send-email-avi@scylladb.com>
2016-08-02 14:14:27 +02:00
Piotr Jastrzebski
ca9c29e296 Cache information about partition being wide
Once we encounter a wide partition store information
about this in cache entry and don't try to read it all
and cache next time it's requested.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
[Paweł: rebased, moved large partition reading logic to
cache_entry::read_wide()]
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-07-29 18:39:22 +01:00
Paweł Dziepak
ee1f1ee1c4 row_cache: fix creating readers for large partitions
There were cases of use-after-free introduced by the code responsible
for creating mutation_readers for large partitions – the lifetimes
of partition ranges and the readers themselves weren't sufficiently
extended.

Another problem, was that if the partition was no longer present in the
sstable the reader would return EOS which was then returned by
range_populating_reader itself causing its users to incorrectly
interpret that as an end of stream.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-07-29 17:02:17 +01:00
Piotr Jastrzebski
fdfd1af694 Use continuity flag correctly with concurrent invalidations
Between reading cache entry and actually using it
invalidations can happen so we have to check if no flag was
cleared if it was we need to read the entry again.

Fixes #1464.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <7856b0ded45e42774ccd6f402b5ee42175bd73cf.1469701026.git.piotr@scylladb.com>
2016-07-28 11:55:18 +01:00
Piotr Jastrzebski
37a7d49676 Add collectd counter for uncached wide partitions.
Keep track of every read of wide partition that's
not cached.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2016-07-21 09:47:49 +02:00
Piotr Jastrzebski
636a4acfd0 Add flag to configure
max size of a cached partition.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2016-07-21 09:47:20 +02:00
Piotr Jastrzebski
98c12dc2e2 Try to read whole streamed_mutation up to limit
If limit is exceeded then return the streamed_mutation
and don't cache it.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2016-07-21 09:35:35 +02:00
Paweł Dziepak
81e4952c78 row_cache: fix marking last entry as continuous
Range queries need to take special care when transitioning between
ranges that are read from sstables and ranges that are already in the
cache.

Original code in such case just started a secondary reader and told it
to unconditionally mark the last entry as continuous (primary reader has
already returned an element tha immediately follows the range that is
going to be read form sstables).

However, that information may get stale. For instance, by the time
secondary reader finish reading its range the element immediately
following it may get evicted from the cache thus causing continuity flag
to be incorrectly set.

The solution is to ensure that the element immediately after the range
read from sstables is still in the cache.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1468586893-15266-1-git-send-email-pdziepak@scylladb.com>
2016-07-15 15:15:02 +02:00
Avi Kivity
9a8788019d row_cache: fix visitor for boost <= 1.55
Older boosts can't return a future from a visitor (likely lacking support
for move-only objects).  Supply a dirty hackaround.

Message-Id: <1467822548-25940-1-git-send-email-avi@scylladb.com>
2016-07-06 19:55:51 +03:00