scylladb

Author	SHA1	Message	Date
Piotr Jastrzebski	60346a2819	row_cache: remove unused read overload This will simplify the following patches and unused code should be removed anyway. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	77f944880c	cache: Remove support for wide partitions This will be handled by row cache now. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-06-24 18:06:11 +02:00
Nadav Har'El	3018df11b5	Allow reading exactly desired byte ranges and fast_forward_to In commit `c63e88d556`, support was added for fast_forward_to() in data_consume_rows(). Because an input stream's end cannot be changed after creation, that patch ignores the specified end byte, and uses the end of file as the end position of the stream. As result of this, even when we want to read a specific byte range (e.g., in the repair code to checksum the partitions in a given range), the code reads an entire 128K buffer around the end byte, or significantly more, with read-ahead enabled. This causes repair to do more than 10 times the amount of I/O it really has to do in the checksumming phase (which in the current implementation, reads small ranges of partitions at a time). This patch has two levels: 1. In the lower level, sstable::data_consume_rows(), which reads all partitions in a given disk byte range, now gets another byte position, "last_end". That can be the range's end, the end of the file, or anything in between the two. It opens the disk stream until last_end, which means 1. we will never read-ahead beyond last_end, and 2. fast_fordward_to() is not allowed beyond last_end. 2. In the upper level, we add to the various layers of sstable readers, mutation readers, etc., a boolean flag mutation_reader::forwarding, which says whether fast_forward_to() is allowed on the stream of mutations to move the stream to a different partition range. Note that this flag is separate from the existing boolean flag streamed_mutation::fowarding - that one talks about skipping inside a single partition, while the flag we are adding is about switching the partition range being read. Most of the functions that previously accepted streamed_mutation::forwarding now accept also the option mutation_reader::forwarding. The exception are functions which are known to read only a single partition, and not support fast_forward_to() a different partition range. We note that if mutation_reader::forwarding::no is requested, and fast_forward_to() is forbidden, there is no point in reading anything beyond the range's end, so data_consume_rows() is called with last_end as the range's end. But if forwarding::yes is requested, we use the end of the file as last_end, exactly like the code before this patch did. Importantly, we note that the repair's partition reading code, column_family::make_streaming_reader, uses mutation_reader::forwarding::no, while the other existing reading code will use the default forwarding::yes. In the future, we can further optimize the amount of bytes read from disk by replacing forwarding::yes by an actual last partition that may ever be read, and use its byte position as the last_end passed to data_consume_rows. But we don't do this yet, and it's not a regression from the existing code, which also opened the file input stream until the end of the file, and not until the end of the range query. Moreover, such an improvement will not improve of anything if the overall range is always very large, in which case not over-reading at its end will not improve performance. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20170619152629.11703-1-nyh@scylladb.com>	2017-06-19 18:31:32 +03:00
Avi Kivity	6e2c9ef9fb	Revert "Allow reading exactly desired byte ranges and fast_forward_to" This reverts commit `317d7fc253` (and also the related `2c57ab84b2`). It causes crashes during range scans, reported by Gleb: "To reproduce I run SELECT * FROM keyspace1.standard1; on typical c-s dataset and 3 node cluster. Backtrace: at /home/gleb/work/seastar/seastar/core/apply.hh:36 rvalue=<unknown type in /home/gleb/work/seastar/build/release/scylla, CU 0x54cf307, DIE 0x55ebf2a>) at /home/gleb/work/seastar/seastar/core/do_with.hh:57 range=std::vector of length 6, capacity 8 = {...}) at /home/gleb/work/seastar/seastar/core/future-util.hh:142 at ./seastar/core/future.hh:890 at /home/gleb/work/seastar/seastar/core/future-util.hh:119 at /home/gleb/work/seastar/seastar/core/future-util.hh:142	2017-06-18 16:10:21 +03:00
Nadav Har'El	317d7fc253	Allow reading exactly desired byte ranges and fast_forward_to In commit `c63e88d556`, support was added for fast_forward_to() in data_consume_rows(). Because an input stream's end cannot be changed after creation, that patch ignores the specified end byte, and uses the end of file as the end position of the stream. As result of this, even when we want to read a specific byte range (e.g., in the repair code to checksum the partitions in a given range), the code reads an entire 128K buffer around the end byte, or significantly more, with read-ahead enabled. This causes repair to do more than 10 times the amount of I/O it really has to do in the checksumming phase (which in the current implementation, reads small ranges of partitions at a time). This patch has two levels: 1. In the lower level, sstable::data_consume_rows(), which reads all partitions in a given disk byte range, now gets another byte position, "last_end". That can be the range's end, the end of the file, or anything in between the two. It opens the disk stream until last_end, which means 1. we will never read-ahead beyond last_end, and 2. fast_fordward_to() is not allowed beyond last_end. 2. In the upper level, we add to the various layers of sstable readers, mutation readers, etc., a boolean flag mutation_reader::forwarding, which says whether fast_forward_to() is allowed on the stream of mutations to move the stream to a different partition range. Note that this flag is separate from the existing boolean flag streamed_mutation::fowarding - that one talks about skipping inside a single partition, while the flag we are adding is about switching the partition range being read. Most of the functions that previously accepted streamed_mutation::forwarding now accept also the option mutation_reader::forwarding. The exception are functions which are known to read only a single partition, and not support fast_forward_to() a different partition range. We note that if mutation_reader::forwarding::no is requested, and fast_forward_to() is forbidden, there is no point in reading anything beyond the range's end, so data_consume_rows() is called with last_end as the range's end. But if forwarding::yes is requested, we use the end of the file as last_end, exactly like the code before this patch did. Importantly, we note that the repair's partition reading code, column_family::make_streaming_reader, uses mutation_reader::forwarding::no, while the other existing reading code will use the default forwarding::yes. In the future, we can further optimize the amount of bytes read from disk by replacing forwarding::yes by an actual last partition that may ever be read, and use its byte position as the last_end passed to data_consume_rows. But we don't do this yet, and it's not a regression from the existing code, which also opened the file input stream until the end of the file, and not until the end of the range query. Moreover, such an improvement will not improve of anything if the overall range is always very large, in which case not over-reading at its end will not improve performance. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20170614072122.13473-1-nyh@scylladb.com>	2017-06-15 13:22:46 +01:00
Tomasz Grabiec	6cf2841654	mvcc: Extract partition_snapshot_reader to separate header Right know whole world includes it transitively, which results in painful recompiles when the code changes. Relax dependencies. Message-Id: <1495620201-8046-1-git-send-email-tgrabiec@scylladb.com>	2017-05-24 12:13:15 +01:00
Avi Kivity	ebaeefa02b	Merge seatar upstream (seastar namespace) - introcduced "seastarx.hh" header, which does a "using namespace seastar"; - 'net' namespace conflicts with seastar::net, renamed to 'netw'. - 'transport' namespace conflicts with seastar::transport, renamed to cql_transport. - "logger" global variables now conflict with logger global type, renamed to xlogger. - other minor changes	2017-05-21 12:26:15 +03:00
Tomasz Grabiec	d1bde3036e	row_cache: Keep counters in a struct So that taking a snapshot of all stats is easy.	2017-05-17 14:15:14 +02:00
Tomasz Grabiec	35c9dfecc2	row_cache: Implement mutation_reader::fast_forward_to() for cache scanner Needed to make perf_fast_forward work with cache enabled.	2017-05-17 14:15:14 +02:00
Tomasz Grabiec	7b6be7e188	row_cache: Add missing propagation of the forwarding flag in handle_large_partition() Message-Id: <1494503145-25622-1-git-send-email-tgrabiec@scylladb.com>	2017-05-11 15:47:19 +01:00
Tomasz Grabiec	0351ab8bc6	row_cache: Fix undefined behavior in read_wide() _underlying is created with _range, which is captured by reference. But range_and_underlyig_reader is moved after being constructed by do_with(), so _range reference is invalidated. Fixes #2377. Message-Id: <1494492025-18091-1-git-send-email-tgrabiec@scylladb.com>	2017-05-11 09:43:43 +01:00
Amnon Heiman	064f5e1b63	row_cache: switch to the metrics layer registration This patch moves the row_cache metrics registration from collectd to the metric layer. Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <20170321143812.785-3-amnon@scylladb.com>	2017-03-21 16:42:58 +02:00
Tomasz Grabiec	892d4a2165	db: Enable creating forwardable readers via mutation_source Right now all mutation source implementations will use make_forwardable() wrapper.	2017-02-23 18:50:44 +01:00
Glauber Costa	facb0aa6d9	row_cache: rewrite loop so that debug mode doesn't become a noop need_preempt() is always true in debug mode. Because of that, this loop will never be executed. Rewrite it as a do-while loop so we are sure that it is executed at least once - or exactly once in debug mode. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <1485913079-1283-1-git-send-email-glauber@scylladb.com>	2017-02-01 10:02:13 +02:00
Glauber Costa	69dbb3e108	row_cache: yield if need_preempt(), even if there is quota left. The quota check is quite old at the moment, and dates back to a time in which the infrastructure in seastar threads was lacking a lot. It is a bad check since it will not take into consideration the size of the partition or the time it takes to merge them. A better check would at least take need_preempt() into account, so that we would respect the task quota. That check is now embedded into should_yield(), so there would no need to check anything else. Although should_yield() does the job, it is still currently quite expensive. And because we are in a seastar thread with a computationally intensive loop, it can hurt latency a lot. So as a temporary measure, let's at least check for need_preempt() - as it is hurting real users at the moment - and soon work on making should_yield() cheaper. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2017-01-26 22:10:54 -05:00
Glauber Costa	0e1f64b163	row_cache: add systemtap markers for the update process update is one of our biggest sources of performance issues as far as the cache is concerned. systemtap can be useful in helping tracking some of them down. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2017-01-26 21:56:32 -05:00
Tomasz Grabiec	d048eec254	row_cache: Fix stats handling for uncached wide partitions Report hitting wide partition dummy as a cache miss instead of a hit. Refs #2011 Message-Id: <1484302266-3828-1-git-send-email-tgrabiec@scylladb.com>	2017-01-18 09:58:04 +00:00
Tomasz Grabiec	87f15624f4	row_cache: Add counter for wide partition mispopulations Message-Id: <1484733250-14470-1-git-send-email-tgrabiec@scylladb.com>	2017-01-18 09:57:51 +00:00
Tomasz Grabiec	78844fa2e5	db: Use incremental selector in partition_presence_checker This reduces the number of sstables we need to check to only those whose token range overlaps with the key. Reduces cache update time. Especially effective with leveled compaction strategy. Refs #1943. Incremental selector works with an immutable sstable set, so cache updates need to be serialized. Otherwise we could mispopulate due to stale presence information. Presence checker interface was changed to accept decorated key in order to gain easy access to the token, which is required by the incremental selector.	2016-12-19 14:20:58 +01:00
Asias He	e5485f3ea6	Get rid of query::partition_range Use dht::partition_range instead	2016-12-19 08:09:25 +08:00
Tomasz Grabiec	1b5f338c17	memtable: Track flushed memory in memtable object	2016-12-05 12:59:09 +01:00
Paweł Dziepak	f877be50b0	Merge "Keep wide partition cache entry longer than others" from Piotr "Cache entries for wide partitions are usually smaller than other entries and the cost of recreating them is higher so it makes sense to keep them longer than ordinary entries."	2016-11-15 20:44:52 +00:00
Paweł Dziepak	999dafbe57	row_cache: touch entries read during range queries Fixes #1847. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com> Message-Id: <1479230809-27547-1-git-send-email-pdziepak@scylladb.com>	2016-11-15 18:54:11 +01:00
Piotr Jastrzebski	5ec668c9c6	Add separate LRU for wide partitions. Evict wide partitions only every 1000 normal partition evictions. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2016-11-15 16:19:13 +01:00
Piotr Jastrzebski	9a41bfbf69	Add collectd metric for wide partition evictions. This will allow us to see how big is an amount of evictions of cached info about wide partitions. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2016-11-15 15:53:14 +01:00
Paweł Dziepak	a8308e2a8d	row_cache: dummy entry does not count as partition Since continuity flag introduction row cache contains a single dummy entry. cache_tracker knows nothing about it so that it doesn't appear in any of the metrics. However, cache destructor calls cache_tracker::on_erase() for every entry in the cache including the dummy one. This is incorrect since the tracker wasn't informed when the dummy entry was created. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com> Message-Id: <1478608776-10363-1-git-send-email-pdziepak@scylladb.com>	2016-11-08 13:54:44 +01:00
Avi Kivity	a35136533d	Convert ring_position and token ranges to be nonwrapping Wrapping ranges are a pain, so we are moving wrap handling to the edges. Since cql can't generate wrapping ranges, this means thrift and the ring maintenance code; also range->ring transformations need to merge the first and last ranges. Message-Id: <1478105905-31613-1-git-send-email-avi@scylladb.com>	2016-11-02 21:04:11 +02:00
Paweł Dziepak	a7224ae46e	row_cache: avoid dereferencing invalid iterator Conditions in row_cache::do_find_or_create_entry() make it possible that std::prev(it) is going to be dereferenced even if it is a begin iterator. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-10-26 15:24:23 +01:00
Paweł Dziepak	654f651e0c	row_cache: set _first_element flag correctly If the continuity flag was set for the first element _first_element flag would not be cleared. This shouldn't cause any correctness problems but properly setting the flag allows to avoid some unnecessary key comparisons. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-10-26 15:07:24 +01:00
Paweł Dziepak	567ff96f2a	row_cache: fix clearing continuity flag at eviction In original implementation the continuity flag indicated that cache has full information about the range the between current partition and the one following it, hence when evicting an entry the one preceeding it had to have its continuity flag cleared. This was changed, however, and now the continuiy flag tells whether the cache is continuous between the current element and the one before it. This means that eviction code needs to clear the flag for the entry directly following the evicted one. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-10-26 14:58:20 +01:00
Paweł Dziepak	5ff699e09f	row_cache: rework cache to use fast forwarding reader This uncomfortably large patch overhauls cache range reader so that it can take advantage of fast forwarding mutation readers. A significant change in the cache itself is that the continuity flag now is used to determine whether cache is contiguous between the previous entry and the current one. This allows for a significant simplification of the cache code and easier integration with reader fast forwarding. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-10-19 15:29:08 +01:00
Paweł Dziepak	18acb0c0e6	row_cache: put cache entry flags in a struct Flags are easier to manage if they are in a single structure. Especially, default initialization and move contstructors are simpler and less error prone. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-10-19 15:29:08 +01:00
Paweł Dziepak	f248e23db5	row_cache: add do_find_or_create_entry() to reduce code duplication Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-10-19 15:29:08 +01:00
Paweł Dziepak	6755a679f6	drop key readers key_readers weren't used since introduction of continuity flag to cache entries. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-10-19 15:29:08 +01:00
Avi Kivity	9ac441d3b5	range: adjust split_after to allow split_point outside input range Make split_after() more generic by allowing split_point to be anywhere, not just within the input range. If the split_point is before, the entire range is returned; and if it is after, stdx::nullopt is returned. "before" and "after" are not well defined for wrap-around ranges, so but we are phasing them out and soon there will not be wrapping_range::split_after() users. This is a prerequisite for converting partition_range and friends to nonwrapping_range. Message-Id: <1475765099-10657-1-git-send-email-avi@scylladb.com>	2016-10-06 17:54:44 +02:00
Duarte Nunes	f864bca773	row_cache: Deal with side-effects in allocating_section In row_cache::make_reader, we update statistics inside an allocating_section, which retries the supplied function until it can satisfy all allocations by way of reserving LSA memory up front. Since those updates are interleave with allocations, retries can lead to miscounts. This patch fixes this by updating statistics after all allocations. Fixes #1659 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1473845977-20205-1-git-send-email-duarte@scylladb.com>	2016-09-14 10:46:25 +01:00
Glauber Costa	dc5d8e33af	Revert "row_cache: update sstable histograms on cache hits" This reverts commit `1726b1d0cc`. Reverting this patch turns our SSTable access counter into a miss counter only. The estimated histogram always starts its first bucket at 1, so by marking cache accesses we will be wrongly feeding "1" into the buckets. Notice that this is not yet ideal: nodetool is supposed to show a histogram of all reads, and by doing this we are changing its meaning slightly. Workloads that serve mostly from cache will be distorted towards their misses. The real solution is to use a different histogram, but we will need to enforce a newer version of nodetool for that: the current issue is that nodetool expects an EstimatedHistogram in a specific format in the other side. Conflicts: row_cache.hh Message-Id: <a599fa9e949766e7c9697450ae34fc28e881e90a.1472742276.git.glauber@scy lladb.com> Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-09-01 18:07:31 +03:00
Duarte Nunes	9269256246	row_cache: Accept a trace_state_ptr This patch changes the row_cache so it accepts a trace_state_ptr, which it is responsible of flowing to the underlying mutation_reader if needed. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-09-01 12:00:55 +02:00
Glauber Costa	1726b1d0cc	row_cache: update sstable histograms on cache hits If we have a cache hit, we still need to update our sstable histogram - notting that we have touched 0 SSTables. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-08-31 15:14:22 -04:00
Piotr Jastrzebski	3607d99269	Remove clustering_key_filtering_context. Remove clustering_key_filter_factory and clustering_key_filtering_context. Use partition_slice directly with a static get_ranges method. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2016-08-30 20:31:55 +02:00
Piotr Jastrzebski	b05b90b3a5	Introduce clustering_key_filter_ranges. This fixes the problem of multiple concurrent get_ranges calls. Previously each call was invalidating the result of the previous call. Now they don't step on each other foot. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2016-08-30 19:46:38 +02:00
Avi Kivity	fbc3377ad4	row_cache: add a counter for a miss that did not result in an insertion Such misses are due to concurrent access to the same key. Add a counter to track this as it results in unnecessary I/O being performed. See #1534. Message-Id: <1470139871-14693-1-git-send-email-avi@scylladb.com>	2016-08-02 14:14:27 +02:00
Piotr Jastrzebski	ca9c29e296	Cache information about partition being wide Once we encounter a wide partition store information about this in cache entry and don't try to read it all and cache next time it's requested. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> [Paweł: rebased, moved large partition reading logic to cache_entry::read_wide()] Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-29 18:39:22 +01:00
Paweł Dziepak	ee1f1ee1c4	row_cache: fix creating readers for large partitions There were cases of use-after-free introduced by the code responsible for creating mutation_readers for large partitions – the lifetimes of partition ranges and the readers themselves weren't sufficiently extended. Another problem, was that if the partition was no longer present in the sstable the reader would return EOS which was then returned by range_populating_reader itself causing its users to incorrectly interpret that as an end of stream. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-29 17:02:17 +01:00
Piotr Jastrzebski	fdfd1af694	Use continuity flag correctly with concurrent invalidations Between reading cache entry and actually using it invalidations can happen so we have to check if no flag was cleared if it was we need to read the entry again. Fixes #1464. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <7856b0ded45e42774ccd6f402b5ee42175bd73cf.1469701026.git.piotr@scylladb.com>	2016-07-28 11:55:18 +01:00
Piotr Jastrzebski	37a7d49676	Add collectd counter for uncached wide partitions. Keep track of every read of wide partition that's not cached. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2016-07-21 09:47:49 +02:00
Piotr Jastrzebski	636a4acfd0	Add flag to configure max size of a cached partition. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2016-07-21 09:47:20 +02:00
Piotr Jastrzebski	98c12dc2e2	Try to read whole streamed_mutation up to limit If limit is exceeded then return the streamed_mutation and don't cache it. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2016-07-21 09:35:35 +02:00
Paweł Dziepak	81e4952c78	row_cache: fix marking last entry as continuous Range queries need to take special care when transitioning between ranges that are read from sstables and ranges that are already in the cache. Original code in such case just started a secondary reader and told it to unconditionally mark the last entry as continuous (primary reader has already returned an element tha immediately follows the range that is going to be read form sstables). However, that information may get stale. For instance, by the time secondary reader finish reading its range the element immediately following it may get evicted from the cache thus causing continuity flag to be incorrectly set. The solution is to ensure that the element immediately after the range read from sstables is still in the cache. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com> Message-Id: <1468586893-15266-1-git-send-email-pdziepak@scylladb.com>	2016-07-15 15:15:02 +02:00
Avi Kivity	9a8788019d	row_cache: fix visitor for boost <= 1.55 Older boosts can't return a future from a visitor (likely lacking support for move-only objects). Supply a dirty hackaround. Message-Id: <1467822548-25940-1-git-send-email-avi@scylladb.com>	2016-07-06 19:55:51 +03:00

1 2 3

130 Commits