scylladb

Author	SHA1	Message	Date
Avi Kivity	48b9e47f7d	Revert "row_cache: Add missing handling for failures happening outside the updating thread" This reverts commit `f9feb310ab` (requested by author).	2017-08-29 19:26:02 +03:00
Tomasz Grabiec	f9feb310ab	row_cache: Add missing handling for failures happening outside the updating thread Thread stack allocation may fail, in which case we did not do the necessary invalidation. Fix by hoisting the scope of the cleanup function. Also fixes the following test failure: tests/row_cache_test.cc(949): fatal error: in "test_update_failure": critical check it->second.equal(*s, mopt->partition()) has failed which started to trigger after commit `318423d50b`. Message-Id: <1504023113-30374-2-git-send-email-tgrabiec@scylladb.com>	2017-08-29 19:17:22 +03:00
Raphael S. Carvalho	637f3bfa50	db: refresh row cache's underlying data source after compaction Underlying data source in row cache holds a reference to sstable set prior to compaction which isn't released until a memtable flush, which means file descriptors of deleted sstables remains opened, wasting disk space. The fix is to refresh underlying data source in row cache. Fixes #2570. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-07-24 15:49:11 -03:00
Piotr Jastrzebski	a4b6cfe8f0	row_cache: use continuity info in single partition queries If a query requests for a single partition that is inside a range that has already been queried, use the continuity info and don't go to disk when it's not needed. Fixes #2244. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <15bb3b5b03225e7402e3862da53b5e06d3f4fa74.1499345295.git.piotr@scylladb.com>	2017-07-07 10:29:19 +02:00
Tomasz Grabiec	37d2b6b3c6	row_cache: Switch _stats.hits/misses to row granularity Those are exported by the RESTful APIs called "get_row_hits/get_row_misses" and reported by nodetool.	2017-07-04 13:55:06 +02:00
Tomasz Grabiec	60c2a86192	row_cache: Track mispopulations also at row level	2017-07-04 13:55:06 +02:00
Tomasz Grabiec	94547db620	row_cache: Track row insertions	2017-07-04 13:55:06 +02:00
Tomasz Grabiec	a58f2c8640	row_cache: Track row hits and misses	2017-07-04 13:55:06 +02:00
Tomasz Grabiec	77b2a92ece	row_cache: Make mispopulation counter also apply for continuity information	2017-07-04 13:55:06 +02:00
Tomasz Grabiec	a5fdff2ac2	row_cache: Add partition_ prefix to current counters In preparation for adding per-row counters.	2017-07-04 13:55:06 +02:00
Tomasz Grabiec	6a22cbceaf	row_cache: Add metrics for operations on underlying reader	2017-07-04 13:55:06 +02:00
Tomasz Grabiec	5c7b6fc164	row_cache: Add reader-related metrics	2017-07-04 13:55:06 +02:00
Tomasz Grabiec	e720b317c9	row_cache: Restore update of concurrent_misses_same_key It was lost in action in `6f6575f456`. Message-Id: <1499168837-5072-1-git-send-email-tgrabiec@scylladb.com>	2017-07-04 14:51:05 +03:00
Tomasz Grabiec	1d6fec0755	row_cache: Drop not very useful prefixes from metric names This drops "total_opertaions_" and "objects_" prefixes. There is no convention of adding them in other parts of the system, and they don't add much value. Fixes scylladb/scylla-grafana-monitoring#169. Message-Id: <1499160342-25865-1-git-send-email-tgrabiec@scylladb.com>	2017-07-04 13:37:12 +03:00
Tomasz Grabiec	97005825bf	row_cache: Fix compilation errors with gcc 5 Message-Id: <1498741526-27055-1-git-send-email-tgrabiec@scylladb.com>	2017-06-29 16:34:46 +03:00
Tomasz Grabiec	786e75dbf7	row_cache: Use continuity information to decide whether to populate If cache is missing given key, but the range is marked as continuous, it means sstables don't have that entry and we can insert it without asking the presence checker (bloom filter based). The latter is more expensive and gives false positives. So this improves update performance and hit ratio. Another positive effect is that we don't have to clear continuity now. Fixes #1999. Message-Id: <1498643043-21117-1-git-send-email-tgrabiec@scylladb.com>	2017-06-28 13:32:48 +03:00
Tomasz Grabiec	b56232b216	row_cache: Introduce evict()	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	6f6575f456	row_cache: Enable partial partition population	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	e792220c3a	row_cache: Introduce update_invalidating()	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	c29878f49f	row_cache: Extract memtable walking logic from update() into do_update() So that it can be reused in update_invalidating().	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	509a0d8a83	row_cache: Allow reading from underlying through read_context The interaction will be as follows: - Before creating cache_streamed_mutation for given partition, cache mutation reader sets up read_context for current partition (in one of two ways) so that the matching underlying streamed_mutation can be accessed at any time by cached_stream_mutation. - cache_streamed_mutation assumes that read_context is set up for current partition and invokes fast_forward_to() and get_next_fragment() to access the underlying streamed_mutation.	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	a1d3e0318c	row_cache: Store autoupdating_underlying_reader in read_context Will be reused for reading of incomplete partition entries.	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	3f2320c377	row_cache: Store information whether query is a range query in read_context We will need to use this information later in yet another place, when creating a reader for incomplete cache entry. This refactors the code so that there is a single place which determines this fact.	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	a2207ee9a6	row_cache: Move autoupdating_underlying_reader to read_context.hh	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	ca920bd0ef	row_cache: Keep only one streamed_mutation in scanning_and_populating_reader Currently scanning_and_populating_reader asks just_cache_scanning_reader for the next partition from cache, together with information if the range is continuous. If it's not, it saves the partition it got from it and moves on to reading from the underlying reader up to that partition. When that's done, it emits the stored partition. This approach won't work well with upcoming changes for storing partial partitions. We won't have whole partitions any more, so streamed_mutation returned for the entry needs to be prepared for reading from the underlying mutation source. We want to reuse the same underlying reader as much as possible, so all streamed_mutations for given read (read_context) will share the state of the underlying reader. Construction of a streamed_mutation will depend on the fact that the shared state is set up for it, so we cannot have two streamed_mutations prepared at the same time (one for entry from primary, and one for the earlier entry being populated). This change defers the creation of a streamed_mutation for the entry present in cache until the whole reader reaches it to avoid this problem. This will also have antoher potentially beneficial effect. Since we defer the decision about which snapshot to use until we reach the entry, there is a higher chance that the current snapshot of the entry will match the one used last by the populating read, and that we will be able to reuse the reader. It's implemented by utilizing a stable partition cursor which tracks its current position so that it's possible to revisit the cache entry (if it's still there) after population ends. The functionality of just_cache_scanning_reader was inlined into scanning_and_populating_reader.	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	045888d5f3	row_cache: Introduce partition_range_cursor	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	c3905bf235	row_cache: Print position instead of key of cache_entry	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	5bfecaad99	row_cache: Switch invalidate_unwrapped() to use ring_position_view ranges It's needed before switching cache_entry ordering to rely solely on cache_entry::position() so that invalidate_unwrapped() never removes the dummy entry at the end. Currently if the range has upper bound like this: { ring_position::max(), inclusive=true } The code which selects entries for removal would include the dummy row at the end. It uses upper_bound() to get the end iterator, and the dummy entry has a position which is equal to the position in the bound. ring_position_view ranges are end-exclusive, so it's impossible to create a partition range which would include a dummy entry. The code is also simpler.	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	64626b32b0	row_cache: Make printable	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	54b3da1910	row_cache: Introduce find_or_create() helper	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	f2d2c221d4	row_cache: Return cache_entry reference from do_find_or_create_entry Will be useful when additional action needs to be done on the entry after it was created or constructed.	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	bbfa52822e	row_cache: Switch readers to use per-entry snapshots Currently readers are always using the latest snapshot. This is fine for respecting write atomicity if partitions are fully continuous in cache (now), but will break write atomicity once partial population is allowed. Consider the following case: flush write(ck=1), write(ck=2) -> snapshot_1 cache reader 1 reads and inserts ck=1 @snapshot_1 flush write(ck=1), write(ck=2) -> snapshot_2 cache reader 2 reads and inserts ck=2 @snapshot_2 Because cache update is not atomic, it can happen that reader 2 will complete while the partition hasn't been updated yet for snapshot_2. In such case, after read 2 the partition would contain ck=1 from snapshot_1 and ck=2 from snapshot_2. It will match neither of the snapshots, and this could violate write atomicity. To solve this problem we conceptually assign each partition key in the ring to its current snapshot which it reflects. The update process gradually converts entries in ring order to the new snapshot. Reads will not be using the latest snapshot, but rather the current snapshot for the position in the ring they are at. There is a race between the update process and populating reads. Since after the update all entries must reflect the new snapshot, reads using the old snapshot cannot be allowed to insert data which can no longer be reached by the update process. Before this patch this race was prevented by the use of a phased_barrier, where readers would keep phased_barrier::operation alive between starting a read of a partition and inserting it into cache. Cache update was waiting for all prior operations before starting the update. Any later read which was not waited for would use the latest snapshot for reads, so the update process didn't have to fix anything up for such reads. After this change, later reads cannot always use the latest snapshot, they have to use the snapshot corresponding to given entry. So it's not enough for update() to wait for prior reads in order to prevent stale populations. The (simple) solution implemented in this patch is to detect the conflict and abandon population of given sub-range. In general, reads are allowed to populate given range only if it belongs to a single snapshot. Note that the range here is not the whole query range. For population of continuity, it is the range starting after the previous key and ending after the key being inserted. When populating a partition entry, the range is a singular range containing only the partition key. Readers switch to new snapshots automatically as they move across the ring. It's possible that the insertion of the partition doesn't conflict, but continuity does. In such case the entry will be inserted but continuity will not be set.	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	8ba6366610	row_cache: Switch to using snapshot_source Currently every time cache needs to create reader for missing data it obtains a reader which is most up to date. That reader includes writes from later populate phases, for which update() was not yet called. This will be problematic once we allow partitions to be partially populated, because different parts of the partition could be partially populated using readers using different sets of writes, and break write atomicity. The solution will be to always populate given partition using the same set of writes, using reader created from the current snapshot. The snapshot changes only on update(), with update() gradually converting each partition to the new snapshot.	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	e23c7e2f34	row_cache: Rework invalidate() implementation 1) Reduce duplication by delegating to more general overloads 2) Improve documentation to not mention effects in terms of population (detail) but rather write visibiliy 3) Rename clear() to invalidate() and merge with the range variant, it has the same semantics	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	2d73c193e7	row_cache: Introduce read_context This object stores all read relevant context required all over the place. This leads to a cleaner code. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> [tgrabiec: - made read_context shareable to allow storing shared mutable state later - added range and cache getters ]	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	a3ff8db323	row_cache: Introduce autoupdating_underlying_reader This is an abstraction that represents a reader to the underlying source and auto updates itself to make sure the reader reflects the latest state of the underlying source. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> [tgrabiec: Add range getter to avoid friendships]	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	60346a2819	row_cache: remove unused read overload This will simplify the following patches and unused code should be removed anyway. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	77f944880c	cache: Remove support for wide partitions This will be handled by row cache now. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-06-24 18:06:11 +02:00
Nadav Har'El	3018df11b5	Allow reading exactly desired byte ranges and fast_forward_to In commit `c63e88d556`, support was added for fast_forward_to() in data_consume_rows(). Because an input stream's end cannot be changed after creation, that patch ignores the specified end byte, and uses the end of file as the end position of the stream. As result of this, even when we want to read a specific byte range (e.g., in the repair code to checksum the partitions in a given range), the code reads an entire 128K buffer around the end byte, or significantly more, with read-ahead enabled. This causes repair to do more than 10 times the amount of I/O it really has to do in the checksumming phase (which in the current implementation, reads small ranges of partitions at a time). This patch has two levels: 1. In the lower level, sstable::data_consume_rows(), which reads all partitions in a given disk byte range, now gets another byte position, "last_end". That can be the range's end, the end of the file, or anything in between the two. It opens the disk stream until last_end, which means 1. we will never read-ahead beyond last_end, and 2. fast_fordward_to() is not allowed beyond last_end. 2. In the upper level, we add to the various layers of sstable readers, mutation readers, etc., a boolean flag mutation_reader::forwarding, which says whether fast_forward_to() is allowed on the stream of mutations to move the stream to a different partition range. Note that this flag is separate from the existing boolean flag streamed_mutation::fowarding - that one talks about skipping inside a single partition, while the flag we are adding is about switching the partition range being read. Most of the functions that previously accepted streamed_mutation::forwarding now accept also the option mutation_reader::forwarding. The exception are functions which are known to read only a single partition, and not support fast_forward_to() a different partition range. We note that if mutation_reader::forwarding::no is requested, and fast_forward_to() is forbidden, there is no point in reading anything beyond the range's end, so data_consume_rows() is called with last_end as the range's end. But if forwarding::yes is requested, we use the end of the file as last_end, exactly like the code before this patch did. Importantly, we note that the repair's partition reading code, column_family::make_streaming_reader, uses mutation_reader::forwarding::no, while the other existing reading code will use the default forwarding::yes. In the future, we can further optimize the amount of bytes read from disk by replacing forwarding::yes by an actual last partition that may ever be read, and use its byte position as the last_end passed to data_consume_rows. But we don't do this yet, and it's not a regression from the existing code, which also opened the file input stream until the end of the file, and not until the end of the range query. Moreover, such an improvement will not improve of anything if the overall range is always very large, in which case not over-reading at its end will not improve performance. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20170619152629.11703-1-nyh@scylladb.com>	2017-06-19 18:31:32 +03:00
Avi Kivity	6e2c9ef9fb	Revert "Allow reading exactly desired byte ranges and fast_forward_to" This reverts commit `317d7fc253` (and also the related `2c57ab84b2`). It causes crashes during range scans, reported by Gleb: "To reproduce I run SELECT * FROM keyspace1.standard1; on typical c-s dataset and 3 node cluster. Backtrace: at /home/gleb/work/seastar/seastar/core/apply.hh:36 rvalue=<unknown type in /home/gleb/work/seastar/build/release/scylla, CU 0x54cf307, DIE 0x55ebf2a>) at /home/gleb/work/seastar/seastar/core/do_with.hh:57 range=std::vector of length 6, capacity 8 = {...}) at /home/gleb/work/seastar/seastar/core/future-util.hh:142 at ./seastar/core/future.hh:890 at /home/gleb/work/seastar/seastar/core/future-util.hh:119 at /home/gleb/work/seastar/seastar/core/future-util.hh:142	2017-06-18 16:10:21 +03:00
Nadav Har'El	317d7fc253	Allow reading exactly desired byte ranges and fast_forward_to In commit `c63e88d556`, support was added for fast_forward_to() in data_consume_rows(). Because an input stream's end cannot be changed after creation, that patch ignores the specified end byte, and uses the end of file as the end position of the stream. As result of this, even when we want to read a specific byte range (e.g., in the repair code to checksum the partitions in a given range), the code reads an entire 128K buffer around the end byte, or significantly more, with read-ahead enabled. This causes repair to do more than 10 times the amount of I/O it really has to do in the checksumming phase (which in the current implementation, reads small ranges of partitions at a time). This patch has two levels: 1. In the lower level, sstable::data_consume_rows(), which reads all partitions in a given disk byte range, now gets another byte position, "last_end". That can be the range's end, the end of the file, or anything in between the two. It opens the disk stream until last_end, which means 1. we will never read-ahead beyond last_end, and 2. fast_fordward_to() is not allowed beyond last_end. 2. In the upper level, we add to the various layers of sstable readers, mutation readers, etc., a boolean flag mutation_reader::forwarding, which says whether fast_forward_to() is allowed on the stream of mutations to move the stream to a different partition range. Note that this flag is separate from the existing boolean flag streamed_mutation::fowarding - that one talks about skipping inside a single partition, while the flag we are adding is about switching the partition range being read. Most of the functions that previously accepted streamed_mutation::forwarding now accept also the option mutation_reader::forwarding. The exception are functions which are known to read only a single partition, and not support fast_forward_to() a different partition range. We note that if mutation_reader::forwarding::no is requested, and fast_forward_to() is forbidden, there is no point in reading anything beyond the range's end, so data_consume_rows() is called with last_end as the range's end. But if forwarding::yes is requested, we use the end of the file as last_end, exactly like the code before this patch did. Importantly, we note that the repair's partition reading code, column_family::make_streaming_reader, uses mutation_reader::forwarding::no, while the other existing reading code will use the default forwarding::yes. In the future, we can further optimize the amount of bytes read from disk by replacing forwarding::yes by an actual last partition that may ever be read, and use its byte position as the last_end passed to data_consume_rows. But we don't do this yet, and it's not a regression from the existing code, which also opened the file input stream until the end of the file, and not until the end of the range query. Moreover, such an improvement will not improve of anything if the overall range is always very large, in which case not over-reading at its end will not improve performance. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20170614072122.13473-1-nyh@scylladb.com>	2017-06-15 13:22:46 +01:00
Tomasz Grabiec	6cf2841654	mvcc: Extract partition_snapshot_reader to separate header Right know whole world includes it transitively, which results in painful recompiles when the code changes. Relax dependencies. Message-Id: <1495620201-8046-1-git-send-email-tgrabiec@scylladb.com>	2017-05-24 12:13:15 +01:00
Avi Kivity	ebaeefa02b	Merge seatar upstream (seastar namespace) - introcduced "seastarx.hh" header, which does a "using namespace seastar"; - 'net' namespace conflicts with seastar::net, renamed to 'netw'. - 'transport' namespace conflicts with seastar::transport, renamed to cql_transport. - "logger" global variables now conflict with logger global type, renamed to xlogger. - other minor changes	2017-05-21 12:26:15 +03:00
Tomasz Grabiec	d1bde3036e	row_cache: Keep counters in a struct So that taking a snapshot of all stats is easy.	2017-05-17 14:15:14 +02:00
Tomasz Grabiec	35c9dfecc2	row_cache: Implement mutation_reader::fast_forward_to() for cache scanner Needed to make perf_fast_forward work with cache enabled.	2017-05-17 14:15:14 +02:00
Tomasz Grabiec	7b6be7e188	row_cache: Add missing propagation of the forwarding flag in handle_large_partition() Message-Id: <1494503145-25622-1-git-send-email-tgrabiec@scylladb.com>	2017-05-11 15:47:19 +01:00
Tomasz Grabiec	0351ab8bc6	row_cache: Fix undefined behavior in read_wide() _underlying is created with _range, which is captured by reference. But range_and_underlyig_reader is moved after being constructed by do_with(), so _range reference is invalidated. Fixes #2377. Message-Id: <1494492025-18091-1-git-send-email-tgrabiec@scylladb.com>	2017-05-11 09:43:43 +01:00
Amnon Heiman	064f5e1b63	row_cache: switch to the metrics layer registration This patch moves the row_cache metrics registration from collectd to the metric layer. Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <20170321143812.785-3-amnon@scylladb.com>	2017-03-21 16:42:58 +02:00
Tomasz Grabiec	892d4a2165	db: Enable creating forwardable readers via mutation_source Right now all mutation source implementations will use make_forwardable() wrapper.	2017-02-23 18:50:44 +01:00
Glauber Costa	facb0aa6d9	row_cache: rewrite loop so that debug mode doesn't become a noop need_preempt() is always true in debug mode. Because of that, this loop will never be executed. Rewrite it as a do-while loop so we are sure that it is executed at least once - or exactly once in debug mode. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <1485913079-1283-1-git-send-email-glauber@scylladb.com>	2017-02-01 10:02:13 +02:00

1 2 3 4

166 Commits