scylladb

Author	SHA1	Message	Date
Glauber Costa	1d7617723d	row cache: pin real dirty during cache updates. Right now, once a region is moved to the cache is no longer visible to the dirty memory system. Not as real dirty nor virtual dirty. The problem is that until a particular partition is moved to the cache it is not evictable. As a result we can OOM the system if we have a lot of pending cache updates as the writes will not be throttled and memory won't be made available. This patch pins the memory used by the region as real dirty before the cache update starts, and unpins it when it is over. In the mean time it gradually releases memory of the partitions that are being moved to cache. I have verified in a couple of workloads that the amount of memory accounted through this is the same amount of memory accounted through the memtable flush procedure. Fixes #1942 Signed-off-by: Glauber Costa <glauber@scylladb.com>	2017-11-08 19:46:36 -05:00
Duarte Nunes	baeec0935f	Replace query::full_slice with schema::full_slice() query::full_slice doesn't select any regular or static columns, which is at odds with the expectations of its users. This patch replaces it with the schema::full_slice() version. Refs #2885 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1507732800-9448-2-git-send-email-duarte@scylladb.com>	2017-10-17 11:25:53 +02:00
Tomasz Grabiec	c78047fa5b	row_cache: Evict partition snapshots If snapshots are not evicted, they may pin unbouned amount of memory for a long time in cache, which may lead to OOM. Evict snapshots together with the entry. Fixes #2775. Fixes #2730.	2017-09-13 17:47:03 +02:00
Tomasz Grabiec	adb159d51b	row_cache: Reuse allocation_strategy::invalidate_references() Modification count in the tracker is redundant, we can rely on allocator's invalidation counter.	2017-09-13 17:38:08 +02:00
Tomasz Grabiec	27a3b4bca9	row_cache: Don't invalidate references on insertion modification_count is currently only used to detect invalidation of references, intended to be incremented on erasure. Insertion into intrusive set doesn't invalidate references, so no need to increment the counter.	2017-09-13 17:38:08 +02:00
Tomasz Grabiec	d22fdf4261	row_cache: Improve safety of cache updates Cache imposes requirements on how updates to the on-disk mutation source are made: 1) each change to the on-disk muation source must be followed by cache synchronization reflecting that change 2) The two must be serialized with other synchronizations 3) must have strong failure guarantees (atomicity) Because of that, sstable list update and cache synchronization must be done under a lock, and cache synchronization cannot fail to synchronize. Normally cache synchronization achieves no-failure thing by wiping the cache (which is noexcept) in case failure is detect. There are some setup steps hoever which cannot be skipped, e.g. taking a lock followed by switching cache to use the new snapshot. That truly cannot fail. The lock inside cache synchronizers is redundant, since the user needs to take it anyway around the combined operation. In order to make ensuring strong exception guarantees easier, and making the cache interface easier to use correctly, this patch moves the control of the combined update into the cache. This is done by having cache::update() et al accept a callback (external_updater) which is supposed to perform modiciation of the underlying mutation source when invoked. This is in-line with the layering. Cache is layered on top of the on-disk mutation source (it wraps it) and reading has to go through cache. After the patch, modification also goes through cache. This way more of cache's requirements can be confined to its implementation. The failure semantics of update() and other synchronizers needed to change due to strong exception guaratnees. Now if it fails, it means the update was not performed, neither to the cache nor to the underlying mutation source. The database::_cache_update_sem goes away, serialization is done internally by the cache. The external_updater needs to have strong exception guarantees. This requirement is not new. It is however currently violated in some places. This patch marks those callbacks as noexcept and leaves a FIXME. Those should be fixed, but that's not in the scope of this patch. Aborting is still better than corrupting the state. Fixes #2754. Also fixes the following test failure: tests/row_cache_test.cc(949): fatal error: in "test_update_failure": critical check it->second.equal(*s, mopt->partition()) has failed which started to trigger after commit `318423d50b`. Thread stack allocation may fail, in which case we did not do the necessary invalidation.	2017-09-04 10:04:29 +02:00
Tomasz Grabiec	b0f3efa577	row_cache: Extract invalidate_sync()	2017-09-04 10:04:29 +02:00
Tomasz Grabiec	56e3ce05db	row_cache: Don't require presence checker to be supplied externally The API is simpler and safer this way.	2017-09-04 10:04:29 +02:00
Tomasz Grabiec	bc3112a187	row_cache: Allow marking as fully continuous on construction Will be needed in tests.	2017-09-04 10:04:29 +02:00
Raphael S. Carvalho	637f3bfa50	db: refresh row cache's underlying data source after compaction Underlying data source in row cache holds a reference to sstable set prior to compaction which isn't released until a memtable flush, which means file descriptors of deleted sstables remains opened, wasting disk space. The fix is to refresh underlying data source in row cache. Fixes #2570. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-07-24 15:49:11 -03:00
Piotr Jastrzebski	b950c59bbb	row_cache: Fix wrong comment on continuity flag This comment was stating exactly the opposite to the truth. This is very misleading Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <79062a061e22ef4c4add24cbdf723cbfb5cda060.1499345295.git.piotr@scylladb.com>	2017-07-07 10:29:19 +02:00
Tomasz Grabiec	62c76abf71	row_cache: Rename num_entries() to partitions() for clarity	2017-07-04 13:55:06 +02:00
Tomasz Grabiec	60c2a86192	row_cache: Track mispopulations also at row level	2017-07-04 13:55:06 +02:00
Tomasz Grabiec	94547db620	row_cache: Track row insertions	2017-07-04 13:55:06 +02:00
Tomasz Grabiec	a58f2c8640	row_cache: Track row hits and misses	2017-07-04 13:55:06 +02:00
Tomasz Grabiec	a5fdff2ac2	row_cache: Add partition_ prefix to current counters In preparation for adding per-row counters.	2017-07-04 13:55:06 +02:00
Tomasz Grabiec	6a22cbceaf	row_cache: Add metrics for operations on underlying reader	2017-07-04 13:55:06 +02:00
Tomasz Grabiec	5c7b6fc164	row_cache: Add reader-related metrics	2017-07-04 13:55:06 +02:00
Tomasz Grabiec	be2e89d596	row_cache: Remove dead code	2017-07-04 13:55:06 +02:00
Tomasz Grabiec	b56232b216	row_cache: Introduce evict()	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	c4e8effffa	tests: Add cache_streamed_mutation_test [tgrabiec: - extracted from a larger commit - removed coupling with how cache_streamed_mutation is created (the code went out of sync), used more stable make_reader(). it's simpler too. - replaced false/true literals with is_continuous/is_dummy where appropraite - dropped tests for cache::underlying (class is gone) - reused streamed_mutation_assertions, it has better error messages - fixed the tests to not create tombstones with missing timestamps - relaxed range tombstone assertions to only check information relevant for the query range - print cache on failure for improved debuggability ]	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	6f6575f456	row_cache: Enable partial partition population	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	e792220c3a	row_cache: Introduce update_invalidating()	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	c29878f49f	row_cache: Extract memtable walking logic from update() into do_update() So that it can be reused in update_invalidating().	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	3755641c4b	row_cache: Introduce cache_streamed_mutation This streamed mutation populates cache with the rows requested by the read. It takes whatever it can find in the cache and fetches the remainings from underlying source. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> [tgrabiec: - fixed maybe_add_to_cache_and_update_continuity() leaking entries if the key already exists in the snapshot - fixed a problem where population race could result in a read missing some rows, because cache_streamed_mutation was advancing the cursor, then deferring, and then checking continuity. We should check continuity atomically with advancing. - fixed rows_handle.maybe_refresh() being accessed outside of update section in read_from_underlying() (undefined behavior) - fixed a problem in start_reading_from_underlying() where we would use incorrect start if lower_bound ended with a range tombstone starting before a key. - range tombstone trimming in add_to_buffer() could create a tombstone which has too low start bound if last_rt.end was a prefix and had inclusive end. invert_kind(end_kind) should be used instead of unconditional inc_start. - range tombstone trimming incorrectly assumed it is fine to trim the tombstone from underlying to the previous fragment's end and emit such tombstone. That would mean the stream can't emit any fragments which start before previous tombstone's end. Solve with range_tombstone_stream. - split add_to_buffer() into overloads for clustering_row, and range_tombstone. Better than wrapping into mutation_fragment before the call and having add_to_buffer() rediscover the information. - changed maybe_add_to_cache_and_update_continuity() to not set continuity to false for existing entries, it's not necessary - moved range tombstone trimming to range_tombstone class - moved range tombstone slicing code to range_tombstone_list and partition_snapshot - can_populate::can_use_cache was unused, dropped - dropped assumption that dummy entries are only at the end - renamed maybe_add_to_cache_and_update_continuity() to maybe_add_to_cache() - dropped no longer needed lower_bound class - extracted row_handle to a seaparate patch - made the copy-from-cache loop preemptable - split maybe_add_next_to_buffer_and_update_continuity(bool) - dropped cache_populator - replaced "underlying" class with use of read_context - replaced can_populate class with a function - simplified lsa_manager methods to avoid moves ]	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	a2207ee9a6	row_cache: Move autoupdating_underlying_reader to read_context.hh	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	045888d5f3	row_cache: Introduce partition_range_cursor	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	f0cc86e5db	row_cache: Introduce cache_entry::position()	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	e3272526a1	row_cache: Allow comparing with ring_position views in row_cache::compare	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	64626b32b0	row_cache: Make printable	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	54b3da1910	row_cache: Introduce find_or_create() helper	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	f2d2c221d4	row_cache: Return cache_entry reference from do_find_or_create_entry Will be useful when additional action needs to be done on the entry after it was created or constructed.	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	0db6bdc916	row_cache: Introduce cache_entry constructor which constructs incomplete entry	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	1c642219c9	row_cache: Ensure there is always a dummy entry after all clustered rows Algorithms will assume that.	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	bbfa52822e	row_cache: Switch readers to use per-entry snapshots Currently readers are always using the latest snapshot. This is fine for respecting write atomicity if partitions are fully continuous in cache (now), but will break write atomicity once partial population is allowed. Consider the following case: flush write(ck=1), write(ck=2) -> snapshot_1 cache reader 1 reads and inserts ck=1 @snapshot_1 flush write(ck=1), write(ck=2) -> snapshot_2 cache reader 2 reads and inserts ck=2 @snapshot_2 Because cache update is not atomic, it can happen that reader 2 will complete while the partition hasn't been updated yet for snapshot_2. In such case, after read 2 the partition would contain ck=1 from snapshot_1 and ck=2 from snapshot_2. It will match neither of the snapshots, and this could violate write atomicity. To solve this problem we conceptually assign each partition key in the ring to its current snapshot which it reflects. The update process gradually converts entries in ring order to the new snapshot. Reads will not be using the latest snapshot, but rather the current snapshot for the position in the ring they are at. There is a race between the update process and populating reads. Since after the update all entries must reflect the new snapshot, reads using the old snapshot cannot be allowed to insert data which can no longer be reached by the update process. Before this patch this race was prevented by the use of a phased_barrier, where readers would keep phased_barrier::operation alive between starting a read of a partition and inserting it into cache. Cache update was waiting for all prior operations before starting the update. Any later read which was not waited for would use the latest snapshot for reads, so the update process didn't have to fix anything up for such reads. After this change, later reads cannot always use the latest snapshot, they have to use the snapshot corresponding to given entry. So it's not enough for update() to wait for prior reads in order to prevent stale populations. The (simple) solution implemented in this patch is to detect the conflict and abandon population of given sub-range. In general, reads are allowed to populate given range only if it belongs to a single snapshot. Note that the range here is not the whole query range. For population of continuity, it is the range starting after the previous key and ending after the key being inserted. When populating a partition entry, the range is a singular range containing only the partition key. Readers switch to new snapshots automatically as they move across the ring. It's possible that the insertion of the partition doesn't conflict, but continuity does. In such case the entry will be inserted but continuity will not be set.	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	8ba6366610	row_cache: Switch to using snapshot_source Currently every time cache needs to create reader for missing data it obtains a reader which is most up to date. That reader includes writes from later populate phases, for which update() was not yet called. This will be problematic once we allow partitions to be partially populated, because different parts of the partition could be partially populated using readers using different sets of writes, and break write atomicity. The solution will be to always populate given partition using the same set of writes, using reader created from the current snapshot. The snapshot changes only on update(), with update() gradually converting each partition to the new snapshot.	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	e23c7e2f34	row_cache: Rework invalidate() implementation 1) Reduce duplication by delegating to more general overloads 2) Improve documentation to not mention effects in terms of population (detail) but rather write visibiliy 3) Rename clear() to invalidate() and merge with the range variant, it has the same semantics	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	2d73c193e7	row_cache: Introduce read_context This object stores all read relevant context required all over the place. This leads to a cleaner code. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> [tgrabiec: - made read_context shareable to allow storing shared mutable state later - added range and cache getters ]	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	a3ff8db323	row_cache: Introduce autoupdating_underlying_reader This is an abstraction that represents a reader to the underlying source and auto updates itself to make sure the reader reflects the latest state of the underlying source. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> [tgrabiec: Add range getter to avoid friendships]	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	60346a2819	row_cache: remove unused read overload This will simplify the following patches and unused code should be removed anyway. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	77f944880c	cache: Remove support for wide partitions This will be handled by row cache now. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-06-24 18:06:11 +02:00
Nadav Har'El	3018df11b5	Allow reading exactly desired byte ranges and fast_forward_to In commit `c63e88d556`, support was added for fast_forward_to() in data_consume_rows(). Because an input stream's end cannot be changed after creation, that patch ignores the specified end byte, and uses the end of file as the end position of the stream. As result of this, even when we want to read a specific byte range (e.g., in the repair code to checksum the partitions in a given range), the code reads an entire 128K buffer around the end byte, or significantly more, with read-ahead enabled. This causes repair to do more than 10 times the amount of I/O it really has to do in the checksumming phase (which in the current implementation, reads small ranges of partitions at a time). This patch has two levels: 1. In the lower level, sstable::data_consume_rows(), which reads all partitions in a given disk byte range, now gets another byte position, "last_end". That can be the range's end, the end of the file, or anything in between the two. It opens the disk stream until last_end, which means 1. we will never read-ahead beyond last_end, and 2. fast_fordward_to() is not allowed beyond last_end. 2. In the upper level, we add to the various layers of sstable readers, mutation readers, etc., a boolean flag mutation_reader::forwarding, which says whether fast_forward_to() is allowed on the stream of mutations to move the stream to a different partition range. Note that this flag is separate from the existing boolean flag streamed_mutation::fowarding - that one talks about skipping inside a single partition, while the flag we are adding is about switching the partition range being read. Most of the functions that previously accepted streamed_mutation::forwarding now accept also the option mutation_reader::forwarding. The exception are functions which are known to read only a single partition, and not support fast_forward_to() a different partition range. We note that if mutation_reader::forwarding::no is requested, and fast_forward_to() is forbidden, there is no point in reading anything beyond the range's end, so data_consume_rows() is called with last_end as the range's end. But if forwarding::yes is requested, we use the end of the file as last_end, exactly like the code before this patch did. Importantly, we note that the repair's partition reading code, column_family::make_streaming_reader, uses mutation_reader::forwarding::no, while the other existing reading code will use the default forwarding::yes. In the future, we can further optimize the amount of bytes read from disk by replacing forwarding::yes by an actual last partition that may ever be read, and use its byte position as the last_end passed to data_consume_rows. But we don't do this yet, and it's not a regression from the existing code, which also opened the file input stream until the end of the file, and not until the end of the range query. Moreover, such an improvement will not improve of anything if the overall range is always very large, in which case not over-reading at its end will not improve performance. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20170619152629.11703-1-nyh@scylladb.com>	2017-06-19 18:31:32 +03:00
Avi Kivity	6e2c9ef9fb	Revert "Allow reading exactly desired byte ranges and fast_forward_to" This reverts commit `317d7fc253` (and also the related `2c57ab84b2`). It causes crashes during range scans, reported by Gleb: "To reproduce I run SELECT * FROM keyspace1.standard1; on typical c-s dataset and 3 node cluster. Backtrace: at /home/gleb/work/seastar/seastar/core/apply.hh:36 rvalue=<unknown type in /home/gleb/work/seastar/build/release/scylla, CU 0x54cf307, DIE 0x55ebf2a>) at /home/gleb/work/seastar/seastar/core/do_with.hh:57 range=std::vector of length 6, capacity 8 = {...}) at /home/gleb/work/seastar/seastar/core/future-util.hh:142 at ./seastar/core/future.hh:890 at /home/gleb/work/seastar/seastar/core/future-util.hh:119 at /home/gleb/work/seastar/seastar/core/future-util.hh:142	2017-06-18 16:10:21 +03:00
Nadav Har'El	317d7fc253	Allow reading exactly desired byte ranges and fast_forward_to In commit `c63e88d556`, support was added for fast_forward_to() in data_consume_rows(). Because an input stream's end cannot be changed after creation, that patch ignores the specified end byte, and uses the end of file as the end position of the stream. As result of this, even when we want to read a specific byte range (e.g., in the repair code to checksum the partitions in a given range), the code reads an entire 128K buffer around the end byte, or significantly more, with read-ahead enabled. This causes repair to do more than 10 times the amount of I/O it really has to do in the checksumming phase (which in the current implementation, reads small ranges of partitions at a time). This patch has two levels: 1. In the lower level, sstable::data_consume_rows(), which reads all partitions in a given disk byte range, now gets another byte position, "last_end". That can be the range's end, the end of the file, or anything in between the two. It opens the disk stream until last_end, which means 1. we will never read-ahead beyond last_end, and 2. fast_fordward_to() is not allowed beyond last_end. 2. In the upper level, we add to the various layers of sstable readers, mutation readers, etc., a boolean flag mutation_reader::forwarding, which says whether fast_forward_to() is allowed on the stream of mutations to move the stream to a different partition range. Note that this flag is separate from the existing boolean flag streamed_mutation::fowarding - that one talks about skipping inside a single partition, while the flag we are adding is about switching the partition range being read. Most of the functions that previously accepted streamed_mutation::forwarding now accept also the option mutation_reader::forwarding. The exception are functions which are known to read only a single partition, and not support fast_forward_to() a different partition range. We note that if mutation_reader::forwarding::no is requested, and fast_forward_to() is forbidden, there is no point in reading anything beyond the range's end, so data_consume_rows() is called with last_end as the range's end. But if forwarding::yes is requested, we use the end of the file as last_end, exactly like the code before this patch did. Importantly, we note that the repair's partition reading code, column_family::make_streaming_reader, uses mutation_reader::forwarding::no, while the other existing reading code will use the default forwarding::yes. In the future, we can further optimize the amount of bytes read from disk by replacing forwarding::yes by an actual last partition that may ever be read, and use its byte position as the last_end passed to data_consume_rows. But we don't do this yet, and it's not a regression from the existing code, which also opened the file input stream until the end of the file, and not until the end of the range query. Moreover, such an improvement will not improve of anything if the overall range is always very large, in which case not over-reading at its end will not improve performance. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20170614072122.13473-1-nyh@scylladb.com>	2017-06-15 13:22:46 +01:00
Tomasz Grabiec	d1bde3036e	row_cache: Keep counters in a struct So that taking a snapshot of all stats is easy.	2017-05-17 14:15:14 +02:00
Amnon Heiman	064f5e1b63	row_cache: switch to the metrics layer registration This patch moves the row_cache metrics registration from collectd to the metric layer. Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <20170321143812.785-3-amnon@scylladb.com>	2017-03-21 16:42:58 +02:00
Tomasz Grabiec	892d4a2165	db: Enable creating forwardable readers via mutation_source Right now all mutation source implementations will use make_forwardable() wrapper.	2017-02-23 18:50:44 +01:00
Tomasz Grabiec	87f15624f4	row_cache: Add counter for wide partition mispopulations Message-Id: <1484733250-14470-1-git-send-email-tgrabiec@scylladb.com>	2017-01-18 09:57:51 +00:00
Asias He	e5485f3ea6	Get rid of query::partition_range Use dht::partition_range instead	2016-12-19 08:09:25 +08:00
Piotr Jastrzebski	5ec668c9c6	Add separate LRU for wide partitions. Evict wide partitions only every 1000 normal partition evictions. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2016-11-15 16:19:13 +01:00

1 2 3

114 Commits