scylladb

Author	SHA1	Message	Date
Tomasz Grabiec	1971332195	row_cache: Fix exception safety of cache_entry::read() When we fail, we need to return streamed_mutation back, so that the operation can be retried. Causes SIGSEGV on nullptr otherwise on bad_alloc.	2017-11-13 20:55:14 +01:00
Tomasz Grabiec	11a195c403	row_cache: scanning_and_populating_reader: Fix exception unsafety causing read to skip data If assignment to _lower_bound in the "_secondary_in_progress = true;" case in do_read_from_primary() throws due to allocation failure, the update section will be retried and we will take the not_moved path, skipping the range which was discontinuous and was supposed to be read from underlying. Fix by redoing lookup using _lower_bound in case the section is retried. When we retry, _primary.valid() will be false. We need to ensure now that _lower_bound is always valid. Fixes #2944.	2017-11-13 20:55:14 +01:00
Tomasz Grabiec	5dc1ee41e4	row_cache: partition_range_cursor: Extract valid() and advance_to() from refresh()	2017-11-13 20:55:14 +01:00
Tomasz Grabiec	09c49b2db3	cache_streamed_mutation: Add trace-level logging to cache_streamed_mutation	2017-11-13 20:55:14 +01:00
Glauber Costa	1d7617723d	row cache: pin real dirty during cache updates. Right now, once a region is moved to the cache is no longer visible to the dirty memory system. Not as real dirty nor virtual dirty. The problem is that until a particular partition is moved to the cache it is not evictable. As a result we can OOM the system if we have a lot of pending cache updates as the writes will not be throttled and memory won't be made available. This patch pins the memory used by the region as real dirty before the cache update starts, and unpins it when it is over. In the mean time it gradually releases memory of the partitions that are being moved to cache. I have verified in a couple of workloads that the amount of memory accounted through this is the same amount of memory accounted through the memtable flush procedure. Fixes #1942 Signed-off-by: Glauber Costa <glauber@scylladb.com>	2017-11-08 19:46:36 -05:00
Glauber Costa	b836005555	row_cache: modernize use of seastar threads For a while now we have an async() function, that simplifies the code by not needing to issue an explicit join. This patch converts the row cache to use async() as well, which most of our code already does. Doing so will make it easier to make changes to update_cache. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2017-11-08 16:21:44 -05:00
Tomasz Grabiec	083b9cddef	row_cache: Fix handling of concurrent partition population This fixes a regression introduced in `27a3b4bca9` (master only). partition_range_cursor assumes that as long as references are valid, _end is valid as well. But if new entries were inserted before _end, it may not, if the new entries fall after the query range. This may result in reads returning partitions from outside the query range. Message-Id: <1507815478-20269-1-git-send-email-tgrabiec@scylladb.com>	2017-10-12 15:55:20 +01:00
Piotr Jastrzebski	6069bab755	Cache single queries to non-existing partitions This way we don't need to query sstables again when the query is repeated. Fixes #1533 Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <8f8559ff19c534dbbb7c9ef6c28271cec607ba20.1506521461.git.piotr@scylladb.com>	2017-09-27 16:15:18 +02:00
Tomasz Grabiec	0911fbbdef	row_cache: Fix row_cache::update_invalidating() evict() doesn't guarantee that the whole partition is discontinuous. In particular, partition tombstone cannot be marked as discontinuous. The parts which are still continuous must be updated. Broken after `c78047fa5b`. Message-Id: <1505375684-28574-1-git-send-email-tgrabiec@scylladb.com>	2017-09-14 10:58:25 +03:00
Tomasz Grabiec	c78047fa5b	row_cache: Evict partition snapshots If snapshots are not evicted, they may pin unbouned amount of memory for a long time in cache, which may lead to OOM. Evict snapshots together with the entry. Fixes #2775. Fixes #2730.	2017-09-13 17:47:03 +02:00
Tomasz Grabiec	adb159d51b	row_cache: Reuse allocation_strategy::invalidate_references() Modification count in the tracker is redundant, we can rely on allocator's invalidation counter.	2017-09-13 17:38:08 +02:00
Tomasz Grabiec	27a3b4bca9	row_cache: Don't invalidate references on insertion modification_count is currently only used to detect invalidation of references, intended to be incremented on erasure. Insertion into intrusive set doesn't invalidate references, so no need to increment the counter.	2017-09-13 17:38:08 +02:00
Tomasz Grabiec	2df6f356b1	mvcc: Store LSA region reference in partition_snapshot Will be useful for improving encapsulation.	2017-09-13 17:38:08 +02:00
Tomasz Grabiec	d22fdf4261	row_cache: Improve safety of cache updates Cache imposes requirements on how updates to the on-disk mutation source are made: 1) each change to the on-disk muation source must be followed by cache synchronization reflecting that change 2) The two must be serialized with other synchronizations 3) must have strong failure guarantees (atomicity) Because of that, sstable list update and cache synchronization must be done under a lock, and cache synchronization cannot fail to synchronize. Normally cache synchronization achieves no-failure thing by wiping the cache (which is noexcept) in case failure is detect. There are some setup steps hoever which cannot be skipped, e.g. taking a lock followed by switching cache to use the new snapshot. That truly cannot fail. The lock inside cache synchronizers is redundant, since the user needs to take it anyway around the combined operation. In order to make ensuring strong exception guarantees easier, and making the cache interface easier to use correctly, this patch moves the control of the combined update into the cache. This is done by having cache::update() et al accept a callback (external_updater) which is supposed to perform modiciation of the underlying mutation source when invoked. This is in-line with the layering. Cache is layered on top of the on-disk mutation source (it wraps it) and reading has to go through cache. After the patch, modification also goes through cache. This way more of cache's requirements can be confined to its implementation. The failure semantics of update() and other synchronizers needed to change due to strong exception guaratnees. Now if it fails, it means the update was not performed, neither to the cache nor to the underlying mutation source. The database::_cache_update_sem goes away, serialization is done internally by the cache. The external_updater needs to have strong exception guarantees. This requirement is not new. It is however currently violated in some places. This patch marks those callbacks as noexcept and leaves a FIXME. Those should be fixed, but that's not in the scope of this patch. Aborting is still better than corrupting the state. Fixes #2754. Also fixes the following test failure: tests/row_cache_test.cc(949): fatal error: in "test_update_failure": critical check it->second.equal(*s, mopt->partition()) has failed which started to trigger after commit `318423d50b`. Thread stack allocation may fail, in which case we did not do the necessary invalidation.	2017-09-04 10:04:29 +02:00
Tomasz Grabiec	b0f3efa577	row_cache: Extract invalidate_sync()	2017-09-04 10:04:29 +02:00
Tomasz Grabiec	56e3ce05db	row_cache: Don't require presence checker to be supplied externally The API is simpler and safer this way.	2017-09-04 10:04:29 +02:00
Tomasz Grabiec	1a2f17d42c	row_cache: Make populate() preserve continuity	2017-09-04 10:04:29 +02:00
Tomasz Grabiec	bc3112a187	row_cache: Allow marking as fully continuous on construction Will be needed in tests.	2017-09-04 10:04:29 +02:00
Avi Kivity	48b9e47f7d	Revert "row_cache: Add missing handling for failures happening outside the updating thread" This reverts commit `f9feb310ab` (requested by author).	2017-08-29 19:26:02 +03:00
Tomasz Grabiec	f9feb310ab	row_cache: Add missing handling for failures happening outside the updating thread Thread stack allocation may fail, in which case we did not do the necessary invalidation. Fix by hoisting the scope of the cleanup function. Also fixes the following test failure: tests/row_cache_test.cc(949): fatal error: in "test_update_failure": critical check it->second.equal(*s, mopt->partition()) has failed which started to trigger after commit `318423d50b`. Message-Id: <1504023113-30374-2-git-send-email-tgrabiec@scylladb.com>	2017-08-29 19:17:22 +03:00
Raphael S. Carvalho	637f3bfa50	db: refresh row cache's underlying data source after compaction Underlying data source in row cache holds a reference to sstable set prior to compaction which isn't released until a memtable flush, which means file descriptors of deleted sstables remains opened, wasting disk space. The fix is to refresh underlying data source in row cache. Fixes #2570. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-07-24 15:49:11 -03:00
Piotr Jastrzebski	a4b6cfe8f0	row_cache: use continuity info in single partition queries If a query requests for a single partition that is inside a range that has already been queried, use the continuity info and don't go to disk when it's not needed. Fixes #2244. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <15bb3b5b03225e7402e3862da53b5e06d3f4fa74.1499345295.git.piotr@scylladb.com>	2017-07-07 10:29:19 +02:00
Tomasz Grabiec	37d2b6b3c6	row_cache: Switch _stats.hits/misses to row granularity Those are exported by the RESTful APIs called "get_row_hits/get_row_misses" and reported by nodetool.	2017-07-04 13:55:06 +02:00
Tomasz Grabiec	60c2a86192	row_cache: Track mispopulations also at row level	2017-07-04 13:55:06 +02:00
Tomasz Grabiec	94547db620	row_cache: Track row insertions	2017-07-04 13:55:06 +02:00
Tomasz Grabiec	a58f2c8640	row_cache: Track row hits and misses	2017-07-04 13:55:06 +02:00
Tomasz Grabiec	77b2a92ece	row_cache: Make mispopulation counter also apply for continuity information	2017-07-04 13:55:06 +02:00
Tomasz Grabiec	a5fdff2ac2	row_cache: Add partition_ prefix to current counters In preparation for adding per-row counters.	2017-07-04 13:55:06 +02:00
Tomasz Grabiec	6a22cbceaf	row_cache: Add metrics for operations on underlying reader	2017-07-04 13:55:06 +02:00
Tomasz Grabiec	5c7b6fc164	row_cache: Add reader-related metrics	2017-07-04 13:55:06 +02:00
Tomasz Grabiec	e720b317c9	row_cache: Restore update of concurrent_misses_same_key It was lost in action in `6f6575f456`. Message-Id: <1499168837-5072-1-git-send-email-tgrabiec@scylladb.com>	2017-07-04 14:51:05 +03:00
Tomasz Grabiec	1d6fec0755	row_cache: Drop not very useful prefixes from metric names This drops "total_opertaions_" and "objects_" prefixes. There is no convention of adding them in other parts of the system, and they don't add much value. Fixes scylladb/scylla-grafana-monitoring#169. Message-Id: <1499160342-25865-1-git-send-email-tgrabiec@scylladb.com>	2017-07-04 13:37:12 +03:00
Tomasz Grabiec	97005825bf	row_cache: Fix compilation errors with gcc 5 Message-Id: <1498741526-27055-1-git-send-email-tgrabiec@scylladb.com>	2017-06-29 16:34:46 +03:00
Tomasz Grabiec	786e75dbf7	row_cache: Use continuity information to decide whether to populate If cache is missing given key, but the range is marked as continuous, it means sstables don't have that entry and we can insert it without asking the presence checker (bloom filter based). The latter is more expensive and gives false positives. So this improves update performance and hit ratio. Another positive effect is that we don't have to clear continuity now. Fixes #1999. Message-Id: <1498643043-21117-1-git-send-email-tgrabiec@scylladb.com>	2017-06-28 13:32:48 +03:00
Tomasz Grabiec	b56232b216	row_cache: Introduce evict()	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	6f6575f456	row_cache: Enable partial partition population	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	e792220c3a	row_cache: Introduce update_invalidating()	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	c29878f49f	row_cache: Extract memtable walking logic from update() into do_update() So that it can be reused in update_invalidating().	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	509a0d8a83	row_cache: Allow reading from underlying through read_context The interaction will be as follows: - Before creating cache_streamed_mutation for given partition, cache mutation reader sets up read_context for current partition (in one of two ways) so that the matching underlying streamed_mutation can be accessed at any time by cached_stream_mutation. - cache_streamed_mutation assumes that read_context is set up for current partition and invokes fast_forward_to() and get_next_fragment() to access the underlying streamed_mutation.	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	a1d3e0318c	row_cache: Store autoupdating_underlying_reader in read_context Will be reused for reading of incomplete partition entries.	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	3f2320c377	row_cache: Store information whether query is a range query in read_context We will need to use this information later in yet another place, when creating a reader for incomplete cache entry. This refactors the code so that there is a single place which determines this fact.	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	a2207ee9a6	row_cache: Move autoupdating_underlying_reader to read_context.hh	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	ca920bd0ef	row_cache: Keep only one streamed_mutation in scanning_and_populating_reader Currently scanning_and_populating_reader asks just_cache_scanning_reader for the next partition from cache, together with information if the range is continuous. If it's not, it saves the partition it got from it and moves on to reading from the underlying reader up to that partition. When that's done, it emits the stored partition. This approach won't work well with upcoming changes for storing partial partitions. We won't have whole partitions any more, so streamed_mutation returned for the entry needs to be prepared for reading from the underlying mutation source. We want to reuse the same underlying reader as much as possible, so all streamed_mutations for given read (read_context) will share the state of the underlying reader. Construction of a streamed_mutation will depend on the fact that the shared state is set up for it, so we cannot have two streamed_mutations prepared at the same time (one for entry from primary, and one for the earlier entry being populated). This change defers the creation of a streamed_mutation for the entry present in cache until the whole reader reaches it to avoid this problem. This will also have antoher potentially beneficial effect. Since we defer the decision about which snapshot to use until we reach the entry, there is a higher chance that the current snapshot of the entry will match the one used last by the populating read, and that we will be able to reuse the reader. It's implemented by utilizing a stable partition cursor which tracks its current position so that it's possible to revisit the cache entry (if it's still there) after population ends. The functionality of just_cache_scanning_reader was inlined into scanning_and_populating_reader.	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	045888d5f3	row_cache: Introduce partition_range_cursor	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	c3905bf235	row_cache: Print position instead of key of cache_entry	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	5bfecaad99	row_cache: Switch invalidate_unwrapped() to use ring_position_view ranges It's needed before switching cache_entry ordering to rely solely on cache_entry::position() so that invalidate_unwrapped() never removes the dummy entry at the end. Currently if the range has upper bound like this: { ring_position::max(), inclusive=true } The code which selects entries for removal would include the dummy row at the end. It uses upper_bound() to get the end iterator, and the dummy entry has a position which is equal to the position in the bound. ring_position_view ranges are end-exclusive, so it's impossible to create a partition range which would include a dummy entry. The code is also simpler.	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	64626b32b0	row_cache: Make printable	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	54b3da1910	row_cache: Introduce find_or_create() helper	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	f2d2c221d4	row_cache: Return cache_entry reference from do_find_or_create_entry Will be useful when additional action needs to be done on the entry after it was created or constructed.	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	bbfa52822e	row_cache: Switch readers to use per-entry snapshots Currently readers are always using the latest snapshot. This is fine for respecting write atomicity if partitions are fully continuous in cache (now), but will break write atomicity once partial population is allowed. Consider the following case: flush write(ck=1), write(ck=2) -> snapshot_1 cache reader 1 reads and inserts ck=1 @snapshot_1 flush write(ck=1), write(ck=2) -> snapshot_2 cache reader 2 reads and inserts ck=2 @snapshot_2 Because cache update is not atomic, it can happen that reader 2 will complete while the partition hasn't been updated yet for snapshot_2. In such case, after read 2 the partition would contain ck=1 from snapshot_1 and ck=2 from snapshot_2. It will match neither of the snapshots, and this could violate write atomicity. To solve this problem we conceptually assign each partition key in the ring to its current snapshot which it reflects. The update process gradually converts entries in ring order to the new snapshot. Reads will not be using the latest snapshot, but rather the current snapshot for the position in the ring they are at. There is a race between the update process and populating reads. Since after the update all entries must reflect the new snapshot, reads using the old snapshot cannot be allowed to insert data which can no longer be reached by the update process. Before this patch this race was prevented by the use of a phased_barrier, where readers would keep phased_barrier::operation alive between starting a read of a partition and inserting it into cache. Cache update was waiting for all prior operations before starting the update. Any later read which was not waited for would use the latest snapshot for reads, so the update process didn't have to fix anything up for such reads. After this change, later reads cannot always use the latest snapshot, they have to use the snapshot corresponding to given entry. So it's not enough for update() to wait for prior reads in order to prevent stale populations. The (simple) solution implemented in this patch is to detect the conflict and abandon population of given sub-range. In general, reads are allowed to populate given range only if it belongs to a single snapshot. Note that the range here is not the whole query range. For population of continuity, it is the range starting after the previous key and ending after the key being inserted. When populating a partition entry, the range is a singular range containing only the partition key. Readers switch to new snapshots automatically as they move across the ring. It's possible that the insertion of the partition doesn't conflict, but continuity does. In such case the entry will be inserted but continuity will not be set.	2017-06-24 18:06:11 +02:00

1 2 3 4

184 Commits