scylladb

Author	SHA1	Message	Date
Tomasz Grabiec	7bb975eb22	row_cache, lru: Introduce evict_shallow() Will be used by MVCC tests which don't want (can't) deal with the row_cache as the container but work with the partition_entry directly. Currently, rows_entry::on_evicted() assumes that it's embedded in row_cache and would segfault when trying to evict the contining partition entry which is not embedded in row_cache. The solution is to call evict_shallow() from mvcc_tests, which does not attempt to evict the containing partition_entry.	2023-01-27 21:56:31 +01:00
Tomasz Grabiec	026f8cc1e7	db: Use mutation_partition_v2 in mvcc This patch switches memtable and cache to use mutation_partition_v2, and all affected algorithms accordingly. The memtable reader was changed to use the same cursor implementation which cache uses, for improved code reuse and reducing risk of bugs due to discrepancy of algorithms which deal with MVCC. Range tombstone eviction in cache has now fine granularity, like with rows. Fixes #2578 Fixes #3288 Fixes #10587	2023-01-27 21:56:28 +01:00
Tomasz Grabiec	f97268d8f2	row_cache: Fix violation of the "oldest version are evicted first" when evicting last dummy Consider the following MVCC state of a partition: v2: ==== <7> [entry2] ==== <9> ===== <last dummy> v1: ================================ <last dummy> [entry1] Where === means a continuous range and --- means a discontinuous range. After two LRU items are evicted (entry1 and entry2), we will end up with: v2: ---------------------- <9> ===== <last dummy> v1: ================================ <last dummy> [entry1] This will cause readers to incorrectly think there are no rows before entry <9>, because the range is continuous in v1, and continuity of a snapshot is a union of continuous intervals in all versions. The cursor will see the interval before <9> as continuous and the reader will produce no rows. This is only temporary, because current MVCC merging rules are such that the flag on the latest entry wins, so we'll end up with this once v1 is no longer needed: v2: ---------------------- <9> ===== <last dummy> ...and the reader will go to sstables to fetch the evicted rows before entry <9>, as expected. The bug is in rows_entry::on_evicted(), which treats the last dummy entry in a special way, and doesn't evict it, and doesn't clear the continuity by omission. The situation is not easy to trigger because it requires certain eviction pattern concurrent with multiple reads of the same partition in different versions, so across memtable flushes. Closes #12452	2023-01-09 16:10:52 +02:00
Tomasz Grabiec	992a73a861	row_cache: Destroy coroutine under region's allocator The reason is alloc-dealloc mismatch of position_in_partition objects allocated by cursors inside coroutine object stored in the update variable in row_cache::do_update() It is allocated under cache region, but in case of exception it will be destroyed under the standard allocator. If update is successful, it will be cleared under region allocator, so there is not problem in the normal case. Fixes #12068 Closes #12233	2022-12-07 21:44:21 +02:00
Avi Kivity	444de2831e	dirty_memory_manager: move to replica module It's a replica-side thing, so move it there. The related flush_permit and sstable_write_permit are moved alongside.	2022-12-06 22:24:17 +02:00
Tomasz Grabiec	4ff204c028	Merge 'cache: make all removals of cache items explicit' from Michał Chojnowski This series is a step towards non-LRU cache algorithms. Our cache items are able to unlink themselves from the LRU list. (In other words, they can be unlinked solely via a pointer to the item, without access to the containing list head). Some places in the code make use of that, e.g. by relying on auto-unlink of items in their destructor. However, to implement algorithms smarter than LRU, we might want to update some cache-wide metadata on item removal. But any cache-wide structures are unreachable through an item pointer, since items only have access to themselves and their immediate neighbours. Therefore, we don't want items to unlink themselves — we want `cache.remove(item)`, rather than `item.remove_self()`, because the former can update the metadata in `cache`. This series inserts explicit item unlink calls in places that were previously relying on destructors, gets rid of other self-unlinks, and adds an assert which ensures that every item is explicitly unlinked before destruction. Closes #11716 * github.com:scylladb/scylladb: utils: lru: assert that evictables are unlinked before destruction utils: lru: remove unlink_from_lru() cache: make all cache unlinks explicit	2022-10-17 12:47:02 +02:00
Michał Chojnowski	f340c9cca5	utils: lru: remove unlink_from_lru() unlink_from_lru() allows for unlinking elements from cache without notifying the cache. This messes up any potential cache bookkeeping. Improved that by replacing all uses of unlink_from_lru() with calls to lru::remove(), which does have access to cache's metadata.	2022-10-17 12:07:27 +02:00
Michał Chojnowski	a0204c17c5	treewide: remove mentions of seastar::thread::should_yield() thread_scheduling_group has been retired many years ago. Remove the leftovers, they are confusing. Closes #11714	2022-10-05 12:26:37 +03:00
Michał Chojnowski	8aa24194b7	row_cache: remove a dead try...catch block in eviction All calls in the try block have been noexcept for some time. Remove the try...catch and the associated misleading comment to avoid confusing source code readers. Closes #11715	2022-10-05 12:23:47 +03:00
Tomasz Grabiec	4c33d1650d	cache_tracker: Make clear() leave no garbage Prremption during partition entry eviciton could put it in the mutation cleaner. No known issues caused by this. Affects only tests.	2022-08-02 11:02:22 +02:00
Tomasz Grabiec	169025d9b4	memtable: Add counters for tombstone compaction	2022-06-15 11:30:25 +02:00
Avi Kivity	5129280f45	Revert "Merge 'memtable, cache: Eagerly compact data with tombstones' from Tomasz Grabiec" This reverts commit `e0670f0bb5`, reversing changes made to `605ee74c39`. It causes failures in debug mode in database_test.test_database_with_data_in_sstables_is_a_mutation_source_plain, though with low probability. Fixes #10780 Reopens #652.	2022-06-14 18:06:22 +03:00
Tomasz Grabiec	0bc45f9666	memtable: Add counters for tombstone compaction	2022-06-06 19:25:41 +02:00
Michael Livshin	029508b77c	flat_mutation_reader ist tot Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-05-31 23:42:34 +03:00
Avi Kivity	528ab5a502	treewide: change metric calls from make_derive to make_counter make_derive was recently deprecated in favor of make_counter, so make the change throughput the codebase. Closes #10564	2022-05-14 12:53:55 +02:00
Botond Dénes	f527956cdb	readers: remove v1 empty_reader The only user is row level repair: it is replaced with downgrade_to_v1(make_empty_flat_reader_v2()). The row level reader has lots of downgrade_to_v1() calls, we will deal with these later all at once. Another use is the empty mutation source, this is trivially converted to use the v2 variant.	2022-04-28 14:12:24 +03:00
Botond Dénes	5e97fb9fc4	row_cache: update reader implementations to v2 cache_flat_mutation_reader gets a native v2 implementation. The underlying mutation representation is not changed: range deletions are still stored as v1 range_tombstones in mutation_partition. These are converted to range tombstone changes during reading. This allows for separating the change of a native v2 reader implementation and a native v2 in-memory storage format, enabling the two to be done at separate times and incrementally.	2022-04-21 14:57:04 +03:00
Botond Dénes	7626beb729	readers/nonforwardable: convert to v2 It has a single user, the row cache, which for now has to upgrade/downgrade around the nonforwardable reader, but this will go away in the next patches when the row cache readers are converted to v2 proper.	2022-04-21 14:34:00 +03:00
Botond Dénes	2a0d7e8a1d	row_cache: cache_entry::read(): return v2 reader Push the conversion down one level. Soon we will make cache flat mutation reader a v2 reader, this keeps the related noise separate.	2022-04-20 10:59:09 +03:00
Botond Dénes	0b035c9099	row_cache: return v2 readers from make_reader*() And adjust callers. The factory functions just sprinkle upgrade_to_v2() on returned readers for now. One test in row_cache_test.cc had to be disabled, because the upgrade to v2 wrapper we now have over cache readers doesn't allow it to directly control the reader's buffer size and so the test fails. There is a FIXME left in the test code and the test will be re-enabled once a native v2 reader implementation allows us to get rid of the upgrade wrapper.	2022-04-20 10:59:09 +03:00
Avi Kivity	e7fb71020b	Merge 'replica: Optimize empty_flat_reader out of the hot path' from Michał Chojnowski When row_cache::make_reader() and memtable::make_flat_reader() see that the query result is empty, they return empty_flat_reader, which is a trivial implementation of flat_mutation_reader. Even though empty_flat_reader doesn't do anything meaningful, it still needs to be created, handled in merging_reader and destroyed. Turns out this is costly. This patch series replaces hot path uses of empty_flat_reader with an empty optional. Performance effects: `perf_simple_query --smp 1` TPS: 138k -> 168k allocs/op: 80.2 -> 71.1 insns/op: 49.9k -> 45.1k `perf_simple_query --smp 1 --enable-cache=1 --flush` TPS: 125k -> 150k allocs/op: 79.2 -> 71.1 insns/op: 51.7k -> 47.2k For a cassandra-stress benchmark (localhost, 100% cache reads) this translates to a TPS increase from ~42k to ~48k per hyperthread. Note that this optimization is effective for single-partition reads where the queried partition is only in cache/sstables or only in memtables. Other queries (e.g. where the partition is in both cache in memtables and needs to be merged) are unaffected. Closes #10204 * github.com:scylladb/scylla: replica: Prefer row_cache::make_reader_opt() to row_cache::make_reader() row_cache: Add row_cache::make_reader_opt() replica: Prefer memtable::make_flat_reader_opt() to memtable::make_flat_reader() memtable: Add memtable::make_flat_reader_opt() [avi: adjust #include for readers/ split]	2022-03-14 14:07:00 +02:00
Mikołaj Sielużycki	1d84a254c0	flat_mutation_reader: Split readers by file and remove unnecessary includes. The flat_mutation_reader files were conflated and contained multiple readers, which were not strictly necessary. Splitting optimizes both iterative compilation times, as touching rarely used readers doesn't recompile large chunks of codebase. Total compilation times are also improved, as the size of flat_mutation_reader.hh and flat_mutation_reader_v2.hh have been reduced and those files are included by many file in the codebase. With changes real 29m14.051s user 168m39.071s sys 5m13.443s Without changes real 30m36.203s user 175m43.354s sys 5m26.376s Closes #10194	2022-03-14 13:20:25 +02:00
Michał Chojnowski	6c6519a909	row_cache: Add row_cache::make_reader_opt()	2022-03-14 12:02:49 +01:00
Avi Kivity	cbba80914d	memtable: move to replica module and namespace Memtables are a replica-side entity, and so are moved to the replica module and namespace. Memtables are also used outside the replica, in two places: - in some virtual tables; this is also in some way inside the replica, (virtual readers are installed at the replica level, not the cooordinator), so I don't consider it a layering violation - in many sstable unit tests, as a convenient way to create sstables with known input. This is a layering violation. We could make memtables their own module, but I think this is wrong. Memtables are deeply tied into replica memory management, and trying to make them a low-level primitive (at a lower level than sstables) will be difficult. Not least because memtables use sstables. Instead, we should have a memtable-like thing that doesn't support merging and doesn't have all other funky memtable stuff, and instead replace the uses of memtables in sstable tests with some kind of make_flat_mutation_reader_from_unsorted_mutations() that does the sorting that is the reason for the use of memtables in tests (and live with the layering violation meanwhile). Test: unit (dev) Closes #10120	2022-02-23 09:05:16 +02:00
Avi Kivity	fcb8d040e8	treewide: use Software Package Data Exchange (SPDX) license identifiers Instead of lengthy blurbs, switch to single-line, machine-readable standardized (https://spdx.dev) license identifiers. The Linux kernel switched long ago, so there is strong precedent. Three cases are handled: AGPL-only, Apache-only, and dual licensed. For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0), reasoning that our changes are extensive enough to apply our license. The changes we applied mechanically with a script, except to licenses/README.md. Closes #9937	2022-01-18 12:15:18 +01:00
Tomasz Grabiec	63351483f0	row_cache: Support reverse reads natively Some implementation notes below. When iterating in reverse, _last_row is after the current entry (_next_row) in table schema order, not before like in the forward mode. Since there is no dummy row before all entries, reverse iteration must be now prepared for the fact that advancing _next_row may land not pointing at any row. The partition_snapshot_row_cursor maintains continuity() correctly in this case, and positions the cursor before all rows, so most of the code works unchanged. The only excpetion is in move_to_next_entry(), which now cannot assume that failure to advance to an entry means it can end a read. maybe_drop_last_entry() is not implemented in reverse mode, which may expose reverse-only workload to the problem of accumulating dummy entries. ensure_population_lower_bound() was not updating _last_row after inserting the entry in latets version. This was not a problem for forward reads because they do not modify the row in the partition snapshot represented by _last_row. They only need the row to be there in the latest version after the call. It's different for reveresed reads, which change the continuity of the entry represented by _last_row, hence _last_row needs to have the iterator updated to point to the entry from the latest version, otherwise we'd set the continuity of the previous version entry which would corrupt the continuity.	2021-12-19 22:41:35 +01:00
Botond Dénes	41facb3270	treewide: move reversing to the mutation sources Push down reversing to the mutation-sources proper, instead of doing it on the querier level. This will allow us to test reverse reads on the mutation source level. The `max_size` parameter of `consume_page()` is now unused but is not removed in this patch, it will be removed in a follow-up to reduce churn.	2021-09-29 12:15:45 +03:00
Benny Halevy	4476800493	flat_mutation_reader: get rid of timeout parameter Now that the timeout is taken from the reader_permit. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-08-24 16:30:51 +03:00
Benny Halevy	e9aff2426e	everywhere: make deferred actions noexcept Prepare for updating seastar submodule to a change that requires deferred actions to be noexcept (and return void). Test: unit(dev, debug) Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-08-22 21:11:52 +03:00
Michael Livshin	f364666d4a	row_cache: count read row tombstones Refs #7749. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2021-08-01 19:41:11 +03:00
Pavel Emelyanov	6ef27c9fa1	btree: Make iterators not modify the tree itself The const_iterator cannot modify anything, but the plain iterator has public methods to remove the key from the tree. To control how the tree is modified this method must be marked private and modification by iterator should come from somewhere else. This somewhere else is the existing key_grabber that's already used to move keys between trees. Generalize this ability to move a key out of a tree (i.e. -- erase). Once done -- mark the iterator::erase_and_dispose private. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-27 20:06:53 +03:00
Piotr Sarna	e9d26dd7ed	utils/coroutine: wrap a helper in utils namespace The class name `coroutine` became problematic since seastar introduced it as a namespace for coroutine helpers. To avoid a clash, the class from scylla is wrapped in a separate namespace. Without this patch, Seastar submodule update fails to compile. Message-Id: <6cb91455a7ac3793bc78d161e2cb4174cf6a1606.1626949573.git.sarna@scylladb.com>	2021-07-22 13:28:43 +03:00
Tomasz Grabiec	e947fac74c	database: Fix cache metrics not being registered Introduced in `6a6403d`. The default constructor with dummy_app_stats is also used by production code. Fixes #9012 Message-Id: <20210712221447.71902-1-tgrabiec@scylladb.com>	2021-07-13 07:50:44 +03:00
Tomasz Grabiec	6a6403d19d	row_cache: cache_tracker: Do not register metrics when constructed for tests Some tests will create two cache_tracker instances because of one being embedded in the sstable test env. This would lead to double registration of metrics, which raises run time error. Avoid by not registering metrics in prometheus in tests at all.	2021-07-02 19:02:14 +02:00
Tomasz Grabiec	7fa4e10aa0	row_cache: Use generic LRU for eviction In preparation for tracking different kinds of objects, not just rows_entry, in the LRU, switch to the LRU implementation form utils/lru.hh which can hold arbitrary element type.	2021-07-02 10:25:58 +02:00
Michael Livshin	9ef2317248	row_cache: count range tombstones processed during read Refs #7749. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com> Message-Id: <20210602152210.17948-1-michael.livshin@scylladb.com>	2021-06-14 14:29:05 +02:00
Avi Kivity	a55b434a2b	treewide: extent copyright statements to present day	2021-06-06 19:18:49 +03:00
Benny Halevy	b4cbd46adb	row_cache: create_underlying_reader: call read_context on_underlying_created only on success ctx.on_underlying_created() mustn't be called if src.make_reader failed and a reader isn't created. Test: unit(dev) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210511054525.35090-1-bhalevy@scylladb.com>	2021-05-12 01:34:48 +02:00
Benny Halevy	0a2670c9ec	row_cache: hold read_context as unique_ptr Such that the holder, that is responsible for closing the read_context before destroying it, holds it uniquely. cache_flat_mutation_reader may be constructed either with a read_context&, where it knows that the read_context is owned externally, by the caller, or it could be constructed with a std::unique_ptr<read_context> in which case it assumes ownership of the read_context and it is now responsible for closing it. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	8531eaaacf	row_cache: make_reader: make read_context only when needed So we can have better control on who's responsible to close it. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	9944586480	row_cache: make_reader: use range directly Not via ctx, so we can delay the making of the read_context, as needed. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	4c969756ac	row_cache: scanning_and_populating_reader: make sure to close underlying readers Note that scanning_and_populating_reader::read_next_partition now closes the current reader unconditionally and before assigning a new reader. This should be an improvement since we want to release resources the reader resources as early as possible, certainly before allocating new resources. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	e34ed3d3e4	row_cache: range_populating_reader: add close method To close the undelying _reader. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	c707ff27a4	row_cache: single_partition_populating_reader: add close method To close the optional underlying _reader and _read_context. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Piotr Jastrzebski	cb3dbb1a4b	row_cache: remove redundant check in make_reader This check is always true because a dummy entry is added at the end of each cache entry. If that wasn't true, the check in else-if would be an UB. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-04-12 21:12:33 +02:00
Piotr Jastrzebski	b3b68dc662	read_context: remove skip_first_fragment arg from create_underlying All callers pass false for its value so no need to keep it around. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-04-12 19:51:06 +02:00
Tomasz Grabiec	cb0b8d1903	row_cache: Zap dummy entries when populating or reading a range This will prevent accumulation of unnecessary dummy entries. A single-partition populating scan with clustering key restrictions will insert dummy entries positioned at the boundaries of the clustering query range to mark the newly populated range as continuous. Those dummy entries may accumulate with time, increasing the cost of the scan, which needs to walk over them. In some workloads we could prevent this. If a populating query overlaps with dummy entries, we could erase the old dummy entry since it will not be needed, it will fall inside a broader continuous range. This will be the case for time series worklodas which scan with a decreasing (newest) lower bound. Refs #8153. _last_row is now updated atomically with _next_row. Before, _last_row was moved first. If exception was thrown and the section was retried, this could cause the wrong entry to be removed (new next instead of old last) by the new algorithm. I don't think this was causing problems before this patch. The problem is not solved for all the cases. After this patch, we remove dummies only when there is a single MVCC version. We could patch apply_monotonically() to also do it, so that dummies which are inside continuous ranges are eventually removed, but this is left for later. perf_row_cache_reads output after that patch shows that the second scan touches no dummies: $ build/release/test/perf/perf_row_cache_reads_g -c1 -m200M Rows in cache: 0 Populating with dummy rows Rows in cache: 265320 Scanning read: 142.621613 [ms], preemption: {count: 639, 99%: 0.545791 [ms], max: 0.526929 [ms]}, cache: 0/0 [MB] read: 0.023197 [ms], preemption: {count: 1, 99%: 0.035425 [ms], max: 0.032736 [ms]}, cache: 0/0 [MB] Message-Id: <20210226172801.800264-1-tgrabiec@scylladb.com>	2021-03-01 20:34:35 +02:00
Avi Kivity	d980f550d1	Merge 'row_cache: Make fill_buffer() preemptable when cursor leads with dummy rows' from Tomasz Grabiec fill_buffer() will keep scanning until _lower_bound_changed is true, even if preemption is signaled, so that the reader makes forward progress. Before the patch, we did not update _lower_bound on touching a dummy entry. The read will not respect preemption until we hit a non-dummy row. If there is a lot of dummy rows, that can cause reactor stalls. Fix that by updating _lower_bound on dummy entries as well. Refs #8153. Tested with perf_row_cache_reads: ``` $ build/release/test/perf/perf_row_cache_reads -c1 -m200M Rows in cache: 0 Populating with dummy rows Rows in cache: 373929 Scanning read: 183.658966 [ms], preemption: {count: 848, 99%: 0.545791 [ms], max: 0.519343 [ms]}, cache: 99/100 [MB] read: 120.951515 [ms], preemption: {count: 257, 99%: 0.545791 [ms], max: 0.518795 [ms]}, cache: 99/100 [MB] ``` Notice that max preemption latency is low in the second "read:" line. Closes #8167 * github.com:scylladb/scylla: row_cache: Make fill_buffer() preemptable when cursor leads with dummy rows tests: perf: Introduce perf_row_cache_reads row_cache: Add metric for dummy row hits	2021-02-28 21:00:20 +02:00
Tomasz Grabiec	f0a3272a5f	row_cache: Add metric for dummy row hits This will help to diagnose performance problems related to the read having to walk through a lot of dummy rows to fill the buffer. Refs #8153	2021-02-25 18:26:01 +01:00
Benny Halevy	4b46793c19	row_cache: scanning_and_populating_reader: add _read_next_partition flag Instead of resetting _reader in scanning_and_populating_reader::fill_buffer in the `reader_finished` case, use a gentler, _read_next_partition flag on which `read_next_partition` will be called in the next iteration. Then, read_next_partition can close _reader only before overwriting it with a new reader. Otherwise, if _reader is always closed in the ``reader_finished` case, we end up hitting premature end_of_stream. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210215101254.480228-30-bhalevy@scylladb.com>	2021-02-17 19:06:21 +02:00

1 2 3 4 5 ...

353 Commits