scylladb

Author	SHA1	Message	Date
Paweł Dziepak	637b9a7b3b	atomic_cell_or_collection: make operator<< show cell content After the new in-memory representation of cells was introduced there was a regression in atomic_cell_or_collection::operator<< which stopped printing the content of the cell. This makes debugging more incovenient are time-consuming. This patch fixes the problem. Schema is propagated to the atomic_cell_or_collection printer and the full content of the cell is printed. Fixes #3571. Message-Id: <20181024095413.10736-1-pdziepak@scylladb.com>	2018-10-24 13:29:51 +03:00
Botond Dénes	eb357a385d	flat_mutation_reader: make timeout opt-out rather than opt-in Currently timeout is opt-in, that is, all methods that even have it default it to `db::no_timeout`. This means that ensuring timeout is used where it should be is completely up to the author and the reviewrs of the code. As humans are notoriously prone to mistakes this has resulted in a very inconsistent usage of timeout, many clients of `flat_mutation_reader` passing the timeout only to some members and only on certain call sites. This is small wonder considering that some core operations like `operator()()` only recently received a timeout parameter and others like `peek()` didn't even have one until this patch. Both of these methods call `fill_buffer()` which potentially talks to the lower layers and is supposed to propagate the timeout. All this makes the `flat_mutation_reader`'s timeout effectively useless. To make order in this chaos make the timeout parameter a mandatory one on all `flat_mutation_reader` methods that need it. This ensures that humans now get a reminder from the compiler when they forget to pass the timeout. Clients can still opt-out from passing a timeout by passing `db::no_timeout` (the previous default value) but this will be now explicit and developers should think before typing it. There were suprisingly few core call sites to fix up. Where a timeout was available nearby I propagated it to be able to pass it to the reader, where I couldn't I passed `db::no_timeout`. Authors of the latter kind of code (view, streaming and repair are some of the notable examples) should maybe consider propagating down a timeout if needed. In the test code (the wast majority of the changes) I just used `db::no_timeout` everywhere. Tests: unit(release, debug) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <1edc10802d5eb23de8af28c9f48b8d3be0f1a468.1536744563.git.bdenes@scylladb.com>	2018-09-20 11:31:24 +02:00
Tomasz Grabiec	477d7b439b	row_cache: Fix violation of continuity on concurrent eviction and population ensure_population_lower_bound() returned true if current clustering range covers all rows, which means that the populator has a right to set continuity flag to true on the row it inserts. This is correct only if the current population range actually starts since before all clustering rows. Otherwise we're populating since _last_row, and should consult it. The fix introduces a new flag, set when starting to populte, which indicates if we're populating from the beginning of the range or not. We cannot simply check if _last_row is set in ensure_population_lower_bound() because _last_row can be set and then become empty again. Fixes #3608	2018-07-17 16:43:21 +02:00
Tomasz Grabiec	450985dfee	mvcc: Use RAII to ensure that partition versions are merged Before this patch, maybe_merge_versions() had to be manually called before partition snapshot goes away. That is error prone and makes client code more complicated. Delegate that task to a new partition_snapshot_ptr object, through which all snapshots are published now.	2018-06-27 21:51:04 +02:00
Tomasz Grabiec	c26a304fbb	mvcc: Merge partition version versions gradually in the background When snapshots go away, typically when the last reader is destroyed, we used to merge adjacent versions atomically. This could induce reactor stalls if partitions were large. This is especially true for versions created on cache update from memtables. The solution is to allow this process to be preempted and move to the background. mutation_cleaner keeps a linked list of such unmerged snapshots and has a worker fiber which merges them incrementally and asynchronously with regards to reads. This reduces scheduling latency spikes in tests/perf_row_cache_update for the case of large partition with many rows. For -c1 -m1G I saw them dropping from 23ms to 2ms.	2018-06-27 12:48:30 +02:00
Tomasz Grabiec	9975135110	row_cache: Make sure reader makes forward progress after each fill_buffer() If reader's buffer is small enough, or preemption happens often enough, fill_buffer() may not make enough progress to advance _lower_bound. If also iteartors are constantly invalidated across fill_buffer() calls, the reader will not be able to make progress. See row_cache_test.cc::test_reading_progress_with_small_buffer_and_invalidation() for an examplary scenario. Also reproduced in debug-mode row_cache_test.cc::test_concurrent_reads_and_eviction Message-Id: <1528283957-16696-1-git-send-email-tgrabiec@scylladb.com>	2018-06-06 16:01:52 +03:00
Paweł Dziepak	27014a23d7	treewide: require type info for copying atomic_cell_or_collection	2018-05-31 15:51:11 +01:00
Tomasz Grabiec	7c34cd04e2	mvcc: Propagate information if insertion happened from ensure_entry_if_complete() It's needed by users to update statistics, different ones depending on if the row already existed or not.	2018-03-07 16:50:55 +01:00
Tomasz Grabiec	381bf02f55	cache: Evict with row granularity Instead of evicting whole partitions, evicts whole rows. As part of this, invalidation of partition entries was changed to not evict from snapshots right away, but unlink them and let them be evicted by the reclaimer.	2018-03-06 11:50:29 +01:00
Tomasz Grabiec	dce9185fc9	cache: Track static row insertions separately from regular rows So that row eviction counter, which doesn't look at the static row, can be in sync with row insertion counter.	2018-03-06 11:50:28 +01:00
Tomasz Grabiec	ab407d99cc	mvcc: Store complete rows in each version in evictable entries For row-level eviction we need to ensure that each version has complete rows so that eviction from older versions doesn't affect the value of the row in newer snapshots. This is achieved by copying the row from an older version before applying the increment in the new version. Only affects evictable entries, memtables are not affected.	2018-03-06 11:50:28 +01:00
Tomasz Grabiec	29d167bf01	mvcc: Introduce partition_snapshot_row_cursor::ensure_entry_in_latest() To avoid duplication of logic between cache reader and ensure_entry_if_complete().	2018-03-06 11:50:28 +01:00
Tomasz Grabiec	9893e8e5f7	mvcc: Make each version have independent continuity This change is a preparation for introducing row-level eviction, such that entries can be evicted from older versions without having to touch other versions. Currently continuity flags on entries are interpreted relative to the combined view merged from all entries. For example: v2: <key=2, cont=1> v1: <key=1, cont=1> In v2, the flag on entry key=2 marks the range (1, 2) as continuous. This is problematic because if the old version is evicted, continuity will change in an incorrect way: v2: <key=2, cont=1> Here, the range (-inf, 1) would be marked as continuous, which is not true. To solve this problem, we change the rules for continuity interpretation in MVCC. Each version will have its own continuity, fully specified in that version, independent of continuity of other versions. Continuity of the snapshot will be a union of continuous ranges in each version. It is assumed that continuous intervals in different versions are non- overlapping, except for points corresponding to complete rows, in which case a later version may overlap with an older version (overwrite). We make use of this assumption to make calculation of the union of intervals on merging easier. I make use of the above assumption in mutation_partition::apply_monotonically(). MVCC population of incomplete entries already almost maintains the non-overlapping invariant, because population intervals correspond to intervals which are incomplete in the old snapshot. The only change needed is to ensure that both population bounds will have entries in the latest version. Population from memtables doesn't mark any intervals as continuous, so also conforms. The only change needed there is to not inherit continuity flags from the old snapshot, effectively making the new version internally discontinuous except for row points. The example from the beginning will become: v2: <key=1, cont=0> <key=2, cont=1> v1: <key=1, cont=1> When marking a range as continuous with some rows present only in older versions, we need to insert entries in the latest version, so that we can mark the range as continuous. The easiest solution is to copy the entry from the old version. Another option would be to add support for incomplete rows and insert such instead. This way we would avoid duplicating row contents. This optimization is deferred.	2018-03-06 11:50:25 +01:00
Tomasz Grabiec	d0e1a3c63e	mvcc: partition_snapshot_row_weakref: Introduce is_in_latest_version()	2018-03-06 11:32:09 +01:00
Tomasz Grabiec	313f2c2bb0	cache: Document intent of maybe_update_continuity()	2018-03-06 11:32:09 +01:00
Tomasz Grabiec	3214883a25	cache: Extract cache_streamed_mutation::ensure_population_lower_bound()	2018-03-06 11:32:09 +01:00
Duarte Nunes	712c051de6	cache_flat_mutation_reader: Pre-calculate cell hash When digest is requested, pre-calculate the cell's hash. We consider the case when the cell is already in the cache, and the case when it added by the underlying reader. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-02-01 01:02:50 +00:00
Piotr Jastrzebski	96c97ad1db	Rename streamed_mutation* files to mutation_fragment* Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2018-01-24 20:56:49 +01:00
Piotr Jastrzebski	19e1f7c285	cache_flat_mutation_reader: fix tombstones handling with small buffer Before when the buffer was so small that it could fit only a single range_tombstone, cache_flat_mutation_reader would keep returning the same tombstone over and over again. The fix is to set _lower_bound to the next fragment we want to return. Fixes #3139 Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2018-01-24 20:09:11 +01:00
Tomasz Grabiec	6654fa6df7	row_cache: Drop unnecessary assignment to _lower_bound on exception We no longer drain cached tombstones since commit `41ede08a1d`, so this adjustment of lower_bound is not needed. Message-Id: <1516796248-11290-1-git-send-email-tgrabiec@scylladb.com>	2018-01-24 16:39:34 +02:00
Glauber Costa	5140aaea00	add a timeout to fast forward to In the last patch, we enabled per-request timeouts, we enable timeouts in fill_buffer. There are many places, though, in which we fast_forward_to before we fill_buffer, so in order to make that effective we need to propagate the timeouts to fast_forward_to as well. In the same way as fill_buffer, we make the argument optional wherever possible in the high level callers, making them mandatory in the implementations. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-01-12 07:43:19 -05:00
Glauber Costa	d965af42b0	add a timeout to fill_buffer As part of the work to enable per-request timeouts, we enable timeouts in fill_buffer. The argument is made optional at the main classes, but mandatory in all the ::impl versions. This way we'll make sure we didn't forget anything. At this point we're still mostly passing that information around and don't have any entity that will act on those timeouts. In the next patch we will wire that up. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-01-11 12:07:41 -05:00
Duarte Nunes	16c975edcc	partition_version: Return static_row fragment from static_row() Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20180109162815.5811-1-duarte@scylladb.com>	2018-01-09 19:17:02 +01:00
Tomasz Grabiec	41ede08a1d	mutation_reader: Allow range tombstones with same position in the fragment stream When we get two range tombstones with the same lower bound from different data sources (e.g. two sstable), which need to be combined into a single stream, they need to be de-overlapped, because each mutation fragment in the stream must have a different position. If we have range tombstones [1, 10) and [1, 20), the result of that de-overlapping will be [1, 10) and [10, 20]. The problem is that if the stream corresponds to a clustering slice with upper bound greater than 1, but lower than 10, the second range tombstone would appear as being out of the query range. This is currently violating assumptions made by some consumers, like cache populator. One effect of this may be that a reader will miss rows which are in the range (1, 10) (after the start of the first range tombstone, and before the start of the second range tombstone), if the second range tombstone happens to be the last fragment which was read for a discontinuous range in cache and we stopped reading at that point because of a full buffer and cache was evicted before we resumed reading, so we went to reading from the sstable reader again. There could be more cases in which this violation may resurface. There is also a related bug in mutation_fragment_merger. If the reader is in forwarding mode, and the current range is [1, 5], the reader would still emit range_tombstone([10, 20]). If that reader is later fast forwarded to another range, say [6, 8], it may produce fragments with smaller positions which were emitted before, violating monotonicity of fragment positions in the stream. A similar bug was also present in partition_snapshot_flat_reader. Possible solutions: 1) relax the assumption (in cache) that streams contain only relevant range tombstones, and only require that they contain at least all relevant tombstones 2) allow subsequent range tombstones in a stream to share the same starting position (position is weakly monotonic), then we don't need to de-overlap the tombstones in readers. 3) teach combining readers about query restrictions so that they can drop fragments which fall outside the range 4) force leaf readers to trim all range tombstones to query restrictions This patch implements solution no 2. It simplifies combining readers, which don't need to accumulate and trim range tombstones. I don't like solution 3, because it makes combining readers more complicated, slower, and harder to properly construct (currently combining readers don't need to know restrictions of the leaf streams). Solution 4 is confined to implementations of leaf readers, but also has disadvantage of making those more complicated and slower. Fixes #3093.	2017-12-22 11:06:20 +01:00
Piotr Jastrzebski	b976872c1a	Rename all _underlying_flat methods in read_context to _underlying. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-12-18 16:37:57 +01:00
Piotr Jastrzebski	a322268416	Turn cache_flat_mutation_reader into a flat reader. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-12-18 13:28:33 +01:00
Piotr Jastrzebski	f467e84424	Rename cache_streamed_mutation to cache_flat_mutation_reader in cache_flat_mutation_reader.hh Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-12-18 13:28:33 +01:00
Piotr Jastrzebski	3075780097	Make copy of cache_streamed_mutation.hh and call it cache_flat_mutation_reader.hh. It will be turned into a flat mutation reader. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-12-18 13:28:33 +01:00

28 Commits