scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-03 21:47:10 +00:00

Author	SHA1	Message	Date
Tomasz Grabiec	bbfa52822e	row_cache: Switch readers to use per-entry snapshots Currently readers are always using the latest snapshot. This is fine for respecting write atomicity if partitions are fully continuous in cache (now), but will break write atomicity once partial population is allowed. Consider the following case: flush write(ck=1), write(ck=2) -> snapshot_1 cache reader 1 reads and inserts ck=1 @snapshot_1 flush write(ck=1), write(ck=2) -> snapshot_2 cache reader 2 reads and inserts ck=2 @snapshot_2 Because cache update is not atomic, it can happen that reader 2 will complete while the partition hasn't been updated yet for snapshot_2. In such case, after read 2 the partition would contain ck=1 from snapshot_1 and ck=2 from snapshot_2. It will match neither of the snapshots, and this could violate write atomicity. To solve this problem we conceptually assign each partition key in the ring to its current snapshot which it reflects. The update process gradually converts entries in ring order to the new snapshot. Reads will not be using the latest snapshot, but rather the current snapshot for the position in the ring they are at. There is a race between the update process and populating reads. Since after the update all entries must reflect the new snapshot, reads using the old snapshot cannot be allowed to insert data which can no longer be reached by the update process. Before this patch this race was prevented by the use of a phased_barrier, where readers would keep phased_barrier::operation alive between starting a read of a partition and inserting it into cache. Cache update was waiting for all prior operations before starting the update. Any later read which was not waited for would use the latest snapshot for reads, so the update process didn't have to fix anything up for such reads. After this change, later reads cannot always use the latest snapshot, they have to use the snapshot corresponding to given entry. So it's not enough for update() to wait for prior reads in order to prevent stale populations. The (simple) solution implemented in this patch is to detect the conflict and abandon population of given sub-range. In general, reads are allowed to populate given range only if it belongs to a single snapshot. Note that the range here is not the whole query range. For population of continuity, it is the range starting after the previous key and ending after the key being inserted. When populating a partition entry, the range is a singular range containing only the partition key. Readers switch to new snapshots automatically as they move across the ring. It's possible that the insertion of the partition doesn't conflict, but continuity does. In such case the entry will be inserted but continuity will not be set.	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	81e7b561da	dht: Add ring_position min()/max()	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	8ba6366610	row_cache: Switch to using snapshot_source Currently every time cache needs to create reader for missing data it obtains a reader which is most up to date. That reader includes writes from later populate phases, for which update() was not yet called. This will be problematic once we allow partitions to be partially populated, because different parts of the partition could be partially populated using readers using different sets of writes, and break write atomicity. The solution will be to always populate given partition using the same set of writes, using reader created from the current snapshot. The snapshot changes only on update(), with update() gradually converting each partition to the new snapshot.	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	446bcdb00d	database: Add missing cache invalidation after attaching sstables This violation of the contract is currently benign, because there are no reads from those tables before they are populated. If there were, the cache would mark the whole (empty) range as continuous and the table would appear empty. It will cause similar problem once cache starts using snapshots of the underlying mutation source. Then this lack of invalidate() will also result in cache thinking that the table is still empty.	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	e23c7e2f34	row_cache: Rework invalidate() implementation 1) Reduce duplication by delegating to more general overloads 2) Improve documentation to not mention effects in terms of population (detail) but rather write visibiliy 3) Rename clear() to invalidate() and merge with the range variant, it has the same semantics	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	c82c6ec6ed	database: Allow obtaining snapshot_source for sstables	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	bd023b6161	tests: Introduce memtable_snapshot_source Snapshottable in-memory mutation source for use in row_cache tests.	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	ddfcf64966	mutation_source: Make copying cheaper Cache readers will need to take snapshots by copying the mutation_source. That's going to happen quite often, so make copying cheaper.	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	58d5e1393b	mutation_reader: Introduce make_combined_mutation_source()	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	1e2463a382	mutation_reader: Introduce make_empty_*_source()	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	289d01c2cc	mutation_reader: Introduce concept of snapshot_source	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	2d73c193e7	row_cache: Introduce read_context This object stores all read relevant context required all over the place. This leads to a cleaner code. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> [tgrabiec: - made read_context shareable to allow storing shared mutable state later - added range and cache getters ]	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	a3ff8db323	row_cache: Introduce autoupdating_underlying_reader This is an abstraction that represents a reader to the underlying source and auto updates itself to make sure the reader reflects the latest state of the underlying source. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> [tgrabiec: Add range getter to avoid friendships]	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	b6d349728f	range_tombstone_list: Introduce slice() working with position range	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	6ce08f2f9a	range_tombstone: Introduce trim_front()	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	271dfc2eac	position_in_partition: Introduce for_range_start()/for_range_end()	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	3b52afa4a3	position_in_partition: Introduce no_clustering_row_between()	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	9c2b3e1167	position_in_partition: Introduce as_start_bound_view()	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	b47c8f1df7	partition_snapshot: Add const-qualified overload of version() [tgrabiec: Extracted from a different patch]	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	dd9d35c166	partition_snapshot: Add getter for range tombstones	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	60c3c0a471	partition_entry: Add squashed() overload with a single schema	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	98f7671553	partition_snapshot: Introduce squashed()	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	87b0f11be3	partition_snapshot: Add getters for static row and partition tombstone [tgrabiec: - Extracted from a different patch - Renamed concept names to more familiar Map and Reduce - Renamed aggregate() to squashed() to match the existing nomenclature - Uncommented the concepts ]	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	ea59b9475e	partition_version: Add const-quialified variant of operator-> [tgrabiec: Extracted from a different patch]	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	f6fe0acea4	partition_version: Make operator bool() const-qualified [tgrabiec: Extracted from a different patch]	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	efc75b0bc3	mutation_partition: Add rows_entry constructor which accepts full contents [tgrabiec: Extracted from different patch]	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	7f8620d4a7	tests: mutation_source: Relax expectations about range tombstones In preparation for having partial cache which trims range tombstones to the lower bound of the query.	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	3a9212e0f2	tests: mutation_assertions: Add ability to limit verification to given clustering_row_ranges Currently mutation sources are free to return range tombstones covering range which is larger than the query range. The cache mutation source will soon become more eager about trimming such tombstones. To cover up for such differences, allow telling the restrictions to only care about differences relevant for given clustering ranges.	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	f925b26241	tests: mutation_reader_assertions: Simplify	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	1d5d5e26a2	mutation: Introduce sliced()	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	92d6456070	range_tombstone_list: Introduce equal()	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	1594ace4d3	range_tombstone_stream: Make printable	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	19edb0b535	range_tombstone_list: Make printable	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	2e75595ecf	range_tombstone_list: Introduce trim()	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	5a29c70f3e	mutation_fragment: make mutation_fragment copyable This will be needed by implementation of cache_streamed_mutation Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	2fdabcaa9b	Track population phase in partition_snapshot This will be used by partial cache in later patches. [tgrabiec: - changed title, - documented meaning of the variable, - renamed the variable, - introduced open_version(), - fixed continuity of the static row not being preserved in case a new version is created] Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	a841c77c54	Introduce maybe_merge_versions This will be used in the following patches by partial cache. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	9642f806ab	partition_version: Introduce version() getter	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	9380dd1ee3	mutation_source: make sure we never ignore fast forwarding mutation source sometimes ignore fast forwarding parameter so this change adds assertion to check that this parameter can be safely ignored. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	ab72241e22	mutation_reader: Accept forwarding flag in make_reader_returning() By default make_reader_returning creates a reader that does not support fast forwarding but the second parameter can be used to make it support fast forwarding. [tgrabiec: Improve title] Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	ac03331490	row_cache_test: improve test_sliced_read_row_presence Remove unused parameter and add checks to make sure all expected rows have been received. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	db053ef902	tests: Add test for continuity merging rules	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	2edf08d36a	tests: random_mutation_generator: Generate random continuity	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	8873a443db	tests: mutation: Generate mutations with continuity	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	dce293e11c	tests: row_cache: Apply only fully continuous mutations to underlying mutation source Cache currently assumes that mutations coming from outside are fully continuous.	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	e86f74edd8	tests: row_cache: Add missing apply() to test_mvcc test case [tgrabiec: Extracted from "row_cache: Introduce cache_streamed_mutation"]	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	95dcfa859b	tests: row_cache: Improve test_mvcc() assert_that().is_equal_to() gives better error message. Also, there is code which can be replaces with assert_that_stream().has_monotonic_positions()	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	05b56fcfb0	mutation_partition: Add support for specifying continuity This will allow expressing lack of information about certain ranges of rows (including the static row), which will be used in cache to determine if information in cache is complete or not. Continuity is represented internally using flags on row entries. The key range between two consecutive entries is continuous iff rows_entry::continuous() is true for the later entry. The range starting after the last entry is assumed to be continuous. The range corresponding to the key of the entry is continuous iff rows_entry::dummy() is false. [tgrabiec: - based on the following commits: 4a5bf75 - Piotr Jastrzebski : mutation_partition: introduce dummy rows_entry 773070e - Piotr Jastrzebski : mutation_partition: add continuity flag to rows_entry - documented that partition tombstone is always complete - require specifying the partition tombstone when creating an incomplete entry - replaced rows_entry(dummy_tag, ...) constructor with more general rows_entry(position_in_partition, ...) - documented continuity semantics on mutation_partition - fixed _static_row_cached being lost by mutation_partition copy constructors - fixed conversion to streamed_mutation to ignore dummy entries - fixed mutation_partition serializer to drop dummy entries - documented semantics of continuity on mutation_partition level - dropped assumptions that dummy entries can be only at the last position - changed equality to ignore continuity completely, rather than partially (it was not ignoring dummy entries, but ignoring continuity flag) - added printout of continuity information in mutation_partition - fixed handling of empty entries in apply_reversibly() with regards to continuity; we no longer can remove empty entries before merging, since that may affect continuity of the right-hand mutation. Added _erased flag. - fixed mutation_partition::clustered_row() with dummy==true to not ignore the key - fixed partition_builder to not ignore continuity - renamed dummy_tag_t to dummy_tag. _t suffix is reserved. - standardized all APIs on is_dummy and is_continuous bool_class:es - replaced add_dummy_entry() with ensure_last_dummy() with safer semantics - dropped unused remove_dummy_entry() - simplified and inlined cache_entry::add_dummy_entry() - fixed mutation_partition(incomplete_tag) constructor to mark all row ranges as discontinuous ]	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	063b37f352	partition_snapshot_reader: Be prepared for skipping some row entries If some row entries may have to be skipped by the reader then it could be that _clustering_rows is not empty, but read_next() will return a disengaged optional because there are no more rows in the current range. The code assumed that it's never the case, and if read_next() returns a disengaged optional then we exhousted all ranges. Before introducing dummy entries this needs to be refactored.	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	2cfe23a35e	partition_snapshot_reader: Use rows_entry::position() for comparing rows key() will not be valid for dummy entries.	2017-06-24 18:06:11 +02:00

1 2 3 4 5 ...

12313 Commits