scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-23 18:10:39 +00:00

Author	SHA1	Message	Date
Tomasz Grabiec	78274276f5	row_cache: Use the memtable cleaner to create memtable snapshot during update Memtable entries should be cleaned using memtable cleaner, which unlike the cache' cleaner is not associated with the cache tracker. It's an error to clean a snapshot using tracker which doesn't own the entries. This will corrupt cache tracker's row counter. Fixes failure of test_exception_safety_of_update_from_memtable from row_cache.cc in debug mode and with allocation failure injection enabled. Introduce in "cache: Defer during partition merging" (`70c72773be`). Message-Id: <1528988256-20578-1-git-send-email-tgrabiec@scylladb.com>	2018-06-14 18:03:02 +03:00
Tomasz Grabiec	f775fc2e4c	mvcc: Fix partition_entry::open_version() After `70c72773be` it's possible that open_version() is called with a phase which is smaller than the phase of the latest version, because latest version belongs to the in-progress cache update. In such case we must return the existing non-latest snapshot and not create a new version on top of the in-progress update. Not doing this violates several invariants, and may lead to inconsistencies, including violation of write atomicity or temporary loss of writes. partition_entry::read() was already adjusted by the aforementioned commit. Do a similar adjustement for open_version(). Fixes sporadic failures of row_cache_test.cc::test_concurrent_reads_and_eviction Message-Id: <1528211847-22825-1-git-send-email-tgrabiec@scylladb.com>	2018-06-05 18:22:38 +03:00
Paweł Dziepak	ec9d166a4f	treewide: require type to compute cell memory usage	2018-05-31 15:51:11 +01:00
Tomasz Grabiec	5bc201df10	cache: Release dirty memory with row granularity	2018-05-30 14:41:41 +02:00
Tomasz Grabiec	70c72773be	cache: Defer during partition merging	2018-05-30 14:41:41 +02:00
Tomasz Grabiec	c653137b2b	mvcc: Make apply_to_incomplete() work with attached versions Needed before making it preemptible. We cannot steal the entry since we may need to resume merging later.	2018-05-30 14:41:40 +02:00
Tomasz Grabiec	1792be3697	cache: Propagate phase to apply_to_incomplete() It will be needed to create snapshots with appropriate phase markers.	2018-05-30 14:41:40 +02:00
Tomasz Grabiec	494cb3f3da	cache: Prepare for incremental apply_to_incomplete() Incremental merging will be implemented by the means of resumable functions, which return stop_iteration::no when not yet finished. We're not using futures, so that the caller can do work around preemption points as well.	2018-05-30 14:41:40 +02:00
Tomasz Grabiec	3f19f76c67	mvcc: Destroy memtable partition versions gently Now all snapshots will have a mutation_cleaner which they will use to gently destroy freed partition_version objects. Destruction of memtable entries during cache update is also using the gentle cleaner now. We need to have a separate cleaner for memtable objects even though they're owned by cache's region, because memtable versions must be cleared without a cache_tracker. Each memtable will have its own cleaner, which will be merged with the cache's cleaner when memtable is merged into cache. Fixes some sources of reactor stalls on cache update when there are large partition entries in memtables.	2018-05-30 14:41:40 +02:00
Tomasz Grabiec	81d231f35b	mvcc: Remove rows from tracker gently Some parititons may have a lot of rows. Better to iterate over them incrementally as part of clear_gently() to avoid stalls.	2018-05-30 14:41:40 +02:00
Tomasz Grabiec	e5aa02efeb	mvcc: Introduce partition_version_list	2018-05-30 12:18:56 +02:00
Tomasz Grabiec	ca1ee93577	mvcc: Fix move constructor of partition_version_ref() not preserving _unique_owner We didn't rely on that yet, it seems, but will. (cherry picked from commit 21a744337de01f699d5c5c340483ad23cabab2ee)	2018-05-30 12:18:56 +02:00
Tomasz Grabiec	40cc766cf2	database: Add API for incremental clearing of partition entries Partitions can get very large. Destroying them all at once can stall the reactor for significant amount of time. We want to avoid that by doing destruction incrementally, deferring in between. A new API is added for that at various levels: stop_iteration clear_gently() noexcept; It returns stop_iteration::yes when the object is fully cleared and can be now destroyed quickly. So a deferring destruction can look like this: return repeat([this] { return clear_gently(); }); The reason why clear_gently() doesn't return a future<> itself is that some contexts cannot defer, like memory reclamation.	2018-05-30 12:18:56 +02:00
Vladimir Krivopalov	e1ee833861	Always pass mutation_partitions to partition_entry::apply() Previously it was also possible to pass a frozen_mutation to it. Now we de-serialize frozen mutations at the calling side. This is a pre-requisite for collecting memtable statistics needed for writing into the SSTables 3.0 format. For #1969. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-04-25 14:58:47 -07:00
Tomasz Grabiec	381bf02f55	cache: Evict with row granularity Instead of evicting whole partitions, evicts whole rows. As part of this, invalidation of partition entries was changed to not evict from snapshots right away, but unlink them and let them be evicted by the reclaimer.	2018-03-06 11:50:29 +01:00
Tomasz Grabiec	bee875fa7d	cache: Ensure all evictable partition_versions have a dummy after all rows Every evictable version will have a dummy entry at the end so that it can be tracked in the LRU. It is also needed to allow old versions to stay around (with tombstones and static rows) after all rows are evicted. Such versions must be fully discontinuous, and we need some entry to mark that.	2018-03-06 11:50:27 +01:00
Tomasz Grabiec	5320705300	cache: Propagate cache_tracker to places manipulating evictable entries cache_tracker reference will be needed to link/unlink row entries. No change of behavior in this patch.	2018-03-06 11:50:27 +01:00
Tomasz Grabiec	e571bd5a2e	mvcc: Add partition_entry::versions_from_oldest()	2018-03-06 11:50:26 +01:00
Tomasz Grabiec	d9a38c1c85	mutation_partition: Add API to walk from rows_entry to cache_entry Will be needed on row eviction, to unlink containers when they become fully evicted.	2018-03-06 11:50:26 +01:00
Tomasz Grabiec	9893e8e5f7	mvcc: Make each version have independent continuity This change is a preparation for introducing row-level eviction, such that entries can be evicted from older versions without having to touch other versions. Currently continuity flags on entries are interpreted relative to the combined view merged from all entries. For example: v2: <key=2, cont=1> v1: <key=1, cont=1> In v2, the flag on entry key=2 marks the range (1, 2) as continuous. This is problematic because if the old version is evicted, continuity will change in an incorrect way: v2: <key=2, cont=1> Here, the range (-inf, 1) would be marked as continuous, which is not true. To solve this problem, we change the rules for continuity interpretation in MVCC. Each version will have its own continuity, fully specified in that version, independent of continuity of other versions. Continuity of the snapshot will be a union of continuous ranges in each version. It is assumed that continuous intervals in different versions are non- overlapping, except for points corresponding to complete rows, in which case a later version may overlap with an older version (overwrite). We make use of this assumption to make calculation of the union of intervals on merging easier. I make use of the above assumption in mutation_partition::apply_monotonically(). MVCC population of incomplete entries already almost maintains the non-overlapping invariant, because population intervals correspond to intervals which are incomplete in the old snapshot. The only change needed is to ensure that both population bounds will have entries in the latest version. Population from memtables doesn't mark any intervals as continuous, so also conforms. The only change needed there is to not inherit continuity flags from the old snapshot, effectively making the new version internally discontinuous except for row points. The example from the beginning will become: v2: <key=1, cont=0> <key=2, cont=1> v1: <key=1, cont=1> When marking a range as continuous with some rows present only in older versions, we need to insert entries in the latest version, so that we can mark the range as continuous. The easiest solution is to copy the entry from the old version. Another option would be to add support for incomplete rows and insert such instead. This way we would avoid duplicating row contents. This optimization is deferred.	2018-03-06 11:50:25 +01:00
Tomasz Grabiec	2f956499a7	mvcc: Drop unused _evictable flag from partition_version_ref	2018-03-06 11:32:09 +01:00
Paweł Dziepak	6b66e4833b	mvcc: avoid ubsan warning about uninitialised boolean Message-Id: <20180223160133.21383-1-pdziepak@scylladb.com>	2018-02-23 16:54:23 +00:00
Tomasz Grabiec	b0b57b8143	mvcc: Do not move unevictable snapshots to cache Commit `6ccd317` introduced a bug in partition_entry::evict() where a partition entry may be partially evicted if there are non-evictable snapshots in it. Partially evicting some of the versions may violate consistency of a snapshot which includes evicted versions. For one, continuity flags are interpreted realtive to the merged view, not within a version, so evicting from some of the versions may mark reanges as continuous when before they were discontinuous. Also, range tombtsones of the snapshot are taken from all versions, so we can't partially evict some of them without marking all affected ranges as discontinuous. The fix is to revert back to full eviciton, and avoid moving non-evictable snapshots to cache. When moving whole partition entry to cache, we first create a neutral empty partition entry and then merge the memtable entry into it just like we would if the entry already existed. Fixes #3215. Tests: unit (release) Message-Id: <1518710592-21925-2-git-send-email-tgrabiec@scylladb.com>	2018-02-15 16:48:07 +00:00
Tomasz Grabiec	27b114fe45	cache: Handle exceptions from make_evictable() cache_entry constructor was marked noexcept, yet make_evictable() may fail in rare cases due to allocation in add_version(). Lift the annotation and make sure that construction has strong exception guarantees for the moved-in state so that it can be retried without data loss inside allocating section.	2018-02-14 16:42:49 +01:00
Avi Kivity	404172652e	Merge "Use xxHash for digest instead of MD5" from Duarte "This series changes digest calculation to use a faster algorithm (xxHash) and to also cache calculated cell hashes that can be kept in memory to speed up subsequent digest requests. The MD5 hash function has proved to be slow for large cell values: size = 256; elapsed = 4us size = 512; elapsed = 8us size = 1024; elapsed = 14us size = 2048; elapsed = 21us size = 4096; elapsed = 33us size = 8192; elapsed = 51us size = 16384; elapsed = 86us size = 32768; elapsed = 150us size = 65536; elapsed = 278us size = 131072; elapsed = 531us size = 262144; elapsed = 1032us size = 524288; elapsed = 2026us size = 1048576; elapsed = 4004us size = 2097152; elapsed = 7943us size = 4194304; elapsed = 15800us size = 8388608; elapsed = 31731us size = 16777216; elapsed = 64681us size = 33554432; elapsed = 130752us size = 67108864; elapsed = 263154us The xxHash is a non-cryptographic, 64bit (there's work in progress on the 128 version) hash that can be used to replace MD5. It performs much better: size = 256; elapsed = 2us size = 512; elapsed = 1us size = 1024; elapsed = 1us size = 2048; elapsed = 2us size = 4096; elapsed = 2us size = 8192; elapsed = 3us size = 16384; elapsed = 5us size = 32768; elapsed = 8us size = 65536; elapsed = 14us size = 131072; elapsed = 28us size = 262144; elapsed = 59us size = 524288; elapsed = 116us size = 1048576; elapsed = 226us size = 2097152; elapsed = 456us size = 4194304; elapsed = 935us size = 8388608; elapsed = 1848us size = 16777216; elapsed = 4723us size = 33554432; elapsed = 10507us size = 67108864; elapsed = 21622us Performance was tested using a 3 node cluster with 1 cpu and 8GB, and with the following cassandra-stress loaders. Measurements are for the read workload. sudo taskset -c 4-15 ./cassandra-stress write cl=ALL n=5000000 -schema 'replication(factor=3)' -col 'size=FIXED(1024) n=FIXED(4)' -mode native cql3 -rate threads=100 sudo taskset -c 4-15 ./cassandra-stress mixed cl=ALL 'ratio(read=1)' n=10000000 -pop 'dist=gauss(1..5000000,5000000,500000)' -col 'size=FIXED(1024) n=FIXED(4)' -mode native cql3 -rate threads=100 xxhash + caching: Results: op rate : 32699 [READ:32699] partition rate : 32699 [READ:32699] row rate : 32699 [READ:32699] latency mean : 3.0 [READ:3.0] latency median : 3.0 [READ:3.0] latency 95th percentile : 3.9 [READ:3.9] latency 99th percentile : 4.5 [READ:4.5] latency 99.9th percentile : 6.6 [READ:6.6] latency max : 24.0 [READ:24.0] Total partitions : 10000000 [READ:10000000] Total errors : 0 [READ:0] total gc count : 0 total gc mb : 0 total gc time (s) : 0 avg gc time(ms) : NaN stdev gc time(ms) : 0 Total operation time : 00:05:05 END md5: Results: op rate : 25241 [READ:25241] partition rate : 25241 [READ:25241] row rate : 25241 [READ:25241] latency mean : 3.9 [READ:3.9] latency median : 3.9 [READ:3.9] latency 95th percentile : 5.1 [READ:5.1] latency 99th percentile : 5.8 [READ:5.8] latency 99.9th percentile : 8.0 [READ:8.0] latency max : 24.8 [READ:24.8] Total partitions : 10000000 [READ:10000000] Total errors : 0 [READ:0] total gc count : 0 total gc mb : 0 total gc time (s) : 0 avg gc time(ms) : NaN stdev gc time(ms) : 0 Total operation time : 00:06:36 END This translates into a 21% improvoment for this workload. Bigger cell values were also tested: sudo taskset -c 4-15 ./cassandra-stress write cl=ALL n=1000000 -schema 'replication(factor=3)' -col 'size=FIXED(4096) n=FIXED(4)' -mode native cql3 -rate threads=100 sudo taskset -c 4-15 ./cassandra-stress mixed cl=ALL 'ratio(read=1)' n=10000000 -pop 'dist=gauss(1..1000000,500000,100000)' -col 'size=FIXED(4096) n=FIXED(4)' -mode native cql3 -rate threads=100 xxhash + caching: Results: op rate : 19964 [READ:19964] partition rate : 19964 [READ:19964] row rate : 19964 [READ:19964] latency mean : 4.9 [READ:4.9] latency median : 4.6 [READ:4.6] latency 95th percentile : 7.2 [READ:7.2] latency 99th percentile : 11.5 [READ:11.5] latency 99.9th percentile : 13.6 [READ:13.6] latency max : 29.2 [READ:29.2] Total partitions : 10000000 [READ:10000000] Total errors : 0 [READ:0] total gc count : 0 total gc mb : 0 total gc time (s) : 0 avg gc time(ms) : NaN stdev gc time(ms) : 0 Total operation time : 00:08:20 END md5: Results: op rate : 12773 [READ:12773] partition rate : 12773 [READ:12773] row rate : 12773 [READ:12773] latency mean : 7.7 [READ:7.7] latency median : 7.3 [READ:7.3] latency 95th percentile : 10.2 [READ:10.2] latency 99th percentile : 16.8 [READ:16.8] latency 99.9th percentile : 19.2 [READ:19.2] latency max : 71.5 [READ:71.5] Total partitions : 10000000 [READ:10000000] Total errors : 0 [READ:0] total gc count : 0 total gc mb : 0 total gc time (s) : 0 avg gc time(ms) : NaN stdev gc time(ms) : 0 Total operation time : 00:13:02 END This translates into a 37% improvoment for this workload. Fixes #2884 Tests: unit-tests (release), dtests (smp=2) Note: dtests are kinda broken in master (> 30 failures), so take the tests tag with a grain of himalayan salt." * 'xxhash/v5' of https://github.com/duarten/scylla: (29 commits) tests/row_cache_test: Test hash caching tests/memtable_test: Test hash caching tests/mutation_test: Use xxHash instead of MD5 for some tests tests/mutation_test: Test xx_hasher alongside md5_hasher schema: Remove unneeded include service/storage_proxy: Enable hash caching service/storage_service: Add and use xxhash feature message/messaging_service: Specify algorithm when requesting digest storage_proxy: Extract decision about digest algorithm to use cache_flat_mutation_reader: Pre-calculate cell hash partition_snapshot_reader: Pre-calculate cell hash query::partition_slice: Add option to specify when digest is requested row: Use cached hash for hash calculation mutation_partition: Replace hash_row_slice with appending_hash mutation_partition: Allow caching cell hashes mutation_partition: Force vector_storage internal storage size test.py: Increase memory for row_cache_stress_test atomic_cell_hash: Add specialization for atomic_cell_or_collection query-result: Use digester instead of md5_hasher range_tombstone: Replace feed_hash() member function with appending_hash ...	2018-02-08 18:24:58 +02:00
Tomasz Grabiec	06b7b54c3d	mvcc: Take partition_entry by const ref in operator<<() Some users will only have const&.	2018-02-06 14:24:19 +01:00
Tomasz Grabiec	50f5bee12e	mvcc: Do not evict from non-evictable snapshots When moving whole partition entries from memtable to cache, we move snapshots as well. It is incorrect to evict from such snapshots though, because associated readers would miss data. Solution is to record evictability of partition version references (snapshots) and avoiding eviction from non-evictable snapshots. Could affect scanning reads, if the reader uses partition entry from memtable, and the partition is too large to fit in reader's buffer, and that entry gets moved to cache (was absent in cache), and then gets evicted (memory pressure). The reader will not see the remainder of that entry. Introduced in `ca8e3c4`, so affects 2.1+ Fixes #3186.	2018-02-06 14:24:19 +01:00
Tomasz Grabiec	d899ae0f02	mvcc: Encapsulate construction of evictable entries Internal invariants of MVCC are better preserved by partition_entry methods, so move construction of partition entries out of cache_entry constructors.	2018-02-05 17:54:03 +01:00
Duarte Nunes	712c051de6	cache_flat_mutation_reader: Pre-calculate cell hash When digest is requested, pre-calculate the cell's hash. We consider the case when the cell is already in the cache, and the case when it added by the underlying reader. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-02-01 01:02:50 +00:00
Duarte Nunes	ec5b7fb553	partition_snapshot_reader: Pre-calculate cell hash When digest is requested, pre-calculate the cell's hash. A downside of this approach is that more work will be done when there are multiple versions of a row that contain values for the same cell, but we expect these cases to be rare and the upside of caching a cell's hash to compensate for the extra work. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-02-01 01:02:50 +00:00
Piotr Jastrzebski	96c97ad1db	Rename streamed_mutation* files to mutation_fragment* Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2018-01-24 20:56:49 +01:00
Tomasz Grabiec	88aff526df	mvcc: Extract version merging functions	2018-01-18 11:32:49 +01:00
Duarte Nunes	16c975edcc	partition_version: Return static_row fragment from static_row() Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20180109162815.5811-1-duarte@scylladb.com>	2018-01-09 19:17:02 +01:00
Tomasz Grabiec	12704fd679	mvcc: Propagate region reference to partition_entry::apply_to_incomplete()	2017-12-08 17:50:48 +01:00
Tomasz Grabiec	a6e083ef6f	mvcc: Add const-qualified partition_version_ref::operator*()	2017-12-08 17:50:48 +01:00
Tomasz Grabiec	b26ce36d4b	mvcc: Introduce partition_snapshot::static_row_continuous()	2017-12-08 17:50:47 +01:00
Tomasz Grabiec	c283744fcb	mvcc: Introduce partition_snapshot::range_tombstones() for full range	2017-12-08 17:50:47 +01:00
Tomasz Grabiec	df964c70f8	mvcc: Don't require external schema in parition_snapshot::range_tombstones()	2017-12-08 17:50:47 +01:00
Tomasz Grabiec	49c0705409	mvcc: partition_version: Use apply_monotonically() to provide atomicity This patch drops the use of apply_reversibly(). We move the mutation to be applied into a new version and then use apply_monotonically() to merge it (if no snapshot) with the current version. This guarantees that apply() is atomic even if apply_monotonically() throws. Fixes #2012.	2017-11-28 12:38:28 +01:00
Tomasz Grabiec	52cabe343c	mvcc: Extract partition_entry::add_version()	2017-11-28 12:38:27 +01:00
Glauber Costa	c2f49da609	partition: add method to calculate memory size of a partition Once that is added, also add a method to a memtable entry to calculate the entire size of a memtable entry. Right now we only have one method to calculate the size minus rows. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2017-11-08 16:21:44 -05:00
Tomasz Grabiec	967cabcaf2	mvcc: Make the null state of partition_snapshot::change_mark explicit	2017-11-02 11:05:19 +01:00
Tomasz Grabiec	4b7933543d	mvcc: Add partition_snapshot::region() getter	2017-11-02 11:05:19 +01:00
Tomasz Grabiec	9cf30f19ae	mvcc: Add partition_snapshot::schema() getter	2017-11-02 11:05:19 +01:00
Tomasz Grabiec	b6ae5783cd	mvcc: Introduce partition_entry::evict() The operation frees as much memory as possible, marking affected mutation elements as discontinuous.	2017-09-13 17:47:03 +02:00
Tomasz Grabiec	cda86abdbc	mvcc: Encapsulate reference stability check in partition_snapshot	2017-09-13 17:38:08 +02:00
Tomasz Grabiec	2df6f356b1	mvcc: Store LSA region reference in partition_snapshot Will be useful for improving encapsulation.	2017-09-13 17:38:08 +02:00
Piotr Jastrzebski	896bf2e5de	Remove unused methods from MVCC Some apply methods where replaced by apply_to_incomplete(). Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	6ebfb730ee	partition_entry: Introduce partition_tombstone() getter	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	b680de930c	partition_entry: Introduce apply_to_incomplete() Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> [tgrabiec: - extracted from a larger commit - fix heap comparator in apply_incomplete_target to order versions properly - extracted partition_version detaching into partition_entry::with_detached_versions() - dropped unnecessary rows_iterator::_version field - dropped unnecessary allocation of rows_entry and key copies in rows_iterator - dropped row_pointer - replaced apply_reversibly() with weaker and faster apply() - added handling of dummy entries at any position - fixed exception safety issue in apply_to_incomplete() which may result in data loss. We cannot move data out of applied versions into a new synthetic row and then apply it, because if exception happens in the middle, the data which was moved from the source will be lost. To fix that, row_iterator::consume_row() is introduced which allows in-place consumption of data without construction of temporary deletable_row. ]	2017-06-24 18:06:11 +02:00

1 2

79 Commits