scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-28 04:06:59 +00:00

Author	SHA1	Message	Date
Tomasz Grabiec	2f956499a7	mvcc: Drop unused _evictable flag from partition_version_ref	2018-03-06 11:32:09 +01:00
Tomasz Grabiec	b0b57b8143	mvcc: Do not move unevictable snapshots to cache Commit `6ccd317` introduced a bug in partition_entry::evict() where a partition entry may be partially evicted if there are non-evictable snapshots in it. Partially evicting some of the versions may violate consistency of a snapshot which includes evicted versions. For one, continuity flags are interpreted realtive to the merged view, not within a version, so evicting from some of the versions may mark reanges as continuous when before they were discontinuous. Also, range tombtsones of the snapshot are taken from all versions, so we can't partially evict some of them without marking all affected ranges as discontinuous. The fix is to revert back to full eviciton, and avoid moving non-evictable snapshots to cache. When moving whole partition entry to cache, we first create a neutral empty partition entry and then merge the memtable entry into it just like we would if the entry already existed. Fixes #3215. Tests: unit (release) Message-Id: <1518710592-21925-2-git-send-email-tgrabiec@scylladb.com>	2018-02-15 16:48:07 +00:00
Tomasz Grabiec	27b114fe45	cache: Handle exceptions from make_evictable() cache_entry constructor was marked noexcept, yet make_evictable() may fail in rare cases due to allocation in add_version(). Lift the annotation and make sure that construction has strong exception guarantees for the moved-in state so that it can be retried without data loss inside allocating section.	2018-02-14 16:42:49 +01:00
Avi Kivity	404172652e	Merge "Use xxHash for digest instead of MD5" from Duarte "This series changes digest calculation to use a faster algorithm (xxHash) and to also cache calculated cell hashes that can be kept in memory to speed up subsequent digest requests. The MD5 hash function has proved to be slow for large cell values: size = 256; elapsed = 4us size = 512; elapsed = 8us size = 1024; elapsed = 14us size = 2048; elapsed = 21us size = 4096; elapsed = 33us size = 8192; elapsed = 51us size = 16384; elapsed = 86us size = 32768; elapsed = 150us size = 65536; elapsed = 278us size = 131072; elapsed = 531us size = 262144; elapsed = 1032us size = 524288; elapsed = 2026us size = 1048576; elapsed = 4004us size = 2097152; elapsed = 7943us size = 4194304; elapsed = 15800us size = 8388608; elapsed = 31731us size = 16777216; elapsed = 64681us size = 33554432; elapsed = 130752us size = 67108864; elapsed = 263154us The xxHash is a non-cryptographic, 64bit (there's work in progress on the 128 version) hash that can be used to replace MD5. It performs much better: size = 256; elapsed = 2us size = 512; elapsed = 1us size = 1024; elapsed = 1us size = 2048; elapsed = 2us size = 4096; elapsed = 2us size = 8192; elapsed = 3us size = 16384; elapsed = 5us size = 32768; elapsed = 8us size = 65536; elapsed = 14us size = 131072; elapsed = 28us size = 262144; elapsed = 59us size = 524288; elapsed = 116us size = 1048576; elapsed = 226us size = 2097152; elapsed = 456us size = 4194304; elapsed = 935us size = 8388608; elapsed = 1848us size = 16777216; elapsed = 4723us size = 33554432; elapsed = 10507us size = 67108864; elapsed = 21622us Performance was tested using a 3 node cluster with 1 cpu and 8GB, and with the following cassandra-stress loaders. Measurements are for the read workload. sudo taskset -c 4-15 ./cassandra-stress write cl=ALL n=5000000 -schema 'replication(factor=3)' -col 'size=FIXED(1024) n=FIXED(4)' -mode native cql3 -rate threads=100 sudo taskset -c 4-15 ./cassandra-stress mixed cl=ALL 'ratio(read=1)' n=10000000 -pop 'dist=gauss(1..5000000,5000000,500000)' -col 'size=FIXED(1024) n=FIXED(4)' -mode native cql3 -rate threads=100 xxhash + caching: Results: op rate : 32699 [READ:32699] partition rate : 32699 [READ:32699] row rate : 32699 [READ:32699] latency mean : 3.0 [READ:3.0] latency median : 3.0 [READ:3.0] latency 95th percentile : 3.9 [READ:3.9] latency 99th percentile : 4.5 [READ:4.5] latency 99.9th percentile : 6.6 [READ:6.6] latency max : 24.0 [READ:24.0] Total partitions : 10000000 [READ:10000000] Total errors : 0 [READ:0] total gc count : 0 total gc mb : 0 total gc time (s) : 0 avg gc time(ms) : NaN stdev gc time(ms) : 0 Total operation time : 00:05:05 END md5: Results: op rate : 25241 [READ:25241] partition rate : 25241 [READ:25241] row rate : 25241 [READ:25241] latency mean : 3.9 [READ:3.9] latency median : 3.9 [READ:3.9] latency 95th percentile : 5.1 [READ:5.1] latency 99th percentile : 5.8 [READ:5.8] latency 99.9th percentile : 8.0 [READ:8.0] latency max : 24.8 [READ:24.8] Total partitions : 10000000 [READ:10000000] Total errors : 0 [READ:0] total gc count : 0 total gc mb : 0 total gc time (s) : 0 avg gc time(ms) : NaN stdev gc time(ms) : 0 Total operation time : 00:06:36 END This translates into a 21% improvoment for this workload. Bigger cell values were also tested: sudo taskset -c 4-15 ./cassandra-stress write cl=ALL n=1000000 -schema 'replication(factor=3)' -col 'size=FIXED(4096) n=FIXED(4)' -mode native cql3 -rate threads=100 sudo taskset -c 4-15 ./cassandra-stress mixed cl=ALL 'ratio(read=1)' n=10000000 -pop 'dist=gauss(1..1000000,500000,100000)' -col 'size=FIXED(4096) n=FIXED(4)' -mode native cql3 -rate threads=100 xxhash + caching: Results: op rate : 19964 [READ:19964] partition rate : 19964 [READ:19964] row rate : 19964 [READ:19964] latency mean : 4.9 [READ:4.9] latency median : 4.6 [READ:4.6] latency 95th percentile : 7.2 [READ:7.2] latency 99th percentile : 11.5 [READ:11.5] latency 99.9th percentile : 13.6 [READ:13.6] latency max : 29.2 [READ:29.2] Total partitions : 10000000 [READ:10000000] Total errors : 0 [READ:0] total gc count : 0 total gc mb : 0 total gc time (s) : 0 avg gc time(ms) : NaN stdev gc time(ms) : 0 Total operation time : 00:08:20 END md5: Results: op rate : 12773 [READ:12773] partition rate : 12773 [READ:12773] row rate : 12773 [READ:12773] latency mean : 7.7 [READ:7.7] latency median : 7.3 [READ:7.3] latency 95th percentile : 10.2 [READ:10.2] latency 99th percentile : 16.8 [READ:16.8] latency 99.9th percentile : 19.2 [READ:19.2] latency max : 71.5 [READ:71.5] Total partitions : 10000000 [READ:10000000] Total errors : 0 [READ:0] total gc count : 0 total gc mb : 0 total gc time (s) : 0 avg gc time(ms) : NaN stdev gc time(ms) : 0 Total operation time : 00:13:02 END This translates into a 37% improvoment for this workload. Fixes #2884 Tests: unit-tests (release), dtests (smp=2) Note: dtests are kinda broken in master (> 30 failures), so take the tests tag with a grain of himalayan salt." * 'xxhash/v5' of https://github.com/duarten/scylla: (29 commits) tests/row_cache_test: Test hash caching tests/memtable_test: Test hash caching tests/mutation_test: Use xxHash instead of MD5 for some tests tests/mutation_test: Test xx_hasher alongside md5_hasher schema: Remove unneeded include service/storage_proxy: Enable hash caching service/storage_service: Add and use xxhash feature message/messaging_service: Specify algorithm when requesting digest storage_proxy: Extract decision about digest algorithm to use cache_flat_mutation_reader: Pre-calculate cell hash partition_snapshot_reader: Pre-calculate cell hash query::partition_slice: Add option to specify when digest is requested row: Use cached hash for hash calculation mutation_partition: Replace hash_row_slice with appending_hash mutation_partition: Allow caching cell hashes mutation_partition: Force vector_storage internal storage size test.py: Increase memory for row_cache_stress_test atomic_cell_hash: Add specialization for atomic_cell_or_collection query-result: Use digester instead of md5_hasher range_tombstone: Replace feed_hash() member function with appending_hash ...	2018-02-08 18:24:58 +02:00
Tomasz Grabiec	06b7b54c3d	mvcc: Take partition_entry by const ref in operator<<() Some users will only have const&.	2018-02-06 14:24:19 +01:00
Tomasz Grabiec	50f5bee12e	mvcc: Do not evict from non-evictable snapshots When moving whole partition entries from memtable to cache, we move snapshots as well. It is incorrect to evict from such snapshots though, because associated readers would miss data. Solution is to record evictability of partition version references (snapshots) and avoiding eviction from non-evictable snapshots. Could affect scanning reads, if the reader uses partition entry from memtable, and the partition is too large to fit in reader's buffer, and that entry gets moved to cache (was absent in cache), and then gets evicted (memory pressure). The reader will not see the remainder of that entry. Introduced in `ca8e3c4`, so affects 2.1+ Fixes #3186.	2018-02-06 14:24:19 +01:00
Tomasz Grabiec	c391bff1d2	mvcc: Drop unnecessary assignment to partition_snapshot::_version merge_partition_versions() is responsible for merging versions unpinned by the current snapshot. If that fails, we don't need to set _version back since versions must be still referenced by someone else, this snapshot is not a unique owner. This change makes it easier to add tracking of evictability.	2018-02-06 14:24:18 +01:00
Tomasz Grabiec	d899ae0f02	mvcc: Encapsulate construction of evictable entries Internal invariants of MVCC are better preserved by partition_entry methods, so move construction of partition entries out of cache_entry constructors.	2018-02-05 17:54:03 +01:00
Duarte Nunes	ec5b7fb553	partition_snapshot_reader: Pre-calculate cell hash When digest is requested, pre-calculate the cell's hash. A downside of this approach is that more work will be done when there are multiple versions of a row that contain values for the same cell, but we expect these cases to be rare and the upside of caching a cell's hash to compensate for the extra work. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-02-01 01:02:50 +00:00
Tomasz Grabiec	60d3c25c02	mvcc: Reverse order of version merging Change merging to apply newer version to older instead of older to newer. Before: (((v3 + v2) + v1) + v0) After: (v0 + (v1 + (v2 + v3))) or equivalent: (((v0 + v1) + v2) + v3) There are several reasons to do this: 1) When continuity merging will change semantics to support eviction from older versions, it will be easier to implement apply() if we can assume that we merge newer to older instead of older to newer, since newer version may have entries falling into a continuous interval in older, but not the other way around. If we didn't revert the order, apply() would have to keep track of lower bound of a continuous interval in the right-hand side argument (older version) as it is applied and update continuity flags in the left hand side by scanning all entries overlapping with it. If order is reversed, merging only needs to deal with the current entry. Also, if we were to keep the old order, we cannot simply move entries from the left hand side as we merge because we need to keep track of the lower bound of a continuous interval, and we need to provide monotonic exception guarantees. So merging would be both more complicated and slower. 2) With large partitions older versions are typically larger than newer versions, and since merging is O(N_right*(1 + log(N_left))), it's better to merge newer into older. Fixes #2715.	2018-01-18 13:52:08 +01:00
Tomasz Grabiec	5331b7b8e2	mvcc: Implement partition_entry::upgrade() using squashed() To reduce duplication of version merging logic.	2018-01-18 11:32:49 +01:00
Tomasz Grabiec	88aff526df	mvcc: Extract version merging functions	2018-01-18 11:32:49 +01:00
Duarte Nunes	16c975edcc	partition_version: Return static_row fragment from static_row() Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20180109162815.5811-1-duarte@scylladb.com>	2018-01-09 19:17:02 +01:00
Tomasz Grabiec	4094c66979	mvcc: Reuse partition_snapshot_row_cursor in apply_to_incomplete() Reduces duplication of knowledge about how logical mutation_partition view is obtained for multiple versions.	2017-12-08 17:50:48 +01:00
Tomasz Grabiec	12704fd679	mvcc: Propagate region reference to partition_entry::apply_to_incomplete()	2017-12-08 17:50:48 +01:00
Tomasz Grabiec	b26ce36d4b	mvcc: Introduce partition_snapshot::static_row_continuous()	2017-12-08 17:50:47 +01:00
Tomasz Grabiec	c283744fcb	mvcc: Introduce partition_snapshot::range_tombstones() for full range	2017-12-08 17:50:47 +01:00
Tomasz Grabiec	df964c70f8	mvcc: Don't require external schema in parition_snapshot::range_tombstones()	2017-12-08 17:50:47 +01:00
Tomasz Grabiec	183554cbc4	mvcc: Optimize partition_snapshot::range_tombstones() for single version case	2017-12-08 10:15:58 +01:00
Tomasz Grabiec	1303320377	mvcc: Fix partition_snapshot::range_tombstones() partition_snapshot::range_tombstones() is deoverlapping tombstones coming from different versions and it may happen that due to range tombstone splitting the method will return a tombstone which starts after the requested range. This would cause it to return a tombstone which doesn't overlap with the requested range. This breaks assumptions made by cache reader. It keeps track of the maximum fragment position, and if cache reader will then need to read from sstables due to a miss, it would do so starting from the position marked by that out of range tombstone, possibly skipping over some rows. Exposed by a change in row_cache_test.cc::test_mvcc() which fills the buffer of sm5 reader after it is created. Fixes #3053.	2017-12-08 10:15:58 +01:00
Tomasz Grabiec	376cddb212	mvcc: Use apply_monotonically() where sufficient	2017-11-28 12:38:28 +01:00
Tomasz Grabiec	49c0705409	mvcc: partition_version: Use apply_monotonically() to provide atomicity This patch drops the use of apply_reversibly(). We move the mutation to be applied into a new version and then use apply_monotonically() to merge it (if no snapshot) with the current version. This guarantees that apply() is atomic even if apply_monotonically() throws. Fixes #2012.	2017-11-28 12:38:28 +01:00
Tomasz Grabiec	52cabe343c	mvcc: Extract partition_entry::add_version()	2017-11-28 12:38:27 +01:00
Tomasz Grabiec	8402728747	row_cache: Call open_version() under region's allocator partition_entry::read() calls open_version() under standard allocator, but it may allocate a new partition version if a snapshot already exists which was created in an earlier phase. Versions are supposed to be allocated using region's allocator, they will be freed using region's allocator. LSA will delegate free() to the standard allocator correctly in this case, but it will subtract from its _non_lsa_occupancy, assuming the allocation was done through it. This will corrupt occupancy() for cache region. Fixes #2948. Message-Id: <1510229584-14398-1-git-send-email-tgrabiec@scylladb.com>	2017-11-13 15:20:08 +00:00
Glauber Costa	c2f49da609	partition: add method to calculate memory size of a partition Once that is added, also add a method to a memtable entry to calculate the entire size of a memtable entry. Right now we only have one method to calculate the size minus rows. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2017-11-08 16:21:44 -05:00
Tomasz Grabiec	bbca83d4c0	cache: Make range tombstone merging exception-safe range_tombstone_list::apply() has no exception safety guarantees about the logical state. The target mutation_partition in cache should be assumed to be left in unspecified state. In particular, some of the preexisting overlapping tombstones may be removed and not reinserted, so the cache would be missing some of the range tombstone information in case the whole allocating section fails. Use apply_monotonically() which provides the needed guarantees. Fixes #2938.	2017-11-07 15:33:24 +01:00
Tomasz Grabiec	9cf30f19ae	mvcc: Add partition_snapshot::schema() getter	2017-11-02 11:05:19 +01:00
Tomasz Grabiec	b6ae5783cd	mvcc: Introduce partition_entry::evict() The operation frees as much memory as possible, marking affected mutation elements as discontinuous.	2017-09-13 17:47:03 +02:00
Tomasz Grabiec	4053c801e2	mvcc: Ensure partition_snapshot always destroys versions using proper allocator partition_snapshot is managed by lw_shared_ptr. Currently it is assumed that before it dies, maybe_merge_versions() is called on it, which destroyes it in the right allocator context. It's not very safe. This patch improves safety by using the right allocator in snapshot's destructor.	2017-09-13 17:38:08 +02:00
Tomasz Grabiec	2df6f356b1	mvcc: Store LSA region reference in partition_snapshot Will be useful for improving encapsulation.	2017-09-13 17:38:08 +02:00
Piotr Jastrzebski	896bf2e5de	Remove unused methods from MVCC Some apply methods where replaced by apply_to_incomplete(). Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	6ebfb730ee	partition_entry: Introduce partition_tombstone() getter	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	e433e68610	partition_entry: Make squashed() and upgrade() work with not fully continuous versions Those methods first create a neutral mutation_partition, and left-fold it with the versions. The problem is that there is no neutral element for static row continuity, the flag from the first addend always wins. We have to copy the flag from the first version to preserve the logical value.	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	b680de930c	partition_entry: Introduce apply_to_incomplete() Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> [tgrabiec: - extracted from a larger commit - fix heap comparator in apply_incomplete_target to order versions properly - extracted partition_version detaching into partition_entry::with_detached_versions() - dropped unnecessary rows_iterator::_version field - dropped unnecessary allocation of rows_entry and key copies in rows_iterator - dropped row_pointer - replaced apply_reversibly() with weaker and faster apply() - added handling of dummy entries at any position - fixed exception safety issue in apply_to_incomplete() which may result in data loss. We cannot move data out of applied versions into a new synthetic row and then apply it, because if exception happens in the middle, the data which was moved from the source will be lost. To fix that, row_iterator::consume_row() is introduced which allows in-place consumption of data without construction of temporary deletable_row. ]	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	b6ce963200	partition_version: Introduce partition_entry::with_detached_versions()	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	64626b32b0	row_cache: Make printable	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	dd9d35c166	partition_snapshot: Add getter for range tombstones	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	60c3c0a471	partition_entry: Add squashed() overload with a single schema	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	98f7671553	partition_snapshot: Introduce squashed()	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	87b0f11be3	partition_snapshot: Add getters for static row and partition tombstone [tgrabiec: - Extracted from a different patch - Renamed concept names to more familiar Map and Reduce - Renamed aggregate() to squashed() to match the existing nomenclature - Uncommented the concepts ]	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	2fdabcaa9b	Track population phase in partition_snapshot This will be used by partial cache in later patches. [tgrabiec: - changed title, - documented meaning of the variable, - renamed the variable, - introduced open_version(), - fixed continuity of the static row not being preserved in case a new version is created] Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	fe387f8ba0	partition_version: Fix corruption of partition_version list The move constructor of partition_version was not invoking move constructor of anchorless_list_base_hook. As a result, when partition_version objects were moved, e.g. during LSA compaction, they were unlinked from their lists. This can make readers return invalid data, because not all versions will be reachable. It also casues leaks of the versions which are not directly attached to memtable entry. This will trigger assertion failure in LSA region destructor. This assetion triggers with row cache disabled. With cache enabled (default) all segments are merged into the cache region, which currently is not destroyed on shutdown, so this problem would go unnoticed. With cache disabled, memtable region is destroyed after memtable is flushed and after all readers stop using that memtable. Fixes #1753. Message-Id: <1476778472-5711-1-git-send-email-tgrabiec@scylladb.com>	2016-10-18 09:25:38 +01:00
Glauber Costa	452eb95943	move partition_snapshot_reader code to header file This is so we can template it without worrying about declaring the specializations in the .cc file. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-10-04 10:39:10 -04:00
Piotr Jastrzebski	b05b90b3a5	Introduce clustering_key_filter_ranges. This fixes the problem of multiple concurrent get_ranges calls. Previously each call was invalidating the result of the previous call. Now they don't step on each other foot. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2016-08-30 19:46:38 +02:00
Paweł Dziepak	5cae44114f	partition_version: handle errors during version merge Currently, partition snapshot destructor can throw which is a big no-no. The solution is to ignore the exception and leave versions unmerged and hope that subsequent reads will succeed at merging. However, another problem is that the merge doesn't use allocating sections which means that memory won't be reclaimed to satisfy its needs. If the cache is full this may result in partition versions not being merged for a very long time. This patch introduces partition_snapshot::merge_partition_versions() which contains all the version merging logic that was previously present in the snapshot destructor. This function may throw so that it can be used with allocating sections. The actual merging and handling of potential erros is done from partition_snapshot_reader destructor. It tries to merge versions under the allocating section. Only if that fails it gives up and leaves them unmerged. Fixes #1578 Fixes #1579. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com> Message-Id: <1471265544-23579-1-git-send-email-pdziepak@scylladb.com>	2016-08-15 15:56:53 +03:00
Tomasz Grabiec	1b2ea14d0e	partition_version: Add missing linearization context Snapshot removal merges partitions, and cell merging must be done inside linearization context. Fixes #1574 Reviewed-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1471010625-18019-1-git-send-email-tgrabiec@scylladb.com>	2016-08-12 17:55:23 +03:00
Paweł Dziepak	db5ea591ad	add mvcc implementation for mutation_partitions To ensure isolation of operation when streaming a mutation from a mutable source (such as cache or memtable) MVCC is used. Each entry in memtable or cache is actually a list of used versions of that entry. Incoming writes are either applied directly to the last verion (if it wasn't being read by anyone) or preprended to the list (if the former head was being read by someone). When reader finishes it tries to squash versions together provided there is no other reader that could prevent this. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:51 +01:00

47 Commits