scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-27 20:05:10 +00:00

Author	SHA1	Message	Date
Tomasz Grabiec	78274276f5	row_cache: Use the memtable cleaner to create memtable snapshot during update Memtable entries should be cleaned using memtable cleaner, which unlike the cache' cleaner is not associated with the cache tracker. It's an error to clean a snapshot using tracker which doesn't own the entries. This will corrupt cache tracker's row counter. Fixes failure of test_exception_safety_of_update_from_memtable from row_cache.cc in debug mode and with allocation failure injection enabled. Introduce in "cache: Defer during partition merging" (`70c72773be`). Message-Id: <1528988256-20578-1-git-send-email-tgrabiec@scylladb.com>	2018-06-14 18:03:02 +03:00
Paweł Dziepak	27014a23d7	treewide: require type info for copying atomic_cell_or_collection	2018-05-31 15:51:11 +01:00
Tomasz Grabiec	5bc201df10	cache: Release dirty memory with row granularity	2018-05-30 14:41:41 +02:00
Tomasz Grabiec	70c72773be	cache: Defer during partition merging	2018-05-30 14:41:41 +02:00
Tomasz Grabiec	1792be3697	cache: Propagate phase to apply_to_incomplete() It will be needed to create snapshots with appropriate phase markers.	2018-05-30 14:41:40 +02:00
Tomasz Grabiec	494cb3f3da	cache: Prepare for incremental apply_to_incomplete() Incremental merging will be implemented by the means of resumable functions, which return stop_iteration::no when not yet finished. We're not using futures, so that the caller can do work around preemption points as well.	2018-05-30 14:41:40 +02:00
Tomasz Grabiec	6bd1a04c10	tests: mvcc: Encapsulate memory management details Curently tests have a single LSA region lock around construction of managed objects, their manipulation, and access. This way we avoid the complexity of dealing with allocating sections. That will not be possible once apply_to_incomplete() is changed to enter an allocating section itself becasue this requires region to be unlocked at entry. The tests will have to take more fine-grained locks. That is somewhat tricky add would add a lot of noise to tests. This patch will make things easier by abstracting LSA management, among other things, inside mvcc_conatiner and mvcc_partition classes.	2018-05-30 14:41:40 +02:00
Tomasz Grabiec	3f19f76c67	mvcc: Destroy memtable partition versions gently Now all snapshots will have a mutation_cleaner which they will use to gently destroy freed partition_version objects. Destruction of memtable entries during cache update is also using the gentle cleaner now. We need to have a separate cleaner for memtable objects even though they're owned by cache's region, because memtable versions must be cleared without a cache_tracker. Each memtable will have its own cleaner, which will be merged with the cache's cleaner when memtable is merged into cache. Fixes some sources of reactor stalls on cache update when there are large partition entries in memtables.	2018-05-30 14:41:40 +02:00
Tomasz Grabiec	58fe331c7e	mvcc: Test version merging when snapshots go away	2018-05-07 13:54:30 +02:00
Tomasz Grabiec	381bf02f55	cache: Evict with row granularity Instead of evicting whole partitions, evicts whole rows. As part of this, invalidation of partition entries was changed to not evict from snapshots right away, but unlink them and let them be evicted by the reclaimer.	2018-03-06 11:50:29 +01:00
Tomasz Grabiec	19951ede7d	tests: mvcc: Use apply_to_incomplete() to create versions So that the test doesn't depend on internal invariants.	2018-03-06 11:50:28 +01:00
Tomasz Grabiec	ed6271fc87	tests: mvcc: Fix test_apply_to_incomplete() It should use evictable entries instead of non-evictable ones, because they are required by apply_to_incomplete().	2018-03-06 11:50:28 +01:00
Tomasz Grabiec	5320705300	cache: Propagate cache_tracker to places manipulating evictable entries cache_tracker reference will be needed to link/unlink row entries. No change of behavior in this patch.	2018-03-06 11:50:27 +01:00
Tomasz Grabiec	bbe771e28f	tests: Add more tests for continuity merging	2018-03-06 11:50:26 +01:00
Tomasz Grabiec	9893e8e5f7	mvcc: Make each version have independent continuity This change is a preparation for introducing row-level eviction, such that entries can be evicted from older versions without having to touch other versions. Currently continuity flags on entries are interpreted relative to the combined view merged from all entries. For example: v2: <key=2, cont=1> v1: <key=1, cont=1> In v2, the flag on entry key=2 marks the range (1, 2) as continuous. This is problematic because if the old version is evicted, continuity will change in an incorrect way: v2: <key=2, cont=1> Here, the range (-inf, 1) would be marked as continuous, which is not true. To solve this problem, we change the rules for continuity interpretation in MVCC. Each version will have its own continuity, fully specified in that version, independent of continuity of other versions. Continuity of the snapshot will be a union of continuous ranges in each version. It is assumed that continuous intervals in different versions are non- overlapping, except for points corresponding to complete rows, in which case a later version may overlap with an older version (overwrite). We make use of this assumption to make calculation of the union of intervals on merging easier. I make use of the above assumption in mutation_partition::apply_monotonically(). MVCC population of incomplete entries already almost maintains the non-overlapping invariant, because population intervals correspond to intervals which are incomplete in the old snapshot. The only change needed is to ensure that both population bounds will have entries in the latest version. Population from memtables doesn't mark any intervals as continuous, so also conforms. The only change needed there is to not inherit continuity flags from the old snapshot, effectively making the new version internally discontinuous except for row points. The example from the beginning will become: v2: <key=1, cont=0> <key=2, cont=1> v1: <key=1, cont=1> When marking a range as continuous with some rows present only in older versions, we need to insert entries in the latest version, so that we can mark the range as continuous. The easiest solution is to copy the entry from the old version. Another option would be to add support for incomplete rows and insert such instead. This way we would avoid duplicating row contents. This optimization is deferred.	2018-03-06 11:50:25 +01:00
Tomasz Grabiec	d2744b6ad8	tests: mvcc: Don't set mutations in versions directly Simply copying mutations which are not fully continuous may violate MVCC invariants, like the one about non-overlapping continuity which will be added later. Use apply_to_incomplete() instead. This unfortunately reduces strenght of the test, since the continuity of the entry is now completely determined by the first version. We should use populate() instead, but it doesn't exist yet. It could be extracted from cache_streamed_mutation, but that's not an easy change. This is alleviated by adding a similar test to row_cache_test_g, in a later patch.	2018-03-06 11:32:09 +01:00
Tomasz Grabiec	b0b57b8143	mvcc: Do not move unevictable snapshots to cache Commit `6ccd317` introduced a bug in partition_entry::evict() where a partition entry may be partially evicted if there are non-evictable snapshots in it. Partially evicting some of the versions may violate consistency of a snapshot which includes evicted versions. For one, continuity flags are interpreted realtive to the merged view, not within a version, so evicting from some of the versions may mark reanges as continuous when before they were discontinuous. Also, range tombtsones of the snapshot are taken from all versions, so we can't partially evict some of them without marking all affected ranges as discontinuous. The fix is to revert back to full eviciton, and avoid moving non-evictable snapshots to cache. When moving whole partition entry to cache, we first create a neutral empty partition entry and then merge the memtable entry into it just like we would if the entry already existed. Fixes #3215. Tests: unit (release) Message-Id: <1518710592-21925-2-git-send-email-tgrabiec@scylladb.com>	2018-02-15 16:48:07 +00:00
Avi Kivity	404172652e	Merge "Use xxHash for digest instead of MD5" from Duarte "This series changes digest calculation to use a faster algorithm (xxHash) and to also cache calculated cell hashes that can be kept in memory to speed up subsequent digest requests. The MD5 hash function has proved to be slow for large cell values: size = 256; elapsed = 4us size = 512; elapsed = 8us size = 1024; elapsed = 14us size = 2048; elapsed = 21us size = 4096; elapsed = 33us size = 8192; elapsed = 51us size = 16384; elapsed = 86us size = 32768; elapsed = 150us size = 65536; elapsed = 278us size = 131072; elapsed = 531us size = 262144; elapsed = 1032us size = 524288; elapsed = 2026us size = 1048576; elapsed = 4004us size = 2097152; elapsed = 7943us size = 4194304; elapsed = 15800us size = 8388608; elapsed = 31731us size = 16777216; elapsed = 64681us size = 33554432; elapsed = 130752us size = 67108864; elapsed = 263154us The xxHash is a non-cryptographic, 64bit (there's work in progress on the 128 version) hash that can be used to replace MD5. It performs much better: size = 256; elapsed = 2us size = 512; elapsed = 1us size = 1024; elapsed = 1us size = 2048; elapsed = 2us size = 4096; elapsed = 2us size = 8192; elapsed = 3us size = 16384; elapsed = 5us size = 32768; elapsed = 8us size = 65536; elapsed = 14us size = 131072; elapsed = 28us size = 262144; elapsed = 59us size = 524288; elapsed = 116us size = 1048576; elapsed = 226us size = 2097152; elapsed = 456us size = 4194304; elapsed = 935us size = 8388608; elapsed = 1848us size = 16777216; elapsed = 4723us size = 33554432; elapsed = 10507us size = 67108864; elapsed = 21622us Performance was tested using a 3 node cluster with 1 cpu and 8GB, and with the following cassandra-stress loaders. Measurements are for the read workload. sudo taskset -c 4-15 ./cassandra-stress write cl=ALL n=5000000 -schema 'replication(factor=3)' -col 'size=FIXED(1024) n=FIXED(4)' -mode native cql3 -rate threads=100 sudo taskset -c 4-15 ./cassandra-stress mixed cl=ALL 'ratio(read=1)' n=10000000 -pop 'dist=gauss(1..5000000,5000000,500000)' -col 'size=FIXED(1024) n=FIXED(4)' -mode native cql3 -rate threads=100 xxhash + caching: Results: op rate : 32699 [READ:32699] partition rate : 32699 [READ:32699] row rate : 32699 [READ:32699] latency mean : 3.0 [READ:3.0] latency median : 3.0 [READ:3.0] latency 95th percentile : 3.9 [READ:3.9] latency 99th percentile : 4.5 [READ:4.5] latency 99.9th percentile : 6.6 [READ:6.6] latency max : 24.0 [READ:24.0] Total partitions : 10000000 [READ:10000000] Total errors : 0 [READ:0] total gc count : 0 total gc mb : 0 total gc time (s) : 0 avg gc time(ms) : NaN stdev gc time(ms) : 0 Total operation time : 00:05:05 END md5: Results: op rate : 25241 [READ:25241] partition rate : 25241 [READ:25241] row rate : 25241 [READ:25241] latency mean : 3.9 [READ:3.9] latency median : 3.9 [READ:3.9] latency 95th percentile : 5.1 [READ:5.1] latency 99th percentile : 5.8 [READ:5.8] latency 99.9th percentile : 8.0 [READ:8.0] latency max : 24.8 [READ:24.8] Total partitions : 10000000 [READ:10000000] Total errors : 0 [READ:0] total gc count : 0 total gc mb : 0 total gc time (s) : 0 avg gc time(ms) : NaN stdev gc time(ms) : 0 Total operation time : 00:06:36 END This translates into a 21% improvoment for this workload. Bigger cell values were also tested: sudo taskset -c 4-15 ./cassandra-stress write cl=ALL n=1000000 -schema 'replication(factor=3)' -col 'size=FIXED(4096) n=FIXED(4)' -mode native cql3 -rate threads=100 sudo taskset -c 4-15 ./cassandra-stress mixed cl=ALL 'ratio(read=1)' n=10000000 -pop 'dist=gauss(1..1000000,500000,100000)' -col 'size=FIXED(4096) n=FIXED(4)' -mode native cql3 -rate threads=100 xxhash + caching: Results: op rate : 19964 [READ:19964] partition rate : 19964 [READ:19964] row rate : 19964 [READ:19964] latency mean : 4.9 [READ:4.9] latency median : 4.6 [READ:4.6] latency 95th percentile : 7.2 [READ:7.2] latency 99th percentile : 11.5 [READ:11.5] latency 99.9th percentile : 13.6 [READ:13.6] latency max : 29.2 [READ:29.2] Total partitions : 10000000 [READ:10000000] Total errors : 0 [READ:0] total gc count : 0 total gc mb : 0 total gc time (s) : 0 avg gc time(ms) : NaN stdev gc time(ms) : 0 Total operation time : 00:08:20 END md5: Results: op rate : 12773 [READ:12773] partition rate : 12773 [READ:12773] row rate : 12773 [READ:12773] latency mean : 7.7 [READ:7.7] latency median : 7.3 [READ:7.3] latency 95th percentile : 10.2 [READ:10.2] latency 99th percentile : 16.8 [READ:16.8] latency 99.9th percentile : 19.2 [READ:19.2] latency max : 71.5 [READ:71.5] Total partitions : 10000000 [READ:10000000] Total errors : 0 [READ:0] total gc count : 0 total gc mb : 0 total gc time (s) : 0 avg gc time(ms) : NaN stdev gc time(ms) : 0 Total operation time : 00:13:02 END This translates into a 37% improvoment for this workload. Fixes #2884 Tests: unit-tests (release), dtests (smp=2) Note: dtests are kinda broken in master (> 30 failures), so take the tests tag with a grain of himalayan salt." * 'xxhash/v5' of https://github.com/duarten/scylla: (29 commits) tests/row_cache_test: Test hash caching tests/memtable_test: Test hash caching tests/mutation_test: Use xxHash instead of MD5 for some tests tests/mutation_test: Test xx_hasher alongside md5_hasher schema: Remove unneeded include service/storage_proxy: Enable hash caching service/storage_service: Add and use xxhash feature message/messaging_service: Specify algorithm when requesting digest storage_proxy: Extract decision about digest algorithm to use cache_flat_mutation_reader: Pre-calculate cell hash partition_snapshot_reader: Pre-calculate cell hash query::partition_slice: Add option to specify when digest is requested row: Use cached hash for hash calculation mutation_partition: Replace hash_row_slice with appending_hash mutation_partition: Allow caching cell hashes mutation_partition: Force vector_storage internal storage size test.py: Increase memory for row_cache_stress_test atomic_cell_hash: Add specialization for atomic_cell_or_collection query-result: Use digester instead of md5_hasher range_tombstone: Replace feed_hash() member function with appending_hash ...	2018-02-08 18:24:58 +02:00
Tomasz Grabiec	3c51cc79d5	tests: mvcc: Add test for eviction with non-evictable snapshots	2018-02-06 14:24:19 +01:00
Tomasz Grabiec	ec5fe5b207	tests: mvcc: Check that partition is fully discontinuous after eviction evict() should remove everything, including range tombstones, so whole clustering range should be marked as discontinuous.	2018-02-06 14:24:19 +01:00
Tomasz Grabiec	439cbada2c	tests: Use partition_entry::make_evictable() where appropriate	2018-02-06 14:24:18 +01:00
Duarte Nunes	712c051de6	cache_flat_mutation_reader: Pre-calculate cell hash When digest is requested, pre-calculate the cell's hash. We consider the case when the cell is already in the cache, and the case when it added by the underlying reader. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-02-01 01:02:50 +00:00
Piotr Jastrzebski	7729bc5e7b	Remove unused mutation_reader_assertions Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2018-01-24 20:56:48 +01:00
Avi Kivity	c743d1258d	Merge "Reverse order of version merging in MVCC" from Tomasz "Changes merging in MVCC to apply newer version to older instead of older to newer. Before (v0 = oldest): (((v3 + v2) + v1) + v0) After: (v0 + (v1 + (v2 + v3))) or: (((v0 + v1) + v2) + v3) There are several reasons to do this: 1) When continuity merging will change semantics to support eviction from older versions, it will be easier to implement apply() if we can assume that we merge newer to older instead of older to newer, since newer version may have entries falling into a continuous interval in older, but not the other way around. If we didn't revert the order, apply() would have to keep track of lower bound of a continuous interval in the right-hand side argument (older version) as it is applied and update continuity flags in the left hand side by scanning all entries overlapping with it. If order is reversed, merging only needs to deal with the current entry. Also, if we were to keep the old order, we cannot simply move entries from the left hand side as we merge because we need to keep track of the lower bound of a continuous interval, and we need to provide monotonic exception guarantees. So merging would be both more complicated and slower. 2) With large partitions older versions are typically larger than newer versions, and since merging is O(N_right(1 + log(N_left))), it's better to merge newer into older. This fixes latency spikes seen in perf_cache_eviction. Fixes #2715." tag 'tgrabiec/reverse-order-of-mvcc-version-merging-v1' of github.com:scylladb/seastar-dev: mvcc: Reverse order of version merging anchorless_list: Introduce last() mvcc: Implement partition_entry::upgrade() using squashed() mvcc: Extract version merging functions mutation_partition: Add rows_entry::set_dummy() position_in_partition: Introduce after_key()	2018-01-21 13:56:57 +02:00
José Guilherme Vanz	380bc0aa0d	Swap arguments order of mutation constructor Swap arguments in the mutation constructor keeping the same standard from the constructor variants. Refs #3084 Signed-off-by: José Guilherme Vanz <guilherme.sft@gmail.com> Message-Id: <20180120000154.3823-1-guilherme.sft@gmail.com>	2018-01-21 12:58:42 +02:00
Tomasz Grabiec	60d3c25c02	mvcc: Reverse order of version merging Change merging to apply newer version to older instead of older to newer. Before: (((v3 + v2) + v1) + v0) After: (v0 + (v1 + (v2 + v3))) or equivalent: (((v0 + v1) + v2) + v3) There are several reasons to do this: 1) When continuity merging will change semantics to support eviction from older versions, it will be easier to implement apply() if we can assume that we merge newer to older instead of older to newer, since newer version may have entries falling into a continuous interval in older, but not the other way around. If we didn't revert the order, apply() would have to keep track of lower bound of a continuous interval in the right-hand side argument (older version) as it is applied and update continuity flags in the left hand side by scanning all entries overlapping with it. If order is reversed, merging only needs to deal with the current entry. Also, if we were to keep the old order, we cannot simply move entries from the left hand side as we merge because we need to keep track of the lower bound of a continuous interval, and we need to provide monotonic exception guarantees. So merging would be both more complicated and slower. 2) With large partitions older versions are typically larger than newer versions, and since merging is O(N_right*(1 + log(N_left))), it's better to merge newer into older. Fixes #2715.	2018-01-18 13:52:08 +01:00
Tomasz Grabiec	e81a4476c8	tests: mvcc: Add more tests for consistency of continuity merging	2017-12-08 17:50:48 +01:00
Tomasz Grabiec	3b6167b4c4	tests: mvcc: Fix test_apply_is_atomic() partition_entry::apply() requires that mutations are fully continuous.	2017-12-08 17:50:48 +01:00
Tomasz Grabiec	33c1f33c90	tests: mvcc: Do not assume that continuity of current row is updated on partition_snapshot_row_cursor::maybe_refresh() It currently is updated only when iterators are invalidated. Better to not assume that, because it's not really needed, and maintaining this would complicate maybe_refresh() after continuity merging rules change later.	2017-12-08 17:50:48 +01:00
Tomasz Grabiec	12704fd679	mvcc: Propagate region reference to partition_entry::apply_to_incomplete()	2017-12-08 17:50:48 +01:00
Tomasz Grabiec	230ca7d01b	tests: mvcc: Use mutation_partition_assertions	2017-12-08 17:50:48 +01:00
Tomasz Grabiec	df964c70f8	mvcc: Don't require external schema in parition_snapshot::range_tombstones()	2017-12-08 17:50:47 +01:00
Tomasz Grabiec	b6f4637aec	tests: mvcc: Add test for partition_snapshot::range_tombstones()	2017-12-08 10:15:58 +01:00
Avi Kivity	4cfcd8055e	Merge "Drop reversible apply() from mutation_partition" from Tomasz "This simplifies implementation of mutation_partition merging by relaxing exception guarantees it needs to provide. This allows reverters to be dropped. Direct motivation for this is to make it easier to implement new semantics for merging of clustering range continuity. Implementation details: We only need strong exception guarantees when applying to the memtable, which is using MVCC. Instead of calling apply() with strong exception guarantees on the latest version, we will move the incoming mutation to a new partition_version and then use monotonic apply() to merge them. If that merging fails, we attach the version with the remainder, which cannot fail. This way apply() always succeeds if the allocation of partition_version object succeeds. Results of `perf_simple_query_g -c1 -m1G --write` (high overwrite rate): Before: 101011.13 tps 102498.07 tps 103174.68 tps 102879.55 tps 103524.48 tps 102794.56 tps 103565.11 tps 103018.51 tps 103494.37 tps 102375.81 tps 103361.65 tps After: 101785.37 tps 101366.19 tps 103532.26 tps 100834.83 tps 100552.11 tps 100891.31 tps 101752.06 tps 101532.00 tps 100612.06 tps 102750.62 tps 100889.16 tps Fixes #2012." * tag 'tgrabiec/drop-reversible-apply-v1' of github.com:scylladb/seastar-dev: mutation_partition: Drop apply_reversibly() mutation_partition: Relax exception guarantees of apply() mutation_partition: Introduce apply_weak() tests: mvcc: Add test for atomicity of partition_entry::apply() tests: Move failure_injecting_allocation_strategy to a header tests: mutation_partition: Test exception guarantees of apply_monotonically() mvcc: Use apply_monotonically() where sufficient mvcc: partition_version: Use apply_monotonically() to provide atomicity mvcc: Extract partition_entry::add_version() mutation_partition: Introduce apply_monotonically() mutation_partition: Introduce row::consume_with()	2017-11-28 16:35:06 +02:00
Tomasz Grabiec	ad37826fcb	tests: mvcc: Add test for atomicity of partition_entry::apply()	2017-11-28 12:38:28 +01:00
Jesse Haber-Kucharsky	fb0866ca20	Move `thread_local` declarations out of `main.cc` Since `disk-error-handler.hh` defines these global variables `extern`, it makes sense to declare them in the `disk-error-handler.cc` instead of `main.cc`. This means that test files don't have to declare them. Fixes #2735. Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com> Message-Id: <1eed120bfd9bb3647e03fe05b60c871de2df2a86.1511810004.git.jhaberku@scylladb.com>	2017-11-27 20:27:42 +01:00
Tomasz Grabiec	e5e9886014	tests: mvcc: Add test for partition_snapshot_row_cursor	2017-09-25 11:21:58 +02:00
Tomasz Grabiec	b8f62e86de	tests: Add test for partition_entry::evict()	2017-09-13 17:47:04 +02:00
Tomasz Grabiec	d76b141b34	tests: Extract mvcc tests to separate file	2017-09-13 17:47:04 +02:00

39 Commits