scylladb

Author	SHA1	Message	Date
Tomasz Grabiec	638d23025b	tests: row_cache: Add test for exception safety of multi-partition scans	2017-11-13 20:55:14 +01:00
Tomasz Grabiec	084e1861c8	tests: row_cache: Add test for exception safety of single-partition reads	2017-11-13 20:55:14 +01:00
Tomasz Grabiec	16d6222a96	tests: row_cache: Add test for population of single rows	2017-11-02 12:16:17 +01:00
Tomasz Grabiec	bbf8ccb709	tests: Add test for population of continuity	2017-11-02 12:16:17 +01:00
Duarte Nunes	baeec0935f	Replace query::full_slice with schema::full_slice() query::full_slice doesn't select any regular or static columns, which is at odds with the expectations of its users. This patch replaces it with the schema::full_slice() version. Refs #2885 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1507732800-9448-2-git-send-email-duarte@scylladb.com>	2017-10-17 11:25:53 +02:00
Tomasz Grabiec	b74b06808e	tests: row_cache: Add test for concurrent population of partition entries Message-Id: <1507815478-20269-2-git-send-email-tgrabiec@scylladb.com>	2017-10-12 15:55:33 +01:00
Piotr Jastrzebski	6069bab755	Cache single queries to non-existing partitions This way we don't need to query sstables again when the query is repeated. Fixes #1533 Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <8f8559ff19c534dbbb7c9ef6c28271cec607ba20.1506521461.git.piotr@scylladb.com>	2017-09-27 16:15:18 +02:00
Tomasz Grabiec	e4adc9c600	tests: row_cache: Add test for concurrent population	2017-09-25 11:21:58 +02:00
Tomasz Grabiec	a3fb7ce660	tests: row_cache: Make populate_range() accept partition_range	2017-09-25 11:21:58 +02:00
Tomasz Grabiec	0911fbbdef	row_cache: Fix row_cache::update_invalidating() evict() doesn't guarantee that the whole partition is discontinuous. In particular, partition tombstone cannot be marked as discontinuous. The parts which are still continuous must be updated. Broken after `c78047fa5b`. Message-Id: <1505375684-28574-1-git-send-email-tgrabiec@scylladb.com>	2017-09-14 10:58:25 +03:00
Tomasz Grabiec	2dfb3b95a5	tests: row_cache: Add evicition tests	2017-09-13 17:47:03 +02:00
Tomasz Grabiec	99aa3d1964	tests: row_cache_test: Don't assume mvcc snapshots are not evictable The test was not updating the underlying mutation source but still expecting to get the right data after calling invalidate(). If snapshots are evictable, that's not guaranteed. Apply to underlying as well, so data is read from underlying if necessary.	2017-09-13 17:38:08 +02:00
Tomasz Grabiec	423142ec81	tests: row_cache_test: Fix abort in debug mode The test used apply() variant which assumed that it was invoked in a seastar thread, which is no longer the case after commit `d22fdf4`. Fix by copying outisde cache update, and use non-deferring apply() variant for cache update. Message-Id: <1505200142-3650-1-git-send-email-tgrabiec@scylladb.com>	2017-09-12 10:57:36 +03:00
Tomasz Grabiec	d22fdf4261	row_cache: Improve safety of cache updates Cache imposes requirements on how updates to the on-disk mutation source are made: 1) each change to the on-disk muation source must be followed by cache synchronization reflecting that change 2) The two must be serialized with other synchronizations 3) must have strong failure guarantees (atomicity) Because of that, sstable list update and cache synchronization must be done under a lock, and cache synchronization cannot fail to synchronize. Normally cache synchronization achieves no-failure thing by wiping the cache (which is noexcept) in case failure is detect. There are some setup steps hoever which cannot be skipped, e.g. taking a lock followed by switching cache to use the new snapshot. That truly cannot fail. The lock inside cache synchronizers is redundant, since the user needs to take it anyway around the combined operation. In order to make ensuring strong exception guarantees easier, and making the cache interface easier to use correctly, this patch moves the control of the combined update into the cache. This is done by having cache::update() et al accept a callback (external_updater) which is supposed to perform modiciation of the underlying mutation source when invoked. This is in-line with the layering. Cache is layered on top of the on-disk mutation source (it wraps it) and reading has to go through cache. After the patch, modification also goes through cache. This way more of cache's requirements can be confined to its implementation. The failure semantics of update() and other synchronizers needed to change due to strong exception guaratnees. Now if it fails, it means the update was not performed, neither to the cache nor to the underlying mutation source. The database::_cache_update_sem goes away, serialization is done internally by the cache. The external_updater needs to have strong exception guarantees. This requirement is not new. It is however currently violated in some places. This patch marks those callbacks as noexcept and leaves a FIXME. Those should be fixed, but that's not in the scope of this patch. Aborting is still better than corrupting the state. Fixes #2754. Also fixes the following test failure: tests/row_cache_test.cc(949): fatal error: in "test_update_failure": critical check it->second.equal(*s, mopt->partition()) has failed which started to trigger after commit `318423d50b`. Thread stack allocation may fail, in which case we did not do the necessary invalidation.	2017-09-04 10:04:29 +02:00
Tomasz Grabiec	56e3ce05db	row_cache: Don't require presence checker to be supplied externally The API is simpler and safer this way.	2017-09-04 10:04:29 +02:00
Tomasz Grabiec	8f2ca52740	tests: Run test_query_only_static_row test case on all mutation sources The test checks behavior common to all mutation readers, so it's better to run it against all mutation sources rather than only for cache reader. Message-Id: <1503072333-17995-1-git-send-email-tgrabiec@scylladb.com>	2017-08-20 12:23:28 +03:00
Paweł Dziepak	79a1ad7a37	tests/row_cache: test queries with no clustering ranges Reproducer for #2604. Message-Id: <20170725131220.17467-3-pdziepak@scylladb.com>	2017-07-25 15:29:17 +02:00
Piotr Jastrzebski	a4b6cfe8f0	row_cache: use continuity info in single partition queries If a query requests for a single partition that is inside a range that has already been queried, use the continuity info and don't go to disk when it's not needed. Fixes #2244. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <15bb3b5b03225e7402e3862da53b5e06d3f4fa74.1499345295.git.piotr@scylladb.com>	2017-07-07 10:29:19 +02:00
Piotr Jastrzebski	70f4b23876	row_cache_test: Add test to reproduce issue 2544 This tests checks that cache should use continuity information for single partition queries inside a range that has already been queried. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <2ebd03ff5366e554d520f86da8054e0b9eff4178.1499345295.git.piotr@scylladb.com>	2017-07-07 10:29:19 +02:00
Tomasz Grabiec	e68925595c	tests: row_cache: Remove unused method	2017-06-27 14:10:37 +02:00
Avi Kivity	9b21a9bfb6	Merge "Implement partial cache" from Tomasz and Piotr "This series enables cache to keep partial partitions. Reads no longer have to read whole partition from sstables in order to cache the result. The 10MB threshold for partition size in cache is lifted. Known issues: - There is no partial eviction yet, whole partitions are still evicted, and partition snapshots held by active reads are not evictable at all - Information about range continuity is not recorded if that would require inserting a dummy entry, or if previous entry doesn't belong to the latest snapshot - Cache update after memtable flush happening concurrently with reads may inhibit that reads' ability to populate cache (new issue) - Cache update from flushed memtables has partition granularity, so may cause latency problems with large partition - Schema is still tracked per-partition, so after schema changes reads may induce high latency due to whole partition needing to be converted atomically - Range tombstones are repeated in the stream for every range between cache entries they cover (new issue) - Populating scans for both small and large partitions (perf_fast_forward) experienced a 40% reduction of throughput, CPU bound How was this tested: - test.py --mode release - row_cache_stress_test -c1 -m1G - perf_fast_forward, passes except for the test case checking range continuity population which would require inserting a dummy entry (mentioned above) - perf_simple_query (-c1 -m1G --duration 32): before: 90k [ops/s] stdev: 4k [ops/s] after: 94k [ops/s] stdev: 2k [ops/s]" * tag 'tgrabiec/introduce-partial-cache-v8' of github.com:cloudius-systems/seastar-dev: (130 commits) tests: row_cache: Add test_tombstone_merging_in_partial_partition test case tests: Introduce row_cache_stress_test utils: Add helpers for dealing with nonwrapping_range<int> tests: simple_schema: Allow passing the tombstone to make_range_tombstone() tests: simple_schema: Accept value by reference tests: simple_schema: Make add_row() accept optional timestamp tests: simple_schema: Make new_timestamp() public tests: simple_schema: Introduce make_ckeys() tests: simple_schema: Introduce get_value(const clustered_row&) helper tests: simple_schema: Fix comment tests: simple_schema: Add missing include row_cache: Introduce evict() tests: Add cache_streamed_mutation_test tests: mutation_assertions: Allow expecting fragments mutation_fragment: Implement equality check tests: row_cache: Add test for population of random partitions tests: row_cache: Add test for partition tombstone population tests: row_cache: Test reading randomly populated partition tests: row_cache: Add test_single_partition_update() tests: row_cache: Add test_scan_with_partial_partitions ...	2017-06-26 14:54:37 +03:00
Avi Kivity	236a8370e4	Remove use of std::random_shuffle() It was removed in C++17. Replace with std::shuffle(). Message-Id: <20170626063809.7563-1-avi@scylladb.com>	2017-06-26 09:36:38 +02:00
Tomasz Grabiec	b0bcf2be53	tests: row_cache: Add test_tombstone_merging_in_partial_partition test case	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	116bcb8b30	tests: row_cache: Add test for population of random partitions	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	930a1415fe	tests: row_cache: Add test for partition tombstone population	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	9bfece6f82	tests: row_cache: Test reading randomly populated partition	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	0358334579	tests: row_cache: Add test_single_partition_update() [tgrabiec: Extracted from "row_cache: Introduce cache_streamed_mutation"]	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	8bb76e2f12	tests: row_cache: Add test_scan_with_partial_partitions	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	7ae40d7045	tests: Add test for update_invalidating()	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	bbfa52822e	row_cache: Switch readers to use per-entry snapshots Currently readers are always using the latest snapshot. This is fine for respecting write atomicity if partitions are fully continuous in cache (now), but will break write atomicity once partial population is allowed. Consider the following case: flush write(ck=1), write(ck=2) -> snapshot_1 cache reader 1 reads and inserts ck=1 @snapshot_1 flush write(ck=1), write(ck=2) -> snapshot_2 cache reader 2 reads and inserts ck=2 @snapshot_2 Because cache update is not atomic, it can happen that reader 2 will complete while the partition hasn't been updated yet for snapshot_2. In such case, after read 2 the partition would contain ck=1 from snapshot_1 and ck=2 from snapshot_2. It will match neither of the snapshots, and this could violate write atomicity. To solve this problem we conceptually assign each partition key in the ring to its current snapshot which it reflects. The update process gradually converts entries in ring order to the new snapshot. Reads will not be using the latest snapshot, but rather the current snapshot for the position in the ring they are at. There is a race between the update process and populating reads. Since after the update all entries must reflect the new snapshot, reads using the old snapshot cannot be allowed to insert data which can no longer be reached by the update process. Before this patch this race was prevented by the use of a phased_barrier, where readers would keep phased_barrier::operation alive between starting a read of a partition and inserting it into cache. Cache update was waiting for all prior operations before starting the update. Any later read which was not waited for would use the latest snapshot for reads, so the update process didn't have to fix anything up for such reads. After this change, later reads cannot always use the latest snapshot, they have to use the snapshot corresponding to given entry. So it's not enough for update() to wait for prior reads in order to prevent stale populations. The (simple) solution implemented in this patch is to detect the conflict and abandon population of given sub-range. In general, reads are allowed to populate given range only if it belongs to a single snapshot. Note that the range here is not the whole query range. For population of continuity, it is the range starting after the previous key and ending after the key being inserted. When populating a partition entry, the range is a singular range containing only the partition key. Readers switch to new snapshots automatically as they move across the ring. It's possible that the insertion of the partition doesn't conflict, but continuity does. In such case the entry will be inserted but continuity will not be set.	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	8ba6366610	row_cache: Switch to using snapshot_source Currently every time cache needs to create reader for missing data it obtains a reader which is most up to date. That reader includes writes from later populate phases, for which update() was not yet called. This will be problematic once we allow partitions to be partially populated, because different parts of the partition could be partially populated using readers using different sets of writes, and break write atomicity. The solution will be to always populate given partition using the same set of writes, using reader created from the current snapshot. The snapshot changes only on update(), with update() gradually converting each partition to the new snapshot.	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	e23c7e2f34	row_cache: Rework invalidate() implementation 1) Reduce duplication by delegating to more general overloads 2) Improve documentation to not mention effects in terms of population (detail) but rather write visibiliy 3) Rename clear() to invalidate() and merge with the range variant, it has the same semantics	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	9380dd1ee3	mutation_source: make sure we never ignore fast forwarding mutation source sometimes ignore fast forwarding parameter so this change adds assertion to check that this parameter can be safely ignored. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	ac03331490	row_cache_test: improve test_sliced_read_row_presence Remove unused parameter and add checks to make sure all expected rows have been received. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	dce293e11c	tests: row_cache: Apply only fully continuous mutations to underlying mutation source Cache currently assumes that mutations coming from outside are fully continuous.	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	e86f74edd8	tests: row_cache: Add missing apply() to test_mvcc test case [tgrabiec: Extracted from "row_cache: Introduce cache_streamed_mutation"]	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	95dcfa859b	tests: row_cache: Improve test_mvcc() assert_that().is_equal_to() gives better error message. Also, there is code which can be replaces with assert_that_stream().has_monotonic_positions()	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	77f944880c	cache: Remove support for wide partitions This will be handled by row cache now. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-06-24 18:06:11 +02:00
Nadav Har'El	3018df11b5	Allow reading exactly desired byte ranges and fast_forward_to In commit `c63e88d556`, support was added for fast_forward_to() in data_consume_rows(). Because an input stream's end cannot be changed after creation, that patch ignores the specified end byte, and uses the end of file as the end position of the stream. As result of this, even when we want to read a specific byte range (e.g., in the repair code to checksum the partitions in a given range), the code reads an entire 128K buffer around the end byte, or significantly more, with read-ahead enabled. This causes repair to do more than 10 times the amount of I/O it really has to do in the checksumming phase (which in the current implementation, reads small ranges of partitions at a time). This patch has two levels: 1. In the lower level, sstable::data_consume_rows(), which reads all partitions in a given disk byte range, now gets another byte position, "last_end". That can be the range's end, the end of the file, or anything in between the two. It opens the disk stream until last_end, which means 1. we will never read-ahead beyond last_end, and 2. fast_fordward_to() is not allowed beyond last_end. 2. In the upper level, we add to the various layers of sstable readers, mutation readers, etc., a boolean flag mutation_reader::forwarding, which says whether fast_forward_to() is allowed on the stream of mutations to move the stream to a different partition range. Note that this flag is separate from the existing boolean flag streamed_mutation::fowarding - that one talks about skipping inside a single partition, while the flag we are adding is about switching the partition range being read. Most of the functions that previously accepted streamed_mutation::forwarding now accept also the option mutation_reader::forwarding. The exception are functions which are known to read only a single partition, and not support fast_forward_to() a different partition range. We note that if mutation_reader::forwarding::no is requested, and fast_forward_to() is forbidden, there is no point in reading anything beyond the range's end, so data_consume_rows() is called with last_end as the range's end. But if forwarding::yes is requested, we use the end of the file as last_end, exactly like the code before this patch did. Importantly, we note that the repair's partition reading code, column_family::make_streaming_reader, uses mutation_reader::forwarding::no, while the other existing reading code will use the default forwarding::yes. In the future, we can further optimize the amount of bytes read from disk by replacing forwarding::yes by an actual last partition that may ever be read, and use its byte position as the last_end passed to data_consume_rows. But we don't do this yet, and it's not a regression from the existing code, which also opened the file input stream until the end of the file, and not until the end of the range query. Moreover, such an improvement will not improve of anything if the overall range is always very large, in which case not over-reading at its end will not improve performance. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20170619152629.11703-1-nyh@scylladb.com>	2017-06-19 18:31:32 +03:00
Avi Kivity	6e2c9ef9fb	Revert "Allow reading exactly desired byte ranges and fast_forward_to" This reverts commit `317d7fc253` (and also the related `2c57ab84b2`). It causes crashes during range scans, reported by Gleb: "To reproduce I run SELECT * FROM keyspace1.standard1; on typical c-s dataset and 3 node cluster. Backtrace: at /home/gleb/work/seastar/seastar/core/apply.hh:36 rvalue=<unknown type in /home/gleb/work/seastar/build/release/scylla, CU 0x54cf307, DIE 0x55ebf2a>) at /home/gleb/work/seastar/seastar/core/do_with.hh:57 range=std::vector of length 6, capacity 8 = {...}) at /home/gleb/work/seastar/seastar/core/future-util.hh:142 at ./seastar/core/future.hh:890 at /home/gleb/work/seastar/seastar/core/future-util.hh:119 at /home/gleb/work/seastar/seastar/core/future-util.hh:142	2017-06-18 16:10:21 +03:00
Nadav Har'El	317d7fc253	Allow reading exactly desired byte ranges and fast_forward_to In commit `c63e88d556`, support was added for fast_forward_to() in data_consume_rows(). Because an input stream's end cannot be changed after creation, that patch ignores the specified end byte, and uses the end of file as the end position of the stream. As result of this, even when we want to read a specific byte range (e.g., in the repair code to checksum the partitions in a given range), the code reads an entire 128K buffer around the end byte, or significantly more, with read-ahead enabled. This causes repair to do more than 10 times the amount of I/O it really has to do in the checksumming phase (which in the current implementation, reads small ranges of partitions at a time). This patch has two levels: 1. In the lower level, sstable::data_consume_rows(), which reads all partitions in a given disk byte range, now gets another byte position, "last_end". That can be the range's end, the end of the file, or anything in between the two. It opens the disk stream until last_end, which means 1. we will never read-ahead beyond last_end, and 2. fast_fordward_to() is not allowed beyond last_end. 2. In the upper level, we add to the various layers of sstable readers, mutation readers, etc., a boolean flag mutation_reader::forwarding, which says whether fast_forward_to() is allowed on the stream of mutations to move the stream to a different partition range. Note that this flag is separate from the existing boolean flag streamed_mutation::fowarding - that one talks about skipping inside a single partition, while the flag we are adding is about switching the partition range being read. Most of the functions that previously accepted streamed_mutation::forwarding now accept also the option mutation_reader::forwarding. The exception are functions which are known to read only a single partition, and not support fast_forward_to() a different partition range. We note that if mutation_reader::forwarding::no is requested, and fast_forward_to() is forbidden, there is no point in reading anything beyond the range's end, so data_consume_rows() is called with last_end as the range's end. But if forwarding::yes is requested, we use the end of the file as last_end, exactly like the code before this patch did. Importantly, we note that the repair's partition reading code, column_family::make_streaming_reader, uses mutation_reader::forwarding::no, while the other existing reading code will use the default forwarding::yes. In the future, we can further optimize the amount of bytes read from disk by replacing forwarding::yes by an actual last partition that may ever be read, and use its byte position as the last_end passed to data_consume_rows. But we don't do this yet, and it's not a regression from the existing code, which also opened the file input stream until the end of the file, and not until the end of the range query. Moreover, such an improvement will not improve of anything if the overall range is always very large, in which case not over-reading at its end will not improve performance. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20170614072122.13473-1-nyh@scylladb.com>	2017-06-15 13:22:46 +01:00
Tomasz Grabiec	e054ccc037	tests: row_cache_test: Induce update failure more reliably After changing region evicitability condition to be less strict, cache update stopped failing because reclamation was able to compact dense region. Induce failure by installing evictor which refuses to evict from cache beyond few elements.	2017-04-20 14:51:47 +02:00
Tomasz Grabiec	892d4a2165	db: Enable creating forwardable readers via mutation_source Right now all mutation source implementations will use make_forwardable() wrapper.	2017-02-23 18:50:44 +01:00
Tomasz Grabiec	2b8bd10dca	tests: Pass all mutation source parameters	2017-02-13 20:52:49 +01:00
Piotr Jastrzebski	36b2c4df19	row_cache_test: extend test_mvcc Make the test execute with and without an active reader to memtable that's flushed to cache. This improves the code covarage of MVCC with tests. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <007b6cd1ba7a84ea5675ea82e454bf1adf3b3330.1485954941.git.piotr@scylladb.com>	2017-02-02 13:51:32 +01:00
Piotr Jastrzebski	c7e95af0b0	row_cache_test: fix test_mvcc Currently the test does not wait for cache update to finish before carrying on with the checks. This makes the test nondeterministic and purely wrong because checks expect update to be finished. This patch changes the test to wait for update to finish. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <2a99bba24b1628466d3495332b48ef3ccdb43c26.1485862389.git.piotr@scylladb.com>	2017-01-31 11:37:29 +00:00
Asias He	e5485f3ea6	Get rid of query::partition_range Use dht::partition_range instead	2016-12-19 08:09:25 +08:00
Paweł Dziepak	b8d737ff0a	tests/row_cache_test: verify that eviction follows lru Refs #1847. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com> Message-Id: <1479231555-28191-1-git-send-email-pdziepak@scylladb.com>	2016-11-15 18:57:54 +01:00
Piotr Jastrzebski	50b41f7d1d	Fix row_cache_test partition_range passed to row_cache::make_reader has to be kept alive as long as the resulting reader is used. Otherwise weird things start to happen. This used to work just because of a pure luck. When I started changing the row_cache implementation I run into very weird behaviors for this tests. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <2c9e337dbbcf35f4e1394cad043eda10b8c2bd4a.1478602876.git.piotr@scylladb.com>	2016-11-08 13:28:53 +01:00
Avi Kivity	a35136533d	Convert ring_position and token ranges to be nonwrapping Wrapping ranges are a pain, so we are moving wrap handling to the edges. Since cql can't generate wrapping ranges, this means thrift and the ring maintenance code; also range->ring transformations need to merge the first and last ranges. Message-Id: <1478105905-31613-1-git-send-email-avi@scylladb.com>	2016-11-02 21:04:11 +02:00

1 2

90 Commits