scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-02 21:17:01 +00:00

Author	SHA1	Message	Date
Tomasz Grabiec	de7cb7bfa4	tests: commitlog: Check there are no segments left on disk after clean shutdown Reproduces #2550. Message-Id: <1499358825-17855-2-git-send-email-tgrabiec@scylladb.com> (cherry picked from commit `72e01b7fe8`)	2017-07-09 19:25:44 +03:00
Avi Kivity	c9ed522fa8	Merge "Adjust row cache metrics for row granularity" from Tomasz * tag 'tgrabiec/row-cache-metrics-v2' of github.com:cloudius-systems/seastar-dev: row_cache: Switch _stats.hits/misses to row granularity row_cache: Rename num_entries() to partitions() for clarity row_cache: Track mispopulations also at row level row_cache: Track row insertions row_cache: Track row hits and misses row_cache: Make mispopulation counter also apply for continuity information row_cache: Add partition_ prefix to current counters misc_services: Switch to using reads_with[_no]_misses counters row_cache: Add metrics for operations on underlying reader row_cache: Add reader-related metrics row_cache: Remove dead code (cherry picked from commit `b1a0e37fcb`)	2017-07-04 15:21:00 +03:00
Avi Kivity	7893a3aad2	Merge "Use selective_token_range_sharder in repair" from Asias "This series introduces selective_token_range_sharder and uses it in repair to generate dht::token_range belongs to a specific shard." * tag 'asias/repair-selective_token_range_sharder-v3' of github.com:cloudius-systems/seastar-dev: repair: Use selective_token_range_sharder tests: Add test_selective_token_range_sharder dht: Add selective_token_range_sharder (cherry picked from commit `66e56511d6`)	2017-07-04 14:18:08 +03:00
Nadav Har'El	e467eef58d	Fix test to use non-wrapping range The test put a wrapping range into a non-wrapping range variable. This was harmless at the time this test was written, but newer code may not be as forgiving so better use a non-wrapping range as intended. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20170704103128.29689-1-nyh@scylladb.com> (cherry picked from commit `d95f908586`)	2017-07-04 14:18:01 +03:00
Avi Kivity	3de701dbe1	Merge "Fix compilation issues in older environments" from Tomasz * 'tgrabiec/fix-compilation-issues' of github.com:cloudius-systems/seastar-dev: tests: streamed_mutation_test: Avoid using boost::size() on row ranges tests: row_cache: Remove unused method (cherry picked from commit `ff7be8241f`)	2017-06-27 16:31:42 +03:00
Avi Kivity	9b21a9bfb6	Merge "Implement partial cache" from Tomasz and Piotr "This series enables cache to keep partial partitions. Reads no longer have to read whole partition from sstables in order to cache the result. The 10MB threshold for partition size in cache is lifted. Known issues: - There is no partial eviction yet, whole partitions are still evicted, and partition snapshots held by active reads are not evictable at all - Information about range continuity is not recorded if that would require inserting a dummy entry, or if previous entry doesn't belong to the latest snapshot - Cache update after memtable flush happening concurrently with reads may inhibit that reads' ability to populate cache (new issue) - Cache update from flushed memtables has partition granularity, so may cause latency problems with large partition - Schema is still tracked per-partition, so after schema changes reads may induce high latency due to whole partition needing to be converted atomically - Range tombstones are repeated in the stream for every range between cache entries they cover (new issue) - Populating scans for both small and large partitions (perf_fast_forward) experienced a 40% reduction of throughput, CPU bound How was this tested: - test.py --mode release - row_cache_stress_test -c1 -m1G - perf_fast_forward, passes except for the test case checking range continuity population which would require inserting a dummy entry (mentioned above) - perf_simple_query (-c1 -m1G --duration 32): before: 90k [ops/s] stdev: 4k [ops/s] after: 94k [ops/s] stdev: 2k [ops/s]" * tag 'tgrabiec/introduce-partial-cache-v8' of github.com:cloudius-systems/seastar-dev: (130 commits) tests: row_cache: Add test_tombstone_merging_in_partial_partition test case tests: Introduce row_cache_stress_test utils: Add helpers for dealing with nonwrapping_range<int> tests: simple_schema: Allow passing the tombstone to make_range_tombstone() tests: simple_schema: Accept value by reference tests: simple_schema: Make add_row() accept optional timestamp tests: simple_schema: Make new_timestamp() public tests: simple_schema: Introduce make_ckeys() tests: simple_schema: Introduce get_value(const clustered_row&) helper tests: simple_schema: Fix comment tests: simple_schema: Add missing include row_cache: Introduce evict() tests: Add cache_streamed_mutation_test tests: mutation_assertions: Allow expecting fragments mutation_fragment: Implement equality check tests: row_cache: Add test for population of random partitions tests: row_cache: Add test for partition tombstone population tests: row_cache: Test reading randomly populated partition tests: row_cache: Add test_single_partition_update() tests: row_cache: Add test_scan_with_partial_partitions ...	2017-06-26 14:54:37 +03:00
Avi Kivity	555621b537	Disentable memtables from sstables Remove sstable::write_components(memtable), replacing it with a helper. Fixes #2354 Message-Id: <20170624142639.16662-1-avi@scylladb.com>	2017-06-26 09:37:11 +02:00
Avi Kivity	236a8370e4	Remove use of std::random_shuffle() It was removed in C++17. Replace with std::shuffle(). Message-Id: <20170626063809.7563-1-avi@scylladb.com>	2017-06-26 09:36:38 +02:00
Tomasz Grabiec	b0bcf2be53	tests: row_cache: Add test_tombstone_merging_in_partial_partition test case	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	23c6f517cb	tests: Introduce row_cache_stress_test Runs readers, updates and eviction concurrently and verifies the following property of reads: - reads see all past writes - reads see no partial writes within a single partition	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	5c9f87fb27	tests: simple_schema: Allow passing the tombstone to make_range_tombstone()	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	edf4a3494c	tests: simple_schema: Accept value by reference	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	5f70df472f	tests: simple_schema: Make add_row() accept optional timestamp	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	53867c4328	tests: simple_schema: Make new_timestamp() public	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	51b5814ec2	tests: simple_schema: Introduce make_ckeys()	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	074c67fe4d	tests: simple_schema: Introduce get_value(const clustered_row&) helper	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	8ffc776e06	tests: simple_schema: Fix comment	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	ecacd2e84a	tests: simple_schema: Add missing include	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	c4e8effffa	tests: Add cache_streamed_mutation_test [tgrabiec: - extracted from a larger commit - removed coupling with how cache_streamed_mutation is created (the code went out of sync), used more stable make_reader(). it's simpler too. - replaced false/true literals with is_continuous/is_dummy where appropraite - dropped tests for cache::underlying (class is gone) - reused streamed_mutation_assertions, it has better error messages - fixed the tests to not create tombstones with missing timestamps - relaxed range tombstone assertions to only check information relevant for the query range - print cache on failure for improved debuggability ]	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	44fdee3f2e	tests: mutation_assertions: Allow expecting fragments	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	116bcb8b30	tests: row_cache: Add test for population of random partitions	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	930a1415fe	tests: row_cache: Add test for partition tombstone population	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	9bfece6f82	tests: row_cache: Test reading randomly populated partition	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	0358334579	tests: row_cache: Add test_single_partition_update() [tgrabiec: Extracted from "row_cache: Introduce cache_streamed_mutation"]	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	8bb76e2f12	tests: row_cache: Add test_scan_with_partial_partitions	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	5a0ae55f6d	Introduce schema_upgrader	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	7ae40d7045	tests: Add test for update_invalidating()	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	fb62dfab02	tests: mvcc: Introduce test_schema_upgrade_preserves_continuity	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	164989a574	tests: mvcc: Add test for partition_entry::apply_to_incomplete()	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	bbfa52822e	row_cache: Switch readers to use per-entry snapshots Currently readers are always using the latest snapshot. This is fine for respecting write atomicity if partitions are fully continuous in cache (now), but will break write atomicity once partial population is allowed. Consider the following case: flush write(ck=1), write(ck=2) -> snapshot_1 cache reader 1 reads and inserts ck=1 @snapshot_1 flush write(ck=1), write(ck=2) -> snapshot_2 cache reader 2 reads and inserts ck=2 @snapshot_2 Because cache update is not atomic, it can happen that reader 2 will complete while the partition hasn't been updated yet for snapshot_2. In such case, after read 2 the partition would contain ck=1 from snapshot_1 and ck=2 from snapshot_2. It will match neither of the snapshots, and this could violate write atomicity. To solve this problem we conceptually assign each partition key in the ring to its current snapshot which it reflects. The update process gradually converts entries in ring order to the new snapshot. Reads will not be using the latest snapshot, but rather the current snapshot for the position in the ring they are at. There is a race between the update process and populating reads. Since after the update all entries must reflect the new snapshot, reads using the old snapshot cannot be allowed to insert data which can no longer be reached by the update process. Before this patch this race was prevented by the use of a phased_barrier, where readers would keep phased_barrier::operation alive between starting a read of a partition and inserting it into cache. Cache update was waiting for all prior operations before starting the update. Any later read which was not waited for would use the latest snapshot for reads, so the update process didn't have to fix anything up for such reads. After this change, later reads cannot always use the latest snapshot, they have to use the snapshot corresponding to given entry. So it's not enough for update() to wait for prior reads in order to prevent stale populations. The (simple) solution implemented in this patch is to detect the conflict and abandon population of given sub-range. In general, reads are allowed to populate given range only if it belongs to a single snapshot. Note that the range here is not the whole query range. For population of continuity, it is the range starting after the previous key and ending after the key being inserted. When populating a partition entry, the range is a singular range containing only the partition key. Readers switch to new snapshots automatically as they move across the ring. It's possible that the insertion of the partition doesn't conflict, but continuity does. In such case the entry will be inserted but continuity will not be set.	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	8ba6366610	row_cache: Switch to using snapshot_source Currently every time cache needs to create reader for missing data it obtains a reader which is most up to date. That reader includes writes from later populate phases, for which update() was not yet called. This will be problematic once we allow partitions to be partially populated, because different parts of the partition could be partially populated using readers using different sets of writes, and break write atomicity. The solution will be to always populate given partition using the same set of writes, using reader created from the current snapshot. The snapshot changes only on update(), with update() gradually converting each partition to the new snapshot.	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	e23c7e2f34	row_cache: Rework invalidate() implementation 1) Reduce duplication by delegating to more general overloads 2) Improve documentation to not mention effects in terms of population (detail) but rather write visibiliy 3) Rename clear() to invalidate() and merge with the range variant, it has the same semantics	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	bd023b6161	tests: Introduce memtable_snapshot_source Snapshottable in-memory mutation source for use in row_cache tests.	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	7f8620d4a7	tests: mutation_source: Relax expectations about range tombstones In preparation for having partial cache which trims range tombstones to the lower bound of the query.	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	3a9212e0f2	tests: mutation_assertions: Add ability to limit verification to given clustering_row_ranges Currently mutation sources are free to return range tombstones covering range which is larger than the query range. The cache mutation source will soon become more eager about trimming such tombstones. To cover up for such differences, allow telling the restrictions to only care about differences relevant for given clustering ranges.	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	f925b26241	tests: mutation_reader_assertions: Simplify	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	9380dd1ee3	mutation_source: make sure we never ignore fast forwarding mutation source sometimes ignore fast forwarding parameter so this change adds assertion to check that this parameter can be safely ignored. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	ac03331490	row_cache_test: improve test_sliced_read_row_presence Remove unused parameter and add checks to make sure all expected rows have been received. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	db053ef902	tests: Add test for continuity merging rules	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	2edf08d36a	tests: random_mutation_generator: Generate random continuity	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	8873a443db	tests: mutation: Generate mutations with continuity	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	dce293e11c	tests: row_cache: Apply only fully continuous mutations to underlying mutation source Cache currently assumes that mutations coming from outside are fully continuous.	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	e86f74edd8	tests: row_cache: Add missing apply() to test_mvcc test case [tgrabiec: Extracted from "row_cache: Introduce cache_streamed_mutation"]	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	95dcfa859b	tests: row_cache: Improve test_mvcc() assert_that().is_equal_to() gives better error message. Also, there is code which can be replaces with assert_that_stream().has_monotonic_positions()	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	05b56fcfb0	mutation_partition: Add support for specifying continuity This will allow expressing lack of information about certain ranges of rows (including the static row), which will be used in cache to determine if information in cache is complete or not. Continuity is represented internally using flags on row entries. The key range between two consecutive entries is continuous iff rows_entry::continuous() is true for the later entry. The range starting after the last entry is assumed to be continuous. The range corresponding to the key of the entry is continuous iff rows_entry::dummy() is false. [tgrabiec: - based on the following commits: 4a5bf75 - Piotr Jastrzebski : mutation_partition: introduce dummy rows_entry 773070e - Piotr Jastrzebski : mutation_partition: add continuity flag to rows_entry - documented that partition tombstone is always complete - require specifying the partition tombstone when creating an incomplete entry - replaced rows_entry(dummy_tag, ...) constructor with more general rows_entry(position_in_partition, ...) - documented continuity semantics on mutation_partition - fixed _static_row_cached being lost by mutation_partition copy constructors - fixed conversion to streamed_mutation to ignore dummy entries - fixed mutation_partition serializer to drop dummy entries - documented semantics of continuity on mutation_partition level - dropped assumptions that dummy entries can be only at the last position - changed equality to ignore continuity completely, rather than partially (it was not ignoring dummy entries, but ignoring continuity flag) - added printout of continuity information in mutation_partition - fixed handling of empty entries in apply_reversibly() with regards to continuity; we no longer can remove empty entries before merging, since that may affect continuity of the right-hand mutation. Added _erased flag. - fixed mutation_partition::clustered_row() with dummy==true to not ignore the key - fixed partition_builder to not ignore continuity - renamed dummy_tag_t to dummy_tag. _t suffix is reserved. - standardized all APIs on is_dummy and is_continuous bool_class:es - replaced add_dummy_entry() with ensure_last_dummy() with safer semantics - dropped unused remove_dummy_entry() - simplified and inlined cache_entry::add_dummy_entry() - fixed mutation_partition(incomplete_tag) constructor to mark all row ranges as discontinuous ]	2017-06-24 18:06:11 +02:00
Piotr Jastrzebski	77f944880c	cache: Remove support for wide partitions This will be handled by row cache now. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-06-24 18:06:11 +02:00
Tomasz Grabiec	fbe8c24ebe	tests: row_cache_alloc_stress: Make eviction detection more reliable It can happen that touch() will trigger eviction on entry to allocating section, and drop in occupancy around insertion will not happen. As a result, we may evict a lot without detecting that. Extend the check to include touch() and use more reliable eviction counters.	2017-06-24 18:06:11 +02:00
Avi Kivity	672de608bf	tests: fix call to seastar::sleep() It's not in the global namespace.	2017-06-22 18:16:13 +03:00
Raphael S. Carvalho	4bb27cbd6f	lcs: actually prefer oldest sstables of L0 when it falls behind Strategy prefers promoting oldest sstables in L0. Because sort procedure is incorrectly sorting elements in descending order, newest sstables will be promoted first if and only if L0 falls behind (more than 32 sstables). If L0 doesn't fall behind, we'll have all L0 sstables compacted with overlapping ones in L1. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-06-19 20:45:39 -03:00
Nadav Har'El	3018df11b5	Allow reading exactly desired byte ranges and fast_forward_to In commit `c63e88d556`, support was added for fast_forward_to() in data_consume_rows(). Because an input stream's end cannot be changed after creation, that patch ignores the specified end byte, and uses the end of file as the end position of the stream. As result of this, even when we want to read a specific byte range (e.g., in the repair code to checksum the partitions in a given range), the code reads an entire 128K buffer around the end byte, or significantly more, with read-ahead enabled. This causes repair to do more than 10 times the amount of I/O it really has to do in the checksumming phase (which in the current implementation, reads small ranges of partitions at a time). This patch has two levels: 1. In the lower level, sstable::data_consume_rows(), which reads all partitions in a given disk byte range, now gets another byte position, "last_end". That can be the range's end, the end of the file, or anything in between the two. It opens the disk stream until last_end, which means 1. we will never read-ahead beyond last_end, and 2. fast_fordward_to() is not allowed beyond last_end. 2. In the upper level, we add to the various layers of sstable readers, mutation readers, etc., a boolean flag mutation_reader::forwarding, which says whether fast_forward_to() is allowed on the stream of mutations to move the stream to a different partition range. Note that this flag is separate from the existing boolean flag streamed_mutation::fowarding - that one talks about skipping inside a single partition, while the flag we are adding is about switching the partition range being read. Most of the functions that previously accepted streamed_mutation::forwarding now accept also the option mutation_reader::forwarding. The exception are functions which are known to read only a single partition, and not support fast_forward_to() a different partition range. We note that if mutation_reader::forwarding::no is requested, and fast_forward_to() is forbidden, there is no point in reading anything beyond the range's end, so data_consume_rows() is called with last_end as the range's end. But if forwarding::yes is requested, we use the end of the file as last_end, exactly like the code before this patch did. Importantly, we note that the repair's partition reading code, column_family::make_streaming_reader, uses mutation_reader::forwarding::no, while the other existing reading code will use the default forwarding::yes. In the future, we can further optimize the amount of bytes read from disk by replacing forwarding::yes by an actual last partition that may ever be read, and use its byte position as the last_end passed to data_consume_rows. But we don't do this yet, and it's not a regression from the existing code, which also opened the file input stream until the end of the file, and not until the end of the range query. Moreover, such an improvement will not improve of anything if the overall range is always very large, in which case not over-reading at its end will not improve performance. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20170619152629.11703-1-nyh@scylladb.com>	2017-06-19 18:31:32 +03:00

1 2 3 4 5 ...

1494 Commits