scylladb

Author	SHA1	Message	Date
Vladimir Krivopalov	948c4d79d3	Collect encoding statistics for memtable updates. We keep track of all updates and store the minimal values of timestamps, TTLs and local deletion times across all the inserted data. These values are written as a part of serialization_header for Statistics.db and used for delta-encoding values when writing Data.db file in SSTables 3.0 (mc) format. For #1969. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-04-25 15:39:14 -07:00
Vladimir Krivopalov	e1ee833861	Always pass mutation_partitions to partition_entry::apply() Previously it was also possible to pass a frozen_mutation to it. Now we de-serialize frozen mutations at the calling side. This is a pre-requisite for collecting memtable statistics needed for writing into the SSTables 3.0 format. For #1969. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-04-25 14:58:47 -07:00
Botond Dénes	f488ae3917	Add buffer_size() to flat_mutation_reader buffer_size() exposes the collective size of the external memory consumed by the mutattion-fragments in the flat reader's buffer. This provides a basis to build basic memory accounting on. Altought this is not the entire memory consumption of any given reader it is the most volatile component and usually by far the largest one too.	2018-03-13 10:34:34 +02:00
Tomasz Grabiec	5320705300	cache: Propagate cache_tracker to places manipulating evictable entries cache_tracker reference will be needed to link/unlink row entries. No change of behavior in this patch.	2018-03-06 11:50:27 +01:00
Piotr Jastrzebski	29eb9f30bc	Fix memtable::clear_gently to work in debug mode. It was getting into an infinite loop because need_preempt was always returning true. Tests: units (release,debug) Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <a324e7f576b247124080830455c920bdad1f617b.1520025213.git.piotr@scylladb.com>	2018-03-04 14:11:54 +02:00
Avi Kivity	404172652e	Merge "Use xxHash for digest instead of MD5" from Duarte "This series changes digest calculation to use a faster algorithm (xxHash) and to also cache calculated cell hashes that can be kept in memory to speed up subsequent digest requests. The MD5 hash function has proved to be slow for large cell values: size = 256; elapsed = 4us size = 512; elapsed = 8us size = 1024; elapsed = 14us size = 2048; elapsed = 21us size = 4096; elapsed = 33us size = 8192; elapsed = 51us size = 16384; elapsed = 86us size = 32768; elapsed = 150us size = 65536; elapsed = 278us size = 131072; elapsed = 531us size = 262144; elapsed = 1032us size = 524288; elapsed = 2026us size = 1048576; elapsed = 4004us size = 2097152; elapsed = 7943us size = 4194304; elapsed = 15800us size = 8388608; elapsed = 31731us size = 16777216; elapsed = 64681us size = 33554432; elapsed = 130752us size = 67108864; elapsed = 263154us The xxHash is a non-cryptographic, 64bit (there's work in progress on the 128 version) hash that can be used to replace MD5. It performs much better: size = 256; elapsed = 2us size = 512; elapsed = 1us size = 1024; elapsed = 1us size = 2048; elapsed = 2us size = 4096; elapsed = 2us size = 8192; elapsed = 3us size = 16384; elapsed = 5us size = 32768; elapsed = 8us size = 65536; elapsed = 14us size = 131072; elapsed = 28us size = 262144; elapsed = 59us size = 524288; elapsed = 116us size = 1048576; elapsed = 226us size = 2097152; elapsed = 456us size = 4194304; elapsed = 935us size = 8388608; elapsed = 1848us size = 16777216; elapsed = 4723us size = 33554432; elapsed = 10507us size = 67108864; elapsed = 21622us Performance was tested using a 3 node cluster with 1 cpu and 8GB, and with the following cassandra-stress loaders. Measurements are for the read workload. sudo taskset -c 4-15 ./cassandra-stress write cl=ALL n=5000000 -schema 'replication(factor=3)' -col 'size=FIXED(1024) n=FIXED(4)' -mode native cql3 -rate threads=100 sudo taskset -c 4-15 ./cassandra-stress mixed cl=ALL 'ratio(read=1)' n=10000000 -pop 'dist=gauss(1..5000000,5000000,500000)' -col 'size=FIXED(1024) n=FIXED(4)' -mode native cql3 -rate threads=100 xxhash + caching: Results: op rate : 32699 [READ:32699] partition rate : 32699 [READ:32699] row rate : 32699 [READ:32699] latency mean : 3.0 [READ:3.0] latency median : 3.0 [READ:3.0] latency 95th percentile : 3.9 [READ:3.9] latency 99th percentile : 4.5 [READ:4.5] latency 99.9th percentile : 6.6 [READ:6.6] latency max : 24.0 [READ:24.0] Total partitions : 10000000 [READ:10000000] Total errors : 0 [READ:0] total gc count : 0 total gc mb : 0 total gc time (s) : 0 avg gc time(ms) : NaN stdev gc time(ms) : 0 Total operation time : 00:05:05 END md5: Results: op rate : 25241 [READ:25241] partition rate : 25241 [READ:25241] row rate : 25241 [READ:25241] latency mean : 3.9 [READ:3.9] latency median : 3.9 [READ:3.9] latency 95th percentile : 5.1 [READ:5.1] latency 99th percentile : 5.8 [READ:5.8] latency 99.9th percentile : 8.0 [READ:8.0] latency max : 24.8 [READ:24.8] Total partitions : 10000000 [READ:10000000] Total errors : 0 [READ:0] total gc count : 0 total gc mb : 0 total gc time (s) : 0 avg gc time(ms) : NaN stdev gc time(ms) : 0 Total operation time : 00:06:36 END This translates into a 21% improvoment for this workload. Bigger cell values were also tested: sudo taskset -c 4-15 ./cassandra-stress write cl=ALL n=1000000 -schema 'replication(factor=3)' -col 'size=FIXED(4096) n=FIXED(4)' -mode native cql3 -rate threads=100 sudo taskset -c 4-15 ./cassandra-stress mixed cl=ALL 'ratio(read=1)' n=10000000 -pop 'dist=gauss(1..1000000,500000,100000)' -col 'size=FIXED(4096) n=FIXED(4)' -mode native cql3 -rate threads=100 xxhash + caching: Results: op rate : 19964 [READ:19964] partition rate : 19964 [READ:19964] row rate : 19964 [READ:19964] latency mean : 4.9 [READ:4.9] latency median : 4.6 [READ:4.6] latency 95th percentile : 7.2 [READ:7.2] latency 99th percentile : 11.5 [READ:11.5] latency 99.9th percentile : 13.6 [READ:13.6] latency max : 29.2 [READ:29.2] Total partitions : 10000000 [READ:10000000] Total errors : 0 [READ:0] total gc count : 0 total gc mb : 0 total gc time (s) : 0 avg gc time(ms) : NaN stdev gc time(ms) : 0 Total operation time : 00:08:20 END md5: Results: op rate : 12773 [READ:12773] partition rate : 12773 [READ:12773] row rate : 12773 [READ:12773] latency mean : 7.7 [READ:7.7] latency median : 7.3 [READ:7.3] latency 95th percentile : 10.2 [READ:10.2] latency 99th percentile : 16.8 [READ:16.8] latency 99.9th percentile : 19.2 [READ:19.2] latency max : 71.5 [READ:71.5] Total partitions : 10000000 [READ:10000000] Total errors : 0 [READ:0] total gc count : 0 total gc mb : 0 total gc time (s) : 0 avg gc time(ms) : NaN stdev gc time(ms) : 0 Total operation time : 00:13:02 END This translates into a 37% improvoment for this workload. Fixes #2884 Tests: unit-tests (release), dtests (smp=2) Note: dtests are kinda broken in master (> 30 failures), so take the tests tag with a grain of himalayan salt." * 'xxhash/v5' of https://github.com/duarten/scylla: (29 commits) tests/row_cache_test: Test hash caching tests/memtable_test: Test hash caching tests/mutation_test: Use xxHash instead of MD5 for some tests tests/mutation_test: Test xx_hasher alongside md5_hasher schema: Remove unneeded include service/storage_proxy: Enable hash caching service/storage_service: Add and use xxhash feature message/messaging_service: Specify algorithm when requesting digest storage_proxy: Extract decision about digest algorithm to use cache_flat_mutation_reader: Pre-calculate cell hash partition_snapshot_reader: Pre-calculate cell hash query::partition_slice: Add option to specify when digest is requested row: Use cached hash for hash calculation mutation_partition: Replace hash_row_slice with appending_hash mutation_partition: Allow caching cell hashes mutation_partition: Force vector_storage internal storage size test.py: Increase memory for row_cache_stress_test atomic_cell_hash: Add specialization for atomic_cell_or_collection query-result: Use digester instead of md5_hasher range_tombstone: Replace feed_hash() member function with appending_hash ...	2018-02-08 18:24:58 +02:00
Tomasz Grabiec	cce1a2bce8	Merge "Use the CPU scheduler" from Glauber & Avi In this patchset I am resubmitting Avi's enablement of the CPU scheduler in his behalf. I've done a ton of testing in the series and there are some improvements / changes that I had previously sent as a separate series. What you see here is the result of merging that work. After this patchset is applied, workloads are smoother and we are able to uphold the pre-defined shares among the various actors. We also finally have everything we need to merge the CPU and I/O controllers. After that is done the code is now much simpler. But also, as a bonus, controllers that were previously available for I/O only (compactions) are enabled for CPU as well. * git@github.com:glommer/scylla.git cpusched-v7: Avi Kivity (4): database, sstables, compaction: convert use of thread_scheduling_group to seastar cpu scheduler memtable, database: make memtable::clear_gently() inherit scheduling_group config: mark background_writer_scheduling_quota as Unused database: place data_query execution stage into scheduling_group Glauber Costa (9): database, main: set up scheduling_groups for our main tasks row_cache: actually use the scheduling group for update_cache allow update_cache and clear_gently to use the entire task quota. database: remove cpu_flush_quota metric controllers: retire auto_adjust_flush_quota controllers: allow memtable I/O controller to have shares statically set controllers: update control points for memtable I/O controller controllers: allow a static priority to override the controller output controllers: unify the I/O and CPU controllers	2018-02-08 15:58:40 +01:00
Glauber Costa	c4974392b7	allow update_cache and clear_gently to use the entire task quota. We have had a quota of partitions to process in clear_gently / update_cache, so that we don't overwork. However, with those things now being in their own task group there is no harm in allowing it to run until we reach a natural preemption point. While we are at it, clear_gently did not check for need_preempt() before, so this patch fixes it. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-02-07 17:19:29 -05:00
Avi Kivity	ac525c9124	memtable, database: make memtable::clear_gently() inherit scheduling_group Instead of using a private thread_scheduling_group, make clear_gently use its caller's scheduling_group to control resource usage.	2018-02-07 17:19:29 -05:00
Tomasz Grabiec	d85d651e0f	memtable: Make printable Useful when debugging test failures.	2018-02-06 14:24:19 +01:00
Duarte Nunes	ec5b7fb553	partition_snapshot_reader: Pre-calculate cell hash When digest is requested, pre-calculate the cell's hash. A downside of this approach is that more work will be done when there are multiple versions of a row that contain values for the same cell, but we expect these cases to be rare and the upside of caching a cell's hash to compensate for the extra work. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-02-01 01:02:50 +00:00
Paweł Dziepak	c945bdc7f6	memtable: do not change accounting state in alloc section Allocating sections can be retried so code that has side effects (like updating flushed bytes accouting) has no place there.	2018-01-31 16:04:31 +00:00
Paweł Dziepak	d803370868	memtable: do not update iterator_reader::_last in alloc section iterator_reader::_last is a part of the state that survives allocating section retries, therefore, it should not be modified in the retryable context.	2018-01-31 16:03:16 +00:00
Paweł Dziepak	ea7248056f	memtable: drop memtable_entry::read()	2018-01-30 18:33:26 +01:00
Paweł Dziepak	0420ca48a5	paratition_snapshot_reader: minimise amount of retryable code Retryable code that has side effects is a recipe for bugs. This patch reworkds the snapshot reader so that the amount of logic run with reclamation disabled is minimal and has a very limited side effects.	2018-01-30 18:33:26 +01:00
Piotr Jastrzebski	d266eaa01e	mutation_source: rename make_flat_mutation_reader to make_reader Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2018-01-19 09:30:12 +01:00
Glauber Costa	5140aaea00	add a timeout to fast forward to In the last patch, we enabled per-request timeouts, we enable timeouts in fill_buffer. There are many places, though, in which we fast_forward_to before we fill_buffer, so in order to make that effective we need to propagate the timeouts to fast_forward_to as well. In the same way as fill_buffer, we make the argument optional wherever possible in the high level callers, making them mandatory in the implementations. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-01-12 07:43:19 -05:00
Glauber Costa	d965af42b0	add a timeout to fill_buffer As part of the work to enable per-request timeouts, we enable timeouts in fill_buffer. The argument is made optional at the main classes, but mandatory in all the ::impl versions. This way we'll make sure we didn't forget anything. At this point we're still mostly passing that information around and don't have any entity that will act on those timeouts. In the next patch we will wire that up. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-01-11 12:07:41 -05:00
Piotr Jastrzebski	17f2eb8ff7	Remove memtable::make_reader Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-12-21 11:47:07 +01:00
Piotr Jastrzebski	6bcee5976b	Stop using memtable::make_reader in memtable::apply Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-12-21 11:47:07 +01:00
Paweł Dziepak	9dc566c64b	memtable: switch flush reader implementation to flat streams	2017-11-27 20:07:22 +01:00
Paweł Dziepak	32eb6437fd	memtable: make make_flush_reader() return flat_mutation_reader	2017-11-27 20:07:22 +01:00
Piotr Jastrzebski	6cd4b6b09c	Remove sstable_range_wrapping_reader The wrapper is no longer needed because read_range_rows returns ::mutation_reader instead of sstables::mutation_reader and the reader returned from it keeps the pointer to shared_sstable that was used to create the reader. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-11-15 10:40:02 +01:00
Piotr Jastrzebski	acfc6fef55	Simplify flat_mutation_reader wrappers If a wrapper takes a flat_mutation_reader in a constructor then it does not have to take schema_ptr because it can obtain it from the inner flat_mutation_reader. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <88c3672df08d2ac465711e9138d426e43ae9c62b.1510331382.git.piotr@scylladb.com>	2017-11-13 08:53:34 +01:00
Tomasz Grabiec	484dde692f	Merge "make sure that cache updates don't overflow dirty memory" from Glauber Since we started accounting virtual dirty memory we no longer have a cap on real dirty memory. In most situations that is not needed, since real dirty will just be at most twice as much as virtual dirty (current flushing memtable plus new memtable). However, due to things like cache updates and component flushing we can end up having a lot of memtables that are virtually freed but not yet fully released, leading real dirty memory to explode using all the box' memory. This patch adds a cap on real dirty memory as well. Because of the hierarchical nature of region_group, if the parent blocks due to memory depletion, so will the child (virtual dirty region group). After that is done, we need to make sure that dirty memory is not seen as freed until the cache update is done. Until a particular partition is moved to the cache it is not evictable. As a result we can OOM the system if we have a lot of pending cache updates as the writes will not be throttled and memory won't be made available. This patch pins the memory used by the region as real dirty before the cache update starts, and unpins it when it is over. In the mean time it gradually releases memory of the partitions that are being moved to cache. I have verified in a couple of workloads that the amount of memory accounted through this is the same amount of memory accounted through the memtable flush procedure. Fixes #1942 * git@github.com:glommer/scylla.git glommer/update-cache-v4: row_cache: modernize use of seastar threads mutation_partition: estimate size of partition memtable: factor out calculation of memtable_entry memory size memtable: add a method to export memtable's dirty memory manager dirty_memory_manager: block if we hit the real dirty limit dirty_memory_manager: add functions to manipulate real dirty partition: add method to calculate memory size of a partition row cache: pin real dirty during cache updates.	2017-11-10 13:55:12 +01:00
Piotr Jastrzebski	e7a0732f72	Add schema_ptr to flat_mutation_reader It is usefull to have a schema inside a flat reader the same way we had schema inside a streamed_mutation. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <b37e0dbf38810c00bd27fb876b69e1754c16a89f.1510312137.git.piotr@scylladb.com>	2017-11-10 13:54:55 +01:00
Glauber Costa	ec36b9eddc	memtable: factor out calculation of memtable_entry memory size The total size is the sum of two components. Add a method that does that sum so this code gets easier to reuse. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2017-11-08 16:21:44 -05:00
Piotr Jastrzebski	aa16cd7eef	flat_mutation_reader_from_mutation: support multiple mutations Rename flat_mutation_reader_from_mutation to flat_mutation_reader_from_mutations. Make it work with std::vector<mutation> instead of a single mutation. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-11-08 14:26:10 +01:00
Piotr Jastrzebski	864d02e795	Turn scanning_reader into flat_mutation_reader This will make memtable::make_reader more efficient. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-11-08 14:08:53 +01:00
Piotr Jastrzebski	68505a5065	Change memtable_entry::read to return flat_mutation_reader This is the first step to move scanning_reader to be flat_mutation_reader. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-11-08 13:52:09 +01:00
Piotr Jastrzebski	7b016527bf	Make iterator_reader independent from mutation_reader iterator_reader will be used also in flat_mutation_reader. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-11-08 13:52:09 +01:00
Piotr Jastrzebski	647dd7f86a	Introduce empty_flat_reader This is an implementation of flat_mutation_reader that returns nothing. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-11-08 13:52:09 +01:00
Piotr Jastrzebski	0a9ab7ff80	memtable: Introduce make_flat_reader This method creates a flat_mutation_reader instead of mutation_reader. All users will be gradually converted to the new interface. make_reader is implemented using make_flat_reader and will be removed once all users are migrated. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-11-08 13:52:09 +01:00
Duarte Nunes	baeec0935f	Replace query::full_slice with schema::full_slice() query::full_slice doesn't select any regular or static columns, which is at odds with the expectations of its users. This patch replaces it with the schema::full_slice() version. Refs #2885 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1507732800-9448-2-git-send-email-duarte@scylladb.com>	2017-10-17 11:25:53 +02:00
Paweł Dziepak	8c3b7fea81	Merge "Introduce new API and converters from/to old mutation_reader" from Piotr "This changeset is the first step to flatten mutation_reader. Then it introduces new mutation_fragment types for partition header and end of partition. Using those a new flat_mutation_reader is defined. Finally it introduces converters between new flat_mutation_reader and old mutation_reader." * 'haaawk/flattened_mutation_reader_v12' of github.com:scylladb/seastar-dev: Add tests for flat_mutation_reader Introduce conversion from flat_mutation_reader to mutation_reader Introduce conversion from mutation_reader to flat_mutation_reader Introduce flat_mutation_reader Extract FlattenedConsumer concept using GCC6_CONCEPT Introduce partition_end mutation_fragment Introduce a position for end of partition Introduce partition_start mutation_fragment Introduce FragmentConsumer Introduce a position for partition start streamed_mutation: Extract concepts using GCC6_CONCEPT macro	2017-10-16 12:14:23 +01:00
Piotr Jastrzebski	46727f12e0	Introduce partition_end mutation_fragment This type of mutation_fragment will be used in new mutation_reader to signal the end of the current partition. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-10-10 16:15:59 +02:00
Piotr Jastrzebski	2516b42752	Introduce partition_start mutation_fragment This type of mutation_fragment will be used in new mutation_reader to signal the beginning of the next partition. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-10-10 16:15:59 +02:00
Piotr Jastrzebski	2583207d9d	Fix memtable scanning_reader::fast_forward_to If memtable is flushed then call fast_forward_to on _delegate. Otherwise call iterator_reader::fast_forward_to. Fixes #2854 Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <6bf1c8bafce845ef945698ce4d722c3c8606e632.1506690042.git.piotr@scylladb.com>	2017-09-29 15:17:39 +02:00
Tomasz Grabiec	2df6f356b1	mvcc: Store LSA region reference in partition_snapshot Will be useful for improving encapsulation.	2017-09-13 17:38:08 +02:00
Tomasz Grabiec	673a22f8e1	memtable: Mark mark_flushed() as noexcept Callers rely on that.	2017-09-04 10:04:29 +02:00
Duarte Nunes	a2b732c156	dirty_memory_manager: Refactor flush permit lifetime management This patch refactors how the flush permit lifetime is managed, dropping the current hash table in favour of a RAII approach. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-31 12:40:19 +02:00
Avi Kivity	e855a28fae	Revert "Merge "memtable flush: Fixes and improvements" from Duarte" This reverts commit `733a64a1df`, reversing changes made to `e11e66723a`. Breaks sstable_test and perf_fast_forward.	2017-07-31 12:44:28 +03:00
Duarte Nunes	ef1275e9dd	dirty_memory_manager: Refactor flush permit lifetime management This patch refactors how the flush permit lifetime is managed, dropping the current hash table in favour of a RAII approach. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-27 21:09:18 +02:00
Nadav Har'El	3018df11b5	Allow reading exactly desired byte ranges and fast_forward_to In commit `c63e88d556`, support was added for fast_forward_to() in data_consume_rows(). Because an input stream's end cannot be changed after creation, that patch ignores the specified end byte, and uses the end of file as the end position of the stream. As result of this, even when we want to read a specific byte range (e.g., in the repair code to checksum the partitions in a given range), the code reads an entire 128K buffer around the end byte, or significantly more, with read-ahead enabled. This causes repair to do more than 10 times the amount of I/O it really has to do in the checksumming phase (which in the current implementation, reads small ranges of partitions at a time). This patch has two levels: 1. In the lower level, sstable::data_consume_rows(), which reads all partitions in a given disk byte range, now gets another byte position, "last_end". That can be the range's end, the end of the file, or anything in between the two. It opens the disk stream until last_end, which means 1. we will never read-ahead beyond last_end, and 2. fast_fordward_to() is not allowed beyond last_end. 2. In the upper level, we add to the various layers of sstable readers, mutation readers, etc., a boolean flag mutation_reader::forwarding, which says whether fast_forward_to() is allowed on the stream of mutations to move the stream to a different partition range. Note that this flag is separate from the existing boolean flag streamed_mutation::fowarding - that one talks about skipping inside a single partition, while the flag we are adding is about switching the partition range being read. Most of the functions that previously accepted streamed_mutation::forwarding now accept also the option mutation_reader::forwarding. The exception are functions which are known to read only a single partition, and not support fast_forward_to() a different partition range. We note that if mutation_reader::forwarding::no is requested, and fast_forward_to() is forbidden, there is no point in reading anything beyond the range's end, so data_consume_rows() is called with last_end as the range's end. But if forwarding::yes is requested, we use the end of the file as last_end, exactly like the code before this patch did. Importantly, we note that the repair's partition reading code, column_family::make_streaming_reader, uses mutation_reader::forwarding::no, while the other existing reading code will use the default forwarding::yes. In the future, we can further optimize the amount of bytes read from disk by replacing forwarding::yes by an actual last partition that may ever be read, and use its byte position as the last_end passed to data_consume_rows. But we don't do this yet, and it's not a regression from the existing code, which also opened the file input stream until the end of the file, and not until the end of the range query. Moreover, such an improvement will not improve of anything if the overall range is always very large, in which case not over-reading at its end will not improve performance. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20170619152629.11703-1-nyh@scylladb.com>	2017-06-19 18:31:32 +03:00
Avi Kivity	6e2c9ef9fb	Revert "Allow reading exactly desired byte ranges and fast_forward_to" This reverts commit `317d7fc253` (and also the related `2c57ab84b2`). It causes crashes during range scans, reported by Gleb: "To reproduce I run SELECT * FROM keyspace1.standard1; on typical c-s dataset and 3 node cluster. Backtrace: at /home/gleb/work/seastar/seastar/core/apply.hh:36 rvalue=<unknown type in /home/gleb/work/seastar/build/release/scylla, CU 0x54cf307, DIE 0x55ebf2a>) at /home/gleb/work/seastar/seastar/core/do_with.hh:57 range=std::vector of length 6, capacity 8 = {...}) at /home/gleb/work/seastar/seastar/core/future-util.hh:142 at ./seastar/core/future.hh:890 at /home/gleb/work/seastar/seastar/core/future-util.hh:119 at /home/gleb/work/seastar/seastar/core/future-util.hh:142	2017-06-18 16:10:21 +03:00
Nadav Har'El	317d7fc253	Allow reading exactly desired byte ranges and fast_forward_to In commit `c63e88d556`, support was added for fast_forward_to() in data_consume_rows(). Because an input stream's end cannot be changed after creation, that patch ignores the specified end byte, and uses the end of file as the end position of the stream. As result of this, even when we want to read a specific byte range (e.g., in the repair code to checksum the partitions in a given range), the code reads an entire 128K buffer around the end byte, or significantly more, with read-ahead enabled. This causes repair to do more than 10 times the amount of I/O it really has to do in the checksumming phase (which in the current implementation, reads small ranges of partitions at a time). This patch has two levels: 1. In the lower level, sstable::data_consume_rows(), which reads all partitions in a given disk byte range, now gets another byte position, "last_end". That can be the range's end, the end of the file, or anything in between the two. It opens the disk stream until last_end, which means 1. we will never read-ahead beyond last_end, and 2. fast_fordward_to() is not allowed beyond last_end. 2. In the upper level, we add to the various layers of sstable readers, mutation readers, etc., a boolean flag mutation_reader::forwarding, which says whether fast_forward_to() is allowed on the stream of mutations to move the stream to a different partition range. Note that this flag is separate from the existing boolean flag streamed_mutation::fowarding - that one talks about skipping inside a single partition, while the flag we are adding is about switching the partition range being read. Most of the functions that previously accepted streamed_mutation::forwarding now accept also the option mutation_reader::forwarding. The exception are functions which are known to read only a single partition, and not support fast_forward_to() a different partition range. We note that if mutation_reader::forwarding::no is requested, and fast_forward_to() is forbidden, there is no point in reading anything beyond the range's end, so data_consume_rows() is called with last_end as the range's end. But if forwarding::yes is requested, we use the end of the file as last_end, exactly like the code before this patch did. Importantly, we note that the repair's partition reading code, column_family::make_streaming_reader, uses mutation_reader::forwarding::no, while the other existing reading code will use the default forwarding::yes. In the future, we can further optimize the amount of bytes read from disk by replacing forwarding::yes by an actual last partition that may ever be read, and use its byte position as the last_end passed to data_consume_rows. But we don't do this yet, and it's not a regression from the existing code, which also opened the file input stream until the end of the file, and not until the end of the range query. Moreover, such an improvement will not improve of anything if the overall range is always very large, in which case not over-reading at its end will not improve performance. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20170614072122.13473-1-nyh@scylladb.com>	2017-06-15 13:22:46 +01:00
Calle Wilund	2913241df1	memtable/commitlog: Change bookkeep to track individul segments Use per CF-id reference count instead, and use handles as result of add operations. These must either be explicitly released or stored (rp_set), or they will release the corresponding replay_position upon destruction. Note: this does _not_ remove the replay positioning ordering requirement for mutations. It just removes it as a means to track segment liveness.	2017-06-07 12:07:01 +00:00
Tomasz Grabiec	de70d942a9	memtable: Decouple from sstable We can make the dependency more abstract by using mutation_source instead of an sstable. Will be useful in some stress tests which want to avoid the disk, but is also good for the sake of decoupling. Message-Id: <1495729508-30081-2-git-send-email-tgrabiec@scylladb.com>	2017-05-25 19:30:21 +03:00
Tomasz Grabiec	f3a6d94398	sstables: Introduce sstable::as_mutation_source() Adaptors extracted from existing testing code. Message-Id: <1495729508-30081-1-git-send-email-tgrabiec@scylladb.com>	2017-05-25 19:30:20 +03:00
Tomasz Grabiec	6cf2841654	mvcc: Extract partition_snapshot_reader to separate header Right know whole world includes it transitively, which results in painful recompiles when the code changes. Relax dependencies. Message-Id: <1495620201-8046-1-git-send-email-tgrabiec@scylladb.com>	2017-05-24 12:13:15 +01:00

1 2 3

120 Commits