scylladb

Author	SHA1	Message	Date
Piotr Jastrzebski	6cd4b6b09c	Remove sstable_range_wrapping_reader The wrapper is no longer needed because read_range_rows returns ::mutation_reader instead of sstables::mutation_reader and the reader returned from it keeps the pointer to shared_sstable that was used to create the reader. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-11-15 10:40:02 +01:00
Piotr Jastrzebski	acfc6fef55	Simplify flat_mutation_reader wrappers If a wrapper takes a flat_mutation_reader in a constructor then it does not have to take schema_ptr because it can obtain it from the inner flat_mutation_reader. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <88c3672df08d2ac465711e9138d426e43ae9c62b.1510331382.git.piotr@scylladb.com>	2017-11-13 08:53:34 +01:00
Tomasz Grabiec	484dde692f	Merge "make sure that cache updates don't overflow dirty memory" from Glauber Since we started accounting virtual dirty memory we no longer have a cap on real dirty memory. In most situations that is not needed, since real dirty will just be at most twice as much as virtual dirty (current flushing memtable plus new memtable). However, due to things like cache updates and component flushing we can end up having a lot of memtables that are virtually freed but not yet fully released, leading real dirty memory to explode using all the box' memory. This patch adds a cap on real dirty memory as well. Because of the hierarchical nature of region_group, if the parent blocks due to memory depletion, so will the child (virtual dirty region group). After that is done, we need to make sure that dirty memory is not seen as freed until the cache update is done. Until a particular partition is moved to the cache it is not evictable. As a result we can OOM the system if we have a lot of pending cache updates as the writes will not be throttled and memory won't be made available. This patch pins the memory used by the region as real dirty before the cache update starts, and unpins it when it is over. In the mean time it gradually releases memory of the partitions that are being moved to cache. I have verified in a couple of workloads that the amount of memory accounted through this is the same amount of memory accounted through the memtable flush procedure. Fixes #1942 * git@github.com:glommer/scylla.git glommer/update-cache-v4: row_cache: modernize use of seastar threads mutation_partition: estimate size of partition memtable: factor out calculation of memtable_entry memory size memtable: add a method to export memtable's dirty memory manager dirty_memory_manager: block if we hit the real dirty limit dirty_memory_manager: add functions to manipulate real dirty partition: add method to calculate memory size of a partition row cache: pin real dirty during cache updates.	2017-11-10 13:55:12 +01:00
Piotr Jastrzebski	e7a0732f72	Add schema_ptr to flat_mutation_reader It is usefull to have a schema inside a flat reader the same way we had schema inside a streamed_mutation. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <b37e0dbf38810c00bd27fb876b69e1754c16a89f.1510312137.git.piotr@scylladb.com>	2017-11-10 13:54:55 +01:00
Glauber Costa	ec36b9eddc	memtable: factor out calculation of memtable_entry memory size The total size is the sum of two components. Add a method that does that sum so this code gets easier to reuse. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2017-11-08 16:21:44 -05:00
Piotr Jastrzebski	aa16cd7eef	flat_mutation_reader_from_mutation: support multiple mutations Rename flat_mutation_reader_from_mutation to flat_mutation_reader_from_mutations. Make it work with std::vector<mutation> instead of a single mutation. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-11-08 14:26:10 +01:00
Piotr Jastrzebski	864d02e795	Turn scanning_reader into flat_mutation_reader This will make memtable::make_reader more efficient. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-11-08 14:08:53 +01:00
Piotr Jastrzebski	68505a5065	Change memtable_entry::read to return flat_mutation_reader This is the first step to move scanning_reader to be flat_mutation_reader. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-11-08 13:52:09 +01:00
Piotr Jastrzebski	7b016527bf	Make iterator_reader independent from mutation_reader iterator_reader will be used also in flat_mutation_reader. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-11-08 13:52:09 +01:00
Piotr Jastrzebski	647dd7f86a	Introduce empty_flat_reader This is an implementation of flat_mutation_reader that returns nothing. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-11-08 13:52:09 +01:00
Piotr Jastrzebski	0a9ab7ff80	memtable: Introduce make_flat_reader This method creates a flat_mutation_reader instead of mutation_reader. All users will be gradually converted to the new interface. make_reader is implemented using make_flat_reader and will be removed once all users are migrated. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-11-08 13:52:09 +01:00
Duarte Nunes	baeec0935f	Replace query::full_slice with schema::full_slice() query::full_slice doesn't select any regular or static columns, which is at odds with the expectations of its users. This patch replaces it with the schema::full_slice() version. Refs #2885 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1507732800-9448-2-git-send-email-duarte@scylladb.com>	2017-10-17 11:25:53 +02:00
Paweł Dziepak	8c3b7fea81	Merge "Introduce new API and converters from/to old mutation_reader" from Piotr "This changeset is the first step to flatten mutation_reader. Then it introduces new mutation_fragment types for partition header and end of partition. Using those a new flat_mutation_reader is defined. Finally it introduces converters between new flat_mutation_reader and old mutation_reader." * 'haaawk/flattened_mutation_reader_v12' of github.com:scylladb/seastar-dev: Add tests for flat_mutation_reader Introduce conversion from flat_mutation_reader to mutation_reader Introduce conversion from mutation_reader to flat_mutation_reader Introduce flat_mutation_reader Extract FlattenedConsumer concept using GCC6_CONCEPT Introduce partition_end mutation_fragment Introduce a position for end of partition Introduce partition_start mutation_fragment Introduce FragmentConsumer Introduce a position for partition start streamed_mutation: Extract concepts using GCC6_CONCEPT macro	2017-10-16 12:14:23 +01:00
Piotr Jastrzebski	46727f12e0	Introduce partition_end mutation_fragment This type of mutation_fragment will be used in new mutation_reader to signal the end of the current partition. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-10-10 16:15:59 +02:00
Piotr Jastrzebski	2516b42752	Introduce partition_start mutation_fragment This type of mutation_fragment will be used in new mutation_reader to signal the beginning of the next partition. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-10-10 16:15:59 +02:00
Piotr Jastrzebski	2583207d9d	Fix memtable scanning_reader::fast_forward_to If memtable is flushed then call fast_forward_to on _delegate. Otherwise call iterator_reader::fast_forward_to. Fixes #2854 Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <6bf1c8bafce845ef945698ce4d722c3c8606e632.1506690042.git.piotr@scylladb.com>	2017-09-29 15:17:39 +02:00
Tomasz Grabiec	2df6f356b1	mvcc: Store LSA region reference in partition_snapshot Will be useful for improving encapsulation.	2017-09-13 17:38:08 +02:00
Tomasz Grabiec	673a22f8e1	memtable: Mark mark_flushed() as noexcept Callers rely on that.	2017-09-04 10:04:29 +02:00
Duarte Nunes	a2b732c156	dirty_memory_manager: Refactor flush permit lifetime management This patch refactors how the flush permit lifetime is managed, dropping the current hash table in favour of a RAII approach. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-31 12:40:19 +02:00
Avi Kivity	e855a28fae	Revert "Merge "memtable flush: Fixes and improvements" from Duarte" This reverts commit `733a64a1df`, reversing changes made to `e11e66723a`. Breaks sstable_test and perf_fast_forward.	2017-07-31 12:44:28 +03:00
Duarte Nunes	ef1275e9dd	dirty_memory_manager: Refactor flush permit lifetime management This patch refactors how the flush permit lifetime is managed, dropping the current hash table in favour of a RAII approach. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-07-27 21:09:18 +02:00
Nadav Har'El	3018df11b5	Allow reading exactly desired byte ranges and fast_forward_to In commit `c63e88d556`, support was added for fast_forward_to() in data_consume_rows(). Because an input stream's end cannot be changed after creation, that patch ignores the specified end byte, and uses the end of file as the end position of the stream. As result of this, even when we want to read a specific byte range (e.g., in the repair code to checksum the partitions in a given range), the code reads an entire 128K buffer around the end byte, or significantly more, with read-ahead enabled. This causes repair to do more than 10 times the amount of I/O it really has to do in the checksumming phase (which in the current implementation, reads small ranges of partitions at a time). This patch has two levels: 1. In the lower level, sstable::data_consume_rows(), which reads all partitions in a given disk byte range, now gets another byte position, "last_end". That can be the range's end, the end of the file, or anything in between the two. It opens the disk stream until last_end, which means 1. we will never read-ahead beyond last_end, and 2. fast_fordward_to() is not allowed beyond last_end. 2. In the upper level, we add to the various layers of sstable readers, mutation readers, etc., a boolean flag mutation_reader::forwarding, which says whether fast_forward_to() is allowed on the stream of mutations to move the stream to a different partition range. Note that this flag is separate from the existing boolean flag streamed_mutation::fowarding - that one talks about skipping inside a single partition, while the flag we are adding is about switching the partition range being read. Most of the functions that previously accepted streamed_mutation::forwarding now accept also the option mutation_reader::forwarding. The exception are functions which are known to read only a single partition, and not support fast_forward_to() a different partition range. We note that if mutation_reader::forwarding::no is requested, and fast_forward_to() is forbidden, there is no point in reading anything beyond the range's end, so data_consume_rows() is called with last_end as the range's end. But if forwarding::yes is requested, we use the end of the file as last_end, exactly like the code before this patch did. Importantly, we note that the repair's partition reading code, column_family::make_streaming_reader, uses mutation_reader::forwarding::no, while the other existing reading code will use the default forwarding::yes. In the future, we can further optimize the amount of bytes read from disk by replacing forwarding::yes by an actual last partition that may ever be read, and use its byte position as the last_end passed to data_consume_rows. But we don't do this yet, and it's not a regression from the existing code, which also opened the file input stream until the end of the file, and not until the end of the range query. Moreover, such an improvement will not improve of anything if the overall range is always very large, in which case not over-reading at its end will not improve performance. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20170619152629.11703-1-nyh@scylladb.com>	2017-06-19 18:31:32 +03:00
Avi Kivity	6e2c9ef9fb	Revert "Allow reading exactly desired byte ranges and fast_forward_to" This reverts commit `317d7fc253` (and also the related `2c57ab84b2`). It causes crashes during range scans, reported by Gleb: "To reproduce I run SELECT * FROM keyspace1.standard1; on typical c-s dataset and 3 node cluster. Backtrace: at /home/gleb/work/seastar/seastar/core/apply.hh:36 rvalue=<unknown type in /home/gleb/work/seastar/build/release/scylla, CU 0x54cf307, DIE 0x55ebf2a>) at /home/gleb/work/seastar/seastar/core/do_with.hh:57 range=std::vector of length 6, capacity 8 = {...}) at /home/gleb/work/seastar/seastar/core/future-util.hh:142 at ./seastar/core/future.hh:890 at /home/gleb/work/seastar/seastar/core/future-util.hh:119 at /home/gleb/work/seastar/seastar/core/future-util.hh:142	2017-06-18 16:10:21 +03:00
Nadav Har'El	317d7fc253	Allow reading exactly desired byte ranges and fast_forward_to In commit `c63e88d556`, support was added for fast_forward_to() in data_consume_rows(). Because an input stream's end cannot be changed after creation, that patch ignores the specified end byte, and uses the end of file as the end position of the stream. As result of this, even when we want to read a specific byte range (e.g., in the repair code to checksum the partitions in a given range), the code reads an entire 128K buffer around the end byte, or significantly more, with read-ahead enabled. This causes repair to do more than 10 times the amount of I/O it really has to do in the checksumming phase (which in the current implementation, reads small ranges of partitions at a time). This patch has two levels: 1. In the lower level, sstable::data_consume_rows(), which reads all partitions in a given disk byte range, now gets another byte position, "last_end". That can be the range's end, the end of the file, or anything in between the two. It opens the disk stream until last_end, which means 1. we will never read-ahead beyond last_end, and 2. fast_fordward_to() is not allowed beyond last_end. 2. In the upper level, we add to the various layers of sstable readers, mutation readers, etc., a boolean flag mutation_reader::forwarding, which says whether fast_forward_to() is allowed on the stream of mutations to move the stream to a different partition range. Note that this flag is separate from the existing boolean flag streamed_mutation::fowarding - that one talks about skipping inside a single partition, while the flag we are adding is about switching the partition range being read. Most of the functions that previously accepted streamed_mutation::forwarding now accept also the option mutation_reader::forwarding. The exception are functions which are known to read only a single partition, and not support fast_forward_to() a different partition range. We note that if mutation_reader::forwarding::no is requested, and fast_forward_to() is forbidden, there is no point in reading anything beyond the range's end, so data_consume_rows() is called with last_end as the range's end. But if forwarding::yes is requested, we use the end of the file as last_end, exactly like the code before this patch did. Importantly, we note that the repair's partition reading code, column_family::make_streaming_reader, uses mutation_reader::forwarding::no, while the other existing reading code will use the default forwarding::yes. In the future, we can further optimize the amount of bytes read from disk by replacing forwarding::yes by an actual last partition that may ever be read, and use its byte position as the last_end passed to data_consume_rows. But we don't do this yet, and it's not a regression from the existing code, which also opened the file input stream until the end of the file, and not until the end of the range query. Moreover, such an improvement will not improve of anything if the overall range is always very large, in which case not over-reading at its end will not improve performance. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20170614072122.13473-1-nyh@scylladb.com>	2017-06-15 13:22:46 +01:00
Calle Wilund	2913241df1	memtable/commitlog: Change bookkeep to track individul segments Use per CF-id reference count instead, and use handles as result of add operations. These must either be explicitly released or stored (rp_set), or they will release the corresponding replay_position upon destruction. Note: this does _not_ remove the replay positioning ordering requirement for mutations. It just removes it as a means to track segment liveness.	2017-06-07 12:07:01 +00:00
Tomasz Grabiec	de70d942a9	memtable: Decouple from sstable We can make the dependency more abstract by using mutation_source instead of an sstable. Will be useful in some stress tests which want to avoid the disk, but is also good for the sake of decoupling. Message-Id: <1495729508-30081-2-git-send-email-tgrabiec@scylladb.com>	2017-05-25 19:30:21 +03:00
Tomasz Grabiec	f3a6d94398	sstables: Introduce sstable::as_mutation_source() Adaptors extracted from existing testing code. Message-Id: <1495729508-30081-1-git-send-email-tgrabiec@scylladb.com>	2017-05-25 19:30:20 +03:00
Tomasz Grabiec	6cf2841654	mvcc: Extract partition_snapshot_reader to separate header Right know whole world includes it transitively, which results in painful recompiles when the code changes. Relax dependencies. Message-Id: <1495620201-8046-1-git-send-email-tgrabiec@scylladb.com>	2017-05-24 12:13:15 +01:00
Avi Kivity	ebaeefa02b	Merge seatar upstream (seastar namespace) - introcduced "seastarx.hh" header, which does a "using namespace seastar"; - 'net' namespace conflicts with seastar::net, renamed to 'netw'. - 'transport' namespace conflicts with seastar::transport, renamed to cql_transport. - "logger" global variables now conflict with logger global type, renamed to xlogger. - other minor changes	2017-05-21 12:26:15 +03:00
Gleb Natapov	5c4158daac	memtable: do not yield while holding reclaim_lock Holding reclaim_lock while yielding may cause memory allocations to fail. Fixes #2139 Message-Id: <20170306153151.GA5902@scylladb.com>	2017-03-06 17:24:22 +01:00
Gleb Natapov	d7bdf16a16	memtable: do not open code logalloc::reclaim_lock use logalloc::reclaim_lock prevents reclaim from running which may cause regular allocation to fail although there is enough of free memory. To solve that there is an allocation_section which acquire reclaim_lock and if allocation fails it run reclaimer outside of a lock and retries the allocation. The patch make use of allocation_section instead of direct use of reclaim_lock in memtable code. Fixes #2138. Message-Id: <20170306160050.GC5902@scylladb.com>	2017-03-06 17:24:22 +01:00
Tomasz Grabiec	892d4a2165	db: Enable creating forwardable readers via mutation_source Right now all mutation source implementations will use make_forwardable() wrapper.	2017-02-23 18:50:44 +01:00
Tomasz Grabiec	2cc27f72ca	memtable: Accept all mutation_source parameters	2017-02-23 18:23:52 +01:00
Tomasz Grabiec	2b8bd10dca	tests: Pass all mutation source parameters	2017-02-13 20:52:49 +01:00
Asias He	e5485f3ea6	Get rid of query::partition_range Use dht::partition_range instead	2016-12-19 08:09:25 +08:00
Glauber Costa	80440c0d79	database: rework dirty memory hierarchy Issue #1918 describes a problem, in which we are generating smaller memtables than we could, and therefore not respecting the flush criteria. That happens because group sizes (and limits) for pressure purposes, and the the soft threshold is currently at 40 %. This causes system group's soft threshold to be way below regular's virtual dirty limit and close to regular group's soft threshold. The system group was very likely to become under soft pressure when regular was because writes to regular group are not yet throttled when they cross both soft thresholds. This is a direct consequence of the linear hierarchy between the regions and to guarantee that it won't happen we would have acqire the semaphore of all ancestor regions when flushing from a child region. While that works, it can lead to problems on its own, like priority inversion if the regions have different priorities - like streaming and regular, and groups lower in the hierarchy, like user, blocking explicit flushes from their ancestors To fix that, this patch reorganizes the dirty memory region groups so that groups are now completely independent. As a disadvantage, when streaming happen we will draw some memory from the cache, but we will live with it for the time being. Fixes #1918 Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-12-13 14:07:53 -05:00
Tomasz Grabiec	527ff6aa40	db: Clear memtable after flush when cache is disabled So that memory is released gradually (impacting latency less) and sooner than when memtable is destroyed. Active readers may keep the memtable alive for unbounded amount of time. Refs #1879	2016-12-05 12:59:09 +01:00
Tomasz Grabiec	1bba51319e	memtable: Maintain virtual dirty on clear() When memtable is flushing, it subtracts _flushed_memory from groups's size to gradually allow more writes. Ideally _flushed_memory would be equal to region's size when flush ends, so the group's size would reach zero. When the memtable and its region are gone the group size should remain the same as after the flush. This is ensured by adding back _flushed_memory to group's size right before the region is removed from the group. Calling clear() before region is removed from the group breaks the accounting because it will shrink the region, but will not affect the amount of memory subtracted due to _flushed_memory. So group's size would decrease more than we want (twice the region's size). The fix is to change clear() so that it reverts _flushed_memory by the amount by which the region size is reduced. This will keep the groups's size constant as long as _flushed_memory > 0.	2016-12-05 12:59:09 +01:00
Tomasz Grabiec	1b5f338c17	memtable: Track flushed memory in memtable object	2016-12-05 12:59:09 +01:00
Tomasz Grabiec	c3768fe4de	memtable: Pass dirty_memory_manager& to memtable constructor The implementation assumes that memtable's region group is owned by dirty_memory_manager, and tries to obtain a reference to it like this: boost::intrusive::get_parent_from_member(_region.group(), &dirty_memory_manager::_region_group)); This is undefined behavior when the region's group does not come from dirty manager. It's safer to be explicit about this dependency by taking a reference to dirty_memory_manager in the constructor.	2016-12-05 12:59:09 +01:00
Glauber Costa	0ca8c3f162	database: keep a pointer to the memtable list in a memtable We current pass a region group to the memtable, but after so many recent changes, that is a bit too low level. This patch changes that so we pass a memtable list instead. Doing that also has a couple of advantages. Mainly, during flush we must get to a memtable to a memtable_list. Currently we do that by going to the memtable to a column family through the schema, and from there to the memtable_list. That, however, involves calling virtual functions in a derived class, because a single column family could have both streaming and normal memtables. If we pass a memtable_list to the memtable, we can keep pointer, and when needed get the memtable_list directly. Not only that gets rid of the inheritance for aesthetic reasons, but that inheritance is not even correct anymore. Since the introduction of the big streaming memtables, we now have a plethora of lists per column family and this transversal is totally wrong. We haven't noticed before because we were flushing the memtables based on their individual sizes, but it has been wrong all along for edge cases in which we would have to resort to size-based flush. This could be the case, for instance, with various plan_ids in flight at the same time. At this point, there is no more reason to keep the derived classes for the dirty_memory_manager. I'm only keeping them around to reduce clutter, although they are useful for the specialized constructors and to communicate to the reader exactly what they are. But those can be removed in a follow up patch if we want. The old memtable constructor signature is kept around for the benefit of two tests in memtable_tests which have their own flush logic. In the future we could do something like we do for the SSTable tests, and have a proxy class that is friends with the memtable class. That too, is left for the future. Fixes #1870 Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <811ec9e8e123dc5fc26eadbda82b0bae906657a9.1479743266.git.glauber@scylladb.com>	2016-11-21 18:18:27 +02:00
Paweł Dziepak	e04664e851	partition_snapshot_accounter: use range_tombstone::memory_usage() Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-11-18 11:25:36 +00:00
Paweł Dziepak	ef57b9a26f	rename memory_usage() to external_memory_usage() where applicable Renaming the function to external_memory_usage() makes it clear that sizeof(T) is not included, something that was a source of confusion in the past. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-11-18 11:25:36 +00:00
Glauber Costa	2ed3f342c1	pass a region to dirty_memory_manager accounting API We would like to know from which region is a particular flush coming from, and account accordingly. The reasoning behind that, is that soon we'll be driving the flushes internally from the dirty_memory_manager without explcitly triggering them. We need to start a flush before the current one finishes, otherwise we'll have a period without significant disk activity when the current SSTable is being sealed, the caches are being updated, etc. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-11-16 21:20:24 -05:00
Avi Kivity	a35136533d	Convert ring_position and token ranges to be nonwrapping Wrapping ranges are a pain, so we are moving wrap handling to the edges. Since cql can't generate wrapping ranges, this means thrift and the ring maintenance code; also range->ring transformations need to merge the first and last ranges. Message-Id: <1478105905-31613-1-git-send-email-avi@scylladb.com>	2016-11-02 21:04:11 +02:00
Paweł Dziepak	e14f8027d5	memtable: add support for fast_forward_to() Fast forwarding of memtable readers is needed only for unit tests which often use memtables as underlying data source for cache and the cache readers. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-10-19 15:29:08 +01:00
Paweł Dziepak	6755a679f6	drop key readers key_readers weren't used since introduction of continuity flag to cache entries. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-10-19 15:29:08 +01:00
Tomasz Grabiec	63784fd921	db: Fix corruption of partition_entry Memory accounting code was attaching partition_snapshot to partition_entry in order to calculate the size of partition_version object. However, it is only allowed if partition_entry doesn't have any snapshot attached already. In this case it always has one, created by the flushing reader. Change the accounting code to reuse existing partition_snapshot reference. Fixes #1746 Message-Id: <1476449160-9252-1-git-send-email-tgrabiec@scylladb.com>	2016-10-14 15:10:48 +01:00
Glauber Costa	7146776d7c	fix sstable tests by not using the flush_reader if no region_group The latest virtual dirty patches broke the SSTable tests. The reason for this is that those tests will flush synthetic memtables that do not have a region_group attached to it. Normally in cases like this we would just give the flush_reader an empty region group. However, the memtable class constructor takes a region_group pointer and that can be null according to the interface. So we must conditionally test it. If there isn't a region_group involved, the virtual dirty accounting should be disabled: after all, we won't even have the baseline memory to begin with. One of the approaches to fix this could be to just provide null accounter classes to be used as a surrogate for the accounting classes in this case. However, since this is mostly used for tests, a much simpler way is to just revert back to the scanning reader in that case. The scanning reader is similar enough to the flush_reader, except that it can handle partial ranges, slices, and delegate accesses to an sstable post-flush. We don't need any of that, but as argued above, there is no need to remove it either. Signed-off-by: Glauber Costa <glommer@scylladb.com> Message-Id: <1475667271-60806-1-git-send-email-glommer@scylladb.com>	2016-10-05 12:44:21 +01:00
Glauber Costa	f89a67c75c	database: allow virtual dirty memory management Scylla currently suffers from a brick wall behavior of the request throttler. Requests pile up until we reach the dirty memory limit, at which point we stop serving them until we have freed enough memory to allow for more requests. The problem is that freeing dirty memory means writing an SSTable to completion. That can take a long time, even if we are blessed with great disks. Those long waiting times can and will translate into timeouts. That is bad behavior. What this patch does is introduce one form of virtual dirty memory accounting. Instead of allowing 100 % of the dirty memory to be filled up until we stop accepting requests, we will do that when we reach 50 % of memory. However, instead of releasing requests only when an SSTable is fully written, we start releasing them when some memory was written. The practical effect of that is that once we reach 50 % occupancy in our dirty memory region, we will bring the system from CPU speed to disk speed, and will start accepting requests only at the rate we are able to write memory back. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-10-04 10:39:10 -04:00

1 2

98 Commits