scylladb

Author	SHA1	Message	Date
Gleb Natapov	5c4158daac	memtable: do not yield while holding reclaim_lock Holding reclaim_lock while yielding may cause memory allocations to fail. Fixes #2139 Message-Id: <20170306153151.GA5902@scylladb.com>	2017-03-06 17:24:22 +01:00
Gleb Natapov	d7bdf16a16	memtable: do not open code logalloc::reclaim_lock use logalloc::reclaim_lock prevents reclaim from running which may cause regular allocation to fail although there is enough of free memory. To solve that there is an allocation_section which acquire reclaim_lock and if allocation fails it run reclaimer outside of a lock and retries the allocation. The patch make use of allocation_section instead of direct use of reclaim_lock in memtable code. Fixes #2138. Message-Id: <20170306160050.GC5902@scylladb.com>	2017-03-06 17:24:22 +01:00
Tomasz Grabiec	892d4a2165	db: Enable creating forwardable readers via mutation_source Right now all mutation source implementations will use make_forwardable() wrapper.	2017-02-23 18:50:44 +01:00
Tomasz Grabiec	2cc27f72ca	memtable: Accept all mutation_source parameters	2017-02-23 18:23:52 +01:00
Tomasz Grabiec	2b8bd10dca	tests: Pass all mutation source parameters	2017-02-13 20:52:49 +01:00
Asias He	e5485f3ea6	Get rid of query::partition_range Use dht::partition_range instead	2016-12-19 08:09:25 +08:00
Glauber Costa	80440c0d79	database: rework dirty memory hierarchy Issue #1918 describes a problem, in which we are generating smaller memtables than we could, and therefore not respecting the flush criteria. That happens because group sizes (and limits) for pressure purposes, and the the soft threshold is currently at 40 %. This causes system group's soft threshold to be way below regular's virtual dirty limit and close to regular group's soft threshold. The system group was very likely to become under soft pressure when regular was because writes to regular group are not yet throttled when they cross both soft thresholds. This is a direct consequence of the linear hierarchy between the regions and to guarantee that it won't happen we would have acqire the semaphore of all ancestor regions when flushing from a child region. While that works, it can lead to problems on its own, like priority inversion if the regions have different priorities - like streaming and regular, and groups lower in the hierarchy, like user, blocking explicit flushes from their ancestors To fix that, this patch reorganizes the dirty memory region groups so that groups are now completely independent. As a disadvantage, when streaming happen we will draw some memory from the cache, but we will live with it for the time being. Fixes #1918 Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-12-13 14:07:53 -05:00
Tomasz Grabiec	527ff6aa40	db: Clear memtable after flush when cache is disabled So that memory is released gradually (impacting latency less) and sooner than when memtable is destroyed. Active readers may keep the memtable alive for unbounded amount of time. Refs #1879	2016-12-05 12:59:09 +01:00
Tomasz Grabiec	1bba51319e	memtable: Maintain virtual dirty on clear() When memtable is flushing, it subtracts _flushed_memory from groups's size to gradually allow more writes. Ideally _flushed_memory would be equal to region's size when flush ends, so the group's size would reach zero. When the memtable and its region are gone the group size should remain the same as after the flush. This is ensured by adding back _flushed_memory to group's size right before the region is removed from the group. Calling clear() before region is removed from the group breaks the accounting because it will shrink the region, but will not affect the amount of memory subtracted due to _flushed_memory. So group's size would decrease more than we want (twice the region's size). The fix is to change clear() so that it reverts _flushed_memory by the amount by which the region size is reduced. This will keep the groups's size constant as long as _flushed_memory > 0.	2016-12-05 12:59:09 +01:00
Tomasz Grabiec	1b5f338c17	memtable: Track flushed memory in memtable object	2016-12-05 12:59:09 +01:00
Tomasz Grabiec	c3768fe4de	memtable: Pass dirty_memory_manager& to memtable constructor The implementation assumes that memtable's region group is owned by dirty_memory_manager, and tries to obtain a reference to it like this: boost::intrusive::get_parent_from_member(_region.group(), &dirty_memory_manager::_region_group)); This is undefined behavior when the region's group does not come from dirty manager. It's safer to be explicit about this dependency by taking a reference to dirty_memory_manager in the constructor.	2016-12-05 12:59:09 +01:00
Glauber Costa	0ca8c3f162	database: keep a pointer to the memtable list in a memtable We current pass a region group to the memtable, but after so many recent changes, that is a bit too low level. This patch changes that so we pass a memtable list instead. Doing that also has a couple of advantages. Mainly, during flush we must get to a memtable to a memtable_list. Currently we do that by going to the memtable to a column family through the schema, and from there to the memtable_list. That, however, involves calling virtual functions in a derived class, because a single column family could have both streaming and normal memtables. If we pass a memtable_list to the memtable, we can keep pointer, and when needed get the memtable_list directly. Not only that gets rid of the inheritance for aesthetic reasons, but that inheritance is not even correct anymore. Since the introduction of the big streaming memtables, we now have a plethora of lists per column family and this transversal is totally wrong. We haven't noticed before because we were flushing the memtables based on their individual sizes, but it has been wrong all along for edge cases in which we would have to resort to size-based flush. This could be the case, for instance, with various plan_ids in flight at the same time. At this point, there is no more reason to keep the derived classes for the dirty_memory_manager. I'm only keeping them around to reduce clutter, although they are useful for the specialized constructors and to communicate to the reader exactly what they are. But those can be removed in a follow up patch if we want. The old memtable constructor signature is kept around for the benefit of two tests in memtable_tests which have their own flush logic. In the future we could do something like we do for the SSTable tests, and have a proxy class that is friends with the memtable class. That too, is left for the future. Fixes #1870 Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <811ec9e8e123dc5fc26eadbda82b0bae906657a9.1479743266.git.glauber@scylladb.com>	2016-11-21 18:18:27 +02:00
Paweł Dziepak	e04664e851	partition_snapshot_accounter: use range_tombstone::memory_usage() Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-11-18 11:25:36 +00:00
Paweł Dziepak	ef57b9a26f	rename memory_usage() to external_memory_usage() where applicable Renaming the function to external_memory_usage() makes it clear that sizeof(T) is not included, something that was a source of confusion in the past. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-11-18 11:25:36 +00:00
Glauber Costa	2ed3f342c1	pass a region to dirty_memory_manager accounting API We would like to know from which region is a particular flush coming from, and account accordingly. The reasoning behind that, is that soon we'll be driving the flushes internally from the dirty_memory_manager without explcitly triggering them. We need to start a flush before the current one finishes, otherwise we'll have a period without significant disk activity when the current SSTable is being sealed, the caches are being updated, etc. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-11-16 21:20:24 -05:00
Avi Kivity	a35136533d	Convert ring_position and token ranges to be nonwrapping Wrapping ranges are a pain, so we are moving wrap handling to the edges. Since cql can't generate wrapping ranges, this means thrift and the ring maintenance code; also range->ring transformations need to merge the first and last ranges. Message-Id: <1478105905-31613-1-git-send-email-avi@scylladb.com>	2016-11-02 21:04:11 +02:00
Paweł Dziepak	e14f8027d5	memtable: add support for fast_forward_to() Fast forwarding of memtable readers is needed only for unit tests which often use memtables as underlying data source for cache and the cache readers. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-10-19 15:29:08 +01:00
Paweł Dziepak	6755a679f6	drop key readers key_readers weren't used since introduction of continuity flag to cache entries. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-10-19 15:29:08 +01:00
Tomasz Grabiec	63784fd921	db: Fix corruption of partition_entry Memory accounting code was attaching partition_snapshot to partition_entry in order to calculate the size of partition_version object. However, it is only allowed if partition_entry doesn't have any snapshot attached already. In this case it always has one, created by the flushing reader. Change the accounting code to reuse existing partition_snapshot reference. Fixes #1746 Message-Id: <1476449160-9252-1-git-send-email-tgrabiec@scylladb.com>	2016-10-14 15:10:48 +01:00
Glauber Costa	7146776d7c	fix sstable tests by not using the flush_reader if no region_group The latest virtual dirty patches broke the SSTable tests. The reason for this is that those tests will flush synthetic memtables that do not have a region_group attached to it. Normally in cases like this we would just give the flush_reader an empty region group. However, the memtable class constructor takes a region_group pointer and that can be null according to the interface. So we must conditionally test it. If there isn't a region_group involved, the virtual dirty accounting should be disabled: after all, we won't even have the baseline memory to begin with. One of the approaches to fix this could be to just provide null accounter classes to be used as a surrogate for the accounting classes in this case. However, since this is mostly used for tests, a much simpler way is to just revert back to the scanning reader in that case. The scanning reader is similar enough to the flush_reader, except that it can handle partial ranges, slices, and delegate accesses to an sstable post-flush. We don't need any of that, but as argued above, there is no need to remove it either. Signed-off-by: Glauber Costa <glommer@scylladb.com> Message-Id: <1475667271-60806-1-git-send-email-glommer@scylladb.com>	2016-10-05 12:44:21 +01:00
Glauber Costa	f89a67c75c	database: allow virtual dirty memory management Scylla currently suffers from a brick wall behavior of the request throttler. Requests pile up until we reach the dirty memory limit, at which point we stop serving them until we have freed enough memory to allow for more requests. The problem is that freeing dirty memory means writing an SSTable to completion. That can take a long time, even if we are blessed with great disks. Those long waiting times can and will translate into timeouts. That is bad behavior. What this patch does is introduce one form of virtual dirty memory accounting. Instead of allowing 100 % of the dirty memory to be filled up until we stop accepting requests, we will do that when we reach 50 % of memory. However, instead of releasing requests only when an SSTable is fully written, we start releasing them when some memory was written. The practical effect of that is that once we reach 50 % occupancy in our dirty memory region, we will bring the system from CPU speed to disk speed, and will start accepting requests only at the rate we are able to write memory back. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-10-04 10:39:10 -04:00
Glauber Costa	eee15578fb	memtables: split scanning reader in two The code that is common will live in its own reader, the iterator_reader. All friendly private access to memtable attributes and methods happen through the iterator reader. After this patch, we are now left with the scanning_reader - same as always, but now implemented on top of the iterator_reader, and a flush_reader, which will be used by SSTable flushes only. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-10-04 10:39:10 -04:00
Glauber Costa	16886eeb96	sstables: use special reader for writing a memtable Right now the special reader doesn't do much, but the idea is that we will soon replace it will a reader that specializes in flush, and is in turn able to provide read-side on-flush functionality like virtual dirty. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-10-04 10:39:10 -04:00
Piotr Jastrzebski	3607d99269	Remove clustering_key_filtering_context. Remove clustering_key_filter_factory and clustering_key_filtering_context. Use partition_slice directly with a static get_ranges method. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2016-08-30 20:31:55 +02:00
Piotr Jastrzebski	b05b90b3a5	Introduce clustering_key_filter_ranges. This fixes the problem of multiple concurrent get_ranges calls. Previously each call was invalidating the result of the previous call. Now they don't step on each other foot. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2016-08-30 19:46:38 +02:00
Glauber Costa	d41fcd45d1	memtables: make memtable inherit from region The LSA memory pressure mechanism will let us know which region is the best candidate for eviction when under pressure. We need to somehow then translate region -> memtable -> column family. The easiest way to convert from region to memtable, is having memtable inherit from region. Despite the fact that this requires multiple inheritance, which always raise a flag a bit, the other class we inherit from is enable_shared_from_this, which has a very simple and well defined interface. So I think it is worthy for us to do it. Once we have the memtable, grabing the column family is easy provided we have a database object. We can grab it from the schema. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-07-05 15:05:29 -04:00
Paweł Dziepak	6871bd5fa0	memtable: fully support streamed_mutations Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:52 +01:00
Paweł Dziepak	2ab1a73efa	memtable: rename partition_entry to memtable_entry partition_entry is going to be a more general object used by both cache and memtable entries. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:51 +01:00
Paweł Dziepak	737eb73499	mutation_reader: make readers return streamed_mutations Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:50 +01:00
Piotr Jastrzebski	dcba6f5c45	Pass clustering_row_ranges to mutation readers. This will allow readers to reduce the amount of data read. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2016-05-16 14:36:57 +02:00
Piotr Jastrzebski	23c23abe53	Make memtable mutation_reader slice using clustering ranges. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2016-05-16 11:46:41 +02:00
Piotr Jastrzebski	484d2ecd0a	Slice data with clustering key range in sstable reader Add additional parameters to mp_row_consumer to be able to fetch only cells for given clustering key ranges This will be used in row_cache when it will work on clustering key level instead of partition key level. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2016-05-16 11:46:30 +02:00
Pekka Enberg	38a54df863	Fix pre-ScyllaDB copyright statements People keep tripping over the old copyrights and copy-pasting them to new files. Search and replace "Cloudius Systems" with "ScyllaDB". Message-Id: <1460013664-25966-1-git-send-email-penberg@scylladb.com>	2016-04-08 08:12:47 +03:00
Glauber Costa	336babfcb8	database: add a priority class to a few SSTable readers Not all SSTable readers will end up getting the right tag for a priority class. In particular, the range reader, also used for the memtables complete ignores any priority class. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-02-24 18:00:34 -05:00
Glauber Costa	80ab41a715	memtable reader: also include a priority class There are situations when a memtable is already flushed but the memtable reader will continue to be in place, relaying reads to the underlying table. For that reason, the "memtables don't need a priority class" argument gets obviously broken. We need to pass a priority class for its reader as well. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-02-24 18:00:34 -05:00
Avi Kivity	d415167496	memtable: use managed_bytes linearization context when applying mutations Ensures that we don't access scattered keys when looking up stuff.	2016-02-16 14:37:46 +02:00
Glauber Costa	15336e7eb7	key_source: turn it into a class Its definition as a lambda function is inconvenient, because it does not allow us to use default values for parameters. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-01-25 15:20:38 -05:00
Glauber Costa	58fdae33bd	mutation_source: turn it into a class Its definition as a lambda function is inconvenient, because it does not allow us to use default values for parameters. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-01-25 15:20:38 -05:00
Tomasz Grabiec	d81a46d7b5	column_family: Add schema setters There is one current schema for given column_family. Entries in memtables and cache can be at any of the previous schemas, but they're always upgraded to current schema on access.	2016-01-11 10:34:52 +01:00
Tomasz Grabiec	4e5a52d6fa	db: Make read interface schema version aware The intent is to make data returned by queries always conform to a single schema version, which is requested by the client. For CQL queries, for example, we want to use the same schema which was used to compile the query. The other node expects to receive data conforming to the requested schema. Interface on shard level accepts schema_ptr, across nodes we use table_schema_version UUID. To transfer schema_ptr across shards, we use global_schema_ptr. Because schema is identified with UUID across nodes, requestors must be prepared for being queried for the definition of the schema. They must hold a live schema_ptr around the request. This guarantees that schema_registry will always know about the requested version. This is not an issue because for queries the requestor needs to hold on to the schema anyway to be able to interpret the results. But care must be taken to always use the same schema version for making the request and parsing the results. Schema requesting across nodes is currently stubbed (throws runtime exception).	2016-01-11 10:34:52 +01:00
Tomasz Grabiec	036974e19b	Make mutation interfaces support multiple versions Schema is tracked in memtable and cache per-entry. Entries are upgraded lazily on access. Incoming mutations are upgraded to table's current schema on given shard. Mutating nodes need to keep schema_ptr alive in case schema version is requested by target node.	2016-01-11 10:34:51 +01:00
Tomasz Grabiec	8a05b61d68	memtable: Read under _read_section	2016-01-11 10:34:51 +01:00
Tomasz Grabiec	5184381a0b	memtable: Deconstify memtable in readers We want to upgrade entries on read and for that we need mutating permission.	2016-01-11 10:34:51 +01:00
Tomasz Grabiec	32ac2ccc4a	memtable: Introduce apply(memtable&)	2015-11-29 16:25:21 +01:00
Avi Kivity	a40a62d840	memtable: use allocating_section to guard allocations Without this, an allocation can fail, and we may not be able to reclaim memory.	2015-11-16 10:56:06 +02:00
Paweł Dziepak	b0edaa5bb7	memtable: add as_key_source() Needed only for tests. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2015-10-20 20:27:53 +02:00
Nadav Har'El	1a4c8db71a	scanning_reader: fix bug on still-being-written memtable scanning_reader has a bug in its range support when it iterates over a memtable which is still open, and thus might still be modified between calls to the read function. This caused, among other things, issue #368 - where repair was reading a memtable which was still open and being written to (by a stream from a a remote node). The problem is that scanning_reader has an optimization so it can avoid comparing the current partition with the range's end on every iteration: It finds, once, a pointer to the element past the end of the range (the so-called "upper bound"), and saves this pointer in _end. Then at every iteration, we can just compare pointers. But If partitions are added to the memtable, the _end we saved is no longer relevant: It still points to a valid partition, but this partition which was once the first partition after the range, may now be precedeed by many new partitions, which may be now returned despite being after the range's end. The fix is to re-calculate "_end" if partitions were added to the memtable. Moreover, we also need to re-calculate "_i" in this case - the current code calculates in one iteration a pointer, _i, to the element to be returned in the next iteration. If additional partitions were added in the meantime, we may need to return them. Because it's impossible to delete partitions from a memtable (just to add new ones or modify existing ones), we can trivially figure out if new partitions were added, using _memtable->partition_count(). Because boost::intrusive::set defaults to constant_time_size(true), using this count is efficient. Fixes #368. Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>	2015-09-20 15:08:08 +02:00
Avi Kivity	d5cf0fb2b1	Add license notices	2015-09-20 10:43:39 +03:00
Tomasz Grabiec	a0c180ef49	memtable: Fix flush in the middle of scanning bug Fixes #309. When scanning memtable readers detect is was flushed, which means that it started to be moved to cache, they fall back to reading from memtable's sstable. Eventually what we should do is to combine memtable and cache contents so that as long as data is not evicted we won't do IO. We do not support scanning in cache yet though, so there is no point in doing this now, and it is not trivial.	2015-09-09 10:17:35 +02:00
Tomasz Grabiec	920fe4278a	Cleanup leftovers after compaction_counter to reclaim_counter rename	2015-09-08 10:19:19 +02:00

1 2

69 Commits