scylladb

Author	SHA1	Message	Date
Avi Kivity	b6ebe2e20b	Merge "Avoid avalanche of tasks after memtable flush" from Tomasz "Before, the logic for releasing writes blocked on dirty worked like this: 1) When region group size changes and it is not under pressure and there are some requests blocked, then schedule request releasing task 2) request releasing task, if no pressure, runs one request and if there are still blocked requests, schedules next request releasing task If requests don't change the size of the region group, then either some request executes or there is a request releasing task scheduled. The amount of scheduled tasks is at most 1, there is a single releasing thread. However, if requests themselves would change the size of the group, then each such change would schedule yet another request releasing thread, growing the task queue size by one. The group size can also change when memory is reclaimed from the groups (e.g. when contains sparse segments). Compaction may start many request releasing threads due to group size updates. Such behavior is detrimental for performance and stability if there are a lot of blocked requests. This can happen on 1.5 even with modest concurrency because timed out requests stay in the queue. This is less likely on 1.6 where they are dropped from the queue. The releasing of tasks may start to dominate over other processes in the system. When the amount of scheduled tasks reaches 1000, polling stops and server becomes unresponsive until all of the released requests are done, which is either when they start to block on dirty memory again or run out of blocked requests. It may take a while to reach pressure condition after memtable flush if it brings virtual dirty much below the threshold, which is currently the case for workloads with overwrites producing sparse regions. I saw this happening in a write workload from issue #2021 where the number of request releasing threads grew into thousands. Fix by ensuring there is at most one request releasing thread at a time. There will be one releasing fiber per region group which is woken up when pressure is lifted. It executes blocked requests until pressure occurs." * tag 'tgrabiec/lsa-single-threaded-releasing-v2' of github.com:cloudius-systems/seastar-dev: tests: lsa: Add test for reclaimer starting and stopping tests: lsa: Add request releasing stress test lsa: Avoid avalanche releasing of requests lsa: Move definitions to .cc lsa: Simplify hard pressure notification management lsa: Do not start or stop reclaiming on hard pressure tests: lsa: Adjust to take into account that reclaimers are run synchronously lsa: Document and annotate reclaimer notification callbacks tests: lsa: Use with_timeout() in quiesce() (cherry picked from commit `7a00dd6985`)	2017-02-02 22:19:25 +01:00
Tomasz Grabiec	e14caaef60	utils/logalloc: Add ability to timeout run_when_memory_available() task	2016-11-29 16:40:58 +01:00
Avi Kivity	176fca5775	logalloc: use correct header for unique_ptr <bits/unique_ptr.hh> is a libstdc++ internal header. USe <memory> instead.	2016-11-27 23:08:04 +02:00
Glauber Costa	f86c9e36f4	logalloc: allow region group reclaimer to specify a soft limit The region_group_reclaimer will let us know every time we are over the limit we have specified for memory usage. However, For some applications, we would be interested in knowing about memory build up earlier, so we can start doing something about it before we reach that condition. This patch introduce soft limit notifications for the region_group_reclaimer. After this patch is applied, start_reclaim() is called earlier, and stop_reclaim() later, after the soft condition is abated. There are methods that allow one to easily test if the pressure condition is a soft limit condition or a hard, threshold condition and act accordingly. Whether to act on both conditions or just one of them is up to the application. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-11-16 21:20:23 -05:00
Paweł Dziepak	b8d737ff0a	tests/row_cache_test: verify that eviction follows lru Refs #1847. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com> Message-Id: <1479231555-28191-1-git-send-email-pdziepak@scylladb.com>	2016-11-15 18:57:54 +01:00
Tomasz Grabiec	6548132423	lsa: Make logalloc::tracker::full_compaction() compact all reclaimable regions is_compactible() will pass on very small regions. full_compaction() is only used in tests to force objects to be moved due to compaction, so we want all reclaimable regions to be compacted.	2016-10-18 11:16:08 +02:00
Tomasz Grabiec	4357d0a6d9	db: Add counter for writes blocked on dirty memory There is already queue_length-requests_blocked_memory, but it's a gauge so does not reflect what happened between the sampling points. total_operations-requests_blocked_memory will allow to see if there were any (and how many) requests which were blocked by dirty memory. Message-Id: <1476098616-12682-1-git-send-email-tgrabiec@scylladb.com>	2016-10-10 14:25:22 +03:00
Glauber Costa	86aa0b830d	LSA: allow a group to query its own region group Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-10-04 10:39:10 -04:00
Glauber Costa	f5fd6bd714	LSA: export information about size of the throttle queue Also add information about for how long has the oldest been sitting in the queue. This is part of the backpressure work to allow us to throttle incoming requests if we won't have memory to process them. Shortages can happen in all sorts of places, and it is useful when designing and testing the solutions to know where they are, and how bad they are. This counter is named for consistency after similar counters from transport/. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-09-27 12:09:08 -04:00
Glauber Costa	fe6a0d97d1	logalloc: make sure allocations in release_requests don't recurse back into the allocator Calls like later() and with_gate() may allocate memory, although that is not very common. This can create a problem in the sense that it will potentially recurse and bring us back to the allocator during free - which is the very thing we are trying to avoid with the call to later(). This patch wraps the relevant calls in the reclaimer lock. This do mean that the allocation may fail if we are under severe pressure - which includes having exhausted all reserved space - but at least we won't recurse back to the allocator. To make sure we do this as early as possible, we just fold both release_requests and do_release_requests into a single function Thanks Tomek for the suggestion. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <980245ccc17960cf4fcbbfedb29d1878a98d85d8.1470254846.git.glauber@scylladb.com>	2016-08-04 11:16:53 +02:00
Glauber Costa	ad58691afb	logalloc: make sure blocked requests memory allocations are served from the standar allocator Issue 1510 describes a scenario in which, under load, we allocate memory within release_requests() leading to a reentry into an invalid state in our blocked requests' shared_promise. This is not easy to trigger since not all allocations will actually get to the point in which they need a new segment, let alone have that happening during another allocator call. Having those kinds of reentry is something we have always sought to avoid with release_requests(): this is the reason why most of the actual routine is deferred after a call to later(). However, that is a trick we cannot use for updating the state of the blocked requests' shared_promise: we can't guarantee when is that going to run, and we always need a valid shared_promise, in a valid state, waiting for new requests to hook into. The solution employed by this patch is to make sure that no allocation operations whatsoever happen during the initial part of release_requests on behalf of the shared promise. Allocation is now deferred to first use, which relieves release_requests() from all allocation duties. All it needs to do is free the old object and signal to the its user that an allocation is needed (by storing {} into the shared_promise). Fixes #1510 Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <49771e51426f972ddbd4f3eeea3cdeef9cc3b3c6.1470238168.git.glauber@scylladb.com>	2016-08-03 20:40:30 +02:00
Tomasz Grabiec	9476bc5a31	Introduce --abort-on-lsa-bad-alloc command line option Useful for triggerring core dump on allocation failure inside LSA, which makes it easier to debug allocation failures. They normally don't cause aborts, just fail the current operation, which makes it hard to figure out what was the cause of allocation failure. Message-Id: <1470233631-18508-1-git-send-email-tgrabiec@scylladb.com>	2016-08-03 17:26:44 +03:00
Tomasz Grabiec	e783b58e3b	Merge branch 'glommer/LSA-throttler-v6' from git@github.com:glommer/scylla.gi From Glauber: This is my new take at the "Move throttler to the LSA" series, except this one don't actually move anything anywhere: I am leaving all memtable conversion out, and instead I am sending just the LSA bits + LSA active reclaim. This should help us see where we are going, and then we can discuss all memtable changes in a series on its own, logically separated (and hopefully already integrated with virtual dirty). [tgrabiec: trivial merge conflicts in logalloc.cc]	2016-06-21 10:22:26 +02:00
Glauber Costa	579d121db8	LSA: export largest region We now keep the regions sorted by size, and the children region groups as well. Internally, the LSA has all information it needs to make size-based reclaim decisions. However, we don't do reclaim internally, but rather warn our user that a pressure situation is mounted. The user of a region_group doesn't need to evict the largest region in case of pressure and is free to do whatever it chooses - including nothing. But more likely than not, taking into account which region is the largest makes sense. This patch puts together this last missing piece of the puzzle, and exports the information we have internally to the user. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-06-20 18:51:00 -04:00
Glauber Costa	38a402307d	LSA: enhance region_group reclaimer We are currently just allowing the region_group to specify a throttle_threshold, that triggers throttling when a certain amount of memory is reached. We would like to notify the callers that such condition is reached, so that the callers can do something to alleviate it - like triggering flushes of their structures. The approach we are taking here is to pass a reclaimer instance. Any user of a region_group can specialize its methods start_reclaiming and stop_reclaiming that will be called when the region_group becomes under pressure or ceases to be, respectively. Now that we have such facility, it makes more sense to move the throttle_threshold here than having it separately. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-06-20 18:50:59 -04:00
Glauber Costa	6404028c6a	LSA: move subgroups to a heap as well When we decide to evict from a specific region_group due to excessive memory usage, we must also consider looking at each of their children (subgroups). It could very well be that most of memory is used by one of the subgroups, and we'll have to evict from there. We also want to make sure we are evicting from the biggest region of all, and not the biggest region in the biggest region_group. To understand why this is important, consider the case in which the regions are memtables associated with dirty region groups. It could be that a very big memtable was recently flushed, and a fairly small one took its place. That region group is still quite large because the memtable hasn't finished flushing yet, but that doesn't mean we should evict from it. To allow us to efficiently pick which region is the largest, each root of each subtree will keep track of its maximal score, defined as the maximum between our largest region total_space and the maximum maximal score of subtrees. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-06-20 18:50:13 -04:00
Glauber Costa	e1eab5c845	LSA: store regions in a heap for regions_group Currently, the regions in a region group are organized in a simple vector. We can do better by using a binomial heap, as we do for segments, and then updating when there is change. Internally to the LSA, we are in good position to always know when change happens, so that's really the best way to do it. The end game here, is to easily call for the reclaim of the largest offending region (potentially asynchronously). Because of that, we aren't really interested in the region occupancy, but in the region reclaimable occuppancy instead: that's simply equal to the occupancy if the region is reclaimable, and 0 otherwise. Doing that effectively lists all non reclaimable regions in the end of the heap, in no particular order. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-06-20 18:50:13 -04:00
Glauber Costa	54d4d46cf7	LSA: move throttling code to LSA. The database code uses a throttling function to make sure that memory used for the dirty region never is over the limit. We track that with a region group, so it makes sense to move this as generic functionality into LSA. This patch implements the LSA-side functionality and a later patch will convert the current memtable throttler to use it. Unlike the current throttling mechanism, we'll not use a timer-based mechanism here. Aside from being more generic and friendlier towards other users, this is a good change for current memtable by itself. The constants - 10ms and 1MB chosen by the current throttler are arbitrary, and we would be better off without them. Let's discuss the merits of each separately: 1) 10ms timer: If we are throttling, we expect somebody to flush the memtables for memory to be released. Since we are in position to know exactly when a memtable was written, thus releasing memory, we can just call unthrottle at that point, instead of using a timer. 2) 1MB release threshold: we do that because we have no idea how much memory a request will use, so we put the cut somehow. However, because of 1) we don't call unthrottle through a timer anymore, and do it directly instead. This means that we can just execute the request and see how much memory it has used, with no need to guess. So we'll call unthrottle at the end of every request that was previously throttled. Writing the code this way also has the advantage that we need one less continuation in the common case of the database not being throttled. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-06-20 18:34:19 -04:00
Glauber Costa	01a658f51d	LSA: helper function for region_group current hierarchy walk converted, but more users will come. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-06-15 22:26:50 -04:00
Glauber Costa	741aa16748	LSA: allow a region_group to have a threshold for throttling specified Allocations will still be allowed if made directly, but callers will have the choice (in an upcoming patch) to proceed only if memory is below this threshold. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-06-15 22:26:50 -04:00
Glauber Costa	7cd0c0731e	region_group: delete move constructor Tomek correctly points out that since we are now using "this" in lambda captures, we should make the region_group not movable. We currently define a move constructor, but there are no users. So we should just remove them. copy constructor is already deleted, and so are the copy and move assignment operators. So by removing the move constructor, we should be fine. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-06-15 22:26:50 -04:00
Tomasz Grabiec	86b76171a8	lsa: Use the same step in both internal and external reclamations	2016-06-14 15:13:15 +02:00
Tomasz Grabiec	d74d902a01	lsa: Make reclamation step configurable	2016-06-14 15:13:14 +02:00
Piotr Jastrzebski	136b8148d2	Use idle CPU to compact LSA memory Register an idle CPU handler that compacts a single segment every time there's nothing better to execute on CPU. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <c26aa608a1e0752fb9e6db1833ef3ba1de95f161.1464169748.git.piotr@scylladb.com>	2016-05-26 12:43:53 +03:00
Pekka Enberg	38a54df863	Fix pre-ScyllaDB copyright statements People keep tripping over the old copyrights and copy-pasting them to new files. Search and replace "Cloudius Systems" with "ScyllaDB". Message-Id: <1460013664-25966-1-git-send-email-penberg@scylladb.com>	2016-04-08 08:12:47 +03:00
Tomasz Grabiec	a0cba3c86f	logalloc: Introduce tracker::occupancy() Returns occupancy information for all memory allocated by LSA, including segment pools / zones.	2016-03-22 16:28:10 +01:00
Tomasz Grabiec	529c8b8858	logalloc: Rename tracker::occupancy() to region_occupancy()	2016-03-22 14:56:44 +01:00
Paweł Dziepak	83b004b2fb	lsa: avoid fragmenting memory Originally, lsa allocated each segment independently what could result in high memory fragmentation. As a result many compaction and eviction passes may be needed to release a sufficiently big contiguous memory block. These problems are solved by introduction of segment zones, contiguous groups of segments. All segments are allocated from zones and the algorithm tries to keep the number of zones to a minimum. Moreover, segments can be migrated between zones or inside a zone in order to deal with fragmentation inside zone. Segment zones can be shrunk but cannot grow. Segment pool keeps a tree containing all zones ordered by their base addresses. This tree is used only by the memory reclamer. There is also a list of zones that have at least one free segments that is used during allocation. Segment allocation doesn't have any preferences which segment (and zone) to choose. Each zone contains a free list of unused segments. If there are no zones with free segments a new one is created. Segment reclamation migrates segments from the zones higher in memory to the ones at lower addresses. The remaining zones are shrunk until the requested number of segments is reclaimed. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2015-12-08 19:31:40 +01:00
Avi Kivity	1c425d6b50	logalloc: allow allocating_section code blocks to return references	2015-11-15 19:10:24 +02:00
Avi Kivity	d5cf0fb2b1	Add license notices	2015-09-20 10:43:39 +03:00
Tomasz Grabiec	8e1b3e5475	lsa: Remove underscore from local variable names	2015-09-10 12:40:12 +03:00
Avi Kivity	6d0a2b5075	logalloc: don't invalidate merged region A region being merged can still be in use; but after merging, compaction_lock and the reclaim counter will no longer work. This can lead to use-after-compact-without-re-lookup errors. Fix by making the source region be the same as the target region; they will share compaction locks and reclaim counters, so lookup avoidance will still work correctly. Fixes #286.	2015-09-08 08:55:44 +02:00
Tomasz Grabiec	3b441416fa	lsa: Make segment size publicly accessible Some tests depend on segment size.	2015-09-06 21:25:44 +02:00
Tomasz Grabiec	c82325a76c	lsa: Make region evictor signal forward progress In some cases region may be in a state where it is not empty and nothing could be evicted from it. For example when creating the first entry, reclaimer may get invoked during creation before it gets linked. We therefore can't rely on emptiness as a stop condition for reclamation, the evction function shall signal us if it made forward progress.	2015-09-06 21:25:44 +02:00
Tomasz Grabiec	d022a1a4a3	lsa: Introduce allocating_section Related to #259. In some cases we need to allocate memory and hold reclaim lock at the same time. If that region holds most of the reclaimable memory, allocations inside that code section may fail. allocating_section is a work-around of the problem. It learns how big reserves shold be from past execution of critical section and tries to ensure proper reserves before entering the section.	2015-09-06 21:24:59 +02:00
Tomasz Grabiec	870e9e5729	lsa: Replace compaction_lock with broader reclaim_lock Disabling compaction of a region is currently done in order to keep the references valid. But disabling only compaction is not enough, we also need to disable eviction, as it also invalidates references. Rather than introducing another type of lock, compaction and eviction are controlled together, generalized as "reclaiming" (hence the reclaim_lock).	2015-09-01 17:29:04 +03:00
Tomasz Grabiec	d20fae96a2	lsa: Make reclaimer run synchronously with allocations The goal is to make allocation less likely to fail. With async reclaimer there is an implicit bound on the amount of memory that can be allocated between deferring points. This bound is difficult to enforce though. Sync reclaimer lifts this limitation off. Also, allocations which could not be satisfied before because of fragmentation now will have higher chances of succeeding, although depending on how much memory is fragmented, that could involve evicting a lot of segments from cache, so we should still avoid them. Downside of sync reclaiming is that now references into regions may be invalidated not only across deferring points but at any allocation site. compaction_lock can be used to pin data, preferably just temporarily.	2015-08-31 21:50:18 +02:00
Tomasz Grabiec	6105c05dbe	lsa: Introduce compaction_lock helper	2015-08-31 21:50:17 +02:00
Tomasz Grabiec	42dce17c82	lsa: Fix documentation for eviction functions	2015-08-31 21:50:17 +02:00
Tomasz Grabiec	110a55886c	lsa: Introduce region::compaction_counter()	2015-08-31 13:58:42 +02:00
Tomasz Grabiec	9ad3dbe592	lsa: Add region::compaction_enabled()	2015-08-31 13:58:42 +02:00
Tomasz Grabiec	048387782a	lsa: Rename region::set_compactible() to set_compaction_enabled() To avoid confusion with region_impl::is_compactible() when the getter is added.	2015-08-31 13:58:42 +02:00
Avi Kivity	9ed2bbb25c	lsa: introduce region_group A region_group is a nestable group of regions, for cumulative statistics purposes.	2015-08-19 19:36:40 +03:00
Avi Kivity	71aad57ca8	lsa: make region::impl a top-level class Makes using forward declarations possible.	2015-08-19 14:43:17 +03:00
Tomasz Grabiec	ef549ae5a5	lsa: Reclaim space from evictable regions incrementally When LSA reclaimer cannot reclaim more space by compaction, it will reclaim data by evicting from evictable regions. Currently the only evictable region is the one owned by the row cache.	2015-08-08 09:59:24 +02:00
Tomasz Grabiec	6ae0747fe5	lsa: Use size_t for sizes	2015-08-06 18:40:06 +02:00
Tomasz Grabiec	df6f0c35df	utils: lsa: Add reclaimer hook which compacts regions	2015-08-06 14:05:15 +02:00
Tomasz Grabiec	5a9e296803	utils: lsa: Introduce log-structured allocator	2015-08-06 14:05:15 +02:00

48 Commits