scylladb

Author	SHA1	Message	Date
Tomasz Grabiec	8d69d217af	lsa: Guarantee invalidated references on allocating section retry There is existing code (e.g. use of partition_snapshot_row_cursor in cache_streamed_mutation) which assumes that references will be invalidated when bad_alloc is thrown from allocating_section. That is currently the case because on retry we will attempt memory reclamation which will invalidate references either through compaction or eviction. Make this guarantee explicit.	2017-11-13 20:55:13 +01:00
Tomasz Grabiec	87be474c19	lsa: Move reclaim counter concept to allocation_strategy level So that generic code can detect invalidation of references. Also, to allow reusing the same mechanism for signalling external reference invalidation.	2017-09-13 17:38:08 +02:00
Tomasz Grabiec	5d2f2bc90b	lsa: Mark region::merge() as noexcept It seems to satisfy this, and row_cache::do_update() will rely on it to simplify error handling. Message-Id: <1504023113-30374-1-git-send-email-tgrabiec@scylladb.com>	2017-08-29 19:17:17 +03:00
Avi Kivity	ebaeefa02b	Merge seatar upstream (seastar namespace) - introcduced "seastarx.hh" header, which does a "using namespace seastar"; - 'net' namespace conflicts with seastar::net, renamed to 'netw'. - 'transport' namespace conflicts with seastar::transport, renamed to cql_transport. - "logger" global variables now conflict with logger global type, renamed to xlogger. - other minor changes	2017-05-21 12:26:15 +03:00
Tomasz Grabiec	7aa286439f	lsa: Add getter for region's eviction function	2017-04-20 14:51:42 +02:00
Avi Kivity	844529fe33	logalloc: avoid auto in function argument declaration 'auto' in a non-lambda function argument is not legal C++, and is hard to read besides. Replace with the right type. Since the right type is private, add some friendship.	2017-04-17 23:18:44 +03:00
Tomasz Grabiec	4ab8b255da	lsa: Allow adjusting reserves in allocating_section	2017-03-16 10:21:10 +01:00
Avi Kivity	7a00dd6985	Merge "Avoid avalanche of tasks after memtable flush" from Tomasz "Before, the logic for releasing writes blocked on dirty worked like this: 1) When region group size changes and it is not under pressure and there are some requests blocked, then schedule request releasing task 2) request releasing task, if no pressure, runs one request and if there are still blocked requests, schedules next request releasing task If requests don't change the size of the region group, then either some request executes or there is a request releasing task scheduled. The amount of scheduled tasks is at most 1, there is a single releasing thread. However, if requests themselves would change the size of the group, then each such change would schedule yet another request releasing thread, growing the task queue size by one. The group size can also change when memory is reclaimed from the groups (e.g. when contains sparse segments). Compaction may start many request releasing threads due to group size updates. Such behavior is detrimental for performance and stability if there are a lot of blocked requests. This can happen on 1.5 even with modest concurrency because timed out requests stay in the queue. This is less likely on 1.6 where they are dropped from the queue. The releasing of tasks may start to dominate over other processes in the system. When the amount of scheduled tasks reaches 1000, polling stops and server becomes unresponsive until all of the released requests are done, which is either when they start to block on dirty memory again or run out of blocked requests. It may take a while to reach pressure condition after memtable flush if it brings virtual dirty much below the threshold, which is currently the case for workloads with overwrites producing sparse regions. I saw this happening in a write workload from issue #2021 where the number of request releasing threads grew into thousands. Fix by ensuring there is at most one request releasing thread at a time. There will be one releasing fiber per region group which is woken up when pressure is lifted. It executes blocked requests until pressure occurs." * tag 'tgrabiec/lsa-single-threaded-releasing-v2' of github.com:cloudius-systems/seastar-dev: tests: lsa: Add test for reclaimer starting and stopping tests: lsa: Add request releasing stress test lsa: Avoid avalanche releasing of requests lsa: Move definitions to .cc lsa: Simplify hard pressure notification management lsa: Do not start or stop reclaiming on hard pressure tests: lsa: Adjust to take into account that reclaimers are run synchronously lsa: Document and annotate reclaimer notification callbacks tests: lsa: Use with_timeout() in quiesce()	2017-02-02 17:49:31 +02:00
Tomasz Grabiec	e40fb438f5	lsa: Avoid avalanche releasing of requests Before, the logic for releasing writes blocked on dirty worked like this: 1) When region group size changes and it is not under pressure and there are some requests blocked, then schedule request releasing task 2) request releasing task, if no pressure, runs one request and if there are still blocked requests, schedules next request releasing task If requests don't change the size of the region group, then either some request executes or there is a request releasing task scheduled. The amount of scheduled tasks is at most 1, there is a single thread of excution. However, if requests themselves would change the size of the group, then each such change would schedule yet another request releasing thread, growing the task queue size by one. The group size can also change when memory is reclaimed from the groups (e.g. when contains sparse segments). Compaction may start many request releasing threads due to group size updates. Such behavior is detrimental for performance and stability if there are a lot of blocked requests. This can happen on 1.5 even with modest concurrency becuase timed out requests stay in the queue. This is less likely on 1.6 where they are dropped from the queue. The releasing of tasks may start to dominate over other processes in the system. When the amount of scheduled tasks reaches 1000, polling stops and server becomes unresponsive until all of the released requests are done, which is either when they start to block on dirty memory again or run out of blocked requests. It may take a while to reach pressure condition after memtable flush if it brings virtual dirty much below the threshold, which is currently the case for workloads with overwrites producing sparse regions. Refs #2021. Fix by ensuring there is at most one request releasing thread at a time. There will be one releasing fiber per region group which is woken up when pressure is lifted. It executes blocked requests until pressure occurs. The logic for notification across hierachy was replaced by calling region_group::notify_relief() from region_group::update() on the broadest relieved group.	2017-02-01 17:41:55 +01:00
Tomasz Grabiec	d55baa0cd1	lsa: Move definitions to .cc	2017-02-01 17:41:55 +01:00
Tomasz Grabiec	8f8b111b33	lsa: Simplify hard pressure notification management The hard pressure was only signalled on region group when run_when_memory_available() was called after the pressure condition was met. So the following loop is always an infinite loop rather than stopping when engouh is allocated to cause pressure: while (!gr.under_pressure()) { region.allocate(...); } It's cleaner if pressure notification works not only if run_when_memory_available() is used but whenever conditino changes, like we do for the soft pressure. There is comment in run_when_memory_available() which gives reasons why notifications are called from there, but I think those reasons no longer hold: - we already notify on soft pressure conditions from update(), and if that is safe, notifying about hard pressure should also be safe. I checked and it looks safe to me. - avoiding notification in the rare case when we stopped writing right after crossing the threshold doesn't seem benefitial. It's unlikely in the first place, and one could argue it's better to actually flush now so that when writes resume they will not block.	2017-02-01 17:41:55 +01:00
Tomasz Grabiec	9aa1be5d08	lsa: Do not start or stop reclaiming on hard pressure We already call these when crossing the soft threshold. We shouldn't stop reclaiming when hard pressure is gone because soft pressure may still be present. Calling start_reclaiming() on hard pressure is unnecessary because soft pressure also starts it, and when there is hard pressure there is also soft pressure.	2017-02-01 17:40:15 +01:00
Amnon Heiman	45b6070832	Merge seastar upstream * seastar 397685c...c1dbd89 (13): > lowres_clock: drop cache-line alignment for _timer > net/packet: add missing include > Merge "Adding histogram and description support" from Amnon > reactor: Fix the error: cannot bind 'std::unique_ptr' lvalue to 'std::unique_ptr&&' > Set the option '--server' of tests/tcp_sctp_client to be required > core/memory: Remove superfluous assignment > core/memory: Remove dead code > core/reactor: Use logger instead of cerr > fix inverted logic in overprovision parameter > rpc: fix timeout checking condition > rpc: use lowres_clock instead of high resolution one > semaphore: make semaphore's clock configurable > rpc: detect timedout outgoing packets earlier Includes treewide change to accomodate rpc changing its timeout clock to lowres_clock. Includes fixup from Amnon: collectd api should use the metrics getters As part of a preperation of the change in the metrics layer, this change the way the collectd api uses the metrics value to use the getters instead of calling the member directly. This will be important when the internal implementation will changed from union to variant. Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <1485457657-17634-1-git-send-email-amnon@scylladb.com>	2017-02-01 14:39:08 +02:00
Tomasz Grabiec	ed9ff19467	lsa: Document and annotate reclaimer notification callbacks They are called from region_group::update(), so must be alloc-free and noexcept.	2017-01-30 19:18:07 +01:00
Amnon Heiman	e19fa02a17	remove scollectd from headers As the metrics migration progressed, some include to scollectd.hh left behind. Because of the nature of the scollecd implementation those include brings alot of code with them to the header files and eventually to many source file. This patch remove those include and add a missing include to storage_proxy.cc. The reason the compiler didn't complain is an indication to the problematic nature of those include in the first place. Before this patch, change in metrics.hh would cause 169 files to compile, after this change 17. Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <1484667536-2185-1-git-send-email-amnon@scylladb.com>	2017-01-17 17:39:47 +02:00
Tomasz Grabiec	e14caaef60	utils/logalloc: Add ability to timeout run_when_memory_available() task	2016-11-29 16:40:58 +01:00
Avi Kivity	176fca5775	logalloc: use correct header for unique_ptr <bits/unique_ptr.hh> is a libstdc++ internal header. USe <memory> instead.	2016-11-27 23:08:04 +02:00
Glauber Costa	f86c9e36f4	logalloc: allow region group reclaimer to specify a soft limit The region_group_reclaimer will let us know every time we are over the limit we have specified for memory usage. However, For some applications, we would be interested in knowing about memory build up earlier, so we can start doing something about it before we reach that condition. This patch introduce soft limit notifications for the region_group_reclaimer. After this patch is applied, start_reclaim() is called earlier, and stop_reclaim() later, after the soft condition is abated. There are methods that allow one to easily test if the pressure condition is a soft limit condition or a hard, threshold condition and act accordingly. Whether to act on both conditions or just one of them is up to the application. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-11-16 21:20:23 -05:00
Paweł Dziepak	b8d737ff0a	tests/row_cache_test: verify that eviction follows lru Refs #1847. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com> Message-Id: <1479231555-28191-1-git-send-email-pdziepak@scylladb.com>	2016-11-15 18:57:54 +01:00
Tomasz Grabiec	6548132423	lsa: Make logalloc::tracker::full_compaction() compact all reclaimable regions is_compactible() will pass on very small regions. full_compaction() is only used in tests to force objects to be moved due to compaction, so we want all reclaimable regions to be compacted.	2016-10-18 11:16:08 +02:00
Tomasz Grabiec	4357d0a6d9	db: Add counter for writes blocked on dirty memory There is already queue_length-requests_blocked_memory, but it's a gauge so does not reflect what happened between the sampling points. total_operations-requests_blocked_memory will allow to see if there were any (and how many) requests which were blocked by dirty memory. Message-Id: <1476098616-12682-1-git-send-email-tgrabiec@scylladb.com>	2016-10-10 14:25:22 +03:00
Glauber Costa	86aa0b830d	LSA: allow a group to query its own region group Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-10-04 10:39:10 -04:00
Glauber Costa	f5fd6bd714	LSA: export information about size of the throttle queue Also add information about for how long has the oldest been sitting in the queue. This is part of the backpressure work to allow us to throttle incoming requests if we won't have memory to process them. Shortages can happen in all sorts of places, and it is useful when designing and testing the solutions to know where they are, and how bad they are. This counter is named for consistency after similar counters from transport/. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-09-27 12:09:08 -04:00
Glauber Costa	fe6a0d97d1	logalloc: make sure allocations in release_requests don't recurse back into the allocator Calls like later() and with_gate() may allocate memory, although that is not very common. This can create a problem in the sense that it will potentially recurse and bring us back to the allocator during free - which is the very thing we are trying to avoid with the call to later(). This patch wraps the relevant calls in the reclaimer lock. This do mean that the allocation may fail if we are under severe pressure - which includes having exhausted all reserved space - but at least we won't recurse back to the allocator. To make sure we do this as early as possible, we just fold both release_requests and do_release_requests into a single function Thanks Tomek for the suggestion. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <980245ccc17960cf4fcbbfedb29d1878a98d85d8.1470254846.git.glauber@scylladb.com>	2016-08-04 11:16:53 +02:00
Glauber Costa	ad58691afb	logalloc: make sure blocked requests memory allocations are served from the standar allocator Issue 1510 describes a scenario in which, under load, we allocate memory within release_requests() leading to a reentry into an invalid state in our blocked requests' shared_promise. This is not easy to trigger since not all allocations will actually get to the point in which they need a new segment, let alone have that happening during another allocator call. Having those kinds of reentry is something we have always sought to avoid with release_requests(): this is the reason why most of the actual routine is deferred after a call to later(). However, that is a trick we cannot use for updating the state of the blocked requests' shared_promise: we can't guarantee when is that going to run, and we always need a valid shared_promise, in a valid state, waiting for new requests to hook into. The solution employed by this patch is to make sure that no allocation operations whatsoever happen during the initial part of release_requests on behalf of the shared promise. Allocation is now deferred to first use, which relieves release_requests() from all allocation duties. All it needs to do is free the old object and signal to the its user that an allocation is needed (by storing {} into the shared_promise). Fixes #1510 Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <49771e51426f972ddbd4f3eeea3cdeef9cc3b3c6.1470238168.git.glauber@scylladb.com>	2016-08-03 20:40:30 +02:00
Tomasz Grabiec	9476bc5a31	Introduce --abort-on-lsa-bad-alloc command line option Useful for triggerring core dump on allocation failure inside LSA, which makes it easier to debug allocation failures. They normally don't cause aborts, just fail the current operation, which makes it hard to figure out what was the cause of allocation failure. Message-Id: <1470233631-18508-1-git-send-email-tgrabiec@scylladb.com>	2016-08-03 17:26:44 +03:00
Tomasz Grabiec	e783b58e3b	Merge branch 'glommer/LSA-throttler-v6' from git@github.com:glommer/scylla.gi From Glauber: This is my new take at the "Move throttler to the LSA" series, except this one don't actually move anything anywhere: I am leaving all memtable conversion out, and instead I am sending just the LSA bits + LSA active reclaim. This should help us see where we are going, and then we can discuss all memtable changes in a series on its own, logically separated (and hopefully already integrated with virtual dirty). [tgrabiec: trivial merge conflicts in logalloc.cc]	2016-06-21 10:22:26 +02:00
Glauber Costa	579d121db8	LSA: export largest region We now keep the regions sorted by size, and the children region groups as well. Internally, the LSA has all information it needs to make size-based reclaim decisions. However, we don't do reclaim internally, but rather warn our user that a pressure situation is mounted. The user of a region_group doesn't need to evict the largest region in case of pressure and is free to do whatever it chooses - including nothing. But more likely than not, taking into account which region is the largest makes sense. This patch puts together this last missing piece of the puzzle, and exports the information we have internally to the user. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-06-20 18:51:00 -04:00
Glauber Costa	38a402307d	LSA: enhance region_group reclaimer We are currently just allowing the region_group to specify a throttle_threshold, that triggers throttling when a certain amount of memory is reached. We would like to notify the callers that such condition is reached, so that the callers can do something to alleviate it - like triggering flushes of their structures. The approach we are taking here is to pass a reclaimer instance. Any user of a region_group can specialize its methods start_reclaiming and stop_reclaiming that will be called when the region_group becomes under pressure or ceases to be, respectively. Now that we have such facility, it makes more sense to move the throttle_threshold here than having it separately. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-06-20 18:50:59 -04:00
Glauber Costa	6404028c6a	LSA: move subgroups to a heap as well When we decide to evict from a specific region_group due to excessive memory usage, we must also consider looking at each of their children (subgroups). It could very well be that most of memory is used by one of the subgroups, and we'll have to evict from there. We also want to make sure we are evicting from the biggest region of all, and not the biggest region in the biggest region_group. To understand why this is important, consider the case in which the regions are memtables associated with dirty region groups. It could be that a very big memtable was recently flushed, and a fairly small one took its place. That region group is still quite large because the memtable hasn't finished flushing yet, but that doesn't mean we should evict from it. To allow us to efficiently pick which region is the largest, each root of each subtree will keep track of its maximal score, defined as the maximum between our largest region total_space and the maximum maximal score of subtrees. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-06-20 18:50:13 -04:00
Glauber Costa	e1eab5c845	LSA: store regions in a heap for regions_group Currently, the regions in a region group are organized in a simple vector. We can do better by using a binomial heap, as we do for segments, and then updating when there is change. Internally to the LSA, we are in good position to always know when change happens, so that's really the best way to do it. The end game here, is to easily call for the reclaim of the largest offending region (potentially asynchronously). Because of that, we aren't really interested in the region occupancy, but in the region reclaimable occuppancy instead: that's simply equal to the occupancy if the region is reclaimable, and 0 otherwise. Doing that effectively lists all non reclaimable regions in the end of the heap, in no particular order. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-06-20 18:50:13 -04:00
Glauber Costa	54d4d46cf7	LSA: move throttling code to LSA. The database code uses a throttling function to make sure that memory used for the dirty region never is over the limit. We track that with a region group, so it makes sense to move this as generic functionality into LSA. This patch implements the LSA-side functionality and a later patch will convert the current memtable throttler to use it. Unlike the current throttling mechanism, we'll not use a timer-based mechanism here. Aside from being more generic and friendlier towards other users, this is a good change for current memtable by itself. The constants - 10ms and 1MB chosen by the current throttler are arbitrary, and we would be better off without them. Let's discuss the merits of each separately: 1) 10ms timer: If we are throttling, we expect somebody to flush the memtables for memory to be released. Since we are in position to know exactly when a memtable was written, thus releasing memory, we can just call unthrottle at that point, instead of using a timer. 2) 1MB release threshold: we do that because we have no idea how much memory a request will use, so we put the cut somehow. However, because of 1) we don't call unthrottle through a timer anymore, and do it directly instead. This means that we can just execute the request and see how much memory it has used, with no need to guess. So we'll call unthrottle at the end of every request that was previously throttled. Writing the code this way also has the advantage that we need one less continuation in the common case of the database not being throttled. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-06-20 18:34:19 -04:00
Glauber Costa	01a658f51d	LSA: helper function for region_group current hierarchy walk converted, but more users will come. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-06-15 22:26:50 -04:00
Glauber Costa	741aa16748	LSA: allow a region_group to have a threshold for throttling specified Allocations will still be allowed if made directly, but callers will have the choice (in an upcoming patch) to proceed only if memory is below this threshold. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-06-15 22:26:50 -04:00
Glauber Costa	7cd0c0731e	region_group: delete move constructor Tomek correctly points out that since we are now using "this" in lambda captures, we should make the region_group not movable. We currently define a move constructor, but there are no users. So we should just remove them. copy constructor is already deleted, and so are the copy and move assignment operators. So by removing the move constructor, we should be fine. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-06-15 22:26:50 -04:00
Tomasz Grabiec	86b76171a8	lsa: Use the same step in both internal and external reclamations	2016-06-14 15:13:15 +02:00
Tomasz Grabiec	d74d902a01	lsa: Make reclamation step configurable	2016-06-14 15:13:14 +02:00
Piotr Jastrzebski	136b8148d2	Use idle CPU to compact LSA memory Register an idle CPU handler that compacts a single segment every time there's nothing better to execute on CPU. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <c26aa608a1e0752fb9e6db1833ef3ba1de95f161.1464169748.git.piotr@scylladb.com>	2016-05-26 12:43:53 +03:00
Pekka Enberg	38a54df863	Fix pre-ScyllaDB copyright statements People keep tripping over the old copyrights and copy-pasting them to new files. Search and replace "Cloudius Systems" with "ScyllaDB". Message-Id: <1460013664-25966-1-git-send-email-penberg@scylladb.com>	2016-04-08 08:12:47 +03:00
Tomasz Grabiec	a0cba3c86f	logalloc: Introduce tracker::occupancy() Returns occupancy information for all memory allocated by LSA, including segment pools / zones.	2016-03-22 16:28:10 +01:00
Tomasz Grabiec	529c8b8858	logalloc: Rename tracker::occupancy() to region_occupancy()	2016-03-22 14:56:44 +01:00
Paweł Dziepak	83b004b2fb	lsa: avoid fragmenting memory Originally, lsa allocated each segment independently what could result in high memory fragmentation. As a result many compaction and eviction passes may be needed to release a sufficiently big contiguous memory block. These problems are solved by introduction of segment zones, contiguous groups of segments. All segments are allocated from zones and the algorithm tries to keep the number of zones to a minimum. Moreover, segments can be migrated between zones or inside a zone in order to deal with fragmentation inside zone. Segment zones can be shrunk but cannot grow. Segment pool keeps a tree containing all zones ordered by their base addresses. This tree is used only by the memory reclamer. There is also a list of zones that have at least one free segments that is used during allocation. Segment allocation doesn't have any preferences which segment (and zone) to choose. Each zone contains a free list of unused segments. If there are no zones with free segments a new one is created. Segment reclamation migrates segments from the zones higher in memory to the ones at lower addresses. The remaining zones are shrunk until the requested number of segments is reclaimed. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2015-12-08 19:31:40 +01:00
Avi Kivity	1c425d6b50	logalloc: allow allocating_section code blocks to return references	2015-11-15 19:10:24 +02:00
Avi Kivity	d5cf0fb2b1	Add license notices	2015-09-20 10:43:39 +03:00
Tomasz Grabiec	8e1b3e5475	lsa: Remove underscore from local variable names	2015-09-10 12:40:12 +03:00
Avi Kivity	6d0a2b5075	logalloc: don't invalidate merged region A region being merged can still be in use; but after merging, compaction_lock and the reclaim counter will no longer work. This can lead to use-after-compact-without-re-lookup errors. Fix by making the source region be the same as the target region; they will share compaction locks and reclaim counters, so lookup avoidance will still work correctly. Fixes #286.	2015-09-08 08:55:44 +02:00
Tomasz Grabiec	3b441416fa	lsa: Make segment size publicly accessible Some tests depend on segment size.	2015-09-06 21:25:44 +02:00
Tomasz Grabiec	c82325a76c	lsa: Make region evictor signal forward progress In some cases region may be in a state where it is not empty and nothing could be evicted from it. For example when creating the first entry, reclaimer may get invoked during creation before it gets linked. We therefore can't rely on emptiness as a stop condition for reclamation, the evction function shall signal us if it made forward progress.	2015-09-06 21:25:44 +02:00
Tomasz Grabiec	d022a1a4a3	lsa: Introduce allocating_section Related to #259. In some cases we need to allocate memory and hold reclaim lock at the same time. If that region holds most of the reclaimable memory, allocations inside that code section may fail. allocating_section is a work-around of the problem. It learns how big reserves shold be from past execution of critical section and tries to ensure proper reserves before entering the section.	2015-09-06 21:24:59 +02:00
Tomasz Grabiec	870e9e5729	lsa: Replace compaction_lock with broader reclaim_lock Disabling compaction of a region is currently done in order to keep the references valid. But disabling only compaction is not enough, we also need to disable eviction, as it also invalidates references. Rather than introducing another type of lock, compaction and eviction are controlled together, generalized as "reclaiming" (hence the reclaim_lock).	2015-09-01 17:29:04 +03:00

1 2

62 Commits