scylladb

Author	SHA1	Message	Date
Avi Kivity	ae660eeec4	logalloc: reduce minimum lsa reserve in allocating_section to 1 Many workloads have fairly constant and small request sizes, so we don't need large reserves for them. These workloads suffer needlessly from the current large reserve of 10 segments (1.2MB) when they really need a few hundred bytes. Reduce the reserve to a minimum of 1 segment. Note that due to #8542 this can make a large difference. Consider a workload that has a 1000-byte footprint in cache. If we've just consumed some free memory and reduced the reserve to zero, then we'll evict about 50,000 objects before proceeding to compact. With the reserved reduced to 1, we'll evict 128 objects. All this for 1000 bytes of memory. Of course, #8542 should be fixed, but reducing the reserve provides some quick relief and makes sense even with the larger fix. The reserve will quickly grow for workloads that handle bigger requests, so they won't see an impact from the reduction. Closes #8572	2021-05-02 15:22:04 +02:00
Avi Kivity	ca0c006b37	logalloc: background reclaim Set up a coroutine in a new scheduling group to ensure there is a "cushion" of free memory. It reclaims in preemptible mode in order to reduce reactor stalls (constrast with synchronous reclaim that cannot preempt until it achieved its goal). The free memory target is arbitrarily set at 60MB. The reclaimer's shares are proportional to the distance from the free memory target; so a workload that allocates memory rapidly will have the background reclaimer working harder. I rolled my own condition variable here, mostly as an experiment. seastar::condition_variable requires several allocations, while the one here requires none. We should formalize it after we gain more experience with it.	2021-02-14 19:09:29 +02:00
Botond Dénes	7b56ed6057	utils: logalloc: add lsa_global_occupancy_stats() Allows querying the occupancy stats of all the lsa memory.	2020-11-17 15:13:21 +02:00
Avi Kivity	7ac59dcc98	lsa: decay reserves The log-structured allocator (LSA) reserves memory when performing operations, since its operations are performed with reclaiming disabled and if it runs out, it cannot evict cache to gain more. The amount of memory to reserve is remembered across calls so that it does not have to repeat the fail/increase-reserve/retry cycle for every operation. However, we currently lack decaying the amount to reserve. This means that if a single operation increased the reserve in the distant past, all current operations also require this large reserve. Large reserves are expensive since they can cause large amounts of cache to be evicted. This patch adds reserve decay. The time-to-decay is inversely proportional to reserve size: 10GB/reserve. This means that a 20MB reserve is halved after 500 operations (10GB/20MB) while a 20kB reserve is halved after 500,000 operations (10GB/20kB). So large, expensive reserves are decayed quickly while small, inexpensive reserves are decayed slowly to reduce the risk of allocation failures and exceptions. A unit test is added. Fixes #325.	2020-09-08 15:59:25 +03:00
Pavel Emelyanov	3237796e00	region: Mark trivial noexcept methods as such Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-07-09 14:41:37 +03:00
Pavel Emelyanov	7696ed1343	shard_tracker: Configure it in one go Instead of doing 3 smp::invoke_on_all-s and duplicating tracker::impl API for the tracker itself, introduce the tracker::configure, simplify the tracker configuration and narrow down the public tracker API. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20200528185442.10682-1-xemul@scylladb.com>	2020-05-29 14:50:43 +02:00
Avi Kivity	1799cfa88a	logalloc: use namespace-scope seastar::idle_cpu_handler and related rather than reactor scope This allows us to drop a #include <reactor.hh>, reducing compile time. Several translation units that lost access to required declarations are updated with the required includes (this can be an include of reactor.hh itself, in case the translation unit that lost it got it indirectly via logalloc.hh) Ref #1.	2020-04-05 12:45:08 +03:00
Rafael Ávila de Espíndola	8da235e440	everywhere: Use futurize_invoke instead of futurize<T>::invoke No functionality change, just simpler. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com> Message-Id: <20200330165308.52383-1-espindola@scylladb.com>	2020-04-03 15:53:35 +02:00
Rafael Ávila de Espíndola	c5795e8199	everywhere: Replace engine().cpu_id() with this_shard_id() This is a bit simpler and might allow removing a few includes of reactor.hh. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com> Message-Id: <20200326194656.74041-1-espindola@scylladb.com>	2020-03-27 11:40:03 +03:00
Rafael Ávila de Espíndola	eca0ac5772	everywhere: Update for deprecated apply functions Now apply is only for tuples, for varargs use invoke. This depends on the seastar changes adding invoke. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com> Message-Id: <20200324163809.93648-1-espindola@scylladb.com>	2020-03-25 08:49:53 +02:00
Botond Dénes	93039a085d	utils/logallog: run_when_memory_available(): remove default timeout	2020-02-27 18:36:32 +02:00
Avi Kivity	454074f284	Merge "database: Avoid OOMing with flush continuations after failed memtable flush" from Tomasz " The original fix (`10f6b125c8`) didn't take into account that if there was a failed memtable flush (Refs flush) but is not a flushable memtable because it's not the latest in the memtable list. If that happens, it means no other memtable is flushable as well, cause otherwise it would be picked due to evictable_occupancy(). Therefore the right action is to not flush anything in this case. Suspected to be observed in #4982. I didn't manage to reproduce after triggering a failed memtable flush. Fixes #3717 " * tag 'avoid-ooming-with-flush-continuations-v2' of github.com:tgrabiec/scylla: database: Avoid OOMing with flush continuations after failed memtable flush lsa: Introduce operator bool() to occupancy_stats lsa: Expose region_impl::evictable_occupancy in the region class	2020-01-08 16:58:54 +02:00
Juliusz Stasiewicz	430b2ad19d	commitlog+region_group: timeout exceptions with names `segment_manager' now uses a decorated version of `timed_out_error' with hardcoded name. On the other hand `region_group' uses named `on_request_expiry' within its `expiring_fifo'.	2019-12-03 19:07:19 +01:00
Tomasz Grabiec	fb28543116	lsa: Introduce operator bool() to occupancy_stats	2019-11-22 12:08:28 +01:00
Tomasz Grabiec	a69fda819c	lsa: Expose region_impl::evictable_occupancy in the region class	2019-11-22 12:08:10 +01:00
Tomasz Grabiec	eb08ab7ed9	lsa: Assert no cross-shard region locking We observed an abort on bad_alloc which was not caused by real OOM, but could be explained by cache region being locked from a different shard, which is not allowed, concurrently with memory reclamation. It's impossible now to prove this, or, if that was indeed the case, to determine which code path was attempting such lock. This patch adds an assert which would catch such incorrect locking at the attempt. Refs #4978	2019-09-23 12:51:29 +02:00
Tomasz Grabiec	f7e79b07d1	lsa: Respect the reclamation step hint from seastar allocator This will allow us to reduce the amount of segment compaction when reclaiming on behlaf of a large allocation because we'll evict much more up front. Tests: - unit (dev) Reviewed-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <1559906584-16770-1-git-send-email-tgrabiec@scylladb.com>	2019-06-23 16:03:06 +03:00
Tomasz Grabiec	dafe22dd83	lsa: Fix spurios abort with --enable-abort-on-lsa-bad-alloc allocate_segment() can fail even though we're not out of memory, when it's invoked inside an allocating section with the cache region locked. That section may later succeed after retried after memory reclamation. We should ignore bad_alloc thrown inside allocating section body and fail only when the whole section fails. Fixes #2924 Message-Id: <1550597493-22500-1-git-send-email-tgrabiec@scylladb.com>	2019-02-20 12:53:49 +02:00
Duarte Nunes	fa2b0384d2	Replace std::experimental types with C++17 std version. Replace stdx::optional and stdx::string_view with the C++ std counterparts. Some instances of boost::variant were also replaced with std::variant, namely those that called seastar::visit. Scylla now requires GCC 8 to compile. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20190108111141.5369-1-duarte@scylladb.com>	2019-01-08 13:16:36 +02:00
Tomasz Grabiec	1e50f85288	database: Make soft-pressure memtable flusher not consider already flushed memtables The flusher picks the memtable list which contains the largest region according to region_impl::evictable_occupancy().total_space(), which follows region::occupancy().total_space(). But only the latest memtable in the list can start flushing. It can happen that the memtable corresponding to the largest region was already flushed to an sstable (flush permit released), but not yet fsynced or moved to cache, so it's still in the memtable list. The latest memtable in the winning list may be small, or empty, in which case the soft pressure flusher will not be able to make much progress. There could be other memtable lists with non-empty (flushable) latest memtables. This can lead to writes unnecessarily blocking on dirty. I observed this for the system memtable group, where it's easy for the memtables to overshoot small soft pressure limits. The flusher kept trying to flush empty memtables, while the previous non-empty memtable was still in the group. The CPU scheduler makes this worse, because it runs memtable_to_cache in a separate scheduling group, so it further defers in time the removal of the flushed memtable from the memtable list. This patch fixes the problem by making regions corresponding to memtables which started flushing report evictable_occupancy() as 0, so that they're picked by the flusher last. Fixes #3716. Message-Id: <1535040132-11153-2-git-send-email-tgrabiec@scylladb.com>	2018-08-26 11:02:34 +03:00
Avi Kivity	0fc54aab98	logalloc: run releaser() in user-provided scheduling group Let the user specify which scheduling group should run the releaser, since it is running functions on the user's behalf. Perhaps a cleaner interface is to require the user to call a long-running function for the releaser, and so we'd just inherit its scheduling group, but that's a much bigger change.	2018-07-31 11:57:58 +03:00
Gleb Natapov	b38ced0fcd	Configure logalloc memory size during initialization	2018-06-11 15:34:14 +03:00
Tomasz Grabiec	4fdd61f1b0	lsa: Expose counters for allocation and compaction throughput Allow observing amplification induced by segment compaction.	2018-05-15 21:49:01 +02:00
Avi Kivity	2baa16b371	logalloc: introduce prime_segment_pool() To segregate std and lsa allocations, we prime the segment pool during initialization so that lsa will release lower-addressed memory to std, rather than lsa and std competing for memory at random addresses. However, tests often evict all of lsa memory for their own purposes, which defeats this priming. Extract the functionality into a new prime_segment_pool() function for use in tests that rely on allocation segregation.	2018-04-07 14:52:58 +03:00
Avi Kivity	54db0f3d30	logalloc: reduce segment size to 128k Reducing the segment size reduces the time needed to compact segments, and increases the number of segments that can be compacted (and so the probability of finding low-occupancy segments). 128k is the size of I/O buffers and of thread stacks, so we can't go lower than that without more significant changes.	2018-04-07 14:52:58 +03:00
Avi Kivity	c9aa9f0d86	Revert "logalloc: capture current scheduling group for deferring function" This reverts commit `3b53f922a3`. It's broken in two ways: 1. concrete_allocating_function::allocate()'ss caller, region_group::start_releaser() loop, will delete the object as soon as it returns; however we scheduled some work depending on `this` in a separate continuation (via with_scheduling_group()) 2. the calling loop's termination condition depends on the work being done immediately, not later.	2018-03-29 16:08:12 +03:00
Glauber Costa	3b53f922a3	logalloc: capture current scheduling group for deferring function When we call run_when_memory_available, it is entirely possible that the caller is doing that inside a scheduling_group. If we don't defer we will execute correctly. But if we do defer, the current code will execute - in the future - with the default scheduling group. This patch fixes that by capturing the caller scheduling group and making sure the function is executed later using it. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-03-20 16:58:35 -04:00
Paweł Dziepak	dcd79af8ed	lsa: optimise disabling reclamation and invalidation counter Most of the lsa gory details are hidden in utils/logalloc.cc. That includes the actual implementation of a lsa region: region_impl. However, there is code in the hot path that often accesses the _reclaiming_enabled member as well as its base class allocation_strategy. In order to optimise those accesses another class is introduced: basic_region_impl that inherits from allocation_strategy and is a base of region_impl. It is defined in utils/logalloc.hh so that it is publicly visible and its member functions are inlineable from anywhere in the code. This class is supposed to be as small as possible, but contain all members and functions that are accessed from the fast path and should be inlined.	2018-01-30 18:33:26 +01:00
Paweł Dziepak	d825ae37bf	lsa: split alloc section into reserving and reclamation-disabled parts Allocating sections reserves certain amount of memory, then disables reclamation and attempts to perform given operation. If that fails due to std::bad_alloc the reserve is increased and the operation is retried. Reserving memory is expensive while just disabling reclamation isn't. Moreover, the code that runs inside the section needs to be safely retryable. This means that we want the amount of logic running with reclamation disabled as small as possible, even if it means entering and leaving the section multiple times. In order to reduce the performance penalty of such solution the memory reserving and reclamation disabling parts of the allocating sections are separated.	2018-01-30 18:33:26 +01:00
Tomasz Grabiec	5c85e9c2db	lsa: Expose max_zone_segments for tests	2018-01-16 13:17:20 +01:00
Tomasz Grabiec	99708cc498	lsa: Expose tracker::non_lsa_used_space() So that it can be used in unit tests.	2018-01-16 13:17:20 +01:00
Glauber Costa	80c4a211d8	consolidate timeout_clock At the moment, various different subsystems use their different ideas of what a timeout_clock is. This makes it a bit harder to pass timeouts between them because although most are actually a lowres_clock, that is not guaranteed to be the case. As a matter of fact, the timeout for restricted reads is expressed as nanoseconds, which is not a valid duration in the lowres_clock. As a first step towards fixing this, we'll consolidate all of the existing timeout_clocks in one, now called db::timeout_clock. Other things that tend to be expressed in terms of that clock--like the fact that the maximum time_point means no timeout and a semaphore that wait()s with that resolution are also moved to the common header. In the upcoming patch we will fix the restricted reader timeouts to be expressed in terms of the new timeout_clock. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-01-11 12:07:41 -05:00
Tomasz Grabiec	8d69d217af	lsa: Guarantee invalidated references on allocating section retry There is existing code (e.g. use of partition_snapshot_row_cursor in cache_streamed_mutation) which assumes that references will be invalidated when bad_alloc is thrown from allocating_section. That is currently the case because on retry we will attempt memory reclamation which will invalidate references either through compaction or eviction. Make this guarantee explicit.	2017-11-13 20:55:13 +01:00
Tomasz Grabiec	87be474c19	lsa: Move reclaim counter concept to allocation_strategy level So that generic code can detect invalidation of references. Also, to allow reusing the same mechanism for signalling external reference invalidation.	2017-09-13 17:38:08 +02:00
Tomasz Grabiec	5d2f2bc90b	lsa: Mark region::merge() as noexcept It seems to satisfy this, and row_cache::do_update() will rely on it to simplify error handling. Message-Id: <1504023113-30374-1-git-send-email-tgrabiec@scylladb.com>	2017-08-29 19:17:17 +03:00
Avi Kivity	ebaeefa02b	Merge seatar upstream (seastar namespace) - introcduced "seastarx.hh" header, which does a "using namespace seastar"; - 'net' namespace conflicts with seastar::net, renamed to 'netw'. - 'transport' namespace conflicts with seastar::transport, renamed to cql_transport. - "logger" global variables now conflict with logger global type, renamed to xlogger. - other minor changes	2017-05-21 12:26:15 +03:00
Tomasz Grabiec	7aa286439f	lsa: Add getter for region's eviction function	2017-04-20 14:51:42 +02:00
Avi Kivity	844529fe33	logalloc: avoid auto in function argument declaration 'auto' in a non-lambda function argument is not legal C++, and is hard to read besides. Replace with the right type. Since the right type is private, add some friendship.	2017-04-17 23:18:44 +03:00
Tomasz Grabiec	4ab8b255da	lsa: Allow adjusting reserves in allocating_section	2017-03-16 10:21:10 +01:00
Avi Kivity	7a00dd6985	Merge "Avoid avalanche of tasks after memtable flush" from Tomasz "Before, the logic for releasing writes blocked on dirty worked like this: 1) When region group size changes and it is not under pressure and there are some requests blocked, then schedule request releasing task 2) request releasing task, if no pressure, runs one request and if there are still blocked requests, schedules next request releasing task If requests don't change the size of the region group, then either some request executes or there is a request releasing task scheduled. The amount of scheduled tasks is at most 1, there is a single releasing thread. However, if requests themselves would change the size of the group, then each such change would schedule yet another request releasing thread, growing the task queue size by one. The group size can also change when memory is reclaimed from the groups (e.g. when contains sparse segments). Compaction may start many request releasing threads due to group size updates. Such behavior is detrimental for performance and stability if there are a lot of blocked requests. This can happen on 1.5 even with modest concurrency because timed out requests stay in the queue. This is less likely on 1.6 where they are dropped from the queue. The releasing of tasks may start to dominate over other processes in the system. When the amount of scheduled tasks reaches 1000, polling stops and server becomes unresponsive until all of the released requests are done, which is either when they start to block on dirty memory again or run out of blocked requests. It may take a while to reach pressure condition after memtable flush if it brings virtual dirty much below the threshold, which is currently the case for workloads with overwrites producing sparse regions. I saw this happening in a write workload from issue #2021 where the number of request releasing threads grew into thousands. Fix by ensuring there is at most one request releasing thread at a time. There will be one releasing fiber per region group which is woken up when pressure is lifted. It executes blocked requests until pressure occurs." * tag 'tgrabiec/lsa-single-threaded-releasing-v2' of github.com:cloudius-systems/seastar-dev: tests: lsa: Add test for reclaimer starting and stopping tests: lsa: Add request releasing stress test lsa: Avoid avalanche releasing of requests lsa: Move definitions to .cc lsa: Simplify hard pressure notification management lsa: Do not start or stop reclaiming on hard pressure tests: lsa: Adjust to take into account that reclaimers are run synchronously lsa: Document and annotate reclaimer notification callbacks tests: lsa: Use with_timeout() in quiesce()	2017-02-02 17:49:31 +02:00
Tomasz Grabiec	e40fb438f5	lsa: Avoid avalanche releasing of requests Before, the logic for releasing writes blocked on dirty worked like this: 1) When region group size changes and it is not under pressure and there are some requests blocked, then schedule request releasing task 2) request releasing task, if no pressure, runs one request and if there are still blocked requests, schedules next request releasing task If requests don't change the size of the region group, then either some request executes or there is a request releasing task scheduled. The amount of scheduled tasks is at most 1, there is a single thread of excution. However, if requests themselves would change the size of the group, then each such change would schedule yet another request releasing thread, growing the task queue size by one. The group size can also change when memory is reclaimed from the groups (e.g. when contains sparse segments). Compaction may start many request releasing threads due to group size updates. Such behavior is detrimental for performance and stability if there are a lot of blocked requests. This can happen on 1.5 even with modest concurrency becuase timed out requests stay in the queue. This is less likely on 1.6 where they are dropped from the queue. The releasing of tasks may start to dominate over other processes in the system. When the amount of scheduled tasks reaches 1000, polling stops and server becomes unresponsive until all of the released requests are done, which is either when they start to block on dirty memory again or run out of blocked requests. It may take a while to reach pressure condition after memtable flush if it brings virtual dirty much below the threshold, which is currently the case for workloads with overwrites producing sparse regions. Refs #2021. Fix by ensuring there is at most one request releasing thread at a time. There will be one releasing fiber per region group which is woken up when pressure is lifted. It executes blocked requests until pressure occurs. The logic for notification across hierachy was replaced by calling region_group::notify_relief() from region_group::update() on the broadest relieved group.	2017-02-01 17:41:55 +01:00
Tomasz Grabiec	d55baa0cd1	lsa: Move definitions to .cc	2017-02-01 17:41:55 +01:00
Tomasz Grabiec	8f8b111b33	lsa: Simplify hard pressure notification management The hard pressure was only signalled on region group when run_when_memory_available() was called after the pressure condition was met. So the following loop is always an infinite loop rather than stopping when engouh is allocated to cause pressure: while (!gr.under_pressure()) { region.allocate(...); } It's cleaner if pressure notification works not only if run_when_memory_available() is used but whenever conditino changes, like we do for the soft pressure. There is comment in run_when_memory_available() which gives reasons why notifications are called from there, but I think those reasons no longer hold: - we already notify on soft pressure conditions from update(), and if that is safe, notifying about hard pressure should also be safe. I checked and it looks safe to me. - avoiding notification in the rare case when we stopped writing right after crossing the threshold doesn't seem benefitial. It's unlikely in the first place, and one could argue it's better to actually flush now so that when writes resume they will not block.	2017-02-01 17:41:55 +01:00
Tomasz Grabiec	9aa1be5d08	lsa: Do not start or stop reclaiming on hard pressure We already call these when crossing the soft threshold. We shouldn't stop reclaiming when hard pressure is gone because soft pressure may still be present. Calling start_reclaiming() on hard pressure is unnecessary because soft pressure also starts it, and when there is hard pressure there is also soft pressure.	2017-02-01 17:40:15 +01:00
Amnon Heiman	45b6070832	Merge seastar upstream * seastar 397685c...c1dbd89 (13): > lowres_clock: drop cache-line alignment for _timer > net/packet: add missing include > Merge "Adding histogram and description support" from Amnon > reactor: Fix the error: cannot bind 'std::unique_ptr' lvalue to 'std::unique_ptr&&' > Set the option '--server' of tests/tcp_sctp_client to be required > core/memory: Remove superfluous assignment > core/memory: Remove dead code > core/reactor: Use logger instead of cerr > fix inverted logic in overprovision parameter > rpc: fix timeout checking condition > rpc: use lowres_clock instead of high resolution one > semaphore: make semaphore's clock configurable > rpc: detect timedout outgoing packets earlier Includes treewide change to accomodate rpc changing its timeout clock to lowres_clock. Includes fixup from Amnon: collectd api should use the metrics getters As part of a preperation of the change in the metrics layer, this change the way the collectd api uses the metrics value to use the getters instead of calling the member directly. This will be important when the internal implementation will changed from union to variant. Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <1485457657-17634-1-git-send-email-amnon@scylladb.com>	2017-02-01 14:39:08 +02:00
Tomasz Grabiec	ed9ff19467	lsa: Document and annotate reclaimer notification callbacks They are called from region_group::update(), so must be alloc-free and noexcept.	2017-01-30 19:18:07 +01:00
Amnon Heiman	e19fa02a17	remove scollectd from headers As the metrics migration progressed, some include to scollectd.hh left behind. Because of the nature of the scollecd implementation those include brings alot of code with them to the header files and eventually to many source file. This patch remove those include and add a missing include to storage_proxy.cc. The reason the compiler didn't complain is an indication to the problematic nature of those include in the first place. Before this patch, change in metrics.hh would cause 169 files to compile, after this change 17. Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <1484667536-2185-1-git-send-email-amnon@scylladb.com>	2017-01-17 17:39:47 +02:00
Tomasz Grabiec	e14caaef60	utils/logalloc: Add ability to timeout run_when_memory_available() task	2016-11-29 16:40:58 +01:00
Avi Kivity	176fca5775	logalloc: use correct header for unique_ptr <bits/unique_ptr.hh> is a libstdc++ internal header. USe <memory> instead.	2016-11-27 23:08:04 +02:00
Glauber Costa	f86c9e36f4	logalloc: allow region group reclaimer to specify a soft limit The region_group_reclaimer will let us know every time we are over the limit we have specified for memory usage. However, For some applications, we would be interested in knowing about memory build up earlier, so we can start doing something about it before we reach that condition. This patch introduce soft limit notifications for the region_group_reclaimer. After this patch is applied, start_reclaim() is called earlier, and stop_reclaim() later, after the soft condition is abated. There are methods that allow one to easily test if the pressure condition is a soft limit condition or a hard, threshold condition and act accordingly. Whether to act on both conditions or just one of them is up to the application. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-11-16 21:20:23 -05:00

1 2

94 Commits