Commit Graph

204 Commits

Author SHA1 Message Date
Avi Kivity
290897ddbc logalloc: background reclaim: use default scheduling group for adjusting shares
If the shares are currently low, we might not get enough CPU time to
adjust the shares in time.

This is currently no-op, since Seastar runs the callback outside
scheduling groups (and only uses the scheduling group for inherited
continuations); but better be insulated against such details.
2021-03-15 13:54:49 +02:00
Avi Kivity
a87f6498c3 logalloc: background reclaim: log shares adjustment under trace level
Useful when debugging, but too noisy at any other time.
2021-03-15 13:54:49 +02:00
Avi Kivity
ce1b1d6ec4 logalloc: background reclaim: fix shares not updated by periodic timer
adjust_shares() thinks it needs to do nothing if the main loop
is running, but in reality it can only avoid waking the main loop;
it still needs to adjust the shares unconditionally. Otherwise,
the background reclaim shares can get locked into a low value.

Fix by splitting the conditional into two.
2021-03-15 13:54:37 +02:00
Avi Kivity
cb4e1bb0b9 logalloc: reduce gap between std min_free and logalloc min_free
With the larger gap, logalloc reserved more memory for std than
the background reclaim threshold for running, so it was triggered
rarely.

With the gap reduced, background reclaim is constantly running in
an allocating workload (e.g. cache misses).
2021-02-14 19:09:29 +02:00
Avi Kivity
ca0c006b37 logalloc: background reclaim
Set up a coroutine in a new scheduling group to ensure there is
a "cushion" of free memory. It reclaims in preemptible mode in
order to reduce reactor stalls (constrast with synchronous reclaim
that cannot preempt until it achieved its goal).

The free memory target is arbitrarily set at 60MB. The reclaimer's
shares are proportional to the distance from the free memory target;
so a workload that allocates memory rapidly will have the background
reclaimer working harder.

I rolled my own condition variable here, mostly as an experiment.
seastar::condition_variable requires several allocations, while
the one here requires none. We should formalize it after we gain
more experience with it.
2021-02-14 19:09:29 +02:00
Avi Kivity
35076dd2d3 logalloc: preemptible reclaim
Add an option (currently unused by all callers) to preempt
reclaim. If reclaim is preempted, it just stops what it is
doing, even if it reclaimed nothing. This is useful for background
reclaim.

Currently, preemption checks are on segment granularity. This is
probably too coarse, and should be refined later, but is already
better than the current granularity which does not allow preemption
until the entire requested memory size was reclaimed.
2021-02-14 19:09:29 +02:00
Piotr Jastrzebski
f2b98b0aad Replace disable_failure_guard with scoped_critical_alloc_section
scoped_critical_alloc_section was recently introduced to replace
disable_failure_guard and made the old class deprecated.

This patch replaces all occurences of disable_failure_guard with
scoped_critical_alloc_section.

Without this patch the build prints many warnings like:
warning: 'disable_failure_guard' is deprecated: Use scoped_critical_section instead [-Wdeprecated-declarations]

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <ca2a91aaf48b0f6ed762a6aa687e6ac5e936355d.1605621284.git.piotr@scylladb.com>
2020-11-17 16:01:25 +02:00
Botond Dénes
7b56ed6057 utils: logalloc: add lsa_global_occupancy_stats()
Allows querying the occupancy stats of all the lsa memory.
2020-11-17 15:13:21 +02:00
Avi Kivity
7ac59dcc98 lsa: decay reserves
The log-structured allocator (LSA) reserves memory when performing
operations, since its operations are performed with reclaiming disabled
and if it runs out, it cannot evict cache to gain more. The amount of
memory to reserve is remembered across calls so that it does not have
to repeat the fail/increase-reserve/retry cycle for every operation.

However, we currently lack decaying the amount to reserve. This means
that if a single operation increased the reserve in the distant past,
all current operations also require this large reserve. Large reserves
are expensive since they can cause large amounts of cache to be evicted.

This patch adds reserve decay. The time-to-decay is inversely proportional
to reserve size: 10GB/reserve. This means that a 20MB reserve is halved
after 500 operations (10GB/20MB) while a 20kB reserve is halved after
500,000 operations (10GB/20kB). So large, expensive reserves are decayed
quickly while small, inexpensive reserves are decayed slowly to reduce
the risk of allocation failures and exceptions.

A unit test is added.

Fixes #325.
2020-09-08 15:59:25 +03:00
Pavel Emelyanov
812eed27fe code: Force formatting of pointer in .debug and .trace
... and tests. Printin a pointer in logs is considered to be a bad practice,
so the proposal is to keep this explicit (with fmt::ptr) and allow it for
.debug and .trace cases.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-08-26 20:44:11 +03:00
Rafael Ávila de Espíndola
30722b8c8e logalloc: Add disable_failure_guard during a few tls variable initialization
The constructors of these global variables can allocate memory. Since
the variables are thread_local, they are initialized at first use.

There is nothing we can do if these allocations fail, so use
disable_failure_guard.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200729184901.205646-1-espindola@scylladb.com>
2020-07-31 15:49:21 +02:00
Pavel Emelyanov
d908646b28 logalloc: Compact segments on reclaim instead of migration
When reclaiming segments to the seastar the code tries to free the segments
sequentially. For this it walks the segments from left to right and frees
them, but every time a non-empty segment is met it gets migrated to another
segment, that's allocated from the right end of the list.

This is waste of cycles sometimes. The destination segment inherits the
holes from the source one, and thus it will be compacted some time in the
future. Why not compact it right at the reclamation time? It will take the
same time or less, but will result in better compaction.

To acheive this, the segment to be reclaimed is compacted with the existing
compact_segment_locked() code with some special care around it.

1. The allocation of new segments from seastar is locked
2. The reclaiming of segments with evict-and-compact is locked as well
3. The emergency pool is opened (the compaction is called with non-empty
   reserve to avoid bad_alloc exception throw in the middle of compaction)
4. The segment is forcibly removed from the histogram and the closed_occupancy
   is updated just like it is with general compaction

The segments-migration auxiliary code can be removed after this.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-08 14:07:35 +03:00
Pavel Emelyanov
4db6ef7b6d logallog: Introduce RAII allocation lock
The lock disables the segment_pool to call for more segments from
the underlying allocator.

To be used in next patch.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-08 14:07:30 +03:00
Pavel Emelyanov
2005aca444 logalloc: Shuffle code around region::impl::compact
This includes 3 small changes to facilitate next patching:
- rename region::impl::compact into compact_segment_locked
- merging former compact with compact_single_segment_locked
- moving log print and stats update into compact_segment_locked

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-08 14:06:45 +03:00
Pavel Emelyanov
8c81c6b7aa logalloc: Do not lock reclaimer twice
The tracker::impl::reclaim is already in reclaim-locked
section, no need for yet another nested lock.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-08 13:14:33 +03:00
Pavel Emelyanov
0392c5ca77 logalloc: Do not calculate object size twice
When walking objects on compaction the migrator->size() virtual fn is
called twice.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-08 13:14:33 +03:00
Pavel Emelyanov
81c9c4c7b2 logalloc: Do not convert obj_desc to migrator back and forth
When calling alloc_small the migrator is passed just to get the
object descriptor, but during compaction the descriptor is already
at hands, so no need to re-get it again.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-08 13:14:33 +03:00
Pavel Emelyanov
878f8d856a logalloc: Report reclamation timing with rate
The timer.stop() call, that reports not only the time-taken, but also
the reclaimation rate, was unintentionally dropped while expanding its
scope (c70ebc7c).

Take it back (and mark the compact_and_evict_locked as private while
at it).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200528185331.10537-1-xemul@scylladb.com>
2020-05-29 14:50:43 +02:00
Pavel Emelyanov
7696ed1343 shard_tracker: Configure it in one go
Instead of doing 3 smp::invoke_on_all-s and duplicating
tracker::impl API for the tracker itself, introduce the
tracker::configure, simplify the tracker configuration
and narrow down the public tracker API.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200528185442.10682-1-xemul@scylladb.com>
2020-05-29 14:50:43 +02:00
Avi Kivity
1799cfa88a logalloc: use namespace-scope seastar::idle_cpu_handler and related rather than reactor scope
This allows us to drop a #include <reactor.hh>, reducing compile time.

Several translation units that lost access to required declarations
are updated with the required includes (this can be an include of
reactor.hh itself, in case the translation unit that lost it got it
indirectly via logalloc.hh)

Ref #1.
2020-04-05 12:45:08 +03:00
Avi Kivity
c020b4e5e2 logalloc: increase capacity of _regions vector outside reclaim lock
Reclaim consults the _regions vector, so we don't want it moving around while
allocating more capacity. For that we take the reclaim lock. However, that
can cause a false-positive OOM during startup:

1. all memory is allocated to LSA as part of priming (2baa16b371)
2. the _regions vector is resized from 64k to 128k, requiring a segment
   to be freed (plenty are free)
3. but reclaiming_lock is taken, so we cannot reclaim anything.

To fix, resize the _regions vector outside the lock.

Fixes #6003.
Message-Id: <20200311091217.1112081-1-avi@scylladb.com>
2020-03-11 12:29:31 +02:00
Rafael Ávila de Espíndola
090164791c logalloc: Store unused ids in a std::vector
There doesn't seem to be any requirement for how unused ids are
reused, so we may as well use the simpler type.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200129211154.47907-1-espindola@scylladb.com>
2020-01-30 10:31:16 +02:00
Avi Kivity
454074f284 Merge "database: Avoid OOMing with flush continuations after failed memtable flush" from Tomasz
"
The original fix (10f6b125c8) didn't
take into account that if there was a failed memtable flush (Refs
flush) but is not a flushable memtable because it's not the latest in
the memtable list. If that happens, it means no other memtable is
flushable as well, cause otherwise it would be picked due to
evictable_occupancy(). Therefore the right action is to not flush
anything in this case.

Suspected to be observed in #4982. I didn't manage to reproduce after
triggering a failed memtable flush.

Fixes #3717
"

* tag 'avoid-ooming-with-flush-continuations-v2' of github.com:tgrabiec/scylla:
  database: Avoid OOMing with flush continuations after failed memtable flush
  lsa: Introduce operator bool() to occupancy_stats
  lsa: Expose region_impl::evictable_occupancy in the region class
2020-01-08 16:58:54 +02:00
Juliusz Stasiewicz
430b2ad19d commitlog+region_group: timeout exceptions with names
`segment_manager' now uses a decorated version of `timed_out_error'
with hardcoded name. On the other hand `region_group' uses named
`on_request_expiry' within its `expiring_fifo'.
2019-12-03 19:07:19 +01:00
Tomasz Grabiec
a69fda819c lsa: Expose region_impl::evictable_occupancy in the region class 2019-11-22 12:08:10 +01:00
Rafael Ávila de Espíndola
99c7f8457d logalloc: Add a migrators_base that is common to debug and release
This simplifies the debug implementation and it now should work with
scylla-gdb.py.

It is not clear what, if anything, is lost by not using random
ids. They were never being reused in the debug implementation anyway.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190618144755.31212-1-espindola@scylladb.com>
2019-08-12 19:44:55 +03:00
Paweł Dziepak
eb7d17e5c5 lsa: make sure align_up_for_asan() doesn't cause reads past end of segment
In debug mode the LSA needs objects to be 8-byte aligned in order to
maximise coverage from the AddressSanitizer.

Usually `close_active()` creates a dummy objects that covers the end of
the segment being closed. However, it the last real objects ends in the
last eight bytes of the segment then that dummy won't be created because
of the alignment requirements. This broke exit conditions on loops
trying to read all objects in the segment and caused them to attempt to
dereference address at the end of the segment. This patch fixes that.

Fixes #4653.
2019-07-10 19:19:24 +02:00
Rafael Ávila de Espíndola
d8dbacc7f6 More precise poisoning in logalloc
This change aligns descriptors and values to 8 bytes so that poisoning
a descriptor or value doesn't interfere with other descriptors and
values.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-06-26 13:13:48 -07:00
Rafael Ávila de Espíndola
6a2accb483 Convert macros to inline functions
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-06-26 13:13:48 -07:00
Tomasz Grabiec
f7e79b07d1 lsa: Respect the reclamation step hint from seastar allocator
This will allow us to reduce the amount of segment compaction when
reclaiming on behlaf of a large allocation because we'll evict much
more up front.

Tests:
  - unit (dev)

Reviewed-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <1559906584-16770-1-git-send-email-tgrabiec@scylladb.com>
2019-06-23 16:03:06 +03:00
Rafael Ávila de Espíndola
bf87b7e1df logalloc: Use asan to poison free areas
With this patch, when using asan, we poison segment memory that has
been allocated from the system but should not be accessible to user
code.

Should help with debugging user after free bugs.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190607140313.5988-1-espindola@scylladb.com>
2019-06-12 11:46:45 +02:00
Rafael Ávila de Espíndola
b3adabda2d Reduce logalloc differences between debug and release
A lot of code in scylla is only reachable if SEASTAR_DEFAULT_ALLOCATOR
is not defined. In particular, refill_emergency_reserve in the default
allocator case is empty, but in the seastar allocator case it compacts
segments.

I am trying to debug a crash that seems to involve memory corruption
around the lsa allocator, and being able to use a debug build for that
would be awesome.

This patch reduces the differences between the two cases by having a
common segment_pool that defers only a few operations to different
segment_store implementations.

Tests: unit (debug, dev)

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190606020937.118205-1-espindola@scylladb.com>
2019-06-06 12:55:56 +03:00
Tomasz Grabiec
21fbf59fa8 lsa: Fix compact_and_evict() being called with a too low step
compact_and_evict gets memory_to_release in bytes while
reclamation step is in segments.

Broken in f092decd90.

It doesn't make much difference with the current default step of 1
segment since we cannot reclaim less than that, so shouldn't cause
problems in practice.

Message-Id: <1556013920-29676-1-git-send-email-tgrabiec@scylladb.com>
2019-04-23 13:14:43 +03:00
Tomasz Grabiec
f092decd90 lsa: Fix potential bad_alloc even though evictable memory exists
When we start the LSA reclamation it can be that
segment_pool::_free_segments is 0 under some conditions and
segment_pool::_current_emergency_reserve_goal is set to 1. The
reclamation step is 1 segment, and compact_and_evict_locked() frees 1
segment back into the segment_pool. However,
segment_pool::reclaim_segments() doesn't free anything to the standard
allocator because the condition _free_segments >
_current_emergency_reserve_goal is false. As a result,
tracker::impl::reclaim() returns 0 as the amount of released memory,
tracker::reclaim() returns
memory::reclaiming_result::reclaimed_nothing and the seastar allocator
thinks it's a real OOM and throws std::bad_alloc.

The fix is to change compact_and_evict() to make sure that reserves
are met, by releasing more if they're not met at entry.

This change also allows us to drop the variant of allocate_segment()
which accepts the reclamation step as a means to refill reserves
faster. This is now not needed, because compact_and_evict() will look
at the reserve deficit to increase the amount of memory to reclaim.

Fixes #4445

Message-Id: <1555671713-16530-1-git-send-email-tgrabiec@scylladb.com>
2019-04-20 09:17:49 +03:00
Tomasz Grabiec
3356a085d2 lsa: Cover more bad_alloc cases with abort
When --abort-on-lsa-bad-alloc is enabled we want to abort whenever
we think we can be out of memory.

We covered failures due to bad_alloc thrown from inside of the
allocation section, but did not cover failures from reservations done
at the beginning of with_reserve(). Fix by moving the trap into
reserve().

Message-Id: <1553258915-27929-1-git-send-email-tgrabiec@scylladb.com>
2019-04-03 16:39:40 +03:00
Tomasz Grabiec
dafe22dd83 lsa: Fix spurios abort with --enable-abort-on-lsa-bad-alloc
allocate_segment() can fail even though we're not out of memory, when
it's invoked inside an allocating section with the cache region
locked. That section may later succeed after retried after memory
reclamation.

We should ignore bad_alloc thrown inside allocating section body and
fail only when the whole section fails.

Fixes #2924

Message-Id: <1550597493-22500-1-git-send-email-tgrabiec@scylladb.com>
2019-02-20 12:53:49 +02:00
Tomasz Grabiec
dbc1894bd5 lsa: Avoid unnecessary compact_and_evict_locked()
When the reclaim request was satisfied from the pool there's no need
to call compact_and_evict_locked(). This allows us to avoid calling
boost::range::make_heap(), which is a tiny performance difference, as
well as some confusing log messages.

Message-Id: <1548091941-8534-1-git-send-email-tgrabiec@scylladb.com>
2019-01-21 20:19:20 +02:00
Rafael Ávila de Espíndola
67039e942b Remove the only use of with_alignment from scylla
In c++17 there are standard ways of requesting aligned memory, so
seastar doesn't need to provide one.

This patch is in preparation for removing with_alignment from seastar.

Tests: unit (debug)

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190107191019.22295-1-espindola@scylladb.com>
2019-01-07 21:34:47 +02:00
Avi Kivity
be99101f36 utils: convert sprint() to format()
sprint() recently became more strict, throwing on sprint("%s", 5). Replace
with the more modern format().

Mechanically converted with https://github.com/avikivity/unsprint.
2018-11-01 13:16:17 +00:00
Tomasz Grabiec
1e50f85288 database: Make soft-pressure memtable flusher not consider already flushed memtables
The flusher picks the memtable list which contains the largest region
according to region_impl::evictable_occupancy().total_space(), which
follows region::occupancy().total_space(). But only the latest
memtable in the list can start flushing. It can happen that the
memtable corresponding to the largest region was already flushed to an
sstable (flush permit released), but not yet fsynced or moved to
cache, so it's still in the memtable list.

The latest memtable in the winning list may be small, or empty, in
which case the soft pressure flusher will not be able to make much
progress. There could be other memtable lists with non-empty
(flushable) latest memtables. This can lead to writes unnecessarily
blocking on dirty.

I observed this for the system memtable group, where it's easy for the
memtables to overshoot small soft pressure limits. The flusher kept
trying to flush empty memtables, while the previous non-empty memtable
was still in the group.

The CPU scheduler makes this worse, because it runs memtable_to_cache
in a separate scheduling group, so it further defers in time the
removal of the flushed memtable from the memtable list.

This patch fixes the problem by making regions corresponding to
memtables which started flushing report evictable_occupancy() as 0, so
that they're picked by the flusher last.

Fixes #3716.
Message-Id: <1535040132-11153-2-git-send-email-tgrabiec@scylladb.com>
2018-08-26 11:02:34 +03:00
Tomasz Grabiec
364418b5c5 logalloc: Make evictable_occupancy() indicate no free space
Doesn't fix any bug, but it's closer to the truth that all segments
are used rather than none is used.

Message-Id: <1535040132-11153-1-git-send-email-tgrabiec@scylladb.com>
2018-08-26 11:02:32 +03:00
Avi Kivity
2c9b886b6d logalloc: reindent
No functional changes.
Message-Id: <20180731125116.32009-1-avi@scylladb.com>
2018-08-01 00:35:54 +01:00
Avi Kivity
0fc54aab98 logalloc: run releaser() in user-provided scheduling group
Let the user specify which scheduling group should run the
releaser, since it is running functions on the user's behalf.

Perhaps a cleaner interface is to require the user to call
a long-running function for the releaser, and so we'd just
inherit its scheduling group, but that's a much bigger change.
2018-07-31 11:57:58 +03:00
Tomasz Grabiec
d94c7c07a3 lsa: Disable alloc failure injector inside the LSA sanitizer
Message-Id: <1531814822-30259-1-git-send-email-tgrabiec@scylladb.com>
2018-07-17 11:27:56 +01:00
Paweł Dziepak
55bf9d78a6 lsa: enhance sanitizer for migrators
Current LSA sanitizer performs only basic checks on the migrators use,
without doing any additonal reporting in case an error is detected. This
patch enhances it so that when a problem is detected relevant stack
traces get printed.
2018-06-25 09:37:43 +01:00
Paweł Dziepak
fcd9b1f821 lsa: formalise migrator id requirements
object_descriptor uses special encoding for migrator ids which assumes
that the valid ones are in a range smaller than uint32_t. Let's add some
static asserts that make this fact more visible.
2018-06-25 09:37:43 +01:00
Gleb Natapov
b38ced0fcd Configure logalloc memory size during initialization 2018-06-11 15:34:14 +03:00
Tomasz Grabiec
498a4132c5 lsa: Add use for debug::static_migrators
Otherwise GDB complains about it being optimized out, breaking our
debug scritps.
2018-05-17 14:22:14 +02:00
Avi Kivity
05cec4a265 Merge "Reduce LSA memory reclamation overhead" from Tomasz
"
Main optimization is in the patch titled "lsa: Reduce amount of segment compactions".

I measured 50% reduction of cache update run time in a steady state for an
append-only workload with large partition, in perf_row_cache_update version from:

  c3f9e6ce1f/tests/perf_row_cache_update.cc

Other workloads, and other allocation sites probably also could see the
improvement.
"

* tag 'tgrabiec/reduce-lsa-segment-compactions-v1' of github.com:tgrabiec/scylla:
  lsa: Expose counters for allocation and compaction throughput
  lsa: Reduce amount of segment compactions
  lsa: Avoid the call to segment_pool::descriptor() in compact()
  lsa: Make reclamation on reserve refill more efficient
2018-05-16 10:24:20 +03:00
Tomasz Grabiec
4fdd61f1b0 lsa: Expose counters for allocation and compaction throughput
Allow observing amplification induced by segment compaction.
2018-05-15 21:49:01 +02:00