Commit Graph

166 Commits

Author SHA1 Message Date
Avi Kivity
be99101f36 utils: convert sprint() to format()
sprint() recently became more strict, throwing on sprint("%s", 5). Replace
with the more modern format().

Mechanically converted with https://github.com/avikivity/unsprint.
2018-11-01 13:16:17 +00:00
Tomasz Grabiec
1e50f85288 database: Make soft-pressure memtable flusher not consider already flushed memtables
The flusher picks the memtable list which contains the largest region
according to region_impl::evictable_occupancy().total_space(), which
follows region::occupancy().total_space(). But only the latest
memtable in the list can start flushing. It can happen that the
memtable corresponding to the largest region was already flushed to an
sstable (flush permit released), but not yet fsynced or moved to
cache, so it's still in the memtable list.

The latest memtable in the winning list may be small, or empty, in
which case the soft pressure flusher will not be able to make much
progress. There could be other memtable lists with non-empty
(flushable) latest memtables. This can lead to writes unnecessarily
blocking on dirty.

I observed this for the system memtable group, where it's easy for the
memtables to overshoot small soft pressure limits. The flusher kept
trying to flush empty memtables, while the previous non-empty memtable
was still in the group.

The CPU scheduler makes this worse, because it runs memtable_to_cache
in a separate scheduling group, so it further defers in time the
removal of the flushed memtable from the memtable list.

This patch fixes the problem by making regions corresponding to
memtables which started flushing report evictable_occupancy() as 0, so
that they're picked by the flusher last.

Fixes #3716.
Message-Id: <1535040132-11153-2-git-send-email-tgrabiec@scylladb.com>
2018-08-26 11:02:34 +03:00
Tomasz Grabiec
364418b5c5 logalloc: Make evictable_occupancy() indicate no free space
Doesn't fix any bug, but it's closer to the truth that all segments
are used rather than none is used.

Message-Id: <1535040132-11153-1-git-send-email-tgrabiec@scylladb.com>
2018-08-26 11:02:32 +03:00
Avi Kivity
2c9b886b6d logalloc: reindent
No functional changes.
Message-Id: <20180731125116.32009-1-avi@scylladb.com>
2018-08-01 00:35:54 +01:00
Avi Kivity
0fc54aab98 logalloc: run releaser() in user-provided scheduling group
Let the user specify which scheduling group should run the
releaser, since it is running functions on the user's behalf.

Perhaps a cleaner interface is to require the user to call
a long-running function for the releaser, and so we'd just
inherit its scheduling group, but that's a much bigger change.
2018-07-31 11:57:58 +03:00
Tomasz Grabiec
d94c7c07a3 lsa: Disable alloc failure injector inside the LSA sanitizer
Message-Id: <1531814822-30259-1-git-send-email-tgrabiec@scylladb.com>
2018-07-17 11:27:56 +01:00
Paweł Dziepak
55bf9d78a6 lsa: enhance sanitizer for migrators
Current LSA sanitizer performs only basic checks on the migrators use,
without doing any additonal reporting in case an error is detected. This
patch enhances it so that when a problem is detected relevant stack
traces get printed.
2018-06-25 09:37:43 +01:00
Paweł Dziepak
fcd9b1f821 lsa: formalise migrator id requirements
object_descriptor uses special encoding for migrator ids which assumes
that the valid ones are in a range smaller than uint32_t. Let's add some
static asserts that make this fact more visible.
2018-06-25 09:37:43 +01:00
Gleb Natapov
b38ced0fcd Configure logalloc memory size during initialization 2018-06-11 15:34:14 +03:00
Tomasz Grabiec
498a4132c5 lsa: Add use for debug::static_migrators
Otherwise GDB complains about it being optimized out, breaking our
debug scritps.
2018-05-17 14:22:14 +02:00
Avi Kivity
05cec4a265 Merge "Reduce LSA memory reclamation overhead" from Tomasz
"
Main optimization is in the patch titled "lsa: Reduce amount of segment compactions".

I measured 50% reduction of cache update run time in a steady state for an
append-only workload with large partition, in perf_row_cache_update version from:

  c3f9e6ce1f/tests/perf_row_cache_update.cc

Other workloads, and other allocation sites probably also could see the
improvement.
"

* tag 'tgrabiec/reduce-lsa-segment-compactions-v1' of github.com:tgrabiec/scylla:
  lsa: Expose counters for allocation and compaction throughput
  lsa: Reduce amount of segment compactions
  lsa: Avoid the call to segment_pool::descriptor() in compact()
  lsa: Make reclamation on reserve refill more efficient
2018-05-16 10:24:20 +03:00
Tomasz Grabiec
4fdd61f1b0 lsa: Expose counters for allocation and compaction throughput
Allow observing amplification induced by segment compaction.
2018-05-15 21:49:01 +02:00
Tomasz Grabiec
3775a9ecec lsa: Reduce amount of segment compactions
Reclaiming memory through segment compaction is expensive. For
occupancy of 85%, in order to reclaim one free segment, we need to
compact 7 segments, by migrating 6 segments worth of data. This results
in significant amplification. Compaction involves moving objects,
which in some cases is expensive in itself as well
(See https://github.com/scylladb/scylla/issues/3247).

This patch reduces amount of segment compactions in favor of doing
more eviction. It especially helps workloads in which LRU order
matches allocation order, in which case there will be no segment
compaction, and just eviction.

In perf_row_cache_update test case for large partition with lots of
rows, which simulates appending workload, I measured that for each new
object allocated, 2 need to be migrated, before the patch. After the
patch, only 0.003 objects are migrated. This reduces run time of
cache update part by 50%.
2018-05-15 21:49:01 +02:00
Tomasz Grabiec
8faafdaae5 lsa: Avoid the call to segment_pool::descriptor() in compact() 2018-05-11 19:07:23 +02:00
Tomasz Grabiec
19edf3970e lsa: Make reclamation on reserve refill more efficient
Currently reserve refill allocates segments repeatedly until the
reserve threhsold is met. If single segment allocation needs to
reclaim memory, it will ask the reclaimer for one segment. The
reclaimer could make better decisions if it knew the total number of
segments we try to allocate. In particular, it would not attempt to
compact any segment until it evicts total amount of memory first,
which may reduce the total amount of segment compactions during
refill.

This patch changes refill to increase reclamation step used by
allocate_segment() so that it matches the total amount of memory we
refill.
2018-05-11 19:07:23 +02:00
Paweł Dziepak
c6c5accd19 lsa: provide migrator with the object size
While the migration function should have enough information to obtain
the object size itself, the LSA logic needs to compute it as well.
IMR is going to make calculating object sizes more expensive, so by
providing the infromation to the migrator we can avoid some needless
operations.
2018-05-09 16:52:26 +01:00
Paweł Dziepak
884888dc11 lsa: add free() that does not require object size
It is non-trivial to get the size of an IMR object. However, the
standard allocator doesn't really need it and LSA can compute it itself
by asking the migrator.
2018-05-09 16:52:26 +01:00
Paweł Dziepak
b1bec336b3 lsa: sanitize use of migrators
Having migrators dynamically registered and deregistered opens a new
class of bugs. This patch adds some additional checks in the debug mode
with the hopes of catching any misuse early.
2018-05-09 16:52:26 +01:00
Paweł Dziepak
cca9f8c944 lsa: reuse registered migrator ids
With the introduction of the new in-memory representation we will get
type- and schema-dependent migrators. Since there is no bound how many
times they can be created and destroyed it is better to be safe and
reuse registered migrator ids.
2018-05-09 16:52:20 +01:00
Paweł Dziepak
b3699f286d lsa: make migrators table thread-local
Migrators can be registered and deregistered at any time. If the table
is not thread-local we risk race conditions.
2018-05-09 16:10:46 +01:00
Avi Kivity
7161244130 Merge seastar upstream
* seastar 70aecca...ac02df7 (5):
  > Merge "Prefix preprocessor definitions" from Jesse
  > cmake: Do not enable warnings transitively
  > posix: prevent unused variable warning
  > build: Adjust DPDK options to fix compilation
  > io_scheduler: adjust property names

DEBUG, DEFAULT_ALLOCATOR, and HAVE_LZ4_COMPRESS_DEFAULT macro
references prefixed with SEASTAR_. Some may need to become
Scylla macros.
2018-04-29 11:03:21 +03:00
Avi Kivity
fc488adc72 logalloc: remove segment_descriptor::_lsa_managed
_lsa_managed is always 1:1 with _region, so we can remove it, saving
some space in the segment descriptor vector.

Tests: unit (release), logalloc_test (debug)
Message-Id: <20180410122606.10671-1-avi@scylladb.com>
2018-04-10 13:54:38 +01:00
Avi Kivity
2c670f6161 logalloc: limit std segment allocations in debug mode
Address Sanitizer has a global limit on the number of allocations
(note: not number of allocations less number of frees, but cumulative
number of allocations). Running some tests in debug mode on a machine
with sufficient memory can break that limit.

Work around that limit by restricting the amount of memory the
debug mode segment_pool can allocate. It's also nicer for running
the test on a workstation.
2018-04-07 21:04:10 +03:00
Avi Kivity
2baa16b371 logalloc: introduce prime_segment_pool()
To segregate std and lsa allocations, we prime the segment pool
during initialization so that lsa will release lower-addressed
memory to std, rather than lsa and std competing for memory at
random addresses.

However, tests often evict all of lsa memory for their own
purposes, which defeats this priming.

Extract the functionality into a new prime_segment_pool()
function for use in tests that rely on allocation segregation.
2018-04-07 14:52:58 +03:00
Avi Kivity
ff6325ee7e logalloc: limit non-contiguous reclaims
We may fail to reclaim because a region has reclaim disabled (usually because
it is in an allocating_section. Failed reclaims can cause high CPU usage
if all of the lower addresses happen to be in a reclaim-disabled region (this
is somewhat mitigated by the fact that checking for reclaim disabled is very
cheap), but worse, failing a segment reclaim can lead to reclaimed memory
being fragmented.  This results in the original allocation continuing to fail.

To combat that, we limit the number of failed reclaims. If we reach the limit,
we fail the reclaim.  The surrounding allocating_section will release the
reclaim_lock, and increase reserves, which will result in reclaim being
retried with all regions being reclaimable, and succeed in allocating
contiguous memory.
2018-04-07 14:52:58 +03:00
Avi Kivity
c6c659ce7a logalloc: pre-allocate all memory as lsa on startup
Since lsa tries to keep some non-lsa memory as reserve, we end up
with three blocks of memory: at low addresses, non-lsa memory that was
allocated during startup or subsequently freed by lsa; at middle addresses,
lsa; and at the top addresses, memory that lsa left alone during initial
cache population due to the reserve.

After time passes, both std and lsa will allocate from the top section,
causing a mix of lsa and non-lsa memory. Since lsa tries to free from
lower addresses, this mix will stay there forever, increasing fragmentation.

Fix that by disabling the reserve during startup and allocating all of memory
for lsa. Any further allocation will then have to be satisfied by lsa first
freeing memory from the low addresses, so we will now have just two sections
of memory: low addresses for std, and top addresses for lsa.

Note that this startup allocation does not page in lsa segments, since the
segment constructor does not touch memory.
2018-04-07 14:52:58 +03:00
Avi Kivity
14510ae986 dynamic_bitset: get rid of resize()
Makes it easier to modify later on. Maybe "dynamic" is not so justified now.
2018-04-07 14:52:58 +03:00
Avi Kivity
3f17dbfcbc logalloc: get rid of the emergency reserve stack
Instead of keeping specific segments in the emergency reserve,
just keep the number of segments in the reserve. This simplifies the
code considerably.
2018-04-07 14:52:55 +03:00
Avi Kivity
fa73d844e9 logalloc: replace zones with segment-at-a-time alloc/free
This patch replaces the zones mechanism with something simpler: a
single segment is moved from the standard allocator to lsa and vice
versa, at a time. Fragmentation resistance is (hopefully) achieved
by having lsa prefer high addresses for lsa data, and return segments
at low address to the standard allocator. Over time, the two will move
apart.

Moving just once segment at a time reduces the latency costs of
transferring memory between free and std.
2018-04-07 13:48:40 +03:00
Paweł Dziepak
5dfa36c526 lsa: add basic sanitizer
LSA being an allocator built on top of the standard may hide some
erroneous usage from AddressSanitizer. Moreover, it has its own classes
of bugs that could be caused by incorrect user behaviour (e.g. migrator
returning wrong object size).

This patch adds basic sanitizer for the LSA that is active in the debug
mode and verifies if the allocator is used correctly and if a problem is
found prints information about the affected object that it has collected
earlier. Theat includes the address and size of an object as well as
backtrace of the allocation site. At the moment the following errors are
being checked for:
 * leaks, objects not freed at region destructor
 * attempts to free objects at invalid address
 * mismatch between object size at allocation and free
 * mismatch between object size at allocation and as reported by the
   migrator
 * internal LSA error: attempt to allocate object at already used
   address
 * internal LSA error: attempt to merge regions containing allocated
   objects at conflicting addresses

Message-Id: <20180226122314.32049-1-pdziepak@scylladb.com>
2018-02-26 14:35:13 +02:00
Tomasz Grabiec
7e0ff8a920 lsa: Disable allocation failure injection inside merge()
Fixes termiantion in tests due to throw from merge(), which is noexcept.
2018-02-14 16:42:49 +01:00
Tomasz Grabiec
66701c1671 lsa: Make region deregistration robust against duplicates 2018-02-14 16:42:49 +01:00
Tomasz Grabiec
cf876bbe2d lsa: Make region allocation exception safe
We were not unregisterring in case add() fails.
2018-02-14 16:42:49 +01:00
Paweł Dziepak
dcd79af8ed lsa: optimise disabling reclamation and invalidation counter
Most of the lsa gory details are hidden in utils/logalloc.cc. That
includes the actual implementation of a lsa region: region_impl.

However, there is code in the hot path that often accesses the
_reclaiming_enabled member as well as its base class
allocation_strategy.

In order to optimise those accesses another class is introduced:
basic_region_impl that inherits from allocation_strategy and is a base
of region_impl. It is defined in utils/logalloc.hh so that it is
publicly visible and its member functions are inlineable from anywhere
in the code. This class is supposed to be as small as possible, but
contain all members and functions that are accessed from the fast path
and should be inlined.
2018-01-30 18:33:26 +01:00
Paweł Dziepak
d825ae37bf lsa: split alloc section into reserving and reclamation-disabled parts
Allocating sections reserves certain amount of memory, then disables
reclamation and attempts to perform given operation. If that fails due
to std::bad_alloc the reserve is increased and the operation is retried.

Reserving memory is expensive while just disabling reclamation isn't.
Moreover, the code that runs inside the section needs to be safely
retryable. This means that we want the amount of logic running with
reclamation disabled as small as possible, even if it means entering and
leaving the section multiple times.

In order to reduce the performance penalty of such solution the memory
reserving and reclamation disabling parts of the allocating sections are
separated.
2018-01-30 18:33:26 +01:00
Tomasz Grabiec
5c85e9c2db lsa: Expose max_zone_segments for tests 2018-01-16 13:17:20 +01:00
Tomasz Grabiec
99708cc498 lsa: Expose tracker::non_lsa_used_space()
So that it can be used in unit tests.
2018-01-16 13:17:20 +01:00
Tomasz Grabiec
e5f8176c32 lsa: Fix memory leak on zone reclaim
_free_segments_in_zones is not adjusted by
segment_pool::reclaim_segments() for empty zones on reclaim under some
conditions. For instance when some zone becomes empty due to regular
free() and then reclaiming is called from the std allocator, and it is
satisfied from a zone after the one which is empty. This would result
in free memory in such zone to appear as being leaked due to corrupted
free segment count, which may cause a later reclaim to fail. This
could result in bad_allocs.

The fix is to always collect such zones.

Fixes #3129
Refs #3119
Refs #3120
2018-01-16 13:17:11 +01:00
Tomasz Grabiec
34ccf234ea Integrate with allocation failure injection framework 2017-11-07 15:33:24 +01:00
Avi Kivity
a2f26f7b29 log_histogram: rename to log_heap
log_histogram is not really a histogram, it is a heap-like container.
Rename to log_heap in case we do want a log_histogram one day.
Message-Id: <20170916172137.30941-1-avi@scylladb.com>
2017-09-18 12:44:05 +02:00
Tomasz Grabiec
87be474c19 lsa: Move reclaim counter concept to allocation_strategy level
So that generic code can detect invalidation of references. Also, to
allow reusing the same mechanism for signalling external reference
invalidation.
2017-09-13 17:38:08 +02:00
Tomasz Grabiec
5d2f2bc90b lsa: Mark region::merge() as noexcept
It seems to satisfy this, and row_cache::do_update() will rely on it to simplify
error handling.

Message-Id: <1504023113-30374-1-git-send-email-tgrabiec@scylladb.com>
2017-08-29 19:17:17 +03:00
Avi Kivity
5a2439e702 main: check for large allocations
Large allocations can require cache evictions to be satisfied, and can
therefore induce long latencies. Enable the seastar large allocation
warning so we can hunt them down and fix them.

Message-Id: <20170819135212.25230-1-avi@scylladb.com>
2017-08-21 10:25:40 +03:00
Tomasz Grabiec
3489c68a68 lsa: Fix performance regression in eviction and compact_on_idle
Region comparator, used by the two, calls region_impl::min_occupancy(),
which calls log_histogram::largest(). The latter is O(N) in terms of
the number of segments, and is supposed to be used only in tests.
We should call one_of_largest() instead, which is O(1).

This caused compact_on_idle() to take more CPU as the number of
segments grew (even when there was nothing to compact). Eviction
would see the same kind of slow down as well.

Introduced in 11b5076b3c.

Message-Id: <1498641973-20054-1-git-send-email-tgrabiec@scylladb.com>
2017-06-28 12:32:43 +03:00
Avi Kivity
ebaeefa02b Merge seatar upstream (seastar namespace)
- introcduced "seastarx.hh" header, which does a "using namespace seastar";
 - 'net' namespace conflicts with seastar::net, renamed to 'netw'.
 - 'transport' namespace conflicts with seastar::transport, renamed to
   cql_transport.
 - "logger" global variables now conflict with logger global type, renamed
   to xlogger.
 - other minor changes
2017-05-21 12:26:15 +03:00
Avi Kivity
1d12d69881 logalloc: define segment_zone::maximum_size
Yield build errors with some compilers, if missing.
2017-05-01 16:31:29 +03:00
Paweł Dziepak
f5cf86484e lsa: introduce upper bound on zone size
Attempting to create huge zones may introduce significant latency. This
patch introduces the maximum allowed zone size so that the time spent
trying to allocate and initialising zone is bounded.

Fixes #2335.

Message-Id: <20170428145916.28093-1-pdziepak@scylladb.com>
2017-04-30 10:58:11 +03:00
Pekka Enberg
940c3f4330 Merge "Clang fixes (part 2)" from Avi
"This series fixes some more errors found by clang, with the aim of enabling
clang/zapcc as a supported compiler.  A single issue remains, but it's
probably in std::experimental::optional::swap(); not in our code."

* tag 'clang/2/v1' of https://github.com/avikivity/scylla:
  sstable_test: avoid passing negative non-type template arguments to unsigned parameters
  UUID: add more comparison operators
  sstable_datafile_test: avoid string_view user-defined literal conversion operator
  mutation_source_test: avoid template function without template keyword
  cql_query_test: define static variable
  cql_query_test: add braces for single-item collection initializers
  storage_service: don't use typeid(temporary)
  logalloc: remove unused max_occupancy_for_compaction
  storage_proxy: drop overzealous use of __int128_t in recently-modified-no-read-repair logic
  storage_proxy: drop unused member access from return value
  storage_proxy: fix reference bound to temporary in data_read_resolver::less_compare
  read_repair_decision: fix operator<<(std::ostream&, ...)
2017-04-24 20:32:16 +03:00
Avi Kivity
6d9e18fd61 logalloc: reduce descriptor overhead
Every lsa-allocated object is prefixed by a header that contains information
needed to free or migrate it.  This includes its size (for freeing) and
an 8-byte migrator (for migrating).  Together with some flags, the overhead
is 14 bytes (16 bytes if the default alignment is used).

This patch reduces the header size to 1 byte (8 bytes if the default alignment
is used).  It uses the following techniques:

 - ULEB128-like encoding (actually more like ULEB64) so a live object's header
   can typically be stored using 1 byte
 - indirection, so that migrators can be encoded in a small index pointing
   to a migrator table, rather than using an 8-byte pointer; this exploits
   the fact that only a small number of types are stored in LSA
 - moving the responsibility for determining an object's size to its
   migrator, rather than storing it in the header; this exploits the fact
   that the migrator stores type information, and object size is in fact
   information about the type

The patch improves the results of memory_footprint_test as following:

Before:

 - in cache:     976
 - in memtable:  947

After:

mutation footprint:
 - in cache:     880
 - in memtable:  858

A reduction of about 10%.  Further reductions are possible by reducing the
alignment of lsa objects.

logalloc_test was adjusted to free more objects, since with the lower
footprint, rounding errors (to full segments) are different and caused
false errors to be detected.

Missing: adjustments to scylla-gdb.py; will be done after we agree on the
new descriptor's format.
2017-04-24 12:23:12 +02:00
Avi Kivity
9303b09a64 logalloc: remove unused max_occupancy_for_compaction
Noticed by clang.
2017-04-22 21:09:41 +03:00