Commit Graph

128 Commits

Author SHA1 Message Date
Tomasz Grabiec
34ccf234ea Integrate with allocation failure injection framework 2017-11-07 15:33:24 +01:00
Avi Kivity
a2f26f7b29 log_histogram: rename to log_heap
log_histogram is not really a histogram, it is a heap-like container.
Rename to log_heap in case we do want a log_histogram one day.
Message-Id: <20170916172137.30941-1-avi@scylladb.com>
2017-09-18 12:44:05 +02:00
Tomasz Grabiec
87be474c19 lsa: Move reclaim counter concept to allocation_strategy level
So that generic code can detect invalidation of references. Also, to
allow reusing the same mechanism for signalling external reference
invalidation.
2017-09-13 17:38:08 +02:00
Tomasz Grabiec
5d2f2bc90b lsa: Mark region::merge() as noexcept
It seems to satisfy this, and row_cache::do_update() will rely on it to simplify
error handling.

Message-Id: <1504023113-30374-1-git-send-email-tgrabiec@scylladb.com>
2017-08-29 19:17:17 +03:00
Avi Kivity
5a2439e702 main: check for large allocations
Large allocations can require cache evictions to be satisfied, and can
therefore induce long latencies. Enable the seastar large allocation
warning so we can hunt them down and fix them.

Message-Id: <20170819135212.25230-1-avi@scylladb.com>
2017-08-21 10:25:40 +03:00
Tomasz Grabiec
3489c68a68 lsa: Fix performance regression in eviction and compact_on_idle
Region comparator, used by the two, calls region_impl::min_occupancy(),
which calls log_histogram::largest(). The latter is O(N) in terms of
the number of segments, and is supposed to be used only in tests.
We should call one_of_largest() instead, which is O(1).

This caused compact_on_idle() to take more CPU as the number of
segments grew (even when there was nothing to compact). Eviction
would see the same kind of slow down as well.

Introduced in 11b5076b3c.

Message-Id: <1498641973-20054-1-git-send-email-tgrabiec@scylladb.com>
2017-06-28 12:32:43 +03:00
Avi Kivity
ebaeefa02b Merge seatar upstream (seastar namespace)
- introcduced "seastarx.hh" header, which does a "using namespace seastar";
 - 'net' namespace conflicts with seastar::net, renamed to 'netw'.
 - 'transport' namespace conflicts with seastar::transport, renamed to
   cql_transport.
 - "logger" global variables now conflict with logger global type, renamed
   to xlogger.
 - other minor changes
2017-05-21 12:26:15 +03:00
Avi Kivity
1d12d69881 logalloc: define segment_zone::maximum_size
Yield build errors with some compilers, if missing.
2017-05-01 16:31:29 +03:00
Paweł Dziepak
f5cf86484e lsa: introduce upper bound on zone size
Attempting to create huge zones may introduce significant latency. This
patch introduces the maximum allowed zone size so that the time spent
trying to allocate and initialising zone is bounded.

Fixes #2335.

Message-Id: <20170428145916.28093-1-pdziepak@scylladb.com>
2017-04-30 10:58:11 +03:00
Pekka Enberg
940c3f4330 Merge "Clang fixes (part 2)" from Avi
"This series fixes some more errors found by clang, with the aim of enabling
clang/zapcc as a supported compiler.  A single issue remains, but it's
probably in std::experimental::optional::swap(); not in our code."

* tag 'clang/2/v1' of https://github.com/avikivity/scylla:
  sstable_test: avoid passing negative non-type template arguments to unsigned parameters
  UUID: add more comparison operators
  sstable_datafile_test: avoid string_view user-defined literal conversion operator
  mutation_source_test: avoid template function without template keyword
  cql_query_test: define static variable
  cql_query_test: add braces for single-item collection initializers
  storage_service: don't use typeid(temporary)
  logalloc: remove unused max_occupancy_for_compaction
  storage_proxy: drop overzealous use of __int128_t in recently-modified-no-read-repair logic
  storage_proxy: drop unused member access from return value
  storage_proxy: fix reference bound to temporary in data_read_resolver::less_compare
  read_repair_decision: fix operator<<(std::ostream&, ...)
2017-04-24 20:32:16 +03:00
Avi Kivity
6d9e18fd61 logalloc: reduce descriptor overhead
Every lsa-allocated object is prefixed by a header that contains information
needed to free or migrate it.  This includes its size (for freeing) and
an 8-byte migrator (for migrating).  Together with some flags, the overhead
is 14 bytes (16 bytes if the default alignment is used).

This patch reduces the header size to 1 byte (8 bytes if the default alignment
is used).  It uses the following techniques:

 - ULEB128-like encoding (actually more like ULEB64) so a live object's header
   can typically be stored using 1 byte
 - indirection, so that migrators can be encoded in a small index pointing
   to a migrator table, rather than using an 8-byte pointer; this exploits
   the fact that only a small number of types are stored in LSA
 - moving the responsibility for determining an object's size to its
   migrator, rather than storing it in the header; this exploits the fact
   that the migrator stores type information, and object size is in fact
   information about the type

The patch improves the results of memory_footprint_test as following:

Before:

 - in cache:     976
 - in memtable:  947

After:

mutation footprint:
 - in cache:     880
 - in memtable:  858

A reduction of about 10%.  Further reductions are possible by reducing the
alignment of lsa objects.

logalloc_test was adjusted to free more objects, since with the lower
footprint, rounding errors (to full segments) are different and caused
false errors to be detected.

Missing: adjustments to scylla-gdb.py; will be done after we agree on the
new descriptor's format.
2017-04-24 12:23:12 +02:00
Avi Kivity
9303b09a64 logalloc: remove unused max_occupancy_for_compaction
Noticed by clang.
2017-04-22 21:09:41 +03:00
Tomasz Grabiec
20f4c9bf23 lsa: Reduce reclamation latency
Currently eviction is performed until occupancy of the whole region
drops below the 85% threshold. This may take a while if region had
high occupancy and is large. We could improve the situation by only
evicting until occupancy of the sparsest segment drops below the
threshold, as is done by this change.

I tested this using a c-s read workload in which the condition
triggers in the cache region, with 1G per shard:

 lsa-timing - Reclamation cycle took 12.934 us.
 lsa-timing - Reclamation cycle took 47.771 us.
 lsa-timing - Reclamation cycle took 125.946 us.
 lsa-timing - Reclamation cycle took 144356 us.
 lsa-timing - Reclamation cycle took 655.765 us.
 lsa-timing - Reclamation cycle took 693.418 us.
 lsa-timing - Reclamation cycle took 509.869 us.
 lsa-timing - Reclamation cycle took 1139.15 us.

The 144ms pause is when large eviction is necessary.

Statistics for reclamation pauses for a read workload over
larger-than-memory data set:

Before:

 avg = 865.796362
 stdev = 10253.498038
 min = 93.891000
 max = 264078.000000
 sum = 574022.988000
 samples = 663

After:

 avg = 513.685650
 stdev = 275.270157
 min = 212.286000
 max = 1089.670000
 sum = 340573.586000
 samples = 663

Refs #1634.

Message-Id: <1484730859-11969-1-git-send-email-tgrabiec@scylladb.com>
2017-04-21 12:52:31 +02:00
Tomasz Grabiec
c83768d6bb log_histogram: Allow non-power-of-two minimum values
We will want to reuse the min_size mechanism for the whole compaction
threshold, including the occupancy threshold. That threshold is close
to the segment size and we cannot pick a power of two which would be
close enough to what we need.

Therefore, change log_histogram to support arbitrary minimum base.

bucket_of() was moved into log_histogram_options so that it can be used
in number_of_buckets(), which makes for a simple and much less
error-prone implementation.
2017-04-21 10:54:50 +02:00
Tomasz Grabiec
7a800c54bf lsa: Use regular compaction threshold in on-idle compaction
Idle-time compaction should not produce not-compactible segments
becuase that means we would have to evict a lot when we finally need
to reclaim some memory, so that occupancy falls below the regular
compaction threshold. This may cause latency spikes.

Refs #1634.
2017-04-20 15:00:15 +02:00
Tomasz Grabiec
7aa286439f lsa: Add getter for region's eviction function 2017-04-20 14:51:42 +02:00
Duarte Nunes
af37a3fdbf logalloc: Fix compilation error
This patch moves a function using the region_impl type after the type
has been defined.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170418124551.25369-1-duarte@scylladb.com>
2017-04-18 15:56:26 +03:00
Avi Kivity
844529fe33 logalloc: avoid auto in function argument declaration
'auto' in a non-lambda function argument is not legal C++, and is hard
to read besides.  Replace with the right type.

Since the right type is private, add some friendship.
2017-04-17 23:18:44 +03:00
Paweł Dziepak
0318dccafd lsa: avoid unnecessary segment migrations during reclaim
segment_zone::migrate_all_segments() was trying to migrate all segments
inside a zone to the other one hoping that the original one could be
completely freed. This was an attempt to optimise for throughput.

However, this may unnecesairly hurt latency if the zone is large, but
only few segments are required to satisfy reclaimer's demands.
Message-Id: <20170410171912.26821-1-pdziepak@scylladb.com>
2017-04-11 08:55:29 +02:00
Tomasz Grabiec
3609665b19 lsa: Fix debug-mode compilation error
By moving definitions of setters out of #ifdef
2017-03-16 18:23:05 +01:00
Tomasz Grabiec
88e7b3ff79 lsa: Ensure can_allocate_more_memory() always leaves a gap above seastar's min_free_memory()
One of the goals of can_allocate_more_memory() is to prevent depleting
seastar's free memory close to its minimum, leaving a head room above
that minimum so that standard allocations will not cause reclamation
immediately. Currently the function doesn't take into accoutn actual
threshold used by the seastar allocator, so there could be no gap or
even could go below the minimum.

Fix that by ensuring there's always a gap above min_free_memory().

min_gap was reduced to 1 MiB so that low memory setups are not
impacted significantly by the change.
Message-Id: <1489667863-15099-1-git-send-email-tgrabiec@scylladb.com>
2017-03-16 12:42:50 +00:00
Tomasz Grabiec
4ab8b255da lsa: Allow adjusting reserves in allocating_section 2017-03-16 10:21:10 +01:00
Duarte Nunes
d32c848d73 utils/logalloc: Change linkage of hist_options to external
Change linkage of segment_descriptor_hist_options to external to keep
good old GCC5 happy, despite C++11 allowing static linkage of non-type
template arguments.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170309213206.10383-1-duarte@scylladb.com>
2017-03-10 11:02:51 +02:00
Duarte Nunes
ca4f5cabd4 lsa: Extract log_histogram class
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-03-04 14:47:19 +01:00
Duarte Nunes
2b6abd5a91 lsa: Make log_histogram more generic
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-03-03 17:59:17 +01:00
Duarte Nunes
3819e6d55f lsa: log_histogram cleanups
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-03-03 17:09:07 +01:00
Duarte Nunes
11b5076b3c lsa: Use log histogram for closed segments
This patch replaces the current heap with a logarithmic histogram
to hold the closed segment descriptors.

This histogram stores elements in different buckets according to
their size. Values are mapped to a sequence of power-of-two ranges
that are split in N sub-buckets. Values less than a minimum value
are placed in bucket 0, whereas values bigger than a maximum value
are not admitted.

There is some loss of precision as segments are now not totally
ordered, and precision decreases the more sparse a segment is. This
allows to reduce the cost of the computations needed when freeing
from a closed segment.

Performance results for perf_simple_query -c4 --duration 60
           before       after       diff
read     43954.27    45246.10      +2.9%
write    48911.54    52807.76      +7.9%

Fixes #1442

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170227235328.27937-1-duarte@scylladb.com>
2017-02-28 18:40:38 +02:00
Tomasz Grabiec
c70ebc7ca5 lsa: Make reclaim_timer enclose segment_pool::reclaim_segments()
LSA timing did not include segment migration. It does after this
change.
Message-Id: <1486657046-9378-1-git-send-email-tgrabiec@scylladb.com>
2017-02-09 17:07:59 +00:00
Tomasz Grabiec
e40fb438f5 lsa: Avoid avalanche releasing of requests
Before, the logic for releasing writes blocked on dirty worked like
this:

  1) When region group size changes and it is not under pressure and
     there are some requests blocked, then schedule request releasing
     task

  2) request releasing task, if no pressure, runs one request and if
     there are still blocked requests, schedules next request
     releasing task

If requests don't change the size of the region group, then either
some request executes or there is a request releasing task
scheduled. The amount of scheduled tasks is at most 1, there is a
single thread of excution.

However, if requests themselves would change the size of the group,
then each such change would schedule yet another request releasing
thread, growing the task queue size by one.

The group size can also change when memory is reclaimed from the
groups (e.g.  when contains sparse segments). Compaction may start
many request releasing threads due to group size updates.

Such behavior is detrimental for performance and stability if there
are a lot of blocked requests. This can happen on 1.5 even with modest
concurrency becuase timed out requests stay in the queue. This is less
likely on 1.6 where they are dropped from the queue.

The releasing of tasks may start to dominate over other processes in
the system. When the amount of scheduled tasks reaches 1000, polling
stops and server becomes unresponsive until all of the released
requests are done, which is either when they start to block on dirty
memory again or run out of blocked requests. It may take a while to
reach pressure condition after memtable flush if it brings virtual
dirty much below the threshold, which is currently the case for
workloads with overwrites producing sparse regions.

Refs #2021.

Fix by ensuring there is at most one request releasing thread at a
time. There will be one releasing fiber per region group which is
woken up when pressure is lifted. It executes blocked requests until
pressure occurs.

The logic for notification across hierachy was replaced by calling
region_group::notify_relief() from region_group::update() on the
broadest relieved group.
2017-02-01 17:41:55 +01:00
Tomasz Grabiec
d55baa0cd1 lsa: Move definitions to .cc 2017-02-01 17:41:55 +01:00
Tomasz Grabiec
8f8b111b33 lsa: Simplify hard pressure notification management
The hard pressure was only signalled on region group when
run_when_memory_available() was called after the pressure condition
was met.

So the following loop is always an infinite loop rather than stopping
when engouh is allocated to cause pressure:

   while (!gr.under_pressure()) {
       region.allocate(...);
   }

It's cleaner if pressure notification works not only if
run_when_memory_available() is used but whenever conditino changes,
like we do for the soft pressure.

There is comment in run_when_memory_available() which gives reasons
why notifications are called from there, but I think those reasons no
longer hold:

 - we already notify on soft pressure conditions from update(), and if
   that is safe, notifying about hard pressure should also be safe. I
   checked and it looks safe to me.

 - avoiding notification in the rare case when we stopped writing
   right after crossing the threshold doesn't seem benefitial. It's
   unlikely in the first place, and one could argue it's better to
   actually flush now so that when writes resume they will not block.
2017-02-01 17:41:55 +01:00
Tomasz Grabiec
2c7902fb2b Revert "lsa: Reduce reclamation latency"
This reverts commit d61002cc33.

Introduced a regression in row_cache_alloc_stress.

The problem is that reclaim_from_evictable() evicts way too much after
the refactor due to the stop condition not taking into account how
much data was evicted so far and only looking at occupancy of the
minimal segment. This may lead to eviction of the whole region.
2017-01-26 10:43:18 +01:00
Tomasz Grabiec
d61002cc33 lsa: Reduce reclamation latency
Currently eviction is performed until occupancy of the whole region
drops below the 85% threshold. This may take a while if region had
high occupancy and is large. We could improve the situation by only
evicting until occupancy of the sparsest segment drops below the
threshold, as is done by this change.

I tested this using a c-s read workload in which the condition
triggers in the cache region, with 1G per shard:

 lsa-timing - Reclamation cycle took 12.934 us.
 lsa-timing - Reclamation cycle took 47.771 us.
 lsa-timing - Reclamation cycle took 125.946 us.
 lsa-timing - Reclamation cycle took 144356 us.
 lsa-timing - Reclamation cycle took 655.765 us.
 lsa-timing - Reclamation cycle took 693.418 us.
 lsa-timing - Reclamation cycle took 509.869 us.
 lsa-timing - Reclamation cycle took 1139.15 us.

The 144ms pause is when large eviction is necessary.

The change improves worst case latency. Reclamation time statistics
over 30 second period after cache fills up, in microseconds:

Before:

  avg = 1524.283148
  stdev = 11021.021118
  min = 12.934000
  max = 144356.000000
  sum = 257603.852000
  samples = 169

After:

  avg = 1317.362414
  stdev = 1913.542802
  min = 263.935000
  max = 19244.600000
  sum = 175209.201000
  samples = 133

Refs #1634.

Message-Id: <1484730859-11969-1-git-send-email-tgrabiec@scylladb.com>
2017-01-19 17:35:36 +02:00
Vlad Zolotarov
022bca16bf utils::logalloc: move collectd counters registration to metrics registration layer
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-01-10 16:24:55 -05:00
Tomasz Grabiec
e14caaef60 utils/logalloc: Add ability to timeout run_when_memory_available() task 2016-11-29 16:40:58 +01:00
Paweł Dziepak
b8d737ff0a tests/row_cache_test: verify that eviction follows lru
Refs #1847.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1479231555-28191-1-git-send-email-pdziepak@scylladb.com>
2016-11-15 18:57:54 +01:00
Tomasz Grabiec
6548132423 lsa: Make logalloc::tracker::full_compaction() compact all reclaimable regions
is_compactible() will pass on very small regions. full_compaction() is
only used in tests to force objects to be moved due to compaction, so
we want all reclaimable regions to be compacted.
2016-10-18 11:16:08 +02:00
Paweł Dziepak
d08cffd3c7 lsa: avoid exceptions during segment_zone creation
LSA tries to allocate zones as large as possible (while still leaving
enough free space for the standard allocator). It uses the amount of
free memory in order to guess how much it can get, but that obviously
doesn't account for fragmentation and the allocation attempt may fail.

This patch changes the LSA code so that it doesn't throw in case zone
couldn't be created but just returns a null pointer which should be
more performant if the LSA memory cannot grow any more.

Fixes #1394.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1476435031-5601-1-git-send-email-pdziepak@scylladb.com>
2016-10-14 11:08:24 +02:00
Tomasz Grabiec
e617bcd8a7 logalloc: disable abort on allocation failure in places in which it is benign
Some places start big expecting allocation failure, then reduce the
requested size. Let's not abort in such cases.

Message-Id: <1476295120-32047-1-git-send-email-tgrabiec@scylladb.com>
2016-10-13 10:53:32 +03:00
Glauber Costa
86aa0b830d LSA: allow a group to query its own region group
Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-04 10:39:10 -04:00
Glauber Costa
28e3f2f6ee LSA: export information about object memory footprint
We allocate objects of a certain size, but we use a bit more memory to hold
them.  To get a clerer picture about how much memory will an object cost us, we
need help from the allocator. This patch exports an interface that allow users
to query into a specific allocator to get that information.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-04 10:39:10 -04:00
Tomasz Grabiec
b0b28696b5 scylla-gdb: Add 'scylla lsa-segment' command
Allows one to examine contents of LSA segment.

Example:

  (gdb) scylla lsa-segment 0x601000480000
  0x601000480e70: live size=144 migrator=standard_migrator<cache_entry>::object
  0x601000480f10: live size=144 migrator=standard_migrator<cache_entry>::object
  0x601000480fb0: free size=192
  0x60100048107e: free size=42
  0x6010004814e0: free size=192
  0x6010004815ae: free size=40
  0x6010004815e8: free size=192
  0x6010004816b8: live size=144 migrator=standard_migrator<cache_entry>::object
  0x601000481758: free size=192
  ...
2016-09-20 16:53:21 +02:00
Glauber Costa
fe6a0d97d1 logalloc: make sure allocations in release_requests don't recurse back into the allocator
Calls like later() and with_gate() may allocate memory, although that is not
very common. This can create a problem in the sense that it will potentially
recurse and bring us back to the allocator during free - which is the very thing
we are trying to avoid with the call to later().

This patch wraps the relevant calls in the reclaimer lock. This do mean that the
allocation may fail if we are under severe pressure - which includes having
exhausted all reserved space - but at least we won't recurse back to the
allocator.

To make sure we do this as early as possible, we just fold both release_requests
and do_release_requests into a single function

Thanks Tomek for the suggestion.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <980245ccc17960cf4fcbbfedb29d1878a98d85d8.1470254846.git.glauber@scylladb.com>
2016-08-04 11:16:53 +02:00
Tomasz Grabiec
9476bc5a31 Introduce --abort-on-lsa-bad-alloc command line option
Useful for triggerring core dump on allocation failure inside LSA,
which makes it easier to debug allocation failures. They normally
don't cause aborts, just fail the current operation, which makes it
hard to figure out what was the cause of allocation failure.

Message-Id: <1470233631-18508-1-git-send-email-tgrabiec@scylladb.com>
2016-08-03 17:26:44 +03:00
Avi Kivity
d261927fa3 logalloc: change sprint() of a pointer to use void* explicitly
Otherwise, fmtlib dislikes it.
2016-07-18 19:37:16 +03:00
Glauber Costa
4e81f19ab5 LSA: fix typo in region merge
There are many potentially tricky things about referring to different regions
from the LSA perspective. Madness, however, is not one of them. I can only
assume we meant made?

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <8eb81f35de4b208a494e43cb392eea07b87b2bf1.1466534798.git.glauber@scylladb.com>
2016-06-21 22:58:44 +03:00
Tomasz Grabiec
e783b58e3b Merge branch 'glommer/LSA-throttler-v6' from git@github.com:glommer/scylla.gi
From Glauber:

This is my new take at the "Move throttler to the LSA" series, except
this one don't actually move anything anywhere: I am leaving all
memtable conversion out, and instead I am sending just the LSA bits +
LSA active reclaim. This should help us see where we are going, and
then we can discuss all memtable changes in a series on its own,
logically separated (and hopefully already integrated with virtual
dirty).

[tgrabiec: trivial merge conflicts in logalloc.cc]
2016-06-21 10:22:26 +02:00
Glauber Costa
579d121db8 LSA: export largest region
We now keep the regions sorted by size, and the children region groups as well.
Internally, the LSA has all information it needs to make size-based reclaim
decisions. However, we don't do reclaim internally, but rather warn our user
that a pressure situation is mounted.

The user of a region_group doesn't need to evict the largest region in case of
pressure and is free to do whatever it chooses - including nothing. But more
likely than not, taking into account which region is the largest makes sense.

This patch puts together this last missing piece of the puzzle, and exports the
information we have internally to the user.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-06-20 18:51:00 -04:00
Glauber Costa
35f8a2ce2c LSA: add a backpointer to the region from its private data
Region is implemented using the pimpl pattern (region_impl), and all its
relevant data is present in a private structure instead of the region itself.

That private structure is the one that the other parts of the LSA will refer to,
the region_group being the prime example. To allow classes such as the
region_group the externally export a particular region, we will introduce a
backpointer region_impl -> region.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-06-20 18:50:59 -04:00
Glauber Costa
38a402307d LSA: enhance region_group reclaimer
We are currently just allowing the region_group to specify a throttle_threshold,
that triggers throttling when a certain amount of memory is reached. We would
like to notify the callers that such condition is reached, so that the callers
can do something to alleviate it - like triggering flushes of their structures.

The approach we are taking here is to pass a reclaimer instance. Any user of a
region_group can specialize its methods start_reclaiming and stop_reclaiming
that will be called when the region_group becomes under pressure or ceases to
be, respectively.

Now that we have such facility, it makes more sense to move the
throttle_threshold here than having it separately.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-06-20 18:50:59 -04:00