Calls like later() and with_gate() may allocate memory, although that is not
very common. This can create a problem in the sense that it will potentially
recurse and bring us back to the allocator during free - which is the very thing
we are trying to avoid with the call to later().
This patch wraps the relevant calls in the reclaimer lock. This do mean that the
allocation may fail if we are under severe pressure - which includes having
exhausted all reserved space - but at least we won't recurse back to the
allocator.
To make sure we do this as early as possible, we just fold both release_requests
and do_release_requests into a single function
Thanks Tomek for the suggestion.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <980245ccc17960cf4fcbbfedb29d1878a98d85d8.1470254846.git.glauber@scylladb.com>
(cherry picked from commit fe6a0d97d1)
From Glauber:
This is my new take at the "Move throttler to the LSA" series, except
this one don't actually move anything anywhere: I am leaving all
memtable conversion out, and instead I am sending just the LSA bits +
LSA active reclaim. This should help us see where we are going, and
then we can discuss all memtable changes in a series on its own,
logically separated (and hopefully already integrated with virtual
dirty).
[tgrabiec: trivial merge conflicts in logalloc.cc]
We now keep the regions sorted by size, and the children region groups as well.
Internally, the LSA has all information it needs to make size-based reclaim
decisions. However, we don't do reclaim internally, but rather warn our user
that a pressure situation is mounted.
The user of a region_group doesn't need to evict the largest region in case of
pressure and is free to do whatever it chooses - including nothing. But more
likely than not, taking into account which region is the largest makes sense.
This patch puts together this last missing piece of the puzzle, and exports the
information we have internally to the user.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Region is implemented using the pimpl pattern (region_impl), and all its
relevant data is present in a private structure instead of the region itself.
That private structure is the one that the other parts of the LSA will refer to,
the region_group being the prime example. To allow classes such as the
region_group the externally export a particular region, we will introduce a
backpointer region_impl -> region.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We are currently just allowing the region_group to specify a throttle_threshold,
that triggers throttling when a certain amount of memory is reached. We would
like to notify the callers that such condition is reached, so that the callers
can do something to alleviate it - like triggering flushes of their structures.
The approach we are taking here is to pass a reclaimer instance. Any user of a
region_group can specialize its methods start_reclaiming and stop_reclaiming
that will be called when the region_group becomes under pressure or ceases to
be, respectively.
Now that we have such facility, it makes more sense to move the
throttle_threshold here than having it separately.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
When we decide to evict from a specific region_group due to excessive memory
usage, we must also consider looking at each of their children (subgroups). It
could very well be that most of memory is used by one of the subgroups, and
we'll have to evict from there.
We also want to make sure we are evicting from the biggest region of all, and
not the biggest region in the biggest region_group. To understand why this is
important, consider the case in which the regions are memtables associated with
dirty region groups. It could be that a very big memtable was recently flushed,
and a fairly small one took its place. That region group is still quite large
because the memtable hasn't finished flushing yet, but that doesn't mean we
should evict from it.
To allow us to efficiently pick which region is the largest, each root of each
subtree will keep track of its maximal score, defined as the maximum between our
largest region total_space and the maximum maximal score of subtrees.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Currently, the regions in a region group are organized in a simple vector.
We can do better by using a binomial heap, as we do for segments, and then
updating when there is change. Internally to the LSA, we are in good position
to always know when change happens, so that's really the best way to do it.
The end game here, is to easily call for the reclaim of the largest offending
region (potentially asynchronously). Because of that, we aren't really interested
in the region occupancy, but in the region reclaimable occuppancy instead: that's
simply equal to the occupancy if the region is reclaimable, and 0 otherwise. Doing
that effectively lists all non reclaimable regions in the end of the heap, in no
particular order.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
The database code uses a throttling function to make sure that memory
used for the dirty region never is over the limit. We track that with
a region group, so it makes sense to move this as generic functionality
into LSA.
This patch implements the LSA-side functionality and a later patch will
convert the current memtable throttler to use it.
Unlike the current throttling mechanism, we'll not use a timer-based
mechanism here. Aside from being more generic and friendlier towards
other users, this is a good change for current memtable by itself.
The constants - 10ms and 1MB chosen by the current throttler are arbitrary, and we
would be better off without them. Let's discuss the merits of each separately:
1) 10ms timer: If we are throttling, we expect somebody to flush the memtables
for memory to be released. Since we are in position to know exactly when a memtable
was written, thus releasing memory, we can just call unthrottle at that point, instead
of using a timer.
2) 1MB release threshold: we do that because we have no idea how much memory a request
will use, so we put the cut somehow. However, because of 1) we don't call unthrottle
through a timer anymore, and do it directly instead. This means that we can just execute
the request and see how much memory it has used, with no need to guess. So we'll call
unthrottle at the end of every request that was previously throttled.
Writing the code this way also has the advantage that we need one less continuation in
the common case of the database not being throttled.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Tomek correctly points out that since we are now using "this" in lambda
captures, we should make the region_group not movable. We currently define a
move constructor, but there are no users. So we should just remove them.
copy constructor is already deleted, and so are the copy and move assignment
operators. So by removing the move constructor, we should be fine.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Reclaiming many segments was observed to cause up to multi-ms
latency. With the new setting, the latency of reclamation cycle with
full segments (worst case mode) is below 1ms.
I saw no decrease in throughput compared to the step of 16 segments in
neither of these modes:
- full segments, reclaim by random evicition
- sparse segments (3% occupancy), reclaim by compaction and no eviction
Fixes#1274.
_closed_occupancy will be used when a region is removed from its region
group, make sure that it is accurate.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
In region destructor, after active segments is freed pointer to it is
left unchanged. This confuses the remaining parts of the destructor
logic (namely, removal from region group) which may rely on the
information in region_impl::_active.
In this particular case the problem was that code removing from the
region group called region_impl::occupancy() which was
dereferencing _active if not null.
Fixes#993.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1457341670-18266-1-git-send-email-pdziepak@scylladb.com>
Historically the purpose of the metric is to show how much memory is
in standard allocations. After zones were introduced, this would also
include free space in lsa zones, which is almost all memory, and thus
the metric lost its original meaning. This change brings it back to
its original meaning.
Message-Id: <1452865125-4033-1-git-send-email-tgrabiec@scylladb.com>
Use steady_clock instead of high_resolution_clock where monotonic
clock is required. high_resolution_clock is essentially a
system_clock (Wall Clock) therefore may not to be assumed monotonic
since Wall Clock may move backwards due to time/date adjustments.
Fixes issue #638
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
This fixes compile error:
In function `logalloc::segment_zone::segment_zone()':
/home/lmr/Code/scylla/utils/logalloc.cc:412: undefined reference to `logalloc::segment_zone::minimum_size'
collect2: error: ld returned 1 exit status
ninja: build stopped: subcommand failed.
Signed-off-by: Lucas Meneghel Rodrigues <lmr@scylladb.com>
Originally, lsa allocated each segment independently what could result
in high memory fragmentation. As a result many compaction and eviction
passes may be needed to release a sufficiently big contiguous memory
block.
These problems are solved by introduction of segment zones, contiguous
groups of segments. All segments are allocated from zones and the
algorithm tries to keep the number of zones to a minimum. Moreover,
segments can be migrated between zones or inside a zone in order to deal
with fragmentation inside zone.
Segment zones can be shrunk but cannot grow. Segment pool keeps a tree
containing all zones ordered by their base addresses. This tree is used
only by the memory reclamer. There is also a list of zones that have
at least one free segments that is used during allocation.
Segment allocation doesn't have any preferences which segment (and zone)
to choose. Each zone contains a free list of unused segments. If there
are no zones with free segments a new one is created.
Segment reclamation migrates segments from the zones higher in memory
to the ones at lower addresses. The remaining zones are shrunk until the
requested number of segments is reclaimed.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
boost::heap::binomial_heap allocates helper object in push() and,
therefore, may throw an exception. This shouldn't happen during
compaction.
The solution is to reserve space for this helper object in
segment_descriptor and use a custom allocator with
boost::heap::binomial_heap.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
LSA memory reclaimer logic assumes that the amount of memory used by LSA
equals: segments_in_use * segment_size. However, LSA is also responsible
for eviction of large objects which do not affect the used segmentcount,
e.g. region with no used segments may still use a lot of memory for
large objects. The solution is to switch from measuring memory in used
segments to used bytes count that includes also large objects.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
While the objects above max_manage_object_size aren't stored in the
LSA segments they are still considered to be belonging to the LSA
region and are evictable using that region evictor.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
The migrator tells lsa how to move an object when it is compacted.
Currently it is a function pointer, which means we must know how to move
the object at compile time. Making it an object allows us to build the
migration function at runtime, making it suitable for runtime-defined types
(such as tuples and user-defined types).
In the future, we may also store the size there for fixed-size types,
reducing lsa overhead.
C++ variable templates would have made this patch smaller, but unfortunately
they are only supported on gcc 5+.
This is certainly the right thing to do and seems to fix#403. However
I didn't manage to convince myself that this would cause problems for
binomial_heap, given that binomial_heap::erase() calls siftup()
anyway:
void erase(handle_type handle)
{
node_pointer n = handle.node_;
siftup(n, force_inf());
top_element = n;
pop();
}
void increase (handle_type handle)
{
node_pointer n = handle.node_;
siftup(n, *this);
update_top_element();
sanity_check();
}
The segment heap is a max-heap, with sparser segments on the top. When
we free from a segment its occupancy is decreased, but its position in
the heap increases.
This bug caused that we picked up segments for compaction in the wrong
order. In extreme cases this can lead to a livelock, in some cases may
just increase compaction latency.
A region being merged can still be in use; but after merging, compaction_lock
and the reclaim counter will no longer work. This can lead to
use-after-compact-without-re-lookup errors.
Fix by making the source region be the same as the target region; they
will share compaction locks and reclaim counters, so lookup avoidance
will still work correctly.
Fixes#286.
In some cases region may be in a state where it is not empty and
nothing could be evicted from it. For example when creating the first
entry, reclaimer may get invoked during creation before it gets
linked. We therefore can't rely on emptiness as a stop condition for
reclamation, the evction function shall signal us if it made forward
progress.