There is existing code (e.g. use of partition_snapshot_row_cursor in
cache_streamed_mutation) which assumes that references will be
invalidated when bad_alloc is thrown from allocating_section. That is
currently the case because on retry we will attempt memory reclamation
which will invalidate references either through compaction or
eviction. Make this guarantee explicit.
- introcduced "seastarx.hh" header, which does a "using namespace seastar";
- 'net' namespace conflicts with seastar::net, renamed to 'netw'.
- 'transport' namespace conflicts with seastar::transport, renamed to
cql_transport.
- "logger" global variables now conflict with logger global type, renamed
to xlogger.
- other minor changes
'auto' in a non-lambda function argument is not legal C++, and is hard
to read besides. Replace with the right type.
Since the right type is private, add some friendship.
"Before, the logic for releasing writes blocked on dirty worked like this:
1) When region group size changes and it is not under pressure and there
are some requests blocked, then schedule request releasing task
2) request releasing task, if no pressure, runs one request and if there are
still blocked requests, schedules next request releasing task
If requests don't change the size of the region group, then either some request
executes or there is a request releasing task scheduled. The amount of scheduled
tasks is at most 1, there is a single releasing thread.
However, if requests themselves would change the size of the group, then each
such change would schedule yet another request releasing thread, growing the task
queue size by one.
The group size can also change when memory is reclaimed from the groups (e.g.
when contains sparse segments). Compaction may start many request releasing
threads due to group size updates.
Such behavior is detrimental for performance and stability if there are a lot
of blocked requests. This can happen on 1.5 even with modest concurrency
because timed out requests stay in the queue. This is less likely on 1.6 where
they are dropped from the queue.
The releasing of tasks may start to dominate over other processes in the
system. When the amount of scheduled tasks reaches 1000, polling stops and
server becomes unresponsive until all of the released requests are done, which
is either when they start to block on dirty memory again or run out of blocked
requests. It may take a while to reach pressure condition after memtable flush
if it brings virtual dirty much below the threshold, which is currently the
case for workloads with overwrites producing sparse regions.
I saw this happening in a write workload from issue #2021 where the number of
request releasing threads grew into thousands.
Fix by ensuring there is at most one request releasing thread at a time. There
will be one releasing fiber per region group which is woken up when pressure is
lifted. It executes blocked requests until pressure occurs."
* tag 'tgrabiec/lsa-single-threaded-releasing-v2' of github.com:cloudius-systems/seastar-dev:
tests: lsa: Add test for reclaimer starting and stopping
tests: lsa: Add request releasing stress test
lsa: Avoid avalanche releasing of requests
lsa: Move definitions to .cc
lsa: Simplify hard pressure notification management
lsa: Do not start or stop reclaiming on hard pressure
tests: lsa: Adjust to take into account that reclaimers are run synchronously
lsa: Document and annotate reclaimer notification callbacks
tests: lsa: Use with_timeout() in quiesce()
Before, the logic for releasing writes blocked on dirty worked like
this:
1) When region group size changes and it is not under pressure and
there are some requests blocked, then schedule request releasing
task
2) request releasing task, if no pressure, runs one request and if
there are still blocked requests, schedules next request
releasing task
If requests don't change the size of the region group, then either
some request executes or there is a request releasing task
scheduled. The amount of scheduled tasks is at most 1, there is a
single thread of excution.
However, if requests themselves would change the size of the group,
then each such change would schedule yet another request releasing
thread, growing the task queue size by one.
The group size can also change when memory is reclaimed from the
groups (e.g. when contains sparse segments). Compaction may start
many request releasing threads due to group size updates.
Such behavior is detrimental for performance and stability if there
are a lot of blocked requests. This can happen on 1.5 even with modest
concurrency becuase timed out requests stay in the queue. This is less
likely on 1.6 where they are dropped from the queue.
The releasing of tasks may start to dominate over other processes in
the system. When the amount of scheduled tasks reaches 1000, polling
stops and server becomes unresponsive until all of the released
requests are done, which is either when they start to block on dirty
memory again or run out of blocked requests. It may take a while to
reach pressure condition after memtable flush if it brings virtual
dirty much below the threshold, which is currently the case for
workloads with overwrites producing sparse regions.
Refs #2021.
Fix by ensuring there is at most one request releasing thread at a
time. There will be one releasing fiber per region group which is
woken up when pressure is lifted. It executes blocked requests until
pressure occurs.
The logic for notification across hierachy was replaced by calling
region_group::notify_relief() from region_group::update() on the
broadest relieved group.
The hard pressure was only signalled on region group when
run_when_memory_available() was called after the pressure condition
was met.
So the following loop is always an infinite loop rather than stopping
when engouh is allocated to cause pressure:
while (!gr.under_pressure()) {
region.allocate(...);
}
It's cleaner if pressure notification works not only if
run_when_memory_available() is used but whenever conditino changes,
like we do for the soft pressure.
There is comment in run_when_memory_available() which gives reasons
why notifications are called from there, but I think those reasons no
longer hold:
- we already notify on soft pressure conditions from update(), and if
that is safe, notifying about hard pressure should also be safe. I
checked and it looks safe to me.
- avoiding notification in the rare case when we stopped writing
right after crossing the threshold doesn't seem benefitial. It's
unlikely in the first place, and one could argue it's better to
actually flush now so that when writes resume they will not block.
We already call these when crossing the soft threshold. We shouldn't
stop reclaiming when hard pressure is gone because soft pressure may
still be present. Calling start_reclaiming() on hard pressure is
unnecessary because soft pressure also starts it, and when there is
hard pressure there is also soft pressure.
* seastar 397685c...c1dbd89 (13):
> lowres_clock: drop cache-line alignment for _timer
> net/packet: add missing include
> Merge "Adding histogram and description support" from Amnon
> reactor: Fix the error: cannot bind 'std::unique_ptr' lvalue to 'std::unique_ptr&&'
> Set the option '--server' of tests/tcp_sctp_client to be required
> core/memory: Remove superfluous assignment
> core/memory: Remove dead code
> core/reactor: Use logger instead of cerr
> fix inverted logic in overprovision parameter
> rpc: fix timeout checking condition
> rpc: use lowres_clock instead of high resolution one
> semaphore: make semaphore's clock configurable
> rpc: detect timedout outgoing packets earlier
Includes treewide change to accomodate rpc changing its timeout clock
to lowres_clock.
Includes fixup from Amnon:
collectd api should use the metrics getters
As part of a preperation of the change in the metrics layer, this change
the way the collectd api uses the metrics value to use the getters
instead of calling the member directly.
This will be important when the internal implementation will changed
from union to variant.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1485457657-17634-1-git-send-email-amnon@scylladb.com>
As the metrics migration progressed, some include to scollectd.hh left
behind.
Because of the nature of the scollecd implementation those include
brings alot of code with them to the header files and eventually to many
source file.
This patch remove those include and add a missing include to
storage_proxy.cc.
The reason the compiler didn't complain is an indication to the
problematic nature of those include in the first place.
Before this patch, change in metrics.hh would cause 169 files to
compile, after this change 17.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1484667536-2185-1-git-send-email-amnon@scylladb.com>
The region_group_reclaimer will let us know every time we are over the
limit we have specified for memory usage.
However, For some applications, we would be interested in knowing about
memory build up earlier, so we can start doing something about it before
we reach that condition.
This patch introduce soft limit notifications for the
region_group_reclaimer. After this patch is applied, start_reclaim() is
called earlier, and stop_reclaim() later, after the soft condition is
abated.
There are methods that allow one to easily test if the pressure
condition is a soft limit condition or a hard, threshold condition and
act accordingly. Whether to act on both conditions or just one of them
is up to the application.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
is_compactible() will pass on very small regions. full_compaction() is
only used in tests to force objects to be moved due to compaction, so
we want all reclaimable regions to be compacted.
There is already queue_length-requests_blocked_memory, but it's a
gauge so does not reflect what happened between the sampling points.
total_operations-requests_blocked_memory will allow to see if there
were any (and how many) requests which were blocked by dirty memory.
Message-Id: <1476098616-12682-1-git-send-email-tgrabiec@scylladb.com>
Also add information about for how long has the oldest been sitting in the
queue. This is part of the backpressure work to allow us to throttle incoming
requests if we won't have memory to process them. Shortages can happen in all
sorts of places, and it is useful when designing and testing the solutions to
know where they are, and how bad they are.
This counter is named for consistency after similar counters from transport/.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Calls like later() and with_gate() may allocate memory, although that is not
very common. This can create a problem in the sense that it will potentially
recurse and bring us back to the allocator during free - which is the very thing
we are trying to avoid with the call to later().
This patch wraps the relevant calls in the reclaimer lock. This do mean that the
allocation may fail if we are under severe pressure - which includes having
exhausted all reserved space - but at least we won't recurse back to the
allocator.
To make sure we do this as early as possible, we just fold both release_requests
and do_release_requests into a single function
Thanks Tomek for the suggestion.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <980245ccc17960cf4fcbbfedb29d1878a98d85d8.1470254846.git.glauber@scylladb.com>
Issue 1510 describes a scenario in which, under load, we allocate memory within
release_requests() leading to a reentry into an invalid state in our
blocked requests' shared_promise.
This is not easy to trigger since not all allocations will actually get to the
point in which they need a new segment, let alone have that happening during
another allocator call.
Having those kinds of reentry is something we have always sought to avoid with
release_requests(): this is the reason why most of the actual routine is
deferred after a call to later().
However, that is a trick we cannot use for updating the state of the blocked
requests' shared_promise: we can't guarantee when is that going to run, and we
always need a valid shared_promise, in a valid state, waiting for new requests
to hook into.
The solution employed by this patch is to make sure that no allocation
operations whatsoever happen during the initial part of release_requests on
behalf of the shared promise. Allocation is now deferred to first use, which
relieves release_requests() from all allocation duties. All it needs to do is
free the old object and signal to the its user that an allocation is needed (by
storing {} into the shared_promise).
Fixes#1510
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <49771e51426f972ddbd4f3eeea3cdeef9cc3b3c6.1470238168.git.glauber@scylladb.com>
Useful for triggerring core dump on allocation failure inside LSA,
which makes it easier to debug allocation failures. They normally
don't cause aborts, just fail the current operation, which makes it
hard to figure out what was the cause of allocation failure.
Message-Id: <1470233631-18508-1-git-send-email-tgrabiec@scylladb.com>
From Glauber:
This is my new take at the "Move throttler to the LSA" series, except
this one don't actually move anything anywhere: I am leaving all
memtable conversion out, and instead I am sending just the LSA bits +
LSA active reclaim. This should help us see where we are going, and
then we can discuss all memtable changes in a series on its own,
logically separated (and hopefully already integrated with virtual
dirty).
[tgrabiec: trivial merge conflicts in logalloc.cc]
We now keep the regions sorted by size, and the children region groups as well.
Internally, the LSA has all information it needs to make size-based reclaim
decisions. However, we don't do reclaim internally, but rather warn our user
that a pressure situation is mounted.
The user of a region_group doesn't need to evict the largest region in case of
pressure and is free to do whatever it chooses - including nothing. But more
likely than not, taking into account which region is the largest makes sense.
This patch puts together this last missing piece of the puzzle, and exports the
information we have internally to the user.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We are currently just allowing the region_group to specify a throttle_threshold,
that triggers throttling when a certain amount of memory is reached. We would
like to notify the callers that such condition is reached, so that the callers
can do something to alleviate it - like triggering flushes of their structures.
The approach we are taking here is to pass a reclaimer instance. Any user of a
region_group can specialize its methods start_reclaiming and stop_reclaiming
that will be called when the region_group becomes under pressure or ceases to
be, respectively.
Now that we have such facility, it makes more sense to move the
throttle_threshold here than having it separately.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
When we decide to evict from a specific region_group due to excessive memory
usage, we must also consider looking at each of their children (subgroups). It
could very well be that most of memory is used by one of the subgroups, and
we'll have to evict from there.
We also want to make sure we are evicting from the biggest region of all, and
not the biggest region in the biggest region_group. To understand why this is
important, consider the case in which the regions are memtables associated with
dirty region groups. It could be that a very big memtable was recently flushed,
and a fairly small one took its place. That region group is still quite large
because the memtable hasn't finished flushing yet, but that doesn't mean we
should evict from it.
To allow us to efficiently pick which region is the largest, each root of each
subtree will keep track of its maximal score, defined as the maximum between our
largest region total_space and the maximum maximal score of subtrees.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Currently, the regions in a region group are organized in a simple vector.
We can do better by using a binomial heap, as we do for segments, and then
updating when there is change. Internally to the LSA, we are in good position
to always know when change happens, so that's really the best way to do it.
The end game here, is to easily call for the reclaim of the largest offending
region (potentially asynchronously). Because of that, we aren't really interested
in the region occupancy, but in the region reclaimable occuppancy instead: that's
simply equal to the occupancy if the region is reclaimable, and 0 otherwise. Doing
that effectively lists all non reclaimable regions in the end of the heap, in no
particular order.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
The database code uses a throttling function to make sure that memory
used for the dirty region never is over the limit. We track that with
a region group, so it makes sense to move this as generic functionality
into LSA.
This patch implements the LSA-side functionality and a later patch will
convert the current memtable throttler to use it.
Unlike the current throttling mechanism, we'll not use a timer-based
mechanism here. Aside from being more generic and friendlier towards
other users, this is a good change for current memtable by itself.
The constants - 10ms and 1MB chosen by the current throttler are arbitrary, and we
would be better off without them. Let's discuss the merits of each separately:
1) 10ms timer: If we are throttling, we expect somebody to flush the memtables
for memory to be released. Since we are in position to know exactly when a memtable
was written, thus releasing memory, we can just call unthrottle at that point, instead
of using a timer.
2) 1MB release threshold: we do that because we have no idea how much memory a request
will use, so we put the cut somehow. However, because of 1) we don't call unthrottle
through a timer anymore, and do it directly instead. This means that we can just execute
the request and see how much memory it has used, with no need to guess. So we'll call
unthrottle at the end of every request that was previously throttled.
Writing the code this way also has the advantage that we need one less continuation in
the common case of the database not being throttled.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Allocations will still be allowed if made directly, but callers will have the
choice (in an upcoming patch) to proceed only if memory is below this threshold.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Tomek correctly points out that since we are now using "this" in lambda
captures, we should make the region_group not movable. We currently define a
move constructor, but there are no users. So we should just remove them.
copy constructor is already deleted, and so are the copy and move assignment
operators. So by removing the move constructor, we should be fine.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Originally, lsa allocated each segment independently what could result
in high memory fragmentation. As a result many compaction and eviction
passes may be needed to release a sufficiently big contiguous memory
block.
These problems are solved by introduction of segment zones, contiguous
groups of segments. All segments are allocated from zones and the
algorithm tries to keep the number of zones to a minimum. Moreover,
segments can be migrated between zones or inside a zone in order to deal
with fragmentation inside zone.
Segment zones can be shrunk but cannot grow. Segment pool keeps a tree
containing all zones ordered by their base addresses. This tree is used
only by the memory reclamer. There is also a list of zones that have
at least one free segments that is used during allocation.
Segment allocation doesn't have any preferences which segment (and zone)
to choose. Each zone contains a free list of unused segments. If there
are no zones with free segments a new one is created.
Segment reclamation migrates segments from the zones higher in memory
to the ones at lower addresses. The remaining zones are shrunk until the
requested number of segments is reclaimed.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
A region being merged can still be in use; but after merging, compaction_lock
and the reclaim counter will no longer work. This can lead to
use-after-compact-without-re-lookup errors.
Fix by making the source region be the same as the target region; they
will share compaction locks and reclaim counters, so lookup avoidance
will still work correctly.
Fixes#286.
In some cases region may be in a state where it is not empty and
nothing could be evicted from it. For example when creating the first
entry, reclaimer may get invoked during creation before it gets
linked. We therefore can't rely on emptiness as a stop condition for
reclamation, the evction function shall signal us if it made forward
progress.
Related to #259. In some cases we need to allocate memory and hold
reclaim lock at the same time. If that region holds most of the
reclaimable memory, allocations inside that code section may
fail. allocating_section is a work-around of the problem. It learns
how big reserves shold be from past execution of critical section and
tries to ensure proper reserves before entering the section.
Disabling compaction of a region is currently done in order to keep
the references valid. But disabling only compaction is not enough, we
also need to disable eviction, as it also invalidates
references. Rather than introducing another type of lock, compaction
and eviction are controlled together, generalized as "reclaiming"
(hence the reclaim_lock).