This patch removes the log message about "compaction_manager - Asked to stop"
at the very end of Scylla runs. This log message is confusing because it
only has the "asked to stop" part, without finally a "stopped", and may
lead a user to incorrectly fear that the shutdown hung - when it in fact
finished just fine.
The database object holds a compaction_manager and stop()s it when the
database is stop()ed - and that is the very last thing our shutdown does.
However, much earlier, as the *first* shutdown operation (i.e., the last
at_exit() in main.cc), we already stop() the compaction manager.
The second stop() call does nothing, but unfortunately prints the log
message just before checking if it has anything to stop. So this patch
just moves the log message to after the check.
Fixes#4238.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190217142657.19963-1-nyh@scylladb.com>
Compaction manager holds reference to all cleaning sstables till the very
end, and that becomes a problem because disk space of cleaned sstables
cannot be reclaimed due to respective file descriptors opened.
Fixes#3735.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20181221000941.15024-1-raphaelsc@scylladb.com>
That's only the very first step which introduces the machinery for making
major compaction aware of all strategies. By the time being, default
implementation is used for them all which only suits size tiered.
Refs #1431.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This is needed for parallel compaction to work with sstable run based approach.
That's because regular compaction clones a set containing all sstables of its
column family. So compaction A can potentially hold a reference to a compacting
sstable of compaction B, so preventing compacting B from releasing its exhausted
sstable.
So all replacements are propagated to all compactions of a given column family,
and compactions in turn, including the one which initiated the propagation,
will do the replacement.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
motivation is that we need a more efficient way to find compactions
that belong to a given column family in compaction list.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Filter out sstable belonging to a partial run being generated by an ongoing
compaction. Otherwise, that could lead to wrong decisions by the compaction
strategy.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
That's important for the reference to sstable to not be kept throughout
the compaction procedure, which would break the goal of releasing
space during compaction.
Manager passes a callback to compaction which calls it whenever
there's sstable replacement.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
sprint() recently became more strict, throwing on sprint("%s", 5). Replace
with the more modern format().
Mechanically converted with https://github.com/avikivity/unsprint.
Function to reevaluate postponed compaction was called too early for strategies that
don't allow parallel compaction (only leveled strategy (LCS) at this moment).
Such strategies must first have the ongoing compaction deregistered before reevaluating
the postponed ones. Manager uses task list of ongoing compaction to decides if there's
ongoing compaction for a given column family. So compaction could stop making progress
at all *if and only if* we stop flushing new data.
So it could happen that a column family would be left with lots of pending compaction,
leading the user to think all compacting is done, but after reboot, there will be
lots of compaction activity.
We'll both improve method to detect parallel compaction here and also add a call to
reevaluate postponed compaction after compaction is done.
Fixes#3534.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180702185327.26615-1-raphaelsc@scylladb.com>
If we see an exception when adding or removing SSTables from the backlog
tracker, the backlog tracker can be inconsistent forever. It would be
best if we act before that happens and disable the backlog tracker. Once
the backlog tracker is disabled it will default to returning a fixed
number of shares.
We can either disable the backlog tracker or remove it. But if we remove
it we can end up with a backlog of zero if that's the only tracker with
a backlog. We then keep it registered but mark it as disabled. This also
leaves room for recovery in some situations: we can recover the backlog
by a doing a schema change in the column family that had the backlog
disabled, for instance.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Backlog calculations should be exception free, but there are at cases in
which I can see they happening. One example is if some backlog tracker
that uses temporary objects fails an allocation.
Memory shortages can be specially pernicious: if we leave the
responsibility of catching those to the individual backlog tracker, we
will keep trying to make more allocations in the other backlog trackers
if we have many column families. By handling it here we can stop that.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We have observed the following behavior with user initiated compactions,
like major compactions:
- if there are no writes, the backlog doesn't increase.
- as compaction progresses the backlog decreases.
- at some point, the backlog is so low that compaction barely makes any
progress.
Going forward, we should allow one to read from the generated partial
SSTables, in which case this doesn't matter that much. But for
user-iniated compactions we would like to guarantee a minimum baseline.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
There are situations in which we want the controllers to stop working
altogether. Usually that's when we have an unimplemented controller or
some exception.
We want to return fixed shares in this case, but this is a very
different situation from when we want fixed shares for *one* backlog
tracker: we want to return fixed shares, yes, but if we disable 200
backlog trackers (because they all failed, for instance), we don't want
that fixed number x 200 to be our backlog.
So the mechanism to globally disable the controller is still granted,
and infinity is a good way to represent that. It's a float that the
controller can easily test against. But actually using infinity in the
code is confusing. People reading it may interpret it as the other way
around from what it means, just meaning "a very large backlog".
Let's turn that into a constant instead. It will help us convey meaning.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
There was recently an attempt to add minimum shares to major compactions
which ended up being harder than it should be due to all the plumbing
necessary to call the compaction controller from inside the compaction
manager-- since it is currently a database object. We had this problem
again when trying to return fixed shares in case of an exception.
Taking a step back, all of those problems stem from the fact that the
compaction controller really shouldn't be a part of the database: as it
deals with compactions and its consequences it is a lot more natural to
have it inside the compaction manager to begin with.
Once we do that, all the aforementioned problems go away. So let's move
there where it belongs.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
After change to serialize compaction on compaction weight (eff62bc61e),
LCS invariant may break because parallel compaction can start, and it's
not currently supported for LCS.
The condition is that weight is deregistered right before last sstable
for a leveled compaction is sealed, so it may happen that a new compaction
starts for the same column family meanwhile that will promote a sstable to
an overlapping token range.
That leads to strategy restoring invariant when it finds the overlapping,
and that means wasted resources.
The fix is about removing a fast path check which is incorrect now because
we release weight early and also fixing a check for ongoing compaction
which prevented compaction from starting for LCS whenever weight tracker
was not empty.
Fixes#3279.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180410034538.30486-1-raphaelsc@scylladb.com>
Commit cce1a2bce8 ("Use the CPU scheduler")
placed some compaction manager code in a scheduling_group. Unfortunately,
downstream code relied on the callers not deferring, so it can rely
on the column_family's existence. That doesn't happen if the column_family
is removed quickly, as with_scheduling_group() always defers.
Fix applying the scheduling group after we've taken the lock and guaranteed
the stability of the column_family object.
Fixes#3196.
Message-Id: <20180211165155.18179-1-avi@scylladb.com>
Set up scheduling groups for streaming, compaction, memtable flush, query,
and commitlog.
The background writer scheduling group is retired; it is split into
the memtable flush and compaction groups.
Comments from Glauber:
This patch is based in a patch from Avi with the same subject, but the
differences are signficant enough so that I reset authorship. In
particular:
1) A bug/regression is fixed with the boundary calculations for the
memtable controller sampling function.
2) A leftover is removed, where after flushing a memtable we would
go back to the main group before going to the cache group again
3) As per Tomek's suggestion, now the submission of compactions
themselves are run in the compaction scheduling group. Having that
working is what changes this patch the most: we now store the
scheduling group in the compaction manager and let the compaction
manager itself enforce the scheduling group.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
If the compaction_backlog_manager's lifetime ends before the linked
compaction_backlog_tracker's, the latter's _manager pointer not being
cleared, can lead to a use-after-free error when running
~compaction_backlog_tracker(), as evidenced by unit-tests failed.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180111004914.25796-2-duarte@scylladb.com>
This patch adds infrastucture in various points in the system to allow
us to determine the amount of work present as backlog from compactions.
What needs to be done can be explained in three major pieces:
1) Add hooks in the points where sstables are added or inserted to a
column family (or more precisely, to a compaction_strategy object).
2) Add hooks in reads and write monitors that allows a compaction
backlog estimator (tracker) to become aware of bytes that are
partially written and compacted away.
3) Add a per-column family class (compaction_backlog_tracker) that
can be used to track work that is done and relevant to compactions
(like the two above), and a compaction manager to provide a
system-wide backlog based on the response of the individual trackers.
The definition of how much backlog one has is strategy-specific. The
Null strategy is easy, as it never really has any backlog, and so is the
major strategy - since what it really matters is the backlog of the
underlying compaction strategy.
Although backlogs are strategy-specific, they should be "compatible", in
the sense that if a particular strategy has more work to do, it should
yield a higher number than its counterparts.
All the others are presented in this patch as unimplemented: they will
always advertise a mild backlog that should yield a constant
CPU-utilization if used alone.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Currently, compaction manager will serialize compaction of same size tier
(or weight) if they belong to the same column family. However, it fails to
do so if the compaction jobs belong to different column families.
That can lead to an ungodly amount of running compaction which gets worse
the higher the number of shards and active column families. The problem
is that it may affect overall system performance due to excessive resource
usage. It's easy to trigger it during bootstraping after loading node with
new sstables or repairing, or if lots of cfs are being actively written.
That being said, compaction jobs of same size tier are now serialized
on a given shard, such that maximum number of compaction (system wise)
is now:
(SHARDS) * (SIZE TIERS)
instead of:
(SHARDS) * (COLUMN FAMILIES) * (SIZE TIERS)
We'll work hard to release a size tier (weight) for a column family
waiting on it as fast as possible, given that we wouldn't like to
underutilize resources available for compaction. We want one starting
after the other. Compaction for a column family that cannot run now
because the size tier is taken, will be postponed. There's a worker
that will be sleeping on a condition variable that will be signalled
whenever a compaction completes. FIFO ordering is used on postponed
list for fairness.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
That will be needed for using it in compaction.hh. We can't declare
compaction_weight_registration in compaction_manager.hh, because
compaction.hh can't include the former due to cyclic dependency,
so compaction_weight_registration will be declared in its own
header.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Two reasons for this change:
1) every compaction should be multiplexed to manager which in turn
will make decision when to schedule. improvements on it will
immediately benefit every existing compaction type.
2) active tasks metric will now track ongoing reshard jobs.
Fixes#2671.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170817224334.6402-1-raphaelsc@scylladb.com>
This is useful for a column family which isn't generating new content
and will have lots of expired data later on that can be purged.
Compaction submission is NO-OP if there's nothing to do, so I think
it's reasonable to do it at an interval of 1 hour.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
A column family which was truncated will remove itself from compaction
manager. Any task running a compaction should be interrupted and a
task waiting to run should bail out when it wakes up.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170425224350.15965-3-raphaelsc@scylladb.com>
For new resharding, it's important to exclude resharding sstables
from the list of candidates for regular compaction. That's doesn't
affect current resharding because it marks the sstables as
compacting. That won't work with new resharding which will work
with sstables from multiple shards.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
From now on, major compaction will go through compaction manager.
Major compaction is serialized to reduce disk space requirement.
Each column family will be running either minor and major compaction
at a given time. The only issue is number of small sstables growing
while major compaction is running, but major compaction itself will
reduce the number of tables considerably. If this turns out to be
an issue, we can allow minor to start in parallel to major, but not
the other way around.
Fixes#1156.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170417233125.14092-1-raphaelsc@scylladb.com>
Problem is that column family field of task wasn't being set for resharding,
so column family wasn't being properly removed from compaction manager.
In addition to fixing this issue, we'll also interrupt ongoing compactions
when dropping a column family, exactly like we do with shutdown.
Fixes#2291.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170418125807.7712-1-raphaelsc@scylladb.com>
Previously, same function was used to handle both regular compaction
and cleanup requests. That's bad because a lot of conditions were
added for both compaction types to live in the same function.
Now, cleanup and regular compaction will live in different functions.
They share a lot of code, so helper functions were introduced.
This change is also important for user-initiated compaction that
will go through compaction manager in the future.
Code is also a lot easier to read now.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
There is no longer a need to use gate for regular termination of
fiber that runs compaction. Now, we only set task->stopping to
true, ask for compaction termination, and wait for its future to
resolve. Code is simplified a lot with this change.
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>