37 Commits

Author SHA1 Message Date
Lakshmi Narayanan Sreethar
9cb766f929 db/config: introduce new config parameter compaction_max_shares
Add support for the new configuration parameter `compaction_max_shares`,
and update the compaction manager to pass it down to the compaction
controller when it changes. The shares allocated to compaction jobs will
be limited by this new parameter.

Fixes #9431

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-11-24 12:52:29 -03:00
Lakshmi Narayanan Sreethar
f2b0489d8c compaction_controller: add configurable maximum shares
Add a `max_shares` constructor parameter to compaction_controller to
allow configuring the maximum output of the control points at
construction time. The constructor now calls `set_max_shares()` with the
provided max_shares value. The subsequent commits will wire this value
to a new configuration option.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-11-24 11:43:24 -03:00
Lakshmi Narayanan Sreethar
853811be90 compaction_controller: introduce set_max_shares()
Add a method to dynamically adjust the maximum output of control points
in the compaction controller. This is required for supporting runtime
configuration of the maximum shares allocated to the compaction process
by the controller.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-11-24 11:43:20 -03:00
Avi Kivity
f3eade2f62 treewide: relicense to ScyllaDB-Source-Available-1.0
Drop the AGPL license in favor of a source-available license.
See the blog post [1] for details.

[1] https://www.scylladb.com/2024/12/18/why-were-moving-to-a-source-available-license/
2024-12-18 17:45:13 +02:00
Kefu Chai
db9e314965 treewide: apply codespell to the comments in source code
for less spelling errors in comment.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#16408
2023-12-20 10:25:03 +02:00
Pavel Emelyanov
5412c7947a backlog_controller: Unwrap scheduling_group
Some time ago (997a34bf8c) the backlog
controller was generalized to maintain some scheduling group. Back then
the group was the pair of seastar::scheduling_group and
seastar::io_priority_class. Now the latter is gone, so the controller's
notion of what sched group is can be relaxed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #14266
2023-06-16 12:02:14 +03:00
Pavel Emelyanov
66e43912d6 code: Switch to seastar API level 7
In that level no io_priority_class-es exist. Instead, all the IO happens
in the context of current sched-group. File API no longer accepts prio
class argument (and makes io_intent arg mandatory to impls).

So the change consists of
- removing all usage of io_priority_class
- patching file_impl's inheritants to updated API
- priority manager goes away altogether
- IO bandwidth update is performed on respective sched group
- tune-up scylla-gdb.py io_queues command

The first change is huge and was made semi-autimatically by:
- grep io_priority_class | default_priority_class
- remove all calls, found methods' args and class' fields

Patching file_impl-s is smaller, but also mechanical:
- replace io_priority_class& argument with io_intent* one
- pass intent to lower file (if applicatble)

Dropping the priority manager is:
- git-rm .cc and .hh
- sed out all the #include-s
- fix configure.py and cmakefile

The scylla-gdb.py update is a bit hairry -- it needs to use task queues
list for IO classes names and shares, but to detect it should it checks
for the "commitlog" group is present.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #13963
2023-06-06 13:29:16 +03:00
Benny Halevy
774a10017c backlog_controller: destroy _update_timer before _current_backlog
The _update_timer callback calls adjust() that
depends on _current_backlog and currently, _current_backlog is
destroyed before _update_timer.

This is benign since there are no preemption points in
the destructor, but it's more correct and elegant
to destroy the timer first, before other members it depends on.

Fixes #14056

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #14057
2023-05-29 23:03:24 +03:00
Benny Halevy
c9a9720247 backlog_controller: keep scheduling_group by value
There is no need to keep a mutable reference to the
scheduling_group passed at construction time since
setting / updating shares is using the schedulig_group /
io_priority_class id as a handle, and the id itself is never
changed by the backlog_controller.

Note that the class names are misleading, in hind sight,
they would better be called scheduling_group_id
and io_priority_class_id, respectively.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-02 07:38:40 +03:00
Benny Halevy
78ad1c70a2 backlog_controller: scheduling_group: keep io_priority_class by value
Exactly like the cpu scheduling_group, io_priority_class
contains the class id, which is a handle to the io_priority_class
and so can be kept by value, rather than by reference,
and be safely copied around.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-02 07:38:40 +03:00
Benny Halevy
450ecd60c6 backlog_controller: scheduling_group: define default member initializers
To prepare for the next patch, implement default initialization
of the scheduling_group and io_priority_class, to the default values.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-02 07:38:40 +03:00
Benny Halevy
3e6622180e backlog_controller: get rid of _interval member
It isn't used outside the constructor.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-02 07:38:40 +03:00
Igor Ribeiro Barbosa Duarte
8dd0f4672d compaction: Make compaction_static_shares liveupdateable
This patch makes compaction_static_shares liveupdateable
to avoid having to restart the cluster after updating
this config.

Signed-off-by: Igor Ribeiro Barbosa Duarte <igor.duarte@scylladb.com>
2022-07-19 10:10:46 -03:00
Igor Ribeiro Barbosa Duarte
c2ee6492e6 backlog_controller: Unify backlog_controller constructors
This patch adds the _static_shares variable to the backlog_controller so that
instead of having to use a separate constructor when controller is disabled,
we can use a single constructor and periodically check on the adjust method
if we should use the static shares or the controller. This will be useful on
the next patches to make compaction_static_shares and memtable_flush_static_shares
live updateable.

Signed-off-by: Igor Ribeiro Barbosa Duarte <igor.duarte@scylladb.com>
2022-07-19 10:06:12 -03:00
Pavel Emelyanov
997a34bf8c backlog_controller: Generalize scheduling groups
Make struct scheduling_group be sub-class of the backlog controller. Its
new meaning is now -- the group under controller maintenance. Both
database and compaction manager derive their sched groups from this one.

This makes backlog controller construction simpler, prepares the ground
for sched groups unification in seastar and facilitates next patch.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-06-16 17:40:19 +03:00
Pavel Emelyanov
fbb59fc920 compaction_manager: Keep compaction_sg on board
This is mainly to make next patch simpler. Also this makes the backlog
controller API smaller by removing its sg() method.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-06-16 17:40:19 +03:00
Avi Kivity
fcb8d040e8 treewide: use Software Package Data Exchange (SPDX) license identifiers
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.

Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.

The changes we applied mechanically with a script, except to
licenses/README.md.

Closes #9937
2022-01-18 12:15:18 +01:00
Avi Kivity
a55b434a2b treewide: extent copyright statements to present day 2021-06-06 19:18:49 +03:00
Botond Dénes
e0284bb9ee treewide: add missing headers and/or forward declarations 2020-03-23 09:29:45 +02:00
Avi Kivity
b0980ba7c6 compaction_controller: increase minimum shares to 50 (~5%) for small-data workloads
The workload in #3844 has these characteristics:
 - very small data set size (a few gigabytes per shard)
 - large working set size (all the data, enough for high cache miss rate)
 - high overwrite rate (so a compaction results in 12X data reduction)

As a result, the compaction backlog controller assigns very few shares to
compaction (low data set size -> low backlog), so compaction proceeds very slowly.
Meanwhile, we have tons of cache misses, and each cache miss needs to read from a
large number of sstables (since compaction isn't progressing). The end result is
a high read amplification, and in this test, timeouts.

While we could declare that the scenario is very artificial, there are other
real-world scenarios that could trigger it. Consider a 100% write load
(population phase) followed by 100% read. Towards the end of the last compaction,
the backlog will drop more and more until compaction slows to a crawl, and until
it completes, all the data (for that compaction) will have to be read from its
input sstables, resulting in read amplification.

We should probably have read amplification affect the backlog, but for now the
simpler solution is to increase the minimum shares to 50 so that compaction
always makes forward progress. This will result in higher-than-needed compaction
bandwidth in some low write rate scenarios so we will see fluctuations in request
rate (what the controller was designed to avoid), but these fluctioations will be
limited to 5%.

Since the base class backlog_controller has a fixed (0, 0) point, remove it
and add it to derived classes (setting it to (0, 50) for compaction).

Fixes #3844 (or at least improves it).
Message-Id: <20181231162710.29410-1-avi@scylladb.com>
2019-01-04 10:58:43 +01:00
Glauber Costa
70c47eb045 controller: adjust constants for compaction controller
Right now the controller adjusts its shares based on how big the backlog
is in comparison to shard memory. We have seen in some tests that if the
dataset becomes too big, this may cause compactions to dominate.

While we may change the input altogether in future versions, I'd like to
propose a quick change for the time being: move the high point from 10x
memory size to 30x memory size. This will cause compactions to increase
in shares more slowly.

While this is as magic as the 10 before, they will allow us to err in
the side of caution, with compactions not becoming aggressive enough to
overly disrupt workloads.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-06-18 15:16:38 -04:00
Glauber Costa
c55ab93178 backlog_controller: add constants to represent a globally disabled controller
There are situations in which we want the controllers to stop working
altogether. Usually that's when we have an unimplemented controller or
some exception.

We want to return fixed shares in this case, but this is a very
different situation from when we want fixed shares for *one* backlog
tracker: we want to return fixed shares, yes, but if we disable 200
backlog trackers (because they all failed, for instance), we don't want
that fixed number x 200 to be our backlog.

So the mechanism to globally disable the controller is still granted,
and infinity is a good way to represent that. It's a float that the
controller can easily test against. But actually using infinity in the
code is confusing. People reading it may interpret it as the other way
around from what it means, just meaning "a very large backlog".

Let's turn that into a constant instead. It will help us convey meaning.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-22 09:25:23 -04:00
Glauber Costa
d758a416f8 backlog_controller: move compaction controller to the compaction manager
There was recently an attempt to add minimum shares to major compactions
which ended up being harder than it should be due to all the plumbing
necessary to call the compaction controller from inside the compaction
manager-- since it is currently a database object. We had this problem
again when trying to return fixed shares in case of an exception.

Taking a step back, all of those problems stem from the fact that the
compaction controller really shouldn't be a part of the database: as it
deals with compactions and its consequences it is a lot more natural to
have it inside the compaction manager to begin with.

Once we do that, all the aforementioned problems go away. So let's move
there where it belongs.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-22 09:24:19 -04:00
Glauber Costa
d3f985ef46 backlog_controller: allow users to compute inverse function of shares
There are some situations in which we want to force a specific amount of
shares and don't have a backlog. We can provide a function to get that
from the controller.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-05-21 19:35:07 -04:00
Avi Kivity
80651e6dcc database: reduce idle memtable flush cpu shares to 1%
Commit 1671d9c433 (not on any release branch)
accidentally bumped the idle memtable flush cpu shares to 100 (representing
10%), causing flushes to be too when they don't comsume too much cpu.

Fixes #3243.
Message-Id: <20180408104601.9607-1-avi@scylladb.com>
2018-04-08 17:12:14 +01:00
Duarte Nunes
b7bd9b8058 backlog_controller: Stop update timer
On database shutdown, this timer can cause use-after-free errors if
not stopped.

Refs #3315

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180324140822.3743-1-duarte@scylladb.com>
2018-03-26 14:36:16 +03:00
Glauber Costa
4272279bbb controllers: unify the I/O and CPU controllers
We have had so far an I/O controller, for compactions and memtables, and
a CPU controller, for memtables only -- since the scheduling was still
quota-based.

Now that the CPU scheduler is fully functional, it is time to do away
with the differences and integrate them both into one.  We now have a
memtable controller and a compaction controller, and they control both
CPU and I/O.

In the future, we may want to control processes that don't do one of
them, like cache updates. If that ever happens, we'll try to make
controlling one of them optional. But for now, since the I/O and CPU
controllers for our main two processes would look exactly the same we
should integrate them.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-02-07 17:19:30 -05:00
Glauber Costa
7b6f188e27 controllers: allow a static priority to override the controller output
We have merged the I/O controller without this, but we want to integrate
the CPU and I/O controllers into one. Currently, the quota can be
statically set for the CPU controller. For now, until we gain more
experience with it we should allow a static value to override the
controller's output as well.

That is particularly important since we don't yet control some
strategies like LCS and the time-based ones. Users in the field may be
using one of those strategies with a static value for background quota.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-02-07 17:19:29 -05:00
Glauber Costa
6f295a2a8a controllers: update control points for memtable I/O controller
Right now CPU and I/O controllers have slightly different control points
for no good reason. Let's use the CPU controller ones as the standard, as
we have been using it in the field for longer and trust it more.

The end goal is to fully integrate them.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-02-07 17:19:29 -05:00
Glauber Costa
b895d495cc controllers: allow memtable I/O controller to have shares statically set
This is so it looks more like the CPU controller. The end goal is to integrate them.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-02-07 17:19:29 -05:00
Glauber Costa
c099c98676 controllers: retire auto_adjust_flush_quota
It no longer makes sense now that we have the full scheduler +
controllers.  In its lieu, we will provide an option to statically set
the controller's shares as a safe guard against us getting this wrong.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-02-07 17:19:29 -05:00
Glauber Costa
956af9f099 database, main: set up scheduling_groups for our main tasks
Set up scheduling groups for streaming, compaction, memtable flush, query,
and commitlog.

The background writer scheduling group is retired; it is split into
the memtable flush and compaction groups.

Comments from Glauber:

This patch is based in a patch from Avi with the same subject, but the
differences are signficant enough so that I reset authorship. In
particular:

1) A bug/regression is fixed with the boundary calculations for the
   memtable controller sampling function.
2) A leftover is removed, where after flushing a memtable we would
   go back to the main group before going to the cache group again
3) As per Tomek's suggestion, now the submission of compactions
   themselves are run in the compaction scheduling group. Having that
   working is what changes this patch the most: we now store the
   scheduling group in the compaction manager and let the compaction
   manager itself enforce the scheduling group.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-02-07 17:19:29 -05:00
Avi Kivity
641aaba12c database, sstables, compaction: convert use of thread_scheduling_group to seastar cpu scheduler
thread_scheduling_groups are converted to plain scheduling_group. Due to
differences in initialization (scheduling_group initializtion defers), we
create the scheduling_groups in main.cc and propagate them to users via
a new class database_config.

The sstable writer loses its thread_scheduling_group parameter and instead
inherits scheduling from its caller.

Since shares are in the 1-1000 range vs. 0-1 for thread scheduling quotas,
the flush controller was adjusted to return values within the higher ranges.
2018-02-07 17:19:29 -05:00
Glauber Costa
4f1b875784 database: add a controller for I/O on memtable flushes.
The algorithm and principle of operation is the same as the CPU
controller. It is, however, always enabled and we will operate on
I/O shares.

I/O-bound workloads are expected to hit the maximum once virtual
dirty fills up and stay there while the load is steady.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-03 19:58:57 -05:00
Glauber Costa
244c564aac compaction: adjust shares for compactions
Compactions can be a heavy disk user and the I/O scheduler can always
guarantee that it uses its fair share of disk.

Such fair share can, however, be a lot more than what compaction indeed
need. This patch draws on the controllers infrastructure to adjust the
I/O shares that the compaction class will get so that compaction
bandwidth is dynamically adjusted.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-03 19:58:57 -05:00
Glauber Costa
4b44a22236 backlog_controllers: implement generic I/O controller
Like the CPU controller, but will act on I/O priorities.
Shares can go from 0 to 1000.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-03 19:56:54 -05:00
Glauber Costa
1671d9c433 factor out some of the controller code
The control algorithm we are using for memtables have proven itself
quite successful. We will very likely use the same for other processes,
like compactions.

Make the code a bit more generic, so that a new controller has to only
set the desired parameters

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-01-03 19:56:54 -05:00