"This patchset implements the compaction controller for I/O shares. The
goal is to automatic adjust compaction shares based on a
strategy-specific backlog. A higher backlog will translate into higher
shares.
As compaction progresses, that reduces the backlog. As new data is
flushed, that increases the backlog. The goal of the controler is to
keep the backlog constant at a certain rate, so that we don't go neither
too fast or too slow.
Tracking reads and writes:
==========================
Tracking of reads and writes happen through the read_monitor and the
write_monitor. The write monitor is an existing interface that has the
purpose of releasing the write permit at particular points of the write
process. We enhance it so to get a reference to an instance that tracks
the current offset inside the sstables::file_writer. This way the
backlog tracker can always know for sure what's the offset of the
current write.
A similar thing is done for reads. The data_consumer already tracks the
position of the current read, and we isolate that into a structure to
which we can get a reference. A read_monitor allows us to connect the
compaction to that reference.
Lifetime management:
====================
In general, tracking objects will be owned by their callers and passed
down as references. The compaction object will own the read monitors and
the compaction write monitors and the memtable flush write monitor will
be kept alive in a do_with block around the flush itself.
The backlog_{write,read}_progress_manager needs to be kept alive until
the SSTable is no longer in progress. For writes, that means until we
are able to add the SSTable charges in full, and for reads (compaction)
that means until we are able to remove the charges in full.
It is important to do that to avoid spikes in the graph. If we remove
the progress managers in a different operation than updating the SSTable
list we will be left in a temporary state where charges appear or
disappear abruptly, to be fixed when the final
add_sstable/remove_sstable happens. So we want those things to happen
together.
The compaction_backlog_tracker is kept alive until the strategy changes,
for example, through ALTER TABLE. Current charges are transferred to the
new strategy's compaction_backlog_tracker object when we do that. If the
type of strategy changes, the current read charges are forgotten. We can
do that because those running compaction will not really contribute to
decrease the backlog of the new compaction strategy.
Tranfer of Charges
==================
When ALTER TABLE happens, we need to transfer ongoing writes to the new
backlog manager. Ongoing reads will still be tracked by the
backlog_manager that originated them.
The rationale for that is that reads still belong to the current
compaction, with the strategy that generated them. But new Tables being
written will add to the backlog of the new strategy.
Note that ALTER TABLE operations not necessarily cause a change of
Strategy. We can be using the same strategy but just changing
properties. If that is the case, we expect no discontinuity in the
backlog graph (tested).
Resharding
==========
Resharding compactions are more complex than normal compactions because
the SSTables are created in one shard and later sent to another shard.
It is better, then, to track resharding compactions separately and let
them have their own backlog tracker, which will insert backlog in
proportion to the amount of data to be resharded.
Memtable Flush I/O Controller
=============================
With the current infrastructure it becomes trivial to add a new
controller, for either I/O or CPU. This patchset then adds an I/O
controller for memtable flushes, using the same backlog algorithm that
we already used for CPU."
* 'compaction-controller-io-v5' of github.com:glommer/scylla:
database: add a controller for I/O on memtable flushes.
document the compaction controller
compaction: adjust shares for compactions
backlog_controllers: implement generic I/O controller
factor out some of the controller code
io shares: multiply all shares by 10
compaction_strategy: implement backlog manager for the SizeTiered strategy
infrastructure for backlog estimator for compaction work.
sstables: notify about end of data component write
sstables: add read_monitor_generator
sstables: add read_monitor
sstables: enhance data consumer with a position tracker
sstables: enhance the file_writer with an offset tracker
sstables: pass references instead of pointers for write_monitor
compaction: control destruction of readers
144 lines
5.4 KiB
C++
144 lines
5.4 KiB
C++
/*
|
|
* Copyright (C) 2015 ScyllaDB
|
|
*
|
|
*/
|
|
|
|
/*
|
|
* This file is part of Scylla.
|
|
*
|
|
* Scylla is free software: you can redistribute it and/or modify
|
|
* it under the terms of the GNU Affero General Public License as published by
|
|
* the Free Software Foundation, either version 3 of the License, or
|
|
* (at your option) any later version.
|
|
*
|
|
* Scylla is distributed in the hope that it will be useful,
|
|
* but WITHOUT ANY WARRANTY; without even the implied warranty of
|
|
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
|
* GNU General Public License for more details.
|
|
*
|
|
* You should have received a copy of the GNU General Public License
|
|
* along with Scylla. If not, see <http://www.gnu.org/licenses/>.
|
|
*/
|
|
|
|
#pragma once
|
|
|
|
#include "database_fwd.hh"
|
|
#include "shared_sstable.hh"
|
|
#include "gc_clock.hh"
|
|
#include "compaction_weight_registration.hh"
|
|
#include <seastar/core/thread.hh>
|
|
#include <functional>
|
|
|
|
namespace sstables {
|
|
|
|
struct compaction_descriptor {
|
|
// List of sstables to be compacted.
|
|
std::vector<sstables::shared_sstable> sstables;
|
|
// Level of sstable(s) created by compaction procedure.
|
|
int level;
|
|
// Threshold size for sstable(s) to be created.
|
|
uint64_t max_sstable_bytes;
|
|
// Holds ownership of a weight assigned to this compaction iff it's a regular one.
|
|
stdx::optional<compaction_weight_registration> weight_registration;
|
|
|
|
compaction_descriptor() = default;
|
|
|
|
explicit compaction_descriptor(std::vector<sstables::shared_sstable> sstables, int level = 0, uint64_t max_sstable_bytes = std::numeric_limits<uint64_t>::max())
|
|
: sstables(std::move(sstables))
|
|
, level(level)
|
|
, max_sstable_bytes(max_sstable_bytes) {}
|
|
};
|
|
|
|
struct resharding_descriptor {
|
|
std::vector<sstables::shared_sstable> sstables;
|
|
uint64_t max_sstable_bytes;
|
|
shard_id reshard_at;
|
|
uint32_t level;
|
|
};
|
|
|
|
enum class compaction_type {
|
|
Compaction = 0,
|
|
Cleanup = 1,
|
|
Validation = 2,
|
|
Scrub = 3,
|
|
Index_build = 4,
|
|
Reshard = 5,
|
|
};
|
|
|
|
static inline sstring compaction_name(compaction_type type) {
|
|
switch (type) {
|
|
case compaction_type::Compaction:
|
|
return "COMPACTION";
|
|
case compaction_type::Cleanup:
|
|
return "CLEANUP";
|
|
case compaction_type::Validation:
|
|
return "VALIDATION";
|
|
case compaction_type::Scrub:
|
|
return "SCRUB";
|
|
case compaction_type::Index_build:
|
|
return "INDEX_BUILD";
|
|
case compaction_type::Reshard:
|
|
return "RESHARD";
|
|
default:
|
|
throw std::runtime_error("Invalid Compaction Type");
|
|
}
|
|
}
|
|
|
|
struct compaction_info {
|
|
compaction_type type = compaction_type::Compaction;
|
|
sstring ks;
|
|
sstring cf;
|
|
size_t sstables = 0;
|
|
uint64_t start_size = 0;
|
|
uint64_t end_size = 0;
|
|
uint64_t total_partitions = 0;
|
|
uint64_t total_keys_written = 0;
|
|
int64_t ended_at;
|
|
std::vector<shared_sstable> new_sstables;
|
|
sstring stop_requested;
|
|
bool tracking = true;
|
|
|
|
bool is_stop_requested() const {
|
|
return stop_requested.size() > 0;
|
|
}
|
|
|
|
void stop(sstring reason) {
|
|
stop_requested = std::move(reason);
|
|
}
|
|
|
|
void stop_tracking() {
|
|
tracking = false;
|
|
}
|
|
};
|
|
|
|
// Compact a list of N sstables into M sstables.
|
|
// Returns info about the finished compaction, which includes vector to new sstables.
|
|
//
|
|
// creator is used to get a sstable object for a new sstable that will be written.
|
|
// max_sstable_size is a relaxed limit size for a sstable to be generated.
|
|
// Example: It's okay for the size of a new sstable to go beyond max_sstable_size
|
|
// when writing its last partition.
|
|
// sstable_level will be level of the sstable(s) to be created by this function.
|
|
// If cleanup is true, mutation that doesn't belong to current node will be
|
|
// cleaned up, log messages will inform the user that compact_sstables runs for
|
|
// cleaning operation, and compaction history will not be updated.
|
|
future<compaction_info> compact_sstables(sstables::compaction_descriptor descriptor,
|
|
column_family& cf, std::function<shared_sstable()> creator, bool cleanup = false,
|
|
seastar::thread_scheduling_group* tsg = nullptr);
|
|
|
|
// Compacts a set of N shared sstables into M sstables. For every shard involved,
|
|
// i.e. which owns any of the sstables, a new unshared sstable is created.
|
|
future<std::vector<shared_sstable>> reshard_sstables(std::vector<shared_sstable> sstables,
|
|
column_family& cf, std::function<shared_sstable(shard_id)> creator,
|
|
uint64_t max_sstable_size, uint32_t sstable_level,
|
|
seastar::thread_scheduling_group* tsg = nullptr);
|
|
|
|
// Return list of expired sstables for column family cf.
|
|
// A sstable is fully expired *iff* its max_local_deletion_time precedes gc_before and its
|
|
// max timestamp is lower than any other relevant sstable.
|
|
// In simpler words, a sstable is fully expired if all of its live cells with TTL is expired
|
|
// and possibly doesn't contain any tombstone that covers cells in other sstables.
|
|
std::unordered_set<sstables::shared_sstable>
|
|
get_fully_expired_sstables(column_family& cf, const std::vector<sstables::shared_sstable>& compacting, gc_clock::time_point gc_before);
|
|
}
|